Data Science for Business and Decision Making covers both statistics and operations research while most competing textbo
2,666 384 62MB
English Pages 1000 [1209] Year 2019
Table of contents :
Cover
Data Science for Business
and Decision Making
Copyright
Dedication
Epigraph
1
Introduction to Data Analysis and Decision Making
Introduction: Hierarchy Between Data, Information, and Knowledge
Overview of the Book
Final Remarks
2
Types of Variables and Measurement and Accuracy Scales
Introduction
Types of Variables
Nonmetric or Qualitative Variables
Metric or Quantitative Variables
Types of Variables x Scales of Measurement
Nonmetric Variables-Nominal Scale
Nonmetric Variables-Ordinal Scale
Quantitative Variable-Interval Scale
Quantitative Variable-Ratio Scale
Types of Variables x Number of Categories and Scales of Accuracy
Dichotomous or Binary Variable (Dummy)
Polychotomous Variable
Discrete Quantitative Variable
Continuous Quantitative Variable
Final Remarks
Exercises
Part II:
Descriptive Statistics
3
Univariate Descriptive Statistics
Introduction
Frequency Distribution Table
Frequency Distribution Table for Qualitative Variables
Frequency Distribution Table for Discrete Data
Frequency Distribution Table for Continuous Data Grouped into Classes
Graphical Representation of the Results
Graphical Representation for Qualitative Variables
Bar Chart
Pie Chart
Pareto Chart
Graphical Representation for Quantitative Variables
Line Graph
Scatter Plot
Histogram
Stem-and-Leaf Plot
Boxplot or Box-and-Whisker Diagram
The Most Common Summary-Measures in Univariate Descriptive Statistics
Measures of Position or Location
Measures of Central Tendency
Arithmetic Mean
Case 1: Simple Arithmetic Mean of Ungrouped Discrete and Continuous Data
Case 2: Weighted Arithmetic Mean of Ungrouped Discrete and Continuous Data
Case 3: Arithmetic Mean of Grouped Discrete Data
Case 4: Arithmetic Mean of Continuous Data Grouped into Classes
Median
Case 1: Median of Ungrouped Discrete and Continuous Data
Case 2: Median of Grouped Discrete Data
Case 3: Median of Continuous Data Grouped into Classes
Mode
Case 1: Mode of Ungrouped Data
Case 2: Mode of Grouped Qualitative or Discrete Data
Case 3: Mode of Continuous Data Grouped into Classes
Quantiles
Quartiles
Deciles
Percentiles
Case 1: Quartiles, Deciles, and Percentiles of Ungrouped Discrete and Continuous Data
Case 2: Quartiles, Deciles, and Percentiles of Grouped Discrete Data
Case 3: Quartiles, Deciles, and Percentiles of Continuous Data Grouped into Classes
Identifying the Existence of Univariate Outliers
Measures of Dispersion or Variability
Range
Average Deviation
Case 1: Average Deviation of Ungrouped Discrete and Continuous Data
Case 2: Average Deviation of Grouped Discrete Data
Case 3: Average Deviation of Continuous Data Grouped into Classes
Variance
Case 1: Variance of Ungrouped Discrete and Continuous Data
Case 2: Variance of Grouped Discrete Data
Case 3: Variance of Continuous Data Grouped into Classes
Standard Deviation
Standard Error
Coefficient of Variation
Measures of Shape
Measures of Skewness
Pearsons First Coefficient of Skewness
Pearsons Second Coefficient of Skewness
Bowleys Coefficient of Skewness
Fishers Coefficient of Skewness
Coefficient of Skewness on Stata
Measures of Kurtosis
Coefficient of Kurtosis
Fishers Coefficient of Kurtosis
Coefficient of Kurtosis on Stata
A Practical Example in Excel
A Practical Example on SPSS
Frequencies Option
Descriptives Option
Explore Option
A Practical Example on Stata
Univariate Frequency Distribution Tables on Stata
Summary of Univariate Descriptive Statistics on Stata
Calculating Percentiles on Stata
Charts on Stata: Histograms, Stem-and-Leaf, and Boxplots
Histogram
Stem-and-Leaf
Boxplot
Final Remarks
Exercises
4
Bivariate Descriptive Statistics
Introduction
Association Between Two Qualitative Variables
Joint Frequency Distribution Tables
Measures of Association
Chi-Square Statistic
Other Measures of Association Based on Chi-Square
Spearmans Coefficient
Correlation Between Two Quantitative Variables
Joint Frequency Distribution Tables
Graphical Representation Through a Scatter Plot
Measures of Correlation
Covariance
Pearsons Correlation Coefficient
Final Remarks
Exercises
Part III: Probabilistic Statistics
5
Introduction to Probability
Introduction
Terminology and Concepts
Random Experiment
Sample Space
Events
Unions, Intersections, and Complements
Independent Events
Mutually Exclusive Events
Definition of Probability
Basic Probability Rules
Probability Variation Field
Probability of the Sample Space
Probability of an Empty Set
Probability Addition Rule
Probability of a Complementary Event
Probability Multiplication Rule for Independent Events
Conditional Probability
Probability Multiplication Rule
Bayes´ Theorem
Combinatorial Analysis
Arrangements
Combinations
Permutations
Final Remarks
Exercises
6
Random Variables and Probability Distributions
Introduction
Random Variables
Discrete Random Variable
Expected Value of a Discrete Random Variable
Variance of a Discrete Random Variable
Cumulative Distribution Function of a Discrete Random Variable
Continuous Random Variable
Expected Value of a Continuous Random Variable
Variance of a Continuous Random Variable
Cumulative Distribution Function of a Continuous Random Variable
Probability Distributions for Discrete Random Variables
Discrete Uniform Distribution
Bernoulli Distribution
Binomial Distribution
Relationship Between the Binomial and the Bernoulli Distributions
Geometric Distribution
Negative Binomial Distribution
Relationship Between the Negative Binomial and the Binomial Distributions
Relationship Between the Negative Binomial and the Geometric Distributions
Hypergeometric Distribution
Approximation of the Hypergeometric Distribution by the Binomial
Poisson Distribution
Approximation of the Binomial by the Poisson Distribution
Probability Distributions for Continuous Random Variables
Uniform Distribution
Normal Distribution
Approximation of the Binomial by the Normal Distribution
Approximation of the Poisson by the Normal Distribution
Exponential Distribution
Relationship Between the Poisson and the Exponential Distribution
Gamma Distribution
Special Cases of the Gamma Distribution
Relationship Between the Poisson and the Gamma Distribution
Chi-Square Distribution
Students t Distribution
Snedecors F Distribution
Relationship Between Students t and Snedecors F Distribution
Final Remarks
Exercises
Part IV: Statistical Inference
7
Sampling
Introduction
Probability or Random Sampling
Simple Random Sampling
Simple Random Sampling Without Replacement
Simple Random Sampling With Replacement
Systematic Sampling
Stratified Sampling
Cluster Sampling
Nonprobability or Nonrandom Sampling
Convenience Sampling
Judgmental or Purposive Sampling
Quota Sampling
Geometric Propagation or Snowball Sampling
Sample Size
Size of a Simple Random Sample
Sample Size to Estimate the Mean of an Infinite Population
Sample Size to Estimate the Mean of a Finite Population
Sample Size to Estimate the Proportion of an Infinite Population
Sample Size to Estimate the Proportion of a Finite Population
Size of the Systematic Sample
Size of the Stratified Sample
Sample Size to Estimate the Mean of an Infinite Population
Sample Size to Estimate the Mean of a Finite Population
Sample Size to Estimate the Proportion of an Infinite Population
Sample Size to Estimate the Proportion of a Finite Population
Size of a Cluster Sample
Size of a One-Stage Cluster Sample
Sample Size to Estimate the Mean of an Infinite Population
Sample Size to Estimate the Mean of a Finite Population
Sample Size to Estimate the Proportion of an Infinite Population
Sample Size to Estimate the Proportion of a Finite Population
Size of a Two-Stage Cluster Sample
Final Remarks
Exercises
8
Estimation
Introduction
Point and Interval Estimation
Point Estimation
Interval Estimation
Point Estimation Methods
Method of Moments
Ordinary Least Squares
Maximum Likelihood Estimation
Interval Estimation or Confidence Intervals
Confidence Interval for the Population Mean (μ)
Known Population Variance (σ2)
Unknown Population Variance (σ2)
Confidence Interval for Proportions
Confidence Interval for the Population Variance
Final Remarks
Exercises
9
Hypotheses Tests
Introduction
Parametric Tests
Univariate Tests for Normality
Kolmogorov-Smirnov Test
Shapiro-Wilk Test
Shapiro-Francia Test
Solving Tests for Normality by Using SPSS Software
Solving Tests for Normality by Using Stata
Kolmogorov-Smirnov Test on the Stata Software
Shapiro-Wilk Test on the Stata Software
Shapiro-Francia Test on the Stata Software
Tests for the Homogeneity of Variances
Bartletts χ2 Test
Cochrans C Test
Hartleys Fmax Test
Levenes F-Test
Solving Levenes Test by Using SPSS Software
Solving Levenes Test by Using the Stata Software
Hypotheses Tests Regarding a Population Mean (μ) From One Random Sample
Z Test When the Population Standard Deviation (σ) Is Known and the Distribution Is Normal
Students t-Test When the Population Standard Deviation (σ) Is Not Known
Solving Students t-Test for a Single Sample by Using SPSS Software
Solving Students t-Test for a Single Sample by Using Stata Software
Students t-Test to Compare Two Population Means From Two Independent Random Samples
Case 1: σ12σ22
Case 2: σ12=σ22
Solving Students t-Test From Two Independent Samples by Using SPSS Software
Solving Students t-Test From Two Independent Samples by Using Stata Software
Students t-Test to Compare Two Population Means From Two Paired Random Samples
Solving Students t-Test From Two Paired Samples by Using SPSS Software
Solving Students t-Test From Two Paired Samples by Using Stata Software
ANOVA to Compare the Means of More Than Two Populations
One-Way ANOVA
Solving the One-Way ANOVA Test by Using SPSS Software
Solving the One-Way ANOVA Test by Using Stata Software
Factorial ANOVA
Two-Way ANOVA
Solving the Two-Way ANOVA Test by Using SPSS Software
Solving the Two-Way ANOVA Test by Using Stata Software
ANOVA With More Than Two Factors
Final Remarks
Exercises
10
Nonparametric Tests
Introduction
Tests for One Sample
Binomial Test
Solving the Binomial Test Using SPSS Software
Solving the Binomial Test Using Stata Software
Chi-Square Test (χ2) for One Sample
Solving the χ2 Test for One Sample Using SPSS Software
Solving the χ2 Test for One Sample Using Stata Software
Sign Test for One Sample
Solving the Sign Test for One Sample Using SPSS Software
Solving the Sign Test for One Sample Using Stata Software
Tests for Two Paired Samples
McNemar Test
Solving the McNemar Test Using SPSS Software
Solving the McNemar Test Using Stata Software
Sign Test for Two Paired Samples
Solving the Sign Test for Two Paired Samples Using SPSS Software
Solving the Sign Test for Two Paired Samples Using Stata Software
Wilcoxon Test
Solving the Wilcoxon Test Using SPSS Software
Solving the Wilcoxon Test Using Stata Software
Tests for Two Independent Samples
Chi-Square Test (χ2) for Two Independent Samples
Solving the χ2 Statistic Using SPSS Software
Solving the χ2 Statistic by Using Stata Software
Mann-Whitney U Test
Solving the Mann-Whitney Test Using SPSS Software
Solving the Mann-Whitney Test Using Stata Software
Tests for k Paired Samples
Cochrans Q Test
Solving Cochrans Q Test by Using SPSS Software
Solution of Cochrans Q Test on Stata Software
Friedmans Test
Solving Friedmans Test by Using SPSS Software
Solving Friedmans Test by Using Stata Software
Tests for k Independent Samples
The χ2 Test for k Independent Samples
Solving the χ2 Test for k Independent Samples on SPSS
Solving the χ2 Test for k Independent Samples on Stata
Kruskal-Wallis Test
Solving the Kruskal-Wallis Test by Using SPSS Software
Solving the Kruskal-Wallis Test by Using Stata
Final Remarks
Exercises
Part V: Multivariate Exploratory Data Analysis
11
Cluster Analysis
Introduction
Cluster Analysis
Defining Distance or Similarity Measures in Cluster Analysis
Distance (Dissimilarity) Measures Between Observations for Metric Variables
Similarity Measures Between Observations for Binary Variables
Agglomeration Schedules in Cluster Analysis
Hierarchical Agglomeration Schedules
Notation
A Practical Example of Cluster Analysis With Hierarchical Agglomeration Schedules
Nearest-Neighbor or Single-Linkage Method
Furthest-Neighbor or Complete-Linkage Method
Between-Groups or Average-Linkage Method
Nonhierarchical K-Means Agglomeration Schedule
Notation
A Practical Example of a Cluster Analysis With the Nonhierarchical K-Means Agglomeration Schedule
Cluster Analysis with Hierarchical and Nonhierarchical Agglomeration Schedules in SPSS
Elaborating Hierarchical Agglomeration Schedules in SPSS
Elaborating Nonhierarchical K-Means Agglomeration Schedules in SPSS
Cluster Analysis With Hierarchical and Nonhierarchical Agglomeration Schedules in Stata
Elaborating Hierarchical Agglomeration Schedules in Stata
Elaborating Nonhierarchical K-Means Agglomeration Schedules in Stata
Final Remarks
Exercises
Appendix
Detecting Multivariate Outliers
12
Principal Component Factor Analysis
Introduction
Principal Component Factor Analysis
Pearsons Linear Correlation and the Concept of Factor
Overall Adequacy of the Factor Analysis: Kaiser-Meyer-Olkin Statistic and Bartletts Test of Sphericity
Defining the Principal Component Factors: Determining the Eigenvalues and Eigenvectors of Correlation Matrix ρ and Calcula ...
Factor Loadings and Communalities
Factor Rotation
A Practical Example of the Principal Component Factor Analysis
Principal Component Factor Analysis in SPSS
Principal Component Factor Analysis in Stata
Final Remarks
Exercises
Appendix: Cronbachs Alpha
Brief Presentation
Determining Cronbachs Alpha Algebraically
Determining Cronbachs Alpha in SPSS
Determining Cronbachs Alpha in Stata
Part VI: Generalized Linear Models
13
Simple and Multiple Regression Models
Introduction
Linear Regression Models
Estimation of the Linear Regression Model by Ordinary Least Squares
Explanatory Power of the Regression Model: Coefficient of Determination R2
General Statistical Significance of the Regression Model and Each of Its Parameters
Construction of the Confidence Intervals of the Model Parameters and Elaboration of Predictions
Estimation of Multiple Linear Regression Models
Dummy Variables in Regression Models
Presuppositions of Regression Models Estimated by OLS
Normality of Residuals
The Multicollinearity Problem
Causes of Multicollinearity
Consequences of Multicollinearity
Application of Multicollinearity Examples in Excel
Multicollinearity Diagnostics
Possible Solutions for the Multicollinearity Problem
The Problem of Heteroskedasticity
Causes of Heteroskedasticity
Consequences of Heteroskedasticity
Heteroskedasticity Diagnostics: Breusch-Pagan/Cook-Weisberg Test
Weighted Least Squares Method: A Possible Solution
Huber-White Method for Robust Standard Errors
The Autocorrelation of Residuals Problem
Causes of the Autocorrelation of Residuals
Consequences of the Autocorrelation of Residuals
Autocorrelation of Residuals Diagnostic: The Durbin-Watson Test
Autocorrelation of Residuals Diagnostic: The Breusch-Godfrey Test
Possible Solutions for the Autocorrelation of Residuals Problem
Detection of Specification Problems: Linktest and RESET Test
Nonlinear Regression Models
The Box-Cox Transformation: The General Regression Model
Estimation of Regression Models in Stata
Estimation of Regression Models in SPSS
Final Remarks
Exercises
Appendix: Quantile Regression Models
A Brief Introduction
Example: Quantile Regression Model in Stata
14
Binary and Multinomial Logistic Regression Models
Introduction
The Binary Logistic Regression Model
Estimation of the Binary Logistic Regression Model by Maximum Likelihood
General Statistical Significance of the Binary Logistic Regression Model and Each of Its Parameters
Construction of the Confidence Intervals of the Parameters for the Binary Logistic Regression Model
Cutoff, Sensitivity Analysis, Overall Model Efficiency, Sensitivity, and Specificity
The Multinomial Logistic Regression Model
Estimation of the Multinomial Logistic Regression Model by Maximum Likelihood
General Statistical Significance of the Multinomial Logistic Regression Model and Each of Its Parameters
Construction of the Confidence Intervals of the Parameters for the Multinomial Logistic Regression Model
Estimation of Binary and Multinomial Logistic Regression Models in Stata
Binary Logistic Regression in Stata
Multinomial Logistic Regression in Stata
Estimation of Binary and Multinomial Logistic Regression Models in SPSS
Binary Logistic Regression in SPSS
Multinomial Logistic Regression in SPSS
Final Remarks
Exercises
Appendix: Probit Regression Models
A Brief Introduction
Example: Probit Regression Model in Stata
15
Regression Models for Count Data: Poisson and Negative Binomial
Introduction
The Poisson Regression Model
Estimation of the Poisson Regression Model by Maximum Likelihood
General Statistical Significance of the Poisson Regression Model and Each of Its Parameters
Construction of the Confidence Intervals of the Parameters for the Poisson Regression Model
Test to Verify Overdispersion in Poisson Regression Models
The Negative Binomial Regression Model
Estimation of the Negative Binomial Regression Model by Maximum Likelihood
General Statistical Significance of the Negative Binomial Regression Model and Each of Its Parameters
Construction of the Confidence Intervals of the Parameters for the Negative Binomial Regression Model
Estimating Regression Models for Count Data in Stata
Poisson Regression Model in Stata
Negative Binomial Regression Model in Stata
Regression Model Estimation for Count Data in SPSS
Poisson Regression Model in SPSS
Negative Binomial Regression Model in SPSS
Final Remarks
Exercises
Appendix: Zero-Inflated Regression Models
Brief Introduction
Example: Zero-Inflated Poisson Regression Model in Stata
Example: Zero-Inflated Negative Binomial Regression Model in Stata
Part VII: Optimization Models and Simulation
16
Introduction to Optimization Models: General Formulations and Business Modeling
Introduction to Optimization Models
Introduction to Linear Programming Models
Mathematical Formulation of a General Linear Programming Model
Linear Programming Model in the Standard and Canonical Forms
Linear Programming Model in the Standard Form
Linear Programming Model in the Canonical Form
Transformations Into the Standard or Canonical Form
Assumptions of the Linear Programming Model
Proportionality
Additivity
Divisibility and Non-negativity
Certainty
Modeling Business Problems Using Linear Programming
Production Mix Problem
Blending or Mixing Problem
Diet Problem
Capital Budget Problems
Portfolio Selection Problem
Model 1: Maximization of an Investment Portfolios Expected Return
Model 2: Investment Portfolio Risk Minimization
Production and Inventory Problem
Aggregated Planning Problem
Final Remarks
Exercises
17
Solution of Linear Programming Problems
Introduction
Graphical Solution of a Linear Programming Problem
Linear Programming Maximization Problem with a Single Optimal Solution
Linear Programming Minimization Problem With a Single Optimal Solution
Special Cases
Multiple Optimal Solutions
Unlimited Objective Function z
There Is No Optimal Solution
Degenerate Optimal Solution
Analytical Solution of a Linear Programming Problem in Which m n
The Simplex Method
Logic of the Simplex Method
Analytical Solution of the Simplex method for Maximization Problems
Tabular Form of the Simplex Method for Maximization Problems
The Simplex Method for Minimization Problems
Special Cases of the Simplex Method
Multiple Optimal Solutions
Unlimited Objective Function z
There Is No Optimal Solution
Degenerate Optimal Solution
Solution by Using a Computer
Solver in Excel
Solution of the Examples found in Section 16.6 of Chapter 16 using Solver in Excel
Solution of Example 16.3 of Chapter 16 (Production Mix Problem at the Venix Toys)
Solution of Example 16.4 of Chapter 16 (Production Mix Problem at Naturelat Dairy)
Solution of Example 16.5 of Chapter 16 (Mix Problem of Oil-South Refinery)
Solution of Example 16.6 of Chapter 16 (Diet Problem)
Solution of Example 16.7 of Chapter 16 (Farmers Problem)
Solution of Example 16.8 of Chapter 16 (Portfolio Selection-Maximization of the Expected Return)
Solution of Example 16.9 of Chapter 16 (Portfolio Selection-Minimization of the Portfolios Mean Absolute Deviation)
Solution of Example 16.10 of Chapter 16 (Production and Inventory Problem of FenixandFurniture)
Solution of Example 16.11 of Chapter 16 (Problem of Lifestyle Natural Juices Manufacturer)
Solver Error Messages for Unlimited and Infeasible Solutions
Unlimited Objective Function z
There Is No Optimal Solution
Result Analysis by Using the Solver Answer and Limits Reports
Answer Report
Limits Report
Sensitivity Analysis
Alteration in one of the Objective Function Coefficients (Graphical Solution)
Alteration in One of the Constants on the Right-Hand Side of the Constraint and Concept of Shadow Price (Graphica ...
Reduced Cost
Sensitivity Analysis With Solver in Excel
Special Case: Multiple Optimal Solutions
Special Case: Degenerate Optimal Solution
Exercises
18
Network Programming
Introduction
Terminology of Graphs and Networks
Classic Transportation Problem
Mathematical Formulation of the Classic Transportation Problem
Balancing the Transportation Problem When the Total Supply Capacity Is Not Equal to the Total Demand Consumed
Case 1: Total Supply Is Greater than Total Demand
Case 2: Total Supply Capacity Is Lower than Total Demand Consumed
Solution of the Classic Transportation Problem
The Transportation Algorithm
Solution of the Transportation Problem Using Excel Solver
Transhipment Problem
Mathematical Formulation of the Transhipment Problem
Solution of the Transhipment Problem Using Excel Solver
Job Assignment Problem
Mathematical Formulation of the Job Assignment Problem
Solution of the Job Assignment Problem Using Excel Solver
Shortest Path Problem
Mathematical Formulation of the Shortest Path Problem
Solution of the Shortest Path Problem Using Excel Solver
Maximum Flow Problem
Mathematical Formulation of the Maximum Flow Problem
Solution of the Maximum Flow Problem Using Excel Solver
Exercises
19
Integer Programming
Introduction
Mathematical Formulation of a General Model for Integer Programming and/or Binary and Linear Relaxation
The Knapsack Problem
Modeling of the Knapsack Problem
Solution of the Knapsack Problem Using Excel Solver
The Capital Budgeting Problem as a Model of Binary Programming
Solution of the Capital Budgeting Problem as a Model of Binary Programming Using Excel Solver
The Traveling Salesman Problem
Modeling of the Traveling Salesman Problem
Solution of the Traveling Salesman Problem Using Excel Solver
The Facility Location Problem
Modeling of the Facility Location Problem
Solution of the Facility Location Problem Using Excel Solver
The Staff Scheduling Problem
Solution of the Staff Scheduling Problem Using Excel Solver
Exercises
20
Simulation and Risk Analysis
Introduction to Simulation
The Monte Carlo Method
Monte Carlo Simulation in Excel
Generation of Random Numbers and Probability Distributions in Excel
Practical Examples
Case 1: Consumption of Red Wine
Case 2: Profit x Loss Forecast
Final Remarks
Exercises
Part VIII: Other Topics
21
Design and Analysis of Experiments
Introduction
Steps in the Design of Experiments
The Four Principles of Experimental Design
Types of Experimental Design
Completely Randomized Design (CRD)
Randomized Block Design (RBD)
Factorial Design (FD)
One-Way Analysis of Variance
Factorial ANOVA
Final Remarks
Exercises
22
Statistical Process Control
Introduction
Estimating the Process Mean and Variability
Control Charts for Variables
Control Charts for X and R
Control Charts for X
Control Charts for R
Control Charts for X and S
Control Charts for Attributes
P Chart (Defective Fraction)
np Chart (Number of Defective Products)
C Chart (Total Number of Defects per Unit)
U Chart (Average Number of Defects per Unit)
Process Capability
Cp Index
Cpk Index
Cpm and Cpmk Indexes
Final Remarks
Exercises
23
Data Mining and Multilevel Modeling
Introduction to Data Mining
Multilevel Modeling
Nested Data Structures
Hierarchical Linear Models
Two-Level Hierarchical Linear Models With Clustered Data (HLM2)
Three-Level Hierarchical Linear Models With Repeated Measures (HLM3)
Estimation of Hierarchical Linear Models in Stata
Estimation of a Two-Level Hierarchical Linear Model With Clustered Data in Stata
Estimation of a Three-Level Hierarchical Linear Model With Repeated Measures in Stata
Estimation of Hierarchical Linear Models in SPSS
Estimation of a Two-Level Hierarchical Linear Model With Clustered Data in SPSS
Estimation of a Three-Level Hierarchical Linear Model With Repeated Measures in SPSS
Final Remarks
Exercises
Appendix
Hierarchical Nonlinear Models
Answers
Answer Keys: Exercises: Chapter 2
Answer Keys: Exercises: Chapter 3
Answer Keys: Exercises: Chapter 4
Answer Keys: Exercises: Chapter 5
Answer Keys: Exercises: Chapter 6
Answer Keys: Exercises: Chapter 7
Answer Keys: Exercises: Chapter 8
Answer Keys: Exercises: Chapter 9
Answer Keys: Exercises: Chapter 10
Answer Keys: Exercises: Chapter 11
Answer Keys: Exercises: Chapter 12
Answer Keys: Exercises: Chapter 13
Answer Keys: Exercises: Chapter 14
Answer Keys: Exercises: Chapter 15
Answer Keys: Exercises: Chapter 16
Answer Keys: Exercises: Chapter 17
Answer Keys: Exercises: Chapter 18
Answer Keys: Exercises: Chapter 19
Answer Keys: Exercises: Chapter 20
Answer Keys: Exercises: Chapter 21
Answer Keys: Exercises: Chapter 22
Answer Keys: Exercises: Chapter 23
Appendices
References
Index
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
Y
Z
Data Science for Business and Decision Making
Data Science for Business and Decision Making
Luiz Paulo Fa´vero School of Economics, Business and Accounting University of Sa˜o Paulo Sa˜o Paulo SP Brazil
Patrı´cia Belfiore Center of Engineering, Modeling and Applied Social Sciences Management Engineering Federal University of ABC Sa˜o Bernardo do Campo SP Brazil
Academic Press is an imprint of Elsevier 125 London Wall, London EC2Y 5AS, United Kingdom 525 B Street, Suite 1650, San Diego, CA 92101, United States 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom © 2019 Elsevier Inc. All rights reserved No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/ permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN 978-0-12-811216-8 For information on all Academic Press publications visit our website at https://www.elsevier.com/books-and-journals
Publisher: Candice Janco Acquisition Editor: Scott Bentley Editorial Project Manager: Susan Ikeda Production Project Manager: Purushothaman Vijayaraj Cover Designer: Miles Hitchen Typeset by SPi Global, India
Dedication We dedicate this book to Ovı´dio and Leonor Antonio and Ana Vera for the unconditional effort dedicated to our education and development. We dedicate this book to Gabriela and Luiz Felipe who are the reason for our existence.
Epigraph When a human awakens to a great dream and throws the full force of his soul over it, all the universe conspires in your favor. Johann Wolfgang von Goethe
Chapter 1
Introduction to Data Analysis and Decision Making Everything in us is mortal, except the gifts of the spirit and of intelligence. Ovid
1.1
INTRODUCTION: HIERARCHY BETWEEN DATA, INFORMATION, AND KNOWLEDGE
In academic and business environments, improving the use of research techniques and modern software packages, together with the understanding, by researchers and managers in the most varied fields of knowledge, of the importance of statistics and data modeling in defining objectives and substantiating research hypotheses based on underlying theories, has been producing more consistent and rigorous papers from a methodological and scientific standpoint. Nevertheless, as the well-known Austrian philosopher, later on naturalized as a British citizen, Ludwig Joseph Johann Wittgenstein used to say, only methodological rigor and the existence of authors who research more of the same topic can generate a deep lack of oxygen in the academic world. Besides availability of data, adequate software packages, and an adequate underlying theory, it is essential for researchers to also use their intuition and experience when defining their objectives and constructing their hypotheses, even in terms of deciding to study the behavior of new and, sometimes, unimaginable variables in their models. This, believe it or not, may also generate interesting and innovative information for the decision-making process! The basic principle of this book is to explain the hierarchy between data, information, and knowledge, at every turn, in this new scenario we live in. Whenever treated and analyzed, data are transformed into information. On the other hand, knowledge is generated at the moment in which such information is recognized and applied to the decision-making process. Analogously, reverse hierarchy can also be applied, since knowledge, whenever disseminated or explained, becomes information that, when broken up, has the capacity to generate a dataset. Fig. 1.1 shows this logic.
1.2
OVERVIEW OF THE BOOK
The book is divided into 23 chapters, which are structured into eight major parts, as follows: Part I: Foundations of Business Data Analysis l Chapter 1: Introduction to Data Analysis and Decision Making. l Chapter 2: Types of Variables and Mensuration and Accuracy Scales. Part II: Descriptive Statistics l Chapter 3: Univariate Descriptive Statistics. l Chapter 4: Bivariate Descriptive Statistics. Part III: Probabilistic Statistics l Chapter 5: Introduction to Probability. l Chapter 6: Random Variables and Probability Distributions. Part IV: Statistical Inference l Chapter 7: Sampling. l Chapter 8: Estimation. Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00001-X © 2019 Elsevier Inc. All rights reserved.
3
4 PART
l l
I Foundations of Business Data Analysis
Chapter 9: Hypotheses Tests. Chapter 10: Nonparametric Tests.
Part V: Multivariate Exploratory Data Analysis Chapter 11: Cluster Analysis. l Chapter 12: Principal Component Factor Analysis. l
Part VI: Generalized Linear Models Chapter 13: Simple and Multiple Regression Models. l Chapter 14: Binary and Multinomial Logistic Regression Models. l Chapter 15: Regression Models for Count Data: Poisson and Negative Binomial. l
Part VII: Optimization Models and Simulation Chapter 16: Introduction to Optimization Models: General Formulations and Business Modeling. l Chapter 17: Solution of Linear Programming Problems. l Chapter 18: Network Programming. l Chapter 19: Integer Programming. l Chapter 20: Simulation and Risk Analysis. l
Part VIII: Other Topics Chapter 21: Design and Analysis of Experiments. l Chapter 22: Statistical Process Control. l Chapter 23: Data Mining and Multilevel Modeling. l
Data
Data
Treatment and analysis
Dismerberment
Information
Information Diffusion
Decision making
Knowledge
Knowledge
FIG. 1.1 Hierarchy between data, information, and knowledge.
Each chapter is structured in the same presentation didactical logic, which we believe favors learning. First, the concepts regarding each topic are introduced and always followed by the algebraic solution, many times in Excel, of practical exercises from datasets primarily developed with a more educational focus. Next, sometimes, the same exercises are solved in Stata Statistical Software® and IBM SPSS Statistics Software®. We believe that this logic facilitates the study and understanding of the correct use of each technique and of the analysis of the results. Moreover, the practical application of the models in Excel, Stata, and SPSS also brings benefits to researchers, as the results can be compared, at every turn, to the ones already estimated or calculated algebraically in the previous sections of each chapter. In addition to providing an opportunity to use these important software packages. At the end of each chapter, additional exercises are proposed, whose answers, presented through the outputs generated, are available at the end of the book. The datasets used are available at www.elsevier.com.
1.3
FINAL REMARKS
All the benefits and potential of the techniques discussed here will be felt by researchers and managers as the procedures are practiced repeatedly. As there are several methods, we must be very careful when defining the technique, since choosing the best alternatives for treating the data fundamentally depends on this moment of practice and exercises. The adequate use of the techniques presented in this book by professors, students, and business managers may more powerfully underpin the research’s initial perception, which can support the decision-making process. Generating knowledge from a phenomenon depends on a well-structured research plan, with the definition of the variables to be collected, the dimensions of the sample, the development of the dataset, and choosing the technique that will be used, which is extremely important.
Introduction to Data Analysis and Decision Making Chapter
1
5
Thus, we believe that this book is meant for researchers who, for different reasons, are specifically interested in data science and decision making, as well as for those who want to deepen their knowledge by using Excel, SPSS, and Stata software packages. This book is recommended to undergraduate and graduate students in the fields of Business Administration, Engineering, Economics, Accounting, Actuarial Science, Statistics, Psychology, Medicine and Health, and to students in other fields related to Human, Exact and Biomedical Sciences. It is also meant for students taking extension, lato sensu postgraduation and MBA courses, as well as for company employees, consultants, and other researchers that have as their main objectives to treat and analyze data, aiming at preparing data models, generating information, and improving knowledge through decision-making processes. To all the researchers and managers that use this book, we hope that adequate and ever more interesting research questions may arise, that analyses may be developed, and that reliable, robust, and useful models for decision-making processes may be constructed. We also hope that the interpretation of outputs may become friendlier and that the use of Excel, SPSS, and Stata may result in important and valuable fruits for new researches and projects. We would like to thank everyone who contributed and made this book become a reality. We would also like to sincerely thank the professionals at Montvero Consulting and Training Ltd., at the International Business Machines Corporation (Armonk, New York), at StataCorp LP (College Station, Texas), at Elsevier Publishing House, especially Andre Gerhard Wolff, J. Scott Bentley, and Susan E. Ikeda. Lastly, but not less important, we would like to thank the professors, students, and employees of the Economics, Business Administration and Accounting College of the University of Sao Paulo (FEA/ USP) and of the Federal University of the ABC (UFABC). Now it is time for you to get started! We would like to emphasize that any contributions, criticisms, and suggestions will always be welcome. So that, later on, they may be incorporated into this book and make it better. Luiz Paulo Fa´vero Patrı´cia Belfiore
Chapter 2
Types of Variables and Measurement and Accuracy Scales And God said: p, i, 0, and 1, and the Universe was created. Leonhard Euler
2.1
INTRODUCTION
A variable is a characteristic of the population (or sample) being studied, and it is possible to measure, count, or categorize it. The type of variable collected is crucial in the calculation of descriptive statistics and in the graphical representation of results, as well as in the selection of the statistical methods that will be used to analyze the data. According to Freund (2006), statistical data are the raw materials of statistical research, always appearing in cases of measurement or record of observations. This chapter discusses the existing types of variables (metric or quantitative and nonmetric or qualitative), as well as their respective scales of measurement (nominal and ordinal for qualitative variables, and interval and ratio for quantitative variables). Classifying the types of variables based on the number of categories and scales of accuracy is also discussed (binary and polychotomous for qualitative variables and discrete and continuous for quantitative variables).
2.2
TYPES OF VARIABLES
Variables can be classified as nonmetric, also known as qualitative or categorical, or metric, also known as quantitative (Fig. 2.1). Nonmetric or qualitative variables represent the characteristics of an individual, object, or element that cannot be measured or quantified. The answers are given in categories. In contrast, metric or quantitative variables represent the characteristics of an individual, object, or element that result from a count (a finite set of values) or from a measurement (an infinite set of values).
2.2.1
Nonmetric or Qualitative Variables
As we will study in Chapter 3, the representation of the characteristics of nonmetric or qualitative variables can be done through frequency distribution tables or in a graphical way, without having to calculate the measures of position, dispersion, and shape. The only exception is the mode, a measure that provides the variable’s most frequent value, and it can also be applied to nonmetric variables. Imagine that a questionnaire will be used to collect data on family income from a sample of consumers, based on certain salary ranges. Table 2.1 shows the variable categories. Note that both variables are qualitative, since the data are represented by ranges. However, it is very common for researchers to classify them incorrectly, mainly when the variable has numerical values in the data. In this case, it is only possible to calculate the frequencies, and not the summary measures, such as, the mean and standard deviation. The frequencies obtained for each income range can be seen in Table 2.2. A common error found in papers that use qualitative variables represented by numbers is the calculation of the sample mean, or any other summary measure. First of all, the researcher calculates the mean of the limits of each range, assuming that this value corresponds to the real mean of the consumers found in that range. However, since the data distribution is not necessarily linear or symmetrical around the mean, this hypothesis is often violated. Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00002-1 © 2019 Elsevier Inc. All rights reserved.
7
8 PART
I Foundations of Business Data Analysis
FIG. 2.1 Types of variables.
TABLE 2.1 Family Income Ranges × Social Class Class
Minimum Wage Salaries (MWS)
Family Income ($)
A
Above 20 MWS
Above $ 15,760.00
B
from 10 to 20 MWS
From $ 7880.00 to $ 15,760.00
C
from 4 to 10 MWS
From $ 3152.00 to $ 7880.00
D
from 2 to 4 MWS
From $ 1576.00 to $ 3152.00
E
Up to 2 MWS
Up to $ 1576.00
TABLE 2.2 Frequencies × Family Income Ranges Frequencies
Family Income ($)
10%
Above $ 15,760.00
18%
From $ 7880.00 to $ 15,760.00
24%
From $ 3152.00 to $ 7880.00
36%
From $ 1576.00 to $ 3152.00
12%
Up to $ 1576.00
In order for us to be able to calculate summary measures, such as, the mean and standard deviation, the variable being studied must necessarily be quantitative.
2.2.2
Metric or Quantitative Variables
Quantitative variables can be represented in a graphical way (line charts, scatter plots, histograms, stem-and-leaves, and boxplots), through measures of position or location (mean, median, mode, quartiles, deciles, and percentiles), measures of dispersion or variability (range, average deviation, variance, standard deviation, standard error, and coefficient of variation), or through measures of shape, such as, skewness and kurtosis, as we will study in Chapter 3. These variables can be discrete or continuous. Discrete variables can take on a finite set of values that frequently come from a count, such as, for example, the number of children in a family (0, 1, 2…). Conversely, continuous variables take on values that are in an interval with real numbers, such as, for example, an individual’s weight or income. Imagine a dataset with 20 people’s names, age, weight and height, as shown in Table 2.3. The data are available in the file VarQuanti.sav. To classify the variables on SPSS (Fig. 2.2), let’s click on Variable View. Note that the variable Name is qualitative (a string), and it is measured on a nominal scale (column Measure). On the other hand, variables Age, Weight, and Height are quantitative (Numeric), and they are measured in scale (Scale). The variable scales of measurement will be studied in more detail in Section 2.3.
Types of Variables and Measurement and Accuracy Scales Chapter
2
9
TABLE 2.3 Dataset With Information on 20 People Name
Age (Years)
Weight (kg)
Height (m)
Mariana
48
62
1.60
Roberta
41
56
1.62
Luiz
54
84
1.76
Leonardo
30
82
1.90
Felipe
35
76
1.85
Marcelo
60
98
1.78
Melissa
28
54
1.68
Sandro
50
70
1.72
Armando
40
75
1.68
Heloisa
24
50
1.59
Julia
44
65
1.62
Paulo
39
83
1.75
Manoel
22
68
1.78
Ana Paula
31
56
1.66
Amelia
45
60
1.64
Horacio
62
88
1.77
Pedro
24
80
1.92
Joao
28
75
1.80
Marcos
49
92
1.76
Celso
54
66
1.68
FIG. 2.2 Classification of the variables.
2.3
TYPES OF VARIABLES × SCALES OF MEASUREMENT
Variables can also be classified according to the level or scale of measurement. Measurement is the process of assigning numbers or labels to objects, people, states, or events, in accordance with specific rules, to represent the quantities or qualities of the attributes. Rule is a guide, a method, or a command that tells the researcher how to measure the attribute. Scale is a set of symbols or numbers, based on a rule, and it applies to individuals or to their behaviors or attitudes. An individual’s position in the scale is based on whether this individual has the attribute that the scale must measure. There are several taxonomies found in the existing literature to classify the scales of measurement of all types of variables (Stevens, 1946; Hoaglin et al., 1983). We will use Stevens classification because it is simple, it is widely used, and because its nomenclature is used in statistical software. According to Stevens (1946), the scales of measurement of nonmetric, categorical or qualitative variables can be classified as nominal and ordinal, while the metric or quantitative variables are classified as interval and ratio scales (or proportional), as shown in Fig. 2.3.
10
PART
I Foundations of Business Data Analysis
FIG. 2.3 Types of variables scales of measurement.
2.3.1
Nonmetric Variables—Nominal Scale
The nominal scale classifies the units into classes or categories regarding the characteristic represented, not establishing any magnitude or order relationship. It is called nominal because the categories are only differentiated by their names. We can assign numerical labels to the variable categories, but arithmetic operations, such as, addition, subtraction, multiplication, and division over these numbers are not allowed. The nominal scale only allows some basic arithmetic operations. For instance, we can count the number of elements in each class or apply hypotheses tests regarding the distribution of the population units in the classes. Thus, most of the usual statistics, such as, the mean and standard deviation, do not make any sense for nominal scale qualitative variables. As examples of nonmetric variables on nominal scales, we can mention professions, religion, color, marital status, geographic location, or country of origin. Imagine a nonmetric variable related to the country of origin of 10 large multinational companies. To represent the categories of the variable Country of origin, we can use numbers, assigning value 1 to the United States, 2 to the Netherlands, 3 to China, 4 to the United Kingdom, and 5 to Brazil, as shown in Table 2.4. In this case, the numbers are only labels or tags to help identify and classify objects. This scale of measurement is known as a nominal scale, that is, the numbers are randomly assigned to the object categories, without any kind of order. To represent the behavior of nominal data, we can use descriptive statistics, such as, frequency distribution tables, bar or pie charts, or the calculation of the mode (Chapter 3). Now, we will discuss how to define labels for qualitative variables on a nominal scale by using the SPSS software (Statistical Package for the Social Sciences). After that, we will be able to construct absolute and relative frequencies tables and charts. Before generating the dataset, let’s define the characteristics of the variables being studied in Variable View (visualization of variables). In order to do that, click on the respective spreadsheet that is available in the lower left side of the Data Editor, or click twice on the column var.
TABLE 2.4 Companies and Country of Origin Company
Country of Origin
Exxon Mobil
1
JP Morgan Chase
1
General Electric
1
Royal Dutch Shell
2
ICBC
3
HSBC Holdings
4
PetroChina
3
Berkshire Hathaway
1
Wells Fargo
1
Petrobras
5
Types of Variables and Measurement and Accuracy Scales Chapter
2
11
The first variable, called Company, is a string, that is, its data are inserted as characters or letters. It was established that the maximum number of characters of the respective variable would be 18. In the column Measure, the scale of measurement of the variable Company is defined, which is nominal. The second variable, called Country, is numerical, since its data are inserted as numbers. However, the numbers are only used to categorize or label the objects, so, the scale of measurement of the respective variable is also nominal (Fig. 2.4). To insert the data from Table 2.4, we are going to go back to Data View. The information must be typed as shown in Fig. 2.5 (the columns represent the variables and the rows represent the observations or individuals). Since the variable Country is represented by numbers, it is necessary to assign labels to each variable category, as shown in Table 2.5. In order to do that, we must click on Data → Define Variable Properties… and select the variable Country, according to Figs. 2.6 and 2.7. Since the nominal scale of measurement of the variable Country has already been defined in the column Measure in Variable View, we can see that it already appears correctly in Fig. 2.8. Defining the labels for each category must be done at this moment, and it can also be seen in the same figure. Value Labels, The database starts being seen with the label names assigned, as shown in Fig. 2.9. By clicking on located on the toolbar, it is possible to alternate from the numerical values of the nominal or ordinal variable and their respective labels. Having structured the dataset, it is possible to construct absolute and relative frequencies tables and charts on SPSS.
FIG. 2.4 Defining the variable characteristics in Variable View.
FIG. 2.5 Inserting the data found in Table 2.4 into Data View.
12
PART
I Foundations of Business Data Analysis
TABLE 2.5 Categories Assigned to the Countries Categories
Country
1
United States
2
The Netherlands
3
China
4
The United Kingdom
5
Brazil
FIG. 2.6 Defining labels for each nominal variable category.
The descriptive statistics to represent the behavior of a single qualitative variable and of two qualitative variables will be studied in Chapters 3 and 4, respectively.
2.3.2
Nonmetric Variables—Ordinal Scale
A nonmetric variable on an ordinal scale classifies the units into classes or categories regarding the characteristic being represented, establishing an order between the units of the different categories. An ordinal scale is a scale on which data is shown in order, determining a relative position of the classes according to one direction. Any set of values can be assigned to the variable categories, as long as the order between them is respected. As in the nominal scale, arithmetic operations (sums, subtractions, multiplications, and divisions) between these values do not make any sense. Thus, the application of the usual descriptive statistics is also limited to nominal variables. Since the scale numbers are only meant to classify them, the descriptive statistics that can be used for ordinal data are frequency distribution tables, charts (including bar and pie charts), and the mode, as will study in Chapter 3.
Types of Variables and Measurement and Accuracy Scales Chapter
2
13
FIG. 2.7 Selecting the nominal variable Country.
Examples of ordinal variables include consumers’ opinion and satisfaction scales, educational level, social class, age, etc. Imagine a nonmetric variable called Classification that measures a group of consumers’ preference regarding a certain wine brand. The definition of labels for each ordinal variable category can be found in Table 2.6. Value 1 is assigned to the worst classification, value 2 to the second worst, and so on, until value 5, which is the best classification, as shown in this table. Instead of using scales from 1 to 5, we could have assigned any other numerical scale, as long as the order of classification had been respected. Thus, the numerical values do not represent a score of the product’s quality, they are only meant to classify it. So, the difference between these values does not represent the difference of the attribute analyzed. These scales of measurement are known as ordinal scales. Fig. 2.10 shows the characteristics of the variables being studied in Variable View on SPSS. The variable Customer is a string (its data are inserted as characters or letters) with a nominal scale of measurement. On the other hand, the variable Classification is numerical (numerical values were assigned to represent the variable categories) with an ordinal scale of measurement. The procedure for defining labels for qualitative variables on an ordinal scale is the same as the one already presented for nominal variables.
14
PART
I Foundations of Business Data Analysis
FIG. 2.8 Defining the labels for the variable Country.
FIG. 2.9 Dataset with labels.
Types of Variables and Measurement and Accuracy Scales Chapter
2
15
TABLE 2.6 Consumers’ Classification of a Certain Wine Brand Value
Label
1
Very bad
2
Bad
3
Average
4
Good
5
Very good
FIG. 2.10 Defining the variable characteristics in Variable View.
2.3.3
Quantitative Variable—Interval Scale
According to Stevens classification (1946), metric or quantitative variables have data in an interval or ratio scale. Besides ordering the units based on the characteristic being measured, the interval scale has a constant unit of measure. The origin or point zero of this scale of measurement is random, and it does not express an absence of quantity. A classic example of an interval scale is temperature measured in Celsius (°C) or in Fahrenheit (°F). Choosing temperature zero is random and differences of equal temperatures are determined by the identification of equal expansion volumes in the liquid inside the thermometer. Hence, the interval scale allows us to infer differences between the units to be measured. However, we cannot state that a value in a specific interval of the scale is a multiple of another one. For instance, assume that two objects are measured at 15°C and 30°C, respectively. Measuring the temperature allows us to determine how much one object is hotter than the other. However, we cannot state that the object with 30°C is twice as hot as the other with 15°C. The interval scale does not vary under positive linear transformations. So, an interval scale can be transformed into another through a positive linear transformation. Transforming degrees Celsius into degrees Fahrenheit is an example of a linear transformation. Most descriptive statistics can be applied to variable data with an interval scale, except statistics based on the ratio scale, such as, the variation coefficient.
2.3.4
Quantitative Variable—Ratio Scale
Analogous to the interval scale, the ratio scale orders the units based on the characteristic measured and has a constant unit of measure. On the other hand, the origin (or point zero) is unique and value zero expresses the absence of quantity. Therefore, it is possible to know if a value in a specific interval of the scale is a multiple of another. Equal ratios between values of the scale correspond to equal ratios between the units measured. Thus, ratio scales do not vary under positive proportion transformations. For example, if a unit is 1 m high and the other 3 m, we can say that the latter is three times higher than the former. Among the scales of measurement, the ratio scale is the most complete, because it allows us to use all arithmetic operations. In addition to this, all the descriptive statistics can be applied to the data of a variable expressed on a ratio scale. Examples of variables whose data can be on the ratio scale include income, age, how many units of a certain product were manufactured, and distance traveled.
16
PART
2.4
I Foundations of Business Data Analysis
TYPES OF VARIABLES × NUMBER OF CATEGORIES AND SCALES OF ACCURACY
Qualitative or categorical variables can also be classified based on the number of categories: (a) dichotomous or binary (dummies), when they only take on two categories; (b) polychotomous, when they take on more than two categories. On the other hand, metric or quantitative variables can also be classified based on the scale of accuracy: discrete or continuous. This classification can be seen in Fig. 2.11.
2.4.1
Dichotomous or Binary Variable (Dummy)
A dichotomous or binary variable (dummy) can only take on two categories, and the values 0 or 1 are assigned to these categories. Value 1 is assigned when the characteristic of interest is present in the variable and value 0 if otherwise. As examples, we have: smokers (1) and nonsmokers (0), a developed country (1) and an underdeveloped country (0), vaccinated patients (1) and nonvaccinated patients (0). Multivariate dependence techniques have as their main objective to specify a model that can explain and predict the behavior of one or more dependent variables through one or more explanatory variables. Many of these techniques, including the simple and multiple regression analysis, binary and multinomial logistic regression, regression for count data, and multilevel modeling, among others, can easily and coherently be applied with the use of nonmetric explanatory variables, as long as they are transformed into binary variables that represent the categories of the original qualitative variable. In this regard, a qualitative variable with n categories, for example, can be represented by (n 1) binary variables. For instance, imagine a variable called Evaluation, expressed by the categories good, average, or bad. Thus, two binary variables may be necessary to represent the original variable, depending on the researcher’s objectives, as shown in Table 2.7. Further details about the definition of dummy variables in confirmatory models will be discussed in Chapter 13, including the presentation of the operations necessary to generate them on software such as Stata.
2.4.2
Polychotomous Variable
A qualitative variable can take on more than two categories and, in this case, it is called polychotomous. As examples, we can mention social classes (lower, middle, and upper) and educational levels (elementary school, high school, college, and graduate school).
2.4.3
Discrete Quantitative Variable
As described in Section 2.2.2, discrete quantitative variables can take on a finite set of values that frequently come from a count, such as, for example, the number of children in a family (0, 1, 2…), the number of senators elected, or the number of cars manufactured in a certain factory.
2.4.4
Continuous Quantitative Variable
Continuous quantitative variables, on the other hand, are those whose possible values are in an interval with real numbers and result from a metric measurement, as, for example, weight, height, or an individual’s salary (Bussab and Morettin, 2011).
FIG. 2.11 Qualitative variables Number of categories and Quantitative variables Scales of accuracy.
Types of Variables and Measurement and Accuracy Scales Chapter
2
17
TABLE 2.7 Defining Binary Variables (Dummies) for the Variable Evaluation Binary Variables (Dummies)
2.5
Evaluation
D1
D2
Good
0
0
Average
1
0
Bad
0
1
FINAL REMARKS
Whenever treated and analyzed through several different statistical techniques, data are transformed into information and can support the decision-making process. These data can be metric (quantitative) or nonmetric (categorical or qualitative). Metric data represent the characteristics of an individual, object, or element that result from a count or measurement (patients’ weight, age, interest rate, among other examples). In the case of nonmetric data, these characteristics cannot be measured or quantified (answers as, for example, yes or no, educational levels, among others). According to Stevens (1946), the scales of measurement of nonmetric, categorical or qualitative variables can be classified as nominal and ordinal, while the metric or quantitative variables are classified on interval and ratio scales (or proportional). A lot of data can be collected in a metric as well as in a nonmetric way. Assume that we wish to assess the quality of a certain product. In order to do that, scores from 1 to 10 regarding certain attributes can be assigned, and a Likert scale can be defined from information that has already been established. In general, and whenever possible, questions must be defined in a quantitative way, in order for the researcher not to lose data information. For Fa´vero et al. (2009), generating the questionnaire and defining the variable scales of measurement will depend on several aspects, including the research objectives, the modeling to be adopted to achieve such objectives, the average time to apply the questionnaire, and how it will be collected. A dataset can present variables on metric and on nonmetric scales, it does not need to restrict itself to only one type of scale. This combination can provide some interesting researches and, jointly with the suitable modeling, it can generate information aimed at assisting the decision-making process. The type of variable collected is crucial in the calculation of descriptive statistics and in the graphical representation of results, as well as in the selection of the statistical methods that will be used to analyze the data.
2.6 1) 2) 3) 4)
EXERCISES
What is the difference between qualitative and quantitative variables? What are scales of measurement and what are the main types of scales? What are the differences between them? What is the difference between discrete and continuous variables? Classify the variables below according to the following scales: nominal, ordinal, binary, discrete, or continuous. a. A company’s revenue. b. A performance rank: good, average, and bad. c. Time to process a part. d. Number of cars sold. e. Distance traveled in km. f. Municipalities in the Greater Sao Paulo. g. Family income ranges. h. A student’s grades: A, B, C, D, O, or R. i. Hours worked. j. Region: North, Northeast, Center-West, South, and Southeast. k. Location: Sao Paulo or Seoul. l. Size of the organization: small, medium, and large.
18
PART
I Foundations of Business Data Analysis
m. Number of bedrooms. n. Classification of risk: high, average, speculative, substantial, in moratorium. o. Married: yes or no. 5) A researcher wishes to study the impact of physical aptitude on the improvement of productivity in an organization. How would you describe the binary variables to be included in this model, so that the variable physical aptitude could be represented? The possible variable categories are: (a) active and healthy; (b) acceptable (could be better); (c) not good enough; (d) sedentary.
Chapter 3
Univariate Descriptive Statistics Mathematics is the alphabet with which God has written the Universe. Galileo Galilei
3.1
INTRODUCTION
Descriptive statistics describes and summarizes the main characteristics observed in a dataset through tables, charts, graphs, and summary measures, allowing the researcher to have a better understanding of the data behavior. The analysis is based on the dataset being studied (sample), without drawing any conclusions or inferences from the population. Researchers can use descriptive statistics to study a single variable (univariate descriptive statistics), two variables (bivariate descriptive statistics), or more than two variables (multivariate descriptive statistics). In this chapter, we will study the concepts of descriptive statistics involving a single variable. Univariate descriptive statistics considers the following topics: (a) the frequency in which a set of data occurs through frequency distribution tables; (b) the representation of the variable’s distribution through charts; and (c) measures that represent a data series, such as measures of position or location, measures of dispersion or variability, and measures of shape (skewness and kurtosis). The four main goals of this chapter are: (1) to introduce the most common concepts related to the tables, charts, and summary measures in univariate descriptive statistics, (2) to present its applications in real examples, (3) to construct tables, charts, and summary measures using Excel and the statistical software SPSS and Stata, and (4) to discuss the results achieved. As described in the previous chapter, before we begin using descriptive statistics, it is necessary to identify the type of variable being studied. The type of variable is essential when calculating descriptive statistics and in the graphical representation of the results. Fig. 3.1 shows the univariate descriptive statistics that will be studied in this chapter, represented by tables, charts, graphs, and summary measures, for each type of variable. Fig. 3.1 summarizes the following information: a) The descriptive statistics used to represent the behavior of one qualitative variable’s data are frequency distribution tables and graphs/charts. b) The frequency distribution table for a qualitative variable represents the frequency in which each variable category occurs. c) The graphical representation of qualitative variables can be illustrated by bar charts (horizontal and vertical), pie charts, and by a Pareto chart. d) For quantitative variables, the most common descriptive statistics are charts and summary measures (measures of position or location, dispersion or variability, and measures of shape). Frequency distribution tables can also be used to represent the frequency in which each possible value of a discrete variable occurs, or to represent the frequency of the data of continuous variables grouped into classes. e) Line graphs, dot or dispersion plots, histograms, stem-and-leaf plots, and boxplots (box-and-whisker diagrams) are normally used as the graphical representation of quantitative variables. f) Measures of position or location can be divided into measures of central tendency (mean, mode, and median) and quantiles (quartiles, deciles, and percentiles). g) The most common measures of dispersion or variability are range, average deviation, variance, standard deviation, standard error, and coefficient of variation. h) The measures of shape include measures of skewness and kurtosis. Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00003-3 © 2019 Elsevier Inc. All rights reserved.
21
22
PART
II Descriptive Statistics
Variable type Qualitative
Quantitative
Charts
Tables
Frequency distribution
Tables
Bar
Frequency distribution
(horizontal or vertical)
Graphs
Summary measures
Line
Pie
Histogram
Pareto
Stem-and-Leaf
Boxplot
Dispersion or Variability
Position or Location
Scatter
Central tendency
Range
Skewness
Average
Kurtosis
Quantiles
Mean
Quartiles
Mode*
Deciles
Median
Percentiles
Shape
Variance Standard deviation Standard error Coefficient of variation
FIG. 3.1 A brief summary of univariate descriptive statistics. *The mode, which provides the most frequent value of the variable, is the only summary measure that can also be used for qualitative variables.
3.2
FREQUENCY DISTRIBUTION TABLE
Frequency distribution tables can be used to represent the frequency in which a set of data with qualitative or quantitative variables occurs. In the case of qualitative variables, the table represents the frequency in which each variable category happens. For discrete quantitative variables, the frequency of occurrences is calculated for each discrete value of the variable. On the other hand, continuous variable data are first grouped into classes and, afterwards, we calculate the frequencies in which each class occurs. A frequency distribution table contains the following calculations: a) b) c) d)
Absolute frequency (Fi): number of times each value i appears in the sample. Relative frequency (Fri): percentage related to the absolute frequency. Cumulative frequency (Fac): sum of all the values equal to or less than the value being analyzed. Relative cumulative frequency (Frac): percentage related to the cumulative frequency (sum of all relative frequencies equal to or less than the value being considered).
3.2.1
Frequency Distribution Table for Qualitative Variables
Through a practical example, we will build the frequency distribution table using the calculations of the absolute frequency, relative frequency, cumulative frequency, and relative cumulative frequency for each category of the qualitative variable being analyzed. Example 3.1 Saint August Hospital provides 3000 blood transfusions to hospitalized patients every month. In order for the hospital to be able to maintain its stocks, 60 blood donations a day are necessary. Table 3.E.1 shows the total number of donors for each blood type on a certain day. Build the frequency distribution table for this problem.
TABLE 3.E.1 Total Number of Donors of Each Blood Type Blood Type
Donors
A+
15
A
2
B+
6
Univariate Descriptive Statistics Chapter
3
23
TABLE 3.E.1 Total Number of Donors of Each Blood Type— cont’d Blood Type
Donors
B
1
AB+
1
AB
1
O+
32
O
2
Solution The complete frequency distribution table for Example 3.1 is shown in Table 3.E.2:
TABLE 3.E.2 Frequency Distribution of Example 3.1
3.2.2
Blood Type
Fi
Fri (%)
Fac
Frac (%)
A+
15
25
15
25
A
2
3.33
17
28.33
B+
6
10
23
38.33
B
1
1.67
24
40
AB+
1
1.67
25
41.67
AB
1
1.67
26
43.33
O+
32
53.33
58
96.67
O
2
3.33
60
100
Sum
60
100
Frequency Distribution Table for Discrete Data
Through the frequency distribution table, we can calculate the absolute frequency, the relative frequency, the cumulative frequency, and the relative cumulative frequency for each possible value of the discrete variable. Different from qualitative variables, instead of the possible categories we must have the possible numeric values. To facilitate understanding, the data must be presented in ascending order. Example 3.2 A Japanese restaurant is defining the new layout for its tables and, in order to do that, it collected information on the number of people who have lunch and dinner at each table throughout one week. Table 3.E.3 shows the first 40 pieces of data collected. Build the frequency distribution table for these data.
TABLE 3.E.3 Number of People per Table 2
5
4
7
4
1
6
2
2
5
4
12
8
6
4
5
2
8
2
6
4
7
2
5
6
4
1
5
10
2
2
10
6
4
3
4
6
3
8
4
24
PART
II Descriptive Statistics
Solution In the next table, each row of the first column represents a possible numeric value of the variable being analyzed. The data are sorted in ascending order. The complete frequency distribution table for Example 3.2 is shown below.
TABLE 3.E.4 Frequency Distribution for Example 3.2
3.2.3
Number of People
Fi
Fri (%)
Fac
Frac (%)
1
2
5
2
5
2
8
20
10
25
3
2
5
12
30
4
9
22.5
21
52.5
5
5
12.5
26
65
6
6
15
32
80
7
2
5
34
85
8
3
7.5
37
92.5
10
2
5
39
97.5
12
1
2.5
40
100
Sum
40
100
Frequency Distribution Table for Continuous Data Grouped into Classes
As described in Chapter 2, continuous quantitative variables are those whose possible values are in an interval of real numbers. Therefore, it makes no sense to calculate the frequency for each possible value, since they rarely repeat themselves. It is better to group the data into classes or ranges. The interval to be defined between the classes is random. However, we must be careful if the number of classes is too small because a lot of information can be lost. On the other hand, if the number of classes is too large, the summary of information is compromised (Bussab and Morettin, 2011). The interval between the classes does not need to be constant, but in order to keep things simple, we will assume the same interval. The following steps must be taken to build a frequency distribution table for continuous data: Step 1: Sort the data in ascending order. Step 2: Determine the number of classes (k), using one of the options: a) Sturges’ Rule ! k ¼ 1 + 3.3 pffiffiffi log(n) b) Through expression k ¼ n where n is the sample size. The value of k must be an integer. Step 3: Determine the interval between the classes (h), calculated as the range of the sample (A ¼ maximum value minimum value) divided by the number of classes: h ¼ A=k The value of h is rounded to the highest integer. Step 4: Build the frequency distribution table (calculate the absolute frequency, the relative frequency, the cumulative frequency, and the relative cumulative frequency) for each class. The lowest limit of the first class corresponds to the minimum value of the sample. To determine the highest limit of each class, we must add the value of h to the lowest limit of the respective class. The lowest limit of the new class corresponds to the highest limit of the previous class.
Univariate Descriptive Statistics Chapter
3
Example 3.3 Consider the data in Table 3.E.5 regarding the grades of 30 students enrolled in the subject Financial Market. Elaborate a frequency distribution table for this problem.
TABLE 3.E.5 Grades of 30 Students Enrolled in the Subject Financial Market 4.2
3.9
5.7
6.5
4.6
6.3
8.0
4.4
5.0
5.5
6.0
4.5
5.0
7.2
6.4
7.2
5.0
6.8
4.7
3.5
6.0
7.4
8.8
3.8
5.5
5.0
6.6
7.1
5.3
4.7
Note: To determine the number of classes, use Sturges’ rule.
Solution Let’s apply the four steps to build the frequency distribution table of Example 3.3, whose variables are continuous: Step 1: Let’s sort the data in ascending order, as shown in Table 3.E.6.
TABLE 3.E.6 Data From Table 3.E.5 Sorted in Ascending Order 3.5
3.8
3.9
4.2
4.4
4.5
4.6
4.7
4.7
5
5
5
5
5.3
5.5
5.5
5.7
6
6
6.3
6.4
6.5
6.6
6.8
7.1
7.2
7.2
7.4
8
8.8
Step 2: Let’s determine the number of classes (k) by using Sturges’ rule: k ¼ 1 + 3:3 log ð30Þ ¼ 5:87 ffi 6 Step 3: The interval between the classes (h) is given by: A ð8:8 3:5Þ ¼ ¼ 0:88 ffi 1 k 6 Step 4: Finally, let’s build the frequency distribution table for each class. The lowest limit of the first class corresponds to the minimum grade 3.5. From this value, we must add the interval between the classes (1), considering that the highest limit of the first class will be 4.5. The second class starts from this value, and so on, and so forth, until the last class is defined. We use the notation ├ to determine that the lowest limit is included in the class and the highest limit is not. The complete frequency distribution table for Example 3.3 (Table 3.E.7) is presented. h¼
TABLE 3.E.7 Frequency Distribution for Example 3.3 Class
Fi
Fri (%)
Fac
Frac (%)
3.5 ├ 4.5
5
16.67
5
16.67
4.5 ├ 5.5
9
30
14
46.67
5.5 ├ 6.5
7
23.33
21
70
6.5 ├ 7.5
7
23.33
28
93.33
7.5 ├ 8.5
1
3.33
29
96.67
8.5 ├ 9.5
1
3.33
30
100
Sum
30
100
25
26
PART
3.3
II Descriptive Statistics
GRAPHICAL REPRESENTATION OF THE RESULTS
The behavior of qualitative and quantitative variable data can also be represented in a graphical way. Charts are a representation of numeric data, in the form of geometric figures (graphs, diagrams, drawings, or images), allowing the reader to interpret these data quickly and objectively. In Section 3.3.1, the main graphical representations for qualitative variables are illustrated: bar charts (horizontal and vertical), pie charts, and a Pareto chart. The graphical representation of quantitative variables is usually illustrated by line graphs, dot plots, histograms, stemand-leaf plots, and boxplots (or box-and-whisker diagrams), as shown in Section 3.3.2. Bar charts (horizontal and vertical), pie charts, a Pareto chart, line graphs, dot plots, and histograms will be generated in Excel. The boxplots and histograms will be constructed by using SPSS and Stata. To build a chart in Excel, first, variables’ data and names must be standardized, codified, and selected in a spreadsheet. The next step consists in clicking on the Insert tab and, in the group Charts, selecting the type of chart we are interested in using (Columns, Rows, Pie, Bar, Area, Scatter, or Other Charts). The chart will be generated automatically on the screen, and it can be personalized according to the preferences of the researcher. Excel offers a variety of chart styles, layouts, and formats. To use them, researcher just needs to select the plotted chart and click on the Design, Layout or Format tab. On the Layout tab, for example, there are many resources available, such as, Chart Title, Axis Titles (shows the name of the horizontal and vertical axes); Legend (shows or hides the legend); Data Labels (allows researcher to insert the series name, the category name, or the values of the labels in the place we are interested in); Data Table (shows the data table below the chart, with or without legend codes); Axes (allows researcher to personalize the scale of the horizontal and vertical axes); Gridlines (shows or hides horizontal and vertical gridlines), among others. The Chart Title, Axis Titles, Legend, Data Labels and Data Table icons are in the Labels group, while the icons Axes and Gridlines are in the Axes group.
3.3.1
Graphical Representation for Qualitative Variables
3.3.1.1 Bar Chart This type of chart is widely used for nominal and ordinal qualitative variables, but it can also be used for discrete quantitative variables, because it allows us to investigate the presence of data trends. As its name indicates, through bars, this chart represents the absolute or relative frequencies of each possible category (or numeric value) of a qualitative variable (or quantitative). In vertical bar charts, each variable category is shown on the X-axis as a bar with constant width, and the height of the respective bar indicates the frequency of the category on the Y-axis. Conversely, in horizontal bar charts, each variable category is shown on the Y-axis as a bar of constant height, and the length of the respective bar indicates the frequency of the category on the X-axis. Let’s now build horizontal and vertical bar charts from a practical example. Example 3.4 A bank created a satisfaction survey, which was used with 120 customers, trying to measure how agile its services were (excellent, good, satisfactory, and poor). The absolute frequencies for each category are presented in Table 3.E.8. Construct a vertical and horizontal bar chart for this problem.
TABLE 3.E.8 Frequencies of Occurrences per Category Satisfaction
Absolute Frequency
Excellent
58
Good
18
Satisfactory
32
Poor
12
Solution Let’s build the vertical and horizontal bar charts of Example 3.4 in Excel.
Univariate Descriptive Statistics Chapter
27
FIG. 3.2 Vertical bar chart for Example 3.4.
Satisfaction 70 58
60 Absolute frequency
3
50 40
32
30 18
20
12
10 0 Excellent
Good
Poor
Satisfactory
FIG. 3.3 Horizontal bar chart for Example 3.4.
Satisfaction Poor
12
Satisfactory
32
Good
18
Excellent
58 0
10
20
30 40 Absolute frequency
50
60
70
First, the data in Table 3.E.8 must be standardized, codified, and selected in a spreadsheet. After that, we can click on the Insert tab and, in the Charts group, and select the option Columns. The chart is automatically generated on the screen. Next, to personalize the chart, while clicking on it, we must select the following icons on the Layout tab: (a) Axis Titles: let’s select the title for the horizontal axis (Satisfaction) and for the vertical axis (Frequency); (b) Legend: to hide the legend, we must click on None; (c) Data Labels: clicking on More Data Label Options, the option Value must be selected in Label Contains (or we can select the option Outside End). Fig. 3.2 shows the vertical bar chart of Example 3.4 generated in Excel. Based on Fig. 3.2, we can see that the categories of the variable being analyzed are presented on the X-axis by bars with the same width and their respective heights indicate the frequencies on the Y-axis. To construct the horizontal bar chart, we must select the option Bar instead of Columns. The other steps follow the same logic. Fig. 3.3 represents the frequency data from Table 3.E.8 through a horizontal bar chart constructed in Excel. The horizontal bar chart in Fig. 3.3 represents the categories of the variable on the Y-axis and their respective frequencies on the X-axis. For each variable category, we draw a bar with a length that corresponds to its frequency. Therefore, this chart only offers information related to the behavior of each category of the original variable and to the generation of investigations regarding the type of distribution, not allowing us to calculate position, dispersion, skewness or kurtosis measures, since the variable being studied is qualitative.
3.3.1.2 Pie Chart Another way to represent qualitative data, in terms of relative frequencies (percentages), is the definition of pie charts. The chart corresponds to a circle with a random radius (the whole) divided into sectors or slices of pie of several different sizes (parts of the whole).
28
PART
II Descriptive Statistics
This chart allows the researcher to visualize the data as slices of a pie or parts of a whole. Let’s now build the pie chart from a practical example. Example 3.5 An election poll was carried out in the city of Sao Paulo to check voters’ preferences concerning the political parties running in the next elections for Mayor. The percentage of voters per political party can be seen in Table 3.E.9. Construct a pie chart for Example 3.5.
TABLE 3.E.9 Percentage of Voters per Political Party Political Party
Percentage
PMDB
18
PSDB
22
PDT
12.5
PT
24.5
PC do B
8
PV
5
Others
10
Solution Let’s build the pie chart for Example 3.5 in Excel. The steps are similar to the ones in Example 3.4. However, we now have to select the option Pie in the Charts group, on the Insert tab. Fig. 3.4 presents the pie chart obtained in Excel for the data shown in Table 3.E.9. FIG. 3.4 Pie chart of Example 3.5.
Political party Others 10% PV 5%
PMDB 18%
PC do B 8%
PSDB 22% PT 24.5% PDT 12.5%
3.3.1.3 Pareto Chart The Pareto chart is a Quality control tool and has as its main objective to investigate the types of problems and, consequently, to identify their respective causes, so that an action can be taken in order to reduce or eliminate them. The Pareto chart is a chart that contains bars and a line graph. The bars represent the absolute frequencies of occurrences of problems and the lines represent the relative cumulative frequencies. The problems are sorted in descending order of priority. Let’s now illustrate a practical example with a Pareto chart.
Univariate Descriptive Statistics Chapter
3
Example 3.6 A manufacturer of credit and magnetic cards has as its main objective to reduce the number of defective cards. The quality inspector classified a sample of 1000 cards that were collected during one week of production, according to the types of defects found, as shown in Table 3.E.10. Construct a Pareto chart for this problem.
TABLE 3.E.10 Frequencies of the Occurrence of Each Defect Type of Defect
Absolute Frequency (Fi)
Damaged/Bent
71
Perforated
28
Illegible printing
12
Wrong characters
20
Wrong numbers
44
Others
6
Total
181
Solution The first step in generating a Pareto chart is to sort the defects in order of priority (from the highest to the lowest frequency). The bar chart represents the absolute frequency of each defect. To construct the line graph, it is necessary to calculate the relative cumulative frequency (%) up to the defect analyzed. Table 3.E.11 shows the absolute frequency for each type of defect, in descending order, and the relative cumulative frequency (%).
TABLE 3.E.11 Absolute Frequency for Each Defect and the Relative Cumulative Frequency (%) Type of Defect
Number of Defects
Cumulative %
Damaged/Bent
71
39.23
Wrong numbers
44
63.54
Perforated
28
79.01
Wrong characters
20
90.06
Illegible printing
12
96.69
Others
6
100
Let’s now build a Pareto chart for Example 3.6 in Excel, using the data in Table 3.E.11. First, the data in Table 3.E.11 must be standardized, codified, and selected in an Excel spreadsheet. In the Charts group, on the Insert tab, let’s select the option Columns (and the clustered column subtype). Note that the chart is automatically generated on the screen. However, absolute frequency data as well as relative cumulative frequency data are presented as columns. To change the type of chart related to the cumulative percentage, we must click with the right button on any bar of the respective series and select the option Change Series Chart Type, followed by a line graph with markers. The resulting chart is a Pareto chart. To personalize the Pareto chart, we must use the following icons on the Layout tab: (a) Axis Titles: for the bar chart, we selected the title for the horizontal axis (Type of defect) and for the vertical axis (Frequency); for the line graph, we called the vertical axis Percentage; (b) Legend: to hide the legend, we must click on None; (c) Data Table: let’s select the option Show Data Table with Legend Keys; (d) Axes: the main unit of the vertical axes for both charts is set in 20 and the maximum value of the vertical axis for line graphs, in 100. Fig. 3.5 shows the chart constructed in Excel that corresponds to the Pareto chart for Example 3.6.
29
30
PART
II Descriptive Statistics
FIG. 3.5 The Pareto chart for Example 3.6. Legend: A, Damaged/Bent; B, Wrong numbers; C, Perforated; D, Wrong characters; E, Illegible printing; F, Others.
3.3.2
Graphical Representation for Quantitative Variables
3.3.2.1 Line Graph In a line graph, points are represented by the intersection of the variables involved on the horizontal axis (X) and on the vertical axis (Y), and they are connected by straight lines. Despite considering two axes, line graphs will be used in this chapter to represent the behavior of a single variable. The graph shows the evolution or trend of a quantitative variable’s data, which is usually continuous, at regular intervals. The numeric variable values are represented on the Y-axis, and the X-axis only shows the data distribution in a uniform way. Let’s now illustrate a practical example of a line graph. Example 3.7 Cheap & Easy is a supermarket that registered the percentage of losses it had in the last 12 months (Table 3.E.12). After having done that, it will adopt new prevention measures. Build a line graph for Example 3.7.
TABLE 3.E.12 Percentage of Losses in the Last 12 Months Month
Losses (%)
January
0.42
February
0.38
March
0.12
April
0.34
May
0.22
June
0.15
July
0.18
August
0.31
September
0.47
October
0.24
November
0.42
December
0.09
Univariate Descriptive Statistics Chapter
3
31
Solution To build the line graph for Example 3.7 in Excel, in the Charts group, on the Insert tab, we must select the option Lines. The other steps follow the same logic of the previous examples. The complete chart can be seen in Fig. 3.6.
FIG. 3.6 Line graph for Example 3.7.
3.3.2.2 Scatter Plot A scatter plot is very similar to a line graph. The biggest difference between them is in the way the data are plotted on the horizontal axis. Similar to a line graph, here the points are also represented by the intersection of the variables along the X-axis and the vertical axis. However, they are not connected by straight lines. The scatter plot studied in this chapter is used to show the evolution or trend of a single quantitative variable’s data, similar to the line graph; however, at irregular intervals (in general). Analogous to a line graph, the numeric variable values are represented on the Y-axis and the X-axis only represents the data behavior throughout time. In the next chapter, we will see how a scatter plot can be used to describe the behavior of two variables simultaneously (bivariate analysis). The numeric values of one variable will be represented on the Y-axis and the other one on the X-axis. Example 3.8 Papermisto is the supplier of three types of raw materials for the production of paper: cellulose, mechanical pulp, and trimmings. In order to maintain its quality standards, the factory carries out a rigorous inspection of its products during each production phase. At irregular intervals, an operator must verify the esthetic and dimensional characteristics of the product selected with specialized instruments. For instance, in the cellulose storage phase, the product must be piled up in bales of approximately 250 kg each. Table 3.E.13 shows the weight of the bales collected in the last 5 hours, at irregular intervals, varying between 20 and 45 minutes. Construct a scatter plot for Example 3.8.
TABLE 3.E.13 Evolution of the Weight of the Bales Throughout Time Time (min)
Weight (kg)
30
250
50
255
85
252
106
248
138
250
178
249
198
252
222
251
252
250
297
245
32
PART
II Descriptive Statistics
Solution To build the scatter plot for Example 3.8 in Excel, in the Charts group, on the Insert tab, we must select the option Scatter. The other steps follow the same logic of the previous examples. The scatter plot can be seen in Fig. 3.7. FIG. 3.7 Scatter plot for Example 3.8.
256
Weight (kg)
254 252 250 248 246 244 0
50
100
150 Time (min)
200
250
300
3.3.2.3 Histogram A histogram is a vertical bar chart that represents the frequency distribution of one quantitative variable (discrete or continuous). The variable values being studied are presented on the X-axis (the base of each bar, with a constant width, represents each possible value of the discrete variable or each class of continuous values, sorted in ascending order). On the other hand, the height of the bars on the Y-axis represents the frequency distribution (absolute, relative, or cumulative) of the respective variable values. A histogram is very similar to a Pareto chart. It is also one of the seven quality tools. A Pareto chart represents the frequency distribution of a qualitative variable (types of problem), whose categories represented on the X-axis are sorted in order of priority (from the category with the highest frequency to the one with the lowest). A histogram represents the frequency distribution of a quantitative variable, whose values represented on the X-axis are sorted in ascending order. Therefore, the first step to elaborate a histogram is building the frequency distribution table. As presented in Sections 3.2.2 and 3.2.3, for each possible value of a discrete variable or for a class with continuous data, we calculate the absolute frequency, the relative frequency, the cumulative frequency, and the relative cumulative frequency. The data must be sorted in ascending order. The histogram is then constructed from this table. The first column of the frequency distribution table, which represents the numeric values or the classes with the values of the variable being studied, will be presented on the X-axis, and the column of absolute frequency (or relative frequency, cumulative frequency, or relative cumulative frequency) will be presented on the Y-axis. Many pieces of statistical software generate the histogram automatically, from the original values of the quantitative variable being studied, without having to calculate the frequencies. Even though Excel has the option of building a histogram from analysis tools, we will show how to build it from the column chart, due to its simplicity. Example 3.9 In order to improve their services, a national bank is hiring new managers to serve their corporate clients. Table 3.E.14 shows the number of companies dealt with daily in one of their main branches in the capital. Elaborate a histogram from these data using Excel.
TABLE 3.E.14 Number of Companies Dealt With Daily 13
11
13
10
11
12
8
12
9
10
12
10
8
11
9
11
14
11
10
9
Univariate Descriptive Statistics Chapter
3
33
Solution The first step is building the frequency distribution table: From the data in Table 3.E.15, we can build a histogram of absolute frequency, relative frequency, cumulative frequency, or relative cumulative frequency using Excel. The histogram generated will be the absolute frequency one. Thus, we must standardize, codify, and select the first two columns of Table 3.E.15 (except the last row: Sum) in an Excel spreadsheet. In the Charts group, on the Insert tab, let’s select the option Columns. Let’s click on the chart so that it can be personalized. On the Layout tab, we selected the following icons: (a) Axis Titles: select the title for the horizontal axis (Number of companies) and for the vertical axis (Absolute frequency); (b) Legend: to hide the legend, we must click on None. The histogram generated in Excel can be seen in Fig. 3.8.
TABLE 3.E.15 Frequency Distribution for Example 3.9 Number of Companies
Fi
Fri (%)
Fac
Frac (%)
8
2
10
2
10
9
3
15
5
25
10
4
20
9
45
11
5
25
14
70
12
3
15
17
85
13
2
10
19
95
14
1
5
20
100
Sum
20
100
FIG. 3.8 Histogram of absolute frequencies elaborated in Excel for Example 3.9.
Number of companies 6
Absolute frequency
5 4 3 2 1 0 8
9
10
11
12
13
14
As mentioned, many statistical computer packages, including SPSS and Stata, build the histogram automatically from the original data of the variable being studied (in this example, using the data in Table 3.E.14), without having to calculate the frequencies. Moreover, these packages have the option of plotting the normal curve. Fig. 3.9 shows the histogram generated using SPSS (with the option of a normal curve) using the data in Table 3.E.14. We will see this in detail in Sections 3.6 and 3.7, how it can be constructed using SPSS and Stata software, respectively. Note that the values of the discrete variable are presented in the middle of the base. For continuous variables, consider the data in Table 3.E.5 (Example 3.3), regarding the grades of the students enrolled in the subject Financial Market. These data were sorted in ascending order, as presented in Table 3.E.6. Fig. 3.10 shows the histogram generated using SPSS software (with the option of a normal curve) using the data in Table 3.E.5 or Table 3.E.6.
34
PART
II Descriptive Statistics
FIG. 3.9 Histogram constructed using SPSS for Example 3.9 (discrete data).
5
Frequency
4
3
2
1
0 6
FIG. 3.10 Histogram generated using SPSS for Example 3.3 (continuous data).
8
10 12 Number_of_companies
14
16
5
Frequency
4
3
2
1
0 3.00
4.00
5.00
6.00 Grades
7.00
8.00
9.00
Note that the data were grouped considering an interval between h ¼ 0.5 classes, differently from Example 3.3 that considered h ¼ 1. The classes’ lower limits are represented on the left side of the base of the bar, and the upper limits (not included in the class) on the right side. The height of the bar represents the total frequency in the class. For example, the first bar represents the 3.5 ├ 4.0 class and there are three values in this interval (3.5, 3.8 and 3.9).
3.3.2.4 Stem-and-Leaf Plot Both bar charts and histograms represent the shape of the variable’s frequency distribution. The stem-and-leaf plot is an alternative to represent the frequency distributions of discrete and continuous quantitative variables with few observations, with the advantage of maintaining the original value of each observation (it allows the visualization of all data information).
Univariate Descriptive Statistics Chapter
3
35
In the plot, the representation of each observation is divided into two parts, separated by a vertical line: the stem is located on the left of the vertical line and represents the observation’s first digit(s); the leaf is located on the right of the vertical line and represents the observation’s last digit(s). Choosing the number of initial digits that will form the stem or the number of complementary digits that will form the leaf is random. The stems usually contain the most significant digits, and the leaves the least significant. The stems are represented in a single column and their different values throughout many lines. For each stem represented on the left-hand side of the vertical line, we have the respective leaves shown on the right-hand side throughout many columns. Stems as well as leaves must be sorted in ascending order. In the cases in which there are too many leaves per stem, we can have more than one line with the same stem. Choosing the number of lines is random, as well as defining the interval or the number of classes in a frequency distribution. To build a stem-and-leaf plot, we can follow the sequence of steps: Step 1: Sort the data in ascending order, to make the visualization of the data easier. Step 2: Define the number of initial digits that will form the stem, or the number of complementary digits that will form the leaf. Step 3: Elaborate the stems, represented in a single column on the left of the vertical line. Their different values are represented throughout many lines, in ascending order. When the number of leaves by stem is very high, we can define two or more lines for the same stem. Step 4: Place the leaves that correspond to the respective stems, on the right-hand side of the vertical line, throughout many columns (in ascending order). Example 3.10 A small company collected its employees’ ages, as shown in Table 3.E.16. Build a stem-and-leaf plot.
TABLE 3.E.16 Employees’ Ages 44
60
22
49
31
58
42
63
33
37
54
55
40
71
55
62
35
45
59
54
50
51
24
31
40
73
28
35
75
48
Solution To construct the stem-and-leaf plot, let’s apply the four steps described: Step 1 First, we must sort the data in ascending order, as shown in Table 3.E.17.
TABLE 3.E.17 Employees’ Ages in Ascending Order 22
24
28
31
31
33
35
35
37
40
40
42
44
45
48
49
50
51
54
54
55
55
58
59
60
62
63
71
73
75
Step 2 The next step to construct a stem-and-leaf plot is to define the number of initial digits of the observation that will form the stem. The complementary digits will form the leaf. In this example, all of the observations have two digits. The stems correspond to the tens and the leaves correspond to the units. Step 3 The following step is to build the stems. Based on Table 3.E.17, we can see that there are observations that begin with the tens 2, 3, 4, 5, 6, and 7 (stems). The stem with the highest frequency is 5 (8 observations), it is possible to represent all of its leaves in a single line. Therefore, we will have a single line per stem. Hence, the stems are presented in a single column on the left of the vertical line, in ascending order, as shown in Fig. 3.11.
36
PART
II Descriptive Statistics
FIG. 3.11 Building the stems for Example 3.10.
2 3 4 5 6 7
Step 4 Finally, let’s place the leaves that correspond to each stem on the right-hand side of the vertical line. The leaves are represented in ascending order throughout many columns. For example, stem 2 contains leaves 2, 4, and 8. Stem 5 contains leaves 0, 1, 4, 4, 5, 5, 8, and 9, represented throughout 8 columns. If this stem were divided into two lines, the first line would have leaves 0 to 4, and the second line leaves 5 to 9. Fig. 3.12 illustrates the stem-and-leaf plot for Example 3.10. FIG. 3.12 Stem-and-Leaf plot for Example 3.10.
2
2
4
8
3
1
1
3
5
5
7
4
0
0
2
4
5
8
9
5
0
1
4
4
5
5
8
6
0
2
3
7
1
3
5
9
Example 3.11 The average temperature, in Celsius, registered in the last 40 days in the city of Porto Alegre can be found in Table 3.E.18. Elaborate the stem-and-leaf plot for Example 3.11.
TABLE 3.E.18 Average Temperature in Celsius 8.5
13.7
12.9
9.4
11.7
19.2
12.8
9.7
19.5
11.5
15.5
16.0
20.4
17.4
18.0
14.4
14.8
13.0
16.6
20.2
17.9
17.7
16.9
15.2
18.5
17.8
16.2
16.4
18.2
16.9
18.7
19.6
13.2
17.2
20.5
14.1
16.1
15.9
18.8
15.7
Solution Once again, let’s apply the four steps to construct the stem-and-leaf plot, but now we have to consider continuous variables. Step 1 First, let’s sort the data in ascending order, as shown in Table 3.E.19.
TABLE 3.E.19 Average Temperature in Ascending Order 8.5
9.4
9.7
11.5
11.7
12.8
12.9
13.0
13.2
13.7
14.1
14.4
14.8
15.2
15.5
15.7
15.9
16.0
16.1
16.2
16.4
16.6
16.9
16.9
17.2
17.4
17.7
17.8
17.9
18.0
18.2
18.5
18.7
18.8
19.2
19.5
19.6
20.2
20.4
20.5
Univariate Descriptive Statistics Chapter
3
37
Step 2 In this example, the leaves correspond to the last digit. The remaining digits (to the left) correspond to the stems. Steps 3 and 4 The stems vary from 8 to 20. The stem with the highest frequency is 16 (7 observations), and its leaves can be represented in a single line. For each stem, we place the respective leaves. Fig. 3.13 shows the stem-and-leaf plot for Example 3.11. FIG. 3.13 Stem-and-Leaf Plot for Example 3.11.
8
5
9
4
7
11
5
7
12
8
9
13
0
2
7
14
1
4
8
15
2
5
7
9
16
0
1
2
4
6
17
2
4
7
8
9
18
0
2
5
7
8
19
2
5
6
20
2
4
5
10
9
9
3.3.2.5 Boxplot or Box-and-Whisker Diagram The boxplot (or box-and-whisker diagram) is a graphical representation of five measures of position or location of a certain variable: minimum value, first quartile (Q1), second quartile (Q2) or median (Md), third quartile (Q3) and maximum value. From a sorted sample, the median corresponds to the central position and the quartiles to subdivisions of the sample, four equal parts, each one containing 25% of the data. Thus, the first quartile (Q1) describes 25% of the first data (organized in ascending order). The second quartile corresponds to the median (50% of the sorted data are located below it and the remaining 50% above it), and the third quartile (Q13) corresponds to 75% of the observations. The dispersion measure resulting from these location measures is called interquartile range (IQR) or interquartile interval (IQI) and corresponds to the difference between Q3 and Q1. This plot allows us to assess the data symmetry and distribution. It also gives us a visual perspective of whether or not there are discrepant data (univariate outliers), since these data are above the upper and lower limits. A representation of the diagram can be seen in Fig. 3.14. FIG. 3.14 Boxplot.
38
PART
II Descriptive Statistics
Calculating the median, the first, and third quartiles, and investigating the existence of univariate outliers will be discussed in Sections 3.4.1.1, 3.4.1.2, and 3.4.1.3, respectively. In Sections 3.6.3 and 3.7, we will study how to generate the box-and-whisker diagram on SPSS and Stata, respectively, using a practical example.
3.4 THE MOST COMMON SUMMARY-MEASURES IN UNIVARIATE DESCRIPTIVE STATISTICS Information found in a dataset can be summarized through suitable numerical measures, called summary measures. In univariate descriptive statistics, the most common summary measures have as their main objective to represent the behavior of the variable being studied through its central and noncentral values, its dispersions, or the way its values are distributed around the mean. The summary measures that will be studied in this chapter are measures of position or location (measures of central tendency and quantiles), measures of dispersion or variability, and measures of shape, such as, skewness and kurtosis. These measures are calculated for metric or quantitative variables. The only exception is the mode, which is a measure of central tendency that provides the most frequent value of a certain variable, so, it can also be calculated for nonmetric or qualitative variables.
3.4.1
Measures of Position or Location
These measures provide values that characterize the behavior of a data series, indicating the data position or location in relation to the axis of the values assumed by the variable or characteristic being studied. The measures of position or location are subdivided into measures of central tendency (mean, median, and mode) and quantiles (quartiles, deciles, and percentiles).
3.4.1.1 Measures of Central Tendency The most common measures of central tendency are the arithmetic mean, the median, and the mode. 3.4.1.1.1
Arithmetic Mean
The arithmetic mean can be a representative measure of a population with N elements, represented by the Greek letter m, or a representative measure of a sample with n elements, represented by X. 3.4.1.1.1.1 Case 1: Simple Arithmetic Mean of Ungrouped Discrete and Continuous Data Simple arithmetic mean, or simply mean, or average, is the sum of all the values of a certain variable (discrete or continuous) divided by the total number of observations. Thus, the sample arithmetic mean of a certain variable X (X) is: n X
X¼
Xi
i¼1
(3.1)
n
where n is the total number of observations in the dataset and Xi, for i ¼ 1, …, n, represents each one of variable X’s values. Example 3.12 Calculate the simple arithmetic mean of the data in Table 3.E.20, regarding the grades of the graduate students enrolled in the subject Quantitative Methods.
TABLE 3.E.20 Students’ Grades 5.7
6.5
6.9
8.3
8.0
4.2
6.3
7.4
5.8
6.9
Univariate Descriptive Statistics Chapter
3
39
Solution The mean is simply calculated as the sum of all the values in Table 3.E.20 divided by the total number of observations: X¼
5:7 + 6:5 + ⋯ + 6:9 ¼ 6:6 10
The MEAN function in Excel calculates the simple arithmetic mean of the set of values selected. Let’s assume that the data in Table 3.E.20 are available from cell A1 to cell A10. To calculate the mean, we just need to insert the expression 5MEAN(A1:A10). Another way to calculate the mean using Excel, as well as other descriptive measures, such as, the median, mode, variance, standard deviation, standard error, skewness and kurtosis, which will also be studied in this chapter, is by using the Analysis ToolPack supplement (Section 3.5).
3.4.1.1.1.2 Case 2: Weighted Arithmetic Mean of Ungrouped Discrete and Continuous Data When calculating the simple arithmetic mean, all of the occurrences have the same importance or weight. When we are interested in assigning different weights (pi) to each value i of variable X, we use the weighted arithmetic mean: n X
X¼
Xi :pi
i¼1 n X
(3.2) pi
i¼1
If the weight is expressed in percentages (relative weight - rw), Expression (3.2) becomes: X¼
n X
Xi :rwi
(3.3)
i¼1
Example 3.13 At Vanessa’s school, the annual average of each subject is calculated based on the grades obtained throughout all four quarters, with their respective weights being: 1, 2, 3, and 4. Table 3.E.21 shows Vanessa’s grades in mathematics in each quarter. Calculate her annual average in the subject.
TABLE 3.E.21 Vanessa’s Grades in Mathematics Period
Grade
Weight
1st Quarter
4.5
1
2nd Quarter
7.0
2
3rd Quarter
5.5
3
4th Quarter
6.5
4
Solution The annual average is calculated by using the weighted arithmetic mean criterion. Applying Expression (3.2) to the data in Table 3. E.21, we have: X¼
4:5 1 + 7:0 2 + 5:5 3 + 6:5 4 ¼ 6:1 1+2+3+4
40
PART
II Descriptive Statistics
Example 3.14 There are five stocks in a certain investment portfolio. Table 3.E.22 shows the average yield of each stock in the previous month, as well as the respective percentage invested. Determine the portfolio’s average yield.
TABLE 3.E.22 Yield of Each Stock and Percentage Invested Stock
Yield (%)
% Investment
Bank of Brazil ON
1.05
10
Bradesco PN
0.56
25
Eletrobras PNB
0.08
15
Gerdau PN
0.24
20
Vale PN
0.75
30
Solution The portfolio’s average yield (%) corresponds to the sum of the products between each stock’s average yield (%) and the respective percentage invested, and, using Expression (3.3), we have: X ¼ 1:05 0:10 + 0:56 0:25 + 0:08 0:15 + 0:24 0:20 + 0:75 0:30 ¼ 0:53%
3.4.1.1.1.3 Case 3: Arithmetic Mean of Grouped Discrete Data When the discrete values of Xi repeat themselves, the data are grouped into a frequency table. To calculate the arithmetic mean, we have to use the same criterion as for the weighted mean. However, the weight for each Xi will be represented by absolute frequencies (Fi) and, instead of n observations with n different values, we will have n observations with m different values (grouped data): m X
m X
Xi :Fi
X ¼ i¼1 m X
¼
Xi :Fi
i¼1
Fi
n
(3.4)
i¼1
If the frequency of the data is expressed in terms of the percentage relative to the absolute frequency (relative frequency—Fr), Expression (3.4) becomes: m X X¼ Xi :Fr i (3.5) i¼1
Example 3.15 A satisfaction survey with 120 participants evaluated the performance of a health insurance company through grades given to it. Grades that vary between 1 and 10. The survey’s results can be seen in Table 3.E.23. Calculate the arithmetic mean for Example 3.15.
TABLE 3.E.23 Absolute Frequency Table Grades
Number of Participants
1
9
2
12
3
15
Univariate Descriptive Statistics Chapter
3
41
TABLE 3.E.23 Absolute Frequency Table—cont’d Grades
Number of Participants
4
18
5
24
6
26
7
5
8
7
9
3
10
1
Solution The arithmetic mean of Example 3.15 is calculated from Expression (3.4): X¼
1 9 + 2 12 + ⋯ + 9 3 + 10 1 ¼ 4:62 120
3.4.1.1.1.4 Case 4: Arithmetic Mean of Continuous Data Grouped into Classes To calculate the simple arithmetic mean, the weighted arithmetic mean, and the arithmetic mean of grouped discrete data, Xi represents each i value of variable X. For continuous data grouped into classes, each class does not have a single value defined, but a set of values. In order for the arithmetic mean to be calculated in this case, we assume that Xi is the middle or central point of class i (i ¼ 1,…,k), so, Expressions (3.4) and (3.5) are rewritten due to the number of classes (k): k X
X¼
k X
Xi :Fi
i¼1 k X
¼
Xi :Fi
i¼1
(3.6)
n
Fi
i¼1
X¼
k X
Xi :Fr i
(3.7)
i¼1
Example 3.16 Table 3.E.24 shows the classes of salaries paid to the employees of a certain company and their respective absolute and relative frequencies. Calculate the average salary.
TABLE 3.E.24 Classes of Salaries (US$ 1000.00) and Their Respective Absolute and Relative Frequencies Classes
Fi
Fri (%)
1├3
240
17.14
3├5
480
34.29
5├7
320
22.86
7├9
150
10.71
9 ├ 11
130
9.29
11 ├ 13
80
5.71
1400
100
Sum
42
PART
II Descriptive Statistics
Solution Considering Xi the central point of class i and applying Expression (3.6), we have: X¼
2 240 + 4 480 + 6 320 + 8 150 + 10 130 + 12 80 ¼ 5:557 1; 400
or using Expression (3.7): X ¼ 2 0:1714 + 4 0:3429 + ⋯ + 10 0:0929 + 12 0:0571 ¼ 5:557 Therefore, the average salary is US$ 5,557.14.
3.4.1.1.2 Median The median (Md) is a measure of location. It locates the center of the distribution of a set of data sorted in ascending order. Its value separates the series in two equal parts, so, 50% of the elements are less than or equal to the median, and the other 50 % are greater than or equal to the median.
3.4.1.1.2.1 Case 1: Median of Ungrouped Discrete and Continuous Data The median of variable X (discrete or continuous) can be calculated as follows: 8 Xn + X n > > +1 > > 2 < 2 , if n is an even number: 2 (3.8) Md ðXÞ ¼ >X > , if n is an odd number: > > : ð n + 1Þ 2 where n is the total number of observations and X1 … Xn, considering that X1 is the smallest observation or the value of the first element, and that Xn is the highest observation or the value of the last element. Example 3.17 Table 3.E.25 shows the monthly production of treadmills of a company in a given year. Calculate the median.
TABLE 3.E.25 Monthly Production of Treadmills in a Given Year Month
Production (units)
Jan.
210
Feb.
180
Mar.
203
April
195
May
208
June
230
July
185
Aug.
190
Sept.
200
Oct.
182
Nov.
205
Dec.
196
Univariate Descriptive Statistics Chapter
3
43
Solution To calculate the median, the observations are sorted in ascending order. Therefore, we have the order of the observations and their respective positions: 180
182
185
190
195
196
200
203
205
208
210
230
1st
2nd
3rd
4th
5th
6th
7th
8th
9th
10th
11th
12th
The median will be the mean between the sixth and the seventh elements, since n is an even number, that is: X12 + X12 +1 2 2 Md ¼ 2 Md ¼
196 + 200 ¼ 198 2
Excel calculates the median of a set of data through the MED function. Note that the median does not consider the order of magnitude of the original variable’s values. If, for instance, the highest value were 400 instead of 230, the median would be exactly the same; however, with a much higher mean. The median is also known as the 2nd quartile (Q2), 50th percentile (P50), or 5th decile (D5). These definitions will be studied in more detail in the following sections.
3.4.1.1.2.2 Case 2: Median of Grouped Discrete Data Here, the calculation of the median is similar to the previous case. However, the data are grouped in a frequency distribution table. Analogous to Case 1, if n is an odd number, the position of the central element will be (n + 1)/2. We can see in the cumulative frequency column the group that has this position and, consequently, its corresponding value in the first column (median). If n is an even number, we verify the group(s) that contain(s) the central positions n/2 and (n/2) + 1 in the cumulative frequency column. If both positions correspond to the same group, we directly obtain their corresponding value in the first column (median). If each position corresponds to a distinct group, the median will be the average between the corresponding values defined in the first column. Example 3.18 Table 3.E.26 shows the number of bedrooms in 70 real estate properties in a condominium located in the metropolitan area of Sao Paulo, and their respective absolute and cumulative frequencies. Calculate the median.
TABLE 3.E.26 Frequency Distribution Number of Bedrooms
Fi
Fac
1
6
6
2
13
19
3
20
39
4
15
54
5
7
61
6
6
67
7
3
70
Sum
70
44
PART
II Descriptive Statistics
Since n is an even number, the median will be the average of the values that occupy positions n/2 and (n/2) + 1, that is: Xn + Xn +1
X + X36 ¼ 35 2 2 Based on Table 3.E.26, we can see that the third group contains all the elements between positions 20 and 39 (including 35 and 36), whose corresponding value is 3. Therefore, the median is: Md ¼ 2
2
Md ¼
3+3 ¼3 2
3.4.1.1.2.3 Case 3: Median of Continuous Data Grouped into Classes For continuous variables grouped into classes, in which the data are presented in a frequency distribution table, we apply the following steps to calculate the median: Step 1: Calculate the position of the median, not taking into consideration if n is an even or an odd number, through the following expression: PosðMd Þ ¼ n=2
(3.9)
Step 2: Identify the class that contains the median (median class) from the cumulative frequency column. Step 3: Calculate the median using the following expression: n FacðMd1Þ AMd Md ¼ LIMd + 2 FMd
(3.10)
where: LIMd ¼ lower limit of the median class; FMd ¼ absolute frequency of the median class; Fac(Md1)¼ cumulative frequency from the previous class to the median class; AMd ¼ range of the median class; n ¼ total number of observations.
Example 3.19 Consider the data in Example 3.16 regarding the classes of salaries paid to the employees of a company and their respective absolute and cumulative frequencies (Table 3.E.27). Calculate the median.
TABLE 3.E.27 Classes of Salaries (US$ 1000.00) and Their Respective Absolute and Cumulative Frequencies Classes
Fi
Fac
1├3
240
240
3├5
480
720
5├7
320
1040
7├9
150
1190
9 ├ 11
130
1320
11 ├ 13
80
1400
Sum
1400
Univariate Descriptive Statistics Chapter
3
45
Solution In the case of continuous data grouped into classes, let’s apply the following steps to calculate the median: Step 1: First, we calculate the position of the median: n 1400 PosðMd Þ ¼ ¼ ¼ 700 2 2 Step 2: Through the cumulative frequency column, we can see that the median is in the second class (3 ├ 5). Step 3: Calculating the median: n Md ¼ LI Md + 2
Fac ðMd1Þ FMd
AMd
where: LIMd ¼ 3, FMd ¼ 480, Fac(Md1) ¼ 240, AMd ¼ 2, n ¼ 1400 Therefore, we have: Md ¼ 3 +
3.4.1.1.3
ð700 240Þ 2 ¼ 4916 ðUS$ 4916:67Þ 480
Mode
The mode (Mo) of a data series corresponds to the observation that occurs with the highest frequency. The mode is the only measure of position that can also be used for qualitative variables, since these variables only allow us to calculate frequencies. 3.4.1.1.3.1 Case 1: Mode of Ungrouped Data Consider a set of observations X1, X2, …, Xn of a certain variable. The mode is the value that appears with the highest frequency. Excel gives us the mode of a set of data through the MODE function. Example 3.20 The production of carrots in a certain company is divided into five phases, including the post-harvest handling phase. Table 3.E.28 shows the average time the processing (in seconds) takes in this phase for 20 observations. Calculate the mode.
TABLE 3.E.28 Processing Time in the Post-Harvest Handling Phase in Seconds 45.0
44.5
44.0
45.0
46.5
46.0
45.8
44.8
45.0
46.2
44.5
45.0
45.4
44.9
45.7
46.2
44.7
45.6
46.3
44.9
Solution The mode is 45.0, which is the most frequent value in the dataset (Table 3.E.28). This value could be determined directly in Excel by using the MODE function.
3.4.1.1.3.2 Case 2: Mode of Grouped Qualitative or Discrete Data For discrete qualitative or quantitative data grouped in a frequency distribution table, the mode can be obtained directly from the table. It is the value with the highest absolute frequency. Example 3.21 A TV station interviewed 500 viewers trying to analyze their preferences in terms of interest categories. The result of the survey can be seen in Table 3.E.29. Calculate the mode.
46
PART
II Descriptive Statistics
TABLE 3.E.29 Viewers’ Preferences in Terms of Interest Categories Fi
Interest Categories Movies
71
Soap Operas
46
News
90
Comedy
98
Sports
120
Concerts
35
Variety
40
Sum
500
Solution Based on Table 3.E.29, we can see that the mode corresponds to the category Sports (the highest absolute frequency). Therefore, the mode is the only measure of position that can also be used for qualitative variables.
3.4.1.1.3.3 Case 3: Mode of Continuous Data Grouped into Classes For continuous data grouped into classes, there are several procedures to calculate the mode, such as, Czuber’s and King’s methods. Czuber’s method has the following phases: Step 1: Identify the class that has the mode (modal class), which is the one with the highest absolute frequency. Step 2: Calculate the mode (Mo): Mo ¼ LI Mo +
FMo FMo1 AMo 2:FMo ðFMo1 + FMo + 1 Þ
(3.11)
where: LIMo ¼ lower limit of the modal class; FMo ¼ absolute frequency of the modal class; FMo1 ¼ absolute frequency from the previous class to the modal class; FMo+1 ¼ absolute frequency from the posterior class to the modal class; AMo ¼ range of the modal class.
Example 3.22 A set of continuous data with 200 observations is grouped into classes with their respective absolute frequencies, as shown in Table 3.E.30. Determine the mode using Czuber’s method.
TABLE 3.E.30 Continuous Data Grouped into Classes and Their Respective Frequencies Class
Fi
01 ├ 10
21
10 ├ 20
36
20 ├ 30
58
30 ├ 40
24
40 ├ 50
19
Sum
200
Univariate Descriptive Statistics Chapter
3
47
Solution Considering continuous data grouped into classes, we can use Czuber’s method to calculate the mode: Step 1: Based on Table 3.E.30, we can see that the modal class is the third one (20 ├ 30), since it has the highest absolute frequency. Step 2: Calculating the mode (Mo): Mo ¼ LI Mo +
FMo FMo1 AMo 2:FMo ðFMo1 + FMo + 1 Þ
where: LIMo ¼ 20, FMo ¼ 58, FMo1 ¼ 36, FMo+1 ¼ 24, AMo ¼ 10 Therefore, we have: Mo ¼ 20 +
58 36 10 ¼ 23:9 2 58 ð36 + 24Þ
On the other hand, King’s method consists of the following phases: Step 1: Identify the modal class (the one with the highest absolute frequency). Step 2: Calculate the mode (Mo) using the following expression: Mo ¼ LI Mo +
FMo + 1 AMo FMo1 + FMo + 1
(3.12)
where: LIMo ¼ lower limit of the modal class; FMo1 ¼ absolute frequency from the previous class to the modal class; FMo+1 ¼ absolute frequency from the posterior class to the modal class; AMo ¼ range of the modal class.
Example 3.23 Once again, consider the data from the previous example. Use King’s method to determine the mode. Solution In Example 3.22, we saw that: LI Mo ¼ 20 FMo + 1 ¼ 24 FMo1 ¼ 36 AMo ¼ 10 Applying Expression (3.12): Mo ¼ LI Mo +
FMo + 1 24 10 ¼ 24 AMo ¼ 20 + FMo1 + FMo + 1 36 + 24
3.4.1.2 Quantiles According to Bussab and Morettin (2011), only the use of measures of central tendency may not be suitable to represent a set of data, since they are also impacted by extreme values. Moreover, only with the use of these measures, it is not possible for the researcher to have a clear idea of the data dispersion and symmetry. As an alternative, we can use quantiles, such as, quartiles, deciles, and percentiles. The 2nd quartile (Q2), 5th decile (D5), or 50th percentile (P50) correspond to the median; therefore, they are measures of central tendency. 3.4.1.2.1
Quartiles
Quartiles (Qi, i ¼ 1, 2, 3) are measures of position that divide a set of data into four parts with equal dimensions, sorted in ascending order.
Min.
Q1
Md = Q2
Q3
Max.
48
PART
II Descriptive Statistics
Thus, the 1st Quartile (Q1 or the 25th percentile) indicates that 25% of the data are less than Q1, or that 75% of the data are greater than Q1. The 2nd Quartile (Q2, or the 5th decile, or the 50th percentile) corresponds to the median, indicating that 50% of the data are less or greater than Q2. The 3rd Quartile (Q3 or the 75th percentile) indicates that 75% of the data are less than Q3, or that 25% of the data are greater than Q3. 3.4.1.2.2
Deciles
Deciles (Di, i ¼ 1, 2, ..., 9) are measures of position that divide a set of data into 10 equal parts, sorted in ascending order.
Min.
D1
D2
D3
D4
D5
D6
D7
D8
D9
Max.
Md
Therefore, the 1st decile (D1 or 10th percentile) indicates that 10% of the data are less than D1 or that 90% of the data are greater than D1. The 2nd decile (D2 or 20th percentile) indicates that 20% of the data are less than D2 or that 80% of the data are greater than D2. And so on, and so forth, until the 9th decile (D9 or 90th percentile), indicating that 90% of the data are less than D9 or that 10% of the data are greater than D9. 3.4.1.2.3 Percentiles Percentiles (Pi, i ¼ 1, 2, ..., 99) are measures of position that divide a set of data, sorted in ascending order, into 100 equal parts. Hence, the 1st percentile (P1) indicates that 1% of the data is less than P1 or that 99% of the data are greater than P1. The 2nd percentile (P2) indicates that 2% of the data are less than P2 or that 98% of the data are greater than P2. And so on, and so forth, until the 99th percentile (P99), which indicates that 99% of the data are less than P99 or that 1% of the data is greater than P99. 3.4.1.2.3.1 Case 1: Quartiles, Deciles, and Percentiles of Ungrouped Discrete and Continuous Data If the position of the quartile, decile, or percentile we are interested in is an integer or is exactly between two positions, calculating the respective quartile, decile or percentile becomes easier. However, this does not happen all the time (imagine a sample with 33 elements and that the objective is to calculate the 67th percentile), there are many methods proposed for this kind of calculation that lead to close results, but they are not identical. We will present a simple and generic method that can be applied to calculate any quartile, decile, or percentile of order i, considering ungrouped discrete and continuous data: Step 1: Sort the observations in ascending order. Step 2: Determine the position of the quartile, decile, or percentile, of order i, we are interested in: i 1 × i + , i ¼ 1, 2,3 4 2 hn i 1 × i + , i ¼ 1, 2,…, 9 Decile ! PosðDi Þ5 10 2 h n i 1 Percentile ! PosðPi Þ5 × i + , i ¼ 1, 2,…, 99 100 2 Quartile ! PosðQi Þ5
hn
(3.13) (3.14) (3.15)
Step 3: Calculate the value of the quartile, decile, or percentile that corresponds to the respective position. Assume that Pos(Q1) ¼ 3.75, that is, the value of Q1 is between the 3rd and 4th positions (75% closer to the 4th position, and 25% to the 3rd position). Therefore, Q1 will be the sum of the value that corresponds to the 3rd position multiplied by 0.25, with the value that corresponds to the 4th position multiplied by 0.75.
Univariate Descriptive Statistics Chapter
3
Example 3.24 Consider the data in Example 3.20 regarding the average carrot processing time in the post-harvest handling phase, as specified in Table 3.E.28. Determine Q1 (1st quartile), Q3 (3rd quartile), D2 (2nd decile), and P64 (64th percentile). Solution For ungrouped continuous data, we must apply the following steps to determine the quartiles, deciles, and percentiles we are interested in: Step 1: Sort the observations in ascending order. 1st
2nd
3rd
4th
5th
7th
7th
8th
9th
10th
44.0
44.5
44.5
44.7
44.8
44.9
44.9
45.0
45.0
45.0
11th
12th
13th
14th
15th
16th
17th
18th
19th
20th
45.0
45.4
45.6
45.7
45.8
46.0
46.2
46.2
46.3
46.5
Step 2: Calculation of the positions of Q1, Q3, D2, and P64: 1 a) PosðQ1 Þ ¼ 20 4 1 + 2 ¼ 5:5 1 b) PosðQ3 Þ ¼ 20 4 3 + 2 ¼ 15:5 1 c) PosðD2 Þ ¼ 20 10 2 + 2 ¼ 4:5 20 d) PosðP64 Þ ¼ 100 64 + 12 ¼ 13:3 Step 3: Calculating Q1, Q3, D2, and P64: a) Pos(Q1) ¼ 5.5 means that its corresponding value is 50% near position 5 and 50% near position 6, that is, Q1 is simply the average of the values that correspond to both positions: 44:8 + 44:9 ¼ 44:85 2 b) Pos(Q3) ¼ 15.5 means that the value we are interested in is between positions 15 and 16 (50% near the 15th position and 50% near the 16th position), so, Q3 can be calculated as follows: Q1 ¼
45:8 + 46 ¼ 45:9 2 c) Pos(D2) ¼ 4.5 means that the value we are interested in is between positions 4 and 5, so, D2 can be calculated as follows: Q3 ¼
44:7 + 44:8 ¼ 44:75 2 d) Pos(P64) ¼ 13.3 means that the value we are interested in is 70% closer to position 13 and 30% closer to position 14, so, P64 can be calculated as follows: D2 ¼
P64 ¼ (0.70 x 45.6) + (0.30 x 45.7) ¼ 45.63. Interpretation Q1 ¼ 44.85 indicates that, in 25% of the observations (the first 5 observations listed in Step 1), the carrot processing time in the postharvest handling phase is less than 44.85 seconds, or that in 75% of the observations (the remaining 15 observations), the processing time is greater than 44.85. Q3 ¼ 45.9 indicates that, in 75% of the observations (15 of them), the processing time is less than 45.9 seconds, or that in 5 observations, the processing time is greater than 45.9. D2 ¼ 44.75 indicates that, in 20% of the observations (4 of them), the processing time is less than 44.75 seconds, or that in 80% of the observations (16 of them), the processing time is greater than 44.75. P64 ¼ 45.63 indicates that, in 64% of the observations (12.8 of them), the processing time is less than 45.63 seconds, or that in 36% of the observations (7.2 of them) the processing time is greater than 45.63. Excel calculates the quartile of order i (i ¼ 0, 1, 2, 3, 4) through the QUARTILE function. As arguments of the function, we must define the matrix or set of data in which we are interested to calculate the respective quartile (it does not need to be in ascending order), in addition to the fourth we are interested in (minimum value ¼ 0; 1st quartile ¼ 1; 2nd quartile ¼ 2, 3rd quartile ¼ 3; maximum value ¼ 4). The k-th percentile (k ¼ 0, ..., 1) can also be calculated in Excel through the PERCENTILE function. As arguments of the function, we must define the matrix we are interested in, in addition to the value of k (for example, in the case of P64, k ¼ 0.64).
49
50
PART
II Descriptive Statistics
The calculation of quartiles, deciles, and percentiles using SPSS and Stata statistical software will be demonstrated in Sections 3.6 and 3.7, respectively. SPSS and Stata software use two methods to calculate quartiles, deciles, or percentiles. One of them is called Tukey’s Hinges and it is the method used in this book. The other method is related to the Weighted Average, whose calculations are more complex. Excel, on the other hand, implements another algorithm that gets similar results.
3.4.1.2.3.2 Case 2: Quartiles, Deciles, and Percentiles of Grouped Discrete Data Here, the calculation of quartiles, deciles, and percentiles is similar to the previous case. However, the data are grouped in a frequency distribution table. In the frequency distribution table, the data must be sorted in ascending order, with their respective absolute and cumulative frequencies. First, we must determine the position of the quartile, decile, or percentile, of order i, we are interested in through Expressions (3.13), (3.14), and (3.15), respectively. From the cumulative frequency column, we must verify the group (s) that contain(s) this position. If the position is a discrete number, its corresponding value is obtained directly in the first column. However, if the position is a fractional number, as, for example, 2.5, and if the 2nd and the 3rd positions are in the same group, its respective value will also be obtained directly. On the other hand, if the position is a fractional number, as, for example, 4.25, and positions 4 and 5 are in different groups, we must calculate the sum of the value that corresponds to the 4th position multiplied by 0.75 with the value that corresponds to the 5th position multiplied by 0.25 (similar to Case 1). Example 3.25 Consider the data in Example 3.18 regarding the number of bedrooms in 70 real estate properties in a condominium located in the metropolitan area of Sao Paulo, and their respective absolute and cumulative frequencies (Table 3.E.26). Calculate Q1, D4, and P96. Solution Let’s calculate the positions of Q1, D4, and P96 through Expressions (3.13), (3.14), and (3.15), respectively, and their corresponding values: 1 a) PosðQ1 Þ ¼ 70 4 1 + 2 ¼ 18 Based on Table 3.E.26, we can see that position 18 is in the second group (2 bedrooms), so, Q1 ¼ 2. 1 b) PosðD4 Þ ¼ 70 10 4 + 2 ¼ 28:5 Through thecumulative frequency column, we can see that positions 28 and 29 are in the third group (3 bedrooms), so, D4 ¼ 3. 70 c) Pos P96 ¼ 100 96 + 12 ¼ 67:7 that is, P96 is 70% closer to position 68 and 30% to position 67. Through the cumulative frequency column, we can see that position 68 is in the seventh group (7 bedrooms) and position 67 to the sixth group (6 bedrooms), so, P96 can be calculated as follows: P96 ¼ ð0:70 x 7Þ + ð0:30 x 6Þ ¼ 6:7: Interpretation Q1 ¼ 2 indicates that 25% of the real estate properties have less than 2 bedrooms, or that 75% of the real estate properties have more than 2 bedrooms. D4 ¼ 3 indicates that 40% of the real estate properties have less than 3 bedrooms, or that 60% of the real estate properties have more than 3 bedrooms. P96 ¼ 6.7 indicates that 96% of the real estate properties have less than 6.7 bedrooms, or that 4% of the real estate properties have more than 6.7 bedrooms.
3.4.1.2.3.3 Case 3: Quartiles, Deciles, and Percentiles of Continuous Data Grouped into Classes For continuous data grouped into classes in which data are represented in a frequency distribution table, we must apply the following steps to calculate the quartiles, deciles, and percentiles: Step 1: Calculate the position of the quartile, decile, or percentile, of order i, we are interested in through the following expressions: n (3.16) Quartile ! PosðQi Þ ¼ i, i ¼ 1,2, 3 4 n Decile ! PosðDi Þ ¼ i, i ¼ 1,2, …, 9 (3.17) 10 n i, i ¼ 1, 2,…,99 (3.18) Percentile ! PosðPi Þ ¼ 100
Univariate Descriptive Statistics Chapter
3
51
Step 2: Identify the class that contains the quartile, decile, or percentile, of order i, we are interested in (quartile class, decile class, or percentile class) from the cumulative frequency column. Step 3: Calculate the quartile, decile, or percentile, of order i, we are interested in through the following expressions: ! PosðQi Þ FcumðQi 1Þ (3.19) RQi , i ¼ 1,2, 3 Quartile ! Qi ¼ LLQi + FQi where: LLQi ¼ lower limit of the quartile class; Fcum(Qi1)¼ cumulative frequency from the previous class to the quartile class; FQi ¼ absolute frequency of the quartile class; RQi ¼ range of the quartile class. Decile ! Di ¼ LLDi +
PosðDi Þ FcumðDi 1Þ
!
FDi
RDi , i ¼ 1,2, …, 9
(3.20)
where: LLDi ¼ lower limit of the decile class; Fcum(Di1)¼ cumulative frequency from the previous class to the decile class; FDi ¼ absolute frequency of the decile class; RDi ¼ range of the decile class. Percentile ! Pi ¼ LLPi +
PosðPi Þ FcumðPi 1Þ FPi
! RPi , i ¼ 1,2, …, 99
(3.21)
where: LLPi ¼ lower limit of the percentile class; Fcum(Pi1)¼ cumulative frequency from the previous class to the percentile class; FPi ¼ absolute frequency of the percentile class; RPi ¼ range of the percentile class.
Example 3.26 A survey on the health conditions of 250 patients collected information about their weight. The data are grouped into classes, as shown in Table 3.E.31. Calculate the first quartile, the seventh decile, and the 60th percentile.
TABLE 3.E.31 Absolute and Cumulative Frequencies Distribution table of Patients’ Weight Grouped into Classes Class
Fi
Fac
50 ├ 60
18
18
60 ├ 70
28
46
70 ├ 80
49
95
80 ├ 90
66
161
90 ├ 100
40
201
100 ├ 110
33
234
110 ├ 120
16
250
Sum
250
52
PART
II Descriptive Statistics
Solution Let’s apply the three steps to calculate Q1, D7, and P60: Step 1: Let’s calculate the position of the first quartile, the seventh decile, and the 60th percentile through Expressions (3.16), (3.17), and (3.18), respectively: 250 1 ¼ 62:5 4 250 7 ¼ 175 7th Decile ! PosðD7 Þ ¼ 10 250 60 ¼ 150 60th Percentile ! PosðP60 Þ ¼ 100 1st Quartile ! PosðQ1 Þ ¼
Step 2: Let’s identify the class that has Q1, D7, and P60 from the cumulative frequency column in Table 3.E.31: Q1 is in the 3rd class (70 ├ 80) D7 is in the 5th class (90 ├ 100) P60 is in the 4th class (80 ├ 90) Step 3: Let’s calculate Q1, D7, and P60 from Expressions (3.19), (3.20), and (3.21), respectively: Q1 ¼ LLQ1 + D7 ¼ LLD7 + P60 ¼ LLP60 +
Pos ðQ1 Þ FcumðQ1 1Þ
!
FQ1 Pos ðD7 Þ FcumðD7 1Þ
RQ1 ¼ 70 + !
FD7 Pos ðP60 Þ FcumðP60 1Þ FP60
62:5 46 10 ¼ 73:37 49
RD7 ¼ 90 + !
175 161 10 ¼ 93:5 40
RP60 ¼ 80 +
150 95 10 ¼ 88:33 66
Interpretation Q1 ¼ 73.37 indicates that 25% of the patients weigh less than 73.37 kg, or that 75% of the patients weigh more than 73.37 kg. D7 ¼ 93.5 indicates that 70% of the patients weigh less than 93.5 kg, or that 30% of the patients weigh more than 93.5 kg. P60 ¼ 88.33 indicates that 60% of the patients weigh less than 88.33 kg, or that 40% of the patients weigh more than 88.33 kg.
3.4.1.3 Identifying the Existence of Univariate Outliers A dataset can contain observations that are extremely distant from most observations or that are inconsistent. These observations are called outliers or atypical, discrepant, abnormal, or extreme values. Before deciding what will be done with the outliers, we must know the causes that lead to such an occurrence. In many cases, these causes can determine the most suitable treatment for the respective outliers. The main causes are measurement mistakes, execution/implementation mistakes, and variability inherent to the population. There are many outlier identification methods: boxplots, discordance models, Dixon’s test, Grubbs’ test, Z-scores, among others. In the Appendix of Chapter 11 (Cluster Analysis), a very efficient method for detecting multivariate outliers will be presented (BACON algorithm—Blocked Adaptive Computationally Efficient Outlier Nominators). The existence of outliers through boxplots (the construction of boxplots was studied in Section 3.3.2.5) is identified from the IQR (interquartile range), which corresponds to the difference between the third and first quartiles: IQR ¼ Q3 Q1
(3.22)
Note that the IQR is the length of the box. Any values located below Q1 or above Q3 by 1.5IQR more will be considered mild outliers and will be represented by circles. They may even be accepted in the population, but with some suspicion. Thus, the X° value of a variable is considered a mild outlier when: X° < Q1 21:5 IQR
(3.23)
X° > Q3 + 1:5 IQR
(3.24)
Univariate Descriptive Statistics Chapter
3
53
FIG. 3.15 Boxplot with the identification of outliers.
or any values located below Q1 or above Q3 by 3 IQR more will be considered extreme outliers and will be presented by asterisks. Thus, the X* value of a variable is considered an extreme outlier when: X∗ < Q1 3:IQR
(3.25)
X∗ > Q3 + 3:IQR
(3.26)
Fig. 3.15 illustrates the boxplot with the identification of outliers. Example 3.27 Consider the sorted data in Example 3.24 regarding the average carrot processing time in the post-harvest handling phase: 44.0
44.5
44.5
44.7
44.8
44.9
44.9
45.0
45.0
45.0
45.0
45.4
45.6
45.7
45.8
46.0
46.2
46.2
46.3
46.5
where Q1 ¼ 44.85, Q2 ¼ 45, Q3 ¼ 45.9, mean ¼ 45.3, and mode ¼ 45. Check and see if there are mild and extreme outliers. Solution To verify if there is a possible outlier, we must calculate: Q1 1:5 ðQ3 Q1 Þ ¼ 44:85 1:5:ð45:9 44:85Þ ¼ 43:275 Q3 + 1:5 ðQ3 Q1 Þ ¼ 45:9 + 1:5:ð45:9 44:85Þ ¼ 47:475 Since there is no value in the distribution outside this interval, we conclude that there are no mild outliers. Obviously, it is not necessary to calculate the interval for extreme outliers. In case only one outlier in a certain variable is identified, the researcher can treat it through some existing procedures, as, for example, the complete elimination of this observation. On the other hand, if there is more than one outlier for one or more variables individually, the elimination of all the observations can reduce the sample size significantly. To avoid this problem, it is very common for observations considered outliers for a certain variable to have their atypical values substituted for the mean of the variable, thus, excluding the outliers (Fa´vero et al., 2009). The authors mention other procedures for dealing with outliers, such as, substituting them for values from a regression or winsorization; which, in an organized way, eliminates an equal number of observations from each side of the distribution. Fa´vero et al. (2009) also highlight the importance of dealing with outliers when the researcher in interested in investigating the behavior of a certain variable without the influence of observations with atypical values. On the other hand, if the main goal is to analyze the behavior of these atypical observations or to define subgroups through discrepancy criteria, maybe eliminating these observations or substituting their values would not be the best solution.
54
PART
3.4.2
II Descriptive Statistics
Measures of Dispersion or Variability
To study the behavior of a set of data, we use measures of central tendency, measures of dispersion, in addition to the nature or shape of the data distribution. Measures of central tendency determine a value that represents the set of data. In order to characterize the dispersion or variability of the data, measures of dispersion are necessary. The most common measures of dispersion are the range, average deviation, variance, standard deviation, standard error, and the coefficient of variation (CV).
3.4.2.1 Range The simplest measure of variability is the total range, or simply range (R), which represents the difference between the highest and lowest value of the set of data: R ¼ Xmax Xmin
(3.27)
3.4.2.2 Average Deviation Deviation is the difference between each observed value and
the mean of the variable. Thus, for population data, it would be m), and for sample data, by X X . The modulus or absolute deviation ignores the sign and is represented by (Xi i denoted by Xi X. Average deviation, or absolute average deviation, represents the arithmetic mean of absolute deviations. 3.4.2.2.1 Case 1: Average Deviation of Ungrouped Discrete and Continuous Data The average deviation (D) is the sum of the absolute deviations of all observations divided by the population size (N) or the sample size (n): N X X m i
D¼
i¼1
D¼
ðfor the populationÞ N n X X X i
(3.28)
i¼1
(3.29)
n
ðfor samplesÞ
Example 3.28 Table 3.E.32 shows the distances traveled (in km) by a vehicle in order to deliver 10 packages throughout the day. Calculate the average deviation.
TABLE 3.E.32 Distances Traveled (km) 12.4
22.6
18.9
9.7
14.5
22.5
26.3
17.7
31.2
20.4
Solution For the data in Table 3.E.32, we have X ¼ 19:62. Applying Expression (3.29), we get the average deviation: j12:4 19:62j + j22:6 19:62j + ⋯ + j20:4 19:62j ¼ 4:98 10 The average deviation can be directly calculated in Excel using the AVEDEV function. D¼
3.4.2.2.2
Case 2: Average Deviation of Grouped Discrete Data
For grouped data, presented in a frequency distribution table for m groups, the calculation of the average deviation is:
Univariate Descriptive Statistics Chapter
3
55
m X X m:F i i
D¼
D¼
Pm
X :F i¼1 i i
bearing in mind that X ¼
i¼1
n
ðfor the populationÞ N m X X X:F i i
(3.30)
i¼1
(3.31)
n
ðfor samplesÞ
.
Example 3.29 Table 3.E.33 shows the number of goals scored by the D.C. soccer team in their last 30 games, with their respective absolute frequencies. Calculate the average deviation.
TABLE 3.E.33 Frequency Distribution of Example 3.29 Number of Goals
Fi
0
5
1
8
2
6
3
4
4
4
5
2
6
1
Sum
30
Solution 05+18+⋯+61 The mean is X ¼ ¼ 2:133. The average deviation can be determined from the calculations presented in 30 Table 3.E.34:
TABLE 3.E.34 Calculations of the Average Deviation for Example 3.29
Therefore, D ¼
Number of Goals
Fi
X X i
X X :F i i
0
5
2.133
10.667
1
8
1.133
9.067
2
6
0.133
0.800
3
4
0.867
3.467
4
4
1.867
7.467
5
2
2.867
5.733
6
1
3.867
3.867
Sum
30
Pm 41:067 i¼1 Xi X :Fi ¼ ¼ 1:369. n 30
41.067
56
PART
II Descriptive Statistics
3.4.2.2.3 Case 3: Average Deviation of Continuous Data Grouped into Classes For continuous data grouped into classes, the calculation of the average deviation is: k X X m:F i i
D¼
i¼1
ðfor the populationÞ
N
(3.32)
k X X X:F i i
D¼
i¼1
n
ðfor samplesÞ
(3.33)
Note that Expressions (3.32) and (3.33) are similar to Expressions (3.30) and (3.31), respectively, except that, instead of m Pk X :F groups, we consider k classes. Moreover, Xi represents the middle or central point of each class i, where X ¼ i¼1n i i , as presented in Expression (3.6). Example 3.30 In order to determine its variation due to genetic factors, a survey with 100 newborn babies collected information about their weight. Table 3.E.35 shows the data grouped into classes and their respective absolute frequencies. Calculate the average deviation.
TABLE 3.E.35 Newborn Babies’ Weight (in kg) Grouped into Classes Class
Fi
2.0 ├ 2.5
10
2.5 ├ 3.0
24
3.0 ├ 3.5
31
3.5 ├ 4.0
22
4.0 ├ 4.5
13
Sum
Solution First, we must calculate X: k X
Xi :Fi
2:25 10 + 2:75 24 + 3:25 31 + 3:75 22 + 4:25 13 ¼ ¼ 3:270 n 100 The average deviation can be determined from the calculations presented in Table 3.E.36: X¼
i¼1
TABLE 3.E.36 Calculations of the Average Deviation for Example 3.30
Therefore, D ¼
Class
Fi
Xi
X X i
X X :F i i
2.0 ├ 2.5
10
2.25
1.02
10.20
2.5 ├ 3.0
24
2.75
0.52
12.48
3.0 ├ 3.5
31
3.25
0.02
0.62
3.5 ├ 4.0
22
3.75
0.48
10.56
4.0 ├ 4.5
13
4.25
0.98
12.74
Sum
100
Pk 46:6 i¼1 Xi X :Fi ¼ ¼ 0:466. n 100
46.6
Univariate Descriptive Statistics Chapter
3
57
3.4.2.3 Variance Variance is a measure of dispersion or variability that evaluates how much the data are dispersed in relation to the arithmetic mean. Thus, the higher the variance, the higher the data dispersion.
3.4.2.3.1
Case 1: Variance of Ungrouped Discrete and Continuous Data
Instead of considering the mean of absolute deviations, as discussed in the previous section, it is more common to calculate the mean of squared deviations. This measure is known as variance: !2 N X Xi N N X X i¼1 2 2 ðXi mÞ Xi N i¼1 i¼1 2 s ¼ ðfor the populationÞ (3.34) ¼ N N !2 n X Xi n n X
2 X i¼1 2 Xi X Xi n i¼1 i¼1 2 ¼ ð forsamplesÞ (3.35) S ¼ n1 n1 The relationship between the sample variance (S2) and the population variance (s2) is given by: S2 ¼
N :s2 n1
(3.36)
Example 3.31 Consider the data in Example 3.28 regarding the distances traveled (in km) by a vehicle in order to deliver 10 packages throughout the day. Calculate the variance. Solution We saw in Example 3.28 that X ¼ 19:62. Applying Expression (3.35), we have: S2 ¼
ð12:4 19:62Þ2 + ð22:6 19:62Þ2 + ⋯ + ð20:4 19:62Þ2 ¼ 41:94 9
The sample variance can be directly calculated in Excel using the VAR.S function. To calculate the variance population, we must use the VAR.P function.
3.4.2.3.2 Case 2: Variance of Grouped Discrete Data For grouped data, represented in a frequency distribution table by m groups, the variance can be calculated as follows: !2 m X Xi :Fi m m X X i¼1 2 2 ðXi mÞ :Fi Xi :Fi N i¼1 i¼1 2 ¼ ðfor the populationÞ (3.37) s ¼ N N !2 m X Xi :Fi m m X X
2 i¼1 2 Xi X :Fi Xi :Fi n i¼1 i¼1 2 S ¼ ðfor samplesÞ (3.38) ¼ n1 n1 Pm X :F where X ¼ i¼1 i i . n
58
PART
II Descriptive Statistics
Example 3.32 Consider the data in Example 3.29 regarding the number of goals scored by the D.C. soccer team in the last 30 games, with their respective absolute frequencies. Calculate the variance. Solution As calculated in Example 3.29, the mean is X ¼ 2:133. The variance can be determined from the calculations presented in Table 3.E.37:
TABLE 3.E.37 Calculations of the Variance Number of Goals
Fi
0
5
Xi X
2
2 Xi X :Fi
4.551
22.756
1
8
1.284
10.276
2
6
0.018
0.107
3
4
0.751
3.004
4
4
3.484
13.938
5
2
8.218
16.436
6
1
14.951
14.951
Sum
30
81.467
Pm Therefore, S 2 ¼
3.4.2.3.3
i¼1
2 Xi X :Fi 81:467 ¼ ¼ 2:809 n1 29
Case 3: Variance of Continuous Data Grouped into Classes
For continuous data grouped into classes, we calculate the variance as follows: k X k X
s ¼ 2
k X
2
ðXi mÞ :Fi
i¼1
¼
N
!2 Xi :Fi
i¼1
Xi2 :Fi
N
i¼1
k X k X
S ¼ 2
k X
2
ðXi xÞ :Fi
i¼1
n1
ðfor the populationÞ
N
¼
Xi2 :Fi
(3.39)
!2 Xi :Fi
i¼1
i¼1
n1
n
ðfor samplesÞ
(3.40)
Note that Expressions (3.39) and (3.40) are similar to Expressions (3.37) and (3.38), respectively, except that we consider k classes instead of m groups.
Example 3.33 Consider the data in Example 3.30 regarding the weight of newborn babies grouped into classes, with their respective absolute frequencies. Calculate the variance. Solution As calculated in Example 3.30, we have X ¼ 3:270.
Univariate Descriptive Statistics Chapter
3
59
The variance can be determined from the calculations presented in Table 3.E.38:
TABLE 3.E.38 Calculations of the Variance for Example 3.33
2
2 Xi X :Fi
Class
Fi
Xi
2.0 ├ 2.5
10
2.25
1.0404
2.5 ├ 3.0
24
2.75
0.2704
6.4896
3.0 ├ 3.5
31
3.25
0.0004
0.0124
3.5 ├ 4.0
22
3.75
0.2304
5.0688
4.0 ├ 4.5
13
4.25
0.9604
12.4852
Sum
100
Pk Therefore, S 2 ¼
i¼1
ðXi X Þ
2
n1
:Fi
Xi X
10.404
34.46
¼ 34:46 99 ¼ 0:348.
3.4.2.4 Standard Deviation Since the variance considers the mean of squared deviations, its value tends to be very high and difficult to interpret. To solve this problem, we calculate the square root of the variance. This measure is known as the standard deviation. It is calculated as follows: pffiffiffiffiffi s ¼ s2 ðfor the populationÞ
(3.41)
pffiffiffiffiffi S ¼ S2 ðfor samplesÞ
(3.42)
Example 3.34 Once again, consider the data in Examples 3.28 or 3.31 regarding the distances traveled (in km) by the vehicle. Calculate the standard deviation. Solution We have X ¼ 19:62. The standard deviation is the square root of the variance, which has already been calculated in Example 3.31: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð12:4 19:62Þ2 + ð22:6 19:62Þ2 + ⋯ + ð20:4 19:62Þ2 pffiffiffiffiffiffiffiffiffiffiffiffi S¼ ¼ 41:94 ¼ 6:476 9 The standard deviation of a sample can be directly calculated in Excel using the STDEV.S function. To calculate the standard deviation of the population, we use the STDEV.P function.
Example 3.35 Consider the data in Examples 3.29 or 3.32 regarding the number of goals scored by the D.C. soccer team in the last 30 games, with their respective absolute frequencies. Calculate the standard deviation. Solution The mean is X ¼ 2:133. The standard deviation is the square root of the variance, so, it can be determined from the calculations of the variance, which has already been calculated in Example 3.32, as demonstrated in Table 3.E.37: rP ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffi m 2 ðXi X Þ :Fi i¼1 ¼ 81:467 Therefore, S ¼ 29 ¼ 2:809 ¼ 1:676. n1
60
PART
II Descriptive Statistics
Example 3.36 Consider the data in Examples 3.30 or 3.33 regarding the weight of newborn babies grouped into classes, with their respective absolute frequencies. Calculate the standard deviation. Solution We have X ¼ 3:270. The standard deviation is the square root of the variance, so, it can be determined from the calculations of the variance, which has already been calculated in Example 3.33, as demonstrated in Table 3.E.38: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2 Pk qffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffi i¼1 Xi X :Fi ¼ 34:46 Therefore, S ¼ 99 ¼ 0:348 ¼ 0:59. n1
3.4.2.5 Standard Error The standard error is the standard deviation of the mean. It is obtained by dividing the standard deviation by the square root of the population or sample size: s (3.43) sX ¼ pffiffiffiffi for the population N S SX ¼ pffiffiffi for samples n
(3.44)
The higher the number of measurements, the better the determination of the average value will be (higher accuracy), due to the compensation of random errors. Example 3.37 One of the phases in the preparation of concrete is mixing it in a concrete mixer. Tables 3.E.39 and 3.E.40 show the concrete mixing times (in seconds), considering a sample with 10 and 30 elements, respectively. Calculate the standard error for both cases and interpret the results.
TABLE 3.E.39 Concrete Mixing Time for a Sample With 10 Elements 124
111
132
142
108
127
133
144
148
105
TABLE 3.E.40 Concrete Mixing Time for a Sample With 30 Elements 125
102
135
126
132
129
156
112
108
134
126
104
143
140
138
129
119
114
107
121
124
112
148
145
130
125
120
127
106
148
Solution First, let’s calculate the standard deviation for both samples: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð124 127:4Þ2 + ð111 127:4Þ2 + ⋯ + ð105 127:4Þ2 S1 ¼ ¼ 15:364 9 sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð125 126:167Þ2 + ð102 126:167Þ2 + ⋯ + ð148 126:167Þ2 ¼ 14:227 S2 ¼ 29 To calculate the standard error, we must apply Expression (3.44): S1 15:364 SX ¼ pffiffiffiffiffi ¼ pffiffiffiffiffiffi ¼ 4:858 1 n1 10
Univariate Descriptive Statistics Chapter
3
61
S2 14:227 SX ¼ pffiffiffiffiffi ¼ pffiffiffiffiffiffi ¼ 2:598 2 n2 30 Despite the small difference in the calculation of the standard deviation, we can see that the standard error of the first sample is almost the double when compared to the second sample. Therefore, the higher the number of measurements, the higher the accuracy.
3.4.2.6 Coefficient of Variation The coefficient of variation (CV) is a relative measure of dispersion that provides the variation of the data in relation to the mean. The smaller the value, the more homogeneous the data will be, that is, the smaller the dispersion around the mean will be. It can be calculated as follows: s (3.45) CV ¼ 100 ð%Þ for the population m S CV ¼ 100 ð%Þ for samples X
(3.46)
A CV can be considered low, indicating a set of data that is reasonably homogeneous, when it is less than 30%. If this value is greater than 30%, the set of data can be considered heterogeneous. However, this standard varies according to the application. Example 3.38 Calculate the coefficient of variation for both samples of the previous example. Solution Applying Expression (3.46), we have: CV 1 ¼ CV 2 ¼
S1 15:364 100 ¼ 100 ¼ 12:06% 127:4 X1
S2 14:227 100 ¼ 11:28% 100 ¼ 126:167 X2
These results confirm the homogeneity of the data of the variable being studied for both samples. We conclude, therefore, that the mean is a good measure to represent the data. Let’s now study the measures of skewness and kurtosis.
3.4.3
Measures of Shape
Measures of asymmetry (skewness) and kurtosis characterize the shape of the distribution of the population elements sampled around the mean (Maroco, 2014).
3.4.3.1 Measures of Skewness Measures of skewness describe the shape of a frequency distribution curve. For a symmetrical curve or frequency distribution, the mean, the mode, and the median are the same. For an asymmetrical curve, the mean gets farther away from the mode, and the median is located in an intermediary position. Fig. 3.16 shows a symmetrical distribution. On the other hand, if the frequency distribution is more concentrated on the left side, that is, the tail on the right is longer than the tail on the left, we will have a positively skewed distribution or to the right, as shown in Fig. 3.17. In this case, the mean is greater than the median, and the latter is greater than the mode (Mo < Md < X). Conversely, if the frequency distribution is more concentrated on the right side, that is, the tail on the left is longer than the tail on the right, we will have a negatively skewed distribution or to the left, as shown in Fig. 3.18. In this case, the mean is less than the median, and the latter is less than the mode X < Md < Mo .
62
PART
II Descriptive Statistics
FIG. 3.16 Symmetrical distribution.
FIG. 3.17 Skewness to the right or positive skewness.
FIG. 3.18 Skewness to the left or negative skewness.
3.4.3.1.1
Pearson’s First Coefficient of Skewness
Pearson’s first coefficient of skewness (Sk1) is a measure of skewness given by the difference between the mean and the mode, weighted by one measure of dispersion (the standard deviation): Sk1 ¼
m Mo for the population s
(3.47)
X Mo for samples, S
(3.48)
Sk1 ¼
which has the following interpretation: If Sk1 ¼ 0, the distribution is symmetrical; If Sk1 > 0, the distribution is positively skewed (to the right); If Sk1 < 0, the distribution is negatively skewed (to the left). Example 3.39 From one set of data, we obtained the following measures X ¼ 34:7, Mo ¼ 31.5, Md ¼ 33.2, and S ¼ 12.4. Determine the type of skewness and calculate Pearson’s first coefficient of skewness.
Univariate Descriptive Statistics Chapter
3
63
Solution Since Mo < Md < X, we have a positive asymmetrical distribution (to the right). Applying Expression (3.48), we can determine Pearson’s first coefficient of skewness: X Mo 34:7 31:5 ¼ ¼ 0:258 S 12:4 Classifying the distribution as positively skewed can also be interpreted by the value Sk1 > 0. Sk 1 ¼
3.4.3.1.2 Pearson’s Second Coefficient of Skewness To avoid using the mode to calculate
the skewness, we must adopt the empirical relationship between the mean, the median, and the mode: X Mo ¼ 3: X Md , which corresponds to Pearson’s second coefficient of skewness (Sk2): 3:ðm Md Þ for the population s
3: X Md for samples Sk2 ¼ S
Sk2 ¼
(3.49) (3.50)
In the same way, we have: If Sk2 ¼ 0, the distribution is symmetrical; If Sk2 > 0, the distribution is positively skewed (to the right); If Sk2 < 0, the distribution is negatively skewed (to the left). Pearson’s first and second coefficients of skewness allow us to compare two or more distributions and to evaluate which one is more asymmetrical. Its modulus indicates the intensity of the skewness. That is, the higher Pearson’s coefficient of skewness is, the more asymmetrical the curve is. Thus: If 0 < j Sk j < 0.15, the skewness is weak; If 0.15 j Sk j 1, the skewness is moderate; If j Sk j > 1, the skewness is strong. Example 3.40 From the data in Example 3.39, calculate Pearson’s second coefficient of skewness. Solution Applying Expression (3.50), we have:
3: X Md 3:ð34:7 33:2Þ ¼ ¼ 0:363 S 12:4 Analogously, since Sk2 > 0, we confirm that the distribution is positively skewed. Sk 2 ¼
3.4.3.1.3 Bowley’s Coefficient of Skewness Another measure of skewness is Bowley’s coefficient of skewness (SkB), also known as quartile coefficient of skewness, calculated with quantiles, such as, the first and third quartiles, in addition to the median: SkB ¼
Q3 + Q1 2:Md Q3 Q1
In the same way, we have: If SkB ¼ 0, the distribution is symmetrical; If SkB > 0, the distribution is positively skewed (to the right); If SkB < 0, the distribution is negatively skewed (to the left).
(3.51)
64
PART
II Descriptive Statistics
Example 3.41 Calculate Bowley’s coefficient of skewness for the following dataset, which has already been sorted in ascending order:
24
25
29
31
36
40
44
45
48
50
54
56
1st
2nd
3rd
4th
5th
6th
7th
8th
9th
10th
11th
12th
Solution We have Q1 ¼ 30, Md ¼ 42, and Q3 ¼ 49. Therefore, we can determine Bowley’s coefficient of skewness: Sk B ¼
Q3 + Q1 2:Md 49 + 30 2:ð42Þ ¼ 0:263 ¼ Q3 Q1 49 30
Since SkB < 0, we conclude that the distribution is negatively skewed (to the left).
3.4.3.1.4 Fisher’s Coefficient of Skewness The last measure of skewness we will study is known as Fisher’s coefficient of skewness (g1), calculated from the third moment around the mean (M3), as presented in Maroco (2014): g1 ¼
n2 :M3 ðn 1Þ:ðn 2Þ:S3
(3.52)
where: n X
3 Xi X
M3 ¼
i¼1
n
(3.53)
which is interpreted the same way as the other coefficients of skewness, that is: If g1 ¼ 0, the distribution is symmetrical; If g1 > 0, the distribution is positively skewed (to the right); If g1 < 0, the distribution is negatively skewed (to the left). Fisher’s coefficient of skewness can be calculated in Excel using the DISTORTION function (see Example 3.42) or using the Analysis Tools supplement (Section 3.5). Its calculation through SPSS software will be presented in Section 3.6. 3.4.3.1.5 Coefficient of Skewness on Stata The coefficient of skewness on Stata is calculated from the second and third moments around the mean, as presented by Cox (2010): Sk ¼
M3 3=2
(3.54)
M2
where: n X
2 Xi X
M2 ¼
i¼1
n
(3.55)
Univariate Descriptive Statistics Chapter
3
65
which is interpreted the same way as the other coefficients of skewness, that is: If Sk ¼ 0, the distribution is symmetrical; If Sk > 0, the distribution is positively skewed (to the right); If Sk < 0, the distribution is negatively skewed (to the left).
3.4.3.2 Measures of Kurtosis In addition to measures of skewness, measures of kurtosis can also be used to characterize the shape of the distribution of the variable being studied. Kurtosis can be defined as the flatness level of a frequency distribution (height of the peak of the curve) in relation to a theoretical distribution that usually corresponds to the normal distribution. When the shape of the distribution is not very flat, nor very long, similar to a normal curve, it is called mesokurtic, as we can see in Fig. 3.19. In contrast, when the distribution shows a frequency curve that is flatter than a normal curve, it is called platykurtic, as shown in Fig. 3.20. Or, when the distribution presents a frequency curve that is longer than a normal curve, it is called leptokurtic, according to Fig. 3.21.
3.4.3.2.1
Coefficient of Kurtosis
One of the most common coefficients to measure the flatness level or kurtosis of a distribution is the percentile coefficient of kurtosis, or simply coefficient of kurtosis (k). It is calculated from the interquartile interval, in addition to the 10th and 90th percentiles: k¼
Q Q1 3
, 2 P90 P10
(3.56)
which has the following interpretation: If k ¼ 0.263, we say that the curve is mesokurtic; If k > 0.263, we say that the curve is platykurtic; If k < 0.263, we say that the curve is leptokurtic.
FIG. 3.19 Mesokurtic curve.
FIG. 3.20 Platykurtic curve.
66
PART
II Descriptive Statistics
FIG. 3.21 Leptokurtic curve.
3.4.3.2.2
Fisher’s Coefficient of Kurtosis
Another very common measure to determine the flatness level or kurtosis of a distribution is Fisher’s coefficient of kurtosis, (g2). It is calculated using the fourth moment near the mean (M4), as presented in Maroco (2014): g2 ¼
n2 :ðn + 1Þ:M4 ðn 1Þ2 3: ðn 1Þ:ðn 2Þ:ðn 3Þ:S4 ðn 2Þ:ðn 3Þ
(3.57)
where: n X
M4 ¼
Xi X
i¼1
n
4 ,
(3.58)
which has the following interpretation: If g2 ¼ 0, the curve has a normal distribution (mesokurtic); If g2 < 0, the curve is very flat (platykurtic); If g2 > 0, the curve is very long (leptokurtic). Many pieces of statistical software, among them SPSS, use Fisher’s coefficient of kurtosis to calculate the flatness level or kurtosis (Section 3.6). In Excel, the KURT function calculates Fisher’s coefficient of kurtosis (Example 3.42), and it can be calculated through the Analysis ToolPak supplement as well (Section 3.5).
3.4.3.2.3
Coefficient of Kurtosis on Stata
The coefficient of kurtosis on Stata is calculated from the second and fourth moments near the mean, as presented by Bock (1975) and Cox (2010): kS ¼
M4 M22
which has the following interpretation: If kS ¼ 3, the curve has a normal distribution (mesokurtic); If kS < 3, the curve is very flat (platykurtic); If kS > 3, the curve is very long (leptokurtic).
(3.59)
Univariate Descriptive Statistics Chapter
3
Example 3.42 Table 3.E.41 shows the prices of stock Y throughout a month, resulting in a sample with 20 periods (i.e., business days). Calculate: a) Fisher’s coefficient of skewness (g1); b) The coefficient of skewness used on Stata; c) Fisher’s coefficient of kurtosis (g2); d) The coefficient of kurtosis used on Stata;
TABLE 3.E.41 Prices of Stock Y Throughout the Month 18.7
18.3
18.4
18.7
18.8
18.8
19.1
18.9
19.1
19.9
18.5
18.5
18.1
17.9
18.2
18.3
18.1
18.8
17.5
16.9
Solution The mean and the standard deviation of the data in Table 3.E.41 are X ¼ 18:475 and S ¼ 0.6324, respectively. We have: a) Fisher’s coefficient of skewness g1: It is calculated using the third moment near the mean (M3): n X
M3 ¼
Xi X
3
i¼1
¼
n
ð18:7 18:475Þ3 + ⋯ + ð16:9 18:475Þ3 ¼ 0:0788 20
Therefore, we have: g1 ¼
n2 :M3 ð20Þ2 ð0:079Þ ¼ ¼ 0:3647 3 ðn 1Þ:ðn 2Þ:S 19 18 ð0:63Þ3
Since g1 < 0, we can conclude that the frequency curve is more concentrated on the right side and has a longer tail to the left, that is, the distribution is asymmetrical to the left or negative. Excel calculates Fisher’s coefficient of skewness (g1) through the SKEW function. File Stock_Market.xls shows the data from Table 3.E.41, cells A1:A20. Thus, to calculate it, we just need to insert expression 5SKEW(A1:A20). b) The coefficient of skewness used on Stata: It is calculated from the second and third moments near the mean: n X
M2 ¼
Xi X
2
i¼1
¼
n
ð18:7 18:475Þ2 + ⋯ + ð16:9 18:475Þ2 ¼ 0:3799 20 M3 ¼ 0:0788
It is calculated as follows: Sk ¼
M3 3=2
M2
¼ 0:3367,
which is interpreted the same way as Fisher’s coefficient of skewness. c) Fisher’s coefficient of kurtosis g2: It is calculated using the fourth moment near the mean (M4): n X
M4 ¼
Xi X
i¼1
n
4 ¼
ð18:7 18:475Þ4 + ⋯ + ð16:9 18:475Þ4 ¼ 0:5857 20
Therefore, we calculate g2 as follows: g2 ¼ g2 ¼
n2 :ðn + 1Þ:M4 ðn 1Þ2 3: 4 ðn 1Þ:ðn 2Þ:ðn 3Þ:S ðn 2Þ:ðn 3Þ ð20Þ2 21 0:5857 19 18 17 ð0:6324Þ
Thus, we can conclude that the curve is long or leptokurtic.
ð19Þ2
4 3: 18 17 ¼ 1:7529
67
68
PART
II Descriptive Statistics
The KURT function in Excel calculates Fisher’s coefficient of kurtosis (g2). To calculate it from the file Stock_Market.xls, we must insert expression 5KURT(A1:A20). d) Coefficient of kurtosis on Stata: It is calculated from the second and fourth moments near the mean: M2 ¼ 0.3799 and M4 ¼ 0.5857, as already calculated. Thus: kS ¼
M4 0:5857 ¼ ¼ 4:0586 M22 ð0:3799Þ2
Since kS > 3, the curve is long or leptokurtic. In the next three sections, we will discuss how to construct tables, charts, graphs, and summary measures in Excel and in the statistical softwares SPSS and Stata, using the data in Example 3.42.
3.5
A PRACTICAL EXAMPLE IN EXCEL
Section 3.3.1 showed the graphical representation of qualitative variables through bar charts (horizontal and vertical), pie charts, and the Pareto chart. We demonstrated how each one of these charts can be obtained using Excel. Conversely, Section 3.3.2 showed the graphical representation of quantitative variables through line graphs, scatter plots, histograms, among others. Analogously, we presented how most of them can be obtained using Excel. Section 3.4 presented the main summary measures, including measures of central tendency (mean, mode, and median), quantiles (quartiles, deciles, and percentiles), measures of dispersion or variability (range, average deviation, variance, standard deviation, standard error, and coefficient of variation), in addition to the measures of shape as skewness and kurtosis. Then, we presented how they can be calculated using the Excel functions, except the ones that are not available. This section discusses how to obtain descriptive statistics (such as, the mean, standard error, median, mode, standard deviation, variance, kurtosis, skewness, among others), through the Analysis ToolPak add-in in Excel. In order to do that, let’s consider the problem presented in Example 3.42, whose data are available in Excel in the file Stock_Market.xls, presented in cells A1:A20, as shown in Fig. 3.22. To load the Analysis ToolPak add-in in Excel, we must first click on the File tab and on Options, as shown in Fig. 3.23. Now, the Excel Options dialog box will open, as shown in Fig. 3.24. From this box, we selected the option Add-ins. In Add-ins, we must select the option Analysis ToolPak and click on Go. Then, the Add-ins dialog box will appear, as shown in Fig. 3.25. Among the add-ins available, we must select the option Analysis ToolPak and click on OK.
FIG. 3.22 Dataset in Excel—Price of Stock Y.
Univariate Descriptive Statistics Chapter
3
69
FIG. 3.23 File tab, focusing more on Options.
Thus, the option Data Analysis will start being available on the Data tab, inside the Analysis group, as shown in Fig. 3.26. Fig. 3.27 shows the Data Analysis dialog box. Note that several analysis tools are available. Let’s select the option Descriptive Statistics and click on OK. From the Descriptive Statistics dialog box (Fig. 3.28), we must select the Input Range (A1:A20) and, as Output options, let’s select Summary statistics. The results can be presented in a new spreadsheet or in a new work folder. Finally, let’s click on OK. The descriptive statistics generated can be seen in Fig. 3.29 and include measures of central tendency (mean, mode, and median), measures of dispersion or variability (variance, standard deviation, and standard error), and measures of shape (skewness and kurtosis). The range can be calculated from the difference between the sample’s maximum and minimum values. As mentioned in Sections 3.4.3.1 and 3.4.3.2, the measure of skewness calculated by Excel (using the SKEW function or by Fig. 3.28) corresponds to Fisher’s coefficient of skewness (g1); and the measure of kurtosis calculated (using the KURT function or by Fig. 3.28) corresponds to Fisher’s coefficient of kurtosis (g2).
3.6
A PRACTICAL EXAMPLE ON SPSS
From a practical example, this section presents how to obtain the main univariate descriptive statistics studied in this chapter by using IBM SPSS Statistics Software. These include frequency distribution tables, charts (histogram, stemand-leaf plots, boxplots, bar charts, and pie charts), measures of central tendency (mean, mode, and median), quantiles
FIG. 3.24 Excel Options dialog box.
FIG. 3.25 Add-ins dialog box.
Univariate Descriptive Statistics Chapter
FIG. 3.26 Availability of the Data Analysis command, from the Data tab.
FIG. 3.27 Data Analysis dialog box.
FIG. 3.28 Descriptive Statistics dialog box.
3
71
72
PART
II Descriptive Statistics
FIG. 3.29 Descriptive statistics in Excel.
FIG. 3.30 Dataset on SPSS—Price of Stock Y.
(quartiles and percentiles), measures of dispersion or variability (range, variance, standard deviation, standard error, among others), and measures of shape (skewness and kurtosis). The use of the images in this section has been authorized by the International Business Machines Corporation©. The data presented in Example 3.42 are the input basis on SPSS and are available in the file Stock_Market.sav, as shown in Fig. 3.30. To obtain such descriptive statistics, we must click on Analyze ! Descriptive Statistics. After that, three options can be used: Frequencies, Descriptive, and Explore.
3.6.1
Frequencies Option
This option can be used for qualitative and quantitative variables, and it provides frequency distribution tables, as well as measures of central tendency (mean, median, and mode), quantiles (quartiles and percentiles), measures of dispersion or variability (range, variance, standard deviation, standard error, among others), and measures of skewness and kurtosis. The Frequencies option also plots bar charts, pie charts, or histograms (with or without a normal curve). Therefore, on the toolbar, click on Analyze ! Descriptive Statistics and select Frequencies..., as shown in Fig. 3.31.
Univariate Descriptive Statistics Chapter
3
73
FIG. 3.31 Descriptive statistics on SPSS—Frequencies Option.
FIG. 3.32 Frequencies dialog box: selecting the variable and showing the frequency table.
Therefore, the Frequencies dialog box will open. The variable being studied (Stock price, called Price) must be selected in Variable(s) and the Display frequency tables option must be activated so that the frequency distribution table can be shown (Fig. 3.32). The following step consists of clicking on Statistics... To select the summary measures that interest us (Fig. 3.33). Among the quantiles, let’s select the option Quartiles (which calculates the first and third quartiles, in addition to the median). To get the percentile of order i (i ¼ 1, 2, ..., 99), we must select the option Percentile(s) and add the order desired. In this case, we chose to calculate the percentiles of order 10 and 60. The measures of central tendency that we have to select are the mean, median, and mode. As measures of dispersion, let’s select Std. deviation (standard deviation), Variance,
74
PART
II Descriptive Statistics
FIG. 3.33 Frequencies: Statistics dialog box.
Range, and S.E. mean (standard error). Finally, let’s select both measures of shape of a distribution: Skewness and Kurtosis. To go back to the Frequencies dialog box, we must click on Continue. Next, let’s click on Charts... and select the chart that interest us. As options, we have Bar charts, Pie charts, or Histograms. Let’s select the last chart with the option of plotting a normal curve (Fig. 3.34). Bar or pie charts can be shown in terms of absolute frequencies (Frequencies) or relative frequencies (Percentages). In order to go back to the Frequencies dialog box once again, we must click on Continue. Finally, click on OK. Fig. 3.35 shows the calculations of the summary measures selected in Fig. 3.33. As studied in Sections 3.4.3.1 and 3.4.3.2, the measure of skewness calculated by SPSS corresponds to Fisher’s coefficient of skewness (g1), and the measure of kurtosis corresponds to Fisher’s coefficient of kurtosis (g2), respectively. Also in Fig. 3.35, note that the percentiles of order 25, 50, and 75 that correspond to the first quartile, median, and third quartile, respectively, were calculated automatically. The method used to calculate the percentiles was the Weighted Average. The frequency distribution table can be seen in Fig. 3.36. The first column represents the absolute frequency of each element (Fi), the second and third columns represent the relative frequency of each element (Fri—%), and the last column represents the relative cumulative frequency (Frac—%). Also in Fig. 3.36, we can see that all the values happened only once. Since we have a continuous quantitative variable with 20 observations and no repetitions, constructing bar or pie charts would not give the researcher any additional information, that is, it would not allow a good visualization of how the stock prices behave in terms of bins. Hence, we chose to construct a histogram with previously defined bins. The histogram generated using SPSS with the option of plotting a normal curve can be seen in Fig. 3.37.
3.6.2
Descriptives Option
Different from Frequencies..., which also has the frequency distribution table option, besides bar charts, pie charts, or histograms (with or without a normal curve), Descriptives... only makes summary measures available (therefore, it is recommended for quantitative variables). Nevertheless, measures of central tendency, such as, the median and mode
Univariate Descriptive Statistics Chapter
FIG. 3.34 Frequencies: Charts dialog box.
FIG. 3.35 Summary measures obtained from Frequencies: Statistics.
3
75
76
PART
II Descriptive Statistics
FIG. 3.36 Frequency distribution.
FIG. 3.37 Histogram with a normal curve obtained from Frequencies: Charts.
Histogram 8
Mean = 18.47 Std. Dev. = .632 N = 20
Frequency
6
4
2
0 17.0
18.0
19.0
20.0
Price
are not made available; nor are quantiles, such as, quartiles and percentiles. To use it, let’s click on Analyze ! Descriptive Statistics and select Descriptives..., as shown in Fig. 3.38. Therefore, the Descriptives dialog box will open. The variable being studied must be selected in Variable(s), as shown in Fig. 3.39. Let’s click on Options... and select the summary measures that interest us (Fig. 3.40). Note that the same summary measures in the Frequencies... were selected, except the median, the mode, in addition to the quartiles and percentiles that are not available, as already mentioned. Let’s click on Continue to go back to the Descriptives dialog box. Finally, click on OK. The results are available in Fig. 3.41.
Univariate Descriptive Statistics Chapter
3
77
FIG. 3.38 Descriptive statistics on SPSS—Descriptives Option.
FIG. 3.39 Descriptives dialog box: selecting the variable.
3.6.3
Explore Option
As Frequencies..., Explore... does not provide the frequency distribution table either. Regarding the types of chart, different from this last option, which offers bar charts, pie charts, and histograms, Explore... provides stem-and-leaf plots, boxplots, in addition to histograms. However, it does not have the option of plotting a normal curve. Regarding summary measures, Explore... provides measures of central tendency, such as, the mean and median (there is no option for the mode); quantiles, such as, percentiles (of order 5, 10, 25, 50, 75, 90, and 95); measures of dispersion, such as, the range, variance, standard deviation, among others (it does not calculate the standard error), besides measures of skewness and kurtosis.
78
PART
II Descriptive Statistics
FIG. 3.40 Descriptives: Options dialog box.
FIG. 3.41 Summary measures obtained from Descriptive: Options.
Therefore, this command is the best one to generate descriptive statistics for quantitative variables. Hence, from Analyze ! Descriptive Statistics, select Explore..., as shown in Fig. 3.42. Therefore, the Explore dialog box will open. The variable being studied must be selected from the list of dependent variables (Dependent List), as shown in Fig. 3.43. Next, we must click on Statistics... to open the Explore: Statistics box, and select the options Descriptives, Outliers, and Percentiles, as shown in Fig. 3.44. Let’s click on Continue to go back to the Explore box. Next, we must click on Plots... to open the Explore: Plots box and select the charts that interest us, as shown in Fig. 3.45. In this case, we have to select Boxplots: Factor levels together (the resulting boxplots will be together in the same chart), Stem-and-leaf and the histogram (note that there is no option for plotting the normal curve). Once again, we must click on Continue to go back to the Explore dialog box. Finally, click on OK. The results obtained are illustrated. Fig. 3.46 shows the results obtained from Explore: Statistics, with Descriptives option. Fig. 3.47 shows the results obtained from Explore: Statistics, with Percentiles option. The percentiles of order 5, 10, 25 (Q1), 50 (median), 75 (Q3), 90, and 95 were calculated using two methods: the Weighted Average and Tukey’s Hinges. The latter corresponds to the method proposed in this chapter (Section 3.4.1.2, Case 1). Thus, applying the expressions in
Univariate Descriptive Statistics Chapter
3
79
FIG. 3.42 Descriptive statistics on SPSS—Explore Option.
FIG. 3.43 Explore dialog box: selecting the variable.
Section 3.4.1.2 to this example, we get the same results seen in Fig. 3.47, as regards Tukey’s Hinges method for calculating P25, P50, and P75. Coincidently, in this example, the value of P75 was the same for both methods, but they are usually different. Fig. 3.48 shows the results obtained from the Explore: Statistics, with Outliers option. The extreme values of the distribution are presented here (the highest five and the lowest five), with their respective positions found in the dataset. Now, the charts constructed from the options selected in Explore: Plots (histograms, stem-and-leaf plots, and boxplots) are presented in Figs. 3.49, 3.50, and 3.51, respectively.
80
PART
II Descriptive Statistics
FIG. 3.44 Explore: Statistics dialog box.
FIG. 3.45 Explore: Plots dialog box.
FIG. 3.46 Results Obtained from the Descriptives Option.
FIG. 3.47 Results obtained from the Percentiles option.
FIG. 3.48 Results obtained from the Outliers option.
FIG. 3.49 Histogram constructed from the Explore: Plots dialog box.
Histogram 8
Mean = 18.48 Std. Dev. = .632 N = 20
Frequency
6
4
2
0 17.0
18.0
19.0
20.0
Price
FIG. 3.50 Stem-and-leaf chart generated from the Explore: Plots dialog box.
Price
Stem-and-Leaf Plot
Frequency
Stem & Leaf
1.00 Extremes 17 .
2.00
17 . 59
6.00
18 . 112334
8.00
18 . 55778889
2.00
19 . 11
1.00 Extremes
Stem width: Each leaf:
FIG. 3.51 Boxplot generated from the Explore: Plots dialog box.
20.0
(==19.9)
1.0 1 case(s)
10
19.0
18.0
17.0
20
16.0 Price
Univariate Descriptive Statistics Chapter
3
83
Obviously, the histogram generated by Fig. 3.49 is the same as the Frequencies... (Fig. 3.37); however, without the normal curve, since the Explore... does not provide this function. Fig. 3.50 shows that the first two digits of the number (the integers, before the point) form the stem and the decimals correspond to the leaf. Moreover, stem 18 is represented in two lines because it contains several observations. In Section 3.4.1.3, we learned how to calculate an extreme outlier through expressions X* < Q1 3.(Q3 Q1) and X* > Q3 + 3.(Q3 Q1). If we consider that Q1 ¼ 18.15 and Q3 ¼ 18.8, we have X* < 16.2 or X* > 20.75. Since there are no observations outside these limits, we conclude that there are no extreme outliers. Repeating the same procedure for mild outliers, that is, applying expressions X° < Q1 1.5.(Q3 Q1) and X° > Q3 + 1.5.(Q3 Q1), we can see that there is one observation with a value of less than 17.175 (20th observation), and another one with a value greater than 19.775 (10th observation). These values are therefore considered mild outliers. The boxplot in Fig. 3.51 shows that observations 10 and 20, with values 19.9 and 16.9, respectively, are mild outliers (represented by circles). Depending on their survey goals, this allows researchers to decide whether to keep them, exclude them (the analysis may be harmed because of the reduction in the sample size), or substitute their values for the variable mean. Continuing in Fig. 3.51, the values of Q1, Q2 (Md), and Q3 correspond to 18.15, 18.5, and 18.8, respectively, which are those obtained from Tukey’s Hinges method (Fig. 3.47), considering all of the initial 20 observations. Therefore, the boxplot’s measures of position (Q1, Md, and Q3), except for the minimum and maximum values, are calculated without excluding the outliers.
3.7
A PRACTICAL EXAMPLE ON STATA
The same descriptive statistics obtained in the previous section through SPSS software will be calculated in this section through Stata Statistical Software. The results will be compared to those obtained in an algebraic way and also by using SPSS. The use of the images in this section has been authorized by StataCorp LP©. The data presented in Example 3.42 are the input basis on Stata, and are available in the file Stock_Market.dta.
3.7.1
Univariate Frequency Distribution Tables on Stata
Through command tabulate, or simply tab, as we will use throughout this book, we can obtain frequency distribution tables for a certain variable. The syntax of the command is: tab variable*
where the term variable* should be substituted for the name of the variable considered in the analysis. Fig. 3.52 shows the obtained output using the command tab price. Just as the frequency distribution table obtained through SPSS (Fig. 3.36), Fig. 3.52 provides the absolute, relative, and relative cumulative frequencies for each category of the variable price. FIG. 3.52 Frequency distribution on Stata using the command tab.
84
PART
II Descriptive Statistics
Consider a case with more than one variable being studied in which the objective is to construct univariate frequency distribution tables (one-way tables), that is, one table for each variable being analyzed. In this case, we must use the command tab1, with the following syntax: tab1 variables*
where the term variables* should be substituted for the list of variables being considered in the analysis.
3.7.2
Summary of Univariate Descriptive Statistics on Stata
Through command summarize, or simply sum, as we will use throughout this book, we can obtain summary measures, such as, the mean, standard deviation, and minimum and maximum values. The syntax of this command is: sum variables*
where the term variables* should be substituted for the list of variables to be considered in the analysis. If no variable is specified, the statistics will be calculated for all of the variables in the dataset. Through the option detail, we can obtain additional statistics, such as, the coefficient of skewness, the coefficient of kurtosis, the four lowest and highest values, as well as several percentiles. The syntax of this command is: sum variables*, detail
Therefore, for the data in our example, available in the file Stock_Market.dta, first, we must type the following command: sum price
obtaining the statistics in Fig. 3.53. To obtain additional descriptive statistics, we must type the following command: sum price*, detail
Fig. 3.54 shows the generated outputs. As shown in Fig. 3.54, the option detail provides the calculation of the percentiles of order 1, 5, 10, 25, 50, 75, 90, 95 and 99. These results are obtained by Tukey’s Hinges method. We have seen, through Fig. 3.47 on the SPSS software, the results of the percentiles of order 25, 50, and 75 obtained by the same method. Fig. 3.54 also provides the four lowest and highest values of the sample analyzed, as well as the coefficients of skewness and kurtosis. Note that these values coincide with the ones calculated in Sections 3.4.3.1.5 and 3.4.3.2.3, respectively.
FIG. 3.53 Summary measures using the command sum on Stata. FIG. 3.54 Additional statistics using the option detail.
Univariate Descriptive Statistics Chapter
3
85
FIG. 3.55 Results obtained from the command centile on Stata.
3.7.3
Calculating Percentiles on Stata
The previous section discussed how to calculate the 1st, 5th, 10th, 25th, 50th, 75th, 90th, 95th, and 99th percentiles through Tukey’s Hinges method. On the other hand, by using the command centile, we can specify the percentiles to be calculated. The method used in this case is the Weighted Average. The syntax of this command is: centile variables*, centile (numbers*)
where the term variables* should be substituted for the list of variables to be considered in the analysis, and the term numbers* for the list of numbers that represent the order of the percentiles to be reported. Therefore, let’s suppose that we want to calculate the percentiles of order 5, 10, 25, 60, 64, 90, and 95 for the variable price, through the Weighted Average. In order to do that, we must use the following command: centile price, centile (5 10 25 60 64 90 95)
The results can be seen in Fig. 3.55. We have seen, through Fig. 3.35, the results of the SPSS software for the percentiles of order 10, 25, 50, 60, and 75 using the same method. Fig. 3.47 on SPSS also provided the calculation of the percentiles of order 5, 10, 25, 50, 75, 90, and 95 through the Weighted Average. The only percentile that had not been specified previously was the one of order 64; the others coincide with the results in Figs. 3.35 and 3.47.
3.7.4
Charts on Stata: Histograms, Stem-and-Leaf, and Boxplots
Stata makes a series of charts available, including bar charts, pie charts, scatter plots, histograms, stem-and-leaf, and boxplots, among others. Next, we will discuss how to obtain histograms, stem-and-leaf plots, and boxplots on Stata, for the data available in the file Stock_Market.dta.
3.7.4.1 Histogram Histograms on Stata can be obtained for continuous and discrete variables. In the case of continuous variables, to obtain a histogram of absolute frequencies, with the option of plotting a normal curve, we must type the following syntax: histogram variable*, normal frequency
or simply: hist variable*, norm freq
as we will use throughout this book. As mentioned before, the term variable* must be substituted for the name of the variable being studied. For discrete variables, we must include the term discrete: hist variable*, discrete norm freq
86
PART
II Descriptive Statistics
FIG. 3.56 Frequency histogram on Stata.
Frequency
10
5
0 17
18
19
20
Price
Going back to the data in Example 3.42, to obtain a frequency histogram, with the option of plotting a normal curve, we must type the following command: hist price, norm freq
The obtained output is shown in Fig. 3.56.
3.7.4.2 Stem-and-Leaf The stem-and-leaf plot on Stata can be obtained using the command stem, followed by the name of the variable being studied. For the data in the file Stock_Market.dta, we just need to type the following command: stem price
The obtained output is shown in Fig. 3.57.
3.7.4.3 Boxplot To obtain the boxplot on the Stata software, we must use the following syntax: graph box variables*
FIG. 3.57 Stem-and-Leaf plot on Stata.
Univariate Descriptive Statistics Chapter
3
87
FIG. 3.58 Boxplot on Stata.
20
Price
19
18
17
where the term variables* should be substituted for the list of variables to be considered in the analysis, and, for each variable, one chart is constructed. For the data in Example 3.42, the command is: graph box price
The chart is shown in Fig. 3.58 which corresponds to the same chart as in Fig. 3.51 generated using SPSS.
3.8
FINAL REMARKS
In this chapter, we studied descriptive statistics for a single variable (univariate descriptive statistics), in order to acquire a better understanding of the behavior of each variable through tables, charts, graphs and summary measures, identifying trends, variability, and outliers. Before we start using descriptive statistics, it is necessary to identify the type of variable we will study. The type of variable is essential for calculating descriptive statistics and in the graphical representation of the results. The descriptive statistics used to represent the behavior of a qualitative variable’s data are frequency distribution tables and charts. The frequency distribution table for a qualitative variable represents the frequency in which each variable category occurs. The graphical representation of qualitative variables can be illustrated by bar charts (horizontal and vertical), pie charts, and a Pareto chart. For quantitative variables, the most common descriptive statistics are charts and summary measures (measures of position or location, dispersion or variability, and measures of shape). Frequency distribution tables can also be used to represent the frequency in which each possible value of a discrete variable occurs, or to represent the frequency of continuous variables’ data grouped into classes. Line graphs, dot or scatter plots, histograms, stem-and-leaf plots, and boxplots (or box-and-whisker diagrams) are normally used to graphically represent quantitative variables.
3.9
EXERCISES
1) What statistics can be used (and in which situations) to represent the behavior of a single quantitative or qualitative variable? 2) What are the limitations of only using measures of central tendency in the study of a certain variable? 3) How can we verify the existence of outliers in a certain variable? 4) Describe each one of the measures of dispersion or variability. 5) What is the difference between Pearson’s first and second coefficients used as measures of skewness in a distribution? 6) What is the best chart to check the position, skewness and discrepancy among the data? 7) In the case of bar charts and scatter plots, what kind of data should be used? 8) What are the most suitable charts to represent qualitative data?
88
PART
II Descriptive Statistics
9) Table 3.1 shows the number of vehicles sold by a dealership in the last 30 days. Construct a frequency distribution table for these data. TABLE 3.1 Number of Vehicles Sold 7
5
9
11
10
8
9
6
8
10
8
5
7
11
9
11
6
7
10
9
8
5
6
8
6
7
6
5
10
8
10) A survey on patients’ health was carried out and information regarding the weight of 50 patients was collected (Table 3.2). Build the frequency distribution table for this problem. TABLE 3.2 Patients’ Weight 60.4
78.9
65.7
82.1
80.9
92.3
85.7
86.6
90.3
93.2
75.2
77.3
80.4
62.0
90.4
70.4
80.5
75.9
55.0
84.3
81.3
78.3
70.5
85.6
71.9
77.5
76.1
67.7
80.6
78.0
71.6
74.8
92.1
87.7
83.8
93.4
69.3
97.8
81.7
72.2
69.3
80.2
90.0
76.9
54.7
78.4
55.2
75.5
99.3
66.7
11) At an electrical appliances factory, in the door component production phase, the quality inspector verifies the total number of parts rejected per type of defect (lack of alignment, scratches, deformation, discoloration, and oxygenation), as shown in Table 3.3.
TABLE 3.3 Total Number of Parts Rejected per Type of Defect Type of Defect
Total
Lack of Alignment
98
Scratches
67
Deformation
45
Discoloration
28
Oxygenation
12
Total
250
We would like you to: a) Elaborate a frequency distribution table for this problem. b) Construct a pie chart, in addition to a Pareto chart. 12) To preserve ac¸aı´, it is necessary to carry out several procedures, such as, whitening, pasteurization, freezing, and dehydration. The files Dehydration.xls, Dehydration.sav, and Dehydration.dta show the processing times (in seconds) in the dehydration phase throughout 100 periods. We would like you to: a) Calculate the measures of position regarding the arithmetic mean, the median, and the mode. b) Calculate the first and third quartiles and see if there are any outliers. c) Calculate the 10th and 90th percentiles. d) Calculate the 3rd and 6th deciles. e) Calculate the measures of dispersion (range, average deviation, variance, standard deviation, standard error, and coefficient of variation).
Univariate Descriptive Statistics Chapter
3
89
f) Check if the distribution is symmetrical, positively skewed, or negatively skewed. g) Calculate the coefficient of kurtosis and determine the flatness level of the distribution (mesokurtic, platykurtic or leptokurtic). h) Construct a histogram, a stem-and-leaf plot, and a boxplot for the variable being studied. 13) In a certain bank branch, we collected the average service time (in minutes) from a sample with 50 customers regarding three types of services. The data can be found in files Services.xls, Services.sav, and Services.dta. Compare the results of the services based on the following measures: a) Measures of position (mean, median, and mode). b) Measures of dispersion (variance, standard deviation, and standard error). c) First and third quartiles; check if there are any outliers. d) Fisher’s coefficient of skewness (g1) and Fisher’s coefficient of kurtosis (g2). Classify the symmetry and the flatness level of each distribution. e) For each one of the variables, construct a bar chart, a boxplot, and a histogram. 14) A passenger collected the average travel times (in minutes) of a bus in the district of Vila Mariana, on the Jabaquara route, for 120 days (Table 3.4). We would like you to: a) Calculate the arithmetic mean, the median, and the mode.
TABLE 3.4 Average Travel Times in 120 Days Time
b) c) d) e)
Number of Days
30
4
32
7
33
10
35
12
38
18
40
22
42
20
43
15
45
8
50
4
Calculate Q1, Q3, D4, P61, and P84. Are there any outliers? Calculate the range, the variance, the standard deviation, and the standard error. Calculate Fisher’s coefficient of skewness (g1) and Fisher’s coefficient of kurtosis (g2). Classify the symmetry and the flatness level of each distribution. f) Construct a bar chart, a histogram, a stem-and-leaf plot, and a boxplot. 15) In order to improve the quality of its services, a retail company collected the average service time, in seconds, of 250 employees. The data were grouped into classes, with their respective absolute and relative frequencies, as shown in Table 3.5. We would like you to: a) Calculate the arithmetic mean, the median, and the mode. b) Calculate Q1, Q3, D2, P13, and P95. c) Are there any outliers? d) Calculate the range, the variance, the standard deviation, and the standard error. e) Calculate Pearson’s first coefficient of skewness and the coefficient of kurtosis. Classify the symmetry and the flatness level of each distribution. f) Construct a histogram.
90
PART
II Descriptive Statistics
TABLE 3.5 Average Service Time Class
Fi
Fri (%)
30 ├ 60
11
4.4
60 ├ 90
29
11.6
90 ├ 120
41
16.4
120 ├ 150
82
32.8
150 ├ 180
54
21.6
180 ├ 210
33
13.2
250
100
Sum
16) A financial analyst wants to compare the price of two stocks throughout the previous month. The data are listed in Table 3.6.
TABLE 3.6 Stock Price Stock A
Stock B
31
25
30
33
24
27
24
34
28
32
22
26
24
26
34
28
24
34
28
28
23
31
30
28
31
34
32
16
26
28
39
29
25
27
42
28
29
33
24
29
22
34
23
33
32
27
29
26
Univariate Descriptive Statistics Chapter
3
91
Carry out a comparative analysis of the price of both stocks based on: a) Measures of position, such as, the mean, median, and mode. b) Measures of dispersion, such as, the range, variance, standard deviation, and standard error. c) The existence of outliers. d) The symmetry and flatness level of the distribution. e) A line graph, scatter plot, stem-and-leaf plot, histogram, and boxplot. 17) Aiming to determine the standards of the investments made in hospitals in Sao Paulo (US$ millions), a state government agency collected data regarding 15 hospitals, as shown in Table 3.7.
TABLE 3.7 Investments in 15 Hospitals in the State of Sao Paulo Hospital
a) b) c) d)
Investment
A
44
B
12
C
6
D
22
E
60
F
15
G
30
H
200
I
10
J
8
K
4
L
75
M
180
N
50
O
64
We would like you to: Calculate the sample’s arithmetic mean and standard deviation. Eliminate possible outliers. Once again, calculate the sample’s arithmetic mean and standard deviation (without the outliers). What can we say about the standard deviation of the new sample without the outliers?
Chapter 4
Bivariate Descriptive Statistics Numbers rule the world. Plato
4.1
INTRODUCTION
The previous chapter discussed descriptive statistics for a single variable (univariate descriptive statistics). This chapter presents the concepts of descriptive statistics involving two variables (bivariate analysis). Therefore, a bivariate analysis has as its main objective to study the relationships (associations for qualitative variables and correlations for quantitative variables) between two variables. These relationships can be studied through the joint distribution of frequencies (contingency tables or crossed classification tables—cross tabulation), graphical representations, and through summary measures. The bivariate analysis will be studied from two distinct situations: a) When two variables are qualitative; b) When two variables are quantitative. Fig. 4.1 shows the bivariate descriptive statistics that will be studied in this chapter, represented by tables, charts, and summary measures, and presents the following situations: a) The descriptive statistics used to represent the data behavior of two qualitative variables are: a) joint frequency distribution tables, in this specific case, also called contingency tables or crossed classification tables (cross tabulation); b) charts, such as, perceptual maps resulting from the correspondence analysis technique (more details can be found in Fa´vero and Belfiore, 2017); c) measures of association, such as, the chi-square statistics (used for nominal and ordinal qualitative variables), the Phi coefficient, the contingency coefficient, and Cramer’s V coefficient (all of them based on chi-square and used for nominal variables), in addition to Spearman’s coefficient (for ordinal qualitative variables). b) In the case of two quantitative variables, we are going to use joint frequency distribution tables, graphical representations, such as, the scatter plot, besides measures of correlation, such as, covariance and Pearson’s correlation coefficient.
4.2
ASSOCIATION BETWEEN TWO QUALITATIVE VARIABLES
The main objective is to assess if there is a relationship between the qualitative or categorical variables studied, in addition to the level of association between them. This can be done through frequency distribution tables, summary measures, such as, the chi-square (used for nominal and ordinal variables), the Phi coefficient, the contingency coefficient, and Cramer’s V coefficient (for nominal variables), and Spearman’s coefficient (for ordinal variables), in addition to graphical representations, such as, perceptual maps resulting from the correspondence analysis, as presented in Fa´vero and Belfiore (2017).
4.2.1
Joint Frequency Distribution Tables
The simplest way to summarize a set of data resulting from two qualitative variables is through a joint frequency distribution table, in this specific case, it is called a contingency table, or a crossed classification table (cross tabulation), or even a correspondence table. In a joint way, it shows the absolute or relative frequencies of variable X’s categories, represented on the X-axis, and of variable Y, represented on the Y-axis. Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00004-5 © 2019 Elsevier Inc. All rights reserved.
93
94
PART
II Descriptive Statistics
Bivariate analysis 2 Qualitative variables
Tables
Charts
Contingency tables
Perceptual maps
2 Quantitative variables Measures of association
Chi-square
Tables
Charts
Frequency distribution
Scatter Plot
Phi coefficient
Measures of correlation
Covariance Pearson’s correlation coefficient
Contingency coefficient Cramer’s V coefficient Spearman’s coefficient
FIG. 4.1 Bivariate descriptive statistics depending on the type of variable.
It is common to add the marginal totals to the contingency table, which correspond to the sum of variable X’s rows and to the sum of variable Y’s columns. We are going to illustrate this analysis through an example based on Bussab and Morettin (2011). Example 4.1 A study was done with 200 individuals trying to analyze the joint behavior of variable X (Health insurance agency) with variable Y (Level of satisfaction). The contingency table showing the variables’ joint absolute frequency distribution, in addition to the marginal totals, is shown in Table 4.E.1. These data are available on the SPSS software in the file HealthInsurance.sav.
TABLE 4.E.1 Joint Absolute Frequency Distribution of the Variables Being Studied Level of Satisfaction Agency
Dissatisfied
Neutral
Satisfied
Total
Total Health
40
16
12
68
Live Life
32
24
16
72
Mena Health
24
32
4
60
Total
96
72
32
200
The study can also be carried out based on the relative frequencies, as studied in Chapter 3, for univariate problems. Bussab and Morettin (2011) show three ways to illustrate the proportion of each category: a) In relation to the general total; b) In relation to the total of each row; c) In relation to the total of each column. Choosing each option varies according to the objective of the problem. For example, Table 4.E.2 shows the joint relative frequency distribution of the variables being studied in relation to the general total.
Bivariate Descriptive Statistics Chapter
4
TABLE 4.E.2 Joint Relative Frequency Distribution of the Variables Being Studied in Relation to the General Total Level of Satisfaction Agency
Dissatisfied
Neutral
Satisfied
Total
Total Health
20%
8%
6%
34%
Live Life
16%
12%
8%
36%
Mena Health
12%
16%
2%
30%
Total
48%
36%
16%
100%
First, we are going to analyze the marginal totals of the rows and columns that provide the unidimensional distributions of each variable. The marginal totals of the rows correspond to the sum of the relative frequencies of each category of the variable Agency and the marginal totals of the columns correspond to the sum of each category of the variable Level of satisfaction. Thus, we can conclude that 34% of the individuals are members of Total Health, 36% of Live Life, and 30% of Mena Health. Analogously, we can conclude that 48% of the individuals are dissatisfied with their health insurance agencies, 36% said they were neutral, and only 16% said they were satisfied. Regarding the joint relative frequency distribution of the variables being studied (a contingency table), we can state that 20% of the individuals are members of Total Health and are dissatisfied. The same logic is applied to the other categories of the contingency table. Conversely, Table 4.E.3 shows the joint relative frequency distribution of the variables being studied in relation to the total of each row.
TABLE 4.E.3 Joint Relative Frequency Distribution of the Variables Being Studied in Relation to the Total of Each Row Level of Satisfaction Agency
Dissatisfied
Neutral
Satisfied
Total
Total Health
58.8%
23.5%
17.6%
100%
Live Life
44.4%
33.3%
22.2%
100%
Mena Health
40%
53.3%
6.7%
100%
Total
48%
36%
16%
100%
From Table 4.E.3, we can see that the ratio of individuals who are members of Total Health and who are dissatisfied is 58.8% (40/ 68), those who are neutral is 23.5% (16/68); and those who are satisfied is 17.6% (12/68). The sum of the ratios in the respective row is 100%. The same logic is applied to the other rows. Finally, Table 4.E.4 shows the joint relative frequency distribution of the variables being studied in relation to the total of each column. Therefore, the ratio of individuals who are members of Total Health and who are dissatisfied is 41.7% (40/96), members of Live Life, 33.3% (32/96), and members of Mena Health, 25% (24/96). The sum of the ratios in the respective column is 100%. The same logic is applied to the other columns.
TABLE 4.E.4 Joint Relative Frequency Distribution of the Variables Being Studied in Relation to the Total of Each Column Level of Satisfaction Agency
Dissatisfied
Neutral
Satisfied
Total
Total Health
41.7%
22.2%
37.5%
34%
Live Life
33.3%
33.3%
50%
36%
Mena Health
25%
44.4%
12.5%
30%
Total
100%
100%
100%
100%
95
96
PART
II Descriptive Statistics
Creating Contingency Tables on the SPSS Software The contingency tables in Example 4.1 will be generated by using SPSS. The use of the images in this chapter has been authorized by the International Business Machines Corporation©. First, we are going to define the properties of each variable on SPSS. The variables Agency and Level of satisfaction are qualitative, but, initially, they are presented as numbers, as shown in the file HealthInsurance_NoLabel.sav. Thus, labels corresponding to each category of both variables must be created, so that: Labels of the variable Agency: 1 ¼ Total Health 2 ¼ Live Life 3 ¼ Mena Health Labels of the variable Level of satisfaction, simply called Satisfaction: 1 ¼ Dissatisfied 2 ¼ Neutral 3 ¼ Satisfied Therefore, we must click on Data → Define Variable Properties… and select the variables that interest us, as seen in Figs. 4.2 and 4.3.
FIG. 4.2 Defining the properties of the variable on SPSS.
Bivariate Descriptive Statistics Chapter
4
FIG. 4.3 Selecting the variables that interest us.
Next, we must click on Continue. Based on Figs. 4.4 and 4.5, note that the variables Agency and Satisfaction were defined as nominal. This definition can also be done in the environment Variable View. The definition of the labels must be created at this moment, as shown in Figs. 4.4 and 4.5. Clicking on OK, the database initially represented as numbers starts being substituted for the respective labels. In the file HealthInsurance.sav, the data have already been labeled. To create contingency tables (cross tabulation), we are going to click on the menu Analyze → Descriptive Statistics → Crosstabs…, as shown in Fig. 4.6. We are going to select the variable Agency in Row(s) and the variable Satisfaction in Column(s). Next, we must click on Cells, as shown in Fig. 4.7. To create contingency tables that represent the joint absolute frequency distribution of the variables observed, the joint relative frequency distribution in relation to the general total, the joint relative frequency distribution in relation to the total of each row, and the joint relative frequency distribution in relation to the total of each column (Tables 4.1–4.4) we must, from the Crosstabs: Cell Display dialog box (opened after we clicked on Cells…), select the option Observed in Counts and options Row, Column and Total in Percentages, as shown in Fig. 4.8. Finally, we are going to click on Continue and OK. The contingency table (cross tabulation) generated by SPSS is shown in Fig. 4.9. Note that the data generated are exactly the same as those presented in Tables 4.1–4.4.
97
98
PART
II Descriptive Statistics
FIG. 4.4 Defining the labels of variable Agency.
FIG. 4.5 Defining the labels of variable Satisfaction.
FIG. 4.6 Creating contingency tables (cross tabulation) on SPSS.
FIG. 4.7 Creating a contingency table.
100
PART
II Descriptive Statistics
FIG. 4.8 Creating contingency tables from the Crosstabs: Cell Display dialog box.
FIG. 4.9 Cross classification table (cross tabulation) generated by SPSS.
Bivariate Descriptive Statistics Chapter
4
101
Creating Contingency Tables on the Stata Software In Chapter 03, we learned how to create frequency distribution tables for a single variable on Stata through the command tabulate, or simply tab. In the case of two or more variables, if the objective is to create univariate frequency distribution tables for each variable being analyzed, we must use the command tab1, followed by the list of variables. The same logic must be applied to create joint frequency distribution tables (contingency tables). To create a contingency table on Stata from the absolute frequencies of the variables being observed, we must use the following syntax: tabulate variable1* variable2*
or simply: tab variable1* variable2* where the terms variable1* and variable2* must be substituted for the names of the respective variables.
If, in addition to the joint absolute frequency distribution of the variables being observed, we want to obtain the joint relative frequency distribution in relation to the total of each row, to the total of each column, and to the general total, we must use the following syntax: tabulate variable1* variable2*, row column cell
or simply: tab variable1* variable2*, r co ce
Consider a case with more than two variables being studied, in which the objective is to construct bivariate frequency distribution tables (two-way tables), for all the combinations of variables, two by two. In this case, we must use the command tab2, with the following syntax: tab2 variables* where the term variables* should be substituted for the list of variables being considered in the analysis.
Analogously, to obtain both the joint absolute frequency distribution and the joint relative frequency distributions per row, per column, and per general total, we must use the following syntax: tab2 variables*, r co ce
The contingency tables in Example 4.1 will be generated now by using the Stata software. The data are available in the file HealthInsurance.dta. Hence, to obtain the table of joint absolute frequency distribution, relative frequencies per row, relative frequencies per column, and relative frequencies per general total, the command is: tab agency satisfaction, r co ce
The results can be seen in Fig. 4.10 and are similar to those presented in Fig. 4.9 (SPSS). FIG. 4.10 Contingency table constructed on Stata.
102
PART
4.2.2
II Descriptive Statistics
Measures of Association
The main measures that represent the association between two qualitative variables are: a) The chi-square statistic (w2)—used for nominal and ordinal qualitative variables; b) The Phi coefficient, the contingency coefficient and Cramer’s V coefficient—applied to nominal variables and based on chi-square; and c) Spearman’s coefficient—used for ordinal variables.
4.2.2.1 Chi-Square Statistic The chi-square statistic (w2) measures the discrepancy between the contingency table observed and the contingency table expected, starting from the hypothesis that there is no association between the variables studied. If the frequency distribution observed is exactly equal to the frequency distribution expected, the result of the chi-square statistic is zero. Therefore, a value lower than w2 indicates independence between the variables. Statistic w2 is given by: 2 I X J X Oij Eij 2 (4.1) w ¼ Eij i¼1 j¼1 where: Oij: number of observations in the ith position of variable X and in the jth position of variable Y; Eij: expected frequency of observations in the ith position of variable X and in the jth position of variable Y; I: number of categories (rows) of variable X; J: number of categories (columns) of variable Y.
Example 4.2 Calculate the w2 statistic for Example 4.1. Solution Table 4.E.5 shows the observed values in the distribution with the respective relative frequencies in relation to the general total of the row. The calculation could also be done in relation to the general total of the column, arriving at the same result of the w2 statistic.
TABLE 4.E.5 Observed Values of Each Category With the Respective Ratios in Relation to the General Total of the Row Level of Satisfaction Agency
Dissatisfied
Neutral
Satisfied
Total
Total Health
40 (58.8%)
16 (23.5%)
12 (17.6%)
68 (100%)
Live Life
32 (44.4%)
24 (33.3%)
16 (22.2%)
72 (100%)
Mena Health
24 (40%)
32 (53.3%)
4 (6.7%)
60 (100%)
Total
96 (48%)
72 (36%)
32 (16%)
200 (100%)
The data in Table 4.E.5 show the dependence between the variables. Assuming that there was no association between the variables, we would expect a ratio of 48% in relation to the total of the row of all three health insurance companies in the Dissatisfied column, 36% in the Neutral column, and 16% in the Satisfied column. The calculation of the expected values can be seen in Table 4.E.6. For example, the calculation of the first cell is 0.48 68 ¼ 32.64.
Bivariate Descriptive Statistics Chapter
4
103
TABLE 4.E.6 Expected Values in Table 4.E.5, Assuming the Nonassociation Between the Variables Level of Satisfaction Agency
Dissatisfied
Neutral
Satisfied
Total
Total Health
32.6 (48%)
24.5 (36%)
10.9 (16%)
68 (100%)
Live Life
34.6 (48%)
25.9 (36%)
11.5 (16%)
72 (100%)
Mena Health
28.8 (48%)
21.6 (36%)
9.6 (16%)
60 (100%)
Total
96 (48%)
72 (36%)
32 (16%)
200 (100%)
To calculate the w2 statistic, we must apply expression (4.1) for the data in Tables 4.E.5 and 4.E.6. The calculation of each term 2 ðOij Eij Þ is shown in Table 4.E.7, jointly with the w2 measure resulting from the sum of the categories. Eij
TABLE 4.E.7 Calculating the x2 Statistic Level of Satisfaction Agency
Dissatisfied
Neutral
Satisfied
Total Health
1.66
2.94
0.12
Live Life
0.19
0.14
1.74
Mena Health
0.80
5.01
3.27
Total
w ¼ 15.861 2
As we are going to study in Chapter 9, which discusses hypotheses tests, significance level a indicates the probability of rejecting a certain hypothesis when it is true. P-value, on the other hand, represents the probability associated to the sample observed value, indicating the lowest significance level that would lead to the rejection of the supposed hypothesis. In other words, P-value represents a decreasing reliability index of a result. The lower the value, the less we can believe in the assumed hypothesis. In the case of the w2 statistic, whose test presupposes the nonassociation between the variables being studied, most statistical software, including SPSS and Stata, calculate the corresponding P-value. Thus, for a confidence level of 95%, if P-value < 0.05, the hypothesis is rejected and we can state that there is an association between the variables. On the other hand, if P-value > 0.05, we conclude that the variables are independent. All of these concepts will be studied in more detail in Chapter 9. Excel calculates the P-value of the w2 statistic through the CHITEST or CHISQ.TEST (Excel 2010 and future versions) functions. In order to do that, we just need to select the set of cells corresponding to the observed or real values and the set of cells of the expected values. Solving the chi-square statistic on the SPSS software Analogous to Example 4.1, calculating the chi-square statistic (w2) on SPSS is also done on the tab Analyze → Descriptive Statistics → Crosstabs…. Once again, we are going to select the variable Agency in Row(s) and the variable Satisfaction in Column(s). Initially, to generate the observed values and the expected values in case of nonassociation between the variables (data in Tables 4.E.5 and 4.E.6), we must click on Cells… and select the options Observed and Expected in Counts, from the Crosstabs: Cell Display dialog box (Fig. 4.11). In the same box, to generate the adjusted standardized residuals, we must select the option Adjusted standardized in Residuals. The results can be seen in Fig. 4.12. To calculate the w2 statistic, in Statistics…, we must select the option Chi-square (Fig. 4.13). Finally, we are going to click on Continue and OK. The result can be seen in Fig. 4.14. Based on Fig. 4.14, we can see that the value of w2 is 15.861, similar to the one calculated in Table 4.E.7. We can also observe that the lowest significance level that would lead to the rejection of the nonassociation hypothesis between the variables (P-value) is 0.003. Since 0.003 < 0.05 (for a confidence level of 95%), the null hypothesis is rejected, which allows us to conclude that there is association between the variables.
104
PART
II Descriptive Statistics
FIG. 4.11 Creating the contingency table with the observed frequencies, the expected frequencies, and the residuals.
FIG. 4.12 Contingency table with the observed values, the expected values, and the residuals, assuming the nonassociation between the variables.
Bivariate Descriptive Statistics Chapter
4
105
FIG. 4.13 Selecting the w2 statistic.
Solving the w2 statistic on the Stata software In Section 4.2.1, we learned how to create contingency tables on Stata through the command tabulate, or simply tab. Besides the observed frequencies, this command also gives us the expected frequencies through the option expected, or simply exp, as well as the calculation of the w2 statistic using the option chi2, or simply ch. For the data in Example 4.1 available in the file HealthInsurance.dta, to obtain the observed and expected frequency distribution tables, jointly with the w2 statistic, we are going to use the following command: tab agency satisfaction, exp ch However, the command tab does not allow residuals to be generated in the output. As an alternative, the command tabchi
was developed from a tabulation module created by Nicholas J. Cox, allowing the adjusted standardized residuals to be calculated too. In order for this command to be used, we must initially type:
FIG. 4.14 Result of the w2 statistic.
106
PART
II Descriptive Statistics
FIG. 4.15 Result of the w2 statistic on Stata.
findit tabchi
and install it in the link tab_chi from http://fmwww.bc.edu/RePEc/bocode/t. After doing this, we can type the following command: tabchi agency satisfaction, a
The result is shown in Fig. 4.15 and is similar to those presented in Figs. 4.12 and 4.14 on the SPSS software. Note that, differently from the command tab, which requires the option exp so that the expected frequencies can be generated, the command tabchi already gives them to us automatically.
4.2.2.2 Other Measures of Association Based on Chi-Square The main measures of association based on the chi-square statistic (w2) are Phi, Cramer’s V coefficient, and the contingency coefficient (C), all of them applied to nominal qualitative variables. In general, an association or correlation coefficient is a measure that varies between 0 and 1, presenting value 0 when there is no relationship between the variables, and value 1 when they are perfectly related. We are going to see how each one of the coefficients studied in this section behaves in relation to these characteristics. a) Phi Coefficient The Phi coefficient is the simplest measure of association for nominal variables based on w2, and it can be expressed as follows: rffiffiffiffiffi w2 (4.2) Phi ¼ n In order for Phi to vary only between 0 and 1, it is necessary for the contingency table to have a 2 x 2 dimension.
Example 4.3 In order to offer high-quality services and meet their customers’ expectations, Ivanblue, a company in the male fashion industry, is investing in strategies to segment the market. Currently, the company has four stores in Campinas, located in the north, center, south, and east regions of the city, and sells four types of clothes: ties, shirts, polo shirts, and pants. Table 4.E.8 shows the purchase data of 20 customers, such as, the type of clothes and the location of the store. Check if there is association between the two variables using the Phi coefficient.
Bivariate Descriptive Statistics Chapter
4
107
TABLE 4.E.8 Purchase Data of 20 Customers Customer
Clothes
Region
1
Tie
South
2
Polo shirt
North
3
Shirt
South
4
Pants
North
5
Tie
South
6
Polo shirt
Center
7
Polo shirt
East
8
Tie
South
9
Shirt
South
10
Tie
Center
11
Pants
North
12
Pants
Center
13
Tie
Center
14
Polo shirt
East
15
Pants
Center
16
Tie
Center
17
Pants
South
18
Pants
North
19
Polo shirt
East
20
Shirt
Center
Solution Using the procedure described in the previous section, the value of the chi-square statistic is w2 ¼ 18.214. Therefore: Phi ¼
rffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi w2 18:214 ¼ ¼ 0:954 n 20
Since both variables have four categories, in this case the condition 0 Phi 1 is not valid, making it difficult to interpret how strong the association is. b) Contingency coefficient The contingency coefficient (C), also known as Pearson’s contingency coefficient, is another measure of association for nominal variables based on the w2 statistic, being represented by the following expression: C¼
sffiffiffiffiffiffiffiffiffiffiffiffi w2 n + w2
(4.3)
where n is the sample size. The contingency coefficient (C) has as its lowest limit the value 0, indicating that there is no relationship between the variables; however, the highest limit of C varies depending on the number of categories, so: sffiffiffiffiffiffiffiffiffiffiffi q1 0C q
(4.4)
108
PART
II Descriptive Statistics
where: q ¼ min ðI, J Þ
(4.5)
where I is the number of rows and J is the number of columns in a contingency table. qffiffiffiffiffiffiffi When C ¼ q1 q , there is a perfect association between the variables; however, this limit never assumes the value 1. Hence, two contingency coefficients can only be compared if both are defined from tables with the same number of rows and columns.
Example 4.4 Calculate the contingency coefficient (C) for the data in Example 4.3. Solution We calculate C as follows: sffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi w2 18:214 ¼ 0:690 C¼ ¼ n + w2 20 + 18:214 Since the contingency table is 4 4 (q ¼ min(4, 4) ¼ 4), the values that C can assume are in the interval: rffiffiffi 3 ! 0 C 0:866 0C 4 We can conclude that there is association between the variables. c) Cramer’s V coefficient Another measure of association for nominal variables based on the w2 statistic is Cramer’s V coefficient, calculated by: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi w2 V¼ n:ðq 1Þ
(4.6)
where q ¼ min(I, J), as presented in expression (4.5). qffiffiffiffi 2 For 2 x 2 contingency tables, expression (4.6) is going to be V ¼ wn , which corresponds to the Phi coefficient. Cramer’s V coefficient is an alternative to the Phi coefficient and to the contingency coefficient (C), and its value is always limited to the interval [0, 1], regardless of the number of categories in the rows and columns: 0V 1
(4.7)
Value 0 indicates that the variables do not have any kind of association and value 1 shows that they are perfectly associated. Therefore, Cramer’s V coefficient allows us to compare contingency tables that have different dimensions.
Example 4.5 Calculate Cramer’s V coefficient for the data in Example 4.3. Solution V¼
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi w2 18:214 ¼ ¼ 0:551 nðq 1Þ 20 3
Since 0 V 1, there is association between the variables; however, it is not considered very strong. Solution of Examples 4.3, 4.4, and 4.5 (calculation of the Phi, contingency, and Cramer’s V coefficients) by using SPSS In Section 4.2.1, we discussed how to create labels that correspond to the variable categories from the menu Data → Define Variable Properties…. The same procedure must be applied to the data in Table 4.E.8 (we cannot forget to define the variables as nominal). The file Market_Segmentation.sav gives us these data already tabulated on SPSS.
Bivariate Descriptive Statistics Chapter
4
109
FIG. 4.16 Selecting the contingency coefficient and Phi and Cramer’s V coefficients.
FIG. 4.17 Results of the contingency coefficient and Phi and Cramer’s V coefficients.
Similar to the calculation of the w2 statistic, calculating the Phi, contingency, and Cramer’s V coefficients on SPSS can also be done on the menu Analyze → Descriptive Statistics → Crosstabs…. We are going to select the variable Clothes in Row(s) and the variable Region in Column(s). In Statistics…, we are going to select the options Contingency coefficient and Phi and Cramer’s V (Fig. 4.16). Note that these coefficients are calculated for nominal variables. The results of the statistics can be seen in Fig. 4.17. For all three coefficients, the P-value of 0.033 (0.033 < 0.05) indicates that there is association between the variables being studied. Solution of Examples 4.3 and 4.5 (calculation of the Phi and Cramer’s V coefficients) by using Stata Stata calculates the Phi and Cramer’s V coefficients through the command phi. Hence, they are going to be calculated for the data in Example 4.3 available in the file Market_Segmentation.dta.
110
PART
II Descriptive Statistics
FIG. 4.18 Calculating the Phi and Cramer’s V coefficients on Stata.
In order for the phi command to be used, initially, we must type: findit phi
and install it in the link snp3.pkg from http://www.stata.com/stb/stb3/. After doing this, we can type the following command: phi clothes region
The results can be seen in Fig. 4.18. Note that the Phi coefficient on Stata is called Cohen’s w. Cramer’s V coefficient, on the other hand, is called Cramer’s phi-prime.
4.2.2.3 Spearman’s Coefficient Spearman’s coefficient (rsp) is a measure of association between two ordinal qualitative variables. Initially, we must sort the set of data of variable X and of variable Y in ascending order. After sorting the data, it is possible to create ranks or rankings, denoted by k (k ¼ 1, …, n). Assigning ranks is something done separately for each variable. Rank 1 is then assigned to the smallest value of the variable, rank 2 to the second smallest value, and so on, and so forth, up until ranking n for the highest value. In case of a tie between values k and k +1, we must assign ranking k + 1/2 to both observations. Calculating Spearman’s coefficient can be done by using the following expression: 6 rsp ¼ 1 where:
n X
dk2
k¼1
n:ðn2 1Þ
(4.8)
n: number of observations (pairs of values); dk: difference between the rankings of order k. Spearman’s coefficient is a measure that varies between 1 and 1. If rsp ¼ 1, all the values of dk are null, indicating that all the rankings are equal to variables X and Y (perfect positive association). The value rsp ¼ 1 is found when Pn 2 n:ðn2 1Þ reaches its maximum value (there is an inversion in the values of the variable rankings), indicating a k¼1 dk ¼ 3 perfect negative association. When rsp ¼ 0, there is no association between variables X and Y. Fig. 4.19 shows a summary of this interpretation. This interpretation is similar to Pearson’s association coefficient, which will be studied in Section 4.3.3.2. FIG. 4.19 Interpretation coefficient.
of
Spearman’s
Bivariate Descriptive Statistics Chapter
4
111
Example 4.6 The coordinator of the Business Administration course is analyzing if there is any kind of association between the grades of 10 students in two different subjects: Simulation and Finance. The data regarding this problem are presented in Table 4.E.9. Calculate Spearman’s coefficient.
TABLE 4.E.9 Grades in the Subjects Simulation and Finance of the 10 Students Being Analyzed Grades Student
Simulation
Finance
1
4.7
6.6
2
6.3
5.1
3
7.5
6.9
4
5.0
7.1
5
4.4
3.5
6
3.7
4.6
7
8.5
6.8
8
8.2
7.5
9
3.5
4.2
10
4.0
3.3
Solution To calculate Spearman’s coefficient, first, we are going to assign rankings to each category of each variable depending on their respective values, as shown in Table 4.E.10.
TABLE 4.E.10 Ranks in the Subjects Simulation and Finance of the 10 Students Rankings Student
Simulation
Finance
dk
d2k
1
5
6
1
1
2
7
5
2
4
3
8
8
0
0
4
6
9
3
9
5
4
2
2
4
6
2
4
2
4
7
10
7
3
9
8
9
10
1
1
9
1
3
2
4
10
3
1
2
4
Sum
40
112
PART
II Descriptive Statistics
Applying expression (4.8), we have: n X
dk2 6 40 k¼1 ¼1 ¼ 0:7576 rsp ¼ 1 nðn2 1Þ 10 99 6
Value 0.758 indicates a strong positive association between the variables. Calculating Spearman’s coefficient using SPSS software File Grades.sav shows the data from Example 4.6 (rankings in Table 4.E.9) tabulated in an ordinal scale (defined in the environment Variable View). Similar to the calculation of the w2 statistic and the Phi, contingency, and Cramer’s V coefficients, Spearman’s coefficient can also be generated by SPSS from the menu Analyze → Descriptive Statistics → Crosstabs…. We are going to select the variable Simulation in Row(s) and the variable Finance in Column(s). In Statistics…, we are going to select the option Correlations (Fig. 4.20). We are going to click on Continue and then, finally, on OK. The result of Spearman’s coefficient is shown in Fig. 4.21. The P-value 0.011 < 0.05 (under the hypothesis of nonassociation between the variables) indicates that there is a correlation between the grades in Simulation and Finance, with 95% confidence. Spearman’s coefficient can also be calculated in the menu Analyze → Correlate → Bivariate…. We must select the variables that interest us, in addition to Spearman’s coefficient, as shown in Fig. 4.22. We are going to click on OK, resulting in Fig. 4.23. FIG. 4.20 Calculating Spearman’s coefficient from the Crosstabs: Statistics dialog box.
FIG. 4.21 Result of Spearman’s coefficient from the Crosstabs: Statistics dialog box.
Bivariate Descriptive Statistics Chapter
4
113
Calculating Spearman’s coefficient by using Stata software In Stata, Spearman’s coefficient is calculated using the command spearman. Therefore, for the data in Example 4.6, available in the file Grades.dta, we must type the following command: spearman simulation finance The results can be seen in Fig. 4.24.
FIG. 4.22 Calculating Spearman’s coefficient from the Bivariate Correlations dialog box.
FIG. 4.23 Result of Spearman’s coefficient from the Bivariate Correlations dialog box. FIG. 4.24 Result of Spearman’s coefficient on Stata.
114
PART
4.3
II Descriptive Statistics
CORRELATION BETWEEN TWO QUANTITATIVE VARIABLES
In this section, the main objective is to assess if there is a relationship between the quantitative variables being studied, besides the level of correlation between them. This can be done through frequency distribution tables, graphical representations, such as, scatter plots, in addition to measures of correlation, such as, the covariance and Pearson’s correlation coefficient.
4.3.1
Joint Frequency Distribution Tables
The same procedure presented for qualitative variables can be used to represent the joint distribution of quantitative variables and to analyze the possible relationships between the respective variables. Analogous to the study of the univariate descriptive statistic, continuous data that do not repeat themselves with a certain frequency can be grouped into class intervals.
4.3.2
Graphical Representation Through a Scatter Plot
The correlation between two quantitative variables can be represented in a graphical way through a scatter plot. It graphically represents the values of variables X and Y in a Cartesian plane. Therefore, a scatter plot allows us to assess: a) Whether there is any relationship between the variables being studied or not; b) The type of relationship between the two variables, that is, the direction in which variable Y increases or decreases depending on changes in X; c) The level of relationship between the variables; d) The nature of the relationship (linear, exponential, among others). Fig. 4.25 shows a scatter plot in which the relationship between variables X and Y is strong positive linear, that is, variations in Y are directly proportional to variations in X. The level of relationship between the variables is strong and the nature is linear. If all the points are contained in a straight line, we have a case in which the relationship is perfect linear, as shown in Fig. 4.26. Figs. 4.27 and 4.28, on the other hand, show a scatter plot in which the relationship between variables X and Y is strong negative linear and perfect negative linear, respectively. FIG. 4.25 Strong positive linear relationship.
FIG. 4.26 Perfect positive linear relationship.
Bivariate Descriptive Statistics Chapter
4
115
FIG. 4.27 Strong negative linear relationship.
FIG. 4.28 Perfect negative linear relationship.
FIG. 4.29 There is no relationship between variables X and Y.
Finally, we may now have a case in which there is no relationship between variables X and Y, as shown in Fig. 4.29. Constructing a scatter plot on SPSS
Example 4.7 Let us open file Income_Education.sav on SPSS. The objective is to analyze the correlation between the variables Family Income and Years of Education through a scatter plot. In order to do that, we are going to click on Graphs ! Legacy Dialogs ! Scatter/Dot… (Fig. 4.30). In the window Scatter/Dot in Fig. 4.31, we are going to select the type of chart (Simple Scatter). Clicking on Define, the Simple Scatterplot dialog box will open, as shown in Fig. 4.32. We are going to select the variable FamilyIncome in the Y-axis and the variable YearsofEducation in the X-axis. Next, we are going to click on OK. The scatter plot created is shown in Fig. 4.33. Based on Fig. 4.33, we can see a strong positive correlation between the variables Family Income and Years of Education. Therefore, the higher the number of years of education, the higher the family income will be, even if there is no cause and effect relationship.
116
PART
II Descriptive Statistics
FIG. 4.30 Constructing a scatter plot on SPSS.
FIG. 4.31 Selecting the type of chart.
The scatter plot can also be created in Excel by selecting the option Scatter. Constructing a scatter plot on Stata The data from Example 4.7 are also available on Stata from the file Income_Education.dta. The variables being studied are called income and education. The scatter plot on Stata is created using the command twoway scatter (or simply tw sc) followed by the variables we are interested in. Thus, to analyze the correlation between the variables Family Income and Years of Education through a scatter plot on Stata, we must type the following command: tw sc income education
The resulting scatter plot is shown in Fig. 4.34.
FIG. 4.32 Simple Scatterplot dialog box.
FIG. 4.33 Scatter plot of the variables Family Income and Years of Education.
6000
5000
Family income
4000
3000
2000
1000
0 4.0
5.0
6.0 7.0 8.0 Years of education
9.0
10.0
118
PART
II Descriptive Statistics
FIG. 4.34 Scatter plot on Stata.
5000
Family income
4000
3000
2000
1000
0 5
6
7
8
9
Years of education
4.3.3
Measures of Correlation
The main measures of correlation, used for quantitative variables, are the covariance and Pearson’s correlation coefficient.
4.3.3.1 Covariance Covariance measures the joint variation between two quantitative variables X and Y, and it is calculated by using the following expression: n X
covðX, Y Þ ¼
Xi X : Y i Y
i¼1
n1
(4.9)
where: Xi: ith value of X; Yi: ith value of Y; X: mean of the values of Xi; Y: mean of the values of Yi; n: sample size. One of the limitations of the covariance is that the measure depends on the sample size, and it may lead to a bad estimate in the case of small samples. Pearson’s correlation coefficient is an alternative for this problem. Example 4.8 Once again, consider the data in Example 4.7 regarding the variables Family Income and Years of Education. The data are also available in Excel in the file Income_Education.xls. Calculate the covariance of the data matrix of both variables. Solution Applying expression (4.9), we have: ð7:6 7:08Þð1, 961 1, 856:22Þ + ⋯ + ð5:4 7:08Þð775 1, 856:22Þ 72, 326:93 ¼ ¼ 761:336 95 95 The covariance can be calculated in Excel by using the COVARIANCE.S (sample) function. In the following section, we are also going to discuss how the covariance can be calculated on SPSS, jointly with Pearson’s correlation coefficient. SPSS considers the same expression presented in this section. covðX, Y Þ ¼
Bivariate Descriptive Statistics Chapter
4
119
FIG. 4.35 Interpretation of Pearson’s correlation coefficient.
4.3.3.2 Pearson’s Correlation Coefficient Pearson’s correlation coefficient (r) is a measure that varies between 1 and 1. Through the sign, it is possible to verify the type of linear relationship between the two variables analyzed (the direction in which variable Y increases or decreases depending on how X changes); the closer it is to the extreme values, the stronger the correlation between them. Therefore: – If r is positive, there is a directly proportional relationship between the variables; if r ¼ 1, we have a perfect positive linear correlation. – If r is negative, there is an inversely proportional relationship between the variables; if r ¼ 1, we have a perfect negative linear correlation. – If r is null, there is no correlation between the variables. Fig. 4.35 shows a summary of the interpretation of Pearson’s correlation coefficient. Pearson’s correlation coefficient (r) can be calculated as a ratio between the covariance of two variables and the product of the standard deviations (S) of each one of them: n X
Xi X : Yi Y
i¼1
covðX, Y Þ n1 ¼ r¼ S X SY SX SY rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pn Pn 2 2 X X ð Þ ðYi Y Þ i i¼1 i¼1 Since SX ¼ and SY ¼ , as we studied in Chapter 3, expression (4.10) becomes: n1 n1 n X
Xi X : Yi Y
i¼1
r ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n n X 2 X 2 Xi X : Yi Y i¼1
(4.10)
(4.11)
i¼1
In Chapter 12, we are going to use Pearson’s correlation coefficient a lot, when studying factorial analysis. Example 4.9 Once again, open the file Income_Education.xls and calculate Pearson’s correlation coefficient between the two variables. Solution Calculating Pearson’s correlation coefficient through expression (4.10) is as follows: r¼
covðX, Y Þ 761:336 ¼ 0:777 ¼ S X SY 970:774 1:009
This calculation could also be done by using expression (4.11), which does not depend on the sample size. The result indicates a strong positive correlation between the variables Family Income and Years of Education.
120
PART
II Descriptive Statistics
FIG. 4.36 Bivariate Correlations dialog box.
Excel also calculates Pearson’s correlation coefficient through the PEARSON function. Solution of Examples 4.8 and 4.9 (calculation of the covariance and Pearson’s correlation coefficient) on SPSS Once again, open the file Income_Education.sav. To calculate the covariance and Pearson’s correlation coefficient on SPSS, we are going to click on Analyze ! Correlate ! Bivariate…. The Bivariate Correlations window will open. We are going to select the variables Family Income and Years of Education, in addition to Pearson’s correlation coefficient, as shown in Fig. 4.36. In Options…, we must select the option Cross-product deviations and covariances, according to Fig. 4.37. We are going to click on Continue and then on OK. The results of the statistics are presented in Fig. 4.38. FIG. 4.37 Selecting the covariance statistic.
Bivariate Descriptive Statistics Chapter
4
121
FIG. 4.38 Results of the covariance and of Pearson’s correlation coefficient on SPSS.
FIG. 4.39 Calculating Pearson’s correlation coefficient on Stata.
FIG. 4.40 Calculating the covariance on Stata.
Analogous to Spearman’s coefficient, Pearson’s correlation coefficient can also be generated on SPSS from the menu Analyze → Descriptive Statistics → Crosstabs… (option Correlations in the Statistics button…). Solution of Examples 4.8 and 4.9(calculation of the covariance and Pearson’s correlation coefficient) on Stata To calculate Pearson’s correlation coefficient on Stata, we must use the command correlate, or simply corr, followed by the list of variables we are interested in. The result is the correlation matrix between the respective variables. Once again, open the file Income_Education.dta. Thus, for the data in this file, we can type the following command: corr income education
The result can be seen in Fig. 4.39. To calculate the covariance, we must use the option covariance, or only cov, at the end of the command correlate (or simply corr). Thus, to generate Fig. 4.40, we must type the following command: corr income education, cov
4.4
FINAL REMARKS
This chapter presented the main concepts of descriptive statistics with greater focus on the study of the relationship between two variables (bivariate analysis). We studied the relationships between two qualitative variables (associations) and between two quantitative variables (correlations). For each situation, several measures, tables, and charts were presented, which allow us to have a better understanding of the data behavior. Fig. 4.1 summarizes this information.
122
PART
II Descriptive Statistics
The construction and interpretation of frequency distributions, graphical representations, in addition to summary measures (measures of position or location and measures of dispersion or variability), allow the researcher to have a better understanding and visualization of the data behavior for two variables simultaneously. More advanced techniques can be applied in the future to the same set of data, so that researchers can go deeper in their studies on bivariate analysis, aiming at improving the quality of the decision making process.
4.5
EXERCISES
1) Which descriptive statistics can be used (and in which situations) to represent the behavior of two qualitative variables simultaneously? 2) And to represent the behavior of two quantitative variables? 3) In what situations should we use contingency tables? 4) What are the differences between the chi-square statistic (w2), Phi coefficient, the contingency coefficient (C), Cramer’s V coefficient, and Spearman’s coefficient? 5) What are the main summary measures to represent the data behavior between two quantitative variables? Describe each one of them. 6) Aiming at identifying the behavior of customers who are in default regarding their payments, a survey with information on the age and level of default of the respondents was carried out. The objective is to determine if there is an association between the variables. Based on the files Default.sav and Default.dta, we would like you to: a) Create the joint frequency distribution tables for the variables age_group and default (absolute frequencies, relative frequencies in relation to the general total, relative frequencies in relation to the total of each line, relative frequencies in relation to the total of each column and the expected frequencies). b) Determine the percentage of individuals who are between 31 and 40 years of age. c) Determine the percentage of individuals who are heavily indebted. d) Determine the percentage of respondents who are 20 years old or younger and do not have debts. e) Determine, among the individuals who are older than 60, the percentage of those who are a little indebted. f) Determine, among the individuals who are a relatively indebted, the percentage of those who are between 41 and 50 years old. g) Verify if there are indications of dependence between the variables. h) Confirm the previous item using the w2 statistic. i) Calculate the Phi, contingency, and Cramer’s V coefficients, confirming whether there is an association between the variables or not. 7) The files Motivation_Companies.sav and Motivation_Companies.dta show a database with the variables Company and Level of Motivation (Motivation), obtained through a survey carried out with 250 employees (50 respondents for each one of the 5 companies surveyed), aiming at assessing the employees’ level of motivation in relation to the companies, considered to be large firms. Hence, we would like you to: a) Create the contingency tables of absolute frequencies, relative frequencies in relation to the general total, relative frequencies in relation to the total of each line, relative frequencies in relation to the total of each column and the expected frequencies; b) Calculate the percentage of respondents who are very demotivated. c) Calculate the percentage of respondents from Company A and are very demotivated. d) Calculate the percentage of motivated respondents in Company D. e) Calculate the percentage of little motivated respondents in Company C. f) Among the respondents who are very motivated, determine the percentage of those who work for Company B. g) Verify if there are indications of dependence between the variables. h) Confirm the previous item using the w2 statistic. i) Calculate the Phi, contingency, and Cramer’s V coefficients, confirming whether there is an association between the variables or not. 8) The files Students_Evaluation.sav and Students_Evaluation.dta show the grades, from 0 to 10, of 100 students from a public university in relation to the following subjects: Operational Research, Statistics, Operations Management, and Finance. Check and see if there is a correlation between the following pairs of variables, constructing the scatter plot and calculating Pearson’s correlation coefficient: a) Operational Research and Statistics; b) Operations Management and Finance. c) Operational Research and Operations Management.
Bivariate Descriptive Statistics Chapter
4
123
9) The files Brazilian_Supermarkets.sav and Brazilian_Supermarkets.dta show revenue data and the number of stores of the 20 largest Brazilian supermarket chains in a given year (source: ABRAS - Brazilian Association of Supermarkets). We would like you to: a) Create the scatter plot for the variables revenue x number of stores. b) Calculate Pearson’s correlation coefficient between the two variables. c) Exclude the four largest supermarket chains in terms of revenue, as well as the chain AM/PM Food and Beverages Ltd., and once again create the scatter plot. d) Once again, calculate Pearson’s correlation coefficient between the two variables being studied.
Chapter 5
Introduction to Probability Do you want to sell sugar water for the rest of your life, or do you want to come with me and change the world? Steve Jobs
5.1
INTRODUCTION
In the previous part of this book, we studied descriptive statistics, which describes and summarizes the main characteristics observed in a dataset through frequency distribution tables, charts, graphs, and summary measures, allowing the researcher to have a better understanding of the data. Probabilistic statistics, on the other hand, uses the probability theory to explain how often certain uncertain events happen, in order to estimate or predict the occurrence of future events. For example, when rolling dice, we do not know for sure which value will appear, so, probability can be used to indicate the occurrence probability of a certain event. According to Bruni (2011), the history of probability presumably started with the cave men. They needed to understand nature’s uncertain phenomena better. In the 17th century, probability theory appeared to explain uncertain events. The study of probability evolved to help plan moves or develop strategies meant for gambling. Currently, it is also applied to the study of statistical inference, in order to generalize the data population. This chapter has as its main objective to present the concepts and terminologies related to the probability theory, as well as their practical application.
5.2 5.2.1
TERMINOLOGY AND CONCEPTS Random Experiment
An experiment consists in any observation or measure process. A random experiment is one that generates unpredictable results, so, if the process is repeated several times, it becomes impossible to predict the result. Flipping a coin and/or rolling dice are examples of random experiments.
5.2.2
Sample Space
Sample space S consists of all the possible results of an experiment. For example, when flipping a coin, we can get head (H) or tail (T). Therefore, S ¼ {H, T}. On the other hand, when rolling a die, the sample space is represented by S ¼ {1, 2, 3, 4, 5, 6}.
5.2.3
Events
An event is any subset of a sample space. For example, event A only contains the even occurrences of rolling a die. Therefore, A ¼ {2, 4, 6}.
5.2.4
Unions, Intersections, and Complements
Two or more events can form unions, intersections, and complements. The union of two events A and B, represented by A [ B, results in a new event containing all the elements of A, B, or both, and can be illustrated according to Fig. 5.1. Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00005-7 © 2019 Elsevier Inc. All rights reserved.
127
128
PART
III Probabilistic Statistics
FIG. 5.1 Union of two events (A [ B).
The intersection of two events A and B, represented by A \ B, results in a new event containing all the elements that are simultaneously in A and B, and can be illustrated according to Fig. 5.2. The complement of an event A, represented by Ac, is the event that contains all the points of S that are not in A, as shown in Fig. 5.3.
5.2.5
Independent Events
Two events A and B are independent when the probability of B happening is not conditional on event A happening. The concept of conditional probability will be discussed in Section 5.5.
5.2.6
Mutually Exclusive Events
Mutually excluding or exclusive events are those that do not have any elements in common, so, they cannot happen simultaneously. Fig. 5.4 illustrates two events A and B that are mutually exclusive.
FIG. 5.2 Intersection of two events (A \ B).
FIG. 5.3 Complement of event A.
FIG. 5.4 Events A and B that are mutually exclusive.
Introduction to Probability Chapter
5.3
5
129
DEFINITION OF PROBABILITY
The probability of a certain event A happening in sample space S is given by the ratio between the number of cases favorable to the event (nA) and the total number of possible cases (n): Pð AÞ ¼
nA number of cases favorable to event A ¼ n total number of possible cases
(5.1)
Example 5.1 When rolling a die, what is the probability of getting an even number? Solution The sample space is given by S ¼ {1, 2, 3, 4, 5, 6}. The event we are interested in is A ¼ {even numbers on a die}, so, A ¼ {2, 4, 6}. Therefore, the probability of A happening is: 3 1 P ðAÞ ¼ ¼ 6 2
Example 5.2 A gravity-pick machine contains three white balls, two red balls, four yellow balls, and two black balls. What is the probability of a red ball being drawn? Solution Given a total of 11 balls and considering A ¼ {the ball is red}, the probability is: P ðAÞ ¼
5.4 5.4.1
number of red balls 2 ¼ total number of balls 11
BASIC PROBABILITY RULES Probability Variation Field
The probability of an event A happening is a number between 0 and 1:
5.4.2
0 Pð A Þ 1
(5.2)
Pð SÞ ¼ 1
(5.3)
PðfÞ ¼ 0
(5.4)
Probability of the Sample Space
Sample space S has probability equal to 1:
5.4.3
Probability of an Empty Set
The probability of an empty set (f) occurring is null:
130
PART
5.4.4
III Probabilistic Statistics
Probability Addition Rule
The probability of event A, event B or both happening can be calculated as follows: PðA [ BÞ ¼ PðAÞ + PðBÞ PðA \ BÞ
(5.5)
If events A and B are mutually exclusive, that is, A \ B 6¼ f, the probability of one of them happening is equal to the sum of the individual probabilities: PðA [ BÞ ¼ PðAÞ + PðBÞ
(5.6)
Expression (5.6) can be extended to n events (A1, A2, …, An) that are mutually exclusive: PðA1 [ A2 [ ⋯ [ An Þ ¼ PðA1 Þ + PðA2 Þ + ⋯ + PðAn Þ
5.4.5
(5.7)
Probability of a Complementary Event
If Ac is A’s complementary event, then: Pð Ac Þ ¼ 1 P ð A Þ
5.4.6
(5.8)
Probability Multiplication Rule for Independent Events
If A and B are two independent events, the probability of them happening together is equal to the product of their individual probabilities: PðA \ BÞ ¼ PðAÞ PðBÞ (5.9) Expression (5.9) can be extended to n independent events (A1, A2, …, An): P ð A 1 \ A 2 \ … \ A n Þ ¼ P ð A 1 Þ P ð A 2 Þ … Pð A n Þ
(5.10)
Example 5.3 A gravity-pick machine contains balls with numbers 1 through 60 that have the same probability of being drawn. We would like you to: a) Define the sample space. b) Calculate the probability of a ball with an odd number on it being drawn. c) Calculate the probability of a ball with a multiple of 5 on it being drawn. d) Calculate the probability of a ball with an odd number or with a multiple of 5 on it being drawn. e) Calculate the probability of a ball with a multiple of 7 or a multiple of 10 on it being drawn. f) Calculate the probability of a ball that does not have a multiple of 5 on it being drawn. g) One ball is drawn randomly and put back into the gravity-pick machine. A new ball will be drawn. Calculate the probability of the first ball having an even number on it and the second one a number greater than 40. Solution a) S ¼ {1, 2, 3, …, 60}. 1 b) A ¼ {1, 3, 5, …, 59}, PðAÞ ¼ 30 60 ¼ 2 1 c) A ¼ {5, 10, 15, …, 60}, PðAÞ ¼ 12 60 ¼ 5 d) Where A ¼ {1, 3, 5, …, 59} and B ¼ {5, 10, 15, …, 60}. Since A and B are not mutually exclusive events, because they have common elements (5, 15, 25, 35, 45, 55), we apply Expression (5.5): 1 1 6 3 P ðA [ BÞ ¼ P ðAÞ + P ðBÞ P ðA \ B Þ ¼ + ¼ 2 5 60 5 e) In this case, A ¼ {7, 14, 21, 28, 35, 42, 49, 56} and B ¼ {10, 20, 30, 40, 50, 60}. Since the events are mutually exclusive (A \ B 6¼ f), we apply Expression (5.6): 8 6 7 + ¼ P ðA [ BÞ ¼ P ðAÞ + P ðB Þ ¼ 60 60 30 f) In this case, A ¼ {multiples of 5} and Ac ¼ {numbers that are not multiples of 5}. Therefore, the probability of complementary event Ac happening is: 1 4 P ðAc Þ ¼ 1 P ðAÞ ¼ 1 ¼ 5 5 g) Since the events are independent, we apply Expression (5.9): 1 20 1 P ðA \ BÞ ¼ P ðAÞ P ðB Þ ¼ ¼ 2 60 6
Introduction to Probability Chapter
5.5
5
131
CONDITIONAL PROBABILITY
When events are not independent, we must use the concept of conditional probability. Considering two events A and B, the probability of A happening, given that B has already happened, is called conditional probability of A given B, and is represented by P(A jB): PðAj BÞ ¼
Pð A \ B Þ Pð BÞ
(5.11)
An event A is considered independent of B if: PðAj BÞ ¼ PðAÞ
(5.12)
Example 5.4 A die is rolled. What is the probability of getting number 4, given that the number drawn was an even number? Solution In this case, A ¼ {number 4} and B ¼ {an even number}. Applying Expression (5.11), we have: P ðAj BÞ ¼
5.5.1
P ðA \ B Þ 1=6 1 ¼ ¼ P ðB Þ 1=2 3
Probability Multiplication Rule
From the definition of conditional probability, the multiplication rule allows researcher to calculate the probability of the simultaneous occurrence of two events A and B as the probability of one of them multiplied by the conditional probability of the other, given that the first event has occurred: PðA \ BÞ ¼ PðAÞ PðBj AÞ ¼ PðBÞ PðAj BÞ
(5.13)
The multiplication rule can be extended to three events A, B, and C: PðA \ B \ CÞ ¼ PðAÞ PðBj AÞ PðCj A \ BÞ
(5.14)
This is only one of the six ways in which Expression (5.14) can be written. Example 5.5 A gravity-pick machine contains eight white balls, six red balls, and four black balls. Initially, we draw a ball that is not put back into the gravity-pick machine. A new ball will be drawn. What is the probability of both balls being red? Solution Differently from the previous example that calculated the conditional probability of a single event, the objective in this case is to calculate the probability of two events occurring simultaneously. The events are also not independent, since the first ball is not put back into the gravity-pick machine. If event A ¼ {the first ball is red} and B ¼ {the second ball is red}, to calculate P(A \ B), we must apply Expression (5.13): P ðA \ B Þ ¼ P ðAÞ P ðBj AÞ ¼
6 5 5 ¼ 18 17 51
Example 5.6 A company will give a car to one of its customers (who are located in different regions of Brazil). Table 5.E.1 shows the data regarding these customers, in terms of gender and city. Determine: a) What is the probability of a male customer being drawn? b) What is the probability of a female customer being drawn? c) What is the probability of a customer from Curitiba being drawn? d) What is the probability of a customer from Sao Paulo being drawn, given that it is a male customer?
132
PART
III Probabilistic Statistics
e) What is the probability of a female customer being drawn, given that it is a customer from Aracaju? f) What is the probability of a female customer from Salvador being drawn?
TABLE 5.E.1 Absolute Frequency Distribution According to Gender and City Male
Female
Total
Goiania
12
14
26
Aracaju
8
12
20
Salvador
16
15
31
Curitiba
24
22
46
Sao Paulo
35
25
60
Belo Horizonte
10
12
22
105
100
205
Solution a) The probability of the customer being a man is 105/205 ¼ 21/41. b) The probability of the customer being a woman is 100/205 ¼ 20/41. c) The probability of the customer being from Curitiba is 46/205. d) Considering that A ¼ {Sao Paulo} and B ¼ {male}, the P(A jB) is calculated according to Expression (5.11): P ðAj BÞ ¼
P ðA \ BÞ 35=205 1 ¼ ¼ P ðB Þ 105=205 3
e) Considering that A ¼ {female} and B ¼ {Aracaju}, the P(A jB) is: P ðAj BÞ ¼
P ðA \ B Þ 12=205 3 ¼ ¼ P ðB Þ 20=205 5
f) If A ¼ {Salvador} and B ¼ {female}, the P(A \B) is calculated according to Expression (5.13): P ðA \ BÞ ¼ P ðAÞ P ðBj AÞ ¼
5.6
31 15 3 ¼ 205 31 41
BAYES’ THEOREM
Imagine that the probability of a certain event was calculated. However, new information was added to the process, so, the probability must be recalculated. The probability calculated initially is called a priori probability; the probability with the recently added information is called a posteriori probability. The calculation of the a posteriori probability is based on Bayes’ Theorem and is described here. Consider B1, B2, …, Bn mutually exclusive events, and P(B1) + P(B2) + … + P(Bn) ¼ 1. A, on the other hand, is any given event that will happen jointly or as a consequence of one of the Bi events (i ¼ 1, 2, …, n). The probability of a Bi event happening, given that A event has already happened, is calculated as follows: PðBi j AÞ ¼
Pð B i \ A Þ PðBi Þ PðAj Bi Þ ¼ PðAÞ PðB1 Þ PðAj B1 Þ + PðB2 Þ PðAj B2 Þ + ⋯ + PðBn Þ PðAj Bn Þ
(5.15)
where: P(Bi) is the a priori probability; P(Bi j A) is the a posteriori probability (probability of Bi after A has happened).
Example 5.7 Consider three identical gravity-pick machines U1, U2, and U3. Gravity-pick machine U1 contains two balls, one is yellow and the other is red. Gravity-pick machine U2, on the other hand, contains three blue balls, while machine U3 contains two red balls and a yellow one. We select one of the gravity-pick machines at random and draw one ball. We can see that the ball chosen is yellow. What is the probability of gravity-pick machine U1 having been chosen?
Introduction to Probability Chapter
5
133
Solution Let‘s define the following events: B1 ¼ choosing gravity-pick machine U1; B2 ¼ choosing gravity-pick machine U2; B3 ¼ choosing gravity-pick machine U3; A ¼ choosing the yellow ball. The objective is to calculate P(B1 j A), knowing that: P(B1) ¼ 1/3, P(Aj B1) ¼ 1/2 P(B2) ¼ 1/3, P(Aj B2) ¼ 0 P(B3) ¼ 1/3, P(Aj B3) ¼ 1/3 Therefore, we have: P ð B1 j A Þ ¼
P ðB1 \ AÞ P ðB1 Þ P ðAj B1 Þ ¼ P ðAÞ P ðB1 Þ P ðAj B1 Þ + P ðB2 Þ P ðAj B2 Þ + P ðB3 Þ P ðAj B3 Þ 1 1 3 3 2 P ðB1 j AÞ ¼ ¼ 1 1 1 1 1 5 + 0+ 3 2 3 3 3
5.7
COMBINATORIAL ANALYSIS
Combinatorial analysis is a set of procedures that calculates the number of different groups that can be formed by selecting a finite number of elements from a set. Arrangements, combinations, and permutations are the three main types of configurations and are applicable to the probability. The probability of an event is, therefore, the ratio between the number of results of the event we are interested in and the total number of results in the sample space (total number of arrangements, combinations, or permutations).
5.7.1
Arrangements
An arrangement calculates the number of possible configurations with distinct elements from a certain set. Bruni (2011) defines arrangement as the study of the number of ways in which researcher can organize a sample of objects, which was removed from a larger population, and in which the alteration of the order of the organized objects is relevant. Given n different objects, if the objective is to select p of these objects (n and p are integers, n p), the number of arrangements or possible ways of doing this is represented by An,p and calculated as follows: An, p ¼
n! ðn pÞ!
(5.16)
Example 5.8 Consider a set with three elements A ¼ {1, 2, 3}. If these elements were taken 2 by 2, how many arrangements would be possible? What is the probability of element 3 being in the second position? Solution From Expression (5.16), we have: An, p ¼
3! 321 ¼ ¼6 ð3 2Þ! 1
These arrangements are (1, 2), (1, 3), (2, 1), (2, 3), (3, 1), and (3, 2). In an arrangement, the order in which the elements are organized is relevant. For example, (1, 2) 6¼ (2, 1). After defining all the arrangements, it is easy to calculate the probability. Since we have two arrangements in which element 3 is in the second position, given that the total number of arrangements is 6, the probability is 2/6 ¼ 1/3.
134
PART
III Probabilistic Statistics
Example 5.9 Calculate the number of ways in which it is possible to park six vehicles in three parking spaces. What is the probability of vehicle 1 being in the first parking space? Solution Through Expression (5.16), we have: A 6, 3 ¼
6! 6 5 4 3! ¼ ¼ 120 ð6 3Þ! 3!
From the 120 possible arrangements, in 20 of them vehicle 1 is in the first position: (1, 2, 3), (1, 2, 4), (1, 2, 5), (1, 2, 6), (1, 3, 2), (1, 3, 4), (1, 3, 5), (1, 3, 6), (1, 4, 2), (1, 4, 3), (1, 4, 5), (1, 4, 6), (1, 5, 2), (1, 5, 3), (1, 5, 4), (1, 5, 6), (1, 6, 2), (1, 6, 3), (1, 6, 4), (1, 6, 5). Therefore, the probability is 20/120 ¼ 1/6.
5.7.2
Combinations
Combinations are a special case of arrangements in which it does not matter the order in which the elements are organized. Given n different objects, the number of ways or combinations in which to organize p of these objects is represented by Cn,p (a combination of n elements arranged p by p), and calculated as follows: n! n ¼ (5.17) Cn, p ¼ p p!ðn pÞ!
Example 5.10 How many different ways can we form groups of four students in a class with 20 students? Solution Since the order of the elements in the group is not relevant, we must apply Expression (5.17): C20, 4 ¼
20! 20 19 18 17 16! 20 ¼ ¼ 4, 845 ¼ 4 4!ð20 4Þ! 24ð16Þ!
Thus, 4,845 different groups can be formed.
Example 5.11 Marcelo, Felipe, Luiz Paulo, Rodrigo, and Ricardo went to an amusement park to have fun. The ride they chose to go on next only has three seats, so, only three of them will be chosen randomly. What is the probability of Felipe and Luiz Paulo being on that ride? Solution The total number of combinations is: C5, 3 ¼ The 10 possibilities are: Group 1: Marcelo, Felipe, and Luiz Paulo Group 2: Marcelo, Felipe, and Rodrigo Group 3: Marcelo, Felipe, and Ricardo Group 4: Marcelo, Luiz Paulo, and Rodrigo Group 5: Marcelo, Luiz Paulo, and Ricardo Group 6: Marcelo, Rodrigo, and Ricardo Group 7: Felipe, Luiz Paulo, and Rodrigo Group 8: Felipe, Luiz Paulo, and Ricardo Group 9: Felipe, Rodrigo, and Ricardo Group 10: Luiz Paulo, Rodrigo, and Ricardo Therefore, the probability is 3/10.
5! 5 4 3! 5 ¼ ¼ 10 ¼ 3 3!2! 3!2
Introduction to Probability Chapter
5.7.3
5
135
Permutations
Permutation is an arrangement in which all the elements in the set are selected. Therefore, it is the number of ways in which n elements can be grouped, changing their order. The number of possible permutations is represented by Pn and can be calculated as follows: Pn ¼ n!
(5.18)
Example 5.12 Consider a set with three elements, A ¼ {1, 2, 3}. What is the total number of permutations possible? Solution P3 ¼ 3 ! ¼ 3 2 1 ¼ 6. They are (1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), and (3, 2, 1).
Example 5.13 A certain factory manufactures six different products. How many different ways can the production sequence occur? Solution To determine the number of possible production sequences, we just need to apply Expression (5.18): P6 ¼ 6! ¼ 6 5 4 3 2 1 ¼ 720
5.8
FINAL REMARKS
This chapter discussed the concepts and terminologies related to the probability theory, as well as their practical application. Probability theory is used to assess the possibility of uncertain events happening, its origin comes from trying to understand uncertain natural phenomena, evolving to planning how to gamble, and, currently, it is being applied to the study of statistical inference.
5.9
EXERCISES
1) Two soccer teams will play overtime until the Golden Goal is scored. Define the sample space. 2) What is the difference between mutually exclusive events and independent events? 3) In a deck of cards with 52 cards, determine: a. The probability of a card of hearts being drawn; b. The probability of a queen being drawn; c. The probability of a face card (jack, queen, or king) being drawn; d. The probability of any card, but not a face card, being drawn; 4) A production batch contains 240 parts and 12 of them are defective. One part is drawn randomly. What is the probability of this part being defective? 5) A number between 1 and 30 is chosen randomly. We would like you to: a. Define the sample space. b. What is the probability of this number being divisible by 3? c. What is the probability of this number being a multiple of 5? d. What is the probability of this number being divisible by 3 or a multiple of 5? e. What is the probability of this number being even, given that it is a multiple of 5? f. What is the probability of this number being a multiple of 5, given that it is divisible by 3? g. What is the probability of this number not being divisible by 3? h. Assuming that two numbers are chosen randomly, what is the probability of the first number being a multiple of 5 and the second one an odd number?
136
PART
III Probabilistic Statistics
6) Two dice are rolled simultaneously. Determine: a. The sample space. b. What is the probability of both numbers being even? c. What is the probability of the sum of the numbers being 10? d. What is the probability of the multiplication of the numbers being 6? e. What is the probability of the sum of the numbers being 10 or 6? f. What is the probability of the number drawn in the first die being an odd number or of the number drawn in the second die being a multiple of 3? g. What is the probability of the number drawn in the first die being an even number or of the number drawn in the second die being a multiple of 4? 7) What is the difference between arrangements, combinations, and permutations?
Chapter 6
Random Variables and Probability Distributions What we call chance can only be the unknown cause of a known effect. Voltaire
6.1
INTRODUCTION
In Chapters 3 and 4, we discussed several statistics to describe the behavior of quantitative and qualitative data, including sample frequency distributions. In this chapter, we are going to study population probability distributions (for quantitative variables). The frequency distribution of a sample is an estimate of the corresponding population probability distribution. When the sample size is large, the sample frequency distribution approximately follows the population probability distribution (Martins and Domingues, 2011). According to the authors, for the study of empirical researches, as well as for solving several practical problems, the study of descriptive statistics is essential. However, when the main goal is to study a population’s variables, the probability distribution is more suitable. This chapter discusses the concept of discrete and continuous random variables, the main probability distributions for each type of random variable, and also the calculation of the expected value and the variance of each probability distribution. For discrete random variables, the most common probability distributions are the discrete uniform, Bernoulli, binomial, geometric, negative binomial, hypergeometric, and Poisson. On the other hand, for continuous random variables, we are going to study the uniform, normal, exponential, gamma, chi-square (w2), Student’s t, and Snedecor’s F distributions.
6.2
RANDOM VARIABLES
As studied in the previous chapter, the set of all possible results of a random experiment is called sample space. To describe a random experiment, it is convenient to associate numerical values to the elements of the sample space. A random variable can be characterized as being a variable that presents a single value for each element, and this value is determined randomly. Assume that e is a random experiment and S is the sample space associated to this experiment. Function X that associates to each element s 2 S a real number X (s) is called random variable. Random variables can be discrete or continuous.
6.2.1
Discrete Random Variable
A discrete random variable can only take on countable numbers of distinct values, usually counts. Therefore, it cannot assume decimal or noninteger values. As examples of discrete random variables, we can mention the number of children in a family, the number of employees in a company, or the number of vehicles produced in a certain factory.
6.2.1.1 Expected Value of a Discrete Random Variable Let X be a discrete random variable that can take on the values {x1, x2, …, xn} with the respective probabilities {p(x1), p(x2), …, p(xn)}. Function {xi, p(xi), i ¼ 1, 2, …, n} is called random variable X probability function and associates, to each value of xi, its probability of occurrence: Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00006-9 © 2019 Elsevier Inc. All rights reserved.
137
138
PART
III Probabilistic Statistics
pðxi Þ ¼ PðX ¼ xi Þ ¼ pi , i ¼ 1,2,…, n so p(xi) 0 for every xi and
n P
(6.1)
pðxi Þ ¼ 1.
i¼1
The expected or average value of X is given by the expression: Eð X Þ ¼
n X
x i Pð X ¼ x i Þ ¼
i¼1
n X
xi :pi
(6.2)
i¼1
Expression (6.2) is similar to the one used for the mean in Chapter 3, in which instead of probabilities pi we had relative frequencies Fri. The difference between pi and Fri is that the former corresponds to the values from an assumed theoretical model and the latter to the variable values observed. Since pi and Fri have the same interpretation, all of the measures and charts presented in Chapter 3, based on the distribution of Fri, have a corresponding one in the distribution of a random variable. The same interpretation is valid for other measures of position and variability, such as, the median and the standard deviation (Bussab and Morettin, 2011).
6.2.1.2 Variance of a Discrete Random Variable The variance of a discrete random variable X is a weighted mean of the distances between the values that X can take on and X’s expected value, where the weights are the probabilities of the possible values of X. If X assumes the values {x1, x2, …, xn} with the respective probabilities {p1, p2, …, pn}, then its variance is given by: n h i X ½xi EðXÞ2 :pi (6.3) Var ðXÞ ¼ s2 ðXÞ ¼ E ðX EðXÞÞ2 ¼ i¼1
In some cases, it is convenient to use the standard deviation of a random variable as a measure of variability. The standard deviation of X is the square root of the variance: pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi sðXÞ ¼ Var ðXÞ (6.4) Example 6.1 Assume that the monthly real estate sales for a certain real estate agent follow the probability distribution seen in Table 6.E.1. Determine the expected value of monthly sales, as well as its variance.
TABLE 6.E.1 Monthly Real Estate Sales and Their Respective Probabilities xi(sales)
0
1
2
3
p(xi)
2/10
4/10
3/10
1/10
Solution The expected value of monthly sales is: E ðX Þ ¼ 0 0:20 + 1 0:40 + 2 0:30 + 3 0:10 ¼ 1:3 The variance can be calculated as: Var ðX Þ ¼ ð0 1:3Þ2 0:2 + ð1 1:3Þ2 0:4 + ð2 1:3Þ2 0:3 + ð3 1:3Þ2 0:1 ¼ 0:81
6.2.1.3 Cumulative Distribution Function of a Discrete Random Variable The cumulative distribution function (c.d.f.) of a random variable X, denoted by F(x), corresponds to the sum of the xi values probabilities that are less than or equal to x: X Fð x Þ ¼ Pð X x Þ ¼ pð x i Þ (6.5) xi x
The following properties are valid for the cumulative distribution function of a discrete random variable: 0 FðxÞ 1
(6.6)
Random Variables and Probability Distributions Chapter
6
139
lim FðxÞ ¼ 1
(6.7)
lim FðxÞ ¼ 0
(6.8)
a < b ! Fð aÞ Fð bÞ
(6.9)
x!∞
x!∞
Example 6.2 For the data in Example 6.1, calculate F(0, 5), F(1), F(2, 5), F(3), F(4), and F(0, 5). Solution 2 a) Fð0:5Þ ¼ PðX 0:5Þ ¼ 10 2 4 6 b) Fð1Þ ¼ PðX 1Þ ¼ 10 + 10 ¼ 10 2 4 3 9 c) Fð2:5Þ ¼ PðX 2:5Þ ¼ 10 + 10 + 10 ¼ 10 2 4 3 1 d) Fð3Þ ¼ PðX 3Þ ¼ 10 + 10 + 10 + 10 ¼ 1 e) F(4) ¼ P(X 4) ¼ 1 f) F(0.5) ¼ P(X 0.5) ¼ 0 In short, the cumulative distribution function of random variable X in Example 6.1 is given by: 8 0 if x < 0, > > > > < 2=10 if 0 x < 1, F ðx Þ ¼ 6=10 if 1 x < 2, > > > 9=10 if 2 x < 3, > : 1 if x 3
6.2.2
Continuous Random Variable
A continuous random variable can take on several different values in an interval of real numbers. As examples of continuous random variables, we can mention a family’s income, the revenue of a company, or the height of a certain child. A continuous random variable X is associated to an f(x) function, called a probability density function (p.d.f.) of X, which meets the following condition: Z+∞ f ðxÞdx ¼ 1, f ðxÞ 0
(6.10)
∞
For any a and b, such that ∞ < a < b < + ∞, the probability of random variable X taking on values within this interval is: Zb f ðxÞdx
Pða X bÞ ¼ a
which can be graphically represented as shown in Fig. 6.1.
FIG. 6.1 Probability of X assuming values within the interval [a, b].
(6.11)
140
PART
III Probabilistic Statistics
6.2.2.1 Expected Value of a Continuous Random Variable The mathematical expected or average value of a continuous random variable X with a probability density function f(x) is given by the expression: Z+∞ EðXÞ ¼
xf ðxÞdx
(6.12)
∞
6.2.2.2 Variance of a Continuous Random Variable The variance of a continuous random variable X with a probability density function f(x) is calculated as:
Var ðXÞ ¼ E X
2
Z∞ 2
ðx EðXÞÞ2 f ðxÞdx
½ Eð X Þ ¼
(6.13)
∞
Example 6.3 The probability density function of a continuous random variable X is given by: 2x, 0 < x < 1 f ðx Þ ¼ 0, for any other values Calculate E(X) and Var(X). Solution Z1
Z1 ðx:2x Þdx ¼
E ðX Þ ¼ 0
E X2 ¼
2 2x 2 dx ¼ 3
0
Z1
x 2 :2x dx ¼
0
Z1
1 2x 3 dx ¼ 2
0
VAR ðX Þ ¼ E X
2
2 1 2 1 ¼ ½E ðX Þ2 ¼ 2 3 18
6.2.2.3 Cumulative Distribution Function of a Continuous Random Variable As in the discrete case, we can calculate the probabilities associated to a continuous random variable X from a cumulative distribution function. Cumulative distribution function F(x) of a continuous random variable X with probability density function f(x) is defined by: FðxÞ ¼ PðX xÞ, ∞ < x < ∞
(6.14)
Expression (6.14) is similar to the one presented for the discrete case, in Expression (6.5). The difference is that, for continuous variables, the cumulative distribution function is a continuous function, without jumps. From (6.11) we have: Zx f ðxÞdx
FðxÞ ¼
(6.15)
∞
As in the discrete case, the following properties for the cumulative distribution function of a continuous random variable are valid:
Random Variables and Probability Distributions Chapter
6
141
0 Fð x Þ 1
(6.16)
lim FðxÞ ¼ 1
(6.17)
lim FðxÞ ¼ 0
(6.18)
a < b ! Fð aÞ Fð bÞ
(6.19)
x!∞
x!∞
Example 6.4 Once again, let us consider the probability density function in Example 6.3: 2x, 0 < x < 1 f ðx Þ ¼ 0, for any other values Calculate the cumulative distribution function of X. Solution Zx
Zx f ðx Þdx ¼
F ðx Þ ¼ P ðX x Þ ¼ ∞
6.3
∞
8 < 0 if x 0 2xdx ¼ x 2 if 0 < x 1 : 1 if x > 1
PROBABILITY DISTRIBUTIONS FOR DISCRETE RANDOM VARIABLES
For discrete random variables, the most common probability distributions are the discrete uniform, Bernoulli, binomial, geometric, negative binomial, hypergeometric, and Poisson.
6.3.1
Discrete Uniform Distribution
It is the simplest discrete probability distribution and receives the name uniform because all of the possible values of the random variable have the same probability of occurrence. A discrete random variable X that takes on the values x1, x2, …, xn has a discrete uniform distribution with parameter n, denoted by X Ud{x1, x2, …, xn}, if its probability function is given by: 1 PðX ¼ xi Þ ¼ pðxi Þ ¼ , i ¼ 1, 2,…, n n
(6.20)
which may be graphically represented as shown in Fig. 6.2. The mathematical expected value of X is given by: n 1 X xi Eð X Þ ¼ : n i¼1
FIG. 6.2 Discrete uniform distribution.
(6.21)
142
PART
III Probabilistic Statistics
The variance of X is calculated as:
2 6 n 16 6X 2 Var ðXÞ ¼ :6 x n 6 i¼1 i 4
n X
!2 3 7 7 7 7 7 5
(6.22)
nð x Þ , n
(6.23)
i¼1
n
xi
And the cumulative distribution function (c.d.f.) is: FðXÞ ¼ PðX xÞ ¼
X1 xi x
n
¼
where n(x) is the number of xi x, as shown in Fig. 6.3.
FIG. 6.3 Cumulative distribution function.
Example 6.5 A totally balanced and clean die is thrown and random variable X represents the value on the face that is facing up. Determine the distribution of X, in addition to X’s expected value and variance. Solution The distribution of X is shown in Table 6.E.2.
TABLE 6.E.2 Distribution of X X
1
2
3
4
5
6
Sum
f(x)
1/6
1/6
1/6
1/6
1/6
1/6
1
Therefore, we have: 1 E ðX Þ ¼ ð1 + 2 + 3 + 4 + 5 + 6Þ ¼ 3:5 6 " # ð21Þ2 1 35 1 + 22 + ⋯ + 62 Var ðX Þ ¼ ¼ ¼ 2:917 6 6 12
6.3.2
Bernoulli Distribution
The Bernoulli trial is a random experiment that only offers two possible results, conventionally called success or failure. As an example of a Bernoulli trial, we can mention tossing a coin, whose only possible results are head and tail. For a certain Bernoulli trial, we will consider the random variable X that takes on the value 1 in case of success, and 0 in case of failure. The probability of success is represented by p and the probability of failure by (1 p) or q. The Bernoulli
Random Variables and Probability Distributions Chapter
6
143
FIG. 6.4 Probability function of the Bernoulli distribution.
FIG. 6.5 Bernoulli’s distribution c.d.f.
distribution, therefore, provides the probability of success or failure of variable X when carrying out a single experiment. Therefore, we can say that variable X follows a Bernoulli distribution with parameter p, denoted by X Bern(p), if its probability function is given by: q ¼ 1 p, if x ¼ 0 Pð X ¼ x Þ ¼ pð x Þ ¼ (6.24) p , if x ¼ 1 which can also be represented in the following way: PðX ¼ xÞ ¼ pðxÞ ¼ px :ð1 pÞ1x , x ¼ 0, 1
(6.25)
The probability function of random variable X is represented in Fig. 6.4. It is easy to see that the expected value of X is: EðXÞ ¼ p
(6.26)
Var ðXÞ ¼ p:ð1 pÞ
(6.27)
with X’s variance being:
Bernoulli’s cumulative distribution function (c.d.f.) is given by: 8 if x < 0 < 0, FðxÞ ¼ PðX xÞ ¼ 1 p, if x 0 < 1 : 1, if x 1
(6.28)
which can be represented by Fig. 6.5. It is important to mention that we are going to use all knowledge on Bernoulli’s distribution when discussing binary logistics regression models (Chapter 14). Example 6.6 The Interclub Indoor Soccer Cup final match is going to be between teams A and B. Random variable X represents the team that will win the Cup. We know that the probability of team A winning is 0.60. Determine the distribution of X, in addition to X’s expected value and variance.
144
PART
III Probabilistic Statistics
Solution Random variable X can only take on two values:
X¼
1, if team A wins 0, if team B wins
Since it is a single game, variable X follows a Bernoulli distribution with parameter p ¼ 0.60, denoted by X Bern(0.6), so: q ¼ 0:4, if x ¼ 0 ðteam BÞ P ðX ¼ x Þ ¼ p ðx Þ ¼ p ¼ 0:6, if x ¼ 1 ðteam AÞ We have: E ðX Þ ¼ p ¼ 0:6 Var ðX Þ ¼ p ð1 p Þ ¼ 0:6 0:4 ¼ 0:24
6.3.3
Binomial Distribution
A binomial experiment consists in n independent repetitions of a Bernoulli trial with probability of success p, probability that remains constant in all repetitions. Discrete random variable X of a binomial model corresponds to the number of successes (k) in the n repetitions of the experiment. Therefore, X follows a binomial distribution with parameters n and p, denoted by X b(n, p), if its probability distribution function is given by: n (6.29) f ðk Þ ¼ P ðX ¼ k Þ ¼ :pk :ð1 pÞnk , k ¼ 0,1, …,n k n! n ¼ where k k!ðn kÞ! The mean of X is given by: EðXÞ ¼ n:p
(6.30)
On the other hand, the variance of X can be expressed as: Var ðXÞ ¼ n:p:ð1 pÞ
(6.31)
Note that the mean and variance of the binomial distribution are equal to the mean and variance of the Bernoulli distribution, multiplied by n, the number of repetitions in a Bernoulli trial. Fig. 6.6 shows the probability function of the binomial distribution for n ¼ 10 and p varying between 0.3, 0.5, and 0.7. From Fig. 6.6, we can see that, for p ¼ 0.5, the probability function is symmetrical around the mean. If p < 0.5, the distribution is positive asymmetrical, observing a higher frequency of smaller values of k and a longer tail to the right. If p > 0.5, the distribution is negative asymmetrical, observing a higher frequency of larger values of k and a longer tail to the left. It is important to mention that we are going to use all knowledge on the binomial distribution when studying multinomial logistics regression models (Chapter 14). FIG. 6.6 Probability function of the binomial distribution for n ¼ 10.
Random Variables and Probability Distributions Chapter
6
145
6.3.3.1 Relationship Between the Binomial and the Bernoulli Distributions A binomial distribution with parameter n ¼ 1 is equivalent to a Bernoulli distribution: X bð1, pÞ X BernðpÞ
Example 6.7 A certain part is produced in a production line. The probability of the part not having defects is 99%. If 30 parts are produced, what is the probability of at least 28 of them being in good conditions? Also determine the random variable’s mean and variance. Solution We have: X ¼ random variable that represents the number of successes (parts in good conditions) in the 30 repetitions p ¼ 0.99 ¼ probability of the part being in good conditions q ¼ 0.01 ¼ probability of the part being defective n ¼ 30 repetitions k ¼ number of successes The probability of at least 28 parts not being defective is given by: P ðX 28Þ ¼ P ðX ¼ 28Þ + P ðX ¼ 29Þ + P ðX ¼ 30Þ 30! 99 28 1 2 ¼ 0:0328 P ðX ¼ 28Þ ¼ 28!2! 100 100 30! 99 29 1 1 ¼ 0:224 P ðX ¼ 29Þ ¼ 29!1! 100 100 30! 99 30 1 0 ¼ 0:7397 P ðX ¼ 30Þ ¼ 30!0! 100 100 P ðX 28Þ ¼ 0:0328 + 0:224 + 0:7397 ¼ 0:997 The mean of X is expressed as: E ðX Þ ¼ n:p ¼ 30 0:99 ¼ 29:7 And the variance of X is: Var ðX Þ ¼ n:p:ð1 p Þ ¼ 30 0:99 0:01 ¼ 0:297
6.3.4
Geometric Distribution
The geometric distribution, as well as the binomial, considers successive independent Bernoulli trials, all of them with probability of success p. However, instead of using a fixed number of trials, they will carry out the experiment until the first success is obtained. The geometric distribution presents two distinct parameterizations, described here. The first parameterization considers successive independent Bernoulli trials, with probability of success p in each trial, until a success occurs. In this case, we cannot include zero as a possible result, so, the domain is supported by using the set {1, 2, 3, …}. For example, we can consider how many times we tossed a coin until we got the first head, the number of parts manufactured until a defective one was produced, among others. The second parameterization of the geometric distribution counts the number of failures or unsuccessful attempts before the first success. Since here it is possible to obtain success in the first Bernoulli trial, we include zero as being a possible result, so, the domain is supported by the set {0, 1, 2, 3, …}. Let X be the random variable that represents the number of trials until the first success. Variable X has a geometric distribution with parameter p, denoted by X Geo(p), if its probability function is given by: f ðxÞ ¼ PðX ¼ xÞ ¼ p:ð1 pÞx1 , x ¼ 1,2, 3, …
(6.32)
For the second case, let us consider Y the random variable that represents the number of failures or unsuccessful attempts before the first success. Variable Y has a geometric distribution with parameter p, denoted by Y Geo(p), if its probability function is given by: f ðyÞ ¼ PðY ¼ yÞ ¼ p:ð1 pÞy , y ¼ 0,1, 2, …
(6.33)
146
PART
III Probabilistic Statistics
FIG. 6.7 Probability function of variable X with parameter p ¼ 0.4.
In both cases, the sequence of probabilities is a geometric progression. The probability function of variable X is graphically represented in Fig. 6.7, for p ¼ 0.4. The calculation of X’s expected value and variance is: Eð X Þ ¼ Var ðXÞ ¼
1 p
1p p2
(6.34) (6.35)
In a similar way, for variable Y, we have: Eð Y Þ ¼
1p p
Var ðY Þ ¼
1p p2
(6.36) (6.37)
The geometric distribution is the only discrete distribution that has the memoryless property (in the case of continuous distributions, we will see that the exponential distribution also has this property). This means that if an experiment is repeated before the first success, then, given that the first success has not happened yet, the conditional distribution function of the number of additional trials does not depend on the number of failures that occurred until then. Thus, for any two positive integers s and t, if X is greater than s, then, the probability of X being greater than s + t is equal to the unconditional probability of X being greater than t: PðX > s + t j X > sÞ ¼ PðX > tÞ
(6.38)
Example 6.8 A company manufactures a certain electronic component and, at the end of the process, each component is tested, one by one. Assume that the probability of one electronic component being defective is 0.05. Determine the probability of the first defect being found in the eighth component tested. Also determine the random variable’s expected value and variance. Solution We have: X ¼ random variable that represents the number of electronic components tested until the first defect is found; p ¼ 0.05 ¼ probability of the component being defective; q ¼ 0.95 ¼ probability of the component being in good conditions. The probability of the first defect being found in the eighth component tested is given by: P ðX ¼ 8Þ ¼ 0:05ð1 0:05Þ81 ¼ 0:035 The mean of X is expressed as: 1 E ðX Þ ¼ ¼ 20 p
Random Variables and Probability Distributions Chapter
6
147
And the variance of X is: Var ðX Þ ¼
6.3.5
1p 0:95 ¼ ¼ 380 p2 0:0025
Negative Binomial Distribution
The negative binomial distribution, also known as the Pascal distribution, carries out successive independent Bernoulli trials (with a constant probability of success in all the trials) until it achieves a prefixed number of successes (k), that is, the experiment continues until k successes are achieved. Let X be the random variable that represents the number of attempts carried out (Bernoulli trials) until the k-th success is reached. Variable X has a negative binomial distribution, denoted by X nb(k, p), if its probability function is given by: x1 (6.39) f ð x Þ ¼ Pð X ¼ x Þ ¼ :pk :ð1 pÞxk , x ¼ k, k + 1,… k1 The graphical representation of a negative binomial distribution with parameter k ¼ 2 and p ¼ 0.4 can be found in Fig. 6.8. The expected value of X is: EðXÞ ¼
k p
(6.40)
and the variance is: Var ðXÞ ¼
k:ð1 pÞ p2
(6.41)
6.3.5.1 Relationship Between the Negative Binomial and the Binomial Distributions The negative binomial distribution is related to the binomial distribution. In the binomial, we must set the sample size (number of Bernoulli trials) and observe the number of successes (random variable). In the negative binomial, we must set the number of successes (k) and observe the number of Bernoulli trials necessary to obtain k successes.
6.3.5.2 Relationship Between the Negative Binomial and the Geometric Distributions The negative binomial distribution with parameter k ¼ 1 is equivalent to the geometric distribution: X nbð1,pÞ X GeoðpÞ Or, a negative binomial series can be considered to be a sum of geometric series.
FIG. 6.8 Probability function of variable X with parameter k ¼ 2 and p ¼ 0.4.
148
PART
III Probabilistic Statistics
It is important to mention that we are going to use all knowledge on the negative binomial distribution when studying the regression models for count data (Chapter 15). Example 6.9 Assume that a student gets three questions right every five tests. Let X be the number of attempts until the twelfth correct answer. Determine the probability of the student having to answer 20 questions in order to get 12 right. Solution We have: k ¼ 12, p ¼ 3/5 ¼ 0.6, q ¼ 2/5 ¼ 0.4 X ¼ number of attempts until the twelfth correct answer, that is, X nb(12; 0.6). Therefore: 20 1 f ð20Þ ¼ P ðX ¼ 20Þ ¼ 0:612 0:42012 ¼ 0:1078 ¼ 10:78% 12 1
6.3.6
Hypergeometric Distribution
The hypergeometric distribution is also related to a Bernoulli trial. However, differently from the binomial sampling, in which the probability of success is constant, in the hypergeometric distribution, since the sampling is without replacement, as the elements are removed from the population to form the sample, the population size diminishes, making the probability of success vary. The hypergeometric distribution describes the number of successes in a sample with n elements, drawn from a finite population without replacement. For example, let us consider a population with N elements, from which M have a certain attribute. The hypergeometric distribution describes the probability of exactly k elements having such attribute (k successes and n k failures), in a sample with n distinct elements randomly drawn from the population without replacement. Let X be a random variable that represents the number of successes obtained from the n elements drawn from the sample. Variable X follows a hypergeometric distribution with parameters N, M, n, denoted by X Hip(N, M, n), if its probability function is given by: M N M : k nk , 0 k min ðM, nÞ (6.42) f ð k Þ ¼ Pð X ¼ k Þ ¼ N n The graphical representation of a hypergeometric distribution with parameters N ¼ 200, M ¼ 50, and n ¼ 30 can be found in Fig. 6.9. The mean of X can be calculated as: Eð X Þ ¼
FIG. 6.9 Probability function of variable X with parameters N ¼ 200, M ¼ 50, and n ¼ 30.
n:M N
(6.43)
Random Variables and Probability Distributions Chapter
6
149
with variance: Var ðXÞ ¼
n:M ðN MÞ:ðN nÞ : N N:ðN 1Þ
(6.44)
6.3.6.1 Approximation of the Hypergeometric Distribution by the Binomial Let X be a random variable that follows a hypergeometric distribution with parameters N, M, and n, denoted by X Hip(N, M, n). If the population is large when compared to the sample size, the hypergeometric distribution can be approximated by a binomial distribution with parameters n and p ¼ M/N (probability of success in a single trial): X HipðN, M, nÞ X bðn, pÞ, with p ¼ M=N
Example 6.10 A gravity-pick machine contains 15 balls and 5 of them are red. 7 balls are chosen randomly, without replacement. Determine: a) The probability of exactly two red balls being drawn. b) The probability of at least two red balls being drawn. c) The expected number of red balls drawn. d) The variance of the number of red balls drawn. Solution Let X be the random variable that represents the number of red balls drawn. We have N ¼ 15, M ¼ 5, and n ¼ 7. M 5 NM 10 : : k 2 nk 5 a) PðX ¼ 2Þ ¼ ¼ ¼ 39:16% N 15 n 7 5 5 10 10 : : 0 1 7 6 b) PðX 2Þ ¼ 1 PðX < 2Þ ¼ 1 ½PðX ¼ 0Þ + PðX ¼ 1Þ ¼ 1 ¼ 81:82% 15 15 7 7 n:M 7:5 ¼ ¼ 2:33 c) EðXÞ ¼ N 15 n:M ðN MÞ:ðN nÞ 7 5 10 8 : ¼ ¼ 0:8889 ¼ 88:89% d) VarðXÞ ¼ N N:ðN 1Þ 5 15 14
6.3.7
Poisson Distribution
The Poisson distribution is used to register the occurrence of rare events, with a very low probability of success (p ! 0), in a certain time interval or space. Differently from the binomial model that provides the probability of the number of successes in a discrete interval (n repetitions of an experiment), the Poisson model provides the probability of the number of successes in a certain continuous interval (time, area, among other possibilities). As examples of variables that represent a Poisson distribution, we can mention the number of customers that arrive in a line per unit of time, the number of defects per unit of time, the number of accidents per unit of area, among others. Note that the measurement units (time and area, in these situations) are continuous, but the random variable (number of occurrences) is discrete. The Poisson distribution presents the following hypotheses: (i) (ii) (iii) (iv)
Events defined in nonoverlapping intervals are independent; In intervals with the same length, the probabilities that the same number of successes will occur are equal; In very small intervals, the probability that more than one success will occur is insignificant; In very small intervals, the probability of one success is proportional to the length of the interval.
Let us consider a discrete random variable X that represents the number of successes (k) in a certain unit of time, unit of area, among other possibilities. Random variable X, with parameter l 0, follows a Poisson distribution, denoted by X Poisson (l), if its probability function is given by: f ðk Þ ¼ P ðX ¼ k Þ ¼
el :lk , k ¼ 0,1, 2, … k!
(6.45)
150
PART
III Probabilistic Statistics
FIG. 6.10 Poisson probability function.
where: e: base of the Napierian (or natural) logarithm, and e ffi 2.718282; l: estimated average rate of occurrence of the event we are interested in for a certain exposition (time interval, area, among other examples). Fig. 6.10 shows the Poisson distribution probability function for l ¼ 1, 3, and 6. In the Poisson distribution, the mean is equal to the variance: EðXÞ ¼ VARðXÞ ¼ l
(6.46)
It is important to mention that we are going to use all knowledge on the Poisson distribution when studying the regression models for count data (Chapter 15).
6.3.7.1 Approximation of the Binomial by the Poisson Distribution Let X be a random variable that follows a binomial distribution with parameters n and p, denoted by X b(n, p). When the number of repetitions of a random experiment is very high (n ! ∞) and the probability of success is very low (p ! 0), such that n. p ¼ l ¼ constant, the binomial distribution gets closer to the Poisson distribution: X b ðn, pÞ X Poisson ðlÞ, com l ¼ n:p
Example 6.11 Assume that the number of customers that arrive at a bank follows a Poisson distribution. We verified that, on average, 12 customers arrive at the bank per minute. Calculate: (a) the probability of 10 customers arriving in the next minute; (b) the probability of 40 customers arriving in the next 5 minutes; (c) X’s mean and variance. Solution We have l ¼ 12 customers per minute. e12 1210 ¼ 0:1048 a) PðX ¼ 10Þ ¼ 10! e12 128 ¼ 0:0655 b) PðX ¼ 8Þ ¼ 8! c) E(X) ¼ VAR(X) ¼ l ¼ 12
Example 6.12 A certain part is produced in a production line. The probability of the part being defective is 0.01. If 300 parts are produced, what is the probability of none of them being defective?
Random Variables and Probability Distributions Chapter
6
151
Solution This example is characterized by a binomial distribution. Since the number of repetitions is high and the probability of success is low, the binomial distribution can be approximated by a Poisson distribution with parameter l ¼ n. p ¼ 300 0.01 ¼ 3, such that: P ðX ¼ 0Þ ¼
6.4
e 3 30 ¼ 0:05 0!
PROBABILITY DISTRIBUTIONS FOR CONTINUOUS RANDOM VARIABLES
For continuous random variables, we are going to study the uniform, normal, exponential, gamma, chi-square (w2), Student’s t, and Snedecor’s F distributions.
6.4.1
Uniform Distribution
The uniform model is the simplest model for continuous random variables. It is used to model the occurrence of events whose probability is constant in intervals with the same range. A random variable X follows a uniform distribution in the interval [a, b], denoted by X U[a, b], if its probability density function is given by: 1=ðb aÞ, if a x b f ðx Þ ¼ (6.47) 0 , otherwise which can be graphically represented as seen in Fig. 6.11. The expected value of X is calculated by the expression: Zb EðXÞ ¼
x a
1 a+b dx ¼ ba 2
(6.48)
Table 6.1 presents a summary of the discrete distributions studied in this section, including the calculation of the random variable’s probability function, the distribution parameters, besides the calculation of X’s expected value and variance. TABLE 6.1 Models for Discrete Variables Distribution Discrete uniform
Bernoulli Binomial Geometric
Negative binomial Hypergeometric
Poisson
Probability Function – P(X) 1 n
Parameters n
E(X) n X
1 xi : n i¼1
Var(X) 2 n 16 6X 2 x :6 n 4 i¼1 i
n P
i¼1
2 3 xi
n
7 7 7 5
px. (1 p)1x, x ¼ 0, 1 n :p k :ð1 p Þnk ,k ¼ 0,1,…, n k
p
p
p. (1 p)
n, p
n.p
n. p. (1 p)
P(X) ¼ p. (1 p)x1, x ¼ 1, 2, 3, … P(Y) ¼ p. (1 p)y, y ¼ 0, 1, 2, …
p
E ðX Þ ¼
1 p 1p E ðY Þ ¼ p
1p p2 1p Var ðY Þ ¼ 2 p
k, p
k p
k:ð1 p Þ p2
N, M, n
n:M N
n:M ðN MÞ:ðN nÞ : N N:ðN 1Þ
l
l
l
x 1 :p k :ð1 p Þxk , x ¼ k, k + 1,… k 1 M N M : k nk ,0 k min ðM, nÞ N n
e l :lk ,k ¼ 0,1,2,… k!
Var ðX Þ ¼
152
PART
III Probabilistic Statistics
FIG. 6.11 Uniform distribution in the interval [a, b].
And the variance of X is: ðb aÞ2 Var ðXÞ ¼ E X2 ½EðXÞ2 ¼ 12 On the other hand, the cumulative distribution function of the uniform distribution is given by: 8 0 , if x < a > > Zx Zx
ba > : a a 1 , if x b
(6.49)
(6.50)
Example 6.13 Random variable X represents the time a bank’s ATM machines are used (in minutes), and it follows a uniform distribution in the interval [1, 5]. Determine: a) P(X < 2) b) P(X > 3) c) P(3 < X < 5) d) E(X) e) VAR(X) Solution a) P(X < 2) ¼ F(2) ¼ (2 1)/(5 1) ¼ 1/4 b) P(X > 3) ¼ 1 P(X < 3) ¼ 1 F(3) ¼ 1 (3 1)/(5 1) ¼ 1/2 c) P(3 < X < 4) ¼ F(4) F(3) ¼ (4 1)/(5 1) (3 1)/(5 1) ¼ 1/4 ð1 + 5Þ ¼3 d) EðXÞ ¼ 2 ð5 1Þ2 4 ¼ e) VARðXÞ ¼ 12 3
6.4.2
Normal Distribution
The normal distribution, also known as Gaussian, is the most widely used and important probability distribution, because it allows us to model a myriad of natural phenomena, studies of human behavior, industrial processes, among others. In addition to allowing us to use approximations to calculate the probabilities of many random variables. A random variable X, with mean m 2 ℜ and standard deviation s > 0, follows a normal or Gaussian distribution, denoted by X N (m, s2), if its probability density function is given by: 2
ðxmÞ 1 f ðxÞ ¼ pffiffiffiffiffiffi :e 2:s2 , ∞ x + ∞, s: 2p
whose graphical representation is shown in Fig. 6.12.
(6.51)
Random Variables and Probability Distributions Chapter
6
153
FIG. 6.12 Normal distribution.
FIG. 6.13 Area under the normal curve.
Fig. 6.13 shows the area under the normal curve based on the number of standard deviations. From Fig. 6.13, we can see that the curve has the shape of a bell and is symmetrical around parameter m, and the smaller parameter s is, the more concentrated the curve is around m. Therefore, in a normal distribution, the mean of X is: Eð X Þ ¼ m
(6.52)
Var ðXÞ ¼ s2
(6.53)
And the variance of X is:
In order to obtain the standard normal distribution or the reduced normal distribution, the original variable X is transformed into a new random variable Z, with mean 0 (m ¼ 0) and variance 1 (s2 ¼ 1): Z¼
Xm N ð0, 1Þ s
(6.54)
Score Z represents the number of standard deviations that separates a random variable X from the mean. This kind of transformation, known as Zscores, is broadly used to standardize variables, because it does not change the shape of the original variable’s normal distribution, and it generates a new variable with mean 0 and variance 1. Therefore,
154
PART
III Probabilistic Statistics
FIG. 6.14 Standard normal distribution.
when many variables with different orders of magnitude are being used in a certain type of modeling, the Zscores standardization process will make all the new standardized variables have the same distribution, with equal orders of magnitude (Fa´vero et al., 2009). The probability density function of random variable Z is reduced to: z2 1 f ðzÞ ¼ pffiffiffiffiffiffi :e 2 , ∞ z + ∞ 2p
(6.55)
whose graphical representation is shown in Fig. 6.14. The cumulative distribution function F(xc) of a normal random variable X is obtained by integrating Expression (6.51) from ∞ to xc, that is: Zxc F ð x c Þ ¼ Pð X x c Þ ¼
f ðxÞdx
(6.56)
∞
Integral (6.56) corresponds to the area under f(x) from ∞ to xc, as shown in Fig. 6.15. In the specific case of the standard normal distribution, the cumulative distribution function is: Zzc F ð z c Þ ¼ Pð Z z c Þ ¼ ∞
Zzc
1 f ðzÞdz ¼ pffiffiffiffiffiffi 2p
z2
e 2 dz
(6.57)
∞
For a random variable Z with a standard normal distribution, let us suppose that the main goal now is to calculate P(Z > zc). So, we have: Z∞ Pð Z > z c Þ ¼ zc
1 f ðzÞdz ¼ pffiffiffiffiffiffi 2p
Z∞
z2 2 dz
e
(6.58)
zc
Fig. 6.16 represents this probability. Table E in the Appendix shows the value of P(Z > zc), that is, the cumulative probability from zc to +∞ (the gray area under the normal curve). FIG. 6.15 Cumulative normal distribution.
f(x)
F(Xc)
–¥
Xc
+¥
X
Random Variables and Probability Distributions Chapter
6
155
f(z)
–¥
zc
+¥
z
FIG. 6.16 Graphical representation of P(Z > zc) for a standardized normal random variable.
6.4.2.1 Approximation of the Binomial by the Normal Distribution Let X be a random variable that has a binomial distribution with parameters n and p, denoted by X b(n, p). As the average number of successes and the average number of failures tend to infinity (n. p ! ∞ and n. (1 p) ! ∞), the binomial distribution gets closer to a normal one with mean m ¼ n. p and variance s2 ¼ n. p. (1 p): X bðn, pÞ X N m, s2 , com m ¼ n:p e s2 ¼ n:p:ð1 pÞ Some authors admit that the approximation of the binomial by the normal distribution is good when n. p > 5 and n. (1 p) > 5, or when n. p. (1 p) 3. A better and more conservative rule requires n. p > 10 and n. (1 p) > 10. However, since it is a discrete approximation through a continuous one, we recommend greater accuracy, carrying out a continuity correction that consists in, for example, transforming P(X ¼ x) into the interval P(x 0.5 < X < x + 0.5).
6.4.2.2 Approximation of the Poisson by the Normal Distribution Analogous to the binomial distribution, the Poisson distribution can also be approximated by a normal one. Let X be a random variable that follows a Poisson distribution with parameter l, denoted by X Poisson(l). Since l ! ∞, the Poisson distribution gets closer to a normal one with mean m ¼ l and variance s2 ¼ l: X PoissonðlÞ X N m, s2 , with m ¼ l and s2 ¼ l In general, we admit that the approximation of the Poisson distribution by the normal distribution is good when l > 10. Once again, we recommend using the continuity correction x 0.5 and x + 0.5. Example 6.14 We know that the average thickness of the hose storage units produced in a factory (X) follows a normal distribution with a mean of 3 mm and a standard deviation of 0.4 mm. Determine: a) P(X > 4.1) b) P(X > 3) c) P(X 3) d) P(X 3.5) e) P(X < 2.3) f) P(2 X 3.8) Solution The probabilities will be calculated based on Table E in the Appendix, which provides the value of P(Z > zc): 4:1 3 ¼ PðZ > 2:75Þ ¼ 0:0030 a) PðX > 4:1Þ ¼ P Z > 0:4 33 ¼ PðZ > 0Þ ¼ 0:5 b) PðX > 3Þ ¼ P Z > 0:4 c) P(X 3) ¼ P(Z 0) ¼ 0.5 3:5 3 PðX 3:5Þ ¼ P Z ¼ PðZ 1:25Þ ¼ 1 PðZ > 1:25Þ d) 0:4 ¼ 1 0:1056 ¼ 0:8944 2:3 3 e) PðX < 2:3Þ ¼ P Z < ¼ PðZ < 1:75Þ ¼ PðZ > 1:75Þ ¼ 0:04 0:4
156
PART
III Probabilistic Statistics
23 3:8 3 Z ¼ Pð2:5 Z 2Þ 0:4 0:4 ¼ PðZ 2Þ PðZ < 2:5Þ ¼ ½1 PðZ > 2Þ PðZ > 2:5Þ ¼
f) Pð2 X 3:8Þ ¼ P
¼ ½1 0:0228 0:0062 ¼ 0:971
6.4.3
Exponential Distribution
Another important distribution, which has applications in system reliability and in the queueing theory, is the exponential distribution. It has as its main characteristic the property of being memoryless, that is, the future lifetime (t) of a certain object has the same distribution, regardless of its past lifetime (s), for any s, t > 0, as shown in Expression (6.38), once again shown below: PðX > s + t j X > sÞ ¼ PðX > tÞ A continuous random variable X has an exponential distribution with parameter l > 0, denoted by X exp(l), if its probability density function is given by: l:el:x , if x 0 (6.59) f ðx Þ ¼ 0 , if x < 0 Fig. 6.17 represents the probability density function of the exponential distribution for parameters l ¼ 0.5, l ¼ 1, and l ¼ 2. We can see that the exponential distribution is positive asymmetrical (to the right), observing a higher frequency for smaller values of x and a longer tail to the right. The density function assumes value l when x ¼ 0, and tends to zero as x ! ∞. The higher the value of l, the more quickly the function tends to zero. In the exponential distribution, the mean of X is: Eð X Þ ¼
1 l
(6.60)
1 l2
(6.61)
and the variance of X is: Var ðXÞ ¼ And the cumulative distribution function F(x) is given by:
Zx f ðxÞdx ¼
FðxÞ ¼ PðX xÞ ¼ 0
FIG. 6.17 Exponential distribution for l ¼ 0.5,l ¼ 1, and l ¼ 2.
1 el:x , if x 0 0 , if x < 0
(6.62)
Random Variables and Probability Distributions Chapter
6
157
From (6.62) we can conclude that: PðX > xÞ ¼ el:x
(6.63)
In system reliability, random variable X represents the lifetime, that is, the time during which a component or system remains operational, outside the interval for repairs and above a specified limit (yield, pressure, among other examples). On the other hand, parameter l represents the failure rate, that is, the number of components or systems that failed in a preestablished time interval: l¼
number of failures operation time
(6.64)
The main measures of reliability are: (a) Mean time to failure (MTTF) and (b) Mean time between failures (MTBF). Mathematically, MTTF and MTBF are equal to the mean of the exponential distribution and represent the mean lifetime. Thus, the failure rate can also be calculated as: l¼
1 MTTF:ðMTBFÞ
(6.65)
In the queueing theory, random variable X represents the mean waiting time until the next arrival (mean time between two customers’ arrivals). On the other hand, parameter l represents the mean arrivals rate, that is, the expected number of arrivals per unit of time.
6.4.3.1 Relationship Between the Poisson and the Exponential Distribution If the number of occurrences in a counting process follows a Poisson distribution (l), then, the random variables “time until the first occurrence” and “time between any successive occurrences” of the aforementioned process have an exp(l) distribution. Example 6.15 The life span of an electronic component follows an exponential distribution with a mean lifetime of 120 hours. Determine: a) The probability of a component failing in the first 100 hours of use; b) The probability of a component lasting more than 150 hours. Solution Assume that l ¼ 1/120 and X exp(1/120). Therefore: x 100 100 x x 100 100 R 120:e 120 120 dx ¼ a) PðX 100Þ ¼ 120:e ¼ e 120 ¼ e 120 + 1 ¼ 0:5654 0 120 0 0
b) PðX > 150Þ ¼
R∞
x
120:e 120 dx ¼
x ∞ 120:e 120
150
6.4.4
120
150
x ∞ 150 ¼ e 120 ¼ e 120 ¼ 0:2865 150
Gamma Distribution
The gamma distribution is one of the most general, such that, other distributions, as the Erlang, exponential, and chi-square (w2) are particular cases of it. As the exponential distribution, it is also widely used in system reliability. The gamma distribution also has applications in physical phenomena, in meteorological processes, insurance risk theory, and economic theory. A continuous random variable X has a gamma distribution with parameters a > 0 and l > 0, denoted by X Gamma (a, l), if its probability density function is given by: 8 a < l :xa1 :el:x , if x 0 (6.66) f ðxÞ ¼ GðaÞ : 0 , if x < 0
158
PART
III Probabilistic Statistics
FIG. 6.18 Density function of x for some values of a and l. (Source: Navidi, W., 2012. Probabilidade e estatı´stica para ci^ encias exatas. Bookman, Porto Alegre.)
where G(a) is the Gamma function, given by: Z∞ GðaÞ ¼
ex :xa1 dx, a > 0
(6.67)
0
The gamma probability density function for some values of a and l is represented in Fig. 6.18. We can see that the gamma distribution is positive asymmetrical (to the right), observing a higher frequency for smaller values of x and a longer tail to the right. However, as a tends to infinity, the distribution becomes symmetrical. We can also observe that when a ¼ 1, the gamma distribution is equal to the exponential. Moreover, the greater the value of l, the more quickly the density function tends to zero. The expected value of X can be calculated as: EðXÞ ¼ a:l
(6.68)
Var ðXÞ ¼ a:l2
(6.69)
On the other hand, the variance of X is given by:
The cumulative distribution function is: Zx Fð x Þ ¼ Pð X x Þ ¼ 0
la f ðxÞdx ¼ GðaÞ
Zx
xa1 :elx dx
(6.70)
0
6.4.4.1 Special Cases of the Gamma Distribution A gamma distribution with parameter a, a positive integer, is called an Erlang distribution, such that: If a is a positive integer ) X Gamma(a, l) X Erlang(a, l) As mentioned before, a gamma distribution with parameter a ¼ 1 is called an exponential distribution: If a ¼ 1 ) X Gamma(a, l) X exp(l) Or, a gamma distribution with parameter a ¼ n/2 and l ¼ 1/2 is called a chi-square distribution with n degrees of freedom: If a ¼ n/2, l ¼ 1/2 ) X Gamma(n/2, 1/2) w w2v¼n
6.4.4.2 Relationship Between the Poisson and the Gamma Distribution In the Poisson distribution, we try to determine the number of occurrences of a certain event within a fixed period. On the other hand, the gamma distribution determines the time necessary to obtain a specified number of occurrences of the event.
Random Variables and Probability Distributions Chapter
6.4.5
6
159
Chi-Square Distribution
A continuous random variable X has a chi-square distribution with n degrees of freedom, denoted by X w2n , if its probability density function is given by: 8 1 < :xn=21 :ex=2 , x > 0 n=2 (6.71) f ðxÞ ¼ 2 :Gðn=2Þ : 0 , x xc), we have: Z∞ Pð X > x c Þ ¼ which can be represented by Fig. 6.20.
f ðxÞdx xc
FIG. 6.19 w2 distribution for different values of n.
(6.76)
160
PART
III Probabilistic Statistics
FIG. 6.20 Graphical representation of P(X > xc) for a random variable with a w2 distribution.
The w2 distribution has several applications in statistical inference. Due to its importance, the w2 distribution can be found in Table D in the Appendix, for different values of parameter n. This table provides the critical values of xc such that P(X > xc) ¼ a. In other words, we can obtain the calculation of the probabilities and of the cumulative probability density function for different values of x from random variable X. Example 6.16 Assume that random variable X follows a chi-square distribution (w2) with 13 degrees of freedom. Determine: a) P(X > 5) b) The x value such that P(X x) ¼ 0.95 c) The x value such that P(X > x) ¼ 0.95 Solution Through the w2 distribution table (Table D in the Appendix), for n ¼ 13, we have: a) P(X > 5) ¼ 97.5% b) 22.362 c) 5.892
6.4.6
Student’s t Distribution
Student’s t distribution was developed by William Sealy Gosset, and it is one of the main probability distributions, with several applications in statistical inference. We are going to assume a random variable Z that has a normal distribution with mean 0 and standard deviation 1, and a random variable X with a chi-square distribution with n degrees of freedom, such that, Z and X are independent. Continuous random variable T is then defined as: Z T ¼ pffiffiffiffiffiffiffiffi X=n
(6.77)
We can say that variable T follows Student’s t distribution with n degrees of freedom, denoted by T tn, if its probability density function is given by: n+1 n + 1 G 2 t2 2 , ∞ tc), we have: Z∞ PðT > tc Þ ¼
f ðtÞdt
(6.82)
tc
as shown in Fig. 6.22. Just as the normal and chi-square (w2) distributions, Student’s t distribution has several applications in statistical inference, such that, there is a table to obtain the probabilities, based on different values of parameter n (Table B in the Appendix). This table provides the critical values of tc such that P(T > tc) ¼ a. In other words, we can obtain the calculation of the probabilities and of the cumulative probability density function for different values of t from random variable T. We are going to use Student’s t distribution when studying simple and multiple regression models (Chapter 13).
a/2
a/2
–tc FIG. 6.22 Graphical representation of Student’s t distribution.
tc
t
162
PART
III Probabilistic Statistics
Example 6.17 Assume that random variable T follows Student’s t distribution with 7 degrees of freedom. Determine: a) P(T > 3.5) b) P(T < 3) c) P(T < 0.711) d) The t value such that P(T t) ¼ 0.95 e) The t value such that P(T > t) ¼ 0.10 Solution a) 0.5% b) 99% c) 25% d) 1.895 e) 1.415
6.4.7
Snedecor’s F Distribution
Snedecor’s F distribution, also known as Fisher’s distribution, is frequently used in tests associated to the analysis of variance (ANOVA), to compare the means of more than two populations. Let us consider continuous random variables Y1 and Y2, such that: l
Y1 and Y2 are independent; Y1 follows a chi-square distribution with n1 degrees of freedom, denoted by Y1 wn21 ;
l
Y2 follows a chi-square distribution with n2 degrees of freedom, denoted by Y2 wn22.
l
We are going to define a new continuous random variable X such that: X¼
Y1 =n1 Y2 =n2
(6.83)
So, we say that X has a Snedecor’s F distribution with n1 and n2 degrees of freedom, denoted by X Fn1, n2, if its probability density function is given by: n + n n n1 =2 1 2 1 G xðn1 =2Þ1 2 n2 (6.84) f ðxÞ ¼ ðn1 + n2 Þ=2 , x > 0 n1 n2 n1 :x + 1 G G 2 2 n2 where Z∞ GðaÞ ¼
ex :xa1 dx
0
Fig. 6.23 shows the behavior of Snedecor’s F distribution probability density function, for different values of n1 and n2. We can see that Snedecor’s F distribution is positive asymmetrical (to the right), observing a higher frequency for smaller values of x and a longer tail to the right. However, as n1 and n2 tend to infinity, the distribution becomes symmetrical. The expected value of X is calculated as: n2 , for n2 > 2 (6.85) Eð X Þ ¼ n2 2 On the other hand, the variance of X is given by: Var ðXÞ ¼
2:n22 :ðn1 + n2 2Þ n1 :ðn2 4Þ:ðn2 2Þ2
, for n2 > 4
(6.86)
Random Variables and Probability Distributions Chapter
6
163
f (x)
F30,30
F4,12 0 FIG. 6.23 Probability density function for F4, 12 and F30,
x 30.
FIG. 6.24 Critical values of Snedecor’s F distribution.
Just as the normal, w2, and Student’s t distributions, Snedecor’s F distribution has several applications in statistical inference. And there is a table from which we can obtain the probabilities and the cumulative distribution function, based on different values of parameters n1 and n2 (Table A in the Appendix). This table provides the critical values of Fc such that P(X > Fc) ¼ a (Fig. 6.24). We are going to use Snedecor’s F distribution when studying simple and multiple regression models (Chapter 13).
6.4.7.1 Relationship Between Student’s t and Snedecor’s F Distribution Let us consider a random variable T with Student’s t distribution with n degrees of freedom. So, the square of variable T follows Snedecor’s F distribution with n1 ¼ 1 and n2 degrees of freedom, as shown by Fa´vero et al. (2009). Thus: If T tn, then T2 F1, n2 Example 6.18 Assume that random variable X follows Snedecor’s F distribution with n1 ¼ 6 degrees of freedom in the numerator, and n2 ¼ 12 degrees of freedom in the denominator, that is, X F6, 12. Determine: a) P(X > 3) b) F6, 12 with a ¼ 10% c) The x value such that P(X x) ¼ 0.975 Solution Through Snedecor’s F distribution table (Table A in the Appendix), for n1 ¼ 6 and n2 ¼ 12, we have: a) P(X > 3) ¼ 5% b) 2.33 c) 3.73
164
PART
III Probabilistic Statistics
Table 6.2 shows a summary of the continuous distributions studied in this section, including the calculation of the random variable’s probability function, the distribution parameters, besides the calculation of X’s expected value and variance. TABLE 6.2 Models for Continuous Variables Distribution
Probability Function – P(X)
Parameters
E(X)
Var(X)
Uniform
1 ,a x b b a
a, b
a+b 2
ðb aÞ2 12
Normal
ðxmÞ 1 pffiffiffiffiffiffi :e 2s2 , ∞ x + ∞ s: 2p
2
m, s
m
s2
Exponential
l. e l. x, x 0
l
1 l
1 l2
Gamma
la a1 lx :x :e , x 0 GðaÞ
a, l
a. l
a. l2
Chi-square (w2)
1 :x n=21 :e x=2 , x > 0 2n=2 :Gðn=2Þ n+1 n + 1 G 2 t2 2 n pffiffiffiffiffi 1 + ,∞ < t < ∞ n G : pn 2 n + n n n1 =2 1 2 1 G x ðn1 =2Þ1 2 n2 ðn1 + n2 Þ=2 n n n 1 2 1 :x + 1 G G 2 2 n2 x>0
n
n
2. n
n
E(T) ¼ 0
Var ðT Þ ¼
n1, n2
n2 n2 2
Student’s t
Snedecor’s F
6.5
n n2
2:n22 :ðn1 + n2 2Þ n1 :ðn2 4Þ:ðn2 2Þ2
FINAL REMARKS
This chapter discussed the main probability distributions used in statistical inference, including the distributions for discrete random variables (discrete uniform, Bernoulli, binomial, geometric, negative binomial, hypergeometric, and Poisson) and for continuous random variables (uniform, normal, exponential, gamma, chi-square (w2), Student’s t, and Snedecor’s F). When characterizing probability distributions, it is extremely important to use measures that indicate the most relevant aspects of the distribution, such as, measures of position (mean, median, and mode), measures of dispersion (variance and standard deviation), and measures of skewness and kurtosis. Understanding the concepts related to probability and to probability distributions helps the researcher in the study of topics related to statistical inference, including parametric and nonparametric hypotheses tests, multivariate analysis through exploratory techniques, and estimation of regression models.
6.6
EXERCISES
1) In a shoe production line, the probability of a defective item being produced is 2%. For a batch with 150 items, determine the probability of a maximum of two items being defective. Also determine the mean and the variance. 2) The probability of a student solving a certain problem is 12%. If 10 students are selected randomly, what is the probability of exactly one of them being successful? 3) A telemarketing salesman sells one product every 8 customers he contacts. The salesman prepares a list of customers. Determine the probability of the first product being sold in the fifth call, in addition to the expected sales value and the respective variance. 4) The probability of a player scoring a penalty is 95%. Determine the probability of the player having to take a penalty kick 33 times to score 30 goals, besides the mean of penalty kicks. 5) Assume that, in a certain hospital, 3 patients undergo stomach surgery daily, following a Poisson distribution. Calculate the probability of 28 patients undergoing surgery next week (7 business days).
Random Variables and Probability Distributions Chapter
6
165
6) Assume that a certain random variable X follows a normal distribution with m ¼ 8 and s2 ¼ 36. Determine the following probabilities: a) P(X 12) b) P(X < 5) c) P(X > 2) d) P(6 < X 11) 7) Consider random variable Z with a standardized normal distribution. Determine critical value zc such that P(Z > zc) ¼ 80%. 8) When tossing 40 balanced coins, determine the following probabilities: a) Of getting exactly 22 heads. b) Of getting more than 25 heads. Solve this exercise by approximating the distribution through a normal distribution. 9) The time until a certain electronic device fails follows an exponential distribution with a failure rate per hour of 0.028. Determine the probability of a device chosen randomly remaining operational for: a) 120 hours; b) 60 hours. 10) A certain type of device follows an exponential distribution with a mean lifetime of 180 hours. Determine: a) The probability of the device lasting more than 220 hours; b) The probability of the device lasting a maximum of 150 hours. 11) The arrival of patients in a lab follows an exponential distribution with an average rate of 1.8 clients per minute. Determine: a) The probability of the next client’s arrival taking more than 30 seconds; b) The probability of the next client’s arrival taking a maximum of 1.5 minutes. 12) The time between clients’ arrivals in a restaurant follows an exponential distribution with a mean of 3 minutes. Determine: a) The probability of more than 3 clients arriving in 6 minutes; b) The probability of the time until the fourth client arrives being less than 10 minutes. 13) A random variable X has a chi-square distribution with n ¼ 12 degrees of freedom. What is critical value xc such that P(X > xc) ¼ 90%? 14) Now, assume that X follows a chi-square distribution with n ¼ 16 degrees of freedom. Determine: a) P(X > 25) b) P(X 32) c) P(25 < X 32) d) The x value such that P(X x) ¼ 0.975 e) The x value such that P(X > x) ¼ 0.975 15) A random variable T follows Student’s t distribution with n ¼ 20 degrees of freedom. Determine: a) Critical value tc such that P( tc < t < tc) ¼ 95% b) E(T) c) Var(T) 16) Now, assume that T follows Student’s t distribution with n ¼ 14 degrees of freedom. Determine: a) P(T > 3) b) P(T 2) c) P(1.5 < T 2) d) The t value such that P(T t) ¼ 0.90 e) The t value such that P(T > t) ¼ 0.025 17) Consider a random variable X that follows Snedecor’s F distribution with n1 ¼ 4 and n2 ¼ 16 degrees of freedom, that is, X F4, 16. Determine: a) P(X > 3) b) F4, 16 with a ¼ 2.5% c) The x value such that P(X x) ¼ 0.99 d) E(X) e) Var(X)
Chapter 7
Sampling Our reason becomes obscure when we consider that the countless fixed stars that shine in the sky do not have any other purpose besides illuminating worlds in which weeping and pain rule, and, in the best case scenario, only unpleasantness exists; at least, judging by the sample we know. Arthur Schopenhauer
7.1
INTRODUCTION
As discussed in the Introduction of this book, population is the set that has all the individuals, objects, or elements to be studied, which have one or more characteristics in common. A census is the study of data related to all the elements of the population. According to Bruni (2011), populations can be finite or infinite. Finite populations have a limited size, allowing their elements to be counted; infinite populations, on the other hand, have an unlimited size, not allowing us to count their elements. As examples of finite populations, we can mention the number of employees in a certain company, the number of members in a club, the number of products manufactured during a certain period, etc. When the number of elements in a population, even though they can be counted, is too high, we assume that the population is infinite. Examples of populations considered infinite are the number of inhabitants in the world, the number of residences in Rio de Janeiro, the number of points on a straight line, etc. Therefore, there are situations in which a study with all the elements in a population is impossible or unwanted. Hence, the alternative is to extract a subset from the population under analysis, which is called a sample. The sample must be representative of the population being studied, therein is the importance of this chapter. From the information gathered in the sample and using suitable statistical procedures, the results obtained can be used to generalize, infer, or draw conclusions regarding the population (statistical inference). For Fa´vero et al. (2009) and Bussab and Morettin (2011), it is rarely possible to obtain the exact distribution of a variable, due to the high costs, the time needed and the difficulties in collecting the data. Hence, the alternative is to select part of the elements in the population (sample) and, after that, infer the properties for the whole (population). Essentially, there are two types of sampling: (1) probability or random sampling, and (2) nonprobability or nonrandom sampling. In random sampling, samples are obtained randomly, that is, the probability of each element of the population being a part of the sample is the same. In nonrandom sampling, on the other hand, the probability of some or all the elements of the population being in the sample is unknown. Fig. 7.1 shows the main random and nonrandom sampling techniques. Fa´vero et al. (2009) show the advantages and disadvantages of random and nonrandom techniques. Regarding random sampling techniques, the main advantages are: a) the selection criteria of the elements are rigorously defined, not allowing the researchers’ or the interviewer’s subjectivity to interfere in the selection of the elements; b) the possibility to mathematically determine the sample size based on accuracy and on the confidence level desired for the results. On the other hand, the main disadvantages are: a) difficulty in obtaining current and complete listings or regions of the population; b) geographically speaking, a random selection can generate a highly disperse sample, increasing the costs, the time needed for the study, and the difficulty in collecting the data. As regards nonrandom sampling techniques, the advantages are lower costs, less time to carry out the study, and less need of human resources. As disadvantages, we can mention: a) there are units in the population that cannot be chosen; b) a personal bias may happen; c) we do not know with what level of confidence the conclusions arrived at can be inferred Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00007-0 © 2019 Elsevier Inc. All rights reserved.
169
170
PART
IV Statistical Inference
FIG. 7.1 Main sampling techniques.
Sampling Random sampling
Simple
Systematic
Stratified
Cluster
Nonrandom sampling
Convenience
Judgmental
Quota
Snowball
for the population. These techniques do not use a random method to select the elements of the sample, so, there is no guarantee that the sample selected is a good representative of the population (Fa´vero et al., 2009). Choosing the sampling technique must consider the goals of the survey, the acceptable error in the results, accessibility to the elements of the population, the desired representativeness, the time needed, and the availability of financial and human resources.
7.2
PROBABILITY OR RANDOM SAMPLING
In this type of sampling, samples are obtained randomly, that is, the probability of each element of the population being a part of the sample is the same, and all of the samples selected are equally probable. In this section, we will study the main probability or random sampling techniques: (a) simple random sampling, (b) systematic sampling, (c) stratified sampling, and (d) cluster sampling.
7.2.1
Simple Random Sampling
According to Bolfarine and Bussab (2005), simple random sampling (SRS) is the simplest and most important method for selecting a sample. Consider a population or universe (U) with N elements: U ¼ f1, 2, …, N g According to Bolfarine and Bussab (2005), planning and selecting the sample include the following steps: (a) Using a random procedure (as, for example, through a table with random numbers or a gravity-pick machine), we must draw an element from population U with the same probability; (b) We repeat the previous process until a sample with n observations is generated (the calculation of the size of a simple random sample will be studied in Section 7.4); (c) When the value drawn is removed from U before of the next draw, we have the SRS without replacement process. In case drawing a unit more than once is allowed, we have the SRS with replacement process. According to Bolfarine and Bussab (2005), from a practical point of view, an SRS without replacement is much more interesting, because it satisfies the intuitive principle that we do not gain more information in case the same unit appears more than once in the sample. On the other hand, an SRS with replacement has mathematical and statistical advantages, such as, the independence between the units drawn. Let’s now study each of them.
7.2.1.1 Simple Random Sampling Without Replacement According to Bolfarine and Bussab (2005), an SRS without replacement works as follows: (a) All of the elements in the population are numbered from 1 to N: U ¼ f1, 2, …, N g
Sampling Chapter
7
171
(b) Using a procedure to generate random numbers, we must draw, with the same probability, one of the N observations of the population; (c) We draw the following element, with the previous value being removed from the population; (d) We repeat the procedure until n observations have been drawn (how to calculate n is explained in Section 7.4.1). N! N possible samples of n elements that can be obtained from the ¼ In this type of sampling, there are CN, n ¼ n n!ðN nÞ! N population, and each sample has the same probability of being selected, 1= . n Example 7.1: Simple Random Sampling without Replacement Table 7.E.1 shows the weight (kg) of 30 parts. Draw, without any replacements, a random sample of size n ¼ 5. How many different samples of size n can be obtained from the population? What is the probability of a sample being selected?
TABLE 7.E.1 Weight (kg) of 30 parts 6.4
6.2
7.0
6.8
7.2
6.4
6.5
7.1
6.8
6.9
7.0
7.1
6.6
6.8
6.7
6.3
6.6
7.2
7.0
6.9
6.8
6.7
6.5
7.2
6.8
6.9
7.0
6.7
6.9
6.8
Solution All 30 parts were numbered from 1 to 30, as shown in Table 7.E.2.
TABLE 7.E.2 Numbers given to the parts 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
6.4
6.2
7.0
6.8
7.2
6.4
6.5
7.1
6.8
6.9
7.0
7.1
6.6
6.8
6.7
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
6.3
6.6
7.2
7.0
6.9
6.8
6.7
6.5
7.2
6.8
6.9
7.0
6.7
6.9
6.8
Through a random procedure (as, for example, the RANDBETWEEN function in Excel), the following numbers were selected: 02 03 14
24 28
The parts associated to these numbers form the random sample selected. 30 29 28 27 26 30 ¼ 142, 506 different samples. There are ¼ 5 5! The probability of a certain sample being selected is 1/142, 506.
7.2.1.2 Simple Random Sampling With Replacement According to Bolfarine and Bussab (2005), an SRS with replacement works as follows: (a) All of the elements in the population are numbered from 1 to N: U ¼ f1, 2, …, N g (b) Using a procedure to generate random numbers, we must draw, with the same probability, one of the N observations of the population; (c) We put this unit back into the population and draw the following value; (d) We repeat the procedure until n observations have been drawn (how to calculate n is explained in Section 7.4.1). In this type of sampling, there are Nn possible samples of n elements that can be obtained from the population, and each sample has the same probability 1/Nn of being selected.
172
PART
IV Statistical Inference
Example 7.2: Simple Random Sampling with Replacement Redo Example 7.1 considering a simple random sampling with replacement. Solution The 30 parts were numbered from 1 to 30. Through a random procedure (for example, we can use the RANDBETWEEN function in Excel), we drew the first part from the sample (12). This part is put back and the second element is drawn (33). The procedure is repeated until five parts have been drawn: 12 33 02 25 33 The parts associated to these numbers form the random sample selected. There are 305 ¼ 24,300,000 different samples. The probability of a certain sample being selected is 1/24,300,000.
7.2.2
Systematic Sampling
According to Costa Neto (2002), when the elements of the population are sorted and periodically removed, we have a systematic sampling. Hence, for example, in a production line, we can remove an element at every 50 items produced. As advantages of systematic sampling, in comparison to simple random sampling, we can mention that it is carried out in a much faster and cheaper way, besides being less susceptible to errors made by the interviewer during the survey. The main disadvantage is the possibility of having variation cycles, especially if these cycles coincide with the period when the elements are removed from the sample. For example, let’s suppose that at every 60 parts produced in a certain machine, one part is inspected; however, in this machine a certain flaw usually happens, so, at every 20 parts produced, one is defective. Assuming that the elements of the population are sorted from 1 to N and that we already know the sample size (n), systematic sampling works as follows: (a) We must determine the sampling interval (k), obtained by the quotient of the population size and the sample size: k¼
N n
This value must be rounded to the closest integer. (b) In this phase, we introduce an element of randomness, choosing the starting unit. The first element chosen {X1} can be an any element between 1 and k; (c) After choosing the first element, after each k element, a new element is removed from the population. The process is repeated until it reaches the sample size (n): X1 ,X1 + k, X1 + 2k, …,X1 + ðn 1Þk
Example 7.3: Systematic Sampling Imagine a population with N ¼ 500 sorted elements. We wish to remove a sample with n ¼ 20 elements from this population. Use the systematic sampling procedure. Solution (a) The sampling interval (k) is: k¼
N 500 ¼ ¼ 25 n 20
(b) The first element chosen {X} can be any element between 1 and 25; suppose that X ¼ 5; (c) Since the first element of the sample is X ¼ 5, the second element will be X ¼ 5 + 25 ¼ 30, the third element X ¼ 5 + 50 ¼ 55, and so on, and so forth, so, the last element of the sample will be X ¼ 5 + 19 25 ¼ 480: A ¼ f 5,30, 55, 80, 105,130, 155,180, 205,230,255, 280,305,330, 355,380, 405,430,455, 480 g
Sampling Chapter
7.2.3
7
173
Stratified Sampling
In this type of sampling, a heterogeneous population is stratified or divided into subpopulations or homogeneous strata, and, in each stratum, a sample is drawn. Hence, initially, we define the number of strata and, by doing that, we obtain the size of each stratum. For each stratum, we specify how many elements will be drawn from the subpopulation, and this can be a uniform or proportional allocation. According to Costa Neto (2002), uniform stratified sampling, from which we draw an equal number of elements in each stratum, is recommended when the strata are approximately the same size. In proportional stratified sampling, on the other hand, the number of elements in each stratum is proportional to the number of elements in the stratum. According to Freund (2006), if the elements selected in each stratum are simple random samples, the global process (stratification followed by random sampling) is called (simple) stratified random sampling. According to Freund (2006), stratified sampling works as follows: (a) A population of size N is divided into k strata of sizes N1, N2, …, Nk; (b) For each stratum, a random sample of size ni (i ¼ 1, 2, …, k) is selected, resulting in k subsamples of sizes n1, n2, …, nk. In uniform stratified sampling, we have: n1 ¼ n 2 ¼ … ¼ nk where the sample size obtained from each stratum is: n ni ¼ , para i ¼ 1, 2, …,k k where n ¼ n1 + n2 + … + nk In proportional stratified sampling, on the other hand, we have: n1 n2 nk ¼ ¼…¼ N1 N2 Nk
(7.1)
(7.2)
(7.3)
In proportional sampling, the sample size obtained from each stratum can be obtained according to the following expression: ni ¼
Ni n, for i ¼ 1, 2, …,k N
(7.4)
As examples of stratified sampling, we can mention the stratification of a city into neighborhoods, of a population by gender or age group, of customers by social class or of students by school. The calculation of the size of a stratified sample will be studied in Section 7.4.3. Example 7.4: Stratified Sampling Consider a club that has N ¼ 5000 members. The population can be divided by age group, aiming at identifying the main activities practiced by each group: from 0 to 4 years of age; from 5 to 11; from 12 to 17; from 18 to 25; from 26 to 36; from 37 to 50; from 51 to 65; and over 65 years of age. We have N1 ¼ 330, N2 ¼ 350, N3 ¼ 400, N4 ¼ 520, N5 ¼ 650, N6 ¼ 1030, N7 ¼ 980, N8 ¼ 740. We would like to draw a stratified sample from the population of size n ¼ 80. What should be the size of the sample drawn from each stratum in case of uniform sampling and proportional sampling? Solution For uniform sampling, ni ¼ n/k ¼ 80/8 ¼ 10. Therefore, n1 ¼ … ¼ n8 ¼ 10. For proportional sampling, we calculate ni ¼ NNi n, for i ¼ 1, 2,…, 8: 330 350 80 ¼ 5:3 ffi 6, n2 ¼ NN2 n ¼ 5;000 80 ¼ 5:6 ffi 6 n1 ¼ NN1 n ¼ 5;000 400 520 n3 ¼ NN3 n ¼ 5;000 80 ¼ 6:4 ffi 7, n4 ¼ NN4 n ¼ 5;000 80 ¼ 8:3 ffi 9 650 80 ¼ 10:4 ffi 11, n6 ¼ NN6 n ¼ 1:030 n5 ¼ NN5 n ¼ 5;000 5;000 80 ¼ 16:5 ffi 17 980 740 80 ¼ 15:7 ffi 16, n8 ¼ NN8 n ¼ 5;000 80 ¼ 11:8 ffi 12 n7 ¼ NN7 n ¼ 5;000
7.2.4
Cluster Sampling
In cluster sampling, the total population must be subdivided into groups of elementary units, called clusters. The sampling is done from the groups and not from the individuals in the population. Hence, we must randomly draw a sufficient number of clusters and the objects from these will form the sample. This type of sampling is called one-stage cluster sampling.
174
PART
IV Statistical Inference
According to Bolfarine and Bussab (2005), one of the inconveniences of cluster sampling is the fact that elements in the same cluster tend to have similar characteristics. The authors show that the more similar the elements in the cluster are, the less efficient the procedure is. Each cluster must be a good representative of the population, that is, it must be heterogeneous, containing all kinds of participants. It is the opposite of stratified sampling. According to Martins and Domingues (2011), cluster sampling is a simple random sampling in which the sample units are the clusters; however, it is less expensive. When we draw elements in the clusters selected, we have a two-stage cluster sampling: in the first stage, we draw the clusters and, in the second, we draw the elements. The number of elements to be drawn depends on the variability in the cluster. The higher the variability, the more elements must be drawn. On the other hand, when the units in the cluster are very similar, it is not advisable nor necessary to draw all the elements, because they will bring the same kind of information (Bolfarine and Bussab, 2005). Cluster sampling can be generalized to several stages. The main advantages that justify the wide use of cluster sampling are: a) many populations are already grouped into natural or geographic subgroups, facilitating its application; b) it allows a substantial reduction in the costs to obtain the sample, without compromising its accuracy. In short, it is fast, cheap, and efficient. The only disadvantage is that clusters are rarely the same size, making it difficult to control the range of the sample. However, to overcome this problem, we have to use certain statistical techniques. As examples of clusters, we can mention the production in a factory divided into assembly lines, company employees divided by area, students in a municipality divided by schools, or the population in a municipality divided into districts. Consider the following notation for cluster sampling: N: population size; M: number of clusters into which the population was divided; Ni: cluster size i (i ¼ 1, 2, ..., M); n: sample size; m: number of clusters drawn (m < M); ni: cluster size i of the sample (i ¼ 1, 2, ..., m), where ni ¼ Ni; bi: cluster size i of the sample (i ¼ 1, 2, ..., m), where bi < ni. In short, one-stage cluster sampling adopts the following procedure: (a) The population is divided into M clusters (C1, …, CM) with sizes that are not necessarily the same; (b) According to a sample plan, usually SRS, we draw m clusters (m < M); m P ni ¼ n . (c) All the elements of each cluster drawn constitute the global sample ni ¼ Ni and i¼1
The calculation of the number of clusters (m) will be studied in Section 7.4.4. On the other hand, two-stage cluster sampling works as follows: (a) The population is divided into M clusters (C1, …, CM) with sizes that are not necessarily the same; (b) We must draw m clusters in the first stage, according to some kind of sample plan, usually SRS; (c) From each cluster i drawn, of size ni, we draw bi elements in the second stage, according to the same or to another m P sample plan bi < ni and n ¼ bi . i¼1
Example 7.5: One-Stage Cluster Sampling Consider a population with N ¼ 20 elements, U ¼ {1, 2, …, 20}. The population is divided into 7 clusters: C1 ¼ {1, 2}, C2 ¼ {3, 4, 5}, C3 ¼ {6, 7, 8}, C4 ¼ {9, 10, 11}, C5 ¼ {12, 13, 14}, C6 ¼ {15, 16}, C7 ¼ {17, 18, 19, 20}. The sample plan adopted says that we should draw three clusters (m ¼ 3) by simple random sampling without replacement. Assuming that clusters C1, C3, and C4 were drawn, determine the sample size, besides the elements that will constitute the one-stage cluster sampling. Solution In one-stage cluster sampling, all the elements of each cluster drawn constitute the sample, so, M ¼ {C1, C3, C4} ¼ {(1, 2), (6, 7, 8), 3 P (9, 10, 11)}. Therefore, n1 ¼ 2, n2 ¼ 3 and n3 ¼ 3, and n ¼ ni ¼ 8. i¼1
Sampling Chapter
7
175
Example 7.6: Two-Stage Cluster Sampling Example 7.5 will be extended to the case of two-stage cluster sampling. Thus, from the clusters drawn in the first stage, the sample m P plan adopted tells us to draw a single element with equal probability from each cluster bi ¼ 1, i ¼ 1, 2, 3 and n ¼ bi ¼ 3 , i¼1
which results in the following: Stage 1: M ¼ {C1, C3, C4} ¼ {(1, 2), (6, 7, 8), (9, 10, 11)} Stage 2: M ¼ {1, 8, 10}
7.3
NONPROBABILITY OR NONRANDOM SAMPLING
In nonprobability sampling methods, samples are obtained in a nonrandom way, that is, the probability of some or all elements of the population belonging to the sample is unknown. Thus, it is not possible to estimate the sample error, nor to generalize the results of the sample to the population, since the former is not representative of the latter. For Costa Neto (2002), this type of sampling is used many times due to its simplicity or impossibility to obtain probability samples, as would be the most desirable. Therefore, we must be careful when deciding to use this type of sampling, since it is subjective, based on the researcher’s criteria and judgment, and sample variability cannot be established with accuracy. In this section, we will study the main nonprobability or nonrandom sampling techniques: (a) convenience sampling, (b) judgmental or purposive sampling, (c) quota sampling, (d) geometric propagation or snowball sampling.
7.3.1
Convenience Sampling
Convenience sampling is used when participation is voluntary or the sample elements are chosen due to convenience or simplicity, such as, friends, neighbors, or students. The advantage this method offers is that it allows researcher to obtain information in a quick and cheap way. However, the sample process does not guarantee that the sample is representative of the population. It should only be employed in extreme situations and in special cases that justify its use. Example 7.7: Convenience Sampling A researcher wishes to study customer behavior in relation to a certain brand and, in order to do that, he develops a sampling plan. The collection of data is done through interviews with friends, neighbors, and workmates. This represents convenience sampling, since this sample is not representative of the population. It is important to highlight that, if the population is very heterogeneous, the results of the sample cannot be generalized to the population.
7.3.2
Judgmental or Purposive Sampling
In judgmental or purposive sampling, the sample is chosen according to an expert’s opinion or previous judgment. It is a risky method due to possible mistakes made by the researcher in his prejudgment. Using this type of sampling requires knowledge of the population and of the elements selected. Example 7.8: Judgmental or Purposive Sampling A survey is trying to identify the reasons why a group of employees of a certain company went on strike. In order to do that, the researcher interviews the main leaders of the trade union and of political movements, as well as the employees that are not involved in such movements. Since the sample size is small, it is not possible to generalize the results to the population, since the sample is not representative of this population.
176
PART
7.3.3
IV Statistical Inference
Quota Sampling
Quota sampling presents greater rigor when compared to other nonrandom samplings. For Martins and Domingues (2011), it is one of the most used sampling methods in market surveys and election polls. Quota sampling is a variation of judgmental sampling. Initially, we set the quotas based on a certain criterion. Within the quotas, the selection of the sample items depends on the interviewer’s judgment. Quota sampling can also be considered a nonprobability version of stratified sampling. Quota sampling consists of three steps: (a) We select the control variables or the population’s characteristics considered relevant for the study in question; (b) We determine the percentage of the population (%) for each one of the relevant variable categories; (c) We establish the size of the quotas (number of people to be interviewed that have the characteristics needed) for each interviewer, so that the sample can have the same proportions as the population. The main advantages of quota sampling are the low costs, speed, and convenience or ease in which the interviewer can select elements. However, since the selection of elements is not random, there are no guarantees that the sample will be representative of the population. Hence, it is not possible to generalize the results of the survey to the population. Example 7.9: Quota Sampling We would like to carry out municipal election polls regarding a certain municipality with 14,253 voters. The survey has as its main objective to identify how people intend to vote based on their gender and age group. Table 7.E.3 shows the absolute frequencies for each pair of variable category analyzed. Apply quota sampling, considering that the sample size is 200 voters and that there are two interviewers.
TABLE 7.E.3 Absolute Frequencies for Each Pair of Categories Age Group
Male
Female
Total
16 and 17
50
48
98
from 18 to 24
1097
1063
2160
from 25 to 44
3409
3411
6820
from 45 to 69
2269
2207
4476
> 69
359
331
690
Total
7184
7060
14,244
Solution (a) The variables that are relevant for the study are gender and age; (b) The percentage of the population (%) for each pair of categories of analyzed variables is shown in Table 7.E.4.
TABLE 7.E.4 Percentage of the Population for Each Pair of Categories Age Group
Male
Female
Total
16 and 17
0.35%
0.34%
0.69%
from 18 to 24
7.70%
7.46%
15.16%
from 25 to 44
23.93%
23.95%
47.88%
from 45 to 69
15.93%
15.49%
31.42%
>69
2.52%
2.32%
4.84%
% of the Total
50.44%
49.56%
100.00%
(c) If we multiply each cell in Table 7.E.4 by the sample size (200), we get the dimensions of the quotas that compose the global sample, as shown in Table 7.E.5.
Sampling Chapter
7
177
TABLE 7.E.5 Dimensions of the Quotas Age Group
Male
Female
Total
16 and 17
1
1
2
from 18 to 24
16
15
31
from 25 to 44
48
48
96
from 45 to 69
32
31
63
>69
5
5
10
Total
102
100
202
Considering that there are two interviewers, the quota for each one will be:
TABLE 7.E.6 Dimensions of the Quotas per Interviewer Age Group
Male
Female
Total
16 and 17
1
1
2
from 18 to 24
8
8
16
from 25 to 44
24
24
48
from 45 to 69
16
16
32
>69
3
3
6
Total
52
52
104
Note: The data in Tables 7.E.5 and 7.E.6 were rounded up, resulting in a total number of 202 voters in Table 7.E.5 and 104 voters in Table 7.E.6.
7.3.4
Geometric Propagation or Snowball Sampling
Geometric propagation or snowball sampling is widely used when the elements of the population are rare, difficult to access, or unknown. In this method, we must identify one or more individuals from the target population, and these will identify the other individuals that are in the same population. The process is repeated until the objective proposed is achieved, that is, the point of saturation. The point of saturation is reached when the last respondents do not add new relevant information to the research, thus, repeating the content of previous interviews. As advantages, we can mention: a) it allows the researcher to find the desired characteristic in the population; b) it is easy to apply, because the recruiting is done through referrals from other people who are in the population; c) low cost, because we need less planning and people; and d) it is efficient to enter populations that are difficult to access. Example 7.10: Snowball Sampling A company is recruiting professionals with a specific profile. The group hired initially recommends other professionals with the same profile. The process is repeated until the number of employees needed is hired. Therefore, we have an example of snowball sampling.
7.4
SAMPLE SIZE
According to Cabral (2006), there are six decisive factors when calculating the sample size: 1) Characteristics of the population, such as, variance (s2) and dimension (N); 2) Sample distribution of the estimator used;
178
PART
IV Statistical Inference
3) The accuracy and reliability required in the results, being necessary to specify the estimation error (B), which is the maximum difference that the researcher accepts between the population parameter and the estimate obtained from the sample; 4) The costs: the larger the sample size, the higher the costs; 5) Costs vs. sample error: must we select a larger sample to reduce the sample error or must we reduce the sample size in order to minimize the resources and efforts necessary, thus ensuring better control for the interviewers, a higher response rate, and a precise and better processing of the information? 6) The statistical techniques that will be used: some statistical techniques demand larger samples than others. The sample selected must be representative of the population. Based on Ferra˜o et al. (2001), Bolfarine and Bussab (2005), and Martins and Domingues (2011), this section discusses how to calculate the sample size for the mean (a quantitative variable) and the proportion (a binary variable) of finite and infinite populations, with a maximum estimation error B, and for each type of random sampling (simple, systematic, stratified, and cluster). In the case of nonrandom samples, either we set the sample size based on a possible budget or we adopt a certain dimension that has already been used successfully in previous studies with the same characteristics. A third alternative would be to calculate the size of a random sample and use that dimension as a reference.
7.4.1
Size of a Simple Random Sample
This section discusses how to calculate the size of a simple random sample to estimate the mean (a quantitative variable) and the proportion (a binary variable) of finite and infinite populations, with a maximum estimation error B. The estimation error (B) for the mean is the maximum difference that the researcher accepts between m (population mean) and X (sample mean), that is, B m X. On the other hand, the estimation error (B) for the proportion is the maximum difference that the researcher accepts between p (proportion of the population) and p^ (proportion of the sample), that is, B jp p^j.
7.4.1.1 Sample Size to Estimate the Mean of an Infinite Population Ifthe variable chosen is quantitative and the population infinite, the size of a simple random sample, where P X m B ¼ 1 a, can be calculated as follows: n¼ where:
s2 B2 =z2a
(7.5)
s2: population variance; B: maximum estimation error; za: abscissa (coordinate) of the standard normal distribution, at the significance level a. According to Bolfarine and Bussab (2005), to determine the sample size it is necessary to set the maximum estimation error (B), the significance level a (translated by the value of za), and to have some previous knowledge of the population variance (s2). The first two are set by the researcher, while the third demands more work. When we do not know s2, its value must be substituted for a reasonable initial estimator. In many cases, a pilot sample can provide sufficient information about the population. In other cases, sample surveys done previously about the population can also provide satisfactory initial estimates for s2. Finally, some authors suggest the use of an approximate value for the standard deviation, given by s ffi range/4.
7.4.1.2 Sample Size to Estimate the Mean of a Finite Population Ifthe variable chosen is quantitative and the population finite, the size of a simple random sample, where P X m B ¼ 1 a, can be calculated as follows: n¼ where:
N:s2 B2 ðN 1Þ: 2 + s2 za
N: size of the population; s2: population variance; B: maximum estimation error; za: coordinate of the standard normal distribution, at the significance level a.
(7.6)
Sampling Chapter
7
179
7.4.1.3 Sample Size to Estimate the Proportion of an Infinite Population If the variable chosen is binary and the population infinite, the size of a simple random sample, where Pðjp^ pj BÞ ¼ 1 a, can be calculated as follows: n¼
p:q B2 =z2a
(7.7)
where: p: proportion of the population that contains the characteristic desired; q ¼ 1 p; B: maximum estimation error; za: coordinate of the standard normal distribution, at the significance level a. In practice, we do not know the value of p and we must, therefore, find its estimate (^ p). If, however, this value is also unknown, we must admit that p^ ¼ 0:50, hence obtaining a conservative size, that is, larger than what is necessary to ensure the accuracy required.
7.4.1.4 Sample Size to Estimate the Proportion of a Finite Population If the variable chosen is binary and the population finite, the size of a simple random sample, where Pðjp^ pj BÞ ¼ 1 a, can be calculated as follows: n¼
N:p:q B2 ðN 1Þ: 2 + p:q za
(7.8)
where: N: size of the population; p: proportion of the population that contains the characteristic desired; q ¼ 1 p; B: maximum estimation error; za: coordinate of the standard normal distribution, at the significance level a.
Example 7.11: Calculating the Size of a Simple Random Sample Consider the population of residents in a condominium (N ¼ 540). We would like to estimate the average age of these residents. Based on previous surveys, we can obtain an estimate for s2 of 463.32. Assume that a simple random sample will be drawn from the population. Assuming that the difference between the sample mean and the real population mean is 4 years, at the most, with a confidence level of 95%, determine the sample size to be collected. Solution The value of za for a ¼ 5% (a bilateral test) is 1.96. From expression (7.6), the sample size is: n¼
N:s2 540 463:32 ¼ 92:38 ffi 93 ¼ B2 42 ðN 1Þ: 2 + s2 539 + 463:32 za 1:962
Therefore, if we collect a simple random sample of at least 93 residents from the population, we can infer, with a confidence level of 95%, that the sample mean (X) will differ 4 years, at the most, from the real population mean (m).
180
PART
IV Statistical Inference
Example 7.12: Calculating the Size of a Simple Random Sample We would like to estimate the proportion of voters who are dissatisfied with a certain politician’s administration. We admit that the real proportion is unknown, as well as its estimate. Assuming that a simple random sample will be drawn from an infinite population and admitting a sample error of 2%, and a significance level of 5%, determine the sample size. Solution Since we do not know the real value of p nor its estimate, let’s assume that p^ ¼ 0:50. Applying Expression (7.7) to estimate the proportion of an infinite population, we have: n¼
p:q 0:5 0:5 ¼ ¼ 2,401 B2 =za2 0:022 =1:962
Therefore, by randomly interviewing 2401 voters, we can infer the real proportion of voters who are dissatisfied, with a maximum estimation error of 2%, and a confidence level of 95%.
7.4.2
Size of the Systematic Sample
In systematic sampling, we use the same expressions as in simple random sampling (as studied in Section 7.4.1), according to the type of variable (quantitative or qualitative) and population (infinite or finite).
7.4.3
Size of the Stratified Sample
This section discusses how to calculate the size of a stratified sample to estimate the mean (a quantitative variable) and the proportion (a binary variable) of finite and infinite populations, with a maximum estimation error B. The estimation error (B) for the mean is the maximum difference that the researcher accepts between m (population mean) and X (sample mean), that is, B m X. On the other hand, the estimation error (B) for the proportion is the maximum difference that the researcher accepts between p (proportion of the population) and p^ (proportion of the sample), that is, B jp p^j. Let’s use the following notation to calculate the size of the stratified sample, as follows: k: number of strata; Ni: size of stratum i, i ¼ 1, 2,..., k; N ¼ N1 + N2 + … + Nk (population size); k P Wi ¼ Ni/N (weight or proportion of stratum i, with Wi ¼ 1); i¼1 mi: population mean of stratum i; s2i : population variance of stratum i; ni: number of elements randomly selected from stratum i; n ¼ n1 + n2 + … + nk (sample size); Xi : sample mean of stratum i; S2i : sample variance of stratum i; pi: proportion of elements that have the characteristic desired in stratum i; q i ¼ 1 pi :
7.4.3.1 Sample Size to Estimate the Mean of an Infinite Population Ifthe variable chosen is quantitative and the population infinite, the size of the stratified sample, where P X m B ¼ 1 a, can be calculated as: k X
n¼ where: Wi ¼ Ni/N (weight or proportion of stratum i, where
Wi :s2i
i¼1
k P i¼1
B2 =z2a Wi ¼ 1);
(7.9)
Sampling Chapter
7
181
s2i : population variance of stratum i; B: maximum estimation error; za: coordinate of the standard normal distribution, at the significance level a.
7.4.3.2 Sample Size to Estimate the Mean of a Finite Population If the variable chosen is quantitative and the population finite, the size of the stratified sample, where P X m B ¼ 1 a, can be calculated as: k X
n¼
Ni2 :s2i =Wi
i¼1 k B2 X N2: 2 + Ni :s2i za i¼1
(7.10)
where: Ni: size of stratum i, i ¼ 1, 2,..., k; s2i : population variance of stratum i; k P Wi ¼ Ni/N (weight or proportion of stratum i, where Wi ¼ 1); i¼1 N: size of the population; B: maximum estimation error; za: coordinate of the standard normal distribution, at the significance level a.
7.4.3.3 Sample Size to Estimate the Proportion of an Infinite Population If the variable chosen is binary and the population infinite, the size of the stratified sample, where Pðjp^ pj BÞ ¼ 1 a, can be calculated as: k X Wi :pi :qi n¼
i¼1
B2 =z2a
(7.11)
where: Wi ¼ Ni/N (weight or proportion of stratum i, where
k P
Wi ¼ 1);
i¼1
pi: proportion of elements that have the characteristic desired in stratum i; qi ¼ 1 pi ; B: maximum estimation error; za: coordinate of the standard normal distribution, at the significance level a.
7.4.3.4 Sample Size to Estimate the Proportion of a Finite Population If the variable chosen is binary and the population finite, the size of the stratified sample, where Pðjp^ pj BÞ ¼ 1 a, can be calculated as: k X Ni2 :pi :qi =Wi n¼ where:
i¼1 k B2 X N2: 2 + Ni :pi :qi za i¼1
Ni: size of stratum i, i ¼ 1, 2,..., k; pi: proportion of elements that have the characteristic desired in stratum i; qi ¼ 1 pi ;
(7.12)
182
PART
IV Statistical Inference
k P Wi ¼ Ni/N (weight or proportion of stratum i, where Wi ¼ 1); i¼1 N: size of the population; B: maximum estimation error; za: coordinate of the standard normal distribution, at the significance level a.
Example 7.13 Calculating the Size of a Stratified Sample A university has 11,886 students enrolled in 14 undergraduate courses, divided into three major areas: Exact Sciences, Human Sciences, and Biological Sciences. Table 7.E.7 shows the number of students enrolled per area. A survey will be carried out in order to estimate the average time students spend studying per week (in hours). Based on pilot samples, we obtain the following estimates for the variances in the areas of Exact, Human, and Biological Sciences: 124.36, 153.22, and 99.87, respectively. The samples selected must be proportional to the number of students per area. Determine the sample size, considering an estimation error of 0.8, and a confidence level of 95%.
TABLE 7.E.7 Number of students enrolled per area Area
Number of students enrolled
Exact Sciences
5285
Human Sciences
3877
Biological Sciences
2724
Total
11,886
Solution From the data, we have: k ¼ 3, N1 ¼ 5,285, N2 ¼ 3,877, N3 ¼ 2,724, N ¼ 11, 886, B ¼ 0:8 W1 ¼
5, 285 3, 877 2, 724 ¼ 0:44, W2 ¼ ¼ 0:33, W3 ¼ ¼ 0:23 11, 886 11, 886 11, 886
For a ¼ 5%, we have za ¼ 1.96. Based on the pilot sample, we must use the estimates for s21, s22, and s23. The sample size is calculated from Expression (7.10): k X
n¼
Ni2 s2i =Wi
i¼1
N2
k B2 X + Ni s2i 2 za i¼1
5,2852 124:36 3,8772 153:22 2, 7242 99:87 + + 0:44 0:33 0:23 ¼ 722:52 ffi 723 n¼ 2 0:8 11, 8862 + ð5, 285 124:36 + 3, 877 153:22 + 2, 724 99:87Þ 2 1:96 Since the sampling is proportional, we can obtain the size of each stratum by using the expression ni ¼ Wi n (i ¼ 1, 2, 3): n1 ¼ W1 n ¼ 0:44 723 ¼ 321:48 ffi 322 n2 ¼ W2 n ¼ 0:33 723 ¼ 235:83 ffi 236 n3 ¼ W3 n ¼ 0:23 723 ¼ 165:70 ffi 166 Thus, to carry out the survey, we must select 322 students from the area of Exact Sciences, 236 from the area of Human Sciences, and 166 from Biological Sciences. From the sample selected, we can infer, with a 95% confidence level, that the difference between the sample mean and the real population mean will be a maximum of 0.8 hours.
Sampling Chapter
7
183
Example 7.14 Calculating the Size of a Stratified Sample Consider the same population from the previous example; however, the objective now is to estimate the proportion of students who work, for each area. Based on a pilot sample, we have the following estimates per area: p^1 ¼ 0:3 (Exact Sciences), p^2 ¼ 0:6 (Human Sciences), and p^3 ¼ 0:4 (Biological Sciences). The type of sampling used in this case is uniform. Determine the sample size, considering an estimation error of 3%, and a 90% confidence level. Solution Since we do not know the real value of p for each area, we can use its estimate. For a 90% confidence level, we have za ¼ 1.645. Applying Expression (7.12) from the stratified sampling to estimate the proportion of a finite population, we have: k X
n¼
N2 : n¼
Ni2 :pi :qi =Wi
i¼1 k B2 X + Ni :pi :qi za2 i¼1
5,2852 0:3 0:7=0:44 + 3,8772 0:6 0:4=0:33 + 2, 7242 0:4 0:6=0:23 0:032 + 5,285 0:3 0:7 + 3,877 0:6 0:4 + 2, 724 0:4 0:6 11, 8862 1:6452 n ¼ 644:54 ffi 645
Since the sampling is uniform, we have n1 ¼ n2 ¼ n3 ¼ 215. Therefore, to carry out the survey, we must randomly select 215 students from each area. From the sample selected, we can infer, with a 90% confidence level, that the difference between the sample proportion and the real population proportion will be a maximum of 3%.
7.4.4
Size of a Cluster Sample
This section discusses how to calculate the size of a one-stage and a two-stage cluster sample. Let’s consider the following notation to calculate the size of a cluster sample: N: population size; M: number of clusters into which the population was divided; Ni: size of cluster i (i ¼ 1, 2, ..., M); n: sample size; m: number of clusters drawn (m < M); ni: size of cluster i from the sample drawn in the first stage (i ¼ 1, 2, ..., m), where ni ¼ Ni; bi: size of cluster i from the sample drawn in the second stage (i ¼ 1, 2, ..., m), where bi < ni; N ¼ N=M (average size of the population clusters); n ¼ n=m (average size of the sample clusters); Xij: j-th observation in cluster i; s2dc: population variance in the clusters; s2ec: population variance between clusters; s2i : population variance in cluster i; mi: population mean in cluster i; s2c ¼ s2dc + s2ec (total population variance). According to Bolfarine and Bussab (2005), the calculation of s2dc and s2ec is given by: Ni M X X
s2dc ¼
Xij mi
i¼1 j¼1
N
2 ¼
M 1 X Ni 2 s M i¼1 N i
M M 1 X 1 X Ni Ni :ðmi mÞ2 ¼ : s2ec ¼ : :ðm mÞ2 N i¼1 M i¼1 N i
(7.13)
(7.14)
Assuming that all the clusters are the same size, the previous expressions can be summarized as follows: s2dc ¼
M 1 X : s2 M i¼1 i
(7.15)
184
PART
IV Statistical Inference
s2ec ¼
M 1 X : ðm mÞ2 M i¼1 i
(7.16)
7.4.4.1 Size of a One-Stage Cluster Sample This section discusses how to calculate the size of a one-stage cluster sample to estimate the mean (a quantitative variable) of a finite and infinite population, with a maximum estimation error B. The estimation error (B) for the mean is the maximum difference that the researcher accepts between m (population mean) and X (sample mean), that is, B m X. 7.4.4.1.1
Sample Size to Estimate the Mean of an Infinite Population
If the variable chosen is quantitative and the population infinite, the number of the clusters drawn in the first stage (m), where P X m B ¼ 1 a, can be calculated as follows: m¼
s2c B2 =z2a
(7.17)
where: s2c ¼ s2dc + s2ec, according to Expressions (7.13)–(7.16); B: maximum estimation error; za: coordinate of the standard normal distribution, at the significance level a. If the clusters are the same size, Bolfarine and Bussab (2005) demonstrate that: m¼
s2e 2 B =z2a
(7.18)
According to the authors, generally, s2c is unknown and has to be estimated from pilot samples or obtained from previous sample surveys. 7.4.4.1.2
Sample Size to Estimate the Mean of a Finite Population
Ifthe variable chosen is quantitative and the population finite, the number of clusters drawn in the first stage (m), where P X m B ¼ 1 a, can be calculated as follows: m¼
M:s2c B2 :N 2 M: 2 + s2c za
(7.19)
where: M: number of clusters into which the population was divided; s2c ¼ s2dc + s2ec, according to Expressions (7.13)–(7.16); B: maximum estimation error; N ¼ N=M (average size of the population clusters); za: coordinate of the standard normal distribution, at the significance level a. 7.4.4.1.3
Sample Size to Estimate the Proportion of an Infinite Population
If the variable chosen is binary and the population infinite, the number of clusters drawn in the first stage (m), where Pðjp^ pj BÞ ¼ 1 a, can be calculated as follows:
Sampling Chapter
1=M: m¼
M X Ni i¼1 N B2 =z2a
7
185
:pi :qi (7.20)
where: M: number of clusters into which the population was divided; Ni: size of cluster i (i ¼ 1, 2, ..., M); N ¼ N=M (average size of the population clusters); pi: proportion of elements that have the characteristic desired in cluster i; qi ¼ 1 pi ; B: maximum estimation error; za: coordinate of the standard normal distribution, at the significance level a. 7.4.4.1.4
Sample Size to Estimate the Proportion of a Finite Population
If the variable chosen is binary and the population finite, the number of clusters drawn in the first stage (m), where Pðjp^ pj BÞ ¼ 1 a, can be calculated as follows: M X Ni
m¼
i¼1
N
:pi :qi
M X B2 :N 2 Ni M: 2 + 1=M: :pi :qi za i¼1 N
(7.21)
where: M: number of clusters into which the population was divided; Ni: size of cluster i (i ¼ 1, 2, ..., M); N ¼ N=M (average size of the population clusters); pi: proportion of elements that have the characteristic desired in cluster i; qi ¼ 1 pi ; B: maximum estimation error; za: coordinate of the standard normal distribution, at the significance level a.
7.4.4.2 Size of a Two-Stage Cluster Sample In this case, we assume that all the clusters are the same size. Based on Bolfarine and Bussab (2005), let’s consider the following linear cost function: C ¼ c1 :n + c2 :b
(7.22)
where: c1: observation cost of one unit from the first stage; c2: observation cost of one unit from the second stage; n: sample size in the first stage; b: sample size in the second stage. The optimal size for b that minimizes the linear cost function is given by: rffiffiffiffiffi sdc c1 : b∗ ¼ sec c2
(7.23)
186
PART
IV Statistical Inference
Example 7.15: Calculating the Size of a Cluster Sample Consider the members of a certain club in Sao Paulo (N ¼ 4,500). We would like to estimate the average evaluation score (0 to 10) given by these members regarding the main features of the club. The population is divided into 10 groups of 450 elements each, based on their membership number. The estimate of the mean and of the population variance per group, based on previous surveys, can be seen in Table 7.E.8. Assuming that the cluster sampling is based on a single stage, determine the number of clusters that must be drawn, considering B ¼ 2% and a ¼ 1%.
TABLE 7.E.8 Mean and population variance per group i
1
2
3
4
5
6
7
8
9
10
mi
7.4
6.6
8.1
7.0
6.7
7.3
8.1
7.5
6.2
6.9
s2i
22.5
36.7
29.6
33.1
40.8
51.7
39.7
30.6
40.5
42.7
Solution From the data given to us, we have: N ¼ 4, 500,M ¼ 10, N ¼ 4, 500=10 ¼ 450,B ¼ 0:02, and za ¼ 2:575: Since all the clusters are the same size, the calculation of s2dc and s2ec is given by: s2dc ¼ s2ec ¼
M 1 X 22:5 + 36:7 + … + 42:7 s2 ¼ : ¼ 36:79 M i¼1 i 10
M 1 X ð7:4 7:18Þ2 + … + ð6:9 7:18Þ2 ¼ 0:35 ðmi mÞ2 ¼ : 10 M i¼1
Therefore, s2c ¼ s2dc + s2ec ¼ 36.79 + 0.35 ¼ 37.14 The number of clusters to be drawn in one stage, for a finite population, is given by Expression (7.19): m¼
M:s2c 10 37:14 ¼ 2:33 ffi 3 ¼ 0:022 4502 B2 :N 2 2 + 37:14 M: 2 + sc 10 za 2:5752
Therefore, the population of N ¼ 4, 500 members is divided into M ¼ 10 clusters with the same size (Ni ¼ 450, i ¼ 1, ...10). From the total number of clusters, we must randomly draw m ¼ 3 clusters. In one-stage cluster sampling, all the elements of each cluster drawn constitute the global sample (n ¼ 450 3 ¼ 1, 350).
From the sample selected, we can infer, with a 99% confidence level, that the difference between the sample mean and the real population mean will be 2%, at the most. Table 7.1 shows a summary of the expressions used to calculate the sample size for the mean (a quantitative variable) and the proportion (a binary variable) of finite and infinite populations, with a maximum estimation error B, and for each type of random sampling (simple, systematic, stratified, and cluster).
7.5
FINAL REMARKS
It is rarely possible to obtain the exact distribution of a variable when we select all the elements of the population, due to the high costs, the time needed, and the difficulties in collecting the data. Therefore, the alternative is to select part of the elements of the population (sample) and, after that, infer the properties for the whole (population). Since the sample must be a good representative of the population, choosing the sampling technique is essential in this process. Sampling techniques can be classified in two major groups: probability or random sampling and nonprobability or nonrandom sampling. Among the main random sampling techniques, we can highlight simple random sampling (with and without replacement), systematic, stratified, and cluster. The main nonrandom sampling techniques are convenience, judgmental or purposive, quota, and snowball sampling. Each one of these techniques has advantages and disadvantages, and choosing the best technique must take the characteristics of each study into consideration. This chapter also discussed how to calculate the sample size for the mean and the proportion of finite and infinite populations, for each type of random sampling. In the case of nonrandom samples, the researcher must either establish the sample size based on a possible budget or adopt a certain dimension that has already been used successfully in previous studies with similar characteristics. Another alternative would be to calculate the size of a random sample and use it as a reference.
Sampling Chapter
7
187
TABLE 7.1 Expressions to Calculate the Size of Random Samples Type of Random Sample Simple
Systematic
Estimating the Mean (Infinite Population) n¼
n¼
n¼
7.6 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11)
n¼
B2 =za2 s2 B2 =za2 k P
Stratified
One-stage Cluster
s2
i¼1
m¼
Estimating the Mean (Finite Population)
n¼
Wi :s2i
B 2 =za2
s2c 2 B =za2
Estimating the Proportion (Infinite Population)
Estimating the Proportion (Finite Population)
N:s2 B2 ðN 1Þ: 2 + s2 za
n¼
p:q B 2 =za2
n¼
N:p:q B2 ðN1Þ: + p:q za2
N:s2 B2 ðN 1Þ: 2 + s2 za
n¼
p:q B 2 =za2
n¼
N:p:q B2 ðN1Þ: + p:q za2
k P i¼1
n¼
N2 : m¼
k P
Ni2 :s2i =Wi
k B2 X + Ni :s2i za2 i¼1
M:s2c B 2 :N 2 M: 2 + s2c za
n¼
i¼1
Wi :pi :qi
B 2 =za2
1=M:
m¼
M N P i :pi :qi i¼1 N 2 2 B =za
k P
n¼
i¼1
N2 :
Ni2 :pi :qi =Wi
k B2 X + Ni :pi :qi 2 za i¼1
M N P i :pi :qi i¼1 N m¼ M 2 2 X B :N Ni M: + 1=M: :pi :qi 2 za i¼1 N
EXERCISES Why is sampling important? What are the differences between random and nonrandom sampling techniques? In what cases must they be used? What is the difference between stratified and cluster sampling? What are the advantages and limitations of each sampling technique? What type of sampling is used in the EuroMillions Lottery? To verify if a part meets certain quality specification demands, from every batch with 150 parts produced, we randomly pick a unit and inspect all the quality characteristics. What type of sampling should be used in this case? Assume that the population of the city of Youngstown (OH) is divided by educational level. Thus, for each level, a percentage of the population will be interviewed. What type of sampling should be used in this case? In a production line, one batch with 1500 parts is produced every hour. From each batch, we randomly pick a sample with 125 units. In each sample unit, we inspect all the quality characteristics to check whether the part is defective or not. What type of sampling should be used in this case? The population of the city of Sao Paulo is divided into 96 districts. From this total, 24 districts will be randomly drawn and, for each one of them, a small sample of the population will be interviewed in a public opinion survey. What type of sampling should be used in this case? We would like to estimate the illiteracy rate in a municipality with 4000 inhabitants who are 15 or over 15 years of age. Based on previous surveys, we can estimate that p^ ¼ 0:24. A random sample will be drawn from the population. Assuming a maximum estimation error of 5%, and a 95% confidence level, what should the sample size be? The population of a certain municipality with 120,000 inhabitants is divided into five regions (North, South, Center, East, and West). The table shows the number of inhabitants per region. A random sample will be collected in each Region
Inhabitants
North
14,060
South
19,477
Center
36,564
East
26,424
West
23,475
188
PART
IV Statistical Inference
region in order to estimate the average age of its inhabitants. The samples selected must be proportional to the number of inhabitants per region. Based on pilot samples, we obtain the following estimates for the variances in the five regions: 44.5 (North), 59.3 (South), 82.4 (Center), 66.2 (East), and 69.5 (West). Determine the sample size, considering an estimation error of 0.6 and a 99% confidence level. 12) Consider a municipality with 120,000 inhabitants. We would like to estimate the percentage of the population that lives in urban and rural areas. The sampling plan used divides the municipality into 85 districts of different sizes. From all the districts, we would like to select some and, for each district chosen, all the inhabitants will be selected. The file Districts.xls shows the size of each district, as well as the estimated percentage of the urban and rural population. Determine the total number of districts to be drawn assuming a maximum estimation error of 10% and a 90% confidence level.
Chapter 8
Estimation A comprehensive study of nature is the most fruitful source of mathematical discoveries. Joseph Fourier
8.1
INTRODUCTION
As previously described, statistical inference has as its main objective to draw conclusions in relation to the population based on data obtained from the sample. The sample must be representative of the population. One of the most important goals of statistical inference is the estimation of population parameters, which is the main goal of this chapter. For Bussab and Morettin (2011), a parameter can be defined as a function of a set of population values; a statistic as a function of a set of sample values; and an estimate as the value assumed by the parameter in a certain sample. Parameters can be estimated using points, through a single point (point estimation), or through an interval of values (interval estimation). The main point estimation methods are estimator of moments, ordinary least squares, and maximum likelihood estimation. Conversely, the main interval estimation methods or confidence intervals (CI) are CI for the population mean when the variance is known, CI for the population mean when the variance is unknown, CI for the population variance, and CI for the proportion.
8.2
POINT AND INTERVAL ESTIMATION
Population parameters can therefore be estimated through a single point or through an interval of values. As examples of population parameter estimators (point and interval), we can mention the mean, the variance, and the proportion.
8.2.1
Point Estimation
Point estimation is used when we want to estimate a single value of the population parameter we are interested in. The population parameter estimate is calculated from a sample. Hence, the sample mean (x) is a point estimate of the real population mean (m). Analogously, the sample variance (S2) p) is a point estimate of the population is a point estimate of the population parameter (s2), as the sample proportion (^ proportion (p). Example 8.1: Point Estimation Consider a luxury condominium with 702 lots. We would like to estimate the average size of the lots, their variance, as well as the proportion of lots for sale. In order to do that, a random sample with 60 lots is collected, revealing an average size of 1750 m2 per lot, a variance of 420 m2, and a proportion of 8% of the lots for sale. Thus: (a) x ¼ 1750 is a point estimate of the real population mean (m); (b) S2 ¼ 420 is a point estimate of the real population variance (s2); and (c) p^ ¼ 0:08 is a point estimate of the real population proportion (p).
Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00008-2 © 2019 Elsevier Inc. All rights reserved.
189
190
PART
8.2.2
IV Statistical Inference
Interval Estimation
Interval estimation is used when we are interested in finding an interval of possible values in which the estimated parameter is located, with a certain confidence level (1 a), being a the significance level. Example 8.2: Interval Estimation Consider the information in Example 8.1. However, instead of using a point estimate of the population parameter, let’s use an interval estimate: (a) The [1700–1800] interval contains the average size of the 702 condominium lots, with a 99% confidence interval; (b) With a 95% confidence interval, the [400–440] interval contains the population variance of the size of the lots; (c) The [6%–10%] interval contains the proportion of lots for sale in the condominium, with 90% confidence.
8.3
POINT ESTIMATION METHODS
The main point estimation methods are the method of moments, ordinary least squares, and maximum likelihood estimation.
8.3.1
Method of Moments
In the method of moments, the population parameters are estimated from the sample estimators as, for example, the mean and the sample variance. Consider a random variable X with the probability density function (p.d.f.) f(x). Assume that X1, X2, ..., Xn is a random sample of size n drawn from population X. For k ¼ 1, 2, ..., the k-th population moment of distribution f(x) is: E Xk (8.1) Consider the random variable X with a p.d.f. f(x). Assume that X1, X2, ..., Xn is a random sample of size n drawn from population X. For k ¼ 1, 2, ..., the k-th sample moment of distribution f(x) is: n X
Mk ¼
Xik
i¼1
(8.2)
n
The estimation procedure of the method of moments is described. Assume that X is a random variable with a p.d.f. f(x, y1, ..., ym), in which y1, ..., ym are population parameters whose values are unknown. A random sample X1, X2, ..., Xn is drawn from population X. ym are obtained by matching the m first sample moments to the corresponding m The estimators of moments ^ y1 , …, ^ population moments and by solving the resulting equations for y1, ..., ym. Thus, the first population moment is: Eð X Þ ¼ m
(8.3)
And the first sample moment is: n X
M1 ¼ X ¼
i¼1
n
Xi (8.4)
By matching the population and the sample moments, we have: ^¼X m Therefore, the sample mean is the moment estimator of the population mean. Table 8.1 shows how to calculate E(X) and Var(X) for different probability distributions, as also studied in Chapter 6.
Estimation Chapter
8
191
TABLE 8.1 Calculating E(X) and Var(X) for Different Probability Distributions E(X)
Var(X)
M
Normal [X N(m,s )]
m
s
2
Binomial [X b(n,p)]
np
np(1 p)
1
Poisson [X Poisson(l)]
l
l
1
(a + b)/2
(b a) /12
Distribution 2
Uniform [X U(a,b)]
2
2
2
Exponential [X exp(l)]
1/l
1/l
1
Gamma [X Gamma(a,l)]
al
al
2
2
2
Example 8.3: Method of Moments Assume that a certain random variable X follows an exponential distribution with parameter l. A random sample of 10 units is drawn from the population whose data can be seen in Table 8.E.1. Calculate the estimation of the l moment.
TABLE 8.E.1 Data Obtained From the Sample 5.4
9.8
6.3
7.9
9.2
10.7
12.5
15.0
13.9
17.2
Solution We have E ðX Þ ¼ X. For an exponential distribution, since E ðX Þ ¼ 1l, we have 1l ¼ X. Therefore, the moments estimator of l is given by ^ l ¼ X1 . For the data in Example 8.3, since X ¼ 10:79, the estimation of the l moment is: 1 1 ^ ¼ 0:093 l¼ ¼ X 10:79
8.3.2
Ordinary Least Squares
A model of a simple linear regression is given by the following expression: Yi ¼ a + b:Xi + mi , i ¼ 1,2, …, n
(8.5)
where: Yi is the i-th observed value of the dependent variable; a is the linear coefficient of the straight line or constant; b is the angular coefficient of the straight line (slope); Xi is the i-th observed value of the explanatory variable; mi is the random error term of the linear relationship between Y and X. Since parameters a and b of the regression model are unknown, we would like to estimate them by using the regression line: Y^i ¼ a + b:Xi where: Y^i is the i-th value estimated or predicted by the model; a and b are the estimates of parameters a and b of the regression model; Xi is the i-th observed value of the explanatory variable.
(8.6)
192
PART
IV Statistical Inference
However, the Yi observed values are not always equal to the Y^i values estimated by the regression model. The difference between the observed value and the estimated value for the i-th observation is the error term mi: mi ¼ Yi Y^i
(8.7)
Thus, the ordinary least squares method is used to determine the best straight line that fits the points of a diagram, that is, the method consists in estimating a and b considering that the sum of squares for the residuals is the smallest possible: min
n X i¼1
m2i ¼
n X
ðYi a b:Xi Þ2
i¼1
The calculation of the estimators is given by: n X
b¼
i¼1
n X
Yi Y Xi X
n X
Xi X
2
¼
i¼1
Yi Xi nXY
i¼1 n X
(8.8) Xi2 nX2
i¼1
a ¼ Y b:X
(8.9)
In Chapter 13, we will study the estimation of a linear regression model by ordinary least squares in more detail.
8.3.3
Maximum Likelihood Estimation
Maximum likelihood estimation is one of the procedures used to estimate the parameters of a model from the variable probability distribution that represents the phenomenon being studied. These parameters are chosen in order to maximize the likelihood function, which is the objective function of a certain linear programming problem (Fa´vero, 2015). Consider a random variable X with a probability density function f(x,y), in which vector y ¼ y1, y2, …, yk is unknown. A random sample X1, X2, …, Xn of size n is drawn from population X; consider x1, x2, …, xn the values effectively observed. Likelihood function L associated to X is a joint probability density function given by the product of the densities of each of the observations: Lðy; x1 , x2 , …, xn Þ ¼ f ðx1 , yÞ f ðx2 , yÞ + ⋯ + f ðxn , yÞ ¼
n Y
f ð x i , yÞ
(8.10)
i¼1
The estimator of maximum likelihood is vector ^y that maximizes the likelihood function.
8.4
INTERVAL ESTIMATION OR CONFIDENCE INTERVALS
In Section 8.3, the population parameters that interested us were estimated through a single value (point estimation). The main limitation of point estimation is that when a parameter is estimated through a single point, all the data information is summarized through this numeric value. As an alternative, we can use interval estimation. Thus, instead of estimating the population parameter through a single point, an interval of likely estimates is given to us. Therefore, we define an interval of values that will contain the true population parameter, with a certain confidence level (1 a), being a the significance level. ^ ^ Consider y an estimator of population parameter y. An interval estimate for y is obtained through interval ]y – k; y + k[, so, P y k < ^ y < y + k ¼ 1 a.
8.4.1
Confidence Interval for the Population Mean (m)
Estimating the population mean from a sample is applied to two cases: when the population variance (s2) is known or unknown.
Estimation Chapter
8
193
FIG. 8.1 Standard normal distribution.
8.4.1.1 Known Population Variance (s2) Let X be a random variable with a normal distribution, mean m, and known variance s2, that is, X N(m,s2). Therefore, we have: Z¼
Xm pffiffiffi Nð0, 1Þ s= n
(8.11)
that is, variable Z has a standard normal distribution. Consider that the probability of variable Z assuming values between zc and zc is 1 a, so, the critical values of zc and zc are obtained from the standard normal distribution table (Table E in the Appendix), as shown in Fig. 8.1. NR and CR means nonrejection region and critical region of the distribution, respectively. Therefore, we have: Pðzc < Z < zc Þ ¼ 1 a
(8.12)
Xm P zc < pffiffiffi < zc ¼ 1 a s= n
(8.13)
s s P X zc pffiffiffi < m < X + zc pffiffiffi ¼ 1 a n n
(8.14)
or:
Thus, the confidence interval for m is:
Example 8.4: CI for the Population Mean When the Variance Is Known We would like to estimate the average processing time of a certain part, with a 95% confidence interval. We know that s ¼ 1.2. In order to do that, a random sample with 400 parts was collected, obtaining a sample mean of X ¼ 5:4. Therefore, construct a 95% confidence interval for the true population mean. Solution We have s ¼ 1.2, n ¼ 400, X ¼ 5:4, and CI ¼ 95% (a ¼ 5%). The critical values of zc and zc for a ¼ 5% can be obtained from Table E in the Appendix (Fig. 8.2). Applying Expression (8.14): 1:2 1:2 P 5:4 1:96 pffiffiffiffiffiffiffiffi < m < 5:4 + 1:96 pffiffiffiffiffiffiffiffi ¼ 95% 400 400 that is:
FIG. 8.2 Critical values of zc and zc.
194
PART
IV Statistical Inference
P ð5:28 < m < 5:52Þ ¼ 95% Therefore, the [5.28;5.52] interval contains the average population value with 95% confidence.
8.4.1.2 Unknown Population Variance (s2) Let X be a random variable with a normal distribution, mean m, and unknown variance s2, that is, X N(m,s2). Since the variance is unknown, it is necessary to use an estimator (S2) instead of s2, which results from another random variable: Xm pffiffiffi tn1 T¼ (8.15) ðS= nÞ that is, variable T follows Student’s t-distribution with n 1 degrees of freedom. Consider that the probability of variable T assuming values between tc and tc is 1 a, so, the critical values of tc and tc are obtained from Student’s t-distribution table (Table B in the Appendix), as shown in Fig. 8.3. Therefore, we have: Pðtc < T < tc Þ ¼ 1 a
(8.16)
Xm P tc < pffiffiffi < tc ¼ 1 a S= n
(8.17)
or:
Therefore, the confidence interval for m is: S S P X tc pffiffiffi < m < X + tc pffiffiffi ¼ 1 a n n
(8.18)
Example 8.5: CI for the Population Mean When the Variance Is Unknown We would like to estimate the average weight of a given population, with a 95% confidence interval. The random variable analyzed has a normal distribution with mean m and unknown variance s2. We pick a sample with 25 individuals from the population and
FIG. 8.3 Student’s t-distribution.
FIG. 8.4 Critical values of Student’s t-distribution.
Estimation Chapter
8
195
calculate the sample mean X ¼ 78 and the sample variance (S2 ¼ 36). Determine the interval that contains the average weight of the population. Solution Since the variance is unknown, we use estimator S2, which results from variable T that follows Student’s t-distribution. The critical values of tc and tc, obtained from Table B in the Appendix, for a significance level of a ¼ 5% and 24 degrees of freedom, can be seen in Fig. 8.4. Applying Expression (8.18): 6 6 P 78 2:064 pffiffiffiffiffiffi < m < 78 + 2:064 pffiffiffiffiffiffi ¼ 95% 25 25 that is: P ð75:5 < m < 80:5Þ ¼ 95% Therefore, the [75.5;80.5] interval contains the average population weight with 95% confidence.
8.4.2
Confidence Interval for Proportions
Consider X a random variable that represents whether a characteristic that interests us in the population exists or not. Thus, X follows a binomial distribution with parameter p, in which p represents the probability of an element in the population presenting the characteristic we are interested in: X bð1, pÞ with mean m ¼ p and variance s2 ¼ p(1 – p). A random sample X1, X2, …, Xn of size n is drawn from the population. Consider k the number of sample elements with the characteristic we are interested in. The estimator of population proportion p (^ p) is given by: p^ ¼
k n
(8.19)
If n is large, we can consider that sample proportion p^ follows a normal distribution, approximately, with mean p and variance p(1 p)/n: p ð1 p Þ (8.20) p^ N p, n p^ p We consider that variable Z ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Nð0, 1Þ. Since n is large, we can substitute p for p^: pð 1 pÞ n p^ p (8.21) Z ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Nð0, 1Þ p^ð1 p^Þ n Consider that the probability of variable Z assuming values between zc and zc is 1 a, so, the critical values of zc and zc are obtained from the standard normal distribution table (Table E in the Appendix), as shown in Fig. 8.1. Thus, we have: Pðzc < Z < zc Þ ¼ 1 a
(8.22)
or: 0
1
B C p^ p PB < zc C @zc < rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi A ¼1a p^ð1 p^Þ n
(8.23)
196
PART
IV Statistical Inference
Therefore, the confidence interval for p is: rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi! p^ð1 p^Þ p^ð1 p^Þ < p < p^ + zc ¼1a P p^ zc n n
(8.24)
Example 8.6: CI for Proportions A factory discovered that the proportion of defective products, in one batch with 1000 parts, is 230 parts. Construct a 95% confidence interval for the true proportion of defective products. Solution n ¼ 1,000 k 230 ¼ 0:23 p^ ¼ ¼ n 1,000 zc ¼ 1:96 Therefore, Expression (8.24) can be written as: rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 0:23 0:77 0:23 0:77 P 0:23 1:96 < p < 0:23 + 1:96 ¼ 95% 1, 000 1, 000 P ð0:204 < p < 0:256Þ ¼ 95% Thus, the [20.4%;25.6%] interval contains the true proportion of defective products with 95% confidence.
8.4.3
Confidence Interval for the Population Variance
Let Xi be a random variable with a normal distribution, mean m, and variance s2, that is, Xi N(m,s2). An estimator for s2 is sample variance S2. Thus, we consider that random variable Q has a chi-square distribution with n 1 degrees of freedom: Q¼
ðn 1Þ S2 w2n1 s2
(8.25)
Consider that the probability of variable Q assuming values between w2low and w2upp is 1 a, so, the critical values of w2low and w2upp are obtained from the chi-square distribution table (Table D in the Appendix), as shown in Fig. 8.5. Therefore, we have: P w2low < w2n1 < w2upp ¼ 1 a (8.26) or: ðn 1Þ S 2 2 < w P w2low < upp ¼ 1 a s2
low
FIG. 8.5 Chi-square distribution.
upp
(8.27)
Estimation Chapter
8
197
Therefore, the confidence interval for s2 is: ! 2 ð n 1 Þ S2 ð n 1 Þ S ¼1a P < s2 < w2upp w2low
(8.28)
Example 8.7: CI for the Population Variance Consider the population of Business Administration students at a public university whose variable of interest is students’ ages. A sample with 101 students was obtained from the normal population and provided S2 ¼ 18.22. Construct a 90% confidence interval for the population variance. Solution From distribution table w2 (Table D in the Appendix), for 100 degrees of freedom, we have: w2low ¼ 77:929 w2upp ¼ 124:342 Therefore, Expression (8.28) can be written as follows: 100 18:22 100 18:22 P < s2 < ¼ 90% 124:342 77:929 P 14:65 < s2 < 23:38 ¼ 90% Thus, the [14.65;23.38] interval contains the true population variance with 90% confidence.
8.5
FINAL REMARKS
Statistical inference is divided into three main parts: sampling, estimation of population parameters, and hypotheses tests. This chapter discussed estimation methods. There are point and interval population parameter estimation methods. Among the main point estimation methods, we can highlight the estimator of moments, ordinary least squares, and maximum likelihood estimation. Conversely, among the main interval estimation methods, we studied the confidence interval (CI) for the population mean (when the variance is known and unknown), the CI for proportions, and the CI for the population variance.
8.6
EXERCISES
1) We would like to estimate the average age of a population that follows a normal distribution and has a standard deviation s ¼ 18. In order to do that, a sample with 120 individuals was drawn from the population and the mean obtained was 51 years old. Construct a 90% confidence interval for the true population mean. 2) We would like to estimate the average income of a certain population with a normal distribution and an unknown variance. A sample with 36 individuals was drawn from the population, presenting a mean of X ¼ 5,400 and a standard deviation S ¼ 200. Construct a 95% confidence interval for the population mean. 3) We would like to estimate the illiteracy rate of a certain municipality. A sample with 500 inhabitants was drawn from the population, presenting an illiteracy rate of 24%. Construct a 95% confidence interval for the proportion of illiterate individuals in the municipality. 4) We would like to estimate the variability of the average time in rendering services to customers in a bank branch. A sample with 61 customers was drawn from the population with a normal distribution and it gave us S2 ¼ 8. Construct a 95% confidence interval for the population variance.
Chapter 9
Hypotheses Tests We must conduct research and then accept the results. If they don’t stand up to experimentation, Buddha’s own words must be rejected. Tenzin Gyatso, 14th Dalai Lama
9.1
INTRODUCTION
As discussed previously, one of the problems to be solved by statistical inference is hypotheses testing. A statistical hypothesis is an assumption about a certain population parameter, such as, the mean, the standard deviation, the correlation coefficient, etc. A hypothesis test is a procedure to decide the veracity or falsehood of a certain hypothesis. In order for a statistical hypothesis to be validated or rejected with accuracy, it would be necessary to examine the entire population, which in practice is not viable. As an alternative, we draw a random sample from the population we are interested in. Since the decision is made based on the sample, errors may occur (rejecting a hypothesis when it is true or not rejecting a hypothesis when it is false), as we will study later on. The procedures and concepts necessary to construct a hypothesis test will be presented. Let’s consider X a variable associated to a population and y a certain parameter of this population. We must define the hypothesis to be tested about parameter y of this population, which is called null hypothesis: H 0 : y ¼ y0
(9.1)
Let’s also define the alternative hypothesis (H1), in case H0 is rejected, which can be characterized as follows: H1 : y 6¼ y0
(9.2)
and the test is called bilateral test (or two-tailed test). The significance level of a test (a) represents the probability of rejecting the null hypothesis when it is true (it is one of the two errors that may occur, as we will see later). The critical region (CR) or rejection region (RR) of a bilateral test is represented by two tails of the same size, respectively, in the left and right extremities of the distribution curve, and each one of them corresponds to half of the significance level a, as shown in Fig. 9.1. Another way to define the alternative hypothesis (H1) would be: H 1 : y < y0
(9.3)
and the test is called unilateral test to the left (or left-tailed test). In this case, the critical region is in the left tail of the distribution and corresponds to significance level a, as shown in Fig. 9.2. Or the alternative hypothesis could be: FIG. 9.1 Critical region (CR) of a bilateral test, also emphasizing the nonrejection region (NR) of the null hypothesis.
Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00009-4 © 2019 Elsevier Inc. All rights reserved.
199
200
PART
IV Statistical Inference
H1 : y > y0
(9.4)
and the test is called unilateral test to the right (or right-tailed test). In this case, the critical region is in the right tail of the distribution and corresponds to significance level a, as shown in Fig. 9.3. Thus, if the main objective is to check whether a parameter is significantly higher or lower than a certain value, we have to use a unilateral test. On the other hand, if the objective is to check whether a parameter is different from a certain value, we have to use a bilateral test. After defining the null hypothesis to be tested, through a random sample collected from the population, we either prove the hypothesis or not. Since the decision is made based on the sample, two types of errors may happen: Type I error: rejecting the null hypothesis when it is true. The probability of this type of error is represented by a: Pðtype I errorÞ ¼ Pðrejecting H0 j H0 is trueÞ ¼ a
(9.5)
Type II error: not rejecting the null hypothesis when it is false. The probability of this type of error is represented by b: Pðtype II errorÞ ¼ Pðnot rejecting H0 j H0 is falseÞ ¼ b
(9.6)
Table 9.1 shows the types of errors that may happen in a hypothesis test. The procedure for defining hypotheses tests includes the following phases: Step Step Step Step Step Step
1: Choosing the most suitable statistical test, depending on the researcher’s intention. 2: Presenting the test’s null hypothesis H0 and its alternative hypothesis H1. 3: Setting the significance level a. 4: Calculating the value observed of the statistic based on the sample obtained from the population. 5: Determining the test’s critical region based on the value of a set in Step 3. 6: Decision: if the value of the statistic lies in the critical region, reject H0. Otherwise, do not reject H0.
According to Fa´vero et al. (2009), most statistical softwares, among them SPSS and Stata, calculate the P-value that corresponds to the probability associated to the value of the statistic calculated from the sample. P-value indicates the lowest significance level observed that would lead to the rejection of the null hypothesis. Thus, we reject H0 if P a.
FIG. 9.2 Critical region (CR) of a left-tailed test, also emphasizing the nonrejection region of the null hypothesis (NR).
FIG. 9.3 Critical region (CR) of a right-tailed test.
TABLE 9.1 Types of Errors Decision
H0 Is True
H0 Is False
Not rejecting H0
Correct decision (1 a)
Type II error (b)
Rejecting H0
Type I error (a)
Correct decision (1 b)
Hypotheses Tests Chapter
9
201
If we use P-value instead of the statistic’s critical value, Steps 5 and 6 of the construction of the hypotheses tests will be: Step 5: Determine the P-value that corresponds to the probability associated to the value of the statistic calculated in Step 4. Step 6: Decision: if P-value is less than the significance level a established in Step 3, reject H0. Otherwise, do not reject H0.
9.2
PARAMETRIC TESTS
Hypotheses tests are divided into parametric and nonparametric tests. In this chapter, we will study parametric tests. Nonparametric tests will be studied in the next chapter. Parametric tests involve population parameters. A parameter is any numerical measure or quantitative characteristic that describes a population. They are fixed values, usually unknown, and represented by Greek characters, such as, the population mean (m), the population standard deviation (s), the population variance (s2), among others. When hypotheses are formulated about population parameters, the hypothesis test is called parametric. In nonparametric tests, hypotheses are formulated about qualitative characteristics of the population. Therefore, parametric methods are applied to quantitative data and require strong assumptions in order to be validated, including: (i) The observations must be independent; (ii) The sample must be drawn from populations with a certain distribution, usually normal; (iii) The populations must have equal variances for the comparison tests of two paired population means or k population means (k 3); (iv) The variables being studied must be measured in an interval or in a reason scale, so that it can be possible to use arithmetic operations over their respective values. We will study the main parametric tests, including tests for normality, homogeneity of variance tests, Student’s t-test and its applications, in addition to the analysis of variance (ANOVA) and its extensions. All of them will be solved in an analytical way and also through the statistical softwares SPSS and Stata. To verify the univariate normality of the data, the most common tests used are Kolmogorov-Smirnov and Shapiro-Wilk. To compare the variance homogeneity between populations, we have Bartlett’s w2 (1937), Cochran’s C (1947a,b), Hartley’s Fmax (1950), and Levene’s F (1960) tests. We will describe Student’s t-test for three situations: to test hypotheses about the population mean, to test hypotheses to compare two independent means, and to compare two paired means. ANOVA is an extension of Student’s t-test and is used to compare the means of more than two populations. In this chapter, ANOVA of one factor, ANOVA of two factors and its extension for more than two factors will be described.
9.3
UNIVARIATE TESTS FOR NORMALITY
Among all univariate tests for normality, the most common are Kolmogorov-Smirnov, Shapiro-Wilk, and Shapiro-Francia.
9.3.1
Kolmogorov-Smirnov Test
The Kolmogorov-Smirnov test (K-S) is an adherence test, that is, it compares the cumulative frequency distribution of a set of sample values (values observed) to a theoretical distribution. The main goal is to test if the sample values come from a population with a supposed theoretical or expected distribution, in this case, the normal distribution. The statistic is given by the point with the biggest difference (in absolute values) between the two distributions. To use the K-S test, the population mean and standard deviation must be known. For small samples, the test loses power, so, it should be used with large samples (n 30). The K-S test assumes the following hypotheses: H0: the sample comes from a population with distribution N(m, s) H1: the sample does not come from a population with distribution N(m, s)
202
PART
IV Statistical Inference
As specified in Fa´vero et al. (2009), let Fexp(X) be an expected distribution function (normal) of cumulative relative frequencies of variable X, where Fexp(X) N(m,s), and Fobs(X) the observed cumulative relative frequency distribution of variable X. The objective is to test whether Fobs(X) ¼ Fexp(X), in contrast with the alternative that Fobs(X) 6¼ Fexp(X). The statistic can be calculated through the following expression: o n (9.7) Dcal ¼ max Fexp ðXi Þ Fobs ðXi Þ; Fexp ðXi Þ Fobs ðXi1 Þ , for i ¼ 1, …,n where: Fexp(Xi): expected cumulative relative frequency in category i; Fobs(Xi): observed cumulative relative frequency in category i; Fobs(Xi1): observed cumulative relative frequency in category i 1. The critical values of Kolmogorov-Smirnov statistic (Dc) are shown in Table G in the Appendix. This table provides the critical values of Dc considering that P(Dcal > Dc) ¼ a (for a right-tailed test). In order for the null hypothesis H0 to be rejected, the value of the Dcal statistic must be in the critical region, that is, Dcal > Dc. Otherwise, we do not reject H0. P-value (the probability associated to the value of Dcal statistic calculated from the sample) can also be seen in Table G. In this case, we reject H0 if P a. Example 9.1: Using the Kolmogorov-Smirnov Test Table 9.E.1 shows the data on a company’s monthly production of farming equipment in the last 36 months. Check and see if the data in Table 9.E.1 come from a population that follows a normal distribution, considering that a ¼ 5%.
TABLE 9.E.1 Production of Farming Equipment in the Last 36 Months 52
50
44
50
42
30
36
34
48
40
55
40
30
36
40
42
55
44
38
42
40
38
52
44
52
34
38
44
48
36
36
55
50
34
44
42
Solution Step 1: Since the objective is to verify if the data in Table 9.E.1 come from a population with a normal distribution, the most suitable test is Kolmogorov-Smirnov (K-S). Step 2: The K-S test hypotheses for this example are: H0: the production of farming equipment in the population follows distribution N(m, s) H1: the production of farming equipment in the population does not follow distribution N(m, s) Step 3: The significance level to be considered is 5%. Step 4: All the steps necessary to calculate Dcal from Expression (9.7) are specified in Table 9.E.2.
TABLE 9.E.2 Calculating the Kolmogorov-Smirnov Statistic Xi
a
Fabs
Fac
c
d
|Fexp(Xi) 2 Fobs(Xi)|
|Fexp(Xi) 2 Fobs(Xi21)|
30
2
2
0.056
1.7801
0.0375
0.018
0.036
34
3
5
0.139
1.2168
0.1118
0.027
0.056
36
4
9
0.250
0.9351
0.1743
0.076
0.035
38
3
12
0.333
0.6534
0.2567
0.077
0.007
40
4
16
0.444
0.3717
0.3551
0.089
0.022
42
4
20
0.556
0.0900
0.4641
0.092
0.020
44
5
25
0.694
0.1917
0.5760
0.118
0.020
b
Fracobs
Zi
e
Fracexp
Hypotheses Tests Chapter
9
203
TABLE 9.E.2 Calculating the Kolmogorov-Smirnov Statistic—cont’d Xi
Fabs
Fac
Fracobs
48
2
27
0.750
50
3
30
52
3
55
3
Zi
Fracexp
| Fexp(Xi) 2 Fobs(Xi)|
| Fexp(Xi) 2 Fobs(Xi21)|
0.7551
0.7749
0.025
0.081
0.833
1.0368
0.8501
0.017
0.100
33
0.917
1.3185
0.9064
0.010
0.073
36
1
1.7410
0.9592
0.041
0.043
a
Absolute frequency. Cumulative (absolute) frequency. Observed cumulative relative frequency of Xi. d Standardized Xi values according to the expression Zi ¼ Xi SX . e Expected cumulative relative frequency of Xi and it corresponds to the probability obtained in Table E in the Appendix (standard normal distribution table) from the value of Zi. b c
Therefore, the real value of the K-S statistic based on the sample is Dcal ¼ 0.118. Step 5: According to Table G in the Appendix, for n ¼ 36 and a ¼ 5%, the critical value of the Kolmogorov-Smirnov statistic is Dc ¼ 0.23. Step 6: Decision: since the value calculated is not in the critical region (Dcal < Dc), the null hypothesis is not rejected, which allows us to conclude, with a 95% confidence level, that the sample is drawn from a population that follows a normal distribution. If we use P-value instead of the statistic’s critical value, Steps 5 and 6 will be: Step 5: According to Table G in the Appendix, for a sample size n ¼ 36, the probability associated to Dcal ¼ 0.118 has as its lowest limit P ¼ 0.20. Step 6: Decision: since P > 0.05, we do not reject H0.
9.3.2
Shapiro-Wilk Test
The Shapiro-Wilk test (S-W) is based on Shapiro and Wilk (1965) and can be applied to samples with 4 n 2000 observations, and it is an alternative to the Kolmogorov-Smirnov test for normality (K-S) in the case of small samples (n < 30). Analogous to the K-S test, the S-W test for normality assumes the following hypotheses: H0: the sample comes from a population with distribution N(m, s) H1: the sample does not come from a population with distribution N(m, s) The calculation of the Shapiro-Wilk statistic (Wcal) is given by: b2 Wcal ¼ Xn 2 , for i ¼ 1, …, n Xi X i¼1 b¼
n=2 X
ai, n Xðni + 1Þ XðiÞ
(9.8)
(9.9)
i¼1
where: X(i) are the sample statistics of order i, that is, the i-th ordered observation, so, X(1) X(2) … X(n); X is the mean of X; ai, n are constants generated from the means, variances, and covariances of the statistics of order i of a random sample of size n from a normal distribution. Their values can be seen in Table H2 in the Appendix. Small values of Wcal indicate that the distribution of the variable being studied is not normal. The critical values of Shapiro-Wilk statistic Wc are shown in Table H1 in the Appendix. Different from most tables, this table provides the critical values of Wc considering that P(Wcal < Wc) ¼ a (for a left-tailed test). In order for the null hypothesis H0 to be rejected, the value of the Wcal statistic must be in the critical region, that is, Wcal < Wc. Otherwise, we do not reject H0. P-value (the probability associated to the value of Wcal statistic calculated from the sample) can also be seen in Table H1. In this case, we reject H0 if P a.
204
PART
IV Statistical Inference
Example 9.2: Using the Shapiro-Wilk Test Table 9.E.3 shows the data on an aerospace company’s monthly production of aircraft in the last 24 months. Check and see if the data in Table 9.E.3 come from a population with a normal distribution, considering that a ¼ 1%.
TABLE 9.E.3 Production of Aircraft in the Last 24 Months 28
32
46
24
22
18
20
34
30
24
31
29
15
19
23
25
28
30
32
36
39
16
23
36
Solution Step 1: For a normality test in which n < 30, the most recommended test is the Shapiro-Wilk (S-W). Step 2: The S-W test hypotheses for this example are: H0: the production of aircraft in the population follows normal distribution N(m, s) H1: the production of aircraft in the population does not follow normal distribution N(m, s) Step 3: The significance level to be considered is 1%. Step 4: The calculation of the S-W statistic for the data in Table 9.E.3, according to Expressions (9.8) and (9.9), is shown. First of all, to calculate b, we must sort the data in Table 9.E.3 in ascending order, as shown in Table 9.E.4. All the steps necessary to calculate b, from Expression (9.9), are specified in Table 9.E.5. The values of ai,n were obtained from Table H2 in the Appendix.
TABLE 9.E.4 Values From Table 9.E.3 Sorted in Ascending Order 15
16
18
19
20
22
23
23
24
24
25
28
28
29
30
30
31
32
32
34
36
36
39
46
TABLE 9.E.5 Procedure to Calculate b i
n 2 i +1
ai,n
X(n 2 i+1)
X(i)
ai,n (X(n 2 i+1) 2 X(i))
1
24
0.4493
46
15
13.9283
2
23
0.3098
39
16
7.1254
3
22
0.2554
36
18
4.5972
4
21
0.2145
36
19
3.6465
5
20
0.1807
34
20
2.5298
6
19
0.1512
32
22
1.5120
7
18
0.1245
32
23
1.1205
8
17
0.0997
31
23
0.7976
9
16
0.0764
30
24
0.4584
10
15
0.0539
30
24
0.3234
11
14
0.0321
29
25
0.1284
12
13
0.0107
28
28
0.0000 b ¼ 36.1675
We have
Pn i¼1
Xi X
Therefore, Wcal ¼ Pn
2
¼ ð28 27:5Þ2 + ⋯ + ð36 27:5Þ2 ¼ 1388
b2 2 ðXi X Þ i¼1
2
Þ ¼ ð36:1675 ¼ 0:978 1338
Step 5: According to Table H1 in the Appendix, for n ¼ 24 and a ¼ 1%, the critical value of the Shapiro-Wilk statistic is Wc ¼ 0.884.
Hypotheses Tests Chapter
9
205
Step 6: Decision: the null hypothesis is not rejected, since Wcal > Wc (Table H1 provides the critical values of Wc considering that P(Wcal < Wc) ¼ a), which allows us to conclude, with a 99% confidence level, that the sample is drawn from a population with a normal distribution. If we use P-value instead of the statistic’s critical value, Steps 5 and 6 will be: Step 5: According to Table H1 in the Appendix, for a sample size n ¼ 24, the probability associated to Wcal ¼ 0.978 is between 0.50 and 0.90 (a probability of 0.90 is associated to Wcal ¼ 0.981). Step 6: Decision: since P > 0.01, we do not reject H0.
9.3.3
Shapiro-Francia Test
This test is based on Shapiro and Francia (1972). According to Sarkadi (1975), the Shapiro-Wilk (S-W) and ShapiroFrancia tests (S-F) have the same format, being different only when it comes to defining the coefficients. Moreover, calculating the S-F test is much simpler and it can be considered a simplified version of the S-W test. Despite its simplicity, it is as robust as the Shapiro-Wilk test, making it a substitute for the S-W. The Shapiro-Francia test can be applied to samples with 5 n 5000 observations, and it is similar to the Shapiro-Wilk test for large samples. Analogous to the S-W test, the S-F test assumes the following hypotheses: H0: the sample comes from a population with distribution N(m, s) H1: the sample does not come from a population with distribution N(m, s) 0
The calculation of the Shapiro-Francia statistic (Wcal) is given by: " #2 , " # n n n X X X 2 0 2 Wcal ¼ mi XðiÞ mi Xi X , for i ¼ 1, …, n i¼1
i¼1
(9.10)
i¼1
where: X(i) are the sample statistics of order i, that is, the ith ordered observation, so, X(1) X(2) … X(n); mi is the approximate expected value of the ith observation (Z-score). The values of mi are estimated by: mi ¼ F1
i n+1
(9.11)
where F1 corresponds to the opposite of a standard normal distribution with a mean ¼ zero and a standard deviation ¼ 1. These values can be obtained from Table E in the Appendix. 0 Small values of Wcal indicate that the distribution of the variable being studied is not normal. The critical values of 0 Shapiro-Francia statistic (Wc) are shown in Table H1 in the Appendix. Different from most tables, this table provides 0 0 0 the critical values of Wc considering that P(Wcal < Wc) ¼ a ¼ a (for a left-tailed test). In order for the null hypothesis 0 0 0 H0 to be rejected, the value of the Wcal statistic must be in the critical region, that is, Wcal < Wc. Otherwise, we do not reject H0. 0 P-value (the probability associated to Wcal statistic calculated from the sample) can also be seen in Table H1. In this case, we reject H0 if P a. Example 9.3: Using the Shapiro-Francia Test Table 9.E.6 shows all the data regarding a company’s daily production of bicycles in the last 60 months. Check and see if the data come from a population with a normal distribution, considering a ¼ 5%. Solution Step 1: The normality of the data can be verified through the Shapiro-Francia test. Step 2: The S-F test hypotheses for this example are: H0: the production of bicycles in the population follows normal distribution N(m, s) H1: the production of bicycles in the population does not follow normal distribution N(m, s) Step 3: The significance level to be considered is 5%.
206
PART
IV Statistical Inference
TABLE 9.E.6 Production of Bicycles in the Last 60 Months 85
70
74
49
67
88
80
91
57
63
66
60
72
81
73
80
55
54
93
77
80
64
60
63
67
54
59
78
73
84
91
57
59
64
68
67
70
76
78
75
80
81
70
77
65
63
59
60
61
74
76
81
79
78
60
68
76
71
72
84
Step 4: The procedure to calculate the S-F statistic for the data in Table 9.E.6 is shown in Table 9.E.7. 0 Therefore, Wcal ¼ (574.6704)2/(53.1904 6278.8500) ¼ 0.989
TABLE 9.E.7 Procedure to Calculate the Shapiro-Francia Statistic i
X(i)
i/(n + 1)
mi
mi X(i)
m2i
(Xi 2 X)2
1
49
0.0164
2.1347
104.5995
4.5569
481.8025
2
54
0.0328
1.8413
99.4316
3.3905
287.3025
3
54
0.0492
1.6529
89.2541
2.7319
287.3025
4
55
0.0656
1.5096
83.0276
2.2789
254.4025
5
57
0.0820
1.3920
79.3417
1.9376
194.6025
6
57
0.0984
1.2909
73.5841
1.6665
194.6025
7
59
0.1148
1.2016
70.8960
1.4439
142.8025
8
59
0.1311
1.1210
66.1380
1.2566
142.8025
93
0.9836
2.1347
198.5256
4.5569
486.2025
574.6704
53.1904
6278.8500
… 60
Sum
Step 5: According to Table H1 in the Appendix, for n ¼ 60 and a ¼ 5%, the critical value of the Shapiro-Francia statistic is 0 Wc ¼ 0.9625. 0 0 0 Step 6: Decision: the null hypothesis is not rejected because Wcal > Wc (Table H1 provides the critical values of Wc considering 0 0 that P(Wcal < Wc) ¼ a), which allows us to conclude, with a 95% confidence level, that the sample is drawn from a population that follows a normal distribution. If we used P-value instead of the statistic’s critical value, Steps 5 and 6 would be: 0 Step 5: According to Table H1 in the Appendix, for a sample size n ¼ 60, the probability associated to Wcal ¼ 0.989 is greater than 0.10 (P-value). Step 6: Decision: since P > 0.05, we do not reject H0.
9.3.4
Solving Tests for Normality by Using SPSS Software
The Kolmogorov-Smirnov and Shapiro-Wilk tests for normality can be solved by using IBM SPSS Statistics Software. The Shapiro-Francia test, on the other hand, will be elaborated through the Stata software, as we will see in the next section. Based on the procedure that will be described, SPSS shows the results of the K-S and the S-W tests for the sample selected. The use of the images in this section has been authorized by the International Business Machines Corporation©. Let’s consider the data presented in Example 9.1 that are available in the file Production_FarmingEquipment.sav. Let´s open the file and select Analyze → Descriptive Statistics → Explore …, as shown in Fig. 9.4. From the Explore dialog box, we must select the variable we are interested in on the Dependent List, as shown in Fig. 9.5. Let´s click on Plots … (the Explore: Plots dialog box will open) and select the option Normality plots with tests (Fig. 9.6). Finally, let’s click on Continue and on OK.
Hypotheses Tests Chapter
9
207
FIG. 9.4 Procedure for elaborating a univariate normality test on SPSS for Example 9.1.
FIG. 9.5 Selecting the variable of interest.
The results of the Kolmogorov-Smirnov and Shapiro-Wilk tests for normality for the data in Example 9.1 are shown in Fig. 9.7. According to Fig. 9.7, the result of the K-S statistic was 0.118, similar to the value calculated in Example 9.1. Since the sample has more than 30 elements, we should only use the K-S test to verify the normality of the data (the S-W test was applied to Example 9.2). Nevertheless, SPSS also makes the result of the S-W statistic available for the sample selected.
208
PART
IV Statistical Inference
FIG. 9.6 Selecting the normality test on SPSS.
FIG. 9.7 Results of the tests for normality for Example 9.1 on SPSS.
FIG. 9.8 Results of the tests for normality for Example 9.2 on SPSS.
As presented in the introduction of this chapter, SPSS calculates the P-value that corresponds to the lowest significance level observed that would lead to the rejection of the null hypothesis. For the K-S and S-W tests the P-value corresponds to the lowest value of P from which Dcal > Dc and Wcal < Wc. As shown in Fig. 9.7, the value of P for the K-S test was of 0.200 (this probability can also be obtained from Table G in the Appendix, as shown in Example 9.1). Since P > 0.05, we do not reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that the data distribution is normal. The S-W test also allows us to conclude that the data distribution follows a normal distribution. Applying the same procedure to verify the normality of the data in Example 9.2 (the data are available in the file Production_Aircraft.sav), we get the results shown in Fig. 9.8. Analogous to Example 9.2, the result of the S-W test was 0.978. The K-S test was not applied to this example due to the sample size (n < 30). The P-value of the S-W test is 0.857 (in Example 9.2, we saw that this probability would be between 0.50 and 0.90
Hypotheses Tests Chapter
9
209
and closer to 0.90) and, since P > 0.01, the null hypothesis is not rejected, which allows us to conclude that the data distribution in the population follows a normal distribution. We will use this test when estimating regression models in Chapter 13. For this example, we can also conclude from the K-S test that the data distribution follows a normal distribution.
9.3.5
Solving Tests for Normality by Using Stata
The Kolmogorov-Smirnov, Shapiro-Wilk, and Shapiro-Francia tests for normality can be solved by using Stata Statistical Software. The Kolmogorov-Smirnov test will be applied to Example 9.1, the Shapiro-Wilk test to Example 9.2, and the Shapiro-Francia test to Example 9.3. The use of the images in this section has been authorized by StataCorp LP©.
9.3.5.1 Kolmogorov-Smirnov Test on the Stata Software The data presented in Example 9.1 are available in the file Production_FarmingEquipment.dta. Let’s open this file and verify that the name of the variable being studied is production. To elaborate the Kolmogorov-Smirnov test on Stata, we must specify the mean and the standard deviation of the variable that interests us in the test syntax, so, the command summarize, or simply sum, must be typed first, followed by the respective variable: sum production
and we get Fig. 9.9. Therefore, we can see that the mean is 42.63889 and the standard deviation is 7.099911. The Kolmogorov-Smirnov test is given by the following command: ksmirnov production = normal((production-42.63889)/7.099911)
The result of the test can be seen in Fig. 9.10. We can see that the value of the statistic is similar to the one calculated in Example 9.1 and by SPSS software. Since P > 0.05, we conclude that the data distribution is normal.
9.3.5.2 Shapiro-Wilk Test on the Stata Software The data presented in Example 9.2 are available in the file Production_Aircraft.dta. To elaborate the Shapiro-Wilk test on Stata, the syntax of the command is: swilk variables*
where the term variables* should be substituted for the list of variables being considered. For the data in Example 9.2, we have a single variable called production, so, the command to be typed is: swilk production FIG. 9.9 Descriptive statistics of the variable production.
FIG. 9.10 Results of the Kolmogorov-Smirnov test on Stata.
210
PART
IV Statistical Inference
FIG. 9.11 Results of the Shapiro-Wilk test for Example 9.2 on Stata.
FIG. 9.12 Results of the Shapiro-Francia test for Example 9.3 on Stata.
The result of the Shapiro-Wilk test can be seen in Fig. 9.11. Since P > 0.05, we can conclude that the sample comes from a population with a normal distribution.
9.3.5.3 Shapiro-Francia Test on the Stata Software The data presented in Example 9.3 are available in the file Production_Bicycles.dta. To elaborate the Shapiro-Francia test on Stata, the syntax of the command is: sfrancia variables*
where the term variables* should be substituted for the list of variables being considered. For the data in Example 9.3, we have a single variable called production, so, the command to be typed is: sfrancia production
The result of the Shapiro-Francia test can be seen in Fig. 9.12. We can see that the value is similar to the one calculated in Example 9.3 (W 0 ¼ 0.989). Since P > 0.05, we conclude that the sample comes from a population with a normal distribution. We will use this test when estimating regression models in Chapter 13.
9.4
TESTS FOR THE HOMOGENEITY OF VARIANCES
One of the conditions to apply a parametric test to compare k population means is that the population variances, estimated from k representative samples, be homogeneous or equal. The most common tests to verify variance homogeneity are Bartlett’s w2 (1937), Cochran’s C (1947a,b), Hartley’s Fmax (1950), and Levene’s F (1960) tests. In the null hypothesis of variance homogeneity tests, the variances of k populations are homogeneous. In the alternative hypothesis, at least one population variance is different from the others. That is: H0 : s21 ¼ s22 ¼ … ¼ s2k H1 : 9i, j : s2i 6¼ s2j ði, j ¼ 1, …, kÞ
9.4.1
(9.12)
Bartlett’s x2 Test
The original test proposed to verify variance homogeneity among groups is Bartlett’s w2 test (1937). This test is very sensitive to normality deviations, and Levene’s test is an alternative in this case. Bartlett’s statistic is calculated from q: k X ðni 1Þ ln S2i q ¼ ðN kÞ ln S2p i¼1
(9.13)
Hypotheses Tests Chapter
9
211
where:
P ni, i ¼ 1, …, k, is the size of each sample i and ki¼1ni ¼ N; 2 Si , i ¼ 1, …, k, is the variance in each sample i;
and
Xk S2p ¼
i¼1
ðni 1Þ S2i
(9.14)
N k
A correction factor c is applied to q statistic, with the following expression: k X 1 1 1 c¼1+ 3 ð k 1Þ n 1 Nk i¼1 i
! (9.15)
where Bartlett’s statistic (Bcal) approximately follows a chi-square distribution with k 1 degrees of freedom: q (9.16) Bcal ¼ w2k1 c From the previous expressions, we can see that the higher the difference between the variances, the higher the value of B. On the other hand, if all the sample variances are equal, its value will be zero. To confirm if the null hypothesis of variance homogeneity will be rejected or not, the value calculated must be compared to the statistic’s critical value (w2c ), which is available in Table D in the Appendix. This table provides the critical values of w2c considering that P(w2cal > w2c ) ¼ a (for a right-tailed test). Therefore, we reject the null hypothesis if Bcal > w2c . On the other hand, if Bcal w2c , we do not reject H0. P-value (the probability associated to w2cal statistic) can also be obtained from Table D. In this case, we reject H0 if P a. Example 9.4: Applying Bartlett’s x2 Test A chain of supermarkets wishes to study the number of customers they serve every day in order to make strategic operational decisions. Table 9.E.8 shows the data of three stores throughout two weeks. Check if the variances between the groups are homogeneous. Consider a ¼ 5%.
TABLE 9.E.8 Number of Customers Served Per Day and Per Store Store 1
Store 2
Store 3
Day 1
620
710
924
Day 2
630
780
695
Day 3
610
810
854
Day 4
650
755
802
Day 5
585
699
931
Day 6
590
680
924
Day 7
630
710
847
Day 8
644
850
800
Day 9
595
844
769
Day 10
603
730
863
Day 11
570
645
901
Day 12
605
688
888
Day 13
622
718
757
Day 14
578
702
712
Standard deviation Variance
24.4059
62.2466
78.9144
595.6484
3874.6429
6227.4780
212
PART
IV Statistical Inference
Solution If we apply the Kolmogorov-Smirnov or the Shapiro-Wilk test for normality to the data in Table 9.E.8, we will verify that their distribution shows adherence to normality, with a 5% significance level, so, Bartlett’s w2 test can be applied to compare the homogeneity of the variances between the groups. Step 1: Since the main goal is to compare the equality of the variances between the groups, we can use Bartlett’s w2 test. Step 2: Bartlett’s w2 test hypotheses for this example are: H0: the population variances of all three groups are homogeneous H1: the population variance of at least one group is different from the others Step 3: The significance level to be considered is 5%. Step 4: The complete calculation of Bartlett’s w2 statistic is shown. First, we calculate the value of S2p, according to Expression (9.14): 13 ð595:65 + 3874:64 + 6227:48Þ ¼ 3565:92 42 3 Thus, we can calculate q through Expression (9.13): Sp2 ¼
q ¼ 39 ln ð3565:92Þ 13 ½ ln ð595:65Þ + ln ð3874:64Þ + ln ð6227:48Þ ¼ 14:94 The correction factor c for q statistic is calculated from Expression (9.15): 1 1 1 c ¼1+ 3 ¼ 1:0256 3 ð3 1Þ 13 42 3 Finally, we calculate Bcal: q 14:94 Bcal ¼ ¼ ¼ 14:567 c 1:0256 Step 5: According to Table D in the Appendix, for n ¼ 3 1 degrees of freedom and a ¼ 5%, the critical value of Bartlett’s w2 test is w2c ¼ 5.991. Step 6: Decision: since the value calculated lies in the critical region (Bcal > w2c ), the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that the population variance of at least one group is different from the others. If we use P-value instead of the statistic’s critical value, Steps 5 and 6 will be: Step 5: According to Table D in the Appendix, for n ¼ 2 degrees of freedom, the probability associated to w2cal ¼ 14.567 is less than 0.005 (a probability of 0.005 is associated to w2cal ¼ 10.597). Step 6: Decision: since P < 0.05, we reject H0.
9.4.2
Cochran’s C Test
Cochran’s C test (1947a,b) compares the group with the highest variance in relation to the others. The test demands that the data have a normal distribution. Cochran’s C statistic is given by: S2 Ccal ¼ Xmax k
(9.17)
S2 i¼1 i
where: S2max is the highest variance in the sample; S2i is the variance in sample i, i ¼ 1, …, k. According to Expression (9.17), if all the variances are equal, the value of the Ccal statistic is 1/k. The higher the difference of S2max in relation to the other variances, the more the value of Ccal gets closer to 1. To confirm whether the null hypothesis will be rejected or not, the value calculated must be compared to Cochran’s (Cc) statistic’s critical value, which is available in Table M in the Appendix.
Hypotheses Tests Chapter
9
213
The values of Cc vary depending on the number of groups (k), the number of degrees of freedom n ¼ max(ni 1), and the value of a. Table M provides the critical values of Cc considering that P(Ccal > Cc) ¼ a (for a right-tailed test). Thus, we reject H0 if Ccal > Cc. Otherwise, we do not reject H0. Example 9.5: Applying Cochran’s C Test Use Cochran’s C test for the data in Example 9.4. The main objective here is to compare the group with the highest variability in relation to the others. Solution Step 1: Since the objective is to compare the group with the highest variance (group 3—see Table 9.E.8) in relation to the others, Cochran’s C test is the most recommended. Step 2: Cochran’s C test hypotheses for this example are: H0: the population variance of group 3 is equal to the others H1: the population variance of group 3 is different from the others Step 3: The significance level to be considered is 5%. Step 4: From Table 9.E.8, we can see that S2max ¼ 6227.48. Therefore, the calculation of Cochran’s C statistic is given by: S2 6227:48 ¼ 0:582 ¼ Ccal ¼ Xmax k 595:65 + 3874:64 + 6227:48 2 S i i¼1 Step 5: According to Table M in the Appendix, for k ¼ 3, n ¼ 13, and a ¼ 5%, the critical value of Cochran’s C statistic is Cc ¼ 0.575. Step 6: Decision: since the value calculated lies in the critical region (Ccal > Cc), the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that the population variance of group 3 is different from the others.
9.4.3
Hartley’s Fmax Test
Hartley’s Fmax test (1950) has the statistic that represents the relationship between the group with the highest variance (S2max) and the group with the lowest variance (S2min): Fmax ,cal ¼
S2max S2min
(9.18)
The test assumes that the number of observations per group is equal to (n1 ¼ n2 ¼ … ¼ nk ¼ n). If all the variances are equal, the value of Fmax will be 1. The higher the difference between S2max and S2min, the higher the value of Fmax. To confirm if the null hypothesis of variance homogeneity will be rejected or not, the value calculated must be compared to the (Fmax,c) statistic’s critical value, which is available in Table N in the Appendix. The critical values vary depending on the number of groups (k), the number of degrees of freedom n ¼ n 1, and the value of a, and this table provides the critical values of Fmax,c considering that P(Fmax,cal > Fmax,c) ¼ a (for a right-tailed test). Therefore, we reject the null hypothesis H0 of variance homogeneity if Fmax,cal > Fmax,c. Otherwise, we do not reject H0. P-value (the probability associated to Fmax,cal statistic) can also be obtained from Table N in the Appendix. In this case, we reject H0 if P a. Example 9.6: Applying Hartley’s Fmax Test Use Hartley’s Fmax test for the data in Example 9.4. The goal here is to compare the group with the highest variability to the group with the lowest variability. Solution Step 1: Since the main objective is to compare the group with the highest variance (group 3—see Table 9.E.8) to the group with the lowest variance (group 1), Hartley’s Fmax test is the most recommended. Step 2: Hartley’s Fmax test hypotheses for this example are: H0: the population variance of group 3 is the same as group 1
214
PART
IV Statistical Inference
H1: the population variance of group 3 is different from group 1 Step 3: The significance level to be considered is 5%. Step 4: From Table 9.E.8, we can see that S2min ¼ 595.65 and S2max ¼ 6227.48. Therefore, the calculation of Hartley’s Fmax statistic is given by:
F max , cal ¼
2 Smax 6,227:48 ¼ 10:45 ¼ 2 595:65 Smin
Step 5: According to Table N in the Appendix, for k ¼ 3, n ¼ 13, and a ¼ 5%, the critical value of the test is Fmax,c ¼ 3.953. Step 6: Decision: since the value calculated lies in the critical region (Fmax,cal > Fmax,c), the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that the population variance of group 3 is different from the population variance of group 1. If we use P-value instead of the statistic’s critical value, Steps 5 and 6 will be: Step 5: According to Table N in the Appendix, the probability associated to Fmax,cal ¼ 10.45, for k ¼ 3 and n ¼ 13, is less than 0.01. Step 6: Decision: since P < 0.05, we reject H0.
9.4.4
Levene’s F-Test
The advantage of Levene’s F-test in relation to other homogeneity of variance tests is that it is less sensitive to deviations from normality, in addition to being considered a more robust test. Levene’s statistic is given by Expression (9.19) and it follows an F-distribution, approximately, with n1 ¼ k 1 and n2 ¼ N k degrees of freedom, for a significance level a: Xk 2 n Zi Z ðN k Þ i¼1 i Fcal ¼ (9.19) 2 Fk1, Nk, a H0 ðk 1Þ Xk Xni Z Z ij i i¼1 j¼1 where: ni is the dimension of each one of the k samples (i ¼ 1, …, k); N is the of the global sample (N ¼ n1 + n2 + ⋯ + nk); dimension Zij ¼ Xij Xi , i ¼ 1, …, k and j ¼ 1, …, ni; Xij is observation j in sample i; Xi is the mean of sample i; Zi is the mean of Zij in sample i; Z is the mean of Zi in the global sample. An expansion of Levene’s test can be found in Brown and Forsythe (1974). From the F-distribution table (Table A in the Appendix), we can determine the critical values of Levene’s statistic (Fc ¼ Fk1,N k,a). Table A provides the critical values of Fc considering that P(Fcal > Fc) ¼ a (right-tailed table). In order for the null hypothesis H0 to be rejected, the value of the statistic must be in the critical region, that is, Fcal > Fc. If Fcal Fc, we do not reject H0. P-value (the probability associated to Fcal statistic) can also be obtained from Table A. In this case, we reject H0 if P a. Example 9.7: Applying Levene’s Test Elaborate Levene’s test for the data in Example 9.4. Solution Step 1: Levene’s test can be applied to check variance homogeneity between the groups, and it is more robust than the other tests.
Hypotheses Tests Chapter
9
215
Step 2: Levene’s test hypotheses for this example are: H0: the population variances of all three groups are homogeneous H1: the population variance of at least one group is different from the others Step 3: The significance level to be considered is 5%. Step 4: The calculation of the Fcal statistic, according to Expression (9.19), is shown.
TABLE 9.E.9 Calculating the Fcal Statistic
I
X1j
Z1j ¼ X1j X 1
1
620
10.571
9.429
88.898
1
630
20.571
0.571
0.327
1
610
0.571
19.429
377.469
1
650
40.571
20.571
423.184
1
585
24.429
4.429
19.612
1
590
19.429
0.571
0.327
1
630
20.571
0.571
0.327
1
644
34.571
14.571
212.327
1
595
14.429
5.571
31.041
1
603
6.429
13.571
184.184
1
570
39.429
19.429
377.469
1
605
4.429
15.571
242.469
1
622
12.571
7.429
55.184
1
578
31.429
11.429
130.612
X 1 ¼ 609:429
Z 1 ¼ 20
Z2j ¼ X2j X 2
Z1j Z1
Z1j Z 1
Sum ¼ 2143.429
Z2j Z 2
Z2j Z 2
27.214
23.204
538.429
780
42.786
7.633
58.257
2
810
72.786
22.367
500.298
2
755
17.786
32.633
1064.890
2
699
38.214
12.204
148.940
2
680
57.214
6.796
46.185
2
710
27.214
23.204
538.429
2
850
112.786
62.367
3889.686
2
844
106.786
56.367
3177.278
2
730
7.214
43.204
1866.593
2
645
92.214
41.796
1746.899
2
688
49.214
1.204
1.450
2
718
19.214
31.204
973.695
2
702
35.214
15.204
231.164
I
X2j
2
710
2
X 2 ¼ 737:214
Z 2 ¼ 50:418
2
2
Sum ¼ 14,782.192
Continued
216
PART
IV Statistical Inference
TABLE 9.E.9 Calculating the Fcal Statistic—cont’d
Z3j ¼ X3j X 3
Z3j Z 3
Z3j Z 3
I
X3j
3
924
90.643
24.194
585.344
3
695
138.357
71.908
5170.784
3
854
20.643
45.806
2098.201
3
802
31.357
35.092
1231.437
3
931
97.643
31.194
973.058
3
924
90.643
24.194
585.344
3
847
13.643
52.806
2788.487
3
800
33.357
33.092
1095.070
3
769
64.357
2.092
4.376
3
863
29.643
36.806
1354.691
3
901
67.643
1.194
1.425
3
888
54.643
11.806
139.385
3
757
76.357
9.908
98.172
3
712
121.357
54.908
X 3 ¼ 833:36
Z 3 ¼ 66:449
2
3014.906 Sum ¼ 19,140.678
Therefore, the calculation of Fcal is carried out as follows: Fcal ¼
ð42 3Þ 14 ð20 45:62Þ2 + 14 ð50:418 45:62Þ2 + 14 ð66:449 45:62Þ2 2143:429 + 14, 782:192 + 19, 140:678 ð3 1Þ Fcal ¼ 8:427
Step 5: According to Table A in the Appendix, for n1 ¼ 2, n2 ¼ 39, and a ¼ 5%, the critical value of the test is Fc ¼ 3.24. Step 6: Decision: since the value calculated lies in the critical region (Fcal > Fc), the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that the population variance of at least one group is different from the others. If we use P-value instead of the statistic’s critical value, Steps 5 and 6 will be: Step 5: According to Table A in the Appendix, for n1 ¼ 2 and n2 ¼ 39, the probability associated to Fcal ¼ 8.427 is less than 0.01 (P-value). Step 6: Decision: since P < 0.05, we reject H0.
9.4.5
Solving Levene’s Test by Using SPSS Software
The use of the images in this section has been authorized by the International Business Machines Corporation©. To test the variance homogeneity between the groups, SPSS uses Levene’s test. The data presented in Example 9.4 are available in the file CustomerServices_Store.sav. In order to elaborate the test, we must click on Analyze → Descriptive Statistics → Explore …, as shown in Fig. 9.13. Let’s include the variable Customer_services in the list of dependent variables (Dependent List) and the variable Store in the factor list (Factor List), as shown in Fig. 9.14. Next, we must click on Plots … and select the option Untransformed in Spread vs Level with Levene Test, as shown in Fig. 9.15. Finally, let’s click on Continue and on OK. The result of Levene’s test can also be obtained through the ANOVA test, by clicking on Analyze ! Compare Means ! One-Way ANOVA …. In Options …, we must select the option Homogeneity of variance test (Fig. 9.16).
Hypotheses Tests Chapter
FIG. 9.13 Procedure for elaborating Levene’s test on SPSS.
FIG. 9.14 Selecting the variables to elaborate Levene’s test on SPSS.
9
217
218
PART
IV Statistical Inference
FIG. 9.15 Continuation of the procedure to elaborate Levene’s test on SPSS.
FIG. 9.16 Results of Levene’s test for Example 9.4 on SPSS.
The value of Levene’s statistic is 8.427, exactly the same as the one calculated previously. Since the significance level observed is 0.001, a value lower than 0.05, the test shows the rejection of the null hypothesis, which allows us to conclude, with a 95% confidence level, that the population variances are not homogeneous.
9.4.6
Solving Levene’s Test by Using the Stata Software
The use of the images in this section has been authorized by StataCorp LP©. Levene’s statistical test for equality of variances is calculated on Stata by using the command robvar (robust-test for equality of variances), which has the following syntax: robvar variable*, by(groups*)
in which the term variable* should be substituted for the quantitative variable studied and the term groups* by the categorical variable that represents them. Let’s open the file CustomerServices_Store.dta that contains the data of Example 9.7. The three groups are represented by the variable store and the number of customers served by the variable services. Therefore, the command to be typed is: robvar services, by(store)
The result of the test can be seen in Fig. 9.17. We can verify that the value of the statistic (8.427) is similar to the one calculated in Example 9.7 and to the one generated on SPSS, as well as the calculation of the probability associated to
Hypotheses Tests Chapter
9
219
FIG. 9.17 Results of Levene’s test for Example 9.7 on Stata.
the statistic (0.001). Since P < 0.05, the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that the variances are not homogeneous.
9.5 HYPOTHESES TESTS REGARDING A POPULATION MEAN (m) FROM ONE RANDOM SAMPLE The main goal is to test if a population mean assumes a certain value or not.
9.5.1 Z Test When the Population Standard Deviation (s) Is Known and the Distribution Is Normal This test is applied when a random sample of size n is obtained from a population with a normal distribution, whose mean (m) is unknown and whose standard deviation (s) is known. If the distribution of the population is not known, it is necessary to work with large samples (n > 30), because the central limit theorem guarantees that, as the sample size grows, the sample distribution of its mean gets closer and closer to a normal distribution. For a bilateral test, the hypotheses are: H0: the sample comes from a population with a certain mean (m ¼ m0) H1: it challenges the null hypothesis (m 6¼ m0) The statistical test used here refers to the sample mean (X). In order for the sample mean to be compared to the value in the table, it must be standardized, so: Zcal ¼
X m0 s Nð0, 1Þ, where sX ¼ pffiffiffi sX n
(9.20)
The critical values of the zc statistic are shown in Table E in the Appendix. This table provides the critical values of zc considering that P(Zcal > zc) ¼ a (for a right-tailed test). For a bilateral test, we must consider P(Zcal > zc) ¼ a/2, since P(Zcal < zc) + P(Zcal > zc) ¼ a. The null hypothesis H0 of a bilateral test is rejected if the value of the Zcal statistic lies in the critical region, that is, if Zcal < zc or Zcal > zc. Otherwise, we do not reject H0. The unilateral probabilities associated to Zcal statistic (P) can also be obtained from Table E. For a unilateral test, we consider that P ¼ P1. For a bilateral test, this probability must be doubled (P ¼ 2P1). Therefore, for both tests, we reject H0 if P a. Example 9.8: Applying the z Test to One Sample A cereal manufacturer states that the average quantity of food fiber in each portion of its product is, at least, 4.2 g with a standard deviation of 1 g. A health care agency wishes to verify if this statement is true. Collecting a random sample of 42 portions, in which the average quantity of food fiber is 3.9 g. With a significance level equal to 5%, is there evidence to reject the manufacturer’s statement?
220
PART
IV Statistical Inference
Solution Step 1: The suitable test for a population mean with a known s, considering a single sample of size n > 30 (normal distribution), is the z test. Step 2: For this example, the z test hypotheses are: H0: m 4.2 g (information provided by the supplier) H1: m < 4.2 g which corresponds to a left-tailed test. Step 3: The significance level to be considered is 5%. Step 4: The calculation of the Zcal statistic, according to Expression (9.20), is:
Zcal ¼
X m0 3:9 4:2 pffiffiffiffiffiffi ¼ 1:94 pffiffiffi ¼ s= n 1= 42
Step 5: According to Table E in the Appendix, for a left-tailed test with a ¼ 5%, the critical value of the test is zc ¼ 1.645. Step 6: Decision: since the value calculated lies in the critical region (zcal < 1.645), the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that the manufacturer’s average quantity of food fiber is less than 4.2 g. If, instead of comparing the value calculated to the critical value of the standard normal distribution, we use the calculation of P-value, Steps 5 and 6 will be: Step 5: According to Table E in the Appendix, for a left-tailed test, the probability associated to zcal ¼ 1.94 is 0.0262 (P-value). Step 6: Decision: since P < 0.05, the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that the manufacturer’s average quantity of food fiber is less than 4.2 g.
9.5.2
Student’s t-Test When the Population Standard Deviation (s) Is Not Known
Student’s t-test for one sample is applied when we do not know the population standard deviation (s), so, its value is estimated from the sample standard deviation (S). However, to substitute s for S in Expression (9.20), the distribution of the variable will no longer be normal; it will become a Student’s t-distribution with n 1 degrees of freedom. Analogous to the z test, Student’s t-test for one sample assumes the following hypotheses for a bilateral test: H0: m ¼ m0 H1: m ¼ 6 m0 And the calculation of the statistic becomes: Tcal ¼
X m0 pffiffiffi tn1 S= n
(9.21)
The value calculated must be compared to the value in Student’s t-distribution table (Table B in the Appendix). This table provides the critical values of tc considering that P(Tcal > tc) ¼ a (for a right-tailed test). For a bilateral test, we have P(Tcal < tc) ¼ a/2 ¼ P(Tcal > tc), as shown in Fig. 9.18. Therefore, for a bilateral test, the null hypothesis is rejected if Tcal < tc or Tcal > tc. If tc Tcal tc, we do not reject H0. FIG. 9.18 Nonrejection region (NR) and critical region (CR) of Student’s t-distribution for a bilateral test.
Hypotheses Tests Chapter
9
221
The unilateral probabilities associated to Tcal statistic (P1) can also be obtained from Table B. For a unilateral test, we have P ¼ P1. For a bilateral test, this probability must be doubled (P ¼ 2P1). Therefore, for both tests, we reject H0 if P a. Example 9.9: Applying Student’s t-Test to One Sample The average processing time of a task using a certain machine has been 18 min. New concepts have been implemented in order to reduce the average processing time. Hence, after a certain period of time, a sample with 25 elements was collected, and an average time of 16.808 min was measured, with a standard deviation of 2.733 min. Check and see if this result represents an improvement in the average processing time. Consider a ¼ 1%. Solution Step 1: The suitable test for a population mean with an unknown s is Student’s t-test. Step 2: For this example, Student’s t-test hypotheses are: H0: m ¼ 18 H1: m < 18 which corresponds to a left-tailed test. Step 3: The significance level to be considered is 1%. Step 4: The calculation of the Tcal statistic, according to Expression (9.21), is:
Tcal ¼
X m0 16:808 18 pffiffiffiffiffiffi ¼ 2:18 pffiffiffi ¼ S= n 2:733= 25
Step 5: According to Table B in the Appendix, for a left-tailed test with 24 degrees of freedom and a ¼ 1%, the critical value of the test is tc ¼ 2.492. Step 6: Decision: since the value calculated is not in the critical region (Tcal > 2.492), the null hypothesis is not rejected, which allows us to conclude, with a 99% confidence level, that there was no improvement in the average processing time. If, instead of comparing the value calculated to the critical value of Student’s t-distribution, we use the calculation of P-value, Steps 5 and 6 will be: Step 5: According to Table B in the Appendix, for a left-tailed test with 24 degrees of freedom, the probability associated to Tcal ¼ 2.18 is between 0.01 and 0.025 (P-value). Step 6: Decision: since P > 0.01, we do not reject the null hypothesis.
9.5.3
Solving Student’s t-Test for a Single Sample by Using SPSS Software
The use of the images in this section has been authorized by the International Business Machines Corporation©. If we wish to compare means from a single sample, SPSS makes Student’s t-test available. The data in Example 9.9 are available in the file T_test_One_Sample.sav. The procedure to apply the test from Example 9.9 will be described. Initially, let´s select Analyze ! Compare Means → One-Sample T Test …, as shown in Fig. 9.19. We must select the variable Time and specify the value 18 that will be tested in Test Value, as shown in Fig. 9.20. Now, we must click on Options … to define the desired confidence level (Fig. 9.21). Finally, let’s click on Continue and on OK. The results of the test are shown in Fig. 9.22. This figure shows the result of the t-test (similar to the value calculated in Example 9.9) and the associated probability (P-value) for a bilateral test. For a unilateral test, the associated probability is 0.0195 (we saw in Example 9.9 that this probability would be between 0.01 and 0.025). Since 0.0195 > 0.01, we do not reject the null hypothesis, which allows us to conclude, with a 99% confidence level, that there was no improvement in the average processing time.
9.5.4
Solving Student’s t-Test for a Single Sample by Using Stata Software
The use of the images in this section has been authorized by StataCorp LP©. Student’s t-test is elaborated on Stata by using the command ttest. For one population mean, the test syntax is: ttest variable* == #
222
PART
IV Statistical Inference
FIG. 9.19 Procedure for elaborating the t-test from one sample on SPSS.
FIG. 9.20 Selecting the variable and specifying the value to be tested.
where the term variable* should be substituted for the name of the variable considered in the analysis and # for the value of the population mean to be tested. The data in Example 9.9 are available in the file T_test_One_Sample.dta. In this case, the variable being analyzed is called time and the goal is to verify if the average processing time is still 18 min, so, the command to be typed is: ttest time == 18
The result of the test can be seen in Fig. 9.23. We can see that the calculated value of the statistic (2.180) is similar to the one calculated in Example 9.9 and also generated on SPSS, as well as the associated probability for a left-tailed test (0.0196). Since P > 0.01, we do not reject the null hypothesis, which allows us to conclude, with a 99% confidence level, that there was no improvement in the processing time.
Hypotheses Tests Chapter
9
223
FIG. 9.21 Options—defining the confidence level.
FIG. 9.22 Results of the t-test for one sample for Example 9.9 on SPSS.
FIG. 9.23 Results of the t-test for one sample for Example 9.9 on Stata.
9.6 STUDENT’S T-TEST TO COMPARE TWO POPULATION MEANS FROM TWO INDEPENDENT RANDOM SAMPLES The t-test for two independent samples is applied to compare the means of two random samples (X1i, i ¼ 1, …, n1; X2j, j ¼ 1, …, n2) obtained from the same population. In this test, the population variance is unknown. For a bilateral test, the null hypothesis of the test states that the population means are the same. If the population means are different, the null hypothesis is rejected, so: H0: m1 ¼ m2 H1: m1 6¼ m2 The calculation of the T statistic depends on the comparison of the population variances between the groups.
Case 1: s21 6¼ s22 Considering that the population variances are different, the calculation of the T statistic is given by:
224
PART
IV Statistical Inference
FIG. 9.24 Nonrejection region (NR) and critical region (CR) of Student’s t-distribution for a bilateral test.
X1 X2 Tcal ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffi S21 S22 + n1 n2
(9.22)
2 2 S1 S22 + n1 n2 n¼ 2 2 2 2 S1 =n1 S =n + 2 2 ð n1 1 Þ ð n2 1 Þ
(9.23)
with the following degrees of freedom:
Case 2: s21 5 s22 When the population variances are homogeneous, to calculate the T statistic, the researcher has to use: X1 X2 rffiffiffiffiffiffiffiffiffiffiffiffiffiffi Tcal ¼ 1 1 + Sp n1 n 2 where:
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðn1 1Þ S21 + ðn2 1Þ S22 Sp ¼ n1 + n2 2
(9.24)
(9.25)
and Tcal follows Student’s t-distribution with n ¼ n1 + n2 2 degrees of freedom. The value calculated must be compared to the value in Student’s t-distribution table (Table B in the Appendix). This table provides the critical values of tc considering that P(Tcal > tc) ¼ a (for a right-tailed test). For a bilateral test, we have P(Tcal < tc) ¼ a/2 ¼ P(Tcal > tc), as shown in Fig. 9.24. Therefore, for a bilateral test, if the value of the statistic lies in the critical region, that is, if Tcal < tc or Tcal > tc, the test allows us to reject the null hypothesis. On the other hand, if tc Tcal tc, we do not reject H0. The unilateral probabilities associated to Tcal statistic (P1) can also be obtained from Table B. For a unilateral test, we have P ¼ P1. For a bilateral test, this probability must be doubled (P ¼ 2P1). Therefore, for both tests, we reject H0 if P a. Example 9.10: Applying Student’s t-Test to Two Independent Samples A quality engineer believes that the average time to manufacture a certain plastic product may depend on the raw materials used, which come from two different suppliers. A sample with 30 observations from each supplier is collected for a test and the results are shown in Tables 9.E.10 and 9.E.11. For a significance level a ¼ 5%, check if there is any difference between the means. Solution Step 1: The suitable test to compare two population means with an unknown s is Student’s t-test for two independent samples. Step 2: For this example, Student’s t-test hypotheses are: H0: m1 ¼ m2 H1: m1 6¼ m2 Step 3: The significance level to be considered is 5%.
Hypotheses Tests Chapter
9
225
TABLE 9.E.10 Manufacturing Time Using Raw Materials From Supplier 1 22.8
23.4
26.2
24.3
22.0
24.8
26.7
25.1
23.1
22.8
25.6
25.1
24.3
24.2
22.8
23.2
24.7
26.5
24.5
23.6
23.9
22.8
25.4
26.7
22.9
23.5
23.8
24.6
26.3
22.7
TABLE 9.E.11 Manufacturing Time Using Raw Materials From Supplier 2 26.8
29.3
28.4
25.6
29.4
27.2
27.6
26.8
25.4
28.6
29.7
27.2
27.9
28.4
26.0
26.8
27.5
28.5
27.3
29.1
29.2
25.7
28.4
28.6
27.9
27.4
26.7
26.8
25.6
26.1
Step 4: For the data in Tables 9.E.10 and 9.E.11, we calculate X 1 ¼ 24:277, X 2 ¼ 27:530, S21 ¼ 1.810, and S22 ¼ 1.559. Considering that the population variances are homogeneous, according to the solution generated on SPSS, let’s use Expressions (9.24) and (9.25) to calculate the Tcal statistic, as follows: rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 29 1:810 + 29 1:559 ¼ 1:298 30 + 30 2 24:277 27:530 rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 9:708 Tcal ¼ 1 1 1:298 + 30 30
Sp ¼
with n ¼ 30 + 30 – 2 ¼ 58 degrees of freedom. Step 5: The critical region of the bilateral test, considering n ¼ 58 degrees of freedom and a ¼ 5%, can be defined from Student’s t-distribution table (Table B in the Appendix), as shown in Fig. 9.25. For a bilateral test, each one of the tails corresponds to half of significance level a. FIG. 9.25 Critical region of Example 9.10.
Step 6: Decision: since the value calculated lies in the critical region, that is, Tcal < 2.002, we must reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that the population means are different. If, instead of comparing the value calculated to the critical value of Student’s t-distribution, we use the calculation of P-value, Steps 5 and 6 will be: Step 5: According to Table B in the Appendix, for a right-tailed test with n ¼ 58 degrees of freedom, probability P1 associated to Tcal ¼ 9.708 is less than 0.0005. For a bilateral test, this probability must be doubled (P ¼ 2P1). Step 6: Decision: since P < 0.05, the null hypothesis is rejected.
9.6.1
Solving Student’s t-Test From Two Independent Samples by Using SPSS Software
The data in Example 9.10 are available in the file T_test_Two_Independent_Samples.sav. The procedure for solving Student’s t-test to compare two population means from two independent random samples on SPSS is described. The use of the images in this section has been authorized by the International Business Machines Corporation©. We must click on Analyze ! Compare Means → Independent-Samples T Test …, as shown in Fig. 9.26.
226
PART
IV Statistical Inference
Let’s include the variable Time in Test Variable(s) and the variable Supplier in Grouping Variable. Next, let’s click on Define Groups … to define the groups (categories) of the variable Supplier, as shown in Fig. 9.27. If the confidence level desired by the researcher is different from 95%, the button Options … must be selected to change it. Finally, let’s click on OK. The results of the test are shown in Fig. 9.28. The value of the t statistic for the test is 9.708 and the associated bilateral probability is 0.000 (P < 0.05), which leads us to reject the null hypothesis, and allows us to conclude, with a 95% confidence level, that the population means are different. We can notice that Fig. 9.28 also shows the result of Levene’s test. Since the significance level observed is 0.694, value greater than 0.05, we can also conclude, with a 95% confidence level, that the variances are homogeneous.
FIG. 9.26 Procedure for elaborating the t-test from two independent samples on SPSS.
FIG. 9.27 Selecting the variables and defining the groups.
Hypotheses Tests Chapter
9
227
FIG. 9.28 Results of the t-test for two independent samples for Example 9.10 on SPSS.
FIG. 9.29 Results of the t-test for two independent samples for Example 9.10 on Stata.
9.6.2
Solving Student’s t-Test From Two Independent Samples by Using Stata Software
The use of the images in this section has been authorized by StataCorp LP©. The t-test to compare the means of two independent groups on Stata is elaborated by using the following syntax: ttest variable*, by(groups*)
where the term variable* must be substituted for the quantitative variable being analyzed, and the term groups* for the categorical variable that represents them. The data in Example 9.10 are available in the file T_test_Two_Independent_Samples.dta. The variable supplier shows the groups of suppliers. The values for each group of suppliers are specified in the variable time. Thus, we must type the following command: ttest time, by(supplier)
The result of the test can be seen in Fig. 9.29. We can see that the calculated value of the statistic (9.708) is similar to the one calculated in Example 9.10 and also generated on SPSS, as well as the associated probability for a bilateral test (0.000). Since P < 0.05, the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that the population means are different.
9.7 STUDENT’S T-TEST TO COMPARE TWO POPULATION MEANS FROM TWO PAIRED RANDOM SAMPLES This test is applied to check whether the means of two paired or related samples, obtained from the same population (before and after) with a normal distribution, are significantly different or not. Besides the normality of the data of each sample, the test requires the homogeneity of the variances between the groups. Different from the t-test for two independent samples, first, we must calculate the difference between each pair of values in position i (di ¼ Xbefore,i Xafter,i, i ¼ 1, …, n) and, after that, test the null hypothesis that the mean of the differences in the population is zero.
228
PART
IV Statistical Inference
For a bilateral test, we have: H0: md ¼ 0, md ¼ mbefore mafter H1: md ¼ 6 0 The Tcal statistic for the test is given by: Tcal ¼
d md pffiffiffi t Sd = n n¼n1
where:
Xn d¼
and Sd ¼
i¼1
(9.26)
d
(9.27)
n
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Xn 2 d d i¼1 i
(9.28)
n1
The value calculated must be compared to the value in Student’s t-distribution table (Table B in the Appendix). This table provides the critical values of tc considering that P(Tcal > tc) ¼ a (for a right-tailed test). For a bilateral test, we have P(Tcal < tc) ¼ a/2 ¼ P(Tcal > tc), as shown in Fig. 9.30. FIG. 9.30 Nonrejection region (NR) and critical region (CR) of Student’s t-distribution for a bilateral test.
Therefore, for a bilateral test, the null hypothesis is rejected if Tcal < tc or Tcal > tc. If tc Tcal tc, we do not reject H0. The unilateral probabilities associated to Tcal statistic (P1) can also be obtained from Table B. For a unilateral test, we have P ¼ P1. For a bilateral test, this probability must be doubled (P ¼ 2P1). Therefore, for both tests, we reject H0 if P a. Example 9.11: Applying Student’s t-Test to Two Paired Samples A group of 10 machine operators, responsible for carrying out a certain task, is trained to perform the same task more efficiently. To verify if there is a reduction in the time taken to perform the task, we measured the time spent by each operator, before and after the training course. Test the hypothesis that the population means of both paired samples are similar, that is, that there is no reduction in time taken to perform the task after the training course. Consider a ¼ 5%.
TABLE 9.E.12 Time Spent Per Operator Before the Training Course 3.2
3.6
3.4
3.8
3.4
3.5
3.7
3.2
3.5
3.9
3.4
3.0
3.2
3.6
TABLE 9.E.13 Time Spent Per Operator After the Training Course 3.0
3.3
3.5
3.6
3.4
3.3
Solution Step 1: In this case, the most suitable test is Student’s t-test for two paired samples. Since the test requires the normality of the data in each sample and the homogeneity of the variances between the groups, K-S or S-W tests, besides Levene’s test, must be applied for such verification. As we will see, in the solution of this example on SPSS, all of these assumptions will be validated.
Hypotheses Tests Chapter
9
229
Step 2: For this example, Student’s t-test hypotheses are: H0: md ¼ 0 H1: md 6¼ 0 Step 3: The significance level to be considered is 5%. Step 4: In order to calculate the Tcal statistic, first, we must calculate di:
TABLE 9.E.14 Calculating di Xbefore, Xafter, di
i
i
3.2
3.6
3.4
3.8
3.4
3.5
3.7
3.2
3.5
3.9
3.0
3.3
3.5
3.6
3.4
3.3
3.4
3.0
3.2
3.6
0.2
0.3
0.1
0.2
0
0.2
0.3
0.2
0.3
0.3
Xn d 0:2 + 0:3 + ⋯ + 0:3 i¼1 i ¼ ¼ 0:19 d¼ n 10 sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð0:2 0:19Þ2 + ð0:3 0:19Þ2 + ⋯ + ð0:3 0:19Þ2 Sd ¼ ¼ 0:137 9 Tcal ¼
d 0:19 pffiffiffi ¼ pffiffiffiffiffiffi ¼ 4:385 Sd = n 0:137= 10
Step 5: The critical region of the bilateral test can be defined from Student’s t-distribution table (Table B in the Appendix), considering n ¼ 9 degrees of freedom and a ¼ 5%, as shown in Fig. 9.31. For a bilateral test, each tail corresponds to half of significance level a. FIG. 9.31 Critical region of Example 9.11.
Step 6: Decision: since the value calculated lies in the critical region (Tcal > 2.262), the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that there is a significant difference between the times spent by the operators before and after the training course. If, instead of comparing the value calculated to the critical value of Student’s t-distribution, we used the calculation of P-value, Steps 5 and 6 will be: Step 5: According to Table B in the Appendix, for a right-tailed test with n ¼ 9 degrees of freedom, the P1 probability associated to Tcal ¼ 4.385 is between 0.0005 and 0.001. For a bilateral test, this probability must be doubled (P ¼ 2P1), so, 0.001 < P < 0.002. Step 6: Decision: since P < 0.05, the null hypothesis is rejected.
9.7.1
Solving Student’s t-Test From Two Paired Samples by Using SPSS Software
First, we must test the normality of the data in each sample, as well as the variance homogeneity between the groups. Using the same procedures described in Sections 9.3.3 and 9.4.5 (the data must be placed in a table the same way as in Section 9.4.5), we obtain Figs. 9.32 and 9.33. Based on Fig. 9.32, we conclude that there is normality of the data for each sample. From Fig. 9.33, we can conclude that the variances between the samples are homogeneous.
230
PART
IV Statistical Inference
The use of the images in this section has been authorized by the International Business Machines Corporation©. To solve Student’s t-test for two paired samples on SPSS, we must open the file T_test_Two_Paired_Samples.sav. Then, we have to click on Analyze ! Compare Means → Paired-Samples T Test …, as shown in Fig. 9.34. We must select the variable Before and move it to Variable1 and the variable After to Variable2, as shown in Fig. 9.35. If the desired confidence level is different from 95%, we must click on Options … to change it. Finally, let’s click on OK. The results of the test are shown in Fig. 9.36. The value of the t-test is 4.385 and the significance level observed for a bilateral test is 0.002, value less than 0.05, which leads us to reject the null hypothesis and allows us to conclude, with a 95% confidence level, that there is a significant difference between the times spent by the operators before and after the training course.
FIG. 9.32 Results of the normality tests on SPSS.
FIG. 9.33 Results of Levene’s test on SPSS.
FIG. 9.34 Procedure for elaborating the t-test from two paired samples on SPSS.
Hypotheses Tests Chapter
9
231
FIG. 9.35 Selecting the variables that will be paired.
FIG. 9.36 Results of the t-test for two paired samples.
FIG. 9.37 Results of Student’s t-test for two paired samples for Example 9.11 on Stata.
9.7.2
Solving Student’s t-Test From Two Paired Samples by Using Stata Software
The t-test to compare the means of two paired groups will be solved on Stata for the data in Example 9.11. The use of the images in this section has been authorized by StataCorp LP©. Therefore, let’s open the file T_test_Two_Paired_Samples.dta. The paired variables are called before and after. In this case, we must type the following command: ttest before == after
The result of the test can be seen in Fig. 9.37. We can see that the calculated value of the statistic (4.385) is similar to the one calculated in Example 9.11 and on SPSS, as well as the probability associated to the statistic for a bilateral test (0.0018). Since P < 0.05, we reject the null hypothesis that the times spent by the operators before and after the training course are the same, with a 95% confidence level.
232
PART
9.8
IV Statistical Inference
ANOVA TO COMPARE THE MEANS OF MORE THAN TWO POPULATIONS
ANOVA is a test used to compare the means of three or more populations, through the analysis of sample variances. The test is based on a sample obtained from each population, aiming at determining if the differences between the sample means suggest significant differences between the population means, or if such differences are only a result of the implicit variability of the sample. ANOVA’s assumptions are: (i) The samples must be independent from each other; (ii) The data in the populations must have a normal distribution; (iii) The population variances must be homogeneous.
9.8.1
One-Way ANOVA
One-way ANOVA is an extension of Student’s t-test for two population means, allowing the researcher to compare three or more population means. The null hypothesis of the test states that the population means are the same. If there is at least one group with a mean that is different from the others, the null hypothesis is rejected. As stated in Fa´vero et al. (2009), the one-way ANOVA allows researcher to verify the effect of a qualitative explanatory variable (factor) on a quantitative dependent variable. Each group includes the observations of the dependent variable in one category of the factor. Assuming that size n independent samples are obtained from k populations (k 3) and that the means of these populations can be represented by m1, m2, …, mk, the analysis of variance tests the following hypotheses: H0 : m1 ¼ m2 ¼ … ¼ mk H1 : 9ði, jÞ mi 6¼ mj , i 6¼ j
(9.29)
According to Maroco (2014), in general, the observations for this type of problem can be represented according to Table 9.2. where Yij represents observation i of sample or group P j (i ¼ 1, …, nj; j ¼ 1, …, k) and nj is the dimension of sample or group j. The dimension of the global sample is N ¼ ki¼1ni. Pestana and Gageiro (2008) present the following model: Yij ¼ mi + eij
(9.30)
Yij ¼ m + ðmi mÞ eij
(9.31)
Yij ¼ m + ai + eij
(9.32)
where: m is the global mean of the population; mi is the mean of sample or group i; ai is the effect of sample or group i; eij is the random error.
TABLE 9.2 Observations of the One-Way ANOVA Samples or Groups 1
2
…
K
Y11
Y12
…
Y1k
Y21
Y22
…
Y2k
…
…
…
…
Yn11
Yn22
…
Ynkk
Hypotheses Tests Chapter
9
233
Therefore, ANOVA assumes that each group comes from a population with a normal distribution, mean mi, and a homogeneous variance, that is, Yij N(mi,s), resulting in the hypothesis that the errors (residuals) have a normal distribution with a mean equal to zero and a constant variance, that is, eij N(0,s), besides being independent (Fa´vero et al., 2009). The technique’s hypotheses are tested from the calculation of the group variances, and that is where the name ANOVA comes from. The technique involves the calculation of the variations between the groups (Y i Y) and within each group (Yij Y i ). The residual sum of squares within groups (RSS) is calculated by: RSS ¼
nj k X X
Yij Y i
2 (9.33)
i¼1 j¼1
The residual sum of squares between groups, or the sum of squares of the factor (SSF), is given by: SSF ¼
k X
2 ni Y i Y
(9.34)
i¼1
Therefore, the total sum is: TSS ¼ RSS + SSF ¼
ni k X 2 X Yij Y
(9.35)
i¼1 j¼1
According to Fa´vero et al. (2009) and Maroco (2014), the ANOVA statistic is given by the division between the variance of the factor (SSF divided by k 1 degrees of freedom) and the variance of the residuals (RSS divided by N k degrees of freedom): SSF ðk 1Þ MSF ¼ Fcal ¼ RSS MSR ðN kÞ
(9.36)
where: MSF represents the mean square between groups (estimate of the variance of the factor); MSR represents the mean square within groups (estimate of the variance of the residuals). Table 9.3 summarizes the calculations of the one-way ANOVA. The value of F can be null or positive, but never negative. Therefore, ANOVA requires an asymmetrical F-distribution to the right. The calculated value (Fcal) must be compared to the value in the F-distribution table (Table A in the Appendix). This table provides the critical values of Fc ¼ Fk1,N k,a where P(Fcal > Fc) ¼ a (right-tailed test). Therefore, one-way ANOVA’s null hypothesis is rejected if Fcal > Fc. Otherwise, if (Fcal Fc), we do not reject H0. We will use these concepts when we study the estimation of regression models in Chapter 13.
TABLE 9.3 Calculating the One-Way ANOVA Source of Variation
Sum of Squares
Between the groups
SSF ¼
Within the groups
RSS ¼
Total
TSS ¼
Pk
Degrees of Freedom 2
Mean Squares
k1
MSF
2 Pk Pni i¼1 j¼1 Yij Y i
Nk
RSS MSR ¼ ðNk Þ
Pk Pni
N1
i¼1 ni
i¼1
Yi Y
j¼1
Yij Y
2
SSF ¼ ðk1 Þ
F MSF F ¼ MSR
Source: Fa´vero, L.P., Belfiore, P., Silva, F.L., Chan, B.L., 2009. Ana´lise de dados: modelagem multivariada para tomada de deciso˜es. Campus Elsevier, Rio de Janeiro; Maroco, J., 2014. Ana´lise estatı´stica com o SPSS Statistics, sixth ed. Edic¸o˜es Sı´labo, Lisboa.
234
PART
IV Statistical Inference
TABLE 9.E.15 Percentage of Sucrose for the Three Suppliers Supplier 1 (n1 5 12)
Supplier 2 (n2 5 10)
Supplier 3 (n3 5 10)
0.33
1.54
1.47
0.79
1.11
1.69
1.24
0.97
1.55
1.75
2.57
2.04
0.94
2.94
2.67
2.42
3.44
3.07
1.97
3.02
3.33
0.87
3.55
4.01
0.33
2.04
1.52
0.79
1.67
2.03
Y 1 ¼ 1:316
Y 2 ¼ 2:285
Y 3 ¼ 2:338
S1 ¼ 0.850
S2 ¼ 0.948
S3 ¼ 0.886
1.24 3.12
Example 9.12: Applying the One-Way ANOVA Test A sample with 32 products is collected to analyze the quality of the honey supplied by three different suppliers. One of the ways to test the quality of the honey is finding out how much sucrose it contains, which usually varies between 0.25% and 6.5%. Table 9. E.15 shows the percentage of sucrose in the sample collected from each supplier. Check if there are differences in this quality indicator among the three suppliers, considering a 5% significance level. Solution Step 1: In this case, the most suitable test is the one-way ANOVA. First, we must verify the assumptions of normality for each group and of variance homogeneity between the groups through the Kolmogorov-Smirnov, Shapiro-Wilk, and Levene tests. Figs. 9.38 and 9.39 show the results obtained by using SPSS software.
FIG. 9.38 Results of the tests for normality on SPSS.
FIG. 9.39 Results of Levene’s test on SPSS.
Hypotheses Tests Chapter
9
235
Since the significance level observed in the tests for normality for each group and in the variance homogeneity test between the groups is greater than 5%, we can conclude that each one of the groups shows data with a normal distribution and that the variances between the groups are homogeneous, with a 95% confidence level. Since the assumptions of the one-way ANOVA were met, the technique can be applied. Step 2: For this example, ANOVA’s null hypothesis states that there are no differences in the amount of sucrose coming from the three suppliers. If there is at least one supplier with a population mean that is different from the others, the null hypothesis will be rejected. Thus, we have: H0: m1 ¼ m2 ¼ m3 H1: 9(i,j) mi 6¼ mj, i 6¼ j Step 3: The significance level to be considered is 5%. Step 4: The calculation of the Fcal statistic is specified here. For this example, we know that k ¼ 3 groups and the global sample size is N ¼ 32. The global sample mean is Y ¼ 1:938. The sum of squares between groups (SSF) is: SSF ¼ 12 ð1:316 1:938Þ2 + 10 ð2:285 1:938Þ2 + 10 ð2:338 1:938Þ2 ¼ 7:449 Therefore, the mean square between groups (MSB) is: MSF ¼
SSF 7:449 ¼ ¼ 3:725 ðk 1Þ 2
The calculation of the sum of squares within groups (RSS) is shown in Table 9.E.16.
TABLE 9.E.16 Calculation of the Sum of Squares Within Groups (SSW)
Yij Y i
Supplier
Sucrose
Yij Y i
1
0.33
0.986
0.972
1
0.79
0.526
0.277
1
1.24
0.076
0.006
1
1.75
0.434
0.189
1
0.94
0.376
0.141
1
2.42
1.104
1.219
1
1.97
0.654
0.428
1
0.87
0.446
0.199
1
0.33
0.986
0.972
1
0.79
0.526
0.277
1
1.24
0.076
0.006
1
3.12
1.804
3.255
2
1.54
0.745
0.555
2
1.11
1.175
1.381
2
0.97
1.315
1.729
2
2.57
0.285
0.081
2
2.94
0.655
0.429
2
3.44
1.155
1.334
2
3.02
0.735
0.540
2
3.55
1.265
1.600
2
2.04
0.245
0.060
2
1.67
0.615
0.378
2
Continued
236
PART
IV Statistical Inference
TABLE 9.E.16 Calculation of the Sum of Squares Within Groups (SSW)—cont’d
Yij Y i
Supplier
Sucrose
Yij Y i
3
1.47
0.868
0.753
3
1.69
0.648
0.420
3
1.55
0.788
0.621
3
2.04
0.298
0.089
3
2.67
0.332
0.110
3
3.07
0.732
0.536
3
3.33
0.992
0.984
3
4.01
1.672
2.796
3
1.52
0.818
0.669
3
2.03
0.308
0.095
RSS
2
23.100
Therefore, the mean square within groups is: MSR ¼
RSS 23:100 ¼ ¼ 0:797 ðN k Þ 29
Thus, the value of the Fcal statistic is: Fcal ¼
MSF 3:725 ¼ ¼ 4:676 MSR 0:797
Step 5: According to Table A in the Appendix, the critical value of the statistic is Fc ¼ F2, 29,
5%
¼ 3.33.
Step 6: Decision: since the value calculated lies in the critical region (Fcal > Fc), we reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that there is at least one supplier with a population mean that is different from the others. If, instead of comparing the value calculated to the critical value of Snedecor’s F-distribution, we use the calculation of P-value, Steps 5 and 6 will be: Step 5: According to Table A in the Appendix, for n1 ¼ 2 degrees of freedom in the numerator and n2 ¼ 29 degrees of freedom in the denominator, the probability associated to Fcal ¼ 4.676 is between 0.01 and 0.025 (P-value). Step 6: Decision: since P < 0.05, the null hypothesis is rejected.
9.8.1.1 Solving the One-Way ANOVA Test by Using SPSS Software The use of the images in this section has been authorized by the International Business Machines Corporation©. The data in Example 9.12 are available in the file One_Way_ANOVA.sav. First of all, let´s click on Analyze ! Compare Means → One-Way ANOVA …, as shown in Fig. 9.40. Let’s include the variable Sucrose in the list of dependent variables (Dependent List) and the variable Supplier in the box Factor, according to Fig. 9.41. After that, we must click on Options … and select the option Homogeneity of variance test (Levene’s test for variance homogeneity). Finally, let’s click on Continue and on OK to obtain the result of Levene’s test, besides the ANOVA table. Since ANOVA does not make the normality test available, it must be obtained by applying the same procedure described in Section 9.3.3. According to Fig. 9.42, we can verify that each one of the groups has data that follow a normal distribution. Moreover, through Fig. 9.43, we can conclude that the variances between the groups are homogeneous.
Hypotheses Tests Chapter
9
237
FIG. 9.40 Procedure for the one-way ANOVA.
FIG. 9.41 Selecting the variables.
From the ANOVA table (Fig. 9.44), we can see that the value of the F-test is 4.676 and the respective P-value is 0.017 (we saw in Example 9.12 that this value would be between 0.01 and 0.025), value less than 0.05. This leads us to reject the null hypothesis and allows us to conclude, with a 95% confidence level, that at least one of the population means is different from the others (there are differences in the percentage of sucrose in the honey of the three suppliers).
9.8.1.2 Solving the One-Way ANOVA Test by Using Stata Software The use of the images in this section has been authorized by StataCorp LP©. The one-way ANOVA on Stata is generated from the following syntax: anova variabley* factor*
238
PART
IV Statistical Inference
FIG. 9.42 Results of the tests for normality for Example 9.12 on SPSS.
FIG. 9.43 Results of Levene’s test for Example 9.12 on SPSS.
FIG. 9.44 Results of the one-way ANOVA for Example 9.12 on SPSS.
FIG. 9.45 Results of the one-way ANOVA on Stata.
in which the term variabley* should be substituted for the quantitative dependent variable and the term factor* for the qualitative explanatory variable. The data in Example 9.12 are available in the file One_Way_Anova.dta. The quantitative dependent variable is called sucrose and the factor is represented by the variable supplier. Thus, we must type the following command: anova sucrose supplier
The result of the test can be seen in Fig. 9.45. We can see that the calculated value of the statistic (4.68) is similar to the one calculated in Example 9.12 and also generated on SPSS, as well as the probability associated to the value of the statistic (0.017). Since P < 0.05, the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that at least one of the population means is different from the others.
Hypotheses Tests Chapter
9.8.2
9
239
Factorial ANOVA
Factorial ANOVA is an extension of the one-way ANOVA, assuming the same assumptions, but considering two or more factors. Factorial ANOVA presumes that the quantitative dependent variable is influenced by more than one qualitative explanatory variable (factor). It also tests the possible interactions between the factors, through the resulting effect of the combination of factor A’s level i and factor B’s level j, as discussed by Pestana and Gageiro (2008), Fa´vero et al. (2009), and Maroco (2014). For Pestana and Gageiro (2008) and Fa´vero et al. (2009), the main objective of the factorial ANOVA is to determine whether the means for each factor level are the same (an isolated effect of the factors on the dependent variable), and to verify the interaction between the factors (the joint effect of the factors on the dependent variable). For educational purposes, the factorial ANOVA will be described for the two-way model.
9.8.2.1 Two-Way ANOVA According to Fa´vero et al. (2009) and Maroco (2014), the observations of the two-way ANOVA can be represented, in general, as shown in Table 9.4. For each cell, we can see the values of the dependent variable in the factors A and B that are being studied. where Yijk represents observation k (k ¼ 1, …, n) of factor A’s level i (i ¼ 1, …, a) and of factor B’s level j (j ¼ 1, …, b). First, in order to check the isolated effects of factors A and B, we must test the following hypotheses (Fa´vero et al., 2009; Maroco, 2014): HA0 : m1 ¼ m2 ¼ … ¼ ma
(9.37)
HA1 : 9ði, jÞ mi 6¼ mj , i 6¼ j ði, j ¼ 1, …, aÞ and HB0 : m1 ¼ m2 ¼ … ¼ mb
(9.38)
HB1 : 9ði, jÞ mi 6¼ mj , i 6¼ j ði, j ¼ 1, …, bÞ
TABLE 9.4 Observations of the Two-Way ANOVA Factor B
Factor A
1
1
2
… …
b
Y111
Y121
Y112
Y122
Yab2
⋮
⋮
⋮
Y11n
Y12n
Yabn
Y211
Y221
Y212
Y222
Y2b2
⋮
⋮
⋮
Y21n
Y22n
Y2bn
⋮
⋮
⋮
⋮
⋮
a
Ya11
Ya21
…
Yab1
Ya12
Ya22
Yab2
⋮
⋮
⋮
Ya1n
Ya2n
Yabn
2
…
Yab1
Y2b1
Source: Fa´vero, L.P., Belfiore, P., Silva, F.L., Chan, B.L., 2009. Ana´lise de dados: modelagem multivariada para tomada de deciso˜es. Campus Elsevier, Rio de Janeiro; Maroco, J., 2014. Ana´lise estatı´stica com o SPSS Statistics, sixth ed. Edic¸o˜es Sı´labo, Lisboa.
240
PART
IV Statistical Inference
Now, in order to verify the joint effect of the factors on the dependent variable, we must test the following hypotheses (Fa´vero et al., 2009; Maroco, 2014): H0 : gij ¼ 0, for i 6¼ j ðthere is no interaction between the factors A and BÞ H1 : gij 6¼ 0, for i 6¼ j ðthere is interaction between the factors A and BÞ
(9.39)
The model presented by Pestana and Gageiro (2008) can be described as: Yijk ¼ m + ai + bj + gij + eijk
(9.40)
where: m is the population’s global mean; ai is the effect of factor A’s level i, given by mi m; bi is the effect of factor B’s level j, given by mj m; gij is the interaction between the factors; eijk is the random error that follows a normal distribution with a mean equal to zero and a constant variance. To standardize the effects of the levels chosen of both factors, we must assume that: a X i¼1
ai ¼
b X
bj ¼
a X
j¼1
gij ¼
i¼1
b X
gij ¼ 0
(9.41)
i¼1
Let’s consider Y, Y ij , Y i , and Y j the general mean of the global sample, the mean per sample, the mean of factor A’s level i, and the mean of factor B’s level j, respectively. We can describe the residual sum of squares (RSS) as: RSS ¼
a X b X n 2 X Yijk Y ij
(9.42)
i¼1 j¼1 k¼1
On the other hand, the sum of squares of factor A (SSFA), the sum of squares of factor B (SSFB), and the sum of squares of the interaction (SSFAB) are represented below in Expressions (9.43)–(9.45), respectively: SSFA ¼ b n
a X
Yi Y
2
(9.43)
i¼1
SSFB ¼ a n
b 2 X Yj Y
(9.44)
j¼1
SSFAB ¼ n
a X b X
Y ij Y i Y j + Y
2 (9.45)
i¼1 j¼1
Therefore, the sum of total squares can be written as follows: TSS ¼ RSS + SSFA + SSFB + SSFAB ¼
a X b X n 2 X Yijk Y
(9.46)
i¼1 j¼1 k¼1
Thus, the ANOVA statistic for factor A is given by: SSFA MSFA ð a 1Þ FA ¼ ¼ RSS MSR ðn 1Þ ab where: MSFA is the mean square of factor A; MSR is the mean square of the errors.
(9.47)
Hypotheses Tests Chapter
9
241
TABLE 9.5 Calculations of the Two-Way ANOVA Source of Variation Factor A
SSF A ¼ b n
Factor B
SSF B ¼ a n
Interaction
SSF AB ¼ n
Error
RSS ¼
Total
Degrees of Freedom
Sum of Squares
TSS ¼
Pa
2
2 Pb j¼1 Y j Y
i¼1
Yi Y
2 Pa Pb i¼1 j¼1 Y ij Y i Y j + Y
Mean Squares
F
a1
A MSF A ¼ ðSSF a1Þ
A FA ¼ MSF MSR
b1
B MSF B ¼ ðSSF b1Þ
B FB ¼ MSF MSR
(a 1). (b 1)
AB MSF AB ¼ ða1SSF Þ ðb1Þ
AB FAB ¼ MSF MSR
RSS MSR ¼ ðn1 Þ ab
Pa Pb Pn
2 Yijk Y ij
(n 1) ab
Pa Pb Pn
2 Yijk Y
N1
i¼1
i¼1
j¼1
j¼1
k¼1
k¼1
Source: Fa´vero, L.P., Belfiore, P., Silva, F.L., Chan, B.L., 2009. Ana´lise de dados: modelagem multivariada para tomada de deciso˜es. Campus Elsevier, Rio de Janeiro; Maroco, J., 2014. Ana´lise estatı´stica com o SPSS Statistics, sixth ed. Edic¸o˜es Sı´labo, Lisboa.
On the other hand, the ANOVA statistic for factor B is given by: SSFB MSFB ðb 1Þ FB ¼ ¼ RSS MSR ðn 1Þ ab
(9.48)
where: MSFB is the mean square of factor B. And the ANOVA statistic for the interaction is represented by: SSFAB ða 1Þ ðb 1Þ MSFAB FAB ¼ ¼ RSS MSR ðn 1Þ ab
(9.49)
where: MSFAB is the mean square of the interaction. The calculations of the two-way ANOVA are summarized in Table 9.5. cal cal The calculated values of the statistics (Fcal A , FB , and FAB) must be compared to the critical values obtained from the c F-distribution table (Table A in the Appendix): FA ¼ Fa1, (n1)ab, a, FcB ¼ Fb1, (n1)ab, a, and FcAB ¼ F(a1)(b1), (n1)ab, a. c cal c cal c For each statistic, if the value lies in the critical region (Fcal A > FA, FB > FB, FAB > FAB), we must reject the null hypothesis. Otherwise, we do not reject H0. Example 9.13: Using the Two-Way ANOVA A sample with 24 passengers who travel from Sao Paulo to Campinas in a certain week is collected. The following variables are analyzed (1) travel time in minutes, (2) the bus company chosen, and (3) the day of the week. The main objective is to verify if there is a relationship between the travel time and the bus company, between the travel time and the day of the week, and between the bus company and the day of the week. The levels considered in the variable bus company are Company A (1), Company B (2), and Company C (3). On the other hand, the levels regarding the day of the week are Monday (1), Tuesday (2), Wednesday (3), Thursday (4), Friday (5), Saturday (6), and Sunday (7). The results of the sample are shown in Table 9.E.17 and are available in the file Two_Way_ANOVA.sav as well. Test these hypotheses, considering a 5% significance level.
242
PART
IV Statistical Inference
TABLE 9.E.17 Data From Example 9.13 (Using the Two-Way ANOVA) Time (Min)
9.8.2.1.1
Company
Day of the Week
90
2
4
100
1
5
72
1
6
76
3
1
85
2
2
95
1
5
79
3
1
100
2
4
70
1
7
80
3
1
85
2
3
90
1
5
77
2
7
80
1
2
85
3
4
74
2
7
72
3
6
92
1
5
84
2
4
80
1
3
79
2
1
70
3
6
88
3
5
84
2
4
Solving the Two-Way ANOVA Test by Using SPSS Software
The use of the images in this section has been authorized by the International Business Machines Corporation©. Step 1: In this case, the most suitable test is the two-way ANOVA. First, we must verify if there is normality in the variable Time (metric) in the model (as shown in Fig. 9.46). According to this figure, we can conclude that variable Time follows a normal distribution, with a 95% confidence level. The hypothesis of variance homogeneity will be verified in Step 4. Step 2: The null hypothesis H0 of the two-way ANOVA for this example assumes that the population means of each level of the factor Company and of each level of the factor Day_of_the_week are equal, that is, HA0 : m1 ¼ m2 ¼ m3 and HB0 : m1 ¼ m2 ¼ … ¼ m7. The null hypothesis H0 also states that there is no interaction between the factor Company and the factor Day_of_the_week, that is, H0: gij ¼ 0 for i 6¼ j. Step 3: The significance level to be considered is 5%.
Hypotheses Tests Chapter
9
243
FIG. 9.46 Results of the normality tests on SPSS.
FIG. 9.47 Procedure for elaborating the two-way ANOVA on SPSS.
Step 4: The F statistics in ANOVA for the factor Company, for the factor Day_of_the_week, and for the interaction Company * Day_of_the_week will be obtained through the SPSS software, according to the procedure specified below. In order to do that, let´s click on Analyze ! General Linear Model → Univariate …, as shown in Fig. 9.47. After that, let´s include the variable Time in the box of dependent variables (Dependent Variable) and the variables Company and Day_of_the_week in the box of Fixed Factor(s), as shown in Fig. 9.48. This example is based on the one-way ANOVA, in which the factors are fixed. If one of the factors were chosen randomly, it would be inserted into the box Random Factor(s), resulting in a case of a three-way ANOVA. The button Model … defines the variance analysis model to be tested. Through the button Contrasts …, we can assess if the category of one of the factors is significantly different from the other categories of the same factor. Charts can be constructed through the button Plots …, thus allowing the visualization of the existence or nonexistence of interactions between the factors. Button Post Hoc …, on the other hand, allows us to compare multiple means. Finally, from the button Options …, we can obtain descriptive statistics and the result of Levene’s variance homogeneity test, as well as select the appropriate significance level (Fa´vero et al., 2009; Maroco, 2014).
244
PART
IV Statistical Inference
FIG. 9.48 Selection of the variables to elaborate the two-way ANOVA.
Therefore, since we want to test variance homogeneity, we must select, in Options …, the option Homogeneity tests, as shown in Fig. 9.49. Finally, let’s click on Continue and on OK to obtain Levene’s variance homogeneity test and the two-way ANOVA table. In Fig. 9.50, we can see that the variances between groups are homogeneous (P ¼ 0.451 > 0.05). Based on Fig. 9.51, we can conclude that there are no significant differences between the travel times of the companies analyzed, that is, the factor Company does not have a significant impact on the variable Time (P ¼ 0.330 > 0.05). On the other hand, we conclude that there are significant differences between the days of the week, that is, the factor Day_of_the_week has a significant effect on the variable Time (P ¼ 0.003 < 0.05). We finally conclude that there is no significant interaction, with a 95% confidence level, between the two factors Company and Day_of_the_week, since P ¼ 0.898 > 0.05. 9.8.2.1.2
Solving the Two-Way ANOVA Test by Using Stata Software
The use of the images in this section has been authorized by StataCorp LP©. The command anova on Stata specifies the dependent variable being analyzed, as well as the respective factors. The interactions are specified using the character # between the factors. Thus, the two-way ANOVA is generated through the following syntax: anova variableY* factorA* factorB* factorA#factorB
or simply: anova variabley* factorA*## factorB*
in which the term variabley* should be substituted for the quantitative dependent variable and the terms factorA* and factorB* for the respective factors. If we type the syntax anova variableY* factorA* factorB*, only the ANOVA for each factor will be elaborated, and not between the factors.
Hypotheses Tests Chapter
9
245
FIG. 9.49 Test of variance homogeneity.
FIG. 9.50 Results of Levene’s test on SPSS.
The data presented in Example 9.13 are available in the file Two_Way_ANOVA.dta. The quantitative dependent variable is called time and the factors correspond to the variables company and day_of_the_week. Thus, we must type the following command: anova time company##day_of_the_week
The results can be seen in Fig. 9.52 and are similar to those presented on SPSS, which allows us to conclude, with a 95% confidence level, that only the factor day_of_the_week has a significant effect on the variable time (P ¼ 0.003 < 0.05), and that there is no significant interaction between the two factors analyzed (P ¼ 0.898 > 0.05).
246
PART
IV Statistical Inference
FIG. 9.51 Results of the two-way ANOVA for Example 9.13 on SPSS.
FIG. 9.52 Results of the two-way ANOVA for Example 9.13 on Stata.
9.8.2.2 ANOVA With More Than Two Factors The two-way ANOVA can be generalized to three or more factors. According to Maroco (2014), the model becomes very complex, since the effect of multiple interactions can make the effect of the factors a bit confusing. The generic model with three factors presented by the author is: Yijkl ¼ m + ai + bj + gk + abij + agik + bgjk + abgijk + eijkl
9.9
(9.50)
FINAL REMARKS
This chapter presented the concepts and objectives of parametric hypotheses tests and the general procedures for constructing each one of them. We studied the main types of tests and the situations in which each one of them must be used. Moreover, the advantages and disadvantages of each test were established, as well as their assumptions. We studied the tests for normality (Kolmogorov-Smirnov, Shapiro-Wilk, and Shapiro-Francia), variance homogeneity tests (Bartlett’s w2, Cochran’s C, Hartley’s Fmax, and Levene’s F), Student’s t-test for one population mean, for two independent means, and for two paired means, as well as ANOVA and its extensions.
Hypotheses Tests Chapter
9
247
Regardless of the application’s main goal, parametric tests can provide good and interesting research results that will be useful in the decision-making process. From a conscious choice of the modeling software, the correct use of each test must always be made based on the underlying theory, without ever ignoring the researcher’s experience and intuition.
9.10 EXERCISES (1) In what situations should parametric tests be applied and what are the assumptions of these tests? (2) What are the advantages and disadvantages of parametric tests? (3) What are the main parametric tests to verify the normality of the data? In what situations must we use each one of them? (4) What are the main parametric tests to verify the variance homogeneity between groups? In what situations must we use each one of them? (5) To test a single population mean, we can use z-test and Student’s t-test. In what cases must each one of them be applied? (6) What are the main mean comparison tests? What are the assumptions of each test? (7) The monthly aircraft sales data throughout last year can be seen in the table below. Check and see if there is normality in the data. Consider a ¼ 5%. Jan.
Feb.
Mar.
Apr.
May
Jun.
Jul.
Aug.
Sept.
Oct.
Nov.
Dec.
48
52
50
49
47
50
51
54
39
56
52
55
(8) Test the normality of the temperature data listed (a ¼ 5%): 12.5
14.2
13.4
14.6
12.7
10.9
16.5
14.7
11.2
10.9
12.1
12.8
13.8
13.5
13.2
14.1
15.5
16.2
10.8
14.3
12.8
12.4
11.4
16.2
14.3
14.8
14.6
13.7
13.5
10.8
10.4
11.5
11.9
11.3
14.2
11.2
13.4
16.1
13.5
17.5
16.2
15.0
14.2
13.2
12.4
13.4
12.7
11.2
(9) The table shows the final grades of two students in nine subjects. Check and see if there is variance homogeneity between the students (a ¼ 5%). Student 1
6.4
5.8
6.9
5.4
7.3
8.2
6.1
5.5
6.0
Student 2
6.5
7.0
7.5
6.5
8.1
9.0
7.5
6.5
6.8
(10) A fat-free yogurt manufacturer states that the number of calories in each cup is 60 cal. In order to check if this information is true, a random sample with 36 cups is collected; and we observed that the average number of calories was 65 cal. with a standard deviation of 3.5. Apply the appropriate test and check if the manufacturer’s statement is true, considering a significance level of 5%. (11) We would like to compare the average waiting time before being seen by a doctor (in minutes) in two hospitals. In order to do that, we collected a sample with 20 patients from each hospital. The data are available in the tables. Check and see if there are differences between the average waiting times in both hospitals. Consider a ¼ 1%.
248
PART
IV Statistical Inference
Hospital 1
72
58
91
88
70
76
98
101
65
73
79
82
80
91
93
88
97
83
71
74
66
40
55
70
76
61
53
50
47
61
52
48
60
72
57
70
66
55
46
51
Hospital 2
(12) Thirty teenagers whose total cholesterol level is higher than what is advisable underwent treatment that consisted of a diet and physical activities. The tables show the levels of LDL cholesterol (mg/dL) before and after the treatment. Check if the treatment was effective (a ¼ 5%). Before the treatment 220
212
227
234
204
209
211
245
237
250
208
224
220
218
208
205
227
207
222
213
210
234
240
227
229
224
204
210
215
228
After the treatment 195
180
200
204
180
195
200
210
205
211
175
198
195
200
190
200
222
198
201
194
190
204
230
222
209
198
195
190
201
210
(13) An aerospace company produces civilian and military helicopters at its three factories. The tables show the monthly production of helicopters in the last 12 months at each factory. Check if there is a difference between the population means. Consider a ¼ 5%. Factory 1 24
26
28
22
31
25
27
28
30
21
20
24
26
24
30
24
27
25
29
30
27
26
25
25
24
26
20
22
22
27
20
26
24
25
Factory 2 28
Factory 3
29
Chapter 10
Nonparametric Tests Mathematics has wonderful strength that is capable of making us understand many mysteries of our faith. Saint Jerome
10.1 INTRODUCTION As studied in the previous chapter, hypotheses tests are divided into parametric and nonparametric. Applied to quantitative data, parametric tests, formulate hypotheses about population parameters, such as the population mean (m), population standard deviation (s), population variance (s2), population proportion (p), etc. Parametric tests require strong assumptions regarding the data distribution. For example, in many cases, we should assume that the samples are collected from populations whose data follow a normal distribution. Or, still, for comparison tests of two paired population means or k population means (k 3), the population variances must be homogeneous. Conversely, nonparametric tests can formulate hypotheses about the qualitative characteristics of the population, then, they can be applied to qualitative data, in nominal or ordinal scales. Since assumptions regarding the data distribution are in smaller number and weaker than the parametric tests, they are also known as distribution-free tests. Nonparametric tests are an alternative to parametric ones when their hypotheses are violated. Given that they require a smaller number of assumptions, they are simpler and easier to apply, but less robust when compared to parametric tests. In short, the main advantages of nonparametric tests are: (a) They can be applied in a wide variety of situations, because they do not require strict premises concerning the population, as parametric methods do. Notably, nonparametric methods do not require that the populations have a normal distribution. (b) Differently from parametric methods, nonparametric methods can be applied to qualitative data, in nominal and ordinal scales. (c) They are easy to apply because they require simpler calculations when compared to parametric methods. The main disadvantages are: (a) With regard to quantitative data, since they must be transformed into qualitative data for the application of nonparametric tests, we lose too much information. (b) Since nonparametric tests are less efficient than parametric tests, we need greater evidence (a larger sample or one with greater differences) to reject the null hypothesis. Thus, since parametric tests are more powerful than nonparametric ones, that is, they have a higher probability of rejecting the null hypothesis when it is really false, they must be chosen as long as all the assumptions are confirmed. On the other hand, nonparametric tests are an alternative to parametric ones when the hypotheses are violated or in cases in which the variables are qualitative. Nonparametric tests are classified according to the variables’ level of measurement and to sample size. For a single sample, we will study the binomial, chi-square (w2), and sign tests. The binomial test is applied to binary variables. The w2 test can be applied to nominal variables as well as to ordinal variables. While the sign test is only applied to ordinal variables. In the case of two paired samples, the main tests are the McNemar test, the sign test, and the Wilcoxon test. The McNemar test is applied to qualitative variables that assume only two categories (binary), while the sign test and the Wilcoxon test are applied to ordinal variables. Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00010-0 © 2019 Elsevier Inc. All rights reserved.
249
250
PART
IV Statistical Inference
TABLE 10.1 Classification of Nonparametric Statistical Tests Dimension
Level of Measurement
Nonparametric Test
One sample
Binary
Binomial
Nominal or ordinal
w2
Ordinal
Sign test
Binary
McNemar test
Ordinal
Sign test Wilcoxon test
Nominal or ordinal
w2
Ordinal
Mann-Whitney U
Binary
Cochran’s Q
Ordinal
Friedman’s test
Nominal or ordinal
w2
Ordinal
Kruskal-Wallis test
Two paired samples
Two independent samples
K paired samples
K independent samples
Source: Fa´vero, L.P., Belfiore, P., Silva, F.L., Chan, B.L., 2009. Ana´lise de dados: modelagem multivariada para tomada de deciso˜es. Campus Elsevier, Rio de Janeiro.
Considering two independent samples, we can highlight the w2 test and the Mann-Whitney U test. The w2 test can be applied to nominal or ordinal variables, while the Mann-Whitney U test only considers ordinal variables. For k paired samples (k 3), we have Cochran’s Q test that considers binary variables and Friedman’s test that considers ordinal variables. Finally, in the case of more than two independent samples, we will study the w2 test for nominal or ordinal variables and the Kruskal-Wallis test for ordinal variables. Table 10.1 shows this classification. Nonparametric tests in which the variables’ level of measurement is ordinal can also be applied to quantitative variables, but they must only be used in these cases, when the hypotheses of the parametric tests are rejected.
10.2
TESTS FOR ONE SAMPLE
In this case, a random sample is taken from the population and we test the hypothesis that the sample data have a certain characteristic or distribution. Among the nonparametric statistical tests for a single sample, we can highlight the binomial test, the w2 test, and the sign test. The binomial test is applied to binary data, the w2 test to nominal or ordinal data, while the sign test is applied to ordinal data.
10.2.1
Binomial Test
The binomial test is applied to an independent sample in which the variable that the researcher is interested in (X) is binary (dummy) or dichotomous, that is, it only has two possibilities: success or failure. We usually call result X ¼ 1 a success and result X ¼ 0 a failure, because it is more convenient. The probability of success in choosing a certain observation is represented by p and the probability of failure by q, that is: P½X ¼ 1 ¼ p and P½X ¼ 0 ¼ q ¼ 1 p For a bilateral test, we must consider the following hypotheses: H0: p ¼ p0 H1: p ¼ 6 p0 According to Siegel and Castellan (2006), the number of successes (Y) or the number of results of type [X ¼ 1] results in a sequence of N observations is:
Nonparametric Tests Chapter
Y¼
N X
10
251
Xi
i¼1
For the authors, in a sample of size N, the probability of obtaining k objects in a category and N k objects in the other category is given by: N (10.1) P½Y ¼ k ¼ pk qNk k ¼ 0, 1,…, N k where: p: probability of success; q: probability of failure, where: N! N ¼ k k!ðN kÞ! Table F1 in the Appendix provides the probability of P[Y ¼ k] for several values of N, k, and p. However, when we test hypotheses, we must use the probability of obtaining values that are greater than or equal to the value observed: N X N pi qNi (10.2) Pð Y k Þ ¼ i i¼k Or the probability of obtaining values that are less than or equal to the value observed: k X N pi qNi Pð Y k Þ ¼ i i¼0
(10.3)
According to Siegel and Castellan (2006), when p ¼ q ¼ ½, instead of calculating the probabilities based on the expressions presented, it is more convenient to use Table F2 in the Appendix. This table provides the unilateral probabilities, under the null hypothesis H0: p ¼ 1/2, of obtaining values that are as extreme as or more extreme than k, where k is the lowest of the frequencies observed (P(Y k)). Due to the symmetry of a binomial distribution, when p ¼ ½, we have P(Y k) ¼ P(Y N k). A unilateral test is used when we predict, in advance, which of both categories must contain the smallest number of cases. For a bilateral test (when the estimate simply refers to the fact that both frequencies will differ), we just need to double the values from Table F2 in the Appendix. This final value obtained is called P-value, which, according to what was discussed in Chapter 9, corresponds to the probability (unilateral or bilateral) associated to the value observed in the sample. P-value indicates the lowest significance level observed, which would lead to the rejection of the null hypothesis. Thus, we reject H0 if P a. In the case of large samples (N > 25), the sample distribution of variable Y is closer to a standard normal distribution, so, the probability can be calculated by the following statistic: Zcal ¼
jN p^ N pj 0:5 pffiffiffiffiffiffiffiffiffiffiffiffiffiffi Npq
(10.4)
where p^refers to the sample estimate of the proportion of successes so that we can test H0. The value of Zcal calculated by using Expression (10.4) must be compared to the critical value of the standard normal distribution (see Table E in the Appendix). This table provides the critical values of zc where P(Zcal > zc) ¼ a (for a righttailed unilateral test). For a bilateral test, we have P(Zcal < zc) ¼ a/2 ¼ P(Zcal > zc). Therefore, for a right-tailed unilateral test, the null hypothesis is rejected if Zcal > zc. Now, for a bilateral test, we reject H0 if Zcal < zc or Zcal > zc. Example 10.1: Applying the Binomial Test to Small Samples A group of 18 students took an intensive English course and were submitted to two different learning methods. At the end of the course, each student chose his/her favorite teaching method, as shown in Table 10.E.1. We believe there are no differences between both teaching methods. Test the null hypothesis with a significance level of 5%.
252
PART
IV Statistical Inference
TABLE 10.E.1 Frequencies Obtained After Students Made Their Choice Events
Method 1
Method 2
Total
Frequency
11
7
18
Proportion
0.611
0.389
1.0
Solution Before we start the general procedure to construct the hypotheses tests, we will explain a few parameters in order to facilitate the understanding. Choosing the method that will be expressed as X ¼ 1 (method 1) and X ¼ 0 (method 2), the probability of choosing method 1 is represented by P[X ¼ 1] ¼ p and method 2 by P[X ¼ 0] ¼ q. The number of successes (Y ¼ k) corresponds to the total number of type X ¼ 1 results and k ¼ 11. Step 1: The most suitable test in this case is the binomial test because the data are categorized into two classes. Step 2: The null hypothesis states that there are no differences in the probabilities of choosing between both methods: H0: p ¼ q ¼ ½ H1: p ¼ 6 q Step 3: The significance level to be considered is 5%. Step 4: We have N ¼ 18, k ¼ 11, p ¼ ½, and q ¼ ½. Due to the symmetry of the binomial distribution, when p ¼ ½, P(Y k) ¼ P(Y N k), that is, P(Y 11) ¼ P(Y 7). So, let’s calculate P(Y 7) by using Expression (10.3) and show how this probability can be obtained directly from Table F2 in the Appendix. The probability of a maximum of seven students choosing method 2 is given by: P ðY 7Þ ¼ P ðY ¼ 0Þ + P ðY ¼ 1Þ + ⋯ + P ðY ¼ 7Þ 0 18 18! 1 1 ¼ 3:815 E 06 P ðY ¼ 0Þ ¼ 0!18! 2 2 1 17 18! 1 1 P ðY ¼ 1Þ ¼ ¼ 6:866 E 05 1!17! 2 2 7 11 18! 1 1 ¼ 0:121 P ðY ¼ 7Þ ¼ 7!11! 2 2 Therefore: P ðY 7Þ ¼ 3:815 E 06 + ⋯ + 0:121 ¼ 0:240 Since p ¼ ½, probability P(Y 7) could be obtained directly from Table F2 in the Appendix. For N ¼ 18 and k ¼ 7 (the lowest frequency observed), the associated unilateral probability is P1 ¼ 0.240. Since it is a bilateral test, this value must be doubled (P ¼ 2P1), so, the associated bilateral probability is P ¼ 0.480. Note: In the general procedure of hypotheses tests, Step 4 corresponds to the calculation of the statistic based on the sample. On the other hand, Step 5 determines the probability associated to the value of the statistic obtained from Step 4. In the case of the binomial test, Step 4 calculates the probability associated to the occurrence in the sample directly. Step 5: Decision: since the associated probability is greater than a (P ¼ 0.480 > 0.05), we do not reject H0, which allows us to conclude, with a 95% confidence level, that there are no differences in the probabilities of choosing method 1 or 2.
Example 10.2: Applying the Binomial Test to Large Samples Redo the previous example considering the following results:
TABLE 10.E.2 Frequencies Obtained After Students Made Their Choice Events
Method 1
Method 2
Total
Frequency
18
12
30
Proportion
0.6
0.4
1.0
Nonparametric Tests Chapter
10
253
FIG. 10.1 Critical region of Example 10.2.
Solution Step 1: Let’s apply the binomial test. Step 2: The null hypothesis states that there are no differences between the probabilities of choosing both methods, that is: H 0: p ¼ q ¼ ½ 6 q H 1: p ¼ Step 3: The significance level to be considered is 5%. Step 4: Since N > 25, we can consider that the sample distribution of variable Y is similar to a standard normal distribution, so, the probability can be calculated from Z statistic:
Zcal ¼
jN p^ N p j 0:5 j30 0:6 30 0:5j 0:5 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ ¼ 0:913 Npq 30 0:5 0:5
Step 5: The critical region of a standard normal distribution (Table E in the Appendix), for a bilateral test in which a ¼ 5%, is shown in Fig. 10.1. For a bilateral test, each one of the tails corresponds to half of significance level a. Step 6: Decision: since the value calculated is not in the critical region, that is, 1.96 Zcal 1.96, the null hypothesis is not rejected, which allows us to conclude, with a 95% confidence level, that there are no differences in the probabilities of choosing between the methods (p ¼ q ¼ ½). If we used P-value instead of the critical value of the statistic, Steps 5 and 6 would be: Step 5: According to Table E in the Appendix, the unilateral probability associated to statistic Zcal ¼ 0.913 is P1 ¼ 0.1762. For a bilateral test, this probability must be doubled (P-value ¼ 0.3564). Step 6: Decision: since P > 0.05, we do not reject H0.
10.2.1.1 Solving the Binomial Test Using SPSS Software Example 10.1 will be solved using IBM SPSS Statistics Software®. The use of the images in this section has been authorized by the International Business Machines Corporation©. The data are available in the file Binomial_Test.sav. The procedure for solving the binomial test using SPSS is described. Let’s select Analyze → Nonparametric Tests → Legacy Dialogs → Binomial … (Fig. 10.2). First, let’s insert variable Method into the Test Variable List. In Test Proportion, we must define p ¼ 0.50, since the probability of success and failure is the same (Fig. 10.3). Finally, let’s click on OK. The results can be seen in Fig. 10.4. The associated probability for a bilateral test is P ¼ 0.481, similar to the value calculated in Example 10.1. Since P > a (0.481 > 0.05), we do not reject H0, which allows us to conclude, with a 95% confidence level, that p ¼ q ¼ ½.’
10.2.1.2 Solving the Binomial Test Using Stata Software Example 10.1 will also be solved using Stata Statistical Software®. The use of the images presented in this section has been authorized by Stata Corp LP©. The data are available in the file Binomial_Test.dta. The syntax of the binomial test on Stata is: bitest variable* = #p
where the term variable* must be replaced by the variable considered in the analysis and #p by the probability of success specified in the null hypothesis.
254
PART
IV Statistical Inference
FIG. 10.2 Procedure for applying the binomial test on SPSS.
FIG. 10.3 Selecting the variable and the proportion for the binomial test.
In Example 10.1, our studied variable is method and, through the null hypothesis, there are no differences in the choice between both methods, so, the command to be typed is: bitest method = 0.5
The result of the binomial test is shown in Fig. 10.5. We can see that the associated probability for a bilateral test is P ¼ 0.481, similar to the value calculated in Example 10.1, and also obtained via SPSS software. Since P > 0.05, we do not reject H0, which allows us to conclude, with a 95% confidence level, that p ¼ q ¼ ½.
Nonparametric Tests Chapter
10
255
FIG. 10.4 Results of the binomial test.
FIG. 10.5 Results of the binomial test for Example 10.1 on Stata.
10.2.2
Chi-Square Test (x2) for One Sample
The w2 test presented in this section is an extension of the binomial test and is applied to a single sample in which the variable being studied assumes two or more categories. The variables can be nominal or ordinal. The test compares the frequencies observed to the frequencies expected in each category. The w2 test assumes the following hypotheses: H0: there is no significant difference between the frequencies observed and the ones expected H1: there is a significant difference between the frequencies observed and the ones expected The statistic for the test, analogous to Expression (4.1) in Chapter 4, is given by: w2cal ¼
k X ðOi Ei Þ2 i¼1
Ei
(10.5)
where: Oi: the number of observations in the ith category; Ei: expected frequency of observations in the ith category when H0 is not rejected; k: the number of categories. The values of w2cal approximately follow a w2 distribution with n ¼ k 1 degrees of freedom. The critical values of the chisquare (w2c ) statistic can be found in Table D in the Appendix, which provides the critical values of w2c , where P(w2cal > w2c ) ¼ a (for a right-tailed unilateral test). In order for the null hypothesis H0 to be rejected, the value of the w2cal statistic must be in the critical region (CR), that is, w2cal > w2c . Otherwise, we do not reject H0 (Fig. 10.6). P-value (the probability associated to the value of the w2cal statistic calculated from the sample) can also be obtained from Table D. In this case, we reject H0 if P a.
FIG. 10.6 w2 distribution, highlighting critical region (CR) and nonrejection of H0 (NR) region.
256
PART
IV Statistical Inference
Example 10.3: Applying the x2 Test to One Sample A candy store would like to find out if the number of chocolate candies sold daily varies depending on the day of the week. In order to do that, a sample was collected throughout 1 week, chosen randomly, and the results can be seen in Table 10.E.3. Test the hypothesis that sales do not depend on the day of the week. Assume that a ¼ 5%.
TABLE 10.E.3 Frequencies Observed Versus Frequencies Expected Events
Sunday
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Frequencies observed
35
24
27
32
25
36
31
Frequencies expected
30
30
30
30
30
30
30
Solution Step 1: The most suitable test to compare the frequencies observed to the ones expected from one sample with more than two categories is the w2 for a single sample. Step 2: Through the null hypothesis, there are no significant differences between the sales observed and the ones expected for each day of the week. On the other hand, through the alternative hypothesis, there is a difference in at least one day of the week: H0: Oi ¼ Ei H1: Oi ¼ 6 Ei Step 3: The significance level to be considered is 5%. Step 4: The value of the statistic is given by: w2cal ¼
k X ðOi Ei Þ2
Ei
i¼1
¼
ð35 30Þ2 ð24 30Þ2 ð31 30Þ2 + +⋯+ ¼ 4:533 30 30 30
Step 5: The critical region of the w test, considering a ¼ 5% and n ¼ 6 degrees of freedom, is shown in Fig. 10.7. 2
FIG. 10.7 Critical Region of Example 10.3.
Step 6: Decision: since the value calculated is not in the critical region, that is, w2cal < 12.592, the null hypothesis is not rejected, which allows us to conclude, with a 95% confidence level, that the number of chocolate candies sold daily does not vary depending on the day of the week. If we use P-value instead of the critical value of the statistic, Steps 5 and 6 of the construction of the hypotheses tests will be: Step 5: According to Table D in the Appendix, for n ¼ 6 degrees of freedom, the probability associated to the statistic w2cal ¼ 4.533 (P-value) is between 0.1 and 0.9. Step 6: Decision: since P > 0.05, we do not reject the null hypothesis.
10.2.2.1 Solving the w2 Test for One Sample Using SPSS Software The use of the images in this section has been authorized by the International Business Machines Corporation©. The data in Example 10.3 are available in the file Chi-Square_One_Sample.sav. The procedure for applying the w2 test on SPSS is described. First, let’s click on Analyze → Nonparametric Tests → Legacy Dialogs → Chi-Square …, as shown in Fig. 10.8. After that, we should insert the variable Day_week into the Test Variable List. The variable being studied has seven categories. The options Get from data and Use specified range (Lower ¼ 1 and Upper ¼ 7) in Expected Range generate
Nonparametric Tests Chapter
10
257
FIG. 10.8 Procedure for elaborating the w2 test on SPSS.
the same results. The frequencies expected for the seven categories are exactly the same. Thus, we must select the option All categories equal in Expected Values, as shown in Fig. 10.9. Finally, let’s click on OK to obtain the results of the w2 test, as shown in Fig. 10.10. Therefore, the value of the w2 statistic is 4.533, similar to the value calculated in Example 10.3. Since the Pvalue ¼ 0.605 > 0.05 (in Example 10.3, we saw that 0.1 < P < 0.9), we do not reject H0, which allows us to conclude, with a 95% confidence level, that the sales do not depend on the day of the week.
10.2.2.2 Solving the w2 Test for One Sample Using Stata Software The use of the images presented in this section has been authorized by Stata Corp LP©. The data in Example 10.3 are available in the file Chi-Square_One_Sample.dta. The variable being studied is day_week. The w2 test for one sample on Stata can be obtained from the command csgof (chi-square goodness of fit), which allows us to compare the distribution of frequencies observed to the ones expected of a certain categorical variable with more than two categories. In order for this command to be used, first, we must type: findit csgof
and install it through the link csgof from http://www.ats.ucla.edu/stat/stata/ado/analysis. After doing this, we can type the following command: csgof day_week
The result is shown in Fig. 10.11. We can see that the result of the test is similar to the one calculated in Example 10.3 and on SPSS, as well as to the probability associated to the statistic.
10.2.3
Sign Test for One Sample
The sign test is an alternative to the t-test for a single random sample when the data distribution of the population does not follow a normal distribution. The only assumption required by the sign test is that the distribution of the variable be continuous.
258
PART
IV Statistical Inference
FIG. 10.9 Selecting the variable and the procedure to elaborate the w2 test.
FIG. 10.10 Results of the w2 test for Example 10.3 on SPSS.
The sign test is based on the population median (m). The probability of obtaining a sample value that is less than the median and the probability of obtaining a sample value that is greater than the median are the same (p ¼ ½). The null hypothesis of the test is that m is equal to a certain value specified by the investigator (m0). For a bilateral test, we have: H0: m ¼ m0 H1: m ¼ 6 m0 The quantitative data are converted into signs, (+) or (), that is, values greater than the median (m0) start being represented by (+) and values less than m0 by (). Data with values equal to m0 are excluded from the sample. Thus, the sign test is applied to ordinal data and offers little power to the researcher, since this conversion results in a considerable loss of information regarding the original data.
Nonparametric Tests Chapter
10
259
FIG. 10.11 Results of the w2 test for Example 10.3 on Stata.
Small samples Let’s establish that N is the number of positive and negative signs (sample size disregarding any ties) and k is the number of signs that corresponds to the lowest frequency. For small samples (N 25), we will use the binomial test with p ¼ ½ to calculate P(Y k). This probability can be obtained directly from Table F2 in the Appendix. Large samples When N > 25, the binomial distribution is more similar to a normal distribution. The value of Z is given by: Z¼
ðX 0:5Þ N=2 pffiffiffiffi Nð0, 1Þ 0:5 N
(10.6)
where X corresponds to the lowest or highest frequency. If X represents the lowest frequency, we must calculate X + 0.5. On the other hand, if X represents the highest frequency, we must calculate X 0.5. Example 10.4: Applying the Sign Test to a Single Sample We estimate that the median retirement age in a certain Brazilian city is 65. One random sample with 20 retirees was drawn from the population and the results can be seen in Table 10.E.4. Test the null hypothesis that m ¼ 65, at the significance level of 10%.
TABLE 10.E.4 Retirement Age 59
62
66
37
60
64
66
70
72
61
64
66
68
72
78
93
79
65
67
59
Solution Step 1: Since the data do not follow a normal distribution, the most suitable test for testing the population median is the sign test. Step 2: The hypotheses of the test are: H0: m ¼ 65 6 65 H 1: m ¼ Step 3: The significance level to be considered is 10%. Step 4: Let’s calculate P(Y k). To facilitate our understanding, let’s sort the data in Table 10.E.4 in ascending order.
TABLE 10.E.5 Data From Table 10.E.4 Sorted in Ascending Order 37
59
59
60
61
62
64
64
65
66
66
66
67
68
70
72
72
78
79
93
260
PART
IV Statistical Inference
Excluding value 65 (a tie), we have the number of () signs is 8, the number of (+) signs is 11, and N ¼ 19. From Table F2 in the Appendix, for N ¼ 19, k ¼ 8, and p ¼ ½, the associated unilateral probability is P1 ¼ 0.324. Since we are using a bilateral test, this value must be doubled, so, the associated bilateral probability is 0.648 (P-value). Step 5: Decision: since P > a (0.648 > 0.10), we do not reject H0, a fact that allows us to conclude, with a 90% confidence level, that m ¼ 65.
10.2.3.1 Solving the Sign Test for One Sample Using SPSS Software The use of the images in this section has been authorized by the International Business Machines Corporation©. SPSS makes the sign test available only for two related samples (2 Related Samples). Thus, in order for us to use the test for a single sample, we must generate a new variable with n values (sample size including ties), all of them equal to m0. The data in Example 10.4 are available in the file Sign_Test_One_Sample.sav. The procedure for applying the sign test on SPSS is shown. First of all, we must click on Analyze → Nonparametric Tests → Legacy Dialogs → 2 Related Samples …, as shown in Fig. 10.12. After that, we must insert variable 1 (Age_pop) and variable 2 (Age_sample) into Test Pairs. Let’s select the option regarding the sign test (Sign) in Test Type, as shown in Fig. 10.13. Next, let’s click on OK to obtain the results of the sign test, as shown in Figs. 10.14 and 10.15. Fig. 10.14 shows the frequencies of negative and positive signs, the total number of ties, and the total frequency. Fig. 10.15 shows the associated probability for a bilateral test, which is similar to the value found in Example 10.4. Since P ¼ 0.648 > 0.10, we do not reject the null hypothesis, which allows us to conclude, with a 90% confidence level, that the median retirement age is 65.
FIG. 10.12 Procedure for elaborating the sign test on SPSS.
Nonparametric Tests Chapter
10
261
FIG. 10.13 Selecting the variables and the sign test.
FIG. 10.14 Frequencies observed.
FIG. 10.15 Sign test for Example 10.4 on SPSS.
10.2.3.2 Solving the Sign Test for One Sample Using Stata Software The use of the images presented in this section has been authorized by Stata Corp LP©. Different from SPSS software, Stata makes the sign test for one sample available. On Stata, the sign test for a single sample as well as for two paired samples can be obtained from the command signtest. The syntax of the test for one sample is: signtest variable* = #
262
PART
IV Statistical Inference
FIG. 10.16 Results of the sign test for Example 10.4 on Stata.
where the term variable* must be replaced by the variable considered in the analysis and # by the value of the population median to be tested. The data in Example 10.4 are available in the file Sign_Test_One_Sample.dta. The variable analyzed is age and the main objective is to verify if the median retirement age is 65. The command to be typed is: signtest age = 65
The result of the test is shown in Fig. 10.16. Analogous to the results presented in Example 10.4 and also generated on SPSS, the number of positive signs is 11, the number of negative signs is 8, and the associated probability for a bilateral test is 0.648. Since P > 0.10, we do not reject the null hypothesis, which allows us to conclude, with a 90% confidence level, that the median retirement age is 65.
10.3
TESTS FOR TWO PAIRED SAMPLES
These tests investigate if two samples are somehow related. The most common examples analyze a situation before and after a certain event. We will study the following tests: the McNemar test for binary variables and the sign and Wilcoxon tests for ordinal variables.
10.3.1
McNemar Test
The McNemar test is applied to assess the significance of changes in two related samples with qualitative or categorical variables that assume only two categories (binary variables). The main goal of the test is to verify if there are any significant changes before and after the occurrence of a certain event. In order to do that, let’s use a 2 2 contingency table, as shown in Table 10.2. According to Siegel and Castellan (2006), the + and signs are used to represent the possible changes in the answers before and after. The frequencies of each occurrence are represented in their respective cells in Table 10.2. For example, if there are changes from the first answer (+) to the second answer (), the result will be written in the right upper cell, so, B represents the total number of observations that presented changes in their behavior from (+) to (). Analogously, if there are changes from the first answer () to the second answer (+), the result will be written in the left lower cell, so, C represents the total number of observations that presented changes in their behavior from () to (+).
Nonparametric Tests Chapter
10
263
TABLE 10.2 2 × 2 Contingency Table After Before
+
2
+
A
B
C
D
On the other hand, while A represents the total number of observations that remained with the same answer (+) before and after, D represents the total number of observations with the same answer () in both periods. Thus, the total number of individuals that change their answer can be represented by B + C. Through the null hypothesis of the test, the total number of changes in each direction is equally likely, that is: H0: P(B ! C) ¼ P(C ! B) H1: P(B ! C) 6¼ P(C ! B) According to Siegel and Castellan (2006), McNemar statistic is calculated based on the chi-square (w2) statistic presented in Expression (10.5), that is: w2cal ¼
2 X ðOi Ei Þ2 i¼1
Ei
¼
ðB ðB + CÞ=2Þ2 ðC ðB + CÞ=2Þ2 ðB CÞ2 + ¼ w21 ðB + CÞ=2 ðB + CÞ=2 B+C
(10.7)
According to the same authors, a correction factor must be used in order for a continuous w2 distribution to become more similar to a discrete w2 distribution, so: w2cal ¼
ðjB Cj 1Þ2 with 1 degree of freedom B+C
(10.8)
The value calculated must be compared to the critical value of the w2 distribution (Table D in the Appendix). This table provides the critical values of w2c where P(w2cal > w2c ) ¼ a (for a right-tailed unilateral test). If the value of the statistic is in the critical region, that is, if w2cal > w2c , we reject H0. Otherwise, we should not reject H0. The probability associated to the w2cal statistic (P-value) can also be obtained from Table D. In this case, the null hypothesis is rejected if P a. Otherwise, we do not reject H0. Example 10.5: Applying the McNemar Test A bill of law proposing the end of full retirement pensions for federal civil servants was being analyzed by the Senate. Aiming at verifying if this measure would bring any changes in the number of people taking public exams, an interview with 60 workers was carried out, before and after the reform, so that they could express their preference in working for a private or a public organization. The results can be seen in Table 10.E.6. Test the hypothesis that there were no significant changes in the workers’ answers before and after the social security reform. Assume that a ¼ 5%.
TABLE 10.E.6 Contingency Table After the Reform Before the Reform
Private
Public
Private
22
3
Public
21
14
Solution Step 1: McNemar is the most suitable test for evaluating the significance of before and after type changes in two related samples, applied to nominal or categorical variables. Step 2: Through the null hypothesis, the reform would not be efficient in changing people’s preferences towards the private sector. In other words, among the workers who changed their preferences, the probability of them changing their preference
264
PART
IV Statistical Inference
from private to public organizations after the reform is the same as the probability of them changing from public to private organizations. That is: H0: P(Private ! Public) ¼ P(Public ! Private) H1: P(Private ! Public) 6¼ P(Public ! Private) Step 3: The significance level to be considered is 5%. Step 4: The value of the statistic, according to Expression (10.7), is: 2
2
ðj321jÞ jÞ w2cal ¼ ðjBC B + C ¼ 3 + 21 ¼ 13:5 with n ¼ 1
If we use the correction factor, the value of the statistic from Expression (10.8) becomes: 2
2
j1Þ j1Þ w2cal ¼ ðjBC ¼ ðj321 ¼ 12:042 with n ¼ 1 B+C 3 + 21
Step 5: The value of the critical chi-square (w2c ) obtained from Table D, in the Appendix, considering a ¼ 5% and n ¼ 1 degree of freedom, is 3.841. Step 6: Decision: since the value calculated is in the critical region, that is, w2cal > 3.841, we reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that there were significant changes in the choice of working at a private or a public organization after the social security reform. If we use P-value instead of the critical value of the statistic, Steps 5 and 6 will be: Step 5: According to Table D in the Appendix, for n ¼ 1 degree of freedom, the probability associated to statistic w2cal ¼ 12.042 or 13.5 (P-value) is less than 0.005 (a probability of 0.005 is associated to statistic w2cal ¼ 7.879). Step 6: Decision: since P < 0.05, we must reject H0.
10.3.1.1 Solving the McNemar Test Using SPSS Software Example 10.5 will be solved using SPSS software. The use of the images in this section has been authorized by the International Business Machines Corporation©. The data are available in the file McNemar_Test.sav. The procedure for applying the McNemar test on SPSS is presented. Let’s click on Analyze → Nonparametric Tests → Legacy Dialogs → 2 Related Samples …, as shown in Fig. 10.17. After that, we should insert variable 1 (Before) and variable 2 (After) into Test Pairs. Let’s select the McNemar test option in Test Type, as shown in Fig. 10.18. Finally, we must click on OK to obtain Figs. 10.19 and 10.20. Fig. 10.19 shows the frequencies observed before and after the reform (Contingency Table). The result of the McNemar test is shown in Fig. 10.20. According to Fig. 10.20, the significance level observed in the McNemar test is 0.000, value lower than 5%, so, the null hypothesis is rejected. Hence, we may conclude, with a 95% confidence level, that there was a significant change in choosing to work at a public or a private organization after the social security reform.
10.3.1.2 Solving the McNemar Test Using Stata Software Example 10.5 will also be solved using Stata software. The use of the images presented in this section has been authorized by Stata Corp LP©. The data are available in the file McNemar_Test.dta. The McNemar test can be calculated on Stata by using the command mcc followed by the paired variables. In our example, the paired variables are called before and after, so, the command to be typed is: mcc before after
The result of the McNemar test is shown in Fig. 10.21. We can see that the value of the statistic is 13.5, similar to the value calculated by Expression (10.7), without the correction factor. The significance level observed from the test is 0.000, lower than 5%, which allows us to conclude, with a 95% confidence level, that there was a significant change before and after the reform. The result of the McNemar test could have also been obtained by using the command mcci 14 21 3 22.
10.3.2
Sign Test for Two Paired Samples
The sign test can also be applied to two paired samples. In this case, the sign is given by the difference between the pairs, that is, if the difference results in a positive number, each pair of values is replaced by a (+) sign. On the other hand, if the result of the difference is negative, each pair of values is replaced by a () sign. In case of a tie, the data will be excluded from the sample.
Nonparametric Tests Chapter
FIG. 10.17 Procedure for elaborating the McNemar test on SPSS.
FIG. 10.18 Selecting the variables and McNemar test.
10
265
266
PART
IV Statistical Inference
FIG. 10.19 Frequencies observed.
FIG. 10.20 McNemar Test for Example 10.5 on SPSS. FIG. 10.21 Results of the McNemar test for Example 10.5 on Stata.
Analogous to the sign test for a single sample, the sign test presented in this section is also an alternative to the t-test for comparing two related samples when the data distribution is not normal. In this case, the quantitative data are transformed into ordinal data. Thus, the sign test is much less powerful than the t-test, because it only uses the difference sign between the pairs as information. Through the null hypothesis, the population median of the differences (md) is zero. Therefore, for a bilateral test, we have: H0: md ¼ 0 6 0 H1: md ¼ In other words, we tested the hypothesis that there are no differences between both samples (the samples come from populations with the same median and the same continuous distribution), that is, the number of (+) signs is the same as number of () signs. The same procedure presented in Section 10.2.3 for a single sample will be used in order to calculate the sign statistic in the case of two paired samples. Small samples We say that N is the number of positive and negative signs (sample size disregarding the ties) and k is the number of signs that corresponds to the lowest frequency. If N 25, we will use the binomial test with p ¼ ½ to calculate P(Y k). This probability can be obtained directly from Table F2 in the Appendix.
Nonparametric Tests Chapter
10
267
Large samples When N > 25, the binomial distribution is more similar to a normal distribution, and the value of Z is given by Expression (10.6): Z¼
ðX 0:5Þ N=2 pffiffiffiffi Nð0, 1Þ 0:5 N
where X corresponds to the lowest or highest frequency. If X represents the lowest frequency, we must use X + 0.5. On the other hand, if X represents the highest frequency, we must use X 0.5. Example 10.6: Applying the Sign Test to Two Paired Samples A group of 30 workers are submitted to a training course aiming at improving their productivity. The result, in terms of the average number of parts produced per hour per employee and before and after the training, is shown in Table 10.E.7. Test the null hypothesis that there were no alterations in productivity before and after the training course. Assume that a ¼ 5%.
TABLE 10.E.7 Productivity Before and After the Training Course Before
After
Difference Sign
36
40
+
39
41
+
27
29
+
41
45
+
40
39
44
42
38
39
+
42
40
40
42
+
43
45
+
37
35
41
40
38
38
0
45
43
40
40
0
39
42
+
38
41
+
39
39
0
41
40
36
38
+
38
36
40
38
36
35
40
42
+
40
41
+
38
40
+
37
39
+
40
42
+ Continued
268
PART
IV Statistical Inference
TABLE 10.E.7 Productivity Before and After the Training Course—cont’d Before
After
Difference Sign
38
36
40
40
0
Solution Step 1: Since the data do not follow a normal distribution, the sign test can be an alternative to the t-test for two paired samples. Step 2: The null hypothesis assumes that there is no difference in productivity before and after the training course, that is: H 0: m d ¼ 0 H1: md 6¼ 0 Step 3: The significance level to be considered is 5%. Step 4: Since N > 25, the binomial distribution is more similar to a normal distribution, and the value of Z is given by:
Z¼
ðX 0:5Þ N=2 ð11 + 0:5Þ 13 pffiffiffiffiffiffi ¼ 0:588 pffiffiffiffi ¼ 0:5 26 0:5 N
Step 5: By using the standard normal distribution table (Table E in the Appendix), we must determine the critical region (CR) for a
FIG. 10.22 Critical region of Example 10.6.
bilateral test, as shown in Fig. 10.22. Step 6: Decision: since the value calculated is not in the critical region, that is, 1.96 Zcal 1.96, the null hypothesis is not rejected, which allows us to conclude, with a 95% confidence level, that there is no difference in productivity before and after the training course. If instead of comparing the value calculated to the critical value of the standard normal distribution, we use the calculation of P-value, Steps 5 and 6 will be: Step 5: According to Table E in the Appendix, the unilateral probability associated to statistic Zcal ¼ 0.59 is P1 ¼ 0.278. For a bilateral test, this probability must be doubled (P-value ¼ 0.556). Step 6: Decision: since P > 0.05, we reject the null hypothesis.
10.3.2.1 Solving the Sign Test for Two Paired Samples Using SPSS Software
The use of the images in this section has been authorized by the International Business Machines Corporation©. The data in Example 10.6 can be found in the file Sign_Test_Two_Paired_Samples.sav. The procedure for applying the sign test to two paired samples on SPSS is shown. We have to click on Analyze → Nonparametric Tests → Legacy Dialogs → 2 Related Samples …, as shown in Fig. 10.23. After that, let’s insert variable 1 (Before) and variable 2 (After) into Test Pairs. Let’s also select the option regarding the sign test (Sign) in Test Type, as shown in Fig. 10.24. Finally, let’s click on OK to obtain the results of the sign test for two paired samples (Figs. 10.25 and 10.26). Fig. 10.25 shows the frequencies of negative and positive signs, the total number of ties, and the total frequency. Fig. 10.26 shows the result of the z test, besides the associated P probability for a bilateral test, values that are similar to the ones calculated in Example 10.6. Since P ¼ 0.556 > 0.05, the null hypothesis is not rejected, which allows us to conclude, with a 95% confidence level, that there is no difference in productivity before and after the training course.
Nonparametric Tests Chapter
FIG. 10.23 Procedure for elaborating the sign test on SPSS.
FIG. 10.24 Selecting the variables and the sign test.
10
269
270
PART
IV Statistical Inference
FIG. 10.25 Frequencies observed.
FIG. 10.26 Sign test (two paired samples) for Example 10.6 on SPSS.
10.3.2.2 Solving the Sign Test for Two Paired Samples Using Stata Software The use of the images presented in this section has been authorized by Stata Corp LP©. The data in Example 10.6 also are available on Stata in the file Sign_Test_Two_Paired_Samples.dta. The paired variables are before and after. As discussed in Section 10.2.3.2 for a single sample, the sign test on Stata is carried out from the command signtest. In the case of two paired samples, we must use the same command. However, it must be followed by the names of the paired variables, with the equal sign between them, since the objective is to test the equality of the respective medians. Thus, the command to be typed for our example is: signtest after = before
The result of the test is shown in Fig. 10.27 and includes the number of positive signs (15), the number of negative signs (11), as well as the probability associated to the statistic for a bilateral test (P ¼ 0.557). These values are similar to the ones calculated in Example 10.6 and also generated on SPSS. Since P > 0.05, we must reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that there is no difference in productivity before and after the training course.
10.3.3
Wilcoxon Test
Analogous to the sign test for two paired samples, the Wilcoxon test is an alternative to the t-test when the data distribution does not follow a normal distribution. The Wilcoxon test is an extension of the sign test; however, it is more powerful. Besides the information about the direction of the differences for each pair, the Wilcoxon test considers the magnitude of the difference within the pairs (Fa´vero et al., 2009). The logical foundations and the method used in the Wilcoxon test are described, based on Siegel and Castellan (2006). Let’s assume that di is the difference between the values for each pair of data. First of all, we have to place all of the di’s in ascending order according to their absolute value (without considering the sign) and calculate the respective ranks using this order. For example, position 1 is attributed to the lowest j di j, position 2 to the second lowest, and so on. At the end, we must attribute the di difference sign for each rank. The sum of all positive ranks is represented by Sp and the sum of all negative ranks by Sn. Occasionally, the values for a certain pair of data are the same (di ¼ 0). In this case, they are excluded from the sample. It is the same procedure used in the sign test, so, the value of N represents the sample size disregarding these ties.
Nonparametric Tests Chapter
10
271
FIG. 10.27 Results of the sign test (two paired samples) for Example 10.6 on Stata.
Another type of tie may happen, in which two or more differences have the same absolute value. In this case, the same rank will be attributed to the ties, which will correspond to the mean of the ranks that would have been attributed if the differences had been different. For example, suppose that three pairs of data indicate the following differences: 1, 1, and 1. Rank 2 is attributed to each pair, which corresponds to the average between 1, 2, and 3. In order, the next value will receive rank 4, since ranks 1, 2, and 3 have already been used. The null hypothesis assumes that the median of the differences in the population (md) is zero, that is, the populations do not differ in location. For a bilateral test, we have: H0: md ¼ 0 H1: md 6¼ 0 In other words, we must test the hypothesis that there are no differences between both samples (the samples come from populations with the same median and the same continuous distribution), that is, the sum of the positive ranks (Sp) is the same as the sum of the negative ranks (Sn). Small samples If N 15, Table I in the Appendix shows the unilateral probabilities associated to the several critical values of Sc (P(Sp > Sc) ¼ a). For a bilateral test, this value must be doubled. If the probability obtained (P-value) is less than or equal to a, we must reject H0. Large samples As N grows, the Wilcoxon distribution becomes more similar to a standard normal distribution. Thus, for N > 15, we must calculate the value of variable z that, according to Siegel and Castellan (2006), Fa´vero et al. (2009), and Maroco (2014), is: N ð N + 1Þ min Sp , Sn 4 Zcal ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Xg Xg t3 t N ðN + 1Þ ð2N + 1Þ j¼1 j j¼1 j 48 24
(10.9)
272
PART
IV Statistical Inference
where: Pg 3 Pg t j¼1 j
t j¼1 j
is a correction factor whenever there are ties; g: the number of groups of tied ranks; tj: the number of tied observations in group j. 48
The value calculated must be compared to the critical value of the standard normal distribution (Table E in the Appendix). This table provides the critical values of zc where P(Zcal > zc) ¼ a (for a right-tailed unilateral test). For a bilateral test, we have P(Zcal < zc) ¼ P(Zcal > zc) ¼ a/2. The null hypothesis H0 of a bilateral test is rejected if the value of the Zcal statistic is in the critical region, that is, if Zcal < zc or Zcal > zc. Otherwise, we do not reject H0. The unilateral probabilities associated to statistic Zcal (P1) can also be obtained from Table E. For a unilateral test, we consider P ¼ P1. For a bilateral test, this probability must be doubled (P ¼ 2P1). Thus, for both tests, we reject H0 if P a. Example 10.7: Applying the Wilcoxon Test A group of 18 students from the 12th grade took an English proficiency exam, without ever having taken an extracurricular course. The same group of students was submitted to an intensive English course for 6 months and, at the end, they took the proficiency exam again. The results can be seen in Table 10.E.8. Test the hypothesis that there was no improvement before and after the course. Assume that a ¼ 5%.
TABLE 10.E.8 Students’ Grades Before and After the Intensive Course Before
After
56
60
65
62
70
74
78
79
47
53
52
59
64
65
70
75
72
75
78
88
80
78
26
26
55
63
60
59
71
71
66
75
60
71
17
24
Solution Step 1: Since the data do not follow a normal distribution, the Wilcoxon test can be applied, because it is more powerful than the sign test for two paired samples. Step 2: Through the null hypothesis, there is no difference in the students’ performance before and after the course, that is: H0: md ¼ 0 H1: md ¼ 6 0
Nonparametric Tests Chapter
10
273
Step 3: The significance level to be considered is 5%. Step 4: Since N > 15, the Wilcoxon distribution is more similar to a normal distribution. In order to calculate the value of z, first of all, we have to calculate di and the respective ranks, as shown in Table 10.E.9.
TABLE 10.E.9 Calculation of di and the Respective Ranks di
di’s Rank
Before
After
56
60
4
7.5
65
62
3
5.5
70
74
4
7.5
78
79
1
2
47
53
6
10
52
59
7
11.5
64
65
1
2
70
75
5
9
72
75
3
5.5
78
88
10
15
80
78
2
4
26
26
0
55
63
8
13
60
59
1
2
71
71
0
66
75
9
14
60
71
11
16
17
24
7
11.5
Since there are two pairs of data with equal values (di ¼ 0), they are excluded from the sample, so, N ¼ 16. The sum of the positive ranks is Sp ¼ 2 + ⋯ + 16 ¼ 124.5. The sum of the negative ranks is Sn ¼ 2 + 4 + 5.5 ¼ 11.5. Thus, we can calculate the value of z by using Expression (10.9): N ðN + 1Þ 16 17 min Sp , Sn 11:5 4 4 Zcal ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Xg ffi ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 2:925 Xg 3 16 17 33 59 11 t t N ðN + 1Þ ð2N + 1Þ j¼1 j j¼1 j 24 48 48 24 Step 5: By using the standard normal distribution table (Table E in the Appendix), we determine the critical region (CR) for the bilateral test, as shown in Fig. 10.28.
FIG. 10.28 Critical region of Example 10.7.
274
PART
IV Statistical Inference
Step 6: Decision: since the value calculated is in the critical region, that is, Zcal < 1.96, the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that there is a difference in the students’ performance before and after the course. If instead of comparing the value calculated to the critical value of the standard normal distribution, we use the calculation of the P-value, Steps 5 and 6 will be: Step 5: According to Table E in the Appendix, the unilateral probability associated to statistic Zcal ¼ 2.925 is p1 ¼ 0.0017. For a bilateral test, this probability must be doubled (P-value ¼ 0.0034). Step 6: Decision: since P < 0.05, we must reject the null hypothesis.
10.3.3.1 Solving the Wilcoxon Test Using SPSS Software
The use of the images in this section has been authorized by the International Business Machines Corporation©. The data in Example 10.7 are available in the file Wilcoxon_Test.sav. The procedure for applying the Wilcoxon test to two paired samples on SPSS is shown. Let’s click on Analyze → Nonparametric Tests → Legacy Dialogs → 2 Related Samples …, as shown in Fig. 10.29. First of all, let’s insert variable 1 (Before) and variable 2 (After) into Test Pairs. Let’s also select the option related to the Wilcoxon test in Test Type, as shown in Fig. 10.30. Finally, let’s click on OK to obtain the results of the Wilcoxon test for two paired samples (Figs. 10.31 and 10.32). Fig. 10.31 shows the number of negative, positive, and tied ranks, besides the mean and the sum of all positive and negative ranks. Fig. 10.32 shows the result of the z test, besides the associated P probability for a bilateral test, values similar to the ones found in Example 10.7. Since P ¼ 0.003 < 0.05, we must reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that there is a difference in the students’ performance before and after the course.
FIG. 10.29 Procedure for elaborating the Wilcoxon test on SPSS.
Nonparametric Tests Chapter
10
275
FIG. 10.30 Selecting the variables and Wilcoxon test.
FIG. 10.31 Ranks.
FIG. 10.32 Wilcoxon test for Example 10.7 on SPSS.
10.3.3.2 Solving the Wilcoxon Test Using Stata Software
The use of the images presented in this section has been authorized by Stata Corp LP©. The data in Example 10.7 are available in the file Wilcoxon_Test.dta. The paired variables are called before and after. The Wilcoxon test on Stata is carried out from the command signrank followed by the name of the paired variables with an equal sign between them. For our example, we must type the following command: signrank before = after
276
PART
IV Statistical Inference
FIG. 10.33 Results of the Wilcoxon test for Example 10.7 on Stata.
The result of the test is shown in Fig. 10.33. Since P < 0.05, we reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that there is a difference in the students’ performance before and after the course.
10.4
TESTS FOR TWO INDEPENDENT SAMPLES
In these tests, we try to compare two populations represented by their respective samples. Different from the tests for two paired samples, here, it is not necessary for the samples to have the same size. Among the tests for two independent samples, we can highlight the chi-square test (for nominal or ordinal variables) and the Mann-Whitney test for ordinal variables.
10.4.1
Chi-Square Test (x2) for Two Independent Samples
In Section 10.2.2, the w2 test was applied to a single sample in which the variable being studied was qualitative (nominal or ordinal). Here the test will be applied to two independent samples, from nominal or ordinal qualitative variables. This test has already been studied in Chapter 4 (Section 4.2.2), in order to verify if there is an association between two qualitative variables, and it will be described once again in this section. The test compares the frequencies observed in each one of the cells of a contingency table to the frequencies expected. The w2 test for two independent samples assumes the following hypotheses: H0: there is no significant difference between the frequencies observed and the ones expected H1: there is a significant difference between the frequencies observed and the ones expected Therefore, the w2 statistic measures the discrepancy between a table with the contingency observed and a table with the contingency expected, starting from the hypothesis that there is no connection between the categories of both variables studied. If the distribution of frequencies observed is exactly the same as the distribution of frequencies expected, the result of the w2 statistic is zero. Thus, a low value of w2 indicates independence between the variables. As already presented in Expression (4.1) in Chapter 4, the w2 statistic for two independent samples is given by: 2 I X J X Oij Eij 2 (10.10) w ¼ Eij i¼1 j¼1 where: Oij: the number of observations in the ith category of variable X and in the jth category of variable Y; Eij: frequency expected of observations in the ith category of variable X and in the jth category of variable Y; I: the number of categories (rows) of variable X; J: the number of categories (columns) of variable Y.
Nonparametric Tests Chapter
10
277
FIG. 10.34 w2 distribution.
The values of w2cal approximately follow an w2 distribution with n ¼ (I 1)(J 1) degrees of freedom. The critical values of the chi-square statistic (w2c ) can be found in Table D, in the Appendix. This table provides the critical values of w2c where P(w2cal > w2c ) ¼ a (for a right-tailed unilateral test). In order for the null hypothesis H0 to be rejected, the value of the w2cal statistic must be in the critical region, that is, w2cal > w2c . Otherwise, we do not reject H0 (Fig. 10.34). Example 10.8: Applying the x2 Test to Two Independent Samples Let’s consider Example 4.1 in Chapter 4 once again, which refers to a study carried out with 200 individuals aiming at analyzing the joint behavior of variable X (Health insurance agency) with variable Y (Level of satisfaction). The contingency table showing the joint distribution of the variables’ absolute frequencies, besides the marginal totals, is presented in Table 10.E.10. Test the hypothesis that there is no association between the categories of both variables, considering a ¼ 5%.
TABLE 10.E.10 Joint Distribution of the Absolute Frequencies of the Variables Being Studied Level of Satisfaction Agency
Dissatisfied
Neutral
Satisfied
Total
Total Health
40
16
12
68
Live Life
32
24
16
72
Mena Health
24
32
4
60
Total
96
72
32
200
Solution Step 1: The most suitable test to compare the frequencies observed in each cell of a contingency table to the frequencies expected is the w2 for two independent samples. Step 2: The null hypothesis states that there are no connections between the categories of variables Agency and Level of satisfaction, that is, the frequencies observed and expected are the same for each pair of variable categories. The alternative hypothesis states that there are differences in at least one pair of categories: H0: Oij ¼ Eij H1: Oij ¼ 6 Eij Step 3: The significance level to be considered is 5%. Step 4: In order to calculate the statistic, it is necessary to compare the values observed and the values expected. Table 10.E.11 presents the distribution’s values observed with their respective relative frequencies in relation to the row’s general total. The calculation could also be done in relation to the column’s general total, achieving the same result as the w2 statistic. The data in Table 10.E.11 demonstrate a dependence between the variables. Supposing that there was no connection between the variables, we would expect a proportion of 48% in relation to the total of the Dissatisfied row for all three agencies, 36% in the Neutral level, and 16% in the Satisfied level. The calculations of the values expected can be found in Table 10.E.12. For example, the calculation of the first cell is 0.48 68 ¼ 32.6. In order to calculate the w2 statistic, we must apply Expression (10.10) to the data in Tables 10.E.11 and 10.E.12. The cal2 ðOij Eij Þ culation of each term Eij is represented in Table 10.E.13, jointly with the resulting w2cal measure of the sum of the categories. Step 5: The critical region (CR) of the w2 distribution (Table D in the Appendix), considering a ¼ 5% and n ¼ (I 1)(J 1) ¼ 4 degrees of freedom, is shown in Fig. 10.35.
278
PART
IV Statistical Inference
TABLE 10.E.11 Values Observed in Each Category With Their Respective Proportions in Relation to the Row’s General Total Level of Satisfaction Agency
Dissatisfied
Neutral
Satisfied
Total
Total Health
40 (58.8%)
16 (23.5%)
12 (17.6%)
68 (100%)
Live Life
32 (44.4%)
24 (33.3%)
16 (22.2%)
72 (100%)
Mena Health
24 (40%)
32 (53.3%)
4 (6.7%)
60 (100%)
Total
96 (48%)
72 (36%)
32 (16%)
200 (100%)
TABLE 10.E.12 Values Expected From Table 10.E.11 Assuming a Nonassociation Between the Variables Level of Satisfaction Agency
Dissatisfied
Neutral
Satisfied
Total
Total Health
32.6 (48%)
24.5 (36%)
10.9 (16%)
68 (100%)
Live Life
34.6 (48%)
25.9 (36%)
11.5 (16%)
72 (100%)
Mena Health
28.8 (48%)
21.6 (36%)
9.6 (16%)
60 (100%)
Total
96 (48%)
72 (36%)
32 (16%)
200 (100%)
TABLE 10.E.13 Calculation of the x2 Statistic Level of Satisfaction Agency
Dissatisfied
Neutral
Satisfied
Total Health
1.66
2.94
0.12
Live Life
0.19
0.14
1.74
Mena Health
0.80
5.01
3.27
Total
w2cal
¼ 15.861
FIG. 10.35 Critical region of Example 10.8.
Step 6: Decision: since the value calculated is in the critical region, that is, w2cal > 9.488, we must reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that there is an association between the variable categories. If we use P-value instead of the critical value of the statistic, Steps 5 and 6 will be: Step 5: According to Table D, in the Appendix, the probability associated to the w2cal statistic ¼ 15.861, for n ¼ 4 degrees of freedom, is less than 0.005. Step 6: Decision: since P < 0.05, we reject H0.
Nonparametric Tests Chapter
10
279
10.4.1.1 Solving the w2 Statistic Using SPSS Software
The use of the images in this section has been authorized by the International Business Machines Corporation©. The data in Example 10.8 are available in the file HealthInsurance.sav. In order to calculate the w2 statistic for two independent samples, we must click on Analyze → Descriptive Statistics → Crosstabs … Let’s insert variable Agency in Row(s) and variable Satisfaction in Column(s), as shown in Fig. 10.36. In Statistics …, let’s select option Chi-square, as shown in Fig. 10.37. Then, we must finally click on Continue and OK. The result is shown in Fig. 10.38. From Fig. 10.38, we can see that the value of w2 is 15.861, similar to what was calculated in Example 10.8. For the confidence level of 95%, as P ¼ 0.003 < 0.05, we must reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that there is an association between the variable categories, that is, the frequencies observed differ from the frequencies expected in at least one pair of categories.
10.4.1.2 Solving the w2 Statistic by Using Stata Software
The use of the images presented in this section has been authorized by Stata Corp LP©. As presented in Chapter 4, the calculation of the w2 statistic on Stata is done by using the command tabulate, or simply tab, followed by the name of the variables being studied, using option chi2, or simply ch. The syntax of the test is: tab variable1* variable2*, ch
The data in Example 10.8 are also available in the file HealthCareInsurance.dta. The variables being studied are agency and satisfaction. Thus, we must type the following command: tab agency satisfaction, ch
The results can be seen in Fig. 10.39 and are similar to the ones presented in Example 10.8 and on Stata.
FIG. 10.36 Selecting the variables.
280
PART
IV Statistical Inference
FIG. 10.37 Selecting the w2 statistic.
FIG. 10.38 Results of the w2 test for Example 10.8 on SPSS.
FIG. 10.39 Results of the w2 test for Example 10.8 on Stata.
Nonparametric Tests Chapter
10.4.2
10
281
Mann-Whitney U Test
The Mann-Whitney U test is one of the most powerful nonparametric tests, applied to quantitative or qualitative variables in an ordinal scale, and it aims at verifying if two nonpaired or independent samples are drawn from the same population. It is an alternative to Student’s t-test when the normality hypothesis is violated or when the sample is small. In addition, it may be considered a nonparametric version of the t-test for two independent samples. Since the original data are transformed into ranks (orders), we lose some information, so, the Mann-Whitney U test is not as powerful as the t-test. Different from the t-test that verifies the equality of the means of two independent populations and with continuous data, the Mann-Whitney U test verifies the equality of the medians. For a bilateral test, the null hypothesis is that the median of both populations is equal, that is: H 0 : m 1 ¼ m2 H1: m1 6¼ m2 The calculation of the Mann-Whitney U statistic is specified, for small and large samples. Small samples Method: (a) Let’s consider N1 the size of the sample with the smallest number of observations and N2 the size of the sample with the largest number of observations. We assume that both samples are independent. (b) In order to apply the Mann-Whitney U test, we must join both samples into a single combined sample that will be formed by N ¼ N1 + N2 elements. However, we must identify the original sample of each observation in the combined sample. The combined sample must be ordered in ascending order and the ranks are attributed to each observation. For example, rank 1 is attributed to the lowest observation and rank N to the highest observation. If there are ties, we attribute the mean of the corresponding ranks. (c) After that, we must calculate the sum of the ranks for each sample, that is, calculate R1, which corresponds to the sum of the ranks in the sample with the smallest number of observations, and R2, which corresponds to sum of the ranks in the sample with the largest number of observations. (d) Thus, we can calculate quantities U1 and U2 as follows: U1 ¼ N1 N2 +
N 1 ð N 1 + 1Þ R1 2
(10.11)
U2 ¼ N1 N 2 +
N 2 ð N 2 + 1Þ R2 2
(10.12)
(e) The Mann-Whitney U statistic is given by: Ucal ¼ min ðU1 , U2 Þ Table J in the Appendix shows the critical values of U in a way that P(Ucal < Uc) ¼ a (for a left-tailed unilateral test), for values of N2 20 and significance levels of 0.05, 0.025, 0.01, and 0.005. In order for the null hypothesis H0 of the left-tailed unilateral test to be rejected, the value of the Ucal statistic must be in the critical region, that is, Ucal < Uc. Otherwise, we do not reject H0. For a bilateral test, we must consider P(Ucal < Uc) ¼ a/2, since P(Ucal < Uc) + P(Ucal > Uc) ¼ a. The unilateral probabilities associated to the Ucal statistic (P1) can also be obtained from Table J. For a unilateral test, we have P ¼ P1. For a bilateral test, this probability must be doubled (P ¼ 2P1). Thus, we reject H0 if P a. Large samples As the sample size grows (N2 > 20), the Mann-Whitney distribution becomes more similar to a standard normal distribution.
282
PART
IV Statistical Inference
The real value of the Z statistic is given by: ðU N1 N2 =2Þ Zcal ¼ vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 0 Xg 1 Xg u 3 u 3 t t j N N u N1 N2 j¼1 j¼1 j A t @ N ð N 1Þ 12 12 where: Pg
t3 j¼1 j
(10.13)
Pg
t j¼1 j
is a correction factor when there are ties; g: the number of groups with tied ranks; tj: the number of tied observations in group j. 12
The value calculated must be compared to the critical value of the standard normal distribution (see Table E in the Appendix). This table provides the critical values of zc where P(Zcal > zc) ¼ a (for a right-tailed unilateral test). For a bilateral test, we have P(Zcal < zc) ¼ P(Zcal > zc) ¼ a/2. Therefore, for a bilateral test, the null hypothesis is rejected if Zcal < zc or Zcal > zc. Unilateral probabilities associated to the Zcal (P1 ¼ P) statistic can also be obtained from Table E. For a bilateral test, this probability must be doubled (P ¼ 2P1). Thus, the null hypothesis is rejected if P a. Example 10.9: Applying the Mann-Whitney U Test to Small Samples Aiming at assessing the quality of two machines, the diameters of the parts produced (in mm) in each one of them are compared, as shown in Table 10.E.14. Use the most suitable test, at a significance level of 5%, to test if both samples come from or do not come from populations with the same medians.
TABLE 10.E.14 Diameter of Parts Produced in Two Machines Mach. A
48.50
48.65
48.58
48.55
48.66
48.64
48.50
Mach. B
48.75
48.64
48.80
48.85
48.78
48.79
49.20
48.72
Solution Step 1: By applying the normality test to both samples, we can see that the data from machine B do not follow a normal distribution. So, the most suitable test to compare the medians of two independent populations is the Mann-Whitney U test. Step 2: Through the null hypothesis, the median diameters of the parts in both machines are the same, so: H0: mA ¼ mB H1: mA ¼ 6 mB Step 3: The significance level to be considered is 5%. Step 4: Calculation of the U statistic: (a) N1 ¼ 7 (sample size from machine B) N2 ¼ 8 (sample size from machine A) (b) Combined sample and respective ranks (Table 10.E.15):
TABLE 10.E.15 Combined Data Data
Machine
Ranks
48.50
A
1.5
48.50
A
1.5
48.55
A
3
48.58
A
4
48.64
A
5.5
Nonparametric Tests Chapter
10
283
TABLE 10.E.15 Combined Data—cont’d Data
Machine
Ranks
48.64
B
5.5
48.65
A
7
48.66
A
8
48.72
A
9
48.75
B
10
48.78
B
11
48.79
B
12
48.80
B
13
48.85
B
14
49.20
B
15
(c) R1 ¼ 80.5 (sum of the ranks from machine B with the smallest number of observations); R2 ¼ 39.5 (sum of the ranks from machine A with the largest number of observations). (d) Calculation of U1 and U2: N1 ðN1 + 1Þ 78 R1 ¼ 7 8 + 80:5 ¼ 3:5 2 2 N2 ðN2 + 1Þ 89 R2 ¼ 7 8 + 39:5 ¼ 52:5 U2 ¼ N1 N2 + 2 2 U1 ¼ N1 N2 +
(e) Calculation of the Mann-Whitney U statistic: Ucal ¼ min ðU1 , U2 Þ ¼ 3:5 Step 5: According to Table J, in the Appendix, for N1 ¼ 7, N2 ¼ 8, and P(Ucal < Uc) ¼ a/2 ¼ 0.025 (bilateral test), the critical value of the Mann-Whitney U statistic is Uc ¼ 10. Step 6: Decision: since the calculated statistic is in the critical region, that is, Ucal < 10, the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that the medians of both populations are different. If we use P-value instead of the critical value of the statistic, Steps 5 and 6 will be: Step 5: According to Table J, in the Appendix, unilateral probability P1 associated to Ucal ¼ 3.5, for N1 ¼ 7 and N2 ¼ 8, is less than 0.005. For a bilateral test, this probability must be doubled (P < 0.01). Step 6: Decision: since P < 0.05, we must reject H0.
Example 10.10: Applying the Mann-Whitney U Test to Large Samples As described previously, as the sample size grows (N2 > 20), the Mann-Whitney distribution becomes more similar to a standard normal distribution. Even though the data in Example 10.9 represent a small sample (N2 ¼ 8), which would be the value of z in this case, by using Expression (10.13)? Interpret the result. Solution ðU N1 N2 =2Þ ð3:5 7 8=2Þ Zcal ¼ vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Xg 1 ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Xg 0 2:840 u u 7 8 153 15 16 4 tj3 tj 3 N N u N1 N2 j¼1 j¼1 A t @ 15 14 12 12 N ðN 1Þ 12 12
284
PART
IV Statistical Inference
The critical value of the zc statistic for a bilateral test, at the significance level of 5%, is 1.96 (see Table E in the Appendix). Since Zcal < 1.96, the null hypothesis would also be rejected by the Z statistic, which allows us to conclude, with a 95% confidence level, that the population medians are different. Instead of comparing the value calculated to the critical value, we could obtain the value of P-value directly from Table E. Thus, the unilateral probability associated to statistic Zcal ¼ 2.840 is P1 ¼ 0.0023. For a bilateral test, this probability must be doubled (P-value ¼ 0.0046).
10.4.2.1 Solving the Mann-Whitney Test Using SPSS Software The use of the images in this section has been authorized by the International Business Machines Corporation©. The data in Example 10.9 are available in the file Mann-Whitney_Test.sav. Since group 1 is the one with the smallest number of observations, in Data → Define Variable Properties …, we assign value 1 to group B and value 2 to group A for variable Machine. In order to elaborate the Mann-Whitney test on SPSS, we must click on Analyze → Nonparametric Tests → Legacy Dialogs → 2 Independent Samples …, as shown in Fig. 10.40. After that, we should insert the variable Diameter in the box Test Variable List and the variable Machine in Grouping Variable, defining the respective groups. Let’s select the option Mann-Whitney U in Test Type, as shown in Fig. 10.41. Finally, let’s click on OK to obtain Figs. 10.42 and 10.43. Fig. 10.42 shows the mean and the sum of the ranks for each group, while Fig. 10.43 shows the statistic of the test. The results in Fig. 10.42 are similar to the ones calculated in Example 10.9. According to Fig. 10.43, the result of the MannWhitney U statistic is 3.50, similar to the value calculated in Example 10.9. The bilateral probability associated to the U statistic is P ¼ 0.002 (we saw in Example 10.9 that this probability is less than 0.01). For the same data in Example 10.9, if we had to calculate the Z statistic and the respective associated bilateral probability, the result would be Zcal ¼ 2.840 and P ¼ 0.005, similar to the values calculated in Example 10.10. For both tests, as the associated bilateral probability is less than 0.05, the null hypothesis is rejected, which allows us to conclude that the medians of both populations are different.
FIG. 10.40 Procedure to elaborate the Mann-Whitney test on SPSS.
Nonparametric Tests Chapter
10
285
FIG. 10.41 Selecting the variables and Mann-Whitney test.
FIG. 10.42 Ranks.
FIG. 10.43 Mann-Whitney test for Example 10.9 on SPSS.
10.4.2.2 Solving the Mann-Whitney Test Using Stata Software
The use of the images presented in this section has been authorized by Stata Corp LP©. The Mann-Whitney test is elaborated on Stata from the command ranksum (equality test for nonpaired data), by using the following syntax: ranksum variable*, by (groups*)
286
PART
IV Statistical Inference
FIG. 10.44 Results of the Mann-Whitney test for Examples 10.9 and 10.10 on Stata.
where the term variable* must be replaced by the quantitative variable studied and the term groups* by the categorical variable that represents the groups. Let’s open the file Mann-Whitney_Test.dta that contains the data from Examples 10.9 and 10.10. Both groups are represented by the variable machine and the quality characteristic by the variable diameter. Thus, the command to be typed is: ranksum diameter, by (machine)
The results obtained are shown in Fig. 10.44. We can see that the calculated value of the statistic (2.840) corresponds to the value calculated in Example 10.10, for large samples, from Expression (10.13). The probability associated to the statistic for a bilateral test is 0.0045. Since P < 0.05, we must reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that the population medians are different.
10.5
TESTS FOR K PAIRED SAMPLES
These tests analyze the differences between k (three or more) paired or related samples. According to Siegel and Castellan (2006), the null hypothesis to be tested is that k samples have been drawn from the same population. The main tests for k paired samples are Cochran’s Q test (for binary variables) and Friedman’s test (for ordinal variables).
10.5.1
Cochran’s Q Test
Cochran’s Q test for k paired samples is an extension of the McNemar test for two samples, and it aims to test the hypothesis that the frequency in which or proportion of three or more related groups differ significantly from one another. In the same way as in the McNemar test, the data are binary. According to Siegel and Castellan (2006), Cochran’s Q test compares the characteristics of several individuals or characteristics of the same individual observed under different conditions. For example, we can analyze if k items differ significantly for N individuals. Or, we may have only one item to analyze and the objective is to compare the answer of N individuals under k different conditions. Let’s suppose that the study data are organized in one table with N rows and k columns, in which N is the number of cases and k is the number of groups or conditions. Through the null hypothesis of Cochran’s Q test, there are no differences between the frequencies or proportions of success (p) of the k related groups, that is, the proportion of a desired answer (success) is the same in each column. Through the alternative hypothesis, there are differences between at least two groups, so: H0: p1 ¼ p2 ¼ … ¼ pk H1: 9(i,j) pi 6¼ pj, i 6¼ j
Nonparametric Tests Chapter
10
287
Cochran’s Q statistic is given by:
X Xk 2
k Xk 2 2 ð k 1 Þ k G G k ð k 1Þ Gj G j¼1 j j¼1 j j¼1 Qcal ¼ ¼ XN XN XN XN k L L2 L L2 k i¼1 i i¼1 i i¼1 i i¼1 i
(10.14)
which approximately follows a w2 distribution with k 1 degrees of freedom, where: Gj: the total number of successes in the jth column; G: mean of the Gj; Li: the total number of successes in the ith row. The value calculated must be compared to the critical value of the w2 distribution (Table D in the Appendix). This table provides the critical values of w2c where P(w2cal > w2c ) ¼ a (for a right-tailed unilateral test). If the value of the statistic is in the critical region, that is, if Qcal > w2c , we must reject H0. Otherwise, we do not reject H0. The probability associated to the calculated value of the statistic (P-value) can also be obtained from Table D. In this case, the null hypothesis is rejected if P a; otherwise we do not reject H0. Example 10.11: Applying Cochran’s Q Test We are interested in assessing 20 consumers’ level of satisfaction regarding three supermarkets, trying to investigate if their clients are satisfied (score 1) or not (score 0) with the quality, variety and price of their products—for each supermarket. Check the hypothesis that the probability of receiving a good evaluation from clients is the same for all three supermarkets, considering a significance level of 10%. Table 10.E.16 shows the results of the evaluation.
TABLE 10.E.16 Results of the Evaluation for All Three Supermarkets Consumer
A
B
C
Li
L2i
1
1
1
1
3
9
2
1
0
1
2
4
3
0
1
1
2
4
4
0
0
0
0
0
5
1
1
0
2
4
6
1
1
1
3
9
7
0
0
1
1
1
8
1
0
1
2
4
9
1
1
1
3
9
10
0
0
1
1
1
11
0
0
0
0
0
12
1
1
0
2
4
13
1
0
1
2
4
14
1
1
1
3
9
15
0
1
1
2
4
16
0
1
1
2
4
17
1
1
1
3
9
18
1
1
1
3
9
19
0
0
1
1
1
20
0
0
1
Total
G1 ¼ 11
G2 ¼ 11
G3 ¼ 16
1 P20
1 P20
i¼1 Li
¼ 38
2 i¼1 Li
¼ 90
288
PART
IV Statistical Inference
FIG. 10.45 Critical region of Example 10.11.
Solution Step 1: The most suitable test to compare proportions of three or more paired groups is Cochran’s Q test. Step 2: Through the null hypothesis, the proportion of successes (score 1) is the same for all three supermarkets. Through the alternative hypothesis, the proportion of satisfied clients differs for at least two supermarkets, so: H0: p1 ¼ p2 ¼ p3 H1: 9(i,j) pi 6¼ pj, i 6¼ j Step 3: The significance level to be considered is 10%. Step 4: The calculation of Cochran’s Q statistic from Expression (10.14), is given by: X Xk 2
k 2 ðk 1Þ k G G j¼1 j j¼1 j ð3 1Þ 3 112 + 112 + 162 382 ¼ 4:167 Qcal ¼ ¼ XN XN 3 38 90 2 L L k i i i¼1 i¼1 Step 5: The critical region (CR) of the w2 distribution (Table D in the Appendix), considering a ¼ 10% and n ¼ k 1 ¼ 2 degrees of freedom, is shown in Fig. 10.45. Step 6: Decision: since the value calculated is not in the critical region, that is, Qcal < 4.605, the null hypothesis is not rejected, which allows us to conclude, with a 90% level of confidence, that the proportion of satisfied clients is equal for all three supermarkets. If we use P-value instead of the statistic’s critical value, Steps 5 and 6 will be: Step 5: According to Table D, in the Appendix, for n ¼ 2 degrees of freedom, the probability associated to statistic Qcal ¼ 4.167 is greater than 0.10 (P-value > 0.10). Step 6: Decision: since P > 0.10, we should not reject H0.
10.5.1.1 Solving Cochran’s Q Test by Using SPSS Software
The use of the images in this section has been authorized by the International Business Machines Corporation©. The data in Example 10.11 are available in the file Cochran_Q_Test.sav. The procedure for elaborating Cochran’s Q test on SPSS is shown. First of all, let’s click on Analyze → Nonparametric Tests → Legacy Dialogs → K Related Samples …, as shown in Fig. 10.46. After that, we must insert variables A, B, and C in the box Test Variables, and select option Cochran’s Q in Test Type, as shown in Fig. 10.47. Finally, let’s click on OK to obtain the results of the test. Fig. 10.48 shows the frequencies of each group and Fig. 10.49 shows the result of the statistic. The value of Cochran’s Q statistic is 4.167, similar to the value calculated in Example 10.11. The probability associated to the statistic is 0.125 (we saw in Example 10.11 that P > 0.10). Since P > a, the null hypothesis is not rejected, which allows us to conclude, with a 90% level of confidence, that there are no differences in the proportion of satisfied clients for all three supermarkets.
10.5.1.2 Solution of Cochran’s Q Test on Stata Software The use of the images presented in this section has been authorized by Stata Corp LP©. The data from Example 10.11 are also available in the file Cochran_Q_Test.dta. The command used to elaborate the test is cochran followed by the k paired variables. In our case, the variables that represent the three groups of supermarkets, a, b, and c, respectively. So, the command to be typed is: cochran a b c
Nonparametric Tests Chapter
FIG. 10.46 Procedure for elaborating Cochran’s Q test on SPSS.
FIG. 10.47 Selecting the variables and Cochran’s Q test.
10
289
290
PART
IV Statistical Inference
FIG. 10.48 Frequencies.
FIG. 10.49 Cochran’s Q test for Example 10.11 on SPSS.
FIG. 10.50 Results of Cochran’s Q test for Example 10.11 on Stata.
The results of Cochran’s Q test on Stata are in Fig. 10.50. We can verify that the result of the statistic and the respective associated probability are similar to the results calculated in Example 10.11, and also generated on SPSS, which allows us to conclude, with a 90% level of confidence, that the proportion of dissatisfied clients is the same for all three supermarkets.
10.5.2
Friedman’s Test
Friedman’s test is applied to quantitative or qualitative variables in an ordinal scale, and has as its main objective to verify if k paired samples are drawn from the same population. It is an extension of the Wilcoxon test for three or more paired samples. It is also an alternative to the analysis of variance when its hypotheses (normality of data and homogeneity of variances) are violated or when the sample size is too small. The data are represented in one table with double entry, with N rows and k columns, in which the rows represent the several individuals or corresponding sets of individuals, and the columns represent the different conditions. Therefore, the null hypothesis of Friedman’s test assumes that the k samples (columns) come from the same population or from populations with the same median (m). For a bilateral test, we have: H0: m1 ¼ m2 ¼ … ¼ mk H1: 9(i,j) mi 6¼ mj, i 6¼ j To apply Friedman’s statistic, we must attribute ranks from 1 to k to each element of each row. For example, position 1 is attributed to the lowest observation on the row and position N to the highest observation. If there are ties, we attribute the mean of the corresponding ranks. Friedman’s statistic is given by: Fcal ¼
k X 12 2 Rj 3 N ð k + 1 Þ N k ðk + 1Þ j¼1
(10.15)
Nonparametric Tests Chapter
10
291
where: N: the number of rows; k: the number of columns; Rj: sum of the ranks in column j. However, according to Siegel and Castellan (2006), whenever there are ties between the ranks of the same group or row, Friedman’s statistic must be corrected in a way that considers the changes in the sample distribution, as follows: Xk 2 12 Rj 3 N 2 k ð k + 1 Þ 2 j¼1 0 (10.16) Fcal ¼ XN Xgi 3 N k t ij i¼1 j¼1 N k ð k + 1Þ + ð k 1Þ where: gi: the number of sets with tied observations in the ith group, including the sets of size 1; tij: size of the jth set of ties in the ith group. The value calculated must be compared to the critical value of the sample distribution. When N and k are small (k ¼ 3 and 3 < N < 13, or k ¼ 4 and 2 < N < 8, or k ¼ 5 and 3 < N < 5), we must use Table K in the Appendix, which shows the critical values of Friedman’s statistic (Fc), where P(Fcal > Fc) ¼ a (for a right-tailed unilateral test). For high values of N and k, the sample distribution can be approximated by the w2 distribution with n ¼ k 1 degrees of freedom. Therefore, if the value of the Fcal statistic is in the critical region, that is, if Fcal > Fc for a small N and K or Fcal > w2c for a high N and K, we must reject the null hypothesis. Otherwise, we do not reject H0. Example 10.12: Applying Friedman’s Test A research is carried out in order to verify the efficacy that breakfast has in weight loss and, in order to do that, 15 patients were followed up for 3 months. Data regarding patients’ weight were collected during three different periods, as shown in Table 10.E.17: before the treatment (BT), after the treatment (AT), and after 3 months of treatment (A3M). Check and see if the treatment had any results. Assume that a ¼ 5%.
TABLE 10.E.17 Patients’ Weight in Each Period Period Patient
BT
AT
A3M
1
65
62
58
2
89
85
80
3
96
95
95
4
90
84
79
5
70
70
66
6
72
65
62
7
87
84
77
8
74
74
69
9
66
64
62
10
135
132
132
11
82
75
71
12
76
73
67
13
94
90
88
14
80
80
77
15
73
70
68
292
PART
IV Statistical Inference
Solution Step 1: Since the data do not follow a normal distribution, Friedman’s test is an alternative to ANOVA to verify if the three paired samples are drawn from the same population. Step 2: Through the null hypothesis, there is no difference among the treatments. Through the alternative hypothesis, the treatment had some results, so: H0: m1 ¼ m2 ¼ m3 H1: 9(i,j) mi 6¼ mj, i 6¼ j Step 3: The significance level to be considered is 5%. Step 4: In order to calculate Friedman’s statistic, first, we must attribute ranks from 1 to 3 to each element in each row, as shown in Table 10.E.18. If there are ties, we attribute the mean of the corresponding ranks.
TABLE 10.E.18 Attributing Ranks Period Patient
BT
AT
A3M
1
3
2
1
2
3
2
1
3
3
1.5
1.5
4
3
2
1
5
2.5
2.5
1
6
3
2
1
7
3
2
1
8
2.5
2.5
1
9
3
2
1
10
3
1.5
1.5
11
3
2
1
12
3
2
1
13
3
2
1
14
2.5
2.5
1
15
3
2
1
Rj
43.5
30.5
16
Mean of the ranks
2.900
2.030
1.067
As shown in Table 10.E.18, there are two ties in patient 3, two in patient 5, two in patient 8, two in patient 10, and two in patient 14. Therefore, the total number of size 2 ties is 5 and the total number of size 1 ties is 35. Thus: gi N X X
tij3 ¼ 35 1 + 5 23 ¼ 75
i¼1 j¼1
Since there are ties, the real value of Friedman’s statistic is calculated from Expression (10.16), as follows: X k 2 12 Rj 3 N 2 k ð k + 1 Þ 2 12 43:52 + 30:52 + 162 3 152 3 42 j¼1 0 ¼ Fcal ¼ XN Xgi ð15 3 75Þ Nk t3 15 3 4 + j¼1 ij i¼1 2 N k ðk + 1Þ + ðk 1Þ 0 ¼ 27:527 Fcal
Nonparametric Tests Chapter
10
293
FIG. 10.51 Critical region of Example 10.12.
If we applied Expression (10.15) without the correction factor, the result of Friedman’s test would be 25.233. Step 5: Since k ¼ 3 and N ¼ 15, let’s use the w2 distribution. The critical region (CR) of the w2 distribution (Table D in the Appendix), considering a ¼ 5% and n ¼ k 1 ¼ 2 degrees of freedom, is shown in Fig. 10.51. 0 > 5.991, we reject the null hypothesis, which Step 6: Decision: since the value calculated is in the critical region, that is, Fcal allows us to conclude, with a 95% confidence level, that the treatment has good results. If we use P-value instead of the critical value of the statistic, Steps 5 and 6 will be: Step 5: According to Table D in the Appendix, for n ¼ 2 degrees of freedom, the probability associated to statistic F 0cal ¼ 27.527 is less than 0.005 (P-value 12.592, we must reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that there is a difference in productivity among the four shifts. If we use P-value instead of the critical value of the statistic, Steps 5 and 6 will be: Step 5: According to Table D in the Appendix, the probability associated to the statistic w2cal ¼ 13.143, for n ¼ 6 degrees of freedom, is between 0.05 and 0.025. Step 6: Decision: since P < 0.05, we reject H0.
FIG. 10.57 Critical region of Example 10.13.
Nonparametric Tests Chapter
10
297
FIG. 10.58 Selecting the variables.
10.6.1.1 Solving the w2 Test for k Independent Samples on SPSS
The use of the images in this section has been authorized by the International Business Machines Corporation©. The data from Example 10.13 are available in the file Chi-Square_k_Independent_Samples.sav. Let’s click on Analyze → Descriptive Statistics → Crosstabs … After that, we should insert the variable Productivity in Row(s) and the variable Shift in Column(s), as shown in Fig. 10.58. In Statistics …, let’s select the option Chi-square, as shown in Fig. 10.59. If we wish to obtain the observed and expected frequency distribution table, in Cells …, we must select the options Observed and Expected in Counts, as shown in Fig. 10.60. Finally, let’s click on Continue and OK. The results can be seen in Figs. 10.61 and 10.62. From Fig. 10.62, we can see that the value of w2 is 13.143, similar to the one calculated in Example 10.13. For a confidence level of 95%, since P ¼ 0.041 < 0.05 (we saw in Example 10.13 that this probability is between 0.025 and 0.05), we must reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that there is a difference in productivity among the four shifts.
10.6.1.2 Solving the w2 Test for k Independent Samples on Stata
The use of the images presented in this section has been authorized by Stata Corp LP©. The data in Example 10.13 are available in the file Chi-Square_k_Independent_Samples.dta. The variables being studied are productivity and shift. The syntax of the w2 test for k independent samples is similar to the one presented in Section 10.4.1 for two independent samples. Thus, we must use the command tabulate, or simply tab, followed by the name of the variables being studied, besides the option chi2, or simply ch. The difference is that, in this case, the categorical variable that represents the groups has more than two categories. Therefore, the syntax of the test for the data in Example 10.13 is: tabulate productivity shift, chi2
298
PART
IV Statistical Inference
FIG. 10.59 Selecting the w2 statistic.
FIG. 10.60 Selecting the observed and expected frequencies distribution table.
Nonparametric Tests Chapter
10
299
FIG. 10.61 Distribution of the observed and expected frequencies.
FIG. 10.62 Results of the w2 test for Example 10.13 on SPSS.
FIG. 10.63 Results of the w2 test for Example 10.13 on Stata.
or simply: tab productivity shift, ch
The results can be seen in Fig. 10.63. The value of the w2 statistic as well as the probability associated to it is similar to the results presented in Example 10.13, and also generated on SPSS.
10.6.2
Kruskal-Wallis Test
The Kruskal-Wallis test aims at verifying if k independent samples (k > 2) come from the same population. It is an alternative to the analysis of variance when the hypotheses of data normality and equality of variances are violated, or when the
300
PART
IV Statistical Inference
sample is small, or even when the variable is measured in an ordinal scale. For k ¼ 2, the Kruskal-Wallis test is equivalent to the Mann-Whitney test. The data are represented in a table with double entry with N rows and k columns, in which the rows represent the observations and the columns represent the different samples or groups. The null hypothesis of the Kruskal-Wallis test assumes that all k samples come from the same population or from identical populations with the same median (m). For a bilateral test, we have: H0: m1 ¼ m2 ¼ … ¼ mk H1: 9(i,j) mi 6¼ mj, i 6¼ j In the Kruskal-Wallis test, all N observations (N is the total number of observations in the global sample) are organized in a single series, and we attribute ranks to each element in the series. Thus, position 1 is attributed to the lowest observation in the global sample, position 2 to the second lowest observation, and so on, and so forth, up to position N. If there are ties, we attribute the mean of the corresponding ranks. The Kruskal-Wallis statistic (H) is given by: Hcal ¼
k R2 X 12 j 3 ð N + 1Þ N ðN + 1Þ j¼1 nj
(10.17)
where: k: the number of samples or groups; nj: the number of observations in the sample or group j; N: the number of observations in the global sample; Rj: sum of the ranks in the sample or group j. However, according to Siegel and Castellan (2006), whenever there are ties between two or more ranks, regardless of the group, the Kruskal-Wallis statistic must be corrected in a way that considers the changes in the sample distribution, so: 0 ¼ Hcal
1
H Xg 3 t t j j j¼1
(10.18)
ðN 3 N Þ
where: g: the number of clusters with different tied ranks; tj: the number of tied ranks in the jth cluster. According to Siegel and Castellan (2006), the main objective for correcting these ties is to increase the value of H, making the result more significant. The value calculated must be compared to the critical value of the sample distribution. If k ¼ 3 and n1, n2, n3 5, we must use Table L in the Appendix, which shows the critical values of the Kruskal-Wallis statistic (Hc), where P(Hcal > Hc) ¼ a (for a right-tailed unilateral test). Otherwise, the sample distribution can be approximated by the w2 distribution with n ¼ k 1 degrees of freedom. Therefore, if the value of the Hcal statistic is in the critical region, that is, if Hcal > Hc for k ¼ 3 and n1, n2, n3 5, or Hcal > w2c for other values, the null hypothesis is rejected, which allows us to conclude that there is no difference between the samples. Otherwise, we do not reject H0. Example 10.14 Applying the Kruskal-Wallis Test A group of 36 patients with the same level of stress was submitted to three different treatments, that is, 12 patients were submitted to treatment A, 12 patients to treatment B, and the remaining 12 to treatment C. At the end of the treatment, each patient answered a questionnaire that evaluates a person’s stress level, which is classified in three phases: the resistance phase, for those who got a maximum of three points, the warning phase, for those who got more than 6 points, and the exhaustion phase, for those who got more than 8 points. The results can be seen in Table 10.E.20. Verify if the three treatments lead to the same results. Consider a significance level of 1%.
Nonparametric Tests Chapter
10
301
TABLE 10.E.20 Stress Level After the Treatment Treatment A
6
5
4
5
3
4
5
2
4
3
5
2
Treatment B
6
7
5
8
7
8
6
9
8
6
8
8
Treatment C
5
9
8
7
9
11
7
8
9
10
7
8
Solution Step 1: Since the variable is measured in an ordinal scale, the most suitable test to verify if the three independent samples are drawn from the same population is the Kruskal-Wallis test. Step 2: Through the null hypothesis, there is no difference among the treatments. Through the alternative hypothesis, there is a difference between at least two treatments, so: H0: m1 ¼ m2 ¼ m3 H1: 9(i,j) mi 6¼ mj, i 6¼ j Step 3: The significance level to be considered is 1%. Step 4: In order to calculate the Kruskal-Wallis statistic, first of all, we must attribute ranks from 1 to 36 to each element in the global sample, as shown in Table 10.E.21. In case of ties, we attribute the mean of the corresponding ranks.
TABLE 10.E.21 Attributing Ranks
A
15.5
10.5
6
10.5
3.5
6
10.5
1.5
B
15.5
20
10.5
26.5
20
26.5
15.5
C
10.5
32.5
26.5
20
32.5
36
20
Sum
Mean
6
3.5
10.5
1.5
85.5
7.13
32.5
26.5
15.5
26.5
26.5
262
21.83
26.5
32.5
35
20
26.5
318.5
26.54
Since there are ties, the Kruskal-Wallis statistic is calculated from Expression (10.18). First of all, we calculate the value of H: k R X 12 12 85:52 + 2622 + 318:52 j 3 37 3 ðN + 1Þ ¼ 12 N ðN + 1Þ j¼1 nj 36 37 2
Hcal ¼
Hcal ¼ 22:181 From Tables 10.E.20 and 10.E.21, we can verify that there are eight tied groups. For example, there are two groups with 2 points (with a rank of 1.5), two groups with 3 points (with a rank of 3.5), three groups with 4 points (with a rank of 6) and, thus, successively, up to four groups with 9 points (with a rank of 32.5). The Kruskal-Wallis statistic is corrected to: H Xg
’ Hcal ¼
1
t 3 tj j¼1 j ðN 3 N Þ
¼
22:181 ¼ 22:662 23 2 + 23 2 + 33 3 + ⋯ + 43 4 3 1 36 36
Step 5: Since n1, n2, n3 > 5, let’s use the w2 distribution. The critical region (CR) of the w2 distribution (Table D in the Appendix), considering a ¼ 1% and n ¼ k 1 ¼ 2 degrees of freedom, is shown in Fig. 10.64.
FIG. 10.64 Critical region of Example 10.14.
302
PART
IV Statistical Inference
Step 6: Decision: since the value calculated is in the critical region, that is, H 0cal > 9.210, we must reject the null hypothesis, which allows us to conclude, with a 99% confidence level, that there is a difference among the treatments. If we use P-value instead of the critical value of the statistic, Steps 5 and 6 will be: Step 5: According to Table D in the Appendix, for n ¼ 2 degrees of freedom, the probability associated to the statistic H 0cal ¼ 22.662 is less than 0.005 (P < 0.005). Step 6: Decision: since P < 0.01, we reject H0.
10.6.2.1 Solving the Kruskal-Wallis Test by Using SPSS Software
The use of the images in this section has been authorized by the International Business Machines Corporation©. The data in Example 10.14 are available in the file Kruskal-Wallis_Test.sav. In order to elaborate the Kruskal-Wallis test on SPSS, let’s click on Analyze → Nonparametric Tests → Legacy Dialogs → K Independent Samples …, as shown in Fig. 10.65. After that, we should insert the variable Result in the box Test Variable List, define the groups of the variable Treatment and select the Kruskal-Wallis test, as shown in Fig. 10.66. Let’s click on OK to obtain the results of the Kruskal-Wallis test. Fig. 10.67 shows the mean of the ranks for each group, similar to the values calculated in Table 10.E.21. The value of the Kruskal-Wallis statistic and the significance level of the test are in Fig. 10.68. The value of the test is 22.662, similar to the value calculated in Example 10.14. The probability associated to the statistic is 0.000 (we saw in Example 10.14 that this probability is less than 0.005). Since P < 0.01, we reject the null hypothesis, which allows us to conclude, with a 99% confidence level, that there is a difference among the treatments.
FIG. 10.65 Procedure for elaborating the Kruskal-Wallis test on SPSS.
Nonparametric Tests Chapter
FIG. 10.66 Selecting the variable and defining the groups for the Kruskal-Wallis test.
FIG. 10.67 Ranks.
FIG. 10.68 Results of the Kruskal-Wallis test for Example 10.14 on SPSS.
10.6.2.2 Solving the Kruskal-Wallis Test by Using Stata
The use of the images presented in this section has been authorized by Stata Corp LP©. On Stata, the Kruskal-Wallis test is elaborated through the command kwallis, using the following syntax: kwallis variable*, by(groups*)
10
303
304
PART
IV Statistical Inference
FIG. 10.69 Results of the Kruskal-Wallis test for Example 10.14 on Stata.
where the term variable* must be replaced by the quantitative or ordinal variable being studied and the term groups* by the categorical variable that represents the groups. Let’s open the file Kruskal-Wallis_Test.dta that contains the data from Example 10.14. All three groups are represented by the variable treatment and the characteristic analyzed by the variable result. Thus, the command to be typed is: kwallis result, by(treatment)
The result of the test can be seen in Fig. 10.69. Analogous to the results presented in Example 10.14 and generated on SPSS, Stata calculates the original value of the statistic (22.181) and with the correction factor whenever there are ties (22.662). Since the probability associated to the statistic is 0.000, we reject the null hypothesis, which allows us to conclude, with a 99% confidence level, that there is no difference among the treatments.
10.7
FINAL REMARKS
In the previous chapter, we studied parametric tests. This chapter, however, was totally dedicated to the study of nonparametric tests. Nonparametric tests are classified according to the variables’ level of measurement and to the sample size. So, for each situation, the main types of nonparametric tests were studied. In addition, the advantages and disadvantages of each test as well as their assumptions were also established. For each nonparametric test, the main inherent concepts, the null and alternative hypotheses, the respective statistics, and the solution of the examples proposed on SPSS and on Stata were presented. Whatever the main objective for their application, nonparametric tests can provide the collection of good and interesting research results that will be useful in any decision-making process. The correct use of each test, from a conscious choice of the modeling software, must always be done based on the underlying theory, without ignoring the researcher’s experience and intuition.
10.8
EXERCISES
(1) In what situations are nonparametric tests applied? (2) What are the advantages and disadvantages of nonparametric tests? (3) What are the differences between the sign test and the Wilcoxon test for two paired samples? (4) Which test is an alternative to the t-test for one sample when the data distribution does not follow a normal distribution? (5) A group of 20 consumers tasted two types of coffee (A and B). At the end, they chose one of the brands, as shown in the table. Test the null hypothesis that there is no difference in these consumers’ preference, with a significance level of 5%.
Nonparametric Tests Chapter
Events
Brand A
Brand B
Total
Frequency
8
12
20
Proportion
0.40
0.60
1.00
10
305
(6) A group of 60 readers evaluated three novels and, at the end, they chose one of the three options, as shown in the table. Test the null hypothesis that there is no difference in these readers’ preference, with a significance level of 5%.
Events
Book A
Book B
Book C
Total
Frequency
29
15
16
60
Proportion
0.483
0.250
0.267
1.00
(7) A group of 20 teenagers went on the Points Diet for 30 days. Check and see if there was weight loss after the diet. Assume that a ¼ 5%.
Before
After
58
56
67
62
72
65
88
84
77
72
67
68
75
76
69
62
104
97
66
65
58
59
59
60
61
62
67
63
73
65
58
58
67
62
67
64
78
72
85
80
306
PART
IV Statistical Inference
(8) Aiming to compare the average service times in two bank branches, data on 22 clients from each bank branch were collected, as shown in the table. Use the most suitable test, with a significance level of 5%, to test whether both samples come or do not come from populations with the same medians. Bank Branch A
Bank Branch B
6.24
8.14
8.47
6.54
6.54
6.66
6.87
7.85
2.24
8.03
5.36
5.68
7.09
3.05
7.56
5.78
6.88
6.43
8.04
6.39
7.05
7.64
6.58
6.97
8.14
8.07
8.30
8.33
2.69
7.14
6.14
6.58
7.14
5.98
7.22
6.22
7.58
7.08
6.11
7.62
7.25
5.69
7.5
8.04
(9) A group of 20 Business Administration students evaluated their level of learning based on three subjects studied in the field of Applied Quantitative Methods, by answering if their level of learning was high (1) or low (0). The results can be seen in the table. Check and see if the proportion of students with a high level of learning is the same for each subject. Consider a significance level of 2.5%. Student
A
B
C
1
0
1
1
2
1
1
1
3
0
0
0
4
0
1
0
5
0
1
1
6
1
1
1
7
1
0
1
Nonparametric Tests Chapter
Student
A
B
C
8
0
1
1
9
0
0
0
10
0
0
0
11
1
1
1
12
0
0
1
13
1
0
1
14
0
1
1
15
0
0
1
16
1
1
1
17
0
0
1
18
1
1
1
19
0
1
1
20
1
1
1
10
307
(10) A group of 15 consumers evaluated their level of satisfaction (1—somewhat dissatisfied, 2—somewhat satisfied, and 3—very satisfied) with three different bank services. The results can be seen in the table. Verify if there is a difference between the three services. Assume a significance level of 5%. Consumer
A
B
C
1
3
2
3
2
2
2
2
3
1
2
1
4
3
2
2
5
1
1
1
6
3
2
1
7
3
3
2
8
2
2
1
9
3
2
2
10
2
1
1
11
1
1
2
12
3
1
1
13
3
2
1
14
2
1
2
15
3
1
2
Part V
Multivariate Exploratory Data Analysis Two or more variables can relate to one another in several different ways. While one researcher may be interested in the study of the interrelationship between categorical (or nonmetric) variables, for example, in order to assess the existence of possible associations between its categories, another researcher may wish to create performance indicators (new variables) from the existence of correlations between the original metric variables. A third researcher may be interested in identifying homogeneous groups possibly formed from the existence of similarities in the variables between the observations of a certain dataset. In all of these situations, researchers may use multivariate exploratory techniques. Multivariate exploratory techniques, also known as interdependence methods, can probably be used in all fields of human knowledge in which researchers aim to study the relationship between the variables of a certain dataset, without intending to estimate confirmatory models. That is, without having to elaborate inferences regarding the findings for other observations, different from the ones considered in the analysis itself, since neither models nor equations are estimated to predict data behavior. This characteristic is crucial to distinguish the techniques studied in Part V of this book from those considered to be dependence methods, such as, the simple and multiple regression models, binary and multinomial logistic regression models, and regression models for count data, all of them studied in Part VI. Therefore, there is no definition of a predictor variable in exploratory models and, thus, their main objectives refer to the reduction or structural simplification of data, to the classification or clustering of observations and variables, to the investigation of the existence of correlation between metric variables, or association between categorical variables and between their categories, to the creation of performance rankings of observations from variables, and to the elaboration of perceptual maps. Exploratory techniques are considered extremely relevant for developing diagnostics regarding the behavior of the data being analyzed. Thus, their varied procedures are commonly adopted in a preliminary way, or even simultaneously, with the application of a certain confirmatory model. Based on pedagogical and conceptual criteria, we have chosen to discuss the two main sets of existing multivariate exploratory techniques in Part V; therefore, the chapters are structured in the following way: Chapter 11: Cluster Analysis Chapter 12: Principal Component Factor Analysis
The decision about the technique to be used also goes through the measurement scale of the variables available in the dataset, which can be categorical or metric (or even binary, a special case of categorization). The type of question itself, when collecting the data, in some situations, may result in a categorical or metric response, which will favor the use of one or more techniques to the detriment of others. Hence, the clear, precise, and preliminary definition of the research objectives is essential to obtain variables in the measurement scale suitable for the application of a certain technique that will serve as a tool for achieving the objectives proposed. While the cluster analysis techniques (Chapter 11), whose procedures can be hierarchical or nonhierarchical, are used when we wish to study similar behavior between the observations (individuals, companies, municipalities, countries, among other examples) regarding certain metric or binary variables and the possible existence of homogeneous clusters (cluster of observations), the principal component factor analysis (Chapter 12) can be chosen as the technique to be used when the main goal is the creation of new variables (factors, or cluster of variables) that capture the joint behavior of the
310
PART
V Multivariate Exploratory Data Analysis
BOX V.1 Exploratory Techniques and Main Objectives Exploratory Technique
Measurement Scale
Main Objectives
Cluster Analysis
Metric or Binary Metric or Binary Metric
Sorting and allocation of the observations into internally homogeneous groups and heterogeneous between one another. Definition of an interesting number of groups. Evaluation of the representativeness of each variable for the formation of a previously established number of groups. From a predefined number of groups, identification of the allocation of each observation. Identification of the correlations between the original variables for creating factors that represent the combination of those variables (reduction or structural simplification). Verification of the validity of previously established constructs. Construction of rankings through the creation of performance indicators from the factors. Extraction of orthogonal factors for future use in multivariate confirmatory techniques that require the absence of multicollinearity.
Hierarchical
Nonhierarchical
Principal Component Factor Analysis
original metric variables. Chapter 11 also presents the procedures for elaborating the multidimensional scaling technique in SPSS and in Stata. It can be considered a natural extension of the cluster analysis, and it has as its main objectives to determine the relative positions (coordinates) of each observation in the dataset and to construct two-dimensional charts in which these coordinates are plotted. It is important to mention that even though they are not discussed in this book, correspondence analysis techniques are very useful when researchers intend to study possible associations between the variables and between their respective categories. While the simple correspondence analysis is applied to the study of the interdependence relationship between only two categorical variables, which characterizes it as a bivariate technique, the multiple correspondence analysis can be used for a larger number of categorical variables, being, in fact, a multivariate technique. For more details on correspondence analysis techniques, we recommend Fa´vero and Belfiore (2017). Box V.1 shows the main objectives of each one of the exploratory techniques discussed in Part V. Each chapter is structured according to the same presentation logic. First, we introduce the concepts regarding each technique, always followed by the algebraic solution of some practical exercises, from datasets elaborated primarily with a more educational focus. Next, the same exercises are solved in the statistical software packages IBM SPSS Statistics Software and Stata Statistical Software. We believe that this logic facilitates the study and understanding of the correct use of each of the techniques and the analysis of the results obtained. In addition to this, the practical application of the models in SPSS and Stata also offers benefits to researchers, because, at any given moment, the results can be compared to the ones already obtained algebraically in the initial sections of each chapter, besides providing an opportunity to use these important software packages. At the end of each chapter, additional exercises are proposed, whose answers, presented through the outputs generated in SPSS, are available at the end of the book.
Chapter 11
Cluster Analysis Maybe Hamlet is right. We could be bounded in a nutshell, but counting ourselves kings of infinite space. Stephen Hawking
11.1 INTRODUCTION Cluster analysis represents a set of very useful exploratory techniques that can be applied whenever we intend to verify the existence of similar behavior between observations (individuals, companies, municipalities, countries, among other examples) in relation to certain variables, and there is the intention of creating groups or clusters, in which an internal homogeneity prevails. In this regard, this set of techniques has as its main objective to allocate observations to a relatively small number of clusters that are internally homogeneous and heterogeneous between themselves, and that represent the joint behavior of the observations from certain variables. That is, the observations of a certain group must be relatively similar to one another, in relation to the variables inserted in the analysis, and significantly different from the observations found in other groups. Clustering techniques are considered exploratory, or interdependent, since their applications do not have a predictive nature for other observations not initially present in the sample. Moreover, the inclusion of new observations into the dataset makes it necessary to reapply the modeling, so that, possibly, new clusters can be generated. Besides, the inclusion of a new variable can also generate a complete rearrangement of the observations in the groups. Researchers can choose to develop a cluster analysis when their main goal is to sort and allocate observations to groups and, from then on, to analyze what the ideal number of clusters formed is. Or they can, a priori, define the number of groups they wish to create, based on certain criteria, and verify how the sorting and allocation of observations behave in that specific number of groups. Regardless of the objective, clustering will continue being exploratory. If a researcher aims to use a technique to, in fact, confirm the creation of groups and to make the analysis predictive, he can use techniques as, for example, discriminant analysis or multinomial logistic regression. Elaborating a cluster analysis does not require vast knowledge of matrix algebra or statistics, different from techniques such as factor analysis and correspondence analysis. The researcher interested in applying a cluster analysis needs to, starting from the definition of the research objectives, choose a certain distance or similarity measure that will be the basis for the observations to be considered less or much closer, and a certain agglomeration schedule that will have to be defined between hierarchical and nonhierarchical methods. Therefore, he will be able to analyze, interpret, and compare the outcomes. It is important to highlight that the outcomes obtained through hierarchical and nonhierarchical agglomeration schedules can be compared and, in this regard, the researcher is free to develop the technique, using one method or another, and to reapply it, if he deems necessary. While hierarchical schedules allow us to identify the sorting and allocation of observations, offering possibilities for researchers to study, assess, and decide the number of clusters formed in nonhierarchical schedules, we start with a known number of clusters and, from then on, we begin allocating the observations to these clusters, with a future evaluation of the representativeness of each variable when creating them. Therefore, the result of one method can serve as input to carry out the other, making the analysis cyclical. Fig. 11.1 shows the logic from which a cluster analysis can be elaborated. When choosing the distance or similarity measure and the agglomeration schedule, we must take some aspects into consideration, such as, the previously desired number of clusters, which were defined based on some resource allocation criteria, as well as certain constraints that may lead the researcher to choose a specific solution. According to Bussab et al. Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00011-2 © 2019 Elsevier Inc. All rights reserved.
311
312
PART
V
Multivariate Exploratory Data Analysis
FIG. 11.1 Logic for elaborating a cluster analysis.
(1990), different criteria regarding distance measures and agglomeration schedules may lead to different cluster formations, and the homogeneity desired by the researcher fundamentally depends on the objectives set in the research. Imagine that a researcher is interested in studying the interdependence between individuals living in a certain municipality based only on two metric variables (age, in years, and average family income, in R$). His main goal is to assess the effectiveness of social programs aimed at providing health care and then, based on these variables, to propose a still unknown number of new programs aimed at homogeneous groups of people. After collecting the data, the researcher constructed a scatter plot, as shown in Fig. 11.2. Based on the chart seen in Fig. 11.2, the researcher identified four clusters and highlighted them in a new chart (Fig. 11.3). From the creation of these clusters, the researcher decided to develop an analysis of the behavior of the observations in each group, or, more precisely, of the existing variability within the clusters and between them, so that he could clearly and consciously base his decision as regards the allocation of individuals to these four new social programs. In order to illustrate this issue, the researcher constructed the chart found in Fig. 11.4.
FIG. 11.2 Scatter plot with individuals’ Income and Age.
Cluster Analysis Chapter
11
313
FIG. 11.3 Highlighting the creation of four clusters.
FIG. 11.4 Illustrating the variability within the clusters and between them.
Based on this chart, the researcher was able to notice that the groups formed showed a lot of internal homogeneity, with a certain individual being closer to other individuals in the same group than to individuals in other groups. This is the core of cluster analysis. If the number of social programs to be provided for the population (number of clusters) had already been given to the researcher, due to budgetary, legal, or political constraints, even so we would be able to use clustering, solely, to determine the allocation of individuals from the municipality to that number of programs (groups).
314
PART
V
Multivariate Exploratory Data Analysis
FIG. 11.5 Rearranging the clusters due to the presence of elderly billionaires.
Having concluded the research and allocated the individuals to the different social, health care programs, the following year, the researcher decided to carry out the same research with individuals from the same municipality. However, in the meantime, a group of elderly billionaires decided to move to that city, and, when he constructed the new scatter plot, the researcher realized that those four clusters, clearly formed the previous year, did not exist anymore, since they fused when the billionaires were included. The new scatter plot can be seen in Fig. 11.5. This new situation exemplifies the importance of always reapplying the cluster analysis whenever new observations are included (and also new variables), which deprives it from and makes its predictive power totally unfeasible, as we have already discussed. Moreover, before elaborating any cluster analysis, this example shows that it is advisable for the researcher to study the data behavior and to check the existence of discrepant observations in relation to certain variables, since the creation of clusters is very sensitive to the presence of outliers. Excluding or retaining outliers in the dataset, however, will depend on the research objectives and on the type of data researcher have. Since, if certain observations represent anomalies in terms of variable values, when compared to the other observations, and end up forming small, insignificant, or even individual clusters, they can, in fact, be excluded. On the other hand, if these observations represent one or more relevant groups, even if they are different from the others, they must be considered in the analysis and, whenever the technique is reapplied, they can be separated so that other segmentations can be better structured in new groups, formed with higher internal homogeneity. We would like to emphasize that cluster analysis methods are considered static procedures, since the inclusion of new observations or variables may change the clusters, thus, making it mandatory to develop a new analysis. In this example, we realized that the original variables from which the groups are established are metric, since the clustering started from the study of the distance behavior (dissimilarity measures) between the observations. In some cases, as we will study throughout this chapter, cluster analyses can be elaborated from the similarity behavior (similarity measures) between observations that present binary variables. However, it is common for researchers to use the incorrect arbitrary weighting procedure with qualitative variables, as, for example, variables on the Likert scale, and, from then on, to apply a cluster analysis. This is a major error, since there are exploratory techniques meant exclusively for the study of the behavior of qualitative variables as, for example, the correspondence analysis. Historically speaking, even though many distance and similarity measures date back to the end of the 19th century and the beginning of the 20th century, cluster analyses, as a better structured set of techniques, began in the field of Anthropology with Driver and Kroeber (1932), and in Psychology with Zubin (1938a,b) and Tryon (1939), as discussed by Reis
Cluster Analysis Chapter
11
315
(2001) and Fa´vero et al. (2009). With the acknowledgment that observation clustering and classification procedures are scientific methods, together with astonishing technological developments, mainly verified after the 1960s, cluster analyses started being used more frequently after Sokal and Sneath’s (1963) relevant work was published, in which procedures are carried out to compare the biological similarities of organisms with similar characteristics and the respective species. Currently, cluster analysis offers several application possibilities in the fields of consumer behavior, market segmentation, strategy, political science, economics, finance, accounting, actuarial science, engineering, logistics, computer science, education, medicine, biology, genetics, biostatistics, psychology, anthropology, demography, geography, ecology, climatology, geology, archeology, criminology and forensics, among others. In this chapter, we will discuss cluster analysis techniques, aiming at: (1) introducing the concepts; (2) presenting the step by step of modeling, in an algebraic and practical way; (3) interpreting the results obtained; and (4) applying the technique in SPSS and in Stata. Following the logic proposed in the book, first, we will present the algebraic solution of an example jointly with the presentation of the concepts. Only after the introduction of concepts will the procedures for elaborating the techniques in SPSS and Stata be presented.
11.2 CLUSTER ANALYSIS Many are the procedures for elaborating a cluster analysis, since there are different distance or similarity measures for metric or binary variables, respectively. Besides, after defining the distance or similarity measure, the researcher still needs to determine, among several possibilities, the observation clustering method, from certain hierarchical or nonhierarchical criteria. Therefore, when one wishes to group observations in internally homogeneous clusters, what initially seems trivial can become quite complex, because there are multiple combinations between different distance or similarity measures and clustering methods. Hence, based on the underlying theory and on his research objectives, as well as on his experience and intuition, it is extremely important for the researcher to define the criteria from which the observations will be allocated to each one of the groups. In the following sections, we will discuss the theoretical development of the technique, along with a practical example. In Sections 11.2.1 and 11.2.2, the concepts of distance and similarity measures and clustering methods are presented and discussed, respectively, always followed by the algebraic solutions developed from a dataset.
11.2.1
Defining Distance or Similarity Measures in Cluster Analysis
As we have already discussed, the first phase for elaborating a cluster analysis consists in defining the distance (dissimilarity) or similarity measure that will be the basis for each observation to be allocated to a certain group. Distance measures are frequently used when the variables in the dataset are essentially metric, since, the greater the differences between the variable values of two observations the smaller the similarity between them or, in other words, the higher the dissimilarity. On the other hand, similarity measures are often used when the variables are binary, and what most interests us is the frequency of converging answer pairs 1-1 or 0-0 of two observations. In this case, the greater the frequency of converging pairs, the higher the similarity between the observations. An exception to this rule is Pearson’s correlation coefficient between two observations, calculated from metric variables, however, with similarity characteristics, as we will see in the following section. We will study the dissimilarity measures for metric variables in Section 11.2.1.1 and, in Section 11.2.1.2, we will discuss the similarity measures for binary variables.
11.2.1.1 Distance (Dissimilarity) Measures Between Observations for Metric Variables As a hypothetical situation, imagine that we intend to calculate the distance between two observations i (i ¼ 1, 2) from a dataset that has three metric variables (X1i, X2i, X3i), with values in the same unit of measure. These data can be found in Table 11.1. It is possible to illustrate the configuration of both observations in a three-dimensional space from these data, since we have exactly three variables. Fig. 11.6 shows the relative position of each observation, emphasizing the distance between them (d12). Distance d12, which is a dissimilarity measure, can be easily calculated by using, for instance, its projection over the horizontal plane formed by axes X1 and X2, called distance d0 12, as shown in Fig. 11.7.
316
PART
V
Multivariate Exploratory Data Analysis
TABLE 11.1 Part of a Dataset With Two Observations and Three Metric Variables Observation i
X1i
X2i
X3i
1
3.7
2.7
9.1
2
7.8
8.0
1.5
X3
1
d12
2 X2 X1 FIG. 11.6 Three-dimensional scatter plot for the hypothetical situation with two observations and three variables.
Thus, based on the well-known Pythagorean distance formula for right-angled triangles, we can determine d12 through the following expression: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (11.1) d12 ¼ ðd0 12 Þ2 + ðX31 X32 Þ2 where j X31 X32 j is the distance of the vertical projections (axis X3) from points 1 and 2. However, distance d0 12 is unknown to us, so, once again, we need to use the Pythagorean formula, now using the distances of the projections from Points 1 and 2 over the other two axes (X1 and X2), as shown in Fig. 11.8. Thus, we can say that: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (11.2) d0 12 ¼ ðX11 X12 Þ2 + ðX21 X22 Þ2 and, substituting (2) in (1), we have: d12 ¼
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðX11 X12 Þ2 + ðX21 X22 Þ2 + ðX31 X32 Þ2 ,
(11.3)
which is the expression of distance (dissimilarity measure) between Points 1 and 2, also known as the Euclidean distance formula.
Cluster Analysis Chapter
11
317
X3
|X31–X32|
1
d12
d′
12
2
X2 X1 FIG. 11.7 Three-dimensional chart highlighting the projection of d12 over the horizontal plane.
FIG. 11.8 Projection of the points over the plane formed by X1 and X2 with emphasis on d´12.
X2
|X21–X22|
d¢
12
2
1 |X11–X12|
X1
318
PART
V
Multivariate Exploratory Data Analysis
Therefore, for the data in our example, we have: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi d12 ¼ ð3:7 7:8Þ2 + ð2:7 8:0Þ2 + ð9:1 1:5Þ2 ¼ 10:132 whose unit of measure is the same as for the original variables in the dataset. It is important to highlight that, if the variables do not have the same unit of measure, a data standardization procedure will have to be carried out previously, as we will discuss later. We can generalize this problem for a situation in which the dataset has n observations and, for each observation i (i ¼ 1, ..., n), values corresponding to each one of the j (j ¼ 1, ..., k) metric variables X, as shown in Table 11.2. So, Expression (11.4), based on Expression (11.3), presents the general definition of the Euclidian distance between any two observations p and q. vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi k 2 2 2ffi u 2 uX X1p X1q + X2p X2q + … + Xkp Xkq ¼ t Xjp Xjq (11.4) dpq ¼ j¼1
Although the Euclidian distance is the most commonly used in cluster analyses, there are other dissimilarity measures that can be used, and using each one of them will depend on the researcher’s assumptions and objectives. Next, we will discuss other dissimilarity measures that can be used: l
Euclidean squared distance: instead of the Euclidian distance, it can be used when the variables show a small dispersion in value, and the use of the squared Euclidian distance makes it easier to interpret the outputs of the analysis and the allocation of the observations to the groups. Its expression is given by: k 2 2 2 X 2 Xjp Xjq dpq ¼ X1p X1q + X2p X2q + … + Xkp Xkq ¼
(11.5)
j¼1 l
Minkowski Distance: it is the most general dissimilarity measure expression from which others derive. It is given by: "
m k X dpq ¼ Xjp Xjq
#1
m
(11.6)
j¼1
where m takes on positive integer values (m ¼ 1, 2, ...). We can see that the Euclidian distance is a particular case of the Minkowski distance, when m ¼ 2.
TABLE 11.2 General Model of a Dataset for Elaborating the Cluster Analysis Variable j Observation i
X1i
X2i
⋯
Xki
1
X11
X21
⋯
Xk1
2
X12
X22
⋮
⋮
⋮
P
X1p
X2p
⋮
⋮
⋮
q
X1q
X2q
Xkq
⋮
⋮
⋮
⋮
n
X1n
X2n
Xkn
Xk2
Xkp
Cluster Analysis Chapter
l
11
319
Manhattan Distance: also referred to as the absolute or city block distance, it does not consider the triangular geometry that is inherent to Pythagoras’ initial expression and only considers the differences between the values of each variable. Its expression, also a particular case of the Minkowski distance when m ¼ 1, is given by: dpq ¼
k X Xjp Xjq
(11.7)
j¼1 l
Chebyshev Distance: also referred to as infinite or maximum distance, it is a particular case of the Manhattan distance because it only considers, for two observations, the maximum difference between all the j variables being studied. Its expression is given by: (11.8) dpq ¼ max Xjp Xjq
It is a particular case of the Minkowski distance as well, when m ¼ ∞. l
Canberra Distance: used for the cases in which the variables only have positive values, it assumes values between 0 and j (number of variables). Its expression is given by: k X X X jp jq (11.9) dpq ¼ j¼1 Xjp + Xjq
Whenever there are metric variables, the researcher can also use Pearson’s correlation, which, even though, is not a dissimilarity measure (in fact, it is a similarity measure), can provide important information when the aim is to group rows from the dataset. Pearson’s correlation expression, between the values of any two observations p and q, based on Expression (4.11) presented in Chapter 4, can be written as follows: k X
Xjp Xp Xjq Xq
j¼1
rpq ¼ vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uX k 2 u 2 u k uX t Xjp Xp t Xjq Xq j¼1
(11.10)
j¼1
where Xp and Xq represent the mean of all variable values for observations p and q, respectively, that is, the mean of each one of the rows in the dataset. Therefore, we can see that we are dealing with a coefficient of correlation between rows and not between columns (variables). It is the most common in data analysis and its values vary between 1 and 1. Pearson’s correlation coefficient can be used as a similarity measure between the rows of the dataset in analyses that include time series, for example, that is, cases in which the observations represent periods. In this case, the researcher may intend to study the correlations between different periods, to investigate, for instance, a possible recurrence of behavior in the same row for the set of variables, which may cause certain periods, not necessarily subsequent ones, to be grouped by similarity of behavior. Going back to the data presented in Table 11.1, we can calculate the different distance measures between observations 1 and 2, given by Expressions (11.4)–(11.9), as well as the correlational similarity measure, given by Expression (11.10). Table 11.3 shows these calculations and the respective results. Based on the results shown in Table 11.3, we can see that different measures produce different results, which may cause the observations to be allocated to different homogeneous clusters, depending on which measure was chosen for the analysis, as discussed by Vicini and Souza (2005) and Malhotra (2012). Therefore, it is essential for the researcher to always underpin his choice and to bear in mind the reasons why he decided to use a certain measure, instead of others. Simply using more than one measure, when analyzing the same dataset, can support this decision, since, in this case, the results can be compared. This becomes really clear when we include a third observation in the analysis, as shown in Table 11.4. While the Euclidian distance suggests that the most similar observations (the shortest distance) are 2 and 3, when we use the Chebyshev distance, observations 1 and 3 are the most similar. Table 11.5 shows these distances for each pair of observations, highlighting, in bold characters, the smallest value of each distance.
320
PART
V
Multivariate Exploratory Data Analysis
TABLE 11.3 Distance and Correlational Similarity Measures Between Observations 1 and 2 Observation i
X1i
X2i
X3i
Mean
1
3.7
2.7
9.1
5.167
2
7.8
8.0
1.5
5.767
Euclidian Distance qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi d12 ¼ ð3:7 7:8Þ2 + ð2:7 8:0Þ2 + ð9:1 1:5Þ2 ¼ 10:132 Squared Euclidean Distance d12 ¼ (3.7 7.8)2 + (2.7 8.0)2 + (9.1 1.5)2 ¼ 102.660 Manhattan Distance d12 ¼ j3.7 7.8j + j 2.7 8.0j + j9.1 1.5j ¼ 17.000 Chebyshev Distance d12 ¼ j9.1 1.5j ¼ 7.600 Canberra Distance 3:77:8j j2:78:0j j9:11:5j d12 ¼ ðj3:7 + 7:8Þ + ð2:7 + 8:0Þ + ð9:1 + 1:5Þ ¼ 1:569
Pearson’s Correlation (Similarity) ð3:75:167Þ ð7:85:767Þ + ð2:75:167Þ ð8:05:767Þ + ð9:15:167Þ ð1:55:767Þ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi r12 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:993 2 2 2 2 2 2 ð3:75:167Þ + ð2:75:167Þ + ð9:15:167Þ
ð7:85:767Þ + ð8:05:767Þ + ð1:55:767Þ
TABLE 11.4 Part of the Dataset With Three Observations and Three Metric Variables Observation i
X1i
X2i
X3i
1
3.7
2.7
9.1
2
7.8
8.0
1.5
3
8.9
1.0
2.7
TABLE 11.5 Euclidian and Chebyshev Distances Between the Pairs of Observations Seen in Table 11.4 Distance
Pair of Observations 1 and 2
Pair of Observations 1 and 3
Pair of Observations 2 and 3
Euclidian
d12 ¼ 10.132
d13 ¼ 8.420
d23 5 7.187
Chebyshev
d12 ¼ 7.600
d13 5 6.400
d23 ¼ 7.000
Hence, in a certain cluster schedule, and only due to the dissimilarity measure chosen, we would have different initial clusters. Besides deciding which distance measure to choose, the researcher also has to verify if the data need to be treated previously. So far, in the examples we have already discussed, we were careful to choose metric variables with values in the same unit of measure (as, for example, students’ grades in Math, Physics, and Chemistry, which vary from 0 the 10). However, if the variables are measured in different units (as, for example, income in R$, educational level in years of study, and number of children), the intensity of the distances between the observations may be arbitrarily influenced by the variables that will possibly present greater magnitude in their values, to the detriment the others. In these situations, the
Cluster Analysis Chapter
11
321
researcher must standardize the data, so that the arbitrary nature of the measurement units may be eliminated, making each variable have the same contribution over the distance measure considered. Z-scores procedure is the most frequently used method to standardize variables. In it, for each observation i, the value of a new standardized variable ZXj is obtained by subtracting the corresponding original variable value Xj from its mean and, after that, the resulting value is divided by its standard deviation, as presented in Expression (11.11). ZXji ¼
Xji Xj sj
(11.11)
where X and s represent the mean and the standard deviation of variable Xj. Hence, regardless of the magnitude of the values and of the type of measurement units of the original variables in a dataset, all the respective variables standardized by the Z-scores procedure will have a mean equal to zero and a standard deviation equal to 1, which ensures that possible arbitrary measurement units over the distance between each pair of observations will be eliminated. In addition, Z-scores have the advantage of not changing the distribution of the original variable. Therefore, if the original variables are different units, distance measure Expressions (11.4)–(11.9) must have the terms Xjp and Xjq, respectively, substituted for ZXjp and ZXjq. Table 11.6 presents these expressions, based on the standardized variables. Even though Pearson’s correlation is not a dissimilarity measure (in fact, it is a similarity measure), it is important to mention that its use also requires that the variables be standardized by using the Z-scores procedure in case they do not have the same measurement units. If the main goal were to group variables, which is the main goal of the following chapter (factor analysis), the standardization of variables through the Z-scores procedure would, in fact, be irrelevant, given that the analysis would consist in assessing the correlation between columns of the dataset. On the other hand, as the objective of this chapter is to group rows from the dataset that represent the observations, the standardization of the variables is necessary for elaborating an accurate cluster analysis.
11.2.1.2 Similarity Measures Between Observations for Binary Variables Now, imagine that we intend to calculate the distance between two observations i (i ¼ 1, 2) coming from a dataset that has seven variables (X1i, ..., X7i), however, all of them related to the presence or absence of characteristics. In this situation, it is common for the presence or absence of a certain characteristic to be represented by a binary variable, or a dummy, which assumes value 1, in case the characteristic occurs, and 0, if otherwise. These data can be found in Table 11.7. It is important to highlight that the use of binary variables does not generate arbitrary weighting problems resulting from the variable categories, contrary to what would happen if discrete values (1, 2, 3, ...) were assigned to each category of each qualitative variable. In this regard, if a certain qualitative variable has k categories, (k 1) binary variables will be necessary to represent the presence or absence of each one of the categories. Thus, all the binary variables will be equal to 0 in case the reference category occurs.
TABLE 11.6 Distance Measure Expressions With Standardized Variables Distance Measure (Dissimilarity)
Expression
Euclidian
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 k P dpq ¼ ZX jp ZX jq j¼1
Squared Euclidean
dpq ¼
dpq ¼
Chebyshev Canberra
j¼1
"
Minkowski
Manhattan
2 k P ZX jp ZX jq
dpq ¼
m k P ZX jp ZX jq
j¼1
k P ZX jp ZX jq
j¼1
dpq ¼ ma´xjZXjp ZXjq j dpq ¼
k ZX ZX P j jp jq j
ZX + ZX jq Þ j¼1 ð jp
#1
m
322
PART
V
Multivariate Exploratory Data Analysis
TABLE 11.7 Part of the Dataset With Two Observations and Seven Binary Variables Observation i
X1i
X2i
X3i
X4i
X5i
X6i
X7i
1
0
0
1
1
0
1
1
2
0
1
1
1
1
0
1
Therefore, by using Expression (11.4), we can calculate the squared Euclidean distance between observations 1 and 2, as follows: d12 ¼
7 X
Xj1 Xj2
2
¼ ð0 0Þ2 + ð0 1Þ2 + ð1 1Þ2 + ð1 1Þ2 + ð0 1Þ2 + ð1 0Þ2 + ð1 1Þ2 ¼ 3,
j¼1
which represents the total number of variables with answer differences between observations 1 and 2. Therefore, for any two observations p and q, the greater the number of equal answers (0-0 or 1-1), the shorter the squared Euclidean distance between them will be, since: 8 2 < 0 if X ¼ X ¼ 0 jp jq 1 Xjp Xjq ¼ (11.12) : 1 if X 6¼ X jp
jq
As discussed by Johnson and Wichern (2007), each stretch of the distance represented by Expression (11.12) is considered to be a dissimilarity measure, since the greater the number of answer discrepancies, the greater the squared Euclidean distances. On the other hand, the calculations equally ponder the pairs of answers 0-0 and 1-1, without giving higher relative importance to the pair of answers 1-1 that, in many cases, is a stronger similarity indicator than the pair of answers 0-0. For example, when we group people, the fact that two of them eat lobster every day is a stronger similarity evidence than the absence of this characteristic for both. Hence, many authors, aiming at defining similarity measures between observations, proposed the use of coefficients that would take the similarity of the answers 1-1 and 0-0 into consideration, and these pairs would not necessarily have the same relative importance. In order for us to be able to present these measures, it is necessary to construct an absolute frequency table of answers 0 and 1 for each pair of observations p and q, as shown in Table 11.8. Next, based on this table, we will discuss the main similarity measures, bearing in mind that the use of each one depends on the researcher’s assumptions and objectives. Simple matching coefficient (SMC): it is the most frequently used similarity measure for binary variables, and it is discussed and used by Zubin (1938a), and by Sokal and Michener (1958). This coefficient, which matches the weights of the converging 1-1 and 0-0 answers, has its expression given by:
l
spq ¼
a+d a+b+c+d
(11.13)
TABLE 11.8 Absolute Frequencies of Answers 0 and 1 for Two Observations p and q Observation p Observation q
1
0
Total
1
a
b
a+b
0
c
d
c+d
Total
a+c
b+d
a+b+c+d
Cluster Analysis Chapter
l
2a 2a+b+c
a a + 2 ðb + cÞ
a a+b+c+d
(11.17)
(11.18)
Yule similarity coefficient: proposed by Yule (1900) and used by Yule and Kendall (1950), this similarity coefficient for binary variables offers as an answer a coefficient that varies from 1 to 1. As we can see, through its expression presented, the coefficient generated is undefined if one or both vectors compared present all the values equal to 0 or 1. Software such as Stata generate the Yule coefficient equal to 1, if b ¼ c ¼ 0 (a total convergence of answers), and equal to 1, if a ¼ d ¼ 0 (a total divergence of answers). spq ¼
l
(11.16)
Ochiai similarity coefficient: even though it is known by this name, it was initially proposed by Driver and Kroeber (1932), and, later on, it was used by Ochiai (1957). This coefficient is undefined when one or both observations being studied present all the variable values equal to 0. However, if both vectors present all the values equal to 0, software such as Stata present the Ochiai coefficient equal to 1. If this happens for only one of the two vectors, the Ochiai coefficient is considered equal to 0. Its expression is given by: a spq ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ða + bÞ ða + cÞ
l
(11.15)
Russel and Rao similarity coefficient: it is also widely used and it only favors the similarities of 1-1 answers in the calculation of its coefficient. It was proposed by Russell and Rao (1940), and its expression is given by: spq ¼
l
(11.14)
anti-Dice similarity coefficient: it was initially proposed by Sokal and Sneath (1963) and Anderberg (1973), the name anti-Dice comes from the fact that this coefficient doubles the weight over the frequencies of different type 1-1 answer pairs, that is, it doubles the weight over the answer divergences. Just as the Jaccard and the Dice coefficients, the antiDice coefficient also ignores the frequency of 0-0 answer pairs. Its expression is given by: spq ¼
l
a a+b+c
Dice similarity coefficient (DSC): although it is only known by this name, it was suggested and discussed by Czekanowski (1932), Dice (1945) and Sørensen (1948). It is similar to the Jaccard index; however, it doubles the weight over the frequency of converging type 1-1 answer pairs. Just as in that case, software such as Stata present the Dice coefficient equal to 1, for the cases in which all the variables are equal to 0 for two observations, thus, avoiding any uncertainty in the calculation. Its expression is given by: spq ¼
l
323
Jaccard index: even though it was first proposed by Gilbert (1884), it received this name because it was discussed and used in two extremely important papers developed by Jaccard (1901, 1908). This measure, also known as Jaccard similarity coefficient, does not take the frequency of the pair of answers 0-0 into consideration, which is considered irrelevant. However, it is possible to come across a situation in which all the variables are equal to 0 for two observations, that is, there is only frequency in cell d of Table 11.8. In this case, software packages such as Stata present the Jaccard index equal to 1, which makes sense from a similarity standpoint. Its expression is given by: spq ¼
l
11
adbc ad+bc
(11.19)
Rogers and Tanimoto similarity coefficient: this coefficient, which doubles the weight of discrepant answers 0-1 and 1-0 in relation to the weight of the combinations of converging type 1-1 and 0-0 answers, was initially proposed by Rogers and Tanimoto (1960). Its expression, which becomes equal to the anti-Dice coefficient when the frequency of 0-0 answers is equal to 0 (d ¼ 0), is given by: spq ¼
a+d a + d + 2 ðb + c Þ
(11.20)
324
l
PART
V
Multivariate Exploratory Data Analysis
Sneath and Sokal similarity coefficient: different from the Rogers and Tanimoto coefficient, this coefficient, proposed by Sneath and Sokal (1962), doubles the weight of converging type 1-1 and 0-0 answers in relation to the other answer combinations (1-0 and 0-1). Its expression, which becomes equal to the Dice coefficient when the frequency of type 0-0 answers is equal to 0 (d ¼ 0), is given by: spq ¼
l
2 ða + d Þ 2 ða + dÞ + b + c
(11.21)
Hamann similarity coefficient: Hamann (1961) proposed this similarity coefficient for binary variables aiming at having the frequencies of discrepant answers (1-0 and 0-1) subtracted from the total of converging answers (1-1 and 0-0). This coefficient, which varies from 1 (total answer divergence) to 1 (total answer convergence), is equal to two times the simple matching coefficient minus 1. Its expression is given by: spq ¼
ða + d Þ ð b + cÞ a+b+c+d
(11.22)
As was discussed in Section 11.2.1.1 as regards the dissimilarity measures applied to metric variables, let’s go back to the data presented in Table 11.7, aiming at calculating the different similarity measures between observations 1 and 2, which only have binary variables. In order to do that, from that table, we must construct the absolute frequency table of answers 0 and 1 for the observations mentioned (Table 11.9). So, using Expressions (11.13)–(11.22), we are able to calculate the similarity measures themselves. Table 11.10 presents the calculations and the results of each coefficient. Analogous to what was discussed when the dissimilarity measures were calculated, we can clearly see that different similarity measures generate different results, which may cause, when defining the cluster method, the observations to be allocated to different homogeneous clusters, depending on which measure was chosen for the analysis. Bear in mind that it does not make any sense to apply the Z-scores standardization procedure to calculate the similarity measures discussed in this section, since the variables used for the cluster analysis are binary. At this moment, it is important to emphasize that, instead of using similarity measures to define the clusters whenever there are binary variables, it is very common to define clusters from the coordinates of each observation, which can be generated when elaborating simple or multiple correspondence analyses, for instance. This is an exploratory technique applied solely to datasets that have qualitative variables, aiming at creating perceptual maps, which are constructed based on the frequency of the categories of each one of the variables in analysis (Fa´vero and Belfiore, 2017). After defining the coefficient that will be used, based on the research objectives, on the underlying theory, and on his experience and intuition, the researcher must move on to the definition of the cluster schedule. The main cluster analysis schedules will be studied in the following section.
11.2.2
Agglomeration Schedules in Cluster Analysis
As discussed by Vicini and Souza (2005) and Johnson and Wichern (2007), in cluster analysis, choosing the clustering method, also known as agglomeration schedule, is as important as defining the distance (or similarity) measure, and this decision must also be made based on what researchers intends to do in terms of their research objectives.
TABLE 11.9 Absolute Frequencies of Answers 0 and 1 for Observations 1 and 2 Observation 1 Observation 2
1
0
Total
1
3
2
5
0
1
1
2
Total
4
3
7
Cluster Analysis Chapter
11
325
TABLE 11.10 Similarity Measures Between Observations 1 and 2 Simple Matching:
Jaccard:
s12 ¼ 3 +7 1 ¼ 0:571
s12 ¼ 36 ¼ 0:500
Dice:
Anti-Dice:
s12 ¼ 2 ð32Þð+3Þ2 + 1 ¼ 0:667
s12 ¼ 3 + 2 3ð2 + 1Þ ¼ 0:333
Russell and Rao:
Ochiai:
s12 ¼ 37 ¼ 0:429
3 s12 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:671
Yule:
Rogers and Tanimoto:
1 s12 ¼ 33 112 + 2 1 ¼ 0:200
s12 ¼ 3 + 1 +32+1ð2 + 1Þ ¼ 0:400
Sneath and Sokal:
Hamann:
s12 ¼ 2 ð23 +ð13Þ++12Þ + 1 ¼ 0:727
s12 ¼ ð3 + 1Þ7 ð2 + 1Þ ¼ 0:143
ð3 + 2Þ ð3 + 1Þ
Basically, agglomeration schedules can be classified into two types, hierarchicals and nonhierarchicals. While the former characterize themselves for favoring a hierarchical structure (step by step) when forming clusters, nonhierarchical schedules use algorithms to maximize the homogeneity within each cluster, without going through a hierarchical process for such. Hierarchical agglomeration schedules can be clustering or partitioning, depending on how the process starts. If all the observations are considered to be separated and, from their distances (or similarities), groups are formed until we reach a final stage with only one cluster, then this process is known as clustering. Among all hierarchical agglomeration schedules, the most commonly used are those that have the following linkage methods: nearest-neighbor or singlelinkage, furthest-neighbor or complete-linkage, or between-groups or average-linkage. On the other hand, if all the observations are considered grouped and, stage after stage, smaller groups are formed by the separation of each observation, until these subdivisions generate individual groups (that is, totally separated observations), then, we have a partitioning process. Conversely, nonhierarchical agglomeration schedules, among which the most popular is the k-means procedure, refer to processes in which clustering centers are defined, and from which the observations are allocated based on their proximity to them. Different from hierarchical schedules, in which the researcher can study the several possibilities for allocating observations and even define the ideal number of clusters based on each one of the grouping stages, a nonhierarchical agglomeration schedule requires that we previously stipulate the number of clusters from which the clustering centers will be defined and the observations allocated. That is why we recommend the generation of a hierarchical agglomeration schedule before constructing a nonhierarchical schedule, when there is no reasonable estimate of the number of clusters that can be formed from the observations in the dataset and based on the variables in study. Fig. 11.9 shows the logic of agglomeration schedules in cluster analysis. We will study hierarchical agglomeration schedules in Section 11.2.2.1, and Section 11.2.2.2 will be used to discuss the nonhierarchical k-means agglomeration schedule.
11.2.2.1 Hierarchical Agglomeration Schedules In this section, we will discuss the main hierarchical agglomeration schedules, in which larger and larger clusters are formed at each clustering stage because new observations or groups are added to it, due to a certain criterion (linkage method) and based on the distance measure chosen. In Section 11.2.2.1.1, the main concepts of these schedules will be presented, and, in Section 11.2.2.1.2, a practical example will be presented and solved algebraically. 11.2.2.1.1
Notation
There are three main linkage methods in hierarchical agglomeration schedules, as shown in Fig. 11.9: the nearest-neighbor or single-linkage, the furthest-neighbor or complete-linkage, and the between-groups or average-linkage. Table 11.11 illustrates the distance to be considered in each clustering stage, based on the linkage method chosen.
326
PART
V
Multivariate Exploratory Data Analysis
Agglomeration schedule
Nonhierarchical (k-means)
Hierarchical
Agglomerative
Partitioning
Linkage method
Furthest neighbor (complete linkage)
Nearest neighbor (single linkage)
Between groups (average linkage)
FIG. 11.9 Agglomeration schedules in cluster analysis.
TABLE 11.11 Distance to be Considered Based on the Linkage Method Linkage Method Single (Nearest-Neighbor or Single-Linkage)
Illustration
Distance (Dissimilarity) d23
1
4 3 2
Complete (Furthest-Neighbor or Complete-Linkage)
5
d15 1
4 3 5
2
Average (Between-Groups or Average-Linkage)
d13 + d14 + d15 + d23 + d24 + d25 6
1
4 3 2
5
The single-linkage method favors the shortest distances (thus, the nomenclature nearest neighbor) so that new clusters can be formed at each clustering stage through the incorporation of observations or groups. Therefore, applying it is advisable in cases in which the observations are relatively far apart, that is, different, and we would like to form clusters considering a minimum of homogeneity. On the other hand, its analysis may be hampered when there are observations or clusters just a little farther apart from each other, as shown in Fig. 11.10. The complete-linkage method, on the other hand, goes in the opposite direction, that is, it favors the greatest distances between the observations or groups so that new clusters can be formed (hence, the name furthest neighbor) and, in this regard, using it is advisable in cases in which there is no considerable distance between the observations, and the researcher needs to identify the heterogeneities between them. Finally, in the average-linkage method, two groups merge based on the average distance between all the pairs of observations that are in these groups (hence, the name average linkage). Accordingly, even though there are changes in the calculation of the distance measures between the clusters, the average-linkage method ends up preserving the order of the observations in each group, offered by the single-linkage method, in case there is a considerable distance between the observations. The same happens with the sorting solution provided by the complete-linkage method, if the observations are very close to each other.
Cluster Analysis Chapter
11
327
FIG. 11.10 Single-linkage method—Hampered analysis when there are observations or clusters just a little further apart.
Johnson and Wichern (2007) proposed a logical sequence of steps in order to facilitate the understanding of a cluster analysis, elaborated through a certain hierarchical agglomerative method: 1. If n is the number of observations in a dataset, we must start the agglomeration schedule with exactly n individual groups (stage 0), such that we will initially have a distances (or similarities) matrix D0 formed by the distances between each pair of observations. 2. In the first stage, we must choose the smallest distance among all of those that form matrix D0, that is, the one that connects the two most similar observations. At this exact moment, we will not have n individual groups any longer, we will have (n 1) groups, and one of them is formed by two observations. 3. In the following clustering stage, we must repeat the previous stage. However, we now have to take the distance between each pair of observations, and between the first group already formed, and each one of the other observations into consideration, based on one of the linkage methods adopted. In other words, we will have, after the first clustering stage, matrix D1 with dimensions (n 1) (n 1), in which one of the rows will be represented by the first grouped pair of observations. Consequently, in the second stage, a new group will be formed by the grouping of two new observations or by adding a certain observation to the first group previously formed in the first stage. 4. The previous process must be repeated (n 1) times, until there is only a single group formed by all the observations. In other words, in the stage (n 2) we will have matrix Dn-2 that will only contain the distance between the last two remaining groups, before the final fusion. 5. Finally, from the clustering stages and the distances between the clusters formed, it is possible to develop a tree-shaped diagram that summarizes the clustering process, and explains the allocation of each observation in each cluster. This diagram is known as a dendrogram or a phenogram. Therefore, the values that form the D matrices of each one of the stages will be a function of the distance measure chosen and of the linkage method adopted. In a certain clustering stage s, imagine that a researcher groups two clusters M and N formed previously, containing observations m and n, respectively, so that cluster MN can be formed. Next, he intends to group MN with another cluster W, with w observations. Since we know that the decision to choose the next cluster will always be the smallest distance between each pair of observations or groups in the hierarchical agglomerative methods, the agglomeration schedule will be essential in order for the distances that will form each matrix Ds to be analyzed. Using this logic and based on Table 11.11, let’s discuss the criterion to calculate the distance between the clusters MN and W, inserted in matrix Ds, based on the linkage method: l
Nearest-Neighbor or Single-Linkage Method:
dðMN ÞW ¼ min dMW ; dNW
(11.23)
where dMW and dNW are the distances between the closest observations in clusters M and W and in clusters N and W, respectively. l
Furthest-Neighbor or Complete-Linkage Method:
dðMN ÞW ¼ max dMW ; dNW
(11.24)
where dMW and dNW are the distances between the farthest observations in clusters M and W and in clusters N and W, respectively.
328
PART
V
Multivariate Exploratory Data Analysis
TABLE 11.12 Example: Grades in Math, Physics, and Chemistry on the College Entrance Exam
l
Student (Observation)
Grade in Mathematics (X1i)
Grade in Physics (X2i)
Grade in Chemistry (X3i)
Gabriela
3.7
2.7
9.1
Luiz Felipe
7.8
8.0
1.5
Patricia
8.9
1.0
2.7
Ovidio
7.0
1.0
9.0
Leonor
3.4
2.0
5.0
Between-Groups or Average-Linkage Method: m +nX w X
dðMN ÞW ¼
dpq
p¼1 q¼1
ðm + nÞ ðwÞ
(11.25)
where dpq represents the distance between any observation p in cluster MN and any observation q in cluster W, and m + n and w represent the number of observations in clusters MN and W, respectively. In the following section, we will present a practical example that will be solved algebraically, and from which the concepts of hierarchical agglomerative methods will be established. 11.2.2.1.2
A Practical Example of Cluster Analysis With Hierarchical Agglomeration Schedules
Imagine that a college professor, who is very concerned about his students’ capacity to learn the subject he teaches, Quantitative Methods, is interested in allocating them to groups with the highest homogeneity possible, based on the grades they obtained on the college entrance exams in subjects considered quantitative (Math, Physics, and Chemistry). In order to do that, the professor collected information on these grades, which vary from 0 to 10. In addition, since he will carry out a cluster analysis, first, in an algebraic way, he decided, for pedagogical purposes, to only work with five students. This dataset can be seen in Table 11.12. Based on the data obtained, the chart in Fig. 11.11 is constructed, and, since the variables are metric, the dissimilarity measure known as Euclidian distance will be used for the cluster analysis. Besides, since all the variables have values in the same unit of 0 measure (grades from 0 to 10), in this case, it will not be necessary to standardize them through Z-scores. In the following sections, hierarchical agglomeration schedules based on the Euclidian distance will be elaborated through the three linkage methods being studied. 11.2.2.1.2.1 Nearest-Neighbor or Single-Linkage Method At this moment, from the data presented in Table 11.12, let’s develop a cluster analysis through a hierarchical agglomeration schedule with the single-linkage method. First of all, we define matrix D0, formed by the Euclidian distances (dissimilarities) between each pair of observations, as follows:
Cluster Analysis Chapter
11
329
Chemistry
Gabriela Ovidio
Leonor
Patricia
Luiz Felipe
Physics Math FIG. 11.11 Three-dimensional chart with the relative position of the five students.
It is important to mention that at this initial moment each observation is considered an individual cluster, that is, in stage 0, we have 5 clusters (sample size). Highlighted in matrix D0 is the smallest distance between all the observations and, therefore, in the first stage, observations Gabriela and Ovidio will initially be grouped, and will now be a new cluster. We must construct matrix D1 so that we can go to the next clustering stage, in which the distances between the cluster Gabriela-Ovidio and the other observations are calculated. Observations that are still isolated. Thus, by using the singlelinkage method and based on the Expression (11.23), we have: dðGabrielaOvidioÞLuiz Felipe ¼ min f10:132; 10:290g ¼ 10:132 dðGabrielaOvidioÞPatricia ¼ min f8:420; 6:580g ¼ 6:580 dðGabrielaOvidioÞLeonor ¼ min f4:170; 5:474g ¼ 4:170 Matrix D1 can be seen:
330
PART
V
Multivariate Exploratory Data Analysis
In the same way, in matrix D1 the smallest distance between all of them is highlighted. Therefore, in the second stage, observation Leonor is inserted into the already-formed cluster Gabriela-Ovidio. Observations Luiz Felipe and Patricia still remain isolated. We must construct matrix D2 so that we can take the next step, in which the distances between the cluster GabrielaOvidio-Leonor and the two remaining observations are calculated. Analogously, we have: dðGabrielaOvidioLeonorÞLuizFelipe ¼ min f10:132; 8:223g ¼ 8:223 dðGabrielaOvidioLeonorÞPatricia ¼ min f6:580; 6:045g ¼ 6:045 Matrix D2 can be written as:
In the third clustering stage, observation Patricia is incorporated into the cluster Gabriela-Ovidio-Leonor, since the corresponding distance is the smallest among all the ones presented in matrix D2. Therefore, we can write matrix D3, which comes next, taking into consideration the following criterion: dðGabrielaOvidioLeonorPatriciaÞLuizFelipe ¼ min f8:223; 7:187g ¼ 7:187
Finally, in the fourth and last stage, all the observations are allocated to the same cluster, thus, concluding the hierarchical process. Table 11.13 presents a summary of this agglomeration schedule constructed by using the singlelinkage method. Based on this agglomeration schedule, we can construct a tree-shaped diagram, known as a dendrogram or phenogram, whose main objective is to illustrate the step by step of the clusters and to facilitate the visualization of how each observation is allocated to each stage. The dendrogram can be seen in Fig. 11.12. Through Figs. 11.13 and 11.14, we are able to interpret the dendrogram constructed. First of all, we drew three lines (I, II, and III) that are orthogonal to the dendrogram lines, as shown in Fig. 11.13, which allow us to identify the number of clusters in each clustering stage, as well as the observations in each cluster. Therefore, line I “cuts” the dendrogram immediately after the first clustering stage and, at this moment, we can verify that there are four clusters (four intersections with the dendrogram’s horizontal lines), one of them formed by observations Gabriela and Ovidio, and the others, by the individual observations.
Cluster Analysis Chapter
11
331
TABLE 11.13 Agglomeration Schedule Through the Single-Linkage Method Stage
Cluster
Grouped Observation
Smallest Euclidian Distance
1
Gabriela
Ovidio
3.713
2
Gabriela-Ovidio
Leonor
4.170
3
Gabriela-Ovidio-Leonor
Patricia
6.045
4
Gabriela-Ovidio-LeonorPatricia
Luiz Felipe
7.187
0
1
2
Euclidean distance 3 4 5
6
7
8
Gabriela Ovidio Leonor Patricia Luiz Felipe FIG. 11.12 Dendrogram—Single-linkage method.
3
4
Euclidean distance 5 6
7
8
3
4
Euclidean distance 5 6
7
8
FIG. 11.13 Interpreting the dendrogram—Number of clusters and allocation of observations.
Gabriela Ovidio Leonor Patricia Luiz Felipe
Gabriela Ovidio Leonor Patricia Luiz Felipe
FIG. 11.14 Interpreting the dendrogram—Distance leaps.
332
PART
V
Multivariate Exploratory Data Analysis
On the other hand, line II intersects three horizontal lines of the dendrogram, which means that, after the second stage, in which observation Leonor was incorporated into the already formed cluster Gabriela-Ovidio, there are three clusters. Finally, line III is drawn immediately after the third stage, in which observation Patricia merges with the cluster Gabriela-Ovidio-Leonor. Since two intersections between this line and the dendrogram’s horizontal lines are identified, we can see that observation Luiz Felipe remains isolated, while the others form a single cluster. Besides providing a study of the number of clusters in each clustering stage and of the allocation of observations, a dendrogram also allows the researcher to analyze the magnitude of the distance leaps in order to establish the clusters. A high magnitude leap, in comparison to the others, can indicate that a certain observation or cluster, a considerably different one, is incorporated into already formed clusters, which offers subsidies for the establishment of a solution regarding the number of clusters without the need for a next clustering stage. Although we know that setting an inflexible, mandatory number of clusters may hamper the analysis, at least giving an idea of this number, given the distance measure used and the linkage method adopted, may help researchers better understand the characteristics of the observations that led to this fact. Moreover, since the number of clusters is important for constructing nonhierarchical agglomeration schedules, this piece of information (considered an output of the hierarchical schedule) may serve as input for the k-means procedure. Fig. 11.14 presents three distance leaps (A, B, and C), regarding each one of the clustering stages, and, from their analysis, we can see that leap B, which represents the incorporation of observation Patricia into the cluster that had already been formed Gabriela-Ovidio-Leonor, is the greatest of the three. Therefore, in case we intend to set the ideal number of clusters in this example, the researcher may choose the solution with three clusters (line II in Fig. 11.13), without the stage in which observation Patricia is incorporated, since it possibly has characteristics that are not so homogeneous and that make it unfeasible to include it in the previously formed cluster, given the large distance leap. Thus, in this case, we would have a cluster formed by Gabriela, Ovidio, and Leonor, another one formed only by Patricia, and a third one formed only by Luiz Felipe. When using dissimilarity measures in methods clustering, a very useful criterion for identifying the number of clusters consists in identifying a considerable distance leap (whenever possible), and defining the number of clusters formed in the clustering stage immediately before the great leap, since very high leaps may incorporate observations with characteristics that are not so homogeneous. Furthermore, it is also important to mention that, if the distance leaps from a stage to another are small, due to the existence of variables with values that are too close to the observations, which can make it difficult to read the dendrogram, the researcher may use the squared Euclidean distance, so that the leaps can become clearer and better explained, making it easier to identify the clusters in the dendrogram, and providing better arguments for the decision making process. Software such as SPSS shows dendrograms with rescaled distance measures, in order to facilitate the interpretation of the allocation of each observation and the visualization of the large distance leaps. Fig. 11.15 illustrates how clusters can be established after the single-linkage method is elaborated. Next, we will develop the same example. However, now, let’s use the complete- and average-linkage methods, so that we can compare the order of the observations and the distance leaps. 11.2.2.1.2.2 Furthest-Neighbor or Complete-Linkage Method Matrix D0, shown here, is obviously the same, and the smallest Euclidian distance, the one highlighted, is between observations Gabriela and Ovidio that become the first cluster. It is important to emphasize that the first cluster will always be the same, regardless of the linkage method used, since the first stage will always consider the smallest distance between two pairs of observations, which are still isolated.
Cluster Analysis Chapter
11
333
Chemistry
Gabriela Ovidio
Leonor
Patricia
Luiz Felipe
Physics Math FIG. 11.15 Suggestion of clusters formed after the single-linkage method.
In the complete-linkage method, we must use Expression (11.24) to construct matrix D1, as follows: dðGabrielaOvidioÞLuizFelipe ¼ max f10:132; 10:290g ¼ 10:290 dðGabrielaOvidioÞPatricia ¼ max f8:420; 6:580g ¼ 8:420 dðGabrielaOvidioÞLeonor ¼ max f4:170; 5:474g ¼ 5:474 Matrix D1 can be seen and by analyzing it, we can see that observation Leonor will be incorporated into the cluster formed by Gabriela and Ovidio. Once again, the smallest value, among all the ones shown in matrix D1, is highlighted.
As verified when using the single-linkage method, here, observations Luiz Felipe and Patricia also remain isolated at this stage. The differences between the methods start arising now. Therefore, we will construct matrix D2 using the following criteria: dðGabrielaOvidioLeonorÞLuizFelipe ¼ max f10:290; 8:223g ¼ 10:290 dðGabrielaOvidioLeonorÞPatricia ¼ max f8:420; 6:045g ¼ 8:420
334
PART
V
Multivariate Exploratory Data Analysis
Matrix D2 can be written as follows:
In the third clustering stage, a new cluster is formed by the fusion of observations Patricia and Luiz Felipe, since the furthest-neighbor criterion adopted in the complete-linkage method makes the distance between these two observations become the smallest among all the ones calculated to construct matrix D2. Therefore, notice that at this stage differences related to the single-linkage method appear, in terms of the sorting and allocation of the observations to groups. Hence, to construct matrix D3, we must take the following criterion into consideration: dðGabrielaOvidioLeonorÞðLuizFelipePatriciaÞ ¼ max f10:290; 8:420g ¼ 10:290
In the same way, in the fourth and last stage, all the observations are allocated to the same cluster, since there is the clustering between Gabriela-Ovidio-Leonor and Luiz Felipe-Patricia. Table 11.14 shows a summary of this agglomeration schedule, elaborated by using the complete-linkage method. This agglomeration schedule’s dendrogram can be seen in Fig. 11.16. We can initially see that the sorting of the observations is different from what was observed in the dendrogram seen in Fig. 11.12. Analogous to what was carried out in the previous method, we chose to draw two vertical lines (I and II) over the largest distance leap, as shown in Fig. 11.17. Thus, if the researcher chooses to consider three clusters, the solution will be the same as the one achieved previously through the single-linkage method, one formed by Gabriela, Ovidio, and Leonor, another one by Luiz Felipe, and a third one by Patricia (line I in Fig. 11.17). However, if he chooses to define two clusters (line II), the solution will be different since, in this case, the second cluster will be formed by Luiz Felipe and Patricia, while in the previous case, it was formed only by Luiz Felipe, since observation Patricia was allocated to the first cluster.
TABLE 11.14 Agglomeration Schedule Through the Complete-Linkage Method Stage
Cluster
Grouped Observation
Smallest Euclidian Distance
1
Gabriela
Ovidio
3.713
2
Gabriela-Ovidio
Leonor
5.474
3
Luiz Felipe
Patricia
7.187
4
Gabriela-Ovidio-Leonor
Luiz Felipe-Patricia
10.290
Cluster Analysis Chapter
0
1
2
3
4
Euclidean distance 5 6
7
8
9
10
11
335
11
Gabriela Ovidio Leonor Luiz Felipe Patricia FIG. 11.16 Dendrogram—Complete-linkage method.
3
4
5
Euclidean distance 6 7 8
9
10
11
Gabriela Ovidio Leonor Luiz Felipe Patricia FIG. 11.17 Interpreting the dendrogram—Clusters and distance leaps.
Similar to what was done in the previous method, Fig. 11.18 illustrates how the clusters can be established after the complete-linkage method is carried out. Defining the clustering method can be based on the application of the average-linkage method, in which two groups merge based on the average distance between all the pairs of observations that belong to these groups. Therefore, as we have already discussed, if the most suitable method is the single linkage because there are observations considerably far apart from one another, the sorting and allocation of the observations will be maintained by the average-linkage method. On the other hand, the outputs of this method will show consistency with the solution achieved through the complete-linkage method as regards the sorting and allocation of the observations, if they are very similar in the variables in study. Thus, it is advisable for the researcher to apply the three linkage methods when elaborating a cluster analysis through hierarchical agglomeration schedules. Therefore, let’s move on to the average-linkage method. 11.2.2.1.2.3 Between-Groups or Average-Linkage Method First of all, let’s show the Euclidian distance matrix between each pair of observations (matrix D0), once again, highlighting the smallest distance between them.
336
PART
V
Multivariate Exploratory Data Analysis
Chemistry
Gabriela Ovidio
Leonor
Patricia
Luiz Felipe
Physics Math FIG. 11.18 Suggestion of clusters formed after the complete-linkage method.
By using Expression (11.25), we are able to calculate the terms of matrix D1, given that the first cluster GabrielaOvidio has already been formed. Thus, we have: 10:132 + 10:290 ¼ 10:211 2 8:420 + 6:580 ¼ 7:500 dðGabrielaOvidioÞPatricia ¼ 2 4:170 + 5:474 ¼ 4:822 dðGabrielaOvidioÞLeonor ¼ 2
dðGabrielaOvidioÞLuizFelipe ¼
Matrix D1 can be seen and, through it, we can see that observation Leonor is once again incorporated into the cluster formed by Gabriela and Ovidio. The smallest value among all the ones presented in matrix D1 has also been highlighted.
Cluster Analysis Chapter
11
337
In order to construct matrix D2, in which the distances between the cluster Gabriela-Ovidio-Leonor and the two remaining observations are calculated, we must perform the following calculations: 10:132 + 10:290 + 8:223 ¼ 9:548 3 8:420 + 6:580 + 6:045 dðGabrielaOvidioLeonorÞPatricia ¼ ¼ 7:015 3
dðGabrielaOvidioLeonorÞLuizFelipe ¼
Note that the distances used to calculate the dissimilarities to be inserted into matrix D2 are the original Euclidian distances between each pair of observations, that is, they come from matrix D0. Matrix D2 can be seen:
As verified when the single-linkage method was elaborated, here, observation Patricia is also incorporated into the cluster already formed by Gabriela, Ovidio and Leonor, and observation Luiz Felipe remains isolated. Finally, matrix D3 can be constructed from the following calculation: dðGabrielaOvidioLeonorPatriciaÞLuizFelipe ¼
10:132 + 10:290 + 8:223 + 7:187 ¼ 8:958 4
Once again, in the fourth and last stage, all the observations are in the same cluster. Table 11.15 and Fig. 11.19 present a summary of this agglomeration schedule and the corresponding dendrogram, respectively, resulting from this averagelinkage method.
TABLE 11.15 Agglomeration Schedule Through the Average-Linkage Method Stage
Cluster
Grouped Observation
Smallest Euclidian Distance
1
Gabriela
Ovidio
3.713
2
Gabriela-Ovidio
Leonor
4.822
3
Gabriela-Ovidio-Leonor
Patricia
7.015
4
Gabriela-Ovidio-Leonor-Patricia
Luiz Felipe
8.958
338
PART
V
Multivariate Exploratory Data Analysis
Euclidean distance 0
1
2
3
4
5
6
7
8
9
Gabriela Ovidio Leonor Patricia Luiz Felipe FIG. 11.19 Dendrogram—Average-linkage method.
Despite having other distance values, we can see that Table 11.15 and Fig. 11.19 show the same sorting and the same allocation of observations in the clusters as those presented in Table 11.13 and in Fig. 11.12, respectively, obtained when the single-linkage method was elaborated. Hence, we can state that the observations are significantly different from the variables studied, fact proven by the consistency of the answers obtained from the single- and average-linkage methods. If the observations were more similar, fact that has not been observed in the diagram seen in Fig. 11.11, the consistency of answers would occur between the completeand average-linkage methods, as already discussed. Therefore, when possible, the initial construction of scatter plots may help researchers, even if in a preliminary way, choose the method to be adopted. Hierarchical agglomeration schedules are very useful and offer us the possibility to analyze, in an exploratory way, the similarity between observations based on the behavior of certain variables. However, it is essential for researchers to understand that these methods are not conclusive by themselves and more than one answer may be obtained, depending on what is desired and on the data behavior. Besides, it is necessary for researchers to be aware of how sensitive these methods are to the presence of outliers. The existence of a very discrepant observation may cause other observations, not so similar to one another, to be allocated to the same cluster because they are extremely different from the observation considered an outlier. Hence, it is advisable to apply the hierarchical agglomeration schedules with the linkage method chosen several times, and, in each application, to identify one or more observations considered outliers. This procedure will make the cluster analysis become more reliable, since more and more homogeneous clusters may be formed. Researchers are free to characterize the most discrepant observation as the one that ended up becoming isolated after the penultimate clustering stage, that is, if it happens before the total fusion. Nonetheless, many are the methods to define an outlier. Barnett and Lewis (1994), for instance, mention almost 1000 articles in the existing literature on outliers, and, for pedagogical purposes, in the Appendix of this chapter, we will discuss an efficient procedure in Stata for detecting outliers when a researcher is carrying out a multivariate data analysis. It is also important to emphasize, as we have already discussed in this section, that different linkage methods, when elaborating hierarchical agglomeration schedules, must be applied to the same dataset, and the resulting dendrograms, compared. This procedure will help researchers in their decision-making processes with regard to choosing the ideal number of clusters, and also to sorting the observations and allocating each one of them to the different clusters formed. This will even allow researchers to make coherent decisions about the number of clusters that may be considered input in a possible nonhierarchical analysis. Last but not least, it is worth mentioning that the agglomeration schedules presented in this section (Tables 11.13, 11.14, and 11.15) provide increasing values of the clustering measures because a dissimilarity measure was used (Euclidian distance) as a comparison criterion between the observations. If we had chosen Pearson’s correlation between the observations, a similarity measure also used for metric variables, as we discussed in Section 11.2.1.1, the values of the clustering measures in the agglomeration schedules would be decreasing. The latter is also true for cluster analyses in which similarity measures are used, as the ones studied in Section 11.2.1.2, to assess the behavior of observations based on binary variables. In the following section we will develop the same example, in an algebraic way, using the nonhierarchical k-means agglomeration schedule.
11.2.2.2 Nonhierarchical K-Means Agglomeration Schedule Among all the nonhierarchical agglomeration schedules, the k-means procedure is the most often used by researchers in several fields of knowledge. Given that the number of clusters is previously defined by the researcher, this procedure can be
Cluster Analysis Chapter
11
339
elaborated after the application of a hierarchical agglomeration schedule when we have no idea of the number of clusters that can be formed, and, in this situation, the output obtained from this procedure can serve as input for the nonhierarchical. 11.2.2.2.1
Notation
As the one developed in Section 11.2.2.1.1, we now present a logical sequence of steps, based on Johnson and Wichern (2007), in order to facilitate the understanding of the cluster analysis (k-means procedure): 1. We define the initial number of clusters and the respective centroids. The main objective is to divide the observations from the dataset into K clusters, such that those within each cluster are the closest to each other if compared to any other that belongs to a different cluster. For such, the observations need to be allocated arbitrarily to the K clusters, so that the respective centroids can be calculated. 2. We must choose a certain observation that is closer to a centroid and reallocate it to this cluster. At this moment, another cluster has just lost that observation, and, therefore, the centroids of the cluster that receives it and of the cluster that loses it must be recalculated. 3. We must continue repeating the previous step until it is no longer possible to reallocate any observation due to its close proximity to a centroid from another cluster. Centroid coordinate x must be recalculated whenever including or excluding a certain observation p in the respective cluster, based on the following expressions: N x + xp , if observation p is inserted into the cluster under analysis N+1 N x xp , if observation p is excluded from the cluster under analysis xnew ¼ N 1 xnew ¼
(11.26) (11.27)
where N and x refer to the number of observations in the cluster and to its centroid coordinate before the reallocation of that observation, respectively. In addition, xp refers to the coordinate of observation p, which changed clusters. For two variables (X1 and X2), Fig. 11.20 shows a hypothetical situation that represents the end of the k-means procedure, in which it is no longer possible to reallocate any observation because there are no more close proximities to centroids of other clusters. The matrix with distances between observations does not need to be defined at each step, different from hierarchical agglomeration schedules, which reduces the requirements in terms of technological capabilities, allowing nonhierarchical agglomeration schedules to be applied to considerably larger dataset than those traditionally studied through hierarchical schedules. FIG. 11.20 Hypothetical situation that represents the end of the K-means procedure.
340
PART
V
Multivariate Exploratory Data Analysis
In addition, bear in mind that the variables must be standardized before elaborating the k-means procedure, and in the hierarchical agglomeration schedules too, if the respective values are not in the same unit of measure. Finally, after concluding this procedure, it is important for researchers to analyze if the values of a certain metric variable differ between the groups defined, that is, if the variability between the clusters is significantly higher than the internal variability of each cluster. The F-test of the one-way analysis of variance, or one-way ANOVA, allows us to develop this analysis, and its null and alternative hypotheses can be defined as follows: H0: the variable under analysis has the same mean in all the groups formed. H1: the variable under analysis has a different mean in at least one of the groups in relation to the others. Therefore, a single F-test can be applied for each variable, aiming to assess the existence of at least one difference among all the comparison possibilities, and, in this regard, the main advantage of applying it is the fact that adjustments in the discrepant dimensions of the groups do not need to be carried out to analyze several comparisons. On the other hand, rejecting the null hypothesis at a certain significance level, does not allow the researcher to know which group(s) is(are) statistically different from the others in relation to the variable being analyzed. The F statistical expression, corresponding to this test, is given by the following expression: K X
2 Nk Xk X
k¼1
F¼
variability between the groups ¼ X K 1 2 variability within the groups Xki Xk
(11.28)
ki
nK where N is the number of observations in the k-th cluster, Xk is the mean of variable X in the same k-th cluster, X is the general average of variable X, and Xki is the value that variable X takes on for a certain observation i present in the k-th cluster. In addition, K represents the number of clusters to be compared, and n, the sample size. By using the F statistic, researchers will be able to identify the variables whose means most differ between the groups, that is, those that most contribute to the formation of at least one of the K clusters (highest F statistic), as well as those that do not contribute to the formation of the suggested number of clusters, at a certain significance level. In the following section, we will discuss a practical example that will be solved algebraically, and from which the concepts of the k-means procedure may be established. 11.2.2.2.2 A Practical Example of a Cluster Analysis With the Nonhierarchical K-Means Agglomeration Schedule To solve the nonhierarchical k-means agglomeration schedule algebraically, let’s use the data from our own example, which can be found in Table 11.12 and are shown in Table 11.16. Software packages such as SPSS use the Euclidian distance as the standard dissimilarity measure, reason why we will develop the algebraic procedures based on this measure. This criterion will even allow the results obtained to be compared to the ones found when elaborating the hierarchical agglomeration schedules in Section 11.2.2.1.2, as, in those situations, the Euclidian distance was also used. In the same way, it will not be necessary to standardize the variables through Z-scores, since all of them are in the same unit of measure (grades from 0 to 10). Otherwise, it is crucial for researchers to standardize the variables before elaborating the k-means procedure. TABLE 11.16 Example: Grades in Math, Physics, and Chemistry on the College Entrance Exams Student (Observation)
Grade in Mathematics (X1i)
Grade in Physics (X2i)
Grade in Chemistry (X3i)
Gabriela
3.7
2.7
9.1
Luiz Felipe
7.8
8.0
1.5
Patricia
8.9
1.0
2.7
Ovidio
7.0
1.0
9.0
Leonor
3.4
2.0
5.0
Cluster Analysis Chapter
11
341
TABLE 11.17 Arbitrary Allocation of the Observations in K 5 3 Clusters and Calculation of the Centroid Coordinates— Initial Step of the K-Means Procedure Centroid Coordinates Variable Cluster
Grade in Mathematics
Grade in Physics
Grade in Chemistry
Gabriela
3:7 + 7:8 ¼ 5:75 2
2:7 + 8:0 ¼ 5:35 2
9:1 + 1:5 ¼ 5:30 2
8:9 + 7:0 ¼ 7:95 2
1:0 + 1:0 ¼ 1:00 2
2:7 + 9:0 ¼ 5:85 2
3.40
2.00
5.00
Luiz Felipe Patricia Ovidio Leonor
Using the logical sequence presented in Section 11.2.2.2.1, we will develop the k-means procedure with K ¼ 3 clusters. This number of clusters may have come from a decision made by the researcher and based on a certain preliminary criterion, or it was chosen based on the outputs of the hierarchical agglomeration schedules. In our case, the decision was made based on the comparison of the dendrograms that had already been constructed, and by the similarity of the outputs obtained by the single- and average-linkage methods. Thus, we need to arbitrarily allocate the observations to three clusters, so that the respective centroids can be calculated. Therefore, we can establish that observations Gabriela and Luiz Felipe form the first cluster, Patricia and Ovidio, the second, and Leonor, the third. Table 11.17 shows the arbitrary formation of these preliminary clusters, as well as the calculation of the respective centroid coordinates, which makes the initial step of the k-means procedure algorithm possible. Based on these coordinates, we constructed the chart seen in Fig. 11.21, which shows the arbitrary allocation of each observation to its cluster and the respective centroids. Based on the second step of the logical sequence presented in Section 11.2.2.2.1, we must choose a certain observation and calculate the distance between it and all the cluster centroids, assuming that it is or it is not reallocated to each cluster. Selecting the first observation (Gabriela), for example, we can calculate the distances between it and the centroids of the clusters that have already been formed (Gabriela-Luiz Felipe, Patricia-Ovidio, and Leonor) and, after that, assume that it leaves its cluster (Gabriela-Luiz Felipe), and is inserted into one of the other two clusters, forming the cluster GabrielaPatricia-Ovidio or Gabriela-Leonor. Thus, from Expressions (11.26) and (11.27), we must recalculate the new centroid coordinates, simulating that, in fact, the reallocation of Gabriela to one of the two clusters takes place, as shown in Table 11.18. Thus, from Tables 11.16, 11.17, and 11.18, we can calculate the following Euclidian distances: l
Assumption that Gabriela is not reallocated: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dGabrielaðGabrielaLuizFelipeÞ ¼ ð3:70 5:75Þ2 + ð2:70 5:35Þ2 + ð9:10 5:30Þ2 ¼ 5:066 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dGabrielaðPatriciaOvidioÞ ¼ ð3:70 7:95Þ2 + ð2:70 1:00Þ2 + ð9:10 5:85Þ2 ¼ 5:614 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dGabrielaLeonor ¼ ð3:70 3:40Þ2 + ð2:70 2:00Þ2 + ð9:10 5:00Þ2 ¼ 4:170
l
Assumption that Gabriela is reallocated: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð3:70 7:80Þ2 + ð2:70 8:00Þ2 + ð9:10 1:50Þ2 ¼ 10:132 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dGabrielaðGabrielaPatriciaOvidioÞ ¼ ð3:70 6:53Þ2 + ð2:70 1:57Þ2 + ð9:10 6:93Þ2 ¼ 3:743 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dGabrielaðGabrielaLeonorÞ ¼ ð3:70 3:55Þ2 + ð2:70 2:35Þ2 + ð9:10 7:05Þ2 ¼ 2:085 dGabrielaLuizFelipe ¼
342
PART
V
Multivariate Exploratory Data Analysis
Chemistry
Gabriela Ovidio
CENTROID 2 CENTROID 1 Leonor
Patricia
Luiz Felipe
Physics Math FIG. 11.21 Arbitrary allocation of the observations in K ¼ 3 clusters and respective centroids—Initial step of the K-means procedure.
TABLE 11.18 Simulating the Reallocation of Gabriela and Calculating the New Centroid Coordinates Centroid Coordinates Variable Simulation
Grade in Mathematics
Grade in Physics
Grade in Chemistry
Luiz Felipe
Excluding Gabriela
2 ð5:75Þ3:70 ¼ 7:80 21
2 ð5:35Þ2:70 ¼ 8:00 21
2 ð5:30Þ9:10 ¼ 1:50 21
Gabriela
Including Gabriela
2 ð7:95Þ + 3:70 ¼ 6:53 2+1
2 ð1:00Þ + 2:70 ¼ 1:57 2+1
2 ð5:85Þ + 9:10 ¼ 6:93 2+1
Including Gabriela
1 ð3:40Þ + 3:70 ¼ 3:55 1+1
1 ð2:00Þ + 2:70 ¼ 2:35 1+1
1 ð5:00Þ + 9:10 ¼ 7:05 1+1
Cluster
Patricia Ovidio Gabriela Leonor Obs.: Note that the values calculated for the Luiz Felipe centroid coordinates are exactly the same as this observation’s original coordinates, as shown in Table 11.16.
Since Gabriela is the closest to the Gabriela-Leonor centroid (the shortest Euclidian distance), we must reallocate this observation to the cluster initially formed only by Leonor. So, the cluster in which observation Gabriela was at first (Gabriela-Luiz Felipe) has just lost it, and now Luiz Felipe has become an individual cluster. Therefore, the centroids of the cluster that receives it and the one that loses it must be recalculated. Table 11.19 shows the creation of the new clusters and the calculation of the respective centroid coordinates too.
Cluster Analysis Chapter
11
343
TABLE 11.19 New Centroids With the Reallocation of Gabriela Centroid Coordinates Variable Cluster
Grade in Mathematics
Grade in Physics
Grade in Chemistry
Luiz Felipe
7.80
8.00
1.50
Patricia
7.95
1.00
5.85
3:7 + 3:4 ¼ 3:55 2
2:7 + 2:0 ¼ 2:35 2
9:1 + 5:0 ¼ 7:05 2
Ovidio Gabriela Leonor
Based on these new coordinates, we can construct the chart shown in Fig. 11.22. Once again, let’s repeat the previous step. At this moment, since observation Luiz Felipe is isolated, let’s simulate the reallocation of the third observation (Patricia). We must calculate the distances between it and the centroids of the clusters that have already been formed (Luiz Felipe, Patricia-Ovidio, and Gabriela-Leonor) and, afterwards, assume that it leaves its cluster (Patricia-Ovidio) and is inserted into one of the other two clusters, forming the cluster Luiz Felipe-Patricia or Gabriela-Patricia-Leonor. Also based on Expressions (11.26) and (11.27), we must recalculate the new centroid coordinates, simulating that, in fact, the reallocation of Patricia to one of these two clusters happens, as shown in Table 11.20. Similar to what was carried out when simulating Gabriela’s reallocation, based on Tables 11.16, 11.19, and 11.20, let’s calculate the Euclidian distances between Patricia and each one of the centroids: Chemistry
Gabriela Ovidio
CENTROID 3
CENTROID 2 Leonor
Patricia
Luiz Felipe
Physics Math FIG. 11.22 New clusters and respective centroids—Reallocation of Gabriela.
344
PART
V
Multivariate Exploratory Data Analysis
TABLE 11.20 Simulation of Patricia’s Reallocation—Next Step of the K-Means Procedure Algorithm Centroid Coordinates Variable Cluster
Simulation
Grade in Mathematics
Grade in Physics
Grade in Chemistry
Luiz Felipe
Including Patricia
1 ð7:80Þ + 8:90 ¼ 8:35 1+1
1 ð8:00Þ + 1:00 ¼ 4:50 1+1
1 ð1:50Þ + 2:70 ¼ 2:10 1+1
Ovidio
Excluding Patricia
2 ð7:95Þ8:90 ¼ 7:00 21
2 ð1:00Þ1:00 ¼ 1:00 21
2 ð5:85Þ2:70 ¼ 9:00 21
Gabriela
Including Patricia
2 ð3:55Þ + 8:90 ¼ 5:33 2+1
2 ð2:35Þ + 1:00 ¼ 1:90 2+1
2 ð7:05Þ + 2:70 ¼ 5:60 2+1
Patricia
Patricia Leonor Obs.: Note that the values calculated of the Ovidio centroid coordinates are exactly the same as this observation’s original coordinates, as shown in Table 11.16.
l
Assumption that Patricia is not reallocated: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð8:90 7:80Þ2 + ð1:00 8:00Þ2 + ð2:70 1:50Þ2 ¼ 7:187 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dPatriciaðPatriciaOvidioÞ ¼ ð8:90 7:95Þ2 + ð1:00 1:00Þ2 + ð2:70 5:85Þ2 ¼ 3:290 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dPatriciaðGabrielaLeonorÞ ¼ ð8:90 3:55Þ2 + ð1:00 2:35Þ2 + ð2:70 7:05Þ2 ¼ 7:026 dPatriciaLuizFelipe ¼
l
Assumption that Patricia is reallocated: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dPatriciaðLuizFelipePatriciaÞ ¼ ð8:90 8:35Þ2 + ð1:00 4:50Þ2 + ð2:70 2:10Þ2 ¼ 3:593 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dPatriciaOvidio ¼ ð8:90 7:00Þ2 + ð1:00 1:00Þ2 + ð2:70 9:00Þ2 ¼ 6:580 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dPatriciaðGabrielaPatriciaLeonorÞ ¼ ð8:90 5:33Þ2 + ð1:00 1:90Þ2 + ð2:70 5:60Þ2 ¼ 4:684
Bearing in mind that the Euclidian distance between Patricia and the cluster Patricia-Ovidio is the shortest, we have to reallocate it to another cluster and, at this moment, let’s maintain the solution presented in Table 11.19 and in Fig. 11.22. Next, we will develop the same procedure, however, simulating the reallocation of the fourth observation (Ovidio). Analogously, we must, therefore, calculate the distances between this observation and the centroids of the clusters that have already been formed (Luiz Felipe, Patricia-Ovidio, and Gabriela-Leonor) and, after that, assume that it leaves its cluster (Patricia-Ovidio) and is inserted into one of the other two clusters, forming the cluster Luiz Felipe-Ovidio or Gabriela-Ovidio-Leonor. Once again by using Expressions (11.26) and (11.27), we can recalculate the new centroid coordinates, simulating that, in fact, the reallocation of Ovidio to one of these two clusters takes place, as shown in Table 11.21. Next, we can see the calculations of the Euclidian distances between Ovidio and each one of the centroids, defined from Tables 11.16, 11.19, and 11.21: l
Assumption that Ovidio is not reallocated: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dOvidioLuizFelipe ¼ ð7:00 7:80Þ2 + ð1:00 8:00Þ2 + ð9:00 1:50Þ2 ¼ 10:290
Cluster Analysis Chapter
11
345
TABLE 11.21 Simulating Ovidio’s Reallocation—New Step of the K-Means Procedure Algorithm Centroid Coordinates Variable Cluster
Simulation
Grade in Mathematics
Grade in Physics
Grade in Chemistry
Luiz Felipe
Including Ovidio
1 ð7:80Þ + 7:00 ¼ 7:40 1+1
1 ð8:00Þ + 1:00 ¼ 4:50 1+1
1 ð1:50Þ + 9:00 ¼ 5:25 1+1
Patricia
Excluding Ovidio
2 ð7:95Þ7:00 ¼ 8:90 21
2 ð1:00Þ1:00 ¼ 1:00 21
2 ð5:85Þ9:00 ¼ 2:70 21
Gabriela
Including Ovidio
2 ð3:55Þ + 7:00 ¼ 4:70 2+1
2 ð2:35Þ + 1:00 ¼ 1:90 2+1
2 ð7:05Þ + 9:00 ¼ 7:70 2+1
Ovidio
Ovidio Leonor Obs.: Note that the values calculated of the Patricia centroid coordinates are exactly the same as this observation’s original coordinates, as shown in Table 11.16.
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dOvidioðPatriciaOvidioÞ ¼ ð7:00 7:95Þ2 + ð1:00 1:00Þ2 + ð9:00 5:85Þ2 ¼ 3:290 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dOvidioðGabrielaLeonorÞ ¼ ð7:00 3:55Þ2 + ð1:00 2:35Þ2 + ð9:00 7:05Þ2 ¼ 4:187
l
Assumption that Ovidio is reallocated: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dOvidioðLuizFelipeOvidioÞ ¼ ð7:00 7:40Þ2 + ð1:00 4:50Þ2 + ð9:00 5:25Þ2 ¼ 5:145 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dOvidioPatricia ¼ ð7:00 8:90Þ2 + ð1:00 1:00Þ2 + ð9:00 2:70Þ2 ¼ 6:580 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dOvidioðGabrielaOvidioLeonorÞ ¼ ð7:00 4:70Þ2 + ð1:00 1:90Þ2 + ð9:00 7:70Þ2 ¼ 2:791
In this case, since observation Ovidio is the closest to the centroid of Gabriela-Ovidio-Leonor (the shortest Euclidian distance), we must reallocate this observation to the cluster formed originally by Gabriela and Leonor. Therefore, observation Patricia becomes an individual cluster. Table 11.22 shows the centroid coordinates of clusters Luiz Felipe, Patricia, and Gabriela-Ovidio-Leonor. We will not carry out the procedure proposed for the fifth observation (Leonor), since it had already fused with observation Gabriela in the first step of the algorithm. We can consider that the k-means procedure is concluded, since it is no
TABLE 11.22 New Centroids With Ovidio’s Reallocation Centroid Coordinates Variable Cluster
Grade in Mathematics
Grade in Physics
Grade in Chemistry
Luiz Felipe
7.80
8.00
1.50
Patricia
8.90
1.00
2.70
Gabriela
4.70
1.90
7.70
Ovidio Leonor
346
PART
V
Multivariate Exploratory Data Analysis
Chemistry
Gabriela Ovidio CENTROID
Leonor
Patricia
Luiz Felipe
Physics Math FIG. 11.23 Solution of the K-means procedure.
longer possible to reallocate any observation due to closer proximity to another cluster’s centroid. Fig. 11.23 shows the allocation of each observation to its cluster and their respective centroids. Note that the solution achieved is equal to the one reached through the single- (Fig. 11.15) and average-linkage methods, when we elaborated the hierarchical agglomeration schedules. As we have already discussed, we can see that the matrix with the distances between the observations does not need to be defined at each step of the k-means procedure algorithm, different from the hierarchical agglomeration schedules, which reduces the requirements in terms of technological capabilities, allowing nonhierarchical agglomeration schedules to be applied to dataset significantly larger than the ones traditionally studied through hierarchical schedules. Table 11.23 shows the Euclidian distances between each observation of the original dataset and the centroids of each one of the clusters formed.
TABLE 11.23 Euclidian Distances Between Observations and Cluster Centroids Cluster Student (Observation)
Luiz Felipe
Patricia
Gabriela Ovidio Leonor
Gabriela
10.132
8.420
1.897
Luiz Felipe
0.000
7.187
9.234
Patricia
7.187
0.000
6.592
Ovidio
10.290
6.580
2.791
Leonor
8.223
6.045
2.998
Cluster Analysis Chapter
11
347
TABLE 11.24 Means per Cluster and General Mean of the Variable mathematics Cluster 1
Cluster 2
Cluster 3
XLuiz Felipe ¼ 7:80
XPatricia ¼ 8:90
XGabriela ¼ 3:70 XOvidio ¼ 7:00 XLeonor ¼ 3:40
X 1 ¼ 7:80
X 2 ¼ 8:90
X 3 ¼ 4:70
X ¼ 6:16
We would like to emphasize that this algorithm can be elaborated with another preliminary allocation of the observations to the clusters besides the one chosen in this example. Reapplying the k-means procedure with several arbitrary choices, given K clusters, allows the researcher to assess how stable the clustering procedure is, and to underpin the allocation of the observations to the groups in a consistent way. After concluding this procedure, it is essential to check, through the F-test of one-way ANOVA, if the values of each one of the three variables considered in the analysis are statistically different between the three clusters. To make the calculation of the F statistics that correspond to this test easier, we constructed Tables 11.24, 11.25, and 11.26, which show the means per cluster and the general mean of the variables mathematics, physics, and chemistry, respectively. So, based on the values presented in these tables and by using Expression (11.28), we are able to calculate the variation between the groups and within them for each one of the variables, as well as the respective F statistics. Tables 11.27, 11.28, and 11.29 show these calculations. Now, let’s analyze the rejection or not of the null hypothesis of the F-tests for each one of the variables. Since there are two degrees of freedom for the variability between the groups (K – 1 ¼ 2) and two degrees of freedom for the variability within the groups (n – K ¼ 2), by using Table A in the Appendix, we have Fc ¼ 19.00 (critical F at a significance level of 0.05). Therefore, only for the variable physics can we reject the null hypothesis that all the groups formed have the same
TABLE 11.25 Means per Cluster and General Mean of the Variable physics Cluster 1
Cluster 2
Cluster 3
XLuiz Felipe ¼ 8:00
XPatricia ¼ 1:00
XGabriela ¼ 2:70 XOvidio ¼ 1:00 XLeonor ¼ 2:00
X 1 ¼ 8:00
X 2 ¼ 1:00
X 3 ¼ 1:90
X ¼ 2:94
TABLE 11.26 Means per Cluster and General Mean of the Variable chemistry Cluster 1
Cluster 2
Cluster 3
XLuiz Felipe ¼ 1:50
XPatricia ¼ 2:70
XGabriela ¼ 9:10 XOvidio ¼ 9:00 XLeonor ¼ 5:00
X 1 ¼ 1:50 X ¼ 5:46
X 2 ¼ 2:70
X 3 ¼ 7:70
348
PART
V
Multivariate Exploratory Data Analysis
TABLE 11.27 Variation and F Statistic for the Variable mathematics Variability between the groups
ð7:806:16Þ2 + ð8:906:16Þ2 + 3ð4:706:16Þ2 31 2
2
Variability within the groups
ð3:704:70Þ + ð7:004:70Þ + ð3:404:70Þ 53
F
8:296 3:990 ¼ 2:079
2
¼ 8:296
¼ 3:990
Note: The calculation of the variability within the groups only took cluster 3 into consideration, since the others show variability equal to 0, because they are formed by a single observation.
TABLE 11.28 Variation and F Statistic for the Variable physics Variability between the groups
ð8:002:94Þ2 + ð1:002:94Þ2 + 3ð1:902:94Þ2 31 2
2
Variability within the groups
ð2:701:90Þ + ð1:001:90Þ + ð2:001:90Þ 53
F
16:306 0:730 ¼ 22:337
2
¼ 16:306
¼ 0:730
Note: The same as the previous table.
TABLE 11.29 Variation and F Statistic for the Variable chemistry Variability between the groups
ð1:505:46Þ2 + ð2:705:46Þ2 + 3ð7:705:46Þ2 31 2
2
Variability within the groups
ð9:107:70Þ + ð9:007:70Þ + ð5:007:70Þ 53
F
19:176 5:470 ¼ 3:506
2
¼ 19:176
¼ 5:470
Note: The same as Table 11.27.
mean, since F calculated Fcal ¼ 22.337 > Fc ¼ F2,2,5% ¼ 19.00, So, for this variable, there is at least one group that has a mean that is statistically different from the others. For the variables mathematics and chemistry, however, we cannot reject the test’s null hypothesis at a significance level of 0.05. Software such as SPSS and Stata do not offer the Fc for the defined degrees of freedom and a certain significance level. However, they offer the Fcal significance level for these degrees of freedom. Thus, instead of analyzing if Fcal > Fc, we must verify if the Fcal significance level is less than 0.05 (5%). Therefore: If Sig. F (or Prob. F) < 0.05, there is at least one difference between the groups for the variable under analysis. The Fcal significance level can be obtained in Excel by using the command Formulas ! Insert Function ! FDIST, which will open a dialog box as the one shown in Fig. 11.24. As we can see in this figure, sig. F for the variable physics is less than 0.05 (sig. F ¼ 0.043), that is, there is at least one difference between the groups for this variable at a significance level of 0.05. An inquisitive researcher will be able to carry out the same procedure for the variables mathematics and chemistry. In short, Table 11.30 presents the results of the oneway ANOVA, with the variation of each variable, the F statistics, and the respective significance levels. The one-way ANOVA table still allows the researcher to identify the variables that most contribute to the formation of at least one of the clusters, because they have a mean that is statistically different from at least one of the groups in relation to the others, since they will have greater F statistic values. It is important to mention that F statistic values are very sensitive to the sample size, and, in this case, the variables mathematics and chemistry ended up not having statistically different means among the three groups, mainly because the sample is small (only five observations). We would like to emphasize that this one-way ANOVA can also be carried out soon after the application of a certain hierarchical agglomeration schedule, since it only depends on the classification of the observations within groups. The researcher must be careful about only one thing, when comparing the results obtained by a hierarchical schedule to the ones obtained by a nonhierarchical schedule, to use the same distance measure in both situations. Different allocations of the observations to the same number of clusters may happen if different distance measures are used in
Cluster Analysis Chapter
11
349
FIG. 11.24 Obtaining the F significance level (command Insert Function).
TABLE 11.30 One-way Analysis of Variance (ANOVA) Variable mathematics
Variability Between the Groups
Variability Within the Groups
F
Sig. F
8.296
3.990
2.079
0.325
physics
16.306
0.730
22.337
0.043
chemistry
19.176
5.470
3.506
0.222
a hierarchical schedule and in a nonhierarchical schedule. Therefore, different values of the F statistics in both situations can be calculated. In general, in case there are one or more variables that do not contribute to the formation of the suggested number of clusters, we recommend that the procedure be reapplied without it (or them). In these situations, the number of clusters may change and, if the researcher feels the need to underpin the initial input regarding the number of K clusters, he may even use a hierarchical agglomeration schedule without those variables before reapplying the k-means procedure, which will make the analysis cyclical. Moreover, the existence of outliers may generate considerably disperse clusters, and treating the dataset in order to identify extremely discrepant observations becomes an advisable procedure, before elaborating nonhierarchical agglomeration schedules. In the Appendix of this chapter, an important procedure in Stata for detecting multivariate outliers will be presented. As with hierarchical agglomeration schedules, the nonhierarchical k-means schedule cannot be used as an isolated technique to make a conclusive decision about the clustering of observations. The data behavior, sample size, and criteria adopted by the researcher may be extremely sensitive to the allocation of observations and the formation of clusters. The combination of the outputs found with the ones coming from other techniques can more powerfully underpin the choices made by the researcher, and provide higher transparency in the decision-making process. At the end of the cluster analysis, since the clusters formed can be represented in the dataset by a new qualitative variable with terms connected to each observation (cluster 1, cluster 2, ..., cluster K), other exploratory multivariate techniques can be elaborated from it, as, for example, a correspondence analysis, so that, depending on the researcher’s objectives, we can study a possible association between the clusters and the categories of other qualitative variables. This new qualitative variable, which represents the allocation of each observation, may also be used as an explanatory variable of a certain phenomenon in confirmatory multivariate models as, for example, multiple regression models, as long
350
PART
V
Multivariate Exploratory Data Analysis
as it is transformed into dummy variables that represent the categories (clusters) of this new variable generated in the cluster analysis, as we will study in Chapter 13. On the other hand, such a procedure only makes sense when we intend to propose a diagnostic regarding the behavior of the dependent variable, without aiming at having forecasts. Since a new observation does not have its place in a certain cluster, obtaining its allocation is only possible when we include such observation into a new cluster analysis, in order to obtain a new qualitative variable and, consequently, new dummies. In addition, this new qualitative variable can also be considered dependent on a multinomial logistic regression model, allowing the researcher to evaluate the probabilities each observation has to belong to each one of the clusters formed, due to the behavior of other explanatory variables not initially considered in the cluster analysis. We would also like to highlight that this procedure depends on the research objectives and construct established, and has a diagnostic nature as regards the behavior of the variables in the sample for the existing observations, without a predictive purpose. Finally, if the clusters formed present substantiality in relation to the number of observations allocated, by using other variables, we may even apply specific confirmatory techniques for each cluster identified, so that, possibly, better adjusted models can be generated. Next, the same dataset will be used to run cluster analyses in SPSS and Stata. In Section 11.3, we will discuss the procedures for elaborating the techniques studied in SPSS and their results too. In Section 11.4, we will study the commands to perform the procedures in Stata, with the respective outputs.
11.3 CLUSTER ANALYSIS WITH HIERARCHICAL AND NONHIERARCHICAL AGGLOMERATION SCHEDULES IN SPSS In this section, we will discuss the step by step for elaborating our example in the IBM SPSS Statistics Software. The main objective is to offer the researcher an opportunity to run cluster analyses with hierarchical and nonhierarchical schedules in this software package, given how easy it is to use it and how didactical the operations are. Every time an output is shown, we will mention the respective result obtained when performing the algebraic solution in the previous sections, so that the researcher can compare them and increase his own knowledge on the topic. The use of the images in this section has been authorized by the International Business Machines Corporation©.
11.3.1
Elaborating Hierarchical Agglomeration Schedules in SPSS
Going back to the example presented in Section 11.2.2.1.2, remember that our professor is interested in grouping students in homogeneous clusters based on their grades (from 0 to 10) obtained on the college entrance exams, in Mathematics, Physics, and Chemistry. The data can be found in the file CollegeEntranceExams.sav and they are exactly the same as the ones presented in Table 11.12. In this section, we will carry out the cluster analysis using the Euclidian distance between the observations and only considering the single-linkage method. In order for a cluster analysis to be elaborated through a hierarchical method in SPSS, we must click on Analyze → Classify → Hierarchical Cluster.... A dialog box as the one shown in Fig. 11.25 will open. Next, we must insert the original variables from our example (mathematics, physics, and chemistry) into Variables and the variable that identifies the observations (student) in Label Cases by, as shown in Fig. 11.26. If the researcher does not have a variable that represents the name of the observations (in this case, a string), he may leave this last cell blank. First of all, in Statistics..., let’s choose the options Agglomeration schedule and Proximity matrix, which make the table with the agglomeration schedule be presented in the outputs, constructed based on the distance measure to be chosen and on the linkage method to be defined, and the matrix with the distances between each pair of observations, respectively. Let’s maintain the option None in Cluster Membership. Fig. 11.27 shows how this dialog box will be. When we click on Continue, we will go back to the main dialog box of the hierarchical cluster analysis. Next, we must click on Plots.... As seen in Fig. 11.28, let’s select the option Dendrogram and the option None in Icicle. In the same way, let’s click on Continue, so that we can go back to the main dialog box. In Method..., which is the most important dialog box of the hierarchical cluster analysis, we must choose the singlelinkage method, also known as the nearest neighbor. Thus, in Cluster Method, let’s select the option Nearest neighbor. An inquisitive researcher may see that the complete (Furthest neighbor) and average (Between-groups linkage) linkage methods, discussed in Section 11.2.2.1, are also available in this option. Besides, since the variables in the dataset are metric, we have to choose one of the dissimilarity measures found in Measure → Interval. In order to maintain the same logic used when solving our example algebraically, we will choose the Euclidian distance as a dissimilarity measure and, therefore, we must select the option Euclidean distance. We can also see that, in this option, we can find the other dissimilarity measures studied in Section 11.2.1.1, such as, the squared
Cluster Analysis Chapter
11
351
FIG. 11.25 Dialog box for elaborating the cluster analysis with a hierarchical method in SPSS.
FIG. 11.26 Selecting the original variables.
Euclidean distance, Minkowski, Manhattan (Block, in SPSS), Chebyshev, and Pearson’s correlation that, even though is a similarity measure, is also used for metric variables. Although we do not use similarity measures in this example because we are not working with binary variables, it is important to mention that some similarity measures can be selected if necessary. Hence, as discussed in Section 11.2.1.2, in Measure → Binary, we can select the simple matching, Jaccard, Dice, Anti-Dice (Sokal and Sneath 2, in SPSS), Russell and Rao, Ochiai, Yule (Yule’s Q, in SPSS), Rogers and Tanimoto, Sneath and Sokal (Sokal and Sneath 1, in SPSS), and Hamann coefficients, among others.
352
PART
V
Multivariate Exploratory Data Analysis
FIG. 11.27 Selecting the options that generate the agglomeration schedule and the matrix with the distances between the pairs of observations.
FIG. 11.28 Selecting the option that generates the dendrogram.
Cluster Analysis Chapter
11
353
FIG. 11.29 Dialog box for selecting the linkage method and the distance measure.
Still in the same dialog box, the researcher may request that the cluster analysis be elaborated from standardized variables. If necessary, for situations in which the original variables have different measurement units, the option Z scores in Transform Values → Standardize can be selected, which will make all the calculations be elaborated from the standardization of the variables, and which will begin having means equal to 0 and standard deviations equal to 1. After these considerations, the dialog box in our example will become what can be seen in Fig. 11.29. Next, we can click on Continue and on OK. The first output (Fig. 11.30) shows dissimilarity matrix D0 formed by the Euclidian distances between each pair of observations. We can even see that in the legend it says, “This is a dissimilarity matrix.” If this matrix were formed by similarity measures, resulting from calculations elaborated from binary variables, it would say, “This is a similarity matrix.”
FIG. 11.30 Matrix with Euclidian distances (dissimilarity measures) between pairs of observations.
354
PART
V
Multivariate Exploratory Data Analysis
FIG. 11.31 Hierarchical agglomeration schedule—Single-linkage method and Euclidian distance.
Through this matrix, which is equal to the one whose values were calculated and presented in Section 11.2.2.1.2, we can verify that observations Gabriela and Ovidio are the most similar (the smallest Euclidian distance) in relation to the variables mathematics, physics, and chemistry (dGabrielaOvidio ¼ 3.713). Therefore, in the hierarchical schedule shown in Fig. 11.31, the first clustering stage occurs exactly by joining these two students, with Coefficient (Euclidian distance) equal to 3.713. Note that the columns Cluster Combined Cluster 1 and Cluster 2 refer to the isolated observations, when they are still not incorporated into a certain cluster or clusters that have already been formed. Obviously, in the first clustering stage, the first cluster is formed by the fusion of two isolated observations. Next, in the second stage, observation Leonor (5) is incorporated into the cluster previously formed by Gabriela (1) and Ovidio (4). With regard to the single-linkage method, we can see that the distance considered for the agglomeration of Leonor was the smallest between this observation and Gabriela or Ovidio, that is, the criterion adopted it was: dðGabrielaOvidioÞLeonor ¼ min f4:170; 5:474g ¼ 4:170 We can also see that, while columns Stage Cluster First Appears Cluster 1 and Cluster 2 indicate in which previous stage each corresponding observation was incorporated into a certain cluster, column Next Stage shows in which future stage the respective cluster will receive a new observation or cluster, given that we are dealing with a clustering method. In the third stage, observation Patricia (3) is incorporated to the already formed cluster, Gabriela-Ovidio-Leonor, respecting the following distance criterion: dðGabrielaOvidioLeonorÞPatricia ¼ min f8:420; 6:580; 6:045g ¼ 6:045 And, finally, given that we have five observations, in the fourth and last stage, observation Luiz Felipe, which is still isolated (note that the last observation to be incorporated into a cluster corresponds to the last value equal to 0 in the column Stage Cluster First Appears Cluster 2), is incorporated to the cluster already formed by the other observations, concluding the agglomeration schedule. The distance considered at this stage is given by: dðGabrielaOvidioLeonorPatriciaÞLuizFelipe ¼ min f10:132; 10:290; 8:223; 7:187g ¼ 7:187 Based on how the observations are sorted in the agglomeration schedule and on the distances used as a clustering criterion, the dendrogram can be constructed, and it can be seen in Fig. 11.32. Note that the distance measures are rescaled to construct the dendrograms in SPSS, so that the interpretation of each observation allocation to the clusters and, mainly, visualizing the highest distance leaps can be made easier, as discussed in Section 11.2.2.1.2.1. The way the observations are sorted in the dendrogram corresponds to what was presented in the agglomeration schedule (Fig. 11.31), and, from the analysis shown in Fig. 11.32, it is possible to see that the greatest distance leap occurs when Patricia merges with Gabriela-Ovidio-Leonor, which had already been formed. This leap could have already been identified in the agglomeration schedule found in Fig. 11.31, since a large increase in distance occurs when we go from the second to the third stage, that is, when we increase the Euclidian distance from 4.170 to 6.045 (44.96%), so that a new cluster can be formed by incorporating another observation. Therefore, we can choose the existing configuration at the end of the second clustering stage, in which three clusters are formed. As discussed in Section 11.2.2.1.2.1, the criterion for identifying the number of clusters that considers the clustering stage immediately before a large leap is very useful and commonly used. Fig. 11.33 shows a vertical line (a dashed line) that “cuts” the dendrogram in the region where the highest leaps occur. At this moment, since three intersections with lines from the dendrogram happen, we can identify three corresponding clusters formed by Gabriela-Ovidio-Leonor, Patricia, and Luiz Felipe, respectively.
Cluster Analysis Chapter
Dendrogram Using Single Linkage
Y
0 Gabriela
1
Ovidio
4
Leonor
5
Patricia
3
5
Rescaled Distance Cluster Combine 10 15 20
25
Luiz Felipe 2 FIG. 11.32 Dendrogram—Single-linkage method and rescaled euclidian distances in SPSS.
Dendrogram Using Single Linkage
Y
0
5
Rescaled Distance Cluster Combine 10 15 20
Gabriela
1
Ovidio
4
Leonor
5
Patricia
3
Individual Cluster Patricia
Luiz Felipe 2
Individual Cluster Luiz Felipe
FIG. 11.33 Dendrogram with cluster identification.
Cluster Gabriela-Ovidio-Leonor
25
11
355
356
PART
V
Multivariate Exploratory Data Analysis
FIG. 11.34 Defining the number of clusters.
As discussed, it is common to find dendrograms that make it difficult to identify distance leaps, mainly due to the fact that there are considerably similar observations in the dataset in relation to all the variables under analysis. In these situations, it is advisable to use the squared Euclidean distance and the complete-linkage method (furthest neighbor). This criteria combination is very popular in datasets with extremely homogeneous observations. Having adopted the solution with three clusters, we can once again click on Analyze → Classify → Hierarchical Cluster... and, on Statistics..., select the option Single solution in Cluster Membership. In this option, we must insert number 3 into Number of clusters, as shown in Fig. 11.34. When we click on Continue, we will go back to the main dialog box of the cluster analysis. On Save..., let’s choose the option Single solution and, in the same way, insert number 3 into Number of clusters, as shown in Fig. 11.35, so that the new variable corresponding to the allocation of observations to the clusters can become available in the dataset. Next, we can click on Continue and on OK. Although the outputs generated are the same, it is important to notice that a new table of results is presented, corresponding to the allocation of the observations to the clusters itself. Fig. 11.36 shows, for three clusters, that, while observations Gabriela, Ovidio, and Leonor form a single cluster, called 1, observations Luiz Felipe and Patricia form two individual clusters, called 2 and 3, respectively. Even though these names are numerical, it is important to highlight that they only represent the labels (categories) of a qualitative variable. When elaborating the procedure described, we can see that a new variable is generated in the dataset. It is called CLU3_1 by SPSS, as shown in Fig. 11.37. This new variable is automatically classified by the software as Nominal, that is, qualitative, as shown in Fig. 11.38, which can be obtained when we click on Variable View, in the lower left-hand side of the screen in SPSS. As we have already discussed, variable CLU3_1 can be used in other exploratory techniques, such as, the correspondence analysis, or in confirmatory techniques. In the latter, it can be inserted, for example, into the explanatory variables vector (as long as it is transformed into dummies) of a multiple regression model, or as a dependent variable of a certain multinomial logistic regression model, in which researchers intend to study the behavior of other variables, not inserted into the cluster analysis, concerning the probability of inserting each observation into each one of the clusters formed. However, this decision depends on the research objectives. At this moment, the researcher may consider the cluster analysis with hierarchical agglomeration schedules concluded. Nevertheless, based on the generation of the new variable CLU3_1, by using the one-way ANOVA, he may still study if the values of a certain variable differ between the clusters formed, that is, if the variability between the groups is significantly higher than the variability within each one of them. Even if the analysis had not been developed when solving the hierarchical schedules algebraically, since we chose to carry it out only after the k-means procedure in Section 11.2.2.2.2, we can now show how it can be applied at this moment, since we have already allocated the observations to the groups.
Cluster Analysis Chapter
11
357
FIG. 11.35 Selecting the option to save the allocation of observations to the clusters with the new variable in the dataset—Hierarchical procedure.
FIG. 11.36 Allocating the observations to the clusters.
FIG. 11.37 Dataset with the new variable CLU3_1—Allocation of each observation.
358
PART
V
Multivariate Exploratory Data Analysis
FIG. 11.38 Nominal (qualitative) classification of the variable CLU3_1.
In order to do that, let’s click on Analyze → Compare Means → One-Way ANOVA.... In the dialog box that will open, we must insert the variables mathematics, physics, and chemistry into Dependent List and variable CLU3_1 (Single Linkage) into Factor. The dialog box will be as the one shown in Fig. 11.39. In Options..., let’s choose the options Descriptive (in Statistics) and Means plot, as shown in Fig. 11.40. Next, we can click on Continue and on OK. While Fig. 11.41 shows the descriptive statistics of the clusters per variable, similar to Tables 11.24, 11.25, and 11.26, Fig. 11.42 uses these values and shows the calculation of the variation between the groups (Between Groups) and within them (Within Groups), as well as the F statistics for each variable and the respective significance levels. We can see that these values correspond to the ones calculated algebraically in Section 11.2.2.2.2 and shown in Table 11.30. From Fig. 11.42, we can see that sig. F for the variable physics is less than 0.05 (sig. F ¼ 0.043), that is, there is at least one group that has a statistically different mean, when compared to the others, at a significance level of 0.05. However, the same cannot be said about the variables mathematics and chemistry. Although we have an idea of which group has a statistically different mean compared to the others for the variable physics, based on the outputs seen in Fig. 11.41, constructing the diagrams may facilitate the analysis of the differences between the variable means per cluster even more. The charts generated by SPSS (Figs. 11.43, 11.44, and 11.45) allow us to see these differences between the groups for each variable analyzed. Therefore, from the chart seen in Fig. 11.44, it is possible to see that group 2, formed only by observation Luiz Felipe, in fact, has a mean different from the others in relation to the variable physics. Besides, even though we can see from the diagrams in Figs. 11.43 and 11.45 that there are mean differences of the variables mathematics and chemistry between the groups, these differences cannot be considered statistically significant, at a significance level of 0.05, since we are dealing with a very small number of observations, and the F statistic values are very sensitive to the sample size. This graphical analysis becomes really useful when we are studying datasets with a larger number of observations and variables.
FIG. 11.39 Dialog box with the selection of the variables to run the one-way analysis of variance in SPSS.
Cluster Analysis Chapter
11
FIG. 11.40 Selecting the options to carry out the one-way analysis of variance.
FIG. 11.41 Descriptive statistics of the clusters per variable.
FIG. 11.42 One-way analysis of variance—Between groups and within groups variation, F statistics, and significance levels per variable.
359
360
PART
V
Multivariate Exploratory Data Analysis
FIG. 11.43 Means of the variable mathematics in the three clusters.
Mean of mathematics grade (0 to 10)
9.0
8.0
7.0
6.0
5.0
4.0 1
2
3
Single linkage
Mean of physics grade (0 to 10)
8.0
6.0
4.0
2.0
1
2
3
Single linkage FIG. 11.44 Means of the variable physics in the three clusters.
Finally, researchers can still complement their analysis by elaborating a procedure known as multidimensional scaling, since using the distance matrix may help them construct a chart that allows a two-dimensional visualization of the relative positions of each observation, regardless of the total number of variables. In order to do that, we must structure a new dataset, formed exactly by the distance matrix. For the data in our example, we can open the file CollegeEntranceExamMatrix.sav, which contains the Euclidian distance matrix shown in Fig. 11.46. Note that the columns of this new dataset refer to the observations in the original dataset, as well as the rows (squared distance matrix).
Cluster Analysis Chapter
11
361
Mean of chemistry grade (0 to 10)
8.0
6.0
4.0
2.0
1
2
3
Single linkage FIG. 11.45 Means of the variable chemistry in the three clusters.
FIG. 11.46 Dataset with the Euclidean distance matrix.
Let’s click on Analyze → Scale → Multidimensional Scaling (ASCAL).... In the dialog box that will open, we must insert the variables that represent the observations in Variables, as shown in Fig. 11.39. Since the data already correspond to the distances, nothing needs to be done regarding the field Distances (Fig. 11.47). In Model..., let’s select the option Ratio in Level of Measurement (note that the option Euclidean distance in Scaling Model has already been selected) and, in Options..., the option Group plots in Display, as shown in Figs. 11.48 and 11.49, respectively. Next, we can click on Continue and on OK. Fig. 11.50 shows the chart with the relative positions of the observations projected on a plane. This type of chart is really useful when researchers wish to prepare didactical presentations of observation clusters (individuals, companies, municipalities, countries, among other examples) and to make the interpretation of the clusters easier, mainly when there is a relatively large number of variables in the dataset.
362
PART
V
Multivariate Exploratory Data Analysis
FIG. 11.47 Dialog box with the selection of the variables to run the multidimensional scaling in SPSS.
FIG. 11.48 Defining the nature of the variable that corresponds to the distance measure.
Cluster Analysis Chapter
11
363
FIG. 11.49 Selecting the option for constructing the twodimensional chart.
Derived stimulus configuration Euclidean distance model 1.0 Gabriela
Leonor
LuizFelipe
Dimension 2
0.5
0.0
–0.5
Ovidio
–1.0
Patricia
–1.5 –2
–1
0 Dimension 1 FIG. 11.50 Two-dimensional chart with the projected relative positions of the observations.
1
2
364
PART
V
Multivariate Exploratory Data Analysis
FIG. 11.51 Dialog box for elaborating the cluster analysis with the nonhierarchical K-means method in SPSS.
11.3.2
Elaborating Nonhierarchical K-Means Agglomeration Schedules in SPSS
Maintaining the same logic proposed in the chapter, from the same dataset, we will develop a cluster analysis based on the nonhierarchical k-means agglomeration schedule. Thus, we must once again use the file CollegeEntranceExams.sav. In order to do that, we must click on Analyze → Classify → K-Means Cluster.... In the dialog box that will open, we must insert the variables mathematics, physics, and chemistry into Variables, and the variable student into Label Cases by. The main difference between this initial dialog box and the one corresponding to the hierarchical procedure is determining the number of clusters from which the k-means algorithm will be elaborated. In our example, let’s insert number 3 into Number of Clusters. Fig. 11.51 shows how the dialog box will be. We can see that we inserted the original variables into the field Variables. This procedure is acceptable, since, for our example, the values are in the same unit of measure. However, if this fact is not verified, before elaborating the k-means procedure, researchers must standardize them through the Z-scores procedure, in Analyze → Descriptive Statistics → Descriptives..., insert the original variables into Variables, and select the option Save standardized values as variables. When we click on OK, researchers will see that new standardized variables will become part of the dataset. Going back to the initial screen of the k-means procedure, we will click on Save.... In the dialog box that will open, we must select the option Cluster membership, as shown in Fig. 11.52. When we click on Continue, we will go back to the previous dialog box. In Options..., let’s select the options Initial cluster centers, ANOVA table, and Cluster information for each case, in Statistics, as shown in Fig. 11.53. Next, we can click on Continue and on OK. It is important to mention that SPSS already uses the Euclidian distance as a standard dissimilarity measure when elaborating the k-means procedure.
Cluster Analysis Chapter
11
365
FIG. 11.52 Selecting the option to save the allocation of observations to the clusters with the new variable in the dataset—Nonhierarchical procedure.
FIG. 11.53 Selecting the options to perform the K-means procedure.
The first two outputs generated refer to the initial step and to the iteration of the k-means algorithm. The centroid coordinates are presented in the initial step and, through them, we can notice that SPSS considers that the three clusters are formed by the first three observations in the dataset. Although this decision is different from the one we used in Section 11.2.2.2.2, this choice is purely arbitrary and, as we will see later, it will not impact the formation of clusters in the final step of the k-means algorithm at all. While Fig. 11.54 shows the values of the original variables for observations Gabriela, Luiz Felipe, and Patricia (as shown in Table 11.16) as the centroid coordinates of the three groups, in Fig. 11.55 we can see, after the first iteration of the algorithm, that the change in the centroid coordinate of the first cluster is 1.897, which corresponds exactly to the Euclidian distance between observation Gabriela and the cluster Gabriela-Ovidio-Leonor (as shown in Table 11.23). In this last
FIG. 11.54 First step of the K-means algorithm—Centroids of the three groups as observation coordinates.
366
PART
V
Multivariate Exploratory Data Analysis
FIG. 11.55 First iteration of the K-means algorithm and change in the centroid coordinates.
FIG. 11.56 Final stage of the K-means algorithm—Allocation of the observations and distances to the respective cluster centroids.
figure, in the footnotes, it is also possible to see the measure 7.187 that corresponds to the Euclidian distance between observations Luiz Felipe and Patricia, which remain isolated after the iteration. The next three figures refer to the final stage of the k-means algorithm. While the output Cluster Membership (Fig. 11.56) shows the allocation of each observation to each one of the three clusters, as well as the Euclidian distances between each observation and the centroid of the respective group, the output Distances between Final Cluster Centers (Fig. 11.58) shows the Euclidian distances between the group centroids. These two outputs have values that were calculated algebraically in Section 11.2.2.2.2 and shown in Table 11.23. Moreover, the output Final Cluster Centers (Fig. 11.57) shows the centroid coordinates of the groups after the final stage of this nonhierarchical procedure, which correspond to the values already calculated and presented in Table 11.22.
FIG. 11.57 Final stage of the K-Means algorithm—Cluster centroid coordinates.
Cluster Analysis Chapter
11
367
FIG. 11.58 Final stage of the K-means algorithm—Distances between the cluster centroids.
FIG. 11.59 One-way analysis of variance in the K-means procedure—Variation between groups and within groups, F statistics, and significance levels per variable.
The ANOVA output (Fig. 11.59) is analogous to the one presented in Table 11.30 in Section 2.2.2.2 and in Fig. 11.42 in Section 11.3.1, and, through it, we can see that only the variable physics has a statistically different mean in at least one of the groups formed, when compared to the others, at a significance level of 0.05. As we have previously discussed, if one or more variables are not contributing to the formation of the suggested number of clusters, we recommend that the algorithm be reapplied without these variables. The researcher can even use a hierarchical procedure without the aforementioned variables before reapplying the k-means procedure. For the data in our example, however, the analysis would become univariate due to the exclusion of the variables mathematics and chemistry, which demonstrates the risk researchers take when working with extremely small datasets in cluster analysis. It is important to mention that the ANOVA output must only be used when studying the variables that most contribute to the formation of the specified number of clusters, since this is chosen so that the differences between the observations allocated to different groups can be maximized. Thus, as explained in this output’s footnotes, we cannot use the F statistic aiming at verifying the equality or not of the groups formed. For this reason, it is common to find the term pseudo F for this statistic in the existing literature. Finally, Fig. 11.60 shows the number of observations in each one of the clusters. Similar to the hierarchical procedure, we can see that a new variable (obviously qualitative) is generated in the dataset after the preparation of the k-means procedure, which is called QCL_1 by SPSS, as shown in Fig. 11.61. This variable ended up being identical to the variable CLU3_1 (Fig. 11.37) in this example. Nonetheless, this fact does not always happen with a larger number of observations and in the cases in which different dissimilarity measures are used in the hierarchical and nonhierarchical procedures. Having presented the procedures for the application of the cluster analysis in SPSS, let’s discuss this technique in Stata.
FIG. 11.60 Number of observations in each cluster.
368
PART
V
Multivariate Exploratory Data Analysis
FIG. 11.61 Dataset with the new variable QCL_1—Allocation of each observation.
11.4 CLUSTER ANALYSIS WITH HIERARCHICAL AND NONHIERARCHICAL AGGLOMERATION SCHEDULES IN STATA Now, we will present the step by step for preparing our example in Stata Statistical Software®. In this section, our main objective is not to once again discuss the concepts related to the cluster analysis, but to give the researcher an opportunity to prepare the technique by using the commands this software has to offer. At each presentation of an output, we will mention the respective result obtained when performing its algebraic solution and also by using SPSS. The use of the images in this section has been authorized by StataCorp LP©.
11.4.1
Elaborating Hierarchical Agglomeration Schedules in Stata
Therefore, let’s begin with the dataset constructed by the professor and which contains the grades in Mathematics, Physics, and Chemistry obtained by five students in the college entrance exams. The dataset can be found in the file CollegeEntranceExams.dta and is exactly the same as the one presented in Table 11.12 in Section 11.2.2.1.2. Initially, we can type the command desc, which makes the analysis of the dataset characteristics possible, such as, the number of observations, the number of variables, and the description of each one of them. Fig. 11.62 shows the first output in Stata. As discussed previously, since the original variables have values in the same unit of measure, in this example, it is not necessary to standardize them by using the Z-scores procedure. However, if the researcher wishes to, he may obtain the standardized variables through the following commands: egen zmathematics = std(mathematics) egen zphysics = std(physics) egen zchemistry = std(chemistry)
FIG. 11.62 Description of the CollegeEntranceExams.dta dataset.
Cluster Analysis Chapter
11
369
TABLE 11.31 Terms in Stata Corresponding to the Measures for Metric Variables Measure for Metric Variables
Term in Stata
Euclidian
L2
Squared Euclidean
L2squared
Manhattan
L1
Chebyshev
Linf
Canberra
Canberra
Pearson’s Correlation
corr
First of all, let’s obtain the matrix with distances between the pairs of observations. In general, the sequence of commands for obtaining distance or similarity matrices in Stata is: matrix dissimilarity D = variables*, option* matrix list D
where the term variables* will have to be substituted for the list of variables to be considered in the analysis, and the term option* will have to be substituted for the term corresponding to the distance or similarity measure that the researcher wishes to use. While Table 11.31 shows the terms in Stata that correspond to each one of the measures for the metric variables studied in Section 11.2.1.1, Table 11.32 shows the terms related to the measures used for the binary variables studied in Section 11.2.1.2. Therefore, since we wish to obtain the Euclidian distance matrix between the pairs of observations, in order to maintain the criterion used in the chapter, we must type the following sequence of commands: matrix dissimilarity D = mathematics physics chemistry, L2 matrix list D
The output generated, which can be seen in Fig. 11.63, is in accordance with what was presented in matrix D0 in Section 11.2.2.1.2.1, and also in Fig. 11.30 when we elaborated the technique in SPSS (Section 11.3.1). Next, we will carry out the cluster analysis itself. The general command used to run a cluster analysis through a hierarchical schedule in Stata is given by: cluster method* variables*, measure(option*)
where, besides the substitution of the terms variables* and option*, as discussed previously, we must substitute the term method* for the linkage method chosen by the researcher. Table 11.33 shows the terms in Stata related to the methods discussed in Section 11.2.2.1. TABLE 11.32 Terms in Stata Corresponding to the Measures for Binary Variables Measure for Binary Variables
Term in Stata
Simple matching
matching
Jaccard
Jaccard
Dice
Dice
AntiDice
antiDice
Russell and Rao
Russell
Ochiai
Ochiai
Yule
Yule
Rogers and Tanimoto
Rogers
Sneath and Sokal
Sneath
Hamann
Hamann
370
PART
V
Multivariate Exploratory Data Analysis
FIG. 11.63 Euclidean distance matrix between pairs of observations.
TABLE 11.33 Terms in Stata That Correspond to the Linkage Methods in Hierarchical Agglomeration Schedules Linkage Method
Term in Stata
Single
singlelinkage
Complete
completelinkage
Average
averagelinkage
Therefore, for the data in our example and following the criterion adopted throughout this chapter (single-linkage method with Euclidian distance - term L2), we must type the following command: cluster singlelinkage mathematics physics chemistry, measure(L2)
After that, we can type the command cluster list, which makes, in a summarized way, the criteria used by the researcher to develop the hierarchical cluster analysis. Fig. 11.64 shows the outputs generated. From Fig. 11.64 and by analyzing the dataset, we can verify that three new variables are generated, regarding the identification of each observation (_clus_1_id), the sorting of the observations when creating the clusters (_clus_1_ord), and the Euclidian distances used in order to group the new observation in each one of the clustering stages (_clus_1_hgt). Fig. 11.65 shows how the dataset is after this cluster analysis is elaborated. It is important to mention that Stata shows the variable _clu_1_hgt with the old values in one row, which can make the analysis a little confusing. Therefore, while distance 3.713 refers to the merger between observations Ovidio and Gabriela (first stage of the agglomeration schedule), distance 7.187 corresponds to the fusion between Luiz Felipe and the cluster already formed by all the other observations (last stage of the agglomeration schedule), as already shown in Table 11.13 and in Fig. 11.31. Thus, in order for researchers to correct this discrepancy and to obtain the real behavior of the distances in each new clustering stage, they can type the sequence of commands, whose output can be seen in Fig. 11.66. Note that a new variable
FIG. 11.64 Elaboration of the hierarchical cluster analysis and summary of the criteria used.
Cluster Analysis Chapter
11
371
FIG. 11.65 Dataset with the new variables.
FIG. 11.66 Stages of the agglomeration schedule and respective Euclidian distances.
is generated (dist) and it corresponds to the correction of the discrepancy found in variable _clu_1_hgt (term [_n-1]), presenting the value of each Euclidian distance in order to establish a new cluster in each stage of the agglomeration schedule. gen dist = _clus_1_hgt[_n-1] replace dist=0 if dist==. sort dist list student dist
Having carried out this phase, we can ask Stata to construct the dendrogram by typing one of the two equivalent commands: cluster dendrogram, labels(student) horizontal
or cluster tree, labels(student) horizontal
The diagram generated can be seen in Fig. 11.67. We can see that the dendrogram constructed by Stata, in terms of Euclidian distances, is equal to the one shown in Fig. 11.12, constructed when the modeling was solved algebraically. However, it differs from the one constructed by SPSS (Fig. 11.32) for not considering rescaled measures. Regardless of this fact, we will adopt three clusters as a possible solution, being one of them formed by Leonor, Ovidio, and Gabriela, another, by Patricia, and the third, by Luiz Felipe, since the criteria discussed about large distance leaps coherently lead us toward this decision. In order to generate a new variable, corresponding to the allocation of the observations to the three clusters, we must type the following sequence of commands. Note that we have named this new variable cluster. The output seen in Fig. 11.68 shows the allocation of the observations to the groups and is equivalent to the one shown in Fig. 11.36 (SPSS). cluster generate cluster = groups(3), name(_clus_1) sort _clus_1_id list student cluster
372
PART
V
Multivariate Exploratory Data Analysis
FIG. 11.67 Dendrogram—Single-linkage method and Euclidian distances in Stata.
Dendrogram for_clus_1 cluster analysis Leonor
Ovidio
Gabriela
Patricia
Luiz Felipe 0
2
4
6
8
L2 dissimilarity measure
FIG. 11.68 Allocating the observations to the clusters.
Finally, by using the one-way analysis of variance (ANOVA), we will study if the values of a certain variable differ between the groups represented by the categories of the new qualitative variable cluster generated in the dataset, that is, if the variation between the groups is significantly higher than the variation within each one of them, following the logic proposed in Section 11.3.1. In order to do that, let’s type the following commands, in which the three metric variables (mathematics, physics, and chemistry) are individually related to the variable cluster: oneway mathematics cluster, tabulate oneway physics cluster, tabulate oneway chemistry cluster, tabulate
The results of the ANOVA for the three variables are in Fig. 11.69. The outputs in this figure, which show the results of the variation Between groups and Within groups, the F statistics, and the respective significance levels (Prob. F, or Prob > F in Stata) for each variable, are equal to the ones calculated algebraically and presented in Table 11.30 (Section 11.2.2.2.2) and also in Fig. 11.42, when this procedure was elaborated in SPSS (Section 11.3.1). Therefore, as we have already discussed, we can see that, while for the variable physics there is at least one cluster that has a statistically different mean, when compared to the others, at a significance level of 0.05 (Prob. F ¼ 0.0429 < 0.05), the variables mathematics and chemistry do not have statistically different means between the three groups formed for this sample and at the significance level set. It is important to bear in mind that, if there is a greater number of variables that have Prob. F less than 0.05, the one considered the most discriminant of the groups is the one with the highest F statistic (that is, the lowest significance level Prob. F).
Cluster Analysis Chapter
11
373
FIG. 11.69 ANOVA for the variables mathematics, physics, and chemistry.
Even if it is possible to conclude the hierarchical analysis at this moment, the researcher has the option to run a multidimensional scaling, in order to see the projections of the relative positions of the observations in a two-dimensional chart, similar to what was done in Section 11.3.1. In order to do that, he may type the following command: mds mathematics physics chemistry, id(student) method(modern) measure(L2) loss(sstress) config nolog
The outputs generated can be found in Figs. 11.70 and 11.71, and the chart of the latter is the one shown in Fig. 11.50.
374
PART
V
Multivariate Exploratory Data Analysis
FIG. 11.70 Elaborating the multidimensional scaling in Stata. FIG. 11.71 Chart with projections of the relative positions of the observations.
Having presented the commands to carry out the cluster analysis with hierarchical agglomeration schedules in Stata, let’s move on to the elaboration of the nonhierarchical k-means agglomeration schedule in the same software package.
11.4.2
Elaborating Nonhierarchical K-Means Agglomeration Schedules in Stata
In order to apply the k-means procedure to the data in the file CollegeEntranceExams.dta, we must type the following command: cluster kmeans mathematics physics chemistry, k(3) name(kmeans) measure(L2) start(firstk)
where the term k(3) is the input for the algorithm to be elaborated with three clusters. Besides, we define that a new variable with the allocation of the observations to the three groups will be generated in the dataset with the name kmeans (term name(kmeans)), and the distance measure used will be the Euclidian distance (term L2). Moreover, the term firstk specifies that the coordinates of the first k observations of the sample will be used as centroids of the k clusters (in our case, k ¼ 3), which corresponds exactly to the criterion adopted by SPSS, as discussed in Section 11.3.2.
Cluster Analysis Chapter
11
375
FIG. 11.72 Elaborating the nonhierarchical K-means procedure and a summary of the criteria used.
Next, we can type the command cluster list kmeans so that, in a summarized way, the criteria adopted for elaborating the k-means procedure can be presented. The outputs in Fig. 11.72 show what is generated by Stata after we type the last two commands. The next two commands generate, in the outputs of the software, two tables that refer to the number of observations in each one of the three clusters formed, as well as to the allocation of each observation in these groups, respectively: table kmeans list student kmeans
Fig. 11.73 shows these outputs. These results correspond to the one found when the k-means procedure was solved algebraically in Section 11.2.2.2.2 (Fig. 11.23), and to the one obtained when this procedure was elaborated using SPSS in Section 11.3.2 (Figs. 11.60 and 11.61). Even though we are able to develop a one-way analysis of variance for the original variables in the dataset, from the new qualitative variable generated (kmeans), we chose not to carry out this procedure here, since we have already done that for the variable cluster generated in Section 11.4.1 after the hierarchical procedure, which is exactly the same as the variable kmeans in this case. On the other hand, for pedagogical purposes, we present the command that allows the means of each variable in the three clusters to be generated, so that they can be compared: tabstat mathematics physics chemistry, by(kmeans)
The output generated can be seen in Fig. 11.74 and is equivalent to the one presented in Tables 11.24, 11.25, and 11.26. Finally, the researcher can also construct a chart to show the interrelationships between the variables, two at a time. This chart, known as matrix, can give the researcher a better understanding of how the variables relate to one another and even FIG. 11.73 Number of observations in each cluster and allocation of observations.
376
PART
V
Multivariate Exploratory Data Analysis
FIG. 11.74 Means per cluster and general means of the variables mathematics, physics, and chemistry.
FIG. 11.75 Interrelationship between the variables and relative position of the observations in each cluster—matrix chart.
make suggestions regarding the relative position of the observations in each cluster in these interrelationships. To construct the chart shown in Fig. 11.75, we must type the following command: graph matrix mathematics physics chemistry, mlabel(kmeans)
Obviously, this chart could have also been constructed in the previous section. However, we chose to present it only at the end of the preparation of the k-means procedure in Stata. By analyzing it, it is possible to verify, among other things, that only considering the variables mathematics and chemistry is not enough to make observations Luiz Felipe and Patricia (clusters 2 and 3, respectively) stay further apart. It is necessary to consider the variable physics so that these two students can, in fact, be allocated to different clusters when forming three clusters. Although it may seem pretty obvious when analyzing the data in their own dataset, the chart becomes extremely useful for larger samples with a considerable number of variables, fact that would multiply these interrelationships.
Cluster Analysis Chapter
11
377
11.5 FINAL REMARKS Many are the situations in which researchers may wish to group observations (individuals, companies, municipalities, countries, political parties, plant species, among other examples) from certain metric or even binary variables. Creating homogeneous clusters, reducing data structurally, and verifying the validity of previously established constructs are some of the main reasons that make researchers choose to work with cluster analysis. This set of techniques allows decision-making mechanisms to be better structured and justified from the behavior and interdependence relationship between the observations of a certain dataset. Since the variable that represents the clusters formed is qualitative, the outputs of the cluster analysis can serve as inputs in other multivariate techniques, both exploratory as well as confirmatory ones. It is strongly advisable for researchers to justify, clearly and transparently, the measure they chose and that will serve as the basis for the observations to be considered more or less similar, as well as the reasons that make them define nonhierarchical or hierarchical agglomeration schedules and, in this last case, determine the linkage methods. In the last few years, the evolution of technological capabilities and the development of new software, with extremely improved resources, caused new and better cluster analysis techniques to arise. Techniques that use more and more sophisticated algorithms and that are aimed at the decision-making process in several fields of knowledge, always with the main goal of grouping observations based on certain criteria. However, in this chapter, we tried to offer a general overview of the main cluster analysis methods, also considered to be the most popular. Lastly, we would like to highlight that the application of this important set of techniques must always be done by using the software chosen for the modeling correctly and sensibly, based on the underlying theory and on researchers’ experience and intuition.
11.6 EXERCISES 1) The scholarship department of a certain college wishes to investigate the interdependence relationship between the students entering university in a certain school year, based only on two metric variables (age, in years, and average family income, in US$). The main objective is to propose a still unknown number of new scholarship programs aimed at homogeneous groups of students. In order to do that, data on 100 new students were collected and a dataset was constructed, which can be found in the files Scholarship.sav and Scholarship.dta, with the following variables: Variable
Description
student
A string variable that identifies all freshmen in the college
age
Student’s age (years)
income
Average family income (US$)
We would like you to: a) Run a cluster analysis through a hierarchical agglomeration schedule, with the complete-linkage method (furthest neighbor) and the squared Euclidean distance. Only present the final part of the agglomeration schedule table and discuss the results. Reminder: Since the variables have different units of measure, it is necessary to apply the Z-scores standardization procedure to prepare the cluster analysis correctly. b) Based on the table found in the previous item and in the dendrogram, we ask you: how many clusters of students will be formed? c) Is it possible to identify one or more very discrepant students, in comparison to the others, regarding the two variables under analysis? d) If the answer to the previous item is “yes,” once again run the hierarchical cluster analysis with the same criteria, however, now, without the student(s) considered discrepant. From the analysis of the new results, can new clusters be identified? e) Discuss how the presence of outliers can hamper the interpretation of results in a clusters analysis. 2) The marketing department of a retail company wants to study possible discrepancies in their 18 stores spread throughout three regional centers and distributed all over the country. In order to maintain and preserve its brand’s image and identity, top management would like to know if their stores are homogeneous in terms of customers’
378
PART
V
Multivariate Exploratory Data Analysis
perception of attributes, such as, services, variety of goods, and organization. Thus, first, a research with samples of clients was developed in each store, so that data regarding these attributes could be collected. These were defined based on the average score obtained (0 to 100) in each store. Next, a dataset was constructed and it contains the following variables: Variable
Description
store
A string variable that varies from 01 the 18 and that identifies the commercial establishment (store)
regional
A string variable that identifies each regional center (Regional 1 to Regional 3)
services
Customers’ average evaluation of services rendered (score from 0 to 100)
assortment
Customers’ average evaluation of the variety of goods (score from 0 to 100)
organization
Customers’ average evaluation of the organization (score from 0 to 100)
These data can be found in the files Retail Regional Center.sav and Retail Regional Center.dta. We would like you to: a) Run a cluster analysis through a hierarchical agglomeration schedule, with the single-linkage method and the Euclidean distance. Present the matrix with distances between each pair of observations. Reminder: Since the variables are in the same unit, it is not necessary to apply the Z-scores standardization procedure. b) Present and discuss the agglomeration schedule table. c) Based on the table found in the previous item and in the dendrogram, we ask you: how many clusters of stores will be formed? d) Run a multidimensional scaling and, after that, present and discuss the two-dimensional chart generated with the relative positions of the stores. e) Run a cluster analysis by using the k-means procedure, with the number of clusters suggested in item (c), and interpret the one-way analysis of variance for each variable considered in the study, considering a significance level of 0.05. Which variable contributes the most to the creation of at least one of the clusters formed, that is, which of them is the most discriminant of the groups? f) Is there any correspondence between the allocations of the observations to the groups obtained by the hierarchical and nonhierarchical methods? g) Is it possible to identify an association between any regional center and a certain discrepant group of stores, which could justify the management’s concern regarding the brand’s image and identity? If the answer is “yes,” once again run the hierarchical cluster analysis with the same criteria, however, now, without this discrepant group of stores. By analyzing the new results, is it possible to see the differences between the others stores more clearly? 3) A financial market analyst has decided to carry out a survey with CEOs and directors of large companies that operate in the health, education, and transport industries, in order to investigate how these companies’ operations are carried out and the mechanisms that guide their decision making processes. In order to do that, he structured a questionnaire with 50 questions, whose answers are only dichotomous, or binary. After applying the questionnaire, he got answers from 35 companies and, from then on, structured a dataset, present in the files Binary Survey.sav and Binary Survey.dta. In a generic way, the variables are:
Variable
Description
q1 to q50
A list of 50 dummy variables that refer to the way the operations and the decision-making processes are carried out in these companies
sector
Company sector
The analyst’s main goal is to verify whether companies in the same sector show similarities in relation to the way their operations and decision making processes are carried out, at least from their own managers’ perspective. In order to do that, after collecting the data, a cluster analysis can be elaborated. We would like you to: a) Based on the hierarchical cluster analysis elaborated with the average-linkage method (between groups) and the simple matching similarity measure for binary variables, analyze the agglomeration schedule generated. b) Interpret the dendrogram.
Cluster Analysis Chapter
11
379
c) Check if there is any correspondence between the allocations of the companies to the clusters and the respective sectors, or, in other words, if the companies in the same sector show similarities regarding the way their operations and decisionmaking processes are carried out.
4) A greengrocer has decided to monitor the sales of his products for 16 weeks (4 months). The main objective is to verify if the sales behavior of three of their main products (bananas, oranges, and apples) is recurrent after a certain period, due to weekly wholesale price fluctuations, prices that are passed on to customers and may impact sales. These data can be found in the files Veggiefruit.sav and Veggiefruit.dta, which have the following variables:
Variable
Description
week
A string variable that varies from 1 to 16 and identifies the week in which the sales were monitored
week_month
A string variable that varies from 1 to 4 and identifies the week in each one of the months
banana
Number of bananas sold that week (un.)
orange
Number of oranges sold that week (un.)
apple
Number of apples sold that week (un.)
We would like you to: a) Run a cluster analysis through a hierarchical agglomeration schedule, with the single-linkage method (nearest neighbor) and Pearson’s correlation measure. Present the matrix of similarity measures (Pearson’s correlation) between each row in the dataset (weekly periods). Reminder: Since the variables are in the same unit of measure, it is not necessary to apply the Z-scores standardization procedure. b) Present and discuss the agglomeration schedule table. c) Based on the table found in the previous item and on the dendrogram, we ask you: is there any indication that the joint sales behavior of bananas, oranges and apples is recurrent in certain weeks?
APPENDIX A.1
Detecting Multivariate Outliers
Even though detecting outliers is extremely important when applying practically every single multivariate data analysis technique, we chose to add this Appendix to the present chapter because cluster analysis represents the first set of multivariate exploratory techniques being studied, whose outputs can be used as inputs in several other techniques, as well as because very discrepant observations may significantly interfere in the creation of clusters. Barnett and Lewis (1994) mention almost 1000 articles in the existing literature on outliers. However, we chose to show a very effective, computationally simple, and fast algorithm for detecting multivariate outliers, bearing in mind that the identification of outliers for each variable individually, that is, in a univariate way, has already been studied in Chapter 3. A) Brief Presentation of the Blocked Adaptive Computationally Efficient Outlier Nominators Algorithm Billor et al. (2000), in extremely important work, show an interesting algorithm that has the purpose of detecting multivariate outliers. It is called Blocked Adaptive Computationally Efficient Outlier Nominators or simply BACON. This algorithm, explained in a very clear and didactical way by Weber (2010), is defined based on the preparation of a few steps, described briefly: 1. From a dataset with n observations and j (j ¼ 1, ..., k) variables X, in which each observation is identified by i (i ¼ 1, ..., n), the distance between one observation i that has a vector with dimension k xi ¼ ðxi1 , xi2 , …, xik Þ and the general mean of all sample values (group G), which also has a vector with dimension k x ¼ ðx1 , x2 , …, xk Þ, is given by the following expression, known as the Mahalanobis distance:
diG ¼
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðxi xÞ’ S1 ðxi xÞ
(11.29)
380
PART
V
Multivariate Exploratory Data Analysis
where S represents the covariance matrix of the n observations. Therefore, the first step of the algorithm consists in identifying m (m > k) homogeneous observations (initial group M) that have the smallest Mahalanobis distances in relation to the entire sample. It is important to mention that the dissimilarity measure known as Mahalanobis distance, not discussed in this chapter, is adopted by the aforementioned authors due to the fact that it is not susceptible to variables that are in different measurement units. 2. Next, the Mahalanobis distances between each observation i and the mean of the m observation values that belong to group M are calculated, which also has a vector with dimension k xM ¼ ðxM1 , xM2 , …, xMk Þ, such that: diM ¼
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðxi xM Þ’ S1 M ðx i x M Þ
(11.30)
where SM represents the covariance matrix of the m observations. 3. All the observations with Mahalanobis distances less than a certain threshold are added to the group M of observations. This threshold is defined as a corrected percentile of the w2 distribution (85% in the Stata standard). Steps 2 and 3 must be reapplied until there are no more modifications in group M, which will only have observations that are not considered outliers. Hence, the ones excluded from the group will be considered multivariate outliers. Weber (2010) codifies the algorithm proposed in the paper written by Billor et al. (2000) in Stata, thus proposing the command bacon. Next, we will present and discuss an example in which this command is used, and whose main advantage is to be very fast computationally, even when applied to large datasets. B) Example: The command bacon in Stata Before the specific preparation of this procedure in Stata, we must install the command bacon by typing findit bacon and clicking on the link st0197 from http://www.stata-journal.com/software/sj10-3. After that, we must click on click here to install. Lastly, going back to the Stata command screen, we can type ssc install moremata and mata: mata mlib index. Having done this, we may apply the command bacon. To apply this command, let’s use the file Bacon.dta, which shows data on the median household income (US$) of 20,000 engineers, their age (years), and time he(she) has had a college degree (years). First of all, we can type the command desc, which makes the analysis of the dataset characteristics possible. Fig. 11.76 shows this first output. Next, we can type the following command that, based on the algorithm presented, identifies the observations considered multivariate outliers: bacon income age tgrad, generate(outbacon)
where the term generate(outbacon) makes a new dummy variable be generated in the dataset, called outbacon, which has values equal to 0 for observations not considered outliers, and values equal to 1 for the ones considered outliers. This output can be seen in Fig. 11.77.
FIG. 11.76 Description of the Bacon.dta dataset.
Cluster Analysis Chapter
11
381
FIG. 11.77 Applying the command bacon in Stata.
FIG. 11.78 Observations classified as multivariate outliers.
Through the figure, it is possible to see that four observations are classified as multivariate outliers. Besides, Stata considers 85% of the percentile standard of the w2 distribution, used as a separation threshold between the observations considered outliers and nonoutliers, as previously discussed and highlighted by Weber (2010). This is the reason why the term BACON outliers (p = 0.15) appears in the outputs. This value may be altered due to a criterion established by the researcher. However, we would like to emphasize that the standard percentile(0.15) is very adequate for obtaining consistent answers. From the following command, which generates the output seen in Fig. 11.78, we can investigate which observations are classified as outliers: list if outbacon == 1
Even if we are working with three variables, we can construct two-dimensional scatter plots, which allow us to identify the positions of the observations considered outliers in relation to the others. In order to do that, let’s type the following commands, which generate the mentioned charts for each pair of variables: scatter income age, ml(outbacon) note("0 = not outlier, 1 = outlier") scatter income tgrad, ml(outbacon) note("0 = not outlier, 1 = outlier") scatter age tgrad, ml(outbacon) note("0 = not outlier, 1 = outlier")
These three charts can be seen in Figs. 11.79, 11.80, and 11.81.
FIG. 11.79 Variables income and age—Relative position of the observations.
Median household income (US$)
40,000
30,000
20,000
10,000
0 20
30
0 = not outlier, 1 = outlier
40 Age (years)
50
60
382
PART
V
Multivariate Exploratory Data Analysis
tgrad—
40,000 Median household income (US$)
FIG. 11.80 Variables income and Relative position of the observations.
30,000
20,000
10,000
0 0
5
10 15 Time since graduation (years)
20
10 15 Time since graduation (years)
20
0 = not outlier, 1 = outlier
FIG. 11.81 Variables age position of the observations.
and
tgrad—Relative
60
Age (years)
50
40
30
20 0
5
0 = not outlier, 1 = outlier
Despite the fact that outliers have been identified, it is important to mention that the decision about what to do with these observations is entirely up to researchers, who must make it based on their research objectives. As already discussed throughout this chapter, excluding these outliers from the dataset may be an option. However, studying why they became multivariately discrepant can also result in many interesting research outcomes.
Chapter 12
Principal Component Factor Analysis Love and truth are so intertwined that it is practically impossible to disentangle and separate them. They are like the two sides of a coin. Mahatma Gandhi
12.1 INTRODUCTION Exploratory factor analysis techniques are very useful when we intend to work with variables that have, between themselves, relatively high correlation coefficients and one wishes to establish new variables that capture the joint behavior of the original variables. Each one of these new variables is called factor, which can be understood as the cluster of variables from criteria previously established. Therefore, factor analysis is a multivariate technique that tries to identify a relatively small number of factors that represent the joint behavior of interdependent original variables. Thus, while cluster analysis, studied in the previous chapter, uses distance or similarity measures to group observations and form clusters, factor analysis uses correlation coefficients to group variables and generate factors. Among the methods used to determine factors, the one known as principal components is, without a doubt, the most widely used in factor analysis, because it is based on the assumption that uncorrelated factors can be extracted from linear combinations of the original variables. Consequently, from a set of original variables correlated to one another, the principal component factor analysis allows another set of variables (factors) resulting from the linear combination of the first set to be determined. Even though, as we know, the term confirmatory factor analysis often appears in the existing literature, factor analysis is essentially an exploratory multivariate technique, or an interdependence, since it does not have a predictive nature for other observations not initially present in the sample, and the inclusion of new observations in the dataset makes it necessary to reapply the technique, so that more accurate and updated new factors can be generated. According to Reis (2001), factor analysis can be used with the main exploratory goal of reducing the data dimension, aiming at creating factors from the original variables, as well as with the objective of confirming an initial hypothesis that the data may be reduced to a certain factor, or a certain dimension, which was previously established. Regardless of the objective, factor analysis will continue to be exploratory. If researchers aim to use a technique to, in fact, confirm the relationships found in the factor analysis, they can use structural equation modeling, for instance. The principal component factor analysis has four main objectives: (1) to identify correlations between the original variables to create factors that represent the linear combination of those variables (structural reduction); (2) to verify the validity of previously established constructs, bearing in mind the allocation of the original variables to each factor; (3) to prepare rankings by generating performance indexes from the factors; and (4) to extract orthogonal factors for future use in confirmatory multivariate techniques that need the absence of multicollinearity. Imagine that a researcher is interested in studying the interdependence between several quantitative variables that translate the socioeconomic behavior of a nation’s municipalities. In this situation, factors that may possibly explain the behavior of the original variables can be determined, and, in this regard, the factor analysis is used to reduce the data structurally and, later on, to create a socioeconomic index that captures the joint behavior of these variables. From this index, we may even propose a performance ranking of the municipalities, and the factors themselves can be used in a possible cluster analysis. In another situation, factors extracted from the original variables can be used as explanatory variables of another variable (dependent), not initially considered in the analysis. For example, factors obtained from the joint behavior of grades in certain 12th grade subjects can be used as explanatory variables of students’ general classification in the college entrance exams, or whether students passed the exams or not. In these situations, note that the factors (orthogonal to one another) are Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00012-4 © 2019 Elsevier Inc. All rights reserved.
383
384
PART
V Multivariate Exploratory Data Analysis
used, instead of the original variables themselves, as explanatory variables of a certain phenomenon in confirmatory multivariate models, such as, multiple or logistic regression, in order to eliminate possible multicollinearity problems. Nevertheless, it is important to highlight that this procedure only makes sense when we intend to elaborate a diagnostic regarding the dependent variable’s behavior, without aiming at having forecasts for other observations not initially present in the sample. Since new observations do not have the corresponding values of the factors generated, obtaining these values is only possible if we include such observations in a new factor analysis. In a third situation, imagine that a retailer is interested in assessing their clients’ level of satisfaction by applying a questionnaire in which the questions have been previously classified into certain groups. For instance, questions A, B, and C were classified into the group quality of services rendered, questions D and E, into the group positive perception of prices, and questions F, G, H, and I, into the group variety of goods. After applying the questionnaire to a significant number of customers, in which these nine variables are collected by attributing scores that vary from 0 to 10, the retailer has decided to elaborate a principal component factor analysis to verify if, in fact, the combination of variables reflects the construct previously established. If this occurs, the factor analysis will have been used to validate the construct, presenting a confirmatory objective. In all of these situations, we can see that the original variables from which the factors will be extracted are quantitative, because a factor analysis begins with the study of the behavior of Pearson’s correlation coefficients between the variables. Nonetheless, it is common for researchers to use the incorrect arbitrary weighting procedure with qualitative variables, as, for example, variables on the Likert scale, and, from then on, to apply a factor analysis. This is a serious error! There are exploratory techniques meant exclusively for studying the behavior of qualitative variables as, for instance, the correspondence analysis and homogeneity analysis, and a factor analysis is definitely not meant for such purpose, as discussed by Fa´vero and Belfiore (2017). In a historical context, the development of factor analyses is partly due to Pearson’s (1896) and Spearman’s (1904) pioneer work. While Karl Pearson developed a rigorous mathematical treatment regarding what we traditionally call correlation at the beginning of the 20th century, Charles Edward Spearman published highly original work in which the interrelationships between students’ performance in several subjects, such as, French, English, Mathematics and Music were evaluated. Since the grades in these subjects showed strong correlation, Spearman proposed that scores resulting from apparently incompatible tests shared a single general factor, and students who got good grades had a more developed psychological or intelligence component. Generally speaking, Spearman excelled in applying mathematical methods and correlation studies to the analysis of the human mind. Decades later, in 1933, Harold Hotelling, a statistician, mathematician, and influential economics theoretician decided to call Principal Component Analysis the analysis that determines components from the maximization of the original data’s variance. Also in the first half of the 20th century, psychologist Louis Leon Thurstone, from an investigation of Spearman’s ideas and based on the application of certain psychological tests, whose results were submitted to a factor analysis, identified people’s seven primary mental abilities: spatial visualization, verbal meaning, verbal fluency, perceptual speed, numerical ability, reasoning, and rote memory. In psychology, the term mental factors is even used for variables that have greater influence over a certain behavior. Currently, factor analysis is used in several fields of knowledge, such as, marketing, economics, strategy, finance, accounting, actuarial science, engineering, logistics, psychology, medicine, ecology and biostatistics, among others. The principal component factor analysis must be defined based on the underlying theory and on the researcher’s experience, so that it can be possible to apply the technique correctly and to analyze the results obtained. In this chapter, we will discuss the principal component factor analysis technique, with the following objectives: (1) to introduce the concepts; (2) to present the step by step of modeling in an algebraic and practical way; (3) to interpret the results obtained; and (4) to show the application of the technique in SPSS and Stata. Following the logic proposed in the book, first, we develop the algebraic solution of an example linked to the presentation of the concepts. Only after introducing these concepts, we present and discuss the procedures for running the technique in SPSS and Stata be presented.
12.2
PRINCIPAL COMPONENT FACTOR ANALYSIS
Many are the procedures inherent to the factor analysis, with different methods for determining (extraction) factors from Pearson’s correlation matrix. The most frequently used method, which was adopted in this chapter for extracting factors, is known as principal components, in which the consequent structural reduction is also called Karhunen-Loe`ve transformation.
Principal Component Factor Analysis Chapter
12
385
TABLE 12.1 General Dataset Model for Developing a Factor Analysis Observation i
X1i
X2i
…
Xki
1
X11
X21
…
Xk1
2
X12
X22
Xk2
3
X13
X23
Xk3
⋮
⋮
⋮
⋮
n
X1n
X2n
Xkn
In the following sections, we will discuss the theoretical development of the technique, as well as a practical example. While the main concepts will be presented in Sections 12.2.1–12.2.5, Section 12.2.6 is meant for solving a practical example algebraically, from a dataset.
12.2.1
Pearson’s Linear Correlation and the Concept of Factor
Let’s imagine a dataset that has n observations and, for each observation i (i ¼ 1, …, n), values corresponding to each one of the k metric variables X, as shown in Table 12.1. From the dataset, and given our intention of extracting factors from k variables X, we must define correlation matrix r that displays the values of Pearson’s linear correlation between each pair of variables, as shown in Expression (12.1). 0 1 1 r12 ⋯ r1k B r21 1 ⋯ r2k C C r¼B (12.1) @ ⋮ ⋮ ⋱ ⋮ A rk1 rk2 ⋯ 1 Correlation matrix r is symmetrical in relation to the main diagonal that, obviously, shows values equal to 1. For example, for variables X1 and X2, Pearson’s correlation r12 can be calculated by using Expression (12.2). Xn X1i X1 X2i X2 i¼1 r12 ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (12.2) Xn Xn 2ffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2ffi X X X X 1i 1 2i 2 i¼1 i¼1 where X1 and X2 represent the means of variables X1 and X2, respectively, and this expression is analogous to Expression (4.11), defined in Chapter 4. Thus, since Pearson’s correlation is a measure of the level of linear relationship between two metric variables, which may vary between 1 and 1, a value closer to one of these extreme values indicates the existence of a linear relationship between the two variables under analysis, which, therefore, may significantly contribute to the extraction of a single factor. On the other hand, a Pearson correlation that is very close to 0 indicates that the linear relationship between the two variables is practically nonexistent. Therefore, different factors can be extracted. Let’s imagine a hypothetical situation in which a certain dataset only has three variables (k ¼ 3). A three-dimensional scatter plot can be constructed from the values of each variable for each observation. The plot can be seen in Fig. 12.1. Only based on the visual analysis of the chart in Fig. 12.1, it is difficult to assess the behavior of the linear relationships between each pair of variables. Thus, Fig. 12.2 shows the projection of the points that correspond to each observation in each one of the planes formed by the pairs of variables, highlighting, in the dotted line, the adjustment that represents the linear relationship between the respective variables. While Fig. 12.2A shows that there is a significant linear relationship between variables X1 and X2 (a very high Pearson correlation), Fig. 12.2B and C make it very clear that there is no linear relationship between X3 and these variables. Fig. 12.3 displays these projections in a three-dimensional plot, with the respective linear adjustments in each plane (the dotted lines). Thus, in this hypothetical example, while variables X1 and X2 may be represented by a single factor in a very significant way, which we will call F1, variable X3 may be represented by another factor, F2, orthogonal to F1. Fig. 12.4 illustrates the extraction of these new factors in a three-dimensional way.
386
PART
V Multivariate Exploratory Data Analysis
FIG.12.1 Three-dimensional scatter plot for a hypothetical situation with three variables.
X3
X2
X1
X3
X3
X2
X1
X1
(A)
(B)
X2
(C)
FIG. 12.2 Projection of the points in each plane formed by a certain pair of variables. (A) Relationship between X1 and X2: positive and very high Pearson correlation. (B) Relationship between X1 and X3: Pearson correlation very close to 0. (C) Relationship between X2 and X3: Pearson correlation very close to 0.
So, factors can be understood as representations of latent dimensions that explain the behavior of the original variables. Having presented these initial concepts, it is important to emphasize that in many cases researchers may choose to not extract a factor represented in a considerable way by only one variable (in this case, factor F2), and what will define the extraction of each one of the factors is the calculation of the eigenvalues from correlation matrix r, as we will study in Section 12.2.3. Nevertheless, before that, it will be necessary to check the overall adequacy of the factor analysis, which will be discussed in the following section.
Principal Component Factor Analysis Chapter
12
387
X3
X2
X1
FIG. 12.3 Projection of the points in a three-dimensional plot with linear adjustments per plane.
12.2.2 Overall Adequacy of the Factor Analysis: Kaiser-Meyer-Olkin Statistic and Bartlett’s Test of Sphericity An adequate extraction of factors from the original variables requires correlation matrix r to have relatively high and statistically significant values. As discussed by Hair et al. (2009), even though visually analyzing correlation matrix r does not reveal if the factor extraction will in fact be adequate, a significant number of values less than 0.30 represent a preliminary indication that the factor analysis may not be adequate. In order to verify the overall adequacy of the factor extraction itself, we must use the Kaiser-Meyer-Olkin statistic (KMO) and Bartlett’s test of sphericity. The KMO statistic gives us the proportion of variance considered common to all the variables in the sample under analysis, that is, which can be attributed to the existence of a common factor. This statistic varies from 0 to 1 and, while values closer to 1 indicate that the variables share a very high proportion of variance (high Pearson correlations), values closer to 0 are a result of low Pearson correlations between the variables, which may indicate that the factor analysis will not be adequate. The KMO statistic, presented initially by Kaiser (1970), can be calculated through Expression (12.3). Xk Xk r2 c¼1 lc , l 6¼ c (12.3) KMO ¼ Xk Xk l¼1 X k Xk 2 + 2 r ’ lc lc l¼1 c¼1 l¼1 c¼1 where l and c represent the rows and columns of correlation matrix r, respectively, and the terms ’ represent the partial correlation coefficients between two variables. While Pearson’s correlation coefficients r are also called zero-order correlation coefficients, partial correlation coefficients ’ are also known as higher-order correlation coefficients. For three
388
PART
V Multivariate Exploratory Data Analysis
F2 X3
X2
X1
F1 FIG. 12.4 Factor extraction.
variables, they are also called first-order correlation coefficients, for four variables, second-order correlation coefficients, and so on. Let’s imagine a hypothetical situation in which a certain dataset shows three variables once again (k ¼ 3). Is it possible that, in fact, r12 reflects the level of linear relationship between X1 and X2 if variable X3 is related to the other two? In this situation, r12 may not represent the true level of linear relationship between X1 and X2 when X3 is present, which may provide a false impression regarding the nature of the relationship between the first two. Thus, partial correlation coefficients may contribute with the analysis, since, according to Gujarati and Porter (2008), they are used when researchers wish to find out the correlation between two variables, either by controlling or ignoring the effects of other variables present in the dataset. For our hypothetical situation, it is the correlation coefficient regardless of X3’s influence over X1 and X2, if any. Hence, for three variables X1, X2, and X3, we can define the first-order correlation coefficients the following way: r12 r13 r23 ’12,3 ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 r213 1 r223 where ’12,3 represents the correlation between X1 and X2, maintaining X3 constant, r13 r12 r23 ’13,2 ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 r212 1 r223 where ’13,2 represents the correlation between X1 and X3, maintaining X2 constant, and
(12.4)
(12.5)
Principal Component Factor Analysis Chapter
r23 r12 r13 ’23,1 ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 r212 1 r213
12
389
(12.6)
where ’23,1 represents the correlation between X2 and X3, maintaining X1 constant. In general, a first-order correlation coefficient can be obtained through the following expression: rab rac rbc ’ab, c ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi 1 r2ac 1 r2bc
(12.7)
where a, b, and c can assume values 1, 2, or 3, corresponding to the three variables under analysis. Conversely, for a case in which there are four variables in the analysis, the general expression of a certain partial correlation coefficient (second-order correlation coefficient) is given by: ’ab, c ’ad, c ’bd, c (12.8) ’ab,cd ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi 2 2 1 ’ad, c 1 ’bd, c where ’ab,cd represents the correlation between Xa and Xb, maintaining Xc and Xd constant, bearing in mind that a, b, c, and d may take on values 1, 2, 3, or 4, which correspond to the four variables under analysis. Obtaining higher-order correlation coefficients, in which five or more variables are considered in the analysis, should always be done based on the determination of lower-order partial correlation coefficients. In Section 12.2.6, we will propose a practical example by using four variables, in which the algebraic solution of the KMO statistic will be obtained through Expression (12.8). It is important to highlight that, even if Pearson’s correlation coefficient between two variables is 0, the partial correlation coefficient between them may not be equal to 0, depending on the values of Pearson’s correlation coefficients between each one of these variables and the others present in the dataset. In order for a factor analysis to be considered adequate, the partial correlation coefficients between the variables must be low. This fact denotes that the variables share a high proportion of variance, and disregarding one or more of them in the analysis may hamper the quality of the factor extraction. Therefore, according to a widely accepted criterion found in the existing literature, Table 12.2 gives us an indication of the relationship between the KMO statistic and the overall adequacy of the factor analysis. On the other hand, Bartlett’s test of sphericity (Bartlett, 1954) consists in comparing correlation matrix r to an identity matrix I of the same dimension. If the differences between the corresponding values outside the main diagonal of each matrix are not statistically different from 0, at a certain significance level, we may consider that the factor extraction will not be adequate. In other words, in this case, Pearson’s correlations between each pair of variables are statistically equal to 0, which makes any attempt of performing a factor extraction from the original variables unfeasible. So, we can define the null and alternative hypotheses of Bartlett’s test of sphericity the following way: 0 1 0 1 1 r12 ⋯ r1k 1 0 ⋯ 0 B r21 1 ⋯ r2k C B0 1 ⋯ 0C C B C H0 : r ¼ B @ ⋮ ⋮ ⋱ ⋮ A ¼ I ¼ @⋮ ⋮ ⋱ ⋮A rk1 rk2 ⋯ 1 0 0 ⋯ 1
TABLE 12.2 Relationship Between the KMO Statistic and the Overall Adequacy of the Factor Analysis KMO Statistic
Overall Adequacy of the Factor Analysis
Between 1.00 and 0.90
Marvelous
Between 0.90 and 0.80
Meritorious
Between 0.80 and 0.70
Middling
Between 0.70 and 0.60
Mediocre
Between 0.60 and 0.50
Miserable
Less than 0.50
Unacceptable
390
PART
V Multivariate Exploratory Data Analysis
0
1 r12 B r21 1 H1 : r ¼ B @ ⋮ ⋮ rk1 rk2
⋯ ⋯ ⋱ ⋯
1 0 r1k 1 B0 r2k C C 6¼ I ¼ B @⋮ ⋮ A 1 0
0 1 ⋮ 0
⋯ ⋯ ⋱ ⋯
1 0 0C C ⋮A 1
The statistic corresponding to Bartlett’s test of sphericity is an w2 statistic, which has the following expression:
2k+5 ln jDj w2Bartlett ¼ ðn 1Þ 6
(12.9)
Þ degrees of freedom. We know that n is the sample size and k is the number of variables. In addition, D represents with k ðk1 2 the determinant of correlation matrix r. Thus, for a certain number of degrees of freedom and a certain significance level, Bartlett’s test of sphericity allows us to check if the total value of the w2Bartlett statistic is higher than the statistic’s critical value. If this is true, we may state that Pearson’s correlations between the pairs of variables are statistically different from 0 and that, therefore, factors can be extracted from the original variables and the factor analysis is adequate. When we develop a practical example in Section 12.2.6, we will also discuss the calculations of the w2Bartlett statistic and the result of Bartlett’s test of sphericity. It is important to emphasize that we should always favor Bartlett’s test of sphericity over the KMO statistic to take a decision about the factor analysis’s overall adequacy. Given that the former is a test with a certain significance level, and the latter is only a coefficient (a statistic) calculated without any set distribution of probabilities or hypotheses that allow us to evaluate the corresponding significance level to make a decision. In addition, it is important to mention that for only two original variables the KMO statistic will always be equal to 0.50. Conversely, the w2Bartlett statistic may indicate if the null hypothesis of the test of sphericity was rejected or not, depending on the magnitude of Pearson’s correlation between both variables. Thus, while the KMO statistic will be 0.50 in these situations, Bartlett’s test of sphericity will allow researchers to decide whether to extract one factor from the two original variables or not. In contrast, for three original variables, it is very common for researchers to extract two factors with the statistical significance of Bartlett’s test of sphericity, however, with the KMO statistic less than 0.50. These two situations emphasize even more the greater relevance of Bartlett’s test of sphericity in relation to the KMO statistic in the decisionmaking process. Finally, we must mention that the recommendation to study Cronbach’s alpha’s magnitude, before studying the overall adequacy of the factor analysis, is commonly found in the existing literature, so that the reliability with which a factor can be extracted from original variables can be evaluated. We would like to highlight that Cronbach’s alpha only offers researchers indications of the internal consistency of the variables in the dataset so that a single factor can be extracted. Therefore, determining it is not a mandatory requisite for developing the factor analysis, since this technique allows the extraction of most factors. Nevertheless, for pedagogical purposes, we will discuss the main concepts of Cronbach’s alpha in the Appendix of this chapter, with its algebraic determination and corresponding applications in SPSS and Stata software. Having discussed these concepts and verified the overall adequacy of the factor analysis, we can now move on to the definition of the factors.
12.2.3 Defining the Principal Component Factors: Determining the Eigenvalues and Eigenvectors of Correlation Matrix r and Calculating the Factor Scores Since a factor represents the linear combination of the original variables, for k variables, we can define a maximum number of k factors (F1, F2, …, Fk), analogous to the maximum number of clusters that can be defined from a sample with n observations, as we discussed in the previous chapter, since a factor can also be understood as the result of the clustering of variables. Therefore, for k variables, we have: F1i ¼ s11 X1i + s21 X2i + ⋯ + sk1 Xki F2i ¼ s12 X1i + s22 X2i + ⋯ + sk2 Xki ⋮ Fki ¼ s1k X1i + s2k X2i + ⋯ + skk Xki
(12.10)
where the terms s are known as factor scores, which represent the parameters of a linear model that relates a certain factor to the original variables. Calculating the factor scores is essential in the context of the factor analysis technique and is elaborated by determining the eigenvalues and eigenvectors of correlation matrix r. In Expression (12.11), we once again show correlation matrix r, which has already been presented in Expression (12.1).
Principal Component Factor Analysis Chapter
0
1 r12 B r21 1 r¼B @ ⋮ ⋮ rk1 rk2
⋯ ⋯ ⋱ ⋯
1 r1k r2k C C ⋮ A 1
12
391
(12.11)
This correlation matrix, with dimensions k k, shows k eigenvalues l2 (l21 l22 … l2k ), which can be obtained from solving the following equation: det l2 I r ¼ 0
(12.12)
where I is the identity matrix, also with dimensions k k. Since a certain factor represents the result of the clustering of variables, it is important to highlight that: l21 + l22 + ⋯ + l2k ¼ k
(12.13)
2 l 1 r 12 ⋯ r1k 2 r l 1 ⋯ r 21 2k ¼ 0 ⋮ ⋮ ⋱ ⋮ 2 r r ⋯ l 1 k1 k2
(12.14)
Expression (12.12) can be rewritten as follows:
from which we can define the eigenvalue matrix L2 the following way: 0
l21 B 0 L2 ¼ B @⋮ 0
0 l22 ⋮ 0
⋯ ⋯ ⋱ ⋯
1 0 0C C ⋮A l2k
(12.15)
In order to define the eigenvectors of matrix r based on the eigenvalues, we must solve the following equation system for each eigenvalue l2 (l21, l22, …, l2k ): l
Determining eigenvectors v11, v21, …, vk1 from the first eigenvalue (l21): 0
from where we obtain:
l
1 0 1 0 1 v11 0 l21 1 r12 ⋯ r1k 2 B r C B C B C 21 l1 1 ⋯ r2k C B v21 C ¼ B 0 C B @ ⋮ ⋮ ⋱ ⋮ A @ ⋮ A @⋮A 2 vk1 0 rk1 rk2 ⋯ l1 1
(12.16)
8 2 l1 1 v11 r12 v21 … r1k vk1 ¼ 0 > > < r21 v11 + l21 1 v21 … r2k vk1 ¼ 0 > ⋮ > : rk1 v11 rk2 v21 … + l21 1 vk1 ¼ 0
(12.17)
Determining eigenvectors v12, v22, …, vk2 from the second eigenvalue (l22): 1 0 1 0 1 v12 0 l22 1 r12 ⋯ r1k 2 B r C B v22 C B 0 C l 1 ⋯ r 21 2k C B 2 B C¼B C @ ⋮ ⋮ ⋱ ⋮ A @ ⋮ A @⋮A vk2 0 rk1 rk2 ⋯ l22 1 0
(12.18)
392
PART
V Multivariate Exploratory Data Analysis
from where we obtain:
l
8 2 l2 1 v12 r12 v22 … r1k vk2 ¼ 0 > > < r21 v12 + l22 1 v22 … r2k vk2 ¼ 0 > ⋮ > : rk1 v12 rk2 v22 … + l22 1 vk2 ¼ 0
(12.19)
Determining eigenvectors v1k, v2k, …, vkk from the kth eigenvalue (l2k ): 0
1 0 1 0 1 v1k 0 l2k 1 r12 ⋯ r1k 2 B r C B v2k C B 0 C l 1 ⋯ r 21 2k k B CB C¼B C @ ⋮ ⋮ ⋱ ⋮ A @ ⋮ A @⋮A vkk 0 rk1 rk2 ⋯ l2k 1
(12.20)
8 2 lk 1 v1k r12 v2k … r1k vkk ¼ 0 > > < r21 v1k + l2k 1 v2k … r2k vkk ¼ 0 > ⋮ > : rk1 v1k rk2 v2k … + l2k 1 vkk ¼ 0
(12.21)
from where we obtain:
Thus, we can calculate the factor scores of each factor by determining the eigenvalues and eigenvectors of correlation matrix r. The factor scores vectors can be defined as follows: l
Factor scores of the first factor: 0 v 1 q11 ffiffiffiffiffi B C l2 C B 0 1 B v 1C s11 B 21 C B s21 C B qffiffiffiffiffi C B C 2C S1 ¼ @ A ¼ B B l1 C ⋮ B C B ⋮ C sk1 B vk1 C @ qffiffiffiffiffi A l21
l
Factor scores of the second factor: 0 v 1 q12 ffiffiffiffiffi B l2 C C 0 1 B B v 2C s12 B q22 C B s22 C B ffiffiffiffiffi C C ¼ B l2 C S2 ¼ B @ ⋮ A B 2C B C ⋮ B C sk2 B vk2 C @ qffiffiffiffiffi A l22
l
(12.22)
(12.23)
Factor scores of the kth factor: 0 v 1 q1kffiffiffiffiffi B l2 C C 0 1 B B v kC s1k B q2kffiffiffiffiffi C C B s2k C B C B 2C Sk ¼ B @ ⋮ A ¼ B lk C B C B ⋮ C skk B vkk C @ qffiffiffiffiffi A l2k
(12.24)
Principal Component Factor Analysis Chapter
12
393
Since the factor scores of each factor are standardized by the respective eigenvalues, the factors of the set of equations presented in Expression (12.10) must be obtained by multiplying each factor score by the corresponding original variable, standardized by using the Z-scores procedure. Thus, we can obtain each one of the factors based on the following equations: v v v ffiffiffiffiffi ZX1i + q21ffiffiffiffiffi ZX2i + ⋯ + qk1ffiffiffiffiffi ZXki F1i ¼ q11 2 2 l1 l1 l21 v v v ffiffiffiffiffi ZX1i + q22ffiffiffiffiffi ZX2i + ⋯ + qk2ffiffiffiffiffi ZXki F2i ¼ q12 (12.25) l22 l22 l22 ⋮ v v v Fki ¼ q1kffiffiffiffiffi ZX1i + q2kffiffiffiffiffi ZX2i + ⋯ + qkkffiffiffiffiffi ZXki l2k l2k l2k where ZXi represents the standardized value of each variable X for a certain observation i. It is important to emphasize that all the factors extracted show, between themselves, Pearson correlations equal to 0, that is, they are orthogonal to one another. A more perceptive researcher will notice that the factor scores of each factor correspond exactly to the estimated parameters of a multiple linear regression model that has, as a dependent variable, the factor itself and, as explanatory variables, the standardized variables. Mathematically, it is also possible to verify the existing relationship between the eigenvectors, correlation matrix r, and eigenvalue matrix L2. Consequently, defining eigenvector matrix V as follows: 0 1 v11 v12 ⋯ v1k B v21 v22 ⋯ v2k C C V¼B (12.26) @ ⋮ ⋮ ⋱ ⋮ A vk1 vk2 ⋯ vkk we can prove that: V’ r V ¼ L2 or:
0
v11 B v12 B @ ⋮ v1k
v21 v22 ⋮ v2k
⋯ ⋯ ⋱ ⋯
1 0 vk1 1 r12 B r21 1 vk2 C CB ⋮ A @ ⋮ ⋮ vkk rk1 rk2
⋯ ⋯ ⋱ ⋯
1 0 r1k v11 B v21 r2k C CB ⋮ A @ ⋮ 1 vk1
(12.27)
v12 v22 ⋮ vk2
⋯ ⋯ ⋱ ⋯
1 0 2 v1k l1 B v2k C C¼B 0 ⋮ A @⋮ vkk 0
0 l22 ⋮ 0
⋯ ⋯ ⋱ ⋯
1 0 0C C ⋮A l2k
(12.28)
In Section 12.2.6, we will discuss a practical example from which this relationship may be demonstrated. While in Section 12.2.2, we discussed the factor analysis’s overall adequacy, in this section, we will discuss the procedures for carrying out the factor extraction, if the technique is considered adequate. Even knowing that the maximum number of factors is also equal to k for k variables, it is essential for researchers to define, based on a certain criterion, the adequate number of factors that, in fact, represent the original variables. In our hypothetical example in Section 12.2.1, we saw that only two factors (F1 and F2) would be enough to represent the three original variables (X1, X2, and X3). Although researchers are free to determine the number of factors to be extracted in the analysis, in a preliminary way, since they may wish to verify the validity of a previously established construct (procedure known as a priori criterion), for instance, it is essential to carry out an analysis based on the magnitude of the eigenvalues calculated from correlation matrix r. As the eigenvalues correspond to the proportion of variance shared by the original variables to form each factor, as we will discuss in Section 12.2.4, since l21 l22 … l2k and bearing in mind that factors F1, F2, …, Fk are obtained from the respective eigenvalues, factors extracted from smaller eigenvalues are formed from smaller proportions of variance shared by the original variables. Since a factor represents a certain cluster of variables, factors extracted from eigenvalues less than 1 may possibly not be able to represent the behavior of a single original variable (of course there are exceptions to this rule, which occur in cases in which a certain eigenvalue is less than, but also very close to 1). The criterion for choosing the number of factors, in which only the factors that correspond to eigenvalues greater than 1 are considered, is often used and known as the latent root criterion or Kaiser criterion. The factor extraction method presented in this chapter is known as principal components, and the first factor F1, formed by the highest proportion of variance shared by the original variables, is also called principal factor. This method is often mentioned in the existing literature and is used in practical applications whenever researchers wish to elaborate a structural reduction
394
PART
V Multivariate Exploratory Data Analysis
of the data in order to create orthogonal factors, to define observation rankings by using the factors generated, and even to confirm the validity of previously established constructs. Other factor extraction methods, such as, the generalized least squares, unweighted least squares, maximum likelihood, alpha factoring, and image factoring, have different criteria and certain specificities and, even though they can also be found in the existing literature, they will not be discussed in this book. Moreover, it is common to discuss the need to apply the factor analysis to variables that have multivariate normal distribution, in order to show consistency when determining the factor scores. Nevertheless, it is important to emphasize that multivariate normality is a very rigid assumption, only necessary for a few factor extraction methods, such as, the maximum likelihood method. Most factor extraction methods do not require the assumption of data multivariate normality and, as discussed by Gorsuch (1983), the principal component factor analysis seems to be, in practice, very robust against breaks in normality.
12.2.4
Factor Loadings and Communalities
Having established the factors, we can now define the factor loadings, which simply are Pearson correlations between the original variables and each one of the factors. Table 12.3 shows the factor loadings for each variable-factor pair. Based on the latent root criterion (in which only factors resulting from eigenvalues greater than 1 are considered), we assume that the factor loadings between the factors that correspond to eigenvalues less than 1 and all the original variables are low, since they will have already presented higher Pearson correlations (loadings) with factors previously extracted from greater eigenvalues. In the same way, original variables that only share a small portion of variance with the other variables will have high factor loadings in only a single factor. If this occurs for all original variables, there will not be significant differences between correlation matrix r and identity matrix I, making the w2Bartlett statistic very low. This fact allows us to state that the factor analysis will not be adequate, and, in this situation, researchers may choose not to extract factors from the original variables. As the factor loadings are Pearson’s correlations between each variable and each factor, the sum of the squares of these loadings in each row of Table 12.3 will always be equal to 1, since each variable shares part of its proportion of variance with all the k factors, and the sum of the proportions of variance (factor loadings or squared Pearson correlations) will be 100%. Conversely, if less than k factors are extracted, due to the latent root criterion, the sum of the squared factor loadings in each row will not be equal to 1. This sum is called communality, which represents the total shared variance of each variable in all the factors extracted from eigenvalues greater than 1. So, we can say that: c211 + c212 + ⋯ ¼ communalityX1 c221 + c222 + ⋯ ¼ communalityX2 ⋮ c2k1 + c2k2 + ⋯ ¼ communalityXk
(12.29)
The main objective of the analysis of communalities is to check if any variable ends up not sharing a significant proportion of variance with the factors extracted. Even though there is no cutoff point from which a certain communality can be considered high or low, since the sample size can interfere in this assessment, the existence of considerably low communalities in relation to the others can indicate to researchers that they may need to reconsider including the respective variable into the factor analysis.
TABLE 12.3 Factor Loadings Between Original Variables and Factors Factor Variable
F1
F2
…
Fk
X1
c11
c12
…
c1k
X2
c21
c22
c2k
⋮
⋮
⋮
⋮
Xk
ck1
ck2
ckk
Principal Component Factor Analysis Chapter
12
395
Therefore, after defining the factors based on the factor scores, we can state that the factor loadings will be exactly the same as the parameters estimated in a multiple linear regression model that shows, as a dependent variable, a certain standardized variable ZX and, as explanatory variables, the factors themselves, and the coefficient of determination R2 of each model is equal to the communality of the respective original variable. The sum of the squared factor loadings in each column of Table 12.3, on the other hand, will be equal to the respective eigenvalue, since the ratio between each eigenvalue and the total number of variables can be understood as the proportion of variance shared by all k original variables to form each factor. So, we can say that: c211 + c221 + ⋯ + c2k1 ¼ l21 c212 + c222 + ⋯ + c2k2 ¼ l22 ⋮ c21k + c22k + ⋯ + c2kk ¼ l2k
(12.30)
After establishing the factors and the calculation of the factor loadings, it is also possible for some variables to have intermediate (neither very high nor very low) Pearson correlations (factor loadings) with all the factors extracted, although its communality is relatively not so low. In this case, although the solution of the factor analysis has already been obtained in an adequate way and considered concluded, researchers can, in the cases in which the factor loadings table shows intermediate values for one or more variables in all the factors, elaborate a rotation of these factors, so that Pearson’s correlations between the original variables and the new factors generated can be increased. In the following section, we will discuss factor rotation.
12.2.5
Factor Rotation
Once again, let’s imagine a hypothetical situation in which a certain dataset only has three variables (k ¼ 3). After preparing the principal component factor analysis, two factors, orthogonal to one another, are extracted, with factor loadings (Pearson correlations) with each one of the three original variables, according to Table 12.4. In order to construct a chart with the relative positions of each variable in each factor (a chart known as loading plot), we can consider the factor loadings to be coordinates (abscissas and ordinates) of the variables in a Cartesian plane formed by both orthogonal factors. The plot can be seen in Fig. 12.5. In order to better visualize the variables better represented by a certain factor, we can think about a rotation around the origin of the originally extracted factors F1 and F2, so that we can bring the points corresponding to variables X1, X2, and X3 0 0 closer to one of the new factors. These are called rotated factors F1 and F2. Fig. 12.6 shows this process in a simplified way. Based on Fig. 12.6, for each variable under analysis, we can see that while the loading for one factor increases, for the other, it decreases. Table 12.5 shows the loading redistribution for our hypothetical situation. Thus, for a generic situation, we can say that rotation is a procedure that maximizes the loadings of each variable in a certain factor, to the detriment of the others. In this regard, the final effect of rotation is the redistribution of factor loadings to factors that initially had smaller proportions of variance shared by all the original variables. The main objective is to minimize the number of variables with high loadings in a certain factor, since each one of the factors will start having more significant loadings only with some of the original variables. Consequently, rotation may simplify the interpretation of the factors.
TABLE 12.4 Factor Loadings Between Three Variables and Two Factors Factor Variable
F1
F2
X1
c11
c12
X2
c21
c22
X3
c31
c32
396
PART
V Multivariate Exploratory Data Analysis
FIG. 12.5 Loading plot for a hypothetical situation with three variables and two factors.
FIG. 12.6 Defining the rotated factors from the factors original.
TABLE 12.5 Original and Rotated Factor Loadings for Our Hypothetical Situation Factor Original Factor Loadings
Rotated Factor Loadings 0
Variable
F1
F2
F1
X1
c11
c12
j c11j > jc11 j
X2
c21
c22
j c21j > jc21 j
X3
c31
c32
j c31j < jc31 j
0
F2
0
jc12j < jc12 j
0
0
jc22j < jc22 j
0
jc32j > jc32 j
0
0
Principal Component Factor Analysis Chapter
12
397
Despite the fact that communalities and the total proportion of variance shared by all the variables in all the factors are not modified by the rotation (and neither are the KMO statistic or w2Bartlett), the proportion of variance shared by the original 0 variables in each factor is redistributed and, therefore, modified. In other words, new eigenvalues are set l 0 0 0 (l1, l2, …, lk) from the rotated factor loadings. Thus, we can say that: c0 211 + c0 212 + ⋯ ¼ communalityX1
c0 221 + c0 222 + ⋯ ¼ communalityX2 ⋮ c0 2k1 + c0 2k2 + ⋯ ¼ communalityXk
(12.31)
and that: c0 211 + c0 221 + ⋯ + c0 2k1 ¼ l0 1 6¼ l21 2 c0 212 + c0 222 + ⋯ + c0 2k2 ¼ l0 2 6¼ l22 ⋮ 2 c0 21k + c0 22k + ⋯ + c0 2kk ¼ l0 k 6¼ l2k 2
(12.32)
even if Expression (12.13) is respected, that is: l21 + l22 + ⋯ + l2k ¼ l0 1 + l0 2 + ⋯ + l0 k ¼ k 2
2
2
(12.33)
Besides, new rotated factor scores are obtained from the rotation of factors, s0 , such that the final expressions of the rotated factors will be: F01i ¼ s011 ZX1i + s021 ZX2i + ⋯ + s0k1 ZXki F02i ¼ s012 ZX1i + s022 ZX2i + ⋯ + s0k2 ZXki ⋮ F0ki ¼ s01k ZX1i + s02k ZX2i + ⋯ + s0kk ZXki
(12.34)
It is important to highlight that the overall adequacy of the factor analysis (KMO statistic and Bartlett’s test of sphericity) is not altered by the rotation, since correlation matrix r continues the same. Even though there are several factor rotation methods, the orthogonal rotation method, also known as Varimax, whose main purpose is to minimize the number of variables that have high loadings on a certain factor through the redistribution of the factor loadings and maximization of the variance shared in factors that correspond to lower eigenvalues, is the most frequently used and will be used in this chapter to solve a practical example. That is where the name Varimax comes from. This method was proposed by Kaiser (1958). The algorithm behind the Varimax rotation method consists in determining a rotation angle y in which pairs of factors are equally rotated. Thus, as discussed by Harman (1976), for a certain pair of factors F1 and F2, for example, the rotated factor loadings c’ between the two factors and the k original variables are obtained from the original factor loadings c, through the following matrix multiplication: 0 0 0 1 1 c11 c12 c11 c012 0 C B c21 c22 C B 0 B C cos y seny ¼ B c21 c22 C (12.35) @ ⋮ ⋮ A @ ⋮ ⋮ A seny cos y 0 0 ck1 ck2 ck1 ck2 where y, the counterclockwise rotation angle, is obtained by the following expression:
2ð D k A BÞ y ¼ 0:25 arctan C k ð A 2 B2 Þ
(12.36)
where: A¼
k X l¼1
c21l c22l communalityl communalityl
B¼
k X 2 l¼1
c1l c2l communalityl
(12.37)
(12.38)
398
PART
V Multivariate Exploratory Data Analysis
C¼
" k X l¼1
D¼
c21l c22l communalityl communalityl
2
2
c1l c2l communalityl
2 # (12.39)
k X l¼1
c21l c22l c1l c2l 2 communalityl communalityl communalityl
(12.40)
In Section 12.2.6, we will use these Varimax rotation method expressions to determine the rotated factor loadings from the original loadings. Besides Varimax, we can also mention other orthogonal rotation methods, such as, Quartimax and Equamax, even though they are less frequently mentioned in the existing literature and less used in practice. In addition to them, the researcher may also use oblique rotation methods, in which nonorthogonal factors are generated. Although they are not discussed in this chapter, we should also mention the Direct Oblimin and Promax methods in this category. Since oblique rotation methods can sometimes be used when we wish to validate a certain construct, whose initial factors are not correlated, we recommend that an orthogonal rotation method be used so that factors extracted in other multivariate techniques can be used later, such as, certain confirmatory models, in which the lack of multicollinearity of the explanatory variables is a mandatory premise.
12.2.6
A Practical Example of the Principal Component Factor Analysis
Imagine that the same professor, deeply engaged in academic and pedagogical activities, is now interested in studying how his students’ grades behave so that, afterwards, he can propose the creation of a school performance ranking. In order to do that, he collected information on the final grades, which vary from 0 to 10, of each one of his 100 students in the following subjects: Finance, Costs, Marketing, and Actuarial Science. Part of the dataset can be seen in Table 12.6. The complete dataset can be found in the file FactorGrades.xls. Through this dataset, it is possible to construct Table 12.7, which shows Pearson’s correlation coefficients between each pair of variables, calculated by using the logic presented in Expression (12.2).
TABLE 12.6 Example: Final Grades in Finance, Costs, Marketing, and Actuarial Science
Student
Final Grade in Finance (X1i)
Final Grade in Costs (X2i)
Final Grade in Marketing (X3i)
Final Grade in Actuarial Science (X4i)
Gabriela
5.8
4.0
1.0
6.0
Luiz Felipe
3.1
3.0
10.0
2.0
Patricia
3.1
4.0
4.0
4.0
Gustavo
10.0
8.0
8.0
8.0
Leticia
3.4
2.0
3.2
3.2
Ovidio
10.0
10.0
1.0
10.0
Leonor
5.0
5.0
8.0
5.0
Dalila
5.4
6.0
6.0
6.0
Antonio
5.9
4.0
4.0
4.0
8.9
5.0
2.0
8.0
… Estela
Principal Component Factor Analysis Chapter
12
399
TABLE 12.7 Pearson’s Correlation Coefficients for Each Pair of Variables finance
costs
marketing
finance
1.000
0.756
0.030
0.711
costs
0.756
1.000
0.003
0.809
0.030
0.003
1.000
0.044
0.711
0.809
0.044
1.000
marketing actuarial science
Therefore, we can write the expression of the correlation matrix r as follows: 0 1 0 1 r12 r13 r14 1:000 0:756 0:030 B r21 1 r23 r24 C B 0:756 1:000 0:003 C B r¼B @ r31 r32 1 r34 A ¼ @ 0:030 0:003 1:000 r41 r42 r43 1 0:711 0:809 0:044
actuarial science
1 0:711 0:809 C C 0:044 A 1:000
which has determinant D ¼ 0.137. By analyzing correlation matrix r, it is possible to verify that only the grades corresponding to the variable marketing do not have correlations with the grades in the other subjects, represented by the other variables. On the other hand, these show relatively high correlations with one another (0.756 between finance and costs, 0.711 between finance and actuarial, and 0.809 between costs and actuarial), which indicates that they may share significant variance to form one factor. Although this preliminary analysis is important, it cannot represent more than a simple diagnostic, since the overall adequacy of the factor analysis needs to be evaluated based on the KMO statistic and, mainly, by using the result of Bartlett’s test of sphericity. As we discussed in Section 12.2.2, the KMO statistic provides the proportion of variance considered common to all the variables present in the analysis, and, in order to establish its calculation, we need to determine partial correlation coefficients ’ between each pair of variables. In this case, it will be second-order correlation coefficients, since we are working with four variables simultaneously. Consequently, based on Expression (12.7), first, we need to determine the first-order correlation coefficients used to calculate of the second-order correlation coefficients. Table 12.8 shows these coefficients. Hence, from these coefficients and by using Expression (12.8), we can calculate the second-order correlation coefficients considered in the KMO statistic’s expression. Table 12.9 shows these coefficients. TABLE 12.8 First-Order Correlation Coefficients r12 r13 r23 ’12, 3 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:756 ð1r213 Þ ð1r223 Þ r r
r
r13 r12 r23 ’13, 2 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:049 ð1r212 Þ ð1r223 Þ r r
r
r14 r12 r24 ’14, 2 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:258 ð1r212 Þ ð1r224 Þ r r
r
14 13 34 ¼ 0:711 ’14, 3 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð1r213 Þ ð1r234 Þ
23 12 13 ’23, 1 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:039 ð1r212 Þ ð1r213 Þ
24 12 14 ’24, 1 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:590 ð1r212 Þ ð1r214 Þ
r24 r23 r34 ¼ 0:810 ’24, 3 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð1r223 Þ ð1r234 Þ
r34 r13 r14 ’34, 1 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:033 ð1r213 Þ ð1r214 Þ
r34 r23 r24 ’34, 2 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:080 ð1r223 Þ ð1r224 Þ
TABLE 12.9 Second-Order Correlation Coefficients ’
’
’
12, 3 14, 3 24, 3 ’12, 34 ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:438
1’214, 3 1’224, 3
’
’
’
13, 2 14, 2 34, 2 ’13, 24 ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:029
1’214, 2 1’234, 2
’
’
’
14, 2 13, 2 34, 2 ’14, 23 ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:255
1’213, 2 1’234, 2
’
’
’
23, 1 24, 1 34, 1 ’23, 14 ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:072
1’224, 1 1’234, 1
’
’
’
24, 1 23, 1 34, 1 ’24, 13 ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:592
1’223, 1 1’234, 1
’
’
’
34, 1 23, 1 24, 1 ’34, 12 ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:069
1’223, 1 1’224, 1
400
PART
V Multivariate Exploratory Data Analysis
So, based on Expression (12.3), we can calculate the KMO statistic. The terms of the expression are given by: k X k X
r2lc ¼ ð0:756Þ2 + ð0:030Þ2 + ð0:711Þ2 + ð0:003Þ2 + ð0:809Þ2 + ð0:044Þ2 ¼ 1:734
l¼1 c¼1 k X k X
’2lc ¼ ð0:438Þ2 + ð0:029Þ2 + ð0:255Þ2 + ð0:072Þ2 + ð0:592Þ2 + ð0:069Þ2 ¼ 0:619
l¼1 c¼1
from where we obtain: KMO ¼
1:734 ¼ 0:737 1:734 + 0:619
Based on the criterion presented in Table 12.2, the value of the KMO statistic suggests that the overall adequacy of the factor analysis is middling. To test whether, in fact, correlation matrix r is statistically different from identity matrix I with the same dimension, we must use Bartlett’s test of sphericity, whose w2Bartlett statistic is given by Expression (12.9). For n ¼ 100 observations, k ¼ 4 variables, and correlation matrix r determinant D ¼ 0.137, we have:
24+5 2 ln ð0:137Þ ¼ 192:335 wBartlett ¼ ð100 1Þ 6 Þ ¼ 6 degrees of freedom. Therefore, by using Table D in the Appendix, we have w2c ¼ 12.592 (critical w2 for 6 with 4 ð41 2 degrees of freedom and with a significance level of 0.05). Thus, since w2Bartlett ¼ 192.335 > w2c ¼ 12.592, we can reject the null hypothesis that correlation matrix r is statistically equal to identity matrix I, at a significance level of 0.05. Software packages like SPSS and Stata do not offer the w2c for the defined degrees of freedom and a certain significance level. However, they offer the significance level of w2Bartlett for these degrees of freedom. So, instead of analyzing if w2Bartlett > w2c , we must verify if the significance level of w2Bartlett is less than 0.05 (5%) so that we can continue performing the factor analysis. Thus: If P-value (either Sig. w2Bartlett, or Prob. w2Bartlett) < 0.05, correlation matrix r is not statistically equal to identity matrix I with the same dimension. The significance level of w2Bartlett can be obtained in Excel by using the command Formulas ! Insert Function ! CHIDIST, which will open a dialog box, as shown in Fig. 12.7. As we can see in Fig. 12.7, the P-value of the w2Bartlett statistic is considerably less than 0.05 (w2Bartlett Pvalue ¼ 8.11 1039), that is, Pearson’s correlations between the pairs of variables are statistically different from 0 and, therefore, factors can be extracted from the original variables, and the factor analysis very adequate.
FIG. 12.7 Obtaining the significance level of w2 (command Insert Function).
Principal Component Factor Analysis Chapter
12
401
Having verified the factor analysis’s overall adequacy, we can move on to the definition of the factors. In order to do that, we must initially determine the four eigenvalues l2 (l21 l22 l23 l24) of correlation matrix r, which can be obtained from solving Expression (12.12). Therefore, we have: 2 l 1 0:756 0:030 0:711 0:756 l2 1 0:003 0:809 0:030 0:003 l2 1 0:044 ¼ 0 0:711 0:809 0:044 l2 1 from where we obtain:
8 2 l1 ¼ 2:519 > > > > < l2 ¼ 1:000 2 > l23 ¼ 0:298 > > > : 2 l4 ¼ 0:183
Consequently, based on Expression (12.15), eigenvalue matrix L2 can be written as follows: 0 1 2:519 0 0 0 B 0 1:000 0 0 C C L2 ¼ B @ 0 0 0:298 0 A 0 0 0 0:183 Note that Expression (12.13) is satisfied, that is: l21 + l22 + ⋯ + l2k ¼ 2:519 + 1:000 + 0:298 + 0:183 ¼ 4 Since the eigenvalues correspond to the proportion of variance shared by the original variables to form each factor, we can construct a shared variance table (Table 12.10). By analyzing Table 12.10, we can say that while 62.975% of the total variance are shared to form the first factor, 25.010% are shared to form the second factor. The third and fourth factors, whose eigenvalues are less than 1, are formed through smaller proportions of shared variance. Since the most common criterion used to choose the number of factors is the latent root criterion (Kaiser criterion), in which only the factors that correspond to eigenvalues greater than 1 are taken into consideration, the researcher can choose to conduct all the subsequent analysis with only the first two factors, formed by sharing 87.985% of the total variance of the original variables, that is, with a total variance loss of 12.015%. Nonetheless, for pedagogical purposes, let’s discuss how to calculate the factor scores by determining the eigenvectors that correspond to the four eigenvalues. Consequently, in order to define the eigenvectors of matrix r based on the four eigenvalues calculated, we must solve the following equation systems for each eigenvalue, based on Expressions (12.16)–(12.21): Determining eigenvectors v11, v21, v31, v41 from the first eigenvalue (l21 ¼ 2.519):
l
8 ð2:519 1:000Þ v11 0:756 v21 + 0:030 v31 0:711 v41 ¼ 0 > > > < 0:756 v + ð2:519 1:000Þ v 0:003 v 0:809 v ¼ 0 11 21 31 41 > 0:030 v 0:003 v + ð 2:519 1:000 Þ v + 0:044 v > 11 21 31 41 ¼ 0 > : 0:711 v11 0:809 v21 + 0:044 v31 + ð2:519 1:000Þ v41 ¼ 0
TABLE 12.10 Variance Shared by the Original Variables to Form Each Factor Factor 1 2 3 4
Eigenvalue l2
Shared Variance (%)
2.519
2:519
1.000
1:000
0.298
0:298
0.183
0:183
Cumulative Shared Variance (%)
4
100 ¼ 62:975
62.975
4
100 ¼ 25, 010
87.985
4
100 ¼ 7:444
95.428
4
100 ¼ 4:572
100.000
402
PART
V Multivariate Exploratory Data Analysis
from where we obtain:
l
0
1 0 1 v11 0:5641 B v21 C B 0:5887 C B C¼B C @ v A @ 0:0267 A 31 v41 0:5783
Determining eigenvectors v12, v22, v32, v42 from the second eigenvalue (l22 ¼ 1.000): 8 ð1:000 1:000Þ v12 0:756 v22 + 0:030 v32 0:711 v42 ¼ 0 > > > < 0:756 v + ð1:000 1:000Þ v 0:003 v 0:809 v ¼ 0 12 22 32 42 > 0:030 v 0:003 v + ð 1:000 1:000 Þ v + 0:044 v > 12 22 32 42 ¼ 0 > : 0:711 v12 0:809 v22 + 0:044 v32 + ð1:000 1:000Þ v42 ¼ 0
from where we obtain:
l
0
1 0 1 v12 0:0068 B v22 C B 0:0487 C B C¼B C @ v A @ 0:9987 A 32 v42 0:0101
Determining eigenvectors v13, v23, v33, v43 from the third eigenvalue (l23 ¼ 0.298): 8 ð0:298 1:000Þ v13 0:756 v23 + 0:030 v33 0:711 v43 ¼ 0 > > > < 0:756 v + ð0:298 1:000Þ v 0:003 v 0:809 v ¼ 0 13 23 33 43 > 0:030 v 0:003 v + ð 0:298 1:000 Þ v + 0:044 v ¼0 > 13 23 33 43 > : 0:711 v13 0:809 v23 + 0:044 v33 + ð0:298 1:000Þ v43 ¼ 0
from where we obtain:
l
0
1 0 1 v13 0:8008 B v23 C B 0:2201 C B C¼B C @ v33 A @ 0:0003 A v43 0:5571
Determining eigenvectors v14, v24, v34, v44 from the fourth eigenvalue (l24 ¼ 0.183): 8 ð0:183 1:000Þ v14 0:756 v24 + 0:030 v34 0:711 v44 ¼ 0 > > > < 0:756 v + ð0:183 1:000Þ v 0:003 v 0:809 v ¼ 0 14 24 34 44 > 0:030 v 0:003 v + ð 0:183 1:000 Þ v + 0:044 v > 14 24 34 44 ¼ 0 > : 0:711 v14 0:809 v24 + 0:044 v34 + ð0:183 1:000Þ v44 ¼ 0
from where we obtain:
0
1 0 1 v14 0:2012 B v24 C B 0:7763 C B C¼B C @ v A @ 0:0425 A 34 v44 0:5959
Principal Component Factor Analysis Chapter
12
403
After having determined the eigenvectors, a more inquisitive researcher may prove the relationship presented in Expression (12.27), that is: V0 r V ¼ L2 0 1 0 1 0:5641 0:5887 0:0267 0:5783 1:000 0:756 0:030 0:711 B 0:0068 0:0487 0:9987 0:0101 C B 0:756 1:000 0:003 0:809 C B C B C @ 0:8008 0:2201 0:0003 0:5571 A @ 0:030 0:003 1:000 0:044 A 0:2012 0:7763 0:0425 0:5959 0:711 0:809 0:044 1:000 0 1 0 1 2:519 0 0 0 0:5641 0:0068 0:8008 0:2012 B 0:5887 0:0487 0:2201 0:7763 C B 0 1:000 0 0 C C B C B @ 0:0267 0:9987 0:0003 0:0425 A ¼ @ 0 0 0:298 0 A 0:5783 0:0101 0:5571 0:5959 0 0 0 0:183 Based on Expressions (12.22)–(12.24), we can calculate the factor scores that correspond to each one of the standardized variables for each one of the factors. Thus, from Expression (12.25), we are able to write the expressions for factors F1, F2, F3, and F4, as follows: 0:5641 0:5887 0:267 0:5783 F1i ¼ pffiffiffiffiffiffiffiffiffiffiffi Zfinancei + pffiffiffiffiffiffiffiffiffiffiffi Zcostsi pffiffiffiffiffiffiffiffiffiffiffi Zmarketingi + pffiffiffiffiffiffiffiffiffiffiffi Zactuariali 2:519 2:519 2:519 2:519 0:0068 0:0487 0:9987 0:0101 F2i ¼ pffiffiffiffiffiffiffiffiffiffiffi Zfinancei + pffiffiffiffiffiffiffiffiffiffiffi Zcostsi + pffiffiffiffiffiffiffiffiffiffiffi Zmarketingi pffiffiffiffiffiffiffiffiffiffiffi Zactuariali 1:000 1:000 1:000 1:000 0:8008 0:2201 0:0003 0:5571 F3i ¼ pffiffiffiffiffiffiffiffiffiffiffi Zfinancei pffiffiffiffiffiffiffiffiffiffiffi Zcostsi pffiffiffiffiffiffiffiffiffiffiffi Zmarketingi pffiffiffiffiffiffiffiffiffiffiffi Zactuariali 0:298 0:298 0:298 0:298 0:2012 0:7763 0:0425 0:5959 F4i ¼ pffiffiffiffiffiffiffiffiffiffiffi Zfinancei pffiffiffiffiffiffiffiffiffiffiffi Zcostsi + pffiffiffiffiffiffiffiffiffiffiffi Zmarketingi + pffiffiffiffiffiffiffiffiffiffiffi Zactuariali 0:183 0:183 0:183 0:183 from where we obtain: F1i ¼ 0:355 Zfinancei + 0:371 Zcostsi 0:017 Zmarketingi + 0:364 Zactuariali F2i ¼ 0:007 Zfinancei + 0:049 Zcostsi + 0:999 Zmarketingi 0:010 Zactuariali F3i ¼ 1:468 Zfinancei 0:403 Zcostsi 0:001 Zmarketingi 1:021 Zactuariali F4i ¼ 0:470 Zfinancei 1:815 Zcostsi + 0:099 Zmarketingi + 1:394 Zactuariali Based on the factor expressions and on the standardized variables, we can calculate the values corresponding to each factor for each observation. Table 12.11 shows these results for part of the dataset. For the first observation in the sample (Gabriela), for example, we can see that: F1Gabriela ¼ 0:355 ð0:011Þ + 0:371 ð0:290Þ 0:017 ð1:650Þ + 0:364 ð0:273Þ ¼ 0:016 F2Gabriela ¼ 0:007 ð0:011Þ + 0:049 ð0:290Þ + 0:999 ð1:650Þ 0:010 ð0:273Þ ¼ 1:665 F3Gabriela ¼ 1:468 ð0:011Þ 0:403 ð0:290Þ 0:001 ð1:650Þ 1:021 ð0:273Þ ¼ 0:176 F4Gabriela ¼ 0:470 ð0:011Þ 1:815 ð0:290Þ + 0:099 ð1:650Þ + 1:394 ð0:273Þ ¼ 0:739 It is important to emphasize that all the factors extracted have Pearson correlations equal to 0, between themselves, that is, they are orthogonal to one another. A more inquisitive researcher may also verify that the factor scores that correspond to each factor are exactly the estimated parameters of a multiple linear regression model that has, as a dependent variable, the factor itself, and as explanatory variables, the standardized variables. Having established the factors, we can define the factor loadings, which correspond to Pearson’s correlation coefficients between the original variables and each one of the factors. Table 12.12 shows the factor loadings for the data in our example. For each original variable, the highest value of the factor loading was highlighted in Table 12.12. Consequently, while the variables finance, costs, and actuarial show stronger correlations with the first factor, we can see that only the variable marketing shows stronger correlation with the second factor. This proves the need for a second factor in order for all the
404
PART
V Multivariate Exploratory Data Analysis
TABLE 12.11 Calculation of the Factors for Each Observation Student
Zfinancei
Zcostsi
Zmarketingi
F2i
F3i
Gabriela
0.011
0.290
1.650
0.273
0.016
1.665
0.176
0.739
Luiz Felipe
0.876
0.697
1.532
1.319
1.076
1.503
0.342
0.831
Patricia
0.876
0.290
0.590
0.523
0.600
0.603
0.634
0.672
1.334
1.337
0.825
1.069
1.346
0.887
0.327
0.228
Leticia
0.779
1.104
0.872
0.841
0.978
0.922
0.161
0.379
Ovidio
1.334
2.150
1.650
1.865
1.979
1.553
0.812
0.841
Leonor
0.267
0.116
0.825
0.125
0.111
0.829
0.312
0.429
Dalila
0.139
0.523
0.118
0.273
0.242
0.139
0.694
0.623
0.021
0.290
0.590
0.523
0.281
0.597
0.682
0.250
Estela
0.982
0.113
1.297
1.069
0.802
1.293
0.305
1.616
Mean
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
Standard deviation
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
Gustavo
Antonio
Zactuariali
F1i
F4i
⋮
TABLE 12.12 Factor Loadings (Pearson’s Correlation Coefficients) Between Variables and Factors Factor Variable
F1
F2
F3
F4
finance
0.895
0.007
0.437
0.086
costs
0.934
0.049
0.120
0.332
0.042
0.999
0.000
0.018
0.918
0.010
0.304
0.255
marketing actuarial science
variables to share significant proportions of variance. However, the third and fourth factors present relatively low correlations with the original variables, which explains the fact that the respective eigenvalues are less than 1. If the variable marketing had not been inserted into the analysis, only the first factor would be necessary to explain the joint behavior of the other variables, and the other factors would also have respective eigenvalues less than 1. Therefore, as discussed in Section 12.2.4, we can verify that factor loadings between factors corresponding to eigenvalues less than 1 are relatively low, since they have already shown stronger Pearson correlations with factors previously extracted from greater eigenvalues. Based on Expression (12.30), we can see that the sum of the squared factor loadings in each column in Table 12.12 will be the respective eigenvalue that, as discussed before, can be understood as the proportion of variance shared by the four original variables to form each factor. Therefore, we have: ð0:895Þ2 + ð0:934Þ2 + ð0:042Þ2 + ð0:918Þ2 ¼ 2:519 ð0:007Þ2 + ð0:049Þ2 + ð0:999Þ2 + ð0:010Þ2 ¼ 1:000 ð0:437Þ2 + ð0:120Þ2 + ð0:000Þ2 + ð0:304Þ2 ¼ 0:298 ð0:086Þ2 + ð0:332Þ2 + ð0:018Þ2 + ð0:255Þ2 ¼ 0:183 from which we can prove that the second eigenvalue only reached value 1 due to the high existing factor loading for the variable marketing.
Principal Component Factor Analysis Chapter
12
405
Furthermore, from the factor loadings presented in Table 12.12, we can also calculate the communalities, which represent the total shared variance of each variable in all the factors extracted from eigenvalues greater than 1. So, based on Expression (12.29), we can write: communalityfinance ¼ ð0:895Þ2 + ð0:007Þ2 ¼ 0:802 communalitycosts ¼ ð0:934Þ2 + ð0:049Þ2 ¼ 0:875 communalitymarketing ¼ ð0:042Þ2 + ð0:999Þ2 ¼ 1:000 communalityactuarial ¼ ð0:918Þ2 + ð0:010Þ2 ¼ 0:843 Consequently, even though the variable marketing is the only one that has a high factor loading with the second factor, it is the variable in which the lowest proportion of variance is lost to form both factors. On the other hand, the variable finance is the one that presents the highest loss of variance to form these two factors (around 19.8%). If we had considered the factor loadings of the four factors, surely, all the communalities would be equal to 1. As we discussed in Section 12.2.4, we can see that the factor loadings are exactly the parameters estimated in a multiple linear regression model, which shows, as a dependent variable, a certain standardized variable and, as explanatory variables, the factors themselves, in which the coefficient of determination R2 of each model is equal to the communality of the respective original variable. Therefore, for the first two factors, we can construct a chart in which the factor loadings of each variable are plotted in each one of the orthogonal axes that represent factors F1 and F2, respectively. This chart, known as a loading plot, can be seen in Fig. 12.8. By analyzing the loading plot, the behavior of the correlations becomes clear. While the variables finance, costs, and actuarial show high correlation with the first factor (X-axis), the variable marketing shows strong correlation with the second factor (Y-axis). More inquisitive researchers may investigate the reasons why this phenomenon occurs, since, sometimes, while the subjects Finance, Costs, and Actuarial Science are taught in a more quantitative way, Marketing can be taught in a more qualitative and behavioral manner. However, it is important to mention that the definition of factors does not force researchers to name them, because, normally, this is not a simple task. Factor analysis does not have “naming factors” as one of its goals and, in case we intend to do that, researchers need to have vast knowledge about the phenomenon being studied, and confirmatory techniques can help them in this endeavor. At this moment, we can consider the preparation of the principal component factor analysis concluded. Nevertheless, as discussed in Section 12.2.5, if researchers wish to obtain a clearer visualization of the variables better represented by a certain factor, they can elaborate a rotation using the Varimax orthogonal method, which maximizes the loadings of each variable in a certain factor. In our example, since we already have an excellent idea of the variables with high loadings in each factor, being the loading plot (Fig. 12.8) already very clear, rotation may be considered unnecessary. Therefore, it will only be elaborated for pedagogical purposes, since, sometimes, researchers may find themselves in situations in which such phenomenon is not so clear. Consequently, based on the factor loadings for the first two factors (first two columns of Table 12.12), we will obtain rotated factor loadings c0 after rotating both factors for an angle y. Thus, based on Expression (12.35), we can write:
1
FIG. 12.8 Loading plot.
marketing
0.5
finance
costs
0
actuarial –0.5
–1 –1
–0.5
0
0.5
1
406
PART
V Multivariate Exploratory Data Analysis
0
0:895 B 0:934 B @ 0:042 0:918
0 0 1 c11 0:007 B c021 0:049 C cos y seny C ¼B @ ⋮ 0:999 A seny cos y c0k1 0:010
1 c012 c022 C C ⋮ A c0k2
where the counterclockwise rotation angle y is obtained from Expression (12.36). Nevertheless, before that, we must determine the values of terms A, B, C, and D present in Expressions (12.37)–(12.40). Constructing Tables 12.13–12.16 helps us for this purpose. So, taking the k ¼ 4 variables into consideration and based on Expression (12.36), we can calculate the counterclockwise rotation angle y as follows: 9 8
< 2 ð D k A BÞ 2 ½ð0:181Þ 4 ð1:998Þ ð0:012Þ = h i ¼ 0:029rad ¼ 0 25: arctan y ¼ 0:25 arctan :ð3:963Þ 4 ð1:998Þ2 ð0:012Þ2 ; C k ðA2 B2 Þ
TABLE 12.13 Obtaining Term A to Calculate Rotation Angle u Variable
c1
c2
communality
c21l communalityl
l
communality
c1l c2l 2 communality
l
finance
0.895
0.007
0.802
1.000
costs
0.934
0.049
0.875
0.995
0.042
0.999
1.000
0.996
0.918
0.010
0.843
1.000
A (sum)
1.998
marketing actuarial science
c2
2l communality
TABLE 12.14 Obtaining Term B to Calculate Rotation Angle u Variable
c1
c2
finance
0.895
0.007
0.802
0.015
costs
0.934
0.049
0.875
0.104
0.042
0.999
1.000
0.085
0.918
0.010
0.843
0.022
marketing actuarial science
B (sum)
0.012
TABLE 12.15 Obtaining Term C to Calculate Rotation Angle u Variable
c1
c2
communality
c21l c22l communalityl communalityl
finance
0.895
0.007
0.802
1.000
costs
0.934
0.049
0.875
0.978
0.042
0.999
1.000
0.986
0.918
0.010
0.843
0.999
C (sum)
3.963
marketing actuarial science
2
2
c1l c2l communalityl
2
Principal Component Factor Analysis Chapter
12
407
TABLE 12.16 Obtaining Term D to Calculate Rotation Angle u Variable
c1
c2
communality
c21l communalityl
finance
0.895
0.007
0.802
0.015
costs
0.934
0.049
0.875
0.103
0.042
0.999
1.000
0.084
0.918
0.010
0.843
0.022
marketing actuarial science
D (sum)
c22l c1l c2l 2 communality communality l
l
0.181
And, finally, we can calculate the rotated factor loadings: 0 0 0 1 c11 0:895 0:007 B c021 B 0:934 0:049 C cos 0:029 sen0:029 B B C @ 0:042 0:999 A sen0:029 cos 0:029 ¼ @ c031 c041 0:918 0:010
1 0 c012 0:895 B 0:935 c022 C C¼B c032 A @ 0:013 c042 0:917
1 0:019 0:021 C C 1:000 A 0:037
Table 12.17 shows, in a consolidated way, the rotated factor loadings through the Varimax method for the data in our example. As we have already mentioned, even though the results without the rotation already showed which variables presented high loadings in each factor, rotation ended up distributing, even if lightly for the data in our example, the variable loadings to each one of the rotated factors. A new loading plot (now with rotated loadings) can also demonstrate this situation (Fig. 12.9).
TABLE 12.17 Rotated Factor Loadings Through the Varimax Method Factor Variable
F2
F1
finance
0.895
0.019
costs
0.935
0.021
0.013
1.000
0.917
0.037
marketing actuarial science
1
0
0
FIG. 12.9 Loading plot with rotated loadings.
marketing
0.5
costs 0
finance actuarial –0.5
–1 –1
–0.5
0
0.5
1
408
PART
V Multivariate Exploratory Data Analysis
Even though the plots in Figs. 12.8 and 12.9 are very similar, since rotation angle y is very small in this example, it is common for the researcher to find situations in which the rotation will contribute considerably for an easier understanding of the loadings, which can, consequently, simplify the interpretation of the factors. It is important to emphasize that the rotation does not change the communalities, that is, Expression (12.31) can be verified: communalityfinance ¼ ð0:895Þ2 + ð0:019Þ2 ¼ 0:802 communalitycosts ¼ ð0:935Þ2 + ð0:021Þ2 ¼ 0:875 communalitymarketing ¼ ð0:013Þ2 + ð1:000Þ2 ¼ 1:000 communalityactuarial ¼ ð0:917Þ2 + ð0:037Þ2 ¼ 0:843 Nonetheless, rotation changes the eigenvalues corresponding to each factor. Thus, for the two rotated factors, we have: ð0:895Þ2 + ð0:935Þ2 + ð0:013Þ2 + ð0:917Þ2 ¼ l0 1 ¼ 2:518 2 ð0:019Þ2 + ð0:021Þ2 + ð1:000Þ2 + ð0:037Þ2 ¼ l0 2 ¼ 1:002 2
0
0
Table 12.18 shows, based on the new eigenvalues l21 and l22, the proportions of variance shared by the original variables to form both rotated factors. In comparison to Table 12.10, we can see that even though there is no change in the sharing of 87.985% of the total variance of the original variables to form the rotated factors, the rotation redistributes the variance shared by the variables in each factor. As we have already discussed, the factor loadings correspond to the parameters estimated in a multiple linear regression model that shows, as a dependent variable, a certain standardized variable and, as explanatory variables, the factors themselves. Therefore, through algebraic operations, we can arrive at the factor scores expressions from the loadings, since they represent the estimated parameters of the respective regression models that have, as a dependent variable, the factors and, as explanatory variables, the standardized variables. Consequently, from the rotated factor loadings (Table 12.17), we arrive at 0 0 the following rotated factors expressions F1 and F2. F01i ¼ 0:355 Zfinancei + 0:372 Zcostsi + 0:012 Zmarketingi + 0:364 Zactuariali F02i ¼ 0:004 Zfinancei + 0:038 Zcostsi + 0:999 Zmarketingi 0:021 Zactuariali 0
Finally, the professor wishes to develop a school performance ranking of his students. Since the two rotated factors, F1 and 0 F2, are formed by the higher proportions of variance shared by the original variables (in this case, 62.942% and 25.043% of the total variance, respectively, as shown in Table 12.18) and correspond to eigenvalues greater than 1, they will be used to create the desired school performance ranking. A well-accepted criterion that is used to form rankings from factors is known as weighted rank-sum criterion, in which, for each observation, the values of all factors obtained (that have eigenvalues greater than 1) weighted by the respective proportions of shared variance are added, with the subsequent ranking of the observations based on the results obtained. This criterion is well accepted because it considers the performance of all the original variables, since only considering the first factor (principal factor criterion) may not consider the positive performance, for instance, obtained in a certain variable that may possibly share a considerable proportion of variance with the second factor. For 10 students chosen from the sample, Table 12.19 shows the result of the school performance ranking resulting from the ranking created after the sum of the values obtained from the factors weighted by the respective proportions of shared variance. The complete ranking can be found in the file FactorGradesRanking.xls. It is essential to highlight that the creation of performance rankings from original variables is considered to be a static procedure, since the inclusion of new observations or variables may alter the factor scores, which makes the preparation of a
TABLE 12.18 Variance Shared by the Original Variables to Form Both Rotated Factors Factor 1 2
0
Eigenvalue l 2
Shared Variance (%)
Cumulative Shared Variance (%)
2.518
2:518
1.002
1:002
4
100 ¼ 62:942
62.942
4
100 ¼ 25:043
87.985
Principal Component Factor Analysis Chapter
12
409
TABLE 12.19 School Performance Ranking Through the Weighted Rank-Sum Criterion 0
Student
Zfinancei
Zcostsi
Zmarketingi
Zactuariali
0
F1i
0
F2i
(F1i 0.62942) 0 + (F2i 0.25043)
Ranking
Adelino
1.30
2.15
1.53
1.86
1.959
1.568
1.626
1
Renata
0.60
2.15
1.53
1.86
1.709
1.570
1.469
2
Ovidio
1.33
2.15
1.65
1.86
1.932
1.611
0.813
13
Kamal
1.33
2.07
1.65
1.86
1.902
1.614
0.793
14
Itamar
1.29
0.55
1.53
1.04
1.022
1.536
0.259
57
Luiz Felipe
0.88
0.70
1.53
1.32
1.032
1.535
0.265
58
0.01
0.29
1.65
0.27
0.032
1.665
0.437
73
0.50
0.50
0.94
1.16
0.443
0.939
0.514
74
Viviane
1.64
1.16
1.01
1.00
1.390
1.029
1.133
99
Gilmar
1.52
1.16
1.40
1.44
1.512
1.409
1.304
100
⋮
⋮
⋮ Gabriela Marina ⋮
new factor analysis mandatory. As time goes by, the evolution of the phenomena represented by the variables may change the correlation matrix, which makes it necessary to reapply the technique in order to generate new factors obtained from more precise and updated scores. Here, therefore, we express a criticism against socioeconomic indexes that use previously established static scores for each variable when calculating the factor to be used to define the ranking in situations in which new observations are constantly included; more than this, in situations in which there is an evolution throughout time, which changes the correlation matrix of the original variables in each period. Finally, it is worth mentioning that the factors extracted are quantitative variables and, therefore, from them, other multivariate exploratory techniques can be elaborated, such as, a cluster analysis, depending on the researcher’s objectives. Besides, each factor can also be transformed into a qualitative variable as, for example, through its categorization into ranges, established based on a certain criterion and, from then on, a correspondence analysis could be elaborated, in order to assess a possible association between the generated categories and the categories of other qualitative variables. Factors can also be used as explanatory variables of a certain phenomenon in confirmatory multivariate models as, for instance, multiple regression models, since orthogonality eliminates multicollinearity problems. On the other hand, such procedure only makes sense when we intend to elaborate a diagnostic regarding the behavior of the dependent variable, without aiming at having forecasts. Since new observations do not show the corresponding values of the factors generated, obtaining it is only possible if we include such observations in a new factor analysis, in order to obtain new factor scores, since it is an exploratory technique. Furthermore, a qualitative variable obtained through the categorization of a certain factor into ranges can also be inserted as a dependent variable of a multinomial logistic regression model, allowing researchers to evaluate the probabilities each observation has of being in each range, due to the behavior of other explanatory variables not initially considered in the factor analysis. We would also like to highlight that this procedure has a diagnostic nature, trying to find out the behavior of the variables in the sample for the existing observations, without a predictive purpose. Next, this same example will be elaborated in the software packages SPSS and Stata. In Section 12.3, the procedures for preparing the principal component factor analysis in SPSS will be presented, as well as their results. In Section 12.4, the commands for running the technique in Stata will be presented, with their respective outputs.
410
12.3
PART
V Multivariate Exploratory Data Analysis
PRINCIPAL COMPONENT FACTOR ANALYSIS IN SPSS
In this section, we will discuss the step by step for developing our example in the IBM SPSS Statistics Software. Following the logic proposed in this book, the main objective is to give researchers an opportunity to elaborate the principal component factor analysis in this software package, given how easy it is to use it and how didactical the operations are. Every time we present an output, we will mention the respective result obtained when performing the algebraic solution of the technique in the previous section, so that researchers can compare them and broaden their own knowledge and understanding about it. The use of the images in this section has been authorized by the International Business Machines Corporation©. Going back to the example presented in Section 12.2.6, remember that the professor is interested in creating a school performance ranking of his students based on the joint behavior of their final grades in four subjects. The data can be found in the file FactorGrades.sav and are exactly the same as the ones partially presented in Table 12.6 in Section 12.2.6. Therefore, in order for the factor analysis to be elaborated, let’s click on Analyze → Dimension Reduction → Factor …. A dialog box as the one shown in Fig. 12.10 will open. Next, we must insert the original variables finance, costs, marketing, and actuarial into Variables, as shown in Fig. 12.11.
FIG. 12.10 Dialog box for running a factor analysis in SPSS.
FIG. 12.11 Selecting the original variables.
Principal Component Factor Analysis Chapter
12
411
Different from what was discussed in the previous chapter, when developing the cluster analysis, it is important to mention that the researcher does not need to worry about with the Z-scores standardization of the original variables to elaborate the factor analysis, since the correlations between original variables or between their corresponding standardized variables are exactly the same. Even so, if researchers choose to standardize each one of the variables, they will see that the outputs will be exactly the same. In Descriptives …, first, let’s select the option Initial solution in Statistics …, which makes all the eigenvalues of the correlation matrix be presented in the outputs, even the ones that are less than 1. In addition, let’s select the options Coefficients, Determinant, and KMO and Bartlett’s test of sphericity in Correlation Matrix, as shown in Fig. 12.12. When we click on Continue, we will go back to the main dialog box of the factor analysis. Next, we must click on Extraction …. As shown in Fig. 12.13, we will maintain the options regarding the factor extraction method selected FIG. 12.12 Selecting the initial options for running the factor analysis.
FIG. 12.13 Choosing the factor extraction method and the criterion for determining the number of factors.
412
PART
V Multivariate Exploratory Data Analysis
(Method: Principal components) and the choice criterion of the number of factors. In this case, as discussed in Section 12.2.3, only the factors that correspond to eigenvalues greater than 1 will be considered (latent root criterion or Kaiser criterion), and, therefore, we must maintain the option Based on Eigenvalue ! Eigenvalues greater than: 1 in Extract selected. Moreover, we will also maintain the options Unrotated factor solution, in Display, and Correlation matrix, in Analyze, selected. In the same way, let’s click on Continue so that we can go back to the main dialog box of the factor analysis. In Rotation …, for now, let’s select the option Loading plot(s) in Display, while still maintaining the option None in Method selected, as shown in Fig. 12.14. Choosing the extraction of unrotated factors at this moment is didactical, since the outputs generated may be compared to the ones obtained algebraically in Section 12.2.6. Nevertheless, researchers can choose to extract rotated factors at this opportunity. After clicking on Continue, we can select the button Scores … in the technique’s main dialog box. At this moment, let’s select the option Display factor score coefficient matrix, as shown in Fig. 12.15, which makes the factor scores that correspond to each factor extracted be presented in the outputs. Next, we can click on Continue and on OK. FIG. 12.14 Dialog box for selecting the rotation method and the loading plot.
FIG. 12.15 Selecting the option to present the factor scores.
Principal Component Factor Analysis Chapter
12
413
FIG. 12.16 Pearson’s correlation coefficients.
The first output (Fig. 12.16) shows correlation matrix r, equal to the one in Table 12.7 in Section 12.2.6, through which we can see that the variable marketing is the only one that shows low Pearson’s correlation coefficients with all the other variables. As we have already discussed, it is a first indication that the variables finance, costs, and actuarial can be correlated with a certain factor, while the variable marketing can correlate strongly with another one. We can also verify that the output seen in Fig. 12.16 shows the value of the determinant of correlation matrix r too, used to calculate the w2Bartlett statistic, as discussed when we presented Expression (12.9). In order to study the overall adequacy of the factor analysis, let’s analyze the outputs in Fig. 12.17, which shows the results of the calculations that correspond to the KMO statistic and w2Bartlett. While the first suggests that the overall adequacy of the factor analysis is considered middling (KMO ¼ 0.737), based on the criterion presented in Table 12.2, the w2Bartlett statistic ¼ 192.335 (Sig. w2Bartlett < 0.05 for 6 degrees of freedom) allows us to reject that correlation matrix r is statistically equal to identity matrix I with the same dimension, at a significance level of 0.05 and based on the hypotheses of Bartlett’s test of sphericity. Thus, we can conclude that the factor analysis is adequate. The values of the KMO and w2Bartlett statistics are calculated through Expressions (12.3) and (12.9), respectively, presented in Section 12.2.2, and are exactly the same as the ones obtained algebraically in Section 12.2.6. Next, Fig. 12.18 shows the four eigenvalues of correlation matrix r that correspond to each one of the factors extracted initially, with the respective proportions of variance shared by the original variables. Note that the eigenvalues are exactly the same as the ones obtained algebraically in Section 12.2.6, such that: l21 + l22 + ⋯ + l2k ¼ 2:519 + 1:000 + 0:298 + 0:183 ¼ 4
FIG. 12.17 Results of the KMO statistic and Bartlett’s test of sphericity.
FIG. 12.18 Eigenvalues and variance shared by the original variables to form each factor.
414
PART
V Multivariate Exploratory Data Analysis
Since in the analysis we will only consider the factors whose eigenvalues are greater than 1, the right-hand side of Fig. 12.18 shows the proportion of variance shared by the original variables to only form these factors. Therefore, analogous to what was the presented in Table 12.10, we can state that, while 62.975% of the total variance are shared to form the first factor, 25.010% are shared to form the second. Thus, to form these two factors, the total loss of variance of the original variables is equal to 12.015%. Having extracted two factors, Fig. 12.19 shows the factor scores that correspond to each one of the standardized variables for each one of these factors. Hence, we are able to write the expressions of factors F1 and F2 as follows: F1i ¼ 0:355 Zfinancei + 0:371 Zcostsi 0:017 Zmarketingi + 0:364 Zactuariali F2i ¼ 0:007 Zfinancei + 0:049 Zcostsi + 0:999 Zmarketingi 0:010 Zactuariali Note that the expressions are identical to the ones obtained in Section 12.2.6 from the algebraic definition of unrotated factor scores. Fig. 12.20 shows the factor loadings, which correspond to Pearson’s correlation coefficients between the original variables and each one of the factors. The values shown in Fig. 12.20 are equal to the ones presented in the first two columns of Table 12.12. The highest factor loading is highlighted for each variable and, therefore, we can verify that, while the variables finance, costs, and actuarial show stronger correlations with the first factor, only the variable marketing shows stronger correlation with the second factor. As we also discussed in Section 12.2.6, the sum of the squared factor loadings in the columns results in the eigenvalue of the corresponding factor, that is, it represents the proportion of variance shared by the four original variables to form each factor. Thus, we can verify that: ð0:895Þ2 + ð0:934Þ2 + ð0:042Þ2 + ð0:918Þ2 ¼ 2:519 ð0:007Þ2 + ð0:049Þ2 + ð0:999Þ2 + ð0:010Þ2 ¼ 1:000
FIG. 12.19 Factor scores.
FIG. 12.20 Factor loadings.
Principal Component Factor Analysis Chapter
12
415
On the other hand, the sum of the squared factor loadings in the rows results in the communality of the respective variable, that is, it represents the proportion of shared variance of each original variable in the two factors extracted. Therefore, we can also see that: communalityfinance ¼ ð0:895Þ2 + ð0:007Þ2 ¼ 0:802 communalitycosts ¼ ð0:934Þ2 + ð0:049Þ2 ¼ 0:875 communalitymarketing ¼ ð0:042Þ2 + ð0:999Þ2 ¼ 1:000 communalityactuarial ¼ ð0:918Þ2 + ð0:010Þ2 ¼ 0:843 In the SPSS outputs, the communalities table is also presented, as shown in Fig. 12.21. The loading plot that shows the relative position of each variable in each factor, based on the respective factor loadings, is also shown in the outputs, as shown in Fig. 12.22 (equivalent to Fig. 12.8 in Section 12.2.6), in which the X-axis represents factor F1, and the Y-axis, factor F2. Even though the relative position of the variables in each axis is very clear, that is, the magnitude of the correlations between each one of them and each factor, for pedagogical purposes, we chose to elaborate the rotation of the axes, which
FIG. 12.21 Communalities.
Component plot marketing 1.0
Component 2
0.5
finance
0.0
actuarial
costs
–0.5
–1.0 –1.0
–1.5
0.0 Component 1
FIG. 12.22 Loading plot.
0.5
1.0
416
PART
V Multivariate Exploratory Data Analysis
can sometimes facilitate the interpretation of the factors because it provides a better distribution of the variables’ factor loadings in each factor. Thus, once again, let’s click on Analyze → Dimension Reduction → Factor … and, on the button Rotation …, select the option Varimax, as shown in Fig. 12.23. When we click on Continue, we will go back to the main dialog box of the factor analysis. In Scores …, let’s select the option Save as variables, as shown in Fig. 12.24, so that the factors generated, now rotated, can be made available in the dataset as new variables. From these factors, the students’ school performance ranking will be created. Next, we can click on Continue and on OK. Figs. 12.25–12.29 show the outputs that present differences in relation to the previous ones, due to the rotation. In this regard, the results of the correlation matrix, of the KMO statistic, of Bartlett’s test of sphericity, and of the communalities table are not presented again, which, even though they were calculated from the rotated loadings, do not show changes in their values. Fig. 12.25 shows these rotated factor loadings and, through them, it is possible to verify, even if very tenuously, a certain redistribution of the variable loadings in each factor. Note that the rotated factor loadings in Fig. 12.25 are exactly the same as the ones obtained algebraically in Section 12.2.6, from Expressions (12.35) to (12.40), and presented in Table 12.17.
FIG. 12.23 Selecting the Varimax orthogonal rotation method.
FIG. 12.24 Selecting the option to save the factors as new variables in the dataset.
Principal Component Factor Analysis Chapter
12
417
FIG. 12.25 Rotated factor loadings through the Varimax method.
Component plot in rotated space marketing 1,0
Component 2
0,5
finance
0,0
costs actuarial
–0,5
–1,0 –1,0
–0,5
0,0
0,5
1,0
Component 1 FIG. 12.26 Loading plot with rotated loadings.
The new loading plot, constructed from the rotated factor loadings and equivalent to Fig. 12.9, can be seen in Fig. 12.26. The rotation angle calculated algebraically in Section 12.2.6 is also a part of the SPSS outputs and can be found in Fig. 12.27. As we have already discussed, from the rotated factor loadings, we can verify that there are no changes in the communality values of the variables considered in the analysis, that is: communalityfinance ¼ ð0:895Þ2 + ð0:019Þ2 ¼ 0:802 communalitycosts ¼ ð0:935Þ2 + ð0:021Þ2 ¼ 0:875 communalitymarketing ¼ ð0:013Þ2 + ð1:000Þ2 ¼ 1:000 communalityactuarial ¼ ð0:917Þ2 + ð0:037Þ2 ¼ 0:843 On the other hand, the new eigenvalues can be obtained as follows: ð0:895Þ2 + ð0:935Þ2 + ð0:013Þ2 + ð0:917Þ2 ¼ l0 1 ¼ 2:518 2 ð0:019Þ2 + ð0:021Þ2 + ð1:000Þ2 + ð0:037Þ2 ¼ l0 2 ¼ 1:002 2
418
PART
V Multivariate Exploratory Data Analysis
FIG. 12.27 Rotation angle (in radians).
FIG. 12.28 Eigenvalues and variance shared by the original variables to form both rotated factors.
FIG. 12.29 Rotated factor scores.
Fig. 12.28 shows the results of the eigenvalues for the first two rotated factors in Rotation Sums of Squared Loadings, with their respective proportions of variance shared by the four original variables. The results are in accordance with the ones presented in Table 12.18. In comparison to the results obtained before the rotation, we can see that, even though there is no change in the sharing of 87.985% of the total variance of the original variables to form both rotated factors, the rotation redistributed the variance shared by the variables to each factor. Fig. 12.29 shows the rotated factor scores, from which the expressions of the new factors can be obtained. Therefore, we can write the following rotated factors expressions: F01i ¼ 0:355 Zfinancei + 0:372 Zcostsi + 0:012 Zmarketingi + 0:364 Zactuariali F02i ¼ 0:004 Zfinancei + 0:038 Zcostsi + 0:999 Zmarketingi 0:021 Zactuariali When developing the procedure described, we can verify that two new variables are generated in the dataset, called FAC1_1 and FAC2_1 by SPSS, as shown in Fig. 12.30 for the first 20 observations. These new variables, which show the values of both rotated factors for each one of the observations in the dataset, are orthogonal to one another, that is, they have a Pearson’s correlation coefficient equal to 0. This can be verified when we click on Analyze → Correlate → Bivariate …. In the dialog box that will open, we must insert the four original variables
Principal Component Factor Analysis Chapter
0
12
419
0
FIG. 12.30 Dataset with the F1 (FAC1_1) and F2 (FAC2_1) values per observation.
into Variables and select the options Pearson (in Correlation Coefficients) and Two-tailed (in Test of Significance), as shown in Fig. 12.31. When we click on OK, the output seen in Fig. 12.32 will be presented, in which it is possible to verify that Pearson’s correlation coefficient between both rotated factors is equal to 0. According to what was studied in Sections 12.2.4 and 12.2.6, a more inquisitive researcher may also verify that the rotated factor scores can be obtained through the estimation of two multiple linear regression models, in which a certain factor is considered to be a dependent variable in each one of them, and as explanatory variables, the standardized variables. The factor scores will be the parameters estimated in each model. In the same way, it is also possible to verify that the rotated factor loadings can be obtained by using the estimation of four multiple linear regression models as well, in which, in each one of them, a certain standardized variable is considered to be a dependent variable, and the factors, explanatory variables. While the factor loadings will be the parameters estimated in each model, the communalities will be the respective coefficients of determination R2. Therefore, the following expressions can be obtained: Zfinancei ¼ 0:895 F01i 0:019 F02i + ui , R2 ¼ 0:802 Zcostsi ¼ 0:935 F01i + 0:021 F02i + ui , R2 ¼ 0:875 Zmarketingi ¼ 0:013 F01i + 1:000 F02i + ui , R2 ¼ 1:000 Zactuariali ¼ 0:917 F01i 0:037 F02i + ui , R2 ¼ 0:843 0
0
in which the terms ui represent additional sources of variance, besides factors F1 and F2, to explain the behavior of each variable, and they are also called error terms or residuals.
420
PART
V Multivariate Exploratory Data Analysis
FIG. 12.31 Dialog box for determining Pearson’s correlation coefficient between both rotated factors.
FIG. 12.32 Pearson’s correlation coefficient between both rotated factors.
In case there is any interest in verifying these facts, we must obtain the standardized variables by clicking on Analyze → Descriptive Statistics → Descriptives …. When we select all the original variables, we must click on Save standardized values as variables. Although this specific procedure is not shown here, after clicking on OK, the standardized variables will be generated in the dataset itself. Therefore, based on the factors generated, we are able to create the desired school performance ranking. In order to do that, we will use the criterion described in Section 12.2.6, known as weighted rank-sum criterion, in which a new variable is generated from the multiplication of the values of each factor by the respective proportions of variance shared by the original variables. Thus, this new variable, which we call ranking, has the following expression: rankingi ¼ 0:62942 F01i + 0:25043 F02i in which parameters 0.62942 and 0.25043 correspond to the proportions of variance shared by the first two factors, respectively, as shown in Fig. 12.28. In order for the variable to be generated in the dataset, we must click on Transform → Compute Variable …. In Target Variable, we must type the name of the new variable (ranking) and, in Numeric Expression, we must type the weighted sum expression (FAC1_1*0.62942) + (FAC2_1*0.25043), as shown in Fig. 12.33. When we click on OK, the variable ranking will appear in the dataset.
Principal Component Factor Analysis Chapter
12
421
FIG. 12.33 Creating the new variable (ranking).
Finally, to sort variable ranking, we must click on Data → Sort Cases …. In addition to selecting the option Descending, we must insert the variable ranking into Sort by, as shown in Fig. 12.34. When we click on OK, the observations will appear sorted out in the dataset, from the highest to the lowest value of variable ranking, as shown in Fig. 12.35 for the 20 observations with the best performance school. We can see that the ranking constructed through the weighted rank-sum criterion points to Adelino as the student with the best school performance in that set of subjects, followed by Renata, Giulia, Felipe, and Cecilia. Having presented the procedures for applying the principal component factor analysis in SPSS, let’s now discuss the technique in Stata, following the standard used in this book.
12.4 PRINCIPAL COMPONENT FACTOR ANALYSIS IN STATA We now present the step by step for preparing our example in Stata Statistical Software. In this section, our main goal is not to discuss the concepts of the principal component factor analysis once again, instead, it is to give researchers an opportunity to elaborate the technique by using the commands in this software. Every time we present an output, we will mention the respective result obtained when applying the technique in an algebraic way and also by using SPSS. The use of the images in this section has been authorized by StataCorp LP©. Therefore, right away, we begin with the dataset constructed by the professor starting from the questions asked to each one of his 100 students. This dataset can be found in the file FactorGrades.dta and is exactly the same as the one partially presented in Table 12.6 in Section 12.2.6.
422
PART
V Multivariate Exploratory Data Analysis
FIG. 12.34 Dialog box for sorting the observations by variable ranking.
FIG. 12.35 Dataset with the school performance ranking.
First of all, we can type the command desc, which makes the analysis of the dataset characteristics possible, such as, the number of observations, the number of variables, and the description of each of them. Fig. 12.36 shows this first output in Stata. The command pwcorr ..., sig generates Pearson’s correlation coefficients between each pair of variables, with their respective significance levels. Therefore, we must type the following command: pwcorr finance costs marketing actuarial, sig
Fig. 12.37 shows the output generated.
Principal Component Factor Analysis Chapter
12
423
FIG. 12.36 Description of the FactorGrades.dta dataset.
FIG. 12.37 Pearson’s correlation coefficients and respective significance levels.
The outputs seen in Fig. 12.37 show that the correlations between the variable marketing and each one of the other variables are relatively low and not statistically significant, at a significance level of 0.05. On the other hand, the other variables have high and statistically significant correlations, between one another, at this significance level, which is a first indication that the factor analysis may group them in a certain factor without any substantial loss of their variances, while the variable marketing may show high correlation with another factor. This figure is in accordance with the one presented in Table 12.7 in Section 12.2.6, and also in Fig. 12.16, when we elaborated the technique in SPSS (Section 12.3). The factor analysis’s overall adequacy can be evaluated through the results of the KMO statistic and Bartlett’s test of sphericity, which can be obtained by using the command factortest. Thus, let’s type: factortest finance costs marketing actuarial
The outputs generated can be seen in Fig. 12.38. Based on the result of the KMO statistic, the overall adequacy of the factor analysis can be considered middling. However, more important than this piece of information is the result of Bartlett’s test of sphericity. From the result of the w2Bartlett statistic, with a significance level of 0.05 and 6 degrees of freedom, we can say that Pearson’s correlation matrix is statistically different from the identity matrix with the same dimension, since w2Bartlett ¼ 192.335 (w2 calculated for 6 degrees of freedom) and Prob. w2Bartlett (P-value) < 0.05. Note that the results of these statistics are in accordance with the ones calculated algebraically in Section 12.2.6 and also shown in Fig. 12.17 of Section 12.3. Fig. 12.38 also shows the value of the determinant of the correlation matrix, used to calculate of the w2Bartlett statistic. Stata also allows us to obtain the eigenvalues and eigenvectors of the correlation matrix. In order to do that, we must type the following command: pca finance costs marketing actuarial
424
PART
V Multivariate Exploratory Data Analysis
FIG. 12.38 Results of the KMO statistic and Bartlett’s test of sphericity.
FIG. 12.39 Eigenvalues and eigenvectors of the correlation matrix.
Fig. 12.39 shows these eigenvalues and eigenvectors, and they are exactly the same as the ones calculated algebraically in Section 12.2.6. Since we have not elaborated the procedure for rotating the factors generated yet, we can verify that the proportions of variance shared by the original variables to form each factor correspond to the ones presented in Table 12.10. After having presented these first outputs, we can now elaborate the principal component factor analysis itself by typing the following command, whose results are shown in Fig. 12.40. factor finance costs marketing actuarial, pcf
where the term pcf refers to the principal-component factor method. While the upper part of Fig. 12.40 shows the eigenvalues of the correlation matrix once again, with the respective proportions of shared variance of the original variables, since researchers can choose not to use the command pca, the lower part of the figure shows the factor loadings, which represent the correlations between each variable and the factors that only have eigenvalues greater than 1. Therefore, we can see that Stata automatically considers the latent root criterion (Kaiser criterion) when choosing the number of factors. If for some reason researchers choose to extract a number of factors considering a smaller eigenvalue so that more factors can be extracted, they must type the term mineigen(#) at the end of the command factor, in which # will be a number that corresponds to the eigenvalue from which factors will be extracted.
Principal Component Factor Analysis Chapter
12
425
FIG. 12.40 Outputs of the principal component factor analysis in Stata.
The factor loadings shown in Fig. 12.40 are equal to the first two columns of Table 12.12 in Section 12.2.6, and in Fig. 12.20 of Section 12.3. Through them, we can see that, while the variables finance, costs, and actuarial show high correlations with the first factor, the variable marketing shows strong correlation with the second factor. Besides, in the factor loadings matrix, a column called Uniqueness is also presented, or exclusivity, whose values represent, for each variable, the proportion of variance lost to form the factors extracted, that is, corresponds to (1—communality) of each variable. Therefore, we have: h i uniquenessfinance ¼ 1 ð0:8953Þ2 + ð0:0068Þ2 ¼ 0:1983 h i uniquenesscosts ¼ 1 ð0:9343Þ2 + ð0:0487Þ2 ¼ 0:1246 h i uniquenessmarketing ¼ 1 ð0:0424Þ2 + ð0:9989Þ2 ¼ 0:0003 h i uniquenessactuarial ¼ 1 ð0:9179Þ2 + ð0:0101Þ2 ¼ 0:1573 Consequently, because the variable marketing has low correlations with each one of the other original variables, it ends up having a high Pearson’s correlation with the second factor. This makes its uniqueness value be very low, since its proportion of variance shared with the second factor is almost equal to 100%. Knowing that two factors are extracted, at this moment, we will carry out the rotation by using the Varimax method. In order to do that, we must type the following command: rotate, varimax horst
where the term horst defines the rotation angle from the standardized factor loadings. This procedure is in accordance with the one elaborated algebraically in Section 12.2.6. The outputs generated can be seen in Fig. 12.41. From Fig. 12.41, as we have already discussed, we can verify that the proportion of variance shared by all the variables to form both factors is equal to 87.98%, even though the eigenvalue of each factor rotated is different from the one obtained previously. The same can be said regarding the uniqueness values of each variable, even if the rotated factor loadings are different in relation to their unrotated corresponding ones, since the Varimax method maximizes the loadings of each variable in a certain factor. Fig. 12.41 also shows the rotation angle at the end. All of these outputs are identical to the ones calculated in Section 12.2.6 and they were also presented when we elaborated the technique in SPSS, in Figs. 12.25, 12.27, and 12.28.
426
PART
V Multivariate Exploratory Data Analysis
Thus, we can say that:
h i uniquenessfinance ¼ 1 ð0:8951Þ2 + ð0:0195Þ2 ¼ 0:1983 h i uniquenesscosts ¼ 1 ð0:9354Þ2 + ð0:0213Þ2 ¼ 0:1246 h i uniquenessmarketing ¼ 1 ð0:0131Þ2 + ð0:9997Þ2 ¼ 0:0003 h i uniquenessactuarial ¼ 1 ð0:9172Þ2 + ð0:0370Þ2 ¼ 0:1573
and that: ð0:8951Þ2 + ð0:9354Þ2 + ð0:0131Þ2 + ð0:9172Þ2 ¼ l0 1 ¼ 2:51768 2 ð0:0195Þ2 + ð0:0213Þ2 + ð0:9997Þ2 + ð0:0370Þ2 ¼ l0 2 ¼ 1:00170 2
FIG. 12.41 Rotation of factors through the Varimax method.
If the researcher wishes to, Stata also allows us to compare the rotated factor loadings to the ones obtained before the rotation in the same table. In order to do that, it is necessary to type the following command, after preparing the rotation: estat rotatecompare
The outputs generated can be seen in Fig. 12.42. At this moment, the loading plot of the rotated factor loadings can be obtained by typing the command loadingplot. This chart, which corresponds to the ones presented in Figs. 12.9 and 12.26, can be seen in Fig. 12.43. After developing these procedures, the researcher may want to generate two new variables in the dataset, which correspond to the rotated factors obtained through the factor analysis. Therefore, it is necessary to type the following command: predict f1 f2
Principal Component Factor Analysis Chapter
12
427
FIG. 12.42 Comparison of the rotated and unrotated factor loadings.
FIG. 12.43 Loading plot with rotated loadings.
where f1 and f2 are the names of the corresponding variables to the first and second factors, respectively. When we type the command, in addition to creating these two new variables in the dataset, an output similar to the one in Fig. 12.44 will also be generated, in which the rotated factor scores are presented. The results shown in Fig. 12.44 are equivalent to the ones in SPSS (Fig. 12.29). Besides, it is also possible to verify that both factors generated are orthogonal, that is, they have a Pearson’s correlation coefficient equal to 0. In order to do that, let’s type: estat common
which results in the output seen in Fig. 12.45. Only for pedagogical purposes, we can also obtain the scores and the rotated factor loadings from multiple linear regression models. In order to do that, first of all, we have to generate the standardized variables by using the Z-scores procedure in the dataset, from each one of the original variables, by typing the following sequence of commands: egen zfinance = std(finance) egen zcosts = std(costs) egen zmarketing = std(marketing) egen zactuarial = std(actuarial)
428
PART
V Multivariate Exploratory Data Analysis
FIG. 12.44 Generating the factors in the dataset and the rotated factor scores.
FIG. 12.45 Pearson’s correlation coefficient between both rotated factors.
Having done this, we can type the two following commands, which represent two multiple linear regression models, in which each one of them shows a certain factor as a dependent variable and the standardized variables as explanatory variables. reg f1 zfinance zcosts zmarketing zactuarial reg f2 zfinance zcosts zmarketing zactuarial
The results of these models can be seen in Fig. 12.46. By analyzing Fig. 12.46, we note that the parameters estimated in each model correspond to the rotated factor scores for each variable, according to what has already been shown in Fig. 12.44. Thus, since all the parameters of the intercept are practically equal to 0, we can write: F01i ¼ 0:3554795 Zfinancei + 0:3721907 Zcostsi + 0:0124719 Zmarketingi + 0:3639452 Zactuariali F02i ¼ 0:0036389 Zfinancei + 0:0377955 Zcostsi + 0:9986053 Zmarketingi 0:020781 Zactuariali Obviously, since the four variables share variances to form each factor, the coefficients of determination R2 of each model are equal to 1. On the other hand, to obtain the rotated factor loadings, we must type the following four commands, which represent four multiple linear regression models, in which each one of them has a certain standardized variable as a dependent variable, and the rotated factors, as explanatory variables. reg zfinance f1 f2 reg zcosts f1 f2 reg zmarketing f1 f2 reg zactuarial f1 f2
The results of these models can be seen in Fig. 12.47. By analyzing this figure, note that the parameters estimated in each model correspond to the rotated factor loadings for each factor, according to what has already been shown in Fig. 12.41. Therefore, since all the parameters of the intercept are practically equal to 0, we can write: Zfinancei ¼ 0:895146 F01i 0:0194694 F02i + ui , R2 ¼ 1 uniqueness ¼ 0:8017 Zcostsi ¼ 0:935375 F01i + 0:0212916 F02i + ui , R2 ¼ 1 uniqueness ¼ 0:8754
Principal Component Factor Analysis Chapter
12
429
FIG. 12.46 Outputs of the multiple linear regression models with factors as dependent variables.
Zmarketingi ¼ 0:013053 F01i + 0:9997495 F02i + ui , R2 ¼ 1 uniqueness ¼ 0:9997 Zactuariali ¼ 0:917223 F01i 0:0370175 F02i + ui , R2 ¼ 1 uniqueness ¼ 0:8427 0
0
where the terms ui represent additional sources of variance, besides factors F1 and F2, to explain the behavior of each variable, since two other factors with eigenvalues less than 1 could also have been extracted. The coefficients of determination R2 of each model different from 1 correspond to the communality values of each variable, that is, to (1 uniqueness). Although researchers can choose not to estimate multiple linear regression models when applying the factor analysis, since it is only a verification procedure, we believe that its didactical nature is essential for fully understanding the technique. From the rotated factors extracted (variables f1 and f2), we can define the desired school performance ranking. As elaborated when applying the technique in SPSS, we will use the criterion described in Section 12.2.6, known as the weighted rank-sum criterion, in which a new variable is generated from the multiplication of the values of each factor by the respective proportions of variance shared by the original variables. Let’s type the following command: gen ranking = f1*0.6294+f2*0.2504
where the terms 0.6294 and 0.2504 correspond to the proportions of variance shared by the first two factors, respectively, as shown in Fig. 12.41. The new variable generated in the dataset is called ranking. Next, we can sort the observations, from the highest to the lowest value of variable ranking, by typing the following command: gsort -ranking
After that, just as an example, we can list the school performance ranking of the best 20 students, based on the joint behavior of the final grades in all four subjects. In order to do that, we can type the following command: list student ranking in 1/20
Fig. 12.48 shows the ranking of the top 20 students.
430
PART
V Multivariate Exploratory Data Analysis
FIG. 12.47 Outputs of the multiple linear regression models with standardized variables as dependent variables.
Principal Component Factor Analysis Chapter
12
431
FIG. 12.48 School performance ranking of the best 20 students.
12.5 FINAL REMARKS Many are the situations in which researchers wish to group variables in one or more factors, to verify the validity of previously established constructs, to create orthogonal factors for future use in confirmatory multivariate techniques that need the absence of multicollinearity, or to create rankings by developing performance indexes. In these situations, the factor analysis procedures are highly recommended, and the most frequently used is known as the principal components. Therefore, factor analysis allows us to improve decision-making processes based on the behavior and on the interdependence between quantitative variables that have a relative correlation intensity. Since the factors generated from the original variables are also quantitative variables, the outputs of the factor analysis can be inputs in other multivariate techniques, such as, the cluster analysis. The stratification of each factor into ranges may allow the association between these ranges and the categories of other qualitative variables to be evaluated through a correspondence analysis. The use of factors in confirmatory multivariate techniques may also make sense when researchers intend to elaborate diagnostics about the behavior of a certain dependent variable and use the factors extracted as explanatory variables, fact that eliminates possible multicollinearity problems because the factors are orthogonal. The consideration of a certain qualitative variable obtained from the stratification of a certain factor into ranges can be used, for example, in a multinomial logistic regression model, which allows the preparation of a diagnostic on the probabilities each observation has of being in each range, due to the behavior of other explanatory variables not initially considered in the factor analysis. Regardless of the main goal for applying the technique, factor analysis may bear good and interesting research fruits that can be useful for the decision-making process. Its preparation must always be carried out through the correct and conscious use of the software package chosen for the modeling, based on the underlying theory and on researchers’ experience and intuition.
12.6 EXERCISES (1) From a dataset that contains certain clients’ variables (individuals), analysts from a bank’s Customer Relationship Management department (CRM) elaborated a principal component factor analysis aiming to study the joint behavior of these variables so that, afterwards, they can propose the creation of an investment profile index. The variables used to elaborate the modeling were:
432
PART
V Multivariate Exploratory Data Analysis
Variable
Description
age
Client’s age i (years)
fixedif
Percentage of resources invested in fixed-income funds (%)
variableif
Percentage of resources invested in variable-income funds (%)
people
Number of people who live in the residence
In a certain management report, these analysts presented the factor loadings (Pearson’s correlation coefficients) between each original variable and both factors extracted by using the latent root criterion or Kaiser criterion. These factor loadings can be found in the table: Variable
Factor 1
Factor 2
age
0.917
0.047
fixedif
0.874
0.077
variableif
0.844
0.197
people
0.031
0.979
We would like you to answer the following questions: (a) Which eigenvalues correspond to the two factors extracted? (b) What are the proportions of variance shared by all the variables to form each factor? What is the total proportion of variance lost by the four variables to extract these two factors? (c) For each variable, what is the proportion of shared variance to form both factors (communality)? (d) What is the expression of each standardized variable based on the two factors extracted? (e) Construct a loading plot from the factor loadings. (f) Interpret both factors based on the distribution of the loadings of each variable. (2) A researcher specialized in analyzing the behavior of nations’ socioeconomic indexes would like to investigate the possible relationship between variables related to corruption, violence, income, and education, and, in order to do that, he collected data on 50 countries considered to be developed or emerging two years in a row. The data can be found in the files CountriesIndexes.sav and CountriesIndexes.dta, which have the following variables: Variable
Period
country
Description A string variable that identifies country i
cpi1
Year 1
cpi2
Year 2
violence1
Year 1
violence2
Year 2
capita_gdp1
Year 1
capita_gdp2
Year 2
school1
Year 1
school2
Year 2
Corruption perception index, which corresponds to citizens’ perception of abuses committed by the public sector as regards a nation’s private assets, including administrative and political aspects. The lower the index, the higher the perception of corruption in the country (Source: Transparency International) Number of murders per 100,000 inhabitants (Sources: World Health Organization, United Nations Office on Drugs and Crime, and GIMD Global Burden of Injuries) Per capita GDP in US$ adjusted for inflation, using 2000 as the base year (Source: World Bank)
Average number of years in school per person over 25 years of age, including primary, secondary, and higher education (Source: Institute for Health Metrics and Evaluation)
In order to create a socioeconomic index that generates a country ranking for each year, the researcher has decided to elaborate a principal component factor analysis using the variables of each period. Based on the results obtained, we would like you to answer the following questions: (a) By using the KMO statistic and Bartlett’s test of sphericity, is it possible to state that the principal component factor analysis is adequate for each one of the years of study? In the case of Bartlett’s test of sphericity, use a significance level of 0.05.
Principal Component Factor Analysis Chapter
12
433
(b) How many factors are extracted in the analysis in each of the years, considering the latent root criterion? Which eigenvalue(s) correspond to the factor(s) extracted each year, as well as the proportion(s) of variance shared by all the variables to form this(these) factor(s)? (c) For each variable, what is the proportion of shared variance to form one(more) factor(s) each year? Did any alterations in the communalities of each variable occur from one year to the next? (d) What are the expression(s) of the factor(s) extracted each year, based on the standardized variables? From one year to the next, did any alterations in the factor scores of the variables occur in each factor? Discuss the importance of developing a specific factor analysis each year in order to create indexes. (e) Considering the principal factor extracted as a socioeconomic index, create a country ranking from this index for each one of the years. From one year to the next, were there any changes regarding the countries’ positions in the ranking? (3) The general manager of a store, which belongs to a chain of drugstores, wishes to find out its consumers’ perception of eight attributes, which are described below:
Attribute (Variable)
Description
assortment
Perception of the variety of goods
replacement
Perception of the quality and speed of inventory replacement
layout
Perception of the store’s layout
comfort
Perception of thermal, acoustic, and visual comfort inside the store
cleanliness
Perception of the store’s general cleanliness
services
Perception of the quality of the services rendered
prices
Perception of the store’s prices compared to the competition
discounts
Perception of the store’s discount policy
In order to do that, he carried out a survey with 1700 clients at the store for some time. The questionnaire was structured based on groups of attributes, and each question corresponding to an attribute asked the consumer to assign a score from 0 to 10 depending on his or her perception of that attribute, 0 corresponded to an entirely negative perception, and 10, to the best perception possible. Since the store’s general manager is rather experienced, he decided, in advance, to gather the questions in three groups, such that, the complete questionnaire would be as follows:
Based on your perception, fill out the questionnaire below with scores from 0 to 10, in which 0 means that your perception is entirely negative in relation to a certain attribute, and 10, that your perception is the best possible Products and store environment Please rate the store’s variety of goods on a scale of 0–10 Please rate the store’s quality and speed of inventory replacement on a scale of 0–10 Please rate the store’s layout on a scale of 0–10 Please rate the store’s thermal, acoustic, and visual comfort on a scale of 0–10 Please rate the store’s general cleanliness on a scale of 0–10 Services Please rate the quality of the services rendered in our store on a scale of 0–10 Prices and discount policy Please rate the store’s prices compared to the competition on a scale of 0–10 Please rate our discount policy on a scale of 0–10
Score
434
PART
V Multivariate Exploratory Data Analysis
The complete dataset developed by the store’s general manager can be seen in the files DrugstorePerception.sav and DrugstorePerception.dta. We would like you to: (a) Present the correlation matrix between each pair of variables. Based on the magnitude of the values of Pearson’s correlation coefficients, is it possible to identify any indication that the factor analysis may group the variables into factors? (b) By using the result of Bartlett’s test of sphericity, is it possible to state, at a significance level of 0.05, that the principal component factor analysis is adequate? (c) How many factors are extracted in the analysis considering the latent root criterion? Which eigenvalue(s) correspond to the factor(s) extracted, as well as to the proportion(s) of variance shared by all the variables to form this(these) factor(s)? (d) What is the total percentage of variance loss of the original variables resulting from the extraction of the factor(s) based on the latent root criterion? (e) For each variable, what are the loading and the proportion of shared variance to form the factor(s)? (f) By demanding the extraction of three factors, to the detriment of the latent root criterion, and based on the new factor loadings, is it possible to confirm the construct of the questionnaire proposed by the store’s general manager? In other words, do the variables of each group in the questionnaire, in fact, end up showing greater sharing of variance with a common factor? (g) Discuss the impact of the decision to extract three factors on the communality values? (h) Construct a Varimax rotation and discuss it once again, based on the redistribution of the factor loadings, the construct initially proposed in the questionnaire by the store’s general manager. (i) Present the 3D loading plot with the rotated factor loadings.
APPENDIX: CRONBACH’S ALPHA A.1
Brief Presentation
The alpha statistic, proposed by Cronbach (1951), is a measure used to assess the internal consistency of the variables in a dataset, that is, it is a measure of the level of reliability with which a certain scale, adopted to define the original variables, produces consistent results about the relationship between these variables. According to Nunnally and Bernstein (1994), the level of reliability is defined from the behavior of the correlations between the original variables (or standardized), and, therefore, Cronbach’s alpha can be used to evaluate the reliability with which a factor can be extracted from variables, thus, being related to the factor analysis. According to Rogers et al. (2002), even though Cronbach’s alpha is not the only existing measure of reliability, since it has constraints related to multidimensionality, that is, with the identification of multiple factors, it can be defined as the measure that makes it possible to assess the intensity with which a certain construct or factor is present in the original variables. Therefore, a dataset with variables that share a single factor tends to have a high Cronbach’s alpha. Hence, Cronbach’s alpha cannot be used to assess the overall adequacy of the factor analysis, different from the KMO statistic and Bartlett’s test of sphericity, since its magnitude offers the researcher an indication only of the internal consistency of the scale used to extract a single factor. If its value is low, not even the first factor will be adequately extracted, main reason why some researchers choose to study the magnitude of Cronbach’s alpha before running the factor analysis, even though this decision is not a mandatory requisite for developing the technique. Cronbach’s alpha can be defined by the following expression: X " # Var k k k 1 (12.41) a¼ Var sum k1 where: Vark is the variance of the kth variable, and Xn i¼1
Var sum ¼
X
Xn X
!2 Xki
k
n1
2
X k ki
i¼1
n (12.42)
Principal Component Factor Analysis Chapter
12
435
which represents the variance of the sum of each row in the dataset, that is, the variance of the sum of the values corresponding to each observation. Besides, we know that n is the sample size, and k, the number of variables X. So, we can state that, if consistencies in the variable values occur, the term Varsum will be large enough in order for alpha (a) to tend to 1. On the other hand, variables that have low correlations, possibly due to the presence of random observation values, will make the term Varsum go back to the sum of the variances of each variable (Vark), which will make alpha (a) tend to 0. Although there is no consensus in the existing literature about the value of alpha from which there is internal consistency of the variables in the dataset, it is interesting that the result obtained is greater than 0.6 when we apply exploratory techniques. Next, we will discuss the calculation of Cronbach’s alpha for the data in the example used throughout this chapter.
A.2
Determining Cronbach’s Alpha Algebraically
From the standardized variables in the example studied throughout this chapter, we can construct Table 12.20, which helps us calculate Cronbach’s alpha. Thus, based on Expression (12.42), we have: Varsum ¼
832:570 ¼ 8:410 99
and, by using Expression (12.41), we can calculate Cronbach’s alpha:
4 4 ¼ 0:699 a¼ 1 3 8:410 We can consider this value acceptable for the internal consistency of the variables in our dataset. Nevertheless, as we will see when determining Cronbach’s alpha in SPSS and in Stata, there is a considerable loss of reliability because the original variables are not measuring the same factor, that is, the same dimension, since this statistic has constraints related to multidimensionality. That is, if we did not include the variable marketing when calculating Cronbach’s alpha, its value would be
TABLE 12.20 Procedure for Calculating Cronbach’s Alpha P
Zfinancei
Zcostsi
Zmarketingi
Gabriela
0.011
0.290
1.650
0.273
1.679
2.817
Luiz Felipe
0.876
0.697
1.532
1.319
1.360
1.849
Patricia
0.876
0.290
0.590
0.523
2.278
5.191
1.334
1.337
0.825
1.069
4.564
20.832
Leticia
0.779
1.104
0.872
0.841
3.597
12.939
Ovidio
1.334
2.150
1.650
1.865
3.699
13.682
Leonor
0.267
0.116
0.825
0.125
0.549
0.301
Dalila
0.139
0.523
0.118
0.273
0.775
0.600
0.021
0.290
0.590
0.523
1.382
1.909
0.982
0.113
1.297
1.069
Gustavo
Antonio
Zactuariali
P
Student
k¼4 Xki
k¼4 Xki
2
⋮ Estela Variance
1.000
1.000
1.000
1.000
0.868 P P 100 i¼1
k¼4 Xki
0.753 2
¼0
P100 P i¼1
k¼4 Xki
2
¼ 832:570
436
PART
V Multivariate Exploratory Data Analysis
considerably higher, which indicates that this variable does not contribute to the construct, or to the first factor, formed by the other variables (finance, costs, and actuarial). The complete spreadsheet with the calculation of Cronbach’s alpha can be found in the file AlphaCronbach.xls. Analogous to what was done throughout this chapter, next, we will present the procedures for obtaining Cronbach’s alpha in SPSS and in Stata.
A.3
Determining Cronbach’s Alpha in SPSS
Once again, let’s use the file FactorGrades.sav. In order for us to determine Cronbach’s alpha based on the standardized variables, first, we must standardize them by using the Z-scores procedure. To do that, let’s click on Analyze → Descriptive Statistics → Descriptives …. When we select all the original variables, we must click on Save standardized values as variables. Although this specific procedure is not shown here, after clicking on OK, the standardized variables will be generated in the dataset itself. After that, let’s click on Analyze → Scale → Reliability Analysis …. A dialog box will open. We must insert the standardized variables into Items, as shown in Fig. 12.49. Next, in Statistics …, we must select the option Scale if item deleted, as shown in Fig. 12.50. This option calculates the different values of Cronbach’s alpha when each variable in the analysis is eliminated. The term item is often mentioned in Cronbach’s work (1951), and it is used as a synonym for variable. Next, we can click on Continue and on OK. Fig. 12.51 shows the result of Cronbach’s alpha, whose value is exactly the same as the one calculated through Expressions (12.41) and (12.42) and shown in the previous section. Furthermore, Fig. 12.52 also shows, in the last column, Cronbach’s alpha values that would be obtained if a certain variable were excluded from the analysis. Therefore, we can see that the presence of the variable marketing contributes negatively to the identification of only one factor, because, as we know, this variable shows strong correlation with the second factor extracted by the principal component factor analysis elaborated throughout this chapter. Since Cronbach’s alpha is a one-dimensional measure of reliability, excluding the variable marketing would make its value get to 0.904. Next, we will obtain the same outputs by using specific commands in Stata.
FIG. 12.49 Dialog box for determining Cronbach’s alpha in SPSS.
Principal Component Factor Analysis Chapter
12
437
FIG. 12.50 Selecting the option to calculate alpha when excluding a certain variable.
FIG. 12.51 Result of Cronbach’s alpha in SPSS.
FIG. 12.52 Cronbach’s alpha when excluding each variable.
A.4
Determining Cronbach’s Alpha in Stata
Now, let’s open the file FactorGrades.dta. In order to calculate Cronbach’s alpha, we must type the following command: alpha finance costs marketing actuarial, asis std
where the term std makes Cronbach’s alpha be calculated from the standardized variables, even if the original variables were considered in the command alpha. The output generated can be seen in Fig. 12.53.
438
PART
V Multivariate Exploratory Data Analysis
FIG. 12.53 Result of Cronbach’s alpha in Stata.
FIG. 12.54 Internal consistency when excluding each variable—last column.
If researchers choose to obtain Cronbach’s alpha values when excluding each one of the variables, as what is done in SPSS, they may type the following command: alpha finance costs marketing actuarial, asis std item
The new outputs are shown in Fig. 12.54, in which the values of the last column are exactly the same as the ones presented in Fig. 12.52, which corroborates the fact that the variables finance, costs, and actuarial show high internal consistency for determining a single factor.
Part VI
Generalized Linear Models The study of statistical distributions is not recent, and since the beginning of the 19th century, approximately up until the beginning of the 20th century, linear models that follow a normal distribution practically dominated the data-modeling scenario. Nonetheless, since the period between both wars, models to represent the situations normal linear models could not satisfactorily represent started arising. McCullagh and Nelder (1989), Turkman and Silva (2000), and Cordeiro and Demetrio (2007) mention, in this context, Berkson’s (1944), Dyke and Patterson’s (1952) and Rasch’s (1960) work on the logistic models that involve the Bernoulli and binomial distributions; Birch’s (1963) work on the models for count data involving the Poisson distribution; Feigl and Zelen’s (1965), Zippin and Armitage’s (1966) and Glasser’s (1967) work on the exponential models; and Nelder’s (1966) work on polynomial models that include the Gamma distribution. All of these models ended up being consolidated, from a theoretical and conceptual perspective, through Nelder and Wedderburn’s (1972) extremely important work, in which the Generalized Linear Models were defined. They represent a group of linear regression models and nonlinear exponential models, in which the dependent variable, for example, follows a normal, Bernoulli, binomial, Poisson, or a Poisson-Gamma distribution. The following models are special cases of Generalized Linear Models: – Linear Regression Models and Models with Box-Cox Transformation; – Binary and Multinomial Logistic Regression Models; – Poisson and Binomial Negative Regression Models for Count Data; and the estimation of each one of them must be done respecting the characteristics of the data and the distribution of the variable that represents the phenomenon we wish to study, called dependent variable. A Generalized Linear Model is defined as follows: i ¼ a + b1 X1i + b2 X2i + … + bk Xki
(VI.1)
where is known as a canonical link function, a represents the constant, bj (j ¼ 1, 2, ..., k) are the coefficients of each explanatory variable and correspond to the parameters to be estimated, Xj are the explanatory variables (metric or dummies), and the subscripts i represent each one of the observations of the sample being analyzed (i ¼ 1, 2, ..., n, where n is the sample size). Box VI.1 relates each specific case of the generalized linear models to the characteristic of the dependent variable, its distribution and the respective canonical link function. BOX VI.1 Generalized Linear Models, Characteristics of the Dependent Variable, and Canonical Link Functions Regression Model
Characteristic of the Dependent Variable
Distribution
Linear With Box-Cox Transformation Binary Logistic
Quantitative Quantitative
Normal Normal after the Transformation Bernoulli
Qualitative with 2 Categories (Dummy)
Canonical Link Function (h) Y^
Y^ l 1 l
ln
p 1p
Multinomial Logistic
Qualitative with M (M > 2) Categories
Binomial
ln
Poisson
Quantitative with Integer and Non-Negative Values (Count Data) Quantitative with Integer and Non-Negative Values (Count Data)
Poisson
ln(l)
Poisson-Gamma
ln(u)
Negative Binomial
pm 1pm
440
PART
VI Generalized Linear Models
Therefore, for a given dependent variable Y that represents the phenomenon being studied (outcome variable), we can specify each one of the models presented in Box VI.1 in the following way: Linear Regression Model Y^i ¼ a + b1 X1i + b2 X2i + … + bk Xki
(VI.2)
where Y^ is the expected value of the dependent variable Y. Regression Model with Box-Cox Transformation l Y^i 1 ¼ a + b1 X1i + b2 X2i + … + bk Xki l
(VI.3)
where Y^ is the expected value of the dependent variable Y and l is the Box-Cox transformation parameter that maximizes the adherence to the normality of the distribution of the new variable generated from the original variable Y. Binary Logistic Regression Model pi ¼ a + b1 X1i + b2 X2i + … + bk Xki ln 1 pi
(VI.4)
where p is the probability of the event of interest occurring, defined by Y ¼ 1, given that the dependent variable Y is a dummy. Multinomial Logistic Regression Model ! pi m ¼ am + b1m X1i + b2m X2i + … + bkm Xki ln 1 pi m
(VI.5)
where pm (m ¼ 0, 1, ..., M – 1) is the probability of occurrence of each one of the M categories of the dependent variable Y. Poisson Regression Model for Count Data lnðli Þ ¼ a + b1 X1i + b2 X2i + … + bk Xki
(VI.6)
where l is the expected value of the number of occurrences of the phenomenon represented by the dependent variable Y, which presents count data with a Poisson distribution. Negative Binomial Regression Model for Count Data lnðui Þ ¼ a + b1 X1i + b2 X2i + … + bk Xki
(VI.7)
where u is the expected value of the number of occurrences of the phenomenon represented by the dependent variable Y, which presents count data with a Poisson-Gamma distribution. Thus, Part VI discusses the Generalized Linear Models. While Chapter 13 discusses the linear regression models and the models with Box-Cox transformation, Chapters 14 and 15 discuss the binary and multinomial logistic regression models and the Poisson and negative binomial regression models for count data, respectively, which are nonlinear exponential models, also called log-linear or semilogarithmic (to the left) models. Fig. VI.1 represents this logic.
Generalized Linear Models Part
VI
441
FIG. VI.1 Generalized Linear Models and Structure of the Chapters in Part VI.
Simple and Multiple Regression Models and BOX-COX Transformation Chapter 13
Count Data Poisson and Negative Binomial Regression Models
Chapter 15
Generalized Linear Models (GLM)
Binary and Multinomial Logistic Regression Models
Chapter 14
The chapters in Part VI are structured in the same presentation logic, in which, initially, the concepts regarding each model and the criteria for estimating its parameters are presented, always using datasets that allow us to solve practical exercises in Excel. After that, the same exercises are solved, step by step, in Stata and in SPSS. At the end of each chapter, additional exercises are proposed, whose answers are available at the end of the book.
Chapter 13
Simple and Multiple Regression Models … because politics is for the present, but an equation is something for eternity. Albert Einstein
13.1 INTRODUCTION Of the techniques studied in this book, without a doubt, those known as simple and multiple linear regression models are the most used in the different fields of knowledge. Imagine that a group of researchers is interested in studying how the rate of return for a financial asset behaves in relation to the market, or how company expense varies when the factory increases its productive capability or increases the number of work hours, or, yet, how the number of bedrooms and amount of floor space in a residential real estate sample can influence the formation of sales prices. Notice that, in all the examples, the main phenomena of interest to study, are represented, in each case, by a metric or quantitative variable, and, therefore, can be studied by means of estimation linear regression models, of which the main goal is to analyze how the relations between a set of explanatory variables, metrics or dummies, and a dependent metric variable (the outcome variable that represents the phenomenon under study) behave, being that some conditions are respected and some presuppositions are met, as we shall see in this chapter. It is important to emphasize that any and all linear regression models should be defined based on the subjacent theory and the experience of the researcher, such that it is possible to estimate the desired model, analyze the results obtained by means of statistical tests and prepare forecasts. In this chapter, we will consider the simple and multiple linear regression models, with the following objectives: (1) Introduce the concepts of simple and multiple linear regression, (2) Interpret results obtained and prepare forecasts, (3) Discuss the technique presuppositions and (4) Present the application of the technique in Excel, Stata, and SPSS. Initially, the solution to an example will be prepared in Excel simultaneously to the presentation of the concepts and the manual solution of the example. Only after the introduction of the concepts will the procedures for the preparation of the regression technique be presented in Stata and SPSS.
13.2 LINEAR REGRESSION MODELS First, we will address linear regression models and their presuppositions. An analysis of nonlinear regressions will be covered in Section 13.4. According to Fa´vero et al. (2009), the linear regression technique offers, primarily, the ability to study the relation between one or more explanatory variables, which are presented in a linear form, and a quantitative dependent variable. As such, a general linear regression model can be defined as follows: Yi ¼ a + b1 X1i + b2 X2i + ⋯ + bk Xki + ui
(13.1)
where Y represents the phenomenon under study (quantitative dependent variable), a represents the intercept (constant or linear coefficient), bj (j ¼ 1, 2, …, k) are the coefficients of each variable (angular coefficients), Xj are explanatory variables (metrics or dummies), and u is the error term (difference between the real value of Y and the predicted value of Y by means of the model for each observation). The subscripts i represent each of the observations of the sample under analysis (i ¼ 1, 2, …, n, where n is the size of the sample). The equation presented by means of Expression (13.1) represents a multiple linear regression model, since it considers the inclusion of various explanatory variables for the study of the phenomenon in question. On the other hand, if only one Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00013-6 © 2019 Elsevier Inc. All rights reserved.
443
444
PART
VI Generalized Linear Models
FIG. 13.1 Estimated simple linear regression model.
+
⋅ Xi
Y
X X variable is inserted, we have before us a simple linear regression model. For didactic reasons, we will introduce the concepts and present the step-by-step process of estimating the parameters by means of a simple regression model. Following, we will amplify the discussion by means of estimation in multiple regression models, including the consideration of dummy variables on the right side of the equation. It is important to emphasize, therefore, that the simple linear regression model to be predicted present the following expression: Y^i ¼ a + b Xi
(13.2)
where Y^i represents the predicted value of the dependent variable, which will be obtained by means of the model estimation for each i observation, and a and b represent the predicted parameters of the intercept and the slope of the proposed model, respectively. Fig. 13.1 presents, graphically, the general configuration of an estimated simple linear regression model. We can, therefore, verify that, while estimated parameter a shows the point on regression model where X ¼ 0, estimated parameter b represents the slope of the model, or rather, the increase (or decrease) of Y for each additional unit of X, on average. Hence, the inclusion of error term u in the Expression (13.1), also known as residual, is justified by the fact that any relation that can be proposed will rarely present itself perfectly. In other words, very probably the phenomenon under study, represented by variable Y, will present a relation with some other X variable not included in the proposal and that, therefore, will need to be represented by error term u. As such, error term u, for each observation i, can be written as: ui ¼ Yi Y^i
(13.3)
According to Kennedy (2008), Fa´vero et al. (2009), and Wooldridge (2012), error terms occur due to some reasons that need to be known and considered by the researchers, such as: l l l
Existence of aggregated variables and/or not random. Failures in the specification of the model (nonlinear forms and omission of relevant explanatory variables). Errors in data gathering.
More consideration regarding error terms will be made in the study of regression model presuppositions, in Section 13.3. Having discussed the preliminary concepts, we shall now begin the study of linear regression models estimation.
13.2.1
Estimation of the Linear Regression Model by Ordinary Least Squares
We often glimpse, in a rational or intuitive way, the relation between variable behaviors that appear either directly or indirectly. If I swim more often at my club, will I increase my muscle mass? If I change jobs, will I have more time to spend with my children? If I save a greater portion of my wages, will I be able to retire at a younger age? These questions offer clear
Simple and Multiple Regression Models Chapter
13
445
TABLE 13.1 Example: Travel Time × Distance Traveled Student
Time to Get to School (min)
Distance Traveled to School (km)
Gabriela
15
8
Dalila
20
6
Gustavo
20
15
Leticia
40
20
Luiz Ovidio
50
25
Leonor
25
11
Ana
10
5
Antonio
55
32
Julia
35
28
Mariana
30
20
relations between a certain dependent variable, which represents the phenomenon we wish to study, and, in the case, a single explainable variable. The objective of regression analysis is, therefore, to provide conditions for the researcher to evaluate how a Y variable behaves based on the behavior of one or more X variables, without, necessarily, the occurrence of a cause and effect relationship. We will introduce the concepts of regression by means of an example that considers only one explanatory variable (simple linear regression). Imagine that, on a certain class day for a group of 10 students, the professor is interested in discovering the influence of the distance traveled to get to school over the travel time. The professor completes a questionnaire with each of the 10 students and prepares a dataset, which can be found in Table 13.1. Actually, the professor wants to know the equation that regulates the phenomenon “travel time to school” in function of “distance traveled by students.” It is known that other variables influence the time of a certain route, such as the route taken, the type of transportation, or the time at which the student left for school that day. However, the professor knows that such variables will not be part of the model, being that they were not collected for the formation of the dataset. The problem can therefore be modeled in the following manner: time ¼ ƒðdistÞ As such, the equation, or simple regression model, will be: timei ¼ a + b disti + ui and, in this way, the expected value (estimate) of the dependent variable, for each i observation, will be given as: ^ i ¼ a + b disti time where a and b are the estimates of parameters a and b, respectively. ^ variable, also known as the conditional mean, is calThis last equation shows that the expected value of the time (Y) culated for each sample observation, in function of the behavior of the dist variable, being that the subscript i represents, for our example data, the school students (i ¼ 1, 2, …, 10). Our objective here is, therefore, to study if the behavior of the dependent variable time presents a relation with the variation of distance, in kilometers, to which each of the students is subjected to arrive at school on a certain class day. In our example, it does not make much sense to discuss time traveled when the distance to school is zero (parameter a). Parameter b, on the other hand, will inform us regarding the increase in time to arrive at school by increasing the distance traveled by one kilometer, on average. We shall, as such, prepare a graph (Fig. 13.2) that relates the travel time (Y) with the distance traveled (X), where each point represents one of the students.
446
PART
VI Generalized Linear Models
FIG. 13.2 Travel time distance traveled for each student.
As previously commented, it is not only the distance traveled that affects the time needed to get to school since it can also be affected by other variables related to traffic, means of transportation, or the individual. As such, the error term u should capture the effect of the remaining variables not included in the model. Now, in order to estimate the equation that best adjusts to this cloud of points, we should establish two fundamental conditions related to the residuals. P (1) The sum of the residuals should be zero: ni¼1ui ¼ 0, where n is the sample size. With only this first condition, several lines of regression can be found where the sum of the residuals is zero, as is shown in Fig. 13.3. Notice that, for the same dataset, several lines can respect the condition that the sum of the residuals is equal to zero. Therefore, it becomes necessary to establish a second condition. P (2) The residual sum of squares is the least possible: ni¼1u2i ¼ min. With this condition, we choose the model that presents the best adjustment possible to the cloud of points, giving us, therefore, the definition of the least squares. In other words, a and b should be determined in such a way that the sum of the squares of the residuals is the least possible (ordinary least squares—OLS method). As such: n X
ðYi b Xi aÞ2 ¼ min
(13.4)
i¼1
The minimization occurs in deriving Expression (13.4), where a and b are equal to zero to resulting expressions. As such: hXn i 2 n ∂ ð Y b X a Þ X i i i¼1 ðYi b Xi aÞ ¼ 0 (13.5) ¼ 2 ∂a i¼1 hXn i 2 n ∂ ð Y b X a Þ X i i¼1 i ¼ 2 X i ð Yi b X i a Þ ¼ 0 (13.6) ∂b i¼1 In distributing and dividing the Expression (13.5) by 2n, where n is the sample size, we have that: Xn Xn Xn 2 i¼1 Yi 2 i¼1 b Xi 2 i¼1 a 0 + + ¼ 2n 2n 2n 2n
(13.7)
from which comes: Y + b X + a ¼ 0
(13.8)
a¼Y bX
(13.9)
and, therefore:
where Y and X represent the sample average of Y and of X, respectively.
Simple and Multiple Regression Models Chapter
Time to school (min)
447
FIG. 13.3 (A–C) Three examples of lines of regression where the sum of residuals is zero.
60 u10
50
13
u8 40
u7
30 u5
u4
20
u2 u1
10 0
u9
u6
0
u3
5
10
(A)
15
20
25
30
35
Traveled distance (km) 60
Time to school (min)
50 u10
u8
40 u6
u9
u7
30 u2
20
u1
u4 u3
u5
10 0 0
5
10
(B)
15
20
25
30
35
Traveled distance (km)
Time to school (min)
60 50 40
u8
u6 30
u1
u2
u4
u3
u10
u7
u5
u9
20 10 0 0
5
10
(C)
15
20
25
30
35
Traveled distance (km)
In substituting this result in Expression (13.6), we have that: 2
n X
Xi Yi b Xi Y + b X ¼ 0
(13.10)
i¼1
which, in developing: n X
n X Xi Y i Y + b Xi X Xi ¼ 0
i¼1
i¼1
which therefore generates:
(13.11)
Xn b¼
Xi X Yi Y Xn 2 Xi X i¼1
i¼1
(13.12)
448
PART
VI Generalized Linear Models
TABLE 13.2 Calculation Spreadsheet for the Determination of a and b
Xi X Y i Y
2 Xi X
Observation (i)
Time (Yi)
Distance (Xi)
Yi Y
Xi X
1
15
8
15
9
135
81
2
20
6
10
11
110
121
3
20
15
10
2
20
4
4
40
20
10
3
30
9
5
50
25
20
8
160
64
6
25
11
5
6
30
36
7
10
5
20
12
240
144
8
55
32
25
15
375
225
9
35
28
5
11
55
121
10
30
20
0
3
0
9
Sum
300
170
1155
814
Average
30
17
Returning to our example, the professor then prepares a calculation spreadsheet in order to obtain the linear regression model, as shown in Table 13.2. By means of the spreadsheet presented in Table 13.2, we can calculate estimators a and b, in accordance as follows: Xn Xi X Yi Y 1155 i¼1 Xn ¼ 1:4189 ¼ b¼ 2 814 X X i¼1
i
a ¼ Y b X ¼ 30 1:4189 17 ¼ 5:8784 And the simple linear regression equation can be written as: ^ i ¼ 5:8784 + 1:4189 disti time The of our example model can be done by means of the Solver tool in Excel, respecting the conditions that P10 estimation P 10 2 u ¼ 0 and i¼1 i i¼1ui ¼ min. In this way, we can initially open the file TimeLeastSquares.xls that contains our example ^ to u and to u2 for each observation. Fig. 13.4 presents this file, before the preparation data, besides the columns referent to Y, of the Solver procedure. According to the logic proposed by Belfiore and Fa´vero (2012), we now open the Excel Solver tool. The objective function is in cell E13, which is our destination cell and which should be minimized (residual sum of squares). Besides FIG. 13.4 TimeLeastSquares.xls dataset.
Simple and Multiple Regression Models Chapter
13
449
FIG. 13.5 Solver—minimization of the residual sum of squares.
this, parameters a and b, which values are in cells H3 and H5, respectively, are the variables cells. Finally, we should impose that the value of cell D13 should equal zero (restriction that the sum of the residuals be equal to zero). The Solver window will be as shown in Fig. 13.5. By clicking on Solve and then OK, we obtain the best solution to the minimization of the residual sum of squares. Fig. 13.6 presents the results obtained by the model. Therefore, intercept a is 5.8784 and the angular coefficient b is 1.4189, according to what we estimated by means of the analytical solution. In an elementary way, the average time to get to school by students who did not travel any distance, or rather, who were already at school, is of 5.8784 min, which does not make much sense from a physical point of view. In some cases, this type of situation can frequently occur, where values for a are not in keeping with reality. From the mathematical point of view, this is not incorrect. However, the researcher should always analyze the physical or economical sense of the situation under study, as well as the subjacent theory used. In analyzing the graph in Fig. 13.2, we notice that there is no student with a distance traveled near zero, and the intercept only reflects extension, projection, or extrapolation of the regression model up to the Y axis. It is even common that some models present a negative a in the study of phenomena
450
PART
VI Generalized Linear Models
FIG. 13.6 Obtaining the parameters of the sum minimization of u2 by Solver.
that cannot offer negative values. Therefore, the researcher should always be aware of this fact, being that a regression model can be quite useful to elaborate inferences regarding the behavior of a Y variable within the limits of the X variation, or rather, for the elaboration of interpolations. Yet extrapolations can offer inconsistencies due to eventual changes in behavior for the Y variable outside the limits of the X variation in the study sample. Giving sequence to the analysis, each additional kilometer in distance between the departure point and the school increases travel time by 1.4189 min, on average. As such, a student who lives 10 km farther from school than another will tend to spend, on average, a little more than 14 min (1.4189 10) to get to school than their classmate who lives closer. Fig. 13.7 presents the simple linear regression model from our example. Concomitant to the discussion of each of the concepts and to the solution of the proposed example by means of analytical form and Solver, we will also present the systematic solution by means of the Excel Regression tool. In Sections 13.5 and 13.6, we will embark on the final solution by means of Stata and SPSS, respectively. In this way, we will not open the file Timedist.xls, which contains the data from our example, or rather, the fictitious travel time and distance covered by a group of students to the school location. By clicking on Data → Data Analysis, the dialog box from Fig. 13.8 will appear. We now click on Regression and then OK. The dialog box for insertion of data to be considered in regression will now appear (Fig. 13.9). For our example, the time (in minutes) variable is the (Y) dependent and the dist (in kilometers) variable is the (X) explanatory. Therefore, we must insert their data in the respective entry intervals, according to what is shown in Fig. 13.10. Besides the insertion of data, we will also select the Residuals option, according to what is shown in Fig. 13.10. Following, we click on OK. A new spreadsheet will be generated with the regression outputs. We will analyze each of them according to when the concepts are introduced, as well as perform the calculations manually. According to what we can observe by means of Fig. 13.11, four groups of outputs are generated: regression statistics, analysis of variance (ANOVA), table of regression coefficients, and residuals table. We will discuss each. As calculated previously, we can verify the regression equation coefficients in the outputs (Fig. 13.12).
FIG. 13.7 Simple linear regression model between time and distance traveled.
Simple and Multiple Regression Models Chapter
13
451
FIG. 13.8 Dialog box for data analysis in Excel.
FIG. 13.9 Dialog box for estimation of linear regression in Excel.
13.2.2
Explanatory Power of the Regression Model: Coefficient of Determination R2
According to Fa´vero et al. (2009), to measure the explanatory power of a certain regression model, or the percentage of variability of the Y variable, which is explained by the variation of behavior of the explanatory variables, we need to understand some important concepts. While the total sum of squares (TSS) shows the variation in Y in regards to its own average, the sum of squares due to regression (SSR) offers a variation of Y considering the X variables used in the model. Besides this, the residual sum of squares (RSS) presents the variation of Y, which is not explained in the prepared model. We can therefore define that:
being:
TSS ¼ SSR + RSS
(13.13)
Yi Y ¼ Y^i Y + Yi Y^i
(13.14)
452
PART
VI Generalized Linear Models
FIG. 13.10 Insertion of data for estimation of linear regression in Excel.
FIG. 13.11 Simple linear regression outputs in Excel.
Simple and Multiple Regression Models Chapter
13
453
FIG. 13.12 Linear regression equation coefficients.
where Yi is equivalent to the value of Y in each i observation of the sample, Y is the average of Y, and Y^i represents the adjusted value of the regression model for each i observation. As such, we have that: Y i Y : total deviation of values of each observation in relation to the average, ^ Y i Y : deviation of values of the regression model for each observation in relation to the average, Yi Y^i : deviation of values of each observation in relation to the regression model, which results in: n X
Yi Y
2
¼
i¼1
n X
Y^i Y
2
+
i¼1
n X
Yi Y^i
2
(13.15)
i¼1
or: n X i¼1
Yi Y
2
¼
n X i¼1
Y^i Y
2
+
n X
ð ui Þ 2
(13.16)
i¼1
which is the very Expression (13.13). Fig. 13.13 graphically shows this relation. With these considerations made and the regression equation defined, we embark on the study of the explanatory power of the regression model, also known as the coefficient of determination R2. Stock and Watson (2004) define R2 as the fraction of the variance of the Yi sample explained (or predicted) by the explanatory variables. In the same way, Wooldridge (2012) considers R2 as the proportion of sample variation of the dependent variable explained by the set of explanatory variables, able to be used as a measure of degree of adjustment for the proposed model.
454
PART
VI Generalized Linear Models
FIG. 13.13 Deviations of Y for two observations.
According to Fa´vero et al. (2009), the explanatory capacity of the model is analyzed by the coefficient of determination R2 of the regression. For a simple regression model, this measure shows how much of the Y variable behavior is explained by the variation in behavior of the X variable, Always remembering that there is not, necessarily, a cause and effect relationship between the X and Y variables. For the multiple regression model, this measure shows how much of the Y variable behavior is explained by the joint variation of the X variables considered in the model. The R2 is obtained in the following manner: R2 ¼
SSR SSR ¼ SSR + RSS TSS
(13.17)
or Xn
2 Y^i Y R ¼ Xn Xn ^i Y 2 + ðu Þ2 Y i¼1 i¼1 i 2
i¼1
(13.18)
Also according to Fa´vero et al. (2009), the R2 can vary between 0 and 1 (0%–100%); however, it is practically impossible to obtain an R2 equal to 1, since it would be very difficult for all the points to fall on a line. In other words, if the R2 were 1, there would be no residuals for each of the observations in the sample under study, and the variability of the Y variable would be totally explained by the vector of X variables considered in the regression model. The more disperse the cloud of points, the less the X and Y variables will relate, the residuals will be greater, and the R2 will be closer to zero. In an extreme case, if the X variation does not correspond to any variation in Y, the R2 will be zero. Fig. 13.14 presents, in an illustrative manner, the behavior of R2 in different cases. Returning to our example where the professor intends to study the behavior of the time students take to get to school and if this phenomenon is influenced by distance traveled by the students, we present the following spreadsheet (Table 13.3), which will aid us in calculating the R2. The spreadsheet presented in Table 13.3 allows us to calculate the R2 of the simple linear regression model for our example. As such: Xn 2 Y^ Y 1638:85 2 i¼1 ¼ ¼ 0:8194 R ¼ Xn 2 Xn 2 1638:85 + 361:15 ðu Þ Y^ Y + i¼1
i¼1
i
In this way, we can now affirm that, for the sample studied, 81.94% of the time variability to get to school is due to the variable referent to the distance traveled during the route determined by each of the students. And, therefore, a little more than 18% of the variability is due to other variables not included in the model and that, therefore, were due to variation in the residuals.
Simple and Multiple Regression Models Chapter
R2 = 0.82
13
455
R2 = 0
Y
Y
X
X
R2 = 0.35
R2 = 1 Y
Y
X
X
FIG. 13.14 R2 behavior for different simple linear regressions.
TABLE 13.3 Spreadsheet for the Calculation of the Coefficient of Determination R2 of the Regression Model
ui Yi Ybi
2
Observation (i)
Time (Yi)
Distance (Xi)
b Y i
1
15
8
17.23
2.23
163.08
4.97
2
20
6
14.39
5.61
243.61
31.45
3
20
15
27.16
7.16
8.05
51.30
4
40
20
34.26
5.74
18.12
32.98
5
50
25
41.35
8.65
128.85
74.80
6
25
11
21.49
3.51
72.48
12.34
7
10
5
12.97
2.97
289.92
8.84
8
55
32
51.28
3.72
453.00
13.81
9
35
28
45.61
10.61
243.61
112.53
10
30
20
34.26
4.26
18.12
18.12
Sum
300
170
1638.85
361.15
Average
30
17
Ybi Y
(ui)2
^ ¼ 5:8784 + 1:4189 dist . Obs.: Where Y^i ¼ time i i
The outputs generated by Excel also bring out this information, according to what can be seen in Fig. 13.15. Note that the outputs also supply the values of Y^ and the residuals for each observation, as well as the minimum value of the sum of the squares of the residuals, which are exactly equal to those obtained by estimation of the parameters by means of the Excel Solver tool (Fig. 13.6) and also calculated and presented in Table 13.3. By means of these values, we can now calculate the R2. According to Stock and Watson (2004) and Fa´vero et al. (2009), the coefficient of determination R2 does not tell researchers if a certain explanatory variable is statistically significant and if this variable is the true cause of the change in behavior for the dependent variable. More than that, the R2 does not provide the ability to evaluate the existence of an eventual bias in the omission of explanatory variables and if the choice of those inserted into the proposed model was appropriate.
456
PART
VI Generalized Linear Models
FIG. 13.15 Coefficient of determination R2 of the regression.
The importance given to the R2 dimension is often excessive. In different situations, researchers highlight the adequacy of their models by obtaining the high R2 values, including giving emphasis to the cause and effect relationship between explanatory variables and the dependent variable, even though quite erroneous, since this measure merely captures the relation between the variables used in the model. Wooldridge (2012) is even more emphasis, highlighting that it is fundamental to not give considerable importance to the R2 value in the evaluation of regression models. According to Fa´vero et al. (2009), if we are able, for example, to find a variable that explains a 40% return on stock, this could at first seem like a low capacity of explanation. However, if a single variable is able to capture this entire relationship in a situation where innumerable other economic, financial, perceptual, and social factors exist, the model could be quite satisfactory. The general statistical significance of the model and its estimated parameters is not given by the R2, but by means of appropriate statistical tests, which we will study in the next section.
13.2.3
General Statistical Significance of the Regression Model and Each of Its Parameters
To begin, it is of fundamental importance to study the general statistical significance of the estimated model. With this in mind, we should make use of the F-test, with its null and alternative hypotheses, for a general regression model, which are: H0: b1 ¼ b2 ¼ … ¼ bk ¼ 0 H1: there is at least one bj 6¼ 0, respectively And, for a simple regression model, therefore, these hypotheses are expressed as: H0: b ¼ 0 6 0 H1: b ¼
Simple and Multiple Regression Models Chapter
13
457
This test allows the researcher to verify if the model that is being estimated does in fact exist, since if all the bj (j ¼ 1, 2, …, k) are statistically equal to zero, the alteration behavior of each of the explanatory variables will not influence in any way the variation behavior of the dependent variable. The F statistic is presented in the following expression: Xn ^ Y 2 Y SSR i¼1 ð k 1Þ ð k 1Þ ¼ (13.19) F ¼ Xn RSS ðu Þ2 i¼1 i ðn k Þ ðn k Þ where k represents the number of parameters of the estimated model (including the intercept) and n, the size of the sample. Therefore, we can obtain an F statistic expression based on the R2 expression presented in Expression (13.17). As such, we have that: SSR R2 ðk 1Þ ðk 1Þ F¼ ¼ RSS ð1 R2 Þ ðn kÞ ðn kÞ
(13.20)
Returning, then, to our initial example, we obtain: 1638:85 ð 2 1Þ F¼ ¼ 36:30 361:15 ð10 2Þ where, for 1 degree of freedom for regression (k 1 ¼ 1) and 8 degrees of freedom for the residuals (n k ¼ 10 2 ¼ 8), we have, by means of Table A in the Appendix, that the Fc ¼ 5.32 (F critical to the significance level of 5%). In this way, as F calculated Fcal ¼ 36.30 > Fc ¼ F1,8,5% ¼ 5.32, we can reject the null hypothesis that all the bj (j ¼ 1) parameters are statistically equal to zero. At least one X variable is statistically significant to explain the variability of Y and we will have a statistically significant regression model for the means of forecast. As, in this case, we have only one X variable (simple regression), this will be statistically significant, to the significance level of 5%, to explain the behavior of the Y variation. The outputs offer, by means of analysis of variance (ANOVA), the F statistic and its corresponding significance level (Fig. 13.16). Software, such as Excel, Stata, and SPSS, do not directly offer Fc for the degrees of freedom defined and the determined significance level. However, they do offer the significance level of Fcal for these degrees of freedom. As such, instead of analyzing if Fcal > Fc, we should verify if the significance level of Fcal is less than 0.05 (5%) so as to give continuity to the regression analysis. Excel calls this significance level F significance. As such: If the F significance is tc ¼ t8,2.5% ¼ 2.306. We can, therefore, reject the null hypothesis in this case, or rather, at the significance level of 5% we cannot affirm that this parameter is statistically equal to zero. These outputs are shown in Fig. 13.19.
460
PART
VI Generalized Linear Models
FIG. 13.18 Standard error calculation.
Analogous to the F-test, instead of analyzing if tcal > tc for each parameter, we directly verify if the significance level (P-value) for each tcal is less than 0.05 (5%), so as to maintain the parameter in the final model. The P-value for each tcal can be obtained in Excel by means of the command Formulas ! Insert Function ! DISTT, which will open a dialog box as is shown in Fig. 13.20. In this figure, the dialog boxes corresponding to parameters a and b are already presented. It is important to mention that, for simple regressions, statistic F ¼ t2 for parameter b, as is shown by Fa´vero et al. (2009). In our example, therefore, we can verify that: t2b ¼ F t2b ¼ ð6:0252Þ2 ¼ 36:30 ¼ F Being that hypothesis H1 of the F-test tells us that at least one b parameter is statistically different from zero for a certain significance level, and being that a simple regression presents only one b parameter, if H0 is rejected for the F-test, H0 will also be for the t-test of this b parameter. However, for the a parameter, being that tcal < tc (P-value of tcal for the a parameter >0.05) in our example, we could think of the estimation of a new regression that forces the intercept to be equal to zero. This can be elaborated by means of the Excel Regression dialog box, with the selection of the option Constant is zero. However, we will not elaborate such procedure since the nonrejection of the null hypothesis that the a parameter is statistically equal to zero is due to the small sample used. It does not impede that a researcher make such forecasts by means of using the model obtained. The imposition that the a be zero could generate forecast bias by the generation of another model that would not be the most adequate to elaborate interpolations in the data. Fig. 13.21 illustrates this fact.
Simple and Multiple Regression Models Chapter
FIG. 13.19 Calculation of coefficients and significance t-test of parameters.
FIG. 13.20 Obtaining the levels of significance of t for parameters a and b (command Insert Function).
13
461
462
PART
VI Generalized Linear Models
FIG. 13.21 Original regression model and with the intercept equal to zero.
60
Time to school (min)
50
Original regression model
40 30 20 10 0
Regression model with intercept = 0 0
5
10
15 20 25 Traveled distance (km)
30
35
In this way, the fact that we cannot reject that the a parameter be equal to zero at a certain significance level does not necessarily imply that we should exclude it from the model. However, if this is the researchers’ decision, it is important they be at least aware that there will be only one different model from the original, with consequences to the preparation of forecasts. The nonrejection of the null hypothesis for the b parameter at a certain significance level, on the other hand, should indicate that a corresponding X variable does not correlate with a Y variable and, therefore, should be excluded from the final model. When later in this chapter we present the analysis of regression by means of the Stata (Section 13.5) and SPSS (Section 13.6) software, the Stepwise procedure will be introduced. This has a property that automatically excludes or maintains the b parameters in the model in function of the criteria presented and offers the final model with the b parameters statistically different from zero for the determined significance level.
13.2.4 Construction of the Confidence Intervals of the Model Parameters and Elaboration of Predictions The confidence levels for the a and bj (j ¼ 1, 2, …, k) parameters for the 95% confidence level, can be written, respectively, as follows: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 0 1 0 13 uXn uXn 2 2 2 2 u u ð u Þ ð u Þ 1 X 1 X i u u i¼1 i 6 7 @ + Xn @ + Xn P4a ta=2 t i¼1 2 A a a + ta=2 t 2 A5 ¼ 95% ðn kÞ ð n k Þ n n Xi X Xi X i¼1 i¼1 2
3
6 7 6 7 6 7 s:e: s:e: 6 7 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P6bj ta=2 vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Xn 2ffi bj bj + ta=2 v Xn 2ffi7 ¼ 95% u u 6 7 u u X X 5 4 t Xn t Xn i¼1 i i¼1 i 2 2 X X i¼1 i i¼1 i n n (13.22) Therefore, for our example, we have that: Parameter a: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi " # 361:1486 1 289 361:1486 1 289 + a 5:8784 + 2:306 + ¼ 95% P 5:8784 2:306 8 10 814 8 10 814 P½4:5731 a 16:3299 ¼ 95% Being that the confidence level for parameter a contains zero, we cannot reject, at the 95% confidence level, that this parameter is statistically equal to zero, according to what has been verified when calculating the t statistic.
Simple and Multiple Regression Models Chapter
Parameter b:
13
463
3
2
7 6 7 6 6:7189 6:7189 6 P61:4189 2:306 sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi b 1:4189 + 2:306 sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi7 7 ¼ 95% 4 ð170Þ2 ð170Þ2 5 3704 3704 10 10 P½0:8758 b 1:9619 ¼ 95% Being that the confidence level for parameter b does not contain zero, we can reject, at the 95% confidence level, that this parameter is statistically equal to zero, also according to what has been verified when calculating the t statistic. These intervals are also generated in the Excel outputs. Being that the software standard is to use the 95% confidence level, these intervals are shown twice, so as to allow the researcher to manually alter the confidence level desired by selecting the Confidence Level option in the Excel Regression dialog box, but still have the ability to analyze the intervals for the confidence level most commonly used (95%). In other words, the confidence level intervals of 95% in Excel will always be presented, giving the researcher the ability to analyze the intervals from another confidence level in parallel. We will, therefore, alter the regression dialog box (Fig. 13.22) in order to allow the software to also calculate the interval parameters for the confidence level of, for example, 90%. These outputs are presented in Fig. 13.23. It can be seen that the lower and upper bands are symmetrical in relation to the estimated average parameter and offer the researcher the ability to prepare forecasts with a certain confidence level. In the case of parameter b from our example, being that the extremes of the lower and upper bands are positive, we can say that this parameter is positive, with 95% confidence. Besides this, we can also way that the interval [0.8758; 1.9619] contains b with 95% confidence. Different from what we did for the 95% confidence level, we will not manually calculate the intervals for the 90% confidence level. However, an analysis of the Excel outputs allows us to affirm that the interval [0.9810; 1.8568] contains b with 90% confidence. In this way, we can say that the lower the levels of confidence, the narrower (less amplitude) the intervals will be to contain a certain parameter. On the other hand, the higher the levels of confidence, the greater amplitude will have the intervals to contain this parameter. Fig. 13.24 illustrates what happens when we have a dispersed cloud of points surrounding a regression model.
FIG. 13.22 Alteration of the confidence level of the intervals of the parameters to 90%.
464
PART
VI Generalized Linear Models
FIG. 13.23 Intervals with confidence levels of 95% and 90% for each of the parameters.
FIG. 13.24 Confidence intervals for a dispersion of points surrounding a regression model.
60
Time to school (min)
50 40 30 20 10 0 0
5
10
15
20
25
30
35
Traveled distance (km)
We can note that, the more positive parameter a is and mathematically equal to 5.8784, we cannot affirm that it is statistically different from zero for this small sample, being that the confidence interval contains an intercept equal to zero (origin). A greater sample could solve this problem. For parameter b, however, we can note that the slope has always been positive, with an average value mathematically calculated and equal to 1.4189. We can visually notice that its confidence interval does not contain an slope equal to zero. As has already been discussed, the rejection of the null hypothesis for parameter b, at a certain significance level, indicates that a corresponding X variable is correlated to the Y variable and, consequently, should remain in the final model. Therefore, we can conclude that the decision to exclude an X variable in a certain regression model can be done by means of
Simple and Multiple Regression Models Chapter
13
465
BOX 13.1 Decision to Include bj Parameters in Regression Models Parameter bj
t Statistic (For Significance Level a) tcal < tc a/2 tcal > tc a/2
t-Test (Analysis of the P-Value for Significance Level a) P-value > significance level a P-value < significance level a
Analysis of Confidence Level
Decision
Confidence level contains zero
Exclude parameter from model
Confidence level does not contain zero
Maintain parameter in model
Obs.: The most common in applied social sciences is the adoption of significance level a ¼ 5%.
a direct analysis of the t-statistic of its respective parameter b (if tcal < tc ! P-value > 0.05 ! we cannot reject that the parameter is statistically equal to zero) or by means of an analysis of the confidence interval (if the same contains zero). Box 13.1 presents the inclusion or exclusion criteria for parameters bj (j ¼ 1, 2, …, k) in regression models. After a discussion of these concepts, the professor proposed the following exercise to his students: What is the average ^ for a student who travels 17 km to get to school? What would be the minimum travel time forecast (Y estimated, or Y) and maximum values that this travel time could assume, with 95% confidence? The first part of the exercise could be solved by a simple substitution of the value of Xi ¼ 17 in the initially obtained equation. Like this: ^ i ¼ 5:8784 + 1:4189 disti ¼ 5:8784 + 1:4189 ð17Þ ¼ 29:9997 min time The second part of the exercise takes us to the outputs in Fig. 13.23, being that the a and b parameters assume intervals of [4.5731; 16.3299] and [0.8758; 1.9619], respectively, at the 95% confidence level. As such, the equations that determine the minimum and maximum travel time values for this confidence level are: Minimum time: ^ min ¼ 4:5731 + 0:8758 disti ¼ 4:5731 + 0:8758 ð17Þ ¼ 10:3155 min time Maximum time: ^ max ¼ 16:3299 + 1:9619 disti ¼ 16:3299 + 1:9619 ð17Þ ¼ 49:6822min time We can therefore say that there is 95% confidence that a student who travels 17 km to get to school will take between 10.3155 and 49.6822 min, with an average estimated time of 29.9997 min. Obviously, the amplitude of these values is not small, due to the confidence interval of parameter a being quite ample. This fact can be corrected by the increase of the sample size or by the inclusion of new, statistically significant X variables in the model (which would then become a multiple regression model) being that, in the last case, the R2 value would be increased. After the professor presented the results of the model to the class, a curious student raised his hand and asked, “Professor, is there any influence of the regression model coefficient of determination R2 on the amplitude of the ^ what would the results be? Would confidence intervals? If we set up this linear regression and substituted Y for Y, 2 the equation change? And the R ? And the confidence intervals?” ^ and again set up the regression by means of the dataset presented in Table 13.4. And, the professor substituted Y for Y The first step taken by the professor was to prepare a new scatter plot graph, with the estimated regression model. This graph is presented in Fig. 13.25. As we can see, obviously all are now located on the regression model, since this procedure forced this situation by the ^ i used the regression model. As such, we can state in advance that the R2 for this new fact that the calculation of each Y regression is 1. Let’s look at the new outputs (Fig. 13.26). As expected, the R2 is 1. Moreover, the model equation is exactly that which was previously calculated, since it is the same line. However, we can see that the F and t-tests cause us to strongly reject their respective null hypotheses. Even parameter a, which previously could not be considered statistically different from zero, now presents its t-test and tells us that we can reject, at the 95% confidence level (or higher), that this parameter is statistically equal to zero. This occurs because previously the small sample used (n ¼ 10 observations) did not allow us to affirm that the intercept was different from zero, being that the dispersion of points generated a confidence interval that had an intercept equal to zero (Fig. 13.24).
466
PART
VI Generalized Linear Models
TABLE 13.4 Dataset for Preparation of New Regression Observation (i)
^ i) Predicted Time(Y
Distance (Xi)
1
17.23
8
2
14.39
6
3
27.16
15
4
34.26
20
5
41.35
25
6
21.49
11
7
12.97
5
8
51.28
32
9
45.61
28
10
34.26
20
FIG. 13.25 Scatter plot and linear regression model ^ and distance traveled (X). between predicted time (Y)
Time to school (min)
60 50 40 30 20 10 0 0
5
10
15
20
25
30
35
Traveled distance (km)
On the other hand, when all the points are on the model, each of the residual terms comes to be zero, which causes the R2 to become 1. Besides, the obtained equation is no longer an adjusted model to a dispersion of points, but the very line that passes through all the points and completely explains the sample behavior. Being such, we do not have a dispersion surrounding the regression model and the confidence intervals come to represent a null amplitude, as we can also see in Fig. 13.26. In this case, for any confidence level, the values for each parameter interval are no longer altered, which causes us to declare with 100% confidence that the [5.8784; 5.8784] interval contains a and the [1.4189; 1.4189] interval contains b. In other words, in this extreme case, a is mathematically equal to 5.8784 and b is mathematically equal to 1.4189. Being as such, R2 is an indicator of just how ample the parameter confidence intervals are. Therefore, models with higher R2 levels will give the researcher the ability to make more accurate forecasts, given that the cloud of points is less dispersed along the regression model, which will reduce the amplitude of the parameter confidence intervals. On the other hand, models with low R2 values can impair the preparation of forecasts in that the greater amplitude of the parameter confidence intervals, but does not invalidate the existence of the model as such. As we have already discussed, many researchers give too much importance to the R2; however, it will be the F-test that will truly confirm that a regression model exists (at least a considered X variable is statistically significant to explain Y). As such, it is not rare to find very low R2 values and statistically significant F-values in Administration, Accounting, or Economics models, which shows that the Y phenomenon studied underwent changes in its behavior due to some X variables adequately included in the model. However, there will be a low forecast accuracy due to the impossibility of monitoring all variables that effectively explain the variation of that Y phenomenon. Within the aforementioned knowledge areas, such a fact can be easily found in works on Finance and the Stock Market.
Simple and Multiple Regression Models Chapter
13
467
^ and distance traveled (X). FIG. 13.26 Outputs of the linear regression model between predicted time (Y)
13.2.5
Estimation of Multiple Linear Regression Models
According to Fa´vero et al. (2009), the multiple linear regression presents the same logic as the simple linear, however now with the inclusion of more than one explanatory X variable in the model. The use of many explanatory variables depends on the subjacent theory and previous studies, as well as the experience and good sense of the researcher, in order to be able to give foundation to the decision. Initially, the ceteris paribus concept (maintain remaining conditions constant) should be used in the multiple regression analysis, since the interpretation of the parameter of each variable should be done in isolation. As such, in a model that possesses two explanatory variables, X1 and X2, the respective coefficients will be analyzed in a way so as to consider the other factors as constants. To illustrate the multiple linear regression, we will use the same example that we have used in this chapter. However, we will now imagine that the professor has made the decision to collect one more variable from each of the students. This variable will refer to the number of traffic lights, or semaphores, each student must pass. We will call this variable sem. As such, the theoretical model becomes: timei ¼ a + b1 disti + b2 semi + ui
468
PART
VI Generalized Linear Models
TABLE 13.5 Example: Travel Time × Distance Traveled and Number of Traffic Lights Student
Time to Get to School (min) (Yi)
Distance Traveled to School (km) (X1i)
Number of Traffic Lights (X2i)
Gabriela
15
8
0
Dalila
20
6
1
Gustavo
20
15
0
Leticia
40
20
1
Luiz Ovidio
50
25
2
Leonor
25
11
1
Ana
10
5
0
Antonio
55
32
3
Julia
35
28
1
Mariana
30
20
1
which, analogous to what was presented for the simple regression, we have that: ^ i ¼ a + b1 disti + b2 semi time where a, b1, and b2 are the estimates for parameters a, b1, and b2, respectively. The new dataset is found in Table 13.5, as well as in the file Timedistsem.xls. We will now algebraically develop the procedures for calculating the model parameters, as we did in the simple regression model. By means of the following expression: Yi ¼ a + b1 X1i + b2 X2i + ui we also define that the residual sum of squares is minimum. Therefore: n X
ðYi b1 X1i b2 X2i aÞ2 ¼ min
i¼1
The minimization occurs in deriving the previous expression in a, b1, and b2 and equaling the resulting expressions to zero. Therefore: hXn i 2 n ð Þ ∂ Y b X b X a X 1 1i 2 2i i¼1 i ðYi b1 X1i b2 X2i aÞ ¼ 0 (13.23) ¼ 2 ∂a i¼1 hXn i 2 n ∂ ð Y b X b X a Þ X i 1 1i 2 2i i¼1 ¼ 2 X1i ðYi b1 X1i b2 X2i aÞ ¼ 0 (13.24) ∂b1 i¼1 hXn i 2 n ∂ ð Y b X b X a Þ X 1 1i 2 2i i¼1 i ¼ 2 X2i ðYi b1 X1i b2 X2i aÞ ¼ 0 (13.25) ∂b2 i¼1 which generates the following system of three equations and three unknowns: 8 n n n X X X > > Yi ¼ n a + b1 X1i + b2 X2i > > > > i¼1 i¼1 i¼1 > > n n n n
> i¼1 i¼1 i¼1 i¼1 > >X n n n n > X X X > 2 > > Yi X2i ¼ a X2i + b1 X1i X2i + b2 X2i : i¼1
i¼1
i¼1
i¼1
(13.26)
Simple and Multiple Regression Models Chapter
13
469
Dividing the first equation by the Expression (13.26) by n, we arrive at: a ¼ Y b1 X1 b2 X2
(13.27)
By means of substituting the Expression (13.27) in the last two equations of the Expression (13.26), we arrive at the following system of two equations and two unknowns: 2 8 Xn 2 3 Xn Xn 3 2 Xn Xn > n n n > X X1i X2i X X X Y X > 1i i 1i i¼1 i¼1 i¼1 6 7 > 2 i¼1 i¼1 5 > ¼ b1 4 X1i Y X 5 + b2 4 X1i X2i > > > i¼1 i 1i n n n < i¼1 i¼1 2 Xn Xn 3 Xn 2 3 2 X X n n > n > n n X X X X X X > Y X > i¼1 1i i¼1 2i 5 i¼1 2i 7 6 2 i¼1 i i¼1 2i > > + b2 4 X2i ¼ b1 4 X1i X2i Yi X2i 5 > > n n n : i¼1 i¼1 i¼1 (13.28) We will now manually calculate the parameters for our example model. To do this, we need to use the spreadsheet in Table 13.6. We will now substitute the values into the system represented by the Expression (13.28). Therefore: " # 8
2 > 300 170 ð 170 Þ ð170Þ ð10Þ > > 6255 + b 3704 231 ¼ b > 1 2 < 10 10 10 " #
2 > 300 10 ð170Þ ð10Þ ð10Þ > > > + b2 18 : 415 10 ¼ b1 231 10 10 Which results in:
1155 ¼ 814 b1 + 61 b2 115 ¼ 61 b1 + 8 b2
Solving the system, we arrive at: b1 ¼ 0:7972 and b2 ¼ 8:2963 We have that: a ¼ Y b1 X1 b2 X2 ¼ 30 0:7972 ð17Þ 8:2963 ð1Þ ¼ 8:1512
TABLE 13.6 Spreadsheet to Calculate the Parameters for the Multiple Linear Regression Obs. (i)
Yi
X1i
X2i
YiX1i
YiX2i
X1iX2i
(Yi)2
(X1i)2
(X2i)2
1
15
8
0
120
0
0
225
64
0
2
20
6
1
120
20
6
400
36
1
3
20
15
0
300
0
0
400
225
0
4
40
20
1
800
40
20
1600
400
1
5
50
25
2
1250
100
50
2500
625
4
6
25
11
1
275
25
11
625
121
1
7
10
5
0
50
0
0
100
25
0
8
55
32
3
1760
165
96
3025
1024
9
9
35
28
1
980
35
28
1225
784
1
10
30
20
1
600
30
20
225
400
1
Sum
300
170
10
6255
415
231
11,000
3704
18
Average
30
17
1
470
PART
VI Generalized Linear Models
TABLE 13.7 Spreadsheet to Calculate Remaining Statistics
Observation (i)
Time (Yi)
Distance (X1i)
Traffic Lights (X2i)
b Y i
ui b Yi Y i
1
15
8
8
14.53
0.47
239.36
0.22
2
20
6
6
21.23
1.23
76.90
1.51
3
20
15
15
20.11
0.11
97.83
0.01
4
40
20
20
32.39
7.61
5.72
57.89
5
50
25
25
44.67
5.33
215.32
28.37
6
25
11
11
25.22
0.22
22.88
0.05
7
10
5
5
12.14
2.14
319.08
4.57
8
55
32
32
58.55
3.55
815.14
12.61
9
35
28
28
38.77
3.77
76.90
14.21
10
30
20
20
32.39
2.39
5.72
5.72
Sum
300
170
10
1874.85
125.15
Average
30
17
1
Ybi Y
2
(ui)2
Therefore, the estimated time equation to get to school now comes to be: ^ i ¼ 8:1512 + 0:7972 disti + 8:2963 semi time It should be remembered that the estimation of these parameters can also be obtained by means of the Excel Solver tool, as shown in Section 13.2.1. The calculations of the coefficient of determination R2, the F and t-statistics, and extreme values of the confidence intervals will not be performed again manually, given that they are exactly the same procedure already performed in Sections 13.2.2–13.2.4 and can be done by means of the respective expressions presented until now. Table 13.7 can be of help in this sense. Let’s go directly to the preparation of this multiple linear regression in Excel (file Timedistsem.xls). In the regression dialog box, we should jointly select the variables referent to the distance traveled and the number of traffic lights, as shown in Fig. 13.27. Fig. 13.28 presents the generated outputs. Within these outputs, we find the parameters for our multiple linear regression model determined algebraically. At this time, it is important to introduce the concept of the adjusted R2. According to Fa´vero et al. (2009), when we wish to compare the coefficient of determination (R2) between two models with different sample sizes or distinct quantities of parameters, the use of the adjusted R2 becomes necessary, which is a measure of the R2 regression estimated by the OLS method adjusted by the number of degrees of freedom, since the estimate sample of R2 tends to overestimate the population parameter. The adjusted R2 expression is: R2adjust ¼ 1
n1 1 R2 nk
(13.29)
where n is the size of the sample and k is the number of regression model parameters (number of explanatory variables plus the intercept). When the number of observations is very large, the adjustment by degrees of freedom becomes negligible; however when there is a significantly different number of X variables for the two samples, the adjusted R2 should be used for the preparation of the comparison between models and the model with the higher adjusted R2 should be opted for. R2 increases when a new variable is added to the model, however the adjusted R2 will not always increase, and could well decrease or become negative. For this last case, Stock and Watson (2004) explain that the adjusted R2 can become negative when the explanatory variables, taken as a set, reduce the residual sum of squares to such a small amount that this reduction is unable to compensate the factor (n 1)/(n k).
Simple and Multiple Regression Models Chapter
FIG. 13.27 Multiple linear regression—joint selection of set of explanatory variables.
FIG. 13.28 Multiple linear regression outputs in Excel.
13
471
472
PART
VI Generalized Linear Models
For our example, we have that: R2adjust ¼ 1
10 1 ð1 0:9374Þ ¼ 0:9195 10 3
Therefore, at present, in detriment to the simple regression initially applied, we should opt for this multiple regression as being a better model to study the behavior of travel time to get to school since the adjusted R2 is higher for this case. Let’s give sequence to the analysis of the remaining outputs. Initially, the F-test informed us that at least one of the X variables is statistically significant to explain the behavior of Y. Besides this, we can also verify, at the 5% significance level, that all the parameters (a, b1, and b2) are statistically different from zero (P-value 1 f ui m m + f1 1 > pð Y ¼ m Þ ¼ 1 p f > , m ¼ 1, 2, … > i logiti : f ui + 1 1 + f ui f1 1 Being Y ZINB(f, u, plogiti), where ZINB means zero-inflated negative binomial and f represents the inverse of the shape parameter of a determined Gamma distribution, and that, analogous to that presented for the zero-inflated Poisson regression models, that: plogiti ¼
1 1+e
(15.37)
ðg + d1 W1i + d2 W2i + ⋯ + dq Wqi Þ
and ui ¼ eða + b1 X1i + b2 X2i + ⋯ + bk Xki Þ
(15.38)
We can again see that, if plogiti ¼ 0, the probability distribution for Expression (15.36) is restricted to the Poisson-Gamma distribution, including in cases where Yi ¼ 0. Then, the zero-inflated negative binomial regression models also present two processes generating zeros, resulting from the binary distribution and the Poisson-Gamma distribution. Therefore, based on Expression (15.36), and based on the logarithmic likelihood function defined in Expression (15.29), we arrive at the following objective function, which has as its intent to estimate the f, a, b1, b2, …, bk and g, d1, d2, …, dk parameters for a determined zero-inflated negative binomial regression model: 1 1 X f ln plogiti + 1 plogiti LL ¼ 1 + f ui Yi ¼0 X f ui ln ð1 + f ui Þ + ln 1 plogiti + Yi ln (15.39) 1 + f u f i Yi >0 1 1 ¼ max ln GðYi + 1Þ ln G f + ln G Yi + f whose solution can also be obtained by means of linear programming tools. Next, we will present an example prepared in Stata where the parameters for a Poisson and a negative binomial regression model are estimated, both with inflated zeros. First, the significance of the amount of zeros in the Y dependent variable (Vuong test) to then, after, evaluate the significance of the parameter f (likelihood-ratio test for f), or rather, the existence of overdispersion in the data. Box 15.2 presents the relation between the regression models for count data and the existence of overdispersion and the excess of zeros in the data of the dependent variable.
BOX 15.2 Regression Models for Count Data, Overdispersion, and Excess of zeros in the Data of the Dependent Variable Regression Model for Count Data Poisson Negative Zero-Inflated Poisson Binomial (ZIP)
Zero-Inflated Negative Binomial (ZINB)
Overdispersion in the data of the dependent variable
No
Yes
No
Yes
Excessive amount of zeros in the data of the dependent variable
No
No
Yes
Yes
Verification
Regression Models for Count Data: Poisson and Negative Binomial Chapter
15
693
In this way, while the zero-inflated models of the Poisson and negative binomial types are more appropriate when there is an excessive amount of zeros in the dependent variable, the use of these last two is even more recommended when there is overdispersion in the data.
A.2
Example: Zero-Inflated Poisson Regression Model in Stata
So as to prepare zero-inflated regression models, we will use the Accidents.dta dataset. To prepare this dataset, the amount of traffic accidents that occurred in 100 cities in a determined country was investigated, which represents a dependent variable with count data. Besides this, the average of inhabitants with a current driver’s license and the fact that the municipality had adopted a dry law for after 10:00 pm was inserted into the urban population base. The desc command allows us to study the dataset characteristics, as is shown in Fig. 15.76.
FIG. 15.76 Description of the Accidents.dta dataset.
In this example, we will define the pop variable as the X variable, and the age and drylaw variables as the W1 and W2 variables. In other words, our goal is to see if the probability or not of accidents, or rather, the occurrence of structural zeros, is influenced by the average age of drivers and the fact of having a dry law after 10:00 p.m. in the municipalities and, besides this, if the occurrence of a determined accident count in the week under study is influenced by the population of each municipality i (i ¼ 1, …, 100). Therefore, for the zero-inflated Poisson regression model, the parameters of the following expressions should be estimated. plogiti ¼
1 1 + eðg + d1 agei + d2 drylawi Þ
and li ¼ eða + b popi Þ First, let’s analyze the distribution of the accidents variable, typing in the following commands: tab accidents hist accidents, discrete freq
Figs. 15.77 and 15.78 present the table of frequencies and the histogram, respectively, and, by their means, it is possible to see that, for the country under study, 58% of the municipalities analyzed did not present any traffic accident in the week researched, which indicated, even though preliminarily, the existence of an excessive amount of zeros in the dependent variable.
694
PART
VI Generalized Linear Models
FIG. 15.77 Frequency distribution for count data of accidents variable.
FIG. 15.78 Histogram of accidents dependent variable.
To elaborate the zero-inflated Poisson regression model, we should type in the following command: zip accidents pop, inf(age drylaw) vuong nolog
where the X dependent variable (pop) should come immediately after the dependent variable (accidents) and the W1 and W2 variables (age and drylaw) should come in parentheses, immediately after the term inf, which means inflate and corresponds to the inflation of structural zeros. The term vuong causes the Vuong test (1989) to be executed, which verifies the adequacy of the zero-inflated model in relation to the specified traditional model (in this case, Poisson), or rather, its goal is
Regression Models for Count Data: Poisson and Negative Binomial Chapter
15
695
to verify for the existence of an excessive amount of zeros in the dependent variable. The term nolog omits the outputs referent to the modeling iterations so that the maximum value of the logarithmic likelihood function is presented. Besides this, it is important to mention that the command presented implicitly offers, as standard, the logit model probabilities expression to verify for the existence of structural zeros referent to the Bernoulli distribution. However, in case the researcher opts to work with the probit model probabilities expression, studied in the Appendix of Chapter 14, the term probit should be added to the end of the command. The outputs are found in Fig. 15.79.
FIG. 15.79 Zero-inflated Poisson regression model outputs in Stata.
The first result that should be analyzed refers to the Vuong test, whose statistic is normally distributed, with positive and significant values indicating the adequacy of the zero-inflated Poisson model, and with negative and significant values indicating the adequacy of the traditional Poisson model. For the data in our example, we can see that the Vuong test indicates the better adequacy of the zero-inflated model over the traditional model, being that z ¼ 4.19 and Pr > z ¼ 0.000. Before analyzing the remaining outputs, it is important to mention that Desmarais and Harden (2013) propose a correction to the Vuong test, based on the Akaike information criterion (AIC) and Bayesian (Schwarz) information criterion (BIC) statistics and which should be elaborated so as to eliminate eventual biases that can affect the decision regarding choosing the more adequate model. To do this, one only need substitute the zip with the term zipcv (which means zero-inflated Poisson with corrected Vuong), and the new command will be as follows: zipcv accidents pop, inf(age drylaw) vuong nolog
However, before its elaboration in Stata, we should install the command zipcv, typing in findit zipcv and clicking on the link st0319 from http://www.stata-journal.com/software/sj13-4. Next, we should click on click here to install. The new outputs are found in Fig. 15.80. For the data in our example, while the Vuong test statistic is z ¼ 4.19, the AIC and BIC corrected statistics are z ¼ 4.13 and z ¼ 4.04, respectively, or rather, all present Pr > z ¼ 0.000. In other words, the results of the Vuong test with AIC and BIC correction continue to allow, in this case, that we state that the zero-inflated model is the most appropriate.
696
PART
VI Generalized Linear Models
FIG. 15.80 Zero-inflated Poisson regression model with Vuong test correction outputs in Stata.
Notice that the remaining outputs presented in Figs. 15.79 and 15.80 are exactly the same. Based on these outputs, we can see that the estimated parameters are statistically different from zero, at 95% confidence, and the final expressions of plogiti and of li are given by: plogiti ¼
1 1 + eð11:729 + 0:225 agei + 1:726 drylawi Þ
and li ¼ eð0:933 + 0:504 popi Þ A more curious researcher could obtain these same outputs by means of the Accidents ZIP Maximum Likelihood.xls file, using the Excel Solver tool, as has been the standard adopted throughout this chapter and book. In this file, the Solver criteria have already been defined. Therefore, using Expression (15.32) and the estimated parameters, we can algebraically calculate, in the following way, the average expected weekly traffic accidents in a municipality of 700,000 inhabitants, with an average driver age of 40 and that does not adopt a dry law for after 10:00 p.m.
n o 1 ½0:933 + 0:504 ð0:700Þ e ¼ 3:39 linflate ¼ 1 1 + e½11:729 + 0:225 ð40Þ + 1:726 ð0Þ The researcher can find the same result by typing the following command, which output is found in Fig. 15.81: mfx, at(pop=0.7 age=40 drylaw=0)
Finally, by means of a graph, we can compare the predicted values for the mean number of weekly traffic accidents obtained by the zero-inflated Poisson regression model with those obtained by a traditional Poisson regression model, without
Regression Models for Count Data: Poisson and Negative Binomial Chapter
15
697
FIG. 15.81 Calculation of expected amount of weekly traffic accidents for values of explanatory variables—mfx command.
considering, therefore, the variables that influence the occurrence of structural zeros, or rather, the dichotomic component (age and drylaw variables). To do this, we can type in the following sequence of commands: quietly zipcv accidents pop, inf(age drylaw) vuong nolog predict lambda_inf quietly poisson accidents pop predict lambda graph twoway scatter accidents pop jj mspline lambda_inf pop jj mspline lambda pop jj, legend(label(2 "ZIP") label(3 "Poisson"))
The generated graph is found in Fig. 15.82 and, by its means, we can see that the predicted values for the zero-inflated Poisson regression model (ZIP) were adjusted more adequately to the excessive amount of zeros in the dependent variable. FIG. 15.82 Expected number of weekly traffic accidents municipality population (pop) for the ZIP and Poisson regression models.
Next, we will analyze, based on the same dataset, the results obtained by means of the zero-inflated negative binomial regression model.
A.3
Example: Zero-Inflated Negative Binomial Regression Model in Stata
Following the same logic, we will again use the Accidents.dta dataset; however, we will now focus on the estimation of a zero-inflated negative binomial model. Therefore, the parameters for the following expressions will be estimated.
698
PART
VI Generalized Linear Models
plogiti ¼
1 1 + eðg + d1 agei + d2 drylawi Þ
and ui ¼ eða + b popi Þ As has been done throughout the chapter, we first analyze the mean and variance of the accidents variable, typing in the following command. tabstat accidents, stats(mean var)
Fig. 15.83 presents the generated result. FIG. 15.83 Mean and variance of the accidents dependent variable.
As we can see, the dependent variable variance is about 14 times greater than its mean, which gives a strong indication of the existence of overdispersion in the data. Let’s, therefore, go on to estimate the zero-inflated negative binomial regression model. To do this, we should type in the following command: zinbcv accidents pop, inf(age drylaw) vuong nolog zip
which follows the same logic as the command used to estimate the ZIP model. Notice that we opted to use the term zinbcv (zero-inflated negative binomial with corrected Vuong) instead of the term zinb, being that, even the estimated parameter was exactly equal, the first presents the Vuong test with AIC and BIC correction. Besides this, the term zip at the end of the command causes that the likelihood-ratio test for the f (alpha in Stata) parameter be verified, or rather, it provides a comparison of the ZINB adequacy in relation to the ZIP model. The outputs are presented in Fig. 15.84.
FIG. 15.84 Zero-inflated negative binomial regression model outputs in Stata.
Regression Models for Count Data: Poisson and Negative Binomial Chapter
15
699
First, we can see that the confidence interval for the f parameter, which is the inverse of the shape parameter c of the Gamma distribution and that Stata calls alpha, does not contain zero, or rather, for the 95% confidence level, we can state that f is statistically different from zero and has an estimated value equal to 1.271. By means of the likelihood-ratio test for the f parameter, we can conclude that the null hypothesis that this parameter is statistically equal to zero can be rejected at the 5% significance level (Sig. w2 ¼ 0.000 < 0.05), which proves the existence of overdispersion in the data and indicates that the ZINB model is preferable to the ZIP model. Besides this, the Vuong test with AIC and BIC correction, by presenting significant z statistics at the 95% confidence level, indicates that the zero-inflated negative binomial regression model is preferable to the traditional negative binomial model for it proves the existence of an excessive amount of zeros. We can also see that the estimated pop variable is statistically different from zero at a 95% confidence level, or rather, this variable is significant to explain the behavior of the weekly amount of traffic accidents (count component). In the same way, the age and drylaw variables are statistically significant to explain the excessive amount of zeros (structural zeros) in the accidents variable (dichotomic component). Based on these outputs, we come to the final expressions for plogiti and for ui, given by: plogiti ¼
1 1 + eð16:237 + 0:288 agei + 2:859 drylawi Þ
and ui ¼ eð0:025 + 0:866 popi Þ Then, a curious researcher can obtain these same outputs by means of the Accidents ZINB Maximum Likelihood.xls file, using the Excel Solver tool, according the standard adopted throughout this chapter and book. In this file, the Solver criteria have been previously defined. Using Expression (15.36) and the estimated parameters, we can again calculate, algebraically, the average expected amount of weekly traffic accidents for a municipality of 700,000 inhabitants, with an average age of 40 and that does not have a dry law for after 10:00 p.m., according to what follows:
n o 1 ½0:025 + 0:866 ð0:700Þ e ¼ 1:86 uinflate ¼ 1 1 + e½16:237 + 0:288 ð40Þ + 2:859 ð0Þ The researcher can also find the same result by typing in the following command, whose output is presented in Fig. 15.85. mfx, at(pop=0.7 age=40 drylaw=0)
FIG. 15.85 Calculation of expected amount of weekly traffic accidents for values of explanatory variables—mfx command.
Theoretically, the modeling could be, at this time, finalized. However, if the researcher is also interest in estimating the parameters for a ZIP model, so as to compare them with those obtained by the ZINB model, the following sequence of commands can be typed: eststo: quietly zip accidents pop, inf(age drylaw) vuong prcounts lambda_inflate, plot eststo: quietly zinb accidents pop, inf(age drylaw) vuong prcounts u_inflate, plot esttab, scalars(ll) se
700
PART
VI Generalized Linear Models
FIG. 15.86 Main results obtained in ZIP and ZINB estimations.
which generates the outputs presented in Fig. 15.86. These consolidated outputs allow us to see, besides the differences between the estimated parameters for both models, that the value obtained for the logarithmic likelihood function (ll) is considerably higher for the ZINB model (model 2 in Fig. 15.86), which is another indication of the better adequacy of this model over the ZIP model for the data in our example. Another way to compare the ZINB and ZIP estimations is by means of analyzing the distributions of the observed and predicted probabilities of weekly accident occurrences for the two estimations, analogous to what we discussed throughout the chapter, using the generated variables in the elaboration of the prcounts commands. To do this, we must enter the following command, which will generate the graph in Fig. 15.87: graph twoway (scatter u_inflateobeq u_inflatepreq lambda_inflatepreq u_inflateval, connect (1 1 1))
where the variables u_inflatepreq and lambda_inflatepreq correspond to the predicted occurrence probabilities of accidents of 0 to 9 obtained, respectively, by the ZINB and ZIP models. Besides this, while the variable u_inflateobeq corresponds to the observed probabilities of the dependent variable and, therefore, presents the same probability distribution presented in Fig. 15.77 for up to 9 traffic accidents, the variable u_inflateval presents the actual values of 0–9, which will be related to the observed probabilities.
Regression Models for Count Data: Poisson and Negative Binomial Chapter
15
701
FIG. 15.87 Observed and predicted probability distributions of weekly traffic accidents for the ZINB and ZIP models.
By means of analyzing the graph in Fig. 15.87, we see that the estimated distribution (predicted) for the ZINB model probabilities is better adjusted to the observed distribution than the estimated probability distribution for the ZIP model, for a count of up to 9 traffic accidents per week. Alternately, as we have discussed throughout the chapter, this fact can also be verified by applying the countfit command, which offers, besides the observed and predicted probabilities for each count (from 0 to 9) of the dependent variable, the error terms resulting of the difference between the probabilities obtained by the ZINB and ZIP models. To do this, we can type the following command. countfit accidents pop, zip zinb noestimates
which generates the outputs in Fig. 15.88 and the graph in Fig. 15.89. Figs. 15.88 and 15.89 show us, once again, that the ZINB adjustment is better than the ZIP model adjustment, for the following reasons: – While the maximum difference between the observed and predicted probabilities for the ZIP model is, in module, equal to 0.070, for the ZINB model it is, in module, equal to 0.016. – The average of these differences is of 0.024 for the ZIP model and of 0.006 for the ZINB model. – The total Pearson value is lower in the ZINB model (1.789) than in the ZIP model (61.233). The graph in Fig. 15.89 allows that a comparative analysis between the terms of error generated to be manually performed, giving due highlight to the ZINB model adjustment, being that the error curve is consistently closer to zero. As was done previously, we can also graphically compare the predicted values of the mean quantity of weekly traffic accidents obtained by the ZIP and ZINB models with those obtained by the corresponding traditional Poisson and negative binomial regression models (nbreg command), without consideration of the variables that influence the occurrence of structural zeros (age and drylaw variables). To do this, we should type in the following sequence of commands. quietly poisson accidents pop predict lambda quietly nbreg accidents pop predict u graph twoway mspline lambda_inflaterate pop jj mspline u_inflaterate pop jj mspline lambda pop jj mspline
u
Binomial"))
popjj,
legend(label(1
"ZIP")
label(2
"ZINB")
label(3
"Poisson")
label(4
"Negative
FIG. 15.88 Observed and predicted probabilities for each count of the dependent variable and the respective error terms.
FIG. 15.89 Error terms resulting from the difference between the observed and predicted probabilities (ZINB and ZIP models).
Regression Models for Count Data: Poisson and Negative Binomial Chapter
15
703
The generated graph is found in Fig. 15.90.
FIG. 15.90 Expected number of weekly traffic accidents municipality population (pop) for the ZIP, ZINB, Poisson, and negative binomial regression models.
Two considerations can be made in relation to this graph. The first speaks regarding the variance in the amount of predicted weekly traffic accidents, which causes the ZINB and negative binomial curves to be more elongated at the upper right side of the graph than those generated by the corresponding ZIP and Poisson models, which are not able to capture the existence of overdispersion in the data. Besides this, we can also see that the predicted values generated by the ZINB and ZIP models are better adjusted to the excessive amount of zeros than the Poisson and negative binomial models, being that they present smaller inclinations, especially for a lower number of expected accidents. As such, it is important for the researcher to have a complete notion of the regression models for count data, so as to estimate, in the best manner possible, the model parameters while always considering the nature and behavior of the dependent variable that represents the phenomenon under study.
Chapter 16
Introduction to Optimization Models: General Formulations and Business Modeling Education is the most powerful weapon which you can use to change the world. Nelson Mandela
16.1 INTRODUCTION TO OPTIMIZATION MODELS Optimization models are being used to solve problems in several industrial and commercial sectors (strategy, marketing, finance, operations and logistics, human resources, among others), to make decisions on the most effective use of resources. This chapter describes how optimization models can help researchers and managers in the decision-making process. First of all, it is important to study the main concepts involved in this process. There are many definitions for the concept of decision. One of them is that decision making refers to the analysis process of many alternatives available of the course of action the person will have to follow, or rather, the process resulting in the selection of a belief or a course of action among several alternative possibilities. Some examples of decisions can be listed here: choosing one location among many others that are available, determining the best stock portfolio, choosing one among several alternatives that balance the company’s production resources, such as, personnel available, hiring, firing, and inventory. Thus, we can see that the organization’s goals are directly linked to the decision-making process. In order to minimize the uncertainties, risks, and complexities that are inherent to the process and aiming at making the most effective decision among the several alternatives available, the value and quality of the information available becomes essential. Communication among the stakeholders involved in the process, during the information collection phase, as well as when defining the objective and the reasoning of the group, also influence the decisions to be made. And it is exactly with greater focus on an effective decision-making process, considering the several interfaces and exogeneities of systems and markets, that optimization models insert themselves as a field of knowledge, in order to provide to the decision-making agent a greater foundation and better knowledge of the problem being analyzed, be it in finance, economics, logistics, or marketing. On the other hand, according to Lisboa (2002), a model is a simplified representation of a real system. It can be an existing project or a future project. In the former, we intend to replicate the operations of a real existing system, in order to increase productivity. In the latter, the main goal is to define the ideal structure of the future system. The behavior of a real system is influenced by several variables involved in the decision-making process. Due to the high complexity of this system, it becomes necessary to simplify it, from a model, in such a way that the main variables involved in the system or project that we aim to understand or control are considered in its construction, as shown in Fig. 16.1. A model is made of three main elements: (a) decision and parameter variables; (b) an objective function, and (c) constraints. (a) Decision and parameter variables: Decision variables are unidentified, or unknown values, which will be determined by solving the model. The optimization models studied consider the following decision variable measurement and precision scales: continuous, discrete, or Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00016-1 © 2019 Elsevier Inc. All rights reserved.
707
708
PART
VII Optimization Models and Simulation
FIG. 16.1 Modeling from a real system. Source: Andrade, E.L., 2015. Introduc¸a˜o a` Pesquisa Operacional: M etodos e Modelos Para Ana´lise de Deciso˜es, fifth ed. LTC, Rio de Janeiro.
binary. Decision variables should assume non-negative values. The types of variables and their respective measurement and precision scales were studied in Chapter 2. Parameters are the previously known fixed values of the problem. As examples of parameters within a mathematical model, we can mention the: (a) demand for each product for a production mix problem; (b) variable cost to produce a certain kind of furniture; (c) profit or cost per unit of product manufactured; (d) cost per employee hired; (e) unit contribution margin whenever a certain electrical appliance is manufactured and sold. (b) Objective function: The objective function is a mathematical function that determines the target value that we intend to achieve or the quality of the solution, based on decision variables and on parameters. It can be a maximization function (profit, revenue, usefulness, service level, wealth, life expectancy, among other attributes) or a minimization function (cost, risk, error, among others). As examples, we can mention the: (a) minimization of the total production cost of several types of chocolates; (b) minimization of the credit risk in a client portfolio; (c) minimization of the number of employees involved in a certain service; (d) maximization of the return on investment in stock and fixed income funds; (e) maximization of the net profit in the production of several types of soft drinks. (c) Constraints: Constraints can be defined as a set of equations (mathematical expressions of equality) and inequalities (mathematical expressions of inequality) that the decision variables of the model should meet. Constraints are added to the model in order to consider the system’s physical limitations, and they impact the values of the decision variables directly. As examples of constraints to be considered in a mathematical model, we can mention: (a) maximum production capacity; (b) maximum risk a certain investor is willing to take; (c) maximum number of vehicles available; (d) minimum acceptable demand for a product. Modeling a decision-making process has the advantage of forcing decision makers to clearly define their goals. Furthermore, it facilitates the identification and storing of the different decisions that impact the goals, it allows us to define the main variables involved in the decision-making process and the system’s own limitations, besides allowing greater interaction among the work group. Optimization models can be divided into: linear programming, network programming, integer programming, nonlinear programming, goal or multiobjective programming, and dynamic programming (see Fig. 16.2). In this chapter, we will discuss the modeling of linear programming problems. The solution of linear programming models will be presented in Chapter 17. Now, network programming and integer programming models will be studied in Chapters 18 and 19, respectively. Nonlinear programming, multiobjective programming, and dynamic programming models are not the focus of this book, but can be found in Belfiore and Fa´vero (2012, 2013). FIG. 16.2 Classification of optimization models.
Introduction to Optimization Models: General Formulations and Business Modeling Chapter
16
709
16.2 INTRODUCTION TO LINEAR PROGRAMMING MODELS In a linear programming problem (LP), the model’s objective function and all its constraints are represented by linear functions. Moreover, all the decision variables must be continuous, that is, they should assume any values in an interval with real numbers. The main goal is to maximize or minimize a certain linear function of decision variables, subject to a set of constraints represented by linear equations or inequalities, including the non-negativity ones of the decision variables. After constructing the mathematical model that represents the real LP problem being studied, the next step is to determine the optimal solution for the model, which is the one with the highest value (if it is a maximization problem) or the lowest value (if it is a minimization problem) in the objective function and meets the linear constraints established. Many algorithms or methods can be applied to find the optimal solution for the model; however, the Simplex method is the most well known and the most common. Since George B. Dantzig developed the Simplex method in 1947, LP has been used to optimize real problems in several sectors. As examples, we can mention trade, services, banking, transportation, automobile, aviation, naval, food, beverages, agriculture and livestock, health, real estate, metallurgy, mining, paper and cellulose, electrical energy, oil, gas and fuels, computers, and the communication sector, among others. Therefore, the use of linear programming techniques in organizational environments has been helping several industries in many countries save millions and sometimes even billions of dollars. According to Winston (2004), a survey with the largest 500 American companies listed by Fortune magazine reported that 85% of the respondents used or had already used the linear programming technique.
16.3 MATHEMATICAL FORMULATION OF A GENERAL LINEAR PROGRAMMING MODEL Linear programming problems try to determine optimal values for the decision variables x1, x2, …, xn, which must be continuous, in order to maximize or minimize linear function z, subject to a set of m linear constraints of equality (equations with an ¼ sign) and/or of inequality (inequalities with a or a sign). The solutions that meet all the constraints, including the non-negativity ones of the decision variables, are called feasible solutions. The feasible solution that presents the best value of the objective function is called optimal solution. The formulation of a general linear programming model can be mathematically represented as: max or min z ¼ f ðx1 , x2 , …, xn Þ ¼ c1 x1 + c2 x2 + … + cn xn subject to : a11 x1 + a12 x2 + … + a1n xn f , ¼ , g b1 a21 x1 + a22 x2 + … + a2n xn f , ¼ , g b2 ⋮ ⋮ ⋮ ⋮
(16.1)
am1 x1 + am2 x2 + … + amn xn f , ¼ , g bm x1 , x2 , …, xn 0 ðnon-negativity constraintÞ where: z is the objective function; xj are the decision variables, main or controllable, j ¼ 1, 2, …, n; aij is the constant or coefficient of the ith constraint of the jth variable, i ¼ 1, 2, …, m, j ¼ 1, 2, …, n; bi is the independent term or quantity of resources available of the ith constraint, i ¼ 1, 2, …, m; and cj is the constant or coefficient of the jth variable of the objective function, j ¼ 1, 2, …, n.
16.4 LINEAR PROGRAMMING MODEL IN THE STANDARD AND CANONICAL FORMS The previous section presented the general formulation of a linear programming problem. This section discusses the formulation in the standard and canonical forms, in addition to elementary operations that can change the formulation of linear programming problems.
16.4.1
Linear Programming Model in the Standard Form
To solve a linear programming problem, be it by the analytical method or by the Simplex algorithm, the formulation of the model should be in the standard form, that is, it must meet the following requirements:
710
l l l
PART
VII Optimization Models and Simulation
The independent terms of the constraints must be non-negative; All the constraints must be represented by linear equations and presented as an equality; The decision variables must be non-negative.
The standard form can be mathematically represented as: max or min z ¼ f ðx1 , x2 , …, xn Þ ¼ c1 x1 + c2 x2 + … + cn xn subject to : a11 x1 + a12 x2 + … + a1n xn ¼ b1 a21 x1 + a22 x2 + … + a2n xn ¼ b2
(16.2)
⋮ ⋮ ⋮ ⋮ am1 x1 + am2 x2 + … + amn xn ¼ bm xj 0, j ¼ 1, 2, …,n The standard linear programming problem can also be written in matrix form: min f ðxÞ ¼ c x subject to : Ax ¼ b x0 where:
16.4.2
2
3 2 3 2 3 2 3 a11 a12 ⋯ a1n x1 b1 0 6 a21 a22 ⋯ a2n 7 6 x2 7 6 b2 7 607 7, x ¼ 6 7, b ¼ 6 7, c ¼ ½c c ⋯ c , 0 ¼ 6 7 A¼6 1 2 n 4 ⋮ ⋮ 4⋮5 4⋮5 405 ⋮ 5 am1 am2 ⋯ amn xn bm 0
Linear Programming Model in the Canonical Form
In a linear programming model in the canonical form, the constraints must be presented as inequalities, and z can be a maximization or a minimization objective function. If z is a maximization function, all the constraints must be represented with a sign. If z is a minimization function, the constraints should have a sign. For a maximization problem, the canonical form can be mathematically represented as: max z ¼ f ðx1 , x2 , …, xn Þ ¼ c1 x1 + c2 x2 + … + cn xn subject to : a11 x1 + a12 x2 + … + a1n xn b1 a21 x1 + a22 x2 + … + a2n xn b2 ⋮
⋮
⋮
(16.3)
⋮
am1 x1 + am2 x2 + … + amn xn bm xj 0, j ¼ 1, 2,…, n Now, if it is a minimization problem, the canonical form becomes: min z ¼ f ðx1 , x2 , …, xn Þ ¼ c1 x1 + c2 x2 + … + cn xn subject to : a11 x1 + a12 x2 + … + a1n xn b1 a21 x1 + a22 x2 + … + a2n xn b2 ⋮ ⋮ ⋮ ⋮ am1 x1 + am2 x2 + … + amn xn bm xj 0, j ¼ 1,2,…, n
(16.4)
Introduction to Optimization Models: General Formulations and Business Modeling Chapter
16.4.3
16
711
Transformations Into the Standard or Canonical Form
In order for a linear programming problem to have one of the forms presented in Sections 16.4.1 and 16.4.2, some elementary operations can be carried out from a general formulation, as described here. (1) A standard maximization problem can be transformed into a minimization linear programming problem: max z ¼ f ðx1 , x2 , …, xn Þ , min z ¼ f ðx1 , x2 , …, xn Þ
(16.5)
Analogously, a minimization problem can be transformed into a maximization problem: minz ¼ f ðx1 , x2 , …, xn Þ , max z ¼ f ðx1 , x2 , …, xn Þ
(16.6)
(2) An inequality constraint of the type can be transformed into another one of the type by multiplying both sides by (1): ai1x1 + ai2x2 + ⋯ + ainxn bi is equivalent to ai1 x1 ai2 x2 ⋯ ain xn bi
(16.7)
Analogously, an inequality constraint of the type can be transformed into another one of the type: ai1x1 + ai2x2 + ⋯ + ainxn bi is equivalent to ai1 x1 ai2 x2 ⋯ ain xn bi (3) An equality constraint can be transformed into two inequality constraints: ai1x1 + ai2x2 + ⋯ + ainxn ¼ bi is equivalent to ai1 x1 + ai2 x2 + ⋯ + ain xn bi ai1 x1 + ai2 x2 + ⋯ + ain xn bi
(16.8)
(16.9)
(4) An inequality constraint of the type can be rewritten as an expression of equality, considering the inclusion of a new non-negative variable on the left-hand side (LHS), xk 0, called a slack variable: ai1 x1 + ai2 x2 + ⋯ + ain xn bi is equivalent to ai1 x1 + ai2 x2 + ⋯ + ain xn + xk ¼ bi (16.10) Analogously, an inequality constraint of the type can also be transformed into an expression of equality by subtracting a new non-negative variable from the left-hand side, xk 0, called a surplus variable: a11x1 + a12x2 + ⋯ + a1nxn b1 is equivalent to ai1 x1 + ai2 x2 + ⋯ + ain xn xk ¼ bi
(16.11)
(5) An xj variable that is unrestricted in sign, called a free variable, can be expressed as the difference between two nonnegative variables: xj ¼ x1j x2j , x1j , x2j 0
(16.12)
Example 16.1 For the following linear programming problem, rewrite it in the standard form, starting from a minimization objective function. max z ¼ f ðx1 , x2 , x3 , x4 Þ ¼ 5x1 + 2x2 4x3 x4 subject to : x1 + 2x2
x4 12
2x1 + x2 + 3x3
6
x1 free, x2 , x3 , x4 0 Solution In order for the model to be rewritten in the standard form, the inequality constraints must be expressed as an equality (Expressions 16.10 and 16.11), and free variable x1 can be expressed as the difference between two nonnegative variables (Expression 16.12). Considering a minimization objective function, we have:
712
PART
VII Optimization Models and Simulation
min z ¼ f ðx1 , x2 , x3 , x4 Þ ¼ 5x11 + 5x12 2x2 + 4x3 + x4 subject to : x4 + x5 ¼ 12 x11 x12 + 2x2 x6 ¼ 6 2x11 2x12 + x2 + 3x3 x11 ,x12 , x2 , x3 , x4 , x5 , x6 0
Example 16.2 Convert the following problem into the canonical form. max z ¼ f ðx1 , x2 , x3 Þ ¼ 3x1 + 4x2 + 5x3 subject to : 2x1 + 2x2 + 4x3 320 3x1 + 4x2 + 5x3 ¼ 580 x1 ,x2 , x3 0 In order for the maximization model to be written in the canonical form, the constraints must be expressed as an inequality of the type. In order to do that, the expression of equality must be transformed into two inequality constraints (Expression 16.9) and the inequalities with the sign must be multiplied by (1), as specified in Expression (16.8). The final model in the canonical form is: max z ¼ f ðx1 , x2 , x3 Þ ¼ 3x1 + 4x2 + 5x3 subject to : 2x1 2x2 4x3 320 3x1 4x2 5x3 580 3x1 + 4x2 + 5x3 580 x1 , x2 , x3 0
16.5
ASSUMPTIONS OF THE LINEAR PROGRAMMING MODEL
In a linear programming problem, the objective function and the model constraints must be linear, the decision variables must be continuous (divisible, and they can assume fractional values) and non-negative, and the model parameters must be deterministic, in order to satisfy the following assumptions: 1. 2. 3. 4.
Proportionality Additivity Divisibility and non-negativity Certainty
16.5.1
Proportionality
The proportionality assumption requires that, for each decision variable considered in the model, its contribution regarding the objective function and the model constraints be directly proportional to the value of the decision variable. Let’s imagine the following example. A company tries to maximize its production of chairs (x1) and tables (x2), and the profit per chair and per table is 4 and 7, respectively. So, objective function z is expressed as max z ¼ 4x1 + 7x2. Fig. 16.3, adapted from Hillier and Lieberman (2005), shows the contribution of variable x1 to objective function z. We can see that, in order for the proportionality assumption to be respected, for every chair produced, the objective function must increase $4. Let’s imagine that an initial set-up cost of $20 is considered (case 1) until the production of chairs (x1) begins. In this case, the contribution of variable x1 in relation to the objective function would be written as z ¼ 4x1 20, instead of z ¼ 4x1, not meeting the proportionality assumption. On the other hand, imagine that there are economies of scale, in a way that production costs diminish and, consequently, the marginal contribution increases, as the total amount produced grows (case 2), also violating the proportionality assumption. In this case, the profit function becomes nonlinear.
Introduction to Optimization Models: General Formulations and Business Modeling Chapter
16
713
400
Meets proportionality
Contribution of x1 to z
350
Case 1
300
z=f(cx) ajj*xj=b
Case 2
250 200 150 100 50 0 –50
0
10
20
30
40
50
60
x1
FIG. 16.3 Contribution of variable x1 to objective function z.
In the same way, regarding the constraints, we assume that the aij coefficients or constants are proportional to production level xj.
16.5.2
Additivity
The additivity assumption states that the total value of the objective function or of each constraint function of a linear programming model is expressed by the sum of the individual contributions of each decision variable. Thus, the contribution of each decision variable does not depend on the contribution of the other variables, in a way that there are no crossed terms, both in the objective function and in the model constraints. Considering the previous example, the objective function is expressed as max z ¼ 4x1 + 7x2. Through the additivity assumption, the total value of the objective function is obtained through the sum of individual contributions of x1 and x2, that is, z ¼ 4 + 7 ¼ 11. If the objective function is expressed as max z ¼ 4x1 + 7x2 + x1x2, the additivity assumption is violated (z ¼ 4 + 7 + 1 ¼ 12 for x1, x2 ¼ 1), since the model’s decision variables are interdependent. In the same way, regarding each model constraint, we assume that the function’s total value is expressed by the sum of each variable’s individual contributions.
16.5.3
Divisibility and Non-negativity
Each one of the decision variables considered in the model can assume any non-negative value within an interval, including fractional values, as long as it meets the model’s constraints. When the variables being studied can only be integers, the model is called integer (linear) programming (ILP or simply IP).
16.5.4
Certainty
This assumption states that the objective function coefficients, the constraint coefficients, and the independent terms of a linear programming model are deterministic (constants and known with certainty).
16.6 MODELING BUSINESS PROBLEMS USING LINEAR PROGRAMMING This section discusses the description and modeling of the main resource optimization problems that are being studied in linear programming in the fields of engineering, business management, economics, and accounting. They are the production mix problem, capital budgeting, investment portfolio selection, production inventory, and aggregated planning.
16.6.1
Production Mix Problem
The production mix problem aims to find the ideal quantity of certain lines of products to be manufactured that will maximize the company’s results (net profit, total profit, etc.) or minimize the production costs, respecting its limitations as
714
PART
VII Optimization Models and Simulation
regards productive and market resources (raw materials constraints, maximum production capacity, availability of human resources, maximum and minimum market demand, among others). When the amount of a certain product to be manufactured can only be an integer (cars, electrical appliances, electronic devices, etc.), we have an integer programming problem (IP). An alternative solution for this kind of problem is to relax or to eliminate the decision variables’ integrality constraint, using a linear programming problem. Luckily, sometimes, the optimal solution to the relaxed problem corresponds to the optimal solution of the original model, that is, it meets the decision variables’ integrality constraints. When the solution of the relaxed problem is not an integer, rounding up or integer programming algorithms must be applied to find the solution for the original problem. Further details can be found in Chapter 19 that discusses integer programming. Example 16.3 Venix is a toy company and it is reviewing its toy cars and tricycles production planning. The net profit per toy car and tricycle unit produced is US$ 12.00 and US$ 60.00, respectively. The raw materials and input necessary to manufacture each one of these products are outsourced, and the company is in charge of the machining, painting, and assembly processes. The machining process requires 15 minutes of specialized labor per car unit and 30 minutes per tricycle unit produced. The painting process requires 6 minutes of specialized labor per car unit and 45 minutes per tricycle unit produced. Now, the assembly process needs 6 and 24 minutes per car unit and tricycle produced, respectively. Per week, the time available for machining, painting, and assembly is 36, 22, and 15 hours, respectively. The company would like to determine how much it should produce of each product per week, respecting its resource limitations, in order to maximize its weekly net profit. Formulate the linear programming problem that maximizes the company’s net profit. Solution First of all, we define the model’s decision variables: xj ¼ amount of product j to be manufactured per week, j ¼ 1, 2. Therefore, we have: x1 ¼ amount of cars to be manufactured per week. x2 ¼ amount of tricycles to be manufactured per week. Therefore, we can see that the decision variables should be integers (it is impossible to produce fractional amounts of toy cars or tricycles), so, this is an integer programming (IP) problem. Luckily, in this problem, the integrality constraints can be relaxed or eliminated, since the relaxed problem’s optimal solution still meets the integrality conditions. Thus, the formulation of the problem will be presented as a linear programming (LP) model. The net profit per toy car unit produced is US$ 12.00, while the net profit per tricycle is US$ 60.00. We are trying to maximize the weekly net profit generated from the amount of cars and tricycles manufactured. Therefore, the objective function can be written as follows: Fobj ¼ max z ¼ 12x1 + 60x2 Considering that for the machining process, to produce one car and/or one tricycle, we need 15 minutes (0.25 hours) and 30 minutes (0.5 hours) of specialized labor, respectively (0.25x1 + 0.50x2). However, the total labor time for the machining activity cannot be higher than 36 hours/week, which generates the following constraint: 0:25x1 + 0:5x2 36 Analogously, for the painting activity, one car and/or one tricycle produced requires 6 minutes (0.1 hours) and 45 minutes (0.75 hours) of specialized labor, respectively (0.1x1 + 0.75x2). However, the maximum limit of labor available for this activity is 22 hours/week: 0:1x1 + 0:75x2 22 Now, the assembly process, to produce one toy car and/or one tricycle, requires 6 minutes (0.1 hours) and 24 minutes (0.4 hours) of labor, respectively (0.1x1 + 0.4x2). The availability of human resources for this activity is 15 hours/week: 0:1x1 + 0:4x2 15 Finally, we have the non-negativity constraints of the decision variables. All the model’s constraints are: (1) Labor availability constraints for all three activities: 0:25x1 + 0:5x2 36 ðmachiningÞ 0:1x1 + 0:75x2 22 ðpaintingÞ 0:1x1 + 0:4x2 15 ðassemblyÞ
Introduction to Optimization Models: General Formulations and Business Modeling Chapter
16
715
(2) Non-negativity constraint of the decision variables: xj 0, j ¼ 1,2 The model’s complete formulation can be represented as: max z ¼ 12x1 + 60x2 subject to : 0:25x1 + 0:50x2 36 0:10x1 + 0:75x2 22 0:10x1 + 0:40x2 15 xj 0, j ¼ 1,2 The optimal solution can be obtained in graphical form, in analytical form, through the Simplex method, or directly from some software, such as, Solver (in Excel), as presented in the next chapter. The current model’s optimal solution is x1 ¼ 70 (toy cars per week) and x2 ¼ 20 (tricycles per week) with z ¼ 2040 (a weekly net profit of US$ 2040.00).
Example 16.4 Naturelat is a dairy company that manufactures the following products: yogurt, fresh white cheese, Mozzarella, Parmesan, and Provolone. Due to some strategic changes resulting from the competition in the market, the company is redefining its production mix. To manufacture each one of these five products, three types of raw materials are necessary: raw milk, cream line, and cream. Table 16.E.1 shows the amount of raw materials necessary to manufacture 1 kg of each product. The daily amount of raw materials available is limited (1200 L of raw milk, 460 L of cream line, and 650 kg of cream). A daily availability of specialized labor is also limited (170 hours/employee/day). The company needs 0.05 hours/employee to manufacture 1 kg of yogurt, 0.12 hours/employee to manufacture 1 kg of fresh white cheese, 0.09 hours/employee for making Mozzarella, 0.04 hours/employee for Parmesan, and 0.16 hours/employee for Provolone cheese. Due to contractual clauses, the company needs to produce a minimum daily quantity of 320 kg of yogurt, 380 kg of fresh white cheese, 450 kg of Mozzarella, 240 kg of Parmesan, and 180 kg of Provolone. The company’s commercial department guarantees that there is enough demand to absorb any production level, regardless of the product. Table 16.E.2 shows the net profit per unit per product (US$/kg), which is calculated as the difference between the sales price and the total variable costs. The company aims to determine the quantity of each product it has to manufacture in order to maximize its results. Formulate the linear programming problem that maximizes the result expected. Solution First of all, we define the model’s decision variables: xj ¼ amount of product j (in kg) to be manufactured per day, j ¼ 1, 2, …, 5. Therefore, we have: x1 ¼ amount of yogurt (in kg) to be manufactured per day. x2 ¼ amount of fresh white cheese (in kg) to be manufactured per day. x3 ¼ amount of Mozzarella (in kg) to be manufactured per day. x4 ¼ amount of Parmesan (in kg) to be manufactured per day. x5 ¼ amount of Provolone (in kg) to be manufactured per day.
TABLE 16.E.1 Raw Materials Necessary to Manufacture 1 kg of Each Product Product
Raw Milk (l)
Cream-Line (l)
Cream (kg)
Yogurt
0.70
0.16
0.25
Fresh white cheese
0.40
0.22
0.33
Mozzarella
0.40
0.32
0.33
Parmesan
0.60
0.19
0.40
Provolone
0.60
0.23
0.47
716
PART
VII Optimization Models and Simulation
TABLE 16.E.2 Net Profit Per Product Unit (US$/kg) Product
Sales Price (US$/kg)
Total Variable Costs (US$/kg)
Contribution Margin (US$/kg)
Yogurt
3.20
2.40
0.80
Fresh white cheese
4.10
3.40
0.70
Mozzarella
6.30
5.15
1.15
Parmesan
8.25
6.95
1.30
Provolone
7.50
6.80
0.70
The total net profit per product is the result obtained by multiplying the net profit per unit (in this case US$/kg) by the respective quantities sold. The problem’s objective function tries to maximize the total net profit of all the company’s products, which is obtained by adding the total net profits of each product: Fobj ¼ max z ¼ 0:80x1 + 0:70x2 + 1:15x3 + 1:30x4 + 0:70x5 Regarding the raw materials availability constraints, let’s first consider the amount of raw milk (liters) used daily to manufacture each product. To manufacture 1 kg of yogurt, the company needs 0.7 L of raw milk (0.70x1 represents the total amount of raw milk used every day to manufacture yogurt). The amount of raw milk (liters) used daily to manufacture fresh white cheese is represented by 0.40x2. Analogously, their daily use of raw milk to manufacture Mozzarella is 0.40x3, 0.60x4 to manufacture Parmesan cheese and 0.60x5 to make Provolone. The total amount of raw milk (liters) used daily to manufacture all of these products can be represented by 0.70x1 + 0.40x2 + 0.40x3 + 0.60x4 + 0.60x5. However, this total cannot be higher than 1200 L (daily amount of raw milk available) and this constraint is represented by: 0:70x1 + 0:40x2 + 0:40x3 + 0:60x4 + 0:60x5 1200 Likewise, the quantity of cream line (in liters) used daily to manufacture yogurt, fresh white cheese, Mozzarella, Parmesan, and Provolone cheese cannot be higher than the maximum amount available, which is 460 L: 0:16x1 + 0:22x2 + 0:32x3 + 0:19x4 + 0:23x5 460 Still with regard to the raw materials availability constraints, in the case of cream, the quantity (kg) used daily to manufacture the five products cannot be higher than the maximum quantity available, which is 650 kg: 0:25x1 + 0:33x2 + 0:33x3 + 0:40x4 + 0:47x5 650 We must also take the daily availability of specialized labor into consideration. Each kilo of yogurt manufactured requires 0.05 hours-employee, so, 0.05x1 represents the total of hours-employee used daily in the manufacturing of yogurt. The number of hoursemployee used daily to manufacture fresh white cheese is represented by 0.12x2. Similarly, 0.09x3 hours-employee are used daily to make Mozzarella. 0.04x4 to produce Parmesan and 0.16x5 to make Provolone. The total number of hours-employee used daily to manufacture all these products can be represented by 0.05x1 + 0.12x2 + 0.09x3 + 0.04x4 + 0.16x5. However, this total cannot be higher than 170 hours-employee, which is the availability of specialized human resources per day: 0:05x1 + 0:12x2 + 0:09x3 + 0:04x4 + 0:16x5 170 We must finally consider the daily minimum market demand constraint for each product: 320 kg for yogurt (x1 320), 380 for fresh white cheese (x2 380), 450 for Mozzarella cheese (x3 450), 240 for Parmesan (x4 240), and 180 for Provolone (x5 180), besides the non-negativity constraint of the decision variables. All the constraints of the model can be represented as: (1) Amount of raw materials used daily to produce yogurt, fresh white cheese, Mozzarella, Parmesan, and Provolone: 0:70x1 + 0:40x2 + 0:40x3 + 0:60x4 + 0:60x5 1200 ðraw milkÞ 0:16x1 + 0:22x2 + 0:32x3 + 0:19x4 + 0:23x5 460 ðcream lineÞ 0:25x1 + 0:33x2 + 0:33x3 + 0:40x4 + 0:47x5 650 ðcreamÞ (2) Daily availability of specialized labor for producing yogurt, fresh white cheese, Mozzarella, Parmesan, and Provolone: 0:05x1 + 0:12x2 + 0:09x3 + 0:04x4 + 0:16x5 170
Introduction to Optimization Models: General Formulations and Business Modeling Chapter
16
717
(3) Daily minimum demand for each product: x1 320 ðyogurtÞ x2 380 ðfresh white cheeseÞ x3 450 ðMozzarellaÞ x4 240 ðParmesanÞ x5 180 ðProvoloneÞ (4) Non-negativity constraints of the decision variables: xj 0, j ¼ 1,2, …, 5 The complete problem is modeled here: max z ¼ 0:80x1 + 0:70x2 + 1:15x3 + 1:30x4 + 0:70x5 subject to : 0:70x1 + 0:40x2 + 0:40x3 + 0:60x4 + 0:60x5 1200 0:16x1 + 0:22x2 + 0:32x3 + 0:19x4 + 0:23x5 460 0:25x1 + 0:33x2 + 0:33x3 + 0:40x4 + 0:47x5 650 0:05x1 + 0:12x2 + 0:09x3 + 0:04x4 + 0:16x5 170 320 x1 380 x2 450 x3 240 x4 x5 180 xj 0, j ¼ 1, …, 5 On Solver (in Excel), the model’s optimal solution is x1 ¼ 320 (kg/day of yogurt), x2 ¼ 380 (kg/day of fresh white cheese), x3 ¼ 690.96 (kg/day of Mozzarella), x4 ¼ 329.95 (kg/day of Parmesan), and x5 ¼ 180 (kg/day of Provolone) with z ¼ 1871.55 (total daily contribution margin of US$ 1871.55).
16.6.2
Blending or Mixing Problem
The blending or mixing problem aims to find the solution with the minimum cost or maximum profit, starting from the combination of several ingredients in order to produce one or many products. The raw materials can be ore, metals, chemical products, oil or crude oil, water, while the final products can be metal ingots, steel, paint, gasoline, or other chemical products. Among several mixing problems, we can mention a few examples: 1. Mixing many types of oil or crude oil to produce different types of gasoline. 2. Mixing chemical products to create other products. 3. Mixing different types of paper to generate recycled paper.
Example 16.5 Petrisul is an oil refinery that uses three types of crude oil (oil 1, oil 2, and oil 3) to produce three types of gasoline: regular, super, and extra. To ensure its quality, each type of gasoline requires certain specifications based on the composition of several types of crude oil, as shown in Table 16.E.3. In order to meet its clients’ demands, the refinery needs to produce at least 5000 barrels a day of regular gasoline and 3000 barrels a day of super and extra gasoline. The daily capacity available is 10,000 barrels of oil 1; 8000 of oil 2; and 7000 of oil 3. The refinery can produce up to 20,000 barrels of gasoline a day. The refinery makes a profit of $5 per barrel of regular gasoline produced, $7 per barrel of super gasoline, and $8 per barrel of extra gasoline. The costs per barrel of crude oil 1, 2, and 3 are $2, $3, and $3, respectively. Formulate the linear programming problem aiming to maximize the company’s daily profit.
718
PART
VII Optimization Models and Simulation
TABLE 16.E.3 Specifications for Each Type of Gasoline Type of Gasoline
Specifications
Regular
Not more than 70% of oil 1
Super
Not more than 50% of oil 1 Not less than 10% of oil 2
Extra
Not more than 50% of oil 2 Not less than 40% of oil 3
Solution First of all, we must define the model’s decision variables: xij ¼ barrels of crude oil i used daily to produce gasoline j, i ¼ 1, 2, 3; j ¼ 1, 2, 3. Therefore, we have: Daily production of regular gasoline ¼ x11 + x21 + x31 Daily production of super gasoline ¼ x12 + x22 + x32 Daily production of extra gasoline ¼ x13 + x23 + x33 Barrels of crude oil 1 used daily ¼ x11 + x12 + x13 Barrels of crude oil 2 used daily ¼ x21 + x22 + x23 Barrels of crude oil 3 used daily ¼ x31 + x32 + x33 The problem’s objective function tries to maximize the refinery’s daily profit (revenue—costs). The model constraints should guarantee that the minimum specifications required for each type of gasoline are taken into consideration, that its clients’ demands are being met, and that the gasoline production and the crude oil supply capacities are being respected. The daily revenue per barrel of gasoline produced is: ¼ 5 ðx11 + x21 + x31 Þ + 7 ðx12 + x22 + x32 Þ + 8 ðx13 + x23 + x33 Þ On the other hand, the daily costs per barrel of crude oil purchased are: ¼ 2 ðx11 + x12 + x13 Þ + 3 ðx21 + x22 + x23 Þ + 3 ðx31 + x32 + x33 Þ The objective function can be written as: Fobj ¼ max z ¼ ð5 2Þx11 + ð5 3Þx21 + ð5 3Þx31 + ð7 2Þx12 + ð7 3Þx22 + ð7 3Þx32 + ð8 2Þx13 + ð8 3Þx23 + ð8 3Þx 33 All the model’s constraints are: (1) The regular gasoline should contain a maximum of 70% of oil 1: x11 0:70 x11 + x21 + x31 which can be rewritten as: 0:30x11 0:70x21 0:70x31 0 (2) The super gasoline should contain a maximum of 50% of oil 1: x12 0:50 x12 + x22 + x32 which can be rewritten as: 0:50x12 0:50x22 0:50x32 0 (3) The super gasoline should contain at least 10% of oil 2: x22 0:10 x12 + x22 + x32 which can be rewritten as: 0:10x12 + 0:90x22 0:10x32 0
Introduction to Optimization Models: General Formulations and Business Modeling Chapter
16
719
(4) The extra gasoline should contain a maximum of 50% of oil 2: x23 0:50 x13 + x23 + x33 which can be rewritten as: 0:50x13 + 0:50x23 0:50x33 0 (5) The extra gasoline should contain at least 40% of oil 3: x33 0:40 x13 + x23 + x33 which can be rewritten as: 0:40x13 0:40x23 + 0:60x33 0 (6) The daily demands for regular, super, and extra gasoline must be met: x11 + x21 + x31 5000 ðregularÞ x12 + x22 + x32 3000 ðsuperÞ x13 + x23 + x33 3000 ðextraÞ (7) The maximum number of barrels of crude oil 1 (10,000), crude oil 2 (8000), and crude oil 3 (7000) available daily must be respected: x11 + x12 + x13 10, 000 ðcrude oil 1Þ x21 + x22 + x23 8000 ðcrude oil 2Þ x31 + x32 + x33 7000 ðcrude oil 3Þ (8) The refinery’s daily production capacity is 20,000 barrels of gasoline a day: x11 + x21 + x31 + x12 + x22 + x32 + x13 + x23 + x33 20, 000 (9) The model’s decision variables are non-negative: xij 0, i ¼ 1,2, 3;j ¼ 1,2, 3 The complete problem is modeled here: Fobj ¼ max z ¼ 3x11 + 2x21 + 2x31 + 5x12 + 4x22 + 4x32 + 6x13 + 5x23 + 5x33 subject to : 0:30x11 0:70x21 0:70x31 0 0:50x12 0:50x22 0:50x32 0 0:10x12 + 0:90x22 0:10x32 0 0:50x13 + 0:50x23 0:50x33 0 0:40x13 0:40x23 + 0:60x33 0 x11 + x21 + x31 5000 x12 + x22 + x32 3000 x13 + x23 + x33 3000 x11 + x12 + x13 10, 000 x21 + x22 + x23 8000 x31 + x32 + x33 7000 x11 + x21 + x31 + x12 + x22 + x32 + x13 + x23 + x33 20, 000 x11 , x21 , x31 , x12 , x22 , x32 , x13 , x23 , x33 0 By using Solver, the model’s optimal solution is x11 ¼ 1300, x21 ¼ 3700, x31 ¼ 0, x12 ¼ 1500, x22 ¼ 1500, x32 ¼ 0, x13 ¼ 7200, x23 ¼ 0, x33 ¼ 4800 with z ¼ 92,000 (a total daily profit of US$ 92,000.00).
720
PART
16.6.3
VII Optimization Models and Simulation
Diet Problem
The diet problem is a classic linear programming problem that tries to determine the best food combinations to be ingested per meal, with the lowest cost possible, meeting a person’s nutritional needs. Several nutrients can be analyzed as, for example: calories, protein, fat, carbs, fibers, calcium, iron, magnesium, phosphorus, potassium, sodium, zinc, copper, manganese, selenium, vitamin A, vitamin C, vitamin B1, vitamin B2, vitamin B12, niacin, folic acid, cholesterol, among others (Pess^oa et al., 2009). Example 16.6 Anemia is a disease caused by low levels of hemoglobin in the blood, protein is responsible for carrying oxygen. According to Dr. Adriana Ferreira, a hematologist, iron-deficiency anemia is the most common kind of anemia, and it is caused by the lack of iron in the body. In order to prevent it, we must adopt a diet that is rich in iron, vitamin A, vitamin B12 and folic acid. These nutrients can be found in several kinds of food, such as, spinach, broccoli, watercress, tomatoes, carrots, eggs, beans, chickpeas, soybeans, beef, liver, and fish. Table 16.E.4 shows the daily needs of each nutrient, the respective quantity in each one of the food items, and their price. In order to prevent its patients from having this kind of anemia, Hospital Metropolis is studying a new diet. The goal is to choose the ingredients with the lowest possible costs. These ingredients will be a part of both main daily meals (lunch and dinner), in a way that 100% of a person’s daily needs of each of these nutrients will be met in both meals. In addition, the total amount ingested in both meals cannot be higher than 1.5 kg.
TABLE 16.E.4 Nutrients, Daily Needs, and Cost Per Food Item 100 g Servings Iron (mg)
Vitamin A (IU)
Vitamin B12 (mcg)
Folic Acid (mg)
Price (US$)
Spinach
3.00
7400
0
0.400
0.30
Broccoli
1.20
138.8
0
0.500
0.20
Watercress
0.20
4725
0
0.100
0.18
Tomatoes
0.49
1130
0
0.250
0.16
Carrots
1.00
14,500
0.10
0.005
0.30
Eggs
0.90
3215
1.00
0.050
0.30
Beans
7.10
0
0
0.056
0.40
Chickpeas
4.86
41
0
0.400
0.40
Soybeans
3.00
1000
0
0.080
0.45
Beef
1.50
0
3.00
0.060
0.75
Liver
10.00
32,000
100.00
0.380
0.80
Fish
1.10
140
2.14
0.002
0.85
Daily intake
8
4500
2
0.4
Solution First of all, we have to define the model’s decision variables: xj ¼ quantity (kg) of food j consumed daily, j ¼ 1, 2, …, 12. Therefore, we have: x1 ¼ quantity (kg) of spinach consumed daily. x2 ¼ quantity (kg) of broccoli consumed daily. x3 ¼ quantity (kg) of watercress consumed daily. ⋮ x12 ¼ quantity (kg) of fish consumed daily.
Introduction to Optimization Models: General Formulations and Business Modeling Chapter
16
721
The model’s objective function tries to minimize the total costs spent with food and it may be written as follows: Fobj ¼ min z ¼ 3x1 + 2x2 + 1:8x3 + 1:6x4 + 3x5 + 3x6 + 4x7 + 4x8 + 4:5x9 + 7:5x10 + 8x11 + 8:5x12 The constraints related to the minimum daily intake of each nutrient must be met. Furthermore, we must consider the maximum weight constraint allowed in both meals. (1) The minimum daily intake of iron must be met: 30x1 + 12x2 + 2x3 + 4:9x4 + 10x5 + 9x6 + 71x7 + 48:6x8 + 30x9 + 15x10 + 100x11 + 11x12 80 (2) The minimum daily intake of vitamin A must be met: 74, 000x1 + 1388x2 + 47, 250x3 + 11, 300x4 + 145,000x5 + 32, 150x6 + 410x8 + 10, 000x9 + 320,000x11 + 1400x12 45,000 (3) The minimum daily intake of vitamin B12 must be met: x5 + 10x6 + 30x10 + 1000x11 + 21:4x12 20 (4) The minimum daily intake of folic acid must be met: 4x1 + 5x2 + x3 + 2:5x4 + 0:05x5 + 0:5x6 + 0:56x7 + 4x8 + 0:8x9 + 0:6x10 + 3:8x11 + 0:02x12 4 (5) The total amount consumed in both meals cannot be higher than 1.5 kg: x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + x12 1:5 (6) The model’s decision variables are nonnegative: x1 , x2 , x3 , x4 , x5 , x6 ,x7 , x8 , x9 , x10 , x11 , x12 0 The complete model can be described as follows: Fobj ¼ min z ¼ 3x1 + 2x2 + 1:8x3 + 1:6x4 + 3x5 + 3x6 + 4x7 + 4x8 + 4:5x9 + 7:5x10 + 8x11 + 8:5x12 s:t: 30x1 +
12x2 +
2x3 + ⋯ + 15x10 +
74, 000x1 + 1388x2 + 47, 250x3 + ⋯ +
+ ⋯ + 30x10 + 4x1 +
5x2 +
x1 +
x2 +
x3 + ⋯ + 0:6x10 + x3 + ⋯ +
+ 100x11 +
11x12 80
+ 320,000x11 + 1400x12 45, 000
x10 +
1000x11 + 21:40x12 20 3:8x11 + 0:02x12 4 x11 +
x12 1,5 xj 0, j ¼ 1,…, 12
The model’s optimal solution is x2 ¼ 0.427 (kg of broccoli), x7 ¼ 0.698 (kg of beans), x8 ¼ 0.237 (kg of chickpeas), x11 ¼ 0.138 (kg of liver), x1, x3, x4, x5, x6, x9, x10, x12 ¼ 0 with z ¼ 5.70 (total cost of US$ 5.70).
16.6.4
Capital Budget Problems
Optimization models, including linear programming, are being widely used to solve several financial investment problems, such as, the capital budgeting problem, investment or portfolio selection, cash flow management, risk analysis, among others. This section discusses a linear programming model that can be used to solve a capital budgeting problem. In the following section, we will discuss the investment portfolio selection problem. The capital budgeting problem aims at selecting, from a set of alternatives, financially feasible investment projects, respecting the investing company’s budgetary constraints.
722
PART
VII Optimization Models and Simulation
The capital budgeting problem uses the concept of NPV (net present value) that aims to define which investment is the most attractive. NPV is defined as the present value (period t ¼ 0) of the cash inflows (receivables) minus the cash outflows (payables/investment) for each period t ¼ 0, 1, …, n. Considering different investment projects, the most attractive is the one that has the highest net present value. The calculation of the NPV is: NPV ¼
n n X X CIFt COFt t t ð 1 + i Þ ð t¼1 t¼1 1 + iÞ
(16.13)
where: CIFt ¼ cash inflow at the beginning of the period t ¼ 1, …, n COFt ¼ cash outflow at the beginning of the period t ¼ 1, …, n i ¼ rate of return We will analyze two types of investment (A and B) in order to determine which one is the most attractive. Investment A requires an initial investment of US$ 100,000 plus a US$ 50,000 investment in 1 year, with a US$ 200,000 return in 2 years. The interest rate is 12% per year. The calculation of the NPV of investment A is: NPV ¼ 100, 000 NPV ¼ 14, 795:92
50, 000 ð1 + 0:12Þ1
+
200,000 ð1 + 0:12Þ2
Investment B requires an initial investment of US$ 150,000 plus a US$ 70,000 investment in 2 years, with a US$ 130,000 return in 1 year, and US$ 120,000 in 3 years. The interest rate is 12% per year. The NPV of investment B is: NPV ¼ 150,000 + NPV ¼ 4318:51
130,000 ð1 + 0:12Þ1
70,000 ð1 + 0:12Þ2
+
120, 000 ð1 + 0:12Þ3
Therefore, we can see that investment B is not profitable. So, investment A is the most attractive. The example showed us how to calculate the NPV of one or more investments and, from that, how to determine which one was the most attractive. However, many times, the resources available are limited, so, to choose one or more investment projects, a linear programming or a binary programming model should be used. Example 16.7 A farmer is analyzing five types of investment based on different crops (soybeans, cassava, corn, wheat, and beans) for his new farm, which has a total of 1000 hectares available. Each crop requires capital investments that will result in future benefits. The initial investment and the payables in the following 3 years for each crop are specified in Table 16.E.5. The return expected in the following 3 years for each crop investment is specified in Table 16.E.6. The farmer has limited resources that he can invest in each period (last column of Table 16.E.5), and he expects a minimum cash flow each period (last column of Table 16.E.6). The interest rate for each crop is 10% per year. From the total area available for the investment, the farmer would like to determine how much he should invest in each crop (in hectares), in order to maximize the NPV of the set of investment projects being analyzed, respecting the minimum expected flows and the maximum outflows in each period. Elaborate the farmer’s linear programming problem.
TABLE 16.E.5 Cash Outflow for Each Year Initial Investment/Payables for Each Year (US$ Thousands Per Hectare) Year
Soybeans
Cassava
Corn
Wheat
Beans
Maximum Cash Outflow (US$ Thousands)
0
5.00
4.00
3.50
3.50
3.00
3800.00
1
1.00
1.00
0.50
1.50
0.50
3500.00
2
1.20
0.50
0.50
0.50
1.00
3200.00
3
0.80
0.50
1.00
0.50
0.50
2500.00
Introduction to Optimization Models: General Formulations and Business Modeling Chapter
16
723
TABLE 16.E.6 Cash Inflow for Each Year Expected Return for Each Year (US$ Thousands Per Hectare) Year
Soybeans
Cassava
Corn
Wheat
Beans
Minimum Cash Inflow (US$ Thousands)
1
5.00
4.20
2.20
6.60
3.00
6000.00
2
7.70
6.50
3.70
8.00
3.50
5000.00
3
7.90
7.20
2.90
6.10
4.10
6500.00
Solution First of all, we have to define the model’s decision variables: xj ¼ total area (in hectares) to be invested in planting crop j, j ¼ 1, 2, …, 5. Therefore, we have: x1 ¼ total area (in hectares) to be invested in planting soybeans. x2 ¼ total area (in hectares) to be invested in planting cassava. x3 ¼ total area (in hectares) to be invested in planting corn. x4 ¼ total area (in hectares) to be invested in planting wheat. x5 ¼ total area (in hectares) to be invested in planting beans. The model’s objective function tries to maximize the NPV (US$ thousands) of the set of investments in the crops being analyzed, that is, the sum of the NPV of each crop (US$ thousands per hectare) multiplied by the total area to be invested in the respective crop (hectares). The calculation of the soybean crop NPV (US$ thousands/hectare), according to Expression (16.13), is: Soybean crop (US$ thousands per hectare) 5:0 7:7 7:9 1:0 1:2 0:8 + + 5:0 ð1 + 0:10Þ1 ð1 + 0:10Þ2 ð1 + 0:10Þ3 ð1 + 0:10Þ1 ð1 + 0:10Þ2 ð1 + 0:10Þ3 NPV ¼ 9:343 ðUS$ 9342:60=hectareÞ
NPV ¼
The calculation of the NPV of the other crops, taking the same procedure into consideration, is listed in Table 16.E.7.
TABLE 16.E.7 Net Present Value (NPV) of Each Crop Net Present Value—NPV (US$ Thousands Per Hectare) Soybeans
Cassava
Corn
Wheat
Beans
9.343
8.902
2.118
11.542
4.044
Thus, objective function z can be described as follows: Fobj ¼ max z ¼ 9:343x1 + 8:902x2 + 2:118x3 + 11:542x4 + 4:044x5 The maximum and minimum cash flow constraints for each year, besides the total area available, must be considered and are shown here. (1) Maximum capacity available (hectares) for planting the crops: x1 + x2 + x3 + x4 + x5 1000 (2) Minimum cash inflow for each year (US$ thousands): 5:0x1 + 4:2x2 + 2:2x3 + 6:6x4 + 3:0x5 6000 ð1st yearÞ 7:7x1 + 6:5x2 + 3:7x3 + 8:0x4 + 3:5x5 5000 ð2nd yearÞ 7:9x1 + 7:2x2 + 2:9x3 + 6:1x4 + 4:1x5 6500 ð3rd yearÞ (3) Maximum outflow for each year (US$ thousands): 5:0x1 + 4:0x2 + 3:5x3 + 3:5x4 + 3:0x5 3800 ðinitial investmentÞ 1:0x1 + 1:0x2 + 0:5x3 + 1:5x4 + 0:5x5 3500 ð1st yearÞ
724
PART
VII Optimization Models and Simulation
1:2x1 + 0:5x2 + 0:5x3 + 0:5x4 + 1:0x5 3200 ð2nd yearÞ 0:8x1 + 0:5x2 + 1:0x3 + 0:5x4 + 0:5x5 2500 ð3rd yearÞ (4) Non-negativity constraints of the decision variables: xj 0, j ¼ 1,2, …, 5: The complete model can be formulated as follows: max z ¼ 9:343x1 + 8:902x2 + 2:118x3 + 11:542x4 + 4:044x5 subject to : x1 +
x2 +
x3 +
x4 +
x5 1000
5:0x1 + 4:2x2 + 2:2x3 + 6:6x4 + 3:0x5 6000 7:7x1 + 6:5x2 + 3:7x3 + 8:0x4 + 3:5x5 5000 7:9x1 + 7:2x2 + 2:9x3 + 6:1x4 + 4:1x5 6500 5:0x1 + 4:0x2 + 3:5x3 + 3:5x4 + 3:0x5 3800 1:0x1 + 1:0x2 + 0:5x3 + 1:5x4 + 0:5x5 3500 1:2x1 + 0:5x2 + 0:5x3 + 0:5x4 + 1:0x5 3200 0:8x1 + 0:5x2 + 1:0x3 + 0:5x4 + 0:5x5 2500 xj 0, j ¼ 1,2, …, 5 The optimal solution of the linear programming model is x1 ¼ 173.33 (hectares for soybeans), x2 ¼ 80 (hectares for cassava), x3 ¼ 0 (hectares for corn), x4 ¼ 746.67 (hectares for wheat), x5 ¼ 0 (hectares for beans) with z ¼ 10, 949.59 (US$ 10,949,590.00). The example used a linear programming model (LP) to solve the respective capital budgeting problem. However, many times and given a set of investment project alternatives, we try to analyze which project i will be approved (xi ¼ 1) or rejected (xi ¼ 0), which becomes a binary programming (BP) problem since the decision variables are binary. This last case will be discussed in Chapter 19.
16.6.5
Portfolio Selection Problem
Markowitz (1952) developed a mathematical model to optimize portfolios that tries to choose, among a set of financial investments, the best combination that maximizes the portfolio’s expected return and minimizes its risk. The model is a quadratic programming problem that tries to find the portfolio’s efficient frontier. The risks of the portfolio are measured by using the variance of the return on assets, calculated as the sum of the individual variances of each asset and the covariances between the pairs of assets. Sharpe (1964) proposed a simplified portfolio optimization model aiming at facilitating the calculation of the covariance matrix. Similar to Markowitz’s model, Sharpe’s model also tries to determine the portfolio’s optimal composition that will result in the highest return possible with the lowest risk. Markowitz’s model requires the extensive calculation of the covariance matrix, so, it is highly complex. In order to facilitate its application, alternative models to Markowitz’s original model have been proposed. From Markowitz’s (1952) and Sharpe’s (1964) theory, we can develop a linear programming model that determines the investment portfolio’s optimal composition and that minimizes the risks, with an expected level of return. Similarly, we can search for the best portfolio composition that maximizes the portfolio’s expected return, subject to the requirement of a minimum level of this value and to the maximum risk allowed.
Model 1: Maximization of an Investment Portfolio’s Expected Return A linear programming model that maximizes an investment portfolio’s expected return, subject to the requirement of a minimum level of this value and to a given risk, is proposed. Model parameters: E(R) ¼ investment portfolio’s expected return mj ¼ expected return of asset j, j ¼ 1, …, n r ¼ minimum level required by the investor regarding the portfolio’s expected return ¼ maximum percentage allowed of asset j to be allocated in the portfolio, j ¼ 1, …, n xmax j sj ¼ standard deviation of asset j, j ¼ 1, …, n s ¼ the portfolio’s standard deviation or average risk
Introduction to Optimization Models: General Formulations and Business Modeling Chapter
16
725
Decision variables: xj ¼ percentage of asset j allocated in the portfolio, j ¼ 1, …, n General formulation: max EðRÞ ¼
n X
mj xj
j¼1
s:t:
n X
xj ¼ 1
ð 1Þ
n X mj xj r
ð 2Þ
j¼1
j¼1
, j ¼ 1, …, n xj xmax j
n X
sj xj s
j¼1
xj 0
(16.14)
ð3Þ ð 4Þ
, j ¼ 1,…, n
ð 5Þ
The model’s objective function tries to maximize the average return of an investment portfolio with n financial assets. Constraint 1 guarantees that all the capital will be invested. Constraint 2 guarantees that the portfolio’s average return will achieve the minimum limit required by the investor in the value of r. , so that the portfolio Constraint 3 states that the percentage of asset i allocated in the portfolio cannot be higher than xmax j can be diversified and its risk minimized. It is important to mention that some expected return maximization models do not consider this constraint. An alternative to constraint 3 can be represented by Equation (4) of Expression (16.14) that ensures that the portfolio’s average risk cannot be higher than s. The risks of each asset and of the portfolio are measured by the standard deviation. Finally, the decision variables should meet the non-negativity condition.
Model 2: Investment Portfolio Risk Minimization An alternative to Markowitz’s model was proposed by Konno and Yamazaki (1991) that introduced the mean absolute deviation (MAD) as a risk measure. The model that minimizes the MAD is shown now. Model parameters: MAD ¼ the portfolio’s mean absolute deviation rjt ¼ return of asset j in period t, j ¼ 1, …, n, t ¼ 1, …, T mj ¼ expected return of asset j, j ¼ 1, …, n r ¼ minimum level required by the investor regarding the portfolio’s expected return ¼ maximum percentage allowed of asset j to be allocated in the portfolio, j ¼ 1, …, n xmax j Decision variables: xj ¼ percentage of asset j allocated in the portfolio, j ¼ 1, …, n. General formulation:
T X n 1X min MAD ¼ r mj x j T t¼1 j¼1 jt s:t: n X xj ¼ 1 ð 1Þ j¼1 n X
mj xj r
j¼1
, j ¼ 1,…, n 0 xj xmax j
ð 2Þ ð3Þ
The model’s objective function tries to minimize the portfolio’s mean absolute deviation. Constraint 1 guarantees that all the capital will be invested.
(16.15)
726
PART
VII Optimization Models and Simulation
Constraint 2 guarantees that the portfolio’s average return will achieve the minimum limit required by the investor in the value of r. . Constraint 3 states that the positive percentage of asset i allocated in the portfolio cannot be higher than xmax j Example 16.8 Investor Paul Smith operates daily on Forinvest’s home broker system. Paul wants to select a new investment portfolio that maximizes its expected return with a certain risk. Based on a historical analysis of the most highly negotiable and representative assets in the Brazilian stock market, Forinvest’s financial analyst selected a set of 10 stocks trading on B3 (Brazilian Stock Exchange) that could form Paul’s portfolio, as shown in Table 16.E.8. The financial analyst tried to select a set of stocks from many different sectors, which were chosen according to Paul’s preferences. Table 16.E.9 shows a partial history of the daily return of each stock during 247 days. The complete data can be found in the file Forinvest.xls. In order to diversify the portfolio and, consequently, minimize the portfolio’s risks, the financial analyst advised Paul to invest 30%, at the most, in each stock. Besides, the portfolio’s risks, measured by the standard deviation, could not be greater than 2.5%. Elaborate the linear programming model that maximizes Paul’s portfolio’s expected return.
TABLE 16.E.8 Stocks That Could Be in Paul’s Portfolio Stocks
Code
1
Banco Brasil ON
BBAS3
2
Bradesco PN
BBDC4
3
Eletrobras PNB
ELET6
4
Gerdau PN
GGBR4
5
Itausa PN
ITSA4
6
Petrobras PN
PETR4
7
Sid Nacional ON
CSNA3
8
Telemar PN
TNLP4
9
Usiminas PNA
USIM5
10
Vale PNA
VALE5
TABLE 16.E.9 Partial History of the Assets’ Daily Return Period
BBAS3 (%)
BBDC4 (%)
ELET6 (%)
GGBR4 (%)
ITSA4 (%)
PETR4 (%)
CSNA3 (%)
TNLP4 (%)
USIM5 (%)
VALE5 (%)
1
6.74
6.04
1.47
4.48
6.50
2.71
2.06
3.19
4.40
3.93
2
6.31
3.05
4.23
5.00
2.14
3.43
4.34
0.22
3.42
2.72
3
4.00
2.08
1.47
1.67
3.27
0.75
2.45
2.19
3.06
0.76
4
0.28
0.14
3.66
1.64
0.81
1.85
1.01
1.29
0.63
0.79
5
6.86
5.28
3.79
4.76
5.50
3.23
6.66
0.11
4.87
4.13
6
2.23
4.87
2.96
3.25
3.69
5.20
7.05
0.97
3.89
2.65
7
1.45
0.90
1.04
4.12
2.47
2.56
0.92
0.07
0.41
0.46
8
1.85
1.05
1.17
1.77
2.39
0.21
2.82
3.67
4.13
1.74
9
6.09
0.14
1.39
0.90
0.82
0.89
1.42
3.75
2.90
2.47
10
1.70
1.94
1.21
3.44
1.38
0.42
2.34
0.14
0.40
3.64
Introduction to Optimization Models: General Formulations and Business Modeling Chapter
16
727
Solution First of all, we calculated the average return and the standard deviation of the daily returns of each investment during the period analyzed, as shown in Table 16.E.10.
TABLE 16.E.10 Average Return and Standard Deviation of Each Stock During the Period Analyzed BBAS3 (%)
BBDC4 (%)
ELET6 (%)
GGBR4 (%)
ITSA4 (%)
PETR4 (%)
CSNA3 (%)
TNLP4 (%)
USIM5 (%)
VALE5 (%)
Average return
6.74
6.04
1.47
4.48
6.50
2.71
2.06
3.19
4.40
3.93
Standard deviation
6.31
3.05
4.23
5.00
2.14
3.43
4.34
0.22
3.42
2.72
The second step consists of defining the model’s decision variables: xj ¼ percentage of stock j to be allocated in the portfolio, j ¼ 1, …, 10. Therefore, we have: x1 ¼ percentage of stock BBAS3 to be allocated in the portfolio. x2 ¼ percentage of stock BBDC4 to be allocated in the portfolio. x3 ¼ percentage of stock ELET6 to be allocated in the portfolio. x4 ¼ percentage of stock GGBR4 to be allocated in the portfolio. x5 ¼ percentage of stock ITSA4 to be allocated in the portfolio. x6 ¼ percentage of stock PETR4 to be allocated in the portfolio. x7 ¼ percentage of stock CSNA3 to be allocated in the portfolio. x8 ¼ percentage of stock TNLP4 to be allocated in the portfolio. x9 ¼ percentage of stock USIM5 to be allocated in the portfolio. x10 ¼ percentage of stock VALE5 to be allocated in the portfolio. The model’s objective function tries to maximize Paul’s portfolio’s expected return during the period analyzed. Therefore, the objective function can be expressed as: Fobj ¼ max z ¼ 0:0037x1 + 0:0024x2 + 0:0014x3 + 0:0030x4 + 0:0024x5 + 0:0019x6 + 0:0028x7 + 0:0018x8 + 0:0025x9 + 0:0024x10 The model’s constraints are described now. (1) The first constraint guarantees that 100% of the capital will be invested, that is, the sum of the composition of the stocks is equal to 1: x1 + x2 + ⋯ + x10 ¼ 1 (2) Constraint 2 states that the maximum percentage to be invested in each stock is 30% of the total capital invested: x1 , x2 , …, x10 0:30 (3) Constraint 3 guarantees that the portfolio’s risks, for the period analyzed, will not be greater than the maximum risk stipulated, that is, 2.5%: 0:0248x1 + 0:0216x2 + ⋯ + 0:0247x10 0:0250 (4) Finally, the non-negativity conditions of the decision variables must be met: x1 , x2 , …,x10 0 The complete model can be formulated as follows: max E ðR Þ ¼ 0:0037x1 + 0:0024x2 + 0:0014x3 + 0:0030x4 + 0:0024x5 + 0:0019x6 + 0:0028x7 + 0:0018x8 + 0:0025x9 + 0:0024x10 s:t: x1 + x2 + ⋯ + x10 ¼ 1 x1 , x2 , …, x10 0:30 0:0248x1 + 0:0216x2 + ⋯ + 0:0247x10 0:0250 x1 , x2 , …, x10 0
ð1 Þ ð2Þ ð3Þ ð4Þ
728
PART
VII Optimization Models and Simulation
The optimal solution of the linear programming model is x1 ¼ 30% (Banco do Brasil ON—BBAS3), x2 ¼ 30% (Bradesco PN—BBDC4), x4 ¼ 18.17% (Gerdau PN—GGBR4), x7 ¼ 21.83% (Sid Nacional ON—CSNA3), and x3, x5, x6, x8, x9, x10 ¼ 0 with z ¼ 0.3% (daily average return of 0.3%).
Example 16.9 Consider the same portfolio optimization problem as Paul Smith’s problem described in Example 16.8. Now, in this case, instead of maximizing the expected return, we want to minimize the portfolio’s mean absolute deviation (MAD). Different from the previous example, instead of considering the maximum risk allowed constraint, we will consider a minimum limit of the portfolio’s expected daily return in the value of 0.15%. Analogous to the previous example, we must invest 30% of the capital total, at the most, in each asset. Consider the same assets (Table 16.E.9) and the same history of daily returns (Table 16.E.10) of the previous example (see file Forinvest.xls). Elaborate the linear programming problem that minimizes the portfolio’s MAD. Solution First of all, we must calculate the MAD of each asset in the portfolio. Let’s consider the first stock (BBAS3). The first step is to calculate the absolute deviation in each period. As calculated in Example 16.8, we can see that the average return of stock BBAS3, for the period analyzed, is 0.37%. Since the return of the first period is 6.74%, we can conclude that j r11 m1 j ¼ j0.0674 0.0037 j ¼ 0.0711. Now, for period 2, we have j r12 m1 j ¼ j 0.0631 0.0037 j ¼ 0.0594. Then, we do the same for the other periods. For the last period, we have jr1247 m1 j ¼ j0.0128 0.0037 j ¼ 0.0091. The second step consists in calculating the mean absolute deviation of BBAS3, that is, the arithmetic mean of the absolute deviations for each period: 1 ð0:0711 + 0:0594 + ⋯ + 0:0091Þ ¼ 0:0187 247 Then, we do the same for the other stocks. Table 16.E.11 shows the mean absolute deviation of each asset.
TABLE 16.E.11 Mean Absolute Deviation of Each Stock
MAD
BBAS3 (%)
BBDC4 (%)
ELET6 (%)
GGBR4 (%)
ITSA4 (%)
PETR4 (%)
CSNA3 (%)
TNLP4 (%)
USIM5 (%)
VALE5 (%)
1.87
1.65
1.47
2.28
1.69
1.50
1.99
1.66
2.11
1.79
The objective function tries to minimize the portfolio’s MAD, and it can be written as follows: Fobj ¼ min MAD ¼ 0:0187x1 + 0:0165x2 + 0:0147x3 + 0:0228x4 + 0:0169x5 + 0:0150x6 + 0:0199x7 + 0:0166x8 + 0:0211x9 + 0:0179x10 The constraint that the portfolio’s daily average return should achieve the minimum limit required by the investor must be considered: 0:0037x1 + 0:0024x2 + ⋯ + 0:0024x10 0:0015 The complete model can be formulated as follows: min MAD ¼ 0:0187x1 + 0:0165x2 + 0:0147x3 + 0:0228x4 + 0:0169x5 + 0:0150x6 + 0:0199x7 + 0:0166x8 + 0:0211x9 + 0:0179x10 s:t: x1 + x2 + ⋯ + x10 ¼ 1 0:0037x1 + 0:0024x2 + ⋯ + 0:0024x10 0:0015 0 x1 , x2 , …, x10 0:30
ð1Þ ð2Þ ð3Þ
The optimal solution of the linear programming model is x2 ¼ 30% (Bradesco PN—BBDC4), x3 ¼ 30% (Eletrobras PNB—ELET6), x6 ¼ 30% (Petrobras PN—PETR4), x8 ¼ 10% (Telemar PN—TNLP4), and x1, x4, x5, x7, x9, x10 ¼ 0 with z ¼ 1.55% (the portfolio’s mean absolute deviation).
Introduction to Optimization Models: General Formulations and Business Modeling Chapter
16.6.6
16
729
Production and Inventory Problem
In this section, we will consider a linear programming model that integrates production and inventory decisions. The period of time can be short, medium, or long. A general linear programming model to solve the production and inventory problem, based on Taha (2010, 2016) and Ahuja et al. (2007), will be presented, for a case with m products (i ¼ 1, …, m) and T periods (t ¼ 1, …, T). Model parameters: Dit ¼ demand for product i in period t cit ¼ production cost per unit of product i in period t iit ¼ inventory cost per unit of product i in period t ¼ maximum production capacity of product i in period t xmax it ¼ maximum inventory capacity of product i in period t Imax it Decision variables: xit ¼ amount of product i to be produced in period t Iit ¼ final inventory of product i in period t General formulation: minz ¼
T X m X ðcit xit + iit Iit Þ t¼1 i¼1
s:t: Iit ¼ Ii,t1 + xit Dit , i ¼ 1,…,m; t ¼ 1, …, T
ð 1Þ
xit xmax it , max Iit Iit ,
i ¼ 1, …,m; t ¼ 1,…, T
ð 2Þ
i ¼ 1, …,m; t ¼ 1,…, T
ð 3Þ
i ¼ 1,…, m; t ¼ 1, …, T
ð 4Þ
xit , Iit 0
(16.16)
The model’s objective function is trying to minimize the sum of the production and inventory costs for T periods of time. For each product, constraint 1 represents the inventory balance expression, that is, the final inventory in period t is the same as the final inventory in the previous period, added to the total produced in the same period, subtracted the demand for the current period. So, in order for the demand for product i to be met in period t, the inventory level of the same product in the previous period, added to what was produced in the same period, must be greater than or equal to the demand. This condition is implied in the model since decision variable Iit can only assume non-negative values. Constraint 2 guarantees that the maximum production capacity will not be exceeded. Constraint 3 guarantees that the maximum inventory capacity will not be exceeded. Finally, the non-negativity conditions of the model’s decision variables must also be met. Similar to the production mix problem, in which the decision variables can only assume integer values (i.e., the manufacturing and storing of products cannot be fractioned, such as, cars, electrical appliances, electrical devices, etc.), we have an integer programming (IP) problem. Example 16.10 Fenix & Furniture is launching their new collection of sofas and armchairs for next semester. This new collection includes 2- and 3-seat sofas, sofa-beds, armchairs, and poufs. Table 16.E.12 shows the data on the production and inventory costs and capacity for each product, which are constant for all the periods. The demand for each product for next semester is listed in Table 16.E.13. The initial inventory for all the products is 200 units. Determine the optimal production and inventory control planning that minimizes the total production and storage costs, meets the intended demand, and respects the production and storage capacity limitations. Solution The mathematical formulation of Example 16.10 is similar to the general formulation of the production and inventory problem presented in Expression (16.16). The complete model is shown now. First of all, we have to define the model’s decision variables: xit ¼ number of pieces of furniture i to be produced in month t (units), i ¼ 1, …, 5, t ¼ 1, …, 6 Iit ¼ final inventory of piece of furniture i in month t (units), i ¼ 1, …, 5, t ¼ 1, …, 6
730
PART
VII Optimization Models and Simulation
TABLE 16.E.12 Production and Inventory Costs and Capacity for Each Product 2-Seat Sofa
3-Seat Sofa
Sofa-Bed
Armchair
Pouf
Production cost (US$/unit)
320
440
530
66
48
Inventory cost (US$/unit)
8
8
9
3
3
Production capacity (units)
1800
1600
1500
2000
2000
Inventory capacity (units)
20,000
18,000
15,000
22,000
22,000
TABLE 16.E.13 Demand Per Product and Period Jan.
Feb.
March
April
May
Jun.
2-Seat sofa
1200
1250
1400
1860
2000
1700
3-Seat sofa
1250
1430
1650
1700
1450
1500
Sofa-bed
1400
1500
1200
1350
1600
1450
Armchair
1800
1750
2100
2000
1850
1630
Pouf
1850
1700
2050
1950
2050
1740
Therefore, we have: x11 ¼ number of 2-seat sofas to be produced in January. ⋮ x16 ¼ number of 2-seat sofas to be produced in June. x21 ¼ number of 3-seat sofas to be produced in January. ⋮ x26 ¼ number of 3-seat sofas to be produced in June. x31 ¼ number of sofa-beds to be produced in January. ⋮ x36 ¼ number of sofa-beds to be produced in June. x41 ¼ number of armchairs to be produced in January. ⋮ x46 ¼ number of armchairs to be produced in June. x51 ¼ number of poufs to be produced in January. ⋮ x56 ¼ number of poufs to be produced in June.
I11 ¼ final inventory of 2-seat sofas in January. ⋮ I16 ¼ final inventory of 2-seat sofas in June. I21 ¼ final inventory of 3-seat sofas in January. ⋮ I26 ¼ final inventory of 3-seat sofas in June. I31 ¼ final inventory of sofa-beds in January. ⋮ I36 ¼ final inventory of sofa-beds in June.
Introduction to Optimization Models: General Formulations and Business Modeling Chapter
16
731
I41 ¼ final inventory of armchairs in January. ⋮ I46 ¼ final inventory of armchairs in June. I51 ¼ final inventory of poufs in January. ⋮ I56 ¼ final inventory of poufs in June. Since the decision variables are discrete, we have an integer programming (IP) problem. Luckily, in this problem, the integrality constraints can be relaxed or eliminated, since the relaxed problem’s optimal solution still meets the integrality conditions. Thus, the formulation of the problem will be presented as a linear programming (LP) model. The objective function can be written as: min z ¼ 320ðx11 + x12 + x13 + x14 + x15 + x16 Þ + 8ðI11 + I12 + I13 + I14 + I15 + I16 Þ + 440ðx21 + x22 + x23 + x24 + x25 + x26 Þ + 8ðI21 + I22 + I23 + I24 + I25 + I26 Þ + 530ðx31 + x32 + x33 + x34 + x35 + x36 Þ + 9ðI31 + I32 + I33 + I34 + I35 + I36 Þ + 66ðx41 + x42 + x43 + x44 + x45 + x46 Þ + 3ðI41 + I42 + I43 + I44 + I45 + I46 Þ + 48ðx51 + x52 + x53 + x54 + x55 + x56 Þ + 3ðI51 + I52 + I53 + I54 + I55 + I56 Þ The model’s constraints are described here. (1) Inventory balance equations, for each piece of furniture i (i ¼ 1, …, 5), each month t (t ¼ 1, …, 6): I11 ¼ 200 + x11 1200 I12 ¼ I11 + x12 1250 I13 ¼ I12 + x13 1400 I14 ¼ I13 + x14 1860 I15 ¼ I14 + x15 2000 I16 ¼ I15 + x16 1700 I21 ¼ 200 + x21 1250 I22 ¼ I21 + x22 1430 I23 ¼ I22 + x23 1650 I24 ¼ I23 + x24 1700 I25 ¼ I24 + x25 1450 I26 ¼ I25 + x26 1500 I31 ¼ 200 + x31 1400 I32 ¼ I31 + x32 1500 I33 ¼ I32 + x33 1200 I34 ¼ I33 + x34 1350 I35 ¼ I34 + x35 1600 I36 ¼ I35 + x36 1450 I41 ¼ 200 + x41 1800 I42 ¼ I41 + x42 1750 I43 ¼ I42 + x43 2100 I44 ¼ I43 + x44 2000 I45 ¼ I44 + x45 1850 I46 ¼ I45 + x46 1630 I51 ¼ 200 + x51 1850 I52 ¼ I51 + x52 1700 I53 ¼ I52 + x53 2050 I54 ¼ I53 + x54 1950 I55 ¼ I54 + x55 2050 I56 ¼ I55 + x56 1740 (2) Maximum production capacity: x11 , x12 ,x13 , x14 , x15 , x16 1800 x21 , x22 ,x23 , x24 , x25 , x26 1600 x31 , x32 ,x33 , x34 , x35 , x36 1500
732
PART
VII Optimization Models and Simulation
x41 , x42 , x43 , x44 , x45 , x46 2000 x51 , x52 , x53 , x54 , x55 , x56 2000 (3) Maximum inventory capacity: I11 , I12 , I13 , I14 , I15 , I16 20, 000 I21 , I22 , I23 , I24 , I25 , I26 18, 000 I31 , I32 , I33 , I34 , I35 , I36 15, 000 I41 , I42 , I43 , I44 , I45 , I46 22, 000 I51 , I52 , I53 , I54 , I55 , I56 22, 000 (4) Non-negativity constraints: xit , Iit 0 i ¼ 1, …, 5; t ¼ 1, …, 6 The production and inventory model’s optimal solution is shown in Table 16.E.14.
TABLE 16.E.14 Production and Inventory Model’s Optimal Solution Solution
Jan.
Feb.
March
April
May
Jun.
z
x1t
1000
1250
1660
1800
1800
1700
US$ 12,472,680.00
x2t
1050
1580
1600
1600
1450
1500
x3t
1200
1500
1200
1450
1500
1450
x4t
1600
1850
2000
2000
1850
1630
x5t
1650
1750
2000
2000
2000
1740
I1t
0
0
260
200
0
0
I2t
0
150
100
0
0
0
I3t
0
0
0
100
0
0
I4t
0
100
0
0
0
0
I5t
0
50
0
50
0
0
16.6.7
Aggregated Planning Problem
Aggregated planning studies the balance between production and demand. The period of time considered is the medium run. In order to meet a fluctuating demand, at a minimum cost, we can change the company’s resources (employees, production, and inventory levels), we can influence the demand, or we can try to find a combination of both strategies. As strategies to influence the demand, we have: advertising, sales, development of alternative products, etc. As strategies to influence production, we can highlight: – – – –
Controlling inventory levels; Hiring and firing employees; Overtime or reducing the number of working hours; Outsourcing.
Most of the methods used to solve the aggregated planning problem consider the demand to be a deterministic factor, so, they only change the company’s productive resources. Thus, we can use the trial and error method trying to select the best option among a set of alternative production solutions, or a linear programming model to determine the problem’s optimal solution.
Introduction to Optimization Models: General Formulations and Business Modeling Chapter
16
733
Linear programming (LP) models are being widely used to solve aggregated planning problems, in order to find the best combination of productive resources that minimizes the total labor, production, and storage costs. For T periods of time, the objective function can minimize the sum of costs related to: regular production, regular labor, hiring and firing employees, overtime, inventory, and/or outsourcing. The constraints are related to the total production and storage capacity, besides the use of labor. The problem can also be characterized as a nonlinear programming—NLP (nonlinear costs) model or as a binary programming—BP model (a choice among n alternative plans). Buffa and Sarin (1987), Moreira (2006), and Silva Filho et al. (2009) present a general linear programming model for the aggregated planning problem. An adjusted formulation, for T periods of time (t ¼ 1, …, T), is shown. Model parameters: Pt ¼ total production in period t Dt ¼ demand in period t rt ¼ production cost per unit (regular hours) in period t ot ¼ production cost per unit (overtime) in period t st ¼ production cost per unit (with subcontracted/outsourced labor) in period t ht ¼ cost of an additional unit (regular hours) in period t by hiring employees from period t 1 to period t ft ¼ cost of a cancelled unit in period t by firing employees from period t 1 to period t it ¼ inventory cost per unit from period t to period t + 1 ¼ maximum inventory capacity in period t (units) Imax t ¼ maximum production capacity at regular hours in period t (units) Rmax t ¼ maximum production capacity during overtime in period t (units) Omax t ¼ maximum subcontracted production capacity in period t (units) Smax t Decision variables: It ¼ final inventory in period t (units) Rt ¼ regular production (regular hours) in period t (units) Ot ¼ overtime production in period t (units) St ¼ production with subcontracted labor in period t (units) Ht ¼ additional production in period t by hiring employees from period t 1 to period t (units) Ft ¼ cancelled production in period t by firing employees from period t 1 to period t (units) General formulation: min z ¼
T X
ðrt Rt + ot Ot + st St + ht Ht + ft Ft + it It Þ
t¼1
s:t: It ¼ It1 + Pt Dt
ð 1Þ
Pt ¼ R t + O t + S t
ð 2Þ
Rt ¼ Rt1 + Ht Ft
ð 3Þ
It Itmax Rt Rmax t Ot Omax t St Smax t Rt , Ot , St , Ht ,Ft , It 0 para t ¼ 1, …,T
(16.17)
ð 4Þ ð 5Þ ð 6Þ ð 7Þ ð 8Þ
For T periods of time, the model’s objective function tries to minimize the sum of costs related to regular production, overtime production, subcontracting or outsourcing, and hiring and firing employees, besides the costs with inventory maintenance. Equation (1) of Expression (16.17) states that the final inventory in period t is the same as the final inventory in the previous period, added to the total produced in the same period, subtracting the demand for the current period. The production capacity is specified in Equation (2) of Expression (16.17) as the sum of the total produced regularly in period t, the overtime production, and the total of subcontracted units for the same period. Equation (3) of Expression (16.17) states that the total number of units produced with regular labor in period t is the same as the previous period (t 1), adding the additional units produced with possible hiring, and subtracting the units cancelled due to possible dismissals from period t 1 for period t. Constraint 4 stipulates the maximum inventory capacity allowed for period t.
734
PART
VII Optimization Models and Simulation
Constraint 5 guarantees that the regular production in period t will not be greater than the maximum limit allowed. Constraint 6 stipulates the maximum production limit allowed using overtime in period t. Constraint 7 sets a maximum production limit using outsourced labor for period t. Finally, the non-negativity conditions of the model’s decision variables must also be met. The formulation is based on a linear programming (LP) model to solve the respective aggregated planning problem. However, if we considered as a decision variable the number of employees to be hired and fired in each period, instead of the variation in production due to the hiring or firing of employees, we would find ourselves in a mixed-integer programming (MIP) problem, in which part of the decision variables is discrete. Similar to the production mix problem and to the production and inventory problem, when all the model’s decision variables are discrete (the quantities produced and stored can only assume integer values), we have an integer programming (IP) model. Example 16.11 Lifestyle, a company that produces natural juices, was analyzing several alternative aggregated planning options that could be adopted to produce cranberry juice in the second semester of the following year. However, they verified that an optimal solution for the problem could be obtained from a linear programming model. According to the sales department, the demand expected for the period being analyzed is listed in Table 16.E.15.
TABLE 16.E.15 Expected Demand (in Liters) for Cranberry Juice in the Second Semester of the Following Year Month
Demand (l)
July
4500
August
5200
September
4780
October
5700
November
5820
December
4480
The production sector provided the following data: Regular production cost (regular hours)
US$ 1.50 per L
Production cost using overtime
US$ 2.00 per L
Production cost using subcontracted labor
US$ 2.70 per L
Cost of increasing production by hiring new employees
US$ 3.00 per L
Cost of decreasing production by firing employees
US$ 1.20 per L
Inventory maintenance costs
US$ 0.40 per L-month
Initial inventory
1000 L
Regular production in the previous month
4000 L
Maximum inventory capacity
1500 L/month
Maximum regular production capacity
5000 L/month
Maximum production capacity using overtime
50 L/month
Maximum production capacity using subcontracted labor
500 L/month
Determine the mathematical formulation of Lifestyle’s aggregated planning problem so that they can minimize their total production costs, respecting the problem’s capacity constraints.
Introduction to Optimization Models: General Formulations and Business Modeling Chapter
16
735
Solution The mathematical formulation of Example 16.11 is similar to the general formulation of the production and inventory problem presented in Expression (16.17). The complete model is shown here. First of all, we have to define the model’s decision variables: It ¼ final inventory of cranberry juice in month t (liters), t ¼ 1 (July), …, 6 (December) Rt ¼ regular production (regular hours) of juice in month t (liters), t ¼ 1, …, 6 Ot ¼ production of juice using overtime in month t (liters), t ¼ 1, …, 6 St ¼ production of juice using subcontracted labor in month t (liters), t ¼ 1, …, 6 Ht ¼ production of additional juice in month t by hiring employees from month t 1 to month t (liters), t ¼ 1, …, 6 Ft ¼ cancelled production of juice in month t by firing employees from month t 1 to month t (liters), t ¼ 1, …, 6 The objective function can be written as: min z ¼ 1:5R1 + 2O1 + 2:7S1 + 3H1 + 1:2F1 + 0:4I1 + 1:5R2 + 2O2 + 2:7S2 + 3H2 + 1:2F2 + 0:4I2 + 1:5R3 + 2O3 + 2:7S3 + 3H3 + 1:2F3 + 0:4I3 + 1:5R4 + 2O4 + 2:7S4 + 3H4 + 1:2F4 + 0:4I4 + 1:5R5 + 2O5 + 2:7S5 + 3H5 + 1:2F5 + 0:4I5 + 1:5R6 + 2O6 + 2:7S6 + 3H6 + 1:2F6 + 0:4I6 The model’s constraints are: (1) Inventory balance equations each month t (t ¼ 1, …, 6): I1 ¼ 1000 + R1 + O1 + S1 4500 I1 + R2 + O2 + S2 5200 I2 ¼ I2 + R3 + O3 + S3 4780 I3 ¼ I3 + R4 + O4 + S4 5700 I4 ¼ I4 + R5 + O5 + S5 5820 I5 ¼ I5 + R6 + O6 + S6 4480 I6 ¼ Notice that Equation (2) of Expression (16.17) (Pt ¼ Rt + Ot + St) is already represented. (2) Quantity produced each month t with regular labor: R1 ¼ 4000 + H1 F1 R2 ¼
R1 + H2 F2
R3 ¼
R2 + H3 F3
R4 ¼
R3 + H4 F4
R5 ¼
R4 + H5 F5
R6 ¼
R5 + H6 F6
(3) Maximum inventory capacity allowed for each period t: I1 , I2 , I3 , I4 , I5 , I6 1500 (4) Maximum regular production capacity in period t: R1 , R2 , R3 , R4 , R5 , R6 5000 (5) Maximum production capacity using overtime in period t: O1 , O2 , O3 , O4 , O5 , O6 50 (6) Maximum production capacity using outsourced labor in period t: S1 , S2 , S3 , S4 , S5 , S6 500 (7) Non-negativity constraints: Rt , Ot , St , Ht , Ft , It 0
for t ¼ 1,…, 6
736
PART
VII Optimization Models and Simulation
The optimal solution of the aggregated planning model is: I1 ¼ 1270 I2 ¼ 840
I3 ¼ 880
I4 ¼ 500
I5 ¼ 0
I6 ¼ 0
R1 ¼ 4770 R2 ¼ 4770 R3 ¼ 4770 R4 ¼ 4770 R5 ¼ 4770 R6 ¼ 4480 O1 ¼ 0
O2 ¼ 0
S1 ¼ 0
O3 ¼ 50
O4 ¼ 50
O5 ¼ 50
O6 ¼ 0
S2 ¼ 0
S3 ¼ 0
S4 ¼ 500
S5 ¼ 500
S6 ¼ 0
H1 ¼ 770 H2 ¼ 0
H3 ¼ 0
H4 ¼ 0
H5 ¼ 0
H6 ¼ 0
F1 ¼ 0
F3 ¼ 0
F4 ¼ 0
F5 ¼ 0
F6 ¼ 290
F2 ¼ 0
z ¼ 49, 549 (US$ 49, 549.00)
16.7
FINAL REMARKS
Optimization models can help researchers and managers in their business decision-making process. Among the existing optimization models, we can mention linear programming, network programming, integer programming, nonlinear programming, goal or multiobjective programming, and dynamic programming. Linear programming is one of the most widely used tools. This chapter introduced and presented the main concepts of optimization models, especially, the modeling of linear programming problems (general formulation in the standard and canonical forms and business modeling problems). The use of optimization models, mainly linear programming, is being more and more disseminated in academia and in the business world. It may be applied to several areas (strategy, marketing, finance, operations and logistics, human resources, among others) and to several sectors (transportation, automobile, aviation, naval, trade, services, banking, food, beverages, agribusiness, health, real estate, metallurgy, paper and cellulose, electrical energy, oil, gas and fuels, computers, telecommunications, mining, among others). The greatest motivation is the huge saving that may happen, around millions or even billions of dollars, for the industries that use them. Several real problems can be formulated through a linear programming model, including: a production mix problem, a mixture problem, a capital budget problem, an investment portfolio selection, production and inventory, an aggregated planning, among others. The methods to solve a linear programming problem (graphical, analytical, by using the Simplex algorithm or by using computerized solutions) will be discussed in the next chapter.
16.8
EXERCISES
(1) Describe the main characteristics present in a linear programming model. (2) Give examples of the main fields and sectors in which the linear programming technique can be applied. (3) Transform the problems into the standard form: 2 X (a) xj max j¼1
s:t: 2x1 5x2 ¼ 10 x1 + 2x2 50 x1 , x2 0 (b)
ð 1Þ ð 2Þ ð 3Þ
min 24x1 + 12x2 s:t: 3x1 + 2x2 4 2x1 4x2 26 x2 3 x1 , x2 0
ð1Þ ð 2Þ ð 3Þ ð4Þ
Introduction to Optimization Models: General Formulations and Business Modeling Chapter
16
737
max 10x1 x2
(c)
s:t: 6x1 + x2 10 x2 6
ð1Þ ð 2Þ
x1 ,x2 0
ð 3Þ
max 3x1 + 3x2 2x3
(d)
s:t: 6x1 + 3x2 x3 10 x2 + x3 20 4 x1 , x2 ,x3 0
ð 1Þ ð 2Þ ð3Þ
(4) Do the same here, but now into the canonical form. (5) Transform the maximization problems into minimization problems: (a) max z ¼ 10x1 x2 (b)
max z ¼ 3x1 + 3x2 2x3
(6) What are the hypotheses of a linear programming model? Describe each one of them. (7) KMX is an American company in the automobile industry. It will launch three new car models next year: Arlington, Marilandy, and Lagoon. The production of each one of these models goes through the following processes: injection, foundry, machining, upholstery, and final assembly. The average operation times (minutes) of one unit of each component can be found in Table 16.1. Each one of these operations is 100% automated. The number of machines available for each sector can also be found in the same table. It is important to mention that each machine works 16 hours a day, from Monday to Friday. According to the commercial department, besides the minimum sales potential per week, the profit per unit of each automobile model can be seen in Table 16.2. Assuming that 100% of the models will be sold, TABLE 16.1 Average Operation Time (Minutes) of 1 Unit of Each Component and Total Number of Machines Available Operation Average Time (Minutes) Sector
Arlington
Marilandy
Lagoon
Machines Available
Injection
3
4
3
6
Foundry
5
5
4
8
Machining
2
4
4
5
Upholstery
4
5
5
8
Final assembly
2
3
3
5
TABLE 16.2 Profit Per Unit and Weekly Minimum Sales Potential Per Product Model
Profit Per Unit (U$)
Minimum Sales Potential (Units/Week)
Arlington
2500
50
Marilandy
3000
30
Gristedes
2800
30
738
PART
VII Optimization Models and Simulation
formulate the linear programming problem that will try to determine the number of automobiles of each model to be manufactured, in order to maximize the company’s weekly net profit. (8) Refresh is a company in the beverage industry that is rethinking its production mix of beers and soft drinks. The production of beer goes through the following processes: the extraction of malt (which can be manufactured internally or not), processing the wort, which produces the alcohol, fermenting (the main phase), processing the beer, and filling the bottles up (packaging). The production of soft drinks goes through the following processes: preparation of the simple syrup, preparation of the compound syrup, dilution, carbonation, and packaging. Each one of the beer and soft drink processing phases is 100% automated. Besides the total number of machines available for each activity, the average operation times (in minutes) of each beer component can be found in Table 16.3. The same data regarding the processing of soft drinks can be found in Table 16.4. It is important to mention that each machine works 8 hours a day, 20 business days a month. Due to the existing competition, we can state that the total demand for beer and soft drinks is not greater than 42,000 L a month. The net profit is US$ 0.50 per liter of beer produced and US$ 0.40 per liter of soft drink. Formulate the linear programming problem that maximizes the total monthly profit margin. TABLE 16.3 Average Beer Operation Time and Number of Machines Available Sector
Operation Time (Minutes)
Number of Machines
Extraction of malt
2
6
Processing the wort
4
12
Fermenting
3
10
Processing the beer
4
12
Packaging the beer
5
13
TABLE 16.4 Average Soft Drink Operation Time and Number of Machines Available Sector
Operation Time (Minutes)
Number of Machines
Simple syrup
1
6
Compound syrup
3
7
Dilution
4
8
Carbonation
5
10
Packaging the soft drink
2
5
(9) Golmobilec is a company in the electrical appliance industry that is reviewing its production mix regarding the main household equipment used in the kitchen: refrigerators, freezers, stoves, dishwashers, and microwave ovens. The manufacturing of each one of these devices starts in the pressing process that molds, perforates, adjusts, and cuts each component. The next phase consists in the painting, followed by the molding process that gives the product its final shape. The last two phases consist in the assembly and packaging of the product final. Table 16.5 shows the time required (in hours/machine) to manufacture one unit of each component in each manufacturing process, besides the total time available for each sector. Table 16.6 shows the total number of labor hours (hours/employee) necessary to manufacture one unit of each component in each manufacturing process, in addition to the total number of employees available who work in each sector. It is important to highlight that each employee works 8 hours a day, from Monday to Friday. Due to storage capacity limitations, there is a maximum production capacity per product, as specified in Table 16.7. The same table also shows the minimum demand for each product that must be met, besides the net profit per unit sold. Formulate the linear programming problem that maximizes the total net profit.
Introduction to Optimization Models: General Formulations and Business Modeling Chapter
16
739
TABLE 16.5 Time Necessary (in Hours/Machine) to Manufacture 1 Unit of Each Component in Each Sector Time Necessary (Hours/Machine) to Manufacture 1 Unit Sector
Refrigerator
Freezer
Stove
Dishwasher
Microwave oven
Time Available (Hours/Machine/Week)
Pressing
0.2
0.2
0.4
0.4
0.3
400
Painting
0.2
0.3
0.3
0.3
0.2
350
Molding
0.4
0.3
0.3
0.3
0.2
250
Assembly
0.2
0.4
0.4
0.4
0.4
200
Packaging
0.1
0.2
0.2
0.2
0.3
200
TABLE 16.6 Total Number of Labor Hours Necessary to Produce 1 Unit of Each Product in Each Sector, Besides the Total Number of Employees Available Total Number of Labor Hours to Manufacture 1 Unit Sector
Refrigerator
Freezer
Stove
Dishwasher
Microwave Oven
Employees Available
Pressing
0.5
0.4
0.5
0.4
0.2
12
Painting
0.3
0.4
0.4
0.4
0.3
10
Molding
0.5
0.5
0.3
0.4
0.3
8
Assembly
0.6
0.5
0.4
0.5
0.6
10
Packaging
0.4
0.4
0.4
0.3
0.2
8
TABLE 16.7 Maximum Capacity, Minimum Demand, and Unit Profit Per Product Product
Maximum Capacity (Units/Week)
Minimum Demand (Units/Week)
Profit Per Unit (US$/Unit)
Refrigerator
1000
200
52
Freezer
800
50
37
Stove
500
50
35
Dishwasher
500
50
40
Microwave oven
200
40
29
(10) A refinery produces three types of gasoline: regular, green, and yellow. Each type of gasoline can be produced from the mixture of four types of petroleum: petroleum 1, petroleum 2, petroleum 3, and petroleum 4. Each type of gasoline requires certain specifications of octane and benzene: – A liter of regular gasoline requires, at least, 0.20 L of octane and 0.18 L of benzene – A liter of green gasoline requires, at least, 0.25 L of octane and 0.20 L of benzene – A liter of yellow gasoline requires, at least, 0.30 L of octane and 0.22 L of benzene The octane and benzene compositions of each type of petroleum are: – A liter of petroleum 1 contains 0.20 of octane and 0.25 of benzene – A liter of petroleum 2 contains 0.30 of octane and 0.20 of benzene – A liter of petroleum 3 contains 0.15 of octane and 0.30 of benzene – A liter of petroleum 4 contains 0.40 of octane and 0.15 of benzene
740
PART
VII Optimization Models and Simulation
Due to contracts that have already been signed, the refinery needs to produce 12,000 L of regular gasoline, 10,000 L of green gasoline, and 8000 L of yellow gasoline daily. The refinery has a maximum production capacity of 60,000 L of gasoline a day, and can purchase up to 15,000 L of each type of petroleum daily. Each liter of regular, green, and yellow gasoline has a net profit of $ 0.40, $ 0.45 and $ 0.50, respectively. The purchase prices per liter of petroleum 1, petroleum 2, petroleum 3, and petroleum 4 are $ 0.20, $ 0.25, $ 0.30, and $ 0.30, respectively. Formulate the linear programming problem aiming at maximizing the daily net profit. (11) Model Adrianne Medici Torres is upset about some localized fat and would like to lose a few kilos in a few weeks. Her nutritionist recommended a diet that is rich in carbs, moderate in fruit, vegetables, protein, legumes, milk and dairy products, and low in fats and sugar. Table 16.8 shows the food options that can be part of Adrianne’s diet and their respective compositions and characteristics. The data in Table 16.8 can also be found in the file AdrianneTorres’Diet.xls. According to her nutritionist, a balanced diet should contain between 4 and 9 portions of carbs, 3 to 5 portions of fruit, 4 to 5 portions of vegetables, 1 portion of legumes, 2 portions of protein, 2 to 3 portions of milk and dairy products, 1 to 2 portions of sugar and sweets, and 1 to 2 portions of fat. We tried to determine how many portions of each food must be ingested daily, at each meal, in order to minimize the total number of calories consumed, meeting the following requisites: (a) The ideal number of portions ingested, of each type of food, must be respected. (b) Each food can only be ingested at the meal specified in Table 16.8. For example, in the case of cereal, we tried to determine how many portions must be ingested daily at breakfast. Now, in the case of cereal bars, we tried to specify how many portions can be ingested daily at breakfast and as part of the morning and afternoon snacks. (c) The total number of calories ingested at breakfast cannot be higher than 300 calories. (d) The total number of calories ingested as a morning snack cannot be higher than 200 calories. (e) The total number of calories ingested at lunch cannot be higher than 550 calories. (f) The total number of calories ingested as an afternoon snack cannot be higher than 200 calories. (g) The total number of calories ingested at dinner cannot be higher than 350 calories. (h) At breakfast, she must ingest, at least, 1 portion of carbs, 2 of fruit and 1 of milk and/or dairy products. (i) Lunch should contain, at least, 1 portion of the following types of food: carbs, protein, legumes, and vegetables. (j) The morning and afternoon snacks should contain, at least, 1 fruit each. (k) Dinner should contain, at least, 1 portion of carbs, 1 of protein, 1 of milk and dairy products, and 1 of vegetables. (l) A balanced diet should contain, at least, 25 g of fibers a day. (m) 100% of our daily needs of the main vitamins and minerals (iron, zinc, vitamins A, C, B1, B2, B6, B12, niacine, folic acid, etc.) must be met in order for our bodies to work properly. Table 16.8 shows the percentage guaranteed by each portion of food with regard to our daily needs of vitamins and minerals. Formulate the linear programming model for Adrianne’s diet problem.
TABLE 16.8 Composition and Characteristics of Each Food That Can Be Part of Adrianne’s Diet (File AdrianneTorres’Diet.xls) Food
Energy (cal/Portion)
Fibers (g/Portion)
% Vitamins and Minerals
Type of Food
Meals
Lettuce
1
1
9
V
3, 5
Plums/prunes
30
2.4
4
F
1, 2, 4
Rice
130
1.2
0.5
C
3
Brown rice
110
1.6
1
C
3
Olive oil
90
0
0
TF
3, 5
Banana
80
2.6
13
F
1, 2, 4
Cereal bar
90
0.9
11
C
1, 2, 4
Crackers
90
0.4
0.4
C
1, 2, 4
Broccoli
10
2.7
15
V
3, 5
Meat
132
0
1
P
3
Introduction to Optimization Models: General Formulations and Business Modeling Chapter
16
741
TABLE 16.8 Composition and Characteristics of Each Food That Can Be Part of Adrianne’s Diet (File AdrianneTorres’Diet.xls)—cont’d Food
Energy (cal/Portion)
Fibers (g/Portion)
% Vitamins and Minerals
Type of Food
Meals
Carrots
31
2
19
V
3, 5
Cereal
120
1.3
20
C
1
Chocolate
150
0.2
0.5
SS
3, 5
Spinach
18
2
28
V
3, 5
Beans
95
7.9
6
L
3
Chicken
112
0
1.5
P
3
Jello
30
0.2
0
SS
3, 5
Chickpeas
92
3.5
4
L
3
Yoghurt
70
1.1
0.7
MD
1, 2, 4
Apples
60
3
0.9
F
1, 2, 4
Papayas
56
2.4
3.1
F
1, 2, 4
Eggs
60
0.6
8.5
P
3
Butter
100
0
0
TF
1, 5
Bread
140
0.5
3.3
C
1, 5
Wholewheat bread
142
0.8
12
C
1, 5
Turkey ham
75
0.4
0.4
P
1, 5
Fish
104
0.7
11
P
3
Pears
88
4
1.2
F
1, 2, 4
Cottage cheese
80
0.4
0.6
MD
1, 5
Arugula
4
1
9.5
V
3, 5
Natural sandwiches
240
1.4
19
Mixed
5
Soya
85
3.9
8
L
3
Soup
120
3.5
16
Mixed
5
Tomatoes
26
1.5
5
V
3, 5
C, carbs; V, vegetables; F, fruit; P, protein; L, legumes; MD, milk and dairy products; TF, total fat; SS, sugar and sweets; 1, food that can be eaten at breakfast; 2 food that can be eaten as a morning snack; 3, food that can be eaten at lunch; 4, food that can be eaten as an afternoon snack; 5, food that can be eaten at dinner.
Note: The soup contains 1 portion of carbs, 1 of protein, 1 of vegetables and 1 of fat. Now, a natural sandwich contains 2 portions of carbs, 1 of protein, 1 of milk and dairy product, 1 of vegetables, and 1 of fat. (12) Company GWX is trying to obtain a competitive differential in the market and, in order to do this, it is considering five new investment projects for the following 3 years: the development of new products, investment in IT, investment in training courses, factory expansion, and warehouse expansion. Each project requires an initial investment and generates an expected return in the following 3 years, as shown in Table 16.9. Currently, the company has a maximum budget of US$ 1,000,000 to invest. For each investment project, the interest rate is 10% per year. It is important to highlight that the investment project in IT depends on the investment project in training, that is, it will only happen if the investment project in training is accepted. Besides, the factory and warehouse expansion projects are mutually excluding, that is, only one of them can be selected. Formulate the problem that has as its main objective to determine in which projects the company should invest, in order to maximize the current wealth generated from the set of investment projects being analyzed.
742
PART
VII Optimization Models and Simulation
TABLE 16.9 Initial Investment and Return Expected in the Following 3 Years for Each Project Cash Flow Each Year (US$ Thousand) Year
Product Development
Investment in IT
Training
Factory Expansion
Warehouse Expansion
0
360
240
180
480
320
1
250
100
120
220
180
2
300
150
180
350
200
3
320
180
180
330
310
(13) A financial analyst from a major brokerage firm is selecting a certain portfolio for a group of clients. The analyst intends to invest in several sectors, having as an option five companies from the financial sector, including banks and insurance companies, two from the metallurgy sector, one from the mining sector, one from the paper and cellulose sector, and another one from the electrical energy sector. Table 16.10 shows the monthly return history of each one of these stocks in a period of 36 months. These data are available in the file Stocks.xls. In order to increase diversification, it was established that the portfolio could contain 50% of stocks from the financial sector (banks and insurance companies) and 40% from each asset, at the most. Besides, the portfolio should contain, at least, 20% of stocks from the banking sector, 20% from the metallurgy or mining sector, and 20% from the paper and cellulose or electrical energy sector. Investors expect the average return of the portfolio to reach a minimum value of 0.80% a.m. Furthermore, the portfolio’s risk, measured by using the standard deviation, cannot be more than 5%. Elaborate the linear programming model that minimizes the portfolio’s mean absolute deviation. (14) Redo the previous exercise considering a period from 1 to 24 months. However, in this case, the model must be formulated for three distinct goals: (a) to minimize the MAD (mean absolute deviation) as in the previous case; (b) to minimize the square root of squared deviations from the mean; (c) min-max (to minimize the highest absolute deviation). (15) CTA Investment Bank manages third parties’ financial resources, operating in several different investment modalities, ensuring to its clients the best return with the lowest risk. Robert Johnson, a client of CTA Investments, wishes to invest US$ 500,000.00 in investment funds. According to Robert’s profile, his bank account manager selected 11 types of investment funds that could be a part of his portfolio. Table 16.11 shows a description of each fund, their annual profitability, their risks, and the necessary initial investment. The annual return expected was calculated as the weighted moving mean of the last five years. The risk of each fund, measured from the standard deviation of the return history, is also specified in Table 16.11. The maximum risk allowed for Robert’s portfolio is 6%. Besides, due to his conservative profile, Robert would like to invest, at least, 50% of his capital in index-pegged funds and fixed income funds and 25% in each one of the other investments, at the most. Formulate the linear programming problem that tries to determine how much to invest in each fund, in order to maximize the expected annual return, respecting the portfolio’s maximum risk constraints, minimum investment in fixed income, and minimum initial investment in each fund. (16) Company Arts & Chemical, a leader in the chemical sector, manufactures m products, including plastic, rubber, paints, and polyurethane, among others. The company plans to integrate production, inventories, and transportation decisions. The merchandise can be produced in n different facilities that distribute these products to p different retailers located in the regions of Washington, Baltimore, Philadelphia, New York, and Pittsburgh. The period analyzed is T periods. In each period, we intend to determine which of the n facility alternatives should produce and deliver each one of the m products to each one of the different p retailers. Each facility can cater to more than one retailer. However, the total demand for each retailer must be met by a single facility. The production and storage capacity of each one of the facilities is limited and differs from one another depending on the product and period. Unit production, transportation, and inventory maintenance costs also differ per product, facility, and period. The main objective is to designate retailers to facilities, to determine how much to produce and the level of inventories of each product in each facility and period—in such a way that the sum of the total production, transportation, and inventory
TABLE 16.10 Monthly Return History of 10 Stocks From Different Industries in a Period of 36 Months (File Stocks.xls) Stock 2
Stock 3
Stock 4
Stock 5
Stock 6
Stock 7
Stock 8
Stock 9
Stock 10
Banking (%)
Banking (%)
Banking (%)
Banking (%)
Insurance (%)
Metallurgy (%)
Metallurgy (%)
Mining (%)
Paper-Cellulose (%)
Electrical Energy (%)
1
2.57
4.47
1.08
4.78
4.19
2.54
0.57
0.60
4.07
2.78
2
3.14
4.33
0.87
3.41
3.08
2.69
0.98
5.78
3.57
3.69
3
6.00
2.67
4.87
2.81
6.47
1.98
5.69
3.25
2.69
2.14
4
2.14
3.59
3.57
6.70
8.05
3.14
3.10
0.88
2.02
4.01
5
5.44
3.34
2.78
2.08
5.04
7.58
3.28
4.52
1.57
1.33
6
11.30
2.09
5.69
3.00
3.47
6.85
8.07
2.88
2.33
4.21
7
8.07
7.80
6.44
3.54
2.09
4.70
2.67
0.58
2.87
0.74
8
2.77
6.14
6.87
2.97
2.56
11.02
3.69
3.69
0.05
0.65
9
2.37
5.77
10.07
5.90
4.44
5.99
6.47
1.44
1.69
2.47
10
2.14
3.23
5.64
7.01
6.07
0.14
0.22
4.22
5.87
3.54
11
4.40
1.04
3.30
2.04
5.30
2.36
3.11
0.47
2.14
2.58
12
2.10
3.02
2.27
3.50
2.07
2.14
4.55
0.05
1.01
5.47
13
2.14
2.01
5.47
9.33
4.44
1.34
0.24
6.95
3.99
3.54
14
4.69
3.67
2.10
8.07
6.14
0.98
3.50
8.41
1.47
2.57
15
11.32
5.69
2.07
2.77
3.07
0.66
2.78
5.41
2.58
4.78
16
4.69
2.00
3.47
5.48
2.05
2.89
8.40
0.22
3.57
1.23
17
2.01
6.75
3.78
3.50
2.67
13.47
7.55
9.54
0.88
0.27
18
7.65
9.47
3.89
6.41
3.07
4.23
0.07
11.02
2.34
3.55
19
2.36
5.33
5.68
3.04
4.08
0.28
9.56
2.55
1.09
2.67
20
11.47
6.01
3.46
2.08
4.99
2.63
5.04
12.23
7.03
0.74
21
3.39
2.01
3.09
3.64
3.70
3.63
3.66
2.00
4.33
3.69
22
8.43
5.03
1.01
6.80
8.02
2.47
4.40
4.47
5.87
0.25
23
4.16
5.33
5.61
5.47
7.35
0.50
2.57
6.58
2.67
0.98
24
2.37
3.36
7.43
6.17
2.44
7.99
3.01
8.80
7.80
4.36
25
7.00
11.04
6.40
5.55
11.07
6.01
9.77
5.96
2.22
1.66
26
3.22
4.64
6.43
4.58
2.47
14.15
6.41
3.22
1.49
0.20
Introduction to Optimization Models: General Formulations and Business Modeling Chapter
Stock 1
16 743
Continued
744 PART
Stock 1
Stock 2
Stock 3
Stock 4
Stock 5
Stock 6
Stock 7
Stock 8
Stock 9
Stock 10
Banking (%)
Banking (%)
Banking (%)
Banking (%)
Insurance (%)
Metallurgy (%)
Metallurgy (%)
Mining (%)
Paper-Cellulose (%)
Electrical Energy (%)
27
4.67
2.07
2.98
2.07
2.60
5.47
2.60
4.74
1.42
1.59
28
3.20
3.68
3.10
2.65
3.18
3.14
3.01
2.33
0.77
5.67
29
0.74
0.58
2.73
6.47
3.08
3.25
7.78
4.01
0.59
4.90
30
5.02
7.04
9.40
6.07
2.00
1.08
8.36
4.32
3.07
3.92
31
4.30
2.99
6.81
5.88
6.47
5.47
2.04
6.77
2.55
2.14
32
2.64
7.66
6.90
0.47
6.13
11.01
2.15
2.64
0.84
0.71
33
6.77
7.16
5.87
8.09
2.47
5.71
3.19
5.74
5.98
2.04
34
6.70
3.41
6.80
6.47
2.08
14.33
2.03
9.12
0.25
4.33
35
2.98
2.01
5.32
5.00
4.43
5.44
6.07
8.40
0.50
2.36
36
5.70
11.52
6.00
0.27
2.29
2.47
5.73
6.47
1.00
1.60
VII Optimization Models and Simulation
TABLE 16.10 Monthly Return History of 10 Stocks From Different Industries in a Period of 36 Months (File Stocks.xls)—cont’d
Introduction to Optimization Models: General Formulations and Business Modeling Chapter
16
745
TABLE 16.11 Characteristics of Each Fund Fund
Annual Yield/Return (%)
Risk (%)
Initial Investment (US$)
Index-pegged fund A
11.74
1.07
30,000.00
Index-pegged fund B
12.19
1.07
100,000.00
Index-pegged fund C
12.66
1.07
250,000.00
Fixed income fund A
12.22
1.62
30,000.00
Fixed income fund B
12.87
1.62
100,000.00
Fixed income fund C
12.96
1.62
250,000.00
Commercial paper A
16.04
5.89
20,000.00
Commercial paper B
17.13
5.89
100,000.00
Multimarket fund
18.10
5.92
10,000.00
Stock fund A
19.53
6.54
1000.00
Stock fund B
22.16
7.23
1000.00
maintenance costs is minimized, the demand for each retailer is met, and capacity constraints are not rejected. From the general production and inventory model proposed in Section 16.6.6, elaborate a general model that integrates production, inventory, and distribution decisions. Note: Since we have a binary decision variable (if product i is delivered by facility j to retailer k in period t, its value will be 1, otherwise, the value will be 0), we have a mixed programming problem. (17) From the previous exercise, consider a case in which each retailer can receive supplies/products from more than one facility. Elaborate the general adapted model. Note: In this case, we must define a new decision variable that consists in determining the amount of product i to be transported from facility j to retailer k in period t. (18) Pharmabelz, a company in the cosmetics and cleaning products industry, would like to define the aggregate planning of the production of Leveza, a type of soap, for the first semester of the following year. In order to do that, the sales department provided the demand expected for the period being studied, as shown in Table 16.12.
TABLE 16.12 Soap Demand Expected (kg) for the First Semester of the Following Year Month
Demand
January
9600
February
10,600
March
12,800
April
10,650
May
11,640
June
10,430
746
PART
VII Optimization Models and Simulation
The production data are:
Costs of regular production
US$ 1.50 per kg
Cost to outsource labor
US$ 2.00 per kg
Costs of regular labor
US$ 600.00/employee-month
Cost to hire a worker
US$ 1000.00/worker
Cost to fire a worker
US$ 900.00/worker
Cost per overtime
US$ 7.00/overtime
Inventory maintenance costs
US$ 1.00/kg-month
Regular labor in the previous month
10 workers
Initial inventory
600 kg
Average productivity per employee
16 kg/employee-hour
Average productivity per overtime
14 kg/overtime
Maximum outsourced production capacity
1 000 kg/month
Maximum regular labor capacity
20 workers
Maximum inventory capacity
2500 kg/month
Each employee usually works 6 business hours a day, 20 business days a month, and he/she is only allowed to work 20 hours overtime a month, at the most. Formulate the aggregate planning model (mixed integer program) that will minimize the total production, labor and storage costs for the period analyzed, respecting the system’s capacity constraints.
Chapter 17
Solution of Linear Programming Problems There is geometry everywhere. However, it is necessary to have eyes to see it, intelligence to understand it, and a soul to admire it. Malba Tahan in “The Man Who Counted”
17.1 INTRODUCTION In this chapter, we will discuss several ways of solving a linear programming problem (LP): a) in a graphical way; b) through the analytical method; c) by using the Simplex method; d) by using a computer. A simple linear programming problem with only two decision variables can be easily solved in a graphical way or through the analytical method. The graphical solution can be applied to solve problems with three decision variables, at the most; however, with greater complexity. Similarly, the analytical solution becomes impractical for problems with many variables and equations, since it calculates all the possible basic solutions. As an alternative to these procedures, we can use the Simplex algorithm or any existing software directly (GAMS, AMPL, AIMMS, software with electronic spreadsheets, such as, Solver in Excel and What’s Best, among others) to solve any linear programming problem. In this chapter, we will solve each one of the management problems modeled in the previous chapter (Examples 16.3 to 16.12) by using Solver in Excel. Some linear programming problems do not have a single nondegenerate optimal solution. It may fall into one of the four categories: a) multiple optimal solutions; b) unlimited objective function z; c) there is no optimal solution; d) degenerate optimal solution. Throughout this chapter, we will discuss how to identify each one of these special cases in a graphical way, through the Simplex method, and by using a computer. Many times, the estimation of model parameters is based on future forecasts, and changes may occur until the final solution is implemented in the real world. As examples of changes, we can mention changes in the quantity of resources available, introduction of a new product, variation in a product’s price, increase or decrease in production costs, among others. Therefore, the sensitivity analysis is essential in the study of linear programming problems, since it has as its main goal to investigate the impacts that certain changes in the model parameters would have on the optimal solution. The sensitivity analysis presented at the end of the chapter discusses the variation that the objective function coefficients and constants, on the right-hand side of each constraint, can assume without changing the initial model’s optimal solution, or without changing the feasibility region. This analysis can be carried out graphically, by using algebraic calculations, or directly through Solver in Excel or other software packages, such as, Lindo, considering one alteration at a time.
17.2 GRAPHICAL SOLUTION OF A LINEAR PROGRAMMING PROBLEM A simple linear programming problem that has two decision variables can be easily solved in a graphical way. According to Hillier and Lieberman (2005), any LP problem that has two decision variables can be solved graphically. Problems with up to three decision variables can also be solved in a graphical way, but with greater complexity. In the graphical solution for a linear programming model, first of all, we must determine the feasible solution space or feasible region along a Cartesian axis. A viable or feasible solution is the one that satisfies all the model constraints, including the non-negativity ones. If a certain solution breaches at least one of the model constraints, it is called an unfeasible solution. The following step consists in determining the model’s optimal solution, that is, the feasible solution that has the best objective function value. For a maximization problem, after the set of feasible solutions is established, the optimal solution is the one that gives the highest value to the objective function within this set. Whereas for a minimization problem, the optimal solution is the one that minimizes the objective function. Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00017-3 © 2019 Elsevier Inc. All rights reserved.
747
748
PART
VII Optimization Models and Simulation
FIG. 17.1 An example of convex and nonconvex sets.
The set of feasible solutions for a linear programming problem is represented by K. Subsequently, the first theorem was born: Theorem 1 The K set is convex. Definition A K set is convex when all the line segments that connect any two points of K are included in K. A convex set is bounded if it includes its bounds. As an illustrative example, the graphical representation of convex and nonconvex sets can be seen in Fig. 17.1. The graphical solution for a linear programming maximization and minimization problem with a single optimal solution will be illustrated through Examples 17.1 and 17.2, respectively. The special cases will also be presented (multiple optimal solutions, unlimited objective function, unfeasible solution, and degenerate optimal solution), through Examples 17.3, 17.4, 17.5, and 17.6.
17.2.1
Linear Programming Maximization Problem with a Single Optimal Solution
The graphical solution for an LP maximization problem with a single optimal solution will be illustrated through Example 17.1. Example 17.1 Consider the following LP maximization problem: max z ¼ 6x1 + 4x2 subject to : 2x1 + 3x2 18 5x1 + 4x2 40 x1 6 x2 8 x1 , x2 0
(17.1)
Determine the set of feasible solutions, in addition to the model’s optimal solution. Solution Feasible region
In the Cartesian axes x1 and x2, we determine the feasible solution space that represents the constraints of the maximization model being studied. First, for each constraint, we plot the line that represents the equality equation (without considering the signs or ) and, from then on, we determine the direction of the line that satisfies the inequality. Thus, for the first constraint, the line that represents the equation 2x1 + 3x2 ¼ 18 can be plotted from two points. If x1 ¼ 0, then x2 ¼ 6. Analogously, if x2 ¼ 0, then x1 ¼ 9. To determine the solution space or the direction of the line that satisfies the inequality 2x1 + 3x2 18, we can consider any point outside the line. We usually use the point of origin (x1, x2) ¼ (0, 0), due to its simplicity. We can see that the point of origin satisfies the first inequality, because 0 + 0 18. Therefore, we can identify the direction of the line that has feasible solutions, as shown in Fig. 17.2. In the same way, for the second constraint, the line that represents equality equation 5x1 + 4x2 ¼ 40 is plotted from two points. If x1 ¼ 0, then x2 ¼ 10. Analogously, if x2 ¼ 0, then x1 ¼ 8. We can also see that the point of origin satisfies the inequality 5x1 + 4x2 40, because 0 + 0 40, representing the direction of the straight line that contains the feasible solutions, consistent with Fig. 17.2. Similarly, we can determine the feasible solution space for the other constraints x1 6, x2 8, x1 0 and x2 0. Constraints 5x1 + 4x2 40 and x2 8 are redundant, that is, if they were excluded from the model, the feasible solution space would not be affected. The feasible region is represented by the four-sided polygon ABCD. Any point on the surface of the polygon or inside it represents the feasible region. On the other hand, any point outside the polygon does not satisfy at least one of the model constraints.
Solution of Linear Programming Problems Chapter
17
749
FIG. 17.2 Feasible region of Example 17.1.
FIG. 17.3 Optimal solution for Example 17.1.
Optimal solution
The next step tries to determine the model’s optimal solution that maximizes the function z ¼ 6x1 + 4x2, within the feasible solution space shown in Fig. 17.2. Since the solution space contains an infinite number of points, it is necessary to carry out a formal procedure to identify the optimal solution (Taha, 2016). First, we need to identify the correct direction in which the function increases (maximization function). In order to do that, we will draw different straight lines based on the objective function equation, assigning different values to z, by trial and error. Having identified the direction in which the objective function increases, it is possible to identify the model’s optimal solution, within the feasible solution space. First, we assigned a value of z ¼ 24, followed by z ¼ 36, and obtained equations 6x1 + 4x2 ¼ 24 and 6x1 + 4x2 ¼ 36, respectively. From these two equations, it was possible to identify the direction, within the feasible solution space, that maximizes the objective function, concluding that point C is optimal. Since vertex C is the intersection of lines 2x1 + 3x2 ¼ 18 and x1 ¼ 6, values x1 and x2 can be calculated algebraically from these two equations. Therefore, we have x1 ¼ 6 and x2 ¼ 2 with z ¼ 6 6 + 4 2 ¼ 44. The complete procedure is presented in Fig. 17.3. Since all the lines are represented by the equation z ¼ 6x1 + 4x2, if we only change the value of z, we can conclude, by Fig. 17.3, that the lines are parallel. Another important theorem states that an optimal solution for a linear programming problem is always associated to a vertex or extreme point of the solution space:
750
PART
VII Optimization Models and Simulation
Theorem 2 For linear programming problems with a single optimal solution, the objective function reaches its maximum or minimum point in an extreme point in convex set K.
17.2.2
Linear Programming Minimization Problem With a Single Optimal Solution
Example 17.2 Consider the following minimization problem: min z ¼ 10x1 + 6x2 subject to : 4x1 + 3x2 24 2x1 + 5x2 20 x1 8 x2 6 x 1 , x2 0
(17.2)
Determine the set of feasible solutions and the model’s optimal solution. Solution Feasible region
The same procedure of Example 17.1 is used to obtain the feasible solution space of the minimization problem. First, we must determine the feasible region from the minimization model constraints. Considering the first 4x1 + 3x2 24 and the second 2x1 + 5x2 20 constraints, we can see that the point of origin (x1, x2)¼ (0, 0) does not satisfy any of the inequalities. Thus, the feasible direction of these two lines does not contain this point. Including constraints x1 8 and x2 6, the feasible solution space became limited, as shown in Fig. 17.4. Different from Example 17.1, in this case, we can see that all the constraints are nonredundant, that is, they are responsible for defining the model’s feasibility region. The feasibility region is represented by polygon ABCD that is highlighted in Fig. 17.4. Optimal solution
The same procedure of Example 17.1 is used to find the optimal solution for the minimization problem. Therefore, we are trying to determine the model’s optimal solution that minimizes the function z ¼ 10x1 + 6x2, within the feasible solution space shown in Fig. 17.4. To analyze the direction in which the objective function decreases (minimization function), different values of z were assigned, by trial and error. Firstly, we assigned a value of z ¼ 72, obtaining equation 10x1 + 6x2 ¼ 72, followed by z ¼ 60 where 10x1 + 6x2 ¼ 60. Hence, it was possible to identify the direction that minimizes the objective function, concluding that point D represents the model’s optimal solution (see Fig. 17.5). FIG. 17.4 Feasible region of Example 17.2.
Solution of Linear Programming Problems Chapter
17
751
FIG. 17.5 Optimal solution for Example 17.2.
Coordinates x1 and x2 of point D can be calculated algebraically from equations 4x1 + 3x2 ¼ 24 and 2x1 + 5x2 ¼ 20, because point D is the intersection of these two equations. Thus, we have x1 ¼ 1.5 and x2 ¼ 6 with z ¼ 10 1.5 + 6 6 ¼ 51. As in the previous example, the LP problem shows a single optimal solution that is associated to an optimal vertex of the solution space (Theorem 2).
17.2.3
Special Cases
Sections 17.2.1 and 17.2.2 presented the graphical solution for a maximization (Example 17.1) and minimization problem (Example 17.2), respectively, with a single nondegenerate optimal solution. The graphical concept of a degenerate solution will be presented in Section 17.2.3.4. However, some linear programming problems do not have a single nondegenerate optimal solution, and they may fall into one of the four cases presented: 1. 2. 3. 4.
Multiple optimal solutions Unlimited objective function z There is no optimal solution Degenerate optimal solution
This section aims to identify, in a graphical way, each one of the special cases listed, which may happen in a linear programming problem. We will also study how to identify them through the Simplex method (see Section 17.4.5) and through a computer (cases 2 and 3 in Section 17.5.3, and cases 1 and 4 in Section 17.6.4).
17.2.3.1 Multiple Optimal Solutions A linear programming problem can have more than one optimal solution. In this case, considering a problem with two decision variables, different values of x1 and x2 achieve the same optimal value in the objective function. This case is graphically illustrated through Example 17.3. According to Taha (2016), when the objective function is parallel to an active constraint, we have a case with multiple optimal solutions. An active constraint is the one responsible for determining the model’s optimal solution. Example 17.3 Determine the set of feasible solutions and the model’s optimal solutions for the following linear programming problem:
752
PART
VII Optimization Models and Simulation
max z ¼ 8x1 + 4x2 subject to : 4x1 + 2x2 16 x1 + x2 6 x1 , x2 0
(17.3)
Solution The same procedure used in the previous examples to find the optimal solution was applied in this case. Fig. 17.6 shows the feasible region determined from the constraints of the model analyzed. We can see that the feasible solution space is represented by the four-sided polygon ABCD. FIG. 17.6 Feasible region with multiple optimal solutions.
To determine the model’s optimal solution, first of all, we assigned a value of z ¼ 16, and obtained the line presented in Fig. 17.6. Since the objective function is a maximization one, the higher the values of x1 and x2, the higher the value of function z, such that the direction in which the function increases can be easily identified. We can see that the lines represented by equations z ¼ 16 ¼ 8x1 + 4x2 and 4x1 + 2x2 ¼ 16 are parallel. Therefore, this is a case with multiple optimal solutions represented by the segment BC. For example, for point B, x1 ¼ 4, x2 ¼ 0, the value of z is 8 4 + 4 0 ¼ 32. Point C is the intersection of lines 4x1 + 2x2 ¼ 16 and x1 + x2 ¼ 6. Calculating it algebraically, we obtain x1 ¼ 2 and x2 ¼ 4 with z ¼ 8 2 + 4 4 ¼ 32. Any other point in this segment is an alternative optimal solution and also presents z ¼ 32. Hence, a new theorem is born: Theorem 3 For linear programming problems with more than one optimal solution, the objective function assumes this value in at least two extreme points of convex set K and in all the convex linear combinations of these extreme points (all the points of the line segment that join these two extremes, that is, the corner of the polygon that contains these extreme points).
17.2.3.2 Unlimited Objective Function z In this case, there is no limit for how much the value of at least one decision variable can increase, resulting in a feasible region and an unlimited objective function z. For a maximization problem, the value of the objective function increases unlimitedly, while for a minimization problem, the value decreases in an unlimited way. Example 17.4 illustrates a case that shows an unlimited set of solutions, in a graphical way, resulting in an unlimited value of the objective function. Example 17.4 Determine the feasible solution space and the model’s optimal solution for the following linear programming problem: max z ¼ 4x1 + 3x2 subject to : 2x1 + 5x2 20 x1 8 x 1 , x2 0
(17.4)
Solution of Linear Programming Problems Chapter
17
753
Solution From the constraints in Example 17.4, we obtain the feasible solution space, that in this case is unlimited, because there is no limit for the increase of x2, as shown in Fig. 17.7. Consequently, objective function z can also increase in an unlimited way. The complete procedure is shown in Fig. 17.7. x2
FIG. 17.7 Unlimited set of feasible solutions and unlimited maximization function z.
x1 £ 8
Unlimited solution space
x1 ³ 0 10 z = 24 = 4x1 + 3x2
Maximization of z
8
6
4
2
2x1 + 5x2 ³ 20 x2 ³ 0
2
4
6
8
10
x1
17.2.3.3 There Is No Optimal Solution In this case, it is not possible to find a feasible solution for the problem being studied, that is, there is no optimal solution. The set of feasible solutions is empty. Example 17.5 illustrates a case in which there is no optimal solution, in a graphical way. Example 17.5 Consider the following linear programming problem: max z ¼ x1 + x2 subject to : 5x1 + 4x2 40 2x1 + x2 6 x1 , x2 0
(17.5)
Determine the feasibility region and the optimal solution for the linear programming model. Solution Fig. 17.8 shows the graphical solution for Example 17.5, considering each one of the model constraints, besides the objective function with an arbitrary value of z ¼ 7. From Fig. 17.8, we can see that no point satisfies all the problem constraints. This means that the feasible solution space in Example 17.5 is empty, resulting in an unfeasible LP problem that has no optimal solution.
17.2.3.4 Degenerate Optimal Solution Graphically, we can see a special case of degenerate solution whenever one of the vertices of the feasible region is obtained from the intersection of more than two distinct lines. Therefore, we have a degenerate vertex. If there is degeneration in the optimal solution, we have a case known as degenerate optimal solution. The concept of degenerate solution and a degeneration problem will be discussed in depth in Sections 17.4.5.4 (identification of a degenerate optimal solution through the Simplex method) and 17.6.4.2 (identification of a degenerate optimal solution through the Sensitivity Report in the Solver in Excel).
754
PART
VII Optimization Models and Simulation
FIG. 17.8 Empty set of feasible solutions without an optimal solution.
x2 x1 ³ 0 10
8 z = 7 = x1 + x2 5x1 + 4x2 ³ 40
6
4
2
2x1 + x2 £ 6 x2 ³ 0 2
4
6
x1
8
Example 17.6 Consider the following linear programming problem: min z ¼ x1 + 5x2 subject to : 2x1 + 4x2 16 x1 + x2 6 x1 4 x 1 , x2 0
(17.6)
Determine the feasibility region and the optimal solution for the linear programming model. Solution The feasible solution space of Example 17.6 is shown in Fig. 17.9, and it is represented by triangle ABC. We can see that constraint x1 4 is redundant. Since vertex B is the intersection of three lines, we have a degenerate vertex. The minimization function consists in the equation z ¼ x1 + 5x2 that tries to find the minimum point that satisfies all the model constraints. Thus, from a value of z ¼ 50, it is possible to identify the direction of the line that minimizes function z, as shown in Fig. 17.9. Therefore, we can see that point B is the degenerate optimal solution. Since point B is the intersection of lines 2x1 + 4x2 ¼ 16 and x1 + x2 ¼ 6, coordinates x1 and x2 can be calculated algebraically from these equations. So, we have x1 ¼ 4 and x2 ¼ 2 with z ¼ 4 + 5 2 ¼ 14. FIG. 17.9 Feasible region with a degenerate optimal solution.
Solution of Linear Programming Problems Chapter
17
755
17.3 ANALYTICAL SOLUTION OF A LINEAR PROGRAMMING PROBLEM IN WHICH m < n The graphical procedure for solving LP problems was presented in Section 17.2. This section discusses the analytical procedure for solving a linear programming problem. Consider a system Ax ¼ b with m linear equations and n variables, in which m < n. According to Taha (2016), if m ¼ n and the equations are coherent, the system has a single solution. In cases in which m > n, at least m n equations must be redundant. However, if m < n and the equations are also coherent, the system will have an infinite number of solutions. To find a solution for the system Ax ¼ b, in which m < n, first, we have to choose a set of variables n m from x, called nonbasic variables (NBV), to which values equal to zero are assigned. The m remaining variables of the system, called basic variables (BV), are then determined. This solution is called basic solution (BS). The set of basic variables is called base. If the basic solution meets the non-negativity constraints, that is, the basic variables are non-negative, it is called feasible basic solution (FBS). According to Winston (2004), a basic variable can also be defined as the one that has coefficient 1 in only one equation, and 0 in the others. All the remaining variables are nonbasic. To calculate the optimal solution, we just need to calculate the value of objective function z of all the possible basic solutions and choose the best alternative. The maximum number of basic solutions to be calculated is: n! n n ¼ (17.7) Cm ¼ m m!ðn mÞ! Hence, the analytical method applied in this section analyzes all the possible combinations of n variables chosen m by m, choosing the best one. Solving it through a linear equation system is feasible in cases in which m and n are small. However, for high values of m and n, the calculation becomes impractical. As an alternative, we can use the Simplex method that will be studied in Section 17.4. Example 17.7 Consider the following system with three variables and two equations: x1 + 2x2 + 3x3 ¼ 28 3x1
x3 ¼ 4
Determine all the basic solutions for this system. Solution For a system with three variables and two equations, we have n m ¼ 3 2 ¼ 1 nonbasic variables and m ¼ 2 basic variables. In this example, the total number of possible basic solutions is 3. Solution 1
NBV ¼ {x1} and BV ¼ {x2, x3} We assigned the value zero to the nonbasic variable, that is, x1 ¼ 0. Thus, we calculate the values of variables x2 and x3 of the basic solution algebraically, from the equation system seen in the heading of the exercise. So, x2 ¼ 20 and x3 ¼ 4. Since x3 < 0, the solution is unfeasible. Solution 2
NBV ¼ {x2} and BV ¼ {x1, x3} If x2 ¼ 0, the basic solution is x1 ¼ 4 and x3 ¼ 8. Therefore, we have a feasible basic solution (FBS). Solution 3
NBV ¼ {x3} and BV ¼ {x1, x2} If x3 ¼ 0, the basic solution is x1 ¼ 1.33 and x2 ¼ 13.33. As in the previous case, we have an FBS here.
756
PART
VII Optimization Models and Simulation
Example 17.8 Consider the following linear programming problem: max 3x1 + 2x2 subject to : x 1 + x2 6 5x1 + 2x2 20 x 1 , x2 0
(17.8)
Solve the problem in an analytical way. Solution In order for the analytical solution procedure to be applied, the problem must be in standard form (see Section 16.4.1 of the previous chapter). In order for the inequality constraints to be rewritten as equalities, slack variables x3 and x4 must be included. Thus, the original problem rewritten in standard form becomes: max 3x1 + 2x2 subject to : x1 + x2 + x3 ¼ 6 5x1 + 2x2 + x4 ¼ 20 x1 , x2 x3 , x4 0
(17.9)
The system has m ¼ 2 equations and n ¼ 4 variables. In order for a basic solution to be found, values equal to zero will be assigned to n m ¼ 4 2 ¼ 2 nonbasic variables, such that the values of the m ¼ 2 remaining basic variables can be determined by the equation system represented by Expression (17.9). In this example, the total number of basic solutions is: 4! 4 C24 ¼ ¼6 ¼ 2 2!ð4 2Þ! Solution A
NBV ¼ {x1, x2} and BV ¼ {x3, x4} First, we assigned value zero to the nonbasic variables x1 and x2, such that the values of the basic variables x3 and x4 can be calculated algebraically from Expression (17.6). Therefore, we have: Nonbasic solution: x1 ¼ 0 and x2 ¼ 0 Basic solution: x3 ¼ 6 and x4 ¼ 20 Objective function: z ¼ 0 The same calculation will be carried out to obtain different basic solutions. At every new solution, a variable from the set of nonbasic variables goes into the set of basic variables (base) and, as a result, one will leave the base. Solution B
In this case, variable x1 enters the base instead of variable x4, which starts to be a part of the set of nonbasic variables. NBV ¼ {x2, x4} and BV ¼ {x1, x3} Nonbasic solution: x2 ¼ 0 and x4 ¼ 0 Basic solution: x1 ¼ 4 and x3 ¼ 2 Objective function: z ¼ 12 Solution C
In this case, variable x4 goes into the base instead of variable x3. NBV ¼ {x2, x3} and BV ¼ {x1, x4} Nonbasic solution: x2 ¼ 0 and x3 ¼ 0 Basic solution: x1 ¼ 6 and x4 ¼ 10 Since x4 < 0, the solution is unfeasible. Solution D
In this case, variable x2 goes into the base instead of variable x4. NBV ¼ {x3, x4} and BV ¼ {x1, x2} Nonbasic solution: x3 ¼ 0 and x4 ¼ 0 Basic solution: x1 ¼ 2.67 and x2 ¼ 3.33 Objective function: z ¼ 14.7 Solution E
In this case, variable x4 goes into the base instead of variable x1. NBV ¼ {x1, x3} and BV ¼ {x2, x4} Nonbasic solution: x1 ¼ 0 and x3 ¼ 0 Basic solution: x2 ¼ 6 and x4 ¼ 8 Objective function: z ¼ 12
Solution of Linear Programming Problems Chapter
17
757
FIG. 17.10 Graphical representation of Example 17.8.
x2 x1 ³ 0 10
F
8
5x1 + 2x2 £ 20
6 z = 10
Maximization function
E
4
Optimal solution (x1 = 2.67, x2 = 3.33) D
2
x1 + x2 £ 6 A
C
B 2
x2 ³ 0
4
6
8
10
x1
Solution F
In this case, variable x3 goes into the base instead of variable x4. NBV ¼ {x1, x4} and BV ¼ {x2, x3} Nonbasic solution: x1 ¼ 0 and x4 ¼ 0 Basic solution: x2 ¼ 10 and x3 ¼ 4 Since x3 < 0, the solution is unfeasible. Thus, the optimal solution is D, with x1 ¼ 2.67, x2 ¼ 3.33, x3 ¼ 0, x4 ¼ 0, and z ¼ 14.67. Fig. 17.10 shows the graphical solution for each one of the six solutions obtained from the Cartesian axes x1 and x2. Solutions A, B, D, and E correspond to an extreme point of the feasible region. Alternatively, since they are unfeasible, solutions C and F do not belong to the set of feasible solutions. Thus, a new theorem is born: Theorem 4 Every feasible basic solution for a linear programming problem is an extreme point of convex set K.
17.4 THE SIMPLEX METHOD As presented in Section 17.2, a graphical solution can be used to solve linear programming problems with two or three decision variables, at the most (greater complexity). In the same way, the analytical solution presented in Section 17.3 becomes impractical for problems with many variables and equations because it calculates all the possible basic solutions, and only afterwards it determines the optimal solution. As an alternative, the Simplex method can be applied to solve any LP problem. The Simplex method for solving linear programming problems was developed in 1947, with the dissemination of operational research in the United States after World War II, by a team led by George B. Dantzig. For Goldbarg and Luna (2005), the Simplex algorithm is the most commonly used method for solving linear programming problems. The Simplex method is an iterative algebraic procedure that starts from an initial feasible basic solution and tries to find, at each iteration, a new feasible basic solution with a better value in the objective function, until the optimal value is achieved. More details of the algorithm will be discussed in the following section. This section is divided into three parts. The logic of the Simplex method is presented in Section 17.4.1. In Section 17.4.2, the Simplex method is described in an analytical way. The tabular form of the Simplex method is discussed in Section 17.4.3.
17.4.1
Logic of the Simplex Method
The Simplex algorithm is an iterative method that starts from an initial feasible basic solution and tries to find, at each iteration, a new feasible basic solution, called adjacent feasible basic solution, with a better value in the objective function, until the optimal value is achieved. The concept of an adjacent FBS is described.
758
PART
VII Optimization Models and Simulation
FIG. 17.11 General description of the Simplex algorithm.
Beginning: The problem must be in the standard form. Step 1: Find an initial FBS for the LP problem. Initial FBS = current FBS Step 2: Verify if the current FBS is the optimal solution for the LP problem. While the current FBS is not the optimal solution for the LP problem do Find an adjacent FBS with a better value in the objective function Adjacent FBS = current FBS End while End
FIG. 17.12 Flowchart with the general description of the Simplex algorithm. (Source: Lachtermacher, G., 2009. Pesquisa operacional na tomada de deciso˜es. 4th ed. Prentice Hall do Brasil, Sa˜o Paulo.)
Start Inequalities with sign £ and introduction of slack variables
Find a initial super-optimal basic solution
Is the solution feasible?
Yes End
No Determine a better adjacent basic solution
From a current basic solution, a nonbasic variable goes into the base instead of another basic variable that becomes nonbasic, generating a new solution called adjacent basic solution. For a problem with m basic variables and n m nonbasic variables, two basic solutions are adjacent if they have m 1 basic variables in common, and they may even have different numerical values. This also implies that n m 1 nonbasic variables are common. If the adjacent basic solution satisfies the non-negativity constraints, it is called an adjacent feasible basic solution (adjacent FBS). According to Theorem 4, every feasible basic solution is an extreme point (vertex) of the feasible region. Thus, two vertices are adjacent if they are connected by a line segment called corner, which means that they share n 1 constraints. The general description of the Simplex algorithm is presented in Fig. 17.11. Analogous to the analytical procedure, in order for the Simplex method to be applied, the problem must be in the standard form (see Section 16.4.1 of the previous chapter). The algorithm can also be described through a flowchart, as seen in Fig. 17.12.
17.4.2
Analytical Solution of the Simplex method for Maximization Problems
Each of the steps of the general algorithm described in Figs. 17.11 and 17.12 is rewritten in Fig. 17.13 in a very detailed way, based on Hillier and Lieberman (2005), for the analytical solution for the Simplex method of linear programming problems, in which the objective function z is a maximization one ( max z¼c1x1 + c2x2 + ⋯ + cnxn ¼ 0). In Example 17.8 of Section 17.3, to calculate the model’s optimal solution all the possible basic solutions were calculated, and the best of them was chosen. The same exercise is solved in Example 17.9, however, through the analytical solution for the Simplex method.
Solution of Linear Programming Problems Chapter
Beginning: The problem must be in the standard form. Step 1: Find an initial feasible basic solution (FBS) for the LP problem.
17
759
FIG. 17.13 Detailed steps of the general algorithm of Figs. 17.11 and 17.12 for solving LP maximization problems through the analytical form of the Simplex method.
An initial FBS can be obtained by assigning values equal to zero to the decision variables. In order for this solution to be feasible, none of the problem constraints can be violated. Step 2: Optimality test. A feasible basic solution is optimal if there are no better adjacent feasible basic solutions. An adjacent FBS is better than the current FBS if there is a positive increase in the value of objective function z. Similarly, an adjacent FBS is worse than the current FBS if the increase in z is negative. While at least one of the non-basic variables of objective function z has a positive coefficient, there is a better adjacent FBS. Iteration: Determine a better adjacent FBS. The direction with greater increase in z must be identified, so that a better feasible basic solution can be determined. In order to do that, three steps must be taken: 1. Determine the non-basic variable that will go to the set of basic variables (base). It must be the one that has the greatest increase in z, that is, with the highest positive coefficient in z. 2. Choose the basic variable that will go to the set of non -basic variables (base). The variable chosen to leave the base must be the one that limits the increase of the non-basic variable selected in the previous step to go into the base. 3. Solve the equation system, recalculating the values of the new adjacent basic solution. Before that, the equation system must be converted into a more convenient form, through basic algebraic operations, by using the Gauss-Jordan elimination method. From the new equation system, each new equation must have only one basic variable with a coefficient equal to 1, each basic variable must appear in only one equation, and the objective function must be written based on the non-basic variables, such that the values of the new basic variables and of objective function z can be obtained directly, and the optimality test can easily be verified.
Example 17.9 Solve the problem by using the analytical solution for the Simplex method. max z ¼ 3x1 + 2x2 subject to : x 1 + x2 6 5x1 + 2x2 20 x 1 , x2 0
(17.10)
Solution Each of the steps of the algorithm will be discussed in depth, based on Hillier and Lieberman (2005). Beginning: The problem must be in the standard form. max z ¼ 3x1 + 2x2 subject to : x1 + x 2 + x 3 ¼6 5x1 + 2x2 + x4 ¼ 20 x 1 , x2 x3 , x4 0
ð0Þ ð1Þ ð2Þ ð3Þ
(17.11)
Step 1: Find an initial FBS for the LP problem. An initial basic solution can be obtained by assigning values equal to zero to decision variables x1 and x2 (nonbasic variables). Note that the values of the basic variables (x3, x4) can be obtained immediately from equation system represented by Expression (17.11), since each equation has only one basic variable with coefficient 1, and each basic variable appears in only one equation.
760
PART
VII Optimization Models and Simulation
Moreover, since the objective function starts to be written based on each one of the nonbasic variables, the optimality test can easily be applied in step 2. The complete result of the initial solution is: NBV ¼ {x1, x2} and BV ¼ {x3, x4} Nonbasic solution: x1 ¼ 0 and x2 ¼ 0 Feasible basic solution: x3 ¼ 6 and x4 ¼ 20 Solution: {x1, x2, x3, x4} ¼ {0, 0, 6, 20} Objective function: z ¼ 0 This solution corresponds to vertex A of the feasible region shown in Example 17.8 of Section 17.3, as presented in Fig. 17.10. Step 2: Optimality test We can say that the initial FBS obtained in step 1 is not optimal, since the coefficients of the nonbasic variables x1 and x2 in the objective function of equation system represented by Expression (17.11) are positive. If any of the variables stops taking on value zero and starts taking on a positive value, there will be a positive increase in the value of objective function z. Thus, it is possible to obtain a better adjacent FBS. Iteration: Determine a better adjacent FBS. Each of the three steps to be implemented in this iteration is shown. 1. Nonbasic variable that will go into the base. According to equation system represented by Expression (17.11), we can see that variable x1 has a greater positive coefficient in the objective function when compared to variable x2, thus, generating a greater positive increase in z if the same measurement units were considered for x1 and x2. So, the nonbasic variable chosen to go into the base is x1: n* o NBV ¼ x1, x2 2. Basic variable that will leave the base. To select the basic variable that will leave the base, we must choose the one that limits the increase of the nonbasic variable chosen in the previous step to go into the base (x1). In order to do that, first, we must assign the value zero to the variables that remained nonbasic (in this case, only x2) in all the equations. From then on, we can obtain the equations of each one of the basic variables based on the nonbasic variable chosen to go into the base (x1). Since all the basic variables must take on non-negative values, by inserting the inequality sign 0 in each one of the constraints, we can identify the basic variable that limits the increase of x1. Therefore, by assigning value zero to variable x2 in Equations (1) and (2) of equation system represented by Expression (17.11), we can obtain the equations of basic variables x3 and x4 based on x1: x3 ¼ 6 x1 x4 ¼ 20 5x1 Since variables x3 and x4 must take on non-negative values, then: x3 ¼ 6 x1 0 ) x1 6 x4 ¼ 20 5x1 0 ) x1 4 We can conclude that the variable that limits the increase of x1 is variable x4, since the maximum value that x1 can reach from x4 is smaller when compared to variable x3 (4 < 6). Hence, the basic variable chosen to leave the base is x4: n o * BV ¼ x3 , x4 3. Transform the equation system by using the Gauss-Jordan elimination method and recalculate the basic solution. As shown in the two previous steps, variable x1 enters the base instead of variable x4, and it will generate a better adjacent basic solution. Therefore, the set of nonbasic variables and the set of basic variables becomes: NBV ¼ {x2, x4} and BV ¼ {x1, x3} In this phase, we try to recalculate the values of the new feasible basic solution. Since x4 represents the new nonbasic variable in the adjacent solution, jointly with x2 that remained nonbasic, we have x2 ¼ 0 and x4 ¼ 0. From then on, the values of basic variables x1 and x3 of the adjacent solution must be recalculated, besides the value of objective function z. First, the equation system must be converted, through basic operations, into a more convenient form, by using the Gauss-Jordan elimination method, such that each equation has only one basic variable (x1 or x3) with a coefficient equal to 1, each basic variable appears in only one equation, and such that the objective function can be written based on nonbasic variables x2 and x4. In order to do that, the coefficients of variable x1 in the current equation system represented by Expression (17.11) must be transformed from 3, 1, and 5 (Equations (0), (1), and (2), respectively) to 0, 0, and 1 (coefficients of variable x4 in the current equation system). According to Hillier and Lieberman (2005), the two basic algebraic operations to be used are: (a) Multiply (or divide) an equation by a constant different from zero. (b) Add (or subtract) a multiple of an equation before (or after) another equation. First, let’s convert the coefficient of variable x1 in Equation (2) of Expression (17.11) from 5 to 1. In order to do that, we just need to divide Equation (2) by 5, such that the new Expression (17.12) starts to be written based on a single basic variable (x1) with coefficient 1: 2 1 x1 + x2 + x4 ¼ 4 5 5
(17.12)
Solution of Linear Programming Problems Chapter
17
761
Another transformation must be carried out so that we can convert the coefficient of variable x1 in Equation (1) of Expression (17.11) from 1 to 0. To do that, we just need to subtract Expression (17.12) from Equation (1) of Expression (17.11), such that the new Expression (17.13) starts to be written based on a single basic variable (x3) with coefficient 1: 3 1 x +x x ¼2 (17.13) 5 2 3 5 4 Finally, we must convert the coefficient of variable x1 in the objective function [Equation (0) of Expression (17.11)] from 3 to 0. To do that, we just need to multiply Expression (17.12) by 3 and subtract it from Equation (0) of Expression (17.11), such that new Expression (17.14) starts to be written based on x2 and x4: 4 3 z ¼ x2 x4 + 12 5 5 The complete equation system, obtained after we applied the Gauss-Jordan elimination method, is:
(17.14)
4 3 ð0Þ z ¼ x2 x4 + 12 5 5 3 1 (17.15) ð1Þ x +x x ¼2 5 2 3 5 4 2 1 ð2Þ x1 + x2 + x4 ¼ 4 5 5 From the new equation system represented by Expression (17.15), it is possible to obtain the new values of x1, x3 , and z immediately. The complete result of the new solution is: NBV ¼ {x2, x4} and BV ¼ {x1, x3} Nonbasic solution: x2 ¼ 0 and x4 ¼ 0 Feasible basic solution: x1 ¼ 4 and x3 ¼ 2 Solution: {x1, x2, x3, x4} ¼ {4, 0, 2, 0} Objective function: z ¼ 12 This solution corresponds to vertex B of the feasible region shown in Example 17.8 of Section 17.3 (Fig. 17.10). Thus, there was a movement from extreme point A to point B (A ! B). Therefore, it was possible to obtain a better adjacent FBS, since there was a positive increase in z compared to the current FBS. The adjacent FBS obtained in this iteration becomes the current FBS. Step 2: Optimality test The current FBS is not optimal yet, since the coefficient of nonbasic variable x2 in Equation (0) of Expression (17.15) is positive. If this variable starts to take on any positive value, there will be a positive increase in the value of objective function z. Thus, it is possible to obtain a better adjacent FBS. Iteration 2: Determine a better adjacent FBS. The three steps to be implemented to determine a new adjacent FBS are: 1. Nonbasic variable that will go into the base. According to the new equation system represented by Expression (17.15), we can see that variable x2 is the only one with a positive coefficient in Equation (0), and it will generate a positive increase in objective function z for any positive value that variable x2 can assume. So, the variable chosen to go from the set of nonbasic variables to the set of basic variables is x2: n* o NBV ¼ x2, x4 2. Basic variable that will leave the base. The basic variable that will leave the base is the one that limits the increase of the nonbasic variable chosen in the previous step to go into the base (x2). By assigning value zero to the variable that remained nonbasic (x4 ¼ 0), in each one of Equations (1) and (2) of Expression (17.15), it is possible to obtain the equations of each one of the basic variables x1 and x3 of the current basic solution based on the nonbasic variable chosen to go into the base (x2): 2 x1 ¼ 4 x2 5 3 x3 ¼ 2 x2 5 Since variables x1 and x3 must take on non-negative values, then: 2 x1 ¼ 4 x2 0 ) x2 10 5 3 10 x3 ¼ 2 x2 0 ) x2 5 3 We can conclude that the variable that limits the increase of x2 is variable x3, since the maximum value that x2 can assume from x3 is smaller when compared to variable x1. Therefore, the basic variable chosen to leave the base is x3: n o * BV ¼ x1 , x3
762
PART
VII Optimization Models and Simulation
3. Transform the equation system by using the Gauss-Jordan elimination method and recalculate the basic solution. As shown in the two previous steps, variable x2 enters the base instead of variable x3, and it will generate a better adjacent basic solution. Thus, the set of nonbasic variables and the set of basic variables becomes: NBV ¼ {x3, x4} and BV ¼ {x1, x2} Before calculating the values of the new basic solution, the equation system must be converted by using the Gauss-Jordan elimination method. In this case, the coefficients of variable x2 in the current equation system represented by Expression (17.15) must be transformed from 4/5, 3/5, and 2/5 (Equations (0), (1), and (2), respectively) to 0, 1, and 0 (coefficients of variable x3 in the current equation system), through basic algebraic operations. First, let’s convert the coefficient of variable x2 in Equation (1) of Expression (17.15) from 3/5 to 1. To do that, we just need to multiply Equation (1) by 5/3, such that the new Expression (17.16) starts to be written based on a single basic variable (x2) with coefficient 1: 5 1 10 (17.16) x2 + x3 x4 ¼ 3 3 3 Analogously, we must convert the coefficient of variable x2 in Equation (2) of Expresion (17.15) from 2/5 to 0. In order to do that, we just need to multiply Expression (17.16) by 2/5 and subtract it from Equation (2) of Expression (17.15), such that the new Expression (17.17) starts to be written based on a single basic variable (x1) with coefficient 1: 2 1 8 x1 x3 + x4 ¼ (17.17) 3 3 3 Finally, we must convert the coefficient of variable x2 in the objective function [Equation (0) of Expression (17.15)] from 4/5 to 0. To do that, we just need to multiply Expresion (17.16) by 4/5 and subtract it from Equation (0) of Expression (17.15), such that the new Expression (17.18) starts to be written based on x3 and x4: 4 1 44 z ¼ x3 x4 + 3 3 3 The complete equation system is represented in (17.19):
(17.18)
4 1 44 ð0Þ z ¼ x3 x4 + 3 3 3 5 1 10 (17.19) ð1Þ x2 + x3 x4 ¼ 3 3 3 2 1 8 ð2Þ x1 x3 + x4 ¼ 3 3 3 From the new equation system represented by Expression (17.19), it is possible to obtain the new values of x1, x2 , and z immediately. The complete result of the new solution is: NBV ¼ {x3, x4} and BV ¼ {x1, x2} Nonbasic solution: x3 ¼ 0 and x4 ¼ 0 Feasible basic solution: x1 ¼ 8/3 ¼ 2.67 and x2 ¼ 10/3 ¼ 3.33 Solution: {x1, x2, x3, x4} ¼ {8/3, 10/3, 0, 0} Objective function: z ¼ 44/3 ¼ 14.67 This solution corresponds to vertex D of the feasible region shown in Fig. 17.10. The direction in which z increases, from the initial solution, goes through vertices A ! B ! D of the graphical solution. Hence, it was possible to obtain a better adjacent FBS, since there was a positive increase in z compared to the current FBS. The adjacent FBS obtained in this iteration becomes the current FBS. Step 2: Optimality test The current FBS is the optimal one, since the coefficients of nonbasic variables x3 and x4 in Equation (0) of Expression (17.19) are negative. Therefore, it is no longer possible to have a positive increase in the value of objective function z, concluding the algorithm of Example 17.9 here.
17.4.3
Tabular Form of the Simplex Method for Maximization Problems
The previous section presented the analytical procedure of the Simplex method to solve a linear programming maximization problem. This section shows the Simplex method in tabular form. To understand the logic of the Simplex algorithm, it is important to use the Simplex method in an analytical way. However, when the calculation is done manually, it is more convenient to use the tabular form. The tabular form uses the same concepts presented in Section 17.4.2, however, in a more practical way.
Solution of Linear Programming Problems Chapter
17
763
As presented in Section 16.4.1 of the previous chapter, the standard form of a general model of linear programming maximization is: max z ¼ c1 x1 + c2 x2 + … + cn xn subject to : a11 x1 + a12 x2 + … + a1n xn ¼ b1 a21 x1 + a22 x2 + … + a2n xn ¼ b2 ⋮ ⋮ ⋮ ⋮ am1 x1 + am2 x2 + … + amn xn ¼ bm xi 0, i ¼ 1, 2,…, m
(17.20)
This same model can be represented in tabular form: BOX 17.1 General Linear Programming Model in Tabular Form Equation
Coefficients x2 …
Constant xn
z
x1
0
1
c1
c2
…
cn
0
1
0
a11
a12
…
a1n
b1
2
0
a21
a22
…
a2n
b2
⋮
⋮
⋮
⋮
m
0
am1
am2
…
⋮
⋮
amn
bm
According to Box 17.1, we can see that maximization function z, in tabular form, starts being rewritten as z c1x1 c2x2 … cnxn ¼ 0. The columns in the middle show the coefficients of the variables on the left-hand side of each equation, in addition to the coefficient of z. The constants on the right-hand side of each equation are represented in the last column. Each one of the steps of the general algorithm described in Figs. 17.11 and 17.12 is rewritten in Fig. 17.14, in a very detailed way, for solving linear programming maximization problems through the tabular form of the Simplex method. The logic presented in this section is the same as the analytical solution for the Simplex method, however, instead of using an algebraic equation system, the solution is calculated directly in the tabular form, by using the concepts of pivot column, pivot row, and pivot number that will be defined throughout the algorithm. Example 17.9 of the previous section presented the solution for a linear programming problem through the analytical solution for the Simplex method. The same exercise will be solved in Example 17.10 through the tabular form of the Simplex method. Example 17.10 Solve the problem by using the tabular form of the Simplex method. max z ¼ 3x1 + 2x2 subject to : x 1 + x2 6 5x1 + 2x2 20 x1 , x2 0
(17.21)
Solution The maximization problem must also be in the standard form: max z ¼ 3x1 + 2x2 subject to : x1 + x2 + x3 ¼6 5x1 + 2x2 + x4 ¼ 20 x1 , x2 , x3 ,x4 0
ð0Þ ð1Þ ð2Þ ð3Þ
In the tabular form, maximization function z starts to be written as: z ¼ 3x1 + 2x2 ) z 3x1 2x2 ¼ 0
(17.22)
764
PART
VII Optimization Models and Simulation
FIG. 17.14 Detailed steps of the general algorithm in Figs. 17.11 and 17.12 for solving LP maximization problems in the tabular form of the Simplex method.
Beginning: The problem must be in the standard form. Step 1: Find an initial FBS for the LP problem. Analogously to the analytical form of the Simplex method presented in Section 17.4.2, an initial basic solution can be obtained by assigning values equal to zero to the decision variables. The initial FBS corresponds to the current FBS. Step 2: Optimality test. The current FBS is optimal if, and only if, the coefficients of all the non-basic variables in Equation (0) of the tabular form are non-negative ( ≥ 0 ). While there is at least one non-basic variable with a negative coefficient in Equation (0), there is a better adjacent FBS. Iteration: Determine a better adjacent FBS. The direction in which z increases the most must be identified, in order for a better feasible basic solution to be determined. To do that, three steps must be taken: 1. Determine the non-basic variable that will go into the base. It must be the one that has the greatest increase in z, that is, with the highest negative coefficient in Equation (0). The column of the non-basic variable chosen to go into the base is called pivot column. 2. Determine the basic variable that will leave the base. Similarly to the analytical form, the variable chosen must be the one that limits the increase of the non-basic variable selected in the previous step to go into the base. As presented by Hillier and Lieberman (2005), in order for the variable to be chosen, three phases are necessary: a) Select the positive coefficients of the pivot column that represent the coefficients of the new basic variable in each constraint of the current model. b) For each positive coefficient selected in the previous step, divide the constant of the same row by it. c) Identify the row with the smallest quotient. This row contains the variable that will leave the base. The row that contains the basic variable chosen to leave the base is called pivot row. The pivot number is the value that corresponds to the intersection of the pivot row and the pivot column. 3. Transform the current tabular form by using the Gauss-Jordan elimination method and recalculate the basic solution. Similarly to the analytical solution, the current tabular form must be converted into a more convenient form, through basic operations, such that the values of the new basic variables and of objective function z can be obtained directly from the new tabular form. The objective function starts being rewritten based on the new non-basic variables of the adjacent solution, such that the optimality test can easily be verified. According to Taha (2016), the new tabular form is obtained after the following basic operations: a) New pivot row = current pivot row ÷ pivot number b) For the other rows, including z: New row = (current row) – (coefficient of the pivot column of the current row) × (new pivot row)
Solution of Linear Programming Problems Chapter
17
765
Table 17.E.1 shows the tabular form of equation system represented by Expression (17.22):
TABLE 17.E.1 Initial Tabular Form of Example 17.10 Coefficients
Basic Variable
Equation
z
x1
x2
x3
x4
z
0
1
3
2
0
0
0
x3
1
0
1
1
1
0
6
x4
2
0
5
2
0
1
20
Constant
From Table 17.E.1, we can see that a new column was added when we compare it to Box 17.1. The first column shows the basic variables considered in each phase (the initial basic variables will be x3 and x4). Step 1: Find an initial FBS Decision variables x1 and x2 were selected for the initial set of nonbasic variables, thus, representing the origin of feasible region (x1, x2) ¼ (0, 0). On the other hand, the set of basic variables is represented by x3 and x4: NBV ¼ {x1, x2} and BV ¼ {x3, x4} The feasible basic solution can immediately be obtained from Table 17.E.1: Feasible basic solution: x3 ¼ 6 and x4 ¼ 20 with z ¼ 0 Solution: {x1, x2, x3, x4} ¼ {0, 0, 6, 20} Step 2: Optimality test Since the coefficients of x1 and x2 are negative in row 0, the current FBS is not optimal, because a positive increase in x1 or x2 will result in an adjacent FBS better than the current FBS. Iteration 1: Determine a better adjacent FBS. Each iteration has three steps: 1. Determine the nonbasic variable that will go into the base. We have to choose the variable with the highest negative coefficient in Equation (0) of Table 17.E.1. For the problem in question, the variable with the highest unitary contribution to objective function z is x1 (3 > 2). Therefore, variable x1 is selected to enter the base, and this variable’s column is the pivot column. 2. Determine the basic variable that will leave the base. Here, we are trying to select the basic variable that will leave the base (and will become null), which is the one that limits the increase of x1. The results of the three phases needed to choose the variable, from Table 17.E.1, are shown, and can also be seen in Table 17.E.2: (a) The positive coefficients selected from the pivot column (column of variable x1) are coefficients 1 and 5 (Equations (1) and (2), respectively). (b) For Equation (1), divide constant 6 by coefficient 1 from the pivot column. For Equation (2), divide constant 20 by coefficient 5. (c) The row with the smallest quotient is Equation (2) (4 < 6). Therefore, the variable chosen to leave the base is x4. Step 1 (determining the variable that enters the base) and the three phases of Step 2 listed to determine the variable that leaves the base, are shown in Table 17.E.2.
TABLE 17.E.2 Determining the Variable that Enters and Leaves the Base in the First Iteration
766
PART
VII Optimization Models and Simulation
Quotient 4 of Equation (2) represents the maximum value that variable x1 can take on in this equation (5x1 + x4 ¼ 20), if variable x4 starts taking on value zero (x4 ¼ 0). On the other hand, quotient 6 represents the maximum value that variable x1 can take on in Equation (1) (x1 + x3 ¼ 6), considering x3 ¼ 0. Since we want to maximize the value of x1, we choose, to leave the base and assume a null value, variable x4 that limits its increase. The pivot row and the pivot column are highlighted in Table 17.E.2. The pivot number (intersection of the pivot row and the pivot column) is 5. 3. Transform the current tabular form by using the Gauss-Jordan elimination method and recalculate the basic solution. In the same way as in the analytical procedure, the coefficients of variable x1 in the current tabular form (Table 17.E.2) must be transformed from –3, 1, and 5 (Equations (0), (1), and (2), respectively) to 0, 0, and 1 (coefficients of variable x4 in the current tabular form), through basic operations (Gauss-Jordan elimination method), such that the values of the new basic variables and of objective function z can be obtained directly from the new tabular form. The new tabular form is obtained after the following basic operations: (a) New pivot row ¼ current pivot row pivot number (b) For the other rows, including z: New row ¼ ðcurrent rowÞ ðcoefficient of the pivot column of the current rowÞ ðnew pivot rowÞ By applying the first operation in the current tabular form (divide Equation (2) by 5), we obtain the new pivot row, as shown in Table 17.E.3:
TABLE 17.E.3 New Pivot Row (iteration 1)
Since variable x1 entered the base instead of variable x4, the column of the basic variable in the new pivot row must be altered, as shown in Table 17.E.3. Phase (b) will be applied to the other lines [Equations (0) and (1)]. We will begin with Equation (1) that has a positive coefficient in the pivot column (+1). First, we multiply this coefficient (+1) by the new pivot row (Equation (2) from Table 17.E.3). This product is then subtracted from current Equation (1), resulting in new Equation (1): x1
x2
x3
x4
Constant
Equation (1)
1
1
1
0
6
Equation (2) (+1)
1
2/5
0
1/5
4
Subtraction:
0
3/5
1
1/5
2
New Equation (1) is rewritten in Table 17.E.4:
TABLE 17.E.4 Phase (b) for Obtaining New Equation (1) (iteration 1)
Phase (b) is also applied to Equation (0) that has a negative coefficient in the pivot column (3). First, we multiply this coefficient’s value (3) by the new pivot row (Equation (2) from Table 17.E.3). This product is then subtracted from current Equation (0), resulting in new Equation (0):
Solution of Linear Programming Problems Chapter
x1
x2
x3
Equation (0)
3
2
0
0
0
Equation (2) (3)
3
6/5
0
3/5
12
0
4/5
0
3/5
12
Subtraction:
x4
17
767
Constant
The new tabular form, after applying the basic operations described, is presented in Table 17.E.5:
TABLE 17.E.5 New Tabular Form After the Gauss-Jordan Elimination Method Is Used (Iteration 1) Coefficients Basic Variable
Equation
z
x1
x2
x3
x4
Constant
z
0
1
0
4/5
0
3/5
12
x3
1
0
0
3/5
1
1/5
2
x1
2
0
1
2/5
0
1/5
4
From the new tabular form (Table 17.E.5), it is possible to obtain the new values of x1, x3, and z immediately. The new feasible basic solution is x1 ¼ 4 and x3 ¼ 2 with z ¼ 12. The new solution is {x1, x2, x3, x4} ¼ {4, 0, 2, 0}. Step 2: Optimality test As shown in Table 17.E.5, Equation (0) starts being rewritten based on the new nonbasic variables (x2 and x4), such that the optimality test can easily be verified. The current FBS is not optimal yet, because the coefficient of x2 in Equation (0) in Table 17.E.5 is negative. Any positive increase in x2 will result in a positive increase in the value of objective function z, such that a better adjacent FBS can be obtained. Iteration 2: Determine a better adjacent FBS. The three steps to be implemented in this iteration are: 1. Determine the nonbasic variable that will go into the base. According to the new tabular form (Table 17.E.5), we can see that variable x2 is the only one with a negative coefficient in Equation (0). Thus, the variable chosen to go into the base is x2. The column of variable x2 becomes the pivot column. 2. Determine the basic variable that will leave the base. The basic variable chosen to leave the base (and become null) is the one that limits the increase of x2. The results of the three phases needed to choose the variable, from Table 17.E.5, are shown, and can also be seen in Table 17.E.6: (a) The positive coefficients selected from the pivot column (column of variable x2) are coefficients 3/5 and 2/5 (Equations (1) and (2), respectively). (b) For Equation (1), divide constant 2 by coefficient 3/5. For Equation (2), divide constant 4 by coefficient 2/5. (c) The row with the smallest quotient is Equation (1) (10/3 < 10). Therefore, the variable chosen to leave the base is x3. Step 1 (determining the variable that enters the base) in addition to the three phases of Step 2 listed, to determine the variable that leaves the base, are shown in Table 17.E.6.
TABLE 17.E.6 Determining the Variable that Enters and Leaves the Base in the Second Iteration
768
PART
VII Optimization Models and Simulation
The row of variable x3 [Equation (1)] becomes the pivot row. The pivot number is 3/5. 3. Transform the current tabular form by using the Gauss-Jordan elimination method and recalculate the basic solution. The coefficients of variable x2 in the current tabular form (Table 17.E.6) must be transformed from -4/5, 3/5, and 2/5 (Equations (0), (1), and (2), respectively) to 0, 1, and 0 (coefficients of variable x3 in the current tabular form), such that the values of the new basic variables and of objective function z can be obtained directly in the new tabular form. In the same way as in the first iteration, the new tabular form is obtained after the following basic operations: (a) New pivot row ¼ current pivot row pivot number (b) For the other rows, including z: New row ¼ ðcurrent rowÞ ðcoefficient of the pivot column of the current rowÞ ðnew pivot rowÞ By applying the first operation in the current tabular form (divide Equation (1) by 3/5), we obtain the new pivot row, as shown in Table 17.E.7:
TABLE 17.E.7 New Pivot Row (iteration 2)
Since variable x2 entered the base instead of variable x3, the column of the basic variable in the new pivot row must be altered, as shown in Table 17.E.7. Phase (b) will be applied to the other lines [Equations (0) and (2)]. Let’s begin with Equation (2) that has a positive coefficient in the pivot column (2/5). First, we multiply this coefficient (2/5) by the new pivot row (Equation (1) from Table 17.E.7). This product is then subtracted from current Equation (2), resulting in new Equation (2):
x1
x2
x3
x4
Constant
Equation (2)
1
2/5
0
1/5
Equation (1) (2/5)
0
2/5
2/3
2/5
4/3
Subtraction:
1
0
2/3
1/3
8/3
4
New Equation (2) appears in Table 17.E.8:
TABLE 17.E.8 Tabular Form With New Equation (2) (iteration 2)
Phase (b) is also applied to Equation (0) that has a negative coefficient in the pivot column (-4/5). First, we multiply the value of this coefficient (-4/5) by the new pivot row (Equation (1) from Table 17.E.7). This product is then subtracted from current Equation (0), resulting in new equation (0):
Solution of Linear Programming Problems Chapter
x1
x2
x3
x4
Equation (0)
0
4/5
0
Equation (1) (4/5)
0
4/5
4/3
Subtraction:
0
0
4/3
17
769
Constant
3/5
12
4/15
8/3
1/3
44/3
The new tabular form, after the application of the Gauss-Jordan elimination method, is presented in Table 17.E.9:
TABLE 17.E.9 New Tabular Form After the Gauss-Jordan Elimination Method Is Used (iteration 2) Coefficients Basic Variable
Equation
z
x1
x2
z
0
1
0
0
x3 4/3
x4 1/3
Constant 44/3
x2
1
0
0
1
5/3
1/3
10/3
x1
2
0
1
0
2/3
1/3
8/3
From Table 17.E.9, we can obtain the new values of x1, x2, and z immediately. The new feasible basic solution is x1 ¼ 8/3 and x2 ¼ 10/3 with z ¼ 44/3. The new solution is {x1, x2, x3, x4} ¼ {8/3, 10/3, 0, 0}. Step 2: Optimality test The current FBS is the optimal one, because the coefficients of nonbasic variables x3 and x4 in Equation (0) in Table 17.E.9 are positive.
17.4.4
The Simplex Method for Minimization Problems
The Simplex method can also be used for solving linear programming minimization problems. The minimization problems discussed in this section will be solved through the tabular form. There are two ways of solving a minimization problem through the Simplex method: Solution 1 Transform a minimization problem into a maximization problem and use the same procedure described in Section 17.4.3. As presented in Expression (16.6) in Section 16.4.3 of the previous chapter, a minimization problem can be converted into another maximization problem through the following transformation: min z ¼ f ðx1 , x2 , …, xn Þ , max z ¼ f ðx1 , x2 , …, xn Þ
(17.23)
Solution 2 Adapt the procedure described in Section 17.4.3 to linear programming minimization problems. Fig. 17.14 presented the detailed steps for the general algorithm described in Figs. 17.11 and 17.12, for solving linear programming maximization problems in the tabular form. To solve LP minimization problems through the tabular form, Step 2 (optimality test) and Step 1 of each iteration (determining the nonbasic variable that will go into the base) must be adapted from Fig. 17.14, since their decisions are based on Equation (0) of the objective function. Fig. 17.15 shows the adjustments in Step 2 and in Step 1 of each iteration in relation to Fig. 17.14 for solving LP minimization problems through the tabular form. These adjustments can be found in Fig. 17.15. As shown in Fig. 17.15, except for Step 2 (optimality test) and Step 1 of each iteration (determining the non-basic variable that will go into the base), we can see that the other steps are the same as the ones presented in Fig. 17.14 for maximization problems.
770
PART
VII Optimization Models and Simulation
FIG. 17.15 Adjustment of the steps shown in Fig. 17.14 for solving LP minimization problems through the tabular form of the Simplex method.
Beginning: The problem must be in the standard form. Step 1: Find an initial FBS for the LP problem. We can use the same procedure of Figure 17.14 for maximization problems. Step 2: Optimality test. The current FBS is optimal if, and only if, the coefficients of all the non-basic variables in Equation (0) of the tabular form are non-positive (£ 0). While there is at least one non-basic variable with a positive coefficient in Equation (0), there is a better adjacent FBS. Iteration: Determine a better adjacent FBS. 1. Determine the non-basic variable that will go into the base. It must be the one that has the greatest negative increase in z, that is, with the highest positive coefficient in Equation (0). 2. Determine the basic variable that will leave the base. We can use the same procedure of Figure 17.14 for maximization problems. 3. Transform the current tabular form by using the Gauss-Jordan elimination method and recalculate the basic solution.
Example 17.11 Consider the following linear programming minimization problem: min z ¼ 4x1 2x2 subject to : 2x1 + x2 10 x1 x2 8 x1 , x2 0
(17.24)
Determine the optimal solution for the problem. Solution 1 First, in order for the problem represented by Expression (17.24) to be in the standard form, slack variables must be introduced into each one of the model constraints. The problem can also be rewritten from a maximization function by using Expression (17.23): max z ¼ 4x1 + 2x2 subject to : 2x1 + x2 + x3 ¼ 10 x1 x2 + x4 ¼ 8 x1 , x2 , x3 , x4 0
(17.25)
The initial tabular form that represents equation system in Expression (17.25) is:
TABLE 17.E.10 Initial Tabular Form of Equation System Represented by Expression (17.25) Coefficients Basic Variable
Equation
z
x1
x2
x3
x4
Constant
z
0
1
4
2
0
1/3
44/3
x3
1
0
2
1
5/3
1/3
10/3
x4
2
0
1
1
2/3
1/3
8/3
Solution of Linear Programming Problems Chapter
17
771
The initial set of nonbasic variables is represented by x1 and x2, while the initial set of basic variables is represented by x3 and x4. Initial solution {x1, x2, x3, x4} ¼ {0, 0, 10, 8} is not optimal, since the coefficient of nonbasic variable x2 in Equation (0) is negative. To determine a better adjacent FBS, variable x2 enters the base (greatest negative coefficient) instead of variable x4, which is the only one that limits the increase of x2, as seen in Table 17.E.11.
TABLE 17.E.11 Variable That Enters and Leaves the Base in the First Iteration
The new tabular form, after the Gauss-Jordan elimination method is used, is:
TABLE 17.E.12 New Tabular Form Using the Gauss-Jordan Elimination Method (iteration 1) Coefficients Basic Variable
Equation
z
x1
x2
x3
x4
Constant
z
0
1
8
0
2
0
20
x2
1
0
2
1
1
0
10
x4
2
0
3
0
1
1
18
From Table 17.E.12, we can obtain the new values of x2, x4, and z immediately. The results of the new feasible basic solution are x2 ¼ 10 and x4 ¼ 18 with z ¼ 20. The new solution can be represented by {x1, x2, x3, x4} ¼ {0, 10, 0, 18}. The new basic solution obtained is the optimal one, since all the coefficients of the nonbasic variables in Equation (0) are nonnegative. Solution 2 In order for the Simplex method to be applied, the initial minimization problem described in Expression (17.24) must be in the standard form: min z ¼ 4x1 2x2 subject to 2x1 + x2 + x3 ¼ 10 x1 x2 + x4 ¼ 8 x1 , x2 , x3 , x4 0
(17.26)
The initial tabular form of equation system in Expression (17.26) is represented in Table 17.E.13.
TABLE 17.E.13 Initial Tabular Form of Equation System Represented by Expression (17.26) Coefficients Basic Variable
Equation
z
x1
Z
0
1
4
x3
1
0
x4
2
0
x2
x3
x4
Constant
2
0
0
0
2
1
1
0
10
1
1
0
1
18
772
PART
VII Optimization Models and Simulation
Analogous to solution 1, the initial set of nonbasic variables is represented by x1 and x2, while the initial set of basic variables is represented by x3 and x4. For a minimization problem, the solution is optimal if all the coefficients of the nonbasic variables in Equation (0) are nonpositive (0). Therefore, initial solution {x1, x2, x3, x4} ¼ {0, 0, 10, 8} is not optimal, since the coefficient of nonbasic variable x2 in Equation (0) is positive. As shown in Table 17.E.14, variable x2 goes into the base (greatest positive coefficient) instead of variable x4 that is the only one with a positive coefficient in the pivot column.
TABLE 17.E.14 Variable That Enters and Leaves the Base in the First Iteration
After the Gauss-Jordan elimination method is used, the new tabular form is:
TABLE 17.E.15 New Tabular Form Using the Gauss-Jordan Elimination Method (iteration 1) Coefficients Basic Variable
Equation
z
x1
x2
x3
x4
Constant
z
0
1
8
0
2
0
20
x2
1
0
2
1
1
0
10
x4
2
0
3
0
1
1
18
According to Table 17.E.15, the new adjacent solution is {x1, x2, x3, x4} ¼ {0, 10, 0, 18} with z ¼ 20. The basic solution obtained is optimal, because the coefficients of all the nonbasic variables in Equation (0) are nonpositive.
17.4.5
Special Cases of the Simplex Method
As presented in Section 17.2.3, an LP problem may not present a single nondegenerate optimal solution and may be characterized as one of the special cases listed: 1. 2. 3. 4.
Multiple optimal solutions Unlimited objective function z There is no optimal solution Degenerate optimal solution
Section 17.2.3 discussed the graphical solution for each one of the special cases listed. This section discusses how to identify the peculiarities of each special case in one of the tabular forms (initial, intermediate, or final).
17.4.5.1 Multiple Optimal Solutions As discussed in Section 17.2.3.1, in a linear programming problem with infinite optimal solutions, several points reach the same optimal value in the objective function. Graphically, when the objective function is parallel to an active constraint, we have a case with multiple optimal solutions.
Solution of Linear Programming Problems Chapter
17
773
Through the Simplex method, we can identify a case with multiple optimal solutions when, in the optimal tabular form, the coefficient of one of the nonbasic variables is null in row 0 of the objective function.
17.4.5.2 Unlimited Objective Function z As described in 2.3.2, in this case, there is no limit to the increase of the value of at least one decision variable, resulting in a feasible region and an unlimited objective function z. For a maximization problem, the value of the objective function increases unlimitedly, while for a minimization problem, the value decreases in an unlimited way. Through the Simplex method, we can identify a case whose objective function is unlimited when, in one of the tabular forms, a candidate nonbasic variable is prevented from entering the base because the rows of all the basic variables have nonpositive coefficients in the column of the candidate nonbasic variable.
17.4.5.3 There Is No Optimal Solution According to Taha (2016), this case never occurs with constraints such as with non-negative constants on the right-hand side of the constraint, since the slack variables guarantee a feasible solution. During the implementation of the Simplex algorithm in the tabular form, whenever the basic variables take on nonnegative values, we have a feasible basic solution. In contrast, when at least one of the basic variables assume negative values, we have an unfeasible basic solution.
17.4.5.4 Degenerate Optimal Solution As discussed in Section 17.2.3.4, we can identify a special case of degenerate solution, graphically, when one of the vertices of the feasible region is obtained by the intersection of more than two distinct lines. Whereas through the Simplex method, we can identify a case with a degenerate solution when, in one of the solutions of the Simplex method, the value of one of the basic variables is null. This variable is called degenerate variable. When all the basic variables take on positive values, we say that the feasible basic solution is nondegenerate. If there is degeneration in the optimal solution, we have a case known as degenerate optimal solution. The degenerate solution is obtained when there is a tie between at least two basic variables when choosing which one of them must leave the base (lines with the same positive quotient). When this happens, we can choose any one of them randomly. The basic variable that is not chosen remains in the base. However, its value becomes null in the new adjacent solution. Analogously, during the solution for any linear programming problem through the Simplex method, if there is a tie when choosing the nonbasic variable that will go into the base, we should choose one of the variables randomly. The degeneration problem is that, in some cases, the Simplex algorithm can go into a loop, resulting in the same basic solutions, since it cannot leave that solution space. In this case, the optimal solution will never be reached.
17.5 SOLUTION BY USING A COMPUTER In this chapter, we will discuss several ways of solving an LP problem: a) graphically, for problems with two decision variables; b) by using the analytical method, for cases in which m < n; c) through the Simplex method. Understanding each one of these methods theoretically is essential, however, to minimize the time spent solving an LP model. The same problems can be solved using a computer without having to manually do calculations and construct charts. Currently, there are several software packages in the market for solving linear programming problems, such as, GAMS, AMPL, AIMMS, software with electronic spreadsheets (Solver in Excel, What’s Best), among others. Software packages GAMS, AMPL, and AIMMS are algebraic modeling languages or systems (Algebraic Modeling Language—AML), that is, high-level programming languages used for solving complex and large-scale mathematical programming problems. These languages have an open interface that makes it possible to connect them to several optimization packages or to Solver (LINDO, LINGO, CPLEX, XPRESS, MINOS, OSL, etc.), which find the model’s solution. These optimization packages can also be used separately, but many of them usually run within a development environment. Let’s now discuss the main characteristics of each of these software packages. LINDO (Linear Interactive and Discrete Optimizer), developed by LINDO Systems (http://www.lindo.com/), solves linear, nonlinear, and integer programming problems. It is very easy to use and also very fast. The complete version of the software does not have limitations regarding the number of constraints, real and integer variables. To solve linear pro gramming problems, the Solver in LINDO uses more than one optimization method, including the Simplex method, revised Simplex, dual Simplex, and interior-point methods. Different from the Simplex algorithm, in the interior-point methods,
774
PART
VII Optimization Models and Simulation
new solutions can be found in the interior of the feasible region. The Solver in LINDO has interface with the following programming languages: Visual Basic, C, C ++, among others. A free version of the software can be downloaded directly from the website http://www.lindo.com/. Software LINGO (Language for Interactive General Optimizer), also developed by LINDO Systems, solves linear, nonlinear, and integer programming problems quickly and effectively. The complete version of the software does not have limitations regarding the number of constraints, real and integer variables either. The Solver in LINGO also uses the Simplex method, revised Simplex, dual Simplex, and interior-points method to determine the optimal solution for a linear programming model. All input data can be read directly from LINGO. However, many times, the software uses electronic spreadsheets as interface, such as Excel. The Solver in LINGO also has interface with the following programming languages: Visual Basic, C, C ++, among others. A free version of the software can also be downloaded from the website http://www.lindo.com/. Also developed by LINDO Systems, software What’s Best! is a module to be installed inside electronic spreadsheets as Excel, and it is used to solve linear, nonlinear, and integer programming problems. The complete version of the software does not have limitations regarding the number of constraints, real and integer variables either. The Solver in What’s Best! uses the same optimization methods as LINDO and LINGO. What’s Best! is also compatible with the VBA (Visual Basic for Applications) in Excel, thus, allowing the application of macros and programming codes. A free version of the software can also be downloaded from the website http://www.lindo.com/. CPLEX is an optimization package that was originally developed by Robert Bixby at CPLEX Optimization. In 1997, it was purchased by ILOG and, later on (2009), by IBM. CPLEX has been widely used to solve large-scale linear, integer, and nonlinear programming problems, many times, serving as a Solver within algebraic modeling systems, such as, GAMS, AMPL, and AIMMS. The Solver in CPLEX uses the Simplex method and the interior-points method to determine the optimal solution for a linear programming problem. It has interface with the following programming languages C, C ++, C#, and Java. A free version of CPLEX can be downloaded on the website https://www.ibm.com/analytics/ cplex-optimizer. Developed by Dash Optimization Ltd., XPRESS is an optimization software that solves complex linear, nonlinear, and integer programming problems. The Solver in XPRESS allows us to choose a solution method (Simplex, dual Simplex, or interior-points method) to determine the best solution for a linear programming problem. XPRESS has an interface with the following programming languages C, C ++, Java, Visual Basic, and Net. Further information on the software can be found on the website http://www.dashoptimization.com/. MINOS, developed by Bruce Murtagh and Michael Saunders from Stanford University, is an optimization software that solves large-scale linear and nonlinear programming problems. To solve linear programming problems, MINOS uses the Simplex method. MINOS has also been widely used as a Solver for algebraic modeling languages. It has interface with the following programming languages Fortran, C, and Matlab. Further information on MINOS can be found on the website http://www.sbsi-sol-optimize.com/asp/sol_product_minos.htm/. Software GAMS (General Algebraic Modeling System), developed by GAMS Development Corporation, is a highlevel algebraic modeling system that is being used to solve complex and large-scale linear, nonlinear, and integer programming problems. As specified previously, GAMS has an interface that connects to several optimization packages, including CPLEX, MINOS, OSL, XPRESS, LINGO, LINDO, among others. A free version of the software can be downloaded directly from the website http://www.gams.com/. AMPL (A Mathematical Programming Language) is also an algebraic modeling language, developed by Bell laboratory to solve high complexity linear, integer, and nonlinear programming problems. AMPL has an open interface that makes it possible to connect it to several types of Solver (such as, CPLEX, MINOS, OSL, XPRESS, among others) that find the model’s optimal solution. A free version of AMPL can be downloaded directly from the website http://www. ampl.com/. AIMMS (Advanced Integrated Multidimensional Modeling Software), developed by Paragon Decision Technology, is also a high-level algebraic modeling language that solves high complexity linear, nonlinear and integer programming problems. It uses optimization packages, such as, CPLEX, MINOS, XPRESS, among others, to determine the optimal solution for a linear programming model. It interfaces with the following programming languages C, C ++, Visual Basic, and Excel. A free version of the software can be downloaded from the website http://www.aimms.com/. IBM’s OSL (Optimization Subroutine Library) is an optimization software that solves large-scale linear, nonlinear, and integer programming problems. The Solver in OSL uses the Simplex method and interior-point methods to determine the optimal solution for a linear programming problem. It interfaces with the programming languages C and Fortran.
Solution of Linear Programming Problems Chapter
17
775
Solver is an add-in in Excel that has been widely used for solving small-scale linear, nonlinear, and integer programming problems, due to its popularity and simplicity. Solver uses the Simplex algorithm to determine the optimal solution for a linear programming model. To solve nonlinear problems, Solver uses the GRG2 (Generalized Reduced Gradient) algorithm. While for integer programming problems, it uses the branch-and-bound algorithm. Solver has an interface with other programming languages, so that the final solution can be exported to another package. In the following section, we will discuss how to use it step by step.
17.5.1
Solver in Excel
Solver is capable of solving problems with up to 200 decision variables and 100 constraints. To use it, it is necessary to activate the add-in Solver in Excel. First, we may click on the File tab and select Options (Fig. 17.16). From the dialog box Excel Options (Fig. 17.17), choose the option Add-Ins and select the name Solver Add-in. Also in Fig. 17.17, the next step consists in clicking on Go, which will open the Add-Ins dialog box shown in Fig. 17.18. Finally, confirm by clicking on Solver Add-in and OK. Thus, Solver in Excel will become available in the Data tab, in the Analyze column, as shown the Fig. 17.19. Having selected the Solver command, the Solver Parameters dialog box will appear (Fig. 17.20). Let’s now discuss each one of its cells. 1. Set Objective The objective cell (target cell in earlier Excel versions) is the one that contains the value of the objective function. 2. To (Equal to in earlier Excel versions) We must choose if the objective function is a maximization (Max) or a minimization (Min) one. Solver can also use the option Value of. In this case, Solver will search for a solution whose objective function value (objective cell) is the same as or as close as possible to the value stipulated. 3. By Changing Variable Cells Variable cells (changing cells or adjustable cells in earlier versions) refer to the model’s decision variables. They are the cells whose values vary until the model’s optimal solution is reached. FIG. 17.16 Activating Solver from Excel Options.
776
PART
VII Optimization Models and Simulation
FIG. 17.17 Activating Solver from the Add-Ins option.
FIG. 17.18 Add-Ins dialog box.
Solution of Linear Programming Problems Chapter
17
777
FIG. 17.19 Availability of Solver on the Data tab.
FIG. 17.20 Solver Parameters.
FIG. 17.21 Add Constraint dialog box.
4. Subject to the Constraints Each of the model constraints must be included by using the Add button seen in Fig. 17.20, thus, showing the Add Constraint dialog box, as shown in Fig. 17.21. First of all, in the Cell Reference box, we must select the cell that represents the left-hand side of the constraint to be added. Select the constraint symbol (, ¼ or ), int (integer variable) or bin (binary variable). In the Constraint box, select a constant, a reference cell or a formula with a numerical value that represents the constraint’s right-hand side. While there are new constraints to be included in the model, click on Add. The non-negativity constraints of the decision variables must also be included in this phase. In the case of the last constraint, press OK to go back to the Solver Parameters dialog box.
778
PART
VII Optimization Models and Simulation
As shown in Fig. 17.20, for each constraint that has already been added, it is possible to change or delete it. In order to do that, we just need to select the constraint desired and click on the button Change or Delete. Additionally, the Reset All button clears all the data regarding the current model. Another alternative of including the non-negativity constraints is to select the Make Unconstrained Variables Non-Negative check box. 5. Select a Solving Method For linear programming problems, we must select the Simplex LP engine. Select the GRG Nonlinear engine for smooth nonlinear problems, and select the Evolutionary engine for nonsmooth problems. 6. Options In the Solver Parameters dialog box, it is also possible to activate the Options button, which makes the Options window available (Fig. 17.22). From Fig. 17.22, on the All Methods tab, we can change options for all solving methods. In the Constraint Precision box, the degree of precision can be specified. The smaller the number, the higher the precision. If Use Automatic Scaling check box is selected, Solver should internally rescale the values of variables, constraints, and the objective to similar magnitudes, to reduce the impact of extremely large or small values on the accuracy of the solution process. The Show Iteration Results check box displays the values of each trial solution. In the Solving with Integer Constraints box, if the Ignore Integer Constraints check box is selected, all integer, binary, and all different constraints are ignored. This is known as the relaxation of the integer programming problem. In the Integer Optimality (%) check box, we can type the maximum percentage difference Solver should accept between the objective value of the best integer solution found and the best known bound on the true optimal objective value before stopping. From Solving Limits box, we can specify the maximum CPU time and the maximum number of iterations, respectively, in the Max Time (Seconds) and Iterations boxes. Finally, the last two options are used only for problems that include integer constraints on variables, or problems that use the Evolutionary Solving Method. In the Max Subproblems box, we can specify the maximum number of subproblems and FIG. 17.22 Options dialog box.
Solution of Linear Programming Problems Chapter
17
779
FIG. 17.23 Solver Results dialog box—feasible solution.
in the Max Feasible Solutions box, we can specify the maximum number of feasible solutions (https://www.solver.com/ excel-solver-change-options-all-solving-methods). 7. Solve Going back to the Solver Parameters dialog box, click on Solve, obtaining the Solver Results dialog box. In cases in which the Solver finds a feasible solution for the problem in question, which satisfies all the model constraints, a corresponding message will appear in the Solver Results dialog box, as seen in Fig. 17.23. In this case, the Solver result will appear automatically in the spreadsheet under analysis, we just need to click on OK to maintain the optimal solution found. To restore the model’s initial values, select the Restore Original Values option and, finally, click on OK. The current scenario can also be saved by using the Save Scenario button. Solver has 3 types of report: Answer, Sensitivity, and Limits. In order for these reports to be made available in new Excel spreadsheets, we just need to select the option desired before clicking on the OK button in the Solver Results dialog box. The Answer report provides the results of the model’s optimal solution. The Limits report shows the lower and upper limits of each one of the variable cells. The Answer and Limits reports will be discussed in Section 17.5.4 and the Sensitivity analysis report in Section 17.6.4.
17.5.2
Solution of the Examples found in Section 16.6 of Chapter 16 using Solver in Excel
Each of the examples found in Section 16.6 of the previous chapter (modeling of real linear programming problems) will be solved by the Solver in Excel.
17.5.2.1 Solution of Example 16.3 of Chapter 16 (Production Mix Problem at the Venix Toys) Example 16.3 presented in Section 16.6.1 of the previous chapter, regarding the production mix problem at Venix, a company in the toy sector, will be solved through Solver in Excel. Fig. 17.24 shows how the linear programming model must be edited in an Excel spreadsheet, so that it can be solved by Solver (see file Example3_Venix.xls). First, we can see that the unit profits of each product are represented by cells B5 and C5. Whereas the decision variables (weekly number of toy cars and tricycles to be manufactured) are represented by cells B14 and C14, respectively. The objective function is represented by cell D14 (see formula in Box 17.2). The labor availability constraints for the machining, painting, and assembly activities are represented in rows 8, 9, and 10, respectively. For instance, the labor availability constraint for the machining activity is represented by the equation 0,25x1 + 0,5x2 36. In order for it to be added to the Subject to the Constraints box in the Solver Parameters dialog box, the left-hand side of the constraint must be represented in a single cell. Therefore, the term 0,25x1 + 0,5x2 starts to be represented by cell D8 (see formula in Box 17.2). We repeat the same procedure for the other constraints. The initial solution has the following values: x1 ¼ 0, x2 ¼ 0, and z ¼ 0.
780
PART
VII Optimization Models and Simulation
Venix Toys
Unit profit
x1 cars 12
x2 tricycles 60
Machining Painting Assembly
0.25 0.1 0.1
0.5 0.75 0.4
Hours used 0.0 0 0
Solution
x1 cars 0
x2 tricycles 0
z Total profit $0.00
Quantities produced
≤ ≤ ≤
Hours available 36 22 15
FIG. 17.24 Production mix model at Venix Toys in Excel.
BOX 17.2 Formula of the Objective Function and of the Total Number of Hours Used in Each Activity Cell
Formula
D8
¼B8*$B$14 + C8*$C$14
D9
¼B9*$B$14 + C9*$C$14
D10
¼B10*$B$14 + C10*$C$14
D14
¼B5*$B$14 + C5*$C$14
BOX 17.3 Alternative to Box 17.2 When Using the SUMPRODUCT Function Cell
Formula
D8
¼SUMPRODUCT(B8:C8,$B$14:$C$14)
D9
¼SUMPRODUCT(B9:C9,$B$14:$C$14)
D10
¼SUMPRODUCT(B10:C10,$B$14:$C$14)
D14
¼SUMPRODUCT(B5:C5,$B$14:$C$14)
For complex problems, we can use the SUMPRODUCT function directly, which multiplies the components that correspond to the intervals or matrices provided and returns the sum of these products, as shown in Box 17.3. The problem is ready to be solved by the Solver in Excel. Clicking on the Solver command, we obtain the Solver Parameters dialog box, as shown in Fig. 17.25. First, we must select the objective cell (D14), which is the one that contains the value of the objective function. Since it is a maximization problem, select the option Max. The next step consists in selecting the variable cells (B14:C14) that represent the model’s decision variables. Lastly, we must add each one of the model constraints to the Subject to the Constraints box. Regarding the labor availability constraints for each of the activities, since they are all of the same type, instead of adding them separately, they can be added simultaneously. To do that, first of all, click on the Add button. Select the cell range D8:D10 in the Cell Reference box, the symbol , and the cell interval F8:F10 in the Constraint box, as shown in Fig. 17.26. To conclude the inclusion of the current constraint, let’s click on OK. Nevertheless, since the non-negativity constraint of the model’s decision variables will also be included, click on Add. Once again, the Add Constraint dialog box appears, such that the new model constraint can be included. Therefore, we select the variable cells (B14:C14) in the Cell Reference box, the symbol , and value 0 to be included in the Constraint box, as seen in Fig. 17.27. Since this is the last constraint, conclude by clicking on OK. The non-negativity constraints can also be activated by selecting the Make Unconstrained Variables Non-Negative check box.
Solution of Linear Programming Problems Chapter
17
781
FIG. 17.25 Solver Parameters dialog box for Example 16.3.
FIG. 17.26 Adding the labor availability constraint for the three activities.
FIG. 17.27 Adding the non-negativity constraint of the decision variables.
Since we have a linear programming problem, select the Simplex LP engine in the Select a Solving Method box. The next step consists in clicking on the Options button, thus, obtaining the Options dialog box (Fig. 17.28). The values regarding the Constraint Precision, Max Time, and Iterations parameters will be kept. If Solver cannot find a viable solution, we must test it with a smaller precision in order to find a feasible solution. Another alternative is to increase the maximum time and the number of iterations. If the problem persists, the model must be unfeasible (Taha, 2016). Finally, click on OK to go back to the Solver Parameters dialog box. From then on, Solver is ready to be solved. Thus, click on Solve. In case of a feasible solution, we can update the current Excel spreadsheet with the new solution by clicking on Keep Solver Solution. Fig. 17.29 shows the result of the optimal solution for Example 16.3 for Venix Toys.
782
PART
VII Optimization Models and Simulation
FIG. 17.28 Options dialog box for Example 16.3.
FIG. 17.29 Optimal solution for the production mix problem for Venix Toys.
Therefore, we can see that the model’s optimal solution is x1 ¼ 70 and x2 ¼ 20 with z ¼ 2, 040 ($2,040.00). The same results can be seen, in a more detailed way, in the Answer Report (see Section 17.5.4). The Limits Report of company Venix Toys will also be discussed in Section 17.5.4. Solution using Names in a Cell or Cell Range According to Haddad and Haddad (2004), using names in a cell or cell range makes it easier to understand the formula. To name it, we just need to click on a cell or cell range, place the desired name in the Name Box, which appears on the left-hand side of the Formula bar, and conclude by typing ENTER. Hence, the cells will be referenced by name and not by the corresponding columns and rows any longer. For example, the set of cells B5:C5 will be named Unit_profit, as seen in Fig. 17.30. Another way of doing this is by right clicking on Define Name. Thus, the New Name dialog box will appear (Fig. 17.31), such that the desired name must be added in the Name box. A third alternative is to select the Formulas tab and
Solution of Linear Programming Problems Chapter
17
783
FIG. 17.30 Inserting a name into a set of cells.
FIG. 17.31 New Name dialog box.
FIG. 17.32 Name Manager dialog box.
choose the Name Manager option (CTRL + F3), resulting in Fig. 17.32. As shown in Fig. 17.32, we can include a new name by using the New button (once again, the New Name dialog box will appear), or select an already existing cell or set of cells and click on Edit to change the current name or Delete. Note that Fig. 17.32 shows the name of each cell or cell range, its corresponding value, in addition to its(their) respective row(s) and column(s). Therefore, each cell or cell range included in the Name Manager starts to be referenced by its(their) name(s) instead of its(their) corresponding row(s) and column(s). For example, the formulas in Box 17.3 regarding cells D8, D9, D10, and D15 start being written based on their new names, as shown in Box 17.4. Fig. 17.33 is an adaptation of Fig. 17.25, in which each cell or cell range starts being called by its respective name. Notice that objective cell D14 starts being referred to as Total_profit, variable cells (B14:C14) and the left-hand side of the second constraint as Quantities_produced, the left-hand side of the first constraint (D8:D10) as Hours_used, and the right-hand side of the second constraint (F8:F10) as Hours_available.
784
PART
VII Optimization Models and Simulation
BOX 17.4 Formulas From Box 17.3 Written Based on Their New Names Cell
Formula
D8
¼SUMPRODUCT(B8:C8,Quantities_produced)
D9
¼SUMPRODUCT(B9:C9,Quantities_produced)
D10
¼SUMPRODUCT(B10:C10,Quantities_produced)
D14
¼SUMPRODUCT(Unit_profit,Quantities_produced)
FIG. 17.33 Solver Parameters after the inclusion of new names.
17.5.2.2 Solution of Example 16.4 of Chapter 16 (Production Mix Problem at Naturelat Dairy) Example 16.4 presented in Section 16.6.1 of the previous chapter, regarding the production mix problem at Naturelat, a dairy product company, will also be solved through the Solver in Excel. Fig. 17.34 illustrates the representation of the model in an Excel spreadsheet (see file Example4_Naturelat.xls). The formulas used in Fig. 17.34 are shown in Box 17.5. Analogous to the Venix Toys example, names were assigned to the cells and cell ranges in Fig. 17.34, which will be used in the Solver to facilitate the understanding of the model. Fig. 17.35 shows the names assigned to the respective cells. The representation of the problem at Naturelat Dairy in the Solver Parameters dialog box is shown in Fig. 17.36. Since names were assigned to the model cells, Fig. 17.36 starts being referred to by their respective names. In Fig. 17.36, note that the constraints are sorted out in alphabetical order. The same happens with the Name Manager (Fig. 17.35). Similar to the Venix Toys example, we selected the Make Unconstrained Variables Non-Negative check box and the Simplex LP engine in the Select a Solving Method box. The Options command remained unaltered. Finally, click on Solve and select the option Keep Solver Solution in the Solver Results dialog box. Fig. 17.37 shows the optimal solution for the production mix model at Naturelat Dairy.
Solution of Linear Programming Problems Chapter
17
785
Naturelat Dairy
Unit contribution margin
x1 yogurt 0.80
x2 minas 0.70
x3 x4 x5 mozzarella parmesan provolone 1.15 1.30 0.70
Raw milk Whey Fat
0.70 0.16 0.25
0.40 0.22 0.33
0.40 0.32 0.33
0.60 0.19 0.40
0.60 0.23 0.47
Raw material used 0.00 0.00 0.00
≤ ≤ ≤
Raw material available 1200 460 650
Labour
0.05
0.12
0.09
0.04
0.16
MH used 0.00
≤
MH available 170
Yogurt Minas cheese Mozzarella cheese Parmesan cheese Provolone cheese
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
Quantity produced 0.00 0.00 0.00 0.00 0.00
≥ ≥ ≥ ≥ ≥
Minimum demand 320 380 450 240 180
Solution
x1 x2 x3 x4 x5 z yogurt minas mozzarella parmesan provolone Total contrib margin Quantitites produced 0 0 0.00 0.00 0 $0.00 FIG. 17.34 Representation of the production mix model in Excel for Naturelat Dairy.
BOX 17.5 Formulas From Fig. 17.34 Cell
Formula
G8
¼SUMPRODUCT(B8:F8,$B$24:$F$24)
G9
¼SUMPRODUCT(B9:F9,$B$24:$F$24)
G10
¼SUMPRODUCT(B10:F10,$B$24:$F$24)
G13
¼SUMPRODUCT(B13:F13,$B$24:$F$24)
G16
¼SUMPRODUCT(B16:F16,$B$24:$F$24)
G17
¼SUMPRODUCT(B17:F17,$B$24:$F$24)
G18
¼SUMPRODUCT(B18:F18,$B$24:$F$24)
G19
¼SUMPRODUCT(B19:F19,$B$24:$F$24)
G20
¼SUMPRODUCT(B20:F20,$B$24:$F$24)
G24
¼SUMPRODUCT(B5:F5,$B$24:$F$24)
FIG. 17.35 Name Manager for the problem at Naturelat Dairy.
786
PART
VII Optimization Models and Simulation
FIG. 17.36 Solver Parameters regarding the problem at Naturelat Dairy.
FIG. 17.37 Result of the Naturelat Dairy model.
Solution of Linear Programming Problems Chapter
17
787
Oil-South Refinary Unit profit
x11 3
x21 2
x31 2
x12 5
x22 4
x32 4
x13 6
x23 5
x33 5
Common-oil1 Super-oil1 Super-oil2 Extra-oil2 Extra-oil3
0.3 0 0 0 0
-0.7 0 0 0 0
-0.7 0 0 0 0
0 0.5 0.1 0 0
0 -0.5 -0.9 0 0
0 -0.5 0.1 0 0
0 0 0 -0.5 0.4
0 0 0 0.5 0.4
0 0 0 -0.5 -0.6
Composition 0 0 0 0 0
≤ ≤ ≤ ≤ ≤
Constant 0 0 0 0 0
Common Super Extra
1 0 0
1 0 0
1 0 0
0 1 0
0 1 0
0 1 0
0 0 1
0 0 1
0 0 1
Barrels gas produced 0 0 0
≥ ≥ ≥
Minimum demand 5,000 3,000 3,000
Oil 1 Oil 2 Oil 3
1 0 0
0 1 0
0 0 1
1 0 0
0 1 0
0 0 1
1 0 0
0 1 0
0 0 1
Barrels oil used 0 0 0
≤ ≤ ≤
Capacity 10,000 8,000 7,000
Refinery
1
1
1
1
1
1
1
1
1
Total production 0
≤
Total capacity 20,000
Solution Qty produced
x11 0
x21 0
x31 0
x12 0
x22 0
x32 0
x13 0
x23 0
x33 0
z (Total profit) $0.00
FIG. 17.38 Representation of the mix problem in Excel of Oil-South Refinery.
17.5.2.3 Solution of Example 16.5 of Chapter 16 (Mix Problem of Oil-South Refinery) Example 16.5 presented in Section 16.6.2 of the previous chapter, regarding the mix problem of Oil-South Refinery, will also be solved through the Solver in Excel. Fig. 17.38 illustrates the representation of the model in an Excel spreadsheet (see file Example5_OilSouth.xls). Rows 6 to 10 represent the constraints of the maximum or minimum percentage allowed of a certain type of oil in the composition of a certain type of gasoline. In order for all the constraints to have the same symbol (), the type inequalities were multiplied by (-1). The formulas used in Fig. 17.38 are shown in Box 17.6. In order to facilitate the understanding of the model, names were assigned to the main cells and cell ranges in Fig. 17.38, as seen in Fig. 17.39. Having assigned the names to the cells or cell ranges of Fig. 17.38, they will be referenced by their respective names. Therefore, the representation of the problem of Oil-South Refinery in the Solver Parameters dialog box is illustrated in Fig. 17.40. BOX 17.6 Formulas From Fig. 17.38 Cell
Formula
K6
¼SUMPRODUCT(B6:J6,$B$26:$J$26)
K7
¼SUMPRODUCT(B7:J7,$B$26:$J$26)
K8
¼SUMPRODUCT(B8:J8,$B$26:$J$26)
K9
¼SUMPRODUCT(B9:J9,$B$26:$J$26)
K10
¼SUMPRODUCT(B10:J10,$B$26:$J$26)
K13
¼SUMPRODUCT(B13:J13,$B$26:$J$26)
K14
¼SUMPRODUCT(B14:J14,$B$26:$J$26)
K15
¼SUMPRODUCT(B15:J15,$B$26:$J$26)
K18
¼SUMPRODUCT(B18:J18,$B$26:$J$26)
K19
¼SUMPRODUCT(B19:J19,$B$26:$J$26)
K20
¼SUMPRODUCT(B20:J20,$B$26:$J$26)
K23
¼SUMPRODUCT(B23:J23,$B$26:$J$26)
K26
¼SUMPRODUCT(B4:J4,$B$26:$J$26)
788
PART
VII Optimization Models and Simulation
FIG. 17.39 Name Manager for the problem of OilSouth Refinery.
FIG. 17.40 Solver Parameters regarding the problem of Oil-South Refinery.
Note that the non-negativity constraints were activated by selecting the Make Unconstrained Variables Non-Negative check box, and the Simplex LP method was selected. The Options command remained unaltered. Finally, click on Solve and OK to keep Solver solution. Fig. 17.41 shows the optimal solution for the mix problem of Oil-South Refinery.
17.5.2.4 Solution of Example 16.6 of Chapter 16 (Diet Problem) Example 16.6 presented in Section 16.6.3 of the previous chapter, regarding a diet problem, will also be solved through the Solver in Excel. Fig. 17.42 represents the model in an Excel spreadsheet (see file Example6_Diet.xls).
Solution of Linear Programming Problems Chapter
17
789
FIG. 17.41 Result of the mix problem of Oil-South Refinery.
Diet problem
Cost/kg
x1 spinach 3
Iron Vitamin A Vitamin B12 Folic acid
30 74,000 0 4
12 1,388 0 5
Daily consumption
1
1
Solution
x1 spinach 0
Qty consumed
x2 x3 x4 broccoli cress tomato 2 1.8 1.6
x5 carrot 3
x6 egg 3
2 4.9 10 9 47,250 11,300 145,000 32,150 0 0 1 10 1 2.5 0.05 0.5
1
1
x2 x3 x4 broccoli cress tomato 0.000 0 0
x7 bean 4
x8 chickpea 4
x9 soy 4.5
x10 meat 7.5
71 0 0 0.56
48.6 410 0 4
30 10,000 0 0.8
15 0 30 0.6
x11 liver 8
x12 fish 8.5
100 11 320,000 1,400 1,000 21.4 3.8 0.02
Total ingredients 0 0 0.00 0
≥ ≥ ≥ ≥
Min necessity 80 45,000 20 4
≤
Max consumption 1.5
1
1
1
1
1
1
1
1
Total consumed 0.0
x5 carrot 0
x6 egg 0
x7 bean 0.000
x8 chickpea 0.000
x9 soy 0
x10 meat 0
x11 liver 0.000
x12 fish 0
z Total cost $0.00
FIG. 17.42 Representation of the diet problem in Excel.
BOX 17.7 Formulas From Fig. 17.42 Cell
Formula
N8
¼SUMPRODUCT(B8:M8,$B$18:$M$18)
N9
¼SUMPRODUCT(B9:M9,$B$18:$M$18)
N10
¼SUMPRODUCT(B10:M10,$B$18:$M$18)
N11
¼SUMPRODUCT(B11:M11,$B$18:$M$18)
N14
¼SUMPRODUCT(B14:M14,$B$18:$M$18)
N18
¼SUMPRODUCT(B5:M5,$B$18:$M$18)
The formulas used in Fig. 17.42 are shown in Box 17.7. The names assigned to the main cells and cell ranges of Fig. 17.42 are listed in Fig. 17.43. Therefore, the Solver Parameters dialog box regarding the diet problem shows the names assigned to the respective cells or cell ranges, as shown in Fig. 17.44.
790
PART
VII Optimization Models and Simulation
FIG. 17.43 Name Manager for the diet problem.
FIG. 17.44 Solver Parameters related to the diet problem.
Note that the non-negativity constraints were activated by selecting the Make Unconstrained Variables Non-Negative check box, and the Simplex LP method was selected. The Options command remained unaltered. Finally, click on Solve and OK to keep Solver solution. Fig. 17.45 shows the optimal solution for the diet problem.
17.5.2.5 Solution of Example 16.7 of Chapter 16 (Farmer’s Problem) In order for Example 16.7 in Section 16.6.4 of the previous chapter (farmer’s problem) to be solved by Solver in Excel, it must be represented in an Excel spreadsheet, as shown in Fig. 17.46 (see file Example7_Farmer.xls). Box 17.8 shows the formulas used in Fig. 17.46.
Solution of Linear Programming Problems Chapter
17
FIG. 17.45 Result of the diet problem.
Farmer
NPV unit/ha
x1 soy 9.343
x2 manioc 8.902
x3 corn 2.118
x4 wheat 11.542
x5 bean 4.044
Capacity (ha)
1
1
1
1
1
Area used 0
≤
Maximum area 1,000
1st year 2nd year 3rd year
5.00 7.70 7.90
4.20 6.50 7.20
2.20 3.70 2.90
6.60 8.00 6.10
3.00 3.50 4.10
Input flow 0 0 0
≥ ≥ ≥
Minimum flow 6,000 5,000 6,500
Initial investiment
5.00
4.00
3.50
3.50
3.00
Initial investiment 0
≤
Maximum initial invest 3,800
1st year 2nd year 3rd year
1.00 1.20 0.80
1.00 0.50 0.50
0.50 0.50 1.00
1.50 0.50 0.50
0.50 1.00 0.50
Output flow 0 0 0
≤ ≤ ≤
Maximum flow 3,500 3,200 2,500
x1 x2 x3 x4 x5 soy manioc corn wheat bean Area invested 0 0 0 0 0 FIG. 17.46 Representation of the farmer’s problem in an Excel spreadsheet. Solution
z Total NPV $0.00
BOX 17.8 Formulas Used in Fig. 17.46 Cell
Formula
G8
¼SUMPRODUCT(B8:F8,$B$25:$F$25)
G11
¼SUMPRODUCT(B11:F11,$B$25:$F$25)
G12
¼SUMPRODUCT(B12:F12,$B$25:$F$25)
G13
¼SUMPRODUCT(B13:F13,$B$25:$F$25)
G16
¼SUMPRODUCT(B16:F16,$B$25:$F$25)
G19
¼SUMPRODUCT(B19:F19,$B$25:$F$25)
G20
¼SUMPRODUCT(B20:F20,$B$25:$F$25)
G21
¼SUMPRODUCT(B21:F21,$B$25:$F$25)
G25
¼SUMPRODUCT(B5:F5,$B$25:$F$25)
791
792
PART
VII Optimization Models and Simulation
FIG. 17.47 Name Manager for the farmer’s problem.
Fig. 17.47 shows the names assigned to the cells and cell ranges in Fig. 17.46, which will be referenced in the Solver. Alternatively, Fig. 17.48 shows the parameters of the Solver as regards the farmer’s problem. Note that the cells and cell ranges are represented by their respective names. Note that the non-negativity constraints were activated by selecting the Make Unconstrained Variables Non-Negative check box, and the Simplex LP method was selected. The Options command remained unaltered. Finally, click on Solve and OK to keep Solver solution. Fig. 17.49 shows the optimal solution for the farmer’s problem.
FIG. 17.48 Solver Parameters as regards the farmer’s problem.
Solution of Linear Programming Problems Chapter
17
793
FIG. 17.49 Optimal solution for the farmer’s problem.
17.5.2.6 Solution of Example 16.8 of Chapter 16 (Portfolio Selection—Maximization of the Expected Return) Example 16.8 in Section 16.6.5 of the previous chapter will also be solved through the Solver in Excel. Fig. 17.50 shows the representation of the portfolio optimization problem in an Excel spreadsheet (see file Example8_Portfolio.xls). Box 17.9 shows the formulas used in Fig. 17.50. Fig. 17.51 shows the names assigned to the cells and cell ranges in Fig. 17.50. Whereas Fig. 17.52 shows the Solver parameters regarding the portfolio selection problem (maximization of the return expected), in which each cell or cell range starts being referenced by its respective name. Note that the non-negativity constraints were activated by selecting the Make Unconstrained Variables Non-Negative check box, and the Simplex LP method was selected. The Options command remained unaltered. Finally, click on Solve and OK to keep Solver solution. Fig. 17.53 shows the optimal solution for the portfolio selection problem.
Portfolio selection - Maximizing expected return
Average return Standard deviation
x1 x2 BBAS3 BBDC4 0.37% 0.24% 2.48% 2.16%
x3 ELET6 0.14% 1.95%
Invested capital
Σx i 0%
=
100% 1
Maximum composition
30%
30%
30%
Portfolio risk
Real 0%
≤
Theoretical 2.50%
x4 x5 x6 x7 x8 x9 x 10 GGBR4 ITSA4 PETR4 CSNA3 TNLP4 USIM5 VALE5 0.30% 0.24% 0.19% 0.28% 0.18% 0.25% 0.24% 2.93% 2.40% 2.00% 2.63% 2.14% 2.73% 2.47%
30%
30%
30%
30%
30%
30%
30%
x1 x2 x3 x4 x5 x6 x7 x8 x9 x 10 BBAS3 BBDC4 ELET6 GGBR4 ITSA4 PETR4 CSNA3 TNLP4 USIM5 VALE5 z = E(R) Optimum composition 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% FIG. 17.50 Representation of Example 16.8 in an Excel spreadsheet. Solution
794
PART
VII Optimization Models and Simulation
BOX 17.9 Formulas Used in Fig. 17.50
FIG. 17.51 Name Manager for the portfolio selection problem (maximization of the expected return).
FIG. 17.52 Solver Parameters regarding the portfolio selection problem (maximization of the return expected).
Cell
Formula
B9
¼SUM(B18:K18)
B14
¼SUMPRODUCT(B6:K6,B18:K18)
L18
¼SUMPRODUCT(B5:K5,B18:K18)
Solution of Linear Programming Problems Chapter
17
795
FIG. 17.53 Optimal solution for the portfolio selection problem.
17.5.2.7 Solution of Example 16.9 of Chapter 16 (Portfolio Selection—Minimization of the Portfolio’s Mean Absolute Deviation) Example 16.9 in Section 16.6.5 of the previous chapter that tries to determine the portfolio’s optimal composition to minimize its mean absolute deviation will be solved through the Solver in Excel. The representation of the problem in an Excel spreadsheet can be seen in Fig. 17.54 (see file Example9_Portfolio.xls). The calculation of the portfolio’s MAD is shown in Fig. 17.55 and can also be found in the same file Example9_Portfolio.xls. For a portfolio that has 10% of each asset in its composition, row 250 in Fig. 17.55 shows the mean absolute deviation of each asset and of the portfolio. As shown in row 250 or the formula of cell P250 in Box 17.10, the percentage of each asset in the portfolio can be multiplied directly by its mean absolute deviation, since the percentage is constant in all the periods. Box 17.10 shows the main formulas used in Figs. 17.54 and 17.55. From the formulas in cells P248 and P250, we can see the calculation of the mean absolute deviation of the first asset (BBAS3). For the other assets, we can use the same procedure. Fig. 17.56 shows the names assigned to the cells and cell ranges in Fig. 17.54.
Average return Standard deviation
x1 BBAS3 0.37% 2.48%
x2 BBDC4 0.24% 2.16%
x3 ELET6 0.14% 1.95%
Invested capital
Σx i 0%
=
100% 1
Portfolio return
Real 0.00%
≥
Theoretical 0.15%
Maximum composition
30%
30%
30%
x4 GGBR4 0.30% 2.93%
x5 ITSA4 0.24% 2.40%
x6 PETR4 0.19% 2.00%
x7 CSNA3 0.28% 2.63%
x8 TNLP4 0.18% 2.14%
x9 USIM5 0.25% 2.73%
x 10 VALE5 0.24% 2.47%
30%
30%
30%
30%
30%
30%
30%
x5 ITSA4 0%
x6 PETR4 0%
x7 CSNA3 0%
x8 TNLP4 0%
x9 USIM5 0%
x1 x2 x3 x4 BBAS3 BBDC4 ELET6 GGBR4 Optimum composition 0% 0% 0% 0% FIG. 17.54 Representation of Example 16.9 in an Excel spreadsheet. Solution
x 10 VALE5 z = MAD 0% 0.00%
796
PART
VII Optimization Models and Simulation
FIG. 17.55 Mean absolute deviation of each asset and of the portfolio.
BOX 17.10 Main Formulas Used in Figs. 17.54 and 17.55
FIG. 17.56 Name Manager for the portfolio selection problem (minimization of the MAD).
Cell
Formula
B9
¼SUM(B18:K18)
B12
¼SUMPRODUCT(B5:K5,B18:K18)
O248
¼MEAN(O2:O247)
P2
¼ABS(O2-$O$248)
P248
¼MEAN(P2:P247)
P250
¼P248*B18
AI250
¼SUM(P250:AH250)
L18
¼AI250
Solution of Linear Programming Problems Chapter
17
797
FIG. 17.57 Solver Parameters regarding the portfolio selection problem (minimization of the MAD).
Whereas Fig. 17.57 presents the Solver parameters regarding the portfolio selection problem (minimization of the MAD). Analogous to previous models, one assumes that the variables are non-negative and that the model is linear. Finally, click on Solve and OK to keep Solver solution. Fig. 17.58 shows the optimal solution for the portfolio selection problem (minimization of the MAD).
FIG. 17.58 Optimal solution for the portfolio selection problem (minimization of the MAD).
798
PART
VII Optimization Models and Simulation
17.5.2.8 Solution of Example 16.10 of Chapter 16 (Production and Inventory Problem of Fenix&Furniture) Example 16.10 in Section 16.6.6 of the previous chapter, regarding the production and inventory problem of Fenix&Furniture, will be solved through Solver in Excel. Fig. 17.59 illustrates the representation of the problem in an Excel spreadsheet (see file Example10_Fenix&Furniture.xls). Note that, in the initial solution presented in Fig. 17.59, 2,000 units of each product were produced in each period. Applying the formula Iit ¼ Ii,t1 + xit Dit, we obtain the inventory levels of each product in each period that must take on non-negative values. It is important to mention that this solution is not the optimal one. Box 17.11 shows the formulas used in Fig. 17.59. Fig. 17.60 shows the names assigned to the main cells and cell ranges in Fig. 17.59. The Solver parameters regarding the production and inventory problem of Fenix&Furniture are represented in Fig. 17.61. Since names were assigned to the main cells and cell ranges, they start being referred to by their respective names. Analogous to previous models, one assumes that the variables are non-negative and that the model is linear. Finally, click on Solve and OK to keep Solver solution. Fig. 17.62 shows the optimal solution for the production and inventory problem of Fenix&Furniture.
17.5.2.9 Solution of Example 16.11 of Chapter 16 (Problem of Lifestyle Natural Juices Manufacturer) Example 16.11 in Section 16.6.7 of the previous chapter, regarding the aggregate planning problem at Lifestyle Natural Juices, will be solved through Solver in Excel. Fig. 17.63 illustrates the representation of the problem in an Excel spreadsheet (see file Example11_Lifestyle.xls). In the initial solution presented in Fig. 17.63, note that 1,000 additional liters were produced in July after they hired new employees in the previous month, such that the values of Rt from July to December become 5,000 (Rt ¼ Rt1 + Ht Ft). Applying the formula It ¼ It1 + Rt + Ot + St Dt, we obtain the inventory levels in each period. The values of the other decision variables remain null. It is important to mention that this solution is not the optimal one. Box 17.12 shows the formulas used in Fig. 17.63. Fig. 17.64 shows the names assigned to the cells and cell ranges in Fig. 17.63. The Solver parameters regarding the aggregate planning problem at Lifestyle company are represented in Fig. 17.65. Since names were assigned to the main cells and cell ranges, they start being referenced by their respective names. Analogous to previous models, one assumes that the variables are non-negative and that the model is linear. Finally, click on Solve and OK to keep Solver solution. Fig. 17.66 shows the optimal solution for the aggregate planning problem of Lifestyle company.
17.5.3
Solver Error Messages for Unlimited and Infeasible Solutions
Section 17.2.3 and Section 17.4.5 presented how to identify, graphically and through the Simplex method, respectively, each one of the special cases that may happen in a linear programming problem: 1. 2. 3. 4.
Multiple optimal solutions Unlimited objective function z There is no optimal solution Degenerate optimal solution
In this section, we will analyze the error messages generated in the Solver Results dialog box for cases 2 and 3 (unlimited objective function z and unfeasible solution). Special cases 1 and 4 will be discussed in Sections 17.6.4.1 and 17.6.4.2, respectively, by using the Solver Sensitivity Report.
17.5.3.1 Unlimited Objective Function z For the maximization problem (max z ¼ 4x1 + 3x2) presented in Section 17.2.3.2 (Example 17.4), the graphical solution was obtained from Fig. 17.67. Solving the same example through Solver in Excel, an error message appears in the Solver Results dialog box: “The Objective Cell values do not converge.” Therefore, whenever we come across a problem with an unlimited objective function, the message in Fig. 17.68 will appear.
Solution of Linear Programming Problems Chapter
17
799
Fenix&Furniture Dec
Jan
Feb
Mar
Apr
Mai
Jun
c 1t c 2t c 3t c 4t c 5t i 1t i 2t i 3t i 4t i 5t
320 440 530 66 48 8 8 9 3 3
320 440 530 66 48 8 8 9 3 3
Unit cost 320 440 530 66 48 8 8 9 3 3
320 440 530 66 48 8 8 9 3 3
320 440 530 66 48 8 8 9 3 3
320 440 530 66 48 8 8 9 3 3
D 1t D 2t D 3t D 4t D 5t
1,200 1,250 1,400 1,800 1,850 1,800 1,600 1,500 2,000 2,000 20,000 18,000 15,000 22,000 22,000
1,250 1,430 1,500 1,750 1,700 1,800 1,600 1,500 2,000 2,000 20,000 18,000 15,000 22,000 22,000
Other input data 1,400 1,650 1,200 2,100 2,050 1,800 1,600 1,500 2,000 2,000 20,000 18,000 15,000 22,000 22,000
1,860 1,700 1,350 2,000 1,950 1,800 1,600 1,500 2,000 2,000 20,000 18,000 15,000 22,000 22,000
2,000 1,450 1,600 1,850 2,050 1,800 1,600 1,500 2,000 2,000 20,000 18,000 15,000 22,000 22,000
1,700 1,500 1,450 1,630 1,740 1,800 1,600 1,500 2,000 2,000 20,000 18,000 15,000 22,000 22,000
1,000 950 800 400 350
1,750 1,520 1,300 650 650
Equations 2,350 1,870 2,100 550 600
2,490 2,170 2,750 550 650
2,490 2,720 3,150 700 600
2,790 3,220 3,700 1,070 860
x 1t max x 2t max x 3t max x 4t max x 5t max I 1t max I 2t max I 3t max I 4t max I 5t max
I 1t calc I 2t calc I 3t calc I 4t calc I 5t calc
200 200 200 200 200
Solution Jan Feb Mar Apr Mai Jun x 1t 2,000 2,000 2,000 2,000 2,000 2,000 x 2t 2,000 2,000 2,000 2,000 2,000 2,000 x 3t 2,000 2,000 2,000 2,000 2,000 2,000 x 4t 2,000 2,000 2,000 2,000 2,000 2,000 x 5t 2,000 2,000 2,000 2,000 2,000 2,000 I 1t 1,000 1,750 2,350 2,490 2,490 2,790 I 2t 950 1,520 1,870 2,170 2,720 3,220 I 3t 800 1,300 2,100 2,750 3,150 3,700 I 4t 400 650 550 550 700 1,070 I 5t 350 650 600 650 600 860 FIG. 17.59 Representation of the production and inventory problem of Fenix&Furniture in an Excel spreadsheet.
z R$ 17,197,650.00
800
PART
VII Optimization Models and Simulation
BOX 17.11 Formulas Used in Fig. 17.59 Cell
Formula
Cell
Formula
Cell
Formula
C35
¼B35 + C42-C18
C37
¼B37 + C44-C20
C39
¼B39 +C46-C22
D35
¼C35 +D42-D18
D37
¼C37 + D44-D20
D39
¼C39+ D46-D22
E35
¼D35 + E42-E18
E37
¼D37 + E44-E20
E39
¼D39 +E46-E22
F35
¼E35 +F42-F18
F37
¼E37 + F44-F20
F39
¼E39+ F46-F22
G35
¼F35 +G42-G18
G37
¼F37 + G44-G20
G39
¼F39+ G46-G22
H35
¼G35 + H42-H18
H37
¼G37 + H44-H20
H39
¼G39 +H46-H22
C36
¼B36 + C43-C19
C38
¼B38 + C45-C21
D36
¼C36 +D43-D19
D38
¼C38 + D45-D21
E36
¼D36 + E43-E19
E38
¼D38 + E45-E21
F36
¼E36 +F43-F19
F38
¼E38 + F45-F21
G36
¼F36 +G43-G19
G38
¼F38 + G45-G21
H36
¼G36 + H43-H19
H38
¼G38 + H45-H21
FIG. 17.60 Name Manager for the production and inventory problem at Fenix&Furniture.
17.5.3.2 There Is No Optimal Solution For the maximization problem (max z ¼ x1 + x2) presented in Section 17.2.3.3 (Example 17.5), the graphical solution was obtained from Fig. 17.69. By solving Example 17.5 through the Solver in Excel, a new error message appears in the Solver Results dialog box, since we have a case in which it is not possible to find a feasible solution (see Fig. 17.70).
17.5.4
Result Analysis by Using the Solver Answer and Limits Reports
Section 17.5.2 presented the Solver results for each one of the examples presented in Section 16.6 of the previous chapter (modeling of real linear programming problems), directly in Excel spreadsheets. A detailed analysis of the results can also be done through the Solver Answer, Limits, and Sensitivity Reports. As mentioned before, the Sensitivity Report will be discussed in Section 17.6. Whereas the Answer and Limits Reports will be discussed in this section for the Venix Toys problem (Example 16.3 of the previous chapter). The modeling of and solution for the problem was presented in Section 17.5.2.1 of this chapter in the same Excel spreadsheet.
Solution of Linear Programming Problems Chapter
17
801
FIG. 17.61 Solver parameters regarding the problem at Fenix&Furniture.
FIG. 17.62 Optimal solution for Fenix&Furniture.
802
PART
VII Optimization Models and Simulation
Lifestyle - Natural Juices Jun
Jul
Ago
Sep
Oct
Nov
Dec
it rt ot st ht ft
0.4 1.5 2 2.7 3 1.2
0.4 1.5 2 2.7 3 1.2
Unit cost 0.4 1.5 2 2.7 3 1.2
0.4 1.5 2 2.7 3 1.2
0.4 1.5 2 2.7 3 1.2
0.4 1.5 2 2.7 3 1.2
Dt It
4,500
5,200
Other input data 4,780
5,700
5,820
4,480
1,500 5,000 50 500
1,500 5,000 50 500
1,500 5,000 50 500
1,500 5,000 50 500
1,500 5,000 50 500
1,500 5,000 50 500
1,500 5,000
1,300 5,000
Equations 1,520 5,000
820 5,000
0 5,000
520 5,000
Solution Jul Ago Sep Oct Nov It 1,000 1,500 1,300 1,520 820 0 Rt 4,000 5,000 5,000 5,000 5,000 5,000 Ot 0 0 0 0 0 St 0 0 0 0 0 Ht 1,000 0 0 0 0 Ft 0 0 0 0 0 FIG. 17.63 Representation of the aggregate planning problem at Lifestyle in an Excel spreadsheet.
Dec 520 5,000 0 0 0 0
1,000
I t max R t max O t max S t max
I t calc R t calc
1,000 4,000
z R$ 50,264.00
BOX 17.12 Formulas Used in Fig. 17.63 Cell
Formula
C22
¼B22 + C27+ C28 + C29-C14
D22
¼C22 + D27+ D28 + D29-D14
E22
¼D22 + E27+ E28 + E29-E14
F22
¼E22 + F27+ F28 + F29-F14
G22
¼F22 + G27+ G28 + G29-G14
H22
¼G22 + H27 +H28+ H29-H14
C23
¼B23 + C30-C31
D23
¼C23 + D30-D31
E23
¼D23 + E30-E31
F23
¼E23 + F30-F31
G23
¼F23 + G30-G31
H23
¼G23 + H30-H31
I26
¼SUMPRODUCT(C6:H11,C26:H31)
17.5.4.1 Answer Report The Answer Report provides the results of the optimal solution found by Solver in a new Excel spreadsheet. Fig. 17.71 shows the Answer Report of the problem faced by Venix Toys.
Solution of Linear Programming Problems Chapter
17
803
FIG. 17.64 Name Manager for the aggregate planning problem at Lifestyle.
FIG. 17.65 Solver parameters regarding the problem at Lifestyle.
804
PART
VII Optimization Models and Simulation
FIG. 17.66 Optimal solution for Lifestyle company. FIG. 17.67 Graphical solution for Example 17.4 with an unlimited objective function.
x2
x1 £ 8
Unlimited solution space
x1 ³ 0 10 z = 24 = 4x1 + 3x2
Maximization of z
8
6
4
2
2x1 + 5x2 ³ 20 x2 ³ 0
2
4
6
8
10
x1
According to Fig. 17.71, we can see that the results of the Answer Report are divided into three main parts: objective cell, variable cells, and constraints. As shown in Fig. 17.29 of Section 17.5.2.1, the maximization function z of Venix Toys’ problem is represented by objective cell D14 (Total_profit). Row 8 in Fig. 17.71 shows the original value and the final value (maximum profit) of the objective cell. The model’s decision variables are represented by variable cells B14 and C14 in Fig. 17.29 of Section 17.5.2.1. Rows 13 and 14 in Fig. 17.71 show the original and final values of each variable cell. From column E, we can see that the optimal number of toy cars to be produced is x1 ¼ 70 and the optimal number of tricycles to be produced is x2 ¼ 20.
FIG. 17.68 Error message for a problem with an unlimited objective function. FIG. 17.69 Graphical solution for Example 17.5 with an unfeasible solution.
x2 x1 ³ 0 10
8 z = 7 = x1 + x2 5x1 + 4x2 ³ 40
6
4
2
2x1 + x2 £ 6 x2 ³ 0 2
4
6
8
FIG. 17.70 Error message for a problem with an unfeasible solution.
x1
806
PART
VII Optimization Models and Simulation
FIG. 17.71 Answer Report for the problem at Venix Toys.
The machining, painting, and assembly human resources availability constraints are represented by rows 19, 20, and 21, respectively, while the non-negativity constraints of each decision variable, by rows 22 and 23. Cells D8, D9, and D10 represent the total number of hours used or the amount of resources necessary for the machining, painting, and assembly departments, respectively. The value of each cell (column D) can be obtained by substituting the optimal values of each decision variable (x1 ¼ 70 and x2 ¼ 20) on the left-hand side of each constraint. Column E shows the formula used to represent each constraint. Conversely, column F presents the status of each constraint: binding or not binding. The Binding Status happens when the total amount of resources used (column D) is equal to the maximum limit available, that is, there is no slack or idleness of resources. As shown in Fig. 17.29 in Section 17.5.2.1, the quantity of resources available for the machining, painting, and assembly activities is represented by cells F8, F9, and F10, respectively. The Not Binding Status indicates that the maximum capacity of resources has not been used. Now, the Slack field suggests the difference between the total amount of resources available and the total amount of resources used, or the amount of idle resources. For example, for the machining sector, from the total amount of resources available (36 hours), only 27.5 hours have been used, what generates an idleness of 8.5 hours (slack). For the painting and machining activities, since the maximum capacity of the resources was used, slack is zero.
17.5.4.2 Limits Report The main results provided by the Limits Report refer to the lower and upper limits of each variable cell (decision variable). Fig. 17.72 shows the Limits Report for the problem faced by Venix Toys. According to Fig. 17.72, we can see that the results of the Limits Report are divided into two main parts: objective cell and variable cells. Analogous to the Answer Report, the Limits Report also provides the optimal value of the objective cell.
FIG. 17.72 Limits Report for the problem at Venix Toys.
Solution of Linear Programming Problems Chapter
17
807
Alternatively, the data regarding each one of the variable cells are represented in rows 13 and 14 in Fig. 17.72. Analogous to the Answer Report, the optimal value of each variable cell is also provided by the Limits Report (column D). Column F presents the lower limits of the variable cells that refer to the minimum value each variable can take on. By assigning lower limits to one of the variables (x1 ¼ 0), and by maintaining the other constants (x2 ¼ 20), we obtain a feasible solution with z ¼ 1,200 (cell G13). In contrast, if x1 ¼ 70 and x2 ¼ 0, we obtain another feasible solution with z ¼ 840 (cell G14). Finally, column I shows the upper limits of the variable cells, that is, the maximum values each variable can achieve. In this case, the value of objective function z is 2,040.
17.6 SENSITIVITY ANALYSIS As presented in Section 16.5 of the previous chapter, one of the hypotheses of a linear programming model is to assume that all the model parameters (objective function coefficients cj, constraint variable coefficients aij, and independent terms bi) are deterministic, that is, constant and known with certainty. However, many times, the estimation of these parameters is based on future forecasts, such that changes may happen until the final solution is implemented in the real world. As examples of changes, we can mention changes in the amount of resources available, launching of a new product, variation in a product’s price, increase or decrease in production costs, among others. Therefore, the sensitivity analysis is essential in the study of linear programming problems, since it has as its main objective to investigate the effects that certain changes in the model parameters would have on the optimal solution. The sensitivity analysis discusses the variation that the objective function coefficients and constants on the right-hand side of each constraint can assume (lower and upper limits), without changing the initial model’s optimal solution or without changing the feasibility region. This analysis can be done graphically, by using algebraic calculations, or directly through the Solver in Excel or other software packages, such as, Lindo, considering one alteration at a time. Therefore, the sensitivity analysis being studied considers two cases: (a) The model’s sensitivity analysis based on the alterations in one of the objective function coefficients, without changing the model’s original basic solution (the basic solution remains optimal). Since one of the objective function coefficients is altered, the value of objective function z also changes. (b) The model’s sensitivity analysis based on the alterations in the independent terms of the constraints, without changing the optimal solution or the feasibility region. Thus, we eliminate the need to recalculate a model’s new optimal solution after changes in its parameters. Section 17.6.1 graphically analyzes the possible changes in the model’s objective function coefficients. The same analysis, based on the independent terms of the constraints, will be studied in Section 17.6.2. Both cases can also be analyzed by Solver in Excel (see Section 17.6.4), always considering one alteration at a time. The sensitivity analysis studied in this section will be described based on Example 17.12. Example 17.12 Romes Shoes is a shoe company and it is interested in planning its production of flip-flops and clogs for next summer. Their products go through the following processes: cutting, assembly, and finishing. Table 17.E.16 shows the total number of labor hours (manhours) necessary to produce a unit of each component in each manufacturing process, besides the total time available per week, also in man-hours. The unit profit per pair of flip-flops and clogs manufactured is $15.00 and $20.00, respectively. Determine the graphical solution for the model.
TABLE 17.E.16 Time Necessary to Produce a Unit of Each Component in Each Manufacturing Process and Total Time Available per Week (man-hours) Time (man-hours) to process 1 unit Sector
Flip-flops
Clogs
Time available (man-hours/week)
Cutting
5
4
240
Assembly
4
8
360
Finishing
0
7.5
300
808
PART
VII Optimization Models and Simulation
Solution The model’s decision variables are: x1 ¼ number of flip-flops to be produced weekly. x2 ¼ number of clogs to be produced weekly. The model’s mathematical formulation can be represented as: max z ¼ 15x1 + 20x2 subject to : 5x1 + 4x2 240 ð1Þ 4x1 + 8x2 360 ð2Þ 7,5x2 300 ð3Þ xj 0, j ¼ 1, 2
(17.27)
The current model’s optimal solution is x1 ¼ 20 (flip-flops per week) and x2 ¼ 35 (clogs per week) with z ¼ 1,000 (Weekly net profit of $1,000.00). Graphically, it is represented in Fig. 17.73.
17.6.1
Alteration in one of the Objective Function Coefficients (Graphical Solution)
Fig. 17.73 presented the graphical solution for Example 17.12, in which extreme point C represents the model’s optimal solution (x1 ¼ 20,x2 ¼ 35 with z ¼ 1,000). Now, let’s carry out a sensitivity analysis based on changes in the values of the objective function coefficients, performing one alteration at a time. The main objective is to determine the value range that each objective function coefficient can take on, maintaining the other coefficients constant, without impacting the model’s basic solution (the basic solution remains optimal). This analysis is based on the comparison between the angular coefficients of the active constraints (treated as equality constraints) and the angular coefficient of the objective function. The active constraints are the ones that define the model’s optimal solution. If the model’s inactive constraints are eliminated, the optimal solution will not be affected. Slope and angular coefficient of line Let a be the angle formed by the line and the X-axis, counterclockwise, called line slope. The angular coefficient m determines the direction of the line, that is, the trigonometric tangent of slope a: m ¼ tg ðaÞ
(17.28)
In a graphical way, Fig. 17.74 specifies four cases for m from different values of a. From Fig. 17.74, we can see that every nonvertical line has a real number m that specifies its direction. Case d is a special case since there is no tangent for slope a ¼ 90° and, consequently, there is no angular coefficient m for a vertical line. This relationship can be better visualized in Fig. 17.75 that shows the tangent chart for different values of a. For instance, for case (a) in the previous figure (0° < a < 90°), we can see that tg 0° ¼ 0 and tg 45° ¼ 1. As a gets closer to 90°, the value of m tends to infinity. FIG. 17.73 Graphical solution for Example 17.12.
Solution of Linear Programming Problems Chapter
17
809
FIG. 17.74 Relationship between line slope (a) and angular coefficient (m).
FIG. 17.75 Tangent for different values of a. (Source: https://www.quora.com/Is-tan-x-a-continous-function.)
The angular coefficient can be calculated from two points in the line. Given two distinct points in the Cartesian plane, A(x1, y1) and B(x2, y2), there is a single line r that goes through these two points. The angular coefficient of r can be calculated as: m ¼ tg ðaÞ ¼
opposite leg Dy y2 y1 ¼ ¼ adjacent leg Dx x2 x1
(17.29)
Angular coefficient of the objective function from its reduced equation Consider the general equation of an objective function with two decision variables (x1 and x2): z ¼ c1 x1 + c 2 x2
(17.30)
810
PART
VII Optimization Models and Simulation
To determine the reduced equation of Expression (17.30), we have to isolate variable x2 from the general equation: c z (17.31) x2 ¼ 1 x1 + c2 c2 where: m ¼ cc1 is the angular coefficient of the objective function. 2
Value range for c1 or c2 that do not change the model’s original basic solution By calculating the angular coefficient of each active constraint in the model (treated as an equality constraint) and the angular coefficient of the objective function, be it from the reduced equation of the line or from two points in the line [Expression (17.28)], we can determine the value range for c1 or c2 that does not change the model’s original basic solution. Let’s illustrate this condition through an example. Going back to Example 17.12, from the graphical solution presented in Fig. 17.73, we can see that only the first two constraints of Expression (17.27) are considered active. Hence, to carry out the sensitivity analysis from variations in one of the objective function coefficients, we must calculate the angular coefficient of the first and second constraints, treated as equality equations. First, we can see that the slope of the first equation 5x1 + 4x2 ¼ 240 and of the second equation 4x1 + 8x2 ¼ 360 are within the interval 90° < a < 180°. Therefore, the angular coefficient of both lines will be negative. From Fig. 17.75, we conclude that the value of m1 (angular coefficient of the first equation) will be smaller when compared to m2 (angular coefficient of the second equation). The value of the angular coefficient of each equation will be determined from the reduced equation of the line. The first equation 5x1 + 4x2 ¼ 240 can be written in the reduced form: x2 ¼ 5=4x1 + 60
(17.32)
So, the angular coefficient of the first equation is m1 ¼ 5/4. Analogously, the reduced form of the second equation can be expressed as: x2 ¼ 1=2x1 + 45
(17.33)
We can conclude that the angular coefficient of the second equation is m2 ¼ 1/2. According to Fig. 17.73, we can see that while the angular coefficient of the objective function ( c1/c2) is between 5/4 and 1/2, that is, between the angular coefficients of the model’s first and second equations (active equations), the original problem’s basic solution does not change, remaining optimal. Mathematically, the value of the original basic solution remains constant, while: 5 c 1 c 1 or 0:5 1 1:25 c2 c2 4 2
(17.34)
Example 17.13 From the problem of company Romes Shoes (Example 17.2), carry out a sensitivity analysis considering the following changes: (a) Let’s assume that there was an increase in the unit profits of flip-flops and clogs to $20.00 and $25.00, respectively, based on reductions in production costs, mainly in terms of human resources. Verify if the basic solution remains optimal. If yes, what is the new value of z? (b) Which possible variations in c2 would maintain the original model’s basic solution? Note: The other parameters remain constant. (c) Do the same for c1. (d) Imagine that there was an increase in the price of leather, the main raw material for producing clogs, diminishing its unit profit to $18.00. In order for the original model’s basic solution to remain unaltered, which interval must the unit profit of flip-flops satisfy? Solution (a) Considering the new objective function equation (z ¼ 20x1 + 25x2), we can determine the ratio c 0:5 c1 2
c1 c2
¼ 20 25 ¼ 0:8 directly.
Therefore, the condition 1:25 continues being satisfied, such that the original model’s basic solution (x1 ¼ 20 and x2 ¼ 35) remains optimal. Therefore, the new value of z is z ¼ 20 20 + 25 35 ¼ 1,275. c (b) Substituting c01 ¼ 15 (original value of c1) in the condition 0:5 c1 1:25, we have: 2
0:5 c2 15 ) c2 30 ) 12 c2 30 or c20 8 c2 c20 + 10 1:25 c2 15 ) c2 12
Solution of Linear Programming Problems Chapter
17
811
Thus, while c2 continues satisfying the interval specified here, the original model’s optimal basic solution (x1 ¼ 20 and x2 ¼ 35) will remain unaltered. (c) Substituting c02 ¼ 20 (original value of c2) in the condition 0:5 cc1 1:25, we have: 2
0:5 20 c1 1:25 20 ) 10 c1 25 or c10 5 c1 c10 + 10 Therefore, while the conditions specified for c1 continue being satisfied, the original model’s basic solution will remain unaltered. (d) By substituting c2 ¼ 18 in the condition 0:5 cc1 1:25, we have: 2
0:5 18 c1 1:25 18 ) 9 c1 22:5 Therefore, while c1 continues satisfying the interval specified , for a value of c2 ¼ 18, the basic solution remains optimal.
Example 17.14 Consider the following maximization problem: max z ¼ 15x1 + 20x2 subject to : 4x1 + 8x2 360 ð1Þ x1 60 ð2Þ xj 0, j ¼ 1,2
(17.35)
Determine: (a) The graphical solution for the original model represented in Expression (17.35). (b) The possible variations in c1 that would maintain the original model’s basic solution (the basic solution remains optimal), maintaining the other parameters constant. Do the same for c2. Solution Fig. 17.76 shows the model’s graphical solution of Expression (17.35). The current model’s optimal solution, represented by extreme point C, is x1 ¼ 60 and x2 ¼ 15 with z ¼ 1,200. From Fig. 17.76, we can see that we have a special case, since the model’s second constraint of Expression (17.35) corresponds to a vertical line (slope of 90° in relation to axis x1). When one of the active constraints is vertical, there will be no upper or lower limit for the angular coefficient of the objective function, since there is no tangent for a 5 90°. Now, let’s carry out a sensitivity analysis from variations in c1 or c2. In order to do that, we need to calculate the angular coefficient of the first constraint (treated in the equality form), be it from its reduced equation or from two points in the line. Let’s use the first case. The reduced form of the first equation in Expression (17.35) is: x2 ¼ 1=2x1 + 45 So, its angular coefficient is 1/2.
FIG. 17.76 Graphical solution for Example 17.4.
812
PART
VII Optimization Models and Simulation
According to Fig. 17.76, we can see that, while the angular coefficient of the objective function is between the angular coefficients of the vertical equation and equation 4x1 + 8x2 ¼ 360, that is, between ∞ (there is no lower limit for the 90° tangent) and 1/2, the basic solution remains optimal. Mathematically, the original basic solution remains constant, while: ∞
c1 1 c or 0:5 1 ∞ c2 c2 2 c
By setting c2 ¼ 20 (original value) in the condition 0:5 c1 ∞, we have: 2
0:5 20 c1 ∞ ) 10 c1 ∞ Thus, while c1 is within the interval specified, the original model’s basic solution (x1 ¼ 60 and x2 ¼ 15) will remain unaltered. c By setting c1 ¼ 15 (original value) in the condition 0:5 c1 ∞, we have: 2 8 < 0:5 c2 15 ) c2 30 ) 0 c2 30 : c2 15 ffi 0 ) c2 0 ∞ Therefore, while c2 continues satisfying the condition specified, the basic solution will remain optimal.
17.6.2 Alteration in One of the Constants on the Right-Hand Side of the Constraint and Concept of Shadow Price (Graphical Solution) The sensitivity analysis from alterations in the value of one of the constants on the right-hand side of the constraint (availability of resources) is based on the concept of shadow price, which can be defined as the increase (or decrease) of the objective function in case 1 unit is added to (or removed from) the current amount of resources available of the i-th constraint (b0i ). Calculating the shadow price (Pi) in case 1 unit of resources is added to b0i is: Pi ¼
Dz +1 z +1 z0 ¼ Dbi, + 1 +1
(17.36)
where: Dz+1 ¼ increase in the value of objective function z in case 1 unit of resources is added to b0i . z0 ¼ initial value of the objective function. z+1 ¼ new value of the objective function after 1 unit is added to b0i . Dbi,+1 ¼ increase in b0i . The definition of shadow price considers an increase of 1 unit in the amount of resources i. Calculating the shadow price (Pi) in case 1 unit of resources is removed from b0i is: Pi ¼
Dz1 z1 z0 z0 z1 ¼ ¼ Dbi, 1 1 1
(17.37)
where: Dz1 ¼ decrease in the value of objective function z in case 1 unit of resources is removed from b0i . z0 ¼ initial value of the objective function. z1 ¼ new value of the objective function after 1 unit is removed from b0i . Dbi,1 ¼ decrease in b0i . The definition of shadow price considers a decrease of 1 unit in the amount of resources i. Shadow price can be interpreted as the fair price to pay for using 1 unit of resource i or the opportunity cost of resources due to the loss of 1 unit of resource i. After defining the shadow price for resource i (Pi), the main goal of this sensitivity analysis is to determine the value range in which bi can vary (maximum increase allowed of p units in b0i or decrease allowed of q units in b0i ), that is, b0i q bi b0i + p, in which the shadow price remains constant. The interval must be determined in order to satisfy the following relationship: Dz +p Dzq z +p z0 z0 zq ¼ ¼ Pi ¼ ¼ Dbi, + p Dbi, q p q where:
(17.38)
Solution of Linear Programming Problems Chapter
17
813
Dz+p ¼ increase in the value of objective function z in case p units of resources are added to b0i . Dzq ¼ decrease in the value of objective function z in case p units of resources are removed from b0i . z0 ¼ initial value of the objective function. z+p ¼ new value of the objective function after p units are added to b0i . zq ¼ new value of the objective function after q units are removed from b0i . Dbi,+p ¼ increase of p units in b0i . Dbi,q ¼ decrease of q units in b0i . Thus, for the interval specified in which the shadow price remains constant, if p units were added to b0i , the value of the objective function would increase Dz+p ¼ Pi p. This equation can also be interpreted as the fair price to pay for using p units of resource i, being proportional to the shadow price. Analogously, if q units were removed from b0i , the value of the objective function would decrease Dzq ¼ Pi q. This equation can also be interpreted as the opportunity cost due to the loss of q units of resource i. The calculation of the shadow price is only valid for active constraints (which define the model’s optimal solution). Otherwise, variations in bi within the feasibility region will not impact the model’s optimal solution, concluding that the shadow price for nonactive constraints is zero. Solving the problem in an analytical or algebraic way, this means that the current model’s optimal solution (that contains a value different from bi, however, within the interval specified above in which the shadow price remains constant) will have the same basic variables as the original model’s optimal solution (the original basic solution remains optimal); however, the values of the decision variables and of the objective function are altered due to changes in bi. Solving the problem in a graphical way, as the amount of resources bi varies, the i-th constraint moves parallel to the i-th original constraint. Nevertheless, the current model’s optimal solution continues being determined by the intersection of the same active lines in the original model (intersection between the i-th constraint altered and another active constraint from the initial model). While the intersection between these lines happens inside the feasibility region, that is, between the extreme points that limit the feasible solution space analyzed, the increase in the value of the objective function due to the use of p units of resource i or the decrease due to the loss of q units of the same resource will be proportional to the shadow price (the shadow price will remain constant for the interval b0i q bi b0i + p, in which b0i represents its original value). For any value of bi out of this range, it will be necessary to recalculate the model’s new optimal solution, since the feasible region is altered. Example 17.15 Similar to Example 17.13, the sensitivity analysis based on changes in the resources will also be applied to the case of Romes Shoes (Example 17.12). From the model’s graphical solution, determine: (a) The shadow price for each sector (cutting, assembly and finishing). (b) The maximum permissible decrease and increase for each bi that would maintain its shadow price constant (when it is positive), or that would not change the initial model’s optimal solution (when the shadow price is null). Solution As presented in Expression (17.27), Example 17.12 of company Romes Shoes can be mathematically represented as: max z ¼ 15x1 + 20x2 subject to : 5x1 + 4x2 240 ð1Þ 4x1 + 8x2 360 ð2Þ 7,5x2 300 ð3Þ xj 0, j ¼ 1,2
cutting assembly finishing
(17.39)
The original model’s optimal solution is x1 ¼ 20 and x2 ¼ 35 with z ¼ 1,000. Changes in the availability of resources in the cutting sector (a) Shadow price If the time available for the cutting sector increases in one man-hour, the first constraint in Expression (17.39) becomes 5x1 + 4x2 241. The new optimal solution is then determined by the intersection between active lines 5x1 + 4x2 ¼ 241 and 4x1 + 8x2 ¼ 360, being represented by point H (x1 ¼ 20.333 and x2 ¼ 34.833 with z ¼ 1,001.667), as shown in Fig. 17.77. The shadow price for the cutting sector (P1), considering an increase of 1 man-hour in the availability of resources, can be calculated as: P1 ¼
1, 001:667 1, 000 ¼ 1:667 241 240
814
PART
VII Optimization Models and Simulation
FIG. 17.77 Sensitivity analysis after adding 1 man-hour to the availability of resources in the cutting sector.
Thus, for each man-hour added to the cutting sector, the objective function increases 1.667. Or the fair price paid for each manhour used in the cutting sector is 1.667. If there were a reduction of 1 man-hour in the cutting sector, we would obtain the same result for the shadow price. By changing the value of the constant of the first constraint to b1 ¼ 239, the model’s new optimal solution becomes: x1 ¼ 19.667 and x2 ¼ 35.167 with z ¼ 998.333. The calculation of the shadow price, considering a decrease of 1 man-hour for the cutting sector, is: 1,000 998:33 ¼ 1:667 240 239 Hence, for each man-hour removed from the cutting sector, the objective function decreases 1.667. Or the opportunity cost for each man-hour lost in the cutting sector is 1.667. (b) Maximum permissible decrease and increase for b1 The main objective is to determine the value range in which b1 can vary (b01 q b1 b01 + p), and in which the shadow price remains constant. While this happens, the price to be paid for the use of p man-hours in the cutting sector will be P1 p ¼ 1.667 p. Analogously, the opportunity cost due to the loss of q man-hours in the same sector will be P1 q ¼ 1.667 q. From Fig. 17.77, we can see that the original model’s optimal solution is determined by the intersection of lines 5x1 + 4x2 ¼ 240 and 4x1 + 8x2 ¼ 360. Also note that the new constraint 5x1 + 4x2 241 is parallel to the original constraint 5x1 + 4x2 240. As the value of b1 increases, the line moves in the direction of extreme point G, always parallel to the original constraint. Similarly, as the value of b1 decreases, the line moves in the direction of extreme point D. While the intersection of equations 5x1 + 4x2 ¼ b1 and 4x1 + 8x2 ¼ 360 occurs within the feasibility region (segment DG), the shadow price will remain constant. Extreme points D and G represent the lower and upper limits for b1. Any point out of this segment will result in a new basic solution. Therefore, the lower and upper limits for b1 can be determined by substituting the coordinates of point D (x1 ¼ 10 and x2 ¼ 40) and G (x1 ¼ 90 and x2 ¼ 0), respectively, in 5x1 + 4x2: Lower limit for b1 (point D): 5 10 + 4 40 ¼ 210 Upper limit for b1 (point G): 5 90 + 4 0 ¼ 450 We can conclude that, while the value of b1 is within the interval 210 b1 450, its shadow price remains constant. The interval can also be specified based on the maximum permissible decrease and increase for b01 ¼ 240 (its original value), being expressed as: P1 ¼
b10 30 b1 b10 + 210 For example, for a value of p ¼ 210, the price to be paid for the use of these 210 man-hours in the cutting sector will be P1 210 ¼ 1.667 210 ¼ 350 (if 210 man-hours were added to the total time available for the cutting sector, the objective function would increase $350.00). Conversely, for a value of q ¼ 30, the opportunity cost due to the loss of these 30 man-hours in the total time available in the cutting sector will be P1 30 ¼ 1.667 30 ¼ 50 (if 30 man-hours were removed from the total time available for the cutting sector, the objective function would decrease $50.00). For any value of b1 out of this interval, it is necessary to recalculate the new optimal solution, because the feasible region is altered. These results can be better visualized in Fig. 17.78. Polygon AEDB represents the feasibility region for a value of b1 ¼ 210 (lower limit for b1). Whereas polygon AEDG represents the feasibility region for a value of b1 ¼ 450 (upper limit for b1). Changes in the availability of resources in the assembly sector (a) Shadow price By adding 1 man-hour to the assembly sector, the second constraint of (1) becomes 4x1 + 8x2 361. The new optimal solution is then determined by the intersection between active lines 5x1 + 4x2 ¼ 240 and 4x1 + 8x2 ¼ 361, being represented by point I (x1 ¼ 19.833 and x2 ¼ 35.208 with z ¼ 1,001.667), as illustrated in Fig. 17.79.
Solution of Linear Programming Problems Chapter
17
815
FIG. 17.78 Sensitivity analysis based on changes in the availability of resources in the cutting sector.
FIG. 17.79 Sensitivity analysis after adding 1 man-hour to the availability of resources in the assembly sector.
Since the new value of the objective function is also 1,001.667 (similar to the cutting sector), we obtain the same shadow price (P2 ¼ 1.667). Therefore, for each man-hour added in the assembly sector, the objective function also increases 1.667. By reducing the time available for the assembly sector in 1 man-hour (b2 ¼ 359), the model’s new optimal solution becomes: x1 ¼ 20.167 and x2 ¼ 34.792 with z ¼ 998.333. Thus, the shadow price in this case is also P2 ¼ 1.667. Therefore, for each man-hour removed from the assembly sector, the objective function also decreases 1.667. (b) Maximum permissible decrease and increase for b2 Fig. 17.79 illustrates the new constraint 4x1 + 8x2 361 for the assembly sector, parallel to the original constraint 4x1 + 8x2 360. As the value of b2 increases, the line moves in the direction of extreme point J, always parallel to the original constraint. Similarly, as the value of b2 decreases, the line moves in the direction of extreme point B. While the intersection of the equations 5x1 + 4x2 ¼ 240 and 4x1 + 8x2 ¼ b2 happens within the feasibility region (segment BJ), the shadow price will remain constant. Extreme points B and J represent the lower and upper limits for b2. Any point out of this segment will result in a new basic solution. Hence, the lower and upper limits for b2 can be determined by substituting the coordinates of point B (x1 ¼ 48 and x2 ¼ 0) and J (x1 ¼ 16 and x2 ¼ 40), respectively, in 4x1 + 8x2: Lower limit for b2 (point B): 4 48 + 8 0 ¼ 192 Upper limit for b2 (point J): 4 16 + 8 40 ¼ 384 We can conclude that, while the value of b2 is within the interval 192 b2 384, its shadow price remains constant. The interval can also be specified based on the maximum permissible decrease and increase for b02 ¼ 360 (its initial value), being expressed as: b20 168 b2 b20 + 24 For example, for a value of p ¼ 24, the price to be paid by the use of these 24 man-hours in the assembly sector will be P2 24 ¼ 1.667 24 ¼ 40 (if 24 man-hours were added to the total time available for the assembly sector, the objective function would increase $40.00). On the other hand, for a value of q ¼ 168, the opportunity cost due to the loss of these 168 man-hours in the
816
PART
VII Optimization Models and Simulation
FIG. 17.80 Sensitivity analysis based on changes in the availability of resources in the assembly sector.
total time available for the assembly sector will be P2 168 ¼ 1.667 168 ¼ 280 (if 168 man-hours are removed from the total time available for the assembly sector, the objective function would decrease $280.00). For any value of b2 out of this interval, it is necessary to recalculate the new optimal solution. These results can be better visualized in Fig. 17.80. Polygon ABJE represents the feasibility region for a value of b2 ¼ 384 (upper limit for b2). Whereas triangle ABK represents the feasibility region for a value of b2 ¼ 192 (upper limit for b2). Changes in the availability of resources in the finishing sector (a) Shadow price As presented in Fig. 17.77, the original model’s optimal solution is determined by the intersection of the two first equations 5x1 + 4x2 ¼ 240 and 4x1 + 8x2 ¼ 360. Since the finishing constraint (7.5x2 300) is not active, changes in the value of b3 within the feasibility region will not impact the original model’s optimal solution. So, its shadow price is zero. (a) Maximum permissible decrease and increase for b3 Since the constraint 7.5x2 300 is not active, the main goal here is to determine the value range in which b3 can vary without changing the initial model’s optimal solution (x1 ¼ 20 and x2 ¼ 35 with z ¼ 1,000). To determine the lower limit for b3, we just need to substitute the value of coordinate x2 of optimal point C (x2 ¼ 35) in 7.5x2 ¼ b3 (intersection of lines 4x1 + 8x2 ¼ 360 and 7.5x2 ¼ b3). So, the lower limit for b3 is 7.5 35 ¼ 262.5. In contrast, any value for b3 above its initial value will continue not impacting the initial model’s optimal solution, such that the value range for b3 can be written as: 262:5 b3 The interval can also be specified based on the maximum permissible decrease and increase for b03 ¼ 300 (its initial value), being expressed as: b30 37:5 b3
17.6.3
Reduced Cost
The reduced cost of a certain nonbasic variable xj can be interpreted as the value that its original coefficient cj must improve in the objective function before the current basic solution becomes suboptimal and xj becomes a basic variable. For a maximization problem, the reduced cost of the nonbasic variable xj in the optimal tabular form (c∗j ) corresponds to the maximum increase in the value of its original coefficient in the objective function (addition of c∗j units to the value of cj), which will maintain the current basic solution optimal and variable xj nonbasic. Any increase greater than c∗j will make the current basic solution suboptimal, such that xj will go into the base. Alternatively, for a minimization problem, c∗j corresponds to the minimum decrease in the value of its original coefficient in the objective function (subtraction of c∗j units from the value of cj), which will maintain the current basic solution optimal and variable xj nonbasic. Any decrease less than c∗j will make the current basic solution suboptimal and variable xj basic. According to Winston (2004), when the coefficient of the nonbasic variable xj improves a value exactly equal to its reduced cost, we have a case with multiple optimal solutions. In this case, there is at least one solution in which the variable
Solution of Linear Programming Problems Chapter
17
817
xj becomes basic and another in which the variable xj continues being nonbasic. By contrast, for any increase greater (maximization) or decrease smaller (minimization) than its reduced cost, the variable xj will always be basic in any optimal solution. A special case may happen when the nonbasic variables simply cannot become basic, that is, their reduced cost will continue to be null, since the same are not active (they do not influence the model’s optimal solution). Example 17.16 Consider the following maximization problem: max z ¼ 3x1 + 6x2 subject to : 2x1 + 3x2 60 ð1Þ 4x1 + 2x2 120 ð2Þ x1 , x2 0
(17.40)
Through a sensitivity analysis of the problem in study, we obtain the reduced cost of each variable, based on the model’s optimal solution of Expression (17.40), as shown in Table 17.E.17:
TABLE 17.E.17 Optimal Solution and Reduced Cost of Each Variable Variable
Optimal Solution
Reduced Cost
x1
0
1
x2
20
0
Interpret the results presented in Table 17.E.17. Solution The model’s basic solution represented by Expression (17.40) is x1 ¼ 0 and x2 ¼ 20. First, we can see that the reduced cost of the variable x2 is null, since it is a basic variable. Conversely, the reduced cost of the variable x1 represents the maximum increase (maximization problem) in the value of c1, which maintains the current basic solution optimal and variable x1 nonbasic. Thus, if the coefficient of x1 in the objective function goes from 3 to 4, the current basic solution will remain optimal, such that the variable x1 will continue being nonbasic. On the other hand, if the coefficient of x1 is greater than 4, the current solution will become suboptimal and the variable x1 will be basic in the new optimal solution. It is important to mention that if problem represented by Expression (17.40) is solved by Solver in Excel, the reduced cost of x1 will appear with a negative sign in the sensitivity report.
Example 17.17 Consider the following minimization problem: min z ¼ 4x1 + 8x2 subject to : 6x1 + 3x2 140 8x1 + 5x2 120 x1 , x2 0
ð1 Þ ð2 Þ
(17.41)
Through a sensitivity analysis of the problem in study, we obtain the optimal solution and the reduced cost of each variable in the model represented by Expression (17.41), as seen in Table 17.E.18:
TABLE 17.E.18 Optimal Solution and Reduced Cost of the Variables Variable
Optimal Solution
Reduced Cost
x1
23.333
0
x2
0
6
818
PART
VII Optimization Models and Simulation
Interpret the results presented in Table 17.E.18. Solution The model’s basic solution represented by Expression (17.41) is x1 ¼ 23.333 and x2 ¼ 0. First, we can see that the reduced cost of the variable x1 is null, since it is a basic variable. In contrast, the reduced cost of the variable x2 specifies the minimum decrease in the value of c2 (minimization problem) that maintains the current basic solution optimal and variable x2 nonbasic. Hence, if the coefficient of x2 in the objective function goes from 8 to 2, the current basic solution will remain optimal and variable x2 nonbasic. On the other hand, if the coefficient of x2 is less than 2, the current solution will become suboptimal and variable x2 will be basic in the new optimal solution. If problem represented by Expression (17.41) is solved by Solver in Excel, the reduced cost of x2 will appear with a positive sign in the sensitivity report.
17.6.4
Sensitivity Analysis With Solver in Excel
The sensitivity analysis through Solver in Excel will be presented in this section for the problem of company Romes Shoes (Example 17.12). Fig. 17.81 (see file Example17.12_Romes.xls) shows the modeling of the problem in Excel, already with the model’s optimal solution. Having solved the model through the Solver in Excel, the window Solver Results appears. Select the option Sensitivity Report, as shown in Fig. 17.82. The results of the sensitivity analysis for the problem of Rome Shoes, considering the changes in one of the objective function coefficients (Section 17.6.1), changes in one of the constants on the right-hand side and concept of shadow price (Section 17.6.2), and the concept of reduced cost (Section 17.6.3), from the Solver in Excel, are consolidated in Fig. 17.83. Lines 4 and 5 show the results of the sensitivity analysis based on changes in one of the objective function coefficients (Section 17.6.1), based on the final value of variable cells (B14 and C14), which represent the model’s decision variables. According to column D, these values are x1 ¼ 20 and x2 ¼ 35. Column E shows the reduced cost of each variable. Since both are basic, their values are null. If one of the variables were nonbasic, its reduced cost would appear with a negative sign in the sensitivity report in Excel, that is, with the opposite sign to the one presented in this book, as already mentioned previously. Analogously, for a minimization problem, the costs reduced of the nonbasic variables are presented with a positive sign in the sensitivity report in Excel. The initial values of the coefficients of each variable in the objective function are presented in column F. On the other hand, columns G and H show the maximum permissible increase and decrease for each coefficient, from its initial value, the other parameters remaining constant, without changing the original model’s optimal basic solution. Lines 10, 11, and 12 show the results of the sensitivity analysis based on changes in the amount of resources of each one of the model constraints (Section 17.6.2). By substituting the optimal values of each variable on the left-hand side of each constraint, we obtain the optimal number of resources necessary for each sector, as shown in column D. These values can also be updated in Fig. 17.81, if the option Keep Solver Solution is chosen in the window Solver Results. The shadow price
Romes Shoes
Unit cost
x1 slippers 15
x2 clogs 20
Cutting Assembly Finishing
5 4 0
4 8 7.5
Hours used 240 360 262.5
Solution
x1 slippers 20
x2 clogs 35
z Total profit $1,000.00
Quantities produced
FIG. 17.81 Modeling of the problem of Romes Shoes in Excel.
≤ ≤ ≤
Hours available 240 360 300
Solution of Linear Programming Problems Chapter
17
819
FIG. 17.82 Option Sensitivity Report in Solver Results.
FIG. 17.83 Sensitivity report of the problem at Romes Shoes.
(price paid for the use or opportunity cost due to the loss of 1 unit of each resource) is presented in column E. The initial amount available of each resource is presented in column F. Alternatively, the maximum permissible increase and decrease for each resource that maintains its shadow price constant or within the feasibility region, from its initial value, are specified in columns G and H, respectively.
17.6.4.1 Special Case: Multiple Optimal Solutions As presented in Section 17.2.3.1, in a graphical solution, we can identify a special case of linear programming with multiple optimal solutions when the objective function is parallel to an active constraint. Alternatively, through the Simplex method (see Section 17.4.6.1), we can identify a case with multiple optimal solutions when, in the optimal tabular form, the coefficient of one of the nonbasic variables is null in row 0 of the objective function. According to Ragsdale (2009), it is also possible to identify a case with multiple optimal solutions through the Sensitivity Report in Solver in Excel. This case happens when the increase or decrease permissible regarding the coefficient of one or more variables in the objective function is zero and we do not have a degenerate solution (see following Section). For the maximization problem (max z ¼ 8x1 + 4x2) presented in Section 17.2.3.1 (Example 17.3), the graphical solution was obtained from Fig. 17.84. The representation of this problem in Excel can be seen in Fig. 17.85.
820
PART
VII Optimization Models and Simulation
FIG. 17.84 Graphical solution for Example 17.3 with multiple optimal solutions.
FIG. 17.85 Representation of Example 17.3 in Excel.
FIG. 17.86 Sensitivity Report for a case with multiple optimal solutions.
By solving this example in the Solver in Excel, it was possible to find a feasible solution, obtaining the same message presented in Fig. 17.82, that Solver found a solution and all the optimized constraints and conditions were met. However, the Solver provided only one of the optimal solutions: x1 ¼ 4 and x2 ¼ 0 with z ¼ 32 (vertex B), not displaying any messages about the special case of multiple optimal solutions. By solving the problem through Solver in Excel and selecting the option Sensitivity Report in Solver Results, we obtain Fig. 17.86. As shown in rows 9 and 10 of Fig. 17.86, the decrease allowed for the coefficient of x1 in the objective function and the increase allowed for the coefficient of x2 in the objective function are null. Since there is no degeneration (see following Section 17.2.4.2), we have a case with multiple optimal solutions.
Solution of Linear Programming Problems Chapter
17
821
Ragsdale (2009) recommends two strategies that can be applied to determine a new optimal solution through Solver in Excel: 1) to insert a new constraint into the model that does not change the optimal value of the objective function and maintains the model’s feasibility; and 2) when the increase allowed for one of the decision variables is null, we must maximize the value of this variable (the objective function must be altered for a maximization problem that has as a objective cell the respective variable and not function z any more). When the decrease allowed for one of the variables is null, we must minimize the value of this variable (the objective function must be altered for a minimization problem that has as an objective cell such variable). For example, if we only use the first strategy, inserting the new constraint x1 x2 1 into the model, a new optimal solution is determined: x1 ¼ 3 and x2 ¼ 2 with z ¼ 32. Through Fig. 17.86, note that the maximum increase allowed for the coefficient of variable x2 in the objective function is zero. Analogously, the decrease allowed for the coefficient of variable x1 in the objective function is also zero. Therefore, regarding the second strategy proposed by Ragsdale, we would have two alternatives: to maximize the new objective cell $C$11 that represents variable x2, or to minimize the new objective cell $B$11 that represents variable x1. Thus, if we got the same initial solution (x1 ¼ 4 and x2 ¼ 0 with z ¼ 32) by only using the first strategy, in addition to inserting the constraint x1 x2 1, we should use one of the alternatives listed for the second strategy. For instance, if we inserted the constraint x1 x2 1 and changed the objective function for a maximization problem that has as a target the cell $C$11, we would also obtain the new optimal solution: x1 ¼ 3 and x2 ¼ 2 with z ¼ 32, as shown in Fig. 17.87. On the other hand, instead of constraint x1 x2 1, if we inserted constraint 2x1 + x2 8 into the original model and changed the objective function to a minimization problem that has as a target the cell $B$11 that represents the variable x1, we would obtain a new optimal solution: x1 ¼ 2 and x2 ¼ 4 with z ¼ 32, as shown in Fig. 17.88.
17.6.4.2 Special Case: Degenerate Optimal Solution As discussed in Section 17.2.3.4, in a graphical solution, we can identify a degenerate solution when one of the vertices of the feasible region is obtained by the intersection of more than two distinct lines. Whereas through the Simplex method (see Section 17.4.5.4), we can identify a case with a degenerate solution when, in one of the solutions of the Simplex method, the coefficient of one of the basic variables is null. If there is degeneration in the optimal solution, we have a case known as degenerate optimal solution. As presented in Section 17.4.5.4, the degeneration problem is that, in some cases, the algorithm Simplex can go into a loop, which generates the same basic solutions, since it cannot leave that solution space. In this case, the optimal solution will never be achieved. We can detect a case with degeneration through the Sensitivity Report in the Solver in Excel when the increase permissible or decrease permissible regarding the amount of resources of one of the constraints is zero. The same Example 17.6 presented in Section 17.2.3.4 for the case with a degenerate optimal solution will be solved in this section from Solver in Excel. The graphical solution for this example (min z ¼ x1 + 5x2) is in Fig. 17.89.
FIG. 17.87 Answer report adding the constraint x1 x2 1 and maximizing x2.
822
PART
VII Optimization Models and Simulation
FIG. 17.88 Answer report adding the constraint 2x1 + x2 8 and minimizing x1.
FIG. 17.89 Graphical solution for Example 17.6 with a degenerate optimal solution.
Analogous to the case with multiple optimal solutions, by solving Example 17.6 through the Solver in Excel, it was also possible to find a feasible solution, obtaining the same message as the one in Fig. 17.82: that Solver found a solution and all the optimized constraints and conditions were met. However, Solver does not display any message about the special case of a degenerate solution. By solving the problem above through Solver in Excel and selecting the option Sensitivity Report in Solver Results, we obtain Fig. 17.90. As shown in rows 15 and 17 of Fig. 17.90, the permissible increase regarding the amount of resources available in the first and third constraints is null. Whereas row 16 shows that the permissible decrease regarding the amount of resources in the second constraint is also zero. Therefore, we have a case with a degenerate optimal solution. In this case, the analysis of the Sensitivity Report may be compromised. Ragsdale (2009) and Lachtermacher (2009) highlight the precautions that must be taken when we identify a case with a degenerate optimal solution: 1. When the increase or decrease allowed for the coefficient of one of the variables in the objective function is also zero, the statement that multiple optimal solutions have occurred is no longer reliable. 2. The reduced costs of the variables may not be unique. Moreover, in order for the optimal solution to change, the coefficients of the variables in the objective function must at least improve their respective reduced costs. 3. The permissible changes in the coefficients of the variables in the objective function are maintained; however, values outside this interval may continue not changing the current optimal solution. 4. The shadow prices may not be unique, jointly with the permissible increase or decrease regarding the availability of resources of each constraint.
Solution of Linear Programming Problems Chapter
17
823
FIG. 17.90 Sensitivity Report for a case with a degenerate solution.
17.7 EXERCISES Section 17.2 (ex.1). Determine the feasible solution space that satisfies each one of the constraints separately, considering x1, x2 0: (a) (b) (c) (d) (e) (f)
3x1 + 2x2 12 2x1 + 3x2 24 3x1 2x2 6 x1 x2 4 x1 + 4x2 16 x1 + 2x2 10
Section 17.2.1 (ex.1). For each maximization function z, determine the direction in which the objective function increases: (a) (b) (c) (d)
max z ¼ 5x1 + 3x2 max z ¼ 4x1 2x2 max z ¼ 2x1 + 6x2 max z ¼ x1 2x2
Section 17.2.1 (ex.2). Determine the graphical solution (feasible solution space and the optimal solution) of the following LP maximization problems: (a) max z ¼ 3x1 + 4x2 subject to : 2x1 + 5x2 18 4x1 + 4x2 12 5x1 + 10x2 20 x1 , x2 0 (b) max z ¼ 2x1 + 3x2 subject to : 2x1 + 2x2 10 3x1 + 4x2 24 x2 4 x1 , x2 0 (c) max z ¼ 4x1 + 2x2 subject to : x1 + x2 16 3x1 2x2 36 10 x1 x2 6 x1 , x2 0
824
PART
VII Optimization Models and Simulation
Section 17.2.1 (ex.3). Graphically solve Venix Toys’ production mix problem (Example 16.3 presented in Section 16.5.1 of the previous chapter, solved through the Solver in Excel in Section 17.5.2.1 of this chapter). Section 17.2.1 (ex.4). Are the solutions in the feasible region of Venix Toys’ problem? (a) (b) (c) (d) (e) (f) (g) (h) (i)
x1 ¼ 30, x2 ¼ 25 x1 ¼ 30, x2 ¼ 30 x1 ¼ 44, x2 ¼ 24 x1 ¼ 45, x2 ¼ 28 x1 ¼ 75, x2 ¼ 15 x1 ¼ 90, x2 ¼ 14 x1 ¼ 100, x2 ¼ 14 x1 ¼ 120, x2 ¼ 10 x1 ¼ 130, x2 ¼ 5
Section 17.2.2 (ex.1). For each minimization function z, determine the direction in which the objective function decreases: (a) (b) (c) (d)
min z ¼ 5x1 + 8x2 min z ¼ 2x1 3x2 min z ¼ 4x1 + 5x2 max z ¼ 7x1 5x2
Section 17.2.2 (ex.2). Determine the graphical solution for the following LP minimization problems: (a) min z ¼ 2x1 + x2 subject to : x1 x2 10 2x1 + 3x2 30 x1 ,x2 0 (b) min z ¼ 2x1 x2 subject to : x1 2x2 2 x1 + 3x2 6 x1 , x2 0 (c) min z ¼ 6x1 + 4x2 subject to : 2x1 + 2x2 40 x1 + 3x2 30 4x1 + 2x2 60 x1 , x2 0 Section 17.2.3 (ex.1). Graphically identify in which of the special cases each LP problem finds itself: a) multiple optimal solutions; b) unlimited objective function z; c) there is no optimal solution; d) degenerate optimal solution. (a) max z ¼ 2x1 + x2 subject to : x1 + 4x2 12 4x1 + 2x2 20 3x2 6 x1 ,x2 0 (b) min z ¼ 2x1 + x2 subject to : 4x1 + 5x2 20 x1 + x2 3 x1 , x2 0
Solution of Linear Programming Problems Chapter
(c) max z ¼ 2x1 + 3x2 subject to : 4x1 + 2x2 20 x1 x2 10 x1 ,x2 0 (d) max z ¼ 6x1 + 4x2 subject to : 4x1 4x2 20 3x1 + 2x2 30 x2 12 x1 , x2 0 (e) min z ¼ 2x1 + 3x2 subject to : x1 + x2 10 4x1 + 2x2 20 4x2 40 x1 ,x2 0 (f) min z ¼ 2x1 + 3x2 subject to : x1 + x2 10 4x1 + 2x2 20 x1 + x2 4 x1 , x2 0 Section 17.2.3 (ex.2). Graphically determine the alternative optimal solutions for the following LP problems: (a) max z ¼ 6x1 + 4x2 subject to : 3x1 + 2x2 90 2x1 + x2 50 x1 ,x2 0 (b) min z ¼ 2x1 + 3x2 subject to : 4x1 x2 11 4x1 + 6x2 32 x1 , x2 0 Section 17.3 (ex.1) Consider the following LP minimization problem: min z ¼ 3x1 + 2x2 subject to : 8x1 + 5x2 140 4x1 + 3x2 80 x1 , x2 0 By solving the problem in an analytical way, determine: (a) The number of possible basic solutions for this system. (b) The feasible basic solutions for the problem, and graphically represent them. (c) The optimal solution. Section 17.3 (ex.2) Do the same for the following LP maximization problem: max z ¼ 4x1 + 3x2 + 5x3 subject to : 3x1 x2 + 2x3 10 4x1 + 2x2 + 5x3 50 x1 ,x2 ,x3 0
17
825
826
PART
VII Optimization Models and Simulation
Section 17.4.2 (ex.1) Consider the following LP maximization problem: max z ¼ 4x1 + 5x2 + 3x3 subject to : 2x1 + 3x2 x3 48 x1 + 2x2 + 5x3 60 3x1 + x2 + 2x3 30 x1 ,x2 ,x3 0 Solve the problem through the analytical form of the Simplex method. Section 17.4.3 (ex.1) Solve the production mix problem of Venix Toys through the Simplex method. Section 17.4.3 (ex.2) Use the Simplex method to solve the following LP maximization problems: (a) max z ¼ 3x1 + 2x2 subject to : 3x1 x2 6 x1 + 3x2 12 x1 ,x2 0 (b) max z ¼ 2x1 + 4x2 + 3x3 subject to : x1 + x2 + 2x3 6 2x1 + 2x2 + 3x3 16 x1 + 4x2 + x3 18 x1 ,x2 ,x3 0 (c) max z ¼ 3x1 + x2 + 2x3 subject to : 2x1 + 2x2 + x3 20 3x1 + x2 + 4x3 60 x1 + x2 + 2x3 30 x1 , x2 , x3 0 Section 17.4.3 (ex.3) What is the biggest difficulty in solving the farmer’s problem (Example 16.7 in Section 16.5.4 of the previous chapter, solved through the Solver in Excel in Section 17.5.2.5 of this chapter) through the Simplex method? Section 17.4.4 (ex.1) Use the Simplex method to solve the following LP minimization problems: (a) min z ¼ 2x1 x2 subject to : 2x1 + 6x2 24 8x1 + 2x2 40 x1 , x2 0 (b) min z ¼ 5x1 6x2 subject to : 4x1 + 2x2 10 x1 + 3x2 22 x1 , x2 0 (c) min z ¼ 2x1 x2 x3 subject to : 3x1 + 5x2 + 4x3 120 x1 + 2x2 + 4x3 90 2x1 x2 + 2x3 60 x1 ,x2 ,x3 0 (d) min z ¼ x1 + 3x2 x3 subject to : 4x1 2x2 + 2x3 160 2x1 + 5 x2 + 10x3 200 x1 x2 + x3 50 x1 , x2 ,x3 0
Solution of Linear Programming Problems Chapter
Section 17.4.5.1 (ex.1) Solve the following LP maximization problem through the Simplex method: max z ¼ 4x1 x2 subject to : 3x1 3x2 175 8x1 2x2 460 60 x1 x1 , x2 0 (a) Demonstrate that we have a special case with multiple optimal solutions here. (b) Determine at least two of the alternative optimal solutions. (c) Solve the problem graphically and compare the results obtained. Section 17.4.5.1 (ex.2) Do the same for the LP minimization problem: min z ¼ 3x1 + 6x2 subject to : 2x1 + 4x2 620 7x1 + 3x2 630 x1 ,x2 0 Section 17.4.5.1 (ex.3) Determine all the optimal FBS of the following maximization problem: max z ¼ 4x1 + 4x2 subject to : x1 + x2 1 x1 ,x2 0 Section 17.4.5.2 (ex.1) Demonstrate that the LP maximization problem has an unlimited objective function z. max z ¼ 5x1 + 2x2 subject to : 2x1 3x2 66 9x1 3x2 99 x1 , x2 0 Section 17.4.5.2 (ex.2) Demonstrate that the LP maximization problem has an unlimited objective function z. min z ¼ 3x1 2x2 subject to : 2x1 + x2 12 3x1 2x2 24 x1 , x2 0 Determine a feasible basic solution with z ¼ 90. Section 17.4.5.3 (ex.1) Demonstrate that the LP maximization problem has an unfeasible solution. max z ¼ 18x1 + 12x2 subject to : 4x1 + 16x2 1850 8x1 5x2 4800 x1 ,x2 0 Section 17.4.5.3 (ex.2) Do the same for the following minimization problem: min z ¼ 7x1 + 5x2 subject to : 6x1 + 4x2 24 x1 + x2 3 x1 ,x2 0
17
827
828
PART
VII Optimization Models and Simulation
Section 17.4.5.4 (ex.1) Demonstrate that the LP maximization problem has a degenerate optimal solution. max z ¼ 2x1 + 3x2 subject to : x1 x2 10 2x1 + 3x2 90 24 x1 x1 , x2 0 Section 17.4.5.4 (ex.2) Do the same for the minimization problem. min z ¼ 6x1 + 8x2 subject to : 2x1 + 4x2 60 5x1 4x2 80 3x1 + 8x2 100 x1 , x2 0 Section 17.4.5 (ex.1) Through the Simplex method, identify in which of the special cases each LP problem finds itself: a) multiple optimal solutions; b) unlimited objective function z; c) there is no optimal solution; d) degenerate optimal solution. (a) max z ¼ x1 + 3x2 subject to : 2x1 + 6x2 48 3x1 + 5x2 60 x1 + 8x2 6 x1 , x2 0 (b) min z ¼ 2x1 6x2 subject to : 3x1 + 2x2 24 2x1 + 6x2 30 x1 , x2 0 (c) max z ¼ 2x1 + x2 subject to : 8x1 + 4x2 600 4x1 + 2x2 300 x1 , x2 0 Section 17.4.5 (ex.2) From each one of the tabular forms presented, identify if we have a special case of the Simplex method or not. If yes, determine the special case in which the problem analyzed finds itself: a) multiple optimal solutions; b) unlimited objective function z; c) there is no optimal solution; d) degenerate optimal solution. In each tabular form, we specify if the original problem is a maximization or a minimization. (a) Maximization problem Coefficients
Basic Variable
Equation
z
x1
x2
x3
x4
Constant
z
0
1
8
0
2
0
20
x2
1
0
2
1
1
0
10
x4
2
0
3
0
1
1
18
Solution of Linear Programming Problems Chapter
(b) Minimization problem Coefficients
Basic Variable
Equation
z
x1
x2
x3
x4
x5
Constant
z
0
1
0
10
1
0
2
60
x4
1
0
0
2
7/3
1
1/3
14
x1
2
0
1
0
2/3
0
1/3
10
(c) Minimization problem
Coefficients
Basic Variable
Equation
z
x1
x2
x3
x4
a1
a2
Constant
z
0
1
0
7/4
0
0
M+7/4
M+3/4
86
x1
1
0
1
1/4
0
0
1/4
1/4
2
x3
2
0
0
1/4
1
0
1/4
1/4
0
x4
3
0
0
1/5
0
1
1/5
3/4
5
(d) Maximization problem Coefficients Basic Variable
Equation
z
x1
x2
x3
x4
Constant
z
0
1
0
0
0
2
3,000
x3
1
0
0
8/3
1
2/3
1,120
x1
2
0
1
2/3
0
1/3
500
(e) Minimization problem Coefficients Basic Variable
Equation
z
x1
x2
x3
x4
Constant
z
0
1
2
5
0
0
0
x3
1
0
3
6
1
0
840
x4
2
0
1
5
0
1
500
17
829
830
PART
VII Optimization Models and Simulation
Section 17.5.2 (ex.1). Consider Exercise 1 proposed in Section 16.5.1 of the previous chapter, regarding the production mix problem of company KMX: (a) Represent the problem in an Excel spreadsheet. (b) Determine the optimal solution through the Solver in Excel. Section 17.5.2 (ex.2). Do the same for Exercise 2 proposed in Section 16.5.1 of the previous chapter, regarding the production mix problem of company Refresh. Section 17.5.2 (ex.3). Do the same for Exercise 3 of Section 16.5.1 of the previous chapter, regarding the company Golmobile. Section 17.5.2 (ex.4). Do the same for Exercise 1 of Section 16.5.2 of the previous chapter, regarding the petroleum mix problem. Section 17.5.2 (ex.5). Do the same for Exercise 1 of Section 16.5.4 of the previous chapter, regarding the capital budget problem of company GWX. Section 17.5.2 (ex.6). Do the same for Exercise 1 of Section 16.5.5 of the previous chapter, regarding the portfolio optimization problem. Section 17.5.2 (ex.7). Do the same for Exercise 3 of Section 16.5.5 of the previous chapter, regarding the portfolio optimization problem of CTA Investment Bank. Section 17.5.2 (ex.8). Do the same for Exercise 1 of Section 16.5.7 of the previous chapter, regarding the aggregate planning problem of company Pharmabelz. Section 17.6.1 (ex.1). Company Solutions manufactures two types of thermometers: digital and mercury ones. Each digital thermometer guarantees a net unit profit of $7.00, while a mercury thermometer generates a net unit profit of $5.00. Manufacturing these 2 types of thermometers requires 3 types of operations. To manufacture one digital thermometer, 4, 5, and 2 minutes in each one of the operations are necessary. Whereas a mercury thermometer requires 2, 3, and 3 minutes for each operation. The availability for each operation is 300, 360, and 180 minutes. (a) Determine the model’s graphical solution. (b) Determine the maximum permissible increase in the net unit profit of a digital thermometer that would maintain the original basic solution unaltered. Assume that the other model parameters remain constant. (c) Determine the maximum permissible decrease in the unit profit of a mercury thermometer that would maintain the original basic solution unaltered, assuming that the other parameters remain constant. (d) Assuming that there was a reduction in the unit profit of digital thermometers to $3.00, check and see if the original model’s optimal solution remains optimal. (e) Assuming that there was an increase in the unit profit of mercury thermometers to $10.00, verify if the original model’s optimal solution remains optimal. Section 17.6.1 (ex.2). Consider the following maximization problem: max z ¼ 8x1 + 6x2 subject to : 2x1 + 5x2 30 3x1 + 6x2 54 2x1 + 8x2 64 x1 , x2 0 (a) Determine the model’s graphical solution. (b) What is the value range in which c1 can vary that would maintain the original basic solution unaltered, assuming that c2 remains constant? (c) Determine the value range in which c2 can vary that would maintain the original basic solution unaltered, assuming that c1 remains constant. Section 17.6.1 (ex.3). Consider the following minimization problem: min z ¼ 8x1 + 6x2 subject to : 2x1 + 5x2 60 3x1 + 6x2 102 2x1 + 8x2 128 x1 , x2 0
Solution of Linear Programming Problems Chapter
17
831
(a) Determine the model’s graphical solution. (b) In the original model’s optimal solution, we verified that variable x2 is basic. Thus, if the non-negativity constraint is not established over the possible variations of c2, which problem will occur? (c) What is the value range in which c1 can vary that would maintain the original basic solution unaltered, assuming that c2 remains constant? (d) Determine the value range in which c2 can vary that would maintain the original basic solution unaltered, assuming that c1 remains constant. Section 17.6.1 (ex.4). Consider Venix Toys’s production mix problem (Example 16.3 of the previous chapter): (a) Determine the optimality condition (c1/c2) that would maintain the original model’s basic solution unaltered. (b) Let’s assume that there was a simultaneous reduction in the unit profits of toy cars and tricycles to $10.00 and $50.00, respectively, due to the competition in the market. Verify if the original model’s basic solution remains optimal and determine the new value of z. (c) What are the possible changes in the unit profit of toy cars that would maintain the original model’s basic solution unaltered? Assume that the other model parameters remain constant. (d) What is the value range in which the unit profit of tricycles can vary without impacting the original model’s basic solution? Assume that the other parameters remain constant. (e) If there is a reduction in the unit profit of toy cars to $9.00, will the original model’s basic solution remain optimal? In this case, what is the new value of the objective function? (f) If there is an increase in the unit profit of tricycles to $80.00, will the original model’s basic solution be affected (the other parameters remain constant)? What is the new value of the objective function after these changes? (g) Imagine that there has been a significant reduction in the production costs of tricycles, increasing their unit profit to $100.00. In order for the original model’s basic solution to remain unaltered, which interval must the unit profit of toy cars satisfy? Section 17.6.2 (ex.1). Once again, consider the production mix problem of Venix Toys: (a) Determine the shadow price for the machining, painting, and assembly departments. (b) Determine the value range in which each bi can vary that maintains the shadow price constant. (c) If the availability in the machining sector increases to 40 hours, what will be the increase in the value of objective function? (d) If the availability in the painting sector is reduced to 18 hours, what will be the decrease in the value of the objective function? Also determine the new values of x1 and x2. Section 17.6.2 (ex.2). Consider Exercise 1 proposed in Section 17.6.1 of company Solutions: (a) If the availability of each operation increases in 1 minute, which one of them must have priority? (b) Determine the maximum permissible increase and decrease (p and q minutes, respectively) in b1, b2, and b3 that maintains the shadow price constant. (c) What is the fair price to pay for using p minutes of b2? (d) What is the opportunity cost due to the loss of q minutes of b3? Section 17.6.3 (ex.1). Consider the following maximization problem: max z ¼ 3x1 + 2x2 subject to : x1 + x2 6 5x1 + 2x2 20 x1 , x2 0 Table 17.1 shows the initial tabular form of the model, Table 17.2 shows the tabular form in the first iteration, and Table 17.3 the optimal tabular form of the same problem. TABLE 17.1 Row 0 of the Initial Tabular Form Coefficients Equation
z
x1
x2
x3
x4
Constant
0
1
3
2
0
0
0
832
PART
VII Optimization Models and Simulation
We would like you to: (a) Interpret the reduced costs of Tables 17.2 and 17.3. (b) Determine the values of z11, z12, z∗1, and z∗2. Section 17.6.3 (ex.2). Consider the following minimization problem: min z ¼ 4x1 2x2 subject to : 2x1 + x2 10 x1 x2 8 x1 ,x2 0 Tables 17.4 and 17.5 show the initial tabular form and the optimal tabular form of the model, respectively. We would like you to: (a) Interpret the reduced costs of Table 17.5. (b) Determine the values of z∗1 and z∗2.
TABLE 17.2 Row 0 of the Tabular Form in the First Iteration Coefficients Equation
z
x1
x2
x3
x4
Constant
0
1
0
4/5
0
3/5
12
TABLE 17.3 Row 0 of the Optimal Tabular Form Coefficients Equation
z
x1
x2
x3
x4
Constant
0
1
0
0
4/3
1/3
44/3
TABLE 17.4 Row 0 of the Initial Tabular Form Coefficients Equation
z
x1
x2
x3
x4
Constant
0
1
4
2
0
0
0
TABLE 17.5 Row 0 of the Optimal Tabular Form Coefficients Equation
z
x1
x2
x3
x4
Constant
0
1
8
0
2
0
20
Solution of Linear Programming Problems Chapter
17
833
Section 17.6.4 (ex.1). Consider Exercise 1 of Section 17.6.1 of company Solutions: (a) Solve it through the Solver in Excel. (b) Through the Solver Sensitivity Report, determine the maximum permissible increase and decrease in c1 that maintains the original basic solution unaltered. (c) Through the Solver Sensitivity Report, determine the maximum permissible increase and decrease in c2 that maintains the original basic solution unaltered. (d) Through the Solver Sensitivity Report, determine the shadow price for each operation. (e) Through the Solver Sensitivity Report, determine the maximum permissible increase and decrease in b1, b2, and b3 that maintains the shadow price constant. Section 17.6.4 (ex.2). Do the same for the production mix problem of company Venix Toys. Section 17.6.4 (ex.3). Through the Solver Sensitivity Report, identify if the problems belong to the special case “multiple optimal solutions” or “degenerate optimal solution.” (a) max z ¼ 4x1 + 2x2 subject to : 6x1 + 2x2 240 2x1 + 3x2 200 3x1 + x2 120 x1 ,x2 0 (b) max z ¼ 3x1 + 8x2 subject to : 2x1 + 2x2 300 5x1 + 4x2 800 9x1 + 24x2 1, 080 x1 ,x2 0 (c) max z ¼ 2x1 + 6x2 subject to : 2x1 + 2x2 600 2x1 + 8x2 800 x1 8x2 0 x1 ,x2 0 (d) max z ¼ 4x1 + 2x2 subject to : 6x1 + 2x2 240 2x1 + 3x2 200 8x1 + 4x2 240 x1 , x2 0 (e) min z ¼ 6x1 + 3x2 subject to : 4x1 + 2x2 832 7x1 + 3x2 714 2x1 + 9x2 900 x1 ,x2 0 (f) min z ¼ 4x1 + 5x2 subject to : 2x1 + 3x2 675 2x1 + 5x2 1, 125 3x1 + 4x2 900 x1 ,x2 0 (g) min z ¼ 2x1 + x2 subject to : 4x1 + 8x2 1, 920 3x1 + 2x2 600 7x1 + 3x2 1, 050 x1 ,x2 0
Chapter 18
Network Programming I understand reason to be, not the ability to ratiocinate, which may be well or poorly employed, but the sequencing of truths that can only produce truths, and one truth cannot be contrary to the other. Gottfried Wilhelm von Leibniz
18.1 INTRODUCTION A network programming problem is modeled through a graph structure or network that consists of various nodes, in which each node must be connected to one or more arcs. Network models are increasingly utilized in various business areas, such as production, transportation, facility location, project management, finances, and others. Many of them may be formulated as linear programming problems (LP) and, therefore, may be solved by the Simplex method. Network modeling facilitates visualization and understanding of system characteristics. Thus, simplified versions of the Simplex method may be used for solving LP problems in networks. Additionally, other more efficient algorithms and software are being proposed and utilized for solving models in networks. Among the main problems in network programming are the classic transportation problem, the transshipment problem, the job assignment problem, the shortest path problem, and the maximum flow problem. Each one of the problems listed here will be studied in this chapter. We will initially present the mathematical modeling of each problem, as well as its solution using Excel Solver. In the case of the classic transportation problem, we will also describe how to solve it by using the transportation algorithm, which is a simplification of the Simplex method.
18.2 TERMINOLOGY OF GRAPHS AND NETWORKS A graph is defined as a set of nodes or vertices and a set of arcs or edges interconnecting these nodes. The nodes, drawn as circles or points, may represent facilities (such as factories, distribution centers, terminals, or seaports), or workstations, or intersections. The arcs, illustrated as line segments, make connections between pairs of nodes, and can represent paths, routes, wires, cables, channels, among others. The notation for a graph is G ¼ (N, A), in which N is a set of nodes and A is a set of arcs. Fig. 18.1 shows an example of a graph with five nodes and eight arcs. Many times, the arcs of a graph that make connections between nodes are associated to a numerical variable called a flow that represents a measurable characteristic of that connection as a distance between nodes, transportation cost, time expended, dimensions of the wire, number of parts transported, and other factors. Analogously, the nodes of a graph may be associated with a numerical variable called capacity, and may represent the loading and unloading capacity, supplies, demand, between other. A graph whose arcs and/or nodes are associated with the numerical flow variable and/or capacity is called a network. Fig. 18.2 shows an example of networks. The nodes represent the cities and the flows represent the distances (km) between them. For simplicity’s sake, we will henceforth no longer make a distinction between the terms “graphs” and “networks,” and will employ only the term “network.” The nodes of a network may be subdivided into three types: a) supply nodes or sources that represent entities that produce or distribute a given product; b) demand nodes that represent entities that consume the product; c) transshipment nodes that are the intermediate points between the supply nodes and demand and represent the waypoints for those products. Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00018-5 © 2019 Elsevier Inc. All rights reserved.
835
836
PART
VII Optimization Models and Simulation
FIG. 18.1 Example of a graph.
FIG. 18.2 Example of a network.
Manaus 5985
Natal
3490 2422
4563
Brasilia
2027
1446
1126
Salvador
3090
Porto Alegre
The arcs may have an arrow indicating the direction of the arc. When the flow between the respective nodes occurs in a single direction indicated by an arrow, we have a directed arc. When the flow occurs in both directions, it is called an undirected arc. In cases where there is a single connection between the nodes, but without an arrow indicating the direction of the arc, it is presumed that the arc is undirected. Each one of those cases may be visualized in Fig. 18.3. The arcs of Figs. 18.1 and 18.2 are also examples of undirected arcs. One may assume in these cases that the distances are symmetrical. When all of the arcs of the network are directed, we have a directed network. Analogously, when all of the arcs are undirected, we say that the network is undirected. Fig. 18.2 is an example of an undirected network. As for Fig. 18.4, it FIG. 18.3 Differences between directed and undirected arc.
Nondirected arc
Directed arc A
B
A
B
A
B
Network Programming Chapter
18
837
FIG. 18.4 Example of a directed network.
refers to a directed network, whose nodes represent a set of physical activities with the respective durations (minutes) and the directed arcs represent the relations of precedence between the activities. Other definitions of the graph theory, such as path, Hamiltonian path, cycle, tree, spanning tree and minimum spanning tree, will be presented. Hillier and Lieberman (2005) define a path between two nodes as the sequence of different arcs connecting these nodes. For example, the sequence of arcs AB BC CE (A ! B ! C ! E) of Fig. 18.1 is considered a path. In a directed network, one may have a directed or undirected path. A path that has a single direction is called a directed path. On the other hand, if at least one arc of the path has an opposite direction from the others, it is said that the path is undirected. For example, the path AC CD DE (A ! C ! D ! E) of Fig. 18.4 is considered a directed path, given that all of the arcs follow the same direction. On the other hand, the path AB BD DC (A ! B ! D ! C) of the same is considered an undirected path, given that the direction of arc DC is contrary to that of the other arcs. A Hamiltonian path is one that visits each node a single time. For example, the path AB BC CE of Fig. 18.1 is also considered a Hamiltonian path. On the other hand, the path AB BC CE ED DC (A ! B ! C ! E ! D ! C) of the same figure is not a Hamiltonian path. A path that begins and ends at the same node forms a cycle. The path AB BC CE EA (A ! B ! C ! E ! A) of Fig. 18.1 is an example of a cycle. In a directed network, one may have a directed or undirected cycle. When the path run in a cycle is directed we have a directed cycle. Analogously, an undirected path that begins and ends at the same node is called an undirected cycle. For example, the cycle AB BD DE EA (A ! B ! D ! E ! A) of Fig. 18.4 is directed, whereas the cycle AB BC CA (A ! B ! C ! A) of the same figure is an example of an undirected cycle. An undirected network G ¼ (N, A) is said to be related when there is a path between any pair of nodes. A network G has a tree structure if it is related and acyclic (without cycles). In addition, as part of the tree concept, it is affirmed that: – A tree with n nodes contains n 1 arcs. – If an arc is added to a tree, a cycle is formed. – If an arc is eliminated from the tree, the network ceases to be related (instead of a single related network there will be two related networks). The network presented in Fig. 18.5 is an example of a tree based on the network presented in Fig. 18.1. Before we define the concept of spanning tree, we will define the concept of subgraph. G’ ¼ (N’, A’) is a subgraph of G ¼ (N, A) if the set of nodes of G’ is a subset of the set of nodes of G (N’ N), if the set of arcs of G’ is a subset of the set of arcs of G (A’ A) and if G’ is a graph. Using the network G ¼ (N, A), a spanning tree, also called a generating tree, is a subgraph of G that has a tree structure and contains all of the nodes of G. Fig. 18.6 is an example of a spanning tree using the network drawn in Fig. 18.1. A minimum spanning tree of G is a spanning tree with a lower cost. FIG. 18.5 Example of a tree.
838
PART
VII Optimization Models and Simulation
FIG. 18.6 Example of a spanning tree.
18.3
CLASSIC TRANSPORTATION PROBLEM
The classic transportation problem has the objective of determining the quantities of products to be transported going from a set of suppliers to a set of consumers, so that the total transportation cost is minimized. Each supplier manufactures a fixed number of products, and each consumer has a known demand that will be met. The problem is modeled using two links in the supply chain; in other words, not considering intermediate facilities (distribution centers, terminal, seaport, or factory). The mathematical notation and the network representation of the classic transportation problem are presented as follows. Consider a set of m suppliers that provide goods to a set of n consumers. The maximum amount to be transported from a given supplier i (i ¼ 1, …, m) corresponds to its capacity of Csi units. On the other hand, the demand of each consumer j (j ¼ 1, …, n) must be met and is represented by dj. The unit cost for transportation for the supplier i to the consumer j is represented by cij. The objective is to determine the quantities to be transported from the supplier i to the consumer j (xij) in order to minimize the total transportation cost (z). Fig. 18.7 presents the network representation of the classic transportation problem.
18.3.1
Mathematical Formulation of the Classic Transportation Problem
The model parameters, the decision variables, and the general mathematical formulation of the classic transportation problem are specified as follows. Parameters of the model: cij ¼ unit cost for transportation from the supplier i (i ¼ 1, …, m) to the consumer j (j ¼ 1, …, n) Csi ¼ supply capacity of the supplier i (i ¼ 1, …, m) dj ¼ demand by the consumer j (j ¼ 1, …, n) Decision variables: xij ¼ quantities transported from the supplier i (i ¼ 1, …, m) to the consumer j (j ¼ 1, …, n) FIG. 18.7 Network representation of the classic transportation problem.
Supplier
Consumer c11; x11
Cs1
S1
C1
d1
Cs2
S2
C2
d2 Demand
Supply capacity
Csm
Cn
Sm cmn; xmn
dn
Network Programming Chapter
18
839
General formulation: min z ¼
m X n X cij xij i¼1 j¼1
s:t: n X
xij Csi ,
i ¼ 1,2, …,m
xij dj ,
j ¼ 1, 2,…, n
xij 0,
i ¼ 1, 2, …,m, j ¼ 1, 2,…, n
(18.1)
j¼1 m X i¼1
that correspond to a linear programming problem. Thus, the problem could be solved by the Simplex method. However, the special structure of the network problem makes it possible to obtain more efficient solution algorithms, such as the transportation algorithm that will be described in Section 18.3.3.1. In order for problem represented by Expression (18.1) to have Pn solution, the total supply capacity should Pthe basic doable Cs be greater than or equal to the demand of consumers, so that, m i i¼1 j¼1 dj . Pn P If total supply total capacity is exactly equal to the total demand consumed, meaning, m i¼1 Csi ¼ j¼1 dj (balancing equation), the problem is known as the balanced transportation problem, and may be rewritten as: min z ¼
m X n X cij xij i¼1 j¼1
s:t:
n X
xij ¼ Csi ,
i ¼ 1, 2,…, m
xij ¼ dj ,
j ¼ 1, 2,…, n
xij 0,
i ¼ 1, 2, …,m, j ¼ 1, 2,…, n
(18.2)
j¼1 m X i¼1
P Pn m We can have a third case in which the total supply capacity is less than the total demand consumed Cs < d i i¼1 j¼1 j , so that the total demand of some consumers will not be met. On the other hand, the suppliers will utilize its maximum capacity. That case may be mathematically formulated as: min z ¼
m X n X
cij xij
i¼1 j¼1
s:t:
n X xij ¼ Csi ,
i ¼ 1,2, …, m
(18.3)
j¼1 m X xij dj ,
j ¼ 1, 2, …,n
i¼1
xij 0,
i ¼ 1,2, …, m, j ¼ 1, 2, …,n
Example 18.1 Karpet Ltd. is an automotive parts manufacturer, whose units are located in the Brazilian cities of Osasco, Sorocaba, and Sao Sebastiao. Its clients are found in Sao Paulo, Rio de Janeiro, and Curitiba, as presented in Fig. 18.8. The unit transportation costs from each origin to each destination, as well as the capacity of each supplier and the demand of each consumer, are found in Table 18.E.1. The objective is to meet the demand of each final consumer, respecting the supply capacities, so as to minimize the total transportation cost. Model the transportation problem.
840
PART
VII Optimization Models and Simulation
FIG. 18.8 Pool of Karpet Ltd. suppliers and consumers.
TABLE 18.E.1 Transportation Data of the Karpet Ltd. Company Transportation Unit Cost Consumer
Supplier
Demand
Sao Paulo
Rio de Janeiro
Curitiba
Capacity
Osasco
12
22
30
100
Sorocaba
18
24
32
140
Sao Sebastiao
22
15
34
160
120
130
150
Solution Since the total supply capacity is exactly equal to the total demand consumed, we have a balanced transportation problem. First, the decision variables of the model are defined: xij ¼ number of parts transported from the supplier i to the consumer j, i ¼ 1, 2, 3; j ¼ 1, 2, 3. Thus, we have: x11 ¼ parts transported from the supplier in Osasco to the consumer in Sao Paulo. x12 ¼ parts transported from the supplier in Osasco to the consumer in Rio de Janeiro. x13 ¼ parts transported from the supplier in Osasco to the consumer in Curitiba. ⋮ x31 ¼ parts transported from the supplier in Sao Sebastiao to the consumer in Sao Paulo. x32 ¼ parts transported from the supplier in Sao Sebastiao to the consumer in Rio de Janeiro. x33 ¼ parts transported from the supplier in Sao Sebastiao to the consumer in Curitiba. The objective function seeks to minimize the total transportation cost: min z ¼ 12x11 + 22x12 + 30x13 + 18x21 + 24x22 + 32x23 + 22x31 + 15x32 + 34x33 The constraints of the model are specified as follows: 1. The capacity of each supplier will be utilized to meet consumer demand: x11 + x12 + x13 ¼ 100 x21 + x22 + x23 ¼ 140 x31 + x32 + x33 ¼ 160
Network Programming Chapter
18
841
2. The demand of each consumer should be met: x11 + x21 + x31 ¼ 120 x12 + x22 + x32 ¼ 130 x13 + x23 + x33 ¼ 150 3. The decision variables of the model are non-negative: xij 0,
i ¼ 1, 2,3, j ¼ 1, 2,3
The optimal solution, obtained by the transportation algorithm (see Section 18.3.3.1) or using Excel Solver (Section 18.3.3.2), is x11 ¼ 100, x12 ¼ 0, x13 ¼ 0, x21 ¼ 20, x22 ¼ 0, x23 ¼ 120, x31 ¼ 0, x32 ¼ 130, x33 ¼ 30 with z ¼ 8,370.
18.3.2 Balancing the Transportation Problem When the Total Supply Capacity Is Not Equal to the Total Demand Consumed In the next section, we will present several methods for solving the classic transportation problem. Most require that the transportation problem be balanced, so that a ghost supplier or customer (dummy) should be added when the total supply is not equal to the total demand. We will see each one of the cases.
Case 1: Total Supply Is Greater than Total Demand Consider an unbalanced transportation problem whose total supply capacity is greater than the total demand consumed. To restore the balance, one must create a ghost customer (dummy) that will absorb the excess supplied. Thus, the demand of this new destination will correspond to the difference between the total supply and the total demand consumed, indicating the nonutilized supply capacity. The transportation unit cost for any supplier to the ghost customer created will be null, since the same is not real. We thus guarantee a feasible basic solution using the solution procedures presented in Sections 18.3.3.1 and 18.3.3.2, given that the total supply capacity has become exactly equal to the total demand. Example 18.2 The Caramel Candy & Confetti company has been involved in the candy sector since 1990 and owns three stores located in the Greater Sao Paulo area. Its main clients are located in the Sao Paulo Capital, Baixada Santista, and Vale do Paraiba, as shown in Fig. 18.9. The production capacity of the stores, the consumer demand, and the costs per unit distributed by each store for each
Store 2
Store 3
Store 1
Consumer Vale do Paraiba
Consumer Baixada Santista
FIG. 18.9 Stores and clients of the Caramel Candy & Confetti company.
Consumer Sao Paulo
842
PART
VII Optimization Models and Simulation
TABLE 18.E.2 Transportation Data of the Caramel Candy & Confetti Company Transportation Unit Cost Consumer Sao Paulo
Supplier
Demand
Baixada Santista
Vale do Paraiba
Capacity
Store 1
8
12
10
50
Store 2
4
10
6
100
Store 3
6
15
12
40
60
70
30
consumer are illustrated in Table 18.E.2. In order to minimize the total transportation cost, the company wants to determine how much to distribute from each store to the respective consumers, respecting the production capacity and ensuring that the demands will be met. Formulate the Caramel Candy & Confetti company transportation problem. Solution We can verify that the Caramel Candy & Confetti company transportation problem is unbalanced, given that the total supply capacity (190) is greater than the total demand consumed (160). Solution (a)
One way to represent the mathematical model of the Caramel Candy & Confetti company is by an Expression (18.1) in which the constraints are written in an unequal form. In that model, one has the following decision variables: xij ¼ number of candies transported from store i to the consumer j, i ¼ 1, 2, 3; j ¼ 1, 2, 3. Thus, we have: x11 ¼ candies transported from store 1 to the consumer in Sao Paulo (SP). x12 ¼ candies transported from store 1 to the consumer in Baixada Santista (BS). x13 ¼ candies transported from store 1 to the consumer in Vale do Paraiba (VP). ⋮ x31 ¼ candies transported from store 3 to the consumer in Sao Paulo (SP). x32 ¼ candies transported from store 3 to the consumer in Baixada Santista (BS). x33 ¼ candies transported from store 3 to the consumer in Vale do Paraiba (VP). The objective function seeks to minimize the total transportation cost: min z ¼ 8x11 + 12x12 + 10x13 + 4x21 + 10x22 + 6x23 + 6x31 + 15x32 + 12x33 The constraints of the model are specified as follows: 1. The production capacity of each store should be respected: x11 + x12 + x13 50 x21 + x22 + x23 100 x31 + x32 + x33 40 2. The demand of each consumer should be met: x11 + x21 + x31 60 x12 + x22 + x32 70 x13 + x23 + x33 30 3. The decision variables of the model are non-negative: xij 0,
i ¼ 1,2, 3, j ¼ 1,2,3
The optimal solution of this model, obtained using Excel Solver (see Section 18.3.3.2), is x11 ¼ 0, x12 ¼ 50, x13 ¼ 0, x21 ¼ 50, x22 ¼ 20, x23 ¼ 30, x31 ¼ 10, x32 ¼ 0, x33 ¼ 0 with z ¼ 1,240. One may see, from that result, that store 3 did not use its maximum capacity of 40 units, but only 10 units.
Network Programming Chapter
Store 50
S1
100
S2
843
FIG. 18.10 Network modeling of the balanced problem for the Caramel Candy & Confetti company.
Consumer c11; x11
18
SP
60
BS
70 Demand
Production capacity 40
S3
c34; x34
VP
30
C4
30
Solution (b)
In order for the transportation algorithm presented in Section 18.3.3.1 to be applied, we must have a balanced transportation problem, so that the total supply capacity is equal to the total demand. To restore the balance for the Caramel Candy & Confetti company problem, one must create a ghost customer (dummy) that will absorb the excess supply of 30 units. Network modeling of the balanced problem is illustrated in Fig. 18.10. The mathematical formulation of the balanced Caramel Candy & Confetti company problem is described as follows. Since the new consumer was added, xij may be rewritten as: xij ¼ number of candies transported from store i to the consumer j, i ¼ 1, 2, 3; j ¼ 1, 2, 3, 4. The new decision variables are: x14 ¼ candies transported from store 1 to the new ghost customer (dummy). x24 ¼ candies transported from store 2 to the new ghost customer (dummy). x34 ¼ candies transported from store 3 to the new ghost customer (dummy). Since the transportation unit cost for any supplier to the new consumer is null, the objective function is not changed: min z ¼ 8x11 + 12x12 + 10x13 + 4x21 + 10x22 + 6x23 + 6x31 + 15x32 + 12x33 The constraints on supply capacity and on demand consumed are changed: 1. Supply constraints by the stores: x11 + x12 + x13 + x14 ¼ 50 x21 + x22 + x23 + x24 ¼ 100 x31 + x32 + x33 + x34 ¼ 40 2. Demand constraints: x11 + x21 + x31 ¼ 60 x12 + x22 + x32 ¼ 70 x13 + x23 + x33 ¼ 30 x14 + x24 + x34 ¼ 30 3. Nonnegativity constraints: xij 0, i ¼ 1, 2,3, j ¼ 1, 2,3, 4 From solution (a), we already know that the nonutilized capacity of 30 units comes only from store 3. Since the new ghost consumer was created to absorb that surplus supply, we can affirm that x34 ¼ 30. Therefore, the optimal solution of the balanced model is x11 ¼ 0, x12 ¼ 50, x13 ¼ 0, x14 ¼ 0, x21 ¼ 50, x22 ¼ 20, x23 ¼ 30, x24 ¼ 0, x31 ¼ 10, x32 ¼ 0, x33 ¼ 0, and x34 ¼ 30 with z ¼ 1,240.
844
PART
VII Optimization Models and Simulation
Case 2: Total Supply Capacity Is Lower than Total Demand Consumed Consider an unbalanced transportation problem whose total supply capacity is less than the total demand consumed. To restore the balance, one must create a ghost supplier (dummy) that will meet the remaining demand. Thus, the amount offered by this new supplier will correspond to the difference between the total demand consumed and the total supply capacity, indicating the unmet demand. The transportation unit cost for the ghost supplier created for any consumer will be null, since the supplier is not real. Analogous to Case 1, the balancing equation between supply and demand guarantees that a feasible basic solution will be found.
Example 18.3 Consider Example 18.2 of the Caramel Candy & Confetti company, with, however, distinct production capacities of the stores and consumer demand, as shown in Table 18.E.3. Formulate the new Caramel Candy & Confetti company transportation problem.
TABLE 18.E.3 New Transportation Data of the Caramel Candy & Confetti Company Transportation Unit Cost Consumer Sao Paulo
Supplier
Vale do Paraiba
Capacity
Store 1
8
12
10
60
Store 2
4
10
6
40
6
15
12
50
50
120
80
Store 3 Demand
Baixada Santista
Solution Once again, we are faced with an unbalanced transportation problem; however, this time, the total supply capacity (150) is less than the total demand consumed (250). Solution (a)
One way to represent that model is through Expression (18.3), in which the suppliers utilize its maximum capacity; however, the total demand of some consumers is not met. The decision variables are not changed in relation to the unbalanced model presented in Example 18.2: xij ¼ number of candies transported from store i to the consumer j, i ¼ 1, 2, 3; j ¼ 1, 2, 3. The same occurs in relation to the objective function: min z ¼ 8x11 + 12x12 + 10x13 + 4x21 + 10x22 + 6x23 + 6x31 + 15x32 + 12x33 The constraints on Example 18.3 are specified as follows. 1. The suppliers will utilize its maximum capacity: x11 + x12 + x13 ¼ 60 x21 + x22 + x23 ¼ 40 x31 + x32 + x33 ¼ 50 2. The total demand of each consumer may not be met: x11 + x21 + x31 50 x12 + x22 + x32 120 x13 + x23 + x33 80 3. The decision variables of the model are nonnegative: xij 0,
i ¼ 1,2, 3, j ¼ 1,2,3
The optimal solution of this model, obtained using Excel Solver (see Section 18.3.3.2), is x11 ¼ 0, x12 ¼ 20, x13 ¼ 40, x21 ¼ 0, x22 ¼ 0, x23 ¼ 40, x31 ¼ 50, x32 ¼ 0, x33 ¼ 0 with z ¼ 1,180.
Network Programming Chapter
Store 60
S1
40
S2
845
FIG. 18.11 Network modeling of Example 18.3 balanced.
Consumer c11; x11
18
SP
50
BS
120 Demand
Production capacity 50
S3
100
S4
VP
80
c43; x43
One may see, from that result, that the total demand of 120 units from the consumer in Baixada Santista was not met, only part of it (20 units). Solution (b)
Analogous to Example 18.2, in order for the transportation algorithm presented in Section 18.3.3.1 to be applied, we must have a balanced transportation problem before us. To restore the balance, one must create a ghost supplier (dummy) that will meet the unmet demand for 100 units. Network modeling of the new balanced problem is illustrated in Fig. 18.11. The mathematical formulation of Example 18.3 balanced is described as follows. Since the new supplier has been added, xij may be rewritten as: xij ¼ number of candies transported from store i to the consumer j, i ¼ 1, 2, 3, 4; j ¼ 1, 2, 3. The new decision variables are: x41 ¼ candies transported from the new ghost store (dummy) to consumer 1. x42 ¼ candies transported from the new ghost store (dummy) to consumer 2. x43 ¼ candies transported from the new ghost store (dummy) to consumer 3. Since the transportation unit cost to the new supplier for any consumer is null, the objective function is not changed: min z ¼ 8x11 + 12x12 + 10x13 + 4x21 + 10x22 + 6x23 + 6x31 + 15x32 + 12x33 The balanced model of Example 18.3 presents the following constraints: 1. Supply constraints by the stores: x11 + x12 + x13 ¼ 60 x21 + x22 + x23 ¼ 40 x31 + x32 + x33 ¼ 50 x41 + x42 + x43 ¼ 100 2. Demand constraints: x11 + x21 + x31 + x41 ¼ 50 x12 + x22 + x32 + x42 ¼ 120 x13 + x23 + x33 + x43 ¼ 80 3. Nonnegativity constraints: xij 0,
i ¼ 1,2,3, j ¼ 1, 2,3, 4
From solution (a) of that same example, we already know that the unmet demand for 100 units comes from the consumer in Baixada Santista. Because the new ghost supplier was created to meet that remaining demand, we can affirm that x42 ¼ 100. Therefore, the optimal solution of the model is x11 ¼ 0, x12 ¼ 20, x13 ¼ 40, x21 ¼ 0, x22 ¼ 0, x23 ¼ 40, x31 ¼ 50, x32 ¼ 0, x33 ¼ 0, x41 ¼ 0, x42 ¼ 100, x43 ¼ 0 with z ¼ 1,180.
846
PART
18.3.3
VII Optimization Models and Simulation
Solution of the Classic Transportation Problem
The classic transportation problem will be solved in two ways. First, we will use the transportation algorithm that is a simplification of the Simplex method presented in Chapter 17. And in Section 18.3.3.2, we will present a solution of the problem using the Excel Solver.
18.3.3.1 The Transportation Algorithm In order to facilitate resolution of the classic transportation problem using the methods presented in this section, the problem should be represented in table form. Box 18.1 presents the tabular form of the general balanced transportation model expressed in Expression (18.2). The transportation algorithm follows the same logic of the Simplex method presented in Chapter 17, with some simplifications in light of the peculiarities of the transportation problem. Fig. 18.12 presents each one of the steps of the transportation algorithm. BOX 18.1 General Tabular Form of the Balanced Transportation Problem
FIG. 18.12 Transportation algorithm.
Initially: The problem must be balanced (total inflow equal to total outflow) and represented in table form, as specified in Box 18.1. Step 1. Find the initial feasible basic solution (FBS). To do that, we will present 3 methods: northwest corner method, minimum cost method and Vogel approximation method. Step 2. Optimality test. To verify if the solution found is optimal, we employ the multiplier method that is based on the duality theory. We apply the optimality condition of the Simplex method to the transportation problem. If the condition is satisfied, the algorithm finalizes here. If not, we determine a better adjacent FBS. Iteration: Determine a better adjacent FBS. To find the new feasible basic solution, three steps should be taken: 1. Determine the non-basic variable that will go into the base, using the multiplier method. 2. Choose the basic variable that will go into the set of non-basic variables, using the feasibility condition of the Simplex method. 3. Recalculate the new basic solution.
Network Programming Chapter
18
847
The elementary operations utilized in the Simplex method to recalculate the values of the new adjacent basic solution are not necessary, given that the new solution may be easily obtained using the table form of the transportation problem. Each one of the steps of the transportation algorithm presented in Fig. 18.12 will be detailed and later applied to solve the transportation problem for the Karpet Ltd. company (Example 18.1). Example 18.4 Represent the Karpet Ltd. company transportation problem (Example 18.1) in table form expressed in Box 18.1. Solution Using data from the balanced Karpet Ltd. company transportation problem presented in Table 18.E.1, one may easily obtain its tabular form, as shown in Table 18.E.4.
TABLE 18.E.4 Representation of the Balanced Karpet Ltd. Company Transportation Problem in Table Form Consumer 1
2 12
3 22
Capacity 30
1
100 18
Supplier
24
32
2
140 22
15
34
3
Demand
160
120
130
150
Step 1. Determining the Initial Feasible Basic Solution The classic transportation problem considers a set of m suppliers and n consumers. Using Expression (18.2) it was found that the balanced transportation problem contains m + n equality constraints. Because in the balanced transportation problem the total inflow is equal to the total outflow, we can affirm that one of those constraints is redundant, so that the model contains m + n 1 independent equations and, consequently, m + n 1 basic variables. For the Karpet Ltd. company transportation problem in which m ¼ 3 and n ¼ 3, we have, therefore, 5 basic variables. We first see how to find an initial FBS using the northwest corner method, followed by the minimum cost method, as well as the Vogel approximation method. Northwest Corner Method The northwest corner method follows the following steps: Initially: Represent the transportation problem in initial tabular form (see Box 18.1). In that method, the transportation costs need not be specified, given that they are not utilized in the algorithm. Step 1: Select the cell located in the upper left corner (northwest), among the cells not yet allocated in Step 2 and the cells not yet blocked in Step 3. Therefore, x11 will always be the first variable selected. Step 2: Allocate the largest possible amount of product to that cell, so that the sum of the corresponding cells in the same row and in the same column does not exceed the total supply and total demand capacities, respectively. Step 3: Using the cell selected in the previous step, block (marking with an x) the cells corresponding to the same row or column that reached the maximum limit of supply or demand, respectively, given that no other value different from zero may be attributed to those cells. If the researcher uses the maximum limit on both the row and the column, only one of them must be blocked. That condition guarantees that there will be basic variables with null values. The algorithm finalizes when all of the cells have been allocated or blocked. Otherwise, return to Step 1.
848
PART
VII Optimization Models and Simulation
Example 18.5 Apply the northwest corner method to the Karpet Ltd. company problem to obtain an initial FBS. Solution The initial tabular form of the Karpet Ltd. company problem, for applying the northwest corner method, is represented in Table 18. E.5 and is similar to the Table 18.E.4, but without the unit transportation costs.
TABLE 18.E.5 Initial Tabular Form of the Karpet Ltd. Company Problem for the Northwest Corner Method Consumer 1
Supplier
2
3
Capacity
1
100
2
140
3
160
Demand
120
130
150
The three steps of the first round are described and represented in Table 18.E.6. Step 1: Select the cell x11, located upper left corner (northwest). Step 2: One may see, from Table 18.E.5, that the total capacity from supplier 1 (Osasco) is 100. For its part, the demand from consumer 1 (Sao Paulo) is 120, so that the maximum value to be allocated in that cell is the minimum between those two values. Step 3: The maximum capacity limit from supplier 1 has been reached, so that the cells corresponding to the same row (x12 and x13) should be blocked.
TABLE 18.E.6 The Three Steps of the First Round Consumer
1 Supplier
1
2
3
Capacity
100
x
x
100
2
140
3
160
Demand
120
130
150
The same logic is applied to the second round. Step 1: Among the remaining cells, select the one located in the northwest corner (x21). Step 2: In setting column 1 relating to the Sao Paulo consumer, the maximum amount that may be allocated in the cell x21 is 20, since the sum of the quantities allocated to the cells of that column may not exceed the demand for 120 units of that consumer. In setting row 2 one may allocate up to 140 units in that same cell. Therefore, x21 ¼ min {20, 140} ¼ 20. Step 3: the cell x31 should be blocked, since the maximum demand limit of column 1 has been reached. For the third round, one has the following steps: Step 1: Among the remaining cells, select the one located in the northwest corner (x22). Step 2: In setting row 2 referring to the Sorocaba supplier, the maximum amount that may be allocated in x22 is 120, given that the sum of the quantities allocated to all of the cells of that same row may not exceed the 140-unit capacity of that supplier. In setting column 2, one may allocate up to 130 units in that same cell. Therefore, x22 ¼ min {120, 130} ¼ 120. Step 3: the cell x23 should be blocked, since the maximum capacity limit of row 2 has been reached.
Network Programming Chapter
18
849
TABLE 18.E.7 Result of the Second Round 1
Supplier
Consumer 2
Capacity
x
100
1
100
2
20
140
3
x
160
Demand
x
3
120
130
150
TABLE 18.E.8 Result of the Third Round Consumer
Supplier
1
2
3
Capacity
1
100
x
x
100
2
20
120
x
140
3
x
Demand
160
120
130
150
In the last two rounds, Step 3 will not be applied, given that the other cells belonging to the row or column that reached the maximum capacity or demand, respectively, have already been blocked in previous rounds. In the case of the next-to-last round, one selects cell x32 and allocates to it the maximum amount possible of 10 units. Finally, in the last round, one allocates the remaining amount of 150 units to cell x33. The initial FBS of the northwest corner method is listed in Table 18.E.9.
TABLE 18.E.9 Final Result of the Northwest Corner Method Consumer
Supplier
1
2
3
Capacity
1
100
0
0
100
2
20
120
0
140
3
0
10
150
160
120
130
150
Demand
The basic solution is, therefore: x11 ¼ 100, x21 ¼ 20, x22 ¼ 120, x32 ¼ 10 and x33 ¼ 150 with z ¼ 9,690. Nonbasic variables: x12 ¼ 0, x13 ¼ 0, x23 ¼ 0 and x31 ¼ 0. Minimum Cost Method The minimum cost method is an adaptation of the northwest corner method, in which, instead of selecting the cell closest to the northwest corner, one selects the one with the lowest cost. The complete algorithm of the lowest cost is detailed as follows. Initially: Represent the transportation problem in initial tabular form in Box 18.1.
850
PART
VII Optimization Models and Simulation
Step 1: Select the cell with lowest possible cost, among the cells not yet allocated in Step 2 and the cells not yet blocked in Step 3. Step 2: Allocate the largest possible amount of product to that cell, so that the sum of the corresponding cells in the same row and in the same column does not exceed the total supply and total demand capacities, respectively. Step 3: Starting with the cell selected in the previous step, block (marking with an x) the cells corresponding to the same row or column that reached the maximum limit of supply or demand, respectively. Analogous to the northwest corner method, if the researcher uses the maximum limit on both the row and the column, only one of them must be blocked. The algorithm finalizes when all of the cells have been allocated or blocked. Otherwise, return to Step 1.
Example 18.6 Apply the minimum cost method to the Karpet Ltd. company problem to obtain an initial FBS. Solution Consider the balanced Karpet Ltd. company transportation problem in initial tabular form (Table 18.E.4). The three steps of the first round are described now and represented in Table 18.E.10. Step 1: Select the cell x11 that is the one with the lowest cost. Step 2: Analogous to the northwest corner method, the largest possible amount to be allocated in that cell is 100 ¼ min {100, 120}. Step 3: The maximum capacity limit from supplier 1 has already been reached, so that the cells corresponding to the same row (x12 and x13) should be blocked.
TABLE 18.E.10 The Three Steps of the First Round Consumer 1
2 12
1
100
22 x
18 Supplier
3
Capacity 30 100
x 24
32
2
140 22
15
34
3 Demand
160 120
130
150
The same logic is applied to the second round and the result is presented in Table 18.E.11. Step 1: Among the remaining cells, select the one with the lowest unit cost (x32). Step 2: the maximum amount that may be allocated in that cell is 130 ¼ min {130, 160}. Step 3: the cell x22 should be blocked since the maximum demand limit of column 2 has been reached.
TABLE 18.E.11 Result of the Second Round Consumer 1
2 12
1
x
100 18
Supplier
2
Capacity 30 100
x 24
32 140
x 22
3 Demand
3 22
15
34 160
130 120
130
150
Network Programming Chapter
18
851
For the third round, one has the following steps: Step 1: Among the remaining cells, select the one with the lowest cost (x21). Step 2: In setting row 2 referring to the Sorocaba supplier, one could allocate up to 140 units in x21. However, setting column 1 relating to the Sao Paulo consumer, the maximum limit is 20 units, given that the sum of the quantities allocated to all of the cells of that column may not exceed the demand for 120 units of that consumer. Therefore, x21 ¼ min {20, 140} ¼ 20. Step 3: the cell x31 should be blocked since the maximum limit of column 1 has been reached.
TABLE 18.E.12 Result of the Third Round Consumer 1
2 12
1
x
100 18
Supplier
2
20
Demand
Capacity 30 100
x 24
32 140
x 22
3
3 22
15
x
130
120
130
34 160 150
Analogous to the northwest corner method, in the last two rounds, Step 3 is not applied, given that the cells belonging to the row or column that reached the maximum capacity or demand, respectively, have already been blocked. In the next-to-last round, one selects cell x23 and allocates to it a number of 120 units. Finally, in the last round, one allocates the remaining 30 units to cell x33. The initial FBS of the minimum cost method is represented in Table 18.E.13.
TABLE 18.E.13 Result of the Minimum Cost Method Consumer 1
2 12
1
100
0 18
Supplier
2
20
Demand
Capacity 30 100
0 24
0 22
3
3 22
32 140
120 15
34
0
130
30
120
130
150
160
The basic solution is, therefore: x11 ¼ 100, x21 ¼ 20, x23 ¼ 120, x32 ¼ 130, and x33 ¼ 30 with z ¼ 8,370. Nonbasic variables: x12 ¼ 0, x13 ¼ 0, x22 ¼ 0, and x31 ¼ 0. Vogel Approximation Method According to Taha (2016), the Vogel approximation method is an improved version of the minimum cost method that generally leads to better initial solutions. The detailed steps of the algorithm are: Initially: Represent the transportation problem in initial tabular form in Box 18.1. Step 1: For each row (and column), calculate the penalty that corresponds to the difference between the two smallest unit transportation costs in the respective row (and column). The penalty for one row (column) is calculated as long as there are at least two cells not yet allocated and not blocked in the same row (column). Step 2: Choose the row or column with the highest penalty. In case of a tie, randomly choose any one of them. In the row or column selected, choose the cell with lowest cost. Step 3: Thus, as with the northwest corner and lowest cost methods, allocate the largest possible amount of product to this cell, so that the sum of the corresponding cells in the same row and in the same column does not exceed the total supply and total demand capacities, respectively.
852
PART
VII Optimization Models and Simulation
Step 4: Analogous to the northwest corner method and that of the lowest cost, using the cell selected in the previous step, block (marking with an x) the cells corresponding to the same row or column that reached the maximum limit of supply or demand, respectively. If the researcher uses the maximum limit on both the row and the column, only one of them must be blocked. As long as there is more than one cell that is not allocated and not blocked, return to Step 1. Otherwise, go to Step 5. Step 5: Allocate the capacity or remaining demand to this last cell.
Example 18.7 Apply the Vogel approximation method to the Karpet Ltd. company problem to obtain an initial FBS. Solution All of the steps of the first round are represented in Table 18.E.14. First, the penalties for each row and column were calculated. One may see that the largest penalty occurred in row 1. One selects cell x11 that is the one with the lowest cost in row 1. The next step consists of allocating the largest possible amount of product to this cell, which is 100 ¼ min {100, 120}. The other cells of row 1 are blocked, since the capacity limit of that supplier has been reached. The result of the first round is highlighted in gray.
TABLE 18.E.14 First Round of the Vogel Approximation Method Consumer 1
2 12
1
100
x 18
Supplier
Capacity
3 22 x 24
100
22 - 12 = 10
140
24 - 18 = 6
160
22 - 15 = 7
32
2 22
Penalty row
30
15
34
3 Demand
120
130
150
Penalty column
18 - 12 =6
22 - 15 =7
32 - 30 =2
The same process is repeated for the second round (see Table 18.E.15). First, the new penalties for each column and for the lines 2 and 3 are calculated. The largest penalty this time is in column 2. Cell x32 is selected as the one with the lowest cost in column 2 and allocated the largest possible amount of product, which is 130 ¼ min {130, 160}. Cell x22 is also blocked, given that the total demand from consumer 2 was met. The new cells allocated and blocked in the second round are highlighted in gray.
TABLE 18.15 Second Round of the Vogel Approximation Method Consumer 1
2 12
1
100
x 18
Supplier
2
Capacity
3 22
100
x 24
32
x 22
3
Penalty row
30
15
140
24 - 18 = 6
160
22 - 15 = 7
34
130
Demand
120
130
150
Penalty column
22 - 18 =4
24 - 15 =9
34 - 32 =2
Network Programming Chapter
18
853
In the third round (see table 18.16), one first calculates the new penalties to lines 2 and 3 and to the columns 1 and 3. One may see that the largest penalty is in row 2. Among the remaining cells, the cell with lowest cost in row 2 is x21. In setting column 1, the maximum amount that may be allocated in x21 is 20, given that the sum of the quantities allocated to all of the cells of that same column may not exceed the demand for 120 units by that consumer. In setting row 2, one may allocate up to 140 units in that same cell. Therefore, x21 ¼ min {20, 140} ¼ 20. Cell x31 is blocked, given that the total demand from consumer 1 was met. The result of the third round is highlighted in gray.
TABLE 18.E.16 Third Round of the Vogel Approximation Method
There now only remains calculation of the penalty for column 3. One chooses the cell with lowest cost in that column; x23 is chosen and allocated 120 units. Finally, one allocates 30 units to the last cell, x33. The initial FBS of the Vogel approximation method is illustrated in Table 18.E.17.
TABLE 18.E.17 Initial Feasible Basic Solution Obtained by the Vogel Approximation Method Consumer 1
2 12
1
100
0 18
Supplier
2
20
Demand
Capacity 30 100
0 24
0 22
3
3 22
32 140
120 15
34
0
130
30
120
130
150
160
The basic solution is, therefore: x11 ¼ 100, x21 ¼ 20, x23 ¼ 120, x32 ¼ 130, and x33 ¼ 30 with z ¼ 8,370. Note that this solution is the same as the one obtained by the minimum cost method. Step 2. Optimality Test To verify if the solution found is optimal, we employ the method of multipliers that is based on the duality theory. Thus, one associates to each row i and to each column j the multipliers ui and vj, respectively. The coefficients of the objective function (reduced costs) of variable xij (c ij ) are given by the following equation: c ij ¼ ui + vj cij
(18.4)
Since the reduced costs of the basic variables are null, Expression (18.4) states that: ui + vj ¼ cij , for each basic variable xij
(18.5)
Since the model contains m + n 1 independent equations and, consequently, m + n 1 basic variables for solving the system of equations represented by Expression (18.5) with m + n unknown, one must arbitrarily attribute a value of zero to one of the multipliers; for example, u1 ¼ 0.
854
PART
VII Optimization Models and Simulation
After calculating the multipliers, one may determine the reduced costs of the nonbasic variables from Expression (18.4). For the transportation problem (minimization problem), the current solution is optimal if, and only if, the reduced costs of all the nonbasic variables are nonpositive: ui + vj cij 0, for each nonbasic variable xij
(18.6)
As long as there is at least one nonbasic variable with reduced positive cost, there is a better adjacent feasible basic solution (FBS). Iteration. Determine a better adjacent FBS To find the new feasible basic solution, three steps should be taken: 1. Determine the nonbasic variable that will go into the base, using the method of multipliers. The nonbasic variable xij selected is the one with greatest reduced cost (greatest value of ui + vj cij). 2. Choose the basic variable that will come from the base (see explanation later). 3. Recalculate the new basic solution. Unlike the Simplex method, this calculation may be done directly using the table form of the transportation problem. The choice of variable that comes from the base and calculation of the new basic solution may be obtained by constructing a closed cycle that begins and ends in the nonbasic variable chosen to enter the base (Step 1). The cycle consists of a sequence of horizontal and vertical sequences connected to each other (diagonal movements are not permitted), in which each corner is associated with the basic variable, with the exception of the nonbasic variable selected. There is only one closed cycle that may be constructed under those conditions. With the closed cycle constructed, the next step consists of determining the variable that will come from the base. Thus, among the corners next to the nonbasic variable xij (horizontally or vertically), one chooses the basic variable with lowest value, given that the capacity constraints from the supplier i and on demand from consumer j should be respected. In case of a tie, one chooses one of them, randomly. Finally, one recalculates the new basic solution. First, the value corresponding to the basic variable chosen to leave the base for new basic variable xij is attributed. The variable that comes from the base thus assumes the value of zero. The new values of the basic variables of the closed cycle should also be recalculated, so that the required supply capacities and demands continue to be satisfied.
Example 18.8 Starting from the basic initial solution obtained by the northwest corner method for the Karpet Ltd. company problem (Example 18.5), determine the optimal solution using the transportation algorithm. Solution Each one of the steps of the transportation algorithm will be applied to determine the optimal solution of the problem studied. As the initial FBS, we will use the one obtained by the northwest corner method. Step 1. Initial FBS Obtained by the Northwest Corner Method The initial solution of the northwest corner method obtained in Example 18.5, including the unit transportation costs of each cell, is represented in Table 18.E.18.
TABLE 18.E.18 Initial FBS of the Northwest Corner Method, Including the Unit Transportation Costs Consumer 1
2 12
1
100
0 18
Supplier
2
20
Demand
Capacity 30 100
0 24
120 22
3
3 22
32 140
0 15
34
0
10
150
120
130
150
Step 2. Optimality Test For each basic variable xij, describe the equation ui + vj ¼ cij (Expression (18.5)): For x11: u1 + v1 ¼ 12 For x21: u2 + v1 ¼ 18
160
Network Programming Chapter
18
855
For x22: u2 + v2 ¼ 24 For x32: u3 + v2 ¼ 15 For x33: u3 + v3 ¼ 34 Doing u1 ¼ 0, one obtains the following results: v1 ¼ 12, u2 ¼ 6, v2 ¼ 18, u3 ¼ 3 and v3 ¼ 37 Using those multipliers, one determines the reduced costs of the nonbasic variables from Expression (18.4): c 12 ¼ u1 + v2 c12 ¼ 0 + 18 22 ¼ 4 c 13 ¼ u1 + v3 c13 ¼ 0 + 37 30 ¼ 7 c 23 ¼ u2 + v3 c23 ¼ 6 + 37 32 ¼ 11 c 31 ¼ u3 + v1 c31 ¼ 3 + 12 22 ¼ 13 Since the reduced costs of the nonbasic variables x13 and x23 are positive, there is a better adjacent feasible basic solution (FBS). The nonbasic variable that will enter the base is x23, because it has the greatest reduced cost. Iteration. Determine a Better Adjacent FBS The closed cycle should be constructed to determine the variable that will come from the base and to calculate the new basic solution. That closed cycle must satisfy the following conditions: (a) begin and end in x23; (b) be formed by a sequence of horizontal and vertical segments connected to each other; and (c) each corner must be associated with the basic variable, with the exception of
TABLE 18.E.19 Construction of the Closed Cycle in the First Iteration Consumer 1
2 12
1
100
2
20
24
Demand
100 32 140
0
120 22
3
Capacity 30
0
0 18
Supplier
3 22
15
34
0
10
150
120
130
150
160
variable x23. Table 18.E.19 presents the closed cycle that satisfies those conditions. With the closed cycle constructed, the next step consists of determining the variable that will come from the base. Thus, among the corners next to the nonbasic variable x23 (horizontally or vertically), one chooses the basic variable x22 that has the lowest value (120 < 150), given that the capacity constraint from the supplier 2 must be respected. Finally, one recalculates the new basic solution. First, one attributes the value of 20 from the basic variable output x22 to the new basic variable x23. The variable x22 that comes from the base assumes, therefore, the value of zero. To restore the balance of the closed cycle, one calculates the new values of the basic variables x32 and x33 (130 and 30, respectively). Table 18.E.20 illustrates the new Adjacent FBS.
TABLE 18.E.20 Adjacent Basic Solution Obtained in the First Iteration Consumer 1
2 12
1
100
0 18
Supplier
2
20
Demand
Capacity 30 100
0 24
0 22
3
3 22
32 140
120 15
34
0
130
30
120
130
150
160
856
PART
VII Optimization Models and Simulation
Step 2. Optimality Test For each basic variable xij, describe the equation ui + vj ¼ cij (Expression (18.5): For x11: u1 + v1 ¼ 12 For x21: u2 + v1 ¼ 18 For x23: u2 + v3 ¼ 32 For x32: u3 + v2 ¼ 15 For x33: u3 + v3 ¼ 34 Doing u1 ¼ 0, one obtains the following results: v1 ¼ 12, u2 ¼ 6, v3 ¼ 26, u3 ¼ 8, and v2 ¼ 7 Using those multipliers, one determines the reduced costs of the nonbasic variables through Expression (18.4): c 12 ¼ u1 + v2 c12 ¼ 0 + 7 22 ¼ 15 c 13 ¼ u1 + v3 c13 ¼ 0 + 26 30 ¼ 4 c 22 ¼ u2 + v2 c22 ¼ 6 + 7 24 ¼ 11 c 31 ¼ u3 + v1 c31 ¼ 8 + 12 22 ¼ 2 Since the reduced costs of all the nonbasic variables are nonpositive, the current solution is optimal. The optimal solution is, therefore: Basic solution: x11 ¼ 100, x21 ¼ 20, x23 ¼ 120, x32 ¼ 130, and x33 ¼ 30 with z ¼ 8,370. Nonbasic variables: x12 ¼ 0, x13 ¼ 0, x22 ¼ 0, and x31 ¼ 0. Note that this solution is similar to the initial solution obtained by the lowest cost and Vogel approximation methods.
18.3.3.2 Solution of the Transportation Problem Using Excel Solver Examples 18.1, 18.2, and 18.3 referring to the classic transportation problem will be solved in this section using Excel Solver. Solution of the Karpet Ltd. company problem (Example 18.1): Fig. 18.13 illustrates the representation of the Karpet Ltd. company transportation problem in an Excel spreadsheet (see file Example18.1_Karpet.xls). The equations utilized in Fig. 18.13 are specified in Box 18.2. Analogous to the examples from Chapter 17, names were attributed to the cells and intervals of cells of Fig. 18.13 that will be referred to in the Solver, in order to facilitate understanding of the model. Box 18.3 presents the names attributed to the respective cells.
Karpet Ltd. Transport unit cost
Supplier Osasco Sorocaba Sao Sebastiao
Sao Paulo 12 18 22
Consumer Rio de Janeiro 22 24 15
Curitiba 30 32 34
Quantitites_transported
Supplier Osasco Sorocaba Sao Sebastiao
Sao Paulo 0 0 0
Consumer Rio de Janeiro 0 0 0
Curitiba 0 0 0
Quantities_supplied 0 0 0
0 0 0 = = = Demand 120 130 150 FIG. 18.13 Representation in Excel of the Karpet Ltd. company transportation problem.
= = =
Capacity 100 140 160
z
Total_cost $0.00
Quantities_delivered
Network Programming Chapter
18
857
BOX 18.2 Equations of Fig. 18.13 Cell
Equation
E16
¼SUM(B16:D16)
E17
¼SUM(B17:D17)
E18
¼SUM(B18:D18)
B20
¼SUM(B16:B18)
C20
¼SUM(C16:C18)
D20
¼SUM(D16:D18)
G22
¼SUMPRODUCT(B7:D9,B16:D18)
BOX 18.3 Names Attributed to the Cells of Fig. 18.13 Name
Cells
Quantities_transported
B16:D18
Quantities_supplied
E16:E18
Capacity
G16:G18
Quantities_delivered
B20:D20
Demand
B22:D22
Total_cost
G22
The representation of the Karpet Ltd. company problem in the Solver Parameters dialog box is illustrated in Fig. 18.14. Since names were attributed to the cells of the model, Fig. 18.14 will now be referred to by their respective names. Note that the non-negativity constraints were activated by selecting the Make Unconstrained Variables Non-Negative check box, and the Simplex LP engine was selected in the Solving Method box. The Options command remained unaltered. Finally, click on Solve and select the option Keep Solver Solution in the Solver Results dialog box. Fig. 18.15 presents the optimal solution of the Karpet Ltd. company transportation problem. Solution of the Caramel Candy & Confetti company problem (Example 18.2): The representation of the Caramel Candy & Confetti company transportation problem in an Excel spreadsheet is in Fig. 18.16 (see file Example18.2_Confetti.xls). Analogous to the Karpet Ltd. company problem (Example 18.1), that problem also considers three suppliers and three consumers. Note that the transport unit cost, the quantities transported, supplied, and delivered, besides the capacity, demand, and total cost, are represented in the same cells of Fig. 18.13. Thus, the equations and the names attributed to the cells of Fig. 18.16 are similar to those of Fig. 18.13 from the previous example. Because we are not faced with a balanced problem (where the total supply capacity is greater than the total demand), the constraints may not be written in the equality form. The new constraints may be visualized in Fig. 18.16 and in the Solver Parameters dialog box, as shown in Fig. 18.17. Analogous to previous models, one assumes that the variables are non-negative and that the model is linear. The optimal solution of the Caramel Candy & Confetti company transportation problem (Example 18.2) is illustrated in Fig. 18.18. Solution of the Modified Caramel Candy & Confetti company problem (Example 18.3): The Example 18.3 is an adaptation from the previous example of the Caramel Candy & Confetti company, in which the production capacities of the stores and client demand are changed, focusing on Case 2 in which the total supply capacity is less than the total demand consumed. Thus, the supply constraints are represented in the equality form, given that the entire
858
PART
VII Optimization Models and Simulation
FIG. 18.14 Solver Parameters regarding the Karpet Ltd. company problem.
FIG. 18.15 Solution of the Karpet Ltd. company transportation problem (Example 18.1) using Excel Solver.
Network Programming Chapter
18
Caramel Candy & Confetti Transport unit cost
Supplier Store 1 Store 2 Store 3
Sao Paulo 8 4 6
Consumer Baix. Santista 12 10 15
Vale Paraiba 10 6 12
Quantities_transported
Supplier Store 1 Store 2 Store 3
Sao Paulo 0 0 0
Consumer Baix. Santista 0 0 0
Vale Paraiba 0 0 0
Quantities_delivered
0 >= 60
0 >= 70
0 >= 30
Demand
Quantities_supplied 0 0 0
> ð4, 1Þ, ð4, 2Þ, ð4, 3Þ, ð4, 4Þ, ð4, 5Þ, ð4, 6Þ > > > > > > > > > > ð 5, 1 Þ, ð 5, 2 Þ, ð 5, 3 Þ, ð 5, 4 Þ, ð 5, 5 Þ, ð 5, 6 Þ > > > > ; : ð6, 1Þ, ð6, 2Þ, ð6, 3Þ, ð6, 4Þ, ð6, 5Þ, ð6, 6Þ 1/4 1/12 1/9 7/36 2/3 1/12
ANSWER KEYS: EXERCISES: CHAPTER 6 1)
150 150 150 0 150 1 149 2 148 Pð X 2Þ ¼ 0:02 0:98 + 0:02 0:98 + ¼ 0:42 0:02 0:98 0 1 2 E(X) ¼ 150 0.02 ¼ 3 Var(X) ¼ 150 0.02 0.98 ¼ 2.94 2) 10 Pð X ¼ 1Þ ¼ 0:12 0:889 ¼ 0:38 1 3) P(X ¼ 5) ¼ 0.125 0.8754 ¼ 0.073 E(X) ¼ 8 Var(X) ¼ 56 4) 32 PðX ¼ 33Þ ¼ 0:9530 0:053 ¼ 1:33% 29
1076
Answers
E(X) ¼ 31.6 ffi 32 5) P(X ¼ 4) ¼ 16.8% 6) a) P(X 12) ¼ P(Z 0.67) ¼ 1 P(Z > 0.67) ¼ 0.75 b) P(X < 5) ¼ P(Z < 0.5) ¼ P(Z > 0.5) ¼ 0.3085 c) P(X > 2) ¼ P(Z > 1) ¼ P(Z < 1) ¼ 1 P(Z > 1) ¼ 0.8413 d) P(6 < X 11) ¼ P(0.33 < Z 0.5) ¼ [1 P(Z > 0.5)] P(Z > 0.33) ¼ 0.3208 7) zc ¼ 0.84 8) a) m ¼ np ¼ 40 0.5 ¼ 20 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi s ¼ npð1 pÞ ¼ 40 0:5 0:5 ¼ 3:16 P(X ¼ 22) ffi P(21.5 < X < 22.5) ¼ P(0.474 < Z < 0.791) ¼ 0.103 b) P(X > 25.5) ¼ P(Z > 1.74) ¼ 4.09% 9) a) P(X > 120) ¼ e0.028120 ¼ 0.0347 b) P(X > 60) ¼ e0.02860 ¼ 0.1864 10) 220
a) PðX > 220Þ ¼ e 180 ¼ 0:2946 150
b) PðX 150Þ ¼ 1 e 180 ¼ 0:5654 11)
a) P(X > 0.5) ¼ e1.80.5 ¼ 0.4066 b) P(X 1.5) ¼ 1 e1.81.5 ¼ 0.9328
12)
a) P(X > 2) ¼ e0.332 ¼ 0.5134 b) P(X 2.5) ¼ 1 e0.332.5 ¼ 0.5654
13) 6.304 14) a) P(X > 25) ¼ 0.07 b) P(X 32) ¼ 0.99 c) P(25 < X 32) ¼ P(X > 25) P(X > 32) ¼ 0.06 d) 28.845 e) 6.908 15) a) 2.086 b) E(T) ¼ 0 c) Var(T) ¼ 1.111 16) a) b) c) d) e)
P(T > 3) ¼ 0.0048 P(T 2) ¼ 1 P(T > 2) ¼ 1 0.0344 ¼ 0.9656 P(1.5 < T 2) ¼ P(T > 1.5) P(T > 2) ¼ 0.0814 0.3444 ¼ 0.0469 1.345 2.145
17) a) P(X > 3) ¼ 0.05 b) 3.73
Answers
1077
c) 4.77 d) E(X) ¼ 1.14 e) Var(X) ¼ 0.98
ANSWER KEYS: EXERCISES: CHAPTER 7 5) Simple random sampling without replacements. 6) Systematic sampling. 7) Stratified sampling. 8) Stratified sampling. 9) Two-stage cluster sampling. 10) By using Expression (7.8) (SRS to estimate the proportion of a finite population), we have n ¼ 262. 11) By using Expression (7.9) (stratified sampling to estimate the mean of an infinite population), we have n ¼ 1,255. 12) By using Expression (7.20) (one-stage cluster sampling to estimate the proportion of an infinite population), we have m ¼ 35.
ANSWER KEYS: EXERCISES: CHAPTER 8
ffiffiffiffiffiffi < m < 51 + 1:645 p18 ffiffiffiffiffiffi ¼ 90% 1) P 51 1:645 p18 120 120 ffiffiffiffi < m < 5, 400 + 2:030 p200 ffiffiffiffi ¼ 95% 2) P 5, 400 2:030 p200 36 36 qffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffi 0:76 0:76 3) P 0:24 1:96 0:24500 < p < 0:24 + 1:96 0:24500 ¼ 95% 60 8
60 8 4) P 83:298 < s2 < 40:482 ¼ 95%
ANSWER KEYS: EXERCISES: CHAPTER 9 7) For the K-S and S-W tests, we have p ¼ 0.200 and 0.151, respectively. Therefore, since P > 0.05, the distribution of data is normal. 8) The data follow a normal distribution (P ¼ 0.200 > 0.05). 9) The variances are homogeneous (P ¼ 0.876 > 0.05—Levene’s test). 10) Since s is unknown, the most suitable test is Student’s t: 6560 pffiffiffiffi ¼ 8:571; tc ¼ 2.030; since Tcal > tc ! we reject H0 (m 6¼ 60). Tcal ¼ 3:5= 36 11) Tcal ¼ 6.921 and P-value ¼ 0.000 < 0.005 ! we reject H0 (m1 6¼ m2). 12) Tcal ¼ 11.953 and P-value ¼ 0.000 < 0.025 ! we reject H0 (mbefore 6¼ mafter), i.e., there was improvement after the treatment. 13) Fcal ¼ 2.476 and P-value ¼ 0.1 > 0.05 ! we do not reject H0 (there is no difference between the population means).
ANSWER KEYS: EXERCISES: CHAPTER 10 4) Sign test. 5) By applying the binomial test for small samples, since P ¼ 0.503 > 0.05, we do not reject H0, concluding that there is no difference in consumers’ preferences. 6) By applying the chi-square test, since w2cal > w2c (6.100 > 5.991) or P < a (0.047 < 0.05), we reject H0, concluding that there are differences in readers’ preferences. 7) By applying the Wilcoxon test, since zcal < zc (3.135 < 1.645) or P < a (0.0085 < 0.05), we reject H0, concluding that the diet resulted in weight loss. 8) By applying the Mann-Whitney U test (the data do not follow a normal distribution), since zcal > zc (0.129 > 1.96) or P > a (0.897 > 0.05), we do not reject H0, concluding that the samples come from populations with equal medians. 9) By applying Cochran’s Q test, since Qcal > Qc (8.727 > 7.378) or P < a (0.013 < 0.025), we reject H0, concluding that the proportion of students with high learning levels is not the same in each subject. 10) By applying the Friedman test, since F0 cal > Fc (9.190 > 5.991) or P < a (0.010 < 0.05), we reject H0, concluding that there are differences between the three services.
1078
Answers
ANSWER KEYS: EXERCISES: CHAPTER 11 1) a) Agglomeration Schedule Cluster Combined
Stage Cluster First Appears
Stage
Cluster 1
Cluster 2
Coefficients
Cluster 1
Cluster 2
Next Stage
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
5 40 25 30 38 1 2 6 4 30 5 29 31 2 1 5 2 1 5 2 1 1 1
13 56 58 55 48 15 14 83 7 42 39 40 38 3 30 25 31 4 6 29 5 2 9
.006 .014 .014 .014 .014 .024 .024 .024 .024 .038 .038 .055 .075 .075 .153 .209 .246 .246 .723 .760 2.764 8.466 173.124
39 56 0 62 75 71 72 74 76 80 77 65 69 83 82 87 90 91 92 93 94 97 98
64 53 26 61 36 55 58 0 68 0 70 78 81 73 86 79 89 85 84 88 95 96 0
87 88 92 86 89 91 90 95 94 91 92 96 93 93 94 95 96 97 97 98 98 99 0
From the agglomeration schedule, it is possible to verify that a big Euclidian distance leap occurs from the 98th stage (when only two clusters remain) to the 99th stage. Analyzing the dendrogram also helps in this interpretation.
b)
In fact, the solution with two clusters is highly advisable at this moment.
1080
Answers
c) Yes. From the agglomeration schedule, it is possible to verify that observation 9 (Antonio) had not clustered in until the moment exactly before the last stage. From the dendrogram, it is also possible to verify that this student differs from the others considerably, which, in this case, results in the generation of only two clusters. d) Agglomeration Schedule Cluster Combined
Stage Cluster First Appears
Stage
Cluster 1
Cluster 2
Coefficients
Cluster 1
Cluster 2
Next Stage
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98
13 27 1 41 6 30 5 16 1 13 2 14 2 13 1 5 9 1 1 5 1 1
34 29 4 46 82 55 74 57 38 39 15 16 28 30 27 6 13 41 2 14 5 9
.537 .537 .537 .754 1.103 1.103 1.584 1.584 1.584 1.584 2.045 2.149 2.149 3.091 3.091 4.411 4.835 7.134 10.292 12.374 18.848 26.325
67 62 63 0 72 58 68 55 79 77 74 61 87 86 85 83 75 91 94 92 95 97
0 60 69 0 0 53 0 73 66 64 76 84 71 82 78 81 90 80 89 88 96 93
86 91 85 94 92 90 92 88 91 90 89 96 95 93 94 96 98 95 97 97 98 0
Answers
1081
Yes, the new results show that there is one cluster rearrangement in the absence of observation Antonio. e) The existence of an outlier may cause other observations, not so similar to one another, to be allocated in the same cluster because they are extremely different from the first one. Therefore, reapplying the technique, with the exclusion or maintenance of outliers, makes the new clusters better structured, and makes them be generated with higher internal homogeneity.
1082
2) a)
Answers
Answers
1083
b) Agglomeration Schedule Cluster Combined
Stage Cluster First Appears
Stage
Cluster 1
Cluster 2
Coefficients
Cluster 1
Cluster 2
Next Stage
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
8 4 1 4 4 1 1 1 1 11 6 15 7 7 6 6 1
18 16 10 9 8 5 4 3 2 12 11 17 15 14 13 7 6
2.000 2.000 2.000 2.000 2.000 2.000 2.000 2.828 6.633 12.329 14.697 23.409 24.495 32.802 35.665 40.497 78.256
0 0 0 2 4 3 6 7 8 0 0 0 0 13 11 15 9
0 0 0 0 1 0 5 0 0 0 10 0 12 0 0 14 16
5 4 6 5 7 7 8 9 17 11 15 13 14 16 16 17 0
From the agglomeration schedule, it is possible to verify that a big Euclidian distance leap occurs from the 16th stage (when only two clusters remain) to the 17th stage. Analyzing the dendrogram also helps in this interpretation. c) Dendrogram using single linkage
Y
0 Regional 3
8
Regional 3
18
Regional 3
4
Regional 3
16
Regional 3
9
Regional 3
1
Regional 3
10
Regional 3
5
Regional 3
3
Regional 3
2
Regional 1
15
Regional 1
17
Regional 1
7
Regional 1
14
Regional 2
11
Regional 2
12
Regional 2
6
Regional 2
13
5
In fact, there are indications of two clusters of stores.
Rescaled distance cluster combine 10 15
20
25
1084
Answers
d)
Derived stimulus configuration Euclidean distance model 1.5
Store06 Store12
1.0
Store11
Dimension 2
0.5 Store13 Store02 Store03 Store04 Store05 Store01
0.0 Store17 Store15
–0.5
Store07 –1.0 Store14
–1.5 –2
–1
0
1
Dimension 1
The two-dimensional chart generated through the multidimensional scaling allows us to see these two clusters and that one is more homogeneous than the other. e) ANOVA Cluster
Customers’ average evaluation of services rendered (0 to 100) Customers’ average evaluation of the variety of goods (0 to 100) Customers’ average evaluation of the organization (0 to 100)
Error
Mean Square
df
Mean Square
df
F
Sig.
10802.178
1
99.600
16
108.456
.000
12626.178
1
199.100
16
63.416
.000
18547.378
1
314.900
16
58.899
.000
The F tests should be used only for descriptive purposes because the clusters have been chosen to maximize the differences among cases in different clusters. The observed significance levels are not corrected for this and thus cannot be interpreted as tests of the hypothesis that the cluster means are equal.
It is possible to state that both clusters formed present statistically different means for the three variables considered in the study, at a significance level of 0.05 (Prob. F < 0.05). Among the groups, the variable considered most discriminating is the one with the highest F statistic, that is, the variable services rendered (F ¼ 108.456).
1085
Answers
f) Single Linkage* Cluster Number of Case Crosstabulation Count Cluster Number of Case
Single Linkage
1 2
Total
1
2
Total
10 0 10
0 8 8
10 8 18
Yes, there is correspondence between the allocations of the observations in the groups obtained through the hierarchical and k-means methods. g) Yes, based on the dendrogram generated, it is possible to verify that all the stores that belong to regional center 3 form cluster 1, which has the lowest means for all the variables. This fact may determine some specific management action at these stores. After preparing a new cluster analysis, without the stores from cluster 1 (regional center 3), the new agglomeration schedule and its corresponding dendrogram are obtained. From it, we can see the differences between the stores from regional centers 1 and 2 more clearly. Agglomeration Schedule Cluster Combined
Stage Cluster First Appears
Stage
Cluster 1
Cluster 2
Coefficients
Cluster 1
Cluster 2
Next Stage
1 2 3 4 5 6 7
11 6 15 7 7 6 6
12 11 17 15 14 13 7
12.329 14.697 23.409 24.495 32.802 35.665 40.497
0 0 0 0 4 2 6
0 1 0 3 0 0 5
2 6 4 5 7 7 0
1086
Answers
Dendrogram using single linkage 0 11
Regional 2
12
Regional 2
6
Regional 2
13
Regional 1
15
Regional 1
17
Regional 1
7
Regional 1
14
Rescaled distance cluster combine 10 15
20
25
Y
Regional 2
5
3) a) Agglomeration Schedule Cluster Combined Stage
Cluster 1
Cluster 2
1 2 3 4 5 6 7 8 9
18 19 17 16 20 23 17 18 21
33 34 32 31 35 27 19 26 23
Stage Cluster First Appears Coefficients 1.000 .980 .980 .980 .960 .880 .880 .860 .860
Cluster 1
Cluster 2
Next Stage
0 0 0 0 0 0 3 1 0
0 0 0 0 0 0 2 0 6
8 7 7 21 17 9 20 11 18
Answers
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
11 15 13 22 2 4 6 12 11 2 17 3 1 2 9 17 4 9 2 7 1 7 4 2 1
14 18 30 29 13 5 24 20 21 15 25 16 10 3 11 22 8 12 6 28 9 17 7 4 2
.860 .853 .840 .840 .820 .820 .800 .800 .797 .793 .790 .790 .780 .770 .768 .764 .750 .749 .742 .740 .728 .727 .703 .513 .484
0 0 0 0 0 0 0 0 10 14 7 0 0 19 0 20 15 24 23 0 22 29 26 28 30
0 8 0 0 12 0 0 5 9 11 0 4 0 21 18 13 0 17 16 0 27 25 31 32 33
1087
18 19 14 25 19 26 28 27 24 23 25 23 30 28 27 31 32 30 33 31 34 32 33 34 0
Since it is a similarity measure, the values of the coefficients are in descending order in the agglomeration schedule. From this table, it is possible to verify that a considerable leap, in relation to the others, occurs from the 32nd stage (when three clusters are formed) to the 33rd clustering stage. Analyzing the dendrogram also helps in this interpretation.
1088
Answers
b) Dendrogram using average linkage (Between groups) 0 35
18
34
33
33
26
32
15
31
13
30
30
29
2
28
16
27
31
26
3
25
6
24
24
23
4
22
5
21
8
20
22
19
29
18
19
17
34
16
17
15
32
14
25
13
7
12
28
11
1
10
10
9
20
8
35
7
12
6
23
5
27
4
21
3
11
2
14
1
9
5
Rescaled distance cluster combine 10 15
20
25
Answers
1089
In fact, the solution with three clusters is highly advisable. c) Average Linkage (Between Groups)
sector
Health Education Transport
1
2
3
Count
Count
Count
11 0 0
0 12 0
0 0 12
Yes, there is correspondence between the industries and the allocations of companies in the clusters. That is, for the sample under analysis, we can state that companies from the same industry have similarities in relation to how their operations and decision-making processes are carried out. At least as regards the managers’ perception.
1090 Answers
4) a) Proximity Matrix Correlation between Vectors of Values Case
1:1
1:1
1.000
2:2 .866
-1.000
3:3
4:4 .000
5:5 .998
6:6 .945
-.996
7:7
8:8 .000
1.000
9:9
10:10 .971
-1.000
11:11
12:12 -.500
13:13 .999
14:14 .997
-1.000
15:15
16:16 .327
2:2
.866
1.000
-.866
-.500
.896
.655
-.908
-.500
.866
.721
-.856
-.866
.891
.822
-.881
-.189
3:3
-1.000
-.866
1.000
.000
-.998
-.945
.996
.000
-1.000
-.971
1.000
.500
-.999
-.997
1.000
-.327
4:4
.000
-.500
.000
1.000
-.064
.327
.091
1.000
.000
.240
-.020
.866
-.052
.082
.030
.945
5:5
.998
.896
-.998
-.064
1.000
.922
-1.000
-.064
.998
.953
-.996
-.554
1.000
.989
-.999
.266
6:6
.945
.655
-.945
.327
.922
1.000
-.911
.327
.945
.996
-.951
-.189
.926
.969
-.935
.619
7:7
-.996
-.908
.996
.091
-1.000
-.911
1.000
.091
-.996
-.945
.994
.577
-.999
-.985
.998
-.240
8:8
.000
-.500
.000
1.000
-.064
.327
.091
1.000
.000
.240
-.020
.866
-.052
.082
.030
.945
9:9
1.000
.866
-1.000
.000
.998
.945
-.996
.000
1.000
.971
-1.000
-.500
.999
.997
-1.000
.327
10:10
.971
.721
-.971
.240
.953
.996
-.945
.240
.971
1.000
-.975
-.277
.957
.987
-.963
.545
11:11
-1.000
-.856
1.000
-.020
-.996
-.951
.994
-.020
-1.000
-.975
1.000
.483
-.997
-.998
.999
-.346
12:12
-.500
-.866
.500
.866
-.554
-.189
.577
.866
-.500
-.277
.483
1.000
-.545
-.427
.526
.655
13:13
.999
.891
-.999
-.052
1.000
.926
-.999
-.052
.999
.957
-.997
-.545
1.000
.991
-1.000
.277
14:14
.997
.822
-.997
.082
.989
.969
-.985
.082
.997
.987
-.998
-.427
.991
1.000
-.994
.404
15:15
-1.000
-.881
1.000
.030
-.999
-.935
.998
.030
-1.000
-.963
.999
.526
-1.000
-.994
1.000
-.298
16:16
.327
-.189
-.327
.945
.266
.619
-.240
.945
.327
.545
-.346
.655
.277
.404
-.298
1.000
This is a similarity matrix
Answers
1091
b) Agglomeration Schedule Cluster Combined
Stage Cluster First Appears
Stage
Cluster 1
Cluster 2
Coefficients
Cluster 1
Cluster 2
Next Stage
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 4 5 3 3 1 3 1 6 1 4 1 4 1 1
9 8 13 11 15 5 7 14 10 6 16 2 12 4 3
1.000 1.000 1.000 1.000 1.000 .999 .998 .997 .996 .987 .945 .896 .866 .619 .577
0 0 0 0 4 1 5 6 0 8 2 10 11 12 14
0 0 0 0 0 3 0 0 0 9 0 0 0 13 7
6 11 6 5 7 8 15 10 10 12 13 14 14 15 0
Since Pearson’s correlation is being used as a similarity measure between observations, the values of the coefficients are in descending order in the agglomeration schedule. From this Table, it is possible to verify that a relevant leap, in relation to the others, occurs from the 13th stage (when three clusters with weekly periods are formed) to the 14th clustering stage. Analyzing the dendrogram also helps in this interpretation.
1092
Answers
c) Dendrogram using single linkage 0 1
1
1
9
1
5
1
13
2
14
2
6
2
10
2
2
4
4
4
8
4
16
4
12
3
3
3
11
3
15
3
7
Rescaled distance cluster combine 10 15
5
20
25
In fact, the solution with three-week clusters is highly advisable at this moment. Moreover, it is possible to verify that the second and third clusters are formed exclusively by the periods related to the third and fourth weeks of each month, respectively. This may offer subsidies to prove that there is recurrence of the joint behavior of banana, orange, and apple sales in these periods, for the data in this example. The following table shows the association between the variable week_month and the allocation of each observation in a certain cluster. Single Linkage
week_month
1 2 3 4
1
2
3
Count
Count
Count
4 4 0 0
0 0 4 0
0 0 0 4
Answers
1093
ANSWER KEYS: EXERCISES: CHAPTER 12 1) a) For each factor, we have the following eigenvalues: Factor 1: (0.917)2 + (0.874)2 + (0.844)2 + (0.031)2 ¼ 2.318 Factor 2: (0.047)2 + (0.077)2 + (0.197)2 + (0.979)2 ¼ 1.005 b) The proportions of variance shared by all the variables to form each factor are: Factor 1: 2:318 4 ¼ 0:580 ð58:00%Þ 1:005 Factor 2: 4 ¼ 0:251 ð25:10%Þ The total proportion of variance lost by the four variables to extract these two factors is: 1 0:580 0:251 ¼ 0:169 ð16:90%Þ c) The proportions of variance shared to form both factors (communalities) are: communalityage ¼ ð0:917Þ2 + ð0:047Þ2 ¼ 0:843 communalityfixedif ¼ ð0:874Þ2 + ð0:077Þ2 ¼ 0:770 communalityvariableif ¼ ð0:844Þ2 + ð0:197Þ2 ¼ 0:751 communalitypeople ¼ ð0:031Þ2 + ð0:979Þ2 ¼ 0:959 d) Based on the two factors extracted, the expressions of each standardized variable are: Zagei ¼ 0:917 F1i + 0:047 F2i + ui , R2 ¼ 0:843 Zfixedif i ¼ 0:874 F1i + 0:077 F2i + ui , R2 ¼ 0:770 Zvariableif i ¼ 0:844 F1i + 0:197 F2i + ui , R2 ¼ 0:751 Zpeoplei ¼ 0:031 F1i + 0:979 F2i + ui , R2 ¼ 0:959 e) 1
people
0.5
variableif fixedif
age
0
–0.5
–1 –1
–0.8
–0.6
–0.4
–0.2
0
0.2
0.4
0.6
0.8
1
f) While variables age, fixedif, and variableif have a high correlation with the first factor (X-axis), variable people has a strong correlation with the second factor (Y-axis). This phenomenon can be a result of the fact that older customers, since they do not like taking risks, invest a great deal more in fixed-income funds, such as, savings accounts or CDB (Bank Deposit Certificates). On the other hand, even though variable variableif has a high correlation with the first factor, the absolute factor loading is negative. This shows that younger customers invest a great deal more in variable-income funds, such as, stocks. Finally, the number of people who live in the household (variable people) has a low correlation with the other variables. Thus, it ends up having a high factor loading with the second factor.
1094
Answers
2) a) YEAR 1 KMO and Bartlett’s Test Kaiser-Meyer-Olkin Measure of Sampling Adequacy.
.719
Bartlett’s Test of Sphericity
Approx. Chi-Square df Sig.
89.637 6 .000
YEAR 2 KMO and Bartlett’s Test Kaiser-Meyer-Olkin Measure of Sampling Adequacy.
.718
Bartlett’s Test of Sphericity
Approx. Chi-Square df Sig.
86.483 6 .000
Based on the KMO statistics, we can state that the overall adequacy of the factor analysis is considered average for each of the years of study (KMO ¼ 0.719 for the first year, and KMO ¼ 0.718 for the second year). In both periods, w2Bartlett statistics allow us to reject, at a significance level of 0.05 and based on the hypothesis of Bartlett’s test of sphericity, that the correlation matrices are statistically equal to the identity matrix with the same dimension. Since w2Bartlett ¼ 89.637 (Sig. w2Bartlett < 0.05 for 6 degrees of freedom) for the first year, and w2Bartlett ¼ 86.483 (Sig. w2Bartlett < 0.05 for 6 degrees of freedom) for the second year. Therefore, the principal component analysis is adequate for each of the years of study. b) YEAR 1 Total Variance Explained Initial Eigenvalues
Extraction Sums of Squared Loadings
Component
Total
% of Variance
Cumulative %
Total
% of Variance
Cumulative %
1 2 3 4
2.589 .730 .536 .146
64.718 18.247 13.391 3.643
64.718 82.965 96.357 100.000
2.589
64.718
64.718
Extraction Method: Principal Component Analysis.
YEAR 2 Total Variance Explained Initial Eigenvalues
Extraction Sums of Squared Loadings
Component
Total
% of Variance
Cumulative %
Total
% of Variance
Cumulative %
1 2 3 4
2.566 .737 .543 .154
64.149 18.435 13.577 3.838
64.149 82.584 96.162 100.000
2.566
64.149
64.149
Extraction Method: Principal Component Analysis.
Answers
1095
Based on the latent root criterion, only one factor is extracted in each of the years, with their respective eigenvalue: Year 1: 2.589 Year 2: 2.566 The proportion of variance shared by all the variables to form the factor each year is: Year 1: 64.718% Year 2: 64.149% c) YEAR 1 Component Matrixa Component 1 Corruption Perception Index - year 1 (Transparency International) Number of murders per 100,000 inhabitants: year 1 (OMS, UNODC and GIMD) Per capita GDP - year 1, using 2000 as the base year (in US$ adjusted for inflation) (World Bank) Average number of years in school per person over 25 years of age - year 1 (IHME)
.900 .614 .911 .755
a 1 component extracted. Extraction Method: Principal Component Analysis.
YEAR 1 Communalities
Corruption Perception Index - year 1 (Transparency International) Number of murders per 100,000 inhabitants: year 1 (OMS, UNODC and GIMD) Per capita GDP - year 1, using 2000 as the base year (in US$ adjusted for inflation) (World Bank) Average number of years in school per person over 25 years of age - year 1 (IHME)
Initial
Extraction
1.000 1.000 1.000 1.000
.810 .378 .830 .571
Extraction Method: Principal Component Analysis.
YEAR 2 Component Matrixa Component 1 Corruption Perception Index - year 2 (Transparency International) Number of murders per 100,000 inhabitants: year 2 (OMS, UNODC and GIMD) Per capita GDP - year 2, using 2000 as the base year (in US$ adjusted for inflation) (World Bank) Average number of years in school per person over 25 years of age - year 2 (IHME) a 1 component extracted. Extraction Method: Principal Component Analysis.
.899 .608 .908 .750
1096
Answers
YEAR 2 Communalities
Corruption Perception Index - year 2 (Transparency International) Number of murders per 100,000 inhabitants: year 2 (OMS, UNODC and GIMD) Per capita GDP - year 2, using 2000 as the base year (in US$ adjusted for inflation) (World Bank) Average number of years in school per person over 25 years of age - year 2 (IHME)
Initial
Extraction
1.000 1.000 1.000 1.000
.808 .370 .825 .563
Extraction Method: Principal Component Analysis.
We can see that slight reductions occurred in the communalities of all the variables from the first to the second year. d) YEAR 1 Component Score Coefficient Matrix Component 1 Corruption Perception Index - year 1 (Transparency International) Number of murders per 100,000 inhabitants: year 1 (OMS, UNODC and GIMD) Per capita GDP - year 1, using 2000 as the base year (in US$ adjusted for inflation) (World Bank) Average number of years in school per person over 25 years of age - year 1 (IHME)
.348 .237 .352 .292
Extraction Method: Principal Component Analysis. Component Scores.
YEAR 2 Component Score Coefficient Matrix Component 1 Corruption Perception Index - year 2 (Transparency International) Number of murders per 100,000 inhabitants: year 2 (OMS, UNODC and GIMD) Per capita GDP - year 2, using 2000 as the base year (in US$ adjusted for inflation) (World Bank) Average number of years in school per person over 25 years of age - year 2 (IHME)
.350 .237 .354 .292
Extraction Method: Principal Component Analysis. Component Scores.
Based on the standardized variables, the expression of the factor extracted each year is: Year 1: Fi ¼ 0:348 Zcpi1i 0:237 Zviolence1i + 0:352 Zcapita_gdp1i + 0:292 Zschool1i Year 2:
Fi ¼ 0:350 Zcpi2i 0:237 Zviolence2i + 0:354 Zcapita_gdp2i + 0:292 Zschool2i
Even if small changes occurred in the factor scores from one year to the next, this only reinforces the importance of reapplying the technique to obtain factors with more precise and updated scores, mainly when they are used to create indexes and rankings.
1097
Answers
e) Year 1
Year 2
Country
Index
Ranking
Country
Index
Ranking
Switzerland Norway Denmark Sweden Japan United States Canada United Kingdom Netherlands Australia Germany Austria Ireland New Zealand Singapore Belgium Israel France Cyprus United Arab Emirates Czech Rep. Italy Poland Spain Chile Greece Kuwait Portugal Romania Oman Saudi Arabia Serbia Argentina Turkey Ukraine Kazakhstan Malaysia Lebanon Russia Mexico China Egypt Thailand Indonesia India Brazil Philippines Venezuela South Africa Colombia
1.6923 1.6794 1.4327 1.4040 1.3806 1.3723 1.3430 1.1560 1.1086 1.0607 1.0297 0.9865 0.9439 0.9269 0.8781 0.8175 0.6322 0.5545 0.5099 0.3157 0.2244 0.0859 0.0373 0.0303 0.0517 0.1432 0.2276 0.2980 0.3028 0.4742 0.5111 0.5407 0.5556 0.6476 0.7109 0.7423 0.7459 0.7966 0.8534 0.8803 0.8840 0.9792 1.0632 1.2245 1.2272 1.3294 1.3466 1.3916 1.8215 1.8534
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
Norway Switzerland Sweden Denmark Japan Canada United States United Kingdom Netherlands Australia Germany Austria Ireland Singapore New Zealand Belgium Israel France Cyprus United Arab Emirates Czech Rep. Poland Spain Chile Italy Kuwait Greece Portugal Romania Saudi Arabia Oman Argentina Serbia Malaysia Turkey Ukraine Kazakhstan Lebanon Russia China Mexico Egypt Thailand Indonesia India Brazil Philippines Venezuela Colombia South Africa
1.6885 1.6594 1.4388 1.4225 1.3848 1.3844 1.3026 1.1321 1.1007 1.0660 1.0401 0.9903 0.9411 0.9184 0.9063 0.8265 0.6444 0.5448 0.4606 0.2849 0.1857 0.0868 0.0334 0.0170 0.0064 0.1462 0.2247 0.2794 0.3150 0.4321 0.5034 0.5342 0.5544 0.6098 0.6401 0.6807 0.6970 0.8060 0.8513 0.8982 0.9323 0.9485 1.0800 1.2431 1.2533 1.3468 1.3885 1.4149 1.7697 1.9173
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
From the first to the second year, there were some changes in the relative positions of the countries in the ranking.
1098
Answers
3) a) Correlation Matrix
Perception of the variety of goods (0 to 10) Correlation Perception of the variety of goods (0 to 10)
Perception of the quality and speed of inventory replacement (0 to 10)
Perception of the store’s layout (0 to 10)
Perception of Perception of the store’s Perception of Perception of thermal, Perception of prices the quality of the the store’s acoustic and the store’s compared to the services general visual comfort inside the store cleanliness (0 to rendered (0 to competition (0 to discount policy (0 to 10) 10) 10) 10) (0 to 10)
1.000
.753
.898
.733
.640
.193
.084
.053
Perception of the quality and speed of inventory replacement (0 to 10)
.753
1.000
.429
.633
.548
.208
-.449
-.367
Perception of the store’s layout (0 to 10)
.898
.429
1.000
.641
.567
.142
.413
.318
Perception of thermal, acoustic and visual comfort inside the store (0 to 10)
.733
.633
.641
1.000
.864
.227
.235
.174
Perception of the store’s general cleanliness (0 to 10)
.640
.548
.567
.864
1.000
.194
.220
.173
Perception of the quality of the services rendered (0 to 10)
.193
.208
.142
.227
.194
1.000
.137
.113
Perception of the store’s prices compared to the competition (0 to 10)
.084
-.449
.413
.235
.220
.137
1.000
.906
Perception of the store’s discount policy (0 to 10)
.053
-.367
.318
.174
.173
.113
.906
1.000
Yes. Based on the magnitude of some Pearson’s correlation coefficients, it is possible to identify a first indication that the factor analysis may group the variables into factors. b) KMO and Bartlett’s Test Kaiser-Meyer-Olkin Measure of Sampling Adequacy.
.610
Bartlett’s Test of Sphericity
13752.938 28 .000
Approx. Chi-Square df Sig.
Yes. From the result of the w2Bartlett statistic, it is possible to reject that the correlation matrix is statistically equal to the identity matrix with the same dimension, at a significance level of 0.05 and based on the hypothesis of Bartlett’s test of sphericity, since w2Bartlett ¼ 13,752.938 (Sig. w2Bartlett < 0.05 for 28 degrees of freedom). Therefore, the principal component analysis can be considered adequate. c) Total Variance Explained Initial Eigenvalues
Extraction Sums of Squared Loadings
Component
Total
% of Variance
Cumulative %
Total
% of Variance
Cumulative %
1 2 3 4 5 6 7 8
3.825 2.254 .944 .597 .214 .126 .025 .016
47.812 28.174 11.794 7.458 2.679 1.570 .313 .201
47.812 75.986 87.780 95.238 97.917 99.486 99.799 100.000
3.825 2.254
47.812 28.174
47.812 75.986
Extraction Method: Principal Component Analysis.
Answers
1099
Considering the latent root criterion, two factors are extracted, with the respective eigenvalues: Factor 1: 3.825 Factor 2: 2.254 The proportion of variance shared by all the variables to form each factor is: Factor 1: 47.812% Factor 2: 28.174% Thus, the total proportion of variance shared by all the variables to form both factors is equal to 75.986%. d) The total proportion of variance lost by all the variables to extract these two factors is: 1 0:75986 ¼ 0:24014 ð24:014%Þ e)
Communalities
Perception of the variety of goods (0 to 10) Perception of the quality and speed of inventory replacement (0 to 10) Perception of the store’s layout (0 to 10) Perception of thermal, acoustic and visual comfort inside the store (0 to 10) Perception of the store’s general cleanliness (0 to 10) Perception of the quality of the services rendered (0 to 10) Perception of the store’s prices compared to the competition (0 to 10) Perception of the store’s discount policy (0 to 10)
Initial
Extraction
1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
.873 .914 .766 .827 .721 .101 .978 .900
Extraction Method: Principal Component Analysis.
Note that the loadings and communality of the variable services rendered are considerably low. This may demonstrate the need to extract a third factor, which decharacterizes the latent root criterion.
1100
Answers
f)
Communalities
Perception of the variety of goods (0 to 10) Perception of the quality and speed of inventory replacement (0 to 10) Perception of the store’s layout (0 to 10) Perception of thermal, acoustic and visual comfort inside the store (0 to 10) Perception of the store’s general cleanliness (0 to 10) Perception of the quality of the services rendered (0 to 10) Perception of the store’s prices compared to the competition (0 to 10) Perception of the store’s discount policy (0 to 10)
Initial
Extraction
1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
.887 .917 .804 .828 .722 .987 .978 .900
Extraction Method: Principal Component Analysis.
Yes, it is possible to confirm the construct of the questionnaire proposed by the store’s general manager, because variables variety of goods, replacement, layout, comfort, and cleanliness have a stronger correlation with a specific factor, variables price and discounts, with another factor, and, finally, variable services rendered, with a third factor. g) To the detriment of the extraction based on the latent root criterion, the decision to extract three factors increases the communalities of the variables, highlighting the variable services rendered, now, more strongly correlated with the third factor. h)
Answers
1101
Varimax rotation redistributes the variable loadings in each factor, which facilitates the confirmation of the construct proposed by the store’s general manager. i) Component plot in rotated space
1.0
Components 2
Discounts Prices 0.5
Layout Cleanliness
0.0
Assortment
Comfort
Services Replacement
–0.5
–1.0 –1.0
–0.5
Com
0.0
pon
0.5 1 .0 –1.0
ents
3
–0.5
0.0
0.5
Components
1
1.0
1102
Answers
Component plot in rotated space
1.0
Components 2
Discounts Prices 0.5 Layout Services
0.0
Cleanliness Comfort Assortment
–0.5 Replacement –1.0 –1.0
–0.5
Com
0.0
pon
0.5 1.0 1.0
ents
1
0.5
0.0
Components
ANSWER KEYS: EXERCISES: CHAPTER 13 1) a) b) c) d) e) f) 2)
–0.5
Y^ ¼ 3:8563 + 0:3872 X R2 ¼ 0.9250 Yes (P-value of t ¼ 0.000 < 0.05). 9.9595 billion dollars (we must make Y ¼ 0 and solve the equation). 3.8563% (we must make X ¼ 0). 0.4024% (mean) 1.2505% (minimum) 2.0554% (maximum)
3
–1.0
Answers
1103
a) Yes, since the P-value of the F statistic < 0.05, we can state that at least one of the explanatory variables is statistically significant to explain the behavior of variable cpi, at a significance level of 0.05. b) Yes, since the P-value of both t statistics < 0.05, we can state that their parameters are statistically different from zero, at a significance level of 0.05. Therefore, the Stepwise procedure would not exclude any of the explanatory variables of the final model. c) c^ pii ¼ 15:1589 + 0:0700 agei 0:4245 hoursi d) R2 ¼ 0.3177 e) By analyzing the signs of the final model’s coefficients, for this cross-section, we can state that countries with billionaires with lower average ages have lower cpi indexes. That is, there is a higher corruption perception from society. Besides, on average, a greater number of hours worked per week has a negative relationship with variable cpi. That is, countries with a higher corruption perception (lower cpi´s) have a higher workload per week. It is important to mention that countries with lower cpi´s are those considered emerging countries. f)
By using the Shapiro-Francia test, the most suitable for the size of this sample, we can see that the residuals follow a normal distribution, at a significance level of 0.05. We would have arrived at the same conclusion if the test used had been the Shapiro-Wilk. g)
From the Breusch-Pagan/Cook-Weisberg test, it is possible to verify if there is homoskedasticity in the model proposed. h)
Since the final model obtained does not have very high VIF statistics (1 – Tolerance ¼ 0.058), we may consider that there are no multicollinearity problems.
1104
Answers
3)
a) The difference between the average cpi value for emerging and for developed countries is 3.6318. That is, while emerging countries have an average cpi ¼ 4.0968, developed countries have an average cpi ¼ 7.7286. This is exactly the value of the cpi regression intercept based on variable emerging, since the dummy emerging for developed countries ¼ 0. Yes, this difference is statistically significant, at a significance level of 0.05, since the P-value of t statistic < 0.05 for the variable emerging. b)
c^ pii ¼ 13:1701 0:1734 hoursi 3:2238 emergingi c)
c^ pi ¼ 13:1701 0:1734 ð37Þ 3:2238 ð1Þ ¼ 3:5305
Answers
d)
1105
c^ pimin ¼ 8:0092 0:3369 ð37Þ 4:0309 ð1Þ ¼ 8:4870 c^ pimax ¼ 18:3310 0:0099 ð37Þ 2:4168 ð1Þ ¼ 15:5479
Obviously, the confidence interval is extremely broad and makes no sense. This happened because the value of R2 is not so high. e)
c^ pii ¼ 27:4049 5:7138 ln ðhoursi Þ 3:2133 emergingi 2
f) Since the adjusted R is slightly higher in the model with nonlinear functional form (logarithmic functional form for variable hours) than in the model with linear functional form, we choose the nonlinear estimated model seen in item (e). Since, in both cases, neither the number of variables nor the sample size used changes, such analysis could be carried out directly based on the values of R2. 4) a)
chole^ sterolt ¼ 136:7161 + 1:9947 bmit 5:1635 sportt b) We can see that the body mass index has a positive relationship with the LDL cholesterol index, such that, every time the index increases by one unit, on average, there is an increase of almost 2 mg/dL of the cholesterol commonly known as bad cholesterol, ceteris paribus. Analogously, the increase in the frequency of physical activities per week by one unit makes the LDL cholesterol index drop, on average, more than 5 mg/dL, ceteris paribus. Therefore, maintaining one’s weight, or even losing weight, plus establishing a routine of weekly physical activities, may contribute to the establishment of a healthier life.
1106
Answers
c)
Since we have, at a significance level of 0.05 and for a model with 3 parameters and 48 observations, 0.938 < dL ¼ 1.45, we can state that there is a positive first-order autocorrelation between the error terms. d)
By analyzing the Breusch-Godfrey test, we can see that, besides the first-order autocorrelation between the error terms, there are also autocorrelation problems between the 3rd, 4th, and 12th order residuals. This shows the existing seasonality in the executive’s behavior regarding his body mass and engagement to doing physical activities.
ANSWER KEYS: EXERCISES: CHAPTER 14 1)
a) Yes. Since the P-value of the w2 statistic < 0.05, we can state that at least one of the explanatory variables is statistically significant to explain the probability of default, at a significance level of 0.05. b) Yes. Since the P-value of all Wald z statistics < 0.05, we can state that their respective parameters are statistically different from zero, at a significance level of 0.05. Therefore, no explanatory variable will be excluded from the final model. c) pi ¼
1 1 + eð2:975070:02433 agei + 0:74149 genderi 0:00025 incomei Þ
d) Yes. Since the parameter estimated for the variable gender is positive, on average, male individuals (dummy ¼ 1) have higher probabilities of default than female individuals, as long as the other conditions are kept constant. The chances of the event occurring will be multiplied by a factor greater than 1.
Answers
1107
e) No. Older people, on average, tend to have smaller probabilities of default, maintaining the remaining conditions constant, since the parameter of the variable age is negative, that is, the chances of the event occurring is multiplied by a factor less than 1, as the age increases. f) p ¼
1 1 + e½2:975070:02433 ð37Þ + 0:74149 ð1Þ0:00025 ð6, 850Þ
¼ 0:7432
The average probability of default estimated for this individual is 74.32%. g)
The chance of being default as the income increases by one unit is, on average and maintaining the remaining conditions constant, multiplied by a factor of 0.99974 (a chance 0.026% lower). h)
While the overall model efficiency is 77.40%, the sensitivity is 93.80% and the specificity is 30.23% (for a cutoff of 0.5).
1108
Answers
2) a)
Only the category bad of variable price was not statistically significant, at a significance level of 0.05, to explain the probability of the event occurring - the event we are interested in. That is, there are no differences that would change the probability of someone becoming loyal to the retailer when they answer terrible or bad on their perception of the prices, maintaining the remaining conditions constant. b)
Answers
c)
For a cutoff of 0.5, the overall model efficiency is 86.00%. d)
Sensitivity/specificity
1.00
0.75
0.50
0.25
0.00 0.00
0.25
0.50 Probability cutoff Sensitivity
0.75 Specificity
1.00
1109
1110
Answers
The cutoff from which the specificity becomes slightly higher than the sensitivity is equal to 0.57.
e) On average, the chance of becoming loyal to the establishment is multiplied by a factor of 5.39 when their perception of services rendered changes from terrible to bad. Whereas, from terrible to regular, this chance is multiplied by a factor of 6.17. From terrible to good, it is multiplied by a factor of 27.78, and, finally, from terrible to excellent, by a factor of 75.60. These answers will only be valid if the other conditions are kept constant. f) On average, the chance of becoming loyal to the establishment is multiplied by a factor of 6.43 when their perception of variety of goods changes from terrible to bad. Whereas, from terrible to regular, this chance is multiplied by a factor of 7.83. From terrible to good, it is multiplied by a factor of 28.09, and, finally, from terrible to excellent, by a factor of 381.88. Conversely, for the variable accessibility, on average, the chance of becoming loyal to the establishment is multiplied by a factor of 10.49 when their perception changes from terrible to bad. From terrible to regular, this chance is multiplied by a factor of 18.55. From terrible to good, it is multiplied by a factor of 127.40, and, finally, from terrible to excellent, by a factor of 213.26. Finally, for the variable price, on average, the chance of becoming loyal to the establishment is multiplied by a factor of 18.47 when their perception changes from terrible or bad to regular. From terrible or bad to good, this chance is multiplied by a factor of 20.82. Lastly, from terrible or bad to excellent, the chance of becoming loyal to the establishment is multiplied by a factor of 49.87. These answers will only be valid if the other conditions are kept constant in each case. g) Based on the analysis of these chances, if the establishment wishes to invest in a single perceptual variable to increase the probability of consumers becoming loyal, such that, they leave their terrible perceptions behind and begin, with higher frequency, to have excellent perceptions of this issue, it must invest in the variable variety of goods, since this variable is the one that shows the highest odds ratio (381.88). In other words, the chances of becoming loyal to the establishment, when their perception changes from terrible variety of goods to excellent, are, on average, multiplied by a factor of 381.88 (38,088% higher), maintaining the remaining conditions constant.
Answers
1111
3) a)
b)
Yes. Since the P-value of the w2 statistic < 0.05, we can reject the null hypothesis that all parameters bjm (j ¼ 1, 2; m ¼ 1, 2, 3, 4) are statistically equal to zero at a significance level of 0.05. That is, at least one of the explanatory variables is statistically significant to form the occurrence probability expression of at least one of the classifications proposed for the LDL cholesterol index. c) Since all the parameters are statistically significant for all the logits (Wald z tests at a significance level of 0.05), the final equations estimated for the average occurrence probabilities of the classifications proposed for the LDL cholesterol index can be written the following way:
1112
Answers
Probability of an individual i having a very high LDL cholesterol index:
pi ¼
1 … 1 + eð0:420:31 cigarettei + 0:16 sporti Þ + eð2:620:41 cigarettei + 1:01 sporti Þ … ð2:461:41 cigarette + 1:13 sport Þ ð2:861:67 cigarette + 1:16 sport Þ i i +e i i +e
Probability of an individual i having a high LDL cholesterol index:
pi ¼
eð0:420:31 cigarettei + 0:16 sporti Þ … 1 + eð0:420:31 cigarettei + 0:16 sporti Þ + eð2:620:41 cigarettei + 1:01 sporti Þ … ð2:461:41 cigarette + 1:13 sport Þ ð2:861:67 cigarette + 1:16 sport Þ i i +e i i +e
Probability of an individual i having a borderline LDL cholesterol index:
pi ¼
eð2:620:41 cigarettei + 1:01 sporti Þ … 1 + eð0:420:31 cigarettei + 0:16 sporti Þ + eð2:620:41 cigarettei + 1:01 sporti Þ … ð2:461:41 cigarette + 1:13 sport Þ ð2:861:67 cigarette + 1:16 sport Þ i i +e i i +e
Probability of an individual i having a near optimal LDL cholesterol index:
pi ¼
eð2:461:41 cigarettei + 1:13 sporti Þ … + eð2:620:41 cigarettei + 1:01 sporti Þ
1 + eð0:420:31 cigarettei + 0:16 sporti Þ …
+eð2:461:41 cigarettei + 1:13 sporti Þ + eð2:861:67 cigarettei + 1:16 sporti Þ
Probability of an individual i having an optimal LDL cholesterol index:
pi ¼
eð2:861:67 cigarettei + 1:16 sporti Þ … + eð2:620:41 cigarettei + 1:01 sporti Þ
1 + eð0:420:31 cigarettei + 0:16 sporti Þ …
+eð2:461:41 cigarettei + 1:13 sporti Þ + eð2:861:67 cigarettei + 1:16 sporti Þ
d) For an individual who does not smoke and only practices sports once a week, we have: Probability of having a very high LDL cholesterol index ¼ 41.32%. Probability of having a high LDL cholesterol index ¼ 31.99%. Probability of having a borderline LDL cholesterol index ¼ 8.23%. Probability of having a near optimal LDL cholesterol index ¼ 10.92%. Probability of having an optimal LDL cholesterol index ¼ 7.54%.
Answers
1113
e)
If people start practicing sports twice a week, they will considerably increase their probability of having near-optimal or optimal levels of LDL cholesterol.
f) The chances of having a high cholesterol index, in comparison to a level considered very high, are, on average, multiplied by a factor of 1.1745 (17.45% higher), when we increase the number of times physical activities are done weekly by one unit and maintaining the remaining conditions constant. g) The chances of having an optimal cholesterol index, on average and in comparison to a level considered near optimal, are multiplied by a factor of 1.2995 (0.2450047 / 0.1885317), when people stop smoking and maintaining the remaining conditions constant. That is, the chances are 29.95% higher.
1114
Answers
Tip: For those who are in doubt about this procedure, you just need to change the reference category of variable cigarette (now, smokes ¼ 0) and estimate the model with the category near optimal of the dependent variable as the reference category. h) and i)
ANSWER KEYS: EXERCISES: CHAPTER 15 1) a) Statistic Mean Variance
1.020 1.125
Even if in a preliminary way, we can see that the mean and variance of the variable purchases are quite close.
Answers
1115
b)
Since the P-value of the t-test that corresponds to the b parameter of lambda is greater than 0.05, we can state that the data of the dependent variable purchases do not present overdispersion. So, the Poisson regression model estimated is suitable due to the presence of equidispersion in the data. c)
The result of the w2 test suggests that there is quality in the adjustment of the Poisson regression model estimated. That is, there are no statistically significant differences, at a significance level of 0.05, between the observed and the predicted probability distributions of annual use incidence of closed-end credit. d) Since all the zcal values < 1.96 or > 1.96, the P-values of the Wald z statistics < 0.05 for all the parameters estimated, thus, we arrive at final Poisson regression model. Therefore, the final expression for the estimated average number of annual use of closed-end credit financing when purchasing durable goods, for a consumer i, is: purchasesi ¼ eð7:0480:001 incomei 0:086 agei Þ e) purchases ¼ e[7.0480.001(2,600)0.086(47)] ¼ 1.06 We recommend that this calculation be carried out with a larger number of decimal places.
f) The annual use incidence rate of closed-end credit financing when there is an increase in the customer’s monthly income of US$1,00 is, on average and as long as the other conditions are kept constant, multiplied by a factor of 0.9988 (0.1124% lower). Consequently, at each increase of US$100,00 in the customer’s monthly income, we expect
1116
Answers
the annual use incidence rate of closed-end credit financing to be 11.24% lower, on average and provided the other conditions are kept constant. g) The annual use incidence rate of closed-end credit financing when there is an increase of 1 year in consumers’ average age is, on average and as long as the other conditions are kept constant, multiplied by a factor of 0.9171 (8.29% lower). h)
In the constructed chart, it is possible to see that higher monthly incomes lead to a decrease in the expected annual use of closed-end credit financing when purchasing durable goods, with an average reduction rate of 12.0% at each increase of US$100.00 in income. i)
Answers
1117
j) Young people and with lower monthly income. 2) a) Statistic Mean Variance
2.760 8.467
Even if in a preliminary way, there are indications of overdispersion in the data of the variable property, since its variance is extremely higher than its mean.
1118
Answers
b)
Since the P-value of the t-test that corresponds to the b parameter of lambda is lower than 0.05, we can state that the data of the dependent variable property present overdispersion, making the Poisson regression model estimated not suitable.
Furthermore, the result of the w2 test suggests the inexistence of adjustment quality in the Poisson regression model estimated. That is, there are statistically significant differences, at a significance level of 0.05, between the probability distributions observed and predicted for the number of real estate properties for sale per square. c)
d) Since the confidence interval for f (alpha in Stata) does not include zero, we can state that, at a 95% confidence level, f is statistically different from zero and has an estimated value equal to 0.230. The result of the likelihood-ratio test for parameter f (alpha) itself suggests that the null hypothesis that this parameter is statistically equal to zero can be rejected at a significance level of 0.05. This proves that there is overdispersion in the data and, therefore, we must choose the estimation of the negative binomial model.
Answers
1119
e) Since all the zcal values < 1.96 or > 1.96, the P-values of the Wald z statistics < 0.05 for all the parameters estimated and, thus, we arrive at the final negative binomial regression model. Therefore, the expression for the estimated average number of real estate properties for sale in a certain square ij is: propertyij ¼ eð0:608 + 0:001 distparkij 0:687 mallij Þ f) property ¼ e[0.608+0.001(820)0.687(0)] ¼ 5.07 We recommend that this calculation be carried out with a larger number of decimal places.
g) The number of real estate properties for sale per square is multiplied, on average and provided the other conditions are kept constant, by a factor of 1.0012 at each 1 meter further away from the municipal park. Hence, when there is an approximation of 1 meter from the park, we must divide the average amount of real estate properties for sale per square by this same factor. That is, the number will be multiplied by a factor of 0.9987 (0.1237% lower). Thus, at each approximation of 100 meters from the park, we expect that the average amount of real estate properties for sale to be, on average and as long as the other conditions are kept constant, 12.37% lower. h) The expected number of real estate properties for sale when a commercial center or mall is built in the microregion (square) is, as long as the other conditions are kept constant, multiplied by a factor of 0.5031. That is, on average, it becomes 49.69% lower. i)
1120
Answers
j)
k) Yes, we can state that proximity to parks and green spaces and the existence of malls and commercial centers in the microregion make the number of real estate properties for sale go down. That is, these features may be helping reduce the intention of selling residential real estate. l)
Answers
1121
m)
We can see that the adjustment of the negative binomial regression model is better than the adjustment of the Poisson regression model, since: – the maximum difference between the probabilities observed and the ones predicted is lower for the negative binomial model; – Pearson’s total value is also lower for the negative binomial regression model.
1122
Answers
n)
ANSWER KEYS: EXERCISES: CHAPTER 16 Ex. 3 max x1 + x2 s:t: ¼ 10 2x1 5x2 a) x1 + 2x2 + x3 ¼ 50 x1 ,x2 , x3 0
ð1Þ ð 2Þ ð3Þ
min 24x1 + 12x2 s:t: ¼4 ð1Þ 3x1 + 2x2 x3 b) + x4 ¼ 26 ð2Þ 2x1 4x2 x5 ¼ 3 ð 3Þ x2 x1 , x2 , x3 ,x4 ,x5 0 ð4Þ
Answers
max 10x1 x2 s:t: 6x1 + x2 + x3 ¼ 10 c) x2 x4 ¼ 6 x1 ,x2 ,x3 , x4 0
ð 1Þ ð2Þ ð 3Þ
max 3x1 + 3x2 2x3 s:t: 6x1 + 3x2 x3 + x4 ¼ 10 d) x2 + x3 x5 ¼ 20 4 x1 , x2 , x3 ,x4 ,x5 0
ð 1Þ ð 2Þ ð3Þ
Ex. 4 max x1 + x2 s:t: 2x1 5x2 10 a) 2x1 + 5x2 10 x1 + 2x2 50 x1 ,x2 0
ð 1Þ ð2Þ ð 3Þ ð4Þ
min 24x1 + 12x2 s:t: 3x1 + 2x2 4 b) 2x1 + 4x2 26 x2 3 x1 ,x2 0
ð 1Þ ð 2Þ ð 3Þ ð 4Þ
max 10x1 x2 s:t: 6x1 + x2 10 c) x2 6 x1 ,x2 0
ð 1Þ ð 2Þ ð3Þ
max 3x1 + 3x2 2x3 s:t: ð 1Þ 6x1 + 3x2 x3 10 d) x2 x3 20 ð2Þ 4 ð 3Þ x1 , x 2 , x 3 0 Ex. 5 a) min z ¼ 10x1 + x2 b) min z ¼ 3x1 3x2 + 2x3 Ex. 7 xi ¼ number of vehicles of model i to be manufactured per week, i ¼ 1, 2, 3. x1 ¼ number of vehicles of model Arlington to be manufactured per week. x2 ¼ number of vehicles of model Marilandy to be manufactured per week. x3 ¼ number of vehicles of model Lagoinha to be manufactured per week. Fobj ¼ max z ¼ 2, 500x1 + 3, 000x2 + 2,800x3 subject to 3x1 + 4x2 + 3x3 480 ðminutes machine=week available for injectionÞ 5x1 + 5x2 + 4x3 640 ðminutes machine=week available for foundryÞ 2x1 + 4x2 + 4x3 400 ðminutes machine=week available for machiningÞ 4x1 + 5x2 + 5x3 640 ðminutes machine=week available for upholsteryÞ 2x1 + 3x2 + 3x3 320 ðminutes machine=week available for final assemblyÞ 50 ðminimum sales potential of the Arlington modelÞ x1 x2 30 ðminimum sales potential of the Marilandy modelÞ x3 30 ðminimum sales potential of the Lagoinha modelÞ x1 ,x2 , x3 0
1123
1124
Answers
Ex. 8 xi ¼ liters of product i to be manufactured per month, i ¼ 1, 2 x1 ¼ liters of beer to be manufactured per month. x2 ¼ liters of soft drink to be manufactured per month. Fobj ¼ max z ¼ 0:5x1 + 0:4x2 subject to 2x1 57,600 ðminutes=month available to extract beer maltÞ 4x1 115,200 ðminutes=month available to process wortÞ 3x1 96,000 ðminutes=month available to ferment beerÞ 4x1 115,200 ðminutes=month available to process beerÞ 5x1 96,000 ðminutes=month available to bottle beerÞ 1x2 57,600 ðminutes=month available to prepare simple syrupÞ 3x2 67,200 ðminutes=month available to prepare compound syrupÞ 4x2 76,800 ðminutes=month available to dilute soft drinksÞ 5x2 96,000 ðminutes=month available to carbonate soft drinksÞ 2x2 48,000 ðminutes=month available to bottle soft drinksÞ x1 + x2 42, 000 ðmaximum demand of beer and soft drinksÞ x1 ,x2 0 Ex. 9 xi ¼ quantity of product i to be manufactured per week, i ¼ 1, 2, …, 5. x1 ¼ number of refrigerators to be manufactured per week. x2 ¼ number of freezers to be manufactured per week. x3 ¼ number of stoves to be manufactured per week. x4 ¼ number of dishwashers to be manufactured per week. x5 ¼ number of microwave ovens to be manufactured per week. Fobj ¼ max z ¼ 52x1 + 37x2 + 35x3 + 40x4 + 29x5 subject to 0:2x1 + 0:2x2 + 0:4x3 + 0:4x4 + 0:3x5 400 ðh machine=week pressingÞ 0:2x1 + 0:3x2 + 0:3x3 + 0:3x4 + 0:2x5 350 ðh machine=week paintingÞ 0:4x1 + 0:3x2 + 0:3x3 + 0:3x4 + 0:2x5 250 ðh machine=week moldingÞ 0:2x1 + 0:4x2 + 0:4x3 + 0:4x4 + 0:4x5 200 ðh machine=week assemblyÞ 0:1x1 + 0:2x2 + 0:2x3 + 0:2x4 + 0:3x5 200 ðh machine=week packagingÞ 0:5x1 + 0:4x2 + 0:5x3 + 0:4x4 + 0:2x5 480 ðh employee=week pressingÞ 0:3x1 + 0:4x2 + 0:4x3 + 0:4x4 + 0:3x5 400 ðh employee=week paintingÞ 0:5x1 + 0:5x2 + 0:3x3 + 0:4x4 + 0:3x5 320 ðh employee=week moldingÞ 0:6x1 + 0:5x2 + 0:4x3 + 0:5x4 + 0:6x5 400 ðh employee=week assemblyÞ 0:4x1 + 0:4x2 + 0:4x3 + 0:3x4 + 0:2x5 1, 280 ðh employee=week packagingÞ 200 x1 1,000 ð min :demand; max :capac:refrigeratorÞ ð min :demand; max :capac:freezerÞ 50 x2 800 50 x3 500 ð min :demand; max :capac:stoveÞ ð min :demand; max :capac:dishwasherÞ 50 x4 500 ð min :demand; max :capac:microwaveÞ 40 x5 200 Ex. 10 xij ¼ liters of type i petroleum used daily to produce gasoline j, i ¼ 1, 2, 3, 4; j ¼ 1, 2, 3. x11 ¼ liters of petroleum 1 used daily to produce regular gasoline. ⋮ x41 ¼ liters of petroleum 4 used daily to produce regular gasoline. x12 ¼ liters of petroleum 1 used daily to produce green gasoline. ⋮ x42 ¼ liters of petroleum 4 used daily to produce green gasoline. x13 ¼ liters of petroleum 1 used daily to produce yellow gasoline. ⋮ x43 ¼ liters of petroleum 4 used daily to produce yellow gasoline.
Answers
Fobj ¼ max z ¼ ð0:40 0:20Þx11 + ð0:40 0:25Þx21 + ð0:40 0:30Þx31 + ð0:40 0:30Þx41 + ð0:45 0:20Þx12 + ð0:45 0:25Þx22 + ð0:45 0:30Þx32 + ð0:45 0:30Þx42 + ð0:50 0:20Þx13 + ð0:50 0:25Þx23 + ð0:50 0:30Þx33 + ð0:50 0:30Þx43 subject to 0:10x21 0:05x31 + 0:20x41 0 0:07x11 + 0:02x21 0:12x31 0:03x41 0 0:05x12 + 0:05x22 0:10x32 0:15x42 0 + 0:10x32 0:05x42 0 0:05x12 0:15x33 + 0:10x43 0 0:10x13 0:03x13 0:02x23 + 0:08x33 0:07x43 0 x11 + x21 + x31 + x41 12, 000 x12 + x22 + x32 + x42 10, 000 x13 + x23 + x33 + x43 8, 000 x11 + x12 + x13 15, 000 x21 + x22 + x23 15, 000 x31 + x32 + x33 15, 000 x41 + x42 + x43 15, 000 x11 + x21 + x31 + x41 + x12 + x22 + x32 + x42 + x13 + x23 + x33 + x43 60,000 x11 , x21 ,x31 , x41 ,x12 , x22 ,x32 , x42 ,x13 , x23 ,x33 , x43 0 Ex. 12 xi ¼ 1 if the company invests in project i 0 otherwise x1 ¼ if the company invests in the development of new products or not. x2 ¼ if the company invests in capacity building or not. x3 ¼ if the company invests in Information Technology or not. x4 ¼ if the company invests in expanding the factory or not. x5 ¼ if the company invests in expanding the depot or not. Fobj ¼ max z ¼ 355:627x1 + 110:113x2 + 213:088x3 + 257:190x4 + 241:833x5 subject to : 360x1 + 240x2 + 180x3 + 480x4 + 320x5 1,000 ðBudget constraintÞ 0 ðProject 2 depends on 3Þ x2 x3 ðMutually excluding projectsÞ x4 + x 5 1 xi ¼ 0 or 1 Ex. 13 xi ¼ percentage of stock i to be allocated in the portfolio, i ¼ 1, …, 10. x1 ¼ percentage of stock 1 from the banking sector to be allocated in the portfolio. x2 ¼ percentage of stock 2 from the banking sector to be allocated in the portfolio. ⋮ x10 ¼ percentage of stock 10 from the electrical sector to be allocated in the portfolio. Fobj ¼ 0:0439x1 + 0:0453x2 + 0:0455x3 + 0:0439x4 + 0:0402x5 + 0:0462x6 + 0:0421x7 + 0:0473x8 + 0:0233x9 + 0:0221x10 s:t: x1 + x2 + ⋯ + x10 ¼ 1 ð 1Þ 0:0122x1 + 0:0121x2 + ⋯ + 0:0148x10 0:008 ð2Þ ð 3Þ 0:0541x1 + 0:0528x2 + ⋯ + 0:0267x10 0:05 ð 4Þ x1 + x2 + x3 + x4 + x5 0:50 ð 5Þ x1 + x2 + x3 + x4 0:20 ð 6Þ x6 + x7 + x8 0:20 ð 7Þ x9 + x10 0:20 ð 8Þ 0 x1 ,x2 ,⋯,x10 0:40 Ex. 16 Decision variables: xijt ¼ quantity of product i to be manufactured in facility j in period t Iijt ¼ final stock of product i in facility j in period t 1 if product i is delivered by facility j to retailer k in period t zijkt ¼ 0 otherwise
1125
1126
Answers
Model parameters: Dikt ¼ demand of product i by retailer k in period t cijt ¼ unit production cost of product i in facility j in period t iijt ¼ unit storage cost of product i in facility j in period t yijkt ¼ total transportation cost of product i from facility j to retailer k in period t xmax ijt ¼ maximum production capacity of product i in facility j in period t Imax ijt ¼ maximum storage capacity of product i in facility j in period t General formulation ! p m X n X T X X Fobj ¼ min z ¼ cijt xijt + iijt Iijt + yijkt zijkt i¼1 j¼1 t¼1
s:t:
k¼1
p X Dikt zijkt + Iijt ¼ Iij, t1 + xijt , k¼1
n X zijkt ¼ 1,
i ¼ 1, …, m; j ¼ 1,…, n; t ¼ 1, …, T
ð 1Þ
k ¼ 1, …,p;
ð2Þ
j¼1
xijt xmax ijt , max Iijt Iijt , zijkt 2 f0, 1g, xijt , Iijt 0
i ¼ 1, …,m; i ¼ 1, …,m; i ¼ 1, …,m; i ¼ 1, …,m;
j ¼ 1, …,n; j ¼ 1, …, n; j ¼ 1, …, n; j ¼ 1, …, n;
t ¼ 1, …, T t ¼ 1, …, T k ¼ 1, …,p; t ¼ 1, …,T t ¼ 1, …, T
Ex. 17 Decision variables: xijt ¼ quantity of product i to be manufactured in facility j in period t Iijt ¼ final stock of product i in facility j in period t Yijkt ¼quantity of product i to be transported from facility j to retailer k in period t 1 if the manufacturing of product i in period t occurs in facility j zijt ¼ 0 otherwise Model parameters: Dikt ¼ demand of product i by retailer k in period t cijt ¼ unit production cost of product i in facility j in period t iijt ¼ unit storage cost of product i in facility j in period t yijkt ¼ unit transportation cost of product i from facility j to retailer k in period t xmax ijt ¼ maximum production capacity of product i in facility j in period t Imax ijt ¼ maximum storage capacity of product i in facility j in period t General formulation ! p m X n X T X X cijt xijt + iijt Iijt + yijkt Yijkt minz ¼ i¼1 j¼1 t¼1
s:t: Iijt ¼ Iij, t1 + xijt n X Yijkt ¼ Dikt ,
k¼1 p X
Yijkt , i ¼ 1,…, m; j ¼ 1,…, n; t ¼ 1,…, T
ð 1Þ
i ¼ 1, …,m; k ¼ 1, …,p; t ¼ 1,…, T
ð 2Þ
k¼1
j¼1
xijt
p X Dikt zijt
k¼1 xijt xmax ijt , max Iijt Iijt ,
zijt 2 f0, 1g, xijt , Iijt , Yijt 0
i ¼ 1, …,m; j ¼ 1, …, n; t ¼ 1, …,T
ð 3Þ
i ¼ 1, …,m; j ¼ 1, …,n; i ¼ 1,…, m; j ¼ 1,…, n; i ¼ 1,…, m; j ¼ 1, …,n; i ¼ 1,…, m; j ¼ 1,…,n;
ð 4Þ ð 5Þ ð6Þ
Ex. 18 Time frame with T ¼ 6 periods, t ¼ 1, …, 6 (Jan., Feb., March, April, May, Jun.). Pt ¼ production in period t (kg) St ¼ production with outsourced labor in period t (kg)
t ¼ 1, …, T t ¼ 1, …,T t ¼ 1, …,T t ¼ 1, …,T
ð 3Þ ð 4Þ ð 5Þ
Answers
NRt ¼ number of regular employees in period t NCt ¼ number of employees hired from period t 1 to period t NDt ¼ number of employees fired from period t 1 to period t HEt ¼ total amount of overtime in period t It ¼ final stock in period t (kg) minz ¼ 1:5P1 + 2S1 + 600NR1 + 1, 000NC1 + 900ND1 + 7HE1 + 1I1 + 1:5P2 + 2S2 + 600NR2 + 1,000NC2 + 900ND2 + 7HE2 + 1I2 + ⋮ ⋮ 1:5P6 + 2S6 + 600NR6 + 1, 000NC6 + 900ND6 + 7HE6 + 1I6 s:t: I1 ¼ 600 + P1 9, 600 I2 ¼ I1 + P2 10, 600 ⋮ ⋮ I6 ¼ I5 + P6 10, 430
ANSWER KEYS: EXERCISES: CHAPTER 17 Section 17.2.1 (ex.2) a) Optimal solution: x1 ¼ 2, x2 ¼ 1 and z ¼ 10 b) Optimal solution: x1 ¼ 1, x2 ¼ 4 and z ¼ 14 c) Optimal solution: x1 ¼ 10, x2 ¼ 6 and z ¼ 52 Section 17.2.1 (ex.4) a) yes b) no c) yes d) no e) yes f) yes g) no h) no i) yes Section 17.2.2 (ex.2) a) Optimal solution: x1 ¼ 12, x2 ¼ 2 and z ¼ 26 b) Optimal solution: x1 ¼ 18, x2 ¼ 8 and z ¼ 28 c) Optimal solution: x1 ¼ 10, x2 ¼ 10 and z ¼ 100 Section 17.2.3 (ex.1) e) Multiple optimal solutions. f) There is no optimal solution. g) Unlimited objective function z. h) Multiple optimal solutions. i) Degenerate optimal solution. j) There is no optimal solution. Section 17.2.3 (ex.2) a) Any point of the segment CD (C (10, 30); D (0, 45)). b) Any point of the segment AB (A (8, 0); B (7/2, 3)). Section 17.3 (ex.1) a) Six basic solutions. c) Optimal solution: x1 ¼ 5, x2 ¼ 20 and z ¼ 55 Section 17.3 (ex.2) a) Ten basic solutions. c) Optimal solution: x1 ¼ 7, x2 ¼ 11, x3 ¼ 0 and z ¼ 61 Section 17.4.2 (ex.1) a) Optimal solution: x1 ¼ 1, x2 ¼ 17, x3 ¼ 5 and z ¼ 104
1127
1128
Answers
Section 17.4.3 (ex.2) a) Optimal solution: x1 ¼ 3, x2 ¼ 3 and z ¼ 15 b) Optimal solution: x1 ¼ 2, x2 ¼ 4, x3 ¼ 0 and z ¼ 20 c) Optimal solution: x1 ¼ 4, x2 ¼ 0, x3 ¼ 12 and z ¼ 36 Section 17.4.4 (ex.1) a) Optimal solution: b) Optimal solution: c) Optimal solution: d) Optimal solution:
x1 ¼ 0, x2 ¼ 4 and z ¼ 4 x1 ¼ 1, x2 ¼ 7 and z ¼ 37 x1 ¼ 0, x2 ¼ 10, x3 ¼ 35/2 and z ¼ 55/2 x1 ¼ 100/3, x2 ¼ 0, x3 ¼ 40/3 and z ¼ 140/3
Section 17.4.5.1 (ex.1) b) Solution 1: x1 ¼ 115/2, x2 ¼ 0 and z ¼ 230 Solution 2: x1 ¼ 60, x2 ¼ 10 and z ¼ 230 Section 17.4.5.1 (ex.2) b) Solution 1: x1 ¼ 310, x2 ¼ 0 and z ¼ 930 Solution 2: x1 ¼ 30, x2 ¼ 140 and z ¼ 930 Section 17.4.5.2 (ex.2) Solution 1: x1 ¼ 10, x2 ¼ 30 Solution 2: x1 ¼ 30, x2 ¼ 0 Section 17.4.5 (ex.1) a) Multiple optimal solutions. b) Unlimited objective function z. c) Multiple optimal solutions/degenerate optimal solution. Section 17.4.5 (ex.2) a) No. b) Unfeasible solution. c) Degenerate optimal solution. d) Multiple optimal solutions. e) Unlimited objective function z. Section 17.5.2 (ex.1) b) Optimal solution: x1 ¼ 70, x2 ¼ 30, x3 ¼ 35 and z ¼ 363, 000. Section 17.5.2 (ex.2) b) Optimal solution: x1 ¼ 24,960, x2 ¼ 17,040 and z ¼ 19,296. Section 17.5.2 (ex.3) b) Optimal solution: x1 ¼ 475, x2 ¼ 50, x3 ¼ 50, x4 ¼ 50, x5 ¼ 75 and z ¼ 32,475. Section 17.5.2 (ex.4) b) Optimal solution:
x11 ¼ 3,600, x21 ¼ 0, x22 ¼ 10,000 x12 ¼ 0, x23 ¼ 0, x13 ¼ 0, z ¼ 5,160
x31 ¼ 0, x41 ¼ 8,400, x32 ¼ 0, x42 ¼ 0, x33 ¼ 3,200, x43 ¼ 4,800 and
Section 17.5.2 (ex.5) b) Optimal solution: x1 ¼ 1, x2 ¼ 0, x3 ¼ 1, x4 ¼ 0, x5 ¼ 1 and z ¼ 810,548 ($810,548.00). Section 17.5.2 (ex.6) b) Optimal solution: x1 ¼ 20%, x7 ¼ 20%, x9 ¼ 20%, x10 ¼ 40%, x2, x3, x4, x5, x6, x8, x11 ¼ 0% and z ¼ 3.07%. Section 17.5.2 (ex.7) b) Optimal solution: 50% ($250,000.00) in the RF_C fund 25% ($125,000.00) in the Petrobras stock fund 25% ($125,000.00) in the Vale stock fund Objective function z ¼ 16.90% per year.
Answers
1129
Section 17.5.2 (ex.8) b) Optimal solution: z ¼ 126,590 ($126,590.00). Solution
Jan.
Feb.
Mar.
Apr.
May
Jun.
Pt St NRt NCt NDt HEt It
9600 0 5 0 5 0 600
10,000 0 5 0 0 28.57 0
12,800 0 6 1 0 91.43 0
11,520 0 6 0 0 0 870
10,770 0 5 0 1 83.57 0
10,430 0 5 0 0 59.29 0
Section 17.6.1 (ex.1) a) x1 ¼ 60, x2 ¼ 20 with z ¼ 520 b) 1.333 c) 0.8 d) No. e) The basic solution remains optimal. Section 17.6.1 (ex.2) a) x1 ¼ 15, x2 ¼ 0 with z ¼ 120 b) c1 2.4 or c1 c01 5.6 c) c2 20 or c2 c02 + 14 Section 17.6.1 (ex.3) a) x1 ¼ 0, x2 ¼ 17 with z ¼ 102 b) Unlimited objective function z. c) c1 3 or c1 c01 5 d) 0 c2 16 or c02 6 c2 c02 + 10 Section 17.6.1 (ex.4) a) 0:133 cc1 0:25 2 b) The basic solution remains optimal with z ¼ 1,700. c) 8 c1 15 or c01 4 c1 c01 + 3 d) 48 c2 90 or c02 12 c2 c02 + 30 e) The basic solution remains optimal with z ¼ 1,830. f) The basic solution remains optimal with z ¼ 2,440. g) 13.333 c1 25 Section 17.6.2 (ex.1) a) P1 ¼ 0, P2 ¼ 34.286, P3 ¼ 85.714 b) b1 b01 8.5 b02 5.95 b2 b02 + 6.125 b03 3.267 b3 b03 + 2.164 c) 0 d) $137.14 (z¼ 1,902.86), x1 ¼ 115.71 and x2 ¼ 8.57 Section 17.6.2 (ex.2) a) P1 ¼ 0, P2 ¼ 1.222, P3 ¼ 0.444 (2nd operation) b) b1 b01 20 b02 180 b2 b02 + 22.5 b03 36 b3 b03 + 180 c) $27.50 d) $16.00 Section 17.6.3 (ex.1) b) z11 ¼ 3, z12 ¼ 65, z∗1 ¼ 3 and z∗2 ¼ 2 Section 17.6.3 (ex.2) b) z∗1 ¼ 4 and z∗2 ¼ 2
1130
Answers
Section 17.6.4 (ex.3) a) Degenerate optimal solution. b) Multiple optimal solutions. c) Degenerate optimal solution. d) Multiple optimal solutions. e) Multiple optimal solutions. f) Degenerate optimal solution. g) Degenerate optimal solution.
ANSWER KEYS: EXERCISES: CHAPTER 18 Ex.1 a) b) c) d) e) f) g) h)
N ¼ {1, 2, 3, 4, 5, 6} A ¼ {(1, 2), (1, 3), (2, 3), (3, 4), (3, 5), (4, 2), (4, 5), (4, 6), (5, 6)} Directed network. 1!2!3!4!2 1!3!5!4 1!3!4!6 2!3!4!2 3!4!5!3
Ex.2 a) b) c) d) e) f) g) h)
N ¼ {1, 2, 3, 4, 5, 6} A ¼ {(1, 2), (1, 3), (2, 3), (2, 4), (3, 5), (4, 6), (5, 2), (5, 4), (6, 5)} Directed network. 2!3!5!4!6!5 1!2!5!4!6!5 1!3!5!4 2!3!5!2 1!2!3!1
Ex.3 a) Tree
1
3
5 4
b) Cover tree
1
3
5 2
4
6
Answers
1131
Ex.4
3
5
1
7
2
8 4
6
Ex.5 Classic transportation problem:
4
4
6
60
3
50
4
50
8 8
6
5 7
5
3
50
2
9
2
80
40
8
1
70
1
9
Optimal FBS: x11 ¼ 40, x14 ¼ 30, x22 ¼ 60, x24 ¼ 20, x33 ¼ 50 with z ¼ 1, 110. Ex.6 Maximum flow problem: 8
2 6
6
6
3
7
1
5 6
4
7 4
6 5
3
3
6
7
Optimal solution: x12 ¼ 6, x13 ¼ 2, x14 ¼ 7, x24 ¼ 3, x25 ¼ 3, x34 ¼ 2, x36 ¼ 0, x45 ¼ 3, x46 ¼ 3, x47 ¼ 6, x57 ¼ 6, x67 ¼ 3 with z ¼ 15. Ex.7 Shortest route problem:
3
2 1
4
5
4
1
7
3
5
6
4
4
3
5
2
8
4
4
5
6
Optimal FBS: x13 ¼ 1, x36 ¼ 1, x68 ¼ 1 (1 3 6 8) with z ¼ 11. Ex.8 x11 ¼ 50, x22 ¼ 10, x23 ¼ 20, x33 ¼ 20.
3
1
1132
Answers
Ex.9 x11 ¼ 80, x13 ¼ 70, x22 ¼ 50, x23 ¼ 80 with z ¼ 4,590. Ex.10 x13 ¼ 150, x21 ¼ 80, x22 ¼ 50 with z ¼ 4,110. Ex.11 a) Optimal FBS: x12 ¼ 100, x13 ¼ 100, x23 ¼ 100, x31 ¼ 150, x32 ¼ 50 with z ¼ 6,800. b) Optimal FBS: x13 ¼ 50, x31 ¼ 100, x41 ¼ 20, x42 ¼ 150, x43 ¼ 30 with z ¼ 1,250. c) Optimal FBS: x12 ¼ 20, x14 ¼ 30, x21 ¼ 20, x24 ¼ 10, x32 ¼ 20, x33 ¼ 60 with z ¼ 1,490. Alternative solution: x11 ¼ 20, x12 ¼ 20, x14 ¼ 10, x24 ¼ 30, x32 ¼ 20, x33 ¼ 60 with z ¼ 1,490. Ex.12 Indexes: Suppliers i 2 I Consolidating centers j 2 J Factory k 2 K Products p 2 P Model parameters: Cmax, j Dpk Sip cpij cpjk cpik
maximum capacity of consolidating center j. demand of product p in factory k. capacity of supplier i to produce product p. unit transportation cost of p from supplier i to consolidating center j. unit transportation cost of p from consolidating center j to factory k. unit transportation cost of p from supplier i to factory k.
Model’s decision variables: xpij ypjk zpik
amount of product p transferred from supplier i to consolidating center j. amount of product p transferred from consolidating center j to factory k. amount of product p transferred from supplier i to factory k.
The problem can be formulated as follows: XXX XXX XXX min cpij xpij + cpjk ypjk + cpik zpik p
s.t.:
i
p
j
X j
ypjk +
XX p
X j
i
xpij +
X i
j
X
p
k
zpik ¼ Dpk ,
i
k
8p,k
(1)
8j
(2)
8i,p
(3)
i
xpij C max , j , X k
xpij ¼
zpik Sip ,
X
ypjk ,
8p, j
(4)
8p,i, j, k
(5)
k
xpij , ypjk , zpik 0,
In the objective function, the first term represents suppliers’ transportation costs up to the consolidation terminals, the second refers to the transportation costs from the consolidation terminals to the final client (factory in Harbin), and the third represents suppliers’ transportation costs directly to the final client. Constraint (1) ensures that client k’s demand for product p will be met. Constraint (2) refers to the maximum capacity of each consolidation terminal. Constraint (3) represents supplier i’s capacity to supply product p. Whereas constraint (4) refers to the preservation of the input and output flows in each transshipment point. Finally, we have the non-negativity constraints.
Answers
Ex.13
xij ¼
1 if task i is designated to machine j, i ¼ 1,…, 4, j ¼ 1, …, 4 0 otherwise
a) Optimal FBS: x12 ¼ 1, x24 ¼ 1, x33 ¼ 1, x41 ¼ 1 with z ¼ 37. b) Optimal FBS: x13 ¼ 1, x24 ¼ 1, x33 ¼ 1, x41 ¼ 1 with z ¼ 35. Ex.14
xij ¼
1 if route ði, jÞ is included in the shortest route, 8i, j 0 otherwise
min 6x12 + 9x13 + 4x23 + 4x24 + 7x25 + 6x35 + 2x45 + 7x46 + 3x56 s:t: x12 + x13 ¼ 1 x46 + x56 ¼ 1 x12 x23 x24 x25 ¼ 0 x13 + x23 x35 ¼ 0 x24 x45 x46 ¼ 0 x25 + x35 + x45 x56 ¼ 0 xij 2 f0, 1g or xij 0 Optimal FBS: x12 ¼ 1, x24 ¼ 1, x45 ¼ 1, x56 ¼ 1 (1 2 4 5 6) with z ¼ 15. Ex.15 Optimal FBS: xAB ¼ 1, xBD ¼ 1, xDE ¼ 1 (A B D E) with z ¼ 64. Ex.16 x12 ¼ 6, x13 ¼ 4, x23 ¼ 0, x24 ¼ 6, x34 ¼ 1, x35 ¼ 3, x45 ¼ 0, x46 ¼ 7, x56 ¼ 3 with z ¼ 10.
ANSWER KEYS: EXERCISES: CHAPTER 19 Section 19.1 (ex.1) a) BP b) MIP c) IP d) BIP e) BP f) MBP g) MIP Section 19.2 (ex.1) a) No b) Yes (x1 ¼ 10,x2 ¼ 0 with z ¼ 20) c) No d) Yes (x1 ¼ 0, x2 ¼ 4 with z ¼ 32) e) Yes (x1 ¼ 1,x2 ¼ 0 with z ¼ 4) f) No g) Yes (x1 ¼ 6,x2 ¼ 5 with z ¼ 58) Section 19.2((ex.2) ) ð0, 0Þ;ð0, 1Þ;ð0, 2Þ;ð0, 3Þ;ð0, 4Þ;ð1, 0Þ;ð1, 1Þ;ð1, 2Þ;ð1, 3Þ;ð2, 0Þ;ð2, 1Þ; b) SF ¼ ð2, 2Þ;ð2, 3Þ;ð3, 0Þ;ð3, 1Þ;ð3, 2Þ;ð4, 0Þ;ð4, 1Þ;ð4, 2Þ;ð5, 0Þ;ð5, 1Þ;ð6, 0Þ c) Optimal solution: x1 ¼ 4 and x2 ¼ 2 with z ¼ 14. Section 19.2 (ex.3) b) SF ¼ {(0, 0); (0, 1); (0, 2); (1, 0); (1, 1); (2, 0); (2, 1); (3, 0)} c) Optimal solution: x1 ¼ 2 and x2 ¼ 1 with z ¼ 4.
1133
1134
Answers
Section 19.2 (ex.4) b) {(0, 0); (0, 1); (0, 2); (1, 0); (1, 1); (1, 2); (2, 0); (2, 1); (2, 2); (3, 0); (3, 1); (3, 2); (4, 0)} c) Optimal solution: x1 ¼ 3 and x2 ¼ 2 with z ¼ 13. Section 19.3 (ex.1) Optimal FBS ¼ {x3 ¼ 1, x4 ¼ 1, x6 ¼ 1, x8 ¼ 1} with z ¼ 172. Section 19.4 (ex.1) max z ¼ 7x1 + 12x2 + 8x3 + 10x4 + 7x5 + 6x6 s:t: 4x1 + 7x2 + 5x3 + 6x4 + 4x5 + 3x6 20 x5 + x6 1 x3 x2 0 x1 , x2 , x3 , x4 ,x5 ,x6 2 f0, 1g Optimal solution: x1 ¼ 1, x2 ¼ 1, x3 ¼ 0, x4 ¼ 1, x5 ¼ 0, x6 ¼ 1 with z ¼ 35. Section 19.5 (ex.1) Indexes i, j ¼ 1, .., n that represent the customers (index 0 represents the depot) v ¼ 1, …, NV that represent the vehicles Parameters Cmax,v ¼ maximum capacity of vehicle n di ¼ demand of client i cij ¼ travel cost from client i to client j Decision variables
xvij
¼
yvi Model formulation min s.t.
¼
PPP i
j
1 if the arc from i to j is traveled by vehicle v 0 otherwise 1 if order of client i is delivered by vehicle v 0 otherwise
v v cij xij
X
yvi ¼ 1,
i ¼ 1,…, n
(1)
v
X
X X
yvi ¼ NV,
i¼0
(2)
v
di yvi C max , v , v ¼ 1, …, NV
(3)
i
xvij ¼ yvj ,
j ¼ 0,…, n,
v ¼ 1, …, NV
(4)
xvij ¼ yvi ,
i ¼ 0,…, n,
v ¼ 1, …, NV
(5)
i
X X
j
xvij ¼ xvij jSj 1, S f1, …, ng, 2 jSj n 1, v ¼ 1, …, NV
(6)
ij2S
xvij 2 f0, 1g, i ¼ 0, …, n j ¼ 0,…,n, v ¼ 1, …, NV
(7)
yvi 2 f0, 1g, i ¼ 0, …,n, v ¼ 1,…, NV
(8)
The main objective of the model is to minimize the total travel costs. Constraint (1) guarantees that each node (client) will be visited by only one vehicle. Whereas constraint (2) guarantees that all the routes will begin and end at the depot (i ¼ 0).
Answers
1135
Constraint (3) guarantees that vehicle capacity will not be exceeded. Constraints (4) and (5) guarantee that vehicles will not interrupt their routes at one client. They are the constraints for the preservation of the input and output flows. Constraint (6) guarantees that subroutes will not be formed. Finally, constraints (7) and (8) guarantee that variables xvij and yvi will be binary. Section 19.6 (ex.1) Indexes i ¼ 1, …, m that represent the distribution centers (DCs) j ¼ 1, …, n that represent the consumers Model parameters fi ¼ fixed costs to maintain DC i open cij ¼ transportation costs from DC i to consumer j Dj ¼ demand of customer j Cmax, i ¼ maximum capacity of DC i Decision variables
( yi ¼ ( xij ¼
1 if DC i opens 0 otherwise
1 if consumer j is supplied by DC i 0 otherwise
General formulation Fobj ¼ min z ¼
m X
fi yi +
i¼1
s:t:
n X
m X n X
cij xij Dj
i¼1 j¼1
xij Dj Cmax,i yi ,
i ¼ 1, …,m
ð 1Þ
j ¼ 1,…, n
ð 2Þ
j¼1 m X xij ¼ 1, i¼1
xij ,yi 2 f0, 1g,
i ¼ 1,…, m, j ¼ 1,…, n
ð3Þ
which corresponds to a binary programming problem. For this problem, index i corresponds to: i ¼ 1 (Belem), i ¼ 2 (Palmas), i ¼ 3 (Sao Luis), i ¼ 4 (Teresina), and i ¼ 5 (Fortaleza); and index j corresponds to: j ¼ 1 (Belo Horizonte), j ¼ 2 (Vitoria), j ¼ 3 (Rio de Janeiro), j ¼ 4 (Sao Paulo), and j ¼ 5 (Campo Grande). Optimal FBS: x22 ¼ 1, x24 ¼ 1, x45 ¼ 1, x51 ¼ 1, x53 ¼ 1, y2 ¼ 1, y4 ¼ 1, y5 ¼ 1 with z ¼ 459,400.00. Section 19.6 (ex.2) Indexes: Suppliers i 2 I Consolidating centers j 2 J Factory k 2 K Products p 2 P Model parameters: Cmax, j fj Dpk Sip cpij cpjk cpik
maximum capacity of consolidating center j. fixed costs to open consolidating center j. demand of product p in factory k. capacity of supplier i to produce product p. unit transportation cost of p from supplier i to consolidating center j. unit transportation cost of p from consolidating center j to factory k. unit transportation cost of p from supplier i to factory k.
1136
Answers
Model’s decision variables: xpij ypjk zpik zj
amount of product p transported from supplier i to consolidating center j. amount of product p transported from consolidating center j to factory k. amount of product p transported from supplier i to factory k. binary variable that assumes value 1 if center j operates, and 0 otherwise.
The problem can be formulated as follows: XXX XXX XXX X min cpij xpij + cpjk ypjk + cpik zpik + f j zj p
i
p
j
s.t.:
X
j
ypjk +
j
XX p
X j
X
k
j
8p,k
(1)
8j
(2)
8i, p
(3)
xpij C max , j zj ,
xpij +
X k
xpij ¼
i
zpik ¼ Dpk ,
i
i
i
X
p
k
zpik Sip ,
X
8p, j
(4)
8p, i, j, k
(5)
8z
(6)
ypjk ,
k
xpij , ypjk , zpik 0, zj 2 f0, 1g,
In the objective function, the first term represents suppliers’ transportation costs up to the consolidation terminals, the second refers to the transportation costs from the consolidation terminals to the final client (factory in Harbin), and the third represents suppliers’ transportation costs directly to the final client, and the last one the fixed costs related to the consolidation terminals’ location. Constraint (1) ensures that client k’s demand for product p will be met. Constraint (2) refers to the maximum capacity of each consolidation terminal. Constraint (3) represents supplier i’s capacity to supply product p. Whereas constraint (4) refers to the preservation of the input and output flows in each transshipment point. Finally, we have the non-negativity constraints and that variable zj is binary. Section 19.7 (ex.1) xi ¼ number of buses that start working in shift i, i ¼ 1, 2, …, 9.
Therefore, we have: x1 ¼ number of buses x2 ¼ number of buses x3 ¼ number of buses x4 ¼ number of buses x5 ¼ number of buses
that that that that that
start start start start start
working working working working working
at at at at at
Shift
Period
1 2 3 4 5 6 7 8 9
6:01–14:00 8:01–16:00 10:01–18:00 12:01–20:00 14:01–22:00 16:01–24:00 18:01–02:00 20:01–04:00 22:01–06:00
6:01. 8:01. 10:01. 12:01. 14:01.
Answers
x6 ¼ number x7 ¼ number x8 ¼ number x9 ¼ number
of of of of
buses buses buses buses
that that that that
start start start start
working working working working
at at at at
1137
16:01. 18:01. 20:01. 22:01.
Fobj ¼ min z ¼ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 subject to 20 ð6 : 01 8 : 00Þ x1 x1 + x2 24 ð8 : 01 10 : 00Þ 18 ð10 : 01 12 : 00Þ x1 + x 2 + x 3 x1 + x2 + x3 + x4 15 ð12 : 01 14 : 00Þ x2 + x3 + x4 + x5 16 ð14 : 01 16 : 00Þ 27 ð16 : 01 18 : 00Þ x3 + x4 + x5 + x6 x4 + x5 + x6 + x7 18 ð18 : 01 20 : 00Þ 12 ð20 : 01 22 : 00Þ x5 + x6 + x7 + x8 x6 + x7 + x8 + x9 10 ð22 : 01 24 : 00Þ x7 + x8 + x9 4 ð00 : 01 02 : 00Þ 3 ð02 : 01 04 : 00Þ x8 + x 9 x9 8 ð04 : 01 06 : 00Þ 0, i ¼ 1, 2,…, 9 xi Optimal solution: x1 ¼ 24, x2 ¼ 0, x3 ¼ 0, x4 ¼ 0, x5 ¼ 16, x6 ¼ 11, x7 ¼ 0, x8 ¼ 0, x9 ¼ 8 with z ¼ 59. Section 19.7 (ex.2) xi ¼ number of employees that start working on day i, i ¼ 1, 2, …, 7. x1 ¼ number of employees that start working on Monday. x2 ¼ number of employees that start working on Tuesday. ⋮ x7 ¼ number of employees that start working on Sunday. min z ¼ x1 + x2 + x3 + x4 + x5 + x6 + x7 subject to + x4 + x5 + x6 + x7 15 ðMondayÞ x1 + x5 + x6 + x7 20 ðTuesdayÞ x1 + x2 x6 + x7 17 ðWednesdayÞ x1 + x2 + x3 + x7 22 ðThursdayÞ x1 + x2 + x3 + x4 + x1 + x2 + x3 + x4 + x5 25 ðFridayÞ 15 ðSaturdayÞ x2 + x3 + x4 + x5 + x6 x3 + x4 + x5 + x6 + x7 10 ðSundayÞ xi 0, i ¼ 1,…, 7 Alternative optimal solution: x1 ¼ 10, x2 ¼ 6, x3 ¼ 0, x4 ¼ 5, x5 ¼ 4, x6 ¼ 0, x7 ¼ 1 with z ¼ 26.
ANSWER KEYS: EXERCISES: CHAPTER 20 4) P (I < 0) ¼ 15.92% by using the NORM.DIST function in Excel or P (I < 0) ¼ 12.19% analyzing the data generated in the Monte Carlo simulation for variable I. Note: The results can change at each new simulation. 5) P (Index > 0.07) ¼ 22.50% by using the NORM.DIST function in Excel or P (Index > 0.07) ¼ 20.43% analyzing the values generated in the simulation. Note: The results can change at each new simulation.
ANSWER KEYS: EXERCISES: CHAPTER 21 1) Fcal ¼ 2.476 (sig. 0.100), that is, there are no differences in the production of helicopters in the three factories. 2) There are no significant differences between the hardness measures of the different converters. That is, the “Type of Converter” factor does not have a significant effect on the variable “Hardness.” On the other hand, we can conclude
1138
Answers
that there are significant differences in the hardness of the different types of ore. That is, the “Type of Ore” factor has a significant effect on the variable “Hardness.” We can also conclude that there is a significant interaction between the two factors. Tests of Between-Subjects Effects Dependent Variable: Hardness Source Corrected Model Intercept Converter Ore Converter * Ore Error Total Corrected Total
Type III Sum of Squares a
15006.222 2023032.111 66.074 14433.852 506.296 3250.667 2041289.000 18256.889
df
Mean Square
F
Sig.
8 1 2 2 4 72 81 80
1875.778 2023032.111 33.037 7216.926 126.574 45.148
41.547 44808.751 .732 159.850 2.804
.000 .000 .485 .000 .032
R Squared ¼ .822 (Adjusted R Squared ¼ .802).
a
3) There are significant differences between the octane rating indexes of the different types of petroleum and between the octane rating indexes of the different oil refining processes. That is, both factors have a significant effect on the octane rating index. Finally, we can conclude that there is significant interaction between the two factors. Tests of Between-Subjects Effects Dependent Variable: Octane Rating Source Corrected Model Intercept Petroleum Refining Petroleum * Refining Error Total Corrected Total
Type III Sum of Squares a
450.229 399857.521 402.792 31.729 15.708 35.250 400343.000 485.479
R Squared ¼ .927 (Adjusted R Squared ¼ .905).
a
ANSWER KEYS: EXERCISES: CHAPTER 22 1) a) Control charts for X UCL ¼ 17.4035 Average ¼ 16.5318 LCL ¼ 15.6600 Control charts for R UCL ¼ 2.7305 Average ¼ 1.1965 LCL ¼ 0.0000 b) Cp ¼ 0.860 Cpk ¼ 0.842 Cpm ¼ 0.859
df
Mean Square
F
Sig.
11 1 2 3 6 36 48 47
40.930 399857.521 201.396 10.576 2.618 .979
41.801 408365.128 205.681 10.801 2.674
.000 .000 .000 .000 .030
Answers
2) a) Control charts for X UCL ¼ 17.3895 Average ¼ 16.5318 LCL ¼ 15.6740 Control charts for S UCL ¼ 1.1938 Average ¼ 0.5268 LCL ¼ 0.0000 b) Cp ¼ 0.9491 Cpk ¼ 0.9290 3) a) Control charts for X UCL ¼ 6.7113 Average ¼ 6.0625 LCL ¼ 5.4137 Control charts for R UCL ¼ 2.0322 Average ¼ 0.8905 LCL ¼ 0.0000 b) Cp ¼ 0.771 Cpk ¼ 0.722 Cpm ¼ 0.542 4) a) Control charts for X UCL ¼ 6.7162 Average ¼ 6.0625 LCL ¼ 5.4088 Control charts for S UCL ¼ 0.9098 Average ¼ 0.4015 LCL ¼ 0.0000 b) Cp ¼ 0.8302 Cpk ¼ 0.7783 5) P chart UCL ¼ 0.1748 Average ¼ 0.0680 LCL ¼ 0.0000 6) UCL ¼ 8.7403 Average ¼ 3.4000 LCL ¼ 0.0000 7) UCL ¼ 11.9996 Average ¼ 5.1750 LCL ¼ 0.0000
1139
1140
Answers
8) Control chart: defects defects UCL Center = 1.1357 LCL
Fraction of nonconformities
2.0
1.5
1.0
0.5
24.00 23.00 22.00 21.00 20.00 19.00 18.00 17.00 16.00 15.00 14.00 13.00 12.00 11.00 10.00 9.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 100
0.0
Sigma level:
ANSWER KEYS: EXERCISES: CHAPTER 23 1) a)
3
Answers
In fact, this is a balanced clustered data structure. b)
1141
1142
Answers
c)
d) Yes. Since the estimation of variance component τ00, which corresponds to random intercept u0j, is considerably higher than its standard error, it is possible to verify that there is variability, at a significance level of 0.05, in the score obtained between students from different countries. Statistically, z ¼ 422.619/125.284 ¼ 3.373 > 1.96, where 1.96 is the critical value of the standard normal distribution, which results in a significance level of 0.05. e) Since Sig. w2 ¼ 0.000, it is possible to reject the null hypothesis that the random intercepts are equal to zero (H0: u0j ¼ 0), which makes the estimation of a traditional linear regression model be ruled out for these clustered data. f) rho ¼
τ00 422:619 ¼ 0:974 ¼ τ00 + s2 422:619 + 11:196
which suggests that approximately 97% of the total variance in students’ grades in science is due to differences between the participants’ countries of origin. g)
Answers
1143
h)
i) The parameters estimated of the fixed- and random-effects components are statistically different from zero, at a significance level of 0.05. j)
1144
Answers
k)
l)
The significance level of the test is equal to 1.000 (much greater than 0.05) because the logarithms of both restricted likelihood functions are identical (LLr ¼ 357.501), the model with only random effects in the intercept is favored, since random error terms u1j are statistically equal to zero.
Answers
m)
n) scoreij ¼ 13:22 + 0:0028 incomeij + 0:0008 resdevelj incomeij + u0j + rij o)
1145
1146
Answers
2) a)
In fact, this is an unbalanced clustered data structure of real estate properties in districts. b)
This is also an unbalanced data panel. c)
Answers
d)
e)
1147
1148
Answers
f)
g) l
Level-2 intraclass correlation:
rhoproperty|district ¼
l
τu000 + τr000 0:1228 + 0:0368 ¼ 0:996 ¼ 2 τu000 + τr000 + s 0:1228 + 0:0368 + 0:0007
Level-3 intraclass correlation: rhodistrict ¼
τu000 0:1228 ¼ ¼ 0:766 2 τu000 + τr000 + s 0:1228 + 0:0368 + 0:0007
The correlation between the natural logarithms of the rental prices per square meter of the properties in the same district is equal to 76.6% (rhodistrict), and the correlation between these annual indexes, for the same property of a certain district, is equal to 99.6% (rhoproperty j district). Thus, we estimate that the real estate and districts random effects form more than 99% of the total variance of the residuals! h) Given the statistical significance of the estimated variances τu000, τr000, and s2 (relationships between values estimated and respective standard errors higher than 1.96, and this is the critical value of the standard normal distribution which results in a significance level of 0.05), we can state that there is variability in the rental price of the commercial properties throughout the period analyzed. Moreover, there is variability in the rental price, throughout time, between real estate properties in the same district and between properties located in different districts.
Answers
1149
i) Since Sig. w2 ¼ 0.000, it is possible to reject the null hypothesis that the random intercepts are equal to zero (H0: u00k ¼ r0jk ¼ 0), which makes the estimation of a traditional linear regression model be ruled out for these data. j)
k) First, we can see that the variable that corresponds to the year (linear trend) with fixed effects is statistically significant, at a significance level of 0.05 (Sig. z ¼ 0.000 < 0.05), which demonstrates that, each year, rental prices of commercial properties increase, on average, 1.10% (e0.011 ¼ 1.011), ceteris paribus. In relation to the random-effects components, it is also possible to verify that there is statistical significance in the variances of u00k, r0jk, and etjk, at a significance level of 0.05, because the estimations of τu000, τr000, and s2 are considerably higher than the respective standard errors.
1150
l)
Answers
Answers
1151
m)
n) l
Level-2 intraclass correlation:
rhoproperty|district ¼ ¼
l
τu000 + τu100 + τr000 + τr100 τu000 + τu100 + τr000 + τr100 + s2 0:142444 + 0:000043 + 0:039638 + 0:000047 ¼ 0:9994 0:142444 + 0:000043 + 0:039638 + 0:000047 + 0:000103
Level-3 intraclass correlation:
rhodistrict ¼ ¼
τu000 + τu100 τu000 + τu100 + τr000 + τr100 + s2 0:142444 + 0:000043 ¼ 0:7817 0:142444 + 0:000043 + 0:039638 + 0:000047 + 0:000103
For this model, we estimate that the real estate and districts random effects form more than 99.9% of the total variance of the residuals!
1152
Answers
o)
Since Sig. w22 ¼ 0.000, we choose the linear trend model with random intercepts and slopes. p)
q) ln ðpÞtjk ¼ 4:134 + 0:015 year jk + 0:231 foodjk + 0:189 space4jk 0:004 valetjk year jk + u00k + u10k year jk + r 0jk + r 1jk year jk + etjk Note: At this moment, we decide to insert the parameter of the variable space4 in the expression, statistically significant at a significance level of 0.10.
Answers
1153
r) Yes, it is possible to state that the natural logarithm of the rental price per square meter of the real estate properties follows a linear trend throughout time. In addition, there is a significant variance in the intercepts and slopes between those located in the same district and between those located in different districts. Yes, the existence of restaurants or food courts in the building, at least four or a higher number of parking spaces, and valet parking in the building where the property is located explain part of the evolution in variability of the natural logarithm of the rental price per square meter of the properties. s)
t) l
Random-effects variance-covariance matrix for level district:
u00k 0:037004 0 var ¼ u10k 0 0:000016
1154
l
Answers
Random-effects variance-covariance matrix for level property:
r0jk 0:030961 0 ¼ var r1jk 0 0:000044 u)
v) l
Random-effects variance-covariance matrix for level district: var
u00k 0:037253 0:000653 ¼ u10k 0:000653 0:000014
Answers
l
1155
Random-effects variance-covariance matrix for level property:
r0jk 0:031679 0:000484 ¼ var r1jk 0:000484 0:000046 w)
Since Sig. w22 ¼ 0.000, the structure of the random-terms variance-covariance matrices is considered unstructured, that is, we can conclude that error terms u00k and u10k are correlated (cov(u00k , u10k) 6¼ 0), and that error terms r0jk and r1jk are also correlated (cov(r0jk , r1jk) 6¼ 0). x) ln ðpÞtjk ¼ 3:7807 + 0:0144 year jk + 0:2314 food jk + 0:2071 space4jk + 0:5111 subwayk 0:0031 valetjk year jk 0:0072 subwayk year jk + 0:0001 violencek year jk + u00k + u10k year jk + r 0jk + r 1jk year jk + etjk y) Yes, it is possible to state that the existence of subway and the violence index in the district explain part of the variability of the evolution of the natural logarithm of the rental price per square meter between real estate properties located in different districts. z)
Appendices TABLE A Snedecor’s F-Distribution P(Fcal > Fc) 5 0.10 a = 0.10
Fc
Numerator Degrees of Freedom (n1) n2 Denominator
1
2
3
4
5
6
7
8
9
10
1
39.86
49.50
53.59
55.83
57.24
58.20
58.91
59.44
59.86
60.19
2
8.53
9.00
9.16
9.24
9.29
9.33
9.35
9.37
9.38
9.39
3
5.54
5.46
5.39
5.34
5.31
5.28
5.27
5.25
5.24
5.23
4
4.54
4.32
4.19
4.11
4.05
4.01
3.98
3.95
3.94
3.92
5
4.06
3.78
3.62
3.52
3.45
3.40
3.37
3.34
3.32
3.30
6
3.78
3.46
3.29
3.18
3.11
3.05
3.01
2.98
2.96
2.94
7
3.59
3.26
3.07
2.96
2.88
2.83
2.78
2.75
2.72
2.70
8
3.46
3.11
2.92
2.81
2.73
2.67
2.62
2.59
2.56
2.54
9
3.36
3.01
2.81
2.69
2.61
2.55
2.51
2.47
2.44
2.42
10
3.29
2.92
2.73
2.61
2.52
2.46
2.41
2.38
2.35
2.32
11
3.23
2.86
2.66
2.54
2.45
2.39
2.34
2.30
2.27
2.25
12
3.18
2.81
2.61
2.48
2.39
2.33
2.28
2.24
2.21
2.19
13
3.14
2.76
2.56
2.43
2.35
2.28
2.23
2.20
2.16
2.14
14
3.10
2.73
2.52
2.39
2.31
2.24
2.19
2.15
2.12
2.10
15
3.07
2.70
2.49
2.36
2.27
2.21
2.16
2.12
2.09
2.06
16
3.05
2.67
2.46
2.33
2.24
2.18
2.13
2.09
2.06
2.03
17
3.03
2.64
2.44
2.31
2.22
2.15
2.10
2.06
2.03
2.00
18
3.01
2.62
2.42
2.29
2.20
2.13
2.08
2.04
2.00
1.98
19
2.99
2.61
2.40
2.27
2.18
2.11
2.06
2.02
1.98
1.96
20
2.97
2.59
2.38
2.25
2.16
2.09
2.04
2.00
1.96
1.94
21
2.96
2.57
2.36
2.23
2.14
2.08
2.02
1.98
1.95
1.92
22
2.95
2.56
2.35
2.22
2.13
2.06
2.01
1.97
1.93
1.90 Continued
1157
1158
Appendices
TABLE A Snedecor’s F-Distribution—cont’d Numerator Degrees of Freedom (n1) n2 Denominator
2
1
3
4
5
6
7
8
9
10
23
2.94
2.55
2.34
2.21
2.11
2.05
1.99
1.95
1.92
1.89
24
2.93
2.54
2.33
2.19
2.10
2.04
1.98
1.94
1.91
1.88
25
2.92
2.53
2.32
2.18
2.09
2.02
1.97
1.93
1.89
1.87
26
2.91
2.52
2.31
2.17
2.08
2.01
1.96
1.92
1.88
1.86
27
2.90
2.51
2.30
2.17
2.07
2.00
1.95
1.91
1.87
1.85
28
2.89
2.50
2.29
2.16
2.06
2.00
1.94
1.90
1.87
1.84
29
2.89
2.50
2.28
2.15
2.06
1.99
1.93
1.89
1.86
1.83
30
2.88
2.49
2.28
2.14
2.05
1.98
1.93
1.88
1.85
1.82
35
2.85
2.46
2.25
2.11
2.02
1.95
1.90
1.85
1.82
1.79
40
2.84
2.44
2.23
2.09
2.00
1.93
1.87
1.83
1.79
1.76
45
2.82
2.42
2.21
2.07
1.98
1.91
1.85
1.81
1.77
1.74
50
2.81
2.41
2.20
2.06
1.97
1.90
1.84
1.80
1.76
1.73
100
2.76
2.36
2.14
2.00
1.91
1.83
1.78
1.73
1.69
1.66
P(Fcal > Fc) 5 0.05 a = 0.05
Fc
Numerator Degrees of Freedom (n1) n2 Denominator
1
2
3
4
5
6
7
8
9
10
1
161.45
199.50
215.71
224.58
230.16
233.99
236.77
238.88
240.54
241.88
2
18.51
19.00
19.16
19.25
19.30
19.33
19.35
19.37
19.38
19.40
3
10.13
9.55
9.28
9.12
9.01
8.94
8.89
8.85
8.81
8.79
4
7.71
6.94
6.59
6.39
6.26
6.16
6.09
6.04
6.00
5.96
5
6.61
5.79
5.41
5.19
5.05
4.95
4.88
4.82
4.77
4.74
6
5.99
5.14
4.76
4.53
4.39
4.28
4.21
4.15
4.10
4.06
7
5.59
4.74
4.35
4.12
3.97
3.87
3.79
3.73
3.68
3.64
8
5.32
4.46
4.07
3.84
3.69
3.58
3.50
3.44
3.39
3.35
9
5.12
4.26
3.86
3.63
3.48
3.37
3.29
3.23
3.18
3.14
10
4.96
4.10
3.71
3.48
3.33
3.22
3.14
3.07
3.02
2.98
11
4.84
3.98
3.59
3.36
3.20
3.09
3.01
2.95
2.90
2.85
12
4.75
3.89
3.49
3.26
3.11
3.00
2.91
2.85
2.80
2.75
13
4.67
3.81
3.41
3.18
3.03
2.92
2.83
2.77
2.71
2.67
1159
Appendices
P(Fcal > Fc) 5 0.05—cont’d Numerator Degrees of Freedom (n1) n2 Denominator
2
1
3
4
5
6
7
8
9
10
14
4.60
3.74
3.34
3.11
2.96
2.85
2.76
2.70
2.65
2.60
15
4.54
3.68
3.29
3.06
2.90
2.79
2.71
2.64
2.59
2.54
16
4.49
3.63
3.24
3.01
2.85
2.74
2.66
2.59
2.54
2.49
17
4.45
3.59
3.20
2.96
2.81
2.70
2.61
2.55
2.49
2.45
18
4.41
3.55
3.16
2.93
2.77
2.66
2.58
2.51
2.46
2.60
19
4.38
3.52
3.13
2.90
2.74
2.63
2.54
2.48
2.42
2.38
20
4.35
3.49
3.10
2.87
2.71
2.60
2.51
2.45
2.39
2.35
21
4.32
3.47
3.07
2.84
2.68
2.57
2.49
2.42
2.37
2.32
22
4.30
3.44
3.05
2.82
2.66
2.55
2.46
2.40
2.34
2.30
23
4.28
3.42
3.03
2.80
2.64
2.53
2.44
2.37
2.32
2.27
24
4.26
3.40
3.01
2.78
2.62
2.51
2.42
2.36
2.30
2.25
25
4.24
3.39
2.99
2.76
2.00
2.49
2.40
2.34
2.28
2.24
26
4.23
3.37
2.98
2.74
2.59
2.47
2.39
2.32
2.27
2.22
27
4.21
3.35
2.96
2.73
2.57
2.46
2.37
2.31
2.25
2.20
28
4.20
3.34
2.95
2.71
2.56
2.45
2.36
2.29
2.24
2.19
29
4.18
3.33
2.93
2.70
2.55
2.43
2.35
2.28
2.22
2.18
30
4.17
3.32
2.92
2.69
2.53
2.42
2.33
2.27
2.21
2.16
35
4.12
3.27
2.87
2.64
2.49
2.37
2.29
2.22
2.16
2.11
40
4.08
3.23
2.84
2.61
2.45
2.34
2.25
2.18
2.12
2.08
45
4.06
3.20
2.81
2.58
2.42
2.31
2.22
2.15
2.10
2.05
50
4.03
3.18
2.79
2.56
2.40
2.29
2.20
2.13
2.07
2.03
100
3.94
3.09
2.70
2.46
2.31
2.19
2.10
2.03
1.97
1.93
P(Fcal > Fc) 5 0.025 a = 0.025
Fc
Numerator Degrees of Freedom (n1) n2 Denominator
1
2
3
4
5
6
7
8
9
10
1
647.8
799.5
864.2
899.6
921.8
937.1
948.2
956.7
963.3
963.3
2
38.51
39.00
39.17
39.25
39.30
39.33
39.36
39.37
39.39
3
17.44
16.04
15.44
15.10
14.88
14.73
14.62
14.54
14.47
39.40 14.42 Continued
1160
Appendices
P(Fcal > Fc) 5 0.025—cont’d Numerator Degrees of Freedom (n1) n2 Denominator
2
1
3
4
5
6
7
8
9
10
4
12.22
10.65
9.98
9.60
9.36
9.20
9.07
8.98
8.90
3.84
5
10.01
8.43
7.76
7.39
7.15
6.98
6.85
6.76
6.68
6.62
6
8.81
7.26
6.60
6.23
5.99
5.82
5.70
5.60
5.52
5.46
7
8.07
6.54
5.89
5.52
5.29
5.12
4.99
4.90
4.82
4.76
8
7.57
6.06
5.42
5.05
4.82
4.65
4.53
4.43
4.36
4.30
9
7.21
5.71
5.08
4.72
4.48
4.32
4.20
4.10
4.03
3.96
10
6.94
5.46
4.83
4.47
4.24
4.07
3.95
3.85
3.78
3.72
11
6.72
5.26
4.63
4.28
4.04
3.88
3.76
3.66
3.59
3.53
12
6.55
5.10
4.47
4.12
3.89
3.73
3.61
3.51
3.44
3.37
13
6.41
4.97
4.35
4.00
3.77
3.60
3.48
3.39
3.31
3.25
14
6.30
4.86
4.24
3.89
3.66
3.50
3.38
3.29
3.21
3.15
15
6.20
4.77
4.15
3.80
3.58
3.41
3.29
3.20
3.12
3.06
16
6.12
4.69
4.08
3.73
3.50
3.34
3.22
3.12
3.05
2.99
17
6.04
4.62
4.01
3.66
3.44
3.28
3.16
3.06
2.98
2.92
18
5.98
4.56
3.95
3.61
3.38
3.22
3.10
3.01
2.93
2.87
19
5.92
4.51
3.90
3.56
3.33
3.17
3.05
2.96
2.88
2.82
20
5.87
4.46
3.86
3.51
3.29
3.13
3.01
2.91
2.84
2.77
21
5.83
4.42
3.82
3.48
3.25
3.09
2.97
2.87
2.80
2.73
22
5.79
4.38
3.78
3.44
3.22
3.05
2.93
2.84
2.76
2.70
23
5.75
4.35
3.75
3.41
3.18
3.02
2.90
2.81
2.73
2.67
24
5.72
4.32
3.72
3.38
3.15
2.99
2.87
2.78
2.70
2.64
25
5.69
4.29
3.69
3.35
3.13
2.97
2.85
2.75
2.68
2.61
26
5.66
4.27
3.67
3.33
3.10
2.94
2.82
2.73
2.65
2.59
27
5.63
4.24
3.65
3.31
3.08
2.92
2.80
2.71
2.63
2.57
28
5.61
4.22
3.63
3.29
3.06
2.90
2.78
2.69
2.61
2.55
29
5.59
4.20
3.61
3.27
3.04
2.88
2.76
2.67
2.59
2.53
30
5.57
4.18
3.59
3.25
3.03
2.87
2.75
2.65
2.57
2.51
40
5.42
4.05
3.46
3.13
2.90
2.74
2.62
2.53
2.45
2.39
60
5.29
3.93
3.34
3.01
2.79
2.63
2.51
2.41
2.33
2.27
120
5.15
3.80
3.23
2.89
2.67
2.52
2.39
2.30
2.22
2.16
1161
Appendices
P(Fcal > Fc) 5 0.01 a = 0.01
Fc
n2 Denominator 1
Numerator Degrees of Freedom (n1) 1
2
3
4
5
6
7
8
9
10
4,052.2
4,999.3
5,403.5
5,624.3
5,764.0
5,859.0
5,928.3
5,981.0
6,022.4
6,055.9
2
98.50
99.00
99.16
99.25
99.30
99.33
99.36
99.38
99.39
99.40
3
34.12
30.82
29.46
28.71
28.24
27.91
27.67
27.49
27.34
27.23
4
21.20
18.00
16.69
15.98
15.52
15.21
14.98
14.80
14.66
14.55
5
16.26
13.27
12.06
11.39
10.97
10.67
10.46
10.29
10.16
10.05
6
13.75
10.92
9.78
9.15
8.75
8.47
8.26
8.10
7.98
7.87
7
12.25
9.55
8.45
7.85
7.46
7.19
6.99
6.84
6.72
6.62
8
11.26
8.65
7.59
7.01
6.63
6.37
6.18
6.03
5.91
5.81
9
10.56
8.02
6.99
6.42
6.06
5.80
5.61
5.47
5.35
5.26
10
10.04
7.56
6.55
5.99
5.64
5.39
5.20
5.06
4.94
4.85
11
9.65
7.21
6.22
5.67
5.32
5.07
4.89
4.74
4.63
4.54
12
9.33
6.93
5.95
5.41
5.06
4.82
4.64
4.50
4.39
4.30
13
9.07
6.70
5.74
5.21
4.86
4.62
4.44
4.30
4.19
4.10
14
8.86
6.51
5.56
5.04
4.69
4.46
4.28
4.14
4.03
3.94
15
8.68
6.36
5.42
4.89
4.56
4.32
4.14
4.00
3.89
3.80
16
8.53
6.23
5.29
4.77
4.44
4.20
4.03
3.89
3.78
3.69
17
8.40
6.11
5.19
4.67
4.34
4.10
3.93
3.79
3.68
3.59
18
8.29
6.01
5.09
4.58
4.25
4.01
3.84
3.71
3.60
3.51
19
8.18
5.93
5.01
4.50
4.17
3.94
3.77
3.63
3.52
3.43
20
8.10
5.85
4.94
4.43
4.10
3.87
3.70
3.56
3.46
3.37
21
8.02
5.78
4.87
4.37
4.04
3.81
3.64
3.51
3.40
3.31
22
7.95
5.72
4.82
4.31
3.99
3.76
3.59
3.45
3.35
3.26
23
7.88
5.66
4.76
4.26
3.94
3.71
3.54
3.41
3.30
3.21
24
7.82
5.61
4.72
4.22
3.90
3.67
3.50
3.36
3.26
3.17
25
7.77
5.57
4.68
4.18
3.85
3.63
3.46
3.32
3.22
3.13
26
7.72
5.53
4.64
4.14
3.82
3.59
3.42
3.29
3.18
3.09
27
7.68
5.49
4.60
4.11
3.78
3.56
3.39
3.26
3.15
3.06
28
7.64
5.45
4.57
4.07
3.75
3.53
3.36
3.23
3.12
3.03
29
7.60
5.42
4.54
4.04
3.73
3.50
3.33
3.20
3.09
3.00
30
7.56
5.39
4.51
4.02
3.70
3.47
3.30
3.17
3.07
2.98
35
7.42
5.27
4.40
3.91
3.59
3.37
3.20
3.07
2.96
2.88 Continued
1162
Appendices
P(Fcal > Fc) 5 0.01—cont’d n2 Denominator
Numerator Degrees of Freedom (n1) 2
1
3
4
5
6
7
8
9
10
40
7.31
5.18
4.31
3.83
3.51
3.29
3.12
2.99
2.89
2.80
45
7.23
5.11
4.25
3.77
3.45
3.23
3.07
2.94
2.83
2.74
50
7.17
5.06
4.20
3.72
3.41
3.19
3.02
2.89
2.78
2.70
100
6.90
4.82
3.98
3.51
3.21
2.99
2.82
2.69
2.59
2.50
Critical values of the Snedecor’s F-distribution.
TABLE B Student’s t-distribution P(Tcal > tc) 5 a
area
0
t
tc
Associated Probability for a Right-Tailed Test
Degrees of Freedom n
0.25
0.10
0.05
0.025
0.01
0.005
0.0025
0.001
0.0005
1
1.000
3.078
6.314
12.706
31.821
63.657
127.3
318.309
636.619
2
0.816
1.886
2.920
4.303
6.965
9.925
22.33
31.60
3
0.765
1.638
2.353
3.182
4.541
5.841
7.453
10.21
12.92
4
0.741
1.533
2.132
2.776
3.747
4.604
5.598
7.173
8.610
5
0.727
1.476
2.015
2.571
3.365
4.032
4.773
5.894
6.869
6
0.718
1.440
1.943
2.447
3.143
3.707
4.317
5.208
5.959
7
0.711
1.415
1.895
2.365
2.998
3.499
4.029
4.785
5.408
8
0.706
1.397
1.860
2.306
2.896
3.355
3.833
4.501
5.041
9
0.703
1.383
1.833
2.262
2.821
3.250
3.690
4.297
4.781
10
0.700
1.372
1.812
2.228
2.764
3.169
3.581
4.144
4.587
11
0.697
1.363
1.796
2.201
2.718
3.106
3.497
4.025
4.437
12
0.695
1.356
1.782
2.179
2.681
3.055
3.428
3.930
4.318
13
0.694
1.350
1.771
2.160
2.650
3.012
3.372
3.852
4.221
14
0.692
1.345
1.761
2.145
2.624
2.977
3.326
3.787
4.140
15
0.691
1.341
1.753
2.131
2.602
2.947
3.286
3.733
4.073
16
0.690
1.337
1.746
2.120
2.583
2.921
3.252
3.686
4.015
17
0.689
1.333
1.740
2.110
2.567
2.898
3.222
3.646
3.965
18
0.688
1.330
1.734
2.101
2.552
2.878
3.197
3.610
3.922
19
0.688
1.328
1.729
2.093
2.539
2.861
3.174
3.579
3.883
14.09
Appendices
1163
TABLE B Student’s t-distribution—cont’d Associated Probability for a Right-Tailed Test
Degrees of Freedom n
0.25
0.10
0.05
20
0.687
1.325
1.725
2.086
2.528
2.845
3.153
3.552
3.850
21
0.686
1.323
1.721
2.080
2.518
2.831
3.135
3.527
3.819
22
0.686
1.321
1.717
2.074
2.508
2.819
3.119
3.505
3.792
23
0.685
1.319
1.714
2.069
2.500
2.807
3.104
3.485
3.768
24
0.685
1.318
1.711
2.064
2.492
2.797
3.091
3.467
3.745
25
0.684
1.316
1.708
2.060
2.485
2.787
3.078
3.450
3.725
26
0.684
1.315
1.706
2.056
2.479
2.779
3.067
3.435
3.707
27
0.684
1.314
1.703
2.052
2.473
2.771
3.057
3.421
3.689
28
0.683
1.313
1.701
2.048
2.467
2.763
3.047
3.408
3.674
29
0.683
1.311
1.699
2.045
2.462
2.756
3.038
3.396
3.660
30
0.683
1.310
1.697
2.042
2.457
2.750
3.030
3.385
3.646
35
0.682
1.306
1.690
2.030
2.438
2.724
2.996
3.340
3.591
40
0.681
1.303
1.684
2.021
2.423
2.704
2.971
3.307
3.551
45
0.680
1.301
1.679
2.014
2.412
2.690
2.952
3.281
3.520
50
0.679
1.299
1.676
2.009
2.403
2.678
2.937
3.261
3.496
z
0.674
1.282
1.645
1.960
2.326
2.576
2.807
3.090
3.291
Critical values of the Student’s t-distribution.
0.025
0.01
0.005
0.0025
0.001
0.0005
TABLE C
Durbin-Watson Distribution (DW) Inconclusive Test
Inconclusive Test No Autocorrelation
Positive Autocorrelation
dL dU
Negative Autocorrelation
4-dU 4-dL
2
DW Statistics models with intercept level of significance a = 5%
k (Number of Parameters – Includes Intercept) 2
3
4
5
6
7
8
9
10
n
dL
dU
dL
dU
dL
dU
dL
dU
dL
dU
dL
dU
dL
dU
dL
dU
dL
dU
6
0.610
1.400
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
7
0.700
1.356
0.467
1.896
–
–
–
–
–
–
–
–
–
–
–
–
–
–
8
0.763
1.332
0.559
1.777
0.367
2.287
–
–
–
–
–
–
–
–
–
–
–
–
9
0.824
1.320
0.629
1.699
0.455
2.128
0.296
2.588
–
–
–
–
–
–
–
–
–
–
10
0.879
1.320
0.697
1.641
0.525
2.016
0.376
2.414
0.243
2.822
–
–
–
–
–
–
–
–
11
0.927
1.324
0.758
1.604
0.595
1.928
0.444
2.283
0.315
2.645
0.203
3.004
–
–
–
–
–
–
12
0.971
1.331
0.812
1.579
0.658
1.864
0.512
2.177
0.380
2.506
0.268
2.832
0.171
3.149
–
–
–
–
13
1.010
0.861
0.861
1.562
0.715
1.816
0.574
2.094
0.444
2.390
0.328
2.692
0.230
2.985
0.147
3.266
–
–
14
1.045
1.350
0.905
1.551
0.767
1.779
0.632
2.030
0.505
2.296
0.389
2.572
0.286
2.848
0.200
3.111
0.127
3.360
15
1.077
1.361
0.946
1.543
0.814
1.750
0.685
1.977
0.562
2.220
0.447
2.471
0.343
2.727
0.251
2.979
0.175
3.216
16
1.106
1.371
0.982
1.539
0.857
1.728
0.734
1.935
0.615
2.157
0.502
2.388
0.398
2.624
0.304
2.860
0.222
3.090
17
1.133
1.381
1.015
1.536
0.897
1.710
0.779
1.900
0.664
2.104
0.554
2.318
0.451
2.537
0.356
2.757
0.272
2.975
18
1.158
1.391
1.046
1.535
0.933
1.696
0.820
1.872
0.710
2.060
0.603
2.258
0.502
2.461
0.407
2.668
0.321
2.873
19
1.180
1.401
1.074
1.536
0.967
1.685
0.859
1.848
0.752
2.023
0.649
2.206
0.549
2.396
0.456
2.589
0.369
2.783
20
1.201
1.411
1.100
1.537
0.998
1.676
0.894
1.828
0.792
1.991
0.691
2.162
0.595
2.339
0.502
2.521
0.416
2.704
21
1.221
1.420
1.125
1.538
1.026
1.669
0.927
1.812
0.829
1.964
0.731
2.124
0.637
2.290
0.546
2.461
0.461
2.633
22
1.239
1.429
1.147
1.541
1.053
1.664
0.958
1.797
0.863
1.940
0.769
2.090
0.677
1.246
0.588
2.407
0.504
2.571
23
1.257
1.437
1.168
1.543
1.078
1.660
0.986
1.785
0.895
1.920
0.804
2.061
0.715
2.208
0.628
2.360
0.545
2.514
24
1.273
1.446
1.188
1.546
1.101
1.656
1.013
1.775
0.925
1.902
0.837
2.035
0.750
2.174
0.666
2.318
0.584
2.464
25
1.288
1.454
1.206
1.550
1.123
1.654
1.038
1.767
0.953
1.886
0.868
2.013
0.784
2.144
0.702
2.280
0.621
2.419
26
1.302
1.461
1.224
1.553
1.143
1.652
1.062
1.759
0.979
1.873
0.897
1.992
0.816
2.117
0.735
2.246
0.657
2.379
27
1.316
1.469
1.240
1.556
1.162
1.651
1.084
1.753
1.004
1.861
0.925
1.974
0.845
2.093
0.767
2.216
0.691
2.342
28
1.328
1.476
1.255
1.560
1.1181
1.650
1.104
1.747
1.028
1.850
0.951
1.959
0.874
2.071
0.798
2.188
0.723
2.309
29
1.341
1.483
1.270
1.563
1.198
1.650
1.124
1.743
1.050
1.841
0.975
1.944
0.900
2.052
0.826
2.164
0.753
2.278
30
1.352
1.489
1.284
1.567
1.214
1.650
1.143
1.739
1.071
1.833
0.998
1.931
0.926
2.034
0.854
2.141
0.782
2.251
31
1.363
1.496
1.297
1.570
1.229
1.650
1.160
1.735
1.090
1.825
1.020
1.920
0.950
2.018
0.879
2.120
0.810
2.226
32
1.373
1.502
1.309
1.574
1.244
1.650
1.177
1.732
1.109
1.819
1.041
1.909
0.972
2.004
0.904
2.102
0.836
2.203
33
1.383
1.508
1.321
1.577
1.258
1.651
1.193
1.730
1.127
1.813
1.061
1.900
0.994
1.991
0.927
20.85
0.861
2.181
34
1.393
1.514
1.333
1.580
1.271
1.652
1.208
1.728
1.144
1.808
1.079
1.891
1.015
1.978
0.950
2.069
0.885
2.162
35
1.402
1.519
1.343
1.584
1.283
1.653
1.222
1.726
1.160
1.803
1.097
1.884
1.034
1.967
0.971
2.054
0.908
2.144
36
1.411
1.525
1.354
1.587
1.295
1.654
1.236
1.724
1.175
1.799
1.114
1.876
1.053
1.957
0.991
2.041
0.930
2.127
37
1.419
1.530
1.364
1.590
1.307
1.655
1.249
1.723
1.190
1.795
1.131
1.870
1.071
1.948
1.011
2.029
0.951
2.112
38
1.427
1.535
1.373
1.594
1.318
1.656
1.261
1.722
1.204
1.792
1.146
1.864
1.088
1.939
1.029
2.017
0.970
2.098
39
1.435
1.540
1.382
1.597
1.328
1.658
1.273
1.722
1.218
1.789
1.161
1.859
1.104
1.932
1.047
2.007
0.990
2.085
40
1.442
1.544
1.391
1.600
1.338
1.659
1.285
1.721
1.230
1.786
1.175
1.854
1.120
1.924
1.064
1.997
1.008
2.072
45
1.475
1.566
1.430
1.615
1.383
1.666
1.336
1.720
1.287
1.776
1.238
1.835
1.189
1.895
1.139
1.958
1.089
2.022
50
1.503
1.585
1.462
1.628
1.421
1.674
1.378
1.721
1.335
1.771
1.291
1.822
1.246
1.875
1.201
1.930
1.156
1.986
55
1.528
1.601
1.490
1.641
1.452
1.611
1.414
1.724
1.374
1.768
1.334
1.814
1.294
1.861
1.253
1.909
1.212
1.959
60
1.549
1.616
1.514
1.652
1.480
1.689
1.444
1.727
1.408
1.767
1.372
1.808
1.335
1.850
1.298
1.894
1.260
1.939
65
1.567
1.629
1.536
1.662
1.503
1.696
1.471
1.731
1.438
1.767
1.404
1.805
1.170
1.843
1.336
1.882
1.301
1.923
70
1.583
1.641
1.554
1.672
1.525
1.703
1.494
1.735
1.464
1.768
1.433
1.802
1.401
1.838
1.369
1.874
1.337
1.910
75
1.598
1.652
1.571
1.680
1.543
1.709
1.515
1.739
1.487
1.770
1.458
1.801
1.428
1.834
1.399
1.867
1.369
1.901
80
1.611
1.662
1.586
1.688
1.560
1.715
1.534
1.743
1.507
1.772
1.480
1.801
1.453
1.831
1.425
1.861
1.397
1.893
85
1.624
1.671
1.600
1.696
1.575
1.721
1.550
1.747
1.525
1.774
1.500
1.801
1.474
1.829
1.448
1.857
1.422
1.886
90
1.635
1.679
1.612
1.703
1.589
1.726
1.566
1.751
1.542
1.776
1.518
1.801
1.494
1.827
1.469
1.854
1.445
1.881
95
1.645
1.687
1.623
1.709
1.602
1.732
1.579
1.755
1.557
1.778
1.535
1.802
1.512
1.827
1.489
1.852
1.465
1.877
100
1.654
1.694
1.634
1.715
1.613
1.736
1.592
1.758
1.571
1.780
1.550
1.803
1.528
1.827
1.489
1.852
1.465
1.877
150
1.720
1.747
1.706
1.760
1.693
1.774
1.679
1.788
1.665
1.802
1.651
1.817
1.637
1.832
1.622
1.846
1.608
1.862
200
1.758
1.779
1.748
1.789
1.738
1.799
1.728
1.809
1.718
1.820
1.707
1.831
1.697
1.841
1.686
1.852
1.675
1.863
1166
Appendices
TABLE D Chi-Square Distribution P(x2cal with n degrees of freedom > x2c) 5 a
Degrees of Freedom n
0.99
0.975
0.95
0.9
0.1
0.05
0.025
0.01
0.005
1
0.000
0.001
0.004
0.016
2.706
3.841
5.024
6.635
7.879
2
0.020
0.051
0.103
0.211
4.605
5.991
7.378
9.210
10.597
3
0.115
0.216
0.352
0.584
6.251
7.815
9.348
11.345
12.838
4
0.297
0.484
0.711
1.064
7.779
9.488
11.143
13.277
14.860
5
0.554
0.831
1.145
1.610
9.236
11.070
12.832
15.086
16.750
6
0.872
1.237
1.635
2.204
10.645
12.592
14.449
16.812
18.548
7
1.239
1.690
2.167
2.833
12.017
14.067
16.013
18.475
20.278
8
1.647
2.180
2.733
3.490
13.362
15.507
17.535
20.090
21.955
9
2.088
2.700
3.325
4.168
14.684
16.919
19.023
21.666
23.589
10
2.558
3.247
3.940
4.865
15.987
18.307
20.483
23.209
25.188
11
3.053
3.816
4.575
5.578
17.275
19.675
21.920
24.725
26.757
12
3.571
4.404
5.226
6.304
18.549
21.026
23.337
26.217
28.300
13
4.107
5.009
5.892
7.041
19.812
22.362
24.736
27.688
29.819
14
4.660
5.629
6.571
7.790
21.064
23.685
26.119
29.141
31.319
15
5.229
6.262
7.261
8.547
22.307
24.996
27.488
30.578
32.801
16
5.812
6.908
7.962
9.312
23.542
26.296
28.845
32.000
34.267
17
6.408
7.564
8.672
10.085
24.769
27.587
30.191
33.409
35.718
18
7.015
8.231
9.390
10.865
25.989
28.869
31.526
34.805
37.156
19
7.633
8.907
10.117
11.651
27.204
30.144
32.852
36.191
38.582
20
8.260
9.591
10.851
12.443
28.412
31.410
34.170
37.566
39.997
21
8.897
10.283
11.591
13.240
29.615
32.671
35.479
38.932
41.401
22
9.542
10.982
12.338
14.041
30.813
33.924
36.781
40.289
42.796
23
10.196
11.689
13.091
14.848
32.007
35.172
38.076
41.638
44.181
24
10.856
12.401
13.848
15.659
33.196
36.415
39.364
42.980
45.558
25
11.524
13.120
14.611
16.473
34.382
37.652
40.646
44.314
46.928
26
12.198
13.844
15.379
17.292
35.563
38.885
41.923
45.642
48.290
27
12.878
14.573
16.151
18.114
36.741
40.113
43.195
46.963
49.645
28
13.565
15.308
16.928
18.939
37.916
41.337
44.461
48.278
50.994
29
14.256
16.047
17.708
19.768
39.087
42.557
45.722
49.588
52.335
30
14.953
16.791
18.493
20.599
40.256
43.773
46.979
50.892
53.672
31
15.655
17.539
19.281
21.434
41.422
44.985
48.232
52.191
55.002
32
16.362
18.291
20.072
22.271
42.585
46.194
49.480
53.486
56.328
Appendices
1167
TABLE D Chi-Square Distribution—cont’d Degrees of Freedom n
0.99
0.975
0.95
0.9
0.1
0.05
0.025
0.01
0.005
33
17.073
19.047
20.867
23.110
43.745
47.400
50.725
54.775
57.648
34
17.789
19.806
21.664
23.952
44.903
48.602
51.966
56.061
58.964
35
18.509
20.569
22.465
24.797
46.059
49.802
53.203
57.342
60.275
36
19.233
21.336
23.269
25.643
47.212
50.998
54.437
58.619
61.581
37
19.960
22.106
24.075
26.492
48.363
52.192
55.668
59.893
62.883
38
20.691
22.878
24.884
27.343
49.513
53.384
56.895
61.162
64.181
39
21.426
23.654
25.695
28.196
50.660
54.572
58.120
62.428
65.475
40
22.164
24.433
26.509
29.051
51.805
55.758
59.342
63.691
66.766
41
22.906
25.215
27.326
29.907
52.949
56.942
60.561
64.950
68.053
42
23.650
25.999
28.144
30.765
54.090
58.124
61.777
66.206
69.336
43
24.398
26.785
28.965
31.625
55.230
59.304
62.990
67.459
70.616
44
25.148
27.575
29.787
32.487
56.369
60.481
64.201
68.710
71.892
45
25.901
28.366
30.612
33.350
57.505
61.656
65.410
69.957
73.166
46
26.657
29.160
31.439
34.215
58.641
62.830
66.616
71.201
74.437
47
27.416
29.956
32.268
35.081
59.774
64.001
67.821
72.443
75.704
48
28.177
30.754
33.098
35.949
60.907
65.171
69.023
73.683
76.969
49
28.941
31.555
33.930
36.818
62.038
66.339
70.222
74.919
78.231
50
29.707
32.357
34.764
37.689
63.167
67.505
71.420
76.154
79.490
Critical values (for a right-tailed unilateral test) of the chi-square distribution.
TABLE E Standard Normal Distribution P(zcal > zc) 5 a
area
0
zc
z
Second Decimal of zc zc
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.0
0.5000
0.4960
0.4920
0.4880
0.4840
0.4801
0.4761
0.4721
0.4681
0.4641
0.1
0.4602
0.4562
0.4522
0.4483
0.4443
0.4404
0.4364
0.4325
0.4286
0.4247
0.2
0.4207
0.4168
0.4129
0.4090
0.4052
0.4013
0.3974
0.3936
0.3897
0.3859
0.3
0.3821
0.3783
0.3745
0.3707
0.3669
0.3632
0.3594
0.3557
0.3520
0.3483
0.4
0.3446
0.3409
0.3372
0.3336
0.3300
0.3264
0.3228
0.3192
0.3156
0.3121
0.5
0.3085
0.3050
0.3015
0.2981
0.2946
0.2912
0.2877
0.2842
0.2810
0.2776 Continued
1168
Appendices
TABLE E Standard Normal Distribution—cont’d Second Decimal of zc zc
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.6
0.2743
0.2709
0.2676
0.2643
0.2611
0.2578
0.2546
0.2514
0.2483
0.2451
0.7
0.2420
0.2389
0.2358
0.2327
0.2296
0.2266
0.2236
0.2206
0.2177
0.2148
0.8
0.2119
0.2090
0.2061
0.2033
0.2005
0.1977
0.1949
0.1922
0.1894
0.1867
0.9
0.1841
0.1814
0.1788
0.1762
0.1736
0.1711
0.1685
0.1660
0.1635
0.1611
1.0
0.1587
0.1562
0.1539
0.1515
0.1492
0.1469
0.1446
0.1423
0.1401
0.1379
1.1
0.1357
0.1335
0.1314
0.1292
0.1271
0.1251
0.1230
0.1210
0.1190
0.1170
1.2
0.1151
0.1131
0.1112
0.1093
0.1075
0.1056
0.1038
0.1020
0.1003
0.0985
1.3
0.0968
0.0951
0.0934
0.0918
0.0901
0.0885
0.0869
0.0853
0.0838
0.0823
1.4
0.0808
0.0793
0.0778
0.0764
0.0749
0.0735
0.0722
0.0708
0.0694
0.0681
1.5
0.0668
0.0655
0.0643
0.0630
0.0618
0.0606
0.0594
0.0582
0.0571
0.0559
1.6
0.0548
0.0537
0.0526
0.0516
0.0505
0.0495
0.0485
0.0475
0.0465
0.0455
1.7
0.0446
0.0436
0.0427
0.0418
0.0409
0.0401
0.0392
0.0384
0.0375
0.0367
1.8
0.0359
0.0352
0.0344
0.0336
0.0329
0.0322
0.0314
0.0307
0.0301
0.0294
1.9
0.0287
0.0281
0.0274
0.0268
0.0262
0.0256
0.0250
0.0244
0.0239
0.0233
2.0
0.0228
0.0222
0.0217
0.0212
0.0207
0.0202
0.0197
0.0192
0.0188
0.0183
2.1
0.0179
0.0174
0.0170
0.0166
0.0162
0.0158
0.0154
0.0150
0.0146
0.0143
2.2
0.0139
0.0136
0.0132
0.0129
0.0125
0.0122
0.0119
0.0116
0.0113
0.0110
2.3
0.0107
0.0104
0.0102
0.0099
0.0096
0.0094
0.0091
0.0089
0.0087
0.0084
2.4
0.0082
0.0080
0.0078
0.0075
0.0073
0.0071
0.0069
0.0068
0.0066
0.0064
2.5
0.0062
0.0060
0.0059
0.0057
0.0055
0.0054
0.0052
0.0051
0.0049
0.0048
2.6
0.0047
0.0045
0.0044
0.0043
0.0041
0.0040
0.0039
0.0038
0.0037
0.0036
2.7
0.0035
0.0034
0.0033
0.0032
0.0031
0.0030
0.0029
0.0028
0.0027
0.0026
2.8
0.0026
0.0025
0.0024
0.0023
0.0023
0.0022
0.0021
0.0021
0.0020
0.0019
2.9
0.0019
0.0018
0.0017
0.0017
0.0016
0.0016
0.0015
0.0015
0.0014
0.0014
3.0
0.0013
0.0013
0.0013
0.0012
0.0012
0.0011
0.0011
0.0011
0.0010
0.0010
3.1
0.0010
0.0009
0.0009
0.0009
0.008
0.0008
0.0008
0.0008
0.007
0.007
3.2
0.0007
3.3
0.0005
3.4
0.0003
3.5
0.00023
3.6
0.00016
3.7
0.00011
1169
Appendices
TABLE E Standard Normal Distribution—cont’d Second Decimal of zc zc
0.00
3.8
0.00007
3.9
0.00005
4.0
0.00003
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
Associated probability for a right-tailed test.
TABLE F1
Binomial Distribution N k p ð1 pÞNk P ½Y ¼ k ¼ k p
N
k
0.01
0.05
0.10
0.15
0.20
0.25
0.30
1/3
0.40
0.45
0.50
2
0
9801
9025
8100
7225
6400
5625
4900
4444
3600
3025
2500
2
1
198
950
1800
2550
3200
3750
4200
4444
4800
4950
5000
1
2
1
25
100
225
400
625
900
1111
1600
2025
2500
0
0
9703
8574
7290
6141
5120
4219
3430
2963
2160
1664
1250
3
1
294
1354
2430
3251
3840
4219
4410
4444
4320
4084
3750
2
2
3
71
270
574
960
1406
1890
2222
2880
3341
3750
1
3
0
1
10
34
80
156
270
370
640
911
1250
0
0
9606
8145
6561
5220
4096
3164
2401
1975
1296
915
625
4
1
388
1715
2916
3685
4096
4219
4116
3951
3456
2995
2500
3
2
6
135
486
975
1536
2109
2646
2963
3456
3675
3750
2
3
0
5
36
115
256
469
756
988
1536
2005
2500
1
4
0
0
1
5
16
39
81
123
256
410
625
0
0
9510
7738
5905
4437
3277
2373
1681
1317
778
503
312
5
1
480
2036
3280
3915
4096
3955
3602
3292
2592
2059
1562
4
2
10
214
729
1382
2048
2637
3087
3292
3456
3369
3125
3
3
0
11
81
244
512
879
1323
1646
2304
2757
3125
2
4
0
0
4
22
64
146
283
412
768
1128
1562
1
5
0
0
0
1
3
10
24
41
102
185
312
0
0
9415
7351
5314
3771
2621
1780
1176
878
467
277
156
6
1
571
2321
3543
3993
3932
3560
3025
2634
1866
1359
938
5
3
4
5
6
2
3
4
5
6
Continued
1170
Appendices
TABLE F1 Binomial Distribution—cont’d p N
7
k
0.01
0.05
0.10
0.15
0.20
0.25
0.30
1/3
0.40
0.45
0.50
2
14
305
984
1762
2458
2966
3241
3292
3110
2780
2344
4
3
0
21
146
415
819
1318
1852
2195
2765
3032
3125
3
4
0
1
12
55
154
330
595
823
1382
1861
2344
2
5
0
0
1
4
15
44
102
165
369
609
938
1
6
0
0
0
0
1
2
7
14
41
83
156
0
0
9321
6983
4783
3206
2097
1335
824
585
280
152
78
7
1
659
2573
3720
3960
3670
3115
2471
2048
1306
872
547
6
2
20
406
1240
2097
2753
3115
3177
3073
2613
2140
1641
5
3
0
36
230
617
1147
1730
2269
2561
2903
2918
2734
4
4
0
2
26
109
287
577
972
1280
1935
2388
2734
3
5
0
0
2
12
43
115
250
384
774
1172
1641
2
6
0
0
0
1
4
13
36
64
172
320
547
1
7
0
0
0
0
0
1
2
5
16
37
78
0
0.99
0.95
0.90
0.85
0.80
0.75
0.70
2/3
0.60
0.55
0.50
7
k
N
8
p
p N
k
0.01
0.05
0.10
0.15
0.20
0.25
0.30
8
0
9227
6634
4305
2725
1678
1001
576
1
746
2793
3826
3847
3355
2670
2
26
515
1488
2376
2936
3
1
54
331
839
4
0
4
46
5
0
0
6
0
7
9
1/3
0.40
0.45
0.50
390
168
84
39
8
1977
1561
896
548
312
7
3115
2965
2731
2090
1569
1094
6
1468
2076
2541
2731
2787
2568
2188
5
185
459
865
1361
1707
2322
2627
2734
4
4
26
92
231
467
683
1239
1719
2188
3
0
0
2
11
38
100
171
413
703
1094
2
0
0
0
0
1
4
12
24
79
164
312
1
8
0
0
0
0
0
0
1
2
7
17
39
0
0
9135
6302
3874
2316
1342
751
404
260
101
46
20
9
1
830
2985
3874
3679
3020
2253
1556
1171
605
339
176
8
2
34
629
1722
2597
3020
3003
2668
2341
1612
1110
703
7
3
1
77
446
1069
1762
2336
2668
2731
2508
2119
1641
6
4
0
6
74
283
661
1168
1715
2048
2508
2600
2461
5
5
0
0
8
50
165
389
735
1024
1672
2128
2461
4
6
0
0
1
6
28
87
210
341
743
1160
1641
3
7
0
0
0
0
3
12
39
73
212
407
703
2
9
1171
Appendices
TABLE F1 Binomial Distribution—cont’d p N
k
0.01
0.05
0.10
0.15
0.20
0.25
0.30
8
0
0
0
0
0
1
4
9
0
0
0
0
0
0
10 0
9044
5987
3487
1969
1074
1
914
3151
3874
3474
2
42
746
1937
3
1
105
4
0
5
1/3
0.40
0.45
0.50
9
35
83
176
1
0
1
3
8
20
0
563
282
173
60
25
10
10
2684
1877
1211
867
403
207
98
9
2759
3020
2816
2335
1951
1209
763
439
5
574
1298
2013
2503
2668
2601
2150
1665
1172
7
10
112
401
881
1460
2001
2276
2508
2384
2051
6
0
1
15
85
264
584
1029
1366
2007
2340
2461
5
6
0
0
1
12
55
162
368
569
1115
1596
2051
4
7
0
0
0
1
8
31
90
163
425
746
1172
3
8
0
0
0
0
1
4
14
30
106
229
439
2
9
0
0
0
0
0
0
1
3
16
42
98
1
10
0
0
0
0
0
0
0
0
1
3
10
0
15 0
8601
4633
2059
874
352
134
47
23
5
1
0
15
1
1303
3658
3432
2312
1319
668
305
171
47
16
5
14
2
92
1348
2669
2856
2309
1559
916
599
219
90
32
13
3
4
307
1285
2184
2501
2252
1700
1299
634
318
139
12
4
0
49
428
1156
1876
2252
2186
1948
1268
780
417
11
10
15
5
0
6
105
449
1032
1651
2061
2143
1859
1404
916
10
6
0
0
19
132
430
917
1472
1786
2066
1914
1527
9
7
0
0
3
30
138
393
811
1148
1771
2013
1964
8
8
0
0
0
5
35
131
348
574
1181
1647
1964
7
9
0
0
0
1
7
34
116
223
612
1048
1527
6
10
0
0
0
0
1
7
30
67
245
515
916
5
11
0
0
0
0
0
1
6
15
74
191
417
4
12
0
0
0
0
0
0
1
3
16
52
139
3
13
0
0
0
0
0
0
0
0
3
10
32
2
14
0
0
0
0
0
0
0
0
0
1
5
1
15
0
0
0
0
0
0
0
0
0
0
0
0
0.99
0.95
0.90
0.85
0.80
0.75
0.70
0.60
0.55
0.50
k
N
20
20
2/3
p
p N
k
0.01
0.05
0.10
0.15
0.20
0.25
0.30
20 0
8179
3585
1216
388
115
32
8
1/3 3
0.40
0.45
0.50
0
0
0
Continued
1172
Appendices
TABLE F1 Binomial Distribution—cont’d p N
k
0.01
0.05
0.10
0.15
0.20
0.25
0.30
1
1652
3774
2702
1368
576
211
68
2
159
1887
2852
2293
1369
669
3
10
596
1901
2428
2054
4
0
133
898
1821
5
0
22
319
6
0
3
7
0
8
1/3
0.40
0.45
0.50
30
5
1
0
19
278
143
31
8
2
18
1339
716
429
123
40
11
17
2182
1897
1304
911
350
139
46
16
1028
1746
2023
1789
1457
746
365
148
15
89
454
1091
1686
1916
1821
1244
746
370
14
0
20
160
545
1124
1643
1821
1659
1221
739
13
0
0
4
46
222
609
1144
1480
1797
1623
1201
12
9
0
0
1
11
74
271
654
987
1597
1771
1602
11
10
0
0
0
2
20
99
308
543
1171
1593
1762
10
11
0
0
0
0
5
30
120
247
710
1185
1602
9
12
0
0
0
0
1
8
39
92
355
727
1201
8
13
0
0
0
0
0
2
10
28
146
366
739
7
14
0
0
0
0
0
0
2
7
49
150
370
6
15
0
0
0
0
0
0
0
1
13
49
148
5
16
0
0
0
0
0
0
0
0
3
13
46
4
17
0
0
0
0
0
0
0
0
0
2
11
3
18
0
0
0
0
0
0
0
0
0
0
2
2
19
0
0
0
0
0
0
0
0
0
0
0
1
20
0
0
0
0
0
0
0
0
0
0
0
0
25 0
7778
2774
718
172
38
8
1
0
0
0
0
25
1
1964
3650
1994
759
236
63
14
5
0
0
0
24
2
238
2305
2659
1607
708
251
74
30
4
1
0
23
3
18
930
2265
2174
1358
641
243
114
19
4
1
22
4
1
269
1384
2110
1867
1175
572
313
71
18
4
21
5
0
60
646
1564
1960
1645
1030
658
199
63
16
20
6
0
10
239
920
1633
1828
1472
1096
442
172
53
19
7
0
1
72
441
1108
1654
1712
1487
800
381
143
18
8
0
0
18
175
623
1241
1651
1673
1200
701
322
17
9
0
0
4
58
294
781
1336
1580
1511
1084
609
16
10
0
0
1
16
118
417
916
1264
1612
1419
974
15
11
0
0
0
4
40
189
536
862
1465
1583
1328
14
12
0
0
0
1
12
74
268
503
1140
1511
1550
13
25
1173
Appendices
TABLE F1 Binomial Distribution—cont’d p N
k
0.01
0.05
0.10
0.15
0.20
0.25
0.30
13
0
0
0
0
3
25
115
14
0
0
0
0
1
7
15
0
0
0
0
0
16
0
0
0
0
17
0
0
0
18
0
0
19
0
20
1/3
0.40
0.45
0.50
251
760
1236
1550
12
42
108
434
867
1328
11
2
13
40
212
520
974
10
0
0
4
12
88
266
609
9
0
0
0
1
3
31
115
322
8
0
0
0
0
0
1
9
42
143
7
0
0
0
0
0
0
0
2
13
53
6
0
0
0
0
0
0
0
0
0
3
16
5
21
0
0
0
0
0
0
0
0
0
1
4
4
22
0
0
0
0
0
0
0
0
0
0
1
3
23
0
0
0
0
0
0
0
0
0
0
0
2
24
0
0
0
0
0
0
0
0
0
0
0
1
25
0
0
0
0
0
0
0
0
0
0
0
0
0.99
0.95
0.90
0.85
0.80
0.75
0.70
0.60
0.55
0.50
k
N
30
2/3
p
p N
k
0.01
0.05
0.10
0.15
0.20
0.25
0.30
30 0
7397
2146
424
76
12
2
0
1
2242
3389
1413
404
93
18
2
328
2586
2277
1034
337
3
31
1270
2361
1703
4
2
451
1771
5
0
124
6
0
7
1/3
0.40
0.45
0.50
0
0
0
0
30
3
1
0
0
0
29
86
18
6
0
0
0
28
785
269
72
26
3
0
0
27
2028
1325
604
208
89
12
2
0
26
1023
1861
1723
1047
464
232
41
8
1
25
27
474
1368
1795
1455
829
484
115
29
6
24
0
5
180
828
1538
1662
1219
829
263
81
19
23
8
0
1
58
420
1106
1593
1501
1192
505
191
55
22
9
0
0
16
181
676
1298
1573
1457
823
382
133
21
10
0
0
4
67
355
909
1416
1530
1152
656
280
20
11
0
0
1
22
161
551
1103
1391
1396
976
509
19
12
0
0
0
6
64
291
749
1101
1474
1265
805
18
13
0
0
0
1
22
134
444
762
1360
1433
1115
17
14
0
0
0
0
7
54
231
436
1101
1424
1354
16
15
0
0
0
0
2
19
106
247
783
1242
1445
15
16
0
0
0
0
0
6
42
116
489
953
1354
14
17
0
0
0
0
0
2
15
48
269
642
1115
13
18
0
0
0
0
0
0
5
17
129
379
805
12 Continued
1174
Appendices
TABLE F1 Binomial Distribution—cont’d p N
k
0.01
0.05
0.10
0.15
0.20
0.25
0.30
19
0
0
0
0
0
0
1
20
0
0
0
0
0
0
21
0
0
0
0
0
22
0
0
0
0
23
0
0
0
24
0
0
25
0
26.
0.40
0.45
0.50
5
54
196
509
11
0
1
20
88
280
10
0
0
0
6
34
133
9
0
0
0
0
1
12
55
8
0
0
0
0
0
0
3
19
7
0
0
0
0
0
0
0
1
6
6
0
0
0
0
0
0
0
0
0
1
5
0
0
0
0
0
0
0
0
0
0
0
4
27
0
0
0
0
0
0
0
0
0
0
0
3
28
0
0
0
0
0
0
0
0
0
0
0
2
29
0
0
0
0
0
0
0
0
0
0
0
1
30
0
0
0
0
0
0
0
0
0
0
0
0
0.99
0.95
0.90
0.85
0.80
0.75
0.70
0.60
0.55
0.50
k
The decimal point was omitted. All entries should be read as .nnn. For p .5, use the upper line for p and the left column for k. For p > .5, use the bottom line for p and the right column for k.
1/3
2/3
N
TABLE F2
Binomial Distribution P N i p ð1 pÞNi PðY kÞ ¼ ki¼0 i k
N
0
1
2
3
4
4
062
312
688
938
1.0
5
031
188
500
812
969
1.0
6
016
109
344
656
891
984
1.0
7
008
062
227
500
773
938
992
1.0
8
004
035
145
363
637
855
965
996
1.0
9
002
020
090
254
500
746
910
980
998
1.0
10
001
011
055
172
377
623
828
945
989
999
1.0
11
006
033
113
274
500
726
887
967
994
999 +
1.0
12
003
019
073
194
387
613
806
927
981
997
999 +
1.0
13
002
011
046
133
291
500
709
867
954
989
998
999+
1.0
14
001
006
029
090
212
395
605
788
910
971
994
999+
999+
1.0
15
004
018
059
151
304
500
696
849
941
982
996
999+
999+
1.0
16
002
011
038
105
227
402
598
773
895
962
989
998
999+
999 +
1.0
17
001
006
025
072
166
315
500
685
834
928
975
994
999
999 +
999 +
1.0
18
001
004
015
048
119
240
407
593
760
881
952
985
996
999
999 +
999+
19
002
010
032
084
180
324
500
676
820
916
968
990
998
999 +
999+
20
001
006
021
058
132
252
412
588
748
868
942
979
994
999
999+
21
001
004
013
039
095
192
332
500
668
808
905
961
987
996
999
22
002
008
026
067
143
262
416
584
738
857
933
974
992
998
23
001
005
017
047
105
202
339
500
661
798
895
953
983
995
24
001
003
011
032
076
154
271
419
581
729
846
924
968
989
002
007
022
054
115
212
345
500
655
788
885
946
6
7
8
9
10
11
12
13
14
15
16
17
978
Appendices
25
5
Continued
1175
1176 Appendices
TABLE F2 Binomial Distribution—cont’d k N
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
26
001
005
014
038
084
163
279
423
577
721
837
916
962
27
001
003
010
026
061
124
221
351
500
649
779
876
939
28
002
006
018
044
092
172
286
425
575
714
828
908
29
001
004
012
031
068
132
229
356
500
644
771
868
30
001
003
008
021
049
100
181
292
428
572
708
819
31
002
005
015
035
075
141
237
360
500
640
763
32
001
004
010
025
055
108
189
298
430
570
702
33
001
002
007
018
040
081
148
243
364
500
636
34
001
005
012
029
061
115
196
304
432
568
35
001
003
008
020
045
088
155
250
368
500
Unilateral probabilities for the binomial test when p ¼ q ¼ 1/2. Note: Decimal points and values less than 0.0005 were omitted.
Appendices
TABLE G
Critical Values of Dc for the Kolmogorov-Smirnov Test Considering P(Dcal > Dc) 5 a Significance Level a
Sample Size (N)
0.20
0.15
0.10
0.05
0.01
1
0.900
0.925
0.950
0.975
0.995
2
0.684
0.726
0.776
0.842
0.929
3
0.565
0.597
0.642
0.708
0.828
4
0.494
0.525
0.564
0.624
0.733
5
0.446
0.474
0.510
0.565
0.669
6
0.410
0.436
0.470
0.521
0.618
7
0.381
0.405
0.438
0.486
0.577
8
0.358
0.381
0.411
0.457
0.543
9
0.339
0.360
0.388
0.432
0.514
10
0.322
0.342
0.368
0.410
0.490
11
0.307
0.326
0.352
0.391
0.468
12
0.295
0.313
0.338
0.375
0.450
13
0.284
0.302
0.325
0.361
0.433
14
0.274
0.292
0.314
0.349
0.418
15
0.266
0.283
0.304
0.338
0.404
16
0.258
0.274
0.295
0.328
0.392
17
0.250
0.266
0.286
0.318
0.381
18
0.244
0.259
0.278
0.309
0.371
19
0.237
0.252
0.272
0.301
0.363
20
0.231
0.246
0.264
0.294
0.356
25
0.21
0.22
0.24
0.27
0.32
30
0.19
0.20
0.22
0.24
0.29
35
0.18
0.19
0.21
0.23
0.27
Greater than 50
1:07 pffiffiffi N
1:14 pffiffiffi N
1:22 pffiffiffi N
1:36 pffiffiffi N
1:63 pffiffiffi N
1177
1178
Appendices
TABLE H1
Critical Values of Shapiro-Wilk Wc Statistic Considering P(Wcal < Wc) 5 a Significance Level a
Sample Size N
0.01
0.02
0.05
0.10
0.50
0.90
0.95
0.98
0.99
3
0.753
0.758
0.767
0.789
0.959
0.998
0.999
1.000
1.000
4
0.687
0.707
0.748
0.792
0.935
0.987
0.992
0.996
0.997
5
0.686
0.715
0.762
0.806
0.927
0.979
0.986
0.991
0.993
6
0.713
0.743
0.788
0.826
0.927
0.974
0.981
0.936
0.989
7
0.730
0.760
0.803
0.838
0.928
0.972
0.979
0.985
0.988
8
0.749
0.778
0.818
0.851
0.932
0.972
0.978
0.984
0.987
9
0.764
0.791
0.829
0.859
0.935
0.972
0.978
0.984
0.986
10
0.781
0.806
0.842
0.869
0.938
0.972
0.978
0.983
0.986
11
0.792
0.817
0.850
0.876
0.940
0.973
0.979
0.984
0.986
12
0.805
0.828
0.859
0.883
0.943
0.973
0.979
0.984
0.986
13
0.814
0.837
0.866
0.889
0.945
0.974
0.979
0.984
0.986
14
0.825
0.846
0.874
0.895
0.947
0.975
0.980
0.984
0.986
15
0.835
0.855
0.881
0.901
0.950
0.976
0.980
0.984
0.987
16
0.844
0.863
0.887
0.906
0.952
0.975
0.981
0.985
0.987
17
0.851
0.869
0.892
0.910
0.954
0.977
0.981
0.985
0.987
18
0.858
0.874
0.897
0.914
0.956
0.978
0.982
0.986
0.988
19
0.863
0.879
0.901
0.917
0.957
0.978
0.982
0.986
0.988
20
0.868
0.884
0.905
0.920
0.959
0.979
0.983
0.986
0.988
21
0.873
0.888
0.908
0.823
0.960
0.980
0.983
0.987
0.989
22
0.878
0.892
0.911
0.926
0.961
0.980
0.984
0.987
0.989
23
0.881
0.895
0.914
0.928
0.962
0.981
0.984
0.987
0.989
24
0.884
0.898
0.916
0.930
0.963
0.981
0.984
0.987
0.989
25
0.888
0.901
0.918
0.931
0.964
0.981
0.985
0.988
0.989
26
0.891
0.904
0.920
0.933
0.965
0.982
0.985
0.988
0.989
27
0.894
0.906
0.923
0.935
0.965
0.982
0.985
0.988
0.990
28
0.896
0.908
0.924
0.936
0.966
0.982
0.985
0.988
0.990
29
0.898
0.910
0.926
0.937
0.966
0.982
0.985
0.988
0.990
30
0.900
0.912
0.927
0.939
0.967
0.983
0.985
0.988
0.900
1179
Appendices
TABLE H2
Coefficients ai.n for the Shapiro-Wilk Normality Test
i/n
2
3
4
5
6
7
8
9
10
1
0.7071
0.7071
0.6872
0.6646
0.6431
0.6233
0.6052
0.5888
0.5739
0.0000
0.1677
0.2413
0.2806
0.3031
0.3164
0.3244
0.3291
0.0000
0.0875
0.1401
0.1743
0.1976
0.2141
0.0000
0.0561
0.0947
0.1224
0.0000
0.0399
2 3 4 5
i/n
11
12
13
14
15
16
17
18
19
20
1
0.5601
0.5475
0.5359
0.5251
0.5150
0.5056
0.4968
0.4886
0.4808
0.4734
2
0.3315
0.3325
0.3325
0.3318
0.3306
0.3290
0.3273
0.3253
0.3232
0.3211
3
0.2260
0.2347
0.2412
0.2460
0.2495
0.2521
0.2540
0.2553
0.2561
0.2565
4
0.1429
0.1586
0.1707
0.1802
0.1878
0.1939
0.1988
0.2027
0.2059
0.2085
5
0.0695
0.0922
0.1099
0.1240
0.1353
0.1447
0.1524
0.1587
0.1641
0.1686
6
0.0000
0.0303
0.0539
0.0727
0.0880
0.1005
0.1109
0.1197
0.1271
0.1334
0.0000
0.0240
0.0433
0.0593
0.0725
0.0837
0.0932
0.1013
0.0000
0.0196
0.0359
0.0496
0.0612
0.0711
0.0000
0.0163
0.0303
0.0422
0.0000
0.0140
7 8 9 10
i/n
21
22
23
24
25
26
27
28
29
30
1
0.4643
0.4590
0.4542
0.4493
0.4450
0.4407
0.4366
0.4328
0.4291
0.4254
2
0.3185
0.3156
0.3126
0.3098
0.3069
0.3043
0.3018
0.2992
0.2968
0.2944
3
0.2578
0.2571
0.2563
0.2554
0.2543
0.2533
0.2522
0.2510
0.2499
0.2487
4
0.2119
0.2131
0.2139
0.2145
0.2148
0.2151
0.2152
0.2151
0.2150
0.2148
5
0.1736
0.1764
0.1787
0.1807
0.1822
0.1836
0.1848
0.1857
0.1864
0.1870
6
0.1399
0.1443
0.1480
0.1512
0.1539
0.1563
0.1584
0.1601
0.1616
0.1630
7
0.1092
0.1150
0.1201
0.1245
0.1283
0.1316
0.1346
0.1372
0.1395
0.1415
8
0.0804
0.0878
0.0941
0.0997
0.1046
0.1089
0.1128
0.1162
0.1192
0.1219
9
0.0530
0.0618
0.0696
0.0764
0.0823
0.0876
0.0923
0.0965
0.1002
0.1036
10
0.0263
0.0368
0.0459
0.0539
0.0610
0.0672
0.0728
0.0778
0.0822
0.0862
11
0.0000
0.0122
0.0228
0.0321
0.0403
0.0476
0.0540
0.0598
0.0650
0.0697
0.0000
0.0107
0.0200
0.0284
0.0358
0.0424
0.0483
0.0537
0.0000
0.0094
0.0178
0.0253
0.0320
0.0381
0.0000
0.0084
0.0159
0.0227
0.0000
0.0076
12 13 14 15
1180
TABLE I Wilcoxon Test P(Sp > Sc ) 5 a
Sc
3
3
0.6250
4
0.3750
5
0.2500
0.5625
6
0.1250
0.4375
4
5
7
0.3125
8
0.1875
0.5000
9
0.1250
0.4063
10
0.0625
0.3125
6
7
8
9
11
0.2188
0.5000
12
0.1563
0.4219
13
0.0938
0.3438
14
0.0625
0.2813
0.5313
15
0.0313
0.2188
0.4688
16
0.1563
0.4063
17
0.1094
0.3438
18
0.0781
0.2891
0.5273
19
0.0469
0.2344
0.4727
20
0.0313
0.1875
0.4219
21
0.0156
0.1484
0.3711
22
0.1094
0.3203
23
0.0781
0.2734
0.5000
24
0.0547
0.2305
0.4551
25
0.0391
0.1914
0.4102
26
0.0234
0.1563
0.3672
27
0.0156
0.1250
0.3262
28
0.0078
0.0977
0.2852
10
0.5000
11
12
13
14
15
Appendices
N
0.2480
0.4609
30
0.0547
0.2129
0.4229
31
0.0391
0.1797
0.3848
32
0.0273
0.1504
0.3477
33
0.0195
0.1250
0.3125
0.5171
34
0.0117
0.1016
0.2783
0.4829
35
0.0078
0.0820
0.2461
0.4492
36
0.0039
0.0645
0.2158
0.4155
37
0.0488
0.1875
0.3823
38
0.0371
0.1611
0.3501
39
0.0273
0.1377
0.3188
0.5151
40
0.0195
0.1162
0.2886
0.4849
41
0.0137
0.0967
0.2598
0.4548
42
0.0098
0.0801
0.2324
0.4250
43
0.0059
0.0654
0.2065
0.3955
44
0.0039
0.0527
0.1826
0.3667
45
0.0020
0.0420
0.1602
0.3386
46
0.0322
0.1392
0.3110
0.5000
47
0.0244
0.1201
0.2847
0.4730
48
0.0186
0.1030
0.2593
0.4463
49
0.0137
0.0874
0.2349
0.4197
50
0.0098
0.0737
0.2119
0.3934
51
0.0068
0.0615
0.1902
0.3677
52
0.0049
0.0508
0.1697
0.3424
53
0.0029
0.0415
0.1506
0.3177
0.5000
54
0.0020
0.0337
0.1331
0.2939
0.4758
55
0.0010
0.0269
0.1167
0.2709
0.4516
56
0.0210
0.1018
0.2487
0.4276
57
0.0161
0.0881
0.2274
0.4039
58
0.0122
0.0757
0.2072
0.3804 Continued
1181
0.0742
Appendices
29
1182
TABLE I Wilcoxon Test—cont’d N 11
12
13
14
59
0.0093
0.0647
0.1879
0.3574
60
0.0068
0.0549
0.1698
0.3349
0.5110
61
0.0049
0.0461
0.1527
0.3129
0.4890
62
0.0034
0.0386
0.1367
0.2915
0.4670
63
0.0024
0.0320
0.1219
0.2708
0.4452
64
0.0015
0.0261
0.1082
0.2508
0.4235
65
0.0010
0.0212
0.0955
0.2316
0.4020
66
0.0005
0.0171
0.0839
0.2131
0.3808
67
0.0134
0.0732
0.1955
0.3599
68
0.0105
0.0636
0.1788
0.3394
69
0.0081
0.0549
0.1629
0.3193
70
0.0061
0.0471
0.1479
0.2997
71
0.0046
0.0402
0.1338
0.2807
72
0.0034
0.0341
0.1206
0.2622
73
0.0024
0.0287
0.1083
0.2444
74
0.0017
0.0239
0.0969
0.2271
75
0.0012
0.0199
0.0863
0.2106
76
0.0007
0.0164
0.0765
0.1947
77
0.0005
0.0133
0.0676
0.1796
78
0.0002
0.0107
0.0594
0.1651
79
0.0085
0.0520
0.1514
80
0.0067
0.0453
0.1384
81
0.0052
0.0392
0.1262
3
4
5
6
7
8
9
10
15
Appendices
Sc
82
0.0040
0.0338
0.1147
83
0.0031
0.0290
0.1039
84
0.0023
0.0247
0.0938
85
0.0017
0.0209
0.0844
86
0.0012
0.0176
0.0757
87
0.0009
0.0148
0.0677
88
0.0006
0.0123
0.0603
89
0.0004
0.0101
0.0535
90
0.0002
0.0083
0.0473
91
0.0001
0.0067
0.0416
92
0.0054
0.0365
93
0.0043
0.0319
94
0.0034
0.0277
95
0.0026
0.0240
96
0.0020
0.0206
97
0.0015
0.0177
98
0.0012
0.0151
99
0.0009
0.0128
100
0.0006
0.0108
101
0.0004
0.0090
102
0.0003
0.0075
103
0.0002
0.0062
104
0.0001
0.0051 0.0042
106
0.0034
107
0.0027
108
0.0021
109
0.0017
Appendices
105
Continued
1183
1184 Appendices
TABLE I Wilcoxon Test—cont’d N Sc
3
4
5
6
7
8
9
10
11
12
13
14
15
110
0.0013
111
0.0010
112
0.0008
113
0.0006
114
0.0004
115
0.0003
116
0.0002
117
0.0002
118
0.0001
119
0.0001
120
0.0000
Right-tailed unilateral probabilities for the Wilcoxon test.
TABLE J Critical Values of Uc for the Mann-Whitney U Test Considering P(Ucal < Uc) 5 a P(Ucal < Uc) 5 0.05 N2\ N1
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
0
0
1
2
2
3
4
4
5
5
6
7
7
8
9
9
10
11
4
0
1
2
3
4
5
6
7
8
9
10
11
12
14
15
16
17
18
5
1
2
4
5
6
8
9
11
12
13
15
16
18
19
20
22
23
25
6
2
3
5
7
8
10
12
14
16
17
19
21
23
25
26
28
30
32
7
2
4
6
8
11
13
15
17
19
21
24
26
28
30
33
35
37
39
8
3
5
8
10
13
15
18
20
23
26
28
31
33
36
39
41
44
47
9
4
6
9
12
15
18
21
24
27
30
33
36
39
42
45
48
51
54
10
4
7
11
14
17
20
24
27
31
34
37
41
44
48
51
55
58
62
11
5
8
12
16
19
23
27
31
34
38
42
46
50
54
57
61
65
69
12
5
9
13
17
21
26
30
34
38
42
47
51
55
60
64
68
72
77
13
6
10
15
19
24
28
33
37
42
47
51
56
61
65
70
75
80
84
14
7
11
16
21
26
31
36
41
46
51
56
61
66
71
77
82
87
92
15
7
12
18
23
28
33
39
44
50
55
61
66
72
77
83
88
94
100
16
8
14
19
25
30
36
42
48
54
60
65
71
77
83
89
95
101
107
17
9
15
20
26
33
39
45
51
57
64
70
77
83
89
96
102
109
115
18
9
16
22
28
35
41
48
55
61
68
75
82
88
95
102
109
116
123
19
10
17
23
30
37
44
51
58
65
72
80
87
94
101
109
116
123
130
20
11
18
25
32
39
47
54
62
69
77
84
92
100
107
115
123
130
138
Appendices
3
1185
1186 Appendices
P(Ucal < Uc) 5 0.025 N2\ N1
3
3
–
0
0
1
1
2
4
–
0
1
2
3
5
0
1
2
3
5
6
1
2
3
5
7
1
3
5
6
8
2
4
6
9
2
4
10
3
11
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
2
3
3
4
4
5
5
6
6
7
7
8
4
4
5
6
7
8
9
10
11
11
12
13
14
6
7
8
9
11
12
13
14
15
17
18
19
20
6
8
10
11
13
14
16
17
19
21
22
24
25
27
8
10
12
14
16
18
20
22
24
26
28
30
32
34
8
10
13
15
17
19
22
24
26
29
31
34
36
38
41
7
10
12
15
17
20
23
26
28
31
34
37
39
42
45
48
5
8
11
14
17
20
23
26
29
33
36
39
42
45
48
52
55
3
6
9
13
16
19
23
26
30
33
37
40
44
47
51
55
58
62
12
4
7
11
14
18
22
26
29
33
37
41
45
49
53
57
61
65
69
13
4
8
12
16
20
24
28
33
37
41
45
50
54
59
63
67
72
76
14
5
9
13
17
22
26
31
36
40
45
50
55
59
64
67
74
78
83
15
5
10
14
19
24
29
34
39
44
49
54
59
64
70
75
80
85
90
16
6
11
15
21
26
31
37
42
47
53
59
64
70
75
81
86
92
98
17
6
11
17
22
28
34
39
45
51
57
63
67
75
81
87
93
99
105
18
7
12
18
24
30
36
42
48
55
61
67
74
80
86
93
99
103
112
19
7
13
19
25
32
38
45
52
58
65
72
78
85
92
99
106
113
119
20
8
14
20
27
34
41
48
55
62
69
76
83
90
98
105
112
119
127
P(Ucal < Uc ) 5 0.01 N2\ N1
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
3
–
0
0
0
0
0
1
1
1
2
2
2
3
3
4
4
4
5
4
–
–
0
1
1
2
3
3
4
5
5
6
7
7
8
9
9
10
5
–
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
6
–
1
2
3
4
6
7
8
9
11
12
13
15
16
18
19
20
22
7
0
1
3
4
6
7
9
11
12
14
16
17
19
21
23
24
26
28
8
0
2
4
6
7
9
11
13
15
17
20
22
24
26
28
30
32
34
9
1
3
5
7
9
11
14
16
18
21
23
26
28
31
33
36
38
40
10
1
3
6
8
11
13
16
19
22
24
27
30
33
36
38
41
44
47
11
1
4
7
9
12
15
18
22
25
29
31
34
37
41
44
47
50
53
12
2
5
8
11
14
17
21
24
28
31
35
38
42
46
49
53
56
60
13
2
5
9
12
16
20
23
27
31
35
39
43
47
51
55
59
63
67
14
2
6
10
13
17
22
26
30
34
38
43
47
51
56
60
65
69
73
15
3
7
11
15
19
24
28
33
37
42
47
51
56
61
66
70
75
80
16
3
7
12
16
21
26
31
36
41
46
51
56
61
66
71
76
82
87
17
4
8
13
18
23
28
33
38
44
49
55
60
66
71
77
82
88
93
18
4
9
14
19
24
30
36
41
47
53
59
65
70
76
82
88
94
100
19
4
9
15
20
26
32
38
44
50
56
63
69
75
82
88
94
101
107
20
5
10
16
22
28
34
40
47
53
60
67
73
80
87
93
100
107
114
Appendices 1187
1188 Appendices
P(Ucal < Uc ) 5 0.005 N2\ N1
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
3
–
0
0
0
0
0
0
0
0
1
1
1
2
2
2
2
3
3
4
–
–
0
0
0
1
1
2
2
3
3
4
5
5
6
6
7
8
5
–
–
0
1
1
2
3
4
5
6
7
7
8
9
10
11
12
13
6
–
0
1
2
3
4
5
6
7
9
10
11
12
13
15
16
17
18
7
–
0
1
3
4
6
7
9
10
12
13
15
16
18
19
21
22
24
8
–
1
2
4
6
7
9
11
13
15
17
18
20
22
24
26
28
30
9
0
1
3
5
7
9
11
13
16
18
20
22
24
27
29
31
33
36
10
0
2
4
6
9
11
13
16
18
21
24
26
29
31
34
37
39
42
11
0
2
5
7
10
13
16
18
21
24
27
30
33
36
39
42
45
48
12
1
3
6
9
12
15
18
21
24
27
31
34
37
41
44
47
51
54
13
1
3
7
10
13
17
20
24
27
31
34
38
42
45
49
53
56
60
14
1
4
7
11
15
18
22
26
30
34
38
42
46
50
54
58
63
67
15
2
5
8
12
16
20
24
29
33
37
42
46
51
55
60
64
69
73
16
2
5
9
13
18
22
27
31
36
41
45
50
55
60
65
70
74
79
17
2
6
10
15
19
24
29
34
39
44
49
54
60
65
70
75
81
86
18
2
6
11
16
21
26
31
37
42
47
53
58
64
70
75
81
87
92
19
3
7
12
17
22
28
33
39
45
51
56
63
69
74
81
87
93
99
20
3
8
13
18
24
30
36
42
48
54
60
67
73
79
86
92
99
105
Appendices
TABLE K
Critical Values for the Friedman’s Test Considering P(Fcal > Fc) 5 a
k
N
a ≤ 0.10
a ≤ 0.05
3
3
6.00
6.00
–
4
6.00
6.50
8.00
5
5.20
6.40
8.40
6
5.33
7.00
9.00
7
5.43
7.14
8.86
8
5.25
6.25
9.00
9
5.56
6.22
8.67
10
5.00
6.20
9.60
11
4.91
6.54
8.91
12
5.17
6.17
8.67
13
4.77
6.00
9.39
∞
4.61
5.99
9.21
2
6.00
6.00
–
3
6.60
7.40
8.60
4
6.30
7.80
9.60
5
6.36
7.80
9.96
6
6.40
7.60
10.00
7
6.26
7.80
10.37
8
6.30
7.50
10.35
∞
6.25
7.82
11.34
3
7.47
8.53
10.13
4
7.60
8.80
11.00
5
7.68
8.96
11.52
∞
7.78
9.49
13.28
4
5
a ≤ 0.01
1189
1190
Appendices
TABLE L
Critical Values for the Kruskal-Wallis Test Considering P(Hcal > Hc) 5 a a
Sample Sizes n1
n2
n3
0.10
2
2
2
4.25
3
2
1
4.29
3
2
2
4.71
4.71
3
3
1
4.57
5.14
3
3
2
4.56
5.36
3
3
3
4.62
5.60
4
2
1
4.50
4
2
2
4.46
5.33
4
3
1
4.06
5.21
4
3
2
4.51
4
3
3
4
4
4
0.01
0.005
7.20
7.20
5.44
6.44
7.00
4.71
5.73
6.75
7.32
1
4.17
4.97
6.67
4
2
4.55
5.45
7.04
7.28
4
4
3
4.55
5.60
7.14
7.59
8.32
4
4
4
4.65
5.69
7.66
8.00
8.65
5
2
1
4.20
5.00
5
2
2
4.36
5.16
5
3
1
4.02
4.96
5
3
2
4.65
5.25
6.82
7.18
5
3
3
4.53
5.65
7.08
7.51
5
4
1
3.99
4.99
6.95
7.36
5
4
2
4.54
5.27
7.12
7.57
8.11
5
4
3
4.55
5.63
7.44
7.91
8.50
5
4
4
4.62
5.62
7.76
8.14
9.00
5
5
1
4.11
5.13
7.31
7.75
5
5
2
4.62
5.34
7.27
8.13
8.68
5
5
3
4.54
5.71
7.54
8.24
9.06
5
5
4
4.53
5.64
7.77
8.37
9.32
5
5
5
4.56
5.78
7.98
8.72
9.68
4.61
5.99
9.21
10.60
13.82
Large samples
0.05
0.001
8.02
6.53
8.24
TABLE M a 5 5%
Critical Values of the Cochran’s C Statistic Considering P(Ccal > Cc) 5 a
n/k
2
3
4
5
6
7
8
9
10
12
15
20
24
30
40
60
120
1
0.9985
0.9669
0.9065
0.8412
0.7808
0.7271
0.6798
0.6385
0.6020
0.5410
0.4709
0.3894
0.3434
0.2929
0.2370
0.1737
0.0998
2
0.9750
0.8709
0.7679
0.6838
0.6161
0.5612
0.5157
0.4775
0.4450
0.3924
0.3346
0.2705
0.2354
0.1980
0.1567
0.1131
0.0632
3
0.9392
0.7977
0.6841
0.5981
0.5321
0.4800
0.4377
0.4027
0.3733
0.3264
0.2758
0.2205
0.1907
0.1593
0.1259
0.0895
0.0495
4
0.9057
0.7457
0.6287
0.5441
0.4803
0.4307
0.3910
0.3584
0.3311
0.2880
0.2419
0.1921
0.1656
0.1377
0.1082
0.0765
0.0419
5
0.8772
0.7071
0.5895
0.5065
0.4447
0.3974
0.3595
0.3286
0.3029
0.2624
0.2195
0.1735
0.1493
0.1237
0.0968
0.0682
0.0371
6
0.8534
0.6771
0.5598
0.4783
0.4184
0.3726
0.3362
0.3067
0.2823
0.2439
0.2034
0.1602
0.1374
0.1137
0.0887
0.0623
0.0337
7
0.8332
0.6530
0.5365
0.4564
0.3980
0.3535
0.3185
0.2901
0.2666
0.2299
0.1911
0.1501
0.1286
0.1061
0.0827
0.0583
0.0312
8
0.8159
0.6333
0.5175
0.4387
0.3817
0.3384
0.3043
0.2768
0.2541
0.2187
0.1815
0.1422
0.1216
0.1002
0.0780
0.0552
0.0292
9
0.8010
0.6167
0.5017
0.4241
0.3682
0.3259
0.2926
0.2659
0.2439
0.2098
0.1736
0.1357
0.1160
0.0958
0.0745
0.0520
0.0279
10
0.7880
0.6025
0.4884
0.4118
0.3568
0.3154
0.2829
0.2568
0.2353
0.2020
0.1671
0.1303
0.1113
0.0921
0.0713
0.0497
0.0266
16
0.7341
0.5466
0.4366
0.3645
0.3135
0.2756
0.2462
0.2226
0.2032
0.1737
0.1429
0.1108
0.0942
0.0771
0.0595
0.0411
0.0218
36
0.6602
0.4748
0.3720
0.3066
0.2612
0.2278
0.2022
0.1820
0.1655
0.1403
0.1144
0.0879
0.0743
0.0604
0.0462
0.0316
0.0165
144
0.5813
0.4031
0.3093
0.2513
0.2119
0.1833
0.1616
0.1446
0.1308
0.1100
0.0889
0.0675
0.0567
0.0457
0.0347
0.0234
0.0120
∞
0.5000
0.3333
0.2500
0.2000
0.1667
0.1429
0.1250
0.1111
0.1000
0.0833
0.0667
0.0500
0.0417
0.0333
0.0250
0.0167
0.0083
a 5 1% n/k
2
3
4
5
6
7
8
9
10
12
15
20
24
30
40
60
120
1
0.9999
0.9933
0.9676
0.9279
0.8828
0.8376
0.7945
0.7544
0.7175
0.6528
0.5747
0.4799
0.4247
0.3632
0.2940
0.2151
0.1225
2
0.9950
0.9423
0.8643
0.7885
0.7218
0.6644
0.6152
0.5727
0.5358
0.4751
0.4069
0.3297
0.2821
0.2412
0.1915
0.1371
0.0759
3
0.9794
0.8831
0.7814
0.6957
0.6258
0.5685
0.5209
0.4810
0.4469
0.3919
0.3317
0.2654
0.2295
0.1913
0.1508
0.1069
0.0585
4
0.9586
0.8335
0.7212
0.6329
0.5635
0.5080
0.4627
0.4251
0.3934
0.3428
0.2882
0.2288
0.1970
0.1635
0.1281
0.0902
0.0489
5
0.9373
0.7933
0.6761
0.5875
0.5195
0.4659
0.4226
0.3870
0.3572
0.3099
0.2593
0.2048
0.1759
0.1454
0.1135
0.0796
0.0429
6
0.9172
0.7606
0.6410
0.5531
0.4866
0.4347
0.3932
0.3592
0.3308
0.2861
0.2386
0.1877
0.1608
0.1327
0.1033
0.0722
0.0387
7
0.8988
0.7335
0.6129
0.5259
0.4608
0.4105
0.3704
0.3378
0.3106
0.2680
0.2228
0.1748
0.1495
0.1232
0.0957
0.0668
0.0357
8
0.8823
0.7107
0.5897
0.5037
0.4401
0.3911
0.3522
0.3207
0.2945
0.2535
0.2104
0.1646
0.1406
0.1157
0.0898
0.0625
0.0334
9
0.8674
0.6912
0.5702
0.4854
0.4229
0.3751
0.3373
0.3067
0.2813
0.2419
0.2002
0.1567
0.1388
0.1100
0.0853
0.0594
0.0316
10
0.8539
0.6743
0.5536
0.4697
0.4084
0.3616
0.3248
0.2950
0.2704
0.2320
0.1918
0.1501
0.1283
0.1054
0.0816
0.0567
0.0302
16
0.7949
0.6059
0.4884
0.4094
0.3529
0.3105
0.2779
0.2514
0.2297
0.1961
0.1612
0.1248
0.1060
0.0867
0.0668
0.0461
0.0242
36
0.7067
0.5153
0.4057
0.3351
0.2858
0.2494
0.2214
0.1992
0.1811
0.1535
0.1251
0.0960
0.0810
0.0658
0.0503
0.0344
0.0178
144
0.6062
0.4230
0.3251
0.2644
0.2229
0.1929
0.1700
0.1521
0.1376
0.1157
0.0934
0.0709
0.0595
0.0480
0.0363
0.0245
0.0125
∞
0.5000
0.3333
0.2500
0.2000
0.1667
0.1429
0.1250
0.1111
0.1000
0.0833
0.0667
0.0500
0.0417
0.0333
0.0250
0.0167
0.0083
1193
Appendices
TABLE N Critical Values of Hartley Fmax Statistic Considering P(Fmax.cal > Fmax.c) 5 a a 5 5% n/k
2
3
4
5
6
7
8
9
10
11
12
2
39
87.5
142
202
266
333
403
475
550
626
704
3
15.4
27.8
39.2
50.7
62
72.9
83.5
93.9
104
114
124
4
9.6
15.5
20.6
25.2
29.5
33.6
37.5
41.1
44.6
48
51.4
5
7.15
10.8
13.7
16.3
18.7
20.8
22.9
24.7
26.5
28.2
29.9
6
5.82
8.38
10.4
12.1
13.7
15
16.3
17.5
18.6
19.7
20.7
7
4.99
6.94
8.44
9.7
10.8
11.8
12.7
13.5
14.3
15.1
15.8
8
4.43
6
7.18
8.12
9.03
9.78
10.5
11.1
11.7
12.2
12.7
9
4.03
5.34
6.31
7.11
7.8
8.41
8.95
9.45
9.91
10.3
10.7
10
3.72
4.85
5.67
6.34
6.92
7.42
7.87
8.28
8.66
9.01
9.34
12
3.28
4.16
4.79
5.3
5.72
6.09
6.42
6.72
7
7.25
7.48
15
2.86
3.54
4.01
4.37
4.68
4.95
5.19
5.4
5.59
5.77
5.93
20
2.46
2.95
3.29
3.54
3.76
3.94
4.1
4.24
4.37
4.49
4.59
30
2.07
2.4
2.61
2.78
2.91
3.02
3.12
3.21
3.29
3.36
3.39
60
1.67
1.85
1.96
2.04
2.11
2.17
2.22
2.26
2.3
2.33
2.36
∞
1
1
1
1
1
1
1
1
1
1
1
a 5 1% n/k
2
3
4
5
6
7
8
9
10
11
12
2
199
448
729
1036
1362
1705
2069
2432
2813
3204
3605
3
47.5
85
120
151
184
216
249
281
310
337
361
4
23.2
37
49
59
69
79
89
97
106
113
120
5
14.9
22
28
33
38
42
46
50
54
57
60
6
11.1
15.5
19.1
22
25
27
30
32
34
36
37
7
8.89
12.1
14.5
16.5
18.4
20
22
23
24
26
27
8
7.5
9.9
11.7
13.2
14.5
15.8
16.9
17.9
18.9
19.8
21
9
6.54
8.5
9.9
11.1
12.1
13.1
13.9
14.7
15.3
16
16.6
10
5.85
7.4
8.6
9.6
10.4
11.1
11.8
12.4
12.9
13.4
13.9
12
4.91
6.1
6.9
7.6
8.2
8.7
9.1
9.5
9.9
10.2
10.6
15
4.07
4.9
5.5
6
6.4
6.7
7.1
7.3
7.5
7.8
8
20
3.32
3.8
4.3
4.6
4.9
5.1
5.3
5.5
5.6
5.8
5.9
30
2.63
3
3.3
3.4
3.6
3.7
3.8
3.9
4
4.1
4.2
60
1.96
2.2
2.3
2.4
2.4
2.5
2.5
2.6
2.6
2.7
2.7
∞
1
1
1
1
1
1
1
1
1
1
1
1194
Appendices
TABLE O
Control Chart Constants X X and R Charts
X and S Charts
n
d2
d3
C4
A2
D3
D4
A3
B3
B4
2
1.128
0.853
0.798
1.880
–
3.267
2.659
–
3.267
3
1.693
0.888
0.886
1.023
–
2.574
1.954
–
2.568
4
2.059
0.880
0.921
0.729
–
2.282
1.628
–
2.266
5
2.326
0.880
0.940
0.577
–
2.114
1.427
–
2.089
6
2.534
0.848
0.952
0.483
–
2.004
1.287
0.030
1.970
7
2.704
0.833
0.959
0.419
0.076
1.924
1.182
0.118
1.882
8
2.847
0.820
0.965
0.373
0.136
1.864
1.099
0.185
1.815
9
2.970
0.808
0.969
0.337
0.184
1.816
1.032
0.239
1.761
10
3.078
0.797
0.973
0.308
0.223
1.777
0.975
0.284
1.716
11
3.173
0.787
0.975
0.285
0.256
1.744
0.927
0.321
1.679
12
3.258
0.779
0.978
0.266
0.283
1.717
0.886
0.354
1.646
13
3.336
0.770
0.979
0.249
0.307
1.693
0.850
0.382
1.618
14
3.407
0.763
0.981
0.235
0.328
1.672
0.817
0.406
1.594
15
3.472
0.756
0.982
0.223
0.347
1.653
0.789
0.428
1.572
16
3.532
0.750
0.984
0.212
0.363
1.637
0.763
0.448
1.552
17
3.588
0.744
0.985
0.203
0.378
1.662
0.739
0.466
1.534
18
3.640
0.739
0.985
0.194
0.391
1.607
0.718
0.482
1.518
19
3.689
0.734
0.986
0.187
0.403
1.597
0.698
0.497
1.503
20
3.735
0.729
0.987
0.180
0.415
1.585
0.680
0.510
1.490
21
3.778
0.727
0.988
0.173
0.425
1.575
0.663
0.523
1.477
22
3.819
0.720
0.988
0.167
0.434
1.566
0.647
0.534
1.466
23
3.858
0.716
0.989
0.162
0.443
1.557
0.633
0.545
1.455
24
3.895
0.712
0.989
0.157
0.451
1.548
0.619
0.555
1.445
25
3.931
0.708
0.990
0.153
0.459
1.541
0.606
0.565
1.435
For n > 25: 4 ðn 1 Þ 3 3 A ¼ pffiffiffi A3 ¼ pffiffiffi c4 ffi 4n 3 n c4 n 3 3 B3 ¼ 1 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi B4 ¼ 1 + pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi c 4 2ð n 1Þ c4 2ðn 1Þ 3 3 B5 ¼ c4 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi B6 ¼ c4 + pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2ð n 1Þ 2ðn 1Þ
References Acock, A.C., 2014. A Gentle Introduction to Stata, fourth ed. Stata Press, College Station. Adkins, L.C., Hill, R.C., 2011. Using Stata for Principles of Econometrics, fourth ed John Wiley & Sons, New York. Agresti, A., 2013. Categorical Data Analysis, third ed. John Wiley & Sons, Hoboken. Aguirre, A., Macedo, P.B.R., 1996. Estimativas de prec¸os hed^onicos para o mercado imobilia´rio de Belo Horizonte. In: XVIII Encontro Brasileiro de ´ guas de Lindo´ia. Econometria. Anais do Congresso, A Ahn, S.C., Schmidt, P., 1997. Efficient estimation of dynamic panel data models: alternative assumptions and simplified estimation. J. Econometrics 76 (1-2), 309–321. Ahuja, R.K., Huang, W., Romeijn, H.E., Morales, D.R., 2007. A heuristic approach to the multi-period single-sourcing problem with production and inventory capacities and perishability constraints. INFORMS J. Comput. 19 (1), 14–26. Aitkin, M., Clayton, D., 1980. The fitting of exponential, Weibull and extreme value distributions to complex censored survival data using GLIM. J. Roy. Stat. Soc. Ser. C 29 (2), 156–163. Akaike, H., 1987. Factor analysis and AIC. Psychometrika 52 (3), 317–332. Albergaria, M., Fa´vero, L.P., 2017. Narrow replication of Fisman and Miguel’s (2007a) ‘Corruption, norms, and legal enforcement: evidence from diplomatic parking tickets’. J. Appl. Econometrics 32 (4), 919–922. Albright, A.C., Winston, W.L., 2015. Business Analytics: Data Analysis and Decision Making, fifth ed. Cengage Learning, Stamford. Albuquerque, J.P.A., Fortes, J.M.P., Finamore, W.A., 2008. Probabilidade, varia´veis aleato´rias e processos estoca´sticos. Interci^encia, Rio de Janeiro. Alcalde, A., Fa´vero, L.P., Takamatsu, R.T., 2013. EBITDA margin in Brazilian companies: variance decomposition and hierarchical effects. Contadurı´a y Administracio´n 58 (2), 197–220. Al-Daoud, M.B., Roberts, S.A., 1996. New methods for the initialisation of clusters. Pattern Recognition Letters 17 (5), 451–455. Aldenderfer, M.S., Blashfield, R.K., 1978a. Cluster analysis and archaeological classification. Am. Antiquity 43 (3), 502–505. Aldenderfer, M.S., Blashfield, R.K., 1984. Cluster Analysis. Sage Publications, Thousand Oaks. Aldenderfer, M.S., Blashfield, R.K., 1978b. Computer programs for performing hierarchical cluster analysis. Appl. Psychol. Meas. 2 (3), 403–411. Aldrich, J.H., Nelson, F.D., 1984. Linear Probability, Logit, and Probit Models. Sage Publications, Thousand Oaks. Aliaga, F.M., 1999. Ana´lisis de correspondencias: estudo bibliometrico sobre su uso en la investigacio´n educativa. Revista Electro´nica de Investigacio´n y Evaluacio´n Educativa. 5(1_1). Allison, P.D., 2009. Fixed Effects Regression Models. Sage Publications, London. Alpert, M.I., Peterson, R.A., 1972. On the interpretation of canonical analysis. J. Market. Res. 9 (2), 187–192. Amemiya, T., 1981. Qualitative response models: a survey. J. Econ. Lit. 19 (4), 1483–1536. Anderberg, M.R., 1973. Cluster Analysis for Applications. Academic Press, New York. Anderson, D.R., Sweeney, D.J., Williams, T.A., 2013. Estatı´stica aplicada a` administrac¸a˜o e economia, 3. ed. Sa˜o Paulo, Thomson Pioneira. Anderson, J.A., 1982. Logistic discrimination. In: Krishnaiah, P.R., Kanal, L.N. (Eds.), Handbook of Statistics. North Holland, Amsterdam, pp. 169–191. Anderson, T.W., Hsiao, C., 1982. Formulation and estimation of dynamic models using panel data. J. Econometrics 18 (1), 47–82. Andrade, E.L., 2009. Introduc¸a˜o a` pesquisa operacional: metodos e modelos para ana´lise de deciso˜es. LTC, Rio de Janeiro. Aranha, F., Zambaldi, F., 2008. Ana´lise fatorial em administrac¸a˜o. Cengage Learing, Sa˜o Paulo. Arau´jo, M.E., Feitosa, C.V., 2003. Ana´lise de agrupamento da Ictiofauna Recifal do Brasil com base em dados secunda´rios: uma avaliac¸a˜o crı´tica. Trop. Oceanogr. 31 (2), 171–192. Arellano, M., 1987. Computing robust standard errors for within-groups estimators. Oxf. Bull. Econ. Stat. 49 (4), 431–434. Arellano, M., 1993. On the testing of correlated effects with panel data. J. Econometrics 59 (1-2), 87–97. Arellano, M., 2003. Panel Data Econometrics: Advanced Texts in Econometrics. Oxford University Press, New York. Arellano, M., Bond, S., 1991. Some tests of specification for panel data: Monte Carlo evidence and an application to employment equations. Rev. Econ. Stud. 58 (2), 277–297. Arellano, M., Bover, O., 1995. Another look at the instrumental variable estimation of error-components models. J. Econometrics 68 (1), 29–51. Arenales, M., Armentano, V., Morabito, R., Yanasse, H., 2007. Pesquisa operacional: para cursos de engenharia. Campus Elsevier, Rio de Janeiro. Arias, R.M., 1999. El ana´lisis multivariante en la investigacio´n cientı´fica. Editorial La Muralla, Madrid. Artes, R., 1998. Aspectos estatı´sticos da ana´lise fatorial de escalas de avaliac¸a˜o. Revista de Psiquiatria Clı´nica 25 (5), 223–228. Ashby, D., West, C.R., Ames, D., 1979. The ordered logistic regression model in psychiatry: rising prevalence of dementia in old peoples homes. Stat. Med. (8), 1317–1326. Atkinson, A.C., 1970. A method for discriminating between models. J. Roy. Stat. Soc. Ser. B 32 (3), 323–353. Ayc¸aguer, L.C.S., Utra, I.M.B., 2004. Regresio´n logı´stica. Editorial La Muralla, Madrid. Azen, R., Walker, C.M., 2011. Categorical Data Analysis for the Behavioral and Social Sciences. Routledge, New York. Bailey, K.D., 1983. Sociological classification and cluster analysis. Qual. Quant. 17 (4), 251–268.
1195
1196
References
Baker, B.O., Hardyck, C.D., Petrinovich, L.F., 1966. Weak measurements vs. strong statistics: an empirical critique of S. S. Stevens’ proscriptions on statistics. Educ. Psychol. Meas. 26, 291–309. Bakke, H.A., Leite, A.S.M., Silva, L.B., 2008. Estatı´stica multivariada: aplicac¸a˜o da ana´lise fatorial na engenharia de produc¸a˜o. Revista Gesta˜o Industrial 4 (4), 1–14. Balakrishnan, P.V., Cooper, M.C., Jacob, V.S., Lewis, P.A., 1994. A study of the classification capabilities of neural networks using unsupervised learning: a comparison with k-means clustering. Psychometrika 59 (4), 509–525. Balestra, P., Nerlove, M., 1966. Pooling cross section and time series data in the estimation of a dynamic model: the demand for natural gas. Econometrica 34 (3), 585–612. Ballinger, G.A., 2004. Using generalized estimating equations for longitudinal data analysis. Organization. Res. Methods 7 (2), 127–150. Baltagi, B.H., 2008. Econometric Analysis of Panel Data, fourth ed. John Wiley & Sons, New York. Baltagi, B.H., Griffin, J.M., 1984. Short and long run effects in pooled models. Int. Econ. Rev. 25 (3), 631–645. Baltagi, B.H., Wu, P.X., 1999. Unequally spaced panel data regressions with AR(1) disturbances. Econometric Theory 15 (6), 814–823. Banfield, J.D., Raftery, A.E., 1993. Model-based gaussian and non-gaussian clustering. Biometrics 49 (3), 803–821. Banzatto, D.A., Kronka, S.N., 2006. Experimentac¸a˜o agrı´cola, fourth ed. Funep, Jaboticabal. Barioni Jr., W., 1995. Ana´lise de correspond^encia na identificac¸a˜o dos fatores de risco associados a` diarreia e a` performance de leito˜es na fase de lactac¸a˜o. Piracicaba. 97 f. Masters Dissertation, Escola Superior de Agricultura Luiz de Queiroz, Universidade de Sa˜o Paulo. Barnett, V., Lewis, T., 1994. Outliers in Statistical Data, third ed. John Wiley & Sons, Chichester. Barradas, J.M., Fonseca, E.C., Silva, E.F., Pereira, H.G., 1992. Identification and mapping of pollution indices using a multivariate statistical methodology. Appl. Geochem. 7 (6), 563–572. Bartholomew, D., Knott, M., Moustaki, I., 2011. Latent Variable Models and Factor Analysis: A Unified Approach, third ed. John Wiley & Sons, New York. Bartlett, M.S., 1954. A note on the multiplying factors for various w2 approximations. J. Roy. Stat. Soc. Ser. B 16 (2), 296–298. Bartlett, M.S., 1937. Properties of sufficiency and statistical tests. Proc. R. Soc. Lond. Ser. A: Math. Phys. Sci. 160 (901), 268–282. Bartlett, M.S., 1941. The statistical significance of canonical correlations. Biometrika 32 (1), 29–37. Bastos, D.B., Nakamura, W.T., 2009. Determinantes da estrutura de capital das companhias abertas no Brasil, Mexico e Chile no perı´odo 2001-2006. Revista Contabilidade e Financ¸as 20 (50), 75–94. Bastos, R., Pindado, J., 2013. Trade credit during a financial crisis: a panel data analysis. J. Bus. Res. 66 (5), 614–620. Batista, L.E., Escuder, M.M.L., Pereira, J.C.R., 2004. A cor da morte: causas de o´bito segundo caracterı´sticas de rac¸a no Estado de Sa˜o Paulo, 1999 a 2001. Revista de Sau´de Pu´blica 38 (5), 630–636. Baum, C.F., 2006. An Introduction to Modern Econometrics Using Stata. Stata Press, College Station. Baum, C.F., Schaffer, M.E., Stillman, S., 2011. Using Stata for applied research: reviewing its capabilities. J. Econ. Surveys 25 (2), 380–394. Baxter, L.A., Finch, S.J., Lipfert, F.W., Yu, Q., 1997. Comparing estimates of the effects of air pollution on human mortality obtained using different regression methodologies. Risk Analysis 17 (3), 273–278. Bazaraa, M.S., Jarvis, J.J., Sherali, H.D., 2009. Linear Programming and Network Flows, fourth ed. John Wiley & Sons. Bazeley, P., 2013. Qualitative Data Analysis: Practical Strategies. Sage Publications, London. Beck, N., 2007. From statistical nuisances to serious modeling: changing how we think about the analysis of time-series-cross-section data. Polit. Anal. 15 (2), 97–100. Beck, N., 2001. Time-series-cross-section-data: what have we learned in the past few years? Annu. Rev. Polit. Sci. 4 (1), 271–293. Beck, N., Katz, J.N., 1995. What to do (and not to do) with time-series cross-section data. Am. Polit. Sci. Rev. 89 (3), 634–647. Begg, M.D., Parides, M.K., 2003. Separation of individual-level and cluster-level covariate effects in regression analysis of correlated data. Stat. Med. 22 (6), 2591–2602. Beh, E.J., 1998. A comparative study of scores for correspondence analysis with ordered categories. Biometr. J. 40 (4), 413–429. Beh, E.J., 1999. Correspondence analysis of ranked data. Commun. Stat. Theory Methods 28 (7), 1511–1533. Beh, E.J., 2004. Simple correspondence analysis: a bibliographic review. Int. Stat. Rev. 72 (2), 257–284. Beh, E.J., Lombardo, R., 2014. Correspondence Analysis: Theory, Practice and New Strategies. John Wiley & Sons, New York. Bekaert, G., Harvey, C.R., 2002. Research in emerging markets finance: looking to the future. Emerg. Market Rev. 3 (4), 429–448. Bekaert, G., Harvey, C.R., Lundblad, C., 2001. Emerging equity markets and economic development. J. Dev. Econ. 66 (2), 465–504. Bekman, O.R., Costa Neto, P.L.O., 2009. Ana´lise estatı´stica da decisa˜o, second ed. Edgard Bl€ucher, Sa˜o Paulo. ® ® Belfiore, P., 2015. Estatistica aplicada a administrac¸a˜o, contabilidade e economia com Excel e SPSS . Campus Elsevier, Rio de Janeiro. Belfiore, P., Fa´vero, L.P., 2012. Pesquisa operacional: para cursos de administrac¸a˜o, contabilidade e economia. Campus Elsevier, Rio de Janeiro. Belfiore, P., Fa´vero, L.P., 2007. Scatter search for the fleet size and mix vehicle routing problem with time windows. Cent. Eur. J. Oper. Res. 15 (4), 351–368. Belfiore, P., Yoshizaki, H.T.Y., 2013. Heuristic methods for the fleet size and mix vehicle routing problem with time windows and split deliveries. Comput. Ind. Eng. 64 (2), 589–601. Belfiore, P., Yoshizaki, H.T.Y., 2009. Scatter search for a real-life heterogeneous fleet vehicle routing problem with time windows and split deliveries in Brazil. Eur. J. Oper. Res. 199 (3), 750–758. Bell, A., Jones, K., Explaining fixed effects: random effects modelling of time-series cross-sectional and panel data. http://polmeth.wustl.edu/media/Paper/ FixedversusRandom_1.pdf. [(Accessed 17 December 2012)]. Benders, J.F., 1962. Partitioning procedures for solving mixed-variables programming problems. Numerische Mathematik 4, 238–252.
References
1197
Bensmail, H., Celeux, G., Raftery, A.E., Robert, C.P., 1997. Inference in model-based cluster analysis. Stat. Comput. 7 (1), 1–10. Benzecri, J.P., 1992. Correspondence analysis handbook, second ed. Marcel Dekker, New York. Benzecri, J.P., 1977. El ana´lisis de correspondencias. Les Cahiers de l’ Analyse des Donnees 2 (2), 125–142. Benzecri, J.P., 1979. Sur le calcul des taux d’inertie dans l’analyse d’un questionnaire. Les Cahiers de l’Analyse des Donnees 4 (3), 377–378. Berenson, M.L., Levine, D.M., 1996. Basic Business Statistics: Concepts and Application, sixth ed. Prentice Hall, Upper Saddle River. Bergh, D.D., 1995. Problems with repeated measures analysis: demonstration with a study of the diversification and performance relationship. Acad. Manag. J. 38 (6), 1692–1708. Berkson, J., 1944. Application of the logistic function to bioassay. J. Am. Stat. Assoc. 39 (227), 357–365. Bezerra, F.A., Corrar, L.J., 2006. Utilizac¸a˜o da ana´lise fatorial na identificac¸a˜o dos principais indicadores para avaliac¸a˜o do desempenho financeiro: uma aplicac¸a˜o nas empresas de seguros. Revista Contabilidade e Financ¸as 4 (42), 50–62. Bhargava, A., Franzini, L., Narendranathan, W., 1982. Serial correlation and the fixed effects model. Rev. Econ. Stud. 49 (4), 533–549. Bhargava, A., Sargan, J.D., 1983. Estimating dynamic random effects models from panel data covering short time periods. Econometrica 51 (6), 1635–1659. Billor, N., Hadi, A.S., Velleman, P.F., 2000. BACON: blocked adaptive computationally efficient outlier nominators. Comput. Stat. Data Anal. 34 (3), 279–298. Binder, D.A., 1978. Bayesian cluster analysis. Biometrika 65 (1), 31–38. Birch, M.W., 1963. Maximum likelihood in three-way contingency tables. J. Roy. Stat. Soc. Ser. B 25 (1), 220–233. Black, K., 2012. Business Statistics: For Contemporary Decision Making, seventh ed. John Wiley & Sons, New York. Blair, E., 1983. Sampling issues in trade area maps drawn from shopping surveys. J. Market. 47 (1), 98–106. Blashfield, R.K., Aldenderfer, M.S., 1978. The literature on cluster analysis. Multivariate Behav. Res. 13 (3), 271–295. Bliese, P.D., Ployhart, R.E., 2002. Growth modeling using random coefficient models: model building, testing, and illustrations. Organization. Res. Methods 5 (4), 362–387. Bliss, C.I., 1934b. The method of probits – a correction. Science 79 (2053), 409–410. Bliss, C.I., 1934a. The method of probits. Science 79 (2037), 38–39. Blundell, R., Bond, S., 1998. Initial conditions and moment restrictions in dynamic panel data models. J. Econometrics 87 (1), 115–143. Blunsdon, B., Reed, K., 2005. Social innovators or lagging behind: factors that influence manager’s time use. Women Manag. Rev. 78, 544–561. Bock, H.H., 1985. On some significance tests in cluster analysis. J. Classification 2 (1), 77–108. Bock, R.D., 1975. Multivariate Statistical Methods in Behavioral Research. McGraw-Hill, New York. Bolfarine, H., Bussab, W.O., 2005. Elementos de amostragem. Edgard Blϋcher, Sa˜o Paulo. Bolfarine, H., Sandoval, M.C., 2001. Introduc¸a˜o a` infer^encia estatı´stica. Sociedade Brasileira de Matema´tica, Rio de Janeiro. Bonett, D.G., 2010. Varying coefficient meta-analytic methods for alpha reliability. Psychol. Methods 15 (4), 368–385. Borgatta, E.F., Bohrnstedt, G.W., 1980. Level of measurement: once over again. Sociol. Methods Res. 9 (2), 147–160. Borooah, V.K., 2001. Logit and Probit. Sage Publications, Thousand Oaks. Botelho, D., Zouain, D.M., 2006. Pesquisa quantitativa em administrac¸a˜o. Atlas, Sa˜o Paulo. Bottai, M., Orsini, N., 2013. A command for Laplace regression. Stata J. 13 (2), 302–314. Botton, L., Bengio, Y., 1995. Convergence properties of the k-means algorithm. Adv. Neural Inf. Process. Syst. 7, 585–592. Bouroche, J.M., Saporta, G., 1982. Ana´lise de dados. Zahar, Rio de Janeiro. Box, G.E.P., Cox, D.R., 1964. An analysis of transformations. J. Roy. Stat. Soc. Ser. B 26 (2), 211–252. Box-Steffensmeier, J.M., Jones, B.S., 2004. Event History Modeling: A Guide for Social Scientists. Cambridge University Press, Cambridge. Braga, R., Fa´vero, L.P., 2017. Disposition effect and tolerance to losses in stock investment decisions: an experimental study. J. Behav. Financ. 18 (3), 271–280. Bramer, M., 2016. Principles of Data Mining, third ed. Springer, New York. Brand, M., 2006. Fast low-rank modifications of the thin singular value decomposition. Linear Algebra Appl. 415 (1), 20–30. Branda˜o, M.A.L., 2010. Estudo de alguns metodos determinı´sticos de otimizac¸a˜o irrestrita. Uberl^andia, 2010. Dissertac¸a˜o (Mestrado em Matema´tica)Universidade Federal de Uberl^andia 87 p. Bravais, A., 1846. Analyse mathematique sur les probabilites des erreurs de situation d’un point. Memoires par Divers Savans 9, 255–332. Breusch, T.S., 1978. Testing for autocorrelation in dynamic linear models. Australian Econ. Papers 17 (31), 334–355. Breusch, T.S., Mizon, G.E., Schmidt, P., 1989. Efficient estimation using panel data. Econometrica 57 (3), 695–700. Breusch, T.S., Pagan, A.R., 1980. The Lagrange multiplier test and its application to model specification in econometrics. Rev. Econ. Stud. 47 (1), 239–253. Breusch, T.S., Ward, M.B., Nguyen, H.T.M., Kompas, T., 2011. On the fixed-effects vector decomposition. Polit. Anal. 19 (2), 123–134. Brito, G.A.S., Assaf Neto, A., 2008. Modelo de risco para carteiras de creditos corporativos. Revista de Administrac¸a˜o (RAUSP) 43 (3), 263–274. Brito Ju´nior, I., 2004. Ana´lise do impacto logı´stico de diferentes regimes aduaneiros no abastecimento de itens aerona´uticos empregando modelo de transbordo multiproduto com custos fixos. Dissertac¸a˜o (Mestrado em Engenharia de Sistemas Logı´sticos), Escola Politecnica da Universidade de Sa˜o Paulo, Sa˜o Paulo. Brito Ju´nior, I., Yoshizaki, H.T.Y., Belfiore, P., 2012. Um modelo de localizac¸a˜o e transbordo multiproduto para avaliac¸a˜o do impacto de regimes aduaneiros. Transportes 20 (3), 89–98. Brown, M.B., Forsythe, A.B., 1974. Robust tests for the equality of variances. J. Am. Stat. Assoc. 69 (346), 364–367.
1198
References
Bruni, A.L., 2011. Estatı´stica aplicada a` gesta˜o empresarial, third ed. Atlas, Sa˜o Paulo. Buchinsky, M., 1998. Recent advances in quantile regression models: a practical guideline for empirical research. J. Hum. Resour. 33 (1), 88–126. Buffa, E.S., Sarin, R.K., 1987. Modern production/operations management, eighth ed. John Wiley & Sons. Bussab, W.O., Miazaki, E.S., Andrade, D.F., 1990. Introduc¸a˜o a` ana´lise de agrupamentos. In: Simpo´sio Brasileiro de Probabilidade e Estatı´stica. Anais do Congresso, Sa˜o Paulo. Bussab, W.O., Morettin, P.A., 2011. Estatı´stica ba´sica, seventh ed. Saraiva, Sa˜o Paulo. Buzas, T.E., Fornell, C., Rhee, B.D., 1989. Conditions under which canonical correlation and redundancy maximization produce identical results. Biometrika 76 (3), 618–621. Cabral, N.A.C.A., Investigac¸a˜o por inquerito. http://www.amendes.uac.pt/monograf/tra06investgInq.pdf. [(Accessed 3 August 2015)]. Ca´ceres, R.C.A., 2013. Ana´lisis de la supervivencia: regresio´n de Cox. Ediciones Alfanova, Ma´laga. Calinski, T., Harabasz, J., 1974. A dendrite method for cluster analysis. Commun. Statist. 3 (1), 1–27. Cameron, A.C., Trivedi, P.K., 1986. Econometric models based on count data: comparisons and applications of some estimators and tests. J. Appl. Econ. 1 (1), 29–53. Cameron, A.C., Trivedi, P.K., 2009. Microeconometrics Using Stata, Revised edition. Stata Press, College Station. Cameron, A.C., Trivedi, P.K., 2013. Regression Analysis of Count Data, second ed. Cambridge University Press, Cambridge. Cameron, A.C., Trivedi, P.K., 1990. Regression-based tests for overdispersion in the Poisson model. J. Econometrics 46 (3), 347–364. Cameron, A.C., Windmeijer, F.A.G., 1997. An R-squared measure of goodness of fit for some common nonlinear regression models. J. Econometrics 77 (2), 329–342. Camilo, C.O., Silva, J.C., 2009. Minerac¸a˜o de dados: conceitos, tarefas, metodos e ferramentas. Technical Report RT-INF 001-09, Instituto de Informa´tica, Universidade Federal de Goia´s. Camiz, S., Gomes, G.C., 2013. Joint correspondence analysis versus multiple correspondence analysis: a solution to an undetected problem. In: Giusti, A., Ritter, G., Vichi, M. (Eds.), Classification and Data Mining. Studies in Classification, Data Analysis, and Knowledge Organization. Springer-Verlag, Berlin, pp. 11–18. Campbell, J.Y., Lo, A.W., Mackinlay, A.C., 1997. The Econometrics of Financial Markets. Princeton University Press, Princeton. Campbell, N.A., Tomenson, J.A., 1983. Canonical variate analysis for several sets of data. Biometrics 39 (2), 425–435. Caroll, J.D., Green, P.E., Schaffer, C.M., 1986. Interpoint distance comparisons in correspondence analysis. J. Market. Res. 23 (3), 271–280. Carvalho, N.A.S., 2012. Aplicac¸a˜o de Modelos Estatı´sticos para Previsa˜o e Monitoramento da Cobrabilidade de uma Empresa de Distribuic¸a˜o de Energia Eletrica no Brasil. Pontifı´cia Universidade Cato´lica do Rio de Janeiro Dissertac¸a˜o (Mestrado em Metrologia). Carvalho, H., 2008. Ana´lise multivariada de dados qualitativos: utilizac¸a˜o da ana´lise de correspond^encias mu´ltiplas com o SPSS. Edic¸o˜es Sı´labo, Lisboa. Cattell, R.B., 1966. The scree test for the number of factors. Multivariate Behav. Res. 1 (2), 245–276. Cattell, R.B., Balcar, K.R., Horn, J.L., Nesselroade, J.R., 1969. Factor matching procedures: an improvement of the s index; with tables. Educ. Psychol. Meas. 29 (4), 781–792. Celeux, G., Govaert, G., 1992. A classification EM algorithm for clustering and two stochastic versions. Comput. Stat. Data Anal. 14 (3), 315–332. Chamberlain, G., 1980. Analysis of covariance with qualitative data. Rev. Econ. Stud. 47 (1), 225–238. Chambless, L.E., Dobson, A., Patterson, C.C., Raines, B., 1991. On the use of a logistic risk score in predicting risk of coronary heart disease. Stat. Med. (9), 385–396. Chappel, W., Kimenyi, M., Mayer, W., 1990. A Poisson probability model of entry and market structure with an application to U.S. industries during 1972-77. South. Econ. J. 56 (4), 918–927. Charnes, A., Cooper, W.W., Rhodes, E., 1978. Measuring the efficiency of decision making units. Eur. J. Oper. Res. 2 (6), 429–444. Charnet, R., Bonvino, H., Freire, C.A.L., Charnet, E.M.R., 2008. Ana´lise de modelos de regressa˜o linear: com aplicac¸o˜es, second ed. Editora da Unicamp, Campinas. Chatterjee, S., Jamieson, L., Wiseman, F., 1991. Identifying most influential observations in factor analysis. Market. Sci. 10 (2), 145–160. Chiavenato, I., 1997. Introduc¸a˜o a` teoria geral da administrac¸a˜o, fifth ed. Makron Books, Sa˜o Paulo. Chen, C.W., 1971. On some problems in canonical correlation analysis. Biometrika 58 (2), 399–400. Chen, M.H., Ibrahim, J.G., Shao, Q.M., 2009. Maximum likelihood inference for the Cox regression model with applications to missing covariates. J. Multivariate Anal. 100 (9), 2018–2030. Cheng, R., Milligan, G.W., 1996. K-Means clustering methods with influence detection. Educ. Psychol. Meas. 56 (5), 833–838. Chopra, S., Meindl, P., 2011. Gesta˜o da cadeia de suprimentos: estrategia, planejamento e operac¸o˜es, fourth ed. Pearson Prentice Hall, Sa˜o Paulo. Chow, G.C., 1960. Tests of equality between sets of coefficients in two linear regressions. Econometrica 28 (3), 591–605. Christensen, R., 1997. Log-Linear Models and Logistic Regression, second ed. Springer-Verlag, New York. Cios, K.J., Pedrycz, W., Swiniarski, R.W., Kurgan, L.A., 2007. Data Mining: A Knowledge Discovery Approach. Springer, New York. Cleveland, W.S., 1985. The Elements of Graphing Data. Wadsworth, Monterey. Cleves, M.A., Gould, W.W., Gutierrez, R.G., Marchenko, Y.V., 2010. An Introduction to Survival Analysis Using Stata, third ed. Stata Press, College Station. Cliff, N., Hamburger, C.D., 1967. The study of sampling errors in factor analysis by means of artificial experiments. Psychol. Bull. 68 (6), 430–445. Cochran, W.G., 1977. Sampling Techniques, third ed. John Wiley & Sons, New York. Cochran, W.G., 1947a. Some consequences when the assumptions for the analysis of variance are not satisfied. Biometrics 3 (1), 22–38. Cochran, W.G., 1950. The comparison of percentages in matched samples. Biometrika 37 (¾), 256–266.
References
1199
Cochran, W.G., 1947b. The distribution of the largest of a set of estimated variances as a fraction of their total. Ann. Eugen.s 22 (11), 47–52. Colin, E.C., 2007. Pesquisa operacional: 170 aplicac¸o˜es em estrategia, financ¸as, logı´stica, produc¸a˜o, marketing e vendas. LTC, Rio de Janeiro. Collings, B., Margolin, B., 1985. Testing goodness of fit for the Poisson assumption when observations are not identically distributed. J. Am. Stat. Assoc. 80 (390), 411–418. Colosimo, E.A., Giolo, S.R., 2006. Ana´lise de sobreviv^encia aplicada. Edgard Bl€ucher, Sa˜o Paulo. Conaway, M.R., 1990. A random effects model for binary data. Biometrics 46 (2), 317–328. Consul, P., 1989. Generalized Poisson Distributions. Marcel Dekker, New York. Consul, P., Famoye, F., 1992. Generalized Poisson regression model. Commun. Stat. Theory Methods 21 (1), 89–109. Consul, P., Jain, G., 1973. A generalization of the Poisson distribution. Technometrics 15 (4), 791–799. Cook, R.D., 1979. Influential observations in linear regression. J. Am. Stat. Assoc. 74, 169–174. Cooper, D.R., Schindler, P.S., 2011. Metodos de pesquisa em administrac¸a˜o, 10th ed. Bookman, Porto Alegre. Cooper, S.L., 1964. Random sampling by telephone: an improved method. J. Market. Res. 1 (4), 45–48. Cordeiro, G.M., 1983. Improved likelihood ratio statistics for generalized linear models. J. Roy. Stat. Soc. Ser. B 45 (3), 404–413. Cordeiro, G.M., 1987. On the corrections to the likelihood ratio statistics. Biometrika 74 (2), 265–274. Cordeiro, G.M., Demetrio, C.G.B., 2007. Modelos lineares generalizados. SEAGRO e Rbras, Santa Maria. Cordeiro, G.M., McCullagh, P., 1991. Bias correction in generalized linear models. J. Roy. Stat. Soc. Ser. B 53 (3), 629–643. Cordeiro, G.M., Ortega, E.M.M., Cunha, D.C.C., 2013. The exponentiated generalized class of distributions. J. Data Sci. 11, 777–803. Cordeiro, G.M., Ortega, E.M.M., Silva, G.O., 2011. The exponentiated generalized gamma distribution with application to lifetime data. J. Stat. Comput. Simul. 81 (7), 827–842. Cordeiro, G.M., Paula, G.A., 1989. Improved likelihood ratio statistics for exponential family nonlinear models. Biometrika 76 (1), 93–100. Cornwell, C., Rupert, P., 1988. Efficient estimation with panel data: an empirical comparison of instrumental variables estimators. J. Appl. Econometrics 3 (2), 149–155. Cortina, J.M., 1993. What is coefficient alpha? An examination of theory and applications. J. Appl. Psychol. 78 (1), 98–104. Costa Neto, P.L.O., 2002. Estatı´stica, second ed. Edgard Bl€ucher, Sa˜o Paulo. Costa, P.S., Santos, N.C., Cunha, P., Cotter, J., Sousa, N., 2013. The use of multiple correspondence analysis to explore associations between categories of qualitative variables in healthy ageing. J. Aging Res. 2013. Courgeau, D., 2003. Methodology and Epistemology of Multilevel Analysis. Kluwer Academic Publishers, London. Covarsi, M.G.A., 1996. Tecnicas de ana´lisis factorial aplicadas al ana´lisis de la informacio´n financiera: fundamentos, limitaciones, hallazgo y evidencia empı´rica espan˜ola. Revista Espan˜ola de Financiacio´n y Contabilidad 26 (86), 57–101. Cox, D.R., 1972. Regression models and life tables. J. Roy. Stat. Soc. Ser. B 34 (2), 187–220. Cox, D.R., 1983. Some remarks on overdispersion. Biometrika 70 (1), 269–274. Cox, D.R., Oakes, D., 1984. Analysis of Survival Data. Chapman and Hall/CRC, London. Cox, D.R., Snell, E.J., 1989. Analysis of Binary Data, second ed. Chapman & Hall, London. Cox, N.J., 2002. Speaking Stata: how to face lists with fortitude. Stata J. 2 (2), 202–222. Cox, N.J., 2001. Speaking Stata: how to repeat yourself without going mad. Stata J. 1 (1), 86–97. Cox, N.J., 2003. Speaking Stata: problems with lists. Stata J. 3 (2), 185–202. Cox, N.J., 2005. Speaking Stata: smoothing in various directions. Stata J. 5 (4), 574–593. Cox, N.J., 2010. Speaking Stata: the limits of sample skewness and kurtosis. Stata J. 10 (3), 482–495. Coxon, A.P., The, M., 1982. User’s guide to multidimensional scaling: with special reference to the MDS (X library of computer programs). Heinemann Educational Books, London. Cronbach, L.J., 1951. Coefficient alpha and the internal structure of tests. Psychometrika 16 (3), 297–334. Crowther, M.J., Abrams, K.R., Lambert, P.C., 2013. Joint modeling of longitudinal and survival data. Stata J. 13 (1), 165–184. Czekanowski, J., 1932. Coefficient of racial “likeness” und “durchschnittliche differenz”. Anthropologischer Anzeiger 9 (3/4), 227–249. D’enza, A.I., Greenacre, M.J., 2012. Multiple correspondence analysis for the quantification and visualization of large categorical data sets. In: Di Ciaccio, A., Coli, M., Ibanez, J.M.A. (Eds.), Advanced Statistical Methods for the Analysis of Large Data-Sets. Studies in Theoretical and Applied Statistics. Springer-Verlag, Berlin, pp. 453–463. Danseco, E.R., Holden, E.W., 1998. Are there different types of homeless families? A typology of homeless families based on cluster analysis. Fam. Relat. 47 (2), 159–165. Dantas, C.A.B., 2008. Probabilidade: um curso introduto´rio, third ed. Edusp, Sa˜o Paulo. Dantas, R.A., Cordeiro, G.M., 1988. Uma nova metodologia para avaliac¸a˜o de imo´veis utilizando modelos lineares generalizados. Revista Brasileira de Estatı´stica 49 (191), 27–46. Dantzig, G.B., Fulkerson, D.R., Johnson, S.M., 1954. Solution of a large-scale traveling salesman problem. Oper. Res. 2, 393–410. Davidson, R., Mackinnon, J.G., 1993. Estimation and Inference in Econometrics. Oxford University Press, Oxford. Davis, P.B., 1977. Conjoint measurement and the canonical analysis of contingency tables. Sociol. Methods Res. 5 (3), 347–365. Day, G.S., Heeler, R.M., 1971. Using cluster analysis to improve marketing experiments. J. Market. Res. 8 (3), 340–347. De Irala, J., Ferna´ndez-Crehuet, N.R., Serranco, C.A., 1997. Intervalos de confianza anormalmente amplios en regresio´n logı´stica: interpretacio´n de resultados de programas estadı´sticos. Revista Panamericana de Salud Pu´blica 28, 235–243. De Leeuw, J., 1984. Canonical Analysis of Categorical Data. DSWO Press, Leiden.
1200
References
De Leeuw, J., 2008. Meijer, E. (Ed.), Handbook of Multilevel Analysis. Springer, New York. Deadrick, D.L., Bennett, N., Russell, C.J., 1997. Using hierarchical linear modeling to examine dynamic performance criteria over time. J. Manag. 23 (6), 745–757. Dean, C., Lawless, J., 1989. Tests for detecting overdispersion in Poisson regression models. J. Am. Stat. Assoc. 84 (406), 467–472. Deaton, A., 2010. Instruments, randomization, and learning about development. J. Econ. Lit. 48 (2), 424–455. Deb, P., Trivedi, P.K., 2006. Maximum simulated likelihood estimation of a negative binomial regression model with multinomial endogenous treatment. Stata J. 6 (2), 246–255. Demidenko, E., 2005. Mixed Models: Theory and Applications. John Wiley & Sons, New York. Desmarais, B.A., Harden, J.J., 2013. Testing for zero inflation in count models: bias correction for the Vuong test. Stata J. 13 (4), 810–835. Deus, J.E.R., 2001. Escalamiento multidimensional. Editorial La Muralla, Madrid. Deville, J.C., Saporta, G., 1983. Correspondence analysis, with an extension towards nominal time series. Journal of Econometrics 22, 169–189. Devore, J.L., 2006. Probabilidade e estatı´stica para engenharia. Thomson Pioneira, Sa˜o Paulo. Dice, L.R., 1945. Measures of the amount of ecologic association between species. Ecology 26 (3), 297–302. Digby, P.G.N., Kempton, R.A., 1987. Multivariate Analysis of Ecological Communities. Chapman & Hall/CRC Press, London. Dillon, W.R., Goldstein, M., 1984. Multivariate Analysis Methods and Applications. John Wiley & Sons, New York. Dobbie, M.J., Welsh, A.H., 2001. Modelling correlated zero-inflated count data. Aust. N. Z. J. Stat. 43 (4), 431–444. Dobson, A.J., 2001. An Introduction to Generalized Linear Models, second ed. Chapman & Hall/CRC Press, London. Dore, J.C., Ojasoo, T., 1996. Correspondence factor analysis of the publication patterns of 48 countries over the period 1981-1992. J. Am. Soc. Inf. Sci. 47, 588–602. Dougherty, C., 2011. Introduction to Econometrics, fourth ed. Oxford University Press, New York. Doutriaux, J., Crener, M.A., 1982. Which statistical technique should I use? A survey and marketing case study. Manag. Decis. Econ. 3 (2), 99–111. Draper, D., 1995. Inference and hierarchical modeling in the social sciences. J. Educ. Behav. Stat. 20 (2), 115–147. Driscoll, J.C., Kraay, A.C., 1998. Consistent covariance matrix estimation with spatially dependent panel data. Rev. Econ. Stat. 80 (4), 549–560. Driver, H.E., Kroeber, A.L., 1932. Quantitative expression of cultural relationships. Univ. Calif. Public. Am. Archaeol. Ethnol. 31 (4), 211–256. Drukker, D.M., 2003. Testing for serial correlation in linear panel-data models. Stata J. 3 (2), 168–177. Duncan, O.D., 1984. Notes on Social Measurement: Historical and Critical. Russell Sage Foundation, New York. Dunlop, D.D., 1994. Regression for longitudinal data: a bridge from least squares regression. Am. Stat. 48 (4), 299–303. Durbin, J., Watson, G.S., 1950. Testing for serial correlation in least squares regression: I. Biometrika 37 (¾), 409–428. Durbin, J., Watson, G.S., 1951. Testing for serial correlation in least squares regression: II. Biometrika 38 (½), 159–177. Dyke, G.V., Patterson, H.D., 1952. Analysis of factorial arrangements when the data are proportions. Biometrics 8 (1), 1–12. Dziuban, C.D., Shirkey, E.C., 1974. When is a correlation matrix appropriate for factor analysis? Some decision rules. Psychol. Bull. 81 (6), 358–361. Ekşiog˘lu, S.D., Ekşiog˘lu, B., Romeijn, H.E., 2007. A lagrangean heuristic for integrated production and transportation planning problems in a dynamic, multi-item, two-layer supply chain. IIE Trans. 39 (2), 191–201. Elhedhli, S., Goffin, J.L., 2005. Efficient production-distribution system design. Manag. Sci. 51 (7), 1151–1164. Embretson, S.E., Hershberger, S.L., 1999. The new Rules of Measurement. Lawrence Erlbaum Associates, Mahwah. Engle, R.F., 1984. Wald, likelihood ratio, and lagrange multiplier tests in econometrics. In: Griliches, Z., Intriligator, M.D. (Eds.), Handbook of Econometrics II. North Holland, Amsterdam, pp. 796–801. Eom, S., Kim, E., 2006. A survey of decision support system applications (1995-2001). J. Oper. Res. Soc. 57, 1264–1278. Epley, D.R.U.S., 2001. Real estate agent income and commercial/investment activities. J. Real Estate Res. 21 (3), 221–244. Espejo, L.G.A., Galva˜o, R.D., 2002. O uso das relaxac¸o˜es Lagrangeana e surrogate em problemas de programac¸a˜o inteira. Pesquisa Operacional 22 (3), 387–402. Espinoza, F.S., Hirano, A.S., 2003. As dimenso˜es de avaliac¸a˜o dos atributos importantes na compra de condicionadores de ar: um estudo aplicado. Revista de Administrac¸a˜o Contempor^anea (RAC) 7 (4), 97–117. Everitt, B.S., Landau, S., Leese, M., Stahl, D., 2011. Cluster Analysis, 5. ed. John Wiley & Sons, Chichester. Fabrigar, L.R., Wegener, D.T., MacCallum, R.C., Strahan, E.J., 1999. Evaluating the use of exploratory factor analysis in psychological research. Psychol. Methods 4 (3), 272–299. Famoye, F., 1993. Restricted generalized Poisson regression model. Commun. Stat. Theory Methods 22 (5), 1335–1354. Famoye, F., Singh, K.P., 2006. Zero-inflated generalized Poisson regression model with an application to domestic violence data. J. Data Sci. 4 (1), 117–130. Farnstrom, F., Lewis, J., Elkan, C., 2000. Scalability for clustering algorithms revisited. SIGKDD Explor. 2 (1), 51–57. Farrel, M.J., 1957. The measurement of productive efficiency. J. Roy. Stat. Soc. 120 (3), 253–290. ® ® ® Fa´vero, L.P., 2015. Ana´lise de dados: modelos de regressa˜o com Excel , Stata e SPSS . Campus Elsevier, Rio de Janeiro. Fa´vero, L.P., 2013. Dados em painel em contabilidade e financ¸as: teoria e aplicac¸a˜o. Brazil. Bus. Rev. 10 (1), 131–156. Fa´vero, L.P., 2010. Modelagem hiera´rquica com medidas repetidas. Associate Professor Thesis - Faculdade de Economia,Administrac¸a˜o e Contabilidade, Universidade de Sa˜o Paulo, Sa˜o Paulo. 202 f. Fa´vero, L.P., 2008a. Modelos de precificac¸a˜o hed^onica de imo´veis residenciais na Regia˜o Metropolitana de Sa˜o Paulo: uma abordagem sob as perspectivas da demanda e da oferta. Estudos Econ^omicos 38 (1), 73–96. Fa´vero, L.P., 2005. O mercado imobilia´rio residencial da regia˜o metropolitana de Sa˜o Paulo: uma aplicac¸a˜o de modelos de comercializac¸a˜o hed^onica de regressa˜o e correlac¸a˜o can^onica. PhD Thesis - Faculdade de EconomiaAdministrac¸a˜o e Contabilidade, Universidade de Sa˜o Paulo, Sa˜o Paulo. 319 f.
References
1201
Fa´vero, L.P., 2011a. Prec¸os hed^onicos no mercado imobilia´rio comercial de Sa˜o Paulo: a abordagem da modelagem multinı´vel com classificac¸a˜o cruzada. Estudos Econ^ omicos 41 (4), 777–810. Fa´vero, L.P., 2008b. Time, firm and country effects on performance: an analysis under the perspective of hierarchical modeling with repeated measures. Brazil. Bus. Rev. 5 (3), 163–180. Fa´vero, L.P., 2011b. Urban amenities and dwelling house prices in Sao Paulo, Brazil: a hierarchical modelling approach. Glob. Bus. Econ. Rev. 13 (2), 147–167. Fa´vero, L.P., Almeida, J.E.F., 2011. O comportamento dos ´ındices de ac¸o˜es em paı´ses emergentes: uma ana´lise com dados em painel e modelos hiera´rquicos. Revista Brasileira de Estatı´stica 72 (235), 97–137. Fa´vero, L.P., Angelo, C.F., Eunni, R.V., 2007. Impact of loyalty programs on customer retention: evidence from the retail apparel industry in Brazil. In: International Academy of Linguistics, Behavioral and Social Sciences. Anais do Congresso, Washington. ® ® Fa´vero, L.P., Belfiore, P., 2015. Ana´lise de dados: tecnicas multivariadas explorato´rias com SPSS e Stata . Campus Elsevier, Rio de Janeiro. Fa´vero, L.P., Belfiore, P., 2011. Cash flow, earnings ratio and stock returns in emerging global regions: evidence from longitudinal data. Glob. Econ. Financ. J. 4 (1), 32–43. ® ® ® Fa´vero, L.P., Belfiore, P., 2017. Manual de ana´lise de dados: estatı´stica e modelagem multivariada com Excel , SPSS e Stata . Elsevier, Rio de Janeiro. Fa´vero, L.P., Belfiore, P., Silva, F.L., Chan, B.L., 2009. Ana´lise de dados: modelagem multivariada para tomada de deciso˜es. Campus Elsevier, Rio de Janeiro. ® Fa´vero, L.P., Belfiore, P., Takamatsu, R.T., Suzart, J., 2014. Metodos quantitativos com Stata . Campus Elsevier, Rio de Janeiro. Fa´vero, L.P., Confortini, D., 2010. Modelos multinı´vel de coeficientes aleato´rios e os efeitos firma, setor e tempo no mercado aciona´rio brasileiro. Pesquisa Operacional 30 (3), 703–727. Fa´vero, L.P., Confortini, D., 2009. Qualitative assessment of stock prices listed on the Sa˜o Paulo Stock Exchange: an approach from the perspective of homogeneity analysis. Academia: Revista Latinoamericana de Administracio´n 42 (1), 20–33. Fa´vero, L.P., Santos, M.A., Serra, R.G., 2018. Cross-border branching in the Latin American banking sector. Int. J. Bank Market. 36 (3), 496–528. Fa´vero, L.P., Sotelino, F.B., 2011. Elasticities of stock prices in emerging markets. In: Batten, J.A., Szilagyi, P.G. (Eds.), The Impact of the Global Financial Crisis on Emerging Financial Markets. Contemporary Studies in Economic and Financial Analysis, vol. 93. Emerald Group Publishing Limited, pp. 473–493. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., 1996. From data mining to knowledge discovery in databases. AI Magazine 17 (3), 37–54. Feigl, P., Zelen, M., 1965. Estimation of exponential survival probabilities with concomitant information. Biometrics 21 (4), 826–838. Fernandes, A.M.R., 2005. Intelig^encia artificial: noc¸o˜es gerais. Visual Books, Floriano´polis. Ferrando, P.J., 1993. Introduccio´n al ana´lisis factorial. Ppu, Barcelona. Ferra˜o, F., Reis, E., Vicente, P., 2001. Sondagens: a amostragem como factor decisivo de qualidade, second ed. Lisboa, Edic¸o˜es Sı´labo. Ferreira, J.M., 2007. Ana´lise de sobreviv^encia: uma visa˜o de risco comportamental na utilizac¸a˜o de carta˜o de credito. Masters Dissertation, Departamento de Estatı´stica e Informa´tica. Universidade Federal Rural de Pernambuco, Recife. 73 f. Ferreira, S.C.R., 2012. Ana´lise multivariada sobre bases de dados criminais. Masters Dissertation, Faculdade de Ci^encias e Tecnologia da Universidade de Coimbra, Coimbra. 81 f. Ferreira Filho, V.J.M., Igna´cio, A.A.V., 2004. O uso de software de modelagem AIMMS na soluc¸a˜o de problemas de programac¸a˜o matema´tica. Pesquisa Operacional 24 (1), 197–210. Fielding, A., 2004. The role of the Hausman test and whether higher level effects should be treated as random or fixed. Multilevel Modelling Newsletter 16 (2), 3–9. Fienberg, S.E., 2007. Analysis of Cross-Classified Categorical Data. Springer-Verlag, New York. Figueira, A.P.C., 2003. Procedimento HOMALS: instrumentalidade no estudo das orientac¸o˜es metodolo´gicas dos professores portugueses de lı´ngua estrangeira. In: V SNIP - Simpo´sio Nacional de Investigac¸a˜o em Psicologia. Anais do Congresso, Lisboa. Figueiredo Filho, D.B., Silva Ju´nior, J.A., Rocha, E.C., 2012. Classificando regimes polı´ticos utilizando ana´lise de conglomerados. Opinia˜o Pu´blica 18 (1), 109–128. Finney, D.J., 1952. Probit Analysis. Cambridge University Press, Cambridge. Finney, D.J., Stevens, W.L., 1948. A table for the calculation of working probits and weights in probit analysis. Biometrika 35 (1/2), 191–201. Firpo, S., 2007. Efficient semiparametric estimation of quantile treatment effects. Econometrica 75 (1), 259–276. Fischer, G., 1936. Ornithologische monatsberichte. Jahrgang, Berlin. Flannery, M.J., Hankins, K.W., 2013. Estimating dynamic panel models in corporate finance. J. Corp. Finance 19 (1), 1–19. Fleischer, G.A., 2011. Contingency Table Analysis for Road Safety Studies. Springer, New York. Fleishman, J.A., 1986. Types of political attitude structure: results of a cluster analysis. Publ. Opin. Quart. 50 (3), 371–386. Fourer, R., Gay, D.M., Kernighan, B.W., 2002. AMPL: A Modeling Language for Mathematical Programming, second ed. Duxbury. Fouto, N.M.M.D., 2004. Determinac¸a˜o de uma func¸a˜o de prec¸os hed^onicos para computadores pessoais no Brasil. Masters Dissertation, Faculdade de Economia, Administrac¸a˜o e Contabilidade, Universidade de Sa˜o Paulo, Sa˜o Paulo. 150 f. Fraley, C., Raftery, A.E., 2002. Model-based clustering, discriminant analysis and density estimation. J. Am. Stat. Assoc. 97 (458), 611–631. Frees, E.W., 1995. Assessing cross-sectional correlation in panel data. J. Econ. 69 (2), 393–414. Frees, E.W., 2004. Longitudinal and Panel Data: Analysis and Applications in the Social Sciences. Cambridge University Press, Cambridge. Frei, F., 2006. Introduc¸a˜o a` ana´lise de agrupamentos: teoria e pra´tica. Editora Unesp, Sa˜o Paulo. Frei, F., Lessa, B.S., Nogueira, J.C.G., Zopello, R., Silva, S.R., Lessa, V.A.M., 2013. Ana´lise de agrupamentos para a classificac¸a˜o de pacientes submetidos a` cirurgia baria´trica Fobi-Capella. ABCD. Arquivos Brasileiros de Cirurgia Digestiva 26 (1), 33–38.
1202
References
Freund, J.E., 2006. Estatı´stica aplicada: economia, administrac¸a˜o e contabilidade, 11th ed. Bookman, Porto Alegre. Friedman, M., 1940. A comparison of alternative tests of significance for the problem of m rankings. Ann. Math. Stat. 11 (1), 86–92. Friedman, M., 1937. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 32 (200), 675–701. Fr€ olich, M., Melly, B., 2010. Estimation of quantile treatment effects with Stata. Stata J. 10 (3), 423–457. Frome, E.L., Kurtner, M.H., Beauchamp, J.J., 1973. Regression analysis of Poisson-distributed data. J. Am. Stat. Assoc. 68 (344), 935–940. Froot, K.A., 1989. Consistent covariance matrix estimation with cross-sectional dependence and heteroskedasticity in financial data. J. Financ. Quant. Anal. 24 (3), 333–355. Fumes, G., Corrente, J.E., 2010. Modelos inflacionados de zeros: aplicac¸o˜es na ana´lise de um questiona´rio de frequ^encia alimentar. Revista Brasileira de Biometria 28 (1), 24–38. Galantucci, L.M., DI Gioia, E., Lavecchia, F., Percoco, G., 2014. Is principal component analysis an effective tool to predict face attractiveness? A contribution based on real 3D faces of highly selected attractive women, scanned with stereophotogrammetry. Med. Biol. Eng. Comput. 52 (5), 475–489. Galton, F., 1894. Natural Inheritance, fifth ed. Macmillan and Company, New York. GAMS - General Algebraic Modeling System, 2011. An introduction to GAMS. Disponı´vel em http://www.gams.com. [(Accessed 1 April 2011)]. Gardiner, J.C., Luo, Z., Roman, L.A., 2009. Fixed effects, random effects and GEE: what are the differences? Stat. Med. 28 (2), 221–239. Gardner, W., Mulvey, E.P., Shaw, E.C., 1995. Regression analyses of counts and rates: Poisson, overdispersed Poisson, and negative binomial models. Psychol. Bull. 118 (3), 392–404. Garson, G.D., 2013. Factor Analysis. Statistical Associates Publishers, Asheboro. Garson, G.D., 2012. Logistic Regression: Binary & Multinomial. Statistical Associates Publishing, Asheboro. Gelman, A., 2006. Multilevel (hierarchical) modeling: what it can and cannot do. Technometrics 48 (3), 432–435. Geoffrion, A.M., 1972. Generalized Benders decomposition. J. Optim. Theory Appl. 10 (4), 237–260. Geoffrion, A.M., Graves, G.W., 1974. Multicommodity distribution design by Benders decomposition. Manag. Sci. 20 (5), 822–844. Gessner, G., Malhotra, N.K., Kamakura, W.A., Zmijewski, M.E., 1988. Estimating models with binary dependent variables: some theoretical and empirical observations. J. Bus. Res. 16 (1), 49–65. Giffins, R., 1985. Canonical Analysis: A Review with Applications in Ecology. Springer-Verlag, Berlin. Gilbert, G.K., 1884. Finley’s tornado predictions. Am. Meteorol. J. (1), 166–172. Gimeno, S.G.A., Souza, J.M.P., 1995. Utilizac¸a˜o de estratificac¸a˜o e modelo de regressa˜o logı´stica na ana´lise de dados de estudos caso-controle. Revista de Sau´de Pu´blica 29 (4), 283–289. Glasser, G.L., Metzger, G.D., 1972. Random-digit dialing as a method of telephone sampling. J. Market. Res. 9 (1), 59–64. Glasser, M., 1967. Exponential survival with covariance. J. Am. Stat. Assoc. 62 (318), 561–568. Gnecco, G., Sanguineti, M., 2009. Accuracy of suboptimal solutions to kernel principal component analysis. Comput. Optim. Appl. 42 (2), 265–287. Gnedenko, B.V., 2008. A teoria da probabilidade. Ci^encia Moderna, Rio de Janeiro. Godfrey, L.G., 1988. Misspecification Tests in Econometrics. Cambridge University Press, Cambridge. Godfrey, L.G., 1978. Testing against general autoregressive and moving average error models when the regressors include lagged dependent variables. Econometrica 46 (6), 1293–1301. Goldbarg, M.C., Luna, H.P.L., 2005. Otimizac¸a˜o combinato´ria e programac¸a˜o linear, second ed. Campus Elsevier, Rio de Janeiro. Goldberger, A.S., 1962. Best linear unbiased prediction in the generalized linear regression model. J. Am. Stat. Assoc. 57 (298), 369–375. Goldstein, H., 2011. Multilevel Statistical Models, fourth ed. John Wiley & Sons, Chichester. Gomes Jr., A.C., Souza, M.J.F., 2004. Softwares de otimizac¸a˜o: manual de refer^encia. Departamento de Computac¸a˜o, Universidade Federal de Ouro Preto. Gomory, R.E., 1958. Outline of an algorithm for integer solutions to linear programs. Bull. Am. Math. Soc. 64 (5), 275–278. Gonc¸alez, P.U., Werner, L., 2009. Comparac¸a˜o dos ´ındices de capacidade do processo para distribuic¸o˜es na˜o normais. Gesta˜o & Produc¸a˜o 16 (1), 121–132. Gordon, A.D., 1987. A review of hierarchical classification. J. Roy. Stat. Soc. Ser. A 150 (2), 119–137. Gorsuch, R.L., 1990. Common factor analysis versus component analysis: some well and little known facts. Multivar. Behav. Res. 25 (1), 33–39. Gorsuch, R.L., 1983. Factor Analysis, second ed. Lawrence Erlbaum Associates, Mahwah. Gould, W., Pitblado, J., Poi, B., 2010. Maximum Likelihood Estimation with Stata, fourth ed. Stata Press, College Station. Gourieroux, C., Monfort, A., Trognon, A., 1984. Pseudo maximum likelihood methods: applications to Poisson models. Econometrica 52 (3), 701–772. Gower, J.C., 1967. A comparison of some methods of cluster analysis. Biometrics 23 (4), 623–637. Greenacre, M.J., 2007. Correspondence Analysis in Practice, second ed. Chapman & Hall/CRC Press, Boca Raton. Greenacre, M.J., 1988. Correspondence analysis of multivariate categorical data by weighted least-squares. Biometrika 75 (3), 457–467. Greenacre, M.J., 2000. Correspondence analysis of square asymmetric matrices. J. Roy. Stat. Soc. Ser. C Appl. Stat. 49 (3), 297–310. Greenacre, M.J., 2008. La pra´ctica del ana´lisis de correspondencias. Barcelona: Fundacio´n Bbva. Greenacre, M.J., 2003. Singular value decomposition of matched matrices. J. Appl. Stat. 30 (10), 1101–1113. Greenacre, M.J., 1989. The Carroll-Green-Schaffer scaling in correspondence analysis: a theoretical and empirical appraisal. J. Market. Res. 26 (3), 358–365. Greenacre, M.J., 1984. Theory and Applications of Correspondence Analysis. Academic Press, London. Greenacre, M.J., Blasius, J., 1994. Correspondence Analysis in the Social Sciences. Academic Press, London. Greenacre, M.J., Blasius, J., 2006. Multiple Correspondence Analysis and Related Methods. Chapman & Hall/CRC Press, Boca Raton. Greenacre, M.J., Hastie, T., 1987. The geometric interpretation of correspondence analysis. J. Am. Stat. Assoc. 82 (398), 437–447.
References
1203
Greenacre, M.J., Pardo, R., 2006. Subset correspondence analysis: visualization of selected response categories in a questionnaire survey. Sociol. Methods Res. 35 (2), 193–218. Greenberg, B.A., Goldstucker, J.L., Bellenger, D.N., 1977. What techniques are used by marketing researchers in business? J. Market. 41 (2), 62–68. Greene, W.H., 2012. Econometric Analysis, seventh ed. Pearson, Harlow. Greene, W.H., 2011. Fixed effects vector decomposition: a magical solution to the problem of time-invariant variables in fixed effects models? Polit. Anal. 19 (2), 135–146. Greenwood, M., Yule, G.U., 1920. An inquiry into the nature of frequency distributions representative of multiple happenings with particular reference to the occurrence of multiple attacks of disease or of repeated accidents. J. Roy. Stat. Soc. Ser. A 83 (2), 255–279. Gu, Y., Hole, A.R., 2013. Fitting the generalized multinomial logit model in Stata. Stata J. 13 (2), 382–397. Gujarati, D.N., 2011. Econometria ba´sica, fifth ed. Bookman, Porto Alegre. Gujarati, D.N., Porter, D.C., 2008. Econometria ba´sica, fifth ed. McGraw-Hill, New York. Gupta, P.L., Gupta, R.C., Tripathi, R.C., 1996. Analysis of zero-adjusted count data. Comput. Stat. Data Anal. 23 (2), 207–218. Gurmu, S., 1998. Generalized hurdle count data regressions models. Econ. Lett. 58 (3), 263–268. Gurmu, S., 1991. Tests for detecting overdispersion in the positive Poisson regression model. J. Bus. Econ. Stat. 9 (2), 215–222. Gurmu, S., Trivedi, P.K., 1996. Excess zeros in count models for recreational trips. J. Bus. Econ. Stat. 14 (4), 469–477. Gurmu, S., Trivedi, P.K., 1992. Overdispersion tests for truncated Poisson regression models. J. Econometrics 54 (1–3), 347–370. Gutierrez, R.G., 2002. Parametric frailty and shared frailty survival models. Stata J. 2 (1), 22–44. Guttman, L., 1941. The quantification of a class of attributes: a theory and method of scale construction. In: Horst, P. et al., (Ed.), The Prediction of Personal Adjustment. Social Science Research Council, New York. Guttman, L., 1977. What is not what in statistics. Statistician 26 (2), 81–107. Haberman, S.J., 1973. The analysis of residuals in cross-classified tables. Biometrics 29 (1), 205–220. Habib, F., Etesam, I., Ghoddusifar, S.H., Mohajeri, N., 2012. Correspondence analysis: a new method for analyzing qualitative data in architecture. Nexus Netw. J. 14 (3), 517–538. Haddad, R., Haddad, P., 2004. Crie planilhas inteligentes com o Microsoft Office Excel 2003 - Avanc¸ado. Erica, Sa˜o Paulo. Hadi, A.S., 1994. A modification of a method for the detection of outliers in multivariate samples. J. Roy. Stat. Soc. Ser. B 56 (2), 393–396. Hadi, A.S., 1992. Identifying multiple outliers in multivariate data. J. Roy. Stat. Soc. Ser. B 54 (3), 761–771. Hair Jr., J.F., Black, W.C., Babin, B.J., Anderson, R.E., Tatham, R.L., 2009. Ana´lise multivariada de dados, sixth ed. Bookman, Porto Alegre. Hall, D.B., 2000. Zero-inflated Poisson and binomial regression with random effects: a case study. Biometrics 56, 1030–1039. Halvorsen, R., Palmquist, R.B., 1980. The interpretation of dummy variables in semilogarithmic equations. Am. Econ. Rev. 70 (3), 474–475. Hamann, U., 1961. Merkmalsbestand und verwandtschaftsbeziehungen der Farinosae: ein beitrag zum system der monokotyledonen. Willdenowia 2 (5), 639–768. Hamilton, L.C., 2013. Statistics with Stata: version 12, eighth ed. Brooks/Cole Cengage Learning, Belmont. Han, J., Kamber, M., 2000. Data Mining: Concepts and Techniques. Morgan Kaufmann, Burlington. Hardin, J.W., Hilbe, J.M., 2013. Generalized Estimating equations, second ed. Chapman & Hall/CRC Press, Boca Raton. Hardin, J.W., Hilbe, J.M., 2012. Generalized Linear Models and Extensions, third ed. Stata Press, College Station. H€ardle, W.K., Simar, L., 2012. Applied Multivariate Statistical Analysis, third ed. Springer, Heidelberg. Hardy, A., 1996. On the number of clusters. Comput. Stat. Data Anal. 23 (1), 83–96. Hardy, M.A., 1993. Regression with Dummy Variables. Sage Publications, Thousand Oaks. Harman, H.H., 1976. Modern Factor Analysis, third ed. University of Chicago Press, Chicago. Hartley, H.O., 1950. The use of range in analysis of variance. Biometrika 37 (3-4), 271–280. Harvey, A.C., 1976. Estimating regression models with multiplicative heteroscedasticity. Econometrica 44 (3), 461–465. Hausman, J.A., 1978. Specification tests in econometrics. Econometrica 46 (6), 1251–1271. Hausman, J.A., Hall, B.H., Griliches, Z., 1984. Econometric models for count data with an application to the patents-R & D relationship. Econometrica 52 (4), 909–938. Hausman, J.A., Taylor, W.E., 1981. Panel data and unobservable individual effects. Econometrica 49 (6), 1377–1398. Hayashi, C., Sasaki, M., Suzuki, T., 1992. Data Analysis for Comparative Social Research: International Perspectives. North Holland, Amsterdam. Heck, R.H., Thomas, S.L., 2009. An Introduction to Multilevel Modeling Techniques, second ed. Routledge, New York. Heck, R.H., Thomas, S.L., Tabata, L.N., 2014. Multilevel and Longitudinal Modeling with IBM SPSS, second ed. Routledge, New York. Heckman, J., Vytlacil, E., 1998. Instrumental variables methods for the correlated random coefficient model: estimating the average rate of return to schooling when the return is correlated with schooling. J. Hum. Resour. 33 (4), 974–987. Heibron, D.C., 1994. Zero-altered and other regression models for count data with added zeros. Biometrical J. 36 (5), 531–547. Held, M., Karp, R.M., 1970. The traveling-salesman problem and minimum spanning trees. Oper. Res. 18 (6), 1138–1162. Herbst, A.F., 1974. A factor analysis approach to determining the relative endogeneity of trade credit. J. Finance 29 (4), 1087–1103. Higgs, N.T., 1991. Practical and innovative uses of correspondence analysis. Statistician 40 (2), 183–194. Hilbe, J.M., 2009. Logistic Regression Models. Chapman & Hall/CRC Press, London. Hill, C., Griffiths, W., Judge, G., 2000. Econometria. Saraiva, Sa˜o Paulo. Hill, P.W., Goldstein, H., 1998. Multilevel modeling of educational data with cross-classification and missing identification for units. J. Educ. Behav. Stat. 23 (2), 117–128.
1204
References
Hillier, D., Pindado, J., Queiroz, V., Torre, C., 2011. The impact of country-level corporate governance on research and development. J. Int. Bus. Stud. 42 (1), 76–98. Hillier, F.S., Lieberman, G.J., 2005. Introduction to Operations Research, eighth ed. McGraw-Hill, Boston. Hinde, J., Demetrio, C.G.B., 1998. Overdispersion: models and estimation. Comput. Stat. Data Anal. 27 (2), 151–170. Hindi, K.S., Basta, T., 1994. Computationally efficient solution of a multiproduct, two-stage distribution-location problem. J. Oper. Res. Soc. 45 (11), 1316–1323. Hindi, K.S., Basta, T., Pienkosz, K., 2006. Efficient solution of a multi-commodity, two-stage distribution problem with constraints on assignment of customers to distribution centers. Int. Trans. Oper. Res. 5 (6), 519–527. Hirschfeld, H.O., 1935. A connection between correlation and contingency. Math. Proc. Cambridge Philos. Soc. 31 (4), 520–524. Ho, H.F., Hung, C.C., 2008. Marketing mix formulation for higher education: an integrated analysis employing analytic hierarchy process, cluster analysis and correspondence analysis. Int. J. Educ. Manag. 22 (4), 328–340. Hoaglin, D.C., Mosteller, F., Tukey, J.W., 2000. Understanding Robust and Exploratory Data Analysis. John Wiley & Sons, New York. Hoechle, D., 2007. Robust standard errors for panel regressions with cross-sectional dependence. Stata J. 7 (3), 281–312. Hoffman, D., Franke, G.R., 1986. Correspondence analysis: graphical representation of categorical data in marketing research. J. Market. Res. 23 (3), 213–227. Hofmann, D.A., 1997. An overview of the logic and rationale of hierarchical linear models. J. Manag. 23 (6), 723–744. Holtz-Eakin, D., Newey, W., Rosen, H.S., 1988. Estimating vector auto regressions with panel data. Econometrica 56 (6), 1371–1395. Hoover, K.R., Donovan, T., 2014. The Elements of Social Scientific Thinking, 11th ed. Worth Publishers, New York. Hosmer, D.W., Lemeshow, S., 1980. Goodness-of-fit tests for the multiple logistic regression model. Commun. Statist. Theory Methods 9 (10), 1043–1069. Hosmer, D.W., Lemeshow, S., May, S., 2008. Applied Survival Analysis: Regression Modeling of Time to Event Data, second ed John Wiley & Sons, Hoboken. Hosmer, D.W., Lemeshow, S., Sturdivant, R.X., 2013. Applied Logistic Regression, 3. ed. John Wiley & Sons, New York. Hosmer, D.W., Taber, S., Lemeshow, S., 1991. The importance of assessing the fit of logistic regression models: a case study. Am. J. Public Health 81, 1630–1635. Hotelling, H., 1933. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24 (6), 417–441. Hotelling, H., 1936. Relations between two sets of variates. Biometrika 28 (3/4), 321–377. Hotelling, H., 1935. The most predictable criterion. J. Educ. Psychol. 26, 139–142. Hough, J.R., 2006. Business segment performance redux: a multilevel approach. Strateg. Manag. J. 27 (1), 45–61. Hox, J.J., 2010. Multilevel Analysis: Techniques and Applications, second ed. Routledge, New York. Hoyos, R.E., Sarafidis, V., 2006. Testing for cross-sectional dependence in panel-data models. Stata J. 6 (4), 482–496. Hsiao, C., 2003. Analysis of Panel Data, second ed. Cambridge University Press, Cambridge. Hu, F.B., Goldberg, J., Hedeker, D., Flay, B.R., Pentz, M.A., 1998. Comparison of population-averaged and subject-specific approaches for analyzing repeated binary outcomes. Am. J. Epidemiol. 147 (7), 694–703. Hubbard, A.E., Ahern, J., Fleischer, N.L., Laan, M.V., Lippman, S.A., Jewell, N., Bruckner, T., Satariano, W.A., 2010. To GEE or not to GEE: comparing population average and mixed models for estimating the associations between neighborhood risk factors and health. Epidemiology 21 (4), 467–474. Huber, P.J., 1967. The behavior of maximum likelihood estimates under nonstandard conditions. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. vol. 1, pp. 221–233. Hubert, L., Arabie, P., 1985. Comparing partitions. J. Classif. 2 (1), 193–218. Hwang, H., Dillon, W.R., Takane, Y., 2006. An extension of multiple correspondence analysis for identifying heterogeneous subgroups of respondents. Psychometrika 71 (1), 161–171. Iezzi, D.F., 2005. A method to measure the quality on teaching evaluation of the university system: the Italian case. Soc. Indicat. Res. 73, 459–477. Igna´cio, S.A., 2010. Import^ancia da estatı´stica para o processo de conhecimento e tomada de decisa˜o. Revista Paranaense de Desenvolvimento 118, 175–192. Intriligator, M.D., Bodkin, R.G., Hsiao, C., 1996. Econometric Models, Techniques and Applications, second ed. Prentice Hall, Englewood Cliffs. Islam, N., 1995. Growth empirics: a panel data approach. Quart. J. Econ. 110 (4), 1127–1170. Israe¨ls, A., 1987. Eigenvalue Techniques for Qualitative Data. DSWO Press, Leiden. Jaccard, J., 2001. Interaction Effects in Logistic Regression. Sage Publications, Thousand Oaks. Jaccard, P., 1901. Distribution de la flore alpine dans le Bassin des Dranses et dans quelques regions voisines. Bulletin de la Societe Vaudoise des Sciences Naturelles 37 (140), 241–272. Jaccard, P., 1908. Nouvelles recherches sur la distribution florale. Bulletin de la Societe Vaudoise des Sciences Naturelles 44 (163), 223–270. Jain, A.K., Murty, M.N., Flynn, P.J., 1999. Data clustering: a review. ACM Comput. Surv. 31 (3), 264–323. Jak, S., Oort, F.J., Dolan, C.V., 2014. Using two-level factor analysis to test for cluster bias in ordinal data. Multivar. Behav. Res. 49 (6), 544–553. Jann, B., 2007. Making regression tables simplified. Stata J. 7 (2), 227–244. Jansakul, N., Hinde, J.P., 2002. Score tests for zero-inflated Poisson models. Comput. Stat. Data Anal. 40 (1), 75–96. Jer^ ome, P., 2014. Multiple Factor Analysis by Example Using R. Chapman & Hall/CRC Press, London. Jimenez, E.G., Flores, J.G., Go´mez, G.R., 2000. Ana´lisis factorial. Editorial La Muralla, Madrid. Johnson, D.E., 1998. Applied Multivariate Methods for Data Analysts. Duxbury Press, Pacific Grove.
References
1205
Johnson, R.A., Wichern, D.W., 2007. Applied Multivariate Statistical Analysis, sixth ed. Pearson Education, Upper Saddle River. Johnson, S.C., 1967. Hierarchical clustering schemes. Psychometrika 32 (3), 241–254. Johnston, J., Dinardo, J., 2001. Metodos econometricos, fourth ed. McGraw-Hill, Lisboa. Jolliffe, I.T., Jones, B., Morgan, B.J.T., 1995. Identifying influential observations in hierarchical cluster analysis. J. Appl. Stat. 22 (1), 61–80. Jones, A.M., Rice, N., D’uva, T.B., Balia, S., 2013. Applied Health Economics, second ed. Routledge, New York. Jones, D.C., Kalmi, P., M€akinen, M., 2010. The productivity effects of stock option schemes: evidence from Finnish panel data. J. Product. Anal. 33 (1), 67–80. Jones, K., Bullen, N., 1994. Contextual models of urban house prices: a comparison of fixed- and random-coefficient models developed by expansion. Econ. Geogr. 70 (3), 252–272. Jones, M.R., 2014. Identifying critical factors that predict quality management program success: data mining analysis of Baldrige award data. Qual. Manag. J. 21 (3), 49–61. Jones, R.H., 1975. Probability estimation using a multinomial logistic function. J. Stat. Comput. Simul. (3), 315–329. Jones, S.T., Banning, K., 2009. US elections and monthly stock market returns. J. Econ. Finance 33 (3), 273–287. J€ oreskog, K.G., 1967. Some contributions to maximum likelihood factor analysis. Psychometrika 32 (4), 443–482. Kachigan, S., 1986. Statistical Analysis: An Interdisciplinary Introduction to Univariate & Multivariate Methods. Radius Press, New York. Kaiser, H.F., 1970. A second generation little jiffy. Psychometrika 35 (4), 401–415. Kaiser, H.F., 1974. An index of factorial simplicity. Psychometrica 39 (1), 31–36. Kaiser, H.F., 1958. The varimax criterion for analytic rotation in factor analysis. Psychometrika 23 (3), 187–200. Kaiser, H.F., Caffrey, J., 1965. Alpha factor analysis. Psychometrika 30 (1), 1–14. Kalbfleisch, J.D., Prentice, R.L., 2002. The Statistical Analysis of Failure Time Data, second ed. John Wiley & Sons, New York. Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y., 2002. The efficient k-means clustering algorithm: analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24 (7), 881–892. Kaplan, E.L., Meier, P., 1958. Nonparametric estimation from incomplete observations. J. Am. Stat. Assoc. 53 (282), 457–481. Kaufman, L., Rousseeuw, P.J., 2005. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, Hoboken. Kaufman, R.L., 1996. Comparing effects in dichotomous logistic regression: a variety of standardized coefficients. Soc. Sci. Quart. 77, 90–109. Kelton, W.D., Sadowski, R.P., Swets, N.B., 2010. Simulation with Arena, fifth ed. McGraw-Hill, New York. Kelton, W.D., Sadowski, R.P., Swets, N.B., 1998. Simulation with Arena, first ed. McGraw-Hill, New York. Kennedy, P., 2008. A Guide to Econometrics, sixth ed. MIT Press, Cambridge. € Keskin, B.B., Uster, H., 2007. A scatter search-based heuristic to locate capacitated transshipment points. Comput. Oper. Res. 34 (10), 3112–3125. Kim, B., Park, C., 1992. Some remarks on testing goodness of fit for the Poisson assumption. Commun. Statist. Theory Methods 21 (4), 979–995. Kim, J.O., Mueller, C.W., 1978a. Factor Analysis: Statistical Methods and Practical Issues. Sage Publications, Thousand Oaks. Kim, J.O., Mueller, C.W., 1978b. Introduction to Factor Analysis: What it Is and How to Do it. Sage Publications, Thousand Oaks. Kintigh, K.W., Ammerman, A.J., 1982. Heuristic approaches to spatial analysis in archaeology. Am. Ant. 47 (1), 31–63. Klastorin, T.D., 1983. Assessing cluster analysis results. J. Market. Res. 20 (1), 92–98. Klatzky, S.R., Hodge, R.W., 1971. A canonical correlation analysis of occupational mobility. J. Am. Stat. Assoc. 66 (333), 16–22. Klein, J.P., Moeschberger, M.L., 2003. Survival Analysis: Techniques for Censored and Truncated Data, second ed. Springer, New York. Kleinbaum, D.G., Klein, M., 2010. Logistic Regression: A Self-Learning Text, third ed. Springer, New York. Kleinbaum, D.G., Klein, M., 2012. Survival Analysis: A Self-Learning Text, third ed. Springer-Verlag, New York. Kleinbaum, D., Kupper, L., Nizam, A., Rosenberg, E.S., 2014. Applied Regression Analysis and Other Multivariable Methods, fifth ed. Cengage Learning, Boston. Klimkiewicz, A., Cervera-Padrell, A.E., Van den Berg, F.W.J., 2016. Multilevel modeling for data mining of downstream bio-industrial processes. Chemometr. Intell. Lab. Syst. 154 (15), 62–71. Kmenta, J., 1978. Elementos de econometria. Atlas, Sa˜o Paulo. Koenker, R., 2004. Quantile regression for longitudinal data. J. Multivar. Anal. 91 (1), 74–89. Koenker, R., 2005. Quantile Regression. Cambridge University Press, Cambridge. Koenker, R., Bassett, G., 1978. Regression quantiles. Econometrica 46 (1), 33–50. Kohler, U., Kreuter, F., 2012. Data Analysis Using Stata, third ed. Stata Press, College Station. Kolmogorov, A., 1941. Confidence limits for an unknown distribution function. Ann. Math. Stat. 12 (4), 461–463. Konno, H., Yamazaki, H., 1991. Mean-absolute deviation portfolio optimization model and its applications to Tokyo stock market. Manag. Sci. 37 (5), 519–531. Kreft, I., De Leeuw, J., 1998. Introducing Multilevel Modeling. Sage Publications, London. Krishnakumar, J., Ronchetti, E. (Eds.), 2000. Panel Data Econometrics: Future Directions. North Holland, Amsterdam. Kruskal, J.B., 1964a. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29 (1), 1–27. Kruskal, J.B., 1964b. Nonmetric multidimensional scaling: a numerical method. Psychometrika 29 (2), 115–129. Kruskal, W.H., 1952. A nonparametric test for the several sample problem. Ann. Math. Stat. 23 (4), 525–540. Kruskal, W.H., Wallis, W.A., 1952. Use of ranks in one-criterion variance analysis. J. Am. Stat. Assoc. 47 (260), 583–621. Kutner, M.H., Nachtshein, C.J., Neter, J., 2004. Applied Linear Regression Models, fourth ed. Irwin, Chicago. Lachtermacher, G., 2009. Pesquisa operacional na tomada de deciso˜es, fourth ed. Prentice Hall do Brasil, Sa˜o Paulo.
1206
References
Laird, N.M., Ware, J.H., 1982. Random-effects models for longitudinal data. Biometrics 38 (4), 963–974. Lambert, D., 1992. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 34 (1), 1–14. Lambert, P.C., Royston, P., 2009. Further development of flexible parametric models for survival analysis. Stata Journal 9 (2), 265–290. Lambert, Z., Durand, R., 1975. Some precautions in using canonical analysis. J. Market. Res. 12 (4), 468–475. Lance, G.N., Williams, W.T., 1967. A general theory of classificatory sorting strategies: 1. Hierarchical systems. Comput. J. 9 (4), 373–380. Land, A.H., Doig, A.G., 1960. An automatic method of solving discrete programming problems. Econometrica 28 (3), 497–520. Landau, S., Everitt, B.S., 2004. A Handbook of Statistical Analyses Using SPSS. Chapman & Hall/CRC Press, Boca Raton. Lane, W.R., Looney, S.W., Wansley, J.W., 1986. An application of the Cox proportional hazards model to bank failure. J. Bank. Finance 10 (4), 511–531. Larose, D.T., Larose, C.D., 2014. Discovering Knowledge in Data: An Introduction to Data Mining, 2. ed. John Wiley & Sons, New York. Lawless, J., 1987. Regression methods for Poisson process data. J. Am. Stat. Assoc. 82 (399), 808–815. Lawley, D.N., 1959. Tests of significance in canonical analysis. Biometrika 46 (1/2), 59–66. Lawson, D.M., Brossart, D.F., 2004. The association between current intergenerational family relationships and sibling structure. J. Counsel. Dev. 82 (4), 472–482. Le Foll, Y., Burtschy, B., 1983. Representations optimales des matrices imports-exports. Revue de Statistique Appliquee 31 (3), 57–72. Le Roux, B., Rouanet, H., 2004. Geometric Data Analysis: From Correspondence Analysis to Structured Data Analysis. Kluwer, Dordrecht. Le Roux, B., Rouanet, H., 2010. Multiple Correspondence Analysis. Sage Publications, Thousand Oaks. Lebart, L., Piron, M., Morineau, A., 2000. Statistique exploratoire multidimensionnelle, third ed. Dunod, Paris. Lee, A.H., Wang, K., Scott, J.A., Yau, K., Mclachlan, G.J., 2006. Multi-level zero-inflated Poisson regression modelling of correlated count data with excess zeros. Stat. Methods Med. Res. 15 (1), 47–61. Lee, A.H., Wang, K., Yau, K., 2001. Analysis of zero-inflated Poisson data incorporating extent of exposure. Biometrical J. 43 (8), 963–975. Lee, E.T., Wang, J.W., 2013. Statistical Methods for Survival Data Analysis, fourth ed. John Wiley & Sons, Hoboken. Lee, L., 1986. Specification test for Poisson regression models. Int. Econ. Rev. 27 (3), 689–706. Leech, N.L., Barrett, K.C., Morgan, G.A., 2005. SPSS for Intermediate Statistics: Use and Interpretation, second ed. Lawrence Erlbaum Associates, Mahwah. Levene, H., 1960. Robust tests for the equality of variance. In: Olkin, I. (Ed.), Contributions to Probability and Statistics. Stanford University Press, Palo Alto, pp. 278–292. Levine, R., 1997. Financial development and economic growth: views and agenda. J. Econ. Lit. 35 (2), 688–726. Levy, P.S., Lemeshow, S., 2009. Sampling of Populations: Methods and applications, fourth ed. John Wiley & Sons, New York. Liang, K.Y., Zeger, S.L., 1986. Longitudinal data analysis using generalized linear models. Biometrika 73 (1), 13–22. Liczbinski, C.R., 2002. Modelo de informac¸o˜es para o gerenciamento das atividades das pequenas indu´strias de produtos alimentares do Rio Grande do Sul. Dissertac¸a˜o (Mestrado em Engenharia de Produc¸a˜o), Universidade Federal de Santa Catarina, Floriano´polis. Likert, R., 1932. A technique for the measurement of attitudes. Arch. Psychol. 22 (140), 5–55. Lilliefors, H.W., 1967. On the Kolmogorov-Smirnov test for normality with mean and variance unknown. J. Am. Stat. Assoc. 62 (318), 399–402. Lindley, D., 1983. Reconciliation of probability distributions. Oper. Res. 31 (5), 866–880. Linneman, P., 1980. Some empirical results on the nature of hedonic price function for the urban housing market. J. Urban Econ. 8, 47–68. Linoff, G.S., Berry, M.J.A., 2011. Data Mining Techniques: for Marketing, Sales, and Customer Relationship Management, third ed. John Wiley & Sons, Indianapolis. Lisboa, E.F.A., 2010. Pesquisa operacional. Disponı´vel em http://www.ericolisboa.eng.br. [(Accessed 28 September 2010)]. Lombardo, R., Beh, E.J., D’ambra, L., 2007. Non-symmetric correspondence analysis with ordinal variables using orthogonal polynomials. Comput. Statist. Data Anal. 52, 566–577. Long, J.S., Freese, J., 2006. Regression models for categorical dependent variables using Stata, second ed. Stata Press, College Station. Lopez, C.P., 2013. Principal Components, Factor Analysis, Correspondence Analysis and Scaling: Examples with SPSS. CreateSpace Independent Publishing Platform. Lo´pez, M.J.R., Fidalgo, J.L., 2000. Ana´lisis de supervivencia. Ed. La Muralla, Madrid. Lord, D., Park, P.Y.J., 2008. Investigating the effects of the fixed and varying dispersion parameters of Poisson-Gamma models on empirical Bayes estimates. Accid. Anal. Prevent. 40 (4), 1441–1457. Lu, Y., Thill, J.C., 2008. Cross-scale analysis of cluster correspondence using different operational neighborhoods. J. Geogr. Syst. 10 (3), 241–261. Lustosa, L., Mesquita, M.A., Quelhas, O., Oliveira, R., 2008. Planejamento e Controle da Produc¸a˜o. Campus Elsevier, Rio de Janeiro. MacCallum, R.C., Widaman, K.F., Zhang, S., Hong, S., 1999. Sample size in factor analysis. Psychol. Methods 4 (1), 84–99. Macedo, M.A.S., 2002. A utilizac¸a˜o de programac¸a˜o matema´tica linear inteira bina´ria (0-1) na selec¸a˜o de projetos sob condic¸a˜o de restric¸a˜o orc¸amenta´ria. Anais do XXXIV SBPO, Rio de Janeiro, Ime. Machado, N.R.S., Ferreira, A.O., 2012. Metodo de Simulac¸a˜o de Monte Carlo em Planilha Excel: Desenvolvimento de uma ferramenta versa´til para ana´lise quantitativa de riscos em gesta˜o de projetos. Revista de Ci^encias Gerenciais 16 (23), 223–244. Machin, D., Cheung, Y.B., Parmar, M.K.B., 2006. Survival Analysis: A Practical Approach, second ed. John Wiley & Sons, Hoboken. Maddala, G.S., 2003. Introduc¸a˜o a` econometria, third ed. LTC Editora, Rio de Janeiro. Maddala, G.S., 1993. The Econometrics for Panel Data. Elgar, Brookfield. Magalha˜es, M.N., Lima, C.P., 2013. Noc¸o˜es de probabilidade e estatı´stica, seventh ed. Edusp, Sa˜o Paulo. Makles, A., 2012. Stata tip 110: how to get the optimal k-means cluster solution. Stata J. 12 (2), 347–351. Malhotra, N.K., 2012. Pesquisa de marketing: uma orientac¸a˜o aplicada, sixth ed. Bookman, Porto Alegre.
References
1207
Mangiameli, P., Chen, S.K., West, D., 1996. A comparison of SOM neural network and hierarchical clustering methods. Eur. J. Oper. Res. 93 (2), 402–417. Manly, B.F.J., 2011. Statistics for Environmental Science and Management, second ed. Chapman and Hall/CRC Press, London. Manly, B.J.F., 2004. Multivariate Statistical Methods, third ed. Chapman and Hall, London. Mann, H.B., Whitney, D.R., 1947. On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18 (1), 50–60. Marcoulides, G.A., Hershberger, S.L., 2014. Multivariate Statistical Methods: A First Course. Psychology Press, New York. Mardia, K.V., Kent, J.T., Bibby, J.M., 1997. Multivariate Analysis, sixth ed Academic Press, London. Markowitz, H., 1952. Portfolio selection. J. Finance 7 (1), 77–91. Maroco, J., 2014. Ana´lise estatı´stica com o SPSS Statistics, sixth ed. Edic¸o˜es Sı´labo, Lisboa. Marquardt, D.W., 1963. An algorithm for least-squares estimation of nonlinear parameters. J. Soc. Ind. Appl. Math. 11 (2), 431–441. Marques, L.D., 2000. Modelos din^amicos com dados em painel: revisa˜o da literatura. In: Serie Working Papers do Centro de Estudos Macroecon^omicos e Previsa˜o (CEMPRE) da Faculdade de Economia do Porto, Portugal. 100. Marriott, F.H.C., 1971. Practical problems in a method of cluster analysis. Biometrics 27 (3), 501–514. Martı´n, J.M., 1990. Oportunidad relativa: reflexiones en torno a la traduccio´n del termino ‘odds ratio’. Gaceta Sanitaria (16), 37. Martins, G.A., Domingues, O., 2011. Estatı´stica geral e aplicada, fourth ed. Atlas, Sa˜o Paulo. Martins, M.S., Galli, O.C., 2007. A previsa˜o de insolv^encia pelo modelo Cox: uma aplicac¸a˜o para a ana´lise de risco de companhias abertas Brasileiras. Revista Eletr^ onica de Administrac¸a˜o (REAd UFRGS), ed. 55 13 (1), 1–18. Mason, R.L., Young, J.C., 2005. Multivariate tools: principal component analysis. Qual. Progr. 38 (2), 83–85. Matisziw, T.C., 2005. Modeling transnational surface freight flow and border crossing improvement. Dissertation (PhD in Philosophy), Ohio State University. Ma´tya´s, L., Sevestre, P. (Eds.), 2008. The Econometrics of Panel Data: Fundamentals and Recent Developments in Theory and Practice. third ed. Springer, New York. Mazzarol, T.W., Soutar, G.N., 2008. Australian educational institutions’ international markets: a correspondence analysis. Int. J. Educ. Manag. 22 (3), 229–238. McClave, J.T., Benson, P.G., Sincich, T., 2009. Estatı´stica para administrac¸a˜o e economia. Pearson Prentice Hall, Sa˜o Paulo. McCue, C., 2014. Data Mining and Predictive Analysis: Intelligence Gathering and Crime Analysis, second ed. Elsevier, Boston. McCullagh, P., 1983. Quasi-likelihood functions. Ann. Stat. 11 (1), 59–67. McCullagh, P., Nelder, J.A., 1989. Generalized Linear Models, second ed. Chapman & Hall, London. McCulloch, C.E., Searle, S.R., Neuhaus, J.M., 2008. Generalized, Linear, and Mixed Models, second ed. John Wiley & Sons, Hoboken. McGahan, A.M., Porter, M.E., 1997. How much does industry matter, really? Strateg. Manag. J. 18 (S1), 15–30. McGee, D.L., Reed, D., Yano, K., 1984. The results of logistic analyses when the variables are highly correlated. Am. J. Epidemiol. 37, 713–719. McIntyre, R.M., Blashfield, R.K., 1980. A nearest-centroid technique for evaluating the minimum-variance clustering procedure. Multivar. Behav. Res. 15, 225–238. McLaughlin, S.D., Otto, L.B., 1981. Canonical correlation analysis in family research. J. Marr. Fam. 43 (1), 7–16. McNemar, Q., 1969. Psychological Statistics, fourth ed. John Wiley & Sons, New York. Medri, W., 2015. Ana´lise explorato´ria de dados. http://www.uel.br/pos/estatisticaeducacao/.../especializacao_estatistica.pdf. [(Accessed 3 August 2015)]. Melo, M.T., Nickel, S., Gama, F.S., 2009. Facility location and supply chain management: a review. Eur. J. Oper. Res. 196, 401–412. Menard, S.W., 2001. Applied Logistic Regression analysis, second ed. Sage Publications, Thousand Oaks. Michell, J., 1986. Measurement scales and statistics: a clash of paradigms. Psychol. Bull. 100 (3), 398–407. Miguel, A., Pindado, J., 2001. Determinants of capital structure: new evidence from spanish panel data. J. Corp. Finance 7 (1), 77–99. Miguel, A., Pindado, J., Torre, C., 2004. Ownership structure and firm value: new evidence from Spain. Strateg. Manag. J. 25 (12), 1199–1207. Miles, M.B., Huberman, A.M., Saldan˜a, J., 2014. Qualitative Data Analysis: A Methods Sourcebook, third ed. Sage Publications, Thousand Oaks. Milligan, G.W., 1981. A Montecarlo study of thirty internal criterion measures for cluster analysis. Psychometrika 46, 325–342. Milligan, G.W., 1980. An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika 45 (3), 325–342. Milligan, G.W., Cooper, M.C., 1985. An examination of procedures for determining the number of clusters in a data set. Psychometrika 50, 159–179. Milligan, G.W., Cooper, M.C., 1987. Methodology review: clustering methods. Appl. Psychol. Meas. 11 (4), 329–354. Mills, T.C., 1993. The Econometric Modelling of Financial Time Series. Cambridge University Press. Min, Y., Agresti, A., 2005. Random effect models for repeated measures of zero-inflated count data. Stat. Modell. 5 (1), 1–19. Mingoti, S.A., 2005. Ana´lise de dados atraves de metodos de estatı´stica multivariada: uma abordagem aplicada. Editora Ufmg, Belo Horizonte. Miranda, A., Rabe-Hesketh, S., 2006. Maximum likelihood estimation of endogenous switching and sample selection models for binary, ordinal, and count variables. Stata J. 6 (3), 285–308. Miranda, G.J., Martins, V.F., Faria, A.F., 2007. O uso da programac¸a˜o linear num contexto de laticı´nios com va´rias restric¸o˜es na capacidade produtiva. Custos e @gronego´cio on line 3, 40–58. Misangyi, V.F., Lepine, J.A., Algina, J., Goeddeke Jr., F., 2006. The adequacy of repeated-measures regression for multilevel research. Organization. Res. Methods 9 (1), 5–28. Mitchell, M.N., 2012a. A Visual Guide to Stata Graphics, third ed. Stata Press, College Station. Mitchell, M.N., 2012b. Interpreting and Visualizing Regression Models Using Stata. Stata Press, College Station. Mittb€ ock, M., Schemper, M., 1996. Explained variation for logistic regression. Stat. Med. 15, 1987–1997. Molina, C.A., 2002. Predicting bank failures using a hazard model: the Venezuelan banking crisis. Emerg. Market Rev. 3 (1), 31–50.
1208
References
Montgomery, D.C., 2013. Introduction to Statistical Quality Control, seventh ed. John Wisley & Sons, Inc, Arizona State University. Montgomery, D.C., Goldsman, D.M., Hines, W.W., Borror, C.M., 2006. Probabilidade e estatı´stica na engenharia, fourth ed. LTC Editora, Rio de Janeiro. Montgomery, D.C., Peck, E.A., Vining, G.G., 2012. Introduction to Linear Regression Analysis, fifth ed. John Wiley & Sons, New Jersey. Montoya, A.G.M., 2009. Infer^encia e diagno´stico em modelos para dados de contagem com excesso de zeros. Masters Dissertation, Departamento de Estatı´stica, Instituto de Matema´tica, Estatı´stica e Computac¸a˜o Cientı´fica, Universidade Estadual de Campinas, Campinas. 95 f. Moore, D.S., McCabe, G.P., Duckworth, W.M., Sclove, S.L., 2006a. A pra´tica da estatı´stica empresarial: como usar dados para tomar deciso˜es. LTC Editora, Rio de Janeiro. Moore, D.S., McCabe, G.P., Duckworth, W.M., Sclove, S.L., 2006b. Estatı´stica empresarial: como usar dados para tomar deciso˜es. LTC Editora, Rio de Janeiro. Morettin, L.G., 2000. Estatı´stica ba´sica: infer^encia. Makron Books, Sa˜o Paulo. Morgan, G.A., Leech, N.L., Gloeckner, G.W., Barrett, K.C., 2004. SPSS for Introductory Statistics: Use and Interpretation, second ed. Lawrence Erlbaum Associates, Mahwah. Morgan, B.J.T., Ray, A.P.G., 1995. Non-uniqueness and inversions in cluster analysis. J. Roy. Stat. Soc. Ser. C 44 (1), 117–134. Moreira, D.A., 2006. Administrac¸a˜o da produc¸a˜o e operac¸o˜es. Thomson Learning, Sa˜o Paulo. Mulaik, S.A., 1990. Blurring the distinction between component analysis and common factor analysis. Multivar. Behav. Res. 25 (1), 53–59. Mulaik, S.A., 2011. Foundations of Factor Analysis, second ed. Chapman & Hall/CRC Press, Boca Raton. Mulaik, S.A., McDonald, R.P., 1978. The effect of additional variables on factor indeterminancy in models with a single common factor. Psychometrika 43 (2), 177–192. Mullahy, J., 1986. Specification and testing of some modified count data models. J. Econometrics 33 (3), 341–365. Muller, K.E., 1982. Understanding canonical correlation through the general linear model and principal components. Am. Statist. 36 (4), 342–354. Mundlak, Y., 1978. On the pooling of time series and cross section data. Econometrica 46 (1), 69–85. Myatt, G.J., Johnson, W.P., 2014. Making Sense of Data I: A Practical Guide to Exploratory Data Analysis and Data Mining, second ed. John Wiley & Sons, Hoboken. Myatt, G.J., Johnson, W.P., 2009. Making sense of data II: a practical guide to data visualization, advanced data mining methods, and applications. John Wiley & Sons, Hoboken. Naito, S.D.N.P., 2007. Ana´lise de correspond^encias generalizada. Masters Dissertation, Faculdade de Ci^encias, Universidade de Lisboa, Lisboa. 156 f. Nance, C.R., de Leeuw, J., Weigand, P.C., Prado, K., Verity, D.S., 2013. Correspondence Analysis and West Mexico Archaeology: Ceramics from the Long-Glassow Collection. University of New Mexico Press, Albuquerque. Nascimento, A., Almeida, R.M.V.R., Castilho, S.R., Infantosi, A.F.C., 2013. Ana´lise de correspond^encia mu´ltipla na avaliac¸a˜o de servic¸os de farma´cia hospitalar no Brasil. Cadernos de Sau´de Pu´blica 29 (6), 1161–1172. Natis, L., 2007. Modelos lineares hiera´rquicos. Masters Dissertation, Instituto de Matema´tica e Estatı´stica, Universidade de Sa˜o Paulo, Sa˜o Paulo. 77 f. Navarro, A., Utzet, F., Caminal, J., Martin, M., 2001. La distribucio´n binomial negativa frente a la de Poisson en el ana´lisis de feno´menos recurrentes. Gaceta Sanitaria 15 (5), 447–452. Navidi, W., 2012. Probabilidade e estatı´stica para ci^encias exatas. Bookman, Porto Alegre. Nasser, R.B., 2012. Mccloud service framework: arcabouc¸o para desenvolvimento de servic¸os baseados na simulac¸a˜o de Monte Carlo na cloud. Pontifı´cia Universidade Cato´lica do Rio de Janeiro – PUC-RIO. Dissertac¸a˜o (Mestrado em Informa´tica). Nelder, J.A., 1966. Inverse polynomials, a useful group of multi-factor response functions. Biometrics 22 (1), 128–141. Nelder, J.A., Wedderburn, R.W.M., 1972. Generalized linear models. J. Roy. Stat. Soc. Ser. A 135 (3), 370–384. Nelson, D., 1975. Some remarks on generalizations of the negative binomial and Poisson distributions. Technometrics 17 (1), 135–136. Nerlove, M., 2002. Essays in Panel Data Econometrics. Cambridge University Press, Cambridge. Neuenschwander, B.E., Flury, B.D., 1995. Common canonical variates. Biometrika 82 (3), 553–560. Neufeld, J.L., 2003. Estatı´stica aplicada a` administrac¸a˜o usando Excel. Prentice Hall, Sa˜o Paulo. Neuhaus, J.M., 1992. Statistical methods for longitudinal and clustered designs with binary responses. Stat. Methods Med. Res. 1 (3), 249–273. Neuhaus, J.M., Kalbfleisch, J.D., 1998. Between- and within-cluster covariate effects in the analysis of clustered data. Biometrics 54 (2), 638–645. Neuhaus, J.M., Kalbfleisch, J.D., Hauck, W.W., 1991. A comparison of cluster-specific and population-averaged approaches for analyzing correlated binary data. Int. Stat. Rev. 59 (1), 25–35. Newey, W.K., West, K.D., 1987. A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix. Econometrica 55 (3), 703–708. Nishisato, S., 1993. On quantifying different types of categorical data. Psychometrika 58 (1), 617–629. Norton, E.C., Bieler, G.S., Ennett, S.T., Zarkin, G.A., 1996. Analysis of prevention program effectiveness with clustered data using generalized estimating equations. J. Consult. Clin. Psychol. 64 (5), 919–926. Norusis, M.J., 2012. IBM SPSS Statistics 19 Guide to Data Analysis. Pearson, Boston. Nunnally, J.C., Bernstein, I.H., 1994. Psychometric Theory, third ed. McGraw-Hill, New York. O’rourke, D., Blair, J., 1983. Improving random respondent selection in telephone surveys. J. Market. Res. 20 (4), 428–432. Ochiai, A., 1957. Zoogeographic studies on the soleoid fishes found in Japan and its neighbouring regions [em japon^es]. Bull. Jpn. Soc. Sci. Fish. 22 (9), 522–525. Olariaga, L.J., Herna´ndez, L.L., 2000. Ana´lisis de correspondencias. Editorial La Muralla, Madrid. Oliveira, C.C.F., 2011. Uma priori beta para distribuic¸a˜o binomial negativa. Masters Dissertation, Departamento de Estatı´stica e Informa´tica, Universidade Federal Rural de Pernambuco, Recife. 54 f.
References
1209
Oliveira, F.E.M., 2009. Estatı´stica e probabilidade, second ed. Atlas, Sa˜o Paulo. Oliveira, T.M.V., 2001. Amostragem na˜o probabilı´stica: adequac¸a˜o de situac¸o˜es para uso e limitac¸o˜es de amostras por conveni^encia, julgamento e quotas. Administrac¸a˜o On Line 2 (3), 1–16. Oliveira Jr., P.A., Dantas, M.J.P., Machado, R.L., 2013. Aplicac¸a˜o da Simulac¸a˜o de Monte Carlo no Gerenciamento de Riscos em Projetos com o Cristal Ball. Simpo´sio de Administrac¸a˜o da Produc¸a˜o, Logı´stica e Operac¸o˜es Internacionais. Olshansky, S.J., Carnes, B.A., 1997. Ever since Gompertz. Demography 34 (1), 1–15. Olson, D., L; Delen, D., 2008. Advanced Data Mining Techniques. Springer, New York. Oneal, J.R., Russett, B., 2001. Clear and clean: the fixed effects of the liberal peace. Int. Org. 55 (2), 469–485. Orden, A., 1956. The transshipment problem. Manag. Sci. 2 (3), 276–285. Orsini, N., Bottai, M., 2011. Logistic quantile regression in Stata. Stata J. 11 (3), 327–344. Ortega, C.M., Cayuela, D.A., 2002. Regresio´n logı´stica no condicionada y taman˜o de muestra: una revisio´n bibliogra´fica. Revista Espan˜ola de Salud Pu´blica 76, 85–93. Ortega, E.M.M., Cordeiro, G.M., Carrasco, J.M.F., 2011. The log-generalized modified Weibull regression model. Brazil. J. Probab. Stat. 25 (1), 64–89. Ortega, E.M.M., Cordeiro, G.M., Kattan, M.W., 2012. The negative binomial-beta Weibull regression model to predict the cure of prostate cancer. J. Appl. Stat. 39 (6), 1191–1210. Ou, H., Wei, C., Deng, Y., Gao, N., Ren, Y., 2014. Principal component analysis to assess the efficiency and mechanism for enhanced coagulation of natural algae-laden water using a novel dual coagulant system. Environ. Sci. Pollut. Res. Int. 21 (3), 2122–2131. Page, M.C., Braver, S.L., Mackinnon, D.P., 2003. Levine’s Guide to SPSS for Analysis of Variance, 2. ed. Lawrence Erlbaum Associates, Mahwah. Pallant, J., 2010. SPSS Survival Manual: A Step by Step Guide to Data Analysis Using SPSS, fourth ed. Open University Press, Berkshire. Palmer, M.W., 1993. Putting things in even better order: the advantages of canonical correspondence analysis. Ecology 74 (8), 2215–2230. Pampel, F.C., 2000. Logistic Regression: A Primer. Sage Publications, Thousand Oaks. Pardoe, I., 2012. Applied Regression Modeling, second ed. John Wiley & Sons, Hoboken. Parzen, E., 1962. On estimation of a probability density function and mode. Ann. Math. Stat. 33 (3), 1065–1076. Pearson, K., 1896. Mathematical contributions to the theory of evolution. III. Regression, Heredity, and Panmixia. Philos. Trans. R. Soc. London 187, 253–318. Pearson, K., 1930. The Life, Letters and Labors of Francis Galton. Cambridge University Press, Cambridge. Pegden, C.D., Shannon, R.E., Sadowski, R.P., 1990. Introduction to Simulation Using SIMAN, second ed. McGraw-Hill, New York. Pen˜a, J.M., Lazano, J.A., Larran˜aga, P., 1999. An empirical comparison of four initialisation methods for the k-means algorithm. Pattern Recognit. Lett. 20 (10), 1027–1040. Pendergast, J.F., Gange, S.J., Newton, M.A., Lindstrom, M.J., Palta, M., Fisher, M.R., 1996. A survey of methods for analyzing clustered binary response data. Int. Stat. Rev. 64 (1), 89–118. Perduzzi, P., Concato, J., Kemper, E., Holford, T.R., Feistein, A.R., 1996. A simulation study of the number of events per variable in logistic regression analysis. J. Clin. Epidemiol. 49, 1373–1379. Pereira, H.C., Sousa, A.J., Ana´lise de dados para o tratamento de quadros multidimensionais. http://biomonitor.ist.utl.pt/ajsousa/Ana lDadosTratQuadMult.html. [(Accessed 20 January 2015)]. Pereira, J.C.R., 2004. Ana´lise de dados qualitativos: estrategias metodolo´gicas para as ci^encias da sau´de, humanas e sociais, third ed. Edusp, Sa˜o Paulo. Pereira, M.A., Vidal, T.L., Amorim, T.N., Fa´vero, L.P., 2010. Decision process based on personal finance books: is there any direction to take? Revista de Economia e Administrac¸a˜o 9 (3), 407–425. Pesaran, M.H., 2004. General diagnostic tests for cross section dependence in panels. Cambridge Working Papers in Economics, nº. 0435, Faculty of Economics, University of Cambridge. Pess^ oa, L.A.M., Lins, M.P.E., Torres, N.T., 2009. Problema da dieta: uma aplicac¸a˜o pra´tica para o navio hidroceanogra´fico “Tauros”. In: Simpo´sio Brasileiro de Pesquisa Operacional, 2009, Porto Seguro, BA. Anais do XLI Simpo´sio Brasileiro de Pesquisa Operacional 1, 1460–1471. Pestana, M.H., Gageiro, J.N., 2008. Ana´lise de dados para ci^encias sociais: a complementaridade do SPSS, 5. ed. Edic¸o˜es Sı´labo, Lisboa. Peters, W.S., 1958. Cluster analysis in urban demography. Soc. Forces 37 (1), 38–44. Petersen, M.A., 2009. Estimating standard errors in finance panel data sets: comparing approaches. Rev. Financ. Stud. 22 (1), 435–480. Peto, R., Lee, P., 1973. Weibull distributions for continuous-carcinogenesis experiments. Biometrics 29 (3), 457–470. Peugh, J.L., Enders, C.K., 2005. Using the SPSS mixed procedure to fit cross-sectional and longitudinal multilevel models. Educ. Psychol. Meas. 65 (5), 714–741. Pylro, A.S., 2008. Modelo Linear Din^amico de Harrison & Stevens Aplicado ao Controle Estatı´stico de Processos Autocorrelacionados. Pontifı´cia Universidade Cato´lica do Rio de Janeiro. Tese (Doutorado em Engenharia de Produc¸a˜o). Pindado, J., Requejo, I., 2015. Panel data: a methodology for model specification and testing. In: Paudyal, K. (Ed.), Wiley Encyclopedia of Management. vol. 4, pp. 1–8. Pindado, J., Requejo, I., Torre, C., 2011. Family control and investment-cash flow sensitivity: empirical evidence from the euro zone. J. Corp. Finance 17 (5), 1389–1409. Pindado, J., Requejo, I., Torre, C., 2014. Family control, expropriation, and investor protection: a panel data analysis of western european corporations. J. Empir. Finance 27 (C), 58–74. Pindyck, R.S., Rubinfeld, D.L., 2004. Econometria: modelos e previso˜es, fourth ed. Campus Elsevier, Rio de Janeiro. Pires, P.J., Marchetti, R.Z., 1997. O perfil dos usua´rios de caixa-automa´ticos em ag^encias banca´rias na cidade de Curitiba. Revista de Administrac¸a˜o Contempor^anea (RAC) 1 (3), 57–76.
1210
References
Pl€ umper, T., Troeger, V.E., 2007. Efficient estimation of time-invariant and rarely changing variables in finite sample panel analyses with unit fixed effects. Polit. Anal. 15 (2), 124–139. Pollard, D., 1981. Strong consistency of k-means clustering. Ann. Stat. 9 (1), 135–140. Pregibon, D., 1981. Logistic regression diagnostics. Ann. Stat. (9), 704–724. Press, S.J., 2005. Applied Multivariate Analysis: Using Bayesian and Frequentist Methods of Inference, second ed. Dover Science, Mineola. Punj, G., Stewart, D.W., 1983. Cluster analysis in marketing research: review and suggestions for application. J. Market. Res. 20 (2), 134–148. Rabe-Hesketh, S., Everitt, B., 2000. A Handbook of Statistical Analyses Using Stata, second ed. Chapman & Hall, Boca Raton. Rabe-Hesketh, S., Skrondal, A., 2012b. Multilevel and Longitudinal Modeling Using Stata: Categorical Responses, Counts, and Survival, third ed. vol. II. Stata Press, College Station. Rabe-Hesketh, S., Skrondal, A., 2012a. Multilevel and Longitudinal Modeling Using Stata: Continuous Responses, third ed. vol. I. Stata Press, College Station. Rabe-Hesketh, S., Skrondal, A., Pickles, A., 2005. Maximum likelihood estimation of limited and discrete dependent variable models with nested random effects. J. Econometrics 128 (2), 301–323. Rabe-Hesketh, S., Skrondal, A., Pickles, A., 2002. Reliable estimation of generalized linear mixed models using adaptive quadrature. Stata J. 2 (1), 1–21. Ragsdale, C.T., 2009. Modelagem e ana´lise de decisa˜o. Cengage Learning, Sa˜o Paulo. Rajan, R.G., Zingales, L., 1998. Financial dependence and growth. Am. Econ. Rev. 88 (3), 559–586. Ramalho, J.J.S., 1996. Modelos de regressa˜o para dados de contagem. Masters Dissertation, Instituto Superior de Economia e Gesta˜o, Universidade Tecnica de Lisboa, Lisboa. 110 f. Rardin, R.L., 1998. Optimization in Operations Research. Prentice Hall, New Jersey. Rasch, G., 1960. Probabilistic Models for Some Intelligence and Attainment Tests. Paedagogike Institut, Copenhagen. Raudenbush, S., Bryk, A., 2002. Hierarchical Linear Models: Applications and Data Analysis Methods, second ed. Sage Publications, Thousand Oaks. Raudenbush, S., Bryk, A., Cheong, Y.F., Congdon, R., du Toit, M., 2004. HLM 6: hierarchical linear and nonlinear modeling. Scientific Software International, Inc, Lincolnwood. Raykov, T., Marcoulides, G.A., 2008. An Introduction to Applied Multivariate Analysis. Routledge, New York. Reis, E., 2001. Estatı´stica multivariada aplicada, second ed. Edic¸o˜es Sı´labo, Lisboa. Rencher, A.C., 1992. Interpretation of canonical discriminant functions, canonical variates and principal components. Am. Stat. 46 (3), 217–225. Rencher, A.C., 2002. Methods of Multivariate Analysis, second ed. John Wiley & Sons, New York. Rencher, A.C., 1988. On the use of correlations to interpret canonical functions. Biometrika 75 (2), 363–365. Rigau, J.G., 1990. Traduccio´n del termino ‘odds ratio. Gaceta Sanitaria (16), 35. Roberto, A.N., 2002. Modelos de rede de fluxo para alocac¸a˜o da a´gua entre mu´ltiplos usos em uma bacia hidrogra´fica. Escola Politecnica, Universidade de Sa˜o Paulo, Sa˜o Paulo Dissertac¸a˜o (Mestrado em Engenharia Hidra´ulica e Sanita´ria). 105 p. Rodrigues, M.C.P., 2002. Potencial de desenvolvimento dos municı´pios fluminenses: uma metodologia alternativa ao Iqm, com base na ana´lise fatorial explorato´ria e na ana´lise de clusters. Caderno de Pesquisas em Administrac¸a˜o 9 (1), 75–89. Rodrigues, P.C., Lima, A.T., 2009. Analysis of an European union election using principal component analysis. Stat. Papers 50 (4), 895–904. Rogers, D.J., Tanimoto, T.T., 1960. A computer program for classifying plants. Science 132 (3434), 1115–1118. Rogers, W., 2000. Errors in hedonic modeling regressions: compound indicator variables and omitted variables. Appraisal J. 208–213. Rogers, W.M., Schmitt, N., Mullins, M.E., 2002. Correction for unreliability of multifactor measures: comparison of alpha and parallel forms approaches. Organization. Res. Methods 5 (2), 184–199. Ross, G.J.S., Preence, D.A., 1985. The negative binomial distribution. Statistician 34 (3), 323–335. Roubens, M., 1982. Fuzzy clustering algorithms and their cluster validity. Eur. J. Oper. Res. 10 (3), 294–301. Rousseeuw, P.J., Leroy, A.M., 1987. Robust Regression and Outlier Detection. John Wiley & Sons, New York. Royston, P., 2006. Explained variation for survival models. Stata J. 6 (1), 83–96. Royston, P., Lambert, P.C., 2011. Flexible Parametric Survival Analysis Using Stata: Beyond the Cox Model. Stata Press, College Station. Royston, P., Parmar, M.K.B., 2002. Flexible parametric proportional-hazards and proportional-odds models for censored survival data, with application to prognostic modelling and estimation of treatment effects. Stat. Med. 21 (15), 2175–2197. Rummel, R.J., 1970. Applied Factor Analysis. Northwestern University Press, Evanston. Russell, P.F., Rao, T.R., 1940. On habitat and association of species of Anopheline Larvae in South-eastern Madras. J. Malaria Instit. India 3 (1), 153–178. Rutemiller, H.C., Bowers, D.A., 1968. Estimation in a heterocedastic regression model. J. Am. Stat. Assoc. 63, 552–557. Saaty, T.L., 2000. Fundamentals of Decision Making and Priority Theory with the Analytic Hierarchy Process. RWS Publications, Pittsburgh. Santos, M.A., Fa´vero, L.P., Distadio, L.F., 2016. Adoption of the International Financial Reporting Standards (IFRS) on companies’ financing structure in emerging economies. Finance Res. Lett. 16 (1), 179–189. Santos, M.S., 2005. Cervejas e refrigerantes. In: Mateus Sales dos Santos e Fla´vio de Miranda Ribeiro. CETESB, Sa˜o Paulo Disponı´vel em http://www. cetesb.sp.gov.br. [(Accessed 11 February 2017)]. Saporta, G., 1990. Probabilites, analyse des donnees et statistique. Technip, Paris. Saraiva Jr., A.F., Tabosa, C.M., Costa, R.P., 2011. Simulac¸a˜o de Monte Carlo aplicada a` ana´lise econ^omica de pedido. Produc¸a˜o 21 (1), 149–164. Sarkadi, K., 1975. The consistency of the Shapiro-Francia test. Biometrika 62 (2), 445–450. Sartoris Neto, A., 2013. Estatı´stica e introduc¸a˜o a` econometria, second ed. Saraiva, Sa˜o Paulo. Schaffer, M.E., Stillman, S., XTOVERID: Stata module to calculate tests of overidentifying restrictions after xtreg, xtivreg, xtivreg2, xthtaylor. http:// ideas.repec.org/c/boc/bocode/s456779.html. [(Accessed 21 February 2014)].
References
1211
Scheffe, H., 1953. A method for judging all contrasts in the analysis of variance. Biometrika 40 (1/2), 87–104. Schmidt, C.M.C., 2003. Modelo de regressa˜o de Poisson aplicado a` a´rea da sau´de. Masters Dissertation, Universidade Regional do Noroeste do Estado do Rio Grande do Sul, Iju´i 98 f. Schoenfeld, D., 1982. Partial residuals for the proportional hazards regression model. Biometrika 69 (1), 239–241. Schriber, T.J., 1974. Simulation Using GPSS. Ed. Ft. Belvoir Defense Technical Information, Wiley, New York. Schwartz Filho, A.J., 2006. Localizac¸a˜o de indu´strias de reciclagem na cadeia logı´stica reversa do coco verde. Dissertac¸a˜o (Mestrado em Engenharia Civil – Transportes), Universidade Federal do Espı´rito Santo. 127 f. Scott, A.J., Symons, M.J., 1971. Clustering methods based on likelihood ratio criteria. Biometrics 27 (2), 387–397. Searle, S.R., Casella, G., McCulloch, C.E., 2006. Variance Components. John Wiley & Sons, New York. Sergio, V.F.N., 2012. Utilizac¸a˜o das distribuic¸o˜es inflacionadas de zeros no monitoramento da qualidade do leite. Monografia (Bacharelado em Estatı´stica), Departamento de Estatı´stica, Universidade Federal de Juiz de Fora, Juiz de Fora. 43 f. Shafto, M.G., Degani, A., Kirlik, A., 1997. Canonical correlation analysis of data on human-automation interaction. In: 41st HFES – Annual Meeting of the Human Factors and Ergonomics Society. Anais do Congresso, Albuquerque. Shapiro, S.S., Francia, R.S., 1972. An approximate analysis of variance test for normality. J. Am. Stat. Assoc. 67, 215–216. Shapiro, S.S., Wilk, M.B., 1965. An analysis of variance test for normality (complete samples). Biometrika 52, 591–611. Sharma, S., 1996. Applied Multivariate Techniques. John Wiley & Sons, Hoboken. Sharpe, N.R., de Veaux, R.D., Velleman, P.F., 2015. Business Statistics, third ed. Pearson Education. Shazmeen, S.F., Baig, M.M.A., Pawar, M.R., 2013. Regression analysis and statistical approach on socio-economic data. Int. J. Adv. Comput. Res. 3 (3), 347. Sheu, C.F., 2000. Regression analysis of correlated binary outcomes. Behav. Res. Methods Instrum. Comput. 32 (2), 269–273. Sharpe, W.F., 1964. Capital asset prices: a theory of market equilibrium under conditions of risk. J. Finance 19 (3), 425–442. Shi, J., Malik, J., 2000. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22 (8), 888–905. Short, J.C., Ketchen, D.J., Bennett, N., du Toit, M., 2006. An examination of firm, industry, and time effects on performance using random coefficients modeling. Organization. Res. Methods 9 (3), 259–284. Short, J.C., Ketchen, D.J., Palmer, T.B., Hult, G.T.M., 2007. Firm, strategic group, and industry influences on performance. Strateg. Manag. J. 28 (2), 147–167. Siegel, S., Castellan Jr., N.J., 2006. Estatı´stica na˜o-parametrica para ci^encias do comportamento, second ed. Bookman, Porto Alegre. Silva Filho, O.S., Cezarino, W., Ratto, J., 2009. Planejamento agregado da produc¸a˜o: modelagem e soluc¸a˜o via planilha Excel & Solver. Revista Produc¸a˜o On Line 9 (3), 572–599. Silva Neto, A.J., Becceneri, J.C., 2009. Tecnicas de intelig^encia computacional inspiradas na natureza: aplicac¸a˜o em problemas inversos em transfer^encia radiativa. Sbmac, Sa˜o Carlos. Simonson, D.G., Stowe, J.D., Watson, C.J., 1983. A canonical correlation analysis of commercial bank asset/liability structures. J. Financ. Quant. Anal. 18 (1), 125–140. Singer, J.M., Andrade, D.F., 1997. Regression models for the analysis of pretest/posttest data. Biometrics 53 (2), 729–735. Skrondal, A., Rabe-Hesketh, S., 2007. Latent variable modelling: a survey. Scand. J. Stat. 34 (4), 712–745. Skrondal, A., Rabe-Hesketh, S., 2003. Multilevel logistic regression for polytomous data and rankings. Psychometrika 68 (2), 267–287. Skrondal, A., Rabe-Hesketh, S., 2009. Prediction in multilevel generalized linear models. J. Roy. Stat. Soc. Ser. A 172 (3), 659–687. Smirnov, N., 1948. Table for estimating the goodness of fit of empirical distributions. Ann. Math. Stat. 19 (2), 279–281. Sneath, P.H.A., Sokal, R.R., 1962. Numerical taxonomy. Nature 193, 855–860. Snijders, T.A.B., Bosker, R.J., 2011. Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling, second ed. Sage Publications, London. Snook, S.C., Gorsuch, R.L.P., 1989. component analysis versus common factor analysis: a Monte Carlo study. Psychol. Bull. 106 (1), 148–154. SOBRAPO – Sociedade Brasileira de Pesquisa Operacional, 2017. Disponı´vel em http://www.sobrapo.org.br. [(Accessed 15 April 2017)]. Sokal, R.R., Michener, C.D., 1958. A statistical method for evaluating systematic relationships. Univ. Kansas Sci. Bull. 38 (22), 1409–1438. Sokal, R.R., Rohlf, F.J., 1962. The comparison of dendrograms by objectives methods. Taxon 11 (2), 33–40. Sokal, R.R., Sneath, P.H.A., 1963. Principles of Numerical Taxonomy. W.H. Freeman and Company, San Francisco. Sørensen, T.J., 1948. A method of establishing groups of equal amplitude in plant sociology based on similarity of species content, and its application to analyses of the vegetation on Danish commons. Roy. Danish Acad. Sci. Lett. Biol. Ser. (5), 1–34. Soto, J.L.G., Morera, M.C., 2005. Modelos jera´rquicos lineales. Editorial La Muralla, Madrid. Spearman, C.E., 1904. “General intelligence,” objectively determined and measured. Am. J. Psychol. 15 (2), 201–292. Spiegel, M.R., Schiller, J., Srinivasan, R.A., 2013. Probabilidade e estatı´stica, third ed. Bookman, Porto Alegre. Stanton, J.M., 2001. Galton, Pearson, and the peas: a brief history of linear regression for statistics instructors. J. Stat. Educ. 9(3). http://www.amstat.org/ publications/jse/v9n3/stanton.html. [(Accessed 14 March 2014)]. StataCorp, 2009. Getting Started with Stata for Windows: Version 11. Stata Press, College Station. StataCorp, 2011. Stata Statistical Software: Release 12. Stata Press, College Station. StataCorp, 2013. Stata statistical software: Release 13. Stata Press, College Station. StataCorp, 2015. Stata Statistical Software: Release 14. Stata Press, College Station. Steenbergen, M.R., Jones, B.S., 2002. Modeling multilevel data structures. Am. J. Polit. Sci. 46 (1), 218–237. Stein, C.E., Loesch, C., 2011. Estatı´stica descritiva e teoria das probabilidades, second ed. Edifurb, Blumenau.
1212
References
Stein, C.M., 1981. Estimation of the mean of a multivariate normal distribution. Ann. Stat. 9 (6), 1135–1151. Stemmler, M., 2014. Person-centered methods: configural frequency analysis (CFA) and other methods for the analysis of contingency tables. Springer, Erlangen. Stephan, F.F., 1941. Stratification in representative sampling. J. Market. 6 (1), 38–46. Stevens, J.P., 2009. Applied Multivariate Statistics for the Social Sciences, fifth ed. Routledge, New York. Stevens, S.S., 1946. On the theory of scales of measurement. Science 103 (2684), 677–680. Stewart, D.K., Love, W.A., 1968. A general canonical correlation index. Psychol. Bull. 70 (3), 160–163. Stewart, D.W., 1981. The application and misapplication of factor analysis in marketing research. J. Market. Res. 18 (1), 51–62. Stock, J.H., Watson, M.W., 2004. Econometria. Pearson Education, Sa˜o Paulo. Stock, J.H., Watson, M.W., 2008. Heteroskedasticity-robust standard errors for fixed effects panel data regression. Econometrica 76 (1), 155–174. Stock, J.H., Watson, M.W., 2006. Introduction to econometrics, third ed. Pearson, Essex. Stowe, J.D., Watson, C.J., Robertson, T.D., 1980. Relationships between the two sides of the balance sheet: a canonical correlation analysis. J. Finance 35 (4), 973–980. Streiner, D.L., 2003. Being inconsistent about consistency: when coefficient alpha does and doesn´t matter. J. Personal. Assess. 80 (3), 217–222. Stukel, T.A., 1988. Generalized logistic models. J. Am. Stat. Assoc. 83 (402), 426–431. Sudman, S., 1985. Efficient screening methods for the sampling of geographically clustered special populations. J. Market. Res. 22 (20), 20–29. Sudman, S., Sirken, M.G., Cowan, C.D., 1988. Sampling rare and elusive populations. Science 240 (4855), 991–996. Swets, J.A., 1996. Signal Detection Theory and ROC Analysis in Psychology and Diagnostics: Collected Papers. Lawrence Erlbaum Associates, Mahwah. Tabachnick, B.G., Fidell, L.S., 2001. Using Multivariate Statistics. Allyn and Bacon, New York. Tacq, J., 1996. Multivariate Analysis Techniques in Social Science Research. Sage Publications, Thousand Oaks. Tadano, Y.S., Ugaya, C.M.L., Franco, A.T., 2009. Metodo de regressa˜o de Poisson: metodologia para avaliac¸a˜o do impacto da poluic¸a˜o atmosferica na sau´de populacional. Ambiente & Sociedade Xii (2), 241–255. Taha, H.A., 2010. Operations Research: An Introduction, nineth ed. Prentice Hall, Upper Saddle River. Taha, H.A., 2016. Operations Research: An Introduction, tenth ed. Pearson Higher Ed, USA. Takane, Y., Young, F.W., DE Leeuw, J., 1977. Nonmetric individual differences multidimensional scaling: an alternating least squares method with optimal scaling features. Psychometrika 42 (1), 7–67. Tang, W., He, H., Tu, X.M., 2012. Applied Categorical and Count Data analysis. Chapman & Hall/CRC Press, Boca Raton. Tapia, J.A., Nieto, F.J., 1993. Razo´n de posibilidades: una propuesta de traduccio´n de la expresio´n odds ratio. Salud Pu´blica de Mexico 35, 419–424. Tate, W.F., 2012. Research on schools, neighborhoods, and communities. Rowman & Littlefield Publishers Inc., Plymouth. Teerapabolarn, K., 2008. Poisson approximation to the beta-negative binomial distribution. Int. J. Contemp. Math. Sci. 3 (10), 457–461. Tenenhaus, M., Young, F., 1985. An analysis and synthesis of multiple correspondence analysis, optimal scaling, dual scaling, homogeneity analysis, and other methods for quantifying categorical multivariate data. Psychometrika 50 (1), 91–119. Thomas, W., Cook, R.D., 1990. Assessing influence on predictions from generalized linear models. Technometrics 32 (1), 59–65. Thompson, B., 1984. Canonical Correlation analysis: Uses and Interpretation. Sage Publications, Thousand Oaks. Thurstone, L.L., 1969. Multiple Factor Analysis: A Development and Expansions of “The Vectors of the Mind” University of Chicago Press, Chicago. Thurstone, L.L., 1959. The Measurement of Values. University of Chicago Press, Chicago. Thurstone, L.L., 1935. The Vectors of the Mind. University of Chicago Press, Chicago. Thurstone, L.L., Thurstone, T.G., 1941. Factorial Studies of Intelligence. University of Chicago Press, Chicago. Timm, N.H., 2002. Applied Multivariate Analysis. Springer-Verlag, New York. Tobin, J., 1969. A general equilibrium approach to monetary theory. J. Money Credit Bank. 1 (1), 15–29. Traissac, P., Martin-Prevel, Y., 2012. Alternatives to principal components analysis to derive asset-based indices to measure socio-economic position in low- and middle-income countries: the case for multiple correspondence analysis. Int. J. Epidemiol. 41 (4), 1207–1208. Triola, M.F., 2013. Introduc¸a˜o a` estatı´stica: atualizac¸a˜o da tecnologia, 11th ed. LTC Editora, Rio de Janeiro. Troldahl, V.C., Carter Jr., R.E., 1964. Random selection of respondents within households in phone surveys. J. Market. Res. 1 (2), 71–76. Tryon, R.C., 1939. Cluster analysis. McGraw-Hill, New York. Tsiatis, A.A., 1980. A note on a goodness-of-fit test for the logistic regression model. Biometrika 67, 250–251. Turkman, M.A.A., Silva, G.L., 2000. Modelos lineares generalizados: da teoria a` pra´tica. Edic¸o˜es Spe, Lisboa. UCLA, 2015. Statistical Consulting Group of the Institute for Digital Research and Education. http://www.ats.ucla.edu/stat/stata/faq/casummary.htm. [(Accessed 5 February 2015)]. UCLA, 2013a. Statistical Consulting Group of the Institute for Digital Research and Education. http://www.ats.ucla.edu/stat/stata/output/stata_mlogit_ output.htm. [(Accessed 22 September 2013)]. UCLA, 2013b. Statistical Consulting Group of the Institute for Digital Research and Education. http://www.ats.ucla.edu/STAT/stata/seminars/stata_sur vival/default.htm. [(Accessed 13 November 2013)]. UCLA, 2013c. Statistical Consulting Group of the Institute for Digital Research and Education. http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/ statareg2.htm. [(Accessed 2 September 2013)]. UCLA, 2013d. Statistical Consulting Group of the Institute for Digital Research and Education. http://www.ats.ucla.edu/stat/stata/dae/canonical.htm. [(Accessed 15 December 2013)]. Valentin, J.L., 2012. Ecologia numerica: uma introduc¸a˜o a` ana´lise multivariada de dados ecolo´gicos, second ed. Interci^encia, Rio de Janeiro.
References
1213
Van Auken, H.E., Doran, B.M., Yoon, K.J., 1993. A financial comparison between Korean and US firms: a cross-balance sheet canonical correlation analysis. J. Small Bus. Manag. 31 (3), 73–83. Vance, P.S., Fa´vero, L.P., Luppe, M.R., 2008. Franquia empresarial: um estudo das caracterı´sticas do relacionamento entre franqueadores e franqueados no Brasil. Revista de Administrac¸a˜o (RAUSP) 43 (1), 59–71. Vanneman, R., 1977. The occupational composition of American classes: results from cluster analysis. Am. J. Sociol. 82 (4), 783–807. Vasconcellos, M.A.S., Alves, D., 2000. Manual de econometria. Atlas, Sa˜o Paulo. Velicer, W.F., Jackson, D.N., 1990. Component analysis versus common factor analysis: some issues in selecting an appropriate procedure. Multivar. Behav. Res. 25 (1), 1–28. Velleman, P.F., Wilkinson, L., 1993. Nominal, ordinal, interval, and ratio typologies are misleading. Am. Stat. 47 (1), 65–72. Verbeek, M., 2012. A Guide to Modern Econometrics, fourth ed. John Wiley & Sons, West Sussex. Verbeke, G., Molenberghs, G., 2000. Linear Mixed Models for Longitudinal Data. Springer-Verlag, New York. Vermunt, J.K., Anderson, C.J., 2005. Joint correspondence analysis (JCA) by maximum likelihood. Methodol. Eur. J. Res. Methods Behav. Soc. Sci. 1 (1), 18–26. Vicini, L., Souza, A.M., 2005. Ana´lise multivariada da teoria a` pra´tica. Monografia (Especializac¸a˜o em Estatı´stica e Modelagem Quantitativa), Centro de Ci^encias Naturais e Exatas, Universidade Federal de Santa Maria, Santa Maria. 215 f. Vieira, S., 2012. Estatı´stica ba´sica. Cengage Learning, Sa˜o Paulo. Vittinghoff, E., Glidden, D.V., Shiboski, S.C., McCulloch, C.E., 2012. Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models, second ed. Springer-Verlag, New York. Vuong, Q.H., 1989. Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica 57 (2), 307–333. Ward Jr., J.H., 1963. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58 (301), 236–244. Wathier, J.L., Dell’aglio, D.D., Bandeira, D.R., 2008. Ana´lise fatorial do inventa´rio de depressa˜o infantil (CDI) em amostra de jovens brasileiros. Avaliac¸a˜o Psicolo´gica 7 (1), 75–84. Watson, I., 2005. Further processing of estimation results: basic programming with matrices. Stata J. 5 (1), 83–91. Weber, S., 2010. Bacon: an effective way to detect outliers in multivariate data using Stata (and Mata). Stata J. 10 (3), 331–338. Wedderburn, R.W.M., 1974. Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika 61 (3), 439–447. Weisberg, S., 1985. Applied Linear Regression. John Wiley & Sons, New York. Weller, S.C., Romney, A.K., 1990. Metric Scaling: Correspondence Analysis. Sage, London. Wen, C.H., Yeh, W.Y., 2010. Positioning of international air passenger carriers using multidimensional scaling and correspondence analysis. Transport. J. 49 (1), 7–23. Wermuth, N., R€ ussmann, H., 1993. Eigenanalysis of symmetrizable matrix products: a result with statistical applications. Scand. J. Stat. 20, 361–367. West, B.T., Welch, K.B., Gałecki, A.T., 2015. Linear Mixed Models: A Pratical Guide Using Statistical Software, second ed. Chapman & Hall/CRC Press, Boca Raton. White, H., 1980. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 48 (4), 817–838. White, H., 1982. Maximum likelihood estimation of misspecified models. Econometrica 50 (1), 1–25. Whitlark, D.B., Smith, S.M., 2001. Using correspondence analysis to map relationships. Market. Res. 13 (3), 22–27. Wilcoxon, F., 1945. Individual comparisons by ranking methods. Biometr. Bull. 1 (6), 80–83. Wilcoxon, F., 1947. Probability tables for individual comparisons by ranking methods. Biometrics 3 (3), 119–122. Williams, R., 2006. Generalized ordered logit/partial proportional odds models for ordinal dependent variables. Stata J. 6 (1), 58–82. Winkelmann, R., Zimmermann, K.F., 1991. A new approach for modeling economic count data. Econ. Lett. 37 (2), 139–143. Winston, W.L., 2004. Operations Research: Applications and Algorithms, fourth ed. Brooks/Cole – Thomson Learning, Belmont. Witten, I.H., Frank, E., Hall, M.A., Pal, C.J., 2016. Data Mining: Practical Machine Learning Tools and Techniques, fourth ed. Elsevier, Boston. Wolfe, J.H., 1978. Comparative cluster analysis of patterns of vocational interest. Multivar. Behav. Res. 13 (1), 33–44. Wolfe, J.H., 1970. Pattern clustering by multivariate mixture analysis. Multivar. Behav. Res. 5 (3), 329–350. Wong, M.A., Lane, T., 1983. A kth nearest neighbour clustering procedure. J. Roy. Stat. Soc. Ser. B 45 (3), 362–368. Wonnacott, T.H., Wonnacott, R.J., 1990. Introductory Statistics for Business and Economics, 4. ed. John Wiley & Sons, New York. Wooldridge, J.M., 2010. Econometric Analysis of Cross Section and Panel Data, second ed. MIT Press, Cambridge. Wooldridge, J.M., 2012. Introductory Econometrics: A Modern Approach, fifth ed. Cengage Learning, Mason. Wooldridge, J.M., 2005. Simple solutions to the initial conditions problem in dynamic, nonlinear panel data models with unobserved heterogeneity. J. Appl. Econ. 20 (1), 39–54. Wu, Z., et al., 2008. Optimization designs of the combined Shewhart CUSUM control charts. Comput. Stat. Data Anal. 53 (2), 496–506. Wulff, J.N., 2015. Interpreting results from the multinomial logit: demonstrated by foreign market entry. Organization. Res. Methods 18 (2), 300–325. Xie, F.C., Wei, B.C., Lin, J.G., 2008. Assessing influence for pharmaceutical data in zero-inflated generalized Poisson mixed models. Stat. Med. 27 (18), 3656–3673. Xie, M., He, B., Goh, T.N., 2001. Zero-inflated Poisson model in statistical process control. Comput. Stat. Data Anal. 38 (2), 191–201. Xue, D., Deddens, J., 1992. Overdispersed negative binomial regression models. Commun. Stat. Theory Methods 21 (8), 2215–2226. Yanai, H., Takane, Y., 2002. Generalized constrained canonical correlation analysis. Multivar. Behav. Res. 37 (2), 163–195. Yau, K., Wang, K., Lee, A., 2003. Zero-inflated negative binomial mixed regression modeling of over-dispersed count data with extra zeros. Biometr. J. 45 (4), 437–452.
1214
References
Yavas, U., Shemwell, D.J., 1996. Bank image: exposition and illustration of correspondence analysis. Int. J. Bank Market. 14 (1), 15–21. Ye, N. (Ed.), 2004. The Handbook of Data Mining. Lawrence Erlbaum Associates, Mahwah. Young, F., 1981. Quantitative analysis of qualitative data. Psychometrika 46 (4), 357–388. Young, G., Householder, A.S., 1938. Discussion of a set of points in terms of their mutual distances. Psychometrika 3 (1), 19–22. Yule, G.U., 1900. On the association of attributes in statistics: with illustrations from the material of the childhood society, etc. Philos. Trans. Roy. Soc. London 194, 257–319. Yule, G.U., Kendall, M.G., 1950. An Introduction to the Theory of Statistics, fourteen ed. Charles Griffin, London. Zeger, S.L., Liang, K.Y., Albert, P.S., 1988. Models for longitudinal data: a generalized estimating equation approach. Biometrics 44 (4), 1049–1060. Zhang, H., Liu, Y., Li, B., 2014. Notes on discrete compound Poisson model with applications to risk theory. Insurance Math. Econ. 59, 325–336. Zheng, X., Rabe-Hesketh, S., 2007. Estimating parameters of dichotomous and ordinal item response models using gllamm. Stata J. 7 (3), 313–333. Zhou, W., Jing, B.Y., 2006. Tail probability approximations for Student’s t-statistics. Probab. Theory Relat. Fields 136 (4), 541–559. Zippin, C., Armitage, P., 1966. Use of concomitant variables and incomplete survival information in the estimation of an exponential survival parameter. Biometrics 22 (4), 665–672. Zorn, C.J.W., 2001. Generalized estimating equation models for correlated data: a review with applications. Am. J. Polit. Sci. 45 (2), 470–490. Zubin, J., 1938a. A technique for measuring like-mindedness. J. Abnormal Soc. Psychol. 33 (4), 508–516. Zubin, J., 1938b. Socio-biological types and methods for their isolation. Psychiatry J. Study Interpersonal Process. 2, 237–247. Zuccolotto, P., 2007. Principal components of sample estimates: an approach through symbolic data. Stat. Methods Appl. 16 (2), 173–192. Zwilling, M.L., 2013. Negative binomial regression. Math. J. 15, 1–18.
Index Note: Page numbers followed by f indicate figures, t indicate tables, and b indicate boxes.
A Absolute average deviation. See Average deviation Absolute frequency, 22, 32–33, 33f Absolute nesting data structure, 990 Adaptive quadrature process, 1005 Advanced Integrated Multidimensional Modeling Software (AIMMS), 774 Agglomeration schedules, 311–312, 324, 326f hierarchicals, 325 (see also Hierarchical agglomeration schedules) k-means procedure, 325 linkage methods, 325 nonhierarchicals, 325 (see also Nonhierarchical k-means agglomeration schedule) Aggregated planning problem, 734–736b, 734t binary programming (BP), 733 decision variables, 733 general formulation, 733 integer programming (IP) model, 734–736 mixed-integer programming (MIP) problem, 734–736 model parameters, 733 nonlinear programming (NLP), 733 resources, 732 Akaike information criterion (AIC), 695 Allocation/attribution problem. See Job assignment problem A Mathematical Programming Language (AMPL), 774 Analysis of variance (ANOVA), 525, 527f assumptions, 232 linear regression models, 457, 458f multiple interactions, 246 one-way ANOVA, 234–236b, 234f, 235–236t calculations, 233, 233t null hypothesis, 232 observations, 232, 232t residual sum of squares, 233 SPSS Software, 236–237, 237–238f Stata Software, 237–238, 238f two-way ANOVA, 241–242b, 242t calculations, 241, 241t factors, 240 observations, 239, 239t residual sum of squares (RSS), 240 SPSS Software, 242–244, 243–246f Stata Software, 244–245, 246f sum of squares of factor, 240 sum of total squares, 240
Anti-Dice similarity coefficient, 323 Arbitrary weighting procedure, 314 Arithmetic mean continuous data, 41–42, 41t, 41–42b grouped discrete data, 40–41, 40–41b, 40–41t ungrouped discrete and continuous data simple arithmetic mean, 38–39, 38t, 38–39b weighted arithmetic mean, 39–40, 39–40t, 39–40b Autocorrelation Breusch-Godfrey test, 493–494 causes, 492, 492f consequences, 493 data time evolution, 491 Durbin-Watson test, 493, 493f first-order autocorrelation, 492 generalized least squares method, 494 residuals problem, 492, 492f Average deviation continuous data, 56, 56b, 56t grouped discrete data, 54–55, 55t, 55b modulus/absolute deviation, 54 ungrouped discrete and continuous data, 54, 54t, 54b
B Bacon algorithm, 533 Balanced nested data structure, 988 Balanced transportation model, 839, 846, 846b Bar charts, 21, 26–27, 26t, 26–27b, 27f Bartlett’s w2 test, 210–212, 211–212b, 211t Bartlett’s test of sphericity, 387, 389–390, 413, 413f, 423, 424f Basic solution (BS), 755 Basic variables (BV), 755, 755b Bayesian (Schwarz) information criterion (BIC), 695 Bayes’ theorem, 132–133, 132–133b Bernoulli distribution, 142–144, 143–144b, 143f, 609, 691 Best linear unbiased predictions (BLUPS), 1005 Between-groups/average-linkage method, 328, 335–338, 337t, 338f Big Data, 983, 984f Binary logistic regression model, 539 confidence intervals, 556–557, 556–557t, 557b cutoff, 558, 559–560t dichotomic form, 540 event nonoccurrence, 541 event occurrence, 540–542, 541t
explanatory variables, 540 graph, 541–542, 541f logit, 540 maximum likelihood, 542–547, 542–544t, 545–547f overall model efficiency (OME), 560 pi values, 558, 558t probability model, 557–558 ROC curve, 561–562, 562f sensitivity analysis, 559–560, 561f specificity, 560 SPSS (see IBM SPSS Statistics Software) Stata (see Stata Software) statistical significance chi-square test, 548–550, 550f degrees of freedom, 550 Hosmer-Lemeshow test, 554 Insert Function, 551, 552f likelihood-ratio test, 552, 553f linear and logistic adjustments, 547–548, 548f logistic probability adjustment, 555, 555f McFadden pseudo R2, 548 null model, 548, 549f parameter estimation, 551, 556 Solver, 547–548, 549f, 553f Wald z test, 550–551 Binary programming (BP), 733 capital budgeting problem, 894t, 894b attributes, 895b decision variables, 895, 897f Excel spreadsheet, 894, 895b, 895f investment projects, 894 linear programming model, 894 Net Present Value (NPV), 894 optimal solution, 895, 897f Solver Parameters dialog box, 895, 896f Traveling Salesman Problem (TSP), 898–899t, 898–900b, 899f Excel Solver, 901f, 902, 902b, 903–904f formulations, 896–901 Hamiltonian problem, 896, 898f network programming, 896 Binomial distribution, 144–145, 144f, 145b, 1169–1176t Binomial test, 250–254, 251–253b, 252t, 253f SPSS Software, 253, 254–255f Stata Software, 253–254, 255f Bivariate descriptive statistics, 93, 94f perceptual maps, 93 qualitative variables, measures of association
1215
1216
Index
Bivariate descriptive statistics (Continued) chi-square statistic, 102–110, 102–110b, 102–103t, 104–106f, 107t, 109–110f joint frequency distribution tables (see Joint frequency distribution tables) Spearman’s coefficient, 110–113, 110f, 111–113b, 111t, 112–113f quantitative variable, measures of correlation covariance, 118, 118b Pearson’s correlation coefficient, 119–121, 119–121b, 119–121f scatter plot, 114–118, 114–118f, 115–116b Blending/mixing problem, 717–719, 717–719b, 718t Blocked Adaptive Computationally Efficient Outlier Nominators (BACON) algorithm, 52 Bowley’s coefficient of skewness, 63–64, 64b Box-Cox transformations IBM SPSS Statistics Software, 527 nonlinear regression models, 497–498, 497b ordinary least squares (OLS) method, 480–481 Stata, regression models estimation, 511, 512f results, 513, 515f Boxplots, 21, 37–38, 37f position/location measures, 53–54, 53f Breusch-Godfrey test, 493–494, 516, 517f Breusch-Pagan/Cook-Weisberg test, 489–490, 505, 506t, 506f Business Analytics, 983
C Canberra distance, 319 Capital budgeting problem, 721–724, 722–723t, 722–724b, 894t, 894b attributes, 895b decision variables, 895, 897f Excel spreadsheet, 894, 895b, 895f investment projects, 894 linear programming model, 894 Net Present Value (NPV), 894 optimal solution, 895, 897f Solver Parameters dialog box, 895, 896f C chart, 966–967, 966–968f Chebyshev distance, 319, 320t Chi-square statistic measures binary logistic regression model, 548–550, 550f contingency coefficient, 107, 108b, 109f Cramer’s V coefficient, 108, 108–110b, 109–110f definition, 102 distribution, 158–160, 159–160f, 160b, 196, 196f, 490, 1166–1167t example, 102–106b, 102–103t, 104–106f K independent samples, 295–299, 296f, 296b, 296t SPSS Software, 297, 297–299f Stata Software, 297–299, 299f one sample, 255–257, 255–256f, 256b, 256t SPSS Software, 256–257, 257–258f Stata Software, 257, 259f
Phi coefficient, 106, 106–108b, 107t, 109–110f two independent samples, 276–280, 277–278f, 277–278t, 277–278b SPSS Software, 279, 279–280f Stata Software, 279–280, 280f Cluster analysis agglomeration schedules, 311–312, 324, 326f hierarchicals, 325 (see also Hierarchical agglomeration schedules) k-means procedure, 325 linkage methods, 325 nonhierarchicals, 325 (see also Nonhierarchical k-means agglomeration schedule) arbitrary weighting procedure, 314 creation, 312, 313f definition, 311 discriminant analysis, 311 distance measures, 314, 319, 320–321t Canberra distance, 319 Chebyshev distance, 319, 320t data standardization procedure, 318 Euclidean squared distance, 318, 320t Manhattan distance, 319 metric variables, dataset, 315, 316t, 318, 318t Minkowski distance, 318 Pearson’s correlation coefficient, 315 Pythagorean distance formula, 316, 317f three-dimensional scatter plot, 315, 316–317f Z-scores procedure, 321 internal homogeneity, 313 Likert scale, 314 logic, 311, 312f multinomial logistic regression, 311 multivariate outliers, 379–382, 380–382f rearrangement, 314, 314f scatter plot, 312, 312f similarity measures, 314, 325t absolute frequencies, 322, 322t, 324, 324t anti-Dice similarity coefficient, 323 arbitrary weighting problems, 321 binary variable, 321 dataset, 321, 322t Dice similarity coefficient (DSC), 323 Euclidean distance, 322 Hamann similarity coefficient, 324 Jaccard index, 323 Ochiai similarity coefficient, 323 Rogers and Tanimoto similarity coefficient, 323 Russel and Rao similarity coefficient, 323 simple matching coefficient (SMC), 322 Sneath and Sokal similarity coefficient, 324 Yule similarity coefficient, 323 static procedures, 314 variability, 312, 313f Cochran’s C statistics, 212–213, 213b, 1191t K paired samples, 286–290, 287–288b, 287t, 288–290f Coefficient of determination, 453–456, 455t, 455–456f
Coefficient of kurtosis, 65, 66f on Stata, 66–68, 67–68b, 67t Coefficient of skewness on Stata, 64–65 Coefficient of variation, 61, 61b Combinations, 134, 134b Combinatorial analysis arrangement, 133–134, 133–134b combinations, 134, 134b definition, 133 permutations, 135, 135b Communalities, 394, 415, 415f Complementary events, 128, 128f, 130 Completely randomized design (CRD), 936–937, 937f Conditional probability, 131 multiplication rule, 131–132, 131–132b, 132t Confidence intervals binary logistic regression model, 556–557, 556–557t, 557b negative binomial regression model, 644, 644t Poisson regression model, 630–632, 631t, 632b population mean, 192 known population variance, 193–194, 193–194b, 193f unknown population variance, 194–195, 194f, 194–195b population variance, 196–197, 196f, 197b proportions, 195–196, 196b Confirmatory factor analysis, 383 Confirmatory techniques, 405 Contingency coefficient, 107, 108b, 109f Contingency tables creation chi-square statistic measures, 102–106b, 102–103t, 104–106f IBM SPSS Statistics Software Cell Display dialog box, 97, 100f cross tabulation, 97, 99–100f labels, 97, 98f variables selection, 96, 97f Stata Software, 101, 101f Continuous random variable, 139f cumulative distribution function (CDF), 140–141, 141b definition, 139 expected/average value, 140 probability density function, 139 probability distributions chi-square distribution, 159–160, 159–160f, 160b exponential distribution, 156–157, 156f, 157b gamma distribution, 157–158, 158f normal distribution (see Gaussian distribution) Snedecor’s F distribution, 162–164, 163b, 163f, 164t Student’s t distribution, 160–162, 161f, 162b uniform distribution, 151–152, 151t, 152b, 152f variance, 140, 140b Continuous variables, 8, 16 Control chart, 1194t
Index
Convenience sampling, 175, 175b Convex and nonconvex sets, 748, 748f Correlation coefficients, 383 Correlation matrix, 391 Covariance, 118, 118b CPLEX, 774 Cramer’s V coefficient, 108, 108–110b, 109–110f Cronbach’s alpha calculation, 435t, 436 internal consistency, 434 reliability level, 434 SPSS, 436, 436–437f Stata, 437–438, 438f variance, 434 Crossindustry standard process of data mining (CRISP-DM), 984–985, 986f Cumulative distribution function (CDF) continuous random variable, 140–141, 141b discrete random variable, 138–139, 139b Cumulative frequency, 22 Czuber’s method, 46, 46–47b, 46t
D Data, information, and knowledge logic, 3, 4f Data mining Big Data, 983, 984f Business Analytics, 983 complexity, 983 crossindustry standard process of data mining (CRISP-DM), 984–985, 986f HLM (see Hierarchical linear models (HLM)) IBM SPSS Modeler, 986, 986f knowledge discovery in databases (KDD), 984, 985f multilevel modeling, 987–988 nested data structures, 988–991, 989–990f, 989–990t predictive capacity, 983 standard recognition, 983 tasks, 985–986 tools and software packages, 986 variety and variability, 983 volume and velocity, 983 Deciles, 48 continuous data, 50–52, 51–52b, 51t grouped discrete data, 50, 50b ungrouped discrete and continuous data, 48–50, 49–50b Decision-making process, 3, 5 Descriptive statistics, 7 Design of experiments (DOE), 935 blocking, 936 completely randomized design (CRD), 936–937, 937f control, 936 data analysis, 936 factorial ANOVA, 938 factorial design (FD), 937 factors and levels, 935 one-way analysis of variance (one-way ANOVA), 938 problem definition, 935 randomization, 936
randomized block design (RBD), 937, 937f replication, 936 response variable, 935 results and conclusions, 936 type, 936 Dice similarity coefficient (DSC), 323 Dichotomous/binary variable (dummy), 16, 17t Diet problem, 720–721, 720–721b, 720t Excel Solver, 788–790, 789b, 789–791f Directed network, 836–837, 837f Direct Oblimin methods, 398 Discrete random variable cumulative distribution function (CDF), 138–139, 139b definition, 137 expected/average value, 137–138 probability distributions Bernoulli distribution, 142–144, 143–144b, 143f binomial distribution, 144–145, 144f, 145b discrete uniform distribution, 141–142, 141f, 142t, 142b geometric distribution, 145–147, 146–147b, 146f hypergeometric distribution, 148–149, 148f, 149b negative binomial distribution, 147–148, 147f, 148b Poisson distribution, 149–151, 150f, 150–151b variance, 138, 138b, 138t Discrete uniform distribution, 141–142, 141f, 142t, 142b Discrete variables, 8, 16 Dispersion/variability measures average deviation continuous data, 56, 56b, 56t grouped discrete data, 54–55, 55t, 55b modulus/absolute deviation, 54 ungrouped discrete and continuous data, 54, 54t, 54b coefficient of variation, 61, 61b range, 54 standard deviation, 59–60, 59–60b standard error, 60–61, 60–61b, 60t variance continuous data, 58–59, 58–59b, 59t definition, 57 grouped discrete data, 57–58, 58t, 58b ungrouped discrete and continuous data, 57, 57b Durbin-Watson test, 1164–1165t autocorrelation, 493, 493f IBM SPSS Statistics Software, 528, 528f result, 529, 529f Stata, regression models estimation, 515, 516f
E Eigenvalues, 391, 408, 408t, 424, 424f Eigenvectors, 391–392, 401–402, 424, 424f Empty set, 129 Erlang distribution, 158
1217
Estimation interval estimation, 190, 190b (see also Confidence intervals) parameter, definition, 189 point estimation, 189, 189b maximum likelihood estimation, 192 method of moments, 190–191, 191t, 191b ordinary least squares (OLS), 191–192 population parameters, 189 Euclidean squared distance, 318, 320t Events, 127, 129b independent, 128, 130 mutually excluding/exclusive, 128, 128f, 130 Excel Solver, 775–779, 775–779f classic transportation problem, 856–860, 856f, 857b, 858–861f diet problem, 788–790, 789b, 789–791f facility location problem, 907–908, 907–909f, 907–908b farmer’s problem, 790–792, 791b, 791–793f job assignment problem, 870, 871–872f, 871b knapsack problem, 891–893, 891–893f, 892b Lifestyle Natural Juices Manufacturer, 798, 802b, 802–804f maximum flow problem, 879–881, 880–881f, 880b Naturelat Dairy, 784–786, 785b, 785–787f Oil-South Refinery, 787–788, 787b, 788–789f portfolio selection, 793–797, 793–797f, 794b, 796b production and inventory problem, Fenix&Furniture, 798, 799–801f, 800b sensitivity analysis, 818–822, 818–823f shortest path problem, 875, 876b, 876–877f transhipment problem (TSP), 866–868, 866–867b, 866–868f Venix Toys, 779–783, 780b, 780–784f, 784b Experimental unit, 935 Explanatory variables, 935 Exploratory factor analysis, 383 Exploratory multivariate technique, 383 Exponential distribution, 156–158, 156f, 157b Extrapolations, 449–450
F Facility location problem, 905–906b, 905f, 906t candidate locations, 902 Excel Solver, 907–908, 907–909f, 907–908b modeling, 902–906 network programming problem, 902 Factor extraction method, 411–412, 411f Factorial analysis of variance, 239–246 Factorial ANOVA, 938 Factorial design (FD), 937 Failure rate, 157 Farmer’s problem, 790–792, 791b, 791–793f Feasible basic solution (FBS), 755 First-order correlation coefficients, 387–389, 399, 399t Fisher’s coefficient kurtosis, 66 skewness, 64 Fisher’s distribution. See Snedecor’s F distribution
1218
Index
Fixed effects parameters, 988 Friedman’s test, 1189t K paired samples, 290–295, 291–292t, 291–293b, 293–295f F-test, 456, 474 Furthest-neighbor/complete-linkage method, 327, 332–335, 334t, 335–336f
G Gamma distribution, 157–158, 158f, 634, 635t, 635f Gaussian distribution binomial approximation, 155 cumulative distribution function, 154, 154f Poisson approximation, 155–156 probability density function, 152, 153f standard deviations, 153, 153f standard normal distribution, 153–154, 154–155f Zscores, 153–154 Gauss-Jordan elimination method, 769, 769t, 772, 772t General Algebraic Modeling System (GAMS), 774 Generalized linear latent and mixed model (GLLAMM), 1005 Geometric distribution, 145–147, 146–147b, 146f Geometric propagation/snowball sampling, 177, 177b Graph, 835, 836f
H Hamann similarity coefficient, 324 Hamiltonian path, 837 Hampered analysis, 326, 327f Hartley’s Fmax test, 213–214, 213–214b, 1193t Hierarchical agglomeration schedules between-groups/average-linkage method, 328, 335–338, 337t, 338f dendrogram, 327 Euclidian distance, 328, 329f furthest-neighbor/complete-linkage method, 327, 332–335, 334t, 335–336f Hampered analysis, 326, 327f linkage methods, 325, 326t nearest-neighbor/single-linkage method, 327–332, 331t, 331f, 333f phenogram, 327 SPSS (see IBM SPSS Statistics Software) Stata (see Stata Software) Hierarchical crossclassified models (HCM), 990 Hierarchical linear models (HLM), 988 count data, 1056–1057, 1058–1063f, 1058t, 1059–1062 generalized linear latent and mixed models (GLLAMM), 1052, 1052f hierarchical logistic models ceteris paribus, 1055 chart of, 1056, 1057f dataset, 1053, 1054f, 1054t mixed effects logistic regression models, 1052–1053 odds ratios, 1055 outputs, 1053, 1054f, 1056, 1056f
SPSS (see IBM SPSS Statistics Software) Stata (see Stata Software) three-level hierarchical linear models, repeated measures (HLM3), 998 individual models, 996, 996f intercepts and slopes randomness, 996–997 intraclass correlation, 997 temporal evolution, 995 variance-covariance matrix, 995 two-level hierarchical linear models, clustered data (HLM2) first-level model, 991 individual models, 991, 992f intercepts, 992 intraclass correlation, 993 likelihood-ratio tests, 993, 995 logarithmic likelihood function, 994–995 maximum likelihood estimation (MLE), 994 multiple linear regression model, 991 multivariate normal distribution, 993 random intercepts model, 993 random slopes model, 993 reduced maximum likelihood, 994 restricted maximum likelihood (REML), 994 slopes, 992 statistical significance, 993 Higher-order correlation coefficients, 387–389 Histogram, 21, 32–34, 32–33b, 32–33t, 33–34f absolute frequency, 32–33, 33f analysis tools, 32–33 continuous data, 33, 34f definition, 32 discrete data, 33, 34f Monte Carlo method, 921–922, 922f, 924f Horizontal bar charts, 26, 27f Hosmer-Lemeshow test, 579, 579f binary logistic regression model, 554 Huber-White robust standard error estimation, 491 Stata, regression models estimation, 507 Hypergeometric distribution, 148–149, 148f, 149b Hypotheses tests bilateral test, 199 critical region (CR), 199, 199f nonparametric tests, 201 (see also Nonparametric tests) parametric tests, 201 (see also Parametric tests) P-value, 201 statistical hypothesis, 199 type I error, 200, 200t type II error, 200, 200t unilateral test, 199 left-tailed test, 199, 200f right-tailed test, 200, 200f
I IBM SPSS Statistics Software ANOVA, 525, 527f binary logistic regression model, 591–601, 592–597f, 600–601f
Box-Cox transformations, 527 C chart, 967, 967–968f confidence levels and intercept exclusion, 519, 519f Cp, Cpk, Cpm and Cpmk indexes, 975–976, 975–976f Cronbach’s alpha, 436, 436–437f Dependent box, 516–518, 518f Descriptives Option, 74–76, 77f Descriptives dialog box, 76, 77f Options dialog box, 76, 78f Options, summary measures, 76, 78f Durbin-Watson test, 528, 528f result, 529, 529f estimation, 516, 517f excluded variables, 519–523 Explore Option, 77–78, 79f boxplot, 79, 82f, 83 Descriptives Option, results, 78, 81f Explore dialog box, 78, 79f histogram, 79, 82f, 83 Outliers option results, 79, 81f Percentiles option, results, 78–79, 81f Plots dialog box, 78, 80f Statistics dialog box, 78, 80f stem-and-leaf chart, 79, 82f, 83 Frequencies Option, 73f Charts, 74, 76f Charts dialog box, 74, 75f Frequencies dialog box, 73, 73f frequency distribution table, 74, 76f qualitative and quantitative variables, 72 Statistics dialog box, 73, 74f Statistics, summary measures, 74, 75f heteroskedasticity problem, 524 hierarchical agglomeration schedules, 350, 351f allocation, 356, 357f clustering stage, 354, 354f dendrogram, 350, 352f, 354, 355f distance measure, 361, 362f Euclidian distance, 350–351, 353, 353f, 360, 361f linkage method, 353, 353f matrix, 350, 352f means, 358, 360–361f multidimensional scaling, 360–361, 362f nominal (qualitative) classification, 356, 358f Number of clusters, 356, 356–357f one-way analysis of variance, 358, 358–359f two-dimensional chart, 361, 363f variables selection, 350, 351f HLM2 (see Two-level hierarchical linear model, clustered data) HLM3 (see Three-level hierarchical linear model, repeated measures) Independent(s) box, 516–518, 518f lndist variable, 527, 528f multicollinearity diagnostic, 523, 524f multinomial logistic regression model, 602–606, 602–604f
Index
negative binomial regression model, 676–685, 680–685f nonhierarchical k-means agglomeration schedule, 364–367, 364–368f nonparametric tests binomial test, 253, 254–255f chi-square test, 256–257, 257–258f, 279, 279–280f, 297, 297–299f Cochran’s Q test, 288, 289–290f Friedman’s test, 293, 294–295f Kruskal-Wallis test, 302, 302–303f Mann-Whitney U test, 284, 284–285f McNemar test, 264, 265–266f sign test, 260, 260–261f, 268–269, 269–270f Wilcoxon test, 274, 274–275f normality plots with tests, 523, 523f np chart, 960–963, 963b, 964f outputs, 519, 520f, 522f parameter and confidence intervals selection, 518–519, 518f parametric tests one-way ANOVA, 236–237, 237–238f Student’s t-test, 221, 222–223f, 225–226, 226–227f, 229–230, 230–231f two-way ANOVA, 242–244, 243–246f P chart (defective fraction), 959–960, 960–963f Poisson regression model, 664–675, 665–679f predicted values, 519, 521f principal components factor analysis algebraic solution, 410 communalities, 415, 415f dataset, 418, 419f Display factor score coefficient matrix, 412, 412f eigenvalues and variance, 413, 413f, 416, 418f factor analysis, 410, 410f factor extraction method, 411–412, 411f factor loadings, 414, 414f factor scores, 414, 414f initial options, 411, 411f KMO statistic and Bartlett’s test of sphericity, 413, 413f loading plot, 415–416, 415f, 417f Pearson’s correlation coefficients, 413, 413f, 418–419, 420f ranking, 420, 422f rotated factor loadings, 416, 417f rotated factor scores, 416, 418f rotation angle, 416, 418f rotation method, 412, 412f Save as variables option, 416, 416f sorting, 421, 422f variable creation, 420, 421f variables selection, 410, 410f Varimax orthogonal rotation method, 416, 416f R chart, 948–952f, 949 residuals behavior, 524, 525f residual sum of squares, 524, 526f RESUP variable, 525, 527f Shapiro-Wilk normality test result, 523, 523f
square of the residuals, 524, 526f stepwise procedure selection, 519, 521f U chart, 970, 971f univariate tests for normality normality test selection, 206, 207f procedure, 206, 207f tests results, 207–208, 208f variable selection, 206, 207f VIF and Tolerance statistics, 523, 524f Independent events, 128, 130 Integer programming (IP), 714–717, 731, 734–736 binary integer programming (BIP), 887 (see also Binary programming (BP)) characteristics, 887, 888b facility location problem, 905–906b, 905f, 906t candidate locations, 902 Excel Solver, 907–908, 907–909f, 907–908b modeling, 902–906 network programming problem, 902 heuristic procedure, 887 knapsack problem, 890–891b, 890t decision variables, 890 Excel Solver, 891–893, 891–893f, 892b mathematical formulation, 890 model parameters, 890 linear relaxation, 888, 889b, 889f metaheuristic procedure, 887 mixed binary programming (MBP), 887 mixed integer programming (MIP), 887 rounding, 888–890 staff scheduling problem, 908–912, 909–912b, 911f, 913–914f Interpolations, 449–450 Interquartile range/interquartile interval (IQR/ IQI), 37, 52 Intersection, 128, 128f Interval estimation, 190, 190b. See also Confidence intervals Intraclass correlation, 993, 997
J Jaccard index, 323 Job assignment problem, 868, 868f Excel Solver, 870, 871–872f, 871b mathematical formulation, 869–870, 869–870b, 869t Joint frequency distribution tables qualitative variables contingency/crossed classification/ correspondence table, 93 example, 94–101b, 94–95t, 96–101f marginal totals, 94–101 quantitative variables, 114 Judgmental/purposive sampling, 175, 175b
K Kaiser criterion, 393 Kaiser-Meyer-Olkin (KMO) statistic, 387, 389, 389t, 399, 413, 413f, 423, 424f Karhunen-Loe`ve transformation, 384
1219
Kernel density estimate, 503, 503f Stata, regression models estimation, 511–513, 514f King’s method, 46–47, 47b Knapsack problem, 890–891b, 890t decision variables, 890 Excel Solver, 891–893, 891–893f, 892b mathematical formulation, 890 model parameters, 890 Knowledge discovery in databases (KDD), 984, 985f Kolmogorov-Smirnov (K-S) test, 1177t univariate tests for normality, 201–203, 202–203t, 202–203b Kruskal-Wallis test, 1190t K independent samples, 299–304, 300–302b, 301t, 301–304f
L Lagrange multiplier (LM), 489–490 Latent root criterion, 393 Levene’s F-Test, 214–216, 214–216b, 215–216t SPSS Software procedure, 216, 217–218f results, 216, 218, 218f variables selection, 216, 217f Stata Software, 218–219, 219f Lifestyle Natural Juices Manufacturer, 798, 802–804f, 802b Lifetime, 157 Likelihood-ratio tests, 993, 995 Likert scale, 17, 314, 384 Linear Interactive and Discrete Optimizer (LINDO) Systems, 773–774 Linear mixed models (LMM), 988 Linear programming (LP) problems additivity assumption, 713 Advanced Integrated Multidimensional Modeling Software (AIMMS), 774 aggregated planning problem, 734–736b, 734t binary programming (BP), 733 decision variables, 733 general formulation, 733 integer programming (IP) model, 734–736 mixed-integer programming (MIP) problem, 734–736 model parameters, 733 nonlinear programming (NLP), 733 resources, 732 basic solution (BS), 755 basic variables (BV), 755, 755b blending/mixing problem, 717–719, 717–719b, 718t canonical form, 710, 712b capital budget problems, 721–724, 722–724b, 722–723t certainty, 713 continuous function, 709 convex and nonconvex sets, 748, 748f CPLEX, 774 degenerate optimal solution, 753–754, 754f, 754b diet problem, 720–721, 720t, 720–721b divisibility and non-negativity, 713
1220
Index
Linear programming (LP) problems (Continued) equality constraint, 711 Excel Solver, 775–779, 775–779f diet problem, 788–790, 789–791f, 789b farmer’s problem, 790–792, 791b, 791–793f Lifestyle Natural Juices Manufacturer, 798, 802–804f, 802b Naturelat Dairy, 784–786, 785b, 785–787f Oil-South Refinery, 787–788, 787b, 788–789f portfolio selection, 793–797, 793–797f, 794b, 796b production and inventory problem, Fenix&Furniture, 798, 799–801f, 800b Venix Toys, 779–783, 780b, 780–784f, 784b feasible basic solution (FBS), 755 feasible solutions, 709 free variable, 711 General Algebraic Modeling System (GAMS), 774 inequality constraint, 711 linear function and constraints, 709 Linear Interactive and Discrete Optimizer (LINDO) Systems, 773–774 A Mathematical Programming Language (AMPL), 774 MINOS, 774 multiple optimal solutions, 751–752, 751–752b, 752f nonbasic variables (NBV), 755, 755b no optimal solution, 753, 753b optimal solution, 709, 747 maximization problem, 748–750, 748–750b, 749f minimization problem, 750–751, 750–751f, 750–751b Optimization Subroutine Library (OSL), 774 portfolio selection problem, 726–728b, 726–728t financial investments, 724 investment portfolio risk minimization, 725–728 investment portfolio’s expected return, 724–725 Markowitz’s model, 724 production and inventory problem costs and capacity, 730t decision variables, 729 demand per product and period, 730t general formulation, 729 integer programming (IP) problem, 731 inventory balance equations, 731 maximum inventory capacity, 732 maximum production capacity, 731 model parameters, 729 non-negativity constraints, 729–732 optimal solution, 732t production mix problem, 713–717, 714–717b, 715–716t proportionality assumption, 712–713, 713f resource optimization problems, 713 sensitivity analysis, 747, 807–808b
Excel Solver, 818–822, 818–823f independent constraints terms, 807 objective function coefficients, 808–812, 808–809f, 810–811b reduced cost, 816–818, 817–818b shadow price, 812–816, 813–816b Simplex method (see Simplex method) slack variable, 711 software packages, 773 Solver error messages, unlimited and infeasible solutions no optimal solution, 800, 805f Solver Results dialog box, 798 unlimited objective function z, 798–799, 804–805f Solver results Answer Report, 802–806, 806f Excel spreadsheets, 800 Limits Report, 806–807, 806f standard form, 709–710, 711–712b standard maximization problem, 711 surplus variable, 711 unlimited objective function z, 752–753, 752–753b, 753f viable/feasible solution, 747 XPRESS, 774 Linear regression models analysis of variance (ANOVA), 457, 458f confidence levels dataset, 465, 466t dispersion of points, 463, 464f inclusion/exclusion criteria, 464–465, 465b null hypothesis rejection, 464–465 for parameters, 462–463, 463–464f predicted time vs. distance traveled, 465–466, 466–467f degrees of freedom, 457 dummy variables ceteris paribus condition, 474 confidence interval amplitudes, 479 criteria, 476, 476t dataset, 473, 473t driving style variable, 474, 476, 476t F-test, 474 GDP growth, 472 joint selection, 473, 475f, 477, 477f outputs, 473, 475f, 477–478, 478–479f qualitative explanatory variable, 472–473, 474t random weighting, 472 substitution of, 476, 477t t-test, 474 explanatory power, 453, 454f coefficient of determination, 453–456, 455t, 455–456f residual sum of squares (RSS), 451 sum of squares due to regression (SSR), 451 total sum of squares (TSS), 451 explanatory variables, 443 F significance level, 457, 458f F statistic, 457 F-test, 456 functional form, 480 metric/quantitative variable, 443
multiple models, 443–444 (see also Multiple linear regression models) null hypothesis, nonrejection, 460, 462, 462f OLS method (see Ordinary least squares (OLS) method) predicted value and parameters, 444 P-values, 459–460 quantitative dependent variable, 443 residual error, 444 simple linear regression model, 443–444, 444f SPSS (see IBM SPSS Statistics Software) standard error, 459, 460f statistical tests, 443 t statistic, 457–459 t-test, 457–459 coefficients and significance, 459, 461f significance levels, 460, 461f Linear specification, 988 Linear trend model random intercepts, 1020–1023, 1021–1023f random intercepts and slopes, 1023–1027, 1024–1025f, 1027–1028f Line chart, 920, 922, 922–923f Line graphs, 21, 30–31, 30t, 30–31b, 31f Logarithmic likelihood function, 994–995
M Mahalanobis distance, 379 Manhattan distance, 319 Mann-Whitney U test, 1185–1188t two independent samples, 281–286, 282–283t, 282–284b, 284–286f Markowitz’s model, 724 Maximum flow problem destination node, 876–877 Excel Solver, 879–881, 880–881f, 880b mathematical formulation, 878–879, 878–879b, 878f Maximum likelihood estimation (MLE), 192, 994, 1005 binary logistic regression model, 542–547, 542–544t, 545–547f multinomial logistic regression model, 564–570, 565–566t, 567–568f, 569–570t, 570f negative binomial regression model dataset, 636, 637t histogram, 637, 638f mean and variance, 637, 637t parameters estimation, 639, 640f results, 638, 640t Solver window, 638, 639f Poisson regression model, 621t dependent variable mean and variance, 621, 621t, 622f Excel Solver tool, 622, 624, 624f log-linear model, 625 non-negative and discrete values, 620 overdispersion, 621–622 parameters estimation, 625, 626f rate of incidence, 622–623, 623t, 623f results, 624, 625t Stata, regression models estimation, 500 McFadden pseudo R2, 548
Index
McNemar test, 262–264, 263t, 263–264b, 265–266f Mean absolute deviation (MAD), 725 Mean arrivals rate, 157 Measurement, definition, 9 Median continuous data, 44–45, 44t, 44–45b grouped discrete data, 43–44, 43t, 43–44b ungrouped discrete and continuous data, 42–43, 42t, 42–43b Method of moments, 190–191, 191b, 191t Minimum cost method, 849, 850–851t, 850–852b Minimum path problem. See Shortest path problem Minimum spanning tree, 837, 838f Minkowski distance, 318 MINOS, 774 Mixed binary programming (MBP), 887 Mixed effects logistic regression models, 1052–1053 Mixed-integer programming (MIP) problem, 734–736, 887 Mode continuous data, 46–47, 46–47b, 46t grouped qualitative/discrete data, 45–46, 45–46b, 46t ungrouped data, 45, 45t, 45b Monte Carlo method application, 920 Excel Data Analysis, 920 histogram, 921–922, 922f, 924f line chart, 920, 922, 922–923f profit and loss forecast, 926–928, 928–931f random number generation and probability distributions, 920–921, 921f, 923f red wine consumption, 923–925, 924–928f frequency distribution, 920 histogram, 920 Manhattan project, 919 probability density functions (PDF), 920 risks and uncertainties, 920 Multilevel modeling, 987–988 Multilevel negative binomial regression model, 1059 Multilevel Poisson regression model, 1059 Multinomial logistic regression model, 311, 539 confidence intervals, 574–575, 574–575t event, 563 logits, 563 maximum likelihood, 564–570, 565–566t, 567–568f, 569–570t, 570f occurrence probabilities, 563 SPSS (see IBM SPSS Statistics Software) Stata (see Stata Software) statistical significance, 570–574, 571–572f Multiple linear regression models, 393, 443–444, 991 ceteris paribus concept, 467 dataset, 468, 468t explanatory variables, 470, 471f multicollinearity, 472 null hypothesis, nonrejection, 472
outputs, 470, 471f parameters calculation, 469–470, 469–470t residual sum of squares, 468 Stata (see Stata, regression models estimation) time equation, 470 Multivariate normal distribution, 394, 993 Mutually excluding/exclusive events, 128, 128f, 130
N Naturelat Dairy, 784–786, 785b, 785–787f Nearest-neighbor/single-linkage method, 327–332, 331f, 331t, 333f Negative binomial distribution, 147–148, 147f, 148b Negative binomial regression model confidence intervals, 644, 644t Gamma distribution, 634, 635f, 635t maximum likelihood dataset, 636, 637t histogram, 637, 638f mean and variance, 637, 637t parameters estimation, 639, 640f results, 638, 640t Solver window, 638, 639f mean, 636 negative binomial type 1 (NB1) regression model, 636 negative binomial type 2 (NB2) regression model, 636 occurrence probability, 634 overdispersion, 634 Poisson distribution, 634 probability distribution function, 634 quantitative variable, 633 SPSS (see IBM SPSS Statistics Software) Stata (see Stata Software) statistical significance, 641–643, 641–642f variance, 636 Nested data structures, 987–991, 989–990f, 989–990t Network programming classic transportation problem, 838f, 839–841b, 840t, 840f algorithm (see Transportation algorithm) balanced transportation problem, 839 decision variables, 838 Excel Solver, 856–860, 856f, 857b, 858–861f general formulation, 839 model parameters, 838 Simplex method, 839 supply chain, 838 total supply capacity and total demand, 841–845, 841f, 841–845b, 842t, 843f, 844t, 845f demand nodes, 835 directed and undirected arc, 836, 836f directed and undirected cycle, 837 directed and undirected path, 837 directed network, 836–837, 837f graph, 835, 836f Hamiltonian path, 837
1221
job assignment problem, 868, 868f Excel Solver, 870, 871–872f, 871b mathematical formulation, 869–870, 869–870b, 869t maximum flow problem destination node, 876–877 Excel Solver, 879–881, 880–881f, 880b mathematical formulation, 878–879, 878–879b, 878f minimum spanning tree, 837, 838f network, definition, 835, 836f shortest path problem Excel Solver, 875, 876b, 876–877f mathematical formulation, 873–875, 874–875b, 874f supply capacity node, 870–873 subgraph, 837 supply nodes/sources, 835 transhipment problem (TSP) Excel Solver, 866–868, 866–868f, 866–867b intermediate transhipment points, 860–862 mathematical formulation, 862–866, 862f, 864–865f, 864t, 864–866b stages, 860–862 transportation unit cost, 862 transshipment nodes, 835 tree structure, 837, 837f Nonbasic variables (NBV), 755, 755b Nonhierarchical k-means agglomeration schedule, 338–339 arbitrary allocation, 341, 341t, 342f Euclidian distance, 346, 346t explanatory variable, 349–350 F significance level, 348, 349f F-test, 340 logical sequence, 339 mean, 347, 347t one-way analysis of variance (ANOVA), 348, 349t procedure, 339, 339f reallocation, 342–345t, 343f, 344 solution, 345–346, 346f SPSS (see IBM SPSS Statistics Software) Stata (see Stata Software) variation and F statistic, 348t Z-scores, 340 Nonlinear programming (NLP), 733 Nonlinear regression models, 443 binary and multinomial logistic models, 497 Box-Cox transformation, 497–498, 497b exponential specification, 495, 496f, 496b linear specification, 495, 496f, 496b nonlinear behavior, 495, 495f Poisson and negative binomial regression models, 497 quadratic specification, 495, 496f, 496b semilogarithmic specification, 495, 496f, 496b Nonmetric/qualitative variables dichotomous/binary variable (dummy), 16, 17t nominal scale, 10, 10t arithmetic operations, 10
1222
Index
Nonmetric/qualitative variables (Continued) Data View, 11, 11f descriptive statistics, 10, 12 labels, 11, 12t, 12f, 14f Value Labels, 11, 14f variable selection, 11, 13f Variable View, 10–11, 11f ordinal scale, 12–14, 15t, 15f polychotomous, 16 scales of accuracy, 16, 16f Nonparametric tests advantages, 249 classification, 250, 250t disadvantages, 249 K independent samples chi-square test, 295–299, 296t, 296b, 296–299f Kruskal-Wallis test, 299–304, 300–302b, 301t, 301–304f K paired samples Cochran’s Q test, 286–290, 287–288b, 287t, 288–290f Friedman’s test, 290–295, 291–292t, 291–293b, 293–295f one sample binomial test, 250–254, 251–253b, 252t, 253–255f chi-square test, 255–257, 255–259f, 256b, 256t sign test, 257–262, 259–260b, 259t, 260–262f two independent samples chi-square test, 276–280, 277–278t, 277–278b, 277–280f Mann-Whitney U test, 281–286, 282–283t, 282–284b, 284–286f two paired samples McNemar test, 262–264, 263t, 263–264b, 265–266f sign test, 264–270, 267–268b, 267–268t, 268–271f Wilcoxon test, 270–276, 272–273t, 272–274b, 273–276f Nonrandom sampling, 169, 170f advantages and disadvantages, 169–170 convenience sampling, 175, 175b geometric propagation/snowball sampling, 177, 177b judgmental/purposive sampling, 175, 175b quota sampling, 176–177, 176–177b, 176–177t Northwest corner method, 847, 848–849t, 848–850b, 854–856b np chart, 960–963, 963b, 964f
O Oblique rotation methods, 398 Ochiai similarity coefficient, 323 Odds ratios, 581, 581f, 1055 Oil-South Refinery, 787–788, 787b, 788–789f One-stage cluster sampling, 173–174, 174b, 183 finite population, sample size mean estimation, 184 proportion estimation, 185
infinite population, sample size mean estimation, 184 proportion estimation, 184–185 One-way analysis of variance (one-way ANOVA), 937–938 Optimization models business modeling, 736 classification, 708, 708f constraints, 708 decision and parameter variables, 707 decision concept, 707 elements, 707 linear programming (LP) (see Linear programming (LP)) objective function, 708 real system behavior, 707, 708f Optimization Subroutine Library (OSL), 774 Ordinary Gauss-Hermite quadrature, 1005 Ordinary least squares (OLS) method, 191–192 autocorrelation Breusch-Godfrey test, 493–494 causes, 492, 492f consequences, 493 data time evolution, 491 Durbin-Watson test, 493, 493f first-order autocorrelation, 492 generalized least squares method, 494 residuals problem, 492, 492f Box-Cox transformations, 480–481 calculation spreadsheet, 448, 448t conditional mean, 445 data analysis box, 450, 451f data insertion, 450, 452f dataset, 445, 445t, 448, 448f dependent variable, 444–445 Excel Regression tool, 450 expected value, 445 explanatory variable, 445 extrapolations, 449–450 heteroskedasticity Breusch-Pagan/Cook-Weisberg test, 489–490 chi-square distribution, 490 consequences, 489 discretionary income, 489, 489f Huber-White method, 491 learning process, 488 probability distribution, 488 problem, 488, 488f residual vector, 490 trial and error models, 488, 488f weighted least squares method, 490 interpolations, 449–450 linear regression estimation, 450, 451f linktest, 480–481, 494–495 multicollinearity auxiliary regressions, 487 causes of, 481–482 Class A model, 483, 483t, 484f Class B model, 484–485, 485f, 485t Class C model, 485–486, 486t, 486f consequences, 482–483 correlation matrix, 487 dependent variable, 481
matrix determinant, 487 matrix form, 481 orthogonal factors, 487 parameter estimation, 481 Tolerance, 487 t statistics, 487 VIF, 487 normal distribution of residuals, 480, 480f presuppositions, 479, 480b regression equation coefficients, 450, 453f RESET test, 480–481, 495 residuals conditions, 446, 447f residual sum of squares minimization, 449, 450f Shapiro-Wilk test/Shapiro-Francia test, 480 simple linear regression model, 445, 450, 450f equation, 448 outputs, 450, 452f Solver tool, 448–449, 449f travel time vs. distance traveled, 445, 446f Orthogonal rotation method, 397 Overall model efficiency (OME), 560 Overdispersion negative binomial regression model, 634 Poisson regression model, 632–633, 632t, 633f Stata Software, 648, 648f, 658
P Parametric tests ANOVA (see Analysis of variance (ANOVA)) population mean Student’s t-test (see Student’s t-test) Z test, 219–220, 219–220b univariate tests for normality Kolmogorov-Smirnov (K-S) test, 201–203, 202–203t, 202–203b Shapiro-Francia (S-F) tests, 205–206, 205–206b, 206t Shapiro-Wilk (S-W) test, 203–205, 204t, 204–205b SPSS Software (see IBM SPSS Statistics Software) Stata (see Stata Software) variance homogeneity tests Bartlett’s w2 test, 210–212, 211–212b, 211t Cochran’s C test, 212–213, 213b Hartley’s Fmax test, 213–214, 213–214b Levene’s F-Test, 214–218, 214–216b, 215–216t null hypothesis, 210 population variance, 210 Pareto chart, 21, 28–30, 29t, 29–30b, 30f Partial correlation coefficients, 387–388 Pascal distribution, 147–148, 147f, 148b P chart, 959, 959f defective fraction, 959–960, 960–963f Pearson’s contingency coefficient, 107 Pearson’s correlation coefficient, 315, 398, 399t, 413, 413f, 418–419, 420f, 427, 428f bivariate descriptive statistics, 119–121, 119–121b, 119–121f
Index
Pearson’s first coefficient of skewness, 62–63, 62–63b Pearson’s linear correlation, 384 correlation matrix, 385 dataset model, 385t factor extraction, 385, 388f latent dimensions, 386 linear adjustments, 385, 387f three-dimensional scatter plot, 385, 386f Pearson’s second coefficient of skewness, 63, 63b Percentile coefficient of kurtosis. See Coefficient of kurtosis Percentiles, 48–52 continuous data, 50–52, 51–52b, 51t grouped discrete data, 50, 50b ungrouped discrete and continuous data, 48–50, 49–50b Permutations, 135, 135b Phi coefficient, 106, 106–108b, 107t, 109–110f Pie charts, 21, 27–28, 28b, 28t, 28f Point estimation, 189, 189b maximum likelihood estimation, 192 method of moments, 190–191, 191b, 191t ordinary least squares (OLS), 191–192 Poisson distribution, 149–151, 150f, 150–151b Poisson regression model confidence intervals, 630–632, 631t, 632b dependent variable, 618 distribution, 619, 619f equidispersion of, 620 explanatory variable, 618 incidence rate ratio, 618 maximum likelihood, 621t dependent variable mean and variance, 621, 621t, 622f Excel Solver tool, 622, 624, 624f log-linear model, 625 non-negative and discrete values, 620 overdispersion, 621–622 parameters estimation, 625, 626f rate of incidence, 622–623, 623t, 623f results, 624, 625t mean, 620 overdispersion, 632–633, 632t, 633f probability of occurrence, 619, 619t SPSS (see IBM SPSS Statistics Software) Stata (see Stata Software) statistical significance, 626–630, 627–628f variance, 620 Polychotomous variable, 16 Population definition, 169 finite, 169 infinite, 169 moment of distribution, 190 Portfolio selection problem, 726–728b, 726–728t Excel Solver, 793–797, 793–797f, 794b, 796b financial investments, 724 investment portfolio risk minimization, 725–728 investment portfolio’s expected return, 724–725 Markowitz’s model, 724
Position/location measures BACON algorithm, 52 boxplot, 53–54, 53f central tendency arithmetic mean, 38–42, 38–41t, 38–42b median, 42–45, 42–44t, 42–45b mode, 45–47, 45–47b, 45–46t interquartile range (IQR), 52 outlier identification methods, 52, 53b quantiles, 48–52 deciles, 48 percentiles, 48–52 quartiles, 47–48 Principal components factor analysis Bartlett’s test of sphericity, 387, 389–390 clusters, 383, 390 coefficient of determination, 395 communality, 394 confirmatory factor analysis, 383 confirmatory techniques, 405 correlation coefficients, 383 correlation matrix, 391 Cronbach’s alpha’s magnitude, 390 dataset, 398, 398t eigenvalues, 391, 408, 408t eigenvectors, 391–392, 401–402 exploratory factor analysis, 383 exploratory multivariate technique, 383 factor loadings, 394, 394t, 404, 404t, 406–407t factor rotation Direct Oblimin methods, 398 loading plot, 395, 396f, 407f, 408 loadings, 397, 407, 407t oblique rotation methods, 398 original factors, 395, 396t, 396f orthogonal rotation method, 397 Promax methods, 398 scores, 397 factor scores, 390–393, 403, 404t first-order correlation coefficients, 387–389, 399, 399t higher-order correlation coefficients, 387–389 Kaiser criterion, 393 Kaiser-Meyer-Olkin (KMO) statistic, 387, 389, 389t, 399 Karhunen-Loe`ve transformation, 384 latent root criterion, 393 Likert scale, 384 loading plot, 405, 405f mental factors, 384 middling, 400 multiple linear regression model, 393 multivariate normal distribution, 394 partial correlation coefficients, 387–388 Pearson’s correlation coefficients, 398, 399t Pearson’s linear correlation, 384 correlation matrix, 385 dataset model, 385t factor extraction, 385, 388f latent dimensions, 386 linear adjustments, 385, 387f three-dimensional scatter plot, 385, 386f
1223
second-order correlation coefficients, 387–389, 399, 399t significance level, 400, 400f SPSS (see IBM SPSS Statistics Software) Stata (see Stata Software) structural equation modeling, 383 uncorrelated factors, 383 variance table, 401, 401t weighted rank-sum criterion, 408, 409t zero-order correlation coefficients, 387–388 Probability density functions (PDF), 920 negative binomial regression model, 634 Probability theory Bayes’ theorem, 132–133, 132–133b combinatorial analysis arrangement, 133–134, 133–134b combinations, 134, 134b definition, 133 permutations, 135, 135b complement, 128, 128f, 130 conditional probability, 131 multiplication rule, 131–132, 131–132b, 132t definition, 129 empty set, 129 events, 127, 129b independent, 128, 130 mutually excluding/exclusive, 128, 128f, 130 intersection, 128, 128f random experiment, 127 sample space, 127, 129 union, 127, 128f variation field, 129 Probability variation field, 129 Process, flowchart, 935, 936f Production and inventory problem costs and capacity, 730t decision variables, 729 demand per product and period, 730t Fenix&Furniture, Excel Solver, 798, 799–801f, 800b general formulation, 729 integer programming (IP) problem, 731 inventory balance equations, 731 maximum inventory capacity, 732 maximum production capacity, 731 model parameters, 729 non-negativity conditions, 729–732 non-negativity constraints, 732 optimal solution, 732t Production mix problem, 713–717, 714–717b, 715–716t Promax methods, 398 Proportional stratified sampling, 173 Pythagorean distance formula, 316, 317f
Q Qualitative variables bivariate descriptive statistics chi-square statistic, 102–110, 102–110b, 102–103t, 104–106f, 107t, 109–110f joint frequency distribution tables (see Joint frequency distribution tables)
1224
Index
Qualitative variables (Continued) Spearman’s coefficient, 110–113, 110f, 111–113b, 111t, 112–113f frequency distribution tables, 22–23, 22–23b, 22–23t univariate descriptive statistics bar charts, 21, 26–27, 26t, 26–27b, 27f Pareto chart, 21, 28–30, 29t, 29–30b, 30f pie charts, 21, 27–28, 28b, 28t, 28f Quantile regression models, 48–52 deciles, 48 dependent variables, 533 leverage distances, 532 median regression models, 532 normality of residuals, 533 percentiles, 48–52 quartiles, 47–48 Stata bacon algorithm, 533 conditional distribution, 537 dependent variable, 533, 534f, 538, 538f median regression model outputs, 535, 535f nonconditional median, 535 OLS regression model, 534 parameter estimation, 536, 536–537f Quantitative variables bivariate descriptive statistics covariance, 118, 118b Pearson’s correlation coefficient, 119–121, 119–121b, 119–121f scatter plot, 114–118, 114–118f, 115–116b continuous, 16 discrete, 16 interval scale, 15 ratio scale, 15 scales of accuracy, 16, 16f univariate descriptive statistics boxplots/box-and-whisker diagram, 21, 37–38, 37f histograms, 21, 32–34, 32–33b, 32–33t, 33–34f line graphs, 21, 30–31, 30t, 30–31b, 31f scatter plot, 21, 31–32, 31–32b, 31t, 32f stem-and-leaf plots, 21, 34–37, 35–36t, 35–37b, 36–37f Quartiles, 47–48 continuous data, 50–52, 51–52b, 51t grouped discrete data, 50, 50b ungrouped discrete and continuous data, 48–50, 49–50b Quota sampling, 176–177, 176–177b, 176–177t
R Random coefficients models, 988 Random effects parameters, 988 Random experiment, 127 Random intercepts and slopes model, 1006–1011, 1007–1011f, 1037–1038, 1038f Random intercepts model, 993, 1004–1005, 1004f, 1006f, 1036, 1037f Randomized block design (RBD), 937, 937f Random sampling, 169, 170f advantages and disadvantages, 169
one-stage cluster sampling, 173–174, 174b simple random sampling (SRS), 170–172, 171t, 171–172b stratified sampling, 173, 173b systematic sampling, 172, 172b two-stage cluster sampling, 174, 175b Random slopes model, 993 Random variables continuous random variable, 139–141, 139f, 140–141b chi-square distribution, 159–160, 159–160f, 160b exponential distribution, 156–157, 156f, 157b gamma distribution, 157–158, 158f normal distribution (see Gaussian distribution) Snedecor’s F distribution, 162–164, 163b, 163f, 164t Student’s t distribution, 160–162, 161f, 162b uniform distribution, 151–152, 151t, 152f, 152b discrete random variable, 137–139, 138–139b Bernoulli distribution, 142–144, 143–144b, 143f binomial distribution, 144–145, 144f, 145b discrete uniform distribution, 141–142, 141f, 142t, 142b geometric distribution, 145–147, 146f, 146–147b hypergeometric distribution, 148–149, 148f, 149b negative binomial distribution, 147–148, 147f, 148b Poisson distribution, 149–151, 150f, 150–151b random experiment, 137 Range, 54 R chart, 947–952f, 948–949 Reduced cost, 816–818, 817–818b Reduced maximum likelihood, 994 Reduced normal distribution, 153 Regression models negative binomial regression model, 617, 618f Poisson model, 617, 618f (see also Poisson regression model) Regression specification error (RESET) test, 495 Relative cumulative frequency, 22 Relative frequency, 22 Residual error, 444 Residual sum of squares (RSS), 451, 468 minimization, 449, 450f two-way ANOVA, 240 Restricted maximum likelihood (REML), 994, 1007–1008 Robit regression models, 610f Bernoulli distribution, 609 definition, 608 event occurrence, 609–610, 610t logistic distribution, 609 sigmoid function, 609 Stata, 611–615, 611–614f Rogers and Tanimoto similarity coefficient, 323
Rule, definition, 9 Russel and Rao similarity coefficient, 323
S Sample moment of distribution, 190 Sample space, 127, 129 Sampling definition, 169 nonprobability sampling (see Nonrandom sampling) population definition, 169 finite, 169 infinite, 169 probability sampling (see Random sampling) types, 169 Scale, definition, 9 Scatter plot, 21, 31–32, 31–32b, 31t, 32f, 312, 312f negative linear relationship, 114, 115f positive linear relationship, 114, 114f SPSS, 116f chart type, 115, 116f Simple Scatterplot dialog box, 115, 117f variables, 115, 117f on Stata, 116, 118f Second-order correlation coefficients, 387–389, 399, 399t Shadow price, 812–816, 813–816b Shape measures kurtosis coefficient of kurtosis, 65, 66f coefficient of kurtosis on Stata, 66–68, 67–68b, 67t definition, 65 Fisher’s coefficient of kurtosis, 66 leptokurtic curve, 65, 66f mesokurtic curve, 65, 65f platykurtic curve, 65, 65f skewness Bowley’s coefficient of skewness, 63–64, 64b coefficient of skewness on Stata, 64–65 Fisher’s coefficient of skewness, 64 left/negative skewness, 61, 62f Pearson’s first coefficient of skewness, 62–63, 62–63b Pearson’s second coefficient of skewness, 63, 63b right/positive skewness, 61, 62f symmetrical distribution, 61, 62f Shapiro-Francia (S-F) tests, 480 Stata, regression models estimation, 509, 509f, 511, 512f univariate tests for normality, 205–206, 205–206b, 206t Shapiro-Wilk (S-W) test, 480, 1178–1179t result, 523, 523f Stata, regression models estimation, 503–504, 504f univariate tests for normality, 203–205, 204t, 204–205b Shortest path problem Excel Solver, 875, 876b, 876–877f
Index
mathematical formulation, 873–875, 874–875b, 874f supply capacity node, 870–873 Sigmoid function, 609 Sign test one sample, 257–262, 259–260b, 259t SPSS Software, 260, 260–261f Stata Software, 261–262, 262f two paired samples, 264–270, 267–268b, 267–268t, 268f SPSS Software, 268–269, 269–270f Stata Software, 270, 271f Simple arithmetic mean, 38–39, 38t, 38–39b Simple linear regression model, 191, 443–445, 444f, 450, 450f equation, 448 outputs, 450, 452f Simple matching coefficient (SMC), 322 Simple random sampling (SRS), 179–180b finite population, sample size mean estimation, 178 proportion estimation, 179 infinite population, sample size mean estimation, 178 proportion estimation, 179 planning and selection, 170 with replacement, 171–172, 172b sample size factors, 177–178 without replacement, 170–171, 171t, 171b Simplex method degenerate optimal solution, 773 description, 758, 758f flowchart, 758, 758f iterative algebraic procedure, 757 maximization problems analytical solution, 758–762, 759–762b, 759f tabular form, 762–769, 763–769b minimization problems, 770–772b tabular form, 769–772, 770f transformation, 769 multiple optimal solutions, 772–773 no optimal solution, 773 unlimited objective function z, 773 Simulation definition, 919 Monte Carlo simulation (see Monte Carlo method) Sneath and Sokal similarity coefficient, 324 Snedecor’s F distribution, 162–164, 163b, 163f, 164t, 1157–1162t Snowball sampling, 177, 177b Spearman’s coefficient, 110–113, 110f, 111–113b, 111t, 112–113f Staff scheduling problem, 908–912, 909–912b, 911f, 913–914f Standard deviation, 59–60, 59–60b Standard error, 60–61, 60t, 60–61b, 459, 460f Standard maximization problem, 711 Standard normal distribution, 193, 193f, 1167–1169t Stata Software, 4 binary logistic regression model classification table, 583–584, 584f
dataset, 575, 576f dummies creation, 576, 577f frequencies distribution, 575–576, 576–577f Hosmer-Lemeshow test, 579, 579f likelihood-ratio test, 578, 578f linear adjustment, 581, 582f logistic adjustment, 581, 582–583f odds ratios, 581, 581f outputs, 577, 577–578f, 580, 580f probability estimation, 580, 580f ROC curve, 585–586, 586f sensitivity analysis, 582–584, 583–585f sensitivity curve, 585, 585f C chart, 966, 966f Cp, Cpk, Cpm and Cpmk indexes, 977 Cronbach’s alpha, 437–438, 438f hierarchical agglomeration schedules, 368–374, 368f, 369–370t, 370–374f HLM2 (see Two-level hierarchical linear model, clustered data) HLM3 (see Three-level hierarchical linear model, repeated measures) intermediate models (multilevel step-up strategy) and commands, 1033, 1033t multinomial logistic regression model, 586–591, 587f, 589–590f negative binomial regression model, 663–664f dataset, 653, 653f explanatory variables, 659, 659f frequency distribution, 653, 653f goodness-of-fit, 655, 655f histogram, 653, 654f mean and variance, 654, 654f null model, 656, 657f outputs, 655, 656f, 658f, 659, 660f overdispersion, 658 probability distribution, 661, 661–662f results, 654, 655f nonhierarchical k-means agglomeration schedule, 374–376, 375–376f nonparametric tests binomial test, 253–254, 255f chi-square test, 257, 259f, 279–280, 280f, 297–299, 299f Cochran’s Q test, 288–290, 290f Friedman’s test, 293–295, 295f Kruskal-Wallis test, 303–304, 304f Mann-Whitney U test, 285–286, 286f McNemar test, 264, 266f sign test, 261–262, 262f, 270, 271f Wilcoxon test, 275–276, 276f parametric tests Kolmogorov-Smirnov (K-S) test, 209, 209f one-way ANOVA, 237–238, 238f Shapiro-Francia (S-F) test, 210, 210f Shapiro-Wilk (S-W) test, 209–210, 210f Student’s t-test, 221–222, 223f, 227, 227f, 231, 231f two-way ANOVA, 244–245, 246f P chart, 959, 959f Poisson regression model dataset, 645, 645f explanatory variables, 650, 650f
1225
frequency distribution, 645, 645f goodness-of-fit, 649, 649f graph of, 649, 649f, 651–653, 651–652f histogram, 645, 646f incidence rate ratios, 650, 650f maximum logarithmic likelihood function, 647 McFadden pseudo R2, 647 mean and variance, 645, 646f null model, 647, 647f outputs, 646, 646f overdispersion test, 648, 648f principal components factor analysis dataset, 421, 423f eigenvalues and eigenvectors, 424, 424f KMO statistic and Bartlett’s test of sphericity, 423, 424f loading plot, 426, 427f multiple linear regression models, 428, 429–430f outputs, 422, 423f, 424–426, 425–427f Pearson’s correlation coefficient, 427, 428f ranking, 429, 431f rotated factor scores, 427, 428f Z-scores, 427–428 R chart, 947f, 948 regression models estimation augmented component-plus-residuals, 509, 513, 515f Box-Cox transformation, 511, 512f, 513, 515f Breusch-Godfrey test results, 516, 517f Breusch-Pagan/Cook-Weisberg test, 505, 506t, 506f correlation matrix, 499, 500f dataset, 498, 498f distribution adherence, 513 dummy variable, 498, 499f Durbin-Watson test result, 515, 516f frequency distribution, 498, 498f Hapiro-Francia test, 504 heteroskedasticity, graphing method, 504–505, 505f Huber-White robust standard error estimation, 507 Kernel density estimate, 511–513, 514f leverage distance concept, 501–502, 502t, 503f linear adjustment and lowess adjustment, 509, 510f, 511, 512f linktest, 507, 507f logarithmic transformation, 510 maximum likelihood estimation, 500 mfx command, 501, 502f multicollinearity, 499, 504 nonparametric method, 509 null hypothesis, 505 outputs, 500–501, 500–501f, 509, 509f parameter estimation, 501 reg command, 499–500 RESET test, 500, 507–508, 508t, 508f residuals distribution and normal distribution, 503, 503f
1226
Index
Stata Software (Continued) Shapiro-Francia test results, 509, 509f, 511, 512f Shapiro-Wilk test, 503–504, 504f squared normalized residuals, 502 temporal model estimation results, 513, 515f temporal variable, 513, 515f variables—graph matrix, 498–499, 499f, 510, 511f VIF and Tolerance statistics, 504, 504f weighted least squares model, 506–507 White test, 505, 506f Statistical process control (SPC) attributes, 941 control charts, 945–946t, 945–949b, 952–957b, 953–954t C chart, 963–967, 965–966t, 965–967b confidence interval, 943 mean, 952 np chart, 960–963, 963b, 964f parameters, 944 P chart, 957–960, 958t, 958–960b probability, 943, 943f sample size, 944 sigma control limits, 943 SPSS Software, 949, 950–952f, 954–957, 955–956f standard deviations, 944, 950, 952 standard normal distribution, 942 Stata Software, 947–949f, 948 U chart, 967–971, 969–970b, 969–970t line chart, 941 normal distribution, 941–942 process capability Cp index, 972, 974–977b Cpk index, 972–973, 973t, 974–977b Cpm and Cpmk indexes, 973–977, 974–977b quality characteristics, 942 range, 941–942 sample mean, 941–942 sample size, 941 sampling method, 941 standard deviation, 941 variables, 941 Stem-and-leaf plots, 21, 34–37, 35–36t, 35–37b, 36–37f Stevens classification, 9 Stratified sampling, 173, 173b, 182t, 182–183b estimation error, 180 finite population, sample size mean estimation, 181 proportion estimation, 181–183 infinite population, sample size mean estimation, 180–181 proportion estimation, 181 Student’s t distribution, 160–162, 161f, 162b, 194, 194f, 1162–1163t Student’s t-test, 220–221, 220f, 221b independent random samples, 224–225b, 225f, 225t bilateral test, 223, 224f degrees of freedom, 224
SPSS Software, 225–226, 226–227f Stata Software, 227, 227f single sample SPSS Software, 221, 222–223f Stata Software, 221–222, 223f two paired random samples, 228–229t, 228–229b, 229f bilateral test, 228, 228f normal distribution, 227 null hypothesis, 227 SPSS Software, 229–230, 230–231f Stata Software, 231, 231f Sum of squares due to regression (SSR), 451 Systematic sampling, 172, 172b
T Three-level hierarchical linear model, repeated measures, 987, 989, 990f, 990t IBM SPSS Statistics Software linear trend model with random intercepts and slopes, 1042–1045, 1043f, 1045f, 1046t null model, 1040, 1041f, 1042 Stata Software dataset characteristics, 1015, 1015t, 1016f linear trend model, random intercepts, 1020–1023, 1021–1023f linear trend model, random intercepts and slopes, 1023–1027, 1024–1025f, 1027–1028f null model, 1018–1020, 1019f outputs, 1015, 1016f random effects variance-covariance matrix, 1027–1032, 1029–1032f students’ average school performance, 1016–1017, 1017f temporal evolution, 1015–1018, 1016f, 1018f Total sum of squares (TSS), 451 Transhipment problem (TSP) Excel Solver, 866–868, 866–868f, 866–867b intermediate transhipment points, 860–862 mathematical formulation, 862–866, 862f, 864–865f, 864t, 864–866b stages, 860–862 transportation unit cost, 862 Transportation algorithm, 846f balanced transportation model, 846, 846b elementary operations, 847 iteration, 854 minimum cost method, 849, 850–851t, 850–852b northwest corner method, 847, 848–849t, 848–850b, 854–856b optimality test, 853 vogel approximation method, 851, 852–854b, 852–853t Traveling Salesman Problem (TSP), 898–899t, 898–900b, 899f Excel Solver, 901f, 902, 902b, 903–904f formulations, 896–901 Hamiltonian problem, 896, 898f network programming, 896 Tree structure, 837, 837f
t-test, 457–459, 474 coefficients and significance, 459, 461f significance levels, 460, 461f Two-level hierarchical linear model, clustered data, 987–988, 989t, 989f IBM SPSS Statistics Software complete final model, 1038–1040, 1039f null model, 1034–1036, 1034–1035f random intercepts and slopes model, 1037–1038, 1038f random intercepts model, 1036, 1037f Stata Software adaptive quadrature process, 1005 best linear unbiased predictions (BLUPS), 1005 complete random intercepts model, 1011–1014, 1012–1014f dataset characteristics, 998, 999f, 999t generalized linear latent and mixed model (GLLAMM), 1005 maximum likelihood estimation, 1005 null model, 999–1000, 1002–1003, 1002–1003f ordinary Gauss-Hermite quadrature, 1005 random intercepts and slopes model, 1006–1011, 1007–1011f random intercepts model, 1004–1005, 1004f, 1006f students’ average performance per school, 998, 1000–1001f unbalanced clustered data structure, 998, 1000f Two-stage cluster sampling, 174, 175b sample size, 183, 185–186, 186–187t, 186b Two-way ANOVA, 937
U U chart, 970, 971f Uniform distribution, 151–152, 151t, 152f, 152b Uniform stratified sampling, 173 Union, 127, 128f Univariate descriptive statistics, 22f Excel Add-ins dialog box, 68, 70f Data Analysis dialog box, 69, 71f dataset, 68, 68f Data tab, 69, 71f descriptive statistics, 69, 72f Descriptive Statistics dialog box, 69, 71f Excel Options dialog box, 68, 70f File tab, 68, 69f frequency distribution tables, 21 calculations, 22 continuous data, 24–25, 25b, 25t definition, 22 discrete data, 23–24, 23–24b, 23–24t qualitative variables, 22–23, 22–23b, 22–23t IBM SPSS Statistics Software, 69–72 dataset, 72, 72f Descriptives Option, 74–76, 77–78f Explore Option, 77–83, 79–82f Frequencies Option, 72–74, 73–76f qualitative variables
Index
bar charts, 21, 26–27, 26t, 26–27b, 27f Pareto chart, 21, 28–30, 29t, 29–30b, 30f pie charts, 21, 27–28, 28b, 28t, 28f quantitative variables boxplots/box-and-whisker diagram, 21, 37–38, 37f histograms, 21, 32–34, 32–33b, 32–33t, 33–34f line graphs, 21, 30–31, 30t, 30–31b, 31f scatter plot, 21, 31–32, 31–32b, 31t, 32f stem-and-leaf plots, 21, 34–37, 35–36t, 35–37b, 36–37f Stata boxplot, 86–87, 87f frequency distribution table, 83–84, 83f histograms, 85–86, 86f percentiles calculation, 85, 85f stem-and-leaf plot, 86, 86f summary, 84, 84f summary measures dispersion/variability, 21 (see also Dispersion/variability measures) position/location, 21 (see also Position/ location measures) shape, 21 (see also Shape measures)
1227
V
W
Variables definition, 7 descriptive statistics, 17 Likert scale, 17 types, 7, 8f metric/quantitative, 8, 9t, 9f (see also Quantitative variables) nonmetric/qualitative, 7–8, 8t (see also Nonmetric/qualitative variables) scales of measurement, 9–15, 10f Stevens classification, 9 Variance continuous data, 58–59, 58–59b, 59t continuous random variable, 140, 140b definition, 57 discrete random variable, 138, 138t, 138b grouped discrete data, 57–58, 58b, 58t ungrouped discrete and continuous data, 57, 57b Varimax orthogonal rotation method, 397, 416, 416f Venix Toys, 779–783, 780b, 780–784f, 784b Vertical bar charts, 26, 27f Vogel approximation method, 851, 852–854b, 852–853t Vuong test correction, 696, 696f
Wald z test, 550–551 Weighted arithmetic mean, 39–40, 39–40t, 39–40b Weighted least squares model, 490 Stata, regression models estimation, 506–507 Weighted rank-sum criterion, 408, 409t White test, 505, 506f Wilcoxon test, 1180–1184t two paired samples, 270–276, 272–273t, 272–274b, 273–276f
Y Yule similarity coefficient, 323
Z Zero-inflated regression models, 692b Bernoulli distribution, 691 logarithmic likelihood function, 691 quantitative variable, 690 sampling zeros, 691 Stata negative binomial regression model, 697–703, 698–703f Poisson regression model, 693–697, 693–697f structural zeros, 691 Zero-order correlation coefficients, 387–388 Z-scores, 427–428 Z test, 219–220, 219–220b
The use of the images from the IBM SPSS Statistics Software® has been authorized by the International Business Machines Corporation© (Armonk, New York). SPSS® Inc. was purchased by IBM® in October of 2009. IBM, the IBM logo, ibm.com and SPSS are commercial brands or trademarks that belong to the International Business Machines Corporation, registered in several jurisdictions around the world. The use of the images from the Stata Statistical Software® has been authorized by StataCorp LP© (College Station, Texas).