Data Science for Business and Decision Making [1 ed.] 0128112166, 9780128112168

Data Science for Business and Decision Making covers both statistics and operations research while most competing textbo

2,666 384 62MB

English Pages 1000 [1209] Year 2019

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Data Science for Business and Decision Making [1 ed.]
 0128112166, 9780128112168

Table of contents :
Cover
Data Science for Business
and Decision Making
Copyright
Dedication
Epigraph
1
Introduction to Data Analysis and Decision Making
Introduction: Hierarchy Between Data, Information, and Knowledge
Overview of the Book
Final Remarks
2
Types of Variables and Measurement and Accuracy Scales
Introduction
Types of Variables
Nonmetric or Qualitative Variables
Metric or Quantitative Variables
Types of Variables x Scales of Measurement
Nonmetric Variables-Nominal Scale
Nonmetric Variables-Ordinal Scale
Quantitative Variable-Interval Scale
Quantitative Variable-Ratio Scale
Types of Variables x Number of Categories and Scales of Accuracy
Dichotomous or Binary Variable (Dummy)
Polychotomous Variable
Discrete Quantitative Variable
Continuous Quantitative Variable
Final Remarks
Exercises
Part II:
Descriptive Statistics
3
Univariate Descriptive Statistics
Introduction
Frequency Distribution Table
Frequency Distribution Table for Qualitative Variables
Frequency Distribution Table for Discrete Data
Frequency Distribution Table for Continuous Data Grouped into Classes
Graphical Representation of the Results
Graphical Representation for Qualitative Variables
Bar Chart
Pie Chart
Pareto Chart
Graphical Representation for Quantitative Variables
Line Graph
Scatter Plot
Histogram
Stem-and-Leaf Plot
Boxplot or Box-and-Whisker Diagram
The Most Common Summary-Measures in Univariate Descriptive Statistics
Measures of Position or Location
Measures of Central Tendency
Arithmetic Mean
Case 1: Simple Arithmetic Mean of Ungrouped Discrete and Continuous Data
Case 2: Weighted Arithmetic Mean of Ungrouped Discrete and Continuous Data
Case 3: Arithmetic Mean of Grouped Discrete Data
Case 4: Arithmetic Mean of Continuous Data Grouped into Classes
Median
Case 1: Median of Ungrouped Discrete and Continuous Data
Case 2: Median of Grouped Discrete Data
Case 3: Median of Continuous Data Grouped into Classes
Mode
Case 1: Mode of Ungrouped Data
Case 2: Mode of Grouped Qualitative or Discrete Data
Case 3: Mode of Continuous Data Grouped into Classes
Quantiles
Quartiles
Deciles
Percentiles
Case 1: Quartiles, Deciles, and Percentiles of Ungrouped Discrete and Continuous Data
Case 2: Quartiles, Deciles, and Percentiles of Grouped Discrete Data
Case 3: Quartiles, Deciles, and Percentiles of Continuous Data Grouped into Classes
Identifying the Existence of Univariate Outliers
Measures of Dispersion or Variability
Range
Average Deviation
Case 1: Average Deviation of Ungrouped Discrete and Continuous Data
Case 2: Average Deviation of Grouped Discrete Data
Case 3: Average Deviation of Continuous Data Grouped into Classes
Variance
Case 1: Variance of Ungrouped Discrete and Continuous Data
Case 2: Variance of Grouped Discrete Data
Case 3: Variance of Continuous Data Grouped into Classes
Standard Deviation
Standard Error
Coefficient of Variation
Measures of Shape
Measures of Skewness
Pearsons First Coefficient of Skewness
Pearsons Second Coefficient of Skewness
Bowleys Coefficient of Skewness
Fishers Coefficient of Skewness
Coefficient of Skewness on Stata
Measures of Kurtosis
Coefficient of Kurtosis
Fishers Coefficient of Kurtosis
Coefficient of Kurtosis on Stata
A Practical Example in Excel
A Practical Example on SPSS
Frequencies Option
Descriptives Option
Explore Option
A Practical Example on Stata
Univariate Frequency Distribution Tables on Stata
Summary of Univariate Descriptive Statistics on Stata
Calculating Percentiles on Stata
Charts on Stata: Histograms, Stem-and-Leaf, and Boxplots
Histogram
Stem-and-Leaf
Boxplot
Final Remarks
Exercises
4
Bivariate Descriptive Statistics
Introduction
Association Between Two Qualitative Variables
Joint Frequency Distribution Tables
Measures of Association
Chi-Square Statistic
Other Measures of Association Based on Chi-Square
Spearmans Coefficient
Correlation Between Two Quantitative Variables
Joint Frequency Distribution Tables
Graphical Representation Through a Scatter Plot
Measures of Correlation
Covariance
Pearsons Correlation Coefficient
Final Remarks
Exercises
Part III: Probabilistic Statistics
5
Introduction to Probability
Introduction
Terminology and Concepts
Random Experiment
Sample Space
Events
Unions, Intersections, and Complements
Independent Events
Mutually Exclusive Events
Definition of Probability
Basic Probability Rules
Probability Variation Field
Probability of the Sample Space
Probability of an Empty Set
Probability Addition Rule
Probability of a Complementary Event
Probability Multiplication Rule for Independent Events
Conditional Probability
Probability Multiplication Rule
Bayes´ Theorem
Combinatorial Analysis
Arrangements
Combinations
Permutations
Final Remarks
Exercises
6
Random Variables and Probability Distributions
Introduction
Random Variables
Discrete Random Variable
Expected Value of a Discrete Random Variable
Variance of a Discrete Random Variable
Cumulative Distribution Function of a Discrete Random Variable
Continuous Random Variable
Expected Value of a Continuous Random Variable
Variance of a Continuous Random Variable
Cumulative Distribution Function of a Continuous Random Variable
Probability Distributions for Discrete Random Variables
Discrete Uniform Distribution
Bernoulli Distribution
Binomial Distribution
Relationship Between the Binomial and the Bernoulli Distributions
Geometric Distribution
Negative Binomial Distribution
Relationship Between the Negative Binomial and the Binomial Distributions
Relationship Between the Negative Binomial and the Geometric Distributions
Hypergeometric Distribution
Approximation of the Hypergeometric Distribution by the Binomial
Poisson Distribution
Approximation of the Binomial by the Poisson Distribution
Probability Distributions for Continuous Random Variables
Uniform Distribution
Normal Distribution
Approximation of the Binomial by the Normal Distribution
Approximation of the Poisson by the Normal Distribution
Exponential Distribution
Relationship Between the Poisson and the Exponential Distribution
Gamma Distribution
Special Cases of the Gamma Distribution
Relationship Between the Poisson and the Gamma Distribution
Chi-Square Distribution
Students t Distribution
Snedecors F Distribution
Relationship Between Students t and Snedecors F Distribution
Final Remarks
Exercises
Part IV: Statistical Inference
7
Sampling
Introduction
Probability or Random Sampling
Simple Random Sampling
Simple Random Sampling Without Replacement
Simple Random Sampling With Replacement
Systematic Sampling
Stratified Sampling
Cluster Sampling
Nonprobability or Nonrandom Sampling
Convenience Sampling
Judgmental or Purposive Sampling
Quota Sampling
Geometric Propagation or Snowball Sampling
Sample Size
Size of a Simple Random Sample
Sample Size to Estimate the Mean of an Infinite Population
Sample Size to Estimate the Mean of a Finite Population
Sample Size to Estimate the Proportion of an Infinite Population
Sample Size to Estimate the Proportion of a Finite Population
Size of the Systematic Sample
Size of the Stratified Sample
Sample Size to Estimate the Mean of an Infinite Population
Sample Size to Estimate the Mean of a Finite Population
Sample Size to Estimate the Proportion of an Infinite Population
Sample Size to Estimate the Proportion of a Finite Population
Size of a Cluster Sample
Size of a One-Stage Cluster Sample
Sample Size to Estimate the Mean of an Infinite Population
Sample Size to Estimate the Mean of a Finite Population
Sample Size to Estimate the Proportion of an Infinite Population
Sample Size to Estimate the Proportion of a Finite Population
Size of a Two-Stage Cluster Sample
Final Remarks
Exercises
8
Estimation
Introduction
Point and Interval Estimation
Point Estimation
Interval Estimation
Point Estimation Methods
Method of Moments
Ordinary Least Squares
Maximum Likelihood Estimation
Interval Estimation or Confidence Intervals
Confidence Interval for the Population Mean (μ)
Known Population Variance (σ2)
Unknown Population Variance (σ2)
Confidence Interval for Proportions
Confidence Interval for the Population Variance
Final Remarks
Exercises
9
Hypotheses Tests
Introduction
Parametric Tests
Univariate Tests for Normality
Kolmogorov-Smirnov Test
Shapiro-Wilk Test
Shapiro-Francia Test
Solving Tests for Normality by Using SPSS Software
Solving Tests for Normality by Using Stata
Kolmogorov-Smirnov Test on the Stata Software
Shapiro-Wilk Test on the Stata Software
Shapiro-Francia Test on the Stata Software
Tests for the Homogeneity of Variances
Bartletts χ2 Test
Cochrans C Test
Hartleys Fmax Test
Levenes F-Test
Solving Levenes Test by Using SPSS Software
Solving Levenes Test by Using the Stata Software
Hypotheses Tests Regarding a Population Mean (μ) From One Random Sample
Z Test When the Population Standard Deviation (σ) Is Known and the Distribution Is Normal
Students t-Test When the Population Standard Deviation (σ) Is Not Known
Solving Students t-Test for a Single Sample by Using SPSS Software
Solving Students t-Test for a Single Sample by Using Stata Software
Students t-Test to Compare Two Population Means From Two Independent Random Samples
Case 1: σ12σ22
Case 2: σ12=σ22
Solving Students t-Test From Two Independent Samples by Using SPSS Software
Solving Students t-Test From Two Independent Samples by Using Stata Software
Students t-Test to Compare Two Population Means From Two Paired Random Samples
Solving Students t-Test From Two Paired Samples by Using SPSS Software
Solving Students t-Test From Two Paired Samples by Using Stata Software
ANOVA to Compare the Means of More Than Two Populations
One-Way ANOVA
Solving the One-Way ANOVA Test by Using SPSS Software
Solving the One-Way ANOVA Test by Using Stata Software
Factorial ANOVA
Two-Way ANOVA
Solving the Two-Way ANOVA Test by Using SPSS Software
Solving the Two-Way ANOVA Test by Using Stata Software
ANOVA With More Than Two Factors
Final Remarks
Exercises
10
Nonparametric Tests
Introduction
Tests for One Sample
Binomial Test
Solving the Binomial Test Using SPSS Software
Solving the Binomial Test Using Stata Software
Chi-Square Test (χ2) for One Sample
Solving the χ2 Test for One Sample Using SPSS Software
Solving the χ2 Test for One Sample Using Stata Software
Sign Test for One Sample
Solving the Sign Test for One Sample Using SPSS Software
Solving the Sign Test for One Sample Using Stata Software
Tests for Two Paired Samples
McNemar Test
Solving the McNemar Test Using SPSS Software
Solving the McNemar Test Using Stata Software
Sign Test for Two Paired Samples
Solving the Sign Test for Two Paired Samples Using SPSS Software
Solving the Sign Test for Two Paired Samples Using Stata Software
Wilcoxon Test
Solving the Wilcoxon Test Using SPSS Software
Solving the Wilcoxon Test Using Stata Software
Tests for Two Independent Samples
Chi-Square Test (χ2) for Two Independent Samples
Solving the χ2 Statistic Using SPSS Software
Solving the χ2 Statistic by Using Stata Software
Mann-Whitney U Test
Solving the Mann-Whitney Test Using SPSS Software
Solving the Mann-Whitney Test Using Stata Software
Tests for k Paired Samples
Cochrans Q Test
Solving Cochrans Q Test by Using SPSS Software
Solution of Cochrans Q Test on Stata Software
Friedmans Test
Solving Friedmans Test by Using SPSS Software
Solving Friedmans Test by Using Stata Software
Tests for k Independent Samples
The χ2 Test for k Independent Samples
Solving the χ2 Test for k Independent Samples on SPSS
Solving the χ2 Test for k Independent Samples on Stata
Kruskal-Wallis Test
Solving the Kruskal-Wallis Test by Using SPSS Software
Solving the Kruskal-Wallis Test by Using Stata
Final Remarks
Exercises
Part V: Multivariate Exploratory Data Analysis
11
Cluster Analysis
Introduction
Cluster Analysis
Defining Distance or Similarity Measures in Cluster Analysis
Distance (Dissimilarity) Measures Between Observations for Metric Variables
Similarity Measures Between Observations for Binary Variables
Agglomeration Schedules in Cluster Analysis
Hierarchical Agglomeration Schedules
Notation
A Practical Example of Cluster Analysis With Hierarchical Agglomeration Schedules
Nearest-Neighbor or Single-Linkage Method
Furthest-Neighbor or Complete-Linkage Method
Between-Groups or Average-Linkage Method
Nonhierarchical K-Means Agglomeration Schedule
Notation
A Practical Example of a Cluster Analysis With the Nonhierarchical K-Means Agglomeration Schedule
Cluster Analysis with Hierarchical and Nonhierarchical Agglomeration Schedules in SPSS
Elaborating Hierarchical Agglomeration Schedules in SPSS
Elaborating Nonhierarchical K-Means Agglomeration Schedules in SPSS
Cluster Analysis With Hierarchical and Nonhierarchical Agglomeration Schedules in Stata
Elaborating Hierarchical Agglomeration Schedules in Stata
Elaborating Nonhierarchical K-Means Agglomeration Schedules in Stata
Final Remarks
Exercises
Appendix
Detecting Multivariate Outliers
12
Principal Component Factor Analysis
Introduction
Principal Component Factor Analysis
Pearsons Linear Correlation and the Concept of Factor
Overall Adequacy of the Factor Analysis: Kaiser-Meyer-Olkin Statistic and Bartletts Test of Sphericity
Defining the Principal Component Factors: Determining the Eigenvalues and Eigenvectors of Correlation Matrix ρ and Calcula ...
Factor Loadings and Communalities
Factor Rotation
A Practical Example of the Principal Component Factor Analysis
Principal Component Factor Analysis in SPSS
Principal Component Factor Analysis in Stata
Final Remarks
Exercises
Appendix: Cronbachs Alpha
Brief Presentation
Determining Cronbachs Alpha Algebraically
Determining Cronbachs Alpha in SPSS
Determining Cronbachs Alpha in Stata
Part VI: Generalized Linear Models
13
Simple and Multiple Regression Models
Introduction
Linear Regression Models
Estimation of the Linear Regression Model by Ordinary Least Squares
Explanatory Power of the Regression Model: Coefficient of Determination R2
General Statistical Significance of the Regression Model and Each of Its Parameters
Construction of the Confidence Intervals of the Model Parameters and Elaboration of Predictions
Estimation of Multiple Linear Regression Models
Dummy Variables in Regression Models
Presuppositions of Regression Models Estimated by OLS
Normality of Residuals
The Multicollinearity Problem
Causes of Multicollinearity
Consequences of Multicollinearity
Application of Multicollinearity Examples in Excel
Multicollinearity Diagnostics
Possible Solutions for the Multicollinearity Problem
The Problem of Heteroskedasticity
Causes of Heteroskedasticity
Consequences of Heteroskedasticity
Heteroskedasticity Diagnostics: Breusch-Pagan/Cook-Weisberg Test
Weighted Least Squares Method: A Possible Solution
Huber-White Method for Robust Standard Errors
The Autocorrelation of Residuals Problem
Causes of the Autocorrelation of Residuals
Consequences of the Autocorrelation of Residuals
Autocorrelation of Residuals Diagnostic: The Durbin-Watson Test
Autocorrelation of Residuals Diagnostic: The Breusch-Godfrey Test
Possible Solutions for the Autocorrelation of Residuals Problem
Detection of Specification Problems: Linktest and RESET Test
Nonlinear Regression Models
The Box-Cox Transformation: The General Regression Model
Estimation of Regression Models in Stata
Estimation of Regression Models in SPSS
Final Remarks
Exercises
Appendix: Quantile Regression Models
A Brief Introduction
Example: Quantile Regression Model in Stata
14
Binary and Multinomial Logistic Regression Models
Introduction
The Binary Logistic Regression Model
Estimation of the Binary Logistic Regression Model by Maximum Likelihood
General Statistical Significance of the Binary Logistic Regression Model and Each of Its Parameters
Construction of the Confidence Intervals of the Parameters for the Binary Logistic Regression Model
Cutoff, Sensitivity Analysis, Overall Model Efficiency, Sensitivity, and Specificity
The Multinomial Logistic Regression Model
Estimation of the Multinomial Logistic Regression Model by Maximum Likelihood
General Statistical Significance of the Multinomial Logistic Regression Model and Each of Its Parameters
Construction of the Confidence Intervals of the Parameters for the Multinomial Logistic Regression Model
Estimation of Binary and Multinomial Logistic Regression Models in Stata
Binary Logistic Regression in Stata
Multinomial Logistic Regression in Stata
Estimation of Binary and Multinomial Logistic Regression Models in SPSS
Binary Logistic Regression in SPSS
Multinomial Logistic Regression in SPSS
Final Remarks
Exercises
Appendix: Probit Regression Models
A Brief Introduction
Example: Probit Regression Model in Stata
15
Regression Models for Count Data: Poisson and Negative Binomial
Introduction
The Poisson Regression Model
Estimation of the Poisson Regression Model by Maximum Likelihood
General Statistical Significance of the Poisson Regression Model and Each of Its Parameters
Construction of the Confidence Intervals of the Parameters for the Poisson Regression Model
Test to Verify Overdispersion in Poisson Regression Models
The Negative Binomial Regression Model
Estimation of the Negative Binomial Regression Model by Maximum Likelihood
General Statistical Significance of the Negative Binomial Regression Model and Each of Its Parameters
Construction of the Confidence Intervals of the Parameters for the Negative Binomial Regression Model
Estimating Regression Models for Count Data in Stata
Poisson Regression Model in Stata
Negative Binomial Regression Model in Stata
Regression Model Estimation for Count Data in SPSS
Poisson Regression Model in SPSS
Negative Binomial Regression Model in SPSS
Final Remarks
Exercises
Appendix: Zero-Inflated Regression Models
Brief Introduction
Example: Zero-Inflated Poisson Regression Model in Stata
Example: Zero-Inflated Negative Binomial Regression Model in Stata
Part VII: Optimization Models and Simulation
16
Introduction to Optimization Models: General Formulations and Business Modeling
Introduction to Optimization Models
Introduction to Linear Programming Models
Mathematical Formulation of a General Linear Programming Model
Linear Programming Model in the Standard and Canonical Forms
Linear Programming Model in the Standard Form
Linear Programming Model in the Canonical Form
Transformations Into the Standard or Canonical Form
Assumptions of the Linear Programming Model
Proportionality
Additivity
Divisibility and Non-negativity
Certainty
Modeling Business Problems Using Linear Programming
Production Mix Problem
Blending or Mixing Problem
Diet Problem
Capital Budget Problems
Portfolio Selection Problem
Model 1: Maximization of an Investment Portfolios Expected Return
Model 2: Investment Portfolio Risk Minimization
Production and Inventory Problem
Aggregated Planning Problem
Final Remarks
Exercises
17
Solution of Linear Programming Problems
Introduction
Graphical Solution of a Linear Programming Problem
Linear Programming Maximization Problem with a Single Optimal Solution
Linear Programming Minimization Problem With a Single Optimal Solution
Special Cases
Multiple Optimal Solutions
Unlimited Objective Function z
There Is No Optimal Solution
Degenerate Optimal Solution
Analytical Solution of a Linear Programming Problem in Which m n
The Simplex Method
Logic of the Simplex Method
Analytical Solution of the Simplex method for Maximization Problems
Tabular Form of the Simplex Method for Maximization Problems
The Simplex Method for Minimization Problems
Special Cases of the Simplex Method
Multiple Optimal Solutions
Unlimited Objective Function z
There Is No Optimal Solution
Degenerate Optimal Solution
Solution by Using a Computer
Solver in Excel
Solution of the Examples found in Section 16.6 of Chapter 16 using Solver in Excel
Solution of Example 16.3 of Chapter 16 (Production Mix Problem at the Venix Toys)
Solution of Example 16.4 of Chapter 16 (Production Mix Problem at Naturelat Dairy)
Solution of Example 16.5 of Chapter 16 (Mix Problem of Oil-South Refinery)
Solution of Example 16.6 of Chapter 16 (Diet Problem)
Solution of Example 16.7 of Chapter 16 (Farmers Problem)
Solution of Example 16.8 of Chapter 16 (Portfolio Selection-Maximization of the Expected Return)
Solution of Example 16.9 of Chapter 16 (Portfolio Selection-Minimization of the Portfolios Mean Absolute Deviation)
Solution of Example 16.10 of Chapter 16 (Production and Inventory Problem of FenixandFurniture)
Solution of Example 16.11 of Chapter 16 (Problem of Lifestyle Natural Juices Manufacturer)
Solver Error Messages for Unlimited and Infeasible Solutions
Unlimited Objective Function z
There Is No Optimal Solution
Result Analysis by Using the Solver Answer and Limits Reports
Answer Report
Limits Report
Sensitivity Analysis
Alteration in one of the Objective Function Coefficients (Graphical Solution)
Alteration in One of the Constants on the Right-Hand Side of the Constraint and Concept of Shadow Price (Graphica ...
Reduced Cost
Sensitivity Analysis With Solver in Excel
Special Case: Multiple Optimal Solutions
Special Case: Degenerate Optimal Solution
Exercises
18
Network Programming
Introduction
Terminology of Graphs and Networks
Classic Transportation Problem
Mathematical Formulation of the Classic Transportation Problem
Balancing the Transportation Problem When the Total Supply Capacity Is Not Equal to the Total Demand Consumed
Case 1: Total Supply Is Greater than Total Demand
Case 2: Total Supply Capacity Is Lower than Total Demand Consumed
Solution of the Classic Transportation Problem
The Transportation Algorithm
Solution of the Transportation Problem Using Excel Solver
Transhipment Problem
Mathematical Formulation of the Transhipment Problem
Solution of the Transhipment Problem Using Excel Solver
Job Assignment Problem
Mathematical Formulation of the Job Assignment Problem
Solution of the Job Assignment Problem Using Excel Solver
Shortest Path Problem
Mathematical Formulation of the Shortest Path Problem
Solution of the Shortest Path Problem Using Excel Solver
Maximum Flow Problem
Mathematical Formulation of the Maximum Flow Problem
Solution of the Maximum Flow Problem Using Excel Solver
Exercises
19
Integer Programming
Introduction
Mathematical Formulation of a General Model for Integer Programming and/or Binary and Linear Relaxation
The Knapsack Problem
Modeling of the Knapsack Problem
Solution of the Knapsack Problem Using Excel Solver
The Capital Budgeting Problem as a Model of Binary Programming
Solution of the Capital Budgeting Problem as a Model of Binary Programming Using Excel Solver
The Traveling Salesman Problem
Modeling of the Traveling Salesman Problem
Solution of the Traveling Salesman Problem Using Excel Solver
The Facility Location Problem
Modeling of the Facility Location Problem
Solution of the Facility Location Problem Using Excel Solver
The Staff Scheduling Problem
Solution of the Staff Scheduling Problem Using Excel Solver
Exercises
20
Simulation and Risk Analysis
Introduction to Simulation
The Monte Carlo Method
Monte Carlo Simulation in Excel
Generation of Random Numbers and Probability Distributions in Excel
Practical Examples
Case 1: Consumption of Red Wine
Case 2: Profit x Loss Forecast
Final Remarks
Exercises
Part VIII: Other Topics
21
Design and Analysis of Experiments
Introduction
Steps in the Design of Experiments
The Four Principles of Experimental Design
Types of Experimental Design
Completely Randomized Design (CRD)
Randomized Block Design (RBD)
Factorial Design (FD)
One-Way Analysis of Variance
Factorial ANOVA
Final Remarks
Exercises
22
Statistical Process Control
Introduction
Estimating the Process Mean and Variability
Control Charts for Variables
Control Charts for X and R
Control Charts for X
Control Charts for R
Control Charts for X and S
Control Charts for Attributes
P Chart (Defective Fraction)
np Chart (Number of Defective Products)
C Chart (Total Number of Defects per Unit)
U Chart (Average Number of Defects per Unit)
Process Capability
Cp Index
Cpk Index
Cpm and Cpmk Indexes
Final Remarks
Exercises
23
Data Mining and Multilevel Modeling
Introduction to Data Mining
Multilevel Modeling
Nested Data Structures
Hierarchical Linear Models
Two-Level Hierarchical Linear Models With Clustered Data (HLM2)
Three-Level Hierarchical Linear Models With Repeated Measures (HLM3)
Estimation of Hierarchical Linear Models in Stata
Estimation of a Two-Level Hierarchical Linear Model With Clustered Data in Stata
Estimation of a Three-Level Hierarchical Linear Model With Repeated Measures in Stata
Estimation of Hierarchical Linear Models in SPSS
Estimation of a Two-Level Hierarchical Linear Model With Clustered Data in SPSS
Estimation of a Three-Level Hierarchical Linear Model With Repeated Measures in SPSS
Final Remarks
Exercises
Appendix
Hierarchical Nonlinear Models
Answers
Answer Keys: Exercises: Chapter 2
Answer Keys: Exercises: Chapter 3
Answer Keys: Exercises: Chapter 4
Answer Keys: Exercises: Chapter 5
Answer Keys: Exercises: Chapter 6
Answer Keys: Exercises: Chapter 7
Answer Keys: Exercises: Chapter 8
Answer Keys: Exercises: Chapter 9
Answer Keys: Exercises: Chapter 10
Answer Keys: Exercises: Chapter 11
Answer Keys: Exercises: Chapter 12
Answer Keys: Exercises: Chapter 13
Answer Keys: Exercises: Chapter 14
Answer Keys: Exercises: Chapter 15
Answer Keys: Exercises: Chapter 16
Answer Keys: Exercises: Chapter 17
Answer Keys: Exercises: Chapter 18
Answer Keys: Exercises: Chapter 19
Answer Keys: Exercises: Chapter 20
Answer Keys: Exercises: Chapter 21
Answer Keys: Exercises: Chapter 22
Answer Keys: Exercises: Chapter 23
Appendices
References
Index
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
Y
Z

Citation preview

Data Science for Business and Decision Making

Data Science for Business and Decision Making

Luiz Paulo Fa´vero School of Economics, Business and Accounting University of Sa˜o Paulo Sa˜o Paulo SP Brazil

Patrı´cia Belfiore Center of Engineering, Modeling and Applied Social Sciences Management Engineering Federal University of ABC Sa˜o Bernardo do Campo SP Brazil

Academic Press is an imprint of Elsevier 125 London Wall, London EC2Y 5AS, United Kingdom 525 B Street, Suite 1650, San Diego, CA 92101, United States 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom © 2019 Elsevier Inc. All rights reserved No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/ permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN 978-0-12-811216-8 For information on all Academic Press publications visit our website at https://www.elsevier.com/books-and-journals

Publisher: Candice Janco Acquisition Editor: Scott Bentley Editorial Project Manager: Susan Ikeda Production Project Manager: Purushothaman Vijayaraj Cover Designer: Miles Hitchen Typeset by SPi Global, India

Dedication We dedicate this book to Ovı´dio and Leonor Antonio and Ana Vera for the unconditional effort dedicated to our education and development. We dedicate this book to Gabriela and Luiz Felipe who are the reason for our existence.

Epigraph When a human awakens to a great dream and throws the full force of his soul over it, all the universe conspires in your favor. Johann Wolfgang von Goethe

Chapter 1

Introduction to Data Analysis and Decision Making Everything in us is mortal, except the gifts of the spirit and of intelligence. Ovid

1.1

INTRODUCTION: HIERARCHY BETWEEN DATA, INFORMATION, AND KNOWLEDGE

In academic and business environments, improving the use of research techniques and modern software packages, together with the understanding, by researchers and managers in the most varied fields of knowledge, of the importance of statistics and data modeling in defining objectives and substantiating research hypotheses based on underlying theories, has been producing more consistent and rigorous papers from a methodological and scientific standpoint. Nevertheless, as the well-known Austrian philosopher, later on naturalized as a British citizen, Ludwig Joseph Johann Wittgenstein used to say, only methodological rigor and the existence of authors who research more of the same topic can generate a deep lack of oxygen in the academic world. Besides availability of data, adequate software packages, and an adequate underlying theory, it is essential for researchers to also use their intuition and experience when defining their objectives and constructing their hypotheses, even in terms of deciding to study the behavior of new and, sometimes, unimaginable variables in their models. This, believe it or not, may also generate interesting and innovative information for the decision-making process! The basic principle of this book is to explain the hierarchy between data, information, and knowledge, at every turn, in this new scenario we live in. Whenever treated and analyzed, data are transformed into information. On the other hand, knowledge is generated at the moment in which such information is recognized and applied to the decision-making process. Analogously, reverse hierarchy can also be applied, since knowledge, whenever disseminated or explained, becomes information that, when broken up, has the capacity to generate a dataset. Fig. 1.1 shows this logic.

1.2

OVERVIEW OF THE BOOK

The book is divided into 23 chapters, which are structured into eight major parts, as follows: Part I: Foundations of Business Data Analysis l Chapter 1: Introduction to Data Analysis and Decision Making. l Chapter 2: Types of Variables and Mensuration and Accuracy Scales. Part II: Descriptive Statistics l Chapter 3: Univariate Descriptive Statistics. l Chapter 4: Bivariate Descriptive Statistics. Part III: Probabilistic Statistics l Chapter 5: Introduction to Probability. l Chapter 6: Random Variables and Probability Distributions. Part IV: Statistical Inference l Chapter 7: Sampling. l Chapter 8: Estimation. Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00001-X © 2019 Elsevier Inc. All rights reserved.

3

4 PART

l l

I Foundations of Business Data Analysis

Chapter 9: Hypotheses Tests. Chapter 10: Nonparametric Tests.

Part V: Multivariate Exploratory Data Analysis Chapter 11: Cluster Analysis. l Chapter 12: Principal Component Factor Analysis. l

Part VI: Generalized Linear Models Chapter 13: Simple and Multiple Regression Models. l Chapter 14: Binary and Multinomial Logistic Regression Models. l Chapter 15: Regression Models for Count Data: Poisson and Negative Binomial. l

Part VII: Optimization Models and Simulation Chapter 16: Introduction to Optimization Models: General Formulations and Business Modeling. l Chapter 17: Solution of Linear Programming Problems. l Chapter 18: Network Programming. l Chapter 19: Integer Programming. l Chapter 20: Simulation and Risk Analysis. l

Part VIII: Other Topics Chapter 21: Design and Analysis of Experiments. l Chapter 22: Statistical Process Control. l Chapter 23: Data Mining and Multilevel Modeling. l

Data

Data

Treatment and analysis

Dismerberment

Information

Information Diffusion

Decision making

Knowledge

Knowledge

FIG. 1.1 Hierarchy between data, information, and knowledge.

Each chapter is structured in the same presentation didactical logic, which we believe favors learning. First, the concepts regarding each topic are introduced and always followed by the algebraic solution, many times in Excel, of practical exercises from datasets primarily developed with a more educational focus. Next, sometimes, the same exercises are solved in Stata Statistical Software® and IBM SPSS Statistics Software®. We believe that this logic facilitates the study and understanding of the correct use of each technique and of the analysis of the results. Moreover, the practical application of the models in Excel, Stata, and SPSS also brings benefits to researchers, as the results can be compared, at every turn, to the ones already estimated or calculated algebraically in the previous sections of each chapter. In addition to providing an opportunity to use these important software packages. At the end of each chapter, additional exercises are proposed, whose answers, presented through the outputs generated, are available at the end of the book. The datasets used are available at www.elsevier.com.

1.3

FINAL REMARKS

All the benefits and potential of the techniques discussed here will be felt by researchers and managers as the procedures are practiced repeatedly. As there are several methods, we must be very careful when defining the technique, since choosing the best alternatives for treating the data fundamentally depends on this moment of practice and exercises. The adequate use of the techniques presented in this book by professors, students, and business managers may more powerfully underpin the research’s initial perception, which can support the decision-making process. Generating knowledge from a phenomenon depends on a well-structured research plan, with the definition of the variables to be collected, the dimensions of the sample, the development of the dataset, and choosing the technique that will be used, which is extremely important.

Introduction to Data Analysis and Decision Making Chapter

1

5

Thus, we believe that this book is meant for researchers who, for different reasons, are specifically interested in data science and decision making, as well as for those who want to deepen their knowledge by using Excel, SPSS, and Stata software packages. This book is recommended to undergraduate and graduate students in the fields of Business Administration, Engineering, Economics, Accounting, Actuarial Science, Statistics, Psychology, Medicine and Health, and to students in other fields related to Human, Exact and Biomedical Sciences. It is also meant for students taking extension, lato sensu postgraduation and MBA courses, as well as for company employees, consultants, and other researchers that have as their main objectives to treat and analyze data, aiming at preparing data models, generating information, and improving knowledge through decision-making processes. To all the researchers and managers that use this book, we hope that adequate and ever more interesting research questions may arise, that analyses may be developed, and that reliable, robust, and useful models for decision-making processes may be constructed. We also hope that the interpretation of outputs may become friendlier and that the use of Excel, SPSS, and Stata may result in important and valuable fruits for new researches and projects. We would like to thank everyone who contributed and made this book become a reality. We would also like to sincerely thank the professionals at Montvero Consulting and Training Ltd., at the International Business Machines Corporation (Armonk, New York), at StataCorp LP (College Station, Texas), at Elsevier Publishing House, especially Andre Gerhard Wolff, J. Scott Bentley, and Susan E. Ikeda. Lastly, but not less important, we would like to thank the professors, students, and employees of the Economics, Business Administration and Accounting College of the University of Sao Paulo (FEA/ USP) and of the Federal University of the ABC (UFABC). Now it is time for you to get started! We would like to emphasize that any contributions, criticisms, and suggestions will always be welcome. So that, later on, they may be incorporated into this book and make it better. Luiz Paulo Fa´vero Patrı´cia Belfiore

Chapter 2

Types of Variables and Measurement and Accuracy Scales And God said: p, i, 0, and 1, and the Universe was created. Leonhard Euler

2.1

INTRODUCTION

A variable is a characteristic of the population (or sample) being studied, and it is possible to measure, count, or categorize it. The type of variable collected is crucial in the calculation of descriptive statistics and in the graphical representation of results, as well as in the selection of the statistical methods that will be used to analyze the data. According to Freund (2006), statistical data are the raw materials of statistical research, always appearing in cases of measurement or record of observations. This chapter discusses the existing types of variables (metric or quantitative and nonmetric or qualitative), as well as their respective scales of measurement (nominal and ordinal for qualitative variables, and interval and ratio for quantitative variables). Classifying the types of variables based on the number of categories and scales of accuracy is also discussed (binary and polychotomous for qualitative variables and discrete and continuous for quantitative variables).

2.2

TYPES OF VARIABLES

Variables can be classified as nonmetric, also known as qualitative or categorical, or metric, also known as quantitative (Fig. 2.1). Nonmetric or qualitative variables represent the characteristics of an individual, object, or element that cannot be measured or quantified. The answers are given in categories. In contrast, metric or quantitative variables represent the characteristics of an individual, object, or element that result from a count (a finite set of values) or from a measurement (an infinite set of values).

2.2.1

Nonmetric or Qualitative Variables

As we will study in Chapter 3, the representation of the characteristics of nonmetric or qualitative variables can be done through frequency distribution tables or in a graphical way, without having to calculate the measures of position, dispersion, and shape. The only exception is the mode, a measure that provides the variable’s most frequent value, and it can also be applied to nonmetric variables. Imagine that a questionnaire will be used to collect data on family income from a sample of consumers, based on certain salary ranges. Table 2.1 shows the variable categories. Note that both variables are qualitative, since the data are represented by ranges. However, it is very common for researchers to classify them incorrectly, mainly when the variable has numerical values in the data. In this case, it is only possible to calculate the frequencies, and not the summary measures, such as, the mean and standard deviation. The frequencies obtained for each income range can be seen in Table 2.2. A common error found in papers that use qualitative variables represented by numbers is the calculation of the sample mean, or any other summary measure. First of all, the researcher calculates the mean of the limits of each range, assuming that this value corresponds to the real mean of the consumers found in that range. However, since the data distribution is not necessarily linear or symmetrical around the mean, this hypothesis is often violated. Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00002-1 © 2019 Elsevier Inc. All rights reserved.

7

8 PART

I Foundations of Business Data Analysis

FIG. 2.1 Types of variables.

TABLE 2.1 Family Income Ranges × Social Class Class

Minimum Wage Salaries (MWS)

Family Income ($)

A

Above 20 MWS

Above $ 15,760.00

B

from 10 to 20 MWS

From $ 7880.00 to $ 15,760.00

C

from 4 to 10 MWS

From $ 3152.00 to $ 7880.00

D

from 2 to 4 MWS

From $ 1576.00 to $ 3152.00

E

Up to 2 MWS

Up to $ 1576.00

TABLE 2.2 Frequencies × Family Income Ranges Frequencies

Family Income ($)

10%

Above $ 15,760.00

18%

From $ 7880.00 to $ 15,760.00

24%

From $ 3152.00 to $ 7880.00

36%

From $ 1576.00 to $ 3152.00

12%

Up to $ 1576.00

In order for us to be able to calculate summary measures, such as, the mean and standard deviation, the variable being studied must necessarily be quantitative.

2.2.2

Metric or Quantitative Variables

Quantitative variables can be represented in a graphical way (line charts, scatter plots, histograms, stem-and-leaves, and boxplots), through measures of position or location (mean, median, mode, quartiles, deciles, and percentiles), measures of dispersion or variability (range, average deviation, variance, standard deviation, standard error, and coefficient of variation), or through measures of shape, such as, skewness and kurtosis, as we will study in Chapter 3. These variables can be discrete or continuous. Discrete variables can take on a finite set of values that frequently come from a count, such as, for example, the number of children in a family (0, 1, 2…). Conversely, continuous variables take on values that are in an interval with real numbers, such as, for example, an individual’s weight or income. Imagine a dataset with 20 people’s names, age, weight and height, as shown in Table 2.3. The data are available in the file VarQuanti.sav. To classify the variables on SPSS (Fig. 2.2), let’s click on Variable View. Note that the variable Name is qualitative (a string), and it is measured on a nominal scale (column Measure). On the other hand, variables Age, Weight, and Height are quantitative (Numeric), and they are measured in scale (Scale). The variable scales of measurement will be studied in more detail in Section 2.3.

Types of Variables and Measurement and Accuracy Scales Chapter

2

9

TABLE 2.3 Dataset With Information on 20 People Name

Age (Years)

Weight (kg)

Height (m)

Mariana

48

62

1.60

Roberta

41

56

1.62

Luiz

54

84

1.76

Leonardo

30

82

1.90

Felipe

35

76

1.85

Marcelo

60

98

1.78

Melissa

28

54

1.68

Sandro

50

70

1.72

Armando

40

75

1.68

Heloisa

24

50

1.59

Julia

44

65

1.62

Paulo

39

83

1.75

Manoel

22

68

1.78

Ana Paula

31

56

1.66

Amelia

45

60

1.64

Horacio

62

88

1.77

Pedro

24

80

1.92

Joao

28

75

1.80

Marcos

49

92

1.76

Celso

54

66

1.68

FIG. 2.2 Classification of the variables.

2.3

TYPES OF VARIABLES × SCALES OF MEASUREMENT

Variables can also be classified according to the level or scale of measurement. Measurement is the process of assigning numbers or labels to objects, people, states, or events, in accordance with specific rules, to represent the quantities or qualities of the attributes. Rule is a guide, a method, or a command that tells the researcher how to measure the attribute. Scale is a set of symbols or numbers, based on a rule, and it applies to individuals or to their behaviors or attitudes. An individual’s position in the scale is based on whether this individual has the attribute that the scale must measure. There are several taxonomies found in the existing literature to classify the scales of measurement of all types of variables (Stevens, 1946; Hoaglin et al., 1983). We will use Stevens classification because it is simple, it is widely used, and because its nomenclature is used in statistical software. According to Stevens (1946), the scales of measurement of nonmetric, categorical or qualitative variables can be classified as nominal and ordinal, while the metric or quantitative variables are classified as interval and ratio scales (or proportional), as shown in Fig. 2.3.

10

PART

I Foundations of Business Data Analysis

FIG. 2.3 Types of variables  scales of measurement.

2.3.1

Nonmetric Variables—Nominal Scale

The nominal scale classifies the units into classes or categories regarding the characteristic represented, not establishing any magnitude or order relationship. It is called nominal because the categories are only differentiated by their names. We can assign numerical labels to the variable categories, but arithmetic operations, such as, addition, subtraction, multiplication, and division over these numbers are not allowed. The nominal scale only allows some basic arithmetic operations. For instance, we can count the number of elements in each class or apply hypotheses tests regarding the distribution of the population units in the classes. Thus, most of the usual statistics, such as, the mean and standard deviation, do not make any sense for nominal scale qualitative variables. As examples of nonmetric variables on nominal scales, we can mention professions, religion, color, marital status, geographic location, or country of origin. Imagine a nonmetric variable related to the country of origin of 10 large multinational companies. To represent the categories of the variable Country of origin, we can use numbers, assigning value 1 to the United States, 2 to the Netherlands, 3 to China, 4 to the United Kingdom, and 5 to Brazil, as shown in Table 2.4. In this case, the numbers are only labels or tags to help identify and classify objects. This scale of measurement is known as a nominal scale, that is, the numbers are randomly assigned to the object categories, without any kind of order. To represent the behavior of nominal data, we can use descriptive statistics, such as, frequency distribution tables, bar or pie charts, or the calculation of the mode (Chapter 3). Now, we will discuss how to define labels for qualitative variables on a nominal scale by using the SPSS software (Statistical Package for the Social Sciences). After that, we will be able to construct absolute and relative frequencies tables and charts. Before generating the dataset, let’s define the characteristics of the variables being studied in Variable View (visualization of variables). In order to do that, click on the respective spreadsheet that is available in the lower left side of the Data Editor, or click twice on the column var.

TABLE 2.4 Companies and Country of Origin Company

Country of Origin

Exxon Mobil

1

JP Morgan Chase

1

General Electric

1

Royal Dutch Shell

2

ICBC

3

HSBC Holdings

4

PetroChina

3

Berkshire Hathaway

1

Wells Fargo

1

Petrobras

5

Types of Variables and Measurement and Accuracy Scales Chapter

2

11

The first variable, called Company, is a string, that is, its data are inserted as characters or letters. It was established that the maximum number of characters of the respective variable would be 18. In the column Measure, the scale of measurement of the variable Company is defined, which is nominal. The second variable, called Country, is numerical, since its data are inserted as numbers. However, the numbers are only used to categorize or label the objects, so, the scale of measurement of the respective variable is also nominal (Fig. 2.4). To insert the data from Table 2.4, we are going to go back to Data View. The information must be typed as shown in Fig. 2.5 (the columns represent the variables and the rows represent the observations or individuals). Since the variable Country is represented by numbers, it is necessary to assign labels to each variable category, as shown in Table 2.5. In order to do that, we must click on Data → Define Variable Properties… and select the variable Country, according to Figs. 2.6 and 2.7. Since the nominal scale of measurement of the variable Country has already been defined in the column Measure in Variable View, we can see that it already appears correctly in Fig. 2.8. Defining the labels for each category must be done at this moment, and it can also be seen in the same figure. Value Labels, The database starts being seen with the label names assigned, as shown in Fig. 2.9. By clicking on located on the toolbar, it is possible to alternate from the numerical values of the nominal or ordinal variable and their respective labels. Having structured the dataset, it is possible to construct absolute and relative frequencies tables and charts on SPSS.

FIG. 2.4 Defining the variable characteristics in Variable View.

FIG. 2.5 Inserting the data found in Table 2.4 into Data View.

12

PART

I Foundations of Business Data Analysis

TABLE 2.5 Categories Assigned to the Countries Categories

Country

1

United States

2

The Netherlands

3

China

4

The United Kingdom

5

Brazil

FIG. 2.6 Defining labels for each nominal variable category.

The descriptive statistics to represent the behavior of a single qualitative variable and of two qualitative variables will be studied in Chapters 3 and 4, respectively.

2.3.2

Nonmetric Variables—Ordinal Scale

A nonmetric variable on an ordinal scale classifies the units into classes or categories regarding the characteristic being represented, establishing an order between the units of the different categories. An ordinal scale is a scale on which data is shown in order, determining a relative position of the classes according to one direction. Any set of values can be assigned to the variable categories, as long as the order between them is respected. As in the nominal scale, arithmetic operations (sums, subtractions, multiplications, and divisions) between these values do not make any sense. Thus, the application of the usual descriptive statistics is also limited to nominal variables. Since the scale numbers are only meant to classify them, the descriptive statistics that can be used for ordinal data are frequency distribution tables, charts (including bar and pie charts), and the mode, as will study in Chapter 3.

Types of Variables and Measurement and Accuracy Scales Chapter

2

13

FIG. 2.7 Selecting the nominal variable Country.

Examples of ordinal variables include consumers’ opinion and satisfaction scales, educational level, social class, age, etc. Imagine a nonmetric variable called Classification that measures a group of consumers’ preference regarding a certain wine brand. The definition of labels for each ordinal variable category can be found in Table 2.6. Value 1 is assigned to the worst classification, value 2 to the second worst, and so on, until value 5, which is the best classification, as shown in this table. Instead of using scales from 1 to 5, we could have assigned any other numerical scale, as long as the order of classification had been respected. Thus, the numerical values do not represent a score of the product’s quality, they are only meant to classify it. So, the difference between these values does not represent the difference of the attribute analyzed. These scales of measurement are known as ordinal scales. Fig. 2.10 shows the characteristics of the variables being studied in Variable View on SPSS. The variable Customer is a string (its data are inserted as characters or letters) with a nominal scale of measurement. On the other hand, the variable Classification is numerical (numerical values were assigned to represent the variable categories) with an ordinal scale of measurement. The procedure for defining labels for qualitative variables on an ordinal scale is the same as the one already presented for nominal variables.

14

PART

I Foundations of Business Data Analysis

FIG. 2.8 Defining the labels for the variable Country.

FIG. 2.9 Dataset with labels.

Types of Variables and Measurement and Accuracy Scales Chapter

2

15

TABLE 2.6 Consumers’ Classification of a Certain Wine Brand Value

Label

1

Very bad

2

Bad

3

Average

4

Good

5

Very good

FIG. 2.10 Defining the variable characteristics in Variable View.

2.3.3

Quantitative Variable—Interval Scale

According to Stevens classification (1946), metric or quantitative variables have data in an interval or ratio scale. Besides ordering the units based on the characteristic being measured, the interval scale has a constant unit of measure. The origin or point zero of this scale of measurement is random, and it does not express an absence of quantity. A classic example of an interval scale is temperature measured in Celsius (°C) or in Fahrenheit (°F). Choosing temperature zero is random and differences of equal temperatures are determined by the identification of equal expansion volumes in the liquid inside the thermometer. Hence, the interval scale allows us to infer differences between the units to be measured. However, we cannot state that a value in a specific interval of the scale is a multiple of another one. For instance, assume that two objects are measured at 15°C and 30°C, respectively. Measuring the temperature allows us to determine how much one object is hotter than the other. However, we cannot state that the object with 30°C is twice as hot as the other with 15°C. The interval scale does not vary under positive linear transformations. So, an interval scale can be transformed into another through a positive linear transformation. Transforming degrees Celsius into degrees Fahrenheit is an example of a linear transformation. Most descriptive statistics can be applied to variable data with an interval scale, except statistics based on the ratio scale, such as, the variation coefficient.

2.3.4

Quantitative Variable—Ratio Scale

Analogous to the interval scale, the ratio scale orders the units based on the characteristic measured and has a constant unit of measure. On the other hand, the origin (or point zero) is unique and value zero expresses the absence of quantity. Therefore, it is possible to know if a value in a specific interval of the scale is a multiple of another. Equal ratios between values of the scale correspond to equal ratios between the units measured. Thus, ratio scales do not vary under positive proportion transformations. For example, if a unit is 1 m high and the other 3 m, we can say that the latter is three times higher than the former. Among the scales of measurement, the ratio scale is the most complete, because it allows us to use all arithmetic operations. In addition to this, all the descriptive statistics can be applied to the data of a variable expressed on a ratio scale. Examples of variables whose data can be on the ratio scale include income, age, how many units of a certain product were manufactured, and distance traveled.

16

PART

2.4

I Foundations of Business Data Analysis

TYPES OF VARIABLES × NUMBER OF CATEGORIES AND SCALES OF ACCURACY

Qualitative or categorical variables can also be classified based on the number of categories: (a) dichotomous or binary (dummies), when they only take on two categories; (b) polychotomous, when they take on more than two categories. On the other hand, metric or quantitative variables can also be classified based on the scale of accuracy: discrete or continuous. This classification can be seen in Fig. 2.11.

2.4.1

Dichotomous or Binary Variable (Dummy)

A dichotomous or binary variable (dummy) can only take on two categories, and the values 0 or 1 are assigned to these categories. Value 1 is assigned when the characteristic of interest is present in the variable and value 0 if otherwise. As examples, we have: smokers (1) and nonsmokers (0), a developed country (1) and an underdeveloped country (0), vaccinated patients (1) and nonvaccinated patients (0). Multivariate dependence techniques have as their main objective to specify a model that can explain and predict the behavior of one or more dependent variables through one or more explanatory variables. Many of these techniques, including the simple and multiple regression analysis, binary and multinomial logistic regression, regression for count data, and multilevel modeling, among others, can easily and coherently be applied with the use of nonmetric explanatory variables, as long as they are transformed into binary variables that represent the categories of the original qualitative variable. In this regard, a qualitative variable with n categories, for example, can be represented by (n 1) binary variables. For instance, imagine a variable called Evaluation, expressed by the categories good, average, or bad. Thus, two binary variables may be necessary to represent the original variable, depending on the researcher’s objectives, as shown in Table 2.7. Further details about the definition of dummy variables in confirmatory models will be discussed in Chapter 13, including the presentation of the operations necessary to generate them on software such as Stata.

2.4.2

Polychotomous Variable

A qualitative variable can take on more than two categories and, in this case, it is called polychotomous. As examples, we can mention social classes (lower, middle, and upper) and educational levels (elementary school, high school, college, and graduate school).

2.4.3

Discrete Quantitative Variable

As described in Section 2.2.2, discrete quantitative variables can take on a finite set of values that frequently come from a count, such as, for example, the number of children in a family (0, 1, 2…), the number of senators elected, or the number of cars manufactured in a certain factory.

2.4.4

Continuous Quantitative Variable

Continuous quantitative variables, on the other hand, are those whose possible values are in an interval with real numbers and result from a metric measurement, as, for example, weight, height, or an individual’s salary (Bussab and Morettin, 2011).

FIG. 2.11 Qualitative variables  Number of categories and Quantitative variables  Scales of accuracy.

Types of Variables and Measurement and Accuracy Scales Chapter

2

17

TABLE 2.7 Defining Binary Variables (Dummies) for the Variable Evaluation Binary Variables (Dummies)

2.5

Evaluation

D1

D2

Good

0

0

Average

1

0

Bad

0

1

FINAL REMARKS

Whenever treated and analyzed through several different statistical techniques, data are transformed into information and can support the decision-making process. These data can be metric (quantitative) or nonmetric (categorical or qualitative). Metric data represent the characteristics of an individual, object, or element that result from a count or measurement (patients’ weight, age, interest rate, among other examples). In the case of nonmetric data, these characteristics cannot be measured or quantified (answers as, for example, yes or no, educational levels, among others). According to Stevens (1946), the scales of measurement of nonmetric, categorical or qualitative variables can be classified as nominal and ordinal, while the metric or quantitative variables are classified on interval and ratio scales (or proportional). A lot of data can be collected in a metric as well as in a nonmetric way. Assume that we wish to assess the quality of a certain product. In order to do that, scores from 1 to 10 regarding certain attributes can be assigned, and a Likert scale can be defined from information that has already been established. In general, and whenever possible, questions must be defined in a quantitative way, in order for the researcher not to lose data information. For Fa´vero et al. (2009), generating the questionnaire and defining the variable scales of measurement will depend on several aspects, including the research objectives, the modeling to be adopted to achieve such objectives, the average time to apply the questionnaire, and how it will be collected. A dataset can present variables on metric and on nonmetric scales, it does not need to restrict itself to only one type of scale. This combination can provide some interesting researches and, jointly with the suitable modeling, it can generate information aimed at assisting the decision-making process. The type of variable collected is crucial in the calculation of descriptive statistics and in the graphical representation of results, as well as in the selection of the statistical methods that will be used to analyze the data.

2.6 1) 2) 3) 4)

EXERCISES

What is the difference between qualitative and quantitative variables? What are scales of measurement and what are the main types of scales? What are the differences between them? What is the difference between discrete and continuous variables? Classify the variables below according to the following scales: nominal, ordinal, binary, discrete, or continuous. a. A company’s revenue. b. A performance rank: good, average, and bad. c. Time to process a part. d. Number of cars sold. e. Distance traveled in km. f. Municipalities in the Greater Sao Paulo. g. Family income ranges. h. A student’s grades: A, B, C, D, O, or R. i. Hours worked. j. Region: North, Northeast, Center-West, South, and Southeast. k. Location: Sao Paulo or Seoul. l. Size of the organization: small, medium, and large.

18

PART

I Foundations of Business Data Analysis

m. Number of bedrooms. n. Classification of risk: high, average, speculative, substantial, in moratorium. o. Married: yes or no. 5) A researcher wishes to study the impact of physical aptitude on the improvement of productivity in an organization. How would you describe the binary variables to be included in this model, so that the variable physical aptitude could be represented? The possible variable categories are: (a) active and healthy; (b) acceptable (could be better); (c) not good enough; (d) sedentary.

Chapter 3

Univariate Descriptive Statistics Mathematics is the alphabet with which God has written the Universe. Galileo Galilei

3.1

INTRODUCTION

Descriptive statistics describes and summarizes the main characteristics observed in a dataset through tables, charts, graphs, and summary measures, allowing the researcher to have a better understanding of the data behavior. The analysis is based on the dataset being studied (sample), without drawing any conclusions or inferences from the population. Researchers can use descriptive statistics to study a single variable (univariate descriptive statistics), two variables (bivariate descriptive statistics), or more than two variables (multivariate descriptive statistics). In this chapter, we will study the concepts of descriptive statistics involving a single variable. Univariate descriptive statistics considers the following topics: (a) the frequency in which a set of data occurs through frequency distribution tables; (b) the representation of the variable’s distribution through charts; and (c) measures that represent a data series, such as measures of position or location, measures of dispersion or variability, and measures of shape (skewness and kurtosis). The four main goals of this chapter are: (1) to introduce the most common concepts related to the tables, charts, and summary measures in univariate descriptive statistics, (2) to present its applications in real examples, (3) to construct tables, charts, and summary measures using Excel and the statistical software SPSS and Stata, and (4) to discuss the results achieved. As described in the previous chapter, before we begin using descriptive statistics, it is necessary to identify the type of variable being studied. The type of variable is essential when calculating descriptive statistics and in the graphical representation of the results. Fig. 3.1 shows the univariate descriptive statistics that will be studied in this chapter, represented by tables, charts, graphs, and summary measures, for each type of variable. Fig. 3.1 summarizes the following information: a) The descriptive statistics used to represent the behavior of one qualitative variable’s data are frequency distribution tables and graphs/charts. b) The frequency distribution table for a qualitative variable represents the frequency in which each variable category occurs. c) The graphical representation of qualitative variables can be illustrated by bar charts (horizontal and vertical), pie charts, and by a Pareto chart. d) For quantitative variables, the most common descriptive statistics are charts and summary measures (measures of position or location, dispersion or variability, and measures of shape). Frequency distribution tables can also be used to represent the frequency in which each possible value of a discrete variable occurs, or to represent the frequency of the data of continuous variables grouped into classes. e) Line graphs, dot or dispersion plots, histograms, stem-and-leaf plots, and boxplots (box-and-whisker diagrams) are normally used as the graphical representation of quantitative variables. f) Measures of position or location can be divided into measures of central tendency (mean, mode, and median) and quantiles (quartiles, deciles, and percentiles). g) The most common measures of dispersion or variability are range, average deviation, variance, standard deviation, standard error, and coefficient of variation. h) The measures of shape include measures of skewness and kurtosis. Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00003-3 © 2019 Elsevier Inc. All rights reserved.

21

22

PART

II Descriptive Statistics

Variable type Qualitative

Quantitative

Charts

Tables

Frequency distribution

Tables

Bar

Frequency distribution

(horizontal or vertical)

Graphs

Summary measures

Line

Pie

Histogram

Pareto

Stem-and-Leaf

Boxplot

Dispersion or Variability

Position or Location

Scatter

Central tendency

Range

Skewness

Average

Kurtosis

Quantiles

Mean

Quartiles

Mode*

Deciles

Median

Percentiles

Shape

Variance Standard deviation Standard error Coefficient of variation

FIG. 3.1 A brief summary of univariate descriptive statistics. *The mode, which provides the most frequent value of the variable, is the only summary measure that can also be used for qualitative variables.

3.2

FREQUENCY DISTRIBUTION TABLE

Frequency distribution tables can be used to represent the frequency in which a set of data with qualitative or quantitative variables occurs. In the case of qualitative variables, the table represents the frequency in which each variable category happens. For discrete quantitative variables, the frequency of occurrences is calculated for each discrete value of the variable. On the other hand, continuous variable data are first grouped into classes and, afterwards, we calculate the frequencies in which each class occurs. A frequency distribution table contains the following calculations: a) b) c) d)

Absolute frequency (Fi): number of times each value i appears in the sample. Relative frequency (Fri): percentage related to the absolute frequency. Cumulative frequency (Fac): sum of all the values equal to or less than the value being analyzed. Relative cumulative frequency (Frac): percentage related to the cumulative frequency (sum of all relative frequencies equal to or less than the value being considered).

3.2.1

Frequency Distribution Table for Qualitative Variables

Through a practical example, we will build the frequency distribution table using the calculations of the absolute frequency, relative frequency, cumulative frequency, and relative cumulative frequency for each category of the qualitative variable being analyzed. Example 3.1 Saint August Hospital provides 3000 blood transfusions to hospitalized patients every month. In order for the hospital to be able to maintain its stocks, 60 blood donations a day are necessary. Table 3.E.1 shows the total number of donors for each blood type on a certain day. Build the frequency distribution table for this problem.

TABLE 3.E.1 Total Number of Donors of Each Blood Type Blood Type

Donors

A+

15

A

2

B+

6

Univariate Descriptive Statistics Chapter

3

23

TABLE 3.E.1 Total Number of Donors of Each Blood Type— cont’d Blood Type

Donors

B

1

AB+

1

AB 

1

O+

32

O

2

Solution The complete frequency distribution table for Example 3.1 is shown in Table 3.E.2:

TABLE 3.E.2 Frequency Distribution of Example 3.1

3.2.2

Blood Type

Fi

Fri (%)

Fac

Frac (%)

A+

15

25

15

25

A

2

3.33

17

28.33

B+

6

10

23

38.33

B

1

1.67

24

40

AB+

1

1.67

25

41.67

AB 

1

1.67

26

43.33

O+

32

53.33

58

96.67

O

2

3.33

60

100

Sum

60

100

Frequency Distribution Table for Discrete Data

Through the frequency distribution table, we can calculate the absolute frequency, the relative frequency, the cumulative frequency, and the relative cumulative frequency for each possible value of the discrete variable. Different from qualitative variables, instead of the possible categories we must have the possible numeric values. To facilitate understanding, the data must be presented in ascending order. Example 3.2 A Japanese restaurant is defining the new layout for its tables and, in order to do that, it collected information on the number of people who have lunch and dinner at each table throughout one week. Table 3.E.3 shows the first 40 pieces of data collected. Build the frequency distribution table for these data.

TABLE 3.E.3 Number of People per Table 2

5

4

7

4

1

6

2

2

5

4

12

8

6

4

5

2

8

2

6

4

7

2

5

6

4

1

5

10

2

2

10

6

4

3

4

6

3

8

4

24

PART

II Descriptive Statistics

Solution In the next table, each row of the first column represents a possible numeric value of the variable being analyzed. The data are sorted in ascending order. The complete frequency distribution table for Example 3.2 is shown below.

TABLE 3.E.4 Frequency Distribution for Example 3.2

3.2.3

Number of People

Fi

Fri (%)

Fac

Frac (%)

1

2

5

2

5

2

8

20

10

25

3

2

5

12

30

4

9

22.5

21

52.5

5

5

12.5

26

65

6

6

15

32

80

7

2

5

34

85

8

3

7.5

37

92.5

10

2

5

39

97.5

12

1

2.5

40

100

Sum

40

100

Frequency Distribution Table for Continuous Data Grouped into Classes

As described in Chapter 2, continuous quantitative variables are those whose possible values are in an interval of real numbers. Therefore, it makes no sense to calculate the frequency for each possible value, since they rarely repeat themselves. It is better to group the data into classes or ranges. The interval to be defined between the classes is random. However, we must be careful if the number of classes is too small because a lot of information can be lost. On the other hand, if the number of classes is too large, the summary of information is compromised (Bussab and Morettin, 2011). The interval between the classes does not need to be constant, but in order to keep things simple, we will assume the same interval. The following steps must be taken to build a frequency distribution table for continuous data: Step 1: Sort the data in ascending order. Step 2: Determine the number of classes (k), using one of the options: a) Sturges’ Rule ! k ¼ 1 + 3.3 pffiffiffi  log(n) b) Through expression k ¼ n where n is the sample size. The value of k must be an integer. Step 3: Determine the interval between the classes (h), calculated as the range of the sample (A ¼ maximum value  minimum value) divided by the number of classes: h ¼ A=k The value of h is rounded to the highest integer. Step 4: Build the frequency distribution table (calculate the absolute frequency, the relative frequency, the cumulative frequency, and the relative cumulative frequency) for each class. The lowest limit of the first class corresponds to the minimum value of the sample. To determine the highest limit of each class, we must add the value of h to the lowest limit of the respective class. The lowest limit of the new class corresponds to the highest limit of the previous class.

Univariate Descriptive Statistics Chapter

3

Example 3.3 Consider the data in Table 3.E.5 regarding the grades of 30 students enrolled in the subject Financial Market. Elaborate a frequency distribution table for this problem.

TABLE 3.E.5 Grades of 30 Students Enrolled in the Subject Financial Market 4.2

3.9

5.7

6.5

4.6

6.3

8.0

4.4

5.0

5.5

6.0

4.5

5.0

7.2

6.4

7.2

5.0

6.8

4.7

3.5

6.0

7.4

8.8

3.8

5.5

5.0

6.6

7.1

5.3

4.7

Note: To determine the number of classes, use Sturges’ rule.

Solution Let’s apply the four steps to build the frequency distribution table of Example 3.3, whose variables are continuous: Step 1: Let’s sort the data in ascending order, as shown in Table 3.E.6.

TABLE 3.E.6 Data From Table 3.E.5 Sorted in Ascending Order 3.5

3.8

3.9

4.2

4.4

4.5

4.6

4.7

4.7

5

5

5

5

5.3

5.5

5.5

5.7

6

6

6.3

6.4

6.5

6.6

6.8

7.1

7.2

7.2

7.4

8

8.8

Step 2: Let’s determine the number of classes (k) by using Sturges’ rule: k ¼ 1 + 3:3  log ð30Þ ¼ 5:87 ffi 6 Step 3: The interval between the classes (h) is given by: A ð8:8  3:5Þ ¼ ¼ 0:88 ffi 1 k 6 Step 4: Finally, let’s build the frequency distribution table for each class. The lowest limit of the first class corresponds to the minimum grade 3.5. From this value, we must add the interval between the classes (1), considering that the highest limit of the first class will be 4.5. The second class starts from this value, and so on, and so forth, until the last class is defined. We use the notation ├ to determine that the lowest limit is included in the class and the highest limit is not. The complete frequency distribution table for Example 3.3 (Table 3.E.7) is presented. h¼

TABLE 3.E.7 Frequency Distribution for Example 3.3 Class

Fi

Fri (%)

Fac

Frac (%)

3.5 ├ 4.5

5

16.67

5

16.67

4.5 ├ 5.5

9

30

14

46.67

5.5 ├ 6.5

7

23.33

21

70

6.5 ├ 7.5

7

23.33

28

93.33

7.5 ├ 8.5

1

3.33

29

96.67

8.5 ├ 9.5

1

3.33

30

100

Sum

30

100

25

26

PART

3.3

II Descriptive Statistics

GRAPHICAL REPRESENTATION OF THE RESULTS

The behavior of qualitative and quantitative variable data can also be represented in a graphical way. Charts are a representation of numeric data, in the form of geometric figures (graphs, diagrams, drawings, or images), allowing the reader to interpret these data quickly and objectively. In Section 3.3.1, the main graphical representations for qualitative variables are illustrated: bar charts (horizontal and vertical), pie charts, and a Pareto chart. The graphical representation of quantitative variables is usually illustrated by line graphs, dot plots, histograms, stemand-leaf plots, and boxplots (or box-and-whisker diagrams), as shown in Section 3.3.2. Bar charts (horizontal and vertical), pie charts, a Pareto chart, line graphs, dot plots, and histograms will be generated in Excel. The boxplots and histograms will be constructed by using SPSS and Stata. To build a chart in Excel, first, variables’ data and names must be standardized, codified, and selected in a spreadsheet. The next step consists in clicking on the Insert tab and, in the group Charts, selecting the type of chart we are interested in using (Columns, Rows, Pie, Bar, Area, Scatter, or Other Charts). The chart will be generated automatically on the screen, and it can be personalized according to the preferences of the researcher. Excel offers a variety of chart styles, layouts, and formats. To use them, researcher just needs to select the plotted chart and click on the Design, Layout or Format tab. On the Layout tab, for example, there are many resources available, such as, Chart Title, Axis Titles (shows the name of the horizontal and vertical axes); Legend (shows or hides the legend); Data Labels (allows researcher to insert the series name, the category name, or the values of the labels in the place we are interested in); Data Table (shows the data table below the chart, with or without legend codes); Axes (allows researcher to personalize the scale of the horizontal and vertical axes); Gridlines (shows or hides horizontal and vertical gridlines), among others. The Chart Title, Axis Titles, Legend, Data Labels and Data Table icons are in the Labels group, while the icons Axes and Gridlines are in the Axes group.

3.3.1

Graphical Representation for Qualitative Variables

3.3.1.1 Bar Chart This type of chart is widely used for nominal and ordinal qualitative variables, but it can also be used for discrete quantitative variables, because it allows us to investigate the presence of data trends. As its name indicates, through bars, this chart represents the absolute or relative frequencies of each possible category (or numeric value) of a qualitative variable (or quantitative). In vertical bar charts, each variable category is shown on the X-axis as a bar with constant width, and the height of the respective bar indicates the frequency of the category on the Y-axis. Conversely, in horizontal bar charts, each variable category is shown on the Y-axis as a bar of constant height, and the length of the respective bar indicates the frequency of the category on the X-axis. Let’s now build horizontal and vertical bar charts from a practical example. Example 3.4 A bank created a satisfaction survey, which was used with 120 customers, trying to measure how agile its services were (excellent, good, satisfactory, and poor). The absolute frequencies for each category are presented in Table 3.E.8. Construct a vertical and horizontal bar chart for this problem.

TABLE 3.E.8 Frequencies of Occurrences per Category Satisfaction

Absolute Frequency

Excellent

58

Good

18

Satisfactory

32

Poor

12

Solution Let’s build the vertical and horizontal bar charts of Example 3.4 in Excel.

Univariate Descriptive Statistics Chapter

27

FIG. 3.2 Vertical bar chart for Example 3.4.

Satisfaction 70 58

60 Absolute frequency

3

50 40

32

30 18

20

12

10 0 Excellent

Good

Poor

Satisfactory

FIG. 3.3 Horizontal bar chart for Example 3.4.

Satisfaction Poor

12

Satisfactory

32

Good

18

Excellent

58 0

10

20

30 40 Absolute frequency

50

60

70

First, the data in Table 3.E.8 must be standardized, codified, and selected in a spreadsheet. After that, we can click on the Insert tab and, in the Charts group, and select the option Columns. The chart is automatically generated on the screen. Next, to personalize the chart, while clicking on it, we must select the following icons on the Layout tab: (a) Axis Titles: let’s select the title for the horizontal axis (Satisfaction) and for the vertical axis (Frequency); (b) Legend: to hide the legend, we must click on None; (c) Data Labels: clicking on More Data Label Options, the option Value must be selected in Label Contains (or we can select the option Outside End). Fig. 3.2 shows the vertical bar chart of Example 3.4 generated in Excel. Based on Fig. 3.2, we can see that the categories of the variable being analyzed are presented on the X-axis by bars with the same width and their respective heights indicate the frequencies on the Y-axis. To construct the horizontal bar chart, we must select the option Bar instead of Columns. The other steps follow the same logic. Fig. 3.3 represents the frequency data from Table 3.E.8 through a horizontal bar chart constructed in Excel. The horizontal bar chart in Fig. 3.3 represents the categories of the variable on the Y-axis and their respective frequencies on the X-axis. For each variable category, we draw a bar with a length that corresponds to its frequency. Therefore, this chart only offers information related to the behavior of each category of the original variable and to the generation of investigations regarding the type of distribution, not allowing us to calculate position, dispersion, skewness or kurtosis measures, since the variable being studied is qualitative.

3.3.1.2 Pie Chart Another way to represent qualitative data, in terms of relative frequencies (percentages), is the definition of pie charts. The chart corresponds to a circle with a random radius (the whole) divided into sectors or slices of pie of several different sizes (parts of the whole).

28

PART

II Descriptive Statistics

This chart allows the researcher to visualize the data as slices of a pie or parts of a whole. Let’s now build the pie chart from a practical example. Example 3.5 An election poll was carried out in the city of Sao Paulo to check voters’ preferences concerning the political parties running in the next elections for Mayor. The percentage of voters per political party can be seen in Table 3.E.9. Construct a pie chart for Example 3.5.

TABLE 3.E.9 Percentage of Voters per Political Party Political Party

Percentage

PMDB

18

PSDB

22

PDT

12.5

PT

24.5

PC do B

8

PV

5

Others

10

Solution Let’s build the pie chart for Example 3.5 in Excel. The steps are similar to the ones in Example 3.4. However, we now have to select the option Pie in the Charts group, on the Insert tab. Fig. 3.4 presents the pie chart obtained in Excel for the data shown in Table 3.E.9. FIG. 3.4 Pie chart of Example 3.5.

Political party Others 10% PV 5%

PMDB 18%

PC do B 8%

PSDB 22% PT 24.5% PDT 12.5%

3.3.1.3 Pareto Chart The Pareto chart is a Quality control tool and has as its main objective to investigate the types of problems and, consequently, to identify their respective causes, so that an action can be taken in order to reduce or eliminate them. The Pareto chart is a chart that contains bars and a line graph. The bars represent the absolute frequencies of occurrences of problems and the lines represent the relative cumulative frequencies. The problems are sorted in descending order of priority. Let’s now illustrate a practical example with a Pareto chart.

Univariate Descriptive Statistics Chapter

3

Example 3.6 A manufacturer of credit and magnetic cards has as its main objective to reduce the number of defective cards. The quality inspector classified a sample of 1000 cards that were collected during one week of production, according to the types of defects found, as shown in Table 3.E.10. Construct a Pareto chart for this problem.

TABLE 3.E.10 Frequencies of the Occurrence of Each Defect Type of Defect

Absolute Frequency (Fi)

Damaged/Bent

71

Perforated

28

Illegible printing

12

Wrong characters

20

Wrong numbers

44

Others

6

Total

181

Solution The first step in generating a Pareto chart is to sort the defects in order of priority (from the highest to the lowest frequency). The bar chart represents the absolute frequency of each defect. To construct the line graph, it is necessary to calculate the relative cumulative frequency (%) up to the defect analyzed. Table 3.E.11 shows the absolute frequency for each type of defect, in descending order, and the relative cumulative frequency (%).

TABLE 3.E.11 Absolute Frequency for Each Defect and the Relative Cumulative Frequency (%) Type of Defect

Number of Defects

Cumulative %

Damaged/Bent

71

39.23

Wrong numbers

44

63.54

Perforated

28

79.01

Wrong characters

20

90.06

Illegible printing

12

96.69

Others

6

100

Let’s now build a Pareto chart for Example 3.6 in Excel, using the data in Table 3.E.11. First, the data in Table 3.E.11 must be standardized, codified, and selected in an Excel spreadsheet. In the Charts group, on the Insert tab, let’s select the option Columns (and the clustered column subtype). Note that the chart is automatically generated on the screen. However, absolute frequency data as well as relative cumulative frequency data are presented as columns. To change the type of chart related to the cumulative percentage, we must click with the right button on any bar of the respective series and select the option Change Series Chart Type, followed by a line graph with markers. The resulting chart is a Pareto chart. To personalize the Pareto chart, we must use the following icons on the Layout tab: (a) Axis Titles: for the bar chart, we selected the title for the horizontal axis (Type of defect) and for the vertical axis (Frequency); for the line graph, we called the vertical axis Percentage; (b) Legend: to hide the legend, we must click on None; (c) Data Table: let’s select the option Show Data Table with Legend Keys; (d) Axes: the main unit of the vertical axes for both charts is set in 20 and the maximum value of the vertical axis for line graphs, in 100. Fig. 3.5 shows the chart constructed in Excel that corresponds to the Pareto chart for Example 3.6.

29

30

PART

II Descriptive Statistics

FIG. 3.5 The Pareto chart for Example 3.6. Legend: A, Damaged/Bent; B, Wrong numbers; C, Perforated; D, Wrong characters; E, Illegible printing; F, Others.

3.3.2

Graphical Representation for Quantitative Variables

3.3.2.1 Line Graph In a line graph, points are represented by the intersection of the variables involved on the horizontal axis (X) and on the vertical axis (Y), and they are connected by straight lines. Despite considering two axes, line graphs will be used in this chapter to represent the behavior of a single variable. The graph shows the evolution or trend of a quantitative variable’s data, which is usually continuous, at regular intervals. The numeric variable values are represented on the Y-axis, and the X-axis only shows the data distribution in a uniform way. Let’s now illustrate a practical example of a line graph. Example 3.7 Cheap & Easy is a supermarket that registered the percentage of losses it had in the last 12 months (Table 3.E.12). After having done that, it will adopt new prevention measures. Build a line graph for Example 3.7.

TABLE 3.E.12 Percentage of Losses in the Last 12 Months Month

Losses (%)

January

0.42

February

0.38

March

0.12

April

0.34

May

0.22

June

0.15

July

0.18

August

0.31

September

0.47

October

0.24

November

0.42

December

0.09

Univariate Descriptive Statistics Chapter

3

31

Solution To build the line graph for Example 3.7 in Excel, in the Charts group, on the Insert tab, we must select the option Lines. The other steps follow the same logic of the previous examples. The complete chart can be seen in Fig. 3.6.

FIG. 3.6 Line graph for Example 3.7.

3.3.2.2 Scatter Plot A scatter plot is very similar to a line graph. The biggest difference between them is in the way the data are plotted on the horizontal axis. Similar to a line graph, here the points are also represented by the intersection of the variables along the X-axis and the vertical axis. However, they are not connected by straight lines. The scatter plot studied in this chapter is used to show the evolution or trend of a single quantitative variable’s data, similar to the line graph; however, at irregular intervals (in general). Analogous to a line graph, the numeric variable values are represented on the Y-axis and the X-axis only represents the data behavior throughout time. In the next chapter, we will see how a scatter plot can be used to describe the behavior of two variables simultaneously (bivariate analysis). The numeric values of one variable will be represented on the Y-axis and the other one on the X-axis. Example 3.8 Papermisto is the supplier of three types of raw materials for the production of paper: cellulose, mechanical pulp, and trimmings. In order to maintain its quality standards, the factory carries out a rigorous inspection of its products during each production phase. At irregular intervals, an operator must verify the esthetic and dimensional characteristics of the product selected with specialized instruments. For instance, in the cellulose storage phase, the product must be piled up in bales of approximately 250 kg each. Table 3.E.13 shows the weight of the bales collected in the last 5 hours, at irregular intervals, varying between 20 and 45 minutes. Construct a scatter plot for Example 3.8.

TABLE 3.E.13 Evolution of the Weight of the Bales Throughout Time Time (min)

Weight (kg)

30

250

50

255

85

252

106

248

138

250

178

249

198

252

222

251

252

250

297

245

32

PART

II Descriptive Statistics

Solution To build the scatter plot for Example 3.8 in Excel, in the Charts group, on the Insert tab, we must select the option Scatter. The other steps follow the same logic of the previous examples. The scatter plot can be seen in Fig. 3.7. FIG. 3.7 Scatter plot for Example 3.8.

256

Weight (kg)

254 252 250 248 246 244 0

50

100

150 Time (min)

200

250

300

3.3.2.3 Histogram A histogram is a vertical bar chart that represents the frequency distribution of one quantitative variable (discrete or continuous). The variable values being studied are presented on the X-axis (the base of each bar, with a constant width, represents each possible value of the discrete variable or each class of continuous values, sorted in ascending order). On the other hand, the height of the bars on the Y-axis represents the frequency distribution (absolute, relative, or cumulative) of the respective variable values. A histogram is very similar to a Pareto chart. It is also one of the seven quality tools. A Pareto chart represents the frequency distribution of a qualitative variable (types of problem), whose categories represented on the X-axis are sorted in order of priority (from the category with the highest frequency to the one with the lowest). A histogram represents the frequency distribution of a quantitative variable, whose values represented on the X-axis are sorted in ascending order. Therefore, the first step to elaborate a histogram is building the frequency distribution table. As presented in Sections 3.2.2 and 3.2.3, for each possible value of a discrete variable or for a class with continuous data, we calculate the absolute frequency, the relative frequency, the cumulative frequency, and the relative cumulative frequency. The data must be sorted in ascending order. The histogram is then constructed from this table. The first column of the frequency distribution table, which represents the numeric values or the classes with the values of the variable being studied, will be presented on the X-axis, and the column of absolute frequency (or relative frequency, cumulative frequency, or relative cumulative frequency) will be presented on the Y-axis. Many pieces of statistical software generate the histogram automatically, from the original values of the quantitative variable being studied, without having to calculate the frequencies. Even though Excel has the option of building a histogram from analysis tools, we will show how to build it from the column chart, due to its simplicity. Example 3.9 In order to improve their services, a national bank is hiring new managers to serve their corporate clients. Table 3.E.14 shows the number of companies dealt with daily in one of their main branches in the capital. Elaborate a histogram from these data using Excel.

TABLE 3.E.14 Number of Companies Dealt With Daily 13

11

13

10

11

12

8

12

9

10

12

10

8

11

9

11

14

11

10

9

Univariate Descriptive Statistics Chapter

3

33

Solution The first step is building the frequency distribution table: From the data in Table 3.E.15, we can build a histogram of absolute frequency, relative frequency, cumulative frequency, or relative cumulative frequency using Excel. The histogram generated will be the absolute frequency one. Thus, we must standardize, codify, and select the first two columns of Table 3.E.15 (except the last row: Sum) in an Excel spreadsheet. In the Charts group, on the Insert tab, let’s select the option Columns. Let’s click on the chart so that it can be personalized. On the Layout tab, we selected the following icons: (a) Axis Titles: select the title for the horizontal axis (Number of companies) and for the vertical axis (Absolute frequency); (b) Legend: to hide the legend, we must click on None. The histogram generated in Excel can be seen in Fig. 3.8.

TABLE 3.E.15 Frequency Distribution for Example 3.9 Number of Companies

Fi

Fri (%)

Fac

Frac (%)

8

2

10

2

10

9

3

15

5

25

10

4

20

9

45

11

5

25

14

70

12

3

15

17

85

13

2

10

19

95

14

1

5

20

100

Sum

20

100

FIG. 3.8 Histogram of absolute frequencies elaborated in Excel for Example 3.9.

Number of companies 6

Absolute frequency

5 4 3 2 1 0 8

9

10

11

12

13

14

As mentioned, many statistical computer packages, including SPSS and Stata, build the histogram automatically from the original data of the variable being studied (in this example, using the data in Table 3.E.14), without having to calculate the frequencies. Moreover, these packages have the option of plotting the normal curve. Fig. 3.9 shows the histogram generated using SPSS (with the option of a normal curve) using the data in Table 3.E.14. We will see this in detail in Sections 3.6 and 3.7, how it can be constructed using SPSS and Stata software, respectively. Note that the values of the discrete variable are presented in the middle of the base. For continuous variables, consider the data in Table 3.E.5 (Example 3.3), regarding the grades of the students enrolled in the subject Financial Market. These data were sorted in ascending order, as presented in Table 3.E.6. Fig. 3.10 shows the histogram generated using SPSS software (with the option of a normal curve) using the data in Table 3.E.5 or Table 3.E.6.

34

PART

II Descriptive Statistics

FIG. 3.9 Histogram constructed using SPSS for Example 3.9 (discrete data).

5

Frequency

4

3

2

1

0 6

FIG. 3.10 Histogram generated using SPSS for Example 3.3 (continuous data).

8

10 12 Number_of_companies

14

16

5

Frequency

4

3

2

1

0 3.00

4.00

5.00

6.00 Grades

7.00

8.00

9.00

Note that the data were grouped considering an interval between h ¼ 0.5 classes, differently from Example 3.3 that considered h ¼ 1. The classes’ lower limits are represented on the left side of the base of the bar, and the upper limits (not included in the class) on the right side. The height of the bar represents the total frequency in the class. For example, the first bar represents the 3.5 ├ 4.0 class and there are three values in this interval (3.5, 3.8 and 3.9).

3.3.2.4 Stem-and-Leaf Plot Both bar charts and histograms represent the shape of the variable’s frequency distribution. The stem-and-leaf plot is an alternative to represent the frequency distributions of discrete and continuous quantitative variables with few observations, with the advantage of maintaining the original value of each observation (it allows the visualization of all data information).

Univariate Descriptive Statistics Chapter

3

35

In the plot, the representation of each observation is divided into two parts, separated by a vertical line: the stem is located on the left of the vertical line and represents the observation’s first digit(s); the leaf is located on the right of the vertical line and represents the observation’s last digit(s). Choosing the number of initial digits that will form the stem or the number of complementary digits that will form the leaf is random. The stems usually contain the most significant digits, and the leaves the least significant. The stems are represented in a single column and their different values throughout many lines. For each stem represented on the left-hand side of the vertical line, we have the respective leaves shown on the right-hand side throughout many columns. Stems as well as leaves must be sorted in ascending order. In the cases in which there are too many leaves per stem, we can have more than one line with the same stem. Choosing the number of lines is random, as well as defining the interval or the number of classes in a frequency distribution. To build a stem-and-leaf plot, we can follow the sequence of steps: Step 1: Sort the data in ascending order, to make the visualization of the data easier. Step 2: Define the number of initial digits that will form the stem, or the number of complementary digits that will form the leaf. Step 3: Elaborate the stems, represented in a single column on the left of the vertical line. Their different values are represented throughout many lines, in ascending order. When the number of leaves by stem is very high, we can define two or more lines for the same stem. Step 4: Place the leaves that correspond to the respective stems, on the right-hand side of the vertical line, throughout many columns (in ascending order). Example 3.10 A small company collected its employees’ ages, as shown in Table 3.E.16. Build a stem-and-leaf plot.

TABLE 3.E.16 Employees’ Ages 44

60

22

49

31

58

42

63

33

37

54

55

40

71

55

62

35

45

59

54

50

51

24

31

40

73

28

35

75

48

Solution To construct the stem-and-leaf plot, let’s apply the four steps described: Step 1 First, we must sort the data in ascending order, as shown in Table 3.E.17.

TABLE 3.E.17 Employees’ Ages in Ascending Order 22

24

28

31

31

33

35

35

37

40

40

42

44

45

48

49

50

51

54

54

55

55

58

59

60

62

63

71

73

75

Step 2 The next step to construct a stem-and-leaf plot is to define the number of initial digits of the observation that will form the stem. The complementary digits will form the leaf. In this example, all of the observations have two digits. The stems correspond to the tens and the leaves correspond to the units. Step 3 The following step is to build the stems. Based on Table 3.E.17, we can see that there are observations that begin with the tens 2, 3, 4, 5, 6, and 7 (stems). The stem with the highest frequency is 5 (8 observations), it is possible to represent all of its leaves in a single line. Therefore, we will have a single line per stem. Hence, the stems are presented in a single column on the left of the vertical line, in ascending order, as shown in Fig. 3.11.

36

PART

II Descriptive Statistics

FIG. 3.11 Building the stems for Example 3.10.

2 3 4 5 6 7

Step 4 Finally, let’s place the leaves that correspond to each stem on the right-hand side of the vertical line. The leaves are represented in ascending order throughout many columns. For example, stem 2 contains leaves 2, 4, and 8. Stem 5 contains leaves 0, 1, 4, 4, 5, 5, 8, and 9, represented throughout 8 columns. If this stem were divided into two lines, the first line would have leaves 0 to 4, and the second line leaves 5 to 9. Fig. 3.12 illustrates the stem-and-leaf plot for Example 3.10. FIG. 3.12 Stem-and-Leaf plot for Example 3.10.

2

2

4

8

3

1

1

3

5

5

7

4

0

0

2

4

5

8

9

5

0

1

4

4

5

5

8

6

0

2

3

7

1

3

5

9

Example 3.11 The average temperature, in Celsius, registered in the last 40 days in the city of Porto Alegre can be found in Table 3.E.18. Elaborate the stem-and-leaf plot for Example 3.11.

TABLE 3.E.18 Average Temperature in Celsius 8.5

13.7

12.9

9.4

11.7

19.2

12.8

9.7

19.5

11.5

15.5

16.0

20.4

17.4

18.0

14.4

14.8

13.0

16.6

20.2

17.9

17.7

16.9

15.2

18.5

17.8

16.2

16.4

18.2

16.9

18.7

19.6

13.2

17.2

20.5

14.1

16.1

15.9

18.8

15.7

Solution Once again, let’s apply the four steps to construct the stem-and-leaf plot, but now we have to consider continuous variables. Step 1 First, let’s sort the data in ascending order, as shown in Table 3.E.19.

TABLE 3.E.19 Average Temperature in Ascending Order 8.5

9.4

9.7

11.5

11.7

12.8

12.9

13.0

13.2

13.7

14.1

14.4

14.8

15.2

15.5

15.7

15.9

16.0

16.1

16.2

16.4

16.6

16.9

16.9

17.2

17.4

17.7

17.8

17.9

18.0

18.2

18.5

18.7

18.8

19.2

19.5

19.6

20.2

20.4

20.5

Univariate Descriptive Statistics Chapter

3

37

Step 2 In this example, the leaves correspond to the last digit. The remaining digits (to the left) correspond to the stems. Steps 3 and 4 The stems vary from 8 to 20. The stem with the highest frequency is 16 (7 observations), and its leaves can be represented in a single line. For each stem, we place the respective leaves. Fig. 3.13 shows the stem-and-leaf plot for Example 3.11. FIG. 3.13 Stem-and-Leaf Plot for Example 3.11.

8

5

9

4

7

11

5

7

12

8

9

13

0

2

7

14

1

4

8

15

2

5

7

9

16

0

1

2

4

6

17

2

4

7

8

9

18

0

2

5

7

8

19

2

5

6

20

2

4

5

10

9

9

3.3.2.5 Boxplot or Box-and-Whisker Diagram The boxplot (or box-and-whisker diagram) is a graphical representation of five measures of position or location of a certain variable: minimum value, first quartile (Q1), second quartile (Q2) or median (Md), third quartile (Q3) and maximum value. From a sorted sample, the median corresponds to the central position and the quartiles to subdivisions of the sample, four equal parts, each one containing 25% of the data. Thus, the first quartile (Q1) describes 25% of the first data (organized in ascending order). The second quartile corresponds to the median (50% of the sorted data are located below it and the remaining 50% above it), and the third quartile (Q13) corresponds to 75% of the observations. The dispersion measure resulting from these location measures is called interquartile range (IQR) or interquartile interval (IQI) and corresponds to the difference between Q3 and Q1. This plot allows us to assess the data symmetry and distribution. It also gives us a visual perspective of whether or not there are discrepant data (univariate outliers), since these data are above the upper and lower limits. A representation of the diagram can be seen in Fig. 3.14. FIG. 3.14 Boxplot.

38

PART

II Descriptive Statistics

Calculating the median, the first, and third quartiles, and investigating the existence of univariate outliers will be discussed in Sections 3.4.1.1, 3.4.1.2, and 3.4.1.3, respectively. In Sections 3.6.3 and 3.7, we will study how to generate the box-and-whisker diagram on SPSS and Stata, respectively, using a practical example.

3.4 THE MOST COMMON SUMMARY-MEASURES IN UNIVARIATE DESCRIPTIVE STATISTICS Information found in a dataset can be summarized through suitable numerical measures, called summary measures. In univariate descriptive statistics, the most common summary measures have as their main objective to represent the behavior of the variable being studied through its central and noncentral values, its dispersions, or the way its values are distributed around the mean. The summary measures that will be studied in this chapter are measures of position or location (measures of central tendency and quantiles), measures of dispersion or variability, and measures of shape, such as, skewness and kurtosis. These measures are calculated for metric or quantitative variables. The only exception is the mode, which is a measure of central tendency that provides the most frequent value of a certain variable, so, it can also be calculated for nonmetric or qualitative variables.

3.4.1

Measures of Position or Location

These measures provide values that characterize the behavior of a data series, indicating the data position or location in relation to the axis of the values assumed by the variable or characteristic being studied. The measures of position or location are subdivided into measures of central tendency (mean, median, and mode) and quantiles (quartiles, deciles, and percentiles).

3.4.1.1 Measures of Central Tendency The most common measures of central tendency are the arithmetic mean, the median, and the mode. 3.4.1.1.1

Arithmetic Mean

The arithmetic mean can be a representative measure of a population with N elements, represented by the Greek letter m, or a representative measure of a sample with n elements, represented by X. 3.4.1.1.1.1 Case 1: Simple Arithmetic Mean of Ungrouped Discrete and Continuous Data Simple arithmetic mean, or simply mean, or average, is the sum of all the values of a certain variable (discrete or continuous) divided by the total number of observations. Thus, the sample arithmetic mean of a certain variable X (X) is: n X



Xi

i¼1

(3.1)

n

where n is the total number of observations in the dataset and Xi, for i ¼ 1, …, n, represents each one of variable X’s values. Example 3.12 Calculate the simple arithmetic mean of the data in Table 3.E.20, regarding the grades of the graduate students enrolled in the subject Quantitative Methods.

TABLE 3.E.20 Students’ Grades 5.7

6.5

6.9

8.3

8.0

4.2

6.3

7.4

5.8

6.9

Univariate Descriptive Statistics Chapter

3

39

Solution The mean is simply calculated as the sum of all the values in Table 3.E.20 divided by the total number of observations: X¼

5:7 + 6:5 + ⋯ + 6:9 ¼ 6:6 10

The MEAN function in Excel calculates the simple arithmetic mean of the set of values selected. Let’s assume that the data in Table 3.E.20 are available from cell A1 to cell A10. To calculate the mean, we just need to insert the expression 5MEAN(A1:A10). Another way to calculate the mean using Excel, as well as other descriptive measures, such as, the median, mode, variance, standard deviation, standard error, skewness and kurtosis, which will also be studied in this chapter, is by using the Analysis ToolPack supplement (Section 3.5).

3.4.1.1.1.2 Case 2: Weighted Arithmetic Mean of Ungrouped Discrete and Continuous Data When calculating the simple arithmetic mean, all of the occurrences have the same importance or weight. When we are interested in assigning different weights (pi) to each value i of variable X, we use the weighted arithmetic mean: n X



Xi :pi

i¼1 n X

(3.2) pi

i¼1

If the weight is expressed in percentages (relative weight - rw), Expression (3.2) becomes: X¼

n X

Xi :rwi

(3.3)

i¼1

Example 3.13 At Vanessa’s school, the annual average of each subject is calculated based on the grades obtained throughout all four quarters, with their respective weights being: 1, 2, 3, and 4. Table 3.E.21 shows Vanessa’s grades in mathematics in each quarter. Calculate her annual average in the subject.

TABLE 3.E.21 Vanessa’s Grades in Mathematics Period

Grade

Weight

1st Quarter

4.5

1

2nd Quarter

7.0

2

3rd Quarter

5.5

3

4th Quarter

6.5

4

Solution The annual average is calculated by using the weighted arithmetic mean criterion. Applying Expression (3.2) to the data in Table 3. E.21, we have: X¼

4:5  1 + 7:0  2 + 5:5  3 + 6:5  4 ¼ 6:1 1+2+3+4

40

PART

II Descriptive Statistics

Example 3.14 There are five stocks in a certain investment portfolio. Table 3.E.22 shows the average yield of each stock in the previous month, as well as the respective percentage invested. Determine the portfolio’s average yield.

TABLE 3.E.22 Yield of Each Stock and Percentage Invested Stock

Yield (%)

% Investment

Bank of Brazil ON

1.05

10

Bradesco PN

0.56

25

Eletrobras PNB

0.08

15

Gerdau PN

0.24

20

Vale PN

0.75

30

Solution The portfolio’s average yield (%) corresponds to the sum of the products between each stock’s average yield (%) and the respective percentage invested, and, using Expression (3.3), we have: X ¼ 1:05  0:10 + 0:56  0:25 + 0:08  0:15 + 0:24  0:20 + 0:75  0:30 ¼ 0:53%

3.4.1.1.1.3 Case 3: Arithmetic Mean of Grouped Discrete Data When the discrete values of Xi repeat themselves, the data are grouped into a frequency table. To calculate the arithmetic mean, we have to use the same criterion as for the weighted mean. However, the weight for each Xi will be represented by absolute frequencies (Fi) and, instead of n observations with n different values, we will have n observations with m different values (grouped data): m X

m X

Xi :Fi

X ¼ i¼1 m X

¼

Xi :Fi

i¼1

Fi

n

(3.4)

i¼1

If the frequency of the data is expressed in terms of the percentage relative to the absolute frequency (relative frequency—Fr), Expression (3.4) becomes: m X X¼ Xi :Fr i (3.5) i¼1

Example 3.15 A satisfaction survey with 120 participants evaluated the performance of a health insurance company through grades given to it. Grades that vary between 1 and 10. The survey’s results can be seen in Table 3.E.23. Calculate the arithmetic mean for Example 3.15.

TABLE 3.E.23 Absolute Frequency Table Grades

Number of Participants

1

9

2

12

3

15

Univariate Descriptive Statistics Chapter

3

41

TABLE 3.E.23 Absolute Frequency Table—cont’d Grades

Number of Participants

4

18

5

24

6

26

7

5

8

7

9

3

10

1

Solution The arithmetic mean of Example 3.15 is calculated from Expression (3.4): X¼

1  9 + 2  12 + ⋯ + 9  3 + 10  1 ¼ 4:62 120

3.4.1.1.1.4 Case 4: Arithmetic Mean of Continuous Data Grouped into Classes To calculate the simple arithmetic mean, the weighted arithmetic mean, and the arithmetic mean of grouped discrete data, Xi represents each i value of variable X. For continuous data grouped into classes, each class does not have a single value defined, but a set of values. In order for the arithmetic mean to be calculated in this case, we assume that Xi is the middle or central point of class i (i ¼ 1,…,k), so, Expressions (3.4) and (3.5) are rewritten due to the number of classes (k): k X



k X

Xi :Fi

i¼1 k X

¼

Xi :Fi

i¼1

(3.6)

n

Fi

i¼1



k X

Xi :Fr i

(3.7)

i¼1

Example 3.16 Table 3.E.24 shows the classes of salaries paid to the employees of a certain company and their respective absolute and relative frequencies. Calculate the average salary.

TABLE 3.E.24 Classes of Salaries (US$ 1000.00) and Their Respective Absolute and Relative Frequencies Classes

Fi

Fri (%)

1├3

240

17.14

3├5

480

34.29

5├7

320

22.86

7├9

150

10.71

9 ├ 11

130

9.29

11 ├ 13

80

5.71

1400

100

Sum

42

PART

II Descriptive Statistics

Solution Considering Xi the central point of class i and applying Expression (3.6), we have: X¼

2  240 + 4  480 + 6  320 + 8  150 + 10  130 + 12  80 ¼ 5:557 1; 400

or using Expression (3.7): X ¼ 2  0:1714 + 4  0:3429 + ⋯ + 10  0:0929 + 12  0:0571 ¼ 5:557 Therefore, the average salary is US$ 5,557.14.

3.4.1.1.2 Median The median (Md) is a measure of location. It locates the center of the distribution of a set of data sorted in ascending order. Its value separates the series in two equal parts, so, 50% of the elements are less than or equal to the median, and the other 50 % are greater than or equal to the median.

3.4.1.1.2.1 Case 1: Median of Ungrouped Discrete and Continuous Data The median of variable X (discrete or continuous) can be calculated as follows: 8 Xn + X  n  > > +1 > > 2 < 2 , if n is an even number: 2 (3.8) Md ðXÞ ¼ >X > , if n is an odd number: > > : ð n + 1Þ 2 where n is the total number of observations and X1  …  Xn, considering that X1 is the smallest observation or the value of the first element, and that Xn is the highest observation or the value of the last element. Example 3.17 Table 3.E.25 shows the monthly production of treadmills of a company in a given year. Calculate the median.

TABLE 3.E.25 Monthly Production of Treadmills in a Given Year Month

Production (units)

Jan.

210

Feb.

180

Mar.

203

April

195

May

208

June

230

July

185

Aug.

190

Sept.

200

Oct.

182

Nov.

205

Dec.

196

Univariate Descriptive Statistics Chapter

3

43

Solution To calculate the median, the observations are sorted in ascending order. Therefore, we have the order of the observations and their respective positions: 180

182

185

190

195

196

200

203

205

208

210

230

1st

2nd

3rd

4th

5th

6th

7th

8th

9th

10th

11th

12th

The median will be the mean between the sixth and the seventh elements, since n is an even number, that is: X12 + X12 +1 2 2 Md ¼ 2 Md ¼

196 + 200 ¼ 198 2

Excel calculates the median of a set of data through the MED function. Note that the median does not consider the order of magnitude of the original variable’s values. If, for instance, the highest value were 400 instead of 230, the median would be exactly the same; however, with a much higher mean. The median is also known as the 2nd quartile (Q2), 50th percentile (P50), or 5th decile (D5). These definitions will be studied in more detail in the following sections.

3.4.1.1.2.2 Case 2: Median of Grouped Discrete Data Here, the calculation of the median is similar to the previous case. However, the data are grouped in a frequency distribution table. Analogous to Case 1, if n is an odd number, the position of the central element will be (n + 1)/2. We can see in the cumulative frequency column the group that has this position and, consequently, its corresponding value in the first column (median). If n is an even number, we verify the group(s) that contain(s) the central positions n/2 and (n/2) + 1 in the cumulative frequency column. If both positions correspond to the same group, we directly obtain their corresponding value in the first column (median). If each position corresponds to a distinct group, the median will be the average between the corresponding values defined in the first column. Example 3.18 Table 3.E.26 shows the number of bedrooms in 70 real estate properties in a condominium located in the metropolitan area of Sao Paulo, and their respective absolute and cumulative frequencies. Calculate the median.

TABLE 3.E.26 Frequency Distribution Number of Bedrooms

Fi

Fac

1

6

6

2

13

19

3

20

39

4

15

54

5

7

61

6

6

67

7

3

70

Sum

70

44

PART

II Descriptive Statistics

Since n is an even number, the median will be the average of the values that occupy positions n/2 and (n/2) + 1, that is: Xn + Xn +1

X + X36 ¼ 35 2 2 Based on Table 3.E.26, we can see that the third group contains all the elements between positions 20 and 39 (including 35 and 36), whose corresponding value is 3. Therefore, the median is: Md ¼ 2

2

Md ¼

3+3 ¼3 2

3.4.1.1.2.3 Case 3: Median of Continuous Data Grouped into Classes For continuous variables grouped into classes, in which the data are presented in a frequency distribution table, we apply the following steps to calculate the median: Step 1: Calculate the position of the median, not taking into consideration if n is an even or an odd number, through the following expression: PosðMd Þ ¼ n=2

(3.9)

Step 2: Identify the class that contains the median (median class) from the cumulative frequency column. Step 3: Calculate the median using the following expression: n   FacðMd1Þ  AMd Md ¼ LIMd + 2 FMd

(3.10)

where: LIMd ¼ lower limit of the median class; FMd ¼ absolute frequency of the median class; Fac(Md1)¼ cumulative frequency from the previous class to the median class; AMd ¼ range of the median class; n ¼ total number of observations.

Example 3.19 Consider the data in Example 3.16 regarding the classes of salaries paid to the employees of a company and their respective absolute and cumulative frequencies (Table 3.E.27). Calculate the median.

TABLE 3.E.27 Classes of Salaries (US$ 1000.00) and Their Respective Absolute and Cumulative Frequencies Classes

Fi

Fac

1├3

240

240

3├5

480

720

5├7

320

1040

7├9

150

1190

9 ├ 11

130

1320

11 ├ 13

80

1400

Sum

1400

Univariate Descriptive Statistics Chapter

3

45

Solution In the case of continuous data grouped into classes, let’s apply the following steps to calculate the median: Step 1: First, we calculate the position of the median: n 1400 PosðMd Þ ¼ ¼ ¼ 700 2 2 Step 2: Through the cumulative frequency column, we can see that the median is in the second class (3 ├ 5). Step 3: Calculating the median: n Md ¼ LI Md + 2

 Fac ðMd1Þ FMd

  AMd

where: LIMd ¼ 3, FMd ¼ 480, Fac(Md1) ¼ 240, AMd ¼ 2, n ¼ 1400 Therefore, we have: Md ¼ 3 +

3.4.1.1.3

ð700  240Þ  2 ¼ 4916 ðUS$ 4916:67Þ 480

Mode

The mode (Mo) of a data series corresponds to the observation that occurs with the highest frequency. The mode is the only measure of position that can also be used for qualitative variables, since these variables only allow us to calculate frequencies. 3.4.1.1.3.1 Case 1: Mode of Ungrouped Data Consider a set of observations X1, X2, …, Xn of a certain variable. The mode is the value that appears with the highest frequency. Excel gives us the mode of a set of data through the MODE function. Example 3.20 The production of carrots in a certain company is divided into five phases, including the post-harvest handling phase. Table 3.E.28 shows the average time the processing (in seconds) takes in this phase for 20 observations. Calculate the mode.

TABLE 3.E.28 Processing Time in the Post-Harvest Handling Phase in Seconds 45.0

44.5

44.0

45.0

46.5

46.0

45.8

44.8

45.0

46.2

44.5

45.0

45.4

44.9

45.7

46.2

44.7

45.6

46.3

44.9

Solution The mode is 45.0, which is the most frequent value in the dataset (Table 3.E.28). This value could be determined directly in Excel by using the MODE function.

3.4.1.1.3.2 Case 2: Mode of Grouped Qualitative or Discrete Data For discrete qualitative or quantitative data grouped in a frequency distribution table, the mode can be obtained directly from the table. It is the value with the highest absolute frequency. Example 3.21 A TV station interviewed 500 viewers trying to analyze their preferences in terms of interest categories. The result of the survey can be seen in Table 3.E.29. Calculate the mode.

46

PART

II Descriptive Statistics

TABLE 3.E.29 Viewers’ Preferences in Terms of Interest Categories Fi

Interest Categories Movies

71

Soap Operas

46

News

90

Comedy

98

Sports

120

Concerts

35

Variety

40

Sum

500

Solution Based on Table 3.E.29, we can see that the mode corresponds to the category Sports (the highest absolute frequency). Therefore, the mode is the only measure of position that can also be used for qualitative variables.

3.4.1.1.3.3 Case 3: Mode of Continuous Data Grouped into Classes For continuous data grouped into classes, there are several procedures to calculate the mode, such as, Czuber’s and King’s methods. Czuber’s method has the following phases: Step 1: Identify the class that has the mode (modal class), which is the one with the highest absolute frequency. Step 2: Calculate the mode (Mo): Mo ¼ LI Mo +

FMo  FMo1  AMo 2:FMo  ðFMo1 + FMo + 1 Þ

(3.11)

where: LIMo ¼ lower limit of the modal class; FMo ¼ absolute frequency of the modal class; FMo1 ¼ absolute frequency from the previous class to the modal class; FMo+1 ¼ absolute frequency from the posterior class to the modal class; AMo ¼ range of the modal class.

Example 3.22 A set of continuous data with 200 observations is grouped into classes with their respective absolute frequencies, as shown in Table 3.E.30. Determine the mode using Czuber’s method.

TABLE 3.E.30 Continuous Data Grouped into Classes and Their Respective Frequencies Class

Fi

01 ├ 10

21

10 ├ 20

36

20 ├ 30

58

30 ├ 40

24

40 ├ 50

19

Sum

200

Univariate Descriptive Statistics Chapter

3

47

Solution Considering continuous data grouped into classes, we can use Czuber’s method to calculate the mode: Step 1: Based on Table 3.E.30, we can see that the modal class is the third one (20 ├ 30), since it has the highest absolute frequency. Step 2: Calculating the mode (Mo): Mo ¼ LI Mo +

FMo  FMo1  AMo 2:FMo  ðFMo1 + FMo + 1 Þ

where: LIMo ¼ 20, FMo ¼ 58, FMo1 ¼ 36, FMo+1 ¼ 24, AMo ¼ 10 Therefore, we have: Mo ¼ 20 +

58  36  10 ¼ 23:9 2  58  ð36 + 24Þ

On the other hand, King’s method consists of the following phases: Step 1: Identify the modal class (the one with the highest absolute frequency). Step 2: Calculate the mode (Mo) using the following expression: Mo ¼ LI Mo +

FMo + 1  AMo FMo1 + FMo + 1

(3.12)

where: LIMo ¼ lower limit of the modal class; FMo1 ¼ absolute frequency from the previous class to the modal class; FMo+1 ¼ absolute frequency from the posterior class to the modal class; AMo ¼ range of the modal class.

Example 3.23 Once again, consider the data from the previous example. Use King’s method to determine the mode. Solution In Example 3.22, we saw that: LI Mo ¼ 20 FMo + 1 ¼ 24 FMo1 ¼ 36 AMo ¼ 10 Applying Expression (3.12): Mo ¼ LI Mo +

FMo + 1 24  10 ¼ 24  AMo ¼ 20 + FMo1 + FMo + 1 36 + 24

3.4.1.2 Quantiles According to Bussab and Morettin (2011), only the use of measures of central tendency may not be suitable to represent a set of data, since they are also impacted by extreme values. Moreover, only with the use of these measures, it is not possible for the researcher to have a clear idea of the data dispersion and symmetry. As an alternative, we can use quantiles, such as, quartiles, deciles, and percentiles. The 2nd quartile (Q2), 5th decile (D5), or 50th percentile (P50) correspond to the median; therefore, they are measures of central tendency. 3.4.1.2.1

Quartiles

Quartiles (Qi, i ¼ 1, 2, 3) are measures of position that divide a set of data into four parts with equal dimensions, sorted in ascending order.

Min.

Q1

Md = Q2

Q3

Max.

48

PART

II Descriptive Statistics

Thus, the 1st Quartile (Q1 or the 25th percentile) indicates that 25% of the data are less than Q1, or that 75% of the data are greater than Q1. The 2nd Quartile (Q2, or the 5th decile, or the 50th percentile) corresponds to the median, indicating that 50% of the data are less or greater than Q2. The 3rd Quartile (Q3 or the 75th percentile) indicates that 75% of the data are less than Q3, or that 25% of the data are greater than Q3. 3.4.1.2.2

Deciles

Deciles (Di, i ¼ 1, 2, ..., 9) are measures of position that divide a set of data into 10 equal parts, sorted in ascending order.

Min.

D1

D2

D3

D4

D5

D6

D7

D8

D9

Max.

Md

Therefore, the 1st decile (D1 or 10th percentile) indicates that 10% of the data are less than D1 or that 90% of the data are greater than D1. The 2nd decile (D2 or 20th percentile) indicates that 20% of the data are less than D2 or that 80% of the data are greater than D2. And so on, and so forth, until the 9th decile (D9 or 90th percentile), indicating that 90% of the data are less than D9 or that 10% of the data are greater than D9. 3.4.1.2.3 Percentiles Percentiles (Pi, i ¼ 1, 2, ..., 99) are measures of position that divide a set of data, sorted in ascending order, into 100 equal parts. Hence, the 1st percentile (P1) indicates that 1% of the data is less than P1 or that 99% of the data are greater than P1. The 2nd percentile (P2) indicates that 2% of the data are less than P2 or that 98% of the data are greater than P2. And so on, and so forth, until the 99th percentile (P99), which indicates that 99% of the data are less than P99 or that 1% of the data is greater than P99. 3.4.1.2.3.1 Case 1: Quartiles, Deciles, and Percentiles of Ungrouped Discrete and Continuous Data If the position of the quartile, decile, or percentile we are interested in is an integer or is exactly between two positions, calculating the respective quartile, decile or percentile becomes easier. However, this does not happen all the time (imagine a sample with 33 elements and that the objective is to calculate the 67th percentile), there are many methods proposed for this kind of calculation that lead to close results, but they are not identical. We will present a simple and generic method that can be applied to calculate any quartile, decile, or percentile of order i, considering ungrouped discrete and continuous data: Step 1: Sort the observations in ascending order. Step 2: Determine the position of the quartile, decile, or percentile, of order i, we are interested in: i 1 × i + , i ¼ 1, 2,3 4 2 hn i 1 × i + , i ¼ 1, 2,…, 9 Decile ! PosðDi Þ5 10 2 h n i 1 Percentile ! PosðPi Þ5 × i + , i ¼ 1, 2,…, 99 100 2 Quartile ! PosðQi Þ5

hn

(3.13) (3.14) (3.15)

Step 3: Calculate the value of the quartile, decile, or percentile that corresponds to the respective position. Assume that Pos(Q1) ¼ 3.75, that is, the value of Q1 is between the 3rd and 4th positions (75% closer to the 4th position, and 25% to the 3rd position). Therefore, Q1 will be the sum of the value that corresponds to the 3rd position multiplied by 0.25, with the value that corresponds to the 4th position multiplied by 0.75.

Univariate Descriptive Statistics Chapter

3

Example 3.24 Consider the data in Example 3.20 regarding the average carrot processing time in the post-harvest handling phase, as specified in Table 3.E.28. Determine Q1 (1st quartile), Q3 (3rd quartile), D2 (2nd decile), and P64 (64th percentile). Solution For ungrouped continuous data, we must apply the following steps to determine the quartiles, deciles, and percentiles we are interested in: Step 1: Sort the observations in ascending order. 1st

2nd

3rd

4th

5th

7th

7th

8th

9th

10th

44.0

44.5

44.5

44.7

44.8

44.9

44.9

45.0

45.0

45.0

11th

12th

13th

14th

15th

16th

17th

18th

19th

20th

45.0

45.4

45.6

45.7

45.8

46.0

46.2

46.2

46.3

46.5

Step 2: Calculation of the positions of Q1, Q3, D2, and P64:   1 a) PosðQ1 Þ ¼ 20 4  1 + 2 ¼ 5:5   1 b) PosðQ3 Þ ¼ 20 4  3 + 2 ¼ 15:5   1 c) PosðD2 Þ ¼ 20 10  2 + 2 ¼ 4:5  20  d) PosðP64 Þ ¼ 100  64 + 12 ¼ 13:3 Step 3: Calculating Q1, Q3, D2, and P64: a) Pos(Q1) ¼ 5.5 means that its corresponding value is 50% near position 5 and 50% near position 6, that is, Q1 is simply the average of the values that correspond to both positions: 44:8 + 44:9 ¼ 44:85 2 b) Pos(Q3) ¼ 15.5 means that the value we are interested in is between positions 15 and 16 (50% near the 15th position and 50% near the 16th position), so, Q3 can be calculated as follows: Q1 ¼

45:8 + 46 ¼ 45:9 2 c) Pos(D2) ¼ 4.5 means that the value we are interested in is between positions 4 and 5, so, D2 can be calculated as follows: Q3 ¼

44:7 + 44:8 ¼ 44:75 2 d) Pos(P64) ¼ 13.3 means that the value we are interested in is 70% closer to position 13 and 30% closer to position 14, so, P64 can be calculated as follows: D2 ¼

P64 ¼ (0.70 x 45.6) + (0.30 x 45.7) ¼ 45.63. Interpretation Q1 ¼ 44.85 indicates that, in 25% of the observations (the first 5 observations listed in Step 1), the carrot processing time in the postharvest handling phase is less than 44.85 seconds, or that in 75% of the observations (the remaining 15 observations), the processing time is greater than 44.85. Q3 ¼ 45.9 indicates that, in 75% of the observations (15 of them), the processing time is less than 45.9 seconds, or that in 5 observations, the processing time is greater than 45.9. D2 ¼ 44.75 indicates that, in 20% of the observations (4 of them), the processing time is less than 44.75 seconds, or that in 80% of the observations (16 of them), the processing time is greater than 44.75. P64 ¼ 45.63 indicates that, in 64% of the observations (12.8 of them), the processing time is less than 45.63 seconds, or that in 36% of the observations (7.2 of them) the processing time is greater than 45.63. Excel calculates the quartile of order i (i ¼ 0, 1, 2, 3, 4) through the QUARTILE function. As arguments of the function, we must define the matrix or set of data in which we are interested to calculate the respective quartile (it does not need to be in ascending order), in addition to the fourth we are interested in (minimum value ¼ 0; 1st quartile ¼ 1; 2nd quartile ¼ 2, 3rd quartile ¼ 3; maximum value ¼ 4). The k-th percentile (k ¼ 0, ..., 1) can also be calculated in Excel through the PERCENTILE function. As arguments of the function, we must define the matrix we are interested in, in addition to the value of k (for example, in the case of P64, k ¼ 0.64).

49

50

PART

II Descriptive Statistics

The calculation of quartiles, deciles, and percentiles using SPSS and Stata statistical software will be demonstrated in Sections 3.6 and 3.7, respectively. SPSS and Stata software use two methods to calculate quartiles, deciles, or percentiles. One of them is called Tukey’s Hinges and it is the method used in this book. The other method is related to the Weighted Average, whose calculations are more complex. Excel, on the other hand, implements another algorithm that gets similar results.

3.4.1.2.3.2 Case 2: Quartiles, Deciles, and Percentiles of Grouped Discrete Data Here, the calculation of quartiles, deciles, and percentiles is similar to the previous case. However, the data are grouped in a frequency distribution table. In the frequency distribution table, the data must be sorted in ascending order, with their respective absolute and cumulative frequencies. First, we must determine the position of the quartile, decile, or percentile, of order i, we are interested in through Expressions (3.13), (3.14), and (3.15), respectively. From the cumulative frequency column, we must verify the group (s) that contain(s) this position. If the position is a discrete number, its corresponding value is obtained directly in the first column. However, if the position is a fractional number, as, for example, 2.5, and if the 2nd and the 3rd positions are in the same group, its respective value will also be obtained directly. On the other hand, if the position is a fractional number, as, for example, 4.25, and positions 4 and 5 are in different groups, we must calculate the sum of the value that corresponds to the 4th position multiplied by 0.75 with the value that corresponds to the 5th position multiplied by 0.25 (similar to Case 1). Example 3.25 Consider the data in Example 3.18 regarding the number of bedrooms in 70 real estate properties in a condominium located in the metropolitan area of Sao Paulo, and their respective absolute and cumulative frequencies (Table 3.E.26). Calculate Q1, D4, and P96. Solution Let’s calculate the positions of Q1, D4, and P96 through Expressions (3.13), (3.14), and (3.15), respectively, and their corresponding values:   1 a) PosðQ1 Þ ¼ 70 4  1 + 2 ¼ 18 Based on Table 3.E.26, we can see that position 18 is in the second group (2 bedrooms), so, Q1 ¼ 2.   1 b) PosðD4 Þ ¼ 70 10  4 + 2 ¼ 28:5 Through thecumulative  frequency column, we can see that positions 28 and 29 are in the third group (3 bedrooms), so, D4 ¼ 3. 70 c) Pos P96 ¼ 100  96 + 12 ¼ 67:7 that is, P96 is 70% closer to position 68 and 30% to position 67. Through the cumulative frequency column, we can see that position 68 is in the seventh group (7 bedrooms) and position 67 to the sixth group (6 bedrooms), so, P96 can be calculated as follows: P96 ¼ ð0:70 x 7Þ + ð0:30 x 6Þ ¼ 6:7: Interpretation Q1 ¼ 2 indicates that 25% of the real estate properties have less than 2 bedrooms, or that 75% of the real estate properties have more than 2 bedrooms. D4 ¼ 3 indicates that 40% of the real estate properties have less than 3 bedrooms, or that 60% of the real estate properties have more than 3 bedrooms. P96 ¼ 6.7 indicates that 96% of the real estate properties have less than 6.7 bedrooms, or that 4% of the real estate properties have more than 6.7 bedrooms.

3.4.1.2.3.3 Case 3: Quartiles, Deciles, and Percentiles of Continuous Data Grouped into Classes For continuous data grouped into classes in which data are represented in a frequency distribution table, we must apply the following steps to calculate the quartiles, deciles, and percentiles: Step 1: Calculate the position of the quartile, decile, or percentile, of order i, we are interested in through the following expressions: n (3.16) Quartile ! PosðQi Þ ¼  i, i ¼ 1,2, 3 4 n Decile ! PosðDi Þ ¼  i, i ¼ 1,2, …, 9 (3.17) 10 n  i, i ¼ 1, 2,…,99 (3.18) Percentile ! PosðPi Þ ¼ 100

Univariate Descriptive Statistics Chapter

3

51

Step 2: Identify the class that contains the quartile, decile, or percentile, of order i, we are interested in (quartile class, decile class, or percentile class) from the cumulative frequency column. Step 3: Calculate the quartile, decile, or percentile, of order i, we are interested in through the following expressions: ! PosðQi Þ  FcumðQi 1Þ (3.19)  RQi , i ¼ 1,2, 3 Quartile ! Qi ¼ LLQi + FQi where: LLQi ¼ lower limit of the quartile class; Fcum(Qi1)¼ cumulative frequency from the previous class to the quartile class; FQi ¼ absolute frequency of the quartile class; RQi ¼ range of the quartile class. Decile ! Di ¼ LLDi +

PosðDi Þ  FcumðDi 1Þ

!

FDi

 RDi , i ¼ 1,2, …, 9

(3.20)

where: LLDi ¼ lower limit of the decile class; Fcum(Di1)¼ cumulative frequency from the previous class to the decile class; FDi ¼ absolute frequency of the decile class; RDi ¼ range of the decile class. Percentile ! Pi ¼ LLPi +

PosðPi Þ  FcumðPi 1Þ FPi

!  RPi , i ¼ 1,2, …, 99

(3.21)

where: LLPi ¼ lower limit of the percentile class; Fcum(Pi1)¼ cumulative frequency from the previous class to the percentile class; FPi ¼ absolute frequency of the percentile class; RPi ¼ range of the percentile class.

Example 3.26 A survey on the health conditions of 250 patients collected information about their weight. The data are grouped into classes, as shown in Table 3.E.31. Calculate the first quartile, the seventh decile, and the 60th percentile.

TABLE 3.E.31 Absolute and Cumulative Frequencies Distribution table of Patients’ Weight Grouped into Classes Class

Fi

Fac

50 ├ 60

18

18

60 ├ 70

28

46

70 ├ 80

49

95

80 ├ 90

66

161

90 ├ 100

40

201

100 ├ 110

33

234

110 ├ 120

16

250

Sum

250

52

PART

II Descriptive Statistics

Solution Let’s apply the three steps to calculate Q1, D7, and P60: Step 1: Let’s calculate the position of the first quartile, the seventh decile, and the 60th percentile through Expressions (3.16), (3.17), and (3.18), respectively: 250  1 ¼ 62:5 4 250  7 ¼ 175 7th Decile ! PosðD7 Þ ¼ 10 250  60 ¼ 150 60th Percentile ! PosðP60 Þ ¼ 100 1st Quartile ! PosðQ1 Þ ¼

Step 2: Let’s identify the class that has Q1, D7, and P60 from the cumulative frequency column in Table 3.E.31: Q1 is in the 3rd class (70 ├ 80) D7 is in the 5th class (90 ├ 100) P60 is in the 4th class (80 ├ 90) Step 3: Let’s calculate Q1, D7, and P60 from Expressions (3.19), (3.20), and (3.21), respectively: Q1 ¼ LLQ1 + D7 ¼ LLD7 + P60 ¼ LLP60 +

Pos ðQ1 Þ  FcumðQ1 1Þ

!

FQ1 Pos ðD7 Þ  FcumðD7 1Þ

 RQ1 ¼ 70 + !

FD7 Pos ðP60 Þ  FcumðP60 1Þ FP60

  62:5  46  10 ¼ 73:37 49 

 RD7 ¼ 90 + !

 175  161  10 ¼ 93:5 40

  RP60 ¼ 80 +

 150  95  10 ¼ 88:33 66

Interpretation Q1 ¼ 73.37 indicates that 25% of the patients weigh less than 73.37 kg, or that 75% of the patients weigh more than 73.37 kg. D7 ¼ 93.5 indicates that 70% of the patients weigh less than 93.5 kg, or that 30% of the patients weigh more than 93.5 kg. P60 ¼ 88.33 indicates that 60% of the patients weigh less than 88.33 kg, or that 40% of the patients weigh more than 88.33 kg.

3.4.1.3 Identifying the Existence of Univariate Outliers A dataset can contain observations that are extremely distant from most observations or that are inconsistent. These observations are called outliers or atypical, discrepant, abnormal, or extreme values. Before deciding what will be done with the outliers, we must know the causes that lead to such an occurrence. In many cases, these causes can determine the most suitable treatment for the respective outliers. The main causes are measurement mistakes, execution/implementation mistakes, and variability inherent to the population. There are many outlier identification methods: boxplots, discordance models, Dixon’s test, Grubbs’ test, Z-scores, among others. In the Appendix of Chapter 11 (Cluster Analysis), a very efficient method for detecting multivariate outliers will be presented (BACON algorithm—Blocked Adaptive Computationally Efficient Outlier Nominators). The existence of outliers through boxplots (the construction of boxplots was studied in Section 3.3.2.5) is identified from the IQR (interquartile range), which corresponds to the difference between the third and first quartiles: IQR ¼ Q3  Q1

(3.22)

Note that the IQR is the length of the box. Any values located below Q1 or above Q3 by 1.5IQR more will be considered mild outliers and will be represented by circles. They may even be accepted in the population, but with some suspicion. Thus, the X° value of a variable is considered a mild outlier when: X° < Q1 21:5  IQR

(3.23)

X° > Q3 + 1:5  IQR

(3.24)

Univariate Descriptive Statistics Chapter

3

53

FIG. 3.15 Boxplot with the identification of outliers.

or any values located below Q1 or above Q3 by 3 IQR more will be considered extreme outliers and will be presented by asterisks. Thus, the X* value of a variable is considered an extreme outlier when: X∗ < Q1  3:IQR

(3.25)

X∗ > Q3 + 3:IQR

(3.26)

Fig. 3.15 illustrates the boxplot with the identification of outliers. Example 3.27 Consider the sorted data in Example 3.24 regarding the average carrot processing time in the post-harvest handling phase: 44.0

44.5

44.5

44.7

44.8

44.9

44.9

45.0

45.0

45.0

45.0

45.4

45.6

45.7

45.8

46.0

46.2

46.2

46.3

46.5

where Q1 ¼ 44.85, Q2 ¼ 45, Q3 ¼ 45.9, mean ¼ 45.3, and mode ¼ 45. Check and see if there are mild and extreme outliers. Solution To verify if there is a possible outlier, we must calculate: Q1  1:5  ðQ3  Q1 Þ ¼ 44:85  1:5:ð45:9  44:85Þ ¼ 43:275 Q3 + 1:5  ðQ3  Q1 Þ ¼ 45:9 + 1:5:ð45:9  44:85Þ ¼ 47:475 Since there is no value in the distribution outside this interval, we conclude that there are no mild outliers. Obviously, it is not necessary to calculate the interval for extreme outliers. In case only one outlier in a certain variable is identified, the researcher can treat it through some existing procedures, as, for example, the complete elimination of this observation. On the other hand, if there is more than one outlier for one or more variables individually, the elimination of all the observations can reduce the sample size significantly. To avoid this problem, it is very common for observations considered outliers for a certain variable to have their atypical values substituted for the mean of the variable, thus, excluding the outliers (Fa´vero et al., 2009). The authors mention other procedures for dealing with outliers, such as, substituting them for values from a regression or winsorization; which, in an organized way, eliminates an equal number of observations from each side of the distribution. Fa´vero et al. (2009) also highlight the importance of dealing with outliers when the researcher in interested in investigating the behavior of a certain variable without the influence of observations with atypical values. On the other hand, if the main goal is to analyze the behavior of these atypical observations or to define subgroups through discrepancy criteria, maybe eliminating these observations or substituting their values would not be the best solution.

54

PART

3.4.2

II Descriptive Statistics

Measures of Dispersion or Variability

To study the behavior of a set of data, we use measures of central tendency, measures of dispersion, in addition to the nature or shape of the data distribution. Measures of central tendency determine a value that represents the set of data. In order to characterize the dispersion or variability of the data, measures of dispersion are necessary. The most common measures of dispersion are the range, average deviation, variance, standard deviation, standard error, and the coefficient of variation (CV).

3.4.2.1 Range The simplest measure of variability is the total range, or simply range (R), which represents the difference between the highest and lowest value of the set of data: R ¼ Xmax  Xmin

(3.27)

3.4.2.2 Average Deviation Deviation is the difference between each observed value and

the mean of the variable. Thus, for population data, it would be m), and for sample data, by X  X . The modulus or absolute deviation ignores the  sign and is represented by (Xi  i denoted by Xi  X . Average deviation, or absolute average deviation, represents the arithmetic mean of absolute deviations. 3.4.2.2.1 Case 1: Average Deviation of Ungrouped Discrete and Continuous Data The average deviation (D) is the sum of the absolute deviations of all observations divided by the population size (N) or the sample size (n): N X X  m i



i¼1



ðfor the populationÞ N n X X  X i

(3.28)

i¼1

(3.29)

n

ðfor samplesÞ

Example 3.28 Table 3.E.32 shows the distances traveled (in km) by a vehicle in order to deliver 10 packages throughout the day. Calculate the average deviation.

TABLE 3.E.32 Distances Traveled (km) 12.4

22.6

18.9

9.7

14.5

22.5

26.3

17.7

31.2

20.4

Solution For the data in Table 3.E.32, we have X ¼ 19:62. Applying Expression (3.29), we get the average deviation: j12:4  19:62j + j22:6  19:62j + ⋯ + j20:4  19:62j ¼ 4:98 10 The average deviation can be directly calculated in Excel using the AVEDEV function. D¼

3.4.2.2.2

Case 2: Average Deviation of Grouped Discrete Data

For grouped data, presented in a frequency distribution table for m groups, the calculation of the average deviation is:

Univariate Descriptive Statistics Chapter

3

55

m X X  m :F i i





Pm

X :F i¼1 i i

bearing in mind that X ¼

i¼1

n

ðfor the populationÞ N m X X  X :F i i

(3.30)

i¼1

(3.31)

n

ðfor samplesÞ

.

Example 3.29 Table 3.E.33 shows the number of goals scored by the D.C. soccer team in their last 30 games, with their respective absolute frequencies. Calculate the average deviation.

TABLE 3.E.33 Frequency Distribution of Example 3.29 Number of Goals

Fi

0

5

1

8

2

6

3

4

4

4

5

2

6

1

Sum

30

Solution 05+18+⋯+61 The mean is X ¼ ¼ 2:133. The average deviation can be determined from the calculations presented in 30 Table 3.E.34:

TABLE 3.E.34 Calculations of the Average Deviation for Example 3.29

Therefore, D ¼

Number of Goals

Fi

X  X i

X  X :F i i

0

5

2.133

10.667

1

8

1.133

9.067

2

6

0.133

0.800

3

4

0.867

3.467

4

4

1.867

7.467

5

2

2.867

5.733

6

1

3.867

3.867

Sum

30

Pm 41:067 i¼1 Xi  X :Fi ¼ ¼ 1:369. n 30

41.067

56

PART

II Descriptive Statistics

3.4.2.2.3 Case 3: Average Deviation of Continuous Data Grouped into Classes For continuous data grouped into classes, the calculation of the average deviation is: k X X  m :F i i



i¼1

ðfor the populationÞ

N

(3.32)

k X X  X :F i i



i¼1

n

ðfor samplesÞ

(3.33)

Note that Expressions (3.32) and (3.33) are similar to Expressions (3.30) and (3.31), respectively, except that, instead of m Pk X :F groups, we consider k classes. Moreover, Xi represents the middle or central point of each class i, where X ¼ i¼1n i i , as presented in Expression (3.6). Example 3.30 In order to determine its variation due to genetic factors, a survey with 100 newborn babies collected information about their weight. Table 3.E.35 shows the data grouped into classes and their respective absolute frequencies. Calculate the average deviation.

TABLE 3.E.35 Newborn Babies’ Weight (in kg) Grouped into Classes Class

Fi

2.0 ├ 2.5

10

2.5 ├ 3.0

24

3.0 ├ 3.5

31

3.5 ├ 4.0

22

4.0 ├ 4.5

13

Sum

Solution First, we must calculate X: k X

Xi :Fi

2:25  10 + 2:75  24 + 3:25  31 + 3:75  22 + 4:25  13 ¼ ¼ 3:270 n 100 The average deviation can be determined from the calculations presented in Table 3.E.36: X¼

i¼1

TABLE 3.E.36 Calculations of the Average Deviation for Example 3.30

Therefore, D ¼

Class

Fi

Xi

X  X i

X  X :F i i

2.0 ├ 2.5

10

2.25

1.02

10.20

2.5 ├ 3.0

24

2.75

0.52

12.48

3.0 ├ 3.5

31

3.25

0.02

0.62

3.5 ├ 4.0

22

3.75

0.48

10.56

4.0 ├ 4.5

13

4.25

0.98

12.74

Sum

100

Pk 46:6 i¼1 Xi  X :Fi ¼ ¼ 0:466. n 100

46.6

Univariate Descriptive Statistics Chapter

3

57

3.4.2.3 Variance Variance is a measure of dispersion or variability that evaluates how much the data are dispersed in relation to the arithmetic mean. Thus, the higher the variance, the higher the data dispersion.

3.4.2.3.1

Case 1: Variance of Ungrouped Discrete and Continuous Data

Instead of considering the mean of absolute deviations, as discussed in the previous section, it is more common to calculate the mean of squared deviations. This measure is known as variance: !2 N X Xi N N X X i¼1 2 2 ðXi  mÞ Xi  N i¼1 i¼1 2 s ¼ ðfor the populationÞ (3.34) ¼ N N !2 n X Xi n n X

2 X i¼1 2 Xi  X Xi  n i¼1 i¼1 2 ¼ ð forsamplesÞ (3.35) S ¼ n1 n1 The relationship between the sample variance (S2) and the population variance (s2) is given by: S2 ¼

N :s2 n1

(3.36)

Example 3.31 Consider the data in Example 3.28 regarding the distances traveled (in km) by a vehicle in order to deliver 10 packages throughout the day. Calculate the variance. Solution We saw in Example 3.28 that X ¼ 19:62. Applying Expression (3.35), we have: S2 ¼

ð12:4  19:62Þ2 + ð22:6  19:62Þ2 + ⋯ + ð20:4  19:62Þ2 ¼ 41:94 9

The sample variance can be directly calculated in Excel using the VAR.S function. To calculate the variance population, we must use the VAR.P function.

3.4.2.3.2 Case 2: Variance of Grouped Discrete Data For grouped data, represented in a frequency distribution table by m groups, the variance can be calculated as follows: !2 m X Xi :Fi m m X X i¼1 2 2 ðXi  mÞ :Fi Xi :Fi  N i¼1 i¼1 2 ¼ ðfor the populationÞ (3.37) s ¼ N N !2 m X Xi :Fi m m X X

2 i¼1 2 Xi  X :Fi Xi :Fi  n i¼1 i¼1 2 S ¼ ðfor samplesÞ (3.38) ¼ n1 n1 Pm X :F where X ¼ i¼1 i i . n

58

PART

II Descriptive Statistics

Example 3.32 Consider the data in Example 3.29 regarding the number of goals scored by the D.C. soccer team in the last 30 games, with their respective absolute frequencies. Calculate the variance. Solution As calculated in Example 3.29, the mean is X ¼ 2:133. The variance can be determined from the calculations presented in Table 3.E.37:

TABLE 3.E.37 Calculations of the Variance Number of Goals

Fi

0

5



Xi  X

2



2 Xi  X :Fi

4.551

22.756

1

8

1.284

10.276

2

6

0.018

0.107

3

4

0.751

3.004

4

4

3.484

13.938

5

2

8.218

16.436

6

1

14.951

14.951

Sum

30

81.467

Pm Therefore, S 2 ¼

3.4.2.3.3

i¼1

2 Xi  X :Fi 81:467 ¼ ¼ 2:809 n1 29

Case 3: Variance of Continuous Data Grouped into Classes

For continuous data grouped into classes, we calculate the variance as follows: k X k X

s ¼ 2

k X

2

ðXi  mÞ :Fi

i¼1

¼

N

!2 Xi :Fi

i¼1

Xi2 :Fi 

N

i¼1

k X k X

S ¼ 2

k X

2

ðXi  xÞ :Fi

i¼1

n1

ðfor the populationÞ

N

¼

Xi2 :Fi 

(3.39)

!2 Xi :Fi

i¼1

i¼1

n1

n

ðfor samplesÞ

(3.40)

Note that Expressions (3.39) and (3.40) are similar to Expressions (3.37) and (3.38), respectively, except that we consider k classes instead of m groups.

Example 3.33 Consider the data in Example 3.30 regarding the weight of newborn babies grouped into classes, with their respective absolute frequencies. Calculate the variance. Solution As calculated in Example 3.30, we have X ¼ 3:270.

Univariate Descriptive Statistics Chapter

3

59

The variance can be determined from the calculations presented in Table 3.E.38:

TABLE 3.E.38 Calculations of the Variance for Example 3.33

2



2 Xi  X :Fi

Class

Fi

Xi

2.0 ├ 2.5

10

2.25

1.0404

2.5 ├ 3.0

24

2.75

0.2704

6.4896

3.0 ├ 3.5

31

3.25

0.0004

0.0124

3.5 ├ 4.0

22

3.75

0.2304

5.0688

4.0 ├ 4.5

13

4.25

0.9604

12.4852

Sum

100

Pk Therefore, S 2 ¼

i¼1

ðXi X Þ

2

n1

:Fi

Xi  X

10.404

34.46

¼ 34:46 99 ¼ 0:348.

3.4.2.4 Standard Deviation Since the variance considers the mean of squared deviations, its value tends to be very high and difficult to interpret. To solve this problem, we calculate the square root of the variance. This measure is known as the standard deviation. It is calculated as follows: pffiffiffiffiffi s ¼ s2 ðfor the populationÞ

(3.41)

pffiffiffiffiffi S ¼ S2 ðfor samplesÞ

(3.42)

Example 3.34 Once again, consider the data in Examples 3.28 or 3.31 regarding the distances traveled (in km) by the vehicle. Calculate the standard deviation. Solution We have X ¼ 19:62. The standard deviation is the square root of the variance, which has already been calculated in Example 3.31: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð12:4  19:62Þ2 + ð22:6  19:62Þ2 + ⋯ + ð20:4  19:62Þ2 pffiffiffiffiffiffiffiffiffiffiffiffi S¼ ¼ 41:94 ¼ 6:476 9 The standard deviation of a sample can be directly calculated in Excel using the STDEV.S function. To calculate the standard deviation of the population, we use the STDEV.P function.

Example 3.35 Consider the data in Examples 3.29 or 3.32 regarding the number of goals scored by the D.C. soccer team in the last 30 games, with their respective absolute frequencies. Calculate the standard deviation. Solution The mean is X ¼ 2:133. The standard deviation is the square root of the variance, so, it can be determined from the calculations of the variance, which has already been calculated in Example 3.32, as demonstrated in Table 3.E.37: rP ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffi m 2 ðXi X Þ :Fi i¼1 ¼ 81:467 Therefore, S ¼ 29 ¼ 2:809 ¼ 1:676. n1

60

PART

II Descriptive Statistics

Example 3.36 Consider the data in Examples 3.30 or 3.33 regarding the weight of newborn babies grouped into classes, with their respective absolute frequencies. Calculate the standard deviation. Solution We have X ¼ 3:270. The standard deviation is the square root of the variance, so, it can be determined from the calculations of the variance, which has already been calculated in Example 3.33, as demonstrated in Table 3.E.38: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

2 Pk qffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffi i¼1 Xi  X :Fi ¼ 34:46 Therefore, S ¼ 99 ¼ 0:348 ¼ 0:59. n1

3.4.2.5 Standard Error The standard error is the standard deviation of the mean. It is obtained by dividing the standard deviation by the square root of the population or sample size: s (3.43) sX ¼ pffiffiffiffi for the population N S SX ¼ pffiffiffi for samples n

(3.44)

The higher the number of measurements, the better the determination of the average value will be (higher accuracy), due to the compensation of random errors. Example 3.37 One of the phases in the preparation of concrete is mixing it in a concrete mixer. Tables 3.E.39 and 3.E.40 show the concrete mixing times (in seconds), considering a sample with 10 and 30 elements, respectively. Calculate the standard error for both cases and interpret the results.

TABLE 3.E.39 Concrete Mixing Time for a Sample With 10 Elements 124

111

132

142

108

127

133

144

148

105

TABLE 3.E.40 Concrete Mixing Time for a Sample With 30 Elements 125

102

135

126

132

129

156

112

108

134

126

104

143

140

138

129

119

114

107

121

124

112

148

145

130

125

120

127

106

148

Solution First, let’s calculate the standard deviation for both samples: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð124  127:4Þ2 + ð111  127:4Þ2 + ⋯ + ð105  127:4Þ2 S1 ¼ ¼ 15:364 9 sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð125  126:167Þ2 + ð102  126:167Þ2 + ⋯ + ð148  126:167Þ2 ¼ 14:227 S2 ¼ 29 To calculate the standard error, we must apply Expression (3.44): S1 15:364 SX ¼ pffiffiffiffiffi ¼ pffiffiffiffiffiffi ¼ 4:858 1 n1 10

Univariate Descriptive Statistics Chapter

3

61

S2 14:227 SX ¼ pffiffiffiffiffi ¼ pffiffiffiffiffiffi ¼ 2:598 2 n2 30 Despite the small difference in the calculation of the standard deviation, we can see that the standard error of the first sample is almost the double when compared to the second sample. Therefore, the higher the number of measurements, the higher the accuracy.

3.4.2.6 Coefficient of Variation The coefficient of variation (CV) is a relative measure of dispersion that provides the variation of the data in relation to the mean. The smaller the value, the more homogeneous the data will be, that is, the smaller the dispersion around the mean will be. It can be calculated as follows: s (3.45) CV ¼  100 ð%Þ for the population m S CV ¼  100 ð%Þ for samples X

(3.46)

A CV can be considered low, indicating a set of data that is reasonably homogeneous, when it is less than 30%. If this value is greater than 30%, the set of data can be considered heterogeneous. However, this standard varies according to the application. Example 3.38 Calculate the coefficient of variation for both samples of the previous example. Solution Applying Expression (3.46), we have: CV 1 ¼ CV 2 ¼

S1 15:364  100 ¼  100 ¼ 12:06% 127:4 X1

S2 14:227  100 ¼ 11:28%  100 ¼ 126:167 X2

These results confirm the homogeneity of the data of the variable being studied for both samples. We conclude, therefore, that the mean is a good measure to represent the data. Let’s now study the measures of skewness and kurtosis.

3.4.3

Measures of Shape

Measures of asymmetry (skewness) and kurtosis characterize the shape of the distribution of the population elements sampled around the mean (Maroco, 2014).

3.4.3.1 Measures of Skewness Measures of skewness describe the shape of a frequency distribution curve. For a symmetrical curve or frequency distribution, the mean, the mode, and the median are the same. For an asymmetrical curve, the mean gets farther away from the mode, and the median is located in an intermediary position. Fig. 3.16 shows a symmetrical distribution. On the other hand, if the frequency distribution is more concentrated on the left side, that is, the tail on the right is longer than the tail on the left, we will have a positively skewed distribution or to the right, as shown in Fig. 3.17. In this case, the mean is greater than the median, and the latter is greater than the mode (Mo < Md < X). Conversely, if the frequency distribution is more concentrated on the right side, that is, the tail on the left is longer than the tail on the right, we will have a negatively skewed distribution or to the left, as shown in Fig. 3.18. In this case, the mean is less than the median, and the latter is less than the mode X < Md < Mo .

62

PART

II Descriptive Statistics

FIG. 3.16 Symmetrical distribution.

FIG. 3.17 Skewness to the right or positive skewness.

FIG. 3.18 Skewness to the left or negative skewness.

3.4.3.1.1

Pearson’s First Coefficient of Skewness

Pearson’s first coefficient of skewness (Sk1) is a measure of skewness given by the difference between the mean and the mode, weighted by one measure of dispersion (the standard deviation): Sk1 ¼

m  Mo for the population s

(3.47)

X  Mo for samples, S

(3.48)

Sk1 ¼

which has the following interpretation: If Sk1 ¼ 0, the distribution is symmetrical; If Sk1 > 0, the distribution is positively skewed (to the right); If Sk1 < 0, the distribution is negatively skewed (to the left). Example 3.39 From one set of data, we obtained the following measures X ¼ 34:7, Mo ¼ 31.5, Md ¼ 33.2, and S ¼ 12.4. Determine the type of skewness and calculate Pearson’s first coefficient of skewness.

Univariate Descriptive Statistics Chapter

3

63

Solution Since Mo < Md < X, we have a positive asymmetrical distribution (to the right). Applying Expression (3.48), we can determine Pearson’s first coefficient of skewness: X  Mo 34:7  31:5 ¼ ¼ 0:258 S 12:4 Classifying the distribution as positively skewed can also be interpreted by the value Sk1 > 0. Sk 1 ¼

3.4.3.1.2 Pearson’s Second Coefficient of Skewness To avoid using the mode to calculate

the skewness, we must adopt the empirical relationship between the mean, the median, and the mode: X  Mo ¼ 3: X  Md , which corresponds to Pearson’s second coefficient of skewness (Sk2): 3:ðm  Md Þ for the population s

3: X  Md for samples Sk2 ¼ S

Sk2 ¼

(3.49) (3.50)

In the same way, we have: If Sk2 ¼ 0, the distribution is symmetrical; If Sk2 > 0, the distribution is positively skewed (to the right); If Sk2 < 0, the distribution is negatively skewed (to the left). Pearson’s first and second coefficients of skewness allow us to compare two or more distributions and to evaluate which one is more asymmetrical. Its modulus indicates the intensity of the skewness. That is, the higher Pearson’s coefficient of skewness is, the more asymmetrical the curve is. Thus: If 0 < j Sk j < 0.15, the skewness is weak; If 0.15  j Sk j  1, the skewness is moderate; If j Sk j > 1, the skewness is strong. Example 3.40 From the data in Example 3.39, calculate Pearson’s second coefficient of skewness. Solution Applying Expression (3.50), we have:



3: X  Md 3:ð34:7  33:2Þ ¼ ¼ 0:363 S 12:4 Analogously, since Sk2 > 0, we confirm that the distribution is positively skewed. Sk 2 ¼

3.4.3.1.3 Bowley’s Coefficient of Skewness Another measure of skewness is Bowley’s coefficient of skewness (SkB), also known as quartile coefficient of skewness, calculated with quantiles, such as, the first and third quartiles, in addition to the median: SkB ¼

Q3 + Q1  2:Md Q3  Q1

In the same way, we have: If SkB ¼ 0, the distribution is symmetrical; If SkB > 0, the distribution is positively skewed (to the right); If SkB < 0, the distribution is negatively skewed (to the left).

(3.51)

64

PART

II Descriptive Statistics

Example 3.41 Calculate Bowley’s coefficient of skewness for the following dataset, which has already been sorted in ascending order:

24

25

29

31

36

40

44

45

48

50

54

56

1st

2nd

3rd

4th

5th

6th

7th

8th

9th

10th

11th

12th

Solution We have Q1 ¼ 30, Md ¼ 42, and Q3 ¼ 49. Therefore, we can determine Bowley’s coefficient of skewness: Sk B ¼

Q3 + Q1  2:Md 49 + 30  2:ð42Þ ¼ 0:263 ¼ Q3  Q1 49  30

Since SkB < 0, we conclude that the distribution is negatively skewed (to the left).

3.4.3.1.4 Fisher’s Coefficient of Skewness The last measure of skewness we will study is known as Fisher’s coefficient of skewness (g1), calculated from the third moment around the mean (M3), as presented in Maroco (2014): g1 ¼

n2 :M3 ðn  1Þ:ðn  2Þ:S3

(3.52)

where: n X

3 Xi  X

M3 ¼

i¼1

n

(3.53)

which is interpreted the same way as the other coefficients of skewness, that is: If g1 ¼ 0, the distribution is symmetrical; If g1 > 0, the distribution is positively skewed (to the right); If g1 < 0, the distribution is negatively skewed (to the left). Fisher’s coefficient of skewness can be calculated in Excel using the DISTORTION function (see Example 3.42) or using the Analysis Tools supplement (Section 3.5). Its calculation through SPSS software will be presented in Section 3.6. 3.4.3.1.5 Coefficient of Skewness on Stata The coefficient of skewness on Stata is calculated from the second and third moments around the mean, as presented by Cox (2010): Sk ¼

M3 3=2

(3.54)

M2

where: n X

2 Xi  X

M2 ¼

i¼1

n

(3.55)

Univariate Descriptive Statistics Chapter

3

65

which is interpreted the same way as the other coefficients of skewness, that is: If Sk ¼ 0, the distribution is symmetrical; If Sk > 0, the distribution is positively skewed (to the right); If Sk < 0, the distribution is negatively skewed (to the left).

3.4.3.2 Measures of Kurtosis In addition to measures of skewness, measures of kurtosis can also be used to characterize the shape of the distribution of the variable being studied. Kurtosis can be defined as the flatness level of a frequency distribution (height of the peak of the curve) in relation to a theoretical distribution that usually corresponds to the normal distribution. When the shape of the distribution is not very flat, nor very long, similar to a normal curve, it is called mesokurtic, as we can see in Fig. 3.19. In contrast, when the distribution shows a frequency curve that is flatter than a normal curve, it is called platykurtic, as shown in Fig. 3.20. Or, when the distribution presents a frequency curve that is longer than a normal curve, it is called leptokurtic, according to Fig. 3.21.

3.4.3.2.1

Coefficient of Kurtosis

One of the most common coefficients to measure the flatness level or kurtosis of a distribution is the percentile coefficient of kurtosis, or simply coefficient of kurtosis (k). It is calculated from the interquartile interval, in addition to the 10th and 90th percentiles: k¼

Q  Q1 3

, 2  P90  P10

(3.56)

which has the following interpretation: If k ¼ 0.263, we say that the curve is mesokurtic; If k > 0.263, we say that the curve is platykurtic; If k < 0.263, we say that the curve is leptokurtic.

FIG. 3.19 Mesokurtic curve.

FIG. 3.20 Platykurtic curve.

66

PART

II Descriptive Statistics

FIG. 3.21 Leptokurtic curve.

3.4.3.2.2

Fisher’s Coefficient of Kurtosis

Another very common measure to determine the flatness level or kurtosis of a distribution is Fisher’s coefficient of kurtosis, (g2). It is calculated using the fourth moment near the mean (M4), as presented in Maroco (2014): g2 ¼

n2 :ðn + 1Þ:M4 ðn  1Þ2  3: ðn  1Þ:ðn  2Þ:ðn  3Þ:S4 ðn  2Þ:ðn  3Þ

(3.57)

where: n X

M4 ¼

Xi  X

i¼1

n

4 ,

(3.58)

which has the following interpretation: If g2 ¼ 0, the curve has a normal distribution (mesokurtic); If g2 < 0, the curve is very flat (platykurtic); If g2 > 0, the curve is very long (leptokurtic). Many pieces of statistical software, among them SPSS, use Fisher’s coefficient of kurtosis to calculate the flatness level or kurtosis (Section 3.6). In Excel, the KURT function calculates Fisher’s coefficient of kurtosis (Example 3.42), and it can be calculated through the Analysis ToolPak supplement as well (Section 3.5).

3.4.3.2.3

Coefficient of Kurtosis on Stata

The coefficient of kurtosis on Stata is calculated from the second and fourth moments near the mean, as presented by Bock (1975) and Cox (2010): kS ¼

M4 M22

which has the following interpretation: If kS ¼ 3, the curve has a normal distribution (mesokurtic); If kS < 3, the curve is very flat (platykurtic); If kS > 3, the curve is very long (leptokurtic).

(3.59)

Univariate Descriptive Statistics Chapter

3

Example 3.42 Table 3.E.41 shows the prices of stock Y throughout a month, resulting in a sample with 20 periods (i.e., business days). Calculate: a) Fisher’s coefficient of skewness (g1); b) The coefficient of skewness used on Stata; c) Fisher’s coefficient of kurtosis (g2); d) The coefficient of kurtosis used on Stata;

TABLE 3.E.41 Prices of Stock Y Throughout the Month 18.7

18.3

18.4

18.7

18.8

18.8

19.1

18.9

19.1

19.9

18.5

18.5

18.1

17.9

18.2

18.3

18.1

18.8

17.5

16.9

Solution The mean and the standard deviation of the data in Table 3.E.41 are X ¼ 18:475 and S ¼ 0.6324, respectively. We have: a) Fisher’s coefficient of skewness g1: It is calculated using the third moment near the mean (M3): n X

M3 ¼

Xi  X

3

i¼1

¼

n

ð18:7  18:475Þ3 + ⋯ + ð16:9  18:475Þ3 ¼ 0:0788 20

Therefore, we have: g1 ¼

n2 :M3 ð20Þ2  ð0:079Þ ¼ ¼ 0:3647 3 ðn  1Þ:ðn  2Þ:S 19  18  ð0:63Þ3

Since g1 < 0, we can conclude that the frequency curve is more concentrated on the right side and has a longer tail to the left, that is, the distribution is asymmetrical to the left or negative. Excel calculates Fisher’s coefficient of skewness (g1) through the SKEW function. File Stock_Market.xls shows the data from Table 3.E.41, cells A1:A20. Thus, to calculate it, we just need to insert expression 5SKEW(A1:A20). b) The coefficient of skewness used on Stata: It is calculated from the second and third moments near the mean: n X

M2 ¼

Xi  X

2

i¼1

¼

n

ð18:7  18:475Þ2 + ⋯ + ð16:9  18:475Þ2 ¼ 0:3799 20 M3 ¼ 0:0788

It is calculated as follows: Sk ¼

M3 3=2

M2

¼ 0:3367,

which is interpreted the same way as Fisher’s coefficient of skewness. c) Fisher’s coefficient of kurtosis g2: It is calculated using the fourth moment near the mean (M4): n X

M4 ¼

Xi  X

i¼1

n

4 ¼

ð18:7  18:475Þ4 + ⋯ + ð16:9  18:475Þ4 ¼ 0:5857 20

Therefore, we calculate g2 as follows: g2 ¼ g2 ¼

n2 :ðn + 1Þ:M4 ðn  1Þ2  3: 4 ðn  1Þ:ðn  2Þ:ðn  3Þ:S ðn  2Þ:ðn  3Þ ð20Þ2  21  0:5857 19  18  17  ð0:6324Þ

Thus, we can conclude that the curve is long or leptokurtic.

ð19Þ2

4  3: 18  17 ¼ 1:7529

67

68

PART

II Descriptive Statistics

The KURT function in Excel calculates Fisher’s coefficient of kurtosis (g2). To calculate it from the file Stock_Market.xls, we must insert expression 5KURT(A1:A20). d) Coefficient of kurtosis on Stata: It is calculated from the second and fourth moments near the mean: M2 ¼ 0.3799 and M4 ¼ 0.5857, as already calculated. Thus: kS ¼

M4 0:5857 ¼ ¼ 4:0586 M22 ð0:3799Þ2

Since kS > 3, the curve is long or leptokurtic. In the next three sections, we will discuss how to construct tables, charts, graphs, and summary measures in Excel and in the statistical softwares SPSS and Stata, using the data in Example 3.42.

3.5

A PRACTICAL EXAMPLE IN EXCEL

Section 3.3.1 showed the graphical representation of qualitative variables through bar charts (horizontal and vertical), pie charts, and the Pareto chart. We demonstrated how each one of these charts can be obtained using Excel. Conversely, Section 3.3.2 showed the graphical representation of quantitative variables through line graphs, scatter plots, histograms, among others. Analogously, we presented how most of them can be obtained using Excel. Section 3.4 presented the main summary measures, including measures of central tendency (mean, mode, and median), quantiles (quartiles, deciles, and percentiles), measures of dispersion or variability (range, average deviation, variance, standard deviation, standard error, and coefficient of variation), in addition to the measures of shape as skewness and kurtosis. Then, we presented how they can be calculated using the Excel functions, except the ones that are not available. This section discusses how to obtain descriptive statistics (such as, the mean, standard error, median, mode, standard deviation, variance, kurtosis, skewness, among others), through the Analysis ToolPak add-in in Excel. In order to do that, let’s consider the problem presented in Example 3.42, whose data are available in Excel in the file Stock_Market.xls, presented in cells A1:A20, as shown in Fig. 3.22. To load the Analysis ToolPak add-in in Excel, we must first click on the File tab and on Options, as shown in Fig. 3.23. Now, the Excel Options dialog box will open, as shown in Fig. 3.24. From this box, we selected the option Add-ins. In Add-ins, we must select the option Analysis ToolPak and click on Go. Then, the Add-ins dialog box will appear, as shown in Fig. 3.25. Among the add-ins available, we must select the option Analysis ToolPak and click on OK.

FIG. 3.22 Dataset in Excel—Price of Stock Y.

Univariate Descriptive Statistics Chapter

3

69

FIG. 3.23 File tab, focusing more on Options.

Thus, the option Data Analysis will start being available on the Data tab, inside the Analysis group, as shown in Fig. 3.26. Fig. 3.27 shows the Data Analysis dialog box. Note that several analysis tools are available. Let’s select the option Descriptive Statistics and click on OK. From the Descriptive Statistics dialog box (Fig. 3.28), we must select the Input Range (A1:A20) and, as Output options, let’s select Summary statistics. The results can be presented in a new spreadsheet or in a new work folder. Finally, let’s click on OK. The descriptive statistics generated can be seen in Fig. 3.29 and include measures of central tendency (mean, mode, and median), measures of dispersion or variability (variance, standard deviation, and standard error), and measures of shape (skewness and kurtosis). The range can be calculated from the difference between the sample’s maximum and minimum values. As mentioned in Sections 3.4.3.1 and 3.4.3.2, the measure of skewness calculated by Excel (using the SKEW function or by Fig. 3.28) corresponds to Fisher’s coefficient of skewness (g1); and the measure of kurtosis calculated (using the KURT function or by Fig. 3.28) corresponds to Fisher’s coefficient of kurtosis (g2).

3.6

A PRACTICAL EXAMPLE ON SPSS

From a practical example, this section presents how to obtain the main univariate descriptive statistics studied in this chapter by using IBM SPSS Statistics Software. These include frequency distribution tables, charts (histogram, stemand-leaf plots, boxplots, bar charts, and pie charts), measures of central tendency (mean, mode, and median), quantiles

FIG. 3.24 Excel Options dialog box.

FIG. 3.25 Add-ins dialog box.

Univariate Descriptive Statistics Chapter

FIG. 3.26 Availability of the Data Analysis command, from the Data tab.

FIG. 3.27 Data Analysis dialog box.

FIG. 3.28 Descriptive Statistics dialog box.

3

71

72

PART

II Descriptive Statistics

FIG. 3.29 Descriptive statistics in Excel.

FIG. 3.30 Dataset on SPSS—Price of Stock Y.

(quartiles and percentiles), measures of dispersion or variability (range, variance, standard deviation, standard error, among others), and measures of shape (skewness and kurtosis). The use of the images in this section has been authorized by the International Business Machines Corporation©. The data presented in Example 3.42 are the input basis on SPSS and are available in the file Stock_Market.sav, as shown in Fig. 3.30. To obtain such descriptive statistics, we must click on Analyze ! Descriptive Statistics. After that, three options can be used: Frequencies, Descriptive, and Explore.

3.6.1

Frequencies Option

This option can be used for qualitative and quantitative variables, and it provides frequency distribution tables, as well as measures of central tendency (mean, median, and mode), quantiles (quartiles and percentiles), measures of dispersion or variability (range, variance, standard deviation, standard error, among others), and measures of skewness and kurtosis. The Frequencies option also plots bar charts, pie charts, or histograms (with or without a normal curve). Therefore, on the toolbar, click on Analyze ! Descriptive Statistics and select Frequencies..., as shown in Fig. 3.31.

Univariate Descriptive Statistics Chapter

3

73

FIG. 3.31 Descriptive statistics on SPSS—Frequencies Option.

FIG. 3.32 Frequencies dialog box: selecting the variable and showing the frequency table.

Therefore, the Frequencies dialog box will open. The variable being studied (Stock price, called Price) must be selected in Variable(s) and the Display frequency tables option must be activated so that the frequency distribution table can be shown (Fig. 3.32). The following step consists of clicking on Statistics... To select the summary measures that interest us (Fig. 3.33). Among the quantiles, let’s select the option Quartiles (which calculates the first and third quartiles, in addition to the median). To get the percentile of order i (i ¼ 1, 2, ..., 99), we must select the option Percentile(s) and add the order desired. In this case, we chose to calculate the percentiles of order 10 and 60. The measures of central tendency that we have to select are the mean, median, and mode. As measures of dispersion, let’s select Std. deviation (standard deviation), Variance,

74

PART

II Descriptive Statistics

FIG. 3.33 Frequencies: Statistics dialog box.

Range, and S.E. mean (standard error). Finally, let’s select both measures of shape of a distribution: Skewness and Kurtosis. To go back to the Frequencies dialog box, we must click on Continue. Next, let’s click on Charts... and select the chart that interest us. As options, we have Bar charts, Pie charts, or Histograms. Let’s select the last chart with the option of plotting a normal curve (Fig. 3.34). Bar or pie charts can be shown in terms of absolute frequencies (Frequencies) or relative frequencies (Percentages). In order to go back to the Frequencies dialog box once again, we must click on Continue. Finally, click on OK. Fig. 3.35 shows the calculations of the summary measures selected in Fig. 3.33. As studied in Sections 3.4.3.1 and 3.4.3.2, the measure of skewness calculated by SPSS corresponds to Fisher’s coefficient of skewness (g1), and the measure of kurtosis corresponds to Fisher’s coefficient of kurtosis (g2), respectively. Also in Fig. 3.35, note that the percentiles of order 25, 50, and 75 that correspond to the first quartile, median, and third quartile, respectively, were calculated automatically. The method used to calculate the percentiles was the Weighted Average. The frequency distribution table can be seen in Fig. 3.36. The first column represents the absolute frequency of each element (Fi), the second and third columns represent the relative frequency of each element (Fri—%), and the last column represents the relative cumulative frequency (Frac—%). Also in Fig. 3.36, we can see that all the values happened only once. Since we have a continuous quantitative variable with 20 observations and no repetitions, constructing bar or pie charts would not give the researcher any additional information, that is, it would not allow a good visualization of how the stock prices behave in terms of bins. Hence, we chose to construct a histogram with previously defined bins. The histogram generated using SPSS with the option of plotting a normal curve can be seen in Fig. 3.37.

3.6.2

Descriptives Option

Different from Frequencies..., which also has the frequency distribution table option, besides bar charts, pie charts, or histograms (with or without a normal curve), Descriptives... only makes summary measures available (therefore, it is recommended for quantitative variables). Nevertheless, measures of central tendency, such as, the median and mode

Univariate Descriptive Statistics Chapter

FIG. 3.34 Frequencies: Charts dialog box.

FIG. 3.35 Summary measures obtained from Frequencies: Statistics.

3

75

76

PART

II Descriptive Statistics

FIG. 3.36 Frequency distribution.

FIG. 3.37 Histogram with a normal curve obtained from Frequencies: Charts.

Histogram 8

Mean = 18.47 Std. Dev. = .632 N = 20

Frequency

6

4

2

0 17.0

18.0

19.0

20.0

Price

are not made available; nor are quantiles, such as, quartiles and percentiles. To use it, let’s click on Analyze ! Descriptive Statistics and select Descriptives..., as shown in Fig. 3.38. Therefore, the Descriptives dialog box will open. The variable being studied must be selected in Variable(s), as shown in Fig. 3.39. Let’s click on Options... and select the summary measures that interest us (Fig. 3.40). Note that the same summary measures in the Frequencies... were selected, except the median, the mode, in addition to the quartiles and percentiles that are not available, as already mentioned. Let’s click on Continue to go back to the Descriptives dialog box. Finally, click on OK. The results are available in Fig. 3.41.

Univariate Descriptive Statistics Chapter

3

77

FIG. 3.38 Descriptive statistics on SPSS—Descriptives Option.

FIG. 3.39 Descriptives dialog box: selecting the variable.

3.6.3

Explore Option

As Frequencies..., Explore... does not provide the frequency distribution table either. Regarding the types of chart, different from this last option, which offers bar charts, pie charts, and histograms, Explore... provides stem-and-leaf plots, boxplots, in addition to histograms. However, it does not have the option of plotting a normal curve. Regarding summary measures, Explore... provides measures of central tendency, such as, the mean and median (there is no option for the mode); quantiles, such as, percentiles (of order 5, 10, 25, 50, 75, 90, and 95); measures of dispersion, such as, the range, variance, standard deviation, among others (it does not calculate the standard error), besides measures of skewness and kurtosis.

78

PART

II Descriptive Statistics

FIG. 3.40 Descriptives: Options dialog box.

FIG. 3.41 Summary measures obtained from Descriptive: Options.

Therefore, this command is the best one to generate descriptive statistics for quantitative variables. Hence, from Analyze ! Descriptive Statistics, select Explore..., as shown in Fig. 3.42. Therefore, the Explore dialog box will open. The variable being studied must be selected from the list of dependent variables (Dependent List), as shown in Fig. 3.43. Next, we must click on Statistics... to open the Explore: Statistics box, and select the options Descriptives, Outliers, and Percentiles, as shown in Fig. 3.44. Let’s click on Continue to go back to the Explore box. Next, we must click on Plots... to open the Explore: Plots box and select the charts that interest us, as shown in Fig. 3.45. In this case, we have to select Boxplots: Factor levels together (the resulting boxplots will be together in the same chart), Stem-and-leaf and the histogram (note that there is no option for plotting the normal curve). Once again, we must click on Continue to go back to the Explore dialog box. Finally, click on OK. The results obtained are illustrated. Fig. 3.46 shows the results obtained from Explore: Statistics, with Descriptives option. Fig. 3.47 shows the results obtained from Explore: Statistics, with Percentiles option. The percentiles of order 5, 10, 25 (Q1), 50 (median), 75 (Q3), 90, and 95 were calculated using two methods: the Weighted Average and Tukey’s Hinges. The latter corresponds to the method proposed in this chapter (Section 3.4.1.2, Case 1). Thus, applying the expressions in

Univariate Descriptive Statistics Chapter

3

79

FIG. 3.42 Descriptive statistics on SPSS—Explore Option.

FIG. 3.43 Explore dialog box: selecting the variable.

Section 3.4.1.2 to this example, we get the same results seen in Fig. 3.47, as regards Tukey’s Hinges method for calculating P25, P50, and P75. Coincidently, in this example, the value of P75 was the same for both methods, but they are usually different. Fig. 3.48 shows the results obtained from the Explore: Statistics, with Outliers option. The extreme values of the distribution are presented here (the highest five and the lowest five), with their respective positions found in the dataset. Now, the charts constructed from the options selected in Explore: Plots (histograms, stem-and-leaf plots, and boxplots) are presented in Figs. 3.49, 3.50, and 3.51, respectively.

80

PART

II Descriptive Statistics

FIG. 3.44 Explore: Statistics dialog box.

FIG. 3.45 Explore: Plots dialog box.

FIG. 3.46 Results Obtained from the Descriptives Option.

FIG. 3.47 Results obtained from the Percentiles option.

FIG. 3.48 Results obtained from the Outliers option.

FIG. 3.49 Histogram constructed from the Explore: Plots dialog box.

Histogram 8

Mean = 18.48 Std. Dev. = .632 N = 20

Frequency

6

4

2

0 17.0

18.0

19.0

20.0

Price

FIG. 3.50 Stem-and-leaf chart generated from the Explore: Plots dialog box.

Price

Stem-and-Leaf Plot

Frequency

Stem & Leaf

1.00 Extremes 17 .

2.00

17 . 59

6.00

18 . 112334

8.00

18 . 55778889

2.00

19 . 11

1.00 Extremes

Stem width: Each leaf:

FIG. 3.51 Boxplot generated from the Explore: Plots dialog box.

20.0

(==19.9)

1.0 1 case(s)

10

19.0

18.0

17.0

20

16.0 Price

Univariate Descriptive Statistics Chapter

3

83

Obviously, the histogram generated by Fig. 3.49 is the same as the Frequencies... (Fig. 3.37); however, without the normal curve, since the Explore... does not provide this function. Fig. 3.50 shows that the first two digits of the number (the integers, before the point) form the stem and the decimals correspond to the leaf. Moreover, stem 18 is represented in two lines because it contains several observations. In Section 3.4.1.3, we learned how to calculate an extreme outlier through expressions X* < Q1  3.(Q3  Q1) and X* > Q3 + 3.(Q3  Q1). If we consider that Q1 ¼ 18.15 and Q3 ¼ 18.8, we have X* < 16.2 or X* > 20.75. Since there are no observations outside these limits, we conclude that there are no extreme outliers. Repeating the same procedure for mild outliers, that is, applying expressions X° < Q1  1.5.(Q3  Q1) and X° > Q3 + 1.5.(Q3  Q1), we can see that there is one observation with a value of less than 17.175 (20th observation), and another one with a value greater than 19.775 (10th observation). These values are therefore considered mild outliers. The boxplot in Fig. 3.51 shows that observations 10 and 20, with values 19.9 and 16.9, respectively, are mild outliers (represented by circles). Depending on their survey goals, this allows researchers to decide whether to keep them, exclude them (the analysis may be harmed because of the reduction in the sample size), or substitute their values for the variable mean. Continuing in Fig. 3.51, the values of Q1, Q2 (Md), and Q3 correspond to 18.15, 18.5, and 18.8, respectively, which are those obtained from Tukey’s Hinges method (Fig. 3.47), considering all of the initial 20 observations. Therefore, the boxplot’s measures of position (Q1, Md, and Q3), except for the minimum and maximum values, are calculated without excluding the outliers.

3.7

A PRACTICAL EXAMPLE ON STATA

The same descriptive statistics obtained in the previous section through SPSS software will be calculated in this section through Stata Statistical Software. The results will be compared to those obtained in an algebraic way and also by using SPSS. The use of the images in this section has been authorized by StataCorp LP©. The data presented in Example 3.42 are the input basis on Stata, and are available in the file Stock_Market.dta.

3.7.1

Univariate Frequency Distribution Tables on Stata

Through command tabulate, or simply tab, as we will use throughout this book, we can obtain frequency distribution tables for a certain variable. The syntax of the command is: tab variable*

where the term variable* should be substituted for the name of the variable considered in the analysis. Fig. 3.52 shows the obtained output using the command tab price. Just as the frequency distribution table obtained through SPSS (Fig. 3.36), Fig. 3.52 provides the absolute, relative, and relative cumulative frequencies for each category of the variable price. FIG. 3.52 Frequency distribution on Stata using the command tab.

84

PART

II Descriptive Statistics

Consider a case with more than one variable being studied in which the objective is to construct univariate frequency distribution tables (one-way tables), that is, one table for each variable being analyzed. In this case, we must use the command tab1, with the following syntax: tab1 variables*

where the term variables* should be substituted for the list of variables being considered in the analysis.

3.7.2

Summary of Univariate Descriptive Statistics on Stata

Through command summarize, or simply sum, as we will use throughout this book, we can obtain summary measures, such as, the mean, standard deviation, and minimum and maximum values. The syntax of this command is: sum variables*

where the term variables* should be substituted for the list of variables to be considered in the analysis. If no variable is specified, the statistics will be calculated for all of the variables in the dataset. Through the option detail, we can obtain additional statistics, such as, the coefficient of skewness, the coefficient of kurtosis, the four lowest and highest values, as well as several percentiles. The syntax of this command is: sum variables*, detail

Therefore, for the data in our example, available in the file Stock_Market.dta, first, we must type the following command: sum price

obtaining the statistics in Fig. 3.53. To obtain additional descriptive statistics, we must type the following command: sum price*, detail

Fig. 3.54 shows the generated outputs. As shown in Fig. 3.54, the option detail provides the calculation of the percentiles of order 1, 5, 10, 25, 50, 75, 90, 95 and 99. These results are obtained by Tukey’s Hinges method. We have seen, through Fig. 3.47 on the SPSS software, the results of the percentiles of order 25, 50, and 75 obtained by the same method. Fig. 3.54 also provides the four lowest and highest values of the sample analyzed, as well as the coefficients of skewness and kurtosis. Note that these values coincide with the ones calculated in Sections 3.4.3.1.5 and 3.4.3.2.3, respectively.

FIG. 3.53 Summary measures using the command sum on Stata. FIG. 3.54 Additional statistics using the option detail.

Univariate Descriptive Statistics Chapter

3

85

FIG. 3.55 Results obtained from the command centile on Stata.

3.7.3

Calculating Percentiles on Stata

The previous section discussed how to calculate the 1st, 5th, 10th, 25th, 50th, 75th, 90th, 95th, and 99th percentiles through Tukey’s Hinges method. On the other hand, by using the command centile, we can specify the percentiles to be calculated. The method used in this case is the Weighted Average. The syntax of this command is: centile variables*, centile (numbers*)

where the term variables* should be substituted for the list of variables to be considered in the analysis, and the term numbers* for the list of numbers that represent the order of the percentiles to be reported. Therefore, let’s suppose that we want to calculate the percentiles of order 5, 10, 25, 60, 64, 90, and 95 for the variable price, through the Weighted Average. In order to do that, we must use the following command: centile price, centile (5 10 25 60 64 90 95)

The results can be seen in Fig. 3.55. We have seen, through Fig. 3.35, the results of the SPSS software for the percentiles of order 10, 25, 50, 60, and 75 using the same method. Fig. 3.47 on SPSS also provided the calculation of the percentiles of order 5, 10, 25, 50, 75, 90, and 95 through the Weighted Average. The only percentile that had not been specified previously was the one of order 64; the others coincide with the results in Figs. 3.35 and 3.47.

3.7.4

Charts on Stata: Histograms, Stem-and-Leaf, and Boxplots

Stata makes a series of charts available, including bar charts, pie charts, scatter plots, histograms, stem-and-leaf, and boxplots, among others. Next, we will discuss how to obtain histograms, stem-and-leaf plots, and boxplots on Stata, for the data available in the file Stock_Market.dta.

3.7.4.1 Histogram Histograms on Stata can be obtained for continuous and discrete variables. In the case of continuous variables, to obtain a histogram of absolute frequencies, with the option of plotting a normal curve, we must type the following syntax: histogram variable*, normal frequency

or simply: hist variable*, norm freq

as we will use throughout this book. As mentioned before, the term variable* must be substituted for the name of the variable being studied. For discrete variables, we must include the term discrete: hist variable*, discrete norm freq

86

PART

II Descriptive Statistics

FIG. 3.56 Frequency histogram on Stata.

Frequency

10

5

0 17

18

19

20

Price

Going back to the data in Example 3.42, to obtain a frequency histogram, with the option of plotting a normal curve, we must type the following command: hist price, norm freq

The obtained output is shown in Fig. 3.56.

3.7.4.2 Stem-and-Leaf The stem-and-leaf plot on Stata can be obtained using the command stem, followed by the name of the variable being studied. For the data in the file Stock_Market.dta, we just need to type the following command: stem price

The obtained output is shown in Fig. 3.57.

3.7.4.3 Boxplot To obtain the boxplot on the Stata software, we must use the following syntax: graph box variables*

FIG. 3.57 Stem-and-Leaf plot on Stata.

Univariate Descriptive Statistics Chapter

3

87

FIG. 3.58 Boxplot on Stata.

20

Price

19

18

17

where the term variables* should be substituted for the list of variables to be considered in the analysis, and, for each variable, one chart is constructed. For the data in Example 3.42, the command is: graph box price

The chart is shown in Fig. 3.58 which corresponds to the same chart as in Fig. 3.51 generated using SPSS.

3.8

FINAL REMARKS

In this chapter, we studied descriptive statistics for a single variable (univariate descriptive statistics), in order to acquire a better understanding of the behavior of each variable through tables, charts, graphs and summary measures, identifying trends, variability, and outliers. Before we start using descriptive statistics, it is necessary to identify the type of variable we will study. The type of variable is essential for calculating descriptive statistics and in the graphical representation of the results. The descriptive statistics used to represent the behavior of a qualitative variable’s data are frequency distribution tables and charts. The frequency distribution table for a qualitative variable represents the frequency in which each variable category occurs. The graphical representation of qualitative variables can be illustrated by bar charts (horizontal and vertical), pie charts, and a Pareto chart. For quantitative variables, the most common descriptive statistics are charts and summary measures (measures of position or location, dispersion or variability, and measures of shape). Frequency distribution tables can also be used to represent the frequency in which each possible value of a discrete variable occurs, or to represent the frequency of continuous variables’ data grouped into classes. Line graphs, dot or scatter plots, histograms, stem-and-leaf plots, and boxplots (or box-and-whisker diagrams) are normally used to graphically represent quantitative variables.

3.9

EXERCISES

1) What statistics can be used (and in which situations) to represent the behavior of a single quantitative or qualitative variable? 2) What are the limitations of only using measures of central tendency in the study of a certain variable? 3) How can we verify the existence of outliers in a certain variable? 4) Describe each one of the measures of dispersion or variability. 5) What is the difference between Pearson’s first and second coefficients used as measures of skewness in a distribution? 6) What is the best chart to check the position, skewness and discrepancy among the data? 7) In the case of bar charts and scatter plots, what kind of data should be used? 8) What are the most suitable charts to represent qualitative data?

88

PART

II Descriptive Statistics

9) Table 3.1 shows the number of vehicles sold by a dealership in the last 30 days. Construct a frequency distribution table for these data. TABLE 3.1 Number of Vehicles Sold 7

5

9

11

10

8

9

6

8

10

8

5

7

11

9

11

6

7

10

9

8

5

6

8

6

7

6

5

10

8

10) A survey on patients’ health was carried out and information regarding the weight of 50 patients was collected (Table 3.2). Build the frequency distribution table for this problem. TABLE 3.2 Patients’ Weight 60.4

78.9

65.7

82.1

80.9

92.3

85.7

86.6

90.3

93.2

75.2

77.3

80.4

62.0

90.4

70.4

80.5

75.9

55.0

84.3

81.3

78.3

70.5

85.6

71.9

77.5

76.1

67.7

80.6

78.0

71.6

74.8

92.1

87.7

83.8

93.4

69.3

97.8

81.7

72.2

69.3

80.2

90.0

76.9

54.7

78.4

55.2

75.5

99.3

66.7

11) At an electrical appliances factory, in the door component production phase, the quality inspector verifies the total number of parts rejected per type of defect (lack of alignment, scratches, deformation, discoloration, and oxygenation), as shown in Table 3.3.

TABLE 3.3 Total Number of Parts Rejected per Type of Defect Type of Defect

Total

Lack of Alignment

98

Scratches

67

Deformation

45

Discoloration

28

Oxygenation

12

Total

250

We would like you to: a) Elaborate a frequency distribution table for this problem. b) Construct a pie chart, in addition to a Pareto chart. 12) To preserve ac¸aı´, it is necessary to carry out several procedures, such as, whitening, pasteurization, freezing, and dehydration. The files Dehydration.xls, Dehydration.sav, and Dehydration.dta show the processing times (in seconds) in the dehydration phase throughout 100 periods. We would like you to: a) Calculate the measures of position regarding the arithmetic mean, the median, and the mode. b) Calculate the first and third quartiles and see if there are any outliers. c) Calculate the 10th and 90th percentiles. d) Calculate the 3rd and 6th deciles. e) Calculate the measures of dispersion (range, average deviation, variance, standard deviation, standard error, and coefficient of variation).

Univariate Descriptive Statistics Chapter

3

89

f) Check if the distribution is symmetrical, positively skewed, or negatively skewed. g) Calculate the coefficient of kurtosis and determine the flatness level of the distribution (mesokurtic, platykurtic or leptokurtic). h) Construct a histogram, a stem-and-leaf plot, and a boxplot for the variable being studied. 13) In a certain bank branch, we collected the average service time (in minutes) from a sample with 50 customers regarding three types of services. The data can be found in files Services.xls, Services.sav, and Services.dta. Compare the results of the services based on the following measures: a) Measures of position (mean, median, and mode). b) Measures of dispersion (variance, standard deviation, and standard error). c) First and third quartiles; check if there are any outliers. d) Fisher’s coefficient of skewness (g1) and Fisher’s coefficient of kurtosis (g2). Classify the symmetry and the flatness level of each distribution. e) For each one of the variables, construct a bar chart, a boxplot, and a histogram. 14) A passenger collected the average travel times (in minutes) of a bus in the district of Vila Mariana, on the Jabaquara route, for 120 days (Table 3.4). We would like you to: a) Calculate the arithmetic mean, the median, and the mode.

TABLE 3.4 Average Travel Times in 120 Days Time

b) c) d) e)

Number of Days

30

4

32

7

33

10

35

12

38

18

40

22

42

20

43

15

45

8

50

4

Calculate Q1, Q3, D4, P61, and P84. Are there any outliers? Calculate the range, the variance, the standard deviation, and the standard error. Calculate Fisher’s coefficient of skewness (g1) and Fisher’s coefficient of kurtosis (g2). Classify the symmetry and the flatness level of each distribution. f) Construct a bar chart, a histogram, a stem-and-leaf plot, and a boxplot. 15) In order to improve the quality of its services, a retail company collected the average service time, in seconds, of 250 employees. The data were grouped into classes, with their respective absolute and relative frequencies, as shown in Table 3.5. We would like you to: a) Calculate the arithmetic mean, the median, and the mode. b) Calculate Q1, Q3, D2, P13, and P95. c) Are there any outliers? d) Calculate the range, the variance, the standard deviation, and the standard error. e) Calculate Pearson’s first coefficient of skewness and the coefficient of kurtosis. Classify the symmetry and the flatness level of each distribution. f) Construct a histogram.

90

PART

II Descriptive Statistics

TABLE 3.5 Average Service Time Class

Fi

Fri (%)

30 ├ 60

11

4.4

60 ├ 90

29

11.6

90 ├ 120

41

16.4

120 ├ 150

82

32.8

150 ├ 180

54

21.6

180 ├ 210

33

13.2

250

100

Sum

16) A financial analyst wants to compare the price of two stocks throughout the previous month. The data are listed in Table 3.6.

TABLE 3.6 Stock Price Stock A

Stock B

31

25

30

33

24

27

24

34

28

32

22

26

24

26

34

28

24

34

28

28

23

31

30

28

31

34

32

16

26

28

39

29

25

27

42

28

29

33

24

29

22

34

23

33

32

27

29

26

Univariate Descriptive Statistics Chapter

3

91

Carry out a comparative analysis of the price of both stocks based on: a) Measures of position, such as, the mean, median, and mode. b) Measures of dispersion, such as, the range, variance, standard deviation, and standard error. c) The existence of outliers. d) The symmetry and flatness level of the distribution. e) A line graph, scatter plot, stem-and-leaf plot, histogram, and boxplot. 17) Aiming to determine the standards of the investments made in hospitals in Sao Paulo (US$ millions), a state government agency collected data regarding 15 hospitals, as shown in Table 3.7.

TABLE 3.7 Investments in 15 Hospitals in the State of Sao Paulo Hospital

a) b) c) d)

Investment

A

44

B

12

C

6

D

22

E

60

F

15

G

30

H

200

I

10

J

8

K

4

L

75

M

180

N

50

O

64

We would like you to: Calculate the sample’s arithmetic mean and standard deviation. Eliminate possible outliers. Once again, calculate the sample’s arithmetic mean and standard deviation (without the outliers). What can we say about the standard deviation of the new sample without the outliers?

Chapter 4

Bivariate Descriptive Statistics Numbers rule the world. Plato

4.1

INTRODUCTION

The previous chapter discussed descriptive statistics for a single variable (univariate descriptive statistics). This chapter presents the concepts of descriptive statistics involving two variables (bivariate analysis). Therefore, a bivariate analysis has as its main objective to study the relationships (associations for qualitative variables and correlations for quantitative variables) between two variables. These relationships can be studied through the joint distribution of frequencies (contingency tables or crossed classification tables—cross tabulation), graphical representations, and through summary measures. The bivariate analysis will be studied from two distinct situations: a) When two variables are qualitative; b) When two variables are quantitative. Fig. 4.1 shows the bivariate descriptive statistics that will be studied in this chapter, represented by tables, charts, and summary measures, and presents the following situations: a) The descriptive statistics used to represent the data behavior of two qualitative variables are: a) joint frequency distribution tables, in this specific case, also called contingency tables or crossed classification tables (cross tabulation); b) charts, such as, perceptual maps resulting from the correspondence analysis technique (more details can be found in Fa´vero and Belfiore, 2017); c) measures of association, such as, the chi-square statistics (used for nominal and ordinal qualitative variables), the Phi coefficient, the contingency coefficient, and Cramer’s V coefficient (all of them based on chi-square and used for nominal variables), in addition to Spearman’s coefficient (for ordinal qualitative variables). b) In the case of two quantitative variables, we are going to use joint frequency distribution tables, graphical representations, such as, the scatter plot, besides measures of correlation, such as, covariance and Pearson’s correlation coefficient.

4.2

ASSOCIATION BETWEEN TWO QUALITATIVE VARIABLES

The main objective is to assess if there is a relationship between the qualitative or categorical variables studied, in addition to the level of association between them. This can be done through frequency distribution tables, summary measures, such as, the chi-square (used for nominal and ordinal variables), the Phi coefficient, the contingency coefficient, and Cramer’s V coefficient (for nominal variables), and Spearman’s coefficient (for ordinal variables), in addition to graphical representations, such as, perceptual maps resulting from the correspondence analysis, as presented in Fa´vero and Belfiore (2017).

4.2.1

Joint Frequency Distribution Tables

The simplest way to summarize a set of data resulting from two qualitative variables is through a joint frequency distribution table, in this specific case, it is called a contingency table, or a crossed classification table (cross tabulation), or even a correspondence table. In a joint way, it shows the absolute or relative frequencies of variable X’s categories, represented on the X-axis, and of variable Y, represented on the Y-axis. Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00004-5 © 2019 Elsevier Inc. All rights reserved.

93

94

PART

II Descriptive Statistics

Bivariate analysis 2 Qualitative variables

Tables

Charts

Contingency tables

Perceptual maps

2 Quantitative variables Measures of association

Chi-square

Tables

Charts

Frequency distribution

Scatter Plot

Phi coefficient

Measures of correlation

Covariance Pearson’s correlation coefficient

Contingency coefficient Cramer’s V coefficient Spearman’s coefficient

FIG. 4.1 Bivariate descriptive statistics depending on the type of variable.

It is common to add the marginal totals to the contingency table, which correspond to the sum of variable X’s rows and to the sum of variable Y’s columns. We are going to illustrate this analysis through an example based on Bussab and Morettin (2011). Example 4.1 A study was done with 200 individuals trying to analyze the joint behavior of variable X (Health insurance agency) with variable Y (Level of satisfaction). The contingency table showing the variables’ joint absolute frequency distribution, in addition to the marginal totals, is shown in Table 4.E.1. These data are available on the SPSS software in the file HealthInsurance.sav.

TABLE 4.E.1 Joint Absolute Frequency Distribution of the Variables Being Studied Level of Satisfaction Agency

Dissatisfied

Neutral

Satisfied

Total

Total Health

40

16

12

68

Live Life

32

24

16

72

Mena Health

24

32

4

60

Total

96

72

32

200

The study can also be carried out based on the relative frequencies, as studied in Chapter 3, for univariate problems. Bussab and Morettin (2011) show three ways to illustrate the proportion of each category: a) In relation to the general total; b) In relation to the total of each row; c) In relation to the total of each column. Choosing each option varies according to the objective of the problem. For example, Table 4.E.2 shows the joint relative frequency distribution of the variables being studied in relation to the general total.

Bivariate Descriptive Statistics Chapter

4

TABLE 4.E.2 Joint Relative Frequency Distribution of the Variables Being Studied in Relation to the General Total Level of Satisfaction Agency

Dissatisfied

Neutral

Satisfied

Total

Total Health

20%

8%

6%

34%

Live Life

16%

12%

8%

36%

Mena Health

12%

16%

2%

30%

Total

48%

36%

16%

100%

First, we are going to analyze the marginal totals of the rows and columns that provide the unidimensional distributions of each variable. The marginal totals of the rows correspond to the sum of the relative frequencies of each category of the variable Agency and the marginal totals of the columns correspond to the sum of each category of the variable Level of satisfaction. Thus, we can conclude that 34% of the individuals are members of Total Health, 36% of Live Life, and 30% of Mena Health. Analogously, we can conclude that 48% of the individuals are dissatisfied with their health insurance agencies, 36% said they were neutral, and only 16% said they were satisfied. Regarding the joint relative frequency distribution of the variables being studied (a contingency table), we can state that 20% of the individuals are members of Total Health and are dissatisfied. The same logic is applied to the other categories of the contingency table. Conversely, Table 4.E.3 shows the joint relative frequency distribution of the variables being studied in relation to the total of each row.

TABLE 4.E.3 Joint Relative Frequency Distribution of the Variables Being Studied in Relation to the Total of Each Row Level of Satisfaction Agency

Dissatisfied

Neutral

Satisfied

Total

Total Health

58.8%

23.5%

17.6%

100%

Live Life

44.4%

33.3%

22.2%

100%

Mena Health

40%

53.3%

6.7%

100%

Total

48%

36%

16%

100%

From Table 4.E.3, we can see that the ratio of individuals who are members of Total Health and who are dissatisfied is 58.8% (40/ 68), those who are neutral is 23.5% (16/68); and those who are satisfied is 17.6% (12/68). The sum of the ratios in the respective row is 100%. The same logic is applied to the other rows. Finally, Table 4.E.4 shows the joint relative frequency distribution of the variables being studied in relation to the total of each column. Therefore, the ratio of individuals who are members of Total Health and who are dissatisfied is 41.7% (40/96), members of Live Life, 33.3% (32/96), and members of Mena Health, 25% (24/96). The sum of the ratios in the respective column is 100%. The same logic is applied to the other columns.

TABLE 4.E.4 Joint Relative Frequency Distribution of the Variables Being Studied in Relation to the Total of Each Column Level of Satisfaction Agency

Dissatisfied

Neutral

Satisfied

Total

Total Health

41.7%

22.2%

37.5%

34%

Live Life

33.3%

33.3%

50%

36%

Mena Health

25%

44.4%

12.5%

30%

Total

100%

100%

100%

100%

95

96

PART

II Descriptive Statistics

Creating Contingency Tables on the SPSS Software The contingency tables in Example 4.1 will be generated by using SPSS. The use of the images in this chapter has been authorized by the International Business Machines Corporation©. First, we are going to define the properties of each variable on SPSS. The variables Agency and Level of satisfaction are qualitative, but, initially, they are presented as numbers, as shown in the file HealthInsurance_NoLabel.sav. Thus, labels corresponding to each category of both variables must be created, so that: Labels of the variable Agency: 1 ¼ Total Health 2 ¼ Live Life 3 ¼ Mena Health Labels of the variable Level of satisfaction, simply called Satisfaction: 1 ¼ Dissatisfied 2 ¼ Neutral 3 ¼ Satisfied Therefore, we must click on Data → Define Variable Properties… and select the variables that interest us, as seen in Figs. 4.2 and 4.3.

FIG. 4.2 Defining the properties of the variable on SPSS.

Bivariate Descriptive Statistics Chapter

4

FIG. 4.3 Selecting the variables that interest us.

Next, we must click on Continue. Based on Figs. 4.4 and 4.5, note that the variables Agency and Satisfaction were defined as nominal. This definition can also be done in the environment Variable View. The definition of the labels must be created at this moment, as shown in Figs. 4.4 and 4.5. Clicking on OK, the database initially represented as numbers starts being substituted for the respective labels. In the file HealthInsurance.sav, the data have already been labeled. To create contingency tables (cross tabulation), we are going to click on the menu Analyze → Descriptive Statistics → Crosstabs…, as shown in Fig. 4.6. We are going to select the variable Agency in Row(s) and the variable Satisfaction in Column(s). Next, we must click on Cells, as shown in Fig. 4.7. To create contingency tables that represent the joint absolute frequency distribution of the variables observed, the joint relative frequency distribution in relation to the general total, the joint relative frequency distribution in relation to the total of each row, and the joint relative frequency distribution in relation to the total of each column (Tables 4.1–4.4) we must, from the Crosstabs: Cell Display dialog box (opened after we clicked on Cells…), select the option Observed in Counts and options Row, Column and Total in Percentages, as shown in Fig. 4.8. Finally, we are going to click on Continue and OK. The contingency table (cross tabulation) generated by SPSS is shown in Fig. 4.9. Note that the data generated are exactly the same as those presented in Tables 4.1–4.4.

97

98

PART

II Descriptive Statistics

FIG. 4.4 Defining the labels of variable Agency.

FIG. 4.5 Defining the labels of variable Satisfaction.

FIG. 4.6 Creating contingency tables (cross tabulation) on SPSS.

FIG. 4.7 Creating a contingency table.

100

PART

II Descriptive Statistics

FIG. 4.8 Creating contingency tables from the Crosstabs: Cell Display dialog box.

FIG. 4.9 Cross classification table (cross tabulation) generated by SPSS.

Bivariate Descriptive Statistics Chapter

4

101

Creating Contingency Tables on the Stata Software In Chapter 03, we learned how to create frequency distribution tables for a single variable on Stata through the command tabulate, or simply tab. In the case of two or more variables, if the objective is to create univariate frequency distribution tables for each variable being analyzed, we must use the command tab1, followed by the list of variables. The same logic must be applied to create joint frequency distribution tables (contingency tables). To create a contingency table on Stata from the absolute frequencies of the variables being observed, we must use the following syntax: tabulate variable1* variable2*

or simply: tab variable1* variable2* where the terms variable1* and variable2* must be substituted for the names of the respective variables.

If, in addition to the joint absolute frequency distribution of the variables being observed, we want to obtain the joint relative frequency distribution in relation to the total of each row, to the total of each column, and to the general total, we must use the following syntax: tabulate variable1* variable2*, row column cell

or simply: tab variable1* variable2*, r co ce

Consider a case with more than two variables being studied, in which the objective is to construct bivariate frequency distribution tables (two-way tables), for all the combinations of variables, two by two. In this case, we must use the command tab2, with the following syntax: tab2 variables* where the term variables* should be substituted for the list of variables being considered in the analysis.

Analogously, to obtain both the joint absolute frequency distribution and the joint relative frequency distributions per row, per column, and per general total, we must use the following syntax: tab2 variables*, r co ce

The contingency tables in Example 4.1 will be generated now by using the Stata software. The data are available in the file HealthInsurance.dta. Hence, to obtain the table of joint absolute frequency distribution, relative frequencies per row, relative frequencies per column, and relative frequencies per general total, the command is: tab agency satisfaction, r co ce

The results can be seen in Fig. 4.10 and are similar to those presented in Fig. 4.9 (SPSS). FIG. 4.10 Contingency table constructed on Stata.

102

PART

4.2.2

II Descriptive Statistics

Measures of Association

The main measures that represent the association between two qualitative variables are: a) The chi-square statistic (w2)—used for nominal and ordinal qualitative variables; b) The Phi coefficient, the contingency coefficient and Cramer’s V coefficient—applied to nominal variables and based on chi-square; and c) Spearman’s coefficient—used for ordinal variables.

4.2.2.1 Chi-Square Statistic The chi-square statistic (w2) measures the discrepancy between the contingency table observed and the contingency table expected, starting from the hypothesis that there is no association between the variables studied. If the frequency distribution observed is exactly equal to the frequency distribution expected, the result of the chi-square statistic is zero. Therefore, a value lower than w2 indicates independence between the variables. Statistic w2 is given by:  2 I X J X Oij  Eij 2 (4.1) w ¼ Eij i¼1 j¼1 where: Oij: number of observations in the ith position of variable X and in the jth position of variable Y; Eij: expected frequency of observations in the ith position of variable X and in the jth position of variable Y; I: number of categories (rows) of variable X; J: number of categories (columns) of variable Y.

Example 4.2 Calculate the w2 statistic for Example 4.1. Solution Table 4.E.5 shows the observed values in the distribution with the respective relative frequencies in relation to the general total of the row. The calculation could also be done in relation to the general total of the column, arriving at the same result of the w2 statistic.

TABLE 4.E.5 Observed Values of Each Category With the Respective Ratios in Relation to the General Total of the Row Level of Satisfaction Agency

Dissatisfied

Neutral

Satisfied

Total

Total Health

40 (58.8%)

16 (23.5%)

12 (17.6%)

68 (100%)

Live Life

32 (44.4%)

24 (33.3%)

16 (22.2%)

72 (100%)

Mena Health

24 (40%)

32 (53.3%)

4 (6.7%)

60 (100%)

Total

96 (48%)

72 (36%)

32 (16%)

200 (100%)

The data in Table 4.E.5 show the dependence between the variables. Assuming that there was no association between the variables, we would expect a ratio of 48% in relation to the total of the row of all three health insurance companies in the Dissatisfied column, 36% in the Neutral column, and 16% in the Satisfied column. The calculation of the expected values can be seen in Table 4.E.6. For example, the calculation of the first cell is 0.48  68 ¼ 32.64.

Bivariate Descriptive Statistics Chapter

4

103

TABLE 4.E.6 Expected Values in Table 4.E.5, Assuming the Nonassociation Between the Variables Level of Satisfaction Agency

Dissatisfied

Neutral

Satisfied

Total

Total Health

32.6 (48%)

24.5 (36%)

10.9 (16%)

68 (100%)

Live Life

34.6 (48%)

25.9 (36%)

11.5 (16%)

72 (100%)

Mena Health

28.8 (48%)

21.6 (36%)

9.6 (16%)

60 (100%)

Total

96 (48%)

72 (36%)

32 (16%)

200 (100%)

To calculate the w2 statistic, we must apply expression (4.1) for the data in Tables 4.E.5 and 4.E.6. The calculation of each term 2 ðOij Eij Þ is shown in Table 4.E.7, jointly with the w2 measure resulting from the sum of the categories. Eij

TABLE 4.E.7 Calculating the x2 Statistic Level of Satisfaction Agency

Dissatisfied

Neutral

Satisfied

Total Health

1.66

2.94

0.12

Live Life

0.19

0.14

1.74

Mena Health

0.80

5.01

3.27

Total

w ¼ 15.861 2

As we are going to study in Chapter 9, which discusses hypotheses tests, significance level a indicates the probability of rejecting a certain hypothesis when it is true. P-value, on the other hand, represents the probability associated to the sample observed value, indicating the lowest significance level that would lead to the rejection of the supposed hypothesis. In other words, P-value represents a decreasing reliability index of a result. The lower the value, the less we can believe in the assumed hypothesis. In the case of the w2 statistic, whose test presupposes the nonassociation between the variables being studied, most statistical software, including SPSS and Stata, calculate the corresponding P-value. Thus, for a confidence level of 95%, if P-value < 0.05, the hypothesis is rejected and we can state that there is an association between the variables. On the other hand, if P-value > 0.05, we conclude that the variables are independent. All of these concepts will be studied in more detail in Chapter 9. Excel calculates the P-value of the w2 statistic through the CHITEST or CHISQ.TEST (Excel 2010 and future versions) functions. In order to do that, we just need to select the set of cells corresponding to the observed or real values and the set of cells of the expected values. Solving the chi-square statistic on the SPSS software Analogous to Example 4.1, calculating the chi-square statistic (w2) on SPSS is also done on the tab Analyze → Descriptive Statistics → Crosstabs…. Once again, we are going to select the variable Agency in Row(s) and the variable Satisfaction in Column(s). Initially, to generate the observed values and the expected values in case of nonassociation between the variables (data in Tables 4.E.5 and 4.E.6), we must click on Cells… and select the options Observed and Expected in Counts, from the Crosstabs: Cell Display dialog box (Fig. 4.11). In the same box, to generate the adjusted standardized residuals, we must select the option Adjusted standardized in Residuals. The results can be seen in Fig. 4.12. To calculate the w2 statistic, in Statistics…, we must select the option Chi-square (Fig. 4.13). Finally, we are going to click on Continue and OK. The result can be seen in Fig. 4.14. Based on Fig. 4.14, we can see that the value of w2 is 15.861, similar to the one calculated in Table 4.E.7. We can also observe that the lowest significance level that would lead to the rejection of the nonassociation hypothesis between the variables (P-value) is 0.003. Since 0.003 < 0.05 (for a confidence level of 95%), the null hypothesis is rejected, which allows us to conclude that there is association between the variables.

104

PART

II Descriptive Statistics

FIG. 4.11 Creating the contingency table with the observed frequencies, the expected frequencies, and the residuals.

FIG. 4.12 Contingency table with the observed values, the expected values, and the residuals, assuming the nonassociation between the variables.

Bivariate Descriptive Statistics Chapter

4

105

FIG. 4.13 Selecting the w2 statistic.

Solving the w2 statistic on the Stata software In Section 4.2.1, we learned how to create contingency tables on Stata through the command tabulate, or simply tab. Besides the observed frequencies, this command also gives us the expected frequencies through the option expected, or simply exp, as well as the calculation of the w2 statistic using the option chi2, or simply ch. For the data in Example 4.1 available in the file HealthInsurance.dta, to obtain the observed and expected frequency distribution tables, jointly with the w2 statistic, we are going to use the following command: tab agency satisfaction, exp ch However, the command tab does not allow residuals to be generated in the output. As an alternative, the command tabchi

was developed from a tabulation module created by Nicholas J. Cox, allowing the adjusted standardized residuals to be calculated too. In order for this command to be used, we must initially type:

FIG. 4.14 Result of the w2 statistic.

106

PART

II Descriptive Statistics

FIG. 4.15 Result of the w2 statistic on Stata.

findit tabchi

and install it in the link tab_chi from http://fmwww.bc.edu/RePEc/bocode/t. After doing this, we can type the following command: tabchi agency satisfaction, a

The result is shown in Fig. 4.15 and is similar to those presented in Figs. 4.12 and 4.14 on the SPSS software. Note that, differently from the command tab, which requires the option exp so that the expected frequencies can be generated, the command tabchi already gives them to us automatically.

4.2.2.2 Other Measures of Association Based on Chi-Square The main measures of association based on the chi-square statistic (w2) are Phi, Cramer’s V coefficient, and the contingency coefficient (C), all of them applied to nominal qualitative variables. In general, an association or correlation coefficient is a measure that varies between 0 and 1, presenting value 0 when there is no relationship between the variables, and value 1 when they are perfectly related. We are going to see how each one of the coefficients studied in this section behaves in relation to these characteristics. a) Phi Coefficient The Phi coefficient is the simplest measure of association for nominal variables based on w2, and it can be expressed as follows: rffiffiffiffiffi w2 (4.2) Phi ¼ n In order for Phi to vary only between 0 and 1, it is necessary for the contingency table to have a 2 x 2 dimension.

Example 4.3 In order to offer high-quality services and meet their customers’ expectations, Ivanblue, a company in the male fashion industry, is investing in strategies to segment the market. Currently, the company has four stores in Campinas, located in the north, center, south, and east regions of the city, and sells four types of clothes: ties, shirts, polo shirts, and pants. Table 4.E.8 shows the purchase data of 20 customers, such as, the type of clothes and the location of the store. Check if there is association between the two variables using the Phi coefficient.

Bivariate Descriptive Statistics Chapter

4

107

TABLE 4.E.8 Purchase Data of 20 Customers Customer

Clothes

Region

1

Tie

South

2

Polo shirt

North

3

Shirt

South

4

Pants

North

5

Tie

South

6

Polo shirt

Center

7

Polo shirt

East

8

Tie

South

9

Shirt

South

10

Tie

Center

11

Pants

North

12

Pants

Center

13

Tie

Center

14

Polo shirt

East

15

Pants

Center

16

Tie

Center

17

Pants

South

18

Pants

North

19

Polo shirt

East

20

Shirt

Center

Solution Using the procedure described in the previous section, the value of the chi-square statistic is w2 ¼ 18.214. Therefore: Phi ¼

rffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi w2 18:214 ¼ ¼ 0:954 n 20

Since both variables have four categories, in this case the condition 0  Phi  1 is not valid, making it difficult to interpret how strong the association is. b) Contingency coefficient The contingency coefficient (C), also known as Pearson’s contingency coefficient, is another measure of association for nominal variables based on the w2 statistic, being represented by the following expression: C¼

sffiffiffiffiffiffiffiffiffiffiffiffi w2 n + w2

(4.3)

where n is the sample size. The contingency coefficient (C) has as its lowest limit the value 0, indicating that there is no relationship between the variables; however, the highest limit of C varies depending on the number of categories, so: sffiffiffiffiffiffiffiffiffiffiffi q1 0C  q

(4.4)

108

PART

II Descriptive Statistics

where: q ¼ min ðI, J Þ

(4.5)

where I is the number of rows and J is the number of columns in a contingency table. qffiffiffiffiffiffiffi When C ¼ q1 q , there is a perfect association between the variables; however, this limit never assumes the value 1. Hence, two contingency coefficients can only be compared if both are defined from tables with the same number of rows and columns.

Example 4.4 Calculate the contingency coefficient (C) for the data in Example 4.3. Solution We calculate C as follows: sffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi w2 18:214 ¼ 0:690 C¼ ¼ n + w2 20 + 18:214 Since the contingency table is 4  4 (q ¼ min(4, 4) ¼ 4), the values that C can assume are in the interval: rffiffiffi 3 ! 0  C  0:866 0C  4 We can conclude that there is association between the variables. c) Cramer’s V coefficient Another measure of association for nominal variables based on the w2 statistic is Cramer’s V coefficient, calculated by: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi w2 V¼ n:ðq  1Þ

(4.6)

where q ¼ min(I, J), as presented in expression (4.5). qffiffiffiffi 2 For 2 x 2 contingency tables, expression (4.6) is going to be V ¼ wn , which corresponds to the Phi coefficient. Cramer’s V coefficient is an alternative to the Phi coefficient and to the contingency coefficient (C), and its value is always limited to the interval [0, 1], regardless of the number of categories in the rows and columns: 0V 1

(4.7)

Value 0 indicates that the variables do not have any kind of association and value 1 shows that they are perfectly associated. Therefore, Cramer’s V coefficient allows us to compare contingency tables that have different dimensions.

Example 4.5 Calculate Cramer’s V coefficient for the data in Example 4.3. Solution V¼

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi w2 18:214 ¼ ¼ 0:551 nðq  1Þ 20  3

Since 0  V  1, there is association between the variables; however, it is not considered very strong. Solution of Examples 4.3, 4.4, and 4.5 (calculation of the Phi, contingency, and Cramer’s V coefficients) by using SPSS In Section 4.2.1, we discussed how to create labels that correspond to the variable categories from the menu Data → Define Variable Properties…. The same procedure must be applied to the data in Table 4.E.8 (we cannot forget to define the variables as nominal). The file Market_Segmentation.sav gives us these data already tabulated on SPSS.

Bivariate Descriptive Statistics Chapter

4

109

FIG. 4.16 Selecting the contingency coefficient and Phi and Cramer’s V coefficients.

FIG. 4.17 Results of the contingency coefficient and Phi and Cramer’s V coefficients.

Similar to the calculation of the w2 statistic, calculating the Phi, contingency, and Cramer’s V coefficients on SPSS can also be done on the menu Analyze → Descriptive Statistics → Crosstabs…. We are going to select the variable Clothes in Row(s) and the variable Region in Column(s). In Statistics…, we are going to select the options Contingency coefficient and Phi and Cramer’s V (Fig. 4.16). Note that these coefficients are calculated for nominal variables. The results of the statistics can be seen in Fig. 4.17. For all three coefficients, the P-value of 0.033 (0.033 < 0.05) indicates that there is association between the variables being studied. Solution of Examples 4.3 and 4.5 (calculation of the Phi and Cramer’s V coefficients) by using Stata Stata calculates the Phi and Cramer’s V coefficients through the command phi. Hence, they are going to be calculated for the data in Example 4.3 available in the file Market_Segmentation.dta.

110

PART

II Descriptive Statistics

FIG. 4.18 Calculating the Phi and Cramer’s V coefficients on Stata.

In order for the phi command to be used, initially, we must type: findit phi

and install it in the link snp3.pkg from http://www.stata.com/stb/stb3/. After doing this, we can type the following command: phi clothes region

The results can be seen in Fig. 4.18. Note that the Phi coefficient on Stata is called Cohen’s w. Cramer’s V coefficient, on the other hand, is called Cramer’s phi-prime.

4.2.2.3 Spearman’s Coefficient Spearman’s coefficient (rsp) is a measure of association between two ordinal qualitative variables. Initially, we must sort the set of data of variable X and of variable Y in ascending order. After sorting the data, it is possible to create ranks or rankings, denoted by k (k ¼ 1, …, n). Assigning ranks is something done separately for each variable. Rank 1 is then assigned to the smallest value of the variable, rank 2 to the second smallest value, and so on, and so forth, up until ranking n for the highest value. In case of a tie between values k and k +1, we must assign ranking k + 1/2 to both observations. Calculating Spearman’s coefficient can be done by using the following expression: 6 rsp ¼ 1  where:

n X

dk2

k¼1

n:ðn2  1Þ

(4.8)

n: number of observations (pairs of values); dk: difference between the rankings of order k. Spearman’s coefficient is a measure that varies between 1 and 1. If rsp ¼ 1, all the values of dk are null, indicating that all the rankings are equal to variables X and Y (perfect positive association). The value rsp ¼  1 is found when Pn 2 n:ðn2 1Þ reaches its maximum value (there is an inversion in the values of the variable rankings), indicating a k¼1 dk ¼ 3 perfect negative association. When rsp ¼ 0, there is no association between variables X and Y. Fig. 4.19 shows a summary of this interpretation. This interpretation is similar to Pearson’s association coefficient, which will be studied in Section 4.3.3.2. FIG. 4.19 Interpretation coefficient.

of

Spearman’s

Bivariate Descriptive Statistics Chapter

4

111

Example 4.6 The coordinator of the Business Administration course is analyzing if there is any kind of association between the grades of 10 students in two different subjects: Simulation and Finance. The data regarding this problem are presented in Table 4.E.9. Calculate Spearman’s coefficient.

TABLE 4.E.9 Grades in the Subjects Simulation and Finance of the 10 Students Being Analyzed Grades Student

Simulation

Finance

1

4.7

6.6

2

6.3

5.1

3

7.5

6.9

4

5.0

7.1

5

4.4

3.5

6

3.7

4.6

7

8.5

6.8

8

8.2

7.5

9

3.5

4.2

10

4.0

3.3

Solution To calculate Spearman’s coefficient, first, we are going to assign rankings to each category of each variable depending on their respective values, as shown in Table 4.E.10.

TABLE 4.E.10 Ranks in the Subjects Simulation and Finance of the 10 Students Rankings Student

Simulation

Finance

dk

d2k

1

5

6

1

1

2

7

5

2

4

3

8

8

0

0

4

6

9

3

9

5

4

2

2

4

6

2

4

2

4

7

10

7

3

9

8

9

10

1

1

9

1

3

2

4

10

3

1

2

4

Sum

40

112

PART

II Descriptive Statistics

Applying expression (4.8), we have: n X

dk2 6  40 k¼1 ¼1 ¼ 0:7576 rsp ¼ 1  nðn2  1Þ 10  99 6

Value 0.758 indicates a strong positive association between the variables. Calculating Spearman’s coefficient using SPSS software File Grades.sav shows the data from Example 4.6 (rankings in Table 4.E.9) tabulated in an ordinal scale (defined in the environment Variable View). Similar to the calculation of the w2 statistic and the Phi, contingency, and Cramer’s V coefficients, Spearman’s coefficient can also be generated by SPSS from the menu Analyze → Descriptive Statistics → Crosstabs…. We are going to select the variable Simulation in Row(s) and the variable Finance in Column(s). In Statistics…, we are going to select the option Correlations (Fig. 4.20). We are going to click on Continue and then, finally, on OK. The result of Spearman’s coefficient is shown in Fig. 4.21. The P-value 0.011 < 0.05 (under the hypothesis of nonassociation between the variables) indicates that there is a correlation between the grades in Simulation and Finance, with 95% confidence. Spearman’s coefficient can also be calculated in the menu Analyze → Correlate → Bivariate…. We must select the variables that interest us, in addition to Spearman’s coefficient, as shown in Fig. 4.22. We are going to click on OK, resulting in Fig. 4.23. FIG. 4.20 Calculating Spearman’s coefficient from the Crosstabs: Statistics dialog box.

FIG. 4.21 Result of Spearman’s coefficient from the Crosstabs: Statistics dialog box.

Bivariate Descriptive Statistics Chapter

4

113

Calculating Spearman’s coefficient by using Stata software In Stata, Spearman’s coefficient is calculated using the command spearman. Therefore, for the data in Example 4.6, available in the file Grades.dta, we must type the following command: spearman simulation finance The results can be seen in Fig. 4.24.

FIG. 4.22 Calculating Spearman’s coefficient from the Bivariate Correlations dialog box.

FIG. 4.23 Result of Spearman’s coefficient from the Bivariate Correlations dialog box. FIG. 4.24 Result of Spearman’s coefficient on Stata.

114

PART

4.3

II Descriptive Statistics

CORRELATION BETWEEN TWO QUANTITATIVE VARIABLES

In this section, the main objective is to assess if there is a relationship between the quantitative variables being studied, besides the level of correlation between them. This can be done through frequency distribution tables, graphical representations, such as, scatter plots, in addition to measures of correlation, such as, the covariance and Pearson’s correlation coefficient.

4.3.1

Joint Frequency Distribution Tables

The same procedure presented for qualitative variables can be used to represent the joint distribution of quantitative variables and to analyze the possible relationships between the respective variables. Analogous to the study of the univariate descriptive statistic, continuous data that do not repeat themselves with a certain frequency can be grouped into class intervals.

4.3.2

Graphical Representation Through a Scatter Plot

The correlation between two quantitative variables can be represented in a graphical way through a scatter plot. It graphically represents the values of variables X and Y in a Cartesian plane. Therefore, a scatter plot allows us to assess: a) Whether there is any relationship between the variables being studied or not; b) The type of relationship between the two variables, that is, the direction in which variable Y increases or decreases depending on changes in X; c) The level of relationship between the variables; d) The nature of the relationship (linear, exponential, among others). Fig. 4.25 shows a scatter plot in which the relationship between variables X and Y is strong positive linear, that is, variations in Y are directly proportional to variations in X. The level of relationship between the variables is strong and the nature is linear. If all the points are contained in a straight line, we have a case in which the relationship is perfect linear, as shown in Fig. 4.26. Figs. 4.27 and 4.28, on the other hand, show a scatter plot in which the relationship between variables X and Y is strong negative linear and perfect negative linear, respectively. FIG. 4.25 Strong positive linear relationship.

FIG. 4.26 Perfect positive linear relationship.

Bivariate Descriptive Statistics Chapter

4

115

FIG. 4.27 Strong negative linear relationship.

FIG. 4.28 Perfect negative linear relationship.

FIG. 4.29 There is no relationship between variables X and Y.

Finally, we may now have a case in which there is no relationship between variables X and Y, as shown in Fig. 4.29. Constructing a scatter plot on SPSS

Example 4.7 Let us open file Income_Education.sav on SPSS. The objective is to analyze the correlation between the variables Family Income and Years of Education through a scatter plot. In order to do that, we are going to click on Graphs ! Legacy Dialogs ! Scatter/Dot… (Fig. 4.30). In the window Scatter/Dot in Fig. 4.31, we are going to select the type of chart (Simple Scatter). Clicking on Define, the Simple Scatterplot dialog box will open, as shown in Fig. 4.32. We are going to select the variable FamilyIncome in the Y-axis and the variable YearsofEducation in the X-axis. Next, we are going to click on OK. The scatter plot created is shown in Fig. 4.33. Based on Fig. 4.33, we can see a strong positive correlation between the variables Family Income and Years of Education. Therefore, the higher the number of years of education, the higher the family income will be, even if there is no cause and effect relationship.

116

PART

II Descriptive Statistics

FIG. 4.30 Constructing a scatter plot on SPSS.

FIG. 4.31 Selecting the type of chart.

The scatter plot can also be created in Excel by selecting the option Scatter. Constructing a scatter plot on Stata The data from Example 4.7 are also available on Stata from the file Income_Education.dta. The variables being studied are called income and education. The scatter plot on Stata is created using the command twoway scatter (or simply tw sc) followed by the variables we are interested in. Thus, to analyze the correlation between the variables Family Income and Years of Education through a scatter plot on Stata, we must type the following command: tw sc income education

The resulting scatter plot is shown in Fig. 4.34.

FIG. 4.32 Simple Scatterplot dialog box.

FIG. 4.33 Scatter plot of the variables Family Income and Years of Education.

6000

5000

Family income

4000

3000

2000

1000

0 4.0

5.0

6.0 7.0 8.0 Years of education

9.0

10.0

118

PART

II Descriptive Statistics

FIG. 4.34 Scatter plot on Stata.

5000

Family income

4000

3000

2000

1000

0 5

6

7

8

9

Years of education

4.3.3

Measures of Correlation

The main measures of correlation, used for quantitative variables, are the covariance and Pearson’s correlation coefficient.

4.3.3.1 Covariance Covariance measures the joint variation between two quantitative variables X and Y, and it is calculated by using the following expression: n  X

covðX, Y Þ ¼

  Xi  X : Y i  Y

i¼1

n1

(4.9)

where: Xi: ith value of X; Yi: ith value of Y; X: mean of the values of Xi; Y: mean of the values of Yi; n: sample size. One of the limitations of the covariance is that the measure depends on the sample size, and it may lead to a bad estimate in the case of small samples. Pearson’s correlation coefficient is an alternative for this problem. Example 4.8 Once again, consider the data in Example 4.7 regarding the variables Family Income and Years of Education. The data are also available in Excel in the file Income_Education.xls. Calculate the covariance of the data matrix of both variables. Solution Applying expression (4.9), we have: ð7:6  7:08Þð1, 961  1, 856:22Þ + ⋯ + ð5:4  7:08Þð775  1, 856:22Þ 72, 326:93 ¼ ¼ 761:336 95 95 The covariance can be calculated in Excel by using the COVARIANCE.S (sample) function. In the following section, we are also going to discuss how the covariance can be calculated on SPSS, jointly with Pearson’s correlation coefficient. SPSS considers the same expression presented in this section. covðX, Y Þ ¼

Bivariate Descriptive Statistics Chapter

4

119

FIG. 4.35 Interpretation of Pearson’s correlation coefficient.

4.3.3.2 Pearson’s Correlation Coefficient Pearson’s correlation coefficient (r) is a measure that varies between 1 and 1. Through the sign, it is possible to verify the type of linear relationship between the two variables analyzed (the direction in which variable Y increases or decreases depending on how X changes); the closer it is to the extreme values, the stronger the correlation between them. Therefore: – If r is positive, there is a directly proportional relationship between the variables; if r ¼ 1, we have a perfect positive linear correlation. – If r is negative, there is an inversely proportional relationship between the variables; if r ¼  1, we have a perfect negative linear correlation. – If r is null, there is no correlation between the variables. Fig. 4.35 shows a summary of the interpretation of Pearson’s correlation coefficient. Pearson’s correlation coefficient (r) can be calculated as a ratio between the covariance of two variables and the product of the standard deviations (S) of each one of them: n  X

  Xi  X : Yi  Y

i¼1

covðX, Y Þ n1 ¼ r¼ S X  SY SX  SY rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pn Pn 2 2 X X ð Þ ðYi Y Þ i i¼1 i¼1 Since SX ¼ and SY ¼ , as we studied in Chapter 3, expression (4.10) becomes: n1 n1 n  X

  Xi  X : Yi  Y

i¼1

r ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n  n  X 2 X 2 Xi  X : Yi  Y i¼1

(4.10)

(4.11)

i¼1

In Chapter 12, we are going to use Pearson’s correlation coefficient a lot, when studying factorial analysis. Example 4.9 Once again, open the file Income_Education.xls and calculate Pearson’s correlation coefficient between the two variables. Solution Calculating Pearson’s correlation coefficient through expression (4.10) is as follows: r¼

covðX, Y Þ 761:336 ¼ 0:777 ¼ S X  SY 970:774  1:009

This calculation could also be done by using expression (4.11), which does not depend on the sample size. The result indicates a strong positive correlation between the variables Family Income and Years of Education.

120

PART

II Descriptive Statistics

FIG. 4.36 Bivariate Correlations dialog box.

Excel also calculates Pearson’s correlation coefficient through the PEARSON function. Solution of Examples 4.8 and 4.9 (calculation of the covariance and Pearson’s correlation coefficient) on SPSS Once again, open the file Income_Education.sav. To calculate the covariance and Pearson’s correlation coefficient on SPSS, we are going to click on Analyze ! Correlate ! Bivariate…. The Bivariate Correlations window will open. We are going to select the variables Family Income and Years of Education, in addition to Pearson’s correlation coefficient, as shown in Fig. 4.36. In Options…, we must select the option Cross-product deviations and covariances, according to Fig. 4.37. We are going to click on Continue and then on OK. The results of the statistics are presented in Fig. 4.38. FIG. 4.37 Selecting the covariance statistic.

Bivariate Descriptive Statistics Chapter

4

121

FIG. 4.38 Results of the covariance and of Pearson’s correlation coefficient on SPSS.

FIG. 4.39 Calculating Pearson’s correlation coefficient on Stata.

FIG. 4.40 Calculating the covariance on Stata.

Analogous to Spearman’s coefficient, Pearson’s correlation coefficient can also be generated on SPSS from the menu Analyze → Descriptive Statistics → Crosstabs… (option Correlations in the Statistics button…). Solution of Examples 4.8 and 4.9(calculation of the covariance and Pearson’s correlation coefficient) on Stata To calculate Pearson’s correlation coefficient on Stata, we must use the command correlate, or simply corr, followed by the list of variables we are interested in. The result is the correlation matrix between the respective variables. Once again, open the file Income_Education.dta. Thus, for the data in this file, we can type the following command: corr income education

The result can be seen in Fig. 4.39. To calculate the covariance, we must use the option covariance, or only cov, at the end of the command correlate (or simply corr). Thus, to generate Fig. 4.40, we must type the following command: corr income education, cov

4.4

FINAL REMARKS

This chapter presented the main concepts of descriptive statistics with greater focus on the study of the relationship between two variables (bivariate analysis). We studied the relationships between two qualitative variables (associations) and between two quantitative variables (correlations). For each situation, several measures, tables, and charts were presented, which allow us to have a better understanding of the data behavior. Fig. 4.1 summarizes this information.

122

PART

II Descriptive Statistics

The construction and interpretation of frequency distributions, graphical representations, in addition to summary measures (measures of position or location and measures of dispersion or variability), allow the researcher to have a better understanding and visualization of the data behavior for two variables simultaneously. More advanced techniques can be applied in the future to the same set of data, so that researchers can go deeper in their studies on bivariate analysis, aiming at improving the quality of the decision making process.

4.5

EXERCISES

1) Which descriptive statistics can be used (and in which situations) to represent the behavior of two qualitative variables simultaneously? 2) And to represent the behavior of two quantitative variables? 3) In what situations should we use contingency tables? 4) What are the differences between the chi-square statistic (w2), Phi coefficient, the contingency coefficient (C), Cramer’s V coefficient, and Spearman’s coefficient? 5) What are the main summary measures to represent the data behavior between two quantitative variables? Describe each one of them. 6) Aiming at identifying the behavior of customers who are in default regarding their payments, a survey with information on the age and level of default of the respondents was carried out. The objective is to determine if there is an association between the variables. Based on the files Default.sav and Default.dta, we would like you to: a) Create the joint frequency distribution tables for the variables age_group and default (absolute frequencies, relative frequencies in relation to the general total, relative frequencies in relation to the total of each line, relative frequencies in relation to the total of each column and the expected frequencies). b) Determine the percentage of individuals who are between 31 and 40 years of age. c) Determine the percentage of individuals who are heavily indebted. d) Determine the percentage of respondents who are 20 years old or younger and do not have debts. e) Determine, among the individuals who are older than 60, the percentage of those who are a little indebted. f) Determine, among the individuals who are a relatively indebted, the percentage of those who are between 41 and 50 years old. g) Verify if there are indications of dependence between the variables. h) Confirm the previous item using the w2 statistic. i) Calculate the Phi, contingency, and Cramer’s V coefficients, confirming whether there is an association between the variables or not. 7) The files Motivation_Companies.sav and Motivation_Companies.dta show a database with the variables Company and Level of Motivation (Motivation), obtained through a survey carried out with 250 employees (50 respondents for each one of the 5 companies surveyed), aiming at assessing the employees’ level of motivation in relation to the companies, considered to be large firms. Hence, we would like you to: a) Create the contingency tables of absolute frequencies, relative frequencies in relation to the general total, relative frequencies in relation to the total of each line, relative frequencies in relation to the total of each column and the expected frequencies; b) Calculate the percentage of respondents who are very demotivated. c) Calculate the percentage of respondents from Company A and are very demotivated. d) Calculate the percentage of motivated respondents in Company D. e) Calculate the percentage of little motivated respondents in Company C. f) Among the respondents who are very motivated, determine the percentage of those who work for Company B. g) Verify if there are indications of dependence between the variables. h) Confirm the previous item using the w2 statistic. i) Calculate the Phi, contingency, and Cramer’s V coefficients, confirming whether there is an association between the variables or not. 8) The files Students_Evaluation.sav and Students_Evaluation.dta show the grades, from 0 to 10, of 100 students from a public university in relation to the following subjects: Operational Research, Statistics, Operations Management, and Finance. Check and see if there is a correlation between the following pairs of variables, constructing the scatter plot and calculating Pearson’s correlation coefficient: a) Operational Research and Statistics; b) Operations Management and Finance. c) Operational Research and Operations Management.

Bivariate Descriptive Statistics Chapter

4

123

9) The files Brazilian_Supermarkets.sav and Brazilian_Supermarkets.dta show revenue data and the number of stores of the 20 largest Brazilian supermarket chains in a given year (source: ABRAS - Brazilian Association of Supermarkets). We would like you to: a) Create the scatter plot for the variables revenue x number of stores. b) Calculate Pearson’s correlation coefficient between the two variables. c) Exclude the four largest supermarket chains in terms of revenue, as well as the chain AM/PM Food and Beverages Ltd., and once again create the scatter plot. d) Once again, calculate Pearson’s correlation coefficient between the two variables being studied.

Chapter 5

Introduction to Probability Do you want to sell sugar water for the rest of your life, or do you want to come with me and change the world? Steve Jobs

5.1

INTRODUCTION

In the previous part of this book, we studied descriptive statistics, which describes and summarizes the main characteristics observed in a dataset through frequency distribution tables, charts, graphs, and summary measures, allowing the researcher to have a better understanding of the data. Probabilistic statistics, on the other hand, uses the probability theory to explain how often certain uncertain events happen, in order to estimate or predict the occurrence of future events. For example, when rolling dice, we do not know for sure which value will appear, so, probability can be used to indicate the occurrence probability of a certain event. According to Bruni (2011), the history of probability presumably started with the cave men. They needed to understand nature’s uncertain phenomena better. In the 17th century, probability theory appeared to explain uncertain events. The study of probability evolved to help plan moves or develop strategies meant for gambling. Currently, it is also applied to the study of statistical inference, in order to generalize the data population. This chapter has as its main objective to present the concepts and terminologies related to the probability theory, as well as their practical application.

5.2 5.2.1

TERMINOLOGY AND CONCEPTS Random Experiment

An experiment consists in any observation or measure process. A random experiment is one that generates unpredictable results, so, if the process is repeated several times, it becomes impossible to predict the result. Flipping a coin and/or rolling dice are examples of random experiments.

5.2.2

Sample Space

Sample space S consists of all the possible results of an experiment. For example, when flipping a coin, we can get head (H) or tail (T). Therefore, S ¼ {H, T}. On the other hand, when rolling a die, the sample space is represented by S ¼ {1, 2, 3, 4, 5, 6}.

5.2.3

Events

An event is any subset of a sample space. For example, event A only contains the even occurrences of rolling a die. Therefore, A ¼ {2, 4, 6}.

5.2.4

Unions, Intersections, and Complements

Two or more events can form unions, intersections, and complements. The union of two events A and B, represented by A [ B, results in a new event containing all the elements of A, B, or both, and can be illustrated according to Fig. 5.1. Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00005-7 © 2019 Elsevier Inc. All rights reserved.

127

128

PART

III Probabilistic Statistics

FIG. 5.1 Union of two events (A [ B).

The intersection of two events A and B, represented by A \ B, results in a new event containing all the elements that are simultaneously in A and B, and can be illustrated according to Fig. 5.2. The complement of an event A, represented by Ac, is the event that contains all the points of S that are not in A, as shown in Fig. 5.3.

5.2.5

Independent Events

Two events A and B are independent when the probability of B happening is not conditional on event A happening. The concept of conditional probability will be discussed in Section 5.5.

5.2.6

Mutually Exclusive Events

Mutually excluding or exclusive events are those that do not have any elements in common, so, they cannot happen simultaneously. Fig. 5.4 illustrates two events A and B that are mutually exclusive.

FIG. 5.2 Intersection of two events (A \ B).

FIG. 5.3 Complement of event A.

FIG. 5.4 Events A and B that are mutually exclusive.

Introduction to Probability Chapter

5.3

5

129

DEFINITION OF PROBABILITY

The probability of a certain event A happening in sample space S is given by the ratio between the number of cases favorable to the event (nA) and the total number of possible cases (n): Pð AÞ ¼

nA number of cases favorable to event A ¼ n total number of possible cases

(5.1)

Example 5.1 When rolling a die, what is the probability of getting an even number? Solution The sample space is given by S ¼ {1, 2, 3, 4, 5, 6}. The event we are interested in is A ¼ {even numbers on a die}, so, A ¼ {2, 4, 6}. Therefore, the probability of A happening is: 3 1 P ðAÞ ¼ ¼ 6 2

Example 5.2 A gravity-pick machine contains three white balls, two red balls, four yellow balls, and two black balls. What is the probability of a red ball being drawn? Solution Given a total of 11 balls and considering A ¼ {the ball is red}, the probability is: P ðAÞ ¼

5.4 5.4.1

number of red balls 2 ¼ total number of balls 11

BASIC PROBABILITY RULES Probability Variation Field

The probability of an event A happening is a number between 0 and 1:

5.4.2

0  Pð A Þ  1

(5.2)

Pð SÞ ¼ 1

(5.3)

PðfÞ ¼ 0

(5.4)

Probability of the Sample Space

Sample space S has probability equal to 1:

5.4.3

Probability of an Empty Set

The probability of an empty set (f) occurring is null:

130

PART

5.4.4

III Probabilistic Statistics

Probability Addition Rule

The probability of event A, event B or both happening can be calculated as follows: PðA [ BÞ ¼ PðAÞ + PðBÞ  PðA \ BÞ

(5.5)

If events A and B are mutually exclusive, that is, A \ B 6¼ f, the probability of one of them happening is equal to the sum of the individual probabilities: PðA [ BÞ ¼ PðAÞ + PðBÞ

(5.6)

Expression (5.6) can be extended to n events (A1, A2, …, An) that are mutually exclusive: PðA1 [ A2 [ ⋯ [ An Þ ¼ PðA1 Þ + PðA2 Þ + ⋯ + PðAn Þ

5.4.5

(5.7)

Probability of a Complementary Event

If Ac is A’s complementary event, then: Pð Ac Þ ¼ 1  P ð A Þ

5.4.6

(5.8)

Probability Multiplication Rule for Independent Events

If A and B are two independent events, the probability of them happening together is equal to the product of their individual probabilities: PðA \ BÞ ¼ PðAÞ  PðBÞ (5.9) Expression (5.9) can be extended to n independent events (A1, A2, …, An): P ð A 1 \ A 2 \ … \ A n Þ ¼ P ð A 1 Þ  P ð A 2 Þ  …  Pð A n Þ

(5.10)

Example 5.3 A gravity-pick machine contains balls with numbers 1 through 60 that have the same probability of being drawn. We would like you to: a) Define the sample space. b) Calculate the probability of a ball with an odd number on it being drawn. c) Calculate the probability of a ball with a multiple of 5 on it being drawn. d) Calculate the probability of a ball with an odd number or with a multiple of 5 on it being drawn. e) Calculate the probability of a ball with a multiple of 7 or a multiple of 10 on it being drawn. f) Calculate the probability of a ball that does not have a multiple of 5 on it being drawn. g) One ball is drawn randomly and put back into the gravity-pick machine. A new ball will be drawn. Calculate the probability of the first ball having an even number on it and the second one a number greater than 40. Solution a) S ¼ {1, 2, 3, …, 60}. 1 b) A ¼ {1, 3, 5, …, 59}, PðAÞ ¼ 30 60 ¼ 2 1 c) A ¼ {5, 10, 15, …, 60}, PðAÞ ¼ 12 60 ¼ 5 d) Where A ¼ {1, 3, 5, …, 59} and B ¼ {5, 10, 15, …, 60}. Since A and B are not mutually exclusive events, because they have common elements (5, 15, 25, 35, 45, 55), we apply Expression (5.5): 1 1 6 3 P ðA [ BÞ ¼ P ðAÞ + P ðBÞ  P ðA \ B Þ ¼ +  ¼ 2 5 60 5 e) In this case, A ¼ {7, 14, 21, 28, 35, 42, 49, 56} and B ¼ {10, 20, 30, 40, 50, 60}. Since the events are mutually exclusive (A \ B 6¼ f), we apply Expression (5.6): 8 6 7 + ¼ P ðA [ BÞ ¼ P ðAÞ + P ðB Þ ¼ 60 60 30 f) In this case, A ¼ {multiples of 5} and Ac ¼ {numbers that are not multiples of 5}. Therefore, the probability of complementary event Ac happening is: 1 4 P ðAc Þ ¼ 1  P ðAÞ ¼ 1  ¼ 5 5 g) Since the events are independent, we apply Expression (5.9): 1 20 1 P ðA \ BÞ ¼ P ðAÞ  P ðB Þ ¼  ¼ 2 60 6

Introduction to Probability Chapter

5.5

5

131

CONDITIONAL PROBABILITY

When events are not independent, we must use the concept of conditional probability. Considering two events A and B, the probability of A happening, given that B has already happened, is called conditional probability of A given B, and is represented by P(A jB): PðAj BÞ ¼

Pð A \ B Þ Pð BÞ

(5.11)

An event A is considered independent of B if: PðAj BÞ ¼ PðAÞ

(5.12)

Example 5.4 A die is rolled. What is the probability of getting number 4, given that the number drawn was an even number? Solution In this case, A ¼ {number 4} and B ¼ {an even number}. Applying Expression (5.11), we have: P ðAj BÞ ¼

5.5.1

P ðA \ B Þ 1=6 1 ¼ ¼ P ðB Þ 1=2 3

Probability Multiplication Rule

From the definition of conditional probability, the multiplication rule allows researcher to calculate the probability of the simultaneous occurrence of two events A and B as the probability of one of them multiplied by the conditional probability of the other, given that the first event has occurred: PðA \ BÞ ¼ PðAÞ  PðBj AÞ ¼ PðBÞ  PðAj BÞ

(5.13)

The multiplication rule can be extended to three events A, B, and C: PðA \ B \ CÞ ¼ PðAÞ  PðBj AÞ  PðCj A \ BÞ

(5.14)

This is only one of the six ways in which Expression (5.14) can be written. Example 5.5 A gravity-pick machine contains eight white balls, six red balls, and four black balls. Initially, we draw a ball that is not put back into the gravity-pick machine. A new ball will be drawn. What is the probability of both balls being red? Solution Differently from the previous example that calculated the conditional probability of a single event, the objective in this case is to calculate the probability of two events occurring simultaneously. The events are also not independent, since the first ball is not put back into the gravity-pick machine. If event A ¼ {the first ball is red} and B ¼ {the second ball is red}, to calculate P(A \ B), we must apply Expression (5.13): P ðA \ B Þ ¼ P ðAÞ  P ðBj AÞ ¼

6 5 5  ¼ 18 17 51

Example 5.6 A company will give a car to one of its customers (who are located in different regions of Brazil). Table 5.E.1 shows the data regarding these customers, in terms of gender and city. Determine: a) What is the probability of a male customer being drawn? b) What is the probability of a female customer being drawn? c) What is the probability of a customer from Curitiba being drawn? d) What is the probability of a customer from Sao Paulo being drawn, given that it is a male customer?

132

PART

III Probabilistic Statistics

e) What is the probability of a female customer being drawn, given that it is a customer from Aracaju? f) What is the probability of a female customer from Salvador being drawn?

TABLE 5.E.1 Absolute Frequency Distribution According to Gender and City Male

Female

Total

Goiania

12

14

26

Aracaju

8

12

20

Salvador

16

15

31

Curitiba

24

22

46

Sao Paulo

35

25

60

Belo Horizonte

10

12

22

105

100

205

Solution a) The probability of the customer being a man is 105/205 ¼ 21/41. b) The probability of the customer being a woman is 100/205 ¼ 20/41. c) The probability of the customer being from Curitiba is 46/205. d) Considering that A ¼ {Sao Paulo} and B ¼ {male}, the P(A jB) is calculated according to Expression (5.11): P ðAj BÞ ¼

P ðA \ BÞ 35=205 1 ¼ ¼ P ðB Þ 105=205 3

e) Considering that A ¼ {female} and B ¼ {Aracaju}, the P(A jB) is: P ðAj BÞ ¼

P ðA \ B Þ 12=205 3 ¼ ¼ P ðB Þ 20=205 5

f) If A ¼ {Salvador} and B ¼ {female}, the P(A \B) is calculated according to Expression (5.13): P ðA \ BÞ ¼ P ðAÞ  P ðBj AÞ ¼

5.6

31 15 3  ¼ 205 31 41

BAYES’ THEOREM

Imagine that the probability of a certain event was calculated. However, new information was added to the process, so, the probability must be recalculated. The probability calculated initially is called a priori probability; the probability with the recently added information is called a posteriori probability. The calculation of the a posteriori probability is based on Bayes’ Theorem and is described here. Consider B1, B2, …, Bn mutually exclusive events, and P(B1) + P(B2) + … + P(Bn) ¼ 1. A, on the other hand, is any given event that will happen jointly or as a consequence of one of the Bi events (i ¼ 1, 2, …, n). The probability of a Bi event happening, given that A event has already happened, is calculated as follows: PðBi j AÞ ¼

Pð B i \ A Þ PðBi Þ  PðAj Bi Þ ¼ PðAÞ PðB1 Þ  PðAj B1 Þ + PðB2 Þ  PðAj B2 Þ + ⋯ + PðBn Þ  PðAj Bn Þ

(5.15)

where: P(Bi) is the a priori probability; P(Bi j A) is the a posteriori probability (probability of Bi after A has happened).

Example 5.7 Consider three identical gravity-pick machines U1, U2, and U3. Gravity-pick machine U1 contains two balls, one is yellow and the other is red. Gravity-pick machine U2, on the other hand, contains three blue balls, while machine U3 contains two red balls and a yellow one. We select one of the gravity-pick machines at random and draw one ball. We can see that the ball chosen is yellow. What is the probability of gravity-pick machine U1 having been chosen?

Introduction to Probability Chapter

5

133

Solution Let‘s define the following events: B1 ¼ choosing gravity-pick machine U1; B2 ¼ choosing gravity-pick machine U2; B3 ¼ choosing gravity-pick machine U3; A ¼ choosing the yellow ball. The objective is to calculate P(B1 j A), knowing that: P(B1) ¼ 1/3, P(Aj B1) ¼ 1/2 P(B2) ¼ 1/3, P(Aj B2) ¼ 0 P(B3) ¼ 1/3, P(Aj B3) ¼ 1/3 Therefore, we have: P ð B1 j A Þ ¼

P ðB1 \ AÞ P ðB1 Þ  P ðAj B1 Þ ¼ P ðAÞ P ðB1 Þ  P ðAj B1 Þ + P ðB2 Þ  P ðAj B2 Þ + P ðB3 Þ  P ðAj B3 Þ 1 1  3 3 2 P ðB1 j AÞ ¼ ¼ 1 1 1 1 1 5  + 0+  3 2 3 3 3

5.7

COMBINATORIAL ANALYSIS

Combinatorial analysis is a set of procedures that calculates the number of different groups that can be formed by selecting a finite number of elements from a set. Arrangements, combinations, and permutations are the three main types of configurations and are applicable to the probability. The probability of an event is, therefore, the ratio between the number of results of the event we are interested in and the total number of results in the sample space (total number of arrangements, combinations, or permutations).

5.7.1

Arrangements

An arrangement calculates the number of possible configurations with distinct elements from a certain set. Bruni (2011) defines arrangement as the study of the number of ways in which researcher can organize a sample of objects, which was removed from a larger population, and in which the alteration of the order of the organized objects is relevant. Given n different objects, if the objective is to select p of these objects (n and p are integers, n p), the number of arrangements or possible ways of doing this is represented by An,p and calculated as follows: An, p ¼

n! ðn  pÞ!

(5.16)

Example 5.8 Consider a set with three elements A ¼ {1, 2, 3}. If these elements were taken 2 by 2, how many arrangements would be possible? What is the probability of element 3 being in the second position? Solution From Expression (5.16), we have: An, p ¼

3! 321 ¼ ¼6 ð3  2Þ! 1

These arrangements are (1, 2), (1, 3), (2, 1), (2, 3), (3, 1), and (3, 2). In an arrangement, the order in which the elements are organized is relevant. For example, (1, 2) 6¼ (2, 1). After defining all the arrangements, it is easy to calculate the probability. Since we have two arrangements in which element 3 is in the second position, given that the total number of arrangements is 6, the probability is 2/6 ¼ 1/3.

134

PART

III Probabilistic Statistics

Example 5.9 Calculate the number of ways in which it is possible to park six vehicles in three parking spaces. What is the probability of vehicle 1 being in the first parking space? Solution Through Expression (5.16), we have: A 6, 3 ¼

6! 6  5  4  3! ¼ ¼ 120 ð6  3Þ! 3!

From the 120 possible arrangements, in 20 of them vehicle 1 is in the first position: (1, 2, 3), (1, 2, 4), (1, 2, 5), (1, 2, 6), (1, 3, 2), (1, 3, 4), (1, 3, 5), (1, 3, 6), (1, 4, 2), (1, 4, 3), (1, 4, 5), (1, 4, 6), (1, 5, 2), (1, 5, 3), (1, 5, 4), (1, 5, 6), (1, 6, 2), (1, 6, 3), (1, 6, 4), (1, 6, 5). Therefore, the probability is 20/120 ¼ 1/6.

5.7.2

Combinations

Combinations are a special case of arrangements in which it does not matter the order in which the elements are organized. Given n different objects, the number of ways or combinations in which to organize p of these objects is represented by Cn,p (a combination of n elements arranged p by p), and calculated as follows:   n! n ¼ (5.17) Cn, p ¼ p p!ðn  pÞ!

Example 5.10 How many different ways can we form groups of four students in a class with 20 students? Solution Since the order of the elements in the group is not relevant, we must apply Expression (5.17):  C20, 4 ¼

 20! 20  19  18  17  16! 20 ¼ ¼ 4, 845 ¼ 4 4!ð20  4Þ! 24ð16Þ!

Thus, 4,845 different groups can be formed.

Example 5.11 Marcelo, Felipe, Luiz Paulo, Rodrigo, and Ricardo went to an amusement park to have fun. The ride they chose to go on next only has three seats, so, only three of them will be chosen randomly. What is the probability of Felipe and Luiz Paulo being on that ride? Solution The total number of combinations is: C5, 3 ¼ The 10 possibilities are: Group 1: Marcelo, Felipe, and Luiz Paulo Group 2: Marcelo, Felipe, and Rodrigo Group 3: Marcelo, Felipe, and Ricardo Group 4: Marcelo, Luiz Paulo, and Rodrigo Group 5: Marcelo, Luiz Paulo, and Ricardo Group 6: Marcelo, Rodrigo, and Ricardo Group 7: Felipe, Luiz Paulo, and Rodrigo Group 8: Felipe, Luiz Paulo, and Ricardo Group 9: Felipe, Rodrigo, and Ricardo Group 10: Luiz Paulo, Rodrigo, and Ricardo Therefore, the probability is 3/10.

  5! 5  4  3! 5 ¼ ¼ 10 ¼ 3 3!2! 3!2

Introduction to Probability Chapter

5.7.3

5

135

Permutations

Permutation is an arrangement in which all the elements in the set are selected. Therefore, it is the number of ways in which n elements can be grouped, changing their order. The number of possible permutations is represented by Pn and can be calculated as follows: Pn ¼ n!

(5.18)

Example 5.12 Consider a set with three elements, A ¼ {1, 2, 3}. What is the total number of permutations possible? Solution P3 ¼ 3 ! ¼ 3  2  1 ¼ 6. They are (1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), and (3, 2, 1).

Example 5.13 A certain factory manufactures six different products. How many different ways can the production sequence occur? Solution To determine the number of possible production sequences, we just need to apply Expression (5.18): P6 ¼ 6! ¼ 6  5  4  3  2  1 ¼ 720

5.8

FINAL REMARKS

This chapter discussed the concepts and terminologies related to the probability theory, as well as their practical application. Probability theory is used to assess the possibility of uncertain events happening, its origin comes from trying to understand uncertain natural phenomena, evolving to planning how to gamble, and, currently, it is being applied to the study of statistical inference.

5.9

EXERCISES

1) Two soccer teams will play overtime until the Golden Goal is scored. Define the sample space. 2) What is the difference between mutually exclusive events and independent events? 3) In a deck of cards with 52 cards, determine: a. The probability of a card of hearts being drawn; b. The probability of a queen being drawn; c. The probability of a face card (jack, queen, or king) being drawn; d. The probability of any card, but not a face card, being drawn; 4) A production batch contains 240 parts and 12 of them are defective. One part is drawn randomly. What is the probability of this part being defective? 5) A number between 1 and 30 is chosen randomly. We would like you to: a. Define the sample space. b. What is the probability of this number being divisible by 3? c. What is the probability of this number being a multiple of 5? d. What is the probability of this number being divisible by 3 or a multiple of 5? e. What is the probability of this number being even, given that it is a multiple of 5? f. What is the probability of this number being a multiple of 5, given that it is divisible by 3? g. What is the probability of this number not being divisible by 3? h. Assuming that two numbers are chosen randomly, what is the probability of the first number being a multiple of 5 and the second one an odd number?

136

PART

III Probabilistic Statistics

6) Two dice are rolled simultaneously. Determine: a. The sample space. b. What is the probability of both numbers being even? c. What is the probability of the sum of the numbers being 10? d. What is the probability of the multiplication of the numbers being 6? e. What is the probability of the sum of the numbers being 10 or 6? f. What is the probability of the number drawn in the first die being an odd number or of the number drawn in the second die being a multiple of 3? g. What is the probability of the number drawn in the first die being an even number or of the number drawn in the second die being a multiple of 4? 7) What is the difference between arrangements, combinations, and permutations?

Chapter 6

Random Variables and Probability Distributions What we call chance can only be the unknown cause of a known effect. Voltaire

6.1

INTRODUCTION

In Chapters 3 and 4, we discussed several statistics to describe the behavior of quantitative and qualitative data, including sample frequency distributions. In this chapter, we are going to study population probability distributions (for quantitative variables). The frequency distribution of a sample is an estimate of the corresponding population probability distribution. When the sample size is large, the sample frequency distribution approximately follows the population probability distribution (Martins and Domingues, 2011). According to the authors, for the study of empirical researches, as well as for solving several practical problems, the study of descriptive statistics is essential. However, when the main goal is to study a population’s variables, the probability distribution is more suitable. This chapter discusses the concept of discrete and continuous random variables, the main probability distributions for each type of random variable, and also the calculation of the expected value and the variance of each probability distribution. For discrete random variables, the most common probability distributions are the discrete uniform, Bernoulli, binomial, geometric, negative binomial, hypergeometric, and Poisson. On the other hand, for continuous random variables, we are going to study the uniform, normal, exponential, gamma, chi-square (w2), Student’s t, and Snedecor’s F distributions.

6.2

RANDOM VARIABLES

As studied in the previous chapter, the set of all possible results of a random experiment is called sample space. To describe a random experiment, it is convenient to associate numerical values to the elements of the sample space. A random variable can be characterized as being a variable that presents a single value for each element, and this value is determined randomly. Assume that e is a random experiment and S is the sample space associated to this experiment. Function X that associates to each element s 2 S a real number X (s) is called random variable. Random variables can be discrete or continuous.

6.2.1

Discrete Random Variable

A discrete random variable can only take on countable numbers of distinct values, usually counts. Therefore, it cannot assume decimal or noninteger values. As examples of discrete random variables, we can mention the number of children in a family, the number of employees in a company, or the number of vehicles produced in a certain factory.

6.2.1.1 Expected Value of a Discrete Random Variable Let X be a discrete random variable that can take on the values {x1, x2, …, xn} with the respective probabilities {p(x1), p(x2), …, p(xn)}. Function {xi, p(xi), i ¼ 1, 2, …, n} is called random variable X probability function and associates, to each value of xi, its probability of occurrence: Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00006-9 © 2019 Elsevier Inc. All rights reserved.

137

138

PART

III Probabilistic Statistics

pðxi Þ ¼ PðX ¼ xi Þ ¼ pi , i ¼ 1,2,…, n so p(xi) 0 for every xi and

n P

(6.1)

pðxi Þ ¼ 1.

i¼1

The expected or average value of X is given by the expression: Eð X Þ ¼

n X

x i  Pð X ¼ x i Þ ¼

i¼1

n X

xi :pi

(6.2)

i¼1

Expression (6.2) is similar to the one used for the mean in Chapter 3, in which instead of probabilities pi we had relative frequencies Fri. The difference between pi and Fri is that the former corresponds to the values from an assumed theoretical model and the latter to the variable values observed. Since pi and Fri have the same interpretation, all of the measures and charts presented in Chapter 3, based on the distribution of Fri, have a corresponding one in the distribution of a random variable. The same interpretation is valid for other measures of position and variability, such as, the median and the standard deviation (Bussab and Morettin, 2011).

6.2.1.2 Variance of a Discrete Random Variable The variance of a discrete random variable X is a weighted mean of the distances between the values that X can take on and X’s expected value, where the weights are the probabilities of the possible values of X. If X assumes the values {x1, x2, …, xn} with the respective probabilities {p1, p2, …, pn}, then its variance is given by: n h i X ½xi  EðXÞ2 :pi (6.3) Var ðXÞ ¼ s2 ðXÞ ¼ E ðX  EðXÞÞ2 ¼ i¼1

In some cases, it is convenient to use the standard deviation of a random variable as a measure of variability. The standard deviation of X is the square root of the variance: pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi sðXÞ ¼ Var ðXÞ (6.4) Example 6.1 Assume that the monthly real estate sales for a certain real estate agent follow the probability distribution seen in Table 6.E.1. Determine the expected value of monthly sales, as well as its variance.

TABLE 6.E.1 Monthly Real Estate Sales and Their Respective Probabilities xi(sales)

0

1

2

3

p(xi)

2/10

4/10

3/10

1/10

Solution The expected value of monthly sales is: E ðX Þ ¼ 0  0:20 + 1  0:40 + 2  0:30 + 3  0:10 ¼ 1:3 The variance can be calculated as: Var ðX Þ ¼ ð0  1:3Þ2  0:2 + ð1  1:3Þ2  0:4 + ð2  1:3Þ2  0:3 + ð3  1:3Þ2  0:1 ¼ 0:81

6.2.1.3 Cumulative Distribution Function of a Discrete Random Variable The cumulative distribution function (c.d.f.) of a random variable X, denoted by F(x), corresponds to the sum of the xi values probabilities that are less than or equal to x: X Fð x Þ ¼ Pð X  x Þ ¼ pð x i Þ (6.5) xi x

The following properties are valid for the cumulative distribution function of a discrete random variable: 0  FðxÞ  1

(6.6)

Random Variables and Probability Distributions Chapter

6

139

lim FðxÞ ¼ 1

(6.7)

lim FðxÞ ¼ 0

(6.8)

a < b ! Fð aÞ  Fð bÞ

(6.9)

x!∞

x!∞

Example 6.2 For the data in Example 6.1, calculate F(0, 5), F(1), F(2, 5), F(3), F(4), and F(0, 5). Solution 2 a) Fð0:5Þ ¼ PðX  0:5Þ ¼ 10 2 4 6 b) Fð1Þ ¼ PðX  1Þ ¼ 10 + 10 ¼ 10 2 4 3 9 c) Fð2:5Þ ¼ PðX  2:5Þ ¼ 10 + 10 + 10 ¼ 10 2 4 3 1 d) Fð3Þ ¼ PðX  3Þ ¼ 10 + 10 + 10 + 10 ¼ 1 e) F(4) ¼ P(X  4) ¼ 1 f) F(0.5) ¼ P(X   0.5) ¼ 0 In short, the cumulative distribution function of random variable X in Example 6.1 is given by: 8 0 if x < 0, > > > > < 2=10 if 0  x < 1, F ðx Þ ¼ 6=10 if 1  x < 2, > > > 9=10 if 2  x < 3, > : 1 if x  3

6.2.2

Continuous Random Variable

A continuous random variable can take on several different values in an interval of real numbers. As examples of continuous random variables, we can mention a family’s income, the revenue of a company, or the height of a certain child. A continuous random variable X is associated to an f(x) function, called a probability density function (p.d.f.) of X, which meets the following condition: Z+∞ f ðxÞdx ¼ 1, f ðxÞ  0

(6.10)

∞

For any a and b, such that ∞ < a < b < + ∞, the probability of random variable X taking on values within this interval is: Zb f ðxÞdx

Pða  X  bÞ ¼ a

which can be graphically represented as shown in Fig. 6.1.

FIG. 6.1 Probability of X assuming values within the interval [a, b].

(6.11)

140

PART

III Probabilistic Statistics

6.2.2.1 Expected Value of a Continuous Random Variable The mathematical expected or average value of a continuous random variable X with a probability density function f(x) is given by the expression: Z+∞ EðXÞ ¼

xf ðxÞdx

(6.12)

∞

6.2.2.2 Variance of a Continuous Random Variable The variance of a continuous random variable X with a probability density function f(x) is calculated as: 

Var ðXÞ ¼ E X

2



Z∞ 2

ðx  EðXÞÞ2 f ðxÞdx

 ½ Eð X Þ  ¼

(6.13)

∞

Example 6.3 The probability density function of a continuous random variable X is given by:  2x, 0 < x < 1 f ðx Þ ¼ 0, for any other values Calculate E(X) and Var(X). Solution Z1

Z1 ðx:2x Þdx ¼

E ðX Þ ¼ 0

  E X2 ¼



 2 2x 2 dx ¼ 3

0

Z1



 x 2 :2x dx ¼

0

Z1



 1 2x 3 dx ¼ 2

0



VAR ðX Þ ¼ E X

 2

 2 1 2 1 ¼  ½E ðX Þ2 ¼  2 3 18

6.2.2.3 Cumulative Distribution Function of a Continuous Random Variable As in the discrete case, we can calculate the probabilities associated to a continuous random variable X from a cumulative distribution function. Cumulative distribution function F(x) of a continuous random variable X with probability density function f(x) is defined by: FðxÞ ¼ PðX  xÞ,  ∞ < x < ∞

(6.14)

Expression (6.14) is similar to the one presented for the discrete case, in Expression (6.5). The difference is that, for continuous variables, the cumulative distribution function is a continuous function, without jumps. From (6.11) we have: Zx f ðxÞdx

FðxÞ ¼

(6.15)

∞

As in the discrete case, the following properties for the cumulative distribution function of a continuous random variable are valid:

Random Variables and Probability Distributions Chapter

6

141

0  Fð x Þ  1

(6.16)

lim FðxÞ ¼ 1

(6.17)

lim FðxÞ ¼ 0

(6.18)

a < b ! Fð aÞ  Fð bÞ

(6.19)

x!∞

x!∞

Example 6.4 Once again, let us consider the probability density function in Example 6.3:  2x, 0 < x < 1 f ðx Þ ¼ 0, for any other values Calculate the cumulative distribution function of X. Solution Zx

Zx f ðx Þdx ¼

F ðx Þ ¼ P ðX  x Þ ¼ ∞

6.3

∞

8 < 0 if x  0 2xdx ¼ x 2 if 0 < x  1 : 1 if x > 1

PROBABILITY DISTRIBUTIONS FOR DISCRETE RANDOM VARIABLES

For discrete random variables, the most common probability distributions are the discrete uniform, Bernoulli, binomial, geometric, negative binomial, hypergeometric, and Poisson.

6.3.1

Discrete Uniform Distribution

It is the simplest discrete probability distribution and receives the name uniform because all of the possible values of the random variable have the same probability of occurrence. A discrete random variable X that takes on the values x1, x2, …, xn has a discrete uniform distribution with parameter n, denoted by X  Ud{x1, x2, …, xn}, if its probability function is given by: 1 PðX ¼ xi Þ ¼ pðxi Þ ¼ , i ¼ 1, 2,…, n n

(6.20)

which may be graphically represented as shown in Fig. 6.2. The mathematical expected value of X is given by: n 1 X xi Eð X Þ ¼ : n i¼1

FIG. 6.2 Discrete uniform distribution.

(6.21)

142

PART

III Probabilistic Statistics

The variance of X is calculated as:

2 6 n 16 6X 2 Var ðXÞ ¼ :6 x  n 6 i¼1 i 4

n X

!2 3 7 7 7 7 7 5

(6.22)

nð x Þ , n

(6.23)

i¼1

n

xi

And the cumulative distribution function (c.d.f.) is: FðXÞ ¼ PðX  xÞ ¼

X1 xi x

n

¼

where n(x) is the number of xi  x, as shown in Fig. 6.3.

FIG. 6.3 Cumulative distribution function.

Example 6.5 A totally balanced and clean die is thrown and random variable X represents the value on the face that is facing up. Determine the distribution of X, in addition to X’s expected value and variance. Solution The distribution of X is shown in Table 6.E.2.

TABLE 6.E.2 Distribution of X X

1

2

3

4

5

6

Sum

f(x)

1/6

1/6

1/6

1/6

1/6

1/6

1

Therefore, we have: 1 E ðX Þ ¼ ð1 + 2 + 3 + 4 + 5 + 6Þ ¼ 3:5 6 " #  ð21Þ2 1  35 1 + 22 + ⋯ + 62  Var ðX Þ ¼ ¼ ¼ 2:917 6 6 12

6.3.2

Bernoulli Distribution

The Bernoulli trial is a random experiment that only offers two possible results, conventionally called success or failure. As an example of a Bernoulli trial, we can mention tossing a coin, whose only possible results are head and tail. For a certain Bernoulli trial, we will consider the random variable X that takes on the value 1 in case of success, and 0 in case of failure. The probability of success is represented by p and the probability of failure by (1  p) or q. The Bernoulli

Random Variables and Probability Distributions Chapter

6

143

FIG. 6.4 Probability function of the Bernoulli distribution.

FIG. 6.5 Bernoulli’s distribution c.d.f.

distribution, therefore, provides the probability of success or failure of variable X when carrying out a single experiment. Therefore, we can say that variable X follows a Bernoulli distribution with parameter p, denoted by X  Bern(p), if its probability function is given by:  q ¼ 1  p, if x ¼ 0 Pð X ¼ x Þ ¼ pð x Þ ¼ (6.24) p , if x ¼ 1 which can also be represented in the following way: PðX ¼ xÞ ¼ pðxÞ ¼ px :ð1  pÞ1x , x ¼ 0, 1

(6.25)

The probability function of random variable X is represented in Fig. 6.4. It is easy to see that the expected value of X is: EðXÞ ¼ p

(6.26)

Var ðXÞ ¼ p:ð1  pÞ

(6.27)

with X’s variance being:

Bernoulli’s cumulative distribution function (c.d.f.) is given by: 8 if x < 0 < 0, FðxÞ ¼ PðX  xÞ ¼ 1  p, if x  0 < 1 : 1, if x  1

(6.28)

which can be represented by Fig. 6.5. It is important to mention that we are going to use all knowledge on Bernoulli’s distribution when discussing binary logistics regression models (Chapter 14). Example 6.6 The Interclub Indoor Soccer Cup final match is going to be between teams A and B. Random variable X represents the team that will win the Cup. We know that the probability of team A winning is 0.60. Determine the distribution of X, in addition to X’s expected value and variance.

144

PART

III Probabilistic Statistics

Solution Random variable X can only take on two values:

 X¼

1, if team A wins 0, if team B wins

Since it is a single game, variable X follows a Bernoulli distribution with parameter p ¼ 0.60, denoted by X Bern(0.6), so:  q ¼ 0:4, if x ¼ 0 ðteam BÞ P ðX ¼ x Þ ¼ p ðx Þ ¼ p ¼ 0:6, if x ¼ 1 ðteam AÞ We have: E ðX Þ ¼ p ¼ 0:6 Var ðX Þ ¼ p ð1  p Þ ¼ 0:6  0:4 ¼ 0:24

6.3.3

Binomial Distribution

A binomial experiment consists in n independent repetitions of a Bernoulli trial with probability of success p, probability that remains constant in all repetitions. Discrete random variable X of a binomial model corresponds to the number of successes (k) in the n repetitions of the experiment. Therefore, X follows a binomial distribution with parameters n and p, denoted by X  b(n, p), if its probability distribution function is given by:   n (6.29) f ðk Þ ¼ P ðX ¼ k Þ ¼ :pk :ð1  pÞnk , k ¼ 0,1, …,n k   n! n ¼ where k k!ðn  kÞ! The mean of X is given by: EðXÞ ¼ n:p

(6.30)

On the other hand, the variance of X can be expressed as: Var ðXÞ ¼ n:p:ð1  pÞ

(6.31)

Note that the mean and variance of the binomial distribution are equal to the mean and variance of the Bernoulli distribution, multiplied by n, the number of repetitions in a Bernoulli trial. Fig. 6.6 shows the probability function of the binomial distribution for n ¼ 10 and p varying between 0.3, 0.5, and 0.7. From Fig. 6.6, we can see that, for p ¼ 0.5, the probability function is symmetrical around the mean. If p < 0.5, the distribution is positive asymmetrical, observing a higher frequency of smaller values of k and a longer tail to the right. If p > 0.5, the distribution is negative asymmetrical, observing a higher frequency of larger values of k and a longer tail to the left. It is important to mention that we are going to use all knowledge on the binomial distribution when studying multinomial logistics regression models (Chapter 14). FIG. 6.6 Probability function of the binomial distribution for n ¼ 10.

Random Variables and Probability Distributions Chapter

6

145

6.3.3.1 Relationship Between the Binomial and the Bernoulli Distributions A binomial distribution with parameter n ¼ 1 is equivalent to a Bernoulli distribution: X  bð1, pÞ X  BernðpÞ

Example 6.7 A certain part is produced in a production line. The probability of the part not having defects is 99%. If 30 parts are produced, what is the probability of at least 28 of them being in good conditions? Also determine the random variable’s mean and variance. Solution We have: X ¼ random variable that represents the number of successes (parts in good conditions) in the 30 repetitions p ¼ 0.99 ¼ probability of the part being in good conditions q ¼ 0.01 ¼ probability of the part being defective n ¼ 30 repetitions k ¼ number of successes The probability of at least 28 parts not being defective is given by: P ðX  28Þ ¼ P ðX ¼ 28Þ + P ðX ¼ 29Þ + P ðX ¼ 30Þ     30! 99 28 1 2  ¼ 0:0328 P ðX ¼ 28Þ ¼  28!2! 100 100     30! 99 29 1 1   ¼ 0:224 P ðX ¼ 29Þ ¼ 29!1! 100 100     30! 99 30 1 0  ¼ 0:7397  P ðX ¼ 30Þ ¼ 30!0! 100 100 P ðX  28Þ ¼ 0:0328 + 0:224 + 0:7397 ¼ 0:997 The mean of X is expressed as: E ðX Þ ¼ n:p ¼ 30  0:99 ¼ 29:7 And the variance of X is: Var ðX Þ ¼ n:p:ð1  p Þ ¼ 30  0:99  0:01 ¼ 0:297

6.3.4

Geometric Distribution

The geometric distribution, as well as the binomial, considers successive independent Bernoulli trials, all of them with probability of success p. However, instead of using a fixed number of trials, they will carry out the experiment until the first success is obtained. The geometric distribution presents two distinct parameterizations, described here. The first parameterization considers successive independent Bernoulli trials, with probability of success p in each trial, until a success occurs. In this case, we cannot include zero as a possible result, so, the domain is supported by using the set {1, 2, 3, …}. For example, we can consider how many times we tossed a coin until we got the first head, the number of parts manufactured until a defective one was produced, among others. The second parameterization of the geometric distribution counts the number of failures or unsuccessful attempts before the first success. Since here it is possible to obtain success in the first Bernoulli trial, we include zero as being a possible result, so, the domain is supported by the set {0, 1, 2, 3, …}. Let X be the random variable that represents the number of trials until the first success. Variable X has a geometric distribution with parameter p, denoted by X  Geo(p), if its probability function is given by: f ðxÞ ¼ PðX ¼ xÞ ¼ p:ð1  pÞx1 , x ¼ 1,2, 3, …

(6.32)

For the second case, let us consider Y the random variable that represents the number of failures or unsuccessful attempts before the first success. Variable Y has a geometric distribution with parameter p, denoted by Y  Geo(p), if its probability function is given by: f ðyÞ ¼ PðY ¼ yÞ ¼ p:ð1  pÞy , y ¼ 0,1, 2, …

(6.33)

146

PART

III Probabilistic Statistics

FIG. 6.7 Probability function of variable X with parameter p ¼ 0.4.

In both cases, the sequence of probabilities is a geometric progression. The probability function of variable X is graphically represented in Fig. 6.7, for p ¼ 0.4. The calculation of X’s expected value and variance is: Eð X Þ ¼ Var ðXÞ ¼

1 p

1p p2

(6.34) (6.35)

In a similar way, for variable Y, we have: Eð Y Þ ¼

1p p

Var ðY Þ ¼

1p p2

(6.36) (6.37)

The geometric distribution is the only discrete distribution that has the memoryless property (in the case of continuous distributions, we will see that the exponential distribution also has this property). This means that if an experiment is repeated before the first success, then, given that the first success has not happened yet, the conditional distribution function of the number of additional trials does not depend on the number of failures that occurred until then. Thus, for any two positive integers s and t, if X is greater than s, then, the probability of X being greater than s + t is equal to the unconditional probability of X being greater than t: PðX > s + t j X > sÞ ¼ PðX > tÞ

(6.38)

Example 6.8 A company manufactures a certain electronic component and, at the end of the process, each component is tested, one by one. Assume that the probability of one electronic component being defective is 0.05. Determine the probability of the first defect being found in the eighth component tested. Also determine the random variable’s expected value and variance. Solution We have: X ¼ random variable that represents the number of electronic components tested until the first defect is found; p ¼ 0.05 ¼ probability of the component being defective; q ¼ 0.95 ¼ probability of the component being in good conditions. The probability of the first defect being found in the eighth component tested is given by: P ðX ¼ 8Þ ¼ 0:05ð1  0:05Þ81 ¼ 0:035 The mean of X is expressed as: 1 E ðX Þ ¼ ¼ 20 p

Random Variables and Probability Distributions Chapter

6

147

And the variance of X is: Var ðX Þ ¼

6.3.5

1p 0:95 ¼ ¼ 380 p2 0:0025

Negative Binomial Distribution

The negative binomial distribution, also known as the Pascal distribution, carries out successive independent Bernoulli trials (with a constant probability of success in all the trials) until it achieves a prefixed number of successes (k), that is, the experiment continues until k successes are achieved. Let X be the random variable that represents the number of attempts carried out (Bernoulli trials) until the k-th success is reached. Variable X has a negative binomial distribution, denoted by X  nb(k, p), if its probability function is given by:   x1 (6.39) f ð x Þ ¼ Pð X ¼ x Þ ¼ :pk :ð1  pÞxk , x ¼ k, k + 1,… k1 The graphical representation of a negative binomial distribution with parameter k ¼ 2 and p ¼ 0.4 can be found in Fig. 6.8. The expected value of X is: EðXÞ ¼

k p

(6.40)

and the variance is: Var ðXÞ ¼

k:ð1  pÞ p2

(6.41)

6.3.5.1 Relationship Between the Negative Binomial and the Binomial Distributions The negative binomial distribution is related to the binomial distribution. In the binomial, we must set the sample size (number of Bernoulli trials) and observe the number of successes (random variable). In the negative binomial, we must set the number of successes (k) and observe the number of Bernoulli trials necessary to obtain k successes.

6.3.5.2 Relationship Between the Negative Binomial and the Geometric Distributions The negative binomial distribution with parameter k ¼ 1 is equivalent to the geometric distribution: X  nbð1,pÞ X  GeoðpÞ Or, a negative binomial series can be considered to be a sum of geometric series.

FIG. 6.8 Probability function of variable X with parameter k ¼ 2 and p ¼ 0.4.

148

PART

III Probabilistic Statistics

It is important to mention that we are going to use all knowledge on the negative binomial distribution when studying the regression models for count data (Chapter 15). Example 6.9 Assume that a student gets three questions right every five tests. Let X be the number of attempts until the twelfth correct answer. Determine the probability of the student having to answer 20 questions in order to get 12 right. Solution We have: k ¼ 12, p ¼ 3/5 ¼ 0.6, q ¼ 2/5 ¼ 0.4 X ¼ number of attempts until the twelfth correct answer, that is, X  nb(12; 0.6). Therefore:   20  1 f ð20Þ ¼ P ðX ¼ 20Þ ¼  0:612  0:42012 ¼ 0:1078 ¼ 10:78% 12  1

6.3.6

Hypergeometric Distribution

The hypergeometric distribution is also related to a Bernoulli trial. However, differently from the binomial sampling, in which the probability of success is constant, in the hypergeometric distribution, since the sampling is without replacement, as the elements are removed from the population to form the sample, the population size diminishes, making the probability of success vary. The hypergeometric distribution describes the number of successes in a sample with n elements, drawn from a finite population without replacement. For example, let us consider a population with N elements, from which M have a certain attribute. The hypergeometric distribution describes the probability of exactly k elements having such attribute (k successes and n  k failures), in a sample with n distinct elements randomly drawn from the population without replacement. Let X be a random variable that represents the number of successes obtained from the n elements drawn from the sample. Variable X follows a hypergeometric distribution with parameters N, M, n, denoted by X  Hip(N, M, n), if its probability function is given by:    M N M : k nk   , 0  k  min ðM, nÞ (6.42) f ð k Þ ¼ Pð X ¼ k Þ ¼ N n The graphical representation of a hypergeometric distribution with parameters N ¼ 200, M ¼ 50, and n ¼ 30 can be found in Fig. 6.9. The mean of X can be calculated as: Eð X Þ ¼

FIG. 6.9 Probability function of variable X with parameters N ¼ 200, M ¼ 50, and n ¼ 30.

n:M N

(6.43)

Random Variables and Probability Distributions Chapter

6

149

with variance: Var ðXÞ ¼

n:M ðN  MÞ:ðN  nÞ : N N:ðN  1Þ

(6.44)

6.3.6.1 Approximation of the Hypergeometric Distribution by the Binomial Let X be a random variable that follows a hypergeometric distribution with parameters N, M, and n, denoted by X  Hip(N, M, n). If the population is large when compared to the sample size, the hypergeometric distribution can be approximated by a binomial distribution with parameters n and p ¼ M/N (probability of success in a single trial): X  HipðN, M, nÞ X  bðn, pÞ, with p ¼ M=N

Example 6.10 A gravity-pick machine contains 15 balls and 5 of them are red. 7 balls are chosen randomly, without replacement. Determine: a) The probability of exactly two red balls being drawn. b) The probability of at least two red balls being drawn. c) The expected number of red balls drawn. d) The variance of the number of red balls drawn. Solution Let X be the random variable that represents the number of red balls drawn. We have N ¼ 15, M ¼ 5, and n ¼ 7.       M 5 NM 10 : : k 2 nk 5   a) PðX ¼ 2Þ ¼ ¼   ¼ 39:16% N 15       n 7 5 5 10 10 : : 0 1 7 6 b) PðX  2Þ ¼ 1  PðX < 2Þ ¼ 1  ½PðX ¼ 0Þ + PðX ¼ 1Þ ¼ 1       ¼ 81:82% 15 15 7 7 n:M 7:5 ¼ ¼ 2:33 c) EðXÞ ¼ N 15 n:M ðN  MÞ:ðN  nÞ 7  5 10  8 : ¼  ¼ 0:8889 ¼ 88:89% d) VarðXÞ ¼ N N:ðN  1Þ 5 15  14

6.3.7

Poisson Distribution

The Poisson distribution is used to register the occurrence of rare events, with a very low probability of success (p ! 0), in a certain time interval or space. Differently from the binomial model that provides the probability of the number of successes in a discrete interval (n repetitions of an experiment), the Poisson model provides the probability of the number of successes in a certain continuous interval (time, area, among other possibilities). As examples of variables that represent a Poisson distribution, we can mention the number of customers that arrive in a line per unit of time, the number of defects per unit of time, the number of accidents per unit of area, among others. Note that the measurement units (time and area, in these situations) are continuous, but the random variable (number of occurrences) is discrete. The Poisson distribution presents the following hypotheses: (i) (ii) (iii) (iv)

Events defined in nonoverlapping intervals are independent; In intervals with the same length, the probabilities that the same number of successes will occur are equal; In very small intervals, the probability that more than one success will occur is insignificant; In very small intervals, the probability of one success is proportional to the length of the interval.

Let us consider a discrete random variable X that represents the number of successes (k) in a certain unit of time, unit of area, among other possibilities. Random variable X, with parameter l 0, follows a Poisson distribution, denoted by X Poisson (l), if its probability function is given by: f ðk Þ ¼ P ðX ¼ k Þ ¼

el :lk , k ¼ 0,1, 2, … k!

(6.45)

150

PART

III Probabilistic Statistics

FIG. 6.10 Poisson probability function.

where: e: base of the Napierian (or natural) logarithm, and e ffi 2.718282; l: estimated average rate of occurrence of the event we are interested in for a certain exposition (time interval, area, among other examples). Fig. 6.10 shows the Poisson distribution probability function for l ¼ 1, 3, and 6. In the Poisson distribution, the mean is equal to the variance: EðXÞ ¼ VARðXÞ ¼ l

(6.46)

It is important to mention that we are going to use all knowledge on the Poisson distribution when studying the regression models for count data (Chapter 15).

6.3.7.1 Approximation of the Binomial by the Poisson Distribution Let X be a random variable that follows a binomial distribution with parameters n and p, denoted by X b(n, p). When the number of repetitions of a random experiment is very high (n ! ∞) and the probability of success is very low (p ! 0), such that n. p ¼ l ¼ constant, the binomial distribution gets closer to the Poisson distribution: X  b ðn, pÞ X  Poisson ðlÞ, com l ¼ n:p

Example 6.11 Assume that the number of customers that arrive at a bank follows a Poisson distribution. We verified that, on average, 12 customers arrive at the bank per minute. Calculate: (a) the probability of 10 customers arriving in the next minute; (b) the probability of 40 customers arriving in the next 5 minutes; (c) X’s mean and variance. Solution We have l ¼ 12 customers per minute. e12  1210 ¼ 0:1048 a) PðX ¼ 10Þ ¼ 10! e12  128 ¼ 0:0655 b) PðX ¼ 8Þ ¼ 8! c) E(X) ¼ VAR(X) ¼ l ¼ 12

Example 6.12 A certain part is produced in a production line. The probability of the part being defective is 0.01. If 300 parts are produced, what is the probability of none of them being defective?

Random Variables and Probability Distributions Chapter

6

151

Solution This example is characterized by a binomial distribution. Since the number of repetitions is high and the probability of success is low, the binomial distribution can be approximated by a Poisson distribution with parameter l ¼ n. p ¼ 300  0.01 ¼ 3, such that: P ðX ¼ 0Þ ¼

6.4

e 3  30 ¼ 0:05 0!

PROBABILITY DISTRIBUTIONS FOR CONTINUOUS RANDOM VARIABLES

For continuous random variables, we are going to study the uniform, normal, exponential, gamma, chi-square (w2), Student’s t, and Snedecor’s F distributions.

6.4.1

Uniform Distribution

The uniform model is the simplest model for continuous random variables. It is used to model the occurrence of events whose probability is constant in intervals with the same range. A random variable X follows a uniform distribution in the interval [a, b], denoted by X  U[a, b], if its probability density function is given by:  1=ðb  aÞ, if a  x  b f ðx Þ ¼ (6.47) 0 , otherwise which can be graphically represented as seen in Fig. 6.11. The expected value of X is calculated by the expression: Zb EðXÞ ¼

x a

1 a+b dx ¼ ba 2

(6.48)

Table 6.1 presents a summary of the discrete distributions studied in this section, including the calculation of the random variable’s probability function, the distribution parameters, besides the calculation of X’s expected value and variance. TABLE 6.1 Models for Discrete Variables Distribution Discrete uniform

Bernoulli Binomial Geometric

Negative binomial Hypergeometric

Poisson

Probability Function – P(X) 1 n

Parameters n

E(X) n X

1 xi : n i¼1

Var(X) 2 n 16 6X 2 x  :6 n 4 i¼1 i



n P

i¼1

2 3 xi

n

7 7 7 5

px. (1  p)1x, x ¼ 0, 1   n :p k :ð1  p Þnk ,k ¼ 0,1,…, n k

p

p

p. (1  p)

n, p

n.p

n. p. (1  p)

P(X) ¼ p. (1  p)x1, x ¼ 1, 2, 3, … P(Y) ¼ p. (1  p)y, y ¼ 0, 1, 2, …

p

E ðX Þ ¼

1 p 1p E ðY Þ ¼ p

1p p2 1p Var ðY Þ ¼ 2 p

k, p

k p

k:ð1  p Þ p2

N, M, n

n:M N

n:M ðN  MÞ:ðN  nÞ : N N:ðN  1Þ

l

l

l



 x 1 :p k :ð1  p Þxk , x ¼ k, k + 1,… k 1    M N M : k nk   ,0  k  min ðM, nÞ N n

e l :lk ,k ¼ 0,1,2,… k!

Var ðX Þ ¼

152

PART

III Probabilistic Statistics

FIG. 6.11 Uniform distribution in the interval [a, b].

And the variance of X is:   ðb  aÞ2 Var ðXÞ ¼ E X2  ½EðXÞ2 ¼ 12 On the other hand, the cumulative distribution function of the uniform distribution is given by: 8 0 , if x < a > > Zx Zx

ba > : a a 1 , if x  b

(6.49)

(6.50)

Example 6.13 Random variable X represents the time a bank’s ATM machines are used (in minutes), and it follows a uniform distribution in the interval [1, 5]. Determine: a) P(X < 2) b) P(X > 3) c) P(3 < X < 5) d) E(X) e) VAR(X) Solution a) P(X < 2) ¼ F(2) ¼ (2  1)/(5  1) ¼ 1/4 b) P(X > 3) ¼ 1  P(X < 3) ¼ 1  F(3) ¼ 1  (3  1)/(5  1) ¼ 1/2 c) P(3 < X < 4) ¼ F(4)  F(3) ¼ (4  1)/(5  1)  (3  1)/(5  1) ¼ 1/4 ð1 + 5Þ ¼3 d) EðXÞ ¼ 2 ð5  1Þ2 4 ¼ e) VARðXÞ ¼ 12 3

6.4.2

Normal Distribution

The normal distribution, also known as Gaussian, is the most widely used and important probability distribution, because it allows us to model a myriad of natural phenomena, studies of human behavior, industrial processes, among others. In addition to allowing us to use approximations to calculate the probabilities of many random variables. A random variable X, with mean m 2 ℜ and standard deviation s > 0, follows a normal or Gaussian distribution, denoted by X  N (m, s2), if its probability density function is given by: 2

ðxmÞ 1 f ðxÞ ¼ pffiffiffiffiffiffi :e 2:s2 ,  ∞  x  + ∞, s: 2p

whose graphical representation is shown in Fig. 6.12.

(6.51)

Random Variables and Probability Distributions Chapter

6

153

FIG. 6.12 Normal distribution.

FIG. 6.13 Area under the normal curve.

Fig. 6.13 shows the area under the normal curve based on the number of standard deviations. From Fig. 6.13, we can see that the curve has the shape of a bell and is symmetrical around parameter m, and the smaller parameter s is, the more concentrated the curve is around m. Therefore, in a normal distribution, the mean of X is: Eð X Þ ¼ m

(6.52)

Var ðXÞ ¼ s2

(6.53)

And the variance of X is:

In order to obtain the standard normal distribution or the reduced normal distribution, the original variable X is transformed into a new random variable Z, with mean 0 (m ¼ 0) and variance 1 (s2 ¼ 1): Z¼

Xm  N ð0, 1Þ s

(6.54)

Score Z represents the number of standard deviations that separates a random variable X from the mean. This kind of transformation, known as Zscores, is broadly used to standardize variables, because it does not change the shape of the original variable’s normal distribution, and it generates a new variable with mean 0 and variance 1. Therefore,

154

PART

III Probabilistic Statistics

FIG. 6.14 Standard normal distribution.

when many variables with different orders of magnitude are being used in a certain type of modeling, the Zscores standardization process will make all the new standardized variables have the same distribution, with equal orders of magnitude (Fa´vero et al., 2009). The probability density function of random variable Z is reduced to: z2 1 f ðzÞ ¼ pffiffiffiffiffiffi :e 2 ,  ∞  z  + ∞ 2p

(6.55)

whose graphical representation is shown in Fig. 6.14. The cumulative distribution function F(xc) of a normal random variable X is obtained by integrating Expression (6.51) from ∞ to xc, that is: Zxc F ð x c Þ ¼ Pð X  x c Þ ¼

f ðxÞdx

(6.56)

∞

Integral (6.56) corresponds to the area under f(x) from ∞ to xc, as shown in Fig. 6.15. In the specific case of the standard normal distribution, the cumulative distribution function is: Zzc F ð z c Þ ¼ Pð Z  z c Þ ¼ ∞

Zzc

1 f ðzÞdz ¼ pffiffiffiffiffiffi 2p

z2

e 2 dz

(6.57)

∞

For a random variable Z with a standard normal distribution, let us suppose that the main goal now is to calculate P(Z > zc). So, we have: Z∞ Pð Z > z c Þ ¼ zc

1 f ðzÞdz ¼ pffiffiffiffiffiffi 2p

Z∞

z2 2 dz

e

(6.58)

zc

Fig. 6.16 represents this probability. Table E in the Appendix shows the value of P(Z > zc), that is, the cumulative probability from zc to +∞ (the gray area under the normal curve). FIG. 6.15 Cumulative normal distribution.

f(x)

F(Xc)

–¥

Xc



X

Random Variables and Probability Distributions Chapter

6

155

f(z)

–¥

zc



z

FIG. 6.16 Graphical representation of P(Z > zc) for a standardized normal random variable.

6.4.2.1 Approximation of the Binomial by the Normal Distribution Let X be a random variable that has a binomial distribution with parameters n and p, denoted by X  b(n, p). As the average number of successes and the average number of failures tend to infinity (n. p ! ∞ and n. (1  p) ! ∞), the binomial distribution gets closer to a normal one with mean m ¼ n. p and variance s2 ¼ n. p. (1  p):   X  bðn, pÞ X  N m, s2 , com m ¼ n:p e s2 ¼ n:p:ð1  pÞ Some authors admit that the approximation of the binomial by the normal distribution is good when n. p > 5 and n. (1  p) > 5, or when n. p. (1  p) 3. A better and more conservative rule requires n. p > 10 and n. (1  p) > 10. However, since it is a discrete approximation through a continuous one, we recommend greater accuracy, carrying out a continuity correction that consists in, for example, transforming P(X ¼ x) into the interval P(x  0.5 < X < x + 0.5).

6.4.2.2 Approximation of the Poisson by the Normal Distribution Analogous to the binomial distribution, the Poisson distribution can also be approximated by a normal one. Let X be a random variable that follows a Poisson distribution with parameter l, denoted by X  Poisson(l). Since l ! ∞, the Poisson distribution gets closer to a normal one with mean m ¼ l and variance s2 ¼ l:   X  PoissonðlÞ X  N m, s2 , with m ¼ l and s2 ¼ l In general, we admit that the approximation of the Poisson distribution by the normal distribution is good when l > 10. Once again, we recommend using the continuity correction x  0.5 and x + 0.5. Example 6.14 We know that the average thickness of the hose storage units produced in a factory (X) follows a normal distribution with a mean of 3 mm and a standard deviation of 0.4 mm. Determine: a) P(X > 4.1) b) P(X > 3) c) P(X  3) d) P(X  3.5) e) P(X < 2.3) f) P(2  X  3.8) Solution The probabilities will be calculated based on Table E in the Appendix, which provides the value of P(Z > zc):   4:1  3 ¼ PðZ > 2:75Þ ¼ 0:0030 a) PðX > 4:1Þ ¼ P Z > 0:4   33 ¼ PðZ > 0Þ ¼ 0:5 b) PðX > 3Þ ¼ P Z > 0:4 c) P(X  3) ¼ P(Z  0) ¼ 0.5   3:5  3 PðX  3:5Þ ¼ P Z  ¼ PðZ  1:25Þ ¼ 1  PðZ > 1:25Þ d) 0:4 ¼ 1  0:1056 ¼ 0:8944   2:3  3 e) PðX < 2:3Þ ¼ P Z < ¼ PðZ < 1:75Þ ¼ PðZ > 1:75Þ ¼ 0:04 0:4

156

PART

III Probabilistic Statistics 

 23 3:8  3 Z ¼ Pð2:5  Z  2Þ 0:4 0:4 ¼ PðZ  2Þ  PðZ < 2:5Þ ¼ ½1  PðZ > 2Þ  PðZ > 2:5Þ ¼

f) Pð2  X  3:8Þ ¼ P

¼ ½1  0:0228  0:0062 ¼ 0:971

6.4.3

Exponential Distribution

Another important distribution, which has applications in system reliability and in the queueing theory, is the exponential distribution. It has as its main characteristic the property of being memoryless, that is, the future lifetime (t) of a certain object has the same distribution, regardless of its past lifetime (s), for any s, t > 0, as shown in Expression (6.38), once again shown below: PðX > s + t j X > sÞ ¼ PðX > tÞ A continuous random variable X has an exponential distribution with parameter l > 0, denoted by X  exp(l), if its probability density function is given by:  l:el:x , if x  0 (6.59) f ðx Þ ¼ 0 , if x < 0 Fig. 6.17 represents the probability density function of the exponential distribution for parameters l ¼ 0.5, l ¼ 1, and l ¼ 2. We can see that the exponential distribution is positive asymmetrical (to the right), observing a higher frequency for smaller values of x and a longer tail to the right. The density function assumes value l when x ¼ 0, and tends to zero as x ! ∞. The higher the value of l, the more quickly the function tends to zero. In the exponential distribution, the mean of X is: Eð X Þ ¼

1 l

(6.60)

1 l2

(6.61)

and the variance of X is: Var ðXÞ ¼ And the cumulative distribution function F(x) is given by: 

Zx f ðxÞdx ¼

FðxÞ ¼ PðX  xÞ ¼ 0

FIG. 6.17 Exponential distribution for l ¼ 0.5,l ¼ 1, and l ¼ 2.

1  el:x , if x  0 0 , if x < 0

(6.62)

Random Variables and Probability Distributions Chapter

6

157

From (6.62) we can conclude that: PðX > xÞ ¼ el:x

(6.63)

In system reliability, random variable X represents the lifetime, that is, the time during which a component or system remains operational, outside the interval for repairs and above a specified limit (yield, pressure, among other examples). On the other hand, parameter l represents the failure rate, that is, the number of components or systems that failed in a preestablished time interval: l¼

number of failures operation time

(6.64)

The main measures of reliability are: (a) Mean time to failure (MTTF) and (b) Mean time between failures (MTBF). Mathematically, MTTF and MTBF are equal to the mean of the exponential distribution and represent the mean lifetime. Thus, the failure rate can also be calculated as: l¼

1 MTTF:ðMTBFÞ

(6.65)

In the queueing theory, random variable X represents the mean waiting time until the next arrival (mean time between two customers’ arrivals). On the other hand, parameter l represents the mean arrivals rate, that is, the expected number of arrivals per unit of time.

6.4.3.1 Relationship Between the Poisson and the Exponential Distribution If the number of occurrences in a counting process follows a Poisson distribution (l), then, the random variables “time until the first occurrence” and “time between any successive occurrences” of the aforementioned process have an exp(l) distribution. Example 6.15 The life span of an electronic component follows an exponential distribution with a mean lifetime of 120 hours. Determine: a) The probability of a component failing in the first 100 hours of use; b) The probability of a component lasting more than 150 hours. Solution Assume that l ¼ 1/120 and X  exp(1/120). Therefore: x 100 100 x x 100 100 R 120:e 120   120 dx ¼  a) PðX  100Þ ¼ 120:e  ¼ e 120  ¼ e 120 + 1 ¼ 0:5654  0 120 0 0

b) PðX > 150Þ ¼

R∞

x

120:e 120 dx ¼ 

x ∞ 120:e 120 

150

6.4.4

120

 

150

x ∞ 150 ¼ e 120  ¼ e 120 ¼ 0:2865 150

Gamma Distribution

The gamma distribution is one of the most general, such that, other distributions, as the Erlang, exponential, and chi-square (w2) are particular cases of it. As the exponential distribution, it is also widely used in system reliability. The gamma distribution also has applications in physical phenomena, in meteorological processes, insurance risk theory, and economic theory. A continuous random variable X has a gamma distribution with parameters a > 0 and l > 0, denoted by X  Gamma (a, l), if its probability density function is given by: 8 a < l :xa1 :el:x , if x  0 (6.66) f ðxÞ ¼ GðaÞ : 0 , if x < 0

158

PART

III Probabilistic Statistics

FIG. 6.18 Density function of x for some values of a and l. (Source: Navidi, W., 2012. Probabilidade e estatı´stica para ci^ encias exatas. Bookman, Porto Alegre.)

where G(a) is the Gamma function, given by: Z∞ GðaÞ ¼

ex :xa1 dx, a > 0

(6.67)

0

The gamma probability density function for some values of a and l is represented in Fig. 6.18. We can see that the gamma distribution is positive asymmetrical (to the right), observing a higher frequency for smaller values of x and a longer tail to the right. However, as a tends to infinity, the distribution becomes symmetrical. We can also observe that when a ¼ 1, the gamma distribution is equal to the exponential. Moreover, the greater the value of l, the more quickly the density function tends to zero. The expected value of X can be calculated as: EðXÞ ¼ a:l

(6.68)

Var ðXÞ ¼ a:l2

(6.69)

On the other hand, the variance of X is given by:

The cumulative distribution function is: Zx Fð x Þ ¼ Pð X  x Þ ¼ 0

la f ðxÞdx ¼ GðaÞ

Zx

xa1 :elx dx

(6.70)

0

6.4.4.1 Special Cases of the Gamma Distribution A gamma distribution with parameter a, a positive integer, is called an Erlang distribution, such that: If a is a positive integer ) X  Gamma(a, l) X Erlang(a, l) As mentioned before, a gamma distribution with parameter a ¼ 1 is called an exponential distribution: If a ¼ 1 ) X  Gamma(a, l) X  exp(l) Or, a gamma distribution with parameter a ¼ n/2 and l ¼ 1/2 is called a chi-square distribution with n degrees of freedom: If a ¼ n/2, l ¼ 1/2 ) X  Gamma(n/2, 1/2) w  w2v¼n

6.4.4.2 Relationship Between the Poisson and the Gamma Distribution In the Poisson distribution, we try to determine the number of occurrences of a certain event within a fixed period. On the other hand, the gamma distribution determines the time necessary to obtain a specified number of occurrences of the event.

Random Variables and Probability Distributions Chapter

6.4.5

6

159

Chi-Square Distribution

A continuous random variable X has a chi-square distribution with n degrees of freedom, denoted by X  w2n , if its probability density function is given by: 8 1 < :xn=21 :ex=2 , x > 0 n=2 (6.71) f ðxÞ ¼ 2 :Gðn=2Þ : 0 , x xc), we have: Z∞ Pð X > x c Þ ¼ which can be represented by Fig. 6.20.

f ðxÞdx xc

FIG. 6.19 w2 distribution for different values of n.

(6.76)

160

PART

III Probabilistic Statistics

FIG. 6.20 Graphical representation of P(X > xc) for a random variable with a w2 distribution.

The w2 distribution has several applications in statistical inference. Due to its importance, the w2 distribution can be found in Table D in the Appendix, for different values of parameter n. This table provides the critical values of xc such that P(X > xc) ¼ a. In other words, we can obtain the calculation of the probabilities and of the cumulative probability density function for different values of x from random variable X. Example 6.16 Assume that random variable X follows a chi-square distribution (w2) with 13 degrees of freedom. Determine: a) P(X > 5) b) The x value such that P(X  x) ¼ 0.95 c) The x value such that P(X > x) ¼ 0.95 Solution Through the w2 distribution table (Table D in the Appendix), for n ¼ 13, we have: a) P(X > 5) ¼ 97.5% b) 22.362 c) 5.892

6.4.6

Student’s t Distribution

Student’s t distribution was developed by William Sealy Gosset, and it is one of the main probability distributions, with several applications in statistical inference. We are going to assume a random variable Z that has a normal distribution with mean 0 and standard deviation 1, and a random variable X with a chi-square distribution with n degrees of freedom, such that, Z and X are independent. Continuous random variable T is then defined as: Z T ¼ pffiffiffiffiffiffiffiffi X=n

(6.77)

We can say that variable T follows Student’s t distribution with n degrees of freedom, denoted by T  tn, if its probability density function is given by:   n+1   n + 1 G 2 t2 2 , ∞ tc), we have: Z∞ PðT > tc Þ ¼

f ðtÞdt

(6.82)

tc

as shown in Fig. 6.22. Just as the normal and chi-square (w2) distributions, Student’s t distribution has several applications in statistical inference, such that, there is a table to obtain the probabilities, based on different values of parameter n (Table B in the Appendix). This table provides the critical values of tc such that P(T > tc) ¼ a. In other words, we can obtain the calculation of the probabilities and of the cumulative probability density function for different values of t from random variable T. We are going to use Student’s t distribution when studying simple and multiple regression models (Chapter 13).

a/2

a/2

–tc FIG. 6.22 Graphical representation of Student’s t distribution.

tc

t

162

PART

III Probabilistic Statistics

Example 6.17 Assume that random variable T follows Student’s t distribution with 7 degrees of freedom. Determine: a) P(T > 3.5) b) P(T < 3) c) P(T <  0.711) d) The t value such that P(T  t) ¼ 0.95 e) The t value such that P(T > t) ¼ 0.10 Solution a) 0.5% b) 99% c) 25% d) 1.895 e) 1.415

6.4.7

Snedecor’s F Distribution

Snedecor’s F distribution, also known as Fisher’s distribution, is frequently used in tests associated to the analysis of variance (ANOVA), to compare the means of more than two populations. Let us consider continuous random variables Y1 and Y2, such that: l

Y1 and Y2 are independent; Y1 follows a chi-square distribution with n1 degrees of freedom, denoted by Y1  wn21 ;

l

Y2 follows a chi-square distribution with n2 degrees of freedom, denoted by Y2  wn22.

l

We are going to define a new continuous random variable X such that: X¼

Y1 =n1 Y2 =n2

(6.83)

So, we say that X has a Snedecor’s F distribution with n1 and n2 degrees of freedom, denoted by X  Fn1, n2, if its probability density function is given by: n + n n n1 =2 1 2 1 G  xðn1 =2Þ1  2 n2 (6.84) f ðxÞ ¼   ðn1 + n2 Þ=2 , x > 0 n1 n2 n1 :x + 1 G G  2 2 n2 where Z∞ GðaÞ ¼

ex :xa1 dx

0

Fig. 6.23 shows the behavior of Snedecor’s F distribution probability density function, for different values of n1 and n2. We can see that Snedecor’s F distribution is positive asymmetrical (to the right), observing a higher frequency for smaller values of x and a longer tail to the right. However, as n1 and n2 tend to infinity, the distribution becomes symmetrical. The expected value of X is calculated as: n2 , for n2 > 2 (6.85) Eð X Þ ¼ n2  2 On the other hand, the variance of X is given by: Var ðXÞ ¼

2:n22 :ðn1 + n2  2Þ n1 :ðn2  4Þ:ðn2  2Þ2

, for n2 > 4

(6.86)

Random Variables and Probability Distributions Chapter

6

163

f (x)

F30,30

F4,12 0 FIG. 6.23 Probability density function for F4, 12 and F30,

x 30.

FIG. 6.24 Critical values of Snedecor’s F distribution.

Just as the normal, w2, and Student’s t distributions, Snedecor’s F distribution has several applications in statistical inference. And there is a table from which we can obtain the probabilities and the cumulative distribution function, based on different values of parameters n1 and n2 (Table A in the Appendix). This table provides the critical values of Fc such that P(X > Fc) ¼ a (Fig. 6.24). We are going to use Snedecor’s F distribution when studying simple and multiple regression models (Chapter 13).

6.4.7.1 Relationship Between Student’s t and Snedecor’s F Distribution Let us consider a random variable T with Student’s t distribution with n degrees of freedom. So, the square of variable T follows Snedecor’s F distribution with n1 ¼ 1 and n2 degrees of freedom, as shown by Fa´vero et al. (2009). Thus: If T tn, then T2  F1, n2 Example 6.18 Assume that random variable X follows Snedecor’s F distribution with n1 ¼ 6 degrees of freedom in the numerator, and n2 ¼ 12 degrees of freedom in the denominator, that is, X  F6, 12. Determine: a) P(X > 3) b) F6, 12 with a ¼ 10% c) The x value such that P(X  x) ¼ 0.975 Solution Through Snedecor’s F distribution table (Table A in the Appendix), for n1 ¼ 6 and n2 ¼ 12, we have: a) P(X > 3) ¼ 5% b) 2.33 c) 3.73

164

PART

III Probabilistic Statistics

Table 6.2 shows a summary of the continuous distributions studied in this section, including the calculation of the random variable’s probability function, the distribution parameters, besides the calculation of X’s expected value and variance. TABLE 6.2 Models for Continuous Variables Distribution

Probability Function – P(X)

Parameters

E(X)

Var(X)

Uniform

1 ,a  x  b b a

a, b

a+b 2

ðb  aÞ2 12

Normal

ðxmÞ 1 pffiffiffiffiffiffi :e  2s2 , ∞  x  + ∞ s: 2p

2

m, s

m

s2

Exponential

l. e l. x, x 0

l

1 l

1 l2

Gamma

la a1 lx :x :e , x  0 GðaÞ

a, l

a. l

a. l2

Chi-square (w2)

1 :x n=21 :e x=2 , x > 0 2n=2 :Gðn=2Þ   n+1   n + 1 G 2 t2 2 n pffiffiffiffiffi  1 + ,∞ < t < ∞ n G : pn 2 n + n n n1 =2 1 2 1 G  x ðn1 =2Þ1  2 n2 ðn1 + n2 Þ=2 n n n  1 2 1 :x + 1 G  G 2 2 n2 x>0

n

n

2. n

n

E(T) ¼ 0

Var ðT Þ ¼

n1, n2

n2 n2 2

Student’s t

Snedecor’s F

6.5

n n2

2:n22 :ðn1 + n2  2Þ n1 :ðn2  4Þ:ðn2  2Þ2

FINAL REMARKS

This chapter discussed the main probability distributions used in statistical inference, including the distributions for discrete random variables (discrete uniform, Bernoulli, binomial, geometric, negative binomial, hypergeometric, and Poisson) and for continuous random variables (uniform, normal, exponential, gamma, chi-square (w2), Student’s t, and Snedecor’s F). When characterizing probability distributions, it is extremely important to use measures that indicate the most relevant aspects of the distribution, such as, measures of position (mean, median, and mode), measures of dispersion (variance and standard deviation), and measures of skewness and kurtosis. Understanding the concepts related to probability and to probability distributions helps the researcher in the study of topics related to statistical inference, including parametric and nonparametric hypotheses tests, multivariate analysis through exploratory techniques, and estimation of regression models.

6.6

EXERCISES

1) In a shoe production line, the probability of a defective item being produced is 2%. For a batch with 150 items, determine the probability of a maximum of two items being defective. Also determine the mean and the variance. 2) The probability of a student solving a certain problem is 12%. If 10 students are selected randomly, what is the probability of exactly one of them being successful? 3) A telemarketing salesman sells one product every 8 customers he contacts. The salesman prepares a list of customers. Determine the probability of the first product being sold in the fifth call, in addition to the expected sales value and the respective variance. 4) The probability of a player scoring a penalty is 95%. Determine the probability of the player having to take a penalty kick 33 times to score 30 goals, besides the mean of penalty kicks. 5) Assume that, in a certain hospital, 3 patients undergo stomach surgery daily, following a Poisson distribution. Calculate the probability of 28 patients undergoing surgery next week (7 business days).

Random Variables and Probability Distributions Chapter

6

165

6) Assume that a certain random variable X follows a normal distribution with m ¼ 8 and s2 ¼ 36. Determine the following probabilities: a) P(X  12) b) P(X < 5) c) P(X > 2) d) P(6 < X  11) 7) Consider random variable Z with a standardized normal distribution. Determine critical value zc such that P(Z > zc) ¼ 80%. 8) When tossing 40 balanced coins, determine the following probabilities: a) Of getting exactly 22 heads. b) Of getting more than 25 heads. Solve this exercise by approximating the distribution through a normal distribution. 9) The time until a certain electronic device fails follows an exponential distribution with a failure rate per hour of 0.028. Determine the probability of a device chosen randomly remaining operational for: a) 120 hours; b) 60 hours. 10) A certain type of device follows an exponential distribution with a mean lifetime of 180 hours. Determine: a) The probability of the device lasting more than 220 hours; b) The probability of the device lasting a maximum of 150 hours. 11) The arrival of patients in a lab follows an exponential distribution with an average rate of 1.8 clients per minute. Determine: a) The probability of the next client’s arrival taking more than 30 seconds; b) The probability of the next client’s arrival taking a maximum of 1.5 minutes. 12) The time between clients’ arrivals in a restaurant follows an exponential distribution with a mean of 3 minutes. Determine: a) The probability of more than 3 clients arriving in 6 minutes; b) The probability of the time until the fourth client arrives being less than 10 minutes. 13) A random variable X has a chi-square distribution with n ¼ 12 degrees of freedom. What is critical value xc such that P(X > xc) ¼ 90%? 14) Now, assume that X follows a chi-square distribution with n ¼ 16 degrees of freedom. Determine: a) P(X > 25) b) P(X  32) c) P(25 < X  32) d) The x value such that P(X  x) ¼ 0.975 e) The x value such that P(X > x) ¼ 0.975 15) A random variable T follows Student’s t distribution with n ¼ 20 degrees of freedom. Determine: a) Critical value tc such that P( tc < t < tc) ¼ 95% b) E(T) c) Var(T) 16) Now, assume that T follows Student’s t distribution with n ¼ 14 degrees of freedom. Determine: a) P(T > 3) b) P(T  2) c) P(1.5 < T  2) d) The t value such that P(T  t) ¼ 0.90 e) The t value such that P(T > t) ¼ 0.025 17) Consider a random variable X that follows Snedecor’s F distribution with n1 ¼ 4 and n2 ¼ 16 degrees of freedom, that is, X  F4, 16. Determine: a) P(X > 3) b) F4, 16 with a ¼ 2.5% c) The x value such that P(X  x) ¼ 0.99 d) E(X) e) Var(X)

Chapter 7

Sampling Our reason becomes obscure when we consider that the countless fixed stars that shine in the sky do not have any other purpose besides illuminating worlds in which weeping and pain rule, and, in the best case scenario, only unpleasantness exists; at least, judging by the sample we know. Arthur Schopenhauer

7.1

INTRODUCTION

As discussed in the Introduction of this book, population is the set that has all the individuals, objects, or elements to be studied, which have one or more characteristics in common. A census is the study of data related to all the elements of the population. According to Bruni (2011), populations can be finite or infinite. Finite populations have a limited size, allowing their elements to be counted; infinite populations, on the other hand, have an unlimited size, not allowing us to count their elements. As examples of finite populations, we can mention the number of employees in a certain company, the number of members in a club, the number of products manufactured during a certain period, etc. When the number of elements in a population, even though they can be counted, is too high, we assume that the population is infinite. Examples of populations considered infinite are the number of inhabitants in the world, the number of residences in Rio de Janeiro, the number of points on a straight line, etc. Therefore, there are situations in which a study with all the elements in a population is impossible or unwanted. Hence, the alternative is to extract a subset from the population under analysis, which is called a sample. The sample must be representative of the population being studied, therein is the importance of this chapter. From the information gathered in the sample and using suitable statistical procedures, the results obtained can be used to generalize, infer, or draw conclusions regarding the population (statistical inference). For Fa´vero et al. (2009) and Bussab and Morettin (2011), it is rarely possible to obtain the exact distribution of a variable, due to the high costs, the time needed and the difficulties in collecting the data. Hence, the alternative is to select part of the elements in the population (sample) and, after that, infer the properties for the whole (population). Essentially, there are two types of sampling: (1) probability or random sampling, and (2) nonprobability or nonrandom sampling. In random sampling, samples are obtained randomly, that is, the probability of each element of the population being a part of the sample is the same. In nonrandom sampling, on the other hand, the probability of some or all the elements of the population being in the sample is unknown. Fig. 7.1 shows the main random and nonrandom sampling techniques. Fa´vero et al. (2009) show the advantages and disadvantages of random and nonrandom techniques. Regarding random sampling techniques, the main advantages are: a) the selection criteria of the elements are rigorously defined, not allowing the researchers’ or the interviewer’s subjectivity to interfere in the selection of the elements; b) the possibility to mathematically determine the sample size based on accuracy and on the confidence level desired for the results. On the other hand, the main disadvantages are: a) difficulty in obtaining current and complete listings or regions of the population; b) geographically speaking, a random selection can generate a highly disperse sample, increasing the costs, the time needed for the study, and the difficulty in collecting the data. As regards nonrandom sampling techniques, the advantages are lower costs, less time to carry out the study, and less need of human resources. As disadvantages, we can mention: a) there are units in the population that cannot be chosen; b) a personal bias may happen; c) we do not know with what level of confidence the conclusions arrived at can be inferred Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00007-0 © 2019 Elsevier Inc. All rights reserved.

169

170

PART

IV Statistical Inference

FIG. 7.1 Main sampling techniques.

Sampling Random sampling

Simple

Systematic

Stratified

Cluster

Nonrandom sampling

Convenience

Judgmental

Quota

Snowball

for the population. These techniques do not use a random method to select the elements of the sample, so, there is no guarantee that the sample selected is a good representative of the population (Fa´vero et al., 2009). Choosing the sampling technique must consider the goals of the survey, the acceptable error in the results, accessibility to the elements of the population, the desired representativeness, the time needed, and the availability of financial and human resources.

7.2

PROBABILITY OR RANDOM SAMPLING

In this type of sampling, samples are obtained randomly, that is, the probability of each element of the population being a part of the sample is the same, and all of the samples selected are equally probable. In this section, we will study the main probability or random sampling techniques: (a) simple random sampling, (b) systematic sampling, (c) stratified sampling, and (d) cluster sampling.

7.2.1

Simple Random Sampling

According to Bolfarine and Bussab (2005), simple random sampling (SRS) is the simplest and most important method for selecting a sample. Consider a population or universe (U) with N elements: U ¼ f1, 2, …, N g According to Bolfarine and Bussab (2005), planning and selecting the sample include the following steps: (a) Using a random procedure (as, for example, through a table with random numbers or a gravity-pick machine), we must draw an element from population U with the same probability; (b) We repeat the previous process until a sample with n observations is generated (the calculation of the size of a simple random sample will be studied in Section 7.4); (c) When the value drawn is removed from U before of the next draw, we have the SRS without replacement process. In case drawing a unit more than once is allowed, we have the SRS with replacement process. According to Bolfarine and Bussab (2005), from a practical point of view, an SRS without replacement is much more interesting, because it satisfies the intuitive principle that we do not gain more information in case the same unit appears more than once in the sample. On the other hand, an SRS with replacement has mathematical and statistical advantages, such as, the independence between the units drawn. Let’s now study each of them.

7.2.1.1 Simple Random Sampling Without Replacement According to Bolfarine and Bussab (2005), an SRS without replacement works as follows: (a) All of the elements in the population are numbered from 1 to N: U ¼ f1, 2, …, N g

Sampling Chapter

7

171

(b) Using a procedure to generate random numbers, we must draw, with the same probability, one of the N observations of the population; (c) We draw the following element, with the previous value being removed from the population; (d) We repeat the procedure until n observations have been drawn (how to calculate n is explained in Section 7.4.1).   N! N possible samples of n elements that can be obtained from the ¼ In this type of sampling, there are CN, n ¼ n n!ðN  nÞ!   N population, and each sample has the same probability of being selected, 1= . n Example 7.1: Simple Random Sampling without Replacement Table 7.E.1 shows the weight (kg) of 30 parts. Draw, without any replacements, a random sample of size n ¼ 5. How many different samples of size n can be obtained from the population? What is the probability of a sample being selected?

TABLE 7.E.1 Weight (kg) of 30 parts 6.4

6.2

7.0

6.8

7.2

6.4

6.5

7.1

6.8

6.9

7.0

7.1

6.6

6.8

6.7

6.3

6.6

7.2

7.0

6.9

6.8

6.7

6.5

7.2

6.8

6.9

7.0

6.7

6.9

6.8

Solution All 30 parts were numbered from 1 to 30, as shown in Table 7.E.2.

TABLE 7.E.2 Numbers given to the parts 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

6.4

6.2

7.0

6.8

7.2

6.4

6.5

7.1

6.8

6.9

7.0

7.1

6.6

6.8

6.7

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

6.3

6.6

7.2

7.0

6.9

6.8

6.7

6.5

7.2

6.8

6.9

7.0

6.7

6.9

6.8

Through a random procedure (as, for example, the RANDBETWEEN function in Excel), the following numbers were selected: 02 03 14

24 28

The parts associated to these numbers form the random sample selected.   30  29  28  27  26 30 ¼ 142, 506 different samples. There are ¼ 5 5! The probability of a certain sample being selected is 1/142, 506.

7.2.1.2 Simple Random Sampling With Replacement According to Bolfarine and Bussab (2005), an SRS with replacement works as follows: (a) All of the elements in the population are numbered from 1 to N: U ¼ f1, 2, …, N g (b) Using a procedure to generate random numbers, we must draw, with the same probability, one of the N observations of the population; (c) We put this unit back into the population and draw the following value; (d) We repeat the procedure until n observations have been drawn (how to calculate n is explained in Section 7.4.1). In this type of sampling, there are Nn possible samples of n elements that can be obtained from the population, and each sample has the same probability 1/Nn of being selected.

172

PART

IV Statistical Inference

Example 7.2: Simple Random Sampling with Replacement Redo Example 7.1 considering a simple random sampling with replacement. Solution The 30 parts were numbered from 1 to 30. Through a random procedure (for example, we can use the RANDBETWEEN function in Excel), we drew the first part from the sample (12). This part is put back and the second element is drawn (33). The procedure is repeated until five parts have been drawn: 12 33 02 25 33 The parts associated to these numbers form the random sample selected. There are 305 ¼ 24,300,000 different samples. The probability of a certain sample being selected is 1/24,300,000.

7.2.2

Systematic Sampling

According to Costa Neto (2002), when the elements of the population are sorted and periodically removed, we have a systematic sampling. Hence, for example, in a production line, we can remove an element at every 50 items produced. As advantages of systematic sampling, in comparison to simple random sampling, we can mention that it is carried out in a much faster and cheaper way, besides being less susceptible to errors made by the interviewer during the survey. The main disadvantage is the possibility of having variation cycles, especially if these cycles coincide with the period when the elements are removed from the sample. For example, let’s suppose that at every 60 parts produced in a certain machine, one part is inspected; however, in this machine a certain flaw usually happens, so, at every 20 parts produced, one is defective. Assuming that the elements of the population are sorted from 1 to N and that we already know the sample size (n), systematic sampling works as follows: (a) We must determine the sampling interval (k), obtained by the quotient of the population size and the sample size: k¼

N n

This value must be rounded to the closest integer. (b) In this phase, we introduce an element of randomness, choosing the starting unit. The first element chosen {X1} can be an any element between 1 and k; (c) After choosing the first element, after each k element, a new element is removed from the population. The process is repeated until it reaches the sample size (n): X1 ,X1 + k, X1 + 2k, …,X1 + ðn  1Þk

Example 7.3: Systematic Sampling Imagine a population with N ¼ 500 sorted elements. We wish to remove a sample with n ¼ 20 elements from this population. Use the systematic sampling procedure. Solution (a) The sampling interval (k) is: k¼

N 500 ¼ ¼ 25 n 20

(b) The first element chosen {X} can be any element between 1 and 25; suppose that X ¼ 5; (c) Since the first element of the sample is X ¼ 5, the second element will be X ¼ 5 + 25 ¼ 30, the third element X ¼ 5 + 50 ¼ 55, and so on, and so forth, so, the last element of the sample will be X ¼ 5 + 19  25 ¼ 480: A ¼ f 5,30, 55, 80, 105,130, 155,180, 205,230,255, 280,305,330, 355,380, 405,430,455, 480 g

Sampling Chapter

7.2.3

7

173

Stratified Sampling

In this type of sampling, a heterogeneous population is stratified or divided into subpopulations or homogeneous strata, and, in each stratum, a sample is drawn. Hence, initially, we define the number of strata and, by doing that, we obtain the size of each stratum. For each stratum, we specify how many elements will be drawn from the subpopulation, and this can be a uniform or proportional allocation. According to Costa Neto (2002), uniform stratified sampling, from which we draw an equal number of elements in each stratum, is recommended when the strata are approximately the same size. In proportional stratified sampling, on the other hand, the number of elements in each stratum is proportional to the number of elements in the stratum. According to Freund (2006), if the elements selected in each stratum are simple random samples, the global process (stratification followed by random sampling) is called (simple) stratified random sampling. According to Freund (2006), stratified sampling works as follows: (a) A population of size N is divided into k strata of sizes N1, N2, …, Nk; (b) For each stratum, a random sample of size ni (i ¼ 1, 2, …, k) is selected, resulting in k subsamples of sizes n1, n2, …, nk. In uniform stratified sampling, we have: n1 ¼ n 2 ¼ … ¼ nk where the sample size obtained from each stratum is: n ni ¼ , para i ¼ 1, 2, …,k k where n ¼ n1 + n2 + … + nk In proportional stratified sampling, on the other hand, we have: n1 n2 nk ¼ ¼…¼ N1 N2 Nk

(7.1)

(7.2)

(7.3)

In proportional sampling, the sample size obtained from each stratum can be obtained according to the following expression: ni ¼

Ni  n, for i ¼ 1, 2, …,k N

(7.4)

As examples of stratified sampling, we can mention the stratification of a city into neighborhoods, of a population by gender or age group, of customers by social class or of students by school. The calculation of the size of a stratified sample will be studied in Section 7.4.3. Example 7.4: Stratified Sampling Consider a club that has N ¼ 5000 members. The population can be divided by age group, aiming at identifying the main activities practiced by each group: from 0 to 4 years of age; from 5 to 11; from 12 to 17; from 18 to 25; from 26 to 36; from 37 to 50; from 51 to 65; and over 65 years of age. We have N1 ¼ 330, N2 ¼ 350, N3 ¼ 400, N4 ¼ 520, N5 ¼ 650, N6 ¼ 1030, N7 ¼ 980, N8 ¼ 740. We would like to draw a stratified sample from the population of size n ¼ 80. What should be the size of the sample drawn from each stratum in case of uniform sampling and proportional sampling? Solution For uniform sampling, ni ¼ n/k ¼ 80/8 ¼ 10. Therefore, n1 ¼ … ¼ n8 ¼ 10. For proportional sampling, we calculate ni ¼ NNi  n, for i ¼ 1, 2,…, 8: 330 350  80 ¼ 5:3 ffi 6, n2 ¼ NN2  n ¼ 5;000  80 ¼ 5:6 ffi 6 n1 ¼ NN1  n ¼ 5;000 400 520 n3 ¼ NN3  n ¼ 5;000  80 ¼ 6:4 ffi 7, n4 ¼ NN4  n ¼ 5;000  80 ¼ 8:3 ffi 9 650  80 ¼ 10:4 ffi 11, n6 ¼ NN6  n ¼ 1:030 n5 ¼ NN5  n ¼ 5;000 5;000  80 ¼ 16:5 ffi 17 980 740  80 ¼ 15:7 ffi 16, n8 ¼ NN8  n ¼ 5;000  80 ¼ 11:8 ffi 12 n7 ¼ NN7  n ¼ 5;000

7.2.4

Cluster Sampling

In cluster sampling, the total population must be subdivided into groups of elementary units, called clusters. The sampling is done from the groups and not from the individuals in the population. Hence, we must randomly draw a sufficient number of clusters and the objects from these will form the sample. This type of sampling is called one-stage cluster sampling.

174

PART

IV Statistical Inference

According to Bolfarine and Bussab (2005), one of the inconveniences of cluster sampling is the fact that elements in the same cluster tend to have similar characteristics. The authors show that the more similar the elements in the cluster are, the less efficient the procedure is. Each cluster must be a good representative of the population, that is, it must be heterogeneous, containing all kinds of participants. It is the opposite of stratified sampling. According to Martins and Domingues (2011), cluster sampling is a simple random sampling in which the sample units are the clusters; however, it is less expensive. When we draw elements in the clusters selected, we have a two-stage cluster sampling: in the first stage, we draw the clusters and, in the second, we draw the elements. The number of elements to be drawn depends on the variability in the cluster. The higher the variability, the more elements must be drawn. On the other hand, when the units in the cluster are very similar, it is not advisable nor necessary to draw all the elements, because they will bring the same kind of information (Bolfarine and Bussab, 2005). Cluster sampling can be generalized to several stages. The main advantages that justify the wide use of cluster sampling are: a) many populations are already grouped into natural or geographic subgroups, facilitating its application; b) it allows a substantial reduction in the costs to obtain the sample, without compromising its accuracy. In short, it is fast, cheap, and efficient. The only disadvantage is that clusters are rarely the same size, making it difficult to control the range of the sample. However, to overcome this problem, we have to use certain statistical techniques. As examples of clusters, we can mention the production in a factory divided into assembly lines, company employees divided by area, students in a municipality divided by schools, or the population in a municipality divided into districts. Consider the following notation for cluster sampling: N: population size; M: number of clusters into which the population was divided; Ni: cluster size i (i ¼ 1, 2, ..., M); n: sample size; m: number of clusters drawn (m < M); ni: cluster size i of the sample (i ¼ 1, 2, ..., m), where ni ¼ Ni; bi: cluster size i of the sample (i ¼ 1, 2, ..., m), where bi < ni. In short, one-stage cluster sampling adopts the following procedure: (a) The population is divided into M clusters (C1, …, CM) with sizes that are not necessarily the same; (b) According to a sample plan, usually SRS, we draw m clusters (m <  M);  m P ni ¼ n . (c) All the elements of each cluster drawn constitute the global sample ni ¼ Ni and i¼1

The calculation of the number of clusters (m) will be studied in Section 7.4.4. On the other hand, two-stage cluster sampling works as follows: (a) The population is divided into M clusters (C1, …, CM) with sizes that are not necessarily the same; (b) We must draw m clusters in the first stage, according to some kind of sample plan, usually SRS; (c) From each cluster i drawn, of size ni, we draw bi elements in the second stage, according to the same or to another  m P sample plan bi < ni and n ¼ bi . i¼1

Example 7.5: One-Stage Cluster Sampling Consider a population with N ¼ 20 elements, U ¼ {1, 2, …, 20}. The population is divided into 7 clusters: C1 ¼ {1, 2}, C2 ¼ {3, 4, 5}, C3 ¼ {6, 7, 8}, C4 ¼ {9, 10, 11}, C5 ¼ {12, 13, 14}, C6 ¼ {15, 16}, C7 ¼ {17, 18, 19, 20}. The sample plan adopted says that we should draw three clusters (m ¼ 3) by simple random sampling without replacement. Assuming that clusters C1, C3, and C4 were drawn, determine the sample size, besides the elements that will constitute the one-stage cluster sampling. Solution In one-stage cluster sampling, all the elements of each cluster drawn constitute the sample, so, M ¼ {C1, C3, C4} ¼ {(1, 2), (6, 7, 8), 3 P (9, 10, 11)}. Therefore, n1 ¼ 2, n2 ¼ 3 and n3 ¼ 3, and n ¼ ni ¼ 8. i¼1

Sampling Chapter

7

175

Example 7.6: Two-Stage Cluster Sampling Example 7.5 will be extended to the case of two-stage cluster sampling. Thus, from the clusters drawn in the first stage, the sample   m P plan adopted tells us to draw a single element with equal probability from each cluster bi ¼ 1, i ¼ 1, 2, 3 and n ¼ bi ¼ 3 , i¼1

which results in the following: Stage 1: M ¼ {C1, C3, C4} ¼ {(1, 2), (6, 7, 8), (9, 10, 11)} Stage 2: M ¼ {1, 8, 10}

7.3

NONPROBABILITY OR NONRANDOM SAMPLING

In nonprobability sampling methods, samples are obtained in a nonrandom way, that is, the probability of some or all elements of the population belonging to the sample is unknown. Thus, it is not possible to estimate the sample error, nor to generalize the results of the sample to the population, since the former is not representative of the latter. For Costa Neto (2002), this type of sampling is used many times due to its simplicity or impossibility to obtain probability samples, as would be the most desirable. Therefore, we must be careful when deciding to use this type of sampling, since it is subjective, based on the researcher’s criteria and judgment, and sample variability cannot be established with accuracy. In this section, we will study the main nonprobability or nonrandom sampling techniques: (a) convenience sampling, (b) judgmental or purposive sampling, (c) quota sampling, (d) geometric propagation or snowball sampling.

7.3.1

Convenience Sampling

Convenience sampling is used when participation is voluntary or the sample elements are chosen due to convenience or simplicity, such as, friends, neighbors, or students. The advantage this method offers is that it allows researcher to obtain information in a quick and cheap way. However, the sample process does not guarantee that the sample is representative of the population. It should only be employed in extreme situations and in special cases that justify its use. Example 7.7: Convenience Sampling A researcher wishes to study customer behavior in relation to a certain brand and, in order to do that, he develops a sampling plan. The collection of data is done through interviews with friends, neighbors, and workmates. This represents convenience sampling, since this sample is not representative of the population. It is important to highlight that, if the population is very heterogeneous, the results of the sample cannot be generalized to the population.

7.3.2

Judgmental or Purposive Sampling

In judgmental or purposive sampling, the sample is chosen according to an expert’s opinion or previous judgment. It is a risky method due to possible mistakes made by the researcher in his prejudgment. Using this type of sampling requires knowledge of the population and of the elements selected. Example 7.8: Judgmental or Purposive Sampling A survey is trying to identify the reasons why a group of employees of a certain company went on strike. In order to do that, the researcher interviews the main leaders of the trade union and of political movements, as well as the employees that are not involved in such movements. Since the sample size is small, it is not possible to generalize the results to the population, since the sample is not representative of this population.

176

PART

7.3.3

IV Statistical Inference

Quota Sampling

Quota sampling presents greater rigor when compared to other nonrandom samplings. For Martins and Domingues (2011), it is one of the most used sampling methods in market surveys and election polls. Quota sampling is a variation of judgmental sampling. Initially, we set the quotas based on a certain criterion. Within the quotas, the selection of the sample items depends on the interviewer’s judgment. Quota sampling can also be considered a nonprobability version of stratified sampling. Quota sampling consists of three steps: (a) We select the control variables or the population’s characteristics considered relevant for the study in question; (b) We determine the percentage of the population (%) for each one of the relevant variable categories; (c) We establish the size of the quotas (number of people to be interviewed that have the characteristics needed) for each interviewer, so that the sample can have the same proportions as the population. The main advantages of quota sampling are the low costs, speed, and convenience or ease in which the interviewer can select elements. However, since the selection of elements is not random, there are no guarantees that the sample will be representative of the population. Hence, it is not possible to generalize the results of the survey to the population. Example 7.9: Quota Sampling We would like to carry out municipal election polls regarding a certain municipality with 14,253 voters. The survey has as its main objective to identify how people intend to vote based on their gender and age group. Table 7.E.3 shows the absolute frequencies for each pair of variable category analyzed. Apply quota sampling, considering that the sample size is 200 voters and that there are two interviewers.

TABLE 7.E.3 Absolute Frequencies for Each Pair of Categories Age Group

Male

Female

Total

16 and 17

50

48

98

from 18 to 24

1097

1063

2160

from 25 to 44

3409

3411

6820

from 45 to 69

2269

2207

4476

> 69

359

331

690

Total

7184

7060

14,244

Solution (a) The variables that are relevant for the study are gender and age; (b) The percentage of the population (%) for each pair of categories of analyzed variables is shown in Table 7.E.4.

TABLE 7.E.4 Percentage of the Population for Each Pair of Categories Age Group

Male

Female

Total

16 and 17

0.35%

0.34%

0.69%

from 18 to 24

7.70%

7.46%

15.16%

from 25 to 44

23.93%

23.95%

47.88%

from 45 to 69

15.93%

15.49%

31.42%

>69

2.52%

2.32%

4.84%

% of the Total

50.44%

49.56%

100.00%

(c) If we multiply each cell in Table 7.E.4 by the sample size (200), we get the dimensions of the quotas that compose the global sample, as shown in Table 7.E.5.

Sampling Chapter

7

177

TABLE 7.E.5 Dimensions of the Quotas Age Group

Male

Female

Total

16 and 17

1

1

2

from 18 to 24

16

15

31

from 25 to 44

48

48

96

from 45 to 69

32

31

63

>69

5

5

10

Total

102

100

202

Considering that there are two interviewers, the quota for each one will be:

TABLE 7.E.6 Dimensions of the Quotas per Interviewer Age Group

Male

Female

Total

16 and 17

1

1

2

from 18 to 24

8

8

16

from 25 to 44

24

24

48

from 45 to 69

16

16

32

>69

3

3

6

Total

52

52

104

Note: The data in Tables 7.E.5 and 7.E.6 were rounded up, resulting in a total number of 202 voters in Table 7.E.5 and 104 voters in Table 7.E.6.

7.3.4

Geometric Propagation or Snowball Sampling

Geometric propagation or snowball sampling is widely used when the elements of the population are rare, difficult to access, or unknown. In this method, we must identify one or more individuals from the target population, and these will identify the other individuals that are in the same population. The process is repeated until the objective proposed is achieved, that is, the point of saturation. The point of saturation is reached when the last respondents do not add new relevant information to the research, thus, repeating the content of previous interviews. As advantages, we can mention: a) it allows the researcher to find the desired characteristic in the population; b) it is easy to apply, because the recruiting is done through referrals from other people who are in the population; c) low cost, because we need less planning and people; and d) it is efficient to enter populations that are difficult to access. Example 7.10: Snowball Sampling A company is recruiting professionals with a specific profile. The group hired initially recommends other professionals with the same profile. The process is repeated until the number of employees needed is hired. Therefore, we have an example of snowball sampling.

7.4

SAMPLE SIZE

According to Cabral (2006), there are six decisive factors when calculating the sample size: 1) Characteristics of the population, such as, variance (s2) and dimension (N); 2) Sample distribution of the estimator used;

178

PART

IV Statistical Inference

3) The accuracy and reliability required in the results, being necessary to specify the estimation error (B), which is the maximum difference that the researcher accepts between the population parameter and the estimate obtained from the sample; 4) The costs: the larger the sample size, the higher the costs; 5) Costs vs. sample error: must we select a larger sample to reduce the sample error or must we reduce the sample size in order to minimize the resources and efforts necessary, thus ensuring better control for the interviewers, a higher response rate, and a precise and better processing of the information? 6) The statistical techniques that will be used: some statistical techniques demand larger samples than others. The sample selected must be representative of the population. Based on Ferra˜o et al. (2001), Bolfarine and Bussab (2005), and Martins and Domingues (2011), this section discusses how to calculate the sample size for the mean (a quantitative variable) and the proportion (a binary variable) of finite and infinite populations, with a maximum estimation error B, and for each type of random sampling (simple, systematic, stratified, and cluster). In the case of nonrandom samples, either we set the sample size based on a possible budget or we adopt a certain dimension that has already been used successfully in previous studies with the same characteristics. A third alternative would be to calculate the size of a random sample and use that dimension as a reference.

7.4.1

Size of a Simple Random Sample

This section discusses how to calculate the size of a simple random sample to estimate the mean (a quantitative variable) and the proportion (a binary variable) of finite and infinite populations, with a maximum estimation error B. The estimation error (B) for the mean is the maximum difference that the researcher accepts between m (population mean) and X (sample mean), that is, B  m  X. On the other hand, the estimation error (B) for the proportion is the maximum difference that the researcher accepts between p (proportion of the population) and p^ (proportion of the sample), that is, B  jp  p^j.

7.4.1.1 Sample Size to Estimate the Mean of an Infinite Population Ifthe variable   chosen is quantitative and the population infinite, the size of a simple random sample, where P X  m  B ¼ 1  a, can be calculated as follows: n¼ where:

s2 B2 =z2a

(7.5)

s2: population variance; B: maximum estimation error; za: abscissa (coordinate) of the standard normal distribution, at the significance level a. According to Bolfarine and Bussab (2005), to determine the sample size it is necessary to set the maximum estimation error (B), the significance level a (translated by the value of za), and to have some previous knowledge of the population variance (s2). The first two are set by the researcher, while the third demands more work. When we do not know s2, its value must be substituted for a reasonable initial estimator. In many cases, a pilot sample can provide sufficient information about the population. In other cases, sample surveys done previously about the population can also provide satisfactory initial estimates for s2. Finally, some authors suggest the use of an approximate value for the standard deviation, given by s ffi range/4.

7.4.1.2 Sample Size to Estimate the Mean of a Finite Population Ifthe variable   chosen is quantitative and the population finite, the size of a simple random sample, where P X  m  B ¼ 1  a, can be calculated as follows: n¼ where:

N:s2 B2 ðN  1Þ: 2 + s2 za

N: size of the population; s2: population variance; B: maximum estimation error; za: coordinate of the standard normal distribution, at the significance level a.

(7.6)

Sampling Chapter

7

179

7.4.1.3 Sample Size to Estimate the Proportion of an Infinite Population If the variable chosen is binary and the population infinite, the size of a simple random sample, where Pðjp^  pj  BÞ ¼ 1  a, can be calculated as follows: n¼

p:q B2 =z2a

(7.7)

where: p: proportion of the population that contains the characteristic desired; q ¼ 1  p; B: maximum estimation error; za: coordinate of the standard normal distribution, at the significance level a. In practice, we do not know the value of p and we must, therefore, find its estimate (^ p). If, however, this value is also unknown, we must admit that p^ ¼ 0:50, hence obtaining a conservative size, that is, larger than what is necessary to ensure the accuracy required.

7.4.1.4 Sample Size to Estimate the Proportion of a Finite Population If the variable chosen is binary and the population finite, the size of a simple random sample, where Pðjp^  pj  BÞ ¼ 1  a, can be calculated as follows: n¼

N:p:q B2 ðN  1Þ: 2 + p:q za

(7.8)

where: N: size of the population; p: proportion of the population that contains the characteristic desired; q ¼ 1  p; B: maximum estimation error; za: coordinate of the standard normal distribution, at the significance level a.

Example 7.11: Calculating the Size of a Simple Random Sample Consider the population of residents in a condominium (N ¼ 540). We would like to estimate the average age of these residents. Based on previous surveys, we can obtain an estimate for s2 of 463.32. Assume that a simple random sample will be drawn from the population. Assuming that the difference between the sample mean and the real population mean is 4 years, at the most, with a confidence level of 95%, determine the sample size to be collected. Solution The value of za for a ¼ 5% (a bilateral test) is 1.96. From expression (7.6), the sample size is: n¼

N:s2 540  463:32 ¼ 92:38 ffi 93 ¼ B2 42 ðN  1Þ: 2 + s2 539  + 463:32 za 1:962

Therefore, if we collect a simple random sample of at least 93 residents from the population, we can infer, with a confidence level of 95%, that the sample mean (X) will differ 4 years, at the most, from the real population mean (m).

180

PART

IV Statistical Inference

Example 7.12: Calculating the Size of a Simple Random Sample We would like to estimate the proportion of voters who are dissatisfied with a certain politician’s administration. We admit that the real proportion is unknown, as well as its estimate. Assuming that a simple random sample will be drawn from an infinite population and admitting a sample error of 2%, and a significance level of 5%, determine the sample size. Solution Since we do not know the real value of p nor its estimate, let’s assume that p^ ¼ 0:50. Applying Expression (7.7) to estimate the proportion of an infinite population, we have: n¼

p:q 0:5  0:5 ¼ ¼ 2,401 B2 =za2 0:022 =1:962

Therefore, by randomly interviewing 2401 voters, we can infer the real proportion of voters who are dissatisfied, with a maximum estimation error of 2%, and a confidence level of 95%.

7.4.2

Size of the Systematic Sample

In systematic sampling, we use the same expressions as in simple random sampling (as studied in Section 7.4.1), according to the type of variable (quantitative or qualitative) and population (infinite or finite).

7.4.3

Size of the Stratified Sample

This section discusses how to calculate the size of a stratified sample to estimate the mean (a quantitative variable) and the proportion (a binary variable) of finite and infinite populations, with a maximum estimation error B. The estimation error (B) for the mean is the maximum difference that the researcher accepts between m (population mean) and X (sample mean), that is, B  m  X. On the other hand, the estimation error (B) for the proportion is the maximum difference that the researcher accepts between p (proportion of the population) and p^ (proportion of the sample), that is, B  jp  p^j. Let’s use the following notation to calculate the size of the stratified sample, as follows: k: number of strata; Ni: size of stratum i, i ¼ 1, 2,..., k; N ¼ N1 + N2 + … + Nk (population size); k P Wi ¼ Ni/N (weight or proportion of stratum i, with Wi ¼ 1); i¼1 mi: population mean of stratum i; s2i : population variance of stratum i; ni: number of elements randomly selected from stratum i; n ¼ n1 + n2 + … + nk (sample size); Xi : sample mean of stratum i; S2i : sample variance of stratum i; pi: proportion of elements that have the characteristic desired in stratum i; q i ¼ 1  pi :

7.4.3.1 Sample Size to Estimate the Mean of an Infinite Population Ifthe variable   chosen is quantitative and the population infinite, the size of the stratified sample, where P X  m  B ¼ 1  a, can be calculated as: k X

n¼ where: Wi ¼ Ni/N (weight or proportion of stratum i, where

Wi :s2i

i¼1

k P i¼1

B2 =z2a Wi ¼ 1);

(7.9)

Sampling Chapter

7

181

s2i : population variance of stratum i; B: maximum estimation error; za: coordinate of the standard normal distribution, at the significance level a.

7.4.3.2 Sample Size to Estimate the Mean of a Finite Population If the variable   chosen is quantitative and the population finite, the size of the stratified sample, where P X  m  B ¼ 1  a, can be calculated as: k X



Ni2 :s2i =Wi

i¼1 k B2 X N2: 2 + Ni :s2i za i¼1

(7.10)

where: Ni: size of stratum i, i ¼ 1, 2,..., k; s2i : population variance of stratum i; k P Wi ¼ Ni/N (weight or proportion of stratum i, where Wi ¼ 1); i¼1 N: size of the population; B: maximum estimation error; za: coordinate of the standard normal distribution, at the significance level a.

7.4.3.3 Sample Size to Estimate the Proportion of an Infinite Population If the variable chosen is binary and the population infinite, the size of the stratified sample, where Pðjp^  pj  BÞ ¼ 1  a, can be calculated as: k X Wi :pi :qi n¼

i¼1

B2 =z2a

(7.11)

where: Wi ¼ Ni/N (weight or proportion of stratum i, where

k P

Wi ¼ 1);

i¼1

pi: proportion of elements that have the characteristic desired in stratum i; qi ¼ 1  pi ; B: maximum estimation error; za: coordinate of the standard normal distribution, at the significance level a.

7.4.3.4 Sample Size to Estimate the Proportion of a Finite Population If the variable chosen is binary and the population finite, the size of the stratified sample, where Pðjp^  pj  BÞ ¼ 1  a, can be calculated as: k X Ni2 :pi :qi =Wi n¼ where:

i¼1 k B2 X N2: 2 + Ni :pi :qi za i¼1

Ni: size of stratum i, i ¼ 1, 2,..., k; pi: proportion of elements that have the characteristic desired in stratum i; qi ¼ 1  pi ;

(7.12)

182

PART

IV Statistical Inference

k P Wi ¼ Ni/N (weight or proportion of stratum i, where Wi ¼ 1); i¼1 N: size of the population; B: maximum estimation error; za: coordinate of the standard normal distribution, at the significance level a.

Example 7.13 Calculating the Size of a Stratified Sample A university has 11,886 students enrolled in 14 undergraduate courses, divided into three major areas: Exact Sciences, Human Sciences, and Biological Sciences. Table 7.E.7 shows the number of students enrolled per area. A survey will be carried out in order to estimate the average time students spend studying per week (in hours). Based on pilot samples, we obtain the following estimates for the variances in the areas of Exact, Human, and Biological Sciences: 124.36, 153.22, and 99.87, respectively. The samples selected must be proportional to the number of students per area. Determine the sample size, considering an estimation error of 0.8, and a confidence level of 95%.

TABLE 7.E.7 Number of students enrolled per area Area

Number of students enrolled

Exact Sciences

5285

Human Sciences

3877

Biological Sciences

2724

Total

11,886

Solution From the data, we have: k ¼ 3, N1 ¼ 5,285, N2 ¼ 3,877, N3 ¼ 2,724, N ¼ 11, 886, B ¼ 0:8 W1 ¼

5, 285 3, 877 2, 724 ¼ 0:44, W2 ¼ ¼ 0:33, W3 ¼ ¼ 0:23 11, 886 11, 886 11, 886

For a ¼ 5%, we have za ¼ 1.96. Based on the pilot sample, we must use the estimates for s21, s22, and s23. The sample size is calculated from Expression (7.10): k X



Ni2 s2i =Wi

i¼1

N2

k B2 X + Ni s2i 2 za i¼1

  5,2852  124:36 3,8772  153:22 2, 7242  99:87 + + 0:44 0:33 0:23 ¼ 722:52 ffi 723 n¼ 2 0:8 11, 8862  + ð5, 285  124:36 + 3, 877  153:22 + 2, 724  99:87Þ 2 1:96 Since the sampling is proportional, we can obtain the size of each stratum by using the expression ni ¼ Wi  n (i ¼ 1, 2, 3): n1 ¼ W1  n ¼ 0:44  723 ¼ 321:48 ffi 322 n2 ¼ W2  n ¼ 0:33  723 ¼ 235:83 ffi 236 n3 ¼ W3  n ¼ 0:23  723 ¼ 165:70 ffi 166 Thus, to carry out the survey, we must select 322 students from the area of Exact Sciences, 236 from the area of Human Sciences, and 166 from Biological Sciences. From the sample selected, we can infer, with a 95% confidence level, that the difference between the sample mean and the real population mean will be a maximum of 0.8 hours.

Sampling Chapter

7

183

Example 7.14 Calculating the Size of a Stratified Sample Consider the same population from the previous example; however, the objective now is to estimate the proportion of students who work, for each area. Based on a pilot sample, we have the following estimates per area: p^1 ¼ 0:3 (Exact Sciences), p^2 ¼ 0:6 (Human Sciences), and p^3 ¼ 0:4 (Biological Sciences). The type of sampling used in this case is uniform. Determine the sample size, considering an estimation error of 3%, and a 90% confidence level. Solution Since we do not know the real value of p for each area, we can use its estimate. For a 90% confidence level, we have za ¼ 1.645. Applying Expression (7.12) from the stratified sampling to estimate the proportion of a finite population, we have: k X



N2 : n¼

Ni2 :pi :qi =Wi

i¼1 k B2 X + Ni :pi :qi za2 i¼1

5,2852  0:3  0:7=0:44 + 3,8772  0:6  0:4=0:33 + 2, 7242  0:4  0:6=0:23 0:032 + 5,285  0:3  0:7 + 3,877  0:6  0:4 + 2, 724  0:4  0:6 11, 8862  1:6452 n ¼ 644:54 ffi 645

Since the sampling is uniform, we have n1 ¼ n2 ¼ n3 ¼ 215. Therefore, to carry out the survey, we must randomly select 215 students from each area. From the sample selected, we can infer, with a 90% confidence level, that the difference between the sample proportion and the real population proportion will be a maximum of 3%.

7.4.4

Size of a Cluster Sample

This section discusses how to calculate the size of a one-stage and a two-stage cluster sample. Let’s consider the following notation to calculate the size of a cluster sample: N: population size; M: number of clusters into which the population was divided; Ni: size of cluster i (i ¼ 1, 2, ..., M); n: sample size; m: number of clusters drawn (m < M); ni: size of cluster i from the sample drawn in the first stage (i ¼ 1, 2, ..., m), where ni ¼ Ni; bi: size of cluster i from the sample drawn in the second stage (i ¼ 1, 2, ..., m), where bi < ni; N ¼ N=M (average size of the population clusters); n ¼ n=m (average size of the sample clusters); Xij: j-th observation in cluster i; s2dc: population variance in the clusters; s2ec: population variance between clusters; s2i : population variance in cluster i; mi: population mean in cluster i; s2c ¼ s2dc + s2ec (total population variance). According to Bolfarine and Bussab (2005), the calculation of s2dc and s2ec is given by: Ni M X X 

s2dc ¼

Xij  mi

i¼1 j¼1

N

2 ¼

M 1 X Ni 2  s M i¼1 N i

M M 1 X 1 X Ni Ni :ðmi  mÞ2 ¼ : s2ec ¼ : :ðm  mÞ2 N i¼1 M i¼1 N i

(7.13)

(7.14)

Assuming that all the clusters are the same size, the previous expressions can be summarized as follows: s2dc ¼

M 1 X : s2 M i¼1 i

(7.15)

184

PART

IV Statistical Inference

s2ec ¼

M 1 X : ðm  mÞ2 M i¼1 i

(7.16)

7.4.4.1 Size of a One-Stage Cluster Sample This section discusses how to calculate the size of a one-stage cluster sample to estimate the mean (a quantitative variable) of a finite and infinite population, with a maximum estimation error B. The estimation error (B) for the mean is the maximum difference that the researcher accepts between m (population mean) and X (sample mean), that is, B  m  X. 7.4.4.1.1

Sample Size to Estimate the Mean of an Infinite Population

If the variable chosen    is quantitative and the population infinite, the number of the clusters drawn in the first stage (m), where P X  m  B ¼ 1  a, can be calculated as follows: m¼

s2c B2 =z2a

(7.17)

where: s2c ¼ s2dc + s2ec, according to Expressions (7.13)–(7.16); B: maximum estimation error; za: coordinate of the standard normal distribution, at the significance level a. If the clusters are the same size, Bolfarine and Bussab (2005) demonstrate that: m¼

s2e 2 B =z2a

(7.18)

According to the authors, generally, s2c is unknown and has to be estimated from pilot samples or obtained from previous sample surveys. 7.4.4.1.2

Sample Size to Estimate the Mean of a Finite Population

Ifthe  variable  chosen is quantitative and the population finite, the number of clusters drawn in the first stage (m), where P X  m  B ¼ 1  a, can be calculated as follows: m¼

M:s2c B2 :N 2 M: 2 + s2c za

(7.19)

where: M: number of clusters into which the population was divided; s2c ¼ s2dc + s2ec, according to Expressions (7.13)–(7.16); B: maximum estimation error; N ¼ N=M (average size of the population clusters); za: coordinate of the standard normal distribution, at the significance level a. 7.4.4.1.3

Sample Size to Estimate the Proportion of an Infinite Population

If the variable chosen is binary and the population infinite, the number of clusters drawn in the first stage (m), where Pðjp^  pj  BÞ ¼ 1  a, can be calculated as follows:

Sampling Chapter

1=M: m¼

M X Ni i¼1 N B2 =z2a

7

185

:pi :qi (7.20)

where: M: number of clusters into which the population was divided; Ni: size of cluster i (i ¼ 1, 2, ..., M); N ¼ N=M (average size of the population clusters); pi: proportion of elements that have the characteristic desired in cluster i; qi ¼ 1  pi ; B: maximum estimation error; za: coordinate of the standard normal distribution, at the significance level a. 7.4.4.1.4

Sample Size to Estimate the Proportion of a Finite Population

If the variable chosen is binary and the population finite, the number of clusters drawn in the first stage (m), where Pðjp^  pj  BÞ ¼ 1  a, can be calculated as follows: M X Ni



i¼1

N

:pi :qi

M X B2 :N 2 Ni M: 2 + 1=M: :pi :qi za i¼1 N

(7.21)

where: M: number of clusters into which the population was divided; Ni: size of cluster i (i ¼ 1, 2, ..., M); N ¼ N=M (average size of the population clusters); pi: proportion of elements that have the characteristic desired in cluster i; qi ¼ 1  pi ; B: maximum estimation error; za: coordinate of the standard normal distribution, at the significance level a.

7.4.4.2 Size of a Two-Stage Cluster Sample In this case, we assume that all the clusters are the same size. Based on Bolfarine and Bussab (2005), let’s consider the following linear cost function: C ¼ c1 :n + c2 :b

(7.22)

where: c1: observation cost of one unit from the first stage; c2: observation cost of one unit from the second stage; n: sample size in the first stage; b: sample size in the second stage. The optimal size for b that minimizes the linear cost function is given by: rffiffiffiffiffi sdc c1 : b∗ ¼ sec c2

(7.23)

186

PART

IV Statistical Inference

Example 7.15: Calculating the Size of a Cluster Sample Consider the members of a certain club in Sao Paulo (N ¼ 4,500). We would like to estimate the average evaluation score (0 to 10) given by these members regarding the main features of the club. The population is divided into 10 groups of 450 elements each, based on their membership number. The estimate of the mean and of the population variance per group, based on previous surveys, can be seen in Table 7.E.8. Assuming that the cluster sampling is based on a single stage, determine the number of clusters that must be drawn, considering B ¼ 2% and a ¼ 1%.

TABLE 7.E.8 Mean and population variance per group i

1

2

3

4

5

6

7

8

9

10

mi

7.4

6.6

8.1

7.0

6.7

7.3

8.1

7.5

6.2

6.9

s2i

22.5

36.7

29.6

33.1

40.8

51.7

39.7

30.6

40.5

42.7

Solution From the data given to us, we have: N ¼ 4, 500,M ¼ 10, N ¼ 4, 500=10 ¼ 450,B ¼ 0:02, and za ¼ 2:575: Since all the clusters are the same size, the calculation of s2dc and s2ec is given by: s2dc ¼ s2ec ¼

M 1 X 22:5 + 36:7 + … + 42:7 s2 ¼ : ¼ 36:79 M i¼1 i 10

M 1 X ð7:4  7:18Þ2 + … + ð6:9  7:18Þ2 ¼ 0:35 ðmi  mÞ2 ¼ : 10 M i¼1

Therefore, s2c ¼ s2dc + s2ec ¼ 36.79 + 0.35 ¼ 37.14 The number of clusters to be drawn in one stage, for a finite population, is given by Expression (7.19): m¼

M:s2c 10  37:14 ¼ 2:33 ffi 3 ¼ 0:022  4502 B2 :N 2 2 + 37:14 M: 2 + sc 10  za 2:5752

Therefore, the population of N ¼ 4, 500 members is divided into M ¼ 10 clusters with the same size (Ni ¼ 450, i ¼ 1, ...10). From the total number of clusters, we must randomly draw m ¼ 3 clusters. In one-stage cluster sampling, all the elements of each cluster drawn constitute the global sample (n ¼ 450  3 ¼ 1, 350).

From the sample selected, we can infer, with a 99% confidence level, that the difference between the sample mean and the real population mean will be 2%, at the most. Table 7.1 shows a summary of the expressions used to calculate the sample size for the mean (a quantitative variable) and the proportion (a binary variable) of finite and infinite populations, with a maximum estimation error B, and for each type of random sampling (simple, systematic, stratified, and cluster).

7.5

FINAL REMARKS

It is rarely possible to obtain the exact distribution of a variable when we select all the elements of the population, due to the high costs, the time needed, and the difficulties in collecting the data. Therefore, the alternative is to select part of the elements of the population (sample) and, after that, infer the properties for the whole (population). Since the sample must be a good representative of the population, choosing the sampling technique is essential in this process. Sampling techniques can be classified in two major groups: probability or random sampling and nonprobability or nonrandom sampling. Among the main random sampling techniques, we can highlight simple random sampling (with and without replacement), systematic, stratified, and cluster. The main nonrandom sampling techniques are convenience, judgmental or purposive, quota, and snowball sampling. Each one of these techniques has advantages and disadvantages, and choosing the best technique must take the characteristics of each study into consideration. This chapter also discussed how to calculate the sample size for the mean and the proportion of finite and infinite populations, for each type of random sampling. In the case of nonrandom samples, the researcher must either establish the sample size based on a possible budget or adopt a certain dimension that has already been used successfully in previous studies with similar characteristics. Another alternative would be to calculate the size of a random sample and use it as a reference.

Sampling Chapter

7

187

TABLE 7.1 Expressions to Calculate the Size of Random Samples Type of Random Sample Simple

Systematic

Estimating the Mean (Infinite Population) n¼





7.6 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11)



B2 =za2 s2 B2 =za2 k P

Stratified

One-stage Cluster

s2

i¼1



Estimating the Mean (Finite Population)



Wi :s2i

B 2 =za2

s2c 2 B =za2

Estimating the Proportion (Infinite Population)

Estimating the Proportion (Finite Population)

N:s2 B2 ðN  1Þ: 2 + s2 za



p:q B 2 =za2



N:p:q B2 ðN1Þ: + p:q za2

N:s2 B2 ðN  1Þ: 2 + s2 za



p:q B 2 =za2



N:p:q B2 ðN1Þ: + p:q za2

k P i¼1



N2 : m¼

k P

Ni2 :s2i =Wi

k B2 X + Ni :s2i za2 i¼1

M:s2c B 2 :N 2 M: 2 + s2c za



i¼1

Wi :pi :qi

B 2 =za2

1=M:



M N P i :pi :qi i¼1 N 2 2 B =za

k P



i¼1

N2 :

Ni2 :pi :qi =Wi

k B2 X + Ni :pi :qi 2 za i¼1

M N P i :pi :qi i¼1 N m¼ M 2 2 X B :N Ni M: + 1=M: :pi :qi 2 za i¼1 N

EXERCISES Why is sampling important? What are the differences between random and nonrandom sampling techniques? In what cases must they be used? What is the difference between stratified and cluster sampling? What are the advantages and limitations of each sampling technique? What type of sampling is used in the EuroMillions Lottery? To verify if a part meets certain quality specification demands, from every batch with 150 parts produced, we randomly pick a unit and inspect all the quality characteristics. What type of sampling should be used in this case? Assume that the population of the city of Youngstown (OH) is divided by educational level. Thus, for each level, a percentage of the population will be interviewed. What type of sampling should be used in this case? In a production line, one batch with 1500 parts is produced every hour. From each batch, we randomly pick a sample with 125 units. In each sample unit, we inspect all the quality characteristics to check whether the part is defective or not. What type of sampling should be used in this case? The population of the city of Sao Paulo is divided into 96 districts. From this total, 24 districts will be randomly drawn and, for each one of them, a small sample of the population will be interviewed in a public opinion survey. What type of sampling should be used in this case? We would like to estimate the illiteracy rate in a municipality with 4000 inhabitants who are 15 or over 15 years of age. Based on previous surveys, we can estimate that p^ ¼ 0:24. A random sample will be drawn from the population. Assuming a maximum estimation error of 5%, and a 95% confidence level, what should the sample size be? The population of a certain municipality with 120,000 inhabitants is divided into five regions (North, South, Center, East, and West). The table shows the number of inhabitants per region. A random sample will be collected in each Region

Inhabitants

North

14,060

South

19,477

Center

36,564

East

26,424

West

23,475

188

PART

IV Statistical Inference

region in order to estimate the average age of its inhabitants. The samples selected must be proportional to the number of inhabitants per region. Based on pilot samples, we obtain the following estimates for the variances in the five regions: 44.5 (North), 59.3 (South), 82.4 (Center), 66.2 (East), and 69.5 (West). Determine the sample size, considering an estimation error of 0.6 and a 99% confidence level. 12) Consider a municipality with 120,000 inhabitants. We would like to estimate the percentage of the population that lives in urban and rural areas. The sampling plan used divides the municipality into 85 districts of different sizes. From all the districts, we would like to select some and, for each district chosen, all the inhabitants will be selected. The file Districts.xls shows the size of each district, as well as the estimated percentage of the urban and rural population. Determine the total number of districts to be drawn assuming a maximum estimation error of 10% and a 90% confidence level.

Chapter 8

Estimation A comprehensive study of nature is the most fruitful source of mathematical discoveries. Joseph Fourier

8.1

INTRODUCTION

As previously described, statistical inference has as its main objective to draw conclusions in relation to the population based on data obtained from the sample. The sample must be representative of the population. One of the most important goals of statistical inference is the estimation of population parameters, which is the main goal of this chapter. For Bussab and Morettin (2011), a parameter can be defined as a function of a set of population values; a statistic as a function of a set of sample values; and an estimate as the value assumed by the parameter in a certain sample. Parameters can be estimated using points, through a single point (point estimation), or through an interval of values (interval estimation). The main point estimation methods are estimator of moments, ordinary least squares, and maximum likelihood estimation. Conversely, the main interval estimation methods or confidence intervals (CI) are CI for the population mean when the variance is known, CI for the population mean when the variance is unknown, CI for the population variance, and CI for the proportion.

8.2

POINT AND INTERVAL ESTIMATION

Population parameters can therefore be estimated through a single point or through an interval of values. As examples of population parameter estimators (point and interval), we can mention the mean, the variance, and the proportion.

8.2.1

Point Estimation

Point estimation is used when we want to estimate a single value of the population parameter we are interested in. The population parameter estimate is calculated from a sample. Hence, the sample mean (x) is a point estimate of the real population mean (m). Analogously, the sample variance (S2) p) is a point estimate of the population is a point estimate of the population parameter (s2), as the sample proportion (^ proportion (p). Example 8.1: Point Estimation Consider a luxury condominium with 702 lots. We would like to estimate the average size of the lots, their variance, as well as the proportion of lots for sale. In order to do that, a random sample with 60 lots is collected, revealing an average size of 1750 m2 per lot, a variance of 420 m2, and a proportion of 8% of the lots for sale. Thus: (a) x ¼ 1750 is a point estimate of the real population mean (m); (b) S2 ¼ 420 is a point estimate of the real population variance (s2); and (c) p^ ¼ 0:08 is a point estimate of the real population proportion (p).

Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00008-2 © 2019 Elsevier Inc. All rights reserved.

189

190

PART

8.2.2

IV Statistical Inference

Interval Estimation

Interval estimation is used when we are interested in finding an interval of possible values in which the estimated parameter is located, with a certain confidence level (1  a), being a the significance level. Example 8.2: Interval Estimation Consider the information in Example 8.1. However, instead of using a point estimate of the population parameter, let’s use an interval estimate: (a) The [1700–1800] interval contains the average size of the 702 condominium lots, with a 99% confidence interval; (b) With a 95% confidence interval, the [400–440] interval contains the population variance of the size of the lots; (c) The [6%–10%] interval contains the proportion of lots for sale in the condominium, with 90% confidence.

8.3

POINT ESTIMATION METHODS

The main point estimation methods are the method of moments, ordinary least squares, and maximum likelihood estimation.

8.3.1

Method of Moments

In the method of moments, the population parameters are estimated from the sample estimators as, for example, the mean and the sample variance. Consider a random variable X with the probability density function (p.d.f.) f(x). Assume that X1, X2, ..., Xn is a random sample of size n drawn from population X. For k ¼ 1, 2, ..., the k-th population moment of distribution f(x) is:   E Xk (8.1) Consider the random variable X with a p.d.f. f(x). Assume that X1, X2, ..., Xn is a random sample of size n drawn from population X. For k ¼ 1, 2, ..., the k-th sample moment of distribution f(x) is: n X

Mk ¼

Xik

i¼1

(8.2)

n

The estimation procedure of the method of moments is described. Assume that X is a random variable with a p.d.f. f(x, y1, ..., ym), in which y1, ..., ym are population parameters whose values are unknown. A random sample X1, X2, ..., Xn is drawn from population X. ym are obtained by matching the m first sample moments to the corresponding m The estimators of moments ^ y1 , …, ^ population moments and by solving the resulting equations for y1, ..., ym. Thus, the first population moment is: Eð X Þ ¼ m

(8.3)

And the first sample moment is: n X

M1 ¼ X ¼

i¼1

n

Xi (8.4)

By matching the population and the sample moments, we have: ^¼X m Therefore, the sample mean is the moment estimator of the population mean. Table 8.1 shows how to calculate E(X) and Var(X) for different probability distributions, as also studied in Chapter 6.

Estimation Chapter

8

191

TABLE 8.1 Calculating E(X) and Var(X) for Different Probability Distributions E(X)

Var(X)

M

Normal [X  N(m,s )]

m

s

2

Binomial [X  b(n,p)]

np

np(1  p)

1

Poisson [X  Poisson(l)]

l

l

1

(a + b)/2

(b  a) /12

Distribution 2

Uniform [X  U(a,b)]

2

2

2

Exponential [X  exp(l)]

1/l

1/l

1

Gamma [X  Gamma(a,l)]

al

al

2

2

2

Example 8.3: Method of Moments Assume that a certain random variable X follows an exponential distribution with parameter l. A random sample of 10 units is drawn from the population whose data can be seen in Table 8.E.1. Calculate the estimation of the l moment.

TABLE 8.E.1 Data Obtained From the Sample 5.4

9.8

6.3

7.9

9.2

10.7

12.5

15.0

13.9

17.2

Solution We have E ðX Þ ¼ X. For an exponential distribution, since E ðX Þ ¼ 1l, we have 1l ¼ X. Therefore, the moments estimator of l is given by ^ l ¼ X1 . For the data in Example 8.3, since X ¼ 10:79, the estimation of the l moment is: 1 1 ^ ¼ 0:093 l¼ ¼ X 10:79

8.3.2

Ordinary Least Squares

A model of a simple linear regression is given by the following expression: Yi ¼ a + b:Xi + mi , i ¼ 1,2, …, n

(8.5)

where: Yi is the i-th observed value of the dependent variable; a is the linear coefficient of the straight line or constant; b is the angular coefficient of the straight line (slope); Xi is the i-th observed value of the explanatory variable; mi is the random error term of the linear relationship between Y and X. Since parameters a and b of the regression model are unknown, we would like to estimate them by using the regression line: Y^i ¼ a + b:Xi where: Y^i is the i-th value estimated or predicted by the model; a and b are the estimates of parameters a and b of the regression model; Xi is the i-th observed value of the explanatory variable.

(8.6)

192

PART

IV Statistical Inference

However, the Yi observed values are not always equal to the Y^i values estimated by the regression model. The difference between the observed value and the estimated value for the i-th observation is the error term mi: mi ¼ Yi  Y^i

(8.7)

Thus, the ordinary least squares method is used to determine the best straight line that fits the points of a diagram, that is, the method consists in estimating a and b considering that the sum of squares for the residuals is the smallest possible: min

n X i¼1

m2i ¼

n X

ðYi  a  b:Xi Þ2

i¼1

The calculation of the estimators is given by: n  X



i¼1

n X

  Yi  Y Xi  X

n  X

Xi  X

2

¼

i¼1

Yi Xi  nXY

i¼1 n X

(8.8) Xi2  nX2

i¼1

a ¼ Y  b:X

(8.9)

In Chapter 13, we will study the estimation of a linear regression model by ordinary least squares in more detail.

8.3.3

Maximum Likelihood Estimation

Maximum likelihood estimation is one of the procedures used to estimate the parameters of a model from the variable probability distribution that represents the phenomenon being studied. These parameters are chosen in order to maximize the likelihood function, which is the objective function of a certain linear programming problem (Fa´vero, 2015). Consider a random variable X with a probability density function f(x,y), in which vector y ¼ y1, y2, …, yk is unknown. A random sample X1, X2, …, Xn of size n is drawn from population X; consider x1, x2, …, xn the values effectively observed. Likelihood function L associated to X is a joint probability density function given by the product of the densities of each of the observations: Lðy; x1 , x2 , …, xn Þ ¼ f ðx1 , yÞ  f ðx2 , yÞ + ⋯ + f ðxn , yÞ ¼

n Y

f ð x i , yÞ

(8.10)

i¼1

The estimator of maximum likelihood is vector ^y that maximizes the likelihood function.

8.4

INTERVAL ESTIMATION OR CONFIDENCE INTERVALS

In Section 8.3, the population parameters that interested us were estimated through a single value (point estimation). The main limitation of point estimation is that when a parameter is estimated through a single point, all the data information is summarized through this numeric value. As an alternative, we can use interval estimation. Thus, instead of estimating the population parameter through a single point, an interval of likely estimates is given to us. Therefore, we define an interval of values that will contain the true population parameter, with a certain confidence level (1  a), being a the significance level. ^ ^ Consider y an estimator   of population parameter y. An interval estimate for y is obtained through interval ]y – k; y + k[, so, P y  k < ^ y < y + k ¼ 1  a.

8.4.1

Confidence Interval for the Population Mean (m)

Estimating the population mean from a sample is applied to two cases: when the population variance (s2) is known or unknown.

Estimation Chapter

8

193

FIG. 8.1 Standard normal distribution.

8.4.1.1 Known Population Variance (s2) Let X be a random variable with a normal distribution, mean m, and known variance s2, that is, X  N(m,s2). Therefore, we have: Z¼

Xm pffiffiffi  Nð0, 1Þ s= n

(8.11)

that is, variable Z has a standard normal distribution. Consider that the probability of variable Z assuming values between  zc and zc is 1  a, so, the critical values of  zc and zc are obtained from the standard normal distribution table (Table E in the Appendix), as shown in Fig. 8.1. NR and CR means nonrejection region and critical region of the distribution, respectively. Therefore, we have: Pðzc < Z < zc Þ ¼ 1  a

(8.12)

  Xm P zc < pffiffiffi < zc ¼ 1  a s= n

(8.13)

  s s P X  zc pffiffiffi < m < X + zc pffiffiffi ¼ 1  a n n

(8.14)

or:

Thus, the confidence interval for m is:

Example 8.4: CI for the Population Mean When the Variance Is Known We would like to estimate the average processing time of a certain part, with a 95% confidence interval. We know that s ¼ 1.2. In order to do that, a random sample with 400 parts was collected, obtaining a sample mean of X ¼ 5:4. Therefore, construct a 95% confidence interval for the true population mean. Solution We have s ¼ 1.2, n ¼ 400, X ¼ 5:4, and CI ¼ 95% (a ¼ 5%). The critical values of  zc and zc for a ¼ 5% can be obtained from Table E in the Appendix (Fig. 8.2). Applying Expression (8.14):   1:2 1:2 P 5:4  1:96 pffiffiffiffiffiffiffiffi < m < 5:4 + 1:96 pffiffiffiffiffiffiffiffi ¼ 95% 400 400 that is:

FIG. 8.2 Critical values of zc and zc.

194

PART

IV Statistical Inference

P ð5:28 < m < 5:52Þ ¼ 95% Therefore, the [5.28;5.52] interval contains the average population value with 95% confidence.

8.4.1.2 Unknown Population Variance (s2) Let X be a random variable with a normal distribution, mean m, and unknown variance s2, that is, X  N(m,s2). Since the variance is unknown, it is necessary to use an estimator (S2) instead of s2, which results from another random variable:   Xm pffiffiffi  tn1 T¼ (8.15) ðS= nÞ that is, variable T follows Student’s t-distribution with n  1 degrees of freedom. Consider that the probability of variable T assuming values between  tc and tc is 1  a, so, the critical values of  tc and tc are obtained from Student’s t-distribution table (Table B in the Appendix), as shown in Fig. 8.3. Therefore, we have: Pðtc < T < tc Þ ¼ 1  a

(8.16)

  Xm P tc < pffiffiffi < tc ¼ 1  a S= n

(8.17)

or:

Therefore, the confidence interval for m is:   S S P X  tc pffiffiffi < m < X + tc pffiffiffi ¼ 1  a n n

(8.18)

Example 8.5: CI for the Population Mean When the Variance Is Unknown We would like to estimate the average weight of a given population, with a 95% confidence interval. The random variable analyzed has a normal distribution with mean m and unknown variance s2. We pick a sample with 25 individuals from the population and

FIG. 8.3 Student’s t-distribution.

FIG. 8.4 Critical values of Student’s t-distribution.

Estimation Chapter

8

195

  calculate the sample mean X ¼ 78 and the sample variance (S2 ¼ 36). Determine the interval that contains the average weight of the population. Solution Since the variance is unknown, we use estimator S2, which results from variable T that follows Student’s t-distribution. The critical values of tc and tc, obtained from Table B in the Appendix, for a significance level of a ¼ 5% and 24 degrees of freedom, can be seen in Fig. 8.4. Applying Expression (8.18):   6 6 P 78  2:064 pffiffiffiffiffiffi < m < 78 + 2:064 pffiffiffiffiffiffi ¼ 95% 25 25 that is: P ð75:5 < m < 80:5Þ ¼ 95% Therefore, the [75.5;80.5] interval contains the average population weight with 95% confidence.

8.4.2

Confidence Interval for Proportions

Consider X a random variable that represents whether a characteristic that interests us in the population exists or not. Thus, X follows a binomial distribution with parameter p, in which p represents the probability of an element in the population presenting the characteristic we are interested in: X  bð1, pÞ with mean m ¼ p and variance s2 ¼ p(1 – p). A random sample X1, X2, …, Xn of size n is drawn from the population. Consider k the number of sample elements with the characteristic we are interested in. The estimator of population proportion p (^ p) is given by: p^ ¼

k n

(8.19)

If n is large, we can consider that sample proportion p^ follows a normal distribution, approximately, with mean p and variance p(1  p)/n:   p ð1  p Þ (8.20) p^  N p, n p^  p We consider that variable Z ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  Nð0, 1Þ. Since n is large, we can substitute p for p^: pð 1  pÞ n p^  p (8.21) Z ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  Nð0, 1Þ p^ð1  p^Þ n Consider that the probability of variable Z assuming values between  zc and zc is 1  a, so, the critical values of  zc and zc are obtained from the standard normal distribution table (Table E in the Appendix), as shown in Fig. 8.1. Thus, we have: Pðzc < Z < zc Þ ¼ 1  a

(8.22)

or: 0

1

B C p^  p PB < zc C @zc < rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi A ¼1a p^ð1  p^Þ n

(8.23)

196

PART

IV Statistical Inference

Therefore, the confidence interval for p is: rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi! p^ð1  p^Þ p^ð1  p^Þ < p < p^ + zc ¼1a P p^  zc n n

(8.24)

Example 8.6: CI for Proportions A factory discovered that the proportion of defective products, in one batch with 1000 parts, is 230 parts. Construct a 95% confidence interval for the true proportion of defective products. Solution n ¼ 1,000 k 230 ¼ 0:23 p^ ¼ ¼ n 1,000 zc ¼ 1:96 Therefore, Expression (8.24) can be written as: rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  0:23  0:77 0:23  0:77 P 0:23  1:96 < p < 0:23 + 1:96 ¼ 95% 1, 000 1, 000 P ð0:204 < p < 0:256Þ ¼ 95% Thus, the [20.4%;25.6%] interval contains the true proportion of defective products with 95% confidence.

8.4.3

Confidence Interval for the Population Variance

Let Xi be a random variable with a normal distribution, mean m, and variance s2, that is, Xi N(m,s2). An estimator for s2 is sample variance S2. Thus, we consider that random variable Q has a chi-square distribution with n  1 degrees of freedom: Q¼

ðn  1Þ  S2  w2n1 s2

(8.25)

Consider that the probability of variable Q assuming values between w2low and w2upp is 1  a, so, the critical values of w2low and w2upp are obtained from the chi-square distribution table (Table D in the Appendix), as shown in Fig. 8.5. Therefore, we have:   P w2low < w2n1 < w2upp ¼ 1  a (8.26) or:   ðn  1Þ  S 2 2 < w P w2low < upp ¼ 1  a s2

low

FIG. 8.5 Chi-square distribution.

upp

(8.27)

Estimation Chapter

8

197

Therefore, the confidence interval for s2 is: ! 2 ð n  1 Þ  S2 ð n  1 Þ  S ¼1a P < s2 < w2upp w2low

(8.28)

Example 8.7: CI for the Population Variance Consider the population of Business Administration students at a public university whose variable of interest is students’ ages. A sample with 101 students was obtained from the normal population and provided S2 ¼ 18.22. Construct a 90% confidence interval for the population variance. Solution From distribution table w2 (Table D in the Appendix), for 100 degrees of freedom, we have: w2low ¼ 77:929 w2upp ¼ 124:342 Therefore, Expression (8.28) can be written as follows:   100  18:22 100  18:22 P < s2 < ¼ 90% 124:342 77:929   P 14:65 < s2 < 23:38 ¼ 90% Thus, the [14.65;23.38] interval contains the true population variance with 90% confidence.

8.5

FINAL REMARKS

Statistical inference is divided into three main parts: sampling, estimation of population parameters, and hypotheses tests. This chapter discussed estimation methods. There are point and interval population parameter estimation methods. Among the main point estimation methods, we can highlight the estimator of moments, ordinary least squares, and maximum likelihood estimation. Conversely, among the main interval estimation methods, we studied the confidence interval (CI) for the population mean (when the variance is known and unknown), the CI for proportions, and the CI for the population variance.

8.6

EXERCISES

1) We would like to estimate the average age of a population that follows a normal distribution and has a standard deviation s ¼ 18. In order to do that, a sample with 120 individuals was drawn from the population and the mean obtained was 51 years old. Construct a 90% confidence interval for the true population mean. 2) We would like to estimate the average income of a certain population with a normal distribution and an unknown variance. A sample with 36 individuals was drawn from the population, presenting a mean of X ¼ 5,400 and a standard deviation S ¼ 200. Construct a 95% confidence interval for the population mean. 3) We would like to estimate the illiteracy rate of a certain municipality. A sample with 500 inhabitants was drawn from the population, presenting an illiteracy rate of 24%. Construct a 95% confidence interval for the proportion of illiterate individuals in the municipality. 4) We would like to estimate the variability of the average time in rendering services to customers in a bank branch. A sample with 61 customers was drawn from the population with a normal distribution and it gave us S2 ¼ 8. Construct a 95% confidence interval for the population variance.

Chapter 9

Hypotheses Tests We must conduct research and then accept the results. If they don’t stand up to experimentation, Buddha’s own words must be rejected. Tenzin Gyatso, 14th Dalai Lama

9.1

INTRODUCTION

As discussed previously, one of the problems to be solved by statistical inference is hypotheses testing. A statistical hypothesis is an assumption about a certain population parameter, such as, the mean, the standard deviation, the correlation coefficient, etc. A hypothesis test is a procedure to decide the veracity or falsehood of a certain hypothesis. In order for a statistical hypothesis to be validated or rejected with accuracy, it would be necessary to examine the entire population, which in practice is not viable. As an alternative, we draw a random sample from the population we are interested in. Since the decision is made based on the sample, errors may occur (rejecting a hypothesis when it is true or not rejecting a hypothesis when it is false), as we will study later on. The procedures and concepts necessary to construct a hypothesis test will be presented. Let’s consider X a variable associated to a population and y a certain parameter of this population. We must define the hypothesis to be tested about parameter y of this population, which is called null hypothesis: H 0 : y ¼ y0

(9.1)

Let’s also define the alternative hypothesis (H1), in case H0 is rejected, which can be characterized as follows: H1 : y 6¼ y0

(9.2)

and the test is called bilateral test (or two-tailed test). The significance level of a test (a) represents the probability of rejecting the null hypothesis when it is true (it is one of the two errors that may occur, as we will see later). The critical region (CR) or rejection region (RR) of a bilateral test is represented by two tails of the same size, respectively, in the left and right extremities of the distribution curve, and each one of them corresponds to half of the significance level a, as shown in Fig. 9.1. Another way to define the alternative hypothesis (H1) would be: H 1 : y < y0

(9.3)

and the test is called unilateral test to the left (or left-tailed test). In this case, the critical region is in the left tail of the distribution and corresponds to significance level a, as shown in Fig. 9.2. Or the alternative hypothesis could be: FIG. 9.1 Critical region (CR) of a bilateral test, also emphasizing the nonrejection region (NR) of the null hypothesis.

Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00009-4 © 2019 Elsevier Inc. All rights reserved.

199

200

PART

IV Statistical Inference

H1 : y > y0

(9.4)

and the test is called unilateral test to the right (or right-tailed test). In this case, the critical region is in the right tail of the distribution and corresponds to significance level a, as shown in Fig. 9.3. Thus, if the main objective is to check whether a parameter is significantly higher or lower than a certain value, we have to use a unilateral test. On the other hand, if the objective is to check whether a parameter is different from a certain value, we have to use a bilateral test. After defining the null hypothesis to be tested, through a random sample collected from the population, we either prove the hypothesis or not. Since the decision is made based on the sample, two types of errors may happen: Type I error: rejecting the null hypothesis when it is true. The probability of this type of error is represented by a: Pðtype I errorÞ ¼ Pðrejecting H0 j H0 is trueÞ ¼ a

(9.5)

Type II error: not rejecting the null hypothesis when it is false. The probability of this type of error is represented by b: Pðtype II errorÞ ¼ Pðnot rejecting H0 j H0 is falseÞ ¼ b

(9.6)

Table 9.1 shows the types of errors that may happen in a hypothesis test. The procedure for defining hypotheses tests includes the following phases: Step Step Step Step Step Step

1: Choosing the most suitable statistical test, depending on the researcher’s intention. 2: Presenting the test’s null hypothesis H0 and its alternative hypothesis H1. 3: Setting the significance level a. 4: Calculating the value observed of the statistic based on the sample obtained from the population. 5: Determining the test’s critical region based on the value of a set in Step 3. 6: Decision: if the value of the statistic lies in the critical region, reject H0. Otherwise, do not reject H0.

According to Fa´vero et al. (2009), most statistical softwares, among them SPSS and Stata, calculate the P-value that corresponds to the probability associated to the value of the statistic calculated from the sample. P-value indicates the lowest significance level observed that would lead to the rejection of the null hypothesis. Thus, we reject H0 if P  a.

FIG. 9.2 Critical region (CR) of a left-tailed test, also emphasizing the nonrejection region of the null hypothesis (NR).

FIG. 9.3 Critical region (CR) of a right-tailed test.

TABLE 9.1 Types of Errors Decision

H0 Is True

H0 Is False

Not rejecting H0

Correct decision (1  a)

Type II error (b)

Rejecting H0

Type I error (a)

Correct decision (1  b)

Hypotheses Tests Chapter

9

201

If we use P-value instead of the statistic’s critical value, Steps 5 and 6 of the construction of the hypotheses tests will be: Step 5: Determine the P-value that corresponds to the probability associated to the value of the statistic calculated in Step 4. Step 6: Decision: if P-value is less than the significance level a established in Step 3, reject H0. Otherwise, do not reject H0.

9.2

PARAMETRIC TESTS

Hypotheses tests are divided into parametric and nonparametric tests. In this chapter, we will study parametric tests. Nonparametric tests will be studied in the next chapter. Parametric tests involve population parameters. A parameter is any numerical measure or quantitative characteristic that describes a population. They are fixed values, usually unknown, and represented by Greek characters, such as, the population mean (m), the population standard deviation (s), the population variance (s2), among others. When hypotheses are formulated about population parameters, the hypothesis test is called parametric. In nonparametric tests, hypotheses are formulated about qualitative characteristics of the population. Therefore, parametric methods are applied to quantitative data and require strong assumptions in order to be validated, including: (i) The observations must be independent; (ii) The sample must be drawn from populations with a certain distribution, usually normal; (iii) The populations must have equal variances for the comparison tests of two paired population means or k population means (k 3); (iv) The variables being studied must be measured in an interval or in a reason scale, so that it can be possible to use arithmetic operations over their respective values. We will study the main parametric tests, including tests for normality, homogeneity of variance tests, Student’s t-test and its applications, in addition to the analysis of variance (ANOVA) and its extensions. All of them will be solved in an analytical way and also through the statistical softwares SPSS and Stata. To verify the univariate normality of the data, the most common tests used are Kolmogorov-Smirnov and Shapiro-Wilk. To compare the variance homogeneity between populations, we have Bartlett’s w2 (1937), Cochran’s C (1947a,b), Hartley’s Fmax (1950), and Levene’s F (1960) tests. We will describe Student’s t-test for three situations: to test hypotheses about the population mean, to test hypotheses to compare two independent means, and to compare two paired means. ANOVA is an extension of Student’s t-test and is used to compare the means of more than two populations. In this chapter, ANOVA of one factor, ANOVA of two factors and its extension for more than two factors will be described.

9.3

UNIVARIATE TESTS FOR NORMALITY

Among all univariate tests for normality, the most common are Kolmogorov-Smirnov, Shapiro-Wilk, and Shapiro-Francia.

9.3.1

Kolmogorov-Smirnov Test

The Kolmogorov-Smirnov test (K-S) is an adherence test, that is, it compares the cumulative frequency distribution of a set of sample values (values observed) to a theoretical distribution. The main goal is to test if the sample values come from a population with a supposed theoretical or expected distribution, in this case, the normal distribution. The statistic is given by the point with the biggest difference (in absolute values) between the two distributions. To use the K-S test, the population mean and standard deviation must be known. For small samples, the test loses power, so, it should be used with large samples (n 30). The K-S test assumes the following hypotheses: H0: the sample comes from a population with distribution N(m, s) H1: the sample does not come from a population with distribution N(m, s)

202

PART

IV Statistical Inference

As specified in Fa´vero et al. (2009), let Fexp(X) be an expected distribution function (normal) of cumulative relative frequencies of variable X, where Fexp(X)  N(m,s), and Fobs(X) the observed cumulative relative frequency distribution of variable X. The objective is to test whether Fobs(X) ¼ Fexp(X), in contrast with the alternative that Fobs(X) 6¼ Fexp(X). The statistic can be calculated through the following expression:   o n     (9.7) Dcal ¼ max Fexp ðXi Þ  Fobs ðXi Þ; Fexp ðXi Þ  Fobs ðXi1 Þ , for i ¼ 1, …,n where: Fexp(Xi): expected cumulative relative frequency in category i; Fobs(Xi): observed cumulative relative frequency in category i; Fobs(Xi1): observed cumulative relative frequency in category i  1. The critical values of Kolmogorov-Smirnov statistic (Dc) are shown in Table G in the Appendix. This table provides the critical values of Dc considering that P(Dcal > Dc) ¼ a (for a right-tailed test). In order for the null hypothesis H0 to be rejected, the value of the Dcal statistic must be in the critical region, that is, Dcal > Dc. Otherwise, we do not reject H0. P-value (the probability associated to the value of Dcal statistic calculated from the sample) can also be seen in Table G. In this case, we reject H0 if P  a. Example 9.1: Using the Kolmogorov-Smirnov Test Table 9.E.1 shows the data on a company’s monthly production of farming equipment in the last 36 months. Check and see if the data in Table 9.E.1 come from a population that follows a normal distribution, considering that a ¼ 5%.

TABLE 9.E.1 Production of Farming Equipment in the Last 36 Months 52

50

44

50

42

30

36

34

48

40

55

40

30

36

40

42

55

44

38

42

40

38

52

44

52

34

38

44

48

36

36

55

50

34

44

42

Solution Step 1: Since the objective is to verify if the data in Table 9.E.1 come from a population with a normal distribution, the most suitable test is Kolmogorov-Smirnov (K-S). Step 2: The K-S test hypotheses for this example are: H0: the production of farming equipment in the population follows distribution N(m, s) H1: the production of farming equipment in the population does not follow distribution N(m, s) Step 3: The significance level to be considered is 5%. Step 4: All the steps necessary to calculate Dcal from Expression (9.7) are specified in Table 9.E.2.

TABLE 9.E.2 Calculating the Kolmogorov-Smirnov Statistic Xi

a

Fabs

Fac

c

d

|Fexp(Xi) 2 Fobs(Xi)|

|Fexp(Xi) 2 Fobs(Xi21)|

30

2

2

0.056

1.7801

0.0375

0.018

0.036

34

3

5

0.139

1.2168

0.1118

0.027

0.056

36

4

9

0.250

0.9351

0.1743

0.076

0.035

38

3

12

0.333

0.6534

0.2567

0.077

0.007

40

4

16

0.444

0.3717

0.3551

0.089

0.022

42

4

20

0.556

0.0900

0.4641

0.092

0.020

44

5

25

0.694

0.1917

0.5760

0.118

0.020

b

Fracobs

Zi

e

Fracexp

Hypotheses Tests Chapter

9

203

TABLE 9.E.2 Calculating the Kolmogorov-Smirnov Statistic—cont’d Xi

Fabs

Fac

Fracobs

48

2

27

0.750

50

3

30

52

3

55

3

Zi

Fracexp

| Fexp(Xi) 2 Fobs(Xi)|

| Fexp(Xi) 2 Fobs(Xi21)|

0.7551

0.7749

0.025

0.081

0.833

1.0368

0.8501

0.017

0.100

33

0.917

1.3185

0.9064

0.010

0.073

36

1

1.7410

0.9592

0.041

0.043

a

Absolute frequency. Cumulative (absolute) frequency. Observed cumulative relative frequency of Xi. d Standardized Xi values according to the expression Zi ¼ Xi SX . e Expected cumulative relative frequency of Xi and it corresponds to the probability obtained in Table E in the Appendix (standard normal distribution table) from the value of Zi. b c

Therefore, the real value of the K-S statistic based on the sample is Dcal ¼ 0.118. Step 5: According to Table G in the Appendix, for n ¼ 36 and a ¼ 5%, the critical value of the Kolmogorov-Smirnov statistic is Dc ¼ 0.23. Step 6: Decision: since the value calculated is not in the critical region (Dcal < Dc), the null hypothesis is not rejected, which allows us to conclude, with a 95% confidence level, that the sample is drawn from a population that follows a normal distribution. If we use P-value instead of the statistic’s critical value, Steps 5 and 6 will be: Step 5: According to Table G in the Appendix, for a sample size n ¼ 36, the probability associated to Dcal ¼ 0.118 has as its lowest limit P ¼ 0.20. Step 6: Decision: since P > 0.05, we do not reject H0.

9.3.2

Shapiro-Wilk Test

The Shapiro-Wilk test (S-W) is based on Shapiro and Wilk (1965) and can be applied to samples with 4  n  2000 observations, and it is an alternative to the Kolmogorov-Smirnov test for normality (K-S) in the case of small samples (n < 30). Analogous to the K-S test, the S-W test for normality assumes the following hypotheses: H0: the sample comes from a population with distribution N(m, s) H1: the sample does not come from a population with distribution N(m, s) The calculation of the Shapiro-Wilk statistic (Wcal) is given by: b2 Wcal ¼ Xn  2 , for i ¼ 1, …, n Xi  X i¼1 b¼

n=2 X

  ai, n  Xðni + 1Þ  XðiÞ

(9.8)

(9.9)

i¼1

where: X(i) are the sample statistics of order i, that is, the i-th ordered observation, so, X(1)  X(2)  …  X(n); X is the mean of X; ai, n are constants generated from the means, variances, and covariances of the statistics of order i of a random sample of size n from a normal distribution. Their values can be seen in Table H2 in the Appendix. Small values of Wcal indicate that the distribution of the variable being studied is not normal. The critical values of Shapiro-Wilk statistic Wc are shown in Table H1 in the Appendix. Different from most tables, this table provides the critical values of Wc considering that P(Wcal < Wc) ¼ a (for a left-tailed test). In order for the null hypothesis H0 to be rejected, the value of the Wcal statistic must be in the critical region, that is, Wcal < Wc. Otherwise, we do not reject H0. P-value (the probability associated to the value of Wcal statistic calculated from the sample) can also be seen in Table H1. In this case, we reject H0 if P  a.

204

PART

IV Statistical Inference

Example 9.2: Using the Shapiro-Wilk Test Table 9.E.3 shows the data on an aerospace company’s monthly production of aircraft in the last 24 months. Check and see if the data in Table 9.E.3 come from a population with a normal distribution, considering that a ¼ 1%.

TABLE 9.E.3 Production of Aircraft in the Last 24 Months 28

32

46

24

22

18

20

34

30

24

31

29

15

19

23

25

28

30

32

36

39

16

23

36

Solution Step 1: For a normality test in which n < 30, the most recommended test is the Shapiro-Wilk (S-W). Step 2: The S-W test hypotheses for this example are: H0: the production of aircraft in the population follows normal distribution N(m, s) H1: the production of aircraft in the population does not follow normal distribution N(m, s) Step 3: The significance level to be considered is 1%. Step 4: The calculation of the S-W statistic for the data in Table 9.E.3, according to Expressions (9.8) and (9.9), is shown. First of all, to calculate b, we must sort the data in Table 9.E.3 in ascending order, as shown in Table 9.E.4. All the steps necessary to calculate b, from Expression (9.9), are specified in Table 9.E.5. The values of ai,n were obtained from Table H2 in the Appendix.

TABLE 9.E.4 Values From Table 9.E.3 Sorted in Ascending Order 15

16

18

19

20

22

23

23

24

24

25

28

28

29

30

30

31

32

32

34

36

36

39

46

TABLE 9.E.5 Procedure to Calculate b i

n 2 i +1

ai,n

X(n 2 i+1)

X(i)

ai,n (X(n 2 i+1) 2 X(i))

1

24

0.4493

46

15

13.9283

2

23

0.3098

39

16

7.1254

3

22

0.2554

36

18

4.5972

4

21

0.2145

36

19

3.6465

5

20

0.1807

34

20

2.5298

6

19

0.1512

32

22

1.5120

7

18

0.1245

32

23

1.1205

8

17

0.0997

31

23

0.7976

9

16

0.0764

30

24

0.4584

10

15

0.0539

30

24

0.3234

11

14

0.0321

29

25

0.1284

12

13

0.0107

28

28

0.0000 b ¼ 36.1675

We have

Pn  i¼1

Xi  X

Therefore, Wcal ¼ Pn

2

¼ ð28  27:5Þ2 + ⋯ + ð36  27:5Þ2 ¼ 1388

b2 2 ðXi X Þ i¼1

2

Þ ¼ ð36:1675 ¼ 0:978 1338

Step 5: According to Table H1 in the Appendix, for n ¼ 24 and a ¼ 1%, the critical value of the Shapiro-Wilk statistic is Wc ¼ 0.884.

Hypotheses Tests Chapter

9

205

Step 6: Decision: the null hypothesis is not rejected, since Wcal > Wc (Table H1 provides the critical values of Wc considering that P(Wcal < Wc) ¼ a), which allows us to conclude, with a 99% confidence level, that the sample is drawn from a population with a normal distribution. If we use P-value instead of the statistic’s critical value, Steps 5 and 6 will be: Step 5: According to Table H1 in the Appendix, for a sample size n ¼ 24, the probability associated to Wcal ¼ 0.978 is between 0.50 and 0.90 (a probability of 0.90 is associated to Wcal ¼ 0.981). Step 6: Decision: since P > 0.01, we do not reject H0.

9.3.3

Shapiro-Francia Test

This test is based on Shapiro and Francia (1972). According to Sarkadi (1975), the Shapiro-Wilk (S-W) and ShapiroFrancia tests (S-F) have the same format, being different only when it comes to defining the coefficients. Moreover, calculating the S-F test is much simpler and it can be considered a simplified version of the S-W test. Despite its simplicity, it is as robust as the Shapiro-Wilk test, making it a substitute for the S-W. The Shapiro-Francia test can be applied to samples with 5  n  5000 observations, and it is similar to the Shapiro-Wilk test for large samples. Analogous to the S-W test, the S-F test assumes the following hypotheses: H0: the sample comes from a population with distribution N(m, s) H1: the sample does not come from a population with distribution N(m, s) 0

The calculation of the Shapiro-Francia statistic (Wcal) is given by: " #2 , " # n n n  X X X 2 0 2 Wcal ¼ mi  XðiÞ mi  Xi  X , for i ¼ 1, …, n i¼1

i¼1

(9.10)

i¼1

where: X(i) are the sample statistics of order i, that is, the ith ordered observation, so, X(1)  X(2)  …  X(n); mi is the approximate expected value of the ith observation (Z-score). The values of mi are estimated by: mi ¼ F1 



i n+1

 (9.11)

where F1 corresponds to the opposite of a standard normal distribution with a mean ¼ zero and a standard deviation ¼ 1. These values can be obtained from Table E in the Appendix. 0 Small values of Wcal indicate that the distribution of the variable being studied is not normal. The critical values of 0 Shapiro-Francia statistic (Wc) are shown in Table H1 in the Appendix. Different from most tables, this table provides 0 0 0 the critical values of Wc considering that P(Wcal < Wc) ¼ a ¼ a (for a left-tailed test). In order for the null hypothesis 0 0 0 H0 to be rejected, the value of the Wcal statistic must be in the critical region, that is, Wcal < Wc. Otherwise, we do not reject H0. 0 P-value (the probability associated to Wcal statistic calculated from the sample) can also be seen in Table H1. In this case, we reject H0 if P  a. Example 9.3: Using the Shapiro-Francia Test Table 9.E.6 shows all the data regarding a company’s daily production of bicycles in the last 60 months. Check and see if the data come from a population with a normal distribution, considering a ¼ 5%. Solution Step 1: The normality of the data can be verified through the Shapiro-Francia test. Step 2: The S-F test hypotheses for this example are: H0: the production of bicycles in the population follows normal distribution N(m, s) H1: the production of bicycles in the population does not follow normal distribution N(m, s) Step 3: The significance level to be considered is 5%.

206

PART

IV Statistical Inference

TABLE 9.E.6 Production of Bicycles in the Last 60 Months 85

70

74

49

67

88

80

91

57

63

66

60

72

81

73

80

55

54

93

77

80

64

60

63

67

54

59

78

73

84

91

57

59

64

68

67

70

76

78

75

80

81

70

77

65

63

59

60

61

74

76

81

79

78

60

68

76

71

72

84

Step 4: The procedure to calculate the S-F statistic for the data in Table 9.E.6 is shown in Table 9.E.7. 0 Therefore, Wcal ¼ (574.6704)2/(53.1904  6278.8500) ¼ 0.989

TABLE 9.E.7 Procedure to Calculate the Shapiro-Francia Statistic i

X(i)

i/(n + 1)

mi

mi X(i)

m2i

(Xi 2 X)2

1

49

0.0164

2.1347

104.5995

4.5569

481.8025

2

54

0.0328

1.8413

99.4316

3.3905

287.3025

3

54

0.0492

1.6529

89.2541

2.7319

287.3025

4

55

0.0656

1.5096

83.0276

2.2789

254.4025

5

57

0.0820

1.3920

79.3417

1.9376

194.6025

6

57

0.0984

1.2909

73.5841

1.6665

194.6025

7

59

0.1148

1.2016

70.8960

1.4439

142.8025

8

59

0.1311

1.1210

66.1380

1.2566

142.8025

93

0.9836

2.1347

198.5256

4.5569

486.2025

574.6704

53.1904

6278.8500

… 60

Sum

Step 5: According to Table H1 in the Appendix, for n ¼ 60 and a ¼ 5%, the critical value of the Shapiro-Francia statistic is 0 Wc ¼ 0.9625. 0 0 0 Step 6: Decision: the null hypothesis is not rejected because Wcal > Wc (Table H1 provides the critical values of Wc considering 0 0 that P(Wcal < Wc) ¼ a), which allows us to conclude, with a 95% confidence level, that the sample is drawn from a population that follows a normal distribution. If we used P-value instead of the statistic’s critical value, Steps 5 and 6 would be: 0 Step 5: According to Table H1 in the Appendix, for a sample size n ¼ 60, the probability associated to Wcal ¼ 0.989 is greater than 0.10 (P-value). Step 6: Decision: since P > 0.05, we do not reject H0.

9.3.4

Solving Tests for Normality by Using SPSS Software

The Kolmogorov-Smirnov and Shapiro-Wilk tests for normality can be solved by using IBM SPSS Statistics Software. The Shapiro-Francia test, on the other hand, will be elaborated through the Stata software, as we will see in the next section. Based on the procedure that will be described, SPSS shows the results of the K-S and the S-W tests for the sample selected. The use of the images in this section has been authorized by the International Business Machines Corporation©. Let’s consider the data presented in Example 9.1 that are available in the file Production_FarmingEquipment.sav. Let´s open the file and select Analyze → Descriptive Statistics → Explore …, as shown in Fig. 9.4. From the Explore dialog box, we must select the variable we are interested in on the Dependent List, as shown in Fig. 9.5. Let´s click on Plots … (the Explore: Plots dialog box will open) and select the option Normality plots with tests (Fig. 9.6). Finally, let’s click on Continue and on OK.

Hypotheses Tests Chapter

9

207

FIG. 9.4 Procedure for elaborating a univariate normality test on SPSS for Example 9.1.

FIG. 9.5 Selecting the variable of interest.

The results of the Kolmogorov-Smirnov and Shapiro-Wilk tests for normality for the data in Example 9.1 are shown in Fig. 9.7. According to Fig. 9.7, the result of the K-S statistic was 0.118, similar to the value calculated in Example 9.1. Since the sample has more than 30 elements, we should only use the K-S test to verify the normality of the data (the S-W test was applied to Example 9.2). Nevertheless, SPSS also makes the result of the S-W statistic available for the sample selected.

208

PART

IV Statistical Inference

FIG. 9.6 Selecting the normality test on SPSS.

FIG. 9.7 Results of the tests for normality for Example 9.1 on SPSS.

FIG. 9.8 Results of the tests for normality for Example 9.2 on SPSS.

As presented in the introduction of this chapter, SPSS calculates the P-value that corresponds to the lowest significance level observed that would lead to the rejection of the null hypothesis. For the K-S and S-W tests the P-value corresponds to the lowest value of P from which Dcal > Dc and Wcal < Wc. As shown in Fig. 9.7, the value of P for the K-S test was of 0.200 (this probability can also be obtained from Table G in the Appendix, as shown in Example 9.1). Since P > 0.05, we do not reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that the data distribution is normal. The S-W test also allows us to conclude that the data distribution follows a normal distribution. Applying the same procedure to verify the normality of the data in Example 9.2 (the data are available in the file Production_Aircraft.sav), we get the results shown in Fig. 9.8. Analogous to Example 9.2, the result of the S-W test was 0.978. The K-S test was not applied to this example due to the sample size (n < 30). The P-value of the S-W test is 0.857 (in Example 9.2, we saw that this probability would be between 0.50 and 0.90

Hypotheses Tests Chapter

9

209

and closer to 0.90) and, since P > 0.01, the null hypothesis is not rejected, which allows us to conclude that the data distribution in the population follows a normal distribution. We will use this test when estimating regression models in Chapter 13. For this example, we can also conclude from the K-S test that the data distribution follows a normal distribution.

9.3.5

Solving Tests for Normality by Using Stata

The Kolmogorov-Smirnov, Shapiro-Wilk, and Shapiro-Francia tests for normality can be solved by using Stata Statistical Software. The Kolmogorov-Smirnov test will be applied to Example 9.1, the Shapiro-Wilk test to Example 9.2, and the Shapiro-Francia test to Example 9.3. The use of the images in this section has been authorized by StataCorp LP©.

9.3.5.1 Kolmogorov-Smirnov Test on the Stata Software The data presented in Example 9.1 are available in the file Production_FarmingEquipment.dta. Let’s open this file and verify that the name of the variable being studied is production. To elaborate the Kolmogorov-Smirnov test on Stata, we must specify the mean and the standard deviation of the variable that interests us in the test syntax, so, the command summarize, or simply sum, must be typed first, followed by the respective variable: sum production

and we get Fig. 9.9. Therefore, we can see that the mean is 42.63889 and the standard deviation is 7.099911. The Kolmogorov-Smirnov test is given by the following command: ksmirnov production = normal((production-42.63889)/7.099911)

The result of the test can be seen in Fig. 9.10. We can see that the value of the statistic is similar to the one calculated in Example 9.1 and by SPSS software. Since P > 0.05, we conclude that the data distribution is normal.

9.3.5.2 Shapiro-Wilk Test on the Stata Software The data presented in Example 9.2 are available in the file Production_Aircraft.dta. To elaborate the Shapiro-Wilk test on Stata, the syntax of the command is: swilk variables*

where the term variables* should be substituted for the list of variables being considered. For the data in Example 9.2, we have a single variable called production, so, the command to be typed is: swilk production FIG. 9.9 Descriptive statistics of the variable production.

FIG. 9.10 Results of the Kolmogorov-Smirnov test on Stata.

210

PART

IV Statistical Inference

FIG. 9.11 Results of the Shapiro-Wilk test for Example 9.2 on Stata.

FIG. 9.12 Results of the Shapiro-Francia test for Example 9.3 on Stata.

The result of the Shapiro-Wilk test can be seen in Fig. 9.11. Since P > 0.05, we can conclude that the sample comes from a population with a normal distribution.

9.3.5.3 Shapiro-Francia Test on the Stata Software The data presented in Example 9.3 are available in the file Production_Bicycles.dta. To elaborate the Shapiro-Francia test on Stata, the syntax of the command is: sfrancia variables*

where the term variables* should be substituted for the list of variables being considered. For the data in Example 9.3, we have a single variable called production, so, the command to be typed is: sfrancia production

The result of the Shapiro-Francia test can be seen in Fig. 9.12. We can see that the value is similar to the one calculated in Example 9.3 (W 0 ¼ 0.989). Since P > 0.05, we conclude that the sample comes from a population with a normal distribution. We will use this test when estimating regression models in Chapter 13.

9.4

TESTS FOR THE HOMOGENEITY OF VARIANCES

One of the conditions to apply a parametric test to compare k population means is that the population variances, estimated from k representative samples, be homogeneous or equal. The most common tests to verify variance homogeneity are Bartlett’s w2 (1937), Cochran’s C (1947a,b), Hartley’s Fmax (1950), and Levene’s F (1960) tests. In the null hypothesis of variance homogeneity tests, the variances of k populations are homogeneous. In the alternative hypothesis, at least one population variance is different from the others. That is: H0 : s21 ¼ s22 ¼ … ¼ s2k H1 : 9i, j : s2i 6¼ s2j ði, j ¼ 1, …, kÞ

9.4.1

(9.12)

Bartlett’s x2 Test

The original test proposed to verify variance homogeneity among groups is Bartlett’s w2 test (1937). This test is very sensitive to normality deviations, and Levene’s test is an alternative in this case. Bartlett’s statistic is calculated from q: k   X   ðni  1Þ  ln S2i q ¼ ðN  kÞ  ln S2p  i¼1

(9.13)

Hypotheses Tests Chapter

9

211

where:

P ni, i ¼ 1, …, k, is the size of each sample i and ki¼1ni ¼ N; 2 Si , i ¼ 1, …, k, is the variance in each sample i;

and

Xk S2p ¼

i¼1

ðni  1Þ  S2i

(9.14)

N k

A correction factor c is applied to q statistic, with the following expression: k X 1 1 1   c¼1+ 3  ð k  1Þ n 1 Nk i¼1 i

! (9.15)

where Bartlett’s statistic (Bcal) approximately follows a chi-square distribution with k  1 degrees of freedom: q (9.16) Bcal ¼  w2k1 c From the previous expressions, we can see that the higher the difference between the variances, the higher the value of B. On the other hand, if all the sample variances are equal, its value will be zero. To confirm if the null hypothesis of variance homogeneity will be rejected or not, the value calculated must be compared to the statistic’s critical value (w2c ), which is available in Table D in the Appendix. This table provides the critical values of w2c considering that P(w2cal > w2c ) ¼ a (for a right-tailed test). Therefore, we reject the null hypothesis if Bcal > w2c . On the other hand, if Bcal  w2c , we do not reject H0. P-value (the probability associated to w2cal statistic) can also be obtained from Table D. In this case, we reject H0 if P  a. Example 9.4: Applying Bartlett’s x2 Test A chain of supermarkets wishes to study the number of customers they serve every day in order to make strategic operational decisions. Table 9.E.8 shows the data of three stores throughout two weeks. Check if the variances between the groups are homogeneous. Consider a ¼ 5%.

TABLE 9.E.8 Number of Customers Served Per Day and Per Store Store 1

Store 2

Store 3

Day 1

620

710

924

Day 2

630

780

695

Day 3

610

810

854

Day 4

650

755

802

Day 5

585

699

931

Day 6

590

680

924

Day 7

630

710

847

Day 8

644

850

800

Day 9

595

844

769

Day 10

603

730

863

Day 11

570

645

901

Day 12

605

688

888

Day 13

622

718

757

Day 14

578

702

712

Standard deviation Variance

24.4059

62.2466

78.9144

595.6484

3874.6429

6227.4780

212

PART

IV Statistical Inference

Solution If we apply the Kolmogorov-Smirnov or the Shapiro-Wilk test for normality to the data in Table 9.E.8, we will verify that their distribution shows adherence to normality, with a 5% significance level, so, Bartlett’s w2 test can be applied to compare the homogeneity of the variances between the groups. Step 1: Since the main goal is to compare the equality of the variances between the groups, we can use Bartlett’s w2 test. Step 2: Bartlett’s w2 test hypotheses for this example are: H0: the population variances of all three groups are homogeneous H1: the population variance of at least one group is different from the others Step 3: The significance level to be considered is 5%. Step 4: The complete calculation of Bartlett’s w2 statistic is shown. First, we calculate the value of S2p, according to Expression (9.14): 13  ð595:65 + 3874:64 + 6227:48Þ ¼ 3565:92 42  3 Thus, we can calculate q through Expression (9.13): Sp2 ¼

q ¼ 39  ln ð3565:92Þ  13  ½ ln ð595:65Þ + ln ð3874:64Þ + ln ð6227:48Þ ¼ 14:94 The correction factor c for q statistic is calculated from Expression (9.15):     1 1 1 c ¼1+ 3  ¼ 1:0256 3  ð3  1Þ 13 42  3 Finally, we calculate Bcal: q 14:94 Bcal ¼ ¼ ¼ 14:567 c 1:0256 Step 5: According to Table D in the Appendix, for n ¼ 3  1 degrees of freedom and a ¼ 5%, the critical value of Bartlett’s w2 test is w2c ¼ 5.991. Step 6: Decision: since the value calculated lies in the critical region (Bcal > w2c ), the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that the population variance of at least one group is different from the others. If we use P-value instead of the statistic’s critical value, Steps 5 and 6 will be: Step 5: According to Table D in the Appendix, for n ¼ 2 degrees of freedom, the probability associated to w2cal ¼ 14.567 is less than 0.005 (a probability of 0.005 is associated to w2cal ¼ 10.597). Step 6: Decision: since P < 0.05, we reject H0.

9.4.2

Cochran’s C Test

Cochran’s C test (1947a,b) compares the group with the highest variance in relation to the others. The test demands that the data have a normal distribution. Cochran’s C statistic is given by: S2 Ccal ¼ Xmax k

(9.17)

S2 i¼1 i

where: S2max is the highest variance in the sample; S2i is the variance in sample i, i ¼ 1, …, k. According to Expression (9.17), if all the variances are equal, the value of the Ccal statistic is 1/k. The higher the difference of S2max in relation to the other variances, the more the value of Ccal gets closer to 1. To confirm whether the null hypothesis will be rejected or not, the value calculated must be compared to Cochran’s (Cc) statistic’s critical value, which is available in Table M in the Appendix.

Hypotheses Tests Chapter

9

213

The values of Cc vary depending on the number of groups (k), the number of degrees of freedom n ¼ max(ni  1), and the value of a. Table M provides the critical values of Cc considering that P(Ccal > Cc) ¼ a (for a right-tailed test). Thus, we reject H0 if Ccal > Cc. Otherwise, we do not reject H0. Example 9.5: Applying Cochran’s C Test Use Cochran’s C test for the data in Example 9.4. The main objective here is to compare the group with the highest variability in relation to the others. Solution Step 1: Since the objective is to compare the group with the highest variance (group 3—see Table 9.E.8) in relation to the others, Cochran’s C test is the most recommended. Step 2: Cochran’s C test hypotheses for this example are: H0: the population variance of group 3 is equal to the others H1: the population variance of group 3 is different from the others Step 3: The significance level to be considered is 5%. Step 4: From Table 9.E.8, we can see that S2max ¼ 6227.48. Therefore, the calculation of Cochran’s C statistic is given by: S2 6227:48 ¼ 0:582 ¼ Ccal ¼ Xmax k 595:65 + 3874:64 + 6227:48 2 S i i¼1 Step 5: According to Table M in the Appendix, for k ¼ 3, n ¼ 13, and a ¼ 5%, the critical value of Cochran’s C statistic is Cc ¼ 0.575. Step 6: Decision: since the value calculated lies in the critical region (Ccal > Cc), the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that the population variance of group 3 is different from the others.

9.4.3

Hartley’s Fmax Test

Hartley’s Fmax test (1950) has the statistic that represents the relationship between the group with the highest variance (S2max) and the group with the lowest variance (S2min): Fmax ,cal ¼

S2max S2min

(9.18)

The test assumes that the number of observations per group is equal to (n1 ¼ n2 ¼ … ¼ nk ¼ n). If all the variances are equal, the value of Fmax will be 1. The higher the difference between S2max and S2min, the higher the value of Fmax. To confirm if the null hypothesis of variance homogeneity will be rejected or not, the value calculated must be compared to the (Fmax,c) statistic’s critical value, which is available in Table N in the Appendix. The critical values vary depending on the number of groups (k), the number of degrees of freedom n ¼ n  1, and the value of a, and this table provides the critical values of Fmax,c considering that P(Fmax,cal > Fmax,c) ¼ a (for a right-tailed test). Therefore, we reject the null hypothesis H0 of variance homogeneity if Fmax,cal > Fmax,c. Otherwise, we do not reject H0. P-value (the probability associated to Fmax,cal statistic) can also be obtained from Table N in the Appendix. In this case, we reject H0 if P  a. Example 9.6: Applying Hartley’s Fmax Test Use Hartley’s Fmax test for the data in Example 9.4. The goal here is to compare the group with the highest variability to the group with the lowest variability. Solution Step 1: Since the main objective is to compare the group with the highest variance (group 3—see Table 9.E.8) to the group with the lowest variance (group 1), Hartley’s Fmax test is the most recommended. Step 2: Hartley’s Fmax test hypotheses for this example are: H0: the population variance of group 3 is the same as group 1

214

PART

IV Statistical Inference

H1: the population variance of group 3 is different from group 1 Step 3: The significance level to be considered is 5%. Step 4: From Table 9.E.8, we can see that S2min ¼ 595.65 and S2max ¼ 6227.48. Therefore, the calculation of Hartley’s Fmax statistic is given by:

F max , cal ¼

2 Smax 6,227:48 ¼ 10:45 ¼ 2 595:65 Smin

Step 5: According to Table N in the Appendix, for k ¼ 3, n ¼ 13, and a ¼ 5%, the critical value of the test is Fmax,c ¼ 3.953. Step 6: Decision: since the value calculated lies in the critical region (Fmax,cal > Fmax,c), the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that the population variance of group 3 is different from the population variance of group 1. If we use P-value instead of the statistic’s critical value, Steps 5 and 6 will be: Step 5: According to Table N in the Appendix, the probability associated to Fmax,cal ¼ 10.45, for k ¼ 3 and n ¼ 13, is less than 0.01. Step 6: Decision: since P < 0.05, we reject H0.

9.4.4

Levene’s F-Test

The advantage of Levene’s F-test in relation to other homogeneity of variance tests is that it is less sensitive to deviations from normality, in addition to being considered a more robust test. Levene’s statistic is given by Expression (9.19) and it follows an F-distribution, approximately, with n1 ¼ k  1 and n2 ¼ N  k degrees of freedom, for a significance level a: Xk  2 n  Zi  Z ðN  k Þ i¼1 i Fcal ¼ (9.19)   2  Fk1, Nk, a H0 ðk  1Þ Xk Xni Z  Z ij i i¼1 j¼1 where: ni is the dimension of each one of the k samples (i ¼ 1, …, k); N is the of the global sample (N ¼ n1 + n2 + ⋯ + nk);  dimension    Zij ¼ Xij  Xi , i ¼ 1, …, k and j ¼ 1, …, ni; Xij is observation j in sample i; Xi is the mean of sample i; Zi is the mean of Zij in sample i; Z is the mean of Zi in the global sample. An expansion of Levene’s test can be found in Brown and Forsythe (1974). From the F-distribution table (Table A in the Appendix), we can determine the critical values of Levene’s statistic (Fc ¼ Fk1,N k,a). Table A provides the critical values of Fc considering that P(Fcal > Fc) ¼ a (right-tailed table). In order for the null hypothesis H0 to be rejected, the value of the statistic must be in the critical region, that is, Fcal > Fc. If Fcal  Fc, we do not reject H0. P-value (the probability associated to Fcal statistic) can also be obtained from Table A. In this case, we reject H0 if P  a. Example 9.7: Applying Levene’s Test Elaborate Levene’s test for the data in Example 9.4. Solution Step 1: Levene’s test can be applied to check variance homogeneity between the groups, and it is more robust than the other tests.

Hypotheses Tests Chapter

9

215

Step 2: Levene’s test hypotheses for this example are: H0: the population variances of all three groups are homogeneous H1: the population variance of at least one group is different from the others Step 3: The significance level to be considered is 5%. Step 4: The calculation of the Fcal statistic, according to Expression (9.19), is shown.

TABLE 9.E.9 Calculating the Fcal Statistic 

I

X1j

    Z1j ¼ X1j  X 1 

1

620

10.571

9.429

88.898

1

630

20.571

0.571

0.327

1

610

0.571

19.429

377.469

1

650

40.571

20.571

423.184

1

585

24.429

4.429

19.612

1

590

19.429

0.571

0.327

1

630

20.571

0.571

0.327

1

644

34.571

14.571

212.327

1

595

14.429

5.571

31.041

1

603

6.429

13.571

184.184

1

570

39.429

19.429

377.469

1

605

4.429

15.571

242.469

1

622

12.571

7.429

55.184

1

578

31.429

11.429

130.612

X 1 ¼ 609:429

Z 1 ¼ 20

    Z2j ¼ X2j  X 2 

Z1j  Z1

Z1j  Z 1

Sum ¼ 2143.429

 Z2j  Z 2

Z2j  Z 2

27.214

23.204

538.429

780

42.786

7.633

58.257

2

810

72.786

22.367

500.298

2

755

17.786

32.633

1064.890

2

699

38.214

12.204

148.940

2

680

57.214

6.796

46.185

2

710

27.214

23.204

538.429

2

850

112.786

62.367

3889.686

2

844

106.786

56.367

3177.278

2

730

7.214

43.204

1866.593

2

645

92.214

41.796

1746.899

2

688

49.214

1.204

1.450

2

718

19.214

31.204

973.695

2

702

35.214

15.204

231.164

I

X2j

2

710

2

X 2 ¼ 737:214

Z 2 ¼ 50:418

2

2

Sum ¼ 14,782.192

Continued

216

PART

IV Statistical Inference

TABLE 9.E.9 Calculating the Fcal Statistic—cont’d

    Z3j ¼ X3j  X 3 

Z3j  Z 3



Z3j  Z 3

I

X3j

3

924

90.643

24.194

585.344

3

695

138.357

71.908

5170.784

3

854

20.643

45.806

2098.201

3

802

31.357

35.092

1231.437

3

931

97.643

31.194

973.058

3

924

90.643

24.194

585.344

3

847

13.643

52.806

2788.487

3

800

33.357

33.092

1095.070

3

769

64.357

2.092

4.376

3

863

29.643

36.806

1354.691

3

901

67.643

1.194

1.425

3

888

54.643

11.806

139.385

3

757

76.357

9.908

98.172

3

712

121.357

54.908

X 3 ¼ 833:36

Z 3 ¼ 66:449

2

3014.906 Sum ¼ 19,140.678

Therefore, the calculation of Fcal is carried out as follows: Fcal ¼

ð42  3Þ 14  ð20  45:62Þ2 + 14  ð50:418  45:62Þ2 + 14  ð66:449  45:62Þ2  2143:429 + 14, 782:192 + 19, 140:678 ð3  1Þ Fcal ¼ 8:427

Step 5: According to Table A in the Appendix, for n1 ¼ 2, n2 ¼ 39, and a ¼ 5%, the critical value of the test is Fc ¼ 3.24. Step 6: Decision: since the value calculated lies in the critical region (Fcal > Fc), the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that the population variance of at least one group is different from the others. If we use P-value instead of the statistic’s critical value, Steps 5 and 6 will be: Step 5: According to Table A in the Appendix, for n1 ¼ 2 and n2 ¼ 39, the probability associated to Fcal ¼ 8.427 is less than 0.01 (P-value). Step 6: Decision: since P < 0.05, we reject H0.

9.4.5

Solving Levene’s Test by Using SPSS Software

The use of the images in this section has been authorized by the International Business Machines Corporation©. To test the variance homogeneity between the groups, SPSS uses Levene’s test. The data presented in Example 9.4 are available in the file CustomerServices_Store.sav. In order to elaborate the test, we must click on Analyze → Descriptive Statistics → Explore …, as shown in Fig. 9.13. Let’s include the variable Customer_services in the list of dependent variables (Dependent List) and the variable Store in the factor list (Factor List), as shown in Fig. 9.14. Next, we must click on Plots … and select the option Untransformed in Spread vs Level with Levene Test, as shown in Fig. 9.15. Finally, let’s click on Continue and on OK. The result of Levene’s test can also be obtained through the ANOVA test, by clicking on Analyze ! Compare Means ! One-Way ANOVA …. In Options …, we must select the option Homogeneity of variance test (Fig. 9.16).

Hypotheses Tests Chapter

FIG. 9.13 Procedure for elaborating Levene’s test on SPSS.

FIG. 9.14 Selecting the variables to elaborate Levene’s test on SPSS.

9

217

218

PART

IV Statistical Inference

FIG. 9.15 Continuation of the procedure to elaborate Levene’s test on SPSS.

FIG. 9.16 Results of Levene’s test for Example 9.4 on SPSS.

The value of Levene’s statistic is 8.427, exactly the same as the one calculated previously. Since the significance level observed is 0.001, a value lower than 0.05, the test shows the rejection of the null hypothesis, which allows us to conclude, with a 95% confidence level, that the population variances are not homogeneous.

9.4.6

Solving Levene’s Test by Using the Stata Software

The use of the images in this section has been authorized by StataCorp LP©. Levene’s statistical test for equality of variances is calculated on Stata by using the command robvar (robust-test for equality of variances), which has the following syntax: robvar variable*, by(groups*)

in which the term variable* should be substituted for the quantitative variable studied and the term groups* by the categorical variable that represents them. Let’s open the file CustomerServices_Store.dta that contains the data of Example 9.7. The three groups are represented by the variable store and the number of customers served by the variable services. Therefore, the command to be typed is: robvar services, by(store)

The result of the test can be seen in Fig. 9.17. We can verify that the value of the statistic (8.427) is similar to the one calculated in Example 9.7 and to the one generated on SPSS, as well as the calculation of the probability associated to

Hypotheses Tests Chapter

9

219

FIG. 9.17 Results of Levene’s test for Example 9.7 on Stata.

the statistic (0.001). Since P < 0.05, the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that the variances are not homogeneous.

9.5 HYPOTHESES TESTS REGARDING A POPULATION MEAN (m) FROM ONE RANDOM SAMPLE The main goal is to test if a population mean assumes a certain value or not.

9.5.1 Z Test When the Population Standard Deviation (s) Is Known and the Distribution Is Normal This test is applied when a random sample of size n is obtained from a population with a normal distribution, whose mean (m) is unknown and whose standard deviation (s) is known. If the distribution of the population is not known, it is necessary to work with large samples (n > 30), because the central limit theorem guarantees that, as the sample size grows, the sample distribution of its mean gets closer and closer to a normal distribution. For a bilateral test, the hypotheses are: H0: the sample comes from a population with a certain mean (m ¼ m0) H1: it challenges the null hypothesis (m 6¼ m0) The statistical test used here refers to the sample mean (X). In order for the sample mean to be compared to the value in the table, it must be standardized, so: Zcal ¼

X  m0 s  Nð0, 1Þ, where sX ¼ pffiffiffi sX n

(9.20)

The critical values of the zc statistic are shown in Table E in the Appendix. This table provides the critical values of zc considering that P(Zcal > zc) ¼ a (for a right-tailed test). For a bilateral test, we must consider P(Zcal > zc) ¼ a/2, since P(Zcal <  zc) + P(Zcal > zc) ¼ a. The null hypothesis H0 of a bilateral test is rejected if the value of the Zcal statistic lies in the critical region, that is, if Zcal <  zc or Zcal > zc. Otherwise, we do not reject H0. The unilateral probabilities associated to Zcal statistic (P) can also be obtained from Table E. For a unilateral test, we consider that P ¼ P1. For a bilateral test, this probability must be doubled (P ¼ 2P1). Therefore, for both tests, we reject H0 if P  a. Example 9.8: Applying the z Test to One Sample A cereal manufacturer states that the average quantity of food fiber in each portion of its product is, at least, 4.2 g with a standard deviation of 1 g. A health care agency wishes to verify if this statement is true. Collecting a random sample of 42 portions, in which the average quantity of food fiber is 3.9 g. With a significance level equal to 5%, is there evidence to reject the manufacturer’s statement?

220

PART

IV Statistical Inference

Solution Step 1: The suitable test for a population mean with a known s, considering a single sample of size n > 30 (normal distribution), is the z test. Step 2: For this example, the z test hypotheses are: H0: m 4.2 g (information provided by the supplier) H1: m < 4.2 g which corresponds to a left-tailed test. Step 3: The significance level to be considered is 5%. Step 4: The calculation of the Zcal statistic, according to Expression (9.20), is:

Zcal ¼

X  m0 3:9  4:2 pffiffiffiffiffiffi ¼ 1:94 pffiffiffi ¼ s= n 1= 42

Step 5: According to Table E in the Appendix, for a left-tailed test with a ¼ 5%, the critical value of the test is zc ¼  1.645. Step 6: Decision: since the value calculated lies in the critical region (zcal < 1.645), the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that the manufacturer’s average quantity of food fiber is less than 4.2 g. If, instead of comparing the value calculated to the critical value of the standard normal distribution, we use the calculation of P-value, Steps 5 and 6 will be: Step 5: According to Table E in the Appendix, for a left-tailed test, the probability associated to zcal ¼  1.94 is 0.0262 (P-value). Step 6: Decision: since P < 0.05, the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that the manufacturer’s average quantity of food fiber is less than 4.2 g.

9.5.2

Student’s t-Test When the Population Standard Deviation (s) Is Not Known

Student’s t-test for one sample is applied when we do not know the population standard deviation (s), so, its value is estimated from the sample standard deviation (S). However, to substitute s for S in Expression (9.20), the distribution of the variable will no longer be normal; it will become a Student’s t-distribution with n  1 degrees of freedom. Analogous to the z test, Student’s t-test for one sample assumes the following hypotheses for a bilateral test: H0: m ¼ m0 H1: m ¼ 6 m0 And the calculation of the statistic becomes: Tcal ¼

X  m0 pffiffiffi  tn1 S= n

(9.21)

The value calculated must be compared to the value in Student’s t-distribution table (Table B in the Appendix). This table provides the critical values of tc considering that P(Tcal > tc) ¼ a (for a right-tailed test). For a bilateral test, we have P(Tcal <  tc) ¼ a/2 ¼ P(Tcal > tc), as shown in Fig. 9.18. Therefore, for a bilateral test, the null hypothesis is rejected if Tcal <  tc or Tcal > tc. If  tc  Tcal  tc, we do not reject H0. FIG. 9.18 Nonrejection region (NR) and critical region (CR) of Student’s t-distribution for a bilateral test.

Hypotheses Tests Chapter

9

221

The unilateral probabilities associated to Tcal statistic (P1) can also be obtained from Table B. For a unilateral test, we have P ¼ P1. For a bilateral test, this probability must be doubled (P ¼ 2P1). Therefore, for both tests, we reject H0 if P  a. Example 9.9: Applying Student’s t-Test to One Sample The average processing time of a task using a certain machine has been 18 min. New concepts have been implemented in order to reduce the average processing time. Hence, after a certain period of time, a sample with 25 elements was collected, and an average time of 16.808 min was measured, with a standard deviation of 2.733 min. Check and see if this result represents an improvement in the average processing time. Consider a ¼ 1%. Solution Step 1: The suitable test for a population mean with an unknown s is Student’s t-test. Step 2: For this example, Student’s t-test hypotheses are: H0: m ¼ 18 H1: m < 18 which corresponds to a left-tailed test. Step 3: The significance level to be considered is 1%. Step 4: The calculation of the Tcal statistic, according to Expression (9.21), is:

Tcal ¼

X  m0 16:808  18 pffiffiffiffiffiffi ¼ 2:18 pffiffiffi ¼ S= n 2:733= 25

Step 5: According to Table B in the Appendix, for a left-tailed test with 24 degrees of freedom and a ¼ 1%, the critical value of the test is tc ¼  2.492. Step 6: Decision: since the value calculated is not in the critical region (Tcal >  2.492), the null hypothesis is not rejected, which allows us to conclude, with a 99% confidence level, that there was no improvement in the average processing time. If, instead of comparing the value calculated to the critical value of Student’s t-distribution, we use the calculation of P-value, Steps 5 and 6 will be: Step 5: According to Table B in the Appendix, for a left-tailed test with 24 degrees of freedom, the probability associated to Tcal ¼  2.18 is between 0.01 and 0.025 (P-value). Step 6: Decision: since P > 0.01, we do not reject the null hypothesis.

9.5.3

Solving Student’s t-Test for a Single Sample by Using SPSS Software

The use of the images in this section has been authorized by the International Business Machines Corporation©. If we wish to compare means from a single sample, SPSS makes Student’s t-test available. The data in Example 9.9 are available in the file T_test_One_Sample.sav. The procedure to apply the test from Example 9.9 will be described. Initially, let´s select Analyze ! Compare Means → One-Sample T Test …, as shown in Fig. 9.19. We must select the variable Time and specify the value 18 that will be tested in Test Value, as shown in Fig. 9.20. Now, we must click on Options … to define the desired confidence level (Fig. 9.21). Finally, let’s click on Continue and on OK. The results of the test are shown in Fig. 9.22. This figure shows the result of the t-test (similar to the value calculated in Example 9.9) and the associated probability (P-value) for a bilateral test. For a unilateral test, the associated probability is 0.0195 (we saw in Example 9.9 that this probability would be between 0.01 and 0.025). Since 0.0195 > 0.01, we do not reject the null hypothesis, which allows us to conclude, with a 99% confidence level, that there was no improvement in the average processing time.

9.5.4

Solving Student’s t-Test for a Single Sample by Using Stata Software

The use of the images in this section has been authorized by StataCorp LP©. Student’s t-test is elaborated on Stata by using the command ttest. For one population mean, the test syntax is: ttest variable* == #

222

PART

IV Statistical Inference

FIG. 9.19 Procedure for elaborating the t-test from one sample on SPSS.

FIG. 9.20 Selecting the variable and specifying the value to be tested.

where the term variable* should be substituted for the name of the variable considered in the analysis and # for the value of the population mean to be tested. The data in Example 9.9 are available in the file T_test_One_Sample.dta. In this case, the variable being analyzed is called time and the goal is to verify if the average processing time is still 18 min, so, the command to be typed is: ttest time == 18

The result of the test can be seen in Fig. 9.23. We can see that the calculated value of the statistic (2.180) is similar to the one calculated in Example 9.9 and also generated on SPSS, as well as the associated probability for a left-tailed test (0.0196). Since P > 0.01, we do not reject the null hypothesis, which allows us to conclude, with a 99% confidence level, that there was no improvement in the processing time.

Hypotheses Tests Chapter

9

223

FIG. 9.21 Options—defining the confidence level.

FIG. 9.22 Results of the t-test for one sample for Example 9.9 on SPSS.

FIG. 9.23 Results of the t-test for one sample for Example 9.9 on Stata.

9.6 STUDENT’S T-TEST TO COMPARE TWO POPULATION MEANS FROM TWO INDEPENDENT RANDOM SAMPLES The t-test for two independent samples is applied to compare the means of two random samples (X1i, i ¼ 1, …, n1; X2j, j ¼ 1, …, n2) obtained from the same population. In this test, the population variance is unknown. For a bilateral test, the null hypothesis of the test states that the population means are the same. If the population means are different, the null hypothesis is rejected, so: H0: m1 ¼ m2 H1: m1 6¼ m2 The calculation of the T statistic depends on the comparison of the population variances between the groups.

Case 1: s21 6¼ s22 Considering that the population variances are different, the calculation of the T statistic is given by:

224

PART

IV Statistical Inference

FIG. 9.24 Nonrejection region (NR) and critical region (CR) of Student’s t-distribution for a bilateral test.



 X1  X2 Tcal ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffi S21 S22 + n1 n2

(9.22)

 2 2 S1 S22 + n1 n2 n¼ 2  2 2 2 S1 =n1 S =n + 2 2 ð n1  1 Þ ð n2  1 Þ

(9.23)

with the following degrees of freedom:

Case 2: s21 5 s22 When the population variances are homogeneous, to calculate the T statistic, the researcher has to use:   X1  X2 rffiffiffiffiffiffiffiffiffiffiffiffiffiffi Tcal ¼ 1 1 + Sp  n1 n 2 where:

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðn1  1Þ  S21 + ðn2  1Þ  S22 Sp ¼ n1 + n2  2

(9.24)

(9.25)

and Tcal follows Student’s t-distribution with n ¼ n1 + n2  2 degrees of freedom. The value calculated must be compared to the value in Student’s t-distribution table (Table B in the Appendix). This table provides the critical values of tc considering that P(Tcal > tc) ¼ a (for a right-tailed test). For a bilateral test, we have P(Tcal <  tc) ¼ a/2 ¼ P(Tcal > tc), as shown in Fig. 9.24. Therefore, for a bilateral test, if the value of the statistic lies in the critical region, that is, if Tcal <  tc or Tcal > tc, the test allows us to reject the null hypothesis. On the other hand, if  tc  Tcal  tc, we do not reject H0. The unilateral probabilities associated to Tcal statistic (P1) can also be obtained from Table B. For a unilateral test, we have P ¼ P1. For a bilateral test, this probability must be doubled (P ¼ 2P1). Therefore, for both tests, we reject H0 if P  a. Example 9.10: Applying Student’s t-Test to Two Independent Samples A quality engineer believes that the average time to manufacture a certain plastic product may depend on the raw materials used, which come from two different suppliers. A sample with 30 observations from each supplier is collected for a test and the results are shown in Tables 9.E.10 and 9.E.11. For a significance level a ¼ 5%, check if there is any difference between the means. Solution Step 1: The suitable test to compare two population means with an unknown s is Student’s t-test for two independent samples. Step 2: For this example, Student’s t-test hypotheses are: H0: m1 ¼ m2 H1: m1 6¼ m2 Step 3: The significance level to be considered is 5%.

Hypotheses Tests Chapter

9

225

TABLE 9.E.10 Manufacturing Time Using Raw Materials From Supplier 1 22.8

23.4

26.2

24.3

22.0

24.8

26.7

25.1

23.1

22.8

25.6

25.1

24.3

24.2

22.8

23.2

24.7

26.5

24.5

23.6

23.9

22.8

25.4

26.7

22.9

23.5

23.8

24.6

26.3

22.7

TABLE 9.E.11 Manufacturing Time Using Raw Materials From Supplier 2 26.8

29.3

28.4

25.6

29.4

27.2

27.6

26.8

25.4

28.6

29.7

27.2

27.9

28.4

26.0

26.8

27.5

28.5

27.3

29.1

29.2

25.7

28.4

28.6

27.9

27.4

26.7

26.8

25.6

26.1

Step 4: For the data in Tables 9.E.10 and 9.E.11, we calculate X 1 ¼ 24:277, X 2 ¼ 27:530, S21 ¼ 1.810, and S22 ¼ 1.559. Considering that the population variances are homogeneous, according to the solution generated on SPSS, let’s use Expressions (9.24) and (9.25) to calculate the Tcal statistic, as follows: rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 29  1:810 + 29  1:559 ¼ 1:298 30 + 30  2 24:277  27:530 rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 9:708 Tcal ¼ 1 1 1:298  + 30 30

Sp ¼

with n ¼ 30 + 30 – 2 ¼ 58 degrees of freedom. Step 5: The critical region of the bilateral test, considering n ¼ 58 degrees of freedom and a ¼ 5%, can be defined from Student’s t-distribution table (Table B in the Appendix), as shown in Fig. 9.25. For a bilateral test, each one of the tails corresponds to half of significance level a. FIG. 9.25 Critical region of Example 9.10.

Step 6: Decision: since the value calculated lies in the critical region, that is, Tcal < 2.002, we must reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that the population means are different. If, instead of comparing the value calculated to the critical value of Student’s t-distribution, we use the calculation of P-value, Steps 5 and 6 will be: Step 5: According to Table B in the Appendix, for a right-tailed test with n ¼ 58 degrees of freedom, probability P1 associated to Tcal ¼ 9.708 is less than 0.0005. For a bilateral test, this probability must be doubled (P ¼ 2P1). Step 6: Decision: since P < 0.05, the null hypothesis is rejected.

9.6.1

Solving Student’s t-Test From Two Independent Samples by Using SPSS Software

The data in Example 9.10 are available in the file T_test_Two_Independent_Samples.sav. The procedure for solving Student’s t-test to compare two population means from two independent random samples on SPSS is described. The use of the images in this section has been authorized by the International Business Machines Corporation©. We must click on Analyze ! Compare Means → Independent-Samples T Test …, as shown in Fig. 9.26.

226

PART

IV Statistical Inference

Let’s include the variable Time in Test Variable(s) and the variable Supplier in Grouping Variable. Next, let’s click on Define Groups … to define the groups (categories) of the variable Supplier, as shown in Fig. 9.27. If the confidence level desired by the researcher is different from 95%, the button Options … must be selected to change it. Finally, let’s click on OK. The results of the test are shown in Fig. 9.28. The value of the t statistic for the test is 9.708 and the associated bilateral probability is 0.000 (P < 0.05), which leads us to reject the null hypothesis, and allows us to conclude, with a 95% confidence level, that the population means are different. We can notice that Fig. 9.28 also shows the result of Levene’s test. Since the significance level observed is 0.694, value greater than 0.05, we can also conclude, with a 95% confidence level, that the variances are homogeneous.

FIG. 9.26 Procedure for elaborating the t-test from two independent samples on SPSS.

FIG. 9.27 Selecting the variables and defining the groups.

Hypotheses Tests Chapter

9

227

FIG. 9.28 Results of the t-test for two independent samples for Example 9.10 on SPSS.

FIG. 9.29 Results of the t-test for two independent samples for Example 9.10 on Stata.

9.6.2

Solving Student’s t-Test From Two Independent Samples by Using Stata Software

The use of the images in this section has been authorized by StataCorp LP©. The t-test to compare the means of two independent groups on Stata is elaborated by using the following syntax: ttest variable*, by(groups*)

where the term variable* must be substituted for the quantitative variable being analyzed, and the term groups* for the categorical variable that represents them. The data in Example 9.10 are available in the file T_test_Two_Independent_Samples.dta. The variable supplier shows the groups of suppliers. The values for each group of suppliers are specified in the variable time. Thus, we must type the following command: ttest time, by(supplier)

The result of the test can be seen in Fig. 9.29. We can see that the calculated value of the statistic (9.708) is similar to the one calculated in Example 9.10 and also generated on SPSS, as well as the associated probability for a bilateral test (0.000). Since P < 0.05, the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that the population means are different.

9.7 STUDENT’S T-TEST TO COMPARE TWO POPULATION MEANS FROM TWO PAIRED RANDOM SAMPLES This test is applied to check whether the means of two paired or related samples, obtained from the same population (before and after) with a normal distribution, are significantly different or not. Besides the normality of the data of each sample, the test requires the homogeneity of the variances between the groups. Different from the t-test for two independent samples, first, we must calculate the difference between each pair of values in position i (di ¼ Xbefore,i  Xafter,i, i ¼ 1, …, n) and, after that, test the null hypothesis that the mean of the differences in the population is zero.

228

PART

IV Statistical Inference

For a bilateral test, we have: H0: md ¼ 0, md ¼ mbefore  mafter H1: md ¼ 6 0 The Tcal statistic for the test is given by: Tcal ¼

d  md pffiffiffi  t Sd = n n¼n1

where:

Xn d¼

and Sd ¼

i¼1

(9.26)

d

(9.27)

n

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Xn  2 d d i¼1 i

(9.28)

n1

The value calculated must be compared to the value in Student’s t-distribution table (Table B in the Appendix). This table provides the critical values of tc considering that P(Tcal > tc) ¼ a (for a right-tailed test). For a bilateral test, we have P(Tcal <  tc) ¼ a/2 ¼ P(Tcal > tc), as shown in Fig. 9.30. FIG. 9.30 Nonrejection region (NR) and critical region (CR) of Student’s t-distribution for a bilateral test.

Therefore, for a bilateral test, the null hypothesis is rejected if Tcal <  tc or Tcal > tc. If  tc  Tcal  tc, we do not reject H0. The unilateral probabilities associated to Tcal statistic (P1) can also be obtained from Table B. For a unilateral test, we have P ¼ P1. For a bilateral test, this probability must be doubled (P ¼ 2P1). Therefore, for both tests, we reject H0 if P  a. Example 9.11: Applying Student’s t-Test to Two Paired Samples A group of 10 machine operators, responsible for carrying out a certain task, is trained to perform the same task more efficiently. To verify if there is a reduction in the time taken to perform the task, we measured the time spent by each operator, before and after the training course. Test the hypothesis that the population means of both paired samples are similar, that is, that there is no reduction in time taken to perform the task after the training course. Consider a ¼ 5%.

TABLE 9.E.12 Time Spent Per Operator Before the Training Course 3.2

3.6

3.4

3.8

3.4

3.5

3.7

3.2

3.5

3.9

3.4

3.0

3.2

3.6

TABLE 9.E.13 Time Spent Per Operator After the Training Course 3.0

3.3

3.5

3.6

3.4

3.3

Solution Step 1: In this case, the most suitable test is Student’s t-test for two paired samples. Since the test requires the normality of the data in each sample and the homogeneity of the variances between the groups, K-S or S-W tests, besides Levene’s test, must be applied for such verification. As we will see, in the solution of this example on SPSS, all of these assumptions will be validated.

Hypotheses Tests Chapter

9

229

Step 2: For this example, Student’s t-test hypotheses are: H0: md ¼ 0 H1: md 6¼ 0 Step 3: The significance level to be considered is 5%. Step 4: In order to calculate the Tcal statistic, first, we must calculate di:

TABLE 9.E.14 Calculating di Xbefore, Xafter, di

i

i

3.2

3.6

3.4

3.8

3.4

3.5

3.7

3.2

3.5

3.9

3.0

3.3

3.5

3.6

3.4

3.3

3.4

3.0

3.2

3.6

0.2

0.3

0.1

0.2

0

0.2

0.3

0.2

0.3

0.3

Xn d 0:2 + 0:3 + ⋯ + 0:3 i¼1 i ¼ ¼ 0:19 d¼ n 10 sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð0:2  0:19Þ2 + ð0:3  0:19Þ2 + ⋯ + ð0:3  0:19Þ2 Sd ¼ ¼ 0:137 9 Tcal ¼

d 0:19 pffiffiffi ¼ pffiffiffiffiffiffi ¼ 4:385 Sd = n 0:137= 10

Step 5: The critical region of the bilateral test can be defined from Student’s t-distribution table (Table B in the Appendix), considering n ¼ 9 degrees of freedom and a ¼ 5%, as shown in Fig. 9.31. For a bilateral test, each tail corresponds to half of significance level a. FIG. 9.31 Critical region of Example 9.11.

Step 6: Decision: since the value calculated lies in the critical region (Tcal > 2.262), the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that there is a significant difference between the times spent by the operators before and after the training course. If, instead of comparing the value calculated to the critical value of Student’s t-distribution, we used the calculation of P-value, Steps 5 and 6 will be: Step 5: According to Table B in the Appendix, for a right-tailed test with n ¼ 9 degrees of freedom, the P1 probability associated to Tcal ¼ 4.385 is between 0.0005 and 0.001. For a bilateral test, this probability must be doubled (P ¼ 2P1), so, 0.001 < P < 0.002. Step 6: Decision: since P < 0.05, the null hypothesis is rejected.

9.7.1

Solving Student’s t-Test From Two Paired Samples by Using SPSS Software

First, we must test the normality of the data in each sample, as well as the variance homogeneity between the groups. Using the same procedures described in Sections 9.3.3 and 9.4.5 (the data must be placed in a table the same way as in Section 9.4.5), we obtain Figs. 9.32 and 9.33. Based on Fig. 9.32, we conclude that there is normality of the data for each sample. From Fig. 9.33, we can conclude that the variances between the samples are homogeneous.

230

PART

IV Statistical Inference

The use of the images in this section has been authorized by the International Business Machines Corporation©. To solve Student’s t-test for two paired samples on SPSS, we must open the file T_test_Two_Paired_Samples.sav. Then, we have to click on Analyze ! Compare Means → Paired-Samples T Test …, as shown in Fig. 9.34. We must select the variable Before and move it to Variable1 and the variable After to Variable2, as shown in Fig. 9.35. If the desired confidence level is different from 95%, we must click on Options … to change it. Finally, let’s click on OK. The results of the test are shown in Fig. 9.36. The value of the t-test is 4.385 and the significance level observed for a bilateral test is 0.002, value less than 0.05, which leads us to reject the null hypothesis and allows us to conclude, with a 95% confidence level, that there is a significant difference between the times spent by the operators before and after the training course.

FIG. 9.32 Results of the normality tests on SPSS.

FIG. 9.33 Results of Levene’s test on SPSS.

FIG. 9.34 Procedure for elaborating the t-test from two paired samples on SPSS.

Hypotheses Tests Chapter

9

231

FIG. 9.35 Selecting the variables that will be paired.

FIG. 9.36 Results of the t-test for two paired samples.

FIG. 9.37 Results of Student’s t-test for two paired samples for Example 9.11 on Stata.

9.7.2

Solving Student’s t-Test From Two Paired Samples by Using Stata Software

The t-test to compare the means of two paired groups will be solved on Stata for the data in Example 9.11. The use of the images in this section has been authorized by StataCorp LP©. Therefore, let’s open the file T_test_Two_Paired_Samples.dta. The paired variables are called before and after. In this case, we must type the following command: ttest before == after

The result of the test can be seen in Fig. 9.37. We can see that the calculated value of the statistic (4.385) is similar to the one calculated in Example 9.11 and on SPSS, as well as the probability associated to the statistic for a bilateral test (0.0018). Since P < 0.05, we reject the null hypothesis that the times spent by the operators before and after the training course are the same, with a 95% confidence level.

232

PART

9.8

IV Statistical Inference

ANOVA TO COMPARE THE MEANS OF MORE THAN TWO POPULATIONS

ANOVA is a test used to compare the means of three or more populations, through the analysis of sample variances. The test is based on a sample obtained from each population, aiming at determining if the differences between the sample means suggest significant differences between the population means, or if such differences are only a result of the implicit variability of the sample. ANOVA’s assumptions are: (i) The samples must be independent from each other; (ii) The data in the populations must have a normal distribution; (iii) The population variances must be homogeneous.

9.8.1

One-Way ANOVA

One-way ANOVA is an extension of Student’s t-test for two population means, allowing the researcher to compare three or more population means. The null hypothesis of the test states that the population means are the same. If there is at least one group with a mean that is different from the others, the null hypothesis is rejected. As stated in Fa´vero et al. (2009), the one-way ANOVA allows researcher to verify the effect of a qualitative explanatory variable (factor) on a quantitative dependent variable. Each group includes the observations of the dependent variable in one category of the factor. Assuming that size n independent samples are obtained from k populations (k 3) and that the means of these populations can be represented by m1, m2, …, mk, the analysis of variance tests the following hypotheses: H0 : m1 ¼ m2 ¼ … ¼ mk H1 : 9ði, jÞ mi 6¼ mj , i 6¼ j

(9.29)

According to Maroco (2014), in general, the observations for this type of problem can be represented according to Table 9.2. where Yij represents observation i of sample or group P j (i ¼ 1, …, nj; j ¼ 1, …, k) and nj is the dimension of sample or group j. The dimension of the global sample is N ¼ ki¼1ni. Pestana and Gageiro (2008) present the following model: Yij ¼ mi + eij

(9.30)

Yij ¼ m + ðmi  mÞ  eij

(9.31)

Yij ¼ m + ai + eij

(9.32)

where: m is the global mean of the population; mi is the mean of sample or group i; ai is the effect of sample or group i; eij is the random error.

TABLE 9.2 Observations of the One-Way ANOVA Samples or Groups 1

2



K

Y11

Y12



Y1k

Y21

Y22



Y2k









Yn11

Yn22



Ynkk

Hypotheses Tests Chapter

9

233

Therefore, ANOVA assumes that each group comes from a population with a normal distribution, mean mi, and a homogeneous variance, that is, Yij  N(mi,s), resulting in the hypothesis that the errors (residuals) have a normal distribution with a mean equal to zero and a constant variance, that is, eij  N(0,s), besides being independent (Fa´vero et al., 2009). The technique’s hypotheses are tested from the calculation of the group variances, and that is where the name ANOVA comes from. The technique involves the calculation of the variations between the groups (Y i  Y) and within each group (Yij  Y i ). The residual sum of squares within groups (RSS) is calculated by: RSS ¼

nj  k X X

Yij  Y i

2 (9.33)

i¼1 j¼1

The residual sum of squares between groups, or the sum of squares of the factor (SSF), is given by: SSF ¼

k X

 2 ni  Y i  Y

(9.34)

i¼1

Therefore, the total sum is: TSS ¼ RSS + SSF ¼

ni  k X 2 X Yij  Y

(9.35)

i¼1 j¼1

According to Fa´vero et al. (2009) and Maroco (2014), the ANOVA statistic is given by the division between the variance of the factor (SSF divided by k  1 degrees of freedom) and the variance of the residuals (RSS divided by N  k degrees of freedom): SSF ðk  1Þ MSF ¼ Fcal ¼ RSS MSR ðN  kÞ

(9.36)

where: MSF represents the mean square between groups (estimate of the variance of the factor); MSR represents the mean square within groups (estimate of the variance of the residuals). Table 9.3 summarizes the calculations of the one-way ANOVA. The value of F can be null or positive, but never negative. Therefore, ANOVA requires an asymmetrical F-distribution to the right. The calculated value (Fcal) must be compared to the value in the F-distribution table (Table A in the Appendix). This table provides the critical values of Fc ¼ Fk1,N k,a where P(Fcal > Fc) ¼ a (right-tailed test). Therefore, one-way ANOVA’s null hypothesis is rejected if Fcal > Fc. Otherwise, if (Fcal  Fc), we do not reject H0. We will use these concepts when we study the estimation of regression models in Chapter 13.

TABLE 9.3 Calculating the One-Way ANOVA Source of Variation

Sum of Squares

Between the groups

SSF ¼

Within the groups

RSS ¼

Total

TSS ¼

Pk



Degrees of Freedom 2

Mean Squares

k1

MSF

2 Pk Pni  i¼1 j¼1 Yij  Y i

Nk

RSS MSR ¼ ðNk Þ

Pk Pni 

N1

i¼1 ni

i¼1

Yi Y

j¼1

Yij  Y

2

SSF ¼ ðk1 Þ

F MSF F ¼ MSR

Source: Fa´vero, L.P., Belfiore, P., Silva, F.L., Chan, B.L., 2009. Ana´lise de dados: modelagem multivariada para tomada de deciso˜es. Campus Elsevier, Rio de Janeiro; Maroco, J., 2014. Ana´lise estatı´stica com o SPSS Statistics, sixth ed. Edic¸o˜es Sı´labo, Lisboa.

234

PART

IV Statistical Inference

TABLE 9.E.15 Percentage of Sucrose for the Three Suppliers Supplier 1 (n1 5 12)

Supplier 2 (n2 5 10)

Supplier 3 (n3 5 10)

0.33

1.54

1.47

0.79

1.11

1.69

1.24

0.97

1.55

1.75

2.57

2.04

0.94

2.94

2.67

2.42

3.44

3.07

1.97

3.02

3.33

0.87

3.55

4.01

0.33

2.04

1.52

0.79

1.67

2.03

Y 1 ¼ 1:316

Y 2 ¼ 2:285

Y 3 ¼ 2:338

S1 ¼ 0.850

S2 ¼ 0.948

S3 ¼ 0.886

1.24 3.12

Example 9.12: Applying the One-Way ANOVA Test A sample with 32 products is collected to analyze the quality of the honey supplied by three different suppliers. One of the ways to test the quality of the honey is finding out how much sucrose it contains, which usually varies between 0.25% and 6.5%. Table 9. E.15 shows the percentage of sucrose in the sample collected from each supplier. Check if there are differences in this quality indicator among the three suppliers, considering a 5% significance level. Solution Step 1: In this case, the most suitable test is the one-way ANOVA. First, we must verify the assumptions of normality for each group and of variance homogeneity between the groups through the Kolmogorov-Smirnov, Shapiro-Wilk, and Levene tests. Figs. 9.38 and 9.39 show the results obtained by using SPSS software.

FIG. 9.38 Results of the tests for normality on SPSS.

FIG. 9.39 Results of Levene’s test on SPSS.

Hypotheses Tests Chapter

9

235

Since the significance level observed in the tests for normality for each group and in the variance homogeneity test between the groups is greater than 5%, we can conclude that each one of the groups shows data with a normal distribution and that the variances between the groups are homogeneous, with a 95% confidence level. Since the assumptions of the one-way ANOVA were met, the technique can be applied. Step 2: For this example, ANOVA’s null hypothesis states that there are no differences in the amount of sucrose coming from the three suppliers. If there is at least one supplier with a population mean that is different from the others, the null hypothesis will be rejected. Thus, we have: H0: m1 ¼ m2 ¼ m3 H1: 9(i,j) mi 6¼ mj, i 6¼ j Step 3: The significance level to be considered is 5%. Step 4: The calculation of the Fcal statistic is specified here. For this example, we know that k ¼ 3 groups and the global sample size is N ¼ 32. The global sample mean is Y ¼ 1:938. The sum of squares between groups (SSF) is: SSF ¼ 12  ð1:316  1:938Þ2 + 10  ð2:285  1:938Þ2 + 10  ð2:338  1:938Þ2 ¼ 7:449 Therefore, the mean square between groups (MSB) is: MSF ¼

SSF 7:449 ¼ ¼ 3:725 ðk  1Þ 2

The calculation of the sum of squares within groups (RSS) is shown in Table 9.E.16.

TABLE 9.E.16 Calculation of the Sum of Squares Within Groups (SSW) 

Yij  Y i

Supplier

Sucrose

Yij  Y i

1

0.33

0.986

0.972

1

0.79

0.526

0.277

1

1.24

0.076

0.006

1

1.75

0.434

0.189

1

0.94

0.376

0.141

1

2.42

1.104

1.219

1

1.97

0.654

0.428

1

0.87

0.446

0.199

1

0.33

0.986

0.972

1

0.79

0.526

0.277

1

1.24

0.076

0.006

1

3.12

1.804

3.255

2

1.54

0.745

0.555

2

1.11

1.175

1.381

2

0.97

1.315

1.729

2

2.57

0.285

0.081

2

2.94

0.655

0.429

2

3.44

1.155

1.334

2

3.02

0.735

0.540

2

3.55

1.265

1.600

2

2.04

0.245

0.060

2

1.67

0.615

0.378

2

Continued

236

PART

IV Statistical Inference

TABLE 9.E.16 Calculation of the Sum of Squares Within Groups (SSW)—cont’d 

Yij  Y i

Supplier

Sucrose

Yij  Y i

3

1.47

0.868

0.753

3

1.69

0.648

0.420

3

1.55

0.788

0.621

3

2.04

0.298

0.089

3

2.67

0.332

0.110

3

3.07

0.732

0.536

3

3.33

0.992

0.984

3

4.01

1.672

2.796

3

1.52

0.818

0.669

3

2.03

0.308

0.095

RSS

2

23.100

Therefore, the mean square within groups is: MSR ¼

RSS 23:100 ¼ ¼ 0:797 ðN  k Þ 29

Thus, the value of the Fcal statistic is: Fcal ¼

MSF 3:725 ¼ ¼ 4:676 MSR 0:797

Step 5: According to Table A in the Appendix, the critical value of the statistic is Fc ¼ F2, 29,

5%

¼ 3.33.

Step 6: Decision: since the value calculated lies in the critical region (Fcal > Fc), we reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that there is at least one supplier with a population mean that is different from the others. If, instead of comparing the value calculated to the critical value of Snedecor’s F-distribution, we use the calculation of P-value, Steps 5 and 6 will be: Step 5: According to Table A in the Appendix, for n1 ¼ 2 degrees of freedom in the numerator and n2 ¼ 29 degrees of freedom in the denominator, the probability associated to Fcal ¼ 4.676 is between 0.01 and 0.025 (P-value). Step 6: Decision: since P < 0.05, the null hypothesis is rejected.

9.8.1.1 Solving the One-Way ANOVA Test by Using SPSS Software The use of the images in this section has been authorized by the International Business Machines Corporation©. The data in Example 9.12 are available in the file One_Way_ANOVA.sav. First of all, let´s click on Analyze ! Compare Means → One-Way ANOVA …, as shown in Fig. 9.40. Let’s include the variable Sucrose in the list of dependent variables (Dependent List) and the variable Supplier in the box Factor, according to Fig. 9.41. After that, we must click on Options … and select the option Homogeneity of variance test (Levene’s test for variance homogeneity). Finally, let’s click on Continue and on OK to obtain the result of Levene’s test, besides the ANOVA table. Since ANOVA does not make the normality test available, it must be obtained by applying the same procedure described in Section 9.3.3. According to Fig. 9.42, we can verify that each one of the groups has data that follow a normal distribution. Moreover, through Fig. 9.43, we can conclude that the variances between the groups are homogeneous.

Hypotheses Tests Chapter

9

237

FIG. 9.40 Procedure for the one-way ANOVA.

FIG. 9.41 Selecting the variables.

From the ANOVA table (Fig. 9.44), we can see that the value of the F-test is 4.676 and the respective P-value is 0.017 (we saw in Example 9.12 that this value would be between 0.01 and 0.025), value less than 0.05. This leads us to reject the null hypothesis and allows us to conclude, with a 95% confidence level, that at least one of the population means is different from the others (there are differences in the percentage of sucrose in the honey of the three suppliers).

9.8.1.2 Solving the One-Way ANOVA Test by Using Stata Software The use of the images in this section has been authorized by StataCorp LP©. The one-way ANOVA on Stata is generated from the following syntax: anova variabley* factor*

238

PART

IV Statistical Inference

FIG. 9.42 Results of the tests for normality for Example 9.12 on SPSS.

FIG. 9.43 Results of Levene’s test for Example 9.12 on SPSS.

FIG. 9.44 Results of the one-way ANOVA for Example 9.12 on SPSS.

FIG. 9.45 Results of the one-way ANOVA on Stata.

in which the term variabley* should be substituted for the quantitative dependent variable and the term factor* for the qualitative explanatory variable. The data in Example 9.12 are available in the file One_Way_Anova.dta. The quantitative dependent variable is called sucrose and the factor is represented by the variable supplier. Thus, we must type the following command: anova sucrose supplier

The result of the test can be seen in Fig. 9.45. We can see that the calculated value of the statistic (4.68) is similar to the one calculated in Example 9.12 and also generated on SPSS, as well as the probability associated to the value of the statistic (0.017). Since P < 0.05, the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that at least one of the population means is different from the others.

Hypotheses Tests Chapter

9.8.2

9

239

Factorial ANOVA

Factorial ANOVA is an extension of the one-way ANOVA, assuming the same assumptions, but considering two or more factors. Factorial ANOVA presumes that the quantitative dependent variable is influenced by more than one qualitative explanatory variable (factor). It also tests the possible interactions between the factors, through the resulting effect of the combination of factor A’s level i and factor B’s level j, as discussed by Pestana and Gageiro (2008), Fa´vero et al. (2009), and Maroco (2014). For Pestana and Gageiro (2008) and Fa´vero et al. (2009), the main objective of the factorial ANOVA is to determine whether the means for each factor level are the same (an isolated effect of the factors on the dependent variable), and to verify the interaction between the factors (the joint effect of the factors on the dependent variable). For educational purposes, the factorial ANOVA will be described for the two-way model.

9.8.2.1 Two-Way ANOVA According to Fa´vero et al. (2009) and Maroco (2014), the observations of the two-way ANOVA can be represented, in general, as shown in Table 9.4. For each cell, we can see the values of the dependent variable in the factors A and B that are being studied. where Yijk represents observation k (k ¼ 1, …, n) of factor A’s level i (i ¼ 1, …, a) and of factor B’s level j (j ¼ 1, …, b). First, in order to check the isolated effects of factors A and B, we must test the following hypotheses (Fa´vero et al., 2009; Maroco, 2014): HA0 : m1 ¼ m2 ¼ … ¼ ma

(9.37)

HA1 : 9ði, jÞ mi 6¼ mj , i 6¼ j ði, j ¼ 1, …, aÞ and HB0 : m1 ¼ m2 ¼ … ¼ mb

(9.38)

HB1 : 9ði, jÞ mi 6¼ mj , i 6¼ j ði, j ¼ 1, …, bÞ

TABLE 9.4 Observations of the Two-Way ANOVA Factor B

Factor A

1

1

2

… …

b

Y111

Y121

Y112

Y122

Yab2







Y11n

Y12n

Yabn

Y211

Y221

Y212

Y222

Y2b2







Y21n

Y22n

Y2bn











a

Ya11

Ya21



Yab1

Ya12

Ya22

Yab2







Ya1n

Ya2n

Yabn

2



Yab1

Y2b1

Source: Fa´vero, L.P., Belfiore, P., Silva, F.L., Chan, B.L., 2009. Ana´lise de dados: modelagem multivariada para tomada de deciso˜es. Campus Elsevier, Rio de Janeiro; Maroco, J., 2014. Ana´lise estatı´stica com o SPSS Statistics, sixth ed. Edic¸o˜es Sı´labo, Lisboa.

240

PART

IV Statistical Inference

Now, in order to verify the joint effect of the factors on the dependent variable, we must test the following hypotheses (Fa´vero et al., 2009; Maroco, 2014): H0 : gij ¼ 0, for i 6¼ j ðthere is no interaction between the factors A and BÞ H1 : gij 6¼ 0, for i 6¼ j ðthere is interaction between the factors A and BÞ

(9.39)

The model presented by Pestana and Gageiro (2008) can be described as: Yijk ¼ m + ai + bj + gij + eijk

(9.40)

where: m is the population’s global mean; ai is the effect of factor A’s level i, given by mi  m; bi is the effect of factor B’s level j, given by mj  m; gij is the interaction between the factors; eijk is the random error that follows a normal distribution with a mean equal to zero and a constant variance. To standardize the effects of the levels chosen of both factors, we must assume that: a X i¼1

ai ¼

b X

bj ¼

a X

j¼1

gij ¼

i¼1

b X

gij ¼ 0

(9.41)

i¼1

Let’s consider Y, Y ij , Y i , and Y j the general mean of the global sample, the mean per sample, the mean of factor A’s level i, and the mean of factor B’s level j, respectively. We can describe the residual sum of squares (RSS) as: RSS ¼

a X b X n  2 X Yijk  Y ij

(9.42)

i¼1 j¼1 k¼1

On the other hand, the sum of squares of factor A (SSFA), the sum of squares of factor B (SSFB), and the sum of squares of the interaction (SSFAB) are represented below in Expressions (9.43)–(9.45), respectively: SSFA ¼ b  n 

a  X

Yi  Y

2

(9.43)

i¼1

SSFB ¼ a  n 

b  2 X Yj  Y

(9.44)

j¼1

SSFAB ¼ n 

a X b  X

Y ij  Y i  Y j + Y

2 (9.45)

i¼1 j¼1

Therefore, the sum of total squares can be written as follows: TSS ¼ RSS + SSFA + SSFB + SSFAB ¼

a X b X n  2 X Yijk  Y

(9.46)

i¼1 j¼1 k¼1

Thus, the ANOVA statistic for factor A is given by: SSFA MSFA ð a  1Þ FA ¼ ¼ RSS MSR ðn  1Þ  ab where: MSFA is the mean square of factor A; MSR is the mean square of the errors.

(9.47)

Hypotheses Tests Chapter

9

241

TABLE 9.5 Calculations of the Two-Way ANOVA Source of Variation Factor A

SSF A ¼ b  n 

Factor B

SSF B ¼ a  n 

Interaction

SSF AB ¼ n 

Error

RSS ¼

Total

Degrees of Freedom

Sum of Squares

TSS ¼

Pa 

2

2 Pb  j¼1 Y j  Y

i¼1

Yi Y

2 Pa Pb  i¼1 j¼1 Y ij  Y i  Y j + Y

Mean Squares

F

a1

A MSF A ¼ ðSSF a1Þ

A FA ¼ MSF MSR

b1

B MSF B ¼ ðSSF b1Þ

B FB ¼ MSF MSR

(a  1). (b  1)

AB MSF AB ¼ ða1SSF Þ  ðb1Þ

AB FAB ¼ MSF MSR

RSS MSR ¼ ðn1 Þ  ab

Pa Pb Pn

 2 Yijk  Y ij

(n  1)  ab

Pa Pb Pn

 2 Yijk  Y

N1

i¼1

i¼1

j¼1

j¼1

k¼1

k¼1

Source: Fa´vero, L.P., Belfiore, P., Silva, F.L., Chan, B.L., 2009. Ana´lise de dados: modelagem multivariada para tomada de deciso˜es. Campus Elsevier, Rio de Janeiro; Maroco, J., 2014. Ana´lise estatı´stica com o SPSS Statistics, sixth ed. Edic¸o˜es Sı´labo, Lisboa.

On the other hand, the ANOVA statistic for factor B is given by: SSFB MSFB ðb  1Þ FB ¼ ¼ RSS MSR ðn  1Þ  ab

(9.48)

where: MSFB is the mean square of factor B. And the ANOVA statistic for the interaction is represented by: SSFAB ða  1Þ  ðb  1Þ MSFAB FAB ¼ ¼ RSS MSR ðn  1Þ  ab

(9.49)

where: MSFAB is the mean square of the interaction. The calculations of the two-way ANOVA are summarized in Table 9.5. cal cal The calculated values of the statistics (Fcal A , FB , and FAB) must be compared to the critical values obtained from the c F-distribution table (Table A in the Appendix): FA ¼ Fa1, (n1)ab, a, FcB ¼ Fb1, (n1)ab, a, and FcAB ¼ F(a1)(b1), (n1)ab, a. c cal c cal c For each statistic, if the value lies in the critical region (Fcal A > FA, FB > FB, FAB > FAB), we must reject the null hypothesis. Otherwise, we do not reject H0. Example 9.13: Using the Two-Way ANOVA A sample with 24 passengers who travel from Sao Paulo to Campinas in a certain week is collected. The following variables are analyzed (1) travel time in minutes, (2) the bus company chosen, and (3) the day of the week. The main objective is to verify if there is a relationship between the travel time and the bus company, between the travel time and the day of the week, and between the bus company and the day of the week. The levels considered in the variable bus company are Company A (1), Company B (2), and Company C (3). On the other hand, the levels regarding the day of the week are Monday (1), Tuesday (2), Wednesday (3), Thursday (4), Friday (5), Saturday (6), and Sunday (7). The results of the sample are shown in Table 9.E.17 and are available in the file Two_Way_ANOVA.sav as well. Test these hypotheses, considering a 5% significance level.

242

PART

IV Statistical Inference

TABLE 9.E.17 Data From Example 9.13 (Using the Two-Way ANOVA) Time (Min)

9.8.2.1.1

Company

Day of the Week

90

2

4

100

1

5

72

1

6

76

3

1

85

2

2

95

1

5

79

3

1

100

2

4

70

1

7

80

3

1

85

2

3

90

1

5

77

2

7

80

1

2

85

3

4

74

2

7

72

3

6

92

1

5

84

2

4

80

1

3

79

2

1

70

3

6

88

3

5

84

2

4

Solving the Two-Way ANOVA Test by Using SPSS Software

The use of the images in this section has been authorized by the International Business Machines Corporation©. Step 1: In this case, the most suitable test is the two-way ANOVA. First, we must verify if there is normality in the variable Time (metric) in the model (as shown in Fig. 9.46). According to this figure, we can conclude that variable Time follows a normal distribution, with a 95% confidence level. The hypothesis of variance homogeneity will be verified in Step 4. Step 2: The null hypothesis H0 of the two-way ANOVA for this example assumes that the population means of each level of the factor Company and of each level of the factor Day_of_the_week are equal, that is, HA0 : m1 ¼ m2 ¼ m3 and HB0 : m1 ¼ m2 ¼ … ¼ m7. The null hypothesis H0 also states that there is no interaction between the factor Company and the factor Day_of_the_week, that is, H0: gij ¼ 0 for i 6¼ j. Step 3: The significance level to be considered is 5%.

Hypotheses Tests Chapter

9

243

FIG. 9.46 Results of the normality tests on SPSS.

FIG. 9.47 Procedure for elaborating the two-way ANOVA on SPSS.

Step 4: The F statistics in ANOVA for the factor Company, for the factor Day_of_the_week, and for the interaction Company * Day_of_the_week will be obtained through the SPSS software, according to the procedure specified below. In order to do that, let´s click on Analyze ! General Linear Model → Univariate …, as shown in Fig. 9.47. After that, let´s include the variable Time in the box of dependent variables (Dependent Variable) and the variables Company and Day_of_the_week in the box of Fixed Factor(s), as shown in Fig. 9.48. This example is based on the one-way ANOVA, in which the factors are fixed. If one of the factors were chosen randomly, it would be inserted into the box Random Factor(s), resulting in a case of a three-way ANOVA. The button Model … defines the variance analysis model to be tested. Through the button Contrasts …, we can assess if the category of one of the factors is significantly different from the other categories of the same factor. Charts can be constructed through the button Plots …, thus allowing the visualization of the existence or nonexistence of interactions between the factors. Button Post Hoc …, on the other hand, allows us to compare multiple means. Finally, from the button Options …, we can obtain descriptive statistics and the result of Levene’s variance homogeneity test, as well as select the appropriate significance level (Fa´vero et al., 2009; Maroco, 2014).

244

PART

IV Statistical Inference

FIG. 9.48 Selection of the variables to elaborate the two-way ANOVA.

Therefore, since we want to test variance homogeneity, we must select, in Options …, the option Homogeneity tests, as shown in Fig. 9.49. Finally, let’s click on Continue and on OK to obtain Levene’s variance homogeneity test and the two-way ANOVA table. In Fig. 9.50, we can see that the variances between groups are homogeneous (P ¼ 0.451 > 0.05). Based on Fig. 9.51, we can conclude that there are no significant differences between the travel times of the companies analyzed, that is, the factor Company does not have a significant impact on the variable Time (P ¼ 0.330 > 0.05). On the other hand, we conclude that there are significant differences between the days of the week, that is, the factor Day_of_the_week has a significant effect on the variable Time (P ¼ 0.003 < 0.05). We finally conclude that there is no significant interaction, with a 95% confidence level, between the two factors Company and Day_of_the_week, since P ¼ 0.898 > 0.05. 9.8.2.1.2

Solving the Two-Way ANOVA Test by Using Stata Software

The use of the images in this section has been authorized by StataCorp LP©. The command anova on Stata specifies the dependent variable being analyzed, as well as the respective factors. The interactions are specified using the character # between the factors. Thus, the two-way ANOVA is generated through the following syntax: anova variableY* factorA* factorB* factorA#factorB

or simply: anova variabley* factorA*## factorB*

in which the term variabley* should be substituted for the quantitative dependent variable and the terms factorA* and factorB* for the respective factors. If we type the syntax anova variableY* factorA* factorB*, only the ANOVA for each factor will be elaborated, and not between the factors.

Hypotheses Tests Chapter

9

245

FIG. 9.49 Test of variance homogeneity.

FIG. 9.50 Results of Levene’s test on SPSS.

The data presented in Example 9.13 are available in the file Two_Way_ANOVA.dta. The quantitative dependent variable is called time and the factors correspond to the variables company and day_of_the_week. Thus, we must type the following command: anova time company##day_of_the_week

The results can be seen in Fig. 9.52 and are similar to those presented on SPSS, which allows us to conclude, with a 95% confidence level, that only the factor day_of_the_week has a significant effect on the variable time (P ¼ 0.003 < 0.05), and that there is no significant interaction between the two factors analyzed (P ¼ 0.898 > 0.05).

246

PART

IV Statistical Inference

FIG. 9.51 Results of the two-way ANOVA for Example 9.13 on SPSS.

FIG. 9.52 Results of the two-way ANOVA for Example 9.13 on Stata.

9.8.2.2 ANOVA With More Than Two Factors The two-way ANOVA can be generalized to three or more factors. According to Maroco (2014), the model becomes very complex, since the effect of multiple interactions can make the effect of the factors a bit confusing. The generic model with three factors presented by the author is: Yijkl ¼ m + ai + bj + gk + abij + agik + bgjk + abgijk + eijkl

9.9

(9.50)

FINAL REMARKS

This chapter presented the concepts and objectives of parametric hypotheses tests and the general procedures for constructing each one of them. We studied the main types of tests and the situations in which each one of them must be used. Moreover, the advantages and disadvantages of each test were established, as well as their assumptions. We studied the tests for normality (Kolmogorov-Smirnov, Shapiro-Wilk, and Shapiro-Francia), variance homogeneity tests (Bartlett’s w2, Cochran’s C, Hartley’s Fmax, and Levene’s F), Student’s t-test for one population mean, for two independent means, and for two paired means, as well as ANOVA and its extensions.

Hypotheses Tests Chapter

9

247

Regardless of the application’s main goal, parametric tests can provide good and interesting research results that will be useful in the decision-making process. From a conscious choice of the modeling software, the correct use of each test must always be made based on the underlying theory, without ever ignoring the researcher’s experience and intuition.

9.10 EXERCISES (1) In what situations should parametric tests be applied and what are the assumptions of these tests? (2) What are the advantages and disadvantages of parametric tests? (3) What are the main parametric tests to verify the normality of the data? In what situations must we use each one of them? (4) What are the main parametric tests to verify the variance homogeneity between groups? In what situations must we use each one of them? (5) To test a single population mean, we can use z-test and Student’s t-test. In what cases must each one of them be applied? (6) What are the main mean comparison tests? What are the assumptions of each test? (7) The monthly aircraft sales data throughout last year can be seen in the table below. Check and see if there is normality in the data. Consider a ¼ 5%. Jan.

Feb.

Mar.

Apr.

May

Jun.

Jul.

Aug.

Sept.

Oct.

Nov.

Dec.

48

52

50

49

47

50

51

54

39

56

52

55

(8) Test the normality of the temperature data listed (a ¼ 5%): 12.5

14.2

13.4

14.6

12.7

10.9

16.5

14.7

11.2

10.9

12.1

12.8

13.8

13.5

13.2

14.1

15.5

16.2

10.8

14.3

12.8

12.4

11.4

16.2

14.3

14.8

14.6

13.7

13.5

10.8

10.4

11.5

11.9

11.3

14.2

11.2

13.4

16.1

13.5

17.5

16.2

15.0

14.2

13.2

12.4

13.4

12.7

11.2

(9) The table shows the final grades of two students in nine subjects. Check and see if there is variance homogeneity between the students (a ¼ 5%). Student 1

6.4

5.8

6.9

5.4

7.3

8.2

6.1

5.5

6.0

Student 2

6.5

7.0

7.5

6.5

8.1

9.0

7.5

6.5

6.8

(10) A fat-free yogurt manufacturer states that the number of calories in each cup is 60 cal. In order to check if this information is true, a random sample with 36 cups is collected; and we observed that the average number of calories was 65 cal. with a standard deviation of 3.5. Apply the appropriate test and check if the manufacturer’s statement is true, considering a significance level of 5%. (11) We would like to compare the average waiting time before being seen by a doctor (in minutes) in two hospitals. In order to do that, we collected a sample with 20 patients from each hospital. The data are available in the tables. Check and see if there are differences between the average waiting times in both hospitals. Consider a ¼ 1%.

248

PART

IV Statistical Inference

Hospital 1

72

58

91

88

70

76

98

101

65

73

79

82

80

91

93

88

97

83

71

74

66

40

55

70

76

61

53

50

47

61

52

48

60

72

57

70

66

55

46

51

Hospital 2

(12) Thirty teenagers whose total cholesterol level is higher than what is advisable underwent treatment that consisted of a diet and physical activities. The tables show the levels of LDL cholesterol (mg/dL) before and after the treatment. Check if the treatment was effective (a ¼ 5%). Before the treatment 220

212

227

234

204

209

211

245

237

250

208

224

220

218

208

205

227

207

222

213

210

234

240

227

229

224

204

210

215

228

After the treatment 195

180

200

204

180

195

200

210

205

211

175

198

195

200

190

200

222

198

201

194

190

204

230

222

209

198

195

190

201

210

(13) An aerospace company produces civilian and military helicopters at its three factories. The tables show the monthly production of helicopters in the last 12 months at each factory. Check if there is a difference between the population means. Consider a ¼ 5%. Factory 1 24

26

28

22

31

25

27

28

30

21

20

24

26

24

30

24

27

25

29

30

27

26

25

25

24

26

20

22

22

27

20

26

24

25

Factory 2 28

Factory 3

29

Chapter 10

Nonparametric Tests Mathematics has wonderful strength that is capable of making us understand many mysteries of our faith. Saint Jerome

10.1 INTRODUCTION As studied in the previous chapter, hypotheses tests are divided into parametric and nonparametric. Applied to quantitative data, parametric tests, formulate hypotheses about population parameters, such as the population mean (m), population standard deviation (s), population variance (s2), population proportion (p), etc. Parametric tests require strong assumptions regarding the data distribution. For example, in many cases, we should assume that the samples are collected from populations whose data follow a normal distribution. Or, still, for comparison tests of two paired population means or k population means (k 3), the population variances must be homogeneous. Conversely, nonparametric tests can formulate hypotheses about the qualitative characteristics of the population, then, they can be applied to qualitative data, in nominal or ordinal scales. Since assumptions regarding the data distribution are in smaller number and weaker than the parametric tests, they are also known as distribution-free tests. Nonparametric tests are an alternative to parametric ones when their hypotheses are violated. Given that they require a smaller number of assumptions, they are simpler and easier to apply, but less robust when compared to parametric tests. In short, the main advantages of nonparametric tests are: (a) They can be applied in a wide variety of situations, because they do not require strict premises concerning the population, as parametric methods do. Notably, nonparametric methods do not require that the populations have a normal distribution. (b) Differently from parametric methods, nonparametric methods can be applied to qualitative data, in nominal and ordinal scales. (c) They are easy to apply because they require simpler calculations when compared to parametric methods. The main disadvantages are: (a) With regard to quantitative data, since they must be transformed into qualitative data for the application of nonparametric tests, we lose too much information. (b) Since nonparametric tests are less efficient than parametric tests, we need greater evidence (a larger sample or one with greater differences) to reject the null hypothesis. Thus, since parametric tests are more powerful than nonparametric ones, that is, they have a higher probability of rejecting the null hypothesis when it is really false, they must be chosen as long as all the assumptions are confirmed. On the other hand, nonparametric tests are an alternative to parametric ones when the hypotheses are violated or in cases in which the variables are qualitative. Nonparametric tests are classified according to the variables’ level of measurement and to sample size. For a single sample, we will study the binomial, chi-square (w2), and sign tests. The binomial test is applied to binary variables. The w2 test can be applied to nominal variables as well as to ordinal variables. While the sign test is only applied to ordinal variables. In the case of two paired samples, the main tests are the McNemar test, the sign test, and the Wilcoxon test. The McNemar test is applied to qualitative variables that assume only two categories (binary), while the sign test and the Wilcoxon test are applied to ordinal variables. Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00010-0 © 2019 Elsevier Inc. All rights reserved.

249

250

PART

IV Statistical Inference

TABLE 10.1 Classification of Nonparametric Statistical Tests Dimension

Level of Measurement

Nonparametric Test

One sample

Binary

Binomial

Nominal or ordinal

w2

Ordinal

Sign test

Binary

McNemar test

Ordinal

Sign test Wilcoxon test

Nominal or ordinal

w2

Ordinal

Mann-Whitney U

Binary

Cochran’s Q

Ordinal

Friedman’s test

Nominal or ordinal

w2

Ordinal

Kruskal-Wallis test

Two paired samples

Two independent samples

K paired samples

K independent samples

Source: Fa´vero, L.P., Belfiore, P., Silva, F.L., Chan, B.L., 2009. Ana´lise de dados: modelagem multivariada para tomada de deciso˜es. Campus Elsevier, Rio de Janeiro.

Considering two independent samples, we can highlight the w2 test and the Mann-Whitney U test. The w2 test can be applied to nominal or ordinal variables, while the Mann-Whitney U test only considers ordinal variables. For k paired samples (k 3), we have Cochran’s Q test that considers binary variables and Friedman’s test that considers ordinal variables. Finally, in the case of more than two independent samples, we will study the w2 test for nominal or ordinal variables and the Kruskal-Wallis test for ordinal variables. Table 10.1 shows this classification. Nonparametric tests in which the variables’ level of measurement is ordinal can also be applied to quantitative variables, but they must only be used in these cases, when the hypotheses of the parametric tests are rejected.

10.2

TESTS FOR ONE SAMPLE

In this case, a random sample is taken from the population and we test the hypothesis that the sample data have a certain characteristic or distribution. Among the nonparametric statistical tests for a single sample, we can highlight the binomial test, the w2 test, and the sign test. The binomial test is applied to binary data, the w2 test to nominal or ordinal data, while the sign test is applied to ordinal data.

10.2.1

Binomial Test

The binomial test is applied to an independent sample in which the variable that the researcher is interested in (X) is binary (dummy) or dichotomous, that is, it only has two possibilities: success or failure. We usually call result X ¼ 1 a success and result X ¼ 0 a failure, because it is more convenient. The probability of success in choosing a certain observation is represented by p and the probability of failure by q, that is: P½X ¼ 1 ¼ p and P½X ¼ 0 ¼ q ¼ 1  p For a bilateral test, we must consider the following hypotheses: H0: p ¼ p0 H1: p ¼ 6 p0 According to Siegel and Castellan (2006), the number of successes (Y) or the number of results of type [X ¼ 1] results in a sequence of N observations is:

Nonparametric Tests Chapter



N X

10

251

Xi

i¼1

For the authors, in a sample of size N, the probability of obtaining k objects in a category and N  k objects in the other category is given by:   N (10.1) P½Y ¼ k ¼  pk  qNk k ¼ 0, 1,…, N k where: p: probability of success; q: probability of failure, where:   N! N ¼ k k!ðN  kÞ! Table F1 in the Appendix provides the probability of P[Y ¼ k] for several values of N, k, and p. However, when we test hypotheses, we must use the probability of obtaining values that are greater than or equal to the value observed: N   X N  pi  qNi (10.2) Pð Y  k Þ ¼ i i¼k Or the probability of obtaining values that are less than or equal to the value observed: k   X N  pi  qNi Pð Y  k Þ ¼ i i¼0

(10.3)

According to Siegel and Castellan (2006), when p ¼ q ¼ ½, instead of calculating the probabilities based on the expressions presented, it is more convenient to use Table F2 in the Appendix. This table provides the unilateral probabilities, under the null hypothesis H0: p ¼ 1/2, of obtaining values that are as extreme as or more extreme than k, where k is the lowest of the frequencies observed (P(Y  k)). Due to the symmetry of a binomial distribution, when p ¼ ½, we have P(Y k) ¼ P(Y  N  k). A unilateral test is used when we predict, in advance, which of both categories must contain the smallest number of cases. For a bilateral test (when the estimate simply refers to the fact that both frequencies will differ), we just need to double the values from Table F2 in the Appendix. This final value obtained is called P-value, which, according to what was discussed in Chapter 9, corresponds to the probability (unilateral or bilateral) associated to the value observed in the sample. P-value indicates the lowest significance level observed, which would lead to the rejection of the null hypothesis. Thus, we reject H0 if P  a. In the case of large samples (N > 25), the sample distribution of variable Y is closer to a standard normal distribution, so, the probability can be calculated by the following statistic: Zcal ¼

jN  p^  N  pj  0:5 pffiffiffiffiffiffiffiffiffiffiffiffiffiffi Npq

(10.4)

where p^refers to the sample estimate of the proportion of successes so that we can test H0. The value of Zcal calculated by using Expression (10.4) must be compared to the critical value of the standard normal distribution (see Table E in the Appendix). This table provides the critical values of zc where P(Zcal > zc) ¼ a (for a righttailed unilateral test). For a bilateral test, we have P(Zcal <  zc) ¼ a/2 ¼ P(Zcal > zc). Therefore, for a right-tailed unilateral test, the null hypothesis is rejected if Zcal > zc. Now, for a bilateral test, we reject H0 if Zcal <  zc or Zcal > zc. Example 10.1: Applying the Binomial Test to Small Samples A group of 18 students took an intensive English course and were submitted to two different learning methods. At the end of the course, each student chose his/her favorite teaching method, as shown in Table 10.E.1. We believe there are no differences between both teaching methods. Test the null hypothesis with a significance level of 5%.

252

PART

IV Statistical Inference

TABLE 10.E.1 Frequencies Obtained After Students Made Their Choice Events

Method 1

Method 2

Total

Frequency

11

7

18

Proportion

0.611

0.389

1.0

Solution Before we start the general procedure to construct the hypotheses tests, we will explain a few parameters in order to facilitate the understanding. Choosing the method that will be expressed as X ¼ 1 (method 1) and X ¼ 0 (method 2), the probability of choosing method 1 is represented by P[X ¼ 1] ¼ p and method 2 by P[X ¼ 0] ¼ q. The number of successes (Y ¼ k) corresponds to the total number of type X ¼ 1 results and k ¼ 11. Step 1: The most suitable test in this case is the binomial test because the data are categorized into two classes. Step 2: The null hypothesis states that there are no differences in the probabilities of choosing between both methods: H0: p ¼ q ¼ ½ H1: p ¼ 6 q Step 3: The significance level to be considered is 5%. Step 4: We have N ¼ 18, k ¼ 11, p ¼ ½, and q ¼ ½. Due to the symmetry of the binomial distribution, when p ¼ ½, P(Y k) ¼ P(Y  N  k), that is, P(Y 11) ¼ P(Y  7). So, let’s calculate P(Y  7) by using Expression (10.3) and show how this probability can be obtained directly from Table F2 in the Appendix. The probability of a maximum of seven students choosing method 2 is given by: P ðY  7Þ ¼ P ðY ¼ 0Þ + P ðY ¼ 1Þ + ⋯ + P ðY ¼ 7Þ  0  18 18! 1 1  ¼ 3:815  E  06  P ðY ¼ 0Þ ¼ 0!18! 2 2  1  17 18! 1 1 P ðY ¼ 1Þ ¼  ¼ 6:866  E  05  1!17! 2 2  7  11 18! 1 1   ¼ 0:121 P ðY ¼ 7Þ ¼ 7!11! 2 2 Therefore: P ðY  7Þ ¼ 3:815  E  06 + ⋯ + 0:121 ¼ 0:240 Since p ¼ ½, probability P(Y  7) could be obtained directly from Table F2 in the Appendix. For N ¼ 18 and k ¼ 7 (the lowest frequency observed), the associated unilateral probability is P1 ¼ 0.240. Since it is a bilateral test, this value must be doubled (P ¼ 2P1), so, the associated bilateral probability is P ¼ 0.480. Note: In the general procedure of hypotheses tests, Step 4 corresponds to the calculation of the statistic based on the sample. On the other hand, Step 5 determines the probability associated to the value of the statistic obtained from Step 4. In the case of the binomial test, Step 4 calculates the probability associated to the occurrence in the sample directly. Step 5: Decision: since the associated probability is greater than a (P ¼ 0.480 > 0.05), we do not reject H0, which allows us to conclude, with a 95% confidence level, that there are no differences in the probabilities of choosing method 1 or 2.

Example 10.2: Applying the Binomial Test to Large Samples Redo the previous example considering the following results:

TABLE 10.E.2 Frequencies Obtained After Students Made Their Choice Events

Method 1

Method 2

Total

Frequency

18

12

30

Proportion

0.6

0.4

1.0

Nonparametric Tests Chapter

10

253

FIG. 10.1 Critical region of Example 10.2.

Solution Step 1: Let’s apply the binomial test. Step 2: The null hypothesis states that there are no differences between the probabilities of choosing both methods, that is: H 0: p ¼ q ¼ ½ 6 q H 1: p ¼ Step 3: The significance level to be considered is 5%. Step 4: Since N > 25, we can consider that the sample distribution of variable Y is similar to a standard normal distribution, so, the probability can be calculated from Z statistic:

Zcal ¼

jN  p^  N  p j  0:5 j30  0:6  30  0:5j  0:5 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ ¼ 0:913 Npq 30  0:5  0:5

Step 5: The critical region of a standard normal distribution (Table E in the Appendix), for a bilateral test in which a ¼ 5%, is shown in Fig. 10.1. For a bilateral test, each one of the tails corresponds to half of significance level a. Step 6: Decision: since the value calculated is not in the critical region, that is, 1.96  Zcal  1.96, the null hypothesis is not rejected, which allows us to conclude, with a 95% confidence level, that there are no differences in the probabilities of choosing between the methods (p ¼ q ¼ ½). If we used P-value instead of the critical value of the statistic, Steps 5 and 6 would be: Step 5: According to Table E in the Appendix, the unilateral probability associated to statistic Zcal ¼ 0.913 is P1 ¼ 0.1762. For a bilateral test, this probability must be doubled (P-value ¼ 0.3564). Step 6: Decision: since P > 0.05, we do not reject H0.

10.2.1.1 Solving the Binomial Test Using SPSS Software Example 10.1 will be solved using IBM SPSS Statistics Software®. The use of the images in this section has been authorized by the International Business Machines Corporation©. The data are available in the file Binomial_Test.sav. The procedure for solving the binomial test using SPSS is described. Let’s select Analyze → Nonparametric Tests → Legacy Dialogs → Binomial … (Fig. 10.2). First, let’s insert variable Method into the Test Variable List. In Test Proportion, we must define p ¼ 0.50, since the probability of success and failure is the same (Fig. 10.3). Finally, let’s click on OK. The results can be seen in Fig. 10.4. The associated probability for a bilateral test is P ¼ 0.481, similar to the value calculated in Example 10.1. Since P > a (0.481 > 0.05), we do not reject H0, which allows us to conclude, with a 95% confidence level, that p ¼ q ¼ ½.’

10.2.1.2 Solving the Binomial Test Using Stata Software Example 10.1 will also be solved using Stata Statistical Software®. The use of the images presented in this section has been authorized by Stata Corp LP©. The data are available in the file Binomial_Test.dta. The syntax of the binomial test on Stata is: bitest variable* = #p

where the term variable* must be replaced by the variable considered in the analysis and #p by the probability of success specified in the null hypothesis.

254

PART

IV Statistical Inference

FIG. 10.2 Procedure for applying the binomial test on SPSS.

FIG. 10.3 Selecting the variable and the proportion for the binomial test.

In Example 10.1, our studied variable is method and, through the null hypothesis, there are no differences in the choice between both methods, so, the command to be typed is: bitest method = 0.5

The result of the binomial test is shown in Fig. 10.5. We can see that the associated probability for a bilateral test is P ¼ 0.481, similar to the value calculated in Example 10.1, and also obtained via SPSS software. Since P > 0.05, we do not reject H0, which allows us to conclude, with a 95% confidence level, that p ¼ q ¼ ½.

Nonparametric Tests Chapter

10

255

FIG. 10.4 Results of the binomial test.

FIG. 10.5 Results of the binomial test for Example 10.1 on Stata.

10.2.2

Chi-Square Test (x2) for One Sample

The w2 test presented in this section is an extension of the binomial test and is applied to a single sample in which the variable being studied assumes two or more categories. The variables can be nominal or ordinal. The test compares the frequencies observed to the frequencies expected in each category. The w2 test assumes the following hypotheses: H0: there is no significant difference between the frequencies observed and the ones expected H1: there is a significant difference between the frequencies observed and the ones expected The statistic for the test, analogous to Expression (4.1) in Chapter 4, is given by: w2cal ¼

k X ðOi  Ei Þ2 i¼1

Ei

(10.5)

where: Oi: the number of observations in the ith category; Ei: expected frequency of observations in the ith category when H0 is not rejected; k: the number of categories. The values of w2cal approximately follow a w2 distribution with n ¼ k  1 degrees of freedom. The critical values of the chisquare (w2c ) statistic can be found in Table D in the Appendix, which provides the critical values of w2c , where P(w2cal > w2c ) ¼ a (for a right-tailed unilateral test). In order for the null hypothesis H0 to be rejected, the value of the w2cal statistic must be in the critical region (CR), that is, w2cal > w2c . Otherwise, we do not reject H0 (Fig. 10.6). P-value (the probability associated to the value of the w2cal statistic calculated from the sample) can also be obtained from Table D. In this case, we reject H0 if P  a.

FIG. 10.6 w2 distribution, highlighting critical region (CR) and nonrejection of H0 (NR) region.

256

PART

IV Statistical Inference

Example 10.3: Applying the x2 Test to One Sample A candy store would like to find out if the number of chocolate candies sold daily varies depending on the day of the week. In order to do that, a sample was collected throughout 1 week, chosen randomly, and the results can be seen in Table 10.E.3. Test the hypothesis that sales do not depend on the day of the week. Assume that a ¼ 5%.

TABLE 10.E.3 Frequencies Observed Versus Frequencies Expected Events

Sunday

Monday

Tuesday

Wednesday

Thursday

Friday

Saturday

Frequencies observed

35

24

27

32

25

36

31

Frequencies expected

30

30

30

30

30

30

30

Solution Step 1: The most suitable test to compare the frequencies observed to the ones expected from one sample with more than two categories is the w2 for a single sample. Step 2: Through the null hypothesis, there are no significant differences between the sales observed and the ones expected for each day of the week. On the other hand, through the alternative hypothesis, there is a difference in at least one day of the week: H0: Oi ¼ Ei H1: Oi ¼ 6 Ei Step 3: The significance level to be considered is 5%. Step 4: The value of the statistic is given by: w2cal ¼

k X ðOi  Ei Þ2

Ei

i¼1

¼

ð35  30Þ2 ð24  30Þ2 ð31  30Þ2 + +⋯+ ¼ 4:533 30 30 30

Step 5: The critical region of the w test, considering a ¼ 5% and n ¼ 6 degrees of freedom, is shown in Fig. 10.7. 2

FIG. 10.7 Critical Region of Example 10.3.

Step 6: Decision: since the value calculated is not in the critical region, that is, w2cal < 12.592, the null hypothesis is not rejected, which allows us to conclude, with a 95% confidence level, that the number of chocolate candies sold daily does not vary depending on the day of the week. If we use P-value instead of the critical value of the statistic, Steps 5 and 6 of the construction of the hypotheses tests will be: Step 5: According to Table D in the Appendix, for n ¼ 6 degrees of freedom, the probability associated to the statistic w2cal ¼ 4.533 (P-value) is between 0.1 and 0.9. Step 6: Decision: since P > 0.05, we do not reject the null hypothesis.

10.2.2.1 Solving the w2 Test for One Sample Using SPSS Software The use of the images in this section has been authorized by the International Business Machines Corporation©. The data in Example 10.3 are available in the file Chi-Square_One_Sample.sav. The procedure for applying the w2 test on SPSS is described. First, let’s click on Analyze → Nonparametric Tests → Legacy Dialogs → Chi-Square …, as shown in Fig. 10.8. After that, we should insert the variable Day_week into the Test Variable List. The variable being studied has seven categories. The options Get from data and Use specified range (Lower ¼ 1 and Upper ¼ 7) in Expected Range generate

Nonparametric Tests Chapter

10

257

FIG. 10.8 Procedure for elaborating the w2 test on SPSS.

the same results. The frequencies expected for the seven categories are exactly the same. Thus, we must select the option All categories equal in Expected Values, as shown in Fig. 10.9. Finally, let’s click on OK to obtain the results of the w2 test, as shown in Fig. 10.10. Therefore, the value of the w2 statistic is 4.533, similar to the value calculated in Example 10.3. Since the Pvalue ¼ 0.605 > 0.05 (in Example 10.3, we saw that 0.1 < P < 0.9), we do not reject H0, which allows us to conclude, with a 95% confidence level, that the sales do not depend on the day of the week.

10.2.2.2 Solving the w2 Test for One Sample Using Stata Software The use of the images presented in this section has been authorized by Stata Corp LP©. The data in Example 10.3 are available in the file Chi-Square_One_Sample.dta. The variable being studied is day_week. The w2 test for one sample on Stata can be obtained from the command csgof (chi-square goodness of fit), which allows us to compare the distribution of frequencies observed to the ones expected of a certain categorical variable with more than two categories. In order for this command to be used, first, we must type: findit csgof

and install it through the link csgof from http://www.ats.ucla.edu/stat/stata/ado/analysis. After doing this, we can type the following command: csgof day_week

The result is shown in Fig. 10.11. We can see that the result of the test is similar to the one calculated in Example 10.3 and on SPSS, as well as to the probability associated to the statistic.

10.2.3

Sign Test for One Sample

The sign test is an alternative to the t-test for a single random sample when the data distribution of the population does not follow a normal distribution. The only assumption required by the sign test is that the distribution of the variable be continuous.

258

PART

IV Statistical Inference

FIG. 10.9 Selecting the variable and the procedure to elaborate the w2 test.

FIG. 10.10 Results of the w2 test for Example 10.3 on SPSS.

The sign test is based on the population median (m). The probability of obtaining a sample value that is less than the median and the probability of obtaining a sample value that is greater than the median are the same (p ¼ ½). The null hypothesis of the test is that m is equal to a certain value specified by the investigator (m0). For a bilateral test, we have: H0: m ¼ m0 H1: m ¼ 6 m0 The quantitative data are converted into signs, (+) or (), that is, values greater than the median (m0) start being represented by (+) and values less than m0 by (). Data with values equal to m0 are excluded from the sample. Thus, the sign test is applied to ordinal data and offers little power to the researcher, since this conversion results in a considerable loss of information regarding the original data.

Nonparametric Tests Chapter

10

259

FIG. 10.11 Results of the w2 test for Example 10.3 on Stata.

Small samples Let’s establish that N is the number of positive and negative signs (sample size disregarding any ties) and k is the number of signs that corresponds to the lowest frequency. For small samples (N  25), we will use the binomial test with p ¼ ½ to calculate P(Y  k). This probability can be obtained directly from Table F2 in the Appendix. Large samples When N > 25, the binomial distribution is more similar to a normal distribution. The value of Z is given by: Z¼

ðX  0:5Þ  N=2 pffiffiffiffi  Nð0, 1Þ 0:5 N

(10.6)

where X corresponds to the lowest or highest frequency. If X represents the lowest frequency, we must calculate X + 0.5. On the other hand, if X represents the highest frequency, we must calculate X  0.5. Example 10.4: Applying the Sign Test to a Single Sample We estimate that the median retirement age in a certain Brazilian city is 65. One random sample with 20 retirees was drawn from the population and the results can be seen in Table 10.E.4. Test the null hypothesis that m ¼ 65, at the significance level of 10%.

TABLE 10.E.4 Retirement Age 59

62

66

37

60

64

66

70

72

61

64

66

68

72

78

93

79

65

67

59

Solution Step 1: Since the data do not follow a normal distribution, the most suitable test for testing the population median is the sign test. Step 2: The hypotheses of the test are: H0: m ¼ 65 6 65 H 1: m ¼ Step 3: The significance level to be considered is 10%. Step 4: Let’s calculate P(Y  k). To facilitate our understanding, let’s sort the data in Table 10.E.4 in ascending order.

TABLE 10.E.5 Data From Table 10.E.4 Sorted in Ascending Order 37

59

59

60

61

62

64

64

65

66

66

66

67

68

70

72

72

78

79

93

260

PART

IV Statistical Inference

Excluding value 65 (a tie), we have the number of () signs is 8, the number of (+) signs is 11, and N ¼ 19. From Table F2 in the Appendix, for N ¼ 19, k ¼ 8, and p ¼ ½, the associated unilateral probability is P1 ¼ 0.324. Since we are using a bilateral test, this value must be doubled, so, the associated bilateral probability is 0.648 (P-value). Step 5: Decision: since P > a (0.648 > 0.10), we do not reject H0, a fact that allows us to conclude, with a 90% confidence level, that m ¼ 65.

10.2.3.1 Solving the Sign Test for One Sample Using SPSS Software The use of the images in this section has been authorized by the International Business Machines Corporation©. SPSS makes the sign test available only for two related samples (2 Related Samples). Thus, in order for us to use the test for a single sample, we must generate a new variable with n values (sample size including ties), all of them equal to m0. The data in Example 10.4 are available in the file Sign_Test_One_Sample.sav. The procedure for applying the sign test on SPSS is shown. First of all, we must click on Analyze → Nonparametric Tests → Legacy Dialogs → 2 Related Samples …, as shown in Fig. 10.12. After that, we must insert variable 1 (Age_pop) and variable 2 (Age_sample) into Test Pairs. Let’s select the option regarding the sign test (Sign) in Test Type, as shown in Fig. 10.13. Next, let’s click on OK to obtain the results of the sign test, as shown in Figs. 10.14 and 10.15. Fig. 10.14 shows the frequencies of negative and positive signs, the total number of ties, and the total frequency. Fig. 10.15 shows the associated probability for a bilateral test, which is similar to the value found in Example 10.4. Since P ¼ 0.648 > 0.10, we do not reject the null hypothesis, which allows us to conclude, with a 90% confidence level, that the median retirement age is 65.

FIG. 10.12 Procedure for elaborating the sign test on SPSS.

Nonparametric Tests Chapter

10

261

FIG. 10.13 Selecting the variables and the sign test.

FIG. 10.14 Frequencies observed.

FIG. 10.15 Sign test for Example 10.4 on SPSS.

10.2.3.2 Solving the Sign Test for One Sample Using Stata Software The use of the images presented in this section has been authorized by Stata Corp LP©. Different from SPSS software, Stata makes the sign test for one sample available. On Stata, the sign test for a single sample as well as for two paired samples can be obtained from the command signtest. The syntax of the test for one sample is: signtest variable* = #

262

PART

IV Statistical Inference

FIG. 10.16 Results of the sign test for Example 10.4 on Stata.

where the term variable* must be replaced by the variable considered in the analysis and # by the value of the population median to be tested. The data in Example 10.4 are available in the file Sign_Test_One_Sample.dta. The variable analyzed is age and the main objective is to verify if the median retirement age is 65. The command to be typed is: signtest age = 65

The result of the test is shown in Fig. 10.16. Analogous to the results presented in Example 10.4 and also generated on SPSS, the number of positive signs is 11, the number of negative signs is 8, and the associated probability for a bilateral test is 0.648. Since P > 0.10, we do not reject the null hypothesis, which allows us to conclude, with a 90% confidence level, that the median retirement age is 65.

10.3

TESTS FOR TWO PAIRED SAMPLES

These tests investigate if two samples are somehow related. The most common examples analyze a situation before and after a certain event. We will study the following tests: the McNemar test for binary variables and the sign and Wilcoxon tests for ordinal variables.

10.3.1

McNemar Test

The McNemar test is applied to assess the significance of changes in two related samples with qualitative or categorical variables that assume only two categories (binary variables). The main goal of the test is to verify if there are any significant changes before and after the occurrence of a certain event. In order to do that, let’s use a 2 2 contingency table, as shown in Table 10.2. According to Siegel and Castellan (2006), the + and  signs are used to represent the possible changes in the answers before and after. The frequencies of each occurrence are represented in their respective cells in Table 10.2. For example, if there are changes from the first answer (+) to the second answer (), the result will be written in the right upper cell, so, B represents the total number of observations that presented changes in their behavior from (+) to (). Analogously, if there are changes from the first answer () to the second answer (+), the result will be written in the left lower cell, so, C represents the total number of observations that presented changes in their behavior from () to (+).

Nonparametric Tests Chapter

10

263

TABLE 10.2 2 × 2 Contingency Table After Before

+

2

+

A

B



C

D

On the other hand, while A represents the total number of observations that remained with the same answer (+) before and after, D represents the total number of observations with the same answer () in both periods. Thus, the total number of individuals that change their answer can be represented by B + C. Through the null hypothesis of the test, the total number of changes in each direction is equally likely, that is: H0: P(B ! C) ¼ P(C ! B) H1: P(B ! C) 6¼ P(C ! B) According to Siegel and Castellan (2006), McNemar statistic is calculated based on the chi-square (w2) statistic presented in Expression (10.5), that is: w2cal ¼

2 X ðOi  Ei Þ2 i¼1

Ei

¼

ðB  ðB + CÞ=2Þ2 ðC  ðB + CÞ=2Þ2 ðB  CÞ2 + ¼  w21 ðB + CÞ=2 ðB + CÞ=2 B+C

(10.7)

According to the same authors, a correction factor must be used in order for a continuous w2 distribution to become more similar to a discrete w2 distribution, so: w2cal ¼

ðjB  Cj  1Þ2 with 1 degree of freedom B+C

(10.8)

The value calculated must be compared to the critical value of the w2 distribution (Table D in the Appendix). This table provides the critical values of w2c where P(w2cal > w2c ) ¼ a (for a right-tailed unilateral test). If the value of the statistic is in the critical region, that is, if w2cal > w2c , we reject H0. Otherwise, we should not reject H0. The probability associated to the w2cal statistic (P-value) can also be obtained from Table D. In this case, the null hypothesis is rejected if P  a. Otherwise, we do not reject H0. Example 10.5: Applying the McNemar Test A bill of law proposing the end of full retirement pensions for federal civil servants was being analyzed by the Senate. Aiming at verifying if this measure would bring any changes in the number of people taking public exams, an interview with 60 workers was carried out, before and after the reform, so that they could express their preference in working for a private or a public organization. The results can be seen in Table 10.E.6. Test the hypothesis that there were no significant changes in the workers’ answers before and after the social security reform. Assume that a ¼ 5%.

TABLE 10.E.6 Contingency Table After the Reform Before the Reform

Private

Public

Private

22

3

Public

21

14

Solution Step 1: McNemar is the most suitable test for evaluating the significance of before and after type changes in two related samples, applied to nominal or categorical variables. Step 2: Through the null hypothesis, the reform would not be efficient in changing people’s preferences towards the private sector. In other words, among the workers who changed their preferences, the probability of them changing their preference

264

PART

IV Statistical Inference

from private to public organizations after the reform is the same as the probability of them changing from public to private organizations. That is: H0: P(Private ! Public) ¼ P(Public ! Private) H1: P(Private ! Public) 6¼ P(Public ! Private) Step 3: The significance level to be considered is 5%. Step 4: The value of the statistic, according to Expression (10.7), is: 2

2

ðj321jÞ jÞ w2cal ¼ ðjBC B + C ¼ 3 + 21 ¼ 13:5 with n ¼ 1

If we use the correction factor, the value of the statistic from Expression (10.8) becomes: 2

2

j1Þ j1Þ w2cal ¼ ðjBC ¼ ðj321 ¼ 12:042 with n ¼ 1 B+C 3 + 21

Step 5: The value of the critical chi-square (w2c ) obtained from Table D, in the Appendix, considering a ¼ 5% and n ¼ 1 degree of freedom, is 3.841. Step 6: Decision: since the value calculated is in the critical region, that is, w2cal > 3.841, we reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that there were significant changes in the choice of working at a private or a public organization after the social security reform. If we use P-value instead of the critical value of the statistic, Steps 5 and 6 will be: Step 5: According to Table D in the Appendix, for n ¼ 1 degree of freedom, the probability associated to statistic w2cal ¼ 12.042 or 13.5 (P-value) is less than 0.005 (a probability of 0.005 is associated to statistic w2cal ¼ 7.879). Step 6: Decision: since P < 0.05, we must reject H0.

10.3.1.1 Solving the McNemar Test Using SPSS Software Example 10.5 will be solved using SPSS software. The use of the images in this section has been authorized by the International Business Machines Corporation©. The data are available in the file McNemar_Test.sav. The procedure for applying the McNemar test on SPSS is presented. Let’s click on Analyze → Nonparametric Tests → Legacy Dialogs → 2 Related Samples …, as shown in Fig. 10.17. After that, we should insert variable 1 (Before) and variable 2 (After) into Test Pairs. Let’s select the McNemar test option in Test Type, as shown in Fig. 10.18. Finally, we must click on OK to obtain Figs. 10.19 and 10.20. Fig. 10.19 shows the frequencies observed before and after the reform (Contingency Table). The result of the McNemar test is shown in Fig. 10.20. According to Fig. 10.20, the significance level observed in the McNemar test is 0.000, value lower than 5%, so, the null hypothesis is rejected. Hence, we may conclude, with a 95% confidence level, that there was a significant change in choosing to work at a public or a private organization after the social security reform.

10.3.1.2 Solving the McNemar Test Using Stata Software Example 10.5 will also be solved using Stata software. The use of the images presented in this section has been authorized by Stata Corp LP©. The data are available in the file McNemar_Test.dta. The McNemar test can be calculated on Stata by using the command mcc followed by the paired variables. In our example, the paired variables are called before and after, so, the command to be typed is: mcc before after

The result of the McNemar test is shown in Fig. 10.21. We can see that the value of the statistic is 13.5, similar to the value calculated by Expression (10.7), without the correction factor. The significance level observed from the test is 0.000, lower than 5%, which allows us to conclude, with a 95% confidence level, that there was a significant change before and after the reform. The result of the McNemar test could have also been obtained by using the command mcci 14 21 3 22.

10.3.2

Sign Test for Two Paired Samples

The sign test can also be applied to two paired samples. In this case, the sign is given by the difference between the pairs, that is, if the difference results in a positive number, each pair of values is replaced by a (+) sign. On the other hand, if the result of the difference is negative, each pair of values is replaced by a () sign. In case of a tie, the data will be excluded from the sample.

Nonparametric Tests Chapter

FIG. 10.17 Procedure for elaborating the McNemar test on SPSS.

FIG. 10.18 Selecting the variables and McNemar test.

10

265

266

PART

IV Statistical Inference

FIG. 10.19 Frequencies observed.

FIG. 10.20 McNemar Test for Example 10.5 on SPSS. FIG. 10.21 Results of the McNemar test for Example 10.5 on Stata.

Analogous to the sign test for a single sample, the sign test presented in this section is also an alternative to the t-test for comparing two related samples when the data distribution is not normal. In this case, the quantitative data are transformed into ordinal data. Thus, the sign test is much less powerful than the t-test, because it only uses the difference sign between the pairs as information. Through the null hypothesis, the population median of the differences (md) is zero. Therefore, for a bilateral test, we have: H0: md ¼ 0 6 0 H1: md ¼ In other words, we tested the hypothesis that there are no differences between both samples (the samples come from populations with the same median and the same continuous distribution), that is, the number of (+) signs is the same as number of () signs. The same procedure presented in Section 10.2.3 for a single sample will be used in order to calculate the sign statistic in the case of two paired samples. Small samples We say that N is the number of positive and negative signs (sample size disregarding the ties) and k is the number of signs that corresponds to the lowest frequency. If N  25, we will use the binomial test with p ¼ ½ to calculate P(Y  k). This probability can be obtained directly from Table F2 in the Appendix.

Nonparametric Tests Chapter

10

267

Large samples When N > 25, the binomial distribution is more similar to a normal distribution, and the value of Z is given by Expression (10.6): Z¼

ðX  0:5Þ  N=2 pffiffiffiffi  Nð0, 1Þ 0:5 N

where X corresponds to the lowest or highest frequency. If X represents the lowest frequency, we must use X + 0.5. On the other hand, if X represents the highest frequency, we must use X  0.5. Example 10.6: Applying the Sign Test to Two Paired Samples A group of 30 workers are submitted to a training course aiming at improving their productivity. The result, in terms of the average number of parts produced per hour per employee and before and after the training, is shown in Table 10.E.7. Test the null hypothesis that there were no alterations in productivity before and after the training course. Assume that a ¼ 5%.

TABLE 10.E.7 Productivity Before and After the Training Course Before

After

Difference Sign

36

40

+

39

41

+

27

29

+

41

45

+

40

39



44

42



38

39

+

42

40



40

42

+

43

45

+

37

35



41

40



38

38

0

45

43



40

40

0

39

42

+

38

41

+

39

39

0

41

40



36

38

+

38

36



40

38



36

35



40

42

+

40

41

+

38

40

+

37

39

+

40

42

+ Continued

268

PART

IV Statistical Inference

TABLE 10.E.7 Productivity Before and After the Training Course—cont’d Before

After

Difference Sign

38

36



40

40

0

Solution Step 1: Since the data do not follow a normal distribution, the sign test can be an alternative to the t-test for two paired samples. Step 2: The null hypothesis assumes that there is no difference in productivity before and after the training course, that is: H 0: m d ¼ 0 H1: md 6¼ 0 Step 3: The significance level to be considered is 5%. Step 4: Since N > 25, the binomial distribution is more similar to a normal distribution, and the value of Z is given by:



ðX  0:5Þ  N=2 ð11 + 0:5Þ  13 pffiffiffiffiffiffi ¼ 0:588 pffiffiffiffi ¼ 0:5  26 0:5  N

Step 5: By using the standard normal distribution table (Table E in the Appendix), we must determine the critical region (CR) for a

FIG. 10.22 Critical region of Example 10.6.

bilateral test, as shown in Fig. 10.22. Step 6: Decision: since the value calculated is not in the critical region, that is, 1.96  Zcal  1.96, the null hypothesis is not rejected, which allows us to conclude, with a 95% confidence level, that there is no difference in productivity before and after the training course. If instead of comparing the value calculated to the critical value of the standard normal distribution, we use the calculation of P-value, Steps 5 and 6 will be: Step 5: According to Table E in the Appendix, the unilateral probability associated to statistic Zcal ¼  0.59 is P1 ¼ 0.278. For a bilateral test, this probability must be doubled (P-value ¼ 0.556). Step 6: Decision: since P > 0.05, we reject the null hypothesis.

10.3.2.1 Solving the Sign Test for Two Paired Samples Using SPSS Software

The use of the images in this section has been authorized by the International Business Machines Corporation©. The data in Example 10.6 can be found in the file Sign_Test_Two_Paired_Samples.sav. The procedure for applying the sign test to two paired samples on SPSS is shown. We have to click on Analyze → Nonparametric Tests → Legacy Dialogs → 2 Related Samples …, as shown in Fig. 10.23. After that, let’s insert variable 1 (Before) and variable 2 (After) into Test Pairs. Let’s also select the option regarding the sign test (Sign) in Test Type, as shown in Fig. 10.24. Finally, let’s click on OK to obtain the results of the sign test for two paired samples (Figs. 10.25 and 10.26). Fig. 10.25 shows the frequencies of negative and positive signs, the total number of ties, and the total frequency. Fig. 10.26 shows the result of the z test, besides the associated P probability for a bilateral test, values that are similar to the ones calculated in Example 10.6. Since P ¼ 0.556 > 0.05, the null hypothesis is not rejected, which allows us to conclude, with a 95% confidence level, that there is no difference in productivity before and after the training course.

Nonparametric Tests Chapter

FIG. 10.23 Procedure for elaborating the sign test on SPSS.

FIG. 10.24 Selecting the variables and the sign test.

10

269

270

PART

IV Statistical Inference

FIG. 10.25 Frequencies observed.

FIG. 10.26 Sign test (two paired samples) for Example 10.6 on SPSS.

10.3.2.2 Solving the Sign Test for Two Paired Samples Using Stata Software The use of the images presented in this section has been authorized by Stata Corp LP©. The data in Example 10.6 also are available on Stata in the file Sign_Test_Two_Paired_Samples.dta. The paired variables are before and after. As discussed in Section 10.2.3.2 for a single sample, the sign test on Stata is carried out from the command signtest. In the case of two paired samples, we must use the same command. However, it must be followed by the names of the paired variables, with the equal sign between them, since the objective is to test the equality of the respective medians. Thus, the command to be typed for our example is: signtest after = before

The result of the test is shown in Fig. 10.27 and includes the number of positive signs (15), the number of negative signs (11), as well as the probability associated to the statistic for a bilateral test (P ¼ 0.557). These values are similar to the ones calculated in Example 10.6 and also generated on SPSS. Since P > 0.05, we must reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that there is no difference in productivity before and after the training course.

10.3.3

Wilcoxon Test

Analogous to the sign test for two paired samples, the Wilcoxon test is an alternative to the t-test when the data distribution does not follow a normal distribution. The Wilcoxon test is an extension of the sign test; however, it is more powerful. Besides the information about the direction of the differences for each pair, the Wilcoxon test considers the magnitude of the difference within the pairs (Fa´vero et al., 2009). The logical foundations and the method used in the Wilcoxon test are described, based on Siegel and Castellan (2006). Let’s assume that di is the difference between the values for each pair of data. First of all, we have to place all of the di’s in ascending order according to their absolute value (without considering the sign) and calculate the respective ranks using this order. For example, position 1 is attributed to the lowest j di j, position 2 to the second lowest, and so on. At the end, we must attribute the di difference sign for each rank. The sum of all positive ranks is represented by Sp and the sum of all negative ranks by Sn. Occasionally, the values for a certain pair of data are the same (di ¼ 0). In this case, they are excluded from the sample. It is the same procedure used in the sign test, so, the value of N represents the sample size disregarding these ties.

Nonparametric Tests Chapter

10

271

FIG. 10.27 Results of the sign test (two paired samples) for Example 10.6 on Stata.

Another type of tie may happen, in which two or more differences have the same absolute value. In this case, the same rank will be attributed to the ties, which will correspond to the mean of the ranks that would have been attributed if the differences had been different. For example, suppose that three pairs of data indicate the following differences: 1, 1, and 1. Rank 2 is attributed to each pair, which corresponds to the average between 1, 2, and 3. In order, the next value will receive rank 4, since ranks 1, 2, and 3 have already been used. The null hypothesis assumes that the median of the differences in the population (md) is zero, that is, the populations do not differ in location. For a bilateral test, we have: H0: md ¼ 0 H1: md 6¼ 0 In other words, we must test the hypothesis that there are no differences between both samples (the samples come from populations with the same median and the same continuous distribution), that is, the sum of the positive ranks (Sp) is the same as the sum of the negative ranks (Sn). Small samples If N  15, Table I in the Appendix shows the unilateral probabilities associated to the several critical values of Sc (P(Sp > Sc) ¼ a). For a bilateral test, this value must be doubled. If the probability obtained (P-value) is less than or equal to a, we must reject H0. Large samples As N grows, the Wilcoxon distribution becomes more similar to a standard normal distribution. Thus, for N > 15, we must calculate the value of variable z that, according to Siegel and Castellan (2006), Fa´vero et al. (2009), and Maroco (2014), is:   N  ð N + 1Þ min Sp , Sn  4 Zcal ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Xg Xg t3  t N  ðN + 1Þ  ð2N + 1Þ j¼1 j j¼1 j  48 24

(10.9)

272

PART

IV Statistical Inference

where: Pg 3 Pg t j¼1 j



t j¼1 j

is a correction factor whenever there are ties; g: the number of groups of tied ranks; tj: the number of tied observations in group j. 48

The value calculated must be compared to the critical value of the standard normal distribution (Table E in the Appendix). This table provides the critical values of zc where P(Zcal > zc) ¼ a (for a right-tailed unilateral test). For a bilateral test, we have P(Zcal <  zc) ¼ P(Zcal > zc) ¼ a/2. The null hypothesis H0 of a bilateral test is rejected if the value of the Zcal statistic is in the critical region, that is, if Zcal <  zc or Zcal > zc. Otherwise, we do not reject H0. The unilateral probabilities associated to statistic Zcal (P1) can also be obtained from Table E. For a unilateral test, we consider P ¼ P1. For a bilateral test, this probability must be doubled (P ¼ 2P1). Thus, for both tests, we reject H0 if P  a. Example 10.7: Applying the Wilcoxon Test A group of 18 students from the 12th grade took an English proficiency exam, without ever having taken an extracurricular course. The same group of students was submitted to an intensive English course for 6 months and, at the end, they took the proficiency exam again. The results can be seen in Table 10.E.8. Test the hypothesis that there was no improvement before and after the course. Assume that a ¼ 5%.

TABLE 10.E.8 Students’ Grades Before and After the Intensive Course Before

After

56

60

65

62

70

74

78

79

47

53

52

59

64

65

70

75

72

75

78

88

80

78

26

26

55

63

60

59

71

71

66

75

60

71

17

24

Solution Step 1: Since the data do not follow a normal distribution, the Wilcoxon test can be applied, because it is more powerful than the sign test for two paired samples. Step 2: Through the null hypothesis, there is no difference in the students’ performance before and after the course, that is: H0: md ¼ 0 H1: md ¼ 6 0

Nonparametric Tests Chapter

10

273

Step 3: The significance level to be considered is 5%. Step 4: Since N > 15, the Wilcoxon distribution is more similar to a normal distribution. In order to calculate the value of z, first of all, we have to calculate di and the respective ranks, as shown in Table 10.E.9.

TABLE 10.E.9 Calculation of di and the Respective Ranks di

di’s Rank

Before

After

56

60

4

7.5

65

62

3

5.5

70

74

4

7.5

78

79

1

2

47

53

6

10

52

59

7

11.5

64

65

1

2

70

75

5

9

72

75

3

5.5

78

88

10

15

80

78

2

4

26

26

0

55

63

8

13

60

59

1

2

71

71

0

66

75

9

14

60

71

11

16

17

24

7

11.5

Since there are two pairs of data with equal values (di ¼ 0), they are excluded from the sample, so, N ¼ 16. The sum of the positive ranks is Sp ¼ 2 + ⋯ + 16 ¼ 124.5. The sum of the negative ranks is Sn ¼ 2 + 4 + 5.5 ¼ 11.5. Thus, we can calculate the value of z by using Expression (10.9):   N  ðN + 1Þ 16  17 min Sp , Sn  11:5  4 4 Zcal ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Xg ffi ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 2:925 Xg 3 16  17  33 59  11 t  t N  ðN + 1Þ  ð2N + 1Þ  j¼1 j j¼1 j  24 48 48 24 Step 5: By using the standard normal distribution table (Table E in the Appendix), we determine the critical region (CR) for the bilateral test, as shown in Fig. 10.28.

FIG. 10.28 Critical region of Example 10.7.

274

PART

IV Statistical Inference

Step 6: Decision: since the value calculated is in the critical region, that is, Zcal < 1.96, the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that there is a difference in the students’ performance before and after the course. If instead of comparing the value calculated to the critical value of the standard normal distribution, we use the calculation of the P-value, Steps 5 and 6 will be: Step 5: According to Table E in the Appendix, the unilateral probability associated to statistic Zcal ¼  2.925 is p1 ¼ 0.0017. For a bilateral test, this probability must be doubled (P-value ¼ 0.0034). Step 6: Decision: since P < 0.05, we must reject the null hypothesis.

10.3.3.1 Solving the Wilcoxon Test Using SPSS Software

The use of the images in this section has been authorized by the International Business Machines Corporation©. The data in Example 10.7 are available in the file Wilcoxon_Test.sav. The procedure for applying the Wilcoxon test to two paired samples on SPSS is shown. Let’s click on Analyze → Nonparametric Tests → Legacy Dialogs → 2 Related Samples …, as shown in Fig. 10.29. First of all, let’s insert variable 1 (Before) and variable 2 (After) into Test Pairs. Let’s also select the option related to the Wilcoxon test in Test Type, as shown in Fig. 10.30. Finally, let’s click on OK to obtain the results of the Wilcoxon test for two paired samples (Figs. 10.31 and 10.32). Fig. 10.31 shows the number of negative, positive, and tied ranks, besides the mean and the sum of all positive and negative ranks. Fig. 10.32 shows the result of the z test, besides the associated P probability for a bilateral test, values similar to the ones found in Example 10.7. Since P ¼ 0.003 < 0.05, we must reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that there is a difference in the students’ performance before and after the course.

FIG. 10.29 Procedure for elaborating the Wilcoxon test on SPSS.

Nonparametric Tests Chapter

10

275

FIG. 10.30 Selecting the variables and Wilcoxon test.

FIG. 10.31 Ranks.

FIG. 10.32 Wilcoxon test for Example 10.7 on SPSS.

10.3.3.2 Solving the Wilcoxon Test Using Stata Software

The use of the images presented in this section has been authorized by Stata Corp LP©. The data in Example 10.7 are available in the file Wilcoxon_Test.dta. The paired variables are called before and after. The Wilcoxon test on Stata is carried out from the command signrank followed by the name of the paired variables with an equal sign between them. For our example, we must type the following command: signrank before = after

276

PART

IV Statistical Inference

FIG. 10.33 Results of the Wilcoxon test for Example 10.7 on Stata.

The result of the test is shown in Fig. 10.33. Since P < 0.05, we reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that there is a difference in the students’ performance before and after the course.

10.4

TESTS FOR TWO INDEPENDENT SAMPLES

In these tests, we try to compare two populations represented by their respective samples. Different from the tests for two paired samples, here, it is not necessary for the samples to have the same size. Among the tests for two independent samples, we can highlight the chi-square test (for nominal or ordinal variables) and the Mann-Whitney test for ordinal variables.

10.4.1

Chi-Square Test (x2) for Two Independent Samples

In Section 10.2.2, the w2 test was applied to a single sample in which the variable being studied was qualitative (nominal or ordinal). Here the test will be applied to two independent samples, from nominal or ordinal qualitative variables. This test has already been studied in Chapter 4 (Section 4.2.2), in order to verify if there is an association between two qualitative variables, and it will be described once again in this section. The test compares the frequencies observed in each one of the cells of a contingency table to the frequencies expected. The w2 test for two independent samples assumes the following hypotheses: H0: there is no significant difference between the frequencies observed and the ones expected H1: there is a significant difference between the frequencies observed and the ones expected Therefore, the w2 statistic measures the discrepancy between a table with the contingency observed and a table with the contingency expected, starting from the hypothesis that there is no connection between the categories of both variables studied. If the distribution of frequencies observed is exactly the same as the distribution of frequencies expected, the result of the w2 statistic is zero. Thus, a low value of w2 indicates independence between the variables. As already presented in Expression (4.1) in Chapter 4, the w2 statistic for two independent samples is given by:  2 I X J X Oij  Eij 2 (10.10) w ¼ Eij i¼1 j¼1 where: Oij: the number of observations in the ith category of variable X and in the jth category of variable Y; Eij: frequency expected of observations in the ith category of variable X and in the jth category of variable Y; I: the number of categories (rows) of variable X; J: the number of categories (columns) of variable Y.

Nonparametric Tests Chapter

10

277

FIG. 10.34 w2 distribution.

The values of w2cal approximately follow an w2 distribution with n ¼ (I  1)(J  1) degrees of freedom. The critical values of the chi-square statistic (w2c ) can be found in Table D, in the Appendix. This table provides the critical values of w2c where P(w2cal > w2c ) ¼ a (for a right-tailed unilateral test). In order for the null hypothesis H0 to be rejected, the value of the w2cal statistic must be in the critical region, that is, w2cal > w2c . Otherwise, we do not reject H0 (Fig. 10.34). Example 10.8: Applying the x2 Test to Two Independent Samples Let’s consider Example 4.1 in Chapter 4 once again, which refers to a study carried out with 200 individuals aiming at analyzing the joint behavior of variable X (Health insurance agency) with variable Y (Level of satisfaction). The contingency table showing the joint distribution of the variables’ absolute frequencies, besides the marginal totals, is presented in Table 10.E.10. Test the hypothesis that there is no association between the categories of both variables, considering a ¼ 5%.

TABLE 10.E.10 Joint Distribution of the Absolute Frequencies of the Variables Being Studied Level of Satisfaction Agency

Dissatisfied

Neutral

Satisfied

Total

Total Health

40

16

12

68

Live Life

32

24

16

72

Mena Health

24

32

4

60

Total

96

72

32

200

Solution Step 1: The most suitable test to compare the frequencies observed in each cell of a contingency table to the frequencies expected is the w2 for two independent samples. Step 2: The null hypothesis states that there are no connections between the categories of variables Agency and Level of satisfaction, that is, the frequencies observed and expected are the same for each pair of variable categories. The alternative hypothesis states that there are differences in at least one pair of categories: H0: Oij ¼ Eij H1: Oij ¼ 6 Eij Step 3: The significance level to be considered is 5%. Step 4: In order to calculate the statistic, it is necessary to compare the values observed and the values expected. Table 10.E.11 presents the distribution’s values observed with their respective relative frequencies in relation to the row’s general total. The calculation could also be done in relation to the column’s general total, achieving the same result as the w2 statistic. The data in Table 10.E.11 demonstrate a dependence between the variables. Supposing that there was no connection between the variables, we would expect a proportion of 48% in relation to the total of the Dissatisfied row for all three agencies, 36% in the Neutral level, and 16% in the Satisfied level. The calculations of the values expected can be found in Table 10.E.12. For example, the calculation of the first cell is 0.48 68 ¼ 32.6. In order to calculate the w2 statistic, we must apply Expression (10.10) to the data in Tables 10.E.11 and 10.E.12. The cal2 ðOij Eij Þ culation of each term Eij is represented in Table 10.E.13, jointly with the resulting w2cal measure of the sum of the categories. Step 5: The critical region (CR) of the w2 distribution (Table D in the Appendix), considering a ¼ 5% and n ¼ (I  1)(J  1) ¼ 4 degrees of freedom, is shown in Fig. 10.35.

278

PART

IV Statistical Inference

TABLE 10.E.11 Values Observed in Each Category With Their Respective Proportions in Relation to the Row’s General Total Level of Satisfaction Agency

Dissatisfied

Neutral

Satisfied

Total

Total Health

40 (58.8%)

16 (23.5%)

12 (17.6%)

68 (100%)

Live Life

32 (44.4%)

24 (33.3%)

16 (22.2%)

72 (100%)

Mena Health

24 (40%)

32 (53.3%)

4 (6.7%)

60 (100%)

Total

96 (48%)

72 (36%)

32 (16%)

200 (100%)

TABLE 10.E.12 Values Expected From Table 10.E.11 Assuming a Nonassociation Between the Variables Level of Satisfaction Agency

Dissatisfied

Neutral

Satisfied

Total

Total Health

32.6 (48%)

24.5 (36%)

10.9 (16%)

68 (100%)

Live Life

34.6 (48%)

25.9 (36%)

11.5 (16%)

72 (100%)

Mena Health

28.8 (48%)

21.6 (36%)

9.6 (16%)

60 (100%)

Total

96 (48%)

72 (36%)

32 (16%)

200 (100%)

TABLE 10.E.13 Calculation of the x2 Statistic Level of Satisfaction Agency

Dissatisfied

Neutral

Satisfied

Total Health

1.66

2.94

0.12

Live Life

0.19

0.14

1.74

Mena Health

0.80

5.01

3.27

Total

w2cal

¼ 15.861

FIG. 10.35 Critical region of Example 10.8.

Step 6: Decision: since the value calculated is in the critical region, that is, w2cal > 9.488, we must reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that there is an association between the variable categories. If we use P-value instead of the critical value of the statistic, Steps 5 and 6 will be: Step 5: According to Table D, in the Appendix, the probability associated to the w2cal statistic ¼ 15.861, for n ¼ 4 degrees of freedom, is less than 0.005. Step 6: Decision: since P < 0.05, we reject H0.

Nonparametric Tests Chapter

10

279

10.4.1.1 Solving the w2 Statistic Using SPSS Software

The use of the images in this section has been authorized by the International Business Machines Corporation©. The data in Example 10.8 are available in the file HealthInsurance.sav. In order to calculate the w2 statistic for two independent samples, we must click on Analyze → Descriptive Statistics → Crosstabs … Let’s insert variable Agency in Row(s) and variable Satisfaction in Column(s), as shown in Fig. 10.36. In Statistics …, let’s select option Chi-square, as shown in Fig. 10.37. Then, we must finally click on Continue and OK. The result is shown in Fig. 10.38. From Fig. 10.38, we can see that the value of w2 is 15.861, similar to what was calculated in Example 10.8. For the confidence level of 95%, as P ¼ 0.003 < 0.05, we must reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that there is an association between the variable categories, that is, the frequencies observed differ from the frequencies expected in at least one pair of categories.

10.4.1.2 Solving the w2 Statistic by Using Stata Software

The use of the images presented in this section has been authorized by Stata Corp LP©. As presented in Chapter 4, the calculation of the w2 statistic on Stata is done by using the command tabulate, or simply tab, followed by the name of the variables being studied, using option chi2, or simply ch. The syntax of the test is: tab variable1* variable2*, ch

The data in Example 10.8 are also available in the file HealthCareInsurance.dta. The variables being studied are agency and satisfaction. Thus, we must type the following command: tab agency satisfaction, ch

The results can be seen in Fig. 10.39 and are similar to the ones presented in Example 10.8 and on Stata.

FIG. 10.36 Selecting the variables.

280

PART

IV Statistical Inference

FIG. 10.37 Selecting the w2 statistic.

FIG. 10.38 Results of the w2 test for Example 10.8 on SPSS.

FIG. 10.39 Results of the w2 test for Example 10.8 on Stata.

Nonparametric Tests Chapter

10.4.2

10

281

Mann-Whitney U Test

The Mann-Whitney U test is one of the most powerful nonparametric tests, applied to quantitative or qualitative variables in an ordinal scale, and it aims at verifying if two nonpaired or independent samples are drawn from the same population. It is an alternative to Student’s t-test when the normality hypothesis is violated or when the sample is small. In addition, it may be considered a nonparametric version of the t-test for two independent samples. Since the original data are transformed into ranks (orders), we lose some information, so, the Mann-Whitney U test is not as powerful as the t-test. Different from the t-test that verifies the equality of the means of two independent populations and with continuous data, the Mann-Whitney U test verifies the equality of the medians. For a bilateral test, the null hypothesis is that the median of both populations is equal, that is: H 0 : m 1 ¼ m2 H1: m1 6¼ m2 The calculation of the Mann-Whitney U statistic is specified, for small and large samples. Small samples Method: (a) Let’s consider N1 the size of the sample with the smallest number of observations and N2 the size of the sample with the largest number of observations. We assume that both samples are independent. (b) In order to apply the Mann-Whitney U test, we must join both samples into a single combined sample that will be formed by N ¼ N1 + N2 elements. However, we must identify the original sample of each observation in the combined sample. The combined sample must be ordered in ascending order and the ranks are attributed to each observation. For example, rank 1 is attributed to the lowest observation and rank N to the highest observation. If there are ties, we attribute the mean of the corresponding ranks. (c) After that, we must calculate the sum of the ranks for each sample, that is, calculate R1, which corresponds to the sum of the ranks in the sample with the smallest number of observations, and R2, which corresponds to sum of the ranks in the sample with the largest number of observations. (d) Thus, we can calculate quantities U1 and U2 as follows: U1 ¼ N1  N2 +

N 1  ð N 1 + 1Þ  R1 2

(10.11)

U2 ¼ N1  N 2 +

N 2  ð N 2 + 1Þ  R2 2

(10.12)

(e) The Mann-Whitney U statistic is given by: Ucal ¼ min ðU1 , U2 Þ Table J in the Appendix shows the critical values of U in a way that P(Ucal < Uc) ¼ a (for a left-tailed unilateral test), for values of N2  20 and significance levels of 0.05, 0.025, 0.01, and 0.005. In order for the null hypothesis H0 of the left-tailed unilateral test to be rejected, the value of the Ucal statistic must be in the critical region, that is, Ucal < Uc. Otherwise, we do not reject H0. For a bilateral test, we must consider P(Ucal < Uc) ¼ a/2, since P(Ucal < Uc) + P(Ucal > Uc) ¼ a. The unilateral probabilities associated to the Ucal statistic (P1) can also be obtained from Table J. For a unilateral test, we have P ¼ P1. For a bilateral test, this probability must be doubled (P ¼ 2P1). Thus, we reject H0 if P  a. Large samples As the sample size grows (N2 > 20), the Mann-Whitney distribution becomes more similar to a standard normal distribution.

282

PART

IV Statistical Inference

The real value of the Z statistic is given by: ðU  N1  N2 =2Þ Zcal ¼ vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 0 Xg 1 Xg u 3 u 3 t  t j N  N u N1  N2 j¼1 j¼1 j A t @  N  ð N  1Þ 12 12 where: Pg

t3  j¼1 j

(10.13)

Pg

t j¼1 j

is a correction factor when there are ties; g: the number of groups with tied ranks; tj: the number of tied observations in group j. 12

The value calculated must be compared to the critical value of the standard normal distribution (see Table E in the Appendix). This table provides the critical values of zc where P(Zcal > zc) ¼ a (for a right-tailed unilateral test). For a bilateral test, we have P(Zcal <  zc) ¼ P(Zcal > zc) ¼ a/2. Therefore, for a bilateral test, the null hypothesis is rejected if Zcal <  zc or Zcal > zc. Unilateral probabilities associated to the Zcal (P1 ¼ P) statistic can also be obtained from Table E. For a bilateral test, this probability must be doubled (P ¼ 2P1). Thus, the null hypothesis is rejected if P  a. Example 10.9: Applying the Mann-Whitney U Test to Small Samples Aiming at assessing the quality of two machines, the diameters of the parts produced (in mm) in each one of them are compared, as shown in Table 10.E.14. Use the most suitable test, at a significance level of 5%, to test if both samples come from or do not come from populations with the same medians.

TABLE 10.E.14 Diameter of Parts Produced in Two Machines Mach. A

48.50

48.65

48.58

48.55

48.66

48.64

48.50

Mach. B

48.75

48.64

48.80

48.85

48.78

48.79

49.20

48.72

Solution Step 1: By applying the normality test to both samples, we can see that the data from machine B do not follow a normal distribution. So, the most suitable test to compare the medians of two independent populations is the Mann-Whitney U test. Step 2: Through the null hypothesis, the median diameters of the parts in both machines are the same, so: H0: mA ¼ mB H1: mA ¼ 6 mB Step 3: The significance level to be considered is 5%. Step 4: Calculation of the U statistic: (a) N1 ¼ 7 (sample size from machine B) N2 ¼ 8 (sample size from machine A) (b) Combined sample and respective ranks (Table 10.E.15):

TABLE 10.E.15 Combined Data Data

Machine

Ranks

48.50

A

1.5

48.50

A

1.5

48.55

A

3

48.58

A

4

48.64

A

5.5

Nonparametric Tests Chapter

10

283

TABLE 10.E.15 Combined Data—cont’d Data

Machine

Ranks

48.64

B

5.5

48.65

A

7

48.66

A

8

48.72

A

9

48.75

B

10

48.78

B

11

48.79

B

12

48.80

B

13

48.85

B

14

49.20

B

15

(c) R1 ¼ 80.5 (sum of the ranks from machine B with the smallest number of observations); R2 ¼ 39.5 (sum of the ranks from machine A with the largest number of observations). (d) Calculation of U1 and U2: N1  ðN1 + 1Þ 78  R1 ¼ 7  8 +  80:5 ¼ 3:5 2 2 N2  ðN2 + 1Þ 89  R2 ¼ 7  8 +  39:5 ¼ 52:5 U2 ¼ N1  N2 + 2 2 U1 ¼ N1  N2 +

(e) Calculation of the Mann-Whitney U statistic: Ucal ¼ min ðU1 , U2 Þ ¼ 3:5 Step 5: According to Table J, in the Appendix, for N1 ¼ 7, N2 ¼ 8, and P(Ucal < Uc) ¼ a/2 ¼ 0.025 (bilateral test), the critical value of the Mann-Whitney U statistic is Uc ¼ 10. Step 6: Decision: since the calculated statistic is in the critical region, that is, Ucal < 10, the null hypothesis is rejected, which allows us to conclude, with a 95% confidence level, that the medians of both populations are different. If we use P-value instead of the critical value of the statistic, Steps 5 and 6 will be: Step 5: According to Table J, in the Appendix, unilateral probability P1 associated to Ucal ¼ 3.5, for N1 ¼ 7 and N2 ¼ 8, is less than 0.005. For a bilateral test, this probability must be doubled (P < 0.01). Step 6: Decision: since P < 0.05, we must reject H0.

Example 10.10: Applying the Mann-Whitney U Test to Large Samples As described previously, as the sample size grows (N2 > 20), the Mann-Whitney distribution becomes more similar to a standard normal distribution. Even though the data in Example 10.9 represent a small sample (N2 ¼ 8), which would be the value of z in this case, by using Expression (10.13)? Interpret the result. Solution ðU  N1  N2 =2Þ ð3:5  7  8=2Þ Zcal ¼ vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Xg 1 ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Xg 0    2:840 u u 7  8 153  15 16  4 tj3  tj 3 N  N u N1  N2 j¼1 j¼1  A t @  15  14 12 12 N  ðN  1Þ 12 12

284

PART

IV Statistical Inference

The critical value of the zc statistic for a bilateral test, at the significance level of 5%, is 1.96 (see Table E in the Appendix). Since Zcal <  1.96, the null hypothesis would also be rejected by the Z statistic, which allows us to conclude, with a 95% confidence level, that the population medians are different. Instead of comparing the value calculated to the critical value, we could obtain the value of P-value directly from Table E. Thus, the unilateral probability associated to statistic Zcal ¼  2.840 is P1 ¼ 0.0023. For a bilateral test, this probability must be doubled (P-value ¼ 0.0046).

10.4.2.1 Solving the Mann-Whitney Test Using SPSS Software The use of the images in this section has been authorized by the International Business Machines Corporation©. The data in Example 10.9 are available in the file Mann-Whitney_Test.sav. Since group 1 is the one with the smallest number of observations, in Data → Define Variable Properties …, we assign value 1 to group B and value 2 to group A for variable Machine. In order to elaborate the Mann-Whitney test on SPSS, we must click on Analyze → Nonparametric Tests → Legacy Dialogs → 2 Independent Samples …, as shown in Fig. 10.40. After that, we should insert the variable Diameter in the box Test Variable List and the variable Machine in Grouping Variable, defining the respective groups. Let’s select the option Mann-Whitney U in Test Type, as shown in Fig. 10.41. Finally, let’s click on OK to obtain Figs. 10.42 and 10.43. Fig. 10.42 shows the mean and the sum of the ranks for each group, while Fig. 10.43 shows the statistic of the test. The results in Fig. 10.42 are similar to the ones calculated in Example 10.9. According to Fig. 10.43, the result of the MannWhitney U statistic is 3.50, similar to the value calculated in Example 10.9. The bilateral probability associated to the U statistic is P ¼ 0.002 (we saw in Example 10.9 that this probability is less than 0.01). For the same data in Example 10.9, if we had to calculate the Z statistic and the respective associated bilateral probability, the result would be Zcal ¼  2.840 and P ¼ 0.005, similar to the values calculated in Example 10.10. For both tests, as the associated bilateral probability is less than 0.05, the null hypothesis is rejected, which allows us to conclude that the medians of both populations are different.

FIG. 10.40 Procedure to elaborate the Mann-Whitney test on SPSS.

Nonparametric Tests Chapter

10

285

FIG. 10.41 Selecting the variables and Mann-Whitney test.

FIG. 10.42 Ranks.

FIG. 10.43 Mann-Whitney test for Example 10.9 on SPSS.

10.4.2.2 Solving the Mann-Whitney Test Using Stata Software

The use of the images presented in this section has been authorized by Stata Corp LP©. The Mann-Whitney test is elaborated on Stata from the command ranksum (equality test for nonpaired data), by using the following syntax: ranksum variable*, by (groups*)

286

PART

IV Statistical Inference

FIG. 10.44 Results of the Mann-Whitney test for Examples 10.9 and 10.10 on Stata.

where the term variable* must be replaced by the quantitative variable studied and the term groups* by the categorical variable that represents the groups. Let’s open the file Mann-Whitney_Test.dta that contains the data from Examples 10.9 and 10.10. Both groups are represented by the variable machine and the quality characteristic by the variable diameter. Thus, the command to be typed is: ranksum diameter, by (machine)

The results obtained are shown in Fig. 10.44. We can see that the calculated value of the statistic (2.840) corresponds to the value calculated in Example 10.10, for large samples, from Expression (10.13). The probability associated to the statistic for a bilateral test is 0.0045. Since P < 0.05, we must reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that the population medians are different.

10.5

TESTS FOR K PAIRED SAMPLES

These tests analyze the differences between k (three or more) paired or related samples. According to Siegel and Castellan (2006), the null hypothesis to be tested is that k samples have been drawn from the same population. The main tests for k paired samples are Cochran’s Q test (for binary variables) and Friedman’s test (for ordinal variables).

10.5.1

Cochran’s Q Test

Cochran’s Q test for k paired samples is an extension of the McNemar test for two samples, and it aims to test the hypothesis that the frequency in which or proportion of three or more related groups differ significantly from one another. In the same way as in the McNemar test, the data are binary. According to Siegel and Castellan (2006), Cochran’s Q test compares the characteristics of several individuals or characteristics of the same individual observed under different conditions. For example, we can analyze if k items differ significantly for N individuals. Or, we may have only one item to analyze and the objective is to compare the answer of N individuals under k different conditions. Let’s suppose that the study data are organized in one table with N rows and k columns, in which N is the number of cases and k is the number of groups or conditions. Through the null hypothesis of Cochran’s Q test, there are no differences between the frequencies or proportions of success (p) of the k related groups, that is, the proportion of a desired answer (success) is the same in each column. Through the alternative hypothesis, there are differences between at least two groups, so: H0: p1 ¼ p2 ¼ … ¼ pk H1: 9(i,j) pi 6¼ pj, i 6¼ j

Nonparametric Tests Chapter

10

287

Cochran’s Q statistic is given by:

X Xk 2

k Xk  2 2 ð k  1 Þ  k  G  G k  ð k  1Þ  Gj  G j¼1 j j¼1 j j¼1 Qcal ¼ ¼ XN XN XN XN k L L2 L L2 k i¼1 i i¼1 i i¼1 i i¼1 i

(10.14)

which approximately follows a w2 distribution with k  1 degrees of freedom, where: Gj: the total number of successes in the jth column; G: mean of the Gj; Li: the total number of successes in the ith row. The value calculated must be compared to the critical value of the w2 distribution (Table D in the Appendix). This table provides the critical values of w2c where P(w2cal > w2c ) ¼ a (for a right-tailed unilateral test). If the value of the statistic is in the critical region, that is, if Qcal > w2c , we must reject H0. Otherwise, we do not reject H0. The probability associated to the calculated value of the statistic (P-value) can also be obtained from Table D. In this case, the null hypothesis is rejected if P  a; otherwise we do not reject H0. Example 10.11: Applying Cochran’s Q Test We are interested in assessing 20 consumers’ level of satisfaction regarding three supermarkets, trying to investigate if their clients are satisfied (score 1) or not (score 0) with the quality, variety and price of their products—for each supermarket. Check the hypothesis that the probability of receiving a good evaluation from clients is the same for all three supermarkets, considering a significance level of 10%. Table 10.E.16 shows the results of the evaluation.

TABLE 10.E.16 Results of the Evaluation for All Three Supermarkets Consumer

A

B

C

Li

L2i

1

1

1

1

3

9

2

1

0

1

2

4

3

0

1

1

2

4

4

0

0

0

0

0

5

1

1

0

2

4

6

1

1

1

3

9

7

0

0

1

1

1

8

1

0

1

2

4

9

1

1

1

3

9

10

0

0

1

1

1

11

0

0

0

0

0

12

1

1

0

2

4

13

1

0

1

2

4

14

1

1

1

3

9

15

0

1

1

2

4

16

0

1

1

2

4

17

1

1

1

3

9

18

1

1

1

3

9

19

0

0

1

1

1

20

0

0

1

Total

G1 ¼ 11

G2 ¼ 11

G3 ¼ 16

1 P20

1 P20

i¼1 Li

¼ 38

2 i¼1 Li

¼ 90

288

PART

IV Statistical Inference

FIG. 10.45 Critical region of Example 10.11.

Solution Step 1: The most suitable test to compare proportions of three or more paired groups is Cochran’s Q test. Step 2: Through the null hypothesis, the proportion of successes (score 1) is the same for all three supermarkets. Through the alternative hypothesis, the proportion of satisfied clients differs for at least two supermarkets, so: H0: p1 ¼ p2 ¼ p3 H1: 9(i,j) pi 6¼ pj, i 6¼ j Step 3: The significance level to be considered is 10%. Step 4: The calculation of Cochran’s Q statistic from Expression (10.14), is given by: X Xk 2

k 2   ðk  1Þ  k  G  G j¼1 j j¼1 j ð3  1Þ  3  112 + 112 + 162  382 ¼ 4:167 Qcal ¼ ¼ XN XN 3  38  90 2 L  L k i i i¼1 i¼1 Step 5: The critical region (CR) of the w2 distribution (Table D in the Appendix), considering a ¼ 10% and n ¼ k  1 ¼ 2 degrees of freedom, is shown in Fig. 10.45. Step 6: Decision: since the value calculated is not in the critical region, that is, Qcal < 4.605, the null hypothesis is not rejected, which allows us to conclude, with a 90% level of confidence, that the proportion of satisfied clients is equal for all three supermarkets. If we use P-value instead of the statistic’s critical value, Steps 5 and 6 will be: Step 5: According to Table D, in the Appendix, for n ¼ 2 degrees of freedom, the probability associated to statistic Qcal ¼ 4.167 is greater than 0.10 (P-value > 0.10). Step 6: Decision: since P > 0.10, we should not reject H0.

10.5.1.1 Solving Cochran’s Q Test by Using SPSS Software

The use of the images in this section has been authorized by the International Business Machines Corporation©. The data in Example 10.11 are available in the file Cochran_Q_Test.sav. The procedure for elaborating Cochran’s Q test on SPSS is shown. First of all, let’s click on Analyze → Nonparametric Tests → Legacy Dialogs → K Related Samples …, as shown in Fig. 10.46. After that, we must insert variables A, B, and C in the box Test Variables, and select option Cochran’s Q in Test Type, as shown in Fig. 10.47. Finally, let’s click on OK to obtain the results of the test. Fig. 10.48 shows the frequencies of each group and Fig. 10.49 shows the result of the statistic. The value of Cochran’s Q statistic is 4.167, similar to the value calculated in Example 10.11. The probability associated to the statistic is 0.125 (we saw in Example 10.11 that P > 0.10). Since P > a, the null hypothesis is not rejected, which allows us to conclude, with a 90% level of confidence, that there are no differences in the proportion of satisfied clients for all three supermarkets.

10.5.1.2 Solution of Cochran’s Q Test on Stata Software The use of the images presented in this section has been authorized by Stata Corp LP©. The data from Example 10.11 are also available in the file Cochran_Q_Test.dta. The command used to elaborate the test is cochran followed by the k paired variables. In our case, the variables that represent the three groups of supermarkets, a, b, and c, respectively. So, the command to be typed is: cochran a b c

Nonparametric Tests Chapter

FIG. 10.46 Procedure for elaborating Cochran’s Q test on SPSS.

FIG. 10.47 Selecting the variables and Cochran’s Q test.

10

289

290

PART

IV Statistical Inference

FIG. 10.48 Frequencies.

FIG. 10.49 Cochran’s Q test for Example 10.11 on SPSS.

FIG. 10.50 Results of Cochran’s Q test for Example 10.11 on Stata.

The results of Cochran’s Q test on Stata are in Fig. 10.50. We can verify that the result of the statistic and the respective associated probability are similar to the results calculated in Example 10.11, and also generated on SPSS, which allows us to conclude, with a 90% level of confidence, that the proportion of dissatisfied clients is the same for all three supermarkets.

10.5.2

Friedman’s Test

Friedman’s test is applied to quantitative or qualitative variables in an ordinal scale, and has as its main objective to verify if k paired samples are drawn from the same population. It is an extension of the Wilcoxon test for three or more paired samples. It is also an alternative to the analysis of variance when its hypotheses (normality of data and homogeneity of variances) are violated or when the sample size is too small. The data are represented in one table with double entry, with N rows and k columns, in which the rows represent the several individuals or corresponding sets of individuals, and the columns represent the different conditions. Therefore, the null hypothesis of Friedman’s test assumes that the k samples (columns) come from the same population or from populations with the same median (m). For a bilateral test, we have: H0: m1 ¼ m2 ¼ … ¼ mk H1: 9(i,j) mi 6¼ mj, i 6¼ j To apply Friedman’s statistic, we must attribute ranks from 1 to k to each element of each row. For example, position 1 is attributed to the lowest observation on the row and position N to the highest observation. If there are ties, we attribute the mean of the corresponding ranks. Friedman’s statistic is given by: Fcal ¼

k   X 12 2 Rj  3  N  ð k + 1 Þ  N  k  ðk + 1Þ j¼1

(10.15)

Nonparametric Tests Chapter

10

291

where: N: the number of rows; k: the number of columns; Rj: sum of the ranks in column j. However, according to Siegel and Castellan (2006), whenever there are ties between the ranks of the same group or row, Friedman’s statistic must be corrected in a way that considers the changes in the sample distribution, as follows: Xk  2 12  Rj  3  N 2  k  ð k + 1 Þ 2 j¼1 0  (10.16) Fcal ¼ XN Xgi  3 N k t ij i¼1 j¼1 N  k  ð k + 1Þ + ð k  1Þ where: gi: the number of sets with tied observations in the ith group, including the sets of size 1; tij: size of the jth set of ties in the ith group. The value calculated must be compared to the critical value of the sample distribution. When N and k are small (k ¼ 3 and 3 < N < 13, or k ¼ 4 and 2 < N < 8, or k ¼ 5 and 3 < N < 5), we must use Table K in the Appendix, which shows the critical values of Friedman’s statistic (Fc), where P(Fcal > Fc) ¼ a (for a right-tailed unilateral test). For high values of N and k, the sample distribution can be approximated by the w2 distribution with n ¼ k  1 degrees of freedom. Therefore, if the value of the Fcal statistic is in the critical region, that is, if Fcal > Fc for a small N and K or Fcal > w2c for a high N and K, we must reject the null hypothesis. Otherwise, we do not reject H0. Example 10.12: Applying Friedman’s Test A research is carried out in order to verify the efficacy that breakfast has in weight loss and, in order to do that, 15 patients were followed up for 3 months. Data regarding patients’ weight were collected during three different periods, as shown in Table 10.E.17: before the treatment (BT), after the treatment (AT), and after 3 months of treatment (A3M). Check and see if the treatment had any results. Assume that a ¼ 5%.

TABLE 10.E.17 Patients’ Weight in Each Period Period Patient

BT

AT

A3M

1

65

62

58

2

89

85

80

3

96

95

95

4

90

84

79

5

70

70

66

6

72

65

62

7

87

84

77

8

74

74

69

9

66

64

62

10

135

132

132

11

82

75

71

12

76

73

67

13

94

90

88

14

80

80

77

15

73

70

68

292

PART

IV Statistical Inference

Solution Step 1: Since the data do not follow a normal distribution, Friedman’s test is an alternative to ANOVA to verify if the three paired samples are drawn from the same population. Step 2: Through the null hypothesis, there is no difference among the treatments. Through the alternative hypothesis, the treatment had some results, so: H0: m1 ¼ m2 ¼ m3 H1: 9(i,j) mi 6¼ mj, i 6¼ j Step 3: The significance level to be considered is 5%. Step 4: In order to calculate Friedman’s statistic, first, we must attribute ranks from 1 to 3 to each element in each row, as shown in Table 10.E.18. If there are ties, we attribute the mean of the corresponding ranks.

TABLE 10.E.18 Attributing Ranks Period Patient

BT

AT

A3M

1

3

2

1

2

3

2

1

3

3

1.5

1.5

4

3

2

1

5

2.5

2.5

1

6

3

2

1

7

3

2

1

8

2.5

2.5

1

9

3

2

1

10

3

1.5

1.5

11

3

2

1

12

3

2

1

13

3

2

1

14

2.5

2.5

1

15

3

2

1

Rj

43.5

30.5

16

Mean of the ranks

2.900

2.030

1.067

As shown in Table 10.E.18, there are two ties in patient 3, two in patient 5, two in patient 8, two in patient 10, and two in patient 14. Therefore, the total number of size 2 ties is 5 and the total number of size 1 ties is 35. Thus: gi N X X

tij3 ¼ 35 1 + 5 23 ¼ 75

i¼1 j¼1

Since there are ties, the real value of Friedman’s statistic is calculated from Expression (10.16), as follows: X k  2   12  Rj  3  N 2  k  ð k + 1 Þ 2 12  43:52 + 30:52 + 162  3  152  3  42 j¼1 0  ¼ Fcal ¼ XN Xgi ð15  3  75Þ Nk t3 15  3  4 + j¼1 ij i¼1 2 N  k  ðk + 1Þ + ðk  1Þ 0 ¼ 27:527 Fcal

Nonparametric Tests Chapter

10

293

FIG. 10.51 Critical region of Example 10.12.

If we applied Expression (10.15) without the correction factor, the result of Friedman’s test would be 25.233. Step 5: Since k ¼ 3 and N ¼ 15, let’s use the w2 distribution. The critical region (CR) of the w2 distribution (Table D in the Appendix), considering a ¼ 5% and n ¼ k  1 ¼ 2 degrees of freedom, is shown in Fig. 10.51. 0 > 5.991, we reject the null hypothesis, which Step 6: Decision: since the value calculated is in the critical region, that is, Fcal allows us to conclude, with a 95% confidence level, that the treatment has good results. If we use P-value instead of the critical value of the statistic, Steps 5 and 6 will be: Step 5: According to Table D in the Appendix, for n ¼ 2 degrees of freedom, the probability associated to statistic F 0cal ¼ 27.527 is less than 0.005 (P-value 12.592, we must reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that there is a difference in productivity among the four shifts. If we use P-value instead of the critical value of the statistic, Steps 5 and 6 will be: Step 5: According to Table D in the Appendix, the probability associated to the statistic w2cal ¼ 13.143, for n ¼ 6 degrees of freedom, is between 0.05 and 0.025. Step 6: Decision: since P < 0.05, we reject H0.

FIG. 10.57 Critical region of Example 10.13.

Nonparametric Tests Chapter

10

297

FIG. 10.58 Selecting the variables.

10.6.1.1 Solving the w2 Test for k Independent Samples on SPSS

The use of the images in this section has been authorized by the International Business Machines Corporation©. The data from Example 10.13 are available in the file Chi-Square_k_Independent_Samples.sav. Let’s click on Analyze → Descriptive Statistics → Crosstabs … After that, we should insert the variable Productivity in Row(s) and the variable Shift in Column(s), as shown in Fig. 10.58. In Statistics …, let’s select the option Chi-square, as shown in Fig. 10.59. If we wish to obtain the observed and expected frequency distribution table, in Cells …, we must select the options Observed and Expected in Counts, as shown in Fig. 10.60. Finally, let’s click on Continue and OK. The results can be seen in Figs. 10.61 and 10.62. From Fig. 10.62, we can see that the value of w2 is 13.143, similar to the one calculated in Example 10.13. For a confidence level of 95%, since P ¼ 0.041 < 0.05 (we saw in Example 10.13 that this probability is between 0.025 and 0.05), we must reject the null hypothesis, which allows us to conclude, with a 95% confidence level, that there is a difference in productivity among the four shifts.

10.6.1.2 Solving the w2 Test for k Independent Samples on Stata

The use of the images presented in this section has been authorized by Stata Corp LP©. The data in Example 10.13 are available in the file Chi-Square_k_Independent_Samples.dta. The variables being studied are productivity and shift. The syntax of the w2 test for k independent samples is similar to the one presented in Section 10.4.1 for two independent samples. Thus, we must use the command tabulate, or simply tab, followed by the name of the variables being studied, besides the option chi2, or simply ch. The difference is that, in this case, the categorical variable that represents the groups has more than two categories. Therefore, the syntax of the test for the data in Example 10.13 is: tabulate productivity shift, chi2

298

PART

IV Statistical Inference

FIG. 10.59 Selecting the w2 statistic.

FIG. 10.60 Selecting the observed and expected frequencies distribution table.

Nonparametric Tests Chapter

10

299

FIG. 10.61 Distribution of the observed and expected frequencies.

FIG. 10.62 Results of the w2 test for Example 10.13 on SPSS.

FIG. 10.63 Results of the w2 test for Example 10.13 on Stata.

or simply: tab productivity shift, ch

The results can be seen in Fig. 10.63. The value of the w2 statistic as well as the probability associated to it is similar to the results presented in Example 10.13, and also generated on SPSS.

10.6.2

Kruskal-Wallis Test

The Kruskal-Wallis test aims at verifying if k independent samples (k > 2) come from the same population. It is an alternative to the analysis of variance when the hypotheses of data normality and equality of variances are violated, or when the

300

PART

IV Statistical Inference

sample is small, or even when the variable is measured in an ordinal scale. For k ¼ 2, the Kruskal-Wallis test is equivalent to the Mann-Whitney test. The data are represented in a table with double entry with N rows and k columns, in which the rows represent the observations and the columns represent the different samples or groups. The null hypothesis of the Kruskal-Wallis test assumes that all k samples come from the same population or from identical populations with the same median (m). For a bilateral test, we have: H0: m1 ¼ m2 ¼ … ¼ mk H1: 9(i,j) mi 6¼ mj, i 6¼ j In the Kruskal-Wallis test, all N observations (N is the total number of observations in the global sample) are organized in a single series, and we attribute ranks to each element in the series. Thus, position 1 is attributed to the lowest observation in the global sample, position 2 to the second lowest observation, and so on, and so forth, up to position N. If there are ties, we attribute the mean of the corresponding ranks. The Kruskal-Wallis statistic (H) is given by: Hcal ¼

k R2 X 12 j  3  ð N + 1Þ  N  ðN + 1Þ j¼1 nj

(10.17)

where: k: the number of samples or groups; nj: the number of observations in the sample or group j; N: the number of observations in the global sample; Rj: sum of the ranks in the sample or group j. However, according to Siegel and Castellan (2006), whenever there are ties between two or more ranks, regardless of the group, the Kruskal-Wallis statistic must be corrected in a way that considers the changes in the sample distribution, so: 0 ¼ Hcal

1

H  Xg  3 t  t j j j¼1

(10.18)

ðN 3  N Þ

where: g: the number of clusters with different tied ranks; tj: the number of tied ranks in the jth cluster. According to Siegel and Castellan (2006), the main objective for correcting these ties is to increase the value of H, making the result more significant. The value calculated must be compared to the critical value of the sample distribution. If k ¼ 3 and n1, n2, n3  5, we must use Table L in the Appendix, which shows the critical values of the Kruskal-Wallis statistic (Hc), where P(Hcal > Hc) ¼ a (for a right-tailed unilateral test). Otherwise, the sample distribution can be approximated by the w2 distribution with n ¼ k  1 degrees of freedom. Therefore, if the value of the Hcal statistic is in the critical region, that is, if Hcal > Hc for k ¼ 3 and n1, n2, n3  5, or Hcal > w2c for other values, the null hypothesis is rejected, which allows us to conclude that there is no difference between the samples. Otherwise, we do not reject H0. Example 10.14 Applying the Kruskal-Wallis Test A group of 36 patients with the same level of stress was submitted to three different treatments, that is, 12 patients were submitted to treatment A, 12 patients to treatment B, and the remaining 12 to treatment C. At the end of the treatment, each patient answered a questionnaire that evaluates a person’s stress level, which is classified in three phases: the resistance phase, for those who got a maximum of three points, the warning phase, for those who got more than 6 points, and the exhaustion phase, for those who got more than 8 points. The results can be seen in Table 10.E.20. Verify if the three treatments lead to the same results. Consider a significance level of 1%.

Nonparametric Tests Chapter

10

301

TABLE 10.E.20 Stress Level After the Treatment Treatment A

6

5

4

5

3

4

5

2

4

3

5

2

Treatment B

6

7

5

8

7

8

6

9

8

6

8

8

Treatment C

5

9

8

7

9

11

7

8

9

10

7

8

Solution Step 1: Since the variable is measured in an ordinal scale, the most suitable test to verify if the three independent samples are drawn from the same population is the Kruskal-Wallis test. Step 2: Through the null hypothesis, there is no difference among the treatments. Through the alternative hypothesis, there is a difference between at least two treatments, so: H0: m1 ¼ m2 ¼ m3 H1: 9(i,j) mi 6¼ mj, i 6¼ j Step 3: The significance level to be considered is 1%. Step 4: In order to calculate the Kruskal-Wallis statistic, first of all, we must attribute ranks from 1 to 36 to each element in the global sample, as shown in Table 10.E.21. In case of ties, we attribute the mean of the corresponding ranks.

TABLE 10.E.21 Attributing Ranks

A

15.5

10.5

6

10.5

3.5

6

10.5

1.5

B

15.5

20

10.5

26.5

20

26.5

15.5

C

10.5

32.5

26.5

20

32.5

36

20

Sum

Mean

6

3.5

10.5

1.5

85.5

7.13

32.5

26.5

15.5

26.5

26.5

262

21.83

26.5

32.5

35

20

26.5

318.5

26.54

Since there are ties, the Kruskal-Wallis statistic is calculated from Expression (10.18). First of all, we calculate the value of H: k R X 12 12 85:52 + 2622 + 318:52 j  3  37    3  ðN + 1Þ ¼ 12 N  ðN + 1Þ j¼1 nj 36  37 2

Hcal ¼

Hcal ¼ 22:181 From Tables 10.E.20 and 10.E.21, we can verify that there are eight tied groups. For example, there are two groups with 2 points (with a rank of 1.5), two groups with 3 points (with a rank of 3.5), three groups with 4 points (with a rank of 6) and, thus, successively, up to four groups with 9 points (with a rank of 32.5). The Kruskal-Wallis statistic is corrected to: H Xg 

’ Hcal ¼

1

t 3  tj j¼1 j ðN 3  N Þ



22:181        ¼ 22:662 23  2 + 23  2 + 33  3 + ⋯ + 43  4  3  1 36  36 

Step 5: Since n1, n2, n3 > 5, let’s use the w2 distribution. The critical region (CR) of the w2 distribution (Table D in the Appendix), considering a ¼ 1% and n ¼ k  1 ¼ 2 degrees of freedom, is shown in Fig. 10.64.

FIG. 10.64 Critical region of Example 10.14.

302

PART

IV Statistical Inference

Step 6: Decision: since the value calculated is in the critical region, that is, H 0cal > 9.210, we must reject the null hypothesis, which allows us to conclude, with a 99% confidence level, that there is a difference among the treatments. If we use P-value instead of the critical value of the statistic, Steps 5 and 6 will be: Step 5: According to Table D in the Appendix, for n ¼ 2 degrees of freedom, the probability associated to the statistic H 0cal ¼ 22.662 is less than 0.005 (P < 0.005). Step 6: Decision: since P < 0.01, we reject H0.

10.6.2.1 Solving the Kruskal-Wallis Test by Using SPSS Software

The use of the images in this section has been authorized by the International Business Machines Corporation©. The data in Example 10.14 are available in the file Kruskal-Wallis_Test.sav. In order to elaborate the Kruskal-Wallis test on SPSS, let’s click on Analyze → Nonparametric Tests → Legacy Dialogs → K Independent Samples …, as shown in Fig. 10.65. After that, we should insert the variable Result in the box Test Variable List, define the groups of the variable Treatment and select the Kruskal-Wallis test, as shown in Fig. 10.66. Let’s click on OK to obtain the results of the Kruskal-Wallis test. Fig. 10.67 shows the mean of the ranks for each group, similar to the values calculated in Table 10.E.21. The value of the Kruskal-Wallis statistic and the significance level of the test are in Fig. 10.68. The value of the test is 22.662, similar to the value calculated in Example 10.14. The probability associated to the statistic is 0.000 (we saw in Example 10.14 that this probability is less than 0.005). Since P < 0.01, we reject the null hypothesis, which allows us to conclude, with a 99% confidence level, that there is a difference among the treatments.

FIG. 10.65 Procedure for elaborating the Kruskal-Wallis test on SPSS.

Nonparametric Tests Chapter

FIG. 10.66 Selecting the variable and defining the groups for the Kruskal-Wallis test.

FIG. 10.67 Ranks.

FIG. 10.68 Results of the Kruskal-Wallis test for Example 10.14 on SPSS.

10.6.2.2 Solving the Kruskal-Wallis Test by Using Stata

The use of the images presented in this section has been authorized by Stata Corp LP©. On Stata, the Kruskal-Wallis test is elaborated through the command kwallis, using the following syntax: kwallis variable*, by(groups*)

10

303

304

PART

IV Statistical Inference

FIG. 10.69 Results of the Kruskal-Wallis test for Example 10.14 on Stata.

where the term variable* must be replaced by the quantitative or ordinal variable being studied and the term groups* by the categorical variable that represents the groups. Let’s open the file Kruskal-Wallis_Test.dta that contains the data from Example 10.14. All three groups are represented by the variable treatment and the characteristic analyzed by the variable result. Thus, the command to be typed is: kwallis result, by(treatment)

The result of the test can be seen in Fig. 10.69. Analogous to the results presented in Example 10.14 and generated on SPSS, Stata calculates the original value of the statistic (22.181) and with the correction factor whenever there are ties (22.662). Since the probability associated to the statistic is 0.000, we reject the null hypothesis, which allows us to conclude, with a 99% confidence level, that there is no difference among the treatments.

10.7

FINAL REMARKS

In the previous chapter, we studied parametric tests. This chapter, however, was totally dedicated to the study of nonparametric tests. Nonparametric tests are classified according to the variables’ level of measurement and to the sample size. So, for each situation, the main types of nonparametric tests were studied. In addition, the advantages and disadvantages of each test as well as their assumptions were also established. For each nonparametric test, the main inherent concepts, the null and alternative hypotheses, the respective statistics, and the solution of the examples proposed on SPSS and on Stata were presented. Whatever the main objective for their application, nonparametric tests can provide the collection of good and interesting research results that will be useful in any decision-making process. The correct use of each test, from a conscious choice of the modeling software, must always be done based on the underlying theory, without ignoring the researcher’s experience and intuition.

10.8

EXERCISES

(1) In what situations are nonparametric tests applied? (2) What are the advantages and disadvantages of nonparametric tests? (3) What are the differences between the sign test and the Wilcoxon test for two paired samples? (4) Which test is an alternative to the t-test for one sample when the data distribution does not follow a normal distribution? (5) A group of 20 consumers tasted two types of coffee (A and B). At the end, they chose one of the brands, as shown in the table. Test the null hypothesis that there is no difference in these consumers’ preference, with a significance level of 5%.

Nonparametric Tests Chapter

Events

Brand A

Brand B

Total

Frequency

8

12

20

Proportion

0.40

0.60

1.00

10

305

(6) A group of 60 readers evaluated three novels and, at the end, they chose one of the three options, as shown in the table. Test the null hypothesis that there is no difference in these readers’ preference, with a significance level of 5%.

Events

Book A

Book B

Book C

Total

Frequency

29

15

16

60

Proportion

0.483

0.250

0.267

1.00

(7) A group of 20 teenagers went on the Points Diet for 30 days. Check and see if there was weight loss after the diet. Assume that a ¼ 5%.

Before

After

58

56

67

62

72

65

88

84

77

72

67

68

75

76

69

62

104

97

66

65

58

59

59

60

61

62

67

63

73

65

58

58

67

62

67

64

78

72

85

80

306

PART

IV Statistical Inference

(8) Aiming to compare the average service times in two bank branches, data on 22 clients from each bank branch were collected, as shown in the table. Use the most suitable test, with a significance level of 5%, to test whether both samples come or do not come from populations with the same medians. Bank Branch A

Bank Branch B

6.24

8.14

8.47

6.54

6.54

6.66

6.87

7.85

2.24

8.03

5.36

5.68

7.09

3.05

7.56

5.78

6.88

6.43

8.04

6.39

7.05

7.64

6.58

6.97

8.14

8.07

8.30

8.33

2.69

7.14

6.14

6.58

7.14

5.98

7.22

6.22

7.58

7.08

6.11

7.62

7.25

5.69

7.5

8.04

(9) A group of 20 Business Administration students evaluated their level of learning based on three subjects studied in the field of Applied Quantitative Methods, by answering if their level of learning was high (1) or low (0). The results can be seen in the table. Check and see if the proportion of students with a high level of learning is the same for each subject. Consider a significance level of 2.5%. Student

A

B

C

1

0

1

1

2

1

1

1

3

0

0

0

4

0

1

0

5

0

1

1

6

1

1

1

7

1

0

1

Nonparametric Tests Chapter

Student

A

B

C

8

0

1

1

9

0

0

0

10

0

0

0

11

1

1

1

12

0

0

1

13

1

0

1

14

0

1

1

15

0

0

1

16

1

1

1

17

0

0

1

18

1

1

1

19

0

1

1

20

1

1

1

10

307

(10) A group of 15 consumers evaluated their level of satisfaction (1—somewhat dissatisfied, 2—somewhat satisfied, and 3—very satisfied) with three different bank services. The results can be seen in the table. Verify if there is a difference between the three services. Assume a significance level of 5%. Consumer

A

B

C

1

3

2

3

2

2

2

2

3

1

2

1

4

3

2

2

5

1

1

1

6

3

2

1

7

3

3

2

8

2

2

1

9

3

2

2

10

2

1

1

11

1

1

2

12

3

1

1

13

3

2

1

14

2

1

2

15

3

1

2

Part V

Multivariate Exploratory Data Analysis Two or more variables can relate to one another in several different ways. While one researcher may be interested in the study of the interrelationship between categorical (or nonmetric) variables, for example, in order to assess the existence of possible associations between its categories, another researcher may wish to create performance indicators (new variables) from the existence of correlations between the original metric variables. A third researcher may be interested in identifying homogeneous groups possibly formed from the existence of similarities in the variables between the observations of a certain dataset. In all of these situations, researchers may use multivariate exploratory techniques. Multivariate exploratory techniques, also known as interdependence methods, can probably be used in all fields of human knowledge in which researchers aim to study the relationship between the variables of a certain dataset, without intending to estimate confirmatory models. That is, without having to elaborate inferences regarding the findings for other observations, different from the ones considered in the analysis itself, since neither models nor equations are estimated to predict data behavior. This characteristic is crucial to distinguish the techniques studied in Part V of this book from those considered to be dependence methods, such as, the simple and multiple regression models, binary and multinomial logistic regression models, and regression models for count data, all of them studied in Part VI. Therefore, there is no definition of a predictor variable in exploratory models and, thus, their main objectives refer to the reduction or structural simplification of data, to the classification or clustering of observations and variables, to the investigation of the existence of correlation between metric variables, or association between categorical variables and between their categories, to the creation of performance rankings of observations from variables, and to the elaboration of perceptual maps. Exploratory techniques are considered extremely relevant for developing diagnostics regarding the behavior of the data being analyzed. Thus, their varied procedures are commonly adopted in a preliminary way, or even simultaneously, with the application of a certain confirmatory model. Based on pedagogical and conceptual criteria, we have chosen to discuss the two main sets of existing multivariate exploratory techniques in Part V; therefore, the chapters are structured in the following way: Chapter 11: Cluster Analysis Chapter 12: Principal Component Factor Analysis

The decision about the technique to be used also goes through the measurement scale of the variables available in the dataset, which can be categorical or metric (or even binary, a special case of categorization). The type of question itself, when collecting the data, in some situations, may result in a categorical or metric response, which will favor the use of one or more techniques to the detriment of others. Hence, the clear, precise, and preliminary definition of the research objectives is essential to obtain variables in the measurement scale suitable for the application of a certain technique that will serve as a tool for achieving the objectives proposed. While the cluster analysis techniques (Chapter 11), whose procedures can be hierarchical or nonhierarchical, are used when we wish to study similar behavior between the observations (individuals, companies, municipalities, countries, among other examples) regarding certain metric or binary variables and the possible existence of homogeneous clusters (cluster of observations), the principal component factor analysis (Chapter 12) can be chosen as the technique to be used when the main goal is the creation of new variables (factors, or cluster of variables) that capture the joint behavior of the

310

PART

V Multivariate Exploratory Data Analysis

BOX V.1 Exploratory Techniques and Main Objectives Exploratory Technique

Measurement Scale

Main Objectives

Cluster Analysis

Metric or Binary Metric or Binary Metric

Sorting and allocation of the observations into internally homogeneous groups and heterogeneous between one another. Definition of an interesting number of groups. Evaluation of the representativeness of each variable for the formation of a previously established number of groups. From a predefined number of groups, identification of the allocation of each observation. Identification of the correlations between the original variables for creating factors that represent the combination of those variables (reduction or structural simplification). Verification of the validity of previously established constructs. Construction of rankings through the creation of performance indicators from the factors. Extraction of orthogonal factors for future use in multivariate confirmatory techniques that require the absence of multicollinearity.

Hierarchical

Nonhierarchical

Principal Component Factor Analysis

original metric variables. Chapter 11 also presents the procedures for elaborating the multidimensional scaling technique in SPSS and in Stata. It can be considered a natural extension of the cluster analysis, and it has as its main objectives to determine the relative positions (coordinates) of each observation in the dataset and to construct two-dimensional charts in which these coordinates are plotted. It is important to mention that even though they are not discussed in this book, correspondence analysis techniques are very useful when researchers intend to study possible associations between the variables and between their respective categories. While the simple correspondence analysis is applied to the study of the interdependence relationship between only two categorical variables, which characterizes it as a bivariate technique, the multiple correspondence analysis can be used for a larger number of categorical variables, being, in fact, a multivariate technique. For more details on correspondence analysis techniques, we recommend Fa´vero and Belfiore (2017). Box V.1 shows the main objectives of each one of the exploratory techniques discussed in Part V. Each chapter is structured according to the same presentation logic. First, we introduce the concepts regarding each technique, always followed by the algebraic solution of some practical exercises, from datasets elaborated primarily with a more educational focus. Next, the same exercises are solved in the statistical software packages IBM SPSS Statistics Software and Stata Statistical Software. We believe that this logic facilitates the study and understanding of the correct use of each of the techniques and the analysis of the results obtained. In addition to this, the practical application of the models in SPSS and Stata also offers benefits to researchers, because, at any given moment, the results can be compared to the ones already obtained algebraically in the initial sections of each chapter, besides providing an opportunity to use these important software packages. At the end of each chapter, additional exercises are proposed, whose answers, presented through the outputs generated in SPSS, are available at the end of the book.

Chapter 11

Cluster Analysis Maybe Hamlet is right. We could be bounded in a nutshell, but counting ourselves kings of infinite space. Stephen Hawking

11.1 INTRODUCTION Cluster analysis represents a set of very useful exploratory techniques that can be applied whenever we intend to verify the existence of similar behavior between observations (individuals, companies, municipalities, countries, among other examples) in relation to certain variables, and there is the intention of creating groups or clusters, in which an internal homogeneity prevails. In this regard, this set of techniques has as its main objective to allocate observations to a relatively small number of clusters that are internally homogeneous and heterogeneous between themselves, and that represent the joint behavior of the observations from certain variables. That is, the observations of a certain group must be relatively similar to one another, in relation to the variables inserted in the analysis, and significantly different from the observations found in other groups. Clustering techniques are considered exploratory, or interdependent, since their applications do not have a predictive nature for other observations not initially present in the sample. Moreover, the inclusion of new observations into the dataset makes it necessary to reapply the modeling, so that, possibly, new clusters can be generated. Besides, the inclusion of a new variable can also generate a complete rearrangement of the observations in the groups. Researchers can choose to develop a cluster analysis when their main goal is to sort and allocate observations to groups and, from then on, to analyze what the ideal number of clusters formed is. Or they can, a priori, define the number of groups they wish to create, based on certain criteria, and verify how the sorting and allocation of observations behave in that specific number of groups. Regardless of the objective, clustering will continue being exploratory. If a researcher aims to use a technique to, in fact, confirm the creation of groups and to make the analysis predictive, he can use techniques as, for example, discriminant analysis or multinomial logistic regression. Elaborating a cluster analysis does not require vast knowledge of matrix algebra or statistics, different from techniques such as factor analysis and correspondence analysis. The researcher interested in applying a cluster analysis needs to, starting from the definition of the research objectives, choose a certain distance or similarity measure that will be the basis for the observations to be considered less or much closer, and a certain agglomeration schedule that will have to be defined between hierarchical and nonhierarchical methods. Therefore, he will be able to analyze, interpret, and compare the outcomes. It is important to highlight that the outcomes obtained through hierarchical and nonhierarchical agglomeration schedules can be compared and, in this regard, the researcher is free to develop the technique, using one method or another, and to reapply it, if he deems necessary. While hierarchical schedules allow us to identify the sorting and allocation of observations, offering possibilities for researchers to study, assess, and decide the number of clusters formed in nonhierarchical schedules, we start with a known number of clusters and, from then on, we begin allocating the observations to these clusters, with a future evaluation of the representativeness of each variable when creating them. Therefore, the result of one method can serve as input to carry out the other, making the analysis cyclical. Fig. 11.1 shows the logic from which a cluster analysis can be elaborated. When choosing the distance or similarity measure and the agglomeration schedule, we must take some aspects into consideration, such as, the previously desired number of clusters, which were defined based on some resource allocation criteria, as well as certain constraints that may lead the researcher to choose a specific solution. According to Bussab et al. Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00011-2 © 2019 Elsevier Inc. All rights reserved.

311

312

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.1 Logic for elaborating a cluster analysis.

(1990), different criteria regarding distance measures and agglomeration schedules may lead to different cluster formations, and the homogeneity desired by the researcher fundamentally depends on the objectives set in the research. Imagine that a researcher is interested in studying the interdependence between individuals living in a certain municipality based only on two metric variables (age, in years, and average family income, in R$). His main goal is to assess the effectiveness of social programs aimed at providing health care and then, based on these variables, to propose a still unknown number of new programs aimed at homogeneous groups of people. After collecting the data, the researcher constructed a scatter plot, as shown in Fig. 11.2. Based on the chart seen in Fig. 11.2, the researcher identified four clusters and highlighted them in a new chart (Fig. 11.3). From the creation of these clusters, the researcher decided to develop an analysis of the behavior of the observations in each group, or, more precisely, of the existing variability within the clusters and between them, so that he could clearly and consciously base his decision as regards the allocation of individuals to these four new social programs. In order to illustrate this issue, the researcher constructed the chart found in Fig. 11.4.

FIG. 11.2 Scatter plot with individuals’ Income and Age.

Cluster Analysis Chapter

11

313

FIG. 11.3 Highlighting the creation of four clusters.

FIG. 11.4 Illustrating the variability within the clusters and between them.

Based on this chart, the researcher was able to notice that the groups formed showed a lot of internal homogeneity, with a certain individual being closer to other individuals in the same group than to individuals in other groups. This is the core of cluster analysis. If the number of social programs to be provided for the population (number of clusters) had already been given to the researcher, due to budgetary, legal, or political constraints, even so we would be able to use clustering, solely, to determine the allocation of individuals from the municipality to that number of programs (groups).

314

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.5 Rearranging the clusters due to the presence of elderly billionaires.

Having concluded the research and allocated the individuals to the different social, health care programs, the following year, the researcher decided to carry out the same research with individuals from the same municipality. However, in the meantime, a group of elderly billionaires decided to move to that city, and, when he constructed the new scatter plot, the researcher realized that those four clusters, clearly formed the previous year, did not exist anymore, since they fused when the billionaires were included. The new scatter plot can be seen in Fig. 11.5. This new situation exemplifies the importance of always reapplying the cluster analysis whenever new observations are included (and also new variables), which deprives it from and makes its predictive power totally unfeasible, as we have already discussed. Moreover, before elaborating any cluster analysis, this example shows that it is advisable for the researcher to study the data behavior and to check the existence of discrepant observations in relation to certain variables, since the creation of clusters is very sensitive to the presence of outliers. Excluding or retaining outliers in the dataset, however, will depend on the research objectives and on the type of data researcher have. Since, if certain observations represent anomalies in terms of variable values, when compared to the other observations, and end up forming small, insignificant, or even individual clusters, they can, in fact, be excluded. On the other hand, if these observations represent one or more relevant groups, even if they are different from the others, they must be considered in the analysis and, whenever the technique is reapplied, they can be separated so that other segmentations can be better structured in new groups, formed with higher internal homogeneity. We would like to emphasize that cluster analysis methods are considered static procedures, since the inclusion of new observations or variables may change the clusters, thus, making it mandatory to develop a new analysis. In this example, we realized that the original variables from which the groups are established are metric, since the clustering started from the study of the distance behavior (dissimilarity measures) between the observations. In some cases, as we will study throughout this chapter, cluster analyses can be elaborated from the similarity behavior (similarity measures) between observations that present binary variables. However, it is common for researchers to use the incorrect arbitrary weighting procedure with qualitative variables, as, for example, variables on the Likert scale, and, from then on, to apply a cluster analysis. This is a major error, since there are exploratory techniques meant exclusively for the study of the behavior of qualitative variables as, for example, the correspondence analysis. Historically speaking, even though many distance and similarity measures date back to the end of the 19th century and the beginning of the 20th century, cluster analyses, as a better structured set of techniques, began in the field of Anthropology with Driver and Kroeber (1932), and in Psychology with Zubin (1938a,b) and Tryon (1939), as discussed by Reis

Cluster Analysis Chapter

11

315

(2001) and Fa´vero et al. (2009). With the acknowledgment that observation clustering and classification procedures are scientific methods, together with astonishing technological developments, mainly verified after the 1960s, cluster analyses started being used more frequently after Sokal and Sneath’s (1963) relevant work was published, in which procedures are carried out to compare the biological similarities of organisms with similar characteristics and the respective species. Currently, cluster analysis offers several application possibilities in the fields of consumer behavior, market segmentation, strategy, political science, economics, finance, accounting, actuarial science, engineering, logistics, computer science, education, medicine, biology, genetics, biostatistics, psychology, anthropology, demography, geography, ecology, climatology, geology, archeology, criminology and forensics, among others. In this chapter, we will discuss cluster analysis techniques, aiming at: (1) introducing the concepts; (2) presenting the step by step of modeling, in an algebraic and practical way; (3) interpreting the results obtained; and (4) applying the technique in SPSS and in Stata. Following the logic proposed in the book, first, we will present the algebraic solution of an example jointly with the presentation of the concepts. Only after the introduction of concepts will the procedures for elaborating the techniques in SPSS and Stata be presented.

11.2 CLUSTER ANALYSIS Many are the procedures for elaborating a cluster analysis, since there are different distance or similarity measures for metric or binary variables, respectively. Besides, after defining the distance or similarity measure, the researcher still needs to determine, among several possibilities, the observation clustering method, from certain hierarchical or nonhierarchical criteria. Therefore, when one wishes to group observations in internally homogeneous clusters, what initially seems trivial can become quite complex, because there are multiple combinations between different distance or similarity measures and clustering methods. Hence, based on the underlying theory and on his research objectives, as well as on his experience and intuition, it is extremely important for the researcher to define the criteria from which the observations will be allocated to each one of the groups. In the following sections, we will discuss the theoretical development of the technique, along with a practical example. In Sections 11.2.1 and 11.2.2, the concepts of distance and similarity measures and clustering methods are presented and discussed, respectively, always followed by the algebraic solutions developed from a dataset.

11.2.1

Defining Distance or Similarity Measures in Cluster Analysis

As we have already discussed, the first phase for elaborating a cluster analysis consists in defining the distance (dissimilarity) or similarity measure that will be the basis for each observation to be allocated to a certain group. Distance measures are frequently used when the variables in the dataset are essentially metric, since, the greater the differences between the variable values of two observations the smaller the similarity between them or, in other words, the higher the dissimilarity. On the other hand, similarity measures are often used when the variables are binary, and what most interests us is the frequency of converging answer pairs 1-1 or 0-0 of two observations. In this case, the greater the frequency of converging pairs, the higher the similarity between the observations. An exception to this rule is Pearson’s correlation coefficient between two observations, calculated from metric variables, however, with similarity characteristics, as we will see in the following section. We will study the dissimilarity measures for metric variables in Section 11.2.1.1 and, in Section 11.2.1.2, we will discuss the similarity measures for binary variables.

11.2.1.1 Distance (Dissimilarity) Measures Between Observations for Metric Variables As a hypothetical situation, imagine that we intend to calculate the distance between two observations i (i ¼ 1, 2) from a dataset that has three metric variables (X1i, X2i, X3i), with values in the same unit of measure. These data can be found in Table 11.1. It is possible to illustrate the configuration of both observations in a three-dimensional space from these data, since we have exactly three variables. Fig. 11.6 shows the relative position of each observation, emphasizing the distance between them (d12). Distance d12, which is a dissimilarity measure, can be easily calculated by using, for instance, its projection over the horizontal plane formed by axes X1 and X2, called distance d0 12, as shown in Fig. 11.7.

316

PART

V

Multivariate Exploratory Data Analysis

TABLE 11.1 Part of a Dataset With Two Observations and Three Metric Variables Observation i

X1i

X2i

X3i

1

3.7

2.7

9.1

2

7.8

8.0

1.5

X3

1

d12

2 X2 X1 FIG. 11.6 Three-dimensional scatter plot for the hypothetical situation with two observations and three variables.

Thus, based on the well-known Pythagorean distance formula for right-angled triangles, we can determine d12 through the following expression: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (11.1) d12 ¼ ðd0 12 Þ2 + ðX31  X32 Þ2 where j X31  X32 j is the distance of the vertical projections (axis X3) from points 1 and 2. However, distance d0 12 is unknown to us, so, once again, we need to use the Pythagorean formula, now using the distances of the projections from Points 1 and 2 over the other two axes (X1 and X2), as shown in Fig. 11.8. Thus, we can say that: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (11.2) d0 12 ¼ ðX11  X12 Þ2 + ðX21  X22 Þ2 and, substituting (2) in (1), we have: d12 ¼

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðX11  X12 Þ2 + ðX21  X22 Þ2 + ðX31  X32 Þ2 ,

(11.3)

which is the expression of distance (dissimilarity measure) between Points 1 and 2, also known as the Euclidean distance formula.

Cluster Analysis Chapter

11

317

X3

|X31–X32|

1

d12

d′

12

2

X2 X1 FIG. 11.7 Three-dimensional chart highlighting the projection of d12 over the horizontal plane.

FIG. 11.8 Projection of the points over the plane formed by X1 and X2 with emphasis on d´12.

X2

|X21–X22|



12

2

1 |X11–X12|

X1

318

PART

V

Multivariate Exploratory Data Analysis

Therefore, for the data in our example, we have: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi d12 ¼ ð3:7  7:8Þ2 + ð2:7  8:0Þ2 + ð9:1  1:5Þ2 ¼ 10:132 whose unit of measure is the same as for the original variables in the dataset. It is important to highlight that, if the variables do not have the same unit of measure, a data standardization procedure will have to be carried out previously, as we will discuss later. We can generalize this problem for a situation in which the dataset has n observations and, for each observation i (i ¼ 1, ..., n), values corresponding to each one of the j (j ¼ 1, ..., k) metric variables X, as shown in Table 11.2. So, Expression (11.4), based on Expression (11.3), presents the general definition of the Euclidian distance between any two observations p and q. vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi k   2  2  2ffi u 2 uX X1p  X1q + X2p  X2q + … + Xkp  Xkq ¼ t Xjp  Xjq (11.4) dpq ¼ j¼1

Although the Euclidian distance is the most commonly used in cluster analyses, there are other dissimilarity measures that can be used, and using each one of them will depend on the researcher’s assumptions and objectives. Next, we will discuss other dissimilarity measures that can be used: l

Euclidean squared distance: instead of the Euclidian distance, it can be used when the variables show a small dispersion in value, and the use of the squared Euclidian distance makes it easier to interpret the outputs of the analysis and the allocation of the observations to the groups. Its expression is given by: k   2  2  2 X 2 Xjp  Xjq dpq ¼ X1p  X1q + X2p  X2q + … + Xkp  Xkq ¼

(11.5)

j¼1 l

Minkowski Distance: it is the most general dissimilarity measure expression from which others derive. It is given by: "

 m k  X   dpq ¼ Xjp  Xjq 

#1

m

(11.6)

j¼1

where m takes on positive integer values (m ¼ 1, 2, ...). We can see that the Euclidian distance is a particular case of the Minkowski distance, when m ¼ 2.

TABLE 11.2 General Model of a Dataset for Elaborating the Cluster Analysis Variable j Observation i

X1i

X2i



Xki

1

X11

X21



Xk1

2

X12

X22







P

X1p

X2p







q

X1q

X2q

Xkq









n

X1n

X2n

Xkn

Xk2

Xkp

Cluster Analysis Chapter

l

11

319

Manhattan Distance: also referred to as the absolute or city block distance, it does not consider the triangular geometry that is inherent to Pythagoras’ initial expression and only considers the differences between the values of each variable. Its expression, also a particular case of the Minkowski distance when m ¼ 1, is given by: dpq ¼

k   X   Xjp  Xjq 

(11.7)

j¼1 l

Chebyshev Distance: also referred to as infinite or maximum distance, it is a particular case of the Manhattan distance because it only considers, for two observations, the maximum difference between all the j variables being studied. Its expression is given by:     (11.8) dpq ¼ max Xjp  Xjq 

It is a particular case of the Minkowski distance as well, when m ¼ ∞. l

Canberra Distance: used for the cases in which the variables only have positive values, it assumes values between 0 and j (number of variables). Its expression is given by:     k X  X  X jp jq   (11.9) dpq ¼ j¼1 Xjp + Xjq

Whenever there are metric variables, the researcher can also use Pearson’s correlation, which, even though, is not a dissimilarity measure (in fact, it is a similarity measure), can provide important information when the aim is to group rows from the dataset. Pearson’s correlation expression, between the values of any two observations p and q, based on Expression (4.11) presented in Chapter 4, can be written as follows: k  X

   Xjp  Xp  Xjq  Xq

j¼1

rpq ¼ vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uX k  2 u 2 u k  uX t Xjp  Xp  t Xjq  Xq j¼1

(11.10)

j¼1

where Xp and Xq represent the mean of all variable values for observations p and q, respectively, that is, the mean of each one of the rows in the dataset. Therefore, we can see that we are dealing with a coefficient of correlation between rows and not between columns (variables). It is the most common in data analysis and its values vary between 1 and 1. Pearson’s correlation coefficient can be used as a similarity measure between the rows of the dataset in analyses that include time series, for example, that is, cases in which the observations represent periods. In this case, the researcher may intend to study the correlations between different periods, to investigate, for instance, a possible recurrence of behavior in the same row for the set of variables, which may cause certain periods, not necessarily subsequent ones, to be grouped by similarity of behavior. Going back to the data presented in Table 11.1, we can calculate the different distance measures between observations 1 and 2, given by Expressions (11.4)–(11.9), as well as the correlational similarity measure, given by Expression (11.10). Table 11.3 shows these calculations and the respective results. Based on the results shown in Table 11.3, we can see that different measures produce different results, which may cause the observations to be allocated to different homogeneous clusters, depending on which measure was chosen for the analysis, as discussed by Vicini and Souza (2005) and Malhotra (2012). Therefore, it is essential for the researcher to always underpin his choice and to bear in mind the reasons why he decided to use a certain measure, instead of others. Simply using more than one measure, when analyzing the same dataset, can support this decision, since, in this case, the results can be compared. This becomes really clear when we include a third observation in the analysis, as shown in Table 11.4. While the Euclidian distance suggests that the most similar observations (the shortest distance) are 2 and 3, when we use the Chebyshev distance, observations 1 and 3 are the most similar. Table 11.5 shows these distances for each pair of observations, highlighting, in bold characters, the smallest value of each distance.

320

PART

V

Multivariate Exploratory Data Analysis

TABLE 11.3 Distance and Correlational Similarity Measures Between Observations 1 and 2 Observation i

X1i

X2i

X3i

Mean

1

3.7

2.7

9.1

5.167

2

7.8

8.0

1.5

5.767

Euclidian Distance qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi d12 ¼ ð3:7  7:8Þ2 + ð2:7  8:0Þ2 + ð9:1  1:5Þ2 ¼ 10:132 Squared Euclidean Distance d12 ¼ (3.7  7.8)2 + (2.7  8.0)2 + (9.1  1.5)2 ¼ 102.660 Manhattan Distance d12 ¼ j3.7  7.8j + j 2.7  8.0j + j9.1  1.5j ¼ 17.000 Chebyshev Distance d12 ¼ j9.1  1.5j ¼ 7.600 Canberra Distance 3:77:8j j2:78:0j j9:11:5j d12 ¼ ðj3:7 + 7:8Þ + ð2:7 + 8:0Þ + ð9:1 + 1:5Þ ¼ 1:569

Pearson’s Correlation (Similarity) ð3:75:167Þ  ð7:85:767Þ + ð2:75:167Þ  ð8:05:767Þ + ð9:15:167Þ  ð1:55:767Þ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi r12 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:993 2 2 2 2 2 2 ð3:75:167Þ + ð2:75:167Þ + ð9:15:167Þ 

ð7:85:767Þ + ð8:05:767Þ + ð1:55:767Þ

TABLE 11.4 Part of the Dataset With Three Observations and Three Metric Variables Observation i

X1i

X2i

X3i

1

3.7

2.7

9.1

2

7.8

8.0

1.5

3

8.9

1.0

2.7

TABLE 11.5 Euclidian and Chebyshev Distances Between the Pairs of Observations Seen in Table 11.4 Distance

Pair of Observations 1 and 2

Pair of Observations 1 and 3

Pair of Observations 2 and 3

Euclidian

d12 ¼ 10.132

d13 ¼ 8.420

d23 5 7.187

Chebyshev

d12 ¼ 7.600

d13 5 6.400

d23 ¼ 7.000

Hence, in a certain cluster schedule, and only due to the dissimilarity measure chosen, we would have different initial clusters. Besides deciding which distance measure to choose, the researcher also has to verify if the data need to be treated previously. So far, in the examples we have already discussed, we were careful to choose metric variables with values in the same unit of measure (as, for example, students’ grades in Math, Physics, and Chemistry, which vary from 0 the 10). However, if the variables are measured in different units (as, for example, income in R$, educational level in years of study, and number of children), the intensity of the distances between the observations may be arbitrarily influenced by the variables that will possibly present greater magnitude in their values, to the detriment the others. In these situations, the

Cluster Analysis Chapter

11

321

researcher must standardize the data, so that the arbitrary nature of the measurement units may be eliminated, making each variable have the same contribution over the distance measure considered. Z-scores procedure is the most frequently used method to standardize variables. In it, for each observation i, the value of a new standardized variable ZXj is obtained by subtracting the corresponding original variable value Xj from its mean and, after that, the resulting value is divided by its standard deviation, as presented in Expression (11.11). ZXji ¼

Xji  Xj sj

(11.11)

where X and s represent the mean and the standard deviation of variable Xj. Hence, regardless of the magnitude of the values and of the type of measurement units of the original variables in a dataset, all the respective variables standardized by the Z-scores procedure will have a mean equal to zero and a standard deviation equal to 1, which ensures that possible arbitrary measurement units over the distance between each pair of observations will be eliminated. In addition, Z-scores have the advantage of not changing the distribution of the original variable. Therefore, if the original variables are different units, distance measure Expressions (11.4)–(11.9) must have the terms Xjp and Xjq, respectively, substituted for ZXjp and ZXjq. Table 11.6 presents these expressions, based on the standardized variables. Even though Pearson’s correlation is not a dissimilarity measure (in fact, it is a similarity measure), it is important to mention that its use also requires that the variables be standardized by using the Z-scores procedure in case they do not have the same measurement units. If the main goal were to group variables, which is the main goal of the following chapter (factor analysis), the standardization of variables through the Z-scores procedure would, in fact, be irrelevant, given that the analysis would consist in assessing the correlation between columns of the dataset. On the other hand, as the objective of this chapter is to group rows from the dataset that represent the observations, the standardization of the variables is necessary for elaborating an accurate cluster analysis.

11.2.1.2 Similarity Measures Between Observations for Binary Variables Now, imagine that we intend to calculate the distance between two observations i (i ¼ 1, 2) coming from a dataset that has seven variables (X1i, ..., X7i), however, all of them related to the presence or absence of characteristics. In this situation, it is common for the presence or absence of a certain characteristic to be represented by a binary variable, or a dummy, which assumes value 1, in case the characteristic occurs, and 0, if otherwise. These data can be found in Table 11.7. It is important to highlight that the use of binary variables does not generate arbitrary weighting problems resulting from the variable categories, contrary to what would happen if discrete values (1, 2, 3, ...) were assigned to each category of each qualitative variable. In this regard, if a certain qualitative variable has k categories, (k  1) binary variables will be necessary to represent the presence or absence of each one of the categories. Thus, all the binary variables will be equal to 0 in case the reference category occurs.

TABLE 11.6 Distance Measure Expressions With Standardized Variables Distance Measure (Dissimilarity)

Expression

Euclidian

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 k  P dpq ¼ ZX jp  ZX jq j¼1

Squared Euclidean

dpq ¼

dpq ¼

Chebyshev Canberra

j¼1

"

Minkowski

Manhattan

2 k  P ZX jp  ZX jq

dpq ¼

m k  P   ZX jp  ZX jq 

j¼1

 k  P   ZX jp  ZX jq 

j¼1

dpq ¼ ma´xjZXjp  ZXjq j dpq ¼

k ZX ZX P j jp jq j

ZX + ZX jq Þ j¼1 ð jp

#1

m

322

PART

V

Multivariate Exploratory Data Analysis

TABLE 11.7 Part of the Dataset With Two Observations and Seven Binary Variables Observation i

X1i

X2i

X3i

X4i

X5i

X6i

X7i

1

0

0

1

1

0

1

1

2

0

1

1

1

1

0

1

Therefore, by using Expression (11.4), we can calculate the squared Euclidean distance between observations 1 and 2, as follows: d12 ¼

7  X

Xj1  Xj2

2

¼ ð0  0Þ2 + ð0  1Þ2 + ð1  1Þ2 + ð1  1Þ2 + ð0  1Þ2 + ð1  0Þ2 + ð1  1Þ2 ¼ 3,

j¼1

which represents the total number of variables with answer differences between observations 1 and 2. Therefore, for any two observations p and q, the greater the number of equal answers (0-0 or 1-1), the shorter the squared Euclidean distance between them will be, since: 8   2 < 0 if X ¼ X ¼ 0 jp jq 1 Xjp  Xjq ¼ (11.12) : 1 if X 6¼ X jp

jq

As discussed by Johnson and Wichern (2007), each stretch of the distance represented by Expression (11.12) is considered to be a dissimilarity measure, since the greater the number of answer discrepancies, the greater the squared Euclidean distances. On the other hand, the calculations equally ponder the pairs of answers 0-0 and 1-1, without giving higher relative importance to the pair of answers 1-1 that, in many cases, is a stronger similarity indicator than the pair of answers 0-0. For example, when we group people, the fact that two of them eat lobster every day is a stronger similarity evidence than the absence of this characteristic for both. Hence, many authors, aiming at defining similarity measures between observations, proposed the use of coefficients that would take the similarity of the answers 1-1 and 0-0 into consideration, and these pairs would not necessarily have the same relative importance. In order for us to be able to present these measures, it is necessary to construct an absolute frequency table of answers 0 and 1 for each pair of observations p and q, as shown in Table 11.8. Next, based on this table, we will discuss the main similarity measures, bearing in mind that the use of each one depends on the researcher’s assumptions and objectives. Simple matching coefficient (SMC): it is the most frequently used similarity measure for binary variables, and it is discussed and used by Zubin (1938a), and by Sokal and Michener (1958). This coefficient, which matches the weights of the converging 1-1 and 0-0 answers, has its expression given by:

l

spq ¼

a+d a+b+c+d

(11.13)

TABLE 11.8 Absolute Frequencies of Answers 0 and 1 for Two Observations p and q Observation p Observation q

1

0

Total

1

a

b

a+b

0

c

d

c+d

Total

a+c

b+d

a+b+c+d

Cluster Analysis Chapter

l

2a 2a+b+c

a a + 2  ðb + cÞ

a a+b+c+d

(11.17)

(11.18)

Yule similarity coefficient: proposed by Yule (1900) and used by Yule and Kendall (1950), this similarity coefficient for binary variables offers as an answer a coefficient that varies from 1 to 1. As we can see, through its expression presented, the coefficient generated is undefined if one or both vectors compared present all the values equal to 0 or 1. Software such as Stata generate the Yule coefficient equal to 1, if b ¼ c ¼ 0 (a total convergence of answers), and equal to 1, if a ¼ d ¼ 0 (a total divergence of answers). spq ¼

l

(11.16)

Ochiai similarity coefficient: even though it is known by this name, it was initially proposed by Driver and Kroeber (1932), and, later on, it was used by Ochiai (1957). This coefficient is undefined when one or both observations being studied present all the variable values equal to 0. However, if both vectors present all the values equal to 0, software such as Stata present the Ochiai coefficient equal to 1. If this happens for only one of the two vectors, the Ochiai coefficient is considered equal to 0. Its expression is given by: a spq ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ða + bÞ  ða + cÞ

l

(11.15)

Russel and Rao similarity coefficient: it is also widely used and it only favors the similarities of 1-1 answers in the calculation of its coefficient. It was proposed by Russell and Rao (1940), and its expression is given by: spq ¼

l

(11.14)

anti-Dice similarity coefficient: it was initially proposed by Sokal and Sneath (1963) and Anderberg (1973), the name anti-Dice comes from the fact that this coefficient doubles the weight over the frequencies of different type 1-1 answer pairs, that is, it doubles the weight over the answer divergences. Just as the Jaccard and the Dice coefficients, the antiDice coefficient also ignores the frequency of 0-0 answer pairs. Its expression is given by: spq ¼

l

a a+b+c

Dice similarity coefficient (DSC): although it is only known by this name, it was suggested and discussed by Czekanowski (1932), Dice (1945) and Sørensen (1948). It is similar to the Jaccard index; however, it doubles the weight over the frequency of converging type 1-1 answer pairs. Just as in that case, software such as Stata present the Dice coefficient equal to 1, for the cases in which all the variables are equal to 0 for two observations, thus, avoiding any uncertainty in the calculation. Its expression is given by: spq ¼

l

323

Jaccard index: even though it was first proposed by Gilbert (1884), it received this name because it was discussed and used in two extremely important papers developed by Jaccard (1901, 1908). This measure, also known as Jaccard similarity coefficient, does not take the frequency of the pair of answers 0-0 into consideration, which is considered irrelevant. However, it is possible to come across a situation in which all the variables are equal to 0 for two observations, that is, there is only frequency in cell d of Table 11.8. In this case, software packages such as Stata present the Jaccard index equal to 1, which makes sense from a similarity standpoint. Its expression is given by: spq ¼

l

11

adbc ad+bc

(11.19)

Rogers and Tanimoto similarity coefficient: this coefficient, which doubles the weight of discrepant answers 0-1 and 1-0 in relation to the weight of the combinations of converging type 1-1 and 0-0 answers, was initially proposed by Rogers and Tanimoto (1960). Its expression, which becomes equal to the anti-Dice coefficient when the frequency of 0-0 answers is equal to 0 (d ¼ 0), is given by: spq ¼

a+d a + d + 2  ðb + c Þ

(11.20)

324

l

PART

V

Multivariate Exploratory Data Analysis

Sneath and Sokal similarity coefficient: different from the Rogers and Tanimoto coefficient, this coefficient, proposed by Sneath and Sokal (1962), doubles the weight of converging type 1-1 and 0-0 answers in relation to the other answer combinations (1-0 and 0-1). Its expression, which becomes equal to the Dice coefficient when the frequency of type 0-0 answers is equal to 0 (d ¼ 0), is given by: spq ¼

l

2  ða + d Þ 2  ða + dÞ + b + c

(11.21)

Hamann similarity coefficient: Hamann (1961) proposed this similarity coefficient for binary variables aiming at having the frequencies of discrepant answers (1-0 and 0-1) subtracted from the total of converging answers (1-1 and 0-0). This coefficient, which varies from 1 (total answer divergence) to 1 (total answer convergence), is equal to two times the simple matching coefficient minus 1. Its expression is given by: spq ¼

ða + d Þ  ð b + cÞ a+b+c+d

(11.22)

As was discussed in Section 11.2.1.1 as regards the dissimilarity measures applied to metric variables, let’s go back to the data presented in Table 11.7, aiming at calculating the different similarity measures between observations 1 and 2, which only have binary variables. In order to do that, from that table, we must construct the absolute frequency table of answers 0 and 1 for the observations mentioned (Table 11.9). So, using Expressions (11.13)–(11.22), we are able to calculate the similarity measures themselves. Table 11.10 presents the calculations and the results of each coefficient. Analogous to what was discussed when the dissimilarity measures were calculated, we can clearly see that different similarity measures generate different results, which may cause, when defining the cluster method, the observations to be allocated to different homogeneous clusters, depending on which measure was chosen for the analysis. Bear in mind that it does not make any sense to apply the Z-scores standardization procedure to calculate the similarity measures discussed in this section, since the variables used for the cluster analysis are binary. At this moment, it is important to emphasize that, instead of using similarity measures to define the clusters whenever there are binary variables, it is very common to define clusters from the coordinates of each observation, which can be generated when elaborating simple or multiple correspondence analyses, for instance. This is an exploratory technique applied solely to datasets that have qualitative variables, aiming at creating perceptual maps, which are constructed based on the frequency of the categories of each one of the variables in analysis (Fa´vero and Belfiore, 2017). After defining the coefficient that will be used, based on the research objectives, on the underlying theory, and on his experience and intuition, the researcher must move on to the definition of the cluster schedule. The main cluster analysis schedules will be studied in the following section.

11.2.2

Agglomeration Schedules in Cluster Analysis

As discussed by Vicini and Souza (2005) and Johnson and Wichern (2007), in cluster analysis, choosing the clustering method, also known as agglomeration schedule, is as important as defining the distance (or similarity) measure, and this decision must also be made based on what researchers intends to do in terms of their research objectives.

TABLE 11.9 Absolute Frequencies of Answers 0 and 1 for Observations 1 and 2 Observation 1 Observation 2

1

0

Total

1

3

2

5

0

1

1

2

Total

4

3

7

Cluster Analysis Chapter

11

325

TABLE 11.10 Similarity Measures Between Observations 1 and 2 Simple Matching:

Jaccard:

s12 ¼ 3 +7 1 ¼ 0:571

s12 ¼ 36 ¼ 0:500

Dice:

Anti-Dice:

s12 ¼ 2  ð32Þð+3Þ2 + 1 ¼ 0:667

s12 ¼ 3 + 2 3ð2 + 1Þ ¼ 0:333

Russell and Rao:

Ochiai:

s12 ¼ 37 ¼ 0:429

3 s12 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:671

Yule:

Rogers and Tanimoto:

1 s12 ¼ 33  112 + 2  1 ¼ 0:200

s12 ¼ 3 + 1 +32+1ð2 + 1Þ ¼ 0:400

Sneath and Sokal:

Hamann:

s12 ¼ 2  ð23 +ð13Þ++12Þ + 1 ¼ 0:727

s12 ¼ ð3 + 1Þ7 ð2 + 1Þ ¼ 0:143

ð3 + 2Þ  ð3 + 1Þ

Basically, agglomeration schedules can be classified into two types, hierarchicals and nonhierarchicals. While the former characterize themselves for favoring a hierarchical structure (step by step) when forming clusters, nonhierarchical schedules use algorithms to maximize the homogeneity within each cluster, without going through a hierarchical process for such. Hierarchical agglomeration schedules can be clustering or partitioning, depending on how the process starts. If all the observations are considered to be separated and, from their distances (or similarities), groups are formed until we reach a final stage with only one cluster, then this process is known as clustering. Among all hierarchical agglomeration schedules, the most commonly used are those that have the following linkage methods: nearest-neighbor or singlelinkage, furthest-neighbor or complete-linkage, or between-groups or average-linkage. On the other hand, if all the observations are considered grouped and, stage after stage, smaller groups are formed by the separation of each observation, until these subdivisions generate individual groups (that is, totally separated observations), then, we have a partitioning process. Conversely, nonhierarchical agglomeration schedules, among which the most popular is the k-means procedure, refer to processes in which clustering centers are defined, and from which the observations are allocated based on their proximity to them. Different from hierarchical schedules, in which the researcher can study the several possibilities for allocating observations and even define the ideal number of clusters based on each one of the grouping stages, a nonhierarchical agglomeration schedule requires that we previously stipulate the number of clusters from which the clustering centers will be defined and the observations allocated. That is why we recommend the generation of a hierarchical agglomeration schedule before constructing a nonhierarchical schedule, when there is no reasonable estimate of the number of clusters that can be formed from the observations in the dataset and based on the variables in study. Fig. 11.9 shows the logic of agglomeration schedules in cluster analysis. We will study hierarchical agglomeration schedules in Section 11.2.2.1, and Section 11.2.2.2 will be used to discuss the nonhierarchical k-means agglomeration schedule.

11.2.2.1 Hierarchical Agglomeration Schedules In this section, we will discuss the main hierarchical agglomeration schedules, in which larger and larger clusters are formed at each clustering stage because new observations or groups are added to it, due to a certain criterion (linkage method) and based on the distance measure chosen. In Section 11.2.2.1.1, the main concepts of these schedules will be presented, and, in Section 11.2.2.1.2, a practical example will be presented and solved algebraically. 11.2.2.1.1

Notation

There are three main linkage methods in hierarchical agglomeration schedules, as shown in Fig. 11.9: the nearest-neighbor or single-linkage, the furthest-neighbor or complete-linkage, and the between-groups or average-linkage. Table 11.11 illustrates the distance to be considered in each clustering stage, based on the linkage method chosen.

326

PART

V

Multivariate Exploratory Data Analysis

Agglomeration schedule

Nonhierarchical (k-means)

Hierarchical

Agglomerative

Partitioning

Linkage method

Furthest neighbor (complete linkage)

Nearest neighbor (single linkage)

Between groups (average linkage)

FIG. 11.9 Agglomeration schedules in cluster analysis.

TABLE 11.11 Distance to be Considered Based on the Linkage Method Linkage Method Single (Nearest-Neighbor or Single-Linkage)

Illustration

Distance (Dissimilarity) d23

1

4 3 2

Complete (Furthest-Neighbor or Complete-Linkage)

5

d15 1

4 3 5

2

Average (Between-Groups or Average-Linkage)

d13 + d14 + d15 + d23 + d24 + d25 6

1

4 3 2

5

The single-linkage method favors the shortest distances (thus, the nomenclature nearest neighbor) so that new clusters can be formed at each clustering stage through the incorporation of observations or groups. Therefore, applying it is advisable in cases in which the observations are relatively far apart, that is, different, and we would like to form clusters considering a minimum of homogeneity. On the other hand, its analysis may be hampered when there are observations or clusters just a little farther apart from each other, as shown in Fig. 11.10. The complete-linkage method, on the other hand, goes in the opposite direction, that is, it favors the greatest distances between the observations or groups so that new clusters can be formed (hence, the name furthest neighbor) and, in this regard, using it is advisable in cases in which there is no considerable distance between the observations, and the researcher needs to identify the heterogeneities between them. Finally, in the average-linkage method, two groups merge based on the average distance between all the pairs of observations that are in these groups (hence, the name average linkage). Accordingly, even though there are changes in the calculation of the distance measures between the clusters, the average-linkage method ends up preserving the order of the observations in each group, offered by the single-linkage method, in case there is a considerable distance between the observations. The same happens with the sorting solution provided by the complete-linkage method, if the observations are very close to each other.

Cluster Analysis Chapter

11

327

FIG. 11.10 Single-linkage method—Hampered analysis when there are observations or clusters just a little further apart.

Johnson and Wichern (2007) proposed a logical sequence of steps in order to facilitate the understanding of a cluster analysis, elaborated through a certain hierarchical agglomerative method: 1. If n is the number of observations in a dataset, we must start the agglomeration schedule with exactly n individual groups (stage 0), such that we will initially have a distances (or similarities) matrix D0 formed by the distances between each pair of observations. 2. In the first stage, we must choose the smallest distance among all of those that form matrix D0, that is, the one that connects the two most similar observations. At this exact moment, we will not have n individual groups any longer, we will have (n  1) groups, and one of them is formed by two observations. 3. In the following clustering stage, we must repeat the previous stage. However, we now have to take the distance between each pair of observations, and between the first group already formed, and each one of the other observations into consideration, based on one of the linkage methods adopted. In other words, we will have, after the first clustering stage, matrix D1 with dimensions (n  1)  (n  1), in which one of the rows will be represented by the first grouped pair of observations. Consequently, in the second stage, a new group will be formed by the grouping of two new observations or by adding a certain observation to the first group previously formed in the first stage. 4. The previous process must be repeated (n  1) times, until there is only a single group formed by all the observations. In other words, in the stage (n  2) we will have matrix Dn-2 that will only contain the distance between the last two remaining groups, before the final fusion. 5. Finally, from the clustering stages and the distances between the clusters formed, it is possible to develop a tree-shaped diagram that summarizes the clustering process, and explains the allocation of each observation in each cluster. This diagram is known as a dendrogram or a phenogram. Therefore, the values that form the D matrices of each one of the stages will be a function of the distance measure chosen and of the linkage method adopted. In a certain clustering stage s, imagine that a researcher groups two clusters M and N formed previously, containing observations m and n, respectively, so that cluster MN can be formed. Next, he intends to group MN with another cluster W, with w observations. Since we know that the decision to choose the next cluster will always be the smallest distance between each pair of observations or groups in the hierarchical agglomerative methods, the agglomeration schedule will be essential in order for the distances that will form each matrix Ds to be analyzed. Using this logic and based on Table 11.11, let’s discuss the criterion to calculate the distance between the clusters MN and W, inserted in matrix Ds, based on the linkage method: l

Nearest-Neighbor or Single-Linkage Method:

  dðMN ÞW ¼ min dMW ; dNW

(11.23)

where dMW and dNW are the distances between the closest observations in clusters M and W and in clusters N and W, respectively. l

Furthest-Neighbor or Complete-Linkage Method:

  dðMN ÞW ¼ max dMW ; dNW

(11.24)

where dMW and dNW are the distances between the farthest observations in clusters M and W and in clusters N and W, respectively.

328

PART

V

Multivariate Exploratory Data Analysis

TABLE 11.12 Example: Grades in Math, Physics, and Chemistry on the College Entrance Exam

l

Student (Observation)

Grade in Mathematics (X1i)

Grade in Physics (X2i)

Grade in Chemistry (X3i)

Gabriela

3.7

2.7

9.1

Luiz Felipe

7.8

8.0

1.5

Patricia

8.9

1.0

2.7

Ovidio

7.0

1.0

9.0

Leonor

3.4

2.0

5.0

Between-Groups or Average-Linkage Method: m +nX w X

dðMN ÞW ¼

dpq

p¼1 q¼1

ðm + nÞ  ðwÞ

(11.25)

where dpq represents the distance between any observation p in cluster MN and any observation q in cluster W, and m + n and w represent the number of observations in clusters MN and W, respectively. In the following section, we will present a practical example that will be solved algebraically, and from which the concepts of hierarchical agglomerative methods will be established. 11.2.2.1.2

A Practical Example of Cluster Analysis With Hierarchical Agglomeration Schedules

Imagine that a college professor, who is very concerned about his students’ capacity to learn the subject he teaches, Quantitative Methods, is interested in allocating them to groups with the highest homogeneity possible, based on the grades they obtained on the college entrance exams in subjects considered quantitative (Math, Physics, and Chemistry). In order to do that, the professor collected information on these grades, which vary from 0 to 10. In addition, since he will carry out a cluster analysis, first, in an algebraic way, he decided, for pedagogical purposes, to only work with five students. This dataset can be seen in Table 11.12. Based on the data obtained, the chart in Fig. 11.11 is constructed, and, since the variables are metric, the dissimilarity measure known as Euclidian distance will be used for the cluster analysis. Besides, since all the variables have values in the same unit of 0 measure (grades from 0 to 10), in this case, it will not be necessary to standardize them through Z-scores. In the following sections, hierarchical agglomeration schedules based on the Euclidian distance will be elaborated through the three linkage methods being studied. 11.2.2.1.2.1 Nearest-Neighbor or Single-Linkage Method At this moment, from the data presented in Table 11.12, let’s develop a cluster analysis through a hierarchical agglomeration schedule with the single-linkage method. First of all, we define matrix D0, formed by the Euclidian distances (dissimilarities) between each pair of observations, as follows:

Cluster Analysis Chapter

11

329

Chemistry

Gabriela Ovidio

Leonor

Patricia

Luiz Felipe

Physics Math FIG. 11.11 Three-dimensional chart with the relative position of the five students.

It is important to mention that at this initial moment each observation is considered an individual cluster, that is, in stage 0, we have 5 clusters (sample size). Highlighted in matrix D0 is the smallest distance between all the observations and, therefore, in the first stage, observations Gabriela and Ovidio will initially be grouped, and will now be a new cluster. We must construct matrix D1 so that we can go to the next clustering stage, in which the distances between the cluster Gabriela-Ovidio and the other observations are calculated. Observations that are still isolated. Thus, by using the singlelinkage method and based on the Expression (11.23), we have: dðGabrielaOvidioÞLuiz Felipe ¼ min f10:132; 10:290g ¼ 10:132 dðGabrielaOvidioÞPatricia ¼ min f8:420; 6:580g ¼ 6:580 dðGabrielaOvidioÞLeonor ¼ min f4:170; 5:474g ¼ 4:170 Matrix D1 can be seen:

330

PART

V

Multivariate Exploratory Data Analysis

In the same way, in matrix D1 the smallest distance between all of them is highlighted. Therefore, in the second stage, observation Leonor is inserted into the already-formed cluster Gabriela-Ovidio. Observations Luiz Felipe and Patricia still remain isolated. We must construct matrix D2 so that we can take the next step, in which the distances between the cluster GabrielaOvidio-Leonor and the two remaining observations are calculated. Analogously, we have: dðGabrielaOvidioLeonorÞLuizFelipe ¼ min f10:132; 8:223g ¼ 8:223 dðGabrielaOvidioLeonorÞPatricia ¼ min f6:580; 6:045g ¼ 6:045 Matrix D2 can be written as:

In the third clustering stage, observation Patricia is incorporated into the cluster Gabriela-Ovidio-Leonor, since the corresponding distance is the smallest among all the ones presented in matrix D2. Therefore, we can write matrix D3, which comes next, taking into consideration the following criterion: dðGabrielaOvidioLeonorPatriciaÞLuizFelipe ¼ min f8:223; 7:187g ¼ 7:187

Finally, in the fourth and last stage, all the observations are allocated to the same cluster, thus, concluding the hierarchical process. Table 11.13 presents a summary of this agglomeration schedule constructed by using the singlelinkage method. Based on this agglomeration schedule, we can construct a tree-shaped diagram, known as a dendrogram or phenogram, whose main objective is to illustrate the step by step of the clusters and to facilitate the visualization of how each observation is allocated to each stage. The dendrogram can be seen in Fig. 11.12. Through Figs. 11.13 and 11.14, we are able to interpret the dendrogram constructed. First of all, we drew three lines (I, II, and III) that are orthogonal to the dendrogram lines, as shown in Fig. 11.13, which allow us to identify the number of clusters in each clustering stage, as well as the observations in each cluster. Therefore, line I “cuts” the dendrogram immediately after the first clustering stage and, at this moment, we can verify that there are four clusters (four intersections with the dendrogram’s horizontal lines), one of them formed by observations Gabriela and Ovidio, and the others, by the individual observations.

Cluster Analysis Chapter

11

331

TABLE 11.13 Agglomeration Schedule Through the Single-Linkage Method Stage

Cluster

Grouped Observation

Smallest Euclidian Distance

1

Gabriela

Ovidio

3.713

2

Gabriela-Ovidio

Leonor

4.170

3

Gabriela-Ovidio-Leonor

Patricia

6.045

4

Gabriela-Ovidio-LeonorPatricia

Luiz Felipe

7.187

0

1

2

Euclidean distance 3 4 5

6

7

8

Gabriela Ovidio Leonor Patricia Luiz Felipe FIG. 11.12 Dendrogram—Single-linkage method.

3

4

Euclidean distance 5 6

7

8

3

4

Euclidean distance 5 6

7

8

FIG. 11.13 Interpreting the dendrogram—Number of clusters and allocation of observations.

Gabriela Ovidio Leonor Patricia Luiz Felipe

Gabriela Ovidio Leonor Patricia Luiz Felipe

FIG. 11.14 Interpreting the dendrogram—Distance leaps.

332

PART

V

Multivariate Exploratory Data Analysis

On the other hand, line II intersects three horizontal lines of the dendrogram, which means that, after the second stage, in which observation Leonor was incorporated into the already formed cluster Gabriela-Ovidio, there are three clusters. Finally, line III is drawn immediately after the third stage, in which observation Patricia merges with the cluster Gabriela-Ovidio-Leonor. Since two intersections between this line and the dendrogram’s horizontal lines are identified, we can see that observation Luiz Felipe remains isolated, while the others form a single cluster. Besides providing a study of the number of clusters in each clustering stage and of the allocation of observations, a dendrogram also allows the researcher to analyze the magnitude of the distance leaps in order to establish the clusters. A high magnitude leap, in comparison to the others, can indicate that a certain observation or cluster, a considerably different one, is incorporated into already formed clusters, which offers subsidies for the establishment of a solution regarding the number of clusters without the need for a next clustering stage. Although we know that setting an inflexible, mandatory number of clusters may hamper the analysis, at least giving an idea of this number, given the distance measure used and the linkage method adopted, may help researchers better understand the characteristics of the observations that led to this fact. Moreover, since the number of clusters is important for constructing nonhierarchical agglomeration schedules, this piece of information (considered an output of the hierarchical schedule) may serve as input for the k-means procedure. Fig. 11.14 presents three distance leaps (A, B, and C), regarding each one of the clustering stages, and, from their analysis, we can see that leap B, which represents the incorporation of observation Patricia into the cluster that had already been formed Gabriela-Ovidio-Leonor, is the greatest of the three. Therefore, in case we intend to set the ideal number of clusters in this example, the researcher may choose the solution with three clusters (line II in Fig. 11.13), without the stage in which observation Patricia is incorporated, since it possibly has characteristics that are not so homogeneous and that make it unfeasible to include it in the previously formed cluster, given the large distance leap. Thus, in this case, we would have a cluster formed by Gabriela, Ovidio, and Leonor, another one formed only by Patricia, and a third one formed only by Luiz Felipe. When using dissimilarity measures in methods clustering, a very useful criterion for identifying the number of clusters consists in identifying a considerable distance leap (whenever possible), and defining the number of clusters formed in the clustering stage immediately before the great leap, since very high leaps may incorporate observations with characteristics that are not so homogeneous. Furthermore, it is also important to mention that, if the distance leaps from a stage to another are small, due to the existence of variables with values that are too close to the observations, which can make it difficult to read the dendrogram, the researcher may use the squared Euclidean distance, so that the leaps can become clearer and better explained, making it easier to identify the clusters in the dendrogram, and providing better arguments for the decision making process. Software such as SPSS shows dendrograms with rescaled distance measures, in order to facilitate the interpretation of the allocation of each observation and the visualization of the large distance leaps. Fig. 11.15 illustrates how clusters can be established after the single-linkage method is elaborated. Next, we will develop the same example. However, now, let’s use the complete- and average-linkage methods, so that we can compare the order of the observations and the distance leaps. 11.2.2.1.2.2 Furthest-Neighbor or Complete-Linkage Method Matrix D0, shown here, is obviously the same, and the smallest Euclidian distance, the one highlighted, is between observations Gabriela and Ovidio that become the first cluster. It is important to emphasize that the first cluster will always be the same, regardless of the linkage method used, since the first stage will always consider the smallest distance between two pairs of observations, which are still isolated.

Cluster Analysis Chapter

11

333

Chemistry

Gabriela Ovidio

Leonor

Patricia

Luiz Felipe

Physics Math FIG. 11.15 Suggestion of clusters formed after the single-linkage method.

In the complete-linkage method, we must use Expression (11.24) to construct matrix D1, as follows: dðGabrielaOvidioÞLuizFelipe ¼ max f10:132; 10:290g ¼ 10:290 dðGabrielaOvidioÞPatricia ¼ max f8:420; 6:580g ¼ 8:420 dðGabrielaOvidioÞLeonor ¼ max f4:170; 5:474g ¼ 5:474 Matrix D1 can be seen and by analyzing it, we can see that observation Leonor will be incorporated into the cluster formed by Gabriela and Ovidio. Once again, the smallest value, among all the ones shown in matrix D1, is highlighted.

As verified when using the single-linkage method, here, observations Luiz Felipe and Patricia also remain isolated at this stage. The differences between the methods start arising now. Therefore, we will construct matrix D2 using the following criteria: dðGabrielaOvidioLeonorÞLuizFelipe ¼ max f10:290; 8:223g ¼ 10:290 dðGabrielaOvidioLeonorÞPatricia ¼ max f8:420; 6:045g ¼ 8:420

334

PART

V

Multivariate Exploratory Data Analysis

Matrix D2 can be written as follows:

In the third clustering stage, a new cluster is formed by the fusion of observations Patricia and Luiz Felipe, since the furthest-neighbor criterion adopted in the complete-linkage method makes the distance between these two observations become the smallest among all the ones calculated to construct matrix D2. Therefore, notice that at this stage differences related to the single-linkage method appear, in terms of the sorting and allocation of the observations to groups. Hence, to construct matrix D3, we must take the following criterion into consideration: dðGabrielaOvidioLeonorÞðLuizFelipePatriciaÞ ¼ max f10:290; 8:420g ¼ 10:290

In the same way, in the fourth and last stage, all the observations are allocated to the same cluster, since there is the clustering between Gabriela-Ovidio-Leonor and Luiz Felipe-Patricia. Table 11.14 shows a summary of this agglomeration schedule, elaborated by using the complete-linkage method. This agglomeration schedule’s dendrogram can be seen in Fig. 11.16. We can initially see that the sorting of the observations is different from what was observed in the dendrogram seen in Fig. 11.12. Analogous to what was carried out in the previous method, we chose to draw two vertical lines (I and II) over the largest distance leap, as shown in Fig. 11.17. Thus, if the researcher chooses to consider three clusters, the solution will be the same as the one achieved previously through the single-linkage method, one formed by Gabriela, Ovidio, and Leonor, another one by Luiz Felipe, and a third one by Patricia (line I in Fig. 11.17). However, if he chooses to define two clusters (line II), the solution will be different since, in this case, the second cluster will be formed by Luiz Felipe and Patricia, while in the previous case, it was formed only by Luiz Felipe, since observation Patricia was allocated to the first cluster.

TABLE 11.14 Agglomeration Schedule Through the Complete-Linkage Method Stage

Cluster

Grouped Observation

Smallest Euclidian Distance

1

Gabriela

Ovidio

3.713

2

Gabriela-Ovidio

Leonor

5.474

3

Luiz Felipe

Patricia

7.187

4

Gabriela-Ovidio-Leonor

Luiz Felipe-Patricia

10.290

Cluster Analysis Chapter

0

1

2

3

4

Euclidean distance 5 6

7

8

9

10

11

335

11

Gabriela Ovidio Leonor Luiz Felipe Patricia FIG. 11.16 Dendrogram—Complete-linkage method.

3

4

5

Euclidean distance 6 7 8

9

10

11

Gabriela Ovidio Leonor Luiz Felipe Patricia FIG. 11.17 Interpreting the dendrogram—Clusters and distance leaps.

Similar to what was done in the previous method, Fig. 11.18 illustrates how the clusters can be established after the complete-linkage method is carried out. Defining the clustering method can be based on the application of the average-linkage method, in which two groups merge based on the average distance between all the pairs of observations that belong to these groups. Therefore, as we have already discussed, if the most suitable method is the single linkage because there are observations considerably far apart from one another, the sorting and allocation of the observations will be maintained by the average-linkage method. On the other hand, the outputs of this method will show consistency with the solution achieved through the complete-linkage method as regards the sorting and allocation of the observations, if they are very similar in the variables in study. Thus, it is advisable for the researcher to apply the three linkage methods when elaborating a cluster analysis through hierarchical agglomeration schedules. Therefore, let’s move on to the average-linkage method. 11.2.2.1.2.3 Between-Groups or Average-Linkage Method First of all, let’s show the Euclidian distance matrix between each pair of observations (matrix D0), once again, highlighting the smallest distance between them.

336

PART

V

Multivariate Exploratory Data Analysis

Chemistry

Gabriela Ovidio

Leonor

Patricia

Luiz Felipe

Physics Math FIG. 11.18 Suggestion of clusters formed after the complete-linkage method.

By using Expression (11.25), we are able to calculate the terms of matrix D1, given that the first cluster GabrielaOvidio has already been formed. Thus, we have: 10:132 + 10:290 ¼ 10:211 2 8:420 + 6:580 ¼ 7:500 dðGabrielaOvidioÞPatricia ¼ 2 4:170 + 5:474 ¼ 4:822 dðGabrielaOvidioÞLeonor ¼ 2

dðGabrielaOvidioÞLuizFelipe ¼

Matrix D1 can be seen and, through it, we can see that observation Leonor is once again incorporated into the cluster formed by Gabriela and Ovidio. The smallest value among all the ones presented in matrix D1 has also been highlighted.

Cluster Analysis Chapter

11

337

In order to construct matrix D2, in which the distances between the cluster Gabriela-Ovidio-Leonor and the two remaining observations are calculated, we must perform the following calculations: 10:132 + 10:290 + 8:223 ¼ 9:548 3 8:420 + 6:580 + 6:045 dðGabrielaOvidioLeonorÞPatricia ¼ ¼ 7:015 3

dðGabrielaOvidioLeonorÞLuizFelipe ¼

Note that the distances used to calculate the dissimilarities to be inserted into matrix D2 are the original Euclidian distances between each pair of observations, that is, they come from matrix D0. Matrix D2 can be seen:

As verified when the single-linkage method was elaborated, here, observation Patricia is also incorporated into the cluster already formed by Gabriela, Ovidio and Leonor, and observation Luiz Felipe remains isolated. Finally, matrix D3 can be constructed from the following calculation: dðGabrielaOvidioLeonorPatriciaÞLuizFelipe ¼

10:132 + 10:290 + 8:223 + 7:187 ¼ 8:958 4

Once again, in the fourth and last stage, all the observations are in the same cluster. Table 11.15 and Fig. 11.19 present a summary of this agglomeration schedule and the corresponding dendrogram, respectively, resulting from this averagelinkage method.

TABLE 11.15 Agglomeration Schedule Through the Average-Linkage Method Stage

Cluster

Grouped Observation

Smallest Euclidian Distance

1

Gabriela

Ovidio

3.713

2

Gabriela-Ovidio

Leonor

4.822

3

Gabriela-Ovidio-Leonor

Patricia

7.015

4

Gabriela-Ovidio-Leonor-Patricia

Luiz Felipe

8.958

338

PART

V

Multivariate Exploratory Data Analysis

Euclidean distance 0

1

2

3

4

5

6

7

8

9

Gabriela Ovidio Leonor Patricia Luiz Felipe FIG. 11.19 Dendrogram—Average-linkage method.

Despite having other distance values, we can see that Table 11.15 and Fig. 11.19 show the same sorting and the same allocation of observations in the clusters as those presented in Table 11.13 and in Fig. 11.12, respectively, obtained when the single-linkage method was elaborated. Hence, we can state that the observations are significantly different from the variables studied, fact proven by the consistency of the answers obtained from the single- and average-linkage methods. If the observations were more similar, fact that has not been observed in the diagram seen in Fig. 11.11, the consistency of answers would occur between the completeand average-linkage methods, as already discussed. Therefore, when possible, the initial construction of scatter plots may help researchers, even if in a preliminary way, choose the method to be adopted. Hierarchical agglomeration schedules are very useful and offer us the possibility to analyze, in an exploratory way, the similarity between observations based on the behavior of certain variables. However, it is essential for researchers to understand that these methods are not conclusive by themselves and more than one answer may be obtained, depending on what is desired and on the data behavior. Besides, it is necessary for researchers to be aware of how sensitive these methods are to the presence of outliers. The existence of a very discrepant observation may cause other observations, not so similar to one another, to be allocated to the same cluster because they are extremely different from the observation considered an outlier. Hence, it is advisable to apply the hierarchical agglomeration schedules with the linkage method chosen several times, and, in each application, to identify one or more observations considered outliers. This procedure will make the cluster analysis become more reliable, since more and more homogeneous clusters may be formed. Researchers are free to characterize the most discrepant observation as the one that ended up becoming isolated after the penultimate clustering stage, that is, if it happens before the total fusion. Nonetheless, many are the methods to define an outlier. Barnett and Lewis (1994), for instance, mention almost 1000 articles in the existing literature on outliers, and, for pedagogical purposes, in the Appendix of this chapter, we will discuss an efficient procedure in Stata for detecting outliers when a researcher is carrying out a multivariate data analysis. It is also important to emphasize, as we have already discussed in this section, that different linkage methods, when elaborating hierarchical agglomeration schedules, must be applied to the same dataset, and the resulting dendrograms, compared. This procedure will help researchers in their decision-making processes with regard to choosing the ideal number of clusters, and also to sorting the observations and allocating each one of them to the different clusters formed. This will even allow researchers to make coherent decisions about the number of clusters that may be considered input in a possible nonhierarchical analysis. Last but not least, it is worth mentioning that the agglomeration schedules presented in this section (Tables 11.13, 11.14, and 11.15) provide increasing values of the clustering measures because a dissimilarity measure was used (Euclidian distance) as a comparison criterion between the observations. If we had chosen Pearson’s correlation between the observations, a similarity measure also used for metric variables, as we discussed in Section 11.2.1.1, the values of the clustering measures in the agglomeration schedules would be decreasing. The latter is also true for cluster analyses in which similarity measures are used, as the ones studied in Section 11.2.1.2, to assess the behavior of observations based on binary variables. In the following section we will develop the same example, in an algebraic way, using the nonhierarchical k-means agglomeration schedule.

11.2.2.2 Nonhierarchical K-Means Agglomeration Schedule Among all the nonhierarchical agglomeration schedules, the k-means procedure is the most often used by researchers in several fields of knowledge. Given that the number of clusters is previously defined by the researcher, this procedure can be

Cluster Analysis Chapter

11

339

elaborated after the application of a hierarchical agglomeration schedule when we have no idea of the number of clusters that can be formed, and, in this situation, the output obtained from this procedure can serve as input for the nonhierarchical. 11.2.2.2.1

Notation

As the one developed in Section 11.2.2.1.1, we now present a logical sequence of steps, based on Johnson and Wichern (2007), in order to facilitate the understanding of the cluster analysis (k-means procedure): 1. We define the initial number of clusters and the respective centroids. The main objective is to divide the observations from the dataset into K clusters, such that those within each cluster are the closest to each other if compared to any other that belongs to a different cluster. For such, the observations need to be allocated arbitrarily to the K clusters, so that the respective centroids can be calculated. 2. We must choose a certain observation that is closer to a centroid and reallocate it to this cluster. At this moment, another cluster has just lost that observation, and, therefore, the centroids of the cluster that receives it and of the cluster that loses it must be recalculated. 3. We must continue repeating the previous step until it is no longer possible to reallocate any observation due to its close proximity to a centroid from another cluster. Centroid coordinate x must be recalculated whenever including or excluding a certain observation p in the respective cluster, based on the following expressions: N  x + xp , if observation p is inserted into the cluster under analysis N+1 N  x  xp , if observation p is excluded from the cluster under analysis xnew ¼ N 1 xnew ¼

(11.26) (11.27)

where N and x refer to the number of observations in the cluster and to its centroid coordinate before the reallocation of that observation, respectively. In addition, xp refers to the coordinate of observation p, which changed clusters. For two variables (X1 and X2), Fig. 11.20 shows a hypothetical situation that represents the end of the k-means procedure, in which it is no longer possible to reallocate any observation because there are no more close proximities to centroids of other clusters. The matrix with distances between observations does not need to be defined at each step, different from hierarchical agglomeration schedules, which reduces the requirements in terms of technological capabilities, allowing nonhierarchical agglomeration schedules to be applied to considerably larger dataset than those traditionally studied through hierarchical schedules. FIG. 11.20 Hypothetical situation that represents the end of the K-means procedure.

340

PART

V

Multivariate Exploratory Data Analysis

In addition, bear in mind that the variables must be standardized before elaborating the k-means procedure, and in the hierarchical agglomeration schedules too, if the respective values are not in the same unit of measure. Finally, after concluding this procedure, it is important for researchers to analyze if the values of a certain metric variable differ between the groups defined, that is, if the variability between the clusters is significantly higher than the internal variability of each cluster. The F-test of the one-way analysis of variance, or one-way ANOVA, allows us to develop this analysis, and its null and alternative hypotheses can be defined as follows: H0: the variable under analysis has the same mean in all the groups formed. H1: the variable under analysis has a different mean in at least one of the groups in relation to the others. Therefore, a single F-test can be applied for each variable, aiming to assess the existence of at least one difference among all the comparison possibilities, and, in this regard, the main advantage of applying it is the fact that adjustments in the discrepant dimensions of the groups do not need to be carried out to analyze several comparisons. On the other hand, rejecting the null hypothesis at a certain significance level, does not allow the researcher to know which group(s) is(are) statistically different from the others in relation to the variable being analyzed. The F statistical expression, corresponding to this test, is given by the following expression: K X



2 Nk  Xk  X

k¼1



variability between the groups ¼ X K  1 2 variability within the groups Xki  Xk

(11.28)

ki

nK where N is the number of observations in the k-th cluster, Xk is the mean of variable X in the same k-th cluster, X is the general average of variable X, and Xki is the value that variable X takes on for a certain observation i present in the k-th cluster. In addition, K represents the number of clusters to be compared, and n, the sample size. By using the F statistic, researchers will be able to identify the variables whose means most differ between the groups, that is, those that most contribute to the formation of at least one of the K clusters (highest F statistic), as well as those that do not contribute to the formation of the suggested number of clusters, at a certain significance level. In the following section, we will discuss a practical example that will be solved algebraically, and from which the concepts of the k-means procedure may be established. 11.2.2.2.2 A Practical Example of a Cluster Analysis With the Nonhierarchical K-Means Agglomeration Schedule To solve the nonhierarchical k-means agglomeration schedule algebraically, let’s use the data from our own example, which can be found in Table 11.12 and are shown in Table 11.16. Software packages such as SPSS use the Euclidian distance as the standard dissimilarity measure, reason why we will develop the algebraic procedures based on this measure. This criterion will even allow the results obtained to be compared to the ones found when elaborating the hierarchical agglomeration schedules in Section 11.2.2.1.2, as, in those situations, the Euclidian distance was also used. In the same way, it will not be necessary to standardize the variables through Z-scores, since all of them are in the same unit of measure (grades from 0 to 10). Otherwise, it is crucial for researchers to standardize the variables before elaborating the k-means procedure. TABLE 11.16 Example: Grades in Math, Physics, and Chemistry on the College Entrance Exams Student (Observation)

Grade in Mathematics (X1i)

Grade in Physics (X2i)

Grade in Chemistry (X3i)

Gabriela

3.7

2.7

9.1

Luiz Felipe

7.8

8.0

1.5

Patricia

8.9

1.0

2.7

Ovidio

7.0

1.0

9.0

Leonor

3.4

2.0

5.0

Cluster Analysis Chapter

11

341

TABLE 11.17 Arbitrary Allocation of the Observations in K 5 3 Clusters and Calculation of the Centroid Coordinates— Initial Step of the K-Means Procedure Centroid Coordinates Variable Cluster

Grade in Mathematics

Grade in Physics

Grade in Chemistry

Gabriela

3:7 + 7:8 ¼ 5:75 2

2:7 + 8:0 ¼ 5:35 2

9:1 + 1:5 ¼ 5:30 2

8:9 + 7:0 ¼ 7:95 2

1:0 + 1:0 ¼ 1:00 2

2:7 + 9:0 ¼ 5:85 2

3.40

2.00

5.00

Luiz Felipe Patricia Ovidio Leonor

Using the logical sequence presented in Section 11.2.2.2.1, we will develop the k-means procedure with K ¼ 3 clusters. This number of clusters may have come from a decision made by the researcher and based on a certain preliminary criterion, or it was chosen based on the outputs of the hierarchical agglomeration schedules. In our case, the decision was made based on the comparison of the dendrograms that had already been constructed, and by the similarity of the outputs obtained by the single- and average-linkage methods. Thus, we need to arbitrarily allocate the observations to three clusters, so that the respective centroids can be calculated. Therefore, we can establish that observations Gabriela and Luiz Felipe form the first cluster, Patricia and Ovidio, the second, and Leonor, the third. Table 11.17 shows the arbitrary formation of these preliminary clusters, as well as the calculation of the respective centroid coordinates, which makes the initial step of the k-means procedure algorithm possible. Based on these coordinates, we constructed the chart seen in Fig. 11.21, which shows the arbitrary allocation of each observation to its cluster and the respective centroids. Based on the second step of the logical sequence presented in Section 11.2.2.2.1, we must choose a certain observation and calculate the distance between it and all the cluster centroids, assuming that it is or it is not reallocated to each cluster. Selecting the first observation (Gabriela), for example, we can calculate the distances between it and the centroids of the clusters that have already been formed (Gabriela-Luiz Felipe, Patricia-Ovidio, and Leonor) and, after that, assume that it leaves its cluster (Gabriela-Luiz Felipe), and is inserted into one of the other two clusters, forming the cluster GabrielaPatricia-Ovidio or Gabriela-Leonor. Thus, from Expressions (11.26) and (11.27), we must recalculate the new centroid coordinates, simulating that, in fact, the reallocation of Gabriela to one of the two clusters takes place, as shown in Table 11.18. Thus, from Tables 11.16, 11.17, and 11.18, we can calculate the following Euclidian distances: l

Assumption that Gabriela is not reallocated: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dGabrielaðGabrielaLuizFelipeÞ ¼ ð3:70  5:75Þ2 + ð2:70  5:35Þ2 + ð9:10  5:30Þ2 ¼ 5:066 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dGabrielaðPatriciaOvidioÞ ¼ ð3:70  7:95Þ2 + ð2:70  1:00Þ2 + ð9:10  5:85Þ2 ¼ 5:614 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dGabrielaLeonor ¼ ð3:70  3:40Þ2 + ð2:70  2:00Þ2 + ð9:10  5:00Þ2 ¼ 4:170

l

Assumption that Gabriela is reallocated: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð3:70  7:80Þ2 + ð2:70  8:00Þ2 + ð9:10  1:50Þ2 ¼ 10:132 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dGabrielaðGabrielaPatriciaOvidioÞ ¼ ð3:70  6:53Þ2 + ð2:70  1:57Þ2 + ð9:10  6:93Þ2 ¼ 3:743 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dGabrielaðGabrielaLeonorÞ ¼ ð3:70  3:55Þ2 + ð2:70  2:35Þ2 + ð9:10  7:05Þ2 ¼ 2:085 dGabrielaLuizFelipe ¼

342

PART

V

Multivariate Exploratory Data Analysis

Chemistry

Gabriela Ovidio

CENTROID 2 CENTROID 1 Leonor

Patricia

Luiz Felipe

Physics Math FIG. 11.21 Arbitrary allocation of the observations in K ¼ 3 clusters and respective centroids—Initial step of the K-means procedure.

TABLE 11.18 Simulating the Reallocation of Gabriela and Calculating the New Centroid Coordinates Centroid Coordinates Variable Simulation

Grade in Mathematics

Grade in Physics

Grade in Chemistry

Luiz Felipe

Excluding Gabriela

2  ð5:75Þ3:70 ¼ 7:80 21

2  ð5:35Þ2:70 ¼ 8:00 21

2  ð5:30Þ9:10 ¼ 1:50 21

Gabriela

Including Gabriela

2  ð7:95Þ + 3:70 ¼ 6:53 2+1

2  ð1:00Þ + 2:70 ¼ 1:57 2+1

2  ð5:85Þ + 9:10 ¼ 6:93 2+1

Including Gabriela

1  ð3:40Þ + 3:70 ¼ 3:55 1+1

1  ð2:00Þ + 2:70 ¼ 2:35 1+1

1  ð5:00Þ + 9:10 ¼ 7:05 1+1

Cluster

Patricia Ovidio Gabriela Leonor Obs.: Note that the values calculated for the Luiz Felipe centroid coordinates are exactly the same as this observation’s original coordinates, as shown in Table 11.16.

Since Gabriela is the closest to the Gabriela-Leonor centroid (the shortest Euclidian distance), we must reallocate this observation to the cluster initially formed only by Leonor. So, the cluster in which observation Gabriela was at first (Gabriela-Luiz Felipe) has just lost it, and now Luiz Felipe has become an individual cluster. Therefore, the centroids of the cluster that receives it and the one that loses it must be recalculated. Table 11.19 shows the creation of the new clusters and the calculation of the respective centroid coordinates too.

Cluster Analysis Chapter

11

343

TABLE 11.19 New Centroids With the Reallocation of Gabriela Centroid Coordinates Variable Cluster

Grade in Mathematics

Grade in Physics

Grade in Chemistry

Luiz Felipe

7.80

8.00

1.50

Patricia

7.95

1.00

5.85

3:7 + 3:4 ¼ 3:55 2

2:7 + 2:0 ¼ 2:35 2

9:1 + 5:0 ¼ 7:05 2

Ovidio Gabriela Leonor

Based on these new coordinates, we can construct the chart shown in Fig. 11.22. Once again, let’s repeat the previous step. At this moment, since observation Luiz Felipe is isolated, let’s simulate the reallocation of the third observation (Patricia). We must calculate the distances between it and the centroids of the clusters that have already been formed (Luiz Felipe, Patricia-Ovidio, and Gabriela-Leonor) and, afterwards, assume that it leaves its cluster (Patricia-Ovidio) and is inserted into one of the other two clusters, forming the cluster Luiz Felipe-Patricia or Gabriela-Patricia-Leonor. Also based on Expressions (11.26) and (11.27), we must recalculate the new centroid coordinates, simulating that, in fact, the reallocation of Patricia to one of these two clusters happens, as shown in Table 11.20. Similar to what was carried out when simulating Gabriela’s reallocation, based on Tables 11.16, 11.19, and 11.20, let’s calculate the Euclidian distances between Patricia and each one of the centroids: Chemistry

Gabriela Ovidio

CENTROID 3

CENTROID 2 Leonor

Patricia

Luiz Felipe

Physics Math FIG. 11.22 New clusters and respective centroids—Reallocation of Gabriela.

344

PART

V

Multivariate Exploratory Data Analysis

TABLE 11.20 Simulation of Patricia’s Reallocation—Next Step of the K-Means Procedure Algorithm Centroid Coordinates Variable Cluster

Simulation

Grade in Mathematics

Grade in Physics

Grade in Chemistry

Luiz Felipe

Including Patricia

1  ð7:80Þ + 8:90 ¼ 8:35 1+1

1  ð8:00Þ + 1:00 ¼ 4:50 1+1

1  ð1:50Þ + 2:70 ¼ 2:10 1+1

Ovidio

Excluding Patricia

2  ð7:95Þ8:90 ¼ 7:00 21

2  ð1:00Þ1:00 ¼ 1:00 21

2  ð5:85Þ2:70 ¼ 9:00 21

Gabriela

Including Patricia

2  ð3:55Þ + 8:90 ¼ 5:33 2+1

2  ð2:35Þ + 1:00 ¼ 1:90 2+1

2  ð7:05Þ + 2:70 ¼ 5:60 2+1

Patricia

Patricia Leonor Obs.: Note that the values calculated of the Ovidio centroid coordinates are exactly the same as this observation’s original coordinates, as shown in Table 11.16.

l

Assumption that Patricia is not reallocated: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð8:90  7:80Þ2 + ð1:00  8:00Þ2 + ð2:70  1:50Þ2 ¼ 7:187 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dPatriciaðPatriciaOvidioÞ ¼ ð8:90  7:95Þ2 + ð1:00  1:00Þ2 + ð2:70  5:85Þ2 ¼ 3:290 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dPatriciaðGabrielaLeonorÞ ¼ ð8:90  3:55Þ2 + ð1:00  2:35Þ2 + ð2:70  7:05Þ2 ¼ 7:026 dPatriciaLuizFelipe ¼

l

Assumption that Patricia is reallocated: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dPatriciaðLuizFelipePatriciaÞ ¼ ð8:90  8:35Þ2 + ð1:00  4:50Þ2 + ð2:70  2:10Þ2 ¼ 3:593 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dPatriciaOvidio ¼ ð8:90  7:00Þ2 + ð1:00  1:00Þ2 + ð2:70  9:00Þ2 ¼ 6:580 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dPatriciaðGabrielaPatriciaLeonorÞ ¼ ð8:90  5:33Þ2 + ð1:00  1:90Þ2 + ð2:70  5:60Þ2 ¼ 4:684

Bearing in mind that the Euclidian distance between Patricia and the cluster Patricia-Ovidio is the shortest, we have to reallocate it to another cluster and, at this moment, let’s maintain the solution presented in Table 11.19 and in Fig. 11.22. Next, we will develop the same procedure, however, simulating the reallocation of the fourth observation (Ovidio). Analogously, we must, therefore, calculate the distances between this observation and the centroids of the clusters that have already been formed (Luiz Felipe, Patricia-Ovidio, and Gabriela-Leonor) and, after that, assume that it leaves its cluster (Patricia-Ovidio) and is inserted into one of the other two clusters, forming the cluster Luiz Felipe-Ovidio or Gabriela-Ovidio-Leonor. Once again by using Expressions (11.26) and (11.27), we can recalculate the new centroid coordinates, simulating that, in fact, the reallocation of Ovidio to one of these two clusters takes place, as shown in Table 11.21. Next, we can see the calculations of the Euclidian distances between Ovidio and each one of the centroids, defined from Tables 11.16, 11.19, and 11.21: l

Assumption that Ovidio is not reallocated: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dOvidioLuizFelipe ¼ ð7:00  7:80Þ2 + ð1:00  8:00Þ2 + ð9:00  1:50Þ2 ¼ 10:290

Cluster Analysis Chapter

11

345

TABLE 11.21 Simulating Ovidio’s Reallocation—New Step of the K-Means Procedure Algorithm Centroid Coordinates Variable Cluster

Simulation

Grade in Mathematics

Grade in Physics

Grade in Chemistry

Luiz Felipe

Including Ovidio

1  ð7:80Þ + 7:00 ¼ 7:40 1+1

1  ð8:00Þ + 1:00 ¼ 4:50 1+1

1  ð1:50Þ + 9:00 ¼ 5:25 1+1

Patricia

Excluding Ovidio

2  ð7:95Þ7:00 ¼ 8:90 21

2  ð1:00Þ1:00 ¼ 1:00 21

2  ð5:85Þ9:00 ¼ 2:70 21

Gabriela

Including Ovidio

2  ð3:55Þ + 7:00 ¼ 4:70 2+1

2  ð2:35Þ + 1:00 ¼ 1:90 2+1

2  ð7:05Þ + 9:00 ¼ 7:70 2+1

Ovidio

Ovidio Leonor Obs.: Note that the values calculated of the Patricia centroid coordinates are exactly the same as this observation’s original coordinates, as shown in Table 11.16.

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dOvidioðPatriciaOvidioÞ ¼ ð7:00  7:95Þ2 + ð1:00  1:00Þ2 + ð9:00  5:85Þ2 ¼ 3:290 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dOvidioðGabrielaLeonorÞ ¼ ð7:00  3:55Þ2 + ð1:00  2:35Þ2 + ð9:00  7:05Þ2 ¼ 4:187

l

Assumption that Ovidio is reallocated: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dOvidioðLuizFelipeOvidioÞ ¼ ð7:00  7:40Þ2 + ð1:00  4:50Þ2 + ð9:00  5:25Þ2 ¼ 5:145 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dOvidioPatricia ¼ ð7:00  8:90Þ2 + ð1:00  1:00Þ2 + ð9:00  2:70Þ2 ¼ 6:580 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dOvidioðGabrielaOvidioLeonorÞ ¼ ð7:00  4:70Þ2 + ð1:00  1:90Þ2 + ð9:00  7:70Þ2 ¼ 2:791

In this case, since observation Ovidio is the closest to the centroid of Gabriela-Ovidio-Leonor (the shortest Euclidian distance), we must reallocate this observation to the cluster formed originally by Gabriela and Leonor. Therefore, observation Patricia becomes an individual cluster. Table 11.22 shows the centroid coordinates of clusters Luiz Felipe, Patricia, and Gabriela-Ovidio-Leonor. We will not carry out the procedure proposed for the fifth observation (Leonor), since it had already fused with observation Gabriela in the first step of the algorithm. We can consider that the k-means procedure is concluded, since it is no

TABLE 11.22 New Centroids With Ovidio’s Reallocation Centroid Coordinates Variable Cluster

Grade in Mathematics

Grade in Physics

Grade in Chemistry

Luiz Felipe

7.80

8.00

1.50

Patricia

8.90

1.00

2.70

Gabriela

4.70

1.90

7.70

Ovidio Leonor

346

PART

V

Multivariate Exploratory Data Analysis

Chemistry

Gabriela Ovidio CENTROID

Leonor

Patricia

Luiz Felipe

Physics Math FIG. 11.23 Solution of the K-means procedure.

longer possible to reallocate any observation due to closer proximity to another cluster’s centroid. Fig. 11.23 shows the allocation of each observation to its cluster and their respective centroids. Note that the solution achieved is equal to the one reached through the single- (Fig. 11.15) and average-linkage methods, when we elaborated the hierarchical agglomeration schedules. As we have already discussed, we can see that the matrix with the distances between the observations does not need to be defined at each step of the k-means procedure algorithm, different from the hierarchical agglomeration schedules, which reduces the requirements in terms of technological capabilities, allowing nonhierarchical agglomeration schedules to be applied to dataset significantly larger than the ones traditionally studied through hierarchical schedules. Table 11.23 shows the Euclidian distances between each observation of the original dataset and the centroids of each one of the clusters formed.

TABLE 11.23 Euclidian Distances Between Observations and Cluster Centroids Cluster Student (Observation)

Luiz Felipe

Patricia

Gabriela Ovidio Leonor

Gabriela

10.132

8.420

1.897

Luiz Felipe

0.000

7.187

9.234

Patricia

7.187

0.000

6.592

Ovidio

10.290

6.580

2.791

Leonor

8.223

6.045

2.998

Cluster Analysis Chapter

11

347

TABLE 11.24 Means per Cluster and General Mean of the Variable mathematics Cluster 1

Cluster 2

Cluster 3

XLuiz Felipe ¼ 7:80

XPatricia ¼ 8:90

XGabriela ¼ 3:70 XOvidio ¼ 7:00 XLeonor ¼ 3:40

X 1 ¼ 7:80

X 2 ¼ 8:90

X 3 ¼ 4:70

X ¼ 6:16

We would like to emphasize that this algorithm can be elaborated with another preliminary allocation of the observations to the clusters besides the one chosen in this example. Reapplying the k-means procedure with several arbitrary choices, given K clusters, allows the researcher to assess how stable the clustering procedure is, and to underpin the allocation of the observations to the groups in a consistent way. After concluding this procedure, it is essential to check, through the F-test of one-way ANOVA, if the values of each one of the three variables considered in the analysis are statistically different between the three clusters. To make the calculation of the F statistics that correspond to this test easier, we constructed Tables 11.24, 11.25, and 11.26, which show the means per cluster and the general mean of the variables mathematics, physics, and chemistry, respectively. So, based on the values presented in these tables and by using Expression (11.28), we are able to calculate the variation between the groups and within them for each one of the variables, as well as the respective F statistics. Tables 11.27, 11.28, and 11.29 show these calculations. Now, let’s analyze the rejection or not of the null hypothesis of the F-tests for each one of the variables. Since there are two degrees of freedom for the variability between the groups (K – 1 ¼ 2) and two degrees of freedom for the variability within the groups (n – K ¼ 2), by using Table A in the Appendix, we have Fc ¼ 19.00 (critical F at a significance level of 0.05). Therefore, only for the variable physics can we reject the null hypothesis that all the groups formed have the same

TABLE 11.25 Means per Cluster and General Mean of the Variable physics Cluster 1

Cluster 2

Cluster 3

XLuiz Felipe ¼ 8:00

XPatricia ¼ 1:00

XGabriela ¼ 2:70 XOvidio ¼ 1:00 XLeonor ¼ 2:00

X 1 ¼ 8:00

X 2 ¼ 1:00

X 3 ¼ 1:90

X ¼ 2:94

TABLE 11.26 Means per Cluster and General Mean of the Variable chemistry Cluster 1

Cluster 2

Cluster 3

XLuiz Felipe ¼ 1:50

XPatricia ¼ 2:70

XGabriela ¼ 9:10 XOvidio ¼ 9:00 XLeonor ¼ 5:00

X 1 ¼ 1:50 X ¼ 5:46

X 2 ¼ 2:70

X 3 ¼ 7:70

348

PART

V

Multivariate Exploratory Data Analysis

TABLE 11.27 Variation and F Statistic for the Variable mathematics Variability between the groups

ð7:806:16Þ2 + ð8:906:16Þ2 + 3ð4:706:16Þ2 31 2

2

Variability within the groups

ð3:704:70Þ + ð7:004:70Þ + ð3:404:70Þ 53

F

8:296 3:990 ¼ 2:079

2

¼ 8:296

¼ 3:990

Note: The calculation of the variability within the groups only took cluster 3 into consideration, since the others show variability equal to 0, because they are formed by a single observation.

TABLE 11.28 Variation and F Statistic for the Variable physics Variability between the groups

ð8:002:94Þ2 + ð1:002:94Þ2 + 3ð1:902:94Þ2 31 2

2

Variability within the groups

ð2:701:90Þ + ð1:001:90Þ + ð2:001:90Þ 53

F

16:306 0:730 ¼ 22:337

2

¼ 16:306

¼ 0:730

Note: The same as the previous table.

TABLE 11.29 Variation and F Statistic for the Variable chemistry Variability between the groups

ð1:505:46Þ2 + ð2:705:46Þ2 + 3ð7:705:46Þ2 31 2

2

Variability within the groups

ð9:107:70Þ + ð9:007:70Þ + ð5:007:70Þ 53

F

19:176 5:470 ¼ 3:506

2

¼ 19:176

¼ 5:470

Note: The same as Table 11.27.

mean, since F calculated Fcal ¼ 22.337 > Fc ¼ F2,2,5% ¼ 19.00, So, for this variable, there is at least one group that has a mean that is statistically different from the others. For the variables mathematics and chemistry, however, we cannot reject the test’s null hypothesis at a significance level of 0.05. Software such as SPSS and Stata do not offer the Fc for the defined degrees of freedom and a certain significance level. However, they offer the Fcal significance level for these degrees of freedom. Thus, instead of analyzing if Fcal > Fc, we must verify if the Fcal significance level is less than 0.05 (5%). Therefore: If Sig. F (or Prob. F) < 0.05, there is at least one difference between the groups for the variable under analysis. The Fcal significance level can be obtained in Excel by using the command Formulas ! Insert Function ! FDIST, which will open a dialog box as the one shown in Fig. 11.24. As we can see in this figure, sig. F for the variable physics is less than 0.05 (sig. F ¼ 0.043), that is, there is at least one difference between the groups for this variable at a significance level of 0.05. An inquisitive researcher will be able to carry out the same procedure for the variables mathematics and chemistry. In short, Table 11.30 presents the results of the oneway ANOVA, with the variation of each variable, the F statistics, and the respective significance levels. The one-way ANOVA table still allows the researcher to identify the variables that most contribute to the formation of at least one of the clusters, because they have a mean that is statistically different from at least one of the groups in relation to the others, since they will have greater F statistic values. It is important to mention that F statistic values are very sensitive to the sample size, and, in this case, the variables mathematics and chemistry ended up not having statistically different means among the three groups, mainly because the sample is small (only five observations). We would like to emphasize that this one-way ANOVA can also be carried out soon after the application of a certain hierarchical agglomeration schedule, since it only depends on the classification of the observations within groups. The researcher must be careful about only one thing, when comparing the results obtained by a hierarchical schedule to the ones obtained by a nonhierarchical schedule, to use the same distance measure in both situations. Different allocations of the observations to the same number of clusters may happen if different distance measures are used in

Cluster Analysis Chapter

11

349

FIG. 11.24 Obtaining the F significance level (command Insert Function).

TABLE 11.30 One-way Analysis of Variance (ANOVA) Variable mathematics

Variability Between the Groups

Variability Within the Groups

F

Sig. F

8.296

3.990

2.079

0.325

physics

16.306

0.730

22.337

0.043

chemistry

19.176

5.470

3.506

0.222

a hierarchical schedule and in a nonhierarchical schedule. Therefore, different values of the F statistics in both situations can be calculated. In general, in case there are one or more variables that do not contribute to the formation of the suggested number of clusters, we recommend that the procedure be reapplied without it (or them). In these situations, the number of clusters may change and, if the researcher feels the need to underpin the initial input regarding the number of K clusters, he may even use a hierarchical agglomeration schedule without those variables before reapplying the k-means procedure, which will make the analysis cyclical. Moreover, the existence of outliers may generate considerably disperse clusters, and treating the dataset in order to identify extremely discrepant observations becomes an advisable procedure, before elaborating nonhierarchical agglomeration schedules. In the Appendix of this chapter, an important procedure in Stata for detecting multivariate outliers will be presented. As with hierarchical agglomeration schedules, the nonhierarchical k-means schedule cannot be used as an isolated technique to make a conclusive decision about the clustering of observations. The data behavior, sample size, and criteria adopted by the researcher may be extremely sensitive to the allocation of observations and the formation of clusters. The combination of the outputs found with the ones coming from other techniques can more powerfully underpin the choices made by the researcher, and provide higher transparency in the decision-making process. At the end of the cluster analysis, since the clusters formed can be represented in the dataset by a new qualitative variable with terms connected to each observation (cluster 1, cluster 2, ..., cluster K), other exploratory multivariate techniques can be elaborated from it, as, for example, a correspondence analysis, so that, depending on the researcher’s objectives, we can study a possible association between the clusters and the categories of other qualitative variables. This new qualitative variable, which represents the allocation of each observation, may also be used as an explanatory variable of a certain phenomenon in confirmatory multivariate models as, for example, multiple regression models, as long

350

PART

V

Multivariate Exploratory Data Analysis

as it is transformed into dummy variables that represent the categories (clusters) of this new variable generated in the cluster analysis, as we will study in Chapter 13. On the other hand, such a procedure only makes sense when we intend to propose a diagnostic regarding the behavior of the dependent variable, without aiming at having forecasts. Since a new observation does not have its place in a certain cluster, obtaining its allocation is only possible when we include such observation into a new cluster analysis, in order to obtain a new qualitative variable and, consequently, new dummies. In addition, this new qualitative variable can also be considered dependent on a multinomial logistic regression model, allowing the researcher to evaluate the probabilities each observation has to belong to each one of the clusters formed, due to the behavior of other explanatory variables not initially considered in the cluster analysis. We would also like to highlight that this procedure depends on the research objectives and construct established, and has a diagnostic nature as regards the behavior of the variables in the sample for the existing observations, without a predictive purpose. Finally, if the clusters formed present substantiality in relation to the number of observations allocated, by using other variables, we may even apply specific confirmatory techniques for each cluster identified, so that, possibly, better adjusted models can be generated. Next, the same dataset will be used to run cluster analyses in SPSS and Stata. In Section 11.3, we will discuss the procedures for elaborating the techniques studied in SPSS and their results too. In Section 11.4, we will study the commands to perform the procedures in Stata, with the respective outputs.

11.3 CLUSTER ANALYSIS WITH HIERARCHICAL AND NONHIERARCHICAL AGGLOMERATION SCHEDULES IN SPSS In this section, we will discuss the step by step for elaborating our example in the IBM SPSS Statistics Software. The main objective is to offer the researcher an opportunity to run cluster analyses with hierarchical and nonhierarchical schedules in this software package, given how easy it is to use it and how didactical the operations are. Every time an output is shown, we will mention the respective result obtained when performing the algebraic solution in the previous sections, so that the researcher can compare them and increase his own knowledge on the topic. The use of the images in this section has been authorized by the International Business Machines Corporation©.

11.3.1

Elaborating Hierarchical Agglomeration Schedules in SPSS

Going back to the example presented in Section 11.2.2.1.2, remember that our professor is interested in grouping students in homogeneous clusters based on their grades (from 0 to 10) obtained on the college entrance exams, in Mathematics, Physics, and Chemistry. The data can be found in the file CollegeEntranceExams.sav and they are exactly the same as the ones presented in Table 11.12. In this section, we will carry out the cluster analysis using the Euclidian distance between the observations and only considering the single-linkage method. In order for a cluster analysis to be elaborated through a hierarchical method in SPSS, we must click on Analyze → Classify → Hierarchical Cluster.... A dialog box as the one shown in Fig. 11.25 will open. Next, we must insert the original variables from our example (mathematics, physics, and chemistry) into Variables and the variable that identifies the observations (student) in Label Cases by, as shown in Fig. 11.26. If the researcher does not have a variable that represents the name of the observations (in this case, a string), he may leave this last cell blank. First of all, in Statistics..., let’s choose the options Agglomeration schedule and Proximity matrix, which make the table with the agglomeration schedule be presented in the outputs, constructed based on the distance measure to be chosen and on the linkage method to be defined, and the matrix with the distances between each pair of observations, respectively. Let’s maintain the option None in Cluster Membership. Fig. 11.27 shows how this dialog box will be. When we click on Continue, we will go back to the main dialog box of the hierarchical cluster analysis. Next, we must click on Plots.... As seen in Fig. 11.28, let’s select the option Dendrogram and the option None in Icicle. In the same way, let’s click on Continue, so that we can go back to the main dialog box. In Method..., which is the most important dialog box of the hierarchical cluster analysis, we must choose the singlelinkage method, also known as the nearest neighbor. Thus, in Cluster Method, let’s select the option Nearest neighbor. An inquisitive researcher may see that the complete (Furthest neighbor) and average (Between-groups linkage) linkage methods, discussed in Section 11.2.2.1, are also available in this option. Besides, since the variables in the dataset are metric, we have to choose one of the dissimilarity measures found in Measure → Interval. In order to maintain the same logic used when solving our example algebraically, we will choose the Euclidian distance as a dissimilarity measure and, therefore, we must select the option Euclidean distance. We can also see that, in this option, we can find the other dissimilarity measures studied in Section 11.2.1.1, such as, the squared

Cluster Analysis Chapter

11

351

FIG. 11.25 Dialog box for elaborating the cluster analysis with a hierarchical method in SPSS.

FIG. 11.26 Selecting the original variables.

Euclidean distance, Minkowski, Manhattan (Block, in SPSS), Chebyshev, and Pearson’s correlation that, even though is a similarity measure, is also used for metric variables. Although we do not use similarity measures in this example because we are not working with binary variables, it is important to mention that some similarity measures can be selected if necessary. Hence, as discussed in Section 11.2.1.2, in Measure → Binary, we can select the simple matching, Jaccard, Dice, Anti-Dice (Sokal and Sneath 2, in SPSS), Russell and Rao, Ochiai, Yule (Yule’s Q, in SPSS), Rogers and Tanimoto, Sneath and Sokal (Sokal and Sneath 1, in SPSS), and Hamann coefficients, among others.

352

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.27 Selecting the options that generate the agglomeration schedule and the matrix with the distances between the pairs of observations.

FIG. 11.28 Selecting the option that generates the dendrogram.

Cluster Analysis Chapter

11

353

FIG. 11.29 Dialog box for selecting the linkage method and the distance measure.

Still in the same dialog box, the researcher may request that the cluster analysis be elaborated from standardized variables. If necessary, for situations in which the original variables have different measurement units, the option Z scores in Transform Values → Standardize can be selected, which will make all the calculations be elaborated from the standardization of the variables, and which will begin having means equal to 0 and standard deviations equal to 1. After these considerations, the dialog box in our example will become what can be seen in Fig. 11.29. Next, we can click on Continue and on OK. The first output (Fig. 11.30) shows dissimilarity matrix D0 formed by the Euclidian distances between each pair of observations. We can even see that in the legend it says, “This is a dissimilarity matrix.” If this matrix were formed by similarity measures, resulting from calculations elaborated from binary variables, it would say, “This is a similarity matrix.”

FIG. 11.30 Matrix with Euclidian distances (dissimilarity measures) between pairs of observations.

354

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.31 Hierarchical agglomeration schedule—Single-linkage method and Euclidian distance.

Through this matrix, which is equal to the one whose values were calculated and presented in Section 11.2.2.1.2, we can verify that observations Gabriela and Ovidio are the most similar (the smallest Euclidian distance) in relation to the variables mathematics, physics, and chemistry (dGabrielaOvidio ¼ 3.713). Therefore, in the hierarchical schedule shown in Fig. 11.31, the first clustering stage occurs exactly by joining these two students, with Coefficient (Euclidian distance) equal to 3.713. Note that the columns Cluster Combined Cluster 1 and Cluster 2 refer to the isolated observations, when they are still not incorporated into a certain cluster or clusters that have already been formed. Obviously, in the first clustering stage, the first cluster is formed by the fusion of two isolated observations. Next, in the second stage, observation Leonor (5) is incorporated into the cluster previously formed by Gabriela (1) and Ovidio (4). With regard to the single-linkage method, we can see that the distance considered for the agglomeration of Leonor was the smallest between this observation and Gabriela or Ovidio, that is, the criterion adopted it was: dðGabrielaOvidioÞLeonor ¼ min f4:170; 5:474g ¼ 4:170 We can also see that, while columns Stage Cluster First Appears Cluster 1 and Cluster 2 indicate in which previous stage each corresponding observation was incorporated into a certain cluster, column Next Stage shows in which future stage the respective cluster will receive a new observation or cluster, given that we are dealing with a clustering method. In the third stage, observation Patricia (3) is incorporated to the already formed cluster, Gabriela-Ovidio-Leonor, respecting the following distance criterion: dðGabrielaOvidioLeonorÞPatricia ¼ min f8:420; 6:580; 6:045g ¼ 6:045 And, finally, given that we have five observations, in the fourth and last stage, observation Luiz Felipe, which is still isolated (note that the last observation to be incorporated into a cluster corresponds to the last value equal to 0 in the column Stage Cluster First Appears Cluster 2), is incorporated to the cluster already formed by the other observations, concluding the agglomeration schedule. The distance considered at this stage is given by: dðGabrielaOvidioLeonorPatriciaÞLuizFelipe ¼ min f10:132; 10:290; 8:223; 7:187g ¼ 7:187 Based on how the observations are sorted in the agglomeration schedule and on the distances used as a clustering criterion, the dendrogram can be constructed, and it can be seen in Fig. 11.32. Note that the distance measures are rescaled to construct the dendrograms in SPSS, so that the interpretation of each observation allocation to the clusters and, mainly, visualizing the highest distance leaps can be made easier, as discussed in Section 11.2.2.1.2.1. The way the observations are sorted in the dendrogram corresponds to what was presented in the agglomeration schedule (Fig. 11.31), and, from the analysis shown in Fig. 11.32, it is possible to see that the greatest distance leap occurs when Patricia merges with Gabriela-Ovidio-Leonor, which had already been formed. This leap could have already been identified in the agglomeration schedule found in Fig. 11.31, since a large increase in distance occurs when we go from the second to the third stage, that is, when we increase the Euclidian distance from 4.170 to 6.045 (44.96%), so that a new cluster can be formed by incorporating another observation. Therefore, we can choose the existing configuration at the end of the second clustering stage, in which three clusters are formed. As discussed in Section 11.2.2.1.2.1, the criterion for identifying the number of clusters that considers the clustering stage immediately before a large leap is very useful and commonly used. Fig. 11.33 shows a vertical line (a dashed line) that “cuts” the dendrogram in the region where the highest leaps occur. At this moment, since three intersections with lines from the dendrogram happen, we can identify three corresponding clusters formed by Gabriela-Ovidio-Leonor, Patricia, and Luiz Felipe, respectively.

Cluster Analysis Chapter

Dendrogram Using Single Linkage

Y

0 Gabriela

1

Ovidio

4

Leonor

5

Patricia

3

5

Rescaled Distance Cluster Combine 10 15 20

25

Luiz Felipe 2 FIG. 11.32 Dendrogram—Single-linkage method and rescaled euclidian distances in SPSS.

Dendrogram Using Single Linkage

Y

0

5

Rescaled Distance Cluster Combine 10 15 20

Gabriela

1

Ovidio

4

Leonor

5

Patricia

3

Individual Cluster Patricia

Luiz Felipe 2

Individual Cluster Luiz Felipe

FIG. 11.33 Dendrogram with cluster identification.

Cluster Gabriela-Ovidio-Leonor

25

11

355

356

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.34 Defining the number of clusters.

As discussed, it is common to find dendrograms that make it difficult to identify distance leaps, mainly due to the fact that there are considerably similar observations in the dataset in relation to all the variables under analysis. In these situations, it is advisable to use the squared Euclidean distance and the complete-linkage method (furthest neighbor). This criteria combination is very popular in datasets with extremely homogeneous observations. Having adopted the solution with three clusters, we can once again click on Analyze → Classify → Hierarchical Cluster... and, on Statistics..., select the option Single solution in Cluster Membership. In this option, we must insert number 3 into Number of clusters, as shown in Fig. 11.34. When we click on Continue, we will go back to the main dialog box of the cluster analysis. On Save..., let’s choose the option Single solution and, in the same way, insert number 3 into Number of clusters, as shown in Fig. 11.35, so that the new variable corresponding to the allocation of observations to the clusters can become available in the dataset. Next, we can click on Continue and on OK. Although the outputs generated are the same, it is important to notice that a new table of results is presented, corresponding to the allocation of the observations to the clusters itself. Fig. 11.36 shows, for three clusters, that, while observations Gabriela, Ovidio, and Leonor form a single cluster, called 1, observations Luiz Felipe and Patricia form two individual clusters, called 2 and 3, respectively. Even though these names are numerical, it is important to highlight that they only represent the labels (categories) of a qualitative variable. When elaborating the procedure described, we can see that a new variable is generated in the dataset. It is called CLU3_1 by SPSS, as shown in Fig. 11.37. This new variable is automatically classified by the software as Nominal, that is, qualitative, as shown in Fig. 11.38, which can be obtained when we click on Variable View, in the lower left-hand side of the screen in SPSS. As we have already discussed, variable CLU3_1 can be used in other exploratory techniques, such as, the correspondence analysis, or in confirmatory techniques. In the latter, it can be inserted, for example, into the explanatory variables vector (as long as it is transformed into dummies) of a multiple regression model, or as a dependent variable of a certain multinomial logistic regression model, in which researchers intend to study the behavior of other variables, not inserted into the cluster analysis, concerning the probability of inserting each observation into each one of the clusters formed. However, this decision depends on the research objectives. At this moment, the researcher may consider the cluster analysis with hierarchical agglomeration schedules concluded. Nevertheless, based on the generation of the new variable CLU3_1, by using the one-way ANOVA, he may still study if the values of a certain variable differ between the clusters formed, that is, if the variability between the groups is significantly higher than the variability within each one of them. Even if the analysis had not been developed when solving the hierarchical schedules algebraically, since we chose to carry it out only after the k-means procedure in Section 11.2.2.2.2, we can now show how it can be applied at this moment, since we have already allocated the observations to the groups.

Cluster Analysis Chapter

11

357

FIG. 11.35 Selecting the option to save the allocation of observations to the clusters with the new variable in the dataset—Hierarchical procedure.

FIG. 11.36 Allocating the observations to the clusters.

FIG. 11.37 Dataset with the new variable CLU3_1—Allocation of each observation.

358

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.38 Nominal (qualitative) classification of the variable CLU3_1.

In order to do that, let’s click on Analyze → Compare Means → One-Way ANOVA.... In the dialog box that will open, we must insert the variables mathematics, physics, and chemistry into Dependent List and variable CLU3_1 (Single Linkage) into Factor. The dialog box will be as the one shown in Fig. 11.39. In Options..., let’s choose the options Descriptive (in Statistics) and Means plot, as shown in Fig. 11.40. Next, we can click on Continue and on OK. While Fig. 11.41 shows the descriptive statistics of the clusters per variable, similar to Tables 11.24, 11.25, and 11.26, Fig. 11.42 uses these values and shows the calculation of the variation between the groups (Between Groups) and within them (Within Groups), as well as the F statistics for each variable and the respective significance levels. We can see that these values correspond to the ones calculated algebraically in Section 11.2.2.2.2 and shown in Table 11.30. From Fig. 11.42, we can see that sig. F for the variable physics is less than 0.05 (sig. F ¼ 0.043), that is, there is at least one group that has a statistically different mean, when compared to the others, at a significance level of 0.05. However, the same cannot be said about the variables mathematics and chemistry. Although we have an idea of which group has a statistically different mean compared to the others for the variable physics, based on the outputs seen in Fig. 11.41, constructing the diagrams may facilitate the analysis of the differences between the variable means per cluster even more. The charts generated by SPSS (Figs. 11.43, 11.44, and 11.45) allow us to see these differences between the groups for each variable analyzed. Therefore, from the chart seen in Fig. 11.44, it is possible to see that group 2, formed only by observation Luiz Felipe, in fact, has a mean different from the others in relation to the variable physics. Besides, even though we can see from the diagrams in Figs. 11.43 and 11.45 that there are mean differences of the variables mathematics and chemistry between the groups, these differences cannot be considered statistically significant, at a significance level of 0.05, since we are dealing with a very small number of observations, and the F statistic values are very sensitive to the sample size. This graphical analysis becomes really useful when we are studying datasets with a larger number of observations and variables.

FIG. 11.39 Dialog box with the selection of the variables to run the one-way analysis of variance in SPSS.

Cluster Analysis Chapter

11

FIG. 11.40 Selecting the options to carry out the one-way analysis of variance.

FIG. 11.41 Descriptive statistics of the clusters per variable.

FIG. 11.42 One-way analysis of variance—Between groups and within groups variation, F statistics, and significance levels per variable.

359

360

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.43 Means of the variable mathematics in the three clusters.

Mean of mathematics grade (0 to 10)

9.0

8.0

7.0

6.0

5.0

4.0 1

2

3

Single linkage

Mean of physics grade (0 to 10)

8.0

6.0

4.0

2.0

1

2

3

Single linkage FIG. 11.44 Means of the variable physics in the three clusters.

Finally, researchers can still complement their analysis by elaborating a procedure known as multidimensional scaling, since using the distance matrix may help them construct a chart that allows a two-dimensional visualization of the relative positions of each observation, regardless of the total number of variables. In order to do that, we must structure a new dataset, formed exactly by the distance matrix. For the data in our example, we can open the file CollegeEntranceExamMatrix.sav, which contains the Euclidian distance matrix shown in Fig. 11.46. Note that the columns of this new dataset refer to the observations in the original dataset, as well as the rows (squared distance matrix).

Cluster Analysis Chapter

11

361

Mean of chemistry grade (0 to 10)

8.0

6.0

4.0

2.0

1

2

3

Single linkage FIG. 11.45 Means of the variable chemistry in the three clusters.

FIG. 11.46 Dataset with the Euclidean distance matrix.

Let’s click on Analyze → Scale → Multidimensional Scaling (ASCAL).... In the dialog box that will open, we must insert the variables that represent the observations in Variables, as shown in Fig. 11.39. Since the data already correspond to the distances, nothing needs to be done regarding the field Distances (Fig. 11.47). In Model..., let’s select the option Ratio in Level of Measurement (note that the option Euclidean distance in Scaling Model has already been selected) and, in Options..., the option Group plots in Display, as shown in Figs. 11.48 and 11.49, respectively. Next, we can click on Continue and on OK. Fig. 11.50 shows the chart with the relative positions of the observations projected on a plane. This type of chart is really useful when researchers wish to prepare didactical presentations of observation clusters (individuals, companies, municipalities, countries, among other examples) and to make the interpretation of the clusters easier, mainly when there is a relatively large number of variables in the dataset.

362

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.47 Dialog box with the selection of the variables to run the multidimensional scaling in SPSS.

FIG. 11.48 Defining the nature of the variable that corresponds to the distance measure.

Cluster Analysis Chapter

11

363

FIG. 11.49 Selecting the option for constructing the twodimensional chart.

Derived stimulus configuration Euclidean distance model 1.0 Gabriela

Leonor

LuizFelipe

Dimension 2

0.5

0.0

–0.5

Ovidio

–1.0

Patricia

–1.5 –2

–1

0 Dimension 1 FIG. 11.50 Two-dimensional chart with the projected relative positions of the observations.

1

2

364

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.51 Dialog box for elaborating the cluster analysis with the nonhierarchical K-means method in SPSS.

11.3.2

Elaborating Nonhierarchical K-Means Agglomeration Schedules in SPSS

Maintaining the same logic proposed in the chapter, from the same dataset, we will develop a cluster analysis based on the nonhierarchical k-means agglomeration schedule. Thus, we must once again use the file CollegeEntranceExams.sav. In order to do that, we must click on Analyze → Classify → K-Means Cluster.... In the dialog box that will open, we must insert the variables mathematics, physics, and chemistry into Variables, and the variable student into Label Cases by. The main difference between this initial dialog box and the one corresponding to the hierarchical procedure is determining the number of clusters from which the k-means algorithm will be elaborated. In our example, let’s insert number 3 into Number of Clusters. Fig. 11.51 shows how the dialog box will be. We can see that we inserted the original variables into the field Variables. This procedure is acceptable, since, for our example, the values are in the same unit of measure. However, if this fact is not verified, before elaborating the k-means procedure, researchers must standardize them through the Z-scores procedure, in Analyze → Descriptive Statistics → Descriptives..., insert the original variables into Variables, and select the option Save standardized values as variables. When we click on OK, researchers will see that new standardized variables will become part of the dataset. Going back to the initial screen of the k-means procedure, we will click on Save.... In the dialog box that will open, we must select the option Cluster membership, as shown in Fig. 11.52. When we click on Continue, we will go back to the previous dialog box. In Options..., let’s select the options Initial cluster centers, ANOVA table, and Cluster information for each case, in Statistics, as shown in Fig. 11.53. Next, we can click on Continue and on OK. It is important to mention that SPSS already uses the Euclidian distance as a standard dissimilarity measure when elaborating the k-means procedure.

Cluster Analysis Chapter

11

365

FIG. 11.52 Selecting the option to save the allocation of observations to the clusters with the new variable in the dataset—Nonhierarchical procedure.

FIG. 11.53 Selecting the options to perform the K-means procedure.

The first two outputs generated refer to the initial step and to the iteration of the k-means algorithm. The centroid coordinates are presented in the initial step and, through them, we can notice that SPSS considers that the three clusters are formed by the first three observations in the dataset. Although this decision is different from the one we used in Section 11.2.2.2.2, this choice is purely arbitrary and, as we will see later, it will not impact the formation of clusters in the final step of the k-means algorithm at all. While Fig. 11.54 shows the values of the original variables for observations Gabriela, Luiz Felipe, and Patricia (as shown in Table 11.16) as the centroid coordinates of the three groups, in Fig. 11.55 we can see, after the first iteration of the algorithm, that the change in the centroid coordinate of the first cluster is 1.897, which corresponds exactly to the Euclidian distance between observation Gabriela and the cluster Gabriela-Ovidio-Leonor (as shown in Table 11.23). In this last

FIG. 11.54 First step of the K-means algorithm—Centroids of the three groups as observation coordinates.

366

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.55 First iteration of the K-means algorithm and change in the centroid coordinates.

FIG. 11.56 Final stage of the K-means algorithm—Allocation of the observations and distances to the respective cluster centroids.

figure, in the footnotes, it is also possible to see the measure 7.187 that corresponds to the Euclidian distance between observations Luiz Felipe and Patricia, which remain isolated after the iteration. The next three figures refer to the final stage of the k-means algorithm. While the output Cluster Membership (Fig. 11.56) shows the allocation of each observation to each one of the three clusters, as well as the Euclidian distances between each observation and the centroid of the respective group, the output Distances between Final Cluster Centers (Fig. 11.58) shows the Euclidian distances between the group centroids. These two outputs have values that were calculated algebraically in Section 11.2.2.2.2 and shown in Table 11.23. Moreover, the output Final Cluster Centers (Fig. 11.57) shows the centroid coordinates of the groups after the final stage of this nonhierarchical procedure, which correspond to the values already calculated and presented in Table 11.22.

FIG. 11.57 Final stage of the K-Means algorithm—Cluster centroid coordinates.

Cluster Analysis Chapter

11

367

FIG. 11.58 Final stage of the K-means algorithm—Distances between the cluster centroids.

FIG. 11.59 One-way analysis of variance in the K-means procedure—Variation between groups and within groups, F statistics, and significance levels per variable.

The ANOVA output (Fig. 11.59) is analogous to the one presented in Table 11.30 in Section 2.2.2.2 and in Fig. 11.42 in Section 11.3.1, and, through it, we can see that only the variable physics has a statistically different mean in at least one of the groups formed, when compared to the others, at a significance level of 0.05. As we have previously discussed, if one or more variables are not contributing to the formation of the suggested number of clusters, we recommend that the algorithm be reapplied without these variables. The researcher can even use a hierarchical procedure without the aforementioned variables before reapplying the k-means procedure. For the data in our example, however, the analysis would become univariate due to the exclusion of the variables mathematics and chemistry, which demonstrates the risk researchers take when working with extremely small datasets in cluster analysis. It is important to mention that the ANOVA output must only be used when studying the variables that most contribute to the formation of the specified number of clusters, since this is chosen so that the differences between the observations allocated to different groups can be maximized. Thus, as explained in this output’s footnotes, we cannot use the F statistic aiming at verifying the equality or not of the groups formed. For this reason, it is common to find the term pseudo F for this statistic in the existing literature. Finally, Fig. 11.60 shows the number of observations in each one of the clusters. Similar to the hierarchical procedure, we can see that a new variable (obviously qualitative) is generated in the dataset after the preparation of the k-means procedure, which is called QCL_1 by SPSS, as shown in Fig. 11.61. This variable ended up being identical to the variable CLU3_1 (Fig. 11.37) in this example. Nonetheless, this fact does not always happen with a larger number of observations and in the cases in which different dissimilarity measures are used in the hierarchical and nonhierarchical procedures. Having presented the procedures for the application of the cluster analysis in SPSS, let’s discuss this technique in Stata.

FIG. 11.60 Number of observations in each cluster.

368

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.61 Dataset with the new variable QCL_1—Allocation of each observation.

11.4 CLUSTER ANALYSIS WITH HIERARCHICAL AND NONHIERARCHICAL AGGLOMERATION SCHEDULES IN STATA Now, we will present the step by step for preparing our example in Stata Statistical Software®. In this section, our main objective is not to once again discuss the concepts related to the cluster analysis, but to give the researcher an opportunity to prepare the technique by using the commands this software has to offer. At each presentation of an output, we will mention the respective result obtained when performing its algebraic solution and also by using SPSS. The use of the images in this section has been authorized by StataCorp LP©.

11.4.1

Elaborating Hierarchical Agglomeration Schedules in Stata

Therefore, let’s begin with the dataset constructed by the professor and which contains the grades in Mathematics, Physics, and Chemistry obtained by five students in the college entrance exams. The dataset can be found in the file CollegeEntranceExams.dta and is exactly the same as the one presented in Table 11.12 in Section 11.2.2.1.2. Initially, we can type the command desc, which makes the analysis of the dataset characteristics possible, such as, the number of observations, the number of variables, and the description of each one of them. Fig. 11.62 shows the first output in Stata. As discussed previously, since the original variables have values in the same unit of measure, in this example, it is not necessary to standardize them by using the Z-scores procedure. However, if the researcher wishes to, he may obtain the standardized variables through the following commands: egen zmathematics = std(mathematics) egen zphysics = std(physics) egen zchemistry = std(chemistry)

FIG. 11.62 Description of the CollegeEntranceExams.dta dataset.

Cluster Analysis Chapter

11

369

TABLE 11.31 Terms in Stata Corresponding to the Measures for Metric Variables Measure for Metric Variables

Term in Stata

Euclidian

L2

Squared Euclidean

L2squared

Manhattan

L1

Chebyshev

Linf

Canberra

Canberra

Pearson’s Correlation

corr

First of all, let’s obtain the matrix with distances between the pairs of observations. In general, the sequence of commands for obtaining distance or similarity matrices in Stata is: matrix dissimilarity D = variables*, option* matrix list D

where the term variables* will have to be substituted for the list of variables to be considered in the analysis, and the term option* will have to be substituted for the term corresponding to the distance or similarity measure that the researcher wishes to use. While Table 11.31 shows the terms in Stata that correspond to each one of the measures for the metric variables studied in Section 11.2.1.1, Table 11.32 shows the terms related to the measures used for the binary variables studied in Section 11.2.1.2. Therefore, since we wish to obtain the Euclidian distance matrix between the pairs of observations, in order to maintain the criterion used in the chapter, we must type the following sequence of commands: matrix dissimilarity D = mathematics physics chemistry, L2 matrix list D

The output generated, which can be seen in Fig. 11.63, is in accordance with what was presented in matrix D0 in Section 11.2.2.1.2.1, and also in Fig. 11.30 when we elaborated the technique in SPSS (Section 11.3.1). Next, we will carry out the cluster analysis itself. The general command used to run a cluster analysis through a hierarchical schedule in Stata is given by: cluster method* variables*, measure(option*)

where, besides the substitution of the terms variables* and option*, as discussed previously, we must substitute the term method* for the linkage method chosen by the researcher. Table 11.33 shows the terms in Stata related to the methods discussed in Section 11.2.2.1. TABLE 11.32 Terms in Stata Corresponding to the Measures for Binary Variables Measure for Binary Variables

Term in Stata

Simple matching

matching

Jaccard

Jaccard

Dice

Dice

AntiDice

antiDice

Russell and Rao

Russell

Ochiai

Ochiai

Yule

Yule

Rogers and Tanimoto

Rogers

Sneath and Sokal

Sneath

Hamann

Hamann

370

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.63 Euclidean distance matrix between pairs of observations.

TABLE 11.33 Terms in Stata That Correspond to the Linkage Methods in Hierarchical Agglomeration Schedules Linkage Method

Term in Stata

Single

singlelinkage

Complete

completelinkage

Average

averagelinkage

Therefore, for the data in our example and following the criterion adopted throughout this chapter (single-linkage method with Euclidian distance - term L2), we must type the following command: cluster singlelinkage mathematics physics chemistry, measure(L2)

After that, we can type the command cluster list, which makes, in a summarized way, the criteria used by the researcher to develop the hierarchical cluster analysis. Fig. 11.64 shows the outputs generated. From Fig. 11.64 and by analyzing the dataset, we can verify that three new variables are generated, regarding the identification of each observation (_clus_1_id), the sorting of the observations when creating the clusters (_clus_1_ord), and the Euclidian distances used in order to group the new observation in each one of the clustering stages (_clus_1_hgt). Fig. 11.65 shows how the dataset is after this cluster analysis is elaborated. It is important to mention that Stata shows the variable _clu_1_hgt with the old values in one row, which can make the analysis a little confusing. Therefore, while distance 3.713 refers to the merger between observations Ovidio and Gabriela (first stage of the agglomeration schedule), distance 7.187 corresponds to the fusion between Luiz Felipe and the cluster already formed by all the other observations (last stage of the agglomeration schedule), as already shown in Table 11.13 and in Fig. 11.31. Thus, in order for researchers to correct this discrepancy and to obtain the real behavior of the distances in each new clustering stage, they can type the sequence of commands, whose output can be seen in Fig. 11.66. Note that a new variable

FIG. 11.64 Elaboration of the hierarchical cluster analysis and summary of the criteria used.

Cluster Analysis Chapter

11

371

FIG. 11.65 Dataset with the new variables.

FIG. 11.66 Stages of the agglomeration schedule and respective Euclidian distances.

is generated (dist) and it corresponds to the correction of the discrepancy found in variable _clu_1_hgt (term [_n-1]), presenting the value of each Euclidian distance in order to establish a new cluster in each stage of the agglomeration schedule. gen dist = _clus_1_hgt[_n-1] replace dist=0 if dist==. sort dist list student dist

Having carried out this phase, we can ask Stata to construct the dendrogram by typing one of the two equivalent commands: cluster dendrogram, labels(student) horizontal

or cluster tree, labels(student) horizontal

The diagram generated can be seen in Fig. 11.67. We can see that the dendrogram constructed by Stata, in terms of Euclidian distances, is equal to the one shown in Fig. 11.12, constructed when the modeling was solved algebraically. However, it differs from the one constructed by SPSS (Fig. 11.32) for not considering rescaled measures. Regardless of this fact, we will adopt three clusters as a possible solution, being one of them formed by Leonor, Ovidio, and Gabriela, another, by Patricia, and the third, by Luiz Felipe, since the criteria discussed about large distance leaps coherently lead us toward this decision. In order to generate a new variable, corresponding to the allocation of the observations to the three clusters, we must type the following sequence of commands. Note that we have named this new variable cluster. The output seen in Fig. 11.68 shows the allocation of the observations to the groups and is equivalent to the one shown in Fig. 11.36 (SPSS). cluster generate cluster = groups(3), name(_clus_1) sort _clus_1_id list student cluster

372

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.67 Dendrogram—Single-linkage method and Euclidian distances in Stata.

Dendrogram for_clus_1 cluster analysis Leonor

Ovidio

Gabriela

Patricia

Luiz Felipe 0

2

4

6

8

L2 dissimilarity measure

FIG. 11.68 Allocating the observations to the clusters.

Finally, by using the one-way analysis of variance (ANOVA), we will study if the values of a certain variable differ between the groups represented by the categories of the new qualitative variable cluster generated in the dataset, that is, if the variation between the groups is significantly higher than the variation within each one of them, following the logic proposed in Section 11.3.1. In order to do that, let’s type the following commands, in which the three metric variables (mathematics, physics, and chemistry) are individually related to the variable cluster: oneway mathematics cluster, tabulate oneway physics cluster, tabulate oneway chemistry cluster, tabulate

The results of the ANOVA for the three variables are in Fig. 11.69. The outputs in this figure, which show the results of the variation Between groups and Within groups, the F statistics, and the respective significance levels (Prob. F, or Prob > F in Stata) for each variable, are equal to the ones calculated algebraically and presented in Table 11.30 (Section 11.2.2.2.2) and also in Fig. 11.42, when this procedure was elaborated in SPSS (Section 11.3.1). Therefore, as we have already discussed, we can see that, while for the variable physics there is at least one cluster that has a statistically different mean, when compared to the others, at a significance level of 0.05 (Prob. F ¼ 0.0429 < 0.05), the variables mathematics and chemistry do not have statistically different means between the three groups formed for this sample and at the significance level set. It is important to bear in mind that, if there is a greater number of variables that have Prob. F less than 0.05, the one considered the most discriminant of the groups is the one with the highest F statistic (that is, the lowest significance level Prob. F).

Cluster Analysis Chapter

11

373

FIG. 11.69 ANOVA for the variables mathematics, physics, and chemistry.

Even if it is possible to conclude the hierarchical analysis at this moment, the researcher has the option to run a multidimensional scaling, in order to see the projections of the relative positions of the observations in a two-dimensional chart, similar to what was done in Section 11.3.1. In order to do that, he may type the following command: mds mathematics physics chemistry, id(student) method(modern) measure(L2) loss(sstress) config nolog

The outputs generated can be found in Figs. 11.70 and 11.71, and the chart of the latter is the one shown in Fig. 11.50.

374

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.70 Elaborating the multidimensional scaling in Stata. FIG. 11.71 Chart with projections of the relative positions of the observations.

Having presented the commands to carry out the cluster analysis with hierarchical agglomeration schedules in Stata, let’s move on to the elaboration of the nonhierarchical k-means agglomeration schedule in the same software package.

11.4.2

Elaborating Nonhierarchical K-Means Agglomeration Schedules in Stata

In order to apply the k-means procedure to the data in the file CollegeEntranceExams.dta, we must type the following command: cluster kmeans mathematics physics chemistry, k(3) name(kmeans) measure(L2) start(firstk)

where the term k(3) is the input for the algorithm to be elaborated with three clusters. Besides, we define that a new variable with the allocation of the observations to the three groups will be generated in the dataset with the name kmeans (term name(kmeans)), and the distance measure used will be the Euclidian distance (term L2). Moreover, the term firstk specifies that the coordinates of the first k observations of the sample will be used as centroids of the k clusters (in our case, k ¼ 3), which corresponds exactly to the criterion adopted by SPSS, as discussed in Section 11.3.2.

Cluster Analysis Chapter

11

375

FIG. 11.72 Elaborating the nonhierarchical K-means procedure and a summary of the criteria used.

Next, we can type the command cluster list kmeans so that, in a summarized way, the criteria adopted for elaborating the k-means procedure can be presented. The outputs in Fig. 11.72 show what is generated by Stata after we type the last two commands. The next two commands generate, in the outputs of the software, two tables that refer to the number of observations in each one of the three clusters formed, as well as to the allocation of each observation in these groups, respectively: table kmeans list student kmeans

Fig. 11.73 shows these outputs. These results correspond to the one found when the k-means procedure was solved algebraically in Section 11.2.2.2.2 (Fig. 11.23), and to the one obtained when this procedure was elaborated using SPSS in Section 11.3.2 (Figs. 11.60 and 11.61). Even though we are able to develop a one-way analysis of variance for the original variables in the dataset, from the new qualitative variable generated (kmeans), we chose not to carry out this procedure here, since we have already done that for the variable cluster generated in Section 11.4.1 after the hierarchical procedure, which is exactly the same as the variable kmeans in this case. On the other hand, for pedagogical purposes, we present the command that allows the means of each variable in the three clusters to be generated, so that they can be compared: tabstat mathematics physics chemistry, by(kmeans)

The output generated can be seen in Fig. 11.74 and is equivalent to the one presented in Tables 11.24, 11.25, and 11.26. Finally, the researcher can also construct a chart to show the interrelationships between the variables, two at a time. This chart, known as matrix, can give the researcher a better understanding of how the variables relate to one another and even FIG. 11.73 Number of observations in each cluster and allocation of observations.

376

PART

V

Multivariate Exploratory Data Analysis

FIG. 11.74 Means per cluster and general means of the variables mathematics, physics, and chemistry.

FIG. 11.75 Interrelationship between the variables and relative position of the observations in each cluster—matrix chart.

make suggestions regarding the relative position of the observations in each cluster in these interrelationships. To construct the chart shown in Fig. 11.75, we must type the following command: graph matrix mathematics physics chemistry, mlabel(kmeans)

Obviously, this chart could have also been constructed in the previous section. However, we chose to present it only at the end of the preparation of the k-means procedure in Stata. By analyzing it, it is possible to verify, among other things, that only considering the variables mathematics and chemistry is not enough to make observations Luiz Felipe and Patricia (clusters 2 and 3, respectively) stay further apart. It is necessary to consider the variable physics so that these two students can, in fact, be allocated to different clusters when forming three clusters. Although it may seem pretty obvious when analyzing the data in their own dataset, the chart becomes extremely useful for larger samples with a considerable number of variables, fact that would multiply these interrelationships.

Cluster Analysis Chapter

11

377

11.5 FINAL REMARKS Many are the situations in which researchers may wish to group observations (individuals, companies, municipalities, countries, political parties, plant species, among other examples) from certain metric or even binary variables. Creating homogeneous clusters, reducing data structurally, and verifying the validity of previously established constructs are some of the main reasons that make researchers choose to work with cluster analysis. This set of techniques allows decision-making mechanisms to be better structured and justified from the behavior and interdependence relationship between the observations of a certain dataset. Since the variable that represents the clusters formed is qualitative, the outputs of the cluster analysis can serve as inputs in other multivariate techniques, both exploratory as well as confirmatory ones. It is strongly advisable for researchers to justify, clearly and transparently, the measure they chose and that will serve as the basis for the observations to be considered more or less similar, as well as the reasons that make them define nonhierarchical or hierarchical agglomeration schedules and, in this last case, determine the linkage methods. In the last few years, the evolution of technological capabilities and the development of new software, with extremely improved resources, caused new and better cluster analysis techniques to arise. Techniques that use more and more sophisticated algorithms and that are aimed at the decision-making process in several fields of knowledge, always with the main goal of grouping observations based on certain criteria. However, in this chapter, we tried to offer a general overview of the main cluster analysis methods, also considered to be the most popular. Lastly, we would like to highlight that the application of this important set of techniques must always be done by using the software chosen for the modeling correctly and sensibly, based on the underlying theory and on researchers’ experience and intuition.

11.6 EXERCISES 1) The scholarship department of a certain college wishes to investigate the interdependence relationship between the students entering university in a certain school year, based only on two metric variables (age, in years, and average family income, in US$). The main objective is to propose a still unknown number of new scholarship programs aimed at homogeneous groups of students. In order to do that, data on 100 new students were collected and a dataset was constructed, which can be found in the files Scholarship.sav and Scholarship.dta, with the following variables: Variable

Description

student

A string variable that identifies all freshmen in the college

age

Student’s age (years)

income

Average family income (US$)

We would like you to: a) Run a cluster analysis through a hierarchical agglomeration schedule, with the complete-linkage method (furthest neighbor) and the squared Euclidean distance. Only present the final part of the agglomeration schedule table and discuss the results. Reminder: Since the variables have different units of measure, it is necessary to apply the Z-scores standardization procedure to prepare the cluster analysis correctly. b) Based on the table found in the previous item and in the dendrogram, we ask you: how many clusters of students will be formed? c) Is it possible to identify one or more very discrepant students, in comparison to the others, regarding the two variables under analysis? d) If the answer to the previous item is “yes,” once again run the hierarchical cluster analysis with the same criteria, however, now, without the student(s) considered discrepant. From the analysis of the new results, can new clusters be identified? e) Discuss how the presence of outliers can hamper the interpretation of results in a clusters analysis. 2) The marketing department of a retail company wants to study possible discrepancies in their 18 stores spread throughout three regional centers and distributed all over the country. In order to maintain and preserve its brand’s image and identity, top management would like to know if their stores are homogeneous in terms of customers’

378

PART

V

Multivariate Exploratory Data Analysis

perception of attributes, such as, services, variety of goods, and organization. Thus, first, a research with samples of clients was developed in each store, so that data regarding these attributes could be collected. These were defined based on the average score obtained (0 to 100) in each store. Next, a dataset was constructed and it contains the following variables: Variable

Description

store

A string variable that varies from 01 the 18 and that identifies the commercial establishment (store)

regional

A string variable that identifies each regional center (Regional 1 to Regional 3)

services

Customers’ average evaluation of services rendered (score from 0 to 100)

assortment

Customers’ average evaluation of the variety of goods (score from 0 to 100)

organization

Customers’ average evaluation of the organization (score from 0 to 100)

These data can be found in the files Retail Regional Center.sav and Retail Regional Center.dta. We would like you to: a) Run a cluster analysis through a hierarchical agglomeration schedule, with the single-linkage method and the Euclidean distance. Present the matrix with distances between each pair of observations. Reminder: Since the variables are in the same unit, it is not necessary to apply the Z-scores standardization procedure. b) Present and discuss the agglomeration schedule table. c) Based on the table found in the previous item and in the dendrogram, we ask you: how many clusters of stores will be formed? d) Run a multidimensional scaling and, after that, present and discuss the two-dimensional chart generated with the relative positions of the stores. e) Run a cluster analysis by using the k-means procedure, with the number of clusters suggested in item (c), and interpret the one-way analysis of variance for each variable considered in the study, considering a significance level of 0.05. Which variable contributes the most to the creation of at least one of the clusters formed, that is, which of them is the most discriminant of the groups? f) Is there any correspondence between the allocations of the observations to the groups obtained by the hierarchical and nonhierarchical methods? g) Is it possible to identify an association between any regional center and a certain discrepant group of stores, which could justify the management’s concern regarding the brand’s image and identity? If the answer is “yes,” once again run the hierarchical cluster analysis with the same criteria, however, now, without this discrepant group of stores. By analyzing the new results, is it possible to see the differences between the others stores more clearly? 3) A financial market analyst has decided to carry out a survey with CEOs and directors of large companies that operate in the health, education, and transport industries, in order to investigate how these companies’ operations are carried out and the mechanisms that guide their decision making processes. In order to do that, he structured a questionnaire with 50 questions, whose answers are only dichotomous, or binary. After applying the questionnaire, he got answers from 35 companies and, from then on, structured a dataset, present in the files Binary Survey.sav and Binary Survey.dta. In a generic way, the variables are:

Variable

Description

q1 to q50

A list of 50 dummy variables that refer to the way the operations and the decision-making processes are carried out in these companies

sector

Company sector

The analyst’s main goal is to verify whether companies in the same sector show similarities in relation to the way their operations and decision making processes are carried out, at least from their own managers’ perspective. In order to do that, after collecting the data, a cluster analysis can be elaborated. We would like you to: a) Based on the hierarchical cluster analysis elaborated with the average-linkage method (between groups) and the simple matching similarity measure for binary variables, analyze the agglomeration schedule generated. b) Interpret the dendrogram.

Cluster Analysis Chapter

11

379

c) Check if there is any correspondence between the allocations of the companies to the clusters and the respective sectors, or, in other words, if the companies in the same sector show similarities regarding the way their operations and decisionmaking processes are carried out.

4) A greengrocer has decided to monitor the sales of his products for 16 weeks (4 months). The main objective is to verify if the sales behavior of three of their main products (bananas, oranges, and apples) is recurrent after a certain period, due to weekly wholesale price fluctuations, prices that are passed on to customers and may impact sales. These data can be found in the files Veggiefruit.sav and Veggiefruit.dta, which have the following variables:

Variable

Description

week

A string variable that varies from 1 to 16 and identifies the week in which the sales were monitored

week_month

A string variable that varies from 1 to 4 and identifies the week in each one of the months

banana

Number of bananas sold that week (un.)

orange

Number of oranges sold that week (un.)

apple

Number of apples sold that week (un.)

We would like you to: a) Run a cluster analysis through a hierarchical agglomeration schedule, with the single-linkage method (nearest neighbor) and Pearson’s correlation measure. Present the matrix of similarity measures (Pearson’s correlation) between each row in the dataset (weekly periods). Reminder: Since the variables are in the same unit of measure, it is not necessary to apply the Z-scores standardization procedure. b) Present and discuss the agglomeration schedule table. c) Based on the table found in the previous item and on the dendrogram, we ask you: is there any indication that the joint sales behavior of bananas, oranges and apples is recurrent in certain weeks?

APPENDIX A.1

Detecting Multivariate Outliers

Even though detecting outliers is extremely important when applying practically every single multivariate data analysis technique, we chose to add this Appendix to the present chapter because cluster analysis represents the first set of multivariate exploratory techniques being studied, whose outputs can be used as inputs in several other techniques, as well as because very discrepant observations may significantly interfere in the creation of clusters. Barnett and Lewis (1994) mention almost 1000 articles in the existing literature on outliers. However, we chose to show a very effective, computationally simple, and fast algorithm for detecting multivariate outliers, bearing in mind that the identification of outliers for each variable individually, that is, in a univariate way, has already been studied in Chapter 3. A) Brief Presentation of the Blocked Adaptive Computationally Efficient Outlier Nominators Algorithm Billor et al. (2000), in extremely important work, show an interesting algorithm that has the purpose of detecting multivariate outliers. It is called Blocked Adaptive Computationally Efficient Outlier Nominators or simply BACON. This algorithm, explained in a very clear and didactical way by Weber (2010), is defined based on the preparation of a few steps, described briefly: 1. From a dataset with n observations and j (j ¼ 1, ..., k) variables X, in which each observation is identified by i (i ¼ 1, ..., n), the distance between one observation i that has a vector with dimension k xi ¼ ðxi1 , xi2 , …, xik Þ and the general mean of all sample values (group G), which also has a vector with dimension k x ¼ ðx1 , x2 , …, xk Þ, is given by the following expression, known as the Mahalanobis distance:

diG ¼

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðxi  xÞ’  S1  ðxi  xÞ

(11.29)

380

PART

V

Multivariate Exploratory Data Analysis

where S represents the covariance matrix of the n observations. Therefore, the first step of the algorithm consists in identifying m (m > k) homogeneous observations (initial group M) that have the smallest Mahalanobis distances in relation to the entire sample. It is important to mention that the dissimilarity measure known as Mahalanobis distance, not discussed in this chapter, is adopted by the aforementioned authors due to the fact that it is not susceptible to variables that are in different measurement units. 2. Next, the Mahalanobis distances between each observation i and the mean of the m observation values that belong to group M are calculated, which also has a vector with dimension k xM ¼ ðxM1 , xM2 , …, xMk Þ, such that: diM ¼

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðxi  xM Þ’  S1 M  ðx i  x M Þ

(11.30)

where SM represents the covariance matrix of the m observations. 3. All the observations with Mahalanobis distances less than a certain threshold are added to the group M of observations. This threshold is defined as a corrected percentile of the w2 distribution (85% in the Stata standard). Steps 2 and 3 must be reapplied until there are no more modifications in group M, which will only have observations that are not considered outliers. Hence, the ones excluded from the group will be considered multivariate outliers. Weber (2010) codifies the algorithm proposed in the paper written by Billor et al. (2000) in Stata, thus proposing the command bacon. Next, we will present and discuss an example in which this command is used, and whose main advantage is to be very fast computationally, even when applied to large datasets. B) Example: The command bacon in Stata Before the specific preparation of this procedure in Stata, we must install the command bacon by typing findit bacon and clicking on the link st0197 from http://www.stata-journal.com/software/sj10-3. After that, we must click on click here to install. Lastly, going back to the Stata command screen, we can type ssc install moremata and mata: mata mlib index. Having done this, we may apply the command bacon. To apply this command, let’s use the file Bacon.dta, which shows data on the median household income (US$) of 20,000 engineers, their age (years), and time he(she) has had a college degree (years). First of all, we can type the command desc, which makes the analysis of the dataset characteristics possible. Fig. 11.76 shows this first output. Next, we can type the following command that, based on the algorithm presented, identifies the observations considered multivariate outliers: bacon income age tgrad, generate(outbacon)

where the term generate(outbacon) makes a new dummy variable be generated in the dataset, called outbacon, which has values equal to 0 for observations not considered outliers, and values equal to 1 for the ones considered outliers. This output can be seen in Fig. 11.77.

FIG. 11.76 Description of the Bacon.dta dataset.

Cluster Analysis Chapter

11

381

FIG. 11.77 Applying the command bacon in Stata.

FIG. 11.78 Observations classified as multivariate outliers.

Through the figure, it is possible to see that four observations are classified as multivariate outliers. Besides, Stata considers 85% of the percentile standard of the w2 distribution, used as a separation threshold between the observations considered outliers and nonoutliers, as previously discussed and highlighted by Weber (2010). This is the reason why the term BACON outliers (p = 0.15) appears in the outputs. This value may be altered due to a criterion established by the researcher. However, we would like to emphasize that the standard percentile(0.15) is very adequate for obtaining consistent answers. From the following command, which generates the output seen in Fig. 11.78, we can investigate which observations are classified as outliers: list if outbacon == 1

Even if we are working with three variables, we can construct two-dimensional scatter plots, which allow us to identify the positions of the observations considered outliers in relation to the others. In order to do that, let’s type the following commands, which generate the mentioned charts for each pair of variables: scatter income age, ml(outbacon) note("0 = not outlier, 1 = outlier") scatter income tgrad, ml(outbacon) note("0 = not outlier, 1 = outlier") scatter age tgrad, ml(outbacon) note("0 = not outlier, 1 = outlier")

These three charts can be seen in Figs. 11.79, 11.80, and 11.81.

FIG. 11.79 Variables income and age—Relative position of the observations.

Median household income (US$)

40,000

30,000

20,000

10,000

0 20

30

0 = not outlier, 1 = outlier

40 Age (years)

50

60

382

PART

V

Multivariate Exploratory Data Analysis

tgrad—

40,000 Median household income (US$)

FIG. 11.80 Variables income and Relative position of the observations.

30,000

20,000

10,000

0 0

5

10 15 Time since graduation (years)

20

10 15 Time since graduation (years)

20

0 = not outlier, 1 = outlier

FIG. 11.81 Variables age position of the observations.

and

tgrad—Relative

60

Age (years)

50

40

30

20 0

5

0 = not outlier, 1 = outlier

Despite the fact that outliers have been identified, it is important to mention that the decision about what to do with these observations is entirely up to researchers, who must make it based on their research objectives. As already discussed throughout this chapter, excluding these outliers from the dataset may be an option. However, studying why they became multivariately discrepant can also result in many interesting research outcomes.

Chapter 12

Principal Component Factor Analysis Love and truth are so intertwined that it is practically impossible to disentangle and separate them. They are like the two sides of a coin. Mahatma Gandhi

12.1 INTRODUCTION Exploratory factor analysis techniques are very useful when we intend to work with variables that have, between themselves, relatively high correlation coefficients and one wishes to establish new variables that capture the joint behavior of the original variables. Each one of these new variables is called factor, which can be understood as the cluster of variables from criteria previously established. Therefore, factor analysis is a multivariate technique that tries to identify a relatively small number of factors that represent the joint behavior of interdependent original variables. Thus, while cluster analysis, studied in the previous chapter, uses distance or similarity measures to group observations and form clusters, factor analysis uses correlation coefficients to group variables and generate factors. Among the methods used to determine factors, the one known as principal components is, without a doubt, the most widely used in factor analysis, because it is based on the assumption that uncorrelated factors can be extracted from linear combinations of the original variables. Consequently, from a set of original variables correlated to one another, the principal component factor analysis allows another set of variables (factors) resulting from the linear combination of the first set to be determined. Even though, as we know, the term confirmatory factor analysis often appears in the existing literature, factor analysis is essentially an exploratory multivariate technique, or an interdependence, since it does not have a predictive nature for other observations not initially present in the sample, and the inclusion of new observations in the dataset makes it necessary to reapply the technique, so that more accurate and updated new factors can be generated. According to Reis (2001), factor analysis can be used with the main exploratory goal of reducing the data dimension, aiming at creating factors from the original variables, as well as with the objective of confirming an initial hypothesis that the data may be reduced to a certain factor, or a certain dimension, which was previously established. Regardless of the objective, factor analysis will continue to be exploratory. If researchers aim to use a technique to, in fact, confirm the relationships found in the factor analysis, they can use structural equation modeling, for instance. The principal component factor analysis has four main objectives: (1) to identify correlations between the original variables to create factors that represent the linear combination of those variables (structural reduction); (2) to verify the validity of previously established constructs, bearing in mind the allocation of the original variables to each factor; (3) to prepare rankings by generating performance indexes from the factors; and (4) to extract orthogonal factors for future use in confirmatory multivariate techniques that need the absence of multicollinearity. Imagine that a researcher is interested in studying the interdependence between several quantitative variables that translate the socioeconomic behavior of a nation’s municipalities. In this situation, factors that may possibly explain the behavior of the original variables can be determined, and, in this regard, the factor analysis is used to reduce the data structurally and, later on, to create a socioeconomic index that captures the joint behavior of these variables. From this index, we may even propose a performance ranking of the municipalities, and the factors themselves can be used in a possible cluster analysis. In another situation, factors extracted from the original variables can be used as explanatory variables of another variable (dependent), not initially considered in the analysis. For example, factors obtained from the joint behavior of grades in certain 12th grade subjects can be used as explanatory variables of students’ general classification in the college entrance exams, or whether students passed the exams or not. In these situations, note that the factors (orthogonal to one another) are Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00012-4 © 2019 Elsevier Inc. All rights reserved.

383

384

PART

V Multivariate Exploratory Data Analysis

used, instead of the original variables themselves, as explanatory variables of a certain phenomenon in confirmatory multivariate models, such as, multiple or logistic regression, in order to eliminate possible multicollinearity problems. Nevertheless, it is important to highlight that this procedure only makes sense when we intend to elaborate a diagnostic regarding the dependent variable’s behavior, without aiming at having forecasts for other observations not initially present in the sample. Since new observations do not have the corresponding values of the factors generated, obtaining these values is only possible if we include such observations in a new factor analysis. In a third situation, imagine that a retailer is interested in assessing their clients’ level of satisfaction by applying a questionnaire in which the questions have been previously classified into certain groups. For instance, questions A, B, and C were classified into the group quality of services rendered, questions D and E, into the group positive perception of prices, and questions F, G, H, and I, into the group variety of goods. After applying the questionnaire to a significant number of customers, in which these nine variables are collected by attributing scores that vary from 0 to 10, the retailer has decided to elaborate a principal component factor analysis to verify if, in fact, the combination of variables reflects the construct previously established. If this occurs, the factor analysis will have been used to validate the construct, presenting a confirmatory objective. In all of these situations, we can see that the original variables from which the factors will be extracted are quantitative, because a factor analysis begins with the study of the behavior of Pearson’s correlation coefficients between the variables. Nonetheless, it is common for researchers to use the incorrect arbitrary weighting procedure with qualitative variables, as, for example, variables on the Likert scale, and, from then on, to apply a factor analysis. This is a serious error! There are exploratory techniques meant exclusively for studying the behavior of qualitative variables as, for instance, the correspondence analysis and homogeneity analysis, and a factor analysis is definitely not meant for such purpose, as discussed by Fa´vero and Belfiore (2017). In a historical context, the development of factor analyses is partly due to Pearson’s (1896) and Spearman’s (1904) pioneer work. While Karl Pearson developed a rigorous mathematical treatment regarding what we traditionally call correlation at the beginning of the 20th century, Charles Edward Spearman published highly original work in which the interrelationships between students’ performance in several subjects, such as, French, English, Mathematics and Music were evaluated. Since the grades in these subjects showed strong correlation, Spearman proposed that scores resulting from apparently incompatible tests shared a single general factor, and students who got good grades had a more developed psychological or intelligence component. Generally speaking, Spearman excelled in applying mathematical methods and correlation studies to the analysis of the human mind. Decades later, in 1933, Harold Hotelling, a statistician, mathematician, and influential economics theoretician decided to call Principal Component Analysis the analysis that determines components from the maximization of the original data’s variance. Also in the first half of the 20th century, psychologist Louis Leon Thurstone, from an investigation of Spearman’s ideas and based on the application of certain psychological tests, whose results were submitted to a factor analysis, identified people’s seven primary mental abilities: spatial visualization, verbal meaning, verbal fluency, perceptual speed, numerical ability, reasoning, and rote memory. In psychology, the term mental factors is even used for variables that have greater influence over a certain behavior. Currently, factor analysis is used in several fields of knowledge, such as, marketing, economics, strategy, finance, accounting, actuarial science, engineering, logistics, psychology, medicine, ecology and biostatistics, among others. The principal component factor analysis must be defined based on the underlying theory and on the researcher’s experience, so that it can be possible to apply the technique correctly and to analyze the results obtained. In this chapter, we will discuss the principal component factor analysis technique, with the following objectives: (1) to introduce the concepts; (2) to present the step by step of modeling in an algebraic and practical way; (3) to interpret the results obtained; and (4) to show the application of the technique in SPSS and Stata. Following the logic proposed in the book, first, we develop the algebraic solution of an example linked to the presentation of the concepts. Only after introducing these concepts, we present and discuss the procedures for running the technique in SPSS and Stata be presented.

12.2

PRINCIPAL COMPONENT FACTOR ANALYSIS

Many are the procedures inherent to the factor analysis, with different methods for determining (extraction) factors from Pearson’s correlation matrix. The most frequently used method, which was adopted in this chapter for extracting factors, is known as principal components, in which the consequent structural reduction is also called Karhunen-Loe`ve transformation.

Principal Component Factor Analysis Chapter

12

385

TABLE 12.1 General Dataset Model for Developing a Factor Analysis Observation i

X1i

X2i



Xki

1

X11

X21



Xk1

2

X12

X22

Xk2

3

X13

X23

Xk3









n

X1n

X2n

Xkn

In the following sections, we will discuss the theoretical development of the technique, as well as a practical example. While the main concepts will be presented in Sections 12.2.1–12.2.5, Section 12.2.6 is meant for solving a practical example algebraically, from a dataset.

12.2.1

Pearson’s Linear Correlation and the Concept of Factor

Let’s imagine a dataset that has n observations and, for each observation i (i ¼ 1, …, n), values corresponding to each one of the k metric variables X, as shown in Table 12.1. From the dataset, and given our intention of extracting factors from k variables X, we must define correlation matrix r that displays the values of Pearson’s linear correlation between each pair of variables, as shown in Expression (12.1). 0 1 1 r12 ⋯ r1k B r21 1 ⋯ r2k C C r¼B (12.1) @ ⋮ ⋮ ⋱ ⋮ A rk1 rk2 ⋯ 1 Correlation matrix r is symmetrical in relation to the main diagonal that, obviously, shows values equal to 1. For example, for variables X1 and X2, Pearson’s correlation r12 can be calculated by using Expression (12.2). Xn     X1i  X1  X2i  X2 i¼1 r12 ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (12.2) Xn  Xn  2ffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2ffi X  X X  X  1i 1 2i 2 i¼1 i¼1 where X1 and X2 represent the means of variables X1 and X2, respectively, and this expression is analogous to Expression (4.11), defined in Chapter 4. Thus, since Pearson’s correlation is a measure of the level of linear relationship between two metric variables, which may vary between 1 and 1, a value closer to one of these extreme values indicates the existence of a linear relationship between the two variables under analysis, which, therefore, may significantly contribute to the extraction of a single factor. On the other hand, a Pearson correlation that is very close to 0 indicates that the linear relationship between the two variables is practically nonexistent. Therefore, different factors can be extracted. Let’s imagine a hypothetical situation in which a certain dataset only has three variables (k ¼ 3). A three-dimensional scatter plot can be constructed from the values of each variable for each observation. The plot can be seen in Fig. 12.1. Only based on the visual analysis of the chart in Fig. 12.1, it is difficult to assess the behavior of the linear relationships between each pair of variables. Thus, Fig. 12.2 shows the projection of the points that correspond to each observation in each one of the planes formed by the pairs of variables, highlighting, in the dotted line, the adjustment that represents the linear relationship between the respective variables. While Fig. 12.2A shows that there is a significant linear relationship between variables X1 and X2 (a very high Pearson correlation), Fig. 12.2B and C make it very clear that there is no linear relationship between X3 and these variables. Fig. 12.3 displays these projections in a three-dimensional plot, with the respective linear adjustments in each plane (the dotted lines). Thus, in this hypothetical example, while variables X1 and X2 may be represented by a single factor in a very significant way, which we will call F1, variable X3 may be represented by another factor, F2, orthogonal to F1. Fig. 12.4 illustrates the extraction of these new factors in a three-dimensional way.

386

PART

V Multivariate Exploratory Data Analysis

FIG.12.1 Three-dimensional scatter plot for a hypothetical situation with three variables.

X3

X2

X1

X3

X3

X2

X1

X1

(A)

(B)

X2

(C)

FIG. 12.2 Projection of the points in each plane formed by a certain pair of variables. (A) Relationship between X1 and X2: positive and very high Pearson correlation. (B) Relationship between X1 and X3: Pearson correlation very close to 0. (C) Relationship between X2 and X3: Pearson correlation very close to 0.

So, factors can be understood as representations of latent dimensions that explain the behavior of the original variables. Having presented these initial concepts, it is important to emphasize that in many cases researchers may choose to not extract a factor represented in a considerable way by only one variable (in this case, factor F2), and what will define the extraction of each one of the factors is the calculation of the eigenvalues from correlation matrix r, as we will study in Section 12.2.3. Nevertheless, before that, it will be necessary to check the overall adequacy of the factor analysis, which will be discussed in the following section.

Principal Component Factor Analysis Chapter

12

387

X3

X2

X1

FIG. 12.3 Projection of the points in a three-dimensional plot with linear adjustments per plane.

12.2.2 Overall Adequacy of the Factor Analysis: Kaiser-Meyer-Olkin Statistic and Bartlett’s Test of Sphericity An adequate extraction of factors from the original variables requires correlation matrix r to have relatively high and statistically significant values. As discussed by Hair et al. (2009), even though visually analyzing correlation matrix r does not reveal if the factor extraction will in fact be adequate, a significant number of values less than 0.30 represent a preliminary indication that the factor analysis may not be adequate. In order to verify the overall adequacy of the factor extraction itself, we must use the Kaiser-Meyer-Olkin statistic (KMO) and Bartlett’s test of sphericity. The KMO statistic gives us the proportion of variance considered common to all the variables in the sample under analysis, that is, which can be attributed to the existence of a common factor. This statistic varies from 0 to 1 and, while values closer to 1 indicate that the variables share a very high proportion of variance (high Pearson correlations), values closer to 0 are a result of low Pearson correlations between the variables, which may indicate that the factor analysis will not be adequate. The KMO statistic, presented initially by Kaiser (1970), can be calculated through Expression (12.3). Xk Xk r2 c¼1 lc , l 6¼ c (12.3) KMO ¼ Xk Xk l¼1 X k Xk 2 + 2 r ’ lc lc l¼1 c¼1 l¼1 c¼1 where l and c represent the rows and columns of correlation matrix r, respectively, and the terms ’ represent the partial correlation coefficients between two variables. While Pearson’s correlation coefficients r are also called zero-order correlation coefficients, partial correlation coefficients ’ are also known as higher-order correlation coefficients. For three

388

PART

V Multivariate Exploratory Data Analysis

F2 X3

X2

X1

F1 FIG. 12.4 Factor extraction.

variables, they are also called first-order correlation coefficients, for four variables, second-order correlation coefficients, and so on. Let’s imagine a hypothetical situation in which a certain dataset shows three variables once again (k ¼ 3). Is it possible that, in fact, r12 reflects the level of linear relationship between X1 and X2 if variable X3 is related to the other two? In this situation, r12 may not represent the true level of linear relationship between X1 and X2 when X3 is present, which may provide a false impression regarding the nature of the relationship between the first two. Thus, partial correlation coefficients may contribute with the analysis, since, according to Gujarati and Porter (2008), they are used when researchers wish to find out the correlation between two variables, either by controlling or ignoring the effects of other variables present in the dataset. For our hypothetical situation, it is the correlation coefficient regardless of X3’s influence over X1 and X2, if any. Hence, for three variables X1, X2, and X3, we can define the first-order correlation coefficients the following way: r12  r13  r23 ’12,3 ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi     1  r213  1  r223 where ’12,3 represents the correlation between X1 and X2, maintaining X3 constant, r13  r12  r23 ’13,2 ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi     1  r212  1  r223 where ’13,2 represents the correlation between X1 and X3, maintaining X2 constant, and

(12.4)

(12.5)

Principal Component Factor Analysis Chapter

r23  r12  r13 ’23,1 ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi     1  r212  1  r213

12

389

(12.6)

where ’23,1 represents the correlation between X2 and X3, maintaining X1 constant. In general, a first-order correlation coefficient can be obtained through the following expression: rab  rac  rbc ’ab, c ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi    ffi 1  r2ac  1  r2bc

(12.7)

where a, b, and c can assume values 1, 2, or 3, corresponding to the three variables under analysis. Conversely, for a case in which there are four variables in the analysis, the general expression of a certain partial correlation coefficient (second-order correlation coefficient) is given by: ’ab, c  ’ad, c  ’bd, c (12.8) ’ab,cd ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi    ffi 2 2 1  ’ad, c  1  ’bd, c where ’ab,cd represents the correlation between Xa and Xb, maintaining Xc and Xd constant, bearing in mind that a, b, c, and d may take on values 1, 2, 3, or 4, which correspond to the four variables under analysis. Obtaining higher-order correlation coefficients, in which five or more variables are considered in the analysis, should always be done based on the determination of lower-order partial correlation coefficients. In Section 12.2.6, we will propose a practical example by using four variables, in which the algebraic solution of the KMO statistic will be obtained through Expression (12.8). It is important to highlight that, even if Pearson’s correlation coefficient between two variables is 0, the partial correlation coefficient between them may not be equal to 0, depending on the values of Pearson’s correlation coefficients between each one of these variables and the others present in the dataset. In order for a factor analysis to be considered adequate, the partial correlation coefficients between the variables must be low. This fact denotes that the variables share a high proportion of variance, and disregarding one or more of them in the analysis may hamper the quality of the factor extraction. Therefore, according to a widely accepted criterion found in the existing literature, Table 12.2 gives us an indication of the relationship between the KMO statistic and the overall adequacy of the factor analysis. On the other hand, Bartlett’s test of sphericity (Bartlett, 1954) consists in comparing correlation matrix r to an identity matrix I of the same dimension. If the differences between the corresponding values outside the main diagonal of each matrix are not statistically different from 0, at a certain significance level, we may consider that the factor extraction will not be adequate. In other words, in this case, Pearson’s correlations between each pair of variables are statistically equal to 0, which makes any attempt of performing a factor extraction from the original variables unfeasible. So, we can define the null and alternative hypotheses of Bartlett’s test of sphericity the following way: 0 1 0 1 1 r12 ⋯ r1k 1 0 ⋯ 0 B r21 1 ⋯ r2k C B0 1 ⋯ 0C C B C H0 : r ¼ B @ ⋮ ⋮ ⋱ ⋮ A ¼ I ¼ @⋮ ⋮ ⋱ ⋮A rk1 rk2 ⋯ 1 0 0 ⋯ 1

TABLE 12.2 Relationship Between the KMO Statistic and the Overall Adequacy of the Factor Analysis KMO Statistic

Overall Adequacy of the Factor Analysis

Between 1.00 and 0.90

Marvelous

Between 0.90 and 0.80

Meritorious

Between 0.80 and 0.70

Middling

Between 0.70 and 0.60

Mediocre

Between 0.60 and 0.50

Miserable

Less than 0.50

Unacceptable

390

PART

V Multivariate Exploratory Data Analysis

0

1 r12 B r21 1 H1 : r ¼ B @ ⋮ ⋮ rk1 rk2

⋯ ⋯ ⋱ ⋯

1 0 r1k 1 B0 r2k C C 6¼ I ¼ B @⋮ ⋮ A 1 0

0 1 ⋮ 0

⋯ ⋯ ⋱ ⋯

1 0 0C C ⋮A 1

The statistic corresponding to Bartlett’s test of sphericity is an w2 statistic, which has the following expression:  

2k+5  ln jDj w2Bartlett ¼  ðn  1Þ  6

(12.9)

Þ degrees of freedom. We know that n is the sample size and k is the number of variables. In addition, D represents with k  ðk1 2 the determinant of correlation matrix r. Thus, for a certain number of degrees of freedom and a certain significance level, Bartlett’s test of sphericity allows us to check if the total value of the w2Bartlett statistic is higher than the statistic’s critical value. If this is true, we may state that Pearson’s correlations between the pairs of variables are statistically different from 0 and that, therefore, factors can be extracted from the original variables and the factor analysis is adequate. When we develop a practical example in Section 12.2.6, we will also discuss the calculations of the w2Bartlett statistic and the result of Bartlett’s test of sphericity. It is important to emphasize that we should always favor Bartlett’s test of sphericity over the KMO statistic to take a decision about the factor analysis’s overall adequacy. Given that the former is a test with a certain significance level, and the latter is only a coefficient (a statistic) calculated without any set distribution of probabilities or hypotheses that allow us to evaluate the corresponding significance level to make a decision. In addition, it is important to mention that for only two original variables the KMO statistic will always be equal to 0.50. Conversely, the w2Bartlett statistic may indicate if the null hypothesis of the test of sphericity was rejected or not, depending on the magnitude of Pearson’s correlation between both variables. Thus, while the KMO statistic will be 0.50 in these situations, Bartlett’s test of sphericity will allow researchers to decide whether to extract one factor from the two original variables or not. In contrast, for three original variables, it is very common for researchers to extract two factors with the statistical significance of Bartlett’s test of sphericity, however, with the KMO statistic less than 0.50. These two situations emphasize even more the greater relevance of Bartlett’s test of sphericity in relation to the KMO statistic in the decisionmaking process. Finally, we must mention that the recommendation to study Cronbach’s alpha’s magnitude, before studying the overall adequacy of the factor analysis, is commonly found in the existing literature, so that the reliability with which a factor can be extracted from original variables can be evaluated. We would like to highlight that Cronbach’s alpha only offers researchers indications of the internal consistency of the variables in the dataset so that a single factor can be extracted. Therefore, determining it is not a mandatory requisite for developing the factor analysis, since this technique allows the extraction of most factors. Nevertheless, for pedagogical purposes, we will discuss the main concepts of Cronbach’s alpha in the Appendix of this chapter, with its algebraic determination and corresponding applications in SPSS and Stata software. Having discussed these concepts and verified the overall adequacy of the factor analysis, we can now move on to the definition of the factors.

12.2.3 Defining the Principal Component Factors: Determining the Eigenvalues and Eigenvectors of Correlation Matrix r and Calculating the Factor Scores Since a factor represents the linear combination of the original variables, for k variables, we can define a maximum number of k factors (F1, F2, …, Fk), analogous to the maximum number of clusters that can be defined from a sample with n observations, as we discussed in the previous chapter, since a factor can also be understood as the result of the clustering of variables. Therefore, for k variables, we have: F1i ¼ s11  X1i + s21  X2i + ⋯ + sk1  Xki F2i ¼ s12  X1i + s22  X2i + ⋯ + sk2  Xki ⋮ Fki ¼ s1k  X1i + s2k  X2i + ⋯ + skk  Xki

(12.10)

where the terms s are known as factor scores, which represent the parameters of a linear model that relates a certain factor to the original variables. Calculating the factor scores is essential in the context of the factor analysis technique and is elaborated by determining the eigenvalues and eigenvectors of correlation matrix r. In Expression (12.11), we once again show correlation matrix r, which has already been presented in Expression (12.1).

Principal Component Factor Analysis Chapter

0

1 r12 B r21 1 r¼B @ ⋮ ⋮ rk1 rk2

⋯ ⋯ ⋱ ⋯

1 r1k r2k C C ⋮ A 1

12

391

(12.11)

This correlation matrix, with dimensions k  k, shows k eigenvalues l2 (l21 l22 … l2k ), which can be obtained from solving the following equation:   det l2  I  r ¼ 0

(12.12)

where I is the identity matrix, also with dimensions k  k. Since a certain factor represents the result of the clustering of variables, it is important to highlight that: l21 + l22 + ⋯ + l2k ¼ k

(12.13)

2 l  1 r 12 ⋯ r1k 2 r l  1 ⋯ r 21 2k ¼ 0 ⋮ ⋮ ⋱ ⋮ 2 r r ⋯ l  1 k1 k2

(12.14)

Expression (12.12) can be rewritten as follows:

from which we can define the eigenvalue matrix L2 the following way: 0

l21 B 0 L2 ¼ B @⋮ 0

0 l22 ⋮ 0

⋯ ⋯ ⋱ ⋯

1 0 0C C ⋮A l2k

(12.15)

In order to define the eigenvectors of matrix r based on the eigenvalues, we must solve the following equation system for each eigenvalue l2 (l21, l22, …, l2k ): l

Determining eigenvectors v11, v21, …, vk1 from the first eigenvalue (l21): 0

from where we obtain:

l

1 0 1 0 1 v11 0 l21  1 r12 ⋯ r1k 2 B r C B C B C 21 l1  1 ⋯ r2k C  B v21 C ¼ B 0 C B @ ⋮ ⋮ ⋱ ⋮ A @ ⋮ A @⋮A 2 vk1 0 rk1 rk2 ⋯ l1  1

(12.16)

 8  2 l1  1  v11  r12  v21 …  r1k  vk1 ¼ 0 > >   < r21  v11 + l21  1  v21 …  r2k  vk1 ¼ 0 > ⋮  >  : rk1  v11  rk2  v21 … + l21  1  vk1 ¼ 0

(12.17)

Determining eigenvectors v12, v22, …, vk2 from the second eigenvalue (l22): 1 0 1 0 1 v12 0 l22  1 r12 ⋯ r1k 2 B r C B v22 C B 0 C l  1 ⋯ r 21 2k C  B 2 B C¼B C @ ⋮ ⋮ ⋱ ⋮ A @ ⋮ A @⋮A vk2 0 rk1 rk2 ⋯ l22  1 0

(12.18)

392

PART

V Multivariate Exploratory Data Analysis

from where we obtain:

l

 8  2 l2  1  v12  r12  v22 …  r1k  vk2 ¼ 0 > >   < r21  v12 + l22  1  v22 …  r2k  vk2 ¼ 0 > ⋮  >  : rk1  v12  rk2  v22 … + l22  1  vk2 ¼ 0

(12.19)

Determining eigenvectors v1k, v2k, …, vkk from the kth eigenvalue (l2k ): 0

1 0 1 0 1 v1k 0 l2k  1 r12 ⋯ r1k 2 B r C B v2k C B 0 C l  1 ⋯ r 21 2k k B CB C¼B C @ ⋮ ⋮ ⋱ ⋮ A @ ⋮ A @⋮A vkk 0 rk1 rk2 ⋯ l2k  1

(12.20)

 8  2 lk  1  v1k  r12  v2k …  r1k  vkk ¼ 0 > >   < r21  v1k + l2k  1  v2k …  r2k  vkk ¼ 0 > ⋮  >  : rk1  v1k  rk2  v2k … + l2k  1  vkk ¼ 0

(12.21)

from where we obtain:

Thus, we can calculate the factor scores of each factor by determining the eigenvalues and eigenvectors of correlation matrix r. The factor scores vectors can be defined as follows: l

Factor scores of the first factor: 0 v 1 q11 ffiffiffiffiffi B C l2 C B 0 1 B v 1C s11 B 21 C B s21 C B qffiffiffiffiffi C B C 2C S1 ¼ @ A ¼ B B l1 C ⋮ B C B ⋮ C sk1 B vk1 C @ qffiffiffiffiffi A l21

l

Factor scores of the second factor: 0 v 1 q12 ffiffiffiffiffi B l2 C C 0 1 B B v 2C s12 B q22 C B s22 C B ffiffiffiffiffi C C ¼ B l2 C S2 ¼ B @ ⋮ A B 2C B C ⋮ B C sk2 B vk2 C @ qffiffiffiffiffi A l22

l

(12.22)

(12.23)

Factor scores of the kth factor: 0 v 1 q1kffiffiffiffiffi B l2 C C 0 1 B B v kC s1k B q2kffiffiffiffiffi C C B s2k C B C B 2C Sk ¼ B @ ⋮ A ¼ B lk C B C B ⋮ C skk B vkk C @ qffiffiffiffiffi A l2k

(12.24)

Principal Component Factor Analysis Chapter

12

393

Since the factor scores of each factor are standardized by the respective eigenvalues, the factors of the set of equations presented in Expression (12.10) must be obtained by multiplying each factor score by the corresponding original variable, standardized by using the Z-scores procedure. Thus, we can obtain each one of the factors based on the following equations: v v v ffiffiffiffiffi  ZX1i + q21ffiffiffiffiffi  ZX2i + ⋯ + qk1ffiffiffiffiffi  ZXki F1i ¼ q11 2 2 l1 l1 l21 v v v ffiffiffiffiffi  ZX1i + q22ffiffiffiffiffi  ZX2i + ⋯ + qk2ffiffiffiffiffi  ZXki F2i ¼ q12 (12.25) l22 l22 l22 ⋮ v v v Fki ¼ q1kffiffiffiffiffi  ZX1i + q2kffiffiffiffiffi  ZX2i + ⋯ + qkkffiffiffiffiffi  ZXki l2k l2k l2k where ZXi represents the standardized value of each variable X for a certain observation i. It is important to emphasize that all the factors extracted show, between themselves, Pearson correlations equal to 0, that is, they are orthogonal to one another. A more perceptive researcher will notice that the factor scores of each factor correspond exactly to the estimated parameters of a multiple linear regression model that has, as a dependent variable, the factor itself and, as explanatory variables, the standardized variables. Mathematically, it is also possible to verify the existing relationship between the eigenvectors, correlation matrix r, and eigenvalue matrix L2. Consequently, defining eigenvector matrix V as follows: 0 1 v11 v12 ⋯ v1k B v21 v22 ⋯ v2k C C V¼B (12.26) @ ⋮ ⋮ ⋱ ⋮ A vk1 vk2 ⋯ vkk we can prove that: V’  r  V ¼ L2 or:

0

v11 B v12 B @ ⋮ v1k

v21 v22 ⋮ v2k

⋯ ⋯ ⋱ ⋯

1 0 vk1 1 r12 B r21 1 vk2 C CB ⋮ A @ ⋮ ⋮ vkk rk1 rk2

⋯ ⋯ ⋱ ⋯

1 0 r1k v11 B v21 r2k C CB ⋮ A @ ⋮ 1 vk1

(12.27)

v12 v22 ⋮ vk2

⋯ ⋯ ⋱ ⋯

1 0 2 v1k l1 B v2k C C¼B 0 ⋮ A @⋮ vkk 0

0 l22 ⋮ 0

⋯ ⋯ ⋱ ⋯

1 0 0C C ⋮A l2k

(12.28)

In Section 12.2.6, we will discuss a practical example from which this relationship may be demonstrated. While in Section 12.2.2, we discussed the factor analysis’s overall adequacy, in this section, we will discuss the procedures for carrying out the factor extraction, if the technique is considered adequate. Even knowing that the maximum number of factors is also equal to k for k variables, it is essential for researchers to define, based on a certain criterion, the adequate number of factors that, in fact, represent the original variables. In our hypothetical example in Section 12.2.1, we saw that only two factors (F1 and F2) would be enough to represent the three original variables (X1, X2, and X3). Although researchers are free to determine the number of factors to be extracted in the analysis, in a preliminary way, since they may wish to verify the validity of a previously established construct (procedure known as a priori criterion), for instance, it is essential to carry out an analysis based on the magnitude of the eigenvalues calculated from correlation matrix r. As the eigenvalues correspond to the proportion of variance shared by the original variables to form each factor, as we will discuss in Section 12.2.4, since l21 l22 … l2k and bearing in mind that factors F1, F2, …, Fk are obtained from the respective eigenvalues, factors extracted from smaller eigenvalues are formed from smaller proportions of variance shared by the original variables. Since a factor represents a certain cluster of variables, factors extracted from eigenvalues less than 1 may possibly not be able to represent the behavior of a single original variable (of course there are exceptions to this rule, which occur in cases in which a certain eigenvalue is less than, but also very close to 1). The criterion for choosing the number of factors, in which only the factors that correspond to eigenvalues greater than 1 are considered, is often used and known as the latent root criterion or Kaiser criterion. The factor extraction method presented in this chapter is known as principal components, and the first factor F1, formed by the highest proportion of variance shared by the original variables, is also called principal factor. This method is often mentioned in the existing literature and is used in practical applications whenever researchers wish to elaborate a structural reduction

394

PART

V Multivariate Exploratory Data Analysis

of the data in order to create orthogonal factors, to define observation rankings by using the factors generated, and even to confirm the validity of previously established constructs. Other factor extraction methods, such as, the generalized least squares, unweighted least squares, maximum likelihood, alpha factoring, and image factoring, have different criteria and certain specificities and, even though they can also be found in the existing literature, they will not be discussed in this book. Moreover, it is common to discuss the need to apply the factor analysis to variables that have multivariate normal distribution, in order to show consistency when determining the factor scores. Nevertheless, it is important to emphasize that multivariate normality is a very rigid assumption, only necessary for a few factor extraction methods, such as, the maximum likelihood method. Most factor extraction methods do not require the assumption of data multivariate normality and, as discussed by Gorsuch (1983), the principal component factor analysis seems to be, in practice, very robust against breaks in normality.

12.2.4

Factor Loadings and Communalities

Having established the factors, we can now define the factor loadings, which simply are Pearson correlations between the original variables and each one of the factors. Table 12.3 shows the factor loadings for each variable-factor pair. Based on the latent root criterion (in which only factors resulting from eigenvalues greater than 1 are considered), we assume that the factor loadings between the factors that correspond to eigenvalues less than 1 and all the original variables are low, since they will have already presented higher Pearson correlations (loadings) with factors previously extracted from greater eigenvalues. In the same way, original variables that only share a small portion of variance with the other variables will have high factor loadings in only a single factor. If this occurs for all original variables, there will not be significant differences between correlation matrix r and identity matrix I, making the w2Bartlett statistic very low. This fact allows us to state that the factor analysis will not be adequate, and, in this situation, researchers may choose not to extract factors from the original variables. As the factor loadings are Pearson’s correlations between each variable and each factor, the sum of the squares of these loadings in each row of Table 12.3 will always be equal to 1, since each variable shares part of its proportion of variance with all the k factors, and the sum of the proportions of variance (factor loadings or squared Pearson correlations) will be 100%. Conversely, if less than k factors are extracted, due to the latent root criterion, the sum of the squared factor loadings in each row will not be equal to 1. This sum is called communality, which represents the total shared variance of each variable in all the factors extracted from eigenvalues greater than 1. So, we can say that: c211 + c212 + ⋯ ¼ communalityX1 c221 + c222 + ⋯ ¼ communalityX2 ⋮ c2k1 + c2k2 + ⋯ ¼ communalityXk

(12.29)

The main objective of the analysis of communalities is to check if any variable ends up not sharing a significant proportion of variance with the factors extracted. Even though there is no cutoff point from which a certain communality can be considered high or low, since the sample size can interfere in this assessment, the existence of considerably low communalities in relation to the others can indicate to researchers that they may need to reconsider including the respective variable into the factor analysis.

TABLE 12.3 Factor Loadings Between Original Variables and Factors Factor Variable

F1

F2



Fk

X1

c11

c12



c1k

X2

c21

c22

c2k









Xk

ck1

ck2

ckk

Principal Component Factor Analysis Chapter

12

395

Therefore, after defining the factors based on the factor scores, we can state that the factor loadings will be exactly the same as the parameters estimated in a multiple linear regression model that shows, as a dependent variable, a certain standardized variable ZX and, as explanatory variables, the factors themselves, and the coefficient of determination R2 of each model is equal to the communality of the respective original variable. The sum of the squared factor loadings in each column of Table 12.3, on the other hand, will be equal to the respective eigenvalue, since the ratio between each eigenvalue and the total number of variables can be understood as the proportion of variance shared by all k original variables to form each factor. So, we can say that: c211 + c221 + ⋯ + c2k1 ¼ l21 c212 + c222 + ⋯ + c2k2 ¼ l22 ⋮ c21k + c22k + ⋯ + c2kk ¼ l2k

(12.30)

After establishing the factors and the calculation of the factor loadings, it is also possible for some variables to have intermediate (neither very high nor very low) Pearson correlations (factor loadings) with all the factors extracted, although its communality is relatively not so low. In this case, although the solution of the factor analysis has already been obtained in an adequate way and considered concluded, researchers can, in the cases in which the factor loadings table shows intermediate values for one or more variables in all the factors, elaborate a rotation of these factors, so that Pearson’s correlations between the original variables and the new factors generated can be increased. In the following section, we will discuss factor rotation.

12.2.5

Factor Rotation

Once again, let’s imagine a hypothetical situation in which a certain dataset only has three variables (k ¼ 3). After preparing the principal component factor analysis, two factors, orthogonal to one another, are extracted, with factor loadings (Pearson correlations) with each one of the three original variables, according to Table 12.4. In order to construct a chart with the relative positions of each variable in each factor (a chart known as loading plot), we can consider the factor loadings to be coordinates (abscissas and ordinates) of the variables in a Cartesian plane formed by both orthogonal factors. The plot can be seen in Fig. 12.5. In order to better visualize the variables better represented by a certain factor, we can think about a rotation around the origin of the originally extracted factors F1 and F2, so that we can bring the points corresponding to variables X1, X2, and X3 0 0 closer to one of the new factors. These are called rotated factors F1 and F2. Fig. 12.6 shows this process in a simplified way. Based on Fig. 12.6, for each variable under analysis, we can see that while the loading for one factor increases, for the other, it decreases. Table 12.5 shows the loading redistribution for our hypothetical situation. Thus, for a generic situation, we can say that rotation is a procedure that maximizes the loadings of each variable in a certain factor, to the detriment of the others. In this regard, the final effect of rotation is the redistribution of factor loadings to factors that initially had smaller proportions of variance shared by all the original variables. The main objective is to minimize the number of variables with high loadings in a certain factor, since each one of the factors will start having more significant loadings only with some of the original variables. Consequently, rotation may simplify the interpretation of the factors.

TABLE 12.4 Factor Loadings Between Three Variables and Two Factors Factor Variable

F1

F2

X1

c11

c12

X2

c21

c22

X3

c31

c32

396

PART

V Multivariate Exploratory Data Analysis

FIG. 12.5 Loading plot for a hypothetical situation with three variables and two factors.

FIG. 12.6 Defining the rotated factors from the factors original.

TABLE 12.5 Original and Rotated Factor Loadings for Our Hypothetical Situation Factor Original Factor Loadings

Rotated Factor Loadings 0

Variable

F1

F2

F1

X1

c11

c12

j c11j > jc11 j

X2

c21

c22

j c21j > jc21 j

X3

c31

c32

j c31j < jc31 j

0

F2

0

jc12j < jc12 j

0

0

jc22j < jc22 j

0

jc32j > jc32 j

0

0

Principal Component Factor Analysis Chapter

12

397

Despite the fact that communalities and the total proportion of variance shared by all the variables in all the factors are not modified by the rotation (and neither are the KMO statistic or w2Bartlett), the proportion of variance shared by the original 0 variables in each factor is redistributed and, therefore, modified. In other words, new eigenvalues are set l 0 0 0 (l1, l2, …, lk) from the rotated factor loadings. Thus, we can say that: c0 211 + c0 212 + ⋯ ¼ communalityX1

c0 221 + c0 222 + ⋯ ¼ communalityX2 ⋮ c0 2k1 + c0 2k2 + ⋯ ¼ communalityXk

(12.31)

and that: c0 211 + c0 221 + ⋯ + c0 2k1 ¼ l0 1 6¼ l21 2 c0 212 + c0 222 + ⋯ + c0 2k2 ¼ l0 2 6¼ l22 ⋮ 2 c0 21k + c0 22k + ⋯ + c0 2kk ¼ l0 k 6¼ l2k 2

(12.32)

even if Expression (12.13) is respected, that is: l21 + l22 + ⋯ + l2k ¼ l0 1 + l0 2 + ⋯ + l0 k ¼ k 2

2

2

(12.33)

Besides, new rotated factor scores are obtained from the rotation of factors, s0 , such that the final expressions of the rotated factors will be: F01i ¼ s011  ZX1i + s021  ZX2i + ⋯ + s0k1  ZXki F02i ¼ s012  ZX1i + s022  ZX2i + ⋯ + s0k2  ZXki ⋮ F0ki ¼ s01k  ZX1i + s02k  ZX2i + ⋯ + s0kk  ZXki

(12.34)

It is important to highlight that the overall adequacy of the factor analysis (KMO statistic and Bartlett’s test of sphericity) is not altered by the rotation, since correlation matrix r continues the same. Even though there are several factor rotation methods, the orthogonal rotation method, also known as Varimax, whose main purpose is to minimize the number of variables that have high loadings on a certain factor through the redistribution of the factor loadings and maximization of the variance shared in factors that correspond to lower eigenvalues, is the most frequently used and will be used in this chapter to solve a practical example. That is where the name Varimax comes from. This method was proposed by Kaiser (1958). The algorithm behind the Varimax rotation method consists in determining a rotation angle y in which pairs of factors are equally rotated. Thus, as discussed by Harman (1976), for a certain pair of factors F1 and F2, for example, the rotated factor loadings c’ between the two factors and the k original variables are obtained from the original factor loadings c, through the following matrix multiplication: 0 0 0 1 1 c11 c12 c11 c012   0 C B c21 c22 C B 0 B C  cos y seny ¼ B c21 c22 C (12.35) @ ⋮ ⋮ A @ ⋮ ⋮ A seny cos y 0 0 ck1 ck2 ck1 ck2 where y, the counterclockwise rotation angle, is obtained by the following expression:

2ð D  k  A  BÞ y ¼ 0:25  arctan C  k  ð A 2  B2 Þ

(12.36)

where: A¼

k  X l¼1

c21l c22l  communalityl communalityl



k  X 2 l¼1

c1l  c2l communalityl

 (12.37)

 (12.38)

398

PART

V Multivariate Exploratory Data Analysis



" k X l¼1



c21l c22l  communalityl communalityl

2

  2

c1l  c2l communalityl

2 # (12.39)

k  X l¼1

  

c21l c22l c1l  c2l  2  communalityl communalityl communalityl

(12.40)

In Section 12.2.6, we will use these Varimax rotation method expressions to determine the rotated factor loadings from the original loadings. Besides Varimax, we can also mention other orthogonal rotation methods, such as, Quartimax and Equamax, even though they are less frequently mentioned in the existing literature and less used in practice. In addition to them, the researcher may also use oblique rotation methods, in which nonorthogonal factors are generated. Although they are not discussed in this chapter, we should also mention the Direct Oblimin and Promax methods in this category. Since oblique rotation methods can sometimes be used when we wish to validate a certain construct, whose initial factors are not correlated, we recommend that an orthogonal rotation method be used so that factors extracted in other multivariate techniques can be used later, such as, certain confirmatory models, in which the lack of multicollinearity of the explanatory variables is a mandatory premise.

12.2.6

A Practical Example of the Principal Component Factor Analysis

Imagine that the same professor, deeply engaged in academic and pedagogical activities, is now interested in studying how his students’ grades behave so that, afterwards, he can propose the creation of a school performance ranking. In order to do that, he collected information on the final grades, which vary from 0 to 10, of each one of his 100 students in the following subjects: Finance, Costs, Marketing, and Actuarial Science. Part of the dataset can be seen in Table 12.6. The complete dataset can be found in the file FactorGrades.xls. Through this dataset, it is possible to construct Table 12.7, which shows Pearson’s correlation coefficients between each pair of variables, calculated by using the logic presented in Expression (12.2).

TABLE 12.6 Example: Final Grades in Finance, Costs, Marketing, and Actuarial Science

Student

Final Grade in Finance (X1i)

Final Grade in Costs (X2i)

Final Grade in Marketing (X3i)

Final Grade in Actuarial Science (X4i)

Gabriela

5.8

4.0

1.0

6.0

Luiz Felipe

3.1

3.0

10.0

2.0

Patricia

3.1

4.0

4.0

4.0

Gustavo

10.0

8.0

8.0

8.0

Leticia

3.4

2.0

3.2

3.2

Ovidio

10.0

10.0

1.0

10.0

Leonor

5.0

5.0

8.0

5.0

Dalila

5.4

6.0

6.0

6.0

Antonio

5.9

4.0

4.0

4.0

8.9

5.0

2.0

8.0

… Estela

Principal Component Factor Analysis Chapter

12

399

TABLE 12.7 Pearson’s Correlation Coefficients for Each Pair of Variables finance

costs

marketing

finance

1.000

0.756

0.030

0.711

costs

0.756

1.000

0.003

0.809

0.030

0.003

1.000

0.044

0.711

0.809

0.044

1.000

marketing actuarial science

Therefore, we can write the expression of the correlation matrix r as follows: 0 1 0 1 r12 r13 r14 1:000 0:756 0:030 B r21 1 r23 r24 C B 0:756 1:000 0:003 C B r¼B @ r31 r32 1 r34 A ¼ @ 0:030 0:003 1:000 r41 r42 r43 1 0:711 0:809 0:044

actuarial science

1 0:711 0:809 C C 0:044 A 1:000

which has determinant D ¼ 0.137. By analyzing correlation matrix r, it is possible to verify that only the grades corresponding to the variable marketing do not have correlations with the grades in the other subjects, represented by the other variables. On the other hand, these show relatively high correlations with one another (0.756 between finance and costs, 0.711 between finance and actuarial, and 0.809 between costs and actuarial), which indicates that they may share significant variance to form one factor. Although this preliminary analysis is important, it cannot represent more than a simple diagnostic, since the overall adequacy of the factor analysis needs to be evaluated based on the KMO statistic and, mainly, by using the result of Bartlett’s test of sphericity. As we discussed in Section 12.2.2, the KMO statistic provides the proportion of variance considered common to all the variables present in the analysis, and, in order to establish its calculation, we need to determine partial correlation coefficients ’ between each pair of variables. In this case, it will be second-order correlation coefficients, since we are working with four variables simultaneously. Consequently, based on Expression (12.7), first, we need to determine the first-order correlation coefficients used to calculate of the second-order correlation coefficients. Table 12.8 shows these coefficients. Hence, from these coefficients and by using Expression (12.8), we can calculate the second-order correlation coefficients considered in the KMO statistic’s expression. Table 12.9 shows these coefficients. TABLE 12.8 First-Order Correlation Coefficients r12 r13  r23 ’12, 3 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:756 ð1r213 Þ  ð1r223 Þ r r

r

r13 r12  r23 ’13, 2 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:049 ð1r212 Þ  ð1r223 Þ r r

r

r14 r12  r24 ’14, 2 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:258 ð1r212 Þ  ð1r224 Þ r r

r

14 13 34 ¼ 0:711 ’14, 3 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð1r213 Þ  ð1r234 Þ

23 12 13 ’23, 1 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:039 ð1r212 Þ  ð1r213 Þ

24 12 14 ’24, 1 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:590 ð1r212 Þ  ð1r214 Þ

r24 r23  r34 ¼ 0:810 ’24, 3 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð1r223 Þ  ð1r234 Þ

r34 r13  r14 ’34, 1 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:033 ð1r213 Þ  ð1r214 Þ

r34 r23  r24 ’34, 2 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:080 ð1r223 Þ  ð1r224 Þ

TABLE 12.9 Second-Order Correlation Coefficients ’

’

’

12, 3 14, 3 24, 3 ’12, 34 ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi    ¼ 0:438

1’214, 3  1’224, 3



’

’

13, 2 14, 2 34, 2 ’13, 24 ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi    ¼ 0:029

1’214, 2  1’234, 2



’

’

14, 2 13, 2 34, 2 ’14, 23 ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi    ¼ 0:255

1’213, 2  1’234, 2



’

’

23, 1 24, 1 34, 1 ’23, 14 ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi    ¼ 0:072

1’224, 1  1’234, 1



’

’

24, 1 23, 1 34, 1 ’24, 13 ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi    ¼ 0:592

1’223, 1  1’234, 1



’

’

34, 1 23, 1 24, 1 ’34, 12 ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi    ¼ 0:069

1’223, 1  1’224, 1

400

PART

V Multivariate Exploratory Data Analysis

So, based on Expression (12.3), we can calculate the KMO statistic. The terms of the expression are given by: k X k X

r2lc ¼ ð0:756Þ2 + ð0:030Þ2 + ð0:711Þ2 + ð0:003Þ2 + ð0:809Þ2 + ð0:044Þ2 ¼ 1:734

l¼1 c¼1 k X k X

’2lc ¼ ð0:438Þ2 + ð0:029Þ2 + ð0:255Þ2 + ð0:072Þ2 + ð0:592Þ2 + ð0:069Þ2 ¼ 0:619

l¼1 c¼1

from where we obtain: KMO ¼

1:734 ¼ 0:737 1:734 + 0:619

Based on the criterion presented in Table 12.2, the value of the KMO statistic suggests that the overall adequacy of the factor analysis is middling. To test whether, in fact, correlation matrix r is statistically different from identity matrix I with the same dimension, we must use Bartlett’s test of sphericity, whose w2Bartlett statistic is given by Expression (12.9). For n ¼ 100 observations, k ¼ 4 variables, and correlation matrix r determinant D ¼ 0.137, we have:  

24+5 2  ln ð0:137Þ ¼ 192:335 wBartlett ¼  ð100  1Þ  6 Þ ¼ 6 degrees of freedom. Therefore, by using Table D in the Appendix, we have w2c ¼ 12.592 (critical w2 for 6 with 4  ð41 2 degrees of freedom and with a significance level of 0.05). Thus, since w2Bartlett ¼ 192.335 > w2c ¼ 12.592, we can reject the null hypothesis that correlation matrix r is statistically equal to identity matrix I, at a significance level of 0.05. Software packages like SPSS and Stata do not offer the w2c for the defined degrees of freedom and a certain significance level. However, they offer the significance level of w2Bartlett for these degrees of freedom. So, instead of analyzing if w2Bartlett > w2c , we must verify if the significance level of w2Bartlett is less than 0.05 (5%) so that we can continue performing the factor analysis. Thus: If P-value (either Sig. w2Bartlett, or Prob. w2Bartlett) < 0.05, correlation matrix r is not statistically equal to identity matrix I with the same dimension. The significance level of w2Bartlett can be obtained in Excel by using the command Formulas ! Insert Function ! CHIDIST, which will open a dialog box, as shown in Fig. 12.7. As we can see in Fig. 12.7, the P-value of the w2Bartlett statistic is considerably less than 0.05 (w2Bartlett Pvalue ¼ 8.11  1039), that is, Pearson’s correlations between the pairs of variables are statistically different from 0 and, therefore, factors can be extracted from the original variables, and the factor analysis very adequate.

FIG. 12.7 Obtaining the significance level of w2 (command Insert Function).

Principal Component Factor Analysis Chapter

12

401

Having verified the factor analysis’s overall adequacy, we can move on to the definition of the factors. In order to do that, we must initially determine the four eigenvalues l2 (l21 l22 l23 l24) of correlation matrix r, which can be obtained from solving Expression (12.12). Therefore, we have: 2 l  1 0:756 0:030 0:711 0:756 l2  1 0:003 0:809 0:030 0:003 l2  1 0:044 ¼ 0 0:711 0:809 0:044 l2  1 from where we obtain:

8 2 l1 ¼ 2:519 > > > > < l2 ¼ 1:000 2 > l23 ¼ 0:298 > > > : 2 l4 ¼ 0:183

Consequently, based on Expression (12.15), eigenvalue matrix L2 can be written as follows: 0 1 2:519 0 0 0 B 0 1:000 0 0 C C L2 ¼ B @ 0 0 0:298 0 A 0 0 0 0:183 Note that Expression (12.13) is satisfied, that is: l21 + l22 + ⋯ + l2k ¼ 2:519 + 1:000 + 0:298 + 0:183 ¼ 4 Since the eigenvalues correspond to the proportion of variance shared by the original variables to form each factor, we can construct a shared variance table (Table 12.10). By analyzing Table 12.10, we can say that while 62.975% of the total variance are shared to form the first factor, 25.010% are shared to form the second factor. The third and fourth factors, whose eigenvalues are less than 1, are formed through smaller proportions of shared variance. Since the most common criterion used to choose the number of factors is the latent root criterion (Kaiser criterion), in which only the factors that correspond to eigenvalues greater than 1 are taken into consideration, the researcher can choose to conduct all the subsequent analysis with only the first two factors, formed by sharing 87.985% of the total variance of the original variables, that is, with a total variance loss of 12.015%. Nonetheless, for pedagogical purposes, let’s discuss how to calculate the factor scores by determining the eigenvectors that correspond to the four eigenvalues. Consequently, in order to define the eigenvectors of matrix r based on the four eigenvalues calculated, we must solve the following equation systems for each eigenvalue, based on Expressions (12.16)–(12.21): Determining eigenvectors v11, v21, v31, v41 from the first eigenvalue (l21 ¼ 2.519):

l

8 ð2:519  1:000Þ  v11  0:756  v21 + 0:030  v31  0:711  v41 ¼ 0 > > > < 0:756  v + ð2:519  1:000Þ  v  0:003  v  0:809  v ¼ 0 11 21 31 41 > 0:030  v  0:003  v + ð 2:519  1:000 Þ  v + 0:044  v > 11 21 31 41 ¼ 0 > : 0:711  v11  0:809  v21 + 0:044  v31 + ð2:519  1:000Þ  v41 ¼ 0

TABLE 12.10 Variance Shared by the Original Variables to Form Each Factor Factor 1 2 3 4

Eigenvalue l2

Shared Variance (%)

2.519

2:519

1.000

1:000

0.298

0:298

0.183

0:183

Cumulative Shared Variance (%)

4

 100 ¼ 62:975

62.975

4

 100 ¼ 25, 010

87.985

4

 100 ¼ 7:444

95.428

4

 100 ¼ 4:572

100.000

402

PART

V Multivariate Exploratory Data Analysis

from where we obtain:

l

0

1 0 1 v11 0:5641 B v21 C B 0:5887 C B C¼B C @ v A @ 0:0267 A 31 v41 0:5783

Determining eigenvectors v12, v22, v32, v42 from the second eigenvalue (l22 ¼ 1.000): 8 ð1:000  1:000Þ  v12  0:756  v22 + 0:030  v32  0:711  v42 ¼ 0 > > > < 0:756  v + ð1:000  1:000Þ  v  0:003  v  0:809  v ¼ 0 12 22 32 42 > 0:030  v  0:003  v + ð 1:000  1:000 Þ  v + 0:044  v > 12 22 32 42 ¼ 0 > : 0:711  v12  0:809  v22 + 0:044  v32 + ð1:000  1:000Þ  v42 ¼ 0

from where we obtain:

l

0

1 0 1 v12 0:0068 B v22 C B 0:0487 C B C¼B C @ v A @ 0:9987 A 32 v42 0:0101

Determining eigenvectors v13, v23, v33, v43 from the third eigenvalue (l23 ¼ 0.298): 8 ð0:298  1:000Þ  v13  0:756  v23 + 0:030  v33  0:711  v43 ¼ 0 > > > < 0:756  v + ð0:298  1:000Þ  v  0:003  v  0:809  v ¼ 0 13 23 33 43 > 0:030  v  0:003  v + ð 0:298  1:000 Þ  v + 0:044  v ¼0 > 13 23 33 43 > : 0:711  v13  0:809  v23 + 0:044  v33 + ð0:298  1:000Þ  v43 ¼ 0

from where we obtain:

l

0

1 0 1 v13 0:8008 B v23 C B 0:2201 C B C¼B C @ v33 A @ 0:0003 A v43 0:5571

Determining eigenvectors v14, v24, v34, v44 from the fourth eigenvalue (l24 ¼ 0.183): 8 ð0:183  1:000Þ  v14  0:756  v24 + 0:030  v34  0:711  v44 ¼ 0 > > > < 0:756  v + ð0:183  1:000Þ  v  0:003  v  0:809  v ¼ 0 14 24 34 44 > 0:030  v  0:003  v + ð 0:183  1:000 Þ  v + 0:044  v > 14 24 34 44 ¼ 0 > : 0:711  v14  0:809  v24 + 0:044  v34 + ð0:183  1:000Þ  v44 ¼ 0

from where we obtain:

0

1 0 1 v14 0:2012 B v24 C B 0:7763 C B C¼B C @ v A @ 0:0425 A 34 v44 0:5959

Principal Component Factor Analysis Chapter

12

403

After having determined the eigenvectors, a more inquisitive researcher may prove the relationship presented in Expression (12.27), that is: V0  r  V ¼ L2 0 1 0 1 0:5641 0:5887 0:0267 0:5783 1:000 0:756 0:030 0:711 B 0:0068 0:0487 0:9987 0:0101 C B 0:756 1:000 0:003 0:809 C B C B C @ 0:8008 0:2201 0:0003 0:5571 A  @ 0:030 0:003 1:000 0:044 A 0:2012 0:7763 0:0425 0:5959 0:711 0:809 0:044 1:000 0 1 0 1 2:519 0 0 0 0:5641 0:0068 0:8008 0:2012 B 0:5887 0:0487 0:2201 0:7763 C B 0 1:000 0 0 C C B C B @ 0:0267 0:9987 0:0003 0:0425 A ¼ @ 0 0 0:298 0 A 0:5783 0:0101 0:5571 0:5959 0 0 0 0:183 Based on Expressions (12.22)–(12.24), we can calculate the factor scores that correspond to each one of the standardized variables for each one of the factors. Thus, from Expression (12.25), we are able to write the expressions for factors F1, F2, F3, and F4, as follows: 0:5641 0:5887 0:267 0:5783 F1i ¼ pffiffiffiffiffiffiffiffiffiffiffi  Zfinancei + pffiffiffiffiffiffiffiffiffiffiffi  Zcostsi  pffiffiffiffiffiffiffiffiffiffiffi  Zmarketingi + pffiffiffiffiffiffiffiffiffiffiffi  Zactuariali 2:519 2:519 2:519 2:519 0:0068 0:0487 0:9987 0:0101 F2i ¼ pffiffiffiffiffiffiffiffiffiffiffi  Zfinancei + pffiffiffiffiffiffiffiffiffiffiffi  Zcostsi + pffiffiffiffiffiffiffiffiffiffiffi  Zmarketingi  pffiffiffiffiffiffiffiffiffiffiffi  Zactuariali 1:000 1:000 1:000 1:000 0:8008 0:2201 0:0003 0:5571 F3i ¼ pffiffiffiffiffiffiffiffiffiffiffi  Zfinancei  pffiffiffiffiffiffiffiffiffiffiffi  Zcostsi  pffiffiffiffiffiffiffiffiffiffiffi  Zmarketingi  pffiffiffiffiffiffiffiffiffiffiffi  Zactuariali 0:298 0:298 0:298 0:298 0:2012 0:7763 0:0425 0:5959 F4i ¼ pffiffiffiffiffiffiffiffiffiffiffi  Zfinancei  pffiffiffiffiffiffiffiffiffiffiffi  Zcostsi + pffiffiffiffiffiffiffiffiffiffiffi  Zmarketingi + pffiffiffiffiffiffiffiffiffiffiffi  Zactuariali 0:183 0:183 0:183 0:183 from where we obtain: F1i ¼ 0:355  Zfinancei + 0:371  Zcostsi  0:017  Zmarketingi + 0:364  Zactuariali F2i ¼ 0:007  Zfinancei + 0:049  Zcostsi + 0:999  Zmarketingi  0:010  Zactuariali F3i ¼ 1:468  Zfinancei  0:403  Zcostsi  0:001  Zmarketingi  1:021  Zactuariali F4i ¼ 0:470  Zfinancei  1:815  Zcostsi + 0:099  Zmarketingi + 1:394  Zactuariali Based on the factor expressions and on the standardized variables, we can calculate the values corresponding to each factor for each observation. Table 12.11 shows these results for part of the dataset. For the first observation in the sample (Gabriela), for example, we can see that: F1Gabriela ¼ 0:355  ð0:011Þ + 0:371  ð0:290Þ  0:017  ð1:650Þ + 0:364  ð0:273Þ ¼ 0:016 F2Gabriela ¼ 0:007  ð0:011Þ + 0:049  ð0:290Þ + 0:999  ð1:650Þ  0:010  ð0:273Þ ¼ 1:665 F3Gabriela ¼ 1:468  ð0:011Þ  0:403  ð0:290Þ  0:001  ð1:650Þ  1:021  ð0:273Þ ¼ 0:176 F4Gabriela ¼ 0:470  ð0:011Þ  1:815  ð0:290Þ + 0:099  ð1:650Þ + 1:394  ð0:273Þ ¼ 0:739 It is important to emphasize that all the factors extracted have Pearson correlations equal to 0, between themselves, that is, they are orthogonal to one another. A more inquisitive researcher may also verify that the factor scores that correspond to each factor are exactly the estimated parameters of a multiple linear regression model that has, as a dependent variable, the factor itself, and as explanatory variables, the standardized variables. Having established the factors, we can define the factor loadings, which correspond to Pearson’s correlation coefficients between the original variables and each one of the factors. Table 12.12 shows the factor loadings for the data in our example. For each original variable, the highest value of the factor loading was highlighted in Table 12.12. Consequently, while the variables finance, costs, and actuarial show stronger correlations with the first factor, we can see that only the variable marketing shows stronger correlation with the second factor. This proves the need for a second factor in order for all the

404

PART

V Multivariate Exploratory Data Analysis

TABLE 12.11 Calculation of the Factors for Each Observation Student

Zfinancei

Zcostsi

Zmarketingi

F2i

F3i

Gabriela

0.011

0.290

1.650

0.273

0.016

1.665

0.176

0.739

Luiz Felipe

0.876

0.697

1.532

1.319

1.076

1.503

0.342

0.831

Patricia

0.876

0.290

0.590

0.523

0.600

0.603

0.634

0.672

1.334

1.337

0.825

1.069

1.346

0.887

0.327

0.228

Leticia

0.779

1.104

0.872

0.841

0.978

0.922

0.161

0.379

Ovidio

1.334

2.150

1.650

1.865

1.979

1.553

0.812

0.841

Leonor

0.267

0.116

0.825

0.125

0.111

0.829

0.312

0.429

Dalila

0.139

0.523

0.118

0.273

0.242

0.139

0.694

0.623

0.021

0.290

0.590

0.523

0.281

0.597

0.682

0.250

Estela

0.982

0.113

1.297

1.069

0.802

1.293

0.305

1.616

Mean

0.000

0.000

0.000

0.000

0.000

0.000

0.000

0.000

Standard deviation

1.000

1.000

1.000

1.000

1.000

1.000

1.000

1.000

Gustavo

Antonio

Zactuariali

F1i

F4i



TABLE 12.12 Factor Loadings (Pearson’s Correlation Coefficients) Between Variables and Factors Factor Variable

F1

F2

F3

F4

finance

0.895

0.007

0.437

0.086

costs

0.934

0.049

0.120

0.332

0.042

0.999

0.000

0.018

0.918

0.010

0.304

0.255

marketing actuarial science

variables to share significant proportions of variance. However, the third and fourth factors present relatively low correlations with the original variables, which explains the fact that the respective eigenvalues are less than 1. If the variable marketing had not been inserted into the analysis, only the first factor would be necessary to explain the joint behavior of the other variables, and the other factors would also have respective eigenvalues less than 1. Therefore, as discussed in Section 12.2.4, we can verify that factor loadings between factors corresponding to eigenvalues less than 1 are relatively low, since they have already shown stronger Pearson correlations with factors previously extracted from greater eigenvalues. Based on Expression (12.30), we can see that the sum of the squared factor loadings in each column in Table 12.12 will be the respective eigenvalue that, as discussed before, can be understood as the proportion of variance shared by the four original variables to form each factor. Therefore, we have: ð0:895Þ2 + ð0:934Þ2 + ð0:042Þ2 + ð0:918Þ2 ¼ 2:519 ð0:007Þ2 + ð0:049Þ2 + ð0:999Þ2 + ð0:010Þ2 ¼ 1:000 ð0:437Þ2 + ð0:120Þ2 + ð0:000Þ2 + ð0:304Þ2 ¼ 0:298 ð0:086Þ2 + ð0:332Þ2 + ð0:018Þ2 + ð0:255Þ2 ¼ 0:183 from which we can prove that the second eigenvalue only reached value 1 due to the high existing factor loading for the variable marketing.

Principal Component Factor Analysis Chapter

12

405

Furthermore, from the factor loadings presented in Table 12.12, we can also calculate the communalities, which represent the total shared variance of each variable in all the factors extracted from eigenvalues greater than 1. So, based on Expression (12.29), we can write: communalityfinance ¼ ð0:895Þ2 + ð0:007Þ2 ¼ 0:802 communalitycosts ¼ ð0:934Þ2 + ð0:049Þ2 ¼ 0:875 communalitymarketing ¼ ð0:042Þ2 + ð0:999Þ2 ¼ 1:000 communalityactuarial ¼ ð0:918Þ2 + ð0:010Þ2 ¼ 0:843 Consequently, even though the variable marketing is the only one that has a high factor loading with the second factor, it is the variable in which the lowest proportion of variance is lost to form both factors. On the other hand, the variable finance is the one that presents the highest loss of variance to form these two factors (around 19.8%). If we had considered the factor loadings of the four factors, surely, all the communalities would be equal to 1. As we discussed in Section 12.2.4, we can see that the factor loadings are exactly the parameters estimated in a multiple linear regression model, which shows, as a dependent variable, a certain standardized variable and, as explanatory variables, the factors themselves, in which the coefficient of determination R2 of each model is equal to the communality of the respective original variable. Therefore, for the first two factors, we can construct a chart in which the factor loadings of each variable are plotted in each one of the orthogonal axes that represent factors F1 and F2, respectively. This chart, known as a loading plot, can be seen in Fig. 12.8. By analyzing the loading plot, the behavior of the correlations becomes clear. While the variables finance, costs, and actuarial show high correlation with the first factor (X-axis), the variable marketing shows strong correlation with the second factor (Y-axis). More inquisitive researchers may investigate the reasons why this phenomenon occurs, since, sometimes, while the subjects Finance, Costs, and Actuarial Science are taught in a more quantitative way, Marketing can be taught in a more qualitative and behavioral manner. However, it is important to mention that the definition of factors does not force researchers to name them, because, normally, this is not a simple task. Factor analysis does not have “naming factors” as one of its goals and, in case we intend to do that, researchers need to have vast knowledge about the phenomenon being studied, and confirmatory techniques can help them in this endeavor. At this moment, we can consider the preparation of the principal component factor analysis concluded. Nevertheless, as discussed in Section 12.2.5, if researchers wish to obtain a clearer visualization of the variables better represented by a certain factor, they can elaborate a rotation using the Varimax orthogonal method, which maximizes the loadings of each variable in a certain factor. In our example, since we already have an excellent idea of the variables with high loadings in each factor, being the loading plot (Fig. 12.8) already very clear, rotation may be considered unnecessary. Therefore, it will only be elaborated for pedagogical purposes, since, sometimes, researchers may find themselves in situations in which such phenomenon is not so clear. Consequently, based on the factor loadings for the first two factors (first two columns of Table 12.12), we will obtain rotated factor loadings c0 after rotating both factors for an angle y. Thus, based on Expression (12.35), we can write:

1

FIG. 12.8 Loading plot.

marketing

0.5

finance

costs

0

actuarial –0.5

–1 –1

–0.5

0

0.5

1

406

PART

V Multivariate Exploratory Data Analysis

0

0:895 B 0:934 B @ 0:042 0:918

0 0 1 c11 0:007   B c021 0:049 C cos y seny C ¼B @ ⋮ 0:999 A seny cos y c0k1 0:010

1 c012 c022 C C ⋮ A c0k2

where the counterclockwise rotation angle y is obtained from Expression (12.36). Nevertheless, before that, we must determine the values of terms A, B, C, and D present in Expressions (12.37)–(12.40). Constructing Tables 12.13–12.16 helps us for this purpose. So, taking the k ¼ 4 variables into consideration and based on Expression (12.36), we can calculate the counterclockwise rotation angle y as follows: 9 8

< 2  ð D  k  A  BÞ 2  ½ð0:181Þ  4  ð1:998Þ  ð0:012Þ = h i ¼ 0:029rad ¼ 0  25: arctan y ¼ 0:25  arctan :ð3:963Þ  4  ð1:998Þ2  ð0:012Þ2 ; C  k  ðA2  B2 Þ

TABLE 12.13 Obtaining Term A to Calculate Rotation Angle u  Variable

c1

c2

communality

c21l communalityl

l

communality

c1l  c2l 2  communality

l

finance

0.895

0.007

0.802

1.000

costs

0.934

0.049

0.875

0.995

0.042

0.999

1.000

0.996

0.918

0.010

0.843

1.000

A (sum)

1.998

marketing actuarial science



c2

2l  communality

TABLE 12.14 Obtaining Term B to Calculate Rotation Angle u  Variable

c1

c2

finance

0.895

0.007

0.802

0.015

costs

0.934

0.049

0.875

0.104

0.042

0.999

1.000

0.085

0.918

0.010

0.843

0.022

marketing actuarial science

B (sum)



0.012

TABLE 12.15 Obtaining Term C to Calculate Rotation Angle u  Variable

c1

c2

communality

c21l c22l  communalityl communalityl

finance

0.895

0.007

0.802

1.000

costs

0.934

0.049

0.875

0.978

0.042

0.999

1.000

0.986

0.918

0.010

0.843

0.999

C (sum)

3.963

marketing actuarial science

2

  2

c1l  c2l communalityl

2

Principal Component Factor Analysis Chapter

12

407

TABLE 12.16 Obtaining Term D to Calculate Rotation Angle u  Variable

c1

c2

communality

c21l communalityl

finance

0.895

0.007

0.802

0.015

costs

0.934

0.049

0.875

0.103

0.042

0.999

1.000

0.084

0.918

0.010

0.843

0.022

marketing actuarial science

D (sum)

   c22l c1l  c2l  2  communality  communality l

l

0.181

And, finally, we can calculate the rotated factor loadings: 0 0 0 1 c11 0:895 0:007   B c021 B 0:934 0:049 C cos 0:029 sen0:029 B B C @ 0:042 0:999 A  sen0:029 cos 0:029 ¼ @ c031 c041 0:918 0:010

1 0 c012 0:895 B 0:935 c022 C C¼B c032 A @ 0:013 c042 0:917

1 0:019 0:021 C C 1:000 A 0:037

Table 12.17 shows, in a consolidated way, the rotated factor loadings through the Varimax method for the data in our example. As we have already mentioned, even though the results without the rotation already showed which variables presented high loadings in each factor, rotation ended up distributing, even if lightly for the data in our example, the variable loadings to each one of the rotated factors. A new loading plot (now with rotated loadings) can also demonstrate this situation (Fig. 12.9).

TABLE 12.17 Rotated Factor Loadings Through the Varimax Method Factor Variable

F2

F1

finance

0.895

0.019

costs

0.935

0.021

0.013

1.000

0.917

0.037

marketing actuarial science

1

0

0

FIG. 12.9 Loading plot with rotated loadings.

marketing

0.5

costs 0

finance actuarial –0.5

–1 –1

–0.5

0

0.5

1

408

PART

V Multivariate Exploratory Data Analysis

Even though the plots in Figs. 12.8 and 12.9 are very similar, since rotation angle y is very small in this example, it is common for the researcher to find situations in which the rotation will contribute considerably for an easier understanding of the loadings, which can, consequently, simplify the interpretation of the factors. It is important to emphasize that the rotation does not change the communalities, that is, Expression (12.31) can be verified: communalityfinance ¼ ð0:895Þ2 + ð0:019Þ2 ¼ 0:802 communalitycosts ¼ ð0:935Þ2 + ð0:021Þ2 ¼ 0:875 communalitymarketing ¼ ð0:013Þ2 + ð1:000Þ2 ¼ 1:000 communalityactuarial ¼ ð0:917Þ2 + ð0:037Þ2 ¼ 0:843 Nonetheless, rotation changes the eigenvalues corresponding to each factor. Thus, for the two rotated factors, we have: ð0:895Þ2 + ð0:935Þ2 + ð0:013Þ2 + ð0:917Þ2 ¼ l0 1 ¼ 2:518 2 ð0:019Þ2 + ð0:021Þ2 + ð1:000Þ2 + ð0:037Þ2 ¼ l0 2 ¼ 1:002 2

0

0

Table 12.18 shows, based on the new eigenvalues l21 and l22, the proportions of variance shared by the original variables to form both rotated factors. In comparison to Table 12.10, we can see that even though there is no change in the sharing of 87.985% of the total variance of the original variables to form the rotated factors, the rotation redistributes the variance shared by the variables in each factor. As we have already discussed, the factor loadings correspond to the parameters estimated in a multiple linear regression model that shows, as a dependent variable, a certain standardized variable and, as explanatory variables, the factors themselves. Therefore, through algebraic operations, we can arrive at the factor scores expressions from the loadings, since they represent the estimated parameters of the respective regression models that have, as a dependent variable, the factors and, as explanatory variables, the standardized variables. Consequently, from the rotated factor loadings (Table 12.17), we arrive at 0 0 the following rotated factors expressions F1 and F2. F01i ¼ 0:355  Zfinancei + 0:372  Zcostsi + 0:012  Zmarketingi + 0:364  Zactuariali F02i ¼ 0:004  Zfinancei + 0:038  Zcostsi + 0:999  Zmarketingi  0:021  Zactuariali 0

Finally, the professor wishes to develop a school performance ranking of his students. Since the two rotated factors, F1 and 0 F2, are formed by the higher proportions of variance shared by the original variables (in this case, 62.942% and 25.043% of the total variance, respectively, as shown in Table 12.18) and correspond to eigenvalues greater than 1, they will be used to create the desired school performance ranking. A well-accepted criterion that is used to form rankings from factors is known as weighted rank-sum criterion, in which, for each observation, the values of all factors obtained (that have eigenvalues greater than 1) weighted by the respective proportions of shared variance are added, with the subsequent ranking of the observations based on the results obtained. This criterion is well accepted because it considers the performance of all the original variables, since only considering the first factor (principal factor criterion) may not consider the positive performance, for instance, obtained in a certain variable that may possibly share a considerable proportion of variance with the second factor. For 10 students chosen from the sample, Table 12.19 shows the result of the school performance ranking resulting from the ranking created after the sum of the values obtained from the factors weighted by the respective proportions of shared variance. The complete ranking can be found in the file FactorGradesRanking.xls. It is essential to highlight that the creation of performance rankings from original variables is considered to be a static procedure, since the inclusion of new observations or variables may alter the factor scores, which makes the preparation of a

TABLE 12.18 Variance Shared by the Original Variables to Form Both Rotated Factors Factor 1 2

0

Eigenvalue l 2

Shared Variance (%)

Cumulative Shared Variance (%)

2.518

2:518

1.002

1:002

4

 100 ¼ 62:942

62.942

4

 100 ¼ 25:043

87.985

Principal Component Factor Analysis Chapter

12

409

TABLE 12.19 School Performance Ranking Through the Weighted Rank-Sum Criterion 0

Student

Zfinancei

Zcostsi

Zmarketingi

Zactuariali

0

F1i

0

F2i

(F1i 0.62942) 0 + (F2i 0.25043)

Ranking

Adelino

1.30

2.15

1.53

1.86

1.959

1.568

1.626

1

Renata

0.60

2.15

1.53

1.86

1.709

1.570

1.469

2

Ovidio

1.33

2.15

1.65

1.86

1.932

1.611

0.813

13

Kamal

1.33

2.07

1.65

1.86

1.902

1.614

0.793

14

Itamar

1.29

0.55

1.53

1.04

1.022

1.536

0.259

57

Luiz Felipe

0.88

0.70

1.53

1.32

1.032

1.535

0.265

58

0.01

0.29

1.65

0.27

0.032

1.665

0.437

73

0.50

0.50

0.94

1.16

0.443

0.939

0.514

74

Viviane

1.64

1.16

1.01

1.00

1.390

1.029

1.133

99

Gilmar

1.52

1.16

1.40

1.44

1.512

1.409

1.304

100





⋮ Gabriela Marina ⋮

new factor analysis mandatory. As time goes by, the evolution of the phenomena represented by the variables may change the correlation matrix, which makes it necessary to reapply the technique in order to generate new factors obtained from more precise and updated scores. Here, therefore, we express a criticism against socioeconomic indexes that use previously established static scores for each variable when calculating the factor to be used to define the ranking in situations in which new observations are constantly included; more than this, in situations in which there is an evolution throughout time, which changes the correlation matrix of the original variables in each period. Finally, it is worth mentioning that the factors extracted are quantitative variables and, therefore, from them, other multivariate exploratory techniques can be elaborated, such as, a cluster analysis, depending on the researcher’s objectives. Besides, each factor can also be transformed into a qualitative variable as, for example, through its categorization into ranges, established based on a certain criterion and, from then on, a correspondence analysis could be elaborated, in order to assess a possible association between the generated categories and the categories of other qualitative variables. Factors can also be used as explanatory variables of a certain phenomenon in confirmatory multivariate models as, for instance, multiple regression models, since orthogonality eliminates multicollinearity problems. On the other hand, such procedure only makes sense when we intend to elaborate a diagnostic regarding the behavior of the dependent variable, without aiming at having forecasts. Since new observations do not show the corresponding values of the factors generated, obtaining it is only possible if we include such observations in a new factor analysis, in order to obtain new factor scores, since it is an exploratory technique. Furthermore, a qualitative variable obtained through the categorization of a certain factor into ranges can also be inserted as a dependent variable of a multinomial logistic regression model, allowing researchers to evaluate the probabilities each observation has of being in each range, due to the behavior of other explanatory variables not initially considered in the factor analysis. We would also like to highlight that this procedure has a diagnostic nature, trying to find out the behavior of the variables in the sample for the existing observations, without a predictive purpose. Next, this same example will be elaborated in the software packages SPSS and Stata. In Section 12.3, the procedures for preparing the principal component factor analysis in SPSS will be presented, as well as their results. In Section 12.4, the commands for running the technique in Stata will be presented, with their respective outputs.

410

12.3

PART

V Multivariate Exploratory Data Analysis

PRINCIPAL COMPONENT FACTOR ANALYSIS IN SPSS

In this section, we will discuss the step by step for developing our example in the IBM SPSS Statistics Software. Following the logic proposed in this book, the main objective is to give researchers an opportunity to elaborate the principal component factor analysis in this software package, given how easy it is to use it and how didactical the operations are. Every time we present an output, we will mention the respective result obtained when performing the algebraic solution of the technique in the previous section, so that researchers can compare them and broaden their own knowledge and understanding about it. The use of the images in this section has been authorized by the International Business Machines Corporation©. Going back to the example presented in Section 12.2.6, remember that the professor is interested in creating a school performance ranking of his students based on the joint behavior of their final grades in four subjects. The data can be found in the file FactorGrades.sav and are exactly the same as the ones partially presented in Table 12.6 in Section 12.2.6. Therefore, in order for the factor analysis to be elaborated, let’s click on Analyze → Dimension Reduction → Factor …. A dialog box as the one shown in Fig. 12.10 will open. Next, we must insert the original variables finance, costs, marketing, and actuarial into Variables, as shown in Fig. 12.11.

FIG. 12.10 Dialog box for running a factor analysis in SPSS.

FIG. 12.11 Selecting the original variables.

Principal Component Factor Analysis Chapter

12

411

Different from what was discussed in the previous chapter, when developing the cluster analysis, it is important to mention that the researcher does not need to worry about with the Z-scores standardization of the original variables to elaborate the factor analysis, since the correlations between original variables or between their corresponding standardized variables are exactly the same. Even so, if researchers choose to standardize each one of the variables, they will see that the outputs will be exactly the same. In Descriptives …, first, let’s select the option Initial solution in Statistics …, which makes all the eigenvalues of the correlation matrix be presented in the outputs, even the ones that are less than 1. In addition, let’s select the options Coefficients, Determinant, and KMO and Bartlett’s test of sphericity in Correlation Matrix, as shown in Fig. 12.12. When we click on Continue, we will go back to the main dialog box of the factor analysis. Next, we must click on Extraction …. As shown in Fig. 12.13, we will maintain the options regarding the factor extraction method selected FIG. 12.12 Selecting the initial options for running the factor analysis.

FIG. 12.13 Choosing the factor extraction method and the criterion for determining the number of factors.

412

PART

V Multivariate Exploratory Data Analysis

(Method: Principal components) and the choice criterion of the number of factors. In this case, as discussed in Section 12.2.3, only the factors that correspond to eigenvalues greater than 1 will be considered (latent root criterion or Kaiser criterion), and, therefore, we must maintain the option Based on Eigenvalue ! Eigenvalues greater than: 1 in Extract selected. Moreover, we will also maintain the options Unrotated factor solution, in Display, and Correlation matrix, in Analyze, selected. In the same way, let’s click on Continue so that we can go back to the main dialog box of the factor analysis. In Rotation …, for now, let’s select the option Loading plot(s) in Display, while still maintaining the option None in Method selected, as shown in Fig. 12.14. Choosing the extraction of unrotated factors at this moment is didactical, since the outputs generated may be compared to the ones obtained algebraically in Section 12.2.6. Nevertheless, researchers can choose to extract rotated factors at this opportunity. After clicking on Continue, we can select the button Scores … in the technique’s main dialog box. At this moment, let’s select the option Display factor score coefficient matrix, as shown in Fig. 12.15, which makes the factor scores that correspond to each factor extracted be presented in the outputs. Next, we can click on Continue and on OK. FIG. 12.14 Dialog box for selecting the rotation method and the loading plot.

FIG. 12.15 Selecting the option to present the factor scores.

Principal Component Factor Analysis Chapter

12

413

FIG. 12.16 Pearson’s correlation coefficients.

The first output (Fig. 12.16) shows correlation matrix r, equal to the one in Table 12.7 in Section 12.2.6, through which we can see that the variable marketing is the only one that shows low Pearson’s correlation coefficients with all the other variables. As we have already discussed, it is a first indication that the variables finance, costs, and actuarial can be correlated with a certain factor, while the variable marketing can correlate strongly with another one. We can also verify that the output seen in Fig. 12.16 shows the value of the determinant of correlation matrix r too, used to calculate the w2Bartlett statistic, as discussed when we presented Expression (12.9). In order to study the overall adequacy of the factor analysis, let’s analyze the outputs in Fig. 12.17, which shows the results of the calculations that correspond to the KMO statistic and w2Bartlett. While the first suggests that the overall adequacy of the factor analysis is considered middling (KMO ¼ 0.737), based on the criterion presented in Table 12.2, the w2Bartlett statistic ¼ 192.335 (Sig. w2Bartlett < 0.05 for 6 degrees of freedom) allows us to reject that correlation matrix r is statistically equal to identity matrix I with the same dimension, at a significance level of 0.05 and based on the hypotheses of Bartlett’s test of sphericity. Thus, we can conclude that the factor analysis is adequate. The values of the KMO and w2Bartlett statistics are calculated through Expressions (12.3) and (12.9), respectively, presented in Section 12.2.2, and are exactly the same as the ones obtained algebraically in Section 12.2.6. Next, Fig. 12.18 shows the four eigenvalues of correlation matrix r that correspond to each one of the factors extracted initially, with the respective proportions of variance shared by the original variables. Note that the eigenvalues are exactly the same as the ones obtained algebraically in Section 12.2.6, such that: l21 + l22 + ⋯ + l2k ¼ 2:519 + 1:000 + 0:298 + 0:183 ¼ 4

FIG. 12.17 Results of the KMO statistic and Bartlett’s test of sphericity.

FIG. 12.18 Eigenvalues and variance shared by the original variables to form each factor.

414

PART

V Multivariate Exploratory Data Analysis

Since in the analysis we will only consider the factors whose eigenvalues are greater than 1, the right-hand side of Fig. 12.18 shows the proportion of variance shared by the original variables to only form these factors. Therefore, analogous to what was the presented in Table 12.10, we can state that, while 62.975% of the total variance are shared to form the first factor, 25.010% are shared to form the second. Thus, to form these two factors, the total loss of variance of the original variables is equal to 12.015%. Having extracted two factors, Fig. 12.19 shows the factor scores that correspond to each one of the standardized variables for each one of these factors. Hence, we are able to write the expressions of factors F1 and F2 as follows: F1i ¼ 0:355  Zfinancei + 0:371  Zcostsi  0:017  Zmarketingi + 0:364  Zactuariali F2i ¼ 0:007  Zfinancei + 0:049  Zcostsi + 0:999  Zmarketingi  0:010  Zactuariali Note that the expressions are identical to the ones obtained in Section 12.2.6 from the algebraic definition of unrotated factor scores. Fig. 12.20 shows the factor loadings, which correspond to Pearson’s correlation coefficients between the original variables and each one of the factors. The values shown in Fig. 12.20 are equal to the ones presented in the first two columns of Table 12.12. The highest factor loading is highlighted for each variable and, therefore, we can verify that, while the variables finance, costs, and actuarial show stronger correlations with the first factor, only the variable marketing shows stronger correlation with the second factor. As we also discussed in Section 12.2.6, the sum of the squared factor loadings in the columns results in the eigenvalue of the corresponding factor, that is, it represents the proportion of variance shared by the four original variables to form each factor. Thus, we can verify that: ð0:895Þ2 + ð0:934Þ2 + ð0:042Þ2 + ð0:918Þ2 ¼ 2:519 ð0:007Þ2 + ð0:049Þ2 + ð0:999Þ2 + ð0:010Þ2 ¼ 1:000

FIG. 12.19 Factor scores.

FIG. 12.20 Factor loadings.

Principal Component Factor Analysis Chapter

12

415

On the other hand, the sum of the squared factor loadings in the rows results in the communality of the respective variable, that is, it represents the proportion of shared variance of each original variable in the two factors extracted. Therefore, we can also see that: communalityfinance ¼ ð0:895Þ2 + ð0:007Þ2 ¼ 0:802 communalitycosts ¼ ð0:934Þ2 + ð0:049Þ2 ¼ 0:875 communalitymarketing ¼ ð0:042Þ2 + ð0:999Þ2 ¼ 1:000 communalityactuarial ¼ ð0:918Þ2 + ð0:010Þ2 ¼ 0:843 In the SPSS outputs, the communalities table is also presented, as shown in Fig. 12.21. The loading plot that shows the relative position of each variable in each factor, based on the respective factor loadings, is also shown in the outputs, as shown in Fig. 12.22 (equivalent to Fig. 12.8 in Section 12.2.6), in which the X-axis represents factor F1, and the Y-axis, factor F2. Even though the relative position of the variables in each axis is very clear, that is, the magnitude of the correlations between each one of them and each factor, for pedagogical purposes, we chose to elaborate the rotation of the axes, which

FIG. 12.21 Communalities.

Component plot marketing 1.0

Component 2

0.5

finance

0.0

actuarial

costs

–0.5

–1.0 –1.0

–1.5

0.0 Component 1

FIG. 12.22 Loading plot.

0.5

1.0

416

PART

V Multivariate Exploratory Data Analysis

can sometimes facilitate the interpretation of the factors because it provides a better distribution of the variables’ factor loadings in each factor. Thus, once again, let’s click on Analyze → Dimension Reduction → Factor … and, on the button Rotation …, select the option Varimax, as shown in Fig. 12.23. When we click on Continue, we will go back to the main dialog box of the factor analysis. In Scores …, let’s select the option Save as variables, as shown in Fig. 12.24, so that the factors generated, now rotated, can be made available in the dataset as new variables. From these factors, the students’ school performance ranking will be created. Next, we can click on Continue and on OK. Figs. 12.25–12.29 show the outputs that present differences in relation to the previous ones, due to the rotation. In this regard, the results of the correlation matrix, of the KMO statistic, of Bartlett’s test of sphericity, and of the communalities table are not presented again, which, even though they were calculated from the rotated loadings, do not show changes in their values. Fig. 12.25 shows these rotated factor loadings and, through them, it is possible to verify, even if very tenuously, a certain redistribution of the variable loadings in each factor. Note that the rotated factor loadings in Fig. 12.25 are exactly the same as the ones obtained algebraically in Section 12.2.6, from Expressions (12.35) to (12.40), and presented in Table 12.17.

FIG. 12.23 Selecting the Varimax orthogonal rotation method.

FIG. 12.24 Selecting the option to save the factors as new variables in the dataset.

Principal Component Factor Analysis Chapter

12

417

FIG. 12.25 Rotated factor loadings through the Varimax method.

Component plot in rotated space marketing 1,0

Component 2

0,5

finance

0,0

costs actuarial

–0,5

–1,0 –1,0

–0,5

0,0

0,5

1,0

Component 1 FIG. 12.26 Loading plot with rotated loadings.

The new loading plot, constructed from the rotated factor loadings and equivalent to Fig. 12.9, can be seen in Fig. 12.26. The rotation angle calculated algebraically in Section 12.2.6 is also a part of the SPSS outputs and can be found in Fig. 12.27. As we have already discussed, from the rotated factor loadings, we can verify that there are no changes in the communality values of the variables considered in the analysis, that is: communalityfinance ¼ ð0:895Þ2 + ð0:019Þ2 ¼ 0:802 communalitycosts ¼ ð0:935Þ2 + ð0:021Þ2 ¼ 0:875 communalitymarketing ¼ ð0:013Þ2 + ð1:000Þ2 ¼ 1:000 communalityactuarial ¼ ð0:917Þ2 + ð0:037Þ2 ¼ 0:843 On the other hand, the new eigenvalues can be obtained as follows: ð0:895Þ2 + ð0:935Þ2 + ð0:013Þ2 + ð0:917Þ2 ¼ l0 1 ¼ 2:518 2 ð0:019Þ2 + ð0:021Þ2 + ð1:000Þ2 + ð0:037Þ2 ¼ l0 2 ¼ 1:002 2

418

PART

V Multivariate Exploratory Data Analysis

FIG. 12.27 Rotation angle (in radians).

FIG. 12.28 Eigenvalues and variance shared by the original variables to form both rotated factors.

FIG. 12.29 Rotated factor scores.

Fig. 12.28 shows the results of the eigenvalues for the first two rotated factors in Rotation Sums of Squared Loadings, with their respective proportions of variance shared by the four original variables. The results are in accordance with the ones presented in Table 12.18. In comparison to the results obtained before the rotation, we can see that, even though there is no change in the sharing of 87.985% of the total variance of the original variables to form both rotated factors, the rotation redistributed the variance shared by the variables to each factor. Fig. 12.29 shows the rotated factor scores, from which the expressions of the new factors can be obtained. Therefore, we can write the following rotated factors expressions: F01i ¼ 0:355  Zfinancei + 0:372  Zcostsi + 0:012  Zmarketingi + 0:364  Zactuariali F02i ¼ 0:004  Zfinancei + 0:038  Zcostsi + 0:999  Zmarketingi  0:021  Zactuariali When developing the procedure described, we can verify that two new variables are generated in the dataset, called FAC1_1 and FAC2_1 by SPSS, as shown in Fig. 12.30 for the first 20 observations. These new variables, which show the values of both rotated factors for each one of the observations in the dataset, are orthogonal to one another, that is, they have a Pearson’s correlation coefficient equal to 0. This can be verified when we click on Analyze → Correlate → Bivariate …. In the dialog box that will open, we must insert the four original variables

Principal Component Factor Analysis Chapter

0

12

419

0

FIG. 12.30 Dataset with the F1 (FAC1_1) and F2 (FAC2_1) values per observation.

into Variables and select the options Pearson (in Correlation Coefficients) and Two-tailed (in Test of Significance), as shown in Fig. 12.31. When we click on OK, the output seen in Fig. 12.32 will be presented, in which it is possible to verify that Pearson’s correlation coefficient between both rotated factors is equal to 0. According to what was studied in Sections 12.2.4 and 12.2.6, a more inquisitive researcher may also verify that the rotated factor scores can be obtained through the estimation of two multiple linear regression models, in which a certain factor is considered to be a dependent variable in each one of them, and as explanatory variables, the standardized variables. The factor scores will be the parameters estimated in each model. In the same way, it is also possible to verify that the rotated factor loadings can be obtained by using the estimation of four multiple linear regression models as well, in which, in each one of them, a certain standardized variable is considered to be a dependent variable, and the factors, explanatory variables. While the factor loadings will be the parameters estimated in each model, the communalities will be the respective coefficients of determination R2. Therefore, the following expressions can be obtained: Zfinancei ¼ 0:895  F01i  0:019  F02i + ui , R2 ¼ 0:802 Zcostsi ¼ 0:935  F01i + 0:021  F02i + ui , R2 ¼ 0:875 Zmarketingi ¼ 0:013  F01i + 1:000  F02i + ui , R2 ¼ 1:000 Zactuariali ¼ 0:917  F01i  0:037  F02i + ui , R2 ¼ 0:843 0

0

in which the terms ui represent additional sources of variance, besides factors F1 and F2, to explain the behavior of each variable, and they are also called error terms or residuals.

420

PART

V Multivariate Exploratory Data Analysis

FIG. 12.31 Dialog box for determining Pearson’s correlation coefficient between both rotated factors.

FIG. 12.32 Pearson’s correlation coefficient between both rotated factors.

In case there is any interest in verifying these facts, we must obtain the standardized variables by clicking on Analyze → Descriptive Statistics → Descriptives …. When we select all the original variables, we must click on Save standardized values as variables. Although this specific procedure is not shown here, after clicking on OK, the standardized variables will be generated in the dataset itself. Therefore, based on the factors generated, we are able to create the desired school performance ranking. In order to do that, we will use the criterion described in Section 12.2.6, known as weighted rank-sum criterion, in which a new variable is generated from the multiplication of the values of each factor by the respective proportions of variance shared by the original variables. Thus, this new variable, which we call ranking, has the following expression: rankingi ¼ 0:62942  F01i + 0:25043  F02i in which parameters 0.62942 and 0.25043 correspond to the proportions of variance shared by the first two factors, respectively, as shown in Fig. 12.28. In order for the variable to be generated in the dataset, we must click on Transform → Compute Variable …. In Target Variable, we must type the name of the new variable (ranking) and, in Numeric Expression, we must type the weighted sum expression (FAC1_1*0.62942) + (FAC2_1*0.25043), as shown in Fig. 12.33. When we click on OK, the variable ranking will appear in the dataset.

Principal Component Factor Analysis Chapter

12

421

FIG. 12.33 Creating the new variable (ranking).

Finally, to sort variable ranking, we must click on Data → Sort Cases …. In addition to selecting the option Descending, we must insert the variable ranking into Sort by, as shown in Fig. 12.34. When we click on OK, the observations will appear sorted out in the dataset, from the highest to the lowest value of variable ranking, as shown in Fig. 12.35 for the 20 observations with the best performance school. We can see that the ranking constructed through the weighted rank-sum criterion points to Adelino as the student with the best school performance in that set of subjects, followed by Renata, Giulia, Felipe, and Cecilia. Having presented the procedures for applying the principal component factor analysis in SPSS, let’s now discuss the technique in Stata, following the standard used in this book.

12.4 PRINCIPAL COMPONENT FACTOR ANALYSIS IN STATA We now present the step by step for preparing our example in Stata Statistical Software. In this section, our main goal is not to discuss the concepts of the principal component factor analysis once again, instead, it is to give researchers an opportunity to elaborate the technique by using the commands in this software. Every time we present an output, we will mention the respective result obtained when applying the technique in an algebraic way and also by using SPSS. The use of the images in this section has been authorized by StataCorp LP©. Therefore, right away, we begin with the dataset constructed by the professor starting from the questions asked to each one of his 100 students. This dataset can be found in the file FactorGrades.dta and is exactly the same as the one partially presented in Table 12.6 in Section 12.2.6.

422

PART

V Multivariate Exploratory Data Analysis

FIG. 12.34 Dialog box for sorting the observations by variable ranking.

FIG. 12.35 Dataset with the school performance ranking.

First of all, we can type the command desc, which makes the analysis of the dataset characteristics possible, such as, the number of observations, the number of variables, and the description of each of them. Fig. 12.36 shows this first output in Stata. The command pwcorr ..., sig generates Pearson’s correlation coefficients between each pair of variables, with their respective significance levels. Therefore, we must type the following command: pwcorr finance costs marketing actuarial, sig

Fig. 12.37 shows the output generated.

Principal Component Factor Analysis Chapter

12

423

FIG. 12.36 Description of the FactorGrades.dta dataset.

FIG. 12.37 Pearson’s correlation coefficients and respective significance levels.

The outputs seen in Fig. 12.37 show that the correlations between the variable marketing and each one of the other variables are relatively low and not statistically significant, at a significance level of 0.05. On the other hand, the other variables have high and statistically significant correlations, between one another, at this significance level, which is a first indication that the factor analysis may group them in a certain factor without any substantial loss of their variances, while the variable marketing may show high correlation with another factor. This figure is in accordance with the one presented in Table 12.7 in Section 12.2.6, and also in Fig. 12.16, when we elaborated the technique in SPSS (Section 12.3). The factor analysis’s overall adequacy can be evaluated through the results of the KMO statistic and Bartlett’s test of sphericity, which can be obtained by using the command factortest. Thus, let’s type: factortest finance costs marketing actuarial

The outputs generated can be seen in Fig. 12.38. Based on the result of the KMO statistic, the overall adequacy of the factor analysis can be considered middling. However, more important than this piece of information is the result of Bartlett’s test of sphericity. From the result of the w2Bartlett statistic, with a significance level of 0.05 and 6 degrees of freedom, we can say that Pearson’s correlation matrix is statistically different from the identity matrix with the same dimension, since w2Bartlett ¼ 192.335 (w2 calculated for 6 degrees of freedom) and Prob. w2Bartlett (P-value) < 0.05. Note that the results of these statistics are in accordance with the ones calculated algebraically in Section 12.2.6 and also shown in Fig. 12.17 of Section 12.3. Fig. 12.38 also shows the value of the determinant of the correlation matrix, used to calculate of the w2Bartlett statistic. Stata also allows us to obtain the eigenvalues and eigenvectors of the correlation matrix. In order to do that, we must type the following command: pca finance costs marketing actuarial

424

PART

V Multivariate Exploratory Data Analysis

FIG. 12.38 Results of the KMO statistic and Bartlett’s test of sphericity.

FIG. 12.39 Eigenvalues and eigenvectors of the correlation matrix.

Fig. 12.39 shows these eigenvalues and eigenvectors, and they are exactly the same as the ones calculated algebraically in Section 12.2.6. Since we have not elaborated the procedure for rotating the factors generated yet, we can verify that the proportions of variance shared by the original variables to form each factor correspond to the ones presented in Table 12.10. After having presented these first outputs, we can now elaborate the principal component factor analysis itself by typing the following command, whose results are shown in Fig. 12.40. factor finance costs marketing actuarial, pcf

where the term pcf refers to the principal-component factor method. While the upper part of Fig. 12.40 shows the eigenvalues of the correlation matrix once again, with the respective proportions of shared variance of the original variables, since researchers can choose not to use the command pca, the lower part of the figure shows the factor loadings, which represent the correlations between each variable and the factors that only have eigenvalues greater than 1. Therefore, we can see that Stata automatically considers the latent root criterion (Kaiser criterion) when choosing the number of factors. If for some reason researchers choose to extract a number of factors considering a smaller eigenvalue so that more factors can be extracted, they must type the term mineigen(#) at the end of the command factor, in which # will be a number that corresponds to the eigenvalue from which factors will be extracted.

Principal Component Factor Analysis Chapter

12

425

FIG. 12.40 Outputs of the principal component factor analysis in Stata.

The factor loadings shown in Fig. 12.40 are equal to the first two columns of Table 12.12 in Section 12.2.6, and in Fig. 12.20 of Section 12.3. Through them, we can see that, while the variables finance, costs, and actuarial show high correlations with the first factor, the variable marketing shows strong correlation with the second factor. Besides, in the factor loadings matrix, a column called Uniqueness is also presented, or exclusivity, whose values represent, for each variable, the proportion of variance lost to form the factors extracted, that is, corresponds to (1—communality) of each variable. Therefore, we have: h i uniquenessfinance ¼ 1  ð0:8953Þ2 + ð0:0068Þ2 ¼ 0:1983 h i uniquenesscosts ¼ 1  ð0:9343Þ2 + ð0:0487Þ2 ¼ 0:1246 h i uniquenessmarketing ¼ 1  ð0:0424Þ2 + ð0:9989Þ2 ¼ 0:0003 h i uniquenessactuarial ¼ 1  ð0:9179Þ2 + ð0:0101Þ2 ¼ 0:1573 Consequently, because the variable marketing has low correlations with each one of the other original variables, it ends up having a high Pearson’s correlation with the second factor. This makes its uniqueness value be very low, since its proportion of variance shared with the second factor is almost equal to 100%. Knowing that two factors are extracted, at this moment, we will carry out the rotation by using the Varimax method. In order to do that, we must type the following command: rotate, varimax horst

where the term horst defines the rotation angle from the standardized factor loadings. This procedure is in accordance with the one elaborated algebraically in Section 12.2.6. The outputs generated can be seen in Fig. 12.41. From Fig. 12.41, as we have already discussed, we can verify that the proportion of variance shared by all the variables to form both factors is equal to 87.98%, even though the eigenvalue of each factor rotated is different from the one obtained previously. The same can be said regarding the uniqueness values of each variable, even if the rotated factor loadings are different in relation to their unrotated corresponding ones, since the Varimax method maximizes the loadings of each variable in a certain factor. Fig. 12.41 also shows the rotation angle at the end. All of these outputs are identical to the ones calculated in Section 12.2.6 and they were also presented when we elaborated the technique in SPSS, in Figs. 12.25, 12.27, and 12.28.

426

PART

V Multivariate Exploratory Data Analysis

Thus, we can say that:

h i uniquenessfinance ¼ 1  ð0:8951Þ2 + ð0:0195Þ2 ¼ 0:1983 h i uniquenesscosts ¼ 1  ð0:9354Þ2 + ð0:0213Þ2 ¼ 0:1246 h i uniquenessmarketing ¼ 1  ð0:0131Þ2 + ð0:9997Þ2 ¼ 0:0003 h i uniquenessactuarial ¼ 1  ð0:9172Þ2 + ð0:0370Þ2 ¼ 0:1573

and that: ð0:8951Þ2 + ð0:9354Þ2 + ð0:0131Þ2 + ð0:9172Þ2 ¼ l0 1 ¼ 2:51768 2 ð0:0195Þ2 + ð0:0213Þ2 + ð0:9997Þ2 + ð0:0370Þ2 ¼ l0 2 ¼ 1:00170 2

FIG. 12.41 Rotation of factors through the Varimax method.

If the researcher wishes to, Stata also allows us to compare the rotated factor loadings to the ones obtained before the rotation in the same table. In order to do that, it is necessary to type the following command, after preparing the rotation: estat rotatecompare

The outputs generated can be seen in Fig. 12.42. At this moment, the loading plot of the rotated factor loadings can be obtained by typing the command loadingplot. This chart, which corresponds to the ones presented in Figs. 12.9 and 12.26, can be seen in Fig. 12.43. After developing these procedures, the researcher may want to generate two new variables in the dataset, which correspond to the rotated factors obtained through the factor analysis. Therefore, it is necessary to type the following command: predict f1 f2

Principal Component Factor Analysis Chapter

12

427

FIG. 12.42 Comparison of the rotated and unrotated factor loadings.

FIG. 12.43 Loading plot with rotated loadings.

where f1 and f2 are the names of the corresponding variables to the first and second factors, respectively. When we type the command, in addition to creating these two new variables in the dataset, an output similar to the one in Fig. 12.44 will also be generated, in which the rotated factor scores are presented. The results shown in Fig. 12.44 are equivalent to the ones in SPSS (Fig. 12.29). Besides, it is also possible to verify that both factors generated are orthogonal, that is, they have a Pearson’s correlation coefficient equal to 0. In order to do that, let’s type: estat common

which results in the output seen in Fig. 12.45. Only for pedagogical purposes, we can also obtain the scores and the rotated factor loadings from multiple linear regression models. In order to do that, first of all, we have to generate the standardized variables by using the Z-scores procedure in the dataset, from each one of the original variables, by typing the following sequence of commands: egen zfinance = std(finance) egen zcosts = std(costs) egen zmarketing = std(marketing) egen zactuarial = std(actuarial)

428

PART

V Multivariate Exploratory Data Analysis

FIG. 12.44 Generating the factors in the dataset and the rotated factor scores.

FIG. 12.45 Pearson’s correlation coefficient between both rotated factors.

Having done this, we can type the two following commands, which represent two multiple linear regression models, in which each one of them shows a certain factor as a dependent variable and the standardized variables as explanatory variables. reg f1 zfinance zcosts zmarketing zactuarial reg f2 zfinance zcosts zmarketing zactuarial

The results of these models can be seen in Fig. 12.46. By analyzing Fig. 12.46, we note that the parameters estimated in each model correspond to the rotated factor scores for each variable, according to what has already been shown in Fig. 12.44. Thus, since all the parameters of the intercept are practically equal to 0, we can write: F01i ¼ 0:3554795  Zfinancei + 0:3721907  Zcostsi + 0:0124719  Zmarketingi + 0:3639452  Zactuariali F02i ¼ 0:0036389  Zfinancei + 0:0377955  Zcostsi + 0:9986053  Zmarketingi  0:020781  Zactuariali Obviously, since the four variables share variances to form each factor, the coefficients of determination R2 of each model are equal to 1. On the other hand, to obtain the rotated factor loadings, we must type the following four commands, which represent four multiple linear regression models, in which each one of them has a certain standardized variable as a dependent variable, and the rotated factors, as explanatory variables. reg zfinance f1 f2 reg zcosts f1 f2 reg zmarketing f1 f2 reg zactuarial f1 f2

The results of these models can be seen in Fig. 12.47. By analyzing this figure, note that the parameters estimated in each model correspond to the rotated factor loadings for each factor, according to what has already been shown in Fig. 12.41. Therefore, since all the parameters of the intercept are practically equal to 0, we can write: Zfinancei ¼ 0:895146  F01i  0:0194694  F02i + ui , R2 ¼ 1  uniqueness ¼ 0:8017 Zcostsi ¼ 0:935375  F01i + 0:0212916  F02i + ui , R2 ¼ 1  uniqueness ¼ 0:8754

Principal Component Factor Analysis Chapter

12

429

FIG. 12.46 Outputs of the multiple linear regression models with factors as dependent variables.

Zmarketingi ¼ 0:013053  F01i + 0:9997495  F02i + ui , R2 ¼ 1  uniqueness ¼ 0:9997 Zactuariali ¼ 0:917223  F01i  0:0370175  F02i + ui , R2 ¼ 1  uniqueness ¼ 0:8427 0

0

where the terms ui represent additional sources of variance, besides factors F1 and F2, to explain the behavior of each variable, since two other factors with eigenvalues less than 1 could also have been extracted. The coefficients of determination R2 of each model different from 1 correspond to the communality values of each variable, that is, to (1  uniqueness). Although researchers can choose not to estimate multiple linear regression models when applying the factor analysis, since it is only a verification procedure, we believe that its didactical nature is essential for fully understanding the technique. From the rotated factors extracted (variables f1 and f2), we can define the desired school performance ranking. As elaborated when applying the technique in SPSS, we will use the criterion described in Section 12.2.6, known as the weighted rank-sum criterion, in which a new variable is generated from the multiplication of the values of each factor by the respective proportions of variance shared by the original variables. Let’s type the following command: gen ranking = f1*0.6294+f2*0.2504

where the terms 0.6294 and 0.2504 correspond to the proportions of variance shared by the first two factors, respectively, as shown in Fig. 12.41. The new variable generated in the dataset is called ranking. Next, we can sort the observations, from the highest to the lowest value of variable ranking, by typing the following command: gsort -ranking

After that, just as an example, we can list the school performance ranking of the best 20 students, based on the joint behavior of the final grades in all four subjects. In order to do that, we can type the following command: list student ranking in 1/20

Fig. 12.48 shows the ranking of the top 20 students.

430

PART

V Multivariate Exploratory Data Analysis

FIG. 12.47 Outputs of the multiple linear regression models with standardized variables as dependent variables.

Principal Component Factor Analysis Chapter

12

431

FIG. 12.48 School performance ranking of the best 20 students.

12.5 FINAL REMARKS Many are the situations in which researchers wish to group variables in one or more factors, to verify the validity of previously established constructs, to create orthogonal factors for future use in confirmatory multivariate techniques that need the absence of multicollinearity, or to create rankings by developing performance indexes. In these situations, the factor analysis procedures are highly recommended, and the most frequently used is known as the principal components. Therefore, factor analysis allows us to improve decision-making processes based on the behavior and on the interdependence between quantitative variables that have a relative correlation intensity. Since the factors generated from the original variables are also quantitative variables, the outputs of the factor analysis can be inputs in other multivariate techniques, such as, the cluster analysis. The stratification of each factor into ranges may allow the association between these ranges and the categories of other qualitative variables to be evaluated through a correspondence analysis. The use of factors in confirmatory multivariate techniques may also make sense when researchers intend to elaborate diagnostics about the behavior of a certain dependent variable and use the factors extracted as explanatory variables, fact that eliminates possible multicollinearity problems because the factors are orthogonal. The consideration of a certain qualitative variable obtained from the stratification of a certain factor into ranges can be used, for example, in a multinomial logistic regression model, which allows the preparation of a diagnostic on the probabilities each observation has of being in each range, due to the behavior of other explanatory variables not initially considered in the factor analysis. Regardless of the main goal for applying the technique, factor analysis may bear good and interesting research fruits that can be useful for the decision-making process. Its preparation must always be carried out through the correct and conscious use of the software package chosen for the modeling, based on the underlying theory and on researchers’ experience and intuition.

12.6 EXERCISES (1) From a dataset that contains certain clients’ variables (individuals), analysts from a bank’s Customer Relationship Management department (CRM) elaborated a principal component factor analysis aiming to study the joint behavior of these variables so that, afterwards, they can propose the creation of an investment profile index. The variables used to elaborate the modeling were:

432

PART

V Multivariate Exploratory Data Analysis

Variable

Description

age

Client’s age i (years)

fixedif

Percentage of resources invested in fixed-income funds (%)

variableif

Percentage of resources invested in variable-income funds (%)

people

Number of people who live in the residence

In a certain management report, these analysts presented the factor loadings (Pearson’s correlation coefficients) between each original variable and both factors extracted by using the latent root criterion or Kaiser criterion. These factor loadings can be found in the table: Variable

Factor 1

Factor 2

age

0.917

0.047

fixedif

0.874

0.077

variableif

0.844

0.197

people

0.031

0.979

We would like you to answer the following questions: (a) Which eigenvalues correspond to the two factors extracted? (b) What are the proportions of variance shared by all the variables to form each factor? What is the total proportion of variance lost by the four variables to extract these two factors? (c) For each variable, what is the proportion of shared variance to form both factors (communality)? (d) What is the expression of each standardized variable based on the two factors extracted? (e) Construct a loading plot from the factor loadings. (f) Interpret both factors based on the distribution of the loadings of each variable. (2) A researcher specialized in analyzing the behavior of nations’ socioeconomic indexes would like to investigate the possible relationship between variables related to corruption, violence, income, and education, and, in order to do that, he collected data on 50 countries considered to be developed or emerging two years in a row. The data can be found in the files CountriesIndexes.sav and CountriesIndexes.dta, which have the following variables: Variable

Period

country

Description A string variable that identifies country i

cpi1

Year 1

cpi2

Year 2

violence1

Year 1

violence2

Year 2

capita_gdp1

Year 1

capita_gdp2

Year 2

school1

Year 1

school2

Year 2

Corruption perception index, which corresponds to citizens’ perception of abuses committed by the public sector as regards a nation’s private assets, including administrative and political aspects. The lower the index, the higher the perception of corruption in the country (Source: Transparency International) Number of murders per 100,000 inhabitants (Sources: World Health Organization, United Nations Office on Drugs and Crime, and GIMD Global Burden of Injuries) Per capita GDP in US$ adjusted for inflation, using 2000 as the base year (Source: World Bank)

Average number of years in school per person over 25 years of age, including primary, secondary, and higher education (Source: Institute for Health Metrics and Evaluation)

In order to create a socioeconomic index that generates a country ranking for each year, the researcher has decided to elaborate a principal component factor analysis using the variables of each period. Based on the results obtained, we would like you to answer the following questions: (a) By using the KMO statistic and Bartlett’s test of sphericity, is it possible to state that the principal component factor analysis is adequate for each one of the years of study? In the case of Bartlett’s test of sphericity, use a significance level of 0.05.

Principal Component Factor Analysis Chapter

12

433

(b) How many factors are extracted in the analysis in each of the years, considering the latent root criterion? Which eigenvalue(s) correspond to the factor(s) extracted each year, as well as the proportion(s) of variance shared by all the variables to form this(these) factor(s)? (c) For each variable, what is the proportion of shared variance to form one(more) factor(s) each year? Did any alterations in the communalities of each variable occur from one year to the next? (d) What are the expression(s) of the factor(s) extracted each year, based on the standardized variables? From one year to the next, did any alterations in the factor scores of the variables occur in each factor? Discuss the importance of developing a specific factor analysis each year in order to create indexes. (e) Considering the principal factor extracted as a socioeconomic index, create a country ranking from this index for each one of the years. From one year to the next, were there any changes regarding the countries’ positions in the ranking? (3) The general manager of a store, which belongs to a chain of drugstores, wishes to find out its consumers’ perception of eight attributes, which are described below:

Attribute (Variable)

Description

assortment

Perception of the variety of goods

replacement

Perception of the quality and speed of inventory replacement

layout

Perception of the store’s layout

comfort

Perception of thermal, acoustic, and visual comfort inside the store

cleanliness

Perception of the store’s general cleanliness

services

Perception of the quality of the services rendered

prices

Perception of the store’s prices compared to the competition

discounts

Perception of the store’s discount policy

In order to do that, he carried out a survey with 1700 clients at the store for some time. The questionnaire was structured based on groups of attributes, and each question corresponding to an attribute asked the consumer to assign a score from 0 to 10 depending on his or her perception of that attribute, 0 corresponded to an entirely negative perception, and 10, to the best perception possible. Since the store’s general manager is rather experienced, he decided, in advance, to gather the questions in three groups, such that, the complete questionnaire would be as follows:

Based on your perception, fill out the questionnaire below with scores from 0 to 10, in which 0 means that your perception is entirely negative in relation to a certain attribute, and 10, that your perception is the best possible Products and store environment Please rate the store’s variety of goods on a scale of 0–10 Please rate the store’s quality and speed of inventory replacement on a scale of 0–10 Please rate the store’s layout on a scale of 0–10 Please rate the store’s thermal, acoustic, and visual comfort on a scale of 0–10 Please rate the store’s general cleanliness on a scale of 0–10 Services Please rate the quality of the services rendered in our store on a scale of 0–10 Prices and discount policy Please rate the store’s prices compared to the competition on a scale of 0–10 Please rate our discount policy on a scale of 0–10

Score

434

PART

V Multivariate Exploratory Data Analysis

The complete dataset developed by the store’s general manager can be seen in the files DrugstorePerception.sav and DrugstorePerception.dta. We would like you to: (a) Present the correlation matrix between each pair of variables. Based on the magnitude of the values of Pearson’s correlation coefficients, is it possible to identify any indication that the factor analysis may group the variables into factors? (b) By using the result of Bartlett’s test of sphericity, is it possible to state, at a significance level of 0.05, that the principal component factor analysis is adequate? (c) How many factors are extracted in the analysis considering the latent root criterion? Which eigenvalue(s) correspond to the factor(s) extracted, as well as to the proportion(s) of variance shared by all the variables to form this(these) factor(s)? (d) What is the total percentage of variance loss of the original variables resulting from the extraction of the factor(s) based on the latent root criterion? (e) For each variable, what are the loading and the proportion of shared variance to form the factor(s)? (f) By demanding the extraction of three factors, to the detriment of the latent root criterion, and based on the new factor loadings, is it possible to confirm the construct of the questionnaire proposed by the store’s general manager? In other words, do the variables of each group in the questionnaire, in fact, end up showing greater sharing of variance with a common factor? (g) Discuss the impact of the decision to extract three factors on the communality values? (h) Construct a Varimax rotation and discuss it once again, based on the redistribution of the factor loadings, the construct initially proposed in the questionnaire by the store’s general manager. (i) Present the 3D loading plot with the rotated factor loadings.

APPENDIX: CRONBACH’S ALPHA A.1

Brief Presentation

The alpha statistic, proposed by Cronbach (1951), is a measure used to assess the internal consistency of the variables in a dataset, that is, it is a measure of the level of reliability with which a certain scale, adopted to define the original variables, produces consistent results about the relationship between these variables. According to Nunnally and Bernstein (1994), the level of reliability is defined from the behavior of the correlations between the original variables (or standardized), and, therefore, Cronbach’s alpha can be used to evaluate the reliability with which a factor can be extracted from variables, thus, being related to the factor analysis. According to Rogers et al. (2002), even though Cronbach’s alpha is not the only existing measure of reliability, since it has constraints related to multidimensionality, that is, with the identification of multiple factors, it can be defined as the measure that makes it possible to assess the intensity with which a certain construct or factor is present in the original variables. Therefore, a dataset with variables that share a single factor tends to have a high Cronbach’s alpha. Hence, Cronbach’s alpha cannot be used to assess the overall adequacy of the factor analysis, different from the KMO statistic and Bartlett’s test of sphericity, since its magnitude offers the researcher an indication only of the internal consistency of the scale used to extract a single factor. If its value is low, not even the first factor will be adequately extracted, main reason why some researchers choose to study the magnitude of Cronbach’s alpha before running the factor analysis, even though this decision is not a mandatory requisite for developing the technique. Cronbach’s alpha can be defined by the following expression: X " # Var k k k  1 (12.41) a¼ Var sum k1 where: Vark is the variance of the kth variable, and Xn i¼1

Var sum ¼

X

Xn X

!2 Xki



k

n1

2

X k ki

i¼1

n (12.42)

Principal Component Factor Analysis Chapter

12

435

which represents the variance of the sum of each row in the dataset, that is, the variance of the sum of the values corresponding to each observation. Besides, we know that n is the sample size, and k, the number of variables X. So, we can state that, if consistencies in the variable values occur, the term Varsum will be large enough in order for alpha (a) to tend to 1. On the other hand, variables that have low correlations, possibly due to the presence of random observation values, will make the term Varsum go back to the sum of the variances of each variable (Vark), which will make alpha (a) tend to 0. Although there is no consensus in the existing literature about the value of alpha from which there is internal consistency of the variables in the dataset, it is interesting that the result obtained is greater than 0.6 when we apply exploratory techniques. Next, we will discuss the calculation of Cronbach’s alpha for the data in the example used throughout this chapter.

A.2

Determining Cronbach’s Alpha Algebraically

From the standardized variables in the example studied throughout this chapter, we can construct Table 12.20, which helps us calculate Cronbach’s alpha. Thus, based on Expression (12.42), we have: Varsum ¼

832:570 ¼ 8:410 99

and, by using Expression (12.41), we can calculate Cronbach’s alpha:

4 4 ¼ 0:699 a¼  1 3 8:410 We can consider this value acceptable for the internal consistency of the variables in our dataset. Nevertheless, as we will see when determining Cronbach’s alpha in SPSS and in Stata, there is a considerable loss of reliability because the original variables are not measuring the same factor, that is, the same dimension, since this statistic has constraints related to multidimensionality. That is, if we did not include the variable marketing when calculating Cronbach’s alpha, its value would be

TABLE 12.20 Procedure for Calculating Cronbach’s Alpha P

Zfinancei

Zcostsi

Zmarketingi

Gabriela

0.011

0.290

1.650

0.273

1.679

2.817

Luiz Felipe

0.876

0.697

1.532

1.319

1.360

1.849

Patricia

0.876

0.290

0.590

0.523

2.278

5.191

1.334

1.337

0.825

1.069

4.564

20.832

Leticia

0.779

1.104

0.872

0.841

3.597

12.939

Ovidio

1.334

2.150

1.650

1.865

3.699

13.682

Leonor

0.267

0.116

0.825

0.125

0.549

0.301

Dalila

0.139

0.523

0.118

0.273

0.775

0.600

0.021

0.290

0.590

0.523

1.382

1.909

0.982

0.113

1.297

1.069

Gustavo

Antonio

Zactuariali

P

Student

k¼4 Xki

k¼4 Xki

2

⋮ Estela Variance

1.000

1.000

1.000

1.000

0.868 P P 100 i¼1

k¼4 Xki

0.753 2

¼0

P100 P i¼1

k¼4 Xki

2

¼ 832:570

436

PART

V Multivariate Exploratory Data Analysis

considerably higher, which indicates that this variable does not contribute to the construct, or to the first factor, formed by the other variables (finance, costs, and actuarial). The complete spreadsheet with the calculation of Cronbach’s alpha can be found in the file AlphaCronbach.xls. Analogous to what was done throughout this chapter, next, we will present the procedures for obtaining Cronbach’s alpha in SPSS and in Stata.

A.3

Determining Cronbach’s Alpha in SPSS

Once again, let’s use the file FactorGrades.sav. In order for us to determine Cronbach’s alpha based on the standardized variables, first, we must standardize them by using the Z-scores procedure. To do that, let’s click on Analyze → Descriptive Statistics → Descriptives …. When we select all the original variables, we must click on Save standardized values as variables. Although this specific procedure is not shown here, after clicking on OK, the standardized variables will be generated in the dataset itself. After that, let’s click on Analyze → Scale → Reliability Analysis …. A dialog box will open. We must insert the standardized variables into Items, as shown in Fig. 12.49. Next, in Statistics …, we must select the option Scale if item deleted, as shown in Fig. 12.50. This option calculates the different values of Cronbach’s alpha when each variable in the analysis is eliminated. The term item is often mentioned in Cronbach’s work (1951), and it is used as a synonym for variable. Next, we can click on Continue and on OK. Fig. 12.51 shows the result of Cronbach’s alpha, whose value is exactly the same as the one calculated through Expressions (12.41) and (12.42) and shown in the previous section. Furthermore, Fig. 12.52 also shows, in the last column, Cronbach’s alpha values that would be obtained if a certain variable were excluded from the analysis. Therefore, we can see that the presence of the variable marketing contributes negatively to the identification of only one factor, because, as we know, this variable shows strong correlation with the second factor extracted by the principal component factor analysis elaborated throughout this chapter. Since Cronbach’s alpha is a one-dimensional measure of reliability, excluding the variable marketing would make its value get to 0.904. Next, we will obtain the same outputs by using specific commands in Stata.

FIG. 12.49 Dialog box for determining Cronbach’s alpha in SPSS.

Principal Component Factor Analysis Chapter

12

437

FIG. 12.50 Selecting the option to calculate alpha when excluding a certain variable.

FIG. 12.51 Result of Cronbach’s alpha in SPSS.

FIG. 12.52 Cronbach’s alpha when excluding each variable.

A.4

Determining Cronbach’s Alpha in Stata

Now, let’s open the file FactorGrades.dta. In order to calculate Cronbach’s alpha, we must type the following command: alpha finance costs marketing actuarial, asis std

where the term std makes Cronbach’s alpha be calculated from the standardized variables, even if the original variables were considered in the command alpha. The output generated can be seen in Fig. 12.53.

438

PART

V Multivariate Exploratory Data Analysis

FIG. 12.53 Result of Cronbach’s alpha in Stata.

FIG. 12.54 Internal consistency when excluding each variable—last column.

If researchers choose to obtain Cronbach’s alpha values when excluding each one of the variables, as what is done in SPSS, they may type the following command: alpha finance costs marketing actuarial, asis std item

The new outputs are shown in Fig. 12.54, in which the values of the last column are exactly the same as the ones presented in Fig. 12.52, which corroborates the fact that the variables finance, costs, and actuarial show high internal consistency for determining a single factor.

Part VI

Generalized Linear Models The study of statistical distributions is not recent, and since the beginning of the 19th century, approximately up until the beginning of the 20th century, linear models that follow a normal distribution practically dominated the data-modeling scenario. Nonetheless, since the period between both wars, models to represent the situations normal linear models could not satisfactorily represent started arising. McCullagh and Nelder (1989), Turkman and Silva (2000), and Cordeiro and Demetrio (2007) mention, in this context, Berkson’s (1944), Dyke and Patterson’s (1952) and Rasch’s (1960) work on the logistic models that involve the Bernoulli and binomial distributions; Birch’s (1963) work on the models for count data involving the Poisson distribution; Feigl and Zelen’s (1965), Zippin and Armitage’s (1966) and Glasser’s (1967) work on the exponential models; and Nelder’s (1966) work on polynomial models that include the Gamma distribution. All of these models ended up being consolidated, from a theoretical and conceptual perspective, through Nelder and Wedderburn’s (1972) extremely important work, in which the Generalized Linear Models were defined. They represent a group of linear regression models and nonlinear exponential models, in which the dependent variable, for example, follows a normal, Bernoulli, binomial, Poisson, or a Poisson-Gamma distribution. The following models are special cases of Generalized Linear Models: – Linear Regression Models and Models with Box-Cox Transformation; – Binary and Multinomial Logistic Regression Models; – Poisson and Binomial Negative Regression Models for Count Data; and the estimation of each one of them must be done respecting the characteristics of the data and the distribution of the variable that represents the phenomenon we wish to study, called dependent variable. A Generalized Linear Model is defined as follows: i ¼ a + b1  X1i + b2  X2i + … + bk  Xki

(VI.1)

where  is known as a canonical link function, a represents the constant, bj (j ¼ 1, 2, ..., k) are the coefficients of each explanatory variable and correspond to the parameters to be estimated, Xj are the explanatory variables (metric or dummies), and the subscripts i represent each one of the observations of the sample being analyzed (i ¼ 1, 2, ..., n, where n is the sample size). Box VI.1 relates each specific case of the generalized linear models to the characteristic of the dependent variable, its distribution and the respective canonical link function. BOX VI.1 Generalized Linear Models, Characteristics of the Dependent Variable, and Canonical Link Functions Regression Model

Characteristic of the Dependent Variable

Distribution

Linear With Box-Cox Transformation Binary Logistic

Quantitative Quantitative

Normal Normal after the Transformation Bernoulli

Qualitative with 2 Categories (Dummy)

Canonical Link Function (h) Y^

Y^ l 1 l



ln

p 1p



Multinomial Logistic

Qualitative with M (M > 2) Categories

Binomial

ln

Poisson

Quantitative with Integer and Non-Negative Values (Count Data) Quantitative with Integer and Non-Negative Values (Count Data)

Poisson

ln(l)

Poisson-Gamma

ln(u)

Negative Binomial



pm 1pm



440

PART

VI Generalized Linear Models

Therefore, for a given dependent variable Y that represents the phenomenon being studied (outcome variable), we can specify each one of the models presented in Box VI.1 in the following way: Linear Regression Model Y^i ¼ a + b1  X1i + b2  X2i + … + bk  Xki

(VI.2)

where Y^ is the expected value of the dependent variable Y. Regression Model with Box-Cox Transformation l Y^i  1 ¼ a + b1  X1i + b2  X2i + … + bk  Xki l

(VI.3)

where Y^ is the expected value of the dependent variable Y and l is the Box-Cox transformation parameter that maximizes the adherence to the normality of the distribution of the new variable generated from the original variable Y. Binary Logistic Regression Model   pi ¼ a + b1  X1i + b2  X2i + … + bk  Xki ln 1  pi

(VI.4)

where p is the probability of the event of interest occurring, defined by Y ¼ 1, given that the dependent variable Y is a dummy. Multinomial Logistic Regression Model ! pi m ¼ am + b1m  X1i + b2m  X2i + … + bkm  Xki ln 1  pi m

(VI.5)

where pm (m ¼ 0, 1, ..., M – 1) is the probability of occurrence of each one of the M categories of the dependent variable Y. Poisson Regression Model for Count Data lnðli Þ ¼ a + b1  X1i + b2  X2i + … + bk  Xki

(VI.6)

where l is the expected value of the number of occurrences of the phenomenon represented by the dependent variable Y, which presents count data with a Poisson distribution. Negative Binomial Regression Model for Count Data lnðui Þ ¼ a + b1  X1i + b2  X2i + … + bk  Xki

(VI.7)

where u is the expected value of the number of occurrences of the phenomenon represented by the dependent variable Y, which presents count data with a Poisson-Gamma distribution. Thus, Part VI discusses the Generalized Linear Models. While Chapter 13 discusses the linear regression models and the models with Box-Cox transformation, Chapters 14 and 15 discuss the binary and multinomial logistic regression models and the Poisson and negative binomial regression models for count data, respectively, which are nonlinear exponential models, also called log-linear or semilogarithmic (to the left) models. Fig. VI.1 represents this logic.

Generalized Linear Models Part

VI

441

FIG. VI.1 Generalized Linear Models and Structure of the Chapters in Part VI.

Simple and Multiple Regression Models and BOX-COX Transformation Chapter 13

Count Data Poisson and Negative Binomial Regression Models

Chapter 15

Generalized Linear Models (GLM)

Binary and Multinomial Logistic Regression Models

Chapter 14

The chapters in Part VI are structured in the same presentation logic, in which, initially, the concepts regarding each model and the criteria for estimating its parameters are presented, always using datasets that allow us to solve practical exercises in Excel. After that, the same exercises are solved, step by step, in Stata and in SPSS. At the end of each chapter, additional exercises are proposed, whose answers are available at the end of the book.

Chapter 13

Simple and Multiple Regression Models … because politics is for the present, but an equation is something for eternity. Albert Einstein

13.1 INTRODUCTION Of the techniques studied in this book, without a doubt, those known as simple and multiple linear regression models are the most used in the different fields of knowledge. Imagine that a group of researchers is interested in studying how the rate of return for a financial asset behaves in relation to the market, or how company expense varies when the factory increases its productive capability or increases the number of work hours, or, yet, how the number of bedrooms and amount of floor space in a residential real estate sample can influence the formation of sales prices. Notice that, in all the examples, the main phenomena of interest to study, are represented, in each case, by a metric or quantitative variable, and, therefore, can be studied by means of estimation linear regression models, of which the main goal is to analyze how the relations between a set of explanatory variables, metrics or dummies, and a dependent metric variable (the outcome variable that represents the phenomenon under study) behave, being that some conditions are respected and some presuppositions are met, as we shall see in this chapter. It is important to emphasize that any and all linear regression models should be defined based on the subjacent theory and the experience of the researcher, such that it is possible to estimate the desired model, analyze the results obtained by means of statistical tests and prepare forecasts. In this chapter, we will consider the simple and multiple linear regression models, with the following objectives: (1) Introduce the concepts of simple and multiple linear regression, (2) Interpret results obtained and prepare forecasts, (3) Discuss the technique presuppositions and (4) Present the application of the technique in Excel, Stata, and SPSS. Initially, the solution to an example will be prepared in Excel simultaneously to the presentation of the concepts and the manual solution of the example. Only after the introduction of the concepts will the procedures for the preparation of the regression technique be presented in Stata and SPSS.

13.2 LINEAR REGRESSION MODELS First, we will address linear regression models and their presuppositions. An analysis of nonlinear regressions will be covered in Section 13.4. According to Fa´vero et al. (2009), the linear regression technique offers, primarily, the ability to study the relation between one or more explanatory variables, which are presented in a linear form, and a quantitative dependent variable. As such, a general linear regression model can be defined as follows: Yi ¼ a + b1  X1i + b2  X2i + ⋯ + bk  Xki + ui

(13.1)

where Y represents the phenomenon under study (quantitative dependent variable), a represents the intercept (constant or linear coefficient), bj (j ¼ 1, 2, …, k) are the coefficients of each variable (angular coefficients), Xj are explanatory variables (metrics or dummies), and u is the error term (difference between the real value of Y and the predicted value of Y by means of the model for each observation). The subscripts i represent each of the observations of the sample under analysis (i ¼ 1, 2, …, n, where n is the size of the sample). The equation presented by means of Expression (13.1) represents a multiple linear regression model, since it considers the inclusion of various explanatory variables for the study of the phenomenon in question. On the other hand, if only one Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00013-6 © 2019 Elsevier Inc. All rights reserved.

443

444

PART

VI Generalized Linear Models

FIG. 13.1 Estimated simple linear regression model.

+

⋅ Xi

Y

X X variable is inserted, we have before us a simple linear regression model. For didactic reasons, we will introduce the concepts and present the step-by-step process of estimating the parameters by means of a simple regression model. Following, we will amplify the discussion by means of estimation in multiple regression models, including the consideration of dummy variables on the right side of the equation. It is important to emphasize, therefore, that the simple linear regression model to be predicted present the following expression: Y^i ¼ a + b  Xi

(13.2)

where Y^i represents the predicted value of the dependent variable, which will be obtained by means of the model estimation for each i observation, and a and b represent the predicted parameters of the intercept and the slope of the proposed model, respectively. Fig. 13.1 presents, graphically, the general configuration of an estimated simple linear regression model. We can, therefore, verify that, while estimated parameter a shows the point on regression model where X ¼ 0, estimated parameter b represents the slope of the model, or rather, the increase (or decrease) of Y for each additional unit of X, on average. Hence, the inclusion of error term u in the Expression (13.1), also known as residual, is justified by the fact that any relation that can be proposed will rarely present itself perfectly. In other words, very probably the phenomenon under study, represented by variable Y, will present a relation with some other X variable not included in the proposal and that, therefore, will need to be represented by error term u. As such, error term u, for each observation i, can be written as: ui ¼ Yi  Y^i

(13.3)

According to Kennedy (2008), Fa´vero et al. (2009), and Wooldridge (2012), error terms occur due to some reasons that need to be known and considered by the researchers, such as: l l l

Existence of aggregated variables and/or not random. Failures in the specification of the model (nonlinear forms and omission of relevant explanatory variables). Errors in data gathering.

More consideration regarding error terms will be made in the study of regression model presuppositions, in Section 13.3. Having discussed the preliminary concepts, we shall now begin the study of linear regression models estimation.

13.2.1

Estimation of the Linear Regression Model by Ordinary Least Squares

We often glimpse, in a rational or intuitive way, the relation between variable behaviors that appear either directly or indirectly. If I swim more often at my club, will I increase my muscle mass? If I change jobs, will I have more time to spend with my children? If I save a greater portion of my wages, will I be able to retire at a younger age? These questions offer clear

Simple and Multiple Regression Models Chapter

13

445

TABLE 13.1 Example: Travel Time × Distance Traveled Student

Time to Get to School (min)

Distance Traveled to School (km)

Gabriela

15

8

Dalila

20

6

Gustavo

20

15

Leticia

40

20

Luiz Ovidio

50

25

Leonor

25

11

Ana

10

5

Antonio

55

32

Julia

35

28

Mariana

30

20

relations between a certain dependent variable, which represents the phenomenon we wish to study, and, in the case, a single explainable variable. The objective of regression analysis is, therefore, to provide conditions for the researcher to evaluate how a Y variable behaves based on the behavior of one or more X variables, without, necessarily, the occurrence of a cause and effect relationship. We will introduce the concepts of regression by means of an example that considers only one explanatory variable (simple linear regression). Imagine that, on a certain class day for a group of 10 students, the professor is interested in discovering the influence of the distance traveled to get to school over the travel time. The professor completes a questionnaire with each of the 10 students and prepares a dataset, which can be found in Table 13.1. Actually, the professor wants to know the equation that regulates the phenomenon “travel time to school” in function of “distance traveled by students.” It is known that other variables influence the time of a certain route, such as the route taken, the type of transportation, or the time at which the student left for school that day. However, the professor knows that such variables will not be part of the model, being that they were not collected for the formation of the dataset. The problem can therefore be modeled in the following manner: time ¼ ƒðdistÞ As such, the equation, or simple regression model, will be: timei ¼ a + b  disti + ui and, in this way, the expected value (estimate) of the dependent variable, for each i observation, will be given as: ^ i ¼ a + b  disti time where a and b are the estimates of parameters a and b, respectively. ^ variable, also known as the conditional mean, is calThis last equation shows that the expected value of the time (Y) culated for each sample observation, in function of the behavior of the dist variable, being that the subscript i represents, for our example data, the school students (i ¼ 1, 2, …, 10). Our objective here is, therefore, to study if the behavior of the dependent variable time presents a relation with the variation of distance, in kilometers, to which each of the students is subjected to arrive at school on a certain class day. In our example, it does not make much sense to discuss time traveled when the distance to school is zero (parameter a). Parameter b, on the other hand, will inform us regarding the increase in time to arrive at school by increasing the distance traveled by one kilometer, on average. We shall, as such, prepare a graph (Fig. 13.2) that relates the travel time (Y) with the distance traveled (X), where each point represents one of the students.

446

PART

VI Generalized Linear Models

FIG. 13.2 Travel time  distance traveled for each student.

As previously commented, it is not only the distance traveled that affects the time needed to get to school since it can also be affected by other variables related to traffic, means of transportation, or the individual. As such, the error term u should capture the effect of the remaining variables not included in the model. Now, in order to estimate the equation that best adjusts to this cloud of points, we should establish two fundamental conditions related to the residuals. P (1) The sum of the residuals should be zero: ni¼1ui ¼ 0, where n is the sample size. With only this first condition, several lines of regression can be found where the sum of the residuals is zero, as is shown in Fig. 13.3. Notice that, for the same dataset, several lines can respect the condition that the sum of the residuals is equal to zero. Therefore, it becomes necessary to establish a second condition. P (2) The residual sum of squares is the least possible: ni¼1u2i ¼ min. With this condition, we choose the model that presents the best adjustment possible to the cloud of points, giving us, therefore, the definition of the least squares. In other words, a and b should be determined in such a way that the sum of the squares of the residuals is the least possible (ordinary least squares—OLS method). As such: n X

ðYi  b  Xi  aÞ2 ¼ min

(13.4)

i¼1

The minimization occurs in deriving Expression (13.4), where a and b are equal to zero to resulting expressions. As such: hXn i 2 n ∂ ð Y  b  X  a Þ X i i i¼1 ðYi  b  Xi  aÞ ¼ 0 (13.5) ¼ 2 ∂a i¼1 hXn i 2 n ∂ ð Y  b  X  a Þ X i i¼1 i ¼ 2 X i  ð Yi  b  X i  a Þ ¼ 0 (13.6) ∂b i¼1 In distributing and dividing the Expression (13.5) by 2n, where n is the sample size, we have that: Xn Xn Xn 2 i¼1 Yi 2 i¼1 b  Xi 2 i¼1 a 0 + + ¼ 2n 2n 2n 2n

(13.7)

from which comes: Y + b  X + a ¼ 0

(13.8)

a¼Y bX

(13.9)

and, therefore:

where Y and X represent the sample average of Y and of X, respectively.

Simple and Multiple Regression Models Chapter

Time to school (min)

447

FIG. 13.3 (A–C) Three examples of lines of regression where the sum of residuals is zero.

60 u10

50

13

u8 40

u7

30 u5

u4

20

u2 u1

10 0

u9

u6

0

u3

5

10

(A)

15

20

25

30

35

Traveled distance (km) 60

Time to school (min)

50 u10

u8

40 u6

u9

u7

30 u2

20

u1

u4 u3

u5

10 0 0

5

10

(B)

15

20

25

30

35

Traveled distance (km)

Time to school (min)

60 50 40

u8

u6 30

u1

u2

u4

u3

u10

u7

u5

u9

20 10 0 0

5

10

(C)

15

20

25

30

35

Traveled distance (km)

In substituting this result in Expression (13.6), we have that: 2

n X

  Xi  Yi  b  Xi  Y + b  X ¼ 0

(13.10)

i¼1

which, in developing: n X

n X     Xi  Y i  Y + b  Xi  X  Xi ¼ 0

i¼1

i¼1

which therefore generates:

(13.11)

Xn  b¼

   Xi  X  Yi  Y Xn  2 Xi  X i¼1

i¼1

(13.12)

448

PART

VI Generalized Linear Models

TABLE 13.2 Calculation Spreadsheet for the Determination of a and b 

   Xi  X  Y i  Y

 2 Xi  X

Observation (i)

Time (Yi)

Distance (Xi)

Yi  Y

Xi  X

1

15

8

15

9

135

81

2

20

6

10

11

110

121

3

20

15

10

2

20

4

4

40

20

10

3

30

9

5

50

25

20

8

160

64

6

25

11

5

6

30

36

7

10

5

20

12

240

144

8

55

32

25

15

375

225

9

35

28

5

11

55

121

10

30

20

0

3

0

9

Sum

300

170

1155

814

Average

30

17

Returning to our example, the professor then prepares a calculation spreadsheet in order to obtain the linear regression model, as shown in Table 13.2. By means of the spreadsheet presented in Table 13.2, we can calculate estimators a and b, in accordance as follows: Xn     Xi  X  Yi  Y 1155 i¼1 Xn  ¼ 1:4189 ¼ b¼ 2 814 X X i¼1

i

a ¼ Y  b  X ¼ 30  1:4189  17 ¼ 5:8784 And the simple linear regression equation can be written as: ^ i ¼ 5:8784 + 1:4189  disti time The of our example model can be done by means of the Solver tool in Excel, respecting the conditions that P10 estimation P 10 2 u ¼ 0 and i¼1 i i¼1ui ¼ min. In this way, we can initially open the file TimeLeastSquares.xls that contains our example ^ to u and to u2 for each observation. Fig. 13.4 presents this file, before the preparation data, besides the columns referent to Y, of the Solver procedure. According to the logic proposed by Belfiore and Fa´vero (2012), we now open the Excel Solver tool. The objective function is in cell E13, which is our destination cell and which should be minimized (residual sum of squares). Besides FIG. 13.4 TimeLeastSquares.xls dataset.

Simple and Multiple Regression Models Chapter

13

449

FIG. 13.5 Solver—minimization of the residual sum of squares.

this, parameters a and b, which values are in cells H3 and H5, respectively, are the variables cells. Finally, we should impose that the value of cell D13 should equal zero (restriction that the sum of the residuals be equal to zero). The Solver window will be as shown in Fig. 13.5. By clicking on Solve and then OK, we obtain the best solution to the minimization of the residual sum of squares. Fig. 13.6 presents the results obtained by the model. Therefore, intercept a is 5.8784 and the angular coefficient b is 1.4189, according to what we estimated by means of the analytical solution. In an elementary way, the average time to get to school by students who did not travel any distance, or rather, who were already at school, is of 5.8784 min, which does not make much sense from a physical point of view. In some cases, this type of situation can frequently occur, where values for a are not in keeping with reality. From the mathematical point of view, this is not incorrect. However, the researcher should always analyze the physical or economical sense of the situation under study, as well as the subjacent theory used. In analyzing the graph in Fig. 13.2, we notice that there is no student with a distance traveled near zero, and the intercept only reflects extension, projection, or extrapolation of the regression model up to the Y axis. It is even common that some models present a negative a in the study of phenomena

450

PART

VI Generalized Linear Models

FIG. 13.6 Obtaining the parameters of the sum minimization of u2 by Solver.

that cannot offer negative values. Therefore, the researcher should always be aware of this fact, being that a regression model can be quite useful to elaborate inferences regarding the behavior of a Y variable within the limits of the X variation, or rather, for the elaboration of interpolations. Yet extrapolations can offer inconsistencies due to eventual changes in behavior for the Y variable outside the limits of the X variation in the study sample. Giving sequence to the analysis, each additional kilometer in distance between the departure point and the school increases travel time by 1.4189 min, on average. As such, a student who lives 10 km farther from school than another will tend to spend, on average, a little more than 14 min (1.4189  10) to get to school than their classmate who lives closer. Fig. 13.7 presents the simple linear regression model from our example. Concomitant to the discussion of each of the concepts and to the solution of the proposed example by means of analytical form and Solver, we will also present the systematic solution by means of the Excel Regression tool. In Sections 13.5 and 13.6, we will embark on the final solution by means of Stata and SPSS, respectively. In this way, we will not open the file Timedist.xls, which contains the data from our example, or rather, the fictitious travel time and distance covered by a group of students to the school location. By clicking on Data → Data Analysis, the dialog box from Fig. 13.8 will appear. We now click on Regression and then OK. The dialog box for insertion of data to be considered in regression will now appear (Fig. 13.9). For our example, the time (in minutes) variable is the (Y) dependent and the dist (in kilometers) variable is the (X) explanatory. Therefore, we must insert their data in the respective entry intervals, according to what is shown in Fig. 13.10. Besides the insertion of data, we will also select the Residuals option, according to what is shown in Fig. 13.10. Following, we click on OK. A new spreadsheet will be generated with the regression outputs. We will analyze each of them according to when the concepts are introduced, as well as perform the calculations manually. According to what we can observe by means of Fig. 13.11, four groups of outputs are generated: regression statistics, analysis of variance (ANOVA), table of regression coefficients, and residuals table. We will discuss each. As calculated previously, we can verify the regression equation coefficients in the outputs (Fig. 13.12).

FIG. 13.7 Simple linear regression model between time and distance traveled.

Simple and Multiple Regression Models Chapter

13

451

FIG. 13.8 Dialog box for data analysis in Excel.

FIG. 13.9 Dialog box for estimation of linear regression in Excel.

13.2.2

Explanatory Power of the Regression Model: Coefficient of Determination R2

According to Fa´vero et al. (2009), to measure the explanatory power of a certain regression model, or the percentage of variability of the Y variable, which is explained by the variation of behavior of the explanatory variables, we need to understand some important concepts. While the total sum of squares (TSS) shows the variation in Y in regards to its own average, the sum of squares due to regression (SSR) offers a variation of Y considering the X variables used in the model. Besides this, the residual sum of squares (RSS) presents the variation of Y, which is not explained in the prepared model. We can therefore define that:

being:

TSS ¼ SSR + RSS

(13.13)

    Yi  Y ¼ Y^i  Y + Yi  Y^i

(13.14)

452

PART

VI Generalized Linear Models

FIG. 13.10 Insertion of data for estimation of linear regression in Excel.

FIG. 13.11 Simple linear regression outputs in Excel.

Simple and Multiple Regression Models Chapter

13

453

FIG. 13.12 Linear regression equation coefficients.

where Yi is equivalent to the value of Y in each i observation of the sample, Y is the average of Y, and Y^i represents the adjusted value of the regression model for each i observation. As such, we have that: Y  i  Y : total deviation of values of each observation in relation to the average, ^ Y i  Y : deviation of values of the regression model for each observation in relation to the average, Yi  Y^i : deviation of values of each observation in relation to the regression model, which results in: n  X

Yi  Y

2

¼

i¼1

n  X

Y^i  Y

2

+

i¼1

n  X

Yi  Y^i

2

(13.15)

i¼1

or: n  X i¼1

Yi  Y

2

¼

n  X i¼1

Y^i  Y

2

+

n X

ð ui Þ 2

(13.16)

i¼1

which is the very Expression (13.13). Fig. 13.13 graphically shows this relation. With these considerations made and the regression equation defined, we embark on the study of the explanatory power of the regression model, also known as the coefficient of determination R2. Stock and Watson (2004) define R2 as the fraction of the variance of the Yi sample explained (or predicted) by the explanatory variables. In the same way, Wooldridge (2012) considers R2 as the proportion of sample variation of the dependent variable explained by the set of explanatory variables, able to be used as a measure of degree of adjustment for the proposed model.

454

PART

VI Generalized Linear Models

FIG. 13.13 Deviations of Y for two observations.

According to Fa´vero et al. (2009), the explanatory capacity of the model is analyzed by the coefficient of determination R2 of the regression. For a simple regression model, this measure shows how much of the Y variable behavior is explained by the variation in behavior of the X variable, Always remembering that there is not, necessarily, a cause and effect relationship between the X and Y variables. For the multiple regression model, this measure shows how much of the Y variable behavior is explained by the joint variation of the X variables considered in the model. The R2 is obtained in the following manner: R2 ¼

SSR SSR ¼ SSR + RSS TSS

(13.17)

or Xn 

2 Y^i  Y R ¼ Xn   Xn ^i  Y 2 + ðu Þ2 Y i¼1 i¼1 i 2

i¼1

(13.18)

Also according to Fa´vero et al. (2009), the R2 can vary between 0 and 1 (0%–100%); however, it is practically impossible to obtain an R2 equal to 1, since it would be very difficult for all the points to fall on a line. In other words, if the R2 were 1, there would be no residuals for each of the observations in the sample under study, and the variability of the Y variable would be totally explained by the vector of X variables considered in the regression model. The more disperse the cloud of points, the less the X and Y variables will relate, the residuals will be greater, and the R2 will be closer to zero. In an extreme case, if the X variation does not correspond to any variation in Y, the R2 will be zero. Fig. 13.14 presents, in an illustrative manner, the behavior of R2 in different cases. Returning to our example where the professor intends to study the behavior of the time students take to get to school and if this phenomenon is influenced by distance traveled by the students, we present the following spreadsheet (Table 13.3), which will aid us in calculating the R2. The spreadsheet presented in Table 13.3 allows us to calculate the R2 of the simple linear regression model for our example. As such: Xn  2 Y^  Y 1638:85 2 i¼1 ¼ ¼ 0:8194 R ¼ Xn  2 Xn 2 1638:85 + 361:15 ðu Þ Y^  Y + i¼1

i¼1

i

In this way, we can now affirm that, for the sample studied, 81.94% of the time variability to get to school is due to the variable referent to the distance traveled during the route determined by each of the students. And, therefore, a little more than 18% of the variability is due to other variables not included in the model and that, therefore, were due to variation in the residuals.

Simple and Multiple Regression Models Chapter

R2 = 0.82

13

455

R2 = 0

Y

Y

X

X

R2 = 0.35

R2 = 1 Y

Y

X

X

FIG. 13.14 R2 behavior for different simple linear regressions.

TABLE 13.3 Spreadsheet for the Calculation of the Coefficient of Determination R2 of the Regression Model 

ui Yi  Ybi





2

Observation (i)

Time (Yi)

Distance (Xi)

b Y i

1

15

8

17.23

2.23

163.08

4.97

2

20

6

14.39

5.61

243.61

31.45

3

20

15

27.16

7.16

8.05

51.30

4

40

20

34.26

5.74

18.12

32.98

5

50

25

41.35

8.65

128.85

74.80

6

25

11

21.49

3.51

72.48

12.34

7

10

5

12.97

2.97

289.92

8.84

8

55

32

51.28

3.72

453.00

13.81

9

35

28

45.61

10.61

243.61

112.53

10

30

20

34.26

4.26

18.12

18.12

Sum

300

170

1638.85

361.15

Average

30

17

Ybi  Y

(ui)2

^ ¼ 5:8784 + 1:4189  dist . Obs.: Where Y^i ¼ time i i

The outputs generated by Excel also bring out this information, according to what can be seen in Fig. 13.15. Note that the outputs also supply the values of Y^ and the residuals for each observation, as well as the minimum value of the sum of the squares of the residuals, which are exactly equal to those obtained by estimation of the parameters by means of the Excel Solver tool (Fig. 13.6) and also calculated and presented in Table 13.3. By means of these values, we can now calculate the R2. According to Stock and Watson (2004) and Fa´vero et al. (2009), the coefficient of determination R2 does not tell researchers if a certain explanatory variable is statistically significant and if this variable is the true cause of the change in behavior for the dependent variable. More than that, the R2 does not provide the ability to evaluate the existence of an eventual bias in the omission of explanatory variables and if the choice of those inserted into the proposed model was appropriate.

456

PART

VI Generalized Linear Models

FIG. 13.15 Coefficient of determination R2 of the regression.

The importance given to the R2 dimension is often excessive. In different situations, researchers highlight the adequacy of their models by obtaining the high R2 values, including giving emphasis to the cause and effect relationship between explanatory variables and the dependent variable, even though quite erroneous, since this measure merely captures the relation between the variables used in the model. Wooldridge (2012) is even more emphasis, highlighting that it is fundamental to not give considerable importance to the R2 value in the evaluation of regression models. According to Fa´vero et al. (2009), if we are able, for example, to find a variable that explains a 40% return on stock, this could at first seem like a low capacity of explanation. However, if a single variable is able to capture this entire relationship in a situation where innumerable other economic, financial, perceptual, and social factors exist, the model could be quite satisfactory. The general statistical significance of the model and its estimated parameters is not given by the R2, but by means of appropriate statistical tests, which we will study in the next section.

13.2.3

General Statistical Significance of the Regression Model and Each of Its Parameters

To begin, it is of fundamental importance to study the general statistical significance of the estimated model. With this in mind, we should make use of the F-test, with its null and alternative hypotheses, for a general regression model, which are: H0: b1 ¼ b2 ¼ … ¼ bk ¼ 0 H1: there is at least one bj 6¼ 0, respectively And, for a simple regression model, therefore, these hypotheses are expressed as: H0: b ¼ 0 6 0 H1: b ¼

Simple and Multiple Regression Models Chapter

13

457

This test allows the researcher to verify if the model that is being estimated does in fact exist, since if all the bj (j ¼ 1, 2, …, k) are statistically equal to zero, the alteration behavior of each of the explanatory variables will not influence in any way the variation behavior of the dependent variable. The F statistic is presented in the following expression: Xn   ^ Y 2 Y SSR i¼1 ð k  1Þ ð k  1Þ ¼ (13.19) F ¼ Xn RSS ðu Þ2 i¼1 i ðn  k Þ ðn  k Þ where k represents the number of parameters of the estimated model (including the intercept) and n, the size of the sample. Therefore, we can obtain an F statistic expression based on the R2 expression presented in Expression (13.17). As such, we have that: SSR R2 ðk  1Þ ðk  1Þ F¼ ¼ RSS ð1  R2 Þ ðn  kÞ ðn  kÞ

(13.20)

Returning, then, to our initial example, we obtain: 1638:85 ð 2  1Þ F¼ ¼ 36:30 361:15 ð10  2Þ where, for 1 degree of freedom for regression (k  1 ¼ 1) and 8 degrees of freedom for the residuals (n  k ¼ 10  2 ¼ 8), we have, by means of Table A in the Appendix, that the Fc ¼ 5.32 (F critical to the significance level of 5%). In this way, as F calculated Fcal ¼ 36.30 > Fc ¼ F1,8,5% ¼ 5.32, we can reject the null hypothesis that all the bj (j ¼ 1) parameters are statistically equal to zero. At least one X variable is statistically significant to explain the variability of Y and we will have a statistically significant regression model for the means of forecast. As, in this case, we have only one X variable (simple regression), this will be statistically significant, to the significance level of 5%, to explain the behavior of the Y variation. The outputs offer, by means of analysis of variance (ANOVA), the F statistic and its corresponding significance level (Fig. 13.16). Software, such as Excel, Stata, and SPSS, do not directly offer Fc for the degrees of freedom defined and the determined significance level. However, they do offer the significance level of Fcal for these degrees of freedom. As such, instead of analyzing if Fcal > Fc, we should verify if the significance level of Fcal is less than 0.05 (5%) so as to give continuity to the regression analysis. Excel calls this significance level F significance. As such: If the F significance is tc ¼ t8,2.5% ¼ 2.306. We can, therefore, reject the null hypothesis in this case, or rather, at the significance level of 5% we cannot affirm that this parameter is statistically equal to zero. These outputs are shown in Fig. 13.19.

460

PART

VI Generalized Linear Models

FIG. 13.18 Standard error calculation.

Analogous to the F-test, instead of analyzing if tcal > tc for each parameter, we directly verify if the significance level (P-value) for each tcal is less than 0.05 (5%), so as to maintain the parameter in the final model. The P-value for each tcal can be obtained in Excel by means of the command Formulas ! Insert Function ! DISTT, which will open a dialog box as is shown in Fig. 13.20. In this figure, the dialog boxes corresponding to parameters a and b are already presented. It is important to mention that, for simple regressions, statistic F ¼ t2 for parameter b, as is shown by Fa´vero et al. (2009). In our example, therefore, we can verify that: t2b ¼ F t2b ¼ ð6:0252Þ2 ¼ 36:30 ¼ F Being that hypothesis H1 of the F-test tells us that at least one b parameter is statistically different from zero for a certain significance level, and being that a simple regression presents only one b parameter, if H0 is rejected for the F-test, H0 will also be for the t-test of this b parameter. However, for the a parameter, being that tcal < tc (P-value of tcal for the a parameter >0.05) in our example, we could think of the estimation of a new regression that forces the intercept to be equal to zero. This can be elaborated by means of the Excel Regression dialog box, with the selection of the option Constant is zero. However, we will not elaborate such procedure since the nonrejection of the null hypothesis that the a parameter is statistically equal to zero is due to the small sample used. It does not impede that a researcher make such forecasts by means of using the model obtained. The imposition that the a be zero could generate forecast bias by the generation of another model that would not be the most adequate to elaborate interpolations in the data. Fig. 13.21 illustrates this fact.

Simple and Multiple Regression Models Chapter

FIG. 13.19 Calculation of coefficients and significance t-test of parameters.

FIG. 13.20 Obtaining the levels of significance of t for parameters a and b (command Insert Function).

13

461

462

PART

VI Generalized Linear Models

FIG. 13.21 Original regression model and with the intercept equal to zero.

60

Time to school (min)

50

Original regression model

40 30 20 10 0

Regression model with intercept = 0 0

5

10

15 20 25 Traveled distance (km)

30

35

In this way, the fact that we cannot reject that the a parameter be equal to zero at a certain significance level does not necessarily imply that we should exclude it from the model. However, if this is the researchers’ decision, it is important they be at least aware that there will be only one different model from the original, with consequences to the preparation of forecasts. The nonrejection of the null hypothesis for the b parameter at a certain significance level, on the other hand, should indicate that a corresponding X variable does not correlate with a Y variable and, therefore, should be excluded from the final model. When later in this chapter we present the analysis of regression by means of the Stata (Section 13.5) and SPSS (Section 13.6) software, the Stepwise procedure will be introduced. This has a property that automatically excludes or maintains the b parameters in the model in function of the criteria presented and offers the final model with the b parameters statistically different from zero for the determined significance level.

13.2.4 Construction of the Confidence Intervals of the Model Parameters and Elaboration of Predictions The confidence levels for the a and bj (j ¼ 1, 2, …, k) parameters for the 95% confidence level, can be written, respectively, as follows: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 0 1 0 13 uXn uXn 2 2 2 2 u u ð u Þ ð u Þ 1 X 1 X i u u i¼1 i 6 7  @ + Xn   @ + Xn  P4a  ta=2  t i¼1 2 A  a  a + ta=2  t 2 A5 ¼ 95% ðn  kÞ ð n  k Þ n n Xi  X Xi  X i¼1 i¼1 2

3

6 7 6 7 6 7 s:e: s:e: 6 7 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P6bj  ta=2  vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Xn 2ffi  bj  bj + ta=2  v Xn 2ffi7 ¼ 95% u u 6 7 u u   X X 5 4 t Xn t Xn i¼1 i i¼1 i 2 2 X  X  i¼1 i i¼1 i n n (13.22) Therefore, for our example, we have that: Parameter a: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi "    # 361:1486 1 289 361:1486 1 289  +  a  5:8784 + 2:306   + ¼ 95% P 5:8784  2:306  8 10 814 8 10 814 P½4:5731  a  16:3299 ¼ 95% Being that the confidence level for parameter a contains zero, we cannot reject, at the 95% confidence level, that this parameter is statistically equal to zero, according to what has been verified when calculating the t statistic.

Simple and Multiple Regression Models Chapter

Parameter b:

13

463

3

2

7 6 7 6 6:7189 6:7189 6 P61:4189  2:306  sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  b  1:4189 + 2:306  sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi7 7 ¼ 95% 4 ð170Þ2 ð170Þ2 5 3704  3704  10 10 P½0:8758  b  1:9619 ¼ 95% Being that the confidence level for parameter b does not contain zero, we can reject, at the 95% confidence level, that this parameter is statistically equal to zero, also according to what has been verified when calculating the t statistic. These intervals are also generated in the Excel outputs. Being that the software standard is to use the 95% confidence level, these intervals are shown twice, so as to allow the researcher to manually alter the confidence level desired by selecting the Confidence Level option in the Excel Regression dialog box, but still have the ability to analyze the intervals for the confidence level most commonly used (95%). In other words, the confidence level intervals of 95% in Excel will always be presented, giving the researcher the ability to analyze the intervals from another confidence level in parallel. We will, therefore, alter the regression dialog box (Fig. 13.22) in order to allow the software to also calculate the interval parameters for the confidence level of, for example, 90%. These outputs are presented in Fig. 13.23. It can be seen that the lower and upper bands are symmetrical in relation to the estimated average parameter and offer the researcher the ability to prepare forecasts with a certain confidence level. In the case of parameter b from our example, being that the extremes of the lower and upper bands are positive, we can say that this parameter is positive, with 95% confidence. Besides this, we can also way that the interval [0.8758; 1.9619] contains b with 95% confidence. Different from what we did for the 95% confidence level, we will not manually calculate the intervals for the 90% confidence level. However, an analysis of the Excel outputs allows us to affirm that the interval [0.9810; 1.8568] contains b with 90% confidence. In this way, we can say that the lower the levels of confidence, the narrower (less amplitude) the intervals will be to contain a certain parameter. On the other hand, the higher the levels of confidence, the greater amplitude will have the intervals to contain this parameter. Fig. 13.24 illustrates what happens when we have a dispersed cloud of points surrounding a regression model.

FIG. 13.22 Alteration of the confidence level of the intervals of the parameters to 90%.

464

PART

VI Generalized Linear Models

FIG. 13.23 Intervals with confidence levels of 95% and 90% for each of the parameters.

FIG. 13.24 Confidence intervals for a dispersion of points surrounding a regression model.

60

Time to school (min)

50 40 30 20 10 0 0

5

10

15

20

25

30

35

Traveled distance (km)

We can note that, the more positive parameter a is and mathematically equal to 5.8784, we cannot affirm that it is statistically different from zero for this small sample, being that the confidence interval contains an intercept equal to zero (origin). A greater sample could solve this problem. For parameter b, however, we can note that the slope has always been positive, with an average value mathematically calculated and equal to 1.4189. We can visually notice that its confidence interval does not contain an slope equal to zero. As has already been discussed, the rejection of the null hypothesis for parameter b, at a certain significance level, indicates that a corresponding X variable is correlated to the Y variable and, consequently, should remain in the final model. Therefore, we can conclude that the decision to exclude an X variable in a certain regression model can be done by means of

Simple and Multiple Regression Models Chapter

13

465

BOX 13.1 Decision to Include bj Parameters in Regression Models Parameter bj

t Statistic (For Significance Level a) tcal < tc a/2 tcal > tc a/2

t-Test (Analysis of the P-Value for Significance Level a) P-value > significance level a P-value < significance level a

Analysis of Confidence Level

Decision

Confidence level contains zero

Exclude parameter from model

Confidence level does not contain zero

Maintain parameter in model

Obs.: The most common in applied social sciences is the adoption of significance level a ¼ 5%.

a direct analysis of the t-statistic of its respective parameter b (if tcal < tc ! P-value > 0.05 ! we cannot reject that the parameter is statistically equal to zero) or by means of an analysis of the confidence interval (if the same contains zero). Box 13.1 presents the inclusion or exclusion criteria for parameters bj (j ¼ 1, 2, …, k) in regression models. After a discussion of these concepts, the professor proposed the following exercise to his students: What is the average ^ for a student who travels 17 km to get to school? What would be the minimum travel time forecast (Y estimated, or Y) and maximum values that this travel time could assume, with 95% confidence? The first part of the exercise could be solved by a simple substitution of the value of Xi ¼ 17 in the initially obtained equation. Like this: ^ i ¼ 5:8784 + 1:4189  disti ¼ 5:8784 + 1:4189  ð17Þ ¼ 29:9997 min time The second part of the exercise takes us to the outputs in Fig. 13.23, being that the a and b parameters assume intervals of [4.5731; 16.3299] and [0.8758; 1.9619], respectively, at the 95% confidence level. As such, the equations that determine the minimum and maximum travel time values for this confidence level are: Minimum time: ^ min ¼ 4:5731 + 0:8758  disti ¼ 4:5731 + 0:8758  ð17Þ ¼ 10:3155 min time Maximum time: ^ max ¼ 16:3299 + 1:9619  disti ¼ 16:3299 + 1:9619  ð17Þ ¼ 49:6822min time We can therefore say that there is 95% confidence that a student who travels 17 km to get to school will take between 10.3155 and 49.6822 min, with an average estimated time of 29.9997 min. Obviously, the amplitude of these values is not small, due to the confidence interval of parameter a being quite ample. This fact can be corrected by the increase of the sample size or by the inclusion of new, statistically significant X variables in the model (which would then become a multiple regression model) being that, in the last case, the R2 value would be increased. After the professor presented the results of the model to the class, a curious student raised his hand and asked, “Professor, is there any influence of the regression model coefficient of determination R2 on the amplitude of the ^ what would the results be? Would confidence intervals? If we set up this linear regression and substituted Y for Y, 2 the equation change? And the R ? And the confidence intervals?” ^ and again set up the regression by means of the dataset presented in Table 13.4. And, the professor substituted Y for Y The first step taken by the professor was to prepare a new scatter plot graph, with the estimated regression model. This graph is presented in Fig. 13.25. As we can see, obviously all are now located on the regression model, since this procedure forced this situation by the ^ i used the regression model. As such, we can state in advance that the R2 for this new fact that the calculation of each Y regression is 1. Let’s look at the new outputs (Fig. 13.26). As expected, the R2 is 1. Moreover, the model equation is exactly that which was previously calculated, since it is the same line. However, we can see that the F and t-tests cause us to strongly reject their respective null hypotheses. Even parameter a, which previously could not be considered statistically different from zero, now presents its t-test and tells us that we can reject, at the 95% confidence level (or higher), that this parameter is statistically equal to zero. This occurs because previously the small sample used (n ¼ 10 observations) did not allow us to affirm that the intercept was different from zero, being that the dispersion of points generated a confidence interval that had an intercept equal to zero (Fig. 13.24).

466

PART

VI Generalized Linear Models

TABLE 13.4 Dataset for Preparation of New Regression Observation (i)

^ i) Predicted Time(Y

Distance (Xi)

1

17.23

8

2

14.39

6

3

27.16

15

4

34.26

20

5

41.35

25

6

21.49

11

7

12.97

5

8

51.28

32

9

45.61

28

10

34.26

20

FIG. 13.25 Scatter plot and linear regression model ^ and distance traveled (X). between predicted time (Y)

Time to school (min)

60 50 40 30 20 10 0 0

5

10

15

20

25

30

35

Traveled distance (km)

On the other hand, when all the points are on the model, each of the residual terms comes to be zero, which causes the R2 to become 1. Besides, the obtained equation is no longer an adjusted model to a dispersion of points, but the very line that passes through all the points and completely explains the sample behavior. Being such, we do not have a dispersion surrounding the regression model and the confidence intervals come to represent a null amplitude, as we can also see in Fig. 13.26. In this case, for any confidence level, the values for each parameter interval are no longer altered, which causes us to declare with 100% confidence that the [5.8784; 5.8784] interval contains a and the [1.4189; 1.4189] interval contains b. In other words, in this extreme case, a is mathematically equal to 5.8784 and b is mathematically equal to 1.4189. Being as such, R2 is an indicator of just how ample the parameter confidence intervals are. Therefore, models with higher R2 levels will give the researcher the ability to make more accurate forecasts, given that the cloud of points is less dispersed along the regression model, which will reduce the amplitude of the parameter confidence intervals. On the other hand, models with low R2 values can impair the preparation of forecasts in that the greater amplitude of the parameter confidence intervals, but does not invalidate the existence of the model as such. As we have already discussed, many researchers give too much importance to the R2; however, it will be the F-test that will truly confirm that a regression model exists (at least a considered X variable is statistically significant to explain Y). As such, it is not rare to find very low R2 values and statistically significant F-values in Administration, Accounting, or Economics models, which shows that the Y phenomenon studied underwent changes in its behavior due to some X variables adequately included in the model. However, there will be a low forecast accuracy due to the impossibility of monitoring all variables that effectively explain the variation of that Y phenomenon. Within the aforementioned knowledge areas, such a fact can be easily found in works on Finance and the Stock Market.

Simple and Multiple Regression Models Chapter

13

467

^ and distance traveled (X). FIG. 13.26 Outputs of the linear regression model between predicted time (Y)

13.2.5

Estimation of Multiple Linear Regression Models

According to Fa´vero et al. (2009), the multiple linear regression presents the same logic as the simple linear, however now with the inclusion of more than one explanatory X variable in the model. The use of many explanatory variables depends on the subjacent theory and previous studies, as well as the experience and good sense of the researcher, in order to be able to give foundation to the decision. Initially, the ceteris paribus concept (maintain remaining conditions constant) should be used in the multiple regression analysis, since the interpretation of the parameter of each variable should be done in isolation. As such, in a model that possesses two explanatory variables, X1 and X2, the respective coefficients will be analyzed in a way so as to consider the other factors as constants. To illustrate the multiple linear regression, we will use the same example that we have used in this chapter. However, we will now imagine that the professor has made the decision to collect one more variable from each of the students. This variable will refer to the number of traffic lights, or semaphores, each student must pass. We will call this variable sem. As such, the theoretical model becomes: timei ¼ a + b1  disti + b2  semi + ui

468

PART

VI Generalized Linear Models

TABLE 13.5 Example: Travel Time × Distance Traveled and Number of Traffic Lights Student

Time to Get to School (min) (Yi)

Distance Traveled to School (km) (X1i)

Number of Traffic Lights (X2i)

Gabriela

15

8

0

Dalila

20

6

1

Gustavo

20

15

0

Leticia

40

20

1

Luiz Ovidio

50

25

2

Leonor

25

11

1

Ana

10

5

0

Antonio

55

32

3

Julia

35

28

1

Mariana

30

20

1

which, analogous to what was presented for the simple regression, we have that: ^ i ¼ a + b1  disti + b2  semi time where a, b1, and b2 are the estimates for parameters a, b1, and b2, respectively. The new dataset is found in Table 13.5, as well as in the file Timedistsem.xls. We will now algebraically develop the procedures for calculating the model parameters, as we did in the simple regression model. By means of the following expression: Yi ¼ a + b1  X1i + b2  X2i + ui we also define that the residual sum of squares is minimum. Therefore: n X

ðYi  b1  X1i  b2  X2i  aÞ2 ¼ min

i¼1

The minimization occurs in deriving the previous expression in a, b1, and b2 and equaling the resulting expressions to zero. Therefore: hXn i 2 n ð Þ ∂ Y  b  X  b  X  a X 1 1i 2 2i i¼1 i ðYi  b1  X1i  b2  X2i  aÞ ¼ 0 (13.23) ¼ 2  ∂a i¼1 hXn i 2 n ∂ ð Y  b  X  b  X  a Þ X i 1 1i 2 2i i¼1 ¼ 2  X1i  ðYi  b1  X1i  b2  X2i  aÞ ¼ 0 (13.24) ∂b1 i¼1 hXn i 2 n ∂ ð Y  b  X  b  X  a Þ X 1 1i 2 2i i¼1 i ¼ 2  X2i  ðYi  b1  X1i  b2  X2i  aÞ ¼ 0 (13.25) ∂b2 i¼1 which generates the following system of three equations and three unknowns: 8 n n n X X X > > Yi ¼ n  a + b1  X1i + b2  X2i > > > > i¼1 i¼1 i¼1 > > n n n n

> i¼1 i¼1 i¼1 i¼1 > >X n n n n > X X X > 2 > > Yi  X2i ¼ a  X2i + b1  X1i  X2i + b2  X2i : i¼1

i¼1

i¼1

i¼1

(13.26)

Simple and Multiple Regression Models Chapter

13

469

Dividing the first equation by the Expression (13.26) by n, we arrive at: a ¼ Y  b1  X1  b2  X2

(13.27)

By means of substituting the Expression (13.27) in the last two equations of the Expression (13.26), we arrive at the following system of two equations and two unknowns: 2 8 Xn 2 3 Xn  Xn 3 2 Xn Xn > n n n > X X1i  X2i X X X Y  X > 1i i 1i i¼1 i¼1 i¼1 6 7 > 2 i¼1 i¼1 5 > ¼ b1  4 X1i Y X   5 + b2  4 X1i  X2i  > > > i¼1 i 1i n n n < i¼1 i¼1 2 Xn  Xn 3 Xn 2 3 2 X X n n > n > n n X X X  X X X > Y  X > i¼1 1i i¼1 2i 5 i¼1 2i 7 6 2 i¼1 i i¼1 2i > > + b2  4 X2i ¼ b1  4 X1i  X2i  Yi  X2i   5 > > n n n : i¼1 i¼1 i¼1 (13.28) We will now manually calculate the parameters for our example model. To do this, we need to use the spreadsheet in Table 13.6. We will now substitute the values into the system represented by the Expression (13.28). Therefore: " # 8

2 > 300  170 ð 170 Þ ð170Þ  ð10Þ > > 6255  + b  3704   231  ¼ b > 1 2 < 10 10 10 " #

2 > 300  10 ð170Þ  ð10Þ ð10Þ > > > + b2  18  : 415  10 ¼ b1  231  10 10 Which results in:



1155 ¼ 814  b1 + 61  b2 115 ¼ 61  b1 + 8  b2

Solving the system, we arrive at: b1 ¼ 0:7972 and b2 ¼ 8:2963 We have that: a ¼ Y  b1  X1  b2  X2 ¼ 30  0:7972  ð17Þ  8:2963  ð1Þ ¼ 8:1512

TABLE 13.6 Spreadsheet to Calculate the Parameters for the Multiple Linear Regression Obs. (i)

Yi

X1i

X2i

YiX1i

YiX2i

X1iX2i

(Yi)2

(X1i)2

(X2i)2

1

15

8

0

120

0

0

225

64

0

2

20

6

1

120

20

6

400

36

1

3

20

15

0

300

0

0

400

225

0

4

40

20

1

800

40

20

1600

400

1

5

50

25

2

1250

100

50

2500

625

4

6

25

11

1

275

25

11

625

121

1

7

10

5

0

50

0

0

100

25

0

8

55

32

3

1760

165

96

3025

1024

9

9

35

28

1

980

35

28

1225

784

1

10

30

20

1

600

30

20

225

400

1

Sum

300

170

10

6255

415

231

11,000

3704

18

Average

30

17

1

470

PART

VI Generalized Linear Models

TABLE 13.7 Spreadsheet to Calculate Remaining Statistics

Observation (i)

Time (Yi)

Distance (X1i)

Traffic Lights (X2i)

b Y i

ui   b Yi  Y i



1

15

8

8

14.53

0.47

239.36

0.22

2

20

6

6

21.23

1.23

76.90

1.51

3

20

15

15

20.11

0.11

97.83

0.01

4

40

20

20

32.39

7.61

5.72

57.89

5

50

25

25

44.67

5.33

215.32

28.37

6

25

11

11

25.22

0.22

22.88

0.05

7

10

5

5

12.14

2.14

319.08

4.57

8

55

32

32

58.55

3.55

815.14

12.61

9

35

28

28

38.77

3.77

76.90

14.21

10

30

20

20

32.39

2.39

5.72

5.72

Sum

300

170

10

1874.85

125.15

Average

30

17

1

Ybi  Y

2

(ui)2

Therefore, the estimated time equation to get to school now comes to be: ^ i ¼ 8:1512 + 0:7972  disti + 8:2963  semi time It should be remembered that the estimation of these parameters can also be obtained by means of the Excel Solver tool, as shown in Section 13.2.1. The calculations of the coefficient of determination R2, the F and t-statistics, and extreme values of the confidence intervals will not be performed again manually, given that they are exactly the same procedure already performed in Sections 13.2.2–13.2.4 and can be done by means of the respective expressions presented until now. Table 13.7 can be of help in this sense. Let’s go directly to the preparation of this multiple linear regression in Excel (file Timedistsem.xls). In the regression dialog box, we should jointly select the variables referent to the distance traveled and the number of traffic lights, as shown in Fig. 13.27. Fig. 13.28 presents the generated outputs. Within these outputs, we find the parameters for our multiple linear regression model determined algebraically. At this time, it is important to introduce the concept of the adjusted R2. According to Fa´vero et al. (2009), when we wish to compare the coefficient of determination (R2) between two models with different sample sizes or distinct quantities of parameters, the use of the adjusted R2 becomes necessary, which is a measure of the R2 regression estimated by the OLS method adjusted by the number of degrees of freedom, since the estimate sample of R2 tends to overestimate the population parameter. The adjusted R2 expression is: R2adjust ¼ 1 

 n1   1  R2 nk

(13.29)

where n is the size of the sample and k is the number of regression model parameters (number of explanatory variables plus the intercept). When the number of observations is very large, the adjustment by degrees of freedom becomes negligible; however when there is a significantly different number of X variables for the two samples, the adjusted R2 should be used for the preparation of the comparison between models and the model with the higher adjusted R2 should be opted for. R2 increases when a new variable is added to the model, however the adjusted R2 will not always increase, and could well decrease or become negative. For this last case, Stock and Watson (2004) explain that the adjusted R2 can become negative when the explanatory variables, taken as a set, reduce the residual sum of squares to such a small amount that this reduction is unable to compensate the factor (n  1)/(n  k).

Simple and Multiple Regression Models Chapter

FIG. 13.27 Multiple linear regression—joint selection of set of explanatory variables.

FIG. 13.28 Multiple linear regression outputs in Excel.

13

471

472

PART

VI Generalized Linear Models

For our example, we have that: R2adjust ¼ 1 

10  1  ð1  0:9374Þ ¼ 0:9195 10  3

Therefore, at present, in detriment to the simple regression initially applied, we should opt for this multiple regression as being a better model to study the behavior of travel time to get to school since the adjusted R2 is higher for this case. Let’s give sequence to the analysis of the remaining outputs. Initially, the F-test informed us that at least one of the X variables is statistically significant to explain the behavior of Y. Besides this, we can also verify, at the 5% significance level, that all the parameters (a, b1, and b2) are statistically different from zero (P-value 1 f  ui m m + f1  1 > pð Y ¼ m Þ ¼ 1  p f >   , m ¼ 1, 2, … > i logiti  : f  ui + 1 1 + f  ui f1  1 Being Y  ZINB(f, u, plogiti), where ZINB means zero-inflated negative binomial and f represents the inverse of the shape parameter of a determined Gamma distribution, and that, analogous to that presented for the zero-inflated Poisson regression models, that: plogiti ¼

1 1+e

(15.37)

ðg + d1  W1i + d2  W2i + ⋯ + dq  Wqi Þ

and ui ¼ eða + b1  X1i + b2  X2i + ⋯ + bk  Xki Þ

(15.38)

We can again see that, if plogiti ¼ 0, the probability distribution for Expression (15.36) is restricted to the Poisson-Gamma distribution, including in cases where Yi ¼ 0. Then, the zero-inflated negative binomial regression models also present two processes generating zeros, resulting from the binary distribution and the Poisson-Gamma distribution. Therefore, based on Expression (15.36), and based on the logarithmic likelihood function defined in Expression (15.29), we arrive at the following objective function, which has as its intent to estimate the f, a, b1, b2, …, bk and g, d1, d2, …, dk parameters for a determined zero-inflated negative binomial regression model:    1 1 X  f ln plogiti + 1  plogiti  LL ¼ 1 + f  ui Yi ¼0    X  f  ui ln ð1 + f  ui Þ +  ln 1  plogiti + Yi  ln (15.39) 1 + f  u f i Yi >0    1  1 ¼ max  ln GðYi + 1Þ  ln G f + ln G Yi + f whose solution can also be obtained by means of linear programming tools. Next, we will present an example prepared in Stata where the parameters for a Poisson and a negative binomial regression model are estimated, both with inflated zeros. First, the significance of the amount of zeros in the Y dependent variable (Vuong test) to then, after, evaluate the significance of the parameter f (likelihood-ratio test for f), or rather, the existence of overdispersion in the data. Box 15.2 presents the relation between the regression models for count data and the existence of overdispersion and the excess of zeros in the data of the dependent variable.

BOX 15.2 Regression Models for Count Data, Overdispersion, and Excess of zeros in the Data of the Dependent Variable Regression Model for Count Data Poisson Negative Zero-Inflated Poisson Binomial (ZIP)

Zero-Inflated Negative Binomial (ZINB)

Overdispersion in the data of the dependent variable

No

Yes

No

Yes

Excessive amount of zeros in the data of the dependent variable

No

No

Yes

Yes

Verification

Regression Models for Count Data: Poisson and Negative Binomial Chapter

15

693

In this way, while the zero-inflated models of the Poisson and negative binomial types are more appropriate when there is an excessive amount of zeros in the dependent variable, the use of these last two is even more recommended when there is overdispersion in the data.

A.2

Example: Zero-Inflated Poisson Regression Model in Stata

So as to prepare zero-inflated regression models, we will use the Accidents.dta dataset. To prepare this dataset, the amount of traffic accidents that occurred in 100 cities in a determined country was investigated, which represents a dependent variable with count data. Besides this, the average of inhabitants with a current driver’s license and the fact that the municipality had adopted a dry law for after 10:00 pm was inserted into the urban population base. The desc command allows us to study the dataset characteristics, as is shown in Fig. 15.76.

FIG. 15.76 Description of the Accidents.dta dataset.

In this example, we will define the pop variable as the X variable, and the age and drylaw variables as the W1 and W2 variables. In other words, our goal is to see if the probability or not of accidents, or rather, the occurrence of structural zeros, is influenced by the average age of drivers and the fact of having a dry law after 10:00 p.m. in the municipalities and, besides this, if the occurrence of a determined accident count in the week under study is influenced by the population of each municipality i (i ¼ 1, …, 100). Therefore, for the zero-inflated Poisson regression model, the parameters of the following expressions should be estimated. plogiti ¼

1 1 + eðg + d1  agei + d2  drylawi Þ

and li ¼ eða + b  popi Þ First, let’s analyze the distribution of the accidents variable, typing in the following commands: tab accidents hist accidents, discrete freq

Figs. 15.77 and 15.78 present the table of frequencies and the histogram, respectively, and, by their means, it is possible to see that, for the country under study, 58% of the municipalities analyzed did not present any traffic accident in the week researched, which indicated, even though preliminarily, the existence of an excessive amount of zeros in the dependent variable.

694

PART

VI Generalized Linear Models

FIG. 15.77 Frequency distribution for count data of accidents variable.

FIG. 15.78 Histogram of accidents dependent variable.

To elaborate the zero-inflated Poisson regression model, we should type in the following command: zip accidents pop, inf(age drylaw) vuong nolog

where the X dependent variable (pop) should come immediately after the dependent variable (accidents) and the W1 and W2 variables (age and drylaw) should come in parentheses, immediately after the term inf, which means inflate and corresponds to the inflation of structural zeros. The term vuong causes the Vuong test (1989) to be executed, which verifies the adequacy of the zero-inflated model in relation to the specified traditional model (in this case, Poisson), or rather, its goal is

Regression Models for Count Data: Poisson and Negative Binomial Chapter

15

695

to verify for the existence of an excessive amount of zeros in the dependent variable. The term nolog omits the outputs referent to the modeling iterations so that the maximum value of the logarithmic likelihood function is presented. Besides this, it is important to mention that the command presented implicitly offers, as standard, the logit model probabilities expression to verify for the existence of structural zeros referent to the Bernoulli distribution. However, in case the researcher opts to work with the probit model probabilities expression, studied in the Appendix of Chapter 14, the term probit should be added to the end of the command. The outputs are found in Fig. 15.79.

FIG. 15.79 Zero-inflated Poisson regression model outputs in Stata.

The first result that should be analyzed refers to the Vuong test, whose statistic is normally distributed, with positive and significant values indicating the adequacy of the zero-inflated Poisson model, and with negative and significant values indicating the adequacy of the traditional Poisson model. For the data in our example, we can see that the Vuong test indicates the better adequacy of the zero-inflated model over the traditional model, being that z ¼ 4.19 and Pr > z ¼ 0.000. Before analyzing the remaining outputs, it is important to mention that Desmarais and Harden (2013) propose a correction to the Vuong test, based on the Akaike information criterion (AIC) and Bayesian (Schwarz) information criterion (BIC) statistics and which should be elaborated so as to eliminate eventual biases that can affect the decision regarding choosing the more adequate model. To do this, one only need substitute the zip with the term zipcv (which means zero-inflated Poisson with corrected Vuong), and the new command will be as follows: zipcv accidents pop, inf(age drylaw) vuong nolog

However, before its elaboration in Stata, we should install the command zipcv, typing in findit zipcv and clicking on the link st0319 from http://www.stata-journal.com/software/sj13-4. Next, we should click on click here to install. The new outputs are found in Fig. 15.80. For the data in our example, while the Vuong test statistic is z ¼ 4.19, the AIC and BIC corrected statistics are z ¼ 4.13 and z ¼ 4.04, respectively, or rather, all present Pr > z ¼ 0.000. In other words, the results of the Vuong test with AIC and BIC correction continue to allow, in this case, that we state that the zero-inflated model is the most appropriate.

696

PART

VI Generalized Linear Models

FIG. 15.80 Zero-inflated Poisson regression model with Vuong test correction outputs in Stata.

Notice that the remaining outputs presented in Figs. 15.79 and 15.80 are exactly the same. Based on these outputs, we can see that the estimated parameters are statistically different from zero, at 95% confidence, and the final expressions of plogiti and of li are given by: plogiti ¼

1 1 + eð11:729 + 0:225  agei + 1:726  drylawi Þ

and li ¼ eð0:933 + 0:504  popi Þ A more curious researcher could obtain these same outputs by means of the Accidents ZIP Maximum Likelihood.xls file, using the Excel Solver tool, as has been the standard adopted throughout this chapter and book. In this file, the Solver criteria have already been defined. Therefore, using Expression (15.32) and the estimated parameters, we can algebraically calculate, in the following way, the average expected weekly traffic accidents in a municipality of 700,000 inhabitants, with an average driver age of 40 and that does not adopt a dry law for after 10:00 p.m.

n o 1 ½0:933 + 0:504  ð0:700Þ  e ¼ 3:39 linflate ¼ 1  1 + e½11:729 + 0:225  ð40Þ + 1:726  ð0Þ The researcher can find the same result by typing the following command, which output is found in Fig. 15.81: mfx, at(pop=0.7 age=40 drylaw=0)

Finally, by means of a graph, we can compare the predicted values for the mean number of weekly traffic accidents obtained by the zero-inflated Poisson regression model with those obtained by a traditional Poisson regression model, without

Regression Models for Count Data: Poisson and Negative Binomial Chapter

15

697

FIG. 15.81 Calculation of expected amount of weekly traffic accidents for values of explanatory variables—mfx command.

considering, therefore, the variables that influence the occurrence of structural zeros, or rather, the dichotomic component (age and drylaw variables). To do this, we can type in the following sequence of commands: quietly zipcv accidents pop, inf(age drylaw) vuong nolog predict lambda_inf quietly poisson accidents pop predict lambda graph twoway scatter accidents pop jj mspline lambda_inf pop jj mspline lambda pop jj, legend(label(2 "ZIP") label(3 "Poisson"))

The generated graph is found in Fig. 15.82 and, by its means, we can see that the predicted values for the zero-inflated Poisson regression model (ZIP) were adjusted more adequately to the excessive amount of zeros in the dependent variable. FIG. 15.82 Expected number of weekly traffic accidents  municipality population (pop) for the ZIP and Poisson regression models.

Next, we will analyze, based on the same dataset, the results obtained by means of the zero-inflated negative binomial regression model.

A.3

Example: Zero-Inflated Negative Binomial Regression Model in Stata

Following the same logic, we will again use the Accidents.dta dataset; however, we will now focus on the estimation of a zero-inflated negative binomial model. Therefore, the parameters for the following expressions will be estimated.

698

PART

VI Generalized Linear Models

plogiti ¼

1 1 + eðg + d1  agei + d2  drylawi Þ

and ui ¼ eða + b  popi Þ As has been done throughout the chapter, we first analyze the mean and variance of the accidents variable, typing in the following command. tabstat accidents, stats(mean var)

Fig. 15.83 presents the generated result. FIG. 15.83 Mean and variance of the accidents dependent variable.

As we can see, the dependent variable variance is about 14 times greater than its mean, which gives a strong indication of the existence of overdispersion in the data. Let’s, therefore, go on to estimate the zero-inflated negative binomial regression model. To do this, we should type in the following command: zinbcv accidents pop, inf(age drylaw) vuong nolog zip

which follows the same logic as the command used to estimate the ZIP model. Notice that we opted to use the term zinbcv (zero-inflated negative binomial with corrected Vuong) instead of the term zinb, being that, even the estimated parameter was exactly equal, the first presents the Vuong test with AIC and BIC correction. Besides this, the term zip at the end of the command causes that the likelihood-ratio test for the f (alpha in Stata) parameter be verified, or rather, it provides a comparison of the ZINB adequacy in relation to the ZIP model. The outputs are presented in Fig. 15.84.

FIG. 15.84 Zero-inflated negative binomial regression model outputs in Stata.

Regression Models for Count Data: Poisson and Negative Binomial Chapter

15

699

First, we can see that the confidence interval for the f parameter, which is the inverse of the shape parameter c of the Gamma distribution and that Stata calls alpha, does not contain zero, or rather, for the 95% confidence level, we can state that f is statistically different from zero and has an estimated value equal to 1.271. By means of the likelihood-ratio test for the f parameter, we can conclude that the null hypothesis that this parameter is statistically equal to zero can be rejected at the 5% significance level (Sig. w2 ¼ 0.000 < 0.05), which proves the existence of overdispersion in the data and indicates that the ZINB model is preferable to the ZIP model. Besides this, the Vuong test with AIC and BIC correction, by presenting significant z statistics at the 95% confidence level, indicates that the zero-inflated negative binomial regression model is preferable to the traditional negative binomial model for it proves the existence of an excessive amount of zeros. We can also see that the estimated pop variable is statistically different from zero at a 95% confidence level, or rather, this variable is significant to explain the behavior of the weekly amount of traffic accidents (count component). In the same way, the age and drylaw variables are statistically significant to explain the excessive amount of zeros (structural zeros) in the accidents variable (dichotomic component). Based on these outputs, we come to the final expressions for plogiti and for ui, given by: plogiti ¼

1 1 + eð16:237 + 0:288  agei + 2:859  drylawi Þ

and ui ¼ eð0:025 + 0:866  popi Þ Then, a curious researcher can obtain these same outputs by means of the Accidents ZINB Maximum Likelihood.xls file, using the Excel Solver tool, according the standard adopted throughout this chapter and book. In this file, the Solver criteria have been previously defined. Using Expression (15.36) and the estimated parameters, we can again calculate, algebraically, the average expected amount of weekly traffic accidents for a municipality of 700,000 inhabitants, with an average age of 40 and that does not have a dry law for after 10:00 p.m., according to what follows:

n o 1 ½0:025 + 0:866  ð0:700Þ  e ¼ 1:86 uinflate ¼ 1  1 + e½16:237 + 0:288  ð40Þ + 2:859  ð0Þ The researcher can also find the same result by typing in the following command, whose output is presented in Fig. 15.85. mfx, at(pop=0.7 age=40 drylaw=0)

FIG. 15.85 Calculation of expected amount of weekly traffic accidents for values of explanatory variables—mfx command.

Theoretically, the modeling could be, at this time, finalized. However, if the researcher is also interest in estimating the parameters for a ZIP model, so as to compare them with those obtained by the ZINB model, the following sequence of commands can be typed: eststo: quietly zip accidents pop, inf(age drylaw) vuong prcounts lambda_inflate, plot eststo: quietly zinb accidents pop, inf(age drylaw) vuong prcounts u_inflate, plot esttab, scalars(ll) se

700

PART

VI Generalized Linear Models

FIG. 15.86 Main results obtained in ZIP and ZINB estimations.

which generates the outputs presented in Fig. 15.86. These consolidated outputs allow us to see, besides the differences between the estimated parameters for both models, that the value obtained for the logarithmic likelihood function (ll) is considerably higher for the ZINB model (model 2 in Fig. 15.86), which is another indication of the better adequacy of this model over the ZIP model for the data in our example. Another way to compare the ZINB and ZIP estimations is by means of analyzing the distributions of the observed and predicted probabilities of weekly accident occurrences for the two estimations, analogous to what we discussed throughout the chapter, using the generated variables in the elaboration of the prcounts commands. To do this, we must enter the following command, which will generate the graph in Fig. 15.87: graph twoway (scatter u_inflateobeq u_inflatepreq lambda_inflatepreq u_inflateval, connect (1 1 1))

where the variables u_inflatepreq and lambda_inflatepreq correspond to the predicted occurrence probabilities of accidents of 0 to 9 obtained, respectively, by the ZINB and ZIP models. Besides this, while the variable u_inflateobeq corresponds to the observed probabilities of the dependent variable and, therefore, presents the same probability distribution presented in Fig. 15.77 for up to 9 traffic accidents, the variable u_inflateval presents the actual values of 0–9, which will be related to the observed probabilities.

Regression Models for Count Data: Poisson and Negative Binomial Chapter

15

701

FIG. 15.87 Observed and predicted probability distributions of weekly traffic accidents for the ZINB and ZIP models.

By means of analyzing the graph in Fig. 15.87, we see that the estimated distribution (predicted) for the ZINB model probabilities is better adjusted to the observed distribution than the estimated probability distribution for the ZIP model, for a count of up to 9 traffic accidents per week. Alternately, as we have discussed throughout the chapter, this fact can also be verified by applying the countfit command, which offers, besides the observed and predicted probabilities for each count (from 0 to 9) of the dependent variable, the error terms resulting of the difference between the probabilities obtained by the ZINB and ZIP models. To do this, we can type the following command. countfit accidents pop, zip zinb noestimates

which generates the outputs in Fig. 15.88 and the graph in Fig. 15.89. Figs. 15.88 and 15.89 show us, once again, that the ZINB adjustment is better than the ZIP model adjustment, for the following reasons: – While the maximum difference between the observed and predicted probabilities for the ZIP model is, in module, equal to 0.070, for the ZINB model it is, in module, equal to 0.016. – The average of these differences is of 0.024 for the ZIP model and of 0.006 for the ZINB model. – The total Pearson value is lower in the ZINB model (1.789) than in the ZIP model (61.233). The graph in Fig. 15.89 allows that a comparative analysis between the terms of error generated to be manually performed, giving due highlight to the ZINB model adjustment, being that the error curve is consistently closer to zero. As was done previously, we can also graphically compare the predicted values of the mean quantity of weekly traffic accidents obtained by the ZIP and ZINB models with those obtained by the corresponding traditional Poisson and negative binomial regression models (nbreg command), without consideration of the variables that influence the occurrence of structural zeros (age and drylaw variables). To do this, we should type in the following sequence of commands. quietly poisson accidents pop predict lambda quietly nbreg accidents pop predict u graph twoway mspline lambda_inflaterate pop jj mspline u_inflaterate pop jj mspline lambda pop jj mspline

u

Binomial"))

popjj,

legend(label(1

"ZIP")

label(2

"ZINB")

label(3

"Poisson")

label(4

"Negative

FIG. 15.88 Observed and predicted probabilities for each count of the dependent variable and the respective error terms.

FIG. 15.89 Error terms resulting from the difference between the observed and predicted probabilities (ZINB and ZIP models).

Regression Models for Count Data: Poisson and Negative Binomial Chapter

15

703

The generated graph is found in Fig. 15.90.

FIG. 15.90 Expected number of weekly traffic accidents  municipality population (pop) for the ZIP, ZINB, Poisson, and negative binomial regression models.

Two considerations can be made in relation to this graph. The first speaks regarding the variance in the amount of predicted weekly traffic accidents, which causes the ZINB and negative binomial curves to be more elongated at the upper right side of the graph than those generated by the corresponding ZIP and Poisson models, which are not able to capture the existence of overdispersion in the data. Besides this, we can also see that the predicted values generated by the ZINB and ZIP models are better adjusted to the excessive amount of zeros than the Poisson and negative binomial models, being that they present smaller inclinations, especially for a lower number of expected accidents. As such, it is important for the researcher to have a complete notion of the regression models for count data, so as to estimate, in the best manner possible, the model parameters while always considering the nature and behavior of the dependent variable that represents the phenomenon under study.

Chapter 16

Introduction to Optimization Models: General Formulations and Business Modeling Education is the most powerful weapon which you can use to change the world. Nelson Mandela

16.1 INTRODUCTION TO OPTIMIZATION MODELS Optimization models are being used to solve problems in several industrial and commercial sectors (strategy, marketing, finance, operations and logistics, human resources, among others), to make decisions on the most effective use of resources. This chapter describes how optimization models can help researchers and managers in the decision-making process. First of all, it is important to study the main concepts involved in this process. There are many definitions for the concept of decision. One of them is that decision making refers to the analysis process of many alternatives available of the course of action the person will have to follow, or rather, the process resulting in the selection of a belief or a course of action among several alternative possibilities. Some examples of decisions can be listed here: choosing one location among many others that are available, determining the best stock portfolio, choosing one among several alternatives that balance the company’s production resources, such as, personnel available, hiring, firing, and inventory. Thus, we can see that the organization’s goals are directly linked to the decision-making process. In order to minimize the uncertainties, risks, and complexities that are inherent to the process and aiming at making the most effective decision among the several alternatives available, the value and quality of the information available becomes essential. Communication among the stakeholders involved in the process, during the information collection phase, as well as when defining the objective and the reasoning of the group, also influence the decisions to be made. And it is exactly with greater focus on an effective decision-making process, considering the several interfaces and exogeneities of systems and markets, that optimization models insert themselves as a field of knowledge, in order to provide to the decision-making agent a greater foundation and better knowledge of the problem being analyzed, be it in finance, economics, logistics, or marketing. On the other hand, according to Lisboa (2002), a model is a simplified representation of a real system. It can be an existing project or a future project. In the former, we intend to replicate the operations of a real existing system, in order to increase productivity. In the latter, the main goal is to define the ideal structure of the future system. The behavior of a real system is influenced by several variables involved in the decision-making process. Due to the high complexity of this system, it becomes necessary to simplify it, from a model, in such a way that the main variables involved in the system or project that we aim to understand or control are considered in its construction, as shown in Fig. 16.1. A model is made of three main elements: (a) decision and parameter variables; (b) an objective function, and (c) constraints. (a) Decision and parameter variables: Decision variables are unidentified, or unknown values, which will be determined by solving the model. The optimization models studied consider the following decision variable measurement and precision scales: continuous, discrete, or Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00016-1 © 2019 Elsevier Inc. All rights reserved.

707

708

PART

VII Optimization Models and Simulation

FIG. 16.1 Modeling from a real system. Source: Andrade, E.L., 2015. Introduc¸a˜o a` Pesquisa Operacional: M etodos e Modelos Para Ana´lise de Deciso˜es, fifth ed. LTC, Rio de Janeiro.

binary. Decision variables should assume non-negative values. The types of variables and their respective measurement and precision scales were studied in Chapter 2. Parameters are the previously known fixed values of the problem. As examples of parameters within a mathematical model, we can mention the: (a) demand for each product for a production mix problem; (b) variable cost to produce a certain kind of furniture; (c) profit or cost per unit of product manufactured; (d) cost per employee hired; (e) unit contribution margin whenever a certain electrical appliance is manufactured and sold. (b) Objective function: The objective function is a mathematical function that determines the target value that we intend to achieve or the quality of the solution, based on decision variables and on parameters. It can be a maximization function (profit, revenue, usefulness, service level, wealth, life expectancy, among other attributes) or a minimization function (cost, risk, error, among others). As examples, we can mention the: (a) minimization of the total production cost of several types of chocolates; (b) minimization of the credit risk in a client portfolio; (c) minimization of the number of employees involved in a certain service; (d) maximization of the return on investment in stock and fixed income funds; (e) maximization of the net profit in the production of several types of soft drinks. (c) Constraints: Constraints can be defined as a set of equations (mathematical expressions of equality) and inequalities (mathematical expressions of inequality) that the decision variables of the model should meet. Constraints are added to the model in order to consider the system’s physical limitations, and they impact the values of the decision variables directly. As examples of constraints to be considered in a mathematical model, we can mention: (a) maximum production capacity; (b) maximum risk a certain investor is willing to take; (c) maximum number of vehicles available; (d) minimum acceptable demand for a product. Modeling a decision-making process has the advantage of forcing decision makers to clearly define their goals. Furthermore, it facilitates the identification and storing of the different decisions that impact the goals, it allows us to define the main variables involved in the decision-making process and the system’s own limitations, besides allowing greater interaction among the work group. Optimization models can be divided into: linear programming, network programming, integer programming, nonlinear programming, goal or multiobjective programming, and dynamic programming (see Fig. 16.2). In this chapter, we will discuss the modeling of linear programming problems. The solution of linear programming models will be presented in Chapter 17. Now, network programming and integer programming models will be studied in Chapters 18 and 19, respectively. Nonlinear programming, multiobjective programming, and dynamic programming models are not the focus of this book, but can be found in Belfiore and Fa´vero (2012, 2013). FIG. 16.2 Classification of optimization models.

Introduction to Optimization Models: General Formulations and Business Modeling Chapter

16

709

16.2 INTRODUCTION TO LINEAR PROGRAMMING MODELS In a linear programming problem (LP), the model’s objective function and all its constraints are represented by linear functions. Moreover, all the decision variables must be continuous, that is, they should assume any values in an interval with real numbers. The main goal is to maximize or minimize a certain linear function of decision variables, subject to a set of constraints represented by linear equations or inequalities, including the non-negativity ones of the decision variables. After constructing the mathematical model that represents the real LP problem being studied, the next step is to determine the optimal solution for the model, which is the one with the highest value (if it is a maximization problem) or the lowest value (if it is a minimization problem) in the objective function and meets the linear constraints established. Many algorithms or methods can be applied to find the optimal solution for the model; however, the Simplex method is the most well known and the most common. Since George B. Dantzig developed the Simplex method in 1947, LP has been used to optimize real problems in several sectors. As examples, we can mention trade, services, banking, transportation, automobile, aviation, naval, food, beverages, agriculture and livestock, health, real estate, metallurgy, mining, paper and cellulose, electrical energy, oil, gas and fuels, computers, and the communication sector, among others. Therefore, the use of linear programming techniques in organizational environments has been helping several industries in many countries save millions and sometimes even billions of dollars. According to Winston (2004), a survey with the largest 500 American companies listed by Fortune magazine reported that 85% of the respondents used or had already used the linear programming technique.

16.3 MATHEMATICAL FORMULATION OF A GENERAL LINEAR PROGRAMMING MODEL Linear programming problems try to determine optimal values for the decision variables x1, x2, …, xn, which must be continuous, in order to maximize or minimize linear function z, subject to a set of m linear constraints of equality (equations with an ¼ sign) and/or of inequality (inequalities with a  or a  sign). The solutions that meet all the constraints, including the non-negativity ones of the decision variables, are called feasible solutions. The feasible solution that presents the best value of the objective function is called optimal solution. The formulation of a general linear programming model can be mathematically represented as: max or min z ¼ f ðx1 , x2 , …, xn Þ ¼ c1 x1 + c2 x2 + … + cn xn subject to : a11 x1 + a12 x2 + … + a1n xn f , ¼ , g b1 a21 x1 + a22 x2 + … + a2n xn f , ¼ , g b2 ⋮ ⋮ ⋮ ⋮

(16.1)

am1 x1 + am2 x2 + … + amn xn f , ¼ , g bm x1 , x2 , …, xn  0 ðnon-negativity constraintÞ where: z is the objective function; xj are the decision variables, main or controllable, j ¼ 1, 2, …, n; aij is the constant or coefficient of the ith constraint of the jth variable, i ¼ 1, 2, …, m, j ¼ 1, 2, …, n; bi is the independent term or quantity of resources available of the ith constraint, i ¼ 1, 2, …, m; and cj is the constant or coefficient of the jth variable of the objective function, j ¼ 1, 2, …, n.

16.4 LINEAR PROGRAMMING MODEL IN THE STANDARD AND CANONICAL FORMS The previous section presented the general formulation of a linear programming problem. This section discusses the formulation in the standard and canonical forms, in addition to elementary operations that can change the formulation of linear programming problems.

16.4.1

Linear Programming Model in the Standard Form

To solve a linear programming problem, be it by the analytical method or by the Simplex algorithm, the formulation of the model should be in the standard form, that is, it must meet the following requirements:

710

l l l

PART

VII Optimization Models and Simulation

The independent terms of the constraints must be non-negative; All the constraints must be represented by linear equations and presented as an equality; The decision variables must be non-negative.

The standard form can be mathematically represented as: max or min z ¼ f ðx1 , x2 , …, xn Þ ¼ c1 x1 + c2 x2 + … + cn xn subject to : a11 x1 + a12 x2 + … + a1n xn ¼ b1 a21 x1 + a22 x2 + … + a2n xn ¼ b2

(16.2)

⋮ ⋮ ⋮ ⋮ am1 x1 + am2 x2 + … + amn xn ¼ bm xj  0, j ¼ 1, 2, …,n The standard linear programming problem can also be written in matrix form: min f ðxÞ ¼ c x subject to : Ax ¼ b x0 where:

16.4.2

2

3 2 3 2 3 2 3 a11 a12 ⋯ a1n x1 b1 0 6 a21 a22 ⋯ a2n 7 6 x2 7 6 b2 7 607 7, x ¼ 6 7, b ¼ 6 7, c ¼ ½c c ⋯ c , 0 ¼ 6 7 A¼6 1 2 n 4 ⋮ ⋮ 4⋮5 4⋮5 405 ⋮ 5 am1 am2 ⋯ amn xn bm 0

Linear Programming Model in the Canonical Form

In a linear programming model in the canonical form, the constraints must be presented as inequalities, and z can be a maximization or a minimization objective function. If z is a maximization function, all the constraints must be represented with a  sign. If z is a minimization function, the constraints should have a  sign. For a maximization problem, the canonical form can be mathematically represented as: max z ¼ f ðx1 , x2 , …, xn Þ ¼ c1 x1 + c2 x2 + … + cn xn subject to : a11 x1 + a12 x2 + … + a1n xn  b1 a21 x1 + a22 x2 + … + a2n xn  b2 ⋮





(16.3)



am1 x1 + am2 x2 + … + amn xn  bm xj  0, j ¼ 1, 2,…, n Now, if it is a minimization problem, the canonical form becomes: min z ¼ f ðx1 , x2 , …, xn Þ ¼ c1 x1 + c2 x2 + … + cn xn subject to : a11 x1 + a12 x2 + … + a1n xn  b1 a21 x1 + a22 x2 + … + a2n xn  b2 ⋮ ⋮ ⋮ ⋮ am1 x1 + am2 x2 + … + amn xn  bm xj  0, j ¼ 1,2,…, n

(16.4)

Introduction to Optimization Models: General Formulations and Business Modeling Chapter

16.4.3

16

711

Transformations Into the Standard or Canonical Form

In order for a linear programming problem to have one of the forms presented in Sections 16.4.1 and 16.4.2, some elementary operations can be carried out from a general formulation, as described here. (1) A standard maximization problem can be transformed into a minimization linear programming problem: max z ¼ f ðx1 , x2 , …, xn Þ , min  z ¼ f ðx1 , x2 , …, xn Þ

(16.5)

Analogously, a minimization problem can be transformed into a maximization problem: minz ¼ f ðx1 , x2 , …, xn Þ , max  z ¼ f ðx1 , x2 , …, xn Þ

(16.6)

(2) An inequality constraint of the  type can be transformed into another one of the  type by multiplying both sides by (1): ai1x1 + ai2x2 + ⋯ + ainxn  bi is equivalent to ai1 x1  ai2 x2  ⋯  ain xn  bi

(16.7)

Analogously, an inequality constraint of the  type can be transformed into another one of the  type: ai1x1 + ai2x2 + ⋯ + ainxn bi is equivalent to ai1 x1  ai2 x2  ⋯  ain xn  bi (3) An equality constraint can be transformed into two inequality constraints: ai1x1 + ai2x2 + ⋯ + ainxn ¼ bi is equivalent to  ai1 x1 + ai2 x2 + ⋯ + ain xn  bi ai1 x1 + ai2 x2 + ⋯ + ain xn  bi

(16.8)

(16.9)

(4) An inequality constraint of the  type can be rewritten as an expression of equality, considering the inclusion of a new non-negative variable on the left-hand side (LHS), xk 0, called a slack variable: ai1 x1 + ai2 x2 + ⋯ + ain xn  bi is equivalent to ai1 x1 + ai2 x2 + ⋯ + ain xn + xk ¼ bi (16.10) Analogously, an inequality constraint of the  type can also be transformed into an expression of equality by subtracting a new non-negative variable from the left-hand side, xk 0, called a surplus variable: a11x1 + a12x2 + ⋯ + a1nxn b1 is equivalent to ai1 x1 + ai2 x2 + ⋯ + ain xn  xk ¼ bi

(16.11)

(5) An xj variable that is unrestricted in sign, called a free variable, can be expressed as the difference between two nonnegative variables: xj ¼ x1j  x2j , x1j , x2j  0

(16.12)

Example 16.1 For the following linear programming problem, rewrite it in the standard form, starting from a minimization objective function. max z ¼ f ðx1 , x2 , x3 , x4 Þ ¼ 5x1 + 2x2  4x3  x4 subject to : x1 + 2x2

 x4  12

2x1 + x2 + 3x3

6

x1 free, x2 , x3 , x4  0 Solution In order for the model to be rewritten in the standard form, the inequality constraints must be expressed as an equality (Expressions 16.10 and 16.11), and free variable x1 can be expressed as the difference between two nonnegative variables (Expression 16.12). Considering a minimization objective function, we have:

712

PART

VII Optimization Models and Simulation

min  z ¼ f ðx1 , x2 , x3 , x4 Þ ¼ 5x11 + 5x12  2x2 + 4x3 + x4 subject to :  x4 + x5 ¼ 12 x11  x12 + 2x2  x6 ¼ 6 2x11  2x12 + x2 + 3x3 x11 ,x12 , x2 , x3 , x4 , x5 , x6  0

Example 16.2 Convert the following problem into the canonical form. max z ¼ f ðx1 , x2 , x3 Þ ¼ 3x1 + 4x2 + 5x3 subject to : 2x1 + 2x2 + 4x3  320 3x1 + 4x2 + 5x3 ¼ 580 x1 ,x2 , x3  0 In order for the maximization model to be written in the canonical form, the constraints must be expressed as an inequality of the  type. In order to do that, the expression of equality must be transformed into two inequality constraints (Expression 16.9) and the inequalities with the  sign must be multiplied by (1), as specified in Expression (16.8). The final model in the canonical form is: max z ¼ f ðx1 , x2 , x3 Þ ¼ 3x1 + 4x2 + 5x3 subject to :  2x1  2x2  4x3  320  3x1  4x2  5x3  580 3x1 + 4x2 + 5x3  580 x1 , x2 , x3  0

16.5

ASSUMPTIONS OF THE LINEAR PROGRAMMING MODEL

In a linear programming problem, the objective function and the model constraints must be linear, the decision variables must be continuous (divisible, and they can assume fractional values) and non-negative, and the model parameters must be deterministic, in order to satisfy the following assumptions: 1. 2. 3. 4.

Proportionality Additivity Divisibility and non-negativity Certainty

16.5.1

Proportionality

The proportionality assumption requires that, for each decision variable considered in the model, its contribution regarding the objective function and the model constraints be directly proportional to the value of the decision variable. Let’s imagine the following example. A company tries to maximize its production of chairs (x1) and tables (x2), and the profit per chair and per table is 4 and 7, respectively. So, objective function z is expressed as max z ¼ 4x1 + 7x2. Fig. 16.3, adapted from Hillier and Lieberman (2005), shows the contribution of variable x1 to objective function z. We can see that, in order for the proportionality assumption to be respected, for every chair produced, the objective function must increase $4. Let’s imagine that an initial set-up cost of $20 is considered (case 1) until the production of chairs (x1) begins. In this case, the contribution of variable x1 in relation to the objective function would be written as z ¼ 4x1  20, instead of z ¼ 4x1, not meeting the proportionality assumption. On the other hand, imagine that there are economies of scale, in a way that production costs diminish and, consequently, the marginal contribution increases, as the total amount produced grows (case 2), also violating the proportionality assumption. In this case, the profit function becomes nonlinear.

Introduction to Optimization Models: General Formulations and Business Modeling Chapter

16

713

400

Meets proportionality

Contribution of x1 to z

350

Case 1

300

z=f(cx) ajj*xj=b

Case 2

250 200 150 100 50 0 –50

0

10

20

30

40

50

60

x1

FIG. 16.3 Contribution of variable x1 to objective function z.

In the same way, regarding the constraints, we assume that the aij coefficients or constants are proportional to production level xj.

16.5.2

Additivity

The additivity assumption states that the total value of the objective function or of each constraint function of a linear programming model is expressed by the sum of the individual contributions of each decision variable. Thus, the contribution of each decision variable does not depend on the contribution of the other variables, in a way that there are no crossed terms, both in the objective function and in the model constraints. Considering the previous example, the objective function is expressed as max z ¼ 4x1 + 7x2. Through the additivity assumption, the total value of the objective function is obtained through the sum of individual contributions of x1 and x2, that is, z ¼ 4 + 7 ¼ 11. If the objective function is expressed as max z ¼ 4x1 + 7x2 + x1x2, the additivity assumption is violated (z ¼ 4 + 7 + 1 ¼ 12 for x1, x2 ¼ 1), since the model’s decision variables are interdependent. In the same way, regarding each model constraint, we assume that the function’s total value is expressed by the sum of each variable’s individual contributions.

16.5.3

Divisibility and Non-negativity

Each one of the decision variables considered in the model can assume any non-negative value within an interval, including fractional values, as long as it meets the model’s constraints. When the variables being studied can only be integers, the model is called integer (linear) programming (ILP or simply IP).

16.5.4

Certainty

This assumption states that the objective function coefficients, the constraint coefficients, and the independent terms of a linear programming model are deterministic (constants and known with certainty).

16.6 MODELING BUSINESS PROBLEMS USING LINEAR PROGRAMMING This section discusses the description and modeling of the main resource optimization problems that are being studied in linear programming in the fields of engineering, business management, economics, and accounting. They are the production mix problem, capital budgeting, investment portfolio selection, production inventory, and aggregated planning.

16.6.1

Production Mix Problem

The production mix problem aims to find the ideal quantity of certain lines of products to be manufactured that will maximize the company’s results (net profit, total profit, etc.) or minimize the production costs, respecting its limitations as

714

PART

VII Optimization Models and Simulation

regards productive and market resources (raw materials constraints, maximum production capacity, availability of human resources, maximum and minimum market demand, among others). When the amount of a certain product to be manufactured can only be an integer (cars, electrical appliances, electronic devices, etc.), we have an integer programming problem (IP). An alternative solution for this kind of problem is to relax or to eliminate the decision variables’ integrality constraint, using a linear programming problem. Luckily, sometimes, the optimal solution to the relaxed problem corresponds to the optimal solution of the original model, that is, it meets the decision variables’ integrality constraints. When the solution of the relaxed problem is not an integer, rounding up or integer programming algorithms must be applied to find the solution for the original problem. Further details can be found in Chapter 19 that discusses integer programming. Example 16.3 Venix is a toy company and it is reviewing its toy cars and tricycles production planning. The net profit per toy car and tricycle unit produced is US$ 12.00 and US$ 60.00, respectively. The raw materials and input necessary to manufacture each one of these products are outsourced, and the company is in charge of the machining, painting, and assembly processes. The machining process requires 15 minutes of specialized labor per car unit and 30 minutes per tricycle unit produced. The painting process requires 6 minutes of specialized labor per car unit and 45 minutes per tricycle unit produced. Now, the assembly process needs 6 and 24 minutes per car unit and tricycle produced, respectively. Per week, the time available for machining, painting, and assembly is 36, 22, and 15 hours, respectively. The company would like to determine how much it should produce of each product per week, respecting its resource limitations, in order to maximize its weekly net profit. Formulate the linear programming problem that maximizes the company’s net profit. Solution First of all, we define the model’s decision variables: xj ¼ amount of product j to be manufactured per week, j ¼ 1, 2. Therefore, we have: x1 ¼ amount of cars to be manufactured per week. x2 ¼ amount of tricycles to be manufactured per week. Therefore, we can see that the decision variables should be integers (it is impossible to produce fractional amounts of toy cars or tricycles), so, this is an integer programming (IP) problem. Luckily, in this problem, the integrality constraints can be relaxed or eliminated, since the relaxed problem’s optimal solution still meets the integrality conditions. Thus, the formulation of the problem will be presented as a linear programming (LP) model. The net profit per toy car unit produced is US$ 12.00, while the net profit per tricycle is US$ 60.00. We are trying to maximize the weekly net profit generated from the amount of cars and tricycles manufactured. Therefore, the objective function can be written as follows: Fobj ¼ max z ¼ 12x1 + 60x2 Considering that for the machining process, to produce one car and/or one tricycle, we need 15 minutes (0.25 hours) and 30 minutes (0.5 hours) of specialized labor, respectively (0.25x1 + 0.50x2). However, the total labor time for the machining activity cannot be higher than 36 hours/week, which generates the following constraint: 0:25x1 + 0:5x2  36 Analogously, for the painting activity, one car and/or one tricycle produced requires 6 minutes (0.1 hours) and 45 minutes (0.75 hours) of specialized labor, respectively (0.1x1 + 0.75x2). However, the maximum limit of labor available for this activity is 22 hours/week: 0:1x1 + 0:75x2  22 Now, the assembly process, to produce one toy car and/or one tricycle, requires 6 minutes (0.1 hours) and 24 minutes (0.4 hours) of labor, respectively (0.1x1 + 0.4x2). The availability of human resources for this activity is 15 hours/week: 0:1x1 + 0:4x2  15 Finally, we have the non-negativity constraints of the decision variables. All the model’s constraints are: (1) Labor availability constraints for all three activities: 0:25x1 + 0:5x2  36 ðmachiningÞ 0:1x1 + 0:75x2  22 ðpaintingÞ 0:1x1 + 0:4x2  15 ðassemblyÞ

Introduction to Optimization Models: General Formulations and Business Modeling Chapter

16

715

(2) Non-negativity constraint of the decision variables: xj  0, j ¼ 1,2 The model’s complete formulation can be represented as: max z ¼ 12x1 + 60x2 subject to : 0:25x1 + 0:50x2  36 0:10x1 + 0:75x2  22 0:10x1 + 0:40x2  15 xj  0, j ¼ 1,2 The optimal solution can be obtained in graphical form, in analytical form, through the Simplex method, or directly from some software, such as, Solver (in Excel), as presented in the next chapter. The current model’s optimal solution is x1 ¼ 70 (toy cars per week) and x2 ¼ 20 (tricycles per week) with z ¼ 2040 (a weekly net profit of US$ 2040.00).

Example 16.4 Naturelat is a dairy company that manufactures the following products: yogurt, fresh white cheese, Mozzarella, Parmesan, and Provolone. Due to some strategic changes resulting from the competition in the market, the company is redefining its production mix. To manufacture each one of these five products, three types of raw materials are necessary: raw milk, cream line, and cream. Table 16.E.1 shows the amount of raw materials necessary to manufacture 1 kg of each product. The daily amount of raw materials available is limited (1200 L of raw milk, 460 L of cream line, and 650 kg of cream). A daily availability of specialized labor is also limited (170 hours/employee/day). The company needs 0.05 hours/employee to manufacture 1 kg of yogurt, 0.12 hours/employee to manufacture 1 kg of fresh white cheese, 0.09 hours/employee for making Mozzarella, 0.04 hours/employee for Parmesan, and 0.16 hours/employee for Provolone cheese. Due to contractual clauses, the company needs to produce a minimum daily quantity of 320 kg of yogurt, 380 kg of fresh white cheese, 450 kg of Mozzarella, 240 kg of Parmesan, and 180 kg of Provolone. The company’s commercial department guarantees that there is enough demand to absorb any production level, regardless of the product. Table 16.E.2 shows the net profit per unit per product (US$/kg), which is calculated as the difference between the sales price and the total variable costs. The company aims to determine the quantity of each product it has to manufacture in order to maximize its results. Formulate the linear programming problem that maximizes the result expected. Solution First of all, we define the model’s decision variables: xj ¼ amount of product j (in kg) to be manufactured per day, j ¼ 1, 2, …, 5. Therefore, we have: x1 ¼ amount of yogurt (in kg) to be manufactured per day. x2 ¼ amount of fresh white cheese (in kg) to be manufactured per day. x3 ¼ amount of Mozzarella (in kg) to be manufactured per day. x4 ¼ amount of Parmesan (in kg) to be manufactured per day. x5 ¼ amount of Provolone (in kg) to be manufactured per day.

TABLE 16.E.1 Raw Materials Necessary to Manufacture 1 kg of Each Product Product

Raw Milk (l)

Cream-Line (l)

Cream (kg)

Yogurt

0.70

0.16

0.25

Fresh white cheese

0.40

0.22

0.33

Mozzarella

0.40

0.32

0.33

Parmesan

0.60

0.19

0.40

Provolone

0.60

0.23

0.47

716

PART

VII Optimization Models and Simulation

TABLE 16.E.2 Net Profit Per Product Unit (US$/kg) Product

Sales Price (US$/kg)

Total Variable Costs (US$/kg)

Contribution Margin (US$/kg)

Yogurt

3.20

2.40

0.80

Fresh white cheese

4.10

3.40

0.70

Mozzarella

6.30

5.15

1.15

Parmesan

8.25

6.95

1.30

Provolone

7.50

6.80

0.70

The total net profit per product is the result obtained by multiplying the net profit per unit (in this case US$/kg) by the respective quantities sold. The problem’s objective function tries to maximize the total net profit of all the company’s products, which is obtained by adding the total net profits of each product: Fobj ¼ max z ¼ 0:80x1 + 0:70x2 + 1:15x3 + 1:30x4 + 0:70x5 Regarding the raw materials availability constraints, let’s first consider the amount of raw milk (liters) used daily to manufacture each product. To manufacture 1 kg of yogurt, the company needs 0.7 L of raw milk (0.70x1 represents the total amount of raw milk used every day to manufacture yogurt). The amount of raw milk (liters) used daily to manufacture fresh white cheese is represented by 0.40x2. Analogously, their daily use of raw milk to manufacture Mozzarella is 0.40x3, 0.60x4 to manufacture Parmesan cheese and 0.60x5 to make Provolone. The total amount of raw milk (liters) used daily to manufacture all of these products can be represented by 0.70x1 + 0.40x2 + 0.40x3 + 0.60x4 + 0.60x5. However, this total cannot be higher than 1200 L (daily amount of raw milk available) and this constraint is represented by: 0:70x1 + 0:40x2 + 0:40x3 + 0:60x4 + 0:60x5  1200 Likewise, the quantity of cream line (in liters) used daily to manufacture yogurt, fresh white cheese, Mozzarella, Parmesan, and Provolone cheese cannot be higher than the maximum amount available, which is 460 L: 0:16x1 + 0:22x2 + 0:32x3 + 0:19x4 + 0:23x5  460 Still with regard to the raw materials availability constraints, in the case of cream, the quantity (kg) used daily to manufacture the five products cannot be higher than the maximum quantity available, which is 650 kg: 0:25x1 + 0:33x2 + 0:33x3 + 0:40x4 + 0:47x5  650 We must also take the daily availability of specialized labor into consideration. Each kilo of yogurt manufactured requires 0.05 hours-employee, so, 0.05x1 represents the total of hours-employee used daily in the manufacturing of yogurt. The number of hoursemployee used daily to manufacture fresh white cheese is represented by 0.12x2. Similarly, 0.09x3 hours-employee are used daily to make Mozzarella. 0.04x4 to produce Parmesan and 0.16x5 to make Provolone. The total number of hours-employee used daily to manufacture all these products can be represented by 0.05x1 + 0.12x2 + 0.09x3 + 0.04x4 + 0.16x5. However, this total cannot be higher than 170 hours-employee, which is the availability of specialized human resources per day: 0:05x1 + 0:12x2 + 0:09x3 + 0:04x4 + 0:16x5  170 We must finally consider the daily minimum market demand constraint for each product: 320 kg for yogurt (x1 320), 380 for fresh white cheese (x2 380), 450 for Mozzarella cheese (x3 450), 240 for Parmesan (x4 240), and 180 for Provolone (x5 180), besides the non-negativity constraint of the decision variables. All the constraints of the model can be represented as: (1) Amount of raw materials used daily to produce yogurt, fresh white cheese, Mozzarella, Parmesan, and Provolone: 0:70x1 + 0:40x2 + 0:40x3 + 0:60x4 + 0:60x5  1200 ðraw milkÞ 0:16x1 + 0:22x2 + 0:32x3 + 0:19x4 + 0:23x5  460 ðcream lineÞ 0:25x1 + 0:33x2 + 0:33x3 + 0:40x4 + 0:47x5  650 ðcreamÞ (2) Daily availability of specialized labor for producing yogurt, fresh white cheese, Mozzarella, Parmesan, and Provolone: 0:05x1 + 0:12x2 + 0:09x3 + 0:04x4 + 0:16x5  170

Introduction to Optimization Models: General Formulations and Business Modeling Chapter

16

717

(3) Daily minimum demand for each product: x1  320 ðyogurtÞ x2  380 ðfresh white cheeseÞ x3  450 ðMozzarellaÞ x4  240 ðParmesanÞ x5  180 ðProvoloneÞ (4) Non-negativity constraints of the decision variables: xj  0, j ¼ 1,2, …, 5 The complete problem is modeled here: max z ¼ 0:80x1 + 0:70x2 + 1:15x3 + 1:30x4 + 0:70x5 subject to : 0:70x1 + 0:40x2 + 0:40x3 + 0:60x4 + 0:60x5  1200 0:16x1 + 0:22x2 + 0:32x3 + 0:19x4 + 0:23x5  460 0:25x1 + 0:33x2 + 0:33x3 + 0:40x4 + 0:47x5  650 0:05x1 + 0:12x2 + 0:09x3 + 0:04x4 + 0:16x5  170  320 x1  380 x2  450 x3  240 x4 x5  180 xj  0, j ¼ 1, …, 5 On Solver (in Excel), the model’s optimal solution is x1 ¼ 320 (kg/day of yogurt), x2 ¼ 380 (kg/day of fresh white cheese), x3 ¼ 690.96 (kg/day of Mozzarella), x4 ¼ 329.95 (kg/day of Parmesan), and x5 ¼ 180 (kg/day of Provolone) with z ¼ 1871.55 (total daily contribution margin of US$ 1871.55).

16.6.2

Blending or Mixing Problem

The blending or mixing problem aims to find the solution with the minimum cost or maximum profit, starting from the combination of several ingredients in order to produce one or many products. The raw materials can be ore, metals, chemical products, oil or crude oil, water, while the final products can be metal ingots, steel, paint, gasoline, or other chemical products. Among several mixing problems, we can mention a few examples: 1. Mixing many types of oil or crude oil to produce different types of gasoline. 2. Mixing chemical products to create other products. 3. Mixing different types of paper to generate recycled paper.

Example 16.5 Petrisul is an oil refinery that uses three types of crude oil (oil 1, oil 2, and oil 3) to produce three types of gasoline: regular, super, and extra. To ensure its quality, each type of gasoline requires certain specifications based on the composition of several types of crude oil, as shown in Table 16.E.3. In order to meet its clients’ demands, the refinery needs to produce at least 5000 barrels a day of regular gasoline and 3000 barrels a day of super and extra gasoline. The daily capacity available is 10,000 barrels of oil 1; 8000 of oil 2; and 7000 of oil 3. The refinery can produce up to 20,000 barrels of gasoline a day. The refinery makes a profit of $5 per barrel of regular gasoline produced, $7 per barrel of super gasoline, and $8 per barrel of extra gasoline. The costs per barrel of crude oil 1, 2, and 3 are $2, $3, and $3, respectively. Formulate the linear programming problem aiming to maximize the company’s daily profit.

718

PART

VII Optimization Models and Simulation

TABLE 16.E.3 Specifications for Each Type of Gasoline Type of Gasoline

Specifications

Regular

Not more than 70% of oil 1

Super

Not more than 50% of oil 1 Not less than 10% of oil 2

Extra

Not more than 50% of oil 2 Not less than 40% of oil 3

Solution First of all, we must define the model’s decision variables: xij ¼ barrels of crude oil i used daily to produce gasoline j, i ¼ 1, 2, 3; j ¼ 1, 2, 3. Therefore, we have: Daily production of regular gasoline ¼ x11 + x21 + x31 Daily production of super gasoline ¼ x12 + x22 + x32 Daily production of extra gasoline ¼ x13 + x23 + x33 Barrels of crude oil 1 used daily ¼ x11 + x12 + x13 Barrels of crude oil 2 used daily ¼ x21 + x22 + x23 Barrels of crude oil 3 used daily ¼ x31 + x32 + x33 The problem’s objective function tries to maximize the refinery’s daily profit (revenue—costs). The model constraints should guarantee that the minimum specifications required for each type of gasoline are taken into consideration, that its clients’ demands are being met, and that the gasoline production and the crude oil supply capacities are being respected. The daily revenue per barrel of gasoline produced is: ¼ 5  ðx11 + x21 + x31 Þ + 7  ðx12 + x22 + x32 Þ + 8  ðx13 + x23 + x33 Þ On the other hand, the daily costs per barrel of crude oil purchased are: ¼ 2  ðx11 + x12 + x13 Þ + 3  ðx21 + x22 + x23 Þ + 3  ðx31 + x32 + x33 Þ The objective function can be written as: Fobj ¼ max z ¼ ð5  2Þx11 + ð5  3Þx21 + ð5  3Þx31 + ð7  2Þx12 + ð7  3Þx22 + ð7  3Þx32 + ð8  2Þx13 + ð8  3Þx23 + ð8  3Þx 33 All the model’s constraints are: (1) The regular gasoline should contain a maximum of 70% of oil 1: x11  0:70 x11 + x21 + x31 which can be rewritten as: 0:30x11  0:70x21  0:70x31  0 (2) The super gasoline should contain a maximum of 50% of oil 1: x12  0:50 x12 + x22 + x32 which can be rewritten as: 0:50x12  0:50x22  0:50x32  0 (3) The super gasoline should contain at least 10% of oil 2: x22  0:10 x12 + x22 + x32 which can be rewritten as: 0:10x12 + 0:90x22  0:10x32  0

Introduction to Optimization Models: General Formulations and Business Modeling Chapter

16

719

(4) The extra gasoline should contain a maximum of 50% of oil 2: x23  0:50 x13 + x23 + x33 which can be rewritten as: 0:50x13 + 0:50x23  0:50x33  0 (5) The extra gasoline should contain at least 40% of oil 3: x33  0:40 x13 + x23 + x33 which can be rewritten as: 0:40x13  0:40x23 + 0:60x33  0 (6) The daily demands for regular, super, and extra gasoline must be met: x11 + x21 + x31  5000 ðregularÞ x12 + x22 + x32  3000 ðsuperÞ x13 + x23 + x33  3000 ðextraÞ (7) The maximum number of barrels of crude oil 1 (10,000), crude oil 2 (8000), and crude oil 3 (7000) available daily must be respected: x11 + x12 + x13  10, 000 ðcrude oil 1Þ x21 + x22 + x23  8000 ðcrude oil 2Þ x31 + x32 + x33  7000 ðcrude oil 3Þ (8) The refinery’s daily production capacity is 20,000 barrels of gasoline a day: x11 + x21 + x31 + x12 + x22 + x32 + x13 + x23 + x33  20, 000 (9) The model’s decision variables are non-negative: xij  0, i ¼ 1,2, 3;j ¼ 1,2, 3 The complete problem is modeled here: Fobj ¼ max z ¼ 3x11 + 2x21 + 2x31 + 5x12 + 4x22 + 4x32 + 6x13 + 5x23 + 5x33 subject to : 0:30x11  0:70x21  0:70x31  0 0:50x12  0:50x22  0:50x32  0  0:10x12 + 0:90x22  0:10x32  0  0:50x13 + 0:50x23  0:50x33  0  0:40x13  0:40x23 + 0:60x33  0 x11 + x21 + x31  5000 x12 + x22 + x32  3000 x13 + x23 + x33  3000 x11 + x12 + x13  10, 000 x21 + x22 + x23  8000 x31 + x32 + x33  7000 x11 + x21 + x31 + x12 + x22 + x32 + x13 + x23 + x33  20, 000 x11 , x21 , x31 , x12 , x22 , x32 , x13 , x23 , x33  0 By using Solver, the model’s optimal solution is x11 ¼ 1300, x21 ¼ 3700, x31 ¼ 0, x12 ¼ 1500, x22 ¼ 1500, x32 ¼ 0, x13 ¼ 7200, x23 ¼ 0, x33 ¼ 4800 with z ¼ 92,000 (a total daily profit of US$ 92,000.00).

720

PART

16.6.3

VII Optimization Models and Simulation

Diet Problem

The diet problem is a classic linear programming problem that tries to determine the best food combinations to be ingested per meal, with the lowest cost possible, meeting a person’s nutritional needs. Several nutrients can be analyzed as, for example: calories, protein, fat, carbs, fibers, calcium, iron, magnesium, phosphorus, potassium, sodium, zinc, copper, manganese, selenium, vitamin A, vitamin C, vitamin B1, vitamin B2, vitamin B12, niacin, folic acid, cholesterol, among others (Pess^oa et al., 2009). Example 16.6 Anemia is a disease caused by low levels of hemoglobin in the blood, protein is responsible for carrying oxygen. According to Dr. Adriana Ferreira, a hematologist, iron-deficiency anemia is the most common kind of anemia, and it is caused by the lack of iron in the body. In order to prevent it, we must adopt a diet that is rich in iron, vitamin A, vitamin B12 and folic acid. These nutrients can be found in several kinds of food, such as, spinach, broccoli, watercress, tomatoes, carrots, eggs, beans, chickpeas, soybeans, beef, liver, and fish. Table 16.E.4 shows the daily needs of each nutrient, the respective quantity in each one of the food items, and their price. In order to prevent its patients from having this kind of anemia, Hospital Metropolis is studying a new diet. The goal is to choose the ingredients with the lowest possible costs. These ingredients will be a part of both main daily meals (lunch and dinner), in a way that 100% of a person’s daily needs of each of these nutrients will be met in both meals. In addition, the total amount ingested in both meals cannot be higher than 1.5 kg.

TABLE 16.E.4 Nutrients, Daily Needs, and Cost Per Food Item 100 g Servings Iron (mg)

Vitamin A (IU)

Vitamin B12 (mcg)

Folic Acid (mg)

Price (US$)

Spinach

3.00

7400

0

0.400

0.30

Broccoli

1.20

138.8

0

0.500

0.20

Watercress

0.20

4725

0

0.100

0.18

Tomatoes

0.49

1130

0

0.250

0.16

Carrots

1.00

14,500

0.10

0.005

0.30

Eggs

0.90

3215

1.00

0.050

0.30

Beans

7.10

0

0

0.056

0.40

Chickpeas

4.86

41

0

0.400

0.40

Soybeans

3.00

1000

0

0.080

0.45

Beef

1.50

0

3.00

0.060

0.75

Liver

10.00

32,000

100.00

0.380

0.80

Fish

1.10

140

2.14

0.002

0.85

Daily intake

8

4500

2

0.4

Solution First of all, we have to define the model’s decision variables: xj ¼ quantity (kg) of food j consumed daily, j ¼ 1, 2, …, 12. Therefore, we have: x1 ¼ quantity (kg) of spinach consumed daily. x2 ¼ quantity (kg) of broccoli consumed daily. x3 ¼ quantity (kg) of watercress consumed daily. ⋮ x12 ¼ quantity (kg) of fish consumed daily.

Introduction to Optimization Models: General Formulations and Business Modeling Chapter

16

721

The model’s objective function tries to minimize the total costs spent with food and it may be written as follows: Fobj ¼ min z ¼ 3x1 + 2x2 + 1:8x3 + 1:6x4 + 3x5 + 3x6 + 4x7 + 4x8 + 4:5x9 + 7:5x10 + 8x11 + 8:5x12 The constraints related to the minimum daily intake of each nutrient must be met. Furthermore, we must consider the maximum weight constraint allowed in both meals. (1) The minimum daily intake of iron must be met: 30x1 + 12x2 + 2x3 + 4:9x4 + 10x5 + 9x6 + 71x7 + 48:6x8 + 30x9 + 15x10 + 100x11 + 11x12  80 (2) The minimum daily intake of vitamin A must be met: 74, 000x1 + 1388x2 + 47, 250x3 + 11, 300x4 + 145,000x5 + 32, 150x6 + 410x8 + 10, 000x9 + 320,000x11 + 1400x12  45,000 (3) The minimum daily intake of vitamin B12 must be met: x5 + 10x6 + 30x10 + 1000x11 + 21:4x12  20 (4) The minimum daily intake of folic acid must be met: 4x1 + 5x2 + x3 + 2:5x4 + 0:05x5 + 0:5x6 + 0:56x7 + 4x8 + 0:8x9 + 0:6x10 + 3:8x11 + 0:02x12  4 (5) The total amount consumed in both meals cannot be higher than 1.5 kg: x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + x12  1:5 (6) The model’s decision variables are nonnegative: x1 , x2 , x3 , x4 , x5 , x6 ,x7 , x8 , x9 , x10 , x11 , x12  0 The complete model can be described as follows: Fobj ¼ min z ¼ 3x1 + 2x2 + 1:8x3 + 1:6x4 + 3x5 + 3x6 + 4x7 + 4x8 + 4:5x9 + 7:5x10 + 8x11 + 8:5x12 s:t: 30x1 +

12x2 +

2x3 + ⋯ + 15x10 +

74, 000x1 + 1388x2 + 47, 250x3 + ⋯ +

+ ⋯ + 30x10 + 4x1 +

5x2 +

x1 +

x2 +

x3 + ⋯ + 0:6x10 + x3 + ⋯ +

+ 100x11 +

11x12  80

+ 320,000x11 + 1400x12  45, 000

x10 +

1000x11 + 21:40x12  20 3:8x11 + 0:02x12  4 x11 +

x12  1,5 xj  0, j ¼ 1,…, 12

The model’s optimal solution is x2 ¼ 0.427 (kg of broccoli), x7 ¼ 0.698 (kg of beans), x8 ¼ 0.237 (kg of chickpeas), x11 ¼ 0.138 (kg of liver), x1, x3, x4, x5, x6, x9, x10, x12 ¼ 0 with z ¼ 5.70 (total cost of US$ 5.70).

16.6.4

Capital Budget Problems

Optimization models, including linear programming, are being widely used to solve several financial investment problems, such as, the capital budgeting problem, investment or portfolio selection, cash flow management, risk analysis, among others. This section discusses a linear programming model that can be used to solve a capital budgeting problem. In the following section, we will discuss the investment portfolio selection problem. The capital budgeting problem aims at selecting, from a set of alternatives, financially feasible investment projects, respecting the investing company’s budgetary constraints.

722

PART

VII Optimization Models and Simulation

The capital budgeting problem uses the concept of NPV (net present value) that aims to define which investment is the most attractive. NPV is defined as the present value (period t ¼ 0) of the cash inflows (receivables) minus the cash outflows (payables/investment) for each period t ¼ 0, 1, …, n. Considering different investment projects, the most attractive is the one that has the highest net present value. The calculation of the NPV is: NPV ¼

n n X X CIFt COFt t t ð 1 + i Þ ð t¼1 t¼1 1 + iÞ

(16.13)

where: CIFt ¼ cash inflow at the beginning of the period t ¼ 1, …, n COFt ¼ cash outflow at the beginning of the period t ¼ 1, …, n i ¼ rate of return We will analyze two types of investment (A and B) in order to determine which one is the most attractive. Investment A requires an initial investment of US$ 100,000 plus a US$ 50,000 investment in 1 year, with a US$ 200,000 return in 2 years. The interest rate is 12% per year. The calculation of the NPV of investment A is: NPV ¼ 100, 000  NPV ¼ 14, 795:92

50, 000 ð1 + 0:12Þ1

+

200,000 ð1 + 0:12Þ2

Investment B requires an initial investment of US$ 150,000 plus a US$ 70,000 investment in 2 years, with a US$ 130,000 return in 1 year, and US$ 120,000 in 3 years. The interest rate is 12% per year. The NPV of investment B is: NPV ¼ 150,000 + NPV ¼ 4318:51

130,000 ð1 + 0:12Þ1



70,000 ð1 + 0:12Þ2

+

120, 000 ð1 + 0:12Þ3

Therefore, we can see that investment B is not profitable. So, investment A is the most attractive. The example showed us how to calculate the NPV of one or more investments and, from that, how to determine which one was the most attractive. However, many times, the resources available are limited, so, to choose one or more investment projects, a linear programming or a binary programming model should be used. Example 16.7 A farmer is analyzing five types of investment based on different crops (soybeans, cassava, corn, wheat, and beans) for his new farm, which has a total of 1000 hectares available. Each crop requires capital investments that will result in future benefits. The initial investment and the payables in the following 3 years for each crop are specified in Table 16.E.5. The return expected in the following 3 years for each crop investment is specified in Table 16.E.6. The farmer has limited resources that he can invest in each period (last column of Table 16.E.5), and he expects a minimum cash flow each period (last column of Table 16.E.6). The interest rate for each crop is 10% per year. From the total area available for the investment, the farmer would like to determine how much he should invest in each crop (in hectares), in order to maximize the NPV of the set of investment projects being analyzed, respecting the minimum expected flows and the maximum outflows in each period. Elaborate the farmer’s linear programming problem.

TABLE 16.E.5 Cash Outflow for Each Year Initial Investment/Payables for Each Year (US$ Thousands Per Hectare) Year

Soybeans

Cassava

Corn

Wheat

Beans

Maximum Cash Outflow (US$ Thousands)

0

5.00

4.00

3.50

3.50

3.00

3800.00

1

1.00

1.00

0.50

1.50

0.50

3500.00

2

1.20

0.50

0.50

0.50

1.00

3200.00

3

0.80

0.50

1.00

0.50

0.50

2500.00

Introduction to Optimization Models: General Formulations and Business Modeling Chapter

16

723

TABLE 16.E.6 Cash Inflow for Each Year Expected Return for Each Year (US$ Thousands Per Hectare) Year

Soybeans

Cassava

Corn

Wheat

Beans

Minimum Cash Inflow (US$ Thousands)

1

5.00

4.20

2.20

6.60

3.00

6000.00

2

7.70

6.50

3.70

8.00

3.50

5000.00

3

7.90

7.20

2.90

6.10

4.10

6500.00

Solution First of all, we have to define the model’s decision variables: xj ¼ total area (in hectares) to be invested in planting crop j, j ¼ 1, 2, …, 5. Therefore, we have: x1 ¼ total area (in hectares) to be invested in planting soybeans. x2 ¼ total area (in hectares) to be invested in planting cassava. x3 ¼ total area (in hectares) to be invested in planting corn. x4 ¼ total area (in hectares) to be invested in planting wheat. x5 ¼ total area (in hectares) to be invested in planting beans. The model’s objective function tries to maximize the NPV (US$ thousands) of the set of investments in the crops being analyzed, that is, the sum of the NPV of each crop (US$ thousands per hectare) multiplied by the total area to be invested in the respective crop (hectares). The calculation of the soybean crop NPV (US$ thousands/hectare), according to Expression (16.13), is: Soybean crop (US$ thousands per hectare) 5:0 7:7 7:9 1:0 1:2 0:8 + +  5:0    ð1 + 0:10Þ1 ð1 + 0:10Þ2 ð1 + 0:10Þ3 ð1 + 0:10Þ1 ð1 + 0:10Þ2 ð1 + 0:10Þ3 NPV ¼ 9:343 ðUS$ 9342:60=hectareÞ

NPV ¼

The calculation of the NPV of the other crops, taking the same procedure into consideration, is listed in Table 16.E.7.

TABLE 16.E.7 Net Present Value (NPV) of Each Crop Net Present Value—NPV (US$ Thousands Per Hectare) Soybeans

Cassava

Corn

Wheat

Beans

9.343

8.902

2.118

11.542

4.044

Thus, objective function z can be described as follows: Fobj ¼ max z ¼ 9:343x1 + 8:902x2 + 2:118x3 + 11:542x4 + 4:044x5 The maximum and minimum cash flow constraints for each year, besides the total area available, must be considered and are shown here. (1) Maximum capacity available (hectares) for planting the crops: x1 + x2 + x3 + x4 + x5  1000 (2) Minimum cash inflow for each year (US$ thousands): 5:0x1 + 4:2x2 + 2:2x3 + 6:6x4 + 3:0x5  6000 ð1st yearÞ 7:7x1 + 6:5x2 + 3:7x3 + 8:0x4 + 3:5x5  5000 ð2nd yearÞ 7:9x1 + 7:2x2 + 2:9x3 + 6:1x4 + 4:1x5  6500 ð3rd yearÞ (3) Maximum outflow for each year (US$ thousands): 5:0x1 + 4:0x2 + 3:5x3 + 3:5x4 + 3:0x5  3800 ðinitial investmentÞ 1:0x1 + 1:0x2 + 0:5x3 + 1:5x4 + 0:5x5  3500 ð1st yearÞ

724

PART

VII Optimization Models and Simulation

1:2x1 + 0:5x2 + 0:5x3 + 0:5x4 + 1:0x5  3200 ð2nd yearÞ 0:8x1 + 0:5x2 + 1:0x3 + 0:5x4 + 0:5x5  2500 ð3rd yearÞ (4) Non-negativity constraints of the decision variables: xj  0, j ¼ 1,2, …, 5: The complete model can be formulated as follows: max z ¼ 9:343x1 + 8:902x2 + 2:118x3 + 11:542x4 + 4:044x5 subject to : x1 +

x2 +

x3 +

x4 +

x5  1000

5:0x1 + 4:2x2 + 2:2x3 + 6:6x4 + 3:0x5  6000 7:7x1 + 6:5x2 + 3:7x3 + 8:0x4 + 3:5x5  5000 7:9x1 + 7:2x2 + 2:9x3 + 6:1x4 + 4:1x5  6500 5:0x1 + 4:0x2 + 3:5x3 + 3:5x4 + 3:0x5  3800 1:0x1 + 1:0x2 + 0:5x3 + 1:5x4 + 0:5x5  3500 1:2x1 + 0:5x2 + 0:5x3 + 0:5x4 + 1:0x5  3200 0:8x1 + 0:5x2 + 1:0x3 + 0:5x4 + 0:5x5  2500 xj  0, j ¼ 1,2, …, 5 The optimal solution of the linear programming model is x1 ¼ 173.33 (hectares for soybeans), x2 ¼ 80 (hectares for cassava), x3 ¼ 0 (hectares for corn), x4 ¼ 746.67 (hectares for wheat), x5 ¼ 0 (hectares for beans) with z ¼ 10, 949.59 (US$ 10,949,590.00). The example used a linear programming model (LP) to solve the respective capital budgeting problem. However, many times and given a set of investment project alternatives, we try to analyze which project i will be approved (xi ¼ 1) or rejected (xi ¼ 0), which becomes a binary programming (BP) problem since the decision variables are binary. This last case will be discussed in Chapter 19.

16.6.5

Portfolio Selection Problem

Markowitz (1952) developed a mathematical model to optimize portfolios that tries to choose, among a set of financial investments, the best combination that maximizes the portfolio’s expected return and minimizes its risk. The model is a quadratic programming problem that tries to find the portfolio’s efficient frontier. The risks of the portfolio are measured by using the variance of the return on assets, calculated as the sum of the individual variances of each asset and the covariances between the pairs of assets. Sharpe (1964) proposed a simplified portfolio optimization model aiming at facilitating the calculation of the covariance matrix. Similar to Markowitz’s model, Sharpe’s model also tries to determine the portfolio’s optimal composition that will result in the highest return possible with the lowest risk. Markowitz’s model requires the extensive calculation of the covariance matrix, so, it is highly complex. In order to facilitate its application, alternative models to Markowitz’s original model have been proposed. From Markowitz’s (1952) and Sharpe’s (1964) theory, we can develop a linear programming model that determines the investment portfolio’s optimal composition and that minimizes the risks, with an expected level of return. Similarly, we can search for the best portfolio composition that maximizes the portfolio’s expected return, subject to the requirement of a minimum level of this value and to the maximum risk allowed.

Model 1: Maximization of an Investment Portfolio’s Expected Return A linear programming model that maximizes an investment portfolio’s expected return, subject to the requirement of a minimum level of this value and to a given risk, is proposed. Model parameters: E(R) ¼ investment portfolio’s expected return mj ¼ expected return of asset j, j ¼ 1, …, n r ¼ minimum level required by the investor regarding the portfolio’s expected return ¼ maximum percentage allowed of asset j to be allocated in the portfolio, j ¼ 1, …, n xmax j sj ¼ standard deviation of asset j, j ¼ 1, …, n s ¼ the portfolio’s standard deviation or average risk

Introduction to Optimization Models: General Formulations and Business Modeling Chapter

16

725

Decision variables: xj ¼ percentage of asset j allocated in the portfolio, j ¼ 1, …, n General formulation: max EðRÞ ¼

n X

mj xj

j¼1

s:t:

n X

xj ¼ 1

ð 1Þ

n X mj xj  r

ð 2Þ

j¼1

j¼1

, j ¼ 1, …, n xj  xmax j

n X

sj xj  s

j¼1

xj  0

(16.14)

ð3Þ ð 4Þ

, j ¼ 1,…, n

ð 5Þ

The model’s objective function tries to maximize the average return of an investment portfolio with n financial assets. Constraint 1 guarantees that all the capital will be invested. Constraint 2 guarantees that the portfolio’s average return will achieve the minimum limit required by the investor in the value of r. , so that the portfolio Constraint 3 states that the percentage of asset i allocated in the portfolio cannot be higher than xmax j can be diversified and its risk minimized. It is important to mention that some expected return maximization models do not consider this constraint. An alternative to constraint 3 can be represented by Equation (4) of Expression (16.14) that ensures that the portfolio’s average risk cannot be higher than s. The risks of each asset and of the portfolio are measured by the standard deviation. Finally, the decision variables should meet the non-negativity condition.

Model 2: Investment Portfolio Risk Minimization An alternative to Markowitz’s model was proposed by Konno and Yamazaki (1991) that introduced the mean absolute deviation (MAD) as a risk measure. The model that minimizes the MAD is shown now. Model parameters: MAD ¼ the portfolio’s mean absolute deviation rjt ¼ return of asset j in period t, j ¼ 1, …, n, t ¼ 1, …, T mj ¼ expected return of asset j, j ¼ 1, …, n r ¼ minimum level required by the investor regarding the portfolio’s expected return ¼ maximum percentage allowed of asset j to be allocated in the portfolio, j ¼ 1, …, n xmax j Decision variables: xj ¼ percentage of asset j allocated in the portfolio, j ¼ 1, …, n. General formulation:

  T X n    1X   min MAD ¼ r  mj x j    T t¼1  j¼1 jt s:t: n X xj ¼ 1 ð 1Þ j¼1 n X

mj xj  r

j¼1

, j ¼ 1,…, n 0  xj  xmax j

ð 2Þ ð3Þ

The model’s objective function tries to minimize the portfolio’s mean absolute deviation. Constraint 1 guarantees that all the capital will be invested.

(16.15)

726

PART

VII Optimization Models and Simulation

Constraint 2 guarantees that the portfolio’s average return will achieve the minimum limit required by the investor in the value of r. . Constraint 3 states that the positive percentage of asset i allocated in the portfolio cannot be higher than xmax j Example 16.8 Investor Paul Smith operates daily on Forinvest’s home broker system. Paul wants to select a new investment portfolio that maximizes its expected return with a certain risk. Based on a historical analysis of the most highly negotiable and representative assets in the Brazilian stock market, Forinvest’s financial analyst selected a set of 10 stocks trading on B3 (Brazilian Stock Exchange) that could form Paul’s portfolio, as shown in Table 16.E.8. The financial analyst tried to select a set of stocks from many different sectors, which were chosen according to Paul’s preferences. Table 16.E.9 shows a partial history of the daily return of each stock during 247 days. The complete data can be found in the file Forinvest.xls. In order to diversify the portfolio and, consequently, minimize the portfolio’s risks, the financial analyst advised Paul to invest 30%, at the most, in each stock. Besides, the portfolio’s risks, measured by the standard deviation, could not be greater than 2.5%. Elaborate the linear programming model that maximizes Paul’s portfolio’s expected return.

TABLE 16.E.8 Stocks That Could Be in Paul’s Portfolio Stocks

Code

1

Banco Brasil ON

BBAS3

2

Bradesco PN

BBDC4

3

Eletrobras PNB

ELET6

4

Gerdau PN

GGBR4

5

Itausa PN

ITSA4

6

Petrobras PN

PETR4

7

Sid Nacional ON

CSNA3

8

Telemar PN

TNLP4

9

Usiminas PNA

USIM5

10

Vale PNA

VALE5

TABLE 16.E.9 Partial History of the Assets’ Daily Return Period

BBAS3 (%)

BBDC4 (%)

ELET6 (%)

GGBR4 (%)

ITSA4 (%)

PETR4 (%)

CSNA3 (%)

TNLP4 (%)

USIM5 (%)

VALE5 (%)

1

6.74

6.04

1.47

4.48

6.50

2.71

2.06

3.19

4.40

3.93

2

6.31

3.05

4.23

5.00

2.14

3.43

4.34

0.22

3.42

2.72

3

4.00

2.08

1.47

1.67

3.27

0.75

2.45

2.19

3.06

0.76

4

0.28

0.14

3.66

1.64

0.81

1.85

1.01

1.29

0.63

0.79

5

6.86

5.28

3.79

4.76

5.50

3.23

6.66

0.11

4.87

4.13

6

2.23

4.87

2.96

3.25

3.69

5.20

7.05

0.97

3.89

2.65

7

1.45

0.90

1.04

4.12

2.47

2.56

0.92

0.07

0.41

0.46

8

1.85

1.05

1.17

1.77

2.39

0.21

2.82

3.67

4.13

1.74

9

6.09

0.14

1.39

0.90

0.82

0.89

1.42

3.75

2.90

2.47

10

1.70

1.94

1.21

3.44

1.38

0.42

2.34

0.14

0.40

3.64

Introduction to Optimization Models: General Formulations and Business Modeling Chapter

16

727

Solution First of all, we calculated the average return and the standard deviation of the daily returns of each investment during the period analyzed, as shown in Table 16.E.10.

TABLE 16.E.10 Average Return and Standard Deviation of Each Stock During the Period Analyzed BBAS3 (%)

BBDC4 (%)

ELET6 (%)

GGBR4 (%)

ITSA4 (%)

PETR4 (%)

CSNA3 (%)

TNLP4 (%)

USIM5 (%)

VALE5 (%)

Average return

6.74

6.04

1.47

4.48

6.50

2.71

2.06

3.19

4.40

3.93

Standard deviation

6.31

3.05

4.23

5.00

2.14

3.43

4.34

0.22

3.42

2.72

The second step consists of defining the model’s decision variables: xj ¼ percentage of stock j to be allocated in the portfolio, j ¼ 1, …, 10. Therefore, we have: x1 ¼ percentage of stock BBAS3 to be allocated in the portfolio. x2 ¼ percentage of stock BBDC4 to be allocated in the portfolio. x3 ¼ percentage of stock ELET6 to be allocated in the portfolio. x4 ¼ percentage of stock GGBR4 to be allocated in the portfolio. x5 ¼ percentage of stock ITSA4 to be allocated in the portfolio. x6 ¼ percentage of stock PETR4 to be allocated in the portfolio. x7 ¼ percentage of stock CSNA3 to be allocated in the portfolio. x8 ¼ percentage of stock TNLP4 to be allocated in the portfolio. x9 ¼ percentage of stock USIM5 to be allocated in the portfolio. x10 ¼ percentage of stock VALE5 to be allocated in the portfolio. The model’s objective function tries to maximize Paul’s portfolio’s expected return during the period analyzed. Therefore, the objective function can be expressed as: Fobj ¼ max z ¼ 0:0037x1 + 0:0024x2 + 0:0014x3 + 0:0030x4 + 0:0024x5 + 0:0019x6 + 0:0028x7 + 0:0018x8 + 0:0025x9 + 0:0024x10 The model’s constraints are described now. (1) The first constraint guarantees that 100% of the capital will be invested, that is, the sum of the composition of the stocks is equal to 1: x1 + x2 + ⋯ + x10 ¼ 1 (2) Constraint 2 states that the maximum percentage to be invested in each stock is 30% of the total capital invested: x1 , x2 , …, x10  0:30 (3) Constraint 3 guarantees that the portfolio’s risks, for the period analyzed, will not be greater than the maximum risk stipulated, that is, 2.5%: 0:0248x1 + 0:0216x2 + ⋯ + 0:0247x10  0:0250 (4) Finally, the non-negativity conditions of the decision variables must be met: x1 , x2 , …,x10  0 The complete model can be formulated as follows: max E ðR Þ ¼ 0:0037x1 + 0:0024x2 + 0:0014x3 + 0:0030x4 + 0:0024x5 + 0:0019x6 + 0:0028x7 + 0:0018x8 + 0:0025x9 + 0:0024x10 s:t: x1 + x2 + ⋯ + x10 ¼ 1 x1 , x2 , …, x10  0:30 0:0248x1 + 0:0216x2 + ⋯ + 0:0247x10  0:0250 x1 , x2 , …, x10  0

ð1 Þ ð2Þ ð3Þ ð4Þ

728

PART

VII Optimization Models and Simulation

The optimal solution of the linear programming model is x1 ¼ 30% (Banco do Brasil ON—BBAS3), x2 ¼ 30% (Bradesco PN—BBDC4), x4 ¼ 18.17% (Gerdau PN—GGBR4), x7 ¼ 21.83% (Sid Nacional ON—CSNA3), and x3, x5, x6, x8, x9, x10 ¼ 0 with z ¼ 0.3% (daily average return of 0.3%).

Example 16.9 Consider the same portfolio optimization problem as Paul Smith’s problem described in Example 16.8. Now, in this case, instead of maximizing the expected return, we want to minimize the portfolio’s mean absolute deviation (MAD). Different from the previous example, instead of considering the maximum risk allowed constraint, we will consider a minimum limit of the portfolio’s expected daily return in the value of 0.15%. Analogous to the previous example, we must invest 30% of the capital total, at the most, in each asset. Consider the same assets (Table 16.E.9) and the same history of daily returns (Table 16.E.10) of the previous example (see file Forinvest.xls). Elaborate the linear programming problem that minimizes the portfolio’s MAD. Solution First of all, we must calculate the MAD of each asset in the portfolio. Let’s consider the first stock (BBAS3). The first step is to calculate the absolute deviation in each period. As calculated in Example 16.8, we can see that the average return of stock BBAS3, for the period analyzed, is 0.37%. Since the return of the first period is 6.74%, we can conclude that j r11  m1 j ¼ j0.0674  0.0037 j ¼ 0.0711. Now, for period 2, we have j r12  m1 j ¼ j 0.0631  0.0037 j ¼ 0.0594. Then, we do the same for the other periods. For the last period, we have jr1247  m1 j ¼ j0.0128  0.0037 j ¼ 0.0091. The second step consists in calculating the mean absolute deviation of BBAS3, that is, the arithmetic mean of the absolute deviations for each period: 1  ð0:0711 + 0:0594 + ⋯ + 0:0091Þ ¼ 0:0187 247 Then, we do the same for the other stocks. Table 16.E.11 shows the mean absolute deviation of each asset.

TABLE 16.E.11 Mean Absolute Deviation of Each Stock

MAD

BBAS3 (%)

BBDC4 (%)

ELET6 (%)

GGBR4 (%)

ITSA4 (%)

PETR4 (%)

CSNA3 (%)

TNLP4 (%)

USIM5 (%)

VALE5 (%)

1.87

1.65

1.47

2.28

1.69

1.50

1.99

1.66

2.11

1.79

The objective function tries to minimize the portfolio’s MAD, and it can be written as follows: Fobj ¼ min MAD ¼ 0:0187x1 + 0:0165x2 + 0:0147x3 + 0:0228x4 + 0:0169x5 + 0:0150x6 + 0:0199x7 + 0:0166x8 + 0:0211x9 + 0:0179x10 The constraint that the portfolio’s daily average return should achieve the minimum limit required by the investor must be considered: 0:0037x1 + 0:0024x2 + ⋯ + 0:0024x10  0:0015 The complete model can be formulated as follows: min MAD ¼ 0:0187x1 + 0:0165x2 + 0:0147x3 + 0:0228x4 + 0:0169x5 + 0:0150x6 + 0:0199x7 + 0:0166x8 + 0:0211x9 + 0:0179x10 s:t: x1 + x2 + ⋯ + x10 ¼ 1 0:0037x1 + 0:0024x2 + ⋯ + 0:0024x10  0:0015 0  x1 , x2 , …, x10  0:30

ð1Þ ð2Þ ð3Þ

The optimal solution of the linear programming model is x2 ¼ 30% (Bradesco PN—BBDC4), x3 ¼ 30% (Eletrobras PNB—ELET6), x6 ¼ 30% (Petrobras PN—PETR4), x8 ¼ 10% (Telemar PN—TNLP4), and x1, x4, x5, x7, x9, x10 ¼ 0 with z ¼ 1.55% (the portfolio’s mean absolute deviation).

Introduction to Optimization Models: General Formulations and Business Modeling Chapter

16.6.6

16

729

Production and Inventory Problem

In this section, we will consider a linear programming model that integrates production and inventory decisions. The period of time can be short, medium, or long. A general linear programming model to solve the production and inventory problem, based on Taha (2010, 2016) and Ahuja et al. (2007), will be presented, for a case with m products (i ¼ 1, …, m) and T periods (t ¼ 1, …, T). Model parameters: Dit ¼ demand for product i in period t cit ¼ production cost per unit of product i in period t iit ¼ inventory cost per unit of product i in period t ¼ maximum production capacity of product i in period t xmax it ¼ maximum inventory capacity of product i in period t Imax it Decision variables: xit ¼ amount of product i to be produced in period t Iit ¼ final inventory of product i in period t General formulation: minz ¼

T X m X ðcit xit + iit Iit Þ t¼1 i¼1

s:t: Iit ¼ Ii,t1 + xit  Dit , i ¼ 1,…,m; t ¼ 1, …, T

ð 1Þ

xit  xmax it , max Iit  Iit ,

i ¼ 1, …,m; t ¼ 1,…, T

ð 2Þ

i ¼ 1, …,m; t ¼ 1,…, T

ð 3Þ

i ¼ 1,…, m; t ¼ 1, …, T

ð 4Þ

xit , Iit  0

(16.16)

The model’s objective function is trying to minimize the sum of the production and inventory costs for T periods of time. For each product, constraint 1 represents the inventory balance expression, that is, the final inventory in period t is the same as the final inventory in the previous period, added to the total produced in the same period, subtracted the demand for the current period. So, in order for the demand for product i to be met in period t, the inventory level of the same product in the previous period, added to what was produced in the same period, must be greater than or equal to the demand. This condition is implied in the model since decision variable Iit can only assume non-negative values. Constraint 2 guarantees that the maximum production capacity will not be exceeded. Constraint 3 guarantees that the maximum inventory capacity will not be exceeded. Finally, the non-negativity conditions of the model’s decision variables must also be met. Similar to the production mix problem, in which the decision variables can only assume integer values (i.e., the manufacturing and storing of products cannot be fractioned, such as, cars, electrical appliances, electrical devices, etc.), we have an integer programming (IP) problem. Example 16.10 Fenix & Furniture is launching their new collection of sofas and armchairs for next semester. This new collection includes 2- and 3-seat sofas, sofa-beds, armchairs, and poufs. Table 16.E.12 shows the data on the production and inventory costs and capacity for each product, which are constant for all the periods. The demand for each product for next semester is listed in Table 16.E.13. The initial inventory for all the products is 200 units. Determine the optimal production and inventory control planning that minimizes the total production and storage costs, meets the intended demand, and respects the production and storage capacity limitations. Solution The mathematical formulation of Example 16.10 is similar to the general formulation of the production and inventory problem presented in Expression (16.16). The complete model is shown now. First of all, we have to define the model’s decision variables: xit ¼ number of pieces of furniture i to be produced in month t (units), i ¼ 1, …, 5, t ¼ 1, …, 6 Iit ¼ final inventory of piece of furniture i in month t (units), i ¼ 1, …, 5, t ¼ 1, …, 6

730

PART

VII Optimization Models and Simulation

TABLE 16.E.12 Production and Inventory Costs and Capacity for Each Product 2-Seat Sofa

3-Seat Sofa

Sofa-Bed

Armchair

Pouf

Production cost (US$/unit)

320

440

530

66

48

Inventory cost (US$/unit)

8

8

9

3

3

Production capacity (units)

1800

1600

1500

2000

2000

Inventory capacity (units)

20,000

18,000

15,000

22,000

22,000

TABLE 16.E.13 Demand Per Product and Period Jan.

Feb.

March

April

May

Jun.

2-Seat sofa

1200

1250

1400

1860

2000

1700

3-Seat sofa

1250

1430

1650

1700

1450

1500

Sofa-bed

1400

1500

1200

1350

1600

1450

Armchair

1800

1750

2100

2000

1850

1630

Pouf

1850

1700

2050

1950

2050

1740

Therefore, we have: x11 ¼ number of 2-seat sofas to be produced in January. ⋮ x16 ¼ number of 2-seat sofas to be produced in June. x21 ¼ number of 3-seat sofas to be produced in January. ⋮ x26 ¼ number of 3-seat sofas to be produced in June. x31 ¼ number of sofa-beds to be produced in January. ⋮ x36 ¼ number of sofa-beds to be produced in June. x41 ¼ number of armchairs to be produced in January. ⋮ x46 ¼ number of armchairs to be produced in June. x51 ¼ number of poufs to be produced in January. ⋮ x56 ¼ number of poufs to be produced in June.

I11 ¼ final inventory of 2-seat sofas in January. ⋮ I16 ¼ final inventory of 2-seat sofas in June. I21 ¼ final inventory of 3-seat sofas in January. ⋮ I26 ¼ final inventory of 3-seat sofas in June. I31 ¼ final inventory of sofa-beds in January. ⋮ I36 ¼ final inventory of sofa-beds in June.

Introduction to Optimization Models: General Formulations and Business Modeling Chapter

16

731

I41 ¼ final inventory of armchairs in January. ⋮ I46 ¼ final inventory of armchairs in June. I51 ¼ final inventory of poufs in January. ⋮ I56 ¼ final inventory of poufs in June. Since the decision variables are discrete, we have an integer programming (IP) problem. Luckily, in this problem, the integrality constraints can be relaxed or eliminated, since the relaxed problem’s optimal solution still meets the integrality conditions. Thus, the formulation of the problem will be presented as a linear programming (LP) model. The objective function can be written as: min z ¼ 320ðx11 + x12 + x13 + x14 + x15 + x16 Þ + 8ðI11 + I12 + I13 + I14 + I15 + I16 Þ + 440ðx21 + x22 + x23 + x24 + x25 + x26 Þ + 8ðI21 + I22 + I23 + I24 + I25 + I26 Þ + 530ðx31 + x32 + x33 + x34 + x35 + x36 Þ + 9ðI31 + I32 + I33 + I34 + I35 + I36 Þ + 66ðx41 + x42 + x43 + x44 + x45 + x46 Þ + 3ðI41 + I42 + I43 + I44 + I45 + I46 Þ + 48ðx51 + x52 + x53 + x54 + x55 + x56 Þ + 3ðI51 + I52 + I53 + I54 + I55 + I56 Þ The model’s constraints are described here. (1) Inventory balance equations, for each piece of furniture i (i ¼ 1, …, 5), each month t (t ¼ 1, …, 6): I11 ¼ 200 + x11  1200 I12 ¼ I11 + x12  1250 I13 ¼ I12 + x13  1400 I14 ¼ I13 + x14  1860 I15 ¼ I14 + x15  2000 I16 ¼ I15 + x16  1700 I21 ¼ 200 + x21  1250 I22 ¼ I21 + x22  1430 I23 ¼ I22 + x23  1650 I24 ¼ I23 + x24  1700 I25 ¼ I24 + x25  1450 I26 ¼ I25 + x26  1500 I31 ¼ 200 + x31  1400 I32 ¼ I31 + x32  1500 I33 ¼ I32 + x33  1200 I34 ¼ I33 + x34  1350 I35 ¼ I34 + x35  1600 I36 ¼ I35 + x36  1450 I41 ¼ 200 + x41  1800 I42 ¼ I41 + x42  1750 I43 ¼ I42 + x43  2100 I44 ¼ I43 + x44  2000 I45 ¼ I44 + x45  1850 I46 ¼ I45 + x46  1630 I51 ¼ 200 + x51  1850 I52 ¼ I51 + x52  1700 I53 ¼ I52 + x53  2050 I54 ¼ I53 + x54  1950 I55 ¼ I54 + x55  2050 I56 ¼ I55 + x56  1740 (2) Maximum production capacity: x11 , x12 ,x13 , x14 , x15 , x16  1800 x21 , x22 ,x23 , x24 , x25 , x26  1600 x31 , x32 ,x33 , x34 , x35 , x36  1500

732

PART

VII Optimization Models and Simulation

x41 , x42 , x43 , x44 , x45 , x46  2000 x51 , x52 , x53 , x54 , x55 , x56  2000 (3) Maximum inventory capacity: I11 , I12 , I13 , I14 , I15 , I16  20, 000 I21 , I22 , I23 , I24 , I25 , I26  18, 000 I31 , I32 , I33 , I34 , I35 , I36  15, 000 I41 , I42 , I43 , I44 , I45 , I46  22, 000 I51 , I52 , I53 , I54 , I55 , I56  22, 000 (4) Non-negativity constraints: xit , Iit  0 i ¼ 1, …, 5; t ¼ 1, …, 6 The production and inventory model’s optimal solution is shown in Table 16.E.14.

TABLE 16.E.14 Production and Inventory Model’s Optimal Solution Solution

Jan.

Feb.

March

April

May

Jun.

z

x1t

1000

1250

1660

1800

1800

1700

US$ 12,472,680.00

x2t

1050

1580

1600

1600

1450

1500

x3t

1200

1500

1200

1450

1500

1450

x4t

1600

1850

2000

2000

1850

1630

x5t

1650

1750

2000

2000

2000

1740

I1t

0

0

260

200

0

0

I2t

0

150

100

0

0

0

I3t

0

0

0

100

0

0

I4t

0

100

0

0

0

0

I5t

0

50

0

50

0

0

16.6.7

Aggregated Planning Problem

Aggregated planning studies the balance between production and demand. The period of time considered is the medium run. In order to meet a fluctuating demand, at a minimum cost, we can change the company’s resources (employees, production, and inventory levels), we can influence the demand, or we can try to find a combination of both strategies. As strategies to influence the demand, we have: advertising, sales, development of alternative products, etc. As strategies to influence production, we can highlight: – – – –

Controlling inventory levels; Hiring and firing employees; Overtime or reducing the number of working hours; Outsourcing.

Most of the methods used to solve the aggregated planning problem consider the demand to be a deterministic factor, so, they only change the company’s productive resources. Thus, we can use the trial and error method trying to select the best option among a set of alternative production solutions, or a linear programming model to determine the problem’s optimal solution.

Introduction to Optimization Models: General Formulations and Business Modeling Chapter

16

733

Linear programming (LP) models are being widely used to solve aggregated planning problems, in order to find the best combination of productive resources that minimizes the total labor, production, and storage costs. For T periods of time, the objective function can minimize the sum of costs related to: regular production, regular labor, hiring and firing employees, overtime, inventory, and/or outsourcing. The constraints are related to the total production and storage capacity, besides the use of labor. The problem can also be characterized as a nonlinear programming—NLP (nonlinear costs) model or as a binary programming—BP model (a choice among n alternative plans). Buffa and Sarin (1987), Moreira (2006), and Silva Filho et al. (2009) present a general linear programming model for the aggregated planning problem. An adjusted formulation, for T periods of time (t ¼ 1, …, T), is shown. Model parameters: Pt ¼ total production in period t Dt ¼ demand in period t rt ¼ production cost per unit (regular hours) in period t ot ¼ production cost per unit (overtime) in period t st ¼ production cost per unit (with subcontracted/outsourced labor) in period t ht ¼ cost of an additional unit (regular hours) in period t by hiring employees from period t  1 to period t ft ¼ cost of a cancelled unit in period t by firing employees from period t  1 to period t it ¼ inventory cost per unit from period t to period t + 1 ¼ maximum inventory capacity in period t (units) Imax t ¼ maximum production capacity at regular hours in period t (units) Rmax t ¼ maximum production capacity during overtime in period t (units) Omax t ¼ maximum subcontracted production capacity in period t (units) Smax t Decision variables: It ¼ final inventory in period t (units) Rt ¼ regular production (regular hours) in period t (units) Ot ¼ overtime production in period t (units) St ¼ production with subcontracted labor in period t (units) Ht ¼ additional production in period t by hiring employees from period t  1 to period t (units) Ft ¼ cancelled production in period t by firing employees from period t  1 to period t (units) General formulation: min z ¼

T X

ðrt Rt + ot Ot + st St + ht Ht + ft Ft + it It Þ

t¼1

s:t: It ¼ It1 + Pt  Dt

ð 1Þ

Pt ¼ R t + O t + S t

ð 2Þ

Rt ¼ Rt1 + Ht  Ft

ð 3Þ

It  Itmax Rt  Rmax t Ot  Omax t St  Smax t Rt , Ot , St , Ht ,Ft , It  0 para t ¼ 1, …,T

(16.17)

ð 4Þ ð 5Þ ð 6Þ ð 7Þ ð 8Þ

For T periods of time, the model’s objective function tries to minimize the sum of costs related to regular production, overtime production, subcontracting or outsourcing, and hiring and firing employees, besides the costs with inventory maintenance. Equation (1) of Expression (16.17) states that the final inventory in period t is the same as the final inventory in the previous period, added to the total produced in the same period, subtracting the demand for the current period. The production capacity is specified in Equation (2) of Expression (16.17) as the sum of the total produced regularly in period t, the overtime production, and the total of subcontracted units for the same period. Equation (3) of Expression (16.17) states that the total number of units produced with regular labor in period t is the same as the previous period (t  1), adding the additional units produced with possible hiring, and subtracting the units cancelled due to possible dismissals from period t  1 for period t. Constraint 4 stipulates the maximum inventory capacity allowed for period t.

734

PART

VII Optimization Models and Simulation

Constraint 5 guarantees that the regular production in period t will not be greater than the maximum limit allowed. Constraint 6 stipulates the maximum production limit allowed using overtime in period t. Constraint 7 sets a maximum production limit using outsourced labor for period t. Finally, the non-negativity conditions of the model’s decision variables must also be met. The formulation is based on a linear programming (LP) model to solve the respective aggregated planning problem. However, if we considered as a decision variable the number of employees to be hired and fired in each period, instead of the variation in production due to the hiring or firing of employees, we would find ourselves in a mixed-integer programming (MIP) problem, in which part of the decision variables is discrete. Similar to the production mix problem and to the production and inventory problem, when all the model’s decision variables are discrete (the quantities produced and stored can only assume integer values), we have an integer programming (IP) model. Example 16.11 Lifestyle, a company that produces natural juices, was analyzing several alternative aggregated planning options that could be adopted to produce cranberry juice in the second semester of the following year. However, they verified that an optimal solution for the problem could be obtained from a linear programming model. According to the sales department, the demand expected for the period being analyzed is listed in Table 16.E.15.

TABLE 16.E.15 Expected Demand (in Liters) for Cranberry Juice in the Second Semester of the Following Year Month

Demand (l)

July

4500

August

5200

September

4780

October

5700

November

5820

December

4480

The production sector provided the following data: Regular production cost (regular hours)

US$ 1.50 per L

Production cost using overtime

US$ 2.00 per L

Production cost using subcontracted labor

US$ 2.70 per L

Cost of increasing production by hiring new employees

US$ 3.00 per L

Cost of decreasing production by firing employees

US$ 1.20 per L

Inventory maintenance costs

US$ 0.40 per L-month

Initial inventory

1000 L

Regular production in the previous month

4000 L

Maximum inventory capacity

1500 L/month

Maximum regular production capacity

5000 L/month

Maximum production capacity using overtime

50 L/month

Maximum production capacity using subcontracted labor

500 L/month

Determine the mathematical formulation of Lifestyle’s aggregated planning problem so that they can minimize their total production costs, respecting the problem’s capacity constraints.

Introduction to Optimization Models: General Formulations and Business Modeling Chapter

16

735

Solution The mathematical formulation of Example 16.11 is similar to the general formulation of the production and inventory problem presented in Expression (16.17). The complete model is shown here. First of all, we have to define the model’s decision variables: It ¼ final inventory of cranberry juice in month t (liters), t ¼ 1 (July), …, 6 (December) Rt ¼ regular production (regular hours) of juice in month t (liters), t ¼ 1, …, 6 Ot ¼ production of juice using overtime in month t (liters), t ¼ 1, …, 6 St ¼ production of juice using subcontracted labor in month t (liters), t ¼ 1, …, 6 Ht ¼ production of additional juice in month t by hiring employees from month t  1 to month t (liters), t ¼ 1, …, 6 Ft ¼ cancelled production of juice in month t by firing employees from month t  1 to month t (liters), t ¼ 1, …, 6 The objective function can be written as: min z ¼ 1:5R1 + 2O1 + 2:7S1 + 3H1 + 1:2F1 + 0:4I1 + 1:5R2 + 2O2 + 2:7S2 + 3H2 + 1:2F2 + 0:4I2 + 1:5R3 + 2O3 + 2:7S3 + 3H3 + 1:2F3 + 0:4I3 + 1:5R4 + 2O4 + 2:7S4 + 3H4 + 1:2F4 + 0:4I4 + 1:5R5 + 2O5 + 2:7S5 + 3H5 + 1:2F5 + 0:4I5 + 1:5R6 + 2O6 + 2:7S6 + 3H6 + 1:2F6 + 0:4I6 The model’s constraints are: (1) Inventory balance equations each month t (t ¼ 1, …, 6): I1 ¼ 1000 + R1 + O1 + S1  4500 I1 + R2 + O2 + S2  5200 I2 ¼ I2 + R3 + O3 + S3  4780 I3 ¼ I3 + R4 + O4 + S4  5700 I4 ¼ I4 + R5 + O5 + S5  5820 I5 ¼ I5 + R6 + O6 + S6  4480 I6 ¼ Notice that Equation (2) of Expression (16.17) (Pt ¼ Rt + Ot + St) is already represented. (2) Quantity produced each month t with regular labor: R1 ¼ 4000 + H1  F1 R2 ¼

R1 + H2  F2

R3 ¼

R2 + H3  F3

R4 ¼

R3 + H4  F4

R5 ¼

R4 + H5  F5

R6 ¼

R5 + H6  F6

(3) Maximum inventory capacity allowed for each period t: I1 , I2 , I3 , I4 , I5 , I6  1500 (4) Maximum regular production capacity in period t: R1 , R2 , R3 , R4 , R5 , R6  5000 (5) Maximum production capacity using overtime in period t: O1 , O2 , O3 , O4 , O5 , O6  50 (6) Maximum production capacity using outsourced labor in period t: S1 , S2 , S3 , S4 , S5 , S6  500 (7) Non-negativity constraints: Rt , Ot , St , Ht , Ft , It  0

for t ¼ 1,…, 6

736

PART

VII Optimization Models and Simulation

The optimal solution of the aggregated planning model is: I1 ¼ 1270 I2 ¼ 840

I3 ¼ 880

I4 ¼ 500

I5 ¼ 0

I6 ¼ 0

R1 ¼ 4770 R2 ¼ 4770 R3 ¼ 4770 R4 ¼ 4770 R5 ¼ 4770 R6 ¼ 4480 O1 ¼ 0

O2 ¼ 0

S1 ¼ 0

O3 ¼ 50

O4 ¼ 50

O5 ¼ 50

O6 ¼ 0

S2 ¼ 0

S3 ¼ 0

S4 ¼ 500

S5 ¼ 500

S6 ¼ 0

H1 ¼ 770 H2 ¼ 0

H3 ¼ 0

H4 ¼ 0

H5 ¼ 0

H6 ¼ 0

F1 ¼ 0

F3 ¼ 0

F4 ¼ 0

F5 ¼ 0

F6 ¼ 290

F2 ¼ 0

z ¼ 49, 549 (US$ 49, 549.00)

16.7

FINAL REMARKS

Optimization models can help researchers and managers in their business decision-making process. Among the existing optimization models, we can mention linear programming, network programming, integer programming, nonlinear programming, goal or multiobjective programming, and dynamic programming. Linear programming is one of the most widely used tools. This chapter introduced and presented the main concepts of optimization models, especially, the modeling of linear programming problems (general formulation in the standard and canonical forms and business modeling problems). The use of optimization models, mainly linear programming, is being more and more disseminated in academia and in the business world. It may be applied to several areas (strategy, marketing, finance, operations and logistics, human resources, among others) and to several sectors (transportation, automobile, aviation, naval, trade, services, banking, food, beverages, agribusiness, health, real estate, metallurgy, paper and cellulose, electrical energy, oil, gas and fuels, computers, telecommunications, mining, among others). The greatest motivation is the huge saving that may happen, around millions or even billions of dollars, for the industries that use them. Several real problems can be formulated through a linear programming model, including: a production mix problem, a mixture problem, a capital budget problem, an investment portfolio selection, production and inventory, an aggregated planning, among others. The methods to solve a linear programming problem (graphical, analytical, by using the Simplex algorithm or by using computerized solutions) will be discussed in the next chapter.

16.8

EXERCISES

(1) Describe the main characteristics present in a linear programming model. (2) Give examples of the main fields and sectors in which the linear programming technique can be applied. (3) Transform the problems into the standard form: 2 X (a) xj max j¼1

s:t: 2x1  5x2 ¼ 10 x1 + 2x2  50 x1 , x2  0 (b)

ð 1Þ ð 2Þ ð 3Þ

min 24x1 + 12x2 s:t: 3x1 + 2x2  4 2x1  4x2  26 x2  3 x1 , x2  0

ð1Þ ð 2Þ ð 3Þ ð4Þ

Introduction to Optimization Models: General Formulations and Business Modeling Chapter

16

737

max 10x1  x2

(c)

s:t: 6x1 + x2  10 x2  6

ð1Þ ð 2Þ

x1 ,x2  0

ð 3Þ

max 3x1 + 3x2  2x3

(d)

s:t: 6x1 + 3x2  x3  10 x2 + x3  20 4 x1 , x2 ,x3  0

ð 1Þ ð 2Þ ð3Þ

(4) Do the same here, but now into the canonical form. (5) Transform the maximization problems into minimization problems: (a) max z ¼ 10x1  x2 (b)

max z ¼ 3x1 + 3x2  2x3

(6) What are the hypotheses of a linear programming model? Describe each one of them. (7) KMX is an American company in the automobile industry. It will launch three new car models next year: Arlington, Marilandy, and Lagoon. The production of each one of these models goes through the following processes: injection, foundry, machining, upholstery, and final assembly. The average operation times (minutes) of one unit of each component can be found in Table 16.1. Each one of these operations is 100% automated. The number of machines available for each sector can also be found in the same table. It is important to mention that each machine works 16 hours a day, from Monday to Friday. According to the commercial department, besides the minimum sales potential per week, the profit per unit of each automobile model can be seen in Table 16.2. Assuming that 100% of the models will be sold, TABLE 16.1 Average Operation Time (Minutes) of 1 Unit of Each Component and Total Number of Machines Available Operation Average Time (Minutes) Sector

Arlington

Marilandy

Lagoon

Machines Available

Injection

3

4

3

6

Foundry

5

5

4

8

Machining

2

4

4

5

Upholstery

4

5

5

8

Final assembly

2

3

3

5

TABLE 16.2 Profit Per Unit and Weekly Minimum Sales Potential Per Product Model

Profit Per Unit (U$)

Minimum Sales Potential (Units/Week)

Arlington

2500

50

Marilandy

3000

30

Gristedes

2800

30

738

PART

VII Optimization Models and Simulation

formulate the linear programming problem that will try to determine the number of automobiles of each model to be manufactured, in order to maximize the company’s weekly net profit. (8) Refresh is a company in the beverage industry that is rethinking its production mix of beers and soft drinks. The production of beer goes through the following processes: the extraction of malt (which can be manufactured internally or not), processing the wort, which produces the alcohol, fermenting (the main phase), processing the beer, and filling the bottles up (packaging). The production of soft drinks goes through the following processes: preparation of the simple syrup, preparation of the compound syrup, dilution, carbonation, and packaging. Each one of the beer and soft drink processing phases is 100% automated. Besides the total number of machines available for each activity, the average operation times (in minutes) of each beer component can be found in Table 16.3. The same data regarding the processing of soft drinks can be found in Table 16.4. It is important to mention that each machine works 8 hours a day, 20 business days a month. Due to the existing competition, we can state that the total demand for beer and soft drinks is not greater than 42,000 L a month. The net profit is US$ 0.50 per liter of beer produced and US$ 0.40 per liter of soft drink. Formulate the linear programming problem that maximizes the total monthly profit margin. TABLE 16.3 Average Beer Operation Time and Number of Machines Available Sector

Operation Time (Minutes)

Number of Machines

Extraction of malt

2

6

Processing the wort

4

12

Fermenting

3

10

Processing the beer

4

12

Packaging the beer

5

13

TABLE 16.4 Average Soft Drink Operation Time and Number of Machines Available Sector

Operation Time (Minutes)

Number of Machines

Simple syrup

1

6

Compound syrup

3

7

Dilution

4

8

Carbonation

5

10

Packaging the soft drink

2

5

(9) Golmobilec is a company in the electrical appliance industry that is reviewing its production mix regarding the main household equipment used in the kitchen: refrigerators, freezers, stoves, dishwashers, and microwave ovens. The manufacturing of each one of these devices starts in the pressing process that molds, perforates, adjusts, and cuts each component. The next phase consists in the painting, followed by the molding process that gives the product its final shape. The last two phases consist in the assembly and packaging of the product final. Table 16.5 shows the time required (in hours/machine) to manufacture one unit of each component in each manufacturing process, besides the total time available for each sector. Table 16.6 shows the total number of labor hours (hours/employee) necessary to manufacture one unit of each component in each manufacturing process, in addition to the total number of employees available who work in each sector. It is important to highlight that each employee works 8 hours a day, from Monday to Friday. Due to storage capacity limitations, there is a maximum production capacity per product, as specified in Table 16.7. The same table also shows the minimum demand for each product that must be met, besides the net profit per unit sold. Formulate the linear programming problem that maximizes the total net profit.

Introduction to Optimization Models: General Formulations and Business Modeling Chapter

16

739

TABLE 16.5 Time Necessary (in Hours/Machine) to Manufacture 1 Unit of Each Component in Each Sector Time Necessary (Hours/Machine) to Manufacture 1 Unit Sector

Refrigerator

Freezer

Stove

Dishwasher

Microwave oven

Time Available (Hours/Machine/Week)

Pressing

0.2

0.2

0.4

0.4

0.3

400

Painting

0.2

0.3

0.3

0.3

0.2

350

Molding

0.4

0.3

0.3

0.3

0.2

250

Assembly

0.2

0.4

0.4

0.4

0.4

200

Packaging

0.1

0.2

0.2

0.2

0.3

200

TABLE 16.6 Total Number of Labor Hours Necessary to Produce 1 Unit of Each Product in Each Sector, Besides the Total Number of Employees Available Total Number of Labor Hours to Manufacture 1 Unit Sector

Refrigerator

Freezer

Stove

Dishwasher

Microwave Oven

Employees Available

Pressing

0.5

0.4

0.5

0.4

0.2

12

Painting

0.3

0.4

0.4

0.4

0.3

10

Molding

0.5

0.5

0.3

0.4

0.3

8

Assembly

0.6

0.5

0.4

0.5

0.6

10

Packaging

0.4

0.4

0.4

0.3

0.2

8

TABLE 16.7 Maximum Capacity, Minimum Demand, and Unit Profit Per Product Product

Maximum Capacity (Units/Week)

Minimum Demand (Units/Week)

Profit Per Unit (US$/Unit)

Refrigerator

1000

200

52

Freezer

800

50

37

Stove

500

50

35

Dishwasher

500

50

40

Microwave oven

200

40

29

(10) A refinery produces three types of gasoline: regular, green, and yellow. Each type of gasoline can be produced from the mixture of four types of petroleum: petroleum 1, petroleum 2, petroleum 3, and petroleum 4. Each type of gasoline requires certain specifications of octane and benzene: – A liter of regular gasoline requires, at least, 0.20 L of octane and 0.18 L of benzene – A liter of green gasoline requires, at least, 0.25 L of octane and 0.20 L of benzene – A liter of yellow gasoline requires, at least, 0.30 L of octane and 0.22 L of benzene The octane and benzene compositions of each type of petroleum are: – A liter of petroleum 1 contains 0.20 of octane and 0.25 of benzene – A liter of petroleum 2 contains 0.30 of octane and 0.20 of benzene – A liter of petroleum 3 contains 0.15 of octane and 0.30 of benzene – A liter of petroleum 4 contains 0.40 of octane and 0.15 of benzene

740

PART

VII Optimization Models and Simulation

Due to contracts that have already been signed, the refinery needs to produce 12,000 L of regular gasoline, 10,000 L of green gasoline, and 8000 L of yellow gasoline daily. The refinery has a maximum production capacity of 60,000 L of gasoline a day, and can purchase up to 15,000 L of each type of petroleum daily. Each liter of regular, green, and yellow gasoline has a net profit of $ 0.40, $ 0.45 and $ 0.50, respectively. The purchase prices per liter of petroleum 1, petroleum 2, petroleum 3, and petroleum 4 are $ 0.20, $ 0.25, $ 0.30, and $ 0.30, respectively. Formulate the linear programming problem aiming at maximizing the daily net profit. (11) Model Adrianne Medici Torres is upset about some localized fat and would like to lose a few kilos in a few weeks. Her nutritionist recommended a diet that is rich in carbs, moderate in fruit, vegetables, protein, legumes, milk and dairy products, and low in fats and sugar. Table 16.8 shows the food options that can be part of Adrianne’s diet and their respective compositions and characteristics. The data in Table 16.8 can also be found in the file AdrianneTorres’Diet.xls. According to her nutritionist, a balanced diet should contain between 4 and 9 portions of carbs, 3 to 5 portions of fruit, 4 to 5 portions of vegetables, 1 portion of legumes, 2 portions of protein, 2 to 3 portions of milk and dairy products, 1 to 2 portions of sugar and sweets, and 1 to 2 portions of fat. We tried to determine how many portions of each food must be ingested daily, at each meal, in order to minimize the total number of calories consumed, meeting the following requisites: (a) The ideal number of portions ingested, of each type of food, must be respected. (b) Each food can only be ingested at the meal specified in Table 16.8. For example, in the case of cereal, we tried to determine how many portions must be ingested daily at breakfast. Now, in the case of cereal bars, we tried to specify how many portions can be ingested daily at breakfast and as part of the morning and afternoon snacks. (c) The total number of calories ingested at breakfast cannot be higher than 300 calories. (d) The total number of calories ingested as a morning snack cannot be higher than 200 calories. (e) The total number of calories ingested at lunch cannot be higher than 550 calories. (f) The total number of calories ingested as an afternoon snack cannot be higher than 200 calories. (g) The total number of calories ingested at dinner cannot be higher than 350 calories. (h) At breakfast, she must ingest, at least, 1 portion of carbs, 2 of fruit and 1 of milk and/or dairy products. (i) Lunch should contain, at least, 1 portion of the following types of food: carbs, protein, legumes, and vegetables. (j) The morning and afternoon snacks should contain, at least, 1 fruit each. (k) Dinner should contain, at least, 1 portion of carbs, 1 of protein, 1 of milk and dairy products, and 1 of vegetables. (l) A balanced diet should contain, at least, 25 g of fibers a day. (m) 100% of our daily needs of the main vitamins and minerals (iron, zinc, vitamins A, C, B1, B2, B6, B12, niacine, folic acid, etc.) must be met in order for our bodies to work properly. Table 16.8 shows the percentage guaranteed by each portion of food with regard to our daily needs of vitamins and minerals. Formulate the linear programming model for Adrianne’s diet problem.

TABLE 16.8 Composition and Characteristics of Each Food That Can Be Part of Adrianne’s Diet (File AdrianneTorres’Diet.xls) Food

Energy (cal/Portion)

Fibers (g/Portion)

% Vitamins and Minerals

Type of Food

Meals

Lettuce

1

1

9

V

3, 5

Plums/prunes

30

2.4

4

F

1, 2, 4

Rice

130

1.2

0.5

C

3

Brown rice

110

1.6

1

C

3

Olive oil

90

0

0

TF

3, 5

Banana

80

2.6

13

F

1, 2, 4

Cereal bar

90

0.9

11

C

1, 2, 4

Crackers

90

0.4

0.4

C

1, 2, 4

Broccoli

10

2.7

15

V

3, 5

Meat

132

0

1

P

3

Introduction to Optimization Models: General Formulations and Business Modeling Chapter

16

741

TABLE 16.8 Composition and Characteristics of Each Food That Can Be Part of Adrianne’s Diet (File AdrianneTorres’Diet.xls)—cont’d Food

Energy (cal/Portion)

Fibers (g/Portion)

% Vitamins and Minerals

Type of Food

Meals

Carrots

31

2

19

V

3, 5

Cereal

120

1.3

20

C

1

Chocolate

150

0.2

0.5

SS

3, 5

Spinach

18

2

28

V

3, 5

Beans

95

7.9

6

L

3

Chicken

112

0

1.5

P

3

Jello

30

0.2

0

SS

3, 5

Chickpeas

92

3.5

4

L

3

Yoghurt

70

1.1

0.7

MD

1, 2, 4

Apples

60

3

0.9

F

1, 2, 4

Papayas

56

2.4

3.1

F

1, 2, 4

Eggs

60

0.6

8.5

P

3

Butter

100

0

0

TF

1, 5

Bread

140

0.5

3.3

C

1, 5

Wholewheat bread

142

0.8

12

C

1, 5

Turkey ham

75

0.4

0.4

P

1, 5

Fish

104

0.7

11

P

3

Pears

88

4

1.2

F

1, 2, 4

Cottage cheese

80

0.4

0.6

MD

1, 5

Arugula

4

1

9.5

V

3, 5

Natural sandwiches

240

1.4

19

Mixed

5

Soya

85

3.9

8

L

3

Soup

120

3.5

16

Mixed

5

Tomatoes

26

1.5

5

V

3, 5

C, carbs; V, vegetables; F, fruit; P, protein; L, legumes; MD, milk and dairy products; TF, total fat; SS, sugar and sweets; 1, food that can be eaten at breakfast; 2 food that can be eaten as a morning snack; 3, food that can be eaten at lunch; 4, food that can be eaten as an afternoon snack; 5, food that can be eaten at dinner.

Note: The soup contains 1 portion of carbs, 1 of protein, 1 of vegetables and 1 of fat. Now, a natural sandwich contains 2 portions of carbs, 1 of protein, 1 of milk and dairy product, 1 of vegetables, and 1 of fat. (12) Company GWX is trying to obtain a competitive differential in the market and, in order to do this, it is considering five new investment projects for the following 3 years: the development of new products, investment in IT, investment in training courses, factory expansion, and warehouse expansion. Each project requires an initial investment and generates an expected return in the following 3 years, as shown in Table 16.9. Currently, the company has a maximum budget of US$ 1,000,000 to invest. For each investment project, the interest rate is 10% per year. It is important to highlight that the investment project in IT depends on the investment project in training, that is, it will only happen if the investment project in training is accepted. Besides, the factory and warehouse expansion projects are mutually excluding, that is, only one of them can be selected. Formulate the problem that has as its main objective to determine in which projects the company should invest, in order to maximize the current wealth generated from the set of investment projects being analyzed.

742

PART

VII Optimization Models and Simulation

TABLE 16.9 Initial Investment and Return Expected in the Following 3 Years for Each Project Cash Flow Each Year (US$ Thousand) Year

Product Development

Investment in IT

Training

Factory Expansion

Warehouse Expansion

0

360

240

180

480

320

1

250

100

120

220

180

2

300

150

180

350

200

3

320

180

180

330

310

(13) A financial analyst from a major brokerage firm is selecting a certain portfolio for a group of clients. The analyst intends to invest in several sectors, having as an option five companies from the financial sector, including banks and insurance companies, two from the metallurgy sector, one from the mining sector, one from the paper and cellulose sector, and another one from the electrical energy sector. Table 16.10 shows the monthly return history of each one of these stocks in a period of 36 months. These data are available in the file Stocks.xls. In order to increase diversification, it was established that the portfolio could contain 50% of stocks from the financial sector (banks and insurance companies) and 40% from each asset, at the most. Besides, the portfolio should contain, at least, 20% of stocks from the banking sector, 20% from the metallurgy or mining sector, and 20% from the paper and cellulose or electrical energy sector. Investors expect the average return of the portfolio to reach a minimum value of 0.80% a.m. Furthermore, the portfolio’s risk, measured by using the standard deviation, cannot be more than 5%. Elaborate the linear programming model that minimizes the portfolio’s mean absolute deviation. (14) Redo the previous exercise considering a period from 1 to 24 months. However, in this case, the model must be formulated for three distinct goals: (a) to minimize the MAD (mean absolute deviation) as in the previous case; (b) to minimize the square root of squared deviations from the mean; (c) min-max (to minimize the highest absolute deviation). (15) CTA Investment Bank manages third parties’ financial resources, operating in several different investment modalities, ensuring to its clients the best return with the lowest risk. Robert Johnson, a client of CTA Investments, wishes to invest US$ 500,000.00 in investment funds. According to Robert’s profile, his bank account manager selected 11 types of investment funds that could be a part of his portfolio. Table 16.11 shows a description of each fund, their annual profitability, their risks, and the necessary initial investment. The annual return expected was calculated as the weighted moving mean of the last five years. The risk of each fund, measured from the standard deviation of the return history, is also specified in Table 16.11. The maximum risk allowed for Robert’s portfolio is 6%. Besides, due to his conservative profile, Robert would like to invest, at least, 50% of his capital in index-pegged funds and fixed income funds and 25% in each one of the other investments, at the most. Formulate the linear programming problem that tries to determine how much to invest in each fund, in order to maximize the expected annual return, respecting the portfolio’s maximum risk constraints, minimum investment in fixed income, and minimum initial investment in each fund. (16) Company Arts & Chemical, a leader in the chemical sector, manufactures m products, including plastic, rubber, paints, and polyurethane, among others. The company plans to integrate production, inventories, and transportation decisions. The merchandise can be produced in n different facilities that distribute these products to p different retailers located in the regions of Washington, Baltimore, Philadelphia, New York, and Pittsburgh. The period analyzed is T periods. In each period, we intend to determine which of the n facility alternatives should produce and deliver each one of the m products to each one of the different p retailers. Each facility can cater to more than one retailer. However, the total demand for each retailer must be met by a single facility. The production and storage capacity of each one of the facilities is limited and differs from one another depending on the product and period. Unit production, transportation, and inventory maintenance costs also differ per product, facility, and period. The main objective is to designate retailers to facilities, to determine how much to produce and the level of inventories of each product in each facility and period—in such a way that the sum of the total production, transportation, and inventory

TABLE 16.10 Monthly Return History of 10 Stocks From Different Industries in a Period of 36 Months (File Stocks.xls) Stock 2

Stock 3

Stock 4

Stock 5

Stock 6

Stock 7

Stock 8

Stock 9

Stock 10

Banking (%)

Banking (%)

Banking (%)

Banking (%)

Insurance (%)

Metallurgy (%)

Metallurgy (%)

Mining (%)

Paper-Cellulose (%)

Electrical Energy (%)

1

2.57

4.47

1.08

4.78

4.19

2.54

0.57

0.60

4.07

2.78

2

3.14

4.33

0.87

3.41

3.08

2.69

0.98

5.78

3.57

3.69

3

6.00

2.67

4.87

2.81

6.47

1.98

5.69

3.25

2.69

2.14

4

2.14

3.59

3.57

6.70

8.05

3.14

3.10

0.88

2.02

4.01

5

5.44

3.34

2.78

2.08

5.04

7.58

3.28

4.52

1.57

1.33

6

11.30

2.09

5.69

3.00

3.47

6.85

8.07

2.88

2.33

4.21

7

8.07

7.80

6.44

3.54

2.09

4.70

2.67

0.58

2.87

0.74

8

2.77

6.14

6.87

2.97

2.56

11.02

3.69

3.69

0.05

0.65

9

2.37

5.77

10.07

5.90

4.44

5.99

6.47

1.44

1.69

2.47

10

2.14

3.23

5.64

7.01

6.07

0.14

0.22

4.22

5.87

3.54

11

4.40

1.04

3.30

2.04

5.30

2.36

3.11

0.47

2.14

2.58

12

2.10

3.02

2.27

3.50

2.07

2.14

4.55

0.05

1.01

5.47

13

2.14

2.01

5.47

9.33

4.44

1.34

0.24

6.95

3.99

3.54

14

4.69

3.67

2.10

8.07

6.14

0.98

3.50

8.41

1.47

2.57

15

11.32

5.69

2.07

2.77

3.07

0.66

2.78

5.41

2.58

4.78

16

4.69

2.00

3.47

5.48

2.05

2.89

8.40

0.22

3.57

1.23

17

2.01

6.75

3.78

3.50

2.67

13.47

7.55

9.54

0.88

0.27

18

7.65

9.47

3.89

6.41

3.07

4.23

0.07

11.02

2.34

3.55

19

2.36

5.33

5.68

3.04

4.08

0.28

9.56

2.55

1.09

2.67

20

11.47

6.01

3.46

2.08

4.99

2.63

5.04

12.23

7.03

0.74

21

3.39

2.01

3.09

3.64

3.70

3.63

3.66

2.00

4.33

3.69

22

8.43

5.03

1.01

6.80

8.02

2.47

4.40

4.47

5.87

0.25

23

4.16

5.33

5.61

5.47

7.35

0.50

2.57

6.58

2.67

0.98

24

2.37

3.36

7.43

6.17

2.44

7.99

3.01

8.80

7.80

4.36

25

7.00

11.04

6.40

5.55

11.07

6.01

9.77

5.96

2.22

1.66

26

3.22

4.64

6.43

4.58

2.47

14.15

6.41

3.22

1.49

0.20

Introduction to Optimization Models: General Formulations and Business Modeling Chapter

Stock 1

16 743

Continued

744 PART

Stock 1

Stock 2

Stock 3

Stock 4

Stock 5

Stock 6

Stock 7

Stock 8

Stock 9

Stock 10

Banking (%)

Banking (%)

Banking (%)

Banking (%)

Insurance (%)

Metallurgy (%)

Metallurgy (%)

Mining (%)

Paper-Cellulose (%)

Electrical Energy (%)

27

4.67

2.07

2.98

2.07

2.60

5.47

2.60

4.74

1.42

1.59

28

3.20

3.68

3.10

2.65

3.18

3.14

3.01

2.33

0.77

5.67

29

0.74

0.58

2.73

6.47

3.08

3.25

7.78

4.01

0.59

4.90

30

5.02

7.04

9.40

6.07

2.00

1.08

8.36

4.32

3.07

3.92

31

4.30

2.99

6.81

5.88

6.47

5.47

2.04

6.77

2.55

2.14

32

2.64

7.66

6.90

0.47

6.13

11.01

2.15

2.64

0.84

0.71

33

6.77

7.16

5.87

8.09

2.47

5.71

3.19

5.74

5.98

2.04

34

6.70

3.41

6.80

6.47

2.08

14.33

2.03

9.12

0.25

4.33

35

2.98

2.01

5.32

5.00

4.43

5.44

6.07

8.40

0.50

2.36

36

5.70

11.52

6.00

0.27

2.29

2.47

5.73

6.47

1.00

1.60

VII Optimization Models and Simulation

TABLE 16.10 Monthly Return History of 10 Stocks From Different Industries in a Period of 36 Months (File Stocks.xls)—cont’d

Introduction to Optimization Models: General Formulations and Business Modeling Chapter

16

745

TABLE 16.11 Characteristics of Each Fund Fund

Annual Yield/Return (%)

Risk (%)

Initial Investment (US$)

Index-pegged fund A

11.74

1.07

30,000.00

Index-pegged fund B

12.19

1.07

100,000.00

Index-pegged fund C

12.66

1.07

250,000.00

Fixed income fund A

12.22

1.62

30,000.00

Fixed income fund B

12.87

1.62

100,000.00

Fixed income fund C

12.96

1.62

250,000.00

Commercial paper A

16.04

5.89

20,000.00

Commercial paper B

17.13

5.89

100,000.00

Multimarket fund

18.10

5.92

10,000.00

Stock fund A

19.53

6.54

1000.00

Stock fund B

22.16

7.23

1000.00

maintenance costs is minimized, the demand for each retailer is met, and capacity constraints are not rejected. From the general production and inventory model proposed in Section 16.6.6, elaborate a general model that integrates production, inventory, and distribution decisions. Note: Since we have a binary decision variable (if product i is delivered by facility j to retailer k in period t, its value will be 1, otherwise, the value will be 0), we have a mixed programming problem. (17) From the previous exercise, consider a case in which each retailer can receive supplies/products from more than one facility. Elaborate the general adapted model. Note: In this case, we must define a new decision variable that consists in determining the amount of product i to be transported from facility j to retailer k in period t. (18) Pharmabelz, a company in the cosmetics and cleaning products industry, would like to define the aggregate planning of the production of Leveza, a type of soap, for the first semester of the following year. In order to do that, the sales department provided the demand expected for the period being studied, as shown in Table 16.12.

TABLE 16.12 Soap Demand Expected (kg) for the First Semester of the Following Year Month

Demand

January

9600

February

10,600

March

12,800

April

10,650

May

11,640

June

10,430

746

PART

VII Optimization Models and Simulation

The production data are:

Costs of regular production

US$ 1.50 per kg

Cost to outsource labor

US$ 2.00 per kg

Costs of regular labor

US$ 600.00/employee-month

Cost to hire a worker

US$ 1000.00/worker

Cost to fire a worker

US$ 900.00/worker

Cost per overtime

US$ 7.00/overtime

Inventory maintenance costs

US$ 1.00/kg-month

Regular labor in the previous month

10 workers

Initial inventory

600 kg

Average productivity per employee

16 kg/employee-hour

Average productivity per overtime

14 kg/overtime

Maximum outsourced production capacity

1 000 kg/month

Maximum regular labor capacity

20 workers

Maximum inventory capacity

2500 kg/month

Each employee usually works 6 business hours a day, 20 business days a month, and he/she is only allowed to work 20 hours overtime a month, at the most. Formulate the aggregate planning model (mixed integer program) that will minimize the total production, labor and storage costs for the period analyzed, respecting the system’s capacity constraints.

Chapter 17

Solution of Linear Programming Problems There is geometry everywhere. However, it is necessary to have eyes to see it, intelligence to understand it, and a soul to admire it. Malba Tahan in “The Man Who Counted”

17.1 INTRODUCTION In this chapter, we will discuss several ways of solving a linear programming problem (LP): a) in a graphical way; b) through the analytical method; c) by using the Simplex method; d) by using a computer. A simple linear programming problem with only two decision variables can be easily solved in a graphical way or through the analytical method. The graphical solution can be applied to solve problems with three decision variables, at the most; however, with greater complexity. Similarly, the analytical solution becomes impractical for problems with many variables and equations, since it calculates all the possible basic solutions. As an alternative to these procedures, we can use the Simplex algorithm or any existing software directly (GAMS, AMPL, AIMMS, software with electronic spreadsheets, such as, Solver in Excel and What’s Best, among others) to solve any linear programming problem. In this chapter, we will solve each one of the management problems modeled in the previous chapter (Examples 16.3 to 16.12) by using Solver in Excel. Some linear programming problems do not have a single nondegenerate optimal solution. It may fall into one of the four categories: a) multiple optimal solutions; b) unlimited objective function z; c) there is no optimal solution; d) degenerate optimal solution. Throughout this chapter, we will discuss how to identify each one of these special cases in a graphical way, through the Simplex method, and by using a computer. Many times, the estimation of model parameters is based on future forecasts, and changes may occur until the final solution is implemented in the real world. As examples of changes, we can mention changes in the quantity of resources available, introduction of a new product, variation in a product’s price, increase or decrease in production costs, among others. Therefore, the sensitivity analysis is essential in the study of linear programming problems, since it has as its main goal to investigate the impacts that certain changes in the model parameters would have on the optimal solution. The sensitivity analysis presented at the end of the chapter discusses the variation that the objective function coefficients and constants, on the right-hand side of each constraint, can assume without changing the initial model’s optimal solution, or without changing the feasibility region. This analysis can be carried out graphically, by using algebraic calculations, or directly through Solver in Excel or other software packages, such as, Lindo, considering one alteration at a time.

17.2 GRAPHICAL SOLUTION OF A LINEAR PROGRAMMING PROBLEM A simple linear programming problem that has two decision variables can be easily solved in a graphical way. According to Hillier and Lieberman (2005), any LP problem that has two decision variables can be solved graphically. Problems with up to three decision variables can also be solved in a graphical way, but with greater complexity. In the graphical solution for a linear programming model, first of all, we must determine the feasible solution space or feasible region along a Cartesian axis. A viable or feasible solution is the one that satisfies all the model constraints, including the non-negativity ones. If a certain solution breaches at least one of the model constraints, it is called an unfeasible solution. The following step consists in determining the model’s optimal solution, that is, the feasible solution that has the best objective function value. For a maximization problem, after the set of feasible solutions is established, the optimal solution is the one that gives the highest value to the objective function within this set. Whereas for a minimization problem, the optimal solution is the one that minimizes the objective function. Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00017-3 © 2019 Elsevier Inc. All rights reserved.

747

748

PART

VII Optimization Models and Simulation

FIG. 17.1 An example of convex and nonconvex sets.

The set of feasible solutions for a linear programming problem is represented by K. Subsequently, the first theorem was born: Theorem 1 The K set is convex. Definition A K set is convex when all the line segments that connect any two points of K are included in K. A convex set is bounded if it includes its bounds. As an illustrative example, the graphical representation of convex and nonconvex sets can be seen in Fig. 17.1. The graphical solution for a linear programming maximization and minimization problem with a single optimal solution will be illustrated through Examples 17.1 and 17.2, respectively. The special cases will also be presented (multiple optimal solutions, unlimited objective function, unfeasible solution, and degenerate optimal solution), through Examples 17.3, 17.4, 17.5, and 17.6.

17.2.1

Linear Programming Maximization Problem with a Single Optimal Solution

The graphical solution for an LP maximization problem with a single optimal solution will be illustrated through Example 17.1. Example 17.1 Consider the following LP maximization problem: max z ¼ 6x1 + 4x2 subject to : 2x1 + 3x2  18 5x1 + 4x2  40 x1 6 x2  8 x1 , x2  0

(17.1)

Determine the set of feasible solutions, in addition to the model’s optimal solution. Solution Feasible region

In the Cartesian axes x1 and x2, we determine the feasible solution space that represents the constraints of the maximization model being studied. First, for each constraint, we plot the line that represents the equality equation (without considering the signs  or ) and, from then on, we determine the direction of the line that satisfies the inequality. Thus, for the first constraint, the line that represents the equation 2x1 + 3x2 ¼ 18 can be plotted from two points. If x1 ¼ 0, then x2 ¼ 6. Analogously, if x2 ¼ 0, then x1 ¼ 9. To determine the solution space or the direction of the line that satisfies the inequality 2x1 + 3x2  18, we can consider any point outside the line. We usually use the point of origin (x1, x2) ¼ (0, 0), due to its simplicity. We can see that the point of origin satisfies the first inequality, because 0 + 0  18. Therefore, we can identify the direction of the line that has feasible solutions, as shown in Fig. 17.2. In the same way, for the second constraint, the line that represents equality equation 5x1 + 4x2 ¼ 40 is plotted from two points. If x1 ¼ 0, then x2 ¼ 10. Analogously, if x2 ¼ 0, then x1 ¼ 8. We can also see that the point of origin satisfies the inequality 5x1 + 4x2  40, because 0 + 0  40, representing the direction of the straight line that contains the feasible solutions, consistent with Fig. 17.2. Similarly, we can determine the feasible solution space for the other constraints x1  6, x2  8, x1 0 and x2 0. Constraints 5x1 + 4x2  40 and x2  8 are redundant, that is, if they were excluded from the model, the feasible solution space would not be affected. The feasible region is represented by the four-sided polygon ABCD. Any point on the surface of the polygon or inside it represents the feasible region. On the other hand, any point outside the polygon does not satisfy at least one of the model constraints.

Solution of Linear Programming Problems Chapter

17

749

FIG. 17.2 Feasible region of Example 17.1.

FIG. 17.3 Optimal solution for Example 17.1.

Optimal solution

The next step tries to determine the model’s optimal solution that maximizes the function z ¼ 6x1 + 4x2, within the feasible solution space shown in Fig. 17.2. Since the solution space contains an infinite number of points, it is necessary to carry out a formal procedure to identify the optimal solution (Taha, 2016). First, we need to identify the correct direction in which the function increases (maximization function). In order to do that, we will draw different straight lines based on the objective function equation, assigning different values to z, by trial and error. Having identified the direction in which the objective function increases, it is possible to identify the model’s optimal solution, within the feasible solution space. First, we assigned a value of z ¼ 24, followed by z ¼ 36, and obtained equations 6x1 + 4x2 ¼ 24 and 6x1 + 4x2 ¼ 36, respectively. From these two equations, it was possible to identify the direction, within the feasible solution space, that maximizes the objective function, concluding that point C is optimal. Since vertex C is the intersection of lines 2x1 + 3x2 ¼ 18 and x1 ¼ 6, values x1 and x2 can be calculated algebraically from these two equations. Therefore, we have x1 ¼ 6 and x2 ¼ 2 with z ¼ 6  6 + 4  2 ¼ 44. The complete procedure is presented in Fig. 17.3. Since all the lines are represented by the equation z ¼ 6x1 + 4x2, if we only change the value of z, we can conclude, by Fig. 17.3, that the lines are parallel. Another important theorem states that an optimal solution for a linear programming problem is always associated to a vertex or extreme point of the solution space:

750

PART

VII Optimization Models and Simulation

Theorem 2 For linear programming problems with a single optimal solution, the objective function reaches its maximum or minimum point in an extreme point in convex set K.

17.2.2

Linear Programming Minimization Problem With a Single Optimal Solution

Example 17.2 Consider the following minimization problem: min z ¼ 10x1 + 6x2 subject to : 4x1 + 3x2  24 2x1 + 5x2  20 x1 8 x2  6 x 1 , x2  0

(17.2)

Determine the set of feasible solutions and the model’s optimal solution. Solution Feasible region

The same procedure of Example 17.1 is used to obtain the feasible solution space of the minimization problem. First, we must determine the feasible region from the minimization model constraints. Considering the first 4x1 + 3x2 24 and the second 2x1 + 5x2 20 constraints, we can see that the point of origin (x1, x2)¼ (0, 0) does not satisfy any of the inequalities. Thus, the feasible direction of these two lines does not contain this point. Including constraints x1  8 and x2  6, the feasible solution space became limited, as shown in Fig. 17.4. Different from Example 17.1, in this case, we can see that all the constraints are nonredundant, that is, they are responsible for defining the model’s feasibility region. The feasibility region is represented by polygon ABCD that is highlighted in Fig. 17.4. Optimal solution

The same procedure of Example 17.1 is used to find the optimal solution for the minimization problem. Therefore, we are trying to determine the model’s optimal solution that minimizes the function z ¼ 10x1 + 6x2, within the feasible solution space shown in Fig. 17.4. To analyze the direction in which the objective function decreases (minimization function), different values of z were assigned, by trial and error. Firstly, we assigned a value of z ¼ 72, obtaining equation 10x1 + 6x2 ¼ 72, followed by z ¼ 60 where 10x1 + 6x2 ¼ 60. Hence, it was possible to identify the direction that minimizes the objective function, concluding that point D represents the model’s optimal solution (see Fig. 17.5). FIG. 17.4 Feasible region of Example 17.2.

Solution of Linear Programming Problems Chapter

17

751

FIG. 17.5 Optimal solution for Example 17.2.

Coordinates x1 and x2 of point D can be calculated algebraically from equations 4x1 + 3x2 ¼ 24 and 2x1 + 5x2 ¼ 20, because point D is the intersection of these two equations. Thus, we have x1 ¼ 1.5 and x2 ¼ 6 with z ¼ 10  1.5 + 6  6 ¼ 51. As in the previous example, the LP problem shows a single optimal solution that is associated to an optimal vertex of the solution space (Theorem 2).

17.2.3

Special Cases

Sections 17.2.1 and 17.2.2 presented the graphical solution for a maximization (Example 17.1) and minimization problem (Example 17.2), respectively, with a single nondegenerate optimal solution. The graphical concept of a degenerate solution will be presented in Section 17.2.3.4. However, some linear programming problems do not have a single nondegenerate optimal solution, and they may fall into one of the four cases presented: 1. 2. 3. 4.

Multiple optimal solutions Unlimited objective function z There is no optimal solution Degenerate optimal solution

This section aims to identify, in a graphical way, each one of the special cases listed, which may happen in a linear programming problem. We will also study how to identify them through the Simplex method (see Section 17.4.5) and through a computer (cases 2 and 3 in Section 17.5.3, and cases 1 and 4 in Section 17.6.4).

17.2.3.1 Multiple Optimal Solutions A linear programming problem can have more than one optimal solution. In this case, considering a problem with two decision variables, different values of x1 and x2 achieve the same optimal value in the objective function. This case is graphically illustrated through Example 17.3. According to Taha (2016), when the objective function is parallel to an active constraint, we have a case with multiple optimal solutions. An active constraint is the one responsible for determining the model’s optimal solution. Example 17.3 Determine the set of feasible solutions and the model’s optimal solutions for the following linear programming problem:

752

PART

VII Optimization Models and Simulation

max z ¼ 8x1 + 4x2 subject to : 4x1 + 2x2  16 x1 + x2  6 x1 , x2  0

(17.3)

Solution The same procedure used in the previous examples to find the optimal solution was applied in this case. Fig. 17.6 shows the feasible region determined from the constraints of the model analyzed. We can see that the feasible solution space is represented by the four-sided polygon ABCD. FIG. 17.6 Feasible region with multiple optimal solutions.

To determine the model’s optimal solution, first of all, we assigned a value of z ¼ 16, and obtained the line presented in Fig. 17.6. Since the objective function is a maximization one, the higher the values of x1 and x2, the higher the value of function z, such that the direction in which the function increases can be easily identified. We can see that the lines represented by equations z ¼ 16 ¼ 8x1 + 4x2 and 4x1 + 2x2 ¼ 16 are parallel. Therefore, this is a case with multiple optimal solutions represented by the segment BC. For example, for point B, x1 ¼ 4, x2 ¼ 0, the value of z is 8  4 + 4  0 ¼ 32. Point C is the intersection of lines 4x1 + 2x2 ¼ 16 and x1 + x2 ¼ 6. Calculating it algebraically, we obtain x1 ¼ 2 and x2 ¼ 4 with z ¼ 8  2 + 4  4 ¼ 32. Any other point in this segment is an alternative optimal solution and also presents z ¼ 32. Hence, a new theorem is born: Theorem 3 For linear programming problems with more than one optimal solution, the objective function assumes this value in at least two extreme points of convex set K and in all the convex linear combinations of these extreme points (all the points of the line segment that join these two extremes, that is, the corner of the polygon that contains these extreme points).

17.2.3.2 Unlimited Objective Function z In this case, there is no limit for how much the value of at least one decision variable can increase, resulting in a feasible region and an unlimited objective function z. For a maximization problem, the value of the objective function increases unlimitedly, while for a minimization problem, the value decreases in an unlimited way. Example 17.4 illustrates a case that shows an unlimited set of solutions, in a graphical way, resulting in an unlimited value of the objective function. Example 17.4 Determine the feasible solution space and the model’s optimal solution for the following linear programming problem: max z ¼ 4x1 + 3x2 subject to : 2x1 + 5x2  20 x1 8 x 1 , x2  0

(17.4)

Solution of Linear Programming Problems Chapter

17

753

Solution From the constraints in Example 17.4, we obtain the feasible solution space, that in this case is unlimited, because there is no limit for the increase of x2, as shown in Fig. 17.7. Consequently, objective function z can also increase in an unlimited way. The complete procedure is shown in Fig. 17.7. x2

FIG. 17.7 Unlimited set of feasible solutions and unlimited maximization function z.

x1 £ 8

Unlimited solution space

x1 ³ 0 10 z = 24 = 4x1 + 3x2

Maximization of z

8

6

4

2

2x1 + 5x2 ³ 20 x2 ³ 0

2

4

6

8

10

x1

17.2.3.3 There Is No Optimal Solution In this case, it is not possible to find a feasible solution for the problem being studied, that is, there is no optimal solution. The set of feasible solutions is empty. Example 17.5 illustrates a case in which there is no optimal solution, in a graphical way. Example 17.5 Consider the following linear programming problem: max z ¼ x1 + x2 subject to : 5x1 + 4x2  40 2x1 + x2  6 x1 , x2  0

(17.5)

Determine the feasibility region and the optimal solution for the linear programming model. Solution Fig. 17.8 shows the graphical solution for Example 17.5, considering each one of the model constraints, besides the objective function with an arbitrary value of z ¼ 7. From Fig. 17.8, we can see that no point satisfies all the problem constraints. This means that the feasible solution space in Example 17.5 is empty, resulting in an unfeasible LP problem that has no optimal solution.

17.2.3.4 Degenerate Optimal Solution Graphically, we can see a special case of degenerate solution whenever one of the vertices of the feasible region is obtained from the intersection of more than two distinct lines. Therefore, we have a degenerate vertex. If there is degeneration in the optimal solution, we have a case known as degenerate optimal solution. The concept of degenerate solution and a degeneration problem will be discussed in depth in Sections 17.4.5.4 (identification of a degenerate optimal solution through the Simplex method) and 17.6.4.2 (identification of a degenerate optimal solution through the Sensitivity Report in the Solver in Excel).

754

PART

VII Optimization Models and Simulation

FIG. 17.8 Empty set of feasible solutions without an optimal solution.

x2 x1 ³ 0 10

8 z = 7 = x1 + x2 5x1 + 4x2 ³ 40

6

4

2

2x1 + x2 £ 6 x2 ³ 0 2

4

6

x1

8

Example 17.6 Consider the following linear programming problem: min z ¼ x1 + 5x2 subject to : 2x1 + 4x2  16 x1 + x2  6 x1 4 x 1 , x2  0

(17.6)

Determine the feasibility region and the optimal solution for the linear programming model. Solution The feasible solution space of Example 17.6 is shown in Fig. 17.9, and it is represented by triangle ABC. We can see that constraint x1  4 is redundant. Since vertex B is the intersection of three lines, we have a degenerate vertex. The minimization function consists in the equation z ¼ x1 + 5x2 that tries to find the minimum point that satisfies all the model constraints. Thus, from a value of z ¼ 50, it is possible to identify the direction of the line that minimizes function z, as shown in Fig. 17.9. Therefore, we can see that point B is the degenerate optimal solution. Since point B is the intersection of lines 2x1 + 4x2 ¼ 16 and x1 + x2 ¼ 6, coordinates x1 and x2 can be calculated algebraically from these equations. So, we have x1 ¼ 4 and x2 ¼ 2 with z ¼ 4 + 5  2 ¼ 14. FIG. 17.9 Feasible region with a degenerate optimal solution.

Solution of Linear Programming Problems Chapter

17

755

17.3 ANALYTICAL SOLUTION OF A LINEAR PROGRAMMING PROBLEM IN WHICH m < n The graphical procedure for solving LP problems was presented in Section 17.2. This section discusses the analytical procedure for solving a linear programming problem. Consider a system Ax ¼ b with m linear equations and n variables, in which m < n. According to Taha (2016), if m ¼ n and the equations are coherent, the system has a single solution. In cases in which m > n, at least m  n equations must be redundant. However, if m < n and the equations are also coherent, the system will have an infinite number of solutions. To find a solution for the system Ax ¼ b, in which m < n, first, we have to choose a set of variables n  m from x, called nonbasic variables (NBV), to which values equal to zero are assigned. The m remaining variables of the system, called basic variables (BV), are then determined. This solution is called basic solution (BS). The set of basic variables is called base. If the basic solution meets the non-negativity constraints, that is, the basic variables are non-negative, it is called feasible basic solution (FBS). According to Winston (2004), a basic variable can also be defined as the one that has coefficient 1 in only one equation, and 0 in the others. All the remaining variables are nonbasic. To calculate the optimal solution, we just need to calculate the value of objective function z of all the possible basic solutions and choose the best alternative. The maximum number of basic solutions to be calculated is:   n! n n ¼ (17.7) Cm ¼ m m!ðn  mÞ! Hence, the analytical method applied in this section analyzes all the possible combinations of n variables chosen m by m, choosing the best one. Solving it through a linear equation system is feasible in cases in which m and n are small. However, for high values of m and n, the calculation becomes impractical. As an alternative, we can use the Simplex method that will be studied in Section 17.4. Example 17.7 Consider the following system with three variables and two equations: x1 + 2x2 + 3x3 ¼ 28 3x1

 x3 ¼ 4

Determine all the basic solutions for this system. Solution For a system with three variables and two equations, we have n  m ¼ 3  2 ¼ 1 nonbasic variables and m ¼ 2 basic variables. In this example, the total number of possible basic solutions is 3. Solution 1

NBV ¼ {x1} and BV ¼ {x2, x3} We assigned the value zero to the nonbasic variable, that is, x1 ¼ 0. Thus, we calculate the values of variables x2 and x3 of the basic solution algebraically, from the equation system seen in the heading of the exercise. So, x2 ¼ 20 and x3 ¼  4. Since x3 < 0, the solution is unfeasible. Solution 2

NBV ¼ {x2} and BV ¼ {x1, x3} If x2 ¼ 0, the basic solution is x1 ¼ 4 and x3 ¼ 8. Therefore, we have a feasible basic solution (FBS). Solution 3

NBV ¼ {x3} and BV ¼ {x1, x2} If x3 ¼ 0, the basic solution is x1 ¼ 1.33 and x2 ¼ 13.33. As in the previous case, we have an FBS here.

756

PART

VII Optimization Models and Simulation

Example 17.8 Consider the following linear programming problem: max 3x1 + 2x2 subject to : x 1 + x2  6 5x1 + 2x2  20 x 1 , x2  0

(17.8)

Solve the problem in an analytical way. Solution In order for the analytical solution procedure to be applied, the problem must be in standard form (see Section 16.4.1 of the previous chapter). In order for the inequality constraints to be rewritten as equalities, slack variables x3 and x4 must be included. Thus, the original problem rewritten in standard form becomes: max 3x1 + 2x2 subject to : x1 + x2 + x3 ¼ 6 5x1 + 2x2 + x4 ¼ 20 x1 , x2 x3 , x4  0

(17.9)

The system has m ¼ 2 equations and n ¼ 4 variables. In order for a basic solution to be found, values equal to zero will be assigned to n  m ¼ 4  2 ¼ 2 nonbasic variables, such that the values of the m ¼ 2 remaining basic variables can be determined by the equation system represented by Expression (17.9). In this example, the total number of basic solutions is:   4! 4 C24 ¼ ¼6 ¼ 2 2!ð4  2Þ! Solution A

NBV ¼ {x1, x2} and BV ¼ {x3, x4} First, we assigned value zero to the nonbasic variables x1 and x2, such that the values of the basic variables x3 and x4 can be calculated algebraically from Expression (17.6). Therefore, we have: Nonbasic solution: x1 ¼ 0 and x2 ¼ 0 Basic solution: x3 ¼ 6 and x4 ¼ 20 Objective function: z ¼ 0 The same calculation will be carried out to obtain different basic solutions. At every new solution, a variable from the set of nonbasic variables goes into the set of basic variables (base) and, as a result, one will leave the base. Solution B

In this case, variable x1 enters the base instead of variable x4, which starts to be a part of the set of nonbasic variables. NBV ¼ {x2, x4} and BV ¼ {x1, x3} Nonbasic solution: x2 ¼ 0 and x4 ¼ 0 Basic solution: x1 ¼ 4 and x3 ¼ 2 Objective function: z ¼ 12 Solution C

In this case, variable x4 goes into the base instead of variable x3. NBV ¼ {x2, x3} and BV ¼ {x1, x4} Nonbasic solution: x2 ¼ 0 and x3 ¼ 0 Basic solution: x1 ¼ 6 and x4 ¼  10 Since x4 < 0, the solution is unfeasible. Solution D

In this case, variable x2 goes into the base instead of variable x4. NBV ¼ {x3, x4} and BV ¼ {x1, x2} Nonbasic solution: x3 ¼ 0 and x4 ¼ 0 Basic solution: x1 ¼ 2.67 and x2 ¼ 3.33 Objective function: z ¼ 14.7 Solution E

In this case, variable x4 goes into the base instead of variable x1. NBV ¼ {x1, x3} and BV ¼ {x2, x4} Nonbasic solution: x1 ¼ 0 and x3 ¼ 0 Basic solution: x2 ¼ 6 and x4 ¼ 8 Objective function: z ¼ 12

Solution of Linear Programming Problems Chapter

17

757

FIG. 17.10 Graphical representation of Example 17.8.

x2 x1 ³ 0 10

F

8

5x1 + 2x2 £ 20

6 z = 10

Maximization function

E

4

Optimal solution (x1 = 2.67, x2 = 3.33) D

2

x1 + x2 £ 6 A

C

B 2

x2 ³ 0

4

6

8

10

x1

Solution F

In this case, variable x3 goes into the base instead of variable x4. NBV ¼ {x1, x4} and BV ¼ {x2, x3} Nonbasic solution: x1 ¼ 0 and x4 ¼ 0 Basic solution: x2 ¼ 10 and x3 ¼  4 Since x3 < 0, the solution is unfeasible. Thus, the optimal solution is D, with x1 ¼ 2.67, x2 ¼ 3.33, x3 ¼ 0, x4 ¼ 0, and z ¼ 14.67. Fig. 17.10 shows the graphical solution for each one of the six solutions obtained from the Cartesian axes x1 and x2. Solutions A, B, D, and E correspond to an extreme point of the feasible region. Alternatively, since they are unfeasible, solutions C and F do not belong to the set of feasible solutions. Thus, a new theorem is born: Theorem 4 Every feasible basic solution for a linear programming problem is an extreme point of convex set K.

17.4 THE SIMPLEX METHOD As presented in Section 17.2, a graphical solution can be used to solve linear programming problems with two or three decision variables, at the most (greater complexity). In the same way, the analytical solution presented in Section 17.3 becomes impractical for problems with many variables and equations because it calculates all the possible basic solutions, and only afterwards it determines the optimal solution. As an alternative, the Simplex method can be applied to solve any LP problem. The Simplex method for solving linear programming problems was developed in 1947, with the dissemination of operational research in the United States after World War II, by a team led by George B. Dantzig. For Goldbarg and Luna (2005), the Simplex algorithm is the most commonly used method for solving linear programming problems. The Simplex method is an iterative algebraic procedure that starts from an initial feasible basic solution and tries to find, at each iteration, a new feasible basic solution with a better value in the objective function, until the optimal value is achieved. More details of the algorithm will be discussed in the following section. This section is divided into three parts. The logic of the Simplex method is presented in Section 17.4.1. In Section 17.4.2, the Simplex method is described in an analytical way. The tabular form of the Simplex method is discussed in Section 17.4.3.

17.4.1

Logic of the Simplex Method

The Simplex algorithm is an iterative method that starts from an initial feasible basic solution and tries to find, at each iteration, a new feasible basic solution, called adjacent feasible basic solution, with a better value in the objective function, until the optimal value is achieved. The concept of an adjacent FBS is described.

758

PART

VII Optimization Models and Simulation

FIG. 17.11 General description of the Simplex algorithm.

Beginning: The problem must be in the standard form. Step 1: Find an initial FBS for the LP problem. Initial FBS = current FBS Step 2: Verify if the current FBS is the optimal solution for the LP problem. While the current FBS is not the optimal solution for the LP problem do Find an adjacent FBS with a better value in the objective function Adjacent FBS = current FBS End while End

FIG. 17.12 Flowchart with the general description of the Simplex algorithm. (Source: Lachtermacher, G., 2009. Pesquisa operacional na tomada de deciso˜es. 4th ed. Prentice Hall do Brasil, Sa˜o Paulo.)

Start Inequalities with sign £ and introduction of slack variables

Find a initial super-optimal basic solution

Is the solution feasible?

Yes End

No Determine a better adjacent basic solution

From a current basic solution, a nonbasic variable goes into the base instead of another basic variable that becomes nonbasic, generating a new solution called adjacent basic solution. For a problem with m basic variables and n  m nonbasic variables, two basic solutions are adjacent if they have m  1 basic variables in common, and they may even have different numerical values. This also implies that n  m  1 nonbasic variables are common. If the adjacent basic solution satisfies the non-negativity constraints, it is called an adjacent feasible basic solution (adjacent FBS). According to Theorem 4, every feasible basic solution is an extreme point (vertex) of the feasible region. Thus, two vertices are adjacent if they are connected by a line segment called corner, which means that they share n  1 constraints. The general description of the Simplex algorithm is presented in Fig. 17.11. Analogous to the analytical procedure, in order for the Simplex method to be applied, the problem must be in the standard form (see Section 16.4.1 of the previous chapter). The algorithm can also be described through a flowchart, as seen in Fig. 17.12.

17.4.2

Analytical Solution of the Simplex method for Maximization Problems

Each of the steps of the general algorithm described in Figs. 17.11 and 17.12 is rewritten in Fig. 17.13 in a very detailed way, based on Hillier and Lieberman (2005), for the analytical solution for the Simplex method of linear programming problems, in which the objective function z is a maximization one ( max z¼c1x1 + c2x2 + ⋯ + cnxn ¼ 0). In Example 17.8 of Section 17.3, to calculate the model’s optimal solution all the possible basic solutions were calculated, and the best of them was chosen. The same exercise is solved in Example 17.9, however, through the analytical solution for the Simplex method.

Solution of Linear Programming Problems Chapter

Beginning: The problem must be in the standard form. Step 1: Find an initial feasible basic solution (FBS) for the LP problem.

17

759

FIG. 17.13 Detailed steps of the general algorithm of Figs. 17.11 and 17.12 for solving LP maximization problems through the analytical form of the Simplex method.

An initial FBS can be obtained by assigning values equal to zero to the decision variables. In order for this solution to be feasible, none of the problem constraints can be violated. Step 2: Optimality test. A feasible basic solution is optimal if there are no better adjacent feasible basic solutions. An adjacent FBS is better than the current FBS if there is a positive increase in the value of objective function z. Similarly, an adjacent FBS is worse than the current FBS if the increase in z is negative. While at least one of the non-basic variables of objective function z has a positive coefficient, there is a better adjacent FBS. Iteration: Determine a better adjacent FBS. The direction with greater increase in z must be identified, so that a better feasible basic solution can be determined. In order to do that, three steps must be taken: 1. Determine the non-basic variable that will go to the set of basic variables (base). It must be the one that has the greatest increase in z, that is, with the highest positive coefficient in z. 2. Choose the basic variable that will go to the set of non -basic variables (base). The variable chosen to leave the base must be the one that limits the increase of the non-basic variable selected in the previous step to go into the base. 3. Solve the equation system, recalculating the values of the new adjacent basic solution. Before that, the equation system must be converted into a more convenient form, through basic algebraic operations, by using the Gauss-Jordan elimination method. From the new equation system, each new equation must have only one basic variable with a coefficient equal to 1, each basic variable must appear in only one equation, and the objective function must be written based on the non-basic variables, such that the values of the new basic variables and of objective function z can be obtained directly, and the optimality test can easily be verified.

Example 17.9 Solve the problem by using the analytical solution for the Simplex method. max z ¼ 3x1 + 2x2 subject to : x 1 + x2  6 5x1 + 2x2  20 x 1 , x2  0

(17.10)

Solution Each of the steps of the algorithm will be discussed in depth, based on Hillier and Lieberman (2005). Beginning: The problem must be in the standard form. max z ¼ 3x1 + 2x2 subject to : x1 + x 2 + x 3 ¼6 5x1 + 2x2 + x4 ¼ 20 x 1 , x2 x3 , x4  0

ð0Þ ð1Þ ð2Þ ð3Þ

(17.11)

Step 1: Find an initial FBS for the LP problem. An initial basic solution can be obtained by assigning values equal to zero to decision variables x1 and x2 (nonbasic variables). Note that the values of the basic variables (x3, x4) can be obtained immediately from equation system represented by Expression (17.11), since each equation has only one basic variable with coefficient 1, and each basic variable appears in only one equation.

760

PART

VII Optimization Models and Simulation

Moreover, since the objective function starts to be written based on each one of the nonbasic variables, the optimality test can easily be applied in step 2. The complete result of the initial solution is: NBV ¼ {x1, x2} and BV ¼ {x3, x4} Nonbasic solution: x1 ¼ 0 and x2 ¼ 0 Feasible basic solution: x3 ¼ 6 and x4 ¼ 20 Solution: {x1, x2, x3, x4} ¼ {0, 0, 6, 20} Objective function: z ¼ 0 This solution corresponds to vertex A of the feasible region shown in Example 17.8 of Section 17.3, as presented in Fig. 17.10. Step 2: Optimality test We can say that the initial FBS obtained in step 1 is not optimal, since the coefficients of the nonbasic variables x1 and x2 in the objective function of equation system represented by Expression (17.11) are positive. If any of the variables stops taking on value zero and starts taking on a positive value, there will be a positive increase in the value of objective function z. Thus, it is possible to obtain a better adjacent FBS. Iteration: Determine a better adjacent FBS. Each of the three steps to be implemented in this iteration is shown. 1. Nonbasic variable that will go into the base. According to equation system represented by Expression (17.11), we can see that variable x1 has a greater positive coefficient in the objective function when compared to variable x2, thus, generating a greater positive increase in z if the same measurement units were considered for x1 and x2. So, the nonbasic variable chosen to go into the base is x1: n* o NBV ¼ x1, x2 2. Basic variable that will leave the base. To select the basic variable that will leave the base, we must choose the one that limits the increase of the nonbasic variable chosen in the previous step to go into the base (x1). In order to do that, first, we must assign the value zero to the variables that remained nonbasic (in this case, only x2) in all the equations. From then on, we can obtain the equations of each one of the basic variables based on the nonbasic variable chosen to go into the base (x1). Since all the basic variables must take on non-negative values, by inserting the inequality sign 0 in each one of the constraints, we can identify the basic variable that limits the increase of x1. Therefore, by assigning value zero to variable x2 in Equations (1) and (2) of equation system represented by Expression (17.11), we can obtain the equations of basic variables x3 and x4 based on x1: x3 ¼ 6  x1 x4 ¼ 20  5x1 Since variables x3 and x4 must take on non-negative values, then: x3 ¼ 6  x1  0 ) x1  6 x4 ¼ 20  5x1  0 ) x1  4 We can conclude that the variable that limits the increase of x1 is variable x4, since the maximum value that x1 can reach from x4 is smaller when compared to variable x3 (4 < 6). Hence, the basic variable chosen to leave the base is x4: n o * BV ¼ x3 , x4 3. Transform the equation system by using the Gauss-Jordan elimination method and recalculate the basic solution. As shown in the two previous steps, variable x1 enters the base instead of variable x4, and it will generate a better adjacent basic solution. Therefore, the set of nonbasic variables and the set of basic variables becomes: NBV ¼ {x2, x4} and BV ¼ {x1, x3} In this phase, we try to recalculate the values of the new feasible basic solution. Since x4 represents the new nonbasic variable in the adjacent solution, jointly with x2 that remained nonbasic, we have x2 ¼ 0 and x4 ¼ 0. From then on, the values of basic variables x1 and x3 of the adjacent solution must be recalculated, besides the value of objective function z. First, the equation system must be converted, through basic operations, into a more convenient form, by using the Gauss-Jordan elimination method, such that each equation has only one basic variable (x1 or x3) with a coefficient equal to 1, each basic variable appears in only one equation, and such that the objective function can be written based on nonbasic variables x2 and x4. In order to do that, the coefficients of variable x1 in the current equation system represented by Expression (17.11) must be transformed from 3, 1, and 5 (Equations (0), (1), and (2), respectively) to 0, 0, and 1 (coefficients of variable x4 in the current equation system). According to Hillier and Lieberman (2005), the two basic algebraic operations to be used are: (a) Multiply (or divide) an equation by a constant different from zero. (b) Add (or subtract) a multiple of an equation before (or after) another equation. First, let’s convert the coefficient of variable x1 in Equation (2) of Expression (17.11) from 5 to 1. In order to do that, we just need to divide Equation (2) by 5, such that the new Expression (17.12) starts to be written based on a single basic variable (x1) with coefficient 1: 2 1 x1 + x2 + x4 ¼ 4 5 5

(17.12)

Solution of Linear Programming Problems Chapter

17

761

Another transformation must be carried out so that we can convert the coefficient of variable x1 in Equation (1) of Expression (17.11) from 1 to 0. To do that, we just need to subtract Expression (17.12) from Equation (1) of Expression (17.11), such that the new Expression (17.13) starts to be written based on a single basic variable (x3) with coefficient 1: 3 1 x +x  x ¼2 (17.13) 5 2 3 5 4 Finally, we must convert the coefficient of variable x1 in the objective function [Equation (0) of Expression (17.11)] from 3 to 0. To do that, we just need to multiply Expression (17.12) by 3 and subtract it from Equation (0) of Expression (17.11), such that new Expression (17.14) starts to be written based on x2 and x4: 4 3 z ¼ x2  x4 + 12 5 5 The complete equation system, obtained after we applied the Gauss-Jordan elimination method, is:

(17.14)

4 3 ð0Þ z ¼ x2  x4 + 12 5 5 3 1 (17.15) ð1Þ x +x  x ¼2 5 2 3 5 4 2 1 ð2Þ x1 + x2 + x4 ¼ 4 5 5 From the new equation system represented by Expression (17.15), it is possible to obtain the new values of x1, x3 , and z immediately. The complete result of the new solution is: NBV ¼ {x2, x4} and BV ¼ {x1, x3} Nonbasic solution: x2 ¼ 0 and x4 ¼ 0 Feasible basic solution: x1 ¼ 4 and x3 ¼ 2 Solution: {x1, x2, x3, x4} ¼ {4, 0, 2, 0} Objective function: z ¼ 12 This solution corresponds to vertex B of the feasible region shown in Example 17.8 of Section 17.3 (Fig. 17.10). Thus, there was a movement from extreme point A to point B (A ! B). Therefore, it was possible to obtain a better adjacent FBS, since there was a positive increase in z compared to the current FBS. The adjacent FBS obtained in this iteration becomes the current FBS. Step 2: Optimality test The current FBS is not optimal yet, since the coefficient of nonbasic variable x2 in Equation (0) of Expression (17.15) is positive. If this variable starts to take on any positive value, there will be a positive increase in the value of objective function z. Thus, it is possible to obtain a better adjacent FBS. Iteration 2: Determine a better adjacent FBS. The three steps to be implemented to determine a new adjacent FBS are: 1. Nonbasic variable that will go into the base. According to the new equation system represented by Expression (17.15), we can see that variable x2 is the only one with a positive coefficient in Equation (0), and it will generate a positive increase in objective function z for any positive value that variable x2 can assume. So, the variable chosen to go from the set of nonbasic variables to the set of basic variables is x2: n* o NBV ¼ x2, x4 2. Basic variable that will leave the base. The basic variable that will leave the base is the one that limits the increase of the nonbasic variable chosen in the previous step to go into the base (x2). By assigning value zero to the variable that remained nonbasic (x4 ¼ 0), in each one of Equations (1) and (2) of Expression (17.15), it is possible to obtain the equations of each one of the basic variables x1 and x3 of the current basic solution based on the nonbasic variable chosen to go into the base (x2): 2 x1 ¼ 4  x2 5 3 x3 ¼ 2  x2 5 Since variables x1 and x3 must take on non-negative values, then: 2 x1 ¼ 4  x2  0 ) x2  10 5 3 10 x3 ¼ 2  x2  0 ) x2  5 3 We can conclude that the variable that limits the increase of x2 is variable x3, since the maximum value that x2 can assume from x3 is smaller when compared to variable x1. Therefore, the basic variable chosen to leave the base is x3: n o * BV ¼ x1 , x3

762

PART

VII Optimization Models and Simulation

3. Transform the equation system by using the Gauss-Jordan elimination method and recalculate the basic solution. As shown in the two previous steps, variable x2 enters the base instead of variable x3, and it will generate a better adjacent basic solution. Thus, the set of nonbasic variables and the set of basic variables becomes: NBV ¼ {x3, x4} and BV ¼ {x1, x2} Before calculating the values of the new basic solution, the equation system must be converted by using the Gauss-Jordan elimination method. In this case, the coefficients of variable x2 in the current equation system represented by Expression (17.15) must be transformed from 4/5, 3/5, and 2/5 (Equations (0), (1), and (2), respectively) to 0, 1, and 0 (coefficients of variable x3 in the current equation system), through basic algebraic operations. First, let’s convert the coefficient of variable x2 in Equation (1) of Expression (17.15) from 3/5 to 1. To do that, we just need to multiply Equation (1) by 5/3, such that the new Expression (17.16) starts to be written based on a single basic variable (x2) with coefficient 1: 5 1 10 (17.16) x2 + x3  x4 ¼ 3 3 3 Analogously, we must convert the coefficient of variable x2 in Equation (2) of Expresion (17.15) from 2/5 to 0. In order to do that, we just need to multiply Expression (17.16) by 2/5 and subtract it from Equation (2) of Expression (17.15), such that the new Expression (17.17) starts to be written based on a single basic variable (x1) with coefficient 1: 2 1 8 x1  x3 + x4 ¼ (17.17) 3 3 3 Finally, we must convert the coefficient of variable x2 in the objective function [Equation (0) of Expression (17.15)] from 4/5 to 0. To do that, we just need to multiply Expresion (17.16) by 4/5 and subtract it from Equation (0) of Expression (17.15), such that the new Expression (17.18) starts to be written based on x3 and x4: 4 1 44 z ¼  x3  x4 + 3 3 3 The complete equation system is represented in (17.19):

(17.18)

4 1 44 ð0Þ z ¼  x3  x4 + 3 3 3 5 1 10 (17.19) ð1Þ x2 + x3  x4 ¼ 3 3 3 2 1 8 ð2Þ x1  x3 + x4 ¼ 3 3 3 From the new equation system represented by Expression (17.19), it is possible to obtain the new values of x1, x2 , and z immediately. The complete result of the new solution is: NBV ¼ {x3, x4} and BV ¼ {x1, x2} Nonbasic solution: x3 ¼ 0 and x4 ¼ 0 Feasible basic solution: x1 ¼ 8/3 ¼ 2.67 and x2 ¼ 10/3 ¼ 3.33 Solution: {x1, x2, x3, x4} ¼ {8/3, 10/3, 0, 0} Objective function: z ¼ 44/3 ¼ 14.67 This solution corresponds to vertex D of the feasible region shown in Fig. 17.10. The direction in which z increases, from the initial solution, goes through vertices A ! B ! D of the graphical solution. Hence, it was possible to obtain a better adjacent FBS, since there was a positive increase in z compared to the current FBS. The adjacent FBS obtained in this iteration becomes the current FBS. Step 2: Optimality test The current FBS is the optimal one, since the coefficients of nonbasic variables x3 and x4 in Equation (0) of Expression (17.19) are negative. Therefore, it is no longer possible to have a positive increase in the value of objective function z, concluding the algorithm of Example 17.9 here.

17.4.3

Tabular Form of the Simplex Method for Maximization Problems

The previous section presented the analytical procedure of the Simplex method to solve a linear programming maximization problem. This section shows the Simplex method in tabular form. To understand the logic of the Simplex algorithm, it is important to use the Simplex method in an analytical way. However, when the calculation is done manually, it is more convenient to use the tabular form. The tabular form uses the same concepts presented in Section 17.4.2, however, in a more practical way.

Solution of Linear Programming Problems Chapter

17

763

As presented in Section 16.4.1 of the previous chapter, the standard form of a general model of linear programming maximization is: max z ¼ c1 x1 + c2 x2 + … + cn xn subject to : a11 x1 + a12 x2 + … + a1n xn ¼ b1 a21 x1 + a22 x2 + … + a2n xn ¼ b2 ⋮ ⋮ ⋮ ⋮ am1 x1 + am2 x2 + … + amn xn ¼ bm xi  0, i ¼ 1, 2,…, m

(17.20)

This same model can be represented in tabular form: BOX 17.1 General Linear Programming Model in Tabular Form Equation

Coefficients x2 …

Constant xn

z

x1

0

1

 c1

 c2



cn

0

1

0

a11

a12



a1n

b1

2

0

a21

a22



a2n

b2









m

0

am1

am2







amn

bm

According to Box 17.1, we can see that maximization function z, in tabular form, starts being rewritten as z  c1x1 c2x2  …  cnxn ¼ 0. The columns in the middle show the coefficients of the variables on the left-hand side of each equation, in addition to the coefficient of z. The constants on the right-hand side of each equation are represented in the last column. Each one of the steps of the general algorithm described in Figs. 17.11 and 17.12 is rewritten in Fig. 17.14, in a very detailed way, for solving linear programming maximization problems through the tabular form of the Simplex method. The logic presented in this section is the same as the analytical solution for the Simplex method, however, instead of using an algebraic equation system, the solution is calculated directly in the tabular form, by using the concepts of pivot column, pivot row, and pivot number that will be defined throughout the algorithm. Example 17.9 of the previous section presented the solution for a linear programming problem through the analytical solution for the Simplex method. The same exercise will be solved in Example 17.10 through the tabular form of the Simplex method. Example 17.10 Solve the problem by using the tabular form of the Simplex method. max z ¼ 3x1 + 2x2 subject to : x 1 + x2  6 5x1 + 2x2  20 x1 , x2  0

(17.21)

Solution The maximization problem must also be in the standard form: max z ¼ 3x1 + 2x2 subject to : x1 + x2 + x3 ¼6 5x1 + 2x2 + x4 ¼ 20 x1 , x2 , x3 ,x4  0

ð0Þ ð1Þ ð2Þ ð3Þ

In the tabular form, maximization function z starts to be written as: z ¼ 3x1 + 2x2 ) z  3x1  2x2 ¼ 0

(17.22)

764

PART

VII Optimization Models and Simulation

FIG. 17.14 Detailed steps of the general algorithm in Figs. 17.11 and 17.12 for solving LP maximization problems in the tabular form of the Simplex method.

Beginning: The problem must be in the standard form. Step 1: Find an initial FBS for the LP problem. Analogously to the analytical form of the Simplex method presented in Section 17.4.2, an initial basic solution can be obtained by assigning values equal to zero to the decision variables. The initial FBS corresponds to the current FBS. Step 2: Optimality test. The current FBS is optimal if, and only if, the coefficients of all the non-basic variables in Equation (0) of the tabular form are non-negative ( ≥ 0 ). While there is at least one non-basic variable with a negative coefficient in Equation (0), there is a better adjacent FBS. Iteration: Determine a better adjacent FBS. The direction in which z increases the most must be identified, in order for a better feasible basic solution to be determined. To do that, three steps must be taken: 1. Determine the non-basic variable that will go into the base. It must be the one that has the greatest increase in z, that is, with the highest negative coefficient in Equation (0). The column of the non-basic variable chosen to go into the base is called pivot column. 2. Determine the basic variable that will leave the base. Similarly to the analytical form, the variable chosen must be the one that limits the increase of the non-basic variable selected in the previous step to go into the base. As presented by Hillier and Lieberman (2005), in order for the variable to be chosen, three phases are necessary: a) Select the positive coefficients of the pivot column that represent the coefficients of the new basic variable in each constraint of the current model. b) For each positive coefficient selected in the previous step, divide the constant of the same row by it. c) Identify the row with the smallest quotient. This row contains the variable that will leave the base. The row that contains the basic variable chosen to leave the base is called pivot row. The pivot number is the value that corresponds to the intersection of the pivot row and the pivot column. 3. Transform the current tabular form by using the Gauss-Jordan elimination method and recalculate the basic solution. Similarly to the analytical solution, the current tabular form must be converted into a more convenient form, through basic operations, such that the values of the new basic variables and of objective function z can be obtained directly from the new tabular form. The objective function starts being rewritten based on the new non-basic variables of the adjacent solution, such that the optimality test can easily be verified. According to Taha (2016), the new tabular form is obtained after the following basic operations: a) New pivot row = current pivot row ÷ pivot number b) For the other rows, including z: New row = (current row) – (coefficient of the pivot column of the current row) × (new pivot row)

Solution of Linear Programming Problems Chapter

17

765

Table 17.E.1 shows the tabular form of equation system represented by Expression (17.22):

TABLE 17.E.1 Initial Tabular Form of Example 17.10 Coefficients

Basic Variable

Equation

z

x1

x2

x3

x4

z

0

1

3

2

0

0

0

x3

1

0

1

1

1

0

6

x4

2

0

5

2

0

1

20

Constant

From Table 17.E.1, we can see that a new column was added when we compare it to Box 17.1. The first column shows the basic variables considered in each phase (the initial basic variables will be x3 and x4). Step 1: Find an initial FBS Decision variables x1 and x2 were selected for the initial set of nonbasic variables, thus, representing the origin of feasible region (x1, x2) ¼ (0, 0). On the other hand, the set of basic variables is represented by x3 and x4: NBV ¼ {x1, x2} and BV ¼ {x3, x4} The feasible basic solution can immediately be obtained from Table 17.E.1: Feasible basic solution: x3 ¼ 6 and x4 ¼ 20 with z ¼ 0 Solution: {x1, x2, x3, x4} ¼ {0, 0, 6, 20} Step 2: Optimality test Since the coefficients of x1 and x2 are negative in row 0, the current FBS is not optimal, because a positive increase in x1 or x2 will result in an adjacent FBS better than the current FBS. Iteration 1: Determine a better adjacent FBS. Each iteration has three steps: 1. Determine the nonbasic variable that will go into the base. We have to choose the variable with the highest negative coefficient in Equation (0) of Table 17.E.1. For the problem in question, the variable with the highest unitary contribution to objective function z is x1 (3 > 2). Therefore, variable x1 is selected to enter the base, and this variable’s column is the pivot column. 2. Determine the basic variable that will leave the base. Here, we are trying to select the basic variable that will leave the base (and will become null), which is the one that limits the increase of x1. The results of the three phases needed to choose the variable, from Table 17.E.1, are shown, and can also be seen in Table 17.E.2: (a) The positive coefficients selected from the pivot column (column of variable x1) are coefficients 1 and 5 (Equations (1) and (2), respectively). (b) For Equation (1), divide constant 6 by coefficient 1 from the pivot column. For Equation (2), divide constant 20 by coefficient 5. (c) The row with the smallest quotient is Equation (2) (4 < 6). Therefore, the variable chosen to leave the base is x4. Step 1 (determining the variable that enters the base) and the three phases of Step 2 listed to determine the variable that leaves the base, are shown in Table 17.E.2.

TABLE 17.E.2 Determining the Variable that Enters and Leaves the Base in the First Iteration

766

PART

VII Optimization Models and Simulation

Quotient 4 of Equation (2) represents the maximum value that variable x1 can take on in this equation (5x1 + x4 ¼ 20), if variable x4 starts taking on value zero (x4 ¼ 0). On the other hand, quotient 6 represents the maximum value that variable x1 can take on in Equation (1) (x1 + x3 ¼ 6), considering x3 ¼ 0. Since we want to maximize the value of x1, we choose, to leave the base and assume a null value, variable x4 that limits its increase. The pivot row and the pivot column are highlighted in Table 17.E.2. The pivot number (intersection of the pivot row and the pivot column) is 5. 3. Transform the current tabular form by using the Gauss-Jordan elimination method and recalculate the basic solution. In the same way as in the analytical procedure, the coefficients of variable x1 in the current tabular form (Table 17.E.2) must be transformed from –3, 1, and 5 (Equations (0), (1), and (2), respectively) to 0, 0, and 1 (coefficients of variable x4 in the current tabular form), through basic operations (Gauss-Jordan elimination method), such that the values of the new basic variables and of objective function z can be obtained directly from the new tabular form. The new tabular form is obtained after the following basic operations: (a) New pivot row ¼ current pivot row  pivot number (b) For the other rows, including z: New row ¼ ðcurrent rowÞ  ðcoefficient of the pivot column of the current rowÞ  ðnew pivot rowÞ By applying the first operation in the current tabular form (divide Equation (2) by 5), we obtain the new pivot row, as shown in Table 17.E.3:

TABLE 17.E.3 New Pivot Row (iteration 1)

Since variable x1 entered the base instead of variable x4, the column of the basic variable in the new pivot row must be altered, as shown in Table 17.E.3. Phase (b) will be applied to the other lines [Equations (0) and (1)]. We will begin with Equation (1) that has a positive coefficient in the pivot column (+1). First, we multiply this coefficient (+1) by the new pivot row (Equation (2) from Table 17.E.3). This product is then subtracted from current Equation (1), resulting in new Equation (1): x1

x2

x3

x4

Constant

Equation (1)

1

1

1

0

6

Equation (2)  (+1)

1

2/5

0

1/5

4

Subtraction:

0

3/5

1

1/5

2

New Equation (1) is rewritten in Table 17.E.4:

TABLE 17.E.4 Phase (b) for Obtaining New Equation (1) (iteration 1)

Phase (b) is also applied to Equation (0) that has a negative coefficient in the pivot column (3). First, we multiply this coefficient’s value (3) by the new pivot row (Equation (2) from Table 17.E.3). This product is then subtracted from current Equation (0), resulting in new Equation (0):

Solution of Linear Programming Problems Chapter

x1

x2

x3

Equation (0)

3

2

0

0

0

Equation (2)  (3)

3

6/5

0

3/5

12

0

4/5

0

3/5

12

Subtraction:

x4

17

767

Constant

The new tabular form, after applying the basic operations described, is presented in Table 17.E.5:

TABLE 17.E.5 New Tabular Form After the Gauss-Jordan Elimination Method Is Used (Iteration 1) Coefficients Basic Variable

Equation

z

x1

x2

x3

x4

Constant

z

0

1

0

4/5

0

3/5

12

x3

1

0

0

3/5

1

1/5

2

x1

2

0

1

2/5

0

1/5

4

From the new tabular form (Table 17.E.5), it is possible to obtain the new values of x1, x3, and z immediately. The new feasible basic solution is x1 ¼ 4 and x3 ¼ 2 with z ¼ 12. The new solution is {x1, x2, x3, x4} ¼ {4, 0, 2, 0}. Step 2: Optimality test As shown in Table 17.E.5, Equation (0) starts being rewritten based on the new nonbasic variables (x2 and x4), such that the optimality test can easily be verified. The current FBS is not optimal yet, because the coefficient of x2 in Equation (0) in Table 17.E.5 is negative. Any positive increase in x2 will result in a positive increase in the value of objective function z, such that a better adjacent FBS can be obtained. Iteration 2: Determine a better adjacent FBS. The three steps to be implemented in this iteration are: 1. Determine the nonbasic variable that will go into the base. According to the new tabular form (Table 17.E.5), we can see that variable x2 is the only one with a negative coefficient in Equation (0). Thus, the variable chosen to go into the base is x2. The column of variable x2 becomes the pivot column. 2. Determine the basic variable that will leave the base. The basic variable chosen to leave the base (and become null) is the one that limits the increase of x2. The results of the three phases needed to choose the variable, from Table 17.E.5, are shown, and can also be seen in Table 17.E.6: (a) The positive coefficients selected from the pivot column (column of variable x2) are coefficients 3/5 and 2/5 (Equations (1) and (2), respectively). (b) For Equation (1), divide constant 2 by coefficient 3/5. For Equation (2), divide constant 4 by coefficient 2/5. (c) The row with the smallest quotient is Equation (1) (10/3 < 10). Therefore, the variable chosen to leave the base is x3. Step 1 (determining the variable that enters the base) in addition to the three phases of Step 2 listed, to determine the variable that leaves the base, are shown in Table 17.E.6.

TABLE 17.E.6 Determining the Variable that Enters and Leaves the Base in the Second Iteration

768

PART

VII Optimization Models and Simulation

The row of variable x3 [Equation (1)] becomes the pivot row. The pivot number is 3/5. 3. Transform the current tabular form by using the Gauss-Jordan elimination method and recalculate the basic solution. The coefficients of variable x2 in the current tabular form (Table 17.E.6) must be transformed from -4/5, 3/5, and 2/5 (Equations (0), (1), and (2), respectively) to 0, 1, and 0 (coefficients of variable x3 in the current tabular form), such that the values of the new basic variables and of objective function z can be obtained directly in the new tabular form. In the same way as in the first iteration, the new tabular form is obtained after the following basic operations: (a) New pivot row ¼ current pivot row  pivot number (b) For the other rows, including z: New row ¼ ðcurrent rowÞ  ðcoefficient of the pivot column of the current rowÞ  ðnew pivot rowÞ By applying the first operation in the current tabular form (divide Equation (1) by 3/5), we obtain the new pivot row, as shown in Table 17.E.7:

TABLE 17.E.7 New Pivot Row (iteration 2)

Since variable x2 entered the base instead of variable x3, the column of the basic variable in the new pivot row must be altered, as shown in Table 17.E.7. Phase (b) will be applied to the other lines [Equations (0) and (2)]. Let’s begin with Equation (2) that has a positive coefficient in the pivot column (2/5). First, we multiply this coefficient (2/5) by the new pivot row (Equation (1) from Table 17.E.7). This product is then subtracted from current Equation (2), resulting in new Equation (2):

x1

x2

x3

x4

Constant

Equation (2)

1

2/5

0

1/5

Equation (1)  (2/5)

0

2/5

2/3

2/5

4/3

Subtraction:

1

0

2/3

1/3

8/3

4

New Equation (2) appears in Table 17.E.8:

TABLE 17.E.8 Tabular Form With New Equation (2) (iteration 2)

Phase (b) is also applied to Equation (0) that has a negative coefficient in the pivot column (-4/5). First, we multiply the value of this coefficient (-4/5) by the new pivot row (Equation (1) from Table 17.E.7). This product is then subtracted from current Equation (0), resulting in new equation (0):

Solution of Linear Programming Problems Chapter

x1

x2

x3

x4

Equation (0)

0

4/5

0

Equation (1)  (4/5)

0

4/5

4/3

Subtraction:

0

0

4/3

17

769

Constant

3/5

12

4/15

8/3

1/3

44/3

The new tabular form, after the application of the Gauss-Jordan elimination method, is presented in Table 17.E.9:

TABLE 17.E.9 New Tabular Form After the Gauss-Jordan Elimination Method Is Used (iteration 2) Coefficients Basic Variable

Equation

z

x1

x2

z

0

1

0

0

x3 4/3

x4 1/3

Constant 44/3

x2

1

0

0

1

5/3

1/3

10/3

x1

2

0

1

0

2/3

1/3

8/3

From Table 17.E.9, we can obtain the new values of x1, x2, and z immediately. The new feasible basic solution is x1 ¼ 8/3 and x2 ¼ 10/3 with z ¼ 44/3. The new solution is {x1, x2, x3, x4} ¼ {8/3, 10/3, 0, 0}. Step 2: Optimality test The current FBS is the optimal one, because the coefficients of nonbasic variables x3 and x4 in Equation (0) in Table 17.E.9 are positive.

17.4.4

The Simplex Method for Minimization Problems

The Simplex method can also be used for solving linear programming minimization problems. The minimization problems discussed in this section will be solved through the tabular form. There are two ways of solving a minimization problem through the Simplex method: Solution 1 Transform a minimization problem into a maximization problem and use the same procedure described in Section 17.4.3. As presented in Expression (16.6) in Section 16.4.3 of the previous chapter, a minimization problem can be converted into another maximization problem through the following transformation: min z ¼ f ðx1 , x2 , …, xn Þ , max  z ¼ f ðx1 , x2 , …, xn Þ

(17.23)

Solution 2 Adapt the procedure described in Section 17.4.3 to linear programming minimization problems. Fig. 17.14 presented the detailed steps for the general algorithm described in Figs. 17.11 and 17.12, for solving linear programming maximization problems in the tabular form. To solve LP minimization problems through the tabular form, Step 2 (optimality test) and Step 1 of each iteration (determining the nonbasic variable that will go into the base) must be adapted from Fig. 17.14, since their decisions are based on Equation (0) of the objective function. Fig. 17.15 shows the adjustments in Step 2 and in Step 1 of each iteration in relation to Fig. 17.14 for solving LP minimization problems through the tabular form. These adjustments can be found in Fig. 17.15. As shown in Fig. 17.15, except for Step 2 (optimality test) and Step 1 of each iteration (determining the non-basic variable that will go into the base), we can see that the other steps are the same as the ones presented in Fig. 17.14 for maximization problems.

770

PART

VII Optimization Models and Simulation

FIG. 17.15 Adjustment of the steps shown in Fig. 17.14 for solving LP minimization problems through the tabular form of the Simplex method.

Beginning: The problem must be in the standard form. Step 1: Find an initial FBS for the LP problem. We can use the same procedure of Figure 17.14 for maximization problems. Step 2: Optimality test. The current FBS is optimal if, and only if, the coefficients of all the non-basic variables in Equation (0) of the tabular form are non-positive (£ 0). While there is at least one non-basic variable with a positive coefficient in Equation (0), there is a better adjacent FBS. Iteration: Determine a better adjacent FBS. 1. Determine the non-basic variable that will go into the base. It must be the one that has the greatest negative increase in z, that is, with the highest positive coefficient in Equation (0). 2. Determine the basic variable that will leave the base. We can use the same procedure of Figure 17.14 for maximization problems. 3. Transform the current tabular form by using the Gauss-Jordan elimination method and recalculate the basic solution.

Example 17.11 Consider the following linear programming minimization problem: min z ¼ 4x1  2x2 subject to : 2x1 + x2  10 x1  x2  8 x1 , x2  0

(17.24)

Determine the optimal solution for the problem. Solution 1 First, in order for the problem represented by Expression (17.24) to be in the standard form, slack variables must be introduced into each one of the model constraints. The problem can also be rewritten from a maximization function by using Expression (17.23): max  z ¼ 4x1 + 2x2 subject to : 2x1 + x2 + x3 ¼ 10 x1  x2 + x4 ¼ 8 x1 , x2 , x3 , x4  0

(17.25)

The initial tabular form that represents equation system in Expression (17.25) is:

TABLE 17.E.10 Initial Tabular Form of Equation System Represented by Expression (17.25) Coefficients Basic Variable

Equation

z

x1

x2

x3

x4

Constant

z

0

1

4

2

0

1/3

44/3

x3

1

0

2

1

5/3

1/3

10/3

x4

2

0

1

1

2/3

1/3

8/3

Solution of Linear Programming Problems Chapter

17

771

The initial set of nonbasic variables is represented by x1 and x2, while the initial set of basic variables is represented by x3 and x4. Initial solution {x1, x2, x3, x4} ¼ {0, 0, 10, 8} is not optimal, since the coefficient of nonbasic variable x2 in Equation (0) is negative. To determine a better adjacent FBS, variable x2 enters the base (greatest negative coefficient) instead of variable x4, which is the only one that limits the increase of x2, as seen in Table 17.E.11.

TABLE 17.E.11 Variable That Enters and Leaves the Base in the First Iteration

The new tabular form, after the Gauss-Jordan elimination method is used, is:

TABLE 17.E.12 New Tabular Form Using the Gauss-Jordan Elimination Method (iteration 1) Coefficients Basic Variable

Equation

z

x1

x2

x3

x4

Constant

z

0

1

8

0

2

0

20

x2

1

0

2

1

1

0

10

x4

2

0

3

0

1

1

18

From Table 17.E.12, we can obtain the new values of x2, x4, and z immediately. The results of the new feasible basic solution are x2 ¼ 10 and x4 ¼ 18 with z ¼  20. The new solution can be represented by {x1, x2, x3, x4} ¼ {0, 10, 0, 18}. The new basic solution obtained is the optimal one, since all the coefficients of the nonbasic variables in Equation (0) are nonnegative. Solution 2 In order for the Simplex method to be applied, the initial minimization problem described in Expression (17.24) must be in the standard form: min z ¼ 4x1  2x2 subject to 2x1 + x2 + x3 ¼ 10 x1  x2 + x4 ¼ 8 x1 , x2 , x3 , x4  0

(17.26)

The initial tabular form of equation system in Expression (17.26) is represented in Table 17.E.13.

TABLE 17.E.13 Initial Tabular Form of Equation System Represented by Expression (17.26) Coefficients Basic Variable

Equation

z

x1

Z

0

1

4

x3

1

0

x4

2

0

x2

x3

x4

Constant

2

0

0

0

2

1

1

0

10

1

1

0

1

18

772

PART

VII Optimization Models and Simulation

Analogous to solution 1, the initial set of nonbasic variables is represented by x1 and x2, while the initial set of basic variables is represented by x3 and x4. For a minimization problem, the solution is optimal if all the coefficients of the nonbasic variables in Equation (0) are nonpositive (0). Therefore, initial solution {x1, x2, x3, x4} ¼ {0, 0, 10, 8} is not optimal, since the coefficient of nonbasic variable x2 in Equation (0) is positive. As shown in Table 17.E.14, variable x2 goes into the base (greatest positive coefficient) instead of variable x4 that is the only one with a positive coefficient in the pivot column.

TABLE 17.E.14 Variable That Enters and Leaves the Base in the First Iteration

After the Gauss-Jordan elimination method is used, the new tabular form is:

TABLE 17.E.15 New Tabular Form Using the Gauss-Jordan Elimination Method (iteration 1) Coefficients Basic Variable

Equation

z

x1

x2

x3

x4

Constant

z

0

1

8

0

2

0

20

x2

1

0

2

1

1

0

10

x4

2

0

3

0

1

1

18

According to Table 17.E.15, the new adjacent solution is {x1, x2, x3, x4} ¼ {0, 10, 0, 18} with z ¼  20. The basic solution obtained is optimal, because the coefficients of all the nonbasic variables in Equation (0) are nonpositive.

17.4.5

Special Cases of the Simplex Method

As presented in Section 17.2.3, an LP problem may not present a single nondegenerate optimal solution and may be characterized as one of the special cases listed: 1. 2. 3. 4.

Multiple optimal solutions Unlimited objective function z There is no optimal solution Degenerate optimal solution

Section 17.2.3 discussed the graphical solution for each one of the special cases listed. This section discusses how to identify the peculiarities of each special case in one of the tabular forms (initial, intermediate, or final).

17.4.5.1 Multiple Optimal Solutions As discussed in Section 17.2.3.1, in a linear programming problem with infinite optimal solutions, several points reach the same optimal value in the objective function. Graphically, when the objective function is parallel to an active constraint, we have a case with multiple optimal solutions.

Solution of Linear Programming Problems Chapter

17

773

Through the Simplex method, we can identify a case with multiple optimal solutions when, in the optimal tabular form, the coefficient of one of the nonbasic variables is null in row 0 of the objective function.

17.4.5.2 Unlimited Objective Function z As described in 2.3.2, in this case, there is no limit to the increase of the value of at least one decision variable, resulting in a feasible region and an unlimited objective function z. For a maximization problem, the value of the objective function increases unlimitedly, while for a minimization problem, the value decreases in an unlimited way. Through the Simplex method, we can identify a case whose objective function is unlimited when, in one of the tabular forms, a candidate nonbasic variable is prevented from entering the base because the rows of all the basic variables have nonpositive coefficients in the column of the candidate nonbasic variable.

17.4.5.3 There Is No Optimal Solution According to Taha (2016), this case never occurs with constraints such as  with non-negative constants on the right-hand side of the constraint, since the slack variables guarantee a feasible solution. During the implementation of the Simplex algorithm in the tabular form, whenever the basic variables take on nonnegative values, we have a feasible basic solution. In contrast, when at least one of the basic variables assume negative values, we have an unfeasible basic solution.

17.4.5.4 Degenerate Optimal Solution As discussed in Section 17.2.3.4, we can identify a special case of degenerate solution, graphically, when one of the vertices of the feasible region is obtained by the intersection of more than two distinct lines. Whereas through the Simplex method, we can identify a case with a degenerate solution when, in one of the solutions of the Simplex method, the value of one of the basic variables is null. This variable is called degenerate variable. When all the basic variables take on positive values, we say that the feasible basic solution is nondegenerate. If there is degeneration in the optimal solution, we have a case known as degenerate optimal solution. The degenerate solution is obtained when there is a tie between at least two basic variables when choosing which one of them must leave the base (lines with the same positive quotient). When this happens, we can choose any one of them randomly. The basic variable that is not chosen remains in the base. However, its value becomes null in the new adjacent solution. Analogously, during the solution for any linear programming problem through the Simplex method, if there is a tie when choosing the nonbasic variable that will go into the base, we should choose one of the variables randomly. The degeneration problem is that, in some cases, the Simplex algorithm can go into a loop, resulting in the same basic solutions, since it cannot leave that solution space. In this case, the optimal solution will never be reached.

17.5 SOLUTION BY USING A COMPUTER In this chapter, we will discuss several ways of solving an LP problem: a) graphically, for problems with two decision variables; b) by using the analytical method, for cases in which m < n; c) through the Simplex method. Understanding each one of these methods theoretically is essential, however, to minimize the time spent solving an LP model. The same problems can be solved using a computer without having to manually do calculations and construct charts. Currently, there are several software packages in the market for solving linear programming problems, such as, GAMS, AMPL, AIMMS, software with electronic spreadsheets (Solver in Excel, What’s Best), among others. Software packages GAMS, AMPL, and AIMMS are algebraic modeling languages or systems (Algebraic Modeling Language—AML), that is, high-level programming languages used for solving complex and large-scale mathematical programming problems. These languages have an open interface that makes it possible to connect them to several optimization packages or to Solver (LINDO, LINGO, CPLEX, XPRESS, MINOS, OSL, etc.), which find the model’s solution. These optimization packages can also be used separately, but many of them usually run within a development environment. Let’s now discuss the main characteristics of each of these software packages. LINDO (Linear Interactive and Discrete Optimizer), developed by LINDO Systems (http://www.lindo.com/), solves linear, nonlinear, and integer programming problems. It is very easy to use and also very fast. The complete version of the software does not have limitations regarding the number of constraints, real and integer variables. To solve linear pro gramming problems, the Solver in LINDO uses more than one optimization method, including the Simplex method, revised Simplex, dual Simplex, and interior-point methods. Different from the Simplex algorithm, in the interior-point methods,

774

PART

VII Optimization Models and Simulation

new solutions can be found in the interior of the feasible region. The Solver in LINDO has interface with the following programming languages: Visual Basic, C, C ++, among others. A free version of the software can be downloaded directly from the website http://www.lindo.com/. Software LINGO (Language for Interactive General Optimizer), also developed by LINDO Systems, solves linear, nonlinear, and integer programming problems quickly and effectively. The complete version of the software does not have limitations regarding the number of constraints, real and integer variables either. The Solver in LINGO also uses the Simplex method, revised Simplex, dual Simplex, and interior-points method to determine the optimal solution for a linear programming model. All input data can be read directly from LINGO. However, many times, the software uses electronic spreadsheets as interface, such as Excel. The Solver in LINGO also has interface with the following programming languages: Visual Basic, C, C ++, among others. A free version of the software can also be downloaded from the website http://www.lindo.com/. Also developed by LINDO Systems, software What’s Best! is a module to be installed inside electronic spreadsheets as Excel, and it is used to solve linear, nonlinear, and integer programming problems. The complete version of the software does not have limitations regarding the number of constraints, real and integer variables either. The Solver in What’s Best! uses the same optimization methods as LINDO and LINGO. What’s Best! is also compatible with the VBA (Visual Basic for Applications) in Excel, thus, allowing the application of macros and programming codes. A free version of the software can also be downloaded from the website http://www.lindo.com/. CPLEX is an optimization package that was originally developed by Robert Bixby at CPLEX Optimization. In 1997, it was purchased by ILOG and, later on (2009), by IBM. CPLEX has been widely used to solve large-scale linear, integer, and nonlinear programming problems, many times, serving as a Solver within algebraic modeling systems, such as, GAMS, AMPL, and AIMMS. The Solver in CPLEX uses the Simplex method and the interior-points method to determine the optimal solution for a linear programming problem. It has interface with the following programming languages C, C ++, C#, and Java. A free version of CPLEX can be downloaded on the website https://www.ibm.com/analytics/ cplex-optimizer. Developed by Dash Optimization Ltd., XPRESS is an optimization software that solves complex linear, nonlinear, and integer programming problems. The Solver in XPRESS allows us to choose a solution method (Simplex, dual Simplex, or interior-points method) to determine the best solution for a linear programming problem. XPRESS has an interface with the following programming languages C, C ++, Java, Visual Basic, and Net. Further information on the software can be found on the website http://www.dashoptimization.com/. MINOS, developed by Bruce Murtagh and Michael Saunders from Stanford University, is an optimization software that solves large-scale linear and nonlinear programming problems. To solve linear programming problems, MINOS uses the Simplex method. MINOS has also been widely used as a Solver for algebraic modeling languages. It has interface with the following programming languages Fortran, C, and Matlab. Further information on MINOS can be found on the website http://www.sbsi-sol-optimize.com/asp/sol_product_minos.htm/. Software GAMS (General Algebraic Modeling System), developed by GAMS Development Corporation, is a highlevel algebraic modeling system that is being used to solve complex and large-scale linear, nonlinear, and integer programming problems. As specified previously, GAMS has an interface that connects to several optimization packages, including CPLEX, MINOS, OSL, XPRESS, LINGO, LINDO, among others. A free version of the software can be downloaded directly from the website http://www.gams.com/. AMPL (A Mathematical Programming Language) is also an algebraic modeling language, developed by Bell laboratory to solve high complexity linear, integer, and nonlinear programming problems. AMPL has an open interface that makes it possible to connect it to several types of Solver (such as, CPLEX, MINOS, OSL, XPRESS, among others) that find the model’s optimal solution. A free version of AMPL can be downloaded directly from the website http://www. ampl.com/. AIMMS (Advanced Integrated Multidimensional Modeling Software), developed by Paragon Decision Technology, is also a high-level algebraic modeling language that solves high complexity linear, nonlinear and integer programming problems. It uses optimization packages, such as, CPLEX, MINOS, XPRESS, among others, to determine the optimal solution for a linear programming model. It interfaces with the following programming languages C, C ++, Visual Basic, and Excel. A free version of the software can be downloaded from the website http://www.aimms.com/. IBM’s OSL (Optimization Subroutine Library) is an optimization software that solves large-scale linear, nonlinear, and integer programming problems. The Solver in OSL uses the Simplex method and interior-point methods to determine the optimal solution for a linear programming problem. It interfaces with the programming languages C and Fortran.

Solution of Linear Programming Problems Chapter

17

775

Solver is an add-in in Excel that has been widely used for solving small-scale linear, nonlinear, and integer programming problems, due to its popularity and simplicity. Solver uses the Simplex algorithm to determine the optimal solution for a linear programming model. To solve nonlinear problems, Solver uses the GRG2 (Generalized Reduced Gradient) algorithm. While for integer programming problems, it uses the branch-and-bound algorithm. Solver has an interface with other programming languages, so that the final solution can be exported to another package. In the following section, we will discuss how to use it step by step.

17.5.1

Solver in Excel

Solver is capable of solving problems with up to 200 decision variables and 100 constraints. To use it, it is necessary to activate the add-in Solver in Excel. First, we may click on the File tab and select Options (Fig. 17.16). From the dialog box Excel Options (Fig. 17.17), choose the option Add-Ins and select the name Solver Add-in. Also in Fig. 17.17, the next step consists in clicking on Go, which will open the Add-Ins dialog box shown in Fig. 17.18. Finally, confirm by clicking on Solver Add-in and OK. Thus, Solver in Excel will become available in the Data tab, in the Analyze column, as shown the Fig. 17.19. Having selected the Solver command, the Solver Parameters dialog box will appear (Fig. 17.20). Let’s now discuss each one of its cells. 1. Set Objective The objective cell (target cell in earlier Excel versions) is the one that contains the value of the objective function. 2. To (Equal to in earlier Excel versions) We must choose if the objective function is a maximization (Max) or a minimization (Min) one. Solver can also use the option Value of. In this case, Solver will search for a solution whose objective function value (objective cell) is the same as or as close as possible to the value stipulated. 3. By Changing Variable Cells Variable cells (changing cells or adjustable cells in earlier versions) refer to the model’s decision variables. They are the cells whose values vary until the model’s optimal solution is reached. FIG. 17.16 Activating Solver from Excel Options.

776

PART

VII Optimization Models and Simulation

FIG. 17.17 Activating Solver from the Add-Ins option.

FIG. 17.18 Add-Ins dialog box.

Solution of Linear Programming Problems Chapter

17

777

FIG. 17.19 Availability of Solver on the Data tab.

FIG. 17.20 Solver Parameters.

FIG. 17.21 Add Constraint dialog box.

4. Subject to the Constraints Each of the model constraints must be included by using the Add button seen in Fig. 17.20, thus, showing the Add Constraint dialog box, as shown in Fig. 17.21. First of all, in the Cell Reference box, we must select the cell that represents the left-hand side of the constraint to be added. Select the constraint symbol (, ¼ or ), int (integer variable) or bin (binary variable). In the Constraint box, select a constant, a reference cell or a formula with a numerical value that represents the constraint’s right-hand side. While there are new constraints to be included in the model, click on Add. The non-negativity constraints of the decision variables must also be included in this phase. In the case of the last constraint, press OK to go back to the Solver Parameters dialog box.

778

PART

VII Optimization Models and Simulation

As shown in Fig. 17.20, for each constraint that has already been added, it is possible to change or delete it. In order to do that, we just need to select the constraint desired and click on the button Change or Delete. Additionally, the Reset All button clears all the data regarding the current model. Another alternative of including the non-negativity constraints is to select the Make Unconstrained Variables Non-Negative check box. 5. Select a Solving Method For linear programming problems, we must select the Simplex LP engine. Select the GRG Nonlinear engine for smooth nonlinear problems, and select the Evolutionary engine for nonsmooth problems. 6. Options In the Solver Parameters dialog box, it is also possible to activate the Options button, which makes the Options window available (Fig. 17.22). From Fig. 17.22, on the All Methods tab, we can change options for all solving methods. In the Constraint Precision box, the degree of precision can be specified. The smaller the number, the higher the precision. If Use Automatic Scaling check box is selected, Solver should internally rescale the values of variables, constraints, and the objective to similar magnitudes, to reduce the impact of extremely large or small values on the accuracy of the solution process. The Show Iteration Results check box displays the values of each trial solution. In the Solving with Integer Constraints box, if the Ignore Integer Constraints check box is selected, all integer, binary, and all different constraints are ignored. This is known as the relaxation of the integer programming problem. In the Integer Optimality (%) check box, we can type the maximum percentage difference Solver should accept between the objective value of the best integer solution found and the best known bound on the true optimal objective value before stopping. From Solving Limits box, we can specify the maximum CPU time and the maximum number of iterations, respectively, in the Max Time (Seconds) and Iterations boxes. Finally, the last two options are used only for problems that include integer constraints on variables, or problems that use the Evolutionary Solving Method. In the Max Subproblems box, we can specify the maximum number of subproblems and FIG. 17.22 Options dialog box.

Solution of Linear Programming Problems Chapter

17

779

FIG. 17.23 Solver Results dialog box—feasible solution.

in the Max Feasible Solutions box, we can specify the maximum number of feasible solutions (https://www.solver.com/ excel-solver-change-options-all-solving-methods). 7. Solve Going back to the Solver Parameters dialog box, click on Solve, obtaining the Solver Results dialog box. In cases in which the Solver finds a feasible solution for the problem in question, which satisfies all the model constraints, a corresponding message will appear in the Solver Results dialog box, as seen in Fig. 17.23. In this case, the Solver result will appear automatically in the spreadsheet under analysis, we just need to click on OK to maintain the optimal solution found. To restore the model’s initial values, select the Restore Original Values option and, finally, click on OK. The current scenario can also be saved by using the Save Scenario button. Solver has 3 types of report: Answer, Sensitivity, and Limits. In order for these reports to be made available in new Excel spreadsheets, we just need to select the option desired before clicking on the OK button in the Solver Results dialog box. The Answer report provides the results of the model’s optimal solution. The Limits report shows the lower and upper limits of each one of the variable cells. The Answer and Limits reports will be discussed in Section 17.5.4 and the Sensitivity analysis report in Section 17.6.4.

17.5.2

Solution of the Examples found in Section 16.6 of Chapter 16 using Solver in Excel

Each of the examples found in Section 16.6 of the previous chapter (modeling of real linear programming problems) will be solved by the Solver in Excel.

17.5.2.1 Solution of Example 16.3 of Chapter 16 (Production Mix Problem at the Venix Toys) Example 16.3 presented in Section 16.6.1 of the previous chapter, regarding the production mix problem at Venix, a company in the toy sector, will be solved through Solver in Excel. Fig. 17.24 shows how the linear programming model must be edited in an Excel spreadsheet, so that it can be solved by Solver (see file Example3_Venix.xls). First, we can see that the unit profits of each product are represented by cells B5 and C5. Whereas the decision variables (weekly number of toy cars and tricycles to be manufactured) are represented by cells B14 and C14, respectively. The objective function is represented by cell D14 (see formula in Box 17.2). The labor availability constraints for the machining, painting, and assembly activities are represented in rows 8, 9, and 10, respectively. For instance, the labor availability constraint for the machining activity is represented by the equation 0,25x1 + 0,5x2  36. In order for it to be added to the Subject to the Constraints box in the Solver Parameters dialog box, the left-hand side of the constraint must be represented in a single cell. Therefore, the term 0,25x1 + 0,5x2 starts to be represented by cell D8 (see formula in Box 17.2). We repeat the same procedure for the other constraints. The initial solution has the following values: x1 ¼ 0, x2 ¼ 0, and z ¼ 0.

780

PART

VII Optimization Models and Simulation

Venix Toys

Unit profit

x1 cars 12

x2 tricycles 60

Machining Painting Assembly

0.25 0.1 0.1

0.5 0.75 0.4

Hours used 0.0 0 0

Solution

x1 cars 0

x2 tricycles 0

z Total profit $0.00

Quantities produced

≤ ≤ ≤

Hours available 36 22 15

FIG. 17.24 Production mix model at Venix Toys in Excel.

BOX 17.2 Formula of the Objective Function and of the Total Number of Hours Used in Each Activity Cell

Formula

D8

¼B8*$B$14 + C8*$C$14

D9

¼B9*$B$14 + C9*$C$14

D10

¼B10*$B$14 + C10*$C$14

D14

¼B5*$B$14 + C5*$C$14

BOX 17.3 Alternative to Box 17.2 When Using the SUMPRODUCT Function Cell

Formula

D8

¼SUMPRODUCT(B8:C8,$B$14:$C$14)

D9

¼SUMPRODUCT(B9:C9,$B$14:$C$14)

D10

¼SUMPRODUCT(B10:C10,$B$14:$C$14)

D14

¼SUMPRODUCT(B5:C5,$B$14:$C$14)

For complex problems, we can use the SUMPRODUCT function directly, which multiplies the components that correspond to the intervals or matrices provided and returns the sum of these products, as shown in Box 17.3. The problem is ready to be solved by the Solver in Excel. Clicking on the Solver command, we obtain the Solver Parameters dialog box, as shown in Fig. 17.25. First, we must select the objective cell (D14), which is the one that contains the value of the objective function. Since it is a maximization problem, select the option Max. The next step consists in selecting the variable cells (B14:C14) that represent the model’s decision variables. Lastly, we must add each one of the model constraints to the Subject to the Constraints box. Regarding the labor availability constraints for each of the activities, since they are all of the same type, instead of adding them separately, they can be added simultaneously. To do that, first of all, click on the Add button. Select the cell range D8:D10 in the Cell Reference box, the symbol , and the cell interval F8:F10 in the Constraint box, as shown in Fig. 17.26. To conclude the inclusion of the current constraint, let’s click on OK. Nevertheless, since the non-negativity constraint of the model’s decision variables will also be included, click on Add. Once again, the Add Constraint dialog box appears, such that the new model constraint can be included. Therefore, we select the variable cells (B14:C14) in the Cell Reference box, the symbol , and value 0 to be included in the Constraint box, as seen in Fig. 17.27. Since this is the last constraint, conclude by clicking on OK. The non-negativity constraints can also be activated by selecting the Make Unconstrained Variables Non-Negative check box.

Solution of Linear Programming Problems Chapter

17

781

FIG. 17.25 Solver Parameters dialog box for Example 16.3.

FIG. 17.26 Adding the labor availability constraint for the three activities.

FIG. 17.27 Adding the non-negativity constraint of the decision variables.

Since we have a linear programming problem, select the Simplex LP engine in the Select a Solving Method box. The next step consists in clicking on the Options button, thus, obtaining the Options dialog box (Fig. 17.28). The values regarding the Constraint Precision, Max Time, and Iterations parameters will be kept. If Solver cannot find a viable solution, we must test it with a smaller precision in order to find a feasible solution. Another alternative is to increase the maximum time and the number of iterations. If the problem persists, the model must be unfeasible (Taha, 2016). Finally, click on OK to go back to the Solver Parameters dialog box. From then on, Solver is ready to be solved. Thus, click on Solve. In case of a feasible solution, we can update the current Excel spreadsheet with the new solution by clicking on Keep Solver Solution. Fig. 17.29 shows the result of the optimal solution for Example 16.3 for Venix Toys.

782

PART

VII Optimization Models and Simulation

FIG. 17.28 Options dialog box for Example 16.3.

FIG. 17.29 Optimal solution for the production mix problem for Venix Toys.

Therefore, we can see that the model’s optimal solution is x1 ¼ 70 and x2 ¼ 20 with z ¼ 2, 040 ($2,040.00). The same results can be seen, in a more detailed way, in the Answer Report (see Section 17.5.4). The Limits Report of company Venix Toys will also be discussed in Section 17.5.4. Solution using Names in a Cell or Cell Range According to Haddad and Haddad (2004), using names in a cell or cell range makes it easier to understand the formula. To name it, we just need to click on a cell or cell range, place the desired name in the Name Box, which appears on the left-hand side of the Formula bar, and conclude by typing ENTER. Hence, the cells will be referenced by name and not by the corresponding columns and rows any longer. For example, the set of cells B5:C5 will be named Unit_profit, as seen in Fig. 17.30. Another way of doing this is by right clicking on Define Name. Thus, the New Name dialog box will appear (Fig. 17.31), such that the desired name must be added in the Name box. A third alternative is to select the Formulas tab and

Solution of Linear Programming Problems Chapter

17

783

FIG. 17.30 Inserting a name into a set of cells.

FIG. 17.31 New Name dialog box.

FIG. 17.32 Name Manager dialog box.

choose the Name Manager option (CTRL + F3), resulting in Fig. 17.32. As shown in Fig. 17.32, we can include a new name by using the New button (once again, the New Name dialog box will appear), or select an already existing cell or set of cells and click on Edit to change the current name or Delete. Note that Fig. 17.32 shows the name of each cell or cell range, its corresponding value, in addition to its(their) respective row(s) and column(s). Therefore, each cell or cell range included in the Name Manager starts to be referenced by its(their) name(s) instead of its(their) corresponding row(s) and column(s). For example, the formulas in Box 17.3 regarding cells D8, D9, D10, and D15 start being written based on their new names, as shown in Box 17.4. Fig. 17.33 is an adaptation of Fig. 17.25, in which each cell or cell range starts being called by its respective name. Notice that objective cell D14 starts being referred to as Total_profit, variable cells (B14:C14) and the left-hand side of the second constraint as Quantities_produced, the left-hand side of the first constraint (D8:D10) as Hours_used, and the right-hand side of the second constraint (F8:F10) as Hours_available.

784

PART

VII Optimization Models and Simulation

BOX 17.4 Formulas From Box 17.3 Written Based on Their New Names Cell

Formula

D8

¼SUMPRODUCT(B8:C8,Quantities_produced)

D9

¼SUMPRODUCT(B9:C9,Quantities_produced)

D10

¼SUMPRODUCT(B10:C10,Quantities_produced)

D14

¼SUMPRODUCT(Unit_profit,Quantities_produced)

FIG. 17.33 Solver Parameters after the inclusion of new names.

17.5.2.2 Solution of Example 16.4 of Chapter 16 (Production Mix Problem at Naturelat Dairy) Example 16.4 presented in Section 16.6.1 of the previous chapter, regarding the production mix problem at Naturelat, a dairy product company, will also be solved through the Solver in Excel. Fig. 17.34 illustrates the representation of the model in an Excel spreadsheet (see file Example4_Naturelat.xls). The formulas used in Fig. 17.34 are shown in Box 17.5. Analogous to the Venix Toys example, names were assigned to the cells and cell ranges in Fig. 17.34, which will be used in the Solver to facilitate the understanding of the model. Fig. 17.35 shows the names assigned to the respective cells. The representation of the problem at Naturelat Dairy in the Solver Parameters dialog box is shown in Fig. 17.36. Since names were assigned to the model cells, Fig. 17.36 starts being referred to by their respective names. In Fig. 17.36, note that the constraints are sorted out in alphabetical order. The same happens with the Name Manager (Fig. 17.35). Similar to the Venix Toys example, we selected the Make Unconstrained Variables Non-Negative check box and the Simplex LP engine in the Select a Solving Method box. The Options command remained unaltered. Finally, click on Solve and select the option Keep Solver Solution in the Solver Results dialog box. Fig. 17.37 shows the optimal solution for the production mix model at Naturelat Dairy.

Solution of Linear Programming Problems Chapter

17

785

Naturelat Dairy

Unit contribution margin

x1 yogurt 0.80

x2 minas 0.70

x3 x4 x5 mozzarella parmesan provolone 1.15 1.30 0.70

Raw milk Whey Fat

0.70 0.16 0.25

0.40 0.22 0.33

0.40 0.32 0.33

0.60 0.19 0.40

0.60 0.23 0.47

Raw material used 0.00 0.00 0.00

≤ ≤ ≤

Raw material available 1200 460 650

Labour

0.05

0.12

0.09

0.04

0.16

MH used 0.00



MH available 170

Yogurt Minas cheese Mozzarella cheese Parmesan cheese Provolone cheese

1 0 0 0 0

0 1 0 0 0

0 0 1 0 0

0 0 0 1 0

0 0 0 0 1

Quantity produced 0.00 0.00 0.00 0.00 0.00

≥ ≥ ≥ ≥ ≥

Minimum demand 320 380 450 240 180

Solution

x1 x2 x3 x4 x5 z yogurt minas mozzarella parmesan provolone Total contrib margin Quantitites produced 0 0 0.00 0.00 0 $0.00 FIG. 17.34 Representation of the production mix model in Excel for Naturelat Dairy.

BOX 17.5 Formulas From Fig. 17.34 Cell

Formula

G8

¼SUMPRODUCT(B8:F8,$B$24:$F$24)

G9

¼SUMPRODUCT(B9:F9,$B$24:$F$24)

G10

¼SUMPRODUCT(B10:F10,$B$24:$F$24)

G13

¼SUMPRODUCT(B13:F13,$B$24:$F$24)

G16

¼SUMPRODUCT(B16:F16,$B$24:$F$24)

G17

¼SUMPRODUCT(B17:F17,$B$24:$F$24)

G18

¼SUMPRODUCT(B18:F18,$B$24:$F$24)

G19

¼SUMPRODUCT(B19:F19,$B$24:$F$24)

G20

¼SUMPRODUCT(B20:F20,$B$24:$F$24)

G24

¼SUMPRODUCT(B5:F5,$B$24:$F$24)

FIG. 17.35 Name Manager for the problem at Naturelat Dairy.

786

PART

VII Optimization Models and Simulation

FIG. 17.36 Solver Parameters regarding the problem at Naturelat Dairy.

FIG. 17.37 Result of the Naturelat Dairy model.

Solution of Linear Programming Problems Chapter

17

787

Oil-South Refinary Unit profit

x11 3

x21 2

x31 2

x12 5

x22 4

x32 4

x13 6

x23 5

x33 5

Common-oil1 Super-oil1 Super-oil2 Extra-oil2 Extra-oil3

0.3 0 0 0 0

-0.7 0 0 0 0

-0.7 0 0 0 0

0 0.5 0.1 0 0

0 -0.5 -0.9 0 0

0 -0.5 0.1 0 0

0 0 0 -0.5 0.4

0 0 0 0.5 0.4

0 0 0 -0.5 -0.6

Composition 0 0 0 0 0

≤ ≤ ≤ ≤ ≤

Constant 0 0 0 0 0

Common Super Extra

1 0 0

1 0 0

1 0 0

0 1 0

0 1 0

0 1 0

0 0 1

0 0 1

0 0 1

Barrels gas produced 0 0 0

≥ ≥ ≥

Minimum demand 5,000 3,000 3,000

Oil 1 Oil 2 Oil 3

1 0 0

0 1 0

0 0 1

1 0 0

0 1 0

0 0 1

1 0 0

0 1 0

0 0 1

Barrels oil used 0 0 0

≤ ≤ ≤

Capacity 10,000 8,000 7,000

Refinery

1

1

1

1

1

1

1

1

1

Total production 0



Total capacity 20,000

Solution Qty produced

x11 0

x21 0

x31 0

x12 0

x22 0

x32 0

x13 0

x23 0

x33 0

z (Total profit) $0.00

FIG. 17.38 Representation of the mix problem in Excel of Oil-South Refinery.

17.5.2.3 Solution of Example 16.5 of Chapter 16 (Mix Problem of Oil-South Refinery) Example 16.5 presented in Section 16.6.2 of the previous chapter, regarding the mix problem of Oil-South Refinery, will also be solved through the Solver in Excel. Fig. 17.38 illustrates the representation of the model in an Excel spreadsheet (see file Example5_OilSouth.xls). Rows 6 to 10 represent the constraints of the maximum or minimum percentage allowed of a certain type of oil in the composition of a certain type of gasoline. In order for all the constraints to have the same symbol (), the type  inequalities were multiplied by (-1). The formulas used in Fig. 17.38 are shown in Box 17.6. In order to facilitate the understanding of the model, names were assigned to the main cells and cell ranges in Fig. 17.38, as seen in Fig. 17.39. Having assigned the names to the cells or cell ranges of Fig. 17.38, they will be referenced by their respective names. Therefore, the representation of the problem of Oil-South Refinery in the Solver Parameters dialog box is illustrated in Fig. 17.40. BOX 17.6 Formulas From Fig. 17.38 Cell

Formula

K6

¼SUMPRODUCT(B6:J6,$B$26:$J$26)

K7

¼SUMPRODUCT(B7:J7,$B$26:$J$26)

K8

¼SUMPRODUCT(B8:J8,$B$26:$J$26)

K9

¼SUMPRODUCT(B9:J9,$B$26:$J$26)

K10

¼SUMPRODUCT(B10:J10,$B$26:$J$26)

K13

¼SUMPRODUCT(B13:J13,$B$26:$J$26)

K14

¼SUMPRODUCT(B14:J14,$B$26:$J$26)

K15

¼SUMPRODUCT(B15:J15,$B$26:$J$26)

K18

¼SUMPRODUCT(B18:J18,$B$26:$J$26)

K19

¼SUMPRODUCT(B19:J19,$B$26:$J$26)

K20

¼SUMPRODUCT(B20:J20,$B$26:$J$26)

K23

¼SUMPRODUCT(B23:J23,$B$26:$J$26)

K26

¼SUMPRODUCT(B4:J4,$B$26:$J$26)

788

PART

VII Optimization Models and Simulation

FIG. 17.39 Name Manager for the problem of OilSouth Refinery.

FIG. 17.40 Solver Parameters regarding the problem of Oil-South Refinery.

Note that the non-negativity constraints were activated by selecting the Make Unconstrained Variables Non-Negative check box, and the Simplex LP method was selected. The Options command remained unaltered. Finally, click on Solve and OK to keep Solver solution. Fig. 17.41 shows the optimal solution for the mix problem of Oil-South Refinery.

17.5.2.4 Solution of Example 16.6 of Chapter 16 (Diet Problem) Example 16.6 presented in Section 16.6.3 of the previous chapter, regarding a diet problem, will also be solved through the Solver in Excel. Fig. 17.42 represents the model in an Excel spreadsheet (see file Example6_Diet.xls).

Solution of Linear Programming Problems Chapter

17

789

FIG. 17.41 Result of the mix problem of Oil-South Refinery.

Diet problem

Cost/kg

x1 spinach 3

Iron Vitamin A Vitamin B12 Folic acid

30 74,000 0 4

12 1,388 0 5

Daily consumption

1

1

Solution

x1 spinach 0

Qty consumed

x2 x3 x4 broccoli cress tomato 2 1.8 1.6

x5 carrot 3

x6 egg 3

2 4.9 10 9 47,250 11,300 145,000 32,150 0 0 1 10 1 2.5 0.05 0.5

1

1

x2 x3 x4 broccoli cress tomato 0.000 0 0

x7 bean 4

x8 chickpea 4

x9 soy 4.5

x10 meat 7.5

71 0 0 0.56

48.6 410 0 4

30 10,000 0 0.8

15 0 30 0.6

x11 liver 8

x12 fish 8.5

100 11 320,000 1,400 1,000 21.4 3.8 0.02

Total ingredients 0 0 0.00 0

≥ ≥ ≥ ≥

Min necessity 80 45,000 20 4



Max consumption 1.5

1

1

1

1

1

1

1

1

Total consumed 0.0

x5 carrot 0

x6 egg 0

x7 bean 0.000

x8 chickpea 0.000

x9 soy 0

x10 meat 0

x11 liver 0.000

x12 fish 0

z Total cost $0.00

FIG. 17.42 Representation of the diet problem in Excel.

BOX 17.7 Formulas From Fig. 17.42 Cell

Formula

N8

¼SUMPRODUCT(B8:M8,$B$18:$M$18)

N9

¼SUMPRODUCT(B9:M9,$B$18:$M$18)

N10

¼SUMPRODUCT(B10:M10,$B$18:$M$18)

N11

¼SUMPRODUCT(B11:M11,$B$18:$M$18)

N14

¼SUMPRODUCT(B14:M14,$B$18:$M$18)

N18

¼SUMPRODUCT(B5:M5,$B$18:$M$18)

The formulas used in Fig. 17.42 are shown in Box 17.7. The names assigned to the main cells and cell ranges of Fig. 17.42 are listed in Fig. 17.43. Therefore, the Solver Parameters dialog box regarding the diet problem shows the names assigned to the respective cells or cell ranges, as shown in Fig. 17.44.

790

PART

VII Optimization Models and Simulation

FIG. 17.43 Name Manager for the diet problem.

FIG. 17.44 Solver Parameters related to the diet problem.

Note that the non-negativity constraints were activated by selecting the Make Unconstrained Variables Non-Negative check box, and the Simplex LP method was selected. The Options command remained unaltered. Finally, click on Solve and OK to keep Solver solution. Fig. 17.45 shows the optimal solution for the diet problem.

17.5.2.5 Solution of Example 16.7 of Chapter 16 (Farmer’s Problem) In order for Example 16.7 in Section 16.6.4 of the previous chapter (farmer’s problem) to be solved by Solver in Excel, it must be represented in an Excel spreadsheet, as shown in Fig. 17.46 (see file Example7_Farmer.xls). Box 17.8 shows the formulas used in Fig. 17.46.

Solution of Linear Programming Problems Chapter

17

FIG. 17.45 Result of the diet problem.

Farmer

NPV unit/ha

x1 soy 9.343

x2 manioc 8.902

x3 corn 2.118

x4 wheat 11.542

x5 bean 4.044

Capacity (ha)

1

1

1

1

1

Area used 0



Maximum area 1,000

1st year 2nd year 3rd year

5.00 7.70 7.90

4.20 6.50 7.20

2.20 3.70 2.90

6.60 8.00 6.10

3.00 3.50 4.10

Input flow 0 0 0

≥ ≥ ≥

Minimum flow 6,000 5,000 6,500

Initial investiment

5.00

4.00

3.50

3.50

3.00

Initial investiment 0



Maximum initial invest 3,800

1st year 2nd year 3rd year

1.00 1.20 0.80

1.00 0.50 0.50

0.50 0.50 1.00

1.50 0.50 0.50

0.50 1.00 0.50

Output flow 0 0 0

≤ ≤ ≤

Maximum flow 3,500 3,200 2,500

x1 x2 x3 x4 x5 soy manioc corn wheat bean Area invested 0 0 0 0 0 FIG. 17.46 Representation of the farmer’s problem in an Excel spreadsheet. Solution

z Total NPV $0.00

BOX 17.8 Formulas Used in Fig. 17.46 Cell

Formula

G8

¼SUMPRODUCT(B8:F8,$B$25:$F$25)

G11

¼SUMPRODUCT(B11:F11,$B$25:$F$25)

G12

¼SUMPRODUCT(B12:F12,$B$25:$F$25)

G13

¼SUMPRODUCT(B13:F13,$B$25:$F$25)

G16

¼SUMPRODUCT(B16:F16,$B$25:$F$25)

G19

¼SUMPRODUCT(B19:F19,$B$25:$F$25)

G20

¼SUMPRODUCT(B20:F20,$B$25:$F$25)

G21

¼SUMPRODUCT(B21:F21,$B$25:$F$25)

G25

¼SUMPRODUCT(B5:F5,$B$25:$F$25)

791

792

PART

VII Optimization Models and Simulation

FIG. 17.47 Name Manager for the farmer’s problem.

Fig. 17.47 shows the names assigned to the cells and cell ranges in Fig. 17.46, which will be referenced in the Solver. Alternatively, Fig. 17.48 shows the parameters of the Solver as regards the farmer’s problem. Note that the cells and cell ranges are represented by their respective names. Note that the non-negativity constraints were activated by selecting the Make Unconstrained Variables Non-Negative check box, and the Simplex LP method was selected. The Options command remained unaltered. Finally, click on Solve and OK to keep Solver solution. Fig. 17.49 shows the optimal solution for the farmer’s problem.

FIG. 17.48 Solver Parameters as regards the farmer’s problem.

Solution of Linear Programming Problems Chapter

17

793

FIG. 17.49 Optimal solution for the farmer’s problem.

17.5.2.6 Solution of Example 16.8 of Chapter 16 (Portfolio Selection—Maximization of the Expected Return) Example 16.8 in Section 16.6.5 of the previous chapter will also be solved through the Solver in Excel. Fig. 17.50 shows the representation of the portfolio optimization problem in an Excel spreadsheet (see file Example8_Portfolio.xls). Box 17.9 shows the formulas used in Fig. 17.50. Fig. 17.51 shows the names assigned to the cells and cell ranges in Fig. 17.50. Whereas Fig. 17.52 shows the Solver parameters regarding the portfolio selection problem (maximization of the return expected), in which each cell or cell range starts being referenced by its respective name. Note that the non-negativity constraints were activated by selecting the Make Unconstrained Variables Non-Negative check box, and the Simplex LP method was selected. The Options command remained unaltered. Finally, click on Solve and OK to keep Solver solution. Fig. 17.53 shows the optimal solution for the portfolio selection problem.

Portfolio selection - Maximizing expected return

Average return Standard deviation

x1 x2 BBAS3 BBDC4 0.37% 0.24% 2.48% 2.16%

x3 ELET6 0.14% 1.95%

Invested capital

Σx i 0%

=

100% 1

Maximum composition

30%

30%

30%

Portfolio risk

Real 0%



Theoretical 2.50%

x4 x5 x6 x7 x8 x9 x 10 GGBR4 ITSA4 PETR4 CSNA3 TNLP4 USIM5 VALE5 0.30% 0.24% 0.19% 0.28% 0.18% 0.25% 0.24% 2.93% 2.40% 2.00% 2.63% 2.14% 2.73% 2.47%

30%

30%

30%

30%

30%

30%

30%

x1 x2 x3 x4 x5 x6 x7 x8 x9 x 10 BBAS3 BBDC4 ELET6 GGBR4 ITSA4 PETR4 CSNA3 TNLP4 USIM5 VALE5 z = E(R) Optimum composition 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% FIG. 17.50 Representation of Example 16.8 in an Excel spreadsheet. Solution

794

PART

VII Optimization Models and Simulation

BOX 17.9 Formulas Used in Fig. 17.50

FIG. 17.51 Name Manager for the portfolio selection problem (maximization of the expected return).

FIG. 17.52 Solver Parameters regarding the portfolio selection problem (maximization of the return expected).

Cell

Formula

B9

¼SUM(B18:K18)

B14

¼SUMPRODUCT(B6:K6,B18:K18)

L18

¼SUMPRODUCT(B5:K5,B18:K18)

Solution of Linear Programming Problems Chapter

17

795

FIG. 17.53 Optimal solution for the portfolio selection problem.

17.5.2.7 Solution of Example 16.9 of Chapter 16 (Portfolio Selection—Minimization of the Portfolio’s Mean Absolute Deviation) Example 16.9 in Section 16.6.5 of the previous chapter that tries to determine the portfolio’s optimal composition to minimize its mean absolute deviation will be solved through the Solver in Excel. The representation of the problem in an Excel spreadsheet can be seen in Fig. 17.54 (see file Example9_Portfolio.xls). The calculation of the portfolio’s MAD is shown in Fig. 17.55 and can also be found in the same file Example9_Portfolio.xls. For a portfolio that has 10% of each asset in its composition, row 250 in Fig. 17.55 shows the mean absolute deviation of each asset and of the portfolio. As shown in row 250 or the formula of cell P250 in Box 17.10, the percentage of each asset in the portfolio can be multiplied directly by its mean absolute deviation, since the percentage is constant in all the periods. Box 17.10 shows the main formulas used in Figs. 17.54 and 17.55. From the formulas in cells P248 and P250, we can see the calculation of the mean absolute deviation of the first asset (BBAS3). For the other assets, we can use the same procedure. Fig. 17.56 shows the names assigned to the cells and cell ranges in Fig. 17.54.

Average return Standard deviation

x1 BBAS3 0.37% 2.48%

x2 BBDC4 0.24% 2.16%

x3 ELET6 0.14% 1.95%

Invested capital

Σx i 0%

=

100% 1

Portfolio return

Real 0.00%



Theoretical 0.15%

Maximum composition

30%

30%

30%

x4 GGBR4 0.30% 2.93%

x5 ITSA4 0.24% 2.40%

x6 PETR4 0.19% 2.00%

x7 CSNA3 0.28% 2.63%

x8 TNLP4 0.18% 2.14%

x9 USIM5 0.25% 2.73%

x 10 VALE5 0.24% 2.47%

30%

30%

30%

30%

30%

30%

30%

x5 ITSA4 0%

x6 PETR4 0%

x7 CSNA3 0%

x8 TNLP4 0%

x9 USIM5 0%

x1 x2 x3 x4 BBAS3 BBDC4 ELET6 GGBR4 Optimum composition 0% 0% 0% 0% FIG. 17.54 Representation of Example 16.9 in an Excel spreadsheet. Solution

x 10 VALE5 z = MAD 0% 0.00%

796

PART

VII Optimization Models and Simulation

FIG. 17.55 Mean absolute deviation of each asset and of the portfolio.

BOX 17.10 Main Formulas Used in Figs. 17.54 and 17.55

FIG. 17.56 Name Manager for the portfolio selection problem (minimization of the MAD).

Cell

Formula

B9

¼SUM(B18:K18)

B12

¼SUMPRODUCT(B5:K5,B18:K18)

O248

¼MEAN(O2:O247)

P2

¼ABS(O2-$O$248)

P248

¼MEAN(P2:P247)

P250

¼P248*B18

AI250

¼SUM(P250:AH250)

L18

¼AI250

Solution of Linear Programming Problems Chapter

17

797

FIG. 17.57 Solver Parameters regarding the portfolio selection problem (minimization of the MAD).

Whereas Fig. 17.57 presents the Solver parameters regarding the portfolio selection problem (minimization of the MAD). Analogous to previous models, one assumes that the variables are non-negative and that the model is linear. Finally, click on Solve and OK to keep Solver solution. Fig. 17.58 shows the optimal solution for the portfolio selection problem (minimization of the MAD).

FIG. 17.58 Optimal solution for the portfolio selection problem (minimization of the MAD).

798

PART

VII Optimization Models and Simulation

17.5.2.8 Solution of Example 16.10 of Chapter 16 (Production and Inventory Problem of Fenix&Furniture) Example 16.10 in Section 16.6.6 of the previous chapter, regarding the production and inventory problem of Fenix&Furniture, will be solved through Solver in Excel. Fig. 17.59 illustrates the representation of the problem in an Excel spreadsheet (see file Example10_Fenix&Furniture.xls). Note that, in the initial solution presented in Fig. 17.59, 2,000 units of each product were produced in each period. Applying the formula Iit ¼ Ii,t1 + xit  Dit, we obtain the inventory levels of each product in each period that must take on non-negative values. It is important to mention that this solution is not the optimal one. Box 17.11 shows the formulas used in Fig. 17.59. Fig. 17.60 shows the names assigned to the main cells and cell ranges in Fig. 17.59. The Solver parameters regarding the production and inventory problem of Fenix&Furniture are represented in Fig. 17.61. Since names were assigned to the main cells and cell ranges, they start being referred to by their respective names. Analogous to previous models, one assumes that the variables are non-negative and that the model is linear. Finally, click on Solve and OK to keep Solver solution. Fig. 17.62 shows the optimal solution for the production and inventory problem of Fenix&Furniture.

17.5.2.9 Solution of Example 16.11 of Chapter 16 (Problem of Lifestyle Natural Juices Manufacturer) Example 16.11 in Section 16.6.7 of the previous chapter, regarding the aggregate planning problem at Lifestyle Natural Juices, will be solved through Solver in Excel. Fig. 17.63 illustrates the representation of the problem in an Excel spreadsheet (see file Example11_Lifestyle.xls). In the initial solution presented in Fig. 17.63, note that 1,000 additional liters were produced in July after they hired new employees in the previous month, such that the values of Rt from July to December become 5,000 (Rt ¼ Rt1 + Ht  Ft). Applying the formula It ¼ It1 + Rt + Ot + St  Dt, we obtain the inventory levels in each period. The values of the other decision variables remain null. It is important to mention that this solution is not the optimal one. Box 17.12 shows the formulas used in Fig. 17.63. Fig. 17.64 shows the names assigned to the cells and cell ranges in Fig. 17.63. The Solver parameters regarding the aggregate planning problem at Lifestyle company are represented in Fig. 17.65. Since names were assigned to the main cells and cell ranges, they start being referenced by their respective names. Analogous to previous models, one assumes that the variables are non-negative and that the model is linear. Finally, click on Solve and OK to keep Solver solution. Fig. 17.66 shows the optimal solution for the aggregate planning problem of Lifestyle company.

17.5.3

Solver Error Messages for Unlimited and Infeasible Solutions

Section 17.2.3 and Section 17.4.5 presented how to identify, graphically and through the Simplex method, respectively, each one of the special cases that may happen in a linear programming problem: 1. 2. 3. 4.

Multiple optimal solutions Unlimited objective function z There is no optimal solution Degenerate optimal solution

In this section, we will analyze the error messages generated in the Solver Results dialog box for cases 2 and 3 (unlimited objective function z and unfeasible solution). Special cases 1 and 4 will be discussed in Sections 17.6.4.1 and 17.6.4.2, respectively, by using the Solver Sensitivity Report.

17.5.3.1 Unlimited Objective Function z For the maximization problem (max z ¼ 4x1 + 3x2) presented in Section 17.2.3.2 (Example 17.4), the graphical solution was obtained from Fig. 17.67. Solving the same example through Solver in Excel, an error message appears in the Solver Results dialog box: “The Objective Cell values do not converge.” Therefore, whenever we come across a problem with an unlimited objective function, the message in Fig. 17.68 will appear.

Solution of Linear Programming Problems Chapter

17

799

Fenix&Furniture Dec

Jan

Feb

Mar

Apr

Mai

Jun

c 1t c 2t c 3t c 4t c 5t i 1t i 2t i 3t i 4t i 5t

320 440 530 66 48 8 8 9 3 3

320 440 530 66 48 8 8 9 3 3

Unit cost 320 440 530 66 48 8 8 9 3 3

320 440 530 66 48 8 8 9 3 3

320 440 530 66 48 8 8 9 3 3

320 440 530 66 48 8 8 9 3 3

D 1t D 2t D 3t D 4t D 5t

1,200 1,250 1,400 1,800 1,850 1,800 1,600 1,500 2,000 2,000 20,000 18,000 15,000 22,000 22,000

1,250 1,430 1,500 1,750 1,700 1,800 1,600 1,500 2,000 2,000 20,000 18,000 15,000 22,000 22,000

Other input data 1,400 1,650 1,200 2,100 2,050 1,800 1,600 1,500 2,000 2,000 20,000 18,000 15,000 22,000 22,000

1,860 1,700 1,350 2,000 1,950 1,800 1,600 1,500 2,000 2,000 20,000 18,000 15,000 22,000 22,000

2,000 1,450 1,600 1,850 2,050 1,800 1,600 1,500 2,000 2,000 20,000 18,000 15,000 22,000 22,000

1,700 1,500 1,450 1,630 1,740 1,800 1,600 1,500 2,000 2,000 20,000 18,000 15,000 22,000 22,000

1,000 950 800 400 350

1,750 1,520 1,300 650 650

Equations 2,350 1,870 2,100 550 600

2,490 2,170 2,750 550 650

2,490 2,720 3,150 700 600

2,790 3,220 3,700 1,070 860

x 1t max x 2t max x 3t max x 4t max x 5t max I 1t max I 2t max I 3t max I 4t max I 5t max

I 1t calc I 2t calc I 3t calc I 4t calc I 5t calc

200 200 200 200 200

Solution Jan Feb Mar Apr Mai Jun x 1t 2,000 2,000 2,000 2,000 2,000 2,000 x 2t 2,000 2,000 2,000 2,000 2,000 2,000 x 3t 2,000 2,000 2,000 2,000 2,000 2,000 x 4t 2,000 2,000 2,000 2,000 2,000 2,000 x 5t 2,000 2,000 2,000 2,000 2,000 2,000 I 1t 1,000 1,750 2,350 2,490 2,490 2,790 I 2t 950 1,520 1,870 2,170 2,720 3,220 I 3t 800 1,300 2,100 2,750 3,150 3,700 I 4t 400 650 550 550 700 1,070 I 5t 350 650 600 650 600 860 FIG. 17.59 Representation of the production and inventory problem of Fenix&Furniture in an Excel spreadsheet.

z R$ 17,197,650.00

800

PART

VII Optimization Models and Simulation

BOX 17.11 Formulas Used in Fig. 17.59 Cell

Formula

Cell

Formula

Cell

Formula

C35

¼B35 + C42-C18

C37

¼B37 + C44-C20

C39

¼B39 +C46-C22

D35

¼C35 +D42-D18

D37

¼C37 + D44-D20

D39

¼C39+ D46-D22

E35

¼D35 + E42-E18

E37

¼D37 + E44-E20

E39

¼D39 +E46-E22

F35

¼E35 +F42-F18

F37

¼E37 + F44-F20

F39

¼E39+ F46-F22

G35

¼F35 +G42-G18

G37

¼F37 + G44-G20

G39

¼F39+ G46-G22

H35

¼G35 + H42-H18

H37

¼G37 + H44-H20

H39

¼G39 +H46-H22

C36

¼B36 + C43-C19

C38

¼B38 + C45-C21

D36

¼C36 +D43-D19

D38

¼C38 + D45-D21

E36

¼D36 + E43-E19

E38

¼D38 + E45-E21

F36

¼E36 +F43-F19

F38

¼E38 + F45-F21

G36

¼F36 +G43-G19

G38

¼F38 + G45-G21

H36

¼G36 + H43-H19

H38

¼G38 + H45-H21

FIG. 17.60 Name Manager for the production and inventory problem at Fenix&Furniture.

17.5.3.2 There Is No Optimal Solution For the maximization problem (max z ¼ x1 + x2) presented in Section 17.2.3.3 (Example 17.5), the graphical solution was obtained from Fig. 17.69. By solving Example 17.5 through the Solver in Excel, a new error message appears in the Solver Results dialog box, since we have a case in which it is not possible to find a feasible solution (see Fig. 17.70).

17.5.4

Result Analysis by Using the Solver Answer and Limits Reports

Section 17.5.2 presented the Solver results for each one of the examples presented in Section 16.6 of the previous chapter (modeling of real linear programming problems), directly in Excel spreadsheets. A detailed analysis of the results can also be done through the Solver Answer, Limits, and Sensitivity Reports. As mentioned before, the Sensitivity Report will be discussed in Section 17.6. Whereas the Answer and Limits Reports will be discussed in this section for the Venix Toys problem (Example 16.3 of the previous chapter). The modeling of and solution for the problem was presented in Section 17.5.2.1 of this chapter in the same Excel spreadsheet.

Solution of Linear Programming Problems Chapter

17

801

FIG. 17.61 Solver parameters regarding the problem at Fenix&Furniture.

FIG. 17.62 Optimal solution for Fenix&Furniture.

802

PART

VII Optimization Models and Simulation

Lifestyle - Natural Juices Jun

Jul

Ago

Sep

Oct

Nov

Dec

it rt ot st ht ft

0.4 1.5 2 2.7 3 1.2

0.4 1.5 2 2.7 3 1.2

Unit cost 0.4 1.5 2 2.7 3 1.2

0.4 1.5 2 2.7 3 1.2

0.4 1.5 2 2.7 3 1.2

0.4 1.5 2 2.7 3 1.2

Dt It

4,500

5,200

Other input data 4,780

5,700

5,820

4,480

1,500 5,000 50 500

1,500 5,000 50 500

1,500 5,000 50 500

1,500 5,000 50 500

1,500 5,000 50 500

1,500 5,000 50 500

1,500 5,000

1,300 5,000

Equations 1,520 5,000

820 5,000

0 5,000

520 5,000

Solution Jul Ago Sep Oct Nov It 1,000 1,500 1,300 1,520 820 0 Rt 4,000 5,000 5,000 5,000 5,000 5,000 Ot 0 0 0 0 0 St 0 0 0 0 0 Ht 1,000 0 0 0 0 Ft 0 0 0 0 0 FIG. 17.63 Representation of the aggregate planning problem at Lifestyle in an Excel spreadsheet.

Dec 520 5,000 0 0 0 0

1,000

I t max R t max O t max S t max

I t calc R t calc

1,000 4,000

z R$ 50,264.00

BOX 17.12 Formulas Used in Fig. 17.63 Cell

Formula

C22

¼B22 + C27+ C28 + C29-C14

D22

¼C22 + D27+ D28 + D29-D14

E22

¼D22 + E27+ E28 + E29-E14

F22

¼E22 + F27+ F28 + F29-F14

G22

¼F22 + G27+ G28 + G29-G14

H22

¼G22 + H27 +H28+ H29-H14

C23

¼B23 + C30-C31

D23

¼C23 + D30-D31

E23

¼D23 + E30-E31

F23

¼E23 + F30-F31

G23

¼F23 + G30-G31

H23

¼G23 + H30-H31

I26

¼SUMPRODUCT(C6:H11,C26:H31)

17.5.4.1 Answer Report The Answer Report provides the results of the optimal solution found by Solver in a new Excel spreadsheet. Fig. 17.71 shows the Answer Report of the problem faced by Venix Toys.

Solution of Linear Programming Problems Chapter

17

803

FIG. 17.64 Name Manager for the aggregate planning problem at Lifestyle.

FIG. 17.65 Solver parameters regarding the problem at Lifestyle.

804

PART

VII Optimization Models and Simulation

FIG. 17.66 Optimal solution for Lifestyle company. FIG. 17.67 Graphical solution for Example 17.4 with an unlimited objective function.

x2

x1 £ 8

Unlimited solution space

x1 ³ 0 10 z = 24 = 4x1 + 3x2

Maximization of z

8

6

4

2

2x1 + 5x2 ³ 20 x2 ³ 0

2

4

6

8

10

x1

According to Fig. 17.71, we can see that the results of the Answer Report are divided into three main parts: objective cell, variable cells, and constraints. As shown in Fig. 17.29 of Section 17.5.2.1, the maximization function z of Venix Toys’ problem is represented by objective cell D14 (Total_profit). Row 8 in Fig. 17.71 shows the original value and the final value (maximum profit) of the objective cell. The model’s decision variables are represented by variable cells B14 and C14 in Fig. 17.29 of Section 17.5.2.1. Rows 13 and 14 in Fig. 17.71 show the original and final values of each variable cell. From column E, we can see that the optimal number of toy cars to be produced is x1 ¼ 70 and the optimal number of tricycles to be produced is x2 ¼ 20.

FIG. 17.68 Error message for a problem with an unlimited objective function. FIG. 17.69 Graphical solution for Example 17.5 with an unfeasible solution.

x2 x1 ³ 0 10

8 z = 7 = x1 + x2 5x1 + 4x2 ³ 40

6

4

2

2x1 + x2 £ 6 x2 ³ 0 2

4

6

8

FIG. 17.70 Error message for a problem with an unfeasible solution.

x1

806

PART

VII Optimization Models and Simulation

FIG. 17.71 Answer Report for the problem at Venix Toys.

The machining, painting, and assembly human resources availability constraints are represented by rows 19, 20, and 21, respectively, while the non-negativity constraints of each decision variable, by rows 22 and 23. Cells D8, D9, and D10 represent the total number of hours used or the amount of resources necessary for the machining, painting, and assembly departments, respectively. The value of each cell (column D) can be obtained by substituting the optimal values of each decision variable (x1 ¼ 70 and x2 ¼ 20) on the left-hand side of each constraint. Column E shows the formula used to represent each constraint. Conversely, column F presents the status of each constraint: binding or not binding. The Binding Status happens when the total amount of resources used (column D) is equal to the maximum limit available, that is, there is no slack or idleness of resources. As shown in Fig. 17.29 in Section 17.5.2.1, the quantity of resources available for the machining, painting, and assembly activities is represented by cells F8, F9, and F10, respectively. The Not Binding Status indicates that the maximum capacity of resources has not been used. Now, the Slack field suggests the difference between the total amount of resources available and the total amount of resources used, or the amount of idle resources. For example, for the machining sector, from the total amount of resources available (36 hours), only 27.5 hours have been used, what generates an idleness of 8.5 hours (slack). For the painting and machining activities, since the maximum capacity of the resources was used, slack is zero.

17.5.4.2 Limits Report The main results provided by the Limits Report refer to the lower and upper limits of each variable cell (decision variable). Fig. 17.72 shows the Limits Report for the problem faced by Venix Toys. According to Fig. 17.72, we can see that the results of the Limits Report are divided into two main parts: objective cell and variable cells. Analogous to the Answer Report, the Limits Report also provides the optimal value of the objective cell.

FIG. 17.72 Limits Report for the problem at Venix Toys.

Solution of Linear Programming Problems Chapter

17

807

Alternatively, the data regarding each one of the variable cells are represented in rows 13 and 14 in Fig. 17.72. Analogous to the Answer Report, the optimal value of each variable cell is also provided by the Limits Report (column D). Column F presents the lower limits of the variable cells that refer to the minimum value each variable can take on. By assigning lower limits to one of the variables (x1 ¼ 0), and by maintaining the other constants (x2 ¼ 20), we obtain a feasible solution with z ¼ 1,200 (cell G13). In contrast, if x1 ¼ 70 and x2 ¼ 0, we obtain another feasible solution with z ¼ 840 (cell G14). Finally, column I shows the upper limits of the variable cells, that is, the maximum values each variable can achieve. In this case, the value of objective function z is 2,040.

17.6 SENSITIVITY ANALYSIS As presented in Section 16.5 of the previous chapter, one of the hypotheses of a linear programming model is to assume that all the model parameters (objective function coefficients cj, constraint variable coefficients aij, and independent terms bi) are deterministic, that is, constant and known with certainty. However, many times, the estimation of these parameters is based on future forecasts, such that changes may happen until the final solution is implemented in the real world. As examples of changes, we can mention changes in the amount of resources available, launching of a new product, variation in a product’s price, increase or decrease in production costs, among others. Therefore, the sensitivity analysis is essential in the study of linear programming problems, since it has as its main objective to investigate the effects that certain changes in the model parameters would have on the optimal solution. The sensitivity analysis discusses the variation that the objective function coefficients and constants on the right-hand side of each constraint can assume (lower and upper limits), without changing the initial model’s optimal solution or without changing the feasibility region. This analysis can be done graphically, by using algebraic calculations, or directly through the Solver in Excel or other software packages, such as, Lindo, considering one alteration at a time. Therefore, the sensitivity analysis being studied considers two cases: (a) The model’s sensitivity analysis based on the alterations in one of the objective function coefficients, without changing the model’s original basic solution (the basic solution remains optimal). Since one of the objective function coefficients is altered, the value of objective function z also changes. (b) The model’s sensitivity analysis based on the alterations in the independent terms of the constraints, without changing the optimal solution or the feasibility region. Thus, we eliminate the need to recalculate a model’s new optimal solution after changes in its parameters. Section 17.6.1 graphically analyzes the possible changes in the model’s objective function coefficients. The same analysis, based on the independent terms of the constraints, will be studied in Section 17.6.2. Both cases can also be analyzed by Solver in Excel (see Section 17.6.4), always considering one alteration at a time. The sensitivity analysis studied in this section will be described based on Example 17.12. Example 17.12 Romes Shoes is a shoe company and it is interested in planning its production of flip-flops and clogs for next summer. Their products go through the following processes: cutting, assembly, and finishing. Table 17.E.16 shows the total number of labor hours (manhours) necessary to produce a unit of each component in each manufacturing process, besides the total time available per week, also in man-hours. The unit profit per pair of flip-flops and clogs manufactured is $15.00 and $20.00, respectively. Determine the graphical solution for the model.

TABLE 17.E.16 Time Necessary to Produce a Unit of Each Component in Each Manufacturing Process and Total Time Available per Week (man-hours) Time (man-hours) to process 1 unit Sector

Flip-flops

Clogs

Time available (man-hours/week)

Cutting

5

4

240

Assembly

4

8

360

Finishing

0

7.5

300

808

PART

VII Optimization Models and Simulation

Solution The model’s decision variables are: x1 ¼ number of flip-flops to be produced weekly. x2 ¼ number of clogs to be produced weekly. The model’s mathematical formulation can be represented as: max z ¼ 15x1 + 20x2 subject to : 5x1 + 4x2  240 ð1Þ 4x1 + 8x2  360 ð2Þ 7,5x2  300 ð3Þ xj  0, j ¼ 1, 2

(17.27)

The current model’s optimal solution is x1 ¼ 20 (flip-flops per week) and x2 ¼ 35 (clogs per week) with z ¼ 1,000 (Weekly net profit of $1,000.00). Graphically, it is represented in Fig. 17.73.

17.6.1

Alteration in one of the Objective Function Coefficients (Graphical Solution)

Fig. 17.73 presented the graphical solution for Example 17.12, in which extreme point C represents the model’s optimal solution (x1 ¼ 20,x2 ¼ 35 with z ¼ 1,000). Now, let’s carry out a sensitivity analysis based on changes in the values of the objective function coefficients, performing one alteration at a time. The main objective is to determine the value range that each objective function coefficient can take on, maintaining the other coefficients constant, without impacting the model’s basic solution (the basic solution remains optimal). This analysis is based on the comparison between the angular coefficients of the active constraints (treated as equality constraints) and the angular coefficient of the objective function. The active constraints are the ones that define the model’s optimal solution. If the model’s inactive constraints are eliminated, the optimal solution will not be affected. Slope and angular coefficient of line Let a be the angle formed by the line and the X-axis, counterclockwise, called line slope. The angular coefficient m determines the direction of the line, that is, the trigonometric tangent of slope a: m ¼ tg ðaÞ

(17.28)

In a graphical way, Fig. 17.74 specifies four cases for m from different values of a. From Fig. 17.74, we can see that every nonvertical line has a real number m that specifies its direction. Case d is a special case since there is no tangent for slope a ¼ 90° and, consequently, there is no angular coefficient m for a vertical line. This relationship can be better visualized in Fig. 17.75 that shows the tangent chart for different values of a. For instance, for case (a) in the previous figure (0° < a < 90°), we can see that tg 0° ¼ 0 and tg 45° ¼ 1. As a gets closer to 90°, the value of m tends to infinity. FIG. 17.73 Graphical solution for Example 17.12.

Solution of Linear Programming Problems Chapter

17

809

FIG. 17.74 Relationship between line slope (a) and angular coefficient (m).

FIG. 17.75 Tangent for different values of a. (Source: https://www.quora.com/Is-tan-x-a-continous-function.)

The angular coefficient can be calculated from two points in the line. Given two distinct points in the Cartesian plane, A(x1, y1) and B(x2, y2), there is a single line r that goes through these two points. The angular coefficient of r can be calculated as: m ¼ tg ðaÞ ¼

opposite leg Dy y2  y1 ¼ ¼ adjacent leg Dx x2  x1

(17.29)

Angular coefficient of the objective function from its reduced equation Consider the general equation of an objective function with two decision variables (x1 and x2): z ¼ c1 x1 + c 2 x2

(17.30)

810

PART

VII Optimization Models and Simulation

To determine the reduced equation of Expression (17.30), we have to isolate variable x2 from the general equation: c z (17.31) x2 ¼  1 x1 + c2 c2 where: m ¼  cc1 is the angular coefficient of the objective function. 2

Value range for c1 or c2 that do not change the model’s original basic solution By calculating the angular coefficient of each active constraint in the model (treated as an equality constraint) and the angular coefficient of the objective function, be it from the reduced equation of the line or from two points in the line [Expression (17.28)], we can determine the value range for c1 or c2 that does not change the model’s original basic solution. Let’s illustrate this condition through an example. Going back to Example 17.12, from the graphical solution presented in Fig. 17.73, we can see that only the first two constraints of Expression (17.27) are considered active. Hence, to carry out the sensitivity analysis from variations in one of the objective function coefficients, we must calculate the angular coefficient of the first and second constraints, treated as equality equations. First, we can see that the slope of the first equation 5x1 + 4x2 ¼ 240 and of the second equation 4x1 + 8x2 ¼ 360 are within the interval 90° < a < 180°. Therefore, the angular coefficient of both lines will be negative. From Fig. 17.75, we conclude that the value of m1 (angular coefficient of the first equation) will be smaller when compared to m2 (angular coefficient of the second equation). The value of the angular coefficient of each equation will be determined from the reduced equation of the line. The first equation 5x1 + 4x2 ¼ 240 can be written in the reduced form: x2 ¼ 5=4x1 + 60

(17.32)

So, the angular coefficient of the first equation is m1 ¼  5/4. Analogously, the reduced form of the second equation can be expressed as: x2 ¼ 1=2x1 + 45

(17.33)

We can conclude that the angular coefficient of the second equation is m2 ¼  1/2. According to Fig. 17.73, we can see that while the angular coefficient of the objective function ( c1/c2) is between 5/4 and 1/2, that is, between the angular coefficients of the model’s first and second equations (active equations), the original problem’s basic solution does not change, remaining optimal. Mathematically, the value of the original basic solution remains constant, while: 5 c 1 c    1   or 0:5  1  1:25 c2 c2 4 2

(17.34)

Example 17.13 From the problem of company Romes Shoes (Example 17.2), carry out a sensitivity analysis considering the following changes: (a) Let’s assume that there was an increase in the unit profits of flip-flops and clogs to $20.00 and $25.00, respectively, based on reductions in production costs, mainly in terms of human resources. Verify if the basic solution remains optimal. If yes, what is the new value of z? (b) Which possible variations in c2 would maintain the original model’s basic solution? Note: The other parameters remain constant. (c) Do the same for c1. (d) Imagine that there was an increase in the price of leather, the main raw material for producing clogs, diminishing its unit profit to $18.00. In order for the original model’s basic solution to remain unaltered, which interval must the unit profit of flip-flops satisfy? Solution (a) Considering the new objective function equation (z ¼ 20x1 + 25x2), we can determine the ratio c 0:5  c1 2

c1 c2

¼ 20 25 ¼ 0:8 directly.

Therefore, the condition  1:25 continues being satisfied, such that the original model’s basic solution (x1 ¼ 20 and x2 ¼ 35) remains optimal. Therefore, the new value of z is z ¼ 20  20 + 25  35 ¼ 1,275. c (b) Substituting c01 ¼ 15 (original value of c1) in the condition 0:5  c1  1:25, we have: 2



0:5  c2  15 ) c2  30 ) 12  c2  30 or c20  8  c2  c20 + 10 1:25  c2  15 ) c2  12

Solution of Linear Programming Problems Chapter

17

811

Thus, while c2 continues satisfying the interval specified here, the original model’s optimal basic solution (x1 ¼ 20 and x2 ¼ 35) will remain unaltered. (c) Substituting c02 ¼ 20 (original value of c2) in the condition 0:5  cc1  1:25, we have: 2

0:5  20  c1  1:25  20 ) 10  c1  25 or c10  5  c1  c10 + 10 Therefore, while the conditions specified for c1 continue being satisfied, the original model’s basic solution will remain unaltered. (d) By substituting c2 ¼ 18 in the condition 0:5  cc1  1:25, we have: 2

0:5  18  c1  1:25  18 ) 9  c1  22:5 Therefore, while c1 continues satisfying the interval specified , for a value of c2 ¼ 18, the basic solution remains optimal.

Example 17.14 Consider the following maximization problem: max z ¼ 15x1 + 20x2 subject to : 4x1 + 8x2  360 ð1Þ x1  60 ð2Þ xj  0, j ¼ 1,2

(17.35)

Determine: (a) The graphical solution for the original model represented in Expression (17.35). (b) The possible variations in c1 that would maintain the original model’s basic solution (the basic solution remains optimal), maintaining the other parameters constant. Do the same for c2. Solution Fig. 17.76 shows the model’s graphical solution of Expression (17.35). The current model’s optimal solution, represented by extreme point C, is x1 ¼ 60 and x2 ¼ 15 with z ¼ 1,200. From Fig. 17.76, we can see that we have a special case, since the model’s second constraint of Expression (17.35) corresponds to a vertical line (slope of 90° in relation to axis x1). When one of the active constraints is vertical, there will be no upper or lower limit for the angular coefficient of the objective function, since there is no tangent for a 5 90°. Now, let’s carry out a sensitivity analysis from variations in c1 or c2. In order to do that, we need to calculate the angular coefficient of the first constraint (treated in the equality form), be it from its reduced equation or from two points in the line. Let’s use the first case. The reduced form of the first equation in Expression (17.35) is: x2 ¼ 1=2x1 + 45 So, its angular coefficient is 1/2.

FIG. 17.76 Graphical solution for Example 17.4.

812

PART

VII Optimization Models and Simulation

According to Fig. 17.76, we can see that, while the angular coefficient of the objective function is between the angular coefficients of the vertical equation and equation 4x1 + 8x2 ¼ 360, that is, between ∞ (there is no lower limit for the 90° tangent) and 1/2, the basic solution remains optimal. Mathematically, the original basic solution remains constant, while: ∞  

c1 1 c   or 0:5  1  ∞ c2 c2 2 c

By setting c2 ¼ 20 (original value) in the condition 0:5  c1  ∞, we have: 2

0:5  20  c1  ∞ ) 10  c1  ∞ Thus, while c1 is within the interval specified, the original model’s basic solution (x1 ¼ 60 and x2 ¼ 15) will remain unaltered. c By setting c1 ¼ 15 (original value) in the condition 0:5  c1  ∞, we have: 2 8 < 0:5  c2  15 ) c2  30 ) 0  c2  30 : c2  15 ffi 0 ) c2  0 ∞ Therefore, while c2 continues satisfying the condition specified, the basic solution will remain optimal.

17.6.2 Alteration in One of the Constants on the Right-Hand Side of the Constraint and Concept of Shadow Price (Graphical Solution) The sensitivity analysis from alterations in the value of one of the constants on the right-hand side of the constraint (availability of resources) is based on the concept of shadow price, which can be defined as the increase (or decrease) of the objective function in case 1 unit is added to (or removed from) the current amount of resources available of the i-th constraint (b0i ). Calculating the shadow price (Pi) in case 1 unit of resources is added to b0i is: Pi ¼

Dz +1 z +1  z0 ¼ Dbi, + 1 +1

(17.36)

where: Dz+1 ¼ increase in the value of objective function z in case 1 unit of resources is added to b0i . z0 ¼ initial value of the objective function. z+1 ¼ new value of the objective function after 1 unit is added to b0i . Dbi,+1 ¼ increase in b0i . The definition of shadow price considers an increase of 1 unit in the amount of resources i. Calculating the shadow price (Pi) in case 1 unit of resources is removed from b0i is: Pi ¼

Dz1 z1  z0 z0  z1 ¼ ¼ Dbi, 1 1 1

(17.37)

where: Dz1 ¼ decrease in the value of objective function z in case 1 unit of resources is removed from b0i . z0 ¼ initial value of the objective function. z1 ¼ new value of the objective function after 1 unit is removed from b0i . Dbi,1 ¼ decrease in b0i . The definition of shadow price considers a decrease of 1 unit in the amount of resources i. Shadow price can be interpreted as the fair price to pay for using 1 unit of resource i or the opportunity cost of resources due to the loss of 1 unit of resource i. After defining the shadow price for resource i (Pi), the main goal of this sensitivity analysis is to determine the value range in which bi can vary (maximum increase allowed of p units in b0i or decrease allowed of q units in b0i ), that is, b0i  q  bi  b0i + p, in which the shadow price remains constant. The interval must be determined in order to satisfy the following relationship: Dz +p Dzq z +p  z0 z0  zq ¼ ¼ Pi ¼ ¼ Dbi, + p Dbi, q p q where:

(17.38)

Solution of Linear Programming Problems Chapter

17

813

Dz+p ¼ increase in the value of objective function z in case p units of resources are added to b0i . Dzq ¼ decrease in the value of objective function z in case p units of resources are removed from b0i . z0 ¼ initial value of the objective function. z+p ¼ new value of the objective function after p units are added to b0i . zq ¼ new value of the objective function after q units are removed from b0i . Dbi,+p ¼ increase of p units in b0i . Dbi,q ¼ decrease of q units in b0i . Thus, for the interval specified in which the shadow price remains constant, if p units were added to b0i , the value of the objective function would increase Dz+p ¼ Pi  p. This equation can also be interpreted as the fair price to pay for using p units of resource i, being proportional to the shadow price. Analogously, if q units were removed from b0i , the value of the objective function would decrease Dzq ¼ Pi  q. This equation can also be interpreted as the opportunity cost due to the loss of q units of resource i. The calculation of the shadow price is only valid for active constraints (which define the model’s optimal solution). Otherwise, variations in bi within the feasibility region will not impact the model’s optimal solution, concluding that the shadow price for nonactive constraints is zero. Solving the problem in an analytical or algebraic way, this means that the current model’s optimal solution (that contains a value different from bi, however, within the interval specified above in which the shadow price remains constant) will have the same basic variables as the original model’s optimal solution (the original basic solution remains optimal); however, the values of the decision variables and of the objective function are altered due to changes in bi. Solving the problem in a graphical way, as the amount of resources bi varies, the i-th constraint moves parallel to the i-th original constraint. Nevertheless, the current model’s optimal solution continues being determined by the intersection of the same active lines in the original model (intersection between the i-th constraint altered and another active constraint from the initial model). While the intersection between these lines happens inside the feasibility region, that is, between the extreme points that limit the feasible solution space analyzed, the increase in the value of the objective function due to the use of p units of resource i or the decrease due to the loss of q units of the same resource will be proportional to the shadow price (the shadow price will remain constant for the interval b0i  q  bi  b0i + p, in which b0i represents its original value). For any value of bi out of this range, it will be necessary to recalculate the model’s new optimal solution, since the feasible region is altered. Example 17.15 Similar to Example 17.13, the sensitivity analysis based on changes in the resources will also be applied to the case of Romes Shoes (Example 17.12). From the model’s graphical solution, determine: (a) The shadow price for each sector (cutting, assembly and finishing). (b) The maximum permissible decrease and increase for each bi that would maintain its shadow price constant (when it is positive), or that would not change the initial model’s optimal solution (when the shadow price is null). Solution As presented in Expression (17.27), Example 17.12 of company Romes Shoes can be mathematically represented as: max z ¼ 15x1 + 20x2 subject to : 5x1 + 4x2  240 ð1Þ 4x1 + 8x2  360 ð2Þ 7,5x2  300 ð3Þ xj  0, j ¼ 1,2

cutting assembly finishing

(17.39)

The original model’s optimal solution is x1 ¼ 20 and x2 ¼ 35 with z ¼ 1,000. Changes in the availability of resources in the cutting sector (a) Shadow price If the time available for the cutting sector increases in one man-hour, the first constraint in Expression (17.39) becomes 5x1 + 4x2  241. The new optimal solution is then determined by the intersection between active lines 5x1 + 4x2 ¼ 241 and 4x1 + 8x2 ¼ 360, being represented by point H (x1 ¼ 20.333 and x2 ¼ 34.833 with z ¼ 1,001.667), as shown in Fig. 17.77. The shadow price for the cutting sector (P1), considering an increase of 1 man-hour in the availability of resources, can be calculated as: P1 ¼

1, 001:667  1, 000 ¼ 1:667 241  240

814

PART

VII Optimization Models and Simulation

FIG. 17.77 Sensitivity analysis after adding 1 man-hour to the availability of resources in the cutting sector.

Thus, for each man-hour added to the cutting sector, the objective function increases 1.667. Or the fair price paid for each manhour used in the cutting sector is 1.667. If there were a reduction of 1 man-hour in the cutting sector, we would obtain the same result for the shadow price. By changing the value of the constant of the first constraint to b1 ¼ 239, the model’s new optimal solution becomes: x1 ¼ 19.667 and x2 ¼ 35.167 with z ¼ 998.333. The calculation of the shadow price, considering a decrease of 1 man-hour for the cutting sector, is: 1,000  998:33 ¼ 1:667 240  239 Hence, for each man-hour removed from the cutting sector, the objective function decreases 1.667. Or the opportunity cost for each man-hour lost in the cutting sector is 1.667. (b) Maximum permissible decrease and increase for b1 The main objective is to determine the value range in which b1 can vary (b01  q  b1  b01 + p), and in which the shadow price remains constant. While this happens, the price to be paid for the use of p man-hours in the cutting sector will be P1  p ¼ 1.667  p. Analogously, the opportunity cost due to the loss of q man-hours in the same sector will be P1  q ¼ 1.667  q. From Fig. 17.77, we can see that the original model’s optimal solution is determined by the intersection of lines 5x1 + 4x2 ¼ 240 and 4x1 + 8x2 ¼ 360. Also note that the new constraint 5x1 + 4x2  241 is parallel to the original constraint 5x1 + 4x2  240. As the value of b1 increases, the line moves in the direction of extreme point G, always parallel to the original constraint. Similarly, as the value of b1 decreases, the line moves in the direction of extreme point D. While the intersection of equations 5x1 + 4x2 ¼ b1 and 4x1 + 8x2 ¼ 360 occurs within the feasibility region (segment DG), the shadow price will remain constant. Extreme points D and G represent the lower and upper limits for b1. Any point out of this segment will result in a new basic solution. Therefore, the lower and upper limits for b1 can be determined by substituting the coordinates of point D (x1 ¼ 10 and x2 ¼ 40) and G (x1 ¼ 90 and x2 ¼ 0), respectively, in 5x1 + 4x2: Lower limit for b1 (point D): 5  10 + 4  40 ¼ 210 Upper limit for b1 (point G): 5  90 + 4  0 ¼ 450 We can conclude that, while the value of b1 is within the interval 210  b1  450, its shadow price remains constant. The interval can also be specified based on the maximum permissible decrease and increase for b01 ¼ 240 (its original value), being expressed as: P1 ¼

b10  30  b1  b10 + 210 For example, for a value of p ¼ 210, the price to be paid for the use of these 210 man-hours in the cutting sector will be P1  210 ¼ 1.667  210 ¼ 350 (if 210 man-hours were added to the total time available for the cutting sector, the objective function would increase $350.00). Conversely, for a value of q ¼ 30, the opportunity cost due to the loss of these 30 man-hours in the total time available in the cutting sector will be P1  30 ¼ 1.667  30 ¼ 50 (if 30 man-hours were removed from the total time available for the cutting sector, the objective function would decrease $50.00). For any value of b1 out of this interval, it is necessary to recalculate the new optimal solution, because the feasible region is altered. These results can be better visualized in Fig. 17.78. Polygon AEDB represents the feasibility region for a value of b1 ¼ 210 (lower limit for b1). Whereas polygon AEDG represents the feasibility region for a value of b1 ¼ 450 (upper limit for b1). Changes in the availability of resources in the assembly sector (a) Shadow price By adding 1 man-hour to the assembly sector, the second constraint of (1) becomes 4x1 + 8x2  361. The new optimal solution is then determined by the intersection between active lines 5x1 + 4x2 ¼ 240 and 4x1 + 8x2 ¼ 361, being represented by point I (x1 ¼ 19.833 and x2 ¼ 35.208 with z ¼ 1,001.667), as illustrated in Fig. 17.79.

Solution of Linear Programming Problems Chapter

17

815

FIG. 17.78 Sensitivity analysis based on changes in the availability of resources in the cutting sector.

FIG. 17.79 Sensitivity analysis after adding 1 man-hour to the availability of resources in the assembly sector.

Since the new value of the objective function is also 1,001.667 (similar to the cutting sector), we obtain the same shadow price (P2 ¼ 1.667). Therefore, for each man-hour added in the assembly sector, the objective function also increases 1.667. By reducing the time available for the assembly sector in 1 man-hour (b2 ¼ 359), the model’s new optimal solution becomes: x1 ¼ 20.167 and x2 ¼ 34.792 with z ¼ 998.333. Thus, the shadow price in this case is also P2 ¼ 1.667. Therefore, for each man-hour removed from the assembly sector, the objective function also decreases 1.667. (b) Maximum permissible decrease and increase for b2 Fig. 17.79 illustrates the new constraint 4x1 + 8x2  361 for the assembly sector, parallel to the original constraint 4x1 + 8x2  360. As the value of b2 increases, the line moves in the direction of extreme point J, always parallel to the original constraint. Similarly, as the value of b2 decreases, the line moves in the direction of extreme point B. While the intersection of the equations 5x1 + 4x2 ¼ 240 and 4x1 + 8x2 ¼ b2 happens within the feasibility region (segment BJ), the shadow price will remain constant. Extreme points B and J represent the lower and upper limits for b2. Any point out of this segment will result in a new basic solution. Hence, the lower and upper limits for b2 can be determined by substituting the coordinates of point B (x1 ¼ 48 and x2 ¼ 0) and J (x1 ¼ 16 and x2 ¼ 40), respectively, in 4x1 + 8x2: Lower limit for b2 (point B): 4  48 + 8  0 ¼ 192 Upper limit for b2 (point J): 4  16 + 8  40 ¼ 384 We can conclude that, while the value of b2 is within the interval 192  b2  384, its shadow price remains constant. The interval can also be specified based on the maximum permissible decrease and increase for b02 ¼ 360 (its initial value), being expressed as: b20  168  b2  b20 + 24 For example, for a value of p ¼ 24, the price to be paid by the use of these 24 man-hours in the assembly sector will be P2  24 ¼ 1.667  24 ¼ 40 (if 24 man-hours were added to the total time available for the assembly sector, the objective function would increase $40.00). On the other hand, for a value of q ¼ 168, the opportunity cost due to the loss of these 168 man-hours in the

816

PART

VII Optimization Models and Simulation

FIG. 17.80 Sensitivity analysis based on changes in the availability of resources in the assembly sector.

total time available for the assembly sector will be P2  168 ¼ 1.667  168 ¼ 280 (if 168 man-hours are removed from the total time available for the assembly sector, the objective function would decrease $280.00). For any value of b2 out of this interval, it is necessary to recalculate the new optimal solution. These results can be better visualized in Fig. 17.80. Polygon ABJE represents the feasibility region for a value of b2 ¼ 384 (upper limit for b2). Whereas triangle ABK represents the feasibility region for a value of b2 ¼ 192 (upper limit for b2). Changes in the availability of resources in the finishing sector (a) Shadow price As presented in Fig. 17.77, the original model’s optimal solution is determined by the intersection of the two first equations 5x1 + 4x2 ¼ 240 and 4x1 + 8x2 ¼ 360. Since the finishing constraint (7.5x2  300) is not active, changes in the value of b3 within the feasibility region will not impact the original model’s optimal solution. So, its shadow price is zero. (a) Maximum permissible decrease and increase for b3 Since the constraint 7.5x2  300 is not active, the main goal here is to determine the value range in which b3 can vary without changing the initial model’s optimal solution (x1 ¼ 20 and x2 ¼ 35 with z ¼ 1,000). To determine the lower limit for b3, we just need to substitute the value of coordinate x2 of optimal point C (x2 ¼ 35) in 7.5x2 ¼ b3 (intersection of lines 4x1 + 8x2 ¼ 360 and 7.5x2 ¼ b3). So, the lower limit for b3 is 7.5  35 ¼ 262.5. In contrast, any value for b3 above its initial value will continue not impacting the initial model’s optimal solution, such that the value range for b3 can be written as: 262:5  b3 The interval can also be specified based on the maximum permissible decrease and increase for b03 ¼ 300 (its initial value), being expressed as: b30  37:5  b3

17.6.3

Reduced Cost

The reduced cost of a certain nonbasic variable xj can be interpreted as the value that its original coefficient cj must improve in the objective function before the current basic solution becomes suboptimal and xj becomes a basic variable. For a maximization problem, the reduced cost of the nonbasic variable xj in the optimal tabular form (c∗j ) corresponds to the maximum increase in the value of its original coefficient in the objective function (addition of c∗j units to the value of cj), which will maintain the current basic solution optimal and variable xj nonbasic. Any increase greater than c∗j will make the current basic solution suboptimal, such that xj will go into the base. Alternatively, for a minimization problem, c∗j corresponds to the minimum decrease in the value of its original coefficient in the objective function (subtraction of c∗j units from the value of cj), which will maintain the current basic solution optimal and variable xj nonbasic. Any decrease less than c∗j will make the current basic solution suboptimal and variable xj basic. According to Winston (2004), when the coefficient of the nonbasic variable xj improves a value exactly equal to its reduced cost, we have a case with multiple optimal solutions. In this case, there is at least one solution in which the variable

Solution of Linear Programming Problems Chapter

17

817

xj becomes basic and another in which the variable xj continues being nonbasic. By contrast, for any increase greater (maximization) or decrease smaller (minimization) than its reduced cost, the variable xj will always be basic in any optimal solution. A special case may happen when the nonbasic variables simply cannot become basic, that is, their reduced cost will continue to be null, since the same are not active (they do not influence the model’s optimal solution). Example 17.16 Consider the following maximization problem: max z ¼ 3x1 + 6x2 subject to : 2x1 + 3x2  60 ð1Þ 4x1 + 2x2  120 ð2Þ x1 , x2  0

(17.40)

Through a sensitivity analysis of the problem in study, we obtain the reduced cost of each variable, based on the model’s optimal solution of Expression (17.40), as shown in Table 17.E.17:

TABLE 17.E.17 Optimal Solution and Reduced Cost of Each Variable Variable

Optimal Solution

Reduced Cost

x1

0

1

x2

20

0

Interpret the results presented in Table 17.E.17. Solution The model’s basic solution represented by Expression (17.40) is x1 ¼ 0 and x2 ¼ 20. First, we can see that the reduced cost of the variable x2 is null, since it is a basic variable. Conversely, the reduced cost of the variable x1 represents the maximum increase (maximization problem) in the value of c1, which maintains the current basic solution optimal and variable x1 nonbasic. Thus, if the coefficient of x1 in the objective function goes from 3 to 4, the current basic solution will remain optimal, such that the variable x1 will continue being nonbasic. On the other hand, if the coefficient of x1 is greater than 4, the current solution will become suboptimal and the variable x1 will be basic in the new optimal solution. It is important to mention that if problem represented by Expression (17.40) is solved by Solver in Excel, the reduced cost of x1 will appear with a negative sign in the sensitivity report.

Example 17.17 Consider the following minimization problem: min z ¼ 4x1 + 8x2 subject to : 6x1 + 3x2  140 8x1 + 5x2  120 x1 , x2  0

ð1 Þ ð2 Þ

(17.41)

Through a sensitivity analysis of the problem in study, we obtain the optimal solution and the reduced cost of each variable in the model represented by Expression (17.41), as seen in Table 17.E.18:

TABLE 17.E.18 Optimal Solution and Reduced Cost of the Variables Variable

Optimal Solution

Reduced Cost

x1

23.333

0

x2

0

6

818

PART

VII Optimization Models and Simulation

Interpret the results presented in Table 17.E.18. Solution The model’s basic solution represented by Expression (17.41) is x1 ¼ 23.333 and x2 ¼ 0. First, we can see that the reduced cost of the variable x1 is null, since it is a basic variable. In contrast, the reduced cost of the variable x2 specifies the minimum decrease in the value of c2 (minimization problem) that maintains the current basic solution optimal and variable x2 nonbasic. Hence, if the coefficient of x2 in the objective function goes from 8 to 2, the current basic solution will remain optimal and variable x2 nonbasic. On the other hand, if the coefficient of x2 is less than 2, the current solution will become suboptimal and variable x2 will be basic in the new optimal solution. If problem represented by Expression (17.41) is solved by Solver in Excel, the reduced cost of x2 will appear with a positive sign in the sensitivity report.

17.6.4

Sensitivity Analysis With Solver in Excel

The sensitivity analysis through Solver in Excel will be presented in this section for the problem of company Romes Shoes (Example 17.12). Fig. 17.81 (see file Example17.12_Romes.xls) shows the modeling of the problem in Excel, already with the model’s optimal solution. Having solved the model through the Solver in Excel, the window Solver Results appears. Select the option Sensitivity Report, as shown in Fig. 17.82. The results of the sensitivity analysis for the problem of Rome Shoes, considering the changes in one of the objective function coefficients (Section 17.6.1), changes in one of the constants on the right-hand side and concept of shadow price (Section 17.6.2), and the concept of reduced cost (Section 17.6.3), from the Solver in Excel, are consolidated in Fig. 17.83. Lines 4 and 5 show the results of the sensitivity analysis based on changes in one of the objective function coefficients (Section 17.6.1), based on the final value of variable cells (B14 and C14), which represent the model’s decision variables. According to column D, these values are x1 ¼ 20 and x2 ¼ 35. Column E shows the reduced cost of each variable. Since both are basic, their values are null. If one of the variables were nonbasic, its reduced cost would appear with a negative sign in the sensitivity report in Excel, that is, with the opposite sign to the one presented in this book, as already mentioned previously. Analogously, for a minimization problem, the costs reduced of the nonbasic variables are presented with a positive sign in the sensitivity report in Excel. The initial values of the coefficients of each variable in the objective function are presented in column F. On the other hand, columns G and H show the maximum permissible increase and decrease for each coefficient, from its initial value, the other parameters remaining constant, without changing the original model’s optimal basic solution. Lines 10, 11, and 12 show the results of the sensitivity analysis based on changes in the amount of resources of each one of the model constraints (Section 17.6.2). By substituting the optimal values of each variable on the left-hand side of each constraint, we obtain the optimal number of resources necessary for each sector, as shown in column D. These values can also be updated in Fig. 17.81, if the option Keep Solver Solution is chosen in the window Solver Results. The shadow price

Romes Shoes

Unit cost

x1 slippers 15

x2 clogs 20

Cutting Assembly Finishing

5 4 0

4 8 7.5

Hours used 240 360 262.5

Solution

x1 slippers 20

x2 clogs 35

z Total profit $1,000.00

Quantities produced

FIG. 17.81 Modeling of the problem of Romes Shoes in Excel.

≤ ≤ ≤

Hours available 240 360 300

Solution of Linear Programming Problems Chapter

17

819

FIG. 17.82 Option Sensitivity Report in Solver Results.

FIG. 17.83 Sensitivity report of the problem at Romes Shoes.

(price paid for the use or opportunity cost due to the loss of 1 unit of each resource) is presented in column E. The initial amount available of each resource is presented in column F. Alternatively, the maximum permissible increase and decrease for each resource that maintains its shadow price constant or within the feasibility region, from its initial value, are specified in columns G and H, respectively.

17.6.4.1 Special Case: Multiple Optimal Solutions As presented in Section 17.2.3.1, in a graphical solution, we can identify a special case of linear programming with multiple optimal solutions when the objective function is parallel to an active constraint. Alternatively, through the Simplex method (see Section 17.4.6.1), we can identify a case with multiple optimal solutions when, in the optimal tabular form, the coefficient of one of the nonbasic variables is null in row 0 of the objective function. According to Ragsdale (2009), it is also possible to identify a case with multiple optimal solutions through the Sensitivity Report in Solver in Excel. This case happens when the increase or decrease permissible regarding the coefficient of one or more variables in the objective function is zero and we do not have a degenerate solution (see following Section). For the maximization problem (max z ¼ 8x1 + 4x2) presented in Section 17.2.3.1 (Example 17.3), the graphical solution was obtained from Fig. 17.84. The representation of this problem in Excel can be seen in Fig. 17.85.

820

PART

VII Optimization Models and Simulation

FIG. 17.84 Graphical solution for Example 17.3 with multiple optimal solutions.

FIG. 17.85 Representation of Example 17.3 in Excel.

FIG. 17.86 Sensitivity Report for a case with multiple optimal solutions.

By solving this example in the Solver in Excel, it was possible to find a feasible solution, obtaining the same message presented in Fig. 17.82, that Solver found a solution and all the optimized constraints and conditions were met. However, the Solver provided only one of the optimal solutions: x1 ¼ 4 and x2 ¼ 0 with z ¼ 32 (vertex B), not displaying any messages about the special case of multiple optimal solutions. By solving the problem through Solver in Excel and selecting the option Sensitivity Report in Solver Results, we obtain Fig. 17.86. As shown in rows 9 and 10 of Fig. 17.86, the decrease allowed for the coefficient of x1 in the objective function and the increase allowed for the coefficient of x2 in the objective function are null. Since there is no degeneration (see following Section 17.2.4.2), we have a case with multiple optimal solutions.

Solution of Linear Programming Problems Chapter

17

821

Ragsdale (2009) recommends two strategies that can be applied to determine a new optimal solution through Solver in Excel: 1) to insert a new constraint into the model that does not change the optimal value of the objective function and maintains the model’s feasibility; and 2) when the increase allowed for one of the decision variables is null, we must maximize the value of this variable (the objective function must be altered for a maximization problem that has as a objective cell the respective variable and not function z any more). When the decrease allowed for one of the variables is null, we must minimize the value of this variable (the objective function must be altered for a minimization problem that has as an objective cell such variable). For example, if we only use the first strategy, inserting the new constraint x1  x2 1 into the model, a new optimal solution is determined: x1 ¼ 3 and x2 ¼ 2 with z ¼ 32. Through Fig. 17.86, note that the maximum increase allowed for the coefficient of variable x2 in the objective function is zero. Analogously, the decrease allowed for the coefficient of variable x1 in the objective function is also zero. Therefore, regarding the second strategy proposed by Ragsdale, we would have two alternatives: to maximize the new objective cell $C$11 that represents variable x2, or to minimize the new objective cell $B$11 that represents variable x1. Thus, if we got the same initial solution (x1 ¼ 4 and x2 ¼ 0 with z ¼ 32) by only using the first strategy, in addition to inserting the constraint x1  x2 1, we should use one of the alternatives listed for the second strategy. For instance, if we inserted the constraint x1  x2 1 and changed the objective function for a maximization problem that has as a target the cell $C$11, we would also obtain the new optimal solution: x1 ¼ 3 and x2 ¼ 2 with z ¼ 32, as shown in Fig. 17.87. On the other hand, instead of constraint x1  x2 1, if we inserted constraint 2x1 + x2 8 into the original model and changed the objective function to a minimization problem that has as a target the cell $B$11 that represents the variable x1, we would obtain a new optimal solution: x1 ¼ 2 and x2 ¼ 4 with z ¼ 32, as shown in Fig. 17.88.

17.6.4.2 Special Case: Degenerate Optimal Solution As discussed in Section 17.2.3.4, in a graphical solution, we can identify a degenerate solution when one of the vertices of the feasible region is obtained by the intersection of more than two distinct lines. Whereas through the Simplex method (see Section 17.4.5.4), we can identify a case with a degenerate solution when, in one of the solutions of the Simplex method, the coefficient of one of the basic variables is null. If there is degeneration in the optimal solution, we have a case known as degenerate optimal solution. As presented in Section 17.4.5.4, the degeneration problem is that, in some cases, the algorithm Simplex can go into a loop, which generates the same basic solutions, since it cannot leave that solution space. In this case, the optimal solution will never be achieved. We can detect a case with degeneration through the Sensitivity Report in the Solver in Excel when the increase permissible or decrease permissible regarding the amount of resources of one of the constraints is zero. The same Example 17.6 presented in Section 17.2.3.4 for the case with a degenerate optimal solution will be solved in this section from Solver in Excel. The graphical solution for this example (min z ¼ x1 + 5x2) is in Fig. 17.89.

FIG. 17.87 Answer report adding the constraint x1  x2 1 and maximizing x2.

822

PART

VII Optimization Models and Simulation

FIG. 17.88 Answer report adding the constraint 2x1 + x2 8 and minimizing x1.

FIG. 17.89 Graphical solution for Example 17.6 with a degenerate optimal solution.

Analogous to the case with multiple optimal solutions, by solving Example 17.6 through the Solver in Excel, it was also possible to find a feasible solution, obtaining the same message as the one in Fig. 17.82: that Solver found a solution and all the optimized constraints and conditions were met. However, Solver does not display any message about the special case of a degenerate solution. By solving the problem above through Solver in Excel and selecting the option Sensitivity Report in Solver Results, we obtain Fig. 17.90. As shown in rows 15 and 17 of Fig. 17.90, the permissible increase regarding the amount of resources available in the first and third constraints is null. Whereas row 16 shows that the permissible decrease regarding the amount of resources in the second constraint is also zero. Therefore, we have a case with a degenerate optimal solution. In this case, the analysis of the Sensitivity Report may be compromised. Ragsdale (2009) and Lachtermacher (2009) highlight the precautions that must be taken when we identify a case with a degenerate optimal solution: 1. When the increase or decrease allowed for the coefficient of one of the variables in the objective function is also zero, the statement that multiple optimal solutions have occurred is no longer reliable. 2. The reduced costs of the variables may not be unique. Moreover, in order for the optimal solution to change, the coefficients of the variables in the objective function must at least improve their respective reduced costs. 3. The permissible changes in the coefficients of the variables in the objective function are maintained; however, values outside this interval may continue not changing the current optimal solution. 4. The shadow prices may not be unique, jointly with the permissible increase or decrease regarding the availability of resources of each constraint.

Solution of Linear Programming Problems Chapter

17

823

FIG. 17.90 Sensitivity Report for a case with a degenerate solution.

17.7 EXERCISES Section 17.2 (ex.1). Determine the feasible solution space that satisfies each one of the constraints separately, considering x1, x2 0: (a) (b) (c) (d) (e) (f)

3x1 + 2x2  12 2x1 + 3x2 24 3x1  2x2  6 x1  x2 4  x1 + 4x2  16  x1 + 2x2 10

Section 17.2.1 (ex.1). For each maximization function z, determine the direction in which the objective function increases: (a) (b) (c) (d)

max z ¼ 5x1 + 3x2 max z ¼ 4x1  2x2 max z ¼  2x1 + 6x2 max z ¼  x1  2x2

Section 17.2.1 (ex.2). Determine the graphical solution (feasible solution space and the optimal solution) of the following LP maximization problems: (a) max z ¼ 3x1 + 4x2 subject to : 2x1 + 5x2  18 4x1 + 4x2  12 5x1 + 10x2  20 x1 , x2  0 (b) max z ¼ 2x1 + 3x2 subject to : 2x1 + 2x2  10 3x1 + 4x2  24 x2  4 x1 , x2  0 (c) max z ¼ 4x1 + 2x2 subject to : x1 + x2  16 3x1  2x2  36  10 x1 x2  6 x1 , x2  0

824

PART

VII Optimization Models and Simulation

Section 17.2.1 (ex.3). Graphically solve Venix Toys’ production mix problem (Example 16.3 presented in Section 16.5.1 of the previous chapter, solved through the Solver in Excel in Section 17.5.2.1 of this chapter). Section 17.2.1 (ex.4). Are the solutions in the feasible region of Venix Toys’ problem? (a) (b) (c) (d) (e) (f) (g) (h) (i)

x1 ¼ 30, x2 ¼ 25 x1 ¼ 30, x2 ¼ 30 x1 ¼ 44, x2 ¼ 24 x1 ¼ 45, x2 ¼ 28 x1 ¼ 75, x2 ¼ 15 x1 ¼ 90, x2 ¼ 14 x1 ¼ 100, x2 ¼ 14 x1 ¼ 120, x2 ¼ 10 x1 ¼ 130, x2 ¼ 5

Section 17.2.2 (ex.1). For each minimization function z, determine the direction in which the objective function decreases: (a) (b) (c) (d)

min z ¼ 5x1 + 8x2 min z ¼ 2x1  3x2 min z ¼  4x1 + 5x2 max z ¼  7x1  5x2

Section 17.2.2 (ex.2). Determine the graphical solution for the following LP minimization problems: (a) min z ¼ 2x1 + x2 subject to : x1  x2  10 2x1 + 3x2  30 x1 ,x2  0 (b) min z ¼ 2x1  x2 subject to : x1  2x2  2  x1 + 3x2  6 x1 , x2  0 (c) min z ¼ 6x1 + 4x2 subject to : 2x1 + 2x2  40 x1 + 3x2  30 4x1 + 2x2  60 x1 , x2  0 Section 17.2.3 (ex.1). Graphically identify in which of the special cases each LP problem finds itself: a) multiple optimal solutions; b) unlimited objective function z; c) there is no optimal solution; d) degenerate optimal solution. (a) max z ¼ 2x1 + x2 subject to : x1 + 4x2  12 4x1 + 2x2  20 3x2  6 x1 ,x2  0 (b) min z ¼ 2x1 + x2 subject to : 4x1 + 5x2  20 x1 + x2  3 x1 , x2  0

Solution of Linear Programming Problems Chapter

(c) max z ¼ 2x1 + 3x2 subject to : 4x1 + 2x2  20 x1  x2  10 x1 ,x2  0 (d) max z ¼ 6x1 + 4x2 subject to : 4x1  4x2  20 3x1 + 2x2  30 x2  12 x1 , x2  0 (e) min z ¼ 2x1 + 3x2 subject to :  x1 + x2  10 4x1 + 2x2  20 4x2  40 x1 ,x2  0 (f) min z ¼ 2x1 + 3x2 subject to :  x1 + x2  10 4x1 + 2x2  20 x1 + x2  4 x1 , x2  0 Section 17.2.3 (ex.2). Graphically determine the alternative optimal solutions for the following LP problems: (a) max z ¼ 6x1 + 4x2 subject to : 3x1 + 2x2  90 2x1 + x2  50 x1 ,x2  0 (b) min z ¼ 2x1 + 3x2 subject to : 4x1  x2  11 4x1 + 6x2  32 x1 , x2  0 Section 17.3 (ex.1) Consider the following LP minimization problem: min z ¼ 3x1 + 2x2 subject to : 8x1 + 5x2  140 4x1 + 3x2  80 x1 , x2  0 By solving the problem in an analytical way, determine: (a) The number of possible basic solutions for this system. (b) The feasible basic solutions for the problem, and graphically represent them. (c) The optimal solution. Section 17.3 (ex.2) Do the same for the following LP maximization problem: max z ¼ 4x1 + 3x2 + 5x3 subject to : 3x1  x2 + 2x3  10 4x1 + 2x2 + 5x3  50 x1 ,x2 ,x3  0

17

825

826

PART

VII Optimization Models and Simulation

Section 17.4.2 (ex.1) Consider the following LP maximization problem: max z ¼ 4x1 + 5x2 + 3x3 subject to : 2x1 + 3x2  x3  48 x1 + 2x2 + 5x3  60 3x1 + x2 + 2x3  30 x1 ,x2 ,x3  0 Solve the problem through the analytical form of the Simplex method. Section 17.4.3 (ex.1) Solve the production mix problem of Venix Toys through the Simplex method. Section 17.4.3 (ex.2) Use the Simplex method to solve the following LP maximization problems: (a) max z ¼ 3x1 + 2x2 subject to : 3x1  x2  6 x1 + 3x2  12 x1 ,x2  0 (b) max z ¼ 2x1 + 4x2 + 3x3 subject to : x1 + x2 + 2x3  6 2x1 + 2x2 + 3x3  16 x1 + 4x2 + x3  18 x1 ,x2 ,x3  0 (c) max z ¼ 3x1 + x2 + 2x3 subject to : 2x1 + 2x2 + x3  20 3x1 + x2 + 4x3  60 x1 + x2 + 2x3  30 x1 , x2 , x3  0 Section 17.4.3 (ex.3) What is the biggest difficulty in solving the farmer’s problem (Example 16.7 in Section 16.5.4 of the previous chapter, solved through the Solver in Excel in Section 17.5.2.5 of this chapter) through the Simplex method? Section 17.4.4 (ex.1) Use the Simplex method to solve the following LP minimization problems: (a) min z ¼ 2x1  x2 subject to :  2x1 + 6x2  24 8x1 + 2x2  40 x1 , x2  0 (b) min z ¼ 5x1  6x2 subject to :  4x1 + 2x2  10 x1 + 3x2  22 x1 , x2  0 (c) min z ¼ 2x1  x2  x3 subject to : 3x1 + 5x2 + 4x3  120  x1 + 2x2 + 4x3  90 2x1  x2 + 2x3  60 x1 ,x2 ,x3  0 (d) min z ¼ x1 + 3x2  x3 subject to : 4x1  2x2 + 2x3  160 2x1 + 5 x2 + 10x3  200 x1  x2 + x3  50 x1 , x2 ,x3  0

Solution of Linear Programming Problems Chapter

Section 17.4.5.1 (ex.1) Solve the following LP maximization problem through the Simplex method: max z ¼ 4x1  x2 subject to : 3x1  3x2  175 8x1  2x2  460  60 x1 x1 , x2  0 (a) Demonstrate that we have a special case with multiple optimal solutions here. (b) Determine at least two of the alternative optimal solutions. (c) Solve the problem graphically and compare the results obtained. Section 17.4.5.1 (ex.2) Do the same for the LP minimization problem: min z ¼ 3x1 + 6x2 subject to : 2x1 + 4x2  620 7x1 + 3x2  630 x1 ,x2  0 Section 17.4.5.1 (ex.3) Determine all the optimal FBS of the following maximization problem: max z ¼ 4x1 + 4x2 subject to : x1 + x2  1 x1 ,x2  0 Section 17.4.5.2 (ex.1) Demonstrate that the LP maximization problem has an unlimited objective function z. max z ¼ 5x1 + 2x2 subject to : 2x1  3x2  66  9x1  3x2  99 x1 , x2  0 Section 17.4.5.2 (ex.2) Demonstrate that the LP maximization problem has an unlimited objective function z. min z ¼ 3x1  2x2 subject to :  2x1 + x2  12  3x1  2x2  24 x1 , x2  0 Determine a feasible basic solution with z ¼ 90. Section 17.4.5.3 (ex.1) Demonstrate that the LP maximization problem has an unfeasible solution. max z ¼ 18x1 + 12x2 subject to : 4x1 + 16x2  1850  8x1  5x2  4800 x1 ,x2  0 Section 17.4.5.3 (ex.2) Do the same for the following minimization problem: min z ¼ 7x1 + 5x2 subject to : 6x1 + 4x2  24 x1 + x2  3 x1 ,x2  0

17

827

828

PART

VII Optimization Models and Simulation

Section 17.4.5.4 (ex.1) Demonstrate that the LP maximization problem has a degenerate optimal solution. max z ¼ 2x1 + 3x2 subject to : x1  x2  10 2x1 + 3x2  90  24 x1 x1 , x2  0 Section 17.4.5.4 (ex.2) Do the same for the minimization problem. min z ¼ 6x1 + 8x2 subject to : 2x1 + 4x2  60 5x1  4x2  80 3x1 + 8x2  100 x1 , x2  0 Section 17.4.5 (ex.1) Through the Simplex method, identify in which of the special cases each LP problem finds itself: a) multiple optimal solutions; b) unlimited objective function z; c) there is no optimal solution; d) degenerate optimal solution. (a) max z ¼ x1 + 3x2 subject to : 2x1 + 6x2  48 3x1 + 5x2  60 x1 + 8x2  6 x1 , x2  0 (b) min z ¼ 2x1  6x2 subject to : 3x1 + 2x2  24 2x1 + 6x2  30 x1 , x2  0 (c) max z ¼ 2x1 + x2 subject to : 8x1 + 4x2  600 4x1 + 2x2  300 x1 , x2  0 Section 17.4.5 (ex.2) From each one of the tabular forms presented, identify if we have a special case of the Simplex method or not. If yes, determine the special case in which the problem analyzed finds itself: a) multiple optimal solutions; b) unlimited objective function z; c) there is no optimal solution; d) degenerate optimal solution. In each tabular form, we specify if the original problem is a maximization or a minimization. (a) Maximization problem Coefficients

Basic Variable

Equation

z

x1

x2

x3

x4

Constant

z

0

1

8

0

2

0

20

x2

1

0

2

1

1

0

10

x4

2

0

3

0

1

1

18

Solution of Linear Programming Problems Chapter

(b) Minimization problem Coefficients

Basic Variable

Equation

z

x1

x2

x3

x4

x5

Constant

z

0

1

0

10

1

0

2

60

x4

1

0

0

2

7/3

1

1/3

14

x1

2

0

1

0

2/3

0

1/3

10

(c) Minimization problem

Coefficients

Basic Variable

Equation

z

x1

x2

x3

x4

a1

a2

Constant

z

0

1

0

7/4

0

0

M+7/4

M+3/4

86

x1

1

0

1

1/4

0

0

1/4

1/4

2

x3

2

0

0

1/4

1

0

1/4

1/4

0

x4

3

0

0

1/5

0

1

1/5

3/4

5

(d) Maximization problem Coefficients Basic Variable

Equation

z

x1

x2

x3

x4

Constant

z

0

1

0

0

0

2

3,000

x3

1

0

0

8/3

1

2/3

1,120

x1

2

0

1

2/3

0

1/3

500

(e) Minimization problem Coefficients Basic Variable

Equation

z

x1

x2

x3

x4

Constant

z

0

1

2

5

0

0

0

x3

1

0

3

6

1

0

840

x4

2

0

1

5

0

1

500

17

829

830

PART

VII Optimization Models and Simulation

Section 17.5.2 (ex.1). Consider Exercise 1 proposed in Section 16.5.1 of the previous chapter, regarding the production mix problem of company KMX: (a) Represent the problem in an Excel spreadsheet. (b) Determine the optimal solution through the Solver in Excel. Section 17.5.2 (ex.2). Do the same for Exercise 2 proposed in Section 16.5.1 of the previous chapter, regarding the production mix problem of company Refresh. Section 17.5.2 (ex.3). Do the same for Exercise 3 of Section 16.5.1 of the previous chapter, regarding the company Golmobile. Section 17.5.2 (ex.4). Do the same for Exercise 1 of Section 16.5.2 of the previous chapter, regarding the petroleum mix problem. Section 17.5.2 (ex.5). Do the same for Exercise 1 of Section 16.5.4 of the previous chapter, regarding the capital budget problem of company GWX. Section 17.5.2 (ex.6). Do the same for Exercise 1 of Section 16.5.5 of the previous chapter, regarding the portfolio optimization problem. Section 17.5.2 (ex.7). Do the same for Exercise 3 of Section 16.5.5 of the previous chapter, regarding the portfolio optimization problem of CTA Investment Bank. Section 17.5.2 (ex.8). Do the same for Exercise 1 of Section 16.5.7 of the previous chapter, regarding the aggregate planning problem of company Pharmabelz. Section 17.6.1 (ex.1). Company Solutions manufactures two types of thermometers: digital and mercury ones. Each digital thermometer guarantees a net unit profit of $7.00, while a mercury thermometer generates a net unit profit of $5.00. Manufacturing these 2 types of thermometers requires 3 types of operations. To manufacture one digital thermometer, 4, 5, and 2 minutes in each one of the operations are necessary. Whereas a mercury thermometer requires 2, 3, and 3 minutes for each operation. The availability for each operation is 300, 360, and 180 minutes. (a) Determine the model’s graphical solution. (b) Determine the maximum permissible increase in the net unit profit of a digital thermometer that would maintain the original basic solution unaltered. Assume that the other model parameters remain constant. (c) Determine the maximum permissible decrease in the unit profit of a mercury thermometer that would maintain the original basic solution unaltered, assuming that the other parameters remain constant. (d) Assuming that there was a reduction in the unit profit of digital thermometers to $3.00, check and see if the original model’s optimal solution remains optimal. (e) Assuming that there was an increase in the unit profit of mercury thermometers to $10.00, verify if the original model’s optimal solution remains optimal. Section 17.6.1 (ex.2). Consider the following maximization problem: max z ¼ 8x1 + 6x2 subject to : 2x1 + 5x2  30 3x1 + 6x2  54 2x1 + 8x2  64 x1 , x2  0 (a) Determine the model’s graphical solution. (b) What is the value range in which c1 can vary that would maintain the original basic solution unaltered, assuming that c2 remains constant? (c) Determine the value range in which c2 can vary that would maintain the original basic solution unaltered, assuming that c1 remains constant. Section 17.6.1 (ex.3). Consider the following minimization problem: min z ¼ 8x1 + 6x2 subject to : 2x1 + 5x2  60 3x1 + 6x2  102 2x1 + 8x2  128 x1 , x2  0

Solution of Linear Programming Problems Chapter

17

831

(a) Determine the model’s graphical solution. (b) In the original model’s optimal solution, we verified that variable x2 is basic. Thus, if the non-negativity constraint is not established over the possible variations of c2, which problem will occur? (c) What is the value range in which c1 can vary that would maintain the original basic solution unaltered, assuming that c2 remains constant? (d) Determine the value range in which c2 can vary that would maintain the original basic solution unaltered, assuming that c1 remains constant. Section 17.6.1 (ex.4). Consider Venix Toys’s production mix problem (Example 16.3 of the previous chapter): (a) Determine the optimality condition (c1/c2) that would maintain the original model’s basic solution unaltered. (b) Let’s assume that there was a simultaneous reduction in the unit profits of toy cars and tricycles to $10.00 and $50.00, respectively, due to the competition in the market. Verify if the original model’s basic solution remains optimal and determine the new value of z. (c) What are the possible changes in the unit profit of toy cars that would maintain the original model’s basic solution unaltered? Assume that the other model parameters remain constant. (d) What is the value range in which the unit profit of tricycles can vary without impacting the original model’s basic solution? Assume that the other parameters remain constant. (e) If there is a reduction in the unit profit of toy cars to $9.00, will the original model’s basic solution remain optimal? In this case, what is the new value of the objective function? (f) If there is an increase in the unit profit of tricycles to $80.00, will the original model’s basic solution be affected (the other parameters remain constant)? What is the new value of the objective function after these changes? (g) Imagine that there has been a significant reduction in the production costs of tricycles, increasing their unit profit to $100.00. In order for the original model’s basic solution to remain unaltered, which interval must the unit profit of toy cars satisfy? Section 17.6.2 (ex.1). Once again, consider the production mix problem of Venix Toys: (a) Determine the shadow price for the machining, painting, and assembly departments. (b) Determine the value range in which each bi can vary that maintains the shadow price constant. (c) If the availability in the machining sector increases to 40 hours, what will be the increase in the value of objective function? (d) If the availability in the painting sector is reduced to 18 hours, what will be the decrease in the value of the objective function? Also determine the new values of x1 and x2. Section 17.6.2 (ex.2). Consider Exercise 1 proposed in Section 17.6.1 of company Solutions: (a) If the availability of each operation increases in 1 minute, which one of them must have priority? (b) Determine the maximum permissible increase and decrease (p and q minutes, respectively) in b1, b2, and b3 that maintains the shadow price constant. (c) What is the fair price to pay for using p minutes of b2? (d) What is the opportunity cost due to the loss of q minutes of b3? Section 17.6.3 (ex.1). Consider the following maximization problem: max z ¼ 3x1 + 2x2 subject to : x1 + x2  6 5x1 + 2x2  20 x1 , x2  0 Table 17.1 shows the initial tabular form of the model, Table 17.2 shows the tabular form in the first iteration, and Table 17.3 the optimal tabular form of the same problem. TABLE 17.1 Row 0 of the Initial Tabular Form Coefficients Equation

z

x1

x2

x3

x4

Constant

0

1

3

2

0

0

0

832

PART

VII Optimization Models and Simulation

We would like you to: (a) Interpret the reduced costs of Tables 17.2 and 17.3. (b) Determine the values of z11, z12, z∗1, and z∗2. Section 17.6.3 (ex.2). Consider the following minimization problem: min z ¼ 4x1  2x2 subject to : 2x1 + x2  10 x1  x2  8 x1 ,x2  0 Tables 17.4 and 17.5 show the initial tabular form and the optimal tabular form of the model, respectively. We would like you to: (a) Interpret the reduced costs of Table 17.5. (b) Determine the values of z∗1 and z∗2.

TABLE 17.2 Row 0 of the Tabular Form in the First Iteration Coefficients Equation

z

x1

x2

x3

x4

Constant

0

1

0

4/5

0

3/5

12

TABLE 17.3 Row 0 of the Optimal Tabular Form Coefficients Equation

z

x1

x2

x3

x4

Constant

0

1

0

0

4/3

1/3

44/3

TABLE 17.4 Row 0 of the Initial Tabular Form Coefficients Equation

z

x1

x2

x3

x4

Constant

0

1

4

2

0

0

0

TABLE 17.5 Row 0 of the Optimal Tabular Form Coefficients Equation

z

x1

x2

x3

x4

Constant

0

1

8

0

2

0

20

Solution of Linear Programming Problems Chapter

17

833

Section 17.6.4 (ex.1). Consider Exercise 1 of Section 17.6.1 of company Solutions: (a) Solve it through the Solver in Excel. (b) Through the Solver Sensitivity Report, determine the maximum permissible increase and decrease in c1 that maintains the original basic solution unaltered. (c) Through the Solver Sensitivity Report, determine the maximum permissible increase and decrease in c2 that maintains the original basic solution unaltered. (d) Through the Solver Sensitivity Report, determine the shadow price for each operation. (e) Through the Solver Sensitivity Report, determine the maximum permissible increase and decrease in b1, b2, and b3 that maintains the shadow price constant. Section 17.6.4 (ex.2). Do the same for the production mix problem of company Venix Toys. Section 17.6.4 (ex.3). Through the Solver Sensitivity Report, identify if the problems belong to the special case “multiple optimal solutions” or “degenerate optimal solution.” (a) max z ¼ 4x1 + 2x2 subject to : 6x1 + 2x2  240 2x1 + 3x2  200 3x1 + x2  120 x1 ,x2  0 (b) max z ¼ 3x1 + 8x2 subject to : 2x1 + 2x2  300 5x1 + 4x2  800 9x1 + 24x2  1, 080 x1 ,x2  0 (c) max z ¼ 2x1 + 6x2 subject to : 2x1 + 2x2  600 2x1 + 8x2  800 x1  8x2  0 x1 ,x2  0 (d) max z ¼ 4x1 + 2x2 subject to : 6x1 + 2x2  240 2x1 + 3x2  200 8x1 + 4x2  240 x1 , x2  0 (e) min z ¼ 6x1 + 3x2 subject to : 4x1 + 2x2  832 7x1 + 3x2  714 2x1 + 9x2  900 x1 ,x2  0 (f) min z ¼ 4x1 + 5x2 subject to : 2x1 + 3x2  675 2x1 + 5x2  1, 125 3x1 + 4x2  900 x1 ,x2  0 (g) min z ¼ 2x1 + x2 subject to : 4x1 + 8x2  1, 920 3x1 + 2x2  600 7x1 + 3x2  1, 050 x1 ,x2  0

Chapter 18

Network Programming I understand reason to be, not the ability to ratiocinate, which may be well or poorly employed, but the sequencing of truths that can only produce truths, and one truth cannot be contrary to the other. Gottfried Wilhelm von Leibniz

18.1 INTRODUCTION A network programming problem is modeled through a graph structure or network that consists of various nodes, in which each node must be connected to one or more arcs. Network models are increasingly utilized in various business areas, such as production, transportation, facility location, project management, finances, and others. Many of them may be formulated as linear programming problems (LP) and, therefore, may be solved by the Simplex method. Network modeling facilitates visualization and understanding of system characteristics. Thus, simplified versions of the Simplex method may be used for solving LP problems in networks. Additionally, other more efficient algorithms and software are being proposed and utilized for solving models in networks. Among the main problems in network programming are the classic transportation problem, the transshipment problem, the job assignment problem, the shortest path problem, and the maximum flow problem. Each one of the problems listed here will be studied in this chapter. We will initially present the mathematical modeling of each problem, as well as its solution using Excel Solver. In the case of the classic transportation problem, we will also describe how to solve it by using the transportation algorithm, which is a simplification of the Simplex method.

18.2 TERMINOLOGY OF GRAPHS AND NETWORKS A graph is defined as a set of nodes or vertices and a set of arcs or edges interconnecting these nodes. The nodes, drawn as circles or points, may represent facilities (such as factories, distribution centers, terminals, or seaports), or workstations, or intersections. The arcs, illustrated as line segments, make connections between pairs of nodes, and can represent paths, routes, wires, cables, channels, among others. The notation for a graph is G ¼ (N, A), in which N is a set of nodes and A is a set of arcs. Fig. 18.1 shows an example of a graph with five nodes and eight arcs. Many times, the arcs of a graph that make connections between nodes are associated to a numerical variable called a flow that represents a measurable characteristic of that connection as a distance between nodes, transportation cost, time expended, dimensions of the wire, number of parts transported, and other factors. Analogously, the nodes of a graph may be associated with a numerical variable called capacity, and may represent the loading and unloading capacity, supplies, demand, between other. A graph whose arcs and/or nodes are associated with the numerical flow variable and/or capacity is called a network. Fig. 18.2 shows an example of networks. The nodes represent the cities and the flows represent the distances (km) between them. For simplicity’s sake, we will henceforth no longer make a distinction between the terms “graphs” and “networks,” and will employ only the term “network.” The nodes of a network may be subdivided into three types: a) supply nodes or sources that represent entities that produce or distribute a given product; b) demand nodes that represent entities that consume the product; c) transshipment nodes that are the intermediate points between the supply nodes and demand and represent the waypoints for those products. Data Science for Business and Decision Making. https://doi.org/10.1016/B978-0-12-811216-8.00018-5 © 2019 Elsevier Inc. All rights reserved.

835

836

PART

VII Optimization Models and Simulation

FIG. 18.1 Example of a graph.

FIG. 18.2 Example of a network.

Manaus 5985

Natal

3490 2422

4563

Brasilia

2027

1446

1126

Salvador

3090

Porto Alegre

The arcs may have an arrow indicating the direction of the arc. When the flow between the respective nodes occurs in a single direction indicated by an arrow, we have a directed arc. When the flow occurs in both directions, it is called an undirected arc. In cases where there is a single connection between the nodes, but without an arrow indicating the direction of the arc, it is presumed that the arc is undirected. Each one of those cases may be visualized in Fig. 18.3. The arcs of Figs. 18.1 and 18.2 are also examples of undirected arcs. One may assume in these cases that the distances are symmetrical. When all of the arcs of the network are directed, we have a directed network. Analogously, when all of the arcs are undirected, we say that the network is undirected. Fig. 18.2 is an example of an undirected network. As for Fig. 18.4, it FIG. 18.3 Differences between directed and undirected arc.

Nondirected arc

Directed arc A

B

A

B

A

B

Network Programming Chapter

18

837

FIG. 18.4 Example of a directed network.

refers to a directed network, whose nodes represent a set of physical activities with the respective durations (minutes) and the directed arcs represent the relations of precedence between the activities. Other definitions of the graph theory, such as path, Hamiltonian path, cycle, tree, spanning tree and minimum spanning tree, will be presented. Hillier and Lieberman (2005) define a path between two nodes as the sequence of different arcs connecting these nodes. For example, the sequence of arcs AB  BC  CE (A ! B ! C ! E) of Fig. 18.1 is considered a path. In a directed network, one may have a directed or undirected path. A path that has a single direction is called a directed path. On the other hand, if at least one arc of the path has an opposite direction from the others, it is said that the path is undirected. For example, the path AC  CD  DE (A ! C ! D ! E) of Fig. 18.4 is considered a directed path, given that all of the arcs follow the same direction. On the other hand, the path AB  BD  DC (A ! B ! D ! C) of the same is considered an undirected path, given that the direction of arc DC is contrary to that of the other arcs. A Hamiltonian path is one that visits each node a single time. For example, the path AB  BC  CE of Fig. 18.1 is also considered a Hamiltonian path. On the other hand, the path AB  BC  CE  ED  DC (A ! B ! C ! E ! D ! C) of the same figure is not a Hamiltonian path. A path that begins and ends at the same node forms a cycle. The path AB  BC  CE  EA (A ! B ! C ! E ! A) of Fig. 18.1 is an example of a cycle. In a directed network, one may have a directed or undirected cycle. When the path run in a cycle is directed we have a directed cycle. Analogously, an undirected path that begins and ends at the same node is called an undirected cycle. For example, the cycle AB  BD  DE  EA (A ! B ! D ! E ! A) of Fig. 18.4 is directed, whereas the cycle AB  BC  CA (A ! B ! C ! A) of the same figure is an example of an undirected cycle. An undirected network G ¼ (N, A) is said to be related when there is a path between any pair of nodes. A network G has a tree structure if it is related and acyclic (without cycles). In addition, as part of the tree concept, it is affirmed that: – A tree with n nodes contains n  1 arcs. – If an arc is added to a tree, a cycle is formed. – If an arc is eliminated from the tree, the network ceases to be related (instead of a single related network there will be two related networks). The network presented in Fig. 18.5 is an example of a tree based on the network presented in Fig. 18.1. Before we define the concept of spanning tree, we will define the concept of subgraph. G’ ¼ (N’, A’) is a subgraph of G ¼ (N, A) if the set of nodes of G’ is a subset of the set of nodes of G (N’  N), if the set of arcs of G’ is a subset of the set of arcs of G (A’  A) and if G’ is a graph. Using the network G ¼ (N, A), a spanning tree, also called a generating tree, is a subgraph of G that has a tree structure and contains all of the nodes of G. Fig. 18.6 is an example of a spanning tree using the network drawn in Fig. 18.1. A minimum spanning tree of G is a spanning tree with a lower cost. FIG. 18.5 Example of a tree.

838

PART

VII Optimization Models and Simulation

FIG. 18.6 Example of a spanning tree.

18.3

CLASSIC TRANSPORTATION PROBLEM

The classic transportation problem has the objective of determining the quantities of products to be transported going from a set of suppliers to a set of consumers, so that the total transportation cost is minimized. Each supplier manufactures a fixed number of products, and each consumer has a known demand that will be met. The problem is modeled using two links in the supply chain; in other words, not considering intermediate facilities (distribution centers, terminal, seaport, or factory). The mathematical notation and the network representation of the classic transportation problem are presented as follows. Consider a set of m suppliers that provide goods to a set of n consumers. The maximum amount to be transported from a given supplier i (i ¼ 1, …, m) corresponds to its capacity of Csi units. On the other hand, the demand of each consumer j (j ¼ 1, …, n) must be met and is represented by dj. The unit cost for transportation for the supplier i to the consumer j is represented by cij. The objective is to determine the quantities to be transported from the supplier i to the consumer j (xij) in order to minimize the total transportation cost (z). Fig. 18.7 presents the network representation of the classic transportation problem.

18.3.1

Mathematical Formulation of the Classic Transportation Problem

The model parameters, the decision variables, and the general mathematical formulation of the classic transportation problem are specified as follows. Parameters of the model: cij ¼ unit cost for transportation from the supplier i (i ¼ 1, …, m) to the consumer j (j ¼ 1, …, n) Csi ¼ supply capacity of the supplier i (i ¼ 1, …, m) dj ¼ demand by the consumer j (j ¼ 1, …, n) Decision variables: xij ¼ quantities transported from the supplier i (i ¼ 1, …, m) to the consumer j (j ¼ 1, …, n) FIG. 18.7 Network representation of the classic transportation problem.

Supplier

Consumer c11; x11

Cs1

S1

C1

d1

Cs2

S2

C2

d2 Demand

Supply capacity

Csm

Cn

Sm cmn; xmn

dn

Network Programming Chapter

18

839

General formulation: min z ¼

m X n X cij xij i¼1 j¼1

s:t: n X

xij  Csi ,

i ¼ 1,2, …,m

xij  dj ,

j ¼ 1, 2,…, n

xij  0,

i ¼ 1, 2, …,m, j ¼ 1, 2,…, n

(18.1)

j¼1 m X i¼1

that correspond to a linear programming problem. Thus, the problem could be solved by the Simplex method. However, the special structure of the network problem makes it possible to obtain more efficient solution algorithms, such as the transportation algorithm that will be described in Section 18.3.3.1. In order for problem represented by Expression (18.1) to have Pn solution, the total supply capacity should Pthe basic doable Cs  be greater than or equal to the demand of consumers, so that, m i i¼1 j¼1 dj . Pn P If total supply total capacity is exactly equal to the total demand consumed, meaning, m i¼1 Csi ¼ j¼1 dj (balancing equation), the problem is known as the balanced transportation problem, and may be rewritten as: min z ¼

m X n X cij xij i¼1 j¼1

s:t:

n X

xij ¼ Csi ,

i ¼ 1, 2,…, m

xij ¼ dj ,

j ¼ 1, 2,…, n

xij  0,

i ¼ 1, 2, …,m, j ¼ 1, 2,…, n

(18.2)

j¼1 m X i¼1

P  Pn m We can have a third case in which the total supply capacity is less than the total demand consumed Cs < d i i¼1 j¼1 j , so that the total demand of some consumers will not be met. On the other hand, the suppliers will utilize its maximum capacity. That case may be mathematically formulated as: min z ¼

m X n X

cij xij

i¼1 j¼1

s:t:

n X xij ¼ Csi ,

i ¼ 1,2, …, m

(18.3)

j¼1 m X xij  dj ,

j ¼ 1, 2, …,n

i¼1

xij  0,

i ¼ 1,2, …, m, j ¼ 1, 2, …,n

Example 18.1 Karpet Ltd. is an automotive parts manufacturer, whose units are located in the Brazilian cities of Osasco, Sorocaba, and Sao Sebastiao. Its clients are found in Sao Paulo, Rio de Janeiro, and Curitiba, as presented in Fig. 18.8. The unit transportation costs from each origin to each destination, as well as the capacity of each supplier and the demand of each consumer, are found in Table 18.E.1. The objective is to meet the demand of each final consumer, respecting the supply capacities, so as to minimize the total transportation cost. Model the transportation problem.

840

PART

VII Optimization Models and Simulation

FIG. 18.8 Pool of Karpet Ltd. suppliers and consumers.

TABLE 18.E.1 Transportation Data of the Karpet Ltd. Company Transportation Unit Cost Consumer

Supplier

Demand

Sao Paulo

Rio de Janeiro

Curitiba

Capacity

Osasco

12

22

30

100

Sorocaba

18

24

32

140

Sao Sebastiao

22

15

34

160

120

130

150

Solution Since the total supply capacity is exactly equal to the total demand consumed, we have a balanced transportation problem. First, the decision variables of the model are defined: xij ¼ number of parts transported from the supplier i to the consumer j, i ¼ 1, 2, 3; j ¼ 1, 2, 3. Thus, we have: x11 ¼ parts transported from the supplier in Osasco to the consumer in Sao Paulo. x12 ¼ parts transported from the supplier in Osasco to the consumer in Rio de Janeiro. x13 ¼ parts transported from the supplier in Osasco to the consumer in Curitiba. ⋮ x31 ¼ parts transported from the supplier in Sao Sebastiao to the consumer in Sao Paulo. x32 ¼ parts transported from the supplier in Sao Sebastiao to the consumer in Rio de Janeiro. x33 ¼ parts transported from the supplier in Sao Sebastiao to the consumer in Curitiba. The objective function seeks to minimize the total transportation cost: min z ¼ 12x11 + 22x12 + 30x13 + 18x21 + 24x22 + 32x23 + 22x31 + 15x32 + 34x33 The constraints of the model are specified as follows: 1. The capacity of each supplier will be utilized to meet consumer demand: x11 + x12 + x13 ¼ 100 x21 + x22 + x23 ¼ 140 x31 + x32 + x33 ¼ 160

Network Programming Chapter

18

841

2. The demand of each consumer should be met: x11 + x21 + x31 ¼ 120 x12 + x22 + x32 ¼ 130 x13 + x23 + x33 ¼ 150 3. The decision variables of the model are non-negative: xij  0,

i ¼ 1, 2,3, j ¼ 1, 2,3

The optimal solution, obtained by the transportation algorithm (see Section 18.3.3.1) or using Excel Solver (Section 18.3.3.2), is x11 ¼ 100, x12 ¼ 0, x13 ¼ 0, x21 ¼ 20, x22 ¼ 0, x23 ¼ 120, x31 ¼ 0, x32 ¼ 130, x33 ¼ 30 with z ¼ 8,370.

18.3.2 Balancing the Transportation Problem When the Total Supply Capacity Is Not Equal to the Total Demand Consumed In the next section, we will present several methods for solving the classic transportation problem. Most require that the transportation problem be balanced, so that a ghost supplier or customer (dummy) should be added when the total supply is not equal to the total demand. We will see each one of the cases.

Case 1: Total Supply Is Greater than Total Demand Consider an unbalanced transportation problem whose total supply capacity is greater than the total demand consumed. To restore the balance, one must create a ghost customer (dummy) that will absorb the excess supplied. Thus, the demand of this new destination will correspond to the difference between the total supply and the total demand consumed, indicating the nonutilized supply capacity. The transportation unit cost for any supplier to the ghost customer created will be null, since the same is not real. We thus guarantee a feasible basic solution using the solution procedures presented in Sections 18.3.3.1 and 18.3.3.2, given that the total supply capacity has become exactly equal to the total demand. Example 18.2 The Caramel Candy & Confetti company has been involved in the candy sector since 1990 and owns three stores located in the Greater Sao Paulo area. Its main clients are located in the Sao Paulo Capital, Baixada Santista, and Vale do Paraiba, as shown in Fig. 18.9. The production capacity of the stores, the consumer demand, and the costs per unit distributed by each store for each

Store 2

Store 3

Store 1

Consumer Vale do Paraiba

Consumer Baixada Santista

FIG. 18.9 Stores and clients of the Caramel Candy & Confetti company.

Consumer Sao Paulo

842

PART

VII Optimization Models and Simulation

TABLE 18.E.2 Transportation Data of the Caramel Candy & Confetti Company Transportation Unit Cost Consumer Sao Paulo

Supplier

Demand

Baixada Santista

Vale do Paraiba

Capacity

Store 1

8

12

10

50

Store 2

4

10

6

100

Store 3

6

15

12

40

60

70

30

consumer are illustrated in Table 18.E.2. In order to minimize the total transportation cost, the company wants to determine how much to distribute from each store to the respective consumers, respecting the production capacity and ensuring that the demands will be met. Formulate the Caramel Candy & Confetti company transportation problem. Solution We can verify that the Caramel Candy & Confetti company transportation problem is unbalanced, given that the total supply capacity (190) is greater than the total demand consumed (160). Solution (a)

One way to represent the mathematical model of the Caramel Candy & Confetti company is by an Expression (18.1) in which the constraints are written in an unequal form. In that model, one has the following decision variables: xij ¼ number of candies transported from store i to the consumer j, i ¼ 1, 2, 3; j ¼ 1, 2, 3. Thus, we have: x11 ¼ candies transported from store 1 to the consumer in Sao Paulo (SP). x12 ¼ candies transported from store 1 to the consumer in Baixada Santista (BS). x13 ¼ candies transported from store 1 to the consumer in Vale do Paraiba (VP). ⋮ x31 ¼ candies transported from store 3 to the consumer in Sao Paulo (SP). x32 ¼ candies transported from store 3 to the consumer in Baixada Santista (BS). x33 ¼ candies transported from store 3 to the consumer in Vale do Paraiba (VP). The objective function seeks to minimize the total transportation cost: min z ¼ 8x11 + 12x12 + 10x13 + 4x21 + 10x22 + 6x23 + 6x31 + 15x32 + 12x33 The constraints of the model are specified as follows: 1. The production capacity of each store should be respected: x11 + x12 + x13  50 x21 + x22 + x23  100 x31 + x32 + x33  40 2. The demand of each consumer should be met: x11 + x21 + x31  60 x12 + x22 + x32  70 x13 + x23 + x33  30 3. The decision variables of the model are non-negative: xij  0,

i ¼ 1,2, 3, j ¼ 1,2,3

The optimal solution of this model, obtained using Excel Solver (see Section 18.3.3.2), is x11 ¼ 0, x12 ¼ 50, x13 ¼ 0, x21 ¼ 50, x22 ¼ 20, x23 ¼ 30, x31 ¼ 10, x32 ¼ 0, x33 ¼ 0 with z ¼ 1,240. One may see, from that result, that store 3 did not use its maximum capacity of 40 units, but only 10 units.

Network Programming Chapter

Store 50

S1

100

S2

843

FIG. 18.10 Network modeling of the balanced problem for the Caramel Candy & Confetti company.

Consumer c11; x11

18

SP

60

BS

70 Demand

Production capacity 40

S3

c34; x34

VP

30

C4

30

Solution (b)

In order for the transportation algorithm presented in Section 18.3.3.1 to be applied, we must have a balanced transportation problem, so that the total supply capacity is equal to the total demand. To restore the balance for the Caramel Candy & Confetti company problem, one must create a ghost customer (dummy) that will absorb the excess supply of 30 units. Network modeling of the balanced problem is illustrated in Fig. 18.10. The mathematical formulation of the balanced Caramel Candy & Confetti company problem is described as follows. Since the new consumer was added, xij may be rewritten as: xij ¼ number of candies transported from store i to the consumer j, i ¼ 1, 2, 3; j ¼ 1, 2, 3, 4. The new decision variables are: x14 ¼ candies transported from store 1 to the new ghost customer (dummy). x24 ¼ candies transported from store 2 to the new ghost customer (dummy). x34 ¼ candies transported from store 3 to the new ghost customer (dummy). Since the transportation unit cost for any supplier to the new consumer is null, the objective function is not changed: min z ¼ 8x11 + 12x12 + 10x13 + 4x21 + 10x22 + 6x23 + 6x31 + 15x32 + 12x33 The constraints on supply capacity and on demand consumed are changed: 1. Supply constraints by the stores: x11 + x12 + x13 + x14 ¼ 50 x21 + x22 + x23 + x24 ¼ 100 x31 + x32 + x33 + x34 ¼ 40 2. Demand constraints: x11 + x21 + x31 ¼ 60 x12 + x22 + x32 ¼ 70 x13 + x23 + x33 ¼ 30 x14 + x24 + x34 ¼ 30 3. Nonnegativity constraints: xij  0, i ¼ 1, 2,3, j ¼ 1, 2,3, 4 From solution (a), we already know that the nonutilized capacity of 30 units comes only from store 3. Since the new ghost consumer was created to absorb that surplus supply, we can affirm that x34 ¼ 30. Therefore, the optimal solution of the balanced model is x11 ¼ 0, x12 ¼ 50, x13 ¼ 0, x14 ¼ 0, x21 ¼ 50, x22 ¼ 20, x23 ¼ 30, x24 ¼ 0, x31 ¼ 10, x32 ¼ 0, x33 ¼ 0, and x34 ¼ 30 with z ¼ 1,240.

844

PART

VII Optimization Models and Simulation

Case 2: Total Supply Capacity Is Lower than Total Demand Consumed Consider an unbalanced transportation problem whose total supply capacity is less than the total demand consumed. To restore the balance, one must create a ghost supplier (dummy) that will meet the remaining demand. Thus, the amount offered by this new supplier will correspond to the difference between the total demand consumed and the total supply capacity, indicating the unmet demand. The transportation unit cost for the ghost supplier created for any consumer will be null, since the supplier is not real. Analogous to Case 1, the balancing equation between supply and demand guarantees that a feasible basic solution will be found.

Example 18.3 Consider Example 18.2 of the Caramel Candy & Confetti company, with, however, distinct production capacities of the stores and consumer demand, as shown in Table 18.E.3. Formulate the new Caramel Candy & Confetti company transportation problem.

TABLE 18.E.3 New Transportation Data of the Caramel Candy & Confetti Company Transportation Unit Cost Consumer Sao Paulo

Supplier

Vale do Paraiba

Capacity

Store 1

8

12

10

60

Store 2

4

10

6

40

6

15

12

50

50

120

80

Store 3 Demand

Baixada Santista

Solution Once again, we are faced with an unbalanced transportation problem; however, this time, the total supply capacity (150) is less than the total demand consumed (250). Solution (a)

One way to represent that model is through Expression (18.3), in which the suppliers utilize its maximum capacity; however, the total demand of some consumers is not met. The decision variables are not changed in relation to the unbalanced model presented in Example 18.2: xij ¼ number of candies transported from store i to the consumer j, i ¼ 1, 2, 3; j ¼ 1, 2, 3. The same occurs in relation to the objective function: min z ¼ 8x11 + 12x12 + 10x13 + 4x21 + 10x22 + 6x23 + 6x31 + 15x32 + 12x33 The constraints on Example 18.3 are specified as follows. 1. The suppliers will utilize its maximum capacity: x11 + x12 + x13 ¼ 60 x21 + x22 + x23 ¼ 40 x31 + x32 + x33 ¼ 50 2. The total demand of each consumer may not be met: x11 + x21 + x31  50 x12 + x22 + x32  120 x13 + x23 + x33  80 3. The decision variables of the model are nonnegative: xij  0,

i ¼ 1,2, 3, j ¼ 1,2,3

The optimal solution of this model, obtained using Excel Solver (see Section 18.3.3.2), is x11 ¼ 0, x12 ¼ 20, x13 ¼ 40, x21 ¼ 0, x22 ¼ 0, x23 ¼ 40, x31 ¼ 50, x32 ¼ 0, x33 ¼ 0 with z ¼ 1,180.

Network Programming Chapter

Store 60

S1

40

S2

845

FIG. 18.11 Network modeling of Example 18.3 balanced.

Consumer c11; x11

18

SP

50

BS

120 Demand

Production capacity 50

S3

100

S4

VP

80

c43; x43

One may see, from that result, that the total demand of 120 units from the consumer in Baixada Santista was not met, only part of it (20 units). Solution (b)

Analogous to Example 18.2, in order for the transportation algorithm presented in Section 18.3.3.1 to be applied, we must have a balanced transportation problem before us. To restore the balance, one must create a ghost supplier (dummy) that will meet the unmet demand for 100 units. Network modeling of the new balanced problem is illustrated in Fig. 18.11. The mathematical formulation of Example 18.3 balanced is described as follows. Since the new supplier has been added, xij may be rewritten as: xij ¼ number of candies transported from store i to the consumer j, i ¼ 1, 2, 3, 4; j ¼ 1, 2, 3. The new decision variables are: x41 ¼ candies transported from the new ghost store (dummy) to consumer 1. x42 ¼ candies transported from the new ghost store (dummy) to consumer 2. x43 ¼ candies transported from the new ghost store (dummy) to consumer 3. Since the transportation unit cost to the new supplier for any consumer is null, the objective function is not changed: min z ¼ 8x11 + 12x12 + 10x13 + 4x21 + 10x22 + 6x23 + 6x31 + 15x32 + 12x33 The balanced model of Example 18.3 presents the following constraints: 1. Supply constraints by the stores: x11 + x12 + x13 ¼ 60 x21 + x22 + x23 ¼ 40 x31 + x32 + x33 ¼ 50 x41 + x42 + x43 ¼ 100 2. Demand constraints: x11 + x21 + x31 + x41 ¼ 50 x12 + x22 + x32 + x42 ¼ 120 x13 + x23 + x33 + x43 ¼ 80 3. Nonnegativity constraints: xij  0,

i ¼ 1,2,3, j ¼ 1, 2,3, 4

From solution (a) of that same example, we already know that the unmet demand for 100 units comes from the consumer in Baixada Santista. Because the new ghost supplier was created to meet that remaining demand, we can affirm that x42 ¼ 100. Therefore, the optimal solution of the model is x11 ¼ 0, x12 ¼ 20, x13 ¼ 40, x21 ¼ 0, x22 ¼ 0, x23 ¼ 40, x31 ¼ 50, x32 ¼ 0, x33 ¼ 0, x41 ¼ 0, x42 ¼ 100, x43 ¼ 0 with z ¼ 1,180.

846

PART

18.3.3

VII Optimization Models and Simulation

Solution of the Classic Transportation Problem

The classic transportation problem will be solved in two ways. First, we will use the transportation algorithm that is a simplification of the Simplex method presented in Chapter 17. And in Section 18.3.3.2, we will present a solution of the problem using the Excel Solver.

18.3.3.1 The Transportation Algorithm In order to facilitate resolution of the classic transportation problem using the methods presented in this section, the problem should be represented in table form. Box 18.1 presents the tabular form of the general balanced transportation model expressed in Expression (18.2). The transportation algorithm follows the same logic of the Simplex method presented in Chapter 17, with some simplifications in light of the peculiarities of the transportation problem. Fig. 18.12 presents each one of the steps of the transportation algorithm. BOX 18.1 General Tabular Form of the Balanced Transportation Problem

FIG. 18.12 Transportation algorithm.

Initially: The problem must be balanced (total inflow equal to total outflow) and represented in table form, as specified in Box 18.1. Step 1. Find the initial feasible basic solution (FBS). To do that, we will present 3 methods: northwest corner method, minimum cost method and Vogel approximation method. Step 2. Optimality test. To verify if the solution found is optimal, we employ the multiplier method that is based on the duality theory. We apply the optimality condition of the Simplex method to the transportation problem. If the condition is satisfied, the algorithm finalizes here. If not, we determine a better adjacent FBS. Iteration: Determine a better adjacent FBS. To find the new feasible basic solution, three steps should be taken: 1. Determine the non-basic variable that will go into the base, using the multiplier method. 2. Choose the basic variable that will go into the set of non-basic variables, using the feasibility condition of the Simplex method. 3. Recalculate the new basic solution.

Network Programming Chapter

18

847

The elementary operations utilized in the Simplex method to recalculate the values of the new adjacent basic solution are not necessary, given that the new solution may be easily obtained using the table form of the transportation problem. Each one of the steps of the transportation algorithm presented in Fig. 18.12 will be detailed and later applied to solve the transportation problem for the Karpet Ltd. company (Example 18.1). Example 18.4 Represent the Karpet Ltd. company transportation problem (Example 18.1) in table form expressed in Box 18.1. Solution Using data from the balanced Karpet Ltd. company transportation problem presented in Table 18.E.1, one may easily obtain its tabular form, as shown in Table 18.E.4.

TABLE 18.E.4 Representation of the Balanced Karpet Ltd. Company Transportation Problem in Table Form Consumer 1

2 12

3 22

Capacity 30

1

100 18

Supplier

24

32

2

140 22

15

34

3

Demand

160

120

130

150

Step 1. Determining the Initial Feasible Basic Solution The classic transportation problem considers a set of m suppliers and n consumers. Using Expression (18.2) it was found that the balanced transportation problem contains m + n equality constraints. Because in the balanced transportation problem the total inflow is equal to the total outflow, we can affirm that one of those constraints is redundant, so that the model contains m + n  1 independent equations and, consequently, m + n  1 basic variables. For the Karpet Ltd. company transportation problem in which m ¼ 3 and n ¼ 3, we have, therefore, 5 basic variables. We first see how to find an initial FBS using the northwest corner method, followed by the minimum cost method, as well as the Vogel approximation method. Northwest Corner Method The northwest corner method follows the following steps: Initially: Represent the transportation problem in initial tabular form (see Box 18.1). In that method, the transportation costs need not be specified, given that they are not utilized in the algorithm. Step 1: Select the cell located in the upper left corner (northwest), among the cells not yet allocated in Step 2 and the cells not yet blocked in Step 3. Therefore, x11 will always be the first variable selected. Step 2: Allocate the largest possible amount of product to that cell, so that the sum of the corresponding cells in the same row and in the same column does not exceed the total supply and total demand capacities, respectively. Step 3: Using the cell selected in the previous step, block (marking with an x) the cells corresponding to the same row or column that reached the maximum limit of supply or demand, respectively, given that no other value different from zero may be attributed to those cells. If the researcher uses the maximum limit on both the row and the column, only one of them must be blocked. That condition guarantees that there will be basic variables with null values. The algorithm finalizes when all of the cells have been allocated or blocked. Otherwise, return to Step 1.

848

PART

VII Optimization Models and Simulation

Example 18.5 Apply the northwest corner method to the Karpet Ltd. company problem to obtain an initial FBS. Solution The initial tabular form of the Karpet Ltd. company problem, for applying the northwest corner method, is represented in Table 18. E.5 and is similar to the Table 18.E.4, but without the unit transportation costs.

TABLE 18.E.5 Initial Tabular Form of the Karpet Ltd. Company Problem for the Northwest Corner Method Consumer 1

Supplier

2

3

Capacity

1

100

2

140

3

160

Demand

120

130

150

The three steps of the first round are described and represented in Table 18.E.6. Step 1: Select the cell x11, located upper left corner (northwest). Step 2: One may see, from Table 18.E.5, that the total capacity from supplier 1 (Osasco) is 100. For its part, the demand from consumer 1 (Sao Paulo) is 120, so that the maximum value to be allocated in that cell is the minimum between those two values. Step 3: The maximum capacity limit from supplier 1 has been reached, so that the cells corresponding to the same row (x12 and x13) should be blocked.

TABLE 18.E.6 The Three Steps of the First Round Consumer

1 Supplier

1

2

3

Capacity

100

x

x

100

2

140

3

160

Demand

120

130

150

The same logic is applied to the second round. Step 1: Among the remaining cells, select the one located in the northwest corner (x21). Step 2: In setting column 1 relating to the Sao Paulo consumer, the maximum amount that may be allocated in the cell x21 is 20, since the sum of the quantities allocated to the cells of that column may not exceed the demand for 120 units of that consumer. In setting row 2 one may allocate up to 140 units in that same cell. Therefore, x21 ¼ min {20, 140} ¼ 20. Step 3: the cell x31 should be blocked, since the maximum demand limit of column 1 has been reached. For the third round, one has the following steps: Step 1: Among the remaining cells, select the one located in the northwest corner (x22). Step 2: In setting row 2 referring to the Sorocaba supplier, the maximum amount that may be allocated in x22 is 120, given that the sum of the quantities allocated to all of the cells of that same row may not exceed the 140-unit capacity of that supplier. In setting column 2, one may allocate up to 130 units in that same cell. Therefore, x22 ¼ min {120, 130} ¼ 120. Step 3: the cell x23 should be blocked, since the maximum capacity limit of row 2 has been reached.

Network Programming Chapter

18

849

TABLE 18.E.7 Result of the Second Round 1

Supplier

Consumer 2

Capacity

x

100

1

100

2

20

140

3

x

160

Demand

x

3

120

130

150

TABLE 18.E.8 Result of the Third Round Consumer

Supplier

1

2

3

Capacity

1

100

x

x

100

2

20

120

x

140

3

x

Demand

160

120

130

150

In the last two rounds, Step 3 will not be applied, given that the other cells belonging to the row or column that reached the maximum capacity or demand, respectively, have already been blocked in previous rounds. In the case of the next-to-last round, one selects cell x32 and allocates to it the maximum amount possible of 10 units. Finally, in the last round, one allocates the remaining amount of 150 units to cell x33. The initial FBS of the northwest corner method is listed in Table 18.E.9.

TABLE 18.E.9 Final Result of the Northwest Corner Method Consumer

Supplier

1

2

3

Capacity

1

100

0

0

100

2

20

120

0

140

3

0

10

150

160

120

130

150

Demand

The basic solution is, therefore: x11 ¼ 100, x21 ¼ 20, x22 ¼ 120, x32 ¼ 10 and x33 ¼ 150 with z ¼ 9,690. Nonbasic variables: x12 ¼ 0, x13 ¼ 0, x23 ¼ 0 and x31 ¼ 0. Minimum Cost Method The minimum cost method is an adaptation of the northwest corner method, in which, instead of selecting the cell closest to the northwest corner, one selects the one with the lowest cost. The complete algorithm of the lowest cost is detailed as follows. Initially: Represent the transportation problem in initial tabular form in Box 18.1.

850

PART

VII Optimization Models and Simulation

Step 1: Select the cell with lowest possible cost, among the cells not yet allocated in Step 2 and the cells not yet blocked in Step 3. Step 2: Allocate the largest possible amount of product to that cell, so that the sum of the corresponding cells in the same row and in the same column does not exceed the total supply and total demand capacities, respectively. Step 3: Starting with the cell selected in the previous step, block (marking with an x) the cells corresponding to the same row or column that reached the maximum limit of supply or demand, respectively. Analogous to the northwest corner method, if the researcher uses the maximum limit on both the row and the column, only one of them must be blocked. The algorithm finalizes when all of the cells have been allocated or blocked. Otherwise, return to Step 1.

Example 18.6 Apply the minimum cost method to the Karpet Ltd. company problem to obtain an initial FBS. Solution Consider the balanced Karpet Ltd. company transportation problem in initial tabular form (Table 18.E.4). The three steps of the first round are described now and represented in Table 18.E.10. Step 1: Select the cell x11 that is the one with the lowest cost. Step 2: Analogous to the northwest corner method, the largest possible amount to be allocated in that cell is 100 ¼ min {100, 120}. Step 3: The maximum capacity limit from supplier 1 has already been reached, so that the cells corresponding to the same row (x12 and x13) should be blocked.

TABLE 18.E.10 The Three Steps of the First Round Consumer 1

2 12

1

100

22 x

18 Supplier

3

Capacity 30 100

x 24

32

2

140 22

15

34

3 Demand

160 120

130

150

The same logic is applied to the second round and the result is presented in Table 18.E.11. Step 1: Among the remaining cells, select the one with the lowest unit cost (x32). Step 2: the maximum amount that may be allocated in that cell is 130 ¼ min {130, 160}. Step 3: the cell x22 should be blocked since the maximum demand limit of column 2 has been reached.

TABLE 18.E.11 Result of the Second Round Consumer 1

2 12

1

x

100 18

Supplier

2

Capacity 30 100

x 24

32 140

x 22

3 Demand

3 22

15

34 160

130 120

130

150

Network Programming Chapter

18

851

For the third round, one has the following steps: Step 1: Among the remaining cells, select the one with the lowest cost (x21). Step 2: In setting row 2 referring to the Sorocaba supplier, one could allocate up to 140 units in x21. However, setting column 1 relating to the Sao Paulo consumer, the maximum limit is 20 units, given that the sum of the quantities allocated to all of the cells of that column may not exceed the demand for 120 units of that consumer. Therefore, x21 ¼ min {20, 140} ¼ 20. Step 3: the cell x31 should be blocked since the maximum limit of column 1 has been reached.

TABLE 18.E.12 Result of the Third Round Consumer 1

2 12

1

x

100 18

Supplier

2

20

Demand

Capacity 30 100

x 24

32 140

x 22

3

3 22

15

x

130

120

130

34 160 150

Analogous to the northwest corner method, in the last two rounds, Step 3 is not applied, given that the cells belonging to the row or column that reached the maximum capacity or demand, respectively, have already been blocked. In the next-to-last round, one selects cell x23 and allocates to it a number of 120 units. Finally, in the last round, one allocates the remaining 30 units to cell x33. The initial FBS of the minimum cost method is represented in Table 18.E.13.

TABLE 18.E.13 Result of the Minimum Cost Method Consumer 1

2 12

1

100

0 18

Supplier

2

20

Demand

Capacity 30 100

0 24

0 22

3

3 22

32 140

120 15

34

0

130

30

120

130

150

160

The basic solution is, therefore: x11 ¼ 100, x21 ¼ 20, x23 ¼ 120, x32 ¼ 130, and x33 ¼ 30 with z ¼ 8,370. Nonbasic variables: x12 ¼ 0, x13 ¼ 0, x22 ¼ 0, and x31 ¼ 0. Vogel Approximation Method According to Taha (2016), the Vogel approximation method is an improved version of the minimum cost method that generally leads to better initial solutions. The detailed steps of the algorithm are: Initially: Represent the transportation problem in initial tabular form in Box 18.1. Step 1: For each row (and column), calculate the penalty that corresponds to the difference between the two smallest unit transportation costs in the respective row (and column). The penalty for one row (column) is calculated as long as there are at least two cells not yet allocated and not blocked in the same row (column). Step 2: Choose the row or column with the highest penalty. In case of a tie, randomly choose any one of them. In the row or column selected, choose the cell with lowest cost. Step 3: Thus, as with the northwest corner and lowest cost methods, allocate the largest possible amount of product to this cell, so that the sum of the corresponding cells in the same row and in the same column does not exceed the total supply and total demand capacities, respectively.

852

PART

VII Optimization Models and Simulation

Step 4: Analogous to the northwest corner method and that of the lowest cost, using the cell selected in the previous step, block (marking with an x) the cells corresponding to the same row or column that reached the maximum limit of supply or demand, respectively. If the researcher uses the maximum limit on both the row and the column, only one of them must be blocked. As long as there is more than one cell that is not allocated and not blocked, return to Step 1. Otherwise, go to Step 5. Step 5: Allocate the capacity or remaining demand to this last cell.

Example 18.7 Apply the Vogel approximation method to the Karpet Ltd. company problem to obtain an initial FBS. Solution All of the steps of the first round are represented in Table 18.E.14. First, the penalties for each row and column were calculated. One may see that the largest penalty occurred in row 1. One selects cell x11 that is the one with the lowest cost in row 1. The next step consists of allocating the largest possible amount of product to this cell, which is 100 ¼ min {100, 120}. The other cells of row 1 are blocked, since the capacity limit of that supplier has been reached. The result of the first round is highlighted in gray.

TABLE 18.E.14 First Round of the Vogel Approximation Method Consumer 1

2 12

1

100

x 18

Supplier

Capacity

3 22 x 24

100

22 - 12 = 10

140

24 - 18 = 6

160

22 - 15 = 7

32

2 22

Penalty row

30

15

34

3 Demand

120

130

150

Penalty column

18 - 12 =6

22 - 15 =7

32 - 30 =2

The same process is repeated for the second round (see Table 18.E.15). First, the new penalties for each column and for the lines 2 and 3 are calculated. The largest penalty this time is in column 2. Cell x32 is selected as the one with the lowest cost in column 2 and allocated the largest possible amount of product, which is 130 ¼ min {130, 160}. Cell x22 is also blocked, given that the total demand from consumer 2 was met. The new cells allocated and blocked in the second round are highlighted in gray.

TABLE 18.15 Second Round of the Vogel Approximation Method Consumer 1

2 12

1

100

x 18

Supplier

2

Capacity

3 22

100

x 24

32

x 22

3

Penalty row

30

15

140

24 - 18 = 6

160

22 - 15 = 7

34

130

Demand

120

130

150

Penalty column

22 - 18 =4

24 - 15 =9

34 - 32 =2

Network Programming Chapter

18

853

In the third round (see table 18.16), one first calculates the new penalties to lines 2 and 3 and to the columns 1 and 3. One may see that the largest penalty is in row 2. Among the remaining cells, the cell with lowest cost in row 2 is x21. In setting column 1, the maximum amount that may be allocated in x21 is 20, given that the sum of the quantities allocated to all of the cells of that same column may not exceed the demand for 120 units by that consumer. In setting row 2, one may allocate up to 140 units in that same cell. Therefore, x21 ¼ min {20, 140} ¼ 20. Cell x31 is blocked, given that the total demand from consumer 1 was met. The result of the third round is highlighted in gray.

TABLE 18.E.16 Third Round of the Vogel Approximation Method

There now only remains calculation of the penalty for column 3. One chooses the cell with lowest cost in that column; x23 is chosen and allocated 120 units. Finally, one allocates 30 units to the last cell, x33. The initial FBS of the Vogel approximation method is illustrated in Table 18.E.17.

TABLE 18.E.17 Initial Feasible Basic Solution Obtained by the Vogel Approximation Method Consumer 1

2 12

1

100

0 18

Supplier

2

20

Demand

Capacity 30 100

0 24

0 22

3

3 22

32 140

120 15

34

0

130

30

120

130

150

160

The basic solution is, therefore: x11 ¼ 100, x21 ¼ 20, x23 ¼ 120, x32 ¼ 130, and x33 ¼ 30 with z ¼ 8,370. Note that this solution is the same as the one obtained by the minimum cost method. Step 2. Optimality Test To verify if the solution found is optimal, we employ the method of multipliers that is based on the duality theory. Thus, one associates to each row i and to each column j the multipliers ui and vj, respectively. The coefficients of the objective function (reduced costs) of variable xij (c ij ) are given by the following equation: c ij ¼ ui + vj  cij

(18.4)

Since the reduced costs of the basic variables are null, Expression (18.4) states that: ui + vj ¼ cij , for each basic variable xij

(18.5)

Since the model contains m + n  1 independent equations and, consequently, m + n  1 basic variables for solving the system of equations represented by Expression (18.5) with m + n unknown, one must arbitrarily attribute a value of zero to one of the multipliers; for example, u1 ¼ 0.

854

PART

VII Optimization Models and Simulation

After calculating the multipliers, one may determine the reduced costs of the nonbasic variables from Expression (18.4). For the transportation problem (minimization problem), the current solution is optimal if, and only if, the reduced costs of all the nonbasic variables are nonpositive: ui + vj  cij  0, for each nonbasic variable xij

(18.6)

As long as there is at least one nonbasic variable with reduced positive cost, there is a better adjacent feasible basic solution (FBS). Iteration. Determine a better adjacent FBS To find the new feasible basic solution, three steps should be taken: 1. Determine the nonbasic variable that will go into the base, using the method of multipliers. The nonbasic variable xij selected is the one with greatest reduced cost (greatest value of ui + vj  cij). 2. Choose the basic variable that will come from the base (see explanation later). 3. Recalculate the new basic solution. Unlike the Simplex method, this calculation may be done directly using the table form of the transportation problem. The choice of variable that comes from the base and calculation of the new basic solution may be obtained by constructing a closed cycle that begins and ends in the nonbasic variable chosen to enter the base (Step 1). The cycle consists of a sequence of horizontal and vertical sequences connected to each other (diagonal movements are not permitted), in which each corner is associated with the basic variable, with the exception of the nonbasic variable selected. There is only one closed cycle that may be constructed under those conditions. With the closed cycle constructed, the next step consists of determining the variable that will come from the base. Thus, among the corners next to the nonbasic variable xij (horizontally or vertically), one chooses the basic variable with lowest value, given that the capacity constraints from the supplier i and on demand from consumer j should be respected. In case of a tie, one chooses one of them, randomly. Finally, one recalculates the new basic solution. First, the value corresponding to the basic variable chosen to leave the base for new basic variable xij is attributed. The variable that comes from the base thus assumes the value of zero. The new values of the basic variables of the closed cycle should also be recalculated, so that the required supply capacities and demands continue to be satisfied.

Example 18.8 Starting from the basic initial solution obtained by the northwest corner method for the Karpet Ltd. company problem (Example 18.5), determine the optimal solution using the transportation algorithm. Solution Each one of the steps of the transportation algorithm will be applied to determine the optimal solution of the problem studied. As the initial FBS, we will use the one obtained by the northwest corner method. Step 1. Initial FBS Obtained by the Northwest Corner Method The initial solution of the northwest corner method obtained in Example 18.5, including the unit transportation costs of each cell, is represented in Table 18.E.18.

TABLE 18.E.18 Initial FBS of the Northwest Corner Method, Including the Unit Transportation Costs Consumer 1

2 12

1

100

0 18

Supplier

2

20

Demand

Capacity 30 100

0 24

120 22

3

3 22

32 140

0 15

34

0

10

150

120

130

150

Step 2. Optimality Test For each basic variable xij, describe the equation ui + vj ¼ cij (Expression (18.5)): For x11: u1 + v1 ¼ 12 For x21: u2 + v1 ¼ 18

160

Network Programming Chapter

18

855

For x22: u2 + v2 ¼ 24 For x32: u3 + v2 ¼ 15 For x33: u3 + v3 ¼ 34 Doing u1 ¼ 0, one obtains the following results: v1 ¼ 12, u2 ¼ 6, v2 ¼ 18, u3 ¼  3 and v3 ¼ 37 Using those multipliers, one determines the reduced costs of the nonbasic variables from Expression (18.4): c 12 ¼ u1 + v2  c12 ¼ 0 + 18  22 ¼ 4 c 13 ¼ u1 + v3  c13 ¼ 0 + 37  30 ¼ 7 c 23 ¼ u2 + v3  c23 ¼ 6 + 37  32 ¼ 11 c 31 ¼ u3 + v1  c31 ¼ 3 + 12  22 ¼ 13 Since the reduced costs of the nonbasic variables x13 and x23 are positive, there is a better adjacent feasible basic solution (FBS). The nonbasic variable that will enter the base is x23, because it has the greatest reduced cost. Iteration. Determine a Better Adjacent FBS The closed cycle should be constructed to determine the variable that will come from the base and to calculate the new basic solution. That closed cycle must satisfy the following conditions: (a) begin and end in x23; (b) be formed by a sequence of horizontal and vertical segments connected to each other; and (c) each corner must be associated with the basic variable, with the exception of

TABLE 18.E.19 Construction of the Closed Cycle in the First Iteration Consumer 1

2 12

1

100

2

20

24

Demand

100 32 140

0

120 22

3

Capacity 30

0

0 18

Supplier

3 22

15

34

0

10

150

120

130

150

160

variable x23. Table 18.E.19 presents the closed cycle that satisfies those conditions. With the closed cycle constructed, the next step consists of determining the variable that will come from the base. Thus, among the corners next to the nonbasic variable x23 (horizontally or vertically), one chooses the basic variable x22 that has the lowest value (120 < 150), given that the capacity constraint from the supplier 2 must be respected. Finally, one recalculates the new basic solution. First, one attributes the value of 20 from the basic variable output x22 to the new basic variable x23. The variable x22 that comes from the base assumes, therefore, the value of zero. To restore the balance of the closed cycle, one calculates the new values of the basic variables x32 and x33 (130 and 30, respectively). Table 18.E.20 illustrates the new Adjacent FBS.

TABLE 18.E.20 Adjacent Basic Solution Obtained in the First Iteration Consumer 1

2 12

1

100

0 18

Supplier

2

20

Demand

Capacity 30 100

0 24

0 22

3

3 22

32 140

120 15

34

0

130

30

120

130

150

160

856

PART

VII Optimization Models and Simulation

Step 2. Optimality Test For each basic variable xij, describe the equation ui + vj ¼ cij (Expression (18.5): For x11: u1 + v1 ¼ 12 For x21: u2 + v1 ¼ 18 For x23: u2 + v3 ¼ 32 For x32: u3 + v2 ¼ 15 For x33: u3 + v3 ¼ 34 Doing u1 ¼ 0, one obtains the following results: v1 ¼ 12, u2 ¼ 6, v3 ¼ 26, u3 ¼ 8, and v2 ¼ 7 Using those multipliers, one determines the reduced costs of the nonbasic variables through Expression (18.4): c 12 ¼ u1 + v2  c12 ¼ 0 + 7  22 ¼ 15 c 13 ¼ u1 + v3  c13 ¼ 0 + 26  30 ¼ 4 c 22 ¼ u2 + v2  c22 ¼ 6 + 7  24 ¼ 11 c 31 ¼ u3 + v1  c31 ¼ 8 + 12  22 ¼ 2 Since the reduced costs of all the nonbasic variables are nonpositive, the current solution is optimal. The optimal solution is, therefore: Basic solution: x11 ¼ 100, x21 ¼ 20, x23 ¼ 120, x32 ¼ 130, and x33 ¼ 30 with z ¼ 8,370. Nonbasic variables: x12 ¼ 0, x13 ¼ 0, x22 ¼ 0, and x31 ¼ 0. Note that this solution is similar to the initial solution obtained by the lowest cost and Vogel approximation methods.

18.3.3.2 Solution of the Transportation Problem Using Excel Solver Examples 18.1, 18.2, and 18.3 referring to the classic transportation problem will be solved in this section using Excel Solver. Solution of the Karpet Ltd. company problem (Example 18.1): Fig. 18.13 illustrates the representation of the Karpet Ltd. company transportation problem in an Excel spreadsheet (see file Example18.1_Karpet.xls). The equations utilized in Fig. 18.13 are specified in Box 18.2. Analogous to the examples from Chapter 17, names were attributed to the cells and intervals of cells of Fig. 18.13 that will be referred to in the Solver, in order to facilitate understanding of the model. Box 18.3 presents the names attributed to the respective cells.

Karpet Ltd. Transport unit cost

Supplier Osasco Sorocaba Sao Sebastiao

Sao Paulo 12 18 22

Consumer Rio de Janeiro 22 24 15

Curitiba 30 32 34

Quantitites_transported

Supplier Osasco Sorocaba Sao Sebastiao

Sao Paulo 0 0 0

Consumer Rio de Janeiro 0 0 0

Curitiba 0 0 0

Quantities_supplied 0 0 0

0 0 0 = = = Demand 120 130 150 FIG. 18.13 Representation in Excel of the Karpet Ltd. company transportation problem.

= = =

Capacity 100 140 160

z

Total_cost $0.00

Quantities_delivered

Network Programming Chapter

18

857

BOX 18.2 Equations of Fig. 18.13 Cell

Equation

E16

¼SUM(B16:D16)

E17

¼SUM(B17:D17)

E18

¼SUM(B18:D18)

B20

¼SUM(B16:B18)

C20

¼SUM(C16:C18)

D20

¼SUM(D16:D18)

G22

¼SUMPRODUCT(B7:D9,B16:D18)

BOX 18.3 Names Attributed to the Cells of Fig. 18.13 Name

Cells

Quantities_transported

B16:D18

Quantities_supplied

E16:E18

Capacity

G16:G18

Quantities_delivered

B20:D20

Demand

B22:D22

Total_cost

G22

The representation of the Karpet Ltd. company problem in the Solver Parameters dialog box is illustrated in Fig. 18.14. Since names were attributed to the cells of the model, Fig. 18.14 will now be referred to by their respective names. Note that the non-negativity constraints were activated by selecting the Make Unconstrained Variables Non-Negative check box, and the Simplex LP engine was selected in the Solving Method box. The Options command remained unaltered. Finally, click on Solve and select the option Keep Solver Solution in the Solver Results dialog box. Fig. 18.15 presents the optimal solution of the Karpet Ltd. company transportation problem. Solution of the Caramel Candy & Confetti company problem (Example 18.2): The representation of the Caramel Candy & Confetti company transportation problem in an Excel spreadsheet is in Fig. 18.16 (see file Example18.2_Confetti.xls). Analogous to the Karpet Ltd. company problem (Example 18.1), that problem also considers three suppliers and three consumers. Note that the transport unit cost, the quantities transported, supplied, and delivered, besides the capacity, demand, and total cost, are represented in the same cells of Fig. 18.13. Thus, the equations and the names attributed to the cells of Fig. 18.16 are similar to those of Fig. 18.13 from the previous example. Because we are not faced with a balanced problem (where the total supply capacity is greater than the total demand), the constraints may not be written in the equality form. The new constraints may be visualized in Fig. 18.16 and in the Solver Parameters dialog box, as shown in Fig. 18.17. Analogous to previous models, one assumes that the variables are non-negative and that the model is linear. The optimal solution of the Caramel Candy & Confetti company transportation problem (Example 18.2) is illustrated in Fig. 18.18. Solution of the Modified Caramel Candy & Confetti company problem (Example 18.3): The Example 18.3 is an adaptation from the previous example of the Caramel Candy & Confetti company, in which the production capacities of the stores and client demand are changed, focusing on Case 2 in which the total supply capacity is less than the total demand consumed. Thus, the supply constraints are represented in the equality form, given that the entire

858

PART

VII Optimization Models and Simulation

FIG. 18.14 Solver Parameters regarding the Karpet Ltd. company problem.

FIG. 18.15 Solution of the Karpet Ltd. company transportation problem (Example 18.1) using Excel Solver.

Network Programming Chapter

18

Caramel Candy & Confetti Transport unit cost

Supplier Store 1 Store 2 Store 3

Sao Paulo 8 4 6

Consumer Baix. Santista 12 10 15

Vale Paraiba 10 6 12

Quantities_transported

Supplier Store 1 Store 2 Store 3

Sao Paulo 0 0 0

Consumer Baix. Santista 0 0 0

Vale Paraiba 0 0 0

Quantities_delivered

0 >= 60

0 >= 70

0 >= 30

Demand

Quantities_supplied 0 0 0

> ð4, 1Þ, ð4, 2Þ, ð4, 3Þ, ð4, 4Þ, ð4, 5Þ, ð4, 6Þ > > > > > > > > > > ð 5, 1 Þ, ð 5, 2 Þ, ð 5, 3 Þ, ð 5, 4 Þ, ð 5, 5 Þ, ð 5, 6 Þ > > > > ; : ð6, 1Þ, ð6, 2Þ, ð6, 3Þ, ð6, 4Þ, ð6, 5Þ, ð6, 6Þ 1/4 1/12 1/9 7/36 2/3 1/12

ANSWER KEYS: EXERCISES: CHAPTER 6 1)



       150 150 150 0 150 1 149 2 148 Pð X  2Þ ¼  0:02  0:98 +  0:02  0:98 + ¼ 0:42  0:02  0:98 0 1 2 E(X) ¼ 150  0.02 ¼ 3 Var(X) ¼ 150  0.02  0.98 ¼ 2.94 2)    10 Pð X ¼ 1Þ ¼  0:12  0:889 ¼ 0:38 1 3) P(X ¼ 5) ¼ 0.125  0.8754 ¼ 0.073 E(X) ¼ 8 Var(X) ¼ 56 4)   32 PðX ¼ 33Þ ¼  0:9530  0:053 ¼ 1:33% 29

1076

Answers

E(X) ¼ 31.6 ffi 32 5) P(X ¼ 4) ¼ 16.8% 6) a) P(X  12) ¼ P(Z  0.67) ¼ 1  P(Z > 0.67) ¼ 0.75 b) P(X < 5) ¼ P(Z <  0.5) ¼ P(Z > 0.5) ¼ 0.3085 c) P(X > 2) ¼ P(Z >  1) ¼ P(Z < 1) ¼ 1  P(Z > 1) ¼ 0.8413 d) P(6 < X  11) ¼ P(0.33 < Z  0.5) ¼ [1  P(Z > 0.5)]  P(Z > 0.33) ¼ 0.3208 7) zc ¼  0.84 8) a) m ¼ np ¼ 40  0.5 ¼ 20 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi s ¼ npð1  pÞ ¼ 40  0:5  0:5 ¼ 3:16 P(X ¼ 22) ffi P(21.5 < X < 22.5) ¼ P(0.474 < Z < 0.791) ¼ 0.103 b) P(X > 25.5) ¼ P(Z > 1.74) ¼ 4.09% 9) a) P(X > 120) ¼ e0.028120 ¼ 0.0347 b) P(X > 60) ¼ e0.02860 ¼ 0.1864 10) 220

a) PðX > 220Þ ¼ e 180 ¼ 0:2946 150

b) PðX  150Þ ¼ 1  e 180 ¼ 0:5654 11)

a) P(X > 0.5) ¼ e1.80.5 ¼ 0.4066 b) P(X  1.5) ¼ 1  e1.81.5 ¼ 0.9328

12)

a) P(X > 2) ¼ e0.332 ¼ 0.5134 b) P(X  2.5) ¼ 1  e0.332.5 ¼ 0.5654

13) 6.304 14) a) P(X > 25) ¼ 0.07 b) P(X  32) ¼ 0.99 c) P(25 < X  32) ¼ P(X > 25)  P(X > 32) ¼ 0.06 d) 28.845 e) 6.908 15) a) 2.086 b) E(T) ¼ 0 c) Var(T) ¼ 1.111 16) a) b) c) d) e)

P(T > 3) ¼ 0.0048 P(T  2) ¼ 1  P(T > 2) ¼ 1  0.0344 ¼ 0.9656 P(1.5 < T  2) ¼ P(T > 1.5)  P(T > 2) ¼ 0.0814  0.3444 ¼ 0.0469 1.345 2.145

17) a) P(X > 3) ¼ 0.05 b) 3.73

Answers

1077

c) 4.77 d) E(X) ¼ 1.14 e) Var(X) ¼ 0.98

ANSWER KEYS: EXERCISES: CHAPTER 7 5) Simple random sampling without replacements. 6) Systematic sampling. 7) Stratified sampling. 8) Stratified sampling. 9) Two-stage cluster sampling. 10) By using Expression (7.8) (SRS to estimate the proportion of a finite population), we have n ¼ 262. 11) By using Expression (7.9) (stratified sampling to estimate the mean of an infinite population), we have n ¼ 1,255. 12) By using Expression (7.20) (one-stage cluster sampling to estimate the proportion of an infinite population), we have m ¼ 35.

ANSWER KEYS: EXERCISES: CHAPTER 8

  ffiffiffiffiffiffi < m < 51 + 1:645 p18 ffiffiffiffiffiffi ¼ 90% 1) P 51  1:645 p18 120 120   ffiffiffiffi < m < 5, 400 + 2:030 p200 ffiffiffiffi ¼ 95% 2) P 5, 400  2:030 p200 36 36 qffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffi   0:76  0:76 3) P 0:24  1:96 0:24500 < p < 0:24 + 1:96 0:24500 ¼ 95% 60  8

60  8 4) P 83:298 < s2 < 40:482 ¼ 95%

ANSWER KEYS: EXERCISES: CHAPTER 9 7) For the K-S and S-W tests, we have p ¼ 0.200 and 0.151, respectively. Therefore, since P > 0.05, the distribution of data is normal. 8) The data follow a normal distribution (P ¼ 0.200 > 0.05). 9) The variances are homogeneous (P ¼ 0.876 > 0.05—Levene’s test). 10) Since s is unknown, the most suitable test is Student’s t: 6560 pffiffiffiffi ¼ 8:571; tc ¼ 2.030; since Tcal > tc ! we reject H0 (m 6¼ 60). Tcal ¼ 3:5= 36 11) Tcal ¼ 6.921 and P-value ¼ 0.000 < 0.005 ! we reject H0 (m1 6¼ m2). 12) Tcal ¼ 11.953 and P-value ¼ 0.000 < 0.025 ! we reject H0 (mbefore 6¼ mafter), i.e., there was improvement after the treatment. 13) Fcal ¼ 2.476 and P-value ¼ 0.1 > 0.05 ! we do not reject H0 (there is no difference between the population means).

ANSWER KEYS: EXERCISES: CHAPTER 10 4) Sign test. 5) By applying the binomial test for small samples, since P ¼ 0.503 > 0.05, we do not reject H0, concluding that there is no difference in consumers’ preferences. 6) By applying the chi-square test, since w2cal > w2c (6.100 > 5.991) or P < a (0.047 < 0.05), we reject H0, concluding that there are differences in readers’ preferences. 7) By applying the Wilcoxon test, since zcal <  zc (3.135 < 1.645) or P < a (0.0085 < 0.05), we reject H0, concluding that the diet resulted in weight loss. 8) By applying the Mann-Whitney U test (the data do not follow a normal distribution), since zcal >  zc (0.129 > 1.96) or P > a (0.897 > 0.05), we do not reject H0, concluding that the samples come from populations with equal medians. 9) By applying Cochran’s Q test, since Qcal > Qc (8.727 > 7.378) or P < a (0.013 < 0.025), we reject H0, concluding that the proportion of students with high learning levels is not the same in each subject. 10) By applying the Friedman test, since F0 cal > Fc (9.190 > 5.991) or P < a (0.010 < 0.05), we reject H0, concluding that there are differences between the three services.

1078

Answers

ANSWER KEYS: EXERCISES: CHAPTER 11 1) a) Agglomeration Schedule Cluster Combined

Stage Cluster First Appears

Stage

Cluster 1

Cluster 2

Coefficients

Cluster 1

Cluster 2

Next Stage

77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99

5 40 25 30 38 1 2 6 4 30 5 29 31 2 1 5 2 1 5 2 1 1 1

13 56 58 55 48 15 14 83 7 42 39 40 38 3 30 25 31 4 6 29 5 2 9

.006 .014 .014 .014 .014 .024 .024 .024 .024 .038 .038 .055 .075 .075 .153 .209 .246 .246 .723 .760 2.764 8.466 173.124

39 56 0 62 75 71 72 74 76 80 77 65 69 83 82 87 90 91 92 93 94 97 98

64 53 26 61 36 55 58 0 68 0 70 78 81 73 86 79 89 85 84 88 95 96 0

87 88 92 86 89 91 90 95 94 91 92 96 93 93 94 95 96 97 97 98 98 99 0

From the agglomeration schedule, it is possible to verify that a big Euclidian distance leap occurs from the 98th stage (when only two clusters remain) to the 99th stage. Analyzing the dendrogram also helps in this interpretation.

b)

In fact, the solution with two clusters is highly advisable at this moment.

1080

Answers

c) Yes. From the agglomeration schedule, it is possible to verify that observation 9 (Antonio) had not clustered in until the moment exactly before the last stage. From the dendrogram, it is also possible to verify that this student differs from the others considerably, which, in this case, results in the generation of only two clusters. d) Agglomeration Schedule Cluster Combined

Stage Cluster First Appears

Stage

Cluster 1

Cluster 2

Coefficients

Cluster 1

Cluster 2

Next Stage

77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98

13 27 1 41 6 30 5 16 1 13 2 14 2 13 1 5 9 1 1 5 1 1

34 29 4 46 82 55 74 57 38 39 15 16 28 30 27 6 13 41 2 14 5 9

.537 .537 .537 .754 1.103 1.103 1.584 1.584 1.584 1.584 2.045 2.149 2.149 3.091 3.091 4.411 4.835 7.134 10.292 12.374 18.848 26.325

67 62 63 0 72 58 68 55 79 77 74 61 87 86 85 83 75 91 94 92 95 97

0 60 69 0 0 53 0 73 66 64 76 84 71 82 78 81 90 80 89 88 96 93

86 91 85 94 92 90 92 88 91 90 89 96 95 93 94 96 98 95 97 97 98 0

Answers

1081

Yes, the new results show that there is one cluster rearrangement in the absence of observation Antonio. e) The existence of an outlier may cause other observations, not so similar to one another, to be allocated in the same cluster because they are extremely different from the first one. Therefore, reapplying the technique, with the exclusion or maintenance of outliers, makes the new clusters better structured, and makes them be generated with higher internal homogeneity.

1082

2) a)

Answers

Answers

1083

b) Agglomeration Schedule Cluster Combined

Stage Cluster First Appears

Stage

Cluster 1

Cluster 2

Coefficients

Cluster 1

Cluster 2

Next Stage

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

8 4 1 4 4 1 1 1 1 11 6 15 7 7 6 6 1

18 16 10 9 8 5 4 3 2 12 11 17 15 14 13 7 6

2.000 2.000 2.000 2.000 2.000 2.000 2.000 2.828 6.633 12.329 14.697 23.409 24.495 32.802 35.665 40.497 78.256

0 0 0 2 4 3 6 7 8 0 0 0 0 13 11 15 9

0 0 0 0 1 0 5 0 0 0 10 0 12 0 0 14 16

5 4 6 5 7 7 8 9 17 11 15 13 14 16 16 17 0

From the agglomeration schedule, it is possible to verify that a big Euclidian distance leap occurs from the 16th stage (when only two clusters remain) to the 17th stage. Analyzing the dendrogram also helps in this interpretation. c) Dendrogram using single linkage

Y

0 Regional 3

8

Regional 3

18

Regional 3

4

Regional 3

16

Regional 3

9

Regional 3

1

Regional 3

10

Regional 3

5

Regional 3

3

Regional 3

2

Regional 1

15

Regional 1

17

Regional 1

7

Regional 1

14

Regional 2

11

Regional 2

12

Regional 2

6

Regional 2

13

5

In fact, there are indications of two clusters of stores.

Rescaled distance cluster combine 10 15

20

25

1084

Answers

d)

Derived stimulus configuration Euclidean distance model 1.5

Store06 Store12

1.0

Store11

Dimension 2

0.5 Store13 Store02 Store03 Store04 Store05 Store01

0.0 Store17 Store15

–0.5

Store07 –1.0 Store14

–1.5 –2

–1

0

1

Dimension 1

The two-dimensional chart generated through the multidimensional scaling allows us to see these two clusters and that one is more homogeneous than the other. e) ANOVA Cluster

Customers’ average evaluation of services rendered (0 to 100) Customers’ average evaluation of the variety of goods (0 to 100) Customers’ average evaluation of the organization (0 to 100)

Error

Mean Square

df

Mean Square

df

F

Sig.

10802.178

1

99.600

16

108.456

.000

12626.178

1

199.100

16

63.416

.000

18547.378

1

314.900

16

58.899

.000

The F tests should be used only for descriptive purposes because the clusters have been chosen to maximize the differences among cases in different clusters. The observed significance levels are not corrected for this and thus cannot be interpreted as tests of the hypothesis that the cluster means are equal.

It is possible to state that both clusters formed present statistically different means for the three variables considered in the study, at a significance level of 0.05 (Prob. F < 0.05). Among the groups, the variable considered most discriminating is the one with the highest F statistic, that is, the variable services rendered (F ¼ 108.456).

1085

Answers

f) Single Linkage* Cluster Number of Case Crosstabulation Count Cluster Number of Case

Single Linkage

1 2

Total

1

2

Total

10 0 10

0 8 8

10 8 18

Yes, there is correspondence between the allocations of the observations in the groups obtained through the hierarchical and k-means methods. g) Yes, based on the dendrogram generated, it is possible to verify that all the stores that belong to regional center 3 form cluster 1, which has the lowest means for all the variables. This fact may determine some specific management action at these stores. After preparing a new cluster analysis, without the stores from cluster 1 (regional center 3), the new agglomeration schedule and its corresponding dendrogram are obtained. From it, we can see the differences between the stores from regional centers 1 and 2 more clearly. Agglomeration Schedule Cluster Combined

Stage Cluster First Appears

Stage

Cluster 1

Cluster 2

Coefficients

Cluster 1

Cluster 2

Next Stage

1 2 3 4 5 6 7

11 6 15 7 7 6 6

12 11 17 15 14 13 7

12.329 14.697 23.409 24.495 32.802 35.665 40.497

0 0 0 0 4 2 6

0 1 0 3 0 0 5

2 6 4 5 7 7 0

1086

Answers

Dendrogram using single linkage 0 11

Regional 2

12

Regional 2

6

Regional 2

13

Regional 1

15

Regional 1

17

Regional 1

7

Regional 1

14

Rescaled distance cluster combine 10 15

20

25

Y

Regional 2

5

3) a) Agglomeration Schedule Cluster Combined Stage

Cluster 1

Cluster 2

1 2 3 4 5 6 7 8 9

18 19 17 16 20 23 17 18 21

33 34 32 31 35 27 19 26 23

Stage Cluster First Appears Coefficients 1.000 .980 .980 .980 .960 .880 .880 .860 .860

Cluster 1

Cluster 2

Next Stage

0 0 0 0 0 0 3 1 0

0 0 0 0 0 0 2 0 6

8 7 7 21 17 9 20 11 18

Answers

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

11 15 13 22 2 4 6 12 11 2 17 3 1 2 9 17 4 9 2 7 1 7 4 2 1

14 18 30 29 13 5 24 20 21 15 25 16 10 3 11 22 8 12 6 28 9 17 7 4 2

.860 .853 .840 .840 .820 .820 .800 .800 .797 .793 .790 .790 .780 .770 .768 .764 .750 .749 .742 .740 .728 .727 .703 .513 .484

0 0 0 0 0 0 0 0 10 14 7 0 0 19 0 20 15 24 23 0 22 29 26 28 30

0 8 0 0 12 0 0 5 9 11 0 4 0 21 18 13 0 17 16 0 27 25 31 32 33

1087

18 19 14 25 19 26 28 27 24 23 25 23 30 28 27 31 32 30 33 31 34 32 33 34 0

Since it is a similarity measure, the values of the coefficients are in descending order in the agglomeration schedule. From this table, it is possible to verify that a considerable leap, in relation to the others, occurs from the 32nd stage (when three clusters are formed) to the 33rd clustering stage. Analyzing the dendrogram also helps in this interpretation.

1088

Answers

b) Dendrogram using average linkage (Between groups) 0 35

18

34

33

33

26

32

15

31

13

30

30

29

2

28

16

27

31

26

3

25

6

24

24

23

4

22

5

21

8

20

22

19

29

18

19

17

34

16

17

15

32

14

25

13

7

12

28

11

1

10

10

9

20

8

35

7

12

6

23

5

27

4

21

3

11

2

14

1

9

5

Rescaled distance cluster combine 10 15

20

25

Answers

1089

In fact, the solution with three clusters is highly advisable. c) Average Linkage (Between Groups)

sector

Health Education Transport

1

2

3

Count

Count

Count

11 0 0

0 12 0

0 0 12

Yes, there is correspondence between the industries and the allocations of companies in the clusters. That is, for the sample under analysis, we can state that companies from the same industry have similarities in relation to how their operations and decision-making processes are carried out. At least as regards the managers’ perception.

1090 Answers

4) a) Proximity Matrix Correlation between Vectors of Values Case

1:1

1:1

1.000

2:2 .866

-1.000

3:3

4:4 .000

5:5 .998

6:6 .945

-.996

7:7

8:8 .000

1.000

9:9

10:10 .971

-1.000

11:11

12:12 -.500

13:13 .999

14:14 .997

-1.000

15:15

16:16 .327

2:2

.866

1.000

-.866

-.500

.896

.655

-.908

-.500

.866

.721

-.856

-.866

.891

.822

-.881

-.189

3:3

-1.000

-.866

1.000

.000

-.998

-.945

.996

.000

-1.000

-.971

1.000

.500

-.999

-.997

1.000

-.327

4:4

.000

-.500

.000

1.000

-.064

.327

.091

1.000

.000

.240

-.020

.866

-.052

.082

.030

.945

5:5

.998

.896

-.998

-.064

1.000

.922

-1.000

-.064

.998

.953

-.996

-.554

1.000

.989

-.999

.266

6:6

.945

.655

-.945

.327

.922

1.000

-.911

.327

.945

.996

-.951

-.189

.926

.969

-.935

.619

7:7

-.996

-.908

.996

.091

-1.000

-.911

1.000

.091

-.996

-.945

.994

.577

-.999

-.985

.998

-.240

8:8

.000

-.500

.000

1.000

-.064

.327

.091

1.000

.000

.240

-.020

.866

-.052

.082

.030

.945

9:9

1.000

.866

-1.000

.000

.998

.945

-.996

.000

1.000

.971

-1.000

-.500

.999

.997

-1.000

.327

10:10

.971

.721

-.971

.240

.953

.996

-.945

.240

.971

1.000

-.975

-.277

.957

.987

-.963

.545

11:11

-1.000

-.856

1.000

-.020

-.996

-.951

.994

-.020

-1.000

-.975

1.000

.483

-.997

-.998

.999

-.346

12:12

-.500

-.866

.500

.866

-.554

-.189

.577

.866

-.500

-.277

.483

1.000

-.545

-.427

.526

.655

13:13

.999

.891

-.999

-.052

1.000

.926

-.999

-.052

.999

.957

-.997

-.545

1.000

.991

-1.000

.277

14:14

.997

.822

-.997

.082

.989

.969

-.985

.082

.997

.987

-.998

-.427

.991

1.000

-.994

.404

15:15

-1.000

-.881

1.000

.030

-.999

-.935

.998

.030

-1.000

-.963

.999

.526

-1.000

-.994

1.000

-.298

16:16

.327

-.189

-.327

.945

.266

.619

-.240

.945

.327

.545

-.346

.655

.277

.404

-.298

1.000

This is a similarity matrix

Answers

1091

b) Agglomeration Schedule Cluster Combined

Stage Cluster First Appears

Stage

Cluster 1

Cluster 2

Coefficients

Cluster 1

Cluster 2

Next Stage

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1 4 5 3 3 1 3 1 6 1 4 1 4 1 1

9 8 13 11 15 5 7 14 10 6 16 2 12 4 3

1.000 1.000 1.000 1.000 1.000 .999 .998 .997 .996 .987 .945 .896 .866 .619 .577

0 0 0 0 4 1 5 6 0 8 2 10 11 12 14

0 0 0 0 0 3 0 0 0 9 0 0 0 13 7

6 11 6 5 7 8 15 10 10 12 13 14 14 15 0

Since Pearson’s correlation is being used as a similarity measure between observations, the values of the coefficients are in descending order in the agglomeration schedule. From this Table, it is possible to verify that a relevant leap, in relation to the others, occurs from the 13th stage (when three clusters with weekly periods are formed) to the 14th clustering stage. Analyzing the dendrogram also helps in this interpretation.

1092

Answers

c) Dendrogram using single linkage 0 1

1

1

9

1

5

1

13

2

14

2

6

2

10

2

2

4

4

4

8

4

16

4

12

3

3

3

11

3

15

3

7

Rescaled distance cluster combine 10 15

5

20

25

In fact, the solution with three-week clusters is highly advisable at this moment. Moreover, it is possible to verify that the second and third clusters are formed exclusively by the periods related to the third and fourth weeks of each month, respectively. This may offer subsidies to prove that there is recurrence of the joint behavior of banana, orange, and apple sales in these periods, for the data in this example. The following table shows the association between the variable week_month and the allocation of each observation in a certain cluster. Single Linkage

week_month

1 2 3 4

1

2

3

Count

Count

Count

4 4 0 0

0 0 4 0

0 0 0 4

Answers

1093

ANSWER KEYS: EXERCISES: CHAPTER 12 1) a) For each factor, we have the following eigenvalues: Factor 1: (0.917)2 + (0.874)2 + (0.844)2 + (0.031)2 ¼ 2.318 Factor 2: (0.047)2 + (0.077)2 + (0.197)2 + (0.979)2 ¼ 1.005 b) The proportions of variance shared by all the variables to form each factor are: Factor 1: 2:318 4 ¼ 0:580 ð58:00%Þ 1:005 Factor 2: 4 ¼ 0:251 ð25:10%Þ The total proportion of variance lost by the four variables to extract these two factors is: 1  0:580  0:251 ¼ 0:169 ð16:90%Þ c) The proportions of variance shared to form both factors (communalities) are: communalityage ¼ ð0:917Þ2 + ð0:047Þ2 ¼ 0:843 communalityfixedif ¼ ð0:874Þ2 + ð0:077Þ2 ¼ 0:770 communalityvariableif ¼ ð0:844Þ2 + ð0:197Þ2 ¼ 0:751 communalitypeople ¼ ð0:031Þ2 + ð0:979Þ2 ¼ 0:959 d) Based on the two factors extracted, the expressions of each standardized variable are: Zagei ¼ 0:917  F1i + 0:047  F2i + ui , R2 ¼ 0:843 Zfixedif i ¼ 0:874  F1i + 0:077  F2i + ui , R2 ¼ 0:770 Zvariableif i ¼ 0:844  F1i + 0:197  F2i + ui , R2 ¼ 0:751 Zpeoplei ¼ 0:031  F1i + 0:979  F2i + ui , R2 ¼ 0:959 e) 1

people

0.5

variableif fixedif

age

0

–0.5

–1 –1

–0.8

–0.6

–0.4

–0.2

0

0.2

0.4

0.6

0.8

1

f) While variables age, fixedif, and variableif have a high correlation with the first factor (X-axis), variable people has a strong correlation with the second factor (Y-axis). This phenomenon can be a result of the fact that older customers, since they do not like taking risks, invest a great deal more in fixed-income funds, such as, savings accounts or CDB (Bank Deposit Certificates). On the other hand, even though variable variableif has a high correlation with the first factor, the absolute factor loading is negative. This shows that younger customers invest a great deal more in variable-income funds, such as, stocks. Finally, the number of people who live in the household (variable people) has a low correlation with the other variables. Thus, it ends up having a high factor loading with the second factor.

1094

Answers

2) a) YEAR 1 KMO and Bartlett’s Test Kaiser-Meyer-Olkin Measure of Sampling Adequacy.

.719

Bartlett’s Test of Sphericity

Approx. Chi-Square df Sig.

89.637 6 .000

YEAR 2 KMO and Bartlett’s Test Kaiser-Meyer-Olkin Measure of Sampling Adequacy.

.718

Bartlett’s Test of Sphericity

Approx. Chi-Square df Sig.

86.483 6 .000

Based on the KMO statistics, we can state that the overall adequacy of the factor analysis is considered average for each of the years of study (KMO ¼ 0.719 for the first year, and KMO ¼ 0.718 for the second year). In both periods, w2Bartlett statistics allow us to reject, at a significance level of 0.05 and based on the hypothesis of Bartlett’s test of sphericity, that the correlation matrices are statistically equal to the identity matrix with the same dimension. Since w2Bartlett ¼ 89.637 (Sig. w2Bartlett < 0.05 for 6 degrees of freedom) for the first year, and w2Bartlett ¼ 86.483 (Sig. w2Bartlett < 0.05 for 6 degrees of freedom) for the second year. Therefore, the principal component analysis is adequate for each of the years of study. b) YEAR 1 Total Variance Explained Initial Eigenvalues

Extraction Sums of Squared Loadings

Component

Total

% of Variance

Cumulative %

Total

% of Variance

Cumulative %

1 2 3 4

2.589 .730 .536 .146

64.718 18.247 13.391 3.643

64.718 82.965 96.357 100.000

2.589

64.718

64.718

Extraction Method: Principal Component Analysis.

YEAR 2 Total Variance Explained Initial Eigenvalues

Extraction Sums of Squared Loadings

Component

Total

% of Variance

Cumulative %

Total

% of Variance

Cumulative %

1 2 3 4

2.566 .737 .543 .154

64.149 18.435 13.577 3.838

64.149 82.584 96.162 100.000

2.566

64.149

64.149

Extraction Method: Principal Component Analysis.

Answers

1095

Based on the latent root criterion, only one factor is extracted in each of the years, with their respective eigenvalue: Year 1: 2.589 Year 2: 2.566 The proportion of variance shared by all the variables to form the factor each year is: Year 1: 64.718% Year 2: 64.149% c) YEAR 1 Component Matrixa Component 1 Corruption Perception Index - year 1 (Transparency International) Number of murders per 100,000 inhabitants: year 1 (OMS, UNODC and GIMD) Per capita GDP - year 1, using 2000 as the base year (in US$ adjusted for inflation) (World Bank) Average number of years in school per person over 25 years of age - year 1 (IHME)

.900 .614 .911 .755

a 1 component extracted. Extraction Method: Principal Component Analysis.

YEAR 1 Communalities

Corruption Perception Index - year 1 (Transparency International) Number of murders per 100,000 inhabitants: year 1 (OMS, UNODC and GIMD) Per capita GDP - year 1, using 2000 as the base year (in US$ adjusted for inflation) (World Bank) Average number of years in school per person over 25 years of age - year 1 (IHME)

Initial

Extraction

1.000 1.000 1.000 1.000

.810 .378 .830 .571

Extraction Method: Principal Component Analysis.

YEAR 2 Component Matrixa Component 1 Corruption Perception Index - year 2 (Transparency International) Number of murders per 100,000 inhabitants: year 2 (OMS, UNODC and GIMD) Per capita GDP - year 2, using 2000 as the base year (in US$ adjusted for inflation) (World Bank) Average number of years in school per person over 25 years of age - year 2 (IHME) a 1 component extracted. Extraction Method: Principal Component Analysis.

.899 .608 .908 .750

1096

Answers

YEAR 2 Communalities

Corruption Perception Index - year 2 (Transparency International) Number of murders per 100,000 inhabitants: year 2 (OMS, UNODC and GIMD) Per capita GDP - year 2, using 2000 as the base year (in US$ adjusted for inflation) (World Bank) Average number of years in school per person over 25 years of age - year 2 (IHME)

Initial

Extraction

1.000 1.000 1.000 1.000

.808 .370 .825 .563

Extraction Method: Principal Component Analysis.

We can see that slight reductions occurred in the communalities of all the variables from the first to the second year. d) YEAR 1 Component Score Coefficient Matrix Component 1 Corruption Perception Index - year 1 (Transparency International) Number of murders per 100,000 inhabitants: year 1 (OMS, UNODC and GIMD) Per capita GDP - year 1, using 2000 as the base year (in US$ adjusted for inflation) (World Bank) Average number of years in school per person over 25 years of age - year 1 (IHME)

.348 .237 .352 .292

Extraction Method: Principal Component Analysis. Component Scores.

YEAR 2 Component Score Coefficient Matrix Component 1 Corruption Perception Index - year 2 (Transparency International) Number of murders per 100,000 inhabitants: year 2 (OMS, UNODC and GIMD) Per capita GDP - year 2, using 2000 as the base year (in US$ adjusted for inflation) (World Bank) Average number of years in school per person over 25 years of age - year 2 (IHME)

.350 .237 .354 .292

Extraction Method: Principal Component Analysis. Component Scores.

Based on the standardized variables, the expression of the factor extracted each year is: Year 1: Fi ¼ 0:348  Zcpi1i  0:237  Zviolence1i + 0:352  Zcapita_gdp1i + 0:292  Zschool1i Year 2:

Fi ¼ 0:350  Zcpi2i  0:237  Zviolence2i + 0:354  Zcapita_gdp2i + 0:292  Zschool2i

Even if small changes occurred in the factor scores from one year to the next, this only reinforces the importance of reapplying the technique to obtain factors with more precise and updated scores, mainly when they are used to create indexes and rankings.

1097

Answers

e) Year 1

Year 2

Country

Index

Ranking

Country

Index

Ranking

Switzerland Norway Denmark Sweden Japan United States Canada United Kingdom Netherlands Australia Germany Austria Ireland New Zealand Singapore Belgium Israel France Cyprus United Arab Emirates Czech Rep. Italy Poland Spain Chile Greece Kuwait Portugal Romania Oman Saudi Arabia Serbia Argentina Turkey Ukraine Kazakhstan Malaysia Lebanon Russia Mexico China Egypt Thailand Indonesia India Brazil Philippines Venezuela South Africa Colombia

1.6923 1.6794 1.4327 1.4040 1.3806 1.3723 1.3430 1.1560 1.1086 1.0607 1.0297 0.9865 0.9439 0.9269 0.8781 0.8175 0.6322 0.5545 0.5099 0.3157 0.2244 0.0859 0.0373 0.0303 0.0517 0.1432 0.2276 0.2980 0.3028 0.4742 0.5111 0.5407 0.5556 0.6476 0.7109 0.7423 0.7459 0.7966 0.8534 0.8803 0.8840 0.9792 1.0632 1.2245 1.2272 1.3294 1.3466 1.3916 1.8215 1.8534

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

Norway Switzerland Sweden Denmark Japan Canada United States United Kingdom Netherlands Australia Germany Austria Ireland Singapore New Zealand Belgium Israel France Cyprus United Arab Emirates Czech Rep. Poland Spain Chile Italy Kuwait Greece Portugal Romania Saudi Arabia Oman Argentina Serbia Malaysia Turkey Ukraine Kazakhstan Lebanon Russia China Mexico Egypt Thailand Indonesia India Brazil Philippines Venezuela Colombia South Africa

1.6885 1.6594 1.4388 1.4225 1.3848 1.3844 1.3026 1.1321 1.1007 1.0660 1.0401 0.9903 0.9411 0.9184 0.9063 0.8265 0.6444 0.5448 0.4606 0.2849 0.1857 0.0868 0.0334 0.0170 0.0064 0.1462 0.2247 0.2794 0.3150 0.4321 0.5034 0.5342 0.5544 0.6098 0.6401 0.6807 0.6970 0.8060 0.8513 0.8982 0.9323 0.9485 1.0800 1.2431 1.2533 1.3468 1.3885 1.4149 1.7697 1.9173

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

From the first to the second year, there were some changes in the relative positions of the countries in the ranking.

1098

Answers

3) a) Correlation Matrix

Perception of the variety of goods (0 to 10) Correlation Perception of the variety of goods (0 to 10)

Perception of the quality and speed of inventory replacement (0 to 10)

Perception of the store’s layout (0 to 10)

Perception of Perception of the store’s Perception of Perception of thermal, Perception of prices the quality of the the store’s acoustic and the store’s compared to the services general visual comfort inside the store cleanliness (0 to rendered (0 to competition (0 to discount policy (0 to 10) 10) 10) 10) (0 to 10)

1.000

.753

.898

.733

.640

.193

.084

.053

Perception of the quality and speed of inventory replacement (0 to 10)

.753

1.000

.429

.633

.548

.208

-.449

-.367

Perception of the store’s layout (0 to 10)

.898

.429

1.000

.641

.567

.142

.413

.318

Perception of thermal, acoustic and visual comfort inside the store (0 to 10)

.733

.633

.641

1.000

.864

.227

.235

.174

Perception of the store’s general cleanliness (0 to 10)

.640

.548

.567

.864

1.000

.194

.220

.173

Perception of the quality of the services rendered (0 to 10)

.193

.208

.142

.227

.194

1.000

.137

.113

Perception of the store’s prices compared to the competition (0 to 10)

.084

-.449

.413

.235

.220

.137

1.000

.906

Perception of the store’s discount policy (0 to 10)

.053

-.367

.318

.174

.173

.113

.906

1.000

Yes. Based on the magnitude of some Pearson’s correlation coefficients, it is possible to identify a first indication that the factor analysis may group the variables into factors. b) KMO and Bartlett’s Test Kaiser-Meyer-Olkin Measure of Sampling Adequacy.

.610

Bartlett’s Test of Sphericity

13752.938 28 .000

Approx. Chi-Square df Sig.

Yes. From the result of the w2Bartlett statistic, it is possible to reject that the correlation matrix is statistically equal to the identity matrix with the same dimension, at a significance level of 0.05 and based on the hypothesis of Bartlett’s test of sphericity, since w2Bartlett ¼ 13,752.938 (Sig. w2Bartlett < 0.05 for 28 degrees of freedom). Therefore, the principal component analysis can be considered adequate. c) Total Variance Explained Initial Eigenvalues

Extraction Sums of Squared Loadings

Component

Total

% of Variance

Cumulative %

Total

% of Variance

Cumulative %

1 2 3 4 5 6 7 8

3.825 2.254 .944 .597 .214 .126 .025 .016

47.812 28.174 11.794 7.458 2.679 1.570 .313 .201

47.812 75.986 87.780 95.238 97.917 99.486 99.799 100.000

3.825 2.254

47.812 28.174

47.812 75.986

Extraction Method: Principal Component Analysis.

Answers

1099

Considering the latent root criterion, two factors are extracted, with the respective eigenvalues: Factor 1: 3.825 Factor 2: 2.254 The proportion of variance shared by all the variables to form each factor is: Factor 1: 47.812% Factor 2: 28.174% Thus, the total proportion of variance shared by all the variables to form both factors is equal to 75.986%. d) The total proportion of variance lost by all the variables to extract these two factors is: 1  0:75986 ¼ 0:24014 ð24:014%Þ e)

Communalities

Perception of the variety of goods (0 to 10) Perception of the quality and speed of inventory replacement (0 to 10) Perception of the store’s layout (0 to 10) Perception of thermal, acoustic and visual comfort inside the store (0 to 10) Perception of the store’s general cleanliness (0 to 10) Perception of the quality of the services rendered (0 to 10) Perception of the store’s prices compared to the competition (0 to 10) Perception of the store’s discount policy (0 to 10)

Initial

Extraction

1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

.873 .914 .766 .827 .721 .101 .978 .900

Extraction Method: Principal Component Analysis.

Note that the loadings and communality of the variable services rendered are considerably low. This may demonstrate the need to extract a third factor, which decharacterizes the latent root criterion.

1100

Answers

f)

Communalities

Perception of the variety of goods (0 to 10) Perception of the quality and speed of inventory replacement (0 to 10) Perception of the store’s layout (0 to 10) Perception of thermal, acoustic and visual comfort inside the store (0 to 10) Perception of the store’s general cleanliness (0 to 10) Perception of the quality of the services rendered (0 to 10) Perception of the store’s prices compared to the competition (0 to 10) Perception of the store’s discount policy (0 to 10)

Initial

Extraction

1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

.887 .917 .804 .828 .722 .987 .978 .900

Extraction Method: Principal Component Analysis.

Yes, it is possible to confirm the construct of the questionnaire proposed by the store’s general manager, because variables variety of goods, replacement, layout, comfort, and cleanliness have a stronger correlation with a specific factor, variables price and discounts, with another factor, and, finally, variable services rendered, with a third factor. g) To the detriment of the extraction based on the latent root criterion, the decision to extract three factors increases the communalities of the variables, highlighting the variable services rendered, now, more strongly correlated with the third factor. h)

Answers

1101

Varimax rotation redistributes the variable loadings in each factor, which facilitates the confirmation of the construct proposed by the store’s general manager. i) Component plot in rotated space

1.0

Components 2

Discounts Prices 0.5

Layout Cleanliness

0.0

Assortment

Comfort

Services Replacement

–0.5

–1.0 –1.0

–0.5

Com

0.0

pon

0.5 1 .0 –1.0

ents

3

–0.5

0.0

0.5

Components

1

1.0

1102

Answers

Component plot in rotated space

1.0

Components 2

Discounts Prices 0.5 Layout Services

0.0

Cleanliness Comfort Assortment

–0.5 Replacement –1.0 –1.0

–0.5

Com

0.0

pon

0.5 1.0 1.0

ents

1

0.5

0.0

Components

ANSWER KEYS: EXERCISES: CHAPTER 13 1) a) b) c) d) e) f) 2)

–0.5

Y^ ¼ 3:8563 + 0:3872  X R2 ¼ 0.9250 Yes (P-value of t ¼ 0.000 < 0.05). 9.9595 billion dollars (we must make Y ¼ 0 and solve the equation). 3.8563% (we must make X ¼ 0). 0.4024% (mean) 1.2505% (minimum) 2.0554% (maximum)

3

–1.0

Answers

1103

a) Yes, since the P-value of the F statistic < 0.05, we can state that at least one of the explanatory variables is statistically significant to explain the behavior of variable cpi, at a significance level of 0.05. b) Yes, since the P-value of both t statistics < 0.05, we can state that their parameters are statistically different from zero, at a significance level of 0.05. Therefore, the Stepwise procedure would not exclude any of the explanatory variables of the final model. c) c^ pii ¼ 15:1589 + 0:0700  agei  0:4245  hoursi d) R2 ¼ 0.3177 e) By analyzing the signs of the final model’s coefficients, for this cross-section, we can state that countries with billionaires with lower average ages have lower cpi indexes. That is, there is a higher corruption perception from society. Besides, on average, a greater number of hours worked per week has a negative relationship with variable cpi. That is, countries with a higher corruption perception (lower cpi´s) have a higher workload per week. It is important to mention that countries with lower cpi´s are those considered emerging countries. f)

By using the Shapiro-Francia test, the most suitable for the size of this sample, we can see that the residuals follow a normal distribution, at a significance level of 0.05. We would have arrived at the same conclusion if the test used had been the Shapiro-Wilk. g)

From the Breusch-Pagan/Cook-Weisberg test, it is possible to verify if there is homoskedasticity in the model proposed. h)

Since the final model obtained does not have very high VIF statistics (1 – Tolerance ¼ 0.058), we may consider that there are no multicollinearity problems.

1104

Answers

3)

a) The difference between the average cpi value for emerging and for developed countries is 3.6318. That is, while emerging countries have an average cpi ¼ 4.0968, developed countries have an average cpi ¼ 7.7286. This is exactly the value of the cpi regression intercept based on variable emerging, since the dummy emerging for developed countries ¼ 0. Yes, this difference is statistically significant, at a significance level of 0.05, since the P-value of t statistic < 0.05 for the variable emerging. b)

c^ pii ¼ 13:1701  0:1734  hoursi  3:2238  emergingi c)

c^ pi ¼ 13:1701  0:1734  ð37Þ  3:2238  ð1Þ ¼ 3:5305

Answers

d)

1105

c^ pimin ¼ 8:0092  0:3369  ð37Þ  4:0309  ð1Þ ¼ 8:4870 c^ pimax ¼ 18:3310  0:0099  ð37Þ  2:4168  ð1Þ ¼ 15:5479

Obviously, the confidence interval is extremely broad and makes no sense. This happened because the value of R2 is not so high. e)

c^ pii ¼ 27:4049  5:7138  ln ðhoursi Þ  3:2133  emergingi 2

f) Since the adjusted R is slightly higher in the model with nonlinear functional form (logarithmic functional form for variable hours) than in the model with linear functional form, we choose the nonlinear estimated model seen in item (e). Since, in both cases, neither the number of variables nor the sample size used changes, such analysis could be carried out directly based on the values of R2. 4) a)

chole^ sterolt ¼ 136:7161 + 1:9947  bmit  5:1635  sportt b) We can see that the body mass index has a positive relationship with the LDL cholesterol index, such that, every time the index increases by one unit, on average, there is an increase of almost 2 mg/dL of the cholesterol commonly known as bad cholesterol, ceteris paribus. Analogously, the increase in the frequency of physical activities per week by one unit makes the LDL cholesterol index drop, on average, more than 5 mg/dL, ceteris paribus. Therefore, maintaining one’s weight, or even losing weight, plus establishing a routine of weekly physical activities, may contribute to the establishment of a healthier life.

1106

Answers

c)

Since we have, at a significance level of 0.05 and for a model with 3 parameters and 48 observations, 0.938 < dL ¼ 1.45, we can state that there is a positive first-order autocorrelation between the error terms. d)

By analyzing the Breusch-Godfrey test, we can see that, besides the first-order autocorrelation between the error terms, there are also autocorrelation problems between the 3rd, 4th, and 12th order residuals. This shows the existing seasonality in the executive’s behavior regarding his body mass and engagement to doing physical activities.

ANSWER KEYS: EXERCISES: CHAPTER 14 1)

a) Yes. Since the P-value of the w2 statistic < 0.05, we can state that at least one of the explanatory variables is statistically significant to explain the probability of default, at a significance level of 0.05. b) Yes. Since the P-value of all Wald z statistics < 0.05, we can state that their respective parameters are statistically different from zero, at a significance level of 0.05. Therefore, no explanatory variable will be excluded from the final model. c) pi ¼

1 1 + eð2:975070:02433  agei + 0:74149  genderi 0:00025  incomei Þ

d) Yes. Since the parameter estimated for the variable gender is positive, on average, male individuals (dummy ¼ 1) have higher probabilities of default than female individuals, as long as the other conditions are kept constant. The chances of the event occurring will be multiplied by a factor greater than 1.

Answers

1107

e) No. Older people, on average, tend to have smaller probabilities of default, maintaining the remaining conditions constant, since the parameter of the variable age is negative, that is, the chances of the event occurring is multiplied by a factor less than 1, as the age increases. f) p ¼

1 1 + e½2:975070:02433  ð37Þ + 0:74149  ð1Þ0:00025  ð6, 850Þ

¼ 0:7432

The average probability of default estimated for this individual is 74.32%. g)

The chance of being default as the income increases by one unit is, on average and maintaining the remaining conditions constant, multiplied by a factor of 0.99974 (a chance 0.026% lower). h)

While the overall model efficiency is 77.40%, the sensitivity is 93.80% and the specificity is 30.23% (for a cutoff of 0.5).

1108

Answers

2) a)

Only the category bad of variable price was not statistically significant, at a significance level of 0.05, to explain the probability of the event occurring - the event we are interested in. That is, there are no differences that would change the probability of someone becoming loyal to the retailer when they answer terrible or bad on their perception of the prices, maintaining the remaining conditions constant. b)

Answers

c)

For a cutoff of 0.5, the overall model efficiency is 86.00%. d)

Sensitivity/specificity

1.00

0.75

0.50

0.25

0.00 0.00

0.25

0.50 Probability cutoff Sensitivity

0.75 Specificity

1.00

1109

1110

Answers

The cutoff from which the specificity becomes slightly higher than the sensitivity is equal to 0.57.

e) On average, the chance of becoming loyal to the establishment is multiplied by a factor of 5.39 when their perception of services rendered changes from terrible to bad. Whereas, from terrible to regular, this chance is multiplied by a factor of 6.17. From terrible to good, it is multiplied by a factor of 27.78, and, finally, from terrible to excellent, by a factor of 75.60. These answers will only be valid if the other conditions are kept constant. f) On average, the chance of becoming loyal to the establishment is multiplied by a factor of 6.43 when their perception of variety of goods changes from terrible to bad. Whereas, from terrible to regular, this chance is multiplied by a factor of 7.83. From terrible to good, it is multiplied by a factor of 28.09, and, finally, from terrible to excellent, by a factor of 381.88. Conversely, for the variable accessibility, on average, the chance of becoming loyal to the establishment is multiplied by a factor of 10.49 when their perception changes from terrible to bad. From terrible to regular, this chance is multiplied by a factor of 18.55. From terrible to good, it is multiplied by a factor of 127.40, and, finally, from terrible to excellent, by a factor of 213.26. Finally, for the variable price, on average, the chance of becoming loyal to the establishment is multiplied by a factor of 18.47 when their perception changes from terrible or bad to regular. From terrible or bad to good, this chance is multiplied by a factor of 20.82. Lastly, from terrible or bad to excellent, the chance of becoming loyal to the establishment is multiplied by a factor of 49.87. These answers will only be valid if the other conditions are kept constant in each case. g) Based on the analysis of these chances, if the establishment wishes to invest in a single perceptual variable to increase the probability of consumers becoming loyal, such that, they leave their terrible perceptions behind and begin, with higher frequency, to have excellent perceptions of this issue, it must invest in the variable variety of goods, since this variable is the one that shows the highest odds ratio (381.88). In other words, the chances of becoming loyal to the establishment, when their perception changes from terrible variety of goods to excellent, are, on average, multiplied by a factor of 381.88 (38,088% higher), maintaining the remaining conditions constant.

Answers

1111

3) a)

b)

Yes. Since the P-value of the w2 statistic < 0.05, we can reject the null hypothesis that all parameters bjm (j ¼ 1, 2; m ¼ 1, 2, 3, 4) are statistically equal to zero at a significance level of 0.05. That is, at least one of the explanatory variables is statistically significant to form the occurrence probability expression of at least one of the classifications proposed for the LDL cholesterol index. c) Since all the parameters are statistically significant for all the logits (Wald z tests at a significance level of 0.05), the final equations estimated for the average occurrence probabilities of the classifications proposed for the LDL cholesterol index can be written the following way:

1112

Answers

Probability of an individual i having a very high LDL cholesterol index:

pi ¼

1 … 1 + eð0:420:31  cigarettei + 0:16  sporti Þ + eð2:620:41  cigarettei + 1:01  sporti Þ … ð2:461:41  cigarette + 1:13  sport Þ ð2:861:67  cigarette + 1:16  sport Þ i i +e i i +e

Probability of an individual i having a high LDL cholesterol index:

pi ¼

eð0:420:31  cigarettei + 0:16  sporti Þ … 1 + eð0:420:31  cigarettei + 0:16  sporti Þ + eð2:620:41  cigarettei + 1:01  sporti Þ … ð2:461:41  cigarette + 1:13  sport Þ ð2:861:67  cigarette + 1:16  sport Þ i i +e i i +e

Probability of an individual i having a borderline LDL cholesterol index:

pi ¼

eð2:620:41  cigarettei + 1:01  sporti Þ … 1 + eð0:420:31  cigarettei + 0:16  sporti Þ + eð2:620:41  cigarettei + 1:01  sporti Þ … ð2:461:41  cigarette + 1:13  sport Þ ð2:861:67  cigarette + 1:16  sport Þ i i +e i i +e

Probability of an individual i having a near optimal LDL cholesterol index:

pi ¼

eð2:461:41  cigarettei + 1:13  sporti Þ … + eð2:620:41  cigarettei + 1:01  sporti Þ

1 + eð0:420:31  cigarettei + 0:16  sporti Þ …

+eð2:461:41  cigarettei + 1:13  sporti Þ + eð2:861:67  cigarettei + 1:16  sporti Þ

Probability of an individual i having an optimal LDL cholesterol index:

pi ¼

eð2:861:67  cigarettei + 1:16  sporti Þ … + eð2:620:41  cigarettei + 1:01  sporti Þ

1 + eð0:420:31  cigarettei + 0:16  sporti Þ …

+eð2:461:41  cigarettei + 1:13  sporti Þ + eð2:861:67  cigarettei + 1:16  sporti Þ

d) For an individual who does not smoke and only practices sports once a week, we have: Probability of having a very high LDL cholesterol index ¼ 41.32%. Probability of having a high LDL cholesterol index ¼ 31.99%. Probability of having a borderline LDL cholesterol index ¼ 8.23%. Probability of having a near optimal LDL cholesterol index ¼ 10.92%. Probability of having an optimal LDL cholesterol index ¼ 7.54%.

Answers

1113

e)

If people start practicing sports twice a week, they will considerably increase their probability of having near-optimal or optimal levels of LDL cholesterol.

f) The chances of having a high cholesterol index, in comparison to a level considered very high, are, on average, multiplied by a factor of 1.1745 (17.45% higher), when we increase the number of times physical activities are done weekly by one unit and maintaining the remaining conditions constant. g) The chances of having an optimal cholesterol index, on average and in comparison to a level considered near optimal, are multiplied by a factor of 1.2995 (0.2450047 / 0.1885317), when people stop smoking and maintaining the remaining conditions constant. That is, the chances are 29.95% higher.

1114

Answers

Tip: For those who are in doubt about this procedure, you just need to change the reference category of variable cigarette (now, smokes ¼ 0) and estimate the model with the category near optimal of the dependent variable as the reference category. h) and i)

ANSWER KEYS: EXERCISES: CHAPTER 15 1) a) Statistic Mean Variance

1.020 1.125

Even if in a preliminary way, we can see that the mean and variance of the variable purchases are quite close.

Answers

1115

b)

Since the P-value of the t-test that corresponds to the b parameter of lambda is greater than 0.05, we can state that the data of the dependent variable purchases do not present overdispersion. So, the Poisson regression model estimated is suitable due to the presence of equidispersion in the data. c)

The result of the w2 test suggests that there is quality in the adjustment of the Poisson regression model estimated. That is, there are no statistically significant differences, at a significance level of 0.05, between the observed and the predicted probability distributions of annual use incidence of closed-end credit. d) Since all the zcal values < 1.96 or > 1.96, the P-values of the Wald z statistics < 0.05 for all the parameters estimated, thus, we arrive at final Poisson regression model. Therefore, the final expression for the estimated average number of annual use of closed-end credit financing when purchasing durable goods, for a consumer i, is: purchasesi ¼ eð7:0480:001  incomei 0:086  agei Þ e) purchases ¼ e[7.0480.001(2,600)0.086(47)] ¼ 1.06 We recommend that this calculation be carried out with a larger number of decimal places.

f) The annual use incidence rate of closed-end credit financing when there is an increase in the customer’s monthly income of US$1,00 is, on average and as long as the other conditions are kept constant, multiplied by a factor of 0.9988 (0.1124% lower). Consequently, at each increase of US$100,00 in the customer’s monthly income, we expect

1116

Answers

the annual use incidence rate of closed-end credit financing to be 11.24% lower, on average and provided the other conditions are kept constant. g) The annual use incidence rate of closed-end credit financing when there is an increase of 1 year in consumers’ average age is, on average and as long as the other conditions are kept constant, multiplied by a factor of 0.9171 (8.29% lower). h)

In the constructed chart, it is possible to see that higher monthly incomes lead to a decrease in the expected annual use of closed-end credit financing when purchasing durable goods, with an average reduction rate of 12.0% at each increase of US$100.00 in income. i)

Answers

1117

j) Young people and with lower monthly income. 2) a) Statistic Mean Variance

2.760 8.467

Even if in a preliminary way, there are indications of overdispersion in the data of the variable property, since its variance is extremely higher than its mean.

1118

Answers

b)

Since the P-value of the t-test that corresponds to the b parameter of lambda is lower than 0.05, we can state that the data of the dependent variable property present overdispersion, making the Poisson regression model estimated not suitable.

Furthermore, the result of the w2 test suggests the inexistence of adjustment quality in the Poisson regression model estimated. That is, there are statistically significant differences, at a significance level of 0.05, between the probability distributions observed and predicted for the number of real estate properties for sale per square. c)

d) Since the confidence interval for f (alpha in Stata) does not include zero, we can state that, at a 95% confidence level, f is statistically different from zero and has an estimated value equal to 0.230. The result of the likelihood-ratio test for parameter f (alpha) itself suggests that the null hypothesis that this parameter is statistically equal to zero can be rejected at a significance level of 0.05. This proves that there is overdispersion in the data and, therefore, we must choose the estimation of the negative binomial model.

Answers

1119

e) Since all the zcal values < 1.96 or > 1.96, the P-values of the Wald z statistics < 0.05 for all the parameters estimated and, thus, we arrive at the final negative binomial regression model. Therefore, the expression for the estimated average number of real estate properties for sale in a certain square ij is: propertyij ¼ eð0:608 + 0:001  distparkij 0:687  mallij Þ f) property ¼ e[0.608+0.001(820)0.687(0)] ¼ 5.07 We recommend that this calculation be carried out with a larger number of decimal places.

g) The number of real estate properties for sale per square is multiplied, on average and provided the other conditions are kept constant, by a factor of 1.0012 at each 1 meter further away from the municipal park. Hence, when there is an approximation of 1 meter from the park, we must divide the average amount of real estate properties for sale per square by this same factor. That is, the number will be multiplied by a factor of 0.9987 (0.1237% lower). Thus, at each approximation of 100 meters from the park, we expect that the average amount of real estate properties for sale to be, on average and as long as the other conditions are kept constant, 12.37% lower. h) The expected number of real estate properties for sale when a commercial center or mall is built in the microregion (square) is, as long as the other conditions are kept constant, multiplied by a factor of 0.5031. That is, on average, it becomes 49.69% lower. i)

1120

Answers

j)

k) Yes, we can state that proximity to parks and green spaces and the existence of malls and commercial centers in the microregion make the number of real estate properties for sale go down. That is, these features may be helping reduce the intention of selling residential real estate. l)

Answers

1121

m)

We can see that the adjustment of the negative binomial regression model is better than the adjustment of the Poisson regression model, since: – the maximum difference between the probabilities observed and the ones predicted is lower for the negative binomial model; – Pearson’s total value is also lower for the negative binomial regression model.

1122

Answers

n)

ANSWER KEYS: EXERCISES: CHAPTER 16 Ex. 3 max x1 + x2 s:t: ¼ 10 2x1  5x2 a) x1 + 2x2 + x3 ¼ 50 x1 ,x2 , x3  0

ð1Þ ð 2Þ ð3Þ

min 24x1 + 12x2 s:t: ¼4 ð1Þ 3x1 + 2x2  x3 b) + x4 ¼ 26 ð2Þ 2x1  4x2  x5 ¼ 3 ð 3Þ x2 x1 , x2 , x3 ,x4 ,x5  0 ð4Þ

Answers

max 10x1  x2 s:t: 6x1 + x2 + x3 ¼ 10 c) x2  x4 ¼ 6 x1 ,x2 ,x3 , x4  0

ð 1Þ ð2Þ ð 3Þ

max 3x1 + 3x2  2x3 s:t: 6x1 + 3x2  x3 + x4 ¼ 10 d) x2 + x3  x5 ¼ 20 4 x1 , x2 , x3 ,x4 ,x5  0

ð 1Þ ð 2Þ ð3Þ

Ex. 4 max x1 + x2 s:t: 2x1  5x2  10 a)  2x1 + 5x2  10 x1 + 2x2  50 x1 ,x2  0

ð 1Þ ð2Þ ð 3Þ ð4Þ

min 24x1 + 12x2 s:t: 3x1 + 2x2  4 b)  2x1 + 4x2  26 x2  3 x1 ,x2  0

ð 1Þ ð 2Þ ð 3Þ ð 4Þ

max 10x1  x2 s:t: 6x1 + x2  10 c)  x2  6 x1 ,x2  0

ð 1Þ ð 2Þ ð3Þ

max 3x1 + 3x2  2x3 s:t: ð 1Þ 6x1 + 3x2  x3  10 d) x2   x3  20 ð2Þ 4 ð 3Þ x1 , x 2 , x 3  0 Ex. 5 a) min  z ¼  10x1 + x2 b) min  z ¼  3x1  3x2 + 2x3 Ex. 7 xi ¼ number of vehicles of model i to be manufactured per week, i ¼ 1, 2, 3. x1 ¼ number of vehicles of model Arlington to be manufactured per week. x2 ¼ number of vehicles of model Marilandy to be manufactured per week. x3 ¼ number of vehicles of model Lagoinha to be manufactured per week. Fobj ¼ max z ¼ 2, 500x1 + 3, 000x2 + 2,800x3 subject to 3x1 + 4x2 + 3x3  480 ðminutes  machine=week available for injectionÞ 5x1 + 5x2 + 4x3  640 ðminutes  machine=week available for foundryÞ 2x1 + 4x2 + 4x3  400 ðminutes  machine=week available for machiningÞ 4x1 + 5x2 + 5x3  640 ðminutes  machine=week available for upholsteryÞ 2x1 + 3x2 + 3x3  320 ðminutes  machine=week available for final assemblyÞ  50 ðminimum sales potential of the Arlington modelÞ x1 x2  30 ðminimum sales potential of the Marilandy modelÞ x3  30 ðminimum sales potential of the Lagoinha modelÞ x1 ,x2 , x3  0

1123

1124

Answers

Ex. 8 xi ¼ liters of product i to be manufactured per month, i ¼ 1, 2 x1 ¼ liters of beer to be manufactured per month. x2 ¼ liters of soft drink to be manufactured per month. Fobj ¼ max z ¼ 0:5x1 + 0:4x2 subject to 2x1  57,600 ðminutes=month available to extract beer maltÞ 4x1  115,200 ðminutes=month available to process wortÞ 3x1  96,000 ðminutes=month available to ferment beerÞ 4x1  115,200 ðminutes=month available to process beerÞ 5x1  96,000 ðminutes=month available to bottle beerÞ 1x2  57,600 ðminutes=month available to prepare simple syrupÞ 3x2  67,200 ðminutes=month available to prepare compound syrupÞ 4x2  76,800 ðminutes=month available to dilute soft drinksÞ 5x2  96,000 ðminutes=month available to carbonate soft drinksÞ 2x2  48,000 ðminutes=month available to bottle soft drinksÞ x1 + x2  42, 000 ðmaximum demand of beer and soft drinksÞ x1 ,x2  0 Ex. 9 xi ¼ quantity of product i to be manufactured per week, i ¼ 1, 2, …, 5. x1 ¼ number of refrigerators to be manufactured per week. x2 ¼ number of freezers to be manufactured per week. x3 ¼ number of stoves to be manufactured per week. x4 ¼ number of dishwashers to be manufactured per week. x5 ¼ number of microwave ovens to be manufactured per week. Fobj ¼ max z ¼ 52x1 + 37x2 + 35x3 + 40x4 + 29x5 subject to 0:2x1 + 0:2x2 + 0:4x3 + 0:4x4 + 0:3x5  400 ðh  machine=week pressingÞ 0:2x1 + 0:3x2 + 0:3x3 + 0:3x4 + 0:2x5  350 ðh  machine=week paintingÞ 0:4x1 + 0:3x2 + 0:3x3 + 0:3x4 + 0:2x5  250 ðh  machine=week moldingÞ 0:2x1 + 0:4x2 + 0:4x3 + 0:4x4 + 0:4x5  200 ðh  machine=week assemblyÞ 0:1x1 + 0:2x2 + 0:2x3 + 0:2x4 + 0:3x5  200 ðh  machine=week packagingÞ 0:5x1 + 0:4x2 + 0:5x3 + 0:4x4 + 0:2x5  480 ðh  employee=week pressingÞ 0:3x1 + 0:4x2 + 0:4x3 + 0:4x4 + 0:3x5  400 ðh  employee=week paintingÞ 0:5x1 + 0:5x2 + 0:3x3 + 0:4x4 + 0:3x5  320 ðh  employee=week moldingÞ 0:6x1 + 0:5x2 + 0:4x3 + 0:5x4 + 0:6x5  400 ðh  employee=week assemblyÞ 0:4x1 + 0:4x2 + 0:4x3 + 0:3x4 + 0:2x5  1, 280 ðh  employee=week packagingÞ 200  x1  1,000 ð min :demand; max :capac:refrigeratorÞ ð min :demand; max :capac:freezerÞ 50  x2  800 50  x3  500 ð min :demand; max :capac:stoveÞ ð min :demand; max :capac:dishwasherÞ 50  x4  500 ð min :demand; max :capac:microwaveÞ 40  x5  200 Ex. 10 xij ¼ liters of type i petroleum used daily to produce gasoline j, i ¼ 1, 2, 3, 4; j ¼ 1, 2, 3. x11 ¼ liters of petroleum 1 used daily to produce regular gasoline. ⋮ x41 ¼ liters of petroleum 4 used daily to produce regular gasoline. x12 ¼ liters of petroleum 1 used daily to produce green gasoline. ⋮ x42 ¼ liters of petroleum 4 used daily to produce green gasoline. x13 ¼ liters of petroleum 1 used daily to produce yellow gasoline. ⋮ x43 ¼ liters of petroleum 4 used daily to produce yellow gasoline.

Answers

Fobj ¼ max z ¼ ð0:40  0:20Þx11 + ð0:40  0:25Þx21 + ð0:40  0:30Þx31 + ð0:40  0:30Þx41 + ð0:45  0:20Þx12 + ð0:45  0:25Þx22 + ð0:45  0:30Þx32 + ð0:45  0:30Þx42 + ð0:50  0:20Þx13 + ð0:50  0:25Þx23 + ð0:50  0:30Þx33 + ð0:50  0:30Þx43 subject to 0:10x21  0:05x31 + 0:20x41  0 0:07x11 + 0:02x21  0:12x31  0:03x41  0  0:05x12 + 0:05x22  0:10x32  0:15x42  0 + 0:10x32  0:05x42  0 0:05x12  0:15x33 + 0:10x43  0  0:10x13 0:03x13  0:02x23 + 0:08x33  0:07x43  0 x11 + x21 + x31 + x41  12, 000 x12 + x22 + x32 + x42  10, 000 x13 + x23 + x33 + x43  8, 000 x11 + x12 + x13  15, 000 x21 + x22 + x23  15, 000 x31 + x32 + x33  15, 000 x41 + x42 + x43  15, 000 x11 + x21 + x31 + x41 + x12 + x22 + x32 + x42 + x13 + x23 + x33 + x43  60,000 x11 , x21 ,x31 , x41 ,x12 , x22 ,x32 , x42 ,x13 , x23 ,x33 , x43  0 Ex. 12 xi ¼ 1 if the company invests in project i 0 otherwise x1 ¼ if the company invests in the development of new products or not. x2 ¼ if the company invests in capacity building or not. x3 ¼ if the company invests in Information Technology or not. x4 ¼ if the company invests in expanding the factory or not. x5 ¼ if the company invests in expanding the depot or not. Fobj ¼ max z ¼ 355:627x1 + 110:113x2 + 213:088x3 + 257:190x4 + 241:833x5 subject to : 360x1 + 240x2 + 180x3 + 480x4 + 320x5  1,000 ðBudget constraintÞ 0 ðProject 2 depends on 3Þ x2  x3 ðMutually excluding projectsÞ x4 + x 5  1 xi ¼ 0 or 1 Ex. 13 xi ¼ percentage of stock i to be allocated in the portfolio, i ¼ 1, …, 10. x1 ¼ percentage of stock 1 from the banking sector to be allocated in the portfolio. x2 ¼ percentage of stock 2 from the banking sector to be allocated in the portfolio. ⋮ x10 ¼ percentage of stock 10 from the electrical sector to be allocated in the portfolio. Fobj ¼ 0:0439x1 + 0:0453x2 + 0:0455x3 + 0:0439x4 + 0:0402x5 + 0:0462x6 + 0:0421x7 + 0:0473x8 + 0:0233x9 + 0:0221x10 s:t: x1 + x2 + ⋯ + x10 ¼ 1 ð 1Þ 0:0122x1 + 0:0121x2 + ⋯ + 0:0148x10  0:008 ð2Þ ð 3Þ 0:0541x1 + 0:0528x2 + ⋯ + 0:0267x10  0:05 ð 4Þ x1 + x2 + x3 + x4 + x5  0:50 ð 5Þ x1 + x2 + x3 + x4  0:20 ð 6Þ x6 + x7 + x8  0:20 ð 7Þ x9 + x10  0:20 ð 8Þ 0  x1 ,x2 ,⋯,x10  0:40 Ex. 16 Decision variables: xijt ¼ quantity of product i to be manufactured in facility j in period t Iijt ¼ final stock of product i in facility j in period t 1 if product i is delivered by facility j to retailer k in period t zijkt ¼ 0 otherwise

1125

1126

Answers

Model parameters: Dikt ¼ demand of product i by retailer k in period t cijt ¼ unit production cost of product i in facility j in period t iijt ¼ unit storage cost of product i in facility j in period t yijkt ¼ total transportation cost of product i from facility j to retailer k in period t xmax ijt ¼ maximum production capacity of product i in facility j in period t Imax ijt ¼ maximum storage capacity of product i in facility j in period t General formulation ! p m X n X T X X Fobj ¼ min z ¼ cijt xijt + iijt Iijt + yijkt zijkt i¼1 j¼1 t¼1

s:t:

k¼1

p X Dikt zijkt + Iijt ¼ Iij, t1 + xijt , k¼1

n X zijkt ¼ 1,

i ¼ 1, …, m; j ¼ 1,…, n; t ¼ 1, …, T

ð 1Þ

k ¼ 1, …,p;

ð2Þ

j¼1

xijt  xmax ijt , max Iijt  Iijt , zijkt 2 f0, 1g, xijt , Iijt  0

i ¼ 1, …,m; i ¼ 1, …,m; i ¼ 1, …,m; i ¼ 1, …,m;

j ¼ 1, …,n; j ¼ 1, …, n; j ¼ 1, …, n; j ¼ 1, …, n;

t ¼ 1, …, T t ¼ 1, …, T k ¼ 1, …,p; t ¼ 1, …,T t ¼ 1, …, T

Ex. 17 Decision variables: xijt ¼ quantity of product i to be manufactured in facility j in period t Iijt ¼ final stock of product i in facility j in period t Yijkt ¼ quantity of product i to be transported from facility j to retailer k in period t 1 if the manufacturing of product i in period t occurs in facility j zijt ¼ 0 otherwise Model parameters: Dikt ¼ demand of product i by retailer k in period t cijt ¼ unit production cost of product i in facility j in period t iijt ¼ unit storage cost of product i in facility j in period t yijkt ¼ unit transportation cost of product i from facility j to retailer k in period t xmax ijt ¼ maximum production capacity of product i in facility j in period t Imax ijt ¼ maximum storage capacity of product i in facility j in period t General formulation ! p m X n X T X X cijt xijt + iijt Iijt + yijkt Yijkt minz ¼ i¼1 j¼1 t¼1

s:t: Iijt ¼ Iij, t1 + xijt  n X Yijkt ¼ Dikt ,

k¼1 p X

Yijkt , i ¼ 1,…, m; j ¼ 1,…, n; t ¼ 1,…, T

ð 1Þ

i ¼ 1, …,m; k ¼ 1, …,p; t ¼ 1,…, T

ð 2Þ

k¼1

j¼1

xijt 

p X Dikt zijt

k¼1 xijt  xmax ijt , max Iijt  Iijt ,

zijt 2 f0, 1g, xijt , Iijt , Yijt  0

i ¼ 1, …,m; j ¼ 1, …, n; t ¼ 1, …,T

ð 3Þ

i ¼ 1, …,m; j ¼ 1, …,n; i ¼ 1,…, m; j ¼ 1,…, n; i ¼ 1,…, m; j ¼ 1, …,n; i ¼ 1,…, m; j ¼ 1,…,n;

ð 4Þ ð 5Þ ð6Þ

Ex. 18 Time frame with T ¼ 6 periods, t ¼ 1, …, 6 (Jan., Feb., March, April, May, Jun.). Pt ¼ production in period t (kg) St ¼ production with outsourced labor in period t (kg)

t ¼ 1, …, T t ¼ 1, …,T t ¼ 1, …,T t ¼ 1, …,T

ð 3Þ ð 4Þ ð 5Þ

Answers

NRt ¼ number of regular employees in period t NCt ¼ number of employees hired from period t  1 to period t NDt ¼ number of employees fired from period t  1 to period t HEt ¼ total amount of overtime in period t It ¼ final stock in period t (kg) minz ¼ 1:5P1 + 2S1 + 600NR1 + 1, 000NC1 + 900ND1 + 7HE1 + 1I1 + 1:5P2 + 2S2 + 600NR2 + 1,000NC2 + 900ND2 + 7HE2 + 1I2 + ⋮ ⋮ 1:5P6 + 2S6 + 600NR6 + 1, 000NC6 + 900ND6 + 7HE6 + 1I6 s:t: I1 ¼ 600 + P1  9, 600 I2 ¼ I1 + P2  10, 600 ⋮ ⋮ I6 ¼ I5 + P6  10, 430

ANSWER KEYS: EXERCISES: CHAPTER 17 Section 17.2.1 (ex.2) a) Optimal solution: x1 ¼ 2, x2 ¼ 1 and z ¼ 10 b) Optimal solution: x1 ¼ 1, x2 ¼ 4 and z ¼ 14 c) Optimal solution: x1 ¼ 10, x2 ¼ 6 and z ¼ 52 Section 17.2.1 (ex.4) a) yes b) no c) yes d) no e) yes f) yes g) no h) no i) yes Section 17.2.2 (ex.2) a) Optimal solution: x1 ¼ 12, x2 ¼ 2 and z ¼ 26 b) Optimal solution: x1 ¼ 18, x2 ¼ 8 and z ¼ 28 c) Optimal solution: x1 ¼ 10, x2 ¼ 10 and z ¼ 100 Section 17.2.3 (ex.1) e) Multiple optimal solutions. f) There is no optimal solution. g) Unlimited objective function z. h) Multiple optimal solutions. i) Degenerate optimal solution. j) There is no optimal solution. Section 17.2.3 (ex.2) a) Any point of the segment CD (C (10, 30); D (0, 45)). b) Any point of the segment AB (A (8, 0); B (7/2, 3)). Section 17.3 (ex.1) a) Six basic solutions. c) Optimal solution: x1 ¼ 5, x2 ¼ 20 and z ¼ 55 Section 17.3 (ex.2) a) Ten basic solutions. c) Optimal solution: x1 ¼ 7, x2 ¼ 11, x3 ¼ 0 and z ¼ 61 Section 17.4.2 (ex.1) a) Optimal solution: x1 ¼ 1, x2 ¼ 17, x3 ¼ 5 and z ¼ 104

1127

1128

Answers

Section 17.4.3 (ex.2) a) Optimal solution: x1 ¼ 3, x2 ¼ 3 and z ¼ 15 b) Optimal solution: x1 ¼ 2, x2 ¼ 4, x3 ¼ 0 and z ¼ 20 c) Optimal solution: x1 ¼ 4, x2 ¼ 0, x3 ¼ 12 and z ¼ 36 Section 17.4.4 (ex.1) a) Optimal solution: b) Optimal solution: c) Optimal solution: d) Optimal solution:

x1 ¼ 0, x2 ¼ 4 and z ¼  4 x1 ¼ 1, x2 ¼ 7 and z ¼  37 x1 ¼ 0, x2 ¼ 10, x3 ¼ 35/2 and z ¼  55/2 x1 ¼ 100/3, x2 ¼ 0, x3 ¼ 40/3 and z ¼  140/3

Section 17.4.5.1 (ex.1) b) Solution 1: x1 ¼ 115/2, x2 ¼ 0 and z ¼ 230 Solution 2: x1 ¼ 60, x2 ¼ 10 and z ¼ 230 Section 17.4.5.1 (ex.2) b) Solution 1: x1 ¼ 310, x2 ¼ 0 and z ¼ 930 Solution 2: x1 ¼ 30, x2 ¼ 140 and z ¼ 930 Section 17.4.5.2 (ex.2) Solution 1: x1 ¼ 10, x2 ¼ 30 Solution 2: x1 ¼ 30, x2 ¼ 0 Section 17.4.5 (ex.1) a) Multiple optimal solutions. b) Unlimited objective function z. c) Multiple optimal solutions/degenerate optimal solution. Section 17.4.5 (ex.2) a) No. b) Unfeasible solution. c) Degenerate optimal solution. d) Multiple optimal solutions. e) Unlimited objective function z. Section 17.5.2 (ex.1) b) Optimal solution: x1 ¼ 70, x2 ¼ 30, x3 ¼ 35 and z ¼ 363, 000. Section 17.5.2 (ex.2) b) Optimal solution: x1 ¼ 24,960, x2 ¼ 17,040 and z ¼ 19,296. Section 17.5.2 (ex.3) b) Optimal solution: x1 ¼ 475, x2 ¼ 50, x3 ¼ 50, x4 ¼ 50, x5 ¼ 75 and z ¼ 32,475. Section 17.5.2 (ex.4) b) Optimal solution:

x11 ¼ 3,600, x21 ¼ 0, x22 ¼ 10,000 x12 ¼ 0, x23 ¼ 0, x13 ¼ 0, z ¼ 5,160

x31 ¼ 0, x41 ¼ 8,400, x32 ¼ 0, x42 ¼ 0, x33 ¼ 3,200, x43 ¼ 4,800 and

Section 17.5.2 (ex.5) b) Optimal solution: x1 ¼ 1, x2 ¼ 0, x3 ¼ 1, x4 ¼ 0, x5 ¼ 1 and z ¼ 810,548 ($810,548.00). Section 17.5.2 (ex.6) b) Optimal solution: x1 ¼ 20%, x7 ¼ 20%, x9 ¼ 20%, x10 ¼ 40%, x2, x3, x4, x5, x6, x8, x11 ¼ 0% and z ¼ 3.07%. Section 17.5.2 (ex.7) b) Optimal solution: 50% ($250,000.00) in the RF_C fund 25% ($125,000.00) in the Petrobras stock fund 25% ($125,000.00) in the Vale stock fund Objective function z ¼ 16.90% per year.

Answers

1129

Section 17.5.2 (ex.8) b) Optimal solution: z ¼ 126,590 ($126,590.00). Solution

Jan.

Feb.

Mar.

Apr.

May

Jun.

Pt St NRt NCt NDt HEt It

9600 0 5 0 5 0 600

10,000 0 5 0 0 28.57 0

12,800 0 6 1 0 91.43 0

11,520 0 6 0 0 0 870

10,770 0 5 0 1 83.57 0

10,430 0 5 0 0 59.29 0

Section 17.6.1 (ex.1) a) x1 ¼ 60, x2 ¼ 20 with z ¼ 520 b) 1.333 c) 0.8 d) No. e) The basic solution remains optimal. Section 17.6.1 (ex.2) a) x1 ¼ 15, x2 ¼ 0 with z ¼ 120 b) c1 2.4 or c1 c01  5.6 c) c2  20 or c2  c02 + 14 Section 17.6.1 (ex.3) a) x1 ¼ 0, x2 ¼ 17 with z ¼ 102 b) Unlimited objective function z. c) c1 3 or c1 c01  5 d) 0  c2  16 or c02  6  c2  c02 + 10 Section 17.6.1 (ex.4) a) 0:133  cc1  0:25 2 b) The basic solution remains optimal with z ¼ 1,700. c) 8  c1  15 or c01  4  c1  c01 + 3 d) 48  c2  90 or c02  12  c2  c02 + 30 e) The basic solution remains optimal with z ¼ 1,830. f) The basic solution remains optimal with z ¼ 2,440. g) 13.333  c1  25 Section 17.6.2 (ex.1) a) P1 ¼ 0, P2 ¼ 34.286, P3 ¼ 85.714 b) b1 b01  8.5 b02  5.95  b2  b02 + 6.125 b03  3.267  b3  b03 + 2.164 c) 0 d) $137.14 (z¼ 1,902.86), x1 ¼ 115.71 and x2 ¼ 8.57 Section 17.6.2 (ex.2) a) P1 ¼ 0, P2 ¼ 1.222, P3 ¼ 0.444 (2nd operation) b) b1 b01  20 b02  180  b2  b02 + 22.5 b03  36  b3  b03 + 180 c) $27.50 d) $16.00 Section 17.6.3 (ex.1) b) z11 ¼ 3, z12 ¼ 65, z∗1 ¼ 3 and z∗2 ¼ 2 Section 17.6.3 (ex.2) b) z∗1 ¼  4 and z∗2 ¼  2

1130

Answers

Section 17.6.4 (ex.3) a) Degenerate optimal solution. b) Multiple optimal solutions. c) Degenerate optimal solution. d) Multiple optimal solutions. e) Multiple optimal solutions. f) Degenerate optimal solution. g) Degenerate optimal solution.

ANSWER KEYS: EXERCISES: CHAPTER 18 Ex.1 a) b) c) d) e) f) g) h)

N ¼ {1, 2, 3, 4, 5, 6} A ¼ {(1, 2), (1, 3), (2, 3), (3, 4), (3, 5), (4, 2), (4, 5), (4, 6), (5, 6)} Directed network. 1!2!3!4!2 1!3!5!4 1!3!4!6 2!3!4!2 3!4!5!3

Ex.2 a) b) c) d) e) f) g) h)

N ¼ {1, 2, 3, 4, 5, 6} A ¼ {(1, 2), (1, 3), (2, 3), (2, 4), (3, 5), (4, 6), (5, 2), (5, 4), (6, 5)} Directed network. 2!3!5!4!6!5 1!2!5!4!6!5 1!3!5!4 2!3!5!2 1!2!3!1

Ex.3 a) Tree

1

3

5 4

b) Cover tree

1

3

5 2

4

6

Answers

1131

Ex.4

3

5

1

7

2

8 4

6

Ex.5 Classic transportation problem:

4

4

6

60

3

50

4

50

8 8

6

5 7

5

3

50

2

9

2

80

40

8

1

70

1

9

Optimal FBS: x11 ¼ 40, x14 ¼ 30, x22 ¼ 60, x24 ¼ 20, x33 ¼ 50 with z ¼ 1, 110. Ex.6 Maximum flow problem: 8

2 6

6

6

3

7

1

5 6

4

7 4

6 5

3

3

6

7

Optimal solution: x12 ¼ 6, x13 ¼ 2, x14 ¼ 7, x24 ¼ 3, x25 ¼ 3, x34 ¼ 2, x36 ¼ 0, x45 ¼ 3, x46 ¼ 3, x47 ¼ 6, x57 ¼ 6, x67 ¼ 3 with z ¼ 15. Ex.7 Shortest route problem:

3

2 1

4

5

4

1

7

3

5

6

4

4

3

5

2

8

4

4

5

6

Optimal FBS: x13 ¼ 1, x36 ¼ 1, x68 ¼ 1 (1  3  6  8) with z ¼ 11. Ex.8 x11 ¼ 50, x22 ¼ 10, x23 ¼ 20, x33 ¼ 20.

3

1

1132

Answers

Ex.9 x11 ¼ 80, x13 ¼ 70, x22 ¼ 50, x23 ¼ 80 with z ¼ 4,590. Ex.10 x13 ¼ 150, x21 ¼ 80, x22 ¼ 50 with z ¼ 4,110. Ex.11 a) Optimal FBS: x12 ¼ 100, x13 ¼ 100, x23 ¼ 100, x31 ¼ 150, x32 ¼ 50 with z ¼ 6,800. b) Optimal FBS: x13 ¼ 50, x31 ¼ 100, x41 ¼ 20, x42 ¼ 150, x43 ¼ 30 with z ¼ 1,250. c) Optimal FBS: x12 ¼ 20, x14 ¼ 30, x21 ¼ 20, x24 ¼ 10, x32 ¼ 20, x33 ¼ 60 with z ¼ 1,490. Alternative solution: x11 ¼ 20, x12 ¼ 20, x14 ¼ 10, x24 ¼ 30, x32 ¼ 20, x33 ¼ 60 with z ¼ 1,490. Ex.12 Indexes: Suppliers i 2 I Consolidating centers j 2 J Factory k 2 K Products p 2 P Model parameters: Cmax, j Dpk Sip cpij cpjk cpik

maximum capacity of consolidating center j. demand of product p in factory k. capacity of supplier i to produce product p. unit transportation cost of p from supplier i to consolidating center j. unit transportation cost of p from consolidating center j to factory k. unit transportation cost of p from supplier i to factory k.

Model’s decision variables: xpij ypjk zpik

amount of product p transferred from supplier i to consolidating center j. amount of product p transferred from consolidating center j to factory k. amount of product p transferred from supplier i to factory k.

The problem can be formulated as follows: XXX XXX XXX min cpij xpij + cpjk ypjk + cpik zpik p

s.t.:

i

p

j

X j

ypjk +

XX p

X j

i

xpij +

X i

j

X

p

k

zpik ¼ Dpk ,

i

k

8p,k

(1)

8j

(2)

8i,p

(3)

i

xpij  C max , j , X k

xpij ¼

zpik  Sip ,

X

ypjk ,

8p, j

(4)

8p,i, j, k

(5)

k

xpij , ypjk , zpik  0,

In the objective function, the first term represents suppliers’ transportation costs up to the consolidation terminals, the second refers to the transportation costs from the consolidation terminals to the final client (factory in Harbin), and the third represents suppliers’ transportation costs directly to the final client. Constraint (1) ensures that client k’s demand for product p will be met. Constraint (2) refers to the maximum capacity of each consolidation terminal. Constraint (3) represents supplier i’s capacity to supply product p. Whereas constraint (4) refers to the preservation of the input and output flows in each transshipment point. Finally, we have the non-negativity constraints.

Answers

Ex.13

xij ¼

1 if task i is designated to machine j, i ¼ 1,…, 4, j ¼ 1, …, 4 0 otherwise

a) Optimal FBS: x12 ¼ 1, x24 ¼ 1, x33 ¼ 1, x41 ¼ 1 with z ¼ 37. b) Optimal FBS: x13 ¼ 1, x24 ¼ 1, x33 ¼ 1, x41 ¼ 1 with z ¼ 35. Ex.14

xij ¼

1 if route ði, jÞ is included in the shortest route, 8i, j 0 otherwise

min 6x12 + 9x13 + 4x23 + 4x24 + 7x25 + 6x35 + 2x45 + 7x46 + 3x56 s:t: x12 + x13 ¼ 1 x46 + x56 ¼ 1 x12  x23  x24  x25 ¼ 0 x13 + x23  x35 ¼ 0 x24  x45  x46 ¼ 0 x25 + x35 + x45  x56 ¼ 0 xij 2 f0, 1g or xij  0 Optimal FBS: x12 ¼ 1, x24 ¼ 1, x45 ¼ 1, x56 ¼ 1 (1  2  4  5  6) with z ¼ 15. Ex.15 Optimal FBS: xAB ¼ 1, xBD ¼ 1, xDE ¼ 1 (A  B  D  E) with z ¼ 64. Ex.16 x12 ¼ 6, x13 ¼ 4, x23 ¼ 0, x24 ¼ 6, x34 ¼ 1, x35 ¼ 3, x45 ¼ 0, x46 ¼ 7, x56 ¼ 3 with z ¼ 10.

ANSWER KEYS: EXERCISES: CHAPTER 19 Section 19.1 (ex.1) a) BP b) MIP c) IP d) BIP e) BP f) MBP g) MIP Section 19.2 (ex.1) a) No b) Yes (x1 ¼ 10,x2 ¼ 0 with z ¼ 20) c) No d) Yes (x1 ¼ 0, x2 ¼ 4 with z ¼ 32) e) Yes (x1 ¼ 1,x2 ¼ 0 with z ¼ 4) f) No g) Yes (x1 ¼ 6,x2 ¼ 5 with z ¼ 58) Section 19.2((ex.2) ) ð0, 0Þ;ð0, 1Þ;ð0, 2Þ;ð0, 3Þ;ð0, 4Þ;ð1, 0Þ;ð1, 1Þ;ð1, 2Þ;ð1, 3Þ;ð2, 0Þ;ð2, 1Þ; b) SF ¼ ð2, 2Þ;ð2, 3Þ;ð3, 0Þ;ð3, 1Þ;ð3, 2Þ;ð4, 0Þ;ð4, 1Þ;ð4, 2Þ;ð5, 0Þ;ð5, 1Þ;ð6, 0Þ c) Optimal solution: x1 ¼ 4 and x2 ¼ 2 with z ¼ 14. Section 19.2 (ex.3) b) SF ¼ {(0, 0); (0, 1); (0, 2); (1, 0); (1, 1); (2, 0); (2, 1); (3, 0)} c) Optimal solution: x1 ¼ 2 and x2 ¼ 1 with z ¼ 4.

1133

1134

Answers

Section 19.2 (ex.4) b) {(0, 0); (0, 1); (0, 2); (1, 0); (1, 1); (1, 2); (2, 0); (2, 1); (2, 2); (3, 0); (3, 1); (3, 2); (4, 0)} c) Optimal solution: x1 ¼ 3 and x2 ¼ 2 with z ¼ 13. Section 19.3 (ex.1) Optimal FBS ¼ {x3 ¼ 1, x4 ¼ 1, x6 ¼ 1, x8 ¼ 1} with z ¼ 172. Section 19.4 (ex.1) max z ¼ 7x1 + 12x2 + 8x3 + 10x4 + 7x5 + 6x6 s:t: 4x1 + 7x2 + 5x3 + 6x4 + 4x5 + 3x6  20 x5 + x6  1 x3  x2  0 x1 , x2 , x3 , x4 ,x5 ,x6 2 f0, 1g Optimal solution: x1 ¼ 1, x2 ¼ 1, x3 ¼ 0, x4 ¼ 1, x5 ¼ 0, x6 ¼ 1 with z ¼ 35. Section 19.5 (ex.1) Indexes i, j ¼ 1, .., n that represent the customers (index 0 represents the depot) v ¼ 1, …, NV that represent the vehicles Parameters Cmax,v ¼ maximum capacity of vehicle n di ¼ demand of client i cij ¼ travel cost from client i to client j Decision variables

xvij

¼

yvi Model formulation min s.t.

¼

PPP i

j

1 if the arc from i to j is traveled by vehicle v 0 otherwise 1 if order of client i is delivered by vehicle v 0 otherwise

v v cij xij

X

yvi ¼ 1,

i ¼ 1,…, n

(1)

v

X

X X

yvi ¼ NV,

i¼0

(2)

v

di yvi  C max , v , v ¼ 1, …, NV

(3)

i

xvij ¼ yvj ,

j ¼ 0,…, n,

v ¼ 1, …, NV

(4)

xvij ¼ yvi ,

i ¼ 0,…, n,

v ¼ 1, …, NV

(5)

i

X X

j

xvij ¼ xvij  jSj  1, S f1, …, ng, 2  jSj  n  1, v ¼ 1, …, NV

(6)

ij2S

xvij 2 f0, 1g, i ¼ 0, …, n j ¼ 0,…,n, v ¼ 1, …, NV

(7)

yvi 2 f0, 1g, i ¼ 0, …,n, v ¼ 1,…, NV

(8)

The main objective of the model is to minimize the total travel costs. Constraint (1) guarantees that each node (client) will be visited by only one vehicle. Whereas constraint (2) guarantees that all the routes will begin and end at the depot (i ¼ 0).

Answers

1135

Constraint (3) guarantees that vehicle capacity will not be exceeded. Constraints (4) and (5) guarantee that vehicles will not interrupt their routes at one client. They are the constraints for the preservation of the input and output flows. Constraint (6) guarantees that subroutes will not be formed. Finally, constraints (7) and (8) guarantee that variables xvij and yvi will be binary. Section 19.6 (ex.1) Indexes i ¼ 1, …, m that represent the distribution centers (DCs) j ¼ 1, …, n that represent the consumers Model parameters fi ¼ fixed costs to maintain DC i open cij ¼ transportation costs from DC i to consumer j Dj ¼ demand of customer j Cmax, i ¼ maximum capacity of DC i Decision variables

( yi ¼ ( xij ¼

1 if DC i opens 0 otherwise

1 if consumer j is supplied by DC i 0 otherwise

General formulation Fobj ¼ min z ¼

m X

fi yi +

i¼1

s:t:

n X

m X n X

cij xij Dj

i¼1 j¼1

xij Dj  Cmax,i  yi ,

i ¼ 1, …,m

ð 1Þ

j ¼ 1,…, n

ð 2Þ

j¼1 m X xij ¼ 1, i¼1

xij ,yi 2 f0, 1g,

i ¼ 1,…, m, j ¼ 1,…, n

ð3Þ

which corresponds to a binary programming problem. For this problem, index i corresponds to: i ¼ 1 (Belem), i ¼ 2 (Palmas), i ¼ 3 (Sao Luis), i ¼ 4 (Teresina), and i ¼ 5 (Fortaleza); and index j corresponds to: j ¼ 1 (Belo Horizonte), j ¼ 2 (Vitoria), j ¼ 3 (Rio de Janeiro), j ¼ 4 (Sao Paulo), and j ¼ 5 (Campo Grande). Optimal FBS: x22 ¼ 1, x24 ¼ 1, x45 ¼ 1, x51 ¼ 1, x53 ¼ 1, y2 ¼ 1, y4 ¼ 1, y5 ¼ 1 with z ¼ 459,400.00. Section 19.6 (ex.2) Indexes: Suppliers i 2 I Consolidating centers j 2 J Factory k 2 K Products p 2 P Model parameters: Cmax, j fj Dpk Sip cpij cpjk cpik

maximum capacity of consolidating center j. fixed costs to open consolidating center j. demand of product p in factory k. capacity of supplier i to produce product p. unit transportation cost of p from supplier i to consolidating center j. unit transportation cost of p from consolidating center j to factory k. unit transportation cost of p from supplier i to factory k.

1136

Answers

Model’s decision variables: xpij ypjk zpik zj

amount of product p transported from supplier i to consolidating center j. amount of product p transported from consolidating center j to factory k. amount of product p transported from supplier i to factory k. binary variable that assumes value 1 if center j operates, and 0 otherwise.

The problem can be formulated as follows: XXX XXX XXX X min cpij xpij + cpjk ypjk + cpik zpik + f j zj p

i

p

j

s.t.:

X

j

ypjk +

j

XX p

X j

X

k

j

8p,k

(1)

8j

(2)

8i, p

(3)

xpij  C max , j  zj ,

xpij +

X k

xpij ¼

i

zpik ¼ Dpk ,

i

i

i

X

p

k

zpik  Sip ,

X

8p, j

(4)

8p, i, j, k

(5)

8z

(6)

ypjk ,

k

xpij , ypjk , zpik  0, zj 2 f0, 1g,

In the objective function, the first term represents suppliers’ transportation costs up to the consolidation terminals, the second refers to the transportation costs from the consolidation terminals to the final client (factory in Harbin), and the third represents suppliers’ transportation costs directly to the final client, and the last one the fixed costs related to the consolidation terminals’ location. Constraint (1) ensures that client k’s demand for product p will be met. Constraint (2) refers to the maximum capacity of each consolidation terminal. Constraint (3) represents supplier i’s capacity to supply product p. Whereas constraint (4) refers to the preservation of the input and output flows in each transshipment point. Finally, we have the non-negativity constraints and that variable zj is binary. Section 19.7 (ex.1) xi ¼ number of buses that start working in shift i, i ¼ 1, 2, …, 9.

Therefore, we have: x1 ¼ number of buses x2 ¼ number of buses x3 ¼ number of buses x4 ¼ number of buses x5 ¼ number of buses

that that that that that

start start start start start

working working working working working

at at at at at

Shift

Period

1 2 3 4 5 6 7 8 9

6:01–14:00 8:01–16:00 10:01–18:00 12:01–20:00 14:01–22:00 16:01–24:00 18:01–02:00 20:01–04:00 22:01–06:00

6:01. 8:01. 10:01. 12:01. 14:01.

Answers

x6 ¼ number x7 ¼ number x8 ¼ number x9 ¼ number

of of of of

buses buses buses buses

that that that that

start start start start

working working working working

at at at at

1137

16:01. 18:01. 20:01. 22:01.

Fobj ¼ min z ¼ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 subject to  20 ð6 : 01  8 : 00Þ x1 x1 + x2  24 ð8 : 01  10 : 00Þ  18 ð10 : 01  12 : 00Þ x1 + x 2 + x 3 x1 + x2 + x3 + x4  15 ð12 : 01  14 : 00Þ x2 + x3 + x4 + x5  16 ð14 : 01  16 : 00Þ  27 ð16 : 01  18 : 00Þ x3 + x4 + x5 + x6 x4 + x5 + x6 + x7  18 ð18 : 01  20 : 00Þ  12 ð20 : 01  22 : 00Þ x5 + x6 + x7 + x8 x6 + x7 + x8 + x9  10 ð22 : 01  24 : 00Þ x7 + x8 + x9  4 ð00 : 01  02 : 00Þ  3 ð02 : 01  04 : 00Þ x8 + x 9 x9  8 ð04 : 01  06 : 00Þ  0, i ¼ 1, 2,…, 9 xi Optimal solution: x1 ¼ 24, x2 ¼ 0, x3 ¼ 0, x4 ¼ 0, x5 ¼ 16, x6 ¼ 11, x7 ¼ 0, x8 ¼ 0, x9 ¼ 8 with z ¼ 59. Section 19.7 (ex.2) xi ¼ number of employees that start working on day i, i ¼ 1, 2, …, 7. x1 ¼ number of employees that start working on Monday. x2 ¼ number of employees that start working on Tuesday. ⋮ x7 ¼ number of employees that start working on Sunday. min z ¼ x1 + x2 + x3 + x4 + x5 + x6 + x7 subject to + x4 + x5 + x6 + x7  15 ðMondayÞ x1 + x5 + x6 + x7  20 ðTuesdayÞ x1 + x2 x6 + x7  17 ðWednesdayÞ x1 + x2 + x3 + x7  22 ðThursdayÞ x1 + x2 + x3 + x4 + x1 + x2 + x3 + x4 + x5  25 ðFridayÞ  15 ðSaturdayÞ x2 + x3 + x4 + x5 + x6 x3 + x4 + x5 + x6 + x7  10 ðSundayÞ xi  0, i ¼ 1,…, 7 Alternative optimal solution: x1 ¼ 10, x2 ¼ 6, x3 ¼ 0, x4 ¼ 5, x5 ¼ 4, x6 ¼ 0, x7 ¼ 1 with z ¼ 26.

ANSWER KEYS: EXERCISES: CHAPTER 20 4) P (I < 0) ¼ 15.92% by using the NORM.DIST function in Excel or P (I < 0) ¼ 12.19% analyzing the data generated in the Monte Carlo simulation for variable I. Note: The results can change at each new simulation. 5) P (Index > 0.07) ¼ 22.50% by using the NORM.DIST function in Excel or P (Index > 0.07) ¼ 20.43% analyzing the values generated in the simulation. Note: The results can change at each new simulation.

ANSWER KEYS: EXERCISES: CHAPTER 21 1) Fcal ¼ 2.476 (sig. 0.100), that is, there are no differences in the production of helicopters in the three factories. 2) There are no significant differences between the hardness measures of the different converters. That is, the “Type of Converter” factor does not have a significant effect on the variable “Hardness.” On the other hand, we can conclude

1138

Answers

that there are significant differences in the hardness of the different types of ore. That is, the “Type of Ore” factor has a significant effect on the variable “Hardness.” We can also conclude that there is a significant interaction between the two factors. Tests of Between-Subjects Effects Dependent Variable: Hardness Source Corrected Model Intercept Converter Ore Converter * Ore Error Total Corrected Total

Type III Sum of Squares a

15006.222 2023032.111 66.074 14433.852 506.296 3250.667 2041289.000 18256.889

df

Mean Square

F

Sig.

8 1 2 2 4 72 81 80

1875.778 2023032.111 33.037 7216.926 126.574 45.148

41.547 44808.751 .732 159.850 2.804

.000 .000 .485 .000 .032

R Squared ¼ .822 (Adjusted R Squared ¼ .802).

a

3) There are significant differences between the octane rating indexes of the different types of petroleum and between the octane rating indexes of the different oil refining processes. That is, both factors have a significant effect on the octane rating index. Finally, we can conclude that there is significant interaction between the two factors. Tests of Between-Subjects Effects Dependent Variable: Octane Rating Source Corrected Model Intercept Petroleum Refining Petroleum * Refining Error Total Corrected Total

Type III Sum of Squares a

450.229 399857.521 402.792 31.729 15.708 35.250 400343.000 485.479

R Squared ¼ .927 (Adjusted R Squared ¼ .905).

a

ANSWER KEYS: EXERCISES: CHAPTER 22 1) a) Control charts for X UCL ¼ 17.4035 Average ¼ 16.5318 LCL ¼ 15.6600 Control charts for R UCL ¼ 2.7305 Average ¼ 1.1965 LCL ¼ 0.0000 b) Cp ¼ 0.860 Cpk ¼ 0.842 Cpm ¼ 0.859

df

Mean Square

F

Sig.

11 1 2 3 6 36 48 47

40.930 399857.521 201.396 10.576 2.618 .979

41.801 408365.128 205.681 10.801 2.674

.000 .000 .000 .000 .030

Answers

2) a) Control charts for X UCL ¼ 17.3895 Average ¼ 16.5318 LCL ¼ 15.6740 Control charts for S UCL ¼ 1.1938 Average ¼ 0.5268 LCL ¼ 0.0000 b) Cp ¼ 0.9491 Cpk ¼ 0.9290 3) a) Control charts for X UCL ¼ 6.7113 Average ¼ 6.0625 LCL ¼ 5.4137 Control charts for R UCL ¼ 2.0322 Average ¼ 0.8905 LCL ¼ 0.0000 b) Cp ¼ 0.771 Cpk ¼ 0.722 Cpm ¼ 0.542 4) a) Control charts for X UCL ¼ 6.7162 Average ¼ 6.0625 LCL ¼ 5.4088 Control charts for S UCL ¼ 0.9098 Average ¼ 0.4015 LCL ¼ 0.0000 b) Cp ¼ 0.8302 Cpk ¼ 0.7783 5) P chart UCL ¼ 0.1748 Average ¼ 0.0680 LCL ¼ 0.0000 6) UCL ¼ 8.7403 Average ¼ 3.4000 LCL ¼ 0.0000 7) UCL ¼ 11.9996 Average ¼ 5.1750 LCL ¼ 0.0000

1139

1140

Answers

8) Control chart: defects defects UCL Center = 1.1357 LCL

Fraction of nonconformities

2.0

1.5

1.0

0.5

24.00 23.00 22.00 21.00 20.00 19.00 18.00 17.00 16.00 15.00 14.00 13.00 12.00 11.00 10.00 9.00 8.00 7.00 6.00 5.00 4.00 3.00 2.00 100

0.0

Sigma level:

ANSWER KEYS: EXERCISES: CHAPTER 23 1) a)

3

Answers

In fact, this is a balanced clustered data structure. b)

1141

1142

Answers

c)

d) Yes. Since the estimation of variance component τ00, which corresponds to random intercept u0j, is considerably higher than its standard error, it is possible to verify that there is variability, at a significance level of 0.05, in the score obtained between students from different countries. Statistically, z ¼ 422.619/125.284 ¼ 3.373 > 1.96, where 1.96 is the critical value of the standard normal distribution, which results in a significance level of 0.05. e) Since Sig. w2 ¼ 0.000, it is possible to reject the null hypothesis that the random intercepts are equal to zero (H0: u0j ¼ 0), which makes the estimation of a traditional linear regression model be ruled out for these clustered data. f) rho ¼

τ00 422:619 ¼ 0:974 ¼ τ00 + s2 422:619 + 11:196

which suggests that approximately 97% of the total variance in students’ grades in science is due to differences between the participants’ countries of origin. g)

Answers

1143

h)

i) The parameters estimated of the fixed- and random-effects components are statistically different from zero, at a significance level of 0.05. j)

1144

Answers

k)

l)

The significance level of the test is equal to 1.000 (much greater than 0.05) because the logarithms of both restricted likelihood functions are identical (LLr ¼ 357.501), the model with only random effects in the intercept is favored, since random error terms u1j are statistically equal to zero.

Answers

m)

n) scoreij ¼ 13:22 + 0:0028  incomeij + 0:0008  resdevelj  incomeij + u0j + rij o)

1145

1146

Answers

2) a)

In fact, this is an unbalanced clustered data structure of real estate properties in districts. b)

This is also an unbalanced data panel. c)

Answers

d)

e)

1147

1148

Answers

f)

g) l

Level-2 intraclass correlation:

rhoproperty|district ¼

l

τu000 + τr000 0:1228 + 0:0368 ¼ 0:996 ¼ 2 τu000 + τr000 + s 0:1228 + 0:0368 + 0:0007

Level-3 intraclass correlation: rhodistrict ¼

τu000 0:1228 ¼ ¼ 0:766 2 τu000 + τr000 + s 0:1228 + 0:0368 + 0:0007

The correlation between the natural logarithms of the rental prices per square meter of the properties in the same district is equal to 76.6% (rhodistrict), and the correlation between these annual indexes, for the same property of a certain district, is equal to 99.6% (rhoproperty j district). Thus, we estimate that the real estate and districts random effects form more than 99% of the total variance of the residuals! h) Given the statistical significance of the estimated variances τu000, τr000, and s2 (relationships between values estimated and respective standard errors higher than 1.96, and this is the critical value of the standard normal distribution which results in a significance level of 0.05), we can state that there is variability in the rental price of the commercial properties throughout the period analyzed. Moreover, there is variability in the rental price, throughout time, between real estate properties in the same district and between properties located in different districts.

Answers

1149

i) Since Sig. w2 ¼ 0.000, it is possible to reject the null hypothesis that the random intercepts are equal to zero (H0: u00k ¼ r0jk ¼ 0), which makes the estimation of a traditional linear regression model be ruled out for these data. j)

k) First, we can see that the variable that corresponds to the year (linear trend) with fixed effects is statistically significant, at a significance level of 0.05 (Sig. z ¼ 0.000 < 0.05), which demonstrates that, each year, rental prices of commercial properties increase, on average, 1.10% (e0.011 ¼ 1.011), ceteris paribus. In relation to the random-effects components, it is also possible to verify that there is statistical significance in the variances of u00k, r0jk, and etjk, at a significance level of 0.05, because the estimations of τu000, τr000, and s2 are considerably higher than the respective standard errors.

1150

l)

Answers

Answers

1151

m)

n) l

Level-2 intraclass correlation:

rhoproperty|district ¼ ¼

l

τu000 + τu100 + τr000 + τr100 τu000 + τu100 + τr000 + τr100 + s2 0:142444 + 0:000043 + 0:039638 + 0:000047 ¼ 0:9994 0:142444 + 0:000043 + 0:039638 + 0:000047 + 0:000103

Level-3 intraclass correlation:

rhodistrict ¼ ¼

τu000 + τu100 τu000 + τu100 + τr000 + τr100 + s2 0:142444 + 0:000043 ¼ 0:7817 0:142444 + 0:000043 + 0:039638 + 0:000047 + 0:000103

For this model, we estimate that the real estate and districts random effects form more than 99.9% of the total variance of the residuals!

1152

Answers

o)

Since Sig. w22 ¼ 0.000, we choose the linear trend model with random intercepts and slopes. p)

q) ln ðpÞtjk ¼ 4:134 + 0:015  year jk + 0:231  foodjk + 0:189  space4jk  0:004  valetjk  year jk + u00k + u10k  year jk + r 0jk + r 1jk  year jk + etjk Note: At this moment, we decide to insert the parameter of the variable space4 in the expression, statistically significant at a significance level of 0.10.

Answers

1153

r) Yes, it is possible to state that the natural logarithm of the rental price per square meter of the real estate properties follows a linear trend throughout time. In addition, there is a significant variance in the intercepts and slopes between those located in the same district and between those located in different districts. Yes, the existence of restaurants or food courts in the building, at least four or a higher number of parking spaces, and valet parking in the building where the property is located explain part of the evolution in variability of the natural logarithm of the rental price per square meter of the properties. s)

t) l

Random-effects variance-covariance matrix for level district: 

   u00k 0:037004 0 var ¼ u10k 0 0:000016

1154

l

Answers

Random-effects variance-covariance matrix for level property: 

   r0jk 0:030961 0 ¼ var r1jk 0 0:000044 u)

v) l

Random-effects variance-covariance matrix for level district:  var

   u00k 0:037253 0:000653 ¼ u10k 0:000653 0:000014

Answers

l

1155

Random-effects variance-covariance matrix for level property: 

   r0jk 0:031679 0:000484 ¼ var r1jk 0:000484 0:000046 w)

Since Sig. w22 ¼ 0.000, the structure of the random-terms variance-covariance matrices is considered unstructured, that is, we can conclude that error terms u00k and u10k are correlated (cov(u00k , u10k) 6¼ 0), and that error terms r0jk and r1jk are also correlated (cov(r0jk , r1jk) 6¼ 0). x) ln ðpÞtjk ¼ 3:7807 + 0:0144  year jk + 0:2314  food jk + 0:2071  space4jk + 0:5111  subwayk  0:0031  valetjk  year jk  0:0072  subwayk  year jk + 0:0001  violencek  year jk + u00k + u10k  year jk + r 0jk + r 1jk  year jk + etjk y) Yes, it is possible to state that the existence of subway and the violence index in the district explain part of the variability of the evolution of the natural logarithm of the rental price per square meter between real estate properties located in different districts. z)

Appendices TABLE A Snedecor’s F-Distribution P(Fcal > Fc) 5 0.10 a = 0.10

Fc

Numerator Degrees of Freedom (n1) n2 Denominator

1

2

3

4

5

6

7

8

9

10

1

39.86

49.50

53.59

55.83

57.24

58.20

58.91

59.44

59.86

60.19

2

8.53

9.00

9.16

9.24

9.29

9.33

9.35

9.37

9.38

9.39

3

5.54

5.46

5.39

5.34

5.31

5.28

5.27

5.25

5.24

5.23

4

4.54

4.32

4.19

4.11

4.05

4.01

3.98

3.95

3.94

3.92

5

4.06

3.78

3.62

3.52

3.45

3.40

3.37

3.34

3.32

3.30

6

3.78

3.46

3.29

3.18

3.11

3.05

3.01

2.98

2.96

2.94

7

3.59

3.26

3.07

2.96

2.88

2.83

2.78

2.75

2.72

2.70

8

3.46

3.11

2.92

2.81

2.73

2.67

2.62

2.59

2.56

2.54

9

3.36

3.01

2.81

2.69

2.61

2.55

2.51

2.47

2.44

2.42

10

3.29

2.92

2.73

2.61

2.52

2.46

2.41

2.38

2.35

2.32

11

3.23

2.86

2.66

2.54

2.45

2.39

2.34

2.30

2.27

2.25

12

3.18

2.81

2.61

2.48

2.39

2.33

2.28

2.24

2.21

2.19

13

3.14

2.76

2.56

2.43

2.35

2.28

2.23

2.20

2.16

2.14

14

3.10

2.73

2.52

2.39

2.31

2.24

2.19

2.15

2.12

2.10

15

3.07

2.70

2.49

2.36

2.27

2.21

2.16

2.12

2.09

2.06

16

3.05

2.67

2.46

2.33

2.24

2.18

2.13

2.09

2.06

2.03

17

3.03

2.64

2.44

2.31

2.22

2.15

2.10

2.06

2.03

2.00

18

3.01

2.62

2.42

2.29

2.20

2.13

2.08

2.04

2.00

1.98

19

2.99

2.61

2.40

2.27

2.18

2.11

2.06

2.02

1.98

1.96

20

2.97

2.59

2.38

2.25

2.16

2.09

2.04

2.00

1.96

1.94

21

2.96

2.57

2.36

2.23

2.14

2.08

2.02

1.98

1.95

1.92

22

2.95

2.56

2.35

2.22

2.13

2.06

2.01

1.97

1.93

1.90 Continued

1157

1158

Appendices

TABLE A Snedecor’s F-Distribution—cont’d Numerator Degrees of Freedom (n1) n2 Denominator

2

1

3

4

5

6

7

8

9

10

23

2.94

2.55

2.34

2.21

2.11

2.05

1.99

1.95

1.92

1.89

24

2.93

2.54

2.33

2.19

2.10

2.04

1.98

1.94

1.91

1.88

25

2.92

2.53

2.32

2.18

2.09

2.02

1.97

1.93

1.89

1.87

26

2.91

2.52

2.31

2.17

2.08

2.01

1.96

1.92

1.88

1.86

27

2.90

2.51

2.30

2.17

2.07

2.00

1.95

1.91

1.87

1.85

28

2.89

2.50

2.29

2.16

2.06

2.00

1.94

1.90

1.87

1.84

29

2.89

2.50

2.28

2.15

2.06

1.99

1.93

1.89

1.86

1.83

30

2.88

2.49

2.28

2.14

2.05

1.98

1.93

1.88

1.85

1.82

35

2.85

2.46

2.25

2.11

2.02

1.95

1.90

1.85

1.82

1.79

40

2.84

2.44

2.23

2.09

2.00

1.93

1.87

1.83

1.79

1.76

45

2.82

2.42

2.21

2.07

1.98

1.91

1.85

1.81

1.77

1.74

50

2.81

2.41

2.20

2.06

1.97

1.90

1.84

1.80

1.76

1.73

100

2.76

2.36

2.14

2.00

1.91

1.83

1.78

1.73

1.69

1.66

P(Fcal > Fc) 5 0.05 a = 0.05

Fc

Numerator Degrees of Freedom (n1) n2 Denominator

1

2

3

4

5

6

7

8

9

10

1

161.45

199.50

215.71

224.58

230.16

233.99

236.77

238.88

240.54

241.88

2

18.51

19.00

19.16

19.25

19.30

19.33

19.35

19.37

19.38

19.40

3

10.13

9.55

9.28

9.12

9.01

8.94

8.89

8.85

8.81

8.79

4

7.71

6.94

6.59

6.39

6.26

6.16

6.09

6.04

6.00

5.96

5

6.61

5.79

5.41

5.19

5.05

4.95

4.88

4.82

4.77

4.74

6

5.99

5.14

4.76

4.53

4.39

4.28

4.21

4.15

4.10

4.06

7

5.59

4.74

4.35

4.12

3.97

3.87

3.79

3.73

3.68

3.64

8

5.32

4.46

4.07

3.84

3.69

3.58

3.50

3.44

3.39

3.35

9

5.12

4.26

3.86

3.63

3.48

3.37

3.29

3.23

3.18

3.14

10

4.96

4.10

3.71

3.48

3.33

3.22

3.14

3.07

3.02

2.98

11

4.84

3.98

3.59

3.36

3.20

3.09

3.01

2.95

2.90

2.85

12

4.75

3.89

3.49

3.26

3.11

3.00

2.91

2.85

2.80

2.75

13

4.67

3.81

3.41

3.18

3.03

2.92

2.83

2.77

2.71

2.67

1159

Appendices

P(Fcal > Fc) 5 0.05—cont’d Numerator Degrees of Freedom (n1) n2 Denominator

2

1

3

4

5

6

7

8

9

10

14

4.60

3.74

3.34

3.11

2.96

2.85

2.76

2.70

2.65

2.60

15

4.54

3.68

3.29

3.06

2.90

2.79

2.71

2.64

2.59

2.54

16

4.49

3.63

3.24

3.01

2.85

2.74

2.66

2.59

2.54

2.49

17

4.45

3.59

3.20

2.96

2.81

2.70

2.61

2.55

2.49

2.45

18

4.41

3.55

3.16

2.93

2.77

2.66

2.58

2.51

2.46

2.60

19

4.38

3.52

3.13

2.90

2.74

2.63

2.54

2.48

2.42

2.38

20

4.35

3.49

3.10

2.87

2.71

2.60

2.51

2.45

2.39

2.35

21

4.32

3.47

3.07

2.84

2.68

2.57

2.49

2.42

2.37

2.32

22

4.30

3.44

3.05

2.82

2.66

2.55

2.46

2.40

2.34

2.30

23

4.28

3.42

3.03

2.80

2.64

2.53

2.44

2.37

2.32

2.27

24

4.26

3.40

3.01

2.78

2.62

2.51

2.42

2.36

2.30

2.25

25

4.24

3.39

2.99

2.76

2.00

2.49

2.40

2.34

2.28

2.24

26

4.23

3.37

2.98

2.74

2.59

2.47

2.39

2.32

2.27

2.22

27

4.21

3.35

2.96

2.73

2.57

2.46

2.37

2.31

2.25

2.20

28

4.20

3.34

2.95

2.71

2.56

2.45

2.36

2.29

2.24

2.19

29

4.18

3.33

2.93

2.70

2.55

2.43

2.35

2.28

2.22

2.18

30

4.17

3.32

2.92

2.69

2.53

2.42

2.33

2.27

2.21

2.16

35

4.12

3.27

2.87

2.64

2.49

2.37

2.29

2.22

2.16

2.11

40

4.08

3.23

2.84

2.61

2.45

2.34

2.25

2.18

2.12

2.08

45

4.06

3.20

2.81

2.58

2.42

2.31

2.22

2.15

2.10

2.05

50

4.03

3.18

2.79

2.56

2.40

2.29

2.20

2.13

2.07

2.03

100

3.94

3.09

2.70

2.46

2.31

2.19

2.10

2.03

1.97

1.93

P(Fcal > Fc) 5 0.025 a = 0.025

Fc

Numerator Degrees of Freedom (n1) n2 Denominator

1

2

3

4

5

6

7

8

9

10

1

647.8

799.5

864.2

899.6

921.8

937.1

948.2

956.7

963.3

963.3

2

38.51

39.00

39.17

39.25

39.30

39.33

39.36

39.37

39.39

3

17.44

16.04

15.44

15.10

14.88

14.73

14.62

14.54

14.47

39.40 14.42 Continued

1160

Appendices

P(Fcal > Fc) 5 0.025—cont’d Numerator Degrees of Freedom (n1) n2 Denominator

2

1

3

4

5

6

7

8

9

10

4

12.22

10.65

9.98

9.60

9.36

9.20

9.07

8.98

8.90

3.84

5

10.01

8.43

7.76

7.39

7.15

6.98

6.85

6.76

6.68

6.62

6

8.81

7.26

6.60

6.23

5.99

5.82

5.70

5.60

5.52

5.46

7

8.07

6.54

5.89

5.52

5.29

5.12

4.99

4.90

4.82

4.76

8

7.57

6.06

5.42

5.05

4.82

4.65

4.53

4.43

4.36

4.30

9

7.21

5.71

5.08

4.72

4.48

4.32

4.20

4.10

4.03

3.96

10

6.94

5.46

4.83

4.47

4.24

4.07

3.95

3.85

3.78

3.72

11

6.72

5.26

4.63

4.28

4.04

3.88

3.76

3.66

3.59

3.53

12

6.55

5.10

4.47

4.12

3.89

3.73

3.61

3.51

3.44

3.37

13

6.41

4.97

4.35

4.00

3.77

3.60

3.48

3.39

3.31

3.25

14

6.30

4.86

4.24

3.89

3.66

3.50

3.38

3.29

3.21

3.15

15

6.20

4.77

4.15

3.80

3.58

3.41

3.29

3.20

3.12

3.06

16

6.12

4.69

4.08

3.73

3.50

3.34

3.22

3.12

3.05

2.99

17

6.04

4.62

4.01

3.66

3.44

3.28

3.16

3.06

2.98

2.92

18

5.98

4.56

3.95

3.61

3.38

3.22

3.10

3.01

2.93

2.87

19

5.92

4.51

3.90

3.56

3.33

3.17

3.05

2.96

2.88

2.82

20

5.87

4.46

3.86

3.51

3.29

3.13

3.01

2.91

2.84

2.77

21

5.83

4.42

3.82

3.48

3.25

3.09

2.97

2.87

2.80

2.73

22

5.79

4.38

3.78

3.44

3.22

3.05

2.93

2.84

2.76

2.70

23

5.75

4.35

3.75

3.41

3.18

3.02

2.90

2.81

2.73

2.67

24

5.72

4.32

3.72

3.38

3.15

2.99

2.87

2.78

2.70

2.64

25

5.69

4.29

3.69

3.35

3.13

2.97

2.85

2.75

2.68

2.61

26

5.66

4.27

3.67

3.33

3.10

2.94

2.82

2.73

2.65

2.59

27

5.63

4.24

3.65

3.31

3.08

2.92

2.80

2.71

2.63

2.57

28

5.61

4.22

3.63

3.29

3.06

2.90

2.78

2.69

2.61

2.55

29

5.59

4.20

3.61

3.27

3.04

2.88

2.76

2.67

2.59

2.53

30

5.57

4.18

3.59

3.25

3.03

2.87

2.75

2.65

2.57

2.51

40

5.42

4.05

3.46

3.13

2.90

2.74

2.62

2.53

2.45

2.39

60

5.29

3.93

3.34

3.01

2.79

2.63

2.51

2.41

2.33

2.27

120

5.15

3.80

3.23

2.89

2.67

2.52

2.39

2.30

2.22

2.16

1161

Appendices

P(Fcal > Fc) 5 0.01 a = 0.01

Fc

n2 Denominator 1

Numerator Degrees of Freedom (n1) 1

2

3

4

5

6

7

8

9

10

4,052.2

4,999.3

5,403.5

5,624.3

5,764.0

5,859.0

5,928.3

5,981.0

6,022.4

6,055.9

2

98.50

99.00

99.16

99.25

99.30

99.33

99.36

99.38

99.39

99.40

3

34.12

30.82

29.46

28.71

28.24

27.91

27.67

27.49

27.34

27.23

4

21.20

18.00

16.69

15.98

15.52

15.21

14.98

14.80

14.66

14.55

5

16.26

13.27

12.06

11.39

10.97

10.67

10.46

10.29

10.16

10.05

6

13.75

10.92

9.78

9.15

8.75

8.47

8.26

8.10

7.98

7.87

7

12.25

9.55

8.45

7.85

7.46

7.19

6.99

6.84

6.72

6.62

8

11.26

8.65

7.59

7.01

6.63

6.37

6.18

6.03

5.91

5.81

9

10.56

8.02

6.99

6.42

6.06

5.80

5.61

5.47

5.35

5.26

10

10.04

7.56

6.55

5.99

5.64

5.39

5.20

5.06

4.94

4.85

11

9.65

7.21

6.22

5.67

5.32

5.07

4.89

4.74

4.63

4.54

12

9.33

6.93

5.95

5.41

5.06

4.82

4.64

4.50

4.39

4.30

13

9.07

6.70

5.74

5.21

4.86

4.62

4.44

4.30

4.19

4.10

14

8.86

6.51

5.56

5.04

4.69

4.46

4.28

4.14

4.03

3.94

15

8.68

6.36

5.42

4.89

4.56

4.32

4.14

4.00

3.89

3.80

16

8.53

6.23

5.29

4.77

4.44

4.20

4.03

3.89

3.78

3.69

17

8.40

6.11

5.19

4.67

4.34

4.10

3.93

3.79

3.68

3.59

18

8.29

6.01

5.09

4.58

4.25

4.01

3.84

3.71

3.60

3.51

19

8.18

5.93

5.01

4.50

4.17

3.94

3.77

3.63

3.52

3.43

20

8.10

5.85

4.94

4.43

4.10

3.87

3.70

3.56

3.46

3.37

21

8.02

5.78

4.87

4.37

4.04

3.81

3.64

3.51

3.40

3.31

22

7.95

5.72

4.82

4.31

3.99

3.76

3.59

3.45

3.35

3.26

23

7.88

5.66

4.76

4.26

3.94

3.71

3.54

3.41

3.30

3.21

24

7.82

5.61

4.72

4.22

3.90

3.67

3.50

3.36

3.26

3.17

25

7.77

5.57

4.68

4.18

3.85

3.63

3.46

3.32

3.22

3.13

26

7.72

5.53

4.64

4.14

3.82

3.59

3.42

3.29

3.18

3.09

27

7.68

5.49

4.60

4.11

3.78

3.56

3.39

3.26

3.15

3.06

28

7.64

5.45

4.57

4.07

3.75

3.53

3.36

3.23

3.12

3.03

29

7.60

5.42

4.54

4.04

3.73

3.50

3.33

3.20

3.09

3.00

30

7.56

5.39

4.51

4.02

3.70

3.47

3.30

3.17

3.07

2.98

35

7.42

5.27

4.40

3.91

3.59

3.37

3.20

3.07

2.96

2.88 Continued

1162

Appendices

P(Fcal > Fc) 5 0.01—cont’d n2 Denominator

Numerator Degrees of Freedom (n1) 2

1

3

4

5

6

7

8

9

10

40

7.31

5.18

4.31

3.83

3.51

3.29

3.12

2.99

2.89

2.80

45

7.23

5.11

4.25

3.77

3.45

3.23

3.07

2.94

2.83

2.74

50

7.17

5.06

4.20

3.72

3.41

3.19

3.02

2.89

2.78

2.70

100

6.90

4.82

3.98

3.51

3.21

2.99

2.82

2.69

2.59

2.50

Critical values of the Snedecor’s F-distribution.

TABLE B Student’s t-distribution P(Tcal > tc) 5 a

area

0

t

tc

Associated Probability for a Right-Tailed Test

Degrees of Freedom n

0.25

0.10

0.05

0.025

0.01

0.005

0.0025

0.001

0.0005

1

1.000

3.078

6.314

12.706

31.821

63.657

127.3

318.309

636.619

2

0.816

1.886

2.920

4.303

6.965

9.925

22.33

31.60

3

0.765

1.638

2.353

3.182

4.541

5.841

7.453

10.21

12.92

4

0.741

1.533

2.132

2.776

3.747

4.604

5.598

7.173

8.610

5

0.727

1.476

2.015

2.571

3.365

4.032

4.773

5.894

6.869

6

0.718

1.440

1.943

2.447

3.143

3.707

4.317

5.208

5.959

7

0.711

1.415

1.895

2.365

2.998

3.499

4.029

4.785

5.408

8

0.706

1.397

1.860

2.306

2.896

3.355

3.833

4.501

5.041

9

0.703

1.383

1.833

2.262

2.821

3.250

3.690

4.297

4.781

10

0.700

1.372

1.812

2.228

2.764

3.169

3.581

4.144

4.587

11

0.697

1.363

1.796

2.201

2.718

3.106

3.497

4.025

4.437

12

0.695

1.356

1.782

2.179

2.681

3.055

3.428

3.930

4.318

13

0.694

1.350

1.771

2.160

2.650

3.012

3.372

3.852

4.221

14

0.692

1.345

1.761

2.145

2.624

2.977

3.326

3.787

4.140

15

0.691

1.341

1.753

2.131

2.602

2.947

3.286

3.733

4.073

16

0.690

1.337

1.746

2.120

2.583

2.921

3.252

3.686

4.015

17

0.689

1.333

1.740

2.110

2.567

2.898

3.222

3.646

3.965

18

0.688

1.330

1.734

2.101

2.552

2.878

3.197

3.610

3.922

19

0.688

1.328

1.729

2.093

2.539

2.861

3.174

3.579

3.883

14.09

Appendices

1163

TABLE B Student’s t-distribution—cont’d Associated Probability for a Right-Tailed Test

Degrees of Freedom n

0.25

0.10

0.05

20

0.687

1.325

1.725

2.086

2.528

2.845

3.153

3.552

3.850

21

0.686

1.323

1.721

2.080

2.518

2.831

3.135

3.527

3.819

22

0.686

1.321

1.717

2.074

2.508

2.819

3.119

3.505

3.792

23

0.685

1.319

1.714

2.069

2.500

2.807

3.104

3.485

3.768

24

0.685

1.318

1.711

2.064

2.492

2.797

3.091

3.467

3.745

25

0.684

1.316

1.708

2.060

2.485

2.787

3.078

3.450

3.725

26

0.684

1.315

1.706

2.056

2.479

2.779

3.067

3.435

3.707

27

0.684

1.314

1.703

2.052

2.473

2.771

3.057

3.421

3.689

28

0.683

1.313

1.701

2.048

2.467

2.763

3.047

3.408

3.674

29

0.683

1.311

1.699

2.045

2.462

2.756

3.038

3.396

3.660

30

0.683

1.310

1.697

2.042

2.457

2.750

3.030

3.385

3.646

35

0.682

1.306

1.690

2.030

2.438

2.724

2.996

3.340

3.591

40

0.681

1.303

1.684

2.021

2.423

2.704

2.971

3.307

3.551

45

0.680

1.301

1.679

2.014

2.412

2.690

2.952

3.281

3.520

50

0.679

1.299

1.676

2.009

2.403

2.678

2.937

3.261

3.496

z

0.674

1.282

1.645

1.960

2.326

2.576

2.807

3.090

3.291

Critical values of the Student’s t-distribution.

0.025

0.01

0.005

0.0025

0.001

0.0005

TABLE C

Durbin-Watson Distribution (DW) Inconclusive Test

Inconclusive Test No Autocorrelation

Positive Autocorrelation

dL dU

Negative Autocorrelation

4-dU 4-dL

2

DW Statistics models with intercept level of significance a = 5%

k (Number of Parameters – Includes Intercept) 2

3

4

5

6

7

8

9

10

n

dL

dU

dL

dU

dL

dU

dL

dU

dL

dU

dL

dU

dL

dU

dL

dU

dL

dU

6

0.610

1.400

































7

0.700

1.356

0.467

1.896





























8

0.763

1.332

0.559

1.777

0.367

2.287

























9

0.824

1.320

0.629

1.699

0.455

2.128

0.296

2.588





















10

0.879

1.320

0.697

1.641

0.525

2.016

0.376

2.414

0.243

2.822

















11

0.927

1.324

0.758

1.604

0.595

1.928

0.444

2.283

0.315

2.645

0.203

3.004













12

0.971

1.331

0.812

1.579

0.658

1.864

0.512

2.177

0.380

2.506

0.268

2.832

0.171

3.149









13

1.010

0.861

0.861

1.562

0.715

1.816

0.574

2.094

0.444

2.390

0.328

2.692

0.230

2.985

0.147

3.266





14

1.045

1.350

0.905

1.551

0.767

1.779

0.632

2.030

0.505

2.296

0.389

2.572

0.286

2.848

0.200

3.111

0.127

3.360

15

1.077

1.361

0.946

1.543

0.814

1.750

0.685

1.977

0.562

2.220

0.447

2.471

0.343

2.727

0.251

2.979

0.175

3.216

16

1.106

1.371

0.982

1.539

0.857

1.728

0.734

1.935

0.615

2.157

0.502

2.388

0.398

2.624

0.304

2.860

0.222

3.090

17

1.133

1.381

1.015

1.536

0.897

1.710

0.779

1.900

0.664

2.104

0.554

2.318

0.451

2.537

0.356

2.757

0.272

2.975

18

1.158

1.391

1.046

1.535

0.933

1.696

0.820

1.872

0.710

2.060

0.603

2.258

0.502

2.461

0.407

2.668

0.321

2.873

19

1.180

1.401

1.074

1.536

0.967

1.685

0.859

1.848

0.752

2.023

0.649

2.206

0.549

2.396

0.456

2.589

0.369

2.783

20

1.201

1.411

1.100

1.537

0.998

1.676

0.894

1.828

0.792

1.991

0.691

2.162

0.595

2.339

0.502

2.521

0.416

2.704

21

1.221

1.420

1.125

1.538

1.026

1.669

0.927

1.812

0.829

1.964

0.731

2.124

0.637

2.290

0.546

2.461

0.461

2.633

22

1.239

1.429

1.147

1.541

1.053

1.664

0.958

1.797

0.863

1.940

0.769

2.090

0.677

1.246

0.588

2.407

0.504

2.571

23

1.257

1.437

1.168

1.543

1.078

1.660

0.986

1.785

0.895

1.920

0.804

2.061

0.715

2.208

0.628

2.360

0.545

2.514

24

1.273

1.446

1.188

1.546

1.101

1.656

1.013

1.775

0.925

1.902

0.837

2.035

0.750

2.174

0.666

2.318

0.584

2.464

25

1.288

1.454

1.206

1.550

1.123

1.654

1.038

1.767

0.953

1.886

0.868

2.013

0.784

2.144

0.702

2.280

0.621

2.419

26

1.302

1.461

1.224

1.553

1.143

1.652

1.062

1.759

0.979

1.873

0.897

1.992

0.816

2.117

0.735

2.246

0.657

2.379

27

1.316

1.469

1.240

1.556

1.162

1.651

1.084

1.753

1.004

1.861

0.925

1.974

0.845

2.093

0.767

2.216

0.691

2.342

28

1.328

1.476

1.255

1.560

1.1181

1.650

1.104

1.747

1.028

1.850

0.951

1.959

0.874

2.071

0.798

2.188

0.723

2.309

29

1.341

1.483

1.270

1.563

1.198

1.650

1.124

1.743

1.050

1.841

0.975

1.944

0.900

2.052

0.826

2.164

0.753

2.278

30

1.352

1.489

1.284

1.567

1.214

1.650

1.143

1.739

1.071

1.833

0.998

1.931

0.926

2.034

0.854

2.141

0.782

2.251

31

1.363

1.496

1.297

1.570

1.229

1.650

1.160

1.735

1.090

1.825

1.020

1.920

0.950

2.018

0.879

2.120

0.810

2.226

32

1.373

1.502

1.309

1.574

1.244

1.650

1.177

1.732

1.109

1.819

1.041

1.909

0.972

2.004

0.904

2.102

0.836

2.203

33

1.383

1.508

1.321

1.577

1.258

1.651

1.193

1.730

1.127

1.813

1.061

1.900

0.994

1.991

0.927

20.85

0.861

2.181

34

1.393

1.514

1.333

1.580

1.271

1.652

1.208

1.728

1.144

1.808

1.079

1.891

1.015

1.978

0.950

2.069

0.885

2.162

35

1.402

1.519

1.343

1.584

1.283

1.653

1.222

1.726

1.160

1.803

1.097

1.884

1.034

1.967

0.971

2.054

0.908

2.144

36

1.411

1.525

1.354

1.587

1.295

1.654

1.236

1.724

1.175

1.799

1.114

1.876

1.053

1.957

0.991

2.041

0.930

2.127

37

1.419

1.530

1.364

1.590

1.307

1.655

1.249

1.723

1.190

1.795

1.131

1.870

1.071

1.948

1.011

2.029

0.951

2.112

38

1.427

1.535

1.373

1.594

1.318

1.656

1.261

1.722

1.204

1.792

1.146

1.864

1.088

1.939

1.029

2.017

0.970

2.098

39

1.435

1.540

1.382

1.597

1.328

1.658

1.273

1.722

1.218

1.789

1.161

1.859

1.104

1.932

1.047

2.007

0.990

2.085

40

1.442

1.544

1.391

1.600

1.338

1.659

1.285

1.721

1.230

1.786

1.175

1.854

1.120

1.924

1.064

1.997

1.008

2.072

45

1.475

1.566

1.430

1.615

1.383

1.666

1.336

1.720

1.287

1.776

1.238

1.835

1.189

1.895

1.139

1.958

1.089

2.022

50

1.503

1.585

1.462

1.628

1.421

1.674

1.378

1.721

1.335

1.771

1.291

1.822

1.246

1.875

1.201

1.930

1.156

1.986

55

1.528

1.601

1.490

1.641

1.452

1.611

1.414

1.724

1.374

1.768

1.334

1.814

1.294

1.861

1.253

1.909

1.212

1.959

60

1.549

1.616

1.514

1.652

1.480

1.689

1.444

1.727

1.408

1.767

1.372

1.808

1.335

1.850

1.298

1.894

1.260

1.939

65

1.567

1.629

1.536

1.662

1.503

1.696

1.471

1.731

1.438

1.767

1.404

1.805

1.170

1.843

1.336

1.882

1.301

1.923

70

1.583

1.641

1.554

1.672

1.525

1.703

1.494

1.735

1.464

1.768

1.433

1.802

1.401

1.838

1.369

1.874

1.337

1.910

75

1.598

1.652

1.571

1.680

1.543

1.709

1.515

1.739

1.487

1.770

1.458

1.801

1.428

1.834

1.399

1.867

1.369

1.901

80

1.611

1.662

1.586

1.688

1.560

1.715

1.534

1.743

1.507

1.772

1.480

1.801

1.453

1.831

1.425

1.861

1.397

1.893

85

1.624

1.671

1.600

1.696

1.575

1.721

1.550

1.747

1.525

1.774

1.500

1.801

1.474

1.829

1.448

1.857

1.422

1.886

90

1.635

1.679

1.612

1.703

1.589

1.726

1.566

1.751

1.542

1.776

1.518

1.801

1.494

1.827

1.469

1.854

1.445

1.881

95

1.645

1.687

1.623

1.709

1.602

1.732

1.579

1.755

1.557

1.778

1.535

1.802

1.512

1.827

1.489

1.852

1.465

1.877

100

1.654

1.694

1.634

1.715

1.613

1.736

1.592

1.758

1.571

1.780

1.550

1.803

1.528

1.827

1.489

1.852

1.465

1.877

150

1.720

1.747

1.706

1.760

1.693

1.774

1.679

1.788

1.665

1.802

1.651

1.817

1.637

1.832

1.622

1.846

1.608

1.862

200

1.758

1.779

1.748

1.789

1.738

1.799

1.728

1.809

1.718

1.820

1.707

1.831

1.697

1.841

1.686

1.852

1.675

1.863

1166

Appendices

TABLE D Chi-Square Distribution P(x2cal with n degrees of freedom > x2c) 5 a

Degrees of Freedom n

0.99

0.975

0.95

0.9

0.1

0.05

0.025

0.01

0.005

1

0.000

0.001

0.004

0.016

2.706

3.841

5.024

6.635

7.879

2

0.020

0.051

0.103

0.211

4.605

5.991

7.378

9.210

10.597

3

0.115

0.216

0.352

0.584

6.251

7.815

9.348

11.345

12.838

4

0.297

0.484

0.711

1.064

7.779

9.488

11.143

13.277

14.860

5

0.554

0.831

1.145

1.610

9.236

11.070

12.832

15.086

16.750

6

0.872

1.237

1.635

2.204

10.645

12.592

14.449

16.812

18.548

7

1.239

1.690

2.167

2.833

12.017

14.067

16.013

18.475

20.278

8

1.647

2.180

2.733

3.490

13.362

15.507

17.535

20.090

21.955

9

2.088

2.700

3.325

4.168

14.684

16.919

19.023

21.666

23.589

10

2.558

3.247

3.940

4.865

15.987

18.307

20.483

23.209

25.188

11

3.053

3.816

4.575

5.578

17.275

19.675

21.920

24.725

26.757

12

3.571

4.404

5.226

6.304

18.549

21.026

23.337

26.217

28.300

13

4.107

5.009

5.892

7.041

19.812

22.362

24.736

27.688

29.819

14

4.660

5.629

6.571

7.790

21.064

23.685

26.119

29.141

31.319

15

5.229

6.262

7.261

8.547

22.307

24.996

27.488

30.578

32.801

16

5.812

6.908

7.962

9.312

23.542

26.296

28.845

32.000

34.267

17

6.408

7.564

8.672

10.085

24.769

27.587

30.191

33.409

35.718

18

7.015

8.231

9.390

10.865

25.989

28.869

31.526

34.805

37.156

19

7.633

8.907

10.117

11.651

27.204

30.144

32.852

36.191

38.582

20

8.260

9.591

10.851

12.443

28.412

31.410

34.170

37.566

39.997

21

8.897

10.283

11.591

13.240

29.615

32.671

35.479

38.932

41.401

22

9.542

10.982

12.338

14.041

30.813

33.924

36.781

40.289

42.796

23

10.196

11.689

13.091

14.848

32.007

35.172

38.076

41.638

44.181

24

10.856

12.401

13.848

15.659

33.196

36.415

39.364

42.980

45.558

25

11.524

13.120

14.611

16.473

34.382

37.652

40.646

44.314

46.928

26

12.198

13.844

15.379

17.292

35.563

38.885

41.923

45.642

48.290

27

12.878

14.573

16.151

18.114

36.741

40.113

43.195

46.963

49.645

28

13.565

15.308

16.928

18.939

37.916

41.337

44.461

48.278

50.994

29

14.256

16.047

17.708

19.768

39.087

42.557

45.722

49.588

52.335

30

14.953

16.791

18.493

20.599

40.256

43.773

46.979

50.892

53.672

31

15.655

17.539

19.281

21.434

41.422

44.985

48.232

52.191

55.002

32

16.362

18.291

20.072

22.271

42.585

46.194

49.480

53.486

56.328

Appendices

1167

TABLE D Chi-Square Distribution—cont’d Degrees of Freedom n

0.99

0.975

0.95

0.9

0.1

0.05

0.025

0.01

0.005

33

17.073

19.047

20.867

23.110

43.745

47.400

50.725

54.775

57.648

34

17.789

19.806

21.664

23.952

44.903

48.602

51.966

56.061

58.964

35

18.509

20.569

22.465

24.797

46.059

49.802

53.203

57.342

60.275

36

19.233

21.336

23.269

25.643

47.212

50.998

54.437

58.619

61.581

37

19.960

22.106

24.075

26.492

48.363

52.192

55.668

59.893

62.883

38

20.691

22.878

24.884

27.343

49.513

53.384

56.895

61.162

64.181

39

21.426

23.654

25.695

28.196

50.660

54.572

58.120

62.428

65.475

40

22.164

24.433

26.509

29.051

51.805

55.758

59.342

63.691

66.766

41

22.906

25.215

27.326

29.907

52.949

56.942

60.561

64.950

68.053

42

23.650

25.999

28.144

30.765

54.090

58.124

61.777

66.206

69.336

43

24.398

26.785

28.965

31.625

55.230

59.304

62.990

67.459

70.616

44

25.148

27.575

29.787

32.487

56.369

60.481

64.201

68.710

71.892

45

25.901

28.366

30.612

33.350

57.505

61.656

65.410

69.957

73.166

46

26.657

29.160

31.439

34.215

58.641

62.830

66.616

71.201

74.437

47

27.416

29.956

32.268

35.081

59.774

64.001

67.821

72.443

75.704

48

28.177

30.754

33.098

35.949

60.907

65.171

69.023

73.683

76.969

49

28.941

31.555

33.930

36.818

62.038

66.339

70.222

74.919

78.231

50

29.707

32.357

34.764

37.689

63.167

67.505

71.420

76.154

79.490

Critical values (for a right-tailed unilateral test) of the chi-square distribution.

TABLE E Standard Normal Distribution P(zcal > zc) 5 a

area

0

zc

z

Second Decimal of zc zc

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.0

0.5000

0.4960

0.4920

0.4880

0.4840

0.4801

0.4761

0.4721

0.4681

0.4641

0.1

0.4602

0.4562

0.4522

0.4483

0.4443

0.4404

0.4364

0.4325

0.4286

0.4247

0.2

0.4207

0.4168

0.4129

0.4090

0.4052

0.4013

0.3974

0.3936

0.3897

0.3859

0.3

0.3821

0.3783

0.3745

0.3707

0.3669

0.3632

0.3594

0.3557

0.3520

0.3483

0.4

0.3446

0.3409

0.3372

0.3336

0.3300

0.3264

0.3228

0.3192

0.3156

0.3121

0.5

0.3085

0.3050

0.3015

0.2981

0.2946

0.2912

0.2877

0.2842

0.2810

0.2776 Continued

1168

Appendices

TABLE E Standard Normal Distribution—cont’d Second Decimal of zc zc

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.6

0.2743

0.2709

0.2676

0.2643

0.2611

0.2578

0.2546

0.2514

0.2483

0.2451

0.7

0.2420

0.2389

0.2358

0.2327

0.2296

0.2266

0.2236

0.2206

0.2177

0.2148

0.8

0.2119

0.2090

0.2061

0.2033

0.2005

0.1977

0.1949

0.1922

0.1894

0.1867

0.9

0.1841

0.1814

0.1788

0.1762

0.1736

0.1711

0.1685

0.1660

0.1635

0.1611

1.0

0.1587

0.1562

0.1539

0.1515

0.1492

0.1469

0.1446

0.1423

0.1401

0.1379

1.1

0.1357

0.1335

0.1314

0.1292

0.1271

0.1251

0.1230

0.1210

0.1190

0.1170

1.2

0.1151

0.1131

0.1112

0.1093

0.1075

0.1056

0.1038

0.1020

0.1003

0.0985

1.3

0.0968

0.0951

0.0934

0.0918

0.0901

0.0885

0.0869

0.0853

0.0838

0.0823

1.4

0.0808

0.0793

0.0778

0.0764

0.0749

0.0735

0.0722

0.0708

0.0694

0.0681

1.5

0.0668

0.0655

0.0643

0.0630

0.0618

0.0606

0.0594

0.0582

0.0571

0.0559

1.6

0.0548

0.0537

0.0526

0.0516

0.0505

0.0495

0.0485

0.0475

0.0465

0.0455

1.7

0.0446

0.0436

0.0427

0.0418

0.0409

0.0401

0.0392

0.0384

0.0375

0.0367

1.8

0.0359

0.0352

0.0344

0.0336

0.0329

0.0322

0.0314

0.0307

0.0301

0.0294

1.9

0.0287

0.0281

0.0274

0.0268

0.0262

0.0256

0.0250

0.0244

0.0239

0.0233

2.0

0.0228

0.0222

0.0217

0.0212

0.0207

0.0202

0.0197

0.0192

0.0188

0.0183

2.1

0.0179

0.0174

0.0170

0.0166

0.0162

0.0158

0.0154

0.0150

0.0146

0.0143

2.2

0.0139

0.0136

0.0132

0.0129

0.0125

0.0122

0.0119

0.0116

0.0113

0.0110

2.3

0.0107

0.0104

0.0102

0.0099

0.0096

0.0094

0.0091

0.0089

0.0087

0.0084

2.4

0.0082

0.0080

0.0078

0.0075

0.0073

0.0071

0.0069

0.0068

0.0066

0.0064

2.5

0.0062

0.0060

0.0059

0.0057

0.0055

0.0054

0.0052

0.0051

0.0049

0.0048

2.6

0.0047

0.0045

0.0044

0.0043

0.0041

0.0040

0.0039

0.0038

0.0037

0.0036

2.7

0.0035

0.0034

0.0033

0.0032

0.0031

0.0030

0.0029

0.0028

0.0027

0.0026

2.8

0.0026

0.0025

0.0024

0.0023

0.0023

0.0022

0.0021

0.0021

0.0020

0.0019

2.9

0.0019

0.0018

0.0017

0.0017

0.0016

0.0016

0.0015

0.0015

0.0014

0.0014

3.0

0.0013

0.0013

0.0013

0.0012

0.0012

0.0011

0.0011

0.0011

0.0010

0.0010

3.1

0.0010

0.0009

0.0009

0.0009

0.008

0.0008

0.0008

0.0008

0.007

0.007

3.2

0.0007

3.3

0.0005

3.4

0.0003

3.5

0.00023

3.6

0.00016

3.7

0.00011

1169

Appendices

TABLE E Standard Normal Distribution—cont’d Second Decimal of zc zc

0.00

3.8

0.00007

3.9

0.00005

4.0

0.00003

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Associated probability for a right-tailed test.

TABLE F1

Binomial Distribution   N k p ð1  pÞNk P ½Y ¼ k  ¼ k p

N

k

0.01

0.05

0.10

0.15

0.20

0.25

0.30

1/3

0.40

0.45

0.50

2

0

9801

9025

8100

7225

6400

5625

4900

4444

3600

3025

2500

2

1

198

950

1800

2550

3200

3750

4200

4444

4800

4950

5000

1

2

1

25

100

225

400

625

900

1111

1600

2025

2500

0

0

9703

8574

7290

6141

5120

4219

3430

2963

2160

1664

1250

3

1

294

1354

2430

3251

3840

4219

4410

4444

4320

4084

3750

2

2

3

71

270

574

960

1406

1890

2222

2880

3341

3750

1

3

0

1

10

34

80

156

270

370

640

911

1250

0

0

9606

8145

6561

5220

4096

3164

2401

1975

1296

915

625

4

1

388

1715

2916

3685

4096

4219

4116

3951

3456

2995

2500

3

2

6

135

486

975

1536

2109

2646

2963

3456

3675

3750

2

3

0

5

36

115

256

469

756

988

1536

2005

2500

1

4

0

0

1

5

16

39

81

123

256

410

625

0

0

9510

7738

5905

4437

3277

2373

1681

1317

778

503

312

5

1

480

2036

3280

3915

4096

3955

3602

3292

2592

2059

1562

4

2

10

214

729

1382

2048

2637

3087

3292

3456

3369

3125

3

3

0

11

81

244

512

879

1323

1646

2304

2757

3125

2

4

0

0

4

22

64

146

283

412

768

1128

1562

1

5

0

0

0

1

3

10

24

41

102

185

312

0

0

9415

7351

5314

3771

2621

1780

1176

878

467

277

156

6

1

571

2321

3543

3993

3932

3560

3025

2634

1866

1359

938

5

3

4

5

6

2

3

4

5

6

Continued

1170

Appendices

TABLE F1 Binomial Distribution—cont’d p N

7

k

0.01

0.05

0.10

0.15

0.20

0.25

0.30

1/3

0.40

0.45

0.50

2

14

305

984

1762

2458

2966

3241

3292

3110

2780

2344

4

3

0

21

146

415

819

1318

1852

2195

2765

3032

3125

3

4

0

1

12

55

154

330

595

823

1382

1861

2344

2

5

0

0

1

4

15

44

102

165

369

609

938

1

6

0

0

0

0

1

2

7

14

41

83

156

0

0

9321

6983

4783

3206

2097

1335

824

585

280

152

78

7

1

659

2573

3720

3960

3670

3115

2471

2048

1306

872

547

6

2

20

406

1240

2097

2753

3115

3177

3073

2613

2140

1641

5

3

0

36

230

617

1147

1730

2269

2561

2903

2918

2734

4

4

0

2

26

109

287

577

972

1280

1935

2388

2734

3

5

0

0

2

12

43

115

250

384

774

1172

1641

2

6

0

0

0

1

4

13

36

64

172

320

547

1

7

0

0

0

0

0

1

2

5

16

37

78

0

0.99

0.95

0.90

0.85

0.80

0.75

0.70

2/3

0.60

0.55

0.50

7

k

N

8

p

p N

k

0.01

0.05

0.10

0.15

0.20

0.25

0.30

8

0

9227

6634

4305

2725

1678

1001

576

1

746

2793

3826

3847

3355

2670

2

26

515

1488

2376

2936

3

1

54

331

839

4

0

4

46

5

0

0

6

0

7

9

1/3

0.40

0.45

0.50

390

168

84

39

8

1977

1561

896

548

312

7

3115

2965

2731

2090

1569

1094

6

1468

2076

2541

2731

2787

2568

2188

5

185

459

865

1361

1707

2322

2627

2734

4

4

26

92

231

467

683

1239

1719

2188

3

0

0

2

11

38

100

171

413

703

1094

2

0

0

0

0

1

4

12

24

79

164

312

1

8

0

0

0

0

0

0

1

2

7

17

39

0

0

9135

6302

3874

2316

1342

751

404

260

101

46

20

9

1

830

2985

3874

3679

3020

2253

1556

1171

605

339

176

8

2

34

629

1722

2597

3020

3003

2668

2341

1612

1110

703

7

3

1

77

446

1069

1762

2336

2668

2731

2508

2119

1641

6

4

0

6

74

283

661

1168

1715

2048

2508

2600

2461

5

5

0

0

8

50

165

389

735

1024

1672

2128

2461

4

6

0

0

1

6

28

87

210

341

743

1160

1641

3

7

0

0

0

0

3

12

39

73

212

407

703

2

9

1171

Appendices

TABLE F1 Binomial Distribution—cont’d p N

k

0.01

0.05

0.10

0.15

0.20

0.25

0.30

8

0

0

0

0

0

1

4

9

0

0

0

0

0

0

10 0

9044

5987

3487

1969

1074

1

914

3151

3874

3474

2

42

746

1937

3

1

105

4

0

5

1/3

0.40

0.45

0.50

9

35

83

176

1

0

1

3

8

20

0

563

282

173

60

25

10

10

2684

1877

1211

867

403

207

98

9

2759

3020

2816

2335

1951

1209

763

439

5

574

1298

2013

2503

2668

2601

2150

1665

1172

7

10

112

401

881

1460

2001

2276

2508

2384

2051

6

0

1

15

85

264

584

1029

1366

2007

2340

2461

5

6

0

0

1

12

55

162

368

569

1115

1596

2051

4

7

0

0

0

1

8

31

90

163

425

746

1172

3

8

0

0

0

0

1

4

14

30

106

229

439

2

9

0

0

0

0

0

0

1

3

16

42

98

1

10

0

0

0

0

0

0

0

0

1

3

10

0

15 0

8601

4633

2059

874

352

134

47

23

5

1

0

15

1

1303

3658

3432

2312

1319

668

305

171

47

16

5

14

2

92

1348

2669

2856

2309

1559

916

599

219

90

32

13

3

4

307

1285

2184

2501

2252

1700

1299

634

318

139

12

4

0

49

428

1156

1876

2252

2186

1948

1268

780

417

11

10

15

5

0

6

105

449

1032

1651

2061

2143

1859

1404

916

10

6

0

0

19

132

430

917

1472

1786

2066

1914

1527

9

7

0

0

3

30

138

393

811

1148

1771

2013

1964

8

8

0

0

0

5

35

131

348

574

1181

1647

1964

7

9

0

0

0

1

7

34

116

223

612

1048

1527

6

10

0

0

0

0

1

7

30

67

245

515

916

5

11

0

0

0

0

0

1

6

15

74

191

417

4

12

0

0

0

0

0

0

1

3

16

52

139

3

13

0

0

0

0

0

0

0

0

3

10

32

2

14

0

0

0

0

0

0

0

0

0

1

5

1

15

0

0

0

0

0

0

0

0

0

0

0

0

0.99

0.95

0.90

0.85

0.80

0.75

0.70

0.60

0.55

0.50

k

N

20

20

2/3

p

p N

k

0.01

0.05

0.10

0.15

0.20

0.25

0.30

20 0

8179

3585

1216

388

115

32

8

1/3 3

0.40

0.45

0.50

0

0

0

Continued

1172

Appendices

TABLE F1 Binomial Distribution—cont’d p N

k

0.01

0.05

0.10

0.15

0.20

0.25

0.30

1

1652

3774

2702

1368

576

211

68

2

159

1887

2852

2293

1369

669

3

10

596

1901

2428

2054

4

0

133

898

1821

5

0

22

319

6

0

3

7

0

8

1/3

0.40

0.45

0.50

30

5

1

0

19

278

143

31

8

2

18

1339

716

429

123

40

11

17

2182

1897

1304

911

350

139

46

16

1028

1746

2023

1789

1457

746

365

148

15

89

454

1091

1686

1916

1821

1244

746

370

14

0

20

160

545

1124

1643

1821

1659

1221

739

13

0

0

4

46

222

609

1144

1480

1797

1623

1201

12

9

0

0

1

11

74

271

654

987

1597

1771

1602

11

10

0

0

0

2

20

99

308

543

1171

1593

1762

10

11

0

0

0

0

5

30

120

247

710

1185

1602

9

12

0

0

0

0

1

8

39

92

355

727

1201

8

13

0

0

0

0

0

2

10

28

146

366

739

7

14

0

0

0

0

0

0

2

7

49

150

370

6

15

0

0

0

0

0

0

0

1

13

49

148

5

16

0

0

0

0

0

0

0

0

3

13

46

4

17

0

0

0

0

0

0

0

0

0

2

11

3

18

0

0

0

0

0

0

0

0

0

0

2

2

19

0

0

0

0

0

0

0

0

0

0

0

1

20

0

0

0

0

0

0

0

0

0

0

0

0

25 0

7778

2774

718

172

38

8

1

0

0

0

0

25

1

1964

3650

1994

759

236

63

14

5

0

0

0

24

2

238

2305

2659

1607

708

251

74

30

4

1

0

23

3

18

930

2265

2174

1358

641

243

114

19

4

1

22

4

1

269

1384

2110

1867

1175

572

313

71

18

4

21

5

0

60

646

1564

1960

1645

1030

658

199

63

16

20

6

0

10

239

920

1633

1828

1472

1096

442

172

53

19

7

0

1

72

441

1108

1654

1712

1487

800

381

143

18

8

0

0

18

175

623

1241

1651

1673

1200

701

322

17

9

0

0

4

58

294

781

1336

1580

1511

1084

609

16

10

0

0

1

16

118

417

916

1264

1612

1419

974

15

11

0

0

0

4

40

189

536

862

1465

1583

1328

14

12

0

0

0

1

12

74

268

503

1140

1511

1550

13

25

1173

Appendices

TABLE F1 Binomial Distribution—cont’d p N

k

0.01

0.05

0.10

0.15

0.20

0.25

0.30

13

0

0

0

0

3

25

115

14

0

0

0

0

1

7

15

0

0

0

0

0

16

0

0

0

0

17

0

0

0

18

0

0

19

0

20

1/3

0.40

0.45

0.50

251

760

1236

1550

12

42

108

434

867

1328

11

2

13

40

212

520

974

10

0

0

4

12

88

266

609

9

0

0

0

1

3

31

115

322

8

0

0

0

0

0

1

9

42

143

7

0

0

0

0

0

0

0

2

13

53

6

0

0

0

0

0

0

0

0

0

3

16

5

21

0

0

0

0

0

0

0

0

0

1

4

4

22

0

0

0

0

0

0

0

0

0

0

1

3

23

0

0

0

0

0

0

0

0

0

0

0

2

24

0

0

0

0

0

0

0

0

0

0

0

1

25

0

0

0

0

0

0

0

0

0

0

0

0

0.99

0.95

0.90

0.85

0.80

0.75

0.70

0.60

0.55

0.50

k

N

30

2/3

p

p N

k

0.01

0.05

0.10

0.15

0.20

0.25

0.30

30 0

7397

2146

424

76

12

2

0

1

2242

3389

1413

404

93

18

2

328

2586

2277

1034

337

3

31

1270

2361

1703

4

2

451

1771

5

0

124

6

0

7

1/3

0.40

0.45

0.50

0

0

0

0

30

3

1

0

0

0

29

86

18

6

0

0

0

28

785

269

72

26

3

0

0

27

2028

1325

604

208

89

12

2

0

26

1023

1861

1723

1047

464

232

41

8

1

25

27

474

1368

1795

1455

829

484

115

29

6

24

0

5

180

828

1538

1662

1219

829

263

81

19

23

8

0

1

58

420

1106

1593

1501

1192

505

191

55

22

9

0

0

16

181

676

1298

1573

1457

823

382

133

21

10

0

0

4

67

355

909

1416

1530

1152

656

280

20

11

0

0

1

22

161

551

1103

1391

1396

976

509

19

12

0

0

0

6

64

291

749

1101

1474

1265

805

18

13

0

0

0

1

22

134

444

762

1360

1433

1115

17

14

0

0

0

0

7

54

231

436

1101

1424

1354

16

15

0

0

0

0

2

19

106

247

783

1242

1445

15

16

0

0

0

0

0

6

42

116

489

953

1354

14

17

0

0

0

0

0

2

15

48

269

642

1115

13

18

0

0

0

0

0

0

5

17

129

379

805

12 Continued

1174

Appendices

TABLE F1 Binomial Distribution—cont’d p N

k

0.01

0.05

0.10

0.15

0.20

0.25

0.30

19

0

0

0

0

0

0

1

20

0

0

0

0

0

0

21

0

0

0

0

0

22

0

0

0

0

23

0

0

0

24

0

0

25

0

26.

0.40

0.45

0.50

5

54

196

509

11

0

1

20

88

280

10

0

0

0

6

34

133

9

0

0

0

0

1

12

55

8

0

0

0

0

0

0

3

19

7

0

0

0

0

0

0

0

1

6

6

0

0

0

0

0

0

0

0

0

1

5

0

0

0

0

0

0

0

0

0

0

0

4

27

0

0

0

0

0

0

0

0

0

0

0

3

28

0

0

0

0

0

0

0

0

0

0

0

2

29

0

0

0

0

0

0

0

0

0

0

0

1

30

0

0

0

0

0

0

0

0

0

0

0

0

0.99

0.95

0.90

0.85

0.80

0.75

0.70

0.60

0.55

0.50

k

The decimal point was omitted. All entries should be read as .nnn. For p  .5, use the upper line for p and the left column for k. For p > .5, use the bottom line for p and the right column for k.

1/3

2/3

N

TABLE F2

Binomial Distribution   P N i p ð1  pÞNi PðY  kÞ ¼ ki¼0 i k

N

0

1

2

3

4

4

062

312

688

938

1.0

5

031

188

500

812

969

1.0

6

016

109

344

656

891

984

1.0

7

008

062

227

500

773

938

992

1.0

8

004

035

145

363

637

855

965

996

1.0

9

002

020

090

254

500

746

910

980

998

1.0

10

001

011

055

172

377

623

828

945

989

999

1.0

11

006

033

113

274

500

726

887

967

994

999 +

1.0

12

003

019

073

194

387

613

806

927

981

997

999 +

1.0

13

002

011

046

133

291

500

709

867

954

989

998

999+

1.0

14

001

006

029

090

212

395

605

788

910

971

994

999+

999+

1.0

15

004

018

059

151

304

500

696

849

941

982

996

999+

999+

1.0

16

002

011

038

105

227

402

598

773

895

962

989

998

999+

999 +

1.0

17

001

006

025

072

166

315

500

685

834

928

975

994

999

999 +

999 +

1.0

18

001

004

015

048

119

240

407

593

760

881

952

985

996

999

999 +

999+

19

002

010

032

084

180

324

500

676

820

916

968

990

998

999 +

999+

20

001

006

021

058

132

252

412

588

748

868

942

979

994

999

999+

21

001

004

013

039

095

192

332

500

668

808

905

961

987

996

999

22

002

008

026

067

143

262

416

584

738

857

933

974

992

998

23

001

005

017

047

105

202

339

500

661

798

895

953

983

995

24

001

003

011

032

076

154

271

419

581

729

846

924

968

989

002

007

022

054

115

212

345

500

655

788

885

946

6

7

8

9

10

11

12

13

14

15

16

17

978

Appendices

25

5

Continued

1175

1176 Appendices

TABLE F2 Binomial Distribution—cont’d k N

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

26

001

005

014

038

084

163

279

423

577

721

837

916

962

27

001

003

010

026

061

124

221

351

500

649

779

876

939

28

002

006

018

044

092

172

286

425

575

714

828

908

29

001

004

012

031

068

132

229

356

500

644

771

868

30

001

003

008

021

049

100

181

292

428

572

708

819

31

002

005

015

035

075

141

237

360

500

640

763

32

001

004

010

025

055

108

189

298

430

570

702

33

001

002

007

018

040

081

148

243

364

500

636

34

001

005

012

029

061

115

196

304

432

568

35

001

003

008

020

045

088

155

250

368

500

Unilateral probabilities for the binomial test when p ¼ q ¼ 1/2. Note: Decimal points and values less than 0.0005 were omitted.

Appendices

TABLE G

Critical Values of Dc for the Kolmogorov-Smirnov Test Considering P(Dcal > Dc) 5 a Significance Level a

Sample Size (N)

0.20

0.15

0.10

0.05

0.01

1

0.900

0.925

0.950

0.975

0.995

2

0.684

0.726

0.776

0.842

0.929

3

0.565

0.597

0.642

0.708

0.828

4

0.494

0.525

0.564

0.624

0.733

5

0.446

0.474

0.510

0.565

0.669

6

0.410

0.436

0.470

0.521

0.618

7

0.381

0.405

0.438

0.486

0.577

8

0.358

0.381

0.411

0.457

0.543

9

0.339

0.360

0.388

0.432

0.514

10

0.322

0.342

0.368

0.410

0.490

11

0.307

0.326

0.352

0.391

0.468

12

0.295

0.313

0.338

0.375

0.450

13

0.284

0.302

0.325

0.361

0.433

14

0.274

0.292

0.314

0.349

0.418

15

0.266

0.283

0.304

0.338

0.404

16

0.258

0.274

0.295

0.328

0.392

17

0.250

0.266

0.286

0.318

0.381

18

0.244

0.259

0.278

0.309

0.371

19

0.237

0.252

0.272

0.301

0.363

20

0.231

0.246

0.264

0.294

0.356

25

0.21

0.22

0.24

0.27

0.32

30

0.19

0.20

0.22

0.24

0.29

35

0.18

0.19

0.21

0.23

0.27

Greater than 50

1:07 pffiffiffi N

1:14 pffiffiffi N

1:22 pffiffiffi N

1:36 pffiffiffi N

1:63 pffiffiffi N

1177

1178

Appendices

TABLE H1

Critical Values of Shapiro-Wilk Wc Statistic Considering P(Wcal < Wc) 5 a Significance Level a

Sample Size N

0.01

0.02

0.05

0.10

0.50

0.90

0.95

0.98

0.99

3

0.753

0.758

0.767

0.789

0.959

0.998

0.999

1.000

1.000

4

0.687

0.707

0.748

0.792

0.935

0.987

0.992

0.996

0.997

5

0.686

0.715

0.762

0.806

0.927

0.979

0.986

0.991

0.993

6

0.713

0.743

0.788

0.826

0.927

0.974

0.981

0.936

0.989

7

0.730

0.760

0.803

0.838

0.928

0.972

0.979

0.985

0.988

8

0.749

0.778

0.818

0.851

0.932

0.972

0.978

0.984

0.987

9

0.764

0.791

0.829

0.859

0.935

0.972

0.978

0.984

0.986

10

0.781

0.806

0.842

0.869

0.938

0.972

0.978

0.983

0.986

11

0.792

0.817

0.850

0.876

0.940

0.973

0.979

0.984

0.986

12

0.805

0.828

0.859

0.883

0.943

0.973

0.979

0.984

0.986

13

0.814

0.837

0.866

0.889

0.945

0.974

0.979

0.984

0.986

14

0.825

0.846

0.874

0.895

0.947

0.975

0.980

0.984

0.986

15

0.835

0.855

0.881

0.901

0.950

0.976

0.980

0.984

0.987

16

0.844

0.863

0.887

0.906

0.952

0.975

0.981

0.985

0.987

17

0.851

0.869

0.892

0.910

0.954

0.977

0.981

0.985

0.987

18

0.858

0.874

0.897

0.914

0.956

0.978

0.982

0.986

0.988

19

0.863

0.879

0.901

0.917

0.957

0.978

0.982

0.986

0.988

20

0.868

0.884

0.905

0.920

0.959

0.979

0.983

0.986

0.988

21

0.873

0.888

0.908

0.823

0.960

0.980

0.983

0.987

0.989

22

0.878

0.892

0.911

0.926

0.961

0.980

0.984

0.987

0.989

23

0.881

0.895

0.914

0.928

0.962

0.981

0.984

0.987

0.989

24

0.884

0.898

0.916

0.930

0.963

0.981

0.984

0.987

0.989

25

0.888

0.901

0.918

0.931

0.964

0.981

0.985

0.988

0.989

26

0.891

0.904

0.920

0.933

0.965

0.982

0.985

0.988

0.989

27

0.894

0.906

0.923

0.935

0.965

0.982

0.985

0.988

0.990

28

0.896

0.908

0.924

0.936

0.966

0.982

0.985

0.988

0.990

29

0.898

0.910

0.926

0.937

0.966

0.982

0.985

0.988

0.990

30

0.900

0.912

0.927

0.939

0.967

0.983

0.985

0.988

0.900

1179

Appendices

TABLE H2

Coefficients ai.n for the Shapiro-Wilk Normality Test

i/n

2

3

4

5

6

7

8

9

10

1

0.7071

0.7071

0.6872

0.6646

0.6431

0.6233

0.6052

0.5888

0.5739

0.0000

0.1677

0.2413

0.2806

0.3031

0.3164

0.3244

0.3291

0.0000

0.0875

0.1401

0.1743

0.1976

0.2141

0.0000

0.0561

0.0947

0.1224

0.0000

0.0399

2 3 4 5

i/n

11

12

13

14

15

16

17

18

19

20

1

0.5601

0.5475

0.5359

0.5251

0.5150

0.5056

0.4968

0.4886

0.4808

0.4734

2

0.3315

0.3325

0.3325

0.3318

0.3306

0.3290

0.3273

0.3253

0.3232

0.3211

3

0.2260

0.2347

0.2412

0.2460

0.2495

0.2521

0.2540

0.2553

0.2561

0.2565

4

0.1429

0.1586

0.1707

0.1802

0.1878

0.1939

0.1988

0.2027

0.2059

0.2085

5

0.0695

0.0922

0.1099

0.1240

0.1353

0.1447

0.1524

0.1587

0.1641

0.1686

6

0.0000

0.0303

0.0539

0.0727

0.0880

0.1005

0.1109

0.1197

0.1271

0.1334

0.0000

0.0240

0.0433

0.0593

0.0725

0.0837

0.0932

0.1013

0.0000

0.0196

0.0359

0.0496

0.0612

0.0711

0.0000

0.0163

0.0303

0.0422

0.0000

0.0140

7 8 9 10

i/n

21

22

23

24

25

26

27

28

29

30

1

0.4643

0.4590

0.4542

0.4493

0.4450

0.4407

0.4366

0.4328

0.4291

0.4254

2

0.3185

0.3156

0.3126

0.3098

0.3069

0.3043

0.3018

0.2992

0.2968

0.2944

3

0.2578

0.2571

0.2563

0.2554

0.2543

0.2533

0.2522

0.2510

0.2499

0.2487

4

0.2119

0.2131

0.2139

0.2145

0.2148

0.2151

0.2152

0.2151

0.2150

0.2148

5

0.1736

0.1764

0.1787

0.1807

0.1822

0.1836

0.1848

0.1857

0.1864

0.1870

6

0.1399

0.1443

0.1480

0.1512

0.1539

0.1563

0.1584

0.1601

0.1616

0.1630

7

0.1092

0.1150

0.1201

0.1245

0.1283

0.1316

0.1346

0.1372

0.1395

0.1415

8

0.0804

0.0878

0.0941

0.0997

0.1046

0.1089

0.1128

0.1162

0.1192

0.1219

9

0.0530

0.0618

0.0696

0.0764

0.0823

0.0876

0.0923

0.0965

0.1002

0.1036

10

0.0263

0.0368

0.0459

0.0539

0.0610

0.0672

0.0728

0.0778

0.0822

0.0862

11

0.0000

0.0122

0.0228

0.0321

0.0403

0.0476

0.0540

0.0598

0.0650

0.0697

0.0000

0.0107

0.0200

0.0284

0.0358

0.0424

0.0483

0.0537

0.0000

0.0094

0.0178

0.0253

0.0320

0.0381

0.0000

0.0084

0.0159

0.0227

0.0000

0.0076

12 13 14 15

1180

TABLE I Wilcoxon Test P(Sp > Sc ) 5 a

Sc

3

3

0.6250

4

0.3750

5

0.2500

0.5625

6

0.1250

0.4375

4

5

7

0.3125

8

0.1875

0.5000

9

0.1250

0.4063

10

0.0625

0.3125

6

7

8

9

11

0.2188

0.5000

12

0.1563

0.4219

13

0.0938

0.3438

14

0.0625

0.2813

0.5313

15

0.0313

0.2188

0.4688

16

0.1563

0.4063

17

0.1094

0.3438

18

0.0781

0.2891

0.5273

19

0.0469

0.2344

0.4727

20

0.0313

0.1875

0.4219

21

0.0156

0.1484

0.3711

22

0.1094

0.3203

23

0.0781

0.2734

0.5000

24

0.0547

0.2305

0.4551

25

0.0391

0.1914

0.4102

26

0.0234

0.1563

0.3672

27

0.0156

0.1250

0.3262

28

0.0078

0.0977

0.2852

10

0.5000

11

12

13

14

15

Appendices

N

0.2480

0.4609

30

0.0547

0.2129

0.4229

31

0.0391

0.1797

0.3848

32

0.0273

0.1504

0.3477

33

0.0195

0.1250

0.3125

0.5171

34

0.0117

0.1016

0.2783

0.4829

35

0.0078

0.0820

0.2461

0.4492

36

0.0039

0.0645

0.2158

0.4155

37

0.0488

0.1875

0.3823

38

0.0371

0.1611

0.3501

39

0.0273

0.1377

0.3188

0.5151

40

0.0195

0.1162

0.2886

0.4849

41

0.0137

0.0967

0.2598

0.4548

42

0.0098

0.0801

0.2324

0.4250

43

0.0059

0.0654

0.2065

0.3955

44

0.0039

0.0527

0.1826

0.3667

45

0.0020

0.0420

0.1602

0.3386

46

0.0322

0.1392

0.3110

0.5000

47

0.0244

0.1201

0.2847

0.4730

48

0.0186

0.1030

0.2593

0.4463

49

0.0137

0.0874

0.2349

0.4197

50

0.0098

0.0737

0.2119

0.3934

51

0.0068

0.0615

0.1902

0.3677

52

0.0049

0.0508

0.1697

0.3424

53

0.0029

0.0415

0.1506

0.3177

0.5000

54

0.0020

0.0337

0.1331

0.2939

0.4758

55

0.0010

0.0269

0.1167

0.2709

0.4516

56

0.0210

0.1018

0.2487

0.4276

57

0.0161

0.0881

0.2274

0.4039

58

0.0122

0.0757

0.2072

0.3804 Continued

1181

0.0742

Appendices

29

1182

TABLE I Wilcoxon Test—cont’d N 11

12

13

14

59

0.0093

0.0647

0.1879

0.3574

60

0.0068

0.0549

0.1698

0.3349

0.5110

61

0.0049

0.0461

0.1527

0.3129

0.4890

62

0.0034

0.0386

0.1367

0.2915

0.4670

63

0.0024

0.0320

0.1219

0.2708

0.4452

64

0.0015

0.0261

0.1082

0.2508

0.4235

65

0.0010

0.0212

0.0955

0.2316

0.4020

66

0.0005

0.0171

0.0839

0.2131

0.3808

67

0.0134

0.0732

0.1955

0.3599

68

0.0105

0.0636

0.1788

0.3394

69

0.0081

0.0549

0.1629

0.3193

70

0.0061

0.0471

0.1479

0.2997

71

0.0046

0.0402

0.1338

0.2807

72

0.0034

0.0341

0.1206

0.2622

73

0.0024

0.0287

0.1083

0.2444

74

0.0017

0.0239

0.0969

0.2271

75

0.0012

0.0199

0.0863

0.2106

76

0.0007

0.0164

0.0765

0.1947

77

0.0005

0.0133

0.0676

0.1796

78

0.0002

0.0107

0.0594

0.1651

79

0.0085

0.0520

0.1514

80

0.0067

0.0453

0.1384

81

0.0052

0.0392

0.1262

3

4

5

6

7

8

9

10

15

Appendices

Sc

82

0.0040

0.0338

0.1147

83

0.0031

0.0290

0.1039

84

0.0023

0.0247

0.0938

85

0.0017

0.0209

0.0844

86

0.0012

0.0176

0.0757

87

0.0009

0.0148

0.0677

88

0.0006

0.0123

0.0603

89

0.0004

0.0101

0.0535

90

0.0002

0.0083

0.0473

91

0.0001

0.0067

0.0416

92

0.0054

0.0365

93

0.0043

0.0319

94

0.0034

0.0277

95

0.0026

0.0240

96

0.0020

0.0206

97

0.0015

0.0177

98

0.0012

0.0151

99

0.0009

0.0128

100

0.0006

0.0108

101

0.0004

0.0090

102

0.0003

0.0075

103

0.0002

0.0062

104

0.0001

0.0051 0.0042

106

0.0034

107

0.0027

108

0.0021

109

0.0017

Appendices

105

Continued

1183

1184 Appendices

TABLE I Wilcoxon Test—cont’d N Sc

3

4

5

6

7

8

9

10

11

12

13

14

15

110

0.0013

111

0.0010

112

0.0008

113

0.0006

114

0.0004

115

0.0003

116

0.0002

117

0.0002

118

0.0001

119

0.0001

120

0.0000

Right-tailed unilateral probabilities for the Wilcoxon test.

TABLE J Critical Values of Uc for the Mann-Whitney U Test Considering P(Ucal < Uc) 5 a P(Ucal < Uc) 5 0.05 N2\ N1

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

0

0

1

2

2

3

4

4

5

5

6

7

7

8

9

9

10

11

4

0

1

2

3

4

5

6

7

8

9

10

11

12

14

15

16

17

18

5

1

2

4

5

6

8

9

11

12

13

15

16

18

19

20

22

23

25

6

2

3

5

7

8

10

12

14

16

17

19

21

23

25

26

28

30

32

7

2

4

6

8

11

13

15

17

19

21

24

26

28

30

33

35

37

39

8

3

5

8

10

13

15

18

20

23

26

28

31

33

36

39

41

44

47

9

4

6

9

12

15

18

21

24

27

30

33

36

39

42

45

48

51

54

10

4

7

11

14

17

20

24

27

31

34

37

41

44

48

51

55

58

62

11

5

8

12

16

19

23

27

31

34

38

42

46

50

54

57

61

65

69

12

5

9

13

17

21

26

30

34

38

42

47

51

55

60

64

68

72

77

13

6

10

15

19

24

28

33

37

42

47

51

56

61

65

70

75

80

84

14

7

11

16

21

26

31

36

41

46

51

56

61

66

71

77

82

87

92

15

7

12

18

23

28

33

39

44

50

55

61

66

72

77

83

88

94

100

16

8

14

19

25

30

36

42

48

54

60

65

71

77

83

89

95

101

107

17

9

15

20

26

33

39

45

51

57

64

70

77

83

89

96

102

109

115

18

9

16

22

28

35

41

48

55

61

68

75

82

88

95

102

109

116

123

19

10

17

23

30

37

44

51

58

65

72

80

87

94

101

109

116

123

130

20

11

18

25

32

39

47

54

62

69

77

84

92

100

107

115

123

130

138

Appendices

3

1185

1186 Appendices

P(Ucal < Uc) 5 0.025 N2\ N1

3

3



0

0

1

1

2

4



0

1

2

3

5

0

1

2

3

5

6

1

2

3

5

7

1

3

5

6

8

2

4

6

9

2

4

10

3

11

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

2

3

3

4

4

5

5

6

6

7

7

8

4

4

5

6

7

8

9

10

11

11

12

13

14

6

7

8

9

11

12

13

14

15

17

18

19

20

6

8

10

11

13

14

16

17

19

21

22

24

25

27

8

10

12

14

16

18

20

22

24

26

28

30

32

34

8

10

13

15

17

19

22

24

26

29

31

34

36

38

41

7

10

12

15

17

20

23

26

28

31

34

37

39

42

45

48

5

8

11

14

17

20

23

26

29

33

36

39

42

45

48

52

55

3

6

9

13

16

19

23

26

30

33

37

40

44

47

51

55

58

62

12

4

7

11

14

18

22

26

29

33

37

41

45

49

53

57

61

65

69

13

4

8

12

16

20

24

28

33

37

41

45

50

54

59

63

67

72

76

14

5

9

13

17

22

26

31

36

40

45

50

55

59

64

67

74

78

83

15

5

10

14

19

24

29

34

39

44

49

54

59

64

70

75

80

85

90

16

6

11

15

21

26

31

37

42

47

53

59

64

70

75

81

86

92

98

17

6

11

17

22

28

34

39

45

51

57

63

67

75

81

87

93

99

105

18

7

12

18

24

30

36

42

48

55

61

67

74

80

86

93

99

103

112

19

7

13

19

25

32

38

45

52

58

65

72

78

85

92

99

106

113

119

20

8

14

20

27

34

41

48

55

62

69

76

83

90

98

105

112

119

127

P(Ucal < Uc ) 5 0.01 N2\ N1

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

3



0

0

0

0

0

1

1

1

2

2

2

3

3

4

4

4

5

4





0

1

1

2

3

3

4

5

5

6

7

7

8

9

9

10

5



0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

6



1

2

3

4

6

7

8

9

11

12

13

15

16

18

19

20

22

7

0

1

3

4

6

7

9

11

12

14

16

17

19

21

23

24

26

28

8

0

2

4

6

7

9

11

13

15

17

20

22

24

26

28

30

32

34

9

1

3

5

7

9

11

14

16

18

21

23

26

28

31

33

36

38

40

10

1

3

6

8

11

13

16

19

22

24

27

30

33

36

38

41

44

47

11

1

4

7

9

12

15

18

22

25

29

31

34

37

41

44

47

50

53

12

2

5

8

11

14

17

21

24

28

31

35

38

42

46

49

53

56

60

13

2

5

9

12

16

20

23

27

31

35

39

43

47

51

55

59

63

67

14

2

6

10

13

17

22

26

30

34

38

43

47

51

56

60

65

69

73

15

3

7

11

15

19

24

28

33

37

42

47

51

56

61

66

70

75

80

16

3

7

12

16

21

26

31

36

41

46

51

56

61

66

71

76

82

87

17

4

8

13

18

23

28

33

38

44

49

55

60

66

71

77

82

88

93

18

4

9

14

19

24

30

36

41

47

53

59

65

70

76

82

88

94

100

19

4

9

15

20

26

32

38

44

50

56

63

69

75

82

88

94

101

107

20

5

10

16

22

28

34

40

47

53

60

67

73

80

87

93

100

107

114

Appendices 1187

1188 Appendices

P(Ucal < Uc ) 5 0.005 N2\ N1

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

3



0

0

0

0

0

0

0

0

1

1

1

2

2

2

2

3

3

4





0

0

0

1

1

2

2

3

3

4

5

5

6

6

7

8

5





0

1

1

2

3

4

5

6

7

7

8

9

10

11

12

13

6



0

1

2

3

4

5

6

7

9

10

11

12

13

15

16

17

18

7



0

1

3

4

6

7

9

10

12

13

15

16

18

19

21

22

24

8



1

2

4

6

7

9

11

13

15

17

18

20

22

24

26

28

30

9

0

1

3

5

7

9

11

13

16

18

20

22

24

27

29

31

33

36

10

0

2

4

6

9

11

13

16

18

21

24

26

29

31

34

37

39

42

11

0

2

5

7

10

13

16

18

21

24

27

30

33

36

39

42

45

48

12

1

3

6

9

12

15

18

21

24

27

31

34

37

41

44

47

51

54

13

1

3

7

10

13

17

20

24

27

31

34

38

42

45

49

53

56

60

14

1

4

7

11

15

18

22

26

30

34

38

42

46

50

54

58

63

67

15

2

5

8

12

16

20

24

29

33

37

42

46

51

55

60

64

69

73

16

2

5

9

13

18

22

27

31

36

41

45

50

55

60

65

70

74

79

17

2

6

10

15

19

24

29

34

39

44

49

54

60

65

70

75

81

86

18

2

6

11

16

21

26

31

37

42

47

53

58

64

70

75

81

87

92

19

3

7

12

17

22

28

33

39

45

51

56

63

69

74

81

87

93

99

20

3

8

13

18

24

30

36

42

48

54

60

67

73

79

86

92

99

105

Appendices

TABLE K

Critical Values for the Friedman’s Test Considering P(Fcal > Fc) 5 a

k

N

a ≤ 0.10

a ≤ 0.05

3

3

6.00

6.00



4

6.00

6.50

8.00

5

5.20

6.40

8.40

6

5.33

7.00

9.00

7

5.43

7.14

8.86

8

5.25

6.25

9.00

9

5.56

6.22

8.67

10

5.00

6.20

9.60

11

4.91

6.54

8.91

12

5.17

6.17

8.67

13

4.77

6.00

9.39



4.61

5.99

9.21

2

6.00

6.00



3

6.60

7.40

8.60

4

6.30

7.80

9.60

5

6.36

7.80

9.96

6

6.40

7.60

10.00

7

6.26

7.80

10.37

8

6.30

7.50

10.35



6.25

7.82

11.34

3

7.47

8.53

10.13

4

7.60

8.80

11.00

5

7.68

8.96

11.52



7.78

9.49

13.28

4

5

a ≤ 0.01

1189

1190

Appendices

TABLE L

Critical Values for the Kruskal-Wallis Test Considering P(Hcal > Hc) 5 a a

Sample Sizes n1

n2

n3

0.10

2

2

2

4.25

3

2

1

4.29

3

2

2

4.71

4.71

3

3

1

4.57

5.14

3

3

2

4.56

5.36

3

3

3

4.62

5.60

4

2

1

4.50

4

2

2

4.46

5.33

4

3

1

4.06

5.21

4

3

2

4.51

4

3

3

4

4

4

0.01

0.005

7.20

7.20

5.44

6.44

7.00

4.71

5.73

6.75

7.32

1

4.17

4.97

6.67

4

2

4.55

5.45

7.04

7.28

4

4

3

4.55

5.60

7.14

7.59

8.32

4

4

4

4.65

5.69

7.66

8.00

8.65

5

2

1

4.20

5.00

5

2

2

4.36

5.16

5

3

1

4.02

4.96

5

3

2

4.65

5.25

6.82

7.18

5

3

3

4.53

5.65

7.08

7.51

5

4

1

3.99

4.99

6.95

7.36

5

4

2

4.54

5.27

7.12

7.57

8.11

5

4

3

4.55

5.63

7.44

7.91

8.50

5

4

4

4.62

5.62

7.76

8.14

9.00

5

5

1

4.11

5.13

7.31

7.75

5

5

2

4.62

5.34

7.27

8.13

8.68

5

5

3

4.54

5.71

7.54

8.24

9.06

5

5

4

4.53

5.64

7.77

8.37

9.32

5

5

5

4.56

5.78

7.98

8.72

9.68

4.61

5.99

9.21

10.60

13.82

Large samples

0.05

0.001

8.02

6.53

8.24

TABLE M a 5 5%

Critical Values of the Cochran’s C Statistic Considering P(Ccal > Cc) 5 a

n/k

2

3

4

5

6

7

8

9

10

12

15

20

24

30

40

60

120

1

0.9985

0.9669

0.9065

0.8412

0.7808

0.7271

0.6798

0.6385

0.6020

0.5410

0.4709

0.3894

0.3434

0.2929

0.2370

0.1737

0.0998

2

0.9750

0.8709

0.7679

0.6838

0.6161

0.5612

0.5157

0.4775

0.4450

0.3924

0.3346

0.2705

0.2354

0.1980

0.1567

0.1131

0.0632

3

0.9392

0.7977

0.6841

0.5981

0.5321

0.4800

0.4377

0.4027

0.3733

0.3264

0.2758

0.2205

0.1907

0.1593

0.1259

0.0895

0.0495

4

0.9057

0.7457

0.6287

0.5441

0.4803

0.4307

0.3910

0.3584

0.3311

0.2880

0.2419

0.1921

0.1656

0.1377

0.1082

0.0765

0.0419

5

0.8772

0.7071

0.5895

0.5065

0.4447

0.3974

0.3595

0.3286

0.3029

0.2624

0.2195

0.1735

0.1493

0.1237

0.0968

0.0682

0.0371

6

0.8534

0.6771

0.5598

0.4783

0.4184

0.3726

0.3362

0.3067

0.2823

0.2439

0.2034

0.1602

0.1374

0.1137

0.0887

0.0623

0.0337

7

0.8332

0.6530

0.5365

0.4564

0.3980

0.3535

0.3185

0.2901

0.2666

0.2299

0.1911

0.1501

0.1286

0.1061

0.0827

0.0583

0.0312

8

0.8159

0.6333

0.5175

0.4387

0.3817

0.3384

0.3043

0.2768

0.2541

0.2187

0.1815

0.1422

0.1216

0.1002

0.0780

0.0552

0.0292

9

0.8010

0.6167

0.5017

0.4241

0.3682

0.3259

0.2926

0.2659

0.2439

0.2098

0.1736

0.1357

0.1160

0.0958

0.0745

0.0520

0.0279

10

0.7880

0.6025

0.4884

0.4118

0.3568

0.3154

0.2829

0.2568

0.2353

0.2020

0.1671

0.1303

0.1113

0.0921

0.0713

0.0497

0.0266

16

0.7341

0.5466

0.4366

0.3645

0.3135

0.2756

0.2462

0.2226

0.2032

0.1737

0.1429

0.1108

0.0942

0.0771

0.0595

0.0411

0.0218

36

0.6602

0.4748

0.3720

0.3066

0.2612

0.2278

0.2022

0.1820

0.1655

0.1403

0.1144

0.0879

0.0743

0.0604

0.0462

0.0316

0.0165

144

0.5813

0.4031

0.3093

0.2513

0.2119

0.1833

0.1616

0.1446

0.1308

0.1100

0.0889

0.0675

0.0567

0.0457

0.0347

0.0234

0.0120



0.5000

0.3333

0.2500

0.2000

0.1667

0.1429

0.1250

0.1111

0.1000

0.0833

0.0667

0.0500

0.0417

0.0333

0.0250

0.0167

0.0083

a 5 1% n/k

2

3

4

5

6

7

8

9

10

12

15

20

24

30

40

60

120

1

0.9999

0.9933

0.9676

0.9279

0.8828

0.8376

0.7945

0.7544

0.7175

0.6528

0.5747

0.4799

0.4247

0.3632

0.2940

0.2151

0.1225

2

0.9950

0.9423

0.8643

0.7885

0.7218

0.6644

0.6152

0.5727

0.5358

0.4751

0.4069

0.3297

0.2821

0.2412

0.1915

0.1371

0.0759

3

0.9794

0.8831

0.7814

0.6957

0.6258

0.5685

0.5209

0.4810

0.4469

0.3919

0.3317

0.2654

0.2295

0.1913

0.1508

0.1069

0.0585

4

0.9586

0.8335

0.7212

0.6329

0.5635

0.5080

0.4627

0.4251

0.3934

0.3428

0.2882

0.2288

0.1970

0.1635

0.1281

0.0902

0.0489

5

0.9373

0.7933

0.6761

0.5875

0.5195

0.4659

0.4226

0.3870

0.3572

0.3099

0.2593

0.2048

0.1759

0.1454

0.1135

0.0796

0.0429

6

0.9172

0.7606

0.6410

0.5531

0.4866

0.4347

0.3932

0.3592

0.3308

0.2861

0.2386

0.1877

0.1608

0.1327

0.1033

0.0722

0.0387

7

0.8988

0.7335

0.6129

0.5259

0.4608

0.4105

0.3704

0.3378

0.3106

0.2680

0.2228

0.1748

0.1495

0.1232

0.0957

0.0668

0.0357

8

0.8823

0.7107

0.5897

0.5037

0.4401

0.3911

0.3522

0.3207

0.2945

0.2535

0.2104

0.1646

0.1406

0.1157

0.0898

0.0625

0.0334

9

0.8674

0.6912

0.5702

0.4854

0.4229

0.3751

0.3373

0.3067

0.2813

0.2419

0.2002

0.1567

0.1388

0.1100

0.0853

0.0594

0.0316

10

0.8539

0.6743

0.5536

0.4697

0.4084

0.3616

0.3248

0.2950

0.2704

0.2320

0.1918

0.1501

0.1283

0.1054

0.0816

0.0567

0.0302

16

0.7949

0.6059

0.4884

0.4094

0.3529

0.3105

0.2779

0.2514

0.2297

0.1961

0.1612

0.1248

0.1060

0.0867

0.0668

0.0461

0.0242

36

0.7067

0.5153

0.4057

0.3351

0.2858

0.2494

0.2214

0.1992

0.1811

0.1535

0.1251

0.0960

0.0810

0.0658

0.0503

0.0344

0.0178

144

0.6062

0.4230

0.3251

0.2644

0.2229

0.1929

0.1700

0.1521

0.1376

0.1157

0.0934

0.0709

0.0595

0.0480

0.0363

0.0245

0.0125



0.5000

0.3333

0.2500

0.2000

0.1667

0.1429

0.1250

0.1111

0.1000

0.0833

0.0667

0.0500

0.0417

0.0333

0.0250

0.0167

0.0083

1193

Appendices

TABLE N Critical Values of Hartley Fmax Statistic Considering P(Fmax.cal > Fmax.c) 5 a a 5 5% n/k

2

3

4

5

6

7

8

9

10

11

12

2

39

87.5

142

202

266

333

403

475

550

626

704

3

15.4

27.8

39.2

50.7

62

72.9

83.5

93.9

104

114

124

4

9.6

15.5

20.6

25.2

29.5

33.6

37.5

41.1

44.6

48

51.4

5

7.15

10.8

13.7

16.3

18.7

20.8

22.9

24.7

26.5

28.2

29.9

6

5.82

8.38

10.4

12.1

13.7

15

16.3

17.5

18.6

19.7

20.7

7

4.99

6.94

8.44

9.7

10.8

11.8

12.7

13.5

14.3

15.1

15.8

8

4.43

6

7.18

8.12

9.03

9.78

10.5

11.1

11.7

12.2

12.7

9

4.03

5.34

6.31

7.11

7.8

8.41

8.95

9.45

9.91

10.3

10.7

10

3.72

4.85

5.67

6.34

6.92

7.42

7.87

8.28

8.66

9.01

9.34

12

3.28

4.16

4.79

5.3

5.72

6.09

6.42

6.72

7

7.25

7.48

15

2.86

3.54

4.01

4.37

4.68

4.95

5.19

5.4

5.59

5.77

5.93

20

2.46

2.95

3.29

3.54

3.76

3.94

4.1

4.24

4.37

4.49

4.59

30

2.07

2.4

2.61

2.78

2.91

3.02

3.12

3.21

3.29

3.36

3.39

60

1.67

1.85

1.96

2.04

2.11

2.17

2.22

2.26

2.3

2.33

2.36



1

1

1

1

1

1

1

1

1

1

1

a 5 1% n/k

2

3

4

5

6

7

8

9

10

11

12

2

199

448

729

1036

1362

1705

2069

2432

2813

3204

3605

3

47.5

85

120

151

184

216

249

281

310

337

361

4

23.2

37

49

59

69

79

89

97

106

113

120

5

14.9

22

28

33

38

42

46

50

54

57

60

6

11.1

15.5

19.1

22

25

27

30

32

34

36

37

7

8.89

12.1

14.5

16.5

18.4

20

22

23

24

26

27

8

7.5

9.9

11.7

13.2

14.5

15.8

16.9

17.9

18.9

19.8

21

9

6.54

8.5

9.9

11.1

12.1

13.1

13.9

14.7

15.3

16

16.6

10

5.85

7.4

8.6

9.6

10.4

11.1

11.8

12.4

12.9

13.4

13.9

12

4.91

6.1

6.9

7.6

8.2

8.7

9.1

9.5

9.9

10.2

10.6

15

4.07

4.9

5.5

6

6.4

6.7

7.1

7.3

7.5

7.8

8

20

3.32

3.8

4.3

4.6

4.9

5.1

5.3

5.5

5.6

5.8

5.9

30

2.63

3

3.3

3.4

3.6

3.7

3.8

3.9

4

4.1

4.2

60

1.96

2.2

2.3

2.4

2.4

2.5

2.5

2.6

2.6

2.7

2.7



1

1

1

1

1

1

1

1

1

1

1

1194

Appendices

TABLE O

Control Chart Constants X X and R Charts

X and S Charts

n

d2

d3

C4

A2

D3

D4

A3

B3

B4

2

1.128

0.853

0.798

1.880



3.267

2.659



3.267

3

1.693

0.888

0.886

1.023



2.574

1.954



2.568

4

2.059

0.880

0.921

0.729



2.282

1.628



2.266

5

2.326

0.880

0.940

0.577



2.114

1.427



2.089

6

2.534

0.848

0.952

0.483



2.004

1.287

0.030

1.970

7

2.704

0.833

0.959

0.419

0.076

1.924

1.182

0.118

1.882

8

2.847

0.820

0.965

0.373

0.136

1.864

1.099

0.185

1.815

9

2.970

0.808

0.969

0.337

0.184

1.816

1.032

0.239

1.761

10

3.078

0.797

0.973

0.308

0.223

1.777

0.975

0.284

1.716

11

3.173

0.787

0.975

0.285

0.256

1.744

0.927

0.321

1.679

12

3.258

0.779

0.978

0.266

0.283

1.717

0.886

0.354

1.646

13

3.336

0.770

0.979

0.249

0.307

1.693

0.850

0.382

1.618

14

3.407

0.763

0.981

0.235

0.328

1.672

0.817

0.406

1.594

15

3.472

0.756

0.982

0.223

0.347

1.653

0.789

0.428

1.572

16

3.532

0.750

0.984

0.212

0.363

1.637

0.763

0.448

1.552

17

3.588

0.744

0.985

0.203

0.378

1.662

0.739

0.466

1.534

18

3.640

0.739

0.985

0.194

0.391

1.607

0.718

0.482

1.518

19

3.689

0.734

0.986

0.187

0.403

1.597

0.698

0.497

1.503

20

3.735

0.729

0.987

0.180

0.415

1.585

0.680

0.510

1.490

21

3.778

0.727

0.988

0.173

0.425

1.575

0.663

0.523

1.477

22

3.819

0.720

0.988

0.167

0.434

1.566

0.647

0.534

1.466

23

3.858

0.716

0.989

0.162

0.443

1.557

0.633

0.545

1.455

24

3.895

0.712

0.989

0.157

0.451

1.548

0.619

0.555

1.445

25

3.931

0.708

0.990

0.153

0.459

1.541

0.606

0.565

1.435

For n > 25: 4 ðn  1 Þ 3 3 A ¼ pffiffiffi A3 ¼ pffiffiffi c4 ffi 4n  3 n c4 n 3 3 B3 ¼ 1  pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi B4 ¼ 1 + pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi c 4 2ð n  1Þ c4 2ðn  1Þ 3 3 B5 ¼ c4  pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi B6 ¼ c4 + pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2ð n  1Þ 2ðn  1Þ

References Acock, A.C., 2014. A Gentle Introduction to Stata, fourth ed. Stata Press, College Station. Adkins, L.C., Hill, R.C., 2011. Using Stata for Principles of Econometrics, fourth ed John Wiley & Sons, New York. Agresti, A., 2013. Categorical Data Analysis, third ed. John Wiley & Sons, Hoboken. Aguirre, A., Macedo, P.B.R., 1996. Estimativas de prec¸os hed^onicos para o mercado imobilia´rio de Belo Horizonte. In: XVIII Encontro Brasileiro de ´ guas de Lindo´ia. Econometria. Anais do Congresso, A Ahn, S.C., Schmidt, P., 1997. Efficient estimation of dynamic panel data models: alternative assumptions and simplified estimation. J. Econometrics 76 (1-2), 309–321. Ahuja, R.K., Huang, W., Romeijn, H.E., Morales, D.R., 2007. A heuristic approach to the multi-period single-sourcing problem with production and inventory capacities and perishability constraints. INFORMS J. Comput. 19 (1), 14–26. Aitkin, M., Clayton, D., 1980. The fitting of exponential, Weibull and extreme value distributions to complex censored survival data using GLIM. J. Roy. Stat. Soc. Ser. C 29 (2), 156–163. Akaike, H., 1987. Factor analysis and AIC. Psychometrika 52 (3), 317–332. Albergaria, M., Fa´vero, L.P., 2017. Narrow replication of Fisman and Miguel’s (2007a) ‘Corruption, norms, and legal enforcement: evidence from diplomatic parking tickets’. J. Appl. Econometrics 32 (4), 919–922. Albright, A.C., Winston, W.L., 2015. Business Analytics: Data Analysis and Decision Making, fifth ed. Cengage Learning, Stamford. Albuquerque, J.P.A., Fortes, J.M.P., Finamore, W.A., 2008. Probabilidade, varia´veis aleato´rias e processos estoca´sticos. Interci^encia, Rio de Janeiro. Alcalde, A., Fa´vero, L.P., Takamatsu, R.T., 2013. EBITDA margin in Brazilian companies: variance decomposition and hierarchical effects. Contadurı´a y Administracio´n 58 (2), 197–220. Al-Daoud, M.B., Roberts, S.A., 1996. New methods for the initialisation of clusters. Pattern Recognition Letters 17 (5), 451–455. Aldenderfer, M.S., Blashfield, R.K., 1978a. Cluster analysis and archaeological classification. Am. Antiquity 43 (3), 502–505. Aldenderfer, M.S., Blashfield, R.K., 1984. Cluster Analysis. Sage Publications, Thousand Oaks. Aldenderfer, M.S., Blashfield, R.K., 1978b. Computer programs for performing hierarchical cluster analysis. Appl. Psychol. Meas. 2 (3), 403–411. Aldrich, J.H., Nelson, F.D., 1984. Linear Probability, Logit, and Probit Models. Sage Publications, Thousand Oaks. Aliaga, F.M., 1999. Ana´lisis de correspondencias: estudo bibliometrico sobre su uso en la investigacio´n educativa. Revista Electro´nica de Investigacio´n y Evaluacio´n Educativa. 5(1_1). Allison, P.D., 2009. Fixed Effects Regression Models. Sage Publications, London. Alpert, M.I., Peterson, R.A., 1972. On the interpretation of canonical analysis. J. Market. Res. 9 (2), 187–192. Amemiya, T., 1981. Qualitative response models: a survey. J. Econ. Lit. 19 (4), 1483–1536. Anderberg, M.R., 1973. Cluster Analysis for Applications. Academic Press, New York. Anderson, D.R., Sweeney, D.J., Williams, T.A., 2013. Estatı´stica aplicada a` administrac¸a˜o e economia, 3. ed. Sa˜o Paulo, Thomson Pioneira. Anderson, J.A., 1982. Logistic discrimination. In: Krishnaiah, P.R., Kanal, L.N. (Eds.), Handbook of Statistics. North Holland, Amsterdam, pp. 169–191. Anderson, T.W., Hsiao, C., 1982. Formulation and estimation of dynamic models using panel data. J. Econometrics 18 (1), 47–82. Andrade, E.L., 2009. Introduc¸a˜o a` pesquisa operacional: metodos e modelos para ana´lise de deciso˜es. LTC, Rio de Janeiro. Aranha, F., Zambaldi, F., 2008. Ana´lise fatorial em administrac¸a˜o. Cengage Learing, Sa˜o Paulo. Arau´jo, M.E., Feitosa, C.V., 2003. Ana´lise de agrupamento da Ictiofauna Recifal do Brasil com base em dados secunda´rios: uma avaliac¸a˜o crı´tica. Trop. Oceanogr. 31 (2), 171–192. Arellano, M., 1987. Computing robust standard errors for within-groups estimators. Oxf. Bull. Econ. Stat. 49 (4), 431–434. Arellano, M., 1993. On the testing of correlated effects with panel data. J. Econometrics 59 (1-2), 87–97. Arellano, M., 2003. Panel Data Econometrics: Advanced Texts in Econometrics. Oxford University Press, New York. Arellano, M., Bond, S., 1991. Some tests of specification for panel data: Monte Carlo evidence and an application to employment equations. Rev. Econ. Stud. 58 (2), 277–297. Arellano, M., Bover, O., 1995. Another look at the instrumental variable estimation of error-components models. J. Econometrics 68 (1), 29–51. Arenales, M., Armentano, V., Morabito, R., Yanasse, H., 2007. Pesquisa operacional: para cursos de engenharia. Campus Elsevier, Rio de Janeiro. Arias, R.M., 1999. El ana´lisis multivariante en la investigacio´n cientı´fica. Editorial La Muralla, Madrid. Artes, R., 1998. Aspectos estatı´sticos da ana´lise fatorial de escalas de avaliac¸a˜o. Revista de Psiquiatria Clı´nica 25 (5), 223–228. Ashby, D., West, C.R., Ames, D., 1979. The ordered logistic regression model in psychiatry: rising prevalence of dementia in old peoples homes. Stat. Med. (8), 1317–1326. Atkinson, A.C., 1970. A method for discriminating between models. J. Roy. Stat. Soc. Ser. B 32 (3), 323–353. Ayc¸aguer, L.C.S., Utra, I.M.B., 2004. Regresio´n logı´stica. Editorial La Muralla, Madrid. Azen, R., Walker, C.M., 2011. Categorical Data Analysis for the Behavioral and Social Sciences. Routledge, New York. Bailey, K.D., 1983. Sociological classification and cluster analysis. Qual. Quant. 17 (4), 251–268.

1195

1196

References

Baker, B.O., Hardyck, C.D., Petrinovich, L.F., 1966. Weak measurements vs. strong statistics: an empirical critique of S. S. Stevens’ proscriptions on statistics. Educ. Psychol. Meas. 26, 291–309. Bakke, H.A., Leite, A.S.M., Silva, L.B., 2008. Estatı´stica multivariada: aplicac¸a˜o da ana´lise fatorial na engenharia de produc¸a˜o. Revista Gesta˜o Industrial 4 (4), 1–14. Balakrishnan, P.V., Cooper, M.C., Jacob, V.S., Lewis, P.A., 1994. A study of the classification capabilities of neural networks using unsupervised learning: a comparison with k-means clustering. Psychometrika 59 (4), 509–525. Balestra, P., Nerlove, M., 1966. Pooling cross section and time series data in the estimation of a dynamic model: the demand for natural gas. Econometrica 34 (3), 585–612. Ballinger, G.A., 2004. Using generalized estimating equations for longitudinal data analysis. Organization. Res. Methods 7 (2), 127–150. Baltagi, B.H., 2008. Econometric Analysis of Panel Data, fourth ed. John Wiley & Sons, New York. Baltagi, B.H., Griffin, J.M., 1984. Short and long run effects in pooled models. Int. Econ. Rev. 25 (3), 631–645. Baltagi, B.H., Wu, P.X., 1999. Unequally spaced panel data regressions with AR(1) disturbances. Econometric Theory 15 (6), 814–823. Banfield, J.D., Raftery, A.E., 1993. Model-based gaussian and non-gaussian clustering. Biometrics 49 (3), 803–821. Banzatto, D.A., Kronka, S.N., 2006. Experimentac¸a˜o agrı´cola, fourth ed. Funep, Jaboticabal. Barioni Jr., W., 1995. Ana´lise de correspond^encia na identificac¸a˜o dos fatores de risco associados a` diarreia e a` performance de leito˜es na fase de lactac¸a˜o. Piracicaba. 97 f. Masters Dissertation, Escola Superior de Agricultura Luiz de Queiroz, Universidade de Sa˜o Paulo. Barnett, V., Lewis, T., 1994. Outliers in Statistical Data, third ed. John Wiley & Sons, Chichester. Barradas, J.M., Fonseca, E.C., Silva, E.F., Pereira, H.G., 1992. Identification and mapping of pollution indices using a multivariate statistical methodology. Appl. Geochem. 7 (6), 563–572. Bartholomew, D., Knott, M., Moustaki, I., 2011. Latent Variable Models and Factor Analysis: A Unified Approach, third ed. John Wiley & Sons, New York. Bartlett, M.S., 1954. A note on the multiplying factors for various w2 approximations. J. Roy. Stat. Soc. Ser. B 16 (2), 296–298. Bartlett, M.S., 1937. Properties of sufficiency and statistical tests. Proc. R. Soc. Lond. Ser. A: Math. Phys. Sci. 160 (901), 268–282. Bartlett, M.S., 1941. The statistical significance of canonical correlations. Biometrika 32 (1), 29–37. Bastos, D.B., Nakamura, W.T., 2009. Determinantes da estrutura de capital das companhias abertas no Brasil, Mexico e Chile no perı´odo 2001-2006. Revista Contabilidade e Financ¸as 20 (50), 75–94. Bastos, R., Pindado, J., 2013. Trade credit during a financial crisis: a panel data analysis. J. Bus. Res. 66 (5), 614–620. Batista, L.E., Escuder, M.M.L., Pereira, J.C.R., 2004. A cor da morte: causas de o´bito segundo caracterı´sticas de rac¸a no Estado de Sa˜o Paulo, 1999 a 2001. Revista de Sau´de Pu´blica 38 (5), 630–636. Baum, C.F., 2006. An Introduction to Modern Econometrics Using Stata. Stata Press, College Station. Baum, C.F., Schaffer, M.E., Stillman, S., 2011. Using Stata for applied research: reviewing its capabilities. J. Econ. Surveys 25 (2), 380–394. Baxter, L.A., Finch, S.J., Lipfert, F.W., Yu, Q., 1997. Comparing estimates of the effects of air pollution on human mortality obtained using different regression methodologies. Risk Analysis 17 (3), 273–278. Bazaraa, M.S., Jarvis, J.J., Sherali, H.D., 2009. Linear Programming and Network Flows, fourth ed. John Wiley & Sons. Bazeley, P., 2013. Qualitative Data Analysis: Practical Strategies. Sage Publications, London. Beck, N., 2007. From statistical nuisances to serious modeling: changing how we think about the analysis of time-series-cross-section data. Polit. Anal. 15 (2), 97–100. Beck, N., 2001. Time-series-cross-section-data: what have we learned in the past few years? Annu. Rev. Polit. Sci. 4 (1), 271–293. Beck, N., Katz, J.N., 1995. What to do (and not to do) with time-series cross-section data. Am. Polit. Sci. Rev. 89 (3), 634–647. Begg, M.D., Parides, M.K., 2003. Separation of individual-level and cluster-level covariate effects in regression analysis of correlated data. Stat. Med. 22 (6), 2591–2602. Beh, E.J., 1998. A comparative study of scores for correspondence analysis with ordered categories. Biometr. J. 40 (4), 413–429. Beh, E.J., 1999. Correspondence analysis of ranked data. Commun. Stat. Theory Methods 28 (7), 1511–1533. Beh, E.J., 2004. Simple correspondence analysis: a bibliographic review. Int. Stat. Rev. 72 (2), 257–284. Beh, E.J., Lombardo, R., 2014. Correspondence Analysis: Theory, Practice and New Strategies. John Wiley & Sons, New York. Bekaert, G., Harvey, C.R., 2002. Research in emerging markets finance: looking to the future. Emerg. Market Rev. 3 (4), 429–448. Bekaert, G., Harvey, C.R., Lundblad, C., 2001. Emerging equity markets and economic development. J. Dev. Econ. 66 (2), 465–504. Bekman, O.R., Costa Neto, P.L.O., 2009. Ana´lise estatı´stica da decisa˜o, second ed. Edgard Bl€ucher, Sa˜o Paulo. ® ® Belfiore, P., 2015. Estatistica aplicada a administrac¸a˜o, contabilidade e economia com Excel e SPSS . Campus Elsevier, Rio de Janeiro. Belfiore, P., Fa´vero, L.P., 2012. Pesquisa operacional: para cursos de administrac¸a˜o, contabilidade e economia. Campus Elsevier, Rio de Janeiro. Belfiore, P., Fa´vero, L.P., 2007. Scatter search for the fleet size and mix vehicle routing problem with time windows. Cent. Eur. J. Oper. Res. 15 (4), 351–368. Belfiore, P., Yoshizaki, H.T.Y., 2013. Heuristic methods for the fleet size and mix vehicle routing problem with time windows and split deliveries. Comput. Ind. Eng. 64 (2), 589–601. Belfiore, P., Yoshizaki, H.T.Y., 2009. Scatter search for a real-life heterogeneous fleet vehicle routing problem with time windows and split deliveries in Brazil. Eur. J. Oper. Res. 199 (3), 750–758. Bell, A., Jones, K., Explaining fixed effects: random effects modelling of time-series cross-sectional and panel data. http://polmeth.wustl.edu/media/Paper/ FixedversusRandom_1.pdf. [(Accessed 17 December 2012)]. Benders, J.F., 1962. Partitioning procedures for solving mixed-variables programming problems. Numerische Mathematik 4, 238–252.

References

1197

Bensmail, H., Celeux, G., Raftery, A.E., Robert, C.P., 1997. Inference in model-based cluster analysis. Stat. Comput. 7 (1), 1–10. Benzecri, J.P., 1992. Correspondence analysis handbook, second ed. Marcel Dekker, New York. Benzecri, J.P., 1977. El ana´lisis de correspondencias. Les Cahiers de l’ Analyse des Donnees 2 (2), 125–142. Benzecri, J.P., 1979. Sur le calcul des taux d’inertie dans l’analyse d’un questionnaire. Les Cahiers de l’Analyse des Donnees 4 (3), 377–378. Berenson, M.L., Levine, D.M., 1996. Basic Business Statistics: Concepts and Application, sixth ed. Prentice Hall, Upper Saddle River. Bergh, D.D., 1995. Problems with repeated measures analysis: demonstration with a study of the diversification and performance relationship. Acad. Manag. J. 38 (6), 1692–1708. Berkson, J., 1944. Application of the logistic function to bioassay. J. Am. Stat. Assoc. 39 (227), 357–365. Bezerra, F.A., Corrar, L.J., 2006. Utilizac¸a˜o da ana´lise fatorial na identificac¸a˜o dos principais indicadores para avaliac¸a˜o do desempenho financeiro: uma aplicac¸a˜o nas empresas de seguros. Revista Contabilidade e Financ¸as 4 (42), 50–62. Bhargava, A., Franzini, L., Narendranathan, W., 1982. Serial correlation and the fixed effects model. Rev. Econ. Stud. 49 (4), 533–549. Bhargava, A., Sargan, J.D., 1983. Estimating dynamic random effects models from panel data covering short time periods. Econometrica 51 (6), 1635–1659. Billor, N., Hadi, A.S., Velleman, P.F., 2000. BACON: blocked adaptive computationally efficient outlier nominators. Comput. Stat. Data Anal. 34 (3), 279–298. Binder, D.A., 1978. Bayesian cluster analysis. Biometrika 65 (1), 31–38. Birch, M.W., 1963. Maximum likelihood in three-way contingency tables. J. Roy. Stat. Soc. Ser. B 25 (1), 220–233. Black, K., 2012. Business Statistics: For Contemporary Decision Making, seventh ed. John Wiley & Sons, New York. Blair, E., 1983. Sampling issues in trade area maps drawn from shopping surveys. J. Market. 47 (1), 98–106. Blashfield, R.K., Aldenderfer, M.S., 1978. The literature on cluster analysis. Multivariate Behav. Res. 13 (3), 271–295. Bliese, P.D., Ployhart, R.E., 2002. Growth modeling using random coefficient models: model building, testing, and illustrations. Organization. Res. Methods 5 (4), 362–387. Bliss, C.I., 1934b. The method of probits – a correction. Science 79 (2053), 409–410. Bliss, C.I., 1934a. The method of probits. Science 79 (2037), 38–39. Blundell, R., Bond, S., 1998. Initial conditions and moment restrictions in dynamic panel data models. J. Econometrics 87 (1), 115–143. Blunsdon, B., Reed, K., 2005. Social innovators or lagging behind: factors that influence manager’s time use. Women Manag. Rev. 78, 544–561. Bock, H.H., 1985. On some significance tests in cluster analysis. J. Classification 2 (1), 77–108. Bock, R.D., 1975. Multivariate Statistical Methods in Behavioral Research. McGraw-Hill, New York. Bolfarine, H., Bussab, W.O., 2005. Elementos de amostragem. Edgard Blϋcher, Sa˜o Paulo. Bolfarine, H., Sandoval, M.C., 2001. Introduc¸a˜o a` infer^encia estatı´stica. Sociedade Brasileira de Matema´tica, Rio de Janeiro. Bonett, D.G., 2010. Varying coefficient meta-analytic methods for alpha reliability. Psychol. Methods 15 (4), 368–385. Borgatta, E.F., Bohrnstedt, G.W., 1980. Level of measurement: once over again. Sociol. Methods Res. 9 (2), 147–160. Borooah, V.K., 2001. Logit and Probit. Sage Publications, Thousand Oaks. Botelho, D., Zouain, D.M., 2006. Pesquisa quantitativa em administrac¸a˜o. Atlas, Sa˜o Paulo. Bottai, M., Orsini, N., 2013. A command for Laplace regression. Stata J. 13 (2), 302–314. Botton, L., Bengio, Y., 1995. Convergence properties of the k-means algorithm. Adv. Neural Inf. Process. Syst. 7, 585–592. Bouroche, J.M., Saporta, G., 1982. Ana´lise de dados. Zahar, Rio de Janeiro. Box, G.E.P., Cox, D.R., 1964. An analysis of transformations. J. Roy. Stat. Soc. Ser. B 26 (2), 211–252. Box-Steffensmeier, J.M., Jones, B.S., 2004. Event History Modeling: A Guide for Social Scientists. Cambridge University Press, Cambridge. Braga, R., Fa´vero, L.P., 2017. Disposition effect and tolerance to losses in stock investment decisions: an experimental study. J. Behav. Financ. 18 (3), 271–280. Bramer, M., 2016. Principles of Data Mining, third ed. Springer, New York. Brand, M., 2006. Fast low-rank modifications of the thin singular value decomposition. Linear Algebra Appl. 415 (1), 20–30. Branda˜o, M.A.L., 2010. Estudo de alguns metodos determinı´sticos de otimizac¸a˜o irrestrita. Uberl^andia, 2010. Dissertac¸a˜o (Mestrado em Matema´tica)Universidade Federal de Uberl^andia 87 p. Bravais, A., 1846. Analyse mathematique sur les probabilites des erreurs de situation d’un point. Memoires par Divers Savans 9, 255–332. Breusch, T.S., 1978. Testing for autocorrelation in dynamic linear models. Australian Econ. Papers 17 (31), 334–355. Breusch, T.S., Mizon, G.E., Schmidt, P., 1989. Efficient estimation using panel data. Econometrica 57 (3), 695–700. Breusch, T.S., Pagan, A.R., 1980. The Lagrange multiplier test and its application to model specification in econometrics. Rev. Econ. Stud. 47 (1), 239–253. Breusch, T.S., Ward, M.B., Nguyen, H.T.M., Kompas, T., 2011. On the fixed-effects vector decomposition. Polit. Anal. 19 (2), 123–134. Brito, G.A.S., Assaf Neto, A., 2008. Modelo de risco para carteiras de creditos corporativos. Revista de Administrac¸a˜o (RAUSP) 43 (3), 263–274. Brito Ju´nior, I., 2004. Ana´lise do impacto logı´stico de diferentes regimes aduaneiros no abastecimento de itens aerona´uticos empregando modelo de transbordo multiproduto com custos fixos. Dissertac¸a˜o (Mestrado em Engenharia de Sistemas Logı´sticos), Escola Politecnica da Universidade de Sa˜o Paulo, Sa˜o Paulo. Brito Ju´nior, I., Yoshizaki, H.T.Y., Belfiore, P., 2012. Um modelo de localizac¸a˜o e transbordo multiproduto para avaliac¸a˜o do impacto de regimes aduaneiros. Transportes 20 (3), 89–98. Brown, M.B., Forsythe, A.B., 1974. Robust tests for the equality of variances. J. Am. Stat. Assoc. 69 (346), 364–367.

1198

References

Bruni, A.L., 2011. Estatı´stica aplicada a` gesta˜o empresarial, third ed. Atlas, Sa˜o Paulo. Buchinsky, M., 1998. Recent advances in quantile regression models: a practical guideline for empirical research. J. Hum. Resour. 33 (1), 88–126. Buffa, E.S., Sarin, R.K., 1987. Modern production/operations management, eighth ed. John Wiley & Sons. Bussab, W.O., Miazaki, E.S., Andrade, D.F., 1990. Introduc¸a˜o a` ana´lise de agrupamentos. In: Simpo´sio Brasileiro de Probabilidade e Estatı´stica. Anais do Congresso, Sa˜o Paulo. Bussab, W.O., Morettin, P.A., 2011. Estatı´stica ba´sica, seventh ed. Saraiva, Sa˜o Paulo. Buzas, T.E., Fornell, C., Rhee, B.D., 1989. Conditions under which canonical correlation and redundancy maximization produce identical results. Biometrika 76 (3), 618–621. Cabral, N.A.C.A., Investigac¸a˜o por inquerito. http://www.amendes.uac.pt/monograf/tra06investgInq.pdf. [(Accessed 3 August 2015)]. Ca´ceres, R.C.A., 2013. Ana´lisis de la supervivencia: regresio´n de Cox. Ediciones Alfanova, Ma´laga. Calinski, T., Harabasz, J., 1974. A dendrite method for cluster analysis. Commun. Statist. 3 (1), 1–27. Cameron, A.C., Trivedi, P.K., 1986. Econometric models based on count data: comparisons and applications of some estimators and tests. J. Appl. Econ. 1 (1), 29–53. Cameron, A.C., Trivedi, P.K., 2009. Microeconometrics Using Stata, Revised edition. Stata Press, College Station. Cameron, A.C., Trivedi, P.K., 2013. Regression Analysis of Count Data, second ed. Cambridge University Press, Cambridge. Cameron, A.C., Trivedi, P.K., 1990. Regression-based tests for overdispersion in the Poisson model. J. Econometrics 46 (3), 347–364. Cameron, A.C., Windmeijer, F.A.G., 1997. An R-squared measure of goodness of fit for some common nonlinear regression models. J. Econometrics 77 (2), 329–342. Camilo, C.O., Silva, J.C., 2009. Minerac¸a˜o de dados: conceitos, tarefas, metodos e ferramentas. Technical Report RT-INF 001-09, Instituto de Informa´tica, Universidade Federal de Goia´s. Camiz, S., Gomes, G.C., 2013. Joint correspondence analysis versus multiple correspondence analysis: a solution to an undetected problem. In: Giusti, A., Ritter, G., Vichi, M. (Eds.), Classification and Data Mining. Studies in Classification, Data Analysis, and Knowledge Organization. Springer-Verlag, Berlin, pp. 11–18. Campbell, J.Y., Lo, A.W., Mackinlay, A.C., 1997. The Econometrics of Financial Markets. Princeton University Press, Princeton. Campbell, N.A., Tomenson, J.A., 1983. Canonical variate analysis for several sets of data. Biometrics 39 (2), 425–435. Caroll, J.D., Green, P.E., Schaffer, C.M., 1986. Interpoint distance comparisons in correspondence analysis. J. Market. Res. 23 (3), 271–280. Carvalho, N.A.S., 2012. Aplicac¸a˜o de Modelos Estatı´sticos para Previsa˜o e Monitoramento da Cobrabilidade de uma Empresa de Distribuic¸a˜o de Energia Eletrica no Brasil. Pontifı´cia Universidade Cato´lica do Rio de Janeiro Dissertac¸a˜o (Mestrado em Metrologia). Carvalho, H., 2008. Ana´lise multivariada de dados qualitativos: utilizac¸a˜o da ana´lise de correspond^encias mu´ltiplas com o SPSS. Edic¸o˜es Sı´labo, Lisboa. Cattell, R.B., 1966. The scree test for the number of factors. Multivariate Behav. Res. 1 (2), 245–276. Cattell, R.B., Balcar, K.R., Horn, J.L., Nesselroade, J.R., 1969. Factor matching procedures: an improvement of the s index; with tables. Educ. Psychol. Meas. 29 (4), 781–792. Celeux, G., Govaert, G., 1992. A classification EM algorithm for clustering and two stochastic versions. Comput. Stat. Data Anal. 14 (3), 315–332. Chamberlain, G., 1980. Analysis of covariance with qualitative data. Rev. Econ. Stud. 47 (1), 225–238. Chambless, L.E., Dobson, A., Patterson, C.C., Raines, B., 1991. On the use of a logistic risk score in predicting risk of coronary heart disease. Stat. Med. (9), 385–396. Chappel, W., Kimenyi, M., Mayer, W., 1990. A Poisson probability model of entry and market structure with an application to U.S. industries during 1972-77. South. Econ. J. 56 (4), 918–927. Charnes, A., Cooper, W.W., Rhodes, E., 1978. Measuring the efficiency of decision making units. Eur. J. Oper. Res. 2 (6), 429–444. Charnet, R., Bonvino, H., Freire, C.A.L., Charnet, E.M.R., 2008. Ana´lise de modelos de regressa˜o linear: com aplicac¸o˜es, second ed. Editora da Unicamp, Campinas. Chatterjee, S., Jamieson, L., Wiseman, F., 1991. Identifying most influential observations in factor analysis. Market. Sci. 10 (2), 145–160. Chiavenato, I., 1997. Introduc¸a˜o a` teoria geral da administrac¸a˜o, fifth ed. Makron Books, Sa˜o Paulo. Chen, C.W., 1971. On some problems in canonical correlation analysis. Biometrika 58 (2), 399–400. Chen, M.H., Ibrahim, J.G., Shao, Q.M., 2009. Maximum likelihood inference for the Cox regression model with applications to missing covariates. J. Multivariate Anal. 100 (9), 2018–2030. Cheng, R., Milligan, G.W., 1996. K-Means clustering methods with influence detection. Educ. Psychol. Meas. 56 (5), 833–838. Chopra, S., Meindl, P., 2011. Gesta˜o da cadeia de suprimentos: estrategia, planejamento e operac¸o˜es, fourth ed. Pearson Prentice Hall, Sa˜o Paulo. Chow, G.C., 1960. Tests of equality between sets of coefficients in two linear regressions. Econometrica 28 (3), 591–605. Christensen, R., 1997. Log-Linear Models and Logistic Regression, second ed. Springer-Verlag, New York. Cios, K.J., Pedrycz, W., Swiniarski, R.W., Kurgan, L.A., 2007. Data Mining: A Knowledge Discovery Approach. Springer, New York. Cleveland, W.S., 1985. The Elements of Graphing Data. Wadsworth, Monterey. Cleves, M.A., Gould, W.W., Gutierrez, R.G., Marchenko, Y.V., 2010. An Introduction to Survival Analysis Using Stata, third ed. Stata Press, College Station. Cliff, N., Hamburger, C.D., 1967. The study of sampling errors in factor analysis by means of artificial experiments. Psychol. Bull. 68 (6), 430–445. Cochran, W.G., 1977. Sampling Techniques, third ed. John Wiley & Sons, New York. Cochran, W.G., 1947a. Some consequences when the assumptions for the analysis of variance are not satisfied. Biometrics 3 (1), 22–38. Cochran, W.G., 1950. The comparison of percentages in matched samples. Biometrika 37 (¾), 256–266.

References

1199

Cochran, W.G., 1947b. The distribution of the largest of a set of estimated variances as a fraction of their total. Ann. Eugen.s 22 (11), 47–52. Colin, E.C., 2007. Pesquisa operacional: 170 aplicac¸o˜es em estrategia, financ¸as, logı´stica, produc¸a˜o, marketing e vendas. LTC, Rio de Janeiro. Collings, B., Margolin, B., 1985. Testing goodness of fit for the Poisson assumption when observations are not identically distributed. J. Am. Stat. Assoc. 80 (390), 411–418. Colosimo, E.A., Giolo, S.R., 2006. Ana´lise de sobreviv^encia aplicada. Edgard Bl€ucher, Sa˜o Paulo. Conaway, M.R., 1990. A random effects model for binary data. Biometrics 46 (2), 317–328. Consul, P., 1989. Generalized Poisson Distributions. Marcel Dekker, New York. Consul, P., Famoye, F., 1992. Generalized Poisson regression model. Commun. Stat. Theory Methods 21 (1), 89–109. Consul, P., Jain, G., 1973. A generalization of the Poisson distribution. Technometrics 15 (4), 791–799. Cook, R.D., 1979. Influential observations in linear regression. J. Am. Stat. Assoc. 74, 169–174. Cooper, D.R., Schindler, P.S., 2011. Metodos de pesquisa em administrac¸a˜o, 10th ed. Bookman, Porto Alegre. Cooper, S.L., 1964. Random sampling by telephone: an improved method. J. Market. Res. 1 (4), 45–48. Cordeiro, G.M., 1983. Improved likelihood ratio statistics for generalized linear models. J. Roy. Stat. Soc. Ser. B 45 (3), 404–413. Cordeiro, G.M., 1987. On the corrections to the likelihood ratio statistics. Biometrika 74 (2), 265–274. Cordeiro, G.M., Demetrio, C.G.B., 2007. Modelos lineares generalizados. SEAGRO e Rbras, Santa Maria. Cordeiro, G.M., McCullagh, P., 1991. Bias correction in generalized linear models. J. Roy. Stat. Soc. Ser. B 53 (3), 629–643. Cordeiro, G.M., Ortega, E.M.M., Cunha, D.C.C., 2013. The exponentiated generalized class of distributions. J. Data Sci. 11, 777–803. Cordeiro, G.M., Ortega, E.M.M., Silva, G.O., 2011. The exponentiated generalized gamma distribution with application to lifetime data. J. Stat. Comput. Simul. 81 (7), 827–842. Cordeiro, G.M., Paula, G.A., 1989. Improved likelihood ratio statistics for exponential family nonlinear models. Biometrika 76 (1), 93–100. Cornwell, C., Rupert, P., 1988. Efficient estimation with panel data: an empirical comparison of instrumental variables estimators. J. Appl. Econometrics 3 (2), 149–155. Cortina, J.M., 1993. What is coefficient alpha? An examination of theory and applications. J. Appl. Psychol. 78 (1), 98–104. Costa Neto, P.L.O., 2002. Estatı´stica, second ed. Edgard Bl€ucher, Sa˜o Paulo. Costa, P.S., Santos, N.C., Cunha, P., Cotter, J., Sousa, N., 2013. The use of multiple correspondence analysis to explore associations between categories of qualitative variables in healthy ageing. J. Aging Res. 2013. Courgeau, D., 2003. Methodology and Epistemology of Multilevel Analysis. Kluwer Academic Publishers, London. Covarsi, M.G.A., 1996. Tecnicas de ana´lisis factorial aplicadas al ana´lisis de la informacio´n financiera: fundamentos, limitaciones, hallazgo y evidencia empı´rica espan˜ola. Revista Espan˜ola de Financiacio´n y Contabilidad 26 (86), 57–101. Cox, D.R., 1972. Regression models and life tables. J. Roy. Stat. Soc. Ser. B 34 (2), 187–220. Cox, D.R., 1983. Some remarks on overdispersion. Biometrika 70 (1), 269–274. Cox, D.R., Oakes, D., 1984. Analysis of Survival Data. Chapman and Hall/CRC, London. Cox, D.R., Snell, E.J., 1989. Analysis of Binary Data, second ed. Chapman & Hall, London. Cox, N.J., 2002. Speaking Stata: how to face lists with fortitude. Stata J. 2 (2), 202–222. Cox, N.J., 2001. Speaking Stata: how to repeat yourself without going mad. Stata J. 1 (1), 86–97. Cox, N.J., 2003. Speaking Stata: problems with lists. Stata J. 3 (2), 185–202. Cox, N.J., 2005. Speaking Stata: smoothing in various directions. Stata J. 5 (4), 574–593. Cox, N.J., 2010. Speaking Stata: the limits of sample skewness and kurtosis. Stata J. 10 (3), 482–495. Coxon, A.P., The, M., 1982. User’s guide to multidimensional scaling: with special reference to the MDS (X library of computer programs). Heinemann Educational Books, London. Cronbach, L.J., 1951. Coefficient alpha and the internal structure of tests. Psychometrika 16 (3), 297–334. Crowther, M.J., Abrams, K.R., Lambert, P.C., 2013. Joint modeling of longitudinal and survival data. Stata J. 13 (1), 165–184. Czekanowski, J., 1932. Coefficient of racial “likeness” und “durchschnittliche differenz”. Anthropologischer Anzeiger 9 (3/4), 227–249. D’enza, A.I., Greenacre, M.J., 2012. Multiple correspondence analysis for the quantification and visualization of large categorical data sets. In: Di Ciaccio, A., Coli, M., Ibanez, J.M.A. (Eds.), Advanced Statistical Methods for the Analysis of Large Data-Sets. Studies in Theoretical and Applied Statistics. Springer-Verlag, Berlin, pp. 453–463. Danseco, E.R., Holden, E.W., 1998. Are there different types of homeless families? A typology of homeless families based on cluster analysis. Fam. Relat. 47 (2), 159–165. Dantas, C.A.B., 2008. Probabilidade: um curso introduto´rio, third ed. Edusp, Sa˜o Paulo. Dantas, R.A., Cordeiro, G.M., 1988. Uma nova metodologia para avaliac¸a˜o de imo´veis utilizando modelos lineares generalizados. Revista Brasileira de Estatı´stica 49 (191), 27–46. Dantzig, G.B., Fulkerson, D.R., Johnson, S.M., 1954. Solution of a large-scale traveling salesman problem. Oper. Res. 2, 393–410. Davidson, R., Mackinnon, J.G., 1993. Estimation and Inference in Econometrics. Oxford University Press, Oxford. Davis, P.B., 1977. Conjoint measurement and the canonical analysis of contingency tables. Sociol. Methods Res. 5 (3), 347–365. Day, G.S., Heeler, R.M., 1971. Using cluster analysis to improve marketing experiments. J. Market. Res. 8 (3), 340–347. De Irala, J., Ferna´ndez-Crehuet, N.R., Serranco, C.A., 1997. Intervalos de confianza anormalmente amplios en regresio´n logı´stica: interpretacio´n de resultados de programas estadı´sticos. Revista Panamericana de Salud Pu´blica 28, 235–243. De Leeuw, J., 1984. Canonical Analysis of Categorical Data. DSWO Press, Leiden.

1200

References

De Leeuw, J., 2008. Meijer, E. (Ed.), Handbook of Multilevel Analysis. Springer, New York. Deadrick, D.L., Bennett, N., Russell, C.J., 1997. Using hierarchical linear modeling to examine dynamic performance criteria over time. J. Manag. 23 (6), 745–757. Dean, C., Lawless, J., 1989. Tests for detecting overdispersion in Poisson regression models. J. Am. Stat. Assoc. 84 (406), 467–472. Deaton, A., 2010. Instruments, randomization, and learning about development. J. Econ. Lit. 48 (2), 424–455. Deb, P., Trivedi, P.K., 2006. Maximum simulated likelihood estimation of a negative binomial regression model with multinomial endogenous treatment. Stata J. 6 (2), 246–255. Demidenko, E., 2005. Mixed Models: Theory and Applications. John Wiley & Sons, New York. Desmarais, B.A., Harden, J.J., 2013. Testing for zero inflation in count models: bias correction for the Vuong test. Stata J. 13 (4), 810–835. Deus, J.E.R., 2001. Escalamiento multidimensional. Editorial La Muralla, Madrid. Deville, J.C., Saporta, G., 1983. Correspondence analysis, with an extension towards nominal time series. Journal of Econometrics 22, 169–189. Devore, J.L., 2006. Probabilidade e estatı´stica para engenharia. Thomson Pioneira, Sa˜o Paulo. Dice, L.R., 1945. Measures of the amount of ecologic association between species. Ecology 26 (3), 297–302. Digby, P.G.N., Kempton, R.A., 1987. Multivariate Analysis of Ecological Communities. Chapman & Hall/CRC Press, London. Dillon, W.R., Goldstein, M., 1984. Multivariate Analysis Methods and Applications. John Wiley & Sons, New York. Dobbie, M.J., Welsh, A.H., 2001. Modelling correlated zero-inflated count data. Aust. N. Z. J. Stat. 43 (4), 431–444. Dobson, A.J., 2001. An Introduction to Generalized Linear Models, second ed. Chapman & Hall/CRC Press, London. Dore, J.C., Ojasoo, T., 1996. Correspondence factor analysis of the publication patterns of 48 countries over the period 1981-1992. J. Am. Soc. Inf. Sci. 47, 588–602. Dougherty, C., 2011. Introduction to Econometrics, fourth ed. Oxford University Press, New York. Doutriaux, J., Crener, M.A., 1982. Which statistical technique should I use? A survey and marketing case study. Manag. Decis. Econ. 3 (2), 99–111. Draper, D., 1995. Inference and hierarchical modeling in the social sciences. J. Educ. Behav. Stat. 20 (2), 115–147. Driscoll, J.C., Kraay, A.C., 1998. Consistent covariance matrix estimation with spatially dependent panel data. Rev. Econ. Stat. 80 (4), 549–560. Driver, H.E., Kroeber, A.L., 1932. Quantitative expression of cultural relationships. Univ. Calif. Public. Am. Archaeol. Ethnol. 31 (4), 211–256. Drukker, D.M., 2003. Testing for serial correlation in linear panel-data models. Stata J. 3 (2), 168–177. Duncan, O.D., 1984. Notes on Social Measurement: Historical and Critical. Russell Sage Foundation, New York. Dunlop, D.D., 1994. Regression for longitudinal data: a bridge from least squares regression. Am. Stat. 48 (4), 299–303. Durbin, J., Watson, G.S., 1950. Testing for serial correlation in least squares regression: I. Biometrika 37 (¾), 409–428. Durbin, J., Watson, G.S., 1951. Testing for serial correlation in least squares regression: II. Biometrika 38 (½), 159–177. Dyke, G.V., Patterson, H.D., 1952. Analysis of factorial arrangements when the data are proportions. Biometrics 8 (1), 1–12. Dziuban, C.D., Shirkey, E.C., 1974. When is a correlation matrix appropriate for factor analysis? Some decision rules. Psychol. Bull. 81 (6), 358–361. Ekşiog˘lu, S.D., Ekşiog˘lu, B., Romeijn, H.E., 2007. A lagrangean heuristic for integrated production and transportation planning problems in a dynamic, multi-item, two-layer supply chain. IIE Trans. 39 (2), 191–201. Elhedhli, S., Goffin, J.L., 2005. Efficient production-distribution system design. Manag. Sci. 51 (7), 1151–1164. Embretson, S.E., Hershberger, S.L., 1999. The new Rules of Measurement. Lawrence Erlbaum Associates, Mahwah. Engle, R.F., 1984. Wald, likelihood ratio, and lagrange multiplier tests in econometrics. In: Griliches, Z., Intriligator, M.D. (Eds.), Handbook of Econometrics II. North Holland, Amsterdam, pp. 796–801. Eom, S., Kim, E., 2006. A survey of decision support system applications (1995-2001). J. Oper. Res. Soc. 57, 1264–1278. Epley, D.R.U.S., 2001. Real estate agent income and commercial/investment activities. J. Real Estate Res. 21 (3), 221–244. Espejo, L.G.A., Galva˜o, R.D., 2002. O uso das relaxac¸o˜es Lagrangeana e surrogate em problemas de programac¸a˜o inteira. Pesquisa Operacional 22 (3), 387–402. Espinoza, F.S., Hirano, A.S., 2003. As dimenso˜es de avaliac¸a˜o dos atributos importantes na compra de condicionadores de ar: um estudo aplicado. Revista de Administrac¸a˜o Contempor^anea (RAC) 7 (4), 97–117. Everitt, B.S., Landau, S., Leese, M., Stahl, D., 2011. Cluster Analysis, 5. ed. John Wiley & Sons, Chichester. Fabrigar, L.R., Wegener, D.T., MacCallum, R.C., Strahan, E.J., 1999. Evaluating the use of exploratory factor analysis in psychological research. Psychol. Methods 4 (3), 272–299. Famoye, F., 1993. Restricted generalized Poisson regression model. Commun. Stat. Theory Methods 22 (5), 1335–1354. Famoye, F., Singh, K.P., 2006. Zero-inflated generalized Poisson regression model with an application to domestic violence data. J. Data Sci. 4 (1), 117–130. Farnstrom, F., Lewis, J., Elkan, C., 2000. Scalability for clustering algorithms revisited. SIGKDD Explor. 2 (1), 51–57. Farrel, M.J., 1957. The measurement of productive efficiency. J. Roy. Stat. Soc. 120 (3), 253–290. ® ® ® Fa´vero, L.P., 2015. Ana´lise de dados: modelos de regressa˜o com Excel , Stata e SPSS . Campus Elsevier, Rio de Janeiro. Fa´vero, L.P., 2013. Dados em painel em contabilidade e financ¸as: teoria e aplicac¸a˜o. Brazil. Bus. Rev. 10 (1), 131–156. Fa´vero, L.P., 2010. Modelagem hiera´rquica com medidas repetidas. Associate Professor Thesis - Faculdade de Economia,Administrac¸a˜o e Contabilidade, Universidade de Sa˜o Paulo, Sa˜o Paulo. 202 f. Fa´vero, L.P., 2008a. Modelos de precificac¸a˜o hed^onica de imo´veis residenciais na Regia˜o Metropolitana de Sa˜o Paulo: uma abordagem sob as perspectivas da demanda e da oferta. Estudos Econ^omicos 38 (1), 73–96. Fa´vero, L.P., 2005. O mercado imobilia´rio residencial da regia˜o metropolitana de Sa˜o Paulo: uma aplicac¸a˜o de modelos de comercializac¸a˜o hed^onica de regressa˜o e correlac¸a˜o can^onica. PhD Thesis - Faculdade de EconomiaAdministrac¸a˜o e Contabilidade, Universidade de Sa˜o Paulo, Sa˜o Paulo. 319 f.

References

1201

Fa´vero, L.P., 2011a. Prec¸os hed^onicos no mercado imobilia´rio comercial de Sa˜o Paulo: a abordagem da modelagem multinı´vel com classificac¸a˜o cruzada. Estudos Econ^ omicos 41 (4), 777–810. Fa´vero, L.P., 2008b. Time, firm and country effects on performance: an analysis under the perspective of hierarchical modeling with repeated measures. Brazil. Bus. Rev. 5 (3), 163–180. Fa´vero, L.P., 2011b. Urban amenities and dwelling house prices in Sao Paulo, Brazil: a hierarchical modelling approach. Glob. Bus. Econ. Rev. 13 (2), 147–167. Fa´vero, L.P., Almeida, J.E.F., 2011. O comportamento dos ´ındices de ac¸o˜es em paı´ses emergentes: uma ana´lise com dados em painel e modelos hiera´rquicos. Revista Brasileira de Estatı´stica 72 (235), 97–137. Fa´vero, L.P., Angelo, C.F., Eunni, R.V., 2007. Impact of loyalty programs on customer retention: evidence from the retail apparel industry in Brazil. In: International Academy of Linguistics, Behavioral and Social Sciences. Anais do Congresso, Washington. ® ® Fa´vero, L.P., Belfiore, P., 2015. Ana´lise de dados: tecnicas multivariadas explorato´rias com SPSS e Stata . Campus Elsevier, Rio de Janeiro. Fa´vero, L.P., Belfiore, P., 2011. Cash flow, earnings ratio and stock returns in emerging global regions: evidence from longitudinal data. Glob. Econ. Financ. J. 4 (1), 32–43. ® ® ® Fa´vero, L.P., Belfiore, P., 2017. Manual de ana´lise de dados: estatı´stica e modelagem multivariada com Excel , SPSS e Stata . Elsevier, Rio de Janeiro. Fa´vero, L.P., Belfiore, P., Silva, F.L., Chan, B.L., 2009. Ana´lise de dados: modelagem multivariada para tomada de deciso˜es. Campus Elsevier, Rio de Janeiro. ® Fa´vero, L.P., Belfiore, P., Takamatsu, R.T., Suzart, J., 2014. Metodos quantitativos com Stata . Campus Elsevier, Rio de Janeiro. Fa´vero, L.P., Confortini, D., 2010. Modelos multinı´vel de coeficientes aleato´rios e os efeitos firma, setor e tempo no mercado aciona´rio brasileiro. Pesquisa Operacional 30 (3), 703–727. Fa´vero, L.P., Confortini, D., 2009. Qualitative assessment of stock prices listed on the Sa˜o Paulo Stock Exchange: an approach from the perspective of homogeneity analysis. Academia: Revista Latinoamericana de Administracio´n 42 (1), 20–33. Fa´vero, L.P., Santos, M.A., Serra, R.G., 2018. Cross-border branching in the Latin American banking sector. Int. J. Bank Market. 36 (3), 496–528. Fa´vero, L.P., Sotelino, F.B., 2011. Elasticities of stock prices in emerging markets. In: Batten, J.A., Szilagyi, P.G. (Eds.), The Impact of the Global Financial Crisis on Emerging Financial Markets. Contemporary Studies in Economic and Financial Analysis, vol. 93. Emerald Group Publishing Limited, pp. 473–493. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., 1996. From data mining to knowledge discovery in databases. AI Magazine 17 (3), 37–54. Feigl, P., Zelen, M., 1965. Estimation of exponential survival probabilities with concomitant information. Biometrics 21 (4), 826–838. Fernandes, A.M.R., 2005. Intelig^encia artificial: noc¸o˜es gerais. Visual Books, Floriano´polis. Ferrando, P.J., 1993. Introduccio´n al ana´lisis factorial. Ppu, Barcelona. Ferra˜o, F., Reis, E., Vicente, P., 2001. Sondagens: a amostragem como factor decisivo de qualidade, second ed. Lisboa, Edic¸o˜es Sı´labo. Ferreira, J.M., 2007. Ana´lise de sobreviv^encia: uma visa˜o de risco comportamental na utilizac¸a˜o de carta˜o de credito. Masters Dissertation, Departamento de Estatı´stica e Informa´tica. Universidade Federal Rural de Pernambuco, Recife. 73 f. Ferreira, S.C.R., 2012. Ana´lise multivariada sobre bases de dados criminais. Masters Dissertation, Faculdade de Ci^encias e Tecnologia da Universidade de Coimbra, Coimbra. 81 f. Ferreira Filho, V.J.M., Igna´cio, A.A.V., 2004. O uso de software de modelagem AIMMS na soluc¸a˜o de problemas de programac¸a˜o matema´tica. Pesquisa Operacional 24 (1), 197–210. Fielding, A., 2004. The role of the Hausman test and whether higher level effects should be treated as random or fixed. Multilevel Modelling Newsletter 16 (2), 3–9. Fienberg, S.E., 2007. Analysis of Cross-Classified Categorical Data. Springer-Verlag, New York. Figueira, A.P.C., 2003. Procedimento HOMALS: instrumentalidade no estudo das orientac¸o˜es metodolo´gicas dos professores portugueses de lı´ngua estrangeira. In: V SNIP - Simpo´sio Nacional de Investigac¸a˜o em Psicologia. Anais do Congresso, Lisboa. Figueiredo Filho, D.B., Silva Ju´nior, J.A., Rocha, E.C., 2012. Classificando regimes polı´ticos utilizando ana´lise de conglomerados. Opinia˜o Pu´blica 18 (1), 109–128. Finney, D.J., 1952. Probit Analysis. Cambridge University Press, Cambridge. Finney, D.J., Stevens, W.L., 1948. A table for the calculation of working probits and weights in probit analysis. Biometrika 35 (1/2), 191–201. Firpo, S., 2007. Efficient semiparametric estimation of quantile treatment effects. Econometrica 75 (1), 259–276. Fischer, G., 1936. Ornithologische monatsberichte. Jahrgang, Berlin. Flannery, M.J., Hankins, K.W., 2013. Estimating dynamic panel models in corporate finance. J. Corp. Finance 19 (1), 1–19. Fleischer, G.A., 2011. Contingency Table Analysis for Road Safety Studies. Springer, New York. Fleishman, J.A., 1986. Types of political attitude structure: results of a cluster analysis. Publ. Opin. Quart. 50 (3), 371–386. Fourer, R., Gay, D.M., Kernighan, B.W., 2002. AMPL: A Modeling Language for Mathematical Programming, second ed. Duxbury. Fouto, N.M.M.D., 2004. Determinac¸a˜o de uma func¸a˜o de prec¸os hed^onicos para computadores pessoais no Brasil. Masters Dissertation, Faculdade de Economia, Administrac¸a˜o e Contabilidade, Universidade de Sa˜o Paulo, Sa˜o Paulo. 150 f. Fraley, C., Raftery, A.E., 2002. Model-based clustering, discriminant analysis and density estimation. J. Am. Stat. Assoc. 97 (458), 611–631. Frees, E.W., 1995. Assessing cross-sectional correlation in panel data. J. Econ. 69 (2), 393–414. Frees, E.W., 2004. Longitudinal and Panel Data: Analysis and Applications in the Social Sciences. Cambridge University Press, Cambridge. Frei, F., 2006. Introduc¸a˜o a` ana´lise de agrupamentos: teoria e pra´tica. Editora Unesp, Sa˜o Paulo. Frei, F., Lessa, B.S., Nogueira, J.C.G., Zopello, R., Silva, S.R., Lessa, V.A.M., 2013. Ana´lise de agrupamentos para a classificac¸a˜o de pacientes submetidos a` cirurgia baria´trica Fobi-Capella. ABCD. Arquivos Brasileiros de Cirurgia Digestiva 26 (1), 33–38.

1202

References

Freund, J.E., 2006. Estatı´stica aplicada: economia, administrac¸a˜o e contabilidade, 11th ed. Bookman, Porto Alegre. Friedman, M., 1940. A comparison of alternative tests of significance for the problem of m rankings. Ann. Math. Stat. 11 (1), 86–92. Friedman, M., 1937. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 32 (200), 675–701. Fr€ olich, M., Melly, B., 2010. Estimation of quantile treatment effects with Stata. Stata J. 10 (3), 423–457. Frome, E.L., Kurtner, M.H., Beauchamp, J.J., 1973. Regression analysis of Poisson-distributed data. J. Am. Stat. Assoc. 68 (344), 935–940. Froot, K.A., 1989. Consistent covariance matrix estimation with cross-sectional dependence and heteroskedasticity in financial data. J. Financ. Quant. Anal. 24 (3), 333–355. Fumes, G., Corrente, J.E., 2010. Modelos inflacionados de zeros: aplicac¸o˜es na ana´lise de um questiona´rio de frequ^encia alimentar. Revista Brasileira de Biometria 28 (1), 24–38. Galantucci, L.M., DI Gioia, E., Lavecchia, F., Percoco, G., 2014. Is principal component analysis an effective tool to predict face attractiveness? A contribution based on real 3D faces of highly selected attractive women, scanned with stereophotogrammetry. Med. Biol. Eng. Comput. 52 (5), 475–489. Galton, F., 1894. Natural Inheritance, fifth ed. Macmillan and Company, New York. GAMS - General Algebraic Modeling System, 2011. An introduction to GAMS. Disponı´vel em http://www.gams.com. [(Accessed 1 April 2011)]. Gardiner, J.C., Luo, Z., Roman, L.A., 2009. Fixed effects, random effects and GEE: what are the differences? Stat. Med. 28 (2), 221–239. Gardner, W., Mulvey, E.P., Shaw, E.C., 1995. Regression analyses of counts and rates: Poisson, overdispersed Poisson, and negative binomial models. Psychol. Bull. 118 (3), 392–404. Garson, G.D., 2013. Factor Analysis. Statistical Associates Publishers, Asheboro. Garson, G.D., 2012. Logistic Regression: Binary & Multinomial. Statistical Associates Publishing, Asheboro. Gelman, A., 2006. Multilevel (hierarchical) modeling: what it can and cannot do. Technometrics 48 (3), 432–435. Geoffrion, A.M., 1972. Generalized Benders decomposition. J. Optim. Theory Appl. 10 (4), 237–260. Geoffrion, A.M., Graves, G.W., 1974. Multicommodity distribution design by Benders decomposition. Manag. Sci. 20 (5), 822–844. Gessner, G., Malhotra, N.K., Kamakura, W.A., Zmijewski, M.E., 1988. Estimating models with binary dependent variables: some theoretical and empirical observations. J. Bus. Res. 16 (1), 49–65. Giffins, R., 1985. Canonical Analysis: A Review with Applications in Ecology. Springer-Verlag, Berlin. Gilbert, G.K., 1884. Finley’s tornado predictions. Am. Meteorol. J. (1), 166–172. Gimeno, S.G.A., Souza, J.M.P., 1995. Utilizac¸a˜o de estratificac¸a˜o e modelo de regressa˜o logı´stica na ana´lise de dados de estudos caso-controle. Revista de Sau´de Pu´blica 29 (4), 283–289. Glasser, G.L., Metzger, G.D., 1972. Random-digit dialing as a method of telephone sampling. J. Market. Res. 9 (1), 59–64. Glasser, M., 1967. Exponential survival with covariance. J. Am. Stat. Assoc. 62 (318), 561–568. Gnecco, G., Sanguineti, M., 2009. Accuracy of suboptimal solutions to kernel principal component analysis. Comput. Optim. Appl. 42 (2), 265–287. Gnedenko, B.V., 2008. A teoria da probabilidade. Ci^encia Moderna, Rio de Janeiro. Godfrey, L.G., 1988. Misspecification Tests in Econometrics. Cambridge University Press, Cambridge. Godfrey, L.G., 1978. Testing against general autoregressive and moving average error models when the regressors include lagged dependent variables. Econometrica 46 (6), 1293–1301. Goldbarg, M.C., Luna, H.P.L., 2005. Otimizac¸a˜o combinato´ria e programac¸a˜o linear, second ed. Campus Elsevier, Rio de Janeiro. Goldberger, A.S., 1962. Best linear unbiased prediction in the generalized linear regression model. J. Am. Stat. Assoc. 57 (298), 369–375. Goldstein, H., 2011. Multilevel Statistical Models, fourth ed. John Wiley & Sons, Chichester. Gomes Jr., A.C., Souza, M.J.F., 2004. Softwares de otimizac¸a˜o: manual de refer^encia. Departamento de Computac¸a˜o, Universidade Federal de Ouro Preto. Gomory, R.E., 1958. Outline of an algorithm for integer solutions to linear programs. Bull. Am. Math. Soc. 64 (5), 275–278. Gonc¸alez, P.U., Werner, L., 2009. Comparac¸a˜o dos ´ındices de capacidade do processo para distribuic¸o˜es na˜o normais. Gesta˜o & Produc¸a˜o 16 (1), 121–132. Gordon, A.D., 1987. A review of hierarchical classification. J. Roy. Stat. Soc. Ser. A 150 (2), 119–137. Gorsuch, R.L., 1990. Common factor analysis versus component analysis: some well and little known facts. Multivar. Behav. Res. 25 (1), 33–39. Gorsuch, R.L., 1983. Factor Analysis, second ed. Lawrence Erlbaum Associates, Mahwah. Gould, W., Pitblado, J., Poi, B., 2010. Maximum Likelihood Estimation with Stata, fourth ed. Stata Press, College Station. Gourieroux, C., Monfort, A., Trognon, A., 1984. Pseudo maximum likelihood methods: applications to Poisson models. Econometrica 52 (3), 701–772. Gower, J.C., 1967. A comparison of some methods of cluster analysis. Biometrics 23 (4), 623–637. Greenacre, M.J., 2007. Correspondence Analysis in Practice, second ed. Chapman & Hall/CRC Press, Boca Raton. Greenacre, M.J., 1988. Correspondence analysis of multivariate categorical data by weighted least-squares. Biometrika 75 (3), 457–467. Greenacre, M.J., 2000. Correspondence analysis of square asymmetric matrices. J. Roy. Stat. Soc. Ser. C Appl. Stat. 49 (3), 297–310. Greenacre, M.J., 2008. La pra´ctica del ana´lisis de correspondencias. Barcelona: Fundacio´n Bbva. Greenacre, M.J., 2003. Singular value decomposition of matched matrices. J. Appl. Stat. 30 (10), 1101–1113. Greenacre, M.J., 1989. The Carroll-Green-Schaffer scaling in correspondence analysis: a theoretical and empirical appraisal. J. Market. Res. 26 (3), 358–365. Greenacre, M.J., 1984. Theory and Applications of Correspondence Analysis. Academic Press, London. Greenacre, M.J., Blasius, J., 1994. Correspondence Analysis in the Social Sciences. Academic Press, London. Greenacre, M.J., Blasius, J., 2006. Multiple Correspondence Analysis and Related Methods. Chapman & Hall/CRC Press, Boca Raton. Greenacre, M.J., Hastie, T., 1987. The geometric interpretation of correspondence analysis. J. Am. Stat. Assoc. 82 (398), 437–447.

References

1203

Greenacre, M.J., Pardo, R., 2006. Subset correspondence analysis: visualization of selected response categories in a questionnaire survey. Sociol. Methods Res. 35 (2), 193–218. Greenberg, B.A., Goldstucker, J.L., Bellenger, D.N., 1977. What techniques are used by marketing researchers in business? J. Market. 41 (2), 62–68. Greene, W.H., 2012. Econometric Analysis, seventh ed. Pearson, Harlow. Greene, W.H., 2011. Fixed effects vector decomposition: a magical solution to the problem of time-invariant variables in fixed effects models? Polit. Anal. 19 (2), 135–146. Greenwood, M., Yule, G.U., 1920. An inquiry into the nature of frequency distributions representative of multiple happenings with particular reference to the occurrence of multiple attacks of disease or of repeated accidents. J. Roy. Stat. Soc. Ser. A 83 (2), 255–279. Gu, Y., Hole, A.R., 2013. Fitting the generalized multinomial logit model in Stata. Stata J. 13 (2), 382–397. Gujarati, D.N., 2011. Econometria ba´sica, fifth ed. Bookman, Porto Alegre. Gujarati, D.N., Porter, D.C., 2008. Econometria ba´sica, fifth ed. McGraw-Hill, New York. Gupta, P.L., Gupta, R.C., Tripathi, R.C., 1996. Analysis of zero-adjusted count data. Comput. Stat. Data Anal. 23 (2), 207–218. Gurmu, S., 1998. Generalized hurdle count data regressions models. Econ. Lett. 58 (3), 263–268. Gurmu, S., 1991. Tests for detecting overdispersion in the positive Poisson regression model. J. Bus. Econ. Stat. 9 (2), 215–222. Gurmu, S., Trivedi, P.K., 1996. Excess zeros in count models for recreational trips. J. Bus. Econ. Stat. 14 (4), 469–477. Gurmu, S., Trivedi, P.K., 1992. Overdispersion tests for truncated Poisson regression models. J. Econometrics 54 (1–3), 347–370. Gutierrez, R.G., 2002. Parametric frailty and shared frailty survival models. Stata J. 2 (1), 22–44. Guttman, L., 1941. The quantification of a class of attributes: a theory and method of scale construction. In: Horst, P. et al., (Ed.), The Prediction of Personal Adjustment. Social Science Research Council, New York. Guttman, L., 1977. What is not what in statistics. Statistician 26 (2), 81–107. Haberman, S.J., 1973. The analysis of residuals in cross-classified tables. Biometrics 29 (1), 205–220. Habib, F., Etesam, I., Ghoddusifar, S.H., Mohajeri, N., 2012. Correspondence analysis: a new method for analyzing qualitative data in architecture. Nexus Netw. J. 14 (3), 517–538.  Haddad, R., Haddad, P., 2004. Crie planilhas inteligentes com o Microsoft Office Excel 2003 - Avanc¸ado. Erica, Sa˜o Paulo. Hadi, A.S., 1994. A modification of a method for the detection of outliers in multivariate samples. J. Roy. Stat. Soc. Ser. B 56 (2), 393–396. Hadi, A.S., 1992. Identifying multiple outliers in multivariate data. J. Roy. Stat. Soc. Ser. B 54 (3), 761–771. Hair Jr., J.F., Black, W.C., Babin, B.J., Anderson, R.E., Tatham, R.L., 2009. Ana´lise multivariada de dados, sixth ed. Bookman, Porto Alegre. Hall, D.B., 2000. Zero-inflated Poisson and binomial regression with random effects: a case study. Biometrics 56, 1030–1039. Halvorsen, R., Palmquist, R.B., 1980. The interpretation of dummy variables in semilogarithmic equations. Am. Econ. Rev. 70 (3), 474–475. Hamann, U., 1961. Merkmalsbestand und verwandtschaftsbeziehungen der Farinosae: ein beitrag zum system der monokotyledonen. Willdenowia 2 (5), 639–768. Hamilton, L.C., 2013. Statistics with Stata: version 12, eighth ed. Brooks/Cole Cengage Learning, Belmont. Han, J., Kamber, M., 2000. Data Mining: Concepts and Techniques. Morgan Kaufmann, Burlington. Hardin, J.W., Hilbe, J.M., 2013. Generalized Estimating equations, second ed. Chapman & Hall/CRC Press, Boca Raton. Hardin, J.W., Hilbe, J.M., 2012. Generalized Linear Models and Extensions, third ed. Stata Press, College Station. H€ardle, W.K., Simar, L., 2012. Applied Multivariate Statistical Analysis, third ed. Springer, Heidelberg. Hardy, A., 1996. On the number of clusters. Comput. Stat. Data Anal. 23 (1), 83–96. Hardy, M.A., 1993. Regression with Dummy Variables. Sage Publications, Thousand Oaks. Harman, H.H., 1976. Modern Factor Analysis, third ed. University of Chicago Press, Chicago. Hartley, H.O., 1950. The use of range in analysis of variance. Biometrika 37 (3-4), 271–280. Harvey, A.C., 1976. Estimating regression models with multiplicative heteroscedasticity. Econometrica 44 (3), 461–465. Hausman, J.A., 1978. Specification tests in econometrics. Econometrica 46 (6), 1251–1271. Hausman, J.A., Hall, B.H., Griliches, Z., 1984. Econometric models for count data with an application to the patents-R & D relationship. Econometrica 52 (4), 909–938. Hausman, J.A., Taylor, W.E., 1981. Panel data and unobservable individual effects. Econometrica 49 (6), 1377–1398. Hayashi, C., Sasaki, M., Suzuki, T., 1992. Data Analysis for Comparative Social Research: International Perspectives. North Holland, Amsterdam. Heck, R.H., Thomas, S.L., 2009. An Introduction to Multilevel Modeling Techniques, second ed. Routledge, New York. Heck, R.H., Thomas, S.L., Tabata, L.N., 2014. Multilevel and Longitudinal Modeling with IBM SPSS, second ed. Routledge, New York. Heckman, J., Vytlacil, E., 1998. Instrumental variables methods for the correlated random coefficient model: estimating the average rate of return to schooling when the return is correlated with schooling. J. Hum. Resour. 33 (4), 974–987. Heibron, D.C., 1994. Zero-altered and other regression models for count data with added zeros. Biometrical J. 36 (5), 531–547. Held, M., Karp, R.M., 1970. The traveling-salesman problem and minimum spanning trees. Oper. Res. 18 (6), 1138–1162. Herbst, A.F., 1974. A factor analysis approach to determining the relative endogeneity of trade credit. J. Finance 29 (4), 1087–1103. Higgs, N.T., 1991. Practical and innovative uses of correspondence analysis. Statistician 40 (2), 183–194. Hilbe, J.M., 2009. Logistic Regression Models. Chapman & Hall/CRC Press, London. Hill, C., Griffiths, W., Judge, G., 2000. Econometria. Saraiva, Sa˜o Paulo. Hill, P.W., Goldstein, H., 1998. Multilevel modeling of educational data with cross-classification and missing identification for units. J. Educ. Behav. Stat. 23 (2), 117–128.

1204

References

Hillier, D., Pindado, J., Queiroz, V., Torre, C., 2011. The impact of country-level corporate governance on research and development. J. Int. Bus. Stud. 42 (1), 76–98. Hillier, F.S., Lieberman, G.J., 2005. Introduction to Operations Research, eighth ed. McGraw-Hill, Boston. Hinde, J., Demetrio, C.G.B., 1998. Overdispersion: models and estimation. Comput. Stat. Data Anal. 27 (2), 151–170. Hindi, K.S., Basta, T., 1994. Computationally efficient solution of a multiproduct, two-stage distribution-location problem. J. Oper. Res. Soc. 45 (11), 1316–1323. Hindi, K.S., Basta, T., Pienkosz, K., 2006. Efficient solution of a multi-commodity, two-stage distribution problem with constraints on assignment of customers to distribution centers. Int. Trans. Oper. Res. 5 (6), 519–527. Hirschfeld, H.O., 1935. A connection between correlation and contingency. Math. Proc. Cambridge Philos. Soc. 31 (4), 520–524. Ho, H.F., Hung, C.C., 2008. Marketing mix formulation for higher education: an integrated analysis employing analytic hierarchy process, cluster analysis and correspondence analysis. Int. J. Educ. Manag. 22 (4), 328–340. Hoaglin, D.C., Mosteller, F., Tukey, J.W., 2000. Understanding Robust and Exploratory Data Analysis. John Wiley & Sons, New York. Hoechle, D., 2007. Robust standard errors for panel regressions with cross-sectional dependence. Stata J. 7 (3), 281–312. Hoffman, D., Franke, G.R., 1986. Correspondence analysis: graphical representation of categorical data in marketing research. J. Market. Res. 23 (3), 213–227. Hofmann, D.A., 1997. An overview of the logic and rationale of hierarchical linear models. J. Manag. 23 (6), 723–744. Holtz-Eakin, D., Newey, W., Rosen, H.S., 1988. Estimating vector auto regressions with panel data. Econometrica 56 (6), 1371–1395. Hoover, K.R., Donovan, T., 2014. The Elements of Social Scientific Thinking, 11th ed. Worth Publishers, New York. Hosmer, D.W., Lemeshow, S., 1980. Goodness-of-fit tests for the multiple logistic regression model. Commun. Statist. Theory Methods 9 (10), 1043–1069. Hosmer, D.W., Lemeshow, S., May, S., 2008. Applied Survival Analysis: Regression Modeling of Time to Event Data, second ed John Wiley & Sons, Hoboken. Hosmer, D.W., Lemeshow, S., Sturdivant, R.X., 2013. Applied Logistic Regression, 3. ed. John Wiley & Sons, New York. Hosmer, D.W., Taber, S., Lemeshow, S., 1991. The importance of assessing the fit of logistic regression models: a case study. Am. J. Public Health 81, 1630–1635. Hotelling, H., 1933. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24 (6), 417–441. Hotelling, H., 1936. Relations between two sets of variates. Biometrika 28 (3/4), 321–377. Hotelling, H., 1935. The most predictable criterion. J. Educ. Psychol. 26, 139–142. Hough, J.R., 2006. Business segment performance redux: a multilevel approach. Strateg. Manag. J. 27 (1), 45–61. Hox, J.J., 2010. Multilevel Analysis: Techniques and Applications, second ed. Routledge, New York. Hoyos, R.E., Sarafidis, V., 2006. Testing for cross-sectional dependence in panel-data models. Stata J. 6 (4), 482–496. Hsiao, C., 2003. Analysis of Panel Data, second ed. Cambridge University Press, Cambridge. Hu, F.B., Goldberg, J., Hedeker, D., Flay, B.R., Pentz, M.A., 1998. Comparison of population-averaged and subject-specific approaches for analyzing repeated binary outcomes. Am. J. Epidemiol. 147 (7), 694–703. Hubbard, A.E., Ahern, J., Fleischer, N.L., Laan, M.V., Lippman, S.A., Jewell, N., Bruckner, T., Satariano, W.A., 2010. To GEE or not to GEE: comparing population average and mixed models for estimating the associations between neighborhood risk factors and health. Epidemiology 21 (4), 467–474. Huber, P.J., 1967. The behavior of maximum likelihood estimates under nonstandard conditions. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. vol. 1, pp. 221–233. Hubert, L., Arabie, P., 1985. Comparing partitions. J. Classif. 2 (1), 193–218. Hwang, H., Dillon, W.R., Takane, Y., 2006. An extension of multiple correspondence analysis for identifying heterogeneous subgroups of respondents. Psychometrika 71 (1), 161–171. Iezzi, D.F., 2005. A method to measure the quality on teaching evaluation of the university system: the Italian case. Soc. Indicat. Res. 73, 459–477. Igna´cio, S.A., 2010. Import^ancia da estatı´stica para o processo de conhecimento e tomada de decisa˜o. Revista Paranaense de Desenvolvimento 118, 175–192. Intriligator, M.D., Bodkin, R.G., Hsiao, C., 1996. Econometric Models, Techniques and Applications, second ed. Prentice Hall, Englewood Cliffs. Islam, N., 1995. Growth empirics: a panel data approach. Quart. J. Econ. 110 (4), 1127–1170. Israe¨ls, A., 1987. Eigenvalue Techniques for Qualitative Data. DSWO Press, Leiden. Jaccard, J., 2001. Interaction Effects in Logistic Regression. Sage Publications, Thousand Oaks. Jaccard, P., 1901. Distribution de la flore alpine dans le Bassin des Dranses et dans quelques regions voisines. Bulletin de la Societe Vaudoise des Sciences Naturelles 37 (140), 241–272. Jaccard, P., 1908. Nouvelles recherches sur la distribution florale. Bulletin de la Societe Vaudoise des Sciences Naturelles 44 (163), 223–270. Jain, A.K., Murty, M.N., Flynn, P.J., 1999. Data clustering: a review. ACM Comput. Surv. 31 (3), 264–323. Jak, S., Oort, F.J., Dolan, C.V., 2014. Using two-level factor analysis to test for cluster bias in ordinal data. Multivar. Behav. Res. 49 (6), 544–553. Jann, B., 2007. Making regression tables simplified. Stata J. 7 (2), 227–244. Jansakul, N., Hinde, J.P., 2002. Score tests for zero-inflated Poisson models. Comput. Stat. Data Anal. 40 (1), 75–96. Jer^ ome, P., 2014. Multiple Factor Analysis by Example Using R. Chapman & Hall/CRC Press, London. Jimenez, E.G., Flores, J.G., Go´mez, G.R., 2000. Ana´lisis factorial. Editorial La Muralla, Madrid. Johnson, D.E., 1998. Applied Multivariate Methods for Data Analysts. Duxbury Press, Pacific Grove.

References

1205

Johnson, R.A., Wichern, D.W., 2007. Applied Multivariate Statistical Analysis, sixth ed. Pearson Education, Upper Saddle River. Johnson, S.C., 1967. Hierarchical clustering schemes. Psychometrika 32 (3), 241–254. Johnston, J., Dinardo, J., 2001. Metodos econometricos, fourth ed. McGraw-Hill, Lisboa. Jolliffe, I.T., Jones, B., Morgan, B.J.T., 1995. Identifying influential observations in hierarchical cluster analysis. J. Appl. Stat. 22 (1), 61–80. Jones, A.M., Rice, N., D’uva, T.B., Balia, S., 2013. Applied Health Economics, second ed. Routledge, New York. Jones, D.C., Kalmi, P., M€akinen, M., 2010. The productivity effects of stock option schemes: evidence from Finnish panel data. J. Product. Anal. 33 (1), 67–80. Jones, K., Bullen, N., 1994. Contextual models of urban house prices: a comparison of fixed- and random-coefficient models developed by expansion. Econ. Geogr. 70 (3), 252–272. Jones, M.R., 2014. Identifying critical factors that predict quality management program success: data mining analysis of Baldrige award data. Qual. Manag. J. 21 (3), 49–61. Jones, R.H., 1975. Probability estimation using a multinomial logistic function. J. Stat. Comput. Simul. (3), 315–329. Jones, S.T., Banning, K., 2009. US elections and monthly stock market returns. J. Econ. Finance 33 (3), 273–287. J€ oreskog, K.G., 1967. Some contributions to maximum likelihood factor analysis. Psychometrika 32 (4), 443–482. Kachigan, S., 1986. Statistical Analysis: An Interdisciplinary Introduction to Univariate & Multivariate Methods. Radius Press, New York. Kaiser, H.F., 1970. A second generation little jiffy. Psychometrika 35 (4), 401–415. Kaiser, H.F., 1974. An index of factorial simplicity. Psychometrica 39 (1), 31–36. Kaiser, H.F., 1958. The varimax criterion for analytic rotation in factor analysis. Psychometrika 23 (3), 187–200. Kaiser, H.F., Caffrey, J., 1965. Alpha factor analysis. Psychometrika 30 (1), 1–14. Kalbfleisch, J.D., Prentice, R.L., 2002. The Statistical Analysis of Failure Time Data, second ed. John Wiley & Sons, New York. Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y., 2002. The efficient k-means clustering algorithm: analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24 (7), 881–892. Kaplan, E.L., Meier, P., 1958. Nonparametric estimation from incomplete observations. J. Am. Stat. Assoc. 53 (282), 457–481. Kaufman, L., Rousseeuw, P.J., 2005. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, Hoboken. Kaufman, R.L., 1996. Comparing effects in dichotomous logistic regression: a variety of standardized coefficients. Soc. Sci. Quart. 77, 90–109. Kelton, W.D., Sadowski, R.P., Swets, N.B., 2010. Simulation with Arena, fifth ed. McGraw-Hill, New York. Kelton, W.D., Sadowski, R.P., Swets, N.B., 1998. Simulation with Arena, first ed. McGraw-Hill, New York. Kennedy, P., 2008. A Guide to Econometrics, sixth ed. MIT Press, Cambridge. € Keskin, B.B., Uster, H., 2007. A scatter search-based heuristic to locate capacitated transshipment points. Comput. Oper. Res. 34 (10), 3112–3125. Kim, B., Park, C., 1992. Some remarks on testing goodness of fit for the Poisson assumption. Commun. Statist. Theory Methods 21 (4), 979–995. Kim, J.O., Mueller, C.W., 1978a. Factor Analysis: Statistical Methods and Practical Issues. Sage Publications, Thousand Oaks. Kim, J.O., Mueller, C.W., 1978b. Introduction to Factor Analysis: What it Is and How to Do it. Sage Publications, Thousand Oaks. Kintigh, K.W., Ammerman, A.J., 1982. Heuristic approaches to spatial analysis in archaeology. Am. Ant. 47 (1), 31–63. Klastorin, T.D., 1983. Assessing cluster analysis results. J. Market. Res. 20 (1), 92–98. Klatzky, S.R., Hodge, R.W., 1971. A canonical correlation analysis of occupational mobility. J. Am. Stat. Assoc. 66 (333), 16–22. Klein, J.P., Moeschberger, M.L., 2003. Survival Analysis: Techniques for Censored and Truncated Data, second ed. Springer, New York. Kleinbaum, D.G., Klein, M., 2010. Logistic Regression: A Self-Learning Text, third ed. Springer, New York. Kleinbaum, D.G., Klein, M., 2012. Survival Analysis: A Self-Learning Text, third ed. Springer-Verlag, New York. Kleinbaum, D., Kupper, L., Nizam, A., Rosenberg, E.S., 2014. Applied Regression Analysis and Other Multivariable Methods, fifth ed. Cengage Learning, Boston. Klimkiewicz, A., Cervera-Padrell, A.E., Van den Berg, F.W.J., 2016. Multilevel modeling for data mining of downstream bio-industrial processes. Chemometr. Intell. Lab. Syst. 154 (15), 62–71. Kmenta, J., 1978. Elementos de econometria. Atlas, Sa˜o Paulo. Koenker, R., 2004. Quantile regression for longitudinal data. J. Multivar. Anal. 91 (1), 74–89. Koenker, R., 2005. Quantile Regression. Cambridge University Press, Cambridge. Koenker, R., Bassett, G., 1978. Regression quantiles. Econometrica 46 (1), 33–50. Kohler, U., Kreuter, F., 2012. Data Analysis Using Stata, third ed. Stata Press, College Station. Kolmogorov, A., 1941. Confidence limits for an unknown distribution function. Ann. Math. Stat. 12 (4), 461–463. Konno, H., Yamazaki, H., 1991. Mean-absolute deviation portfolio optimization model and its applications to Tokyo stock market. Manag. Sci. 37 (5), 519–531. Kreft, I., De Leeuw, J., 1998. Introducing Multilevel Modeling. Sage Publications, London. Krishnakumar, J., Ronchetti, E. (Eds.), 2000. Panel Data Econometrics: Future Directions. North Holland, Amsterdam. Kruskal, J.B., 1964a. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29 (1), 1–27. Kruskal, J.B., 1964b. Nonmetric multidimensional scaling: a numerical method. Psychometrika 29 (2), 115–129. Kruskal, W.H., 1952. A nonparametric test for the several sample problem. Ann. Math. Stat. 23 (4), 525–540. Kruskal, W.H., Wallis, W.A., 1952. Use of ranks in one-criterion variance analysis. J. Am. Stat. Assoc. 47 (260), 583–621. Kutner, M.H., Nachtshein, C.J., Neter, J., 2004. Applied Linear Regression Models, fourth ed. Irwin, Chicago. Lachtermacher, G., 2009. Pesquisa operacional na tomada de deciso˜es, fourth ed. Prentice Hall do Brasil, Sa˜o Paulo.

1206

References

Laird, N.M., Ware, J.H., 1982. Random-effects models for longitudinal data. Biometrics 38 (4), 963–974. Lambert, D., 1992. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 34 (1), 1–14. Lambert, P.C., Royston, P., 2009. Further development of flexible parametric models for survival analysis. Stata Journal 9 (2), 265–290. Lambert, Z., Durand, R., 1975. Some precautions in using canonical analysis. J. Market. Res. 12 (4), 468–475. Lance, G.N., Williams, W.T., 1967. A general theory of classificatory sorting strategies: 1. Hierarchical systems. Comput. J. 9 (4), 373–380. Land, A.H., Doig, A.G., 1960. An automatic method of solving discrete programming problems. Econometrica 28 (3), 497–520. Landau, S., Everitt, B.S., 2004. A Handbook of Statistical Analyses Using SPSS. Chapman & Hall/CRC Press, Boca Raton. Lane, W.R., Looney, S.W., Wansley, J.W., 1986. An application of the Cox proportional hazards model to bank failure. J. Bank. Finance 10 (4), 511–531. Larose, D.T., Larose, C.D., 2014. Discovering Knowledge in Data: An Introduction to Data Mining, 2. ed. John Wiley & Sons, New York. Lawless, J., 1987. Regression methods for Poisson process data. J. Am. Stat. Assoc. 82 (399), 808–815. Lawley, D.N., 1959. Tests of significance in canonical analysis. Biometrika 46 (1/2), 59–66. Lawson, D.M., Brossart, D.F., 2004. The association between current intergenerational family relationships and sibling structure. J. Counsel. Dev. 82 (4), 472–482. Le Foll, Y., Burtschy, B., 1983. Representations optimales des matrices imports-exports. Revue de Statistique Appliquee 31 (3), 57–72. Le Roux, B., Rouanet, H., 2004. Geometric Data Analysis: From Correspondence Analysis to Structured Data Analysis. Kluwer, Dordrecht. Le Roux, B., Rouanet, H., 2010. Multiple Correspondence Analysis. Sage Publications, Thousand Oaks. Lebart, L., Piron, M., Morineau, A., 2000. Statistique exploratoire multidimensionnelle, third ed. Dunod, Paris. Lee, A.H., Wang, K., Scott, J.A., Yau, K., Mclachlan, G.J., 2006. Multi-level zero-inflated Poisson regression modelling of correlated count data with excess zeros. Stat. Methods Med. Res. 15 (1), 47–61. Lee, A.H., Wang, K., Yau, K., 2001. Analysis of zero-inflated Poisson data incorporating extent of exposure. Biometrical J. 43 (8), 963–975. Lee, E.T., Wang, J.W., 2013. Statistical Methods for Survival Data Analysis, fourth ed. John Wiley & Sons, Hoboken. Lee, L., 1986. Specification test for Poisson regression models. Int. Econ. Rev. 27 (3), 689–706. Leech, N.L., Barrett, K.C., Morgan, G.A., 2005. SPSS for Intermediate Statistics: Use and Interpretation, second ed. Lawrence Erlbaum Associates, Mahwah. Levene, H., 1960. Robust tests for the equality of variance. In: Olkin, I. (Ed.), Contributions to Probability and Statistics. Stanford University Press, Palo Alto, pp. 278–292. Levine, R., 1997. Financial development and economic growth: views and agenda. J. Econ. Lit. 35 (2), 688–726. Levy, P.S., Lemeshow, S., 2009. Sampling of Populations: Methods and applications, fourth ed. John Wiley & Sons, New York. Liang, K.Y., Zeger, S.L., 1986. Longitudinal data analysis using generalized linear models. Biometrika 73 (1), 13–22. Liczbinski, C.R., 2002. Modelo de informac¸o˜es para o gerenciamento das atividades das pequenas indu´strias de produtos alimentares do Rio Grande do Sul. Dissertac¸a˜o (Mestrado em Engenharia de Produc¸a˜o), Universidade Federal de Santa Catarina, Floriano´polis. Likert, R., 1932. A technique for the measurement of attitudes. Arch. Psychol. 22 (140), 5–55. Lilliefors, H.W., 1967. On the Kolmogorov-Smirnov test for normality with mean and variance unknown. J. Am. Stat. Assoc. 62 (318), 399–402. Lindley, D., 1983. Reconciliation of probability distributions. Oper. Res. 31 (5), 866–880. Linneman, P., 1980. Some empirical results on the nature of hedonic price function for the urban housing market. J. Urban Econ. 8, 47–68. Linoff, G.S., Berry, M.J.A., 2011. Data Mining Techniques: for Marketing, Sales, and Customer Relationship Management, third ed. John Wiley & Sons, Indianapolis. Lisboa, E.F.A., 2010. Pesquisa operacional. Disponı´vel em http://www.ericolisboa.eng.br. [(Accessed 28 September 2010)]. Lombardo, R., Beh, E.J., D’ambra, L., 2007. Non-symmetric correspondence analysis with ordinal variables using orthogonal polynomials. Comput. Statist. Data Anal. 52, 566–577. Long, J.S., Freese, J., 2006. Regression models for categorical dependent variables using Stata, second ed. Stata Press, College Station. Lopez, C.P., 2013. Principal Components, Factor Analysis, Correspondence Analysis and Scaling: Examples with SPSS. CreateSpace Independent Publishing Platform. Lo´pez, M.J.R., Fidalgo, J.L., 2000. Ana´lisis de supervivencia. Ed. La Muralla, Madrid. Lord, D., Park, P.Y.J., 2008. Investigating the effects of the fixed and varying dispersion parameters of Poisson-Gamma models on empirical Bayes estimates. Accid. Anal. Prevent. 40 (4), 1441–1457. Lu, Y., Thill, J.C., 2008. Cross-scale analysis of cluster correspondence using different operational neighborhoods. J. Geogr. Syst. 10 (3), 241–261. Lustosa, L., Mesquita, M.A., Quelhas, O., Oliveira, R., 2008. Planejamento e Controle da Produc¸a˜o. Campus Elsevier, Rio de Janeiro. MacCallum, R.C., Widaman, K.F., Zhang, S., Hong, S., 1999. Sample size in factor analysis. Psychol. Methods 4 (1), 84–99. Macedo, M.A.S., 2002. A utilizac¸a˜o de programac¸a˜o matema´tica linear inteira bina´ria (0-1) na selec¸a˜o de projetos sob condic¸a˜o de restric¸a˜o orc¸amenta´ria. Anais do XXXIV SBPO, Rio de Janeiro, Ime. Machado, N.R.S., Ferreira, A.O., 2012. Metodo de Simulac¸a˜o de Monte Carlo em Planilha Excel: Desenvolvimento de uma ferramenta versa´til para ana´lise quantitativa de riscos em gesta˜o de projetos. Revista de Ci^encias Gerenciais 16 (23), 223–244. Machin, D., Cheung, Y.B., Parmar, M.K.B., 2006. Survival Analysis: A Practical Approach, second ed. John Wiley & Sons, Hoboken. Maddala, G.S., 2003. Introduc¸a˜o a` econometria, third ed. LTC Editora, Rio de Janeiro. Maddala, G.S., 1993. The Econometrics for Panel Data. Elgar, Brookfield. Magalha˜es, M.N., Lima, C.P., 2013. Noc¸o˜es de probabilidade e estatı´stica, seventh ed. Edusp, Sa˜o Paulo. Makles, A., 2012. Stata tip 110: how to get the optimal k-means cluster solution. Stata J. 12 (2), 347–351. Malhotra, N.K., 2012. Pesquisa de marketing: uma orientac¸a˜o aplicada, sixth ed. Bookman, Porto Alegre.

References

1207

Mangiameli, P., Chen, S.K., West, D., 1996. A comparison of SOM neural network and hierarchical clustering methods. Eur. J. Oper. Res. 93 (2), 402–417. Manly, B.F.J., 2011. Statistics for Environmental Science and Management, second ed. Chapman and Hall/CRC Press, London. Manly, B.J.F., 2004. Multivariate Statistical Methods, third ed. Chapman and Hall, London. Mann, H.B., Whitney, D.R., 1947. On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18 (1), 50–60. Marcoulides, G.A., Hershberger, S.L., 2014. Multivariate Statistical Methods: A First Course. Psychology Press, New York. Mardia, K.V., Kent, J.T., Bibby, J.M., 1997. Multivariate Analysis, sixth ed Academic Press, London. Markowitz, H., 1952. Portfolio selection. J. Finance 7 (1), 77–91. Maroco, J., 2014. Ana´lise estatı´stica com o SPSS Statistics, sixth ed. Edic¸o˜es Sı´labo, Lisboa. Marquardt, D.W., 1963. An algorithm for least-squares estimation of nonlinear parameters. J. Soc. Ind. Appl. Math. 11 (2), 431–441. Marques, L.D., 2000. Modelos din^amicos com dados em painel: revisa˜o da literatura. In: Serie Working Papers do Centro de Estudos Macroecon^omicos e Previsa˜o (CEMPRE) da Faculdade de Economia do Porto, Portugal. 100. Marriott, F.H.C., 1971. Practical problems in a method of cluster analysis. Biometrics 27 (3), 501–514. Martı´n, J.M., 1990. Oportunidad relativa: reflexiones en torno a la traduccio´n del termino ‘odds ratio’. Gaceta Sanitaria (16), 37. Martins, G.A., Domingues, O., 2011. Estatı´stica geral e aplicada, fourth ed. Atlas, Sa˜o Paulo. Martins, M.S., Galli, O.C., 2007. A previsa˜o de insolv^encia pelo modelo Cox: uma aplicac¸a˜o para a ana´lise de risco de companhias abertas Brasileiras. Revista Eletr^ onica de Administrac¸a˜o (REAd UFRGS), ed. 55 13 (1), 1–18. Mason, R.L., Young, J.C., 2005. Multivariate tools: principal component analysis. Qual. Progr. 38 (2), 83–85. Matisziw, T.C., 2005. Modeling transnational surface freight flow and border crossing improvement. Dissertation (PhD in Philosophy), Ohio State University. Ma´tya´s, L., Sevestre, P. (Eds.), 2008. The Econometrics of Panel Data: Fundamentals and Recent Developments in Theory and Practice. third ed. Springer, New York. Mazzarol, T.W., Soutar, G.N., 2008. Australian educational institutions’ international markets: a correspondence analysis. Int. J. Educ. Manag. 22 (3), 229–238. McClave, J.T., Benson, P.G., Sincich, T., 2009. Estatı´stica para administrac¸a˜o e economia. Pearson Prentice Hall, Sa˜o Paulo. McCue, C., 2014. Data Mining and Predictive Analysis: Intelligence Gathering and Crime Analysis, second ed. Elsevier, Boston. McCullagh, P., 1983. Quasi-likelihood functions. Ann. Stat. 11 (1), 59–67. McCullagh, P., Nelder, J.A., 1989. Generalized Linear Models, second ed. Chapman & Hall, London. McCulloch, C.E., Searle, S.R., Neuhaus, J.M., 2008. Generalized, Linear, and Mixed Models, second ed. John Wiley & Sons, Hoboken. McGahan, A.M., Porter, M.E., 1997. How much does industry matter, really? Strateg. Manag. J. 18 (S1), 15–30. McGee, D.L., Reed, D., Yano, K., 1984. The results of logistic analyses when the variables are highly correlated. Am. J. Epidemiol. 37, 713–719. McIntyre, R.M., Blashfield, R.K., 1980. A nearest-centroid technique for evaluating the minimum-variance clustering procedure. Multivar. Behav. Res. 15, 225–238. McLaughlin, S.D., Otto, L.B., 1981. Canonical correlation analysis in family research. J. Marr. Fam. 43 (1), 7–16. McNemar, Q., 1969. Psychological Statistics, fourth ed. John Wiley & Sons, New York. Medri, W., 2015. Ana´lise explorato´ria de dados. http://www.uel.br/pos/estatisticaeducacao/.../especializacao_estatistica.pdf. [(Accessed 3 August 2015)]. Melo, M.T., Nickel, S., Gama, F.S., 2009. Facility location and supply chain management: a review. Eur. J. Oper. Res. 196, 401–412. Menard, S.W., 2001. Applied Logistic Regression analysis, second ed. Sage Publications, Thousand Oaks. Michell, J., 1986. Measurement scales and statistics: a clash of paradigms. Psychol. Bull. 100 (3), 398–407. Miguel, A., Pindado, J., 2001. Determinants of capital structure: new evidence from spanish panel data. J. Corp. Finance 7 (1), 77–99. Miguel, A., Pindado, J., Torre, C., 2004. Ownership structure and firm value: new evidence from Spain. Strateg. Manag. J. 25 (12), 1199–1207. Miles, M.B., Huberman, A.M., Saldan˜a, J., 2014. Qualitative Data Analysis: A Methods Sourcebook, third ed. Sage Publications, Thousand Oaks. Milligan, G.W., 1981. A Montecarlo study of thirty internal criterion measures for cluster analysis. Psychometrika 46, 325–342. Milligan, G.W., 1980. An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika 45 (3), 325–342. Milligan, G.W., Cooper, M.C., 1985. An examination of procedures for determining the number of clusters in a data set. Psychometrika 50, 159–179. Milligan, G.W., Cooper, M.C., 1987. Methodology review: clustering methods. Appl. Psychol. Meas. 11 (4), 329–354. Mills, T.C., 1993. The Econometric Modelling of Financial Time Series. Cambridge University Press. Min, Y., Agresti, A., 2005. Random effect models for repeated measures of zero-inflated count data. Stat. Modell. 5 (1), 1–19. Mingoti, S.A., 2005. Ana´lise de dados atraves de metodos de estatı´stica multivariada: uma abordagem aplicada. Editora Ufmg, Belo Horizonte. Miranda, A., Rabe-Hesketh, S., 2006. Maximum likelihood estimation of endogenous switching and sample selection models for binary, ordinal, and count variables. Stata J. 6 (3), 285–308. Miranda, G.J., Martins, V.F., Faria, A.F., 2007. O uso da programac¸a˜o linear num contexto de laticı´nios com va´rias restric¸o˜es na capacidade produtiva. Custos e @gronego´cio on line 3, 40–58. Misangyi, V.F., Lepine, J.A., Algina, J., Goeddeke Jr., F., 2006. The adequacy of repeated-measures regression for multilevel research. Organization. Res. Methods 9 (1), 5–28. Mitchell, M.N., 2012a. A Visual Guide to Stata Graphics, third ed. Stata Press, College Station. Mitchell, M.N., 2012b. Interpreting and Visualizing Regression Models Using Stata. Stata Press, College Station. Mittb€ ock, M., Schemper, M., 1996. Explained variation for logistic regression. Stat. Med. 15, 1987–1997. Molina, C.A., 2002. Predicting bank failures using a hazard model: the Venezuelan banking crisis. Emerg. Market Rev. 3 (1), 31–50.

1208

References

Montgomery, D.C., 2013. Introduction to Statistical Quality Control, seventh ed. John Wisley & Sons, Inc, Arizona State University. Montgomery, D.C., Goldsman, D.M., Hines, W.W., Borror, C.M., 2006. Probabilidade e estatı´stica na engenharia, fourth ed. LTC Editora, Rio de Janeiro. Montgomery, D.C., Peck, E.A., Vining, G.G., 2012. Introduction to Linear Regression Analysis, fifth ed. John Wiley & Sons, New Jersey. Montoya, A.G.M., 2009. Infer^encia e diagno´stico em modelos para dados de contagem com excesso de zeros. Masters Dissertation, Departamento de Estatı´stica, Instituto de Matema´tica, Estatı´stica e Computac¸a˜o Cientı´fica, Universidade Estadual de Campinas, Campinas. 95 f. Moore, D.S., McCabe, G.P., Duckworth, W.M., Sclove, S.L., 2006a. A pra´tica da estatı´stica empresarial: como usar dados para tomar deciso˜es. LTC Editora, Rio de Janeiro. Moore, D.S., McCabe, G.P., Duckworth, W.M., Sclove, S.L., 2006b. Estatı´stica empresarial: como usar dados para tomar deciso˜es. LTC Editora, Rio de Janeiro. Morettin, L.G., 2000. Estatı´stica ba´sica: infer^encia. Makron Books, Sa˜o Paulo. Morgan, G.A., Leech, N.L., Gloeckner, G.W., Barrett, K.C., 2004. SPSS for Introductory Statistics: Use and Interpretation, second ed. Lawrence Erlbaum Associates, Mahwah. Morgan, B.J.T., Ray, A.P.G., 1995. Non-uniqueness and inversions in cluster analysis. J. Roy. Stat. Soc. Ser. C 44 (1), 117–134. Moreira, D.A., 2006. Administrac¸a˜o da produc¸a˜o e operac¸o˜es. Thomson Learning, Sa˜o Paulo. Mulaik, S.A., 1990. Blurring the distinction between component analysis and common factor analysis. Multivar. Behav. Res. 25 (1), 53–59. Mulaik, S.A., 2011. Foundations of Factor Analysis, second ed. Chapman & Hall/CRC Press, Boca Raton. Mulaik, S.A., McDonald, R.P., 1978. The effect of additional variables on factor indeterminancy in models with a single common factor. Psychometrika 43 (2), 177–192. Mullahy, J., 1986. Specification and testing of some modified count data models. J. Econometrics 33 (3), 341–365. Muller, K.E., 1982. Understanding canonical correlation through the general linear model and principal components. Am. Statist. 36 (4), 342–354. Mundlak, Y., 1978. On the pooling of time series and cross section data. Econometrica 46 (1), 69–85. Myatt, G.J., Johnson, W.P., 2014. Making Sense of Data I: A Practical Guide to Exploratory Data Analysis and Data Mining, second ed. John Wiley & Sons, Hoboken. Myatt, G.J., Johnson, W.P., 2009. Making sense of data II: a practical guide to data visualization, advanced data mining methods, and applications. John Wiley & Sons, Hoboken. Naito, S.D.N.P., 2007. Ana´lise de correspond^encias generalizada. Masters Dissertation, Faculdade de Ci^encias, Universidade de Lisboa, Lisboa. 156 f. Nance, C.R., de Leeuw, J., Weigand, P.C., Prado, K., Verity, D.S., 2013. Correspondence Analysis and West Mexico Archaeology: Ceramics from the Long-Glassow Collection. University of New Mexico Press, Albuquerque. Nascimento, A., Almeida, R.M.V.R., Castilho, S.R., Infantosi, A.F.C., 2013. Ana´lise de correspond^encia mu´ltipla na avaliac¸a˜o de servic¸os de farma´cia hospitalar no Brasil. Cadernos de Sau´de Pu´blica 29 (6), 1161–1172. Natis, L., 2007. Modelos lineares hiera´rquicos. Masters Dissertation, Instituto de Matema´tica e Estatı´stica, Universidade de Sa˜o Paulo, Sa˜o Paulo. 77 f. Navarro, A., Utzet, F., Caminal, J., Martin, M., 2001. La distribucio´n binomial negativa frente a la de Poisson en el ana´lisis de feno´menos recurrentes. Gaceta Sanitaria 15 (5), 447–452. Navidi, W., 2012. Probabilidade e estatı´stica para ci^encias exatas. Bookman, Porto Alegre. Nasser, R.B., 2012. Mccloud service framework: arcabouc¸o para desenvolvimento de servic¸os baseados na simulac¸a˜o de Monte Carlo na cloud. Pontifı´cia Universidade Cato´lica do Rio de Janeiro – PUC-RIO. Dissertac¸a˜o (Mestrado em Informa´tica). Nelder, J.A., 1966. Inverse polynomials, a useful group of multi-factor response functions. Biometrics 22 (1), 128–141. Nelder, J.A., Wedderburn, R.W.M., 1972. Generalized linear models. J. Roy. Stat. Soc. Ser. A 135 (3), 370–384. Nelson, D., 1975. Some remarks on generalizations of the negative binomial and Poisson distributions. Technometrics 17 (1), 135–136. Nerlove, M., 2002. Essays in Panel Data Econometrics. Cambridge University Press, Cambridge. Neuenschwander, B.E., Flury, B.D., 1995. Common canonical variates. Biometrika 82 (3), 553–560. Neufeld, J.L., 2003. Estatı´stica aplicada a` administrac¸a˜o usando Excel. Prentice Hall, Sa˜o Paulo. Neuhaus, J.M., 1992. Statistical methods for longitudinal and clustered designs with binary responses. Stat. Methods Med. Res. 1 (3), 249–273. Neuhaus, J.M., Kalbfleisch, J.D., 1998. Between- and within-cluster covariate effects in the analysis of clustered data. Biometrics 54 (2), 638–645. Neuhaus, J.M., Kalbfleisch, J.D., Hauck, W.W., 1991. A comparison of cluster-specific and population-averaged approaches for analyzing correlated binary data. Int. Stat. Rev. 59 (1), 25–35. Newey, W.K., West, K.D., 1987. A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix. Econometrica 55 (3), 703–708. Nishisato, S., 1993. On quantifying different types of categorical data. Psychometrika 58 (1), 617–629. Norton, E.C., Bieler, G.S., Ennett, S.T., Zarkin, G.A., 1996. Analysis of prevention program effectiveness with clustered data using generalized estimating equations. J. Consult. Clin. Psychol. 64 (5), 919–926. Norusis, M.J., 2012. IBM SPSS Statistics 19 Guide to Data Analysis. Pearson, Boston. Nunnally, J.C., Bernstein, I.H., 1994. Psychometric Theory, third ed. McGraw-Hill, New York. O’rourke, D., Blair, J., 1983. Improving random respondent selection in telephone surveys. J. Market. Res. 20 (4), 428–432. Ochiai, A., 1957. Zoogeographic studies on the soleoid fishes found in Japan and its neighbouring regions [em japon^es]. Bull. Jpn. Soc. Sci. Fish. 22 (9), 522–525. Olariaga, L.J., Herna´ndez, L.L., 2000. Ana´lisis de correspondencias. Editorial La Muralla, Madrid. Oliveira, C.C.F., 2011. Uma priori beta para distribuic¸a˜o binomial negativa. Masters Dissertation, Departamento de Estatı´stica e Informa´tica, Universidade Federal Rural de Pernambuco, Recife. 54 f.

References

1209

Oliveira, F.E.M., 2009. Estatı´stica e probabilidade, second ed. Atlas, Sa˜o Paulo. Oliveira, T.M.V., 2001. Amostragem na˜o probabilı´stica: adequac¸a˜o de situac¸o˜es para uso e limitac¸o˜es de amostras por conveni^encia, julgamento e quotas. Administrac¸a˜o On Line 2 (3), 1–16. Oliveira Jr., P.A., Dantas, M.J.P., Machado, R.L., 2013. Aplicac¸a˜o da Simulac¸a˜o de Monte Carlo no Gerenciamento de Riscos em Projetos com o Cristal Ball. Simpo´sio de Administrac¸a˜o da Produc¸a˜o, Logı´stica e Operac¸o˜es Internacionais. Olshansky, S.J., Carnes, B.A., 1997. Ever since Gompertz. Demography 34 (1), 1–15. Olson, D., L; Delen, D., 2008. Advanced Data Mining Techniques. Springer, New York. Oneal, J.R., Russett, B., 2001. Clear and clean: the fixed effects of the liberal peace. Int. Org. 55 (2), 469–485. Orden, A., 1956. The transshipment problem. Manag. Sci. 2 (3), 276–285. Orsini, N., Bottai, M., 2011. Logistic quantile regression in Stata. Stata J. 11 (3), 327–344. Ortega, C.M., Cayuela, D.A., 2002. Regresio´n logı´stica no condicionada y taman˜o de muestra: una revisio´n bibliogra´fica. Revista Espan˜ola de Salud Pu´blica 76, 85–93. Ortega, E.M.M., Cordeiro, G.M., Carrasco, J.M.F., 2011. The log-generalized modified Weibull regression model. Brazil. J. Probab. Stat. 25 (1), 64–89. Ortega, E.M.M., Cordeiro, G.M., Kattan, M.W., 2012. The negative binomial-beta Weibull regression model to predict the cure of prostate cancer. J. Appl. Stat. 39 (6), 1191–1210. Ou, H., Wei, C., Deng, Y., Gao, N., Ren, Y., 2014. Principal component analysis to assess the efficiency and mechanism for enhanced coagulation of natural algae-laden water using a novel dual coagulant system. Environ. Sci. Pollut. Res. Int. 21 (3), 2122–2131. Page, M.C., Braver, S.L., Mackinnon, D.P., 2003. Levine’s Guide to SPSS for Analysis of Variance, 2. ed. Lawrence Erlbaum Associates, Mahwah. Pallant, J., 2010. SPSS Survival Manual: A Step by Step Guide to Data Analysis Using SPSS, fourth ed. Open University Press, Berkshire. Palmer, M.W., 1993. Putting things in even better order: the advantages of canonical correspondence analysis. Ecology 74 (8), 2215–2230. Pampel, F.C., 2000. Logistic Regression: A Primer. Sage Publications, Thousand Oaks. Pardoe, I., 2012. Applied Regression Modeling, second ed. John Wiley & Sons, Hoboken. Parzen, E., 1962. On estimation of a probability density function and mode. Ann. Math. Stat. 33 (3), 1065–1076. Pearson, K., 1896. Mathematical contributions to the theory of evolution. III. Regression, Heredity, and Panmixia. Philos. Trans. R. Soc. London 187, 253–318. Pearson, K., 1930. The Life, Letters and Labors of Francis Galton. Cambridge University Press, Cambridge. Pegden, C.D., Shannon, R.E., Sadowski, R.P., 1990. Introduction to Simulation Using SIMAN, second ed. McGraw-Hill, New York. Pen˜a, J.M., Lazano, J.A., Larran˜aga, P., 1999. An empirical comparison of four initialisation methods for the k-means algorithm. Pattern Recognit. Lett. 20 (10), 1027–1040. Pendergast, J.F., Gange, S.J., Newton, M.A., Lindstrom, M.J., Palta, M., Fisher, M.R., 1996. A survey of methods for analyzing clustered binary response data. Int. Stat. Rev. 64 (1), 89–118. Perduzzi, P., Concato, J., Kemper, E., Holford, T.R., Feistein, A.R., 1996. A simulation study of the number of events per variable in logistic regression analysis. J. Clin. Epidemiol. 49, 1373–1379. Pereira, H.C., Sousa, A.J., Ana´lise de dados para o tratamento de quadros multidimensionais. http://biomonitor.ist.utl.pt/ajsousa/Ana lDadosTratQuadMult.html. [(Accessed 20 January 2015)]. Pereira, J.C.R., 2004. Ana´lise de dados qualitativos: estrategias metodolo´gicas para as ci^encias da sau´de, humanas e sociais, third ed. Edusp, Sa˜o Paulo. Pereira, M.A., Vidal, T.L., Amorim, T.N., Fa´vero, L.P., 2010. Decision process based on personal finance books: is there any direction to take? Revista de Economia e Administrac¸a˜o 9 (3), 407–425. Pesaran, M.H., 2004. General diagnostic tests for cross section dependence in panels. Cambridge Working Papers in Economics, nº. 0435, Faculty of Economics, University of Cambridge. Pess^ oa, L.A.M., Lins, M.P.E., Torres, N.T., 2009. Problema da dieta: uma aplicac¸a˜o pra´tica para o navio hidroceanogra´fico “Tauros”. In: Simpo´sio Brasileiro de Pesquisa Operacional, 2009, Porto Seguro, BA. Anais do XLI Simpo´sio Brasileiro de Pesquisa Operacional 1, 1460–1471. Pestana, M.H., Gageiro, J.N., 2008. Ana´lise de dados para ci^encias sociais: a complementaridade do SPSS, 5. ed. Edic¸o˜es Sı´labo, Lisboa. Peters, W.S., 1958. Cluster analysis in urban demography. Soc. Forces 37 (1), 38–44. Petersen, M.A., 2009. Estimating standard errors in finance panel data sets: comparing approaches. Rev. Financ. Stud. 22 (1), 435–480. Peto, R., Lee, P., 1973. Weibull distributions for continuous-carcinogenesis experiments. Biometrics 29 (3), 457–470. Peugh, J.L., Enders, C.K., 2005. Using the SPSS mixed procedure to fit cross-sectional and longitudinal multilevel models. Educ. Psychol. Meas. 65 (5), 714–741. Pylro, A.S., 2008. Modelo Linear Din^amico de Harrison & Stevens Aplicado ao Controle Estatı´stico de Processos Autocorrelacionados. Pontifı´cia Universidade Cato´lica do Rio de Janeiro. Tese (Doutorado em Engenharia de Produc¸a˜o). Pindado, J., Requejo, I., 2015. Panel data: a methodology for model specification and testing. In: Paudyal, K. (Ed.), Wiley Encyclopedia of Management. vol. 4, pp. 1–8. Pindado, J., Requejo, I., Torre, C., 2011. Family control and investment-cash flow sensitivity: empirical evidence from the euro zone. J. Corp. Finance 17 (5), 1389–1409. Pindado, J., Requejo, I., Torre, C., 2014. Family control, expropriation, and investor protection: a panel data analysis of western european corporations. J. Empir. Finance 27 (C), 58–74. Pindyck, R.S., Rubinfeld, D.L., 2004. Econometria: modelos e previso˜es, fourth ed. Campus Elsevier, Rio de Janeiro. Pires, P.J., Marchetti, R.Z., 1997. O perfil dos usua´rios de caixa-automa´ticos em ag^encias banca´rias na cidade de Curitiba. Revista de Administrac¸a˜o Contempor^anea (RAC) 1 (3), 57–76.

1210

References

Pl€ umper, T., Troeger, V.E., 2007. Efficient estimation of time-invariant and rarely changing variables in finite sample panel analyses with unit fixed effects. Polit. Anal. 15 (2), 124–139. Pollard, D., 1981. Strong consistency of k-means clustering. Ann. Stat. 9 (1), 135–140. Pregibon, D., 1981. Logistic regression diagnostics. Ann. Stat. (9), 704–724. Press, S.J., 2005. Applied Multivariate Analysis: Using Bayesian and Frequentist Methods of Inference, second ed. Dover Science, Mineola. Punj, G., Stewart, D.W., 1983. Cluster analysis in marketing research: review and suggestions for application. J. Market. Res. 20 (2), 134–148. Rabe-Hesketh, S., Everitt, B., 2000. A Handbook of Statistical Analyses Using Stata, second ed. Chapman & Hall, Boca Raton. Rabe-Hesketh, S., Skrondal, A., 2012b. Multilevel and Longitudinal Modeling Using Stata: Categorical Responses, Counts, and Survival, third ed. vol. II. Stata Press, College Station. Rabe-Hesketh, S., Skrondal, A., 2012a. Multilevel and Longitudinal Modeling Using Stata: Continuous Responses, third ed. vol. I. Stata Press, College Station. Rabe-Hesketh, S., Skrondal, A., Pickles, A., 2005. Maximum likelihood estimation of limited and discrete dependent variable models with nested random effects. J. Econometrics 128 (2), 301–323. Rabe-Hesketh, S., Skrondal, A., Pickles, A., 2002. Reliable estimation of generalized linear mixed models using adaptive quadrature. Stata J. 2 (1), 1–21. Ragsdale, C.T., 2009. Modelagem e ana´lise de decisa˜o. Cengage Learning, Sa˜o Paulo. Rajan, R.G., Zingales, L., 1998. Financial dependence and growth. Am. Econ. Rev. 88 (3), 559–586. Ramalho, J.J.S., 1996. Modelos de regressa˜o para dados de contagem. Masters Dissertation, Instituto Superior de Economia e Gesta˜o, Universidade Tecnica de Lisboa, Lisboa. 110 f. Rardin, R.L., 1998. Optimization in Operations Research. Prentice Hall, New Jersey. Rasch, G., 1960. Probabilistic Models for Some Intelligence and Attainment Tests. Paedagogike Institut, Copenhagen. Raudenbush, S., Bryk, A., 2002. Hierarchical Linear Models: Applications and Data Analysis Methods, second ed. Sage Publications, Thousand Oaks. Raudenbush, S., Bryk, A., Cheong, Y.F., Congdon, R., du Toit, M., 2004. HLM 6: hierarchical linear and nonlinear modeling. Scientific Software International, Inc, Lincolnwood. Raykov, T., Marcoulides, G.A., 2008. An Introduction to Applied Multivariate Analysis. Routledge, New York. Reis, E., 2001. Estatı´stica multivariada aplicada, second ed. Edic¸o˜es Sı´labo, Lisboa. Rencher, A.C., 1992. Interpretation of canonical discriminant functions, canonical variates and principal components. Am. Stat. 46 (3), 217–225. Rencher, A.C., 2002. Methods of Multivariate Analysis, second ed. John Wiley & Sons, New York. Rencher, A.C., 1988. On the use of correlations to interpret canonical functions. Biometrika 75 (2), 363–365. Rigau, J.G., 1990. Traduccio´n del termino ‘odds ratio. Gaceta Sanitaria (16), 35. Roberto, A.N., 2002. Modelos de rede de fluxo para alocac¸a˜o da a´gua entre mu´ltiplos usos em uma bacia hidrogra´fica. Escola Politecnica, Universidade de Sa˜o Paulo, Sa˜o Paulo Dissertac¸a˜o (Mestrado em Engenharia Hidra´ulica e Sanita´ria). 105 p. Rodrigues, M.C.P., 2002. Potencial de desenvolvimento dos municı´pios fluminenses: uma metodologia alternativa ao Iqm, com base na ana´lise fatorial explorato´ria e na ana´lise de clusters. Caderno de Pesquisas em Administrac¸a˜o 9 (1), 75–89. Rodrigues, P.C., Lima, A.T., 2009. Analysis of an European union election using principal component analysis. Stat. Papers 50 (4), 895–904. Rogers, D.J., Tanimoto, T.T., 1960. A computer program for classifying plants. Science 132 (3434), 1115–1118. Rogers, W., 2000. Errors in hedonic modeling regressions: compound indicator variables and omitted variables. Appraisal J. 208–213. Rogers, W.M., Schmitt, N., Mullins, M.E., 2002. Correction for unreliability of multifactor measures: comparison of alpha and parallel forms approaches. Organization. Res. Methods 5 (2), 184–199. Ross, G.J.S., Preence, D.A., 1985. The negative binomial distribution. Statistician 34 (3), 323–335. Roubens, M., 1982. Fuzzy clustering algorithms and their cluster validity. Eur. J. Oper. Res. 10 (3), 294–301. Rousseeuw, P.J., Leroy, A.M., 1987. Robust Regression and Outlier Detection. John Wiley & Sons, New York. Royston, P., 2006. Explained variation for survival models. Stata J. 6 (1), 83–96. Royston, P., Lambert, P.C., 2011. Flexible Parametric Survival Analysis Using Stata: Beyond the Cox Model. Stata Press, College Station. Royston, P., Parmar, M.K.B., 2002. Flexible parametric proportional-hazards and proportional-odds models for censored survival data, with application to prognostic modelling and estimation of treatment effects. Stat. Med. 21 (15), 2175–2197. Rummel, R.J., 1970. Applied Factor Analysis. Northwestern University Press, Evanston. Russell, P.F., Rao, T.R., 1940. On habitat and association of species of Anopheline Larvae in South-eastern Madras. J. Malaria Instit. India 3 (1), 153–178. Rutemiller, H.C., Bowers, D.A., 1968. Estimation in a heterocedastic regression model. J. Am. Stat. Assoc. 63, 552–557. Saaty, T.L., 2000. Fundamentals of Decision Making and Priority Theory with the Analytic Hierarchy Process. RWS Publications, Pittsburgh. Santos, M.A., Fa´vero, L.P., Distadio, L.F., 2016. Adoption of the International Financial Reporting Standards (IFRS) on companies’ financing structure in emerging economies. Finance Res. Lett. 16 (1), 179–189. Santos, M.S., 2005. Cervejas e refrigerantes. In: Mateus Sales dos Santos e Fla´vio de Miranda Ribeiro. CETESB, Sa˜o Paulo Disponı´vel em http://www. cetesb.sp.gov.br. [(Accessed 11 February 2017)]. Saporta, G., 1990. Probabilites, analyse des donnees et statistique. Technip, Paris. Saraiva Jr., A.F., Tabosa, C.M., Costa, R.P., 2011. Simulac¸a˜o de Monte Carlo aplicada a` ana´lise econ^omica de pedido. Produc¸a˜o 21 (1), 149–164. Sarkadi, K., 1975. The consistency of the Shapiro-Francia test. Biometrika 62 (2), 445–450. Sartoris Neto, A., 2013. Estatı´stica e introduc¸a˜o a` econometria, second ed. Saraiva, Sa˜o Paulo. Schaffer, M.E., Stillman, S., XTOVERID: Stata module to calculate tests of overidentifying restrictions after xtreg, xtivreg, xtivreg2, xthtaylor. http:// ideas.repec.org/c/boc/bocode/s456779.html. [(Accessed 21 February 2014)].

References

1211

Scheffe, H., 1953. A method for judging all contrasts in the analysis of variance. Biometrika 40 (1/2), 87–104. Schmidt, C.M.C., 2003. Modelo de regressa˜o de Poisson aplicado a` a´rea da sau´de. Masters Dissertation, Universidade Regional do Noroeste do Estado do Rio Grande do Sul, Iju´i 98 f. Schoenfeld, D., 1982. Partial residuals for the proportional hazards regression model. Biometrika 69 (1), 239–241. Schriber, T.J., 1974. Simulation Using GPSS. Ed. Ft. Belvoir Defense Technical Information, Wiley, New York. Schwartz Filho, A.J., 2006. Localizac¸a˜o de indu´strias de reciclagem na cadeia logı´stica reversa do coco verde. Dissertac¸a˜o (Mestrado em Engenharia Civil – Transportes), Universidade Federal do Espı´rito Santo. 127 f. Scott, A.J., Symons, M.J., 1971. Clustering methods based on likelihood ratio criteria. Biometrics 27 (2), 387–397. Searle, S.R., Casella, G., McCulloch, C.E., 2006. Variance Components. John Wiley & Sons, New York. Sergio, V.F.N., 2012. Utilizac¸a˜o das distribuic¸o˜es inflacionadas de zeros no monitoramento da qualidade do leite. Monografia (Bacharelado em Estatı´stica), Departamento de Estatı´stica, Universidade Federal de Juiz de Fora, Juiz de Fora. 43 f. Shafto, M.G., Degani, A., Kirlik, A., 1997. Canonical correlation analysis of data on human-automation interaction. In: 41st HFES – Annual Meeting of the Human Factors and Ergonomics Society. Anais do Congresso, Albuquerque. Shapiro, S.S., Francia, R.S., 1972. An approximate analysis of variance test for normality. J. Am. Stat. Assoc. 67, 215–216. Shapiro, S.S., Wilk, M.B., 1965. An analysis of variance test for normality (complete samples). Biometrika 52, 591–611. Sharma, S., 1996. Applied Multivariate Techniques. John Wiley & Sons, Hoboken. Sharpe, N.R., de Veaux, R.D., Velleman, P.F., 2015. Business Statistics, third ed. Pearson Education. Shazmeen, S.F., Baig, M.M.A., Pawar, M.R., 2013. Regression analysis and statistical approach on socio-economic data. Int. J. Adv. Comput. Res. 3 (3), 347. Sheu, C.F., 2000. Regression analysis of correlated binary outcomes. Behav. Res. Methods Instrum. Comput. 32 (2), 269–273. Sharpe, W.F., 1964. Capital asset prices: a theory of market equilibrium under conditions of risk. J. Finance 19 (3), 425–442. Shi, J., Malik, J., 2000. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22 (8), 888–905. Short, J.C., Ketchen, D.J., Bennett, N., du Toit, M., 2006. An examination of firm, industry, and time effects on performance using random coefficients modeling. Organization. Res. Methods 9 (3), 259–284. Short, J.C., Ketchen, D.J., Palmer, T.B., Hult, G.T.M., 2007. Firm, strategic group, and industry influences on performance. Strateg. Manag. J. 28 (2), 147–167. Siegel, S., Castellan Jr., N.J., 2006. Estatı´stica na˜o-parametrica para ci^encias do comportamento, second ed. Bookman, Porto Alegre. Silva Filho, O.S., Cezarino, W., Ratto, J., 2009. Planejamento agregado da produc¸a˜o: modelagem e soluc¸a˜o via planilha Excel & Solver. Revista Produc¸a˜o On Line 9 (3), 572–599. Silva Neto, A.J., Becceneri, J.C., 2009. Tecnicas de intelig^encia computacional inspiradas na natureza: aplicac¸a˜o em problemas inversos em transfer^encia radiativa. Sbmac, Sa˜o Carlos. Simonson, D.G., Stowe, J.D., Watson, C.J., 1983. A canonical correlation analysis of commercial bank asset/liability structures. J. Financ. Quant. Anal. 18 (1), 125–140. Singer, J.M., Andrade, D.F., 1997. Regression models for the analysis of pretest/posttest data. Biometrics 53 (2), 729–735. Skrondal, A., Rabe-Hesketh, S., 2007. Latent variable modelling: a survey. Scand. J. Stat. 34 (4), 712–745. Skrondal, A., Rabe-Hesketh, S., 2003. Multilevel logistic regression for polytomous data and rankings. Psychometrika 68 (2), 267–287. Skrondal, A., Rabe-Hesketh, S., 2009. Prediction in multilevel generalized linear models. J. Roy. Stat. Soc. Ser. A 172 (3), 659–687. Smirnov, N., 1948. Table for estimating the goodness of fit of empirical distributions. Ann. Math. Stat. 19 (2), 279–281. Sneath, P.H.A., Sokal, R.R., 1962. Numerical taxonomy. Nature 193, 855–860. Snijders, T.A.B., Bosker, R.J., 2011. Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling, second ed. Sage Publications, London. Snook, S.C., Gorsuch, R.L.P., 1989. component analysis versus common factor analysis: a Monte Carlo study. Psychol. Bull. 106 (1), 148–154. SOBRAPO – Sociedade Brasileira de Pesquisa Operacional, 2017. Disponı´vel em http://www.sobrapo.org.br. [(Accessed 15 April 2017)]. Sokal, R.R., Michener, C.D., 1958. A statistical method for evaluating systematic relationships. Univ. Kansas Sci. Bull. 38 (22), 1409–1438. Sokal, R.R., Rohlf, F.J., 1962. The comparison of dendrograms by objectives methods. Taxon 11 (2), 33–40. Sokal, R.R., Sneath, P.H.A., 1963. Principles of Numerical Taxonomy. W.H. Freeman and Company, San Francisco. Sørensen, T.J., 1948. A method of establishing groups of equal amplitude in plant sociology based on similarity of species content, and its application to analyses of the vegetation on Danish commons. Roy. Danish Acad. Sci. Lett. Biol. Ser. (5), 1–34. Soto, J.L.G., Morera, M.C., 2005. Modelos jera´rquicos lineales. Editorial La Muralla, Madrid. Spearman, C.E., 1904. “General intelligence,” objectively determined and measured. Am. J. Psychol. 15 (2), 201–292. Spiegel, M.R., Schiller, J., Srinivasan, R.A., 2013. Probabilidade e estatı´stica, third ed. Bookman, Porto Alegre. Stanton, J.M., 2001. Galton, Pearson, and the peas: a brief history of linear regression for statistics instructors. J. Stat. Educ. 9(3). http://www.amstat.org/ publications/jse/v9n3/stanton.html. [(Accessed 14 March 2014)]. StataCorp, 2009. Getting Started with Stata for Windows: Version 11. Stata Press, College Station. StataCorp, 2011. Stata Statistical Software: Release 12. Stata Press, College Station. StataCorp, 2013. Stata statistical software: Release 13. Stata Press, College Station. StataCorp, 2015. Stata Statistical Software: Release 14. Stata Press, College Station. Steenbergen, M.R., Jones, B.S., 2002. Modeling multilevel data structures. Am. J. Polit. Sci. 46 (1), 218–237. Stein, C.E., Loesch, C., 2011. Estatı´stica descritiva e teoria das probabilidades, second ed. Edifurb, Blumenau.

1212

References

Stein, C.M., 1981. Estimation of the mean of a multivariate normal distribution. Ann. Stat. 9 (6), 1135–1151. Stemmler, M., 2014. Person-centered methods: configural frequency analysis (CFA) and other methods for the analysis of contingency tables. Springer, Erlangen. Stephan, F.F., 1941. Stratification in representative sampling. J. Market. 6 (1), 38–46. Stevens, J.P., 2009. Applied Multivariate Statistics for the Social Sciences, fifth ed. Routledge, New York. Stevens, S.S., 1946. On the theory of scales of measurement. Science 103 (2684), 677–680. Stewart, D.K., Love, W.A., 1968. A general canonical correlation index. Psychol. Bull. 70 (3), 160–163. Stewart, D.W., 1981. The application and misapplication of factor analysis in marketing research. J. Market. Res. 18 (1), 51–62. Stock, J.H., Watson, M.W., 2004. Econometria. Pearson Education, Sa˜o Paulo. Stock, J.H., Watson, M.W., 2008. Heteroskedasticity-robust standard errors for fixed effects panel data regression. Econometrica 76 (1), 155–174. Stock, J.H., Watson, M.W., 2006. Introduction to econometrics, third ed. Pearson, Essex. Stowe, J.D., Watson, C.J., Robertson, T.D., 1980. Relationships between the two sides of the balance sheet: a canonical correlation analysis. J. Finance 35 (4), 973–980. Streiner, D.L., 2003. Being inconsistent about consistency: when coefficient alpha does and doesn´t matter. J. Personal. Assess. 80 (3), 217–222. Stukel, T.A., 1988. Generalized logistic models. J. Am. Stat. Assoc. 83 (402), 426–431. Sudman, S., 1985. Efficient screening methods for the sampling of geographically clustered special populations. J. Market. Res. 22 (20), 20–29. Sudman, S., Sirken, M.G., Cowan, C.D., 1988. Sampling rare and elusive populations. Science 240 (4855), 991–996. Swets, J.A., 1996. Signal Detection Theory and ROC Analysis in Psychology and Diagnostics: Collected Papers. Lawrence Erlbaum Associates, Mahwah. Tabachnick, B.G., Fidell, L.S., 2001. Using Multivariate Statistics. Allyn and Bacon, New York. Tacq, J., 1996. Multivariate Analysis Techniques in Social Science Research. Sage Publications, Thousand Oaks. Tadano, Y.S., Ugaya, C.M.L., Franco, A.T., 2009. Metodo de regressa˜o de Poisson: metodologia para avaliac¸a˜o do impacto da poluic¸a˜o atmosferica na sau´de populacional. Ambiente & Sociedade Xii (2), 241–255. Taha, H.A., 2010. Operations Research: An Introduction, nineth ed. Prentice Hall, Upper Saddle River. Taha, H.A., 2016. Operations Research: An Introduction, tenth ed. Pearson Higher Ed, USA. Takane, Y., Young, F.W., DE Leeuw, J., 1977. Nonmetric individual differences multidimensional scaling: an alternating least squares method with optimal scaling features. Psychometrika 42 (1), 7–67. Tang, W., He, H., Tu, X.M., 2012. Applied Categorical and Count Data analysis. Chapman & Hall/CRC Press, Boca Raton. Tapia, J.A., Nieto, F.J., 1993. Razo´n de posibilidades: una propuesta de traduccio´n de la expresio´n odds ratio. Salud Pu´blica de Mexico 35, 419–424. Tate, W.F., 2012. Research on schools, neighborhoods, and communities. Rowman & Littlefield Publishers Inc., Plymouth. Teerapabolarn, K., 2008. Poisson approximation to the beta-negative binomial distribution. Int. J. Contemp. Math. Sci. 3 (10), 457–461. Tenenhaus, M., Young, F., 1985. An analysis and synthesis of multiple correspondence analysis, optimal scaling, dual scaling, homogeneity analysis, and other methods for quantifying categorical multivariate data. Psychometrika 50 (1), 91–119. Thomas, W., Cook, R.D., 1990. Assessing influence on predictions from generalized linear models. Technometrics 32 (1), 59–65. Thompson, B., 1984. Canonical Correlation analysis: Uses and Interpretation. Sage Publications, Thousand Oaks. Thurstone, L.L., 1969. Multiple Factor Analysis: A Development and Expansions of “The Vectors of the Mind” University of Chicago Press, Chicago. Thurstone, L.L., 1959. The Measurement of Values. University of Chicago Press, Chicago. Thurstone, L.L., 1935. The Vectors of the Mind. University of Chicago Press, Chicago. Thurstone, L.L., Thurstone, T.G., 1941. Factorial Studies of Intelligence. University of Chicago Press, Chicago. Timm, N.H., 2002. Applied Multivariate Analysis. Springer-Verlag, New York. Tobin, J., 1969. A general equilibrium approach to monetary theory. J. Money Credit Bank. 1 (1), 15–29. Traissac, P., Martin-Prevel, Y., 2012. Alternatives to principal components analysis to derive asset-based indices to measure socio-economic position in low- and middle-income countries: the case for multiple correspondence analysis. Int. J. Epidemiol. 41 (4), 1207–1208. Triola, M.F., 2013. Introduc¸a˜o a` estatı´stica: atualizac¸a˜o da tecnologia, 11th ed. LTC Editora, Rio de Janeiro. Troldahl, V.C., Carter Jr., R.E., 1964. Random selection of respondents within households in phone surveys. J. Market. Res. 1 (2), 71–76. Tryon, R.C., 1939. Cluster analysis. McGraw-Hill, New York. Tsiatis, A.A., 1980. A note on a goodness-of-fit test for the logistic regression model. Biometrika 67, 250–251. Turkman, M.A.A., Silva, G.L., 2000. Modelos lineares generalizados: da teoria a` pra´tica. Edic¸o˜es Spe, Lisboa. UCLA, 2015. Statistical Consulting Group of the Institute for Digital Research and Education. http://www.ats.ucla.edu/stat/stata/faq/casummary.htm. [(Accessed 5 February 2015)]. UCLA, 2013a. Statistical Consulting Group of the Institute for Digital Research and Education. http://www.ats.ucla.edu/stat/stata/output/stata_mlogit_ output.htm. [(Accessed 22 September 2013)]. UCLA, 2013b. Statistical Consulting Group of the Institute for Digital Research and Education. http://www.ats.ucla.edu/STAT/stata/seminars/stata_sur vival/default.htm. [(Accessed 13 November 2013)]. UCLA, 2013c. Statistical Consulting Group of the Institute for Digital Research and Education. http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/ statareg2.htm. [(Accessed 2 September 2013)]. UCLA, 2013d. Statistical Consulting Group of the Institute for Digital Research and Education. http://www.ats.ucla.edu/stat/stata/dae/canonical.htm. [(Accessed 15 December 2013)]. Valentin, J.L., 2012. Ecologia numerica: uma introduc¸a˜o a` ana´lise multivariada de dados ecolo´gicos, second ed. Interci^encia, Rio de Janeiro.

References

1213

Van Auken, H.E., Doran, B.M., Yoon, K.J., 1993. A financial comparison between Korean and US firms: a cross-balance sheet canonical correlation analysis. J. Small Bus. Manag. 31 (3), 73–83. Vance, P.S., Fa´vero, L.P., Luppe, M.R., 2008. Franquia empresarial: um estudo das caracterı´sticas do relacionamento entre franqueadores e franqueados no Brasil. Revista de Administrac¸a˜o (RAUSP) 43 (1), 59–71. Vanneman, R., 1977. The occupational composition of American classes: results from cluster analysis. Am. J. Sociol. 82 (4), 783–807. Vasconcellos, M.A.S., Alves, D., 2000. Manual de econometria. Atlas, Sa˜o Paulo. Velicer, W.F., Jackson, D.N., 1990. Component analysis versus common factor analysis: some issues in selecting an appropriate procedure. Multivar. Behav. Res. 25 (1), 1–28. Velleman, P.F., Wilkinson, L., 1993. Nominal, ordinal, interval, and ratio typologies are misleading. Am. Stat. 47 (1), 65–72. Verbeek, M., 2012. A Guide to Modern Econometrics, fourth ed. John Wiley & Sons, West Sussex. Verbeke, G., Molenberghs, G., 2000. Linear Mixed Models for Longitudinal Data. Springer-Verlag, New York. Vermunt, J.K., Anderson, C.J., 2005. Joint correspondence analysis (JCA) by maximum likelihood. Methodol. Eur. J. Res. Methods Behav. Soc. Sci. 1 (1), 18–26. Vicini, L., Souza, A.M., 2005. Ana´lise multivariada da teoria a` pra´tica. Monografia (Especializac¸a˜o em Estatı´stica e Modelagem Quantitativa), Centro de Ci^encias Naturais e Exatas, Universidade Federal de Santa Maria, Santa Maria. 215 f. Vieira, S., 2012. Estatı´stica ba´sica. Cengage Learning, Sa˜o Paulo. Vittinghoff, E., Glidden, D.V., Shiboski, S.C., McCulloch, C.E., 2012. Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models, second ed. Springer-Verlag, New York. Vuong, Q.H., 1989. Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica 57 (2), 307–333. Ward Jr., J.H., 1963. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58 (301), 236–244. Wathier, J.L., Dell’aglio, D.D., Bandeira, D.R., 2008. Ana´lise fatorial do inventa´rio de depressa˜o infantil (CDI) em amostra de jovens brasileiros. Avaliac¸a˜o Psicolo´gica 7 (1), 75–84. Watson, I., 2005. Further processing of estimation results: basic programming with matrices. Stata J. 5 (1), 83–91. Weber, S., 2010. Bacon: an effective way to detect outliers in multivariate data using Stata (and Mata). Stata J. 10 (3), 331–338. Wedderburn, R.W.M., 1974. Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika 61 (3), 439–447. Weisberg, S., 1985. Applied Linear Regression. John Wiley & Sons, New York. Weller, S.C., Romney, A.K., 1990. Metric Scaling: Correspondence Analysis. Sage, London. Wen, C.H., Yeh, W.Y., 2010. Positioning of international air passenger carriers using multidimensional scaling and correspondence analysis. Transport. J. 49 (1), 7–23. Wermuth, N., R€ ussmann, H., 1993. Eigenanalysis of symmetrizable matrix products: a result with statistical applications. Scand. J. Stat. 20, 361–367. West, B.T., Welch, K.B., Gałecki, A.T., 2015. Linear Mixed Models: A Pratical Guide Using Statistical Software, second ed. Chapman & Hall/CRC Press, Boca Raton. White, H., 1980. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 48 (4), 817–838. White, H., 1982. Maximum likelihood estimation of misspecified models. Econometrica 50 (1), 1–25. Whitlark, D.B., Smith, S.M., 2001. Using correspondence analysis to map relationships. Market. Res. 13 (3), 22–27. Wilcoxon, F., 1945. Individual comparisons by ranking methods. Biometr. Bull. 1 (6), 80–83. Wilcoxon, F., 1947. Probability tables for individual comparisons by ranking methods. Biometrics 3 (3), 119–122. Williams, R., 2006. Generalized ordered logit/partial proportional odds models for ordinal dependent variables. Stata J. 6 (1), 58–82. Winkelmann, R., Zimmermann, K.F., 1991. A new approach for modeling economic count data. Econ. Lett. 37 (2), 139–143. Winston, W.L., 2004. Operations Research: Applications and Algorithms, fourth ed. Brooks/Cole – Thomson Learning, Belmont. Witten, I.H., Frank, E., Hall, M.A., Pal, C.J., 2016. Data Mining: Practical Machine Learning Tools and Techniques, fourth ed. Elsevier, Boston. Wolfe, J.H., 1978. Comparative cluster analysis of patterns of vocational interest. Multivar. Behav. Res. 13 (1), 33–44. Wolfe, J.H., 1970. Pattern clustering by multivariate mixture analysis. Multivar. Behav. Res. 5 (3), 329–350. Wong, M.A., Lane, T., 1983. A kth nearest neighbour clustering procedure. J. Roy. Stat. Soc. Ser. B 45 (3), 362–368. Wonnacott, T.H., Wonnacott, R.J., 1990. Introductory Statistics for Business and Economics, 4. ed. John Wiley & Sons, New York. Wooldridge, J.M., 2010. Econometric Analysis of Cross Section and Panel Data, second ed. MIT Press, Cambridge. Wooldridge, J.M., 2012. Introductory Econometrics: A Modern Approach, fifth ed. Cengage Learning, Mason. Wooldridge, J.M., 2005. Simple solutions to the initial conditions problem in dynamic, nonlinear panel data models with unobserved heterogeneity. J. Appl. Econ. 20 (1), 39–54. Wu, Z., et al., 2008. Optimization designs of the combined Shewhart CUSUM control charts. Comput. Stat. Data Anal. 53 (2), 496–506. Wulff, J.N., 2015. Interpreting results from the multinomial logit: demonstrated by foreign market entry. Organization. Res. Methods 18 (2), 300–325. Xie, F.C., Wei, B.C., Lin, J.G., 2008. Assessing influence for pharmaceutical data in zero-inflated generalized Poisson mixed models. Stat. Med. 27 (18), 3656–3673. Xie, M., He, B., Goh, T.N., 2001. Zero-inflated Poisson model in statistical process control. Comput. Stat. Data Anal. 38 (2), 191–201. Xue, D., Deddens, J., 1992. Overdispersed negative binomial regression models. Commun. Stat. Theory Methods 21 (8), 2215–2226. Yanai, H., Takane, Y., 2002. Generalized constrained canonical correlation analysis. Multivar. Behav. Res. 37 (2), 163–195. Yau, K., Wang, K., Lee, A., 2003. Zero-inflated negative binomial mixed regression modeling of over-dispersed count data with extra zeros. Biometr. J. 45 (4), 437–452.

1214

References

Yavas, U., Shemwell, D.J., 1996. Bank image: exposition and illustration of correspondence analysis. Int. J. Bank Market. 14 (1), 15–21. Ye, N. (Ed.), 2004. The Handbook of Data Mining. Lawrence Erlbaum Associates, Mahwah. Young, F., 1981. Quantitative analysis of qualitative data. Psychometrika 46 (4), 357–388. Young, G., Householder, A.S., 1938. Discussion of a set of points in terms of their mutual distances. Psychometrika 3 (1), 19–22. Yule, G.U., 1900. On the association of attributes in statistics: with illustrations from the material of the childhood society, etc. Philos. Trans. Roy. Soc. London 194, 257–319. Yule, G.U., Kendall, M.G., 1950. An Introduction to the Theory of Statistics, fourteen ed. Charles Griffin, London. Zeger, S.L., Liang, K.Y., Albert, P.S., 1988. Models for longitudinal data: a generalized estimating equation approach. Biometrics 44 (4), 1049–1060. Zhang, H., Liu, Y., Li, B., 2014. Notes on discrete compound Poisson model with applications to risk theory. Insurance Math. Econ. 59, 325–336. Zheng, X., Rabe-Hesketh, S., 2007. Estimating parameters of dichotomous and ordinal item response models using gllamm. Stata J. 7 (3), 313–333. Zhou, W., Jing, B.Y., 2006. Tail probability approximations for Student’s t-statistics. Probab. Theory Relat. Fields 136 (4), 541–559. Zippin, C., Armitage, P., 1966. Use of concomitant variables and incomplete survival information in the estimation of an exponential survival parameter. Biometrics 22 (4), 665–672. Zorn, C.J.W., 2001. Generalized estimating equation models for correlated data: a review with applications. Am. J. Polit. Sci. 45 (2), 470–490. Zubin, J., 1938a. A technique for measuring like-mindedness. J. Abnormal Soc. Psychol. 33 (4), 508–516. Zubin, J., 1938b. Socio-biological types and methods for their isolation. Psychiatry J. Study Interpersonal Process. 2, 237–247. Zuccolotto, P., 2007. Principal components of sample estimates: an approach through symbolic data. Stat. Methods Appl. 16 (2), 173–192. Zwilling, M.L., 2013. Negative binomial regression. Math. J. 15, 1–18.

Index Note: Page numbers followed by f indicate figures, t indicate tables, and b indicate boxes.

A Absolute average deviation. See Average deviation Absolute frequency, 22, 32–33, 33f Absolute nesting data structure, 990 Adaptive quadrature process, 1005 Advanced Integrated Multidimensional Modeling Software (AIMMS), 774 Agglomeration schedules, 311–312, 324, 326f hierarchicals, 325 (see also Hierarchical agglomeration schedules) k-means procedure, 325 linkage methods, 325 nonhierarchicals, 325 (see also Nonhierarchical k-means agglomeration schedule) Aggregated planning problem, 734–736b, 734t binary programming (BP), 733 decision variables, 733 general formulation, 733 integer programming (IP) model, 734–736 mixed-integer programming (MIP) problem, 734–736 model parameters, 733 nonlinear programming (NLP), 733 resources, 732 Akaike information criterion (AIC), 695 Allocation/attribution problem. See Job assignment problem A Mathematical Programming Language (AMPL), 774 Analysis of variance (ANOVA), 525, 527f assumptions, 232 linear regression models, 457, 458f multiple interactions, 246 one-way ANOVA, 234–236b, 234f, 235–236t calculations, 233, 233t null hypothesis, 232 observations, 232, 232t residual sum of squares, 233 SPSS Software, 236–237, 237–238f Stata Software, 237–238, 238f two-way ANOVA, 241–242b, 242t calculations, 241, 241t factors, 240 observations, 239, 239t residual sum of squares (RSS), 240 SPSS Software, 242–244, 243–246f Stata Software, 244–245, 246f sum of squares of factor, 240 sum of total squares, 240

Anti-Dice similarity coefficient, 323 Arbitrary weighting procedure, 314 Arithmetic mean continuous data, 41–42, 41t, 41–42b grouped discrete data, 40–41, 40–41b, 40–41t ungrouped discrete and continuous data simple arithmetic mean, 38–39, 38t, 38–39b weighted arithmetic mean, 39–40, 39–40t, 39–40b Autocorrelation Breusch-Godfrey test, 493–494 causes, 492, 492f consequences, 493 data time evolution, 491 Durbin-Watson test, 493, 493f first-order autocorrelation, 492 generalized least squares method, 494 residuals problem, 492, 492f Average deviation continuous data, 56, 56b, 56t grouped discrete data, 54–55, 55t, 55b modulus/absolute deviation, 54 ungrouped discrete and continuous data, 54, 54t, 54b

B Bacon algorithm, 533 Balanced nested data structure, 988 Balanced transportation model, 839, 846, 846b Bar charts, 21, 26–27, 26t, 26–27b, 27f Bartlett’s w2 test, 210–212, 211–212b, 211t Bartlett’s test of sphericity, 387, 389–390, 413, 413f, 423, 424f Basic solution (BS), 755 Basic variables (BV), 755, 755b Bayesian (Schwarz) information criterion (BIC), 695 Bayes’ theorem, 132–133, 132–133b Bernoulli distribution, 142–144, 143–144b, 143f, 609, 691 Best linear unbiased predictions (BLUPS), 1005 Between-groups/average-linkage method, 328, 335–338, 337t, 338f Big Data, 983, 984f Binary logistic regression model, 539 confidence intervals, 556–557, 556–557t, 557b cutoff, 558, 559–560t dichotomic form, 540 event nonoccurrence, 541 event occurrence, 540–542, 541t

explanatory variables, 540 graph, 541–542, 541f logit, 540 maximum likelihood, 542–547, 542–544t, 545–547f overall model efficiency (OME), 560 pi values, 558, 558t probability model, 557–558 ROC curve, 561–562, 562f sensitivity analysis, 559–560, 561f specificity, 560 SPSS (see IBM SPSS Statistics Software) Stata (see Stata Software) statistical significance chi-square test, 548–550, 550f degrees of freedom, 550 Hosmer-Lemeshow test, 554 Insert Function, 551, 552f likelihood-ratio test, 552, 553f linear and logistic adjustments, 547–548, 548f logistic probability adjustment, 555, 555f McFadden pseudo R2, 548 null model, 548, 549f parameter estimation, 551, 556 Solver, 547–548, 549f, 553f Wald z test, 550–551 Binary programming (BP), 733 capital budgeting problem, 894t, 894b attributes, 895b decision variables, 895, 897f Excel spreadsheet, 894, 895b, 895f investment projects, 894 linear programming model, 894 Net Present Value (NPV), 894 optimal solution, 895, 897f Solver Parameters dialog box, 895, 896f Traveling Salesman Problem (TSP), 898–899t, 898–900b, 899f Excel Solver, 901f, 902, 902b, 903–904f formulations, 896–901 Hamiltonian problem, 896, 898f network programming, 896 Binomial distribution, 144–145, 144f, 145b, 1169–1176t Binomial test, 250–254, 251–253b, 252t, 253f SPSS Software, 253, 254–255f Stata Software, 253–254, 255f Bivariate descriptive statistics, 93, 94f perceptual maps, 93 qualitative variables, measures of association

1215

1216

Index

Bivariate descriptive statistics (Continued) chi-square statistic, 102–110, 102–110b, 102–103t, 104–106f, 107t, 109–110f joint frequency distribution tables (see Joint frequency distribution tables) Spearman’s coefficient, 110–113, 110f, 111–113b, 111t, 112–113f quantitative variable, measures of correlation covariance, 118, 118b Pearson’s correlation coefficient, 119–121, 119–121b, 119–121f scatter plot, 114–118, 114–118f, 115–116b Blending/mixing problem, 717–719, 717–719b, 718t Blocked Adaptive Computationally Efficient Outlier Nominators (BACON) algorithm, 52 Bowley’s coefficient of skewness, 63–64, 64b Box-Cox transformations IBM SPSS Statistics Software, 527 nonlinear regression models, 497–498, 497b ordinary least squares (OLS) method, 480–481 Stata, regression models estimation, 511, 512f results, 513, 515f Boxplots, 21, 37–38, 37f position/location measures, 53–54, 53f Breusch-Godfrey test, 493–494, 516, 517f Breusch-Pagan/Cook-Weisberg test, 489–490, 505, 506t, 506f Business Analytics, 983

C Canberra distance, 319 Capital budgeting problem, 721–724, 722–723t, 722–724b, 894t, 894b attributes, 895b decision variables, 895, 897f Excel spreadsheet, 894, 895b, 895f investment projects, 894 linear programming model, 894 Net Present Value (NPV), 894 optimal solution, 895, 897f Solver Parameters dialog box, 895, 896f C chart, 966–967, 966–968f Chebyshev distance, 319, 320t Chi-square statistic measures binary logistic regression model, 548–550, 550f contingency coefficient, 107, 108b, 109f Cramer’s V coefficient, 108, 108–110b, 109–110f definition, 102 distribution, 158–160, 159–160f, 160b, 196, 196f, 490, 1166–1167t example, 102–106b, 102–103t, 104–106f K independent samples, 295–299, 296f, 296b, 296t SPSS Software, 297, 297–299f Stata Software, 297–299, 299f one sample, 255–257, 255–256f, 256b, 256t SPSS Software, 256–257, 257–258f Stata Software, 257, 259f

Phi coefficient, 106, 106–108b, 107t, 109–110f two independent samples, 276–280, 277–278f, 277–278t, 277–278b SPSS Software, 279, 279–280f Stata Software, 279–280, 280f Cluster analysis agglomeration schedules, 311–312, 324, 326f hierarchicals, 325 (see also Hierarchical agglomeration schedules) k-means procedure, 325 linkage methods, 325 nonhierarchicals, 325 (see also Nonhierarchical k-means agglomeration schedule) arbitrary weighting procedure, 314 creation, 312, 313f definition, 311 discriminant analysis, 311 distance measures, 314, 319, 320–321t Canberra distance, 319 Chebyshev distance, 319, 320t data standardization procedure, 318 Euclidean squared distance, 318, 320t Manhattan distance, 319 metric variables, dataset, 315, 316t, 318, 318t Minkowski distance, 318 Pearson’s correlation coefficient, 315 Pythagorean distance formula, 316, 317f three-dimensional scatter plot, 315, 316–317f Z-scores procedure, 321 internal homogeneity, 313 Likert scale, 314 logic, 311, 312f multinomial logistic regression, 311 multivariate outliers, 379–382, 380–382f rearrangement, 314, 314f scatter plot, 312, 312f similarity measures, 314, 325t absolute frequencies, 322, 322t, 324, 324t anti-Dice similarity coefficient, 323 arbitrary weighting problems, 321 binary variable, 321 dataset, 321, 322t Dice similarity coefficient (DSC), 323 Euclidean distance, 322 Hamann similarity coefficient, 324 Jaccard index, 323 Ochiai similarity coefficient, 323 Rogers and Tanimoto similarity coefficient, 323 Russel and Rao similarity coefficient, 323 simple matching coefficient (SMC), 322 Sneath and Sokal similarity coefficient, 324 Yule similarity coefficient, 323 static procedures, 314 variability, 312, 313f Cochran’s C statistics, 212–213, 213b, 1191t K paired samples, 286–290, 287–288b, 287t, 288–290f Coefficient of determination, 453–456, 455t, 455–456f

Coefficient of kurtosis, 65, 66f on Stata, 66–68, 67–68b, 67t Coefficient of skewness on Stata, 64–65 Coefficient of variation, 61, 61b Combinations, 134, 134b Combinatorial analysis arrangement, 133–134, 133–134b combinations, 134, 134b definition, 133 permutations, 135, 135b Communalities, 394, 415, 415f Complementary events, 128, 128f, 130 Completely randomized design (CRD), 936–937, 937f Conditional probability, 131 multiplication rule, 131–132, 131–132b, 132t Confidence intervals binary logistic regression model, 556–557, 556–557t, 557b negative binomial regression model, 644, 644t Poisson regression model, 630–632, 631t, 632b population mean, 192 known population variance, 193–194, 193–194b, 193f unknown population variance, 194–195, 194f, 194–195b population variance, 196–197, 196f, 197b proportions, 195–196, 196b Confirmatory factor analysis, 383 Confirmatory techniques, 405 Contingency coefficient, 107, 108b, 109f Contingency tables creation chi-square statistic measures, 102–106b, 102–103t, 104–106f IBM SPSS Statistics Software Cell Display dialog box, 97, 100f cross tabulation, 97, 99–100f labels, 97, 98f variables selection, 96, 97f Stata Software, 101, 101f Continuous random variable, 139f cumulative distribution function (CDF), 140–141, 141b definition, 139 expected/average value, 140 probability density function, 139 probability distributions chi-square distribution, 159–160, 159–160f, 160b exponential distribution, 156–157, 156f, 157b gamma distribution, 157–158, 158f normal distribution (see Gaussian distribution) Snedecor’s F distribution, 162–164, 163b, 163f, 164t Student’s t distribution, 160–162, 161f, 162b uniform distribution, 151–152, 151t, 152b, 152f variance, 140, 140b Continuous variables, 8, 16 Control chart, 1194t

Index

Convenience sampling, 175, 175b Convex and nonconvex sets, 748, 748f Correlation coefficients, 383 Correlation matrix, 391 Covariance, 118, 118b CPLEX, 774 Cramer’s V coefficient, 108, 108–110b, 109–110f Cronbach’s alpha calculation, 435t, 436 internal consistency, 434 reliability level, 434 SPSS, 436, 436–437f Stata, 437–438, 438f variance, 434 Crossindustry standard process of data mining (CRISP-DM), 984–985, 986f Cumulative distribution function (CDF) continuous random variable, 140–141, 141b discrete random variable, 138–139, 139b Cumulative frequency, 22 Czuber’s method, 46, 46–47b, 46t

D Data, information, and knowledge logic, 3, 4f Data mining Big Data, 983, 984f Business Analytics, 983 complexity, 983 crossindustry standard process of data mining (CRISP-DM), 984–985, 986f HLM (see Hierarchical linear models (HLM)) IBM SPSS Modeler, 986, 986f knowledge discovery in databases (KDD), 984, 985f multilevel modeling, 987–988 nested data structures, 988–991, 989–990f, 989–990t predictive capacity, 983 standard recognition, 983 tasks, 985–986 tools and software packages, 986 variety and variability, 983 volume and velocity, 983 Deciles, 48 continuous data, 50–52, 51–52b, 51t grouped discrete data, 50, 50b ungrouped discrete and continuous data, 48–50, 49–50b Decision-making process, 3, 5 Descriptive statistics, 7 Design of experiments (DOE), 935 blocking, 936 completely randomized design (CRD), 936–937, 937f control, 936 data analysis, 936 factorial ANOVA, 938 factorial design (FD), 937 factors and levels, 935 one-way analysis of variance (one-way ANOVA), 938 problem definition, 935 randomization, 936

randomized block design (RBD), 937, 937f replication, 936 response variable, 935 results and conclusions, 936 type, 936 Dice similarity coefficient (DSC), 323 Dichotomous/binary variable (dummy), 16, 17t Diet problem, 720–721, 720–721b, 720t Excel Solver, 788–790, 789b, 789–791f Directed network, 836–837, 837f Direct Oblimin methods, 398 Discrete random variable cumulative distribution function (CDF), 138–139, 139b definition, 137 expected/average value, 137–138 probability distributions Bernoulli distribution, 142–144, 143–144b, 143f binomial distribution, 144–145, 144f, 145b discrete uniform distribution, 141–142, 141f, 142t, 142b geometric distribution, 145–147, 146–147b, 146f hypergeometric distribution, 148–149, 148f, 149b negative binomial distribution, 147–148, 147f, 148b Poisson distribution, 149–151, 150f, 150–151b variance, 138, 138b, 138t Discrete uniform distribution, 141–142, 141f, 142t, 142b Discrete variables, 8, 16 Dispersion/variability measures average deviation continuous data, 56, 56b, 56t grouped discrete data, 54–55, 55t, 55b modulus/absolute deviation, 54 ungrouped discrete and continuous data, 54, 54t, 54b coefficient of variation, 61, 61b range, 54 standard deviation, 59–60, 59–60b standard error, 60–61, 60–61b, 60t variance continuous data, 58–59, 58–59b, 59t definition, 57 grouped discrete data, 57–58, 58t, 58b ungrouped discrete and continuous data, 57, 57b Durbin-Watson test, 1164–1165t autocorrelation, 493, 493f IBM SPSS Statistics Software, 528, 528f result, 529, 529f Stata, regression models estimation, 515, 516f

E Eigenvalues, 391, 408, 408t, 424, 424f Eigenvectors, 391–392, 401–402, 424, 424f Empty set, 129 Erlang distribution, 158

1217

Estimation interval estimation, 190, 190b (see also Confidence intervals) parameter, definition, 189 point estimation, 189, 189b maximum likelihood estimation, 192 method of moments, 190–191, 191t, 191b ordinary least squares (OLS), 191–192 population parameters, 189 Euclidean squared distance, 318, 320t Events, 127, 129b independent, 128, 130 mutually excluding/exclusive, 128, 128f, 130 Excel Solver, 775–779, 775–779f classic transportation problem, 856–860, 856f, 857b, 858–861f diet problem, 788–790, 789b, 789–791f facility location problem, 907–908, 907–909f, 907–908b farmer’s problem, 790–792, 791b, 791–793f job assignment problem, 870, 871–872f, 871b knapsack problem, 891–893, 891–893f, 892b Lifestyle Natural Juices Manufacturer, 798, 802b, 802–804f maximum flow problem, 879–881, 880–881f, 880b Naturelat Dairy, 784–786, 785b, 785–787f Oil-South Refinery, 787–788, 787b, 788–789f portfolio selection, 793–797, 793–797f, 794b, 796b production and inventory problem, Fenix&Furniture, 798, 799–801f, 800b sensitivity analysis, 818–822, 818–823f shortest path problem, 875, 876b, 876–877f transhipment problem (TSP), 866–868, 866–867b, 866–868f Venix Toys, 779–783, 780b, 780–784f, 784b Experimental unit, 935 Explanatory variables, 935 Exploratory factor analysis, 383 Exploratory multivariate technique, 383 Exponential distribution, 156–158, 156f, 157b Extrapolations, 449–450

F Facility location problem, 905–906b, 905f, 906t candidate locations, 902 Excel Solver, 907–908, 907–909f, 907–908b modeling, 902–906 network programming problem, 902 Factor extraction method, 411–412, 411f Factorial analysis of variance, 239–246 Factorial ANOVA, 938 Factorial design (FD), 937 Failure rate, 157 Farmer’s problem, 790–792, 791b, 791–793f Feasible basic solution (FBS), 755 First-order correlation coefficients, 387–389, 399, 399t Fisher’s coefficient kurtosis, 66 skewness, 64 Fisher’s distribution. See Snedecor’s F distribution

1218

Index

Fixed effects parameters, 988 Friedman’s test, 1189t K paired samples, 290–295, 291–292t, 291–293b, 293–295f F-test, 456, 474 Furthest-neighbor/complete-linkage method, 327, 332–335, 334t, 335–336f

G Gamma distribution, 157–158, 158f, 634, 635t, 635f Gaussian distribution binomial approximation, 155 cumulative distribution function, 154, 154f Poisson approximation, 155–156 probability density function, 152, 153f standard deviations, 153, 153f standard normal distribution, 153–154, 154–155f Zscores, 153–154 Gauss-Jordan elimination method, 769, 769t, 772, 772t General Algebraic Modeling System (GAMS), 774 Generalized linear latent and mixed model (GLLAMM), 1005 Geometric distribution, 145–147, 146–147b, 146f Geometric propagation/snowball sampling, 177, 177b Graph, 835, 836f

H Hamann similarity coefficient, 324 Hamiltonian path, 837 Hampered analysis, 326, 327f Hartley’s Fmax test, 213–214, 213–214b, 1193t Hierarchical agglomeration schedules between-groups/average-linkage method, 328, 335–338, 337t, 338f dendrogram, 327 Euclidian distance, 328, 329f furthest-neighbor/complete-linkage method, 327, 332–335, 334t, 335–336f Hampered analysis, 326, 327f linkage methods, 325, 326t nearest-neighbor/single-linkage method, 327–332, 331t, 331f, 333f phenogram, 327 SPSS (see IBM SPSS Statistics Software) Stata (see Stata Software) Hierarchical crossclassified models (HCM), 990 Hierarchical linear models (HLM), 988 count data, 1056–1057, 1058–1063f, 1058t, 1059–1062 generalized linear latent and mixed models (GLLAMM), 1052, 1052f hierarchical logistic models ceteris paribus, 1055 chart of, 1056, 1057f dataset, 1053, 1054f, 1054t mixed effects logistic regression models, 1052–1053 odds ratios, 1055 outputs, 1053, 1054f, 1056, 1056f

SPSS (see IBM SPSS Statistics Software) Stata (see Stata Software) three-level hierarchical linear models, repeated measures (HLM3), 998 individual models, 996, 996f intercepts and slopes randomness, 996–997 intraclass correlation, 997 temporal evolution, 995 variance-covariance matrix, 995 two-level hierarchical linear models, clustered data (HLM2) first-level model, 991 individual models, 991, 992f intercepts, 992 intraclass correlation, 993 likelihood-ratio tests, 993, 995 logarithmic likelihood function, 994–995 maximum likelihood estimation (MLE), 994 multiple linear regression model, 991 multivariate normal distribution, 993 random intercepts model, 993 random slopes model, 993 reduced maximum likelihood, 994 restricted maximum likelihood (REML), 994 slopes, 992 statistical significance, 993 Higher-order correlation coefficients, 387–389 Histogram, 21, 32–34, 32–33b, 32–33t, 33–34f absolute frequency, 32–33, 33f analysis tools, 32–33 continuous data, 33, 34f definition, 32 discrete data, 33, 34f Monte Carlo method, 921–922, 922f, 924f Horizontal bar charts, 26, 27f Hosmer-Lemeshow test, 579, 579f binary logistic regression model, 554 Huber-White robust standard error estimation, 491 Stata, regression models estimation, 507 Hypergeometric distribution, 148–149, 148f, 149b Hypotheses tests bilateral test, 199 critical region (CR), 199, 199f nonparametric tests, 201 (see also Nonparametric tests) parametric tests, 201 (see also Parametric tests) P-value, 201 statistical hypothesis, 199 type I error, 200, 200t type II error, 200, 200t unilateral test, 199 left-tailed test, 199, 200f right-tailed test, 200, 200f

I IBM SPSS Statistics Software ANOVA, 525, 527f binary logistic regression model, 591–601, 592–597f, 600–601f

Box-Cox transformations, 527 C chart, 967, 967–968f confidence levels and intercept exclusion, 519, 519f Cp, Cpk, Cpm and Cpmk indexes, 975–976, 975–976f Cronbach’s alpha, 436, 436–437f Dependent box, 516–518, 518f Descriptives Option, 74–76, 77f Descriptives dialog box, 76, 77f Options dialog box, 76, 78f Options, summary measures, 76, 78f Durbin-Watson test, 528, 528f result, 529, 529f estimation, 516, 517f excluded variables, 519–523 Explore Option, 77–78, 79f boxplot, 79, 82f, 83 Descriptives Option, results, 78, 81f Explore dialog box, 78, 79f histogram, 79, 82f, 83 Outliers option results, 79, 81f Percentiles option, results, 78–79, 81f Plots dialog box, 78, 80f Statistics dialog box, 78, 80f stem-and-leaf chart, 79, 82f, 83 Frequencies Option, 73f Charts, 74, 76f Charts dialog box, 74, 75f Frequencies dialog box, 73, 73f frequency distribution table, 74, 76f qualitative and quantitative variables, 72 Statistics dialog box, 73, 74f Statistics, summary measures, 74, 75f heteroskedasticity problem, 524 hierarchical agglomeration schedules, 350, 351f allocation, 356, 357f clustering stage, 354, 354f dendrogram, 350, 352f, 354, 355f distance measure, 361, 362f Euclidian distance, 350–351, 353, 353f, 360, 361f linkage method, 353, 353f matrix, 350, 352f means, 358, 360–361f multidimensional scaling, 360–361, 362f nominal (qualitative) classification, 356, 358f Number of clusters, 356, 356–357f one-way analysis of variance, 358, 358–359f two-dimensional chart, 361, 363f variables selection, 350, 351f HLM2 (see Two-level hierarchical linear model, clustered data) HLM3 (see Three-level hierarchical linear model, repeated measures) Independent(s) box, 516–518, 518f lndist variable, 527, 528f multicollinearity diagnostic, 523, 524f multinomial logistic regression model, 602–606, 602–604f

Index

negative binomial regression model, 676–685, 680–685f nonhierarchical k-means agglomeration schedule, 364–367, 364–368f nonparametric tests binomial test, 253, 254–255f chi-square test, 256–257, 257–258f, 279, 279–280f, 297, 297–299f Cochran’s Q test, 288, 289–290f Friedman’s test, 293, 294–295f Kruskal-Wallis test, 302, 302–303f Mann-Whitney U test, 284, 284–285f McNemar test, 264, 265–266f sign test, 260, 260–261f, 268–269, 269–270f Wilcoxon test, 274, 274–275f normality plots with tests, 523, 523f np chart, 960–963, 963b, 964f outputs, 519, 520f, 522f parameter and confidence intervals selection, 518–519, 518f parametric tests one-way ANOVA, 236–237, 237–238f Student’s t-test, 221, 222–223f, 225–226, 226–227f, 229–230, 230–231f two-way ANOVA, 242–244, 243–246f P chart (defective fraction), 959–960, 960–963f Poisson regression model, 664–675, 665–679f predicted values, 519, 521f principal components factor analysis algebraic solution, 410 communalities, 415, 415f dataset, 418, 419f Display factor score coefficient matrix, 412, 412f eigenvalues and variance, 413, 413f, 416, 418f factor analysis, 410, 410f factor extraction method, 411–412, 411f factor loadings, 414, 414f factor scores, 414, 414f initial options, 411, 411f KMO statistic and Bartlett’s test of sphericity, 413, 413f loading plot, 415–416, 415f, 417f Pearson’s correlation coefficients, 413, 413f, 418–419, 420f ranking, 420, 422f rotated factor loadings, 416, 417f rotated factor scores, 416, 418f rotation angle, 416, 418f rotation method, 412, 412f Save as variables option, 416, 416f sorting, 421, 422f variable creation, 420, 421f variables selection, 410, 410f Varimax orthogonal rotation method, 416, 416f R chart, 948–952f, 949 residuals behavior, 524, 525f residual sum of squares, 524, 526f RESUP variable, 525, 527f Shapiro-Wilk normality test result, 523, 523f

square of the residuals, 524, 526f stepwise procedure selection, 519, 521f U chart, 970, 971f univariate tests for normality normality test selection, 206, 207f procedure, 206, 207f tests results, 207–208, 208f variable selection, 206, 207f VIF and Tolerance statistics, 523, 524f Independent events, 128, 130 Integer programming (IP), 714–717, 731, 734–736 binary integer programming (BIP), 887 (see also Binary programming (BP)) characteristics, 887, 888b facility location problem, 905–906b, 905f, 906t candidate locations, 902 Excel Solver, 907–908, 907–909f, 907–908b modeling, 902–906 network programming problem, 902 heuristic procedure, 887 knapsack problem, 890–891b, 890t decision variables, 890 Excel Solver, 891–893, 891–893f, 892b mathematical formulation, 890 model parameters, 890 linear relaxation, 888, 889b, 889f metaheuristic procedure, 887 mixed binary programming (MBP), 887 mixed integer programming (MIP), 887 rounding, 888–890 staff scheduling problem, 908–912, 909–912b, 911f, 913–914f Interpolations, 449–450 Interquartile range/interquartile interval (IQR/ IQI), 37, 52 Intersection, 128, 128f Interval estimation, 190, 190b. See also Confidence intervals Intraclass correlation, 993, 997

J Jaccard index, 323 Job assignment problem, 868, 868f Excel Solver, 870, 871–872f, 871b mathematical formulation, 869–870, 869–870b, 869t Joint frequency distribution tables qualitative variables contingency/crossed classification/ correspondence table, 93 example, 94–101b, 94–95t, 96–101f marginal totals, 94–101 quantitative variables, 114 Judgmental/purposive sampling, 175, 175b

K Kaiser criterion, 393 Kaiser-Meyer-Olkin (KMO) statistic, 387, 389, 389t, 399, 413, 413f, 423, 424f Karhunen-Loe`ve transformation, 384

1219

Kernel density estimate, 503, 503f Stata, regression models estimation, 511–513, 514f King’s method, 46–47, 47b Knapsack problem, 890–891b, 890t decision variables, 890 Excel Solver, 891–893, 891–893f, 892b mathematical formulation, 890 model parameters, 890 Knowledge discovery in databases (KDD), 984, 985f Kolmogorov-Smirnov (K-S) test, 1177t univariate tests for normality, 201–203, 202–203t, 202–203b Kruskal-Wallis test, 1190t K independent samples, 299–304, 300–302b, 301t, 301–304f

L Lagrange multiplier (LM), 489–490 Latent root criterion, 393 Levene’s F-Test, 214–216, 214–216b, 215–216t SPSS Software procedure, 216, 217–218f results, 216, 218, 218f variables selection, 216, 217f Stata Software, 218–219, 219f Lifestyle Natural Juices Manufacturer, 798, 802–804f, 802b Lifetime, 157 Likelihood-ratio tests, 993, 995 Likert scale, 17, 314, 384 Linear Interactive and Discrete Optimizer (LINDO) Systems, 773–774 Linear mixed models (LMM), 988 Linear programming (LP) problems additivity assumption, 713 Advanced Integrated Multidimensional Modeling Software (AIMMS), 774 aggregated planning problem, 734–736b, 734t binary programming (BP), 733 decision variables, 733 general formulation, 733 integer programming (IP) model, 734–736 mixed-integer programming (MIP) problem, 734–736 model parameters, 733 nonlinear programming (NLP), 733 resources, 732 basic solution (BS), 755 basic variables (BV), 755, 755b blending/mixing problem, 717–719, 717–719b, 718t canonical form, 710, 712b capital budget problems, 721–724, 722–724b, 722–723t certainty, 713 continuous function, 709 convex and nonconvex sets, 748, 748f CPLEX, 774 degenerate optimal solution, 753–754, 754f, 754b diet problem, 720–721, 720t, 720–721b divisibility and non-negativity, 713

1220

Index

Linear programming (LP) problems (Continued) equality constraint, 711 Excel Solver, 775–779, 775–779f diet problem, 788–790, 789–791f, 789b farmer’s problem, 790–792, 791b, 791–793f Lifestyle Natural Juices Manufacturer, 798, 802–804f, 802b Naturelat Dairy, 784–786, 785b, 785–787f Oil-South Refinery, 787–788, 787b, 788–789f portfolio selection, 793–797, 793–797f, 794b, 796b production and inventory problem, Fenix&Furniture, 798, 799–801f, 800b Venix Toys, 779–783, 780b, 780–784f, 784b feasible basic solution (FBS), 755 feasible solutions, 709 free variable, 711 General Algebraic Modeling System (GAMS), 774 inequality constraint, 711 linear function and constraints, 709 Linear Interactive and Discrete Optimizer (LINDO) Systems, 773–774 A Mathematical Programming Language (AMPL), 774 MINOS, 774 multiple optimal solutions, 751–752, 751–752b, 752f nonbasic variables (NBV), 755, 755b no optimal solution, 753, 753b optimal solution, 709, 747 maximization problem, 748–750, 748–750b, 749f minimization problem, 750–751, 750–751f, 750–751b Optimization Subroutine Library (OSL), 774 portfolio selection problem, 726–728b, 726–728t financial investments, 724 investment portfolio risk minimization, 725–728 investment portfolio’s expected return, 724–725 Markowitz’s model, 724 production and inventory problem costs and capacity, 730t decision variables, 729 demand per product and period, 730t general formulation, 729 integer programming (IP) problem, 731 inventory balance equations, 731 maximum inventory capacity, 732 maximum production capacity, 731 model parameters, 729 non-negativity constraints, 729–732 optimal solution, 732t production mix problem, 713–717, 714–717b, 715–716t proportionality assumption, 712–713, 713f resource optimization problems, 713 sensitivity analysis, 747, 807–808b

Excel Solver, 818–822, 818–823f independent constraints terms, 807 objective function coefficients, 808–812, 808–809f, 810–811b reduced cost, 816–818, 817–818b shadow price, 812–816, 813–816b Simplex method (see Simplex method) slack variable, 711 software packages, 773 Solver error messages, unlimited and infeasible solutions no optimal solution, 800, 805f Solver Results dialog box, 798 unlimited objective function z, 798–799, 804–805f Solver results Answer Report, 802–806, 806f Excel spreadsheets, 800 Limits Report, 806–807, 806f standard form, 709–710, 711–712b standard maximization problem, 711 surplus variable, 711 unlimited objective function z, 752–753, 752–753b, 753f viable/feasible solution, 747 XPRESS, 774 Linear regression models analysis of variance (ANOVA), 457, 458f confidence levels dataset, 465, 466t dispersion of points, 463, 464f inclusion/exclusion criteria, 464–465, 465b null hypothesis rejection, 464–465 for parameters, 462–463, 463–464f predicted time vs. distance traveled, 465–466, 466–467f degrees of freedom, 457 dummy variables ceteris paribus condition, 474 confidence interval amplitudes, 479 criteria, 476, 476t dataset, 473, 473t driving style variable, 474, 476, 476t F-test, 474 GDP growth, 472 joint selection, 473, 475f, 477, 477f outputs, 473, 475f, 477–478, 478–479f qualitative explanatory variable, 472–473, 474t random weighting, 472 substitution of, 476, 477t t-test, 474 explanatory power, 453, 454f coefficient of determination, 453–456, 455t, 455–456f residual sum of squares (RSS), 451 sum of squares due to regression (SSR), 451 total sum of squares (TSS), 451 explanatory variables, 443 F significance level, 457, 458f F statistic, 457 F-test, 456 functional form, 480 metric/quantitative variable, 443

multiple models, 443–444 (see also Multiple linear regression models) null hypothesis, nonrejection, 460, 462, 462f OLS method (see Ordinary least squares (OLS) method) predicted value and parameters, 444 P-values, 459–460 quantitative dependent variable, 443 residual error, 444 simple linear regression model, 443–444, 444f SPSS (see IBM SPSS Statistics Software) standard error, 459, 460f statistical tests, 443 t statistic, 457–459 t-test, 457–459 coefficients and significance, 459, 461f significance levels, 460, 461f Linear specification, 988 Linear trend model random intercepts, 1020–1023, 1021–1023f random intercepts and slopes, 1023–1027, 1024–1025f, 1027–1028f Line chart, 920, 922, 922–923f Line graphs, 21, 30–31, 30t, 30–31b, 31f Logarithmic likelihood function, 994–995

M Mahalanobis distance, 379 Manhattan distance, 319 Mann-Whitney U test, 1185–1188t two independent samples, 281–286, 282–283t, 282–284b, 284–286f Markowitz’s model, 724 Maximum flow problem destination node, 876–877 Excel Solver, 879–881, 880–881f, 880b mathematical formulation, 878–879, 878–879b, 878f Maximum likelihood estimation (MLE), 192, 994, 1005 binary logistic regression model, 542–547, 542–544t, 545–547f multinomial logistic regression model, 564–570, 565–566t, 567–568f, 569–570t, 570f negative binomial regression model dataset, 636, 637t histogram, 637, 638f mean and variance, 637, 637t parameters estimation, 639, 640f results, 638, 640t Solver window, 638, 639f Poisson regression model, 621t dependent variable mean and variance, 621, 621t, 622f Excel Solver tool, 622, 624, 624f log-linear model, 625 non-negative and discrete values, 620 overdispersion, 621–622 parameters estimation, 625, 626f rate of incidence, 622–623, 623t, 623f results, 624, 625t Stata, regression models estimation, 500 McFadden pseudo R2, 548

Index

McNemar test, 262–264, 263t, 263–264b, 265–266f Mean absolute deviation (MAD), 725 Mean arrivals rate, 157 Measurement, definition, 9 Median continuous data, 44–45, 44t, 44–45b grouped discrete data, 43–44, 43t, 43–44b ungrouped discrete and continuous data, 42–43, 42t, 42–43b Method of moments, 190–191, 191b, 191t Minimum cost method, 849, 850–851t, 850–852b Minimum path problem. See Shortest path problem Minimum spanning tree, 837, 838f Minkowski distance, 318 MINOS, 774 Mixed binary programming (MBP), 887 Mixed effects logistic regression models, 1052–1053 Mixed-integer programming (MIP) problem, 734–736, 887 Mode continuous data, 46–47, 46–47b, 46t grouped qualitative/discrete data, 45–46, 45–46b, 46t ungrouped data, 45, 45t, 45b Monte Carlo method application, 920 Excel Data Analysis, 920 histogram, 921–922, 922f, 924f line chart, 920, 922, 922–923f profit and loss forecast, 926–928, 928–931f random number generation and probability distributions, 920–921, 921f, 923f red wine consumption, 923–925, 924–928f frequency distribution, 920 histogram, 920 Manhattan project, 919 probability density functions (PDF), 920 risks and uncertainties, 920 Multilevel modeling, 987–988 Multilevel negative binomial regression model, 1059 Multilevel Poisson regression model, 1059 Multinomial logistic regression model, 311, 539 confidence intervals, 574–575, 574–575t event, 563 logits, 563 maximum likelihood, 564–570, 565–566t, 567–568f, 569–570t, 570f occurrence probabilities, 563 SPSS (see IBM SPSS Statistics Software) Stata (see Stata Software) statistical significance, 570–574, 571–572f Multiple linear regression models, 393, 443–444, 991 ceteris paribus concept, 467 dataset, 468, 468t explanatory variables, 470, 471f multicollinearity, 472 null hypothesis, nonrejection, 472

outputs, 470, 471f parameters calculation, 469–470, 469–470t residual sum of squares, 468 Stata (see Stata, regression models estimation) time equation, 470 Multivariate normal distribution, 394, 993 Mutually excluding/exclusive events, 128, 128f, 130

N Naturelat Dairy, 784–786, 785b, 785–787f Nearest-neighbor/single-linkage method, 327–332, 331f, 331t, 333f Negative binomial distribution, 147–148, 147f, 148b Negative binomial regression model confidence intervals, 644, 644t Gamma distribution, 634, 635f, 635t maximum likelihood dataset, 636, 637t histogram, 637, 638f mean and variance, 637, 637t parameters estimation, 639, 640f results, 638, 640t Solver window, 638, 639f mean, 636 negative binomial type 1 (NB1) regression model, 636 negative binomial type 2 (NB2) regression model, 636 occurrence probability, 634 overdispersion, 634 Poisson distribution, 634 probability distribution function, 634 quantitative variable, 633 SPSS (see IBM SPSS Statistics Software) Stata (see Stata Software) statistical significance, 641–643, 641–642f variance, 636 Nested data structures, 987–991, 989–990f, 989–990t Network programming classic transportation problem, 838f, 839–841b, 840t, 840f algorithm (see Transportation algorithm) balanced transportation problem, 839 decision variables, 838 Excel Solver, 856–860, 856f, 857b, 858–861f general formulation, 839 model parameters, 838 Simplex method, 839 supply chain, 838 total supply capacity and total demand, 841–845, 841f, 841–845b, 842t, 843f, 844t, 845f demand nodes, 835 directed and undirected arc, 836, 836f directed and undirected cycle, 837 directed and undirected path, 837 directed network, 836–837, 837f graph, 835, 836f Hamiltonian path, 837

1221

job assignment problem, 868, 868f Excel Solver, 870, 871–872f, 871b mathematical formulation, 869–870, 869–870b, 869t maximum flow problem destination node, 876–877 Excel Solver, 879–881, 880–881f, 880b mathematical formulation, 878–879, 878–879b, 878f minimum spanning tree, 837, 838f network, definition, 835, 836f shortest path problem Excel Solver, 875, 876b, 876–877f mathematical formulation, 873–875, 874–875b, 874f supply capacity node, 870–873 subgraph, 837 supply nodes/sources, 835 transhipment problem (TSP) Excel Solver, 866–868, 866–868f, 866–867b intermediate transhipment points, 860–862 mathematical formulation, 862–866, 862f, 864–865f, 864t, 864–866b stages, 860–862 transportation unit cost, 862 transshipment nodes, 835 tree structure, 837, 837f Nonbasic variables (NBV), 755, 755b Nonhierarchical k-means agglomeration schedule, 338–339 arbitrary allocation, 341, 341t, 342f Euclidian distance, 346, 346t explanatory variable, 349–350 F significance level, 348, 349f F-test, 340 logical sequence, 339 mean, 347, 347t one-way analysis of variance (ANOVA), 348, 349t procedure, 339, 339f reallocation, 342–345t, 343f, 344 solution, 345–346, 346f SPSS (see IBM SPSS Statistics Software) Stata (see Stata Software) variation and F statistic, 348t Z-scores, 340 Nonlinear programming (NLP), 733 Nonlinear regression models, 443 binary and multinomial logistic models, 497 Box-Cox transformation, 497–498, 497b exponential specification, 495, 496f, 496b linear specification, 495, 496f, 496b nonlinear behavior, 495, 495f Poisson and negative binomial regression models, 497 quadratic specification, 495, 496f, 496b semilogarithmic specification, 495, 496f, 496b Nonmetric/qualitative variables dichotomous/binary variable (dummy), 16, 17t nominal scale, 10, 10t arithmetic operations, 10

1222

Index

Nonmetric/qualitative variables (Continued) Data View, 11, 11f descriptive statistics, 10, 12 labels, 11, 12t, 12f, 14f Value Labels, 11, 14f variable selection, 11, 13f Variable View, 10–11, 11f ordinal scale, 12–14, 15t, 15f polychotomous, 16 scales of accuracy, 16, 16f Nonparametric tests advantages, 249 classification, 250, 250t disadvantages, 249 K independent samples chi-square test, 295–299, 296t, 296b, 296–299f Kruskal-Wallis test, 299–304, 300–302b, 301t, 301–304f K paired samples Cochran’s Q test, 286–290, 287–288b, 287t, 288–290f Friedman’s test, 290–295, 291–292t, 291–293b, 293–295f one sample binomial test, 250–254, 251–253b, 252t, 253–255f chi-square test, 255–257, 255–259f, 256b, 256t sign test, 257–262, 259–260b, 259t, 260–262f two independent samples chi-square test, 276–280, 277–278t, 277–278b, 277–280f Mann-Whitney U test, 281–286, 282–283t, 282–284b, 284–286f two paired samples McNemar test, 262–264, 263t, 263–264b, 265–266f sign test, 264–270, 267–268b, 267–268t, 268–271f Wilcoxon test, 270–276, 272–273t, 272–274b, 273–276f Nonrandom sampling, 169, 170f advantages and disadvantages, 169–170 convenience sampling, 175, 175b geometric propagation/snowball sampling, 177, 177b judgmental/purposive sampling, 175, 175b quota sampling, 176–177, 176–177b, 176–177t Northwest corner method, 847, 848–849t, 848–850b, 854–856b np chart, 960–963, 963b, 964f

O Oblique rotation methods, 398 Ochiai similarity coefficient, 323 Odds ratios, 581, 581f, 1055 Oil-South Refinery, 787–788, 787b, 788–789f One-stage cluster sampling, 173–174, 174b, 183 finite population, sample size mean estimation, 184 proportion estimation, 185

infinite population, sample size mean estimation, 184 proportion estimation, 184–185 One-way analysis of variance (one-way ANOVA), 937–938 Optimization models business modeling, 736 classification, 708, 708f constraints, 708 decision and parameter variables, 707 decision concept, 707 elements, 707 linear programming (LP) (see Linear programming (LP)) objective function, 708 real system behavior, 707, 708f Optimization Subroutine Library (OSL), 774 Ordinary Gauss-Hermite quadrature, 1005 Ordinary least squares (OLS) method, 191–192 autocorrelation Breusch-Godfrey test, 493–494 causes, 492, 492f consequences, 493 data time evolution, 491 Durbin-Watson test, 493, 493f first-order autocorrelation, 492 generalized least squares method, 494 residuals problem, 492, 492f Box-Cox transformations, 480–481 calculation spreadsheet, 448, 448t conditional mean, 445 data analysis box, 450, 451f data insertion, 450, 452f dataset, 445, 445t, 448, 448f dependent variable, 444–445 Excel Regression tool, 450 expected value, 445 explanatory variable, 445 extrapolations, 449–450 heteroskedasticity Breusch-Pagan/Cook-Weisberg test, 489–490 chi-square distribution, 490 consequences, 489 discretionary income, 489, 489f Huber-White method, 491 learning process, 488 probability distribution, 488 problem, 488, 488f residual vector, 490 trial and error models, 488, 488f weighted least squares method, 490 interpolations, 449–450 linear regression estimation, 450, 451f linktest, 480–481, 494–495 multicollinearity auxiliary regressions, 487 causes of, 481–482 Class A model, 483, 483t, 484f Class B model, 484–485, 485f, 485t Class C model, 485–486, 486t, 486f consequences, 482–483 correlation matrix, 487 dependent variable, 481

matrix determinant, 487 matrix form, 481 orthogonal factors, 487 parameter estimation, 481 Tolerance, 487 t statistics, 487 VIF, 487 normal distribution of residuals, 480, 480f presuppositions, 479, 480b regression equation coefficients, 450, 453f RESET test, 480–481, 495 residuals conditions, 446, 447f residual sum of squares minimization, 449, 450f Shapiro-Wilk test/Shapiro-Francia test, 480 simple linear regression model, 445, 450, 450f equation, 448 outputs, 450, 452f Solver tool, 448–449, 449f travel time vs. distance traveled, 445, 446f Orthogonal rotation method, 397 Overall model efficiency (OME), 560 Overdispersion negative binomial regression model, 634 Poisson regression model, 632–633, 632t, 633f Stata Software, 648, 648f, 658

P Parametric tests ANOVA (see Analysis of variance (ANOVA)) population mean Student’s t-test (see Student’s t-test) Z test, 219–220, 219–220b univariate tests for normality Kolmogorov-Smirnov (K-S) test, 201–203, 202–203t, 202–203b Shapiro-Francia (S-F) tests, 205–206, 205–206b, 206t Shapiro-Wilk (S-W) test, 203–205, 204t, 204–205b SPSS Software (see IBM SPSS Statistics Software) Stata (see Stata Software) variance homogeneity tests Bartlett’s w2 test, 210–212, 211–212b, 211t Cochran’s C test, 212–213, 213b Hartley’s Fmax test, 213–214, 213–214b Levene’s F-Test, 214–218, 214–216b, 215–216t null hypothesis, 210 population variance, 210 Pareto chart, 21, 28–30, 29t, 29–30b, 30f Partial correlation coefficients, 387–388 Pascal distribution, 147–148, 147f, 148b P chart, 959, 959f defective fraction, 959–960, 960–963f Pearson’s contingency coefficient, 107 Pearson’s correlation coefficient, 315, 398, 399t, 413, 413f, 418–419, 420f, 427, 428f bivariate descriptive statistics, 119–121, 119–121b, 119–121f

Index

Pearson’s first coefficient of skewness, 62–63, 62–63b Pearson’s linear correlation, 384 correlation matrix, 385 dataset model, 385t factor extraction, 385, 388f latent dimensions, 386 linear adjustments, 385, 387f three-dimensional scatter plot, 385, 386f Pearson’s second coefficient of skewness, 63, 63b Percentile coefficient of kurtosis. See Coefficient of kurtosis Percentiles, 48–52 continuous data, 50–52, 51–52b, 51t grouped discrete data, 50, 50b ungrouped discrete and continuous data, 48–50, 49–50b Permutations, 135, 135b Phi coefficient, 106, 106–108b, 107t, 109–110f Pie charts, 21, 27–28, 28b, 28t, 28f Point estimation, 189, 189b maximum likelihood estimation, 192 method of moments, 190–191, 191b, 191t ordinary least squares (OLS), 191–192 Poisson distribution, 149–151, 150f, 150–151b Poisson regression model confidence intervals, 630–632, 631t, 632b dependent variable, 618 distribution, 619, 619f equidispersion of, 620 explanatory variable, 618 incidence rate ratio, 618 maximum likelihood, 621t dependent variable mean and variance, 621, 621t, 622f Excel Solver tool, 622, 624, 624f log-linear model, 625 non-negative and discrete values, 620 overdispersion, 621–622 parameters estimation, 625, 626f rate of incidence, 622–623, 623t, 623f results, 624, 625t mean, 620 overdispersion, 632–633, 632t, 633f probability of occurrence, 619, 619t SPSS (see IBM SPSS Statistics Software) Stata (see Stata Software) statistical significance, 626–630, 627–628f variance, 620 Polychotomous variable, 16 Population definition, 169 finite, 169 infinite, 169 moment of distribution, 190 Portfolio selection problem, 726–728b, 726–728t Excel Solver, 793–797, 793–797f, 794b, 796b financial investments, 724 investment portfolio risk minimization, 725–728 investment portfolio’s expected return, 724–725 Markowitz’s model, 724

Position/location measures BACON algorithm, 52 boxplot, 53–54, 53f central tendency arithmetic mean, 38–42, 38–41t, 38–42b median, 42–45, 42–44t, 42–45b mode, 45–47, 45–47b, 45–46t interquartile range (IQR), 52 outlier identification methods, 52, 53b quantiles, 48–52 deciles, 48 percentiles, 48–52 quartiles, 47–48 Principal components factor analysis Bartlett’s test of sphericity, 387, 389–390 clusters, 383, 390 coefficient of determination, 395 communality, 394 confirmatory factor analysis, 383 confirmatory techniques, 405 correlation coefficients, 383 correlation matrix, 391 Cronbach’s alpha’s magnitude, 390 dataset, 398, 398t eigenvalues, 391, 408, 408t eigenvectors, 391–392, 401–402 exploratory factor analysis, 383 exploratory multivariate technique, 383 factor loadings, 394, 394t, 404, 404t, 406–407t factor rotation Direct Oblimin methods, 398 loading plot, 395, 396f, 407f, 408 loadings, 397, 407, 407t oblique rotation methods, 398 original factors, 395, 396t, 396f orthogonal rotation method, 397 Promax methods, 398 scores, 397 factor scores, 390–393, 403, 404t first-order correlation coefficients, 387–389, 399, 399t higher-order correlation coefficients, 387–389 Kaiser criterion, 393 Kaiser-Meyer-Olkin (KMO) statistic, 387, 389, 389t, 399 Karhunen-Loe`ve transformation, 384 latent root criterion, 393 Likert scale, 384 loading plot, 405, 405f mental factors, 384 middling, 400 multiple linear regression model, 393 multivariate normal distribution, 394 partial correlation coefficients, 387–388 Pearson’s correlation coefficients, 398, 399t Pearson’s linear correlation, 384 correlation matrix, 385 dataset model, 385t factor extraction, 385, 388f latent dimensions, 386 linear adjustments, 385, 387f three-dimensional scatter plot, 385, 386f

1223

second-order correlation coefficients, 387–389, 399, 399t significance level, 400, 400f SPSS (see IBM SPSS Statistics Software) Stata (see Stata Software) structural equation modeling, 383 uncorrelated factors, 383 variance table, 401, 401t weighted rank-sum criterion, 408, 409t zero-order correlation coefficients, 387–388 Probability density functions (PDF), 920 negative binomial regression model, 634 Probability theory Bayes’ theorem, 132–133, 132–133b combinatorial analysis arrangement, 133–134, 133–134b combinations, 134, 134b definition, 133 permutations, 135, 135b complement, 128, 128f, 130 conditional probability, 131 multiplication rule, 131–132, 131–132b, 132t definition, 129 empty set, 129 events, 127, 129b independent, 128, 130 mutually excluding/exclusive, 128, 128f, 130 intersection, 128, 128f random experiment, 127 sample space, 127, 129 union, 127, 128f variation field, 129 Probability variation field, 129 Process, flowchart, 935, 936f Production and inventory problem costs and capacity, 730t decision variables, 729 demand per product and period, 730t Fenix&Furniture, Excel Solver, 798, 799–801f, 800b general formulation, 729 integer programming (IP) problem, 731 inventory balance equations, 731 maximum inventory capacity, 732 maximum production capacity, 731 model parameters, 729 non-negativity conditions, 729–732 non-negativity constraints, 732 optimal solution, 732t Production mix problem, 713–717, 714–717b, 715–716t Promax methods, 398 Proportional stratified sampling, 173 Pythagorean distance formula, 316, 317f

Q Qualitative variables bivariate descriptive statistics chi-square statistic, 102–110, 102–110b, 102–103t, 104–106f, 107t, 109–110f joint frequency distribution tables (see Joint frequency distribution tables)

1224

Index

Qualitative variables (Continued) Spearman’s coefficient, 110–113, 110f, 111–113b, 111t, 112–113f frequency distribution tables, 22–23, 22–23b, 22–23t univariate descriptive statistics bar charts, 21, 26–27, 26t, 26–27b, 27f Pareto chart, 21, 28–30, 29t, 29–30b, 30f pie charts, 21, 27–28, 28b, 28t, 28f Quantile regression models, 48–52 deciles, 48 dependent variables, 533 leverage distances, 532 median regression models, 532 normality of residuals, 533 percentiles, 48–52 quartiles, 47–48 Stata bacon algorithm, 533 conditional distribution, 537 dependent variable, 533, 534f, 538, 538f median regression model outputs, 535, 535f nonconditional median, 535 OLS regression model, 534 parameter estimation, 536, 536–537f Quantitative variables bivariate descriptive statistics covariance, 118, 118b Pearson’s correlation coefficient, 119–121, 119–121b, 119–121f scatter plot, 114–118, 114–118f, 115–116b continuous, 16 discrete, 16 interval scale, 15 ratio scale, 15 scales of accuracy, 16, 16f univariate descriptive statistics boxplots/box-and-whisker diagram, 21, 37–38, 37f histograms, 21, 32–34, 32–33b, 32–33t, 33–34f line graphs, 21, 30–31, 30t, 30–31b, 31f scatter plot, 21, 31–32, 31–32b, 31t, 32f stem-and-leaf plots, 21, 34–37, 35–36t, 35–37b, 36–37f Quartiles, 47–48 continuous data, 50–52, 51–52b, 51t grouped discrete data, 50, 50b ungrouped discrete and continuous data, 48–50, 49–50b Quota sampling, 176–177, 176–177b, 176–177t

R Random coefficients models, 988 Random effects parameters, 988 Random experiment, 127 Random intercepts and slopes model, 1006–1011, 1007–1011f, 1037–1038, 1038f Random intercepts model, 993, 1004–1005, 1004f, 1006f, 1036, 1037f Randomized block design (RBD), 937, 937f Random sampling, 169, 170f advantages and disadvantages, 169

one-stage cluster sampling, 173–174, 174b simple random sampling (SRS), 170–172, 171t, 171–172b stratified sampling, 173, 173b systematic sampling, 172, 172b two-stage cluster sampling, 174, 175b Random slopes model, 993 Random variables continuous random variable, 139–141, 139f, 140–141b chi-square distribution, 159–160, 159–160f, 160b exponential distribution, 156–157, 156f, 157b gamma distribution, 157–158, 158f normal distribution (see Gaussian distribution) Snedecor’s F distribution, 162–164, 163b, 163f, 164t Student’s t distribution, 160–162, 161f, 162b uniform distribution, 151–152, 151t, 152f, 152b discrete random variable, 137–139, 138–139b Bernoulli distribution, 142–144, 143–144b, 143f binomial distribution, 144–145, 144f, 145b discrete uniform distribution, 141–142, 141f, 142t, 142b geometric distribution, 145–147, 146f, 146–147b hypergeometric distribution, 148–149, 148f, 149b negative binomial distribution, 147–148, 147f, 148b Poisson distribution, 149–151, 150f, 150–151b random experiment, 137 Range, 54 R chart, 947–952f, 948–949 Reduced cost, 816–818, 817–818b Reduced maximum likelihood, 994 Reduced normal distribution, 153 Regression models negative binomial regression model, 617, 618f Poisson model, 617, 618f (see also Poisson regression model) Regression specification error (RESET) test, 495 Relative cumulative frequency, 22 Relative frequency, 22 Residual error, 444 Residual sum of squares (RSS), 451, 468 minimization, 449, 450f two-way ANOVA, 240 Restricted maximum likelihood (REML), 994, 1007–1008 Robit regression models, 610f Bernoulli distribution, 609 definition, 608 event occurrence, 609–610, 610t logistic distribution, 609 sigmoid function, 609 Stata, 611–615, 611–614f Rogers and Tanimoto similarity coefficient, 323

Rule, definition, 9 Russel and Rao similarity coefficient, 323

S Sample moment of distribution, 190 Sample space, 127, 129 Sampling definition, 169 nonprobability sampling (see Nonrandom sampling) population definition, 169 finite, 169 infinite, 169 probability sampling (see Random sampling) types, 169 Scale, definition, 9 Scatter plot, 21, 31–32, 31–32b, 31t, 32f, 312, 312f negative linear relationship, 114, 115f positive linear relationship, 114, 114f SPSS, 116f chart type, 115, 116f Simple Scatterplot dialog box, 115, 117f variables, 115, 117f on Stata, 116, 118f Second-order correlation coefficients, 387–389, 399, 399t Shadow price, 812–816, 813–816b Shape measures kurtosis coefficient of kurtosis, 65, 66f coefficient of kurtosis on Stata, 66–68, 67–68b, 67t definition, 65 Fisher’s coefficient of kurtosis, 66 leptokurtic curve, 65, 66f mesokurtic curve, 65, 65f platykurtic curve, 65, 65f skewness Bowley’s coefficient of skewness, 63–64, 64b coefficient of skewness on Stata, 64–65 Fisher’s coefficient of skewness, 64 left/negative skewness, 61, 62f Pearson’s first coefficient of skewness, 62–63, 62–63b Pearson’s second coefficient of skewness, 63, 63b right/positive skewness, 61, 62f symmetrical distribution, 61, 62f Shapiro-Francia (S-F) tests, 480 Stata, regression models estimation, 509, 509f, 511, 512f univariate tests for normality, 205–206, 205–206b, 206t Shapiro-Wilk (S-W) test, 480, 1178–1179t result, 523, 523f Stata, regression models estimation, 503–504, 504f univariate tests for normality, 203–205, 204t, 204–205b Shortest path problem Excel Solver, 875, 876b, 876–877f

Index

mathematical formulation, 873–875, 874–875b, 874f supply capacity node, 870–873 Sigmoid function, 609 Sign test one sample, 257–262, 259–260b, 259t SPSS Software, 260, 260–261f Stata Software, 261–262, 262f two paired samples, 264–270, 267–268b, 267–268t, 268f SPSS Software, 268–269, 269–270f Stata Software, 270, 271f Simple arithmetic mean, 38–39, 38t, 38–39b Simple linear regression model, 191, 443–445, 444f, 450, 450f equation, 448 outputs, 450, 452f Simple matching coefficient (SMC), 322 Simple random sampling (SRS), 179–180b finite population, sample size mean estimation, 178 proportion estimation, 179 infinite population, sample size mean estimation, 178 proportion estimation, 179 planning and selection, 170 with replacement, 171–172, 172b sample size factors, 177–178 without replacement, 170–171, 171t, 171b Simplex method degenerate optimal solution, 773 description, 758, 758f flowchart, 758, 758f iterative algebraic procedure, 757 maximization problems analytical solution, 758–762, 759–762b, 759f tabular form, 762–769, 763–769b minimization problems, 770–772b tabular form, 769–772, 770f transformation, 769 multiple optimal solutions, 772–773 no optimal solution, 773 unlimited objective function z, 773 Simulation definition, 919 Monte Carlo simulation (see Monte Carlo method) Sneath and Sokal similarity coefficient, 324 Snedecor’s F distribution, 162–164, 163b, 163f, 164t, 1157–1162t Snowball sampling, 177, 177b Spearman’s coefficient, 110–113, 110f, 111–113b, 111t, 112–113f Staff scheduling problem, 908–912, 909–912b, 911f, 913–914f Standard deviation, 59–60, 59–60b Standard error, 60–61, 60t, 60–61b, 459, 460f Standard maximization problem, 711 Standard normal distribution, 193, 193f, 1167–1169t Stata Software, 4 binary logistic regression model classification table, 583–584, 584f

dataset, 575, 576f dummies creation, 576, 577f frequencies distribution, 575–576, 576–577f Hosmer-Lemeshow test, 579, 579f likelihood-ratio test, 578, 578f linear adjustment, 581, 582f logistic adjustment, 581, 582–583f odds ratios, 581, 581f outputs, 577, 577–578f, 580, 580f probability estimation, 580, 580f ROC curve, 585–586, 586f sensitivity analysis, 582–584, 583–585f sensitivity curve, 585, 585f C chart, 966, 966f Cp, Cpk, Cpm and Cpmk indexes, 977 Cronbach’s alpha, 437–438, 438f hierarchical agglomeration schedules, 368–374, 368f, 369–370t, 370–374f HLM2 (see Two-level hierarchical linear model, clustered data) HLM3 (see Three-level hierarchical linear model, repeated measures) intermediate models (multilevel step-up strategy) and commands, 1033, 1033t multinomial logistic regression model, 586–591, 587f, 589–590f negative binomial regression model, 663–664f dataset, 653, 653f explanatory variables, 659, 659f frequency distribution, 653, 653f goodness-of-fit, 655, 655f histogram, 653, 654f mean and variance, 654, 654f null model, 656, 657f outputs, 655, 656f, 658f, 659, 660f overdispersion, 658 probability distribution, 661, 661–662f results, 654, 655f nonhierarchical k-means agglomeration schedule, 374–376, 375–376f nonparametric tests binomial test, 253–254, 255f chi-square test, 257, 259f, 279–280, 280f, 297–299, 299f Cochran’s Q test, 288–290, 290f Friedman’s test, 293–295, 295f Kruskal-Wallis test, 303–304, 304f Mann-Whitney U test, 285–286, 286f McNemar test, 264, 266f sign test, 261–262, 262f, 270, 271f Wilcoxon test, 275–276, 276f parametric tests Kolmogorov-Smirnov (K-S) test, 209, 209f one-way ANOVA, 237–238, 238f Shapiro-Francia (S-F) test, 210, 210f Shapiro-Wilk (S-W) test, 209–210, 210f Student’s t-test, 221–222, 223f, 227, 227f, 231, 231f two-way ANOVA, 244–245, 246f P chart, 959, 959f Poisson regression model dataset, 645, 645f explanatory variables, 650, 650f

1225

frequency distribution, 645, 645f goodness-of-fit, 649, 649f graph of, 649, 649f, 651–653, 651–652f histogram, 645, 646f incidence rate ratios, 650, 650f maximum logarithmic likelihood function, 647 McFadden pseudo R2, 647 mean and variance, 645, 646f null model, 647, 647f outputs, 646, 646f overdispersion test, 648, 648f principal components factor analysis dataset, 421, 423f eigenvalues and eigenvectors, 424, 424f KMO statistic and Bartlett’s test of sphericity, 423, 424f loading plot, 426, 427f multiple linear regression models, 428, 429–430f outputs, 422, 423f, 424–426, 425–427f Pearson’s correlation coefficient, 427, 428f ranking, 429, 431f rotated factor scores, 427, 428f Z-scores, 427–428 R chart, 947f, 948 regression models estimation augmented component-plus-residuals, 509, 513, 515f Box-Cox transformation, 511, 512f, 513, 515f Breusch-Godfrey test results, 516, 517f Breusch-Pagan/Cook-Weisberg test, 505, 506t, 506f correlation matrix, 499, 500f dataset, 498, 498f distribution adherence, 513 dummy variable, 498, 499f Durbin-Watson test result, 515, 516f frequency distribution, 498, 498f Hapiro-Francia test, 504 heteroskedasticity, graphing method, 504–505, 505f Huber-White robust standard error estimation, 507 Kernel density estimate, 511–513, 514f leverage distance concept, 501–502, 502t, 503f linear adjustment and lowess adjustment, 509, 510f, 511, 512f linktest, 507, 507f logarithmic transformation, 510 maximum likelihood estimation, 500 mfx command, 501, 502f multicollinearity, 499, 504 nonparametric method, 509 null hypothesis, 505 outputs, 500–501, 500–501f, 509, 509f parameter estimation, 501 reg command, 499–500 RESET test, 500, 507–508, 508t, 508f residuals distribution and normal distribution, 503, 503f

1226

Index

Stata Software (Continued) Shapiro-Francia test results, 509, 509f, 511, 512f Shapiro-Wilk test, 503–504, 504f squared normalized residuals, 502 temporal model estimation results, 513, 515f temporal variable, 513, 515f variables—graph matrix, 498–499, 499f, 510, 511f VIF and Tolerance statistics, 504, 504f weighted least squares model, 506–507 White test, 505, 506f Statistical process control (SPC) attributes, 941 control charts, 945–946t, 945–949b, 952–957b, 953–954t C chart, 963–967, 965–966t, 965–967b confidence interval, 943 mean, 952 np chart, 960–963, 963b, 964f parameters, 944 P chart, 957–960, 958t, 958–960b probability, 943, 943f sample size, 944 sigma control limits, 943 SPSS Software, 949, 950–952f, 954–957, 955–956f standard deviations, 944, 950, 952 standard normal distribution, 942 Stata Software, 947–949f, 948 U chart, 967–971, 969–970b, 969–970t line chart, 941 normal distribution, 941–942 process capability Cp index, 972, 974–977b Cpk index, 972–973, 973t, 974–977b Cpm and Cpmk indexes, 973–977, 974–977b quality characteristics, 942 range, 941–942 sample mean, 941–942 sample size, 941 sampling method, 941 standard deviation, 941 variables, 941 Stem-and-leaf plots, 21, 34–37, 35–36t, 35–37b, 36–37f Stevens classification, 9 Stratified sampling, 173, 173b, 182t, 182–183b estimation error, 180 finite population, sample size mean estimation, 181 proportion estimation, 181–183 infinite population, sample size mean estimation, 180–181 proportion estimation, 181 Student’s t distribution, 160–162, 161f, 162b, 194, 194f, 1162–1163t Student’s t-test, 220–221, 220f, 221b independent random samples, 224–225b, 225f, 225t bilateral test, 223, 224f degrees of freedom, 224

SPSS Software, 225–226, 226–227f Stata Software, 227, 227f single sample SPSS Software, 221, 222–223f Stata Software, 221–222, 223f two paired random samples, 228–229t, 228–229b, 229f bilateral test, 228, 228f normal distribution, 227 null hypothesis, 227 SPSS Software, 229–230, 230–231f Stata Software, 231, 231f Sum of squares due to regression (SSR), 451 Systematic sampling, 172, 172b

T Three-level hierarchical linear model, repeated measures, 987, 989, 990f, 990t IBM SPSS Statistics Software linear trend model with random intercepts and slopes, 1042–1045, 1043f, 1045f, 1046t null model, 1040, 1041f, 1042 Stata Software dataset characteristics, 1015, 1015t, 1016f linear trend model, random intercepts, 1020–1023, 1021–1023f linear trend model, random intercepts and slopes, 1023–1027, 1024–1025f, 1027–1028f null model, 1018–1020, 1019f outputs, 1015, 1016f random effects variance-covariance matrix, 1027–1032, 1029–1032f students’ average school performance, 1016–1017, 1017f temporal evolution, 1015–1018, 1016f, 1018f Total sum of squares (TSS), 451 Transhipment problem (TSP) Excel Solver, 866–868, 866–868f, 866–867b intermediate transhipment points, 860–862 mathematical formulation, 862–866, 862f, 864–865f, 864t, 864–866b stages, 860–862 transportation unit cost, 862 Transportation algorithm, 846f balanced transportation model, 846, 846b elementary operations, 847 iteration, 854 minimum cost method, 849, 850–851t, 850–852b northwest corner method, 847, 848–849t, 848–850b, 854–856b optimality test, 853 vogel approximation method, 851, 852–854b, 852–853t Traveling Salesman Problem (TSP), 898–899t, 898–900b, 899f Excel Solver, 901f, 902, 902b, 903–904f formulations, 896–901 Hamiltonian problem, 896, 898f network programming, 896 Tree structure, 837, 837f

t-test, 457–459, 474 coefficients and significance, 459, 461f significance levels, 460, 461f Two-level hierarchical linear model, clustered data, 987–988, 989t, 989f IBM SPSS Statistics Software complete final model, 1038–1040, 1039f null model, 1034–1036, 1034–1035f random intercepts and slopes model, 1037–1038, 1038f random intercepts model, 1036, 1037f Stata Software adaptive quadrature process, 1005 best linear unbiased predictions (BLUPS), 1005 complete random intercepts model, 1011–1014, 1012–1014f dataset characteristics, 998, 999f, 999t generalized linear latent and mixed model (GLLAMM), 1005 maximum likelihood estimation, 1005 null model, 999–1000, 1002–1003, 1002–1003f ordinary Gauss-Hermite quadrature, 1005 random intercepts and slopes model, 1006–1011, 1007–1011f random intercepts model, 1004–1005, 1004f, 1006f students’ average performance per school, 998, 1000–1001f unbalanced clustered data structure, 998, 1000f Two-stage cluster sampling, 174, 175b sample size, 183, 185–186, 186–187t, 186b Two-way ANOVA, 937

U U chart, 970, 971f Uniform distribution, 151–152, 151t, 152f, 152b Uniform stratified sampling, 173 Union, 127, 128f Univariate descriptive statistics, 22f Excel Add-ins dialog box, 68, 70f Data Analysis dialog box, 69, 71f dataset, 68, 68f Data tab, 69, 71f descriptive statistics, 69, 72f Descriptive Statistics dialog box, 69, 71f Excel Options dialog box, 68, 70f File tab, 68, 69f frequency distribution tables, 21 calculations, 22 continuous data, 24–25, 25b, 25t definition, 22 discrete data, 23–24, 23–24b, 23–24t qualitative variables, 22–23, 22–23b, 22–23t IBM SPSS Statistics Software, 69–72 dataset, 72, 72f Descriptives Option, 74–76, 77–78f Explore Option, 77–83, 79–82f Frequencies Option, 72–74, 73–76f qualitative variables

Index

bar charts, 21, 26–27, 26t, 26–27b, 27f Pareto chart, 21, 28–30, 29t, 29–30b, 30f pie charts, 21, 27–28, 28b, 28t, 28f quantitative variables boxplots/box-and-whisker diagram, 21, 37–38, 37f histograms, 21, 32–34, 32–33b, 32–33t, 33–34f line graphs, 21, 30–31, 30t, 30–31b, 31f scatter plot, 21, 31–32, 31–32b, 31t, 32f stem-and-leaf plots, 21, 34–37, 35–36t, 35–37b, 36–37f Stata boxplot, 86–87, 87f frequency distribution table, 83–84, 83f histograms, 85–86, 86f percentiles calculation, 85, 85f stem-and-leaf plot, 86, 86f summary, 84, 84f summary measures dispersion/variability, 21 (see also Dispersion/variability measures) position/location, 21 (see also Position/ location measures) shape, 21 (see also Shape measures)

1227

V

W

Variables definition, 7 descriptive statistics, 17 Likert scale, 17 types, 7, 8f metric/quantitative, 8, 9t, 9f (see also Quantitative variables) nonmetric/qualitative, 7–8, 8t (see also Nonmetric/qualitative variables) scales of measurement, 9–15, 10f Stevens classification, 9 Variance continuous data, 58–59, 58–59b, 59t continuous random variable, 140, 140b definition, 57 discrete random variable, 138, 138t, 138b grouped discrete data, 57–58, 58b, 58t ungrouped discrete and continuous data, 57, 57b Varimax orthogonal rotation method, 397, 416, 416f Venix Toys, 779–783, 780b, 780–784f, 784b Vertical bar charts, 26, 27f Vogel approximation method, 851, 852–854b, 852–853t Vuong test correction, 696, 696f

Wald z test, 550–551 Weighted arithmetic mean, 39–40, 39–40t, 39–40b Weighted least squares model, 490 Stata, regression models estimation, 506–507 Weighted rank-sum criterion, 408, 409t White test, 505, 506f Wilcoxon test, 1180–1184t two paired samples, 270–276, 272–273t, 272–274b, 273–276f

Y Yule similarity coefficient, 323

Z Zero-inflated regression models, 692b Bernoulli distribution, 691 logarithmic likelihood function, 691 quantitative variable, 690 sampling zeros, 691 Stata negative binomial regression model, 697–703, 698–703f Poisson regression model, 693–697, 693–697f structural zeros, 691 Zero-order correlation coefficients, 387–388 Z-scores, 427–428 Z test, 219–220, 219–220b

The use of the images from the IBM SPSS Statistics Software® has been authorized by the International Business Machines Corporation© (Armonk, New York). SPSS® Inc. was purchased by IBM® in October of 2009. IBM, the IBM logo, ibm.com and SPSS are commercial brands or trademarks that belong to the International Business Machines Corporation, registered in several jurisdictions around the world. The use of the images from the Stata Statistical Software® has been authorized by StataCorp LP© (College Station, Texas).