Title Page......Page 2
Brief Contents......Page 6
Contents......Page 8
Preface......Page 18
Credits......Page 26
Learning Objectives......Page 28
Impacts and Challenges......Page 35
Software Support......Page 39
Data Sets and Databases......Page 41
Big Data......Page 42
Metrics and Data Classification......Page 43
Decision Models......Page 48
Model Assumptions......Page 51
Prescriptive Decision Models......Page 53
Problem Solving with Analytics......Page 54
Structuring the Problem......Page 55
Implementing the Solution......Page 56
Key Terms......Page 57
Problems and Exercises......Page 58
Case: Drout Advertising Research Project......Page 60
Case: Performance Lawn Equipment......Page 61
Learning Objectives......Page 64
Basic Excel Skills......Page 66
Copying Formulas......Page 67
Other Useful Excel Tips......Page 68
Basic Excel Functions......Page 69
Functions for Specific Applications......Page 70
Insert Function......Page 71
Logical Functions......Page 72
Using Excel Lookup Functions for Database Queries......Page 74
Problems and Exercises......Page 77
Case: Performance Lawn Equipment......Page 79
Learning Objectives......Page 80
Data Visualization......Page 81
Tools and Software for Data Visualization......Page 82
Creating Charts in Microsoft Excel......Page 83
Column and Bar Charts......Page 84
Pie Charts......Page 86
Scatter Chart......Page 87
Bubble Charts......Page 89
Geographic Data......Page 90
Data Bars, Color Scales, and Icon Sets......Page 91
Sparklines......Page 92
Excel Camera Tool......Page 93
Data Queries: Tables, Sorting, and Filtering......Page 94
Pareto Analysis......Page 95
Filtering Data......Page 97
Statistical Methods for Summarizing Data......Page 99
Frequency Distributions for Categorical Data......Page 100
Relative Frequency Distributions......Page 101
Excel Histogram Tool......Page 102
Cumulative Relative Frequency Distributions......Page 106
Percentiles and Quartiles......Page 107
Cross-Tabulations......Page 109
Exploring Data Using PivotTables......Page 111
PivotCharts......Page 113
Slicers and PivotTable Dashboards......Page 114
Key Terms......Page 117
Problems and Exercises......Page 118
Case: Drout Advertising Research Project......Page 120
Case: Performance Lawn Equipment......Page 121
Learning Objectives......Page 122
Understanding Statistical Notation......Page 123
Arithmetic Mean......Page 124
Median......Page 125
Midrange......Page 126
Using Measures of Location in Business Decisions......Page 127
Interquartile Range......Page 128
Variance......Page 129
Standard Deviation......Page 130
Chebyshev’s Theorem and the Empirical Rule......Page 131
Standardized Values......Page 134
Coefficient of Variation......Page 135
Measures of Shape......Page 136
Excel Descriptive Statistics Tool......Page 137
Descriptive Statistics for Grouped Data......Page 139
Statistics in PivotTables......Page 141
Measures of Association......Page 142
Covariance......Page 143
Correlation......Page 144
Excel Correlation Tool......Page 146
Outliers......Page 147
Statistical Thinking in Business Decisions......Page 149
Variability in Samples......Page 150
Key Terms......Page 152
Problems and Exercises......Page 153
Case: Performance Lawn Equipment......Page 156
Learning Objectives......Page 158
Basic Concepts of Probability......Page 159
Probability Rules and Formulas......Page 161
Joint and Marginal Probability......Page 162
Conditional Probability......Page 164
Random Variables and Probability Distributions......Page 167
Discrete Probability Distributions......Page 169
Expected Value of a Discrete Random Variable......Page 170
Using Expected Value in Making Decisions......Page 171
Variance of a Discrete Random Variable......Page 173
Binomial Distribution......Page 174
Poisson Distribution......Page 176
Continuous Probability Distributions......Page 177
Properties of Probability Density Functions......Page 178
Uniform Distribution......Page 179
Normal Distribution......Page 181
Standard Normal Distribution......Page 183
Exponential Distribution......Page 185
Continuous Distributions......Page 187
Random Sampling from Probability Distributions......Page 188
Sampling from Discrete Probability Distributions......Page 189
Sampling from Common Probability Distributions......Page 190
Probability Distribution Functions in Analytic Solver Platform......Page 193
Data Modeling and Distribution Fitting......Page 195
Distribution Fitting with Analytic Solver Platform......Page 197
Key Terms......Page 199
Problems and Exercises......Page 200
Case: Performance Lawn Equipment......Page 206
Learning Objectives......Page 208
Sampling Methods......Page 209
Estimating Population Parameters......Page 212
Errors in Point Estimation......Page 213
Understanding Sampling Error......Page 214
Sampling Distribution of the Mean......Page 216
Interval Estimates......Page 217
Confidence Intervals......Page 218
Confidence Interval for the Mean with Known Population Standard Deviation......Page 219
The t-Distribution......Page 220
Confidence Interval for a Proportion......Page 221
Using Confidence Intervals for Decision Making......Page 223
Prediction Intervals......Page 224
Confidence Intervals and Sample Size......Page 225
Problems and Exercises......Page 227
Case: Drout Advertising Research Project......Page 229
Case: Performance Lawn Equipment......Page 230
Learning Objectives......Page 232
Hypothesis Testing......Page 233
One-Sample Hypothesis Tests......Page 234
Understanding Potential Errors in Hypothesis Testing......Page 235
Selecting the Test Statistic......Page 236
Drawing a Conclusion......Page 237
p-Values......Page 239
One-Sample Tests for Proportions......Page 240
Confidence Intervals and Hypothesis Tests......Page 241
Two-Sample Tests for Differences in Means......Page 242
Two-Sample Test for Means with Paired Samples......Page 245
Test for Equality of Variances......Page 246
Analysis of Variance (ANOVA)......Page 248
Assumptions of ANOVA......Page 250
Chi-Square Test for Independence......Page 251
Cautions in Using the Chi-Square Test......Page 253
Key Terms......Page 254
Problems and Exercises......Page 255
Case: Performance Lawn Equipment......Page 258
Learning Objectives......Page 260
Modeling Relationships and Trends in Data......Page 261
Simple Linear Regression......Page 265
Finding the Best-Fitting Regression Line......Page 266
Least-Squares Regression......Page 268
Simple Linear Regression with Excel......Page 270
Testing Hypotheses for Regression Coefficients......Page 272
Residual Analysis and Regression Assumptions......Page 273
Checking Assumptions......Page 275
Multiple Linear Regression......Page 276
Building Good Regression Models......Page 281
Correlation and Multicollinearity......Page 283
Practical Issues in Trendline and Regression Modeling......Page 284
Regression with Categorical Independent Variables......Page 285
Categorical Variables with More Than Two Levels......Page 288
Regression Models with Nonlinear Terms......Page 290
Advanced Techniques for Regression Modeling using XLMiner......Page 292
Problems and Exercises......Page 295
Case: Performance Lawn Equipment......Page 299
Learning Objectives......Page 300
Historical Analogy......Page 301
Indicators and Indexes......Page 302
Statistical Forecasting Models......Page 303
Moving Average Models......Page 305
Error Metrics and Forecast Accuracy......Page 309
Exponential Smoothing Models......Page 311
Forecasting Models for Time Series with a Linear Trend......Page 313
Double Exponential Smoothing......Page 314
Regression-Based Forecasting for Time Series with a Linear Trend......Page 315
Regression-Based Seasonal Forecasting Models......Page 317
Holt-Winters Models for Forecasting Time Series with Seasonality and Trend......Page 319
Selecting Appropriate Time-Series-Based Forecasting Models......Page 321
Regression Forecasting with Causal Variables......Page 322
The Practice of Forecasting......Page 323
Problems and Exercises......Page 325
Case: Performance Lawn Equipment......Page 327
Learning Objectives......Page 328
The Scope of Data Mining......Page 330
Sampling......Page 331
Data Visualization......Page 333
Dirty Data......Page 335
Cluster Analysis......Page 337
Classification......Page 342
Measuring Classification Performance......Page 343
Using Training and Validation Data......Page 345
Classification Techniques......Page 347
k-Nearest Neighbors (k-NN)......Page 348
Discriminant Analysis......Page 350
Logistic Regression......Page 355
Association Rule Mining......Page 359
Cause-and-Effect Modeling......Page 362
Problems and Exercises......Page 365
Case: Performance Lawn Equipment......Page 367
Learning Objectives......Page 368
Building Models Using Simple Mathematics......Page 369
Building Models Using Influence Diagrams......Page 370
Models Involving Multiple Time Periods......Page 378
Single-Period Purchase Decisions......Page 380
Overbooking Decisions......Page 381
Data and Models......Page 383
Range Names......Page 386
Form Controls......Page 387
What-If Analysis......Page 389
Data Tables......Page 391
Scenario Manager......Page 393
Goal Seek......Page 394
Parametric Sensitivity Analysis......Page 395
Problems and Exercises......Page 398
Case: Performance Lawn Equipment......Page 403
Learning Objectives......Page 404
Monte Carlo Simulation......Page 406
Defining Uncertain Model Inputs......Page 408
Running a Simulation......Page 411
Viewing and Analyzing Results......Page 413
New-Product Development Model......Page 415
Confidence Interval for the Mean......Page 418
Overlay Charts......Page 419
Box-Whisker Charts......Page 421
The Flaw of Averages......Page 422
Monte Carlo Simulation Using Historical Data......Page 423
Monte Carlo Simulation Using a Fitted Distribution......Page 424
Overbooking Model......Page 425
The Custom Distribution in Analytic Solver Platform......Page 426
Cash Budget Model......Page 427
Correlating Uncertain Variables......Page 430
Problems and Exercises......Page 434
Case: Performance Lawn Equipment......Page 441
Learning Objectives......Page 442
Identifying Elements for an Optimization Model......Page 443
Translating Model Information into Mathematical Expressions......Page 444
Implementing Linear Optimization Models on Spreadsheets......Page 447
Solving Linear Optimization Models......Page 449
Using the Standard Solver......Page 450
Graphical Interpretation of Linear Optimization......Page 455
How Solver Works......Page 460
Solver Outcomes and Solution Messages......Page 462
Alternative (Multiple) Optimal Solutions......Page 463
Unbounded Solution......Page 464
Infeasibility......Page 465
Using Optimization Models for Prediction and Insight......Page 466
Solver Sensitivity Report......Page 468
Using the Sensitivity Report......Page 471
Parameter Analysis in Analytic Solver Platform......Page 473
Problems and Exercises......Page 477
Case: Performance Lawn Equipment......Page 482
Learning Objectives......Page 484
Types of Constraints in Optimization Models......Page 486
Process Selection Models......Page 487
Spreadsheet Design and Solver Reports......Page 488
Solver Output and Data Visualization......Page 490
Blending Models......Page 494
Dealing with Infeasibility......Page 495
Portfolio Investment Models......Page 498
Evaluating Risk versus Reward......Page 500
Scaling Issues in Using Solver......Page 501
Transportation Models......Page 503
Formatting the Sensitivity Report......Page 505
Multiperiod Production Planning Models......Page 507
Building Alternative Models......Page 509
Multiperiod Financial Planning Models......Page 512
Models with Bounded Variables......Page 516
Auxiliary Variables for Bound Constraints......Page 520
A Production/Marketing Allocation Model......Page 522
Using Sensitivity Information Correctly......Page 524
Problems and Exercises......Page 526
Case: Performance Lawn Equipment......Page 538
Learning Objectives......Page 540
Solving Models with General Integer Variables......Page 541
Workforce-Scheduling Models......Page 545
Alternative Optimal Solutions......Page 546
Integer Optimization Models with Binary Variables......Page 550
Project-Selection Models......Page 551
Using Binary Variables to Model Logical Constraints......Page 553
Location Models......Page 554
Parameter Analysis......Page 556
A Customer-Assignment Model for Supply Chain Optimization......Page 557
Plant Location and Distribution Models......Page 560
Binary Variables, IF Functions, and Nonlinearities in Model Formulation......Page 561
Fixed-Cost Models......Page 563
Problems and Exercises......Page 565
Case: Performance Lawn Equipment......Page 574
Learning Objectives......Page 580
Formulating Decision Problems......Page 582
Decision Strategies for a Minimize Objective......Page 583
Decision Strategies for a Maximize Objective......Page 584
Decisions with Conflicting Objectives......Page 585
Expected Value Strategy......Page 587
Evaluating Risk......Page 588
Decision Trees......Page 589
Decision Trees and Risk......Page 593
Sensitivity Analysis in Decision Trees......Page 595
The Value of Information......Page 596
Bayes’s Rule......Page 597
Utility and Decision Making......Page 599
Constructing a Utility Function......Page 600
Exponential Utility Functions......Page 603
Problems and Exercises......Page 605
Case: Performance Lawn Equipment......Page 609
Appendix A......Page 612
Glossary......Page 636
Index......Page 644

##### Citation preview

Business Analytics Methods, Models, and Decisions James R. Evans University of Cincinnati Global EDITION SECOND EDITION

Boston Columbus Indianapolis New York San Francisco Amsterdam Cape Town Dubai London Madrid Milan Munich Paris Montréal Toronto Delhi Mexico City São Paulo Sydney Hong Kong Seoul Singapore Taipei Tokyo

Editorial Director: Chris Hoag Editor in Chief: Deirdre Lynch Acquisitions Editor: Patrick Barbera Editorial Assistant: Justin Billing Program Manager: Tatiana Anacki Project Manager: Kerri Consalvo Associate Project Editor, Global Edition: Amrita Kar Assistant Acquisitions Editor, Global Edition: Debapriya Mukherjee Project Manager, Global Edition: Vamanan Namboodiri Manager, Media Production, Global Edition: Vikram Kumar Senior Manufacturing Controller, Production, Global Edition: Trudy Kimber Project Management Team Lead: Christina Lepre Program Manager Team Lead: Marianne Stepanian

Media Producer: Nicholas Sweeney MathXL Content Developer: Kristina Evans Marketing Manager: Erin Kelly Marketing Assistant: Emma Sarconi Senior Author Support/Technology Specialist: Joe Vetere Rights and Permissions Project Manager: Diahanne Lucas Dowridge Procurement Specialist: Carole Melville Associate Director of Design: Andrea Nix Program Design Lead: Beth Paquin Text Design: 10/12 TimesLTStd Composition: Lumina Datamatics, Inc. Cover Design: Lumina Datamatics, Inc. Cover Image: ©bagiuiani/Shutterstock

Pearson Education Limited Edinburgh Gate Harlow Essex CM20 2JE England and Associated Companies throughout the world Visit us on the World Wide Web at: www.pearsonglobaleditions.com © Pearson Education Limited 2017 The rights of James R. Evans to be identified as the author of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988. Authorized adaptation from the United States edition, entitled Understanding Financial Statements, 11th edition, ISBN 9780-321-99782-1, by James R. Evans, published by Pearson Education © 2017. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without either the prior written permission of the publisher or a license permitting restricted copying in the United Kingdom issued by the Copyright Licensing Agency Ltd, Saffron House, 6–10 Kirby Street, London EC 1N 8TS. All trademarks used herein are the property of their respective owners. The use of any trademark in this text does not vest in the author or publisher any trademark ownership rights in such trademarks, nor does the use of such trademarks imply any affiliation with or endorsement of this book by such owners. ISBN-10: 1-292-09544-X ISBN-13: 978-1-292-09544-8 British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library 10 9 8 7 6 5 4 3 2 1 Typeset by Lumina Datamatics, Inc. Printed and bound by Vivar, Malaysia

Brief Contents

Preface 17 About the Author  23 Credits 25 Part 1  Foundations of Business Analytics Chapter 1 Introduction to Business Analytics  27 Chapter 2 Analytics on Spreadsheets  63 Part 2  Descriptive Analytics Chapter 3 Visualizing and Exploring Data  79 Chapter 4 Descriptive Statistical Measures  121 Chapter 5 Probability Distributions and Data Modeling  157 Chapter 6 Sampling and Estimation  207 Chapter 7 Statistical Inference  231 Part 3  Predictive Analytics Chapter 8 Trendlines and Regression Analysis  259 Chapter 9 Forecasting Techniques  299 Chapter 10 Introduction to Data Mining  327 Chapter 11 Spreadsheet Modeling and Analysis  367 Chapter 12 Monte Carlo Simulation and Risk Analysis  403 Part 4  Prescriptive Analytics Chapter 13 Linear Optimization  441 Chapter 14 Applications of Linear Optimization  483 Chapter 15 Integer Optimization  539 Chapter 16 Decision Analysis  579 Supplementary Chapter A (online) Nonlinear and Non-Smooth Optimization Supplementary Chapter B (online) Optimization Models with Uncertainty Appendix A  611 Glossary 635 Index 643

5

Contents

Preface 17 About the Author  23 Credits 25 Part 1: Foundations of Business Analytics

Chapter 1: Introduction to Business Analytics  27 Learning Objectives  27 What Is Business Analytics?  30 Evolution of Business Analytics  31 Impacts and Challenges  34

Scope of Business Analytics  35 Software Support  38

Data for Business Analytics  39 Data Sets and Databases  40  •  Big Data  41  •  Metrics and Data ­Classification  42  •  Data Reliability and Validity  44

Models in Business Analytics  44 Decision Models  47  •  Model Assumptions  50  •  Uncertainty and Risk  52  •  Prescriptive Decision Models  52

Problem Solving with Analytics  53 Recognizing a Problem  54  •  Defining the Problem  54  •  Structuring the Problem 54  •  Analyzing the Problem  55  •  Interpreting Results and Making a Decision  55  •  Implementing the Solution  55 Key Terms  56  •  Fun with Analytics  57  •  Problems and Exercises  57  •  Case: Drout Advertising Research Project  59  •  Case: Performance Lawn Equipment 60

Chapter 2: Analytics on Spreadsheets  63 Learning Objectives  63 Basic Excel Skills  65 Excel Formulas  66  •  Copying Formulas  66  •  Other Useful Excel Tips  67

Excel Functions  68 Basic Excel Functions  68  •  Functions for Specific Applications  69  •  Insert Function  70  •  Logical Functions  71

Using Excel Lookup Functions for Database Queries  73 Spreadsheet Add-Ins for Business Analytics  76 Key Terms  76  •  Problems and Exercises  76  •  Case: Performance Lawn Equipment 78

7

8

Contents ﻿  ﻿

Part 2: Descriptive Analytics

Chapter 3: Visualizing and Exploring Data  79 Learning Objectives  79 Data Visualization  80 Dashboards 81  •  Tools and Software for Data Visualization  81

Creating Charts in Microsoft Excel  82 Column and Bar Charts  83  •  Data Labels and Data Tables Chart Options 85 •  Line Charts  85  •  Pie Charts  85  •  Area Charts  86  •  Scatter Chart  86  •  Bubble Charts  88  • Miscellaneous Excel Charts  89  •  Geographic Data  89

Other Excel Data Visualization Tools  90 Data Bars, Color Scales, and Icon Sets  90  • Sparklines  91 •  Excel Camera Tool 92

Data Queries: Tables, Sorting, and Filtering  93 Sorting Data in Excel  94  •  Pareto Analysis  94  •  Filtering Data  96

Statistical Methods for Summarizing Data  98 Frequency Distributions for Categorical Data  99  •  Relative ­Frequency Distributions 100 •  Frequency Distributions for Numerical Data  101  •  Excel Histogram Tool  101  •  Cumulative Relative Frequency ­Distributions  105  •  Percentiles and Quartiles  106  • Cross-Tabulations  108

Exploring Data Using PivotTables  110 PivotCharts 112  •  Slicers and PivotTable Dashboards  113 Key Terms  116  •  Problems and Exercises  117  •  Case: Drout Advertising R ­ esearch Project 119  •  Case: Performance Lawn Equipment  120

Chapter 4: Descriptive Statistical Measures  121 Learning Objectives  121 Populations and Samples  122 Understanding Statistical Notation  122

Measures of Location  123 Arithmetic Mean  123  • Median  124 • Mode  125 • Midrange  125 •  Using Measures of Location in Business Decisions  126

Measures of Dispersion  127 Range 127  •  Interquartile Range  127  • Variance  128 • Standard ­Deviation  129  •  Chebyshev’s Theorem and the Empirical Rules  130  •  Standardized Values  133  •  Coefficient of Variation  134

Measures of Shape  135 Excel Descriptive Statistics Tool  136 Descriptive Statistics for Grouped Data  138 Descriptive Statistics for Categorical Data: The Proportion  140 Statistics in PivotTables  140

Contents ﻿  ﻿

9

Measures of Association  141 Covariance 142  • Correlation  143 •  Excel Correlation Tool  145  Outliers 146

Statistical Thinking in Business Decisions  148 Variability in Samples  149 Key Terms  151  •  Problems and Exercises  152  •  Case: Drout Advertising ­Research Project 155  •  Case: Performance Lawn Equipment  155

Chapter 5: Probability Distributions and Data Modeling  157 Learning Objectives  157 Basic Concepts of Probability  158 Probability Rules and Formulas  160  •  Joint and Marginal Probability  161  •  Conditional Probability  163

Random Variables and Probability Distributions  166 Discrete Probability Distributions  168 Expected Value of a Discrete Random Variable  169  •  Using Expected Value in Making Decisions  170  •  Variance of a Discrete Random Variable  172  •  Bernoulli Distribution  173  •  Binomial Distribution  173  •  Poisson Distribution  175

Continuous Probability Distributions  176 Properties of Probability Density Functions  177  •  Uniform Distribution  178  •  Normal Distribution  180  •  The NORM.INV Function  182  •  Standard ­Normal Distribution 182  •  Using Standard Normal Distribution Tables  184  •  Exponential Distribution  184  •  Other Useful Distributions  186  • ­Continuous Distributions 186

Random Sampling from Probability Distributions  187 Sampling from Discrete Probability Distributions  188  •  Sampling from Common Probability Distributions  189  •  Probability Distribution Functions in Analytic Solver Platform 192

Data Modeling and Distribution Fitting  194 Goodness of Fit  196  •  Distribution Fitting with Analytic Solver Platform 196 Key Terms  198  •  Problems and Exercises  199  •  Case: Performance Lawn Equipment 205

Chapter 6: Sampling and Estimation  207 Learning Objectives  207 Statistical Sampling  208 Sampling Methods  208

Estimating Population Parameters  211 Unbiased Estimators  212  •  Errors in Point Estimation  212

Sampling Error  213 Understanding Sampling Error  213

10

Contents ﻿  ﻿

Sampling Distributions  215 Sampling Distribution of the Mean  215  •  Applying the Sampling Distribution of the Mean  216

Interval Estimates  216 Confidence Intervals  217 Confidence Interval for the Mean with Known Population Standard Deviation 218  • The t-Distribution 219 •  Confidence Interval for the Mean with Unknown Population Standard Deviation  220  •  Confidence Interval for a ­Proportion  220  •  Additional Types of Confidence Intervals  222

Using Confidence Intervals for Decision Making  222 Prediction Intervals  223 Confidence Intervals and Sample Size  224 Key Terms  226  •  Problems and Exercises  226  •  Case: Drout Advertising Research Project  228  •  Case: Performance Lawn Equipment  229

Chapter 7: Statistical Inference  231 Learning Objectives  231 Hypothesis Testing  232 Hypothesis-Testing Procedure  233

One-Sample Hypothesis Tests  233 Understanding Potential Errors in Hypothesis Testing  234  •  Selecting the Test ­Statistic  235  •  Drawing a Conclusion  236

Two-Tailed Test of Hypothesis for the Mean  238 p-Values 238  •  One-Sample Tests for Proportions  239  •  Confidence ­Intervals and Hypothesis Tests  240

Two-Sample Hypothesis Tests  241 Two-Sample Tests for Differences in Means  241  •  Two-Sample Test for Means with Paired Samples  244  •  Test for Equality of Variances  245

Analysis of Variance (ANOVA)  247 Assumptions of ANOVA  249

Chi-Square Test for Independence  250 Cautions in Using the Chi-Square Test  252 Key Terms  253  •  Problems and Exercises  254  •  Case: Drout ­Advertising R ­ esearch Project 257 •  Case: Performance Lawn Equipment  257

Part 3: Predictive Analytics

Chapter 8: Trendlines and Regression Analysis  259 Learning Objectives  259 Modeling Relationships and Trends in Data  260 Simple Linear Regression  264 Finding the Best-Fitting Regression Line  265  •  Least-Squares Regression  267 Simple Linear Regression with Excel  269  •  Regression as Analysis of ­Variance  271  •  Testing Hypotheses for Regression Coefficients  271  •  Confidence Intervals for Regression Coefficients  272

Contents ﻿  ﻿

11

Residual Analysis and Regression Assumptions  272 Checking Assumptions  274

Multiple Linear Regression  275 Building Good Regression Models  280 Correlation and Multicollinearity  282  •  Practical Issues in Trendline and R ­ egression Modeling 283

Regression with Categorical Independent Variables  284 Categorical Variables with More Than Two Levels  287

Regression Models with Nonlinear Terms  289 Advanced Techniques for Regression Modeling using XLMiner 291 Key Terms  294  •  Problems and Exercises  294  •  Case: Performance Lawn Equipment 298

Chapter 9: Forecasting Techniques  299 Learning Objectives  299 Qualitative and Judgmental Forecasting  300 Historical Analogy  300  •  The Delphi Method  301  •  Indicators and Indexes  301

Statistical Forecasting Models  302 Forecasting Models for Stationary Time Series  304 Moving Average Models  304  •  Error Metrics and Forecast Accuracy  308  •  Exponential Smoothing Models  310

Forecasting Models for Time Series with a Linear Trend  312 Double Exponential Smoothing  313  •  Regression-Based Forecasting for Time Series with a Linear Trend  314

Forecasting Time Series with Seasonality  316 Regression-Based Seasonal Forecasting Models  316  •  Holt-Winters Forecasting for Seasonal Time Series  318  •  Holt-Winters Models for Forecasting Time Series with Seasonality and Trend  318

Selecting Appropriate Time-Series-Based Forecasting Models  320 Regression Forecasting with Causal Variables  321 The Practice of Forecasting  322 Key Terms  324  •  Problems and Exercises  324  •  Case: Performance Lawn Equipment 326

Chapter 10: Introduction to Data Mining  327 Learning Objectives  327 The Scope of Data Mining  329 Data Exploration and Reduction  330 Sampling 330  •  Data Visualization  332  •  Dirty Data  334  • Cluster Analysis 336

Classification 341 An Intuitive Explanation of Classification  342  •  Measuring Classification ­Performance  342  •  Using Training and Validation Data  344  • Classifying New Data  346

12

Contents ﻿  ﻿

Classification Techniques  346 k-Nearest Neighbors (k-NN) 347  •  Discriminant Analysis  349  • Logistic Regression 354 •  Association Rule Mining  358

Cause-and-Effect Modeling  361 Key Terms  364  •  Problems and Exercises  364  •  Case: Performance Lawn Equipment 366

Chapter 11: Spreadsheet Modeling and Analysis  367 Learning Objectives  367 Strategies for Predictive Decision Modeling  368 Building Models Using Simple Mathematics  368  •  Building Models Using I­ nfluence Diagrams 369

Spreadsheet Applications in Business Analytics  375 Models Involving Multiple Time Periods  377  •  Single-Period Purchase ­Decisions  379  •  Overbooking Decisions  380

Model Assumptions, Complexity, and Realism  382 Data and Models  382

Developing User-Friendly Excel Applications  385 Data Validation  385  •  Range Names  385  •  Form Controls  386

Analyzing Uncertainty and Model Assumptions  388 What-If Analysis  388  •  Data Tables  390  •  Scenario ­Manager  392  •  Goal Seek  393

Model Analysis Using Analytic Solver Platform 394 Parametric Sensitivity Analysis  394  •  Tornado Charts  396 Key Terms  397  •  Problems and Exercises  397  •  Case: Performance Lawn Equipment 402

Chapter 12: Monte Carlo Simulation and Risk Analysis  403 Learning Objectives  403 Spreadsheet Models with Random Variables  405 Monte Carlo Simulation  405

Monte Carlo Simulation Using Analytic Solver Platform 407 Defining Uncertain Model Inputs  407  •  Defining Output Cells  410  •  Running a Simulation  410  •  Viewing and Analyzing Results  412

New-Product Development Model  414 Confidence Interval for the Mean  417  •  Sensitivity Chart  418  • Overlay Charts 418 •  Trend Charts  420  •  Box-Whisker Charts  420  • ­ Simulation Reports  421

Newsvendor Model  421 The Flaw of Averages  421  •  Monte Carlo Simulation Using Historical Data 422 •  Monte Carlo Simulation Using a Fitted Distribution  423

Overbooking Model  424 The Custom Distribution in Analytic Solver Platform 425

Contents ﻿  ﻿

Cash Budget Model  426 Correlating Uncertain Variables  429 Key Terms  433  •  Problems and Exercises  433  •  Case: Performance Lawn Equipment 440

Part 4: Prescriptive Analytics

Chapter 13: Linear Optimization  441 Learning Objectives  441 Building Linear Optimization Models  442 Identifying Elements for an Optimization Model  442  •  Translating Model Information into Mathematical Expressions  443  •  More about ­Constraints  445  •  Characteristics of Linear Optimization Models  446

Implementing Linear Optimization Models on Spreadsheets  446 Excel Functions to Avoid in Linear Optimization  448

Solving Linear Optimization Models  448 Using the Standard Solver 449  • Using Premium Solver 451 •  Solver Answer Report  452

Graphical Interpretation of Linear Optimization  454 How Solver Works  459 How Solver Creates Names in Reports  461

Solver Outcomes and Solution Messages  461 Unique Optimal Solution  462  •  Alternative (Multiple) Optimal Solutions 462  •  Unbounded Solution  463  • Infeasibility  464

Using Optimization Models for Prediction and Insight  465 Solver Sensitivity Report  467  •  Using the Sensitivity Report  470  •  Parameter Analysis in Analytic Solver Platform 472 Key Terms  476  •  Problems and Exercises  476  •  Case: Performance Lawn Equipment 481

Chapter 14: Applications of Linear Optimization  483 Learning Objectives  483 Types of Constraints in Optimization Models  485 Process Selection Models  486 Spreadsheet Design and Solver Reports  487 Solver Output and Data Visualization  489

Blending Models  493 Dealing with Infeasibility  494

Portfolio Investment Models  497 Evaluating Risk versus Reward  499  •  Scaling Issues in Using Solver 500

Transportation Models  502 Formatting the Sensitivity Report  504  • Degeneracy  506

Multiperiod Production Planning Models  506 Building Alternative Models  508

Multiperiod Financial Planning Models  511

13

14

Contents ﻿  ﻿

Models with Bounded Variables  515 Auxiliary Variables for Bound Constraints  519

A Production/Marketing Allocation Model  521 Using Sensitivity Information Correctly  523 Key Terms  525  •  Problems and Exercises  525  •  Case: Performance Lawn Equipment 537

Chapter 15: Integer Optimization  539 Learning Objectives  539 Solving Models with General Integer Variables  540 Workforce-Scheduling Models  544  •  Alternative Optimal Solutions  545

Integer Optimization Models with Binary Variables  549 Project-Selection Models  550  •  Using Binary Variables to Model Logical Constraints 552  •  Location Models  553  •  Parameter Analysis   555  •  A Customer-Assignment Model for Supply Chain Optimization  556

Mixed-Integer Optimization Models  559 Plant Location and Distribution Models  559  •  Binary Variables, IF Functions, and Nonlinearities in Model Formulation  560  •  Fixed-Cost Models  562 Key Terms  564  •  Problems and Exercises  564  •  Case: Performance Lawn Equipment 573

Chapter 16: Decision Analysis  579 Learning Objectives  579 Formulating Decision Problems  581 Decision Strategies without Outcome Probabilities  582 Decision Strategies for a Minimize Objective  582  •  Decision Strategies for a ­Maximize Objective  583  •  Decisions with Conflicting Objectives  584

Decision Strategies with Outcome Probabilities  586 Average Payoff ­Strategy  586  •  Expected Value Strategy  586  •  Evaluating Risk  587

Decision Trees  588 Decision Trees and Monte Carlo Simulation  592  •  Decision Trees and Risk 592 •  Sensitivity Analysis in Decision Trees  594

The Value of Information  595 Decisions with Sample Information  596  •  Bayes’s Rule  596

Utility and Decision Making  598 Constructing a Utility Function  599  •  Exponential Utility Functions  602 Key Terms  604  •  Problems and Exercises  604  •  Case: Performance Lawn Equipment 608

Contents ﻿  ﻿

15

Supplementary Chapter A (online) Nonlinear and Non-Smooth Optimization Supplementary Chapter B (online) Optimization Models with Uncertainty ­Online chapters are available for download at www.pearsonglobaleditions.com/Evans. Appendix A  611 Glossary 635 Index 643

Preface

In 2007, Thomas H. Davenport and Jeanne G. Harris wrote a groundbreaking book, ­Competing on Analytics: The New Science of Winning (Boston: Harvard Business School Press). They described how many organizations are using analytics strategically to make better decisions and improve customer and shareholder value. Over the past several years, we have seen remarkable growth in analytics among all types of organizations. The Institute for Operations Research and the Management Sciences (INFORMS) noted that analytics software as a service is predicted to grow three times the rate of other business segments in upcoming years.1 In addition, the MIT Sloan Management Review in collaboration with the IBM Institute for Business Value surveyed a global sample of nearly 3,000 executives, managers, and analysts.2 This study concluded that top-performing organizations use analytics five times more than lower performers, that improvement of information and analytics was a top priority in these organizations, and that many organizations felt they were under significant pressure to adopt advanced information and analytics ­approaches. Since these reports were published, the interest in and the use of analytics has grown dramatically. In reality, business analytics has been around for more than a half-century. Business schools have long taught many of the core topics in business analytics—statistics, data analysis, information and decision support systems, and management science. However, these topics have traditionally been presented in separate and independent courses and supported by textbooks with little topical integration. This book is uniquely designed to present the emerging discipline of business analytics in a unified fashion consistent with the contemporary definition of the field.

About the Book This book provides undergraduate business students and introductory graduate students with the fundamental concepts and tools needed to understand the emerging role of business analytics in organizations, to apply basic business analytics tools in a spreadsheet environment, and to communicate with analytics professionals to effectively use and interpret analytic models and results for making better business decisions. We take a balanced, holistic approach in viewing business analytics from descriptive, predictive, and prescriptive perspectives that today define the discipline.

1Anne

Robinson, Jack Levis, and Gary Bennett, INFORMS News: INFORMS to Officially Join Analytics Movement. http://www.informs.org/ORMS-Today/Public-Articles/October-Volume-37-Number-5/ INFORMS-News-INFORMS-to-Officially-Join-Analytics-Movement. 2“Analytics: The New Path to Value,” MIT Sloan Management Review Research Report, Fall 2010.

17

18

Preface ﻿  ﻿

This book is organized in five parts. 1. Foundations of Business Analytics The first two chapters provide the basic foundations needed to understand business analytics, and to manipulate data using Microsoft Excel. 2. Descriptive Analytics Chapters 3 through 7 focus on the fundamental tools and methods of data ­analysis and statistics, focusing on data visualization, descriptive statistical measures, probability distributions and data modeling, sampling and estimation, and statistical ­inference. We subscribe to the American Statistical Association’s ­recommendations for teaching introductory statistics, which include emphasizing statistical literacy and developing statistical thinking, stressing conceptual ­understanding rather than mere knowledge of procedures, and using technology for developing conceptual understanding and analyzing data. We believe these goals can be accomplished without introducing every conceivable technique into an 800–1,000 page book as many mainstream books currently do. In fact, we cover all essential content that the state of Ohio has mandated for undergraduate business statistics across all public colleges and universities. 3. Predictive Analytics In this section, Chapters 8 through 12 develop approaches for applying regression, forecasting, and data mining techniques, building and analyzing predictive models on spreadsheets, and simulation and risk analysis. 4. Prescriptive Analytics Chapters 13 through 15, along with two online supplementary chapters, explore linear, integer, and nonlinear optimization models and applications, including ­optimization with uncertainty. 5. Making Decisions Chapter 16 focuses on philosophies, tools, and techniques of decision analysis. The second edition has been carefully revised to improve both the content and pedagogical organization of the material. Specifically, this edition has a much stronger emphasis on data visualization, incorporates the use of additional Excel tools, new features of Analytic Solver Platform for Education, and many new data sets and problems. Chapters 8 through 12 have been re-ordered from the first edition to improve the logical flow of the topics and provide a better transition to spreadsheet modeling and applications.

Features of the Book Examples—numerous, short examples throughout all chapters illus• Numbered trate concepts and techniques and help students learn to apply the techniques and understand the results.

in Practice”—at least one per chapter, this feature describes real • “Analytics applications in business. Objectives—lists the goals the students should be able to achieve after • Learning studying the chapter.

Preface ﻿  ﻿

19

Terms—bolded within the text and listed at the end of each chapter, these • Key words will assist students as they review the chapter and study for exams. Key terms and their definitions are contained in the glossary at the end of the book. End-of-Chapter Problems and Exercises—help to reinforce the material covered through the chapter. Integrated Cases—allows students to think independently and apply the relevant tools at a higher level of learning. Data Sets and Excel Models—used in examples and problems and are available to students at www.pearsonglobaleditions.com/evans

• • • Software Support

While many different types of software packages are used in business analytics applications in the industry, this book uses Microsoft Excel and Frontline Systems’ powerful Excel add-in, Analytic Solver Platform for Education, which together provide extensive capabilities for business analytics. Many statistical software packages are available and provide very powerful capabilities; however, they often require special (and costly) ­licenses and additional learning requirements. These packages are certainly appropriate for analytics professionals and students in master’s programs dedicated to preparing such professionals. However, for the general business student, we believe that Microsoft Excel with proper add-ins is more appropriate. Although Microsoft Excel may have some deficiencies in its statistical capabilities, the fact remains that every business student will use Excel throughout their careers. Excel has good support for data visualization, basic statistical analysis, what-if analysis, and many other key aspects of business analytics. In fact, in using this book, students will gain a high level of proficiency with many features of Excel that will serve them well in their future careers. Furthermore Frontline Systems’ ­Analytic Solver Platform for Education Excel add-ins are integrated throughout the book. This add-in, which is used among the top business organizations in the world, provides a comprehensive coverage of many other business analytics topics in a common platform. This add-in provides support for data modeling, forecasting, Monte Carlo simulation and risk analysis, data mining, optimization, and decision analysis. Together with Excel, it provides a comprehensive basis to learn business analytics effectively.

To the Students To get the most out of this book, you need to do much more than simply read it! Many examples describe in detail how to use and apply various Excel tools or add-ins. We highly recommend that you work through these examples on your computer to replicate the outputs and results shown in the text. You should also compare mathematical formulas with spreadsheet formulas and work through basic numerical calculations by hand. Only in this fashion will you learn how to use the tools and techniques effectively, gain a better understanding of the underlying concepts of business analytics, and increase your proficiency in using Microsoft Excel, which will serve you well in your future career. Visit the Companion Web site (www.pearsonglobaleditions.com/evans) for access to the following: Files: Data Sets and Excel Models—files for use with the numbered • Online examples and the end-of-chapter problems (For easy reference, the relevant file names are italicized and clearly stated when used in examples.)

20

Preface ﻿  ﻿

• • • • •

To register and download the software successfully, you will need a Texbook Code and a Course Code. The Textbook Code is EBA2 and your instructor will provide the Course Code. This download includes a 140-day license to use the software. Visit www.­pearsonglobaleditions.com/Evans for complete download instructions.

To the Instructors Instructor’s Resource Center—Reached through a link at www.pearsonglobaleditions.com/Evans, the Instructor’s Resource Center contains the electronic files for the complete Instructor’s Solutions Manual, PowerPoint lecture presentations, and the Test Item File. redeem, log in at www.pearsonglobaleditions.com/Evans, instructors • Register, can access a variety of print, media, and presentation resources that are available with this book in downloadable digital format. Resources are also available for course management platforms such as Blackboard, WebCT, and CourseCompass. Need help? Pearson Education’s dedicated technical support team is ready to assist instructors with questions about the media supplements that accompany this text. Visit http://247pearsoned.com for answers to frequently asked questions and toll-free user support phone numbers. The supplements are available to adopting instructors. Detailed descriptions are provided at the Instructor’s Resource Center. Instructor’s Solutions Manual—The Instructor’s Solutions Manual, updated and revised for the second edition by the author, includes Excel-based solutions for all

Preface ﻿  ﻿

21

end-of-chapter problems, exercises, and cases. The Instructor’s S ­ olutions Manual is available for download by visiting www.pearsonglobaleditions.com/Evans and clicking on the Instructor Resources link. PowerPoint presentations—The PowerPoint slides, revised and updated by the author, are available for download by visiting www.pearsonglobaleditions.com/Evans and clicking on the Instructor Resources link. The PowerPoint slides provide an instructor with individual lecture outlines to accompany the text. The slides include nearly all of the figures, tables, and examples from the text. Instructors can use these lecture notes as they are or can easily modify the notes to reflect specific presentation needs. Test Bank—The TestBank, prepared by Paolo Catasti from Virginia Commonwealth University, is available for download by visiting www.­pearsonglobaleditions.com/Evans and clicking on the Instructor ­Resources link. Analytic Solver Platform for Education (ASPE)—This is a special version of Frontline Systems’ Analytic Solver Platform software for Microsoft Excel.

• Acknowledgements

I would like to thank the staff at Pearson Education for their professionalism and dedication to making this book a reality. In particular, I want to thank Kerri Consalvo, ­Tatiana Anacki, Erin Kelly, Nicholas Sweeney, and Patrick Barbera; Jen Carley at Lumina D ­ atamatics, Inc.; accuracy checker Annie Puciloski; and solutions checker Regina ­Krahenbuhl for their outstanding contributions to producing this book. I also want to acknowledge Daniel Fylstra and his staff at Frontline Systems for working closely with me to allow this book to have been the first to include XLMiner with Analytic Solver Platform. If you have any suggestions or corrections, please contact the author via email at [email protected] James R. Evans Department of Operations, Business Analytics, and Information Systems University of Cincinnati Cincinnati, Ohio Pearson would also like to thank Sahil Raj (Punjabi University) and Loveleen Gaur (Amity University, Noida) for their contribution to the Global Edition, and Ruben Garcia Berasategui (Jakarta International College), Ahmed R. ElMelegy (The American University, Dubai) and Hyelim Oh (National University of Singapore) for reviewing the Global Edition.

James R. Evans Professor, University of Cincinnati College of Business James R. Evans is professor in the Department of Operations, Business Analytics, and Information Systems in the College of Business at the University of Cincinnati. He holds BSIE and MSIE degrees from Purdue and a PhD in Industrial and Systems Engineering from Georgia Tech. Dr. Evans has published numerous textbooks in a variety of business disciplines, including statistics, decision models, and analytics, simulation and risk analysis, network optimization, operations management, quality management, and creative thinking. He has published over 90 papers in journals such as Management Science, IIE Transactions, ­Decision Sciences, Interfaces, the Journal of Operations Management, the Quality Management Journal, and many others, and wrote a series of columns in Interfaces on creativity in management science and operations research during the 1990s. He has also served on numerous journal editorial boards and is a past-president and Fellow of the Decision Sciences Institute. In 1996, he was an INFORMS Edelman Award Finalist as part of a project in supply chain optimization with Procter & Gamble that was credited with helping P&G save over \$250,000,000 annually in their North American supply chain, and consulted on risk analysis modeling for Cincinnati 2012’s Olympic Games bid proposal. A recognized international expert on quality management, he served on the Board of Examiners and the Panel of Judges for the Malcolm Baldrige National Quality Award. Much of his current research focuses on organizational performance excellence and measurement practices.

23

Credits

Photo Credits Chapter 1  Page 27 Analytics Business Analysis: Mindscanner/Fotolia Page 56 ­Computer, calculator, and spreadsheet: Hans12/Fotolia Chapter 2  Page 63 Computer with Spreadsheet: Gunnar Pippel/Shutterstock 25

26

Credits

Chapter 3  Page 79 Spreadsheet with magnifying glass: Poles/Fotolia Page 98 Data Analysis: 2jenn/Shutterstock Chapter 4  Page 121 Pattern of colorful numbers: JonnyDrake/Shutterstock Page 151 Computer screen with financial data: NAN728/Shutterstock Chapter 5  Page 157 Faded spreadsheet: Fantasista/Fotolia Page 177 Probability and cost graph with pencil: Fantasista/Fotolia Page 198 Business concepts: Victor Correia/ Shutterstock Chapter 6  Page 207 Series of bar graphs: Kalabukhava Iryna/Shutterstock Page 211 Brewery truck: Stephen Finn/Shutterstock Chapter 7  Page 231 Business man solving problems with illustrated graph display: Serg Nvns/Fotolia Page 253 People working at a helpdesk: StockLite/Shutterstock Chapter 8  Page 259 Trendline 3D graph: Sheelamohanachandran/Fotolia Page 279 Computer and Risk: Gunnar Pippel/Shutterstock Page 280C 4 blank square shape navigation web 2.0 button slider: Claudio Divizia/Shutterstock Page 280L Graph chart illustrations of growth and recession: Vector Illustration/Shutterstock Page 280R Audio gauge: Shutterstock Chapter 9  Page 299 Past and future road sign: Karen Roach/Fotolia Page 324 NBC Studios: Sean Pavone/Dreamstine Chapter 10  Page 327 Data Mining Technology Strategy Concept: Kentoh/Shutterstock Page 363 Business man drawing a marketing diagram: Helder Almeida/Shutterstock Chapter 11  Page 367 3D spreadsheet: Dmitry/Fotolia Page 375 Buildings: ZUMA Press/Newscom Page 381 Health Clinic: Poprostskiy Alexey/Shutterstock Chapter 12  Page 403 Analyzing Risk in Business: iQoncept/Shutterstock Page 432 ­Office Building: Verdeskerde/Shutterstock Chapter 13  Page 441 3D spreadsheet, graph, pen: Archerix/Shutterstock Page 475 Television acting sign: Bizoo_n/Fotolia Chapter 14  Page 483 People working on spreadsheets: Pressmaster/Shutterstock Page 515 Colored Stock Market Chart: 2jenn/Shutterstock Chapter 15  Page 539 Brainstorming Concept: Dusit/Shutterstock Page 549 Qantas Airbus A380: Gordon Tipene/Dreamstine Page 559 Supply chain concept: Kheng Guan Toh/ Shutterstock Chapter 16  Page 579 Person at crossroads: Michael D Brown/Shutterstock Page 604 Collage of several images from a drug store: Sokolov/Shutterstock Supplementary C ­ hapter A (online) Page 27 Various discount tags and labels: little Whale/Shutterstock Page 35 Red Cross facility: Littleny/Dreamstine Supplementary Chapter B (online) Page 27 Confused man thinking over right decision: StockThings/Shutterstock Page 33 Lockheed Constellation Cockpit: Brad Whitsitt/ Shutterstock

Chapter

1

Learning Objectives After studying this chapter, you will be able to:

• Define business analytics. • Explain why analytics is important in today’s business environment. • State some typical examples of business applications in which analytics would be beneficial. • Summarize the evolution of business analytics and explain the concepts of business intelligence, operations research and management science, and decision support systems. Explain and provide examples of descriptive, predictive, and prescriptive analytics. State examples of how data are used in business. Explain the difference between a data set and a database. Define a metric and explain the concepts of measurement and measures. Explain the difference between a discrete metric and continuous metric, and provide examples of each.

• • • • •

• Describe the four groups of data classification,

categorical, ordinal, interval, and ratio, and provide examples of each. Explain the concept of a model and various ways a model can be characterized. Define and list the elements of a decision model. Define and provide an example of an influence diagram. Use influence diagrams to build simple mathematical models. Use predictive models to compute model outputs. Explain the difference between uncertainty and risk. Define the terms optimization, objective function, and optimal solution. Explain the difference between a deterministic and stochastic decision model. List and explain the steps in the problem-solving process.

• • • • • • • • •

27

28

Chapter 1   Introduction to Business Analytics

Most of you have likely been to a zoo, seen the animals, had something to eat, and bought some souvenirs. You probably wouldn’t think that managing a zoo is very difficult; after all, it’s just feeding and taking care of the animals, right? A zoo might be the last place that you would expect to find business analytics being used, but not anymore. The Cincinnati Zoo & Botanical ­G arden has been an “early adopter” and one of the first organizations of its kind to exploit business analytics.1 Despite generating more than two-thirds of its budget through its own ­fund-raising efforts, the zoo wanted to reduce its reliance on local tax subsidies even further by increasing visitor attendance and revenues from secondary sources such as membership, food and retail outlets. The zoo’s senior management surmised that the best way to realize more value from each visit was to offer visitors a truly transformed customer experience. By using business analytics to gain greater insight into visitors’ behavior and tailoring ­operations to their preferences, the zoo expected to increase attendance, boost membership, and maximize sales. The project team—which consisted of consultants from IBM and ­BrightStar Partners, as well as senior executives from the zoo—began t­ ranslating the ­o rganization’s goals into technical solutions. The zoo worked to create a ­business analytics platform that was capable of delivering the desired goals by combining data from ticketing and point-of-sale systems throughout the zoo with membership information and geographical data gathered from the ZIP codes of all visitors. This enabled the creation of reports and dashboards that give everyone from senior managers to zoo staff access to real-time ­information that helps them optimize operational management and transform the customer experience. By integrating weather forecast data, the zoo is able to compare current forecasts with historic attendance and sales data, supporting better decisionmaking for labor scheduling and inventory planning. Another area where the solution delivers new insight is food service. By opening food outlets at specific times of day when demand is highest (for example, keeping ice cream kiosks open in the final hour before the zoo closes), the zoo has been able to increase sales significantly. The zoo has been able to increase attendance and revenues dramatically, resulting in annual ROI of 411%. The business

1Source:

IBM Software Business Analtyics, “Cincinnati Zoo transforms customer experience and boosts profits,” © IBM Corporation 2012.

Chapter 1   Introduction to Business Analytics

29

analytics initiative paid for itself within three months, and delivers, on average, benefits of \$738,212 per year. Specifically,

• The zoo has seen a 4.2% rise in ticket sales by targeting potential visitors who live in specific ZIP codes.

• Food revenues increased by 25% by optimizing the mix of products on sale and adapting selling practices to match peak purchase times.

• Eliminating slow-selling products and targeting visitors with s­ pecific promotions enabled an 18% increase in merchandise sales.

• Cut marketing expenditure, saving \$40,000 in the first year, and reduced

advertising expenditure by 43% by eliminating i­neffective campaigns and segmenting customers for more targeted marketing.

Because of the zoo’s success, other organizations such as Point Defiance Zoo & Aquarium, in Washington state, and History Colorado, a museum in Denver, have embarked on similar initiatives. In recent years, analytics has become increasingly important in the world of business, particularly as organizations have access to more and more data. Managers today no longer make decisions based on pure judgment and experience; they rely on factual data and the ability to manipulate and analyze data to support their decisions. As a result, many companies have recently established analytics departments; for instance, IBM reorganized its consulting business and established a new 4,000-person organization focusing on analytics.2 Companies are increasingly seeking business graduates with the ability to understand and use analytics. In fact, in 2011, the U.S. Bureau of Labor Statistics predicted a 24% increase in demand for professionals with analytics expertise. No matter what your academic business concentration is, you will most likely be a future user of analytics to some extent and work with analytics professionals. The purpose of this book is to provide you with a basic introduction to the concepts, methods, and models used in business analytics so that you will develop not only an appreciation for its capabilities to support and enhance business decisions, but also the ability to use business analytics at an elementary level in your work. In this chapter, we introduce you to the field of business analytics, and set the foundation for many of the concepts and techniques that you will learn.

2Matthew

J. Liberatore and Wenhong Luo, “The Analytics Movement: Implications for Operations ­Research,” Interfaces, 40, 4 (July–August 2010): 313–324.

30

Chapter 1   Introduction to Business Analytics

in “Brain Trust—Enabling the Confident Enterprise with Business Analytics” (Cary, NC: SAS Institute, Inc., 2010): 27–29. www.sas.com/bareport

Chapter 1   Introduction to Business Analytics

31

(for example, determining brands to buy, quantities, and • merchandising allocations), (for example, finding the best location for bank branches and ATMs, or • location where to service industrial equipment), and many others in operations and supply chains, finance, marketing, and human resources—in fact, in every discipline of business.5 Various research studies have discovered strong relationships between a company’s performance in terms of profitability, revenue, and shareholder return and its use of analytics. Top-performing organizations (those that outperform their competitors) are three times more likely to be sophisticated in their use of analytics than lower performers and are more likely to state that their use of analytics differentiates them from competitors.6 However, research has also suggested that organizations are overwhelmed by data and struggle to understand how to use data to achieve business results and that most organizations simply don’t understand how to use analytics to improve their businesses. Thus, understanding the capabilities and techniques of analytics is vital to managing in today’s business environment. One of the emerging applications of analytics is helping businesses learn from social media and exploit social media data for strategic advantage.7 Using analytics, firms can integrate social media data with traditional data sources such as customer surveys, focus groups, and sales data; understand trends and customer perceptions of their products; and create informative reports to assist marketing managers and product designers.

Evolution of Business Analytics Analytical methods, in one form or another, have been used in business for more than a century. However, the modern evolution of analytics began with the introduction of computers in the late 1940s and their development through the 1960s and beyond. Early computers provided the ability to store and analyze data in ways that were either very difficult or impossible to do so manually. This facilitated the collection, management, analysis, and reporting of data, which is often called business intelligence (BI), a term that was coined in 1958 by an IBM researcher, Hans Peter Luhn.8 Business intelligence software can answer basic questions such as “How many units did we sell last month?” “What products did customers buy and how much did they spend?” “How many credit card transactions were completed yesterday?” Using BI, we can create simple rules to flag exceptions automatically, for example, a bank can easily identify transactions greater than \$10,000 to report to the Internal Revenue Service.9 BI has evolved into the modern discipline we now call information systems (IS).

5Thomas H. Davenport, “How Organizations Make Better Decisions,” edited excerpt of an article distributed by the International Institute for Analytics published in “Brain Trust—Enabling the Confident Enterprise with Business Analytics” (Cary, NC: SAS Institute, Inc., 2010): 8–11. www.sas.com/bareport 6Thomas H. Davenport and Jeanne G. Harris, Competing on Analytics (Boston: Harvard Business School Press, 2007): 46; Michael S. Hopkins, Steve LaValle, Fred Balboni, Nina Kruschwitz, and Rebecca Shockley, “10 Data Points: Information and Analytics at Work,” MIT Sloan Management Review, 52, 1 (Fall 2010): 27–31. 7Jim Davis, “Convergence—Taking Social Media from Talk to Action,” SASCOM (First Quarter 2011): 17. 8H. P. Luhn, “A Business Intelligence System.” IBM Journal (October 1958). 9Jim Davis, “Business Analytics: Helping You Put an Informed Foot Forward,” in “Brain Trust—­Enabling the Confident Enterprise with Business Analytics,” (Cary, NC: SAS Institute, Inc., 2010): 4–7. www.sas .com/bareport

32

Chapter 1   Introduction to Business Analytics

Statistics has a long and rich history, yet only rather recently has it been recognized as an important element of business, driven to a large extent by the massive growth of data in today’s world. Google’s chief economist stated that statisticians surely have the “really sexy job” for the next decade. 10 Statistical methods allow us to gain a richer understanding of data that goes beyond business intelligence reporting by not only summarizing data succinctly but also finding unknown and interesting relationships among the data. Statistical methods include the basic tools of description, exploration, estimation, and inference, as well as more advanced techniques like regression, forecasting, and data mining. Much of modern business analytics stems from the analysis and solution of complex ­decision problems using mathematical or computer-based models—a discipline known as operations research, or management science. Operations research (OR) was born from ­efforts to improve military operations prior to and during World War II. After the war, scientists recognized that the mathematical tools and techniques developed for military ­applications could be applied successfully to problems in business and industry. A ­significant amount of research was carried on in public and private think tanks during the late 1940s and through the 1950s. As the focus on business applications expanded, the term ­management science (MS) became more prevalent. Many people use the terms operations research and management science interchangeably, and the field became known as Operations Research/Management Science (OR/MS). Many OR/MS applications use modeling and optimization—techniques for translating real problems into mathematics, spreadsheets, or other computer languages, and using them to find the best (“optimal”) solutions and decisions. INFORMS, the Institute for Operations ­Research and the Management Sciences, is the leading professional society devoted to OR/MS and analytics, and publishes a bimonthly magazine called Analytics (http://analytics-magazine.com/). Digital subscriptions may be obtained free of charge at the Web site. Decision support systems (DSS) began to evolve in the 1960s by combining business intelligence concepts with OR/MS models to create analytical-based computer systems to support decision making. DSSs include three components: 1. Data management. The data management component includes databases for storing data and allows the user to input, retrieve, update, and manipulate data. 2. Model management. The model management component consists of various statistical tools and management science models and allows the user to easily build, manipulate, analyze, and solve models. 3. Communication system. The communication system component provides the interface necessary for the user to interact with the data and model management components.11 DSSs have been used for many applications, including pension fund management, portfolio management, work-shift scheduling, global manufacturing and facility location, advertising-budget allocation, media planning, distribution planning, airline operations planning, inventory control, library management, classroom assignment, nurse scheduling, blood distribution, water pollution control, ski-area design, police-beat design, and energy planning.12 10James J. Swain, “Statistical Software in the Age of the Geek,” Analytics-magazine.org, March/April 2013, pp. 48–55. www.informs.org 11William E. Leigh and Michael E. Doherty, Decision Support and Expert Systems (Cincinnati, OH: South-Western Publishing Co., 1986). 12H. B. Eom and S. M. Lee, “A Survey of Decision Support System Applications (1971–April 1988),” Interfaces, 20, 3 (May–June 1990): 65–79.

Figure

33

Chapter 1   Introduction to Business Analytics

1.1

A Visual Perspective of Business Analytics

Statistics

Modeling and Optimization

Modern business analytics can be viewed as an integration of BI/IS, statistics, and modeling and optimization as illustrated in Figure 1.1. While the core topics are ­traditional and have been used for decades, the uniqueness lies in their intersections. For example, data mining is focused on better understanding characteristics and patterns among variables in large databases using a variety of statistical and analytical tools. Many standard statistical tools as well as more advanced ones are used extensively in data mining. S ­ imulation and risk analysis relies on spreadsheet models and statistical analysis to ­examine the impacts of uncertainty in the estimates and their potential interaction with one another on the output variable of interest. Spreadsheets and formal models allow one to manipulate data to perform what-if analysis—how specific combinations of inputs that reflect key assumptions will affect model outputs. What-if analysis is also used to assess the sensitivity of optimization models to changes in data inputs and provide better insight for making good decisions. Perhaps the most useful component of business analytics, which makes it truly unique, is the center of Figure 1.1—visualization. Visualizing data and results of analyses provide a way of easily communicating data at all levels of a business and can reveal surprising patterns and relationships. Software such as IBM’s Cognos system exploits data visualization for query and reporting, data analysis, dashboard presentations, and scorecards linking strategy to operations. The Cincinnati Zoo, for example, has used this on an iPad to display hourly, daily, and monthly reports of attendance, food and retail location revenues and sales, and other metrics for prediction and marketing strategies. UPS uses telematics to capture vehicle data and display them to help make decisions to improve efficiency and performance. You may have seen a tag cloud (see the graphic at the beginning of this chapter), which is a visualization of text that shows words that appear more frequently using larger fonts. The most influential developments that propelled the use of business analytics have been the personal computer and spreadsheet technology. Personal computers and spreadsheets provide a convenient way to manage data, calculations, and visual graphics simultaneously, using intuitive representations instead of abstract mathematical notation. Although the early

34

Chapter 1   Introduction to Business Analytics

Analytics in Practice: Harrah’s Entertainment13 One of the most cited examples of the use of analytics in business is Harrah’s Entertainment. Harrah’s owns numerous hotels and casinos and uses analytics to support revenue management activities, which involve selling the right resources to the right customer at the right price to maximize revenue and profit. The gaming industry views hotel rooms as ­incentives or rewards to support casino gaming activities and revenues, not as revenue-maximizing assets. Therefore, Harrah’s objective is to set room rates and accept reservations to maximize the expected gaming profits from customers. They begin with collecting and tracking of customers’ gaming activities (playing slot machines and casino games) using Harrah’s “Total Rewards” card program, a customer loyalty program that provides rewards such as meals,

discounted rooms, and other perks to customers based on the amount of money and time they spend at Harrah’s. The data collected are used to segment customers into more than 20 groups based on their expected gaming activities. For each customer segment, analytics forecasts demand for hotel rooms by arrival date and length of stay. Then Harrah’s uses a prescriptive model to set prices and allocate rooms to these customer segments. For example, the system might offer complimentary rooms to customers who are expected to generate a gaming profit of at least \$400 but charge \$325 for a room if the profit is expected to be only \$100. Marketing can use the information to send promotional offers to targeted customer segments if it identifies low-occupancy rates for specific dates.

applications of spreadsheets were primarily in accounting and finance, spreadsheets have developed into powerful general-purpose managerial tools for applying techniques of business analytics. The power of analytics in a personal computing ­environment was noted some 20 years ago by business consultants Michael Hammer and James Champy, who said, “When accessible data is combined with easy-to-use analysis and modeling tools, frontline workers —when properly trained—suddenly have sophisticated decision-making capabilities.”14 Although many good analytics software packages are available to professionals, we use ­Microsoft Excel and a powerful add-in called A ­ nalytic Solver Platform throughout this book.

Impacts and Challenges The impact of applying business analytics can be significant. Companies report ­reduced costs, better risk management, faster decisions, better productivity, and enhanced ­bottom-line performance such as profitability and customer satisfaction. For example, 1-800-flowers.com uses analytic software to target print and online promotions with greater accuracy; change prices and offerings on its Web site (sometimes hourly); and optimize its marketing, shipping, distribution, and manufacturing operations, resulting in a \$50 million cost savings in one year.15 Business analytics is changing how managers make decisions.16 To thrive in today’s business world, organizations must continually innovate to differentiate themselves from competitors, seek ways to grow revenue and market share, reduce costs, retain existing customers and acquire new ones, and become faster and leaner. IBM ­suggests that 13Based

on Liberatore and Luo, “The Analytics Movement”; and Richard Metters et al., “The ‘Killer ­ pplication’ of Revenue Management: Harrah’s Cherokee Casino & Hotel,” Interfaces, 38, 3 (May–June A 2008): 161–175. 14Michael Hammer and James Champy, Reengineering the Corporation (New York: HarperBusiness, 1993): 96. 15Jim Goodnight, “The Impact of Business Analytics on Performance and Profitability,” in “Brain Trust— Enabling the Confident Enterprise with Business Analytics” (Cary, NC: SAS Institute, Inc., 2010): 4–7. www.sas.com/bareport 16Analytics: The New Path to Value, a joint MIT Sloan Management Review and IBM Institute for ­Business Value study.

Chapter 1   Introduction to Business Analytics

35

t­ raditional management approaches are evolving in today’s analytics-driven environment to include more fact-based decisions as opposed to judgment and intuition, more prediction rather than reactive decisions, and the use of analytics by everyone at the point where decisions are made rather than relying on skilled experts in a consulting group.17 Nevertheless, organizations face many challenges in developing analytics capabilities, including lack of understanding of how to use analytics, competing business priorities, insufficient analytical skills, difficulty in getting good data and sharing information, and not understanding the benefits versus perceived costs of analytics studies. Successful ­application of analytics requires more than just knowing the tools; it requires a highlevel understanding of how analytics supports an organization’s competitive strategy and ­effective execution that crosses multiple disciplines and managerial levels. A 2011 survey by Bloomberg Businessweek Research Services and SAS concluded that business analytics is still in the “emerging stage” and is used only narrowly within business units, not across entire organizations. The study also noted that many organizations lack analytical talent, and those that do have analytical talent often don’t know how to apply the results properly. While analytics is used as part of the decision-making process in many organizations, most business decisions are still based on intuition.18 Therefore, while many challenges are apparent, many more opportunities exist. These opportunities are ­reflected in the job market for analytics professionals, or “data scientists,” as some call them. The ­Harvard Business Review called data scientist “the sexiest job of the 21st century,” and McKinsey & Company predicted a 50 to 60% shortfall in data scientists in the United States by 2018.19

Scope of Business Analytics Business analytics begins with the collection, organization, and manipulation of data and is supported by three major components:20 1. Descriptive analytics. Most businesses start with descriptive analytics—the use of data to understand past and current business performance and make informed decisions. Descriptive analytics is the most commonly used and most well-understood type of analytics. These techniques categorize, characterize, consolidate, and classify data to convert it into useful information for the ­purposes of understanding and analyzing business performance. Descriptive analytics summarizes data into meaningful charts and reports, for example, about budgets, sales, revenues, or cost. This process allows managers to obtain standard and customized reports and then drill down into the data and make queries to understand the impact of an advertising campaign, for example, ­review business performance to find problems or areas of opportunity, and identify patterns and trends in data. Typical questions that descriptive analytics helps answer are “How much did we sell in each region?” “What was our revenue and profit last quarter?” “How many and what types of complaints did we 17“Business

Analytics and Optimization for the Intelligent Enterprise” (April 2009). www.ibm.com /qbs/intelligent-enterprise 18Bloomberg Businessweek Research Services and SAS, “The Current State of Business Analytics: Where Do We Go From Here?” (2011). 19Andrew Jennings, “What Makes a Good Data Scientist?” Analytics Magazine (July–August 2013): 8–13. www.analytics-magazine.org 20Parts of this section are adapted from Irv Lustig, Brenda Dietric, Christer Johnson, and Christopher Dziekan, “The Analytics Journey,” Analytics (November/December 2010). www.analytics-magazine.org

36

Chapter 1   Introduction to Business Analytics

resolve?” “Which factory has the lowest productivity?” Descriptive analytics also helps companies to classify customers into different segments, which enables them to develop specific marketing campaigns and advertising strategies. 2. Predictive analytics. Predictive analytics seeks to predict the future by examining historical data, detecting patterns or relationships in these data, and then extrapolating these relationships forward in time. For example, a marketer might wish to predict the response of different customer segments to an advertising campaign, a commodities trader might wish to predict short-term movements in commodities prices, or a skiwear manufacturer might want to predict next season’s demand for skiwear of a specific color and size. Predictive analytics can predict risk and find relationships in data not readily apparent with traditional analyses. Using advanced techniques, predictive analytics can help to detect hidden patterns in large quantities of data to segment and group data into coherent sets to predict behavior and detect trends. For instance, a bank manager might want to identify the most profitable customers or predict the chances that a loan applicant will default, or alert a credit-card customer to a potential fraudulent charge. Predictive analytics helps to answer questions such as “What will happen if demand falls by 10% or if supplier prices go up 5%?” “What do we expect to pay for fuel over the next several months?” “What is the risk of losing money in a new business venture?” 3. Prescriptive analytics. Many problems, such as aircraft or employee scheduling and supply chain design, for example, simply involve too many choices or alternatives for a human decision maker to effectively consider. Prescriptive ­analytics uses optimization to identify the best alternatives to minimize or ­maximize some objective. Prescriptive analytics is used in many areas of business, including o­ perations, marketing, and finance. For example, we may determine the best pricing and advertising strategy to maximize revenue, the optimal amount of cash to store in ATMs, or the best mix of investments in a retirement portfolio to manage risk. The mathematical and statistical techniques of predictive analytics can also be combined with optimization to make decisions that take into a­ ccount the uncertainty in the data. Prescriptive analytics addresses questions such as “How much should we produce to maximize profit?” “What is the best way of shipping goods from our factories to minimize costs?” “Should we change our plans if a natural disaster closes a supplier’s factory: if so, by how much?”

Analytics in Practice: A  nalytics in the Home Lending and Mortgage Industry21 Sometime during their lives, most Americans will ­receive a mortgage loan for a house or condominium. The process starts with an application. The application contains all pertinent information about the borrower that the lender will need. The bank or mortgage company then initiates a process that leads to a loan decision. It is here that key information about the borrower is provided by third-party providers. This information includes a credit report, verification of income, v ­ erification of

­ ssets, ­verification of employment, and an appraisal of a the property among others. The result of the processing function is a complete loan file that contains all the information and documents needed to underwrite the loan, which is the next step in the process. Underwriting is where the loan application is evaluated for its risk. Underwriters evaluate whether the borrower can make payments on time, can afford to pay back the loan, and has sufficient collateral in the property to back up the (continued )

21Contributed

by Craig Zielazny, BlueNote Analytics, LLC.

Chapter 1   Introduction to Business Analytics

loan. In the event the borrower defaults on their loan, the lender can sell the property to recover the amount of the loan. But, if the amount of the loan is greater than the value of the property, then the lender cannot recoup their money. If the underwriting process indicates that the borrower is creditworthy, has the capacity to repay the loan, and the value of the property in question is greater than the loan amount, then the loan is approved and will move to closing. Closing is the step where the borrower signs all the appropriate papers agreeing to the terms of the loan. In reality, lenders have a lot of other work to do. First, they must perform a quality control review on a sample of the loan files that involves a manual examination of all the documents and information gathered. This process is designed to identify any mistakes that may have been made or information that is missing from the loan file. Because lenders do not have ­unlimited money to lend to borrowers, they frequently sell the loan to a third party so that they have fresh capital to lend to others. This occurs in what is called the secondary market. Freddie Mac and Fannie Mae are the two largest purchasers of mortgages in the secondary market. The final step in the process is servicing. Servicing includes all the activities associated with providing the customer service on the loan like processing payments, managing property taxes held in escrow, and answering questions about the loan. In addition, the institution collects various operational data on the process to track its performance and efficiency, including the number of applications, loan types and amounts, cycle times (time to close the loan), bottlenecks in the process, and so on. Many ­different types of analytics are used: Descriptive Analytics—This focuses on historical ­reporting, addressing such questions as:

many loan apps were taken each of the past • How 12 months? was the total cycle time from app to close? • What • What was the distribution of loan profitability by credit score and loan-to-value (LTV), which is the mortgage amount divided by the appraised value of the property.

Predictive Analytics—Predictive modeling use ­mathematical, spreadsheet, and statistical models, and address questions such as: impact on loan volume will a given market• What ing program have? many processors or underwriters are needed • How for a given loan volume? • Will a given process change reduce cycle time? Prescriptive Analytics—This involves the use of simulation or optimization to drive decisions. Typical questions include: is the optimal staffing to achieve a given • What profitability constrained by a fixed cycle time? is the optimal product mix to maximize profit • What constrained by fixed staffing? The mortgage market has become much more dynamic in recent years due to rising home values, falling interest rates, new loan products, and an increased desire by home owners to utilize the equity in their homes as a financial resource. This has increased the complexity and variability of the mortgage process and created an opportunity for lenders to proactively use the data that are available to them as a tool for managing their business. To ensure that the process is efficient, effective and performed with quality, data and analytics are used every day to track what is done, who is doing it, and how long it takes.

A wide variety of tools are used to support business analytics. These include: queries and analysis • Database to report key performance measures • “Dashboards” visualization • Data methods • Statistical and predictive models • Spreadsheets and “what-if” analyses • Scenario • Simulation

37

38

Chapter 1   Introduction to Business Analytics

• Forecasting and text mining • Data Optimization • Social media, Web, and text analytics • Although the tools used in descriptive, predictive, and prescriptive analytics are different, many applications involve all three. Here is a typical example in retail operations.

Example 1.1  Retail Markdown Decisions22 As you probably know from your shopping experiences, most department stores and fashion retailers clear their seasonal inventory by reducing prices. The key question they face is what prices should they set—and when should they set them—to meet inventory goals and maximize revenue? For example, suppose that a store has 100 bathing suits of a certain style that go on sale from April 1 and wants to sell all of them by the end of June. Over each week of the 12-week selling season, they can make a decision to discount the price. They face two decisions: When to reduce the price and by how much? This results in 24 decisions to make. For a major national

chain that may carry thousands of products, this can easily result in millions of decisions that store managers have to make. Descriptive analytics can be used to examine historical data for similar products, such as the number of units sold, price at each point of sale, starting and ending inventories, and special promotions, newspaper ads, direct marketing ads, and so on, to understand what the results of past decisions achieved. Predictive analytics can be used to predict sales based on pricing decisions. Finally, prescriptive analytics can be applied to find the best set of pricing decisions to maximize the total revenue.

Software Support Many companies, such as IBM, SAS, and Tableau have developed a variety of software and hardware solutions to support business analytics. For example, IBM’s Cognos ­Express, an integrated business intelligence and planning solution designed to meet the needs of midsize companies, provides reporting, analysis, dashboard, scorecard, planning, budgeting, and forecasting capabilities. It’s made up of several modules, including Cognos ­Express Reporter, for self-service reporting and ad hoc query; Cognos Express Advisor, for analysis and visualization; and Cognos Express Xcelerator, for Excel-based planning and business analysis. Information is presented to the business user in a business context that makes it easy to understand, with an easy to use interface they can quickly gain the insight they need from their data to make the right decisions and then take ­action for ­effective and efficient business optimization and outcome. SAS provides a variety of software that integrate data management, business intelligence, and analytics tools. SAS ­Analytics covers a wide range of capabilities, including predictive modeling and data m ­ ining, visualization, forecasting, optimization and model management, statistical analysis, text analytics, and more. Tableau Software provides simple drag and drop tools for v­ isualizing data from spreadsheets and other databases. We encourage you to explore many of these products as you learn the basic principles of business analytics in this book.

22Inspired by a presentation by Radhika Kulkarni, SAS Institute, “Data-Driven Decisions: Role of ­ perations Research in Business Analytics,” INFORMS Conference on Business Analytics and O ­Operations Research, April 10–12, 2011.

Chapter 1   Introduction to Business Analytics

39

Data for Business Analytics Since the dawn of the electronic age and the Internet, both individuals and organizations have had access to an enormous wealth of data and information. Data are numerical facts and figures that are collected through some type of measurement process. Information comes from analyzing data—that is, extracting meaning from data to support evaluation and decision making. Data are used in virtually every major function in a business. Modern organizations— which include not only for-profit businesses but also nonprofit organizations—need good data to support a variety of company purposes, such as planning, reviewing company ­performance, improving operations, and comparing company performance with competitors’ or best-practice benchmarks. Some examples of how data are used in business ­include the following: reports summarize data about companies’ profitability and market • Annual share both in numerical form and in charts and graphs to communicate with

• • • • • •

shareholders. Accountants conduct audits to determine whether figures reported on a firm’s balance sheet fairly represent the actual data by examining samples (that is, ­subsets) of accounting data, such as accounts receivable. Financial analysts collect and analyze a variety of data to understand the contribution that a business provides to its shareholders. These typically include profitability, revenue growth, return on investment, asset utilization, operating margins, earnings per share, economic value added (EVA), shareholder value, and other relevant measures. Economists use data to help companies understand and predict population trends, interest rates, industry performance, consumer spending, and international trade. Such data are often obtained from external sources such as Standard & Poor’s Compustat data sets, industry trade associations, or government databases. Marketing researchers collect and analyze extensive customer data. These data often consist of demographics, preferences and opinions, transaction and payment history, shopping behavior, and a lot more. Such data may be collected by surveys, personal interviews, focus groups, or from shopper loyalty cards. Operations managers use data on production performance, manufacturing quality, delivery times, order accuracy, supplier performance, productivity, costs, and environmental compliance to manage their operations. Human resource managers measure employee satisfaction, training costs, turnover, market innovation, training effectiveness, and skills development.

Such data may be gathered from primary sources such as internal company records and business transactions, automated data-capturing equipment, or customer market s­ urveys and from secondary sources such as government and commercial data sources, custom research providers, and online research. Perhaps the most important source of data today is data obtained from the Web. With today’s technology, marketers collect extensive information about Web behaviors, such as the number of page views, visitor’s country, time of view, length of time, origin and destination paths, products they searched for and viewed, products purchased, what reviews they read, and many others. Using analytics, marketers can learn what content is being viewed most often, what ads were clicked on, who the most frequent visitors are, and what types of visitors browse but don’t buy. Not only can marketers understand what customers have done, but they can better predict what they intend to do in the future. For example,

40

Chapter 1   Introduction to Business Analytics

if a bank knows that a customer has browsed for mortgage rates and homeowner’s insurance, they can target the customer with homeowner loans rather than credit cards or automobile loans. Traditional Web data are now being enhanced with social media data from Facebook, cell phones, and even Internet-connected gaming devices. As one example, a home furnishings retailer wanted to increase the rate of sales for customers who browsed their Web site. They developed a large data set that covered more than 7,000 demographic, Web, catalog, and retail behavioral attributes for each customer. They used predictive analytics to determine how well a customer would respond to different e-mail marketing offers and customized promotions to individual customers. This not only helped them to determine where to most effectively spend marketing resources but doubled the response rate compared to previous marketing campaigns, with a projected multimillion dollar increase in sales.23

Data Sets and Databases A data set is simply a collection of data. Marketing survey responses, a table of historical stock prices, and a collection of measurements of dimensions of a manufactured item are examples of data sets. A database is a collection of related files containing records on people, places, or things. The people, places, or things for which we store and maintain information are called entities.24 A database for an online retailer that sells instructional fitness books and DVDs, for instance, might consist of a file for three entities: ­publishers from which goods are purchased, customer sales transactions, and product inventory. A d­ atabase file is usually organized in a two-dimensional table, where the columns ­correspond to each individual element of data (called fields, or attributes), and the rows represent records of related data elements. A key feature of computerized databases is the ability to quickly relate one set of files to another. Databases are important in business analytics for accessing data, making queries, and other data and information management activities. Software such as Microsoft Access provides powerful analytical database capabilities. However, in this book, we won’t be delving deeply into databases or database management systems but will work with individual database files or simple data sets. Because spreadsheets are convenient tools for storing and manipulating data sets and database files, we will use them for all examples and problems.

Example 1.2  A Sales Transaction Database File25 Figure 1.2 shows a portion of sales transactions on an Excel worksheet for a particular day for an online seller of instructional fitness books and DVDs. The fields are shown in row 3 of the spreadsheet and consist of the

customer ID, region, payment type, transaction code, source of the sale, amount, product purchased, and time of day. Each record (starting in row 4) has a value for each of these fields.

23Based on a presentation by Bill Franks of Teradata, “Optimizing Customer Analytics: How Customer Level Web Data Can Help,” INFORMS Conference on Business Analytics and Operations Research, April 10–12, 2011. 24Kenneth C. Laudon and Jane P. Laudon, Essentials of Management Information Systems, 9th ed. (Upper Saddle River, NJ: Prentice Hall, 2011): 159. 25Adapted and modified from Kenneth C. Laudon and Jane P. Laudon, Essentials of Management ­Information Systems.

Figure

Chapter 1   Introduction to Business Analytics

41

1.2

A Portion of Excel File Sales Transactions Database

Big Data Today, nearly all data are captured digitally. As a result, data have been growing at an overwhelming rate, being measured by terabytes (1012 bytes), petabytes (1015 bytes), exabytes (1018 bytes), and even by higher-dimensional terms. Just think of the amount of data stored on Facebook, Twitter, or Amazon servers, or the amount of data acquired daily from scanning items at a national grocery chain such as Kroger and its affiliates. W ­ almart, for instance, has over one million transactions each hour, yielding more than 2.5 petabytes of data. Analytics professionals have coined the term big data to refer to massive amounts of business data from a wide variety of sources, much of which is available in real time, and much of which is uncertain or unpredictable. IBM calls these characteristics volume, variety, velocity, and veracity. Most often, big data revolves around customer behavior and customer experiences. Big data provides an opportunity for organizations to gain a competitive advantage—if the data can be understood and analyzed effectively to make better business decisions. The volume of data continue to increase; what is considered “big” today will be even bigger tomorrow. In one study of information technology (IT) professionals in 2010, nearly half of survey respondents ranked data growth among their top three challenges. Big data come from many sources, and can be numerical, textual, and even audio and video data. Big data are captured using sensors (for example, supermarket scanners), click streams from the Web, customer transactions, e-mails, tweets and social media, and other ways. Big data sets are unstructured and messy, requiring sophisticated analytics to integrate and process the data, and understand the information contained in them. Not only are big data being captured in real time, but they must be incorporated into business decisions at a faster rate. Processes such as fraud detection must be analyzed quickly to have value. IBM has added a fourth dimension: veracity—the level of reliability associated with data. Having high-quality data and understanding the uncertainty in data are essential for good decision making. Data veracity is an important role for statistical methods. Big data can help organizations better understand and predict customer behavior and improve customer service. A study by the McKinsey Global Institute noted that “The ­effective use of big data has the potential to transform economies, delivering a new wave of productivity growth and consumer surplus. Using big data will become a key basis of competition for existing companies, and will create new competitors who are able to a­ ttract employees that have the critical skills for a big data world.”26 However, ­understanding big 26James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, and A ­ ngela

Hung Byers, “Big Data: The Next Frontier for Innovation, Competition, and Productivity,” McKinsey & Company May 2011.

42

Chapter 1   Introduction to Business Analytics

data requires advanced analytics tools such as data mining and text analytics, and new technologies such as cloud computing, faster multi-core processors, large memory spaces, and solid-state drives.

Metrics and Data Classification A metric is a unit of measurement that provides a way to objectively quantify performance. For example, senior managers might assess overall business performance using such metrics as net profit, return on investment, market share, and customer satisfaction. A plant manager might monitor such metrics as the proportion of defective parts produced or the number of inventory turns each month. For a Web-based retailer, some useful metrics are the percentage of orders filled accurately and the time taken to fill a customer’s order. Measurement is the act of obtaining data associated with a metric. Measures are numerical values associated with a metric. Metrics can be either discrete or continuous. A discrete metric is one that is derived from counting something. For example, a delivery is either on time or not; an order is complete or incomplete; or an invoice can have one, two, three, or any number of errors. Some discrete metrics associated with these examples would be the proportion of on-time deliveries; the number of incomplete orders each day, and the number of errors per invoice. Continuous metrics are based on a continuous scale of measurement. Any metrics involving dollars, length, time, volume, or weight, for example, are continuous. Another classification of data is by the type of measurement scale. Data may be classified into four groups: 1. Categorical (nominal) data, which are sorted into categories according to specified characteristics. For example, a firm’s customers might be classified by their geographical region (North America, South America, Europe, and P ­ acific); employees might be classified as managers, supervisors, and ­associates. The categories bear no quantitative relationship to one another, but we usually assign an arbitrary number to each category to ease the process of managing the data and computing statistics. Categorical data are usually counted or expressed as proportions or percentages. 2. Ordinal data, which can be ordered or ranked according to some relationship to one another. College football or basketball rankings are ordinal; a higher ranking signifies a stronger team but does not specify any numerical measure of strength. Ordinal data are more meaningful than categorical data because data can be compared to one another. A common example in business is data from survey scales—for example, rating a service as poor, average, good, very good, or excellent. Such data are categorical but also have a natural order ­(excellent is better than very good) and, consequently, are ordinal. However, ordinal data have no fixed units of measurement, so we cannot make meaningful numerical statements about differences between categories. Thus, we cannot say that the difference between excellent and very good is the same as between good and average, for example. Similarly, a team ranked number 1 may be far superior to the number 2 team, whereas there may be little difference between teams ranked 9th and 10th. 3. Interval data, which are ordinal but have constant differences between observations and have arbitrary zero points. Common examples are time and temperature. Time is relative to global location, and calendars have arbitrary starting dates (compare, for example, the standard Gregorian calendar with the Chinese

Chapter 1   Introduction to Business Analytics

43

calendar). Both the Fahrenheit and Celsius scales represent a specified measure of distance—degrees—but have arbitrary zero points. Thus we cannot take meaningful ratios; for example, we cannot say that 50 degrees is twice as hot as 25 degrees. However, we can compare differences. Another example is SAT or GMAT scores. The scores can be used to rank students, but only differences between scores provide information on how much better one student performed over another; ratios make little sense. In contrast to ordinal data, interval data allow meaningful comparison of ranges, averages, and other statistics. In business, data from survey scales, while technically ordinal, are often treated as interval data when numerical scales are associated with the categories (for instance, 1 = poor, 2 = average, 3 = good, 4 = very good, 5 = ­excellent). Strictly speaking, this is not correct because the “distance” between categories may not be perceived as the same (respondents might perceive a larger gap between poor and average than between good and very good, for example). Nevertheless, many users of survey data treat them as interval when analyzing the data, particularly when only a numerical scale is used without descriptive labels. 4. Ratio data, which are continuous and have a natural zero. Most business and economic data, such as dollars and time, fall into this category. For example, the measure dollars has an absolute zero. Ratios of dollar figures are meaningful. For example, knowing that the Seattle region sold \$12 million in March whereas the Tampa region sold \$6 million means that Seattle sold twice as much as Tampa. This classification is hierarchical in that each level includes all the information content of the one preceding it. For example, ordinal data are also categorical, and ratio information can be converted to any of the other types of data. Interval information can be converted to ordinal or categorical data but cannot be converted to ratio data without the knowledge of the absolute zero point. Thus, a ratio scale is the strongest form of measurement.

Example 1.3  Classifying Data Elements in a Purchasing Database27 Figure 1.3 shows a portion of a data set containing all items that an aircraft component manufacturing company has purchased over the past 3 months. The data provide the supplier; order number; item number, description, and cost; quantity ordered; cost per order, the suppliers’ accounts payable (A/P) terms; and the order and arrival dates. We may classify each of these types of data as follows: Supplier—categorical • Order Number—ordinal • Item Number—categorical •

27Based

Description—categorical • Item Item Cost—ratio • Quantity—ratio • Cost per Order—ratio • A/P Terms—ratio • Order Date—interval • Arrival Date—interval • We might use these data to evaluate the average speed of delivery and rank the suppliers (thus creating ­ordinal data) by this metric. (We see how to do this in the next chapter).

on Laudon and Laudon, Essentials of Management Information Systems.

44

Figure

Chapter 1   Introduction to Business Analytics

1.3

Portion of Excel File Purchase Orders Data

Data Reliability and Validity Poor data can result in poor decisions. In one situation, a distribution system design model relied on data obtained from the corporate finance department. Transportation costs were determined using a formula based on the latitude and longitude of the locations of plants and customers. But when the solution was represented on a geographic information system (GIS) mapping program, one of the customers was in the Atlantic Ocean. Thus, data used in business decisions need to be reliable and valid. Reliability means that data are accurate and consistent. Validity means that data correctly measure what they are supposed to measure. For example, a tire pressure gauge that consistently reads several pounds of pressure below the true value is not reliable, although it is valid because it does measure tire pressure. The number of calls to a customer service desk might be counted correctly each day (and thus is a reliable measure), but not valid if it is used to assess customer dissatisfaction, as many calls may be simple queries. Finally, a survey question that asks a customer to rate the quality of the food in a restaurant may be neither reliable (because different customers may have conflicting perceptions) nor valid (if the intent is to measure customer satisfaction, as satisfaction generally includes other elements of service besides food).

Models in Business Analytics To make a decision, we must be able to specify the decision alternatives that represent the choices that can be made and criteria for evaluating the alternatives. Specifying decision alternatives might be very simple; for example, you might need to choose one of three corporate health plan options. Other situations can be more complex; for example, in locating a new distribution center, it might not be possible to list just a small number of alternatives. The set of potential locations might be anywhere in the United States or even within a large geographical region such as Asia. Decision criteria might be to maximize discounted net profits, customer satisfaction, or social benefits or to minimize costs, environmental impact, or some measure of loss. Many decision problems can be formalized using a model. A model is an abstraction or representation of a real system, idea, or object. Models capture the most important features of a problem and present them in a form that is easy to interpret. A model can be as simple as a written or verbal description of some phenomenon, a visual representation such as a graph or a flowchart, or a mathematical or spreadsheet representation (see ­Example 1.4). Models can be descriptive, predictive, or prescriptive, and therefore are used in a wide variety of business analytics applications. In Example 1.4, note that the first two

Chapter 1   Introduction to Business Analytics

45

Example 1.4  Three Forms of a Model The sales of a new product, such as a first-generation iPad, Android phone, or 3-D television, often follow a common pattern. We might represent this in one of three following ways: 1. A simple verbal description of sales might be: The rate of sales starts small as early adopters begin to evaluate a new product and then begins to grow at an increasing rate over time as positive customer feedback spreads. Eventually, the market begins to become saturated and the rate of sales begins to decrease. 2. A sketch of sales as an S-shaped curve over time, as shown in Figure 1.4, is a visual model that conveys this phenomenon.

3. Finally, analysts might identify a mathematical model that characterizes this curve. Several different mathematical functions do this; one is called a ct Gompertz curve and has the formula: S = aebe , where S = sales, t = time, e is the base of natural logarithms, and a, b, and c are constants. Of course, you would not be expected to know this; that’s what analytics professionals do. Such a mathematical model provides the ability to predict sales quantitatively, and to analyze potential decisions by asking “what if?” questions.

forms of the model are purely descriptive; they simply explain the phenomenon. While the mathematical model also describes the phenomenon, it can be used to predict sales at a future time. Models are usually developed from theory or observation and establish relationships between actions that decision makers might take and results that they might expect, thereby allowing the decision makers to predict what might happen based on the model. Models complement decision makers’ intuition and often provide insights that intuition cannot. For example, one early application of analytics in marketing involved a study of sales operations. Sales representatives had to divide their time between large and small customers and between acquiring new customers and keeping old ones. The problem was to determine how the representatives should best allocate their time. Intuition suggested that they should concentrate on large customers and that it was much harder to acquire a new customer than to keep an old one. However, intuition could not tell whether they should concentrate on the 100 largest or the 1,000 largest customers, or how much effort to spend on acquiring new customers. Models of sales force effectiveness and customer response patterns provided the insight to make these decisions. However, it is important to understand that all models are only representations of the real world and, as such, cannot capture every nuance that decision makers face in reality. Decision makers must often

Figure

1.4

New Product Sales Over Time

46

Chapter 1   Introduction to Business Analytics

modify the policies that models suggest to account for intangible factors that they might not have been able to incorporate into the model. A simple descriptive model is a visual representation called an influence diagram because it describes how various elements of the model influence, or relate to, others. An influence diagram is a useful approach for conceptualizing the structure of a model and can assist in building a mathematical or spreadsheet model. The elements of the model are represented by circular symbols called nodes. Arrows called branches connect the nodes and show which elements influence others. Influence diagrams are quite useful in the early stages of model building when we need to understand and characterize key relationships. Example 1.5 shows how to construct simple influence diagrams, and Example 1.6 shows how to build a mathematical model, drawing upon the influence diagram.

Example 1.5  An Influence Diagram for Total Cost From basic business principles, we know that the total cost of producing a fixed volume of a product is comprised of fixed costs and variable costs. Thus, a simple influence diagram that shows these relationships is given in Figure 1.5. We can develop a more detailed model by noting that the variable cost depends on the unit variable cost as well as the quantity produced. The expanded model is shown in Figure 1.6. In this figure, all the nodes that have

Figure

no branches pointing into them are inputs to the model. We can see that the unit variable cost and fixed costs are data inputs in the model. The quantity produced, however, is a decision variable because it can be controlled by the manager of the operation. The total cost is the output (note that it has no branches pointing out of it) that we would be interested in calculating. The variable cost node links some of the inputs with the output and can be considered as a “building block” of the model for total cost.

1.5

Total Cost

An Influence Diagram Relating Total Cost to Its Key Components Fixed Cost

Figure

1.6

Variable Cost

Total Cost

An Expanded Influence Diagram for Total Cost

Fixed Cost

Variable Cost

Unit Variable Cost

Quantity Produced

47

Chapter 1   Introduction to Business Analytics

Example 1.6  Building a Mathematical Model from an Influence Diagram We can develop a mathematical model from the influence diagram in Figure 1.6. First, we need to specify the precise nature of the relationships among the various quantities. For example, we can easily state that

Total Cost = Fixed Cost + Variable Cost

(1.1)

Logic also suggests that the variable cost is the unit variable cost times the quantity produced. Thus, Variable Cost = Unit Variable Cost × Quantity Produced  (1.2)

Using these relationships, we may develop a mathematical representation by defining symbols for each of these quantities: TC = total cost V = unit variable cost F = fixed cost Q = quantity produced This results in the model TC = F + VQ

(1.4)

By substituting this into equation (1.1), we have Total Cost = Fixed Cost + Variable Cost = Fixed Cost + Unit Variable Cost × Quantity Produced  (1.3)

Decision Models A decision model is a logical or mathematical representation of a problem or business situation that can be used to understand, analyze, or facilitate making a decision. Most decision models have three types of input: 1. Data, which are assumed to be constant for purposes of the model. Some ­examples would be costs, machine capacities, and intercity distances. 2. Uncontrollable variables, which are quantities that can change but cannot be directly controlled by the decision maker. Some examples would be customer demand, inflation rates, and investment returns. Often, these variables are uncertain. 3. Decision variables, which are controllable and can be selected at the discretion of the decision maker. Some examples would be production quantities (see Example 1.5), staffing levels, and investment allocations. Decision models characterize the relationships among the data, uncontrollable variables, and decision variables, and the outputs of interest to the decision maker (see Figure 1.7). Decision models can be represented in various ways, most typically with mathematical functions and spreadsheets. Spreadsheets are ideal vehicles for implementing decision models because of their versatility in managing data, evaluating different scenarios, and presenting results in a meaningful fashion.

Figure

1.7

Nature of Decision Models

Inputs

Data, Uncontrollable Variables, and Decision Variables

Outputs

Decision Model

Measures of Performance or Behavior

48

Chapter 1   Introduction to Business Analytics

How might we use the model in Example 1.6 to help make a decision? Suppose that a manufacturer has the option of producing a part in-house or outsourcing it from a ­supplier (the decision variables). Should the firm produce the part or outsource it? The decision ­depends on the anticipated volume of demand (an uncontrollable variable); for high ­volumes, the cost to manufacture in-house will be lower than outsourcing, because the fixed costs can be spread over a large number of units. For small volumes, it would be more economical to outsource. Knowing the total cost of both alternatives (based on data for fixed and variable manufacturing costs and purchasing costs) and the break-even point would facilitate the decision. A numerical example is provided in Example 1.7.

Example 1.7  A Break-Even Decision Model Suppose that a manufacturer can produce a part for \$125/unit with a fixed cost of \$50,000. The alternative is to outsource production to a supplier at a unit cost of \$175. The total manufacturing cost is expressed by using equation (1.5): TC (manufacturing) = \$50,000 + \$125 × Q and the total outsourcing cost can be written as

Thus, if the anticipated production volume is greater than 1,000, it is more economical to manufacture the part; if it is less than 1,000, then it should be outsourced. This is shown graphically in Figure 1.8. We may also develop a general formula for the breakeven point by letting C be the unit cost of outsourcing the part and setting TC (manufacturing) = TC (outsourcing) using the formulas:

TC (outsourcing) = \$175 × Q Mathematical models are easy to manipulate; for example, it is easy to find the break-even volume by setting TC (manufacturing) = TC (outsourcing) and solving for Q:

F + VQ = CQ F Q =  C - V

(1.5)

\$50,000 + \$125 × Q = \$175 × Q \$50,000 = 50 × Q Q = 1,000

Many models are developed by analyzing historical data. Example 1.8 shows how historical data might be used to develop a decision model that can be used to predict the impact of pricing and promotional strategies in the grocery industry.

Figure

1.8

Graphical Illustration of Break-Even Analysis

49

Chapter 1   Introduction to Business Analytics

Example 1.8  A Sales-Promotion Decision Model In the grocery industry, managers typically need to know how best to use pricing, coupons, and advertising strategies to influence sales. Grocers often study the relationship of sales volume to these strategies by conducting controlled experiments to identify the relationship between them and sales volumes.28 That is, they implement different combinations of pricing, coupons, and advertising, observe the sales that result, and use analytics

Week

Price (\$)

Coupon (0,1)

Store 1 Sales (Units)

Store 2 Sales (Units)

Store 3 Sales (Units)

1

6.99

0

0

501

510

481

2

6.99

0

150

772

748

775

3

6.99

1

0

554

528

506

4

6.99

1

150

838

785

834

5

6.49

0

0

521

519

500

6

6.49

0

150

723

790

723

7

6.49

1

0

510

556

520

8

6.49

1

150

818

773

800

9

7.59

0

0

479

491

486

10

7.59

0

150

825

822

757

11

7.59

1

0

533

513

540

12

7.59

1

150

839

791

832

13

5.49

0

0

484

480

508

14

5.49

0

150

686

683

708

15

5.49

1

0

543

531

530

16

5.49

1

150

767

743

779

To better understand the relationships among price, coupons, and advertising, the grocer might have developed the following model using business analytics tools:

to develop a predictive model of sales as a function of these decision strategies. For example, suppose that a grocer who operates three stores in a small city varied the price, coupons (yes = 1, no = 0), and advertising expenditures in a local newspaper over a 16-week period and observed the following sales:

sales = 500 − 0.05 × price + 30 × coupons + 0.08 × advertising + 0.25 × price × advertising

In this model, the decision variables are price, coupons, and advertising. The values 500, − 0.05, 30, 0.08, and 0.25 are effects of the input data to the model that are estimated from the data obtained from the experiment. They reflect the impact on sales of changing the decision variables. For example, an increase in price of \$1 results in a 0.05-unit decrease in weekly sales; using coupons results in a 30-unit increase in weekly sales. In this example, there are no uncontrollable input variables. The

28Roger

output of the model is the sales units of the product. For example, if the price is \$6.99, no coupons are offered and no advertising is done (the experiment corresponding to week 1), the model estimates sales as

sales = 500 − 0.05 × \$6.99 + 30 × 0 + 0.08 × 0 + 0.25 × \$6.99 × 0 = 500 units

We see that the actual sales in week 1 varied between 481 and 510 in the three stores. Thus, this model predicts a good estimate for sales; however, it does not tell us anything about the potential variability or prediction error. Nevertheless, the manager can use this model to evaluate different pricing, promotion, and advertising strategies, and help choose the best strategy to maximize sales or profitability.

J. Calantone, Cornelia Droge, David S. Litvack, and C. Anthony di Benedetto. “Flanking in a Price War,” Interfaces, 19, 2 (1989): 1–12.

50

Chapter 1   Introduction to Business Analytics

Model Assumptions All models are based on assumptions that reflect the modeler’s view of the “real world.” Some assumptions are made to simplify the model and make it more tractable; that is, able to be easily analyzed or solved. Other assumptions might be made to better characterize historical data or past observations. The task of the modeler is to select or build an appropriate model that best represents the behavior of the real situation. For example, economic theory tells us that demand for a product is negatively related to its price. Thus, as prices increase, demand falls, and vice versa (a phenomenon that you may recognize as price elasticity—the ratio of the percentage change in demand to the percentage change in price). Different mathematical models can describe this phenomenon. In the following examples, we illustrate two of them. (Both of these examples can be found in the Excel file Demand Prediction Models. We introduce the use of spreadsheets in analytics in the next chapter.)

Example 1.9  A Linear Demand Prediction Model A simple model to predict demand as a function of price is the linear model D = a − bP

(1.6)

where D is the demand rate, P is the unit price, a is a constant that estimates the demand when the price is zero, and b is the slope of the demand function. This model is most applicable when we want to predict the effect of small changes around the current price. For example, suppose we know that when the price is \$100, demand is 19,000 units and that demand falls by 10 for each dollar of price increase. Using simple algebra, we can determine that a = 20,000 and b = 10. Thus, if the price is \$80, the predicted demand is D = 20,000 − 101802 = 19,200 units

Figure

1.9

Graph of Linear Demand Model D = a − bP

If the price increases to \$90, the model predicts demand as D = 20,000 − 101902 = 19,100 units If the price is \$100, demand would be D = 20,000 − 1011002 = 19,000 units and so on. A chart of demand as a function of price is shown in Figure 1.9 as price varies between \$80 and \$120. We see that there is a constant decrease in demand for each \$10 ­increase in price, a characteristic of a linear model.

51

Chapter 1   Introduction to Business Analytics

Example 1.10  A Nonlinear Demand Prediction Model An alternative model assumes that price elasticity is constant. In this case, the appropriate model is D = cP

−d



(1.7)

where, c is the demand when the price is 0 and d + 0 is the price elasticity. To be consistent with Example 1.9, we assume that when the price is zero, demand is 20,000. Therefore, c = 20,000. We will also, as in Example 1.9, assume that when the price is \$100, D = 19,000. Using these values in equation (1.7), we can determine the value for d (we can do this mathematically using logarithms, but we’ll see how to do this very easily using ­E xcel in Chapter 11); this is d = − 0.0111382. Thus, if the price is \$80, then the predicted demand is

If the price is 90, the demand would be D = 20,0001902 −0.0111382 = 19022. If the price is 100, demand is D = 20,00011002 −0.0111382 = 19,000. A graph of demand as a function of price is shown in Figure 1.10. The predicted demand falls in a slight nonlinear fashion as price increases. For example, demand decreases by 25 units when the price increases from \$80 to \$90, but only by 22 units when the price increases from \$90 to \$100. If the price increases to \$100, you would see a smaller decrease in demand. Therefore, we see a nonlinear relationship in contrast to Example 1.9.

D = 20,0001802 −0.0111382 = 19,047.

Both models in Examples 1.9 and 1.10 make different predictions of demand for different prices (other than \$90). Which model is best? The answer may be neither. First of all, the development of realistic models requires many price point changes within a carefully designed experiment. Secondly, it should also include data on competition and customer disposable income, both of which are hard to determine. Nevertheless, it is possible to develop price elasticity models with limited price ranges and narrow customer segments. A good starting point would be to create a historical database with detailed information on all past pricing actions. Unfortunately, practitioners have observed that such models are not widely used in retail marketing, suggesting a lot of opportunity to ­apply business analytics.29

Figure 1.10 Graph of Nonlinear Demand Model D = cP−d

29Ming

Zhang, Clay Duan, and Arun Muthupalaniappan, “Analytics Applications in Consumer Credit and Retail Marketing,” analytics-magazine.org, November/December 2011, pp. 27–33.

52

Chapter 1   Introduction to Business Analytics

Uncertainty and Risk As we all know, the future is always uncertain. Thus, many predictive models incorporate uncertainty and help decision makers analyze the risks associated with their decisions. ­Uncertainty is imperfect knowledge of what will happen; risk is associated with the consequences and likelihood of what might happen. For example, the change in the stock price of Apple on the next day of trading is uncertain. However, if you own Apple stock, then you face the risk of losing money if the stock price falls. If you don’t own any stock, the price is still uncertain although you would not have any risk. Risk is evaluated by the magnitude of the consequences and the likelihood that they would occur. For example, a 10% drop in the stock price would incur a higher risk if you own \$1 million than if you only owned \$1,000. Similarly, if the chances of a 10% drop were 1 in 5, the risk would be higher than if the chances were only 1 in 100. The importance of risk in business has long been recognized. The renowned management writer, Peter Drucker, observed in 1974: To try to eliminate risk in business enterprise is futile. Risk is inherent in the commitment of present resources to future expectations. Indeed, economic progress can be defined as the ability to take greater risks. The attempt to eliminate risks, even the attempt to minimize them, can only make them irrational and unbearable. It can only result in the greatest risk of all: rigidity.30 Consideration of risk is a vital element of decision making. For instance, you would probably not choose an investment simply on the basis of the return you might expect because, typically, higher returns are associated with higher risk. Therefore, you have to make a trade-off between the benefits of greater rewards and the risks of potential losses. Analytic models can help assess this. We will address this in later chapters.

Prescriptive Decision Models A prescriptive decision model helps decision makers to identify the best solution to a decision problem. Optimization is the process of finding a set of values for decision variables that minimize or maximize some quantity of interest—profit, revenue, cost, time, and so on—called the objective function. Any set of decision variables that optimizes the objective function is called an optimal solution. In a highly competitive world where one percentage point can mean a difference of hundreds of thousands of dollars or more, knowing the best solution can mean the difference between success and failure.

Example 1.11  A Prescriptive Model for Pricing To illustrate an example of a prescriptive model, suppose that a firm wishes to determine the best pricing for one of its products to maximize revenue over the next year. A market research study has collected data that estimate the expected annual sales for different levels of pricing. Analysts determined that sales can be expressed by the following model: sales = − 2.9485 × price + 3,240.9

30P.

Because revenue equals price × sales, a model for total revenue is total revenue = price × sales = price × 1 − 2.9485 × price + 3240.92 = 22.9485 × price2 + 3240.9 × price

The firm would like to identify the price that maximizes the total revenue. One way to do this would be to try different prices and search for the one that yields the highest total revenue. This would be quite tedious to do by hand or even with a calculator. We will see how to do this easily on a spreadsheet in Chapter 11.

F. Drucker, The Manager and the Management Sciences in Management: Tasks, Responsibilities, Practices (London: Harper and Row, 1974).

Chapter 1   Introduction to Business Analytics

53

Although the pricing model did not, most optimization models have constraints— limitations, requirements, or other restrictions that are imposed on any solution, such as “do not exceed the allowable budget” or “ensure that all demand is met.” For instance, a consumer products company manager would probably want to ensure that a specified level of customer service is achieved with the redesign of the distribution system. The presence of constraints makes modeling and solving optimization problems more challenging; we address constrained optimization problems later in this book, starting in Chapter 13. For some prescriptive models, analytical solutions—closed-form mathematical expressions or simple formulas—can be obtained using such techniques as calculus or other types of mathematical analyses. In most cases, however, some type of computer-based procedure is needed to find an optimal solution. An algorithm is a systematic procedure that finds a solution to a problem. Researchers have developed effective algorithms to solve many types of optimization problems. For example, Microsoft Excel has a built-in add-in called Solver that allows you to find optimal solutions to optimization problems formulated as spreadsheet models. We use Solver in later chapters. However, we will not be concerned with the detailed mechanics of these algorithms; our focus will be on the use of the algorithms to solve and analyze the models we develop. If possible, we would like to ensure that an algorithm such as the one Solver uses finds the best solution. However, some models are so complex that it is impossible to solve them optimally in a reasonable amount of computer time because of the extremely large number of computations that may be required or because they are so complex that finding the best solution cannot be guaranteed. In these cases, analysts use search algorithms—solution procedures that generally find good solutions without guarantees of finding the best one. Powerful search algorithms exist to obtain good solutions to extremely difficult optimization problems. These are discussed in the supplementary online Chapter A. Prescriptive decision models can be either deterministic or stochastic. A deterministic model is one in which all model input information is either known or assumed to be known with certainty. A stochastic model is one in which some of the model input information is uncertain. For instance, suppose that customer demand is an important element of some model. We can make the assumption that the demand is known with certainty; say, 5,000 units per month. In this case we would be dealing with a deterministic model. On the other hand, suppose we have evidence to indicate that demand is uncertain, with an average value of 5,000 units per month, but which typically varies between 3,200 and 6,800 units. If we make this assumption, we would be dealing with a stochastic model. These situations are discussed in the supplementary online Chapter B.

Problem Solving with Analytics The fundamental purpose of analytics is to help managers solve problems and make decisions. The techniques of analytics represent only a portion of the overall problem-solving and decision-making process. Problem solving is the activity associated with defining, analyzing, and solving a problem and selecting an appropriate solution that solves a problem. Problem solving consists of several phases: 1. recognizing a problem 2. defining the problem 3. structuring the problem 4. analyzing the problem 5. interpreting results and making a decision 6. implementing the solution

54

Chapter 1   Introduction to Business Analytics

Recognizing a Problem Managers at different organizational levels face different types of problems. In a manufacturing firm, for instance, top managers face decisions of allocating financial resources, building or expanding facilities, determining product mix, and strategically sourcing production. Middle managers in operations develop distribution plans, production and inventory schedules, and staffing plans. Finance managers analyze risks, determine investment strategies, and make pricing decisions. Marketing managers develop advertising plans and make sales force allocation decisions. In manufacturing operations, problems involve the size of daily production runs, individual machine schedules, and worker assignments. Whatever the problem, the first step is to realize that it exists. How are problems recognized? Problems exist when there is a gap between what is happening and what we think should be happening. For example, a consumer products manager might feel that distribution costs are too high. This recognition might result from comparing performance with a competitor, observing an increasing trend compared to previous years.

Defining the Problem The second step in the problem-solving process is to clearly define the problem. Finding the real problem and distinguishing it from symptoms that are observed is a critical step. For example, high distribution costs might stem from inefficiencies in routing trucks, poor location of distribution centers, or external factors such as increasing fuel costs. The problem might be defined as improving the routing process, redesigning the entire distribution system, or optimally hedging fuel purchases. Defining problems is not a trivial task. The complexity of a problem increases when the following occur: number of potential courses of action is large. • The The belongs to a group rather than to an individual. • The problem problem solver has several competing objectives. • External groups or individuals are affected by the problem. • The problem solver true owner of the problem—the person who experi• ences the problem andandistheresponsible for getting it solved—are not the same. Time limitations are important. • These factors make it difficult to develop meaningful objectives and characterize the range of potential decisions. In defining problems, it is important to involve all people who make the decisions or who may be affected by them.

Structuring the Problem This usually involves stating goals and objectives, characterizing the possible decisions, and identifying any constraints or restrictions. For example, if the problem is to r­ edesign a distribution system, decisions might involve new locations for manufacturing plants and warehouses (where?), new assignments of products to plants (which ones?), and the amount of each product to ship from different warehouses to customers (how much?). The goal of cost reduction might be measured by the total delivered cost of the product. The manager would probably want to ensure that a specified level of customer service— for instance, being able to deliver orders within 48 hours—is achieved with the redesign. This is an example of a constraint. Structuring a problem often involves developing a formal model.

55

Chapter 1   Introduction to Business Analytics

Analyzing the Problem Here is where analytics plays a major role. Analysis involves some sort of experimentation or solution process, such as evaluating different scenarios, analyzing risks associated with various decision alternatives, finding a solution that meets certain goals, or determining an optimal solution. Analytics professionals have spent decades developing and refining a variety of approaches to address different types of problems. Much of this book is devoted to helping you understand these techniques and gain a basic facility in using them.

Interpreting Results and Making a Decision Interpreting the results from the analysis phase is crucial in making good decisions. Models cannot capture every detail of the real problem, and managers must understand the limitations of models and their underlying assumptions and often incorporate judgment into making a decision. For example, in locating a facility, we might use an analytical procedure to find a “central” location; however, many other considerations must be included in the decision, such as highway access, labor supply, and facility cost. Thus, the location specified by an analytical solution might not be the exact location the company actually chooses.

Implementing the Solution This simply means making it work in the organization, or translating the results of a model back to the real world. This generally requires providing adequate resources, motivating employees, eliminating resistance to change, modifying organizational policies, and developing trust. Problems and their solutions affect people: customers, suppliers, and employees. All must be an important part of the problem-solving process. Sensitivity to political and organizational issues is an important skill that managers and analytical professionals alike must possess when solving problems. In each of these steps, good communication is vital. Analytics professionals need to be able to communicate with managers and clients to understand the business context of the problem and be able to explain results clearly and effectively. Such skills as constructing good visual charts and spreadsheets that are easy to understand are vital to users of analytics. We emphasize these skills throughout this book.

Analytics in Practice: D  eveloping Effective Analytical Tools at Hewlett-Packard31 Hewlett-Packard (HP) uses analytics extensively. Many applications are used by managers with little knowledge of analytics. These require that analytical tools be easily understood. Based on years of ­experience, HP analysts compiled some key lessons. Before creating an analytical decision tool, HP asks three questions: 1. Will analytics solve the problem? Will the tool enable a better solution? Should other non analytical solutions be used? Are there organizational or other issues that must be resolved? Often, what 31Based

may appear to be an analytical problem may actually be rooted in problems of incentive misalignment, unclear ownership and accountability, or business strategy. 2. Can we leverage an existing solution? Before “reinventing the wheel,” can existing solutions address the problem? What are the costs and benefits? 3. Is a decision model really needed? Can simple decision guidelines be used instead of a formal decision tool? (continued )

on Thomas Olavson and Chris Fry, “Spreadsheet Decision-Support Tools: Lessons Learned at Hewlett-Packard,” Interfaces, 38, 4, July–August 2008: 300–310.

56

Chapter 1   Introduction to Business Analytics

Once a decision is made to develop an analytical tool, they use several guidelines to increase the chances of successful implementation: prototyping–a quick working version of the • Use tool designed to test its features and gather feedback;

insight, not black boxes. A “black box” tool • Build is one that generates an answer, but may not provide confidence to the user. Interactive tools that creates insights to support a decision provide ­better information. Remove unneeded complexity. Simpler is better. A good tool can be used without expert support. Partner with end users in discovery and design. Decision makers who will actually use the tool should be involved in its development. Develop an analytic champion. Someone (ideally, the actual decision maker) who is knowledgeable about the solution and close to it must champion the process.

• • •

Key Terms Algorithm Big data Business analytics (analytics) Business intelligence (BI) Categorical (nominal) data Constraint Continuous metric Data mining Data set Database Decision model Decision support systems (DSS) Descriptive analytics Deterministic model Discrete metric Influence diagram Information systems (IS) Interval data Measure Measurement Metric Model Modeling and optimization

Objective function Operations Research/Management   Science (OR/MS) Optimal solution Optimization Ordinal data Predictive analytics Prescriptive analytics Price elasticity Problem solving Ratio data Reliability Risk Search algorithm Simulation and risk analysis Statistics Stochastic model Tag cloud Uncertainty Validity Visualization What-if analysis

57

Chapter 1   Introduction to Business Analytics

Fun with Analytics Mr. John Toczek, an analytics manager at ARAMARK Corporation, maintains a Web site called the PuzzlOR (OR being “Operations Research”) at www.puzzlor.com. Each month he posts a new puzzle. Many of these can be solved using techniques in this book; however, even if you cannot develop a formal model, the puzzles can be fun and competitive challenges for students. We encourage you to explore these, in addition to the formal problems, exercises, and cases in this book. A good one to start with is ­“SurvivOR” from June 2010. Have fun!

Problems and Exercises 1. Discuss how business analytics can be used in sports,

8. A survey handed out to individuals at a major shop-

such as tennis, cricket, football, and so on. Identify as many opportunities as you can for each.

ping mall in a small Florida city in July asked the following:

2. A multinational hotel chain has been implementing

analytics digital marketing to its customers. However, the responses to the digital campaigns have not been favorable, and the revenue generation has not been as expected. Currently, they are trying to solve this problem by focusing on similar campaigns that use the same promotional content, and changing these campaigns to suit the specific tastes of the consumers in each nation. Discuss how business analytics can be utilized by the hotel management in this scenario. What is the data required to facilitate good decisions? 3. Suggest some metrics that a hotel might want to col-

lect about their guests. How might these metrics be used with business analytics to support decisions at the hotel? 4. Suggest some metrics that a railway or bus ticket-

ing agency might want to collect. Describe how a manager might utilize this data to facilitate better decisions. 5. Classify each of the data elements in the Sales

­Transactions database (Figure 1.1) as categorical, ­ordinal, interval, or ratio data and explain why. 6. Identify each of the variables in the Excel file Credit

Approval Decisions as categorical, ordinal, interval, or ratio and explain why. 7. Classify each of the variables in the Excel file

­ eddings as categorical, ordinal, interval, or ratio W and explain why.

• gender • age • ethnicity of residency • length overall with city services (using a • scale of satisfaction 1–5, going from poor to excellent) of schools (using a scale of 1–5, going • quality from poor to excellent) What types of data (categorical, ordinal, interval, or ­ratio) would each of the survey items represent and why? 9. A bank developed a model for predicting the aver-

age checking and savings account balance as balance = - 17,732 + 367 * age + 1,300 * years ­education + 0.116 * household wealth. a. Explain how to interpret the numbers in this model. b. Suppose that a customer is 32 years old, is a college graduate (so that years education = 16), and has a household wealth of \$150,000. What is the predicted bank balance? 10. Four key marketing decision variables are price (P),

advertising (A), transportation (T), and product quality (Q). Consumer demand (D) is influenced by these variables. The simplest model for describing demand in terms of these variables is D = k - pP + aA + tT + qQ

58

Chapter 1   Introduction to Business Analytics

where k, p, a, t, and q are positive constants. a. How does a change in each variable affect demand? b. How do the variables influence each other? c. What limitations might this model have? Can you think of how this model might be made more realistic? 11. A firm installs 1500 air conditioners which need to be

serviced every six months. The firm can hire a team from its logistics department at a fixed cost of \$6,000. Each unit will be serviced by the team at \$15.00. The firm can also outsource this at a cost of \$17.00 inclusive of all charges. a. For the given number of units, compute the total cost of servicing for both options. Which is a better decision?

14. Automobiles have different fuel economies (mpg), and

commuters drive different distances to work or school. Suppose that a state Department of ­Transportation (DOT) is interested in measuring the average monthly fuel consumption of commuters in a certain city. The DOT might sample a group of commuters and ­collect information on the number of miles driven per day, number of driving days per month, and the fuel ­economy of their cars. Develop a predictive model for calculating the amount of gasoline consumed, using the following symbols for the data. G m d f

= = = =

gallons of fuel consumed per month miles driven per day to and from work or school number of driving days per month fuel economy in miles per gallon

range of volumes for which it is more economical to outsource.

Suppose that a commuter drives 30 miles round trip to work 20 days each month and achieves a fuel economy of 34 mpg. How many gallons of gasoline are used?

12. Return on investment (ROI) is computed in the fol-

15. A manufacturer of mp3 players is preparing to set the

b. Find the break-even volume and characterize the

lowing manner: ROI is equal to turnover multiplied by earnings as a percent of sales. Turnover is sales divided by total investment. Total investment is current assets (inventories, accounts receivable, and cash) plus fixed assets. Earnings equal sales minus the cost of sales. The cost of sales consists of variable production costs, selling expenses, freight and delivery, and administrative costs. a. Construct an influence diagram that relates these variables. b. Define symbols and develop a mathematical model. 13. Total marketing effort is a term used to describe the

critical decision factors that affect demand: price, advertising, distribution, and product quality. Let the variable x represent total marketing effort. A typical model that is used to predict demand as a function of total marketing effort is D = ax b Suppose that a is a positive number. Different model forms result from varying the constant b. Sketch the graphs of this model for b = 0, b = 1, 0 6 b 6 1, b 6 0, and b 7 1. What does each model tell you about the relationship between demand and marketing effort? What assumptions are implied? Are they reasonable? How would you go about selecting the appropriate model?

price on a new model. Demand is thought to depend on the price and is represented by the model D = 2,500 - 3P The accounting department estimates that the total costs can be represented by C = 5,000 + 5D Develop a model for the total profit in terms of the price, P. 16. The demand for airline travel is quite sensitive to price.

Typically, there is an inverse relationship ­between demand and price; when price decreases, d­ emand increases and vice versa. One major airline has found that when the price (P) for a round trip between Chicago and Los Angeles is \$600, the demand (D) is 500 passengers per day. When the price is reduced to \$400, demand is 1,200 passengers per day. a. Plot these points on a coordinate system and de-

velop a linear model that relates demand to price. b. Develop a prescriptive model that will determine what price to charge to maximize the total revenue. c. By trial and error, can you find the optimal solution that maximizes total revenue?

Chapter 1   Introduction to Business Analytics

59

Case: Drout Advertising Research Project32 Jamie Drout is interested in perceptions of gender stereotypes within beauty product advertising, which includes soap, deodorant, shampoo, conditioner, lotion, perfume, cologne, makeup, chemical hair color, razors, skin care, feminine care, and salon services; as well as the perceived benefits of empowerment advertising. Gender stereotypes specifically use cultural perceptions of what constitutes an attractive, acceptable, and desirable man or woman, frequently exploiting specific gender roles, and are commonly employed in advertisements for beauty products. Women are represented as delicately feminine, strikingly beautiful, and physically flawless, occupying small amounts of physical space that generally exploit their sexuality; men as strong and masculine with chiseled physical bodies, occupying large amounts of physical space to maintain their masculinity and power. In contrast, empowerment advertising strategies negate gender stereotypes and visually communicate the unique differences in each individual. In empowerment advertising, men and women are to represent the diversity in beauty, body type, and levels of perceived femininity and masculinity. Her project is focused on understanding consumer perceptions of these advertising strategies. Jamie conducted a survey using the following questionnaire: 1. What is your gender? Male Female 2. What is your age? 3. What is the highest level of education you have completed? Some High School Classes High School Diploma Some Undergraduate Courses Associate Degree Bachelor Degree Master Degree J.D. M.D. Doctorate Degree 4. What is your annual income? \$0 to 6 \$10,000 \$10,000 to 6 \$20,000 \$20,000 to 6 \$30,000 \$30,000 to 6 \$40,000 \$40,000 to 6 \$50,000 32I

\$50,000 to 6 \$60,000 \$60,000 to 6 \$70,000 \$70,000 to 6 \$80,000 \$80,000 to 6 \$90,000 \$90,000 to 6 \$110,000 \$110,000 to 6 \$130,000 \$130,000 to 6 \$150,000 \$150,000 or More 5. On average, how much do you pay for beauty and hygiene products or services per year? Include references to the following products: soap, deodorant, shampoo, conditioner, lotion, perfume, cologne, makeup, chemical hair color, razors, skin care, feminine care, and salon services. 6. On average, how many beauty and hygiene advertisements, if at all, do you think you view or hear per day? Include references to the following advertisements: television, billboard, Internet, radio, newspaper, magazine, and direct mail. 7. On average, how many of those advertisements, if at all, specifically subscribe to gender roles and stereotypes? 8. On the following scale, what role, if any, do these advertisements have in reinforcing specific gender stereotypes? Drastic Influential Limited Trivial None 9. To what extent do you agree that empowerment advertising, which explicitly communicates the unique differences in each individual, would help transform cultural gender stereotypes? Strongly agree Agree Somewhat agree Neutral Somewhat disagree Disagree Strongly disagree 10. On average, what percentage of advertisements that you view or hear per day currently utilize empowerment advertising?

express my appreciation to Jamie Drout for providing this original material from her class project as the basis for this case.

60

Chapter 1   Introduction to Business Analytics

Assignment: Jamie received 105 responses, which are given in the Excel file Drout Advertising Survey. Review the questionnaire and classify the data collected from each question as categorical, ordinal, interval, or ratio. Next, explain how the data and subsequent analysis using business analytics might lead to a better understanding of stereotype versus empowerment advertising. Specifically, state some of the key insights that you would hope to answer by analyzing the data.

An important aspect of business analytics is good communication. Write up your answers to this case f­ ormally in a well-written report as if you were a ­consultant to Ms. Drout. This case will continue in Chapters 3, 4, 6, and 7, and you will be asked to use a variety of descriptive analytics tools to analyze the data and interpret the results. As you do this, add your insights to the report, culminating in a complete project report that fully analyzes the data and draws appropriate conclusions.

Case: Performance Lawn Equipment In each chapter of this book, we use a database for a fictitious company, Performance Lawn Equipment (PLE), within a case exercise for applying the tools and techniques introduced in the chapter.33 To put the database in perspective, we first provide some background about the company, so that the applications of business analytic tools will be more meaningful. PLE, headquartered in St. Louis, Missouri, is a privately owned designer and producer of traditional lawn mowers used by homeowners. In the past 10 years, PLE has added another key product, a medium-size diesel power lawn tractor with front and rear power takeoffs, Class I three-point hitches, four-wheel drive, power steering, and full hydraulics. This equipment is built primarily for a niche market consisting of large estates, including golf and country clubs, resorts, private estates, city parks, large commercial complexes, lawn care service providers, private homeowners with five or more acres, and government (federal, state, and local) parks, building complexes, and military bases. PLE provides most of the products to dealerships, which, in turn, sell directly to end users. PLE employs 1,660 people worldwide. About half the workforce is based in St. Louis; the remainder is split among their manufacturing plants. In the United States, the focus of sales is on the eastern seaboard, California, the Southeast, and the south central states, which have the greatest concentration of customers. Outside the United States, PLE’s sales include a European market, a growing South American market, and developing markets in the Pacific Rim and China. The market is cyclical, but the different products and regions balance some of this, with just less than 30% of total sales in the spring and summer (in the United States), about 25% in the fall, and about 20% in the winter. Annual sales are approximately \$180 million. 33The

Both end users and dealers have been established as important customers for PLE. Collection and analysis of end-user data showed that satisfaction with the products depends on high quality, easy attachment/dismount of implements, low maintenance, price value, and service. For dealers, key requirements are high quality, parts and feature availability, rapid restock, discounts, and timeliness of support. PLE has several key suppliers: Mitsitsiu, Inc., the sole source of all diesel engines; LANTO Axles, Inc., which provides tractor axles; Schorst Fabrication, which provides subassemblies; Cuberillo, Inc, supplier of transmissions; and Specialty Machining, Inc., a supplier of precision machine parts. To help manage the company, PLE managers have developed a “balanced scorecard” of measures. These data, which are summarized shortly, are stored in the form of a Microsoft Excel workbook (Performance Lawn ­Equipment) accompanying this book. The database contains various measures captured on a monthly or quarterly basis and used by various managers to evaluate business performance. Data for each of the key measures are stored in a separate worksheet. A summary of these worksheets is given next: ● Dealer Satisfaction, measured on a scale of 1–5 (1 = poor, 2 = less than average, 3 = average, 4 = above average, and 5 = excellent). Each year, dealers in each region are surveyed about their overall satisfaction with PLE. The worksheet contains summary data from surveys for the past 5 years. ● End-User Satisfaction, measured on the same scale as dealers. Each year, 100 users from each region are surveyed. The worksheet contains summary data for the past 5 years.

case scenario was based on Gateway Estate Lawn Equipment Co. Case Study, used for the 1997 Malcolm Baldrige National Quality Award Examiner Training course. This material is in the public domain. The database, however, was developed by the author.

Chapter 1   Introduction to Business Analytics

● 2014 Customer Survey, results from a survey for

customer ratings of specific attributes of PLE tractors: quality, ease of use, price, and service on the same 1–5 scale. This sheet contains 200 ­observations of customer ratings. Complaints, which shows the number of complaints registered by all customers each month in each of PLE’s five regions (North America, South America, Europe, the Pacific, and China). Mower Unit Sales and Tractor Unit Sales, which provide sales by product by region on a monthly basis. Unit sales for each region are aggregated to obtain world sales figures. Industry Mower Total Sales and Industry Tractor Total Sales, which list the number of units sold by all producers by region. Unit Production Costs, which provides monthly accounting estimates of the variable cost per unit for manufacturing tractors and mowers over the past 5 years. Operating and Interest Expenses, which provides monthly administrative, depreciation, and interest expenses at the corporate level. On-Time Delivery, which provides the number of deliveries made each month from each of PLE’s major suppliers, number on time, and the percent on time. Defects After Delivery, which shows the number of defects in supplier-provided material found in all shipments received from suppliers. Time to Pay Suppliers, which provides measurements in days from the time the invoice is received until payment is sent. Response Time, which gives samples of the times taken by PLE customer-service personnel to respond to service calls by quarter over the past 2 years. Employee Satisfaction, which provides data for the past 4 years of internal surveys of employees to determine their overall satisfaction with their jobs, using the same scale used for customers. Employees are surveyed quarterly, and results are stratified by employee category: design and production, managerial, and sales/administrative support.

61

In addition to these business measures, the PLE database contains worksheets with data from special studies: ● Engines, which lists 50 samples of the time required to produce a lawn-mower blade using a new technology. ● Transmission Costs, which provides the results of 30 samples each for the current process used to produce tractor transmissions and two proposed new processes. ● Blade Weight, which provides samples of mowerblade weights to evaluate the consistency of the production process. ● Mower Test, which lists test results of mower functional performance after assembly for 30 samples of 100 units each. ● Employee Retention, data from a study of employee duration (length of hire) with PLE. The 40 subjects were identified by reviewing hires from 10 years prior and identifying those who were involved in managerial positions (either hired into management or promoted into management) at some time in this 10-year period. ● Shipping Cost, which gives the unit shipping cost for mowers and tractors from existing and proposed plants for a supply-chain-design study. ● Fixed Cost, which lists the fixed cost to expand existing plants or build new facilities, also as part of the supply-chain-design study. ● Purchasing Survey, which provides data obtained from a third-party survey of purchasing managers of customers of Performance Lawn Care. Elizabeth Burke has recently joined the PLE management team to oversee production operations. She has reviewed the types of data that the company collects and has assigned you the responsibility to be her chief analyst in the coming weeks. To prepare for this task, you have decided to review each worksheet and determine whether the data were gathered from internal sources, external sources, or have been generated from special studies. Also, you need to know whether the measures are categorical, ordinal, interval, or ratio. Prepare a report summarizing the characteristics of the metrics used in each worksheet.

Chapter

2

S. Dashkevych/Shutterstock.com

Learning Objectives After studying this chapter, you will be able to:

• Find buttons and menus in the Excel 2013 ribbon. • Write correct formulas in an Excel worksheet. • Apply relative and absolute addressing in Excel formulas. • Copy formulas from one cell to another or to a range of cells.

• Use Excel features such as split screen, paste special,

show formulas, and displaying grid lines and headers in your applications. Use basic and advanced Excel functions. Use Excel functions for business intelligence queries in databases.

• •

63

64

Many commercial software packages are available to facilitate the ­a pplication of business analytics. Although they often have unique features and capabilities, they can be expensive, generally require advanced training to understand and apply, and may work only on specific computer platforms. Spreadsheet software, on the other hand, is widely used across all areas of business and is standard on nearly e ­ very e ­ mployee’s computer. Spreadsheets are an effective platform for manipulating data and developing and solving models; they support powerful commercial add-ins and facilitate communication of results. Spreadsheets provide a flexible modeling environment and are particularly useful when the end user is not the designer of the model. Teams can easily use spreadsheets and understand the logic upon which they are built. Information in spreadsheets can easily be copied from Excel into other documents and presentations. A recent survey identified more than 180 commercial spreadsheet products that support analytics efforts, including data management and reporting, data- and model-driven analytical techniques, and implementation.1 Many organizations have used spreadsheets extremely effectively to support ­decision making in marketing, finance, and operations. Some illustrative applications include the following:2

• Analyzing supply chains (Hewlett-Packard) • Determining optimal inventory levels to meet customer service ­objectives (Procter & Gamble)

• Selecting internal projects (Lockheed Martin Space Systems Company) • Planning for emergency clinics in response to a sudden epidemic or bioterrorism attack (Centers for Disease Control)

• Analyzing the default risk of a portfolio of real estate loans (Hypo International)

• Assigning medical residents to on-call and emergency rotations (University of Vermont College of Medicine)

• Performance measurement and evaluation (American Red Cross) The purpose of this chapter is to provide a review of the basic features of ­Microsoft Excel that you need to know to use spreadsheets for analyzing and

1Thomas A. Grossman, “Resources for Spreadsheet Analysts,” Analytics (May/June 2010): 8. analytics magazine.com 2Larry J. LeBlanc and Thomas A. Grossman, “Introduction: The Use of Spreadsheet Software in the Application of Management Science and Operations Research,” Interfaces, 38, 4 (July–August 2008): 225–227.

65

solving problems with techniques of business analytics. In this text, we use ­Microsoft Excel 2013 for Windows to perform spreadsheet calculations and analyses. Excel files for all text examples and data used in problems and exercises are provided with this book (see the Preface). This review is not intended to be a complete tutorial; many good Excel tutorials can be found online, and we also encourage you to use the Excel help capability (by clicking the question mark button at the top right of the screen). Also, for any reader who may be a Mac user, we caution you that Mac versions of Excel do not have the full functionality that Windows versions have, particularly statistical features, although most of the basic capabilities are the same. In particular, the Excel add-in that we use in later chapters, Analytic Solver Platform, only runs on Windows. Thus, if you use a Mac, you should either run Bootcamp with Windows or use a third-party software product such as Parallels or VMWare.

Basic Excel Skills To be able to apply the procedures and techniques that you will learn in this book, it is necessary for you to be relatively proficient in using Excel. We assume that you are familiar with the most elementary spreadsheet concepts and procedures, such as saving, and printing files; • opening, using workbooks and worksheets; • moving around a spreadsheet; • selecting cells and ranges; • inserting/deleting rows and columns; • entering and editing text, numerical data, and formulas in cells; • formatting data (number, currency, decimal places, etc.); • working with text strings; • formatting data and text; and • modifying the appearance of the spreadsheet using borders, shading, and so on. • Menus and commands in Excel 2013 reside in the “ribbon” shown in Figure 2.1. Menus and commands are arranged in logical groups under different tabs (File, Home, ­I nsert, and so on); small triangles pointing downward indicate menus of additional choices. We often refer to certain commands or options and where they may be found in the ribbon. Figure

2.1

Excel 2013 Ribbon

66

Excel Formulas Formulas in Excel use common mathematical operators: (+) • addition (-) • subtraction (*) • multiplication • division (/) Exponentiation uses the ^ symbol; for example, 25 is written as 2^5 in an Excel formula. Cell references in formulas can be written either with relative addresses or absolute addresses. A relative address uses just the row and column label in the cell reference (for example, A4 or C21); an absolute address uses a dollar sign (\$ sign) before either the row or column label or both (for example, \$A2, C\$21, or \$B\$15). Which one we choose makes a critical difference if you copy the cell formulas. If only relative addressing is used, then copying a formula to another cell changes the cell references by the number of rows or columns in the direction that the formula is copied. So, for instance, if we would use a formula in cell B8, =B4-B5*A8, and copy it to cell C9 (one column to the right and one row down), all the cell references are increased by one and the formula would be changed to =C5-C6*B9. Using a \$ sign before a row label (for example, B\$4) keeps the reference fixed to row 4 but allows the column reference to change if the formula is copied to another cell. Similarly, using a \$ sign before a column label (for example, \$B4) keeps the reference to column B fixed but allows the row reference to change. Finally, using a \$ sign before both the row and column labels (for example, \$B\$4) keeps the reference to cell B4 fixed no matter where the formula is copied. You should be very careful to use relative and absolute ­addressing appropriately in your models, especially when copying formulas.

Example 2.1  Implementing Price-Demand Models in Excel In Chapter 1, we described two models for predicting ­demand as a function of price:

­ alculate the demand in cell B8 for the linear model, we c use the formula

D = a − bP

= \$B\$4 − \$B\$5*A8

and D = cP

−d

Figure 2.2 shows a spreadsheet (Excel file Demand Prediction Models) for calculating demand for different prices using each of these models. For example, to

To calculate the demand in cell E8 for the nonlinear model, we use the formula = \$E\$4*D8^ − \$E\$5 Note how the absolute addresses are used so that as these formulas are copied down, the demand is computed correctly.

Copying Formulas Excel provides several ways of copying formulas to different cells. This is extremely useful in building decision models, because many models require replication of formulas for different periods of time, similar products, and so on. One way is to select the cell with the formula to be copied, click the Copy button from the Clipboard group under the Home tab (or simply press Ctrl-C on your keyboard), click on the cell you wish to copy to, and then click the Paste button (or press Ctrl-V). You may also enter a formula directly in a range of cells without copying and pasting by selecting the range, typing in the formula, and pressing Ctrl-Enter.

Figure

67

2.2

Excel Models for Demand Prediction

To copy a formula from a single cell or range of cells down a column or across a row, first select the cell or range, click and hold the mouse on the small square in the lower right-hand corner of the cell (the “fill handle”), and drag the formula to the “target” cells to which you wish to copy.

Other Useful Excel Tips Screen. You may split the worksheet horizontally and/or vertically • Split to view different parts of the worksheet at the same time. The vertical splitter

• •

bar is just to the right of the bottom scroll bar, and the horizontal splitter bar is just above the right-hand scroll bar. Position your cursor over one of these until it changes shape, click, and drag the splitter bar to the left or down. Paste Special. When you normally copy (one or more) cells and paste them in a worksheet, Excel places an exact copy of the formulas or data in the cells (except for relative addressing). Often you simply want the result of formulas, so the data will remain constant even if other parameters used in the formulas change. To do this, use the Paste Special option found within the Paste menu in the Clipboard group under the Home tab instead of the Paste command. Choosing Paste Values will paste the result of the formulas from which the data were calculated. Column and Row Widths. Many times a cell contains a number that is too large to display properly because the column width is too small. You may change the column width to fit the largest value or text string anywhere in the column by positioning the cursor to the right of the column label so that it changes to a cross with horizontal arrows and then double-clicking. You may also move the arrow to the left or right to manually change the column width. You may change the row heights in a similar fashion by moving the cursor below the row number ­label. This can be especially useful if you have a very long formula to display. To break a formula within a cell, position the cursor at the break point in the formula bar and press Alt-Enter. Displaying Formulas in Worksheets. Choose Show Formulas in the Formula Auditing group under the Formulas tab. You often need to change the column width to display the formulas properly. Displaying Grid Lines and Row and Column Headers for Printing. Check the Print boxes for gridlines and headings in the Sheet Options group under the Page

68

Layout tab. Note that the Print command can be found by clicking on the Office button. Filling a Range with a Series of Numbers. Suppose you want to build a worksheet for entering 100 data values. It would be tedious to have to enter the numbers from 1 to 100 one at a time. Simply fill in the first few values in the series and highlight them. Then click and drag the small square (fill handle) in the lower right-hand corner down (Excel will show a small pop-up window that tells you the last value in the range) until you have filled in the column to 100; then release the mouse.

Excel Functions Functions are used to perform special calculations in cells and are used extensively in business analytics applications. All Excel functions require an equal sign and a function name followed by parentheses, in which you specify arguments for the function.

Basic Excel Functions Some of the more common functions that we will use in applications include the following: MIN(range)—finds the smallest value in a range of cells MAX(range)—finds the largest value in a range of cells SUM(range)—finds the sum of values in a range of cells AVERAGE(range)—finds the average of the values in a range of cells COUNT(range)—finds the number of cells in a range that contain numbers COUNTIF(range, criteria)—finds the number of cells within a range that meet a specified criterion. The COUNTIF function counts the number of cells within a range that meet a criterion that you specify. For example, you can count all the cells that start with a certain letter, or you can count all the cells that contain a number that is larger or smaller than a number you specify. Examples of criteria are 100, “>100”, a cell reference such as A4, a text string such as “Facebook.” Note that text and logical formulas must be enclosed in quotes. See Excel Help for other examples. Excel has other useful COUNT-type functions: COUNTA counts the number of nonblank cells in a range, and COUNTBLANK counts the number of blank cells in a range. In addition, COUNTIFS(range1, criterion1, range2, criterion2,… range_n, criterion_n) finds the number of cells within multiple ranges that meet specific criteria for each range. We illustrate these functions using the Purchase Orders data set in Example 2.2.

Example 2.2  Using Basic Excel Functions In the Purchase Orders data set, we will find the following:

• • • •

smallest and largest quantity of any item ordered total order costs average number of months per order for accounts payable number of purchase orders placed

of orders placed for O-rings • number number • 30 monthsof orders with A/P terms shorter than of O-ring orders from Spacetime • number Technologies

The results are shown in Figure 2.3. In this figure, we used the split-screen feature in Excel to reduce the number of rows shown in the spreadsheet. To find the smallest and largest quantity of any item ordered, we use the MIN and MAX functions for the data in column F. Thus, the formula in cell B99 is = MIN(F4:F97) and the formula in cell B100 is = MAX(F4:F97). To find the total order costs, we sum the data in column G using the SUM function: = SUM(G4:G97); this is the formula in cell B101. To find the average number of A/P months, we use the AVERAGE function for the data in column H. The formula in cell B102 is = AVERAGE(H4:H97). To find the number of purchase orders placed, use the COUNT function. Note that the COUNT function counts only the number of cells in a range that contain numbers,

69

so we could not use it in columns A, B, or D; however, any other column would be acceptable. Using the item numbers in column C, the formula in cell B103 is = COUNT(C4:C97). To find the number of orders placed for O-rings, we use the ­COUNTIF function. For this example, the formula used in cell B104 is = COUNTIF(D4:D97, “O-Ring”). We could have also used the cell reference for any cell containing the text O-Ring, such as = COUNTIF(D4:D97,D12). To find the number of orders with A/P terms less than 30 months, use the formula = COUNTIF(H4:H97,“11 + i2t dollars today, where i is the discount rate. The ­discount rate reflects the opportunity costs of spending funds now versus achieving a return through another investment, as well as the risks associated with not receiving returns until a later time. The sum of the present values of all cash flows over a stated time horizon is the net present value:

n Ft NPV = a t t = 0 11 + i2

(2.1)

where Ft = cash flow in period t. A positive NPV means that the investment will provide added value because the projected return exceeds the discount rate. The Excel function NPV(rate, value1, value2, …) calculates the net present value of an investment by using a discount rate and a series of future payments (negative values) and ­income (positive values). Rate is the value of the discount rate i over the length of one period, and value1, value2, … are 1 to 29 arguments representing the payments and income for each ­period. The values must be equally spaced in time and are assumed to occur at the end of each period. The NPV investment begins one period before the date of the value1 cash flow and ends with the last cash flow in the list. The NPV calculation is based on future cash flows. If the first cash flow (such as an initial investment or fixed cost) occurs at the beginning of the first period, then it must be added to the NPV result and not included in the function arguments.

Example 2.3  Using the NPV Function A company is introducing a new product. The fixed cost for marketing and distribution is \$25,000 and is incurred just prior to launch. The forecasted net sales revenues for the first six months are shown in Figure 2.4. The f­ ormula

in cell B8 computes the net present value of these cash flows as = NPV(B6,C4:H4) − B5. Note that the fixed cost is not a future cash flow and is not included in the NPV function arguments.

Insert Function The easiest way to locate a particular function is to select a cell and click on the Insert function button 3 fx4 , which can be found under the ribbon next to the formula bar and also in the Function Library group in the Formulas tab. You may either type in a description in the search field, such as “net present value,” or select a category, such as “Financial,” from the drop-down box. This feature is particularly useful if you know what function to use but are not sure of what arguments to enter because it will guide you in entering the appropriate data for the function arguments. Figure 2.5 shows the dialog from which you may select the function you wish

Figure

2.4

Net Present Value Calculation

Figure

71

2.5

Insert Function Dialog

to use. For example, if we would choose the COUNTIF function, the dialog in Figure 2.6 ­appears. When you click in an input cell, a description of the argument is shown. Thus, if you are not sure what to enter for the range, the explanation in Figure 2.6 will help you. For further information, you could click on the Help button in the lower left-hand corner.

Logical Functions Logical functions return only one of two values: TRUE or FALSE. Three useful logical functions in business analytics applications are IF(condition, value if true, value if false)—a logical function that returns one value if the condition is true and another if the condition is false, AND(condition 1, condition 2…)—a logical function that returns TRUE if all ­conditions are true and FALSE if not, OR(condition 1, condition 2…)—a logical function that returns TRUE if any ­condition is true and FALSE if not. The IF function, IF(condition, value if true, value if false), allows you to choose one of two values to enter into a cell. If the specified condition is true, value if true will be put in

Figure

2.6

Function Arguments Dialog for COUNTIF

72

the cell. If the condition is false, value if false will be entered. Value if true and value if false can be a number or a text string enclosed in quotes. Note that if a blank is used between quotes, “ ”, then the result will simply be a blank cell. This is often useful to create a clean spreadsheet. For example, if cell C2 contains the function =IF(A8=2,7,12), it states that if the value in cell A8 is 2, the number 7 will be assigned to cell C2; if the value in cell A8 is not 2, the number 12 will be assigned to cell C2. Conditions may include the following: =  equal to 7  greater than 6  less than 7 = greater than or equal to 6 = less than or equal to 6 7  not equal to You may “nest” up to seven IF functions by replacing value-if-true or value-if-false in an IF function with another IF function: =IF(A8=2,(IF(B3=5,;YESN as the probability of x i. Then this expression for the mean has the same basic form as the ­expected value formula.

Example 5.19  Computing the Expected Value We may apply formula (5.9) to the probability distribution of rolling two dice. We multiply the outcome 2 by its probability 1/36, add this to the product of the outcome 3 and its probability, and so on. Continuing in this fashion, the expected value is

Figure 5.8 shows these calculations in an Excel spreadsheet (worksheet Expected Value in the Dice Rolls Excel file). As expected (no pun intended), the average value of the roll of two dice is 7.

E[X ] = 210.02782 + 3 10.05562 + 410.08332 + 510.011112 + 6 10.13892 + 710.16672 + 810.13892 + 910.11112 + 10 10.08332 + 1110.05562 + 1210.02782 = 7

Using Expected Value in Making Decisions Expected value can be helpful in making a variety of decisions, even those we see in daily life.

Example 5.20  Expected Value on Television One of the author’s favorite examples stemmed from a task in season 1 of Donald Trump’s TV show, The ­Apprentice. Teams were required to select an artist and sell his or her art for the highest total amount of money. One team selected a mainstream artist who specialized in abstract art that sold for between \$1,000 and \$2,000; the second team chose an avant-garde artist whose surrealist and rather controversial art was priced much higher. Guess who won? The first team did, because the probability of selling a piece of mainstream art was much higher than the avant-garde artist whose bizarre art (the team members themselves didn’t even like it!) had a very low probability of a sale. A back-of-the-envelope e ­ xpected value calculation would have easily predicted the winner. A popular game show that took TV audiences by storm several years ago was called Deal or No Deal. The game involved a set of numbered briefcases that contain amounts of money from 1 cent to \$1,000,000. Contestants begin choosing cases to be opened and removed, and their amounts are shown. After each set of cases is

1“Deal

opened, the banker offers the contestant an amount of money to quit the game, which the contestant may either choose or reject. Early in the game, the banker’s offer is usually less than the expected value of the remaining cases, providing an incentive to continue. However, as the number of remaining cases becomes small, the banker’s offers approach or may even exceed the average of the remaining cases. Most people press on until the bitter end and often walk away with a smaller amount than they could have had they been able to estimate the expected value of the remaining cases and make a more rational decision. In one case, a contestant had five briefcases left with \$100, \$400, \$1,000, \$50,000, and \$300,000. Because the choice of each case is equally likely, the expected value was 0.21\$100 + \$400 + \$1000 + \$50,000 + \$300,0002 = \$70,300 and the banker offered \$80,000 to quit. Instead, she said “No Deal” and proceeded to open the \$300,000 suitcase, eliminating it from the game, and took the next banker’s offer of \$21,000, which was more than 60% larger than the expected value of the remaining cases.1

or No Deal: A Statistical Deal.” www.pearsonified.com/2006/03/deal_or_no_deal_the_real_deal.php

Figure

Chapter 5  Probability Distributions and Data Modeling

171

5.8

Expected Value Calculations for Rolling Two Dice

It is important to understand that the expected value is a “long-run average” and is appropriate for decisions that occur on a repeated basis. For one-time decisions, however, you need to consider the downside risk and the upside potential of the decision. The following example illustrates this.

Example 5.21  Expected Value of a Charitable Raffle Suppose that you are offered the chance to buy one of 1,000 tickets sold in a charity raffle for \$50, with the prize being \$25,000. Clearly, the probability of winning 1 is 1,000 , or 0.001, whereas the probability of losing is 1 − 0.001 − 0.999. The random variable X is your net winnings, and its probability distribution is

x     f ( x)   − \$50  0.999    \$24,950   0.001

repeatedly over the long run, you would lose an average of \$25.00 each time you play. Of course, for any one game, you would either lose \$50 or win \$24,950. So the question becomes, Is the risk of losing \$50 worth the ­p otential of winning \$24,950? Although the expected value is negative, you might take the chance because the upside p ­ otential is large relative to what you might lose, and, after all, it is for charity. However, if your potential loss is large, you might not take the chance, even if the expected value were positive.

The expected value, E [ X ], is − \$50(0.999) + \$24,950(0.001) = − \$25.00. This means that if you played this game

Decisions based on expected values are common in real estate development, day trading, and pharmaceutical research projects. Drug development is a good example. The cost of research and development projects in the pharmaceutical industry is generally in the hundreds of millions of dollars and often approaches \$1 billion. Many projects never make it to clinical trials or might not get approved by the Food and Drug Administration. Statistics indicate that 7 of 10 products fail to return the cost of the company’s capital. However, large firms can absorb such losses because the return from one or two blockbuster drugs can easily offset these losses. On an average basis, drug companies make a net profit from these decisions.

172

Chapter 5  Probability Distributions and Data Modeling

Example 5.22  Airline Revenue Management Let us consider a simplified version of the typical revenue management process that airlines use. At any date prior to a scheduled flight, airlines must make a decision as to whether to reduce ticket prices to stimulate demand for unfilled seats. If the airline does not discount the fare, empty seats might not be sold and the airline will lose revenue. If the airline discounts the remaining seats too early (and could have sold them at the higher fare), they would lose profit. The decision depends on the probability p of selling a full-fare ticket if they choose not to discount the price. Because an airline makes hundreds or thousands of such decisions each day, the expected value approach is appropriate. Assume that only two fares are available: full and discount. Suppose that a full-fare ticket is \$560, the discount fare is \$400, and p = 0.75. For simplification, assume that

if the price is reduced, then any remaining seats would be sold at that price. The expected value of not discounting the price is 0.25 (0) + 0.75(\$560) = \$420. Because this is higher than the discounted price, the airline should not discount at this time. In reality, airlines constantly update the probability p based on the information they collect and analyze in a database. When the value of p drops below the break-even point: \$400 = p ( \$560), or p = 0.714, then it is beneficial to discount. It can also work in reverse; if demand is such that the probability that a higher-fare ticket would be sold, then the price may be adjusted upward. This is why published fares constantly change and why you may receive last-minute discount offers or may pay higher prices if you wait too long to book a reservation. Other industries such as hotels and cruise lines use similar decision strategies.

Variance of a Discrete Random Variable We may compute the variance, Var[X], of a discrete random variable X as a weighted average of the squared deviations from the expected value:

Var [X] = a 1x j - E[X]22f1x j2 ∞

(5.10)

j=1

Example 5.23  Computing the Variance of a Random Variable We may apply formula (5.10) to calculate the variance of the probability distribution of rolling two dice. Figure 5.9

shows these calculations in an Excel spreadsheet (worksheet Variance in Random Variable Calculations Excel file).

Similar to our discussion in Chapter 4, the variance measures the uncertainty of the random variable; the higher the variance, the higher the uncertainty of the outcome. Although variances are easier to work with mathematically, we usually measure the variability of a random variable by its standard deviation, which is simply the square root of the variance. Figure

5.9

Variance Calculations for Rolling Two Dice

173

Chapter 5  Probability Distributions and Data Modeling

Bernoulli Distribution The Bernoulli distribution characterizes a random variable having two possible outcomes, each with a constant probability of occurrence. Typically, these outcomes represent “success” 1x = 12 having probability p and “failure” 1x = 02, having probability 1 - p. A success can be any outcome you define. For example, in attempting to boot a new computer just off the assembly line, we might define a success as “does not boot up” in defining a Bernoulli random variable to characterize the probability distribution of a defective product. Thus, success need not be a favorable result in the traditional sense. The probability mass function of the Bernoulli distribution is p if x = 1 f1x2 = e  (5.11) 1 - p if x = 0 where p represents the probability of success. The expected value is p, and the variance is p11 - p2.

Example 5.24  Using the Bernoulli Distribution A Bernoulli distribution might be used to model whether an individual responds positively ( x = 1) or negatively ( x = 0) to a telemarketing promotion. For example, if you estimate that 20% of customers contacted will make a purchase, the probability distribution that describes whether or not a particular individual makes a purchase is Bernoulli with

p = 0.2. Think of the following experiment. Suppose that you have a box with 100 marbles, 20 red and 80 white. For each customer, select one marble at random (and then replace it). The outcome will have a Bernoulli distribution. If a red marble is chosen, then that customer makes a purchase; if it is white, the customer does not make a purchase.

Binomial Distribution The binomial distribution models n independent replications of a Bernoulli experiment, each with a probability p of success. The random variable X represents the number of successes in these n experiments. In the telemarketing example, suppose that we call n = 10 customers, each of which has a probability p = 0.2 of making a purchase. Then the probability distribution of the number of positive responses obtained from 10 customers is binomial. Using the binomial distribution, we can calculate the probability that exactly x customers out of the 10 will make a purchase for any value of x between 0 and 10. A binomial distribution might also be used to model the results of sampling inspection in a production operation or the effects of drug research on a sample of patients. The probability mass function for the binomial distribution is

f1x2 = d

n a x b p x11 - p2n - x, 0,

for x = 0, 1, 2, c, n



(5.12)

otherwise

n The notation a b represents the number of ways of choosing x distinct items from a group x of n items and is computed as

n n! a b =  x x! 1n - x2!

where n! (n factorial) = n1n - 121n - 22 g 122112, and 0! is defined to be 1.

(5.13)

174

Chapter 5  Probability Distributions and Data Modeling

Example 5.25  Computing Binomial Probabilities We may use formula (5.12) to compute binomial probabilities. For example, if the probability that any individual will make a purchase from a telemarketing solicitation is 0.2, then the probability distribution that x individuals out of 10 calls will make a purchase is 10 a b 10.22 x 10.82 10 − x, for x = 0, 1, 2, c, n f 1x2 = c x 0, otherwise

Thus, to find the probability that 3 people will make a purchase among the 10 calls, we compute f(3) = a

10 b(0.2)3(0.8)10−3 3

= (10!/3!7!)(0.008)(0.2097152)  = 120(0.008)(0.2097152) = 0.20133

The formula for the probability mass function for the binomial distribution is rather complex, and binomial probabilities are tedious to compute by hand; however, they can easily be computed in Excel using the function BINOM.DIST1number_s, trials, probability_s, cumulative2 In this function, number_s plays the role of x, and probability_s is the same as p. If cumulative is set to TRUE, then this function will provide cumulative probabilities; otherwise the default is FALSE, and it provides values of the probability mass function, f1x2.

Example 5.26  Using Excel’s Binomial Distribution Function Figure 5.10 shows the results of using this function to compute the distribution for the previous example (Excel file Binomial Probabilities). For instance, the probability that exactly 3 individuals will make a purchase is BINOM.DIST(A10,\$B\$3,\$B\$4,FALSE) = 0.20133 = f 132.

Figure 5.10 Computing Binomial Probabilities in Excel

The probability that 3 or fewer individuals will make a purchase is BINOM.DIST(A10,\$B\$3,\$B\$4,TRUE) = 0.87913 = F13 2. Correspondingly, the probability that more than 3 out of 10 individuals will make a purchase is 1 − F132 = 1 − 0.87913 = 0.12087.

175

Chapter 5  Probability Distributions and Data Modeling

Figure 5.11 Example of the Binomial Distribution with p = 0.8

The expected value of the binomial distribution is np, and the variance is np11 - p2. The binomial distribution can assume different shapes and amounts of skewness, depending on the parameters. Figure 5.11 shows an example when p = 0.8. For larger values of p, the binomial distribution is negatively skewed; for smaller values, it is positively skewed. When p = 0.5, the distribution is symmetric.

Poisson Distribution The Poisson distribution is a discrete distribution used to model the number of occurrences in some unit of measure—for example, the number of customers arriving at a Subway store during a weekday lunch hour, the number of failures of a machine during a month, number of visits to a Web page during 1 minute, or the number of errors per line of software code. The Poisson distribution assumes no limit on the number of occurrences (meaning that the random variable X may assume any nonnegative integer value), that occurrences are independent, and that the average number of occurrences per unit is a constant, l (Greek lowercase lambda). The expected value of the Poisson distribution is l, and the variance also is equal to l. The probability mass function for the Poisson distribution is:

f 1x2 = d

e -llx , x!

0,

for x = 0, 1, 2, c



(5.14)

otherwise

Example 5.27  Computing Poisson Probabilities Suppose that, on average, the number of customers arriving at Subway during lunch hour is 12 customers per hour. The probability that exactly x customers will arrive during the hour is given by a Poisson distribution with a mean of 12. The probability that exactly x customers will arrive during the hour would be calculated using formula (5.14):

f ( x) = d

e−12 12 x , x!

for x = 0, 1, 2, N

0,

otherwise

Substituting x = 5 in this formula, the probability that ­exactly 5 customers will arrive is f (5) = 0.1274.

Like the binomial, Poisson probabilities are cumbersome to compute by hand. Probabilities can easily be computed in Excel using the function POISSON.DIST(x, mean, cumulative).

176

Chapter 5  Probability Distributions and Data Modeling

Example 5.28  Using Excel’s Poisson Distribution Function Figure 5.12 shows the results of using this function to compute the distribution for Example 5.26 with L = 12 (see the Excel file Poisson Probabilities). Thus, the probability of exactly one arrival during the lunch hour is calculated by the Excel function = POISSON.DIST( A7, \$B\$3,FALSE ) = 0.00007 = f ( 1); the probability of 4 arrivals or fewer is calculated by

= POISSON.DIS T( A10, \$ B \$ 3,TRUE) = 0.00760 = F(4), and so on. Because the possible values of a Poisson random variable are infinite, we have not shown the complete distribution. As x gets large, the probabilities become quite small. Like the binomial, the specific shape of the distribution depends on the value of the parameter L; the distribution is more skewed for smaller values.

Continuous Probability Distributions As we noted earlier, a continuous random variable is defined over one or more intervals of real numbers and, therefore, has an infinite number of possible outcomes. Suppose that the expert who predicted the probabilities associated with next year’s change in the DJIA in Figure 5.6 kept refining the estimates over larger and larger ranges of values. Figure 5.13 Figure 5.12 Computing Poisson Probabilities in Excel

Figure 5.13 Refined Probability Distribution of DJIA Change

Chapter 5  Probability Distributions and Data Modeling

177

Analytics in Practice: U  sing the Poisson Distribution for Modeling Bids on Priceline2 Priceline is well known for allowing customers to name their own prices (but not the service providers) in bidding for services such as airline flights or hotel stays. Some hotels take advantage of Priceline’s approach to fill empty rooms for leisure travelers while not diluting the business market by offering discount rates through traditional channels. In one study using business analytics to develop a model to optimize pricing strategies for Kimpton Hotels, which develops, owns, or manages more than 40 independent boutique lifestyle hotels in the United States and Canada, the distribution of the number of bids for a given number of days before arrival was modeled as a Poisson distribution because it corresponded well with data that were observed. For example, the average number of bids placed per day 3 days before arrival on a weekend (the random variable X) was 6.3. Therefore, the distribution used in the model was f ( x) = e−6.36.3x , x!, where x is the number of bids placed. The analytic model helped to determine the prices to post on ­Priceline and the inventory allocation for each price. After using the model, rooms sold via Priceline increased 11% in 1 year, and the average rate for these rooms ­increased 3.7%.

Lucas Photo/Shutterstock.com

shows what such a probability distribution might look like using 2.5% increments rather than 5%. Notice that the distribution is similar in shape to the one in Figure 5.6 but simply has more outcomes. If this refinement process continues, then the distribution will ­approach the shape of a smooth curve, as shown in the figure. Such a curve that characterizes outcomes of a continuous random variable is called a probability density function and is described by a mathematical function f1x2.

Properties of Probability Density Functions A probability density function has the following properties: 1. f1x2 Ú 0 for all values of x. This means that a graph of the density function must lie at or above the x-axis. 2. The total area under the density function above the x-axis is 1.0. This is analogous to the property that the sum of all probabilities of a discrete random variable must add to 1.0. 3. P1X = x2 = 0. For continuous random variables, it does not make mathematical sense to attempt to define a probability for a specific value of x because there are an infinite number of values.

2Based

on Chris K. Anderson, “Setting Prices on Priceline,” Interfaces, 39, 4 (July–August 2009): 307–315.

178

Chapter 5  Probability Distributions and Data Modeling

4. Probabilities of continuous random variables are only defined over intervals. Thus, we may calculate probabilities between two numbers a and b, P1a … X … b2, or to the left or right of a number c—for example, P1X 6 c2 and P1X 7 c2. 5. P1a … X … b2 is the area under the density function between a and b. The cumulative distribution function for a continuous random variable is denoted the same way as for discrete random variables, F1x2, and represents the probability that the random variable X is less than or equal to x, P1X … x2. Intuitively, F1x2 represents the area under the density function to the left of x. F1x2 can often be derived mathematically from f 1x2. Knowing F(x) makes it easy to compute probabilities over intervals for continuous distributions. The probability that X is between a and b is equal to the difference of the cumulative distribution function evaluated at these two points; that is,

P1a … X … b2 = P1X … b2 - P1X … a2 = F1b2 - F1a2

(5.15)

For continuous distributions we need not be concerned about the endpoints, as we were with discrete distributions, because P1a … X … b2 is the same as P1a 6 X 6 b2. The formal definitions of expected value and variance for a continuous random variable are similar to those for a discrete random variable; however, to understand them, we must rely on notions of calculus, so we do not discuss them in this book. We simply state them when appropriate.

Uniform Distribution The uniform distribution characterizes a continuous random variable for which all outcomes between some minimum and maximum value are equally likely. The uniform distribution is often assumed in business analytics applications when little is known about a random variable other than reasonable estimates for minimum and maximum values. The parameters a and b are chosen judgmentally to reflect a modeler’s best guess about the range of the random variable. For a uniform distribution with a minimum value a and a maximum value b, the density function is

1 , b a f 1x2 = d 0,

for a … x … b 

(5.16)

otherwise

and the cumulative distribution function is

0, x - a, F1x2 = d b - a 1,

if x < a if a … x … b



(5.17)

if b < x

Although Excel does not provide a function to compute uniform probabilities, the formulas are simple enough to incorporate into a spreadsheet. Probabilities are also easy to compute for the uniform distribution because of the simple geometric shape of the density function, as Example 5.29 illustrates.

179

Chapter 5  Probability Distributions and Data Modeling

Example 5.29  Computing Uniform Probabilities Suppose that sales revenue, X, for a product varies uniformly each week between a = \$1000 and b = \$2000. The density function is f 1 x 2 = 1 , 1 2000 − 1000 2 = 1 , 1000 and is shown in Figure 5.14. Note that the area under the density is function is 1.0, which you can easily verify by multiplying the height by the width of the rectangle. Suppose we wish to find the probability that sales revenue will be less than x = \$1,300. We could do this in two ways. First, compute the area under the density function using geometry, as shown in Figure 5.15. The area is 1 1 , 1,000 2 1 300 2 = 0.30. Alternatively, we could use for-

Now suppose we wish to find the probability that revenue will be between \$1,500 and \$1,700. Again, using geometrical arguments (see Figure 5.16), the area of the rectangle between \$1,500 and \$1,700 is 1 1 , 1,000 2 1 200 2 = 0.2. We may also use formula (5.15) and compute it as follows: P ( 1,500 " X " 1,700) = P ( X " 1,700) − P( X " 1,500) = F(1,700) − F (1,500) =

mula (5.17) to compute F 1 1,300 2 :

11,700 − 1,0002

12,000 − 1,0002

(1,500 − 1,000) (2,000 − 1,000)

= 0.7 − 0.5 = 0.2

F11,3002 = 11,300 − 1,0002 , 12,000 − 1,0002 = 0.30

In either case, the probability is 0.30.

The expected value and variance of a uniform random variable X are computed as follows: a + b  2

E[X] =

Var[X] =

(5.18)

1b - a22  12

(5.19)

A variation of the uniform distribution is one for which the random variable is restricted to integer values between a and b (also integers); this is called a discrete uniform Figure 5.14

1/1,000

Uniform Probability Density Function

\$1,000

\$2,000

\$2

00

00 \$1 ,0

Probability that X * \$1,300

\$1 ,3

Figure 5.15

,0 00

1/1,000

\$2 ,0 00

\$1 ,5 0 \$1 0 ,7 00

P ( \$1,500 * X * \$1,700)

\$1 ,

Figure 5.16

00 0

1/1,000

180

Chapter 5  Probability Distributions and Data Modeling

­distribution. An example of a discrete uniform distribution is the roll of a single die. Each of the numbers 1 through 6 has a 16 probability of occurrence.

Normal Distribution The normal distribution is a continuous distribution that is described by the familiar bellshaped curve and is perhaps the most important distribution used in statistics. The normal distribution is observed in many natural phenomena. Test scores such as the SAT, deviations from specifications of machined items, human height and weight, and many other measurements are often normally distributed. The normal distribution is characterized by two parameters: the mean, m, and the standard deviation, s. Thus, as m changes, the location of the distribution on the x-axis also changes, and as s is decreased or increased, the distribution becomes narrower or wider, respectively. Figure 5.17 shows some examples. The normal distribution has the following properties:

Figure 5.17 Examples of Normal Distributions

1. The distribution is symmetric, so its measure of skewness is zero. 2. The mean, median, and mode are all equal. Thus, half the area falls above the mean and half falls below it. 3. The range of X is unbounded, meaning that the tails of the distribution extend to negative and positive infinity. 4. The empirical rules apply exactly for the normal distribution; the area under the density function within {1 standard deviation is 68.3%, the area under the density function within {2 standard deviation is 95.4%, and the area under the density function within {3 standard deviation is 99.7%.

Chapter 5  Probability Distributions and Data Modeling

181

Normal probabilities cannot be computed using a mathematical formula. Instead, we may use the Excel function NORM.DIST(x, mean, standard_deviation, cumulative). NORM.DIST(x, mean, standard_deviation, TRUE) calculates the cumulative probability F1x2 = P1X … x2 for a specified mean and standard deviation. (If cumulative is set to FALSE, the function simply calculates the value of the density function f 1x2, which has little practical application other than tabulating values of the density function. This was used to draw the distributions in Figure 5.17.)

Example 5.30  Using the NORM.DIST Function to Compute Normal Probabilities Suppose that a company has determined that the distribution of customer demand (X) is normal with a mean of 750 units/month and a standard deviation of 100 units/ month. Figure 5.18 shows some cumulative probabilities calculated with the NORM.DIST function (see the Excel file Normal Probabilities). The company would like to know the following:

This is simply the cumulative probability for x = 900, which can be calculated using the Excel function = NORM.DIST(900,750,100,TRUE) = 0.9332.

1. What is the probability that demand will be at most

P(X + 700) = 1 − P(X * 700) = 1 − F(700) = 1 − 0.3085 = 0.6915

900 units? 2. What is the probability that demand will exceed 700 units? 3. What is the probability that demand will be between 700 and 900 units? To answer the questions, first draw a picture. This helps to ensure that you know what area you are trying to calculate and how to use the formulas for working with a cumulative distribution correctly. Question 1. Figure 5.19(a) shows the probability that demand will be at most 900 units, or P( X * 900).

Figure 5.18 Normal Probability Calculations in Excel

Question 2. Figure 5.19(b) shows the probability that ­d emand will exceed 700 units, P( X + 700). Using the principles we have previously discussed, this can be found by subtracting P( X * 700) from 1:

This can be computed in Excel using the formula = 1− NORM.DIST (700,750,100,TRUE). Question 3. The probability that demand will be between 700 and 900, P (700 * X * 900), is illustrated in Figure 5.19(c). This is calculated by P( 700 * X * 900) = P (X * 900) − P (X * 700) = F ( 900) − F (700) = 0.9332 − 0.3085 = 0.6247 In Excel, we would use the formula = NORM.DIST (900,750,100,TRUE) − NORM.DIST(700,750,100,TRUE).

182

Chapter 5  Probability Distributions and Data Modeling P(Demand < 900)

Figure 5.19

100

100

P(X > 700)

Computing Normal Probabilities x 750

x

900

700 750

(a)

(b)   100

Area  1  0.10

P(700  X  900)

100 Area  0.10

700 750

x

900

750

(c)

?

x

(d)

The NORM.INV Function With the NORM.DIST function, we are given a value of the random variable X and can find the cumulative probability to the left of x. Now let’s reverse the problem. Suppose that we know the cumulative probability but don’t know the value of x. How can we find it? We are often faced with such a question in many applications. The Excel function NORM.INV(probability, mean, standard_dev) can be used to do this. In this function, probability is the cumulative probability value corresponding to the value of x we seek “INV” stands for inverse.

Example 5.31  Using the NORM.INV Function In the previous example, what level of demand would be exceeded at most 10% of the time? Here, we need to find the value of x so that P ( X + x) = 0.10. This is ­illustrated in Figure 5.19(d). Because the area in the ­upper tail of the normal distribution is 0.10, the ­c umulative probability must be 1 − 0.10 = 0.90. From ­Figure 5.18,

we can see that the correct value must be somewhere between 850 and 900 because F(850) = 0.8413 and F(900) = 0.9332. We can find the exact value using the Excel function = NORM.INV (0.90,750,100) = 878.155, Therefore, a demand of approximately 878 will satisfy the criterion.

Standard Normal Distribution Figure 5.20 provides a sketch of a special case of the normal distribution called the standard normal distribution—the normal distribution with m = 0 and s = 1. This distribution is important in performing many probability calculations. A standard normal random variable is usually denoted by Z, and its density function by f 1z2. The scale along the z-axis represents the number of standard deviations from the mean of zero. The Excel function NORM.S.DIST(z) finds probabilities for the standard normal distribution.

Chapter 5  Probability Distributions and Data Modeling

183

Example 5.32  Computing Probabilities with the Standard Normal Distribution We have previously noted that the empirical rules apply to any normal distribution. Let us find the areas under the standard normal distribution within 1, 2, and 3 standard deviations of the mean. These can be found by using the function NORM.S.DIST(z). Figure 5.21 shows a tabulation of the cumulative probabilities for z ranging from − 3 to + 3 and calculations of the areas within 1, 2, and 3 standard deviations of the mean. We apply formula (5.15) to find the difference between the cumulative

Figure 5.20 Standard Normal Distribution

Figure 5.21 Computing Standard Normal Probabilities

probabilities, F ( b) − F ( a ). For example, the area within 1 standard deviation of the mean is found by calculating P(− 1 * Z * 1) = F(1) − F(− 1) = NORM.S.DIST( 1) − NORM.S.DIST(− 1) = 0.84134 − 0.15866 = 0.6827 (the difference due to decimal rounding). As the empirical rules stated, about 68% of the area falls within 1 standard deviation; 95%, within 2 standard deviations; and more than 99%, within 3 standard deviations of the mean.

184

Chapter 5  Probability Distributions and Data Modeling

Using Standard Normal Distribution Tables Although it is quite easy to use Excel to compute normal probabilities, tables of the standard normal distribution are commonly found in textbooks and professional references when a computer is not available. Such a table is provided in Table A.1 of Appendix A at the end of this book. The table allows you to look up the cumulative probability for any value of z between -3.00 and +3.00. One of the advantages of the standard normal distribution is that we may compute probabilities for any normal random variable X having a mean m and standard deviation s by converting it to a standard normal random variable Z. We introduced the concept of standardized values (z-scores) for sample data in Chapter 4. Here, we use a similar formula to convert a value x from an arbitrary normal distribution into an equivalent standard normal value, z:

z =

1x - m2  s

(5.20)

Example 5.33  Computing Probabilities with Standard Normal Tables We will answer the first question posed in Example 5.30: What is the probability that demand will be at most x = 900 units if the distribution of customer demand (X) is normal with a mean of 750 units/month and a standard deviation of 100 units/month? Using formula (5.19), convert x to a standard normal value: z =

Note that 900 is 150 units higher than the mean of 750; since the standard deviation is 100, this simply means that 900 is 1.5 standard deviations above the mean, which is the value of z. Using Table A.1 in Appendix A, we see that the cumulative probability for z = 1.5 is 0.9332, which is the same answer we found for Example 5.30.

900 − 750 = 1.5 100

Exponential Distribution The exponential distribution is a continuous distribution that models the time between randomly occurring events. Thus, it is often used in such applications as modeling the time between customer arrivals to a service system or the time to or between failures of machines, lightbulbs, hard drives, and other mechanical or electrical components. Similar to the Poisson distribution, the exponential distribution has one parameter, l. In fact, the exponential distribution is closely related to the Poisson; if the number of events occurring during an interval of time has a Poisson distribution, then the time between events is exponentially distributed. For instance, if the number of arrivals at a bank is Poisson-distributed, say with mean l = 12>hour then the time between arrivals is exponential, with mean m = 1>12 hour, or 5 minutes. The exponential distribution has the density function

f 1x2 = le -lx, for x Ú 0

(5.21)

and its cumulative distribution function is

F1x2 = 1 - e -lx, for x Ú 0

(5.22)

Chapter 5  Probability Distributions and Data Modeling

185

Sometimes, the exponential distribution is expressed in terms of the mean m rather than the rate l. To do this, simply substitute 1>m for l in the preceding formulas. The expected value of the exponential distribution is 1>l and the variance is 11>l22. Figure 5.22 provides a sketch of the exponential distribution. The exponential distribution has the properties that it is bounded below by 0, it has its greatest density at 0, and the density declines as x increases. The Excel function EXPON.DIST (x, lambda, cumulative) can be used to compute exponential probabilities. As with other Excel probability distribution functions, cumulative is either TRUE or FALSE, with TRUE providing the cumulative distribution function.

Example 5.34  Using the Exponential Distribution Suppose that the mean time to failure of a critical component of an engine is m = 8,000 hours. Therefore, l = 1 , m = 1 , 8,000 failures/hour. The probability that the component will fail before x hours is given by the cumulative distribution function F 1 x 2 . Figure 5.23 shows

a portion of the cumulative distribution function, which may be found in the Excel file Exponential Probabilities. For example, the probability of failing before 5,000 hours is F (5000) = 0.4647.

Figure 5.22 Example of an Exponential Distribution 1l = 1 2

186

Chapter 5  Probability Distributions and Data Modeling

Figure 5.23 Computing Exponential Probabilities in Excel

Other Useful Distributions Many other probability distributions, especially those distributions that assume a wide variety of shapes, find application in decision models for characterizing a wide variety of phenomena. Such distributions provide a great amount of flexibility in representing both empirical data or when judgment is needed to define an appropriate distribution. We provide a brief description of these distributions; however, you need not know the mathematical details about them to use them in applications.

Continuous Distributions Triangular Distribution. The triangular distribution is defined by three parameters: the minimum, a; maximum, b; and most likely, c. Outcomes near the most likely value have a higher chance of occurring than those at the extremes. By varying the most likely value, the triangular distribution can be symmetric or skewed in either direction, as shown in Figure 5.24. The triangular distribution is often used when no data are available to characterize an uncertain variable and the distribution must be estimated judgmentally. Lognormal Distribution. If the natural logarithm of a random variable X is normal, then X has a lognormal distribution. Because the lognormal distribution is positively skewed and bounded below by zero, it finds applications in modeling phenomena that have low probabilities of large values and cannot have negative values, such as the time to complete a task. Other common examples include stock prices and real estate prices. The lognormal distribution is also often used for “spiked” service times, that is, when the probability of zero is very low, but the most likely value is just greater than zero. Beta Distribution. One of the most flexible distributions for modeling variation over a fixed interval from 0 to a positive value is the beta. The beta distribution is a function of two parameters, a and b, both of which must be positive. If a and b are equal, the distribution is symmetric. If either parameter is 1.0 and the other is greater than 1.0, the distribution is in the shape of a J. If a is

187

Chapter 5  Probability Distributions and Data Modeling f(x)

Figure 5.24 Examples of Triangular Distributions

(symmetric)

a

c

b

x

f(x)

(positively skewed)

a

c

b

x

f(x)

(negatively skewed)

a

c

b

x

less than b, the distribution is positively skewed; otherwise, it is negatively skewed. These properties can help you to select appropriate values for the shape parameters.

Random Sampling from Probability Distributions Many applications in business analytics require random samples from specific probability distributions. For example, in a financial model, we might be interested in the distribution of the cumulative discounted cash flow over several years when sales, sales growth rate, operating expenses, and inflation factors are all uncertain and are described by probability distributions. The outcome variables of such decision models are complicated functions of the random input variables. Understanding the probability distribution of such variables can be accomplished only by sampling procedures called Monte Carlo simulation, which we address in Chapter 12. The basis for generating random samples from probability distributions is the concept of a random number. A random number is one that is uniformly distributed between 0 and 1. Technically speaking, computers cannot generate truly random numbers since they must use a predictable algorithm. However, the algorithms are designed to generate a sequence of numbers that appear to be random. In Excel, we may generate a random number within any cell using the function RAND( ). This function has no arguments; therefore, nothing should be placed within the parentheses (but the parentheses are required). Figure 5.25 shows a table of 10 random numbers generated in Excel. You should be aware that unless the automatic recalculation feature is suppressed, whenever any cell in the spreadsheet is modified, the values in any cell containing the RAND( ) function will change. Automatic recalculation can be changed to manual by choosing Calculation ­Options in the Calculation group under the Formulas tab. Under manual recalculation mode, the worksheet is recalculated only when the F9 key is pressed.

188

Chapter 5  Probability Distributions and Data Modeling

Figure 5.25 A Sample of Random Numbers

Sampling from Discrete Probability Distributions Sampling from discrete probability distributions using random numbers is quite easy. We will illustrate this process using the probability distribution for rolling two dice.

Example 5.35  Sampling from the Distribution of Dice Outcomes The probability mass function and cumulative distribution in decimal form are as follows:

x

f 1 x 2

F1x2

2

0.0278

0.0278

3

0.0556

0.0833

4

0.0833

0.1667

5

0.1111

0.2778

6

0.1389

0.4167

7

0.1667

0.5833

8

0.1389

0.7222

9

0.1111

0.8333

10

0.0833

0.9167

11

0.0556

0.9722

12

0.0278

1.0000

Notice that the values of F(x) divide the interval from 0 to 1 into smaller intervals that correspond to the probabilities of the outcomes. For example, the interval from (but not ­including) 0 and up to and including 0.0278 has a probability of 0.028 and corresponds to the outcome x = 2; the interval from (but not including) 0.0278 and up to and

including 0.0833 has a probability of 0.0556 and corresponds to the outcome x = 3; and so on. This is summarized as follows:

Interval 0

Outcome

to 0.0278

2

0.0278 to 0.0833

3

0.0833 to 0.1667

4

0.1667 to 0.2778

5

0.2778 to 0.4167

6

0.4167 to 0.5833

7

0.5833 to 0.7222

8

0.7222 to 0.8323

9

0.8323 to 0.9167

10

0.9167 to 0.9722

11

0.9722 to 1.0000

12

Any random number, then, must fall within one of these intervals. Thus, to generate an outcome from this distribution, all we need to do is to select a random number and determine the interval into which it falls. Suppose we use the data in Figure 5.25. The first random

Chapter 5  Probability Distributions and Data Modeling

number is 0.326510048. This falls in the interval corresponding to the sample outcome of 6. The second random number is 0.743390121. This number falls in the interval corresponding to an outcome of 9. Essentially, we have developed a technique to roll dice on a com-

189

puter. If this is done repeatedly, the frequency of occurrence of each outcome should be proportional to the size of the random number range (i.e., the probability associated with the outcome) because random numbers are uniformly distributed.

We can easily use this approach to generate outcomes from any discrete distribution; the VLOOKUP function in Excel can be used to implement this on a spreadsheet.

Example 5.36  Using the VLOOKUP Function for Random Sampling Suppose that we want to sample from the probability d ­ istribution of the predicted change in the Dow Jones Industrial Average index shown in Figure 5.6. We first construct the cumulative distribution F 1 x 2 . Then assign intervals to the outcomes based on the values of the cumulative distribution, as shown in Figure 5.26. This specifies the table range for the VLOOKUP function, namely, \$E\$2:\$G\$10. List the random numbers in a column using the RAND( ) function. The formula in

cell J2 is = VLOOKUP(I2,\$E\$2:\$G\$10,3), which is copied down that column. This function takes the value of the random number in cell I2, finds the last number in the first column of the table range that is less than the random number, and returns the value in the third column of the ­table range. In this case, 0.49 is the last number in column E that is less than 0.530612386, so the function returns 5% as the outcome.

Sampling from Common Probability Distributions This approach of generating random numbers and transforming them into outcomes from a probability distribution may be used to sample from most any distribution. A value randomly generated from a specified probability distribution is called a random variate. For example, it is quite easy to transform a random number into a random variate from a uniform distribution between a and b. Consider the formula:

U = a + 1b - a2*RAND( )

(5.23)

Note that when RAND( ) = 0, U = a, and when RAND( ) approaches 1, U approaches b. For any other value of RAND( ) between 0 and 1, 1b - a2*RAND( ) represents the same proportion of the interval 1a, b2 as RAND( ) does of the interval 10, 12. Thus, all Figure 5.26 Using the VLOOKUP Function to Sample from a Discrete Distribution

190

Chapter 5  Probability Distributions and Data Modeling

Figure 5.27 Excel Random Number Generation Dialog

real numbers between a and b can occur. Since RAND( ) is uniformly distributed, so also is U. Although this is quite easy, it is certainly not obvious how to generate random variates from other distributions such as normal or exponential. We do not describe the technical details of how this is done but rather just describe the capabilities available in Excel. Excel allows you to generate random variates from discrete distributions and certain others using the Random Number Generation option in the Analysis Toolpak. From the Data tab in the ribbon, select Data Analysis in the Analysis group and then Random Number Generation. The Random Number Generation dialog, shown in ­Figure 5.27, will appear. From the Random Number Generation dialog, you may select from seven distributions: uniform, normal, Bernoulli, binomial, Poisson, and patterned, as well as discrete. (The patterned distribution is characterized by a lower and upper bound, a step, a repetition rate for values, and a repetition rate for the sequence.) If you select the Output Range option, you are asked to specify the upper-left cell reference of the output table that will store the outcomes, the number of variables (columns of values you want generated), number of random numbers (the number of data points you want generated for each variable), and the type of distribution. The default distribution is the discrete distribution.

Example 5.37  Using Excel’s Random Number Generation Tool We will generate 100 outcomes from a Poisson distribution with a mean of 12. In the Random Number Generation dialog, set the Number of Variables to 1 and the Number of Random Numbers to 100 and select Poisson from the drop-down Distribution box. The dialog will

change and prompt you for the value of Lambda, the mean of the Poisson distribution; enter 12 in the box and click OK. The tool will display the random numbers in a column. Figure 5.28 shows a histogram of the results.

The dialog in Figure 5.27 also allows you the option of specifying a random number seed. A random number seed is a value from which a stream of random numbers

Chapter 5  Probability Distributions and Data Modeling

191

Figure 5.28 Histogram of Samples from a Poisson Distribution

is generated. By specifying the same seed, you can produce the same random numbers at a later time. This is desirable when we wish to reproduce an identical sequence of “random” events in a simulation to test the effects of different policies or decision variables under the same circumstances. However, one disadvantage with using the Random Number Generation tool is that you must repeat the process to generate a new set of sample values; pressing the recalculation (F9) key will not change the values. This can make it difficult to use this tool to analyze decision models. Excel also has several inverse functions of probability distributions that may be used to generate random variates. For the normal distribution, use mean, standard_deviation)—normal distribution with a • NORM.INV(probability, specified mean and standard deviation, • NORM.S.INV(probability)—standard normal distribution. For some advanced distributions, you might see mean, standard_deviation)—lognormal distribu• LOGNORM.INV(probability, tion, where ln(X) has the specified mean and standard deviation, • BETA.INV(probability, alpha, beta, A, B)—beta distribution. To use these functions, simply enter RAND( ) in place of probability in the function. For example, NORM.INV(RAND( ), 5, 2) will generate random variates from a normal distribution with mean 5 and standard deviation 2. Each time the worksheet is recalculated, a new random number and, hence, a new random variate, are generated. These functions may be embedded in cell formulas and will generate new values whenever the worksheet is recalculated.

192

Chapter 5  Probability Distributions and Data Modeling

The following example shows how sampling from probability distributions can provide insights about business decisions that would be difficult to analyze mathematically.

Example 5.38  A Sampling Experiment for Evaluating Capital Budgeting Projects In finance, one way of evaluating capital budgeting projects is to compute a profitability index (PI), which is defined as the ratio of the present value of future cash flows (PV) to the initial investment (I):

PI = PV , I

(5.24)

Because the cash flow and initial investment that may be required for a particular project are often uncertain, the profitability index is also uncertain. If we can characterize PV and I by some probability distributions, then we would like to know the probability distribution for PI. For example, suppose that PV is estimated to be normally distributed with a mean of \$12 million and a standard deviation of \$2.5 million, and the initial investment is also estimated to be normal with a mean of \$3.0 million and standard deviation of \$0.8 million. Intuitively, we might believe that the profitability index is also normally distributed with a mean of \$12 million , \$3 million = \$4 million; however, as

we shall see, this is not the case. We can use a sampling experiment to identify the probability distribution of PI for these assumptions. Figure 5.29 shows a simple model from the Excel file Profitability Index Experiment. For each experiment, the values of PV and I are sampled from their assumed normal distributions using the NORM.INV function. PI is calculated in column D, and the average value for 1,000 experiments is shown in cell E8. We clearly see that this is not equal to 4 as previously suspected. The histogram in Figure 5.30 also demonstrates that the distribution of PI is not normal but is skewed to the right. This experiment confirms that the ratio of two normal distributions is not normally distributed. We encourage you to create this spreadsheet and replicate this experiment (note that your results will not be exactly the same as these because you are generating random values!)

Probability Distribution Functions in Analytic Solver Platform Analytic Solver Platform (see the section on Spreadsheet Add-ins in Chapter 2) provides custom Excel functions that generate random samples from specified probability distributions. Table 5.1 shows a list of these for distributions we have discussed. These functions return random values from the specified distributions in worksheet cells. These functions will be very useful in business analytics applications in later chapters, especially Chapter 12 on simulation and risk analysis.

Figure 5.29 Sampling Experiment for Profitability Index

Chapter 5  Probability Distributions and Data Modeling

193

Figure 5.30 Frequency Distribution and Histogram of Profitability Index

Table

5.1

Analytic Solver Platform Probability Distribution Functions

Distribution

Analytic Solver Platform Function

Bernoulli

PsiBernoulli( probability)

Binomial

PsiBinomial(trials, probability)

Poisson

PsiPoisson(mean)

Uniform

PsiUniform(lower, upper)

Normal

PsiNormal(mean, standard deviation)

Exponential

PsiExponential(mean)

Discrete Uniform

PsiDisUniform(values)

Geometric

PsiGeometric(probability)

Negative Binomial

PsiNegBinomial(successes, probability)

Hypergeometric

PsiHyperGeo(trials, success, population size)

Triangular

PsiTriangular(minimum, most likely, maximum)

Lognormal

PsiLognormal(mean, standard deviation)

Beta

PsiBeta(alpha, beta)

Example 5.39  Using Analytic Solver Platform Distribution Functions An energy company was considering offering a new product and needed to estimate the growth in PC ownership. Using the best data and information available, they determined that the minimum growth rate was 5.0%, the most likely value was 7.7%, and the maximum value was 10.0%. These parameters characterize a triangular

distribution. Figure 5.31 (Excel file PC Ownership Growth Rates) shows a portion of 500 samples that were generated using the function PsiTriangular(5%, 7.7%, 10%). Notice that the histogram exhibits a clear triangular shape.

194

Chapter 5  Probability Distributions and Data Modeling

Figure 5.31 Samples from a Triangular Distribution

Data Modeling and Distribution Fitting In many applications of business analytics, we need to collect sample data of important variables such as customer demand, purchase behavior, machine failure times, and service activity times, to name just a few, to gain an understanding of the distributions of these variables. Using the tools we have studied, we may construct frequency distributions and histograms and compute basic descriptive statistical measures to better understand the nature of the data. However, sample data are just that—samples. Using sample data may limit our ability to predict uncertain events that may occur because potential values outside the range of the sample data are not included. A better approach is to identify the underlying probability distribution from which sample data come by “fitting” a theoretical distribution to the data and verifying the goodness of fit statistically. To select an appropriate theoretical distribution that fits sample data, we might begin by examining a histogram of the data to look for the distinctive shapes of particular distributions. For example, normal data are symmetric, with a peak in the middle. Exponential data are very positively skewed, with no negative values. Lognormal data are also very positively skewed, but the density drops to zero at 0. Various forms of the gamma, Weibull, or beta distributions could be used for distributions that do not seem to fit one of the other common forms. This approach is not, of course, always accurate or valid, and sometimes it can be difficult to apply, especially if sample sizes are small. However, it may narrow the search down to a few potential distributions. Summary statistics can also provide clues about the nature of a distribution. The mean, median, standard deviation, and coefficient of variation often provide information about the nature of the distribution. For instance, normally distributed data tend to have a fairly low coefficient of variation (however, this may not be true if the mean is small). For normally distributed data, we would also expect the median and mean to be approximately the same. For exponentially distributed data, however, the median will be less than the mean. Also, we would expect the mean to be about equal to the standard deviation, or, equivalently, the coefficient of variation would be close to 1. We could also look at the skewness index. Normal data are not skewed, whereas lognormal and exponential data are positively skewed. The following examples illustrate some of these ideas.

Chapter 5  Probability Distributions and Data Modeling

195

Example 5.40  Analyzing Airline Passenger Data An airline operates a daily route between two mediumsized cities using a 70-seat regional jet. The flight is rarely booked to capacity but often accommodates business travelers who book at the last minute at a high price. Figure 5.32 shows the number of passengers for a sample of 25 flights (Excel file Airline Passengers). The histogram shows a relatively symmetric distribution. The mean, median, and mode are all similar, although

there is some degree of positive skewness. From our discussion in Chapter 4 about the variability of samples, it is important to recognize that this is a relatively small sample that can exhibit a lot of variability compared with the population from which it is drawn. Thus, based on these characteristics, it would not be unreasonable to assume a normal distribution for the purpose of developing a predictive or prescriptive analytics model.

Example 5.41  Analyzing Airport Service Times Figure 5.33 shows a portion of the data and statistical analysis of 812 samples of service times at an airport’s ticketing counter (Excel file Airport Service Times). It is not clear what the distribution might be. It does not appear to be exponential, but it might be lognormal or even another distribution with which you might not be familiar.

From the descriptive statistics, we can see that the mean is not close to the standard deviation, suggesting that the data are probably not exponential. The data are positively skewed, suggesting that a lognormal distribution might be appropriate. However, it is difficult to make a definitive conclusion.

The examination of histograms and summary statistics might provide some idea of the appropriate distribution; however, a better approach is to analytically fit the data to the best type of probability distribution. Figure 5.32 Data and Statistics for Passenger Demand

196

Chapter 5  Probability Distributions and Data Modeling

Figure 5.33 Airport Service Times Statistics

Goodness of Fit The basis for fitting data to a probability distribution is a statistical procedure called goodness of fit. Goodness of fit attempts to draw a conclusion about the nature of the distribution. For instance, in Example 5.40 we suggested that it might be reasonable to assume that the distribution of passenger demand is normal. Goodness of fit would provide objective, analytical support for this assumption. Understanding the details of this procedure requires concepts that we will learn in Chapter 7. However, software exists (which we illustrate shortly) that run statistical procedures to determine how well a theoretical distribution fits a set of data, and also find the best-fitting distribution. Determining how well sample data fits a distribution is typically measured using one of three types of statistics, called chi-square, Kolmogorov-Smirnov, and Anderson-­ Darling statistics. Essentially, these statistics provide a measure of how well the histogram of the sample data compares with a specified theoretical probability distribution. The chisquare approach breaks down the theoretical distribution into areas of equal probability and compares the data points within each area to the number that would be expected for that distribution. The Kolmogorov-Smirnov procedure compares the cumulative distribution of the data with the theoretical distribution and bases its conclusion on the largest vertical distance between them. The Anderson-Darling method is similar but puts more weight on the differences between the tails of the distributions. This approach is useful when you need a better fit at the extreme tails of the distribution. If you use chi-square, you should have at least 50 data points; for small samples, the Kolmogorov-Smirnov test generally works better.

Distribution Fitting with Analytic Solver Platform Analytic Solver Platform has the capability of “fitting” a probability distribution to data using one of the three goodness-of-fit procedures. This is often done to analyze and define inputs to simulation models that we discuss in Chapter 12. However, you need not understand simulation at this time to use this capability. We illustrate this procedure using the airport service time data.

Chapter 5  Probability Distributions and Data Modeling

197

Example 5.42  Fitting a Distribution to Airport Service Times Step 1: Highlight the range of the data in the Airport ­Service Times worksheet. Click on the Tools button in the ­Analytic Solver Platform ribbon and then click Fit. This displays the Fit Options dialog shown in Figure 5.34. Step 2: In the Fit Options dialog, choose whether to fit the data to a continuous or discrete distribution. In this example, we select Continuous. You may also choose the statistical procedure used to evaluate the results, either chi-square, Kolmogorov-Smirnov, or Anderson-Darling. We choose the default option, Kolmogorov-Smirnov. Click the Fit button.

to compare the results to a different distribution, simply check the box on the left side. You don’t have to know the mathematical details to use the distribution in a spreadsheet application because the formula for the Psi function corresponding to this distribution is shown in the panel on the right side of the output. When you exit the dialog, you have the option to accept the result; if so, it asks you to select a cell to place the Psi function for the distribution, in this case, the function:

­Analytic Solver Platform displays a window with the results as shown in Figure 5.35. In this case, the best-fitting distribution is called an Erlang distribution. If you want

We could use this function to generate samples from this distribution, similar to the way we used the NORM.INV function in Example 5.38.

Figure 5.34 Fit Options Dialog

Figure 5.35 Analytic Solver Platform Distribution Fitting Results

=PsiErlang(1.46504838280818,80.0576462180289, PsiShift 8.99)

198

Chapter 5  Probability Distributions and Data Modeling

Victor Correira/Shutterstock.com

Analytics in Practice: The Value of Good Data Modeling in Advertising

Since the data observed on ad effectiveness was clearly skewed, other researchers examined ad effectiveness by studying standard industry data on ad recall without requiring the assumption of normally distributed effects. This analysis found that the best of a number of ads was more effective than any single ad. Further analysis revealed that the optimal number of ads to commission can vary significantly, depending on the shape of the distribution of effectiveness for a single ad. The researchers developed an alternative to Gross’s model. From their analyses, they found that as the number of draft ads was increased, the effectiveness of the best ad also increased. Both the optimal number of draft ads and the payoff from creating multiple independent drafts were higher when the correct distribution was used than the results reported in Gross’s original study.

Key Terms Bernoulli distribution Binomial distribution Complement Conditional probability

Continuous random variable Cumulative distribution function Discrete random variable Discrete uniform distribution

3Based on G. C. O’Connor, T. R. Willemain, and J. MacLachlan, “The Value of Competition Among Agencies in Developing Ad Campaigns: Revisiting Gross’s Model,” Journal of Advertising, 25, 1 (1996): 51–62.

Chapter 5  Probability Distributions and Data Modeling

Empirical probability distribution Event Expected value Experiment Exponential distribution Goodness of fit Independent events Intersection Joint probability Joint probability table Marginal probability Multiplication law of probability Mutually exclusive Normal distribution

199

Outcome Poisson distribution Probability Probability density function Probability distribution Probability mass function Random number Random number seed Random variable Random variate Sample space Standard normal distribution Uniform distribution Union

Problems and Exercises 1. a.  A die is rolled. Find the probability that the num-

ber obtained is greater than 4. b. Two coins are tossed. Find the probability that only one head is obtained. c. Two dice are rolled. Find the probability that the sum is equal to 5. d. A card is drawn at random from a deck of cards. Find the probability of getting the King of Hearts. 2. Consider the experiment of drawing two cards with-

out replacement from a deck consisting of only the ace through 10 of a single suit (e.g., only hearts). a. Describe the outcomes of this experiment. List the elements of the sample space. b. Define the event Ai to be the set of outcomes for which the sum of the values of the cards is i (with an ace = 1). List the outcomes associated with Ai for i = 3 to 19. c. What is the probability of obtaining a sum of the two cards equaling from 3 to 19? 3. Find the probability of getting the each of the total

values when two dice is rolled: 1, 2, 5, 6, 7, 10, and 11. 4. The students of a class have elected five candi-

dates to represent them on the college management council: S.No.

Gender

Age

1

Male

18

2

Male

19

3

Female

22

4

Female

20

5

Male

23

This group decides to elect a spokesperson by randomly drawing a name from a hat. Calculate the probability of the spokesperson being either female or over 21. 5. Refer to the card scenario described in Problem 2. a. Let A be the event “total card value is odd.” Find

P(A) and P(Ac).

b. What is the probability that the sum of the two cards will be more than 14? 6. The latest nationwide political poll in a particular

country indicates that the probability for the candidate to be a republican is 0.55, a communist is 0.30, and a supporter of the patriots of that country is 0.15. Assuming that these probabilities are accurate, within a randomly chosen group of 10 citizens: a. What is the probability that four are communists? b. What is the probability that none are republican? 7. Roulette is played at a table similar to the one in Fig-

ure 5.36. A wheel with the numbers 1 through 36 (evenly distributed with the colors red and black) and two green numbers 0 and 00 rotates in a shallow bowl with a curved wall. A small ball is spun on the inside of the wall and drops into a pocket corresponding to one of the numbers. Players may make 11 different types of bets by placing chips on different areas of the table. These include bets on a single number, two adjacent numbers, a row of three numbers, a block of four numbers, two adjacent rows of six numbers, and the five number combinations of 0, 00, 1, 2, and 3; bets on the numbers 1–18 or 19–36; the first, ­second, or third group of 12 numbers; a column of

200

Chapter 5  Probability Distributions and Data Modeling

Figure 5.36 Layout of a Typical Roulette Table

12 ­numbers; even or odd; and red or black. Payoffs differ by bet. For instance, a single-number bet pays 35 to 1 if it wins; a three-number bet pays 11 to 1; a column bet pays 2 to 1; and a color bet pays even money. Define the following events: C1 = column 1 number, C2 = column 2 number, C3 = column 3 number, O = odd number, E = even number, G = green number, F12 = first 12 numbers, S12 = second 12 numbers, and T12 = third 12 numbers. a. Find the probability of each of these events. b. Find P(G or O), P(O or F12), P(C1 or C3), P(E and F12), P(E or F12), P(S12 and T12), P(O or C2). 8. From a bag full of colored balls (red, blue, green and

orange), some are picked out and replaced. This is done a thousand times and the number of times each colored ball is picked out is—Blue: 300, Red: 200, Green: 450, and Orange: 50. a. What is the probability of picking a green ball? b. What is the probability of picking a blue ball? c. If there are 100 balls in the bag, how many of them are likely to be green? d. If there are 10000 balls in the bag, how many of them are likely to be orange? 9. A box contains marbles of three different colors:

8 black, 6 white, and 4 red. Three marbles are selected at random without replacement. Find the probability that the selection contains each of the outcomes listed. a. Three black marbles b. A red, a black and a white marble, in that order c. A red marble and two white marbles, in any order

10. A survey of 200 college graduates who have been work-

ing for at least 3 years found that 90 owned only mutual funds, 20 owned only stocks, and 70 owned both. a. What is the probability that an individual owns a stock? A mutual fund? b. What is the probability that an individual owns neither stocks nor mutual funds? c. What is the probability that an individual owns either a stock or a mutual fund? 11. Row 26 of the Excel file Census Education Data

gives the number of employed persons having a specific educational level. a. Find the probability that an employed person has attained each of the educational levels listed in the data. b. Suppose that A is the event “has at least an Associate’s Degree” and B is the event “is at least a high school graduate.” Find the probabilities of these events. Are they mutually exclusive? Why or why not? Find the probability P(A or B). 12. A survey of shopping habits found the percentage

of respondents that use technology for shopping as shown in Figure 5.37. For example, 17.39% only use online coupons; 21.74% use online coupons and check prices online before shopping, and so on. a. What is the probability that a shopper will check prices online before shopping? b. What is the probability that a shopper will use a smart phone to save money? c. What is the probability that a shopper will use online coupons? d. What is the probability that a shopper will not use any of these technologies?

201

Chapter 5  Probability Distributions and Data Modeling

Figure 5.37 4.35%

Check Prices Online Before Shopping

21.74%

17.39%

17.39%

4.35%

Use Online Coupons

4.35%

4.35% Use a Smart Phone to Save Money

e. What is the probability that a shopper will check

prices online and use online coupons but not use a smart phone? f. If a shopper checks prices online, what is the probability that he or she will use a smart phone? g. What is the probability that a shopper will check prices online but not use online coupons or a smart phone? 13. A Canadian business school summarized the gender

and residency of its incoming class as follows: Residency Gender Canada United States Europe Asia Other Male

123

24

17

52

8

Female

86

8

10

73

4

a. Construct the joint probability table. b. Calculate the marginal probabilities. c. What is the probability that a female student is from outside Canada or the United States? 14. In an example in Chapter 3, we developed the fol-

lowing cross-tabulation of sales transaction data: Book

DVD

Total

East

Region

56

42

98

North

43

42

85

South

62

37

99

West

100

90

190

Total

261

211

472

a. Find the marginal probabilities that a sale originated in each of the four regions and the marginal probability of each type of sale (book or DVD). b. Find the conditional probabilities of selling a book given that the customer resides in each region. 15. Use the Civilian Labor Force data in the Excel file

Census Education Data to find the following: a. P(unemployed and advanced degree) b. P(unemployed ∙ advanced degree) c. P(not a high school grad ∙ unemployed) d. Are the events “unemployed” and “at least a high school graduate” independent? 16. Using the data in the Excel file Consumer Transpor-

tation Survey, develop a contingency table for Gender and Vehicle Driven; then convert this table into probabilities. a. What is the probability that respondent is female? b. What is the probability that a respondent drives an SUV? c. What is the probability that a respondent is male and drives a minivan? d. What is the probability that a female respondent drives either a truck or an SUV? e. If it is known that an individual drives a car, what is the probability that the individual is female? f. If it is known that an individual is male, what is the probability that he drives an SUV? g. Determine whether the random variables “gender” and the event “vehicle driven” are statistically independent. What would this mean for advertisers?

202

Chapter 5  Probability Distributions and Data Modeling

17. A home pregnancy test is not always accurate. Sup-

pose the probability is 0.015 that the test indicates that a woman is pregnant when she actually is not, and the probability is 0.025 that the test indicates that a woman is not pregnant when she really is. Assume that the probability that a woman who takes the test is actually pregnant is 0.7. What is the probability that a woman is pregnant if the test yields a not-­pregnant result?

22. The weekly demand of a slow-moving product has

the following probability mass function: Demand, x

18. A political candidate running for local office is con-

sidering the votes she can get in an upcoming election. Assume that the votes can take on only four possible values. If the candidate assessment is per the given Excel sheet Votes, construct the probability distribution graph. Number of Votes

Probability this Will Happen

1000

0.2

2000

0.4

3000

0.3

4000

0.1

19. In the roulette example described in Problem 7, what

is the probability that the outcome will be green twice in a row? What is the probability that the outcome will be black twice in a row? 20. A consumer products company found that 48% of suc-

cessful products also received favorable results from test market research, whereas 12% had unfavorable results but nevertheless were successful. They also found that 28% of unsuccessful products had unfavorable research results, whereas 12% of them had favorable research results. That is, P(successful product and favorable test market) = 0.48, P(successful product and unfavorable test market) = 0.12, P(unsuccessful product and favorable test market) = 0.12, and P(unsuccessful product and unfavorable test market) = 0.28. Find the probabilities of successful and unsuccessful products given known test market results. 21. A particular training program has been designed to

upgrade the administrative skills of managers. The program is self-administered; the manager requires putting in different number of hours to complete the program. The previous participant’s input indicates that the mean length of time spent on the program is 500 hours, and that this normally distributed random variables has standard deviation of 100 hours. Calculate the probability of a randomly selected participant who will require more than 500 hours.

Probability, f(x)

0

0.2

1

0.4

2

0.3

3

0.1

4 or more

0

Find the expected value, variance, and standard deviation of weekly demand. 23. The Excel sheet Baseball contains information about

a team which is using an automatic pitching machine. If the machine is correctly setup and properly adjusted, it will strike 85 percent of the time. If it is incorrectly set up, it will strike only 35 percent of the time. Past data indicates that 75 percent of the setup of the machine is correctly done. After the machine has been set up, at batting practice one day, it throws three strikes on the first three pitches. What is the revised probability that has setup done correctly? Event Correct Incorrect

P(event)

P(1strike/event)

0.75

0.85

x

0.35

24. Based on the data in the Excel file Consumer

Transportation Survey, develop a probability mass function and cumulative distribution function (both tabular and as charts) for the random variable Number of Children. What is the probability that an individual in this survey has fewer than three children? At least one child? Five or more children? 25. A major application of analytics in marketing is determin-

ing the attrition of customers. Suppose that the probability of a long-distance carrier’s customer leaving for another carrier from one month to the next is 0.12. What distribution models the retention of an individual customer? What is the expected value and standard deviation? 26. The Excel file Call Center Data shows that in a

sample of 70 individuals, 27 had prior call center experience. If we assume that the probability that any potential hire will also have experience with a probability of 27/70, what is the probability that among 10 potential hires, more than half of them will have experience? Define the parameter(s) for this distribution based on the data.

Chapter 5  Probability Distributions and Data Modeling

27. If a cell phone company conducted a telemarket-

ing campaign to generate new clients and the probability of successfully gaining a new customer was 0.07, what is the probability that contacting 50 potential customers would result in at least 5 new customers? 28. During 1 year, a particular mutual fund has outper-

formed the S&P 500 index 33 out of 52 weeks. Find the probability that this performance or better would happen again. 29. A popular resort hotel has 300 rooms and is usually fully

booked. About 6% of the time a reservation is canceled before the 6:00 p.m. deadline with no penalty. What is the probability that at least 280 rooms will be occupied? Use the binomial distribution to find the exact value.

203

b. Assuming a Poisson distribution and using the mean number of hurricanes per season from the empirical data, compute the probabilities of experiencing 0–12 hurricanes in a season. Compare these to your answer to part (a). How good does a Poisson distribution model this phenomenon? Construct a chart to visualize these results. 33. Verify that the function corresponding to the fol-

lowing figure is a valid probability density function. Then find the following probabilities: a. P1x 6 82 b. P1x 7 72 c. P16 6 x 6 102 d. P18 6 x 6 112

30. A telephone call center where people place market-

ing calls to customers has a probability of success of 0.08. The manager is very harsh on those who do not get a sufficient number of successful calls. Find the number of calls needed to ensure that there is a probability of 0.90 of obtaining 5 or more successful calls. 31. Ravi sells three life insurance policies on an average

per week. Use Poisson’s distribution to calculate the probability that in a given week he will sell a. some policies. b. two or more policies but less than 5 policies. c. one policy, assuming that there are 5 working days per week. 32. The number and frequency of Atlantic hurricanes an-

nually from 1940 through 2012 is shown here. Number

Frequency

34. The time required to play a game of Battleship™ is

uniformly distributed between 15 and 60 minutes. a. Find the expected value and variance of the time to complete the game. b. What is the probability of finishing within 30 minutes? c. What is the probability that the game would take longer than 40 minutes?

0

5

1

16

2

19

3

14

4

3

5

5

6

4

7

3

ber of days to remodel a bathroom for a client is 10 days. He also estimates that 80% of similar jobs are completed within 18 days. If the remodeling time is uniformly distributed, what should be the parameters of the uniform distribution?

35. A contractor has estimated that the minimum num-

8

2

36. In determining automobile-mileage ratings, it was

10

1

12

1

found that the mpg (X) for a certain model is normally distributed, with a mean of 33 mpg and a standard deviation of 1.7 mpg. Find the following: a. P1X 6 302 b. P128 6 X 6 322

a. Find the probabilities of 0–12 hurricanes each season using these data.

204

Chapter 5  Probability Distributions and Data Modeling

c. P1X 7 352 d. P1X 7 312 e. The mileage rating that the upper 5% of cars achieve. 37. The distribution of the SAT scores in math for an in-

coming class of business students has a mean of 590 and standard deviation of 22. Assume that the scores are normally distributed. a. Find the probability that an individual’s SAT score is less than 550. b. Find the probability that an individual’s SAT score is between 550 and 600. c. Find the probability that an individual’s SAT score is greater than 620. d. What percentage of students will have scored better than 700? e. Find the standardized values for students scoring 550, 600, 650, and 700 on the test. 38. A popular soft drink is sold in 2-liter (2,000-­milliliter)

41. A lightbulb is warranted to last for 5,000 hours. If the

time to failure is exponentially distributed with a true mean of 4,750 hours, what is the probability that it will last at least 5,000 hours? 42. The actual delivery time from Giodanni’s Pizza is

exponentially distributed with a mean of 20 minutes. a. What is the probability that the delivery time will exceed 30 minutes? b. What proportion of deliveries will be completed within 20 minutes? 43. Develop a procedure to sample from the probability dis-

tribution of soft-drink choices in Problem 1. Implement your procedure on a spreadsheet and use the VLOOKUP function to sample 10 outcomes from the distribution. 44. Develop a procedure to sample from the probability

distribution of two-card hands in Problem 2. Implement your procedure on a spreadsheet and use the VLOOKUP function to sample 20 outcomes from the distribution.

bottles. Because of variation in the filling process, bottles have a mean of 2,000 milliliters and a standard deviation of 20, normally distributed. a. If the process fills the bottle by more than 50 milliliters, the overflow will cause a machine malfunction. What is the probability of this occurring? b. What is the probability of underfilling the bottles by at least 30 milliliters?

45. Use formula (5.23) to obtain a sample of 25 outcomes

39. A supplier contract calls for a key dimension of a part

= net profit margin * total asset turnover * equity multiplier. Suppose that the equity multiplier is fixed at 4.0, but that the net profit margin is normally distributed with a mean of 3.8% and a standard deviation of 0.4%, and that the total asset turnover is normally distributed with a mean of 1.5 and a standard deviation of 0.2. Set up and conduct a sampling experiment to find the distribution of the return on equity. Show your results as a histogram to help explain your analysis and conclusions. Use the empirical rules to predict the return on equity.

to be between 1.96 and 2.04 centimeters. The supplier has determined that the standard deviation of its process, which is normally distributed, is 0.04 centimeter. a. If the actual mean of the process is 1.98, what fraction of parts will meet specifications? b. If the mean is adjusted to 2.00, what fraction of parts will meet specifications? c. How small must the standard deviation be to ensure that no more than 2% of parts are nonconforming, assuming the mean is 2.00? 40. Dev scored 940 on a national mathematics test. The

mean test score was 850 with a standard deviation of 100. What proportion of students had a higher score than Dev? (Assume that the test scores are normally distributed.)

for a game of Battleship™ as described in Problem 34. Find the average and standard deviation for these 25 outcomes. 46. Use the Excel Random Number Generation tool to gen-

erate 100 samples of the number of customers that the financial consultant in Problem 31 will have on a daily basis. What percentage will meet his target of at least 5? 47. A formula in financial analysis is: Return on equity

48. A government agency is putting a large project out

for low bid. Bids are expected from 10 different contractors and will have a normal distribution with a mean of \$3.5 million and a standard deviation of \$0.25 million. Devise and implement a sampling

Chapter 5  Probability Distributions and Data Modeling

e­ xperiment for estimating the distribution of the minimum bid and the expected value of the minimum bid. 49. Use Analytic Solver Platform to fit the hurricane data in Problem 32 to a discrete distribution? Does the Poisson distribution give the best fit? 50. Use Analytic Solver Platform to fit a distribution to

the data in the Excel file Computer Repair Times.

205

Try the three different statistical measures for evaluating goodness of fit and see if they result in different best-fitting distributions. 51. The Excel file Investment Returns provides sample

data for the annual return of the S&P 500, and monthly returns of a stock portfolio and bond portfolio. Construct histograms for each data set and use Analytic Solver Platform to find the best fitting distribution.

Case: Performance Lawn Equipment PLE collects a variety of data from special studies, many of which are related to the quality of its products. The company collects data about functional test performance of its mowers after assembly; results from the past 30 days are given in the worksheet Mower Test. In addition, many inprocess measurements are taken to ensure that manufacturing processes remain in control and can produce according to design specifications. The worksheet Blade Weight shows 350 measurements of blade weights taken from the manufacturing process that produces mower blades during the most recent shift. Elizabeth Burke has asked you to study these data from an analytics perspective. Drawing upon your experience, you have developed a number of questions. 1. For the mower test data, what distribution might be appropriate to model the failure of an individual mower? 2. What fraction of mowers fails the functional performance test using all the mower test data? 3. What is the probability of having x failures in the next 100 mowers tested, for x from 0 to 20? 4. What is the average blade weight and how much variability is occurring in the measurements of blade weights?

5. Assuming that the data are normal, what is the probability that blade weights from this process will exceed 5.20? 6. What is the probability that weights will be less than 4.80? 7. What is the actual percent of weights that exceed 5.20 or are less than 4.80 from the data in the worksheet? 8. Is the process that makes the blades stable over time? That is, are there any apparent changes in the pattern of the blade weights? 9. Could any of the blade weights be considered outliers, which might indicate a problem with the manufacturing process or materials? 10. Was the assumption that blade weights are normally distributed justified? What is the best-fitting probability distribution for the data? Summarize all your findings to these questions in a wellwritten report.

Chapter

6

Sampling and Estimation

KALABUKHAVA IRYNA/Shutterstock.com

Learning Objectives After studying this chapter, you will be able to:

• Describe the elements of a sampling plan. • Explain the difference between subjective and probabilistic sampling. • State two types of subjective sampling. • Explain how to conduct simple random sampling

• Use the standard error in probability calculations. • Explain how an interval estimate differs from a point estimate. • Define and give examples of confidence intervals. • Calculate confidence intervals for population means

• • • •

• • • • •

and use Excel to find a simple random sample from an Excel database. Explain systematic, stratified, and cluster sampling, and sampling from a continuous process. Explain the importance of unbiased estimators. Describe the difference between sampling error and nonsampling error. Explain how the average, standard deviation, and distribution of means of samples changes as the sample size increases. Define the sampling distribution of the mean. Calculate the standard error of the mean. Explain the practical importance of the central limit theorem.

• • •

and proportions using the formulas in the chapter and the appropriate Excel functions. Explain how confidence intervals change as the level of confidence increases or decreases. Describe the difference between the t-distribution and the normal distribution. Use confidence intervals to draw conclusions about population parameters. Compute a prediction interval and explain how it differs from a confidence interval. Compute sample sizes needed to ensure a confidence interval for means and proportions with a specified margin of error.

207

208

Chapter 6  Sampling and Estimation

We discussed the difference between population and samples in Chapter 4. Sampling is the foundation of statistical analysis. We use sample data in business analytics applications for many purposes. For example, we might wish to estimate the mean, variance, or proportion of a very large or ­unknown population; provide values for inputs in decision models; understand customer satisfaction; reach a conclusion as to which of several sales strategies is more effective; or understand if a change in a process resulted in an ­improvement. In this chapter, we discuss sampling methods, how they are used to estimate population parameters, and how we can assess the error inherent in sampling.

Statistical Sampling The first step in sampling is to design an effective sampling plan that will yield repre­ sentative samples of the populations under study. A sampling plan is a description of the approach that is used to obtain samples from a population prior to any data collection activity. A sampling plan states objectives of the sampling activity, • the the population, • the target population frame (the list from which the sample is selected), • the method of sampling, • the operational procedures for collecting the data, and • the statistical tools that will be used to analyze the data. •

Example 6.1  A Sampling Plan for a Market Research Study Suppose that a company wants to understand how golfers might respond to a membership program that provides discounts at golf courses in the golfers’ locality as well as across the country. The objective of a sampling study might be to estimate the proportion of golfers who would likely subscribe to this program. The target population might be all golfers over 25 years old. However, identifying all golfers in America might be impossible. A practical population frame might be a list of golfers who

have purchased equipment from national golf or sporting goods companies through which the discount card will be sold. The operational procedures for collecting the data might be an e-mail link to a survey site or direct-mail questionnaire. The data might be stored in an Excel database; statistical tools such as PivotTables and simple descriptive statistics would be used to segment the respondents into different demographic groups and estimate their likelihood of responding positively.

Sampling Methods Many types of sampling methods exist. Sampling methods can be subjective or probabilistic. Subjective methods include judgment sampling, in which expert judgment is used to select the sample (survey the “best” customers), and convenience sampling, in which samples are selected based on the ease with which the data can be collected (survey all customers who happen to visit this month). Probabilistic sampling involves selecting the

209

Chapter 6  Sampling and Estimation

Figure

6.1

Excel Sampling Tool Dialog

items in the sample using some random procedure. Probabilistic sampling is necessary to draw valid statistical conclusions. The most common probabilistic sampling approach is simple random sampling. Simple random sampling involves selecting items from a population so that every subset of a given size has an equal chance of being selected. If the population data are stored in a data­ base, simple random samples can generally be easily obtained.

Example 6.2  Simple Random Sampling with Excel Suppose that we wish to sample from the Excel database Sales Transactions. Excel provides a tool to generate a random set of values from a given population size. Click on Data Analysis in the Analysis group of the Data tab and select Sampling. This brings up the dialog shown in Figure 6.1. In the Input Range box, we specify the data range from which the sample will be taken. This tool requires that the data sampled be numeric, so in this example we sample from the first column of the data set, which corresponds to the customer ID number. There are two options for sampling:

1. Sampling can be periodic, and we will be prompted for the Period, which is the interval between sample

observations from the beginning of the data set. For instance, if a period of 5 is used, observations 5, 10, 15, and so on, will be selected as samples.

2. Sampling can also be random, and we will be prompted for the Number of Samples. Excel will then randomly select this number of samples from the specified data set. However, this tool generates random samples with replacement, so we must be careful to check for duplicate observations in the sample created. Figure 6.2 shows 20 samples generated by the tool. We sorted them in ascending order to make it easier to identify duplicates. As you can see, two of the customers were duplicated by the tool.

Other methods of sampling include the following: (Periodic) Sampling. Systematic, or periodic, sampling is a sam­ • Systematic pling plan (one of the options in the Excel Sampling tool) that selects every nth item from the population. For example, to sample 250 names from a list of 400,000, the first name could be selected at random from the first 1,600, and then every 1,600th name could be selected. This approach can be used for telephone sampling when supported by an automatic dialer that is programmed to dial numbers in a systematic manner. However, systematic sampling is not the same

210

Figure

Chapter 6  Sampling and Estimation

6.2

Samples Generated Using the Excel Sampling Tool

as simple random sampling because for any sample, every possible sample of a given size in the population does not have an equal chance of being selected. In some ­situations, this approach can induce significant bias if the population has some underlying pattern. For instance, sampling orders received every 7 days may not yield a representative sample if customers tend to send orders on certain days ­every week. Stratified Sampling. Stratified sampling applies to populations that are di­ vided into natural subsets (called strata) and allocates the appropriate propor­ tion of samples to each stratum. For example, a large city may be divided into political districts called wards. Each ward has a different number of citizens. A stratified sample would choose a sample of individuals in each ward pro­ portionate to its size. This approach ensures that each stratum is weighted by its size relative to the population and can provide better results than simple random sampling if the items in each stratum are not homogeneous. However, issues of cost or significance of certain strata might make a disproportionate sample more useful. For example, the ethnic or racial mix of each ward might be significantly different, making it difficult for a stratified sample to obtain the desired information. Cluster Sampling. Cluster sampling is based on dividing a population into sub­ groups (clusters), sampling a set of clusters, and (usually) conducting a complete census within the clusters sampled. For instance, a company might segment its customers into small geographical regions. A cluster sample would consist of a random sample of the geographical regions, and all customers within these regions would be surveyed (which might be easier because regional lists might be easier to produce and mail). Sampling from a Continuous Process. Selecting a sample from a continuous man­ ufacturing process can be accomplished in two main ways. First, select a time at random; then select the next n items produced after that time. Second, select n times at random; then select the next item produced after each of these times. The first approach generally ensures that the observations will come from a homo­ geneous population; however, the second approach might include items from different populations if the characteristics of the process should change over time, so caution should be used.

Chapter 6  Sampling and Estimation

211

U.S. breweries rely on a three-tier distribution system to deliver product to retail outlets, such as supermarkets and convenience stores, and on-premise accounts, such as bars and restaurants. The three tiers are the manufacturer, wholesaler (distributor), and retailer. A distribution network must be as efficient and cost effective as possible to deliver to the market a fresh product that is damage free and is delivered at the right place at the right time. To understand distributor performance related to overall effectiveness, MillerCoors brewery defined seven attributes of proper distribution and collected data from 500 of its distributors. A field quality specialist (FQS) audits distributors within an assigned region of the country and collects data on these attributes. The FQS uses a handheld device to scan the universal product code on each package to identify the product type and amount. When audits are complete, data are summarized and uploaded from the handheld device into a master database. This distributor auditing uses stratified random sampling with proportional allocation of samples based on the distributor’s market share. In addition to providing a more representative sample and better logistical control of sampling, stratified random sampling enhances statistical precision when data are aggregated by market area served by the distributor. This enhanced precision is a consequence of smaller and typically homogeneous market regions, which are able to provide realistic estimates of variability, especially when compared to another market region that is markedly different.

Stephen Finn/Shutterstock.com

Analytics in Practice: U  sing Sampling Techniques to Improve Distribution1

Randomization of retail accounts is achieved through a specially designed program based on the GPS location of the distributor and serviced retail accounts. The sampling strategy ultimately addresses a specific distributor’s performance related to out-ofcode product, damaged product, and out-of-rotation product at the retail level. All in all, more than 6,000 of the brewery’s national retail accounts are audited during a sampling year. Data collected by the FQSs during the year are used to develop a performance ranking of distributors and identify opportunities for improvement.

Estimating Population Parameters Sample data provide the basis for many useful analyses to support decision making. Estimation involves assessing the value of an unknown population parameter—such as a population mean, population proportion, or population variance—using sample data. Estimators are the measures used to estimate population parameters; for example, we use the sample mean x to estimate a population mean m. The sample variance s2 estimates a population variance s2, and the sample proportion p estimates a population proportion p. A point estimate is a single number derived from sample data that is used to estimate the value of a population parameter. 1Based on Tony Gojanovic and Ernie Jimenez, “Brewed Awakening: Beer Maker Uses Statistical M ­ ethods

to Improve How Its Products Are Distributed,” Quality Progress (April 2010).

212

Chapter 6  Sampling and Estimation

Unbiased Estimators It seems quite intuitive that the sample mean should provide a good point estimate for the population mean. However, it may not be clear why the formula for the sample variance that we introduced in Chapter 4 has a denominator of n - 1, particularly because it is dif­ ferent from the formula for the population variance (see formulas (4.4) and (4.5) in Chap­ ter 4). In these formulas, the population variance is ­computed by 2 a 1x i - m2 n

2

s =

i=1

N

whereas the sample variance is computed by the formula 2 a 1x i - x2 n

2

s =

i=1

n -1

Why is this so? Statisticians develop many types of estimators, and from a theo­ retical as well as a practical perspective, it is important that they “truly estimate” the population parameters they are supposed to estimate. Suppose that we perform an experi­ ment in which we repeatedly sampled from a population and computed a point estimate for a population parameter. Each individual point estimate will vary from the population parameter; however, we would hope that the long-term average (expected value) of all possible point estimates would equal the population parameter. If the expected value of an estimator equals the population parameter it is intended to estimate, the estimator is said to be unbiased. If this is not true, the estimator is called biased and will not provide correct results. Fortunately, all the estimators we have introduced are unbiased and, therefore, are meaningful for making decisions involving the population parameter. In particular, statisti­ cians have shown that the denominator n - 1 used in computing s2 is necessary to provide an unbiased estimator of s2. If we simply divided by the number of observations, the esti­ mator would tend to underestimate the true variance.

Errors in Point Estimation One of the drawbacks of using point estimates is that they do not provide any indication of the magnitude of the potential error in the estimate. A major metropolitan newspaper reported that, based on a Bureau of Labor Statistics survey, college professors were the highest-paid workers in the region, with an average salary of \$150,004. Actual aver­ ages for two local universities were less than \$70,000. What happened? As reported in a follow-up story, the sample size was very small and included a large number of highly paid medical school faculty; as a result, there was a significant error in the point estimate that was used. When we sample, the estimators we use—such as a sample mean, sample proportion, or sample variance—are actually random variables that are characterized by some distri­ bution. By knowing what this distribution is, we can use probability theory to quantify the uncertainty associated with the estimator. To understand this, we first need to discuss sampling error and sampling distributions.

Chapter 6  Sampling and Estimation

213

Sampling Error In Chapter 4, we observed that different samples from the same population have dif­ ferent characteristics—for example, variations in the mean, standard deviation, fre­ quency distribution, and so on. Sampling (statistical) error occurs because samples are only a subset of the total population. Sampling error is inherent in any sampling process, and although it can be minimized, it cannot be totally avoided. Another type of ­e rror, called nonsampling error, occurs when the sample does not represent the target population adequately. This is generally a result of poor sample design, such as using a convenience sample when a simple random sample would have been more appropriate or choosing the wrong population frame. It may also result from inad­ equate data reliability, which we discussed in Chapter 1. To draw good conclusions from samples, analysts need to eliminate nonsampling error and understand the nature of sampling error. Sampling error depends on the size of the sample relative to the population. Thus, ­determining the number of samples to take is essentially a statistical issue that is based on the accuracy of the estimates needed to draw a useful conclusion. We discuss this later in this chapter. However, from a practical standpoint, one must also consider the cost of ­sampling and sometimes make a trade-off between cost and the information that is obtained.

Understanding Sampling Error Suppose that we estimate the mean of a population using the sample mean. How can we determine how accurate we are? In other words, can we make an informed statement about how far the sample mean might be from the true population mean? We could gain some insight into this question by performing a sampling experiment.

Example 6.3  A Sampling Experiment Let us choose a population that is uniformly distributed between a = 0 and b = 10. Formulas (5.17) and (5.18) state that the expected value is 10 + 102 , 2 = 5, and the variance is (10 − 0)2 , 12 = 8.333. We use the Excel Random Number Generation tool described in Chapter 5 to generate 25 samples, each of size 10 from this population. Figure 6.3 shows a portion of a spreadsheet for this experiment, along with a histogram of the data (on the left side) that shows that the 250 observations are approximately uniformly distributed. (This is available in the Excel file Sampling Experiment.) In row 12 we compute the mean of each sample. These statistics vary a lot from the population values because of sampling error. The histogram on the right shows the distribution of the 25 sample means, which vary from less than 4 to more than 6. Now let’s compute the average and standard deviation of the sample means in row 12 (cells AB12

and AB13). Note that the average of all the sample means is quite close to the true population mean of 5.0. Now let us repeat this experiment for larger sample sizes. Table 6.1 shows some results. Notice that as the sample size gets larger, the averages of the 25 sample means are all still close to the expected value of 5; however, the standard deviation of the 25 sample means becomes smaller for increasing sample sizes, meaning that the means of samples are clustered closer together around the true expected value. Figure 6.4 shows comparative histograms of the sample means for each of these cases. These illustrate the conclusions we just made and, also, perhaps even more surprisingly, the distribution of the sample means appears to assume the shape of a normal distribution for larger sample sizes. In our experiment, we used only 25 sample means. If we had used a much-larger number, the distributions would have been more well defined.

214

Chapter 6  Sampling and Estimation

Figure

6.3

Portion of Spreadsheet for Sampling Experiment

Table

6.1

Results from Sampling Experiment

Figure

6.4

Histograms of Sample Means for Increasing Sample Sizes

Sample Size

Average of 25 Sample Means

Standard Deviation of 25 Sample Means

10

5.0108

0.816673

25

5.0779

0.451351

100

4.9173

0.301941

500

4.9754

0.078993

215

Chapter 6  Sampling and Estimation

If we apply the empirical rules to these results, we can estimate the sampling error ­associated with one of the sample sizes we have chosen.

Example 6.4  Estimating Sampling Error Using the Empirical Rules Using the results in Table 6.1 and the empirical rule for three standard deviations around the mean, we could state, for example, that using a sample size of 10, the distribution of sample means should fall approximately from 5.0 − 3(0.816673) = 2.55 t o 5.0 + 3(0.816673) = 7.45. Thus, there is considerable error in estimating the mean

using a sample of only 10. For a sample of size 25, we would expect the sample means to fall between 5.0 − 3 ( 0.451351) = 3.65 to 5.0 + 3(0.451351) = 6.35. Note that as the sample size increased, the error decreased. For sample sizes of 100 and 500, the intervals are [4.09, 5.91] and [4.76, 5.24].

Sampling Distributions We can quantify the sampling error in estimating the mean for any unknown population. To do this, we need to characterize the sampling distribution of the mean.

Sampling Distribution of the Mean The means of all possible samples of a fixed size n from some population will form a distribution that we call the sampling distribution of the mean. The histograms in Fig­ ure 6.4 are approximations to the sampling distributions of the mean based on 25 samples. Statisticians have shown two key results about the sampling distribution of the mean. First, the standard deviation of the sampling distribution of the mean, called the standard error of the mean, is computed as

Standard Error of the Mean = s> 1n

(6.1)

where s is the standard deviation of the population from which the individual observations are drawn and n is the sample size. From this formula, we see that as n increases, the standard error decreases, just as our experiment demonstrated. This suggests that the estimates of the mean that we obtain from larger sample sizes provide greater accuracy in estimating the true population mean. In other words, larger sample sizes have less sampling error.

Example 6.5  Computing the Standard Error of the Mean For our experiment, we know that the variance of the population is 8.33 (because the values were uniformly distributed). Therefore, the standard deviation of the population is S = 2.89. We may compute the standard error of the mean for each of the sample sizes in our ­experiment using formula (6.1). For example, with n = 10, we have Standard Error of the Mean = S , !n = 2.89 , !10 = 0.914

For the remaining data in Table 6.1 we have the following:

Sample Size, n

Standard Error of the Mean

10

0.914

25

0.577

100

0.289

500

0.129

The standard deviations shown in Table 6.1 are simply estimates of the standard error of the mean based on the limited number of 25 samples. If we compare these estimates with the theoretical values in the previous example, we see that they are close but not exactly the same. This is because the true standard error is based on all possible sample means in the sampling

216

Chapter 6  Sampling and Estimation

distribution, whereas we used only 25. If you repeat the experiment with a larger number of samples, the observed values of the standard error would be closer to these theoretical values. In practice, we will never know the true population standard deviation and generally take only a limited sample of n observations. However, we may estimate the standard error of the mean using the sample data by simply dividing the sample standard deviation by the square root of n. The second result that statisticians have shown is called the central limit theorem, one of the most important practical results in statistics that makes systematic inference possible. The central limit theorem states that if the sample size is large enough, the sampling distribution of the mean is approximately normally distributed, regardless of the distribution of the population and that the mean of the sampling distribution will be the same as that of the population. This is exactly what we observed in our experiment. The distribution of the population was uniform, yet the sampling distribution of the mean converges to the shape of a normal distribution as the sample size increases. The central limit theorem also states that if the population is normally distributed, then the sampling distribution of the mean will also be normal for any sample size. The central limit theo­ rem allows us to use the theory we learned about calculating probabilities for normal distributions to draw conclusions about sample means.

Applying the Sampling Distribution of the Mean The key to applying sampling distribution of the mean correctly is to understand whether the probability that you wish to compute relates to an individual observation or to the mean of a sample. If it relates to the mean of a sample, then you must use the sampling distribution of the mean, whose standard deviation is the standard error, s> 1n.

Example 6.6  Using the Standard Error in Probability Calculations Suppose that the size of individual customer orders (in dollars), X, from a major discount book publisher Web site is normally distributed with a mean of \$36 and standard deviation of \$8. The probability that the next individual who places an order at the Web site will make a purchase of more than \$40 can be found by calculating 1 − NORM.DIST(40,36,8,TRUE) = 1 − 0.6915 = 0.3085 Now suppose that a sample of 16 customers is chosen. What is the probability that the mean purchase for these 16 customers will exceed \$40? To find this, we must realize that we must use the sampling distribution of the mean to carry out the appropriate calculations. The sampling distribution

of the mean will have a mean of \$36 but a standard error of \$8 , !16 = \$2. Then the probability that the mean purchase exceeds \$40 for a sample size of n = 16 is 1 − NORM.DIST(40,36,2,TRUE) = 1 − 0.9772 = 0.0228 Although about 30% of individuals will make purchases exceeding \$40, the chance that 16 customers will collectively average more than \$40 is much smaller. It would be very unlikely for all 16 customers to make highvolume purchases, because some individual purchases would as likely be less than \$36 as more, making the variability of the mean purchase amount for the sample of 16 much smaller than for individuals.

Interval Estimates An interval estimate provides a range for a population characteristic based on a sample. Intervals are quite useful in statistics because they provide more information than a point estimate. Intervals specify a range of plausible values for the characteristic of interest and a way of assessing “how plausible” they are. In general, a 10011 - a2% probability interval is any interval [A, B] such that the probability of falling between A and B is 1 - a. Probability intervals are often centered on the mean or median. For instance,

Chapter 6  Sampling and Estimation

217

in a normal distribution, the mean plus or minus 1 standard deviation describes an ­approximate 68% probability interval around the mean. As another example, the 5th and 95th percentiles in a data set constitute a 90% probability interval.

Example 6.7  Interval Estimates in the News We see interval estimates in the news all the time when trying to estimate the mean or proportion of a population. Interval estimates are often constructed by taking a point estimate and adding and subtracting a margin of error that is based on the sample size. For example, a ­Gallup poll might report that 56% of voters support a certain candidate with a margin of error of ± 3%. We would conclude that the true percentage of voters that support

the candidate is most likely between 53% and 59%. Therefore, we would have a lot of confidence in predicting that the candidate would win a forthcoming election. If, however, the poll showed a 52% level of support with a margin of error of ± 4%, we might not be as confident in predicting a win because the true percentage of supportive voters is likely to be somewhere between 48% and 56%.

The question you might be asking at this point is how to calculate the error associ­ ated with a point estimate. In national surveys and political polls, such margins of error are usually stated, but they are never properly explained. To understand them, we need to introduce the concept of confidence intervals.

Confidence Intervals Confidence interval estimates provide a way of assessing the accuracy of a point estimate. A confidence interval is a range of values between which the value of the population pa­ rameter is believed to be, along with a probability that the interval correctly estimates the true (unknown) population parameter. This probability is called the level of confidence, de­ noted by 1 - a, where a is a number between 0 and 1. The level of confidence is usually expressed as a percent; common values are 90%, 95%, or 99%. (Note that if the level of confidence is 90%, then a = 0.1.) The margin of error depends on the level of confidence and the sample size. For example, suppose that the margin of error for some sample size and a level of confidence of 95% is calculated to be 2.0. One sample might yield a point estimate of 10. Then, a 95% confidence interval would be [8, 12]. However, this interval may or may not include the true population mean. If we take a different sample, we will most likely have a different point estimate, say, 10.4, which, given the same margin of error, would yield the interval estimate [8.4, 12.4]. Again, this may or may not include the true population mean. If we chose 100 different samples, leading to 100 different interval estimates, we would ex­ pect that 95% of them—the level of confidence—would contain the true population mean. We would say we are “95% confident” that the interval we obtain from sample data contains the true population mean. The higher the confidence level, the more assurance we have that the interval contains the true population parameter. As the confidence level increases, the confidence interval becomes wider to provide higher levels of assurance. You can view a as the risk of incorrectly concluding that the confidence interval contains the true mean. When national surveys or political polls report an interval estimate, they are actu­ ally confidence intervals. However, the level of confidence is generally not stated because the average person would probably not understand the concept or terminology. While not stated, you can probably assume that the level of confidence is 95%, as this is the most common value used in practice (however, the Bureau of Labor Statistics tends to use 90% quite often).

218

Chapter 6  Sampling and Estimation

Many different types of confidence intervals may be developed. The formulas used depend on the population parameter we are trying to estimate and possibly other character­ istics or assumptions about the population. We illustrate a few types of confidence intervals.

Confidence Interval for the Mean with Known Population Standard Deviation The simplest type of confidence interval is for the mean of a population where the standard deviation is assumed to be known. You should realize, however, that in nearly all practical sampling applications, the population standard deviation will not be known. However, in some applications, such as measurements of parts from an automated machine, a process might have a very stable variance that has been established over a long history, and it can reasonably be assumed that the standard deviation is known. A 10011 - a2% confidence interval for the population mean m based on a sample of size n with a sample mean x and a known population standard deviation s is given by

x { za/21s> 1n2

(6.2)

Note that this formula is simply the sample mean (point estimate) plus or minus a margin of error. The margin of error is a number za>2 multiplied by the standard error of the sampling distribution of the mean, s> 1n. The value za>2 represents the value of a standard normal random variable that has an upper tail probability of a>2 or, equivalently, a cumulative probability of 1 - a>2. It may be found from the standard normal table (see Table A.1 in Appendix A at the end of the book) or may be computed in Excel using the value of the function NORM.S.INV11 - a>22. For example, if a = 0.05 (for a 95% confidence interval), then NORM.S.INV10.9752 = 1.96; if a = 0.10 (for a 90% confidence interval), then NORM.S.INV10.952 = 1.645, and so on. Although formula (6.2) can easily be implemented in a spreadsheet, the Excel func­ tion CONFIDENCE.NORM(alpha, standard_deviation, size) can be used to compute the margin of error term, za>2 s> 1n; thus, the confidence interval is the sample mean { CONFIDENCE.NORM(alpha, standard_deviation, size).

Example 6.8  C  omputing a Confidence Interval with a Known Standard Deviation In a production process for filling bottles of liquid detergent, historical data have shown that the variance in the volume is constant; however, clogs in the filling machine often affect the average volume. The historical standard deviation is 15 milliliters. In filling 800-milliliter bottles, a sample of 25 found an average volume of 796 milliliters. Using formula (6.2), a 95% confidence interval for the population mean is

x ± z A/2 (S , !n)

= 796 ± 1.96(15 , !25) = 796 ± 5.88, or [790.12, 801.88]

The worksheet Population Mean Sigma Known in the Excel workbook Confidence Intervals computes this interval using the CONFIDENCE.NORM function to compute the margin of error in cell B9, as shown in Figure 6.5.

As the level of confidence, 1 - a, decreases, za>2 decreases, and the confidence in­ terval becomes narrower. For example, a 90% confidence interval will be narrower than a 95% confidence interval. Similarly, a 99% confidence interval will be wider than a 95% confidence interval. Essentially, you must trade off a higher level of accuracy with the risk that the confidence interval does not contain the true mean. Smaller risk will result in a

Figure

219

Chapter 6  Sampling and Estimation

6.5

Confidence Interval for Mean Liquid Detergent Filling Volume

wider confidence interval. However, you can also see that as the sample size increases, the standard error decreases, making the confidence interval narrower and providing a more accurate interval estimate for the same level of risk. So if you wish to reduce the risk, you should consider increasing the sample size.

The t-Distribution In most practical applications, the standard deviation of the population is unknown, and we need to calculate the confidence interval differently. Before we can discuss how to com­ pute this type of confidence interval, we need to introduce a new probability distribution called the t-distribution. The t-distribution is actually a family of probability distribu­ tions with a shape similar to the standard normal distribution. Different t-distributions are distinguished by an additional parameter, degrees of freedom (df). The t-distribution has a larger variance than the standard normal, thus making confidence intervals wider than those obtained from the standard normal distribution, in essence correcting for the uncertainty about the true standard deviation, which is not known. As the number of degrees of freedom increases, the t-distribution converges to the standard normal distribution (Figure 6.6). When sample sizes get to be as large as 120, the distributions are virtually identical; even for sample sizes as low as 30 to 35, it becomes difficult to distinguish between the two. Thus, for large sample sizes, many people use z-values to ­establish confidence intervals even when the standard deviation is unknown. We must point out, however, that for any sample size, the true sampling distribution of the mean is the t-distribution, so when in doubt, use the t. The concept of degrees of freedom can be puzzling. It can best be explained by exam­ ining the formula for the sample variance: 2 a 1x i - x2 n

s2 =

i=1

n - 1

Note that to compute s2, we first need to compute the sample mean, x. If we know the value of the mean, then we need know only n - 1 distinct observations; the nth is com­ pletely determined. (For instance, if the mean of three values is 4 and you know that two of the values are 2 and 4, you can easily determine that the third number must be 6.) The number of sample values that are free to vary defines the number of degrees of freedom; in general, df equals the number of sample values minus the number of estimated parameters. Because the sample variance uses one estimated parameter, the mean, the t­-distribution used in confidence interval calculations has n - 1 degrees of freedom. Because the t-­distribution explicitly accounts for the effect of the sample size in estimating the popula­ tion variance, it is the proper one to use for any sample size. However, for large samples, the difference between t- and z-values is very small, as we noted earlier.

220

Figure

Chapter 6  Sampling and Estimation

6.6

Comparison of the t-Distribution to the Standard Normal Distribution

Confidence Interval for the Mean with Unknown Population Standard Deviation The formula for a 10011 - a2% confidence interval for the mean m when the population standard deviation is unknown is

x { ta>2,n - 11s> 1n2

(6.3)

where ta>2,n - 1 is the value from the t-distribution with n - 1 degrees of freedom, giving an upper-tail probability of a>2. We may find t-values in Table A.2 in Appendix A at the end of the book or by using the Excel function T.INV11 - a>2, n - 12 or the function T.INV.2T1a, n - 12. The Excel function CONFIDENCE.T(alpha, ­standard_deviation, size) can be used to compute the margin of error term, ta>2,n - 1(s> 1n); thus, the confi­ dence interval is the sample mean { CONFIDENCE.T.

Example 6.9  C  omputing a Confidence Interval with Unknown Standard Deviation In the Excel file Credit Approval Decisions, a large bank has sample data used in making credit approval decisions (see Figure 6.7). Suppose that we want to find a 95% confidence interval for the mean revolving balance for the population of applicants that own a home. First, sort the data by homeowner and compute the mean and standard deviation of the revolving balance for the sample of homeowners. This results in x = \$12,630.37 and s = \$5393.38. The sample size is n = 27, so the standard

error s , !n = \$ 1037.96. The t-distribution has 26 degrees of freedom; therefore, t.025,26 = 2.056. Using formula (6.3), the confidence interval is \$12,630.37 ± 2.056(\$1037.96) or [\$10,496, \$14,764]. The worksheet Population Mean Sigma Unknown in the Excel workbook Confidence ­Intervals computes this interval using the CONFIDENCE.T function to compute the margin of error in cell B10, as shown in Figure 6.8.

Confidence Interval for a Proportion For categorical variables such as gender (male or female), education (high school, col­ lege, post-graduate), and so on, we are usually interested in the proportion of observa­ tions in a sample that has a certain characteristic. An unbiased estimator of a population proportion p (this is not the number pi = 3.14159 . . . ) is the statistic pˆ = x >n (the sample ­proportion), where x is the number in the sample having the desired characteristic and n is the sample size.

221

Chapter 6  Sampling and Estimation

Figure

6.7

Portion of Excel File Credit Approval Decisions

Figure

6.8

Confidence Interval for Mean Revolving Balance of Homeowners

A 10011 - a2% confidence interval for the proportion is

pn { za/2

A

pn 11 - pn 2  n

(6.4)

Notice that as with the mean, the confidence interval is the point estimate plus or minus some margin of error. In this case, 2pn 11 - pn2>n is the standard error for the sam­ pling distribution of the proportion. Excel does not have a function for computing the margin of error, but it can easily be implemented on a spreadsheet.

Example 6.10  Computing a Confidence Interval for a Proportion The last column in the Excel file Insurance Survey (see Figure 6.9) describes whether a sample of employees would be willing to pay a lower premium for a higher deductible for their health insurance. Suppose we are interested in the proportion of individuals who answered yes. We may easily confirm that 6 out of the 24 employees, or 25%, answered yes. Thus, a point estimate for the proportion answering yes is pn = 0.25. Using formula (6.4), we find that a 95% confidence interval for the proportion of employees answering yes is

0.25 ± 1.96

A

0.25(0.75) = 0.25 ± 0.173, or [0.077, 0.423] 24

The worksheet Population Mean Sigma Unknown in the Excel workbook Confidence Intervals computes this interval, as shown in Figure 6.10. Notice that this is a fairly wide confidence interval, suggesting that we have quite a bit of uncertainty as to the true value of the population proportion. This is because of the relatively small sample size.

222

Chapter 6  Sampling and Estimation

Figure

6.9

Portion of Excel File Insurance Survey

Figure 6.10 Confidence Interval for the Proportion

Additional Types of Confidence Intervals Confidence intervals may be calculated for other population parameters such as a v­ ariance or standard deviation and also for differences in the means or proportions of two popula­ tions. The concepts are similar to the types of confidence intervals we have discussed, but many of the formulas are rather complex and more difficult to implement on a spreadsheet. Some advanced software packages and spreadsheet add-ins provide additional s­ upport. Therefore, we do not discuss them in this book, but we do suggest that you consult other books and statistical references should you need to use them, now that you understand the basic concepts underlying them.

Using Confidence Intervals for Decision Making Confidence intervals can be used in many ways to support business decisions.

Example 6.11 Drawing a Conclusion about a Population Mean Using a Confidence Interval In packaging a commodity product such as laundry detergent, the manufacturer must ensure that the packages contain the stated amount to meet government regulations. In Example 6.8, we saw an example where the required volume is 800 milliliters, yet the sample average was only

796 m ­ illiliters. Does this indicate a serious problem? Not necessarily. The 95% confidence interval for the mean we computed in Figure 6.5 was [790.12, 801.88]. Although the sample mean is less than 800, the sample does not provide sufficient evidence to draw that conclusion that the

223

Chapter 6  Sampling and Estimation

­ opulation mean is less than 800 because 800 is contained p within the confidence interval. In fact, it is just as plausible that the population mean is 801. We cannot tell definitively because of the sampling error. However, suppose that the sample average is 792. Using the Excel worksheet Population Mean Sigma Known in the workbook Confidence Intervals,

we find that the confidence interval for the mean would be [786.12, 797.88]. In this case, we would conclude that it is highly unlikely that the population mean is 800 milliliters because the confidence interval falls completely below 800; the manufacturer should check and adjust the equipment to meet the standard.

The next example shows how to interpret a confidence interval for a proportion.

Example 6.12  Using a Confidence Interval to Predict Election Returns Suppose that an exit poll of 1,300 voters found that 692 voted for a particular candidate in a two-person race. This represents a proportion of 53.23% of the sample. Could we conclude that the candidate will likely win the election? A 95% confidence interval for the proportion is [0.505, 0.559]. This suggests that the population proportion of voters who favor this candidate is highly likely to exceed 50%, so it is safe to predict the winner. On the other hand,

­ uppose that only 670 of the 1,300 voters voted for the s candidate, a sample proportion of 0.515. The confidence interval for the population proportion is [0.488, 0.543]. Even though the sample proportion is larger than 50%, the sampling error is large, and the confidence interval suggests that it is reasonably likely that the true population proportion could be less than 50%, so it would not be wise to predict the winner based on this information.

Prediction Intervals Another type of interval used in estimation is a prediction interval. A prediction interval is one that provides a range for predicting the value of a new observation from the same population. This is different from a confidence interval, which provides an interval esti­ mate of a population parameter, such as the mean or proportion. A confidence interval is associated with the sampling distribution of a statistic, but a prediction interval is associ­ ated with the distribution of the random variable itself. When the population standard deviation is unknown, a 10011 - a2% prediction in­ terval for a new observation is

x { ta>2,n - 1as

A

1 +

1 b n

(6.5)

Note that this interval is wider than the confidence interval in formula (6.3) by virtue of the additional value of 1 under the square root. This is because, in addition to estimat­ ing the population mean, we must also account for the variability of the new observation around the mean. One important thing to realize also is that in formula (6.3) for a confidence interval, as n gets large, the error term tends to zero so the confidence interval converges on the mean. However, in the prediction interval formula (6.5), as n gets large, the error term converges to ta>2, n - 11s2, which is simply a 10011 - a2% probability interval. Because we are trying to predict a new observation from the population, there will always be uncertainty.

224

Chapter 6  Sampling and Estimation

Example 6.13  Computing a Prediction Interval In estimating the revolving balance in the Excel file Credit Approval Decisions in Example 6.9, we may use formula (6.5) to compute a 95% prediction interval for the revolving balance of a new homeowner as 1 \$12,630.37 ± 2.056(\$5,393.38) A 1 + , or 27 [\$338.10, \$23,922.64]

Note that compared with Example 6.9, the size of the prediction interval is considerably wider than that of the confidence interval.

Confidence Intervals and Sample Size An important question in sampling is the size of the sample to take. Note that in all the formulas for confidence intervals, the sample size plays a critical role in determining the width of the confidence interval. As the sample size increases, the width of the confidence interval decreases, providing a more accurate estimate of the true population parameter. In many applications, we would like to control the margin of error in a confidence interval. For example, in reporting voter preferences, we might wish to ensure that the margin of error is {2%. Fortunately, it is relatively easy to determine the appropriate sample size needed to estimate the population parameter within a specified level of precision. The formulas for determining sample sizes to achieve a given margin of error are based on the confidence interval half-widths. For example, consider the confidence interval for the mean with a known population standard deviation we introduced in formula (6.2): x { za>2a

s

E Ú za>2a

s

b 2n Suppose we want the width of the confidence interval on either side of the mean (i.e., the margin of error) to be at most E. In other words,

Solving for n, we find:

2n

n Ú 1za>222

b

s2  E2

(6.6)

In a similar fashion, we can compute the sample size required to achieve a desired confidence interval half-width for a proportion by solving the following equation (based on formula (6.4) using the population proportion p in the margin of error term) for n:

This yields

E Ú za>2 2p11 - p2>n n Ú 1za>222

p11 - p2 E2



(6.7)

In practice, the value of p will not be known. You could use the sample proportion from a preliminary sample as an estimate of p to plan the sample size, but this might require several iterations and additional samples to find the sample size that yields the required precision. When no information is available, the most conservative estimate is to set p = 0.5. This maximizes the quantity p11 - p2 in the formula, resulting in the sam­ ple size that will guarantee the required precision no matter what the true proportion is.

225

Chapter 6  Sampling and Estimation

Figure 6.11 Confidence Interval for the Mean Using a Sample Size = 97

Example 6.14  Sample Size Determination for the Mean In the liquid detergent example (Example 6.8), the confidence interval we computed in Figure 6.5 was [790.12, 801.88]. The width of the confidence interval is ± 5.88 milliliters, which represents the sampling error. Suppose the manufacturer would like the sampling error to be at most 3 milliliters. Using formula (6.6), we may compute the required sample size as follows: n # 1 zA>2 2 2 = 11.962 2

Rounding up we find that that 97 samples would be needed. To verify this, Figure 6.11 shows that if a sample of 97 is used along with the same sample mean and standard deviation, the confidence interval does indeed have a sampling error of error less than 3 milliliters.

(S 2 ) E2 (152 ) 32

= 96.04

Of course, we generally do not know the population standard deviation prior to finding the sample size. A commonsense approach would be to take an initial sample to estimate the population standard deviation using the sample standard deviation s and determine the required sample size, collecting additional data if needed. If the half-width of the resulting confidence interval is within the required margin of error, then we clearly have achieved our goal. If not, we can use the new sample standard deviation s to determine a new sample size and collect additional data as needed. Note that if s changes significantly, we still might not have achieved the desired precision and might have to repeat the process. Usually, however, this will be unnecessary.

Example 6.15  Sample Size Determination for a Proportion For the voting example we discussed, suppose that we wish to determine the number of voters to poll to ensure a sampling error of at most ± 2%. As we stated, when no information is available, the most conservative approach is to use 0.5 for the estimate of the true proportion. Using formula (6.7) with P = 0.5, the number of voters to poll to obtain a 95% confidence interval on the proportion of

voters that choose a particular candidate with a precision of ± 0.02 or less is n # 1 zA/2 2 2

= 1 1.962 2

P(1 − P) E2 (0.5) (1 − 0.5) 0.022

= 2,401

226

Chapter 6  Sampling and Estimation

Key Terms Central limit theorem Cluster sampling Confidence interval Convenience sampling Degrees of freedom (df) Estimation Estimators Interval estimate Judgment sampling Level of confidence Nonsampling error Point estimate

Population frame Prediction interval Probability interval Sample proportion Sampling (statistical) error Sampling distribution of the mean Sampling plan Simple random sampling Standard error of the mean Stratified sampling Systematic (or periodic) sampling t-Distribution

Problems and Exercises 1. Your college or university wishes to obtain reliable information about student perceptions of administra­ tive communication. Describe how to design a sam­ pling plan for this situation based on your knowledge of the structure and organization of your college or university. How would you implement simple ran­ dom sampling, stratified sampling, and cluster sam­ pling for this study? What would be the pros and cons of using each of these methods? 2. Number the rows in the Excel file Credit Risk Data

to identify each record. The bank wants to sample from this database to conduct a more-detailed audit. Use the Excel Sampling tool to find a simple random sample of 20 unique records. 3. Describe how to apply stratified sampling to sample

from the Credit Risk Data file based on the differ­ ent types of loans. Implement your process in Excel to choose a random sample consisting of 10% of the records for each type of loan. 4. Find the current 30 stocks that comprise the Dow

Jones Industrial Average. Set up an Excel spreadsheet for their names, market capitalization, and one or two other key financial statistics (search Yahoo! Finance or a similar Web source). Using the Excel Sampling tool, obtain a random sample of 5 stocks, compute point estimates for the mean and standard deviation, and compare them to the population parameters. 5. Repeat the sampling experiment in Example 6.3 for

sample sizes 50, 100, 250, and 500. Compare your results to the example and use the empirical rules to

a­ nalyze the sampling error. For each sample, also find the standard error of the mean using formula (6.1). 6. Uncle’s Pizza is doing good business in Delhi due to

its prompt home delivery system. It guarantees that the pizza will be delivered within 30 minutes from the time order was placed or the order is free. The time that it takes to deliver each order on time is maintained in the Pizza Time System. Fourteen ran­ dom entries from the Pizza Time System are listed. 10.1

19.6

12.2

32.6

18.2

29.5

13.2

30

10.8

14.8

22.1

15.6

45.6

15.6

a. Find the mean for the sample. b. Explain if this sample can be used to estimate the average time that it takes for Uncle’s Pizza to de­ liver the pizza. 7. A soft drink bottle filling machine is known to have

a mean of 200 ml and a standard variation of 10 ml. The quality control manager took a random sample of the filled bottles and found the sample mean to be 215 ml. She assumed the sample must not be repre­ sentative. Do you agree with the conclusion made by the quality control manager? Justify your answer. 8. A sample of 33 airline passengers found that the

a­ verage check-in time is 2.167. Based on long-term data, the population standard deviation is known to be 0.48. Find a 95% confidence interval for the mean check-in time. Use the appropriate formula and verify your result using the Confidence Intervals workbook.

Chapter 6  Sampling and Estimation

9. A sample of 20 international students attending an

urban U.S. university found that the average amount budgeted for expenses per month was \$1612.50 with a standard deviation of \$1179.64. Find a 95% confi­ dence interval for the mean monthly expense budget of the population of international students. Use the appropriate formula and verify your result using the Confidence Intervals workbook. 10. A sample of 25 individuals at a shopping mall found

that the mean number of visits to a restaurant per week was 2.88 with a standard deviation of 1.59. Find a 99% confidence interval for the mean num­ ber of restaurant visits. Use the appropriate formula and verify your result using the Confidence Intervals workbook. 11. A bank sampled its customers to determine the

proportion of customers who use their debit card at least once each month. A sample of 50 cus­ tomers found that only 12 use their debit card monthly. Find 95% and 99% confidence intervals for the proportion of customers who use their debit card monthly. Use the appropriate formula and verify your result using the Confidence Intervals workbook. 12. If, based on a sample size of 850, a political candidate

finds that 458 people would vote for him in a twoperson race, what is the 95% confidence interval for his expected proportion of the vote? Would he be confident of winning based on this poll? Use the ap­ propriate formula and verify your result using the Confidence Intervals workbook. 13. If, based on a sample size of 200, a political candi­

date found that 125 people would vote for her in a two-person race, what is the 99% confidence interval for her expected proportion of the vote? Would she be confident of winning based on this poll? 14. Using the data in the Excel file Accounting Profes-

sionals, find and interpret 95% confidence intervals for the following: a. mean years of service b. proportion of employees who have a graduate degree 15. Find the standard deviation of the total assets held by

the bank in the Excel file Credit Risk Data. a. Treating the records in the database as a popula­ tion, use your sample in Problem 2 and compute

227 90%, 95%, and 99% confidence intervals for the total assets held in the bank by loan applicants using formula (6.2) and any appropriate Excel functions. Explain the differences as the level of confidence increases. b. How do your confidence intervals differ if you assume that the population standard deviation is not known but estimated using your sample data?

16. The Excel file Restaurant Sales provides sample

information on lunch, dinner, and delivery sales for a local Italian restaurant. Develop 95% confidence intervals for the mean of each of these variables, as well as total sales for weekdays and weekends. What conclusions can you reach? 17. Using the data in the worksheet Consumer Transpor-

tation Survey, develop 95% confidence intervals for the following: a. the proportion of individuals who are satisfied with their vehicle b. the proportion of individuals who have at least one child 18. The monthly sales of a mobile phone shop have been

distributed with a standard deviation of \$900. A statistical study of sales in the last nine months has found a confidence interval for the mean of monthly sales with extremes of \$5663 and \$6839. a. What were the average sales over the nine month period? b. What is the confidence level for this interval? 19. Using data in the Excel file Colleges and Universi-

ties, find 95% confidence intervals for the median SAT for each of the two groups, liberal arts colleges and research universities. Based on these confidence intervals, does there appear to be a difference in the median SAT scores between the two groups? 20. The Excel file Baseball Attendance shows the at­

tendance in thousands at San Francisco Giants’ baseball games for the 10 years before the Oakland A’s moved to the Bay Area in 1968, as well as the combined attendance for both teams for the next 11 years. Develop 95% confidence intervals for the mean attendance of each of the two groups. Based on these confidence intervals, would you conclude that attendance has changed after the move?

228

Chapter 6  Sampling and Estimation

21. A random sample of 100 teenagers was surveyed, and

24. The Excel file Restaurant Sales provides sample in­

the mean number of songs that they had downloaded from the iTunes store in the past month was 9.4 with the results considered accurate is within 1.4 (18 times out of 20). a. What percent of confidence level is the result? b. What is the margin of error? c. What is the confidence interval? Explain.

formation on lunch, dinner, and delivery sales for a local Italian restaurant. Develop 95% prediction inter­ vals for the daily dollar sales of each of these variables and also for the total sales dollars on a weekend day.

22. A study of nonfatal occupational injuries in the

United States found that about 31% of all injuries in the service sector involved the back. The Na­ tional Institute for Occupational Safety and Health (NIOSH) recommended conducting a comprehensive ergonomics assessment of jobs and workstations. In response to this information, Mark Glassmeyer de­ veloped a unique ergonomic handcart to help field service engineers be more productive and also to re­ duce back injuries from lifting parts and equipment during service calls. Using a sample of 382 field ser­ vice engineers who were provided with these carts, Mark collected the following data: Year 1 (without Cart)

Year 2 (with Cart)

Average call time

8.27 hours

7.98 hours

Standard deviation call time

1.36 hours

1.21 hours

Proportion of back injuries

0.018

0.010

Find 95% confidence intervals for the average call times and proportion of back injuries in each year. What conclusions would you reach based on your results? 23. Using the data in the worksheet Consumer Trans-

portation Survey, develop 95% and 99% prediction intervals for the following: a. the hours per week that an individual will spend in his or her vehicle b. the number of miles driven per week

25. For the Excel file Credit Approval Decisions, find

95% confidence and prediction intervals for the credit scores and revolving balance of homeowners and nonhomeowners. How do they compare? 26. Trade associations, such as the United Dairy Farmers

Association, frequently conduct surveys to identify characteristics of their membership. If this organiza­ tion conducted a survey to estimate the annual percapita consumption of milk and wanted to be 95% confident that the estimate was no more than {0.5 gallon away from the actual average, what sample size is needed? Past data have indicated that the standard deviation of consumption is approximately 6 gallons. 27. If a manufacturer conducted a survey among ran­

domly selected target market households and wanted to be 95% confident that the difference between the sample estimate and the actual market share for its new product was no more than {2%, what sample size would be needed? 28. After regular complaints of tire blowouts on the Ya­

muna Expressway, in an automotive test conducted by the authorities, the average tire pressure in a sam­ ple of 62 tires was found to be 24 pounds per square inch and the standard deviation was 2.1 pound per square inch. a. What is the estimated population standard devia­ tion for this population? b. Calculate the estimated standard deviation error of the mean. 29. A music company wants to know how the illegal

downloading of music online affects CD sales. 600 families are randomly chosen from various parts of a particular country and the number of songs that are downloaded in an hour are noted. The sample mean is 3947 with a sample standard deviation of 104. Determine a 90% confidence interval for this data. (Assume that the population variance is not known.)

Case: Drout Advertising Research Project The background for this case was introduced in Chapter 1. This is a continuation of the case in Chapter 4. For this part of the case, compute confidence intervals for means and proportions, and analyze the sampling errors, ­possibly

suggesting larger sample sizes to obtain more precise es­ timates. Write up your findings in a formal report or add your findings to the report you completed for the case in Chapter 4, depending on your instructor’s requirements.

Chapter 6  Sampling and Estimation

229

Case: Performance Lawn Equipment In reviewing your previous reports, several questions came to Elizabeth Burke’s mind. Use point and interval estimates to help answer these questions. 1. What proportion of customers rate the company with “top box” survey responses (which is defined as scale levels 4 and 5) on quality, ease of use, price, and ser­ vice in the 2012 Customer Survey worksheet? How do these proportions differ by geographic region? 2. What estimates, with reasonable assurance, can PLE give customers for response times to customer ser­ vice calls? 3. Engineering has collected data on alternative process costs for building transmissions in the worksheet Transmission Costs. Can you determine whether one of the proposed processes is better than the current process?

4. What would be a confidence interval for an addi­ tional sample of mower test performance as in the worksheet Mower Test? 5. For the data in the worksheet Blade Weight, what is the sampling distribution of the mean, the overall mean, and the standard error of the mean? Is a nor­ mal distribution an appropriate ­assumption for the sampling distribution of the mean? 6. How many blade weights must be measured to find a 95% confidence interval for the mean blade weight with a sampling error of at most 0.2? What if the sampling error is specified as 0.1? Answer these questions and summarize your results in a formal report to Ms. Burke.

Chapter

7

Statistical Inference

Benis Arapovic/Shutterstock.com

Learning Objectives After studying this chapter, you will be able to:

• Explain the purpose of hypothesis testing. • Explain the difference between the null and alternative hypotheses. • List the steps in the hypothesis-testing procedure. • State the proper forms of hypotheses for one-sample hypothesis tests. • Correctly formulate hypotheses. • List the four possible outcome results from a hypothesis test. • Explain the difference between Type I and Type II errors. • State how to increase the power of a test. • Choose the proper test statistic for hypothesis tests involving means and proportions.

• Explain how to draw a conclusion for one- and twotailed hypothesis tests. • Use p-values to draw conclusions about hypothesis tests. • State the proper forms of hypotheses for two-sample hypothesis tests. • Select and use Excel Analysis Toolpak procedures for two-sample hypothesis tests. • Explain the purpose of analysis of variance. • Use the Excel ANOVA tool to conduct an analysis of variance test. • List the assumptions of ANOVA. • Conduct and interpret the results of a chi-square test for independence.

231

232

Chapter 7  Statistical Inference

Managers need to know if the decisions they have made or are planning to make are effective. For example, they might want to answer questions like the following: Did an advertising campaign increase sales? Will product placement in a grocery store make a difference? Did a new assembly method improve productivity or quality in a factory? Many applications of business ­a nalytics involve seeking statistical evidence that decisions or process changes have met their objectives. Statistical inference focuses on drawing conclusions about populations from samples. Statistical inference includes estimation of population parameters and hypothesis testing, which involves drawing conclusions about the value of the parameters of one or more populations based on sample data. The fundamental statistical approach for doing this is called hypothesis testing. Hypothesis testing is a technique that allows you to draw valid statistical conclusions about the value of population parameters or differences among them.

Hypothesis Testing Hypothesis testing involves drawing inferences about two contrasting propositions (each called a hypothesis) relating to the value of one or more population parameters, such as the mean, proportion, standard deviation, or variance. One of these propositions (called the null hypothesis) describes the existing theory or a belief that is accepted as valid unless strong statistical evidence exists to the contrary. The second proposition (called the alternative hypothesis) is the complement of the null hypothesis; it must be true if the null hypothesis is false. The null hypothesis is denoted by H0, and the alternative hypothesis is denoted by H1. Using sample data, we either 1. reject the null hypothesis and conclude that the sample data provide sufficient statistical evidence to support the alternative hypothesis, or 2. fail to reject the null hypothesis and conclude that the sample data does not support the alternative hypothesis. If we fail to reject the null hypothesis, then we can only accept as valid the existing theory or belief, but we can never prove it.

Example 7.1  A Legal Analogy for Hypothesis Testing A good analogy for hypothesis testing is the U.S. legal system. In our system of justice, a defendant is innocent until proven guilty. The null hypothesis—our belief in the absence of any contradictory evidence—is not guilty, whereas the alternative hypothesis is guilty. If the evidence (sample data) strongly indicates that the de-

fendant is guilty, then we reject the assumption of innocence. If the evidence is not sufficient to indicate guilt, then we cannot reject the not guilty hypothesis; however, we haven’t proven that the defendant is innocent. In reality, you can only conclude that a defendant is guilty from the evidence; you still have not proven it!

233

Chapter 7  Statistical Inference

Hypothesis-Testing Procedure Conducting a hypothesis test involves several steps: 1. Identifying the population parameter of interest and formulating the hypotheses to test 2. Selecting a level of significance, which defines the risk of drawing an incorrect conclusion when the assumed hypothesis is actually true 3. Determining a decision rule on which to base a conclusion 4. Collecting data and calculating a test statistic 5. Applying the decision rule to the test statistic and drawing a conclusion We apply this procedure to two different types of hypothesis tests; the first involving a single population (called one-sample tests) and, later, tests involving more than one population (multiple-sample tests).

One-Sample Hypothesis Tests A one-sample hypothesis test is one that involves a single population parameter, such as the mean, proportion, standard deviation, and so on. To conduct the test, we use a single sample of data from the population. We may conduct three types of one-sample hypothesis tests: H0: population parameter Ú constant vs. H1: population parameter 6 constant H0: population parameter … constant vs. H1: population parameter 7 constant H0: population parameter = constant vs. H1: population parameter ≠ constant Notice that one-sample tests always compare a population parameter to some constant. For one-sample tests, the statements of the null hypotheses are expressed as either Ú, …, or =. It is not correct to formulate a null hypothesis using 7 , 6 , or ≠ . How do we determine the proper form of the null and alternative hypotheses? ­Hypothesis testing always assumes that H0 is true and uses sample data to determine whether H1 is more likely to be true. Statistically, we cannot “prove” that H0 is true; we can only fail to reject it. Thus, if we cannot reject the null hypothesis, we have shown only that there is insufficient evidence to conclude that the alternative hypothesis is true. However, rejecting the null hypothesis provides strong evidence (in a statistical sense) that the null hypothesis is not true and that the alternative hypothesis is true. Therefore, what we wish to provide evidence for statistically should be identified as the alternative hypothesis.

Example 7.2  Formulating a One-Sample Test of Hypothesis CadSoft, a producer of computer-aided design software for the aerospace industry receives numerous calls for technical support. In the past, the average response time has been at least 25 minutes. The company has upgraded its information systems and believes that this

will help reduce response time. As a result, it believes that the average response time can be reduced to less than 25 minutes. The company collected a sample of 44 response times in the Excel file CadSoft Technical Support Response Times (see Figure 7.1).

234

Figure

Chapter 7  Statistical Inference

7.1

Portion of Technical Support Response-Time Data

If the new information system makes a difference, then data should be able to confirm that the mean response time is less than 25 minutes; this defines the alternative hypothesis, H1. Therefore, the proper statements of the null and alternative hypotheses are:

We would typically write this using the proper symbol for the population parameter. In this case, letting M be the mean response time, we would write: H0: M # 25 H1: M * 25

H0: population mean response time # 25 minutes H1: population mean response time * 25 minutes

Understanding Potential Errors in Hypothesis Testing We already know that sample data can show considerable variation; therefore, conclusions based on sample data may be wrong. Hypothesis testing can result in one of four different outcomes: 1. The null hypothesis is actually true, and the test correctly fails to reject it. 2. The null hypothesis is actually false, and the hypothesis test correctly reaches this conclusion. 3. The null hypothesis is actually true, but the hypothesis test incorrectly rejects it (called Type I error). 4. The null hypothesis is actually false, but the hypothesis test incorrectly fails to reject it (called Type II error). The probability of making a Type I error, that is, P(rejecting H0 ∙ H0 is true), is denoted by a and is called the level of significance. This defines the likelihood that you are willing to take in making the incorrect conclusion that the alternative hypothesis is true when, in fact, the null hypothesis is true. The value of a can be controlled by the decision maker and is selected before the test is conducted. Commonly used levels for a are 0.10, 0.05, and 0.01. The probability of correctly failing to reject the null hypothesis, or P(not rejecting H0 ∙ H0 is true), is called the confidence coefficient and is calculated as 1 - a. For a confidence coefficient of 0.95, we mean that we expect 95 out of 100 samples to support the null hypothesis rather than the alternate hypothesis when H0 is actually true. Unfortunately, we cannot control the probability of a Type II error, P(not rejecting H0 ∙ H0 is false), which is denoted by b. Unlike a, b cannot be specified in advance but depends on the true value of the (unknown) population parameter.

235

Chapter 7  Statistical Inference

Example 7.3  How B Depends on the True Population Mean Consider the hypotheses in the CadSoft example: H0: mean response time # 25 minutes H1: mean response time * 25 minutes If the true mean response from which the sample is drawn is, say, 15 minutes, we would expect to have a much smaller probability of incorrectly concluding that the null hypothesis is true than when the true mean response is 24 minutes, for example. If the true mean were 15 minutes, the sample mean would very likely be much less than 25, leading

us to reject H0. If the true mean were 24 minutes, even though it is less than 25, we would have a much higher probability of failing to reject H0 because a higher likelihood exists that the sample mean would be greater than 25 due to sampling error. Thus, the farther away the true mean response time is from the hypothesized value, the smaller is B. Generally, as A decreases, B increases, so the decision maker must consider the trade-offs of these risks. So, if you choose a level of significance of 0.01 instead of 0.05 and keep the sample size constant, you would reduce the probability of a Type I error but increase the probability of a Type II error.

The value 1 - b is called the power of the test and represents the probability of correctly rejecting the null hypothesis when it is indeed false, or P(rejecting H0 ∙ H0 is false). We would like the power of the test to be high (equivalently, we would like the probability of a Type II error to be low) to allow us to make a valid conclusion. The power of the test is sensitive to the sample size; small sample sizes generally result in a low value of 1 - b. The power of the test can be increased by taking larger samples, which enable us to detect small differences between the sample statistics and population parameters with more accuracy. However, a larger sample size incurs higher costs, giving new meaning to the adage, there is no such thing as a free lunch. This suggests that if you choose a small level of significance, you should try to compensate by having a large sample size when you conduct the test.

Selecting the Test Statistic The next step is to collect sample data and use the data to draw a conclusion. The decision to reject or fail to reject a null hypothesis is based on computing a test statistic from the sample data. The test statistic used depends on the type of hypothesis test. Different types of hypothesis tests use different test statistics, and it is important to use the correct one. The proper test statistic often depends on certain assumptions about the population—for example, whether or not the standard deviation is known. The following formulas show two types of one-sample hypothesis tests for means and their associated test statistics. The value of m0 is the hypothesized value of the population mean; that is, the “constant” in the hypothesis formulation. Type of Test

Test Statistic

One-sample test for mean, S known

z =

One-sample test for mean, S unknown

t =

x − M0 S , 1n x − M0 s , 1n

(7.1)

(7.2)

236

Chapter 7  Statistical Inference

Example 7.4  Computing the Test Statistic For the CadSoft example, the average response time for the sample of 44 customers is x = 21.91 minutes and the sample standard deviation is s = 19.49. The hypothesized mean is M0 = 25. You might wonder why we even have to test the hypothesis statistically when the sample average of 21.91 is clearly less than 25. The reason is because of sampling error. It is quite possible that the population mean truly is 25 or more and that we were just lucky to draw a sample whose mean was smaller. Because of potential sampling error, it would be dangerous to conclude that the company was meeting its goal just by looking at the sample mean without better statistical evidence. Because we don’t know the value of the population standard deviation, the proper test statistic to use is formula (7.2): x − M0 t = s , 1n

Therefore, the value of the test statistic is t =

x − M0 s , 1n

=

21.91 − 25 − 3.09 = = −1.05 2.938 19.49> 144

Observe that the numerator is the distance between the sample mean (21.91) and the hypothesized value (25). By dividing by the standard error, the value of t represents the number of standard errors the sample mean is from the hypothesized value. In this case, the sample mean is 1.05 standard errors below the hypothesized value of 25. This notion provides the fundamental basis for the hypothesis test—if the sample mean is “too far” away from the hypothesized value, then the null hypothesis should be rejected.

Drawing a Conclusion The conclusion to reject or fail to reject H0 is based on comparing the value of the test statistic to a “critical value” from the sampling distribution of the test statistic when the null hypothesis is true and the chosen level of significance, a. The sampling distribution of the test statistic is usually the normal distribution, t-distribution, or some other well-known distribution. For example, the sampling distribution of the z-test statistic in formula (7.1) is a standard normal distribution; the t-test statistic in formula (7.2) has a t-distribution with n - 1 degrees of freedom. For a one-tailed test, the critical value is the number of standard errors away from the hypothesized value for which the probability of exceeding the critical value is a. If a = 0.05, for example, then we are saying that there is only a 5% chance that a sample mean will be that far away from the hypothesized value purely because of sampling error and should this occur, it suggests that the true population mean is different from what was hypothesized. The critical value divides the sampling distribution into two parts, a rejection region and a nonrejection region. If the null hypothesis is false, it is more likely that the test statistic will fall into the rejection region. If it does, we reject the null hypothesis; otherwise, we fail to reject it. The rejection region is chosen so that the probability of the test statistic falling into it if H0 is true is the probability of a Type I error, a. The rejection region occurs in the tails of the sampling distribution of the test statistic and depends on the structure of the hypothesis test, as shown in Figure 7.2. If the null hypothesis is structured as = and the alternative hypothesis as ≠ , then we would reject H0 if the test statistic is either significantly high or low. In this case, the rejection region will occur in both the upper and lower tail of the distribution [see Figure 7.2(a)]. This is called a two-tailed test of hypothesis. Because the probability that the test statistic falls into the rejection region, given that H0 is true, the combined area of both tails must be a; each tail has an area of a>2.

Figure

237

Chapter 7  Statistical Inference

Rejection Region

7.2

Illustration of Rejection Regions in Hypothesis Testing

/2

/2

Lower critical value Upper critical value (a) Two-tailed test

Rejection Region

Rejection Region

Lower one-tailed test

Critical value

Upper one-tailed test (b) One-tailed tests

Critical value

The other types of hypothesis tests, which specify a direction of relationship (where H0 is either Ú or … ), are called one-tailed tests of hypothesis. In this case, the rejection region occurs only in one tail of the distribution [see Figure 7.2(b)]. Determining the correct tail of the distribution to use as the rejection region for a one-tailed test is easy. If H1 is stated as 6 , the rejection region is in the lower tail; if H1 is stated as 7 , the rejection region is in the upper tail (just think of the inequality as an arrow pointing to the proper tail direction). Two-tailed tests have both upper and lower critical values, whereas one-tailed tests have either a lower or upper critical value. For standard normal and t-distributions, which have a mean of zero, lower-tail critical values are negative; upper-tail critical values are positive. Critical values make it easy to determine whether or not the test statistic falls in the rejection region of the proper sampling distribution. For example, for an upper one-tailed test, if the test statistic is greater than the critical value, the decision would be to reject the null hypothesis. Similarly, for a lower one-tailed test, if the test statistic is less than the critical value, we would reject the null hypothesis. For a two-tailed test, if the test statistic is either greater than the upper critical value or less than the lower critical value, the decision would be to reject the null hypothesis.

Example 7.5  Finding the Critical Value and Drawing a Conclusion For the CadSoft example, if the level of significance is 0.05, then the critical value for a one-tail test is the value of the t-distribution with n − 1 degrees of freedom that provides a tail area of 0.05, that is, tA,n − 1. We may find t-values in Table A.2 in Appendix A at

the end of the book or by using the Excel function T.INV(1 − A, n - 1). Thus, the critical value is t0.05,43 = T.INV 10.95,432 = 1.68. Because the t-distribution is symmetric with a mean of 0 and this is a lower-tail test, we use the negative of this number ( − 1.68) as the critical value.

238

Figure

Chapter 7  Statistical Inference

7.3

t-Test for Mean Response Time

Rejection Region

t 1.68 1.05

By comparing the value of the t-test statistic with this critical value, we see that the test statistic does not fall below the critical value (i.e., − 1.05 + − 1.68) and is not in the rejection region. Therefore, we cannot reject H0 and cannot conclude that the mean response time has

0

improved to less than 25 minutes. Figure 7.3 illustrates the conclusion we reached. Even though the sample mean is less than 25, we cannot conclude that the population mean response time is less than 25 because of the large amount of sampling error.

Two-Tailed Test of Hypothesis for the Mean Basically, all hypothesis tests are similar; you just have to ensure that you select the correct test statistic, critical value, and rejection region, depending on the type of hypothesis. The following example illustrates a two-tailed test of hypothesis for the mean.

Example 7.6  Conducting a Two-Tailed Hypothesis Test for the Mean Figure 7.4 shows a portion of data collected in a survey of 34 respondents by a travel agency (provided in the ­Excel file Vacation Survey). Suppose that the travel agency wanted to target individuals who were approximately 35 years old. Thus, we wish to test whether the average age of respondents is equal to 35. The hypothesis to test is H0: mean age = 35 H1: mean age 3 35 The sample mean is computed to be 38.677, and the sample standard deviation is 7.858. We use the t-test statistic: t =

x − M0 s , 1n

=

38.677 − 35 7.858 , 234

In this case, the sample mean is 2.73 standard errors above the hypothesized mean of 35. However, because this is a two-tailed test, the rejection region and decision rule are different. For a level of significance A, we reject H0 if the t-test statistic falls either below the negative critical value, − tA>2,n−1, or above the positive critical value, tA>2,n−1. Using either Table A.2 in Appendix A at the back of this book or the Excel function T.INV.2T(.05,33) to calculate t0.025,33, we obtain 2.0345. Thus, the critical values are ±2.0345. Because the t-test statistic does not fall between these values, we must reject the null h ­ ypothesis that the average age is 35 (see Figure 7.5).

= 2.73

p-Values An alternative approach to comparing a test statistic to a critical value in hypothesis testing is to find the probability of obtaining a test statistic value equal to or more extreme than that obtained from the sample data when the null hypothesis is true. This probability

Figure

239

Chapter 7  Statistical Inference

7.4

Portion of Vacation Survey Data

Figure

Rejection Region

7.5

Illustration of a Two-Tailed Test for Example 7.6

– 2.0345

0

2.0345 2.73

is commonly called a p-value, or observed significance level. To draw a conclusion, compare the p-value to the chosen level of significance a; whenever p 6 a, reject the null hypothesis and otherwise fail to reject it. p-Values make it easy to draw conclusions about hypothesis tests. For a lower one-tailed test, the p-value is the probability to the left of the test statistic t in the t-distribution, and is found by T.DIST(t, n - 1, TRUE). For an upper one-tailed test, the p-value is the probability to the right of the test statistic t, and is found by 1 - T.DIST(t, n - 1, TRUE). For a two-tailed test, the p-value is found by T.DIST.2T (t, n - 1), if t 7 0; if t 6 0, use T.DIST.2T(-t, n - 1).

Example 7.7  Using p-Values For the CadSoft example, the t-test statistic for the hypothesis test in the response-time example is − 1.05. If the true mean is really 25, then the p-value is the probability of obtaining a test statistic of − 1.05 or less (the area to the left of − 1.05 in Figure 7.3). We can calculate the p-value using the Excel function T.DIST1−1.05,43,TRUE2 = 0.1498. Because p = 0.1498 is not less than A = 0.05, we do not reject H0. In other words, there is about a 15% chance that the test statistic would be − 1.05 or smaller if the null hypothesis were

true. This is a fairly high probability, so it would be difficult to conclude that the true mean is less than 25 and we could attribute the fact that the test statistic is less than the hypothesized value to sampling error alone and not reject the null hypothesis. For the Vacation Survey two-tailed hypothesis test in Example 7.6, the p-value for this test is 0.010, which can also be computed by the Excel function T.DIST.2T(2.73,33); therefore, since 0.010 * 0.05, we reject H0.

One-Sample Tests for Proportions Many important business measures, such as market share or the fraction of deliveries received on time, are expressed as proportions. We may conduct a test of hypothesis about a population proportion in a similar fashion as we did for means. The test statistic for a one-sample test for proportions is

z =

pn - p0 2p011 - p02>n



(7.3)

240

Chapter 7  Statistical Inference

where p0 is the hypothesized value and np is the sample proportion. Similar to the test statistic for means, the z-test statistic shows the number of standard errors that the sample proportion is from the hypothesized value. The sampling distribution of this test statistic has a standard normal distribution.

Example 7.8  A One-Sample Test for the Proportion CadSoft also sampled 44 customers and asked them to rate the overall quality of the company’s software product using a scale of 0—very poor 1—poor 2—good 3—very good 4—excellent These data can be found in the Excel File CadSoft Product Satisfaction Survey. The firm tracks customer satisfaction of quality by measuring the proportion of responses in the top two categories. Over the past, this proportion has averaged about 75%. For these data, 35 of the 44 responses, or 79.5%, are in the top two categories. Is there sufficient evidence to conclude that this satisfaction measure has significantly exceeded 75% using a significance level of 0.05? Answering this question involves testing the hypotheses about the population proportion P: H0 : P " 0.75 H1: P + 0.75 This is an upper-tailed, one-tailed test. The test statistic is computed using formula (7.3):

z =

0.795 − 0.75 20.75 (1 − 0.75) , 44

= 0.69

In this case, the sample proportion of 0.795 is 0.69 standard error above the hypothesized value of 0.75. Because this is an upper-tailed test, we reject H0 if the value of the test statistic is larger than the critical value. Because the sampling distribution of z is a standard normal, the critical value of z for a level of significance of 0.05 is found by the Excel function NORM.S. INV 10.952 = 1.645. Because the test statistic does not exceed the critical value, we cannot reject the null ­hypothesis that the proportion is no greater than 0.75. Thus, even though the sample proportion exceeds 0.75, we cannot conclude statistically that the customer satisfaction ratings have significantly improved. We could attribute this to sampling error and the relatively small sample size. The p-value can be found by computing the area to the right of the test statistic in the standard normal distribution: 1 – NORM.S.DIST(0.69,TRUE) = 0.24. Note that the p-value is greater than the significance level of 0.05, leading to the same conclusion of not rejecting the null hypothesis.

For a lower-tailed test, the p-value would be computed by the area to the left of the test statistic; that is, NORM.S.DIST(z, TRUE). If we had a two-tailed test, the p-value is 2*NORM.S.DIST(z, TRUE) if z 6 0; otherwise, the p-value is 2*(1 - NORM.S.DIST (-z, TRUE)) if z 7 0.

Confidence Intervals and Hypothesis Tests A close relationship exists between confidence intervals and hypothesis tests. For example, suppose we construct a 95% confidence interval for the mean. If we wish to test the hypotheses H0: m = m0 H1: m ≠ m0 at a 5% level of significance, we simply check whether the hypothesized value m0 falls within the confidence interval. If it does not, then we reject H0; if it does, then we cannot reject H0.

241

Chapter 7  Statistical Inference

For one-tailed tests, we need to examine on which side of the hypothesized value the confidence interval falls. For a lower-tailed test, if the confidence interval falls entirely below the hypothesized value, we reject the null hypothesis. For an upper-tailed test, if the confidence interval falls entirely above the hypothesized value, we also reject the null hypothesis.

Two-Sample Hypothesis Tests Many practical applications of hypothesis testing involve comparing two populations for differences in means, proportions, or other population parameters. Such tests can confirm differences between suppliers, performance at two different factory locations, new and old work methods or reward and recognition programs, and many other situations. Similar to one-sample tests, two-sample hypothesis tests for differences in population parameters have one of the following forms: 1. Lower-tailed test H0: population parameter (1) - population parameter (2) Ú D0 vs. H1: population parameter (1) - population parameter (2) 6 D0. This test seeks evidence that the difference between population parameter (1) and population parameter (2) is less than some value, D0. When D0 = 0, the test simply seeks to conclude whether population parameter (1) is smaller than population parameter (2). 2. Upper-tailed test H0: population parameter (1) - population parameter (2) … D0 vs. H1: population parameter (1) - population parameter (2) 7 D0. This test seeks evidence that the difference between population parameter (1) and population parameter (2) is greater than some value, D0. When D0 = 0, the test simply seeks to conclude whether population parameter (1) is larger than population parameter (2). 3. Two-tailed test H0: population parameter (1) - population parameter (2) = D0 vs. H1: population parameter (1) - population parameter (2) ≠ D0. This test seeks evidence that the difference between the population parameters is equal to D0. When D0 = 0, we are seeking evidence that population parameter (1) differs from parameter (2). In most applications D0 = 0, and we are simply seeking to compare the population parameters. However, there are situations when we might want to determine if the para­ meters differ by some non-zero amount; for example, “job classification A makes at least \$5,000 more than job classification B.” The hypothesis-testing procedures are similar to those previously discussed in the sense of computing a test statistic and comparing it to a critical value. However, the test statistics for two-sample tests are more complicated than for one-sample tests and we will not delve into the mathematical details. Fortunately, Excel provides several tools for conducting two-sample tests, and we will use these in our examples. Table 7.1 summarizes the Excel Analysis Toolpak procedures that we will use.

Two-Sample Tests for Differences in Means In a two-sample test for differences in means, we always test hypotheses of the form

H0: m1 - m2 {Ú, … , or =} 0

H1: m1 - m2 {6, 7 , or ≠ } 0

(7.4)

242

Chapter 7  Statistical Inference

Table 7.1 Excel Analysis Toolpak Procedures for Two-Sample Hypothesis Tests

Type of Test

Excel Procedure

Two-sample test for means, S known

Excel z-test: Two-sample for means

Two-sample test for means, S 2 unknown, assumed unequal

Excel t-test: Two-sample assuming unequal variances

Two-sample test for means, S 2 unknown, assumed equal

Excel t-test: Two-sample assuming equal variances

Paired two-sample test for means

Excel t-test: Paired two-sample for means

Two-sample test for equality of variances

Excel F-test Two-sample for variances

2

Example 7.9  Comparing Supplier Performance The last two columns in the Purchase Orders data file provide the order date and arrival date of all orders placed with each supplier. The time between placement of an order and its arrival is commonly called the lead time. We may compute the lead time by subtracting the Excel date function values from each other (Arrival Date − Order Date), as shown in Figure 7.6. Figure 7.7 shows a pivot table for the average lead time for each supplier. Purchasing managers have noted that they order many of the same types of items from Alum Sheeting and Durrable Products and are considering dropping Alum Sheeting from its supplier base if its lead time is significantly longer than that of

­ urrable Products. Therefore, they would like to test the D hypothesis H0 : M 1 − M 2 " 0 H1 : M 1 − M 2 + 0 where M1 = mean lead time for Alum Sheeting and M2 = mean lead time for Durrable Products. Rejecting the null hypothesis suggests that the average lead time for Alum Sheeting is statistically longer than Durrable Products. However, if we cannot reject the null hypothesis, then even though the mean lead time for Alum Sheeting is longer, the difference would most likely be due to sampling error, and we could not conclude that there is a statistically significant difference.

Selection of the proper test statistic and Excel procedure for a two-sample test for means depends on whether the population standard deviations are known, and if not, whether they are assumed to be equal.

Figure

7.6

Portion of Purchase Orders Database with Lead Time Calculations

1. Population variance is known. In Excel, choose z-Test: Two-Sample for Means from the Data Analysis menu. This test uses a test statistic that is based on the standard normal distribution. 2. Population variance is unknown and assumed unequal. From the Data Analysis menu, choose t-test: Two-Sample Assuming Unequal Variances. The test statistic for this case has a t-distribution.

Figure

243

Chapter 7  Statistical Inference

7.7

Pivot Table for Average Supplier Lead Time

3. Population variance unknown but assumed equal. In Excel, choose t-test: TwoSample Assuming Equal Variances. The test statistic also has a t-distribution, but it is different from the unequal variance case. These tools calculate the test statistic, the p-value for both a one-tail and two-tail test, and the critical values for one-tail and two-tail tests. For the z-test with known population variances, these are called z, P1Z … z2 one-tail or P1Z … z2 two-tail, and z Critical one-tail or z Critical two-tail, respectively. For the t-tests, these are called t Stat, P1T … t2 one-tail or P1T … t2 two-tail, and t Critical one-tail or t Critical two-tail, respectively. Caution: You must be very careful in interpreting the output information from these Excel tools and apply the following rules:

1.  If the test statistic is negative, the one-tailed p-value is the correct p-value for a lower-tail test; however, for an upper-tail test, you must subtract this number from 1.0 to get the correct p-value. 2. If the test statistic is nonnegative (positive or zero), then the p-value in the output is the correct p-value for an upper-tail test; but for a lower-tail test, you must subtract this number from 1.0 to get the correct p-value. 3. For a lower-tail test, you must change the sign of the one-tailed critical value. Only rarely are the population variances known; also, it is often difficult to justify the assumption that the variances of each population are equal. Therefore, in most practical situations, we use the t-test: Two-Sample Assuming Unequal Variances. This procedure also works well with small sample sizes if the populations are approximately normal. It is recommended that the size of each sample be approximately the same and total 20 or more. If the populations are highly skewed, then larger sample sizes are recommended.

Example 7.10  Testing the Hypotheses for Supplier Lead-Time Performance To conduct the hypothesis test for comparing the lead times for Alum Sheeting and Durrable Products, first sort the data by supplier and then select t-test: Two-Sample Assuming Unequal Variances from the Data ­A nalysis menu. The dialog is shown in Figure 7.8. The dialog prompts you for the range of the data for each variable, hypothesized mean difference, whether the ranges have labels, and the level of significance A. If you leave the box ­Hypothesized Mean Difference blank or enter zero, the test

is for equality of means. However, the tool allows you to specify a value D0 to test the hypothesis H0: M1 − M2 = D0 if you want to test whether the population means have a ­certain distance between them. In this example, the ­Variable 1 range defines the lead times for Alum Sheeting, and the Variable 2 range for Durrable Products. Figure 7.9 shows the results from the tool. The tool provides information for both one-tailed and twotailed tests. Because this is a one-tailed test, we use the

244

Chapter 7  Statistical Inference

highlighted information in Figure 7.9 to draw our conclusions. For this example, t Stat is positive and we have an upper-tailed test; therefore using the rules stated earlier, the p-value is 0.00166. Based on this alone, we reject the null hypothesis and must conclude that Alum Sheeting has a statistically longer average lead time than Durrable

­ roducts. We may draw the same conclusion by comparP ing the value of t Stat with the critical value t Critical onetail. Being an upper-tail test, the value of t Critical one-tail is 1.812. Comparing this with the value of t Stat, we would ­reject H0 only if t Stat + t Critical one @ tail. Since t Stat is greater than t Critical one-tail, we reject the null hypothesis.

Two-Sample Test for Means with Paired Samples In the previous example for testing differences in the mean supplier lead times, we used independent samples; that is, the orders in each supplier’s sample were not related to each other. In many situations, data from two samples are naturally paired or matched. For example, suppose that a sample of assembly line workers perform a task using two different types of work methods, and the plant manager wants to determine if any differences exist between the two methods. In collecting the data, each worker will have performed the task using each method. Had we used independent samples, we would have randomly selected two different groups of employees and assigned one work method to one group and the alternative method to the second group. Each worker would have performed the task using only one of the methods. As another example, suppose that we wish to compare retail prices of grocery items between two competing grocery stores. It makes little sense to compare different samples of items from each store. Instead, we would select a sample of grocery items and

Figure

7.8

Dialog for Two-Sample t-Test, Sigma Unknown

Figure

7.9

Results for Two-Sample Test for Lead-Time Performance

245

Chapter 7  Statistical Inference

find the price charged for the same items by each store. In this case, the samples are paired because each item would have a price from each of the two stores. When paired samples are used, a paired t-test is more accurate than assuming that the data come from independent populations. The null hypothesis we test revolves around the mean difference (mD) between the paired samples; that is H0: mD 5 Ú , … , or = 6 0

H1: mD 56 , 7 , or ≠ } 0.

The test uses the average difference between the paired data and the standard deviation of the differences similar to a one-sample test. Excel has a Data Analysis tool, t-Test: Paired Two-Sample for Means for conducting this type of test. In the dialog, you need to enter only the variable ranges and hypothesized mean difference.

Example 7.11  Using the Paired Two-Sample Test for Means The Excel file Pile Foundation contains the estimates used in a bid and actual auger-cast pile lengths that engineers ­u ltimately had to use for a foundationengineering project. The contractor’s past experience suggested that the bid information was generally accurate, so the average of the paired ­d ifferences ­b etween the actual pile lengths and estimated lengths should be close to zero. After this project was completed, the contractor found that the average difference between the actual lengths and the estimated lengths was 6.38. Could the contractor conclude that the bid information was poor?

Figure 7.10 shows a portion of the data and the Excel dialog for the paired two-sample test. Figure 7.11 shows the output from the Excel tool using a significance level of 0.05, where Variable 1 is the estimated lengths, and Variable 2 is the actual lengths. This is a two-tailed test, so in Figure 7.11 we interpret the results using only the two-tail i­nformation that is highlighted. The critical values are ±1.968, and ­because t Stat is much smaller than the lower critical value, we must reject the null hypothesis and conclude that the mean of the differences between the estimates and the ­actual pile lengths is statistically significant. Note that the p-value is essentially zero, verifying this conclusion.

Test for Equality of Variances Understanding variation in business processes is very important, as we have stated before. For instance, does one location or group of employees show higher variability than others? We can test for equality of variances between two samples using a new type of test,

Figure 7.10 Portion of Excel File Pile Foundation

246

Chapter 7  Statistical Inference

Figure 7.11 Excel Output for Paired Two-Sample Test for Means

the F-test. To use this test, we must assume that both samples are drawn from normal populations. The hypotheses we test are

H0: s21 - s22 = 0

H1: s21 - s22 ≠ 0

(7.5)

To test these hypotheses, we collect samples of n1 observations from population 1 and n2 observations from population 2. The test uses an F-test statistic, which is the ratio of the variances of the two samples:

F =

s 21 s 22



(7.6)

The sampling distribution of this statistic is called the F-distribution. Similar to the tdistribution, it is characterized by degrees of freedom; however, the F-distribution has two degrees of freedom, one associated with the numerator of the F-statistic, n1 - 1, and one associated with the denominator of the F-statistic, n2 - 1. Table A.4 in Appendix A at the end of the book provides only upper-tail critical values, and the distribution is not symmetric, as is the standard normal or the t-distribution. Therefore, although the hypothesis test is really a two-tailed test, we will simplify it as a one-tailed test to make it easy to use tables of the F-distribution and interpret the results of the Excel tool that we will use. We do this by ensuring that when we compute F, we take the ratio of the larger sample variance to the smaller sample variance. If the variances differ significantly from each other, we would expect F to be much larger than 1; the closer F is to 1, the more likely it is that the variances are the same. Therefore, we need only to compare F to the upper-tail critical value. Hence, for a level of significance a, we find the critical value Fa>2,df1,df2 of the F-distribution, and then we reject the null hypothesis if the F-test statistic exceeds the critical value. Note that we are using a>2 to find the critical value, not a. This is because we are using only the upper tail information on which to base our conclusion.

Example 7.12  Applying the F-Test for Equality of Variances To illustrate the F-test, suppose that we wish to determine whether the variance of lead times is the same for Alum Sheeting and Durrable Products in the Purchase ­Orders data. The F-test can be applied using the Excel

Data Analysis tool F-test for Equality of Variances. The dialog prompts you to enter the range of the sample data for each variable. As we noted, you should ensure that the first variable has the larger variance; this might require you to

247

Chapter 7  Statistical Inference

Figure 7.12 Results for Two-Sample F-Test for Equality of Variances

­ alculate the variances before you use the tool. In this case, c the variance of the lead times for Alum Sheeting is larger than the variance for Durrable Products (see Figure 7.9), so this is assigned to Variable 1. Note also that if we choose A = 0.05, we must enter 0.025 for the level of significance in the Excel dialog. The results are shown in Figure 7.12. The value of the F-statistic, F, is 3.467. We compare this with the upper-tail critical value, F Critical one-tail,

which is 3.607. Because F * F Critical one-tail, we cannot reject the null hypothesis and conclude that the variances are not significantly different from each other. Note that the p-value is P 1F *= f 2 one tail = 0.0286. Although the level of significance is 0.05, remember that we must compare this to A>2 = 0.025 because we are using only upper-tail information.

The F-test for equality of variances is often used before testing for the difference in means so that the proper test (population variance is unknown and assumed unequal or population variance is unknown and assumed equal, which we discussed earlier in this chapter) is selected.

Analysis of Variance (ANOVA) To this point, we have discussed hypothesis tests that compare a population parameter to a constant value or that compare the means of two different populations. Often, we would like to compare the means of several different groups to determine if all are equal or if any are significantly different from the rest.

Example 7.13  Differences in Insurance Survey Data In the Excel data file Insurance Survey, we might be interested in whether any significant differences exist in satisfaction among individuals with different levels of

education. We could sort the data by educational level and then create a table similar to the following.

Some College

5

3

4

3

4

1

5

5

4

3

5

2

3

5

3

3

4

4

3

5

4

4

5

2 Average

3.444

4.500

3.143

Count

9

8

7

248

Chapter 7  Statistical Inference

Although the average satisfaction for each group is somewhat different and it appears that the mean satisfaction of individuals with a graduate degree is higher, we cannot

tell conclusively whether or not these differences are significant because of sampling error.

In statistical terminology, the variable of interest is called a factor. In this example, the factor is the educational level, and we have three categorical levels of this factor, college graduate, graduate degree, and some college. Thus, it would appear that we will have to perform three different pairwise tests to establish whether any significant differences exist among them. As the number of factor levels increases, you can easily see that the number of pairwise tests grows large very quickly. Fortunately, other statistical tools exist that eliminate the need for such a tedious approach. Analysis of variance (ANOVA) is one of them. The null hypothesis for ANOVA is that the population means of all groups are equal; the alternative hypothesis is that at least one mean differs from the rest: H0: m1 = m2 = g = mm H1: at least one mean is different from the others ANOVA derives its name from the fact that we are analyzing variances in the data; essentially, ANOVA computes a measure of the variance between the means of each group and a measure of the variance within the groups and examines a test statistic that is the ratio of these measures. This test statistic can be shown to have an F-distribution (similar to the test for equality of variances). If the F-statistic is large enough based on the level of significance chosen and exceeds a critical value, we would reject the null hypothesis. Excel provides a Data Analysis tool, ANOVA: Single Factor to conduct analysis of variance.

Example 7.14  Applying the Excel ANOVA Tool To test the null hypothesis that the mean satisfaction for all educational levels in the Excel file Insurance Survey are equal against the alternative hypothesis that at least one mean is different, select ANOVA: Single Factor from the Data Analysis options. First, you must set up the worksheet so that the data you wish to use are displayed in contiguous columns as shown in Example 7.13. In the dialog shown in Figure 7.13, specify the input range of the data (which must be in contiguous columns) and whether it is stored in rows or columns (i.e., whether each factor level or group is a row or column in the range). The sample size for each factor level need not be the same, but the input range must be a rectangular region that ­contains all data. You must also specify the level of ­significance (A).

The results for this example are given in Figure 7.14. The output report begins with a summary report of basic statistics for each group. The ANOVA section reports the details of the hypothesis test. You needn’t worry about all the mathematical details. The important information to interpret the test is given in the columns labeled F (the F-test statistic), P-value (the p-value for the test), and F crit (the critical value from the F-distribution). In this e ­ xample, F = 3.92, and the critical value from the F-distribution is 3.4668. Here F + F crit; therefore, we must reject the null hypothesis and conclude that there are significant differences in the means of the groups; that is, the mean satisfaction is not the same among the three educational levels. Alternatively, we see that the p-value is smaller than the chosen level of significance, 0.05, leading to the same conclusion.

Chapter 7  Statistical Inference

249

Figure 7.13 ANOVA Single Factor Dialog

Figure 7.14 ANOVA Results for Insurance Survey Data

Although ANOVA can identify a difference among the means of multiple populations, it cannot determine which means are different from the rest. To do this, we may use the Tukey-Kramer multiple comparison procedure. Unfortunately, Excel does not provide this tool, but it may be found in other statistical software.

Assumptions of ANOVA ANOVA requires assumptions that the m groups or factor levels being studied represent populations whose outcome measures 1. are randomly and independently obtained, 2. are normally distributed, and 3. have equal variances. If these assumptions are violated, then the level of significance and the power of the test can be affected. Usually, the first assumption is easily validated when random samples are chosen for the data. ANOVA is fairly robust to departures from normality, so in most cases this isn’t a serious issue. If sample sizes are equal, violation of the third assumption does not have serious effects on the statistical conclusions; however, with unequal sample sizes, it can. When the assumptions underlying ANOVA are violated, you may use a nonparametric test that does not require these assumptions; we refer you to more comprehensive texts on statistics for further information and examples.

250

Chapter 7  Statistical Inference

Finally, we wish to point out that students often use ANOVA to compare the equality of means of exactly two populations. It is important to realize that by doing this, you are making the assumption that the populations have equal variances (assumption 3). Thus, you will find that the p-values for both ANOVA and the t-Test: Two-Sample Assuming Equal Variances will be the same and lead to the same conclusion. However, if the variances are unequal as is generally the case with sample data, ANOVA may lead to an erroneous conclusion. We recommend that you do not use ANOVA for comparing the means of two populations, but instead use the appropriate t-test that assumes unequal variances.

Chi-Square Test for Independence A common problem in business is to determine whether two categorical variables are independent. We introduced the concept of independent events in Chapter 5. In the energy drink survey example (Example 5.9), we used conditional probabilities to determine whether brand preference was independent of gender. However, with sample data, sampling error can make it difficult to properly assess the independence of categorical variables. We would never expect the joint probabilities to be exactly the same as the product of the marginal probabilities because of sampling error even if the two variables are statistically independent. Testing for independence is important in marketing applications.

Example 7.15  Independence and Marketing Strategy Figure 7.15 shows a portion of the sample data used in Chapter 5 for brand preferences of energy drinks (Excel file Energy Drink Survey) and the cross-tabulation of the results. A key marketing question is whether the proportion of males who prefer a particular brand is no different from the proportion of females. For instance, of the 63 male students, 25 (40%) prefer brand 1. If gender and brand preference are indeed independent, we would expect that about the same proportion of the sample of

female students would also prefer brand 1. In actuality, only 9 of 37 (24%) prefer brand 1. However, we do not know whether this is simply due to sampling error or represents a significant difference. Knowing whether gender and brand preference are independent can help marketing personnel better target advertising campaigns. If they are not independent, then advertising should be targeted differently to males and females, whereas if they are independent, it would not matter.

We can test for independence by using a hypothesis test called the chi-square test for independence. The chi-square test for independence tests the following hypotheses: H0: the two categorical variables are independent H1: the two categorical variables are dependent The chi-square test is an example of a nonparametric test; that is, one that does not depend on restrictive statistical assumptions, as ANOVA does. This makes it a widely applicable and popular tool for understanding relationships among categorical data. The first step in the procedure is to compute the expected frequency in each cell of the crosstabulation if the two variables are independent. This is easily done using the following: expected frequency in row i and column j = 

(grand total row i)(grand total column j) total number of observations (7.7)

251

Chapter 7  Statistical Inference

Figure 7.15 Portion of Energy Drink Survey and Cross-Tabulation

Figure 7.16 Expected Frequencies for the Chi-Square Test

Example 7.16  Computing Expected Frequencies For the Energy Drink Survey data, we may compute the expected frequencies using the data from the cross-tabulation and formula (7.7). For example, the expected frequency of females who prefer brand 1 is (37) (34) , 100 = 12.58. This

can easily be ­implemented in Excel. Figure 7.16 shows the results (see the Excel file Chi-Square Test). The formula in cell F11, for example, is = \$I5*F\$7/\$I\$7, which can be copied to the other cells to complete the calculations.

Next, we compute a test statistic, called a chi-square statistic, which is the sum of the squares of the differences between observed frequency, fo, and expected frequency, fe, di­ vided by the expected frequency in each cell:

x2 = a

1 fo - fe22  fe

(7.8)

The closer the observed frequencies are to the expected frequencies, the smaller will be the value of the chi-square statistic. The sampling distribution of x2 is a special distribution called the chi-square 1x22 distribution. The chi-square distribution is characterized by degrees of freedom, similar to the t-distribution. Table A.3 in Appendix A in the back of this book provides critical values of the chi-square distribution for selected values of a. We compare the chi-square statistic for a specified level of significance a to the critical value from a chi-square distribution with 1r - 121c - 12 degrees of freedom, where r and c are the number of rows and columns in the cross-tabulation table, respectively. The Excel function CHISQ.INV.RT(probability, deg_ freedom) returns the value of x2 that has a right-tail area equal to probability for a specified degree of freedom. By setting probability equal to the level of significance, we can obtain the critical value for the hypothesis test. If the test statistic exceeds the critical value for a specified level of significance, we reject H0. The Excel function CHISQ.TEST(actual_range, expected_range) computes the p-value for the chi-square test.

252

Chapter 7  Statistical Inference

Figure 7.17 Excel Implementation of Chi-Square Test

Example 7.17  Conducting the Chi-Square Test For the Energy Drink Survey data, Figure 7.17 shows the calculations of the chi-square statistic using formula (7.8). For example, the formula in cell F17 is = (F5 − F11)2 , F11, which can be copied to the other cells. The grand total in the lower right cell is the value of x 2. In this case, the chi-square test statistic is 6.4924. Since the cross-tabulation has r = 2 rows and c = 3 columns, we have (2 − 1) (3 − 1) = 2 degrees of freedom for the chi-square distribution. U ­ sing A = 0.05, the Excel function CHISQ.INV.RT(0.05,2) returns the

critical value 5.99146. Because the test statistic exceeds the critical value, we reject the null hypothesis that the two categorical variables are independent. Alternatively, we could simply use the CHISQ.TEST function to find the p-value for the test and base our conclusion on that without computing the chi-square statistic. For this example, the function CHISQ.TEST(F6:H7,F12:H13) returns the p-value of 0.0389, which is less than A = 0.05; therefore, we reject the null hypothesis.

Cautions in Using the Chi-Square Test First, when using PivotTables to construct a cross-tabulation and implement the chi-square test in Excel similar to Figure 7.17, be extremely cautious of blank cells in the PivotTable. Blank cells will not be counted in the chi-square calculations and will lead to errors. If you have blank cells in the PivotTable, simply replace them by zeros, or right-click in the ­PivotTable, choose PivotTable Options, and enter 0 in the field for the checkbox For empty cells show. Second, the chi-square test assumes adequate expected cell frequencies. A rule of thumb is that there be no more than 20% of cells with expected frequencies smaller than 5, and no expected frequencies of zero. More advanced statistical procedures exist to handle this, but you might consider aggregating some of the rows or columns in a logical fashion to enforce this assumption. This, of course, results in fewer rows or columns.

253

Chapter 7  Statistical Inference

Analytics in Practice: U  sing Hypothesis Tests and Business Analytics in a Help Desk Service Improvement Project1 call center and the help desk are statistically different from each other, they found no statistically significant advantage in keeping help-desk employees working at the call center. As a result, they moved help-desk agents to the client’s main office area. Using a variety of other analytical techniques, they were able to make changes to their process, resulting in the following:

StockLite/Shutterstock.com

Schlumberger is an international oilfield-services provider headquartered in Houston, Texas. Through an outsourcing contract, they supply help-desk services for a global telecom company that offers wireline communications and integrated telecom services to more than 2 million cellular subscribers. The help desk, located in Ecuador, faced increasing customer complaints and losses in dollars and cycle times. The company drew upon the analytics capability of one of the help-desk managers to investigate and solve the problem. The data showed that the average solution time for issues reported to the help desk was 9.75 hours. The company set a goal to reduce the average solution time by 50%. In addition, the number of issues reported to the help desk had reached an average of 30,000 per month. Reducing the total number of issues reported to the help desk would allow the company to address those issues that hadn’t been resolved because of a lack of time, and to reduce the number of abandoned calls. They set a goal to identify preventable issues so that customers would not have to contact the help desk in the first place, and set a target of 15,000 issues. As part of their analysis, they observed that the average solution time for help-desk technicians working at the call center seemed to be lower than the average for technicians working on site with clients. They conducted a hypothesis test structured around the question: Is there a difference between having help desk employees working at an off-site facility rather than on site within the client’s main office? The null hypothesis was that there was no significant difference; the alternative hypothesis was that there was a significant difference. Using a two-sample t-test to assess whether the

in the number of help-desk issues • aofdecrease 32% capability to meet the target of • improved 15,000 total issues reduction in the average desktop solution • atime from 9.75 hours to 1 hour, an

• •

improvement of 89.5% a reduction in the call-abandonment rate from 44% to 26% a reduction of 69% in help-desk operating costs

Key Terms Alternative hypothesis Analysis of variance (ANOVA) Chi-square distribution Chi-square statistic Confidence coefficient Factor Hypothesis Hypothesis testing Level of significance

Null hypothesis One-sample hypothesis test One-tailed test of hypothesis p-Value (observed significance level) Power of the test Statistical inference Two-tailed test of hypothesis Type I error Type II error

1Based on Francisco, Endara M. “Help Desk Improves Service and Saves Money with Six Sigma,” American Society for Quality, http://asq.org /economic-case/markets/pdf/help-desk-24490.pdf, accessed 8/19/11.

254

Chapter 7  Statistical Inference

Problems and Exercises For all hypothesis tests, assume that the level of significance is 0.05 unless otherwise stated.

Formulate and test the appropriate hypotheses to determine whether his belief is valid.

1. Create an Excel workbook with worksheet templates

6. Metropolitan Press hypothesizes that the average life of

(similar to the Excel workbook Confidence Intervals) for one-sample hypothesis tests for means and proportions. Apply your templates to the example problems in this chapter. (For subsequent problems, you should use the formulas in this chapter to perform the calculations, and use this template only to verify your results!) 2. A company is considering two different campaigns, A and B, for the promotion of their product. Two tests are conducted in two market areas with identical consumer characteristics, and in a random sample of 60 customers who saw campaign A, 18 tried the product. In a random sample of 100 customers who saw campaign B, 22 tried the product. What conclusion can management reach? (Assume that the population variance is not known.) 3. A management institute checked the past records of applicants and the mean score calculated was 350. The administration is interested to know whether the quality of new applicants has changed or not. From the recent scores of 100 applicants, the mean is 365 with a standard deviation of 38. Does this data provide statistical evidence that the quality of recent applicants has improved? 4. A retailer believes that its new advertising strategy will increase sales. Previously, the mean spending in 15 categories of consumer items in both the 18–34 and 35+ age groups was \$70.00. a. Formulate a hypothesis test to determine if the mean spending in these categories has statistically increased. b. After the new advertising campaign was launched, a marketing study found that the mean spending for 300 respondents in the 18–34 age group was \$75.86, with a standard deviation of \$50.90. Is there sufficient evidence to conclude that the advertising strategy significantly increased sales in this age group? c. For 700 respondents in the 35 + age group, the mean and standard deviation were \$68.53 and \$45.29, respectively. Is there sufficient evidence to conclude that the advertising strategy significantly increased sales in this age group? 5. A financial advisor believes that the proportion of in-

vestors who are risk–averse (i.e., try to avoid risk in their investment decisions) is at least 0.7. A survey of 32 investors found that 20 of them were risk-averse.

its largest Web press is 14,500 hours. They know that the standard deviation of press life is 2,100 hours. From a sample of 25 presses, the company find sample mean of 13,000 hours. At a 0.01 significance level, should the company conclude that the average life of the presses is less than the hypothesized 14,500 hours? 7. Ice Cream Manufacture is to produce a new ice cream flavor. The company‘s marketing research department surveyed 6,000 families and 335 of them showed interest in purchasing the new flavor. A similar study made two year ago showed that 5% of the families would purchase the flavor. What should the company conclude regarding the new flavor? 8. Call centers typically have high turnover. The director of human resources for a large bank has compiled data on about 70 former employees at one of the bank’s call centers in the Excel file Call Center Data. In writing an article about call center working conditions, a reporter has claimed that the average tenure is no more than 2 years. Formulate and test a hypothesis using these data to determine if this claim can be disputed. 9. The manager of a store claims that 60% of the shoppers entering the store leave without making a purchase. Out of a sample of 50, it is found that 35 shoppers left without buying. Is the result consistent with the claim? 10. A sample of 400 athletes is found to have mean height of 171.38 cm. Can we call it a sample from a large population of mean height 171.17 and standard deviation of 3.30 cm? 11. The State of Ohio Department of Education has a mandated ninth-grade proficiency test that covers writing, reading, mathematics, citizenship (social studies), and science. The Excel file Ohio Education Performance provides data on success rates (defined as the percent of students passing) in school districts in the greater Cincinnati metropolitan area along with state averages. Test null hypotheses that the average scores in the Cincinnati area are equal to the state averages in each test and also for the composite score. 12. Formulate and test hypotheses to determine if statistical evidence suggests that the graduation rate for (1) top liberal arts colleges or (2) research universities in the sample Colleges and Universities exceeds 90%. Do the data support a conclusion that the graduation rates exceed 85%? Would your conclusions

Chapter 7  Statistical Inference

change if the level of significance was 0.01 instead of 0.05? 13. The Excel file Sales Data provides data on a sample

255 20. In the Excel file Cell Phone Survey, test the hypoth-

esis that the mean responses for Value for the Dollar and Customer Service do not differ by gender.

of customers. An industry trade publication stated that the average profit per customer for this industry was at least \$4,500. Using a test of hypothesis, do the data support this claim or not?

21. A sample size of 22 with a mean of 8 and a standard

14. The Excel file Room Inspection provides data for

22. Determine if there is evidence to conclude that the

100 room inspections at each of 25 hotels in a major chain. Management would like the proportion of nonconforming rooms to be less than 2%. Test an appropriate hypothesis to determine if management can make this claim.

23. The director of human resources for a large bank

15. An employer is considering negotiating its pric-

ing structure for health insurance with its provider if there is sufficient evidence that customers will be willing to pay a lower premium for a higher deductible. Specifically, they want at least 30% of their employees to be willing to do this. Using the sample data in the Excel file Insurance Survey, determine what decision they should make. 16. Using the data in the Excel file Consumer Transporta-

tion Survey, test the following null hypotheses: a. Individuals spend at least 8 hours per week in their vehicles. b. Individuals drive an average of 600 miles per week. c. The average age of SUV drivers is no greater than 35. d. At least 80% of individuals are satisfied with their vehicles. 17. Using the Excel file Facebook Survey, determine if

the mean number of hours spent online per week is the same for males as it is for females. 18. Determine if there is evidence to conclude that the

mean number of vacations taken by married individuals is less than the number taken by single/divorced individuals using the data in the Excel file Vacation Survey. Use a level of significance of 0.05. Would your conclusion change if the level of significance is 0.01? 19. The Excel file Accounting Professionals provides the

results of a survey of 27 employees in a tax division of a Fortune 100 company. a. Test the null hypothesis that the average number of years of service is the same for males and females. b. Test the null hypothesis that the average years of undergraduate study is the same for males and females.

deviation of 12.5 test the hypothesis that the value of the population mean is 70 against the assumption that it is more than 70. Use the 0.025 significant levels. mean GPA of males who plan to attend graduate school is larger than that of females who plan to attend graduate school using the data in the Excel file Graduate School Survey. has compiled data on about 70 former employees at one of the bank’s call centers (see the Excel file Call Center Data). For each of the following, assume equal variances of the two populations. a. Test the null hypothesis that the average length of service for males is the same as for females. b. Test the null hypothesis that the average length of service for individuals without prior call center experience is the same as those with experience. c. Test the null hypothesis that the average length of service for individuals with a college degree is the same as for individuals without a college degree. d. Now conduct tests of hypotheses for equality of variances. Were your assumptions of equal variances valid? If not, repeat the test(s) for means using the unequal variance test. 24. A producer of computer-aided design software for

the aerospace industry receives numerous calls for technical support. Tracking software is used to monitor response and resolution times. In addition, the company surveys customers who request support using the following scale: 0—did not exceed expectations; 1—marginally met expectations; 2—met expectations; 3—exceeded expectations; 4— greatly exceeded expectations. The questions are as follows: Q1: Did the support representative explain the process for resolving your problem? Q2: D id the support representative keep you informed about the status of progress in resolving your problem? Q3: Was the support representative courteous and professional? Q4: Was your problem resolved?

256

Chapter 7  Statistical Inference

Q5: W as your problem resolved in an acceptable amount of time? Q6: Overall, how did you find the service provided by our technical support department? A final question asks the customer to rate the overall

quality of the product using a scale of 0—very poor; 1—poor; 2—good; 3—very good; 4—excellent. A sample of survey responses and associated resolution and response data are provided in the Excel file Customer Support Survey. a. The company has set a service standard of 1 day for the mean resolution time. Does evidence exist that the response time is more than 1 day? How do the outliers in the data affect your result? What should you do about them? b. Test the hypothesis that the average service index is equal to the average engineer index.

Scores contain data from a sample of students. What conclusion can be reached using ANOVA? 30. Using the data in the Excel file Cell Phone Survey, apply

ANOVA to determine if the mean response for Value for the Dollar is the same for different types of cell phones. 31. Using the data in the Excel file Freshman College

Data, use ANOVA to determine whether significant differences exist in the mean retention rate for the different colleges over the 4-year period. Second, use ANOVA to determine if significant differences exist in the mean ACT and SAT scores among the different colleges. 32. A car manufacturing firm is bringing out a new model. To figure out its advertising campaign, they want to determine whether the model appeal will be dependent on a particular age group. A sample of a customer survey revealed the following:

25. Using the data in the Excel file Ohio Education Per-

formance, test the hypotheses that the mean difference in writing and reading scores is zero and that the mean difference in math and science scores is zero. Use the paired-sample procedure. 26. The Excel file Unions and Labor Law Data reports

the percent of public- and private-sector employees in unions in 1982 for each state, along with indicators whether the states had a bargaining law that covered public employees or right-to-work laws. a. Test the hypothesis that the mean percent of ­employees in unions for both the public sector and private sector is the same for states having bargaining laws as for those who do not. b. Test the hypothesis that the mean percent of ­employees in unions for both the public sector and private sector is the same for states having right-to-work laws as for those who do not. 27. Using the data in the Excel file Student Grades,

which represent exam scores in one section of a large statistics course, test the hypothesis that the variance in grades is the same for both tests. 28. In the Excel file Restaurant Sales, determine if the

variance of weekday sales is the same as that of weekend sales for each of the three variables (lunch, dinner, and delivery). 29. A college is trying to determine if there is a significant difference in the mean GMAT score of students from different undergraduate backgrounds who ­apply to the MBA program. The Excel file GMAT

Liked

Under 20

20–40

40–50

50 and over

Total

140

70

70

25

305

60

40

30

65

195

200

110

100

90

500

Disliked Total

What can the manufacturer conclude? 33. A survey of college students determined the prefer-

ence for cell phone providers. The following data were obtained: Provider Gender T-Mobile AT&T Male Female

Verizon Other

12

39

27

16

8

22

24

12

Can we conclude that gender and cell phone provider are independent? If not, what implications does this have for marketing? 34. For the data in the Excel file Accounting Profes-

sionals, perform a chi-square test of independence to determine if age group is independent of having a graduate degree. 35. For the data in the Excel file Graduate School Sur-

vey, perform a chi-square test for independence to determine if plans to attend graduate school are independent of gender. 36. For the data in the Excel file New Account Process-

ing, perform chi-square tests for independence to determine if certification is independent of gender, and if ­certification is independent of having prior industry background.

Chapter 7  Statistical Inference

257

Case: Drout Advertising Research Project The background for this case was introduced in Chapter 1. This is a continuation of the case in Chapter 6. For this part of the case, propose and test some meaningful hypotheses that will help Ms. Drout understand and explain the results. Include two-sample tests, ANOVA, and/or ­Chi-Square tests for independence as appropriate. Write up your conclusions in a formal report, or add your f­ indings

to the report you completed for the case in Chapter 6 as per your instructor’s requirements. If you have accumulated all sections of this case into one report, polish it up so that it is as professional as possible, drawing final conclusions about the perceptions of the role of advertising in the reinforcement of gender stereotypes and the impact of empowerment advertising.

Case: Performance Lawn Equipment Elizabeth Burke has identified some additional questions she would like you to answer. 1. Are there significant differences in ratings of specific product/service attributes in the 2014 Customer Survey worksheet? 2. In the worksheet On-Time Delivery, has the proportion of on-time deliveries in 2014 significantly improved since 2010? 3. Have the data in the worksheet Defects After Delivery changed significantly over the past 5 years? 4. Although engineering has collected data on alternative process costs for building t­ransmissions

in the worksheet Transmission Costs, why didn’t they reach a conclusion as to whether one of the proposed processes is better than the current process? 5. Are there differences in employee retention due to gender, college graduation status, or whether the employee is from the local area in the data in the worksheet Employee Retention? Conduct appropriate statistical analyses and hypothesis tests to answer these questions and summarize your results in a formal report to Ms. Burke.

Chapter

8

Trendlines and Regression Analysis

gibsons/Shutterstock.com

Learning Objectives After studying this chapter, you will be able to:

• Explain the purpose of regression analysis and provide examples in business. • Use a scatter chart to identify the type of relationship between two variables. • List the common types of mathematical functions used in predictive modeling. • Use the Excel Trendline tool to fit models to data. • Explain how least-squares regression finds the bestfitting regression model. • Use Excel functions to find least-squares regression coefficients. • Use the Excel Regression tool for both single and multiple linear regressions. • Interpret the regression statistics of the Excel Regression tool. • Interpret significance of regression from the Excel Regression tool output. • Draw conclusions for tests of hypotheses about regression coefficients.

• Interpret confidence intervals for regression coefficients • Calculate standard residuals. • List the assumptions of regression analysis and describe methods to verify them. • Explain the differences in the Excel Regression tool

output for simple and multiple linear regression models. Apply a systematic approach to build good regression models. Explain the importance of understanding multicollinearity in regression models. Build regression models for categorical data using dummy variables. Test for interactions in regression models with categorical variables. Identify when curvilinear regression models are more appropriate than linear models.

• • • • •

259

260

Chapter 8  Trendlines and Regression Analysis

Many applications of business analytics involve modeling relationships between one or more independent variables and some dependent variable. For example, we might wish to predict the level of sales based on the price we set, or extrapolate a trend into the future. As other examples, a company may wish to predict sales based on the U.S. GDP (gross domestic product) and the 10-year treasury bond rate to capture the influence of the business cycle,1 or a marketing researcher might want to predict the intent of buying a particular automobile model based on a survey that measured consumer attitudes toward the brand, negative word-of-mouth, and income level.2 Trendlines and regression analysis are tools for building such models and predicting future results. Our principal focus is to gain a basic understanding of how to use and interpret trendlines and regression models, statistical issues associated with interpreting regression analysis results, and practical issues in using trendlines and regression as tools for making and evaluating decisions.

Modeling Relationships and Trends in Data Understanding both the mathematics and the descriptive properties of different functional relationships is important in building predictive analytical models. We often begin by ­creating a chart of the data to understand it and choose the appropriate type of functional relationship to incorporate into an analytical model. For cross-sectional data, we use a scatter chart; for time hyphenate as adjective for data series data we use a line chart. Common types of mathematical functions used in predictive analytical models include the following: function: y = a + bx. Linear functions show steady increases or • Linear decreases over the range of x. This is the simplest type of function used in

• •

1James

p­ redictive models. It is easy to understand, and over small ranges of values, can approximate behavior rather well. Logarithmic function: y = ln1x2. Logarithmic functions are used when the rate of change in a variable increases or decreases quickly and then levels out, such as with diminishing returns to scale. Logarithmic functions are often used in marketing models where constant percentage increases in advertising, for instance, result in constant, absolute increases in sales. Polynomial function: y = ax 2 + bx + c (second order—quadratic function), y = ax 3 + bx 2 + dx + e (third order—cubic function), and so on. A secondorder polynomial is parabolic in nature and has only one hill or valley; a thirdorder polynomial has one or two hills or valleys. Revenue models that incorporate price elasticity are often polynomial functions.

R. Morris and John P. Daley, Introduction to Financial Models for Management and Planning (Boca Raton, FL: Chapman & Hall/CRC, 2009): 257. 2Alvin C. Burns and Ronald F. Bush, Basic Marketing Research Using Microsoft Excel Data Analysis, 2nd ed. (Upper Saddle River, NJ: Prentice Hall, 2008): 450.

Chapter 8  Trendlines and Regression Analysis

261

function: y = ax . Power functions define phenomena that increase at a • Power specific rate. Learning curves that express improving times in performing a task b

are often modeled with power functions having a 7 0 and b 6 0. Exponential function: y = ab x. Exponential functions have the property that y rises or falls at constantly increasing rates. For example, the perceived brightness of a lightbulb grows at a decreasing rate as the wattage increases. In this case, a would be a positive number and b would be between 0 and 1. The exponential function is often defined as y = ae x, where b = e, the base of natural logarithms (approximately 2.71828).

The Excel Trendline tool provides a convenient method for determining the best-fitting functional relationship among these alternatives for a set of data. First, click the chart to which you wish to add a trendline; this will display the Chart Tools menu. Select the Chart Tools Design tab, and then click Add Chart Element from the Chart Layouts group. From the Trendline submenu, you can select one of the options (Linear is the most common) or More Trendline Options. . . . If you select More Trendline Options, you will get the Format ­Trendline pane in the worksheet (see Figure 8.1). A simpler way of doing all this is to right click on the data series in the chart and choose Add trendline from the pop-up menu—try it! Select the radio button for the type of functional relationship you wish to fit to the data. Check the boxes for Display Equation on chart and Display R-squared value on chart. You may then close the Format Trendline pane. Excel will display the results on the chart you have selected; you may move the equation and R-squared value for better readability by dragging them to a different location. To clear a trendline, right click on it and select Delete. R2 (R-squared) is a measure of the “fit” of the line to the data. The value of R2 will be between 0 and 1. The larger the value of R2 the better the fit. We will discuss this further in the context of regression analysis. Trendlines can be used to model relationships between variables and understand how the dependent variable behaves as the independent variable changes. For example, the d­ emand-prediction models that we introduced in Chapter 1 (Examples 1.9 and 1.10) would generally be developed by analyzing data.

Figure

8.1

Excel Format Trendline Pane

262

Chapter 8  Trendlines and Regression Analysis

Example 8.1  Modeling a Price-Demand Function A market research study has collected data on sales volumes for different levels of pricing of a particular product. The data and a scatter diagram are shown in Figure 8.2 (Excel file Price-Sales Data). The relationship between price and sales clearly appears to be linear, so a linear trendline was fit to the data. The resulting model is

sales = 20,512 − 9.5116 × price This model can be used as the demand function in other marketing or financial analyses.

Trendlines are also used extensively in modeling trends over time—that is, when the variable x in the functional relationships represents time. For example, an analyst for an airline needs to predict where fuel prices are going, and an investment analyst would want to predict the price of stocks or key economic indicators.

Example 8.2  Predicting Crude Oil Prices Figure 8.3 shows a chart of historical data on crude oil prices on the first Friday of each month from January 2006 through June 2008 (data are in the Excel file Crude Oil Prices). Using the Trendline tool, we can try to fit the various functions to these data (here x represents the number of months starting with January 2006). The results are as follows: exponential:  y = 50.49e0.021x

R2 = 0.664

logarithmic:  y = 13.02ln1x2 + 39.60  R2 = 0.382

Figure

8.2

Price-Sales Data and Scatter Diagram with Fitted Linear Function

polynomial (second order): y = 0.130x2 − 2.399x + 68.01

R2 = 0.905

polynomial (third order): y = 0.005x3 − 0.111x2 + 0.648x + 59.497 R2 = 0.928 power: y = 45.96x.0169

R2 = 0.397

The best-fitting model is the third-order polynomial, shown in Figure 8.4.

Chapter 8  Trendlines and Regression Analysis

Figure

263

8.3

Chart of Crude Oil Prices

Be cautious when using polynomial functions. The R2 value will continue to increase as the order of the polynomial increases; that is, a third-order polynomial will provide a better fit than a second order polynomial, and so on. Higher-order polynomials will generally not be very smooth and will be difficult to interpret visually. Thus, we don’t recommend going beyond a third-order polynomial when fitting data. Use your eye to make a good judgment! Of course, the proper model to use depends on the scope of the data. As the chart shows, crude oil prices were relatively stable until early 2007 and then began to increase rapidly. By including the early data, the long-term functional relationship might not adequately express the short-term trend. For example, fitting a model to only the data beginning with January 2007 yields these models: exponential: polynomial (second order): linear:

Figure

8.4

Polynomial Fit of Crude Oil Prices

y = 50.56 e 0.044x y = 0.121x 2 + 1.232x + 53.48 y = 3.548x + 45.76

R2 = 0.969 R2 = 0.968 R2 = 0.944

264

Chapter 8  Trendlines and Regression Analysis

The difference in prediction can be significant. For example, to predict the price 6 months after the last data point 1x = 362 yields \$172.24 for the third-order polynomial fit with all the data and \$246.45 for the exponential model with only the recent data. Thus, the analysis must be careful to select the proper amount of data for the analysis. The question then becomes one of choosing the best assumptions for the model. Is it reasonable to assume that prices would increase exponentially or perhaps at a slower rate, such as with the linear model fit? Or, would they level off and start falling? Clearly, factors other than historical trends would enter into this choice. As we now know, oil prices plunged in the latter half of 2008; thus, all predictive models are risky.

Simple Linear Regression Regression analysis is a tool for building mathematical and statistical models that characterize relationships between a dependent variable (which must be a ratio variable and not categorical) and one or more independent, or explanatory, variables, all of which are numerical (but may be either ratio or categorical). Two broad categories of regression models are used often in business settings: (1) regression models of cross-sectional data and (2) regression models of time-series data, in which the independent variables are time or some function of time and the focus is on predicting the future. Time-series regression is an important tool in forecasting, which is the subject of Chapter 9. A regression model that involves a single independent variable is called simple ­regression. A regression model that involves two or more independent variables is called multiple regression. In the remainder of this chapter, we describe how to develop and analyze both simple and multiple regression models. Simple linear regression involves finding a linear relationship between one independent variable, X, and one dependent variable, Y. The relationship between two variables can assume many forms, as illustrated in Figure 8.5. The relationship may be linear or nonlinear, or there may be no relationship at all. Because we are focusing our discussion on linear regression models, the first thing to do is to verify that the relationship is linear, as in Figure 8.5(a). We would not expect to see the data line up perfectly along a straight line; we simply want to verify that the general relationship is linear. If the relationship is clearly nonlinear, as in Figure 8.5(b), then alternative approaches must be used, and if no relationship is evident, as in Figure 8.5(c), then it is pointless to even consider developing a linear regression model. To determine if a linear relationship exists between the variables, we recommend that you create a scatter chart that can show the relationship between variables visually.

Figure

8.5

Examples of Variable Relationships

(a) Linear

(b) Nonlinear

(c) No relationship

Chapter 8  Trendlines and Regression Analysis

265

Example 8.3  Home Market Value Data The market value of a house is typically related to its size. In the Excel file Home Market Value (see Figure 8.6), data obtained from a county auditor provides information about the age, square footage, and current market value of houses in a particular subdivision. We might wish to investigate the relationship between the market value and the size of the home. The independent variable, X, is the number of square feet, and the dependent variable, Y, is the market value.

Figure

Figure 8.7 shows a scatter chart of the market value in relation to the size of the home. In general, we see that higher market values are associated with larger house sizes and the relationship is approximately linear. Therefore, we could conclude that simple linear regression would be an appropriate technique for predicting market value based on house size.

8.6

Portion of Home Market Value

Figure

8.7

Scatter Chart of Market Value versus Home Size

Finding the Best-Fitting Regression Line The idea behind simple linear regression is to express the relationship between the dependent and independent variables by a simple linear equation, such as market value = a + b * square feet where a is the y-intercept and b is the slope of the line. If we draw a straight line through the data, some of the points will fall above the line, some will fall below it, and a few

266

Figure

Chapter 8  Trendlines and Regression Analysis

8.8

Two Possible Regression Lines

might fall on the line itself. Figure 8.8 shows two possible straight lines that pass through the data. Clearly, you would choose A as the better-fitting line over B because all the points are closer to the line and the line appears to be in the middle of the data. The only difference between the lines is the value of the slope and intercept; thus, we seek to determine the values of the slope and intercept that provide the best-fitting line.

Example 8.4  Using Excel to Find the Best Regression Line When using the Trendline tool for simple linear regression in the Home Market Value example, be sure the linear function option is selected (it is the default option when you use the tool). Figure 8.9 shows the best fitting regression line. The equation is market value = \$32,673 + \$35.036 × square feet The value of the regression line can be explained as follows. Suppose we wanted to estimate the home market value for any home in the population from which the sample data were gathered. If all we knew were the market values, then the best estimate of the market value for any home would simply be the sample mean, which is \$92,069. Thus, no matter if the house has 1,500 square feet or 2,200 square feet, the best estimate of market value would still be \$92,069. Because the market values vary from about \$75,000 to more than \$120,000, there is quite a bit of uncertainty in using the mean as the estimate. However, from the scatter chart, we see that larger homes tend to have higher market values. Therefore, if we know that a home has 2,200 square feet, we would expect

the market value estimate to be higher than for one that has only 1,500 square feet. For example, the estimated market value of a home with 2,200 square feet would be market value = \$32,673 + \$35.036 × 2,200 = \$109,752 whereas the estimated value for a home with 1,500 square feet would be market value = \$32,673 + \$35.036 × 1,500 = \$85,227 The regression model explains the differences in market value as a function of the house size and provides better estimates than simply using the average of the sample data. One important caution: it is dangerous to extrapolate a regression model outside the ranges covered by the observations. For instance, if you want to predict the market value of a house that has 3,000 square feet, the results may or may not be accurate, because the regression model estimates did not use any observations greater than 2,400 square feet. We cannot be sure that a linear extrapolation will hold and should not use the model to make such predictions.

Figure

267

Chapter 8  Trendlines and Regression Analysis

8.9

Best-fitting Simple Linear Regression Line

We can find the best-fitting line using the Excel Trendline tool (with the linear option chosen), as described earlier in this chapter.

Least-Squares Regression The mathematical basis for the best-fitting regression line is called least-squares ­regression. In regression analysis, we assume that the values of the dependent variable, Y, in the sample data are drawn from some unknown population for each value of the ­independent variable, X. For example, in the Home Market Value data, the first and fourth observations come from a population of homes having 1,812 square feet; the second ­observation comes from a population of homes having 1,914 square feet; and so on. Because we are assuming that a linear relationship exists, the expected value of Y is b0 + b1X for each value of X. The coefficients b0 and b1 are population parameters that represent the intercept and slope, respectively, of the population from which a sample of observations is taken. The intercept is the mean value of Y when X = 0, and the slope is the change in the mean value of Y as X changes by one unit. Thus, for a specific value of X, we have many possible values of Y that vary around the mean. To account for this, we add an error term, e (the Greek letter epsilon), to the mean. This defines a simple linear regression model:

Y = b0 + b1X + e

(8.1)

However, because we don’t know the entire population, we don’t know the true values of b0 and b1. In practice, we must estimate these as best we can from the sample data. Define b 0 and b 1 to be estimates of b0 and b1. Thus, the estimated simple linear regression equation is

Yn = b 0 + b 1X

(8.2)

Let Xi be the value of the independent variable of the ith observation. When the value of the independent variable is Xi, then Yni = b 0 + b 1Xi is the estimated value of Y for Xi. One way to quantify the relationship between each point and the estimated regression equation is to measure the vertical distance between them, as illustrated in Figure 8.10. We

268

Chapter 8  Trendlines and Regression Analysis Y

Figure 8.10

^ Y2

Measuring the Errors in a Regression Model

Y1 ^ Y1 Y2

e2

e1

X1

X2

X

Errors associated with individual observations

can think of these differences, ei, as the observed errors (often called residuals) ­associated with estimating the value of the dependent variable using the regression line. Thus, the error associated with the ith observation is: ei = Yi - Yni

(8.3)

The best-fitting line should minimize some measure of these errors. Because some errors will be negative and others positive, we might take their absolute value or simply square them. Mathematically, it is easier to work with the squares of the errors. Adding the squares of the errors, we obtain the following function:

2 2 n 2 a e i = a 1Yi - Yi 2 = a 1Yi - 3b 0 + b 1Xi4 2 n

n

n

i=1

i=1

i=1

(8.4)

If we can find the best values of the slope and intercept that minimize the sum of squares (hence the name “least squares”) of the observed errors ei, we will have found the bestfitting regression line. Note that Xi and Yi are the values of the sample data and that b 0 and b 1 are unknowns in equation (8.4). Using calculus, we can show that the solution that minimizes the sum of squares of the observed errors is a XiYi - nX Y n

i=1 n

b1 =

b 0 = Y - b 1X

2 2 a X i - nX

(8.5)

i=1

(8.6)

Although the calculations for the least-squares coefficients appear to be somewhat complicated, they can easily be performed on an Excel spreadsheet. Even better, Excel has built-in capabilities for doing this. For example, you may use the functions INTERCEPT (known_y’s, known_x’s) and SLOPE(known_y’s, known_x’s) to find the least-squares coefficients b 0 and b 1.

Example 8.5  Using Excel Functions to Find Least-Squares Coefficients For the Home Market Value Excel file, the range of the dependent variable Y (market value) is C4:C45; the range of the independent variable X (square feet) is B4:B45. The function INTERCEPT(C4:C45, B4:B45) yields b0 = 32,673 and SLOPE(C4:C45, B4:B45) yields b1 = 35.036, as we saw in Example 8.4. The slope tells

us that for every additional square foot, the market value increases by \$35.036. We may use the Excel function TREND(known_y’s, known_x’s, new_x’s) to estimate Y for any value of X; for example, for a house with 1,750 square feet, the estimated market value is TREND(C4:C45, B4:B45, 1750) = \$93,986.

Chapter 8  Trendlines and Regression Analysis

269

We could stop at this point, because we have found the best-fitting line for the observed data. However, there is a lot more to regression analysis from a statistical perspective, because we are working with sample data—and usually rather small samples—which we know have a lot of variation as compared with the full population. Therefore, it is important to understand some of the statistical properties associated with regression analysis.

Simple Linear Regression with Excel Regression-analysis software tools available in Excel provide a variety of information about the statistical properties of regression analysis. The Excel Regression tool can be used for both simple and multiple linear regressions. For now, we focus on using the tool just for simple linear regression. From the Data Analysis menu in the Analysis group under the Data tab, select the Regression tool. The dialog box shown in Figure 8.11 is displayed. In the box for the ­Input Y Range, specify the range of the dependent variable values. In the box for the ­Input X Range, specify the range for the independent variable values. Check Labels if your data range contains a descriptive label (we highly recommend using this). You have the option of forcing the intercept to zero by checking Constant is Zero; however, you will usually not check this box because adding an intercept term allows a better fit to the data. You also can set a Confidence Level (the default of 95% is commonly used) to provide confidence intervals for the intercept and slope parameters. In the ­Residuals section, you have the option of including a residuals output table by checking the boxes for Residuals, Standardized Residuals, Residual Plots, and Line Fit Plots. Residual Plots generates a chart for each independent variable versus the residual, and Line Fit Plots generates a scatter chart with the values predicted by the regression model included (however, creating a scatter chart with an added trendline is visually superior to what this tool provides). ­Finally, you may also choose to have Excel construct a normal probability plot for the dependent variable, which transforms the cumulative probability scale (vertical axis) so that the graph of the cumulative normal distribution is a straight line. The closer the points are to a straight line, the better the fit to a normal distribution. Figure 8.12 shows the basic regression analysis output provided by the Excel ­Regression tool for the Home Market Value data. The output consists of three sections: Regression Statistics (rows 3–8), ANOVA (rows 10–14), and an unlabeled section at the bottom (rows 16–18) with other statistical information. The least-squares estimates of the slope and intercept are found in the Coefficients column in the bottom section of the output.

Figure 8.11 Excel Regression Tool Dialog

270

Chapter 8  Trendlines and Regression Analysis

Figure 8.12 Basic Regression Analysis Output for Home Market Value Example

In the Regression Statistics section, Multiple R is another name for the sample correlation coefficient, r, which was introduced in Chapter 4. Values of r range from -1 to 1, where the sign is determined by the sign of the slope of the regression line. A Multiple R value greater than 0 indicates positive correlation; that is, as the independent variable increases, the dependent variable does also; a value less than 0 indicates negative ­correlation—as X increases, Y decreases. A value of 0 indicates no correlation. R-squared 1R22 is called the coefficient of determination. Earlier we noted that R2 is a measure of the how well the regression line fits the data; this value is also provided by the Trendline tool. Specifically, R2 gives the proportion of variation in the dependent variable that is explained by the independent variable of the regression model. The value of R2 is between 0 and 1. A value of 1.0 indicates a perfect fit, and all data points lie on the regression line, whereas a value of 0 indicates that no relationship exists. Although we would like high values of R2, it is difficult to specify a “good” value that signifies a strong relationship because this depends on the application. For example, in scientific applications such as calibrating physical measurement equipment, R2 values close to 1 would be expected; in marketing research studies, an R2 of 0.6 or more is considered very good; however, in many social science applications, values in the neighborhood of 0.3 might be considered acceptable. Adjusted R Square is a statistic that modifies the value of R2 by incorporating the sample size and the number of explanatory variables in the model. Although it does not give the actual percent of variation explained by the model as R2 does, it is useful when comparing this model with other models that include additional explanatory variables. We discuss it more fully in the context of multiple linear regression later in this chapter. Standard Error in the Excel output is the variability of the observed Y-values from the predicted values 1Yn2. This is formally called the standard error of the estimate, SYX. If the data are clustered close to the regression line, then the standard error will be small; the more scattered the data are, the larger the standard error.

Example 8.6  Interpreting Regression Statistics for Simple Linear Regression After running the Excel Regression tool, the first things to look for are the values of the slope and intercept, namely, the estimates b1 and b0 in the regression model. In the Home Market Value example, we see that the ­intercept is 32,673, and the slope (coefficient of the

i­ndependent variable, Square Feet) is 35.036, just as we had computed earlier. In the Regression Statistics section, R2 = 0.5347. This means that approximately 53% of the variation in Market Value is explained by Square Feet. The remaining variation is due to other factors that

Chapter 8  Trendlines and Regression Analysis

were not included in the model. The standard error of the estimate is \$7,287.72. If we compare this to the standard deviation of the market value, which is \$10,553, we see that the variation around the regression line (\$7,287.72)

271

is less than the variation around the sample mean (\$10,553). This is because the independent variable in the regression model explains some of the variation.

Regression as Analysis of Variance In Chapter 7, we introduced analysis of variance (ANOVA), which conducts an F-test to determine whether variation due to a particular factor, such as the differences in sample means, is significantly greater than that due to error. ANOVA is commonly applied to regression to test for significance of regression. For a simple linear regression model, ­significance of regression is simply a hypothesis test of whether the regression coefficient b1 (slope of the independent variable) is zero:

H0: b 1 = 0

H1: b1 ≠ 0

(8.7)

If we reject the null hypothesis, then we may conclude that the slope of the independent variable is not zero and, therefore, is statistically significant in the sense that it explains some of the variation of the dependent variable around the mean. Similar to our discussion in Chapter 7, you needn’t worry about the mathematical details of how F is computed, or even its value, especially since the tool does not provide the critical value for the test. What is important is the value of Significance F, which is the p-value for the F-test. If Significance F is less than the level of significance (typically 0.05), we would reject the null hypothesis.

Example 8.7  Interpreting Significance of Regression For the Home Market Value example, the ANOVA test is shown in rows 10–14 in Figure 8.12. Significance F, that is, the p-value associated with the hypothesis test H0: B1 = 0

is essentially zero (3.798 : 10−8). Therefore, assuming a level of significance of 0.05, we must reject the null hypothesis and conclude that the slope—the coefficient for Square Feet—is not zero. This means that home size is a statistically significant variable in explaining the variation in market value.

H1: B1 3 0

Testing Hypotheses for Regression Coefficients Rows 17–18 of the Excel output, in addition to specifying the least-squares coefficients, provide additional information for testing hypotheses associated with the intercept and slope. Specifically, we may test the null hypothesis that b0 or b1 equals zero. Usually, it makes little sense to test or interpret the hypothesis that b0 = 0 unless the intercept has a significant physical meaning in the context of the application. For simple linear regression, testing the null hypothesis H0: b1 = 0 is the same as the significance of regression test that we described earlier. The t-test for the slope is similar to the one-sample test for the mean that we described in Chapter 7. The test statistic is

t =

b1 - 0 standard error

(8.8)

and is given in the column labeled t Stat in the Excel output. Although the critical value of the t-distribution is not provided, the output does provide the p-value for the test.

272

Chapter 8  Trendlines and Regression Analysis

Example 8.8  Interpreting Hypothesis Tests for Regression Coefficients For the Home Market Value example, note that the value of t Stat is computed by dividing the coefficient by the standard error using formula (8.8). For instance, t Stat for the slope is 35.03637258>5.16738385 = 6.780292234. Because Excel does not provide the critical value with which to compare the t Stat value, we may use the p-value to draw a conclusion. Because the p-values for both coefficients are essentially zero, we would c ­ onclude

that neither coefficient is statistically equal to zero. Note that the p-value associated with the test for the slope coefficient, Square Feet, is equal to the Significance F value. This will always be true for a regression model with one independent variable because it is the only explanatory variable. However, as we shall see, this will not be the case for multiple regression models.

Confidence Intervals for Regression Coefficients Confidence intervals (Lower 95% and Upper 95% values in the output) provide information about the unknown values of the true regression coefficients, accounting for sampling error. They tell us what we can reasonably expect to be the ranges for the population intercept and slope at a 95% confidence level. We may also use confidence intervals to test hypotheses about the regression coefficients. For example, in Figure 8.12, we see that neither confidence interval includes zero; therefore, we can conclude that b0 and b1 are statistically different from zero. Similarly, we can use them to test the hypotheses that the regression coefficients equal some value other than zero. For example, to test the hypotheses H0: b1 = B1 H1: b1 ≠ B1 we need only check whether B1 falls within the confidence interval for the slope. If it does not, then we reject the null hypothesis, otherwise we fail to reject it.

Example 8.9  Interpreting Confidence Intervals for Regression Coefficients For the Home Market Value data, a 95% confidence interval for the intercept is [14,823, 50,523]. Similarly, a 95% confidence interval for the slope is [24.59, 45.48]. ­Although the regression model is Yn = 32,673 + 35.036X, the confidence intervals suggest a bit of uncertainty about predictions using the model. Thus, although we estimated that a house with 1,750 square feet has a

m a r k e t v a l u e o f 32,673 + 35.036(1,750) = \$93,986, if the true population parameters are at the extremes of the confidence intervals, the estimate might be as low as 14,823 + 24.59(1,750) = \$57,855 or as high as 50,523 + 45.48(1,750) = \$130,113. Narrower confidence intervals provide more accuracy in our predictions.

Residual Analysis and Regression Assumptions Recall that residuals are the observed errors, which are the differences between the actual values and the estimated values of the dependent variable using the regression equation. Figure 8.13 shows a portion of the residual table generated by the Excel Regression tool. The residual output includes, for each observation, the predicted value using the estimated regression equation, the residual, and the standard residual. The residual is simply the difference between the actual value of the dependent variable and the predicted value, or Yi - Yni. Figure 8.14 shows the residual plot generated by the Excel tool. This chart is actually a scatter chart of the residuals with the values of the independent variable on the x-axis.

Chapter 8  Trendlines and Regression Analysis

273

Figure 8.13 Portion of Residual Output

Figure 8.14 Residual Plot for Square Feet

Standard residuals are residuals divided by their standard deviation. Standard residuals describe how far each residual is from its mean in units of standard deviations (similar to a z-value for a standard normal distribution). Standard residuals are useful in checking assumptions underlying regression analysis, which we will address shortly, and to detect outliers that may bias the results. Recall that an outlier is an extreme value that is different from the rest of the data. A single outlier can make a significant difference in the regression equation, changing the slope and intercept and, hence, how they would be interpreted and used in practice. Some consider a standardized residual outside of { 2 standard deviations as an outlier. A more conservative rule of thumb would be to consider outliers outside of a { 3 standard deviation range. (Commercial software packages have more sophisticated techniques for identifying outliers.)

Example 8.10  Interpreting Residual Output For the Home Market Value data, the first observation has a market value of \$90,000 and the regression model predicts \$96,159.13. Thus, the residual is 90,000 − 96,159.13 = − \$6,159.13. The standard deviation of the residuals can be computed as 7,198.299. By dividing the residual by this value, we have the standardized residual for the first observation. The value of − 0.8556 tells us that the first observation is about 0.85 standard deviation below the regression line. If we check the values of all the standardized residuals, you will find that the value of the last data point is 4.53, meaning that the market value of this home, having only 1,581 square

feet, is more than 4 standard deviations above the predicted value and would clearly be identified as an outlier. (If you look back at Figure 8.7, you may have noticed that this point appears to be quite different from the rest of the data.) You might question whether this observation belongs in the data, because the house has a large value despite a relatively small size. The explanation might be an outdoor pool or an unusually large plot of land. Because this value will influence the regression results and may not be representative of the other homes in the neighborhood, you might consider dropping this observation and recomputing the regression model.

274

Chapter 8  Trendlines and Regression Analysis

Checking Assumptions The statistical hypothesis tests associated with regression analysis are predicated on some key assumptions about the data. 1. Linearity. This is usually checked by examining a scatter diagram of the data or examining the residual plot. If the model is appropriate, then the residuals should appear to be randomly scattered about zero, with no apparent pattern. If the residuals exhibit some well-defined pattern, such as a linear trend, a parabolic shape, and so on, then there is good evidence that some other functional form might better fit the data. 2. Normality of errors. Regression analysis assumes that the errors for each individual value of X are normally distributed, with a mean of zero. This can be verified either by examining a histogram of the standard residuals and inspecting for a bell-shaped distribution or by using more formal goodness-offit tests. It is usually difficult to evaluate normality with small sample sizes. However, regression analysis is fairly robust against departures from normality, so in most cases this is not a serious issue. 3. Homoscedasticity. The third assumption is homoscedasticity, which means that the variation about the regression line is constant for all values of the independent variable. This can also be evaluated by examining the residual plot and looking for large differences in the variances at different values of the independent variable. Caution should be exercised when looking at residual plots. In many applications, the model is derived from limited data, and multiple observations for different values of X are not available, making it difficult to draw definitive conclusions about homoscedasticity. If this assumption is seriously violated, then techniques other than least squares should be used for estimating the regression model. 4. Independence of errors. Finally, residuals should be independent for each value of the independent variable. For cross-sectional data, this assumption is usually not a problem. However, when time is the independent variable, this is an important assumption. If successive observations appear to be correlated— for example, by becoming larger over time or exhibiting a cyclical type of pattern—then this assumption is violated. Correlation among successive observations over time is called autocorrelation and can be identified by residual plots having clusters of residuals with the same sign. Autocorrelation can be evaluated more formally using a statistical test based on a measure called the Durbin–Watson statistic. The Durbin–Watson statistic is 2 a 1ei - ei - 12 n

D =

i=2

2 a ei n

(8.9)

i=1

This is a ratio of the squared differences in successive residuals to the sum of the squares of all residuals. D will range from 0 to 4. When successive residuals are positively autocorrelated, D will approach 0. Critical values of the statistic have been tabulated based on the sample size and number of independent variables that allow you to conclude that there is either evidence of autocorrelation or no evidence of autocorrelation or the test is inconclusive. For most practical purposes, values below 1 suggest autocorrelation; values above 1.5 and below 2.5 suggest no autocorrelation; and values above 2.5 suggest

Chapter 8  Trendlines and Regression Analysis

275

Figure 8.15 Histogram of Standard Residuals

negative autocorrelation. This can become an issue when using regression in forecasting, which we discuss in the next chapter. Some software packages compute this statistic; however, Excel does not. When assumptions of regression are violated, then statistical inferences drawn from the hypothesis tests may not be valid. Thus, before drawing inferences about regression models and performing hypothesis tests, these assumptions should be checked. However, other than linearity, these assumptions are not needed solely for model fitting and estimation purposes.

Example 8.11  C  hecking Regression Assumptions for the Home Market Value Data Linearity: The scatter diagram of the market value data appears to be linear; looking at the residual plot in Figure 8.14 also confirms no pattern in the residuals. Normality of errors: Figure 8.15 shows a histogram of the standard residuals for the market value data. The distribution appears to be somewhat positively skewed (particularly with the outlier) but does not appear to be a

serious departure from normality, particularly as the sample size is small. Homoscedasticity: In the residual plot in Figure 8.14, we see no serious differences in the spread of the data for different values of X, particularly if the outlier is eliminated. Independence of errors: Because the data are crosssectional, we can assume that this assumption holds.

Multiple Linear Regression Many colleges try to predict student performance as a function of several characteristics. In the Excel file Colleges and Universities (see Figure 8.16), suppose that we wish to predict the graduation rate as a function of the other variables—median SAT, acceptance rate, expenditures/student, and percent in the top 10% of their high school class. It is logical to

276

Chapter 8  Trendlines and Regression Analysis

Figure 8.16 Portion of Excel File Colleges and Universities

propose that schools with students who have higher SAT scores, a lower acceptance rate, a larger budget, and a higher percentage of students in the top 10% of their high school classes will tend to retain and graduate more students. A linear regression model with more than one independent variable is called a multiple linear regression model. Simple linear regression is just a special case of multiple linear regression. A multiple linear regression model has the form: Y = b0 + b1X1 + b2X2 + g + bkXk + e

(8.10)

where Y is the dependent variable, X1, c, Xk are the independent (explanatory) variables, b0 is the intercept term, b1, c, bk are the regression coefficients for the independent variables, e is the error term Similar to simple linear regression, we estimate the regression coefficients—called partial regression coefficients—b 0, b 1, b 2, cb k, then use the model:

Yn = b 0 + b 1X1 + b 2X2 + g + b kXk

(8.11)

to predict the value of the dependent variable. The partial regression coefficients represent the expected change in the dependent variable when the associated independent variable is increased by one unit while the values of all other independent variables are held constant. For the college and university data, the proposed model would be

Graduation% = b 0 + b 1 SAT + b 2 ACCEPTANCE + b 3 EXPENDITURES + b 4 TOP10% HS

Thus, b 2 would represent an estimate of the change in the graduation rate for a unit increase in the acceptance rate while holding all other variables constant. As with simple linear regression, multiple linear regression uses least squares to estimate the intercept and slope coefficients that minimize the sum of squared error terms over all observations. The principal assumptions discussed for simple linear regression also hold here. The Excel Regression tool can easily perform multiple linear regression; you need to specify only the full range for the independent variable data in the dialog. One caution when using the tool: the independent variables in the spreadsheet must be in contiguous columns. So, you may have to manually move the columns of data around before applying the tool.

Chapter 8  Trendlines and Regression Analysis

277

The results from the Regression tool are in the same format as we saw for simple linear regression. However, some key differences exist. Multiple R and R Square (or R2) are called the multiple correlation coefficient and the coefficient of multiple determination, respectively, in the context of multiple regression. They indicate the strength of association between the dependent and independent variables. Similar to simple linear regression, R2 explains the percentage of variation in the dependent variable that is explained by the set of independent variables in the model. The interpretation of the ANOVA section is quite different from that in simple linear regression. For multiple linear regression, ANOVA tests for significance of the entire model. That is, it computes an F-statistic for testing the hypotheses H0: b1 = b2 = g = bk = 0 H1: at least one bj is not 0 The null hypothesis states that no linear relationship exists between the dependent and any of the independent variables, whereas the alternative hypothesis states that the dependent variable has a linear relationship with at least one independent variable. If the null hypothesis is rejected, we cannot conclude that a relationship exists with every independent variable individually. The multiple linear regression output also provides information to test hypotheses about each of the individual regression coefficients. Specifically, we may test the null hypothesis that b0 (the intercept) or any bi equals zero. If we reject the null hypothesis that the slope associated with independent variable i is zero, H0: bi = 0, then we may state that independent variable i is significant in the regression model; that is, it contributes to reducing the variation in the dependent variable and improves the ability of the model to better predict the dependent variable. However, if we cannot reject H0, then that independent variable is not significant and probably should not be included in the model. We see how to use this information to identify the best model in the next section. Finally, for multiple regression models, a residual plot is generated for each independent variable. This allows you to assess the linearity and homoscedasticity assumptions of regression.

Example 8.12  I nterpreting Regression Results for the Colleges and Universities Data The multiple regression results for the college and university data are shown in Figure 8.17. From the Coefficients section, we see that the model is: Graduation% = 17.92 + 0.072 SAT − 24.859 ACCEPTANCE − 0.000136 EXPENDITURES − 0.163 TOP10% HS The signs of some coefficients make sense; higher SAT scores and lower acceptance rates suggest higher graduation rates. However, we might expect that larger student expenditures and a higher percentage of top high school students would also positively influence the graduation rate. Perhaps the problem occurred because

some of the best students are more demanding and change schools if their needs are not being met, some entrepreneurial students might pursue other interests before graduation, or there is sampling error. As with simple linear regression, the model should be used only for values of the independent variables within the range of the data. The value of R2 (0.53) indicates that 53% of the variation in the dependent variable is explained by these independent variables. This suggests that other factors not included in the model, perhaps campus living conditions, social opportunities, and so on, might also influence the graduation rate. (continued)

278

Chapter 8  Trendlines and Regression Analysis

Figure 8.17 Multiple Regression Results for Colleges and Universities Data

Figure 8.18 Residual Plot for Top 10% HS Variable

From the ANOVA section, we may test for significance of regression. At a 5% significance level, we reject the null hypothesis because Significance F is essentially zero. Therefore, we may conclude that at least one slope is statistically different from zero. Looking at the p-values for the independent variables in the last section, we see that all are less than 0.05; ­therefore, we reject the null hypothesis that each partial

regression coefficient is zero and conclude that each of them is statistically significant. Figure 8.18 shows one of the residual plots from the Excel output. The assumptions appear to be met, and the other residual plots (not shown) also validate these assumptions. The normal probability plot (also not shown) does not suggest any serious departures from normality.

Chapter 8  Trendlines and Regression Analysis

279

Analytics in Practice: U  sing Linear Regression and Interactive Risk Simulators to Predict Performance at ARAMARK3 ARAMARK is a leader in professional services, providing award-winning food services, facilities management, and uniform and career apparel to health care institutions, universities and school districts, stadiums and arenas, and businesses around the world. Headquartered in Philadelphia, ARAMARK has ­approximately 255,000 employees serving clients in 22 countries. ARAMARK’s Global Risk Management ­D epartment (GRM) needed a way to determine the statistical relationships between key business metrics (e.g., employee tenure, employee engagement, a trained workforce, account tenure, service offerings) and risk metrics (e.g., OSHA rate, workers’ compensation rate, customer injuries) to understand the impact of these risks on the business. GRM also needed a simple tool that field operators and the risk management team could use to predict the impact of business decisions on risk metrics before those decisions were implemented. Typical questions they would want to ask were, What would happen to our OSHA rate if we increased the percentage of part time labor? and How could we impact turnover if operations improved safety performance? ARAMARK maintains extensive historical data. For example, the Global Risk Management group keeps track of data such as OSHA rates, slip/trip/fall rates, injury costs, and level of compliance with safety standards; the Human Resources department monitors turnover and percentage of part-time labor; the Payroll department keeps data on average wages; and the Training and Organizational Development department collects data on employee engagement. Excelbased linear regression was used to determine the relationships between the dependent variables (such as OSHA rate, slip/trip/fall rate, claim cost, and turnover) and the independent variables (such as the percentage of part-time labor, average wage, employee engagement, and safety compliance). Although the regression models provided the basic analytical support that ARAMARK needed, the GRM team used a novel approach to implement the models

for use by their clients. They developed “Interactive Risk Simulators,” which are simple online tools that allowed users to manipulate the values of the independent variables in the regression models using interactive sliders that correspond to the business metrics and instantaneously view the values of the dependent variables (the risk metrics) on gauges similar to those found on the dashboard of a car. Figure 8.19 illustrates the structure of the simulators. The gauges are updated instantly as the user adjusts the sliders, showing how changes in the business environment affect the risk metrics. This visual representation made the models easy to use and understand, particularly for nontechnical employees.

Gunnar Pippel/Shutterstock.com

GRM sent out more than 200 surveys to multiple levels of the organization to assess the usefulness of Interactive Risk Simulators. One hundred percent of respondents answered “Yes” to “Were the simulators easy to use?” and 78% of respondents answered “Yes” to “Would these simulators be useful in running your business and helping you make decisions?” The deployment of Interactive Risk Simulators to the field has been met with overwhelming positive response and recognition from leadership within all lines of business, including frontline managers, food-service directors, district managers, and general managers.

3The author expresses his appreciation to John Toczek, Manager of Decision Support and Analytics at ARAMARK Corporation.

280

Nataliia Natykach/Shutterstock.com

c./Shutterstock.com

vector-illustration/Shutterstock.com

Chapter 8  Trendlines and Regression Analysis

Inputs: Independent Variables

Regression Models

Outputs: Dependent Variables

Figure 8.19 Structure of an Interactive Risk Simulator

Building Good Regression Models In the colleges and universities regression example, all the independent variables were found to be significant by evaluating the p-values of the regression analysis. This will not always be the case and leads to the question of how to build good regression models that include the “best” set of variables. Figure 8.20 shows a portion of the Excel file Banking Data, which provides data acquired from banking and census records for different zip codes in the bank’s current market. Such information can be useful in targeting advertising for new customers or for choosing locations for branch offices. The data show the median age of the population, median years of education, median income, median home value, median household wealth, and average bank balance. Figure 8.21 shows the results of regression analysis used to predict the average bank balance as a function of the other variables. Although the independent variables explain more than 94% of the variation in the average bank balance, you can see that at a 0.05 significance level, the p-values indicate that both Education and Home Value do not appear to be significant. A good regression model should include only significant independent variables. However, it is not always clear exactly what will happen when we add or remove variables from a model; variables that are (or are not) significant in one model may (or may not) be significant in another. Therefore, you should not consider dropping all insignificant variables at one time, but rather take a more structured approach. Adding an independent variable to a regression model will always result in R2 equal to or greater than the R2 of the original model. This is true even when the new independent

Figure 8.20 Portion of Banking Data

Chapter 8  Trendlines and Regression Analysis

281

Figure 8.21 Regression Analysis Results for Banking Data

variable has little true relationship with the dependent variable. Thus, trying to maximize R2 is not a useful criterion. A better way of evaluating the relative fit of different models is to use adjusted R2. Adjusted R2 reflects both the number of independent variables and the sample size and may either increase or decrease when an independent variable is added or dropped, thus providing an indication of the value of adding or removing independent variables in the model. An increase in adjusted R2 indicates that the model has improved. This suggests a systematic approach to building good regression models: 1. Construct a model with all available independent variables. Check for significance of the independent variables by examining the p-values. 2. Identify the independent variable having the largest p-value that exceeds the chosen level of significance. 3. Remove the variable identified in step 2 from the model and evaluate adjusted R2. (Don’t remove all variables with p-values that exceed a at the same time, but remove only one at a time.) 4. Continue until all variables are significant. In essence, this approach seeks to find a significant model that has the highest ­adjusted R2.

Example 8.13  Identifying the Best Regression Model We will apply the preceding approach to the Banking Data example. The first step is to identify the variable with the largest p-value exceeding 0.05; in this case, it is Home Value, and we remove it from the model and rerun the Regression tool. Figure 8.22 shows the results after removing Home Value. Note that the adjusted R2 has increased slightly, whereas the R2 -value decreased slightly because we removed a variable from the model. All the p-values are now less than 0.05, so this now

appears to be the best model. Notice that the p-value for Education, which was larger than 0.05 in the first regression analysis, dropped below 0.05 after Home Value was removed. This phenomenon often occurs when multicollinearity (discussed in the next section) is present and emphasizes the importance of not removing all variables with large p-values from the original model at the same time.

282

Chapter 8  Trendlines and Regression Analysis

Figure 8.22 Regression Results without Home Value

Another criterion used to determine if a variable should be removed is the t-statistic. If 0 t 0 6 1, then the standard error will decrease and adjusted R2 will increase if the variable is removed. If 0 t 0 7 1, then the opposite will occur. In the banking regression results, we see that the t-statistic for Home Value is less than 1; therefore, we expect the adjusted R2 to increase if we remove this variable. You can follow the same iterative approach outlined before, except using t-values instead of p-values. These approaches using the p-values or t-statistics may involve considerable experimentation to identify the best set of variables that result in the largest adjusted R2. For large numbers of independent variables, the number of potential models can be overwhelming. For example, there are 210 = 1,024 possible models that can be developed from a set of 10 independent variables. This can make it difficult to effectively screen out insignificant variables. Fortunately, automated methods—stepwise regression and best subsets—exist that facilitate this process.

Correlation and Multicollinearity As we have learned previously, correlation, a numerical value between -1 and +1, measures the linear relationship between pairs of variables. The higher the absolute value of the correlation, the greater the strength of the relationship. The sign simply indicates whether variables tend to increase together (positive) or not (negative). Therefore, examining correlations between the dependent and independent variables, which can be done using the Excel Correlation tool, can be useful in selecting variables to include in a multiple regression model because a strong correlation indicates a strong linear relationship. However, strong correlations among the independent variables can be problematic. This can potentially signify a phenomenon called multicollinearity, a condition occurring when two or more independent variables in the same regression model contain high levels of the same information and, consequently, are strongly correlated with one another and can predict each other better than the dependent variable. When significant multicollinearity is present, it becomes difficult to isolate the effect of one independent variable on the dependent variable, and the signs of coefficients may be the opposite of what they should be, making it difficult to interpret regression coefficients. Also, p-values can be inflated, resulting in the conclusion not to reject the null hypothesis for significance of regression when it should be rejected.

Chapter 8  Trendlines and Regression Analysis

283

Some experts suggest that correlations between independent variables exceeding an absolute value of 0.7 may indicate multicollinearity. However, multicollinearity is best measured using a statistic called the variance inflation factor (VIF) for each independent variable. More-sophisticated software packages usually compute these; unfortunately, ­Excel does not.

Example 8.14  Identifying Potential Multicollinearity Figure 8.23 shows the correlation matrix for the variables in the Colleges and Universities data. You can see that SAT and Acceptance Rate have moderate correlations with the dependent variable, Graduation%, but the correlation between Expenditures/Student and Top 10% HS with Graduation% are relatively low. The strongest correlation, however, is between two independent variables: Top 10% HS and Acceptance Rate. However, the value of − 0.6097 does not exceed the recommended threshold of 0.7, so we can likely assume that multicollinearity is not a problem here (a more advanced analysis using VIF calculations does indeed confirm that multicollinearity does not exist). In contrast, Figure 8.24 shows the correlation matrix for all the data in the banking example. Note that large

correlations exist between Education and Home Value and also between Wealth and Income (in fact, the variance inflation factors do indicate significant multicollinearity). If we remove Wealth from the model, the adjusted R2 drops to 0.9201, but we discover that Education is no longer significant. Dropping Education and leaving only Age and Income in the model results in an adjusted R2 of 0.9202. However, if we remove Income from the model instead of Wealth, the Adjusted R2 drops to only 0.9345, and all remaining variables (Age, Education, and Wealth) are significant (see Figure 8.25). The R2-value for the model with these three variables is 0.936.

Practical Issues in Trendline and Regression Modeling Example 8.14 clearly shows that it is not easy to identify the best regression model s­ imply by examining p-values. It often requires some experimentation and trial and error. From a practical perspective, the independent variables selected should make some sense in ­attempting to explain the dependent variable (i.e., you should have some reason to ­believe that changes in the independent variable will cause changes in the dependent variable even though causation cannot be proven statistically). Logic should guide your model

Figure 8.23 Correlation Matrix for Colleges and Universities Data

Figure 8.24 Correlation Matrix for Banking Data

284

Chapter 8  Trendlines and Regression Analysis

Figure 8.25 Regression Results for Age, Education, and Wealth as Independent Variables

d­ evelopment. In many applications, behavioral, economic, or physical theory might suggest that certain variables should belong in a model. Remember that additional variables do contribute to a higher R2 and, therefore, help to explain a larger proportion of the variation. Even though a variable with a large p-value is not statistically significant, it could simply be the result of sampling error and a modeler might wish to keep it. Good modelers also try to have as simple a model as possible—an age-old principle known as parsimony—with the fewest number of explanatory variables that will provide an adequate interpretation of the dependent variable. In the physical and management sciences, some of the most powerful theories are the simplest. Thus, a model for the banking data that includes only age, education, and wealth is simpler than one with four variables; because of the multicollinearity issue, there would be little to gain by including income in the model. Whether the model explains 93% or 94% of the variation in bank deposits would probably make little difference. Therefore, building good regression models relies as much on experience and judgment as it does on technical analysis. One issue that one often faces in using trendlines and regression is overfitting the model. It is important to realize that sample data may have unusual variability that is different from the population; if we fit a model too closely to the sample data we risk not fitting it well to the population in which we are interested. For instance, in fitting the crude oil prices in Example 8.2, we noted that the R2-value will increase if we fit higher-order polynomial functions to the data. While this might provide a better mathematical fit to the sample data, doing so can make it difficult to explain the phenomena rationally. The same thing can happen with multiple regression. If we add too many terms to the model, then the model may not adequately predict other values from the population. Overfitting can be mitigated by using good logic, intuition, physical or behavioral theory, and parsimony as we have discussed.

Regression with Categorical Independent Variables Some data of interest in a regression study may be ordinal or nominal. This is common when including demographic data in marketing studies, for example. Because regression analysis requires numerical data, we could include categorical variables by coding the variables. For example, if one variable represents whether an individual is a college graduate or not, we might code No as 0 and Yes as 1. Such variables are often called dummy variables.

Chapter 8  Trendlines and Regression Analysis

285

Example 8.15  A Model with Categorical Variables The Excel file Employee Salaries, shown in Figure 8.26, provides salary and age data for 35 employees, along with an indicator of whether or not the employees have an MBA (Yes or No). The MBA indicator variable is categorical; thus, we code it by replacing No by 0 and Yes by 1. If we are interested in predicting salary as a function of the other variables, we would propose the model Y = B0 + B1X1 + B2X2 + E

Thus, a 30-year-old with an MBA would have an estimated salary of salary = 893.59 + 1044.15 × 30 + 14767.23 × 1 = \$ 46,985.32 This model suggests that having an MBA increases the salary of this group of employees by almost \$15,000. Note that by substituting either 0 or 1 for MBA, we obtain two models: No MBA: salary = 893.59 + 1044.15 × age

where Y = salary X1 = age X2 = MBA indicator (0 or 1) After coding the MBA indicator column in the data file, we begin by running a regression on the entire data set, yielding the output shown in Figure 8.27. Note that the model explains about 95% of the variation, and the p-values of both variables are significant. The model is salary = 893.59 + 1044.15 × age + 14767.23 × MBA

Figure 8.26 Portion of Excel File Employee Salaries

Figure 8.27 Initial Regression Model for Employee Salaries

MBA: salary = 15,660.82 + 1044.15 × age The only difference between them is the intercept. The models suggest that the rate of salary increase for age is the same for both groups. Of course, this may not be true. Individuals with MBAs might earn relatively higher salaries as they get older. In other words, the slope of Age may depend on the value of MBA.

286

Chapter 8  Trendlines and Regression Analysis

An interaction occurs when the effect of one variable (i.e., the slope) is dependent on another variable. We can test for interactions by defining a new variable as the product of the two variables, X3 = X1 * X2, and testing whether this variable is significant, leading to an alternative model.

Example 8.16  Incorporating Interaction Terms in a Regression Model For the Employee Salaries example, we define an interaction term as the product of age 1X1 2 and MBA 1X2 2 by defining X3 = X1 × X2. The new model is Y = B0 + B1X1 + B2X2 + B3X3 + E In the worksheet, we need to create a new column (called Interaction) by multiplying MBA by Age for each observation (see Figure 8.28). The regression results are shown in Figure 8.29. From Figure 8.29, we see that the adjusted R2 increases; however, the p-value for the MBA indicator variable is 0.33, indicating that this variable is not significant. Therefore, we drop this variable and run a regression using only age and the interaction term. The results are shown in Figure 8.30. Adjusted R2 increased slightly, and both age and the interaction term are significant. The final model is

Figure 8.28 Portion of Employee Salaries Modified for Interaction Term

Figure 8.29 Regression Results with Interaction Term

salary = 3,323.11 + 984.25 × age + 425.58

× MBA × age

The models for employees with and without an MBA are: No MBA: salary = 3,323.11 + 984.25 × age + 425.58 (0) × age = 3323.11 + 984.25 × age MBA: salary = 3323.11 + 984.25 × age + 425.58 (1) × age = 3,323.11 + 1,409.83 × age Here, we see that salary depends not only on whether an employee holds an MBA, but also on age and is more realistic than the original model.

287

Chapter 8  Trendlines and Regression Analysis

Figure 8.30 Final Regression Model for Salary Data

Categorical Variables with More Than Two Levels When a categorical variable has only two levels, as in the previous example, we coded the levels as 0 and 1 and added a new variable to the model. However, when a categorical variable has k 7 2 levels, we need to add k - 1 additional variables to the model.

Example 8.17  A Regression Model with Multiple Levels of Categorical Variables The Excel file Surface Finish provides measurements of the surface finish of 35 parts produced on a lathe, along with the revolutions per minute (RPM) of the spindle and one of four types of cutting tools used (see Figure 8.31). The engineer who collected the data is interested in predicting the surface finish as a function of RPM and type of tool. Intuition might suggest defining a dummy variable for each tool type; however, doing so will cause numerical instability in the data and cause the regression tool to crash. Instead, we will need k − 1 = 3 dummy variables corresponding to three of the levels of the categorical variable. The level left out will correspond to a reference, or baseline, value. Therefore, because we have k = 4 levels of tool type, we will define a regression model of the form Y = B0 + B1X1 + B2X2 + B3X3 + B4X4 + E where Y = surface finish

Note that when X2 = X3 = X4 = 0, then, by default, the tool type is A. Substituting these values for each tool type into the model, we obtain: Tool type A: Y = B0 + B1X1 + E Tool type B: Y = B0 + B1X1 + B2 + E Tool type C: Y = B0 + B1X1 + B3 + E Tool type D: Y = B0 + B1X1 + B4 + E For a fixed value of RPM (X1), the slopes corresponding to the dummy variables represent the difference between the surface finish using that tool type and the baseline using tool type A. To incorporate these dummy variables into the ­regression model, we add three columns to the data, as shown in ­Figure 8.32. Using these data, we obtain the regression results shown in Figure 8.33. The resulting model is surface finish = 24.49 + 0.098 RPM − 13.31 type B          − 20.49 type C − 26.04 type D

X2 = 1 if tool type is B and 0 if not

Almost 99% of the variation in surface finish is e ­ xplained by the model, and all variables are significant. The models for each individual tool are

X3 = 1 if tool type is C and 0 if not

Tool A: surface finish = 24.49 + 0.098 RPM − 13.31(0)

X1 = RPM

X4 = 1 if tool type is D and 0 if not

− 20.49(0) − 26.04(0) = 24.49 + 0.098 RPM

(continued)

288

Chapter 8  Trendlines and Regression Analysis

Tool B: surface finish = 24.49 + 0.098 RPM − 13.31(1) − 20.49(0) − 26.04(0) = 11.18 + 0.098 RPM Tool C: surface finish = 24.49 + 0.098 RPM − 13.31(0) − 20.49(1) − 26.04(0) = 4.00 + 0.098 RPM

Figure 8.31 Portion of Excel File Surface Finish

Figure 8.32 Data Matrix for Surface Finish with Dummy Variables

Tool D: surface finish = 24.49 + 0.098 RPM − 13.31(0) − 20.49(0) − 26.04(1) = − 1.55 + 0.098 RPM Note that the only differences among these models are the intercepts; the slopes associated with RPM are the same. This suggests that we might wish to test for interactions between the type of cutting tool and RPM; we leave this to you as an exercise.

Chapter 8  Trendlines and Regression Analysis

289

Figure 8.33 Surface Finish Regression Model Results

Regression Models with Nonlinear Terms Linear regression models are not appropriate for every situation. A scatter chart of the data might show a nonlinear relationship, or the residuals for a linear fit might result in a ­nonlinear pattern. In such cases, we might propose a nonlinear model to explain the relationship. For instance, a second-order polynomial model would be Y = b0 + b1X + b2X 2 + e Sometimes, this is called a curvilinear regression model. In this model, b1 represents the linear effect of X on Y, and b2 represents the curvilinear effect. However, although this model appears to be quite different from ordinary linear regression models, it is still linear in the parameters (the betas, which are the unknowns that we are trying to estimate). In other words, all terms are a product of a beta coefficient and some function of the data, which are simply numerical values. In such cases, we can still apply least squares to estimate the regression coefficients. Curvilinear regression models are also often used in forecasting when the independent variable is time. This and other applications of regression in forecasting are discussed in the next chapter.

Example 8.18  Modeling Beverage Sales Using Curvilinear Regression The Excel file Beverage Sales provides data on the sales of cold beverages at a small restaurant with a large outdoor patio during the summer months (see Figure 8.34). The owner has observed that sales tend to increase on hotter days. Figure 8.35 shows linear regression results for these data. The U-shape of the residual plot (a ­second-order polynomial trendline was fit to the residual data) suggests that a linear relationship is not appropriate. To apply a curvilinear regression model, add a column to the data matrix by squaring the temperatures.

Now, both temperature and temperature squared are the independent variables. Figure 8.36 shows the results for the curvilinear regression model. The model is: sales = 142,850 − 3,643.17 × temperature + 23.3 × temperature2 Note that the adjusted R2 has increased significantly from the linear model and that the residual plots now show more random patterns.

290

Figure 8.34 Portion of Excel File Beverage Sales

Figure 8.35 Linear Regression Results for Beverage Sales

Figure 8.36 Curvilinear Regression Results for Beverage Sales

Chapter 8  Trendlines and Regression Analysis

Chapter 8  Trendlines and Regression Analysis

291

Advanced Techniques for Regression Modeling using XLMiner XLMiner is an Excel add-in for data mining that accompanies Analytic Solver Platform. Data mining is the subject of Chapter 10 and includes a wide variety of statistical procedures for exploring data, including regression analysis. The regression analysis tool in XLMiner has some advanced options not available in Excel’s Descriptive Statistics tool, which we discuss in this section. Best-subsets regression evaluates either all possible regression models for a set of independent variables or the best subsets of models for a fixed number of independent variables. It helps you to find the best model based on the Adjusted R2. Best-subsets ­regression evaluates models using a statistic called Cp, which is called the Bonferroni criterion. Cp estimates the bias introduced in the estimates of the responses by having an underspecified model (a model with important predictors missing). If Cp is much greater than k + 1 (the number of independent variables plus 1), there is substantial bias. The full model always has Cp = k + 1. If all models except the full model have large Cps, it suggests that important predictor variables are missing. Models with a minimum value or having Cp less than or at least close to k + 1 are good models to consider. XLMiner offers five different procedures for selecting the best subsets of variables. Backward Elimination begins with all independent variables in the model and deletes one at a time until the best model is identified. Forward Selection begins with a model having no independent variables and successively adds one at a time until no additional variable makes a significant contribution. Stepwise Selection is similar to Forward Selection except that at each step, the procedure considers dropping variables that are not statistically significant. Sequential Replacement replaces variables sequentially, retaining those that improve performance. These options might terminate with a different model. Exhaustive Search looks at all combinations of variables to find the one with the best fit, but it can be time consuming for large numbers of variables.

Example 8.19  Using XLMiner for Regression We will use the Banking Data example. After installation, XLMiner will appear as a new tab in the Excel ribbon. The XLMiner ribbon is shown in Figure 8.37. To use the basic regression tool, click the Predict button in the Data ­Mining group and choose Multiple Linear Regression. The first of two dialogs will then be displayed, as shown in Figure 8.38. First, enter the data range (including headers) in the box near the top and check the box First row contains ­headers. All the variables will be listed in the left pane (Variables in input data). Select the independent variables and move them using the arrow button to the Input variables pane; then select the dependent variable and move it to the Output variable pane as shown in the figure. Click Next. The second dialog shown in Figure 8.39 will appear. Select the output options and check the Summary report box. However, before clicking Finish, click on the Best subsets button. In the dialog shown in Figure 8.40, check the box at the top and choose the selection procedure. Click OK and then click Finish in the Step 2 dialog.

XLMiner creates a new worksheet with an “Output Navigator” that allows you to click on hyperlinks to see various portions of the output (see Figure 8.41). The regression model and ANOVA output are shown in Figure 8.42. Note that this is the same as the output shown in Figure 8.21. The Best subsets results appear below the ANOVA output, shown in Figure 8.43. RSS is the residual sum of squares, or the sum of squared deviations between the predicted probability of success and the actual value (1 or 0). Probability is a quasi-hypothesis test that a given subset is acceptable; if this is less than 0.05, you can rule out that subset. Note that the model with 5 coefficients (including the intercept) is the only one that has a Cp value less than k + 1 = 5, and its adjusted R2 is the largest. If you click “Choose Subset,” XLMiner will create a new worksheet with the results for this model, which is the same as we found in Figure 8.22; that is, the model without the Home Value variable.

292

Figure 8.37 XLMiner Ribbon

Figure 8.38 XLMiner Linear Regression Dialog, Step 1 of 2

Figure 8.39 XLMiner Linear Regression Dialog, Step 2 of 2

Chapter 8  Trendlines and Regression Analysis

Figure 8.40 XLMiner Best Subset Dialog

Figure 8.41 XLMiner Output Navigator

Figure 8.42 XLMiner Regression Output

Figure 8.43 XLMiner Best Subsets Results

Chapter 8  Trendlines and Regression Analysis

293

294

Chapter 8  Trendlines and Regression Analysis

XLMiner also provides cross-validation—a process of using two sets of sample data; one to build the model (called the training set), and the second to assess the model’s performance (called the validation set). This will be explained in Chapter 10 when we study data mining in more depth, but is not necessary for standard regression analysis.

Key Terms Autocorrelation Best-subsets regression Coefficient of determination 1R22 Cross-validation Coefficient of multiple determination Curvilinear regression model Dummy variables Exponential function Homoscedasticity Interaction Least-squares regression Linear function Logarithmic function Multicollinearity

Multiple correlation coefficient Multiple linear regression Overfitting Parsimony Partial regression coefficient Polynomial function Power function R2 (R-squared) Regression analysis Residuals Significance of regression Simple linear regression Standard error of the estimate, SYX Standard residuals

Problems and Exercises 1. Each worksheet in the Excel file LineFit Data con-

tains a set of data that describes a functional relationship between the dependent variable y and the independent variable x. Construct a line chart of each data set, and use the Excel Trendline tool to determine the best-fitting functions to model these data sets. 2. A consumer products company has collected some

to be any outliers? If so, delete them and then find the best-fitting linear regression line using the Excel Trendline tool. What would you conclude about the strength of any relationship? Would you use regression to make predictions of the unemployment rate based on the cost of living? 4. Using the data in the Excel file Weddings construct

Sales

\$300

\$7000

\$350

\$9000

\$400

\$10000

scatter charts to determine whether any linear relationship appears to exist between (1) the wedding cost and attendance, (2) the wedding cost and the value rating, and (3) the couple’s income and wedding cost only for the weddings paid for by the bride and groom. Then find the best-fitting linear regression lines using the Excel Trendline tool for each of these charts.

\$450

\$10600

5. Using the data in Excel file Loans, construct a scat-

data relating to the advertising expenditure and sales of one of its products:

What type of model would best represent the data? Use the Excel Trendline tool to find the best among the options provided.

ter chart for monthly income versus loan amount and add a linear trendline. What is the regression model? If an individual has 7336 as monthly income, what would you predict the loan amount to be?

3. Using the data in the Excel file Demographics, de-

6. Using the results of fitting the Home Market Value

termine if a linear relationship exists between unemployment rates and cost of living indexes by constructing a scatter chart. Visually, do there appear

regression line in Example 8.4, compute the errors associated with each observation using formula (8.3) and construct a histogram.

Chapter 8  Trendlines and Regression Analysis

295

7. Set up an Excel worksheet to apply formulas (8.5)

a. Interpret all key regression results, hypothesis

and (8.6) to compute the values of b 0 and b 1 for the data in the Excel file Home Market Value and verify that you obtain the same values as in Examples 8.4 and 8.5.

b. Analyze the residuals to determine if the assump-

8. The managing director of a consulting group has the

following monthly data on total overhead costs and professional labor hours to bill to clients:4 Overhead Costs

Billable Hours

\$365,000

3,000

\$400,000

4,000

\$430,000

5,000

\$477,000

6,000

\$560,000

7,000

\$587,000

8,000

a. Develop a trendline to identify the relationship

between billable hours and overhead costs. b. Interpret the coefficients of your regression model. Specifically, what does the fixed component of the model mean to the consulting firm? c. If a special job requiring 1,000 billable hours that would contribute a margin of \$38,000 before overhead was available, would the job be attractive? 9. Using the Excel file Weddings, apply the ­Excel Re-

tests, and confidence intervals in the output. tions underlying the regression analysis are valid. c. Use the standard residuals to determine if any possible outliers exist. d. If a couple makes \$70,000 together, how much would they probably budget for the wedding? 11. Using the data in Excel file Crime, apply the Excel

regression tool using crime rate (CRIM) as the dependent variable and pupil-teacher ratio (PTRATIO) in the region as the independent variable. a. Interpret all key regression results, hypothesis tests, and confidence intervals in the output. b. Use the standard residuals to determine if any

outliers exist. 12. Using the data in the Excel file Student Grades, ap-

ply the Excel Regression tool using the midterm grade as the independent variable and the final exam grade as the dependent variable. a. Interpret all key regression results, hypothesis tests, and confidence intervals in the output. b. Analyze the residuals to determine if the assumptions underlying the regression analysis are valid. c. Use the standard residuals to determine if any possible outliers exist.

gression tool using the wedding cost as the dependent variable and attendance as the independent variable. a. Interpret all key regression results, hypothesis tests, and confidence intervals in the output. b. Analyze the residuals to determine if the assumptions underlying the regression analysis are valid. c. Use the standard residuals to determine if any possible outliers exist. d. If a couple is planning a wedding for 175 guests, how much should they budget?

13. The Excel file National Football League provides

10. Using the Excel file Weddings, apply the ­Excel Re-

14. A deep-foundation engineering contractor has bid

gression tool using the wedding cost as the d­ ependent variable and the couple’s income as the independent variable, only for those weddings paid for by the bride and groom.

various data on professional football for one season. a. Construct a scatter diagram for Points/Game and Yards/Game in the Excel file. Does there appear to be a linear relationship? b. Develop a regression model for predicting Points/Game as a function of Yards/Game. ­Explain the statistical significance of the model. c. Draw conclusions about the validity of the regression analysis assumptions from the residual plot and standard residuals. on a foundation system for a new building housing the world headquarters for a Fortune 500 company.

4Modified from Charles T. Horngren, George Foster, and Srikant M. Datar, Cost Accounting: A Managerial Emphasis, 9th ed. (Englewood Cliffs, NJ: Prentice Hall, 1997): 371.

296

Chapter 8  Trendlines and Regression Analysis

A part of the project consists of installing 311 auger cast piles. The contractor was given bid information for cost-estimating purposes, which consisted of the estimated depth of each pile; however, actual drill footage of each pile could not be determined exactly until construction was performed. The Excel file Pile Foundation contains the estimates and actual pile lengths after the project was completed. Develop a linear regression model to estimate the actual pile length as a function of the estimated pile lengths. What do you conclude?

20. Using the data in the Excel file Freshman College

15. The Excel file Concert Sales provides data on sales

21. The Excel file Major League Baseball provides data

dollars and the number of radio, TV, and newspaper ads promoting the concerts for a group of cities. Develop simple linear regression models for predicting sales as a function of the number of each type of ad. Compare these results to a multiple linear regression model using both independent variables. Examine the residuals of the best model for regression assumptions and possible outliers. 16. Using the data in the Excel file Credit Card Spend-

ing, develop a multiple linear regression model for estimating the average credit card expenditure as a function of both the income and family size. Predict the average expense of a family that has two members and an income of \$188,000 per annum, and another that has three members and an income of \$39,000 income per annum. 17. The Excel file Cereal Data provides a variety of nu-

tritional information about 67 cereals and their shelf location in a supermarket. Use regression analysis to find the best model that explains the relationship between calories and the other variables. Investigate the model assumptions and clearly explain your conclusions. Keep in mind the principle of parsimony! 18. The Excel file Salary Data provides information on

current salary, beginning salary, previous experience (in months) when hired, and total years of education for a sample of 100 employees in a firm. a. Develop a multiple regression model for predicting current salary as a function of the other variables. b. Find the best model for predicting current salary using the t-value criterion. 19. The Excel file Credit Approval Decisions provides

information on credit history for a sample of banking customers. Use regression analysis to identify the best

model for predicting the credit score as a function of the other numerical variables. For the model you select, conduct further analysis to check for significance of the independent variables and for multicollinearity. Data, identify the best regression model for predicting the first year retention rate. For the model you select, conduct further analysis to check for significance of the independent variables and for multicollinearity. on the 2010 season. a. Construct and examine the correlation matrix. Is multicollinearity a potential problem? b. Suggest an appropriate set of independent variables that predict the number of wins by examining the correlation matrix. c. Find the best multiple regression model for predicting the number of wins. How good is your model? Does it use the same variables you thought were appropriate in part (b)? 22. The Excel file Golfing Statistics provides data for a

portion of the 2010 professional season for the top 25 golfers. a. Find the best multiple regression model for predicting earnings/event as a function of the remaining variables. b. Find the best multiple regression model for predicting average score as a function of the other variables except earnings and events. 23. Use the p-value criterion to find a good model for

predicting the number of points scored per game by football teams using the data in the Excel file ­National Football League. 24. The State of Ohio Department of Education has a man-

dated ninth-grade proficiency test that covers writing, reading, mathematics, citizenship (social studies), and science. The Excel file Ohio Education Performance provides data on success rates (defined as the percent of students passing) in school districts in the greater Cincinnati metropolitan area along with state averages. a. Suggest the best regression model to predict math

success as a function of success in the other subjects by examining the correlation matrix; then run the regression tool for this set of variables.

297

Chapter 8  Trendlines and Regression Analysis

b. Develop a multiple regression model to predict

math success as a function of success in all other subjects using the systematic approach described in this chapter. Is multicollinearity a problem? c. Compare the models in parts (a) and (b). Are they the same? Why or why not?

Excel Trendline feature to identify the best type of curvilinear trendline that maximizes R2. Units Produced

Costs

500

\$12,500

1,000

\$25,000

1,500

\$32,500

2,000

\$40,000

2,500

\$45,000

3,000

\$50,000

25. A leading car manufacturer tracks the data of its used

cars for resale. The Excel file Car Sales contains information on the selling price of the car, fuel type (diesel or petrol), horsepower (HP), and manufacture year. a. Develop a multiple linear regression model for the selling price as a function of fuel type and HP without any interaction term. b. Determine if any interaction exists between fuel

type and HP and find the best model. What is the predicted price for either a petrol or diesel car with a horsepower of 69?

29. A product manufacturer wishes to determine the

relationship between the shelf space of the product and its sales. Past data indicates the following sales and shelf space in its stores.

26. For the Car Sales data described in Problem 25, de-

velop a regression model for selling price as a function of horsepower and manufacture year, incorporating an interaction term. What would be the predicted price for a car manufactured in either 2002 or 2003 with a horsepower of 69? How do these predictions compare to the overall average price in each year? 27. For the Excel file Auto Survey,

Sales

Shelf Space

\$25,000

5 square feet

\$15,000

3.2 square feet

\$28,000

5.4 square feet

\$30,000

6.1 square feet

\$17,000

4.3 square feet

\$16,000

3.1 square feet

\$12,000

2.6 square feet

\$21,000

6.4 square feet

a. Find the best regression model to predict miles/

\$19,000

4.9 square feet

gallon as a function of vehicle age and mileage. b. Using your result from part (a), add the categorical variable Purchased to the model. Does this change your result? c. Determine whether any significant interaction exists between Vehicle Age and Purchased variables.

\$27,000

5.7 square feet

Using these data points, apply simple linear regression, and examine the residual plot. What do you conclude? Construct a scatter chart and use the Excel Trendline feature to identify the best type of curvilinear trendline that maximizes R2.

28. Cost functions are often nonlinear with volume be-

30. For the Excel file Cereal Data, use XLMiner and

cause production facilities are often able to produce larger quantities at lower rates than smaller quantities.5 Using the following data, apply simple linear regression, and examine the residual plot. What do you conclude? Construct a scatter chart and use the

31. Use XLMiner and best subsets with stepwise selec-

5Horngren, 6Horngren,

best subsets with backward selection to find the best model. tion to find the best model points per game for the National Football League data (see Problem 23).

Foster, and Datar, Cost Accounting: A Managerial Emphasis, 9th ed.: 349. Foster, and Datar, Cost Accounting: A Managerial Emphasis, 9th ed.: 349.

298

Chapter 8  Trendlines and Regression Analysis

Case: Performance Lawn Equipment In reviewing the PLE data, Elizabeth Burke noticed that defects received from suppliers have decreased (­ worksheet ­Defects After Delivery). Upon investigation, she learned that in 2010, PLE experienced some quality problems due to an increasing number of defects in materials received from suppliers. The company instituted an initiative in ­August 2011 to work with suppliers to reduce these defects, to more closely coordinate deliveries, and to ­improve materials quality through reengineering supplier production policies. Elizabeth noted that the program appeared to reverse an increasing trend in defects; she would like to predict what might have happened had the supplier initiative not been implemented and how the number of defects might further be reduced in the near future. In meeting with PLE’s human resources director, Elizabeth also discovered a concern about the high rate of turnover in its field service staff. Senior managers have suggested that the department look closer at its recruiting policies, particularly to try to identify the characteristics of individuals that lead to greater retention. However, in a recent staff meeting, HR managers could not agree on these characteristics. Some argued that years of education and grade point averages were good predictors. Others argued that hiring more mature applicants would lead to greater retention. To study these factors, the staff agreed to conduct a statistical study to determine the effect that years of education, college grade point average, and age when hired have on retention. A sample of 40 field service

engineers hired 10 years ago was selected to determine the influence of these variables on how long each individual stayed with the company. Data are compiled in the Employee Retention worksheet. Finally, as part of its efforts to remain competitive, PLE tries to keep up with the latest in production technology. This is especially important in the highly competitive lawn-mower line, where competitors can gain a real advantage if they develop more cost-effective means of production. The lawn-mower division therefore spends a great deal of effort in testing new technology. When new production technology is introduced, firms often experience learning, resulting in a gradual decrease in the time required to produce successive units. Generally, the rate of improvement declines until the production time levels off. One example is the production of a new design for lawnmower engines. To determine the time required to produce these engines, PLE produced 50 units on its production line; test results are given on the worksheet Engines in the database. Because PLE is continually developing new technology, understanding the rate of learning can be useful in estimating future production costs without having to run extensive prototype trials, and Elizabeth would like a better handle on this. Use techniques of regression analysis to assist her in evaluating the data in these three worksheets and reaching useful conclusions. Summarize your work in a formal r­eport with all appropriate results and analyses.

Chapter

9

Forecasting Techniques

iQoncept/Shutterstock.com

Learning Objectives After studying this chapter, you will be able to:

• Explain how judgmental approaches are used for forecasting. • List different types of statistical forecasting models. • Apply moving average and exponential smoothing models to stationary time series. • State three error metrics used for measuring forecast accuracy and explain the differences among them. • Apply double exponential smoothing models to time series with a linear trend.

• Use Holt-Winters and regression models to forecast time series with seasonality. • Apply Holt-Winters forecasting models to time series with both trend and seasonality. • Identify the appropriate choice of forecasting model based on the characteristics of a time series. • Explain how regression techniques can be used to forecast with explanatory or causal variables. • Apply XLMiner to different types of forecasting models.

299

300

Chapter 9  Forecasting Techniques

Managers require good forecasts of future events to make good decisions. For example, forecasts of interest rates, energy prices, and other economic indicators are needed for financial planning; sales forecasts are needed to plan production and workforce capacity; and forecasts of trends in demographics, consumer behavior, and technological innovation are needed for long-term strategic planning. The government also invests significant resources on predicting short-run U.S. business performance using the Index of Leading Indicators. This index focuses on the performance of individual businesses, which often is highly correlated with the performance of the overall economy and is used to forecast economic trends for the nation as a whole. In this chapter, we introduce some common methods and approaches to forecasting, including both qualitative and quantitative techniques. Business analysts may choose from a wide range of forecasting techniques to support decision making. Selecting the appropriate method depends on the characteristics of the forecasting problem, such as the time horizon of the variable being forecast, as well as available information on which the forecast will be based. Three major categories of forecasting approaches are qualitative and judgmental techniques, statistical time-series models, and explanatory/causal methods. In this chapter, we introduce forecasting techniques in each of these categories and use basic Excel tools, XLMiner, and linear regression to implement them in a spreadsheet environment.

Qualitative and Judgmental Forecasting Qualitative and judgmental techniques rely on experience and intuition; they are necessary when historical data are not available or when the decision maker needs to forecast far into the future. For example, a forecast of when the next generation of a microprocessor will be available and what capabilities it might have will depend greatly on the opinions and expertise of individuals who understand the technology. Another use of judgmental methods is to incorporate nonquantitative information, such as the impact of government regulations or competitor behavior, in a quantitative forecast. Judgmental techniques range from such simple methods as a manager’s opinion or a group-based jury of executive opinion to more structured approaches such as historical analogy and the Delphi method.

Historical Analogy One judgmental approach is historical analogy, in which a forecast is obtained through a comparative analysis with a previous situation. For example, if a new product is being introduced, the response of consumers to marketing campaigns to similar, previous products can be used as a basis to predict how the new marketing campaign might fare. Of course, temporal changes or other unique factors might not be fully considered in such

Chapter 9  Forecasting Techniques

301

an approach. However, a great deal of insight can often be gained through an analysis of past experiences.

Example 9.1  Predicting the Price of Oil In early 1998, the price of oil was about \$22 a barrel. However, in mid-1998, the price of a barrel of oil dropped to around \$11. The reasons for this price drop included an oversupply of oil from new production in the Caspian Sea region, high production in non-OPEC regions, and lowerthan-normal demand. In similar circumstances in the past, OPEC would meet and take action to raise the price

of oil. Thus, from historical analogy, we might forecast a rise in the price of oil. OPEC members did, in fact, meet in mid-1998 and agreed to cut their production, but nobody believed that they would actually cooperate effectively, and the price continued to drop for a time. Subsequently, in 2000, the price of oil rose dramatically, falling again in late 2001.

Analogies often provide good forecasts, but you need to be careful to recognize new or different circumstances. Another analogy is international conflict relative to the price of oil. Should war break out, the price would be expected to rise, analogous to what it has done in the past.

The Delphi Method A popular judgmental forecasting approach, called the Delphi method, uses a panel of experts, whose identities are typically kept confidential from one another, to respond to a sequence of questionnaires. After each round of responses, individual opinions, edited to ensure anonymity, are shared, allowing each to see what the other experts think. Seeing other experts’ opinions helps to reinforce those in agreement and to influence those who did not agree to possibly consider other factors. In the next round, the experts revise their estimates, and the process is repeated, usually for no more than two or three rounds. The Delphi method promotes unbiased exchanges of ideas and discussion and usually results in some convergence of opinion. It is one of the better approaches to forecasting longrange trends and impacts.

Indicators and Indexes Indicators and indexes generally play an important role in developing judgmental forecasts. Indicators are measures that are believed to influence the behavior of a variable we wish to forecast. By monitoring changes in indicators, we expect to gain insight about the future behavior of the variable to help forecast the future.

Example 9.2  Economic Indicators One variable that is important to the nation’s economy is the Gross Domestic Product (GDP), which is a measure of the value of all goods and services produced in the United States. Despite its shortcomings (for instance, unpaid work such as housekeeping and child care is not

measured; production of poor-quality output inflates the measure, as does work expended on corrective action), it is a practical and useful measure of economic performance. Like most time series, the GDP rises and falls in a cyclical fashion. Predicting future trends in the GDP is (continued)

302

Chapter 9  Forecasting Techniques

often done by analyzing leading indicators—series that tend to rise and fall for some predictable length of time prior to the peaks and valleys of the GDP. One example of a leading indicator is the formation of business enterprises; as the rate of new businesses grows, we would expect the GDP to increase in the future. Other examples of leading indicators are the percent change in the

money supply (M1) and net change in business loans. Other indicators, called lagging indicators, tend to have peaks and valleys that follow those of the GDP. Some lagging indicators are the Consumer Price Index, prime rate, business investment expenditures, or inventories on hand. The GDP can be used to predict future trends in these indicators.

Indicators are often combined quantitatively into an index, a single measure that weights multiple indicators, thus providing a measure of overall expectation. For example, financial analysts use the Dow Jones Industrial Average as an index of general stock market performance. Indexes do not provide a complete forecast but rather a better picture of direction of change and thus play an important role in judgmental forecasting.

Example 9.3  Leading Economic Indicators The Department of Commerce initiated an Index of Leading Indicators to help predict future economic performance. Components of the index include the following: weekly hours, manufacturing • average average • insuranceweekly initial claims, unemployment orders, consumer goods, and materials • new vendor performance—slower deliveries • new orders, nondefense capital goods • building permits, private housing • stock prices, 500 common stocks (Standard & Poor) • money supply • interest rate spread • index of consumer expectations (University of • Michigan)

Business Conditions Digest included more than 100 time series in seven economic areas. This publication was discontinued in March 1990, but information related to the Index of Leading Indicators was continued in Survey of Current Business. In December 1995, the U.S. Department of Commerce sold this data source to The Conference Board, which now markets the information under the title Business Cycle Indicators; information can be obtained at its Web site (www.conference-board .org). The site includes excellent current information about the calculation of the index as well as its current components.

Statistical Forecasting Models Statistical time-series models find greater applicability for short-range forecasting problems. A time series is a stream of historical data, such as weekly sales. We characterize the values of a time series over T periods as At , t = 1, 2, c, T. Time-series models assume that whatever forces have influenced sales in the recent past will continue into the near future; thus, forecasts are developed by extrapolating these data into the future. Time series generally have one or more of the following components: random behavior, trends, seasonal effects, or cyclical effects. Time series that do not have trend, seasonal, or cyclical effects but are relatively constant and exhibit only random behavior are called stationary time series. Many forecasts are based on analysis of historical time-series data and are predicated on the assumption that the future is an extrapolation of the past. A trend is a gradual ­upward or downward movement of a time series over time.

Chapter 9  Forecasting Techniques

303

Example 9.4  Identifying Trends in a Time Series Figure 9.1 shows a chart of total energy ­c onsumption from the data in the Excel file Energy Production & ­Consumption. This time series shows an upward trend. However, we see that energy consumption was rising quite rapidly in a linear fashion during the 1960s, then

leveled off for a while and began increasing at a slower rate through the 1980s and 1990s. In the past decade, we ­actually see a slight downward trend. This time series, then, is composed of several different short trends.

Time series may also exhibit short-term seasonal effects (over a year, month, week, or even a day) as well as longer-term cyclical effects, or nonlinear trends. A seasonal effect is one that repeats at fixed intervals of time, typically a year, month, week, or day. At a neighborhood grocery store, for instance, short-term seasonal patterns may occur over a week, with the heaviest volume of customers on weekends; seasonal patterns may also be evident during the course of a day, with higher volumes in the mornings and late afternoons. Figure 9.2 shows seasonal changes in natural gas usage for a homeowner over the course of a year (Excel file Gas & Electric). Cyclical effects describe ups and downs over a much longer time frame, such as several years. Figure 9.3 shows a chart of the data

Figure

9.1

Total Energy Consumption Time Series

Figure

9.2

Seasonal Effects in Natural Gas Usage

304

Figure

Chapter 9  Forecasting Techniques

9.3

Cyclical Effects in Federal Funds Rates

in the Excel file Federal Funds Rates. We see some evidence of long-term cycles in the time series driven by economic factors, such as periods of inflation and recession. Although visual inspection of a time series to identify trends, seasonal, or cyclical effects may work in a naïve fashion, such unscientific approaches may be a bit unsettling to a manager making important decisions. Subtle effects and interactions of seasonal and cyclical factors may not be evident from simple visual extrapolation of data. Statistical methods, which involve more formal analyses of time series, are invaluable in ­developing good forecasts. A variety of statistically-based forecasting methods for time series are ­commonly used. Among the most popular are moving average methods, exponential smoothing, and regression analysis. These can be implemented very easily on a spreadsheet using basic functions and Data Analysis tools available in Microsoft Excel, as well as with more powerful software such as XLMiner. Moving average and exponential smoothing models work best for time series that do not exhibit trends or seasonal factors. For time series that involve trends and/or seasonal factors, other techniques have been developed. These include double moving average and exponential smoothing models, ­seasonal additive and multiplicative models, and Holt-Winters additive and multiplicative models.

Forecasting Models for Stationary Time Series Two simple approaches that are useful over short time periods when trend, seasonal, or cyclical effects are not significant are moving average and exponential smoothing models.

Moving Average Models The simple moving average method is a smoothing method based on the idea of averaging random fluctuations in the time series to identify the underlying direction in which the time series is changing. Because the moving average method assumes that future observations will be similar to the recent past, it is most useful as a short-range forecasting method. Although this method is very simple, it has proven to be quite useful in stable environments, such as inventory management, in which it is necessary to develop forecasts for a large number of items. Specifically, the simple moving average forecast for the next period is computed as the average of the most recent k observations. The value of k is somewhat arbitrary,

305

Chapter 9  Forecasting Techniques

although its choice affects the accuracy of the forecast. The larger the value of k, the more the current forecast is dependent on older data, and the forecast will not react as quickly to fluctuations in the time series. The smaller the value of k, the quicker the forecast responds to changes in the time series. Also, when k is larger, extreme values have less effect on the forecasts. (In the next section, we discuss how to select k by examining errors associated with different values.)

Example 9.5  Moving Average Forecasting The Excel file Tablet Computer Sales contains data for the number of units sold for the past 17 weeks. Figure 9.4 shows a chart of these data. The time series appears to be relatively stable, without trend, seasonal, or cyclical effects; thus, a moving average model would be appropriate. Setting k = 3, the three-period moving average forecast for week 18 is

82 + 71 + 50 = 67.67 3 Moving average forecasts can be generated easily on a spreadsheet. Figure 9.5 shows the computations for a three-period moving average forecast of tablet computer sales. Figure 9.6 shows a chart that contrasts the data with the forecasted values. week 18 forecast =

Moving average forecasts can also be obtained from Excel’s Data Analysis options.

Example 9.6  Using Excel’s Moving Average Tool For the Tablet Computer Sales Excel file, select Data Analysis and then Moving Average from the Analysis group. Excel displays the dialog box shown in Figure 9.7. You need to enter the Input Range of the data, the ­Interval (the value of k), and the first cell of the Output Range. To align the ­actual data with the forecasted values in the worksheet, select the first cell of the Output Range to be one row below the first value. You may also obtain a chart of the data and the moving ­averages, as well as a column of standard e ­ rrors, by checking the appropriate boxes. However, we do not ­recommend ­using the chart

Figure

9.4

Chart of Weekly Tablet Computer Sales

or error options because the forecasts ­generated by this tool are not properly aligned with the data (the forecast value aligned with a particular data point ­represents the forecast for the next month) and, thus, can be misleading. Rather, we recommend that you generate your own chart as we did in Figure 9.6. Figure 9.8 shows the r­ esults produced by the Moving Average tool (with some customization of the formatting). Note that the forecast for week 18 is aligned with the actual value for week 17 on the chart. Compare this to Figure 9.6 and you can see the difference.

306

Figure

Chapter 9  Forecasting Techniques

9.5

Excel Implementation of Moving Average Forecast

Figure

9.6

Chart of Units Sold and Moving Average Forecast

Figure

9.7

Excel Moving Average Tool Dialog

XLMiner also provides a tool for forecasting with moving averages. XLMiner is an Excel add-on that is available from Frontline Systems, developers of Analytic Solver ­Platform. See the Preface for installation instructions. XLMiner will be discussed more thoroughly in Chapter 10.

Chapter 9  Forecasting Techniques

Figure

307

9.8

Results of Excel Moving Average Tool (Note misalignment of forecasts with actual sales in the chart.)

Example 9.7  Moving Average Forecasting with XLMiner To use XLMiner for the Tablet Computer Sales data, first click on any value in the data. Then select ­S moothing from the Time Series group and select Moving ­Average. The dialog in Figure 9.9 appears. Next, move the variables from the Variables in input data field to the Time Variable and Selected variable fields using the arrow buttons (this was already done in Figure 9.9). In the Weights panel, adjust the value of Interval—the number of periods to use for the moving average. In the Output options

Figure

9.9

XLMiner Moving Average Dialog

panel, you may click Give Forecast and enter the number of forecasts to generate from the procedure. When you click OK, XLMiner generates the output on a new worksheet, as shown in Figure 9.10. The forecasts are shown in rows 24 through 40 along with a chart of the data and forecasts (without the initial periods that do not have corresponding forecasts). The forecast for week 18 is shown at the bottom of the figure. We discuss other parts of the output next.

308

Chapter 9  Forecasting Techniques

Figure 9.10 XLMiner Moving Average Results

Error Metrics and Forecast Accuracy The quality of a forecast depends on how accurate it is in predicting future values of a time series. In the simple moving average model, different values for k will produce different forecasts. How do we know which is the best value for k? The error, or residual, in a forecast is the difference between the forecast and the actual value of the time series (once it is known). In Figure 9.6, the forecast error is simply the vertical distance between the forecast and the data for the same time period. To analyze the effectiveness of different forecasting models, we can define error metrics, which compare quantitatively the forecast with the actual observations. Three metrics that are commonly used are the mean absolute deviation, mean square error, and mean absolute percentage error. The mean absolute deviation (MAD) is the absolute difference between the actual value and the forecast, averaged over a range of forecasted values: a  At - Ft  n

t=1

n



(9.1)

where At is the actual value of the time series at time t, Ft is the forecast value for time t, and n is the number of forecast values (not the number of data points since we do not have a forecast value associated with the first k data points). MAD provides a robust measure of error and is less affected by extreme observations.

309

Chapter 9  Forecasting Techniques

Mean square error (MSE) is probably the most commonly used error metric. It p­ enalizes larger errors because squaring larger numbers has a greater impact than squaring smaller numbers. The formula for MSE is 2 a 1At - Ft2 n

MSE =

t=1

n

(9.2)



Again, n represents the number of forecast values used in computing the average. Sometimes the square root of MSE, called the root mean square error (RMSE), is used: 2 a 1At - Ft2 n

RMSE =

t=1

H

n



(9.3)

Note that unlike MSE, RMSE is expressed in the same units as the data (similar to the difference between a standard deviation and a variance), allowing for more practical comparisons. A fourth commonly used metric is mean absolute percentage error (MAPE). MAPE is the average of absolute errors divided by actual observation values. a n

MAPE =

t=1

2

At - Ft 2 At * 100 n

(9.4)

The values of MAD and MSE depend on the measurement scale of the time-series data. For example, forecasting profit in the range of millions of dollars would result in very large MAD and MSE values, even for very accurate forecasting models. On the other hand, market share is measured in proportions; therefore, even bad forecasting models will have small values of MAD and MSE. Thus, these measures have no meaning except in comparison with other models used to forecast the same data. Generally, MAD is less affected by extreme observations and is preferable to MSE if such extreme observations are considered rare events with no special meaning. MAPE is different in that the measurement scale is eliminated by dividing the absolute error by the time-series data value. This allows a better relative comparison. Although these comments provide some guidelines, there is no universal agreement on which measure is best. Note that the output from XLMiner in Figure 9.10 calculates residuals for the forecasts and provides the values of MAPE, MAD, and MSE.

Example 9.8  Using Error Metrics to Compare Moving Average Forecasts The metrics we have described can be used to compare different moving average forecasts for the T ­ ablet ­Computer Sales data. A spreadsheet that shows the forecasts as well as the calculations of the error metrics for two-, three-, and four-period moving average m ­ odels is given in Figure 9.11. The error is the difference b ­ etween the actual value of the units sold and the forecast. To compute MAD, we first compute the ­absolute values of

the errors and then average them. For MSE, we compute the squared errors and then find the average. For MAPE, we find the absolute values of the e ­ rrors divided by the actual observation multiplied by 100 and then average them. The results suggest that a two-­p eriod moving ­average model provides the best forecast among these alternatives because the error metrics are all smaller than for the other models.

310

Chapter 9  Forecasting Techniques

Figure 9.11

Exponential Smoothing Models

Error Metrics Alternative Moving Average Forecasts

A versatile, yet highly effective, approach for short-range forecasting is simple exponential smoothing. The basic simple exponential smoothing model is

Ft + 1 = 11 - a2Ft + aAt

= Ft + a1At - Ft2

(9.5)

where Ft + 1 is the forecast for time period t + 1, Ft is the forecast for period t, At is the observed value in period t, and a is a constant between 0 and 1 called the smoothing constant. To begin, set F1 and F2 equal to the actual observation in period 1, A1. Using the two forms of the forecast equation just given, we can interpret the simple exponential smoothing model in two ways. In the first model, the forecast for the next period, Ft + 1, is a weighted average of the forecast made for period t, Ft , and the actual observation in period t, At . The second form of the model, obtained by simply rearranging terms, states that the forecast for the next period, Ft + 1, equals the forecast for the last period, Ft, plus a fraction a of the forecast error made in period t, At - Ft. Thus, to make a forecast once we have selected the smoothing constant, we need to know only the previous forecast and the actual value. By repeated substitution for Ft in the equation, it is easy to demonstrate that Ft + 1 is a decreasingly weighted average of all past time-series data. Thus, the forecast actually reflects all the data, provided that a is strictly between 0 and 1.

Example 9.9  Using Exponential Smoothing to Forecast Tablet Computer Sales For the tablet computer sales data, the forecast for week 2 is 88, the actual observation for week 1. Suppose we choose A = 0.7; then the forecast for week 3 would be

The actual observation for week 3 is 60; thus, the forecast for week 4 would be week 4 forecast = (1− 0.7)(57.2) + (0.7)(60) = 59.16

week 3 forecast = (1− 0.7)(88) + (0.7)(44) = 57.2

Because the simple exponential smoothing model requires only the previous forecast and the current time-series value, it is very easy to calculate; thus, it is highly suitable for environments such as inventory systems, where many forecasts must be made.

Chapter 9  Forecasting Techniques

311

The smoothing constant a is usually chosen by experimentation in the same manner as choosing the number of periods to use in the moving average model. Different values of a affect how quickly the model responds to changes in the time series. For instance, a value of a = 0 would simply repeat last period’s forecast, whereas a = 1 would forecast last period’s actual demand. The closer a is to 1, the quicker the model responds to changes in the time series, because it puts more weight on the actual current observation than on the forecast. Likewise, the closer a is to 0, the more weight is put on the prior forecast, so the model would respond to changes more slowly.

Example 9.10  F  inding the Best Exponential Smoothing Model for Tablet Computer Sales An Excel spreadsheet for evaluating exponential smoothing models for the Tablet Computer Sales data using ­values of A between 0.1 and 0.9 is shown in Figure 9.12. Note that in computing the error measures, the first row

is not included because we do not have a forecast for the first period, Week 1. A smoothing constant of A = 0.6 provides the lowest ­error for all three metrics.

Excel has a Data Analysis tool for exponential smoothing.

Example 9.11  Using Excel’s Exponential Smoothing Tool In the Table Computer Sales example, from the A ­ nalysis group, select Data Analysis and then Exponential ­Smoothing. In the dialog (Figure 9.13), as in the Moving Average dialog, you must enter the Input Range of the time-series data, the Damping Factor is (1 − A)—not the smoothing constant as we have defined it—and the first cell of the Output Range, which should be adjacent to the

Figure 9.12 Exponential Smoothing Forecasts for Tablet Computer Sales

first data point. You also have options for labels, to chart output, and to obtain standard errors. As opposed to the Moving Average tool, the chart generated by this tool does correctly align the forecasts with the actual data, as shown in Figure 9.14. You can see that the exponential smoothing model follows the pattern of the data quite closely, although it tends to lag with an increasing trend in the data.

312

Chapter 9  Forecasting Techniques

Figure 9.13 Exponential Smoothing Tool Dialog

Figure 9.14 Excel Exponential Smoothing Forecasts for A = 0.6

XLMiner also has an exponential smoothing capability. The dialog (which appears when Exponential . . . is selected from the Time Series/Smoothing menu) is similar to the one for moving averages in Figure 9.9. However, within the Weights pane, it provides options to either enter the smoothing constant, Level (Alpha) or to check an Optimize box, which will find the best value of the smoothing constant.

Example 9.12  Optimizing Exponential Smoothing Forecasts Using XLMiner Select Exponential Smoothing from the Smoothing menu in XLMiner. For the Tablet Computer Sales data, enter the data (similar to the dialog in Figure 9.9), and check the Optimize box in the Weights pane. Figure 9.15 shows the results. In row 16, we see that the optimized

­ moothing constant is 0.63. You can see that this is close s to the value of 0.6 that we estimated in Figure 9.12; the error measures shown in rows 48–50 are slightly lower than those in Figure 9.12.

Forecasting Models for Time Series with a Linear Trend For time series with a linear trend but no significant seasonal components, double moving average and double exponential smoothing models are more appropriate than using simple moving average or exponential smoothing models. Both methods are based on the linear trend equation:

Ft + k = a t + b tk

(9.6)

Chapter 9  Forecasting Techniques

313

Figure 9.15 XLMiner Exponential Smoothing Results for Tablet Computer Sales

That is, the forecast for k periods into the future from period t is a function of a base value a t, also known as the level, and a trend, or slope, b t. Double moving average and double exponential smoothing differ in how the data are used to arrive at appropriate values for a t and b t. Because the calculations are more complex than for simple moving average and exponential smoothing models, it is easier to use forecasting software than to try to implement the models directly on a spreadsheet. Therefore, we do not discuss the theory or formulas underlying the methods. XLMiner does not support a procedure for double moving average; however, it does provide one for double exponential smoothing.

Double Exponential Smoothing In double exponential smoothing, the estimates of a t and b t are obtained from the following equations:

a t = aFt + 11 - a21a t - 1 + b t - 12 b t = b1a t - a t - 12 + 11 - b2b t - 1

(9.7)

In essence, we are smoothing both parameters of the linear trend model. From the first equation, the estimate of the level in period t is a weighted average of the observed value at time t and the predicted value at time t, a t-1 + b t-1, based on simple exponential smoothing. For large values of a, more weight is placed on the observed value. Lower values of a put more weight on the smoothed predicted value. Similarly, from the second equation, the estimate of the trend in period t is a weighted average of the differences in the ­estimated levels in periods t and t - 1 and the estimate of the level in period t - 1.

314

Chapter 9  Forecasting Techniques

Larger values of b place more weight on the differences in the levels, but lower values of b put more emphasis on the previous estimate of the trend. Initial values are chosen for a 1 as A1 and b 1 as A2 - A1. Equations (9.7) must then be used to compute a t and b t for the entire time series to be able to generate forecasts into the future. As with simple exponential smoothing, we are free to choose the values of a and b. However, it is easier to let XLMiner optimize these values using historical data.

Example 9.13  Double Exponential Smoothing with XLMiner Figure 9.16 shows a portion of the Excel file Coal ­Production, which provides data on total tons produced from 1960 through 2011. The data appear to follow a linear trend. The XLMiner dialog is similar to the one used for single exponential smoothing. Using the optimization feature to find the best values of A and B, XLMiner produces the output, a portion of which is shown in Figure 9.17. We see that the best values of A and B are 0.684 and 0.00,

respectively. Forecasts generated by XLMiner for the next 3 years (not shown in Figure 9.17) are 2012: 1,115,563,804 2013: 1,130,977,341 2014: 1,146,390,878

Regression-Based Forecasting for Time Series with a Linear Trend Equation 9.6 may look familiar from simple linear regression. We introduced regression in the previous chapter as a means of developing relationships between a dependent and independent variables. Simple linear regression can be applied to forecasting using time as the independent variable.

Example 9.14  Forecasting Using Trendlines For the data in the Excel file Coal Production, a linear trendline, shown in Figure 9.18, gives an R2 value of 0.95 (the fitted model assumes that the years are numbered 1 through 52, not as actual dates). The model is tons = 438,819,885.29 + 15,413,536.97 × year

Figure 9.16 Portion of Excel File Coal Production

Thus, a forecast for 2012 would be

tons = 438,819,885.29 + 15,413,536.97 × (53) = 1,255,737,345

Note however, that the linear model does not adequately predict the recent drop in production after 2008.

Chapter 9  Forecasting Techniques

315

Figure 9.17 Portion of XLMiner Output for Double Exponential Smoothing of Coal-Production Data

Figure 9.18 Trendline-Based Forecast for Coal Production

In Chapter 8, we noted that an important assumption for using regression analysis is the lack of autocorrelation among the data. When autocorrelation is present, successive observations are correlated with one another; for example, large observations tend to follow other large observations, and small observations also tend to follow one another. This can often be seen by examining the residual plot when the data are ordered by time. Figure 9.19 shows the time-ordered residual plot from the Excel Regression tool for the coal-production example. The residuals do not appear to be random; rather, successive

316

Chapter 9  Forecasting Techniques

Figure 9.19 Residual Plot for Linear Regression Forecasting Model

o­ bservations seem to be related to one another. This suggests autocorrelation, indicating that other approaches, called autoregressive models, are more appropriate. However, these are more advanced than the level of this book and are not discussed here.

Forecasting Time Series with Seasonality Quite often, time-series data exhibit seasonality, especially on an annual basis. We saw an example of this in Figure 9.2. When time series exhibit seasonality, different techniques provide better forecasts than the ones we have described.

Regression-Based Seasonal Forecasting Models One approach is to use linear regression. Multiple linear regression models with categorical variables can be used for time series with seasonality. To do this, we use dummy categorical variables for the seasonal components.

Example 9.15  Regression-Based Forecasting for Natural Gas Usage With monthly data, as we have for natural gas usage in the Gas & Electric Excel file, we have a seasonal categorical variable with k = 12 levels. As discussed in Chapter 8, we construct the regression model using k − 1 dummy variables. We will use January as the reference month; therefore, this variable does not appear in the model: gas usage = B0 + B1 time + B2 February + B3 March + B4 April + B5 May + B6 June + B7 July + B8 August + B9 September + B10 October + B11 November + B12 December

This coding scheme results in the data matrix shown in Figure 9.20. This model picks up trends from the regression coefficient for time and seasonality from the dummy variables for each month. The forecast for the next January will be B0 + B1(25). The variable coefficients (betas) for each of the other 11 months will show the adjustment relative to January. For example, the forecast for next February will be B0 + B1(26) + B2(1), and so on. Figure 9.21 shows the results of using the Regression tool in Excel after eliminating insignificant variables (time and Feb). Because the data show no clear linear trend, the

Chapter 9  Forecasting Techniques

variable time could not explain any significant variation in the data. The dummy variable for February was probably insignificant because the historical gas usage for both January and February were very close to each other. The R2 for this model is 0.971, which is very good. The final regression model is

Figure 9.20 Data Matrix for Seasonal Regression Model

Figure 9.21 Final Regression Model for Forecasting Gas Usage

317

gas usage = 236.75 − 36.75 March − 99.25 April − 192.25 May − 203.25 June − 208.25 July − 209.75 August − 208.25 September − 196.75 October − 149.75 November − 43.25 December

318

Chapter 9  Forecasting Techniques

Holt-Winters Forecasting for Seasonal Time Series The methods we describe here and in the next section are based on the work of two ­researchers, C.C. Holt, who developed the basic approach, and P.R. Winters, who ­extended Holt’s work. Hence, these approaches are commonly referred to as Holt-Winters models. Holt-Winters models are similar to exponential smoothing models in that smoothing constants are used to smooth out variations in the level and seasonal patterns over time. For time series with seasonality but no trend, XLMiner supports a Holt-Winters method but does not have the ability to optimize the parameters.

Example 9.16  F  orecasting Natural Gas Usage Using Holt-Winters No-Trend Model Figure 9.22 shows the dialog for the Holt-Winters smoothing model with no trend for the natural gas data in the Gas & Electric Excel file in Figure 9.2. In the ­P arameters pane, the value of Period is the length of the season, in this case, 12 months. Note that we have two complete seasons of data. Because the procedure does not optimize the parameters, you will generally

have to experiment with the smoothing constants A and G (gamma) that apply to the level and seasonal factors in the model. Figure 9.23 shows a portion of the output. We see that this choice of parameters results in a fairly close forecast with low error metrics. The forecasts at the bottom of the output provide point estimates along with confidence intervals.

Holt-Winters Models for Forecasting Time Series with Seasonality and Trend Many time series exhibit both trend and seasonality. Such might be the case for growing sales of a seasonal product. These models combine elements of both the trend and seasonal models. Two types of Holt-Winters smoothing models are often used.

Figure 9.22 XLMiner Holt-Winters Smoothing No-Trend Model Dialog

Chapter 9  Forecasting Techniques

319

Figure 9.23 Portion of XLMiner Output for Forecasting Natural Gas Usage

The Holt-Winters additive model is based on the equation

Ft + 1 = a t + b t + St - s + 1

(9.8)

and the Holt-Winters multiplicative model is

Ft + 1 = 1a t + b t2St - s + 1

(9.9)

The additive model applies to time series with relatively stable seasonality, whereas the multiplicative model applies to time series whose amplitude increases or decreases over time. Therefore, a chart of the time series should be viewed first to identify the appropriate type of model to use. Three parameters, a, b, and g, are used to smooth the level, trend, and seasonal factors in the time series. XLMiner supports both models.

Example 9.17  Forecasting New Car Sales Using Holt-Winters Models Figure 9.24 shows a portion of the Excel file New Car Sales, which contain 3 years of monthly retail sales’ data. There is clearly a stable seasonal factor in the time series, along with an increasing trend; therefore, the Holt-Winters additive model would appear to be the most appropriate. In XLMiner, choose Smoothing/ ­­H olt-Winters/Additive from the Time-Series group.

As with other procedures, some experimentation is necessary to identify the best parameters for the model. The ­d ialog in Figure 9.25 shows the default values. In the ­results shown in Figure 9.26, you can see that the forecasts do not track the data very well. This may be due to the low value of G used to smooth out the seasonal factor. We leave it to you to experiment to find a better model.

320

Chapter 9  Forecasting Techniques

Figure 9.24 Portion of Excel File New Car Sales

Figure 9.25 Holt-Winters Smoothing Additive Model Dialog

Selecting Appropriate Time-Series-Based Forecasting Models Table 9.1 summarizes the choice of forecasting approaches that can be implemented by XLMiner based on characteristics of the time series. Table

9.1

Forecasting Model Choice

No Seasonality

Seasonality

No trend

Simple moving average or simple exponential smoothing

Holt-Winters no-trend smoothing model or multiple regression

Trend

Double exponential smoothing

Holt-Winters additive or Holt-Winters multiplicative model

321

Chapter 9  Forecasting Techniques

Figure 9.26 Results form Holt-Winters Additive Model for Forecasting New-Car Sales

Regression Forecasting with Causal Variables In many forecasting applications, other independent variables besides time, such as economic indexes or demographic factors, may influence the time series. For example, a manufacturer of hospital equipment might include such variables as hospital capital spending and changes in the proportion of people over the age of 65 in building models to forecast future sales. Explanatory/causal models, often called econometric models, seek to identify factors that explain statistically the patterns observed in the variable being forecast, usually with regression analysis. We will use a simple example of forecasting gasoline sales to illustrate econometric modeling.

Example 9.18 Forecasting Gasoline Sales Using Simple Linear Regression Figure 9.27 shows gasoline sales over 10 weeks during June through August along with the average price per gallon and a chart of the gasoline sales time series with a fitted trendline (Excel file Gasoline Sales). During the summer months, it is not unusual to see an increase in sales as more people go on vacations. The chart shows a linear

trend, although R2 is not very high. The trendline is: sales = 4,790.1 + 812.99 week Using this model, we would predict sales for week 11 as sales = 4,790.1 + 812.99(11) = 13,733 gallons

322

Chapter 9  Forecasting Techniques

Figure 9.27 Gasoline Sales Data and Trendline

In the gasoline sales data, we also see that the average price per gallon changes each week, and this may influence consumer sales. Therefore, the sales trend might not simply be a factor of steadily increasing demand, but it might also be influenced by the average price per gallon. The average price per gallon can be considered as a causal variable. Multiple linear regression provides a technique for building forecasting models that incorporate not only time, but other potential causal variables also.

Example 9.19  I ncorporating Causal Variables in a Regression-Based Forecasting Model For the gasoline sales data, we can incorporate the price/gallon by using two independent variables. This results in the multiple regression model sales = B0 + B1 week + B2 price>gallon The results are shown in Figure 9.28, and the regression model is

Notice that the R2 value is higher when both variables are included, explaining more than 86% of the variation in the data. If the company estimates that the average price for the next week will drop to \$3.80, the model would forecast the sales for week 11 as sales = 72333.08 + 508.67(11) − 16463.2(3.80) = 15,368 gallons

sales = 72333.08 + 508.67 week − 16463.2 price>gallon

The Practice of Forecasting Surveys of forecasting practices have shown that both judgmental and quantitative methods are used for forecasting sales of product lines or product families as well as for broad company and industry forecasts. Simple time-series models are used for short- and medium-range forecasts, whereas regression analysis is the most popular method for longrange forecasting. However, many companies rely on judgmental methods far more than quantitative methods, and almost half judgmentally adjust quantitative forecasts. In this chapter, we focus on these three approaches to forecasting. In practice, managers use a variety of judgmental and quantitative forecasting techniques. Statistical methods alone cannot account for such factors as sales promotions, unusual environmental disturbances, new product introductions, large one-time orders, and

Chapter 9  Forecasting Techniques

323

Figure 9.28 Regression Results for Gasoline Sales

so on. Many managers begin with a statistical forecast and adjust it to account for intangible factors. Others may develop independent judgmental and statistical forecasts and then combine them, either objectively by averaging or in a subjective manner. It is important to compare quantitatively generated forecasts to judgmental forecasts to see if the forecasting method is adding value in terms of an improved forecast. It is impossible to provide universal guidance as to which approaches are best, because they depend on a variety of factors, including the presence or absence of trends and seasonality, the number of data points available, length of the forecast time horizon, and the experience and knowledge of the forecaster. Often, quantitative approaches will miss significant changes in the data, such as reversal of trends, whereas qualitative forecasts may catch them, particularly when using indicators as discussed earlier in this chapter.

Analytics in Practice: Forecasting at NBC Universal1 NBC Universal (NBCU), a subsidiary of the General Electric Company (GE), is one of the world’s leading media and entertainment companies in the distribution, production, and marketing of entertainment, news, and information. The television broadcast year in the United States starts in the third week of September. The major broadcast networks announce their programming schedules for the new broadcast year in the middle of May. Shortly thereafter, the sale of advertising time, which generates the majority of revenues, begins. The broadcast networks sell 60% to 80% of their airtime inventory during a brief period starting in late May and lasting 1 Based

2 to 3 weeks. This sales period is known as the upfront market. Immediately after announcing their program schedules, the networks finalize their ratings forecasts and estimate the market demand. The ratings forecasts are projections of the numbers of people in each of several demographic groups who are expected to watch each airing of the shows in the program schedule for the entire broadcast year. After they finalize their ratings projections and marketdemand estimates, the networks set the rate cards that contain the prices for commercials on all their shows and develop pricing strategies. (continued)

on Srinivas Bollapragada, Salil Gupta, Brett Hurwitz, Paul Miles, and Rajesh Tyagi, “NBC-Universal Uses a Novel Qualitative Forecasting Technique to Predict Advertising Demand,” Interfaces, 38, 2 (March–April 2008): 103–111.

324

Forecasting upfront market demand has ­always been a challenge. NBCU initially relied on historical patterns, expert knowledge, and intuition for ­estimating demand. Later, it tried time-series forecasting models based on historical demand and leading economic indicator data and implemented the models in a M­icrosoft Excel–based system. However, these models proved to be unsatisfactory because of the unique nature of NBCU’s demand population. The time-series models had fit and prediction errors in the range of 5% to 12% based on the historical data. These errors were considered reasonable, but the sales executives were reluctant to use the models because the models did not consider several qualitative factors that they believe influence the demand. As a result, they did not trust the forecasts that these models generated; therefore, they had never used them. Analytics staff at NBCU then decided to d­evelop a qualitative demand forecasting model that captures the knowledge of the sales experts. Their approach incorporates the Delphi method and “grass-roots forecasting,” which is based on the concept of asking those who are close to the end consumer, such as salespeople, about the customers’ purchasing plans, along with historical data to develop forecasts. Since 2004, more than 200 sales

Chapter 9  Forecasting Techniques

and finance personnel at NBCU have been using the system to support sales decisions during the upfront market when NBCU signs advertising deals worth more than \$4.5 billion. The system enables NBCU to sell and analyze pricing scenarios across all NBCU’s television properties with ease and sophistication while predicting demand with a high accuracy. NBCU’s sales leaders credit the system with having given them a unique competitive advantage.

Key Terms Cyclical effect Delphi method Double exponential smoothing Double moving average Econometric model Historical analogy Holt-Winters additive model Holt-Winters models Holt-Winters multiplicative model Index Indicator

Mean absolute deviation (MAD) Mean absolute percentage error (MAPE) Mean square error (MSE) Root mean square error (RMSE) Seasonal effect Simple exponential smoothing Simple moving average Smoothing constant Stationary time series Time series Trend

Problems and Exercises 1. Identify some business applications in which judg-

mental forecasting techniques such as historical analogy and the Delphi method would be useful. 2. Search the Conference Board’s Web site to find

business cycle indicators, and the components and

methods adopted to compute the same. Write a short report about your findings. 3. The Excel file Energy Production & Consumption

provides data on production, imports, exports, and consumption. Develop line charts for each variable

Chapter 9  Forecasting Techniques

and identify key characteristics of the time series (e.g., trends or cycles). Are any of these time series stationary? In forecasting the future, discuss whether all or only a portion of the data should be used. 4. The Excel file New Registered Users provides data on monthly new registrations on a Web site for four years. Compare the three-month and twelve-month moving average forecasts using the MAD criterion. Explain which model yields better results and why. 5. The Excel file Closing Stock Prices provides data for

four stocks and the Dow Jones Industrials Index over a 1-month period. a. Develop spreadsheet models for forecasting each of the stock prices using simple 2-period moving average and simple exponential smoothing with a smoothing constant of 0.3. b. Compare your results to the outputs from Excel’s

Data Analysis tools. c. Using MAD, MSE, and MAPE as guidance, find the best number of moving average periods and best smoothing constant for exponential smoothing. d. Use XLMiner to find the best number of periods for the moving average forecast and optimal exponential smoothing constant. 6. For the data in the Excel file Gasoline Prices do the

following: a. Develop spreadsheet models for forecasting prices using simple moving average and simple exponential smoothing. b. Compare your results to the outputs from Excel’s Data Analysis tools. c. Using MAD, MSE, and MAPE as guidance, find the best number of moving average periods and best smoothing constant for exponential smoothing. d. Use XLMiner to find the best number of periods

for the moving average forecast and optimal exponential smoothing constant. 7. Consider the prices for the DJ Industrials in the

Excel file Closing Stock Prices. The data appear to have a linear trend over the time period provided. a. Use simple linear regression to forecast the data. What would be the forecasts for the next 3 days? b. Use the double exponential smoothing procedure in XLMiner to find forecasts for the next 3 days.

325

8. Consider the data in the Excel file Consumer Price

Index. a. Use simple linear regression to forecast the data. What would be the forecasts for the next 2 years? b. Use the double exponential smoothing procedure in XLMiner to find forecasts for the next 2 years. 9. Consider the data in the Excel file Internet Users.

Use simple linear regression to forecast the data. What would be the forecast for the next three years? 10. Develop a multiple linear regression model with cat-

egorical variables that incorporate seasonality for forecasting the deaths caused by accidents in the U.S. Use the data for years 1976 and 1977 in the Excel file ­Accidental Deaths. Use the model to generate forecasts for the next nine months, and compare the forecasts to actual observations noted in the data for the year 1978. 11. Develop a multiple regression model with categori-

cal variables that incorporate seasonality for forecasting sales using the last three years of data in the Excel file New Car Sales. 12. Develop a multiple regression model with categori-

cal variables that incorporate seasonality for forecasting housing starts beginning in June 2006 using the data in the Excel file Housing Starts. 13. The Excel file Census Data provides annual average

expenditures and income levels of the people in the U.S. Develop forecasting models for each of the data type. What do your models predict for the next two years?. 14. Use the Holt-Winters no-trend model to find the best

model to find forecasts for the next 12 months in the Excel file Housing Starts. 15. The Excel file CD Interest Rates provides annual

average interest rates on 6-month certificate of deposits. Compare the Holt-Winters additive and multiplicative models using XLMiner with the default parameters and a season of 6 years. Why does the multiplicative model provide better results? 16. The Excel file Olympic Track and Field Data provides

the gold medal-winning distances for the high jump, discus, and long jump for the modern ­Olympic Games. Develop forecasting models for each of the events. What do your models predict for the next Olympics? 17. Choose an appropriate forecasting technique for the

data in the Excel file Coal Consumption and find the

326

Chapter 9  Forecasting Techniques

best forecasting model. Explain how you would use the model to forecast and how far into the future it would be appropriate to forecast.

best forecasting model. Explain how you would use the model to forecast and how far into the future it would be appropriate to forecast.

18. Choose an appropriate forecasting technique for the

22. Choose an appropriate forecasting technique for the

data in the Excel file DJIA December Close and find the best forecasting model. Explain how you would use the model to forecast and how far into the future it would be appropriate to forecast.

data in the Excel file Treasury Yield Rates and find the best forecasting model. Explain how you would use the model to forecast and how far into the future it would be appropriate to forecast.

19. Choose an appropriate forecasting technique for the

23. Data in the Excel File Microprocessor Data shows

data in the Excel file Inflation Rates US and find the best forecasting model. Explain how you would use the model to forecast, and how far into the future it would be appropriate to forecast.

the demand for one type of chip used in industrial equipment from a small manufacturer. a. Construct a chart of the data. What appears to happen when a new chip is introduced? b. Develop a causal regression model to forecast demand that includes both time and the introduction of a new chip as explanatory variables.

20. Choose an appropriate forecasting technique for the

data in the Excel file Mortgage Rates and find the best forecasting model. Explain how you would use the model to forecast and how far into the future it would be appropriate to forecast. 21. Choose an appropriate forecasting technique for the

c. What would the forecast be for the next month if

a new chip is introduced? What would it be if a new chip is not introduced?

data in the Excel file Gaussian Response and find the

Case: Performance Lawn Equipment An important part of planning manufacturing capacity is having a good forecast of sales. Elizabeth Burke is interested in forecasting sales of mowers and tractors in each marketing region as well as industry sales to assess future

changes in market share. She also wants to forecast future increases in production costs. Develop forecasting models for these data and prepare a report of your results with appropriate charts and output from Excel.

Chapter

10

Introduction to Data Mining

kensoh/Shutterstock.com

Learning Objectives After studying this chapter, you will be able to:

• Define data mining and some common approaches used in data mining. • Explain how cluster analysis is used to explore and reduce data. • Apply cluster analysis techniques using XLMiner. • Explain the purpose of classification methods, how to measure classification performance, and the use of training and validation data.

• Apply k-Nearest Neighbors, discriminant analysis, and logistic regression for classification using XLMiner. Describe association rule mining and its use in market basket analysis. Use XLMiner to develop association rules. Use correlation analysis for cause-and-effect modeling

• • •

327

328

Chapter 10  Introduction to Data Mining

In an article in Analytics magazine, Talha Omer observed that using a cell phone to make a voice call leaves behind a significant amount of data. “The cell phone provider knows every person you called, how long you talked, what time you called and whether your call was successful or if was dropped. It also knows where you are, where you make most of your calls from, which promotion you are responding to, how many times you have bought before, and so on.”1 Considering the fact that the vast majority of people today use cell phones, a huge amount of data about consumer behavior is available. Similarly, many stores now use loyalty cards. At supermarket, drugstores, retail stores, and other outlets, loyalty cards enable consumers to take advantage of sale prices available only to those who use the card. However, when they do, the cards leave behind a digital trail of data about purchasing patterns. How can a business exploit these data? If they can better understand patterns and hidden relationships in the data, they can not only understand buying habits but also customize advertisements, promotions, coupons, and so on, for each individual customer and send targeted text messages and e-mail offers (we’re not talking spam here, but registered users who opt into such messages). Data mining is a rapidly growing field of business analytics that is focused on better understanding characteristics and patterns among variables in large databases using a variety of statistical and analytical tools. Many of the tools that we have studied in previous chapters, such as data visualization, data summarization, PivotTables, correlation and regression analysis, and other techniques, are used extensively in data mining. However, as the amount of data has grown exponentially, many other statistical and analytical methods have been developed to identify relationships among variables in large data sets and understand hidden patterns that they may contain. In this chapter, we introduce some of the more popular methods and use

XLMiner software to implement them in a spreadsheet environment. Many datamining procedures require advanced statistical knowledge to understand the underlying theory. Therefore, our focus is on simple applications and understanding the purpose and application of the techniques rather than their theoretical underpinnings.2 In addition, we note that this chapter is not intended to cover all aspects of data mining. Many other techniques are available in XLMiner that are not described in this chapter. 1 Talha

Omer, “From Business Intelligence to Analytics,” Analytics (January/February 2011): 20. www.analyticsmagazine.com. 2Many of the descriptions of techniques supported by XLMiner have been adapted from the XLMiner help files. Please note that the example output screen shots in this chapter may differ from the newest release of XLMiner.

Chapter 10  Introduction to Data Mining

329

The Scope of Data Mining Data mining can be considered part descriptive and part prescriptive analytics. In descriptive analytics, data-mining tools help analysts to identify patterns in data. Excel charts and PivotTables, for example, are useful tools for describing patterns and analyzing data sets; however, they require manual intervention. Regression analysis and forecasting models help us to predict relationships or future values of variables of interest. As some researchers observe, “the boundaries between prediction and description are not sharp (some of the predictive models can be descriptive, to the degree that they are understandable, and vice versa).”3 In most business applications, the purpose of descriptive analytics is to help managers predict the future or make better decisions that will impact future performance, so we can generally state that data mining is primarily a predictive analytic approach. Some common approaches in data mining include the following: Exploration and Reduction. This often involves identifying groups in which • Data the elements of the groups are in some way similar. This approach is often used

to understand differences among customers and segment them into homogenous groups. For example, Macy’s department stores identified four lifestyles of its customers: “Katherine,” a traditional, classic dresser who doesn’t take a lot of risks and likes quality; “Julie,” neotraditional and slightly more edgy but still classic; “Erin,” a contemporary customer who loves newness and shops by brand; and “Alex,” the fashion customer who wants only the latest and greatest (they have male versions also).4 Such segmentation is useful in design and marketing activities to better target product offerings. These techniques have also been used to identify characteristics of successful employees and improve recruiting and hiring practices. Classification. Classification is the process of analyzing data to predict how to classify a new data element. An example of classification is spam filtering in an e-mail client. By examining textual characteristics of a message (subject header, key words, and so on), the message is classified as junk or not. Classification methods can help predict whether a credit-card transaction may be fraudulent, whether a loan applicant is high risk, or whether a consumer will respond to an advertisement. Association. Association is the process of analyzing databases to identify natural associations among variables and create rules for target marketing or buying recommendations. For example, Netflix uses association to understand what types of movies a customer likes and provides recommendations based on the data. Amazon.com also makes recommendations based on past purchases. Supermarket loyalty cards collect data on customers’ purchasing habits and print coupons at the point of purchase based on what was currently bought. Cause-and-effect modeling. Cause-and-effect modeling is the process of developing analytic models to describe the relationship between metrics that drive business performance—for instance, profitability, customer satisfaction, or employee satisfaction. Understanding the drivers of performance can

3Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth, “From Data Mining to Knowledge ­Discovery in Databases,” AI Magazine, American Association for Artificial Intelligence (Fall 1996): 37–54. 4“Here’s Mr. Macy,” Fortune (November 28, 2005): 139–142.

330

Chapter 10  Introduction to Data Mining

lead to better decisions to improve performance. For example, the controls group of Johnson Controls, Inc., examined the relationship between satisfaction and contract-renewal rates. They found that 91% of contract renewals came from customers who were either satisfied or very satisfied, and customers who were not satisfied had a much higher defection rate. Their model predicted that a one-percentage-point increase in the overall satisfaction score was worth \$13 million in service contract renewals annually. As a result, they identified decisions that would improve customer satisfaction.5 Regression and correlation analysis are key tools for cause-and-effect modeling.

Data Exploration and Reduction Some basic techniques in data mining involve exploring data and “data reduction”— that is, breaking down large sets of data into more-manageable groups or segments that provide better insight. We have seen numerous techniques earlier in this book for exploring data and data reduction. For example, charts, frequency distributions and histograms, and summary statistics provide basic information about the characteristics of data. Pivot­ Tables, in particular, are very useful in exploring data from different perspectives and for data reduction. XLMiner provides a variety of tools and techniques for data exploration that complement or extend the concepts and tools we have studied in previous chapters. These are found in the Data Analysis group of the XLMiner ribbon, shown in Figure 10.1.

Sampling When dealing with large data sets and “big data,” it might be costly or time-consuming to process all the data. Instead, we might have to use a sample. We introduced sampling procedures in Chapter 6. XLMiner can sample from an Excel worksheet or from a Microsoft Access database.

Example 10.1  Using XLMiner to Sample from a Worksheet Figure 10.2 shows a portion of the Base Data worksheet Excel File Credit Risk Data. While certainly not “big data,” it consists of 425 records. From the Data Analysis group in the XLMiner ribbon, click the Sample button and choose Sample from Worksheet. Make sure the Data range is correct and includes headers. Select all variables in the left window pane and move them to the right using the # button (which changes to a " if all variables are moved to the right). Choose the

5Steve

­ ptions in the Sampling Options section; in this case, o we selected 20 samples (without replacement unless the Sample with replacement box is checked—this avoids duplicates) using simple random sampling. By entering a value in the Set seed box, you can obtain the same results at another time for control purposes; otherwise a different random sample will be selected. Figure 10.3 shows the completed dialog and Figure 10.4 shows the results.

Hoisington and Earl Naumann, “The Loyalty Elephant,” Quality Progress (February 2003): 33–41.

Chapter 10  Introduction to Data Mining

Figure 10.1 XLMiner Ribbon

Figure 10.2 Portion of Excel File Credit Risk Data

Figure 10.3 XLMiner Sampling Dialog

331

332

Chapter 10  Introduction to Data Mining

Figure 10.4 XLMiner Sampling Results

Data Visualization XLMiner offers numerous charts to visualize data. We have already seen many of these, such as bar, line, and scatter charts, and histograms. However, XLMiner also has the capability to produce boxplots, parallel coordinate charts, scatterplot matrix charts, and variable charts. These are found from the Explore button in the Data Analysis group.

Example 10.2  A Boxplot for Credit Risk Data We will construct a boxplot for the number of months employed for each marital status value from the Credit Risk Data. First, select the Chart Wizard from the Explore button in the Data Analysis group in the XLMiner tab. Select Boxplot; in the second dialog, choose Months ­Employed as the variable to plot on the vertical axis. In the next dialog, choose Marital Status as the variable to plot on the horizontal axis. Click Finish. The result is shown in Figure 10.5. The box range shows the 25th and 75th percentiles (the interquartile range, IQR), the solid line within the box is the median, and the dotted line within the box is the mean. The “whiskers” extend on

either side of the box to represent the minimum and maximum values in a data set. If you hover the cursor over any box, the chart will display these values. Very long whiskers suggest possible outliers in the data. You can easily see the differences in the data between those who are single as compared with those married or divorced. You can also filter the data by checking or unchecking the boxes in the filter pane to display the boxplots for only a portion of the data, for example, to compare those with a high credit risk with those with a low credit risk classification.

Chapter 10  Introduction to Data Mining

333

Figure 10.5 Boxplot for Months Employed by Marital Status

Boxplots (sometimes called box-and-whisker plots) graphically display five key statistics of a data set—the minimum, first quartile, median, third quartile, and ­maximum—and are very useful in identifying the shape of a distribution and outliers in the data. A parallel coordinates chart consists of a set of vertical axes, one for each variable selected. For each observation, a line is drawn connecting the vertical axes. The point at which the line crosses an axis represents the value for that variable. A parallel coordinates chart creates a “multivariate profile,” and helps an analyst to explore the data and draw basic conclusions.

Example 10.3  A Parallel Coordinates Chart for Credit Risk Data First, select the Chart Wizard from the Explore button in the Data Analysis group in the XLMiner tab. ­S elect ­P arallel Coordinates. In the second dialog, choose Checking, Savings, Months Employed, and Age as the variables to include. Figure 10.6 shows the results. In the small drop-down box at the top, you can choose to color the lines by one of the variables; in this case,

we chose to color by credit risk. Yellow represents low credit risk, and blue represents high. We see that individuals with a low number of months employed and lower ages tend to have high credit risk as shown by the density of the blue lines. As with boxplots, you can easily filter the data to explore other combinations of variables or subsets of the data.

A scatterplot matrix combines several scatter charts into one panel, allowing the user to visualize pairwise relationships between variables.

334

Chapter 10  Introduction to Data Mining

Figure 10.6 Example of a Parallel Coordinates Plot

Example 10.4  A Scatterplot Matrix for Credit Risk Data Select the Chart Wizard from the Explore button in the Data Analysis group in the XLMiner tab. Select ­Scatterplot Matrix. In the next dialog, check the boxes for Months Customer, Months Employed, and Age and click Finish. Figure 10.7 shows the result. Along the ­diagonal are histograms of the individual variables. Off the diagonal are scatterplots of pairs of variables. For example, the chart in the third row and second column of the figure shows the scatter chart of Months Employed

versus Age. Note that months employed is on the x-axis and age on the y-axis. The data appear to have a slight upward linear trend, signifying that older individuals have been employed for a longer time. Note that there are two charts for each pair of variables with the axes flipped. For example, the chart in the second row and third column is the same as the one we discussed, but with age on the x-axis. As before, you can easily filter the data to create different views.

Finally, a variable plot simply plots a matrix of histograms for the variables selected.

Example 10.5  A Variable Plot of Credit Risk Data Select the Chart Wizard from the Explore button in the Data Analysis group in the XLMiner tab. Select Variable. In the next dialog, check the boxes for the variables you wish to include (we kept them all) and click Finish.

Figure 10.8 shows the results. This tool is much easier to use than Excel's Histogram tool, especially for many variables in a data set and you can easily filter the data to create different perspectives.

Dirty Data It is not unusual to find real data sets that have missing values or errors. Such data sets are called “dirty” and need to be “cleaned” prior to analyzing them. Several approaches

Chapter 10  Introduction to Data Mining

335

Figure 10.7 Example of a Scatterplot Matrix

Figure 10.8 Example of a Variable Plot

are used for handling missing data. For example, we could simply eliminate the records that contain missing data; estimate reasonable values for missing observations, such as the mean or median value; or use a data mining procedure to deal with them. XLMiner has the capability to deal with missing data in the Transform menu in the Data Analysis group. We suggest that you consult the XLMiner User Guide from the Help menu for further ­information. In any event, you should try to understand whether missing data are simply random events or if there is a logical reason why they are missing. Eliminating sample data indiscriminately could result in misleading information and conclusions about the data.

336

Chapter 10  Introduction to Data Mining

Data errors can often be identified from outliers (see the discussion in Chapter 3). A typical approach is to evaluate the data with and without outliers and determine whether their impact will significantly change the conclusions, and whether more effort should be spent on trying to understand and explain them.

Cluster Analysis Cluster analysis, also called data segmentation, is a collection of techniques that seek to group or segment a collection of objects (i.e., observations or records) into subsets or clusters, such that those within each cluster are more closely related to one another than objects assigned to different clusters. The objects within clusters should exhibit a high amount of similarity, whereas those in different clusters will be dissimilar. Cluster analysis is a data-reduction technique in the sense that it can take a large number of observations, such as customer surveys or questionnaires, and reduce the information into smaller, homogenous groups that can be interpreted more easily. The segmentation of customers into smaller groups, for example, can be used to customize advertising or promotions. As opposed to many other data-mining techniques, cluster analysis is primarily descriptive, and we cannot draw statistical inferences about a sample using it. In addition, the clusters identified are not unique and depend on the specific procedure used; therefore, it does not result in a definitive answer but only provides new ways of looking at data. Nevertheless, it is a widely used technique. There are two major methods of clustering—hierarchical clustering and k-means clustering. In hierarchical clustering, the data are not partitioned into a particular cluster in a single step. Instead, a series of partitions takes place, which may run from a single cluster containing all objects to n clusters, each containing a single object. Hierarchical clustering is subdivided into agglomerative clustering methods, which proceed by series of fusions of the n objects into groups, and divisive clustering methods, which separate n objects successively into finer groupings. Figure 10.9 illustrates the differences between these two types of methods. Agglomerative techniques are more commonly used, and this is the method implemented in XLMiner. Hierarchical clustering may be represented by a two-dimensional

Figure 10.9 Agglomerative versus Divisive Clustering

Chapter 10  Introduction to Data Mining

337

diagram known as a dendrogram, which illustrates the fusions or divisions made at each successive stage of analysis. An agglomerative hierarchical clustering procedure produces a series of partitions of the data, Pn, Pn - 1, c, P1. Pn consists of n single-object clusters, and P1 consists of a single group containing all n observations. At each particular stage, the method joins together the two clusters that are closest together (most similar). At the first stage, this consists of simply joining together the two objects that are closest together. Different methods use different ways of defining distance (or similarity) between clusters. The most commonly used measure of distance between objects is Euclidean distance. This is an extension of the way in which the distance between two points on a plane is computed as the hypotenuse of a right triangle (see Figure 10.10). The Euclidean distance measure between two points (x1, x2, . . . , xn) and (y1, y2, . . . , yn) is

21x 1 - y 122 + 1 x 2 - y 222 + g + 1x n - y n22

(10.1)

Some clustering methods use the squared Euclidean distance (i.e., without the square root) because it speeds up the calculations. One of the simplest agglomerative hierarchical clustering methods is single linkage clustering, also known as the nearest-neighbor technique. The defining feature of the method is that distance between groups is defined as the distance between the closest pair of objects, where only pairs consisting of one object from each group are considered. In the single linkage method, the distance between two clusters, r and s, D (r,s), is defined as the minimum distance between any object in cluster r and any object in cluster s. In other words, the distance between two clusters is given by the value of the shortest link between the clusters. At each stage of hierarchical clustering, we find the two clusters with the minimum distance between them and merge them together. Another method that is basically the opposite of single linkage clustering is called complete linkage clustering. In this method, the distance between groups is defined as the distance between the most distant pair of objects, one from each group. A third method

Figure 10.10 Computing the Euclidean Distance Between Two Points

338

Chapter 10  Introduction to Data Mining

is average linkage clustering. Here the distance between two clusters is defined as the average of distances between all pairs of objects, where each pair is made up of one object from each group. Other methods are average group linkage clustering, which uses the mean values for each variable to compute distances between clusters, and Ward’s hierarchical clustering method, which uses a sum-of-squares criterion. Different methods generally yield different results, so it is best to experiment and compare the results.

Example 10.6  Clustering Colleges and Universities Data Figure 10.11 shows a portion of the Excel file Colleges and Universities. The characteristics of these institutions differ quite widely. Suppose that we wish to cluster them into more homogeneous groups based on the median SAT, acceptance rate, expenditures/student, percentage of students in the top 10% of their high school, and graduation rate. In XLMiner, choose Hierarchical Clustering from the Cluster menu in the Data Analysis group. In the dialog shown in Figure 10.12, specify the data range and move the variables that are of interest into the Selected Variables list. Note that we are clustering the numerical variables, so School and Type are not included. After clicking Next, the Step 2 dialog appears (see Figure 10.13). Check the box Normalize input data; this is important to ensure that the distance measure accords equal weight to each variable; without normalization, the variable with the largest scale will dominate the measure. Hierarchical clustering uses the Euclidean distance as the similarity measure for numeric data. The other two options apply only for binary (0 or 1) data. Select the clustering method you wish to use. In this case, we choose Group Average Linkage. In the final dialog (Figure 10.14), select the number of clusters. The agglomerative method of hierarchical clustering keeps forming clusters until only one cluster is left. This option lets you stop the process at a given number of clusters. We selected four clusters. The output is saved on multiple worksheets. Figure 10.15 shows the summary of the inputs. You may use the ­Output Navigator bar at the top of the worksheet to display various parts of the output rather than trying to navigate through the worksheets yourself.

Clustering Stages output details the history of the cluster formation, showing how the clusters are formed at each stage of the algorithm. At various stages of the clustering process, there are different numbers of clusters. A dendrogram lets you visualize this. This is shown in Figure 10.16. The y-axis measures intercluster distance. Because of the size of the problem, each individual observation is not shown, and some of them are already clustered in the ­“subclusters.” The Sub Cluster IDs are listed along the x-axis, with a legend below it. For example, during the clustering procedure, records 20 and 25, and records 14 and 16 were merged; these subclusters were then merged t­ ogether. At the top of the diagram, we see that all clusters are merged into a single cluster. If you draw a horizontal line through the dendogram at any value of the y-axis, you can identify the number of clusters and the observations in each of them. For example, drawing the line at the distance value of 3, you can see that we have four clusters; just follow the subclusters at the ends of the branches to identify the individual observations in each of them. The Predicted Clusters shows the assignment of observations to the number of clusters we specified in the input dialog, in this case four. This is shown in Figure 10.17. For instance, cluster 3 consists of only three schools, records 4, 28, and 29; and cluster 4 consists of only one observation, record 6. (You may sort the data in Excel to see this more easily.) These schools and their data are extracted in the following database:

We can see that the schools in cluster 3 have quite similar profiles, whereas Cal Tech stands out considerably from the others.

Chapter 10  Introduction to Data Mining

Figure 10.11 Portion of the Excel File Colleges and Universities

Figure 10.12 Hierarchical Clustering Dialog, Step 1

Figure 10.13 Hierarchical Clustering Dialog, Step 2

339

340

Figure 10.14 Hierarchical Clustering Dialog, Step 3

Figure 10.15 Hierarchical Clustering Results: Inputs

Figure 10.16 Hierarchical Clustering Results: Dendogram and Partial Cluster Legend

Chapter 10  Introduction to Data Mining

Chapter 10  Introduction to Data Mining

341

Figure 10.17 Portion of Hierarchical Clustering Results: Predicted Clusters

Classification Classification methods seek to classify a categorical outcome into one of two or more categories based on various data attributes. For each record in a database, we have a categorical variable of interest (e.g., purchase or not purchase, high risk or no risk), and a number of additional predictor variables (age, income, gender, education, assets, etc.). For a given set of predictor variables, we would like to assign the best value of the categorical variable. We will be illustrating various classification techniques using the Excel database Credit Approval Decisions. A portion of this database is shown in Figure 10.18. In this database, the categorical variable of interest is the decision to approve or reject a credit application. The remaining variables are the predictor variables. Because we are working with numerical data, however, we need to code the Homeowner and Decision fields numerically. We code the Homeowner attribute “Y” as 1 and “N” as 0; similarly, we code the Decision attribute

Figure 10.18 Portion of the Excel File Credit Approval Decisions

342

Chapter 10  Introduction to Data Mining

Figure 10.19 Modified Excel File with Numerically Coded Variables

“Approve” as 1 and “Reject” as 0. Figure 10.19 shows a portion of the modified database (Excel file Credit Approval Decisions Coded).

An Intuitive Explanation of Classification To develop an intuitive understanding of classification, we consider only the credit score and years of credit history as predictor variables.

Example 10.7  Classifying Credit-Approval Decisions Intuitively Figure 10.20 shows a chart of the credit scores and years of credit history in the Credit Approval Decisions data. The chart plots the credit scores of loan applicants on the x-axis and the years of credit history on the y-axis. The large bubbles represent the applicants whose credit applications were rejected; the small bubbles represent those that were approved. With a few exceptions (the points at the bottom right corresponding to high credit scores with just a few years of credit history that were rejected), there appears to be a clear separation of the points. When the credit score is greater than 640, the applications were approved, but most applications with credit scores of 640 or less were rejected. Thus, we might propose a simple classification rule: approve an application with a credit score greater than 640.

Another way of classifying the groups is to use both the credit score and years of credit history by visually drawing a straight line to separate the groups, as shown in Figure 10.21. This line passes through the points (763, 2) and (595, 18). Using a little algebra, we can calculate the equation of the line as years = − 0.095 × credit score + 74.66 Therefore, we can propose a different classification rule: whenever years + 0.095 × credit score " 74.66, the application is rejected; otherwise, it is approved. Here again, however, we see some misclassification.

Although this is easy to do intuitively for only two predictor variables, it is more difficult to do when we have more predictor variables. Therefore, more-sophisticated procedures are needed as we will discuss.

Measuring Classification Performance As we saw in the previous example, errors may occur with any classification rule, resulting in misclassification. One way to judge the effectiveness of a classification rule is to find the probability of making a misclassification error and summarizing the results in a classification matrix, which shows the number of cases that were classified either correctly or incorrectly.

Chapter 10  Introduction to Data Mining

343

Figure 10.20 Chart of Credit-Approval Decisions

Figure 10.21 Alternate Credit-Approval Classification Scheme

Example 10.8  Classification Matrix for Credit-Approval Classification Rules In the credit-approval decision example, using just the credit score to classify the applications, we see that in two cases, applicants with credit scores exceeding 640 were rejected, out of a total of 50 data points. Table 10.1 shows a classification matrix for the credit score rule in Figure 10.20.

The off-diagonal elements are the frequencies of misclassification, whereas the diagonal elements are the numbers that were correctly classified. Therefore, the probability of 2 misclassification was 50 , or 0.04. We leave it as an exercise for you to develop a classification matrix for the second rule.

344

Chapter 10  Introduction to Data Mining

Table 10.1 Classification Matrix for Credit Score Rule

Predicted Classification Decision = 1

Decision = 0

Decision = 1

23

2

Decision = 0

0

25

Actual Classification

Using Training and Validation Data Most data-mining projects use large volumes of data. Before building a model, we typically partition the data into a training data set and a validation data set. Training data sets have known outcomes and are used to “teach” a data-mining algorithm. To get a more realistic estimate of how the model would perform with unseen data, you need to set aside a part of the original data into a validation data set and not use it in the training process. If you were to use the training data set to compute the accuracy of the model fit, you would get an overly optimistic estimate of the accuracy of the model. This is because the training or model-fitting process ensures that the accuracy of the model for the training data is as high as possible—the model is specifically suited to the training data. The validation data set is often used to fine-tune models. When a model is finally chosen, its accuracy with the validation data set is still an optimistic estimate of how it would perform with unseen data. This is because the final model has come out as the winner among the competing models based on the fact that its accuracy with the validation data set is highest. Thus, data miners often set aside another portion of data, which is used neither in training nor in validation. This set is known as the test data set. The accuracy of the model on the test data gives a realistic estimate of the performance of the model on completely unseen data.

Example 10.8  Partitioning Data Sets in XLMiner To partition the data into training and validation sets in XLMiner, select Partition from the Data Mining group and then choose Standard Partition. The Standard Data Partition dialog prompts you for basic information; Figure 10.22 shows the completed dialog. The dialog first allows you to specify the data range and whether it contains headers in the Excel file as well as the variables to include in the partition. To select a variable for the partition, click on it and then click the # button (which changes to a " button if all variables have been moved to the right pane). You may use the Ctrl key to select multiple variables. The random number seed defaults to 12345, but this can be changed. XLMiner provides three options:

1. Automatic percentages: If you select this, 60% of the total number of records in the data set are assigned randomly to the training set and the rest to the validation set. If the data set is large, then 60% will perhaps exceed the limit on number of records in the training partition. In that case, XLMiner will allocate a maximum percentage to the training set that will be just within the limits. It will then assign the remaining percentage to the validation set.

2. Specify percentages: You can specify the required partition percentages. In case of large data sets, XLMiner will suggest the maximum possible percentage to the training set, so that the training partition is within the specified limits. It will then allocate the remaining records to the validation and test sets in the proportion 60:40. You may change these and specify percentages. XLMiner will execute your specifications as long as the limits are met.

3. Equal percentages: XLMiner will divide the records equally in training, validation, and test sets. If the data set is large, it will assign maximum possible records to training so that the number is within the specified limit for training partition and assigns the same percentage to the validation and test sets. This means all the records may not be accommodated. So, in case of large data sets, specify ­percentages if required. Figure 10.23 shows a portion of the output for the Credit Approval Decisions example. You may display the training data and validation data using the Output Navigator links at the top of the worksheet.

Chapter 10  Introduction to Data Mining

345

Figure 10.22 Standard Data Partition Dialog

Figure 10.23 Portion of Data Partition Output

XLMiner provides two ways of standard partitioning: random partitioning and userdefined partitioning. Random partitioning uses simple random sampling, in which every observation in the main data set has equal probability of being selected for the partition data set. For example, if you specify 60% for the training data set, then 60% of the

346

Chapter 10  Introduction to Data Mining

Figure 10.24 Additional Data in the Excel File Credit Approval Decisions Coded

t­otal observations would be randomly selected and would comprise the training data set. Random partitioning uses random numbers to generate the sample. You can specify any nonnegative random number seed to generate the random sample. Using the same seed allows you to replicate the partitions exactly for different runs.

Classifying New Data The purpose of developing a classification model is to be able to classify new data. After a classification scheme is chosen and the best model is developed based on existing data, we use the predictor variables as inputs to the model to predict the output.

Example 10.9  C  lassifying New Data for Credit Decisions Using Credit Scores and Years of Credit History The Excel files Credit Approval Decisions and Credit Approval Decisions Coded include a small set of new data that we wish to classify in the worksheet Additional Data. These data are shown in Figure 10.24. If we use the simple credit-score rule from Example 10.7 that a score of more than 640 is needed to approve an application, then we would classify the decision

for the first, third, and sixth records to be 1 and the rest to be 0. If we use the rule developed in Example 10.7, which includes both the credit score and years of credit history—that is, reject the application if years + 0.095 × credit score " 74.66— then the decisions would be as follows:

Homeowner

Credit Score

Years of Credit History

Revolving Balance

Revolving Utilization

Years + 0.095*Credit Score

1

700

8

\$21,000.00

15%

74.50

0

0

520

1

\$4,000.00

90%

50.40

0

1

650

10

\$8,500.00

25%

71.75

0

0

602

7

\$16,300.00

70%

64.19

0

0

549

2

\$2,500.00

90%

54.16

0

1

742

15

\$16,700.00

18%

85.49

1

Decision

Only the last record would be approved.

Classification Techniques We will describe three different data-mining approaches used for classification: k-Nearest Neighbors, discriminant analysis, and logistic regression.

Chapter 10  Introduction to Data Mining

347

k-Nearest Neighbors (k-NN) The k-Nearest Neighbors (k-NN) algorithm is a classification scheme that attempts to find records in a database that are similar to one we wish to classify. Similarity is based on the “closeness” of a record to numerical predictors in the other records. In the Credit Approval Decisions database, we have the predictors Homeowner, Credit Score, Years of Credit History, Revolving Balance, and Revolving Utilization. We seek to classify the decision to approve or reject the credit application. Suppose that the values of the predictors of two records X and Y are labeled 1x 1, x 2, c, x n2 and 1y 1, y 2, c, x n2. We measure the distance between two records by the Euclidean distance in formula (10.1). Because predictors often have different scales, they are often standardized before computing the distance. Suppose we have a record X that we want to classify. The nearest neighbor to that record in the training data set is the one that that has the smallest distance from it. The 1-NN rule then classifies record X in the same category as its nearest neighbor. We can extend this idea to a k-NN rule by finding the k-nearest neighbors in the training data set to each record we want to classify and then assigning the classification as the classification of majority of the k-nearest neighbors. The choice of k is somewhat arbitrary. If k is too small, the classification of a record is very sensitive to the classification of the single record to which it is closest. A larger k reduces this variability, but making k too large introduces bias into the classification decisions. For example, if k is the count of the entire training dataset, all records will be classified the same way. Like the smoothing constants for moving average or exponential smoothing forecasting, some experimentation is needed to find the best value of k to minimize the misclassification rate in the validation data. XLMiner provides the ability to select a maximum value for k and evaluate the performance of the algorithm on all values of k up to the maximum specified value. Typically, values of k between 1 and 20 are used, depending on the size of the data sets, and odd numbers are often used to avoid ties in computing the majority classification of the nearest neighbors.

Example 10.10  Classifying Credit Decisions Using the k-NN Algorithm First, partition the data in the Credit Approval Decisions Coded Excel file into training and validation data sets, as described in Example 10.8. Next, select Classify from the XLMiner Data Mining group and choose k-Nearest Neighbors. In the dialog as shown in Figure 10.25, ensure that the Data source worksheet matches the name of the worksheet with the data partion, not the original data. Move the input variables (the predictor variables) and output variable (the one being classified) into the proper panes using the arrow buttons. Click on Next to proceed. In the second dialog (see Figure 10.26), we recommend checking the box Normalize input data. Normalizing the data is important to ensure that the distance measure gives equal weight to each variable; without normalization, the variable with the largest scale will dominate the measure. In the field below, enter the value of k. In the Scoring Option section, if you select Score on specified value of k as above, the output is displayed by scoring on the specified value of k. If you select Score on best k between 1 and specified value, XLMiner evaluates models for all values of k up to the maximum specified value and scoring

is done on the best of these models. In this example, we set k = 5 and evaluate all models from k = 1 to 5. We leave Prior Class Probabilities at its default selection. Leave the Step 3 dialog as is and click Finish. The output of the k-NN algorithm is displayed in a separate sheet (see Figure 10.27) and various sections of the output can be navigated using the Output Navigator bar at the top of the worksheet by clicking on the highlighted ­titles. The Validation error log for different k lists the percentage errors for all values of k for the training and validation data sets and selects that value as best k for which the percentage error validation is minimum (in this case, k = 2). The scoring is performed later using this value. Of particular interest is the Training Data Scoring and Validation Data Scoring summary reports, which tally the actual and computed clas­sifications. Correct classification counts are along the diagonal from upper left to lower right in the Classification Confusion Matrix. In this case, there were no misclassifications in the training data, and two misclassifications in the validation data.

348

Figure 10.25 k-NN Dialog, Step 1 of 2

Figure 10.26 k-NN Dialog, Steps 2 and 3

Chapter 10  Introduction to Data Mining

Chapter 10  Introduction to Data Mining

349

Figure 10.27 Portion of k-NN Output

Example 10.11  Classifying New Data Using k-NN We use the Credit Approval Decisions Coded database that we used in Example 10.9 to classify the new data in the Additional Data worksheet. First, partition the data or use the data partition worksheet that was analyzed in the previous example. In Step 2 of the k-NN procedure (see Figure 10.26), normalize the input data and set the number of nearest neighbors (k) to 2, since this was the best value identified in the previous ­e xample, and choose Score on specified value of k as above. In the Step 3 dialog click on In worksheet

in the Score new data pane of the dialog. In the Match Variables in the New Range dialog, ­s elect the Additional Data worksheet in the Worksheet field and highlight the range of the new data in the Data range field, including headers. Because we use the same headers, click on Match By Name; this ­r esults in the dialog shown in Figure 10.28. Click Finish in the Step 3 dialog. In the Output Navigator, choose New Data ­Detail Rpt. Figure 10.29 shows the results. The first, third, and fourth records are classified as “Approved.”

Discriminant Analysis Discriminant analysis is a technique for classifying a set of observations into predefined classes. The purpose is to determine the class of an observation based on a set of predictor variables. Based on the training data set, the technique constructs a set of linear functions of the predictors, known as discriminant functions, which have the form:

L = b 1X1 + b 2X2 + c + b nXn + c

(10.2)

where the bs are weights, or discriminant coefficients, the Xs are the input variables, or predictors, and c is a constant or the intercept. The weights are determined by maximizing the between-group variance relative to the within-group variance. These discriminant functions are used to predict the category of a new observation. For k categories, k discriminant functions are constructed. For a new observation, each of the k discriminant functions is evaluated, and the observation is assigned to class i if the ith discriminant function has the highest value.

350

Chapter 10  Introduction to Data Mining

Figure 10.28 Match Variables in the New Range Dialog for Scoring New Data

Figure 10.29 The k-NN Procedure Classification of New Data

Example 10.12  Classifying Credit Decisions Using Discriminant Analysis In the Credit Approval Decisions Coded database, first, partition the data into training and validation sets, as described earlier. From the XLMiner options, select Discriminant Analysis from the Classify menu in the Data Mining group. The first dialog that appears is shown in Figure 10.30. Make sure the worksheet specified is the one with the data partition. Specify the input variables and the output variable. The “success” class corresponds to the outcome value that you consider a success—in this case, the approval of the loan to which we assigned the value 1. The cutoff probability defaults to 0.5, and this is typically used. The second dialog is shown in Figure 10.31. The discriminant analysis procedure incorporates prior assump-

tions about how frequently the different classes occur. Three options are available:

1. According to relative occurrences in training data. This option assumes that the probability of encountering a particular category is the same as the frequency with which it occurs in the training data. 2. Use equal prior probabilities. This option assumes that all categories occur with equal probability. 3. User specified prior probabilities. This option is available only if the output variable has two categories. If you have information about the probabilities that an observation will belong to a particular category (regardless of the training sample) then you may specify probability values for the two categories.

Chapter 10  Introduction to Data Mining

351

Figure 10.30 Discriminant Analysis Dialog, Step 1

This dialog also allows you to specify the cost of misclassification when there are two categories. If the costs are equal for the two groups, then the method will attempt to misclassify the fewest number of observations across all groups. If the misclassification costs are unequal, XLMiner takes into consideration the relative costs and attempts to fit a model that minimizes the total cost of misclassification. The third dialog (Figure 10.32) allows you to specify the output options. These include some advanced statistical information and more detailed reports; check the box for the Classification Function. Figure 10.33 shows the classification (discriminant) functions for the two categories from the worksheet DA_Stored. For category 1 (approve the loan application), the discriminant function is L(1) = − 149.871 + 10.66073 × homeowner + 0.355209 × credit score + 0.858509 × years of credit history − 0.00015 × revolving balance + 115.9978 × revolving utilization

For category 0 (reject the loan application), the discriminant function is L(0) = − 174.22 + 7.589715 × homeowner + 0.364829 × credit score + 0.54185 × years of credit history − 0.00023 × revolving balance + 170.6218 × revolving utilization For example, for the first record in the database, L(1) = − 149.871 + 10.66073 × 1 + 0.355209 × 725 + 0.858509 × 20 − 0.00015 × \$11,320 + 115.9978 × 0.25 = 162.7879 L(0) = − 174.22 + 7.589715 × 1 + 0.364829 × 725 + 0.54185 × 20 − 0.00023 × 11,320 + 170.6218 × 0.25 = 148.7596 Therefore, this record would be assigned to category 1. Figure 10.34 shows the scoring reports for the training and validation data sets. We see that there is an overall misclassification rate of 15%.

352

Figure 10.31 Discriminant Analysis Dialog, Step 2

Figure 10.32 Discriminant Analysis Dialog, Step 3

Figure 10.33 Discriminant Analysis Results—Classification Function Data

Chapter 10  Introduction to Data Mining

Chapter 10  Introduction to Data Mining

353

Figure 10.34 Discriminant Analysis Results—Training and Validation Data

Example 10.13  Using Discriminant Analysis to Classify New Data We will use the Credit Approval Decisions Coded database that we introduced earlier to classify the new data. Follow the same process as in Example 10.12. ­However, in the dialog for Step 3 (see Figure 10.32), click on D ­ etailed report in the Score new data in Worksheet pane of the dialog. The same dialog, Match variables in the new range, which we saw in Example 10.11, appears (see Figure 10.28). Select the Additional Data worksheet in the Worksheet field and highlight the range of the new

Figure 10.35 Discriminant Analysis Classification of New Data

data in the Data range field including headers. Because we use the same headers, click on Match By Name. Click OK and then click Finish in the Step 3 dialog. XLMiner creates a new worksheet labeled DA_New­Score, shown in Figure 10.35, that provides the predicted classification for each new record. Records 1, 3, and 6 are assigned to category 1 (approve the application) and the remaining records are assigned to category 0 (reject the application).

354

Chapter 10  Introduction to Data Mining

Like many statistical procedures, discriminant analysis requires certain assumptions, such as normality of the independent variables as well as other assumptions, to apply properly. The normality assumption is often violated in practice, but the method is generally robust to violations of the assumptions. The next technique, called logistic regression, does not rely on these assumptions, making it preferred by many analytics practitioners.

Logistic Regression In Chapter 8, we studied linear regression, in which the dependent variable is continuous and numerical. Logistic regression is a variation of ordinary regression in which the dependent variable is categorical. The independent variables may be continuous or categorical, as in the case of ordinary linear regression. However, whereas multiple linear regression seeks to predict the numerical value of the dependent variable Y based on the values of the dependent variables, logistic regression seeks to predict the probability that the output variable will fall into a category based on the values of the independent (predictor) variables. This probability is used to classify an observation into a category. Logistic regression is generally used when the dependent variable is binary—that is, takes on two values, 0 or 1, as in the credit-approval decision example that we have been using, in which Y = 1 if the loan is approved and Y = 0 if it is rejected. This situation is very common in many other business situations, such as when we wish to classify customers as buyers or nonbuyers or credit-card transactions as fraudulent or not. To classify an observation using logistic regression, we first estimate the probability p that it belongs to category 1, P1Y = 12, and, consequently, the probability 1 - p that it belongs to category 0, P1Y = 02. Then we use a cutoff value, typically 0.5, with which to compare p and classify the observation into one of the two categories. For instance, if p 7 0.5, the observation would be classified into category 1; otherwise it would be classified into category 0. You may recall from Chapter 8 that a multiple linear regression model has the form Y = b0 + b1X1 + b2X2 + g + bkXk. In logistic regression, we use a different dependent variable, called the logit, which is the natural logarithm of p>11 - p2. Thus, the form of a logistic regression model is

ln

p = b0 + b1X1 + b2X2 + g + bkXk 1 - p

(10.3)

where p is the probability that the dependent variable Y = 1, and X1, X2, c, Xk are the independent variables (predictors). The parameters b0, b1, b2, c, bk are the unknown regression coefficients, which have to be estimated from the data. The ratio p>11 - p2 is called the odds of belonging to category 1 1Y = 12. This is a common notion in gambling. For example, if the probability of winning a game is p = 0.2, then 1 - p = 0.8, so the odds of winning are 0.2>0.8 = 14, or one in four. That is, you would win once for every four times you would lose, on average. The logit is continuous over the range from - ∞ to + ∞ and from equation (10.3) is a linear function of the predictor variables. The values of this predictor variable are then transformed into probabilities by a logistic function:

p =

1 1 + e -1b0 + b1X1 + b2X2 + c + bkXk2



(10.4)

355

Chapter 10  Introduction to Data Mining

Example 10.14  C  lassifying Credit Approval Decisions Using Logistic Regression In the Credit Approval Decisions Coded database, first, partition the data into training and validation sets. In XLMiner, select Logistic Regression from the ­C lassify menu in the Data Mining group. The dialog shown in ­F igure 10.36 appears, where you need to specify the data range, the input variables, and the output variable. The “success” class corresponds to the outcome value that you consider a success—in this case, the approval of the loan to which we assigned the value 1. The second logistic regression dialog is shown in Figure 10.37. You can choose to force the constant term to zero and omit it from the regression. You can also change the confidence level for the confidence intervals displayed in the results for the odds ratio. Typically this is set to 95%. The Advanced button allows you to change or select some additional options; for our purposes we leave these alone. The Variable Selection button allows XLMiner to ­e valuate all possible models with subsets of the ­independent variables. This is useful in choosing models that eliminate insignificant independent variables. Figure 10.38 shows the dialog. Several options are available for the selection procedure that the algorithm uses to choose the variables in the models: elimination: Variables are eliminated • Backward one at a time, starting with the least significant.

Figure 10.36 Logistic Regression Dialog, Step 1

selection: Variables are added one at • Forward a time, starting with the most significant. search: All combinations of vari• Exhaustive ables are searched for the best fit (can be quite

• •

time consuming, depending on the number of variables). Sequential replacement: For a given number of variables, variables are sequentially replaced and replacements that improve performance are retained. Stepwise selection: Like forward selection, but at each stage, variables can be dropped or added.

Each option may yield different results, so it is usually wise to experiment with the different options. For our p ­ urposes, we will use the default values in this dialog. Figure 10.39 shows the third dialog. Check the appropriate options. For simple problems, the summary reports for scoring the training and validation data will suffice. The logistic regression output is displayed on a new worksheet, and you can use the Output Navigator links to display different sections of the worksheet. Figure 10.40 shows the regression model and best subsets output. The output contains the beta coefficients, their standard errors,

356

Chapter 10  Introduction to Data Mining

Figure 10.37 Logistic Regression Dialog, Step 2

Figure 10.38 Logistic Regression Best Subset Variable Selection Dialog

Figure 10.39 Logistic Regression Dialog, Step 3

the p-value, the odds ratio for each variable (which is simply ex, where x is the value of the coefficient), and confidence interval for the odds. Summary statistics to the right show the residual degrees of freedom (number of observations − number of predictors), a standard deviation–type measure (Residual Dev.) for the model (which typically has a chi-square distribution), the percentage of successes (1s) in the training data, the number of iterations required to fit the model, and the multiple R-squared value. If we select the best subsets option, then XLMiner shows the best regression model. Figure 10.40 shows the regression model. The coefficients are the betas in Equation 10.3. The choice of the best model depends on the calculated values of various error values and the probability.

RSS is the residual sum of squares, or the sum of squared deviations between the predicted probability of success and the actual value (1 or 0). Cp is a measure of the error in the best subset model, relative to the error incorporating all variables. Adequate models are those for which Cp is roughly equal to the number of parameters in the model (including the constant), and/or Cp is at a minimum. Probability is a quasi-hypothesis test of the proposition that a given subset is acceptable; if Probability * 0.05 we can rule out that subset. The training and validation summary reports are shown in Figure 10.41. We see that all cases were classified correctly for the training data, and there was an overall error rate of 15% for the validation data.

Chapter 10  Introduction to Data Mining

357

Figure 10.40 Logistic Regression Model and Best Subsets Output

Figure 10.41 Logistic Regression Training and Validation Data Summaries

Example 10.15  Using Logistic Regression to Classifying New Data We use the Credit Approval Decisions Coded database that contains the new data. First, partition the data or use the existing data partition worksheet that was analyzed in the previous example. In Step 3 of the ­logistic regression procedure (see Figure 10.39), click on In worksheet in the Score new data pane of the dialog.

The information in the Match Variables in the New Range dialog should be the same as in previous ­e xamples (see Figure 10.28). After you return to the Step 3 dialog, click Finish. XLMiner creates a new worksheet labeled LR_NewScore shown in Figure 10.42 that provides the predicted classification for each new record.

358

Chapter 10  Introduction to Data Mining

Figure 10.42 Logistic Regression Classification of New Data

Association Rule Mining Association rule mining, often called affinity analysis, seeks to uncover interesting associations and/or correlation relationships among large sets of data. Association rules identify attributes that occur frequently together in a given data set. A typical and widely used example of association rule mining is market basket analysis. For example, supermarkets routinely collect data using bar-code scanners. Each record lists all items bought by a customer for a single-purchase transaction. Such databases consist of a large number of transaction records. Managers would be interested to know if certain groups of items are consistently purchased together. They could use these data for adjusting store layouts (placing items optimally with respect to each other), for cross-selling, for promotions, for catalog design, and to identify customer segments based on buying patterns. Association rule mining is how companies such as Netflix and Amazon.com make recommendations based on past movie rentals or item purchases, for example.

Example 10.16  Custom Computer Configuration Figure 10.43 shows a portion of the Excel file PC ­Purchase Data. The data represent the configurations for a small number of orders of laptops placed over the Web. The main options from which customers can choose are the type of processor, screen size, memory, and hard drive. A “1” signifies that a customer selected a particular

option. If the manufacturer can better understand what types of components are often ordered together, it can speed up final assembly by having partially completed laptops with the most popular combinations of components configured prior to order, thereby reducing delivery time and improving customer satisfaction.

Association rules provide information in the form of if-then statements. These rules are computed from the data but, unlike the if-then rules of logic, association rules are probabilistic in nature. In association analysis, the antecedent (the “if” part) and conse-

359

Chapter 10  Introduction to Data Mining

Figure 10.43 Portion of the Excel File PC Purchase Data

quent (the “then” part) are sets of items (called item sets) that are disjoint (do not have any items in common). To measure the strength of association, an association rule has two numbers that express the degree of uncertainty about the rule. The first number is called the support for the (association) rule. The support is simply the number of transactions that include all items in the antecedent and consequent parts of the rule. (The support is sometimes expressed as a percentage of the total number of records in the database.) One way to think of support is that it is the probability that a randomly selected transaction from the database will contain all items in the antecedent and the consequent. The second number is the confidence of the (association) rule. Confidence is the ratio of the number of transactions that include all items in the consequent as well as the antecedent (namely, the support) to the number of transactions that include all items in the antecedent. The confidence is the conditional probability that a randomly selected transaction will include all the items in the consequent given that the transaction includes all the items in the antecedent:

confidence = P (consequent  antecedent) =

P1antecedent and consequent2  P1antecedent2

(10.5)

The higher the confidence, the more confident we are that the association rule provides useful information. Another measure of the strength of an association rule is lift, which is defined as the ratio of confidence to expected confidence. Expected confidence is the number of transactions that include the consequent divided by the total number of transactions. Expected confidence assumes independence between the consequent and the antecedent. Lift provides information about the increase in probability of the then (consequent) given the if (antecedent) part. The higher the lift ratio, the stronger the association rule; a value greater than 1.0 is usually a good minimum.

360

Chapter 10  Introduction to Data Mining

Example 10.17  Measuring Strength of Association Suppose that a supermarket database has 100,000 point-of-sale transactions, out of which 2,000 include both items A and B and 800 of these include item C. The association rule “If A and B are purchased, then C is also purchased” has a support of

800 transactions (alternatively 0.8% = 800 , 100,000) and a confidence of 40% ( = 800 , 2,000). Suppose the number of total transactions for C is 5,000. Then, e x p e c t e d c o n f i d e n c e i s 5,000 , 100,000 = 5%, a n d lift = confidence , expected confidence = 40% , 5% = 8.

We next illustrate how XLMiner is used for the PC purchase data.

Example 10.18  Identifying Association Rules for PC Purchase Data In XLMiner, select Association Rules from the Associate menu in the Data Mining group. In the dialog shown in Figure 10.44, specify the data range to be processed, the input data format desired, and your requirements for how much support and confidence rules must be ­r eported. Two input options are available:

1. Data in binary matrix format: Choose this option if each column in the data represents a distinct item and the data are expressed as 0s and 1s. All ­n onzeros are treated as 1s. A 0 under a variable name means that item is absent in that transaction, and a 1 means it is present. 2. Data in item list format: Choose this option if each row of data consists of item codes or names that are present in that transaction. In the Parameters pane, specify the minimum number of transactions in which a particular item set must appear for it to qualify for inclusion in an association rule in the Minimum support (# transactions) field. For a small data set, as in this example, we set this number to be 5. In the Minimum confidence (%) field, specify the minimum

Figure 10.44 Association Rule Dialog

confidence threshold for rule generation. If this is set too high, the algorithm might not find any association rules; low ­values will result in many rules which may be difficult to interpret. We selected 80%. Figure 10.45 shows the results. Rule 1 states that if a customer purchased an Intel Core i7 processor and a 4 GB memory, then a 12 inch screen was also purchased. This particular rule has confidence of 83.33%, meaning that of the people who bought a core i7 processor and a 4 GB memory, 83.33% of them bought 12 inch screens as well. The value in the column Support for A indicates that it has support of 6 transactions, meaning that 6 customers bought a core i7 processor with 4 GB memory. The value in the column Support for C indicates the number of transactions involving the purchase of options, total. The value in the column Support (a & c) is the number of transactions in which a 12-inch screen, Intel Core i7, and 4 GB memory were ordered. The value in the Lift Ratio column indicates how much more likely we are to encounter a 12 inch screen transaction if we consider just those transactions where an Intel Core i7 and 4 GB memory are purchased, as compared to the entire population of transactions.

Chapter 10  Introduction to Data Mining

361

Figure 10.45  Association Results for PC Purchase Data

Cause-and-Effect Modeling Managers are always interested in results, such as profit, customer satisfaction and retention, production yield, and so on. Lagging measures, or outcomes, tell what has ­happened and are often external business results, such as profit, market share, or customer satisfaction. Leading measures (performance drivers) predict what will happen and usually are internal metrics, such as employee satisfaction, productivity, turnover, and so on. For example, customer satisfaction results in regard to sales or service transactions would be a lagging measure; employee satisfaction, sales representative behavior, billing accuracy, and so on, would be examples of leading measures that might influence customer satisfaction. If employees are not satisfied, their behavior toward customers could be negatively affected, and customer satisfaction could be low. If this can be explained using business analytics, managers can take steps to improve employee satisfaction, leading to improved customer satisfaction. Therefore, it is important to understand what controllable factors significantly influence key business performance measures that managers cannot directly control. Correlation analysis can help to identify these influences and lead to the development of cause-and-effect models that can help managers make better decisions today that will influence results tomorrow. Recall from Chapter 4 that correlation is a measure of the linear relationship between two variables. High values of the correlation coefficient indicate strong relationships between the variables. The following example shows how correlation can be useful in causeand-effect modeling.

362

Chapter 10  Introduction to Data Mining

Example 10.19  Using Correlation for Cause-and-Effect Modeling The Excel file Ten Year Survey shows the results of 40 quarterly surveys conducted by a major electronics device manufacturer, a portion of which is shown in Figure 10.46. 6 The data provide average scores on a 1–5 scale for customer satisfaction, overall employee satisfaction, employee job satisfaction, employee satisfaction with their supervisor, and employee perception of training and skill improvement. Figure 10.47 shows the correlation matrix. All the correlations except the one between job satisfaction and customer satisfaction are relatively strong, with the highest correlations between overall employee satisfaction and employee job satisfaction,

employee satisfaction with their supervisor, and employee perception of training and skill improvement. Although correlation analysis does not prove any cause and effect, we can logically infer that a cause-andeffect relationship exists. The data indicate that customer satisfaction, the key external business result, is strongly influenced by internal factors that drive employee satisfaction. Logically, we could propose the model shown in Figure 10.48. This suggests that if managers want to improve customer satisfaction, they need to start by ensuring good relations between supervisors and their employees and focus on improving training and skills.

Figure 10.46 Portion of Ten Year Survey Data

Figure 10.47 Correlation Matrix of Ten Year Survey Data

6Based

on a description of a real application by Steven H. Hoisington and Tse-His Huang, “Customer Satisfaction and Market Share: An Empirical Case Study of IBM’s AS/400 Division,” in Earl Naumann and Steven H. Hoisington (eds.) Customer-Centered Six Sigma (Milwaukee, WI: ASQ Quality Press, 2001). The data used in this example are fictitious, however.

363

Chapter 10  Introduction to Data Mining

Figure 10.48 Cause-and-Effect Model

Satisfaction with Supervisor

Employee Satisfaction

Job Satisfaction

Customer Satisfaction

Training and Skill Improvement

A wide range of companies have deployed data mining successfully. Although early adopters of this technology have tended to be in information-­intensive industries such as financial services and direct-mail marketing, data mining has found application in any company looking to leverage a large data warehouse to better manage their customer relationships. Two critical factors for success with data mining are a large, well-integrated data warehouse and a welldefined understanding of the business process within which data mining is to be applied (such as customer prospecting, retention, campaign management, and so on). Some successful application areas of data mining include the following: pharmaceutical company analyzes its recent • Asales force activity and uses their results to improve targeting of high-value physicians and determine which marketing activities will have the greatest impact in the near future. The results are distributed to the sales force via a wide-area network that enables the representatives to review the recommendations from the perspective of the key attributes in the decision process. The ongoing, dynamic analysis of the data warehouse allows best practices from

Hector Almeida/Shutterstock.com

Analytics in Practice: S  uccessful Business Applications of Data Mining7

throughout the organization to be applied in specific sales situations. A credit-card company leverages its vast warehouse of customer transaction data to identify customers most likely to be interested in a new credit product. Using a small test mailing, the attributes of customers with an affinity for the product are identified. Recent projects have indicated more than a 20-fold decrease in costs for targeted mailing campaigns over conventional approaches. (continued)

7Based on Kurt Thearling, “An Introduction to Data Mining,” White Paper from Thearling.com. http:// www.thearling.com/text/dmwhite/dmwhite.htm.

364

Chapter 10  Introduction to Data Mining

diversified transportation company with a large • Adirect sales force uses data mining to identify the

best prospects for its services. Using data mining to analyze its own customer experience, this company builds a unique segmentation identifying the attributes of high-value prospects. Applying this segmentation to a general business database such as those provided by Dun & Bradstreet can yield a prioritized list of prospects by region. A large consumer package goods company ­applies data mining to improve its sales process to retailers. Data from consumer panels,

­ hipments, and competitor activity are used s to understand the reasons for brand and store switching. Through this analysis, the manufacturer can select promotional strategies that best reach their target customer segments. In each of these examples, companies have leveraged their knowledge about customers to reduce costs and improve the value of customer relationships. These organizations can now focus their efforts on the most important (profitable) customers and prospects and design targeted marketing strategies to best reach them.

Key Terms Agglomerative clustering methods Association rule mining Average group linkage clustering Average linkage clustering Boxplot Classification matrix Cluster analysis Complete linkage clustering Confidence of the (association) rule Data mining Dendogram Discriminant analysis Discriminant function Divisive clustering methods Euclidean distance Hierarchical clustering

k-nearest neighbors (k-NN) algorithm Lagging measures Leading measures Lift Logistic regression Logit Market basket analysis Odds Parallel coordinates chart Scatterplot matrix Single linkage clustering Support for the (association) rule Training data set Validation data set Variable plot Ward’s hierarchical clustering

Problems and Exercises 1. Use XLMiner to generate a simple random sample of

10 records from the Excel file Banking Data. 2. Use the Excel file Banking Data. a. C onstruct a boxplot for the Median Income,

­ edian Home Value, Median Household Wealth, M and Average Bank Balance. b. What observations can you make about these data?

3. Construct a parallel coordinates chart for Median

Income, Median Home Value, Median Household Wealth, and Average Bank Balance in the Excel file Banking Data. What conclusions can you reach? 4. Construct a scatterplot matrix for Median Income,

Median Home Value, Median Household Wealth, and Average Bank Balance in the Excel file Banking Data. What conclusions can you reach?

Chapter 10  Introduction to Data Mining

5. Construct a variable plot for all the variables in the

Excel file Banking Data. 6. Compute the Euclidean distance between the follow-

ing set of points: a. (1.06, 9.2) and (0.89, 10.3) b. (1.6, 0.628, 9.077) and (2.2, 1.555, 5.088) 7. For the Excel file Pharmaceuticals, normalize

each column of the numerical data (i.e., compute a Z-score for each of the values) and then compute the Euclidean distances between the following pharmaceutical companies: ABT, CHTT and MRX.

365

only credit score and years of credit history as input variables. 16. The Excel file Credit Risk Data provides a database

of information about loan applications along with a classification of credit risk in column L. Convert the categorical data into numerical codes as appropriate. Sample 200 records from the data set. Then apply the k-NN algorithm to classify training and validation data sets and the additional data in the file. Summarize your findings. 17. The Excel file Credit Risk Data provides a database

the average and standard deviations of each numerical variable for the schools in each cluster and compare them with the averages and standard deviations for the entire data set. Does the clustering show distinct differences among the clusters?

of information about loan applications along with a classification of credit risk in column L. Convert the categorical data into numerical codes as appropriate. Sample 200 records from the data set. Then apply discriminant analysis to classify training and validation data sets and the new data in the file. Summarize your findings.

9. For the Colleges and Universities data, use XLMiner

18. The Excel file Credit Risk Data provides a database

8. For the four clusters identified in Example 10.6, find

to find four clusters using each of the other clustering methods (see Figure 10.13); compare the results with Example 10.6. 10. Apply cluster analysis to the numerical data in the

Excel file Credit Approval Decisions. Analyze the clusters and determine if cluster analysis would be a useful classification method for approving or rejecting loan applications.

of information about loan applications, along with a classification of credit risk in column L. Convert the categorical data into numerical codes as appropriate. Then apply logistic regression to classify training and validation data sets and the new data in the file. Summarize your findings. 19. For the PC Purchase Data, identify association rules

Create up to five clusters and analyze the results to draw conclusions about the survey.

with the following input parameters for the XLMiner association rules procedure: a. support = 3; confidence = 90, b. support = 7; confidence = 90, c. support = 3; confidence = 70, d. support = 7; confidence = 70, Compare your results with those in Example 10.18.

13. Use the k-NN algorithm to classify the new data in the

20. The Excel file Cosmetics Data provides data on pur-

Excel file Mortgage Defaulters Additional using only credit score and value of loan as input variables.

chases of different cosmetic items at a large chain store. Develop a market basket analysis using the XLMiner association rules procedure with the input parameters support = 35 and confidence = 80.

11. Apply cluster analysis to the Excel file Sales Data,

using the input variables Percent Gross Profit, ­Industry Code, and Competitive Rating. Create four clusters and draw conclusions about the groupings. 12. Cluster the records in the Excel file Ten Year Survey.

14. Use discriminant analysis to classify the new data

in the Excel file Credit A ­ pproval Decisions Coded using only credit score and years of credit history as input variables. 15. Use logistic regression to classify the new data in the

Excel file Credit Approval Decisions Coded using

21. The Excel file Myatt Steak House provides 5 years of

data on key business results for a restaurant. Identify the leading and lagging measures, find the correlation matrix, and propose a cause-and-effect model using the strongest correlations.

366

Chapter 10  Introduction to Data Mining

Case: Performance Lawn Equipment The worksheet Purchasing Survey in the Performance Lawn Care database provides data related to predicting the level of business (Usage Level) obtained from a third-party survey of purchasing managers of customers ­Performance Lawn Care.8 The seven PLE attributes rated by each respondent are Delivery speed—the amount of time it takes to deliver the product once an order is confirmed Price level—the perceived level of price charged by PLE Price flexibility—the perceived willingness of PLE representatives to negotiate price on all types of purchases Manufacturing image—the overall image of the manufacturer Overall service—the overall level of service necessary for maintaining a satisfactory relationship between PLE and the purchaser Sales force image—the overall image of the PLE’s sales force Product quality—perceived level of quality Responses to these seven variables were obtained using a graphic rating scale, where a 10-centimeter line was drawn between endpoints labeled “poor” and “excellent.” Respondents indicated their perceptions using a mark on the line, which was measured from the left endpoint. The result was a scale from 0 to 10 rounded to one decimal place. Two measures were obtained that reflected the outcomes of the respondent’s purchase relationships with PLE:

8The

Usage level—how much of the firm’s total product is purchased from PLE, measured on a 100-point scale, ranging from 0% to 100% Satisfaction level—how satisfied the purchaser is with past purchases from PLE, measured on the same graphic rating scale as perceptions 1 through 7 The data also include four characteristics of the responding firms: Size of firm—size relative to others in this market (0 = small; 1 = large) Purchasing structure—the purchasing method used in a particular company (1 = centralized procurement, 0 = decentralized procurement) Industry—the industry classification of the purchaser [1 = retail (resale such as Home Depot), 0 = private (nonresale, such as a landscaper)] Buying type—a variable that has three categories (1 = new purchase, 2 = modified rebuy, 3 = straight rebuy) Elizabeth Burke would like to understand what she learned from these data. Apply appropriate data-mining techniques to analyze the data. For example, can PLE segment customers into groups with similar perceptions about the company? Can cause-and-effect models provide insight about the drivers of satisfaction and usage level? Summarize your results in a report to Ms. Burke.

data and description of this case are based on the HATCO example on pages 28–29 in Joseph F. Hair, Jr., Rolph E. Anderson, Ronald L. Tatham, and William C. Black, Multivariate Analysis, 5th ed. (Upper Saddle River, NJ: Prentice Hall, 1998).

Chapter

11

Rufous/Shutterstock.com

Learning Objectives After studying this chapter, you will be able to:

• Explain how to use simple mathematics and influence diagrams to help develop predictive analytic models. • Apply principles of spreadsheet engineering to designing and implementing spreadsheet models. • Use Excel features and spreadsheet engineering to ensure the quality of spreadsheet models. • Develop and implement analytic models for multipletime-period problems. • Describe the newsvendor problem and implement it on a spreadsheet. • Describe how overbooking decisions can be modeled on spreadsheets. • Explain how model validity can be assessed.

• Perform what-if analysis on spreadsheet models. • Construct one- and two-way data tables. • Use data tables to analyze uncertainty in decision models. • Use the Excel Scenario Manager to evaluate different model scenarios. • Apply the Excel Goal Seek tool for break-even analysis and other types of models. • Create data tables and tornado charts using Analytic Solver Platform. • Use Excel tools to create user-friendly Excel models and applications.

367

368

Chapter 11  Spreadsheet Modeling and Analysis

The late management and quality guru Dr. W. Edwards Deming once stated that all management is prediction. What he was implying is that when managers make decisions, they do so with an eye to the future and essentially are predicting that their decisions will achieve certain results. Predictive modeling is the heart and soul of business analytics. We introduced the concept of a decision model in Figure 1.7 in ­Chapter 1. Decision models transform inputs—data, uncontrollable variables, and decision variables—into outputs, or measures of performance or behavior. When we build a decision model, we are essentially predicting what outputs will occur based on the model inputs. The model itself is simply a set of assumptions that characterize the relationships between the inputs and the outputs. For instance, in Examples 1.9 and 1.10, we presented two different models for predicting demand as a function of price, each based on different assumptions. The first model assumes that demand is a linear function of price, whereas the second assumes a nonlinear price-elasticity relationship. Which model more accurately predicts demand can be verified only by observing data in the future. Since the future is unknown, the choice of the model must be driven either by sound logic and experience or the analysis of historical data that may be available. These are the two basic approaches that we ­develop in this chapter. We also describe approaches for analyzing models to evaluate future scenarios and ask what-if types of questions to facilitate ­better business decisions.

Strategies for Predictive Decision Modeling Building decision models is more of an art than a science. Creating good decision models requires a solid understanding of basic business principles in all functional areas, such as accounting, finance, marketing, and operations, knowledge of business practice and research, and logical skills. Models often evolve from simple to complex and from deterministic to stochastic (see the definitions in Chapter 1), so it is generally best to start simple and enrich models as necessary.

Building Models Using Simple Mathematics Sometimes a simple “back-of-the-envelope” calculation can help managers make better decisions and lead to the development of useful models.

Example 11.1  The Economic Value of a Customer Few companies take the time to estimate the value of a good customer (and often spend little effort to keep one). Suppose that a customer at a restaurant spends, on average, \$50 per visit and comes six times each year. Assuming that the restaurant realizes a 40% margin on the average bill for food and drinks, then their gross

profit would be ( \$50)(6) (.40) = \$ 120. If 30% of customers do not return each year, then the average lifetime of a customer is 1 , 0.3 = 3.33 years. Therefore, the average nondiscounted gross profit during a customer’s lifetime is \$ 120( 3.33) = \$ 400.

369

Chapter 11  Spreadsheet Modeling and Analysis

Although this example calculated the economic value of a customer for one particular scenario, what we’ve really done is to set the stage for constructing a general decision model. Suppose we define the following variables: R = revenue per purchase F = p urchase frequency in number per year (e.g., if a customer purchases once every 2 years, then F = 12 = 0.5) M = gross profit margin (expressed as a fraction) D = defection rate (fraction of customers defecting each year) Then, the value of a loyal customer, V, would be

V =

R * F * M  D

(11.1)

In the previous example, R = +50, F = 6, M = 0.4, and D = 0.3. We can use this model to evaluate different scenarios systematically.

Building Models Using Influence Diagrams Although it can be easy to develop a model from simple numerical calculations, as we illustrated in the previous example, most model development requires a more formal ­approach. Influence diagrams were introduced in Chapter 1, and are a logical and visual representation of key model relationships, which can be used as a basis for developing a mathematical decision model.

Example 11.2  Developing a Decision Model Using an Influence Diagram We will develop a decision model for predicting profit in the face of uncertain future demand. To help develop the model, we use the influence-diagram approach. We all know that profit = revenue − cost. Using a little “Business 101” logic, revenue depends on the unit price and the quantity sold, and cost depends on the unit cost, quantity produced, and fixed costs of production. However, if demand is uncertain, then the amount ­produced may be less than or greater than the actual demand. Thus, the quantity sold depends on both the demand and the quantity produced. Putting these facts together, we can build the influence diagram shown in Figure 11.1. The next step is to translate the influence diagram into a more formal model. Define

S = quantity sold Q = quantity produced D = demand First, note that cost consists of the fixed cost (F ) plus the variable cost of producing Q units (cQ): C = F + cQ Next, revenue equals the unit price ( p) multiplied by the quantity sold (S): R = pS The quantity sold, however, must be the smaller of the demand (D) and the quantity produced (Q), or

P = profit R = revenue

S = min5 D, Q 6

p = unit price

Therefore, R = pS = p*min5 D, Q 6 . Substituting these results into the basic formula for profit P = R − C, we have

c = unit cost

C = cost

F = fixed cost

P = p*min5 D, Q 6 − (F + cQ)

(11.2)

370

Chapter 11  Spreadsheet Modeling and Analysis

Figure 11.1

Profit

An Influence Diagram for Profit Revenue

Unit Price

Cost

Quantity Sold

Quantity Produced

Unit Cost

Fixed Cost

Demand

Implementing Models on Spreadsheets We may creatively apply various Excel tools and capabilities to improve the structure and use of spreadsheet models. In this section, we discuss approaches for developing good, useful, and correct spreadsheet models. Good spreadsheet analytic applications should also be user-friendly; that is, it should be easy to input or change data and see key results, particularly for users who may not be as proficient in using spreadsheets. Good design reduces the potential for errors and misinterpretation of information, leading to more insightful decisions and better results.

Spreadsheet Design In Chapter 1, Example 1.7, we developed a simple decision model for a break-even analysis situation. Recall that the scenario involves a manufacturer who can produce a part for \$125/unit with a fixed cost of \$50,000. The alternative is to outsource production to a ­supplier at a unit cost of \$175. We developed mathematical models for the total manufacturing cost and the total cost of outsourcing as a function of the production volume, Q: TC 1manufacturing2 = +50,000 + +125 * Q TC 1outsourcing2 = +175 * Q

Example 11.3  A Spreadsheet Model for the Outsourcing Decision Figure 11.2 shows a spreadsheet for implementing the o ­ utsourcing decision model (Excel file ­Outsourcing ­D ecision Model). The input data consist of the costs associated with manufacturing the product i­n-house or purchasing it from an outside supplier and the ­production volume. The model calculates the total cost for manufacturing and ­outsourcing. The key outputs in the model are the difference in these costs and the decision that results in the lowest cost. The data are clearly separated from the model component of the spreadsheet. Observe how the IF function is used in cell B20 to identify the best decision. If the cost difference is ­negative

or zero, then the function returns “Manufacture” as the best decision; otherwise it returns “Outsource.” Also observe the correspondence between the spreadsheet formulas and the mathematical model: TC (manufacturing) = \$50,000 + \$125 × Q = B6 + B7*B12 TC (outsourcing) = \$175 × Q = B12*B10 Thus, if you can write a spreadsheet formula, you can develop a mathematical model by substituting symbols or numbers into the Excel formulas.

Figure 11.2 Outsourcing Decision Model Spreadsheet

Chapter 11  Spreadsheet Modeling and Analysis

371

Because decision models characterize the relationships between inputs and outputs, it is useful to separate the data, model calculations, and model outputs clearly in designing a spreadsheet. It is particularly important not to use input data in model formulas, but to reference the spreadsheet cells that contain the data. In this way, if the data change or you want to experiment with the model, you need not change any of the formulas, which can easily result in errors.

Example 11.4  Pricing Decision Spreadsheet Model Another model we developed in Chapter 1 is one in which a firm wishes to determine the best pricing for one of its products to maximize revenue. The model was ­developed by incorporating an equation for sales into a total revenue calculation:

sales = − 2.9485 × price + 3,240.9 total revenue = price × sales = price × (− 2.9485 × price + 3,240.9) = − 2.9485 × price2 + 3,240.9 × price Figure 11.3 shows a spreadsheet for calculating both sales and revenue as a function of price.

Figure 11.3 Pricing Decision Spreadsheet Model

Mathematical models are easy to manipulate; for example, we showed in Chapter 1 that it was easy to find the break-even point by setting TC (manufacturing) = TC (outsourcing) and solving for Q. In contrast, it is more difficult to find the break-even volume using trial and error on the spreadsheet without knowing some advanced tools and approaches. However,

372

Chapter 11  Spreadsheet Modeling and Analysis

spreadsheets have the advantage of allowing you to easily modify the model inputs and calculate the numerical results. We will use both spreadsheets and analytical modeling approaches in our model-building applications—it is important to be able to “speak both languages.”

Example 11.5  Spreadsheet Implementation of the Profit Model The analytical model we developed in Example 11.2 can easily be implemented in an Excel spreadsheet to evaluate profit (Excel file Profit Model). Let us assume that unit price = \$40, unit cost = \$24, fixed cost = \$400,000, and demand = 50,000. The decision variable is the quantity produced; for the purposes of building a spreadsheet model, we assume a value of 40,000 units. Figure 11.4 shows a spreadsheet implementation of this model. To

better understand the model, study the relationships between the spreadsheet formulas, the influence diagram, and the mathematical model. A manager might use the spreadsheet to evaluate how profit would be expected to change for different values of the uncertain future demand and/or the quantity produced, which is a decision variable that the manager can control. We do this later in this chapter.

Spreadsheet Quality Building spreadsheet models, often called spreadsheet engineering, is part art and part science. The quality of a spreadsheet can be assessed both by its logical accuracy and its design. Spreadsheets need to be accurate, understandable, and user-friendly. First and foremost, spreadsheets should be accurate. Verification is the process of ensuring that a model is accurate and free from logical errors. Spreadsheet errors can be disastrous. A large investment company once made a \$2.6 billion error. They notified holders of one mutual fund to expect a large dividend; fortunately, they caught the error before sending the checks. One research study of 50 spreadsheets found that fewer than 10% were error free.1 Significant errors in business have resulted from mistakes in copying and pasting, sorting, numerical input, and spreadsheet-formula references. Industry research has found that more than 90% of spreadsheets with more than 150 rows were incorrect by at least 5%. There are three basic approaches to spreadsheet engineering that can improve spreadsheet quality:

Figure 11.4 Spreadsheet Implementation of Profit Model

1S.

Powell, K. Baker, and B. Lawson, “Errors in Operational Spreadsheets,” Journal of End User ­Computing, 21 (July–September 2009): 24–36.

Chapter 11  Spreadsheet Modeling and Analysis

373

1. Improve the design and format of the spreadsheet itself. After the inputs, outputs, and key model relationships are well understood, you should sketch a logical design of the spreadsheet. For example, you might want the spreadsheet to resemble a financial statement to make it easier for managers to read. It is good practice to separate the model inputs from the model itself and to reference the input cells in the model formulas; that way, any changes in the inputs will be automatically reflected in the model. We have done this in the examples. Another useful approach is to break complex formulas into smaller pieces. This reduces typographical errors, makes it easier to check your r­ esults, and also makes the spreadsheet easier to read for the user. Finally, it is also important to set up the spreadsheet in a form that the end user—who may be a financial manager, for example—can easily interpret and use. Example 11.6 illustrates these ideas. 2. Improve the process used to develop a spreadsheet. If you sketched out a ­conceptual design of the spreadsheet, work on each part individually before moving on to the others to ensure that each part is correct. As you enter formulas, check the results with simple numbers (such as 1) to determine if they make sense, or use inputs with known results. Be careful in using the Copy and Paste commands in Excel, particularly with respect to relative and absolute addresses. Use the Excel function wizard (the fx button on the formula bar) to ensure that you are entering the correct values in the correct fields of the function. 3. Inspect your results carefully and use appropriate tools available in Excel. For example, the Excel Formula Auditing tools (in the Formulas tab) help you validate the logic of formulas and check for errors. Using Trace Precedents and Trace Dependents, you can visually show what cells affect or are affected by the value of a selected cell, similar to an influence diagram. The Formula Auditing tools also include Error Checking, which checks for common errors that occur when using formulas, and Evaluate Formula, which helps to debug a complex formula by evaluating each part of the formula individually. We encourage you to learn how to use these tools.

Example 11.6  Modeling Net Income on a Spreadsheet The calculation of net income is based on the following formulas: profit = sales − cost of goods sold • gross operating expenses = administrative expenses • + selling expenses

+ depreciation expenses net operating income = gross profit − − operating expenses earnings before taxes = net operating income − interest expense net income = earnings before taxes − taxes

• • •

We could develop a simple model to compute net income using these formulas by substitution: net income = sales − cost of goods sold − administrative expenses − selling expenses − depreciation expenses − interest expense − taxes We can implement this model on a spreadsheet, as shown in Figure 11.5. This spreadsheet provides only the

end result and, from a financial perspective, provides little information to the end user. An alternative is to break down the model by writing the preceding formulas in separate cells in the ­s preadsheet using a data-model format, as shown in Figure 11.6. This clearly shows the individual calculations and provides better information. However, although both of these models are technically correct, neither is in the form to which most accounting and financial employees are accustomed. A third alternative is to express the calculations as a pro forma income statement using the structure and formatting that accountants are used to, as shown in Figure 11.7. Although this has the same calculations as in Figure 11.6, note that the use of negative dollar amounts requires a change in the formulas (i.e., addition of negative amounts rather than subtraction of positive amounts). The Excel workbook Net Income Models contains each of these examples in separate worksheets.

374

Figure 11.5 Simple Spreadsheet Model for Net Income

Figure 11.6 Data-Model Format for Net Income

Figure 11.7 Pro Forma Income Statement Format

Chapter 11  Spreadsheet Modeling and Analysis

Chapter 11  Spreadsheet Modeling and Analysis

375

Analytics in Practice: Spreadsheet Engineering at Procter & Gamble2 At the basic level, all input fields had comments attached; this served as a quick online help function for the planners. For each model, they also provided a user manual that describes every input and result and explains the formulas in detail. The model templates and all documentation were posted on an intranet site that was accessible to all P&G employees. This ensured that all employees had access to the most current versions of the models, supporting material, and training schedules.

ZUMA Archive/ZUMA Press/Newscom

In the mid-1980s, Procter & Gamble (P&G) needed an easy and consistent way to manage safety stock ­inventory. P&G’s Western European Business ­Analysis group created a spreadsheet model that eventually grew into a suite of global inventory models. The model was designed to help supply chain planners better understand inventories in supply chains and to provide a quick method for setting safety stock levels. P&G also developed several spin-off models based on this application that are used around the world. In designing the model, analysts used many of the principles of spreadsheet engineering. For example, they separated the input sections from the calculation and results sections by grouping the appropriate cells and using different formatting. This speeded up the data entry process. In addition, the spreadsheet was designed to display all relevant data on one screen so the user does not need to switch between different sections of the model. Analysts also used a combination of data validation and conditional formatting to highlight errors in the data input. They also provided a list of warnings and errors that a user should resolve before using the results of the model. The list flags obvious mistakes such as negative transit times and input data that may require checking and forecast errors that fall outside the boundaries of the model’s statistical validity

Spreadsheet Applications in Business Analytics A wide variety of practical problems in business analytics can be modeled using spreadsheets. In this section, we present several examples and families of models that illustrate different applications. One thing to note is that a useful spreadsheet model need not be complex; often, simple models can provide managers with the information they need to make good decisions. Example 11.7 is adapted from a real application in the banking industry.

Example 11.7  A Predictive Model for Staffing3 Staffing is an area of any business where making changes can be expensive and time-consuming. Thus, it is quite important to understand staffing requirements well in advance. In many cases, the time to hire and train 2Based

new employees can be 90 to 180 days, so it is not always possible to react quickly to staffing needs. Hence, ­a dvance planning is vital so that managers can make good decisions about overtime or reductions in work

on Ingrid Farasyn, Koray Perkoz, Wim Van de Velde, “Spreadsheet Models for Inventory Target Setting at Procter & Gamble,” Interfaces, 38, 4 (July–August 2008): 241–250. 3The author is indebted to Mr. Craig Zielanzy of BlueNote Analytics, LLC, for providing this example.

376

Chapter 11  Spreadsheet Modeling and Analysis

hours, or adding or reducing temporary or permanent staff. Planning for staffing requirements is an area where analytics can be of tremendous benefit. Suppose that the manager of a loan-processing department wants to know how many employees will be needed over the next several months to process a certain number of loan files per month so she can better plan capacity. Let’s also suppose that there are different types of products that require processing. A product could be a 30-year fixed rate mortgage, 7/1 ARM, FHA loan, or a construction loan. Each of these loan types vary in their complexity and require different levels of documentation and, consequently, have different times to complete. ­Assume that the manager forecasts 700 loan applications in May, 750 in June, 800 in July, and 825 in August. Each employee works productively for 6.5 hours each day, and there are 22 working days in May, 20 in June, 22 in July, and 22 in August. The manager also knows, based on ­historical loan data, the percentage of each product type and how long it takes to process one loan of each type. These data are presented next: Products

Product Mix (%)

Hours Per File

Product 1

22

3.50

Product 2

17

2.00

Product 3

13

1.50

Figure 11.8 Staffing Model Spreadsheet Implementation

Product 4

12

5.50

Product 5

9

4.00

Product 6

9

3.00

Product 7

6

2.00

Product 8

5

2.00

Product 9

3

1.50

Product 10

1

3.50

Misc

3

3.00

Total

100

The manager would like to predict the number of full time equivalent (FTE) staff needed each month to ensure that all loans can be processed. Figure 11.8 shows a simple predictive model on a spreadsheet to calculate the FTEs required (Excel file ­S taffing Model). For each month, we take the desired throughput and convert thus to the number of files for each product based on the product mix percentages. By multiplying by the hours per file, we then calculate the number of hours required for each product. Finally, we divide the total number of hours required each month by the number of working hours each month (hours worked per day * days in the month). This yields the number of FTEs required.

Chapter 11  Spreadsheet Modeling and Analysis

377

Figure 11.8 Staffing Model Spreadsheet Implementation (continued)

Models Involving Multiple Time Periods Most practical models used in business analytics are more complex and involve basic financial analysis similar to the profit model. One example is the decision to launch a new product. In the pharmaceutical industry, for example, the process of research and development is a long and arduous process (see Example 11.8); total development expenses can approach \$1 billion. Models for these types of applications typically incorporate multiple time periods that are logically linked together, and predictive analytical capabilities are vital to making good business decisions. However, taking a systematic approach to putting the pieces ­together logically can often make a seemingly difficult problem much easier.

Example 11.8  New-Product Development Suppose that Moore Pharmaceuticals has discovered a potential drug breakthrough in the laboratory and needs to decide whether to go forward to conduct clinical trials and seek FDA approval to market the drug. Total R&D costs are expected to reach \$700 million, and the cost of clinical trials will be about \$150 million. The current market size is estimated to be 2 million people and is expected to grow at a rate of 3% each year. In the first year, Moore estimates gaining an 8% market share, which is anticipated to grow by 20% each year. It is difficult to estimate beyond 5 years because new competitors are expected to be entering the market. A monthly prescription is anticipated to generate revenue of \$130 while incurring variable costs of \$40. A discount rate of 9% is assumed for computing the net present value of the project. The company needs to know how long it will take to recover its fixed expenses and the net present value over the first 5 years. Figure 11.9 shows a spreadsheet model for this situation (Excel file Moore Pharmaceuticals). The model is based

on a variety of known data, estimates, and assumptions. If you examine the model closely, you will see that some of the inputs in the model are easily obtained from corporate accounting (e.g., discount rate, unit revenue, and unit cost) using historical data (e.g., project costs), forecasts, or judgmental estimates based on preliminary market research or previous experience (e.g., market size, market share, and yearly growth rates). The model itself is a straightforward application of accounting and financial logic; you should examine the Excel formulas to see how the model is built. The assumptions used represent the “most likely” estimates, and the spreadsheet shows that the product will begin to be profitable by the fourth year. However, the model is based on some rather tenuous assumptions about the market size and market-share growth rates. In reality, much of the data used in the model are uncertain, and the corporation would be remiss if it simply used the results of this one scenario. The real value of the model would be in analyzing a variety of scenarios that use different values for these assumptions.

378

Figure 11.9 Spreadsheet Implementation of Moore Pharmaceuticals Model

Chapter 11  Spreadsheet Modeling and Analysis

Chapter 11  Spreadsheet Modeling and Analysis

379

Single-Period Purchase Decisions Banana Republic, a division of Gap, Inc., was trying to build a name for itself in fashion circles as parent Gap shifted its product line to basics such as cropped pants, jeans, and khakis. In one recent holiday season, the company had bet that blue would be the ­top-selling color in stretch merino wool sweaters. They were wrong; as the company president noted, “The number 1 seller was moss green. We didn’t have enough.”4 This situation describes one of many practical situations in which a one-time purchase decision must be made in the face of uncertain demand. Department store buyers must purchase seasonal clothing well in advance of the buying season, and candy shops must decide on how many special holiday gift boxes to assemble. The general scenario is commonly known as the newsvendor problem: A street newsvendor sells daily ­newspapers and must make a decision about how many to purchase. Purchasing too few results in lost opportunity to increase profits, but purchasing too many results in a loss since the excess must be discarded at the end of the day. We first develop a general model for this problem and then illustrate it with an example. Let us assume that each item costs \$C to purchase and is sold for \$R. At the end of the period, any unsold items can be disposed of at \$S each (the salvage value). Clearly, it makes sense to assume that R 7 C 7 S. Let D be the number of units demanded during the period and Q be the quantity purchased. Note that D is an uncontrollable input, whereas Q is a decision variable. If demand is known, then the optimal decision is obvious: Choose Q = D. However, if D is not known in advance, we run the risk of overpurchasing or underpurchasing. If Q 6 D, then we lose the opportunity of realizing additional profit (since we assume that R 7 C), and if Q 7 D, we incur a loss (because C 7 S). Notice that we cannot sell more than the minimum of the actual demand and the amount produced. Thus, the quantity sold at the regular price is the smaller of D and Q. Also, the surplus quantity is the larger of 0 and Q - D. The net profit is calculated as:

net profit = R * quantity sold + S * surplus quantity - C * Q

(11.3)

In reality, the demand D is uncertain and can be modeled using a probability distribution based on approaches that we described in Chapter 5. For now, we do not deal with models that involve probability distributions (building the models is enough of a challenge at this point); however, we learn how to deal with them in the next chapter. Another example of an application of predictive analytics that involve probability distributions is overbooking.

Example 11.9  A Single-Period Purchase Decision Model Suppose that a small candy store makes Valentine’s Day gift boxes that cost \$12.00 and sell for \$18.00. In the past, at least 40 boxes have been sold by Valentine’s Day, but the actual amount is uncertain, and in the past, the owner has often run short or made too many. After the holiday, any unsold boxes are discounted 50% and are eventually sold.

4Louise

The net profit can be calculated using formula (11.3) for any values of Q and D: net profit = \$ 18.00 × min5 D, Q 6 +\$ 9.00 × max5 0, Q − D 6 − \$12.00 × Q Figure 11.10 shows a spreadsheet that implements this model assuming a demand of 41 and a purchase quantity of 44 (Excel file Newsvendor Model).

Lee, “Yes, We Have a New Banana,” BusinessWeek (May 31, 2004): 70–72.

380

Chapter 11  Spreadsheet Modeling and Analysis

Figure 11.10 Spreadsheet Implementation of Newsvendor Model

Overbooking Decisions An important operations decision for service businesses such as hotels, airlines, and car-rental companies is the number of reservations to accept to effectively fill capacity knowing that some customers may not use their reservations or tell the business. If a hotel, for example, holds rooms for customers who do not show up, they lose revenue opportunities. (Even if they charge a night’s lodging as a guarantee, rooms held for additional days may go unused.) A common practice in these industries is to overbook reservations. When more customers arrive than can be handled, the business usually incurs some cost to satisfy them (by putting them up at another hotel or, for most airlines, providing extra compensation such as ticket vouchers). Therefore, the decision becomes how much to overbook to balance the costs of overbooking against the lost revenue for underuse.

Example 11.10  A Hotel Overbooking Model Figure 11.11 shows a spreadsheet model (Excel file Hotel Overbooking Model) for a popular resort hotel that has 300 rooms and is usually fully booked. The hotel charges \$120 per room. Reservations may be canceled by the 6:00 p.m. deadline with no penalty. The hotel has estimated that the average overbooking cost is \$100. The logic of the model is straightforward. In the model section of the spreadsheet, cell B12 represents the decision variable of how many reservations to accept. In this example, we assume that the hotel is willing to accept 310 reservations; that is, to overbook by 10 rooms. Cell B13 represents the actual customer demand (the number of customers who want a reservation). Here we assume that 312 customers tried to make a reservation. The hotel cannot accept more reservations than its predetermined limit, so, therefore, the number of reservations made in cell B13 is the smaller of the customer demand and the reservation limit. Cell B14 is the number

of customers who decide to cancel their reservation. In this example, we assume that only 6 of the 310 reservations are cancelled. Therefore, the actual number of customers who arrive (cell B15) is the difference between the number of reservations made and the number of cancellations. If the actual number of customer arrivals exceeds the room capacity, overbooking occurs. This is modeled by the MAX function in cell B17. Net revenue is computed in cell B18. A manager would probably want to use this model to analyze how the number of overbooked customers and net revenue would be influenced by changes in the reservation limit, customer demand, and cancellations. As with the newsvendor model, the customer demand and the number of cancellations are in reality, random variables that we cannot specify with certainty. We also show how to incorporate randomness into the model in the next chapter.

Chapter 11  Spreadsheet Modeling and Analysis

381

Figure 11.11 Hotel Overbooking Model Spreadsheet

The East Carolina University (ECU) Student Health ­Service (SHS) provides health-care services and wellness education to enrolled students.5 Patient volume consists almost entirely of scheduled appointments for non urgent health-care needs. In a recent academic year, 35,050 appointments were scheduled. Patients failed to arrive for over 10% of these appointments. The no-show problem is not unique. Various studies report that no-show rates for health service providers often range as high as 30% to 50%. To address this problem, a quality-improvement (QI) team was formed to analyze an overbooking ­o ption. Their efforts resulted in developing a novel overbooking model that included the effects of e­mployee burnout resulting from the need to see more patients than the normal capacity allowed. The model provided strong evidence that a 10% to 15% overbooking level produces the highest value. The  ­o verbooking model was also instrumental in alleviating staff concerns about disruption and pressures that result from large numbers of overscheduled p ­ atients. At a 5% overbooking rate, the staff was ­r eassured by model results that predicted 95% of the operating days with no patients being overscheduled; in the worst case, 8 patients would be overscheduled a few days each month. In ­a ddition, at a 10% overbooking rate the model

Kurhan/Shutterstock.com

Analytics in Practice: U  sing an Overbooking Model at a Student Health Clinic

­ redicted that  during 85% of the operating days p per month, no patients would be overscheduled; a ­maximum of 16 overscheduled patients would rarely ever occur. Based on the model predictions, the SHS implemented an overbooking policy and overbooked by 7.3% with plans to increase to 10% in future semesters. The SHS director estimated the actual savings from overbooking during the first semester of implementation would be approximately \$95,000.

5Based on John Kros, Scott Dellana, and David West, “Overbooking Increases Patient Access at East Carolina University’s Student Health Services Clinic,” Interfaces, Vol. 39, No. 3 May–June 2009, pp. 271–287.

382

Chapter 11  Spreadsheet Modeling and Analysis

Model Assumptions, Complexity, and Realism Models cannot capture every detail of the real problem, and managers must understand the limitations of models and their underlying assumptions. Validity refers to how well a model represents reality. One approach for judging the validity of a model is to identify and examine the assumptions made in a model to see how they agree with our perception of the real world; the closer the agreement, the higher the validity. Another approach is to compare model results to observed results; the closer the agreement, the more valid the model. A “perfect” model corresponds to the real world in every respect; unfortunately, no such model has ever existed and never will exist in the future, because it is impossible to include every detail of real life in one model. To add more realism to a model generally requires more complexity and analysts have to know how to balance these.

Example 11.11  A Retirement-Planning Model Consider modeling a typical retirement plan. Suppose that an employee starts working after college at age 22 at a starting salary of \$50,000. She expects an average salary increase of 3% each year. Her retirement plan requires that she contribute 8% of her salary, and her employer adds an additional 35% of her contribution. She anticipates an annual return of 8% on her retirement portfolio. Figure 11.12 shows a spreadsheet model of her retirement investments through age 50 (Excel file ­Retirement Plan). There are two validity issues with this model. One, of course, is whether the assumptions of the annual salary increase and return on investment are reasonable and whether they should be assumed to be the same each year. Assuming the same rate of salary increases and investment returns each year simplifies the model but detracts from the realism because these

­ ariables will clearly vary each year. A second validity v issue is how the model calculates the return on investment. The model in Figure 11.12 assumes that the return on investment is applied to the previous year’s balance and not to the current year’s contributions (examine the formula used in cell E15). An alternative would be to calculate the investment return based on the end-of-year balance, including current-year contributions, using the formula =( E14 + C15 + D15)* (1 + \$B\$8) in cell E15 and copying it down the spreadsheet. This will produce a different result. Neither of these assumptions is quite correct, since the contributions would normally be made on a monthly basis. To reflect this would require a much larger and more-complex spreadsheet model. Thus, building realistic models requires careful thought and creativity, and a good working knowledge of the capabilities of Excel.

Data and Models Data used in models can come from subjective judgment based on past experience, existing databases and other data sources, analysis of historical data, or surveys, experiments, and other methods of data collection. For example, in the profit model we might query accounting records for values of the unit cost and fixed costs. Statistical methods that we have studied are often used to estimate data required in predictive models. For example, we might use historical data to compute the mean demand; we might also use quartiles or percentiles in the model to evaluate different scenarios. However, even if data are not available, using a good subjective estimate is better than sacrificing the completeness of a model that may be useful to managers.6

6Glen

L. Urban, “Building Models for Decision Makers,” Interfaces, 4, 3 (May 1974): 1–11.

383

Chapter 11  Spreadsheet Modeling and Analysis

Figure 11.12 Portion of Retirement Plan Spreadsheet

Let’s develop a simple example based on retail markdown pricing decisions that we described in Example 1.1 in Chapter 1.

Example 11.12  Modeling Retail Markdown Pricing Decisions A chain of department stores is introducing a new brand of bathing suit for \$70. The prime selling season is 50 days during the late spring and early summer; after that, the store has a clearance sale around July 4 and marks down the price by 70% (to \$21.00), typically selling any remaining inventory at the clearance price. Merchandise buyers have purchased 1,000 units and allocated them to the stores prior to the selling season. After a few weeks, the stores reported an average sales of 7 units/day, and past experience suggests that this constant level of sales will continue over the remainder of the selling season. Thus, over the 50-day selling season, the stores would be

expected to sell 50 × 7 = 350 units at the full retail price and earn a revenue of \$70.00 × 350 = \$24,500. The remaining 650 units would be sold at \$21.00, for a clearance revenue of \$13,650. Therefore, the total revenue would be predicted as \$24,500 + \$ 13,650 = \$38,150. As an experiment, the store reduced the price to \$49 for one weekend and found that the average daily sales were 32.2 units. Assuming a linear trend model for sales as a function of price, as in Example 1.9, daily sales = a − b × price (continued)

384

Chapter 11  Spreadsheet Modeling and Analysis

we can find values for a and b by solving these two equations simultaneously based on the data the store obtained. 7 = a − b × \$70.00 32.2 = a − b × \$49.00

units sold at markdown = daily sales × (50 − x) as long as this is less than or equal to the number of units ­remaining in inventory from full retail sales. If not, this number needs to be adjusted. Then we can compute the markdown revenue as

This leads to the linear demand model:

markdown revenue = units sold x markdown price

daily sales = 91 − 1.2 × price We may also use Excel’s SLOPE and INTERCEPT functions to find the slope and intercept of the straight line between the two points (\$70, 7) and (\$49, 32.2); this is incorporated into the Excel model that follows. Because this model suggests that higher sales can be driven by price discounts, the marketing department has the basis for making improved discounting decisions. For instance, suppose they decide to sell at full retail price for x days and then discount the price by y% for the ­remainder of the selling season, followed by the clearance sale. What total revenue could they predict? We can compute this easily. Selling at the full retail price for x days yields revenue of

full retail price revenue = 7 units>day × x days × \$ 70.00 = \$490.00x

The markdown price applies for the remaining 50 − x days: markdown price = \$70(100% − y%) daily sales = a − b × markdown price = 91 − 1.2 × \$70 x (100% − y%)

Figure 11.13 Markdown Pricing Model Spreadsheet

Finally, the remaining inventory after 50 days is clearance inventory = 1000 − units sold at full retail − units sold at markdown = 1,000 − 7x − [ 91− 1.2 × \$70.00 × ( 100% − y%)] × ( 50 − x) This amount is sold at a price of \$21.00, resulting in revenue of

clearance price revenue = 31,000 − 7x − [ 91− 1.2 × \$70.00 × 1100% − y% 2] × 150 − x2 4 × \$ 21.00

The total revenue would be found by adding the models developed for full retail price revenue, discounted price revenue, and clearance price revenue. Figure 11.13 shows a spreadsheet implementation of this model (Excel file Markdown Pricing Model ). By changing the values in cells B7 and B8, the marketing manager could predict the revenue that could be achieved for different markdown decisions.

Chapter 11  Spreadsheet Modeling and Analysis

385

Developing User-Friendly Excel Applications Using business analytics requires good communication between analysts and the clients or managers who use the tools. In many cases, users may not be as familiar with Excel. Thus, developing user-friendly spreadsheets is vital to gaining acceptance of the tools and making them useful.

Data Validation One useful Excel tool is the data validation feature. This feature allows you to define acceptable input values in a spreadsheet, and provide an error alert if an invalid entry is made. This can help to avoid inadvertent user errors. This can be found in the Data Tools Group within the Data tab on the Excel ribbon. Select the cell range, click on Data ­Validation, and then specify the criteria that Excel will use to flag invalid data.

Range Names Use cell and range names to simplify formulas and make them more user-friendly. For example, suppose that the unit price is stored in cell B13 and quantity sold is in cell B14. Suppose you wish to calculate revenue in cell C15. Instead of writing the formula = B13*B14, you could define the name of cell B13 in Excel as “UnitPrice” and the name of cell B14 as “QuantitySold.” Then in cell C15, you could simply write the formula = UnitPrice*QuantitySold. (In this book, however, we use cell references so that you can more easily trace formulas in the examples.)

Example 11.13  Using Data Validation Let us use the Outsourcing Decision Model spreadsheet as an example. Suppose that an employee is asked to use the spreadsheet to evaluate the manufacturing and purchase cost options and best decisions for a large number of parts used in an automobile system assembly. She is given lists of data that cost accountants and purchasing managers have compiled and printed and must look up the data and enter them into the spreadsheet. Such a manual process leaves plenty of opportunity for error. However, suppose that we know that the unit cost of any item is at least \$10 but no more than \$100. If a cost is

Figure 11.14 Data Validation Dialog

\$47.50, for instance, a misplaced decimal would result in either \$4.75 or \$475, which would clearly be out of range. In the Data Validation dialog, you can specify that the value must be a decimal number between 10 and 100 as shown in Figure 11.14. On the Error Alert tab, you can also create an alert box that pops up when an invalid entry is made (see Figure 11.15). On the Input Message tab, you can create a prompt to display a comment in the cell about the correct input format. Data validation has other customizable options that you might want to explore.

386

Chapter 11  Spreadsheet Modeling and Analysis

Figure 11.15 Example of an Error Alert

Form Controls Form controls are buttons, boxes, and other mechanisms for inputting or changing data on spreadsheets easily that can be used to design user-friendly spreadsheets. To use form controls, you must first activate the Developer tab on the ribbon. Click the File tab, then Options, and then Customize Ribbon. Under Customize the Ribbon, make sure that Main Tabs is displayed in the drop-down box, and then click the check box next to Developer (which is typically unchecked in a standard Excel installation). You will see the new tab in the Excel ribbon as shown in Figure 11.16. If you click the Insert button in the Controls group, you will see the form controls available (do not confuse these with the Active X Controls in the same menu). Form controls include

• Button box • Combo Check box • Spin button • List box • Option button • Group box • Label • Scroll bar • These allow the user to more easily interface with models to enter or change data without the potential of inadvertently introducing errors in formulas. With form controls, you can keep the spreadsheets hidden and make them easier to use, especially for individuals without much spreadsheet knowledge. To insert a form control, click the Insert button in the Controls tab under the Developer menu, click on the control you want to use and then click within your worksheet. The following example shows how to use both a spin button and scroll bar in the Outsourcing Decision Model Excel file.

Figure 11.16 Excel Developer Tab

Chapter 11  Spreadsheet Modeling and Analysis

387

Example 11.14  Using Form Controls for the Outsourcing Decision Model We will design a simple spreadsheet interface to allow a user to evaluate different values of the supplier cost and production volume in the Outsourcing Decision Model spreadsheet. We will use a spin button for the supplier unit cost (which we will assume might vary between \$150 and \$200 in increments of \$5) and a scroll bar for the production volume (in unit increments between 500 and 3000 units). The completed spreadsheet is shown in ­Figure 11.17. First, click the Insert button in the Controls group of the Developer tab, select the spin button, click it, and then click somewhere in the worksheet. The spin button (and any form control) can be re-sized by dragging the handles along the edge and moved within the worksheet. Move it to a convenient location, and ­enter the name you wish to use (such as Supplier Unit Cost) ­a djacent to it. Next, right click the spin button and select Format ­Control. You will see the dialog box shown in Figure 11.18. Enter the values shown and click OK. Now if you click the up or down buttons, the value in cell D3 will change within the specified range. Next, repeat this process by inserting the scroll bar next to the production volume in ­column D. The next step is to link the values in column D to the model by replacing the value in cell B10 with = D3, and the value in cell B12 with = D8. (We could have assigned the cell link references in the Format Control dialogs to cells B10 and B12, but it is easier to

Figure 11.17 Outsourcing Decision Model Spreadsheet with Form Controls

see the values next to the form controls.) Now, using the ­controls, you can easily see how the model outputs change without having to type in new values. Form controls only allow integer increments, so we have to make some modifications to a spreadsheet if we want to change a number by a fractional value. For example, suppose that we want to use a spin button to change an interest rate in cell B8 from 0% to 10% in increments of 0.1% (i.e., 0.001). Choose some empty cell, say C8 and ­enter a value between 0 and 100 in it. Then enter the formula = C8/1000 in cell B8. Note that if the value in C8 = 40, for example, then the value in cell B8 will be 40/1000 = 0.04, or 4%. Then as the value in cell C8 changes by 1, the value in cell B8 changes by 1/1000, or 0.1%. In the ­Format Control dialog, specify the minimum value at 0 and the maximum value at 100 and link the button to cell C8. Now as you click the up or down arrows on the spin button, the value in cell C8 changes by 1 and the value in cell B8 changes by 0.1%. Other form controls can also be used; we encourage you to experiment and identify creative ways to use them. Excel also has many other features that can be used to improve the design and implementation of spreadsheet models. The serious analyst should consider learning about macro recording and Visual Basic for Applications (VBA), but these topics are well-beyond the scope of this book.

388

Chapter 11  Spreadsheet Modeling and Analysis

Figure 11.18 Format Control Dialog

Analyzing Uncertainty and Model Assumptions Because predictive analytical models are based on assumptions about the future and incorporate variables that most likely will not be known with certainty, it is usually important to investigate how these assumptions and uncertainty affect the model outputs. This is one of the most important and valuable activities for using predictive models to gain insights and make good decisions. In this section, we describe several different approaches for doing this.

What-If Analysis Spreadsheet models allow you to easily evaluate what-if questions—how specific combinations of inputs that reflect key assumptions will affect model outputs. What-if analysis is as easy as changing values in a spreadsheet and recalculating the outputs. However, systematic approaches make this process easier and more useful. In Example 11.2, we developed a model for profit and suggested how a manager might use the model to change inputs and evaluate different scenarios. A more informative way of evaluating a wider range of scenarios is to build a table in the spreadsheet to vary the input or inputs in which we are interested over some range, and calculate the output for this range of values. The following example illustrates this.

Example 11.15  Using Excel for What-If Analysis In the profit model used in Example 11.2, we stated that demand is uncertain. A manager might be interested in the following question: For any fixed quantity produced, how will profit change as demand changes? In Figure 11.19, we created a table for varying levels of demand, and computed the profit. This shows that a loss is incurred for low levels of demand, whereas profit is limited to \$240,000 whenever the demand exceeds the quantity produced, no matter how high it is. Notice that the formula

refers to cells in the model; thus, the user could change the quantity produced or any of the other model inputs and still have a correct evaluation of the profit for these values of demand. One of the advantages of evaluating what-if questions for a range of values rather than one at a time is the ability to visualize the results in a chart, as shown in Figure 11.20. This clearly shows that profit increases as demand increases until it hits the value of the quantity produced.

Figure 11.19 What-If Table for Uncertain Demand

Figure 11.20 Chart of What-If Analysis

Chapter 11  Spreadsheet Modeling and Analysis

389

390

Chapter 11  Spreadsheet Modeling and Analysis

Conducting what-if analysis in this fashion can be quite tedious. Fortunately, Excel provides several tools—data tables, Scenario Manager, and Goal Seek—that facilitate what-if and other types of decision model analyses. These can be found within the What-If Analysis menu in the Data tab.

Data Tables Data tables summarize the impact of one or two inputs on a specified output. Excel allows you to construct two types of data tables. A one-way data table evaluates an output variable over a range of values for a single input variable. Two-way data tables evaluate an output variable over a range of values for two different input variables. To create a one-way data table, first create a range of values for some input cell in your model that you wish to vary. The input values must be listed either down a column (column oriented) or across a row (row oriented). If the input values are column oriented, enter the cell reference for the output variable in your model that you wish to evaluate in the row above the first value and one cell to the right of the column of input values. Reference any other output variable cells to the right of the first formula. If the input values are listed across a row, enter the cell reference of the output variable in the column to the left of the first value and one cell below the row of values. Type any additional output cell references below the first one. Next, select the range of cells that contains both the formulas and values you want to substitute. From the Data tab in Excel, select Data Table under the What-If Analysis menu. In the dialog box (see Figure 11.21), if the input range is column oriented, type the cell reference for the input cell in your model in the Column input cell box. If the input range is row oriented, type the cell reference for the input cell in the Row input cell box.

Example 11.16  A One-Way Data Table for Uncertain Demand In this example, we create a one-way data table for profit for varying levels of demand. First, create a column of demand values in column E exactly as we did in Example 11.15. Then in cell F3, enter the formula =C22. This simply references the output of the profit model. Highlight the range E3:F11 (note that this range includes both the column of demand as well as the cell reference to

profit), and select Data Table from the What-If Analysis menu. In the Column input cell field, enter B8; this tells the tool that the values in column E are different values of demand in the model. When you click OK, the tool produces the results (which we formatted as currency) shown in Figure 11.22.

We may evaluate multiple outputs using one-way data tables.

Example 11.17  One-Way Data Tables with Multiple Outputs Suppose that we want to examine the impact of the uncertain demand on revenue in addition to profit. We simply add another column to the data table. For this case, insert the formula =C15 into cell G3. Also, add the labels “Profit” in F2

Figure 11.21 Data Table Dialog

and “Revenue” in G2 to identify the results. Then, highlight the range E3:G11 and proceed as described in the previous example. This process results in the data table shown in Figure 11.23.

Chapter 11  Spreadsheet Modeling and Analysis

391

Figure 11.22 One-Way Data Table for Uncertain Demand

Figure 11.23 One-Way Data Table with Two Outputs

To create a two-way data table, type a list of values for one input variable in a column and a list of input values for the second input variable in a row, starting one row above and one column to the right of the column list. In the cell in the upper left-hand corner immediately above the column list and to the left of the row list, enter the cell reference of the output variable you wish to evaluate. Select the range of cells that contain this cell reference and both the row and column of values. On the What-If Analysis menu, click Data Table. In the Row input cell of the dialog box, enter the reference for the input cell in the model that corresponds to the input values in the row. In the Column input cell box,

Example 11.18  A Two-Way Data Table for the Profit Model In most models, the assumptions used for the input data are often uncertain. For example, in the profit model, the unit cost might be affected by supplier price changes and inflationary factors. Marketing might be considering price adjustments to meet profit goals. We use a two-way data table to evaluate the impact of changing these assumptions. First, create a column for the unit prices you wish to evaluate and a row for the unit costs in the form of a matrix. In the upper left corner enter the formula  = C22,

which references the profit in the model. Select the range of all the data (not including the descriptive titles) and then select the data table tool in the What-If Analysis menu. In the Data Table dialog, enter B6 for the Row input cell since the unit cost corresponds to cell B6 in the model, and enter B5 for the Column input cell since the unit price corresponds to cell B5. Figure 11.24 shows the completed result.

392

Chapter 11  Spreadsheet Modeling and Analysis

Figure 11.24 Two-Way Data Table

enter the reference for the input cell in the model that corresponds to the input values in the column. Then click OK. Two-way data tables can evaluate only one output variable. To evaluate multiple output variables, you must construct multiple two-way tables.

Scenario Manager The Excel Scenario Manager tool allows you to create scenarios—sets of values that are saved and can be substituted automatically on your worksheet. Scenarios are useful for conducting what-if analyses when you have more than two output variables (which data tables cannot handle). The Excel Scenario Manager is found under the What-If Analysis menu in the Data Tools group on the Data tab. When the tool is started, click the Add button to open the Add Scenario dialog and define a scenario (see Figure 11.25). Enter the name of the scenario in the Scenario name box. In the Changing cells box, enter the references, separated by commas, for the cells in your model that you want to include in the s­ cenario (or hold down the Ctrl key and click on the cells). In the Scenario Values dialog that appears next, enter values for each of the changing cells. If you have put these into your spreadsheet, you can simply reference them. After all scenarios are added, they can be selected by clicking on the name of the scenario and then the Show button. Excel will change all values of the cells in your spreadsheet to correspond to those defined by the scenario for you to see the results within the model. When you click the Summary button on the Scenario Manager dialog, you will be prompted to enter the result cells and choose either a summary or a PivotTable report. The Scenario Manager can handle up to 32 variables.

Chapter 11  Spreadsheet Modeling and Analysis

393

Example 11.19  Using the Scenario Manager for the Markdown Pricing Model In the Markdown Pricing Model spreadsheet, suppose that we wish to evaluate four different strategies, which are shown in Figure 11.26. In the Add Scenario dialog, enter Ten/ten as the scenario name, and specify the changing cells as B7 and B8 (that is, the number of days at full retail price and the intermediate markdown). In the ­Scenario Values dialog, enter the values for these variables in the appropriate fields, or enter the formulas for the cell references; for instance, enter = E2 for the changing

cell B7 or = E3 for the changing cell B8. Repeat this process for each scenario. Click the Summary button. In the Scenario Summary dialog that appears next, enter C33 (the total revenue) as the result cell. The Scenario Manager evaluates the model for each combination of values and creates the summary report shown in Figure 11.27. The results indicate that the largest profit can be obtained using the twenty/twenty markdown strategy.

Goal Seek If you know the result that you want from a formula but are not sure what input value the formula needs to get that result, use the Goal Seek feature in Excel. Goal Seek works only with one variable input value. If you want to consider more than one input value or wish to maximize or minimize some objective, you must use the Solver add-in, which is discussed in other chapters. On the Data tab, in the Data Tools group, click What-If Analysis, and then click Goal Seek. The dialog shown in Figure 11.28 will appear. In the Set cell box, enter the reference for the cell that contains the formula that you want to resolve. In the To value box, type the formula result that you want. In the By changing cell box, enter the reference for the cell that contains the value that you want to adjust.

Figure 11.26 Markdown Pricing Model with Scenarios

Figure 11.27 Scenario Summary for the Markdown Pricing Model

Figure 11.28 Goal Seek Dialog

394

Chapter 11  Spreadsheet Modeling and Analysis

Figure 11.29 Break-Even Analysis Using Goal Seek

Example 11.20  Finding the Break-Even Point in the Outsourcing Model In the outsourcing decision model we introduced in Chapter 1 and developed a spreadsheet for in Example 11.3 p. 352, we might wish to find the break-even point. The break-even point is the value of demand volume for which total manufacturing cost equals total purchased cost, or, equivalently, for which the difference is zero. Therefore, you seek to find the value of production vol-

ume in cell B12 that yields a value of zero in cell B19. In the Goal Seek dialog, enter B19 for the Set cell, enter 0 in the To value box, and enter B12 in the By changing cell box. The Goal Seek tool determines that the break-even volume is 1,000 and enters this value in cell B12 in the model, as shown in Figure 11.29.

Model Analysis Using Analytic Solver Platform Analytic Solver Platform (see the section in Chapter 2 regarding spreadsheet add-ins) ­provides sensitivity analysis capabilities to explore a spreadsheet model and identify and visualize the key input parameters that have the greatest impact on model results.

Parametric Sensitivity Analysis Parametric sensitivity analysis is the term used by Analytic Solver Platform for systematic methods of what-if analysis. A parameter is simply a piece of input data in a model. With Analytic Solver Platform you can easily create one- and two-way data tables and a special type of chart, called a tornado chart, that provides useful what-if information.

Example 11.21  Creating Data Tables with Analytic Solver Platform Suppose that we wish to create a one-way data table to evaluate the profit as the unit price in cell B5 is varied between \$35 and \$45 in the profit model (see Figure 11.4). First, define this cell as a parameter in Analytic Solver ­Platform. Select cell B5 and then click the Parameters button in the ribbon (Figure 11.30), and select Sensitivity.

This opens a Function Arguments dialog (Figure 11.31), in which you specify a set of values or a range. To create the data table, s ­ elect the result cell that corresponds to the model output—in this case, cell C22. Then click the Reports button and click on Parameter Analysis from the ­Sensitivity menu. This displays a Sensitivity Report dialog

Chapter 11  Spreadsheet Modeling and Analysis

(Figure 11.32). You may use the a ­ rrows to move cells into the panes on the right; this is useful if you have d ­ efined multiple input ­p arameters and want to conduct different sensitivity a ­ nalyses. Analytic Solver Platform will create a new worksheet with the data table, as shown in Figure 11.33. To create a two-way data table, define two ­inputs as parameters and in the Sensitivity Report dialog. For example, we might want to change both the unit price as well

Figure 11.30 Analytic Solver Platform Ribbon

Figure 11.31 Analytic Solver Platform Function Arguments Dialog

Figure 11.32 Sensitivity Report Dialog

395

as the unit cost. With two parameters, be sure to check the box Vary Parameters Independently near the bottom. You can also create charts to visualize the data tables by selecting the results cell, clicking the Charts button, and then clicking Parameter Analysis from the Sensitivity menu. Figure 11.34 shows a two-way data table and a three-dimensional chart when both the unit price and unit cost are varied. We encourage you to replace the cell references (\$B\$5, \$B\$6, and \$C\$22) by descriptive names to facilitate understanding the results.

396

Chapter 11  Spreadsheet Modeling and Analysis

Figure 11.33 Sensitivity Analysis Report— One-Way Data Table

Figure 11.34

Two-Way Data Table and Chart As we have seen, charts, graphs, and other visual aids play an important part in analyzing

data and models. One useful tool is a tornado chart. A tornado chart graphically shows the impact that variation in a model input has on some output while holding all other inputs constant. Typically, we choose a base case and then vary the inputs by some percentage, say plus or minus 10% or 20%. As each input is varied, we record the values of the output and chart the ranges of the output in a bar chart in descending order. This usually results in a funnel shape, hence the name. A tornado chart shows which inputs are the most influential on the output and which are the least influential. If these inputs are uncertain, then you would probably want to study the more influential ones to reduce uncertainty and its effect on the output. If the effects are small, you might ignore any uncertainty or eliminate those effects from the model. They are also useful in helping you select the inputs that you would want to analyze further with data tables or scenarios.

Example 11.22  Creating a Tornado Chart in Analytic Solver Platform Creating a tornado chart in Analytic Solver Platform is extremely easy to do. Analytic Solver Platform automatically identifies all the data input cells on which the output cell depends and creates the chart. In the Profit Model spreadsheet, select cell C22; then click the P ­ arameters button and choose Identify. Figure 11.35 shows the

results. We see that a 10% change in cell B5, the unit price, affects profit the most, followed by the unit cost, quantity produced, fixed cost, and demand. If you don’t want to vary all parameters by the same percentage, you may define ranges in the same fashion as we did for the data table examples.

397

Chapter 11  Spreadsheet Modeling and Analysis

Figure 11.35 Tornado Sensitivity Chart for the Profit Model

Key Terms Data table Data validation Form controls Newsvendor problem One-way data table Overbook Parametric sensitivity analysis Pro forma income statement

Scenarios Spreadsheet engineering Tornado chart Two-way data table Validity Verification What-if analysis

Problems and Exercises 1. Develop a spreadsheet model for gasoline usage

scenario, Problem 4 in Chapter 1, using the data provided. Apply the principles of spreadsheet engineering in developing your model. 2. Develop a spreadsheet model for Problem 5 in Chap-

ter 1. Apply the principles of spreadsheet engineering in developing your model. Use the spreadsheet to create a table for a range of prices to help you identify the price that results in the maximum revenue. 3. Develop a spreadsheet model to determine how

much a person or a couple can afford to spend on a house.7 Lender guidelines suggest that the allowable monthly housing expenditure should be no more than 28% of monthly gross income. From this, you must subtract total nonmortgage housing expenses, which would include insurance and property taxes and any other additional expenses. This defines the 7Based

affordable monthly mortgage payment. In addition, guidelines also suggest that total affordable monthly debt payments, including housing expenses, should not exceed 36% of gross monthly income. This is calculated by subtracting total nonmortgage housing expenses and any other installment debt, such as car loans, student loans, credit-card debt, and so on, from 36% of total monthly gross income. The smaller of the affordable monthly mortgage payment and the total affordable monthly debt payments is the affordable monthly mortgage. To calculate the maximum that can be ­b orrowed, find the monthly payment per \$1,000 mortgage based on the current interest rate and duration of the loan. Divide the affordable monthly mortgage amount by this monthly payment to find the affordable mortgage. A ­ ssuming a 20% down payment, the maximum price of a house would be the affordable mortgage divided by 0.8. Use the

on Ralph R. Frasca, Personal Finance, 8th ed. (Boston: Prentice Hall, 2009).

398

Chapter 11  Spreadsheet Modeling and Analysis

following data to test your model: total monthly gross income = \$6,500; nonmortgage housing expenses =  \$350; monthly installment debt = \$500; monthly payment per \$1,000 mortgage = \$7.25. 4. A company records the following components of

fixed and variable costs for a product. Fixed Cost (in dollars): Plaint Maintenance - 15,000 Salaries - 40,000 Depreciation - 100,000 Rent - 8,000 Manufacturing expenses - 12,000 Advertising - 5,000 Administrative expenses - 20,000 Variable Cost per unit: Labor - 3.00, Materials - 5.00, Sales Commission - 2.00 Assuming Sales Price per unit = \$15, develop a spreadsheet model to calculate the break-even point using the above. Design your spreadsheet using effective spreadsheet-engineering principles. 5. For inventory problems, the cost is a function of the

order size. A company has collected the following data for one of its product. Annual requirement, R = 12,000 Ordering cost per order, S = 150 Cost per unit, C = 4 Carrying cost per unit, I = 0.20 Quantity ordered per order, Q = 100 Develop a general model for computing ordering cost, carrying cost, and total cost functions. Use the following formulas: Ordering cost = (R/Q)*S Carrying Cost = (Q/2)*I*C Total Cost = Ordering cost + Carrying Cost 6. A (greatly) simplified model of the national econ-

omy can be described as follows. The national income is the sum of three components: consumption, investment, and government spending. Consumption is related to the total income of all individuals and to the taxes they pay on income. Taxes depend on total income and the tax rate. Investment is also related to the size of the total income.

8Based

a. Use this information to draw an influence diagram by recognizing that the phrase “A is related to B” implies that A influences B in the model. . If we assume that the phrase “A is related to B” b can be translated into mathematical terms as A = kB, where k is some constant, develop a mathematical model for the information provided. 7. Thomas wants to predict the sales figures of his

company for the upcoming year. On the basis of historical data, he concludes that a linear function passes through the observed data points for the first and sixth years. The sales figure for the first year is \$24,000, and for the sixth year is \$2,000. Develop a spreadsheet model to find intercept and slope of the linear function and predict sales for the seventh year. 8. The Radio Shop sells two popular models of portable

sport radios, model A and model B. The sales of these products are not independent of each other (in economics, we call these substitutable products, because if the price of one increases, sales of the other will i­ncrease). The store wishes to establish a pricing policy to maximize revenue from these products. A study of price and sales data shows the following relationships between the quantity sold (N) and prices (P) of each model: NA = 20 - 0.62PA + 0.30PB NB = 29 + 0.10PA - 0.60PB a. Construct a model for the total revenue and implement it on a spreadsheet. . What is the predicted revenue if PA = +18 and b PB = +30? What if the prices are PA = +25 and PB = +50? 9. For a new product, sales volume in the first year is

estimated to be 80,000 units and is projected to grow at a rate of 4% per year. The selling price is \$12 and will increase by \$0.50 each year. Per-unit variable costs are \$3, and annual fixed costs are \$400,000. Per-unit costs are expected to increase 5% per year. Fixed costs are expected to increase 8% per year. Develop a spreadsheet model to calculate the net present value of profit over a 3-year period, assuming a 4% discount rate. 10. A stockbroker calls on potential clients from refer-

rals. For each call, there is a 10% chance that the client will decide to invest with the firm. Fifty-five

on an example of the Parfitt-Collins model in Gary L. Lilien, Philip Kotler, and K. Sridhar Moorthy, Marketing Models (Englewood Cliffs, NJ: Prentice Hall, 1992): 483.

Chapter 11  Spreadsheet Modeling and Analysis

percent of those interested are found not to be qualified, based on the brokerage firm’s screening criteria. The remaining are qualified. Of these, half will invest an average of \$5,000, 25% will invest an average of \$20,000, 15% will invest an average of \$50,000, and the remainder will invest \$100,000. The commission schedule is as follows: Transaction Amount

Commission

Up to \$25,000

\$50 + 0.5% of the amount

\$25,001 to \$50,000

\$75 + 0.4% of the amount

\$50,001 to \$100,000

\$ 125 + 0.3% of the amount

The broker keeps half the commission. Develop

a spreadsheet to calculate the broker’s commission based on the number of calls per month made. What is the expected commission based on making 600 calls? 11. The director of a nonprofit ballet company in a me-

dium-sized U.S. city is planning its next fundraising campaign. In recent years, the program has found the following percentages of donors and gift levels: Gift Level

Amount

Benefactor

\$10,000

3

Philanthropist

\$5,000

10

Producer’s Circle

\$1,000

25

Director’s Circle

\$500

50

Principal

\$100

7% of solicitations

\$50

12% of solicitations

Soloist

Develop a spreadsheet model to calculate the to-

tal amount donated based on this information if the number of the company contacts 1000 potential donors to donate at the \$100 level or below. 12. A gasoline mini-mart orders 25 copies of a monthly

magazine. Depending on the cover story, demand for the magazine varies. The gasoline mini-mart purchases the magazines for \$1.50 and sells them for \$4.00. Any magazines left over at the end of the month are donated to hospitals and other health-care facilities. ­Modify the newsvendor example spreadsheet to model this situation. Investigate the ­financial implications of this policy if the demand is expected

399

to vary between 10 and 30 copies each month. How many must be sold to at least break even? 13. Koehler Vision Associates (KVA) specializes in

laser-assisted corrective eye surgery. Prospective patients make appointments for prescreening exams to determine their candidacy for the surgery: if they qualify, a \$250 charge is applied as a deposit for the actual procedure. The weekly demand is 150, and about 12% of prospective patients fail to show up or cancel their exam at the last minute. Patients that do not show up are refunded the prescreening fee less a \$25 processing fee. KVA can handle 125 patients per week and is considering overbooking its appointments to reduce the lost revenue associated with cancellations. However, any patient that is overbooked may spread unfavorable comments about the company; thus, the overbooking cost is estimated to be \$125. Develop a spreadsheet model for calculating net revenue. Find the net revenue and number overbooked if 140 through 150 appointments are taken. 14. Tanner Park is a small amusement park that provides

a variety of rides and outdoor activities for children and teens. In a typical summer season, the number of adult and children’s tickets sold are 20,000 and 10,000, respectively. Adult ticket prices are \$18 and the children’s price is \$10. Revenue from food and beverage concessions is estimated to be \$60,000, and souvenir revenue is expected to be \$25,000. Variable costs per person (adult or child) are \$3, and fixed costs amount to \$150,000. Determine the profitability of this business. 15. With the growth of digital photography, a young

entrepreneur is considering establishing a new business, Cruz Wedding Photography. He believes that the average number of wedding bookings per year is 15. One of the key variables in developing his business plan is the life he can expect from a single digital single lens reflex (DSLR) camera before it needs to be replaced. Due to heavy usage, the shutter life expectancy is estimated to be 150,000 clicks. For each booking, the average number of photographs taken is assumed to be 2,000. Develop a model to determine the camera life (in years). 16. The Executive Committee of Reder Electric Vehicles

is debating whether to replace its original model, the REV-Touring, with a new model, the REV-Sport, which would appeal to a younger audience. Whatever vehicle chosen will be produced for the next 4 years,

400

Chapter 11  Spreadsheet Modeling and Analysis

after which time a reevaluation will be necessary. The REV-Sport has passed through the concept and initial design phases and is ready for final design and manufacturing. Final development costs are estimated to be \$75 million, and the new fixed costs for tooling and manufacturing are estimated to be \$600 million. The REV-Sport is expected to sell for \$30,000. The first year sales for the REV-Sport is estimated to be 60,000, with a sales growth for the subsequent years of 6% per year. The variable cost per vehicle is uncertain until the design and supply-chain decisions are finalized, but is estimated to be \$22,000. Next-year sales for the REV-Touring are estimated to be 50,000, but the sales are expected to decrease at a rate of 10% for each of the next 3 years. The selling price is \$28,000. Variable costs per vehicle are \$21,000. Since the model has been in production, the fixed costs for development have already been recovered. Develop a 4-year model to recommend the best decision using a net present value discount rate of 5%. How sensitive is the result to the estimated variable cost of the REVSport? How might this affect the decision? 17. The Schoch Museum is embarking on a 5-year fun-

draising campaign. As a nonprofit institution, the museum finds it challenging to acquire new donors as many donors do not contribute every year. Suppose that the museum has identified a pool of 8,000 potential donors. The actual number of donors in the first year of the campaign is estimated to be 65% of this pool. For each subsequent year, the museum expects that 35% of current donors will discontinue their contributions. In addition, the museum expects to attract some percentage of new donors. This is assumed to be 10% of the pool. The average contribution in the first year is assumed to be \$50, and will increase at a rate of 2.5%. Develop a model to predict the total funds that will be raised over the 5-year period, and investigate the impacts of the percentage assumptions used in the model. 18. Apply the data-validation tool to the Bank Data Ex-

cel file with an error alert message box to ensure that a two-digit number is correctly entered under Age, the data entered under ZipCode should not exceed 5 digits, and the Education field takes the values 1, 2 and 3 for ‘undergraduate’, ‘graduate’ and ‘post graduate’ respectively. Enter some fictitious additional data to verify that your results are correct. 19. Insert a spin button and scroll bar in the Outsourcing

Decision Model to allow the user to easily change the

production volume in cell B12 from 500 to 3000. Which one is easier to use? Discuss the pros and cons of each. 20. Insert a spin button in the car lease purchase model

to change the discount rate in cell F8 from 1% to 10% in increments of one-tenth. 21. For the Pro Forma Income Statement model in the

Excel file Net Income Models (Figure 11.7), add a scroll bar form control to allow the user to easily change the level of sales from 3,000,000 to 10,000,000 in increments of 1,000 and recalculate the spreadsheet. (Hint: the scroll values must be between 0 and 30,000 so you will need to modify the spreadsheet to make it work correctly.) 22. Create a new worksheet in the Retirement Portfolio

workbook. In this worksheet, add a list box form control to allow the user to select one of the mutual funds on the original worksheet, and display a summary of the net asset value, number of shares, and total value using the VLOOKUP function. (Hint: your list box should show the fund names, but you will need to modify the original spreadsheet to use VLOOKUP correctly!) 23. The Excel sheet ­Travelling Salesman contains data on

cost incurred by salesman on travelling from one city to another. Using this data matrix, add list box controls so that manager can choose two cities and find the cost of travelling between them. (Hint: Set the cell links to be any blank cells as the list boxes return the number of position in the list; then use VLOOKUP to find the cost). 24. Problem 15 in Chapter 1 posed the following situa-

tion: A manufacturer of mp3 players is preparing to set the price on a new model. Demand is thought to depend on the price and is represented by the model D = 2,500 - 3P The accounting department estimates that the total costs can be represented by C = 5,000 + 5D Implement your model on a spreadsheet and construct a one-way data table to estimate the price for which profit is maximized. 25. Problem 16 in Chapter 1 posed the following situa-

tion: The demand for airline travel is quite sensitive to price. Typically, there is an inverse relationship between demand and price; when price decreases, demand increases, and vice versa. One major airline has found that when the price (p) for a round trip between Chicago and Los Angeles is \$600, the

401

Chapter 11  Spreadsheet Modeling and Analysis

demand (D) is 500 passengers per day. When the price is reduced to \$400, demand is 1,200 passengers per day. You were asked to develop an appropriate model. Implement your model on a spreadsheet and use a data table to estimate the price that maximizes total revenue.

a . Use data tables to evaluate the profit for the

26. Develop a spreadsheet model for determining value,

28. For the Koehler Vision Associates model you devel-

using the simple valuation function Value = D/(r - g), where r is the discount rate = 10% and g is the growth rate = 4% and D is dividend = 1.25. Use a two-way data table to determine value if g varies from 1% to 5% in increments of 1, and r varies from 8% to 16% in increments of 2%. 27. The booking price for motivational seminar (held every

week) is charged at \$650 per booking, with maximum seats = 100. The total cost for arranging such a seminar comes to \$35,000 per week. The manager offers 10% discount on group bookings, allowing 5 seats per group. On an average, he receives 2 to 10 (maximum allowed in a seminar) group booking orders. Construct a spreadsheet model to determine the profit all seats are booked, and none of which is group booking.

Scenarios for Problem 31 Expected Crowd

10, use data tables to show how the commission is a function of the number of calls made. 30. For the nonprofit ballet company fundraising model

you developed in Problem 11, use a data table to show how the amount varies based on the number of solicitations. 31. For the garage-band model you developed in Prob-

lem 7, define and run some reasonable scenarios using the Scenario Manager to evaluate profitability for the following scenarios:

Optimistic

Pessimistic

4500

2500

\$15

\$20

\$12.50

\$10,000

\$8,500

\$12,500

throughout the country, such as Old Navy, Hallmark Cards, or Radio Shack, to name just a few. The retailer is often seeking to open new stores and needs to evaluate the profitability of a proposed location

Inflation rate

29. For the stockbroker model you developed in Problem

3000

32. Think of any retailer that operates many stores

Scenarios for Problem 32

oped in Problem 13, use data tables to study how revenue is affected by changes in the number of appointments accepted and patient demand.

Likely

Concession Expenditure Fixed cost

specified range of booked group seats. b. Suppose the manager is considering lowering or increasing the group booking discount by 5%. How will profit be affected?

that would be leased for 5 years. An Excel model is provided in the New Store Financial Model spreadsheet. Use Scenario Manager to evaluate the cumulative discounted cash flow for the fifth year under the following scenarios:

Scenario 1

Scenario 2

Scenario 3

1%

5%

3%

25%

30%

26%

Labor cost

\$150,000

\$225,000

\$200,000

Other expenses

\$300,000

\$350,000

\$325,000

First-year sales revenue

Cost of merchandise (% of sales)

\$600,000

\$600,000

\$800,000

Sales growth year 2

15%

22%

25%

Sales growth year 3

10%

15%

18%

Sales growth year 4

6%

11%

14%

Sales growth year 5

3%

5%

8%

402

Chapter 11  Spreadsheet Modeling and Analysis

33. The Hyde Park Surgery Center specializes in high-

36. The admissions director of an engineering college

risk cardiovascular surgery. The center needs to forecast its profitability over the next 3 years to plan for capital growth projects. For the first year, the hospital anticipates serving 1,200 patients, which is expected to grow by 8% per year. Based on current reimbursement formulas, each patient provides an average billing of \$125,000, which will grow by 3% each year. However, because of managed care, the center collects only 25% of billings. Variable costs for supplies and drugs are calculated to be 10% of billings. Fixed costs for salaries, utilities, and so on, will amount to \$20,000,000 in the first year and are assumed to increase by 5% per year. Develop a spreadsheet model to calculate the net present value of profit over the next 3 years. Use a discount rate of 4%. Define three reasonable scenarios that the center director might wish to evaluate and use the Scenario Manager to compare them.

has \$500,000 in scholarships each year from an endowment to offer to high-achieving applicants. The value of each scholarship offered is \$25,000 (thus, 20 scholarships are offered). The benefactor who provided the money would like to see all of it used each year for new students. However, not all students accept the money; some take offers from competing schools. If they wait until the end of the admissions’s deadline to decline the scholarship, it cannot be offered to someone else because any other good students would already have committed to other programs. Consequently, the admissions director offers more money than available in anticipation that a percentage of offers will be declined. If more than 20 students accept the offers, the college is committed to honoring them, and the additional amount has to come out of the dean’s budget. Based on prior history, the percentage of applicants that accept the offer is about 70%. Develop a spreadsheet model for this situation, and apply whatever analysis tools you deem appropriate to help the admissions director make a decision on how many scholarships to offer. Explain your results in a business memo to the director, Mr. P. Woolston.

34. For the garage-band model in Problem 7, construct

a tornado chart and explain the sensitivity of each of the model’s parameters on total profit. 35. For the new-product model in Problem 9, construct

a tornado chart and explain the sensitivity of each of the model’s parameters on the NPV of profit.

Case: Performance Lawn Equipment Part 1: The Performance Lawn Equipment database contains data needed to develop a pro forma income statement. Dealers selling PLE products all receive 18% of sales revenue for their part of doing business, and this is accounted for as the selling expense. The tax rate is 50%. Develop an Excel worksheet to extract and summarize the data needed to develop the income statement for 2014 and implement an Excel model in the form of a pro forma income statement for the company. Part 2: The CFO of Performance Lawn Equipment, J. Kenneth Valentine, would like to have a model to predict the net income for the next 3 years. To do this, you need to determine how the variables in the pro forma income statement will likely change in the future. Using the calculations and worksheet that you developed along with other historical data in the database, estimate the annual rate of change in sales revenue, cost of goods sold, operating expense, and interest expense. Use these rates to modify the pro Forma income statement to predict the net income over the next 3 years. Because the estimates you derived from the historical data may not hold in the future, conduct appropriate what-if, scenario, and/or parametric sensitivity analyses to investigate how the projections might change if these assumptions don’t hold. Construct a tornado chart to show how the assumptions impact the net income in your model. Summarize your results and conclusions in a report to Mr. Valentine.

Chapter

12

Monte Carlo Simulation and Risk Analysis

iQoncept/Shutterstock.com

Learning Objectives After studying this chapter, you will be able to:

• Explain the concept and importance of analyzing risk in business decisions. • Use data tables to conduct simple Monte Carlo simulations. • Use Analytic Solver Platform to develop, implement, and analyze Monte Carlo simulation models. • Compute confidence intervals for the mean value of an output in a simulation model. • Construct and interpret sensitivity, overlay, trend,

• Explain the significance of the “flaw of averages.” • Conduct Monte Carlo simulation using historical data and resampling techniques. • Use fitted distributions to define uncertain variables in a simulation. • Define and use custom distributions in Monte Carlo simulations. • Correlate uncertain variables in a simulation model using Analytic Solver Platform.

and box-whisker charts for a simulation model.

403

404

Chapter 12  Monte Carlo Simulation and Risk Analysis

For many of the predictive decision models we developed in Chapter 11, all the data—particularly the uncontrollable inputs—were assumed to be known and constant. Other models, such as the newsvendor, overbooking, and retirement-planning models, incorporated ­uncontrollable inputs, such as customer demand, hotel cancellations, and annual returns on investments, which exhibit random behavior. We often assume such variables to be constant to simplify the model and the analysis. However, many situations dictate that randomness be explicitly incorporated into our models. This is usually done by specifying probability distributions for the appropriate uncontrollable inputs. As we noted earlier in this book, models that include randomness are called s­ tochastic, or probabilistic, models. These types of models help to evaluate risks associated with undesirable consequences and to find optimal decisions under uncertainty. Risk is the likelihood of an undesirable outcome. It can be assessed by evaluating the probability that the outcome will occur along with the severity of the outcome. For example, an investment that has a high probability of losing money is riskier than one with a lower probability. Similarly, an investment that may result in a \$10 million loss is certainly riskier than one that might result in only a \$10,000 loss. In assessing risk, we could answer questions such as, What is the probability that we will incur a financial loss? How do the probabilities of different potential losses compare? What is the probability that we will run out of inventory? What are the chances that a project will be completed on time? Risk analysis is an approach for developing “a comprehensive understanding and awareness of the risk associated with a particular variable of interest (be it a payoff measure, a cash flow profile, or a macroeconomic forecast).”1 Hertz and Thomas present a simple ­scenario to illustrate the concept of risk analysis: The executives of a food company must decide whether to launch a new packaged cereal. They have come to the conclusion that five factors are the determining variables: advertising and promotion ­expense, total cereal market, share of market for this product, operating costs, and new capital investment. On the basis of the “most likely” estimate for each of these variables, the picture looks very bright—a healthy 30% return, indicating a significantly positive e ­ xpected net present value. This future, however, depends on each of the “most likely” estimates coming true in the actual case. If each of these ­“educated guesses” has, for example, a 60% chance of ­being correct, there is only an 8% chance that all five will be correct (0.60 * 0.60 * 0.60 * 0.60 * 0.60) if the factors are assumed to be independent. So the “expected” return, or present value 1David

B. Hertz and Howard Thomas, Risk Analysis and Its Applications (Chichester, UK: John Wiley & Sons, Ltd., 1983): 1.

Chapter 12  Monte Carlo Simulation and Risk Analysis

405

measure, is actually dependent on a rather unlikely coincidence. The decision maker needs to know a great deal more about the other values used to make each of the five estimates and about what he stands to gain or lose from various combinations of these values.2 Thus, risk analysis seeks to examine the impacts of uncertainty in the estimates and their potential interaction with one another on the output variable of interest. Hertz and Thomas also note that the challenge to risk analysts is to frame the output of risk analysis procedures in a manner that makes sense to the manager and provides clear insight into the problem, suggesting that simulation has many advantages. In this chapter, we discuss how to build and analyze models involving ­uncertainty and risk using Excel. We then introduce ­Analytic Solver Platform to implement Monte Carlo simulation. We wish to point out that the topic of simulation can fill an entire book. An entirely different area of simulation, which we do not address in this book, is the simulation of dynamic systems, such as waiting lines, inventory systems, manufacturing systems, and so on. This requires different modeling and implementation tools, and is best approached using commercial software. Systems simulation is an important tool for analyzing operations, whereas Monte Carlo simulation, as we describe it, is focused more on financial risk analysis.

Spreadsheet Models with Random Variables In Chapter 5, we described how to sample randomly from probability distributions and to generate certain random variates using Excel tools and functions. We will use these ­techniques to show how to incorporate uncertainty into decision models.

Example 12.1  Incorporating Uncertainty in the Outsourcing Decision Model Refer back to the outsourcing decision model we introduced in Chapter 1 and for which we ­developed an Excel model in Chapter 11. The model is shown again in Figure 12.1. Assume that the production volume is uncertain. We can model the demand as a random variable having some probability distribution. Suppose the manufacturer has enough data and information to assume that demand (production volume) will be normally distributed with a mean of 1,000 and a standard deviation of 100. We could use the Excel function NORM.INV (­probability, mean,

standard_deviation), as described in Chapter 5, to generate random values of the demand (Production Volume) by replacing the input in cell B12 of the spreadsheet with the formula =ROUND(NORM.INV (RAND( ), 1000, 100),0). The ROUND function is used to ensure that the values will be whole numbers. Whenever the F9 key is pressed (on a Windows PC) or the Calculate Now button is clicked from the Calculation group in the Formula tab, the worksheet will be recalculated, and the value of ­demand will change randomly.

Monte Carlo Simulation Monte Carlo simulation is the process of generating random values for uncertain inputs in a model, computing the output variables of interest, and repeating this process for many 2Ibid.,

24.

406

Chapter 12  Monte Carlo Simulation and Risk Analysis

Figure 12.1 Outsourcing Decision Model Spreadsheet

trials to understand the distribution of the output results. For example, in the o­ utsourcing decision model, we can randomly generate the production volume and compute the cost difference and associated decision and then repeat this for some number of trials. Monte Carlo simulation can easily be accomplished on a spreadsheet using a data table.

Example 12.2  Using Data Tables for Monte Carlo Spreadsheet Simulation Figure 12.2 shows a Monte Carlo simulation for the outsourcing decision model (Excel file Outsourcing Decision Monte Carlo Simulation Model). First, construct a data table (see Chapter 11) by listing the number of trials down a column (here we used 20 trials) and referencing the cells associated with demand, the difference, and the decision in cells E3, F3, and G3, respectively (i.e., the formula in cell E3 is =B12; in cell F3, = B19; and in cell G3, = B20). Select the range of the table (D3:G23)—and here’s the trick—in the Column Input Cell field in the Data Table dialog, enter any blank cell in the spreadsheet. This is done because the trial number does not relate to any parameter in the model; we simply want to repeat the spreadsheet recalculation independently for each row of the data table, knowing that the demand will change each time because of the use of the RAND function in the demand formula. As you can see from the results, each trial has a randomly generated demand. The data table process substitutes these demands into cell B12 and finds the associated difference and decision in columns F and G. The average difference is \$535, and 55% of the trials resulted in outsourcing as the best decision; the histogram shows the distribution of the results. These results might suggest that, although the future demand is not known, the manufacturer’s best choice might be to outsource. However, there is a risk that this may not be the best decision.

The small number of trials that we used in this example makes sampling error an important issue. We could easily obtain significantly different results if we repeat the simulation (by pressing the F9 key on a Windows PC). For example, repeated simulations yielded the following percentages for outsourcing as the best decision: 40%, 60%, 65%, 45%, 75%, 45%, and 35%. There is considerable variability in the results, but this can be reduced by using a larger number of trials. To understand this variability better, let us construct a confidence interval for the proportion of decisions that result in a manufacturing recommendation with the sample size (number of trials) n = 20 using the data in Figure 12.2. Using formula (6.4) from Chapter 6, a 95% confidence interval for the proportion is 0.55 ± 1.96 0.5510.452 = 0.55 ± 0.22, or [0.33, 0.77]. Because the B 20 CI includes values below and above 0.5, this suggests that we have little certainty as to the best decision. However, if we obtained the same proportion using 1,000 trials, the 0.5510.45 2 confidence interval would be 0.55 ± 1.96 = B 1000 0.55 ± 0.03, or [0.52, 0.58]. This would indicate that we would have confidence that outsourcing would be the better decision more than half the time.

Figure 12.2 Monte Carlo Simulation of the Outsourcing Decision Model

Chapter 12  Monte Carlo Simulation and Risk Analysis

407

Although the use of a data table illustrates how we can apply Monte Carlo simulation to a decision model, it is impractical to apply to more complex problems. For example, in the Moore Pharmaceuticals model in Chapter 11, many of the model parameters, such as the initial market size, project costs, market-size growth factors, and market-share growth rates, may all be uncertain. In addition, we need to be able to capture and save the results of thousands of trials to obtain good statistical results, and it would be useful to construct a histogram of the results and calculate a variety of statistics to conduct further analyses. Fortunately, sophisticated software approaches that easily perform these functions are available. The remainder of this chapter is focused on learning to use Analytic Solver Platform software to perform large-scale Monte Carlo simulation. We will start with the simple outsourcing decision model.

Monte Carlo Simulation Using Analytic Solver Platform To use Analytic Solver Platform, you must perform the following steps: 1. Develop the spreadsheet model. 2. Determine the probability distributions that describe the uncertain inputs in your model. 3. Identify the output variables that you wish to predict. 4. Set the number of trials or repetitions for the simulation. 5. Run the simulation. 6. Interpret the results.

Defining Uncertain Model Inputs When model inputs are uncertain, we need to characterize them by some probability distribution. For many decision models, empirical data may be available, either in historical records or collected through special efforts. For example, maintenance r­ ecords

408

Chapter 12  Monte Carlo Simulation and Risk Analysis

might provide data on machine failure rates and repair times, or observers might collect data on service times in a bank or post office. This provides a factual basis for choosing the appropriate probability distribution to model the input variable. We can identify an appropriate distribution by fitting historical data to a theoretical model, as we illustrated in Chapter 5. In other situations, historical data are not available, and we can draw upon the properties of common probability distributions and typical applications that we discussed in Chapter 5 to help choose a representative distribution that has the shape that would most reasonably represent the analyst’s understanding about the uncertain variable. For example, a normal distribution is symmetric, with a peak in the middle. Exponential data are very positively skewed, with no negative values. A triangular distribution has a limited range and can be skewed in either direction. Very often, uniform or triangular distributions are used in the absence of data. These distributions depend on simple parameters that one can easily identify based on managerial knowledge and judgment. For example, to define the uniform distribution, we need to know only the smallest and largest possible values that the variable might assume. For the triangular distribution, we also include the most likely value. In the construction industry, for instance, experienced supervisors can easily tell you the fastest, most likely, and slowest times for performing a task such as framing a house, taking into account possible weather and material delays, labor absences, and so on. There are two ways to define uncertain variables in Analytic Solver Platform. One is to use the custom Excel functions for generating random samples from probability distributions that we described in Table 5.1 in Chapter 5. This is similar to the method that we used for the outsourcing example when we used the NORM.INV function in the Monte Carlo spreadsheet simulation. For example, the Analytic Solver Platform function that is equivalent to NORM.INV(RAND( ), mean, standard deviation) is PsiNormal(mean, standard deviation).

Example 12.3  Using Analytic Solver Platform Probability Distribution Functions For the Outsourcing Decision Model, we assume that the production volume is normally distributed with a mean of 1,000 and a standard deviation of 100, as in the previous example. However, we make the problem a bit more complicated by ­assuming that the unit cost of purchasing from the supplier is also uncertain and has a triangular distribution with a m ­ inimum value of \$160, most likely value of \$175, and maximum value of \$200. To model the

distribution of the production volume in the outsourcing decision model, we could use the PsiNormal(mean, standard deviation) function. Thus, we could ­enter the formula =PsiNormal(1000, 100) into cell B12. To ensure that the result is a whole number, we could modify the formula to be = ROUND(PsiNormal(1000,100),0). To model the unit cost, we could enter the formula = PsiTriangular(160, 175, 200) in cell B10.

The second way to define an uncertain variable is to use the Distributions button in the Analytic Solver Platform ribbon. First, select the cell in the spreadsheet for which you want to define a distribution. Click on the Distributions button as shown in Figure 12.3. Choose a distribution from one of the categories in the list that pops up. This will display a dialog in which you may define the parameters of the distribution.

Chapter 12  Monte Carlo Simulation and Risk Analysis

409

Figure 12.3 Analytic Solver Platform Distributions Options

Example 12.4  Using the Distributions Button in Analytic Solver Platform In the Outsourcing Decision Model spreadsheet, select cell B12, the production volume. Click the Distributions button in the Analytic Solver Platform ribbon and select the normal distribution from the Common category. This displays the dialog shown in Figure 12.4. In the pane on the right, change the values of the mean and stdev under ­Parameters to reflect the distribution you wish to model; in this case, set mean to 1,000 and stdev to 100. Click

Figure 12.4 Analytic Solver Platform Normal Distribution Dialog

the Save button at the top of the dialog. Analytic Solver ­Platform will enter the correct Psi function into the cell in the spreadsheet and you may close the dialog. For the unit cost, select cell B10 and select the triangular d ­ istribution from the list. Figure 12.5 shows the completed dialog after the min, likely, and max parameters have been entered. If you double-click an uncertain cell, you can bring up this dialog to perform additional editing if necessary.

410

Chapter 12  Monte Carlo Simulation and Risk Analysis

Figure 12.5 Analytic Solver Platform Triangular Distribution Dialog

Defining Output Cells To define a cell you wish to predict and create a distribution of output values from your model (which Analytic Solver Platform calls an uncertain function cell), first select it, and then click on the Results button in the Simulation Model group in the Analytic Solver Platform ribbon. Choose the Output option and then In Cell.

Example 12.5 Using the Results Button in Analytic Solver Platform For the Outsourcing Decision Model, select cell B19 (the cost difference value) and then choose the In Cell option, as we described. Figure 12.6 shows the process. Analytic Solver Platform modifies the formula in the cell to be = B16 − B17 + PsiOutput( ). You may also add

+ PsiOutput( ) manually to the cell formula to designate it as an output cell. However, you may choose only output cells that are numerical; thus, you could not choose cell B20, which displays a text result.

Running a Simulation To run a simulation, first click on the Options button in the Options group in the ­Analytic Solver Platform ribbon. This displays a dialog (see Figure 12.7) in which you can specify the number of trials and other options to run the simulation (make sure the ­Simulation tab is selected). Trials per Simulation allows you to choose the number of times that ­Analytic Solver Platform will generate random values for the uncertain cells in the model and recalculate the entire spreadsheet. Because Monte Carlo simulation is essentially statistical sampling, the larger the number of trials you use, the more precise will be the ­result. U ­ nless the model is extremely complex, a large number of trials will not unduly tax ­today’s computers, so we recommend that you use at least 5,000 trials (the educational version restricts this to a maximum of 10,000 trials). You should use a larger number of trials as the number of uncertain cells in your model increases so that the simulation can generate representative samples from all distributions for assumptions. You may run more than one simulation if you wish to examine the variability in the results. The procedure that Analytic Solver Platform uses generates a stream of random numbers from which the values of the uncertain inputs are selected from their probability

Chapter 12  Monte Carlo Simulation and Risk Analysis

411

Figure 12.6 Analytic Solver Platform Results Options

Figure 12.7 Analytic Solver Platform Options Dialog

d­ istributions. Every time you run the model, you will get slightly different results because of sampling error. However, you may control this by setting a value for Sim. Random Seed in the dialog. If you choose a nonzero number, then the same sequence of random numbers will be used for generating the random values for the uncertain inputs; this will guarantee that the same values will be used each time you run the model. This is useful when you wish to change a controllable variable in your model and compare results for the

412

Chapter 12  Monte Carlo Simulation and Risk Analysis

same assumption values. As long as you use the same number, the assumptions generated will be the same for all simulations. Analytic Solver Platform has alternative sampling methods; the two most common are Monte Carlo and Latin Hypercube sampling. Monte Carlo sampling selects random variates independently over the entire range of possible values of the distribution. With Latin Hypercube sampling, the uncertain variable’s probability distribution is divided into i­ntervals of equal probability and generates a value randomly within each interval. Latin Hypercube sampling results in a more even distribution of output values because it samples the entire range of the distribution in a more consistent manner, thus achieving more ­accurate forecast statistics (particularly the mean) for a fixed number of Monte Carlo ­trials. However, Monte Carlo sampling is more representative of reality and should be used if you are interested in evaluating the model performance under various what-if scenarios. Unless you are an advanced user, we recommend leaving the other options at their default values. The last step is to run the simulation by clicking the Simulate button in the Solve ­Action group. When the simulation finishes, you will see a message “Simulation finished successfully” in the lower-left corner of the Excel window.

Viewing and Analyzing Results You may specify whether you want output charts to automatically appear after a simulation is run by clicking the Options button in the Analytic Solver Platform ribbon, and either checking or unchecking the box Show charts after simulation in the Charts tab. You may also view the results of the simulation at any time by double-clicking on an output cell that contains the PsiOutput() function or by choosing Simulation from the Reports button in the Analysis group in the Analytic Solver Platform ribbon. This displays a window with various tabs showing different charts to analyze results.

Example 12.6 Analyzing Simulation Results for the Outsourcing Decision Model Figure 12.8 shows the Frequency tab in the simulation results window. This is a frequency distribution of the cost difference for the 5,000 trials using the Monte Carlo sampling method. You can see that the distribution is somewhat negatively skewed. In the Statistics pane on the right, we see that the mean cost difference is − \$3,068, which suggests that, on average, it would be better to manufacture in-house than to outsource. We also see that the minimum cost difference was − \$43,222 and the maximum difference was \$24,367. These are estimates of the best- and worst-case results that can be expected, lending further evidence that it might be better to manufacture in-house. In the Chart Statistics section of the Statistics pane, you may specify a Lower Cutoff, Likelihood, or ­Upper ­Cutoff value. These options help you analyze the ­frequency chart. For example, if we set the Upper ­Cutoff to 0, we obtain the chart shown in Figure 12.9. This illustrates the probability of a negative (as well as a positive) cost

­ ifference. From the chart, we see that there is about a d 59% chance of a negative value for outsourcing, whereby in-house manufacturing would be best. The red line that divides the ­regions in the chart is called a marker line. You can move it with your mouse to calculate different areas of probability. As you do, the values in the Chart Statistics section will change. You may right-click on a marker line to remove it; you may also add new marker lines by ­right-clicking to show probabilities between marker lines in the chart. If you specify both a Lower Cutoff and Upper Cutoff value, marker lines will be added at both values, and the ­Likelihood statistic will be the probability between them. The other tabs in the results window display a cumulative f­requency distribution and a reverse cumulative frequency distribution, as well as a sensitivity chart and scatter plots, which we discuss in other examples. The best way to learn to analyze the charts is by experimenting. In addition, you can change the display in the right pane by selecting other o ­ ptions in the drop-down menu

Chapter 12  Monte Carlo Simulation and Risk Analysis

413

Figure 12.8 Simulation Results— Cost Difference Frequency Distribution

Figure 12.9 Probability of a Negative Cost Difference

by clicking on the down arrow to the right of the Statistics header. The options are Percentiles, Chart Type, Chart Options, Axis ­O ptions, and Markers. The Percentiles ­o ption displays percentiles of the simulation results and is ­e ssentially a numerical tabulation

of the cumulative distribution of the output; for example, the 10th percentile in these simulation results was − \$16,550 (not shown). This means that 10% of the simulated cost differences were less than or equal to − \$16,550. The other ­options are simply for customizing the charts.

In the remainder of this chapter, we present several additional examples of Monte Carlo simulation using Analytic Solver Platform. These serve to illustrate the wide range of applications in which the approach may be used and also various features of Analytic Solver Platform and tools for analyzing simulation models.

414

Chapter 12  Monte Carlo Simulation and Risk Analysis

New-Product Development Model The Moore Pharmaceuticals spreadsheet model to support a new-product development decision was introduced in Chapter 11; Figure 12.10 shows the model again. Although the values used in the spreadsheet suggest that the new drug would become profitable by the fourth year, much of the data in this model are uncertain. Thus, we might be interested in evaluating the risk associated with the project. Three questions we might be interested in are as follows: 1. What is the risk that the net present value over the 5 years will not be positive? 2. What are the chances that the product will show a cumulative net profit in the third year? 3. What cumulative profit in the fifth year are we likely to realize with a probability of at least 0.90? Suppose that the project manager of Moore Pharmaceuticals has identified the following uncertain variables in the model and the distributions and parameters that describe them, as follows: size: normal with mean of 2,000,000 units and standard deviation of • Market 400,000 units costs: uniform between \$600,000,000 and \$800,000,000 • R&D Clinical trial costs: lognormal with mean of \$150,000,000 and standard deviation • \$30,000,000 Annual market growth factor: triangular with minimum = 2%, maximum = 6%, • and most likely = 3% Annual share growth rate: triangular with minimum = 15%, • maximummarket = 25%, and most likely = 20%

Figure 12.10 Moore Pharmaceuticals Spreadsheet Model

Chapter 12  Monte Carlo Simulation and Risk Analysis

415

Example 12.7 Setting Up the Simulation Model for Moore Pharmaceuticals As we learned earlier, we may use either the Psi functions or the Distribution buttons in the Analytic Solver Platform ­r ibbon to specify the uncertain variables. Although the result is the same, the Psi functions are often easier to use. To model the market size, we could use the PsiNormal(mean, standard deviation) function. Thus, we could enter the formula = PsiNormal(2000000, 400000) into cell B5. Similarly, we could use the following functions for the remaining uncertain variables: R&D Costs (cell B11): =PsiUniform(600000000, • 800000000) trial costs (cell B12): • Clinical =PsiLognormal(150000000, 30000000)

market growth factor (cells C18 to F18): • Annual =PsiTriangular(2%, 3%, 6%) market share growth rate (cells C20 to • Annual F20): =PsiTriangular(15%, 20%, 25%) Because the annual market-growth factors and ­market-share-growth rates use the same distributions, we need enter them only once and then copy them to the other cells. We define the cumulative net profit for each year (cells B28 through F28) and the net present value (cell B30) as the output cells.

Now we are prepared to run the simulation and analyze the results. If your simulation model contains more than one output function, then a Variables Chart containing frequency graphs of up to 9 output functions and uncertain variables will appear as shown in Figure 12.11. In this case, the Variables Chart shows the frequency charts for all 6 uncertain functions (cells B28:F28 and B30) and 3 of the uncertain inputs (B5, B11, and B12) in the Moore Pharmaceutical model. You may customize this by checking or unchecking the boxes in the Filters pane; for example, you can remove the uncertain input distributions and only show the six outputs. As noted earlier in this chapter, you may also suppress the automatic display of the chart in the Charts tab after clicking the Options button. In this example, we used 10,000 trials. We may use the frequency charts in the simulation results to answer the risk analysis questions we posed earlier.

Figure 12.11 Variables Chart for Simulation Results

416

Chapter 12  Monte Carlo Simulation and Risk Analysis

Example 12.8  Risk Analysis for Moore Pharmaceuticals 1.  What is the probability that the net present value

3. What cumulative profit in the fifth year are we

over the 5 years will not be positive? Double-click on cell B30 to display the simulation results for the net present value output. Enter the number 0 for the Upper Cutoff value in the Statistics pane. The results are shown in Figure 12.12; this shows about an 18% chance that the NPV will not be positive.

likely to realize with a probability of at least 0.90? An easy way to answer this question is to view the ­Percentiles results (see Figure 12.14). Therefore, we can expect a cumulative net profit of about \$180,000 or more with 90% certainty. Another way is to set the lower cutoff in the Chart Statistics field to some number smaller than the minimum value and then set the likelihood to 10%. Analytic Solver Platform will calculate and draw a marker line for the value of the upper cutoff that provides a certainty less than the upper cutoff of 10% and, consequently, a certainty of 90% greater than the upper cutoff.

2. What are the chances that the product will show a cumulative net profit in the third year? Double-click cell D28, the cumulative net profit in year 3. Enter the value 0 for the Lower Cutoff value, as illustrated in Figure 12.13. This shows that the probability of a positive cumulative net profit in the third year is only about 9%.

Figure 12.12 Probability of a Nonpositive Net Present Value

Figure 12.13 Probability of a Non-Positive Cumulative Third-Year Net Profit

Chapter 12  Monte Carlo Simulation and Risk Analysis

417

Figure 12.14 Percentiles for Fifth-Year Cumulative Net Profit

Confidence Interval for the Mean Monte Carlo simulation is essentially a sampling experiment. Each time you run a simulation, you will obtain slightly different results as we observed in Example 12.2 for the outsourcing decision model. Therefore, statistics such as the mean are a single observation from a sample of n trials from some unknown population. In Chapter 6, we discussed how to construct a confidence interval for the population mean to measure the error in estimating the true population mean. We may use the statistical information to construct a confidence interval for the mean using a variant of formula (6.3) in Chapter 6:

x { za>21s> 1n2

(12.1)

Because a Monte Carlo simulation will generally have a very large number of trials (we used 10,000), we may use the standard normal z-value instead of the t-distribution in the confidence interval formula.

Example 12.9 A Confidence Interval for the Mean Net Present Value We will construct a 95% confidence interval for the mean NPV using the simulation results from the Moore Pharmaceuticals example. From statistics shown in Figure 12.12, we have mean = \$200,608,120 standard deviation = \$220,980,564 n = 10,000 For a 95% confidence interval, zA>2 = 1.96. Therefore, ­ sing formula (12.1), a 95% confidence interval for the u mean would be \$200,608,120 ± 1.961220,980,564 , !10,000 2, or [ \$196,276,901, \$204,939,339]

This means that if we ran the simulation again with different random inputs, we could expect the mean NPV to generally fall within this interval. To reduce the size of the confidence interval, we would need to run the simulation for a larger number of trials. For most risk analysis applications, however, the mean is less important than the actual distribution of outcomes.

418

Chapter 12  Monte Carlo Simulation and Risk Analysis

Sensitivity Chart The sensitivity chart feature allows you to determine the influence that each uncertain model input has individually on an output variable based on its correlation with the output variable. The sensitivity chart displays the rankings of each uncertain variable according to its impact on an output cell as a tornado chart. A sensitivity chart provides three benefits: 1. It tells which uncertain variables influence output variables the most and which would benefit from better estimates. 2. It tells which uncertain variables influence output variables the least and can be ignored or discarded altogether. 3. By providing understanding of how the uncertain variables affect your model, it allows you to develop more realistic spreadsheet models and improve the accuracy of your results. The sensitivity chart can be viewed by clicking the Sensitivity tab in the results window (see Figure 12.15).

Example 12.10  Interpreting the Sensitivity Chart for NPV Figure 12.15 shows the sensitivity chart and the net present value output cell (B30). The uncertain variable cells are ranked from top to bottom, beginning with the one having the highest absolute value of correlation with NPV. In this example, we see that cell B5, the market size, has a correlation of about 0.95 with NPV; the R&D cost (cell B11) has a negative 0.255 correlation, and the clinical trial cost (cell B12) has a negative 0.130 correlation with NPV. The other

uncertain variable cells have a negligible effect. This means that if you want to reduce the variability in the distribution of NPV the most, you would need to obtain better information about the estimated market size and use a probability distribution that has a smaller variance. The small correlations between NPV and the market-growth factors suggest that using constant values instead of uncertain probability distributions would have little effect on the results.

Overlay Charts If a simulation has multiple related forecasts, the overlay chart feature allows you to superimpose the frequency distributions from selected forecasts on one chart to compare differences and similarities that might not be apparent.

Figure 12.15 Sensitivity Chart for Net Present Value

Chapter 12  Monte Carlo Simulation and Risk Analysis

419

Example 12.11  Creating an Overlay Chart To create an overlay chart, click the Charts button in the Analysis group in the Analytic Solver Platform ­r ibbon. Click Multiple Simulation Results (do not choose ­Multiple Simulations!) and then choose Overlay. In the Reports dialog that appears, select the output variable cells you wish to include in the chart and move them to the right side of the dialog using the arrow buttons (see Figure 12.16). In this example, we selected cells B28 and F28,

Figure 12.16 Reports Dialog for Selecting Output Cells for an Overlay Chart

Figure 12.17 Overlay Chart for Year 1 and Year 5 Cumulative Net Profit

which correspond to the cumulative net profit for years 1 and 5. Figure 12.17 shows the overlay chart for the distributions of cumulative net profit for years 1 and 5. This chart makes it clear that the mean value for year 1 is smaller than for year 5, and the variance in year 5 is much larger than that in year 1. This is to be expected because there is more uncertainty in predicting farther in the future, and the model captures this.

420

Chapter 12  Monte Carlo Simulation and Risk Analysis

Trend Charts If a simulation has multiple output variables that are related to one another (such as over time), you can view the distributions of all output variables on a single chart, called a trend chart. In Analytic Solver Platform, the trend chart shows the mean values as well as 75% and 90% bands (probability intervals) around the mean. For example, the band representing the 90% band range shows the range of values into which the output variable has a 90% chance of falling.

Example 12.12  Creating a Trend Chart To create a trend chart for the Moore Pharmaceuticals example, click the Charts button in the Analysis group in the Analytic Solver Platform ribbon. Click Multiple ­S imulation Results and then choose Trend. (Be careful not to confuse “Multiple Simulation Results” with “­M ultiple Simulations” in the drop-down menu; these are different options.) In the Reports dialog that appears, ­select the output variable cells you wish to include in the

chart and move them to the right side of the dialog using the arrow buttons. In this example, we selected cells B28 through F28, which correspond to the cumulative net profit for all years. Figure 12.18 shows a trend chart for these variables. We see that although the mean net cumulative profit increases over time, so does the variation, indicating that the uncertainty in forecasting the future also increases with time.

Box-Whisker Charts Finally, Analytic Solver Platform can create box-whisker charts to illustrate the statistical properties of the output variable distributions in an alternate fashion. A box-whisker chart shows the minimum, first quartile, median, third quartile, and maximum values in a data set graphically. The first and third quartiles form a box around the median, showing the middle 50% of the data, and the whiskers extend to the minimum and maximum values. They can be created by clicking on the Charts button similar to the overlay and trend charts. Figure 12.19 shows an example for the cumulative net profits in the Moore Pharmaceuticals simulation.

Figure 12.18 Trend Chart for Cumulative Net Profit Over 5 Years

Chapter 12  Monte Carlo Simulation and Risk Analysis

421

Figure 12.19 Example of Analytic Solver Platform Box-Whisker Chart

Simulation Reports Analytic Solver Platform allows you to create reports in the form of Excel worksheets that summarize a simulation. To do this, click the Reports button in the Analysis group in the Analytic Solver Platform ribbon, and choose Simulation from the options that appear. The report summarizes basic statistical information about the model, simulation options, uncertain variables, and output variables, most of which we have already seen in the charts. It is useful to provide a record of the simulation for quick reference.

Newsvendor Model In Chapter 11, we developed the newsvendor model to analyze a single-period purchase decision. Here we apply Monte Carlo simulation to forecast the profitability of different purchase quantities when the future demand is uncertain. Let us suppose that the store owner kept records for the past 20 years on the number of boxes sold at full price, as shown in the spreadsheet in Figure 12.20 (Excel file Newsvendor Model with Historical Data). The distribution of sales seems to be some type of positively skewed unimodal distribution.

The Flaw of Averages You might wonder why we cannot simply use average values for the uncertain inputs in a decision model and eliminate the need for Monte Carlo simulation. Let’s see what happens if we do this for the newsvendor model.

Example 12.13 Using Average Values in the Newsvendor Model If we find the average of the historical candy sales, we obtain 44.05, or, rounded to a whole number, 44. Using this value for demand and purchase quantity, the model predicts a profit of \$264 (see ­Figure 12.21). However, if we

construct a data table to evaluate the profit for each of the historical values (also shown in ­Figure 12.21), we see that the average profit is only \$255.00.

422

Chapter 12  Monte Carlo Simulation and Risk Analysis

Figure 12.20 Newsvendor Model with Historical Data

Figure 12.21 Example of the Flaw of Averages

Dr. Sam Savage, a strong proponent of spreadsheet modeling, coined the term the flaw of averages to describe this phenomenon. Basically what this says is that the evaluation of a model output using the average value of the input is not necessarily equal to the average value of the outputs when evaluated with each of the input values. The reason this occurs in the newsvendor example is because the quantity sold is limited to the smaller of the demand and purchase quantity, so even when demand exceeds the purchase quantity, the profit is limited. Using averages in models can conceal risk, and this is a common ­error among users of analytic models. This is why Monte Carlo simulation is valuable.

Monte Carlo Simulation Using Historical Data We can perform a Monte Carlo simulation by resampling from the historical sales ­distribution—that is, by selecting a value randomly from the historical data as the demand in the model.

Chapter 12  Monte Carlo Simulation and Risk Analysis

423

Example 12.14 Simulating the Newsvendor Model Using Resampling In the Newsvendor Model with Historical Data spreadsheet, we have the historical data listed in the range D2:D21. All we need to do is to define the distribution of demand in cell B11 using the PsiDisUniform function in Analytic Solver Platform. This function will sample a value from the historical data for each trial of the simulation. Enter the formula = PsiDisUniform(D2:D21) into cell B11. Now, you may set up the simulation model by defining the

profit cell B17 as an uncertain function cell, set the simulation options (we chose 5,000 trials), and run the simulation. Figure 12.22 shows the results; for the purchase quantity of 44, the mean profit is \$255.00. The frequency chart, also shown in Figure 12.22, looks somewhat odd. However, recall that if demand exceeds the purchase quantity, then sales are limited to the number purchased, which ­explains the large spike at the right of the distribution.

Monte Carlo Simulation Using a Fitted Distribution While sampling from empirical data is easy to do, it does have some drawbacks. First, the empirical data may not adequately represent the true underlying population because of sampling error. Second, using an empirical distribution precludes sampling values outside the range of the actual data. Therefore, it is usually advisable to fit a distribution and use it for the uncertain variable. We can do this by fitting a distribution to the data using the techniques we described in Chapter 5.

Example 12.15 Using a Fitted Distribution for Monte Carlo Simulation Following the steps in Example 5.42, first highlight the range of the data in the Newsvendor Model with Historical Data spreadsheet, and click Fit from the Tools group in the ­Analytic Solver Platform ribbon. Because the number of sales is discrete, select the Discrete radio button in the Fit Options dialog and click Fit. Figure 12.23 shows the best-­ fitting distribution, a negative ­binomial distribution. When you attempt to close the dialog, Analytic Solver Platform will ask

Figure 12.22 Newsvendor Model Simulation Results Using Resampling for Purchase Quantity = 44

if you wish to accept the fitted distribution. Click Yes, and a pop-up will allow you to drag and place the function into a cell in the spreadsheet. Place the Psi function for the negative binomial distribution in the first cell of the data (cell D2). To use this for the simulation, simply reference cell D2 in cell B11, corresponding to the demand in the model. Figure 12.24 shows the results, which are quite similar to the results found by resampling in Example 12.14.

424

Chapter 12  Monte Carlo Simulation and Risk Analysis

Figure 12.23 Best-Fitting Distribution for Historical Candy Sales

Figure 12.24 Newsvendor Simulation Results Using the Negative Binomial Distribution for Purchase Quantity = 44

Analytic Solver Platform has a feature called Interactive Simulation. Whenever the Simulate button is clicked, you will notice that the lightbulb in the icon turns bright. If you change any number in the model, Analytic Solver Platform will automatically run the simulation for that quantity; this makes it easy to conduct what-if analyses. For example, changing the purchase quantity to 50 yields the results shown in Figure 12.25. The mean profit drops to \$246.05. You could use this approach to identify the best purchase quantity; however, a more systematic method is described in the online Supplementary Chapter B.

Overbooking Model In Chapter 11, we developed a model for overbooking decisions (Hotel Overbooking Model). In any realistic overbooking situation, the actual customer demand as well as the number of cancellations would be random variables. We illustrate how a simulation model can help in making the best overbooking decision and introduce a new type of distribution in Analytic Solver Platform, a custom distribution.

425

Chapter 12  Monte Carlo Simulation and Risk Analysis

Figure 12.25 Newsvendor Simulation Results for Purchase Quantity = 50

Figure 12.26 Hotel Overbooking Simulation Model and Demand Distribution

The Custom Distribution in Analytic Solver Platform Let us assume that historical data for the demand have been collected and summarized in a relative frequency distribution, but that the actual data are no longer available. These are shown in columns D and E in Figure 12.26 (Excel file Hotel Overbooking Monte Carlo Simulation Model with Custom Demand). We also assume that each reservation has a ­constant probability p = 0.04 of being canceled; therefore, the number of cancellations (cell B14) can be modeled using a binomial distribution with n = number of reservations made and p = probability of cancellation.

Example 12.16 Defining a Custom Distribution in Analytic Solver Platform To use the relative frequency distribution to define the uncertain demand in the Hotel Overbooking Model with Custom Demand (note that this spreadsheet is already completed; to follow along, copy columns D and E to the original ­Hotel Overbooking Model worksheet) first select cell B12 that

corresponds to the demand, then click on the Distributions button in the Analytic Solver Platform ribbon and choose Discrete from the Custom category. In the dialog, edit the range for “values” and “weights” in the Parameters section in the fields on the right. Values correspond to the range (continued )

426

Chapter 12  Monte Carlo Simulation and Risk Analysis

Figure 12.27 Custom Discrete Distribution Dialog

Figure 12.28 Binomial Distribution Dialog

of demand in cells D2:D13, and weights are the relative frequencies or probabilities in cells E2:E13. The dialog will then display the actual form of the distribution, as shown in Figure 12.27. Alternatively, you could use the function =Psi Discrete(\$D\$2:\$D\$13,\$E\$2:\$E\$13) in cell B12. To model the number of cancellations in cell B14, choose the binomial distribution from the Discrete category in the Distributions list. Note that the number of

trials must be the value in cell B13. This is critical in this example, because the number of reservations made will change, depending on the customer demand in cell B12. Therefore, in the Parameters section of the dialog, we must reference cell B13 and not use a constant value, as shown in F ­ igure 12.28. Alternatively, we could use the function =PsiBinomial(B13, 0.04) in cell B14. Define cells B17 and B18 as output cells and run the model.

Figures 12.29 and 12.30 show frequency charts of the two output variables—number of overbooked customers and net revenue—for accepting 310 reservations. There is about a 14% chance of overbooking at least one customer. Observe that there seem to be two different distributions superimposed over one another in the net revenue frequency distribution. Can you explain why this is so? As with the newsvendor problem, we can easily change the number of reservations made, and the Interactive Simulation capability will quickly run a new simulation and change the results in the frequency charts.

Cash Budget Model Cash budgeting is the process of projecting and summarizing a company’s cash inflows and outflows expected during a planning horizon, usually 6 to 12 months.3 The cash budget also shows the monthly cash balances and any short-term borrowing used to cover 3Douglas R. Emery, John D. Finnerty, and John D. Stowe, Principles of Financial Management (Upper Saddle River, NJ: Prentice Hall, 1998): 652–654.

Chapter 12  Monte Carlo Simulation and Risk Analysis

427

Figure 12.29 Frequency Chart of Number of Overbooked Customers

Figure 12.30 Frequency Chart of Net Revenue

cash ­shortfalls. Positive cash flows can increase cash, reduce outstanding loans, or be used elsewhere in the business; negative cash flows can reduce cash available or be offset with additional borrowing. Most cash budgets are based on sales forecasts. With the inherent uncertainty in such forecasts, Monte Carlo simulation is an appropriate tool to analyze cash budgets. Figure 12.31 shows an example of a cash budget spreadsheet (Excel file Cash Budget Model). The highlighted cells represent the uncertain variables and outputs we want to predict from the simulation model. The budget begins in April (thus, sales for April and subsequent months are uncertain). These are assumed to be normally ­distributed with a standard deviation of 10% of the mean. In addition, we assume that sales in adjacent months are correlated with one another, with a correlation coefficient of 0.6. On average, 20% of sales are collected in the month of sale, 50%, in the month following the sale, and 30%, in the second month following the sale. However, these figures are uncertain, so a uniform distribution is used to model the first two values (15% to 20% and 40% to 50%, respectively), with the assumption that all remaining revenues are collected in the second month following the sale. Purchases are 60% of

428

Figure 12.31 Cash Budget Model

Chapter 12  Monte Carlo Simulation and Risk Analysis

sales and are paid for 1 month prior to the sale. Wages and salaries are 12% of sales and are paid in the same month as the sale. Rent of \$10,000 is paid each month. Additional cash operating expenses of \$30,000 per month will be incurred for April through July, decreasing to \$25,000 for August and September. Tax payments of \$20,000 and \$30,000 are expected in April and July, respectively. A capital expenditure of \$150,000 will occur in June, and the company has a mortgage payment of \$60,000 in May. The cash balance at the end of March is \$150,000, and managers want to maintain a minimum balance of \$100,000 at all times. The company will borrow the amounts necessary to ensure that the minimum balance is achieved. Any cash above the minimum will be used to pay off any loan balance until it is eliminated. The available cash balances in row 25 of the spreadsheet are the output variables we wish to predict.

Example 12.17  Simulating the Cash Budget Model without Correlations Build the basic simulation model by defining distributions for each of the uncertain variables. First, specify the sales for April through October (cells E5:K5) to be normally ­distributed with means equal to the values in the spreadsheet and standard deviations equal to 10% of the means. For example, use the function =PsiNormal(600000,60000) in cell E5. For the current collections rate in cell B7, use

the uniform distribution = PsiUniform(15%, 20%), and for the previous month collections rate in cell B8, use =PsiUniform(40%, 50%). Define the available balances in row 25 as output variables in the simulation model. The Excel file Cash Budget Monte Carlo Simulation Model provides the completed simulation model.

Figure 12.32 shows the results of Example 12.17 in the form of a trend chart. We see that there is a high likelihood that the cash balances for the first 3 months will be negative before increasing. Viewing the frequency charts and statistics for the individual months will provide the details of the distributions of likely cash balances and the probabilities

Chapter 12  Monte Carlo Simulation and Risk Analysis

429

Figure 12.32 Cash Balance Simulation Trend Chart

of requiring loans. For example, in April, the probability that the balance will not exceed the minimum of \$100,000 and require an additional loan is about 0.70 (see Figure 12.33). This actually worsens in May and June and becomes zero by July.

Correlating Uncertain Variables Unless you specify otherwise, Monte Carlo simulation assumes that each of the uncertain variables is independent of all the others. This may not be the case. In the cash budget model, if the sales in April are high, then it would make sense that the sales in May would be high also. Thus, we might expect a positive correlation between these variables. In this scenario, we assume a correlation coefficient of 0.6 between sales in successive months. The following example shows how to incorporate this assumption into the simulation model.

Figure 12.33 Likelihood of Not Meeting Minimum Balance in April

430

Chapter 12  Monte Carlo Simulation and Risk Analysis

Example 12.18 Incorporating Correlations in Analytic Solver Platform To correlate the uncertain variables in the Cash Budget Monte Carlo Simulation Model, first click the C ­ orrelations button in the Simulation Model group in the Analytic Solver Platform ribbon. This brings up the Create new correlation matrix dialog shown in Figure 12.34 that lists the uncertain variables in the model. In this example, we are only correlating the variables in the range E5:K5. In the left pane, hold the Ctrl key and click on each of the distributions in the range E5:K5, or click on \$E5\$, hold the Shift key and then click on \$K\$5 to select them. Then click on the right arrow. (The double right arrow selects all of them, which we do not want in this example.) This creates an initial correlation matrix as shown in Figure 12.35. The numerical values show the correlations (initially set to zero); the green distributions are those used in the uncertain cells, and the blue scatterplots show visual representations of the correlations between the variables. Replace the zeros by the correlations you want in the model. In this example, we will assume a 0.6 correlation between each successive month. In boxes 2 and 3, you can name the correlation matrix and specify the location to place it in the spreadsheet. This is shown in Figure 12.36. Now, it is very important to ensure that the correlations are mathematically consistent with each other (a mathematical property called positive semidefinite). You can select the Validate button in the Manage Correlations dialog, or Analytic Solver Platform will perform an ­automatic check for this when you try to close the dialog. If the correlation m ­ atrix

Figure 12.34 Create New Correlation Matrix Dialog

does not satisfy this property, it will ask you if you want to adjust the correlations so that it does. Always choose Yes. Click the Update Matrix button (you can make changes manually, but we recommend this only for advanced users) and then Accept Update. The adjusted matrix is shown in Figure 12.37. Note that the correlations between successive months are close to 0.6, but that the matrix now includes some small correlations between other months. This ensures the mathematical consistency needed to run the simulation. You may now close the dialog. The cell range of the correlation matrix is used in the function PsiCorrMatrix(cell range, position, instance), where position corresponds to the number of the uncertain variables in the correlation matrix and instance refers to the name given to the correlation matrix. Analytic Solver Platform adds these functions to the distributions for the uncertain variables that are correlated. For example, the f­ormula in cell E5 for April sales is changed to: = PsiNormal (600000,60000,PsiCorrMatrix(\$B\$33:\$H\$39,1, “Monthly Correlations”)). The ­f ormula in cell F5 for May sales is changed to: = PsiNormal(700000,70000,PsiCorrMatrix (\$B\$33:\$H\$39,2, “Monthly Correlations”)), and so on. Now set the simulation options and run the model. The Excel file Cash Budget Monte Carlo Simulation Model with Correlations provides the completed model for this example.

Figure 12.35 Initial Correlation Matrix

Figure 12.36 Completed Correlation Matrix

Chapter 12  Monte Carlo Simulation and Risk Analysis

431

432

Chapter 12  Monte Carlo Simulation and Risk Analysis

You will observe some slight differences in the results when uncertain variables are correlated. For example, the standard deviation for the September balance is lower when correlations are included in the model than when they are not. Generally, inducing correlations into a simulation model tends to reduce the variance of the predicted outputs.

Analytics in Practice: I mplementing Large-Scale Monte Carlo Spreadsheet Models4

4Based

because the entire spreadsheet must be recalculated both for each iteration of the simulation and each individual asset (or transaction) within the portfolio. This pushes the limits of stand-alone Excel models, even for a single asset. Moreover, because the bank is usually interested in analyzing its entire portfolio of thousands of assets, in practice, it becomes impossible to do so using stand-alone Excel. Therefore, Hypo needed a way to implement the complex analytics of simulation in a way that its global offices could use on all their thousands of loans. In addition to the computational intensity of simulation analytics, the option to build the entire simulation framework in Excel can lead to human error, which

Implementing large-scale Monte Carlo models in spreadsheets in practice can be challenging. This example shows how one company used Monte Carlo simulation for commercial real estate credit-risk analysis but had to develop new approaches to effectively implementing spreadsheet analytics across the company. Based in Stuttgart, Germany, Hypo Real Estate Bank International (Hypo), with a large portfolio in commercial real estate lending, undertakes some of the world’s largest real estate transactions. Hypo was faced with the challenge of complying with Basel II banking regulations in Europe. Basel II was a new regulation for setting the minimum capital to be held in reserve by internationally active banks. If a bank is able to comply with the more demanding requirements of the regulation, it can potentially save E20–E60 million per year in capital costs. To qualify however, Hypo needed new risk models and reporting systems. The company also wished to upgrade its internal reporting and management framework to provide better analytical tools to its lending officers, who were responsible for structuring new loans, and to provide its managers with better insights into the risks of the overall portfolio. Monte Carlo simulation is the only practical approach for analyzing risk models the bank needed. For example, in one commercial real estate application, 200 different macroeconomic and market variables are typically simulated over 20 years. The cash-flow modeling process can be even more complex, particularly if the effects of all the intricate details of the transaction must be quantified. However, the computational process of Monte Carlo simulation is numerically intensive

on Yusuf Jafry, Christopher Marrison, and Ulrike Umkehrer-Neudeck, “Hypo International Strengthens Risk Management with a Large-Scale, Secure Spreadsheet-Management Framework,” Interfaces, 38, 4 (July–August 2008): 281–288.

433

Chapter 12  Monte Carlo Simulation and Risk Analysis

they called spreadsheet risk. Spreadsheet risks that Hypo wished to minimize included the following: of spreadsheet models that are • Proliferation stored on individual users’ desktop computers

• • •

throughout the organization are untested and lack version data, and the unsanctioned manipulation of the results of spreadsheet calculations. Potential for serious mistakes resulting from typographical and “cut and copy-and-paste” errors when entering data from other applications or spreadsheets. Accidental acceptance of results from incomplete calculations. Errors associated with running an insufficient number of Monte Carlo iterations because of data or time constraints.

Given these potential problems, Hypo deemed a pure Excel solution as ­impractical. Instead, they used a consulting firm’s proprietary software, called the Specialized Finance System (SFS), that embeds spreadsheets within a ­high-performance, server-based system for enterprise applications. This eliminated the spreadsheet risks but allowed users to exploit the flexible programming power that spreadsheets provide, while giving confidence and trust in the results. The new system has improved management reporting and the efficiency of internal processes and has also provided insights into structuring new loans to make them less risky and more profitable.

Key Terms Risk Risk analysis Sensitivity chart Trend chart Uncertain function

Box-whisker chart Flaw of averages Marker line Monte Carlo simulation Overlay chart

Problems and Exercises 1. For the market share model in Problem 5 of Chapter 11,

suppose that the estimate of the percentage of new purchasers who will ultimately try the brand is uncertain and assumed to be normally distributed with a mean of 35% and a standard deviation of 4%. Use the NORM.INV function and a one-way data table to conduct a Monte Carlo simulation with 25 trials to find the distribution of the long-run market share. 2. For the garage-band model in Problem 7 of Chapter 11,

suppose that the expected crowd is normally distributed with a mean of 3,000 and standard deviation of 200. Use the NORM.INV function and a one-way data table to conduct a Monte Carlo simulation with 25 trials to find the distribution of the expected profit. 3. A professional football team is preparing its budget

for the next year. One component of the budget is

the revenue that they can expect from ticket sales. The home venue, Dylan Stadium, has five different seating zones with different prices. Key information is given below. The demands are all assumed to be normally distributed. Seating Zone

Seats ­Available

Ticket Price

Mean ­Demand

First Level Sideline

15,000

\$100.00

14,500

750

Second Level

5,000

\$90.00

4,750

500

10,000

\$80.00

9,000

1,250

First Level End Zone

Standard ­Deviation

(continued )

434

Chapter 12  Monte Carlo Simulation and Risk Analysis

Seating Zone

Seats ­Available

Ticket Price

Mean ­Demand

Standard ­Deviation

Third Level Sideline

21,000

\$70.00

17,000

2,500

Third Level End Zone

14,000

is ­normally distributed with a mean of \$175 and a standard deviation of \$12. Find the probability that outsourcing will result in the best decision. 7. For the Outsourcing Decision Model, suppose that the

\$60.00

8,000

3,000

Determine the distribution of total revenue under these assumptions using an Excel data table with 50 simulated trials. Summarize your results with a h­ istogram. 4. For the new-product model in Problem 9 of ­Chapter 11,

suppose that the first-year sales volume is normally distributed with a mean of 100,000 units and a standard deviation of 10,000. Use the NORM.INV function and a one-way data table to conduct a Monte Carlo simulation to find the distribution of the net present value profit over the 3-year period. 5. Financial analysts often use the following model to

characterize changes in stock prices: 2

Pt = P0e (m - 0.5s )t + sZ2t where P0 = current stock price Pt = price at time t m = mean (logarithmic) change of the stock price per unit time s = (logarithmic) standard deviation of price change Z = standard normal random variable This model assumes that the logarithm of a stock’s price is a normally distributed random variable (see the discussion of the lognormal distribution and note that the first term of the exponent is the mean of the lognormal distribution). Using historical data, we can estimate values for m and s. Suppose that the a­ verage daily change for a stock is \$0.003227, and the standard deviation is 0.026154. Develop a spreadsheet to simulate the price of the stock over the next 30 days if the current price is \$53. Use the Excel function NORM.S.INV(RAND( )) to generate values for Z. Construct a chart showing the movement in the stock price. 6. Use Analytic Solver Platform to simulate the

­ utsourcing Decision Model under the assumptions O that the production volume will be triangular with a minimum of 800, maximum of 1,700, and most likely value of 1,400, and that the unit supplier cost

demand volume is lognormally distributed with a mean of 1,500 and standard deviation of 500. What is the distribution of the cost differences between manufacturing in-house and purchasing? What decision would you recommend? Define both the cost difference and decision as output cells. Because output cells in ­Analytic Solver Platform must be numeric, replace the formula in cell B20 with =IF(B1926, or 0.58, which represents the utility of this ­payoff. The decision of accepting \$600 outright or taking the gamble could be made by flipping a coin. These individuals tend to ignore risk measures and base their decisions on the average payoffs. A utility function may be used instead of the actual monetary payoffs in a decision analysis by simply replacing the payoffs with their equivalent utilities and then computing expected values. The expected utilities and the corresponding optimal decision strategy then reflect the decision maker’s preferences toward risk. For example, if we use the average payoff strategy (because no probabilities of events are given) for the data in Table 16.2, the best decision would be to choose the stock fund. However, if we replace the payoffs in Table 16.2 with the (risk-averse) utilities that we defined and again use the average payoff strategy, the best decision would be to choose the bank CD as opposed to the stock fund, as shown in the following table.

602

Chapter 16  Decision Analysis

Decision/Event

Rates Rise

Rates Stable

Rates Fall

Average Utility

Bank CD

0.75

0.75

0.75

0.75

Bond fund

0.35

0.85

0.9

0.70

Stock fund

0

0.80

1.0

0.60

If assessments of event probabilities are available, these can be used to compute the expected utility and identify the best decision.

Exponential Utility Functions It can be rather difficult to compute a utility function, especially for situations involving a large number of payoffs. Because most decision makers typically are risk averse, we may use an exponential utility function to approximate the true utility function. The exponential utility function is U1x2 = 1 - e -x>R(16.2) where e is the base of the natural logarithm (2.71828 …) and R is a shape parameter that is a measure of risk tolerance. Figure 16.14 shows several examples of U(x) for different values of R. Notice that all these functions are concave and that as R increases, the functions become flatter, indicating more tendency toward risk neutrality. One approach to estimating a reasonable value of R is to find the maximum payoff \$R for which the decision maker is willing to take an equal chance on winning \$R or losing \$R>2. The smaller the value of R, the more risk averse is the individual. For instance, would you take a bet on winning \$10 versus losing \$5? How about winning \$10,000 versus losing \$5,000? Most people probably would not worry about taking the first gamble but might definitely think twice about the second. Finding one’s maximum comfort level establishes the utility function.

Figure 16.14 Examples of Exponential Utility Functions

603

Chapter 16  Decision Analysis

Example 16.19  Using an Exponential Utility Function For the personal investment decision example, suppose that R = \$400. The utility function is U( x) = 1 − e−x,400, resulting in the following utility values: Payoff, X

Utility, U(X)

\$1,700

0.9857

\$1,000

0.9179

\$840

0.8775

\$600

0.7769

\$400

0.6321

− \$500

− 2.4903

− \$900

− 8.4877

Using the utility values in the payoff table, we find that the bank CD remains the best decision, as shown in the ­following table, as it has the highest average utility. Decision/Event

Rates Rise

Rates Stable

Rates Fall

Average Utility

Bank CD

0.6321

0.6321

0.6321

0.6321

Bond fund

−2.4903

0.8775

0.9179

−0.2316

Stock fund

−8.4877

0.7769

0.9857

−2.2417

Analytics in Practice: U  sing Decision Analysis in Drug Development Drug development in the United States is time consuming, resource intensive, risky, and heavily regulated.2 On average, it takes nearly 15 years to research and develop a drug in the United States, with an aftertax cost in 1990 dollars of approximately \$200 million. In July 1999, the biological products leadership committee, composed of the senior managers within Bayer Biological Products (BP), a business unit of Bayer Pharmaceuticals (Pharma), made its newly formed ­strategic-planning department responsible for the commercial evaluation of a new blood-clot-busting drug. To ensure that it made the best drug-­development decisions, Pharma used a structured process based on the principles of decision analysis to evaluate the technical feasibility and market potential of its new drugs. ­Previously, BP had analyzed a few business cases for review by Pharma. This commercial evaluation was BP’s first decision analysis project.

Probability distributions of uncertain variables were assessed by estimating the 10th percentile and 90th percentile from experts, who were each asked to review the results to make sure they accurately reflect his or her judgment. Pharma used net present value (NPV) as its decision-making criterion. Given the complexity and inherent structure of decisions concerning new drugs, the new-drug-development decision making was defined as a sequence of six decision points, with identified key market-related and scientific deliverables so senior managers can assess the likelihood of success versus the company’s exposure to risk, costs, and strategic fit. Decision point 1 was whether to begin preclinical development. After successful preclinical animal testing, Bayer can decide (decision point 2) to begin testing the drug in humans. Decision point 3 and decision point 4 (are both decisions to invest or not in continuing clinical devel-

2Based on Jeffrey S. Stonebraker, “How Bayer Makes Decisions to Develop New Drugs,” Interfaces, 32, 6 (November–December 2002): 77–90.

604

opment. Following successful completion of development, Bayer can choose to file a biological license application with the FDA (decision point 5). If the FDA approves it, Bayer can decide (decision point 6) to launch the new drug in the marketplace. The project team presented their input assumptions and recommendations for the commercial evaluation of the drug to the three levels of Pharma decision makers, who eventually approved preclinical development. External validation of the data inputs and assumptions demonstrated their rigor and defensibility. Senior managers could compare the evaluation results for the proposed drug with those for other development drugs with confidence. The international committees lauded the project team’s effort as top-notch, and the decision-analysis approach set new standards for subsequent BP analyses.

sliper84/Shutterstock.com

Chapter 16  Decision Analysis

Key Terms Average payoff strategy Branches Certainty equivalent Decision alternatives Decision making Decision node Decision strategy Decision tree Event node Expected opportunity loss Expected value of perfect information (EVPI) Expected value of sample information (EVSI) Expected value strategy Laplace, or average payoff, strategy Maximax strategy

Maximin strategy Minimax regret strategy Minimax strategy Minimin strategy Nodes Outcomes Payoffs Payoff table Perfect information Risk premium Risk profile Sample information States of nature Uncertain events Utility theory Value of information

Problems and Exercises Note: Data for selected problems can be found in the ­E xcel file Chapter 16 Problem Data to facilitate your problem-solving efforts. Worksheet tabs correspond to the problem numbers. 1. Use the Outsourcing Decision Model Excel file to

compute the cost of in-house manufacturing and outsourcing for the following levels of demand: 800, 1000, 1200, and 1400. Use this information to set up a payoff table for the decision problem, and

apply the aggressive, conservative, and opportunity loss strategies. 2. The DoorCo Corporation is a leading manufacturer

of garage doors. All doors are manufactured in their plant in Carmel, Indiana, and shipped to distribution centers or major customers. DoorCo recently acquired another manufacturer of garage doors, Wisconsin Door, and is considering moving its wood-door operations to the Wisconsin plant. Key

605

Chapter 16  Decision Analysis

considerations in this decision are the transportation, labor, and production costs at the two plants. Complicating matters is the fact that marketing is predicting a decline in the demand for wood doors. The company developed three scenarios: 1. Demand falls slightly, with no noticeable effect on production. 2. Demand and production decline 20%. 3. Demand and production decline 40%. The following table shows the total costs under each decision and scenario. Slight Decline

20% Decline

40% Decline

Stay in Carmel

\$1,000,000

\$900,000

\$840,000

Move to Wisconsin

\$1,200,000

\$915,000

\$800,000

What decision should DoorCo make using each

strategy? a. aggressive strategy b. conservative strategy c. opportunity-loss strategy 3. Suppose that a car-rental agency offers insurance for a

week that costs \$75. A minor fender bender will cost \$2,000, whereas a major accident might cost \$16,000 in repairs. Without the insurance, you would be personally liable for any damages. What should you do? Clearly, there are two decision alternatives: take the insurance, or do not take the insurance. The uncertain consequences, or events that might occur, are that you would not be involved in an accident, that you would be involved in a fender bender, or that you would be involved in a major accident. Develop a payoff table for this situation. What decision should you make using each strategy? a. aggressive strategy b. conservative strategy c. opportunity-loss strategy 4. Slaggert Systems is considering becoming certified

to the ISO 9000 series of quality standards. Becoming certified is expensive, but the company could lose a substantial amount of business if its major customers suddenly demand ISO certification and the company does not have it. At a management retreat, the senior executives of the firm developed the fol-

lowing payoff table, indicating the net present value of profits over the next 5 years. Customer Response Standards Required

Standards Not Required

Become certified

\$575,000

\$500,000

Stay uncertified

\$450,000

\$600,000

What decision should the company make using each strategy? a. aggressive strategy b. conservative strategy c. opportunity-loss strategy 5. For the DoorCo Corporation decision in Problem 2,

compute the standard deviation of the payoffs for each decision. What does this tell you about the risk in making the decision? 6. For the car-rental situation in Problem 3, compute

the standard deviation of the payoffs for each decision. What does this tell you about the risk in making the decision? 7. For Slaggert Systems decision in Problem 4, com-

pute the standard deviation of the payoffs for each decision. What does this tell you about the risk in making the decision? 8. What decisions should be made using the average payoff strategy in Problems 2, 3, and 4? 9. For the DoorCo Corporation decision in Problem 2,

suppose that the probabilities of the three scenarios are estimated to be 0.15, 0.40, and 0.45, respectively. Find the best expected value decision. 10. For the car-rental situation described in Problem 3,

assume that you researched insurance industry statistics and found out that the probability of a major accident is 0.05% and that the probability of a fender bender is 0.16%. What is the expected value decision? Would you choose this? Why or why not? 11. For the DoorCo Corporation decision in Problems 2

and 9, construct a decision tree and compute the rollback values to find the best expected value decision. 12. For the car-rental decision in Problems 3 and 10,

construct a decision tree and compute the rollback values to find the best expected value decision. 13. For the car-rental decision in Problems 3 and 10,

suppose that the cost of a minor fender bender is

606

Chapter 16  Decision Analysis

n­ ormally distributed with a mean of \$2000 and standard deviation of \$100, and the cost of a major ­accident is triangular with a minimum of \$10,000, maximum of \$25,000, and most likely value of \$16,000. Use Analytic Solver Platform to simulate the decision tree and find the distribution of the ­expected value of not taking the insurance. 14. An information system consultant is bidding on a

project that involves some uncertainty. Based on past experience, if all went well (probability 0.1), the project would cost \$1.2 million to complete. If moderate debugging were required (probability 0.7), the project would probably cost \$1.4 million. If major problems were encountered (probability 0.2), the project could cost \$1.8 million. Assume that the firm is bidding competitively and the expectation of successfully gaining the job at a bid of \$2.2 million is 0, at \$2.1 million is 0.1, at \$2.0 million is 0.2, at \$1.9 million is 0.3, at \$1.8 million is 0.5, at \$1.7 million is 0.8, and at \$1.6 million is practically certain. a. Calculate the expected value for the given bids. b. What is the best bidding decision? 15. IM Retail deals in retail of all items of a popular cos-

metic brand Beau. For a particular item, the price of stocking, selling, and cost price varies with the season. The cost price of the item in season is \$12, while its selling price in season is \$18. After the season, the bargain price is \$9 and cost of stocking the item after season is \$1. Gathering past data IM Retail has developed the following probability distribution for demand: Demand (units)

Probability

7

.20

8

.20

9

.25

10

.15

11

.20

a. Construct a payoff table for IM Retail decision problem of how many units to be stocked. What is the best decision from an expected value basis? b. Find the expected value of perfect information. c. What is the expected demand? What is the expected profit if the retailer stocks the expected demand?

16. Bev’s Bakery specializes in sourdough bread. Early

each morning, Bev must decide how many loaves to bake for the day. Each loaf costs \$1.25 to make and sells for \$3.50. Bread left over at the end of the day can be sold the next day for \$1.00. Past data indicate that demand is distributed as follows: Number of Loaves

Probability

15

0.02

16

0.05

17

0.10

18

0.16

19

0.28

20

0.20

21

0.15

22

0.04

a. Construct a payoff table and determine the optimal quantity for Bev to bake each morning using expected values. b. What is the optimal quantity for Bev to bake if the unsold loaves are sold the next day but are donated to a food bank? 17. Ravex Yacht has developed a new cabin cruiser

which they have earmarked for the medium to large boat market. A market analysis suggests a 30% probability of annual sales being 5000 boats, 40% probability of 4000 annual sales, and 30% probability of 3000 annual sales. The firm can go into limited production where variable costs are 10000\$ per boat and fixed costs are 800,000\$ annually. Or the firm can go into full scale production where variable costs are \$9000 per boat and fixed costs are 5,000,000\$ annually. a. Construct a decision tree for the situation. b. Compute payoffs and probabilities. c. If the boat is to be sold at \$11,000, should the company go into limited or full scale production such that the profits are maximized? 18. Midwestern Hardware must decide how many

snow shovels to order for the coming snow season. Each shovel costs \$15.00 and is sold for \$29.95. No inventory is carried from one snow season to the next. Shovels unsold after ­F ebruary

607

Chapter 16  Decision Analysis

are sold at a discount price of \$10.00. Past data indicate that sales are highly dependent on the severity of the winter season. Past seasons have been classified as mild or harsh, and the following distribution of regular price ­d emand has been tabulated: Mild Winter

Harsh Winter

No. of Shovels

Probability

No. of Shovels

Probability

250

0.5

1,500

0.2

300

0.4

2,500

0.3

350

0.1

3,000

0.5

Shovels must be ordered from the manufacturer in

lots of 200; thus, possible order sizes are 200, 400, 1,400, 1,600, 2,400, 2,600, and 3,000 units. Construct a decision tree to illustrate the components of the decision model, and find the optimal quantity for Midwestern to order if the forecast calls for a 40% chance of a harsh winter. 19. Perform a sensitivity analysis of the Midwestern

Hardware scenario (Problem 18). Find the optimal order quantity and optimal expected profit for probabilities of a harsh winter ranging from 0.2 to 0.8 in increments of 0.2. Plot optimal expected profit as a function of the probability of a harsh winter. 20. Dean Kuroff started a business of rehabbing old

homes. He recently purchased a circa-1800 Victorian mansion and converted it into a three-family residence. Recently, one of his tenants complained that the refrigerator was not working properly. Dean’s cash flow was not extensive, so he was not excited about purchasing a new refrigerator. He is considering two other options: purchase a used refrigerator or repair the current unit. He can purchase a new one for \$400, and it will easily last 3 years. If he repairs the current one, he estimates a repair cost of \$150, but he also believes that there is only a 30% chance that it will last a full 3 years and he will end up purchasing a new one anyway. If he buys a used refrigerator for \$200, he estimates that there is a 0.6 probability that it will last at least 3 years. If it breaks down, he will still have the option of repairing it for \$150 or buying a new one. Develop a decision tree for this situation and determine Dean’s optimal strategy. 21. Many automobile dealers advertise lease options for

new cars. Suppose that you are considering three alternatives:

1. Purchase the car outright with cash. 2. Purchase the car with 20% down and a 48-month loan. 3. Lease the car. Select an automobile whose leasing contract is ad-

vertised in a local paper. Using current interest rates and advertised leasing arrangements, perform a decision analysis of these options. Make, but clearly define, any assumptions that may be required. 22. Drilling decisions by oil and gas operators involve

intensive capital expenditures made in an environment characterized by limited information and high risk. A well site is dry, wet, or gushing. Historically, 50% of all wells have been dry, 30% wet, and 20% gushing. The value (net of drilling costs) for each type of well is as follows: Dry

− \$80.000

Wet

\$100,000

Gushing

\$200,000

Wildcat operators often investigate oil prospects in areas where deposits are thought to exist by making geological and geophysical examinations of the area before obtaining a lease and drilling permit. This often includes recording shock waves from detonations by a seismograph and using a magnetometer to measure the intensity of Earth’s magnetic effect to detect rock formations below the surface. The cost of doing such studies is approximately \$15,000. Of course, one may choose to drill in a location based on “gut feel” and avoid the cost of the study. The geological and geophysical examination classifies an area into one of three categories: no structure (NS), which is a bad sign; open structure (OS), which is an “OK” sign; and closed structure (CS), which is hopeful. Historically, 40% of the tests resulted in NS, 35% resulted in OS, and 25% resulted in CS readings. ­After the result of the test is known, the company may decide not to drill. The following table shows probabilities that the well will actually be dry, wet, or gushing based on the classification provided by the examination (in essence, the examination cannot accurately predict the actual event): Dry

Wet

Gushing

NS

0.73

0.22

0.05

OS

0.43

0.34

0.23

CS

0.23

0.372

0.398

608

Chapter 16  Decision Analysis

a. Construct a decision tree of this problem that includes the decision of whether or not to perform the geological tests. b. What is the optimal decision under expected value when no experimentation is conducted? c. Find the overall optimal strategy by rolling back the tree. 23. Hahn Engineering is planning on bidding on a job and

often competes against a major competitor, Sweigart and Associates (S&A), as well as other firms. Historically, S&A has bid for the same jobs 80% of the time; thus the probability that S&A will bid on this job is 0.80. If S&A bids on a job, the probability that Hahn Engineering will win it is 0.30. If S&A does not bid on a job, the probability that Hahn will win the bid is 0.60. Apply Bayes’s rule to find the probability that Hahn Engineering will win the bid. If they do, what is the probability that S&A did bid on it? 24. MJ Logistics has decided to build a new warehouse

to support its supply chain activities. They have the option of building either a large warehouse or a small one. Construction costs for the large facility are \$8 million versus \$3 million for the small facility. The profit (excluding construction cost) depends on the volume of work the company expects to contract for in the future. This is summarized in the following table (in millions of dollars): High Volume

Low Volume

Large warehouse

\$35

\$20

Small warehouse

\$25

\$15

The company believes that there is a 60% chance that the volume of demand will be high. a. Construct a decision tree to identify the best choice.

b. Suppose that the company engages an economic expert to provide an opinion about the volume of work based on a forecast of economic conditions. Historically, the expert’s upside predictions has been 75% accurate, whereas the downside predictions have been 90% accurate. In contrast to the company’s assessment, the expert believes that the chance for high demand is 70%. Determine the best strategy if their predictions suggest that the economy will improve or will deteriorate. Given the information, what is the probability that the volume will be high? 25. Consider the car-rental insurance scenario in

­ roblems 3 and 10. Use the approach described in P this chapter to develop your personal utility f­ unction for the payoffs associated with this d­ ecision. ­D etermine the decision that would result using the utilities i­nstead of the payoffs. Is the decision ­consistent with your choice? 26. A college football team is trailing 14–0 late in the

game. The team just made a touchdown. If they can, hold the opponent and score one more time, they can tie or win the game. The coach is wondering whether to go for an extra-point kick or a two-point conversion now and what to do if they can score again. a. Develop a decision tree for the coach’s decision. b. Estimate probabilities for successful kicks or two-point conversions and a last minute score. (You might want to do this by doing some group brainstorming or by calling on experts, such as your school’s coach or a sports journalist.) Using the probabilities from part (a), determine the optimal strategy. c. Why would utility theory be a better approach than using the points for making a decision? Propose a utility function and compare your results.

Case: Performance Lawn Equipment PLE has developed a prototype for a new snow blower for the consumer market. This can exploit the company’s expertise in small-gasoline-engine technology and also balance seasonal demand cycles in the North American and European markets to provide additional revenues during the winter months. Initially, PLE faces two possible decisions: introduce the product globally at a cost of \$850,000 or evaluate it in a North American test market at a cost of \$200,000. If it introduces the product

globally, PLE might find either a high or low response to the product. Probabilities of these events are estimated to be 0.6 and 0.4, respectively. With a high response, gross revenues of \$2,000,000 are expected; with a low response, the figure is \$450,000. If it starts with a North American test market, it might find a low response or a high response with probabilities 0.3 and 0.7, respectively. This may or may not reflect the global market potential. In any case, after conducting the marketing re-

Chapter 16  Decision Analysis

search, PLE next needs to decide whether to keep sales only in North America, market globally, or drop the product. If the North American response is high and PLE stays only in North America, the expected revenue is \$1,200,000. If it markets globally (at an additional cost of \$200,000), the probability of a high global response is 0.9 with revenues of \$2,000,000 (\$450,000 if the global response is low). If the North American response is low and it remains in North America, the expected revenue is \$200,000. If it markets globally (at an ­additional cost

609 of \$600,000), the probability of a high global response is 0.05, with revenues of \$2,000,000 (\$450,000 if the global response is low). Construct a decision tree, determine the optimal strategy, and develop a risk profile associated with the optimal strategy. Evaluate the sensitivity of the optimal strategy to changes in the probability estimates. Summarize all your results, including your recommendation and justification for it, in a formal report to the executive committee, who will ultimately make this decision.

Appendix A: Statistical Tables

Table

A.1

The Cumulative Standard Normal Distribution

z

.00

.01

.02

.03

.04

.05

.06

.07

.08

.09

−3.9 −3.8

.00005 .00007

.00005 .00007

.00004 .00007

.00004 .00006

.00004 .00006

.00004 .00006

.00004 .00006

.00004 .00005

.00003 .00005

.00003 .00005

−3.7

.00011

.00010

.00010

.00010

.00009

.00009

.00008

.00008

.00008

.00008

−3.6

.00016

.00015

.00015

.00014

.00014

.00013

.00013

.00012

.00012

.00011

−3.5

.00023

.00022

.00022

.00021

.00020

.00019

.00019

.00018

.00017

.00017

−3.4

.00034

.00032

.00031

.00030

.00029

.00028

.00027

.00026

.00025

.00024

−3.3

.00048

.00047

.00045

.00043

.00042

.00040

.00039

.00038

.00036

.00035

−3.2

.00069

.00066

.00064

.00062

.00060

.00058

.00056

.00054

.00052

.00050

−3.1

.00097

.00094

.00090

.00087

.00084

.00082

.00079

.00076

.00074

.00071

−3.0

.00135

.00131

.00126

.00122

.00118

.00114

.00111

.00107

.00103

.00100

−2.9

.0019

.0018

.0018

.0017

.0016

.0016

.0015

.0015

.0014

.0014

−2.8

.0026

.0025

.0024

.0023

.0023

.0022

.0021

.0021

.0020

.0019

−2.7

.0035

.0034

.0033

.0032

.0031

.0030

.0029

.0028

.0027

.0026

−2.6

.0047

.0045

.0044

.0043

.0041

.0040

.0039

.0038

.0037

.0036

−2.5

.0062

.0060

.0059

.0057

.0055

.0054

.0052

.0051

.0049

.0048

−2.4

.0082

.0080

.0078

.0075

.0073

.0071

.0069

.0068

.0066

.0064

−2.3

.0107

.0104

.0102

.0099

.0096

.0094

.0091

.0089

.0087

.0084

−2.2

.0139

.0136

.0132

.0129

.0125

.0122

.0119

.0116

.0113

.0110

−2.1

.0179

.0174

.0170

.0166

.0162

.0158

.0154

.0150

.0146

.0143

−2.0

.0228

.0222

.0217

.0212

.0207

.0202

.0197

.0192

.0188

.0183

−1.9

.0287

.0281

.0274

.0268

.0262

.0256

.0250

.0244

.0239

.0233

−1.8

.0359

.0351

.0344

.0336

.0329

.0322

.0314

.0307

.0301

.0294

−1.7

.0446

.0436

.0427

.0418

.0409

.0401

.0392

.0384

.0375

.0367

−1.6

.0548

.0537

.0526

.0516

.0505

.0495

.0485

.0475

.0465

.0455

−1.5

.0668

.0655

.0643

.0630

.0618

.0606

.0594

.0582

.0571

.0559 (continued)

611

612

Appendix A  Statistical Tables

z

.00

.01

.02

.03

.04

.05

.06

.07

.08

.09

−1.4

.0808

.0793

.0778

.0764

.0749

.0735

.0721

.0708

.0694

.0681

−1.3

.0968

.0951

.0934

.0918

.0901

.0885

.0869

.0853

.0838

.0823

−1.2

.1151

.1131

.1112

.1093

.1075

.1056

.1038

.1020

.1003

.0985

−1.1

.1357

.1335

.1314

.1292

.1271

.1251

.1230

.1210

.1190

.1170

−1.0

.1587

.1562

.1539

.1515

.1492

.1469

.1446

.1423

.1401

.1379

−0.9

.1841

.1814

.1788

.1762

.1736

.1711

.1685

.1660

.1635

.1611

−0.8

.2119

.2090

.2061

.2033

.2005

.1977

.1949

.1922

.1894

.1867

−0.7

.2420

.2388

.2358

.2327

.2296

.2266

.2236

.2006

.2177

.2148

−0.6

.2743

.2709

.2676

.2643

.2611

.2578

.2546

.2514

.2482

.2451

−0.5

.3085

.3050

.3015

.2981

.2946

.2912

.2877

.2843

.2810

.2776

−0.4

.3446

.3409

.3372

.3336

.3300

.3264

.3228

.3192

.3156

.3121

−0.3

.3821

.3783

.3745

.3707

.3669

.3632

.3594

.3557

.3520

.3483

−0.2

.4207

.4168

.4129

.4090

.4052

.4013

.3974

.3936

.3897

.3859

−0.1

.4602

.4562

.4522

.4483

.4443

.4404

.4364

.4325

.4286

.4247

−0.0

.5000

.4960

.4920

.4880

.4840

.4801

.4761

.4721

.4681

.4641

.00

.01

.02

.03

.04

.05

.06

.07

.08

.09

0.0

z

.5000

.5040

.5080

.5120

.5160

.5199

.5239

.5279

.5319

.5359

0.1

.5398

.5438

.5478

.5517

.5557

.5596

.5636

.5675

.5714

.5753

0.2

.5793

.5832

.5871

.5910

.5948

.5987

.6026

.6064

.6103

.6141

0.3

.6179

.6217

.6255

.6293

.6331

.6368

.6406

.6443

.6480

.6517

0.4

.6554

.6591

.6628

.6664

.6700

.6736

.6772

.6808

.6844

.6879

0.5

.6915

.6950

.6985

.7019

.7054

.7088

.7123

.7157

.7190

.7224

0.6

.7257

.7291

.7324

.7357

.7389

.7422

.7454

.7486

.7518

.7549

0.7

.7580

.7612

.7642

.7673

.7704

.7734

.7764

.7794

.7823

.7852

0.8

.7881

.7910

.7939

.7967

.7995

.8023

.8051

.8078

.8106

.8133

0.9

.8159

.8186

.8212

.8238

.8264

.8289

.8315

.8340

.8365

.8389

1.0

.8413

.8438

.8461

.8485

.8508

.8531

.8554

.8577

.8599

.8621

1.1

.8643

.8665

.8686

.8708

.8729

.8749

.8770

.8790

.8810

.8830

1.2

.8849

.8869

.8888

.8907

.8925

.8944

.8962

.8980

.8997

.9015

1.3

.9032

.9089

.9066

.9082

.9099

.9115

.9131

.9147

.9162

.9177

1.4

.9192

.9207

.9222

.9236

.9251

.9265

.9279

.9292

.9306

.9319

1.5

.9332

.9345

.9357

.9370

.9382

.9394

.9406

.9418

.9429

.9441

613

Appendix A  Statistical Tables

.00

.01

.02

.03

.04

.05

.06

.07

.08

.09

1.6

z

.9452

.9463

.9474

.9484

.9495

.9505

.9515

.9525

.9535

.9545

1.7

.9554

.9564

.9573

.9582

.9591

.9599

.9608

.9616

.9625

.9633

1.8

.9641

.9649

.9656

.9664

.9671

.9678

.9686

.9693

.9699

.9706

1.9

.9713

.9719

.9726

.9732

.9738

.9744

.9750

.9756

.9761

.9767

2.0

.9772

.9778

.9783

.9788

.9793

.9798

.9803

.9808

.9812

.9817

2.1

.9821

.9826

.9830

.9834

.9838

.9842

.9846

.9850

.9854

.9857

2.2

.9861

.9864

.9868

.9871

.9875

.9878

.9881

.9884

.9887

.9890

2.3

.9893

.9896

.9898

.9901

.9904

.9906

.9909

.9911

.9913

.9916

2.4

.9918

.9920

.9922

.9925

.9927

.9929

.9931

.9932

.9934

.9936

2.5

.9938

.9940

.9941

.9943

.9945

.9946

.9948

.9949

.9951

.9952

2.6

.9953

.9955

.9956

.9957

.9959

.9960

.9961

.9962

.9963

.9964

2.7

.9965

.9966

.9967

.9968

.9969

.9970

.9971

.9972

.9973

.9974

2.8

.9974

.9975

.9976

.9977

.9977

.9978

.9979

.9979

.9980

.9981

2.9

.9981

.9982

.9982

.9983

.9984

.9984

.9985

.9985

.9986

.9986

3.0

.99865

.99869

.99874

.99878

.99882

.99886

.99889

.99893

.99897

.99900

3.1

.99903

.99906

.99910

.99913

.99916

.99918

.99921

.99924

.99926

.99929

3.2

.99931

.99934

.99936

.99938

.99940

.99942

.99944

.99946

.99948

.99950

3.3

.99952

.99953

.99955

.99957

.99958

.99960

.99961

.99962

.99964

.99965

3.4

.99966

.99968

.99969

.99970

.99971

.99972

.99973

.99974

.99975

.99976

3.5

.99977

.99978

.99978

.99979

.99980

.99981

.99981

.99982

.99983

.99983

3.6

.99984

.99985

.99985

.99986

.99986

.99987

.99987

.99988

.99988

.99989

3.7

.99989

.99990

.99990

.99990

.99991

.99991

.99992

.99992

.99992

.99992

3.8

.99993

.99993

.99993

.99994

.99994

.99994

.99994

.99995

.99995

.99995

3.9

.99995

.99995

.99996

.99996

.99996

.99996

.99996

.99996

.99997

.99997

Entry represents area under the cumulative standardized normal distribution from −∞ to z.

614

Table

Appendix A  Statistical Tables

A.2

Critical Values of t

Degrees of Freedom

.25

.10

.05

Upper Tail Areas .025

.01

.005

1

1.0000

3.0777

6.3138

12.7062

31.8207

63.6574

2

0.8165

1.8856

2.9200

4.3027

6.9646

9.9248

3

0.7649

1.6377

2.3534

3.1824

4.5407

5.8409

4

0.7407

1.5332

2.1318

2.7764

3.7469

4.6041

5

0.7267

1.4759

2.0150

2.5706

3.3649

4.0322

6

0.7176

1.4398

1.9432

2.4469

3.1427

3.7074

7

0.7111

1.4149

1.8946

2.3646

2.9980

3.4995

8

0.7064

1.3968

1.8595

2.3060

2.8965

3.3554

9

0.7027

1.3830

1.8331

2.2622

2.8214

3.2498

10

0.6998

1.3722

1.8125

2.2281

2.7638

3.1693

11

0.6974

1.3634

1.7959

2.2010

2.7181

3.1058

12

0.6955

1.3562

1.7823

2.1788

2.6810

3.0545

13

0.6938

1.3502

1.7709

2.1604

2.6503

3.0123

14

0.6924

1.3450

1.7613

2.1448

2.6245

2.9768

15

0.6912

1.3406

1.7531

2.1315

2.6025

2.9467

16

0.6901

1.3368

1.7459

2.1199

2.5835

2.9208

17

0.6892

1.3334

1.7396

2.1098

2.5669

2.8982

18

0.6884

1.3304

1.7341

2.1009

2.5524

2.8784

19

0.6876

1.3277

1.7291

2.0930

2.5395

2.8609

20

0.6870

1.3253

1.7247

2.0860

2.5280

2.8453

21

0.6864

1.3232

1.7207

2.0796

2.5177

2.8314

22

0.6858

1.3212

1.7171

2.0739

2.5083

2.8188

23

0.6853

1.3195

1.7139

2.0687

2.4999

2.8073

24

0.6848

1.3178

1.7109

2.0639

2.4922

2.7969

25

0.6844

1.3163

1.7081

2.0595

2.4851

2.7874

26

0.6840

1.3150

1.7056

2.0555

2.4786

2.7787

27

0.6837

1.3137

1.7033

2.0518

2.4727

2.7707

28

0.6834

1.3125

1.7011

2.0484

2.4671

2.7633

29

0.6830

1.3114

1.6991

2.0452

2.4620

2.7564

30

0.6828

1.3104

1.6973

2.0423

2.4573

2.7500

615

Appendix A  Statistical Tables

Degrees of Freedom

.25

.10

.05

Upper Tail Areas .025

.01

.005

31

0.6825

1.3095

1.6955

2.0395

2.4528

2.7440

32

0.6822

1.3086

1.6939

2.0369

2.4487

2.7385

33

0.6820

1.3077

1.6924

2.0345

2.4448

2.7333

34

0.6818

1.3070

1.6909

2.0322

2.4411

2.7284

35

0.6816

1.3062

1.6896

2.0301

2.4377

2.7238

36

0.6814

1.3055

1.6883

2.0281

2.4345

2.7195

37

0.6812

1.3049

1.6871

2.0262

2.4314

2.7154

38

0.6810

1.3042

1.6860

2.0244

2.4286

2.7116

39

0.6808

1.3036

1.6849

2.0227

2.4258

2.7079

40

0.6807

1.3031

1.6839

2.0211

2.4233

2.7045

41

0.6805

1.3025

1.6829

2.0195

2.4208

2.7012

42

0.6804

1.3020

1.6820

2.0181

2.4185

2.6981

43

0.6802

1.3016

1.6811

2.0167

2.4163

2.6951

44

0.6801

1.3011

1.6802

2.0154

2.4141

2.6923

45

0.6800

1.3006

1.6794

2.0141

2.4121

2.6896

46

0.6799

1.3002

1.6787

2.0129

2.4102

2.6870

47

0.6797

1.2998

1.6779

2.0117

2.4083

2.6846

48

0.6796

1.2994

1.6772

2.0106

2.4066

2.6822

49

0.6795

1.2991

1.6766

2.0096

2.4049

2.6800

50

0.6794

1.2987

1.6759

2.0086

2.4033

2.6778

51

0.6793

1.2984

1.6753

2.0076

2.4017

2.6757

52

0.6792

1.2980

1.6747

2.0066

2.4002

2.6737

53

0.6791

1.2977

1.6741

2.0057

2.3988

2.6718

54

0.6791

1.2974

1.6736

2.0049

2.3974

2.6700

55

0.6790

1.2971

1.6730

2.0040

2.3961

2.6682

56

0.6789

1.2969

1.6725

2.0032

2.3948

2.6665

57

0.6788

1.2966

1.6720

2.0025

2.3936

2.6649

58

0.6787

1.2963

1.6716

2.0017

2.3924

2.6633

59

0.6787

1.2961

1.6711

2.0010

2.3912

2.6618

60

0.6786

1.2958

1.6706

2.0003

2.3901

2.6603

61

0.6785

1.2956

1.6702

1.9996

2.3890

2.6589

62

0.6785

1.2954

1.6698

1.9990

2.3880

2.6575

63

0.6784

1.2951

1.6694

1.9983

2.3870

2.6561

64

0.6783

1.2949

1.6690

1.9977

2.3860

2.6549

65

0.6783

1.2947

1.6686

1.9971

2.3851

2.6536

66

0.6782

1.2945

1.6683

1.9966

2.3842

2.6524

67

0.6782

1.2943

1.6679

1.9960

2.3833

2.6512

68

0.6781

1.2941

1.6676

1.9955

2.3824

2.6501

69

0.6781

1.2939

1.6672

1.9949

2.3816

2.6490

70

0.6780

1.2938

1.6669

1.9944

2.3808

2.6479 (continued )

616

Appendix A  Statistical Tables

Degrees of Freedom

.25

.10

.05

Upper Tail Areas .025

.01

.005

71

0.6780

1.2936

1.6666

1.9939

2.3800

2.6469

72

0.6779

1.2934

1.6663

1.9935

2.3793

2.6459

73

0.6779

1.2933

1.6660

1.9930

2.3785

2.6449

74

0.6778

1.2931

1.6657

1.9925

2.3778

2.6439

75

0.6778

1.2929

1.6654

1.9921

2.3771

2.6430

76

0.6777

1.2928

1.6652

1.9917

2.3764

2.6421

77

0.6777

1.2926

1.6649

1.9913

2.3758

2.6412

78

0.6776

1.2925

1.6646

1.9908

2.3751

2.6403

79

0.6776

1.2924

1.6644

1.9905

2.3745

2.6395

80

0.6776

1.2922

1.6641

1.9901

2.3739

2.6387

81

0.6775

1.2921

1.6639

1.9897

2.3733

2.6379

82

0.6775

1.2920

1.6636

1.9893

2.3727

2.6371

83

0.6775

1.2918

1.6634

1.9890

2.3721

2.6364

84

0.6774

1.2917

1.6632

1.9886

2.3716

2.6356

85

0.6774

1.2916

1.6630

1.9883

2.3710

2.6349

86

0.6774

1.2915

1.6628

1.9879

2.3705

2.6342

87

0.6773

1.2914

1.6626

1.9876

2.3700

2.6335

88

0.6773

1.2912

1.6624

1.9873

2.3695

2.6329

89

0.6773

1.2911

1.6622

1.9870

2.3690

2.6322

90

0.6772

1.2910

1.6620

1.9867

2.3685

2.6316

91

0.6772

1.2909

1.6618

1.9864

2.3680

2.6309

92

0.6772

1.2908

1.6616

1.9861

2.3676

2.6303

93

0.6771

1.2907

1.6614

1.9858

2.3671

2.6297

94

0.6771

1.2906

1.6612

1.9855

2.3667

2.6291

95

0.6771

1.2905

1.6611

1.9853

2.3662

2.6286

96

0.6771

1.2904

1.6609

1.9850

2.3658

2.6280

97

0.6770

1.2903

1.6607

1.9847

2.3654

2.6275

98

0.6770

1.2902

1.6606

1.9845

2.3650

2.6269

99

0.6770

1.2902

1.6604

1.9842

2.3646

2.6264

100

0.6770

1.2901

1.6602

1.9840

2.3642

2.6259

110

0.6767

1.2893

1.6588

1.9818

2.3607

2.6213

120

0.6765

1.2886

1.6577

1.9799

2.3578

2.6174

0.6745

1.2816

1.6449

1.9600

2.3263

2.5758

For particular number of degrees of freedom, entry represents the critical value of t corresponding to a specified upper tail area (A).

Table

A.3

Critical Values of X2

Degrees of Freedom

Upper Tail Areas (A) .995

.975

.95

.90

.75

.25

.10

.05

.025

.01

.005

0.010 0.072 0.207 0.412

0.020 0.115 0.297 0.554

0.001 0.051 0.216 0.484 0.831

0.004 0.103 0.352 0.711 1.145

0.016 0.211 0.584 1.064 1.610

0.102 0.575 1.213 1.923 2.675

1.323 2.773 4.108 5.385 6.626

2.706 4.605 6.251 7.779 9.236

3.841 5.991 7.815 9.488 11.071

5.024 7.378 9.348 11.143 12.833

6.635 9.210 11.345 13.277 15.086

7.879 10.597 12.838 14.860 16.750

6 7 8 9 10

0.676 0.989 1.344 1.735 2.156

0.872 1.239 1.646 2.088 2.558

1.237 1.690 2.180 2.700 3.247

1.635 2.167 2.733 3.325 3.940

2.204 2.833 3.490 4.168 4.865

3.455 4.255 5.071 5.899 6.737

7.841 9.037 10.219 11.389 12.549

10.645 12.017 13.362 14.684 15.987

12.592 14.067 15.507 16.919 18.307

14.449 16.013 17.535 19.023 20.483

16.812 18.475 20.090 21.666 23.209

18.548 20.278 21.955 23.589 25.188

11 12 13 14 15

2.603 3.074 3.565 4.075 4.601

3.053 3.571 4.107 4.660 5.229

3.816 4.404 5.009 5.629 6.262

4.575 5.226 5.892 6.571 7.261

5.578 6.304 7.042 7.790 8.547

7.584 8.438 9.299 10.165 11.037

13.701 14.845 15.984 17.117 18.245

17.275 18.549 19.812 21.064 22.307

19.675 21.026 22.362 23.685 24.996

21.920 23.337 24.736 26.119 27.488

24.725 26.217 27.688 29.141 30.578

26.757 28.299 29.819 31.319 32.801

16 17 18 19 20

5.142 5.697 6.265 6.844 7.434

5.812 6.408 7.015 7.633 8.260

6.908 7.564 8.231 8.907 9.591

7.962 8.672 9.390 10.117 10.851

9.312 10.085 10.865 11.651 12.443

11.912 12.792 13.675 14.562 15.452

19.369 20.489 21.605 22.718 23.828

23.542 24.769 25.989 27.204 28.412

26.296 27.587 28.869 30.144 31.410

28.845 30.191 31.526 32.852 34.170

32.000 33.409 34.805 36.191 37.566

34.267 35.718 37.156 38.582 39.997

21 22 23 24 25

8.034 8.643 9.260 9.886 10.520

8.897 9.542 10.196 10.856 11.524

10.283 10.982 11.689 12.401 13.120

11.591 12.338 13.091 13.848 14.611

13.240 14.042 14.848 15.659 16.473

16.344 17.240 18.137 19.037 19.939

24.935 26.039 27.141 28.241 29.339

29.615 30.813 32.007 33.196 34.382

32.671 33.924 35.172 36.415 37.652

35.479 36.781 38.076 39.364 40.646

38.932 40.289 41.638 42.980 44.314

41.401 42.796 44.181 45.559 46.928

26 27 28 29 30

11.160 11.808 12.461 13.121 13.787

12.198 12.879 13.565 14.257 14.954

13.844 14.573 15.308 16.047 16.791

15.379 16.151 16.928 17.708 18.493

17.292 18.114 18.939 19.768 20.599

20.843 21.749 22.657 23.567 24.478

30.435 31.528 32.620 33.711 34.800

35.563 36.741 37.916 39.087 40.256

38.885 40.113 41.337 42.557 43.773

41.923 43.194 44.461 45.722 46.979

45.642 46.963 48.278 49.588 50.892

48.290 49.645 50.993 52.336 53.672

617

For a particular number of degrees of freedom, entry represents the critical value of X2 corresponding to a specified upper tail area (A). For larger values of degrees of freedom (df) the expression Z = 22x2 − 22(df ) − 1 may be used, and the resulting upper tail area can be obtained from the table of the standard normal distribution (Table A.1).

Appendix A  Statistical Tables

.99

1 2 3 4 5

618 Table

Appendix A  Statistical Tables

A.4

Critical values of the F distribution

Upper critical values of the F distribution for numerator degrees of freedom N1 and denominator degrees of freedom N2, 5% significance level N N2 1 1 2 3 4 5 6 7 8 9

10

1

161.448

199.500

215.707

224.583

230.162

233.986

236.768

238.882

240.543

241.882

2

18.513

19.000

19.164

19.247

19.296

19.330

19.353

19.371

19.385

19.396

3

10.128

9.552

9.277

9.117

9.013

8.941

8.887

8.845

8.812

8.786

4

7.709

6.944

6.591

6.388

6.256

6.163

6.094

6.041

5.999

5.964

5

6.608

5.786

5.409

5.192

5.050

4.950

4.876

4.818

4.772

4.735

6

5.987

5.143

4.757

4.534

4.387

4.284

4.207

4.147

4.099

4.060

7

5.591

4.737

4.347

4.120

3.972

3.866

3.787

3.726

3.677

3.637

8

5.318

4.459

4.066

3.838

3.687

3.581

3.500

3.438

3.388

3.347

9

5.117

4.256

3.863

3.633

3.482

3.374

3.293

3.230

3.179

3.137

10

4.965

4.103

3.708

3.478

3.326

3.217

3.135

3.072

3.020

2.978

11

4.844

3.982

3.587

3.357

3.204

3.095

3.012

2.948

2.896

2.854

12

4.747

3.885

3.490

3.259

3.106

2.996

2.913

2.849

2.796

2.753

13

4.667

3.806

3.411

3.179

3.025

2.915

2.832

2.767

2.714

2.671

14

4.600

3.739

3.344

3.112

2.958

2.848

2.764

2.699

2.646

2.602

15

4.543

3.682

3.287

3.056

2.901

2.790

2.707

2.641

2.588

2.544

16

4.494

3.634

3.239

3.007

2.852

2.741

2.657

2.591

2.538

2.494

17

4.451

3.592

3.197

2.965

2.810

2.699

2.614

2.548

2.494

2.450

18

4.414

3.555

3.160

2.928

2.773

2.661

2.577

2.510

2.456

2.412

19

4.381

3.522

3.127

2.895

2.740

2.628

2.544

2.477

2.423

2.378

20

4.351

3.493

3.098

2.866

2.711

2.599

2.514

2.447

2.393

2.348

21

4.325

3.467

3.072

2.840

2.685

2.573

2.488

2.420

2.366

2.321

22

4.301

3.443

3.049

2.817

2.661

2.549

2.464

2.397

2.342

2.297

23

4.279

3.422

3.028

2.796

2.640

2.528

2.442

2.375

2.320

2.275

24

4.260

3.403

3.009

2.776

2.621

2.508

2.423

2.355

2.300

2.255

25

4.242

3.385

2.991

2.759

2.603

2.490

2.405

2.337

2.282

2.236

26

4.225

3.369

2.975

2.743

2.587

2.474

2.388

2.321

2.265

2.220

27

4.210

3.354

2.960

2.728

2.572

2.459

2.373

2.305

2.250

2.204

28

4.196

3.340

2.947

2.714

2.558

2.445

2.359

2.291

2.236

2.190

29

4.183

3.328

2.934

2.701

2.545

2.432

2.346

2.278

2.223

2.177

30

4.171

3.316

2.922

2.690

2.534

2.421

2.334

2.266

2.211

2.165

31

4.160

3.305

2.911

2.679

2.523

2.409

2.323

2.255

2.199

2.153

32

4.149

3.295

2.901

2.668

2.512

2.399

2.313

2.244

2.189

2.142

33

4.139

3.285

2.892

2.659

2.503

2.389

2.303

2.235

2.179

2.133

34

4.130

3.276

2.883

2.650

2.494

2.380

2.294

2.225

2.170

2.123

35

4.121

3.267

2.874

2.641

2.485

2.372

2.285

2.217

2.161

2.114

619

Appendix A  Statistical Tables

N1

1

2

3

4

5

6

7

8

9

10

36

4.113

3.259

2.866

2.634

2.477

2.364

2.277

2.209

2.153

2.106

37

4.105

3.252

2.859

2.626

2.470

2.356

2.270

2.201

2.145

2.098

N2

38

4.098

3.245

2.852

2.619

2.463

2.349

2.262

2.194

2.138

2.091

39

4.091

3.238

2.845

2.612

2.456

2.342

2.255

2.187

2.131

2.084

40

4.085

3.232

2.839

2.606

2.449

2.336

2.249

2.180

2.124

2.077

41

4.079

3.226

2.833

2.600

2.443

2.330

2.243

2.174

2.118

2.071

42

4.073

3.220

2.827

2.594

2.438

2.324

2.237

2.168

2.112

2.065

43

4.067

3.214

2.822

2.589

2.432

2.318

2.232

2.163

2.106

2.059

44

4.062

3.209

2.816

2.584

2.427

2.313

2.226

2.157

2.101

2.054

45

4.057

3.204

2.812

2.579

2.422

2.308

2.221

2.152

2.096

2.049

46

4.052

3.200

2.807

2.574

2.417

2.304

2.216

2.147

2.091

2.044

47

4.047

3.195

2.802

2.570

2.413

2.299

2.212

2.143

2.086

2.039

48

4.043

3.191

2.798

2.565

2.409

2.295

2.207

2.138

2.082

2.035

49

4.038

3.187

2.794

2.561

2.404

2.290

2.203

2.134

2.077

2.030

50

4.034

3.183

2.790

2.557

2.400

2.286

2.199

2.130

2.073

2.026

51

4.030

3.179

2.786

2.553

2.397

2.283

2.195

2.126

2.069

2.022

52

4.027

3.175

2.783

2.550

2.393

2.279

2.192

2.122

2.066

2.018

53

4.023

3.172

2.779

2.546

2.389

2.275

2.188

2.119

2.062

2.015

54

4.020

3.168

2.776

2.543

2.386

2.272

2.185

2.115

2.059

2.011

55

4.016

3.165

2.773

2.540

2.383

2.269

2.181

2.112

2.055

2.008

56

4.013

3.162

2.769

2.537

2.380

2.266

2.178

2.109

2.052

2.005

57

4.010

3.159

2.766

2.534

2.377

2.263

2.175

2.106

2.049

2.001

58

4.007

3.156

2.764

2.531

2.374

2.260

2.172

2.103

2.046

1.998

59

4.004

3.153

2.761

2.528

2.371

2.257

2.169

2.100

2.043

1.995

60

4.001

3.150

2.758

2.525

2.368

2.254

2.167

2.097

2.040

1.993

61

3.998

3.148

2.755

2.523

2.366

2.251

2.164

2.094

2.037

1.990

62

3.996

3.145

2.753

2.520

2.363

2.249

2.161

2.092

2.035

1.987

63

3.993

3.143

2.751

2.518

2.361

2.246

2.159

2.089

2.032

1.985

64

3.991

3.140

2.748

2.515

2.358

2.244

2.156

2.087

2.030

1.982

65

3.989

3.138

2.746

2.513

2.356

2.242

2.154

2.084

2.027

1.980

66

3.986

3.136

2.744

2.511

2.354

2.239

2.152

2.082

2.025

1.977

67

3.984

3.134

2.742

2.509

2.352

2.237

2.150

2.080

2.023

1.975

68

3.982

3.132

2.740

2.507

2.350

2.235

2.148

2.078

2.021

1.973

69

3.980

3.130

2.737

2.505

2.348

2.233

2.145

2.076

2.019

1.971

70

3.978

3.128

2.736

2.503

2.346

2.231

2.143

2.074

2.017

1.969

71

3.976

3.126

2.734

2.501

2.344

2.229

2.142

2.072

2.015

1.967

72

3.974

3.124

2.732

2.499

2.342

2.227

2.140

2.070

2.013

1.965

73

3.972

3.122

2.730

2.497

2.340

2.226

2.138

2.068

2.011

1.963

74

3.970

3.120

2.728

2.495

2.338

2.224

2.136

2.066

2.009

1.961

75

3.968

3.119

2.727

2.494

2.337

2.222

2.134

2.064

2.007

1.959 (continued)

620

Appendix A  Statistical Tables

N1

1

2

3

4

5

6

7

8

9

10

76

3.967

3.117

2.725

2.492

2.335

2.220

2.133

2.063

2.006

1.958

77

3.965

3.115

2.723

2.490

2.333

2.219

2.131

2.061

2.004

1.956

78

3.963

3.114

2.722

2.489

2.332

2.217

2.129

2.059

2.002

1.954

79

3.962

3.112

2.720

2.487

2.330

2.216

2.128

2.058

2.001

1.953

80

3.960

3.111

2.719

2.486

2.329

2.214

2.126

2.056

1.999

1.951

81

3.959

3.109

2.717

2.484

2.327

2.213

2.125

2.055

1.998

1.950

82

3.957

3.108

2.716

2.483

2.326

2.211

2.123

2.053

1.996

1.948

83

3.956

3.107

2.715

2.482

2.324

2.210

2.122

2.052

1.995

1.947

84

3.955

3.105

2.713

2.480

2.323

2.209

2.121

2.051

1.993

1.945

85

3.953

3.104

2.712

2.479

2.322

2.207

2.119

2.049

1.992

1.944

86

3.952

3.103

2.711

2.478

2.321

2.206

2.118

2.048

1.991

1.943

87

3.951

3.101

2.709

2.476

2.319

2.205

2.117

2.047

1.989

1.941

88

3.949

3.100

2.708

2.475

2.318

2.203

2.115

2.045

1.988

1.940

89

3.948

3.099

2.707

2.474

2.317

2.202

2.114

2.044

1.987

1.939

90

3.947

3.098

2.706

2.473

2.316

2.201

2.113

2.043

1.986

1.938

91

3.946

3.097

2.705

2.472

2.315

2.200

2.112

2.042

1.984

1.936

92

3.945

3.095

2.704

2.471

2.313

2.199

2.111

2.041

1.983

1.935

93

3.943

3.094

2.703

2.470

2.312

2.198

2.110

2.040

1.982

1.934

94

3.942

3.093

2.701

2.469

2.311

2.197

2.109

2.038

1.981

1.933

95

3.941

3.092

2.700

2.467

2.310

2.196

2.108

2.037

1.980

1.932

96

3.940

3.091

2.699

2.466

2.309

2.195

2.106

2.036

1.979

1.931

97

3.939

3.090

2.698

2.465

2.308

2.194

2.105

2.035

1.978

1.930

98

3.938

3.089

2.697

2.465

2.307

2.193

2.104

2.034

1.977

1.929

99

3.937

3.088

2.696

2.464

2.306

2.192

2.103

2.033

1.976

1.928

100

3.936

3.087

2.696

2.463

2.305

2.191

2.103

2.032

1.975

1.927

11

12

13

14

15

16

17

18

19

20

1

242.983

243.906

244.690

245.364

245.950

246.464

246.918

247.323

247.686

248.013

2

19.405

19.413

19.419

19.424

19.429

19.433

19.437

19.440

19.443

19.446

3

8.763

8.745

8.729

8.715

8.703

8.692

8.683

8.675

8.667

8.660

4

5.936

5.912

5.891

5.873

5.858

5.844

5.832

5.821

5.811

5.803

5

4.704

4.678

4.655

4.636

4.619

4.604

4.590

4.579

4.568

4.558

6

4.027

4.000

3.976

3.956

3.938

3.922

3.908

3.896

3.884

3.874

7

3.603

3.575

3.550

3.529

3.511

3.494

3.480

3.467

3.455

3.445

8

3.313

3.284

3.259

3.237

3.218

3.202

3.187

3.173

3.161

3.150

9

3.102

3.073

3.048

3.025

3.006

2.989

2.974

2.960

2.948

2.936

10

2.943

2.913

2.887

2.865

2.845

2.828

2.812

2.798

2.785

2.774

11

2.818

2.788

2.761

2.739

2.719

2.701

2.685

2.671

2.658

2.646

N2

N2

N1

621

Appendix A  Statistical Tables

N1

11

12

13

14

15

16

17

18

19

20

12

2.717

2.687

2.660

2.637

2.617

2.599

2.583

2.568

2.555

2.544

13

2.635

2.604

2.577

2.554

2.533

2.515

2.499

2.484

2.471

2.459

14

2.565

2.534

2.507

2.484

2.463

2.445

2.428

2.413

2.400

2.388

15

2.507

2.475

2.448

2.424

2.403

2.385

2.368

2.353

2.340

2.328

16

2.456

2.425

2.397

2.373

2.352

2.333

2.317

2.302

2.288

2.276

17

2.413

2.381

2.353

2.329

2.308

2.289

2.272

2.257

2.243

2.230

18

2.374

2.342

2.314

2.290

2.269

2.250

2.233

2.217

2.203

2.191

19

2.340

2.308

2.280

2.256

2.234

2.215

2.198

2.182

2.168

2.155

20

2.310

2.278

2.250

2.225

2.203

2.184

2.167

2.151

2.137

2.124

21

2.283

2.250

2.222

2.197

2.176

2.156

2.139

2.123

2.109

2.096

22

2.259

2.226

2.198

2.173

2.151

2.131

2.114

2.098

2.084

2.071

23

2.236

2.204

2.175

2.150

2.128

2.109

2.091

2.075

2.061

2.048

24

2.216

2.183

2.155

2.130

2.108

2.088

2.070

2.054

2.040

2.027

25

2.198

2.165

2.136

2.111

2.089

2.069

2.051

2.035

2.021

2.007

26

2.181

2.148

2.119

2.094

2.072

2.052

2.034

2.018

2.003

1.990

27

2.166

2.132

2.103

2.078

2.056

2.036

2.018

2.002

1.987

1.974

28

2.151

2.118

2.089

2.064

2.041

2.021

2.003

1.987

1.972

1.959

29

2.138

2.104

2.075

2.050

2.027

2.007

1.989

1.973

1.958

1.945

30

2.126

2.092

2.063

2.037

2.015

1.995

1.976

1.960

1.945

1.932

31

2.114

2.080

2.051

2.026

2.003

1.983

1.965

1.948

1.933

1.920

32

2.103

2.070

2.040

2.015

1.992

1.972

1.953

1.937

1.922

1.908

33

2.093

2.060

2.030

2.004

1.982

1.961

1.943

1.926

1.911

1.898

34

2.084

2.050

2.021

1.995

1.972

1.952

1.933

1.917

1.902

1.888

35

2.075

2.041

2.012

1.986

1.963

1.942

1.924

1.907

1.892

1.878

36

2.067

2.033

2.003

1.977

1.954

1.934

1.915

1.899

1.883

1.870

37

2.059

2.025

1.995

1.969

1.946

1.926

1.907

1.890

1.875

1.861

38

2.051

2.017

1.988

1.962

1.939

1.918

1.899

1.883

1.867

1.853

39

2.044

2.010

1.981

1.954

1.931

1.911

1.892

1.875

1.860

1.846

40

2.038

2.003

1.974

1.948

1.924

1.904

1.885

1.868

1.853

1.839

41

2.031

1.997

1.967

1.941

1.918

1.897

1.879

1.862

1.846

1.832

42

2.025

1.991

1.961

1.935

1.912

1.891

1.872

1.855

1.840

1.826

43

2.020

1.985

1.955

1.929

1.906

1.885

1.866

1.849

1.834

1.820

44

2.014

1.980

1.950

1.924

1.900

1.879

1.861

1.844

1.828

1.814

45

2.009

1.974

1.945

1.918

1.895

1.874

1.855

1.838

1.823

1.808

46

2.004

1.969

1.940

1.913

1.890

1.869

1.850

1.833

1.817

1.803

47

1.999

1.965

1.935

1.908

1.885

1.864

1.845

1.828

1.812

1.798

48

1.995

1.960

1.930

1.904

1.880

1.859

1.840

1.823

1.807

1.793

49

1.990

1.956

1.926

1.899

1.876

1.855

1.836

1.819

1.803

1.789

50

1.986

1.952

1.921

1.895

1.871

1.850

1.831

1.814

1.798

1.784

51

1.982

1.947

1.917

1.891

1.867

1.846

1.827

1.810

1.794

1.780

N2

(continued)

622

Appendix A  Statistical Tables

N1

11

12

13

14

15

16

17

18

19

20

52

1.978

1.944

1.913

1.887

1.863

1.842

1.823

1.806

1.790

1.776

53

1.975

1.940

1.910

1.883

1.859

1.838

1.819

1.802

1.786

1.772

54

1.971

1.936

1.906

1.879

1.856

1.835

1.816

1.798

1.782

1.768

55

1.968

1.933

1.903

1.876

1.852

1.831

1.812

1.795

1.779

1.764

56

1.964

1.930

1.899

1.873

1.849

1.828

1.809

1.791

1.775

1.761

57

1.961

1.926

1.896

1.869

1.846

1.824

1.805

1.788

1.772

1.757

58

1.958

1.923

1.893

1.866

1.842

1.821

1.802

1.785

1.769

1.754

59

1.955

1.920

1.890

1.863

1.839

1.818

1.799

1.781

1.766

1.751

60

1.952

1.917

1.887

1.860

1.836

1.815

1.796

1.778

1.763

1.748

61

1.949

1.915

1.884

1.857

1.834

1.812

1.793

1.776

1.760

1.745

62

1.947

1.912

1.882

1.855

1.831

1.809

1.790

1.773

1.757

1.742

63

1.944

1.909

1.879

1.852

1.828

1.807

1.787

1.770

1.754

1.739

64

1.942

1.907

1.876

1.849

1.826

1.804

1.785

1.767

1.751

1.737

65

1.939

1.904

1.874

1.847

1.823

1.802

1.782

1.765

1.749

1.734

66

1.937

1.902

1.871

1.845

1.821

1.799

1.780

1.762

1.746

1.732

67

1.935

1.900

1.869

1.842

1.818

1.797

1.777

1.760

1.744

1.729

68

1.932

1.897

1.867

1.840

1.816

1.795

1.775

1.758

1.742

1.727

N2

69

1.930

1.895

1.865

1.838

1.814

1.792

1.773

1.755

1.739

1.725

70

1.928

1.893

1.863

1.836

1.812

1.790

1.771

1.753

1.737

1.722

71

1.926

1.891

1.861

1.834

1.810

1.788

1.769

1.751

1.735

1.720

72

1.924

1.889

1.859

1.832

1.808

1.786

1.767

1.749

1.733

1.718

73

1.922

1.887

1.857

1.830

1.806

1.784

1.765

1.747

1.731

1.716

74

1.921

1.885

1.855

1.828

1.804

1.782

1.763

1.745

1.729

1.714

75

1.919

1.884

1.853

1.826

1.802

1.780

1.761

1.743

1.727

1.712

76

1.917

1.882

1.851

1.824

1.800

1.778

1.759

1.741

1.725

1.710

77

1.915

1.880

1.849

1.822

1.798

1.777

1.757

1.739

1.723

1.708

78

1.914

1.878

1.848

1.821

1.797

1.775

1.755

1.738

1.721

1.707

79

1.912

1.877

1.846

1.819

1.795

1.773

1.754

1.736

1.720

1.705

80

1.910

1.875

1.845

1.817

1.793

1.772

1.752

1.734

1.718

1.703

81

1.909

1.874

1.843

1.816

1.792

1.770

1.750

1.733

1.716

1.702

82

1.907

1.872

1.841

1.814

1.790

1.768

1.749

1.731

1.715

1.700

83

1.906

1.871

1.840

1.813

1.789

1.767

1.747

1.729

1.713

1.698

84

1.905

1.869

1.838

1.811

1.787

1.765

1.746

1.728

1.712

1.697

85

1.903

1.868

1.837

1.810

1.786

1.764

1.744

1.726

1.710

1.695

86

1.902

1.867

1.836

1.808

1.784

1.762

1.743

1.725

1.709

1.694

87

1.900

1.865

1.834

1.807

1.783

1.761

1.741

1.724

1.707

1.692

88

1.899

1.864

1.833

1.806

1.782

1.760

1.740

1.722

1.706

1.691

89

1.898

1.863

1.832

1.804

1.780

1.758

1.739

1.721

1.705

1.690

90

1.897

1.861

1.830

1.803

1.779

1.757

1.737

1.720

1.703

1.688

623

Appendix A  Statistical Tables

N1

11

12

13

14

15

16

17

18

19

20

91

1.895

1.860

1.829

1.802

1.778

1.756

1.736

1.718

1.702

1.687

92

1.894

1.859

1.828

1.801

1.776

1.755

1.735

1.717

1.701

1.686

93

1.893

1.858

1.827

1.800

1.775

1.753

1.734

1.716

1.699

1.684

94

1.892

1.857

1.826

1.798

1.774

1.752

1.733

1.715

1.698

1.683

95

1.891

1.856

1.825

1.797

1.773

1.751

1.731

1.713

1.697

1.682

96

1.890

1.854

1.823

1.796

1.772

1.750

1.730

1.712

1.696

1.681

97

1.889

1.853

1.822

1.795

1.771

1.749

1.729

1.711

1.695

1.680

98

1.888

1.852

1.821

1.794

1.770

1.748

1.728

1.710

1.694

1.679

99

1.887

1.851

1.820

1.793

1.769

1.747

1.727

1.709

1.693

1.678

100

1.886

1.850

1.819

1.792

1.768

1.746

1.726

1.708

1.691

1.676

N2

Upper critical values of the F distribution for numerator degrees of freedom N1 and denominator degrees of freedom N2, 10% significance level N N2 1 1 2 3 4 5 6 7 8 9 1

39.863

49.500

53.593

55.833

57.240

58.204

58.906

59.439

59.858

10 60.195

2

8.526

9.000

9.162

9.243

9.293

9.326

9.349

9.367

9.381

9.392

3

5.538

5.462

5.391

5.343

5.309

5.285

5.266

5.252

5.240

5.230

4

4.545

4.325

4.191

4.107

4.051

4.010

3.979

3.955

3.936

3.920

5

4.060

3.780

3.619

3.520

3.453

3.405

3.368

3.339

3.316

3.297

6

3.776

3.463

3.289

3.181

3.108

3.055

3.014

2.983

2.958

2.937

7

3.589

3.257

3.074

2.961

2.883

2.827

2.785

2.752

2.725

2.703

8

3.458

3.113

2.924

2.806

2.726

2.668

2.624

2.589

2.561

2.538

9

3.360

3.006

2.813

2.693

2.611

2.551

2.505

2.469

2.440

2.416

10

3.285

2.924

2.728

2.605

2.522

2.461

2.414

2.377

2.347

2.323

11

3.225

2.860

2.660

2.536

2.451

2.389

2.342

2.304

2.274

2.248

12

3.177

2.807

2.606

2.480

2.394

2.331

2.283

2.245

2.214

2.188

13

3.136

2.763

2.560

2.434

2.347

2.283

2.234

2.195

2.164

2.138

14

3.102

2.726

2.522

2.395

2.307

2.243

2.193

2.154

2.122

2.095

15

3.073

2.695

2.490

2.361

2.273

2.208

2.158

2.119

2.086

2.059

16

3.048

2.668

2.462

2.333

2.244

2.178

2.128

2.088

2.055

2.028

17

3.026

2.645

2.437

2.308

2.218

2.152

2.102

2.061

2.028

2.001

18

3.007

2.624

2.416

2.286

2.196

2.130

2.079

2.038

2.005

1.977

19

2.990

2.606

2.397

2.266

2.176

2.109

2.058

2.017

1.984

1.956

20

2.975

2.589

2.380

2.249

2.158

2.091

2.040

1.999

1.965

1.937

21

2.961

2.575

2.365

2.233

2.142

2.075

2.023

1.982

1.948

1.920

22

2.949

2.561

2.351

2.219

2.128

2.060

2.008

1.967

1.933

1.904

23

2.937

2.549

2.339

2.207

2.115

2.047

1.995

1.953

1.919

1.890

24

2.927

2.538

2.327

2.195

2.103

2.035

1.983

1.941

1.906

1.877 (continued)

624

Appendix A  Statistical Tables

N1

1

2

3

4

5

6

7

8

9

10

25

2.918

2.528

2.317

2.184

2.092

2.024

1.971

1.929

1.895

1.866

26

2.909

2.519

2.307

2.174

2.082

2.014

1.961

1.919

1.884

1.855

N2

27

2.901

2.511

2.299

2.165

2.073

2.005

1.952

1.909

1.874

1.845

28

2.894

2.503

2.291

2.157

2.064

1.996

1.943

1.900

1.865

1.836

29

2.887

2.495

2.283

2.149

2.057

1.988

1.935

1.892

1.857

1.827

30

2.881

2.489

2.276

2.142

2.049

1.980

1.927

1.884

1.849

1.819

31

2.875

2.482

2.270

2.136

2.042

1.973

1.920

1.877

1.842

1.812

32

2.869

2.477

2.263

2.129

2.036

1.967

1.913

1.870

1.835

1.805

33

2.864

2.471

2.258

2.123

2.030

1.961

1.907

1.864

1.828

1.799

34

2.859

2.466

2.252

2.118

2.024

1.955

1.901

1.858

1.822

1.793

35

2.855

2.461

2.247

2.113

2.019

1.950

1.896

1.852

1.817

1.787

36

2.850

2.456

2.243

2.108

2.014

1.945

1.891

1.847

1.811

1.781

37

2.846

2.452

2.238

2.103

2.009

1.940

1.886

1.842

1.806

1.776

38

2.842

2.448

2.234

2.099

2.005

1.935

1.881

1.838

1.802

1.772

39

2.839

2.444

2.230

2.095

2.001

1.931

1.877

1.833

1.797

1.767

40

2.835

2.440

2.226

2.091

1.997

1.927

1.873

1.829

1.793

1.763

41

2.832

2.437

2.222

2.087

1.993

1.923

1.869

1.825

1.789

1.759

42

2.829

2.434

2.219

2.084

1.989

1.919

1.865

1.821

1.785

1.755

43

2.826

2.430

2.216

2.080

1.986

1.916

1.861

1.817

1.781

1.751

44

2.823

2.427

2.213

2.077

1.983

1.913

1.858

1.814

1.778

1.747

45

2.820

2.425

2.210

2.074

1.980

1.909

1.855

1.811

1.774

1.744

46

2.818

2.422

2.207

2.071

1.977

1.906

1.852

1.808

1.771

1.741

47

2.815

2.419

2.204

2.068

1.974

1.903

1.849

1.805

1.768

1.738

48

2.813

2.417

2.202

2.066

1.971

1.901

1.846

1.802

1.765

1.735

49

2.811

2.414

2.199

2.063

1.968

1.898

1.843

1.799

1.763

1.732

50

2.809

2.412

2.197

2.061

1.966

1.895

1.840

1.796

1.760

1.729

51

2.807

2.410

2.194

2.058

1.964

1.893

1.838

1.794

1.757

1.727

52

2.805

2.408

2.192

2.056

1.961

1.891

1.836

1.791

1.755

1.724

53

2.803

2.406

2.190

2.054

1.959

1.888

1.833

1.789

1.752

1.722

54

2.801

2.404

2.188

2.052

1.957

1.886

1.831

1.787

1.750

1.719

55

2.799

2.402

2.186

2.050

1.955

1.884

1.829

1.785

1.748

1.717

56

2.797

2.400

2.184

2.048

1.953

1.882

1.827

1.782

1.746

1.715

57

2.796

2.398

2.182

2.046

1.951

1.880

1.825

1.780

1.744

1.713

58

2.794

2.396

2.181

2.044

1.949

1.878

1.823

1.779

1.742

1.711

59

2.793

2.395

2.179

2.043

1.947

1.876

1.821

1.777

1.740

1.709

60

2.791

2.393

2.177

2.041

1.946

1.875

1.819

1.775

1.738

1.707

61

2.790

2.392

2.176

2.039

1.944

1.873

1.818

1.773

1.736

1.705

62

2.788

2.390

2.174

2.038

1.942

1.871

1.816

1.771

1.735

1.703

63

2.787

2.389

2.173

2.036

1.941

1.870

1.814

1.770

1.733

1.702

625

Appendix A  Statistical Tables

N2

N1

1

2

3

4

5

6

7

8

9

10

64

2.786

2.387

2.171

2.035

1.939

1.868

1.813

1.768

1.731

1.700

65

2.784

2.386

2.170

2.033

1.938

1.867

1.811

1.767

1.730

1.699

66

2.783

2.385

2.169

2.032

1.937

1.865

1.810

1.765

1.728

1.697

67

2.782

2.384

2.167

2.031

1.935

1.864

1.808

1.764

1.727

1.696

68

2.781

2.382

2.166

2.029

1.934

1.863

1.807

1.762

1.725

1.694

69

2.780

2.381

2.165

2.028

1.933

1.861

1.806

1.761

1.724

1.693

70

2.779

2.380

2.164

2.027

1.931

1.860

1.804

1.760

1.723

1.691

71

2.778

2.379

2.163

2.026

1.930

1.859

1.803

1.758

1.721

1.690

72

2.777

2.378

2.161

2.025

1.929

1.858

1.802

1.757

1.720

1.689

73

2.776

2.377

2.160

2.024

1.928

1.856

1.801

1.756

1.719

1.687

74

2.775

2.376

2.159

2.022

1.927

1.855

1.800

1.755

1.718

1.686

75

2.774

2.375

2.158

2.021

1.926

1.854

1.798

1.754

1.716

1.685

76

2.773

2.374

2.157

2.020

1.925

1.853

1.797

1.752

1.715

1.684

77

2.772

2.373

2.156

2.019

1.924

1.852

1.796

1.751

1.714

1.683

78

2.771

2.372

2.155

2.018

1.923

1.851

1.795

1.750

1.713

1.682

79

2.770

2.371

2.154

2.017

1.922

1.850

1.794

1.749

1.712

1.681

80

2.769

2.370

2.154

2.016

1.921

1.849

1.793

1.748

1.711

1.680

81

2.769

2.369

2.153

2.016

1.920

1.848

1.792

1.747

1.710

1.679

82

2.768

2.368

2.152

2.015

1.919

1.847

1.791

1.746

1.709

1.678

83

2.767

2.368

2.151

2.014

1.918

1.846

1.790

1.745

1.708

1.677

84

2.766

2.367

2.150

2.013

1.917

1.845

1.790

1.744

1.707

1.676

85

2.765

2.366

2.149

2.012

1.916

1.845

1.789

1.744

1.706

1.675

86

2.765

2.365

2.149

2.011

1.915

1.844

1.788

1.743

1.705

1.674

87

2.764

2.365

2.148

2.011

1.915

1.843

1.787

1.742

1.705

1.673

88

2.763

2.364

2.147

2.010

1.914

1.842

1.786

1.741

1.704

1.672

89

2.763

2.363

2.146

2.009

1.913

1.841

1.785

1.740

1.703

1.671

90

2.762

2.363

2.146

2.008

1.912

1.841

1.785

1.739

1.702

1.670

91

2.761

2.362

2.145

2.008

1.912

1.840

1.784

1.739

1.701

1.670

92

2.761

2.361

2.144

2.007

1.911

1.839

1.783

1.738

1.701

1.669

93

2.760

2.361

2.144

2.006

1.910

1.838

1.782

1.737

1.700

1.668

94

2.760

2.360

2.143

2.006

1.910

1.838

1.782

1.736

1.699

1.667

95

2.759

2.359

2.142

2.005

1.909

1.837

1.781

1.736

1.698

1.667

96

2.759

2.359

2.142

2.004

1.908

1.836

1.780

1.735

1.698

1.666

97

2.758

2.358

2.141

2.004

1.908

1.836

1.780

1.734

1.697

1.665

98

2.757

2.358

2.141

2.003

1.907

1.835

1.779

1.734

1.696

1.665

99

2.757

2.357

2.140

2.003

1.906

1.835

1.778

1.733

1.696

1.664

100

2.756

2.356

2.139

2.002

1.906

1.834

1.778

1.732

1.695

1.663 (continued)

626

Appendix A  Statistical Tables

N1

11

12

13

14

15

16

17

18

19

20

1

60.473

60.705

60.903

61.073

61.220

61.350

61.464

61.566

61.658

61.740

2

9.401

9.408

9.415

9.420

9.425

9.429

9.433

9.436

9.439

9.441

3

5.222

5.216

5.210

5.205

5.200

5.196

5.193

5.190

5.187

5.184

4

3.907

3.896

3.886

3.878

3.870

3.864

3.858

3.853

3.849

3.844

5

3.282

3.268

3.257

3.247

3.238

3.230

3.223

3.217

3.212

3.207

6

2.920

2.905

2.892

2.881

2.871

2.863

2.855

2.848

2.842

2.836

7

2.684

2.668

2.654

2.643

2.632

2.623

2.615

2.607

2.601

2.595

8

2.519

2.502

2.488

2.475

2.464

2.455

2.446

2.438

2.431

2.425

9

2.396

2.379

2.364

2.351

2.340

2.329

2.320

2.312

2.305

2.298

10

2.302

2.284

2.269

2.255

2.244

2.233

2.224

2.215

2.208

2.201

11

2.227

2.209

2.193

2.179

2.167

2.156

2.147

2.138

2.130

2.123

12

2.166

2.147

2.131

2.117

2.105

2.094

2.084

2.075

2.067

2.060

13

2.116

2.097

2.080

2.066

2.053

2.042

2.032

2.023

2.014

2.007

14

2.073

2.054

2.037

2.022

2.010

1.998

1.988

1.978

1.970

1.962

15

2.037

2.017

2.000

1.985

1.972

1.961

1.950

1.941

1.932

1.924

16

2.005

1.985

1.968

1.953

1.940

1.928

1.917

1.908

1.899

1.891

17

1.978

1.958

1.940

1.925

1.912

1.900

1.889

1.879

1.870

1.862

18

1.954

1.933

1.916

1.900

1.887

1.875

1.864

1.854

1.845

1.837

19

1.932

1.912

1.894

1.878

1.865

1.852

1.841

1.831

1.822

1.814

20

1.913

1.892

1.875

1.859

1.845

1.833

1.821

1.811

1.802

1.794

21

1.896

1.875

1.857

1.841

1.827

1.815

1.803

1.793

1.784

1.776

22

1.880

1.859

1.841

1.825

1.811

1.798

1.787

1.777

1.768

1.759

23

1.866

1.845

1.827

1.811

1.796

1.784

1.772

1.762

1.753

1.744

24

1.853

1.832

1.814

1.797

1.783

1.770

1.759

1.748

1.739

1.730

25

1.841

1.820

1.802

1.785

1.771

1.758

1.746

1.736

1.726

1.718

26

1.830

1.809

1.790

1.774

1.760

1.747

1.735

1.724

1.715

1.706

27

1.820

1.799

1.780

1.764

1.749

1.736

1.724

1.714

1.704

1.695

28

1.811

1.790

1.771

1.754

1.740

1.726

1.715

1.704

1.694

1.685

29

1.802

1.781

1.762

1.745

1.731

1.717

1.705

1.695

1.685

1.676

30

1.794

1.773

1.754

1.737

1.722

1.709

1.697

1.686

1.676

1.667

31

1.787

1.765

1.746

1.729

1.714

1.701

1.689

1.678

1.668

1.659

32

1.780

1.758

1.739

1.722

1.707

1.694

1.682

1.671

1.661

1.652

33

1.773

1.751

1.732

1.715

1.700

1.687

1.675

1.664

1.654

1.645

34

1.767

1.745

1.726

1.709

1.694

1.680

1.668

1.657

1.647

1.638

35

1.761

1.739

1.720

1.703

1.688

1.674

1.662

1.651

1.641

1.632

N2

36

1.756

1.734

1.715

1.697

1.682

1.669

1.656

1.645

1.635

1.626

37

1.751

1.729

1.709

1.692

1.677

1.663

1.651

1.640

1.630

1.620

38

1.746

1.724

1.704

1.687

1.672

1.658

1.646

1.635

1.624

1.615

39

1.741

1.719

1.700

1.682

1.667

1.653

1.641

1.630

1.619

1.610

40

1.737

1.715

1.695

1.678

1.662

1.649

1.636

1.625

1.615

1.605

627

Appendix A  Statistical Tables

N1

11

12

13

14

15

16

17

18

19

20

41

1.733

1.710

1.691

1.673

1.658

1.644

1.632

1.620

1.610

1.601

42

1.729

1.706

1.687

1.669

1.654

1.640

1.628

1.616

1.606

1.596

43

1.725

1.703

1.683

1.665

1.650

1.636

1.624

1.612

1.602

1.592

44

1.721

1.699

1.679

1.662

1.646

1.632

1.620

1.608

1.598

1.588

45

1.718

1.695

1.676

1.658

1.643

1.629

1.616

1.605

1.594

1.585

46

1.715

1.692

1.672

1.655

1.639

1.625

1.613

1.601

1.591

1.581

47

1.712

1.689

1.669

1.652

1.636

1.622

1.609

1.598

1.587

1.578

48

1.709

1.686

1.666

1.648

1.633

1.619

1.606

1.594

1.584

1.574

49

1.706

1.683

1.663

1.645

1.630

1.616

1.603

1.591

1.581

1.571

50

1.703

1.680

1.660

1.643

1.627

1.613

1.600

1.588

1.578

1.568

51

1.700

1.677

1.658

1.640

1.624

1.610

1.597

1.586

1.575

1.565

52

1.698

1.675

1.655

1.637

1.621

1.607

1.594

1.583

1.572

1.562

53

1.695

1.672

1.652

1.635

1.619

1.605

1.592

1.580

1.570

1.560

54

1.693

1.670

1.650

1.632

1.616

1.602

1.589

1.578

1.567

1.557

55

1.691

1.668

1.648

1.630

1.614

1.600

1.587

1.575

1.564

1.555

56

1.688

1.666

1.645

1.628

1.612

1.597

1.585

1.573

1.562

1.552

57

1.686

1.663

1.643

1.625

1.610

1.595

1.582

1.571

1.560

1.550

58

1.684

1.661

1.641

1.623

1.607

1.593

1.580

1.568

1.558

1.548

59

1.682

1.659

1.639

1.621

1.605

1.591

1.578

1.566

1.555

1.546

60

1.680

1.657

1.637

1.619

1.603

1.589

1.576

1.564

1.553

1.543

61

1.679

1.656

1.635

1.617

1.601

1.587

1.574

1.562

1.551

1.541

62

1.677

1.654

1.634

1.616

1.600

1.585

1.572

1.560

1.549

1.540

63

1.675

1.652

1.632

1.614

1.598

1.583

1.570

1.558

1.548

1.538

64

1.673

1.650

1.630

1.612

1.596

1.582

1.569

1.557

1.546

1.536

65

1.672

1.649

1.628

1.610

1.594

1.580

1.567

1.555

1.544

1.534

66

1.670

1.647

1.627

1.609

1.593

1.578

1.565

1.553

1.542

1.532

67

1.669

1.646

1.625

1.607

1.591

1.577

1.564

1.552

1.541

1.531

68

1.667

1.644

1.624

1.606

1.590

1.575

1.562

1.550

1.539

1.529

69

1.666

1.643

1.622

1.604

1.588

1.574

1.560

1.548

1.538

1.527

70

1.665

1.641

1.621

1.603

1.587

1.572

1.559

1.547

1.536

1.526

71

1.663

1.640

1.619

1.601

1.585

1.571

1.557

1.545

1.535

1.524

72

1.662

1.639

1.618

1.600

1.584

1.569

1.556

1.544

1.533

1.523

73

1.661

1.637

1.617

1.599

1.583

1.568

1.555

1.543

1.532

1.522

74

1.659

1.636

1.616

1.597

1.581

1.567

1.553

1.541

1.530

1.520

75

1.658

1.635

1.614

1.596

1.580

1.565

1.552

1.540

1.529

1.519

N2

76

1.657

1.634

1.613

1.595

1.579

1.564

1.551

1.539

1.528

1.518

77

1.656

1.632

1.612

1.594

1.578

1.563

1.550

1.538

1.527

1.516

78

1.655

1.631

1.611

1.593

1.576

1.562

1.548

1.536

1.525

1.515

79

1.654

1.630

1.610

1.592

1.575

1.561

1.547

1.535

1.524

1.514

80

1.653

1.629

1.609

1.590

1.574

1.559

1.546

1.534

1.523

1.513 (continued)

628

Appendix A  Statistical Tables

N1

11

12

13

14

15

16

17

18

19

20

81

1.652

1.628

1.608

1.589

1.573

1.558

1.545

1.533

1.522

1.512

82

1.651

1.627

1.607

1.588

1.572

1.557

1.544

1.532

1.521

1.511

83

1.650

1.626

1.606

1.587

1.571

1.556

1.543

1.531

1.520

1.509

84

1.649

1.625

1.605

1.586

1.570

1.555

1.542

1.530

1.519

1.508

85

1.648

1.624

1.604

1.585

1.569

1.554

1.541

1.529

1.518

1.507

86

1.647

1.623

1.603

1.584

1.568

1.553

1.540

1.528

1.517

1.506

87

1.646

1.622

1.602

1.583

1.567

1.552

1.539

1.527

1.516

1.505

88

1.645

1.622

1.601

1.583

1.566

1.551

1.538

1.526

1.515

1.504

89

1.644

1.621

1.600

1.582

1.565

1.550

1.537

1.525

1.514

1.503

90

1.643

1.620

1.599

1.581

1.564

1.550

1.536

1.524

1.513

1.503

91

1.643

1.619

1.598

1.580

1.564

1.549

1.535

1.523

1.512

1.502

92

1.642

1.618

1.598

1.579

1.563

1.548

1.534

1.522

1.511

1.501

93

1.641

1.617

1.597

1.578

1.562

1.547

1.534

1.521

1.510

1.500

94

1.640

1.617

1.596

1.578

1.561

1.546

1.533

1.521

1.509

1.499

95

1.640

1.616

1.595

1.577

1.560

1.545

1.532

1.520

1.509

1.498

96

1.639

1.615

1.594

1.576

1.560

1.545

1.531

1.519

1.508

1.497

97

1.638

1.614

1.594

1.575

1.559

1.544

1.530

1.518

1.507

1.497

98

1.637

1.614

1.593

1.575

1.558

1.543

1.530

1.517

1.506

1.496

99

1.637

1.613

1.592

1.574

1.557

1.542

1.529

1.517

1.505

1.495

100

1.636

1.612

1.592

1.573

1.557

1.542

1.528

1.516

1.505

1.494

N2

Upper critical values of the F distribution for numerator degrees of freedom N1 and denominator degrees of freedom N2, 1% significance level N N2 1 1 2 3 4 5 6 7 8 9 10 1

4052.19

4999.52

5403.34

5624.62

5763.65

5858.97

5928.33

5981.10

6022.50

6055.85

2

98.502

99.000

99.166

99.249

99.300

99.333

99.356

99.374

99.388

99.399

3

34.116

30.816

29.457

28.710

28.237

27.911

27.672

27.489

27.345

27.229

4

21.198

18.000

16.694

15.977

15.522

15.207

14.976

14.799

14.659

14.546

5

16.258

13.274

12.060

11.392

10.967

10.672

10.456

10.289

10.158

10.051

6

13.745

10.925

9.780

9.148

8.746

8.466

8.260

8.102

7.976

7.874

7

12.246

9.547

8.451

7.847

7.460

7.191

6.993

6.840

6.719

6.620

8

11.259

8.649

7.591

7.006

6.632

6.371

6.178

6.029

5.911

5.814

9

10.561

8.022

6.992

6.422

6.057

5.802

5.613

5.467

5.351

5.257

10

10.044

7.559

6.552

5.994

5.636

5.386

5.200

5.057

4.942

4.849

11

9.646

7.206

6.217

5.668

5.316

5.069

4.886

4.744

4.632

4.539

12

9.330

6.927

5.953

5.412

5.064

4.821

4.640

4.499

4.388

4.296

13

9.074

6.701

5.739

5.205

4.862

4.620

4.441

4.302

4.191

4.100

14

8.862

6.515

5.564

5.035

4.695

4.456

4.278

4.140

4.030

3.939

629

Appendix A  Statistical Tables

N1

1

2

3

4

5

6

7

8

9

10

15

8.683

6.359

5.417

4.893

4.556

4.318

4.142

4.004

3.895

3.805

16

8.531

6.226

5.292

4.773

4.437

4.202

4.026

3.890

3.780

3.691

17

8.400

6.112

5.185

4.669

4.336

4.102

3.927

3.791

3.682

3.593

18

8.285

6.013

5.092

4.579

4.248

4.015

3.841

3.705

3.597

3.508

19

8.185

5.926

5.010

4.500

4.171

3.939

3.765

3.631

3.523

3.434

20

8.096

5.849

4.938

4.431

4.103

3.871

3.699

3.564

3.457

3.368

21

8.017

5.780

4.874

4.369

4.042

3.812

3.640

3.506

3.398

3.310

22

7.945

5.719

4.817

4.313

3.988

3.758

3.587

3.453

3.346

3.258

23

7.881

5.664

4.765

4.264

3.939

3.710

3.539

3.406

3.299

3.211

24

7.823

5.614

4.718

4.218

3.895

3.667

3.496

3.363

3.256

3.168

25

7.770

5.568

4.675

4.177

3.855

3.627

3.457

3.324

3.217

3.129

26

7.721

5.526

4.637

4.140

3.818

3.591

3.421

3.288

3.182

3.094

27

7.677

5.488

4.601

4.106

3.785

3.558

3.388

3.256

3.149

3.062

28

7.636

5.453

4.568

4.074

3.754

3.528

3.358

3.226

3.120

3.032

29

7.598

5.420

4.538

4.045

3.725

3.499

3.330

3.198

3.092

3.005

30

7.562

5.390

4.510

4.018

3.699

3.473

3.305

3.173

3.067

2.979

31

7.530

5.362

4.484

3.993

3.675

3.449

3.281

3.149

3.043

2.955

32

7.499

5.336

4.459

3.969

3.652

3.427

3.258

3.127

3.021

2.934

33

7.471

5.312

4.437

3.948

3.630

3.406

3.238

3.106

3.000

2.913

34

7.444

5.289

4.416

3.927

3.611

3.386

3.218

3.087

2.981

2.894

35

7.419

5.268

4.396

3.908

3.592

3.368

3.200

3.069

2.963

2.876

36

7.396

5.248

4.377

3.890

3.574

3.351

3.183

3.052

2.946

2.859

37

7.373

5.229

4.360

3.873

3.558

3.334

3.167

3.036

2.930

2.843

38

7.353

5.211

4.343

3.858

3.542

3.319

3.152

3.021

2.915

2.828

39

7.333

5.194

4.327

3.843

3.528

3.305

3.137

3.006

2.901

2.814

40

7.314

5.179

4.313

3.828

3.514

3.291

3.124

2.993

2.888

2.801

41

7.296

5.163

4.299

3.815

3.501

3.278

3.111

2.980

2.875

2.788

42

7.280

5.149

4.285

3.802

3.488

3.266

3.099

2.968

2.863

2.776

43

7.264

5.136

4.273

3.790

3.476

3.254

3.087

2.957

2.851

2.764

44

7.248

5.123

4.261

3.778

3.465

3.243

3.076

2.946

2.840

2.754

45

7.234

5.110

4.249

3.767

3.454

3.232

3.066

2.935

2.830

2.743

46

7.220

5.099

4.238

3.757

3.444

3.222

3.056

2.925

2.820

2.733

47

7.207

5.087

4.228

3.747

3.434

3.213

3.046

2.916

2.811

2.724

48

7.194

5.077

4.218

3.737

3.425

3.204

3.037

2.907

2.802

2.715

49

7.182

5.066

4.208

3.728

3.416

3.195

3.028

2.898

2.793

2.706

N2

50

7.171

5.057

4.199

3.720

3.408

3.186

3.020

2.890

2.785

2.698

51

7.159

5.047

4.191

3.711

3.400

3.178

3.012

2.882

2.777

2.690

52

7.149

5.038

4.182

3.703

3.392

3.171

3.005

2.874

2.769

2.683

53

7.139

5.030

4.174

3.695

3.384

3.163

2.997

2.867

2.762

2.675

54

7.129

5.021

4.167

3.688

3.377

3.156

2.990

2.860

2.755

2.668 (continued)

630

Appendix A  Statistical Tables

N1

1

2

3

4

5

6

7

8

9

10

55

7.119

5.013

4.159

3.681

3.370

3.149

2.983

2.853

2.748

2.662

56

7.110

5.006

4.152

3.674

3.363

3.143

2.977

2.847

2.742

2.655

57

7.102

4.998

4.145

3.667

3.357

3.136

2.971

2.841

2.736

2.649

58

7.093

4.991

4.138

3.661

3.351

3.130

2.965

2.835

2.730

2.643

59

7.085

4.984

4.132

3.655

3.345

3.124

2.959

2.829

2.724

2.637

60

7.077

4.977

4.126

3.649

3.339

3.119

2.953

2.823

2.718

2.632

61

7.070

4.971

4.120

3.643

3.333

3.113

2.948

2.818

2.713

2.626

62

7.062

4.965

4.114

3.638

3.328

3.108

2.942

2.813

2.708

2.621

63

7.055

4.959

4.109

3.632