Statistics for Managers Using Microsoft Excel [RENTAL EDITION] (9th Edition) [9 ed.] 0135969859, 9780135969854

This print textbook is available for students to rent for their classes. The Pearson print rental program provides stude

5,976 1,016 160MB

English Pages 752 [824] Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Statistics for Managers Using Microsoft Excel [RENTAL EDITION] [9 ed.] 0135969859, 9780135969854

This print textbook is available for students to rent for their classes. The Pearson print rental program provides stude

1,365 160 29MB Read more

Statistics for managers using Microsoft Excel [8th edition] 9780134173054, 1292156341, 9781292156347, 0134173058, 9780134566672, 013456667X

For undergraduate business statistics courses." Analyzing the Data Applicable to Business This text is the gold sta

8,236 862 34MB Read more

Elementary statistics using Microsoft Excel [Sixth edition.] 9780134506623, 0134506626

Real data brings statistics to life From opinion polls and clinical trials to self-driving cars, statistics influences a

10,249 931 124MB Read more

Statistics for managers using Microsoft Excel [7th ed] 9780133061819, 9780273787112, 027378711X, 0133061817

Intended primarily for undergraduate courses in business statistics, this text also provides practical content to curren

2,082 168 74MB Read more

Microsoft Office Excel 2007 for Project Managers [illustrated edition] 9780470047170, 0470047178

Combine the power of Excel 2007, Microsoft Office SharePoint Server, and sound project management tools to boost your sk

692 54 7MB Read more

Elementary Statistics Using Excel: Pearson New International Edition [5th edition] 1292041765, 1269374508, 9781292041766, 9781269374507, 9781292055770, 1292055774

From SAT scores to job search methods, statistics influences and shapes the world around us. Marty Triola's text co

2,547 318 23MB Read more

Microsoft Visual C# Step by Step, 9th Edition PDF.pdf [9th Edition] 978-1-5093-0776-0

John Sharp is a principal technologist for CM Group Ltd, a software development and consultancy company in the United Ki

20,780 5,964 39MB Read more

Excel VBA ganz einfach! (New Edition): Lernen Sie Programmieren mit Microsoft Excel (German Edition)

Komplette Überarbeitung des beliebten Beststellers "Excel VBA ganz einfach!". Mehr als doppelt so viel Inhalt.

698 34 11MB Read more

Excel VBA ganz einfach! (New Edition): Lernen Sie Programmieren mit Microsoft Excel (German Edition)

Komplette Überarbeitung des beliebten Beststellers "Excel VBA ganz einfach!". Mehr als doppelt so viel Inhalt.

735 168 13MB Read more

Microsoft Excel for Professionals

2,830 489 46MB Read more

Statistics for Managers Using Microsoft Excel [RENTAL EDITION] (9th Edition) [9 ed.]
0135969859, 9780135969854

Author / Uploaded
David M. Levine
David F. Stephan
Kathryn A. Szabat

Citation preview

MyLab Statistics Support You Need, When You Need It MyLab Statistics has helped millions of students succeed in their math courses. Take advantage of the resources it provides.

-··

Homework: HW 2.1.39

c....• .__.,..._ ___

.....,_..,.. ..... _ .. ..,. .. _ _ _ _ _ .. ... ._ ...... ,..

. . _. . . . .-.. . . ._

;:....:::::.=:'..... 1·~ 1 -':-

I

· --~

D·

-

Interactive Exercises Mylab Statistics' interactive exercises mirror t hose in the textbook but are programmed to allow you un limited practice, leading to mastery. Most exercises include learning aids, such as "Help Me Solve This," ''View an Example," and "Video," and they offer helpful feedback when you enter incorrect answers.

Instructional Videos

Solve for xwbcn :•~and: •3, µ • 18.3, and a •2.2. '7

Instructional videos cover key examples from the text and can conveniently be played on any mobile device. These videos are especial ly helpful if you miss a class or just need further explanation.

..

Q

D

AA

I

. ! ~ ~~.{~~/ ~- . : ..... _.,..~~;·.~:.·:

I-·.

.

••••

ill

••

,.

0 _.............. ___.__:~,._.__...._,_.......,.. --·-,.---.-an....,... ___'"..............- .. ....... - . .....-i--..-·-.. . . . .--.. . . . . . .. ....... ,......--

..........

~

Round the answer to chc ocarcst 1e0tb.

11

'~

--.looct:..

~

-

,.._,~

~,

-~ ,._

... ,.... .............,_........, ____ ...

o ~·

The Pearson eText gives you access to your textbook anytime, anywhere. In addition to note taking, highlighting, and bookmarking, the Pearson eText App offers access even when offline.

................................ .... .... ..,, _...,. .... .. .......... .. __ ._...,.._ ..-..~

•

Interactive eText

........ W'lror.. ................... _ _ _ ........... " " -....... ..

....

"' •

~

pearson.com/mylab/statistics

Data Analysis Task Describing a group or several groups

For Numerical Variables Ordered array, stem-and-leaf display, frequency distribution, relative frequency distribution, percentage distribution, cumulative percentage distribution, histogram, polygon, cum ulative percentage polygon (Sections 2.2, 2.4)

For Categorical Variables Summary table, bar chart, pie chart, doughnut chart, Pareto chart (Sections 2.1 and 2.3)

Mean, median, mode, geometric mean, quartiles, range, interquartile range, standard deviation, variance, coefficient of variation, skewness, kurtosis, boxplot, normal probability plot (Sections 3.1, 3.2, 3.3, 6.3) Index numbers (online Section 16. 7) Dashboards (Section 17.2) Inference about one group

Confidence interval estimate of the mean (Sections 8.1 and 8.2)

Confidence interval estimate of the proportion (Section 8.3)

t test for the mean (Section 9.2)

Z test for the proportion (Section 9.4)

Chi-square test for a variance or standard deviation (online Section 12.7) Comparing two groups

Tests for the difference in the means of two independent populations (Section 10.1)

Z test for the difference between two proportions (Section 10.3)

Wilcoxon rank sum test (Section 12.4) Paired t test (Section 10.2)

Chi-square test for the difference between two proportions (Section 12.1)

F test for the difference between two variances (Section 10.4)

McNemar test for two related samples (online Section 12.6)

Wilcoxon signed ranks test (online Section 12.8) Comparing more than two groups

One-way analysis of variance for comparing several means (Section 11.1) Kruskal-Wallis test (Section 12.5)

Chi-square test for differences among more than two proportions (Section 12.2)

Randomized block design (online Section 11.3) Two-way analysis of variance (Section 11.2) Analyzing the relationship between two variables

Scatter plot, time series plot (Section 2.5) Covariance, coefficient of correlation (Section 3.5) Simple linear regression (Chapter 13)

t test of correlation (Section 13.7)

Contingency table, side-by-side bar chart, PivotTables (Sections 2.1, 2.3, 2.6) Chi-square test of independence (Section 12.3)

Time-series forecasting (Chapter 16) Sparklines (Section 2.7) Analyzing the relationship between two or more variables

Colored scatter plots, bubble chart, treemap (Section 2.7)

Multidimensional contingency tables (Section 2.6)

Multiple regression (Chapters 14 and 15)

Dril ldown and slicers (Section 2.7)

Dynamic bubble charts (Section 17.2)

Logistic regression (Section 14.7)

Regression trees (Section 17.3)

Classification trees (Section 17.3)

Cluster analysis (Section 17.4)

Multiple correspondence analysis (Section 17.5)

This page intentionally left blank

Statistics for Managers Using Microsoft® Excel® NINTH EDITION

David M. Levine Department of Information Systems and Statistics Zicklin School of Business, Baruch College, City University of New York

David F. Stephan Two Bridges Instructional Technology

Kathryn A. Szabat Department of Business Systems and Analytics School of Business, La Salle University

@ Pearson

Please contact https://support.pearson.com/getsupport/s/contactsupport with any queries on this content. Microsoft and/or its respective suppliers make no representations about the suitability of the information contained in the documents and related graphics published as part of the services for any purpose. All such documents and related graphics are provided "as is" without warranty of any kind. Microsoft and/or its respective suppliers hereby di sclaim all warranties and conditions with regard to this information, including all warranties and conditions of merchantability, whether express, implied or statutory, fitness for a particular purpose, title and non-infringement. Jn no event shall Microsoft and/or its respective suppliers be liable for any special, indirect or consequential damages or any damages whatsoever resulting from loss of use, data or profits, whether in an action of contract, negligence or other tortious action, arising out of or in connection with the use or performance of information available from the services. The documents and related graphics contained herein could include technical inaccuracies or typographical errors. Changes are periodically added to the information herein. Microsoft and/or its respective suppliers may make improvements and/or changes in the product(s) and/or the program(s) described herein at any time. Partial screen shots may be viewed in fu ll within the software version specified. Microsoft®, Windows®, and Excel® are registered trademarks of the Microsoft Corporation in the U.S.A. and other countries. This book is not sponsored or endorsed by or affiliated with the Microsoft Corporation. Copyright © 202 1, 2017, 2014 by Pearson Education, Inc. or its affiliates, 221 River Street, Hoboken, NJ 07030. All Rights Reserved. Manufactured in the United States of America. This publication is protected by copyright, and permission should be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise. For information regardi ng permissions, request forms, and the appropriate contacts within the Pearson Education Global Rights and Permissions department, please visit www.pearsoned. corn/permissions/. Acknowledgments of third-party content appear on page 720, which constitutes an extension of this copyright page. PEARSON, ALWAYS LEARNING, and MYLAB are exclusive trademarks owned by Pearson Education, Inc. or its affiliates in the U.S. and/or other countries. Unless otherwise indicated herein, any third-party trademarks, logos, or icons that may appear in this work are the property of the ir respective owners, and any references to third-party trademarks, logos, icons, or other trade dress are for demonstrative or descriptive purposes only. Such references are not intended to imply any sponsorship, endorsement, authorization, or promotion of Pearson's products by the owners of such marks, or any relationship between the owner and Pearson Education, Inc., or its affiliates, authors, licensees, or distributors. Library of Congress Cataloging-in-Publication Data Names: Levine, David M., author. I Stephan, David, author. I Szabat, Kathryn A., author. Title: Stati stics for Managers using Microsoft Excel I David M. Levine , Department of Information Systems and Statistics, Zicklin School of Business, Baruch College, City University of New York, David F. Stephan, Two Bridges Instructional Technology, Kathryn A. Szabat, Department of Business Systems and Analytics, School of Business, La Salle University. Description: Ninth edition. I [Hoboken] : Pearson, (20 19] I Includes index. I Summary: Statistics for Managers Using Microsoft Excel features a more concise style for understanding statistical concepts.-- Provided by publisher. Identifiers: LCCN 20 19043379 I ISBN 9780 135969854 (paperback) Subjects: LCSH: Microsoft Excel (Computer file) I Management--Statistical methods. I Commercial statistics. I Electronic spreadsheets. I Management--Statistical methods--Computer programs. I Commercial statistics--Computer programs. Classification: LCC HD30.215 .S73 2020 I DOC 5 l 9.50285/554--dc23 LC record available at https://lccn.loc.gov/20 19043379

ScoutAutomatedPrintCode

@Pearson

Instructor's Review Copy ISBN 10: 0-13-662494-4 ISBN 13: 978-0- 13-662494-3 Rental ISBN 10: 0-13-596985-9 ISBN 13: 978-0-1 3-596985-4

To our spouses and children, Marilyn, Mary, Sharyn, and Mark and to our parents, in loving memory, Lee, Reuben, Ruth, Francis J., Mary, and William

This page intentionally left blank

About the Authors David M. Levine, David F. Stephan, and Kathryn A. Szabat are all experienced business school educators committed to innovation and improving instruction in business statistics and related subjects.

David Levine, Professor Emeritus of Statistics and CIS at Baruch College, CUNY, is a nationally recognized innovator in statistics education for more than three decades. Levine has coauthored 14 books, including several business statistics textbooks; textbooks and professional titles that explain and explore quality management and the Six Sigma approach; and, with David Stephan, a trade paperback that explains statistical concepts to a general audience. Levine has presented or chaired numerous sessions about business eduKathryn Szabat, David Levine, and David Stephan cation at leading conferences conducted by the Decision Sciences Institute (DSI) and the American Statistical Association, and he and his coauthors have been active participants in the annual DSI Data, Analytics, and Statistics Instruction (DASI) mini-conference. During his many years teaching at Baruch College, Levine was recognized for his contributions to teaching and curriculum development with the College's highest distinguished teaching honor. He earned B .B.A. and M.B.A. degrees from CCNY. and a Ph.D. in industrial engineering and operations research from New York University. Advances in computing have always shaped David Stephan's professional life. As an undergraduate, he helped professors use statistics software that was considered advanced even though it could compute only several things discussed in Chapter 3, thereby gaining an early appreciation for the benefits of using software to solve problems (and perhaps positively influencing his grades). An early advocate of using computers to support instruction, he developed a prototype of a mainframe-based system that anticipated features found today in Pearson's MathXL and served as special assistant for computing to the Dean and Provost at Baruch College. In his many years teaching at Baruch, Stephan implemented the first computer-based classroom, helped redevelop the CIS curriculum, and, as part of a FIPSE project team, designed and implemented a multimedia learning environment. He was also nominated for teaching honors. Stephan has presented at SEDSI and DSI DASI mini-conferences, sometimes with his coauthors. Stephan earned a B.A. from Franklin & Marshall College and an M.S. from Baruch College, CUNY, and completed the instructional technology graduate program at Teachers College, Columbia University. As Associate Professor of Business Systems and Analytics at La Salle University, Kathryn Szabat has transformed several business school majors into one interdisciplinary major that better supports careers in new and emerging disciplines of data analysis including analytics. Szabat strives to inspire, stimulate, challenge, and motivate students through innovation and curricular enhancements and shares her coauthors' commitment to teaching excellence and the continual improvement of statistics presentations. Beyond the classroom she has provided statistical advice to numerous business, nonbusiness, and academic communities, with particular interest in the areas of education, medicine, and nonprofit capacity building. Her research activities have led to journal publications, chapters in scholarly books, and conference presentations. Szabat is a member of the American Statistical Association (ASA), DSI, Institute for Operation Research and Management Sciences (INFORMS), and DSI DASI. She received a B.S. from SUNY-Albany, an M.S. in statistics from the Wharton School of the University of Pennsylvania, and a Ph.D. in statistics, with a cognate in operations research, from the Wharton School of the University of Pennsylvania. For all three coauthors, continuous improvement is a natural outcome of their curiosity about the world. Their varied backgrounds and many years of teaching experience have come together to shape this book in ways discussed in the Preface.

vii

This page intentionally left blank

Preface xxi First Things First 1 1

Defining and Collecting Data 16

2

Organizing and Visualizing Variables 38

3

Numerical Descriptive Measures 108

4

Basic Probability 152

5

Discrete Probability Distributions 176

6

The Normal Distribution and Other Continuous Distributions 198

7

Sampling Distributions 224

8

Confidence Interval Estimation 244

9

Fundamentals of Hypothesis Testing: One-Sample Tests 275

10 Two-Sample Tests 311 11 Analysis of Variance 352 12 Chi-Square and Nonparametric Tests 389 13 Simple Linear Regression 430 14 Introduction to Multiple Regression 478 15 Multiple Regression Model Building 526 16 Time-Series Forecasting 556 17 Business Analytics 600 18 Getting Ready to Analyze Data in the Future 618 19 Statistical Applications in Quality Management (online) 19-1

20 Decision Making (online) 20-1 Appendices 625 Self-Test Solutions and Answers to Selected Even-Numbered Problems 673 Index 713 Credits 720 ix

This page intentionally left blank

Preface xxi

First Things First

1.2

Collecting Data 19 Populations and Samples 19 Data Sources 20

1

1.3

USING STATISTICS: "The Price of Admission" 1

Systematic Sample 22 Stratified Sample 22 Cluster Sample 22

FTF.1 Think Differently About Statistics 2 Statistics: A Way of Thinking 3 Statistics: An Important Part of Your Business Education 4

1.4

"Big Data" 4

FTF.3 Starting Point for Learning Statistics 5 Statistic 5 Can Statistics Ip/. , statistic) Lie? 5

1.5

Using Software Properly 7

REFERENCES 1 0 KEYTERMS 10

EXCEL GUIDE 11 EG.1 Getting Started with Excel 11 EG.2 Entering Data 11

Other Data Preprocessing Tasks 25 Data Formatting 25 Stacking and Unstacking Data 26 Recoding Variables 26

FTF.5 Starting Point for Using Microsoft Excel 8 More About the Excel Guide Workbooks 9 Excel Skills That Readers Need 9

Data Cleaning 24 Invalid Variable Values 24 Coding Errors 24 Data Integration Errors 24 Missing Values 25 Algorithmic Cleaning of Extreme Numerical Values 25

FTF.2 Business Analytics: The Changing Face of Statistics 4

FTF.4 Starting Point for Using Software 6

Types of Sampling Methods 21 Simple Random Sample 21

1.6

Types of Survey Errors 27 Coverage Error 27 Nonresponse Error 27 Sampling Error 28 Measurement Error 28 Ethical Issues About Surveys 28

CONSIDER THIS: New Media Surveys/Old Survey Errors 29

EG.3 Open or Save a Workbook 11 EG.4 Working with a Workbook 12

USING STATISTICS: Defining Moments, Revisited 30

EG.5 Print a Worksheet 12

SUMMARY 30

EG.6 Reviewing Worksheets 12

REFERENCES 30

EG.7 If You use the Workbook Instructions 12

KEYTERMS 31

TABLEAU GUIDE 13

CHECKING YOUR UNDERSTANDING 31

TG.1 Gett ing Started with Tableau 13

CHAPTER REVIEW PROBLEMS 31

TG.2 Ent ering Data 14

CASES FOR CHAPTER 1 32

TG.3 Open or Save a Workbook 14 TG.4 Working with Data 15 TG.5 Print a Workbook 15

m

Defining and Collecting Data 16

Managing Ashland MultiComm Services 32 CardioGood Fitness 33 Clear Mountain State Student Survey 33 Learning With the Dig ital Cases 33

CHAPTER 1 EXCEL GUIDE 35 EG1 .1 Defining Variables 35 EG1 .3 Types of Sampling Methods 35 EG1 .4 Data Cleaning 36

USING STATISTICS: Defining Moments 16 1.1

Defin ing Variables 17 Classifying Variables by Type 17 Measurement Scales 18

EG1 .5 Other Data Preprocessing 36

CHAPTER 1 TABLEAU GUIDE 37 TG1 .1 Defining Variables 37 TG1 .4 Data Cleaning 37

xi

xii

CONTENTS

mJ Organizing and Visualizing Variables

38

EG2.3 Visualizing Categorical Variables 92 EG2.4 Visualizing Numerical Variables 94 EG2.5 Visualizing Two Numerical Variables 97 EG2.6 Organizing a Mix of Variables 98

USING STATISTICS: "The Choice Is Yours" 38

EG2.7 Visualizing a Mix of Variables 99

2.1

EG2.8 Filtering and Querying Data 101

2.2

Organizing Categorical Variables 39 The Summary Table 39 The Contingency Table 40 Organizing Numerical Variables 43 The Frequency Distribution 44 The Relative Frequency Distribution and t he Percentage Distribution 46 The Cumulative Distribution 48

2.3

Visualizing Categorical Variables 51 The Bar Chart 51 The Pie Chart and the Doughnut Chart 52 The Pareto Chart 53 Visualizing Two Categorical Variables 55

2.4

Visualizing Numerical Variables 58 The Stem-and-Leaf Display 58 The Histogram 59 The Percentage Polygon 60 The Cumulative Percentage Polygon (Ogive) 61

2.5

Visualizing Two Numerical Variables 65 The Scatter Plot 65 The Time-Series Plot 66

2.6

Organizing a Mix of Variables 68 Drill-down 69

2. 7

Visualizing a Mix of Variables 70

CHAPTER 2 TABLEAU GUIDE 101 TG2.1 Organizing Categorical Variables 101 TG2.2 Organizing Numerical Variables 102 TG2.3 Visualizing Categorical Variables 102 TG2.4 Visualizing Numerical Variables 104 TG2.5 Visualizing Two Numerical Variables 105 TG2.6 Organizing a Mix of Variables 105 TG2. 7 Visualizing a Mix of Variables 106

WNumerical Descriptive Measures

USING STATISTICS: More Descriptive Choices 108 3.1

Measures of Central Tendency 109 The Mean 109 The Median 111 The Mode 112 The Geometric Mean 11 3

3.2

Measures of Variation and Shape 114 The Range 114 The Variance and the Standard Deviation 115 The Coefficient of Variation 117

Colored Scatter Plot (Tableau) 70 Bubble Chart 71 PivotChart 71

Z Scores 118 Shape: Skewness 120 Shape: Kurtosis 120

Treemap 71 Sparklines 72

2.8

Fi ltering and Querying Data 73 Excel Slicers 73

2.9

Pitfalls in Organizing and Visualizing Variables 75 Obscuring Data 75 Creating False Impressions 76 Chartjunk 77

3.3

Exploring Numerical Variables 125 Quartiles 125 The Interquartile Range 127 The Five-Number Summary 128 The Boxplot 129

3.4

Numerical Descriptive Measures for a Population 132 The Population Mean 132

USING STATISTICS: "The Choice Is Yours," Revisited 79 SUMMARY 79

The Population Variance and Standard Deviation 133 The Empirical Rule 134 Chebyshev's Theorem 134

REFERENCES 80 KEY EQUATIONS 80 KEYTERMS 81

3.5

The Covariance and the Coefficient of Correlation 136 The Covariance 136 The Coefficient of Correlation 137

3 .6

Descriptive Statistics: Pitfalls and Ethical Issues 141

CHECKING YOUR UNDERSTANDING 81 CHAPTER REVIEW PROBLEMS 81

CASES FOR CHAPTER 2 86 Managing Ashland MultiComm Services 86 Digital Case 86 CardioGood Fitness 87 The Choice Is Yours Follow-Up 87 Clear Mountain State Student Survey 87

CHAPTER 2 EXCEL GUIDE 88

108

USING STATISTICS: More Descriptive Choices, Revisited 141 SUMMARY 142 REFERENCES 142 KEY EQUATIONS 142 KEY TERMS 143

EG2.1 Organizing Categorical Variables 88

CHECKING YOUR UNDERSTANDING 143

EG2.2 Organizing Numerical Variables 90

CHAPTER REVIEW PROBLEMS 144

CONTENTS

CASES FOR CHAPTER 3 14 7 Managing Ashland MultiComm Services 147 Digital Case 147 CardioGood Fitness 147 More Descriptive Choices Follow-up 147 Clear Mountain State Student Survey 147

CHAPTER 3 EXCEL GUIDE 148

WDiscrete Probability Distributions

5.1

The Probability Distribution for a Discrete Variable 177 Expected Value of a Discrete Variable 177 Variance and Standard Deviation of a Discrete Variable 178

5.2

Binomial Distribution 181

EG3.2 Measures of Variation and Shape 149

Histograms for Discrete Variables 184 Summary Measures for the Binomial Distribution 185

EG3.4 Numerical Descriptive Measures for a Population 150 EG3.5 The Covariance and t he Coefficient of Correlation 150

CHAPTER 3 TABLEAU GUIDE 151 TG3.3 Exploring Numerical Variables 151

~

Basic Probability

152

USING STATISTICS: Probable Outcomes at Fredco Warehouse Club 152 4.1

4.2

Basic Probability Co ncepts 153 Events and Sample Spaces 153 Types of Probability 154 Summarizing Sample Spaces 155 Simple Probability 155 Joint Probability 157 Marginal Probability 157 General Addition Rule 158 Conditional Probability 161 Calculating Conditional Probabilities 161 Decision Trees 163 Independence 164 Multiplication Rules 165 Marginal Probability Using the General Mult iplication Rule 166

4.3

Ethical Issues and Probability 169

4.4

Bayes' Theorem 169

116

USING STATISTICS: Events of Interest at Ricknel Home Centers 176

EG3.1 Measures of Central Tendency 148 EG3.3 Exploring Numerical Variables 149

xiii

5.3

Poisson Distribution 188

5.4

Covariance of a Probability Distribution and Its Application in Finance 191

5.5

Hypergeometric Distribution 191

USING STATISTICS: Events oflnterest ... , Revisited 191 SUMMARY 192 REFERENCES 192 KEY EQUATIONS 192 KEY TERMS 1 92 CHECKING YOUR UNDERSTANDING 192 CHAPTER REVIEW PROBLEMS 193

CASES FOR CHAPTER 5 195 Managing Ashland MultiComm Services 195 Digital Case 195

CHAPTER 5 EXCEL GUIDE 196 EG5.1 The Probability Distribution for a Discrete Variable 196 EG5.2 Binomial Distribution 196 EG5.3 Poisson Distribution 196

m

The Normal Distribution and Other Continuous Distributions 198

CONSIDER THIS: Divine Providence and Spam 170

4.5

Counting Rules 171

USING STATISTICS: Probable Outcomes at Fredco Warehouse Club, Revisited 171 SUMMARY 171

USING STATISTICS: Normal Load Times at MyTVLab 198

6 .1

Continuous Probability Distributions 199

6.2

The Normal Distribution 200 Role of the Mean and the Standard Deviation 201 Calculating Normal Probabilities 202

REFERENCES 172

Finding X Values 207

KEY EQUATIONS 172 KEY TERMS 1 72

CONSIDER THIS: What Is Normal? 210

CHECKING YOUR UNDERSTANDING 172

6.3

CHAPTER REVIEW PROBLEMS 173

CASES FOR CHAPTER 4 174 Digital Case 174 CardioGood Fitness 174 The Choice Is Yours Follow-Up 174 Clear Mountain State Student Survey 174

CHAPTER 4 EXCEL GUIDE 175 EG4.1 Basic Probability Concepts 175 EG4.4 Bayes' Theorem 175

Evaluating Normality 212 Comparing Data Characteristics to Theoretical Properties 212 Constructing the Normal Probability Plot 213

6.4

The Uniform Distribution 215

6.5

The Exponential Distribution 217

6.6

The Normal Approximation to the Binomial Distribution 217

USING STATISTICS: Normal Load Times ... , Revisited 218 SUMMARY 218

xiv

CONTENTS

REFERENCES 218

8.2

KEY EQUATIONS 219 KEY TERMS 219

The Concept of Degrees of Freedom 251 Properties of the t Distributi on 251

CHECKING YOUR UNDERSTANDING 219 CHAPTER REVIEW PROBLEMS 219

CASES FOR CHAPTER 6 221 Managing Ashland MultiComm Services 221 CardioGood Fitness 221 More Descriptive Choices Follow-up 221 Clear Mountain State Student Survey 221 Digital Case 221

CHAPTER 6 EXCEL GUIDE 222

The Confidence Interval Statement 253

8.3

Confidence Interval Estimate for the Proportion 258

8.4

Determining Sample Size 261 Sample Size Determination for the Mean 261 Sample Size Determination for the Proportion 263

8.5

Confidence Interval Estimation and Ethical Issues 266

8.6

Application of Confidence Interval Estimation in Auditing 266

8. 7

Estimation and Sample Size Determination for Finite Populations 266

8.8

Bootstrapping 266

EG6.2 The Normal Distribution 222 EG6.3 Evaluating Normality 222

Confidence Interval Estimate for the Mean (a Unknown) 250 Student's t Distribution 250

USING STATISTICS: Getting Estimates at Ricknel Home Centers, Revisited 267 USING STATISTICS: Sampling Oxford Cereals 224

SUMMARY 267

7 .1

Sampling Distributions 225

REFERENCES 267

7 .2

Sampling Distribution of the Mean 225

KEY EQUATIONS 268

The Unbiased Property of the Sample Mean 225 Standard Error of the Mean 227 Sampling from Normally Distributed Populations 228 Sampling from Non-normally Distributed Populations-The Central Limit Theorem 231 VISUAL EXPLORATIONS: Exploring Sampling Distributions 235

7 .3

Sampling Distribution of the Proportion 236

7.4

Sampling from Finite Populations 239

USING STATISTICS: Sampling Oxford Cereals, Revisited 239 SUMMARY 240 REFERENCES 240 KEY EQUATIONS 240 KEY TERMS 240 CHECKING YOUR UNDERSTANDING 240 CHAPTER REVIEW PROBLEMS 241

KEY TERMS 268 CHECKING YOUR UNDERSTANDING 268 CHAPTER REVIEW PROBLEMS 268

CASES FOR CHAPTER 8 271 Managing Ashland MultiComm Services 271 Digital Case 272 Sure Value Convenience Stores 272 CardioGood Fitness 272 More Descriptive Choices Follow-Up 272 Clear Mountain State Student Survey 272

CHAPTER 8 EXCEL GUIDE 273 EG8.1 Confidence Interval Estimate for the Mean {er Known) 273 EG8.2 Confidence Interval Estimate for the Mean (er Unknown) 273 EG8.3 Confidence Interval Estimate for the Proportion 27 4 EG8.4 Determining Sample Size 274

ml Fundamentals of Hypothesis Testing: One-Sample Tests

CASES FOR CHAPTER 7 242 Managing Ashland MultiComm Services 242 Digital Case 242

CHAPTER 7 EXCEL GUIDE 243 EG7.2 Sampling Distribution of the Mean 243

USING STATISTICS: Significant Testing at Oxford Cereals 275 9.1

Z Test for the Mean (a Known) 280 Hypothesis Testing Using the Critical Value Approach 280 Hypothesis Testing Using the p-Value Approach 284 A Connection Between Confidence Interval Estimation and Hypothesis Testing 286 Can You Ever Know the Population Standard Deviation? 287

USING STATISTICS: Getting Estimates at Ricknel Home Centers 244 Confidence Interval Estimate for the Mean (a Known) 245 Sampling Error 246 Can You Ever Know the Population Standard Deviation? 249

Fundamentals of Hypothesis Testing 276 The Critical Value of the Test Statistic 277 Regions of Rejection and Nonrejection 278 Risks in Decision Making Using Hypothesis Testing 278

8 Confidence Interval Estimation 244

8.1

21s

9.2

t Test of Hypothesis for the Mean (a Unknown) 288 Using the Critical Value Approach 289 Using the p-Value Approach 290

CONTENTS Checking the Normality Assumption 291

9.3

One-Tail Tests 294

SUMMARY 341

Using the Critical Value Approach 294 Using the p-Value Approach 296

9.4

REFERENCES 342

Z Test of Hypothesis for the Proportion 298 Using the Critical Value Approach 299 Using the p-Value Approach 300

9.5

KEY EQUATIONS 342 KEY TERMS 342 CHECKING YOUR UNDERSTANDING 343

Potential Hypothesis-Testing Pitfalls and Ethical Issues 302 Important Planning Stage Questions 302 Statistical Significance Versus Practical Significance 303 Statistical Insignificance Versus Importance 303 Reporting of Findings 303 Ethical Issues 303

9.6

USING STATISTICS: Differing Means for Selling ... , Revisited 340

Power of the Test 303

USING STATISTICS: Significant Testing ... , Revisited 304 SUMMARY 304

CHAPTER REVIEW PROBLEMS 343

CASES FOR CHAPTER 10 345 Managing Ashland MultiComm Services 345 Digital Case 345 Sure Value Convenience Stores 346 CardioGood Fitness 346 More Descriptive Choices Follow-Up 346 Clear Mountain State Student Survey 346

CHAPTER 10 EXCEL GUIDE 347 EG10.1 Comparing the Means of Two Independent Populations 347

REFERENCES 304

EG10.2 Comparing the Means of Two Related Populations 349 EG10.3 Comparing the Proportions of Two Independent Populations 350

KEY EQUATIONS 305 KEY TERMS 305 CHECKING YOUR UNDERSTANDING 305

EG10.4 FTest For The Ratio of Two Variances 351

CHAPTER REVIEW PROBLEMS 305

CASES FOR CHAPTER 9 307 Managing Ashland MultiComm Services 307 Digital Case 307 Sure Value Convenience Stores 308

USING STATISTICS: The Means to Find Differences at Arlingtons 352

CHAPTER 9 EXCEL GUIDE 309 EG9.1 Fundamentals of Hypothesis Testing 309 EG9.2 t Test of Hypothesis for the Mean (a Unknown) 309 EG9.3 One-Tail Tests 310 EG9.4 Z Test of Hypothesis for the Proportion 310

~O

Two-Sample Tests

11.1 One-Way ANOVA 353 F Test for Differences Among More Than Two Means 356 One-Way ANOVA F Test Assumptions 360

Levene Test for Homogeneity of Variance 361 Multiple Comparisons: The Tukey-Kramer Procedure 362

11.2 Two-Way ANOVA 367 311

USING STATISTICS: Differing Means for Selling Streaming Media Players at Arlingtons? 311 10.1 Comparing the Means of Two Independent Populations 312 Pooled-Variance t Test for the Difference Between Two Means Assuming Equal Variances 312 Evaluating the Normality Assumption 315 Confidence Interval Estimate for the Difference Between Two Means 3 17 Separate-Variance t Test for the Difference Between Two Means, Assuming Unequal Variances 318

Factor and Interaction Effects 367 Testing for Factor and Interaction Effects 369 Multiple Comparisons: The Tukey Procedure 373 Visualizing Interaction Effects: The Cell Means Plot 374 Interpreting Interaction Effects 374

11.3 The Randomized Block Design 379 11.4 Fixed Effects, Random Effects, and Mixed Effects Models 379 USING STATISTICS: The Means to Find Differences at Arlingtons, Revisited 379 SUMMARY 379 REFERENCES 380

CONSIDER THIS: Do People Really Do This? 319

KEY EQUATIONS 380

10.2 Comparing the Means of Two Related Populations 321 Paired t Test 322

CHECKING YOUR UNDERSTANDING 381

Confidence Interval Estimate for the Mean Difference 327

10.3 Comparing the Proportions of Two Independent Populations 329 Z Test for the Difference Between Two Proportions 329 Confidence Interval Estimate for the Difference Between Two Proportions 333

10.4 F Test for the Ratio of Two Variances 336 10.5 Effect Size 340

KEY TERMS 381 CHAPTER REVIEW PROBLEMS 381

CASES FOR CHAPTER 11 383 Managing Ashland MultiComm Services 383 PHASE 1 383 PHASE 2 383 Digital Case 384 Sure Value Convenience Stores 384 CardioGood Fitness 384

XV

xvi

CONTENTS

More Descriptive Choices Follow-Up 384 Clear Mountain St ate Student Survey 384

~ Simple Linear Regression

430

CHAPTER 11 EXCEL GUIDE 385 EG11.1 The Completely Randomized Design : One-Way Anova 385 EG11.2 The Factorial Design: Two-Way ANOVA 387

~ Chi-Square and Nonparametric Tests

USING STATISTICS: Knowing Customers at Sunflowers Apparel 430 Preliminary Analysis 431

13.1 Simple Linear Regression Models 432 13.2 Determining the Simple Linear Regression Eq uation 433 389

USING STATISTICS: Avoiding Guesswork About Resort Guests 389 12.1 Chi-Square Test for the Difference Between Two Proportions 390 12.2 Chi-Square Test for Differences Among More Than Two Proportions 397 The Marascuilo Procedure 400 The Analysis of Proportions (ANOP) 402

12.3 Chi-Square Test of Independence 403 12.4 Wilcoxon Rank Sum Test for Two Independent Populations 409 12.5 Kruskal-Wallis Ran k Test for the One-Way ANOVA 415 Assumptions of the Kruskal-Wallis Rank Test 4 18

12.6 McNemar Test for the Difference Between Two Proportions (Related Samples) 419 12.7 Chi-Square Test for the Variance or Standard Deviation 419 12.8 Wilcoxon Sig ned Ran ks Test for Two Related Populations 420 USING STATISTICS: Avoiding Guesswork ... , Revisited 420 REFERENCES 420 SUMMARY 420 KEY EQUATIONS 421 KEY TERMS 422 CHECKING YOUR UNDERSTANDING 422 CHAPTER REVIEW PROBLEMS 422

CASES FOR CHAPTER 12 424 Managing Ashland MultiComm Services 424 PHASE 1 424 PHASE 2 424 Digital Case 425 Sure Value Convenience Stores 425 CardioGood Fitness 425 More Descriptive Choices Follow-Up 425 Clear Mountain State Student Survey 425

CHAPTER 12 EXCEL GUIDE 427 EG12.1 Chi-Square Test for the Difference Between Two Proportions 427 EG12.2 Chi-Square Test for Differences Among More Than Two Proportions 427 EG12.3 Chi-Square Test of Independence 428 EG12.4 Wilcoxon Rank Sum Test: A Nonparametric Method For Two Independent Populations 428 EG12.5 Kruskal-Wallis Rank Test: A Nonparametric Method For The One-Way Anova 429

The Least-Squares Method 433 Predictions in Regression Analysis: Interpolation Versus Extrapolation 436 Calculating the Slope, b 1 , and the Y Intercept, b 0 437

13.3 Measures of Variation 441 Computing the Sum of Squares 441 The Coefficient of Determination 443 Standard Error of the Estimate 444

13.4 Assumptions of Regression 445 13.5 Resid ual Analysis 446 Evaluating the Assumptions 446

13.6 Measuri ng Autocorrelation: The Durbin-Watson Statist ic 450 Residual Plots to Detect Autocorrelation 450 The Durbin-Watson Statistic 451

13.7 Inferences About the Slope and Correlation Coefficient 454

t Test for the Slope 454 F Test for the Slope 455

Confidence Interval Estimat e for the Slope 457 t Test for the Correlation Coefficient 457

13.8 Estimat ion of Mean Values and Pred iction of Individual Values 460 The Confidence Interval Estimate for the Mean Response 461 The Prediction Interval for an Individual Response 462

13.9 Potential Pitfalls in Regression 464 USING STATISTICS: Knowing Customers ..., Revisited 466 SUMMARY 467 REFERENCES 468 KEY EQUATIONS 468 KEY TERMS 469 CHECKING YOUR UNDERSTANDING 469 CHAPTER REVIEW PROBLEMS 470

CASES FOR CHAPTER 13 4 73 Managing Ashland MultiComm Services 473 Digital Case 473 Brynne Packaging 473

CHAPTER 13 EXCEL GUIDE 474 EG13.2 Determining the Simple Linear Regression Equation 474 EG13.3 Measures of Variation 475 EG13.5 Residual Analysis 475 EG13.6 Measuring Autocorrelation: the Durbin-Watson Statistic 476 EG13.7 Inferences About the Slope and Correlation Coefficient 476 EG13.8 Estimation of Mean Values and Prediction of Individual Values 476

CHAPTER 13 TABLEAU GUIDE 477 TG13.2 Det ermining the Simple Linear Regression Equation 4 77 TG13.3 Measures of Variation 4 77

CONTENTS

Ila Introduction to Multiple Regression

Testing the Quadratic Effect 530 The Coefficient of Mult iple Determination 533

15.2 Using Transformations in Regression Models 534

478

The Square-Root Transformation 535 The Log Transformation 536

USING STATISTICS: The Multiple Effects of OmniPower Bars 478

15.3 Collinearity 538

14.1 Developing a Multiple Regression Model 479

15.4 Model Building 540

Interpreting the Regression Coefficients 4 79 Pred icting the Dependent Variable Y 481

14.2 Evaluating Multiple Regression Models 483 Coefficient of Multiple Determination, r 2 484 Adjusted r 2 484 F Test for the Significance of the Overall Multiple Regression Model 485

14.3 Multiple Regression Residual Analysis 487 14.4 Inferences About the Population Regression Coefficients 489 Tests of Hypothesis 489 Confidence Interval Estimation 490

14.5 Testing Portions of the Multiple Regression Model 492 Coefficients of Partial Determinat ion 495

14.6 Using Dummy Variables and Interaction Terms 497 Interactions 503

CONSIDER THIS: What Is Not Normal? (Using a Categorical Dependent Variable) 508

The Stepwise Regression Approach to Model Building 541 The Best Subsets Approach to Model Build ing 542

15.5 Pitfal ls in Multiple Regression and Ethical Issues 547 Pitfalls in Multiple Regression 547 Ethical Issues 547

USING STATISTICS: Valuing Parsimony ... , Revisited 547 SUMMARY 548 REFERENCES 549 KEY EQUATIONS 549 KEY TERMS 549 CHECKING YOUR UNDERSTANDING 549 CHAPTER REVIEW PROBLEMS 549

CASES FOR CHAPTER 15 551 The Mountain States Potato Company 551 Sure Value Convenience Stores 552 Digital Case 552 The Graybill Instrumentation Company Case 552 More Descriptive Choices Follow-Up 553

14.7 Log istic Regression 509

CHAPTER 15 EXCEL GUIDE 554

14.8 Cross-Validation 514 USING STATISTICS: The Multiple Effects ... , Revisited 515 SUMMARY 515

EG15.1 The Quadratic Regression Model 554 EG15.2 Using Transformations in Regression Models 554 EG15.3 Collinearity! 555

REFERENCES 517

EG15.4 Model Building 555

KEY EQUATIONS 517 KEY TERMS 518 CHECKING YOUR UNDERSTANDING 518 CHAPTER REVIEW PROBLEMS 518

c.

Time-Series Forecasting

55

•

USING STATISTICS: Is the ByYourDoor Service Trending? 556

CASES FOR CHAPTER 14 521 Managing Ashland MultiComm Services 521 Digital Case 521

CHAPTER 14 EXCEL GUIDE 522

16.1 Time-Series Com ponent Factors 557 16.2 Smoothing an Annual Time Series 559 Moving Averages 559

EG14.1 Developing a Multiple Regression Model 522 EG14.2 Evaluating Multiple Regression Models 523 EG14.3 Multiple Regression Residual Analysis 523 EG14.4 Inferences About the Population Regression Coefficients 524 EG14.5 Testing Portions of the Multiple Regression Model 524 EG14.6 Using Dummy Variables and Interaction Terms 524 EG14.7 Logistic Regression 524

Exponential Smoothing 561

16.3 Least-Squares Trend Fitting and Forecasting 564 The Linear Trend Model 564 The Quadratic Trend Model 566 The Exponential Trend Model 567 Model Selection Using First, Second, and Percentage Differences 569

16.4 Autoregressive Modeling for Trend Fitting and Forecasting 573

ll:m Multiple Regression Model Building

xvi i

52s

USING STATISTICS: Valuing Parsimony at WSTA-TV 526 15.1 The Quadratic Regression Model 527 Finding the Regression Coefficients and Predicting Y 528 Testing for the Significance of the Quadratic Model 530

Selecting an Appropriate Autoregressive Model 57 4 Determining the Appropriateness of a Selected Model 575

16.5 Choosing an Appropriate Forecasting Model 582 Residual Analysis 582 The Magnitude of the Residuals Through Squared or Absolute Differences 583 The Principle of Parsimony 584 A Comparison of Four Forecasting Methods 584

16.6 Time-Series Forecasting of Seasonal Data 586

xviii

CONTENTS Least-Squares Forecasting with Monthly or Quarterly Data 586

16.7 Index Numbers 592

~ Getting Ready to Analyze Data in the Future 61a

CONSIDER THIS: Let the Model User Beware 592 USING STATISTICS: Is the ByYourDoor Service Trending? Revisited 592 SUMMARY 592 REFERENCES 593 KEY EQUATIONS 593 KEY TERMS 594 CHECKING YOUR UNDERSTANDING 595 CHAPTER REVIEW PROBLEMS 595

CASES FOR CHAPTER 16 596 Managing Ashland MultiComm Services 596 Digital Case 596

CHAPTER 16 EXCEL GUIDE 597 EG16.2 Smoothing An Annual Time Series 597 EG16.3 Least-Squares Trend Fitting and Forecasting 598 EG16.4 Autoregressive Modeling for Trend Fitting and Forecasting 598 EG16.5 Choosing an Appropriate Forecasting Model 599 EG16.6 Time-Series Forecasting of Seasonal Data 599

USING STATISTICS: Back to Arlingtons for the Future 600 17.1 Business Analytics Overview 601 Business Analytics Categories 601 Business Analytics Vocabulary 602

CONSIDER THIS: What's My Major If I Want to Be a Data Miner? 602 Inferential Statistics and Predictive Analytics 603 Microsoft Excel and Business Analytics 604 Remainder of This Chapter 604 17.2 Descriptive Analytics 604 Dashboards 605 Data Dimensionality and Descriptive Analytics 606 17.3 Decision Trees 606 Regression Trees 607 Classification Trees 608 Subjectivity and Interpretation 609 17.4 Clustering 609 17.5 Association Analysis 6 10 17.6 TextAnalytics 61 1

17. 7 Prescriptive Analytics 6 12 Optimization and Simulation 613

USING STATISTICS: Back to Arlingtons ... , Revisited 613 REFERENCES 613 KEY TERMS 614 CHECKING YOUR UNDERSTANDING 614

CHAPTER 17 SOFTWARE GUIDE 615 EG17.2 Descriptive Analytics 615 EG17.5 Predictive Analytics for Clustering 616

USING STATISTICS: Mounting Future Analyses 618 18.1 Analyzing Numerical Variables 619 Describe the Characteristics of a Numerical Variable? 619 Reach Conclusions About the Population Mean or the Standard Deviation? 619 Determine Whether the Mean and/ or Standard Deviation Differs Depending on the Group? 620 Determine Which Factors Affect the Value of a Variable? 620 Predict the Value of a Variable Based on the Values of Other Variables? 621 Classify or Associate Items? 62 1 Determine Whether the Values of a Variable Are Stable Over Time? 621 18.2 Analyzing Categorical Variables 62 1 Describe the Proportion of Items of Interest in Each Category? 621 Reach Conclusions About the Proportion of Items of Interest? 622 Determine Whether the Proportion of Items of Interest Differs Depending on the Group? 622 Predict the Proportion of Items of Interest Based on the Values of Other Variables? 622 Classify or Associate Items? 622 Determine Whether the Proportion of Items of Interest Is Stable Over Time? 622

USING STATISTICS: The Future to Be Visited 623 CHAPTER REVIEW PROBLEMS 623

Em Statistical Applications in Quality Management (on/ine)

19-1

USING STATISTICS: Finding Quality at the Beachcomber 19-1 19.1 The Theory of Control Charts 19-2 The Causes of Variation 19-2 19.2 Control Chart for the Proportion: The p Chart 19-4 19.3 The Red Bead Experiment: Understanding Process Variabil ity 19-10 19.4 Control Chart for an Area of Opportunity: The c Chart 19-12 19.5 Control Charts for the Range and the Mean 19-15 The R Chart 19-15 The X Chart 19-18 19.6 Process Capability 19-21 Customer Satisfaction and Specification Limits 19-21 Capability Indices 19-22 CPL, CPU, and CPK 19-23

19. 7 Total Quality Management 19-26 19.8 Six Sigma 19-27 The DMAIC Model 19-28 Roles in a Six Sigma Organization 19-29 Lean Six Sigma 19-29

CONTENTS

USING STATISTICS: Finding Quality at the Beachcomber, Revisited 19-30 SUMMARY 19-30 REFERENCES 19-31

B.1 B.2 B.3 B.4 B.5E B.5T B.6

KEY TERMS 19-32 CHAPTER REVIEW PROBLEMS 19-32

CASES FOR CHAPTER 19 19-35 The Harnswell Sewing Machine Company Case 19-35 PHASE 1 19-35 PHASE 2 19-35 PHASE 3 19-36 PHASE 4 19-36 PHASE 5 19-36 Managing Ashland Multicomm Services 19-37

B.7

EG 19.2 Control Chart for the Proportion: The p Chart 19-38 EG19.4 Control Chart for an Area of Opportunity: The c Chart 19-39 EG 19.5 Control Charts for the Range and the Mean 19-40 EG 19.6 Process Capability 19-41

20-1

USING STATISTICS: Reliable Decision Making 20-1 20.1 Payoff Tables and Decision Trees 20-6 20.2 Criteria for Decision Making 20-5 Maximax Payoff 20-7 Maxim in Payoff 20-7 Expected Monetary Value 20-7 Expected Opportunity Loss 20-9 Return-to-Risk Ratio 20- 11

20.3 Decision Making w ith Sample Information 20-16 CONSIDER THIS: Risky Business 20-22 USING STATISTICS: Reliable Decision Making, Revisited 20-22 SUMMARY 20-23 REFERENCES 20-23 KEY EQUATIONS 20-23 KEY TERMS 20-23 CHAPTER REVIEW PROBLEMS 20-23

CASES FOR CHAPTER 20 20-26 Digital Case 20-26

CHAPTER 20 EXCEL GUIDE 20-27 EG 20.1 Payoff Tables and Decision Trees 20-27 EG 20.2 Criteria for Decision Making 20-27

Appendices 625 A. Basic Math Concepts and Symbols 626 A.1 A.2

Operators 626 Rules for Arithmetic Operations 626

Identifying the Software Version 632 Formulas 632 Excel Cell References 633 Excel Worksheet Formatting 635 Excel Chart Formatting 636 Tableau Chart Formatting 637 Creating Histograms for Discrete Probability Distributions (Excel) 638 Deleting the " Extra" Histogram Bar (Excel) 638

C. Online Resources 640

CHAPTER 19 EXCEL GUIDE 19-38

20.4 Utility 20-21

Rules for Algebra: Exponents and Square Roots 626 Rules for Logarithms 627 Summation Notation 628 Greek Alphabet 631

B. Important Software Skills and Concepts 632

KEY EQUATIONS 19-31

20 Decision Making (on/ine)

A.3 A.4 A.5 A.6

xix

C.1 C.2 C.3 C.4

About the Online Resources for This Book 640 Data Files 640 Microsoft Excel Files Integrated With This Book 646 Supplemental Files 646

D. Configuring Software 647 D.1 D.2

Microsoft Excel Configuration 647 Supplemental Files 648

E. Tables 649 E.1 Table of Random Numbers 649 E.2 The Cumulative Standardized Normal Distribution 651 E.3 Critical Values of t 653 Critical Values of x2 655 E.4 Critical Values of F 656 E.5 E.6 Lower and Upper Critical Values, T1 , of the Wilcoxon Rank Sum Test 660 E.7 Critical Values of the Studentized Range, Q 661 Critical Values, dL and du, of the Durbin-Watson Statistic, D E.8 (Critical Values Are One-Sided) 663 Control Chart Factors 664 E.9 E.10 The Standardized Normal Distribution 665

F. Useful Knowledge 666 F.1 F.2

Keyboard Shortcuts 666 Understanding the Nonstatistical Excel Functions 666

G. Software FAQs 668 G.1 G.2 G.3

Microsoft Excel FAQs 668 PHStat FAQs 668 Tableau FAQs 669

H. All About PHStat 670 H.1 H.2 H.3 H.4

What is PHStat? 670 Obtaining and Setting Up PHStat 671 Using PHStat 671 PHStat Procedures, by Category 672

Self-Test Solutions and Answers to Selected Even-Numbered Problems 673 Index 713 Credits 720

This page intentionally left blank

business statistics evolves and becomes an increasingly important part of one's business ducation, which topics get taught and how those topics are presented becomes all the ore important. As authors, we think about these issues as we seek ways to continuously improve the quality of business statistics education. We actively participate in conferences and meetings sponsored by the Decision Sciences Institute, American Statistical Association (ASA), and INFORMS, the Institute for Operations Research and the Management Sciences. We use the ASA's Guidelines for Assessment and Instruction (GAISE) reports and combine them with our experiences teaching business statistics to a diverse student body at several universities. When writing a book for introductory business statistics students, four learning principles guide us. Help students see the relevance of statistics to their own careers by using examples from the functional areas that may become their areas of specialization. Students need to learn statistics in the context of the functional areas of business. We discuss every statistical method using an example from a functional area, such as accounting, finance, management, or marketing, and explain the application of methods to specific business activities. Emphasize interpretation and analysis of statistical results over calculation. We emphasize the interpretation of results, the evaluation of the assumptions, and the discussion of what should be done if the assumptions are violated. We believe that these activities are more important to students' futures and will serve them better than emphasizing tedious manual calculations. Give students ample practice in understanding how to apply statistics to business. We believe that both classroom examples and homework exercises should involve actual or realistic data, using small and large sets of data, to the extent possible. Integrate data analysis software with statistical learning. We integrate Microsoft Excel into every statistics method that the book discusses in full. This integration illustrates how software can assist the business decision-making process. In this edition, we also integrate using Tableau into selected topics, where such integration makes best sense. (Integrating data analysis software also supports our second principle about emphasizing interpretation over calculation.) When thinking about introductory business statistics students using data analysis software, three additional principles guide us. Using software should model business best practices. We emphasize reusable templates and model solutions over building unaudited solutions from scratch that may contain errors. Using preconstructed and previously validated solutions not only models best practice but reflects regulatory requirements that businesses face today. Provide detailed sets of instructions that accommodate various levels of software use and familiarity. Instruction sets should accommodate casual software users and as well as users keen to use software to a deeper level. For most topics, we present PHStat and Workbook instructions, two different sets that create identical statistical results. Software instruction sets should be complete and contain known starting points. Vague instructions that present statements such as "Use command X" that presume students can figure out how to "get to" command X are distracting to learning. We provide instruction sets that have a known starting point, typically in the form of "open to a specific worksheet in a specific workbook."

xxi

xxii

PREFACE

What's New in This Edition? This ninth edition of Statistics for Managers Using Microsoft Excel features many passages rewritten in a more concise style that emphasize definitions as the foundation for understanding statistical concepts. In addition to changes that readers of past editions have come to expect, such as new examples and Using Statistics case scenarios and an extensive number of new endof-section or end-of-chapter problems, the edition debuts: • Tabular Summaries that state hypothesis test and regression example results along with the conclusions that those results support now appear in Chapters 10 through 13. • Tableau Guides that explain how to use the data visualization software Tableau Public as a complement to Microsoft Excel for visualizing data and regression analysis. • A New Business Analytics Chapter (Chapter 17) that provides a complete introduction to the field of business analytics. The chapter defines terms and categories that introductory business statistics students may encounter in other courses or outside the classroom. This chapter benefits from the insights the authors have gained from teaching and lecturing about business analytics as well as research the authors have done for a forthcoming companion book on business analytics.

Continuing Features That Readers Have Come to Expect This edition of Statistics for Managers Using Microsoft Excel continues to incorporate a number of distinctive features that has led to its wide adoption over the previous editions. Table 1 summaries these carry-over features: TABLE 1 Distinctive Features Continued in the Ninth Edit ion

Feature

Details

Using Statistics case scenarios

Each chapter begins with a Using Statistics case scenario that presents a business problem or goal that illustrates the application of business statistics to provide actionable information. For many chapters, scenarios also provide the scaffolding for learning a series of related statistical methods. End-of-chapter "Revisited" sections reinforce the statistical learning of a chapter by discussing how the methods and techniques can be applied to the goal or problem that the case scenario considers. In this edition, seven chapters have new or revised case scenarios.

Emphasis on interpretation of the data analysis results

Statistics for Managers Using Microsoft Excel was among the first introductory business statistics textbooks to focus on the interpretation of Microsoft Excel statistical results. This tradition continues, now supplemented by Tableau (Public) results for selected methods in which Tableau can enhance or complement Excel results.

Software integration and flexibility

Software instructions feature chapter examples and were personally written by the authors, who collectively have more than one hundred years of experience teaching the application of business software. With modularized Workbook, PHStat, and where applicable, Analysis Toolbook instructions, both instructors and students can switch among these instruction sets as they use this book with no loss of statistical learning.

Unique Excel workbooks

Statistics for Managers Using Microsoft Excel comes with Excel Guide workbooks that illustrate model solutions and provide template solutions to selected methods and Visual Explorations, macro-enhanced workbooks that demonstrate selected basic concepts. This book is fully integrated with PHStat, the Pearson statistical add-in for Excel that places the focus on statistical learning that the authors designed and developed. See Appendix H for a complete description of PHStat.

PREFACE

xxiii

TABLE 1 Distinct ive Features Continued in the Ninth Edition (continued)

Feature

Details

In-chapter and end-of-chapter reinforcements

Exhibits summarize key processes throughout the book. A key terms list provides an index to the definitions of the important vocabulary of a chapter. "Learning the Basics" questions test the basic concepts of a chapter. "Applying the Concepts" problems test the learner's ability to apply statistical methods to business problems. And, for the more mathematically minded, "Key Equations" list the boxed number equations that appear in a chapter.

End-of-chapter cases

End-of-chapter cases include a case that continues through most chapters and several cases that reoccur throughout the book. "Digital Cases" require students to examine business documents and other information sources to sift through various claims and discover the data most relevant to a business case problem. Many of these cases also illustrate common misuses of statistical information.

The Instructor's Solutions Manual provides instructional tips for using cases as well as solutions to the Digital Cases. An appendix provides additional self-study opportunities by provides answers to the "Self-Test" problems and most of the even-numbered problems in this book In-margin student tips and LearnMore references reinforce important points and direct students to additional learning resources. In-chapter Consider This essays reinforce important concepts, examine side issues, or answer questions that arise while studying business statistics, such as "What is so ' normal ' about the normal distribution?" With an extensive library of separate online topics, sections, and even two full chapters, instructors can combine these materials and the opportunities for additional learning to meet their curricular needs.

Answers to even-numbered problems Opportunities for additional learning

Highly tailorable content

Chapter-by-Chapter Changes Made for This Edition Because the authors believe in continuous quality improvement, every chapter of Statistics for Managers Using Microsoft Excel contains changes to enhance, update, or just freshen this book. Table 2 provides a chapter-by-chapter summary of these changes. TABLE 2 Chapter-by-Chapter Change Matrix

Chapter

Using Statistics Changed

First Things First

2

•

Problems Changed

Selected Chapter Changes

n.a.

Updates opening section. Retitles, revises, and expands old Section FTF.4 as Section FTF.4 and new Section FTF.5 "Starting Point for Using Microsoft Excel." Expands the First Things First Excel Guide. Adds a First Things First Tableau Guide.

26%

Old Sections 1.3 and 1.4 revised and expanded as new Section 1.4 "Data Preparation." Adds a Chapter 1 Tableau Guide.

57%

Uses new samples of 479 retirement funds and 100 restaurant meal costs for in-chapter examples. Includes new examples for organizing and visualizing categorical variables. Uses updated scatter plot and time-series plot examples. Adds new Section 2.8 "Filtering and Querying Data." Adds coverage of bubble charts, treemaps, and (Tableau) colored scatter plots. Revises and expands the Chapter 2 Excel Guide. Adds a Chapter 2 Tableau Guide.

xxiv

PREFACE

TABLE 2 Chapter-by-Chapter Change Matrix (continued)

Chapter

Using Statistics Changed

Problems Changed

Selected Chapter Changes

52%

Uses new samples of 479 retirement funds and 100 restaurant meal costs for in-chapter examples. Includes updated Dogs of the Dow NBA team values data sets. Adds a Chapter 3 Tableau Guide.

41 %

Uses the updated Using Statistics scenario for in-chapter examples.

5

32%

6

31%

Adds a new exhibit that summarizes the binomial distribution. Uses new samples of 479 retirement funds for the normal probability plot example.

3

4

7

•

•

43 %

Enhances selected figures for additional clarity.

8

36%

Presents a more realistic chapter opening illustration. Revises Example 8.3 in Section 8.2 "Confidence Interval Estimate for the Mean (o Unknown)."

9

29%

10

37%

Adds a new exhibit that summarizes fundamental hypothesis testing concepts. Revises Section 9.2 "t Test of Hypothesis for the Mean (o Unknown)" example such that the normality assumption is not violated. Revises examples in Section 9.3 "One-Tail Tests" and Section 9.4 "Z Test of Hypothesis for the Proportion." Uses the new tabular test summaries for the two-sample t test, paired t test, and the Z test for the difference between two proportions. Includes new Section 10.1 passage "Evaluating the Normality Assumption." Uses new (market basket) data for the paired "t test example. Enhances selected figures for additional clarity. Contains general writing improvements throughout chapter.

11

17%

Presents an updated chapter opening illustration. Uses revised data for the Using Statistics scenario for in-chapter examples. Uses the new tabular test summaries for the one-way ANOVA results. Presents discussion of the Levene test for the homogeneity of variance before the Tukey-Kramer multiple comparisons procedure. Revises Example 11.1 in Section 11.1 "One-Way ANOVA."

12

37%

Uses the new tabular test summaries for the chi-square tests, Wilcoxon rank sum test, and Kruskal-Wallis rank test. Category changes in text of independence example. Uses revised data for Section 12.4 and 12.5 examples for the Wilcoxon and Kruskal-Wallis tests. Contains general writing improvements throughout chapter.

PREFACE

TABLE 2 Chapter-by-Chapter Change Matrix (continued)

Chapter

Using Statistics Changed

Problems Changed

xxv

Selected Chapter Changes

13

46%

Reorganizes presentation of basic regression concepts with new "Preliminary Analysis" passage and a revised Section 13 .1 retitled as "Simple Linear Regression Models." Revises the exhibit in Section 13.9 that summarizes avoiding potential regression pitfalls. Presents an updated chapter opening illustration. Enhances selected figures for additional clarity. Adds a Chapter 13 Tableau Guide.

14

29%

Retitles Section 14.2 as "Evaluating Multiple Regression Models." Uses tables to summarize the net effects in multiple regression and residual analysis. Uses the new tabular test summaries for the overall F test, the t test for the slope, and logistic regression. Adds the new "What Is Not Normal? (Using a Categorical Dependent Variable)" Consider This feature . Adds a new section on cross-validation. Enhances selected figures for additional clarity. Contains general writing improvements throughout chapter.

15

36%

Uses the new tabular test summaries for the quadratic regression results. Revises Section 15.2 "Using Transformations in Regression Models." Replaces Example 5 .2 in Section 15.2 with a new sales analysis business example. Reorganizes Section 15.4 "Model Building." Updates the Section 15.4 exhibit concerning steps for successful model building. Contains general writing improvements throughout chapter.

16

67%

Combines old Sections 16.1 and 16.2 into a revised Sectionl6.l "Time-Series Component Factors. Section 16.1 presents a new illustration of time-series components. Uses updated movie attendance time-series data for the Section 16.2 example. Uses new annual time-series revenue data for Alphabet Inc. for Sections 16.3 and 16.5 examples. Uses updated annual time-series revenue data for the Coca-Cola Company in Section 16.4. Uses new quarterly time-series revenue data for Amazon.com, Inc. for the Section 16.6 example. Uses updated data for moving averages and exponential smoothing. Uses updated data for the online Section 16.7 "Index Numbers."

100%

Completely new "Business Analytics" that expands and updates old Sections 17 .3 through 17 .5. Includes the new "What's My Major If I Want to Be a Data Miner?" Consider This feature.

17

•

xxvi

PREFACE

TABLE 2 Chapter-by-Chapter Change Matrix (continued)

Chapter

Using Statistics Changed

Problems Changed

18

55%

Appendices

n.a.

Selected Chapter Changes

Updates old Chapter 17 Sections 17.1and17.2 to form the new version of the "Getting Ready to Analyze Data in the Future" chapter. Adds new Tableau sections to Appendices B, D, and G. Adds new Appendix B section about using nonnumerical labels in time-series plots. Includes updated data files listing in Appendix C.

Serious About Writing Improvements Ever read a textbook preface that claims writing improvements but offers no evidence? Among the writing improvements in this edition of Statistics for Managers Using Microsoft Excel, the authors have used tabular summaries to guide readers to reaching conclusions and making decisions based on statistical information. The authors believe that this writing improvement, which appears in Chapters 9 through 15, adds clarity to the purpose of the statistical method being discussed and better illustrates the role of statistics in business decision-making processes. For example, consider the following sample passage from Example 10.1 in Chapter 10 that illustrates the use of the new tabular summaries. Previously, part of the Example 10.1 solution was presented as: You do not reject the null hypothesis because tsTAT = -1.6341 > -1.7341. The p-value (as computed in Figure 10.5) is 0.0598. This p-value indicates that the probability that tsTAT < -1.634 1 is equal to 0.0598. In other words, if the population means are equal, the probability that the sample mean delivery time for the local pizza restaurant is at least 2.18 minutes faster than the national chain is 0.0598. Because the p-value is greater than a = 0.05, there is insufficient evidence to reject the null hypothesis. Based on these results , there is insufficient evidence for the local pizza restaurant to make the advertising claim that it has a faster delivery time. In this edition, we present the equivalent solution (on page 316): Table 10.4 summarizes the results of the pooled-variance t test for the pizza delivery data using the calculation above (not shown in this sample) and Figure 10.5 results. Based on the conclusions, the local branch of the national chain and a local pizza restaurant have similar delivery times. Therefore, as part of the last step of the DCOVA framework, you and your friends exclude delivery time as a decision criteria when choosing from which store to order pizza. TABLE 10.4 Pooled-variance t test summary for the delivery times for the two pizza restaurants

Result

Conclusions

The tsTAT = -1.6341 is greater than -1.7341. The t test p-value = 0.0598 is greater than the level of significance, a = 0.05.

1. Do not reject the null hypothesis H 0 . 2. Conclude that insufficient evidence exists that the mean delivery time is lower for the local restaurant than for the branch of the national chain. 3. There is a probability of 0.0598 that tsTAT < -1.6341.

PREFACE

xxvii

A Note of Thanks Creating a new edition of a textbook and its ancillaries requires a team effort with our publisher and the authors are fortunate to work with Suzanna Bainbridge, Karen Montgomery, Rachel Reeve, Alicia Wilson, Bob Carroll, Jean Choe, and Aimee Thorne of Pearson Education. Readers who note the quality of the illustrations and software screen shots that appear throughout the book should know that the authors are indebted to Joe Vetere of Pearson Education for his expertise. We also thank Sharon Cahill of SPi Global US, Inc., and Julie Kidd for their efforts to maintain the highest standard of book production quality (and for being gracious in handling the mistakes we make). We thank Gail Illich of McLennan Community College for preparing the instructor resources and the solutions manual for this book and for helping refine problem questions throughout it. And we thank our colleagues in the Data, Analytics, and Statistics Instruction Specific Interest Group (DASI SIG) of the Decision Sciences Institute for their ongoing encouragement and exchange of ideas that assist us to continuously improve our content. We strive to publish books without any errors and appreciate the efforts of Jennifer Blue, James Lapp, Sheryl Nelson, and Scott Bennett to copy edit, accuracy check, and proofread our book and to get us closer to our goal. We thank the RAND Corporation and the American Society for Testing and Materials for their kind permission to publish various tables in Appendix E, and to the American Statistical Association for its permission to publish diagrams from the American Statistician. Finally, we would like to thank our families for their patience, understanding, love, and assistance in making this book a reality.

Contact Us! Please email us at a [email protected] with your questions about the contents of this book. Please include the hashtag #SMUME9 in the subject line of your email. We always welcome suggestions you may have for a future edition of this book. And while we have strived to make this book as error-free as possible, we also appreciate those who share with us any issues or errors that they discover. If you need assistance using software, please contact your academic support person or Pearson Technical Support at support.pearson.com/getsupport/. They have the resources to resolve and walk you through a solution to many technical issues in a way we do not. As you use this book, be sure to make use of the "Resources for Success" that Pearson Education supplies for this book (described on the following pages). We also invite you to visit smume9.davidlevinestatistics.com , where we may post additional information, corrections, and updated or new content to support this book. David M. Levine David F. Stephan Kathryn A. Szabat

This page intentionally left blank

Get the most out of ~Pearson • Mylab

MyLab Statistics MyLab Statistics Online Course for

Statistics for Managers Using

Microsoft®Excel®, 9th Edition by Levine, Stephan, Szabat (access code required) Mylab Statistics is the teaching and learning platform that empowers instructors to reach every student. By combining trusted author content with digital tools and a flexible platform, Mylab Statistics personalizes the learning experience and improves results for each student. Mylab makes learning and using a variety of statistical programs as seamless and intuitive as possible. Download the data files that this book uses (see Appendix C) in Microsoft Excel. Download supplemental

"' Download the Excel Data Workbooks that contain t he data used In chapter examples or named in pr oblems and end-of-chapter cases. "' Download the Excel Gulde Workbooks that contain the model templates and solutions for statistical m ethods discussed In the textbook. "' Download the IMP Data Tables and PrQjei:u that contain the data used in chapter examples or named in problems and end-of-chapter cases. "' Download the Mjnjtab Worksheets and Pllljects that contain the data used in chapter examples or named In problems and end-of-chapter cases. "' Download the PHStat readme pdfthat explains the technical requirements and getting started lnstrualons for using this M icrosoft Excel add·ln. To download PHStat. visit the PHStat downlpad J2Ut. (Download requires an access code as explained on that page.) "' Download the ~ploratlons Wor!sbpoks that Interactively demonstrate various key statistical concepts.

files that support in-book cases or extend learning.

Instructional Videos Access instructional support videos including

,.........,

. •

.

Pearson's Business Insight and StatTalk videos, ==·~'¥"::':'¥'"i::'

available with assessment questions. Reference technology study cards and instructional videos

..'

for Microsoft Excel.

p

u u ~

h

u

... . .

.

...

.

0

o1"

Diverse Question Libraries Build homework assignments, quizzes, and tests

., . ,-------1 !!! Question Help

Suppose you hav9 the bar graph to lhe l'Ghl lhowing b'tdosurt rat"

for a few select stai.s. tr you want to exaggerate the diffetentes in lhe ttat"' fofeclout rates, how would you change IN ~?

to support your course learning outcomes. From

Q

El. ·

6..

Getting Ready (GR) questions to the Conceptual Question Library (CQL), we have your assessment needs covered from the mechanics to the critical

... ,~,

~,

....... , .........

...

""'

* A. To ~e ltle diffentntn in knclosure rates, change 1he Kale: on 1hey...U1 10 cover a smaller~. 2-6

pertent. for example

understanding of Statistics. The exercise libraries include technology-led instruction, including new Microsoft Excel-based exercises, and learning aids

B. To exaggerate the d1ffflfentes

11'1 forac\os.ura rates, draw Iha bar graph honzontaHy. ii fotecl01Ure rates, graph lhe J*'centl as decmall. O. To exaggerate lhe differences 11'1 foreclosure rates, change lhe scale on 1he y·axis to cover a larger range. G-20 percent, for example

C.

To~ lhe diff....-.ces

OuestJon lS torr.,ie

:J

95

PHStat inserts two new worksheets, each of which contains a freq uency distribution and a histogram. To relocate the histograms to their own chart sheets, use the instructions in Appendix Section B.5. Because you cannot define an explicit lower boundary for the first bin, there can be no midpoint defined for that bin. Therefore, the Midpoints Cell Range you enter must have one fewer cell than the Bins Cell Range. PHStat uses the first midpoint for the second bin and uses "--" as the label for the first bin. The example uses the workaround that "Classes and Excel Bins" discusses on page 46. When you use this workaround, the histogram bar labeled-will always be a zero bar. Appendix Section B.7 explains how you can delete this unnecessary bar from the histogram, as was done for the Section 2.4 examples. Workbook Use the Histogram workbook as a model. For the example, first construct frequency distributions for the growth and value funds. Open to the Unstacked worksheet of the Retirement Funds workbook. This worksheet contains the retirement funds data unstacked in columns A and B and a set of bin numbers and midpoints appropriate for those variables in columns D and E. Click the insert worksheet icon to insert a new worksheet. In the new worksheet: 1. Enter a title in cell Al, Bins in cell A3, Frequency in cell B3, and Midpoints in cell C3. 2. Copy the bin number list in the cell range D2:Dll of the Unstacked worksheet and paste this list into cell A4 of the new worksheet. 3. Enter'-- in cell C4. Copy the midpoints list in the cell range E2:E13 of the Unstacked worksheet and paste this list into cell CS of the new worksheet. 4. Select the cell range B4:B13 that will hold the array formula. 5. Type, but do not press the Enter or Tab key, the formula = FREQUENCY(Unstacked!$A$2:$A$307, $A$4: $A$13). Then, while holding down the Ctrl and Shift keys, press the Enter key to enter the array formula into the cell range B4:B13. 6. Adjust the worksheet formatting as necessary. Steps 1 through 6 construct a frequency distribution for the growth retirement fund s. To construct a frequency distribution for the value retirement funds, insert another worksheet and repeat steps l through 6, entering =FREQUENCY (Unstacked!$B$2:$B$174, $A$4: $A$13) as the array formula in step 5. Having constructed the two frequency distributions, continue by constructing the two histograms. Open to the worksheet that contains the frequency distribution for the growth funds and: 1. Select the cell range B3:B13 (the cell range of the frequencies). 2. Select Insert-+ Insert Column or Bar Chart (#I in the Iabeled Charts group) and select the Clustered Column gallery item, the first selection in the 2-D Column row.

96

CHAPTER 2

I Organizing and Visualizing Variables

3. Right-click the chart and click Select Data in the shortcut menu. In the Select Data Source display: 4. Click Edit under the Horizontal (Categories) Axis Labels heading. In the Axis Labels display, drag the mouse to select and enter $C$3:$C$13 as the midpoints cell range and click OK. In the Select Data Source display, click the icon inside the Horizontal (Category) axis labels box and drag the mouse to select and enter the midpoints cell range $C$3:$C$13. 5. Click OK. Do not type the axis label cell range in step 4 for the reasons that Appendix Section B.3 explains. In the new chart:

6. Right-click inside a bar and click Format Data Series in the shortcut menu. 7. In the Format Data Series display, click Series Options. In the Series Options, click Series Options, enter 0 as the Gap Width and then close the display. (To see the second Series Options, you may have to first click the chart [third] icon near the top of the task pane.) In Excel for Mac, there is only one Series Options label, and the Gap Width setting is displayed without having to click Series Options. In older Excels, click Series Options in the left pane, and in the Series Options right pane, change the Gap Width slider to No Gap and then click Close. 8. Relocate the chart to a chart sheet, tum off the chart legend and gridlines, add axis titles, and modify the chart title by using the Appendix Section B.5 instructions. This example uses the workaround that "Classes and Excel Bins" discusses on page 46. Appendix Section B.7 explains how the unnecessary histogram bar that this workaround creates, which will always be a zero bar, can be deleted, as was done for the Section 2.4 examples. Analysis ToolPak Use Histogram.

For the example, open to the Unstacked worksheet of the Retirement Funds workbook and: 1. Select Data-+ Data Analysis. In the Data Analysis dialog box, select Histogram from the Analysis Tools list and then click OK.

In the Histogram dialog box: 2. Enter Al:A307 as the Input Range and enter Dl:Dll as the Bin Range. 3. Check Labels, click New Worksheet Ply, and check Chart Output. 4. Click OK to create the frequency distribution and histogram on a new worksheet. In the new worksheet: 5. Follow steps 5 through 9 of the Analysis ToolPak instructions in "The Frequency Distribution" on pages 91and92.

These steps construct a frequency distribution and histogram for the growth funds. To construct a frequency distribution and histogram for the value funds, repeat the nine steps, but in step 2 enter Bl:B174 as the Input Range. You will need to correct several formatting errors to the histograms that Excel constructs. For each histogram, first change the gap widths between bars to 0. Follow steps 6 and 7 of the Workbook instructions of this section, noting the special instructions that appear after step 8. Histogram bars are labeled by bin numbers. To change the labeling to midpoints, open to each of the new work-sheets and: 1. Enter Midpoints in cell C3 and '-- in cell C4. Copy the cell range E2:El0 of the Unstacked worksheet and paste this list into cell CS of the new worksheet. 2. Right-click the histogram and click Select Data.

In the Select Data Source display: 3. Click Edit under the Horizontal (Categories) Axis Labels heading. In the Axis Labels display, drag the mouse to select and enter the cell range $C$4:$C$13 and click OK. In Excel for Mac, in the Select Data Source display, click the icon inside the Horizontal (Category) axis labels box and drag the mouse to select and enter the cell range $C$4:$C$13. 4. Click OK. 5. Relocate the chart to a chart sheet, tum off the chart legend, and modify the chart title by using the instructions in Appendix Section B.5. Do not type the axis label cell range in step 3 as you would otherwise do for the reasons explained in Appendix Section B.3. This example uses the workaround that "Classes and Excel Bins" discusses on page 46. Appendix Section B.7 explains how the unnecessary histogram bar that this workaround creates can be deleted, as was done for the Section 2.4 examples.

The Percentage Polygon and the Cumulative Percentage Polygon (Ogive) Key Technique Modify an Excel line chart that is based on a frequency distribution. Example Construct percentage polygons and cumulative percentage polygons for the three-year return percentages for the growth and value retirement funds, similar to Figure 2.14 on page 61 and Figure 2.16 on page 62. PHStat Use Histogram & Polygons.

For the example, use the PHStat instructions for creating a histogram on page 95 but in step 3 of those instructions, also check Percentage Polygon and Cumulative Percentage Polygon (Ogive) before clicking OK. Workbook Use the Polygons workbook as a model.

For the example, open to the Unstacked worksheet of the Retirement Funds workbook and follow steps 1 through 6 of the Workbook "The Histogram" instructions on page 95 to construct a frequency distribution for the growth funds.

EXCEL Guide

Repeat the steps to construct a frequency distribution for the value funds using the instructions that immediately follow step 6. Open to the worksheet that contains the growth funds frequency distribution and: 1. Select column C. Right-click and click Insert in the shortcut menu. Right-click and click Insert in the shortcut menu a second time. (The worksheet contains new, blank columns C and D and the midpoints column is now column E.) 2. Enter Percentage in cell C3 and Cumulative Pctage. in cell D3. 3. Enter = B4/SUM($B$4:$B$13) in cell C4 and copy this formula down through row 13. 4. Enter = C4 in cell D4. 5. Enter = CS + D4 in cell DS and copy this formula down through row 13. 6. Select the cell range C4:D13 right-click, and click Format Cells in the shortcut menu. 7. In the Number tab of the Format Cells dialog box, click Percentage in the Category list and click OK. Open to the worksheet that contains the value funds frequency distribution and repeat steps 1through7. To construct the percentage polygons, open to the worksheet that contains the growth funds distribution and: 1. Select cell range C4:C13. 2. Select Insert-+ Line (#3 in the labeled Charts group), and select the Line with Markers gallery item in the 2-D Line group. 3. Right-click the chart and click Select Data in the shortcut menu. In the Select Data Source display:

4. Click Edit under the Legend Entries (Series) heading. In the Edit Series dialog box, enter the formula ="Growth Funds" as the Series name and click OK. In Excel for Mac, enter the formula as the Name. 5. Click Edit under the Horizontal (Categories) Axis Labels heading. In the Axis Labels display, drag the mouse to select and enter the cell range $E$4:$E$13 and click OK. In Excel for Mac, in the Select Data Source display, click the icon inside the Horizontal (Category) axis labels box and drag the mouse to select and enter the cell range $E$4:$E$13. 6. Click OK. 7. Relocate the chart to a chart sheet, turn off the chart gridlines, add axis titles, and modify the chart title by using the instructions in Appendix Section B.5. In the new chart sheet: 8. Right-click the chart and click Select Data in the shortcut menu. In the Select Data Source display:

9. Click Add under the Legend Entries (Series) heading. In the Edit Series dialog box, enter the

97

formula ="Value Funds" as the Series name and press Tab. In Excel for Mac, click"+" icon below the Legend entries (Series) list. Enter the formula ="Value Funds" as the Name. 10. With the placeholder value in Series values highlighted, click the sheet tab for the worksheet that contains the value funds distribution. In that worksheet, drag the mouse to select and enter the cell range $C$4:$C$13 and click OK. In Excel for Mac, click the icon in the Y values box. Click the sheet tab for the worksheet that contains the value funds distribution and, in that worksheet, drag the mouse to select and enter the cell range $C$4:$C$13. 11. Click Edit under the Horizontal (Categories) Axis Labels heading. In the Axis Labels display, drag the mouse to select and enter the cell range $E$4:$E$13 and click OK. In Excel for Mac, in the Select Data Source display, click the icon inside the Horizontal (Category) axis labels box and drag the mouse to select and enterthe cell range $E$4:$E$13. Click OK.

Do not type the axis label cell range in steps 10 and 11 for the reasons explained in Appendix Section B.3. To construct the cumulative percentage polygons, open to the worksheet that contains the growth funds distribution and repeat steps 1 through 12, but in step 1, select the cell range D4:D13; in step 5, drag the mouse to select and enter the cell range $A$4:$A$13; and in step 11 , drag the mouse to select and enter the cell range $D$4:$D$13. If the Y axis of the cumulative percentage polygon extends past 100%, right-click the axis and click Format Axis in the shortcut menu. In the Format Axi s display, click Axis Options. In the Axis Options, enter 0 as the Minimum and then close the display. In Excels with two-panel dialog boxes, in the Axis Options right pane, click the first Fixed (for Minimum), enter 0 as the value, and then click Close.

EG2.5

VISUALIZING TWO NUMERICAL VARIABLES

The Scatter Plot Key Technique Use the Excel scatter chart. Example Construct a scatter plot of revenue and value for NBA teams, similar to Figure 2.17 on page 66. PHStat Use Scatter Plot.

For the example, open to the DATA worksheet of the NBAValues workbook. Select PHStat-+ Descriptive Statistics-+ Scatter Plot. In the procedure's dialog box (shown top left on next page): 1. 2. 3. 4.

Enter Dl:D31 as the Y Variable Cell Range. Enter Cl:C31 as the X Variable Cell Range. Check First cells in each range contains label. Enter a Title and click OK.

98

I Organizing and Visualizing Variables

CHAPTER 2

x

Sutter Plot -Data Y Vlriable Cdllange: X Varloblo Cd Ralgo:

: : ;:J :J

l,_01_:0_31_ _ _ lct:C31

P' Fht-hoadlrl0'90con,....llbel

[

QJ1put~

Tille:

1

~

EG2.6

ORGANIZING a MIX of VARIABLES

Multidimensional Contingency Tables Key Technique Use the Excel PivotTable feature. Example Construct a PivotTable showing percentage of overall total for Fund Type, Risk Level, and Market Cap for the retirement funds sample, similar to the one shown at the right in Figure 2.19 on page 69.

To add a superimposed line like the one shown in Figure 2.17. click the chart and use step 3 of the Workbook instructions.

Workbook Use the MCT workbook as a model.

Workbook Use the Scatter Plot workbook as a model.

For the example, open to the DATA worksheet of the Retirement Funds workbook and:

For the example, open to the DATA worksheet of the NBAValues workbook and: 1. Select the cell range Cl:D31. 2. Select Insert-+ Scatter (X, Y) or Bubble Chart (:/lfj in the labeled Charts group) and select the Scatter gallery item. Excel for Mac labels the #6 icon X Y (Scatter). 3. Select Design (or Chart Design)-+ Add Chart Element-+ Trendline-+ Linear. 4. Relocate the chart to a chart sheet, tum off the chart legend and gridlines, add axis titles, and modify the chart title by using the Appendix Section B.5 instructions.

When constructing Excel scatter charts with other variables, make sure that the X variable column precedes (is to the left of) the Y variable column. (If the worksheet is arranged Y then X, cut and paste so that the Y variable column appears to the right of the X variable column.)

The Time-Series Plot Key Technique Use the Excel scatter chart. Example Construct a time-series plot of movie revenue per year from 1995 to 2018, similar to Figure 2.18 on page 67.

1. Select Insert-+ PivotTable.

In the Create PivotTable display: 2. Click Select a table or range and enter Al:N480 as the Table/Range. 3. Click New Worksheet and then click OK. In the PivotTable Fields (PivotTable Builder or PivotTable

Field List in older Excels) display (partially shown below): 4. Drag Fund Type in the Choose fields to add to report box and drop it in the Rows (or Row Labels) box. 5. Drag Market Cap in the Choose fields to add to report box and drop it in the Rows (or Row Labels) box. 6. Drag Risk Level in the Choose fields to add to report box and drop it in the Columns (or Column Labels) box. 7. Drag Fund Type in the Choose fields to add to report box a second time and drop it in the I Values box. The dropped label changes to Count of Fund Type.

T

Filters

Cotumns

I RisHovo1

Workbook Use the Time Series workbook as a model. For the example, open to the DATA worksheet of the Movie Revenues workbook and: 1. Select the cell range Al:B25. 2. Select Insert-+ Scatter (X, Y) or Bubble Chart (:/lfj in the labeled Charts group) and select the Scatter with Straight Lines and Markers gallery item. Excel for Mac labels the #6 icon X Y (Scatter). 3. Relocate the chart to a chart sheet, turn off the chart legend and gridlines, add axis titles, and modify the chart title by using Appendix Section B.5 instructions.

When constructing time-series charts with other variables, make sure that the X variable column precedes (is to the left of) the Y variable column. (If the worksheet is arranged Y then X, cut and paste so that the Y variable column appears to the right of the X variable column.)

Rows

Values

•

...

. Count of Fund Txpe • ,

8. Click (not right-click) the dropped label Count of Fund Type and then click Value Field Settings in the shortcut menu. In the Value Field Settings display, click the Show Values As tab and select % of Grand Total from the Show values as drop-down list, shown at left on the next page. Click the "!" icon to the right of the dropped label Count of Fund Type. In the PivotTable Field dialog box, click the Show data as tab and select % of Grand Total (% of total in older Excels for Mac) from the drop-down list, shown at right on the next page. 9. Click OK.

EXCEL Guide

99

16. Click OK. The label in the I Values box changes to Average of JOYrReturn. In the PivotTable:

17. Select cell range BS:E13, right-click, and click Format Cells in the shortcut menu. In the Number tab of the Format Cells dialog box, click Number, set the Decimal places to 2, and click OK.

EG2.7

VISUALIZING a MIX of VARIABLES

PivotChart In the PivotTable: 10. Enter a title in cell Al. 11. Follow steps 6 through 10 of the Workbook (untallied data) "The Contingency Table" instructions on page 90 that relabel the rows and columns and rearrange the order of the risk category columns.

Key Technique Use the PivotChart feature with a previously constructed PivotTable. (The PivotChart feature is not included in older Excels for Mac.) Example Construct the PivotChart based on the Figure 2.20 PivotTable of type, risk, and market cap showing mean tenyear return percentage, shown in Figure 2.23 on page 71.

Workbook Use the MCT workbook as a model.

Adding a Numerical Variable Key Technique Alter the contents of the I Values box in the PivotTable Field List pane. Example Construct the Figure 2.20 PivotTable of Fund Type, Risk Level, and Market Cap on page 69 that shows the mean ten-year return percentage for the retirement funds sample.

Workbook Use the MCT workbook as a model. For the example, first construct the PivotTable showing percentage of overall total for Fund Type, Risk Level, and Market Cap for the retirement funds sample using the 11-step instructi ons of the "Multidimensional Contingency Table" Workbook instructions that starts on page 98. Then continue with these steps: 12. If the PivotTable Fields pane is not visible, right-click cell A3 and click Show Field List in the shortcut menu. In older Excels for Mac, if the PivotTable Builder (or Field List) display is not visible, select PivotTable Analyze-+ Field List. In the display:

13. Drag the blank label (changed from Count ofFund Type in a prior step) in the ! Values box and drop it outside the display to delete. In the PivotTable, all of the percentages disappear. 14. Drag lOYrReturn in the Choose fields to add to report box and drop it in the I Values box. The dropped label changes to Sum of JOYrReturn. 15. Click (not right-click) Sum of lOYrReturn and then click Value Field Settings in the shortcut menu. In the Value Field Settings display, click the Summarize Values By tab and select Average from the Summarize value field by drop-down list. Click the "I' icon to the right of the label Sum of 1OYrReturn. In the PivotTable Field dialog box, click the Summarize by tab and select Average from the list.

For the example, open to the MCT worksheet of the MCT workbook and: 1. Select cell A3 (or any other cell inside the PivotTable). 2. Select Insert -+ PivotChart. 3. In the Insert Chart display, click Bar in the All Charts tab and select the Clustered Bar gallery item and click OK. In Excel for Mac, with the newly inserted chart still selected, select Insert Column and select the Clustered Bar gallery item in the 2-D Bar group. 4. Relocate the chart to a chart sheet, turn off the gridli nes, and add chart and axis titles by using the instructions in Appendi x Section B.5. In the PivotTable, collapse the Growth and Value categories, hiding the Market Cap categories. Note that contents of PivotChart changes to reflect changes made to the PivotTable.

Treemap Key Technique Use the Excel treemap feature with a specially prepared tabular summary that includes columns that express hierarchical (tree) relationships. (The treemap feature is not included in older Excel versions.) Example Construct the Figure 2.24 treemap on page 72 that summarizes the sample of 479 retirement funds by Fund Type and Market Cap.

Workbook Use Treemap. For the example, open to the StackedSummary worksheet of the Retirement Funds workbook. This worksheet contains sorted fund type categories in column A, market cap categories in column B, and frequencies in column C. Select the cell range Al:C7 and: 1. Select Insert-+ Insert Hierarchy Chart (#4 in the labeled Charts group) and select the Treemap gallery item. 2. Click the chart title and enter a new title for the chart.

100

CHAPTER 2

I Organizing and Visualizing Variables

3. Click one of the tile labels and increase the point size to improve readability. (This change affects all treemap labels.) 4. Select Design (or Chart Design) -+ Quick Layout and select the Layout 2 gallery item (which adds the "branch" data labels). 5. Select Design (or Chart Design) -+ Add Chart Element-+ Legend Bottom. 6. Right-click in the whitespace near the title and select Move Chart. 7. In the Move Chart dialog box, click New Sheet and click OK. PHStat Use Treemap. As an example, open to the StackedSummary worksheet of the Retirement Funds workbook. This worksheet contains sorted fund type categories in column A, market cap categories in column B, and frequencies in column C. Select PHStat-+ Descriptive Statistics-+ Treemap. In the procedure's dialog box:

Sparklines Key Technique Use the sparklines feature. Example Construct the sparklines for movie revenues per month for the period 2005 to 2018 shown in Figure 2.25 on page 72.

Workbook Use the Sparklines workbook as a model. For the example, open to the DATA worksheet of the Monthly Movie Revenues workbook and: 1. Select Insert-+ Line (in the Sparklines group). 2. In the Create Sparklines dialog box (shown below), enter B2:M13 as the Data Range and P2:P13 as the Location Range. 3. Click OK. C...te 5p1

"'"'° ""''""lll>ol

:Jj

~°""""';....;_;..._~~~~~~~ 1 Tille:

Sunburst Chart Workbook Use Sunburst. As an example, open to the StackedHierarchy worksheet of the Retirement Funds workbook. This worksheet contains sorted fund type categories in column A, sorted market cap categories in col umn B, sorted star rating categories in column C, and frequencies in column D. Select the cell range Al:D31 and:

1. Select Insert-+ Insert Hierarchy Chart (#4 in the Iabeled Charts group) and select the Sunburst gallery item. 2. Click the chart title and enter a new title for the chart. 3. Click one of the segments and increase the point size to improve readability. (This will change the point size of all labels.) 4. Right-click in the whitespace near the title and select Move Chart. 5. In the Move Chart dialog box, click New Sheet and click OK.

x

1orles 80

Excel results for the variance and standard deviation of the number of calories in the sample of cereals.

100 100 110 130 190

0

Calrulalions va,iance directly using VAR.S function In formula •VAR.S(A2:A8)

2200

Standard deviation durectty using STDEV.S

function in formula zSTOEV.S(A2:A8)

46.90

200

Alternatively, using Equation (3.6) on page 115: II

L(X; - X)2 s2

=

_i= _ 1 _ _ __

(80 - 130)2

+ (100 - 130)2 + . .. + (200 - 130)2

n - 1

=

13,200

~~

6

7 - 1

= 2200 ,

Using Equation (3.7) on page 115, the sample standard deviation, S, is II

L (X; - X)2

S=W=

_i =_ l ---

n - 1

= v2,200 = 46.9042

The standard deviation of 46.9042 indicates that the calories in the cereals are clustering within ± 46.9042 around the mean of 130 (clustering between X - lS = 83.0958 and X + lS = 176.9042). In fact, 57.1 % (four out of seven) of the calories lie within this interval.

The Coefficient of Variation The coefficient of variation is equal to the standard deviation divided by the mean, multiplied by 100%. Unlike the measures of variation presented previously, the coefficient of variation (CV) measures the scatter in the data relative to the mean. The coefficient of variation is a relative

118

CHAPTER 3

I Numerical Descriptive Measures measure of variation that is always expressed as a percentage rather than in terms of the units of the particular data. Equation (3.8) defines the coefficient of variation.

COEFFI CI ENT OF VARIATI O N

student TIP The coefficient of variation is always expressed as a percentage and not as units of a variable.

The coefficient of variation is equal to the standard deviation divided by the mean, multiplied by 100%.

CV=

(~)100%

(3.8)

where S = sample standard deviation X = sample mean

learnMORE The Sharpe ratio, another relative measure of variation, is often used in financial analysis. Read the SHORT TAKES for Chapter 3 to learn more about this ratio.

EXAMPLE 3.7 Comparing Two Coefficients of Variation When the Two Variables Have Different Units of Measurement

For the sample of 10 get-ready times, because X

CV=

= 39.6 andS = 6.77, the coefficient of variation is

(xs)

= 100% = (6.77) 100% = 17.10% 39.6

For the get-ready times, the standard deviation is 17 .1% of the size of the mean. The coefficient of variation is especially useful when comparing two or more sets of data that are measured in different units, as Example 3.7 illustrates.

Which varies more from cereal to cereal-the number of calories or the amount of sugar (in grams)? SOLUTION Because calories and the amount of sugar have different units of measurement, you need to compare the relative variability in the two measurements. For calories, using the mean and variance that Examples 3.1 and 3.6 on pages 111 and 117 calculate, the coefficient of variation is

CVcaiories = (

46.9042) 100% = 36.08% 130

For the amount of sugar in grams, the values for the seven cereals are 6

2

4

4

4

11

10

For these data, X = 5.8571 and S = 3.3877. Therefore, the coefficient of variation is 3.3877) 100% = 57.84% CVsugar = ( _ 5 8571 You conclude that relative to the mean, the amount of sugar is much more variable than the calories.

Z Scores The Z score of a value is the difference between that value and the mean, divided by the standard deviation. A Z score of 0 indicates that the value is the same as the mean. A positive or negative Z score indicates whether the value is above or below the mean and by how many standard

119

3.2 Measures of Variation and Shape

deviations. Z scores help identify outliers, the values that seem excessively different from most of the rest of the values (see Section 1.4). Values that are very different from the mean will have either very small (negative) Z scores or very large (positive) Z scores. As a general rule, a Z score that is less than -3.0 or greater than + 3.0 indicates an outlier value.

Z SCORE

X-X

(3.9)

Z= - S

Calculating Z scores would analyze the sample of 10 get-ready times. Because the mean is 39.6 minutes, the standard deviation is 6.77 minutes, and the time to get ready on the first day is 39.0 minutes, the Z score for Day 1 using Equation (3.9) is

x -x= z=s

39.0 - 39.6 6.77

= - 0.09

The Z score of -0.09 for the first day indicates that the time to get ready on that day is very close to the mean. Figure 3.2 presents the Z scores for all 10 days.

FIGURE 3.2 Excel worksheet containing the Z scores for 10 get-ready times

A

Gtt·Read Time 39 29 43 52 39 44 40 31 10 44 11 35

z score ·0.09 · 1.57 0..50 L83 -0.09 0.65 0.06 · L27 0.65 -0.68

In The Worksheet

Column B contains formulas that use the STANDARDIZE function to calculate the Z scores for the get-ready times. STANDARDIZE requires the mean and standard deviation be known for the data set. The Z Scores and Variation worksheets of the Excel Guide Descriptive workbook illustrate how STANDARDIZE can use the mean and standard deviation that another worksheet calculates.

The largest Z score is 1.83 for Day 4, on which the time to get ready was 52 minutes. The lowest Z score is - 1.57 for Day 2, on which the time to get ready was 29 minutes. Because none of the Z scores are less than -3.0 or greater then +3.0, you conclude that the get-ready times include no apparent outliers.

EXAMPLE 3.8 Calculating the Z Scores of the Number of Calories in Cereals

A sample of seven breakfast cereals (stored in lij§i@ij) includes nutritional data about the number of calories per serving (see Example 3.1 on page 111). Calculate the Z scores of the calories in breakfast cereals. SOLUTION Figure 3.3 presents the Z scores of the calories for the cereals. The largest Z score is 1.49, for a cereal with 200 calories. The lowest Z score is - 1.07, for a cereal with 80 calories. There are no apparent outliers in these data because none of the Z scores are less than -3.0 or greater than +3.0.

FIGURE 3.3 Excel worksheet containing the Z scores for 10 cereals

A

1

calorie$ ZSCo

median: positive, or right-skewed distribution

= median: symmetrical distribution (zero skewness)

In a symmetrical distribution, the values below the mean are distributed in exactly the same way as the values above the mean, and the skewness is zero. In a skewed distribution, the data values below and above the mean are imbalanced, and the skewness is a nonzero value (less than zero for a left-skewed distribution, greater than zero for a right-skewed distribution). Figure 3.4 visualizes these possibilities. FIGURE 3.4 The shapes of three data d istributions

Panel A Negative, or left-skewed

Panel B Symmetrical

Pane l C Positive, or ri ght-skewed

Panel A displays a left-skewed distribution. In a left-skewed distribution, most of the values are in the upper portion of the distribution. Some extremely small values cause the long tail and distortion to the left and cause the mean to be less than the median. Because the skewness statistic for such a distribution will be less than zero, some use the term negative skew to describe this distribution. Panel B displays a symmetrical distribution. In a symmetrical distribution, values are equally distributed in the upper and lower portions of the distribution. This equality causes the portion of the curve below the mean to be the mirror image of the portion of the curve above the mean and makes the mean equal to the median. Panel C displays a right-skewed distribution. In a right-skewed distribution, most of the values are in the lower portion of the distribution. Some extremely large values cause the long tail and distortion to the right and cause the mean to be greater than the median. Because the skewness statistic for such a distribution will be greater than zero, some use the term positive skew to describe this distribution.

Shape: Kurtosis

1 Several different operational definitions exist for kurtosis. The definition here, used by Excel, is sometimes called excess kurtosis to distinguish it from other definitions. Read the SHORT TAKES for Chapter 3 to learn how Excel calculates kurtosis (and skewness).

Kurtosis measures the peakedness of the curve of the distribution-that is, how sharply the curve rises approaching the center of the distribution. Kurtosis compares the shape of the peak to the shape of the peak of a bell-shaped normal distribution (see Chapter 6), which, by definition, has a kurtosis of zero. 1 A distribution that has a sharper-rising center peak than the peak of a normal distribution has positive kurtosis, a kurtosis value that is greater than zero, and is called leptokurtic. A distribution that has a slower-rising (flatter) center peak than the peak of a normal distribution has negative kurtosis, a kurtosis value that is less than zero, and is called platykur tic. A leptokurtic distribution has a higher concentration of values near the mean of the distribution compared to a normal distribution, while a platykurtic distribution has a lower concentration compared to a normal distribution.

3.2 Measures of Variation and Shape

1 21

In affecting the shape of the central peak, the relative concentration of values near the mean also affects the ends, or tails, of the curve of a distribution. A leptokurtic distribution has fatter tails, many more values in the tails, than a normal distribution has. When an analysis mistakenly assumes that a set of data forms a normal distribution, that analysis will underestimate the occurrence of extreme values if the data actually forms a leptokurtic distribution. Some suggest that such a mistake can explain the unanticipated reverses and collapses that led to the financial crisis of 2007- 2008 (Taleb).

EXAMPLE 3.9 Computing Descriptive Statistics for Growth and Value Funds

In the More Descriptive Choices scenario, you are interested in comparing the past performance of the growth and value funds from a sample of 479 funds . One measure of past performance is the three-year return percentage variable. Compute descriptive statistics for the growth and value funds. SOLUTION Figure 3.5 presents descriptive summary measures for the two types of funds . The results include the mean, median, mode, minimum, maximum, range, variance, standard deviation, coefficient of variation, skewness, kurtosis, count (the sample size), and standard error. The standard error (see Section 7.2) is the standard deviation divided by the square root of the sample size. In examining the results, you see that there are some differences in the three-year return for the growth and value funds. The growth funds had a mean three-year return of 8.51 and a median return of 8.70. This compares to a mean of 6.84 and a median of 7.07 for the value funds. The medians indicate that half of the growth funds had three-year returns of 8.70 or better, and half the value funds had three-year returns of 6.84 or better. You conclude that the value funds had a lower return than the growth funds.

FIGURE 3.5 Excel descriptive statistics results for t he three-year return percentages for the growth and value funds

In the Worksheet

A

Desaiptlve Statistics for the 3YrReturn Variable Growth

•5

M ean M edian

6

Mod~

8.SI 8.70

value

Growth 6.84 llTUnoxtontofthod&bl

E2! ttde ~ rnorb (exaptoudon)

Fonna16-9 Style:

Fil:

1-

l m~Gloy

llordor: I -., 1

vl vl vl vl

11. Back in the main Tableau window, select Analysis-+

Swap Rows and Columns. Boxplots change from a vertical to horizontal layout. Moving the mouse pointer over each boxplot, displays the values of the five-number summary of the boxplot. Tableau labels the first quartile, Q" the Lower Hinge, and labels the third quartile, Q3 , the Upper Hinge. (The lower and upper hinges are equivalent to these quartiles by the method this book uses to calculate Q 1 and Q3.) To construct a boxplot for a numerical variable, the variable must be a measure and not a dimension, the reason for step 1.

CONTENTS USING STATISTICS: Probable Outcomes at Fredco Warehouse Club

4.1 Basic Probability Concepts 4.2 Condit ional Probability 4.3 Ethical Issues and Probability 4.4 Bayes' Theorem CONSIDER THIS: Divine Providence and Spam

4.5 Counting Ru les online Probable Outcomes at Fredco Warehouse Club, Revisited EXCEL GUIDE

OBJEC IVES Understand basic probability concepts Understand cond itional probability

152

s a Fredco Warehouse Club electronics merchandise manager, you oversee the process that selects, purchases, and markets electronics items that the club sell s, and you seek to satisfy customers ' needs. Noting new trends in the television marketplace, you have worked with company marketers to conduct an intent-to-purchase study. The study asked the heads of 1,000 households about their intentions to purchase a large TV, a screen size of at least 65 inches, sometime during the next 12 months. Now 12 months later, you plan a follow-up survey with the same people to see if they did purchase a large TV in the intervening time. For households that made such a purchase, you would like to know if the TV purchased was HDR (high dynamic range) capable, whether they also purchased a streaming media player in the intervening time, and whether they were satisfied with their large TV purchase. You plan to use survey results to form a new marketing strategy that will enhance sales and better target those households likely to purchase multiple or more expensive products. What questions can you ask in this survey? How can you express the relationships among the various intent-to-purchase responses of individual households?

A

4.1 Basic Probability Concepts

1 53

T

he principles of probability help bridge the worlds of descriptive statistics and inferential statistics. Probability principles are the foundation for the probability distribution, the concept of mathematical expectation, and the binomial and Poisson distributions. Applying probability to intent-to-purchase survey responses answers purchase behavior questions such as: • What is the probability that a household is planning to purchase a large TV in the next year? • What is the probability that a household will actually purchase a large TV? • What is the probability that a household is planning to purchase a large TV and actually purchases the TV? • Given that the household is planning to purchase a large TV, what is the probability that the purchase is made? • Does knowledge of whether a household plans to purchase a large TV change the likelihood of predicting whether the household will purchase a large TV? • What is the probability that a househo ld that purchases a large TV will purchase a n HDR-capable TV? • What is the probability that a household that purchases a large TV with HDR will also purchase a streaming media player? • What is the probability that a household that purchases a large TV will be satisfied with the purchase? Answers to these question will provide a basis to form a marketing strategy. One can consider whether to target households that have indicated an intent to purchase or to focus on selling TVs that have HDR or both. One can also explore whether households that purchase large TVs with HDR can be easily persuaded to also purchase streaming media players .

4.1 Basic Probability Concepts In everyday usage, probability, according to the Oxford English Dictionary, indicates the extent to which something is likely to occur or exist but can also mean the most likely cause of something. If storm clouds form, the wind shifts, and the barometric pressure drops, the probability of rain coming soon increases (first meaning). If one observes people entering an office building with wet clothes or otherwise drenched, there is a strong probability that it is currently raining outside (second meaning). In statistics, probability is a numerical value that expresses the ratio between the value sought and the set of all possible values that could occur. A six-sided die has faces for 1, 2, 3, 4, 5, and 6. Therefore, for one roll of a fair six-sided die, the set of all possible values are the values 1 through 6. If the value sought is "a value greater than 4," then the values 5 or 6 would be sought. One would say the probability of this event is 2 outcomes divided by 6 outcomes or 1/3. Consider tossing a fair coin heads or tails two times. What is the probability of tossing two tails? The set of possible values for tossing a fair coin twice are HH, TT, HT, and TH. Therefore, the probability of tossing two tails is 114 because only one value (TT) matches what is being sought, and the set of all possible values has four values.

Events and Sample Spaces student TIP Events are represented by letters of the alphabet.

When discussing probability, one uses outcomes in place of values and cal ls the set of all possible outcomes the sample space. Events are subsets of the sample space, the set of all outcomes that produce a specific result. For tossing a fair coin twice, the event "toss at least one head" is the subset of outcomes HH, HT, and TH, and the event "toss two tails" is the subset TT. Both events are examples of a joint event, an event that has two or more characteristics. In contrast, a simple event has only one characteristic, an outcome that cannot be further subdivided. The event "rolling a value greater than 4" in the first example results in the subset of outcomes 5 and 6 and is an example of a simple event because "5" and "6" represent one characteristic and cannot be further divided.

1 54

CHAPTER 4

I Basic Probability

student TIP By definition, an event and its complement are always both mutually exclusive and collectively exhaustive.

student TIP A probability cannot be negative or greater than 1.

The complement of an event A, noted by the symbol A', is the subset of outcomes that are not part of the event. For tossing a fair coin twice, the complement of the event "toss at least one head" is the subset TT, while the complement of the event "toss two tails" is HH, HT, and TH. A set of events are mutually exclusive if they cannot occur at the same. The events "roll a value greater than 4" and "roll a value less than 3" are mutually exclusive when rolling one fair die. However, the events "roll a value greater than 4" and "roll a value greater than 5" are not because both share the outcome of rolling a 6. A set of events are collectively exha ustive if one of the events must occur. For rolling a fair six-sided die, the events "roll a value of 3 or less" and "roll a value of 4 or more" are collectively exhaustive because these two subsets include all possible outcomes in the sample space. However, the set of events "roll a value of 3 or less" and "roll a value greater than 4" is not because this set does not include the outcome of rolling a 4. Not all sets of collectively exhaustive events are mutually exclusive. For rolling a fair six-sided die, the set of events "roll a value of 3 or less," "roll an even-numbered value," and "roll a value greater than 4" is collectively exhaustive but is not mutually exclusive as, for example, "a value of 3 or less" and "an even-numbered value" could both occur if a 2 is rolled. Certain and impossible events represent special cases. A certain event is an event that is sure to occur such as "roll a value greater than O" for rolling one fair die. Because the subset of outcomes for a certain event is the entire set of outcomes in the sample, a certain event has a probabiJjty of 1. An impossible event is an event that has no chance of occurring, such as "roll a value greater than 6" for rolling one fair die. Because the subset of outcomes for an impossible event is empty-contains no outcomes-an impossible event has a probability of 0.

Types of Probability The concepts and vocabulary related to events and sample spaces are helpful to understanding how to calculate probabilities. Also affecting such calculations is the type of probability being used: a priori, empirical, or subjective. In a priori probability, the probability of an occurrence is based on having prior knowledge of the outcomes that can occur. Consider a standard deck of cards that has 26 red cards and 26 black cards. The probability of selecting a black card is 26 / 52 = 0.50 because there are 26 black cards and 52 total cards. What does this probability mean? If each card is replaced after it is selected, this probability does not mean that one out of the next two cards selected will be black. One cannot say for certain what will happen on the next several selections. However, one can say that in the long run , if this selection process is continually repeated, the proportion of black cards selected will approach 0.50. Example 4.1 shows another example of computing an a priori probability.

EXAMPLE 4.1 Finding A Priori Probabilities

A standard six-sided die has six faces. Each face of the die contains either one, two, three, four, five, or six dots. If you roll a die, what is the probability that you will get a face with five dots? SOLUTION Each face is equally Jjkely to occur. Because there are six faces, the probability of getting a face with five dots is 1/6.

The preceding examples use the a priori probability approach because the number of ways the event occurs and the total number of possible outcomes are known from the composition of the deck of cards or the faces of the die. In the empirical probability approach, the probabilities are based on observed data, not on prior knowledge of how the outcomes can occur. Surveys are often used to generate empirical probabilities. Examples of this type of probability are the proportion of individuals in the Fredco Warehouse Club scenario who actually purchase a large TV, the proportion of registered voters who prefer a certain political candidate, and the proportion of students who have part-time jobs. For example, if one conducts a survey of students, and 60% state that they have part-time jobs, then there is a 0.60 probability that an individual student has a part-time job.

4.1 Basic Probability Concepts

1 55

The third approach to probability, subjective probability, differs from the other two approaches because subjective probability differs from person to person. For example, the development team for a new product may assign a probability of 0.60 to the chance of success for the product, while the president of the company may be less optimistic and assign a probability of 0.30. The assignment of subjective probabilities to various outcomes is usually based on a combination of an individual's past experience, personal opinion, and analysis of a particular situation. Subjective probability is especially useful in making decisions in situations in which one cannot use a priori probability or empirical probability.

Summarizing Sample Spaces Sample spaces can be presented in tabular form using contingency tables (see Section 2.1). Table 4.1 in Example 4.2 summarizes a sample space as a contingency table. When used for probability, each cell in a contingency table represents one joint event, analogous to the one joint response when these tables are used to summarize categorical variables. For example, 200 of the respondents correspond to the joint event "planned to purchase a large TV and subsequently did purchase the large TV."

EXAMPLE 4.2

Events and Sample Spaces TABLE 4.1 Purchase behavior for large TVs

The Fredco Warehouse Club scenario on page 152 involves analyzing the results of an intentto-purchase study. Table 4.1 presents the results of the sample of 1,000 households surveyed in terms of purchase behavior for large TVs.

ACTUALLY PURCHASED PLANNED TO PURCHASE

Yes

Yes

200

50

250

No

100

650

750

Total

300

700

1,000

Total

No

What is the sample space? Give examples of simple events and joint events. SOLUTION The sample space consists of the 1,000 respondents. Simple events are "planned to

purchase," "did not plan to purchase," "purchased," and "did not purchase." The complement of the event "planned to purchase" is "did not plan to purchase." The event "planned to purchase and actually purchased" is a joint event because in this joint event, the respondent must plan to purchase the TV and actually purchase it.

Simple Probability Simple probability is the probability of occurrence of a simple event A, P(A), in which each outcome is equally likely to occur. Equation (4.1) defines the probability of occurrence for simple probability.

PROBABILITY OF OCCURRENCE

Probability of occurrence =

x

T

where X T

= number of outcomes in which the event occurs = total number of possible outcomes

(4.1)

1 56

CHAPTER 4

I Basic Probability Equation 4.1 represents what some people wrongly think is the probability of occurrence for all probability problems. (Not all probability problems can be solved by Equation 4.1 as later examples in this chapter illustrate.) In the Fredco Warehouse Club scenario, the collected survey data represent an example of empirical probability. Therefore, one can use Equation (4.1) to determine answers to questions that can be expressed as a simple probability. For example, one question asked respondents if they planned to purchase a large TV. Using the responses to this question, how can one determine the probability of selecting a household that planned to purchase a large TV? From the Table 4.1 contingency table, determine the value of X as 250, the total of the Planned-to-purchase Yes row and determine the value of T as 1,000, the overall total of respondents located in the lower right corner cell of the table. Using Equation (4.1) and Table 4.1 :

x

Probability of occurrence = T

Number who planned to purchase P (Planned to purchase) = - - - - - - - - - - - Total number of households 250 1,000 Thus, there is a 0.25 (or 25%) chance that a household planned to purchase a large TV. Example 4.3 illustrates another application of simple probability. = - - = 0.25

EXAMPLE 4.3 Computing the Probability That the Large TV Purchased is HOR-capable

TABLE 4.2 Purchase behavior about purchasing an HOR-capable television and a streaming media player

In another Fredco Warehouse Club follow-up survey, additional questions were asked of the 300 households that actually purchased large TVs. Table 4.2 indicates the consumers' responses to whether the TV purchased was HDR-capable and whether they also purchased a streaming media player in the past 12 months. Find the probability that if a household that purchased a large TV is randomly selected, the television purchased was HDR-capable.

STREAMING MEDIA PLAYER HOR FEATURE

Yes

No

Total

HDR-capable

38

42

80

Not HDR-capable Total

70

150

220

108

192

300

SOLUTION Using the following definitions:

A

= purchased an HDR-capable TV

A' = purchased a television not HDR-capable B = purchased a streaming media player B' = did not purchase a streaming media player

P(HDR-capable)

=

Number ofHDR-capable TVs purchased T l b fTV ota num er o s 80 = 0.267 300

= -

There is a 26.7% chance that a randomly selected large TV purchased is HDR-capable.

1 57

4.1 Basic Probability Concepts

Joint Probability Whereas simple probability refers to the probability of occurrence of simple events, j oint probability refers to the probability of an occurrence involving two or more events. An example of joint probability is the probability that one will get heads on the first toss of a coin and heads on the second toss of a coin. In Table 4.1 on page 155, the count of the group of individuals who planned to purchase and actually purchased a large TV corresponds to the cell that represents Planned to purchase Yes and Actually purchased Yes, the upper left numerical cell. Because this group consists of 200 households, the probability of picking a household that planned to purchase and actually purchased a large TV is ?(Planned to purchase and actually purchased)

Planned to purchase and actually purchased

= ---------------Total number of respondents

200 1,000

= - - = 0.20

Example 4.4 also demonstrates how to determine joint probability.

EXAMPLE 4.4 Determining the Joint Probability That a Household Purchased an HOR-capable Large TV and Purchased a Streaming Media Player

In Table 4.2 on page 156, the purchases are cross-classified as being HDR-capable or not and whether the household purchased a streaming media player. Find the probability that a randomly selected household that purchased a large TV also purchased an HDR-capable TV and purchased a streaming media player. SOLUTION Using Equation (4.1) on page 155 and Table 4.2,

Number that purchased an HDR-capable TV and purchased a streaming media player Total number of large TV purchasers

P(HDR-capable TV and streaming media player)

38

=-

= 0.127 300 Therefore, there is a 12.7% chance that a randomly selected household that purchased a large TV purchased an HDR-capable TV and purchased a streaming media player.

Marginal Probability The marginal probability of an event consists of a set of joint probabilities. You can determine the marginal probability of a particular event by using the concept of joint probability just discussed. For example, if B consists of two events, B 1 and B2 , then P(A), the probability of event A , consists of the joint probability of event A occurring with event B1 and the joint probability of event A occurring with event B2 . Use Equation (4.2) to calculate marginal probabilities.

student TIP Mutually exclusive events cannot occur simultaneously. In a collectively exhaustive set of events, one of the events must occur.

MARGINA L PROBABILITY P(A) = P(A and B 1)

where B" B2 , •

• • ,

+

P(A and B2 )

+ · ·· +

P(A and Bk)

Bk are k mutually exclusive and collectively exhaustive events

(4.2)

1 58

CHAPTER 4

I Basic Probability Using Equation (4.2) to calculate the marginal probability of "planned to purchase" a large TV gets the same result as adding the number of outcomes that make up the simple event "planned to purchase": P(Planned to purchase) = P(Planned to purchase and purchased)

+ P (Planned to purchase and did not purchase) 50 1,000

200 1,000

250 1,000

= - - + - - = - - = 0.25

student TIP

General Addition Rule

The key word when using the addition rule is or.

The probability of event "A or B" considers the occurrence of either event A or event B or both A and B. For example, how can one determine the probability that a household planned to purchase or actually purchased a large TV? The event "planned to purchase or actually purchased" includes all households that planned to purchase and all households that actually purchased a large TV. Examine each cell of the Table 4.1 contingency table on page 155 to determine whether it is part of this event. From Table 4.1, the cell "planned to purchase and did not actually purchase" is part of the event because it includes respondents who planned to purchase. The cell "did not plan to purchase and actually purchased" is included because it contains respondents who actually purchased. Finally, the cell "planned to purchase and actually purchased" has both characteristics of interest. Therefore, one way to calculate the probability of "planned to purchase or actually purchased" is P(Planned to purchase or actually purchased)

=

P(Planned to purchase and did not actually purchase) + P(Did not plan to purchase and actually purchased) + P(Planned to purchase and actually purchased) 50 1,000

100 1,000

200 1,000

= -- + -- + - 350 1,000

= - - = 0.35 Often, it is easier to determine P(A or B), the probability of the event A or B, by using the general addition rule, defined in Equation (4.3).

GENERAL ADDITION RULE

The probability of A or B is equal to the probability of A plus the probability of B minus the probability of A and B. P(A or B) = P(A)

+ P(B) - P(A and B)

(4.3)

Applying Equation (4.3) to the previous example produces the following result: P(Planned to purchase or actually purchased) = P(Planned to purchase) + P(Actually purchased) - P(Planned to purchase and actually purchased) 250 1,000

300 1,000

200 1,000

= -- + -- - -350 1,000

= - - = 0.35

4.1 Basic Probability Concepts

1 59

The general addition rule consists of taking the probability of A and adding it to the probability of Band then subtracting the probability of the joint event A and B from this total because the joint event has already been included in computing both the probability of A and the probability of B . For example, in Table 4.1, ifthe outcomes of the event "planned to purchase" are added to those of the event "actually purchased," the joint event "planned to purchase and actually purchased" has been included in each of these simple events. Therefore, because this joint event has been included twice, you must subtract it to compute the correct result. Example 4.5 illustrates another application of the general addition rule.

EXAMPLE 4.5 Using the General Addition Rule for the Households That Purchased Large TVs

In Example 4.3 on page 156, the purchases were cross-classified in Table 4.2 as TVs that were HDR-capable or not and whether the household purchased a streaming media player. Find the probability that among households that purchased a large TV, they purchased an HDR-capable TV or purchased a streaming media player. SOLUTION Using Equation (4.3):

P(HDR-capable TV or purchased a streaming media player)

= P(HDR-capable TV) + ?(purchased a streaming media player)

- P(HDR-capable TV and purchased a streaming media player) 80 108 38 =-+--300 300 300 150 = = 0.50 300 Therefore, of households that purchased a large TV, there is a 50% chance that a randomly selected household purchased an HDR-capable TV or purchased a streaming media player.

LEARNING THE BASICS

4.4 Consider the following contingency table:

4.1 Three coins are tossed. a. b. c. d.

Give an example of a simple event. Give an example of a joint event. What is the complement of a head on the first toss? What does the sample space consist of?

4 .2 An urn contains 12 red balls and 8 white balls. One ball is to be selected from the urn. a. Give an example of a simple event. b. What is the complement of a red ball? c. What does the sample space consist of?

B

B'

A

10

30

A'

25

35

What is the probability of event A'? A and B? A ' andB '? A' or B'?

a. b. c. d.

4 .3 Consider the following contingency table:

APPLYING THE CONCEPTS B

B'

A

10

A'

20

20 40

What is the probability of event

a. A? b. A'? c. A and B? d. A or B?

4.5 For each of the following, indicate whether the type of probability involved is an example of a priori probability, empirical probability, or subjective probability. a. The next toss of a fair coin will land on heads. b. Italy will win soccer's World Cup the next time the competition is held. c. The sum of the faces of two dice will be seven. d. The train taking a commuter to work will be more than I 0 minutes late.

160

CHAPTER 4

I Basic Probability

4.6 For each of the following, state whether the events created are mutually exclusive and whether they are collectively exhaustive. a. Undergraduate business students were asked whether they were sophomores or juniors. b. Each respondent was classified by the type of car he or she drives: sedan, SUV, American, European, Asian, or none. c. People were asked, "Do you currently live in (i) an apartment or (ii) a house?" d. A product was classified as defective or not defective.

4.7 Which of the following events occur with a probability of zero? For each, state why or why not. a. A company is Listed on the New York Stock Exchange and NASDAQ. b. A consumer owns a smartphone and a tablet. c. A cell phone is an Apple and a Samsung. d. An automobile is a Toyota and was manufactured in the United States.

4.8 Financial experts say retirement accounts are one of the best tools for saving and investing for the future. Are millennials and Gen Xers financially prepared for the future? A survey of American adults conducted by INSIDER and Morning Consult revealed the following:

a. Give an example of a simple event. b. Give an example of a joint event. c. What is the complement of a marketer who plans to increase Linkedin organic social media postings? d. Why is a marketer who plans to increase Linkedln organic social media postings and is a B2C marketer a joint event?

4.11 Referring to the contingency table in Problem 4.10, if a marketer is selected at random, what is the probability that a. he or she plans to increase Linkedln organic social media postings? b. he or she is a B2C marketer? c. he or she plans to increase Linkedin organic social media postings or is a B2C marketer? d. Explain the difference in the results in (b) and (c).

nm 4.12 Is there fu ll support for increased use of educa.mm tional technologies in higher ed? As part of Inside Higher Ed's 2018 Survey of Faculty Attitudes on Technology, academic professionals, namely, full-time faculty members and administrators who oversee the ir institutions ' online learning or instructional technology efforts (digital learning leaders), were asked that question. The following table summarizes their responses.

HAS A RETIREMENT SAVINGS ACCOUNT AGE GROUP

Yes

No

Millennials GenX

579 599

628 532

ACADEMIC PROFESSIONAL FULL SUPPORT

Source: Data extracted from "MillenniaJs are delusional about the future, but they aren't the only ones," bit.ly/2KDeDK9.

a. b. c. d.

Give an example of a simple event. Give an example of a joint event. What is the complement of "Has a retirement savings account"? Why is "Millennial and has a retirement savings account" a joint event?

4.9 Referring to the contingency table in Problem 4.8, if an American adult is selected at random, what is the probability that a. the American adult has a retirement savings account? b. the American adult was a millennial who has a retirement savings account? c. the American adult was a millennial or has a retirement savings account? d. Explain the difference in the results in (b) and (c ).

4.10 How will marketers change their organic (unpaid) social media postings in the near future? A survey by Social Media Examiner reported that 65% of B2B marketers (marketers who focus primarily on attracting businesses) plan to increase Linkedln organic social media postings, as compared to 43% ofB2C marketers (marketers who primarily target consumers). The survey was based on 1,947 B2B marketers and 3,779 B2C marketers. The following table summarizes the results. BUSINESS FOCUS

INCREASE LINKEDIN POSTINGS?

B2B

B2C

Total

Yes No Total

1,266 681 1,947

1,625 2,154 3,779

2,891 2,835 5,726

Source: Data extracted from "201 8 Social Media Marketing Industry Report," socialmediaexaminer.com.

Yes No Total

Faculty Member

Digital Learning Leader

Total

511 1,086 1,597

175 31 206

686 1,117 1,803

Source: Data extracted from "The 20 18 Inside Higher Ed Survey of Faculty Attitudes on Technology," bit.ly/30JjUit.

If an academic professional is selected at random, what is the probability that he or she a. fully supports increased use of educational technologies in higher ed? b. is a digital learning leader? c. fully supports increased use of educational technologies in higher ed or is a digital learning leader? d. Explain the difference in the results in (b) and (c).

4.13 Are small and mid-sized business (SMB) owners concerned about limited access to money affecting their businesses? A sample of 150 female SMB owners and 146 male SMB owners revealed the following results: SMBOWNER GENDER CONCERNED ABOUT LIMITED ACCESS TO MONEY?

Yes No Total

Female

Male

Total

38 112 150

20 126 146

58 238 296

Source: Data extracted from "201 9 Small Business Owner Report: The Male vs Female Divide," bit.ly/2WvByIN.

If a SMB owner is selected at random, what is the probability that the SMB owner

4 .2 Conditional Probability

a. is concerned about limited access to money affecting the business? b. is a female SMB owner and is concerned about limited access to money affecting the business? c. is a female SMB owner or is concerned about limited access to money affecting the business? d. Explain the difference in the results of (b) and (c). 4.14 Consumers are aware that companies share and sell their personal data in exchange for free services, but is it important to consumers to have a clear understanding of a company's privacy policy before signing up for its service online? According to an Axios-SurveyMonkey poll, 911of 1,001 older adults (aged 65 + ) indicate that it is important to have a clear understanding of a company's privacy policy before signing up for its service online and 195 of260 younger adults (aged 18-24) indicate that it is important to have a clear understanding of a company's privacy policy before signing up for its service online. Source: Data extracted from "Privacy policies are read by an aging few," bit.ly/2NyPAWx.

Construct a contingency table to evaluate the probabilities. What is the probability that a respondent chosen at random a. indicates it is important to have a clear understanding of a company's privacy policy before signing up for its service online? b. is an older adult and indicates it is important to have a clear understanding of a company's privacy policy before signing up for its service online?

161

c. is an older adult or indicates it is important to have a clear understanding of a company's privacy policy before signing up for its service online? d. is an older adult or a younger adult? 4.15 Each year, ratings are compiled concerning the performance of new cars during the first 90 days of use. Suppose that the cars have been categorized according to whether a car needs a warrantyrelated repair (yes or no) and the country in which the company manufacturing a car is based (United States or not United States). Based on the data collected, the probability that the new car needs a warranty repair is 0.04, the probability that the car was manufactured by a U.S.-based company is 0.60, and the probability that the new car needs a warranty repair and was manufactured by a U.S.-based company is 0.025.

Construct a contingency table to evaluate the probabilities of a warranty-related repair. What is the probability that a new car selected at random a. needs a warranty repair? b. needs a warranty repair and was manufactured by a U.S.-based company? c. needs a warranty repair or was manufactured by a U.S.-based company? d. needs a warranty repair or was not manufactured by a U.S.-based company?

4.2 Conditional Probability Each Section 4.1 example involves finding the probability of an event when sampling from the entire sample space. How does one determine the probability of an event if one knows certain information about the events involved?

Calculating Conditional Probabilities Conditional probability refers to the probability of event A, given information about the occurrence of another event, B.

CONDITIONAL PROBABILITY

The probability of A given B is equal to the probability of A and B divided by the probability of B.

I =

P(A B)

P(A and B) P(B)

(4.4a)

The probability of B given A is equal to the probability of A and B divided by the probability of A.

I

P(B A)=

P(A and B) P(A)

where P(A and B) = joint probability of A and B P(A) = marginal probability of A P(B) = marginal probability of B

(4.4b)

162

CHAPTER 4

I Basic Probability

student TIP The variable that is given goes in the denominator of Equation (4.4) denominator. Because planned to purchase was the given, planned to purchase goes in the denominator.

In the Fredco Warehouse Club scenario, suppose one had been told that a specific household planned to purchase a large TV. What then would be the probability that the household actually purchased the TV? In this example, the objective is to find P(Actually purchased IPlanned to purchase), given the information that a household planned to purchase a large TV. Therefore, the sample space does not consist of all 1,000 households in the survey. It consists of only those households that planned to purchase the large TV. Of 250 such households, 200 actually purchased the large TV. Therefore, based on Table 4 .1 on page 155, the probability that a household actually purchased the large TV given that they planned to purchase is Planned to purchase and actually purchased P(Actually purchased lPianned to purchase)= - - - - - - - - - - - - - - - Planned to purchase 200 = = 0.80 250 Defining event A as Planned to purchase and event Bas Actually purchased, Equation (4.4b) also calculates this result:

P(B IA)=

P(A and B) P(A)

P(Actually purchased IPlanned to purchase) =

200/ 1,000 200 / l ,OOO = = 0.80 250 250

Example 4.6 further illustrates conditional probability.

EXAMPLE 4.6 Finding the Conditional Probability of Purchasing a Streaming Media Player

Table 4.2 on page 156 is a contingency table for whether a household purchased an HDR-capable TV and whether the household purchased a streaming media player. If a household purchased an HDR-capable TV, what is the probability that it also purchased a streaming media player?

SOLUTION Because you know that the household purchased an HDR-capable TV, the sample space is reduced to 80 households. Of these 80 households, 38 also purchased a streaming media player. Therefore, the probability that a household purchased a streaming media player, given that the household purchased an HDR-capable TV, is P(Purchased streaming media player IPurchased HDR-capable TV)

Number purchasing HDR-capable TV and streaming media player Number purchasing HDR-capable TV 38 = - = 0.475 80

Using Equation (4.4b) on page 16 1 and the following definitions:

A = Purchased an HDR-capable TV B = Purchased a streaming media player then

P(B IA) =

P(A and B)

P(A)

=

38 / 300 80/ 300

= 0.475

Therefore, given that the household purchased an HDR-capable TV, there is a 47.5% chance that the household also purchased a streaming media player. You can compare this conditional probability to the marginal probability of purchasing a streaming media player, which is 108/ 300 = 0.36, or 36% . These results tell you that households that purchased HDR-capable TVs are more likely to purchase a streaming media player than are households that purchased large TVs that are not HDR-capable.

4 .2 Condit ional Probability

163

Decision Trees In Table 4.1 , households are classified according to whether they planned to purchase and whether they actually purchased large TVs. A decision tree is an alternative to the contingency

table. Figure 4.1 represents the decision tree for this example.

FIGURE 4.1 Decision tree for planned to purchase and act ually purchased

P(A and 8) =

Did Not p Actua11 Urchase

Entire Set of Households

P(A' ) =

P(A and B')

y

200 1,000

=~

1,000

P(A' and B) =

100 1,000

P(A' and B') =

650 1,000

750 1,000

In Figure 4.1 , beginning at the left with the entire set of households, there are two "branches" for whether or not the household planned to purchase a large TV. Each of these branches has two subbranches, corresponding to whether the household actually purchased or did not actually purchase the large TV. The probabilities at the end of the initial branches represent the marginal probabilities of A and A ' . The probabilities at the end of each of the four subbranches represent the joint probability for each combination of events A and B. You compute the conditional probability by dividing the j oint probability by the appropriate marginal probability. For example, to compute the probability that the household actually purchased, given that the household planned to purchase the large TV, you take ? (Planned to purchase and actually purchased) and divide by ?(Planned to purchase). From Figure 4.1 ,

?(Actually purchased IPlanned to purchase) =

200/ 1,000 /

250 1,000 200 250

= -=

0.80

Example 4.7 illustrates how to construct a decision tree.

EXAMPLE 4.7 Constructing the Decision Tree for the Households That Purchased Large TVs

~ (continued)

Using the cross-classified data in Table 4.2 on page 156 construct the decision tree. Use the decision tree to find the probability that a household purchased a streaming media player, given that the household purchased an HDR-capable television. SOLUTION The decision tree for purchased a streaming media player and an HDR-capable

TV is displayed in Figure 4.2.

164

CHAPTER 4

I Basic Probability

FIGURE 4.2 Decision tree for purchased an HOR-capable TV and a streaming media player

PIA and 8) =

80

PIA) = 300

38 300

PIA and 8' ) = ~ 300

Entire Set of Households

PIA' and 8) =

;g

0

PIA'and8') = 150 300

Using Equation (4.4b) on page 161 and the following definitions: A = Purchased an HDR-capable TV

B = Purchased a streaming media player

then

I

P(B A) =

P(A and B) P(A)

=

38/ 300 80/ 300

= 0.475

Independence In the example concerning the purchase of large TVs, the conditional probability is

200 / 250 = 0.80 that the selected household actually purchased the large TV, given that the household planned to purchase. The simple probability of selecting a household that actually purchased is 300/ 1,000 = 0.30. This result shows that the prior knowledge that the household planned to purchase affected the probability that the household actually purchased the TV. In other words , the outcome of one event is dependent on the outcome of a second event. When the outcome of one event does not affect the probability of occurrence of another event, the events are said to be independent. Independence can be determined by using Equation (4.5).

INDEPENDENCE

Two events, A and B, are independent if and only if

I =

P(A B)

P(A)

where P(A [ B) P(A)

= conditional probability of A given B = marginal probability of A

Example 4.8 on the next page demonstrates the use of Equation (4.5).

(4.5)

4 .2 Conditional Probability

EXAMPLE 4.8 Determining Independence TABLE 4.3 Satisfaction with purchase of large TVs

165

In the follow-up survey of the 300 households that actually purchased large TVs, the households were asked if they were satisfied with their purchases. Table 4.3 cross-classifies the responses to the satisfaction question with the responses to whether the TV was HDR-capable.

SATISFIED WITH PURCHASE? HOR FEATURE

Yes

No

Total

UDR-capable

64 176 240

16 44

80 220

60

300

Not UDR-capable Total

Determine whether being satisfied with the purchase and the HDR feature of the TV purchased are independent. SOLUTION For these data:

?(Satisfied I HDR-capable) =

64 / 300 801300

=

64 80

= 0.80

which is equal to ?(Satisfied) =

240 = 0.80 300

Thus, being satisfied with the purchase and the HDR feature of the TV purchased are independent. Knowledge of one event does not affect the probability of the other event.

Multiplication Rules The general multiplication rule is derived using Equation (4.4a) on page 161: P(A I B) =

P(A andB) P(B)

and solving for the joint probability P(A and B).

GENERAL MULTIPLICATION RULE

The probability of A and B is equal to the probability of A given B times the probability of B.

I

P(A and B) = P(A B)P(B)

(4.6)

Example 4.9 demonstrates the use of the general multiplication rule.

EXAMPLE 4.9 Using the General Multiplication Rule

Consider the 80 households that purchased HDR-capable TVs. In Table 4.3 on this page, you see that 64 households are satisfied with their purchase, and 16 households are dissatisfied. Suppose two households are randomly selected from the 80 households. Find the probability that both households are satisfied with their purchase. SOLUTION Here you can use the multiplication rule in the following way. If

~(continued)

A

= second household selected is satisfied

B

= first household selected is satisfied

166

CHAPTER 4

I Basic Probability then, using Equation (4.6), P(A and B)

= P(A IB)P(B)

The probability that the first household is satisfied with the purchase is 64/80. However, the probability that the second household is also satisfied with the purchase depends on the result of the first selection. If the first household is not returned to the sample after the satisfaction level is determined (i.e., sampling without replacement), the number of households remaining is 79. If the first household is satisfied, the probability that the second is also satisfied is 63/79 because 63 satisfied households remain in the sample. Therefore, P(A and B) = (

~~ )( ~~) =

0.6380

There is a 63.80% chance that both of the households sampled will be satisfied with their purchase.

The multiplication rule for independent events is derived by substituting P(A) for P(A IB) in Equation (4.6).

MULTIPLI CATION RULE FOR INDEPENDENT EVENTS

If A and B are independent, the probability of A and B is equal to the probability of A times the probability of B. (4.7)

P(A and B) = P(A)P(B)

If this rule holds for two events, A and B, then A and B are independent. Therefore, there are two ways to determine independence:

I. Events A and Bare independent if, and only if, P(A I B) = P(A). 2. Events A and Bare independent if, and only if, P(A and B) = P(A)P(B).

Marginal Probability Using the General Multiplication Rule Section 4.1 defines the marginal probability using Equation (4.2). One can state the equation for marginal probability by using the general multiplication rule. If P(A)

=

P(A and B 1)

+

P(A and B2 )

+ ··· +

P(A and Bk)

then, using the general multiplication rule, Equation (4.8) defines the marginal probability.

MARG INAL PROBABILITY USING THE GENERAL MULT IPLICATION RULE

where B" B 2 ,

. .. ,

Bk are k mutually exclusive and collectively exhaustive events.

4.2 Conditional Probability

1 67

To illustrate Equation (4.8), refer to Table 4.1 on page 155 and let: P(A) P(B 1)

= probability of planned to purchase = probability of actually purchased

P(B2 ) = probability of did not actually purchase

Then, using Equation (4.8), the probability of planned to purchase is

=

=

e~~ )C~~~o) + ( ;0°0 )C~~~o) 200 1,000

+ __1Q_ = 250 = 0.25 1,000

1,000

LEARNING THE BASICS HAS A RETIREMENT SAVINGS ACCOUNT

4.16 Consider the following contingency table: B A

A'

10 20

B'

AGE GROUP

Yes

No

20 40

Millennials Gen Xer

579 599

628 532

Source: Data extracted from "Millennials are delusional about the future, but they aren't the only ones," bit.ly/2KDeDK9.

What is the probability of

a. AIB? a . Given that the American adult has a retirement savings account, what is the probability that the American adult is a millennial ? b. Given that the American adult is a millennial, what is the probability that the American adult has a retirement savings account? c. Explain the difference in the results in (a) and (b). d. Is having a retirement savings account and age group independent?

b. AIB'?

c. A' I B '? d. Are events A and B independent? 4.17 Consider the following contingency table:

A

A'

B

B'

10 25

30 35

4.22 How will marketers change their organic (unpaid) socia l media postings in the near future? A survey by Social Media Examiner reported that 65% ofB2B marketers (marketers who focus primarily on attracting businesses) plan to increase Linkedln organic social media postings, as compared to 43% of B2C marketers (marketers who primarily target consumers). The survey was based on 1,947 B2B marketers and 3,779 B2C marketers. The following table summarizes the results.

What is the probability of

a. b. c. d.

AIB? A' I B'? AIB'?

Are events A and B independent?

4.18 If P(A and B) = 0.4 and P(B) = 0.8, find P(A IB). 4.19 If P(A) = 0.7, P(B) = 0.6, andA and Bare independent, find P(A and B).

4.20 If P(A)

=

0.3, P(B)

= 0.4, and P(A and B) =

0.2, areA and

B independent?

APPLYING THE CONCEPTS

BUSINESS FOCUS

INCREASE LINKEDIN POSTINGS?

B2B

B2C

Total

Yes No Total

1,266 681 1,947

1,625 2, 154 3,779

2,89 1 2,835 5,726

Source: Data extracted from "2018 Social Media Marketing Industry Report," socialmediaexaminer.com.

4.21 Financial experts say retirement accounts are one of the best tools for saving and investing for the future. Are mi llennials and Gen Xers financially prepared for the future? A survey of American adults conducted by INSIDER and Morning Consult revealed the following:

a . Suppose you know that the marketer is a B2B marketer. What is the probability that he or she plans to increase Linkedln organic social media postings?

168

CHAPTER 4

I Basic Probability

b. Suppose you know that the marketer is a B2C marketer. What is the probability that he or she plans to increase Linkedln organic social media postings? c. Are the two events, increase Linkedln postings and business focus, independent? Explain.

4.23 Are small and mid-sized business (SMB) owners concerned about limited access to money affecting their businesses? A sample of 150 female SMB owners and 146 male SMB owners revealed the following results: CONCERNED ABOUT LIMITED ACCESS TO MONEY?

SMB OWNER GENDER

Yes No Total

Female

Male

Total

38 112 150

20 126 146

58 238 296

Source: Data extracted from "2019 Small Business Owner Report: The Male vs Female Divide,"bit.ly/2WvBylN.

4.25 Consumers are aware that companies share and sell their personal data in exchange for free services, but is it important to consumers to have a clear understanding of a company's privacy policy before signing up for its service online? According to an AxiosSurveyMonkey poll, 911 of 1,001 older adults (aged 65+) and 195 of 260 younger adults (aged 18-24) indicate that it is important to have a clear understanding of a company's privacy policy before signing up for its service online. Source: Data extracted from "Privacy policies are read by an aging few," bitly/2NyPAW.

a. Suppose that the respondent chosen is an older adult. What is the probability that the respondent indicates that it is important to have a clear understanding of a company's privacy policy before signing up for its service online? b. Suppose that the respondent chosen does indicate that it is important to have a clear understanding of a company's privacy policy before signing up for its service online. What is the probability that the respondent is a younger adult? c. Are importance of a clear understanding and adult age independent? Explain

4.26 Each year, ratings are compiled concerning the performance of a. If a SMB owner selected is female, what is the probability that she is concerned about limited access to money affecting the business? b. If a SMB owner selected is male, what is the probability that he is concerned about limited access to money affecting the business? c. Is concern about limited access to money independent of SMB owner gender?

~ 4.24 Is there full support for increased use of educa. . . tional technologies in higher ed? As part of Inside Higher Ed's 2018 Survey of Faculty Attitudes on Technology, academic professionals, namely, full-time faculty members and administrators who oversee their institutions' online learni ng or instructional technology efforts (digital learning leaders), were asked that question. The following table summarizes their responses:

4.27 In 44 of the 68 years from 1950 through 2018 (in 201 1

ACADEMIC PROFESSIONAL FULL SUPPORT

Yes No Total

Faculty Member 5 11 1,086 1,597

Digital Learning Leader 175 31 206

new cars during the first 90 days of use. Suppose that the cars have been categorized according to whether a car needs warranty-related repair (yes or no) and the country in which the company manufacturing a car is based (United States or not United States). Based on the data collected, the probability that the new car needs a warranty repair is 0.04, the probability that the car is manufactured by a U.S.-based company is 0.60, and the probability that the new car needs a warranty repair and was manufactured by a U.S.-based company is 0.025. a. Suppose you know that a company based in the United States manufactured a particular car. What is the probability that the car needs a warranty repair? b. Suppose you know that a company based in the United States did not manufacture a particular car. What is the probability that the car needs a warranty repair? c. Are the need for a warranty repair and location of the company manufacturing the car independent?

Total 686 1,117 1,803

there was virtually no change), the S&P 500 finished higher after the first five days of trading. In 36 out of 44 years, the S&P 500 fini shed higher for the year. Is a good first week a good omen for the upcoming year? The following table gives the first-week and annual performance over this 68-year period: S&P SOO'S ANNUAL PERFORMANCE

Source: Data extracted from "The 2018 Inside Higher Ed Survey of Faculty Attitudes on Technology," bit.ly/301jUit.

FIRST WEEK

a. Given that an academic professional is a faculty member, what is the probability that the academic professional fully supports increased use of educational technologies in higher ed? b. Given that an academic professional is a faculty member, what is the probability that the academic professional does not fully support increased use of educational technologies in higher ed? c. Given that an academic professional is a digital learning leader, what is the probability that the academic professional fully supports increased use of educational technologies in higher ed? d. Given that an academic professional is a digital learning leader, what is the probability that the academic professional does not fully support increased use of educational technologies in higher ed?

Higher Lower

Higher

Lower

36 12

8 12

a. If a year is selected at random, what is the probability that the S&P 500 finished higher for the year? b. Given that the S&P 500 finished higher after the first five days of trading, what is the probability that it finished higher for the year? c. Are the two events "first-week performance" and "annual performance" independent? Explain. d. Look up the performance after the first five days of 2019 and the 20 19 annual performance of the S&P 500 at finance.yahoo .corn. Comment on the results.

4.4 Bayes' Theorem

169

4.28 A standard deck of cards is being used to play a game. There

4.29 A box of nine iPhone 7 cell phones contains two red cell

are four suits (hearts, diamonds, clubs, and spades), each having 13 faces (ace, 2, 3, 4, 5, 6, 7, 8, 9, I 0, jack, queen, and king), making a total of 52 cards. This complete deck is thoroughly mixed, and you will receive the first 2 cards from the deck, without replacement (the first card is not returned to the deck after it is selected). a. What is the probability that both cards are queens? b. What is the probability that the first card is a 10 and the second card is a 5 or 6? c. If you were sampling with replacement (the first card is returned to the deck after it is selected), what would be the answer in (a)? d. In the game of blackjack, the face cards Uack, queen, king) count as 10 points, and the ace counts as either I or 11 points . All other cards are counted at their face value. Blackjack is achieved if two cards total 2 1 points. What is the probability of getting blackjack in this problem?

phones and seven black cell phones. a. If two cell phones are randomly selected from the box, without replacement (the first cell phone is not returned to the box after it is selected), what is the probability that both cell phones selected will be red? b. If two cell phones are randomly selected from the box, without replacement (the first cell phone is not returned to the box after it is selected), what is the probability that there will be one red cell phone and one black cell phone selected? c. If three cell phones are selected, with replacement (the cell phones are returned to the box after they are selected), what is the probability that all three will be red? d. If you were sampling with replacement (the first cell phone is returned to the box after it is selected), what would be the answers to (a) and (b)?

4.3 Ethical Issues and Probability Ethical issues can arise when any statements related to probability are presented to the public, particularly when these statements are part of an advertising campaign for a product or service. Unfortunately, many people are not comfortable with numerical concepts (Paulos) and tend to misinterpret the meaning of the probability. In some instances, the misinterpretation is not intentional, but in other cases, advertisements may unethically try to mislead potential customers. One example of a potentially unethical application of probability relates to advertisements for state lotteries. When purchasing a lottery ticket, the customer selects a set of numbers (such as 6) from a larger list of numbers (such as 54). Although virtually all participants know that they are unlikely to win the lottery, they also have very little idea of how unlikely it is for them to select all six winning numbers from the list of 54 numbers. They have even less of an idea of the probability of not selecting any winning numbers. Given this background, one might consider a state lottery commercial that stated, "We won' t stop until we have made everyone a millionaire" to be deceptive and possibly unethical. Is it possible that the lottery can make everyone a millionaire? Is it ethical to suggest that the purpose of the lottery is to make everyone a millionaire? Another example of a potentially unethical application of probability relates to an investment newsletter promising a 90% probability of a 20% annual return on investment. To make the claim in the newsletter an ethical one, the investment service needs to (a) explain the basis on which this probability estimate rests, (b) provide the probability statement in another format, such as 9 chances in 10, and (c) explain what happens to the investment in the 10% of the cases in which a 20% return is not achieved (e.g., is the entire investment lost?). Probability-related claims should be stated ethically. Readers can consider how one would write an ethical description about the probability of winning a certain prize in a state lottery. Or how one can ethically explain a "90% probability of a 20% annual return" without creating a misleading inference.

4.4 Bayes' Theorem Developed by Thomas Bayes in the eighteenth century, Bayes' theorem builds on conditional probability concepts that Section 4.2 discusses. Bayes' theorem revises previously calculated probabilities using additional information and forms the basis for Bayesian analysis. (AndersonCook, Bellhouse, Hooper). In recent years, Bayesian analysis has gained new prominence for its application to analyzing big data using predictive analytics (see Chapter 17). However, Bayesian analysis does

170

CHAPTER 4

I Basic Probability not require big data and can be used in a variety of problems to better determine the revised probability of certain events. The Bayesian Analysis online topic contains examples that apply Bayes' theorem to a marketing problem and a diagnostic problem and presents a set of additional study problems. The Consider This for this section explores an application of Bayes' theorem that many use every day, and references Equation (4.9) that the online topic discusses.

CONSIDER THIS

Divine Providence and Spam Would you ever guess that the essays Divine Benevolence: Or, An Attempt to Prove That the Principal End of the Divine Providence and Government Is the Happiness of His Creatures and An Essay Towards Solving a Problem in the Doctrine of Chances were written by the same person? Probably not , and in doing so, you illustrate a modern -day application of Bayesian statistics: spam, or junk mail filters. In not guessing correctly, you probably looked at the words in the tit les of the essays and concluded that they were talking about two different things. An implic it rule you used was that word frequencies vary by subject matter. A stat istics essay would very likely contain the word statistics as well as words such as chance, problem, and solving. An eighteenth-century essay about theology and religion would be more likely to contain the uppercase forms of Divine and Providence. Likewise, there are words you would guess to be very unlikely to appear in either book, such as technical terms from finance, and words that are most likely to appear in both-common words such as a, and, and the. That words would be either likely or unlikely suggests an application of probabi lity theory. Of course, likely and unlikely are fuzzy concepts, and we might occasional ly misclassify an essay if we kept t hings too simple, such as relying solely on the occurrence of the words Divine and Providence. For example, a profile of the late Harris Milstead, better known as Divine, the star of Hairspray and other films, visiting Providence (Rhode Island), would most certainly not be an essay about theology. But if we widened the number of words we examined and found such words as movie or t he name John Waters (Divine's director in many films), we probably would quickly realize the essay had something to do with twentieth-century c inema and little to do with theology and religion. We can use a similar process to try to classify a new email message in your in-box as either spam or a legitimate message (called "ham," in this context). We wou ld first need to add to your emai l program a "spam filter" that has t he ability to track word frequencies associated with spam and ham messages as you identify them on a day-to-day basis. Th is would allow the filter to constantly update the prior

probabilities necessary to use Bayes' theorem. With these probabilities, the filter can ask, "What is the probabi lity that an email is spam, given the presence of a certain word?" Applying the terms of Equation (4.9), such a Bayesian spam filter would multiply the probabil ity of finding the word in a spam email, P(A I8), by the probabi lity that the email is spam , P(B), and then div ide by the probability of finding the word in an email, the denominator in Equation (4.9). Bayesian spam filters also use shortcuts by focusing on a small set of words that have a high probability of being found in a spam message as well as on a small set of other words that have a low probabi lity of being found in a spam message. As spammers (people who send junk email) learned of such new filters, they tried to outfox them . Having learned that Bayesian f ilters might be assigning a high P(A I8), value to words common ly found in spam, such as Viagra, spammers thought they could fool the filter by misspelling the word as Vi@gr@ or V1agra. What they overlooked was that the misspelled variants were even more likely to be found in a spam message than the original word. Thus, the misspelled variants made the job of spotting spam easier for the Bayesian filters. Other spammers tried to fool the f ilters by adding "good" words, words that would have a low probability of being found in a spam message, or "rare" words, words not frequently encountered in any message. But these spammers overlooked the fact that the conditional probabilities are constantly updated and that words once considered "good" would be soon discarded from the good list by the filter as t heir P(A I B) value increased. Likewise, as "rare" words grew more common in spam and yet stayed rare in ham, such words acted like the misspelled variants that others had tried earlier. Even then, and perhaps after reading about Bayesian statistics, spammers thought that they could "break" Bayesian filters by inserting random words in their messages. Those random words would affect the filter by causing it to see many words whose P(A IB) value would be low. The Bayesian filter would begin to label many spam

Summary

messages as ham and end up being of no practical use. Spammers again overlooked that condit ional probabi lities are constantly updated. Other spammers decided to el iminate al l or most of the words in their messages and replace them with graphics so that Bayesian filters would have very few words with which to form cond itional probabilities. But t his approach failed, too, as Bayesian filters were rewritten to consider things other than words in a message. After all, Bayes' theorem concerns events, and "graphics present with no text" is as valid an event as "some word, X, present in a message."

171

Other future tricks will ultimately fai l for the same reason. (By the way, spam filters use non-Bayesian techniques as well, which makes spammers' lives even more difficult.) Bayesian spam filters are an example of the unexpected way that applications of statistics can show up in your daily life. You will discover more examples as you read the rest of this book. By the way, the author of the two essays mentioned earlier was Thomas Bayes, who is a lot more famous for the second essay than the first essay, a failed attempt to use mathematics and logic to prove the existence of God.

4.5 Counting Rules In many cases, a large number of outcomes is possible and determining the exact number of outcomes can be difficult. In these situations, rules have been developed for counting the exact number of possible outcomes. The Section 4.5 online topic discusses these rules and illustrates their use.

s a Fredco Warehouse Club electronics merchandise manager, you analyzed an intent-to-purchase-a-large-TV study of the heads of 1,000 households as well as a follow-up study done 12 months later. In that later survey, respondents who answered that they had purchased a large TV were asked additional questions concerning whether the large TV was HDR-capable and whether the respondents had purchased a streaming media player in the past 12 months. By analyzing the results of these surveys, you were able to uncover many pieces of valuable information that will help you plan a marketing strategy to enhance sales and better target those households likely to purchase multiple or more expensive products. Whereas only 30% of the households actually purchased a large TV, if a household indicated that it planned to purchase a large TV in the next 12 months, there

A

was an 80% chance that the household actually made the purchase. Thus the marketing strategy should target those households that have indicated an intent to purchase. You determined that for households that purchased an HOR-capable TV, there was a 47.5% chance that the household also purchased a streaming media player. You then compared this conditional probability to the marginal probabi lity of purchasing a streaming media player, which was 36%. Thus, households that purchased HOR-capable TVs are more likely to purchase a streaming media player than are households that purchased large TVs that were not HOR-capable .

... SUMMARY This chapter develops the basic concepts of probability that serve as a foundation fo r other concepts that later chapters discuss . Probability is a numeric value from 0 to I that represents the chance, likelihood, or possibility that a particular event will occur. In addition to simple probability,

the chapter discusses conditional probabilities and independent events. The chapter contingency tables and decision trees summarize and present probability information. The chapter also introduces Bayes' theorem.

172

CHAPTER 4

I Basic Probability

TREFERENCES Anderson-Cook, C. M. "Unraveling Bayes' Theorem." Quality Progress, March 2014, pp. 52- 54. Bellhouse, D.R. "The Reverend Thomas Bayes, FRS: A Biography to Celebrate the Tercentenary of His Birth." Statistical Science, 19 (2004), 3-43. Hooper, W. "Probing Probabilities ." Quality Progress, March 2014, pp. 18-22.

Lowd, D. , and C. Meek. "Good Word Attacks on Statistical Spam Filters." Presented at the Second Conference on Email and AntiSpam, 2005. Paulos, J. A. Innumeracy. New York: Hill and Wang, 1988. Silberman, S. "The Quest for Meaning," Wired 8.02, February 2000. Zell er, T. "The Fight Against V 1 @gra (and Other Spam)." The New York Times, May 21, 2006, pp. B l , B6.

T KEY EQUATIONS Probability of Occurrence

Independence

x

T

Probability of occurrence =

(4.1)

=

P(A and B 1)

+ ··· +

+

I

P(A and B ) = P(A B)P(B)

P(A and B2 )

(4.2)

P(A and Bk)

General Addition Rule P(A or B) = P(A )

(4.5)

General Multiplication Rule

Marginal Probability P(A)

I

P(A B) = P(A)

+ P(B)

(4.3)

- P(A and B)

(4.6)

Multiplication Rule for Independent Events P(A and B)

=

(4.7)

P(A)P(B)

Conditional Probability P(A B) =

I

P(A and B) P(B)

(4.4a)

I

P(A and B) P(A)

(4.4b)

P(B A) =

Marginal Probability Using the General Multiplication Rule P(A)

=

P(A IB 1)P(B 1)

+ ... +

+

P(A IB 2)P(B2 )

I

P(A Bk)P(Bk)

(4.8)

TKEYTERMS a priori probability 154 Bayes' theorem 169 certain event 154 collectively exhaustive 154 complement 154 conditional probability 161 decision tree 163 empirical probability 154 events 153

general addition rule 158 general multiplication rule 165 impossible event 154 independence 164 joint event 153 joint probability 157 marginal probability 157 multiplication rule for independent events 166

mutually exclusive 154 outcomes 153 probability 153 sample space 153 simple event 153 simple probability 155 subjective probability 155

TCHECKING YOUR UNDERSTANDING 4.30 What are the differences between a priori probability, empirical probability, and subjective probability?

4.33 What is the difference between mutually exclusive events and collectively exhaustive events?

4.31 What is the difference between a simple event and a j oint

4.34 How does conditional probability relate to the concept of

event?

independence?

4.32 How can you use the general addition rule to find the proba-

4.35 How does the multiplication rule differ for events that are and

bi lity of occurrence of event A or B?

are not independent?

173

Chapter Review Problems

... CHAPTER REVIEW PROBLEMS 4.36 A survey by Accenture indicated that 64% of millennials as compared to 28 % of baby boomers prefer "hybrid" investment advice-a combination of traditional advisory services and low-cost digital tools---0ver either a dedicated human advisor or conventional robo-advisory services (computer-generated advice and services without human advisors) alone. Source: Data extracted from Business Wire, "Majority of Wealthy Investors Prefer a Mix of Human and Robo-Advice, According to Accenture Research," /bit.Iy/2qZY90u.

Suppose that the survey was based on 500 respondents from each of the two generation groups. a. Construct a contingency table. b. Give an example of a simple event and a joint event. c. What is the probability that a randomly selected respondent prefers hybrid investment advice? d. What is the probability that a randomly selected respondent prefers hybrid investment advice and is a baby boomer? e. Are the events "generation group" and "prefers hybrid investment advice" independent? Explain.

4.37 G2 Crowd provides commentary and insight about employee engagement trends and challenges within organizations in its G2 Crowd Employee Engagement Report. The report represents the results of an online survey conducted in 2019 with employees located across the United States. G2 Crowd was interested in examining differences between HR and non-HR emp loyees. One area of focus was on employees' response to important metrics to consider when evaluating the effectiveness of employee engagement programs. The findings are summarized in the following tables. Source: Data extracted from "Employee Engagement," G2 Crowd, bit.ly/2WEQkgk

PRESENTEEISM IS AN IMPORTANT METRIC EMPLOYEE

Yes

No

Total

HR Non-HR Total

53 43 96

79 225 304

132 268 400

ABSENTEEISM IS AN IMPORTANT METRIC

Yes

No

Total

HR Non-HR Total

54

78 196 274

132 268 400

72

4.38 To better understand the website builder market, Clutch surveyed individuals who had created a website using a do-it-yourself (DIY) website builder. Respondents, categorized by the type of website they builtbusiness or personal-were asked to indicate the primary purpose for building their website. The following table summarizes the findings: TYPE OF WEBSITE PRIMARY PURPOSE

Online Business Presence Online Sales Creative Display Infor mational Resources Biog Total

Business

Personal

Total

52

4

56

32 28 9

13 54 24

45 82 33

8 129

52 147

60 276

Source: Data extracted from " How Businesses Use DIY Web Builders: Clutch 20 17 Survey," bit.ly/2qQjXiq.

If a website builder is selected at random, what is the probability that he or she a. indicated creative display as the primary purpose for building his/her website? b. indicated creative display or informational resources as the primary purpose for building his/her website? c. is a business website builder or indicated online sales as the primary purpose for building his/her website? d. is a business website builder and indicated online sales as the primary purpose for building his/her website? e. Given that the website builder selected is a personal website builder, what is the probability that he/she indicated online business presence as the primary purpose for building his/her website?

4.39 The CMO Survey collects and disseminates the opinions of top

EMPLOYEE

126

e. Suppose the randomly chosen employee does indicate that presenteeism is an important metric to consider when evaluating the effectiveness of employee engagement programs. What is the probability that the employee is a non-HR employee? f. Are "presenteeism is an important metric" and "employee" independent? g. Is "absenteeism is an important metric" independent of "employee"?

What is the probability that a randomly chosen employee

a. is an HR employee? b. is an HR employee or indicates that absenteeism is an important metric to consider when evaluating the effectiveness of employee engagement programs? c. does not indicate that presenteeism is an important metric to consider when evaluating the effectiveness of employee engagement programs and is a non-HR employee? d. does not indicate that presenteeism is an important metric to consider when evaluating the effectiveness of employee engagement programs or is a non-HR employee?

marketers in order to predict the future of markets, track marketing excellence, and improve the value of marketing in firms and in society. Part of the survey is devoted to the topic of marketing analytics and understanding what factors prevent companies from using more marketing analytics. The following findings are based on responses from 272 senior marketers within B2B firms and 114 senior marketers within B2C firms. Source: Data extracted from "Results by Firm & Industry Characteristics," The CMO Survey, February 2017, p. 148. bit.Iy/2qY3Qvk.

LACK OF PROCESS/TOOLS TO MEASURE SUCCESS FIRM

Yes

No

Total

B2B B2C Total

90 35 125

182 79 261

272

114 386

174

CHAPTER 4

I Basic Probability

LACK OF PEOPLE WHO CAN LINK TO PRACTICE FIRM

Yes

B2C

75 36

Total

111

B2B

No 197 78 275

Total

272 114

386

a. What is the probability that a randomly selected senior marketer indicates that lack of process/tools to measure success through analytics is a factor that prevents his/her company from using more marketing analytics? b. Given that a randomly selected senior marketer is within a B2B firm, what is the probability that the senior marketer indicates that lack of process/tools to measure success through analytics is a factor that prevents his/her company from using more marketing analytics?

c. Given that a randomly selected senior marketer is within a B2C firm, what is the probability that the senior marketer indicates that lack of process/tools to measure success through analytics is a factor that prevents his/her company from using more marketing analytics? d. What is the probability that a randomly selected senior marketer indicates that lack of people who can Link to marketing practice is a factor that prevents his/her company from using more marketing analytics? e. Given that a randomly selected senior marketer is within a B2B firm, what is the probability that the senior marketer indicates that lack of people who can link to marketing practice is a factor that prevents his/her company from using more marketing analytics? f. Given that a randomly selected senior marketer is within a B2C firm, what is the probability that the senior marketer indicates that lack of people who can link to marketing practice is a factor that prevents his/her company from using more marketing analytics? g. Comment on the results in (a) through (f).

CHAPTER

Digital Case

The Choice Is Yours Follow-Up

Apply your knowledge about contingency tables and the proper application of simple and joint probabilities in this continuing Digital Case from Chapter 3.

1. Follow up the " Using Statistics: 'The Choice Is Yours,' Revisited" on page 79 by constructing contingency tables of market cap and type, market cap and risk, market cap and rating, type and risk, type and rating, and risk and rating for the sample of 479 retirement funds stored in

Open EndRunGuide.pdf, the EndRun Financial Services "Guide to Investing," and read the information about the Guaranteed Investment Package (GIP). Read the claims and examine the supporting data. Then answer the following questions: How accurate is the claim of the probability of success for EndRun's GIP? In what ways is the claim mi sleading? How wou ld you calculate and state the probability of having an annual rate of return not less than 15%?

1. Using the table found under the "Show Me the Winning Probabilities" subhead, calculate the proper probabilities for the group of investors. What mistake was made in reporting the 7% probability claim?

2. Are there any probability calculations that would be appropriate for rating an investment service? Why or why not?

CardioGood Fitness 1. For each CardioGood Fitness treadmill product line (see CardioGood Fitness , construct two-way contingency tables of gender, education in years, relationship status, and selfrated fitness. (There will be a total of six tables for each treadmill product.)

2. For each table you construct, compute all conditional and marginal probabilities. 3. Write a report detailing your findings to be presented to the management of CardioGood Fitness.

Retirement Funds

2. For each table you construct, compute all conditional and marginal probabilities. 3. Write a report summarizing your conclusions.

Clear Mountain State Student Survey The Student News Service at Clear Mountain State University (CMSU) has decided to gather data about the undergraduate students who attend CMSU. CMSU creates and distributes a survey of 14 questions (see CMUndergradSurvey.pdf) and receives responses from 111 undergraduates (stored in Student Survey I For these data, construct contingency tables of gender and major, gender and graduate school intention, gender and employment status, gender and computer preference, class and graduate school intention, class and employment status , major and graduate school intention, major and employment status, and major and computer preference.

1. For each of these contingency tables, compute all the conditional and marginal probabilities.

2. Write a report summarizing your conclusions.

CHAPTER

T

EXCEL GUIDE

EG4.1

BASIC PROBABILITY CONCEPTS

EG4.4

BAYES' THEOREM

Simple Probability, Joint Probability, and the General Addition Rule Key Technique Use Excel arithmetic formulas. Example Compute simple and joint probabilities for purchase behavior data in Table 4. 1 on page 155.

Key Technique Use Excel arithmetic formulas.

PHStat Use Simple & Joint Probabilities.

The worksheet (shown below) already contains the probabilities for the online section example. For other problems, change those probabilities in the cell range BS:C6.

For the example, select PHStat-+ Probability & Prob. Distributions-+ Simple & Joint Probabilities. In the new template, similar to the worksheet shown below, fill in the Sample Space area with the data.

Example Apply Bayes' theorem to the television marketing

example that the Bayesian Analysis online topic discusses. Workbook Use the COMPUTE worksheet of the Bayes workbook as a template.

A

B

c

0

E

Boye•' Theorem compouttons

Workbook Use the COMPUTE worksheet of the Probabilities workbook as a template. The worksheet (shown below) already contains the Table 4.1 purchase behavior data. For other problems, change the sample space table entries in the cell ranges C3:D4 and AS:D6. As you change the event names in cells, B5, B6, CS, and C6, the column A row labels for simple and joint probabilities and the addition rule change as well. These column A labels are formulas that use the concatenation operator(&) to form row labels from the event names you enter. For example, the cell AlO formula ="P(''& BS &")" combines the two characters P( with the Yes B5 cell value and the character) to form the label P(Yes). To examine all of the COMPUTE worksheet formulas shown below, open to the COMPUTE_FORMULAS worksheet. B

A

c

Probobllitles 4 EYent Prior Conditional Joint 0.32 5 s 0.4 0.1 s· 0.6 0.3 0.18 Total:

o.s

Revis~

0.64 0.36

Joint

Revioed I

:as •cs

=DS!SOS7 I

-·a;

Yes ToQ!s No 200 .50 2.50

Yes

100 300

No Totals

650 700

750

1000

B

Simole Probabilities 9 10 P(Yes) 0.25 •B/O 11 P(No) 0.75 =E6{0 12 P(Yes)

0.30 sCT/O 0.70 =D7/0

13 PINol 14 15

Joint Probabllltlcs

16 P(Yes and Yes) 17 P(Ye,.nd No) 18 PINO andYesl 19 P(No and Nol

o.io

.CS/O 0.05 =DS{O 0.10 .C6/0 0.65 =06{0

20 21

Add1tion Rule

22 PIYes 2) =

=

P(X

+

3)

P(X

=

4)

+ ···

Because in a probability distribution all the probabilities must sum to I, the terms on the right side of the equation P(X > 2) also represent the complement of the probability that X is less than or equal to 2 [i.e., 1 - P(X :::; 2)] . Thus, P(X

> 2) = 1 -

P(X :::; 2)

=

1 - [P(X

=

0)

+

P(X

=

1)

+

P(X

=

2)]

Now, using Equation (5.8), P(X

learn MORE The Poisson Table online topic contains a table of Poisson probabilities and explains how to use the table to compute Poisson probabilities.

FIGURE 5.7 Excel worksheet results for computing Poisson probabi lities with A = 3

[0.0498

l!

2!

+ 0.1494 + 0.2240]

Thus, there is a 57.68% chance that more than two customers will arrive in the same minute. Excel can automate Poisson probability calculations, which can be tedious. Figure 5.7 contains the computed Poisson probabilities for the bank customer arrival example.

;F

c

B

A

In t he Worksheet

E

0

Probobllltles

3

The worksheet uses the POISSON.DIST function to compute the column B probabilities.

I

Dato

'

Mean/Exoeaed numbef of events of Interest: 5 I 6 Poisson Probabllltlet Table

I

31

7

x

8 9

0

0.°"98 • POISSON.OIST(AB, $E$0, FAlSEJ

1 2

0.1494 • POISSON.OIST(A9, $E$4, FAlSEJ 0. 2240 • POISSON.OIST(A10, $E$0, FAlSE)

Pf]{J

11

3

0.2240 =POISSON.OIST(All, $E$4, FAlSE)

12

•5

0. 1008 =POISSON.OIST(A13, $E$0, FAlSE)

13 14 15

16 17 18

Computing Poisson Probabilities

+ e- 3 0(3.0) 1 + e- 3 0 (3.0) 2]

1 - 0.4232 = 0.5768

10

EXAMPLE 5.5

e- 3 0(3.0)0 [ O!

> 2) = 1

6 7

8 9 10

The COMPUTE worksheet of the Excel Guide Poisson workbook is a copy of the Figure 5.7 worksheet.

0.1680 • POISSON.DIST(AU. $E$0, FAlSE) o.~

• POISSON.OIST(A14, $E$0, FAlSE)

0.0216 • POISSON.OIST(A1S, $E$0, FAlSE) 0.0081 • POISSON.OIST(A16, $E$0, FAlSE) 0.0027 • POISSON.OIST(A17, $E$0, FAlSE) 0.0008 • POISSON.OIST(A11, $E$0, FAlSE)

19

11

0.0002 • POISSON.OIST(A19, $E$0, FAlSE)

20 21

12 13

0.0001 • POISSON.OIST(A20, SE$4. FAlSE) 0.0000 • POISSON.OIST(A21. $E$0, FAlSE)

22

14

23

"

0.0000 • POISSON.OIST(A22. $E$4, FAlSE) 0.0000 • POISSON.OIST(A23, $E$0, FAlSE)

Assume that the number of new visitors to a website in one minute follows a Poisson distribution with a mean of 2.5. What is the probability that in a given minute, there are no new visitors to the website? That there is at least one new visitor to the website? SOLUTION Using Equation (5.8) on page 188 with A = 2.5 (or Excel or a Poisson table lookup), the probability that there are no new visitors to the website is P(X

.,..(continued)

=

OIA

=

2.5)

=

e-2.s(2.5)o O!

=

1 (2.71828)2-5 (1)

=

0.082 1

190

CHAPTER 5

I Discrete Probability Distributions The probability that there will be no new visitors to the website in a given minute is 0.0821 , or 8.21 %. Thus, P(X;::: 1)

1 - P(X

=

1 - 0.0821

0)

= 0.9179

The probability that there will be at least one new visitor to the website in a given minute is 0.9179, or 91.79%. Figure 5.8 shows the Example 5.5 Excel results. The answer to the questions can be found in the boldface cells in the "Poisson Probabilities Table." FIGURE 5.8 Excel worksheet results for computing the Poisson probabi lity for Example 5.5

A B c D Poisson ProbabllitlH for Website Visitors

4 Mean/Ex

E

ed number of events of Interest:

s ,___ _ _ _ ___,

PROBLEMS FOR SECTION 5.3 LEARNING THE BASICS

5.22 The quality control manager ofMarilyn's Cookies is inspect-

5.18 Assume a Poisson distribution. a. If A = 2.5, find P(X = 2). b. If A = 8.0, find P(X = 8). c. If A = 0.5, find P(X = I). d. If A = 3.7, find P(X = 0).

ing a batch of chocolate chip cookies that has just been baked. If the production process is in control, the mean number of chocolate chips per cookie is 6.0. What is the probability that in any particular cookie being inspected a. fewer than five chocolate chips will be found? b. exactly five chocolate chips will be found? c. five or more chocolate chips will be found? d. either four or five chocolate chips will be found?

5.19 Assume a Poisson distribution. a. If A = 2.0, find P(X :::: 2). b. If A = 8.0, find P(X :::: 3). c. If A = 0.5, find P(X :s 1). d. If A = 4.0, find P(X :::: 1). e. If A = 5.0, find P(X :5 3).

5.20 Assume a Poisson distribution with A = 5.0. What is the probability that a. X = I? b. x < l? c. X > I? d. x :5 l ? APPLYING THE CONCEPTS 5.21 Assume that the number of airline customer service complaints filed with the Department of Transportation's Office of Aviation Enforcement and Proceedings (OAEP) in one day is distributed as a Poisson variable. The mean number of customer service complaints in January 2019 is 3.13 per day. Source: Data extracted from U.S. Department of Transportation, Air Travel Consumer Report, January 2019.

What is the probability that in any given day a. zero customer service complaints against travel agents will be filed? b. exactly one customer service complaint against travel agents will be filed? c. two or more customer service complaints against travel agents will be fi led? d. fewer than three customer service complaints against travel agents will be filed?

5.23 Refer to Problem 5.22. How many cookies in a batch of 100 should the manager expect to discard if company policy requires that all chocolate chip cookies sold have at least four chocolate chips? 5.24 The U.S. Department of Transportation maintains statistics for mishandled bags. In January 2019, Delta mishandled 0.32 bags per day. What is the probability that in the next month, Delta will have a. no mishandled bags? b. at least one mishandled bag? c. at least two mishandled bags?

5.25 The U.S. Department of Transportation maintains statistics for mishandled bags. In January 2019, American Airlines mishandled 0 .90 bags per day. What is the probability that in the next month, American Airlines will have a. no mishandled bags? b. at least one mishandled bag? c. at least two mishandled bags?

5.26 The Consumer Financial Protection Bureau's Consumer Response team hears directly from consumers about the challenges they face in the marketplace, brings their concerns to the attention of financial institutions, and assists in addressing their complaints. An analysis of complaints registered recently indicates that the mean number of vehicle lease complaints registered by consumers is 3.5 per day. Sorce: Data extracted from bit.ly/2nGDsc7.

5 .5 Hypergeometric Distribution

191

Assume that the number of vehicle lease complaints registered by consumers is distributed as a Poisson random variable. What is the probability that in a given day a. no vehicle lease complaint will be registered by consumers? b. exactly one vehicle lease complaint will be registered by consumers? c. more than one vehicle lease complaint will be registered by consumers? d. fewer than two vehicle lease complaints will be registered by consumers?

c. two or fewer problems? d. Give an operational definition for problem. Why is the operational definition important in interpreting the initial quality score?

5.27 J.D. Power and Associates calculates and publishes various statistics concerning car quality. The dependability score measures problems experienced during the past 12 months by owners of vehicles (2019). For these models of cars, Ford had 1.48 problems per car and Toyota had 1.08 problems per car.

5.29 A toll-free phone number is available from 9 A.M. to 9 P.M. for your customers to register complaints about a product purchased from your company. Past history indicates that a mean of 0.8 calls is received per minute. a. What properties must be true about the situation described here in order to use the Poisson distribution to calculate probabilities concerning the number of phone calls received in a one-minute period?

Source: Data extracted from M. Bomey, "Lex us top brand as auto reliability hits an all-time high," USA Today, February 14 2009, p. I B-28.

Let X be equal to the number of problems with a Ford. a. What assumptions must be made in order for X to be distributed as a Poisson random variable? Are these assumptions reasonable? Making the assumptions as in (a), if you purchased a Ford in the 2019 model year, what is the probability that in the past 12 months, the car had b. zero problems?

5.28 Refer to Problem 5.27. If you purchased a Toyota in the 2019 model year, what is the probability that in the past 12 months the car had a. zero problems? b. two or fewer problems? c. Compare your answers in (a) and (b) to those for the Ford in Problem 5.27 (b) and (c).

Assuming that this situation matches the properties discussed in (a), what is the probability that during a one-minute period a. zero phone calls will be received? b. three or more phone calls will be received? c. What is the maximum number of phone calls that will be received in a one-minute period 99.99% of the time?

5.4 Covariance of a Probability Distribution and Its Application in Finance Section 5.1 defines the expected value, variance, and standard deviation for the probability distribution of a single variable. The Section 5.4 online topic discusses covariance between two variables and explores how financial analysts apply this method as a tool for modern portfolio management.

5.5 Hypergeometric Distribution The hypergeometric distribution determines the probability of x events of interest when sample data without replacement from a finite population has been collected. The Section 5.5 online topic discusses the hypergeometric distribution and illustrates its use.

I

n the Ricknel Home Centers scenario at the beginning of this chapter, you were an accountant for the Ricknel Home Centers, LLC. The company's accounting information system automatically reviews order forms from online customers for possible mistakes. Any questionable invoices are tagged and included in a daily exceptions report. Knowing that the probability that an order will be tagged is 0.10, the binomial distribution was able to be used to determine the chance of finding a certain number of tagged forms in a sample of size four. There was a 65.6% chance that none of the forms would be tagged, a 29.2% chance that one would be tagged, and a 5.2% chance that two or more would be tagged.

Other calculations determined that, on average, one would expect 0.4 form to be tagged, and the standard deviation of the number of tagged order forms would be 0.6. Because the binomial distribution can be applied for any known probability and sample size, Ricknel staffers will be able to make inferences about the online ordering process and, more importantly, evaluate any changes or proposed changes to that process.

192

CHAPTER 5

I Discrete Probability Distributions

TSUMMARY • If there is a fixed number of observations, n, each of which is classified as an event of interest or not an event of interest, use the binomial distribution. • If there is an area of opportunity, use the Poisson distribution.

This chapter discusses two important discrete probability distributions: the binomial and Poisson distributions. The chapter explains the following selection rules that govern which discrete distribution to select for a particular situation:

TREFERENCES Hogg, R. V., J. T. McKean, and A. V. Craig. Introduction to Mathematical Statistics, 7th ed. New York: Pearson Education, 2013. Levine, D. M., P. Ramsey, and R. Smidt. Applied Statistics for Engineers and Scientists Using Microsoft Excel and Minitab. Upper Saddle River, NJ: Prentice Hall, 2001.

McGinty, J. "The Science Behind Your Long Wait in Line." Wall Street Journal, October 8, 2016, p. A2.

T KEY EQUATIONS Expected Value, µ, of a Discrete Variable

Binomial Distribution

N

= E(X) =

µ,

LX;P(X

= X;)

(5.1)

P(X

= x [n, 7T) =

i= l

Variance of a Discrete Variable (]"

=

L [x; -

7"(1 - 7T)" - x

(5.5)

Mean of the Binomial Distribution

N

2

n'·

x!(n - x)!

(5.2)

E(X)]2P(X = x;)

µ,

=

E(X)

=

(5.6)

n7T

i= l

Standard Deviation of a Discrete Variable (J"

=

W

=

~i~[x; -

2

E(X)] P(X

Standard Deviation of the Binomial Distribution

= x;)

(5.3)

n!

=

W

=~=

v'n7T(l - 7T)

(5.7)

Poisson Distribution

Combinations

nCx = - - - x!(n - x)!

(J"

e- A){

(5.4)

P(X

= x [ A) = -

(5.8)

x!

TKEY TERMS area of opportunity 188 binomial distribution 181 expected value 177 mathematical model 181

Poisson distribution 188 probability distribution for a discrete variable 177 probability distribution function 181

rule of combinations 181 standard deviation of a discrete variable 178 variance of a discrete variable 178

TCHECKING YOUR UNDERSTANDING 5.30 What is the meaning of the expected value of a variable? 5.31 What are the four properties that must be present in order to use the binomial distribution?

5.32 What are the four properties that must be present in order to use the Poisson distribution?

Chapter Review Problems

193

... CHAPTER REVIEW PROBLEMS 5.33 Darwin Head, a 35-year-old sawmill worker, won $1 million and a Chevrolet Malibu Hybrid by scoring 15 goals within 24 seconds at the Vancouver Canucks National Hockey League game (B. Ziemer, "Darwin Evolves into an Instant MilLionaire," Vancouver Sun , February 28, 2008, p. 1). Head said he would use the money to pay off his mortgage and provide for his children, and he had no plans to quit his job. The contest was part of the Chevrolet Malibu Million Dollar Shootout, sponsored by General Motors Canadian Division. Did GM-Canada risk the $ 1 million? No! GM-Canada purchased event insurance from a company specializing in promotions at sporting events such as a half-court basketball shot or a hole-in-one giveaway at the local charity golf outing. The event insurance company estimates the probability of a contestant winning the contest and, for a modest charge, insures the event. The promoters pay the insurance premium but take on no added risk as the insurance company will make the large payout in the unlikely event that a contestant wins. To see how it works, suppose that the insurance company estimates that the probability a contestant would win a million-dollar shootout is 0.001 and that the insurance company charges $4,000. a. Calculate the expected value of the profit made by the insurance company. b. Many call this kind of situation a wi n- win opportunity for the insurance company and the promoter. Do you agree? Explain .

5.34 Between 1896-when the Dow Jones index was createdand 20 18, the index rose in 66.4% of the years. Based on this information, and assuming a binomial distribution, what do you think is the probability that the stock market will rise a. next year? b. the year after next? c. in four of the next five years? d. in none of the next five years? e. For this situation, what assumption of the binomial distribution might not be valid?

that this indicator is a random event with no predictive value, you would expect that the indicator would be correct 50% of the time. a. What is the probability of the Dow Jones Industrial Average increasing in 11 or more of the 14 U.S. presidential e lection years if the probability of an increase in the Dow Jones Industrial Average is 0.50? b. What is the probability that the Dow Jones Industrial Average wi ll increase in 11 or more of the 14 U.S. presidential election years if the probability of an increase in the Dow Jones Industrial Average in any year is 0.75?

5.37 Medical billing errors and fraud are on the rise. According to Medical Billing Advocates of America, three out of four times, the medical bills that they review contain errors. Source: Kelly Gooch, "Medical billing errors growing, says Medical Billing Advocates of America," Becker's Hospital Review, bit.ly/2qkA8mR.

If a sample of 10 medical bills is selected, what is the probability that a . 0 medical bills wi ll contain errors? b. exactly 5 medical bills will contain errors? c. more than 5 medical bills will contain errors? d. What are the mean and standard deviation of the probabi lity distribution?

5.38 Refer to Problem 5.37. Suppose that a quality improvement initiative has reduced the percentage of medical bills containing errors to 40%. If a sample of 10 medical bills is selected, what is the probability that a. 0 medical bills will contain errors? b. exactly 5 medical bills will contain errors? c. more than 5 medical bills contain errors? d. What are the mean and standard deviation of the probabi lity distribution? e. Compare the results of (a) through (c) to those of Problem 5.37 (a) through (c).

5.39 In a recent year, 46% of Google searches were one or two

5.35 The Internet has become a crucial marketing channel for busi-

words, while 21 % of Google searches were five or six words.

ness. According to a recent study, 9 1% of retail brands use two or more social media channels for business.

Source: Wordstream Biog, "27 Google Search Statistics You Shou ld Know in 20 19 (+ Insig hts !)," bit.ly/2GxpUlv.

Source: Data extracted from Brandwatch, "123 Amazing Social Media Statistics and Facts," bit.ly/20h2Crg.

If a sample of I 0 Google searches is selected, what is the proba-

If a sample of 10 businesses is selected, what is the probability that: a . 8 use two or more social media channels for business? b. At least 8 use two or more social media channels for business? c. At most 6 use two or more social media channels for business? d. If you selected 10 businesses in a certain geographical area and only three use two or more social media channels for business, what conclusion might you reach about businesses in this geographical area?

5.36 One theory concerning the Dow Jones Industrial Average is that it is likely to increase during U.S. presidential election years. From 1964 through 2016, the Dow Jones Industrial Average increased in 11 of the 14 U.S. presidential election years. Assuming

bility that a. more than 5 are one- or two-word searches? b. more than 5 are five- or six-word searches? c. none are one- or two-word searches? d. What assumptions did you have to make to answer (a) through (c)? 5.40 The Consumer Financial Protection Bureau 's Consumer Response Team hears directly from consumers about the challenges they face in the marketplace, brings their concerns to the attention of financial institutions, and assists in addressing their complaints. Of the consumers who registered a bank account and service complaint, 46% cited "account management," complaints related to the marketing or management of an account, as their complaint. Source: Consumer Response Annual Report, bit.ly/2x4CN5w.

194

CHAPTER 5

I Discrete Probability Distributions

Consider a sample of 20 consumers who registered bank account and service complaints. Use the binomial model to answer the following questions: a . What is the expected value, or mean, of the binomial distribution? b. What is the standard deviation of the binomial distribution? c. What is the probability that 10 of the 20 consumers cited "account management" as the type of complaint? d. What is the probability that no more than 5 of the consumers cited "account management" as the type of complaint? e. What is the probability that 5 or more of the consumers cited "account management" as the type of complaint?

5.41 Refer to Problem 5.40. In the same time period, 24% of the consumers registering a bank account and service compliant cited "deposit and withdrawal" as the type of complaint; these are issues such as transaction holds and unauthorized transactions. a. What is the expected value, or mean, of the binomial distribution? b. What is the standard deviation of the binomial distribution? c. What is the probability that none of the 20 consumers cited "deposit and withdrawal" as the type of complaint? d. What is the probability that no more than 2 of the consumers cited "deposit and withdrawal" as the type of complaint? e. What is the probability that 3 or more of the consumers cited "deposit and withdrawal" as the type of complaint? 5.42 One theory concerning the S&P 500 Index is that if it increases during the first five trading days of the year, it is likely to increase during the entire year. From 1950 through 2018, the S&P 500 Index had these early gains in 44 years (in 2011 there was virtually no change). In 36 of these 44 years, the S&P 500 Index increased for the entire year. Assuming that this indicator is a random event with no predictive value, you would expect that the indicator would be correct 50% of the time. What is the probability of the S&P 500 Index increasing in 36 or more years if the true probability of an increase in the S&P 500 Index is a. 0.50? b. 0.70? c. 0.90? d. Based on the results of (a) through (c), what do you think is the probability that the S&P 500 Index wi ll increase if there is an early gain in the first five trading days of the year? Explain.

5.43 Spurious correlation refers to the apparent relationship between variables that e ither have no true relationship or are related to other variables that have not been measured. One widely publicized stock market indicator in the United States that is an example of spurious correlation is the relationship between the winner of the National Football League Super Bowl and the performance of the Dow Jones Industrial Average in that year. The " indicator" states that when a team that existed before the National Football League merged with the American Football League wins the Super Bowl, the Dow Jones Industrial Average will increase in that year. (Of course, any correlation between these is spurious as one thing has absolute ly nothing to do with the other!) Since the first Super Bowl was held in 1967 through 2018, the indicator has been correct 38 out of 52 times. Assuming that this indicator is a random event with no predictive value, you would expect that the indicator would be correct 50% of the time. a. What is the probability that the indicator would be correct 38 or more times in 50 years? b. What does this tell you about the usefulness of this indicator?

5.44 The United Auto Courts Reports biog notes that the National Insurance Crime Bureau says that Miami-Dade, Broward, and Palm Beach counties account for a substantial number of questionable insurance claims referred to investigators. Assume that the number of questionable insurance claims referred to investigators by Miami-Dade, Broward, and Palm Beach counties is distributed as a Poisson random variable with a mean of 7 per day. a. What assumptions need to be made so that the number of questionable insurance claims referred to investigators by MiamiDade, Broward, and Pal m Beach counties is distributed as a Poisson random variable? Making the assumptions given in (a), what is the probability that b. 5 questionable insurance clai ms will be referred to investigators by Miami-Dade, Broward, and Palm Beach counties in a day? c. 10 or fewer questionab le insurance claims will be referred to investigators by Miami-Dade, Broward, and Palm Beach counties in a day? d. 11 or more questionable insurance claims will be referred to investigators by Miami-Dade, Broward, and Palm Beach counties in a day?

Cases

195

CHAPTER I

TCASES Managing Ashland MultiComm Services The Ashland MultiComm Services (AMS) marketing department wants to increase subscriptions for its 3-For-All telephone, cable, and Internet combined service. AMS marketing has been conducting an aggressive direct-marketing campaign that includes postal and electronic mailings and telephone solicitations. Feedback from these efforts indicates that including premium channels in this combined service is a very important factor for both current and prospective subscribers. After several brainstorming sessions, the marketing department has decided to add premium cable channels as a no-cost benefit of subscribing to the 3-For-All service. The research director, Mona Fields, is planning to conduct a survey among prospective customers to determine how many premium channels need to be added to the 3-For-All service in order to generate a subscription to the service. Based on past campaigns and on industry-wide data, she estimates the following: Number of Free Premium Channels

Probability of Subscriptions

0

0.02 0.04 0.06 0.07 0.08 0.085

2 3 4 5

1. If a sample of 50 prospective customers is selected and no free premium channels are included in the 3-For-All service offer, given past results, what is the probability that a. fewer than 3 customers will subscribe to the 3 -For-All service offer? b. 0 customers or 1 customer will subscribe to the 3-For-All service offer? c. more than 4 customers will subscribe to the 3-For-All service offer?

a. fewer than 3 customers will subscribe to the 3-For-All service offer? b. 0 customers or 1 customer will subscribe to the 3-For-All service offer? c. more than 4 customers will subscribe to the 3-For-All service offer? d. Compare the results of (a) through (c) to those of Problem I. e. Suppose that in the actual survey of 50 prospective customers, 6 customers subscribe to the 3-For-All service offer. What does this tell you about the previous estimate of the proportion of customers who would subscribe to the 3-For-All service offer? f. What do the results in (e) tell you about the effect of offering free premium channels on the likelihood of obtaining subscriptions to the 3-For-All service? 3. Suppose that additional surveys of 50 prospective customers were conducted in which the number of free premium channels was varied. The results were as follows: Number of Free Premium Channels

Number of Subscriptions

3 4 5

5 6 6 7

How many free premium channels should the research director recommend for inclusion in the 3-For-All service? Explain.

Digital Case Apply your knowledge about expected value in this continuing Digital Case from Chapters 3 and 4.

Open BullsAndBears.pdf, a marketing brochure from EndRun Financial Services. Read the claims and examine the supporting data. Then answer the following: 1. Are there any "catches" about the claims the brochure makes for the rate ofreturn of Happy Bull and Worried Bear funds?

d. Suppose that in the actual survey of 50 prospective customers, 4 customers subscribe to the 3-For-All service offer. What does this tell you about the previous estimate of the proportion of customers who would subscribe to the 3-For-All service offer?

2. What subjective data influence the rate-of-return analyses of these funds? Could EndRun be accused of making false and misleading statements? Why or why not?

2. Instead of offering no premium free channels as in Problem I , suppose that two free premium channels are included in the 3-For-All service offer. Given past results, what is the probability that

3. The expected-return analysis seems to show that the Worried Bear fund has a greater expected return than the Happy Bull fund. Should a rational investor never invest in the Happy Bull fund? Why or why not?

CHAPTER

TEXCEL GUIDE EG5.1

The PROBABILITY DISTRIBUTION for a DISCRETE VARIABLE

Key Technique Use SUMPRODUCT(X cell range, P(X) cell range) to compute the expected value. Use SUMPRODUCT(squared differences cell range, P(X) cell range) to compute the variance. Example Compute the expected value, variance, and standard deviation for the number of interruptions per day data of Table 5.1 on page 177.

PHStat Use Binomial.

For the example, select PHStat-+ Probability & Prob. Distributions-+ Binomial. In the procedure's dialog box (shown below): 1. Enter 4 as the Sample Size. 2. Enter 0.1 as the Prob. of an Event of Interest. 3. Enter 0 as the Outcomes From value and enter 4 as the (Outcomes) To value. 4. Enter a Title, check Histogram, and click OK.

Workbook Use the Discrete Variable workbook as a model. For the example, open to the DATA worksheet of the Discrete Variable workbook. The worksheet contains the column A and B entries needed to compute the expected value, variance, and standard deviation for the example. Unusual for a DATA worksheet in this book, column C contains formulas. These formulas use the expected value that cell B4 in the COMPUTE worksheet of the same workbook computes (first three rows shown below) and are equivalent to the fourth column calculations in Table 5.3.

2 0 3 I 4 2

0.35 =(Al· COMPUTtl$8$4)•2 0.25 =(A3 • COMPUTtl$8$4)•2 0.20 =(A4 • COMPUTE!$8$4)•2

For other problems, modify the DATA worksheet. Enter the probability distribution data into columns A and B and, if necessary, extend column C, by first selecting cell C7 and then copying that cell down as many rows as necessary. If the probabiljty distribution has fewer than six outcomes, select the rows that contain the extra, unwanted outcomes, right-click, and then click Delete in the shortcut menu. Appendix F further explains the SUMPRODUCT function that the COMPUTE worksheet uses to compute the expected value and variance.

EG5.2

BINOMIAL DISTRIBUTION

Key Technique Use the BlNOM.DlST(number of events of interest, sample size, probability of an event of interest, FALSE) function. Example Compute the binomial probabilities for n = 4 and n = 0.1 , and construct a histogram of that probability distribution, similar to Figures 5.2 and 5.3 on page 184.

196

Sinomi•I Probilbility Dmribution

x

Check Cumulative Probabilities before clicking OK in step 4 to have the procedure include columns for P(~X), P( < X) , P(> X) , and P(~X) in the binomial probabilities table. Workbook Use the Binomial workbook as a template and model. For the example, open to the COMPUTE worksheet of the Binomial workbook, shown in Figure 5.2 on page 184. The worksheet already contains the entries needed for the example. For other problems, change the sample size in cell B4 and the probability of an event of interest in cell B5. If necessary, extend the binomial probabilities table by first selecting cell rangeA18:Bl8 and then copying that cell range down as many rows as necessary. To construct a histogram of the probability distribution, use the Appendix Section B.6 instructions. For problems that require cumulative probabilities, use the CUMULATIVE worksheet in the Binomial workbook. The SHORTTAKES for Chapter 5 explains and documents this worksheet.

EG5.3

POISSON DISTRIBUTION

Key Technique Use the POlSSON.DlST(number of

events of interest, the average or expected number of events of interest, FALSE) function.

EXCEL Guide

Example Compute the Poisson probabilities for the

Figure 5.7 customer arrival problem on page 189. PHStat Use Poisson.

For the example, select PHStat-+ Probability & Prob. Distributions-+ Poisson. In this procedure's dialog box (shown below):

1. Enter 3 as the Mean/Expected No. of Events of Interest. 2. Enter a Title and click OK. Poisson Probobiity Distribution

X

=r-

l

ra..u.11ve-11es

r ...~....

~

ILi
X), and PC~X) in the Poisson probabilities table. Check Histogram to construct a histogram of the Poisson probability distribution. Workbook Use the Poisson workbook as a template. For the example, open to the COMPUTE worksheet of the Poisson workbook, shown in Figure 5.7 on page 189. The worksheet already contains the entries for the example. For other problems, change the mean or expected number of events of interest in cell E4. To construct a histogram of the probability distribution, use the Appendix Section B.6 instructions. For problems that require cumulative probabilities, use the CUMULATIVE worksheet in the Binomial workbook. The SHORTTAKES for Chapter 5 explains and documents this worksheet.

CONTENTS USING STATISTICS: Normal Load Times at MyTVLab

6.1 Continuous Probability Distributions 6.2 The Normal Distribution VISUAL EXPLORATIONS: Exploring the Normal Distribution CONSIDER THIS: What Is Normal?

6.3 Evaluating Normality 6.4 The Uniform Distribution

6.5 The Exponential Distribution (online) 6.6 The Normal Approximation to t he Binomial Distribution (online)

Normal Load Times at MyTVLab, Revisited EXCEL GUIDE

OBJEC IVES Compute probabilities from t he normal distribution • Use t he normal distribution to solve business problems Use t he normal probability plot to determine whether a set of data is approximately normally distributed Compute probabilities from t he uniform distribution.

198

Y

ou are the vice president in charge of sales and marketing for MyTVLab, a web-based business that has evolved into a full-fledged, subscription-based streaming video service. To differentiate MyTVLab from the other companies that sell similar services, you decide to create a "Why Choose Us" web page to help educate new and prospective subscribers about all that MyTVLab offers. As part of that page, you have produced a new video that samples the content MyTVLab streams as well as demonstrates the relative ease of setting up MyTVLab on many types of devices. You want this video to download with the page so that a visitor can jump to different segments immediately or view the video later, when offline. You know from research (Kishnan and Sitaraman) and past observations, Internet visitors will not tolerate waiting too long for a web page to load. One wait time measure is load time, the time in seconds that passes from first pointing a browser to a web page until the web page is fully loaded and content such as video is ready to be viewed. You have set a goal that the load time for the new sales page should rarely exceed 10 seconds (too long for visitors to wait) and, ideally, should rarely be less than 1 second (a waste of company Internet resources). To measure this time, you point a web browser at the MyTVLab corporate test center to the new sales web page and record the load time. In your first test, you record a time of 6.67 seconds. You repeat the test and record a time of 7 .52 seconds. Though consistent to your goal, you realize that two load times do not constitute strong proof of anything, especially as your assistant has performed his own test and recorded a load time of 8.83 seconds. Could you use a method based on probability theory to ensure that most load times will be within the range you seek? MyTVLab has recorded past load times of a similar page with a similar video and determined the mean load time of that page is 7 seconds, the standard deviation of those times is 2 seconds, that approximately two-thirds of the load times are between 5 and 9 seconds, and about 95% of the load times are between 3 and 11 seconds. Could you use these facts to assure yourself that the load time goal you have set for the new sales page is likely to be met?

6 .1 Continuous Probability Distributions

199

hapter 5 discusses how to use probability distributions for a discrete numerical variable. In the MyTVLab scenario, you are examining the load time, a continuous numerical variable. You are no longer considering a table of discrete (specific) values, but a continuous range of values. For example, the phrase "load times are between 5 and 9 seconds" includes any value between 5 and 9 and not just the values 5, 6, 7, 8, and 9. If you plotted the phrase on a graph, you would draw a continuous line from 5 to 9 and not just plot five discrete points. When you add information about the shape of the range of values, such as two-thirds of the load times are between 5 and 9 seconds or about 95% of the load times are between 3 and 11 seconds, you can visualize the plot of all values as an area under a curve. If that area under the curve follows the well-known pattern of certain continuous distributions, you can use the continuous probability distribution for that pattern to estimate the likelihood that a load time is within a range of values. In the MyTVLab scenario, the past load times of a similar page describes a pattern that conforms to the pattern associated with the normal distribution, the subject of Section 6.2. That would allow you, as the vice president for sales and marketing, to use the normal distribution with the statistics given to determine if your load time goal is likely to be met.

C

6.1 Continuous Probability Distributions Continuous probability distributions vary by the shape of the area under the curve. Figure 6.1 visualizes the normal, uniform, and exponential probability distributions. FIGURE 6.1 Three continuous probabi lity distributions

Section 6.4 further discusses the uniform distribution. The online Section 6.5 further discusses the exponential distribution.

Values of X

Values of X

Values of X

Normal Distribution

Uniform Distribution

Exponential Distribution

Some distributions, including the normal and uniform distributions in Figure 6.1 , show a symmetrical shape. Distributions such as the right-skewed exponential distribution do not. In symmetrical distributions the mean equals the median, whereas in a right-skewed distribution the mean is greater than the median. Each of the three distributions also has unique properties. The normal distribution is not only symmetrical, but bell-shaped, a shape that (loosely) suggests the profile of a bell. Being bell-shaped means that most values of the continuous variable will cluster around the mean. Although the values in a normal distribution can range from negative infinity to positive infinity, the shape of the normal distribution makes it very unlikely that extremely large or extremely small values will occur. The uniform distribution, also known as the rectangular distribution, contains values that are equally distributed in the range between the smallest value and the largest value. In a uniform distribution, every value is equally likely. The exponential distribution contains values from zero to positive infinity and is rightskewed, making the mean greater than the median. Its shape makes it unlikely that extremely large values will occur. Besides visualizations such as those in Figure 6.1 , a continuous probability distribution can be expressed mathematically as a probability density function . A probability density function for a specific continuous probability distribution , represented by the symbol JCX), defines the distribution of the values for a continuous variable and can be used as the basis for calculations that determine the likelihood or probability that a value will be within a certain range.

200

CHAPTER 6

I The Normal Distribution and Other Continuous Distributions

6.2 The Normal Distribution

By the rule the previous paragraph states, the probability that the load time is exactly 7, or any other specific value, is zero.

The most commonly used continuous probability distribution, the normal distribution, plays an important role in statistics and business. Because of its relationship to the Central Limit Theorem (see Section 7.2), the distribution provides the basis for classical statistical inference and can be used to approximate various discrete probability distributions. For business, many continuous variables used in decision making have distributions that closely resemble the normal distribution. The normal distribution can be used to estimate the probability that values occur within a specific range or interval. This probability corresponds to an area under a curve that the normal distribution defines. Because a single point on a curve, representing a specific value, cannot define an area, the area under any single point/specific value will be 0. Therefore, when using the normal distribution to estimate values of a continuous variable, the probability that the variable will be exactly a specified value is always zero. For the MyTVLab scenario, the load time for the new sales page would be an example of a continuous variable whose distribution approximates the normal distribution. This approximation enables one to estimate probabilities such as the probability that the load time would be between 7 and 10 seconds, the probability that the load time would be between 8 and 9 seconds, or the probability that the load time would be between 7.99 and 8.01 seconds. Exhibit 6.1 presents four important theoretical properties of the normal distribution. The distributions of many business decision-making continuous variables share the first three properties, sufficient to allow the use of the normal distribution to estimate the probability for specific ranges or intervals of values.

~=ti:ll:hl 6.1 Normal Distribution Important Theoretical Properties Symmetrical distribution. Its mean and median are equal. Bell-shaped. Values cluster around the mean. Interquartile range is roughly 1.33 standard deviations. Therefore, the middle 50% of the values are contained within an interval that is approximately two-thirds of a standard deviation below and two-thirds of a standard deviation above the mean. The distribution has an infinite range (- oo < X < approximate this range (see page 205).

oo ).

Six standard deviations

Table 6.1 presents the fill amounts, the volume of liquid placed inside a bottle, for a production run of 10,000 one-liter water bottles. Due to minor irregularities in the machinery and the water pressure, the fill amounts will vary slightly from the desired target amount, which is a bit more than l .O liters to prevent underfilling of bottles and the subsequent consumer unhappiness that such underfilling would cause. TABLE 6.1 Fill amounts for 10,000 one-liter water bottles

Fill Amount (liters) < l .OZ5 1.025 < 1.030 1.030 < 1.035 1.035 < 1.040 1.040 < 1.045 l .045 < l .050 1.050 < 1.055 1.055 < 1.060 1.060 < 1.065 1.065 < 1.070 1.070 < 1.075 1.075 < 1.080 1.080 or above Total

Relative Frequency 48 / 10,000 122/ 10,000 325 / 10,000 695 / 10,000 1, 198/ 10,000 1,664/ 10,000 1,896 I I 0,000 1,664/ 10,000 1, 198/ 10,000 695 / 10.000 325 / 10,000 122/ 10,000 48 / 10,000

= = = = = = = = = = = = =

0.0048 0.0 122 0.0325 0.0695 0. 1198 0. 1664 0. 1896 0. 1664 0. 1198 0.0695 0.0325 0.0 122 0.0048 1.0000

6.2 The Normal Distribution

student TIP Section 2.4 discusses histograms and relative frequency polygons.

FIGURE 6.2 Relative frequency histogram and polygon of the amount filled in 10,000 water bottles

201

The fill amounts for the 10,000-bottle run cluster in the interval 1.05 to 1.055 liters. The fill amounts distribute symmetrically around that grouping, forming a bell-shaped pattern, which the relative frequency polygon that has been superimposed over the Figure 6.2 histogram highlights. These properties of the fill amount permit the normal distribution to be used to estimate values. Note that the distribution of fill amounts does not have an infinite range as fill amounts can never be less than 0 or more than the entire, fixed volume of a bottle. Therefore, the normal distribution can only be an approximation of the fill amount distribution, a distribution that does not contain an infinite range, an important theoretical property of true normal distributions. .20 :>I4S =IF(Bll < 0, Ell, E12)

Reject the null hvoothesit

=IF(B17 < BS, •Reject the null hypothesis", • 0o not reject the null hypothesis")

0 E 10 Orw!·l•il Calculations 11 T.DIST.Rlv•lue 0.0019 :T.DIST.RT(A8S(813), 8U) 12 1-T.OIST.RTYalue 0.9981 :1 - U l

FIGURE 9.12 Determining the p-value for a one-tail test

00~ 0.9955

-2.8416

Example 9.5 illustrates a one-tail test in which the rejection region is in the upper tail.

EXAMPLE 9.5 A One-Tail Test for the Mean

A company that manufactures chocolate bars is particularly concerned that the mean weight of a chocolate bar is not greater than 6.03 ounces. A sample of 50 chocolate bars is selected; the sample mean is 6.034 ounces, and the sample standard deviation is 0.02 ounce. Using the a = 0.01 level of significance, is there evidence that the population mean weight of the chocolate bars is greater than 6.03 ounces? SOLUTION Using the Exhibit 9.2 critical value approach on page 282,

Step 1 First, define the null and alternative hypotheses: H o: µ, :::; 6.03

H 1: µ, > 6.03 ~ (continued)

Step 2 Collect the data from a sample of n = 50. You decide to use a = 0.01. Step 3 Because 2.4049; otherwise, do not reject H0 . Step 5 From your sample of 50 chocolate bars, you find that the sample mean weight is 6.034 ounces, and the sample standard deviation is 0.02 ounces. Using n = 50, X = 6.034, S = 0.02, and Equation (9.2) on page 288, tsTAT

xs

µ,

6.034 - 6.03 0.02

= - - - = - - - - - = 1.414 Vn

V5o

Step 6 Because tsTAT = 1.414 < 2.4049 or the p -value (from Excel) is 0.0818 > 0.01 , you do not reject the null hypothesis. There is insufficient evidence to conclude that the population mean weight is greater than 6.03 ounces.

To perform one-tail tests of hypotheses, one must properly formulate H 0 and H 1• Exhibit 9.4 summarizes the key points about the null and alternative hypotheses for one-tail tests.

l#f3=il:hl 9.4 The Null and Alternative Hypotheses in One-Tail Tests

The null hypothesis, H0 , states a status quo claim. The alternative hypothesis, H 1, states a claim that is contrary to the null hypothesis and often represents a research claim or specific inference that an analyst seeks to prove. A null and alternative pair of hypotheses are always collectively exhaustive. If one rejects the null hypothesis, one has strong statistical evidence that the alternative hypothesis is correct. If one does not reject the null hypothesis, one has not proven the null hypothesis. (Rather, one has only failed to prove the alternative hypothesis.)

The null hypothesis always refers to a population parameter such as µ, and not a sample statistic such as X. The null hypothesis always includes an equals sign when stating a claim about the population parameter, for example, H0 : µ, ::::: 208.16 grams. The alternative hypothesis never includes an equals sign when stating a claim about the population parameter, for example, H 1: µ, < 208.16 grams.

LEARNING THE BASICS

9.36 In a one-tail hypothesis test where you reject H 0 only in the upper tail, what is the p-value if ZsTAT = +2.00? 9.37 In Problem 9.36, what is your statistical decision if you test the null hypothesis at the 0.01 level of significance?

9.38 In a one-tail hypothesis test where you reject H 0 only in the lower tail, what is the p-value if ZsTAT = - 1.38?

9.39 In Problem 9.38, what is your statistical decision if you test the null hypothesis at the 0.05 level of significance?

298

CHAPTER 9

I Fundamentals of Hypothesis Testin g: One-Sam ple Tests

9.40 In a one-tail hypothesis test where you reject H0 only in the lower tail, what is the p-value if Z sTAT = + 1.38? 9.41 In Problem 9 .40, what is the statistical decision if you test the null hypothesis at the 0.05 level of significance?

9.42 In a one-tail hypothesis test where you reject Ho only in the upper tail, what is the critical value of the t-test statistic with 10 degrees of freedom at the 0.0 l level of significance?

9.43 In Problem 9 .42, what is your statistical decision if tsTAT

= + 2.79?

9.44 In a one-tail hypothesis test where you reject H0 only in the lower tail, what is the critical value of the tsTAT test statistic with 20 degrees of freedom at the 0.01 level of significance? 9.45 In Problem 9.44, what is your statistical decision if tsTAT

=

- 3.15?

APPLYING THE CONCEPTS 9.46 The Washington Metropolitan Area Transit Authority has set a bus fleet reliability goal of 8,000 bus-miles. Bus reliability is measured specifically as the number of bus-miles traveled before a mechanical breakdown that requires the bus to be removed from service or deviate from the schedule. Suppose a sample of 64 buses resulted in a sample mean of 8,210 bus-miles and a sample standard deviation of 625 bus-miles. a. Is there evidence that the population mean bus-miles is greater than 8,000 bus-miles? (Use a 0.05 level of significance.) b. Determine the p-value and interpret its meaning .

Source: Data extracted from BA Ahmad, K. Khairatul, and A. Farnazza, "An assessment of patient waiting and consultatio n time in a primary healthcare clinic," Malaysian Family Practice, 20 17, 12(1 ), pp. 14-2 1.

a. If you test the null hypothesis at the 0.0 l level of significance, is there evidence that the population mean wait time is less than 30 minutes? b. Interpret the meaning of the p-value in this problem.

9.49 You are the manager of a restaurant that de livers pizza to coll ege dormitory rooms. You have just changed your delivery process in an effort to reduce the mean time between the order and completion of delivery from the current 25 minutes. A sample of 36 orders using the new delivery process yields a sample mean of 22.4 minutes and a sample standard deviation of 6 minutes. a. Using the six-step critical value approach, at the 0.05 level of significance, is there evidence that the population mean delivery time has been reduced below the previous population mean value of 25 minutes? b. At the 0.05 level of significance, use the six-step p-value approach. c. Interpret the meaning of the p -value in (b). d. Compare your conclusions in (a) and (b).

9 .50 A survey of nonprofit organizations showed that online fund-

9.47 CarMD reports that the mean repair cost of a Toyota in 201 8 for a "check engine" light was $482.

raising has increased in the past year. Based on a random sample of 133 nonprofits, the mean one-time gift donation resulting from email outreach in the past year was $87. Assume that the sample standard deviation is $9. a. If you test the null hypothesis at the 0.01 leve l of significance, is there evidence that the mean one-time gift donation is greater than $85.50? b. Interpret the meaning of the p-value in this problem.

Source: www.carmd.com/wp/vehicle-health-index-introduction/2018-carmdmanufacturer-vehicle-rankings

9.51 The population mean waiting time to check out of a super-

Suppose a sample of 100 "check engine" light repairs completed in the last month was selected. The sample mean repair cost was $27 1 with the sample standard deviation of $ 100. a. Is there evidence that the population mean repair cost is less than $482? (Use a 0.05 level of significance.) b. Determine the p-value and interpret its meaning .

nm

9.48 Patient waiting is a common phenomenon in the . . . . doctor's waiting room. One acceptable standard of practice states that waiting time for patients to be seen by the first provider in hospital outpatient and public health clinics should be less than 30 minutes. A study was conducted to assess patient waiting at a primary healthcare clinic. Data were collected on a sample of 860 patients. In this sample, the mean wait time was 24.05 minutes, with a standard deviation of 16.5 minutes.

market has been 4 minutes. Recently, in an effort to reduce the waiting time, the supermarket has experimented with a system in w hich infrared cameras use body heat and in- store software to determine how many lanes should be opened. A sample of I 00 c usto mers was selected , and their mean waiting time to check out was 3.10 minutes, with a sample standard deviation of 2 .5 minutes . a. At the 0.05 level of significance, using the critical value approach to hypothesis testing, is there evidence that the population mean waiting time to check out is less than 4 minutes? b. At the 0.05 level of significance, using the p-value approach to hypothesis testing, is there evidence that the population mean waiting time to check out is less than 4 minutes? c. Interpret the meaning of the p -value in this problem. d. Compare your conclusions in (a) and (b).

9.4 Z Test of Hypothesis for the Proportion In some situations, one seeks to test a hypothesis about the proportion of events of interest in the population, 7T , rather than test the population mean. To begin, one selects a random sample and calculates the sample proportion, p = X / n. Then, one compares the value of this statistic to the hypothesized value of the parameter, 7T , in order to decide whether to reject the null hypothesis. If the number of events of interest (X) and the number of events that are not of interest (n - X) are each at least five, the sampling distribution of a proportion approximately follows

9.4 Z Test of Hypothesis for the Proportion

student TIP Do not confuse this use of the Greek letter pi, 'TT, to represent the population proportion with the constant that is the ratio of the circumference to the diameter of a circleapproximately 3.14159.

299

a normal distribution, and one can use the Z test for the proportion. Equation (9.3) defines this hypothesis test for the difference between the sample proportion, p, and the hypothesized population proportion, 'TT. Equation (9.4) provides an alternate definition that uses the number of events of interest, X, instead of the sample proportion, p. Z TEST FOR THE PROPORTION (9.3)

where .

X

number of events of interest in the sample

n

sample size

p = sample proport10n = - = 1T

= hypothesized proportion of events of interest in the population

Z TEST FOR THE PROPORTION IN TERMS OF THE NUMBER OF EVENTS OF INTEREST X - n7T ZsTAT = ---;==== n'TT( l - 'TT)

Y

(9.4)

The ZsTAT test statistic approximately follows a standardized normal distribution when X and (n - X) are each at least 5 for this test.

1 as reported by L. Petrecca in "Always 'on' : How you can disconnect from work," USA Today, January 16, 2017.

Either the critical value or p-value approach can be used to evaluate the Z test for the proportion. To illustrate both approaches, consider a 2016 survey conducted by CareerBuilder in which 45% of American workers reported that they work during nonbusiness hours. 1 Suppose a research firm decides to take a new survey to determine whether the proportion has changed since 2016. The firm conducts this new survey and finds that 208 of 400 American workers reported that they work during nonbusiness hours. To determine whether the proportion has changed, the firm defines the null and alternative hypotheses: H0 : 1T

= 0.45 (the proportion of American workers who reported that they work during

H 1: 1T

nonbusiness hours has not changed since 2016) 0.45 (the proportion of American workers who reported that they work during nonbusiness hours has changed since 2016)

~

Because the firm seeks to determine whether the population proportion of American workers who reported that they work during nonbusiness hours has changed from 0.45 since 2016, the firm uses a two-tail test with either the critical value approach or the p-value approach.

Using the Critical Value Approach The firm selects an a = 0.05 level of significance that defines the rejection and nonrejection regions that Figure 9.13 visualizes. The decision rule is

< -1.96 or if ZsTAT > otherwise, do not reject H0 .

Reject H0 if Z sTAT

+ 1.96;

FIGURE 9.13 Two-tail test of hypothesis for the proportion at t he 0.05 level of sig nificance

t

-1.96 Region of Rejection

"' 0/

Region of Nonrejection

t

+1.96 z Region of Reject ion

300

CHAPTER 9

I Fundamentals of

Hypothesis Testing: One-Sam ple Tests

Having determined the null and alternate hypotheses and the Z test statistic, the research firm first calculates the sample proportion, p. In the new survey, 208 of 400 American workers reported that they work during nonbusiness hours, which makes p = 0.52 ( 208 -o- 400) . Because X = 208 and n - X = 192, and each value is greater than 5, either Equation (9.3) or (9.4) can be used: p -

0.52 - 0.45 ----;:======== /0.45(1 - 0.45) )n(l: n) 'TT

ZsrAT = -;:::=====

\j

ZsrAT

X - nn

0.0700 0.0249

= - - = 2.8141

400

208 - (400)(0.45) = ~ = 2.8141 vc4oo)co.45)C0.55) 9.9499

= -;===== Vnn(l - n)

The Z test statistic can also be determined by software results, as the Figure 9.14 Excel results for this test illustrate. Because ZsrAT = 2.8141 > 1.96, the firm rejects H0. There is evidence that the population proportion of American workers who reported that they work during nonbusiness hours has changed from 0.46 since 2016. FIGURE 9.14 Excel Z test results for whether the proportion of American workers who reported that they work during nonbusiness hours has changed from 0.45 since 2016

A

1F

B

of Hypothesis for !tie ~on 0.ta

Null Hvoothesi5

,,..

0.4S

LeVel of Significance

o.os

Number of ltem5 of klternt

20I

SamDle:Sil.e

400

lnt~ed lat e Cllc:ulattons

0 Somple Prooortlon 1 Standard Error 2 ZTest Sliitirtk

o.~200

I I

=86/87

0.02A9 =SQRTl84. (1 - 84)/87) U141 =(810 - 84)/81.l

31 4

TWo-Tail Test

5 Lower Critical Value

7 P ·Value

8

I I I

L9600 :NORM.SJ NV(8S/2) 1.9600 =NORM.S.INV(1 - 85/2) 0.0049 =2 • (1 - NORM.s.DIST(A85(812), TRUE)) Rejed the null hypothesis =IF(817 < 85,"RejeCopy21$A:$A) 246.4 =AVEIAGE(DlltaCopy2!$A:$A) 5.1228 or Fa;2; otherwise, do not reject H0 . To illustrate how to use the F test to determine whether the two variances are equal, recall the Using Statistics scenario on page 311 that concerns the sales of VLABGo players in two different sales locations. To determine whether to use the pooled-variance t test or the

10.4 F Test for the Ratio of Two Variances

337

separate-variance t test, one first tests the equality of the two population variances. The null and alternative hypotheses are

of = ifz H,: of ¥c ifz Ho:

Because sample 1 is defined as having the larger sample variance, the rejection region in the upper tail of the F distribution contains a / 2. Using the level of significance a = 0.05, the rejection region in the upper tail contains 0.025 of the distribution. Because there are samples of 10 stores for each of the two sales locations, there are 10 - 1 = 9 degrees of freedom for both the numerator (the sample with the larger variance) and the denominator (the sample with the smaller variance). Use Table E.5 to determine Fa12, the upper-tail critical value of the F distribution. From Table E.5, a portion of which Table 10.12 shows, the upper-tail critical value, Fa;2 , is 4.03. Therefore, the decision rule is Reject H 0 if F s TAT > F o.025 = 4.03; otherwise, do not reject H0 . TABLE 10.12 Finding t he upper-tail critical value of F with 9 denominator and numerator degrees of freedom for an upper-tail area of 0.025

Cumulative Probabilities = 0.975 Upper-Tail Area = 0.025 Numerator dfi Denominator df2 1

2

3 7 8

~

1

2

3

7

8

647.80 38.51 17.44

799.50 39.00 16.04

864.20 39.17 15.44

948.20 39.36 14.62

956.70 39.37 14.54

8.07 7.57 7.21

6.54 6.06 5.71

5.89 5.42 5.08

4.99 4.53 4.20

4.90 4.43 4.10

.82 36

•@m

Source: Extracted from Table E.S.

Using the Table 10.1 VLABGo sales data on page 313 and Equation (10.7) on page 336

sf=

(42.5420)2

=

the result is F STAT

1,809.8222

s~

= (32.5271)2 =

1,809,8222 = -Sf = --s~

1,058.0111

1,058.0111

1.7106

Because FsTAT = 1.7106 < 4.03, you do not reject H0 . Figure 10. 13 on page 338 shows the results for this test, including the p-value, 0.4361. Because 0.4361 > 0.05, one concludes that there is no evidence of a significant difference in the variability of the sales of the VLABGo players for the two sales locations. In testing for a difference between two variances using the F test, one assumes that each of the two populations is normally distributed. The F test is very sensitive to the normality assumption. If boxplots or normal probability plots suggest even a mild departure from normality for either of the two populations, one should not use the F test. Instead, use the Levene test (see Section 11.1) or a nonparametric approach (Corder and Daniel). In testing for the equality of variances as part of assessing the appropriatenes s of the pooled-variance t test procedure, the F test is a two-tail test with a / 2 in the upper tail. However, when one examines the variability in situations other than the pooled-variance t test, the F test is often a one-tail test. Example 10.4 illustrates a one-tail test.

338

CHAPTER 10

I Two-Sample Tests

FIGURE 10.13 Excel F test resu lts for t he two d ifferent sales locations data.

A F Test for Oitfuences in Two Variance1

B

In the Worksheet

J

The worksheet uses the F.INV.RT function to compute the upper critical value in cell B 18 and the F.DIST.RT funciton to help compute the p-value.

Oota

4 Level o f Sl nlflcangt: IAl: All Pq>Jotion 2~ c.1 Rongo: ia1:a11 PqlUotion

P" Fnt-.,bo"1rangesa>n""'label 1- Two·TllTest

r

IJppi< Cd RM!go:

1.ll!All

Po!Uobon 2 Slll'C>i< Cd RM!go:

lelj!U

g :J

r;; Fnt czla" bolh , . _ " " ' - A

B 1 t·Test; Two-sample Assuming Eqwil Variances

-1·~(" 1-·TliTtst

c

(.' Two·T.. Test

2 1--~~~~~~~~~~~ Front

4 Mean

5 Variance 6 ObserV1tlons

ln·Ais~

1 ==t~

143].gll57 7 Pooled Variance 0 8 Hypothtilzed Mein Difference 9 di 18 10 1 Slat 11 P(T Fa

where Fa is the upper-tail critical value from an F distribution with c - 1 and rc(n' - 1) degrees of freedom. To test the hypothesis of no interaction of factors A and B, one defines the null and alternative hypotheses H0 : The interaction of A and B is equal to zero H 1: The interaction of A and B is not equal to zero and uses the FsTAT test statistic in Equation (11.15).

F TEST FOR INTERACTION EFFECT

MSAB F - -STAT - MSE

(11.15)

One rejects the null hypothesis at the a level of significance if

student TIP In each of these F tests, the denominator of the FsTAT statistic is MSE.

FsTAT

MSAB MSE

= -- >

Fa

where Fa is the upper-tail critical value from an F distribution with (r - l )(c - 1) and rc(n' - 1) degrees of freedom.

11.2 Two-Way ANOVA

371

Table 11.7 presents the entire two-way ANOVA table.

TABLE 11.7

Analysis of variance table for the two-factor factorial design

Source A B AB

Degrees of Freedom rc(r - l)(c - 1)

Sum of Squares SSA SSB SSAB

Error

rc(n' - 1)

SSE

Total

n - 1

SST

Mean Square (Variance)

F

SSA MSA = - r - 1

F

MSA - -MSE

SSB MSB= - c - 1

F

MSB - -MSE

F

MSAB - -MSE

MSAB = MSE=

SSAB (r - l)(c - 1)

STAT -

STAT -

STAT -

SSE rc(n' - 1)

To illustrate two-way ANOVA, return to the Arlingtons scenario on page 352. As a member of the sales team, you first explored how different in-store locations might affect the sales of mobile electronics items using one-way ANOVA. Now, to explore the effects of permitting mobile payment methods to buy mobile electronics items, you design an experiment that examines this second (B) factor as it studies the effects of in-store location (factor A) using two-way ANOVA. Two-way ANOVA will allow you to determine if there is a significant difference in mobile electronics sales among the four in-store locations and whether permitting mobile payment methods makes a difference. To test the effects of the two factors, you conduct a 60-day experiment at 40 same-sized stores that have similar storewide net sales. You randomly assign ten stores to use the current in-aisle location, ten stores to use the special front location, ten stores to use the kiosk location, and ten stores to use the expert counter. In five stores in each of the four groups, you permit mobile payment methods (for the other five in each group, mobile payment methods are not permitted). At the end of the experiment, you organize the mobile electronics sales data by group and store the data in Mobile Electronics 2 . Table 11.8 presents the data of the experiment.

TABLE 11.8

Mobile electronics sales ($000) at four in-store locations with mobile payments permitted and not permitted

IN-STORE LOCATION MOBILE PAYMENTS

In-Aisle

Front

Kiosk

Expert

No No No

30.06 29.96 30.19

32.22 31.47 32.13

30.78 30.91 30.79

30.33 30.29 30.25

No No Yes Yes Yes Yes

29.96 27.74

31.86 32.29

30.95 31.13

30.25 30.55

30.66 29.99 30.73 30.72

32.81 32.65 32.81 32.42

31.34 31.80 32.00 31.07

31.03 31.77 30.97 31.43

Yes

30.73

33.12

31.69

30.72

372

CHAPTER 11

I Analysi s of Variance Figure 11.10 presents the results for this example. In the Excel worksheet, the A, B, and Error sources of variation in Table 11.7 on page 371 are labeled Sample, Columns, and Within, respectively.

FIGURE 11.10 Excel two-way ANOVA results for the in-store location sales and mobile payment experiment

A B c 1 iANovA: TWo-Factor With Repllatlon

3 SUMMARY

In-aisle

0

G

Kioslc

Front

Expert

s 1SL67 30.334 0.01S7

20 61A.ll 30.70SS 1.0730

1S7.9 31.58 O.Ul7

lSS.92 31.184 0.1122

20 630.A6 31.5230 0.7773

10 323.78 32.378 0.2'30

10 312.A6 31.246 0.1946

10 307.59 30.759 0.2. .2

28 Within

SS 6.6131 28.227• 0.1339 6.79'3

32

MS 6.6131 9.4091 0.0446 0.2123

29 30 Toal

41.1311

39

Count

6 Sum 7 Average 8 Variance 10

IA7.91 29.582 1.0692

1S9.97 31.994 O.U2A

150.56 30.912 0.0203

1S2.83 30.566 0.100

163.81 32.762 0.0656

10 300.7' 30.074 0.7906

Yts

s

11 COunt

12 Sum 13 Average

1• jV•rtance 15 16

Total

17 Count 18 sum 19 Averag4!

20 21 22 23 2• 25 26 27

vanance

ANOVA

sourc.e o vor1ot1on Sample Columns Interaction

Table E.5 does not provide the upper-tail critical values from the F distribution with 32 degrees of freedom in the denominator. When the desired degrees of freedom are not provided in the table, use the p-value computed by Excel.

FIGURE 11.11 Regions of rejection and nonrejection at the 0.05 level of significance, with 3 and 32 degrees of freedom

P-tJOlue 0.0000 0.0000 0.2103 O.lllS F

31.A760 44.3154

IUW/ ~sJg_nl(!conce I

!lJ

2

Toal

No

s

F crll

4.14'1 2.9011 2.9011

o.osj

To interpret the results, one starts by testing whether there is an interaction effect between factor A (mobile payments) and factor B (in-store locations). If the interaction effect is significant, further analysis will focus on this interaction. If the interaction effect is not significant, you can focus on the main effects-the potential effect of permitting mobile payment (factor A) and the potential differences in in-store locations (factor B). Using the 0.05 level of significance to determine whether there is evidence of an interaction effect, one rejects the null hypothesis of no interaction between mobile payments and in-store locations if the computed F sTAT statistic is greater than 2.9011, the upper-tail critical value from the F distribution, with 3 and 32 degrees of freedom (see Figures 11.10 and 11.11).2

~

0

t

t

2.9011 F Region of Rejection Critical Value

Region of Non rejection

1

Because F s rAT = 0.2103 < 2.9011 or the p-value = 0.8885 > 0.05, one does not reject H0 . One concludes that there is insufficient evidence of an interaction effect between the two factors , mobile payment and in-store location. Finding insufficient evidence of an interaction effect between factors enables one to continue and focus on the main effects. Using the 0.05 level of significance and testing whether there is an effect due to mobile payment options (yes or no) (factor A), one rejects the null hypothesis if the computed FsTAT test statistic is greater than 4.1491, the upper-tail critical value from the F distribution with 1 and 32 degrees of freedom (see Figures 11.10 and 11.12). Because FsTAT = 31.4760 > 4.1491 or the p-value = 0.0000 < 0.05, one rejects H0 . One concludes that there is evidence of a difference in the mean sales when mobile payment methods are permitted as compared to when they

11.2 Two-Way ANOVA

373

are not. Because the mean sales when mobile payment methods are permitted is 31.523 and is 30.7055 when they are not, one can conclude that permitting mobile payment methods has led to an increase in mean sales. FIGURE 11.12 Reg ions of rejection and nonrejection at the 0.05 level of significance, with 1 and 32 degrees of freedom

~ t

0

4.1491

t

F

Region of /Region of Nonrejection Rejection Critical Value

Using the 0.05 level of significance and testing for a difference among the in-store locations (factor B), one rejects the null hypothesis of no difference if the computed FsTAT test statistic is greater than 2.9011 , the upper-tail critical value from the F distribution with 3 degrees of freedom in the numerator and 32 degrees of freedom in the denominator (see Figures 11.10 and 11.11). Because FsTAT = 44.3154 > 2.9011 or the p-value = 0.0000 < 0.05, one rejects H0 and concludes that there is evidence of a difference in the mean sales among the four in-store locations.

Multiple Comparisons: The Tukey Procedure If one or both of the factor effects are significant and there is no significant interaction effect, one

can determine the particular levels that are significantly different by using the Tukey multiple comparisons procedure for two-way ANOVA when there are more than two levels of a factor (Kutner and Montgomery). Equation (11.16) gives the critical range for factor A.

CRITICAL RANGE FOR FACTOR A

..

!!-SE

(11.16) en where Qa is the upper-tail critical value from a Studentized range distribution having rand rc(n' - 1) degrees of freedom. (Values for the Studentized range distribution are found in Table E.7.)

Cntical range = Qa

-,

Equation (11.17) gives the critical range for factor B.

CRITICAL RANGE FOR FACTOR B

Critical range = Qa

n

(11.17)

where Qa is the upper-tail critical value from a Studentized range distribution having c and rc(n' - 1) degrees of freedom. (Values for the Studentized range distribution are found in Table E.7.)

To use the Tukey procedure, return to the Table 11.8 mobile electronics sales data on page 37 1. In the ANOVA summary table in Figure 11.10 on page 372, the interaction effect is not significant. Because there are only two categories for mobile payment (yes and no), there are no multiple comparisons to be constructed. Using a = 0.05, there is evidence of a significant difference among the four in-store locations that comprise factor B. Thus, one

37 4

CHAPTER 11

I Analysis of Variance can use the Tukey multiple comparisons procedure to determine which of the four in-store locations differ. Because there are four in-store locations, there are 4(4 - 1)/2 = 6 pairwise comparisons. Using the calculations presented in Figure 11.10, the absolute mean differences are as follows: L IX1. - X.2. I

= 2.124 = 1.172 I30.074 - 30.759 I = 0.685 I30.074 - 32.378 I

2. IX1. - X 3.I 3. IX1. - x4.I

130.074 - 31.246 1

4. lxz. - X 3 I

I32.378 - 31.246 I

1.132

s.

1x.2. - x.4. 1

I32.378 - 30.759 I

1.619

6. 1x.3. - x.4. 1

131.246 - 30.759 1

= 0.487

To determine the critical range, refer to Figure 11.10 to find MSE = 0.2123, r = 2, c = 4, and n' = 5. From Table E.7 [for a = 0.05, c = 4, and rc(n' - 1) = 32], Q"" the upper-tail critical value of the Studentized range distribution with 4 and 32 degrees of freedom is approximately 3.84. Using Equation (11.17), Critical range =

3.84~ =

0.5595

Because five of the six comparisons are greater than the critical range of 0.5595, one concludes that the population mean sales is different for the in-store locations except for the kiosk and expert locations. The front location is estimated to have higher mean sales than the other three in-store locations. The kiosk location is estimated to have higher mean sales than the in-aisle location. The expert location is estimated to have higher mean sales than the in-aisle location. Note that by using a = 0.05, one can make all six comparisons with an overall error rate of only 5%. Consistent with the results of the one-factor experiment, one has additional evidence that selling mobile electronics items at the front location will increase sales the most. In addition, one also has evidence that enabling mobile payment will also lead to increased sales.

Visualizing Interaction Effects: The Cell Means Plot One can gain a better understanding of the interaction effect by plotting the cell means, the means of all possible factor-level combinations. Figure 11.13 presents a cell means plot that uses the cell means for the mobile payments permitted/in-store location combinations shown in Figure 11.10 on page 372. From the plot of the mean sales for each combination of mobile payments permitted and in-store location, observe that the two lines (representing the two levels of mobile payments, yes and no) are roughly parallel. This indicates that the difference between the mean sales for stores that permit mobile payment methods and those that do not is virtually the same for the four in-store locations. In other words, there is no interaction between these two factors, as was indicated by the F test. FIGURE 11.13 Excel c ell means plot for mobile electronics sales based on mobile payments permitted and in-store location

Cell Means Plot for In-Store Location and Mobile Payments 35.0 30.0 25.0 _.No Mobite Payments

120.0

~ Mobile

Payments

_; 15.0 ~ 10 .0 5.0

0.0 - t - - - - - . . . - - - - - - - . - - - - - - . - - - - - - . lrwiisle Front Kiosk bpert

Interpreting Interaction Effects How does one interpret an interaction? When there is an interaction, some levels of factor A respond better with certain levels of factor B. For example, with respect to mobile electronics sales, suppose that some in-store locations were better when mobile payment methods were

11.2 Two-Way ANOVA

375

permitted and other in-store locations were better when mobile payment methods were not permitted. If this were true, the lines of Figure 11.13 would not be nearly as parallel, and the interaction effect might be statistically significant. In such a situation, the difference between whether mobile payment methods were permitted is no longer the same for all in-store locations. Such an outcome would also complicate the interpretation of the main effects because differences in one factor (whether mobile payment methods were permitted) would not be consistent across the other factor (the in-store locations). Example 11.2 illustrates a situation with a significant interaction effect.

EXAMPLE 11.2 Interpreting Significant Interaction Effects

A nationwide company specializing in preparing students for college and graduate school entrance exams, such as the SAT, ACT, GRE, and LSAT, had the business objective of improving its ACT preparatory course. Two factors of interest to the company are the length of the course (a condensed 10-day period or a regular 30-day period) and the type of course (traditional classroom or online distance learning). The company collected data by randomly assigning 10 clients to each of the four cells that represent a combination of length of the course and type of course. Table 11.9 presents the results, which are stored in m:J. What are the effects of the type of course and the length of the course on ACT scores?

TABLE 11.9 ACT scores for different types and lengths of courses

LENGTH OF COURSE

Condensed

TYPE OF COURSE

Regular

Traditional Traditional

26 27

18 24

34 24

28 21

Traditional Traditional Traditional Online Online Online Online Online

25 21 21 27 29

19 20

35 31

23 29

18 21 32 20 28 29

28 24 16 22 20 23

26 21 19 19 24 25

30 24 30

SOLUTION The Figure 11.14 cell means plot shows a strong interaction between the type of

course and the length of the course. The nonparallel lines indicate that the effect of condensing the course depends on whether the course is taught in the traditional classroom or by online distance learning. The online mean score is higher when the course is condensed to a 10-day period, whereas the traditional mean score is higher when the course takes place over the regular 30-day period. FIGURE 11.14 Excel cell means plot of mean ACT scores

Cell Means Plot for ACT Scores 30

25 .......Traditional ~Online

Condensed

..,.(continued)

R•aular Coune l e ngth

376

CHAPTER 11

I Analysi s of Variance To verify the visual analysis provided by interpreting the cell means plot, one tests whether there is a statistically significant interaction between factor A (length of course) and factor B (type of course). Using a 0.05 level of significance, one rejects the null hypothesis because F s TAT = 24.2569 > 4.1132 or the p-value equals 0.0000 < 0.05 (see Figure 11 .15 shown below). Thus, the hypothesis test confirms the interaction evident in the cell means plot.

,.

FIGURE 11 .15

B c D 1 IANovA: TWo-FKtor With Repllutlon 2 condensed Regular Total 3 SUMMARY

Excel two-way ANOVA results for the ACT scores

~

tfadjtional

5 Count 6 sum 7 Averige

10

10 27'1 27.,

21'1

2U

20 498

2U

1L2lll 20.9889 24.n63

8 Variance

10

Ottl/M

20 483 27 21.3 24.15 16.2222 8.0111 20.Dm

11 Count 12 sum 13 Average 1.t vanance 15 16 17 Count 1e sum

10

10

270

213

Total 20 20 48, 4'2 24.45 24.6 19.83,5 25.2000

19 Average 20 Variance

21 22 23 ANDVA 24 SOurtt o Vorlorlon 25 Sample 26 COiumns 27 lnterat'tion

28 Within 29 30 Tot.I 31 j

SS

d

F

MS

5.6250 0.2250 342.2250 507.9000

1 5.6250 1 0.2250 1 342.2250 36 14.1083

855.,750

3'

Fcril

P-llOIW

0.3987

o.5318

0.0~

0.'°°2 0.0000

24.~

l. 0.05. Given that the interaction is significant, one can reanalyze the data with the two factors collapsed into four groups of a single factor rather than a two-way ANOVA with two levels of each of the two factors. One reorganizes the data as follows: Group 1 is traditional condensed, Group 2 is traditional regular, Group 3 is online condensed, and Group 4 is online regular. Figure 11.16 shows the results for these data, stored in l•WMffeil. FIGURE 11.16 Excel one-way ANOVA and Tukey-Kramer results for the ACT scores A

B

C

D

G

; 1one-way ANDVA (ANOVA: Single Factor)

I

A

3 SUMMARY Grau

12

o Variallon

13 Between Groups 1' Within GnxJps 15 16 Tot.I 17

....(continued)

c

D

E

Sample Sample

Count

GnxJp l 6 GnxJp2 7 GnxJp 3 B GnxJp4

:~LDVA

B

10 10 10 10

Sum A 21' 279 270 213

~ Voriarte.~

Group

1: GnxJp 1

2U 11.2111 27., 20.~ 27 16.2222 21.3 8.0111

2.GnxJp2 l : GnxJpl • : GnxJp4

.U

d

MS

3 116.0250 36 l•.1083

F

P-volw

8.223'

F crlt

0.0003 2.8663

l Level of siRnlficance 2 Numenotor d.f.

3

DenominatOf" d.f.

• MSW S 1Q Statistic

855,,750

3' ~o J

n

nu

O.Q5

Mean 21., 27.,

27 21.3

other om

0

348.0750 507.9000

G

F

H

I

Tukey Kramer M ultlple eompartsons

Size

10 10 10 10

C~rhon

'

~suits

Group 1 to Group 2

6

Group 1 to Group 3 Group l to Group 4

5.1 0.6

Ll.178 4.5373 Me~ms are not different

Group 2 to Group 3 Group 2 to Group 4

°"

1.1878 4.5373 Means are different

Group 3 to Group 4

0.05 4 36 14.10833 3.82

Absolute Std. ErrOJ Critiall Oiffnenu of DiffH~ ~e

6.6 5.7

1.1878 4.5373 Means are different

Ll.878 4..5373 Means are dlffttent L1878 4.5373 Means are not different 1.1878 4.S373 Means are d1ffe 2.8663 or p-value = 0.0003 < 0.05, there is evidence of a significant difference in the four groups (traditional condensed, traditional regular, online condensed, and online regular). Using the Tukey-Kramer multiple comparisons procedure, traditional condensed is different from traditional regular and from online condensed. Traditional regular is also different from online regular, and online condensed is also different from online regular. Therefore, whether condensing a course is a good idea depends on whether the course is offered in a traditional classroom or as an online distance learning course. To ensure the highest mean ACT scores, the company should use the traditional approach for courses that are given over a 30-day period but use the online approach for courses that are condensed into a 10-day period.

PROBLEMS FOR SECTION 11.2 LEARNING THE BASICS 11.15 Consider a two-factor factorial design with three levels for factor A, three levels for factor B, and four replicates in each of the nine cells. a. How many degrees of freedom are there in determining the factor A variatio n and the factor B variation? b. How many degrees of freedom are there in determining the interaction variation? c. How many degrees of freedom are there in determining the random variatio n? d. How many degrees of freedom are there in determining the total variation? 11.16 A ssume that you are working w ith the results from Problem 11.15, and SSA = 120, SSB = 110, SSE = 270, and SST = 540. a. What is SSAB? b. What are MSA and MSB? c. What is MSAB? d. What is MSE? 11 .17 A ssume that you are working w ith the results from Problems 11.15 and 11.16. a. What is the value of the FsTAT test statistic for the interaction effect? b. What is the value of the FsTAT test statistic for the factor A effect? c. What is the value of the FsTAT test statistic for the factor B effect? d. Form the ANOVA summary table and fill in all values in the body of the table. 11.18 Given the results from Problems 11.15 through 11.17, a. at the 0.05 level of sig nificance, is there an effect due to factor A? b. at the 0.05 level of sig nificance, is there an effect due to factor B? c. at the 0.05 level of significance, is there an interaction effect?

11.19 Given a two-way ANOVA with two levels for factor A, five levels for factor B, and four replicates in each of the 10 cells, with SSA = 18, SSB = 64, SSE= 60, and SST = 150, a. form the ANOVA summary table and fill in all values in the body of the table. b. at the 0.05 level of significance, is there an effect due to factor A? c. at the 0.05 level of significance, is there an effect due to factor B? d . at the 0.05 level of sig nificance, is there an interaction effect? 11.20 Given a two-factor factorial experiment and the ANOVA summary table that follows, fill in all the missing results:

Source A B

AB Error

Total

Degrees of Freedom

Sum of Mean Square Squares (Variance)

F

r - 1 =2 SSA=? MSA = 80 FsrAT = ? c - l =? SSB = 220 MSB =? FsrAT = 11.0 (r - l )(c - 1) = 8 SSAB =? MSAB = 10 FsrAr = ? MSE =? rc(n' - I) = 30 SSE=? n - I = ?

SST =?

11.21 Given the results from Problem 11.20, a. at the 0.05 level of significance, is there an effect due to factor A? b. at the 0.05 level of significance, is there an effect due to factor B? c. at the 0.05 level of sig nificance, is there an interaction effect?

APPLYING THE CONCEPTS 11.22 An experiment was conducted to study the extrusion process of biodegradable packaging foam. Two of the factors considered for their effect on the unit density (mg/ ml) were the die temperature (145 °C vs. 155 °C) and the die diameter (3 mm vs. 4 mm). The results are stored in Packaging Foam 1 Source: Data extracted from W. Y. Koh, K. M. Eskridge, and M. A. Hanna, "Supersaturated Split-Plot Designs," Journal of Quality Technology, 45, January 20 13, pp. 61- 72.

378

CHAPTER 11

I Analysis of Variance

At the 0.05 level of significance, a. is there an interaction between die temperature and die diameter? b. is there an effect due to die temperature? c. is there an effect due to die diameter? d. Plot the mean unit density for each die temperature for each die diameter. e. What can you conclude about the effect of die temperature and die diameter on mean unit density? 11.23 Referring to Problem 11.22, the effect of die temperature and die diameter on the foam diameter was also measured and the results stored in Packaging Foam 2

At the 0.05 level of significance, a. is there an interaction between die temperature and die diameter? b. is there an effect due to die temperature? c. is there an effect due to die diameter? d. Plot the mean foam diameter for each die temperature and die diameter. e. What conclusions can you reach concerning the importance of each of these two factors on the foam diameter?

~ 11.24 A plastic injection molding process is often used

miD

in manufacturing because of its abi lity to mold complicated shapes. An experiment was conducted on the manufacture of a televisio n remote part, and the warpage (mm) of the part was measured and stored in i•*l;l§11!0iffl.

Source: Data ex1rac1ed from M.A. Barghash and F. A. Alkaabneh, "Shrinkage and Warpage Detailed Analysis and Optimization for the Injection Molding Process Using Multistage Experimental Design," Quality Engineering, 26, 2014, pp. 3 19- 334.

STOPPER HEIGHT BREAK OFF PRESSURE

11.25 A glass manufacturing company wanted to investigate the effect of breakoff pressure and stopper height on the percentage of breaking off chips. The results, stored in were as fo llows:

Tuenty-Five

Two

0.00

0.00

Two

1.00

0.25

Three

2.25

1.50

Three

1.50

1.25

Three

0.25

0.25

Three

0.75

0.75

Source: K. Kumar and S. Yadav, " Breakthrough Solution," Six Sigma Forum Maga zine, November 2016, pp. 7-22.

At the 0.05 level of significance, a. is there an interaction between the breakoff pressure and the stopper height? b. is there an effect due to the breakoff pressure? c. is there an effect due to the stopper height? d. Plot the percentage breakoff for each breakoff pressure for each stopper height. e. Discuss the results of (a) through (d). 11 .26 A glass manufacturing company wanted to investigate the effect of zone I lower temperature (630 vs. 650) and zone 3 upper temperature (695 vs. 7 15) on the roller imprint of glass. The results stored in were as follows:

mmJ

Two factors were to be considered, the filling time ( I, 2, or 3 sec) and the mold temperature (60, 72.5, or 85 °C). At the 0.05 level of significance, a. is there an interaction between filling time and mold temperature? b. is there an effect due to filling time? c. is there an effect due to mold temperature? d. Plot the mean warpage for each fi lling time for each mold temperature. e. If appropriate, use the Tukey multiple comparison procedure to determine which of the filling times and mold temperatures differ. f. Discuss the results of (a) through (e).

Tuenty

ZONE3UPPER ZONE 1 LOWER

695

715

630

50

100

630

25

0

630

50

25

630

125

75

650

25

75

650

25

25

650

50

0

650

20

125

Source: K. Kumar and S. Yadav, " Breakthrough Solution,'' Six Sigma Forum Magazine, November 2016, pp. 7- 22.

CtJh{JI,

STOPPER HEIGHT BREAK OFF PRESSURE

Tuenty

Tuenty-Five

Two

1.75

0.75

Two

1.00

0.50

At the 0.05 level of significance, a. is there an interaction between zone 1 lower and zone 3 upper? b. is there an effect due to zone I lower? c. is there an effect due to zone 3 upper? d. Plot the roller imprint for each level of zone 1 lower for level of zone 3 upper. e. Discuss the results of (a) through (d).

11.4 Fixed Effects, Random Eff ects, and Mixed Eff ects Models

379

11.3 The Randomized Block Design Section 10.2 discusses how to use the paired t test to evaluate the difference between the means of two groups when you have repeated measurements or matched samples. The randomized block design evaluates differences among more than two groups that contain matched samples or repeated measures that have been placed in blocks. The Section 11.3 online topic discusses this method and illustrates its use.

11.4 Fixed Effects, Random Effects, and Mixed Effects Models Sections 11.1 through 11.3 do not consider the distinction between how the levels of a factor were selected. The equation for the F test depends on whether the levels of a factor were specifically selected or randomly selected from a population. The Section 11.4 online topic presents the appropriate F tests to use when the levels of a factor are either specifically selected or randomly selected from a population of levels.

n the Arlingtons scenario, you sought to determine whether there were differences in mobile electronics sales among four in-store locations as well as determine whether permitting mobile payments had an effect on those sales. Using the one-way ANOVA, you determined that there was a difference in the mean sales for the four in- store locations. Using the two-way ANOVA, you determined that there was no interaction between in-store location and permitting mobile payment methods and that mean sales were higher when mobile payment methods were permitted than when such methods were not. In addition, you concluded that the population mean sales is different for the four in-store locations and reached these other conclusions:

I

• The front location is estimated to have higher mean sales than the other three locations. • The kiosk location is estimated to have higher mean sales than the current in-aisle location. • The expert location is estimated to have higher mean sales than the current in-aisle location. Your next step as a member of the sales team might be to further investigate the differences among the sales locations as well as examine other factors that could influence mobile electronics sale.

TSUMMARY This chapter explains the statistical procedures used to analyze the effect of one or two factors of interest. The chapter details the assumptions required for each procedure and emphasizes the need to investigate the validity of the TABLE 11.10 Chapter 11 designs and procedures

assumptions underlying the hypothesis-testing procedures. Table 11 . 10 summarizes the designs and procedures that this chapter discusses.

Design

Procedure

Completely randomized design

One-way analysis of variance (Section 11.1)

Factorial design

Two-way analysis of variance (Section 11 .2)

Randomized block design

Randomized block design (online Section 11.3)

380 T

I Analysis of Variance

CHAPTER 11

REFERENCES

Berenson, M. L., D. M. Levine, and M. Goldstein. Intermediate Statistical Methods and Applications: A Compute r Package Approach. Upper Saddle River, NJ: Prentice Hall, 1983. Corder, G. W., and D. I. Foreman. Nonparametric Statistics: A Step-by-StepApproach. New York: Wiley, 2014. Daniel, W. W. Applied Nonparametric Statistics , 2nd ed . Boston: PWS Kent, 1990. Gitlow, H. S., R. Melnyck, and D. Levine. A Guide to Six Sigma and Process Improvement for Practitioners and Students, 2nd ed. Old Tappan, NJ: Pearson Education, 2015.

T

Hicks, C. R., and K. Turner. Fundamental Concepts in the Design of Experiments, 5th ed. New York: Oxford University Press, 1999. Kutner, M., J. Neter, C. Nachtsheim, and W. Li. Applied Linear Statistical Models, 5th ed. New York: McGraw-Hill-Irwin, 2005. Levine, D. Statistics for Six Sigma Green Belts. Upper Saddle River, NJ: Financial Times/Prentice Hall, 2006. Montgomery, D. M. Design and Analysis of Experiments, 8th ed. New York: Wiley, 2013.

KEY EQUATIONS

Total Variation in One-Way ANOVA

±f (Xii - X)

SST =

Interaction Variation in Tuo-Way ANOVA ,.

2

(11.1)

j= I i= I

ssA = ~>A xj

-

x) 2

c11.2)

±f (xii - Xj )

2

(11.3)

Mean Squares in One-Way ANOVA

SSA MSA = - e - 1 SSW MSW = - -

(ll.4b)

SST MST= - -

(11.4c)

(11.4a)

n - c

n - 1 One-Way ANOVA F srA T Test Statistic

r

ssE =

MSA = MSW

Critical range

(11.5)

1

=

Qa

/ MSW(_!_

\j

2

nj

+

_!_)

(11.6)

nj'

(11.12a)

SSB MSB= - e - 1

(11.12b)

SSAB MSAB=-----

(11.12c)

SSE MSE= - - re(n' - 1)

(ll.12d)

(r -

F

c

11

l)(c - 1)

MSA MSE

- --

F

MSB MSE

(11.14)

F Test for Interaction Effect

1

L L L (xijk - x)

2

(11.7)

Factor A Variation in Tuo-Way ANOVA

L (Xi.. - X) 2

F

MSAB MSE

---

STAT -

(11.8)

i= I

Factor B Variation in Tuo-Way ANOVA

. . range = Qa ~SE Cntical -, en

x) 2

(11.16)

Critical Range for Factor B

c

rn'L (xj. -

(11.15)

Critical Range for Factor A

r

j= l

(11.13)

- --

STAT -

i= l j= l k= I

SSB =

(11.11)

F Test for Factor B Effect

Total Variation in Tuo-Way ANOVA

SSA = en'

11

F Test for Factor A Effect

Critical Range for the Tukey-Kramer Procedure

=

c

L L Lcxijk - xij)2

SSA MSA = - r-1

STAT -

SST

(11.10)

Mean Squares in Tuo-Way ANOVA

j= Ii = I

r

I..

i= l j= l k= I

Within-Group Variation in One-Way ANOVA

FsTAT

lj .

Random Variation in Tuo-Way ANOVA

j= l

SSW=

(X - X - X·J. + X) 2

i= l j= l

Among-Group Variation in One-Way ANOVA c

c

SSAB= n' L.J ~ L.J ~

(11.9)

Critical range = Qa

e n

(11.17)

Chapter Review Problems

381

TKEY TERMS among-group variation 353 analysis of variance (ANOVA) 353 ANOVA summary table 357 cell means 374 completely randomized design 353 critical range 362 F distribution 356 factor 353 factorial desi~ 353 grand mean, X 354 groups 353 homogeneity of variance 360 interaction 367

levels 353 Levene test 361 main effect 372 mean squares 355, 369 multiple comparisons 362 normality 360 one-way ANOVA 353 randomized block design 353 randomness and independence 360 replicates 367 Studentized range distribution 362 sum of squares among groups (SSA) 355 sum of squares due to factor A (SSA) 368

sum of squares due to factor B (SSB) 368 sum of squares due to interaction (SSAB) 368 sum of squares error (SSE) 369 sum of squares total (SST) 353 sum of squares within groups (SSW) 355 total variation 354 Tukey multiple comparisons procedure for two-way ANOVA 373 Tukey-Kramer multiple comparisons procedure for one-way ANOVA 362 two-way ANOVA 367 within-group variation 355

TCHECKING YOUR UNDERSTANDING 11.27 In a one-way ANOVA, what is the difference between the among-groups variance MSA and the within-groups variance MSW?

11.32 What is the difference between the one-way ANOVA F test and the Levene test?

11.28 What are the distinguishing features of the completely randomized design and two-factor factorial designs?

11.33 Under what conditions should you use the two-way ANOVA F test to examine possible differences among the means of each factor in a factorial design?

11.29 What are the assumptions of ANOVA? 11.30 Under what conditions should you use the one-way ANOVA F test to examine possible differences among the means of c independent populations? 11.31 When and how should you use multiple comparison procedures for evaluating pairwise combinations of the group means?

11.34 What is meant by the concept of interaction in a two-factor factorial design? 11.35 How can you determine whether there is an interaction in the two-factor factorial design?

TCHAPTER REVIEW PROBLEMS 11.36 You are the production manager at a parachute manufacturing company. Parachutes are woven in your factory using a synthetic fiber purchased from one of four different suppliers. The strength of these fibers is an important characteristic that ensures quality parachutes. You need to decide whether the synthetic fibers from each of your four suppliers result in parachutes of equal strength. Furthermore, to produce parachutes your factory uses two types of looms, the Jetta and the Turk. You need to determine if the parachutes woven on each type of loom are equally strong. You also want to know if any differences in the strength of the parachute can be attributed to the four suppliers are dependent on the type of loom used. You conduct an experiment in which five different parachutes from each supplier are manufactured on each of the two different looms and collect and store the data in Parachute Two-way At the 0.05 level of significance, a. is there an interaction between supplier and loom? b. is there an effect due to loom? c. is there an effect due to supplier? d. Plot the mean strength for each supplier for each loom. e. If appropriate, use the Tukey procedure to determine differences between suppliers. f. Repeat the analysis, using the file with suppliers as the only factor. Compare your results to those of (e).

11.37 Medical wires are used in the manufacture of cardiovascular devices. A study was conducted to determine the effect of several factors on the ratio of the load on a test specimen (YS) to the ultimate contains the study tensile strength (UTS). The file results, which examined factors including the machine (W95 vs. W96) and the reduction angle (narrow vs. wide). Source: Data extracted from B. Nepal , S. Mohanty, and L. Kay, "Quality Improvement of Medical Wire Manufacturing Process,'' Quality Engineering 25, 20 13, pp. 15 1- 163.

At the 0.05 level of significance, a. is there an interaction between machine type and reduction angle? b. is there an effect due to machine type? c. is there an effect due to reduction angle? d. Plot the mean ratio of the load on a test specimen (YS) to the ultimate tensile strength (UTS) for each machine type for each reduction angle. e. What can you conclude about the effects of machine type and reduction angle on the ratio of the load on a test specimen (YS) to the ultimate tensile strength (UTS)? Explain. f. Repeat the analysis, using reduction angle as the only factor the and file. Compare your results to those of (c) and (e).

382

CHAPTER 11

I Analysis of Variance

11.38 An operations manager wants to examine the effect of air-jet pressure (in pounds per square inch [psi]) on the breaking strength of yarn. Three different levels of air-jet pressure are to be considered: 30 psi, 40 psi, and 50 psi. A random sample of 18 yarns are selected from the same batch, and the yams are randomly assigned, 6 each, to the 3 levels of air-jet pressure. The breaking strength scores are stored in Em. a. Is there evidence of a significant difference in the variances of the breaking strengths for the three air-jet pressures? (Use a= 0.05.) b. At the 0.05 level of significance, is there evidence of a difference among mean breaking strengths for the three air-jet pressures? c. If appropriate, use the Tukey-Kramer procedure to determine which air-jet pressures significantly differ with respect to mean breaking strength. (Use a = 0.05.) d. What should the operations manager conclude? 11.39 Suppose that, when setting up the experiment in Problem 11.38, the operations manager is able to study the effect of side-toside aspect in addition to air-jet pressure. Thus, instead of the onefactor completely randomized design in Problem 11.38, a two-factor factorial design was used, with the first factor, side-to-side aspect, having two levels (nozzle and opposite) and the second factor, air-jet pressure, having three levels (30 psi, 40 psi, and 50 psi). A sample of 18 yarns is randomly assigned, three to each of the six side-toside aspect and pressure level combinations. The breaking-strength are as follows: scores, stored in

r,m,

AIR-JET PRESSURE 30 psi

40 psi

50 psi

Nozzle Nozzle Nozzle

25.5 24.9 26.1

24.8 23.7 24.4

23.2 23.7 22.7

Opposite Opposite Opposite

24.7 24.2 23.6

23.6 23.3 21.4

22.6 22.8 24.9

SIDE-TO-SIDE ASPECT

At the 0.05 level of significance, a. is there an interaction between side-to-side aspect and air-jet pressure? b. is there an effect due to side-to-side aspect? c. is there an effect due to air-jet pressure? d. Plot the mean yarn breaking strength for each level of side-toside aspect for each level of air-jet pressure. e. If appropriate, use the Tukey procedure to study differences among the air-jet pressures. f. On the basis of the results of (a) through (e), what conclusions can you reach concerning yarn breaking strength? Discuss. g. Compare your results in (a) through (f) with those from the completely randomized design in Problem 11.38. Discuss fully. 11.40 A hotel wanted to develop a new system for delivering room service breakfasts. In the current system, an order form is left on the bed in each room. If the customer wishes to receive a room service breakfast, he or she places the order form on the doorknob before 11 P.M. The current system requires customers to select a 15-minute interval for desired delivery time (6:30-6:45 A.M., 6:45-7:00 A.M. , etc.). The new system is designed to allow the customer to request a specific delivery time. The hotel wants to measure the difference (in minutes) between the actual delivery time and the requested

delivery time of room service orders for breakfast. (A negative time means that the order was delivered before the requested time. A positive time means that the order was delivered after the requested time.) The factors included were the menu choice (American or Continental) and the desired time period in which the order was to be delivered (Early Time Period [6:30-8:00 A.M.] or Late Time Period [8:00-9:30 A.M.]). Ten orders for each combination of menu choice and desired time period were studied on a particular day. The data, stored in l:!t!'.lj@jtfjl, are as follows: DESIRED TIME

Early Time Period

Late Time Period

Continental Continental Continental Continental Continental Continental Continental Continental Continental Continental

1.2 2.1 3.3 4.4 3.4 5.3 2.2 1.0 5.4 1.4

-2.5 3.0 -0.2 1.2 1.2 0.7 -1.3 0.2 - 0.5 3.8

American American American American American American American American American American

4.4

6.0 2.3 4.2 3.8 5.5 1.8 5.1 4.2 4.9 4.0

TYPE OF BREAKFAST

I. I

4.8 7.1 6.7 5.6 9.5 4.1 7.9 9.4

At the 0.05 level of significance, a. is there an interaction between type of breakfast and desired time? b. is there an effect due to type of breakfast? c. is there an effect due to desired time? d. Plot the mean delivery time difference for each desired time for each type of breakfast. e. On the basis of the results of (a) through (d), what conclusions can you reach concerning delivery time difference? Discuss. 11.41 Refer to the room service experiment in Problem 11.40. Now suppose that the results are as shown below and stored in Breakfast Results . Repeat (a) through (e), using these data, and compare the results to those of (a) through (e) of Problem 11.40. DESIRED TIME TYPE OF BREAKFAST

Continental Continental Continental Continental Continental

Early

Late

1.2 2.1 3.3 4.4 3.4

-0.5 5.0 1.8 3.2 3.2

Cases

DESIRED TIME TYPE OF BREAKFAST

Late

Early

Continental Continental Continental Continental Continental

5.3 2.2 1.0 5.4 1.4

2.7 0.7 2.2 1.5 5.8

American American American American American American American American American American

4.4 I. I 4.8 7.1 6.7 5.6 9.5 4.1 7.9 9.4

6.0 2.3 4.2 3.8 5.5 1.8 5.1 4.2 4.9 4.0

11.42 A pet food company has the business objective of having the weight of a can of cat food come as close to the specified weight as possible. Realizing that the size of the pieces of meat contained

383

in a can and the can fill height could impact the weight of a can, a team studying the weight of canned cat food wondered whether the current larger chunk size produced higher can weight and more variability. The team decided to study the effect on weight of a cutting size that was finer than the current size. In addition, the team slightly lowered the target for the sensing mechanism that determines the fill height in order to determine the effect of the fill height on can weight. Twenty cans were filled for each of the four combinations of piece size (fine and current) and fill height (low and current). The contents of each can were weighed, and the amount above or below the label weight of 3 ounces was recorded as the variable coded weight. For example, a can containing 2.90 ounces was given a coded weight of - 0. 10. Results were stored in llf!'!i#!T!piff!.HAnalyze these data and write a report for presentation to the team. Indicate the importance of the piece size and the fi ll height on the weight of the canned cat food. Be sure to include a recommendation for the level of each factor that will come closest to meeting the target weight and the limitations of this experiment, along with recommendations for future experiments that might be undertaken.

CHAPTER

Managing Ashland MultiComm Services PHASE1 The computer operations department had a business objective of reducing the amount of time to fully update each subscriber's set of messages in a special secured email system. An experiment was conducted in which 24 subscribers were selected and three different messaging systems were used. Eight subscribers were assigned to each system, and the update times were measured. The results, stored in MM'i'§i, are presented in Table AMS 11 .1.

TABLE AMS11.1 Update times (in seconds) for three different systems System 1

System 2

System 3

38.8 42.I 45.2 34.8 48.3 37.8 41.l 43.6

41.8 36.4 39.1 28.7 36.4 36.1 35.8 33.7

32.9 36.1 39.2 29.3 41.9 31.7 35.2 38.1

1. Analyze the data in Table AMS 11. l and write a report to the computer operations department that indicates your findings . Include an appendix in which you discuss the reason you selected a particular statistical test to compare the three email interfaces. DO NOT CONTINUE UNTIL THE PHASE 1 EXERCISE HAS BEEN COMPLETED.

PHASE 2 After analyzing the data in Table AMS 11.1 , the computer operations department team decided to also study the effect of the connection media used (cable or fiber). The team designed a study in which a total of30 subscribers were chosen. The subscribers were randomly assigned to one of the three messaging systems so that there were five subscribers in each of the six combinations of the two factors-messaging system and media used. Measurements were taken on the updated time. Table AMS 11 .2 summarizes the results that are stored in l·!t~f.ilfj.

384

CHAPTER 11

I Analysis of Variance

System 1

System 2

System 3

45.6

41.7

35.3

49.0

42.8

37.7

You decide to carry out an experiment in a sample of 24 stores where customer counts have been running almost exactly at the national average of 900. In 6 of the stores, the price of a small coffee will now be $0.59, in 6 stores the price of a small coffee will now be $0.69, in 6 stores, the price of a small coffee will now be $0.79, and in 6 stores, the price of a small coffee will now be $0.89. After four weeks of selling the coffee at the new price, the daily customer counts in the stores were recorded and stored in l@j§i\ir\A.

4 1.8

40.0

41.0

1. Analyze the data and determine whether there is evidence of

35.6

39.6

28.7

a difference in the daily customer count, based on the price of a small coffee.

TABLE AMS11.2 Update times (in seconds), based on messaging system and media used INTERFACE MEDIA

Cable

Fiber

43.4

36.0

31.8

44.1

37.9

43.3

40.8

41.1

40.0

2. If appropriate, determine which mean prices differ in daily customer counts.

46.9

35.8

43.I

3. What price do you recommend for a small coffee?

51.8

45.3

39.6

48.5

40.2

33.2

CardioGood Fitness Return to the CardioGood Fitness case (sto red in CardioGood Fitness I first presented on page 33.

2. Completely analyze these data and write a report to the team that indicates the importance of each of the two factors and/ or the interaction between them on the update time. Include recommendations for future experiments to perform.

Digital Case Apply your knowledge about ANOVA in this Digital Case, which continues the cereal-fill packaging dispute Digital Case from Chapters 7, 9, and JO. After reviewing CCACC's latest document (see the Digital Case for Chapter I 0 on page 345), Oxford Cereals has released Second Analysis.pdf, a press kit that Oxford Cereals has assembled to refute the claim that it is guilty of using selective data. Review the Oxford Cereals press kit and then answer the following questions.

1. Does Oxford Cereals have a legitimate argument? Why or why not? 2. Assuming that the samples Oxford Cereals has posted were randomly selected, perform the appropriate analysis to resolve the ongoing weight dispute. 3. What conclusions can you reach from your results? If you were called as an expert witness, would you support the claims of the CCACC or the claims of Oxford Cereals? Explain.

Sure Value Convenience Stores You work in the corporate office for a nationwide convenience store franchise that operates nearly 10,000 stores. The per-store daily customer count (the mean number of customers in a store in one day) has been steady, at 900, for some time. To increase the customer count, the chain is considering cutting prices for coffee beverages. The question to be determined is how much to cut prices to increase the daily customer count without reducing the gross margin on coffee sales too much.

1. Determine whether differences exist between customers based on the product purchased (TM195, TM498, TM798), in their age in years, education in years, annual household income ($), mean number of times the customer plans to use the treadmill each week, and mean number of miles the customer expects to walk or run each week.

2. Write a report to be presented to the management of CardioGood Fitness detailing your findings.

More Descriptive Choices Follow-Up Follow up the Using Statistics scenario "More Descriptive Choices, Revisited" on page 147 by determining whether there is a difference between the small, mid-cap, and large market cap funds in the one-year return percentages, five-year return percentages, and ten-year return percentages (stored in Retirement Funds

Clear Mountain State Student Survey The Student News Service at Clear Mountain State University (CMSU) has decided to gather data about the undergraduate students who attend CMSU. They create and distribute a survey of 14 questions and rece ive responses from 111 undergraduates (stored in Student Survey

1. At the 0.05 level of significance, is there evidence of a difference based on academic major in expected starting salary, number of social networking sites registered for, age, spending on textbooks and supplies, text messages sent in a week, and the wealth needed to feel rich? 2. At the 0.05 level of significance, is there evidence of a difference based on graduate school intention in grade point average, expected starting salary, number of social networking sites registered for, age, spending on textbooks and supplies, text messages sent in a week, and the wealth needed to feel rich?

•EXCEL GUIDE EG11.1

The COMPLETELY RANDOMIZED DESIGN: ONE-WAY ANOVA

Analyzing Variation in One-Way ANOVA Key Technique Use the Section EG2.5 instructions to construct scatter plots using stacked data. If necessary, change the levels of the factor to consecutive integers beginning with 1, as was done for the Figure 11.3 in-store location sales experiment data o n page 357.

F Test for Differences Among More Than Two Means Key Technique Use the DEVSQ (cell range of data of all groups) function to compute SST. Use an expression in the form SST - DEVSQ (group 1 data cell range) - DEVSQ (group 2 data cell range) ... DEVSQ (group n data cell range) to compute SSA. Example Perform the Figure 11 .6 one-way ANOVA for the in-store location sales experiment on page 360. PHStat Use One-Way ANOVA. For the example, open to the DATA worksheet of the Mobile Electronics workbook. Select PHStat-+ Multiple-Sample Tests-+One-Way ANOVA. In the procedure's dialog box (shown below): 1. 2. 3. 4.

Enter 0.05 as the Level of Significance. Enter Al:D6 as the Group Data Cell Range. Check First cells contain label. Enter a Title, clear the Tukey-Kramer Procedure check box, and click OK.

I

Lewi of sq.ficanoo:

,_lo._os_ ___, I

G'«.o Doto Cel Rongo:

IA1:06

P"Fn•-~~·~·~bol-.;.:;..;:__

The COMPUTE worksheet uses the ASFDATA worksheet that already contains the data for the example. Modifying the COMPUTE worksheet for other problems involves multiple steps and is more complex than template modifications discussed in earlier chapters. To modify the One-Way ANOVA workbook for other problems, first paste the data for the new problem into the ASFData worksheet, overwriting the in-store locations sales data. Then, in the COMPUTE worksheet (shown in Figure 11 .6):

1. Edit the SST formula =DEVSQ(ASFData!Al:D6) in cell B 16 to use the cell range of the new data just pasted into the ASFData worksheet. 2. Edit the cell B 13 SSA formula so there are as many DEVSQ(group column cell range) terms as there are groups. 3. Change the level of significance in cell G 17, if necessary. 4. If the problem contains three groups, select row 8, rightclick, and select Delete from the shortcut menu. If the problem contains more than four groups, select row 8, right-click, and click Insert from the shortcut menu. Repeat this step as many times as necessary. 5. If you inserted new rows, enter (not copy) the formulas for those rows, using the formulas in row 7 as models. 6. Adjust table formatting as necessary. To see the arithmetic formulas that the COMPUTE worksheet uses, not s hown in F igure 11.6, ope n to the COMPUTE_FORMULAS worksheet.

Analysis ToolPak Use Anova: Single Factor.

x

,o..

Workbook Use the COMPUTE worksheet of the OneWay ANOVA workbook as a template.

::J' J

_ _ _ __

Tide: I r r.ay- Proad.ft

For the example, open to the DATA worksheet of the Mobile Electronics workbook and: 1. Select Data -+ Data Analysis. 2. In the Data Analysis dialog box, select Anova: Single Factor from the Analysis Tools list and then click OK. In the procedure's dialog box (shown on page 386):

In additio n to the worksheet shown in Figure 11.6, this procedure creates an ASFData worksheet to hold the data used for the test. See the following Workbook section for a complete description of this worksheet.

3. Enter Al:D6 as the Input Range. 4. Click Columns, check Labels in First Row, and enter 0.05 as Alpha. 5. Click New Worksheet Ply. 6. Click OK.

386

CHAPTER 11

I Analysis of Variance x

"'-"'Single Fodor Input

the following steps 1 through 7 were done with the workbook open to the DATA worksheet:

tnpu1Rlft9

Fa;

otherwise, do not reject H0 . Table 13.6 organizes the complete set of results into an analysis of variance (ANOVA) table.

TABLE 13.6 ANOVA table for testing the significance of a regression coefficient

Source

df

Sum of Squares

Mean Square (variance)

SSR

SSR MSR = = SSR 1 SSE MSE= - n - 2

Regression Error

n-2

SSE

Total

n - 1

SST

F F

STAT -

MSR MSE

--

Figure 13. 19, the completed ANOVA table for the Sunflowers Apparel sales data (and part of Figure 13.4), shows that the computed FsTAT test statistic is 66.8792 and the p-value is 0.0000 (or less than 0.0001).

FIGURE 13.19 Excel , F test results for the Sunflowers Apparel data

-

A

c

B

D

E

G

F

10 ANOVA 11 +--~~~~--'~"-~~~~o..._~~M~S'-_:_f~~s;g~,n~lflco•nce=-:..._F 12 Regression 1 66.7854 66.7854 66.87'2 0.0000 URM~~

U

11.'832 0.9986

14 Total 15

u

78.7686

16;=========~c~;;;,,:;;:fld#:;;.n:;;rs~:;;St.:::a -n:;;da;;;rr1:;E:.:..:rrot -"-- _,-1:;;sr:;;a1=~P__::·-va;;;1u~.=~lo;;:wor:;;~j5X;;;=~u:;;po~•tr::j5X= 17 Intercept

1B ProflltdCustomers

·L2088

0.9949 · 1.2151

0.2477

·3.3765

0.9588

2.0742

0.2536 8.1780

0.0000

1.5216

2.6268

Using a level of significance of 0.05, from Table E.5, the critical value of the F distribution, with 1and12 degrees of freedom, is 4.75 (see Figure 13.20). Because FsrAT = 66.8792 > 4.75 or because the p-value = 0.0000 < 0.05, one rejects H 0 and concludes that there is a significant linear relationship between the number of profiled customers and annual sales. Because the F test in Equation (13. 17) on page 455 is equivalent to the t test in Equation (13.16) on page 454, one reaches the same conclusion using that other test.

FIGURE 13.20 Regions of rejection and nonrejection when testing fo r the significance of the slope at the 0.05 level of significance, with 1 and 12 degrees of freedom Region of Nonrejection

Critical Region of Value Rejection

13. 7 Inferences About the Slope and Correlation Coefficient

457

Confidence Interval Estimate for the Slope In addition to testing for the existence of a linear relationship between the variables, one can construct a confidence interval estimate of {31 using Equation (13. 18). Construct the confidence interval estimate for the population slope by taking the sample slope, b 1, and adding and subtracting the critical t value multiplied by the standard error of the slope.

CONFIDENCE INTERVAL ESTIMATE OF THE SLOPE, {31

b1

±

ta;2Sb,

h1 - ta;2Sb, ~ /31 ~ h1 + ta;2Sb,

(13.18)

where ta;2 = critical value corresponding to an upper-tail probability of a / 2 from the t distribution with n - 2 degrees of freedom (i.e., a cumulative area of 1 - a / 2)

From the Figure 13.17 results on page 455, b1

= 2.0742

n

= 14 Sb, = 0.2536

To construct a 95% confidence interval estimate, a / 2 ta/2 = 2.1788. Thus, b1

± ta;2Sb,

= 2.0742

= 0.025, and from Table E.3,

± ( 2.1788 ) ( 0.2536)

= 2.0742 ± 0.5526 1.5216

~

/31

~

2.6268

Therefore, one has 95% confidence that the population slope is between 1.5216 and 2.6268. The confidence interval indicates that for each increase of 1 million profiled customers, predicted annual sales are estimated to increase by at least $1 ,521,600 but no more than $2,626,800. Because both of these values are above 0, one has evidence of a significant linear relationship between annual sales and the number of profiled customers. Had the interval included 0, one would have concluded that there is no evidence of a significant linear relationship between the variables.

t Test for the Correlation Coefficient Section 3.5 notes that the strength of the relationship between two numerical variables can be measured using the correlation coefficient, r . The values of the coefficient of correlation range from -1 for a perfect negative correlation to + 1 for a perfect positive correlation. One uses the correlation coefficient to determine whether there is a statistically significant linear relationship between X and Y. To do so, one hypothesizes that the population correlation coefficient, p , is 0. Thus, the null and alternative hypotheses are H0 : p

= 0 (no correlation)

H 1: p # 0 (correlation) Equation (13.19a) defines the test statistic for determining the existence of a significant correlation.

458

CHAPTER 13

I

Simple Linear Regression

TESTING FOR THE EXISTENCE OF CORRELATION r - p tsTAT = p , 1 - r2

(13.19a)

n-2 where

=

+W

if b, >

r=

-w

if b,
0, the correlation coefficient for annual sales and profiled customers is the positive square root of r 2-that is, r = + = +0.9208. Using Equation (13.19a) to test the null hypothesis that there is no correlation between these two variables results in

VoW9

t

-

STAT - )

r - 0

0.9208 - 0

- ----;:::===== = 8.178

2 -

J -

(0.9208)2 14 - 2

Using the 0.05 level of significance, because tsrAT = 8.178 > 2.1788, one rejects the null hypothesis. One concludes that there is a significant correlation between annual sales and the number of profiled customers. This tsrAT test statistic is the same value as the tsrAT test statistic calculated when testing whether the population slope, {31, is equal to zero.

PROBLEMS FOR SECTION 13. 7 LEARNING THE BASICS 13.39 You are testing the null hypothesis that there is no linear relationship between two variables, X and Y. From your sample of n = 10, you determine that r = 0. 80. a. What is the value of the t test statistic tsTAT? b. At the a = 0.05 level of significance, what are the critical values? c. Based on your answers to (a) and (b), what statistical decision should you make?

13.40 You are testing the null hypothesis that there is no linear relationship between two variables, X and Y. From your sample of n = 18, you determine that b 1 = + 4.5 and Sb, = 1.5. a. What is the value of tsTAT? b. At the a = 0.05 level of significance, what are the critical values? c. Based on your answers to (a) and (b), what statistical decision should you make? d. Construct a 95% confi dence interval estimate of the population slope, {3 1•

13. 7 Inferences About the Slope and Correlation Coefficient

13.41 You are testing the null hypothesis that there is no linear relationship between two variables, X and Y. From your sample of n = 20, you determine that SSR = 60 and SSE = 40. a. What is the value of FsrAT? b. At the a = 0.05 level of significance, what is the critical value? c. Based on your answers to (a) and (b), what statistical decision should you make? d . Compute the correlation coefficient by first computi ng r 2 and assuming that b 1 is negative. e. At the 0.05 level of significance, is there a significant correlation between X and Y?

APPLYING THE CONCEPTS

IPJl!m

13.42 In Problem 13.4 on page 439, you used the percentage of alcohol to predict wine quality. The data are stored in Mlf!'Wl'M'!tj. From the results of that problem, b 1 = 0.5624 and Sb, = 0.1127. a. At the 0.05 level of significance, is there evidence of a linear relationship between the percentage of alcoho l and wine quality? b. Construct a 95% confidence interval estimate of the population slope, {31•

~

13.43 In Problem 13.5 on page 439, you used the summated rating of a restaurant to predict the cost of a meal. The data are stored in

1;@1€1116!.IA. a. At the 0.05 level of significance, is there evidence of a linear relationship between the summated rating of a restaurant and the cost of a meal? b. Construct a 95% confidence interval estimate of the population slope, /3 1• 13.44 In Problem 13.6 on page 440, a prospective MBA student wanted to predict starting salary upon graduation, based on program per-year tuition. The data are stored in ~- Use the results of that problem. a. At the 0.05 level of significance, is there evidence of a linear relationship between the starting salary upon graduation and program per-year tuition? b. Construct a 95% confidence interval estimate of the population slope, {3 1• 13.45 In Problem 13.7 on page 440, you used the plate gap in the bag-sealing equipment to predict the tear rating of a bag of coffee. The data are stored in @tm!'t!t;jt". Use the results of that problem. a. At the 0.05 level of significance, is there evidence of a linear relationship between the plate gap of the bag-sealing machine and the tear rating of a bag of coffee? b. Construct a 95% confidence interval estimate of the population slope, {3 1• 13.46 In Problem 13.8 on page 440, you used annual revenues to predict the current value of a MLB baseball team, using the data stored in W@:MM. Use the results of that problem. a. At the 0.05 level of significance, is there evidence of a linear relationship between annual revenue and franchise value? b. Construct a 95% confidence interval estimate of the population slope, {3 1• 13.47 In Problem 13.9 on page 440, an agent for a real estate company wanted to predict the monthly rent for one-bedroom

459

apartments, based on the size of the apartment. The data are stored in Silver Spring Rentals . Use the results of that problem. a. At the 0.05 level of significance, is there evidence of a Linear relationship between the size of the apartment and the monthly rent? b. Construct a 95% confidence interval estimate of the population slope, {3 1• 13.48 In Problem 13. JO on page 440, you used YouTube trailer views to predict movie weekend box office gross from data stored in e;m. Use the results of that problem. a. At the 0.05 level of significance, is there evidence of a linear relationship between YouTube trailer views and movie weekend box office gross? b. Construct a 95% confidence interval estimate of the population slope, /3 1• 13.49 The volatility of a stock is often measured by its beta value. You can estimate the beta value of a stock by developing a simple linear regression model, using the percentage weekly change in the stock as the dependent variable and the percentage weekly change in a market index as the independent variable. The S&P 500 Index is a common index to use. For example, if you wanted to estimate the beta value for Disney, you could use the fo llowing model, which is sometimes referred to as a market model: % weekly change in Disney = f3o

+ /31 ( percent weekly change in S&P 500 index) + e The least-squares regression estimate of the slope b 1 is the estimate of the beta value for Disney. A stock with a beta value of 1.0 tends to move the same as the overall market. A stock with a beta value of 1.5 tends to move 50% more than the overall market, and a stock with a beta value of 0.6 tends to move only 60% as much as the overall market. Stocks with negative beta values tend to move in the opposite direction of the overall market. The following table gives some beta values for some widely held stocks as of April 23, 20 18. Company Apple Disney American Eagle Mines Marriott Microsoft Procter & Gamble

Ticker Symbol

Beta

AAPL DIS AEM MAR MSFT PG

1.09 0.70 -0.26 1.25 1.02 0.34

Source: Data extracted from fi nance.yahoo.corn, July 16, 2019.

a. For each of the six companies, interpret the beta value. b. How can investors use the beta value as a guide for investing? 13.50 Index funds are mutual funds that try to mimic the movement of leading indexes, such as the S&P 500 or the Russell 2000. The beta values (as described in Problem 13.49) for these funds are therefore approximately l.O, and the estimated market models for these funds are approximately % weekly change in index fund = 0.0

+

1.0(% weekly change in the index)

Leveraged index fu nds are designed to magnify the movement of major indexes. Direxion Funds is a leading provider of leveraged index and other alternative-class mutual fund products for investment

460

CHAPTER 13

I

Simple Linear Regression

ad visors and sophisticated investors. Two of the company's funds are shown in the following table:

13.52 Movie companies need to predict the gross receipts of an individual movie once the movie has debuted. The following results (stored in are the first weekend gross, the U.S. gross, and the worldwide gross (in $millions) of the eight Harry Potter movies that debuted from 2001 to 2011:

[email protected]

Name

Ticker Symbol

Description

300% of the Russell Daily Small Cap TNA 2000 Index Bull 3x Fund Daily S&P 500 SPUU 200% of the S&P Bull 2x Fund 500 Index Source: Data extracted from www.direxionfunds.com. The estimated market models for these funds are approximately % daily change in TNA = 0.0 + 3.0( % daily change in the Russell 2000) % daily change in SPUU = 0.0 + 2.0( % daily change in the S&P 500 Index) Thus, if the Russell 2000 Index gains 10% over a period of time, the leveraged mutual fund TNA gains approximately 30%. On the downside, if the same index loses 20%, TNA loses approximately 60%. a. The objective of the Direxion Funds Bull 2x Fund, SPUU, is 200% of the performance of the S&P 500 Index. What is its approximate market model? b. If the S&P 500 Index gains 10% in a year, what return do you expect SPUU to have? c. If the S&P 500 Index loses 20% in a year, what return do you expect SPUU to have? d. What type of investors should be attracted to leveraged index funds? What type of investors should stay away from these funds?

13.51 The file @§ in this problem. c. Use the prediction line developed in (a) to predict the mean number of wins for a team with an ERA of 4.50. d. Compute the coefficient of determination, r2 , and interpret its meaning. e. Perform a residual analysis on your results and determine the adequacy of the fit of the model. f. At the 0.05 level of significance, is there evidence of a linear relationship between the number of wins and the ERA? g. Construct a 95% confidence interval estimate of the mean number of wins expected for teams with an ERA of 4.50. h. Construct a 95% prediction interval of the number of wins for an individual team that has an ERA of 4.50. i. Construct a 95% confidence interval estimate of the population slope. j. The 30 teams constitute a population. In order to use statistical inference, as in (f) through (i), the data must be assumed to represent a random sample. What "population" would this sample be drawing conclusions about? k. What other independent variables might you consider for inclusion in the model? I. What conclusions can you reach concerning the relationship between ERA and wins? 13.82 Can you use the annual revenues generated by NBA teams to predict current values? Figure 2.17 on page 66 shows a scatter plot of team revenue and current value, and Figure 3.12 on page 138, shows the correlation coefficient. Now, you want to develop a simple linear regression model to predict current values based on revenues. (Team current values and revenues are stored in ·~!:t·lijleF!.MFll.)

a. Assuming a linear relationship, use the least-squares method to compute the regression coefficients b0 and b 1• b. Interpret the meaning of the Y intercept, b 0, and the slope, b 1, in this problem. c. Predict the mean current value of an NBA team that generates $150 million of annual revenue. d. Compute the coefficient of determination, r 2, and interpret its meaning. e. Perform a residual analysis on your results and evaluate the regression assumptions. f. At the 0.05 level of significance, is there evidence of a linear relationship between the annual revenues generated and the current value of an NBA team? g. Construct a 95 % confidence interval estimate of the mean value of all NBA teams that generate $150 million of annual revenue. h. Construct a 95 % prediction interval of the value of an individual NBA team that generates $150 million of annual revenue. i. Compare the results of (a) through (h) to those of baseball teams in Problems 13.8, 13.20, 13.30, 13.46, and 13.62 and European soccer teams in Problem 13.83. 13.83 In Problem 13.82 you used annual revenue to develop a model to predict the current value of NBA teams. Can you also use the annual revenues generated by European soccer teams to predict their current values? (European soccer team current values and revenues are stored in l+l§§§i!J!j@t1.) a. Repeat Problem 13.82 (a) through (h) for the European soccer teams. b. Compare the results of (a) to those of baseball teams in Problems 13.8, 13.20, 13.30, 13.46, and 13.62 and NBA teams in Problem 13.82. 13.84 During the fall harvest season in the United States, pumpkins are sold in large quantities at farm stands. Often, instead of weighing the pumpkins prior to sale, the farm stand operator will just place the pumpkin in the appropriate circular cutout on the counter. When asked why this was done, one farmer replied, "I can tell the weight of the pumpkin from its circumference." To determine whether this was really true, the circumference and weight of each pumpkin from a sample of23 pumpkins were determined and the results stored in ljll,,!.!@'j. a. Assuming a linear relationship, use the least-squares method to compute the regression coefficients b0 and b 1• b. Interpret the meaning of the slope, b 1, in this problem. c. Predict the mean weight for a pumpkin that is 60 centimeters in circumference. d. Do you think it is a good idea for the farmer to sell pumpkins by circumference instead of weight? Explain. e. Determine the coefficient of determination, r 2 , and interpret its meaning. f. Perform a residual analysis for these data and evaluate the regression assumptions. g. At the 0.05 level of significance, is there evidence of a linear relationship between the circumference and weight of a pumpkin? h. Construct a 95% confidence interval estimate of the population slope, {31•

REPORT WRITING EXERCISE 13.85 In Problems 13.8, 13.20, 13.30, 13.46, 13.62, 13.82, and 13.83, you developed regression models to predict current value of MLB baseball, NBA basketball, and European soccer teams. Now, write a report based on the models you developed. Append to your report all appropriate charts and statistical information.

Cases

473

CHAPTER

Managing Ashland MultiComm Services To ensure that as many trial subscriptions to the 3-For-All service as possible are converted to regular subscriptions, the marketing department works closely with the customer support department to accomplish a smooth initial process for the trial subscription customers. To assist in this effort, the marketing department needs to accurately forecast the monthly total of new regular subscriptions. A team consisting of managers from the marketing and customer support departments was convened to develop a better method of forecasting new subscriptions. Previously, after examining new subscription data for the prior three months, a group of three managers would develop a subjective forecast of the number of new subscriptions. Livia Salvador, who was recently hired by the company to provide expertise in quantitative forecasting methods, suggested that the department look for factors that might help in predicting new subscriptions. Members of the team found that the forecasts in the past year had been particularly inaccurate because in some months, much more time was spent on telemarketing than in other months. Livia collected data (stored in mEJEI) for the number of new subscriptions and hours spent on telemarketing for each month for the past two years. 1. What criticism can you make concerning the method of forecasting that involved taking the new subscriptions data for the prior three months as the basis for future projections?

2. What factors other than number of telemarketing hours spent might be useful in predicting the number of new subscriptions? Explain. 3. a. Analyze the data and develop a regression model to predict the number of new subscriptions for a month, based on the number of hours spent on telemarketing for new subscriptions.

b. If you expect to spend 1,200 hours on telemarketing per month, estimate the number of new subscriptions for the month. Indicate the assumptions on which this prediction is based. Do you think these assumptions are valid? Explain. c. What would be the danger of predicting the number of new subscriptions for a month in which 2,000 hours were spent on telemarketing?

Digital Case Apply your knowledge of simple linear regression in this Digital Case, which extends the Sunflowers Apparel Using Statistics scenario from this chapter.

Leasing agents from the Triangle Mall Management Corporation have suggested that Sunflowers consider several locations in some of Triangle's newly renovated lifestyle malls that cater to

shoppers with higher-than-mean disposable income. Although the locations are smaller than the typical Sunflowers location, the leasing agents argue that higher-than-mean disposable income in the surrounding community is a better predictor of higher sales than profiled customers. The leasing agents maintain that sample data from 14 Sunflowers stores prove that this is true. Open Triangle_Sunflower.pdf and review the leasing agents' proposal and supporting documents. Then answer the following questions: 1. Should mean disposable income be used to predict sales based on the sample of 14 Sunflowers stores?

2. Should the management of Sunflowers accept the claims of Triangle's leasing agents? Why or why not? 3. Is it possible that the mean disposable income of the surrounding area is not an important factor in leasing new locations? Explain. 4. Are there any other factors not mentioned by the leasing agents that might be relevant to the store leasing decision?

Brynne Packaging Brynne Packaging is a large packaging company, offering its customers the highest standards in innovative packaging solutions and reliable service. About 25 % of the employees at Brynne Packaging are machine operators. The human resources department has suggested that the company consider using the Wesman Personnel Classification Test (WPCT), a measure of reasoning ability, to screen applicants for the machine operator job. In order to assess the WPCT as a predictor of future job performance, 25 recent applicants were tested using the WPCT; all were hired, regardless of their WPCT score. At a later time, supervisors were asked to rate the quality of the job performance of these 25 employees, using a l-to-10 rating scale (where 1 = very low and 10 = very high). Factors considered in the ratings included the employee's output, defect rate, ability to implement continuous quality procedures, and contributions to team problem-solving efforts. The file contains the WPCT scores (WPCT) and job performance ratings (Ratings) for the 25 employees. 1. Assess the significance and importance of WPCT score as a predictor of job performance. Defend your answer.

2. Predict the mean job performance rating for all employees with a WPCT score of 6. Give a point prediction as well as a 95% confidence interval. Do you have any concerns using the regression model for predicting mean job performance rating given the WPCT score of 6? 3. Evaluate whether the assumptions of regression have been seriously violated.

CHAPTER

... EXCEL GUIDE EG13.2

DETERMINING the SIMPLE LINEAR REGRESSION EQUATION

Key Technique Use the LINEST(cell range of Yvariable, cell range of X variable, True, True) array function to compute the b 1 and b0 coefficients, the b 1 and b0 standard errors, r 2 and the standard error of the estimate, the F test statistic and error df, and SSR and SSE. Use the expression T.INV.2T(l - confidence level, Error degrees of freedom) to compute the critical value for the t test. Example Perform the Figure 13.4 analysis of the Sunflowers Apparel data on page 43S.

PHStat Use Simple Linear Regression. For the example, open to the DATA worksheet of the Site Selection workbook. Select PHStat °'Regression°' Simple Linear Regression. In the procedure's dialog box (shown below):

Enter Cl:C15 as the Y Variable Cell Range. Enter Bl:B15 as the X Variable Cell Range. Check First cells in both ranges contain label Enter 95 as the Confidence level for regression coefficients. 5. Check Regression Statistics Table and ANOVA and Coefficients Table. 6. Enter a Title and click OK. 1. 2. 3. 4.

x

Simple line• R.egreHion

[

~:ariableCdRanQ Fa;

If one fail s to reject the null hypothesis, one concludes that the model fit is not appropriate. If one rejects the null hypothesis, one proceeds with the model and uses methods that Sections 14.4 and 14.5 discuss to determine which independent variables should be included in the final regression model. For the OmniPower sales study, using the 0.05 level of significance, a , and Table E.5, the critical value of the F distribution with 2 and 31 degrees of freedom is approximately 3.32. Figure 14.3 visualizes the regions of nonrejection and rejection using this critical value.

486

CHAPTER 14

I

Introduction to Multiple Regression

FIGURE 14.3 Testing for t he significance of a set of regression coeff icients at the 0.05 level of sig nificance, with 2 and 31 degrees of freedom F

3.32

Region of Non reject ion

l

Reg ion of Rejection

Critical Value

Figure 14.1 multiple regression results on page 481 includes the FsTAT test statistic in the ANOVA tables. Table 14.4 summarizes the results of the test for the set of regression coefficients. Based on the results, one concludes that either price or promotional expenses or both variables can be used to help predict mean monthly sales.

TABLE 14.4 Overall F test results and conclusions

Result FsTAT

Conclusions

= 48.477 1 is greater than

the F critical value, 3.32

student TIP Using tables to summarize regression results and conclusions is a good way to communicate results to others.

p-value = 0.0000 is less than the level of significance, a = 0.05

1. Reject the null hypothesis H0 . 2. Conclude that evidence exists for claiming that at least one of the independent X variables (price or promotional expenses) is related to the dependent Y variable, sales. 3. The probability is 0.0000 that FsTAT > 48.4771.

PROBLEMS FOR SECTION 14.2 LEARNING THE BASICS 14.9 The following ANOVA summary table is for a multiple regression model with two independent variables:

Source Regression Error Total

Degrees of Freedom

Sum of Squares

2 18 20

60 120 180

Mean Squares

14.10 The following ANOVA summary table is for a multiple regression model with two independent variables:

Source F

a. Determine the regression mean square (MSR) and the mean square error (MSE). b. Compute the overall FsTAT test statistic. c. Determine whether there is a significant relationship between Y and the two independent variables at the 0.05 level of significance. d. Compute the coefficient of multiple determination, r 2, and interpret its meaning. e. Compute the adjusted r 2 .

Regression Error Total

Degrees of Freedom

Sum of Squares

2

30 120 150

10

12

Mean Squares

F

a. Determine the regression mean square (MSR) and the mean square error (MSE). b. Compute the overall Fs TAT test statistic. c. Determine whether there is a significant relationship between Y and the two independent variables at the 0.05 level of significance. d. Compute the coefficient of multiple determination, r 2 , and interpret its meaning. e. Compute the adjusted r 2 .

14.3 Multiple Regression Residual Analysis

APPLYING THE CONCEPTS 14.11 A financial analyst engaged in business valuation obtained financial data on 60 drug companies (Industry Group SIC 3 code: 283). The file contains the following variables: Company-Drug Company name PB fye- Price-to-book-value ratio (fiscal year ending) ROE- Return on equity SGrowth-Growth (GS5) a. Develop a regression model to predict price-to-book-value ratio based on return on equity. b. Develop a regression model to predict price-to-book-value ratio based on growth. c. Develop a regression model to predict price-to-book-value ratio based on return on equity and growth. d. Compute and interpret the adjusted r 2 for each of the three models. e. Which of these three models do you think is the best predictor of price-to-book-value ratio?

~ 14.12 In Problem 14.3 on page 482, you predicted non. . . profit charitable commitment, based on nonprofit revenue and fundraising efficiency. The regression analysis resulted in this ANOVA table:

487

14.14 In Problem 14.4 on page 482, you used efficiency ratio and total risk-based capital to predict ROAA at a community bank (stored in . Using the results from that problem, a. determine whether there is a significant relationship between ROAA and the two independent variables (efficiency ratio and total risk-based capital) at the 0.05 level of significance. b. interpret the meaning of the p-value. c. compute the coefficient of multiple determination, r 2, and interpret its meaning. d. compute the adjusted r 2 . 14.15 In Problem 14.7 on page 483 , you used the weekly staff count and remote engineering hours to predict standby hours (stored ). Using the results from that problem, in a. determine whether there is a significant relationship between standby hours and the two independent variables (total staff present and remote engineering hours) at the 0.05 level of significance. b. interpret the meaning of the p-value. c. compute the coefficient of multiple determination, r2 , and interpret its meaning. d. compute the adjusted r 2 .

Determine whether there is a sign ificant relationship between commitment and the two independent variables at the 0.05 level of significance.

14.16 In Problem 14.6 on page 483, you used full-time voluntary turnover (%) and total worldwide revenue ($billions) to predict . Using number of full-time jobs added (stored in the results from that problem, a. determine whether there is a significant relationship between number of full-time jobs added and the two independent variables (full-time voluntary turnover and total worldwide revenue) at the 0.05 level of significance. b. interpret the meaning of the p-value. c. compute the coefficient of multiple determination, r 2, and interpret its meaning. d. compute the adjusted r 2 .

14.13 In Problem 14.5 on page 483, you used the percentage of alcohol and chlorides to predict wine quality (stored in l@@f®t\11. Use the results from that problem to do the following: a. Determine whether there is a significant relationship between wine quality and the two independent variables (percentage of alcohol and chlorides) at the 0.05 level of significance. b. Interpret the meaning of the p-value. c. Compute the coefficient of multiple determination, r 2, and interpret its meaning. d. Compute the adjusted r 2 .

14.17 In Problem 14.8 on page 483, you used the land area of a property and the age of a house to predict the fair market value (stored in lfilt!tilij.11§). Using the results from that problem, a. determine whether there is a significant relationship between fair market value and the two independent variables (land area of a property and age of a house) at the 0.05 level of significance. b. interpret the meaning of the p-value. c. compute the coefficient of multiple determination, r 2, and interpret its meaning. d. compute the adjusted r 2 .

Source Regression Error Total

Degrees of Freedom

Sum of Squares

2 95 97

3529.0718 2893. 1323 6422.2041

Mean Squares

F

p-Value

1764.54 57.9410 of variable Y with X2 while holding X1 constant is 0.4728. For a given (constant) price, 47.28% of the variation in OmniPower sales is explained by variation in the amount of promotional expenses. 11 ,319,244.62 52,093,677.44 - 39,472,730.77

=

+ 11 ,3 19,244.62

0.4728

Equation ( 14.14) defines the coefficient of partial determination for the jth variable in a multiple regression model containing several (k) independent variables.

COEFFICIENT OF PARTIAL DETERMINATION FOR A MULTIPLE REGRESSION MODEL CONTAINING k INDEPENDENT VARIABLES r2 yj.(All variables except j)

SSR(Xj IAll Xs except j) SST - SSR(All Xs)

+

SSR(Xj l All Xs exceptj)

(14.14)

14.6 Using Dummy Variables and Int eraction Terms

LEARNING THE BASICS 14.31 The following is the ANOVA summary table for a multiple regression model with two independent variables: Degrees of Freedom

Sum of Squares

Regression Error

2 18

60 120

Total

20

180

Source

Mean Squares

F

ri1.

rh.1>

14.32 The following is the ANOVA summary table for a multiple regression model with two independent variables: Degrees of Freedom

Sum of Squares

Regression Error

2 10

30 120

Total

12

150

rJ'!'m 14.34 In Problem 14.4 on page 4 82, you used efficiency . . . . ratio and total risk-based capital to predict ROAA at a community bank (stored in . Using the results from that problem, a. at the 0.05 level of significance, determine whether each independent variable makes a significant contribution to the regression model. On the basis of these results, indicate the most appropriate regression model for this set of data. b. compute the coefficients of partial determination, 1•2 and rh.1, and interpret their meaning.

ri

If SSR(X1) = 45 and SSR(X2 ) = 25 , a. determi ne whether there is a significant relationship between Y and each independent variable at the 0.05 level of significance. b. compute the coefficients of partial determination, 2 and and interpret their meaning.

Source

497

Mean Squares

F

If SSR(X 1) = 20 and SSR(X2 ) = 15, a . determine whether there is a significant relationship between Y and each independent variable at the 0.05 level of significance. b. compute the coefficients of partial determination, 2 and 1, and interpret their meaning.

ri1.

rh.

14.35 In Problem 14.7 on page 483, you used the weekly staff count and remote engineering hours to predict standby hours (stored in Nickels 26Weeks ). Using the results from that problem, a. at the 0.05 level of significance, determine whether each independent variable makes a significant contribution to the regression model. On the basis of these results, indicate the most appropriate regression model for this set of data. b. compute the coeffici ents of partial determination, 1•2 and rh.1, and interpret their meaning.

ri

14 .36 In Problem 14.6 on page 4 83, you used fu ll-time voluntary turnover (%), and total worldwide revenue ($billions) to predict ). the number of full-time job openings (stored in Using the results from that problem, a. at the 0.05 level of significance, determine whether each independent variable makes a significant contribution to the regression model. On the basis of these results, indicate the most appropriate regression model for this set of data. b. compute the coefficients of partial determination, 1•2 and rh.1, and interpret their meaning.

ri

14 .37 In Problem 14.8 on page 4 83, you used land area of a property and age of a house to predict the fair market value (stored in Using the results from that problem, a. at the 0.05 level of significance, determine whether each independent variable makes a significant contribution to the regression model. On the basis of these results, indicate the most appropriate regression model for this set of data. b. compute the coefficients of partial determination, 1.2 and rh. 1, and interpret their meaning.

lfilM3·Gii).

APPLYING THE CONCEPTS 14.33 In Problem 14.5 on page 483, you used alcohol percentage and chlorides to predict wine quality (stored in 1+®!f!t®® ).Using the results from that problem, a . at the 0.05 level of significance, determine whether each independent variable makes a significant contribution to the regression model. On the basis of these results, indicate the most appropriate regression model for this set of data. b. compute the coefficients of partial determination, 1•2 and 1, and interpret their meaning.

ri

ri

rh_

14.6 Using Dummy Variables and Interaction Terms The multiple regression models that Sections 14.1 through 14.5 di scuss assumed that each independent variable is a numerical variable. For example, in Section 14.1, you used price and promotional expenses, two numerical independent variables, to predict the monthly sales of OmniPower nutrition bars. However, for some models, one needs to examine the effect of a categorical independent variable. In such cases, one uses a dummy variable to include a categorical independent variable in a regression model.

498

CHAPTER 14

I

Introduction to Multiple Regression

student TIP The software guides for this chapter explain how to create dummy variables from categorical variables not already coded with the values 0 and 1.

Dummy variables use the numeric values 0 and 1 to recode two categories of a categorical independent variable in a regression model. In general, the number of dummy variables one needs to define equals the number of categories - 1. If a categorical independent variable has only two categories, one dummy variable, XJ, gets defined, and the values 0 and 1 represent the two categories. W hen the two categories represent the presence or absence of a characteristic, use 0 to represent the absence and 1 to represent the presence of the characteristic. For example, to predict the monthly sales of the OmniPower bars, one might include the categorical variable location in the model to explore the possible effect on sales caused by displaying the OmniPower bars in the two different sales locations, a special front location and in the snack aisle, analogous to the locations used in the Chapter 10 Arlingtons scenario to sell streaming media players. In this case for the categorical variable location, the dummy variable, XJ, would have these values:

xd =

0 if the value is the first category (special front location)

Xd = 1 if the value is the second category (in-aisle location) To illustrate using dummy variables in regression, consider the case of a realtor who seeks to develop a model for predicting the asking price of houses listed for sale ($thousands) in Silver Spring, Maryland, using living space in the house (square feet) and whether the house has a fireplace as the independent variables. To include the categorical variable for the presence of a fireplace, the dummy variable X2 is defined as X2 = 0 if the house does not have a fireplace X2 = 1 if the house has a fireplace Assuming that the slope of asking price with living space is the same for houses that have and do not have a fireplace, the multiple regression model is

Y; = f3o + f31Xli + f3zX2; +

e;

where

Y; = asking price, in thousands of dollars, for house i {30 = Y intercept X 1; = living space, in thousands of square feet, for house i {3 1 = slope of asking price w ith living space, holding constant the presence or absence of a fireplace X2; = dummy variable that represents the absence or presence of a fireplace for house i f3z = net effect of the presence of a fireplace on asking price, holding constant the living space e; = random error in Y for house i Figure 14.9 presents the regression results for this model, using a sample of 61 Silver Spring houses listed for sale that was extracted from trulia.com and stored in . In these results, the dummy variable X2 is labeled as Fireplace.

FIGURE 14.9 Excel worksheet resu lts for the regression model t hat includes Living Space and Fireplace

A 1 Askinc Price Analysis

c

B

0

E

f

G

I

2

This worksheet uses the same template as the Figure 14.1 worksheet.

RNr~ssion Statistirs 3 4 MultlpleR 0.6842 0.4681 5 R Square 6 AdJUSted R Squire 0.4497

7 St1ndard Error 8 Observations

66.8687 61

9 10 ANOVA II

12 Regression 13 Residual

1• Total

df

SS 2 58 60

MS

F

228210.1161 114105.0581 25.5187 4471A235 259342.5606

SlanH1rnnceF

0.0000

48ru2.6767

15 16 11 Int ercept 18 Uv1na Space 19 Fireplace

CM/f~nts

302.2518 0.0765 52.9674

In the Worksheet

St.ondord Error 26.5548 0.0129 19.1421

tStot

P-volu'

Lower9~

Upper9~

11.3822

0.0000

249.0965

355.4071

5.9179 i.16n

0.0000 0.0076

0.0507 14.6504

0.1024 91.2844

14.6 Using Dummy Variables and Interaction Terms

499

From Figure 14.9, the regression equation is

Y; = 302.2518 + 0.0765Xi; + 52.9674X2 ; For houses without a fireplace , one sets X2 = 0:

Y; = 302.2518 + 0.0765X 1; + 52.9674X2; = 302.2518 = 302.2518

+ +

0.0765X1;

+

52.9674(0)

+ +

52.9674X2;

0.0765X1;

For houses with a fireplace, one sets X2 = 1:

Y; =

0.0765X 1;

= 302.2518

+ +

= 355.2192

+

0.0765X1;

302.2518

0.0765X 1;

52.9674(1)

Table 14.9 summarizes the results of the test for the regression coefficient for living space ( b 1 ) and the regression coefficient for presence or absence of a fireplace ( b2 ) that appears as

part of Figure 14.9, the Silver Spring houses multiple regression results on page 498. Based on these results, one can conclude that living space has a significant effect on mean asking price and the presence of a fireplace also has a significant effect. TABLE 14.9

t test for the s lope Results and conclusions for the Silver Spring houses multiple regression model

Result

Conclusions

tsTAT = 5.9179 is greater than 2.0017 p-value = 0.0000 is less than the level of significance, a = 0.05

1. Reject the null hypothesis H 0 .

2. Conclude that strong evidence exists for claiming that living space is related to the dependent Y variable, asking price, taking into account the presence or absence of a fireplace.

3. The probability is 0.0000 that tsTAT < -5.9179 or tsTAT > 5.9179 tsTAT = 2.7671 is greater than 2.0017 p-value = 0.0076 is less than the level of significance, a = 0.05

student TIP Remember that an independent variable does not always make a significant contribution to a regression model.

1. Reject the null hypothesis H0 . 2. Conclude that strong evidence exists for claim-

ing that presence of a fireplace is related to the dependent Y variable, asking price, taking into account the living space. 3. The probability is 0.0076 that or ts TAT > 2.7671.

r2

= 0.4681

tsTAT

< -2.7671

46.81 % of the variation in the asking price can be explained by variation in living space and whether the house has a fireplace.

Using the net regression coefficients b 1 and b2 , the Table 14.10 net effects table summarizes the effects of adding one square foot of living space ( X1) or the presence of a fireplace ( X2 ) . TABLE 14.10

Net Effects Table for the Silver Spring Houses Multiple Regression Model

Independent Variable Change

Net Effect

An increase of one square foot in living space

Predict mean asking price to increase by 0.0765 ($000) or $76.50 holding presence of a fireplace constant.

Presence of a fireplace

Predict mean asking price to increase by $52.9674 ($000) or $52,967.40 for a house with a fireplace holding living space constant.

500

CHAPTER 14

I

Introduction to Multiple Regression

In some situations, the categorical independent variable has more than two categories. When this occurs, two or more dummy variables are needed. Example 14.3 illustrates such situations.

EXAMPLE 14.3 Modeling a ThreeLevel Categorical Variable

Define a multiple regression model to predict the asking price of houses as the dependent variable, as was done in the previous example for the Silver Spring houses, and use living space and house type as independent variables. House type is a three-level categorical variable with the values colonial, ranch, and other. SOLUTION To model the three-level categorical variable house type, two dummy variables, X1 and X2 , are needed:

X1;

=

1 if the house type is colonial for house i; 0 otherwise

X2 ; = 1 if the house type is ranch for house i; 0 otherwise Thus, if house i is a colonial then X1; = 1 and X2; = 0; if house i is a ranch, then X1; = 0 and X2; = 1; and if house i is other (neither colonial nor ranch), then X1; = X2; = 0. Thus, house type other becomes the baseline category to which the effect of being a colonial or ranch house type is compared. A third independent variable is used for living space: X3;

= living space for observation i

Thus, the regression model for this example is

where

Y; = asking price for house i {30 = Y intercept {31 = difference between the predicted asking price of house type colonial and the

predicted asking price of house type other holding living space constant {32 = difference between the predicted asking price of house type ranch and the predicted

asking price of house type other holding living space constant {33 = slope of asking price with living space, holding the house type constant e;

= random error in y for observation i

Regression models can have several numerical independent variables along with a dummy variable. Example 14.4 illustrates a regression model in which there are two numerical independent variables and a categorical independent variable. - - - - - - - -

EXAMPLE 14.4 Studying a Regression Model That Contains a Dummy Variable and Two Numerical Independent Variables

A builder of suburban homes faces the business problem of seeking to predict heating oil consumption in single-family houses. The independent variables considered are atmospheric temperature (°F), X1, insulation, the amount of attic insulation, inches, X2 , and ranch-style, whether the house is ranch-style, X3 . Data are collected from a sample of 15 single-family houses and stored in W®ttmWU· Develop and analyze an appropriate regression model, using these three independent variables XI> Xi. and X3 . SOLUTION Define X3 , ranch-style, a dummy variable for ranch-style house, as follows:

X3 = 0 if not a ranch-style house ....(continued)

X3 = 1 if a ranch-style house

14.6 Using Dummy Variables and Interaction Terms

501

Assuming that the slope between heating oil consumption and temperature, X1, and between heating oil consumption and insulation, X2 , is the same for both styles of houses, the regression model is

where

Y; f3o {3 1

monthly heating oil consumption, in gallons, for house i

=

Y intercept

slope of heating oil consumption with temperature, holding constant the effect of insulation and ranch-style

f3z = slope of heating oil consumption with insulation holding constant the effect of temperature and ranch-style

f3:i

= incremental effect of the presence of a ranch-style house on heating oil consumption

s;

= random error in y for house i

holding constant the effect of temperature and insulation Figure 14.10 presents results for this regression model.

FIGURE 14.10 Excel worksheet results for the regression model that includes temperature, insulation, and ranchstyle for the heating oil data

c

A B 1 Heating Oil consumption Analysis

0

F

E

G

1

2

J

•

Rearession Statistics MultipleR

5 R Square

6 AdJUSted R Square 7 Standard Error

8 Observations

0.9942 0.9884 0.9853 15.7489 15

9 10 ANOVA

11 12 Regression 13 Residual

14 Total 15 16 17 Intercept

df 3 11 14

F SIQnWJMnce F SS MS 233006.9094 77802.3031 313.6822 0.0000 2728.3200 248.0291 236135.2293

CO'!ffil'Wnts Standard Error

18 Temperature 19 Insulation

-

20 Ranch-stvle

592.5401 ·5.5251 · 21.3761 -38.9n1

14.3370 0.2044 1.4480 !.3584

t Stot P-110lue Lower95" Uooer"" 41.3295 0.0000 560.9846 624.0956 ·27.0757 0.0000 ·5.9751 ·5.0752 ·14.7623 0.0000 ·24.5632 ·18.1891 ·4.6627 0.0007 ·57.3695 - 20.5759

From the results in Figure 14.10, the regression equation is

Y; = 592.5401 - 5.525 lXi; - 21.3761X2;

-

38.9727X3;

For houses that are not ranch style, because X3 = 0, the regression equation reduces to

Y; = 592.5401 - 5.5251X1; For houses that are ranch style, because X3

ff=

~ (continued)

-

21.3761X2;

= 1, the regression equation reduces to

553.5674 - 5.5251X1;

-

21.3761X2;

Table 14.11 on page 502 summarizes the results of the tests for the regression coefficient for temperature (b 1), insulation (b2 ), and the regression coefficient for ranch-style, the presence or absence of a ranch-style house (b3) that appears as part of Figure 14.10. Based on these results, the developer can conclude that temperature, insulation, and ranch-style each has a significant effect on mean monthly heating oil consumption.

502

CHAPTER 14

I

Introduction to Multiple Regression

TABLE 14.11 t test for the slope results and conclusions for the Example 14.4 multiple regression model

Result

Conclusions

= -27.0267 is less than -2.2010

1. Reject the null hypothesis H0 . 2. Conclude that strong evidence exists for claiming that temperature is related to the dependent Y variable, heating oil consumption, holding insulation and ranch-style constant. 3. The probability is 0.0000 that tsTAT < -27.0267 or lsTAT > 27.0267

tsTAT

p-value = 0.0000 is less than the level of significance, a = 0.05

= -14.7623 is less than -2.2010

tsTAT

p-value = 0.0000 is less than the level of significance, a = 0.05

= -4.6627 is less than -2.2010

tsTAT

p-value = 0.0007 is less than the level of significance, a = 0.05

r

2

= 0.9884

1. Reject the null hypothesis H 0 . 2. Conclude that strong evidence exists for claiming that insulation is related to the dependent Y variable, heating oil consumption, holding temperature and ranch-style constant. 3. The probability is 0.0000 that tSTAT < -14.7623 or lsTAT > 14.7623 1. Reject the null hypothesis H0 . 2. Conclude that strong evidence exists for claiming that whether the house is a ranch-style is related to the dependent Y variable, heating oil consumption, holding temperature and insulation constant. 3. The probability is 0.0007 that tSTAT < -4.6627 or fs TAT > 4.6627 98.84% of the variation in the heating oil consumption can be explained by variation in temperature, insulation, and whether the house is a ranch-style.

Using the net regression coefficients b" b2' and b 3, the Table 14.12 net effect column summarizes the effects of an increase one degree of temperature (X 1), adding one inch to the insulation amount (X2 ), and whether the house is a ranch-style (X3). If the cost of adding one inch in attic insulation was equivalent to about 21 gallons of heating oil, the builder could predict that the new insulation would "pay for itself" by lowering heating oil consumption in about one month. TABLE 14.12 Net effects table for the Example 14.4 multiple regression model

Independent Variable Change

Net Effect

An increase of one degree in temperature (°F)

Predict mean monthly heating oil consumption to decrease by 5.5251 gallons holding insulation and ranch-style constant.

An increase of one inch in attic insulation

Predict mean monthly heating oil consumption to decrease by 21.3761 gallons for each additional inch of attic insulation holding temperature and ranch-style constant.

Presence of a ranch-style house

Predict mean monthly heating oil consumption to decrease by 38.9727 gallons for a ranch-style house holding temperature and insulation constant.

Before anyone could use the model in Example 14.4 for decision making, one may want to determine whether the independent variables interact with each other.

14.6 Using Dummy Variables and Interaction Terms

503

Interactions In the regression models discussed so far, the effect an independent variable has on the dependent variable has been assumed to be independent of the other independent variables in the model. An interaction occurs if the effect of an independent variable on the dependent variable changes according to the value of a second independent variable. For example, it is possible that advertising will have a large effect on the sales of a product when the price of a product is low. However, if the price of the product is too high, increases in advertising will not dramatically change sales. In this case, price and advertising are said to interact. In other words, one cannot make general statements about the effect of advertising on sales. The effect that advertising has on sales is dependent on the price. One uses an interaction term, also called a cross-pr oduct term, to model an interaction effect in a multiple regression model. To illustrate the concept of interaction and use of an interaction term, recall the Section 14.6 asking price of homes example, the multiple regression results for which Figure 14.9 on page 498 presents. In the regression model, the realtor has assumed that the effect that living space has on the asking price is independent of whether the house has a fireplace. Stated in statistical language, the realtor has assumed that the slope of asking price with living space is the same for all houses, regardless of whether the house contains a fireplace. However, this assumption may not be true. If an interaction exists between living space and the presence or absence of a fireplace, the two slopes will be different. To evaluate whether an interaction exists, one first defines an interaction term that is the product of the independent variable X 1 (living space) and the dummy variable X2 (fireplace). T hen, one tests whether this interaction variable makes a significant contribution to the regression model. If the interaction is significant, one cannot use the original model for prediction. For these data, one defines

X3 = X 1 X X 2 Figure 14.1 1 presents regression results for the model that includes the living space, X i. the presence of a fireplace, X2, and the interaction of X 1 and X2 , which has been defined as X3 and labeled Living Space*Fireplace.

FIGURE 14.11 Excel worksheet results for the regression model that includes living space, fireplace, and interaction of living space and fireplace

~Price Ana lysis

3

c

B

A

1 2

0

E

F

G_J

This worksheet uses the same template as the Figure 14. 1 worksheet.

/Ugrrtsslon StatlJt/cs

MultlpleR 5 RSquare 6 Adjusted R Squt1re

4

7 Standard Error

8 Observations

0.6849 0.4691 0.4411 67.3907 61

9 10 ANOVA

df

11

12 Regression 13 Residual 14 Total

3 57 60

SS MS F S10nif1COnco F 228686.7174 76228.9058 16.7849 0.0000 258865.'593 4541.5081 487552.6767

15 16

17 Intercept 18 Living Space 19 Fireplace 20 Uvlna Space•ffreplace

student TIP It is possible that the interaction between two independent variables will be significant even though one of the independent variables is not significant.

CMffltVnts Standard E"or 316.2350 S0.7878

0.0681 34.8926 0.0106

In the Worksheet

0.0292 59.0359 0.0326

tStot 6.2266 2.3319 0.5910 0.3239

P-voluf'

LOWf'r9"'

0.0000 0.0233 0.5568

o.1•n

214.5341 0.0096 ·83.3248 ·0.0548

Uoorr9"'

''LivingSpace*Fireplace" refers to the column that holds the cross-product term that models the interaction between the living space and fireplace variables .

417.9359 0.1265 153.1101 0.0759

The null and alternate hypotheses to test for the existence of an interaction are

Ho: {33 = 0 H1: {33 ¥- 0. Table 14. 13 summarizes the results of the test for the interaction for living space ( b 1 ) and presence of a fireplace ( b2 ) that appears as part of Figure 14.11. Based on these conclusions, one concludes that interaction of living space ( b 1) and presence of a fireplace ( b2 ) is not significant. The interaction term should not be included in the regression model to predict asking price.

504

CHAPTER 14

I

Introduction to Multiple Regression

TABLE 14.13

t test for the Interaction for

Result

living space and presence of a fireplace results and conclusions

2.0025

tsTAT

Conclusions 1. Do not reject the null hypothesis H0 . 2. Conclude that there is insufficient evidence of an interaction of living space (b 1) and presence of a fireplace (b2) . 3. The probability is 0.7472 that tsTAT < -0.3239 or tsTAT > 0.3239.

0.3239 is less than

=

p-value = 0.7472 is greater than the level of significance, a = 0.05

In Example 14.5, three interaction terms are added to the model.

EXAMPLE 14.5 Evaluating a Regression Model with Several Interactions

For the Example 14.4 data, determine whether adding interaction terms makes a significant contribution to the regression model. SOLUTION To evaluate possible interactions between the independent variables, three interaction terms are constructed as follows: X4 = X1 X X2 , X 5 = X1 X X 3, and X 6 = X2 X X3. The regression model is now

Y; = /30 + /31X1;

+

/JiX2;

+

/3~3;

+

f34X4;

+

/3sXs;

+

/306;

+

e;

where X1 is temperature, X2 is insulation, X3 is the dummy variable ranch-style, X4 is the interaction between temperature and insulation, X5 is the interaction between temperature and ranchstyle, and X6 is the interaction between insulation and ranch-style. Figure 14.12 presents the results for this regression model. FIGURE 14.12 Excel worksheet results for the regression model that includes temperature, X1 ; insulat ion, X2 ; the dummy variable ranchstyle, X3 ; the interaction of temperature and insulation, X4 ; t he interaction of temperature and ranch-style, X5 ; and the interaction of insulation and ranch-style, X6 A

B

c

0

E

F

SS

MS

F

Slgnlflrone X3, X4, X5, X6 X1, X2, X3, X4, X 5, X6 X1, X2, X3, X4, X 5, X7 X1, X2, X 3, X4, X6, X7 XI> X2, X3, X4, X5, X6, X7

SOLUTION From Table 15.5, you need to determine which models have CP values that are less than or close to k + 1. Two models meet this criterion. The model with six independent variables (X1, X2, X3, X4, X5 , X6) has a CP value of 6.8, which is less thank + 1 = 6 + 1 = 7, and the full model with seven independent variables (X1> X 2, X 3, X4, X5, X6, X7) has a CP value

of 8.0. One way you can choose between the two models is to select the model with the largest adjusted r 2 , which is the model with six independent variables. Another way to select a final model is to determine whether the models contain a subset of variables that are common. Then you test whether the contribution of the additional variables is significant. In this case, because the models differ only by the inclusion of variable X7 in the full model, you test whether variable X7 makes a significant contribution to the regression model, given that the variables X1, X2, X3, X4, X5 , and X6 are already included in the model. If the contribution is statistically significant, then you should include variable X7 in the regression model. If variable X7 does not make a statistically significant contribution, you should not include it in the model.

546

CHAPTER 15

I Multiple Regression Model Building Figure 15.18 summarizes the model-building process.

FIGURE 15.18 The model-building process For Each Model Best Subsets Creates:

Choose Candidate Independent XVariabl es

t CP Near

Perform Reg ression With All Independent X Vari ables t o Find V/Fs

or Less Than

No

Disca rd M odel

k+ 1 Yes

Yes

Choose a " Best" Model From Models Retai ned

El iminate th e XVariable W ith the La rgest VIF

Perform a Residual Analysis on Selected Model Run Best Subsets Reg ressio n to Obtain Best M odels With k Terms for a Given Number of Independent Variables

t Add Quadratic or Interactio n Term s, or Transform Vari ables, As Necessary

LEARNING THE BASICS

APPLYING THE CONCEPTS

15.21 You are considering four independent variables for inclusion in a regression model. You select a sample of n = 30, with the following results: 1. The model that includes independent variables A and B has a c,, value equal to 4.6. 2. The mode l that includes independent variables A and C has a c,, value equal to 2.4. 3. The model that includes independent variables A, B, and C has a c,, value equal to 2.7. a. Which models meet the criterion for fu rther consideration? Explain. b. How would you compare the model that contains independent variables A , B , and C to the model that contains independent variables A and B? Explain .

contains data from a sample of full-time 15.23 The fi le MBA programs offered by private universities. The variables collected for this sample are average starting salary upon graduation ($), the percentage of applicants to the full-time program who were accepted, the average GM AT test score of students entering the program, program per-year tuition ($), and percent of students with job offers at time of graduation.

15.22 You are considering six independent variables for inclusion in a regressio n model. You select a sample of n = 40, with the following results:

~ 15 .24 You need to develop a model to predict the asking

k

=

2

T

=

6

+

1

=

7

R~

=

0.274

R}

=

0.653

a . Compute the C,, value for this two-independent-variable model. b. Based on your answer to (a), does this model meet the criterion for fu rther consideration as the best model? Explain.

liM'=U

Source: Data extracted from U.S. News & World Report Education, " Best Graduate Schools," bit.ly/lE8MBcp.

Develop the most appropriate multiple regression model to predict the mean starting salary upon graduation. Be sure to include a thorough residual analysis. In addition, provide a detailed explanation of the results, including a comparison of the most appropriate multiple regression model to the best simple linear regression model.

~

price of houses listed for sale in Silver Spring, Maryland, based on the living space of the house, the lot size, and the age, whether it has a fireplace, the number of bedrooms, and the number of bathrooms. A sample of 61 houses is selected and the results are stored in Silver Spring Homes . Develop the most appropriate multiple regression model to predict asking price. Be sure to perform a thorough residual analysis. In addition, provide a detailed explanation of the results.

15.5 Pitfalls in Multiple Regression and Ethical Issues

15.25 Accounting Today identified top public accounting fi rms in 10 geographic regions across the United States. The file Regional Accounting Partners contains data for public accounting firms in the Southeast, Gulf Coast, and Capital Regions. The variables are revenue ($millions), number of partners in the firm , number of offices in the firm, proportion of business dedicated to management advisory services (MAS %), whether the fi rm is

547

located in the Southeast Region (0 = no, 1 = yes), and whether the firm is located in the Gulf Coast Region (0 = no, 1 = yes). Source: Data ex tracted from "The 2019 Top 100 Firms and Regional Leaders," bit.ly/20jXtDo.

Develop the most appropriate multiple regression model to predict firm revenue. Be sure to perform a thorough residual analysis. In addition, provide a detailed explanation of the results.

15.5 Pitfalls in Multiple Regression and Ethical Issues Pitfalls in Multiple Regression Model building is an art as well as a science. Different individuals may not always agree on the best multiple regression model. To develop a good regression model, use the process that Exhibit 15.1 and Figure 15.18 on pages 540 and 546 summarize. As you follow that process, you must avoid certain pitfalls that can interfere with the development of a useful model. Section 13.9 discussed pitfalls in simple linear regression and strategies for avoiding them. Multiple regression models require the following additional precautions to avoid common pitfalls: • Interpret the regression coefficient for a particular independent variable from a perspective in which the values of all other independent variables are held constant. • Evaluate residual plots for each independent variable. • Evaluate interaction and quadratic terms. • Compute the VIF for each independent variable before determining which independent variables to include in the model. • Examine several alternative models, using best subsets regression. • Use logistic regression instead of least squares regression when the dependent variable is categorical. • Use cross-validation to improve accuracy of models for predicting future data.

Ethical Issues Ethical issues arise when a user who wants to make predictions manipulates the development process of the multiple regression model. The key here is intent. In addition to the situations that Section 13.9 discusses, unethical behavior occurs when someone uses multiple regression analysis and willfully fails to remove from consideration independent variables that exhibit a high collinearity with other independent variables or willfully fails to use methods other than least-squares regression when the assumptions necessary for least-squares regression are seriously violated.

n the Using Statistics scenario, you were hired by Nickels Broadcasting to determine which variables have an effect on WSTA-TV standby hours. You were given a 26-week sample that contained the weekly standby hours, staff count, remote engineering hours, graphics hours, and production hours. You performed a multiple regression analysis on the data. The coefficient of multiple determination indicated that 62.31 % of the variation in standby hours can be explained by variation in the weekly staff count and the number of remote engineering, graphics, and editorial production hours. The model indicated that standby hours are estimated to increase by 1.2456 hours for each additional weekly staff member present, holding constant the other independent variables; to decrease by 0.1184 hour for each additional remote engineering hour, holding constant the

I

other independent variables; to decrease by 0.2971 hour for each additional graphics hour, holding constant the other independent variables; and to increase by 0.1305 hour for each additional editorial production hour, holding constant the other independent variables. Each of the four independent variables had a significant effect on Standby, holding constant the other independent variables. This regression model enables you to predict standby hours based on the weekly staff count, remote engineering hours, graphics hours, and editorial production hours. Any predictions developed by the model can then be carefully monitored, new data can be collected, and other variables may possibly be considered.

548

CHAPTER 15

I

Multiple Regression Model Building

TSUMMARY Figure 15.1 9 summarizes the several topics associated with multiple regression model building. FIGURE 15.19 Roadmap for multiple regression

Model Building Methods

BestSubsets Regress ion

Quad ratic Terms

Stepwise Regression

Multi ple Reg ression Techniques

Fitting t he Best Model

Transfor mations

Dummy Variables and Interact ion Terms

Determining and Interpret ing Regressio n Coefficients

Residual Analysis

Adj usted ,2

Collinearity Analysis

Test Overall Model Significance

Ho: l31=J32=··· = J3k=O

Test Portions of Model

Validate a nd Use Model for Prediction and Est imat ion

Esti mate

13,

Est imate µand Pred ict Y

Chapter Review Problems

549

T REFERENCES Kutner, M., C. Nachtsheim, J. Neter, and W. Li. Applied Linear Statistical Models, 5th ed. New York: McGraw-Hill/Irwin, 2005. Montgomery, D. C., E. A. Peck, and G. G. Vining. Introduction to Linear Regression Analysis, 5th ed. New York: Wiley, 2012.

Snee, R. D. "Some Aspects of Nonorthogonal Data Analysis, Part I. Developing Prediction Equations." Journal of Quality Technology 5 (1973): 67-79.

T KEY EQUATIONS Quadratic Regression Model

Original Exponential Model

Y; = f3o + f31X 1; + f3iXT; + s;

(15.1)

Quadratic Regression Equation

ef3o+ f31Xli+ f32X2;

S;

(15.6)

Transformed Exponential Model

2

~

Y; =

Y; = ho + b1X1; + b2Xli

(15.2)

In Y; = In(ef3o+ f3 ,X1i +f32X2, s;)

+ In e; f3o + {3 1X 1; + {32X2; + In s;

Regression Model with a Square-Root Transformation

= In(ef3o+ f3 ,X1i+ f31X2;)

vf; = f3o + f31X 1; + e;

=

(15.3)

Original Multiplicative Model

Transformed Multiplicative Model

Variance Inflationary Factor 1 VIF= - 1 1 - R1

log Y; = log ( f3oXfl xqfe;)

CP Statistic

Y; = f3oXf1xq1s;

(15.4)

(15.7)

(15.8)

j

f3o + log ( Xfi ) + log(Xql) + log s; = log {30 + {3 1 log X 1; + {32 log X2; + logs;

= log

C (15.5)

P

=

( 1 - R 2)(n - 1) k 1 - R}

-

[n -

2(k

+

I) ]

(15.9)

TKEY TERMS best subsets approach 542 CP statistic 543 collinearity 538 curvilinear relationship 527

logarithmic transformation 536 principle of parsimony 540 quadratic regression model 527 quadratic term 527

square-root transformation 535 stepwise regression 541 transformations 534 variance inflationary factor (VIF)

538

TCHECKING YOUR UNDERSTANDING 15.26 How can you evaluate whether collinearity exists in a multiple regression model?

15.28 How do you choose among models according to the CP statistic in best subsets regression?

15.27 What is the difference between stepwise regression and best subsets regression?

TCHAPTER REVIEW PROBLEMS 15.29 A specialist in baseball analytics has expanded his analysis, presented in Problem 14.77 on page 520, of which variables are important in predicting a team's wins in a given baseball season. He has collected data in l:fitltll!'f!111 related to wins, ERA, saves, runs scored per game, batting average, home runs, and batting average against for a recent season.

Develop the most appropriate multiple regression model to predict a team's wins. Be sure to include a thorough residual analysis. In addition, provide a detailed explanation of the results. 15.30 In the production of printed circuit boards, errors in the alignment of electrical connections are a source of scrap. The file

550

CHAPTER 15

I Multiple Regression Model Build ing

Registration Errors Comparison contains the registration error, the temperature, the pressure, and the circuit board cost (low or high) used in the production of circuit boards.

15.36 You are a real estate broker who wants to compare property values in Glen Cove, Freeport, and Roslyn. Use the data in GC Freeport Roslyn

Develop the most appropriate multiple regression model to predict registration error.

a. Develop the most appropriate multiple regression model to predict fair market value. b. What conclusions can you reach concerning the differences in fair market value between Glen Cove, Freeport, and Roslyn?

15.31 Hem lock Farms is a community located in the Pocono

15.37 Financial analysts engage in business valuation to deter-

conMountains area of eastern Pennsylvania. The file tains information on homes that were recently for sale. The variables included were

mine a company's value. A standard approach uses the mu ltiple of earnings method: You multiply a company's profits by a certain value (average or median) to arrive at a final value. More recently, regression analysis has been demonstrated to consistently deliver more accurate predictions. A valuator has been given the assignment of valuing a drug company. She obtained financial data on 60 drug companies (Industry Group Standard Industrial Classification [SIC] 3 code 283), which included pharmaceutical preparation firms (SIC 4 code 2834), in vitro and in vivo diagnostic substances firms (SIC 4 code 2835), and biological products firms (SIC 4 2836). The fi le Business Valuation 2 contains these variables:

Source: Data extracted from C. Nachtsheim and B. Jones, "A Powerful Analytical Tool," Six Sigma Forum Maga zine, August 2003, pp. 3(}..-33.

rn@@!Ui!i.M

List Price- Asking price of the house Hot Tub-Whether the house has a hot tub, with 0 = No and 1 = Yes Lake View- Whether the house has a lake view, with 0 = No and 1 = Yes Bathrooms-Number of bathrooms Bedrooms-Number of bedrooms Loft/Den-Whether the house has a loft or den, with 0 = No and I = Yes Finished basement-Whether the house has a finished basement, with 0 =No and I = Yes Acres-Number of acres for the property Develop the most appropriate multiple regression model to predict the asking price. Be sure to perform a thorough residual analysis. In addition, provide a detailed explanation of your results.

15.32 Nassau County is located approximately 25 miles east of New York City. The file lfil$1@ contains a sample of 30 singlefamily homes located in Glen Cove. Variables included are the fair market value, land area of the property (acres), interior size of the house (square feet), age (years), number of rooms, number of bathrooms, and number of cars that can be parked in the garage. a. Develop the most appropriate multiple regression model to predict fair market value. b. Compare the results in (a) with those of Problems 15.33 (a) and 15.34 (a).

15.33 Data similar to those in Problem 15.32 are available for homes located in Roslyn (approximately 8 miles from Glen Cove) and are stored in limm· a. Perform an analysis similar to that of Problem 15.32. b. Compare the results in (a) with those of Problems 15.32 (a) and 15.34 (a).

15.34 Data similar to Problem 15.32 are available for homes located in Freeport (located approximately 20 mi les from Roslyn) and are stored in IM§·1·1el. a. Perform an analysis similar to that of Problem 15.32. b. Compare the results in (a) with those of Problems 15.32 (a) and 15.33 (a).

15.35 You are a real estate broker who wants to compare property values in Glen Cove and Roslyn (which are located approximately 8 miles apart). Use the data in (Cldl;®Mtt·Make sure to include the dummy variable for location (Glen Cove or Roslyn) in the regression model. a. Develop the most appropriate multiple regression model to predict fair market value. b. What conclusions can you reach concerning the differences in fair market value between Glen Cove and Roslyn?

COMPANY-Drug company name TS-Ticker symbol SIC 3-Standard Industrial Classification 3 code (industry group identifier) SIC 4-Standard Industrial Classification 4 code (industry identifier) PB fye-Price-to-book value ratio (fiscal year ending) PE fye-Price-to-earnings ratio (fiscal year ending) NL Assets-Natural log of assets (as a measure of size) ROE-Return on equity SGROWTH- Growth (GS5) DEBT/EBITDA-Ratio of debt to earnings before interest, taxes, depreciation , and amortization D2834-Dummy variable indicator of SIC 4 code 2834 ( I if 2834, 0 if not) D2835-Dummy variable indicator of SIC 4 code 2835 ( I if 2835, 0 if not) Develop the most appropriate multiple regression model to predict the price-to-book value ratio. Perform a thorough residual analysis and provide a detailed explanation of your results.

15.38 The J. Conklin article, "It's a Marathon, Not a Sprint," Quality Progress, June 2009, pp. 46-49, discussed a metal deposition process in which a piece of metal is placed in an acid bath and an alloy is layered on top of it. The key quality characteristic is the thickness of the alloy layer. The file iii@!i.!f{j contains the following variables: Thickness-Thickness of the alloy layer Catalyst-Catalyst concentration in the acid bath pH-pH level of the acid bath Pressure-Pressure in the tank holding the acid bath Temp-Temperature in the tank holding the acid bath Voltage-Voltage applied to the tank holding the acid bath Develop the most appropriate multiple regression model to predict the thickness of the alloy layer. Be sure to perform a thorough residual analysis. The article suggests that there is a significant interaction between the pressure and the temperature in the tank. Do you agree?

Cases

15.39 A molding machine that contains different cavities is used in producing plastic parts. The product characteristics of interest are the product length (in.) and weight (g). The mold cavities were filled with raw material powder and then vibrated during the experiment. The factors that were varied were the vibration time (seconds), the vibration pressure (psi), the vibration amplitude(%), the raw material density (g/ mL), and the quantity of raw material (scoops). The experiment was conducted in two different cavities on the molding machine. The data are stored in 1Mi1Qttml. Source: Data extracted from M. Lopez and M. McShane-Vaughn, "Maximizing Product, Minimizing Costs," Six Sigma Forum Magazine, February 2008, pp. 18- 23.

a. Develop the most appropriate multiple regression model to predict the product length in cavity 1. Be sure to perform a thorough residual analysis. In addition, provide a detailed explanation of your results. b. Repeat (a) for cavity 2. c. Compare the results for length in the two cavities. d. Develop the most appropriate multiple regression model to predict the product weight in cavity 1. Be sure to perform a thorough residual analysis. In addition, provide a detailed explanation of your results. e. Repeat (d) for cavity 2. f. Compare the results for weight in the two cavities. 15.40 The file B!m contains a sample of 25 cities in the United States. Variables included are city average annual salary($), unemployment rate(%), median home value ($thousands), number of

551

violent crimes per 100,000 residents, average commuter travel time (minutes), and livability score, a rating on a scale of 0 to I 00 that rates the overall livability of the city. Source : Data extracted from " 125 Best Places to Live in the USA," available at bit.ly/2jYvtFz and "AARP Livability Index," available at bit.ly/1Qbd6oj.

Develop the most appropriate multiple regression model to predict average annual salary ($). Be sure to perform a thorough residual analysis and provide a detailed explanation of the results as part of your answer. 15.41 A sports statistician seeks to evaluate NBA team statistics for a recent year. The statistician collects data for wins and points scored, field goal percentage, opponent field goal percentage, three point field goal percentage, free throw percentage, rebounds per game, assists per game, and turnovers per game, and stores these data in Ul=n§t!itji. Develop the most appropriate multiple regression model to predict the number of wins for an NBA team. Be sure to perform a thorough residual analysis and provide a detailed explanation of the results as part of your answer.

REPORT WRITING EXERCISE 15.42 In Problems 15.32-15.36 you developed multiple regression models to predict the fair market value of houses in Glen Cove, Roslyn, and Freeport. Now write a report based on the models you developed. Append all appropriate charts and statistical information to your report.

CHAPTER

15

... CASES The Mountain States Potato Company Mountain States Potato Company sells a by-product of its potato-processing operation, called a filter cake, to area feed lots as cattle feed. The business problem faced by the feedlot owners is that the cattle are not gaining weight as quickly as they once were. The feedlot owners believe that the root cause of the problem is that the percentage of solids in the filter cake is too low. Historically, the percentage of solids in the filter cakes ran slightly above 12%. Lately, however, the solids are running in the 11% range. What is actually affecting the solids is a mystery, but something has to be done quickly. Individuals involved in the process were asked to identify variables that might affect the percentage of solids. This review turned up the six variables (in addition to the percentage of solids) listed in the right column . Data collected by monitoring the process several times daily for 20 days are stored in limmm·

1. Thoroughly analyze the data and develop a regression model to predict the percentage of solids.

2. Write an executive summary concerning your findings to the president of the Mountain States Potato Company. Include specific recommendations on how to get the percentage of solids back above 12%.

Variable

Comments

SOLIDS PH

Percentage of solids in the filter cake. Acidity. This measure of acidity indicates bacterial action in the clarifier and is controlled by the amount of downtime in the system. As bacterial action progresses, organic acids are produced that can be measured using pH. Pressure of the vacuum line below the fluid line on the rotating drum. Pressure of the vacuum line above the fluid line on the rotating drum. Filter cake thickness, measured on the drum. Setting used to control the drum speed. May differ from DRUMSPD due to mechanical inefficiencies. Speed at which the drum is rotating when collecting the filter cake. Measured with a stopwatch.

LOWER UPPER THICK VARIDRIV

DRUMSPD

552

CHAPTER 15

I

Multiple Regression Model Building

Sure Value Convenience Stores You work in the corporate office for a nationwide convenience store franchise that operates nearly 10,000 stores. The perstore daily customer count (i.e., the mean number of customers in a store in one day) has been steady at 900 for some time. To increase the customer count, the chain is considering cutting prices for coffee beverages. The question to be answered is how much prices should be cut to increase the daily customer count without reducing the gross margin on coffee sales too much. You decide to carry out an experiment in a sample of 24 stores where customer counts have been running almost exactly at the national average of 900. In six of the stores, the price of a small coffee will now be $0.59, in six stores the price of a small coffee will now be $0.69, in six stores, the price of a small coffee will now be $0.79, and in six stores, the price of a small coffee will now be $0.89. After four weeks at the new prices, the daily customer count in the stores is determined and is stored in lll'!fj®J-fMttJ. a. Construct a scatter plot for price and sales.

b. Fit a quadratic regression model and state the quadratic regression equation. c. Predict the mean weekly sales for a small coffee priced at 79 cents.

d. Perform a residual analysis on the results and determine whether the regression model is valid. e. At the 0.05 level of significance, is there a significant quadratic relationship between weekly sales and price? f. At the 0.05 level of significance, determine whether the quadratic model is a better fit than the linear model. g. Interpret the meaning of the coefficient of multiple determination.

h. Compute the adjusted r 2 . i. What price do you recommend the small coffee should be sold for?

Digital Case Apply your knowledge of multiple regression model building in this Digital Case, which extends the Chapter 14 OmniPower Bars Using Statistics scenario.

Still concerned about ensuring a successful test marketing of its OmniPower bars, the marketing department of OmniFoods has contacted Connect2Coupons (C2C), another merchandising consultancy. C2C suggests that earlier analysis done by In-Store Placements Group (ISPG) was faulty because it did not use the correct type of data. C2C claims that its Internet-based viral marketing will have an even greater effect on OmniPower energy bar sales, as new data from the same 34-store sample will show. In response, ISPG says its earlier claims are valid and has reported to the OmniFoods marketing department that it can discern no simple relationship between C2C's viral marketing and increased OmniPower sales.

Open OmniPowerForumlS.pdf to review all the claims made in a private online forum and chat hosted on the OmniFoods corporate website. Then answer the following: 1. Which of the claims are true? False? True but misleading? Support your answer by performing an appropriate statistical analysis. 2. If the grocery store chain allowed OmniFoods to use an unlimited number of sales techniques, which techniques should it use? Explain. 3. If the grocery store chain allowed OmniFoods to use only one sales technique, which technique should it use? Explain.

The Craybill Instrumentation Company Case The Craybill Instrumentation Company produces highly technical industrial instrumentation devices. The human resources (HR) director has the business objective of improving recruiting decisions concerning sales managers. The company has 45 sales regions, each headed by a sales manager. Many of the sales managers have degrees in electrical engineering, and due to the technical nature of the product line, several company officials believe that only applicants with degrees in electrical engineering should be considered. At the time of their application, candidates are asked to take the Strong-Campbell Interest Inventory Test and the Wonderlic Personnel Test. Due to the time and money involved with the testing, some discussion has taken place about dropping one or both of the tests. To start, the HR director gathered information on each of the 45 current sales managers, including years of selling experience, electrical engineering background, and the scores from both the Wonderlic and Strong-Campbell tests. The HR director has decided to use regression modeling to predict a dependent variable of "sales index" score, which is the ratio of the regions' actual sales divided by the target sales. The target values are constructed each year by upper management, in consultation with the sales managers, and are based on past performance and market potential within each region . The file IM!.fi[.[Ui contains information on the 45 current sales managers. The following variables are included: Sales-Ratio of yearly sales divided by the target sales value for that region; the target values were mutually agreedupon "realistic expectations" Wonder-Score from the Wonderlic Personnel Test; the higher the score, the higher the applicant's perceived ability to manage SC- Score on the Strong-Campbell Interest Inventory Test; the higher the score, the higher the applicant's perceived interest in sales Experience- Number of years of selling experience prior to becoming a sales manager Engineer-Dummy variable that equals I if the sales manager has a degree in electrical engineering and 0 otherwise

Cases

a. Develop the most appropriate regression model to predict sales.

b. Do you think that the company should continue administering both the Wonderlic and Strong-Campbell tests? Explain.

c. Do the data support the argument that e lectrical engineers outperform the other sales managers? Would you support the idea to hire only electrical engineers? Explain.

d. How important is prior selling experience in this case? Explain. e. Discuss in detai l how the HR director shou ld incorporate the regression model you developed into the recruiting process.

553

More Descriptive Choices Follow-Up Follow-up the Using Statistics scenario, "More Descriptive Choices, Revisited" on page 147, by developing regression models to predict the one-year return, the three-year return, the five-year return, and the ten-year return based on the assets, turnover ratio, expense ratio, beta, standard deviation, type of fund (growth versus value), and risk (stored in Retirement Funds I . (For this analysis, combine low and average risk into the new category "not high.") Be sure to perform a thorough residual analysis. Provide a summary report that explai ns your results in detail.

CHAPTER

T

EXCEL GUIDE

EG15.1

The QUADRATIC REGRESSION MODEL

Format Trendline

x

Trend~ne Options "'

~

Key Technique Use the exponential operator(") in a col-

umn of formulas to create the quadratic term. Example Create the quadratic term for the Section 15. l concrete strength analysis.

~

PHStat, Workbook, and Analysis ToolPak For the example, open to the DATA worksheet of the Fly Ash workbook, which contains the independent X variable flyash o/o in column A and the dependent Y variable strength in column B and:

Q 111

TrendGM Options

b_

E!ponential

l.t ·

J.mear

lt_

LAganthm1c

bJ. • folynom1al I

Or!ler

2

•

•

3. Check Display Equation on chart (shown below). 1. Select column B, right-click, and click Insert from the shortcut menu. This creates a new, blank column B, and changes strength to column C. 2. Enter the label FlyAsh%"2 in cell Bl and then enter the formula =A2"2 in cell B2. 3. Copy this formula down column B through all the data rows (through row 19).

(Best practice places the quadratic term in a column that is contiguous to the columns of the other independent X variables.) Adapt the Section EG 14.1 instructions to perform a regression analysis using the quadratic term. For PHStat, use Cl:C19 as the Y Variable Cell Range and Al:B19 as the X Variables Cell Range. For Worksheet, use C2:C19 and A2:B19 in step 4 as the new cell ranges. For Analysis ToolPak, use Cl:C19 as the Input Y Range and Al:B19 as the Input X Range. To create a scatter plot, adapt the EG2.5 "The Scatter Plot" instructions. For PHStat, use Cl:C19 as the Y Variable Cell Range and Al:B19 as the X Variable Cell Range. For Worksheet, select the noncontiguous cell range Al:A19, Cl:C19 in step l and skip step 3. (Appendix B explains how to select a noncontiguous cell range.) Select the scatter chart and then: 1. Select Design (or Chart Design) -+ Add Chart Element-+ Trendline-+ More Trendline Options. In the Format Trendline pane (parts shown below), 2. Click Polynomial (shown at top of right column).

554

L-J

~

m u:n.q n

G!J Display J;:quation on chart

0

Display !!-squared value on chart

In older Excels, select Layout-+ Trendline-+ More Trendline Options in step 1 and in the Format Trendline dialog box, click Trendline Options in the left pane. In the Trendline Options right pane, click Polynomial, check Display Equation on chart, and click OK.

EG15.2

USING TRANSFORMATIONS in REGRESSION MODELS

The Square-Root Transformation To the worksheet that contains your regression data, add a new column of formulas that computes the square root of the variable for which you want to create a square-root transformation. For example, to create a square-root transformation in a blank column D for a variable in a column C, enter the formula =SQRT(C2) in cell D2 of that worksheet and copy the formula down through all data rows. If the column to the immediate right of the variable to be transformed is not empty, first select that column, rightclick, and click Insert from the shortcut menu. Then place the transformation in the newly inserted blank column.

The Log Transformation To the worksheet that contains your regression data, add a new column of formulas that computes the common (base 10) logarithm or natural logarithm (base e) of the dependent variable to create a log transformation. For example, to create a common logarithm transformation in a blank column D for a variable in column C, enter the formula =LOG(C2) in cell D2 of that worksheet and copy the formula down through all data rows. To create a natural logarithm transformation

EXCEL Guide

in a blank column D for a variable in column C, enter the formula =LN(C2) in cell D2 of that worksheet and copy the formula down through all data rows. If the dependent variable appears in a column to the immediate right of the independent variable being transformed , first select the dependent variable column, rightclick, and click Insert from the shortcut menu and then place the transformation of the independent variable in that new column.

x

Stepwise: Regrt:ssion - Data Y Yonabio Cel RM>go:

1Al:A27

XYorlablesCelRM>go:

le1:E27

P' Fht-" bolh ,.._""'...,label

rs,__cn_ -

Workbook To compute the variance inflationary factor, first use the EG 14.1 "Interpreting the Regression Coefficients" Workbook instructions on page 522 to create regression results worksheets for every combination of independent variables in which one serves as the dependent variable. Then, in each of the regression results worksheets, enter the label VIF in cell A9 and enter the formula = 1/(1 - B5) in cell B9 to compute the VIF.

EG15.4

MODEL BUILDING

The Stepwise Regression Approach to Model Building Key Technique Use PHStat to perform a stepwise analysis. Example Perform the Figure 15.14 stepwise analysis for the

Nickels Broadcasting data on page 542.

PHStat Use Stepwise Regression. For the example, open to the DATA worksheet of the Nickels 26Weeks workbook and select PHStat' Regression' Stepwise Regression. In the procedure's dialog box (shown at the top of the right column): Enter Al:A27 as the Y Variable Cell Range. Enter Bl:E27 as the X Variables Cell Range. Check First cells in both ranges contain label. Enter 95 as the Confidence level for regression coefficients. 5. Click p values as the Stepwise Criteria. 6. Click General Stepwise and keep the pair of .05 values as the p value to enter and the p value to remove. 7. Enter a Title and click OK. 1. 2. 3. 4.

195%

l

go:

IA l:A27

:;]

r-l e1-:EIJ-----=:;i:::;1

P' Fht-" Ndl rongo cnn.,,.. labd [

Confidena:: lt:vet fur r~es.M>n meffioents:

~%

nn;..t Options [ n11e:

1

Because this procedure examines many different regression models, there may be a noticeable delay between when OK is clicked and results appear onscreen.

CONTENTS USING STATISTICS: Is the ByYourDoor Service Trending? 16. 1 Time-Series Component Factors 16.2 Smoothing an Annual Time Series 16.3 Least-Squares Trend Fitting and Forecasting 16.4 Autoregressive Modeling for Trend Fitting and Forecasting 16.5 Choosing an Appropriate Forecasting Model 16.6 Time-Series Forecasting of Seasonal Data 16.7 Index Numbers (online)

CONSIDER THIS: Let The Model User Beware

Is the ByYourDoor Service Trending? Revisited EXCEL GUIDE

OBJEC IVES • Construct different t ime-series forecasting models for annual and seasonal data Choose the most appropriate time-series forecasting model

556

enior managers at ByYourDoor, an online food delivery service, have asked you to analyze sales data. These managers would l.ike to know if sales data can be used to estimate future sales. They already know that their business is thriving, but sales seem to be subject to periodic dips that make it hard to accurately estimate short-term physical and labor resources requirements for the company. One manager wondered if a regression technique might be useful, but another manager recalls that simple and multiple regression models can only predict inside the range of the X values used to create the model. Looking forward would require going beyond the values in such a range. Is it even possible to make a useful estimation about a future value of a dependent Yvariable?

S

16.1 Time-Series Component Factors

557

Forecasting estimates future business conditions by monitoring changes that occur over time. Managers must be able to develop forecasts to anticipate likely changes their businesses will face. For example, retail marketing executives might forecast product demand, sales revenues, consumer preferences, and inventory, among other things, to make decisions regarding product promotions and strategic planning. Time-series forecasting, the focus of this chapter, uses a time series, a set of numerical data collected over time at regular intervals, as the basis for the estimation. Both government and business activities generate time series data. Some government examples include economic indicators such as a consumer price index or the quarterly gross domestic product (GDP) as well as measurements of real-world phenomena such as the mean monthly level of lakes, the levels of carbon dioxide in the air, or the daily high temperature for a locality. Businesses generate many types of time series and typically include annual measurements of sales revenues, net profits, and other accounting data in annual reports or similar documents. Time-series forecasting is not the only type of forecasting that uses numerical data. Causal forecasting methods help determine the factors that relate to the variable being estimated. These methods include multiple regression analysis with lagged variables, econometric modeling, leading indicator analysis, and other economic barometers that are beyond the scope of this text, but which the Box, Hanke, and Montgomery references discuss. Although time-series forecasting shares the goal of prediction with the regression methods that previous chapters discuss, time-series forecasting seeks to estimate a future value, a goal very different than from the goals of the regression methods that Chapters 13, 14, and 15 discuss. For ByTheDoor, an initial complication would be to establish the time interval that most makes sense for estimating future sales. The time interval can affect both the perception of the data as well as the statistical methods used to analyze the time series. Because the company buys supplies monthly and because customers use the service once a month, on average, collecting monthly data might make best sense, but other time intervals might also be appropriate depending on the goal of the senior managers at the firm.

16.1 Time-Series Component Factors As Section 2.5 notes, a time-series plot, in which the X axis represents units of time and the Y axis represents the values of a numerical variable, can help visualize trends in data that occur over time. A trend, an overall long-term upward or downward movement, that exists in a time series, is one possible pattern, or component of a time series. Establishing whether a trend exists in a time series is an important early step in time-series analysis. Time-series plots can suggest whether a trend component exists in the time series. If a time series shows no trend, then the techniques of moving averages and exponential smoothing that Section 16.2 discusses can be used to analyze the time series. If a time series shows a trend, the various methods that Sections 16.3 through 16.5 discuss can be used if the time series represents annual data. Figure 16.1 Panel A shows a time series with a strong upward trend. Time-series data may also show a combination of cyclical and irregular components. A cyclical component is up-and-down movement in the time series of medium duration, typically from two to ten years in length. Figure 16.1 Panel B shows a time series with four cycles of differing durations. These cycles often correlate with "business cycles" that are associated with certain types of economic activities. Figure 16.1 Panel C visualizes a time series that has a strong irregular component. An irregular component reflects one-time changes to a time series that cannot be explained by the trend or cyclical components. For business decision makers, discovering an irregularity may signal an inflection point in which a significant business or economic change has occurred. If a time series is collected at intervals of less than year, such as monthly or quarterly, the time series may also have a seasonal component. Figure 16.1 Panel D visualizes a seasonal component in which time series data values show the pattern of being very high during the summer months and very low during the winter months.

558

CHAPTER 16

I Time-Series Forecasting

FIGURE 16.1 Trend, cyclical, irregular, and seasonal components of a time series Times Series With a Long-Term Trend

0

10

20

30

Time Series With Cyclica l Component

40

50

10

60

15

Year

Panel A

Time Series With Seasonal Component

z 10

15

25

Panel B

Time Series W ith Irregular Component

0

20

Year

20

25

30

35

~

..

.. ,,.. .... .

. i ..

,,.. .... > z < < :! ~ ~ ~ < < :! ~ Ji Ji Ji Ji

z

~

,,.. ....

< < :! Ji Ji

~

i

z

~

..

,,.. ....

< < :! Ji Ji

~ ~z

Montt.

Year

Panel C

Panel D

Figure 16.2 shows the time series of houses sold in the United States from June 1987 through June 2019. This time series contains two types of components. The irregular downward component centered on 2008 refl ects the collapse of the U.S. housing market that led to the "Great Depression" of 2007-2009. Several monthly upward spikes can be seen, including one that occurs every March (red points). This seasonal effect persists through the irregular component centered on 2008, in which March 2007 and March 2008 sales represent temporary upswings. An analyst trying to predict future housing sales and not accounting for that seasonal effect might have been misled by these March changes during the 2007-2009 recessionary period. An analyst who understood that such spikes were seasonal would not expect these spikes to continue and therefore would have been less likely to overestimate the future housing sales.

FIGURE 16.2 Monthly sales of houses in the United States, June 1987 t hrough June 2019 (red plots represent March sales)

Houses Sold By Month, June 1987 - June 2019 140 120

1100 ~

i...

IO

~ 60

;

~ .a

"

20 O

+-~--.-~~,....-~-.-~~,....-~-.-~~,....-~-.-~~~

Jun 1917 Jun 1991 Jun 1995 Jun 1999 Jun 2003 Jun 2007 Jun 2011 Jun 2015 Jun 2019 Month

16.2 Smoothing an Annual Time Series

559

16.2 Smoothing an Annual Time Series Smoothing a time series, transforming the time series to show small-scale fluctuations, can help determine if a time series contains a trend because the smoothing minimizes the effects of the other time-series components. For example, Table 16.1 presents the annual U.S. and Canada movie attendance (in billions) from 2001 through 2018, as reflected by number of tickets sold. Figure 16.3 visualizes these data, stored in Movie Attendance

TABLE 16.1 Annual movie attendance (in billions) from 2001 through 2018

Year

Attendance

Year

Attendance

Year

Attendance

2001

1.44

2007

1.40

201 2

1.36

2002

1.58

2008

1.34

2013

1.34

2003

1.53

2009

1.41

2014

1.27

2004

1.51

2010

1.34

2015

1.32

2005

1.38 1.41

2011

1.28

2016 2017

1.32 1.23

201 8

1.31

2006

Source: Data extracted from boxofficemojo.com/yearly.

FIGURE 16.3 Time-series plot of movie attendance from 2001 through 2018

M ovie Attendance from 2001 through 2018 1.80 1.60 1.40 1.20

c 1.00 .2

~ 0 .80

0.60 0.40 0 .20

0.00

+-----~--------~----~

2000

2005

2010

201.5

2020

Year

Figure 16.3 seems to show a slight downward trend in movie attendance for the time period plotted. However, the variation that exists from one time period to another can sometimes obscure a long-term trend, which can make an existing trend hard to identify. Using moving averages or exponential smoothing can smooth the data and better visualize a long-term trend that may be present.

Moving Averages student TIP Remember that you cannot calculate moving averages at the beginning and at the end of the series.

The moving averages method calculates means for sequences of consecutive time-series values for a time duration L. The sequences each differ by one time-series value, as the moving average method " moves" through the time-series. For example, for a three-year moving average for an annual time series of 11 years, the first calculated mean would be the mean of the time-series values for years 1 through 3, the second calculated mean would be the mean for years 2 through 4, and the ninth calculated mean would be the mean for years 9 through 11. The moving averages method always reduces the number of values because moving averages cannot be calculated for the first (L - 1) / 2 years and the last (L - 1) / 2 years of the time series. For the example, in which L = 3, a moving average cannot be calculated for either the

560

CHAPTER 16

I Time-Series Forecasting first or last (eleventh) year. Although L could be any whole number, making Lan odd number permits centering each moving average on a time value, which simplifies preparing tabular and visual summaries of a moving average. For example, if L = 3, the first moving average for an annual time series of 11 years would be centered on year 2. If L = 5, the first moving average would be centered on year 3. However, if L = 4, the moving average would be centered on year "2.5," a time value that is not part of the original time series. For annual time-series data that do not contain an obvious cyclical component, using 3, 5, or 7 as the value of Lare reasonable choices. If a cyclical component exists in a time series, the value of L should be a number that corresponds to or is a multiple of the estimated length of a cycle. Example 16.1 illustrates calculating moving averages for L = 5.

EXAMPLE 16.1 Calculating Five-Year Moving Averages

The following data represent revenue (in $millions) for a casual dining restaurant over the 11-year period 2008 to 2018. 4.0 5.0 7.0 6.0 8.0 9.0 5.0 7.0 7.5 5.5 6.5 Compute the five-year moving averages for this annual time series. SOLUTION Five-year moving averages take the mean of five consecutive time-series values.

The first of the five-year moving averages is MA(5)

=

Y1

+ Y2 + Y3 + Y4 + Y5 4.0 + 5.0 + 7.0 + 6.0 + 8.0 30.0 = == 6.0 5

5

5

The second of the five-year moving averages is: MA(5)

=

Y2

+ Y3 + Y4 + Y5 + Y6 5

=

5.0

+ 7.0 + 6.0 + 8.0 + 9.0 5

35.0 5

=-

=

7.0

The third, fourth, fifth, sixth, and seventh moving averages are: 7.0 + 6.0 + 8.0 + 9.0 + 5.0 35.0 ---------- = - = 7.0 5

student TIP Using a time duration L that is an odd number facilitates the comparison of the moving averages with t he original t ime series.

Y4

MA(5)

=

MA(5)

=

Ys

MA(5)

=

Y6

MA( ) 5

=

Y1

+ Y5 + Y6 +

Y7

+

Y8

5

+ Y6 + Y1 + Ys + Y9 5

= =

+ Y1 + Ys + Y9 + Y10 5

=

6.0

5

+ 8.0 + 9.0 + 5.0 + 7.0 5

8.0

+ 9.0 + 5.0 + 7.0 + 7.5 5

9.0

+ 5.0 + 7.0 + 7.5 + 5.5 5

35.0 5

=

7.0

36.5 5

=

7.3

==-

34.0 5

=-

=

6.8

+ Ys + Y9 + Y10 + Y1 1 = 5.0 + 7.0 + 7.5 + 5.5 + 6.5 = 31.5 = 6 _3 5

5

5

Figure 16.4 (left) visualizes the three-year and five-year moving averages for the movie attendance data stored in . The moving average plots show a downward trend but, unlike the Figure 16.3 time-series plot, reveal that the trend has greatly slowed after 2004. Figure 16.4 (right), a redone plot that discards the time-series values for the early years 200 1 through 2004, reveals a time series with a much weaker trend. That pattern suggests that using the shorter time series may lead to a more accurate short-term forecast of future movie attendance.

561

16.2 Smoothing an Annual Time Series

FIGURE 16.4 Time-series plots for the three- and five-year moving averages for the movie attendance for two time series, 2001 through 2018 and 2005 through 2018 Moving Averages for Movie Attendance, 2001 through 2018

j

Moving Averages for Movie Attendance, 2005 through 2018

1.80

1. 80

1.60

1.60

1.40

1.40

1.20

1.20 ~ 1.00

1.00

ii 0.80

.2 ~ 0.80

.._Attendance .._MAJ-Yr

0.60

.._MA5-Yr

0.40

.._Attendance

0.60 0.40 0.20

0 .20

+------..,.------,------....-------,

0.00 2000

2005

2010

2015

o.oo

+--~-------------....---~---..

2004

2020

2006

Later movie attendance examples in this chapter use the shorter 2005 through 2018 time series for the reasons this passage

discusses.

2008

2010

2012

2014

2016

2018

2020

YHr

YHr

While discarding data is almost never allowed in the inferential methods that earlier chapters discuss, discarding consecutive time series is an example of the partially subjective nature of time-series forecasting. Determining the proper length of a time series to be used for forecasting can be a mix of business experience and awareness of external or one-time, irregular factors. For the U.S. and Canadian movie attendance time series, further investigation reveals that the year 2002 was unusual in being the only year in which the popular Star Wars, Harry Potter, and the Lord of the Rings movie series all had releases. (And those three films were only the second, third, and fourth most popular movies that year, as 2002 also saw the release of the first modern-day Spider-Man movie.) Because the shorter movie attendance time series shows no trend, the moving averages based on the shorter time series could be used for short-term forecasting. However, a second technique, exponential smoothing, typically offers better short-term forecasting.

Exponential Smoothing Exponential smoothing consists of a series of exponentially weighted moving averages. The weights assigned to the values change so that the most recent (the last) value receives the highest weight, the previous value receives the second-highest weight, and so on, with the first value receiving the lowest weight. Therefore, the more recent a time-series value is, the more influence the value has on the smoothing function. Exponential smoothing uses all previous values in contrast to the moving averages method, which uses only a subset of the time series to calculate individual moving averages. Exponential smoothing enables one to calculate short-term forecasts one period into the future , even when the presence and type of long-term trend in a time series is difficult to determine. Equation (16.1) defines an exponentially smoothed value for time period i. Note the special case that the smoothed value for time period 1 is the time period 1 value.

AN EXPONENTIALLY SMOOTHED VALUE FOR TIME PER IOD i

E;

WY;

+

(1 - W)E;-1

(16.1)

where i = 2, 3, 4, . .. E; = exponentially smoothed series value for time period i E;_ 1 = exponentially smoothed series calculated for time period i Y; = time series value for period i W = subjectively assigned weight or smoothing coefficient, where 0 < W < 1

562

CHAPTER 16

I Time-Series Forecasting Choosing the weight or smoothing coefficient, W, that one assigns to the time series is both critical to the smoothing and somewhat subjective. One should select a small value for W, close to zero, if the goal is to smooth a series by eliminating unwanted cyclical and irregular variations in order to see the overall long-term tendency of the series. One should select a large value for W, close to 0.5, if the goal is forecasting future short-term directions. Figure 16.5 presents the exponentially smoothed values (with smoothing coefficients W = 0.50 and W = 0.25), the movie attendance data from 2005 to 2018, and a plot of the original data and the two exponentially smoothed time series. Observe that exponential smoothing has smoothed some of the variation in the movie attendance.

FIGURE 16.5

A B Yeor Attendance 200S UI 2006 l .41 2007 1.40 2008 1.34

Exponentially smoothed series (W = 0.50 and W = 0.25) worksheet and plot for the movie attendance data 9

10 11 12

D

1.3800 1.3950 l.397S 1.3618

1.3800 1.317S 1.3906 1.3780 1.3860 1.37.S 1.3509

1.60

2009 2010 2011 2012

L41 1. 34 1.21 L 36

1.3194 1.1M7 1.3223 1.:MU

1 .3531

2013 2014

1.3406 1.3053 1.3126 1.3163

1.3499 1.3299 1.3274 1.3256 1.3017 1.30311

13

2015 2016

L34 L27 1.32 1.32

14

2017

1.23

1.2732

15

201ll

1.11

1.2916

.. ...

Exponentially Smoothed Annual Movie Attendance

c

•

uo 1. 20

k::::f;;::: .

I

1.00

~

~Attendance

~

.._ES(W=0 .50)

~ 0.10

0.60

.._ES(W=0.25)

0.40 0.20 0.00 2004

2006

2008

ZOU

2010

2014

2016

2011

Year

To illustrate these exponential smoothing calculations for a smoothing coefficient of W = 0.25, begin with the initial value Y2005 = 1.38 as the first smoothed value (£2005 = 1.38).

Then, using the value of the time series for 2006 (Y2006 = 1.41), smooth the series for 2006 as follows: E2006

= WY2006 + (1 - W)Ezoos = (0.25)(1.41) + (0.75)(1.38)

1.3875

To smooth the series for 2007: E2001

= WY2001 + (1 - W)E2006 = (0.25)(1.40) + (0.75)(1.3875) = 1.3906

This smoothing would continue for each of the remaining years in the time series. (Figure 16.5 also contains results of this smoothing operation.) Exponential smoothing is a weighted average of all previous time periods. Therefore, when using exponential smoothing for forecasting, one uses the smoothed value in the current time period as the forecast of the value in the following period ( 1 ).

B+

FORECASTING TIME PERIOD i

+1 l'f + 1 =

E;

(16.2)

To forecast the movie attendance in 2019, using a smoothing coefficient of W = 0.25, one uses the smoothed value for 2018 as its estimate. Y201s + 1

= E20 1s

Y2019

= E201 s

Y2019

= 1.3038

The exponentially smoothed forecast for 2019 is 1.3038 billion.

16.2 Smoothing an Annual Time Series

LEARNING THE BASICS 16.1 If you are using exponential smoothing for forecasting an annual time series of revenues, what is your forecast for next year if the smoothed value for this year is $32.4 million? 16.2 Consider a nine-year moving average used to smooth a time series that was first recorded in 1984. a. Which year serves as the first centered value in the smoothed series? b. How many years of values in the series are lost when computing all the nine-year moving averages? 16.3 You are using exponential smoothing on an annual time series concerning total revenues (in $millions). You decide to use a smoothing coefficient of W = 0.20, and the exponentially smoothed value for 2019 is £ 2019 = (0.20)(12.1) + (0.80)(9.4). a . What is the smoothed value of this series in 2019? b. What is the smoothed value of this series in 2020 if the value of the series in that year is $11.5 million? APPLYING THE CONCEPTS ~ 16.4 The data below, stored in Computer Usage ,

~

represent the hours per day spent by American desktop and laptop users from 2008 to 2017. Year

Hours per Day

Year

Hours per Day

2008 2009 2010 2011 2012

2.2 2.3 2.4 2.6 2.5

2013 2014 2015 2016 2017

2.3 2.2 2.2 2.2 2.1

Source: Data extracted from R. Marvin, "Tech Addiction by the Numbers ...," bit.ly/2KECJAZ.

a. Plot the time series. b. Fit a three-year moving average to the data and plot the results. c. Using a smoothing coeffi cient of W = 0.50, exponentially smooth the series and plot the results. d. What is your exponentially smoothed forecast for 2018? e. Repeat (c) and (d), using W = 0.25. f. Compare the results of (d) and (e). g. What conclusions can you reach about desktop/laptop use by American users?

16.5 The following data, stored in Core Appliances , list the number of shipments of core major household appliances in the United States from 2000 to 2017 (in mi llions) . Year

Shipments

Year

Shipments

2000 200 1 2002 2003 2004 2005 2006 2007 2008

38.4 38.2 40.8 42.5 46. 1 47.0 46.7 44. 1 39.8

2009 2010 201 1 2012 2013 2014 2015 2016 2017

36.5 38.2 36.0 35.8 39.2 41.5 42.9 44.7 46.4

Source: Data extracted from www.statista.com .

563

a. Plot the time series. b. Fit a three-year moving average to the data and plot the results. c. Using a smoothing coefficient of W = 0.50, exponentially smooth the series and plot the results. d. What is your exponentially smoothed forecast for 2018? e. Repeat (c) and (d), using W = 0.25. f. Compare the results of (d) and (e). g. What conclusions can you reach concerning the total number of shipments of core major household appliances in the United States from 2000 to 20 17 (in millions)?

16.6 How have stocks performed in the past? The following table , which show the perpresents the data stored in formance of a broad measure of stock performance (by percentage) for each decade from the 1830s through the 201Os:

Decade

Performance ( % )

Decade

Performance ( % )

1830s 1840s 1850s 1860s 1870s 1880s 1890s 1900s 1910s

2.8 12.8 6.6 12.5 7.5 6.0 5.5 10.9 2.2

1920s 1930s 1940s 1950s 1960s 1970s 1980s 1990s 2000s 2010s*

13.3 - 2.2 9.6 18.2 8.3 6.6 16.6 17.6 1.2 11.0

* Through December 31, 2018. Source: T. Lauricella, " Investors Hope the ' 10s Beat the 'OOs,'' The Wall Street Journal, December 21, 2009, pp. C 1, C2 and Moneychimp. corn, "CAGR of the Stock Market," bit.ly/l a iqtil.

a. Plot the time series. b. Fit a three-period moving average to the data and plot the results. c. Using a smoothing coefficient of W = 0.50, exponentially smooth the series and plot the results. d. What is your exponentially smoothed forecast for the 2020s? e. Repeat (c) and (d), using W = 0.25. f. Compare the results of (d) and (e). g. What conclusions can you reach concerning how stocks have performed in the past?

16.7 The file l@itthij-1.!eij contains the coffee exports (in thousands of 60 kg bags) by Costa Rica from 2004 to 2018. a. Plot the data. b. Fit a three-year moving average to the data and plot the results. c. Using a smoothing coefficient of W = 0.50, exponentially smooth the series and plot the results. d. What is your exponentially smoothed forecast for 2019? e. Repeat (c) and (d), using a smoothing coefficient of W = 0.25. f. Compare the results of (d) and (e). g. What conclusions can you reach about the exports of coffee in Costa Rica?

564

CHAPTER 16

I Time-Series Forecasting

16.8 The file IID contains the number of initial public offerings (IPOs) issued from 2001 through 2018. Source: Data extracted from Statista, " Number of LPOS in the United States from 1999 to 2018," bit.ly/2uctnU4.

a. Plot the data. b. Fit a three-year moving average to the data and plot the results.

c. Using a smoothing coefficient of W = 0.50, exponentially smooth the series and plot the results. d. What is your exponentially smoothed forecast for 2019? e. Repeat (c) and (d), using a smoothing coefficient of W = 0.25. f. Compare the results of (d) and (e).

16.3 Least-Squares Trend Fitting and Forecasting To make intermediate and long-range forecasts requires identifying the trend component in a time series. Identifying the trend means being able to develop the most appropriate model that fits the trend. As with regression models that previous chapters discuss, time series data might fit a linear trend model (see Section 13.2), a quadratic trend model (see Section 15.1), or, if the time-series data increase at a rate such that the percentage difference from value to value is constant, an exponential trend model.

The Linear Trend Model The linear trend model:

is the simplest forecasting model. Equation (16.3) defines the linear trend forecasting equation.

LINEAR TREND FORECASTING EQUATION

(16.3)

Recall that in linear regression analysis, one uses the method of least squares to compute the sample slope, b 1, and the sample Y intercept, b0 . The values for X are then substituted into Equation (16.3) to predict Y. When using the least-squares method for fitting trends in a time series, one can simplify the interpretation of the coefficients by assigning coded values to the X (time) variable. One assigns consecutively numbered integers, starting with 0, as the coded values for the time periods. For example, in time-series data that have been recorded annually for 8 years, one assigns the coded value 0 to the first year, the coded value 1 to the second year, the coded value 2 to the third year, and so on, concluding by assigning 7 to the eighth. To illustrate model fitting, consider the Table 16.2 time series, stored in JM1!fli!T!11, that lists Alphabet Inc.'s annual revenue (in $billions) from 2011 to 2018. TABLE 16.2 Annual revenues for Alphabet Inc. 2011-2018

Alphabet Inc. was formed in 2005 and is the holding company that owns Google.

Year

Revenue ($billions)

2011

37.905

2012

46.039

2013

55.519

2014

66.001

2015

74.989

2016

90.272

2017

110.855

2018

136.819

Source: Data extracted from J. Clement, " Annual revenue of Alphabet from 20 11 to 201 8," bit.ly/217QJBq.

16.3 Least-Squares Trend Fitting and Forecast ing

565

Figure 16.6 presents the regression results for the simple linear regression model that uses the consecutive coded values 0 through 7 as the X (coded year) variable. These results produce the linear trend forecasting equation:

Y;

FIGURE 16.6 Excel regression worksheet results for the linear trend model to forecast revenue (in $bill ions) for Alphabet Inc.

= 30.2280

+ 13.4491X;

A 8 c 0 Linear Trend Model for Alphabet Inc. Annual Revenue It~

G

E

rnsion Stotlsta

• Multiple R 5 RSquare

0.9761 0.9528

6 Adjusted R Squwe

0.'449

7 Standard EtrOf' 8 Obsefvotions

7.921-1

:o1ANOVA 11 1--- - - - - - - 'd"-f _ __,,ss"-------"M""S----'F---'S""lo'n'"'•"""=·-~F 12 Regress100

~-3

13 Resld~I 14 Total

376.4621 7973.3584

7596.11963 121.0783 62. 7437

0.0000

15 J--------:::----,.....-,-----...,.-----,,,...---=16l---- - - - - - "C=n.£"'" ='""!!.' l:s --=S"' ta"" nd"ard '= E""' =---'-' t s,,, tat"---..::: P-'-" va"" lu"-• _,, t.ow. ='.::: '-"' =-....::U ="-'-''""5!11 17 lntertept 18 Coded Ye..-

For this regression model, X1 are interpreted as follows:

30.2280

S.1130

5.9119

13.4491

1.2223

11.0036

0.0010 0.0000

17.7168

4 2.7392

l G.4584

16Al98

= 0 represents the year 201 1, and the regression coefficients

• The Y intercept, b0 = 30.2280, is the predicted mean revenue (in $billions) at Alphabet Inc. during the origin, or base, year, 2011. • The slope, b 1 = 13.4491, indicates that the mean revenue is predicted to increase by $13.4491 billion per year. To project the trend in the revenue at Alphabet Inc. to 2019, one substitutes X9 = 8, the code for 19 into the linear trend forecasting equation: Y; = 30.2280

+ 13.4491(8)

= 137.8208 billions of dollars

Figure 16. 7 presents the linear trendline plotted with the time-series values. There is a strong upward linear trend, and r 2 is 0.9528, indicating that more than 95% of the variation in revenue is explained by the linear trend of the time series. However, observe that the early years are slightly above the trend line, but the middle years are below the trend line, and many of the later years are also above the trend line, but the last two years are above the trend line. To investigate whether a different trend model might provide a better fit, a quadratic trend model and an exponential trend model can be fitted.

FIGURE 16.7 Plot of the linear trend forecasting equation for Alphabet Inc. annual revenue

Linear Trend Plot for Alphabet Inc. Annual Revenue 160

•

140

_120

j

y c 13.449x + 30.228

R' =0.9521

100

ii

~ 80

!

c

j

60 40

20

O+----,----,.---,------.---,.---,------.----, 0 Coded Year

566

CHAPTER 16

I Time-Series Forecasting

The Quadratic Trend Model The quadratic trend model,

is a nonlinear model that contains a linear term and a curvilinear term in addition to a Y intercept. Using the least-squares method for a quadratic model that Section 15.1 describes, Equation (16.4) defines a quadratic trend forecasting equation.

QUADRAT IC TREND FORECASTING EQUATION

(16.4) where b0 = estimated Y intercept b 1 = estimated linear effect on Y b2 = estimated quadratic effect on Y

Figure 16.8 presents the regression results for the quadratic trend model to forecast annual revenue at Alphabet Inc. Using Equation (16.4) and the results from Figure 16.8,

Y;

=

40.1296

+

3.5475X;

+

1.4145Xf

where the year coded 0 is 201 1.

FIGURE 16.8 Excel regression worksheet resu lts for the quadratic trend model to forecast annual revenue (in $billions) for Alphabet Inc.

8 c 0 Quadr11tk: Trend Model for AJphabet Inc. Annual Revenue

G

E

R rasioo Statistia 0.!1975 4 Muniple R 5 RSquare 0.9949 6 Adjusted RSquore 0.!1929 7 Standard Error

2.8396

8 Observ.tions 9 10 ANOVA 11

SS

12 Regression

7933.0409

.)966.52()1

13 Reslduol 14 Total

40.3175 7973.3584

8.0635

MS

n

491.9104

n~F

0.0000

15 16

Coe

17 lnte!t"tept 18 Coded Yeor 19 CodedYea
r9~

1.5641 0.0732

1.5'87 0.0815

A

One calculates the values for f3o and {3 1 by taking the antilog of the regression coefficients b0 and b 1: f3o = antilog(b0 ) = antilog(l.5814) = 101.5814 = 38.1417

ffi 1

=

antilog(b 1) = antilog(0.0774) = 10°·0774 = 1.1951

Thus, using Equation (16.7b), the exponential trend forecasting equation is

f; =

(38.1417)(1.1951l;

where the year coded 0 is 201 1. The Y intercept, ffio = 38.1417 billions of dollars, is the revenue forecast for the base year 2011. The value

Cffi1

-

1) X 100%, = 19.51 %, is the annual compound growth rate in reve-

nues at Alphabet Inc. For forecasting purposes, one substitutes the appropriate coded X values into either Equation (16.7a) or Equation (16.7b). For example, to forecast revenues for 2019 (X = 8) using Equation (16.7a), log(};) = 1.5814

f;

=

+ 0.0774(8)

= 2.2006

antilog(2.2006) = 102·2006 = 158.708 billion of dollars

Figure 16.11 plots the exponential trend forecasting equation, along with the time-series data. The adjusted r 2 for the exponential trend model (0.9967) is greater than the adjusted r 2 for the linear trend model (0.9449) and similar to the quadratic model (0.9929). FIGURE 16.11 Plot of the exponential t rend forecasting equation for Alphabet Inc. annual revenue

Exponential Trend Plot for Alphabet Inc. Annual Revenue 2.5 y

-

2 0

§ 1.5

~

~

!

=o.on4x + l .504

-

1

0.5

0

4

Coded Year

a

569

16.3 Least-Squares Trend Fitting and Forecasting

Model Selection Using First, Second, and Percentage Differences Exhibit 16.1 summarizes how the first, second, and percentage differences in a time series helps determine which type of model is most appropriate for the time series. Although most time-series data will not perfectly fit any of the models, consider the first differences, second differences, and percentage differences as guides in choosing an appropriate model. Examples 16.2, 16.3, and 16.4 illustrate linear, quadratic, and exponential trend models that have perfect (or nearly perfect) fits to their respective data sets.

M#fj:O:bl 16.1 Model Selection Using First, Second, and Percentage Differences • If a linear trend model provides a perfect fit to a time series, then the first differences

will be constant:

• If a quadratic trend model provides a perfect fit to a time series, then the second differences

will be constant:

• If an exponential trend model provides a perfect fit to a time series, then the percentage

differences between consecutive values will be constant:

Y,, - Y,,_ I Y,,_ ,

---- x

EXAMPLE 16.2 A Linear Trend Model With a Perfect Fit

100%

The following time series represents the number of customers per year (in thousands) at a branch of a fast-food chain: Year

2010 Customers Y

200

2011

2012

2013

2014

2015

2016

2017

2018

2019

205

210

215

220

225

230

235

240

245

Using first differences, show that the linear trend model provides a perfect fit to these data.

SOLUTION The following table shows the solution: Year Customers Y First differences

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

200

205 5.0

210

215 5.0

220 5.0

225 5.0

230 5.0

235 5.0

240 5.0

245 5.0

5.0

The differences between consecutive values in the series are the same throughout. Thus, the number of customers at the branch of the fast-food chain shows a linear growth pattern.

570

CHAPTER 16

I Time-Series Forecasting

EXAMPLE 16.3 A Quadratic Trend Model With a Perfect Fit

The following time series represents the number of customers per year (in thousands) at another branch of a fast-food chain: Year Customers Y

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

200

201

203.5

207.5

213

220

228.5

238.5

250

263

Using second differences, show that the quadratic trend model provides a perfect fit to these data. SOLUTION The following table shows the solution: Year 2010

2011

2012

2013

2014

2015

2016

2017

200

201 203.5 1.0 2.5 1.5

207.5 4.0 1.5

213 5.5 1.5

220 228.5 7.0 8.5 1.5 1.5

238.5 10.0 1.5

Customers Y First differences Second differences

2018 250 11.5 1.5

2019 263 13.0 1.5

The second differences between consecutive pairs of values in the series are the same throughout. Thus, the number of customers at the branch of the fast-food chain shows a quadratic growth pattern. Its rate of growth is accelerating over time.

EXAMPLE 16.4 An Exponential Trend Model With an Almost Perfect Fit

The following time series represents the number of customers per year (in thousands) for another branch of the fast-food chain: Year Customers Y

2010

2011

200

206

2012

2013

2014

212.18 218.55 225.11

2015

2016

2017

2018

2019

231.86 238.82 245.98 253.36 260.96

Using percentage differences, show that the exponential trend model provides almost a perfect fit to these data. SOLUTION The following table shows the solution: Year Customers Y Percentage differences

2010

2011

2012

2013

2014

200

206 212.18 218.55 225.11 231.86 238.82 245.98 253.36 260.96 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0

2015

2016

2017

2018

2019

The percentage differences between consecutive values in the series are approximately the same throughout. Thus, this branch of the fast-food chain shows an exponential growth pattern. Its rate of growth is approximately 3% per year.

Figure 16.12 shows a worksheet that compares the first, second, and percentage differences for Alphabet Inc revenue. Neither the first, second, or percentage differences are constant across the series, although the percentage differences are fairly consistent, save for one value. Therefore, other models, including the models that Section 16.5 discusses, may be more appropriate.

16.3 Least-Squares Trend Fitting and Forecasting

FIGURE 16.12 Excel worksheet ternplate t hat computes first, second, and percentage differences in revenue (in $billions) for Alphabet Inc.

I

B

A

c

D

E

First

Second

Percen'taee

571

Yeu Revenue Difference Difference Difference

#N/A

2 3

2011

37.905

2012

46.039

8.134

4

2013

55.519

9.480

5 6 7 B 9

2014

66.001

2015 2016

74.989 90.272.

2017 110.855 2018 136.819

#N/A #N/A

#N/A 21.459

1.346

20.591

10.482

1.002

18.880

8.988 15.283

· 1.494 6.295

13.618 20.380

20.583

5.300

25.964

5.381

22.801 23.422

PROBLEMS FOR SECTION 16.3 LEARNING THE BASICS 16.9 If you are using the method of least squares for fitting trends in an annual time series containing 25 consecutive yearly values, a. what coded value do you assign to X for the first year in the series? b. what coded value do you assign to X for the fifth year in the series? c. what coded value do you assign to X for the most recent recorded year in the series? d. what coded value do you assign to X if you want to project the trend and make a forecast five years beyond the last observed value? 16.10 The linear trend forecasting equation for an annual time series containing 22 values (from 1998 to 2019) on total revenues (in $millions) is Yf = 4.0 + l.SX; a. b. c. d.

Interpret the Y intercept, b0 . Interpret the slope, b 1• What is the fitted trend value for the fifth year? What is the fitted trend value for the most recent year? e. What is the projected trend forecast three years after the last value?

16.11 The linear trend forecasting equation for an annual time series containing 42 values (from 1998 to 2019) on net sales (in $billions) is

Y; = 1.2 + O.SX; a. Interpret the Y intercept, b0 . b. c. d. e.

Interpret the slope, b 1• What is the fitted trend value for the tenth year? What is the fitted trend value for the most recent year? What is the projected trend forecast two years after the last value?

APPLYING THE CONCEPTS

nm 16.12 There has been much publicity about bonuses . . . . paid to workers on Wall Street. Just how large are these bonuses? The file contains the bonuses paid (in $000) from 2000 to 2018.

l+h!%4i

Source: Data extracted fro m J . Spector, " Wall Street bonuses rise 1% to average $ 138,2 10 ," USA Today, March 15, 201 7, and N. Chiwaya, "Wall Street bonuses declined 17 percent in 20 18 .. .," NBC News, March 26, 2019, available at nbcnews .to/2Tch1Z9.

a. Plot the data. b. Compute a linear trend forecasting equation and plot the results. c. Compute a quadratic trend forecasting equation and plot the results.

d. Compute an exponential trend forecasting equation and plot the results. e. Using the forecasting equations in (b) through (d), what are your annual forecasts of the bonuses for 2019 and 2020? f. How can you explain the differences in the three forecasts in (e)? What forecast do you think you should use? Why?

16.1 3 Gross domestic product (GDP) is a major indicator of a nation's overall economic activity. It consists of personal consumption expenditures, gross domestic investment, net exports of goods and services, and government consumption expenditures. The file m:lil contains the GDP (in billions of current dollars) for the United States from 1980 to 201 8. Sour ce: Data extracted from Bureau of Economic Analysis, U.S. Department of Commerce, www.bea .gov.

a. Plot the data. b. Compute a linear trend forecasting equation and plot the trend line. c. What are your forecasts for 20 19 and 2020? d. What conclusions can you reach concerning the trend in GDP?

16.14 The data in

represent federal receipts from 1978 through 201 8 in billions of current dollars, from individual and corporate income tax, social insurance, excise tax, estate and gift tax, customs duties, and federal reserve deposits. Source: Data extracted from "Historical Federal Receipt and Outlay Summary," Tax Policy Center, tpc.io/lJMFKpo.

a. Plot the series of data. b. Compute a linear trend forecasting equation and plot the trend line. c. What are your forecasts of the federal receipts for 201 9 and 2020? d. What conclusions can you reach concerning the trend in federal receipts?

16.15 The file I;1¥1MN\fl contains the number ofnew, single-family houses sold in the United States from 1992 through 201 8 . a. Plot the data. b. Compute a linear trend forecasting equation and plot the trend line. c. Compute a quadratic trend forecasting equation and plot the results. d. Compute an exponential trend forecasting equation and plot the results. e. Which model is the most appropriate? f. Using the most appropriate model, forecast the number of new, single-famil y houses sold in the United States in 201 9.

572

CHAPTER 16

I Time-Series Forecast ing

16.16 The data shown in the following tab le and stored in

1'1'FiTIA·M4i represent the yearly amount of solar power generated by utilities (in millions of kWh) in the United States from 2002 through 20 18:

Year

Solar Power Generated (millions of kWh)

2002 2003 2004 2005 2006 2007 2008 2009

555 534 575 55 1 508 612 864 892

Year

Solar Power Generated (millions of kWh)

2010 201 1 2012 2013 2014 2015 20 16 20 17 20 18

1,2 12 1,818 4,327 9,253 18,32 1 26,473 36,754 52,958 66,604

Source: Data extracted from en.wikipedia.org/wiki/Solar_power_in_ the_ United_States.

a. Plot the data. b. Compute a linear trend forecasting equation and plot the trend line. c. Compute a quadratic trend forecasting equation and plot the results. d. Compute an exponential trend forecasting equation and plot the results. e. Using the models in (b) through (d), what are your annual trend forecasts of the yearly amount of solar power generated by utilities (in millions of kWh) in the United States in 2019 and 2020?

16.17 The file t·mffl@n!!ti1@ti contains the number of automobiles assembled in the United States (in thousands) from 1999 to 2018. Source: Data extracted from FRED, " Domestic Auto Production," bit.ly/2KH.KC pz.

a. Plot the data. b. Compute a linear trend forecasting equation and plot the trend line. c. Compute a quadratic trend forecasting equation and plot the results. d . Compute an exponential trend forecasting equation and plot the results. e. Which model is the most appropriate? f. Using the most appropriate model, forecast the U.S. automobiles assembled for2019.

16.18 The average salary of Major League Baseball players on opening day from 2000 to 2019 is stored in and in the fo llowing table.

l,t!!=f't@i!H

Year

Salary ($millions)

Year

Salary ($millions)

2000 200 1 2002 2003 2004 2005 2006 2007 2008 2009

1.99 2.29 2.38 2.58 2.49 2.63 2.83 2.92 3. 13 3.26

2010 2011 2012 2013 2014 2015 2016 2017 2018 2019

3.27 3.32 3.38 3.62 3.8 1 4.25 4.40 4.70 4.63 4.70

Source: Data extracted from "Baseball Salaries," USA Today, April 6, 2009, p. 6C; mlb.com, USA Today, April 2, 2017, and "MLB Salaries - MLB Baseball - USA TODAY," bit.ly/l kj4E4N.

a. Plot the data. b. Compute a linear trend forecasting equation and plot the trend line. c. Compute a quadratic trend forecasting equation and plot the results. d. Compute an exponential trend forecasting equation and plot the results. e. Which model is the most appropriate? f. Using the most appropriate model, forecast the average salary for 2020.

EID

16.19 The file contains the following prices in London for an ounce of si lver (in US$) on the last day of the year from 1999 to 2018: Year

Price (US$/ounce)

Year

Price (US$/ounce)

1999 2000 2001 2002 2003 2004 2005 2006 2007 2008

5.330 4.570 4.520 4.670 5.965 6.8 15 8.830 12.900 14.760 10.790

2009 2010 201 1 2012 2013 2014 20 15 20 16 2017 2018

16.990 30.630 28.1 80 29.950 19.500 15.970 13.820 15.990 16.865 15.490

Source: Data extracted from JM Bullion, "Silver Spot Price & Charts," bit.ly/2w4YPYI.

a. Plot the data. b. Compute a linear trend forecasting equation and plot the trend line. c. Compute a quadratic trend forecasting equation and plot the results. d. Compute an exponential trend forecasting equation and plot the results. e. Which model is the most appropriate? f. Using the most appropriate model, forecast the price of silver at the end of 20 19.

16.20 The data in E:mmJ reflect the annual values of the consumer price index for all urban consumers (CPI-U) in the United States over the 54-year period 1965 through 2018, using 1982 through 1986 as the base period. This index measures the average change in prices over time in a fixed " market basket" of goods and services purchased by all urban consumers, including urban clerical, professional, managerial, and technical workers; self-employed individuals; short-term workers; unemployed individuals; and retirees. Source: Data extracted from Bureau of Labor Statistics, U.S. Department of Labor, www.bls.gov.

a. Plot the data. b. Describe the movement in this time series over the 54-year period. c. Compute a li near trend forecasting equation and plot the trend line. d. Compute a quadratic trend forecasting equation and plot the results. e. Compute an exponential trend forecasting equation and plot the results. f. Which model is the most appropriate? g. Using the most appropriate model, forecast the CPI for 2019 and 2020.

16.4 Autoregressive Modeling for Trend Fitting and Forecasting

16.21 Although you should not expect a perfectly fitting model for any time-series data, you can consider the first differences, second differences, and percentage differences for a given series as guides in choosing an appropriate model. For this problem, use each of the time series presented in the following table and stored in i11 ..

t5fi4U4i·'· Year

Series I

Series II

Series III

2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018

10.0 15.1 24.0 36.7 53.8 74.8 100.0 129.2 162.4 199.0 239.3 283.5

30.0 33. 1 36.4 39.9 43.9 48.2 53.2 58.2 64.5 70.7 77. 1 83.9

60.0 67.9 76.1 84.0 92.2 100.0 108.0 I 15.8 124.1 132.0 140.0 147.8

a. Determine the most appropriate model. b. Compute the forecasting equation. c. Forecast the value for 2019.

573

16.22 A time-series plot often helps you determine the appropriate model to use. For this problem, use each of the time series presented in the following table and stored in il!,,!5fi§i!Ji:I.

Year

Series I

Series II

2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018

100.0 115.2 130.1 144.9 160.0 175.0 189.8 204.9 219.8 235.0 249.8 264.9

100.0 115.2 131.7 150.8 174.l 200.0 230.8 266.1 305.5 351.8 403.0 469.2

a. Plot the observed data Y over time X and plot the logarithm of the observed data (log Y) over time X to determine whether a linear trend model or an exponential trend model is more appropriate. (Hint: If the plot of log Y versus X appears to be linear, an exponential trend model provides an appropriate fit.) b. Compute the appropriate forecasting equations. c. Forecast the values for 2019.

16.4 Autoregressive Modeling for Trend Fitting and Forecasting

learn MORE The exponential smoothing model that Section 16.3 describes and the autoregressive models that Section 16.4 describes are special cases of autoregressive integrated moving average (ARIMA) models developed by Box and Jenkins. To learn more about such models, see Bisgaard and Kulahci; Box, Jenkins, Reinsel and Leung; and Pecar.

Frequently, the values of a time series at particular points in time are highly correlated with the values that precede and succeed them. This type of correlation is called autocorrelation. When the autocorrelation exists between values that are in consecutive periods in a time series, the time series displays first-order autocorrelation. When the autocorrelation exists between values that are two periods apart, the time series displays second-order autocorrelation. For the general case in which the autocorrelation exists between values that are p periods apart, the time series displays pth-order autocorrelation. Autoregressive modeling uses a set of lagged predictor variables to overcome the problems that autocorrelation causes with other models. A lagged predictor variable takes its value from the value of a predictor variable for a previous time period. To analyze pth-order autocorrelation, you create a set of p lagged predictor variables. The first lagged predictor variable takes its value from the value of a predictor variable that is one time period away, the lag ; the second lagged predictor variable takes its value from the value of a predictor variable that is two time periods away; and so on until the pth lagged predictor variable that takes its value from the value of a predictor variable that is p time periods away. Note that each subsequent lagged predictor variable contains one less time-series value. In the general case, a p lagged variable will contain p less values. Equation ( 16.8) defines the pth-order autoregressive model. In the equation, A0 , A 1, ••• , AP represent the parameters and a 0 , al> ... , aP represent the corresponding regression coefficients. This is similar to the multiple regression model, Equation (14.1) on page 480, in which {30, {3 1, ••• , f3k> represent the regression parameters and b0 , b 1, .• • , bk represent the corresponding regression coefficients.

574

CHAPTER 16

I Time-Series Forecasting pTH-ORDER AUTOREGRESSIVE MODELS

(16.8) where

Y; = observed value of the series at time i Y; - 1 = observed value of the series at time i }j _ 2 = observed value of the series at time i - 2 }j _/1 = observed value of the series at time i - p p = number of autoregression parameters (not including a Y intercept)

to be estimated from least-squares regression analysis A0 , A 1, A2,

••• ,

A11

= autoregression parameters to be estimated from least-squares

regression analysis O; = a nonautocorrelated random error component (with mean = 0 and constant variance)

student TIP B is the Greek letter delta.

Equations (16.9) and (16.10) define two specific autoregressive models. Equation (1 6.9) defines the first-order autoregressive model and is similar in form to the simple linear regression model, Equation (13.1) on page 432. Equation (16.10) defines the second-order autoregressive model and is similar to the multiple regression model with two independent variables, Equation (14.2) on page 480.

FIRST-ORDER AUTOREGRESSIVE MODEL

(16.9) SECOND-ORDER AUTOREGRESSIVE MODEL

(16.10)

Selecting an Appropriate Autoregressive Model Selecting an appropriate autoregressive model can be complicated. One must weigh the advantages of using a simpler model against the concern of using a model that does not take into account important autocorrelation in the data. On the other hand, selecting a higher-order model that requires estimates of numerous parameters may contain some unnecessary parameters, especially if the time series is short (n is small). Recall that when computing an estimate of A11, pout of then time series values are lost due to the lagging of values. Examples 16.5 and 16.6 illustrate this loss.

EXAMPLE 16.5 Comparison Schema for a First-Order Autoregressive Model

Consider the following series of n = 7 consecutive annual values: Year Series

1

2

3

4

5

6

7

31

34

37

35

36

43

40

Show the comparisons needed for a first-order autoregressive model. .,..(continued)

16.4 Autoregressive Modeling for Trend Fitting and Forecasting

575

SOLUTION Yeari

First-Order Autoregressive Model (Lagl: Y; versus lj _ 1) 31

2 3 4 5 6 7

...

34 31 37 34 35 37 36 35 43 36 40 43

Because Y1 is the first value and there is no value prior to it, Y1 is not used in the regression analysis. Therefore, the first-order autoregressive model would be based on six pairs of values.

EXAMPLE 16.6 Comparison Schema fora Second-Order Autoregressive Model

Consider the following series of n

= 7 consecutive annual values: Year

Series

1

2

3

4

s

6

7

31

34

37

35

36

43

40

Show the comparisons needed for a second-order autoregressive model. SOLUTION Yeari

Second-Order Autoregressive Model Lag2: Y; vs. li- t and Y; vs. lj _ 2 31

2 3

. . .

and 31

.. .

5 6

34 31 and 34 .. . 37 34 and 37 31 35 37 and 35 34 36 35 and 36 37 43 36 and 43 35

7

40 43 and 40 36

4

Because no value is recorded prior to Yi. the first two comparisons, each of which requires a value prior to Y" cannot be used when performing regression analysis. Therefore, the secondorder autoregressive model would be based on five pairs of values.

Determining the Appropriateness of a Selected Model After selecting a model and using the least-squares method to compute the regression coefficients, one must determine the appropriateness of the model. One either selects a particular pth-order autoregressive model based on previous experiences with similar data or starts with a model that contains several autoregressive parameters and then eliminates the higher-order parameters that do not significantly contribute to the model. In this latter approach, one uses at test for the significance of AP, the highest-order autoregressive parameter in the current model under consideration. The null and alternative hypotheses are H 0: AP = 0 H 1:AP # 0

576

CHAPTER 16

I Time-Series Forecasting Equation (16.11) defines the test statistic.

t TEST FOR SIGNIFICANCE OF THE HIGHEST-ORDER AUTOREGRESSIVE PARAMETER, AP (16.11)

where AP = hypothesized value of the highest-order parameter, AP, in the autoregressive model aP = regression coefficient that estimates the highest-order parameter, AP, in the autoregressive model SaI' = standard deviation of aP The tsTAT test statistic follows at distribution with n - 2p - 1 degrees of freedom. In addition to the degrees of freedom lost for each of the p population parameters being estimated, an additional p degrees of freedom are lost because there are p fewer comparisons to be made from the original n values in the time series.

For a given level of significance, a, you reject the null hypothesis if the tsrAT test statistic is greater than the upper-tail critical value from the t distribution or if the tsTAT test statistic is less than the lower-tail critical value from the t distribution. Thus, the decision rule is Reject H 0 if tsTAT

ta; 2 ;

otherwise, do not reject H0 . Figure 16.13 illustrates the decision rule and regions of rejection and nonrejection.

FIGURE 16.13 Rejection regions for a twotail test for the significance of the highest-order autoregressive parameter Ap

r

I

-ta /2

"";'"of Rejection

Critical Value

~o/ Region of Nonreject ion Ap= 0

I

+ta12

r

t

Reg;oo of Rejection

Critical Va lue

If one does not reject the null hypothesis that AP = 0, one concludes that the selected model contains too many estimated autoregressive parameters. One discards the highest-order term and develops an autoregressive model of order p - 1, using the least-squares method. Then, one repeats the test of the hypothesis that the new highest-order parameter is 0. This cycle of testing and modeling continues until one rejects H0 . When this occurs, one can conclude that the remaining highest-order parameter is significant and that the model is acceptable for forecasting purposes.

16.4 Autoregressive Modeling for Trend Fitting and Forecasting

577

Equation (16.12) defines the fitted pth-order autoregressive equation.

FITTED pTH-ORDER AUTOREGRESSIVE EQUATION

(16.12) where

Y; = fitted values of the series at time i

li- 1 = observed value of the series at time i li- 2 = observed value of the series at time i li- P = observed value of the series at time i

- 2

- p

p = number of autoregression parameters (not including a Y intercept) to

be estimated from least-squares regression analysis a 0 , a" a 2 ,

• • •,

aP

= regression coefficients

One uses Equation (16.13) to forecast j years into the future from the current nth time period.

p TH-ORDER AUTOREGRESSIVE FORECASTING EQUATION

(16.13) where a 0, a" a 2 ,

. . .,

aP = regression coefficients that estimate the parameters p = number of autoregression parameters (not including a Y intercept)

to be estimated from least-squares regression analysis j = number of years into the future

Y,, +j-p Y,, +j-p

= forecast of Yn+j-p from the current year for j = observed value for Y,,+ j-p for j - p :::; 0

- p

> 0

Thus, to make forecasts j years into the future , using a third-order autoregressive model , one needs only the most recent p = 3 values (Y,, , Y,,_ 1, and Y,,_ 2) and the regression estimates a0 , ai, a 2 , and a 3 . To forecast one year ahead, Equation (16.13) becomes

To forecast two years ahead, Equation (16.13) becomes ~

~

Yn +2 = ao + a,Y,,+ 1 + a1Yn + a3Y,1 - l To forecast three years ahead, Equation (16.13) becomes ~

~

Yn +3 = ao + a,Y,, +2 and so on.

~

+ a2Yn +l + a3Y,,

578

CHAPTER 16

I Time-Series Forecasting Autoregressive modeling is a powerful forecasting technique for time series that have autocorrelation. Exhibit 16.2 summarizes the steps to construct an autoregressive model.

•=ta:o=n• 16.2 Autoregressive Modeling Steps

student TIP Remember that in an autoregressive model, the independent variable(s) are equal to the dependent variable lagged by a certain number of time periods.

1. Choose a value for p, the highest-order parameter in the autoregressive model to be evaluated, remembering that the t test for significance is based on n - 2p degrees of freedom. 2. Create a set of p lagged predictor variables. (See Figure 16.14 for an example.) 3. Perform a least-squares analysis of the multiple regression model containing all p lagged predictor variables. 4. Test for the significance of AP, the highest-order autoregressive parameter in the model. S. If one does not reject the null hypothesis, discard the pth variable and repeat steps 3 and 4 with a revised degrees of freedom that correspond to the revised number of predictors. If one rejects the null hypothesis, select the autoregressive model with all p predictors for fitting [see Equation (16. 12)] and forecasting [see Equation (16.13)].

To demonstrate the autoregressive modeling approach, consider the Coca-Cola Company annual revenue for the years 2000 through 2018 (see Figure 16.14). Figure 16.14 presents a worksheet template that computes three lagged predictor variables, Lagl , Lag2, and Lag3, that can be used for the first-order, second-order, and third-order autoregressive models, using the Coca-Cola Company annual revenue data. FIGURE 16.14

A B Year Revenues

Excel worksheet template for computing lagged predictor variables for the first-order, secondorder, and third-order autoregressive models for the Coca-Cola Company annual revenue,

2000

3

2001

2002

2003 6 7

2004 2005 2006 2007

10 11 12 13 14 15

2000-2018

16 17

200I 2009 2010 2011 2012 2013

18 19

2010 2015 2016 2017

20

20lll

20.5 20.1 19.6 21.0 21.9 23.1 24. l 2L9 3L9 31.0 35.l 46.5 41.0 46.7 45.9 44.3 4L9 35.4 3L9

c

D

E la 3 IN/A

20.S 20.t 19.6 21.0 21.9

23.l 24.1

21.9 31.9

u .o 35.l 46.5 41.0 46.7 45.9 44.3 41.9 35A

20.5 20.l 19.6 21.0 21.9 23.l 24.l 21.9 31.9

31.0 35.l 46.S 48.0 46.7 45.9 44.3 41.9

IN/A IN/A 20.5 20. l 19.6

21.0 21.9 23.1

24.1 2L9 3L9 31.0 15.1 46.5 48.0 46.7 45.9 44.3

To fit the third-order autoregressive model, all three lagged predictor variables are used. To fit the second-order autoregressive model, only the Lagl and Lag2 variables are used. To fit the first-order autoregressive model, only the Lagl variable is used. Selecting the autoregressive model that best fits an annual time series begins with the highest order autoregressive model to be evaluated. Choosing which model will be the highest order model to be evaluated can be based on prior experience using a time series. When no past experience or insights exists, one often chooses to use a third-order model as the starting point, thereby making p = 3. Because the Coca-Cola Company revenue time series is being evaluated without prior experience, the highest-order model to evaluate is the third-order autoregressive model. From Figure 16.15, the fitted third-order autoregressive equation is

Y; =

5.1923

+

l.4023}j_ , - 0.6557Y;-2

where the first year in the series is 2003.

+ 0.1128}j_ 3

16.4 Autoregressive Modeling for Trend Fitting and Forecasting

FIGURE 16.15 Excel regression worksheet resu lts for a thi rd-order autoregressive model for the Coca-Cola Company annual revenue

c

A 8 D E F Thlrd,.Ofder AutoregrHiive Model for The Coca Cola Company AMUll Revenue

579

G

R ruJlon Statistics Multiple R 0.9441 R Square 6 Adjusted R Square

7 Standard Error 8 Observations

~01ANOVA 111--~~~~~-'='-~~~-=-~~-"'=---~~f~~Slgx·"~'lflca.=:c.n~a:..;__f 32.7899 0.0000

12 Rearesslon 13 Reslduol 14 Toal

15r-~~~~..,...-::-~....,.~,-,.,-~..,..~..,.....,...~~....,.:::-~~~

CMfflttVnt.s Standard Error tStat P-tHJlJe lDwer9511'. un-95" 16+-~~~~....::::::.U..:~"--.:=:.'.::.:::.::..:e.::::.-...:..:o::::........::.:::::::::._..:::.::o:...:::::..___:::t:.t:0.~::..

17 18 19 20

lnt0100pt Logl lag2 11""1

S.1923 1.4023 ~.6SS7

0.1128

3.1945 0.2816 0.4782 0.3096

1.6254 4.9795 -1.3n4 0.36'5

0.1300 0.0003 0.1954

·L767!l

o.nu

-Q.S618

0.7887

·L&.1715

12.1524 2.0159 0.3161 0.7875

One evaluates this model by testing for the significance of A3, the highest-order parameter. The null and alternative hypotheses are H 0 : A3

0

H 1: A3 ¥= 0

The highest-order regression coefficient, a 3, for the fitted third-order autoregressive model is 0.1128, with a standard error of 0.3096. Using Equation (16.11) on page 576 and the worksheet results given in Figure 16.15, 0.1128 - 0 0.3096

0.3645

Using a 0.05 level of significance, the two-tail t test with 12 degrees of freedom has critical values of ± 2.1788. Because -2.1788 < tsTAT = 0.3645 < 2.1788 or because the p-value = 0.7219 > 0.05, one does not reject H0 . One concludes that the third-order parameter of the autoregressive model is not significant and should not remain in the model. Therefore, one continues by fitting the Figure 16.16 second-order autoregressive model. The fitted second-order autoregressive equation is

Y;

= 4.5737

+

l.3856Y; - 1 - 0.5156,

l'i- 2

where the first year of the series is 2002.

FIGURE 16.16 Excel regression worksheet results for the second-order autoregressive model for the Coca-Cola Company annual revenue

;f

A 8 C D E F Second..Ofder Autoregressive Model for The Coca Cola Company Annual Revenue

3

G

R msionStolistks

Multiple R R Square 6 Adjusted RSquere 7 Standard Error 8 Observations

0.9494 0.9013 O.aan 3.3697 17

9 1olANOVA 12 Regression 13 Resldu~I 14 Total

14 16

IASl.2332 ns.6166 63.9032 lSB.!1691 11.3549 1610.2024

0.0000

15 ;--~~~~~-::-~~~.,...-,,--~~~~-,-~~~-::-~~--:-:-

16 1--~~~~~Co .. 20

10 0 0

10

15

20

25

Cod9d Quarter

Least-Squares Forecasting with Monthly or Quarterly Data To develop a least-squares regression model that includes a seasonal component, the leastsquares exponential trend fitting method used in Section 16.3 is combined with dummy variables to represent the quarters (see Section 14.6) to model the seasonal component.

16.6 Time-Series Forecasting of Seasonal Data

587

Equation (16.15) defines the exponential trend model for quarterly data.

EXPONENTIAL MODEL WITH QUARTERLY DATA

(16.15) where X; = coded quarterly value, i = 0, 1, 2, .. .

Q1 = 1 if first quarter, 0 if not first quarter Q2 = 1 if second quarter, 0 if not second quarter Q3 = 1 if third quarter, 0 if not third quarter (/3 1

20ne could also choose to use base e logarithms. For more information on logarithms, see Appendix Section A.3.

-

f3o = Y intercept 1) X 100% = quarterly compound growth rate (in%) /Ji = multiplier for first quarter relative to fourth quarter f3J = multiplier for second quarter relative to fourth quarter {34 = multiplier for third quarter relative to fourth quarter e; = value of the irregular component for time period i

The model in Equation (16.15) is not in the form of a linear regression model. To transform this nonlinear model to a linear model, one uses a base 10 logarithmic transformation. 2 Taking the logarithm of each side of Equation (16.15) results in Equation ( 16.16).

TRANSFORMED EXPONENTIAL MODEL WITH QUARTERLY DATA

log(Y;)

=

log ( f30 f3~;/3f'/3~2/3~3 e;)

= log(/30 )

+

log(/3~;)

+

log(/3f')

(16.16)

+ log(/3~2) +

log(/3~3 )

+

log(e;)

= log(/30 ) + X; log(/31) + Q 1 log(/Ji) + Q2 log(/33) + Q3 log(/34 ) + log(e;)

One uses the Equation (16.16) linear model for least-squares regression estimation. Performing the regression analysis using log(Y;) as the dependent variable and X;, Q 1, Q2 , and Q3 as the independent variables results in Equation (16.17).

EXPONENTIAL GROWTH WITH QUARTERLY DATA FORECASTING EQUATION

(16.17) where

= estimate of log(f3o) and thus 10b0 = ffio b 1 = estimate of log(/3 1) and thus 1Ob' = ffi 1

b0

b2 = estimate of log(/Ji) and thus 10b2 = b3 = estimate of log(/33) and thus 10b3 = b4 = estimate of log(/34) and thus l Ob• =

ffii ffi3 ffi 4

588

CHAPTER 16

I Time-Series Forecasting One uses Equation ( 16.18) for monthly data.

EXPONENTIAL MODEL WITH MONTHLY DATA

Y; = f3orI'i'rfl'fif2ff/3fJ't4fit5m16~1[J.Y•{f(rJfi'!i'"~" 8;

(16.18)

where

X; = coded monthly value, i = 0, 1, 2, . .. M 1 = 1 if January, 0 if not January M 2 = 1 if February, 0 if not February M 3 = 1 if March, 0 if not March M 11 = 1 if November, 0 if not November {30 = Y intercept ({3 1 - 1) X 100% = monthly compound growth rate (in%)

f3z = multiplier for January relative to December {33 = multiplier for February relative to December {34 = multiplier for March relative to December {312 = multiplier for November relative to December s; = value of the irregular component for time period i

The model in Equation (16.18) is not in the form of a linear regression model. To transform this nonlinear model into a linear model, one can use a base 10 logarithm transformation. Taking the logarithm of each side of Equation (16.18) results in Equation ( 16.19).

TRANSFORMED EXPONENTIAL MODEL WITH MONTHLY DATA log(Y;)

= =

1og(f3orI'i'rfl'fif2f3'f'fit•fitsrt1•~1!J.Y•f3'frJff!i10~11 8 ;)

(16.19)

+ X; log({31) + M 1 log(f3z) + M2 log({33) + M3 log({34) + M4 log(f3s) + Ms log(f36) + M6 log(f37) + M7 log({38) + M 8 log(/3ook

D ~hart Oulput

0

,Standard Enou

3. Enter Bl:B13 as the Input Range. 4. Enter 0.5 as the Damping factor. (The damping factor is equal to l - W.) 5. Check Labels, enter Cl as the Output Range, and click OK.

597

598

CHAPTER 16

I Time-Series Forecasting

In the new column C: 6. Copy the last formula in cell C12 to cell C13. 7. Enter the column heading ES(W= .SO) in cell Cl, replacing the #NIA value. To create the exponentially smoothed values that use a smoothing coefficient of W = 0.25, repeat steps 1 through 7 but enter 0.75 as the Damping factor in step 4, enter Dl as the Output Range in step 5, and enter ES(W=.25) as the column heading in step 7.

EG16.3

LEAST-SQUARES TREND FITTING and FORECASTING

The Linear Trend Model Key Technique Modify the Section EG13.2 instructions on page 474. Use the cell range of the coded variable as the X variable cell range (called the X Variable Cell Range in the PHStat instructions, called the cell range ofX variable in the Workbook instructions, and called the Input X Range in the Analysis Too/Pak instructions). To enter many coded values, use Home-+ Fill (in the Editing group) -+Series and in the Series diaJog box, click Columns and Linear, and select appropriate values for Step value and Stop value.

The Quadratic Trend Model Key Technique Modify the Section EG 15 .1 instructions on page 554. Use the cell range of the coded variable and the squared coded variable as the X variables cell range, called the X Variables Cell Range in the PHStat instructions and the Input X Range in the Analysis Too/Pak instructions.

The Exponential Trend Model Key Technique Modify the Section EG 15.2 instructions on page 554 and the EG 13.5 instructions on page 475. Use the POWER(lO,predicted log(Y)) function to compute the predicted Yvalues from the predicted log(Y) results. To create an exponential trend model, first convert the values of the dependent variable Y to log(Y) values using the Section EG 15.2 instructions. Then perform a simple linear regression analysis with residual analysis using the log(Y) values. Modify the Section EG 13.5 instructions using the cell range of the log(Y) values as the Yvariable cell range and the cell range of the coded variable as the X variable cell range. If you use the PHStat or Workbook instructions, residuals wi ll appear in a residuals worksheet. If you use the Analysis ToolPak instructions, residuals wi ll appear in the RESIDUAL OUTPUT area of the regression resul ts worksheet. Because you use log(Y) values for the regression, the predicted Y and residuals listed are log values that need to be converted. [The Analysis ToolPak incorrectly labels the new column for the logs of the residuals Residuals, and not LOG(Residuals). ]

In an empty column in the residuals worksheet (PHStat or Workbook) or an empty column range to the right of RESIDUALS OUTPUT area (Analysis Too/Pak): 1. Add a column of formulas that use the POWER function to compute the predicted Y values. 2. Copy the original Y values to the next empty column. 3. In the next empty (third new) column, enter formulas in the form =Y value cell-predicted Y cell to compute the residuals.

Use columns G through I of the RESIDUALS worksheet of the Exponential Trend workbook as a model for these three columns. The worksheet already contains the values and formulas needed to create the Figure 16.11 plot that fits an exponential trend forecasting equation for Alphabet Inc. annual revenue. To construct an exponential trend plot, first select the cell range of the time-series data and then use the Section EG2.5 instructions to construct a scatter plot. (For Alphabet Inc. revenue example, use the cell range is Bl:C9 in the Data worksheet of the Alphabet workbook.) Select the chart and: 1. Select Design (or Chart Design) -+ Add Chart Element-+ Trendline-+ More Trendline Options. 2. In the Format Trendline pane, click Exponential.

Model Selection Using First, Second, and Percentage Differences Key Technique Use the COMPUTE worksheet of the Differences workbook (see Figure 16.12 page 571 ), as a model for developing a differences worksheet. Use arithmetic formulas to compute the first, second, and percentage differences. Use division formulas to compute the percentage differences and use subtraction formulas to compute the first and second differences. Open to the COMPUTE_FORMULAS worksheet to review the formulas the COMPUTE worksheet uses.

EG16.4

AUTOREGRESSIVE MODELING for TREND FITTING and

FORECASTING Creating Lagged Predictor Variables Key Technique Use the COMPUTE worksheet of the Lagged Predictors workbook as a model for developing lagged predictor variables for the first-order, second-order, and third-order autoregressive models. Create lagged predictor variables by creating a column of formulas that refer to a previous row's (previous time period's) Y value. Enter the special worksheet value #NIA (not available) for the cells in the column to which lagged values do not apply. When specifying cell ranges for a lagged predictor variable, you include only rows that contain Jagged values. Contrary to the usual practice in this book, you do not include rows that contain #NIA, nor do you include the row 1 column heading.

EXCEL Guide

Open to the COMPUTE_FORMULAS worksheet to review the formulas that the worksheet template in Figure 16.14 on page 578 uses.

Autoregressive Modeling Key Technique To create a third-order or second-order autoregressive model, modify the Section EG14. l instructions on page 552. Use the cell range of the first-order, second-order, and third-order lagged predictor variables as the X variables cell range for the third-order model. Use the cell range of the first-order and second-order lagged predictor variables as the X variables cell range for the second-order model. If you use the PHStat instructions, modify step 3 to clear, not check, First cells in both ranges contain label. If using the Workbook instructions, use the COMPUTE3 worksheet in lieu of the COMPUTE worksheet for the third-order model. If using the Analysis Toof Pak instructions, do not check Labels in step 4. To create a first-order autoregressive model, modify the Section EG 13.2 instructions on page 474. Use the cell range of the first-order lagged predictor variable as the X variable cell range (called the X Variable Cell Range in the PHStat instructions, the cell range of X variable in the Workbook instructions, and the Input X Range in the Analysis Toof Pak instructions). If using the PHStat instructions, modify step 3 to clear, not check, First cells in both ranges contain label. If using the Analysis Toof Pak instructions, do not check Labels in step 4.

EG16.5

CHOOSING an APPROPRIATE FORECASTING MODEL

Performing a Residual Analysis To create residual plots for the linear trend model or the first-order autoregressive model, use the Section EG 13.5 instructions on page 475 . To create residual plots for the quadratic trend model or second-order autoregressive model, use the Section EG 14.3 instructions on page 523. To create residual plots for the exponential trend model, use the instructions Section EG16.4 on page 523. To create residual plots for the third-order autoregressive model, modify the Section EG 14.3 instructions to use the RESIDUALS3 worksheet instead of the RESIDUALS worksheet if you use the Workbook instructions.

Measuring the Magnitude of the Residuals Through Squared or Absolute Differences Key Technique Use the functions SUMPRODUCT and COUNT to compute the mean absolute deviation (MAD). To compute the mean absolute deviation (MAD) , first perform a residual analysis. Then, in an empty cell, add the formula =SUMPRODUCT(ABS(residuals cell range)) I COUNT(residuals cell range). When entering the residuals

599

cell range, do not include the column heading in the cell range. (See Appendix Section F.2 to learn more about the application of SUMPRODUCT function in this formula.) Cell 110 of the RESIDUALS_FORMULAS worksheet of the Exponential Trend workbook contains the MAD formula for the Alphabet Inc. annual revenue example.

A Comparison of Four Forecasting Methods Key Technique Use the COMPARE worksheet of the Forecasting Comparison workbook as a model. Construct a model comparison worksheet similar to the Figure 16.21 worksheet on page 585 by using Paste Special values (see Appendix Section B.5) to transfer results from the regression results worksheets. For the SSE values (row 22 in Figure 16.21), copy the regression results worksheet cell C 13, the SS value for Residual in the ANOVA table. For the Svx values (row), copy the regression results worksheet cell B7, labeled Standard Error, for all but the exponential trend model. For the MAD values, add formulas as discussed in the previous section. For the Syx value for the exponential trend model, enter a formula in the form =SQRT(exponential SSE cell I (COUNT(cell range of exponential residuals) - 2)). In the COMPARE worksheet, this formula is =SQRT(H22/ (COUNT(H3: H21) - 2)). Open to the COMPARE_FORMULAS worksheet to discover how the COMPARE worksheet uses the SUMSQ function as an alternate way of displaying the SSE values.

EG16.6 TIME-SERIES FORECASTING of SEASONAL DATA Least-Squares Forecasting With Monthly or Quarterly Data To develop a least-squares regression model for monthly or quarterly data, add columns of formulas that use the IF function (see Appendix Section F.2) to create dummy variables for the quarterly or monthly data. Enter all formulas in the form =IF (comparison,1, 0). Shown at right are the first five rows of columns F through K of a data worksheet that contains dummy variables. In the first illustration, columns F, G, and H contain the quarterly dummy variables Ql, Q2, and Q3 that are based on column B coded quarter values (not shown). In the second illustration, columns J and K contain the two monthly variables Ml and M6 that are based on column C month values (also not shown). G 1

2 3 4 5

Ql

H Ql

CU

=IF(B2 = 1, 1, 0) • IF(Bl • 1, 1, 0) =IF(B4 = 1, 1, O) =IF(BS = 1, 1, 0)

=IF( B2 = 2, 1, 0) • IF(Bl • 2, 1, 0) =IF(B4 = 2, 1, 0) •IF( BS = 2, 1, 0)

K M6

Ml

2 • IF(C2 :"J•n-.ry", 1, J clf(C3 _.J•nu.ary", 1, 4 • IF(C4 ="Januory", 1, 5 • IF(CS ="January", 1,

=IF(B2 = l , 1, 0) • IF(Bl • l , 1, 0) =IF(B4 = l, 1, O) • IF(BS = l, 1, 0)

0) O} O) O)

_]

=IF(C2 = "June", 1, clf(C3 ••June•,1,. =IF(C4 = "June",1, =IF(CS = "June", 1,

0) 0) O) O)

..

CONTENTS

,.,

USING STATISTICS: Back to Arlingtons for the Future

....

-

17. 1 Business Analytics Overview CONSIDER THIS: What's My Major If I Want to Be a Data Miner?

17.2 Descriptive Analytics 17.3 Decision Trees 17.4 Clustering 17.5 Association Analysis 17.6 Text Analytics 1 7. 7 Prescriptive Analytics Back to Arlingtons ... , Revisited SOFTWARE GUIDE

OBJEC IVES • Understand fundamental business analytics concepts • Gain experience with selected analytics methods • Understand the variety of predictive analyt ics methods

600

T

hrough sales experiments that the Using Statistics scenarios in Chapter 10 describe, Arlingtons discovered how the location of items in a store can affect the in-store sales. While making store placement decisions and charging varying store placement fees based on those experiments did increase revenues, long-term retailing trends toward online commerce continued to hurt the overall financial health of Arlingtons. When a private equity firm made an unsolicited bid for Arlingtons, senior management and the board of directors at Arlingtons reluctantly agreed to a buyout. The new owners believe that with advanced data analysis, they can grow the business, especially in the online marketplace where Arlingtons has been a weak competitor. Just as multiple regression allows consideration of several independent variables, they believe that other methods associated with business analytics will allow them to analyze many more relevant variables. For example, the new owners look to track customer buying habits and to be able to answer questions such as "Who were those customers that were most likely to buy the VLABGo players from the special front of store sales location?" and "What else could one expect those customers to buy at Arlingtons?" The new owners also believe that they will be able to start getting answers to more fundamental questions such as "Should we even be selling mobile electronics?" and "Should we invest more in online sales and less in brick-and-mortar (physical) stores?" To introduce business analytics to existing store managers, the new owners have hired you to prepare notes for a management seminar that would introduce business analytics to these managers, each of whom already have a knowledge of introductory business statistics. What do you say to such a group?

17 .1 Business Analytics Overview

601

17.1 Business Analytics Overview Business statistics first gained widespread usage in an age of manual filing systems and limited computerization. The first wave of business computers made practical the calculations of advanced inferential methods that previous chapters discuss, but data handling and storage was often limited or clumsy or both. As information technology and management matured, the application of business statistics grew within organizations and was applied to larger and larger sets of data. In today's world, where even mobile devices surpass the functionality of supercomputers that existed 30 years ago, much more can be done to support fact-based decision making. This "much more" is the practical realization of techniques long imagined but that could not be implemented due to the limitations of information technology in the past. This much more combines statistics, information systems, and management science. This much more often uses well-known methods but extends those methods into more functional areas or provides the means to analyze large volumes of data. This much more is business analytics that Section FfF.2 on page 4 first defines. Section FTF.2 describes business analytics as "the changing face of statistics," but these sets of techniques could also be called "the changing face of business." Just as business students today typically take at least one course in business statistics, business students of tomorrow (and some even today) will be taking at least one course in business analytics. This chapter serves as an introduction and bridge to that future.

Business Analytics Categories Business analytics methods help management decision makers answer what has happened or has been happening in the business, what could happen in the business, or what should happen based on a recommended course of action. These three kinds of management questions define the three main categories of business analytics (see Table 17.1). TABLE 17.1 Management questions that define business analytics categories

Question

Business analytics category

What has happened or has been happening? What could happen?

Descriptive analytics Predictive analytics Prescriptive analytics

What should happen?

Descriptive analytics answer "What has happened or has been happening?" questions. Descriptive analytics methods summarize historical data to identify patterns to the data that might be worthy of investigation or provide decision makers with new insights about business operations. Many methods contain the ability for decision makers to drill down, or reveal, the details of data that were summarized and most are related to or extensions of methods that Chapter 2 discusses. Predictive analytics answer "What could happen?" questions. Several subtypes of this category exist. Prediction methods use historical data to explain relationships among variables or to estimate ("predict") values of a dependent Y variable. Multiple regression, which Chapter 14 discusses, is an example of such a method. Classification methods assign items in a collection to target categories or classes. Logistic regression, which Chapter 14 discusses, assigns items to one of two target categories. Clustering methods find groupings in data being analyzed. Association methods find items that tend to occur together or specify the rules that explain such co-occurrences. Prescriptive analytics answer "What should happen?" questions. Prescriptive methods seek to optimize the performance of a business and offer decision-making recommendations for how to respond to and manage business circumstances in the future. These methods often evaluate models that predictive analytics methods build to determine new ways to operate a business while balancing constraints and considering business objectives. Prescriptive methods "can take processes that were once expensive, arduous, and difficult, and complete them in a cost-effective and effortless manner" (Morgan).

602

CHAPTER 17

I

Business Analytics

Business Analytics Vocabulary

Note that anyone who has acquired language using an illustrated vocabulary book, which labels pictures with vocabulary words, has teamed language using labeled data, too.

Using business analytics, one discovers that a vocabulary has been partially invented from suppliers of computer services or derived from fields other than statistics, such as artificial intelligence. Although this new vocabulary can make business analytics sound difficult to learn, many terms relate directly to the key terms of earlier chapters or have direct parallels to concepts that earlier chapters discuss. Therefore, a foundation based on statistical learning serves one well when facing business analytics. Perhaps the most commonly encountered term is data mining. Thinking about natural resources mining, where valuable minerals are extracted from the earth, a marketer saw an analogy to "mining data" to extract valuable information from corporate data stores and coined the term. Because the term was invented by thinking about corporate data, data mining usually implies processing large volumes of data or even big data that might combine corporate data (a primary source) with external, secondary sources of data. However, smaller sets of data can be data mined, including certain sets that other chapters use to demonstrate concepts. While "data mining" varies depending on the software and information system used, data mining typically combines what Section 14.5 calls model building with the application of one or more descriptive or predictive analytics methods. This means one element of data mining can be understood as similar to stepwise or best subsets regression that Section 14.5 discusses. A conversation about business analytics today often includes the terms artificial intelligence (Al) and machine learning. While "AI" suggests to some a future of talking cyborgs, dysfunctional computers, or the like, artificial intelligence is a branch of computer science more than 60 years old and seeks to use software to mimic human expertise, reasoning, or knowledge. Machine learning are the AI methods that help automate model building. Using Excel to perform regression analysis illustrates a simple case of machine learning: Excel computes the coefficients that could be otherwise be manually calculated by a human with knowledge of the defining equations found in Chapters 13 and 14. Classifying a business analytics method as either supervised or unsupervised is also borrowed from artificial intelligence. Supervised methods use explicit facts, called labeled data, to better "learn" relationships among variables and build models. All regression data are labeled because the data includes not only values of the independent X variables, but values for the dependent Y variable. Therefore, all regression methods are examples of supervised methods. In contrast, unsupervised methods uncover relationships among variables with unlabeled data. Models produced by unsupervised methods can sometimes be hard for business analyst users or decision makers to understand because the model can be based on abstract mathematical properties of the data.

CONSIDER THIS

What's My Major If I Want to Be a Data Miner? To be most effect ive as a data miner, you need a broad base of business skills, as one wou ld get majoring in any business subject. Most critically, you need to know how to define problems and requirements using a problem-solving framework such as the DCOVA model and have an awareness for basic concepts of statistics, goals of t his book. You might supplement you r knowledge with a course that fully introduces business analytics and gives you additional practice in the data preparation tasks that Chapter 1 summarizes. But you do not need to major in data mining to be a data miner, just as you do not need to major in statistics to apply statistical methods to fact-based decision making.

If you are, or plan to be, a graduate student, consider a concent ration in business analytics that more closely examines the application of data mining to a functional area. Whatever choices you make, t he points made in Section FTF.1 about using a framework and understand ing that analytical ski lls are more important than arithmetic (and other mathematical) skills will always hold. Ironically, as data mining/business analytics software gets more capable and gains the ability to analyze more and more data in ever increasing sophisticated ways, t he points that Section FTF.1 emphasizes will become increasingly important.

17 .1 Business Analytics Overview

603

Inferential Statistics and Predictive Analytics Inferential statistics form an important foundation for predictive analytics, and one can consider inferential methods such as multiple regression as examples of predictive analytics, too. Therefore, anyone who has studied multiple regression has already begun studying business analytics! Chapter 13 explains that "regression methods seek to discover how one or more independent X variables can predict the value of a dependent Y variable." Most regression examples that Chapters 13 through 15 consider uncover and explain the relationship among the variables being analyzed. Regression methods for those cases produce explanatory models that help better explain the relationship among variables. For example, in the Chapter 14 OmniPower scenario, the resulting explanatory model helps one better estimate the value of the dependent sales variable, given specific values of the price and promotional expenses variables. The value of the dependent sales variable is the predicted mean sales, that is, an average of what will be expected over time given the explanatory model. One could say that the model predicts the predicted mean sales, instead of estimates. In fact, people-and this book-sometimes use the verbs estimate and predict interchangeably when discussing how one uses a regression model. That usage is not wrong, but it does obscure the important distinction: One uses predictive analytics methods to produce prediction models, models that predict individual cases, not estimate the general case. A method can produce a model that, in one context, is an explanatory model and, in another context, is a prediction model. Logistic regression, which Section 14.7 discusses, best illustrates this. Section 14.7 uses a cardholder study and logistic regression to develop a regression equation that enables one to estimate the probability that a cardholder will upgrade to a premium card based on specific values for purchases made last year and whether the cardholder possesses additional cards. In the specific worked-out example, the result was a probability of 0. 702, which is equivalent to saying that, in the general case, one expects 70.2% of all individuals with those specific values of purchases and additional cards to upgrade to a premium card. Section 14.7 uses the results of logistic regression as an explanatory model. While a perfectly appropriate use, logistic regression becomes a predictive analytics method when the logistic regression model is used as (part of) a prediction model. After all, logistic regression uses a binary categorical variable, so each cardholder upgraded or did not upgrade to a premium card. No cardholder upgraded "70.2% of a premium card" nor did "70.2% of a cardholder" upgrade to a premium card. Recall that the data in , analyzed to create the logistic model, uses the values of 1 and 0 for the upgraded variable to represent the two categories, or classes, "upgraded to a premium card" and "did not upgrade to a premium card." Could the logistic regression model be used as a prediction model to predict individual values for the upgraded variable? Yes, if one uses a decision criterion, a value set using business experience or insight or by experimentation. One establishes a decision rule that compares the criterion to the estimated probability to decide whether to predict the value 1 or 0 for the upgraded variable. Figure 17 .1 diagrams how applying a decision rule turns logistic regression into a method of classification.

FIGURE 17.1 Using estimated probab il ity for c lass ification

Decision Rule: Is estimated probability p

Predict 1 (upgraded)

> decision criterion?

Predict 0 (did not upgrade)

Note that stepwise regression and best subset regression require using decision rules, too, even as the phrase never appears in Chapter 15. In Exhibit 15. l "Successful Model Building" on

604

CHAPTER 17

I

Business Analytics

page 540, step 5 specifies to choose the best model based the values of adjusted? or the CPstatistic, thereby stating a decision rule. When a regression method includes the automated selection of a model based on a decision rule, the method becomes a true example of predictive analytics.

Microsoft Excel and Business Analytics Microsoft Excel plays mostly a supporting role when using business analytics. Many use Excel as a convenient means to select and organize data or to perform various data preparation tasks to ready data for later analysis by business analytics applications. Reflecting that choice, many analytics programs, including Tableau and Microsoft Power BI Desktop, include the ability to import Excel workbook files for analysis. Excel can be used to perform statistical analyses that help better explain, support, or explore business analytics results. One can also modify Excel, adding Microsoft-supplied or third-party add-ins that give Excel new analytics-oriented functionality. Add-ins that perform basic descriptive analytics or implement predictive analytics methods such as tree induction (see Section 17.3) are available on either a free-to-use, free-to-try, or for-sale basis.

Remainder of This Chapter Because Excel is most likely to be used in conjunction with descriptive analytics, Section 17 .2 discusses descriptive analytics and presents an example that uses the Microsoft Power BI Desktop, an application that complements and works with Microsoft Excel. Section 17 .3 features regression and classification trees, a predictive analytics method implemented in a variety of free-to-try and for-sale add-ins. (As Section 17 .3 discusses, trees are also related to the regression methods that Chapters 13 through 15 discuss.) Section 17.4 provides examples of clustering, and Section 17 .5 presents an example of association analysis, two other types of predictive analytics. Section 17 .6 introduces readers to text analytics, a much-discussed application of business analytics, and Section 17.7 provides readers with an introduction to prescriptive analytics.

17.2 Descriptive Analytics Chapters 2 and 3 discuss descriptive methods that organize and visualize previously collected data. What if current data could be organized and visualized as it gets collected? That would change descriptive methods from being summaries of the status of a business at some point in the past into a tool that could be used for day-to-day, if not minute-by-minute, business monitoring. Giving decision makers this ability is one of the goals of descriptive analytics. Descriptive analytics provide the means to monitor business activities in near real time, very quickly after a transaction or other business event has occurred. Being able to do this monitoring can be useful for a business that handles perishable inventory. As the First Things First Chapter Using Statistics scenario notes, empty seats on an airplane or in a concert hall or theater cannot be sold after a certain time. Descriptive analytics allows for a continuously updated display of the inventory, informing late-to-buy customers of the current availability of seats as well as visualizing patterns of sold and unsold seats for managers. Descriptive analytics can help manage sets of interrelated flows of people or objects as those flows occur. For example, managers of large sports complexes use descriptive analytics to monitor the flow of cars in parking facilities, the flow of arriving patrons into the stadium, as well as the flow of patrons inside the stadium. Summaries generated by descriptive methods can highlight trends as they occur, such as points of growing congestion. By being provided with such information in a timely manner, stadium managers can redirect personnel to trouble spots in the complex and redirect patrons to entrances or facilities that are underused.

17 .2 Descriptive Analytics

605

Dashboards Dashboards are comprehensive summary displays that enable decision makers to monitor a business or business activity. Dashboards present the most important pieces of information, typically, in a visual format that allows decision makers to quickly perceive the overall status of an activity. Dashboards present these key indicators in a way that provides drill-down abilities that can reveal progressive levels of detail interactively. Dashboards can be of any size, from a single desktop computer display, to wall-mounted displays or even larger, such as the nearly 800-square-foot NASDAQ MarketSite Video Wall at Times Square, which can be configured as a NASDAQ stock market dashboard that provides current stock market trends for passersby and viewers of financial programming (see Paczkowski). Figure 17 .2 presents a Microsoft Power BI dashboard that the new managers at Arlingtons might use to monitor national sales. The dashboard uses word tiles and clickable tabular summaries to present sales summaries at different levels of detail: by store category and then by the subcategories of the hardlines department that include mobile electronics sales, the subject of Chapter 10 sales experiments. In Figure 17.2, managers have decided to focus on mobile electronics sales and are currently viewing mobile electronics sales from one of the four national sales regions, while monitoring total mobile electronics sales nationwide (6544). By viewing a dashboard, the new owners of Arlingtons have a clearer and more immediate picture of current sales throughout the Arlingtons chain. That may help them better react to changes as they seek to manage the retailer to better success. Figure 17 .2 illustrates that dashboards can visually present drilled down data and act as complements to the data exploration techniques that organize and visualize a mix of variables that Sections 2.6 and 2. 7 summarize. While Figure 17 .2 contains simple data visualizations, the more complex visualizations that Section 2.7 describe, such as treemaps and colored scatter plots, can also appear in dashboards. For dashboards designed for individual users, multidimensional contingency tables that permit drill-down (see Sections 2.6) are also found.

FIGURE 17.2 National sales dashboard for the Arlingtons retail chain ARLINGTONS NATIONAL SALES DASHBOARD Focus on Mobile Electronics

Sales by Category

Mobile Electronics by Region

Househo ld Essent... ~

-

Furnishings & D ...

Grocery

Mobile Electronics Total Sales Hardlines Department Toys & Games -. Spor ting

Mobile

6544

r Appliances

Goods ~

- Aulomotive

Top 5 M obile Electronics Brooksv1lle lntertachen

El ect ro nics ~

L

' - Electronics

Lakewood

Home Improvement

PeMsburg Stilwell

Beta Gamma Beta Delta Alpha

ZIU1Sn« ······

. -c.nw·····

606

CHAPTER 17

I

Business Analytics

Data Dimensionality and Descriptive Analytics Descriptive analytics also help visualize data that have a higher data dimensionality, the number of variables that comprise a data item, by using color, size, or motion to represent additional dimensions. This higher dimensionality overcomes the limits of standard business display technologies, such as screens and paper, that are two-dimensional surfaces. A bubble chart uses filled-in circles, called bubbles, the col or and size (diameter) of which add additional data dimensions. Typically, color represents a categorical variable and size represents a numerical variable, but either of these attributes can be used in the other way. Dynamic bubble charts, also known as motion charts, extend bubble charts by using motion to represent one additional data dimension, typically time. These charts take the form of animations in which bubbles change over time. The changing position of bubbles over time often reveal complex trends and interactions better than an equivalent time-series plot of the data. Figure 17.3 shows a time-lapse image from a Tableau dynamic bubble chart animation that visualizes domestic movie revenues, by the MPAA ratings G, PG, PG-13, and R, for the years 2002 through 2018. The time-lapse image shows only the animation for the even years in this time series. The animation reveals that as revenues of G-rated movies increase in a year, those revenues tend to depress the revenues of PG-rated movies, suggesting some relationship. The animation also shows how revenues for G-rated movies shrink over time and revenues for PG-13 movies increase over time.

FIGURE 17.3

_, • • • • • ••• • ·• ..••••......

Movie Revenues by Rat ing and Year, Sized by Revenues

Time-lapse of Tableau dynamic bubble chart for domestic movie revenues by MPPA rating, for the years 2002 through 2018, showing even years only

• ...11

61

Search

'----~~~~~

e

v v

Ill! Iii

Categories

1::::1 •

v

Ill!

Highest Sales MobiloEloctronicsR._

tfara•

0

ShHIS

(!) (!) (!) (!) (!) (!)

Sales by Categ+ Line LCL Special cause va riation/

Time Panel A

Copyrighl © 202 I Pearson Educalion, Inc.

Time Panel B

Time Panel C

19-4

CHAPTER 19

I Stat istical Applicat ions in Quality Management

student TIP Look for an obvious pattern over time and not small fluctuations from one time period to another.

2

This ru le is often referred to as the runs rule. A simil ar rule that some companies use is called the trend rule: eight or more consecutive points that increase in value or eight or more consecutive points that decrease in value. Some stati sticians (see reference 5) have criticized the trend rule. It should be used only with extreme caution.

In Panel A of Figure 19 .1, there is no apparent pattern in the values over time and none of the points fall outside the 3 standard deviation control limits. The process appears stable and contains only common cause variation. Panel B, on the contrary, contains two points that fall outside the 3 standard deviation control limits. You should investigate these points to try to determine the special causes that led to their occurrence. Although Panel C does not have any points outside the control limits, it has a series of consecutive points above the mean value (the center line) as well as a series of consecutive points below the mean value. In addition, a long-term overall downward trend is clearly visible. You should investigate the situation to try to determine what may have caused this pattern. Detecting a pattern is not always so easy. The following simple rule (see references 10, 15, and 19) can help you detect a trend or a shift in the mean level of a process: Eight or more consecutive points that lie above the center line or eight or more consecutive points that lie below the center line. 2 A process is considered out of control if its control chart indicates an out-of-control condition, either a point outside the control limits or a series of points that exhibits a pattern. An out-of-cont r ol pr ocess contains both common causes of variation and special causes of variation. Because special causes of variation are not part of the process design, an out-ofcontrol process is unpredictable. When you determine that a process is out of control, you must identify the special causes of variation that are producing the out-of-control conditions. If the special causes are detrimental to the quality of the product or service, you need to implement plans to eliminate this source of variation. When a special cause increases quality, you should change the process so that the special cause is incorporated into the process design. Thus, this beneficial special cause now becomes a common cause source of variation, and the process is improved. A process whose control chart does not indicate any out-of-control conditions is said to be in control. An in-control process contains only common causes of variation. Because these sources of variation are inherent to the process itself, an in-control process is predictable. In-control processes are sometimes said to be in a state of statistical control. When a process is in control, you must determine whether the amount of common cause variation in the process is small enough to satisfy the customers of the products or services. If the common cause variation is small enough to consistently satisfy the customers, you then use control charts to monitor the process on a continuing basis to make sure the process remains in control. If the common cause variation is too large, you need to alter the process itself.

19.2 Control Chart for the Proportion: The p Chart student TIP The p chart is used only when each item can be classified into one of two possible categories such as conforming and not conforming.

Various types of control charts are used to monitor processes and determine whether special cause variation is present in a process. Attribute control cha r ts are used for categorical or discrete variables. This section introduces the p char t, which is used for categorical variables. The p chart gets its name from the fact that you plot the proportion of items in a sample that are in a category of interest. For example, sampled items are often classified according to whether they conform or do not conform to operationally defined requirements. Thus, the p chart is frequently used to monitor and analyze the proportion of nonconforming items in repeated samples (i .e., subgroups) selected from a process. To begin the discussion of p charts, recall that you studied proportions and the binomial distribution in Section 5.2. Then, in Equation (7.6), the sample proportion is defined asp = X / n, and the standard deviation of the sample proportion is defined in Equation (7.7) as

u= p

~

\) ----;;----

Copyright © 2021 Pearson Education, Inc.

1 9-5

19.2 Control Chart for the Proportion: The p Chart 3 This chapter uses the q uality management phrase proportion of nonconforming items even as a p chart can monitor any proportion of interest. (In the Section 5.2 discussion of the binomial distribution, the phrase proportion of items of interest is used.)

Using Equation ( 19 .1 ), control limits for the proportion of nonconforming3 items from the sample data are established in Equation (19.2). CONTROL LIM ITS FO R T HE p C HART

p±

3~p(l ; jj)

UCL=

p+

3~p(l; jj)

LCL =

p-

3~p(l; jj)

(19.2)

For equal n;, k

LP;

-

i=l

and p= - -

n = n;

k

or, in general, k

-

2:n;

i=I

n= - -

k

k

-

LX;

i=I

and p= - k-

2:n;

i= l

where

X; =

number of nonconforming items in subgroup i

n; = sample (or subgroup) size for subgroup i

x

p; = --'- = proportion nonconforming items in subgroup n;

k = number of subgroups selected

n=

mean subgroup size

p=

proportion of nonconforming items in the k subgroups combined

Any negative value for the LCL means that the LCL does not exist. To show the application of the p chart, return to the Beachcomber Hotel scenario. During the process improvement effort in the Measure phase of Six Sigma (see Section 19.8), a nonconforming room was operationally defined as the absence of an amenity or an appliance not in working order upon check-in. During the Analyze phase of Six Sigma, data on the nonconformances were collected daily from a sample of 200 rooms (stored in lmmII). Table 19. 1 on the next page lists the number and proportion of nonconforming k

rooms for each day in the four-week period. For these data, k = 28, k

LP;

P=

i= I

-k- =

2: p; = i= l

2.315 28 =

0.0827. Because then; are equal, n; =

n=

Using Equation (19.2), 0.0827

±

3

(0.0827)(0.9173) 200

resulting in UCL = 0.0827 + 0.0584 = 0.1411 LCL = 0.0827 - 0.0584 = 0.0243 Copyright © 2021 Pearson Education, Inc.

200.

2.3 15, and

19-6

CHAPTER 19

I

Stat istical Applicat ions in Quality Management

TABLE 19.1

Day

Nonconforming hotel rooms at check-in over 28-day period

( i)

Rooms Rooms Not Proportion Day Rooms Rooms Not Proportion Studied (n;) Ready (Xi) (i) Studied (n;) Ready (Xi) (pi) (pi)

1

200

16

0.080

15

2 3 4 5

200 200 200 200

7 21

0.035 0.105

16 17

17 25

0.085 0.125

6 7

200 200 200 200 200 200 200

19 16

8 9 10 11 12 13 14

200 200

18

18 19

200 200 200 200 200

0.095 0.080

20 21

15 11 12

0.075 0.055 0.060

22 20 17 26

0.110 0.100 0.085 0.130

13 15

0.090 0.065 0.075

10 14

0.050 0.070

200 200

25 19

0.125 0.095

22 23 24

200 200 200

12

0.060 0.030 0.060

25 26 27 28

200 200 200 200

6 12 18 15 20 22

0.090 0.075 0.100 0.110

Figure 19.2 displays Excel and Minitab p charts for the Table 19.1 data. These charts show a process in a state of statistical control, with the individual points distributed around p without any pattern and all the points within the control limits. Thus, any improvement in the process of making rooms ready for guests must come from the reduction of common cause variation. Such reductions require changes in the process. These changes are the responsibility of management. Remember that improvements in quality cannot occur until changes to the process itself are successfully implemented. FIGURE 19.2 Excel and Minitab p charts for t he nonconforming hotel rooms P Chart of Rooms Not Ready

p Chart for Rooms Not Ready 0.16 O.lA

----- - - - --- - ---- - - - ---- - --- - -- -

u~

0.12

G.

O.IO

~

... ...

i ....

c 0.10

"I

012

g

0.08 O.OI

.--t-'--T---t-~~-\----t--t---t-"rl--

002

O.OA

~

1)

~

~

u

25

P•0.0827

28

Doy 0.02

- - - - -- -- -- - -- ------- - -- - - -- - - - -

I.Cl.

0.00 0

10

15

20

25

Time

This example illustrates a situation in which the subgroup size does not vary. As a general rule, as long as none of the subgroup sizes, n;, differ from the mean subgroup size, n, by more than ± 25% of n (see reference 10), you can use Equation (19.2) to compute the control limits for the p chart. If any subgroup size differs by more than ± 25% of n, you use alternative formulas for calculating the control limits (see references 10 and 15). To illustrate the use of the p chart when the subgroup sizes are unequal, Example 19 .1 studies the production of medical sponges.

Copyright © 2021 Pearson Education, Inc.

19.2 Control Chart for the Proportion: The p Chart

EXAMPLE 19.1 Using the p Chart for Unequal Subgroup Sizes TABLE 19.2 Medical sponges produced and number nonconforming over a 32-day period

1 9-7

Table 19.2 indicates the number of medical sponges produced daily for a period of 32 days and the number that are nonconforming (stored in iiJ.I·l.!.tj). Construct a control chart for these data.

Sponges Nonconforming Sponges Nonconforming Proportion Proportion Day Produced Day Produced Sponges Sponges ( i)

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

(n;)

(X;)

(p;)

( i)

(n;)

(X;)

(p;)

690 580 685 595 665 596 600 620

21 22 20 21 23 19 18 24 20 22 19 23 22 26 17 16

0.030 0.038 0.029 0.035 0.035 0.032 0.030 0.039 0.033 0.037 0.029 0.034 0.033 0.044 0.029 0.029

17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

575 610 596 630 625 615 575 572 645 651 660 685 671 660 595

20 16 15 24 25 21 23 20 24 39 21 19 17 22 24

600

16

0.035 0.026 0.025 0.038 0.040 0.034 0.040 0.035 0.037 0.060 0.032 0.028 0.025 0.033 0.040 0.027

610 595 645 675 670 590 585 560

SOLUTION From Table 19.2, k

= 32

k

k

:Ln; = 19,926

L

i= l

i= l

X; = 679

Therefore

n=

19,926 32

p=

- - = 622.69

679 19,926

- - = 0.034

Using Equation (19.2),

p ±

3)p(l;

p)

0.034

±

3

= 0.034 ±

(0.034)(1 - 0.034) 622.69 0.022

resulting in UCL = 0.034 + 0.022 = 0.056 LCL = 0.034 - 0.022 = 0.012 ~(continued)

Copyright © 2021 Pearson Education, Inc.

19-8

CHAPTER 19

I

Statistical Applications in Quality Management

Figure 19.3 displays the Excel and JMP p charts for the sponge data. FIGURE 19.3 Excel and JMP p charts for t he proportion of nonconforming medical sponges 4 · ,of~

p Chart for Nonconform inc Sponges 0.07

0.07

0.02 0.Dl

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

10

LQ.

..

20

o.,.

....

2

4

6

a

~

Q

M

16

~

~

22

~

U

U

~

32

D..

From Figure 19.3, you can see that day 26, on which there were 39 nonconforming sponges produced out of 65 1 sampled, is above the UCL. Management needs to determine the reason (i.e., root cause) for th is special cause variation and take corrective action. Once actions are taken, you can remove the data from day 26 and then construct and analyze a new control chart.

PROBLEMS FOR SECTION 19.2 LEARNING THE BASICS 19.1 The following data were collected on nonconformances for a period of 10 days:

19.2 The following data were collected on nonconformances for a period of 10 days: Day

Sample Size

Nonconformances

Day

Sample Size

Nonconformances

I

1 2 3 4 5 6 7 8 9 10

100 100 100 100 100 100 100 100 100 100

12 14

2 3 4 5 6 7 8 9 10

111 93 105 92 117 88 117 87 119 107

12 14 10 18 22 14 15 13 14 16

10

18 22 14 15 13 14 16

a. On what day is the proportion of nonconformances largest? Smallest? b. What are the LCL and UCL? c. Are there any special causes of variation?

a. On what day is the proportion of nonconformances largest? Smallest? b. What are the LCL and UCL? c. Are there any special causes of variation?

Copyright © 2021 Pearson Education, Inc.

19.2 Control Chart for the Proportion: The p Chart

1 9-9

APPLYING THE CONCEPTS 19.3 A medical transcription service enters medical data on patient

a. Construct a p chart. b. Is the process in a state of statistical control? Why?

files for hospitals. The service has the business objective of improving the turnaround time (defined as the time between sending data and the time the client receives completed files). After studying the process, it was determined that turnaround time was increased by transmission errors. A transmission error was defined as data transmitted that did not go through as planned and needed to be retransmitted. For a period of 31 days, a sample of 125 transmissions were randomly selected and evaluated for errors and stored in i16Ufo.!il. The following table presents the number and proportion of transmissions with errors:

19.5 A hospital administrator has the business objective of reduc-

Number Proportion Day of Errors of Errors

Number Proportion Day of Errors of Errors

(i)

(X;)

(p;)

(i)

(X;)

(p;)

I

6 3 4 4 9 0 0 8 4 3 4

0.048 0.024 0.032 0.032 0.072 0.000 0.000 0.064 0.032 0.024 0.032 0.008 0.080 0.072 0.024 0.008

17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

4 6 3 5

0.032 0.048 0.024 0.040 0.008 0.024 0.1 12 0.048 0.056 0.024 0.080 0.056 0.040 0.000 0.024

2 3 4 5 6 7 8 9 10 II

12 13 14 15 16

10 9 3

I

3 14 6 7 3 10 7 5 0 3

b. Is the process in a state of statistical control? Why?

19.4 The following data (stored in fal.l@m represent the findings from a study conducted at a factory that manufactures film canisters. For 32 days, 500 film canisters were sampled and inspected. The following table Lists the number of defective fi lm canisters (the nonconforming items) for each day (the subgroup):

I

2 3 4 5 6 7 8 9 10 II

12 13 14 15 16

19.6 The bottling division of Sweet Suzy's Sugarless Cola maintains daily records of the occurrences of unacceptable cans flowing from the filling and sealing machine. The data in ltfMfoQ lists the number of cans filled and the number of nonconforming cans for one month (based on a five-day workweek). a. Construct a p chart for the proportion of unacceptable cans for the month. Does the process give an out-of-control signal? b. If you want to develop a process for reducing the proportion of unacceptable cans, how should you proceed? 19.7 The manager of the accounting office of a large hospital has the business objective of reducing the number of incorrect account numbers entered into the computer system. A subgroup of 200 account numbers is selected from each day's output, and each account number is inspected to determine whether it is a nonconforming item. The results for a period of 39 days are stored in

1§0·'f1!Q.

a. Construct a p chart.

Day

ing the time to process patients' medical records after discharge. She determined that all records should be processed within 5 days of discharge. Thus, any record not processed within 5 days of a patient's discharge is nonconforming. The administrator recorded the number of patients discharged and the number of records not processed within the 5-day standard for a 30-day period and stored in l~*1@!;Q'I. a. Construct a p chart for these data. b. Does the process give an out-of-control signal? Explain. c. If the process is out of control, assume that special causes were subsequently identified and corrective action was taken to keep them from happening again. Then eliminate the data causing the out-of-control signals and recalculate the control limits.

Number Nonconforming

Day

Number Nonconforming

26 25 23 24 26 20 21 27 23 25 22 26 25 29 20 19

17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

23 19 18 27 28 24 26 23 27 28 24 22 20 25 27 19

a. Construct a p chart for the proportion of nonconforming items. Does the process give an out-of-control signal? b. Based on your answer in (a), if you were the manager of the accounting office, what would you do to improve the process of account number entry?

19.8 A regional manager of a telecommunications company is responsible for processing requests concerning additions, changes, and deletions of service. She forms a service improvement team to look at the corrections to the orders in terms of central office equipment and facilities required to process the orders that are issued to service requests. Data collected over a period of 30 days are stored in it§t41!Q. a. Construct a p chart for the proportion of corrections. Does the process give an out-of-control signal? b. What should the regional manager do to improve the processing of requests for changes in service?

Copyright © 2021 Pearson Education, Inc.

19-1 0

CHAPTER 19

I

Statistical Applications in Quality Management

19.3 The Red Bead Experiment: Understanding Process Variability

4

For information on how to purchase such a bowl, visit the Li ghtning Calculato r website, www .qualitytng.com.

The red bead experiment, also known as "the Parable of the Red Beads," is a demonstration developed by W. Edwards Deming that enhances understanding the difference between common cause and special cause variation. The red bead experiment involves the selection of beads from a bowl that contains 4,000 beads. 4 Unknown to the participants in the experiment, 3,200 (80%) of the beads are white and 800 (20%) are red. You can use several different scenarios for conducting the experiment. The one used here begins with a facilitator (who will play the role of company supervisor) asking members of the audience to volunteer for the jobs of workers (at least four are needed), inspectors (two are needed), chief inspector (one is needed), and recorder (one is needed). A worker's job consists of using a paddle that has five rows of 10 bead-size holes to select 50 beads from the bowl of beads. When the participants have been selected, the supervisor explains the jobs to them. The job of the workers is to produce white beads because red beads are unacceptable to the customers. Strict procedures are to be followed. Work standards call for the daily production of exactly 50 beads by each worker (a strict quota system). Management has established a standard that no more than 2 red beads (4%) per worker are to be produced on any given day. Each worker dips the paddle into the box of beads so that when it is removed, each of the 50 holes contains a bead. The worker carries the paddle to the two inspectors, who independently record the count of red beads. The chief inspector compares their counts and announces the results to the audience. The recorder writes down the number and percentage of red beads next to the name of the worker. When all the people know their jobs, "production" can begin. Suppose that on the first "day," the number of red beads "produced" by the four workers (call them Livia, David, Dan, and Sharyn) was 9, 12, 13, and 7, respectively. How should management react to the day's production when the standard says that no more than 2 red beads per worker should be produced? Should all the workers be reprimanded, or should only David and Dan be warned that they will be fired if they don' t improve? Suppose that production continues for an additional two days. Table 19.3 summarizes the results for all three days.

TABLE 19.3 Red bead experiment results for four workers over three days

DAY

1

2

3

All Three Days

Livia David

9 (18%) 12 (24%)

11 (22%) 12 (24%)

26 (17.33%) 32 (21.33%)

Dan Sharyn

13 (26%) 7 (14%)

6 (12%) 9 (18%)

6 (12%) 8 (16%) 12 (24%)

All four workers Mean

41 10.25

38 9.5

34 8.5

Percentage

20.5%

19%

17%

WORKER

8 (16%)

31 (20.67 %) 24 (16.0%) 113 9.42 18.83%

From Table 19.3, on each day, some of the workers were above the mean and some below the mean. On Day 1, Sharyn did best, but on Day 2, Dan (who had the worst record on Day 1) was best, and on Day 3, Livia was best. How can you explain all this variation? k

From Table 19.3, k

= 12 (4 workers

3 days), n

X

= 50,

L i=I

Therefore,

p=

113 = 0.1883 600

-

Copyright © 2021 Pearson Education, Inc.

k

X;

= 113, and

L i=I

n;

= 600.

19.3 The Red Bead Experiment: Understanding Process Variability

19-11

Using Equation (19.2),

p±

3~/j(l ;

jj)

0.1883(1 - 0. 1883) 50

= 0. 1883 ± 3

=

0.1883

±

0.1659

resulting in UCL

= 0.1883 +

LCL

= 0.1883 - 0. 1659 = 0.0224

0.1659

=

0.3542

Figure 19.4 represents the p chart for the Table 19.3 data. In the chart, all the points are within the control limits, and there are no patterns in the results. The differences between the workers merely represent common cause variation inherent in an in-control process.

FIGURE 19.4 p chart for the red bead experiment

-------------------------------------- UCL -.30 ----------------~

O+----,.-------,----,.----.-----M:tt-0

10

20

30

50

u

~

~

~

o.,..

~

n

~

~

~

..=

~

Week

The c chart does not indicate any points outside the control limits. However, because there are eight or more consecutive points that lie above the center line and there are also eight or more consecutive points that lie below the center line, the process is out of control. There is a clear pattern to the number of customer complaints over time. During the first half of the sequence, the number of complaints for almost all the weeks is greater than the mean number of complaints, and the number of complaints for almost all the weeks in the second half are less than the mean number of complaints. This change, which is an improvement, is due to a special cause of variation. The next step is to investigate the process and determine the special cause that produced this pattern. When identified, you then need to ensure that this becomes a permanent improvement, not a temporary phenomenon. In other words, the source of the special cause of variation must become part of the permanent ongoing process in order for the number of customer complaints not to slip back to the high levels experienced in the first 25 weeks. Copyright © 2021 Pearson Education, Inc.

19-14

CHAPTER 19

I

Statistical Applications in Quality Management

PROBLEMS FOR SECTION 19.4 LEARNING THE BASICS 19.11 The following data were collected on the number of nonconformities per unit for 10 time periods:

Time

Nonconformities per Unit

Time

Nonconformities per Unit

1 2 3 4 5

7 3 6 3 4

6 7 8 9 10

5 3 5 2 0

a. Construct the appropriate control chart and determine the LCL and UCL. b. Are there any special causes of variation?

19.12 The following data were collected on the number of nonconformities per unit for 10 time periods:

Time

Nonconformities per Unit 25

2 3 4 5

II

10 11 6

Time

Nonconformities per Unit

6 7 8 9 10

15 12 10 9 6

a. Construct the appropriate control chart and determine the LCL and UCL. b. Are there any special causes of variation?

APPLYING THE CONCEPTS 19.13 To improve service quality, the owner of a dry-cleaning business has the business objective of reducing the number of drycleaned items that are returned for rework per day. Records were kept for a four-week period (the store is open Monday through Saturday), with the results given in the following table and in the fil e

@t¥!.1.

Day I

2 3 4 5 6 7 8 9 10 II

12

Items Returned for Rework

Day

Items Returned for Rework

4 6 3 7 6 8 6 4 8 6 5 12

13 14 15 16 17 18 19 20 21 22 23 24

5 8 3 4 10 9 6 5 8 6 7 9

a. Construct a c chart for the number of items per day that are returned for rework. Do you think the process is in a state of statistical control? b. Should the owner of the dry-cleaning store take action to investigate why 12 items were returned for rework on Day 12? Explain. Would your answer change if 20 items were returned for rework on Day 12? c. On the basis of the results in (a), what should the owner of the dry-cleaning store do to reduce the number of items per day that are returned for rework?

19.14 The branch manager of a savings bank has recorded the number of errors of a particular type that each of 12 tellers has made during the past year. The results (stored in mJm) are as follows:

Teller

Number of Errors

Alice Carl Gina Jane Livia Marla

4 7 12 6 2 5

Teller Mitchell Nora Paul Salvador Tripp Vera

Number of Errors 6 3 5 4 7 5

a. Do you think the bank manager will single out Gina for any disciplinary action regarding her performance in the past year? b. Construct a c chart for the number of errors committed by the 12 tellers. ls the number of errors in a state of statistical control ? c. Based on the c chart developed in (b), do you now think that Gina should be singled out for disciplinary action regarding her performance? Does your conclusion now agree with what you expected the manager to do? d. On the basis of the results in (b), what should the branch manager do to reduce the number of errors?

19.15 Falls are one source of preventable hospital injury. Although most patients who fall are not hurt, a risk of serious injury is involved. The data in liiJ:a represent the number of patient falls per month over a 28-month period in a 19-bed AIDS unit at a major metropolitan hospital. a. Construct a c chart for the number of patient falls per month. Is the process of patient falls per month in a state of statistical control? b. What effect would it have on your conclusions if you knew that the AIDS unit was started only one month prior to the beginning of data collection? c. Compile a list of factors that might produce special cause variation in this problem? 19.16 A member of the volunteer fire department for Trenton, Ohio, decided to apply the control chart methodology he learned in his business statistics class to data collected by the fire department. He was interested in determining whether weeks containing more than the mean number of fire runs were due to inherent, chance causes of variation, or if there were special causes of variation such

Copyright © 2021 Pearson Education, Inc.

19.5 Control Charts for the Range and the Mean

as increased arson, severe drought, or holiday-related activities. The file contains the number of fire runs made per week (Sunday through Saturday) during a single year.

1#®;!11.H

Source: Data extracted from The City of Trenton 2001 Annual Report, Trenton, Ohio, February 21 , 2002.

a. b. c. d.

What is the mean number of fire runs made per week? Construct a c chart for the number of fire runs per week. Is the process in a state of statistical control? Weeks 15 and 41 experienced seven fire runs each. Are these large values explainable by common causes, or does it appear that special causes of variation occurred in these weeks? e. Explain how the fire department can use these data to chart and monitor future weeks in real-time (i.e., on a week-to-week basis)? 19.17 Rochester-Electro-Medical Inc. is a manufacturing company based in Tampa, Florida, that produces medical products. Management had the business objective of improving the safety of the workplace and began a safety sampling study. The following data (stored in ~) represent the number of unsafe acts observed by the company safety director over an initial time period in which he made 20 tours of the plant.

Tour 2

3 4 5 6 7 8 9 10

Number of Unsafe Acts

10 6 6 10

8 12 2

1 23 3

Tour

Number of Unsafe Acts

11 12 13 14 15 16 17 18 19 20

2 8 7 6 6 11 13 9 6 9

1 9-1 5

Source: Data extracted from H. Gitlow, A. R. Berkins, and M. He, "Safety Sampling: A Case Study," Quality Engineering, 14 (2002), 405-419.

a. Construct a c chart for the number of unsafe acts. b. Based on the results of (a), is the process in a state of statistical control? c. What should management do next to improve the process?

19.5 Control Charts for the Range and the Mean student TIP Use range and mean charts when you have measurements on a numerical variable.

You use variables control charts to monitor and analyze a process when you have numerical variables. Common numerical variables include time, money, and weight. Because numerical variables provide more information than categorical variables, such as the proportion of nonconforming items, variables control charts are more sensitive than the p chart in detecting special cause variation. Variables charts are typically used in pairs, with one chart monitoring the variability in a process and the other monitoring the process mean. You must examine the chart that monitors variability first because if it indicates the presence of out-of-control conditions, the interpretation of the chart for the mean will be misleading. Although businesses currently use several alternative pairs of charts (see references 10, 15, 16, and 19), this chapter considers only the control charts for the range and the mean.

The R Chart You can use several different types of control charts to monitor the variability in a numerical variable. The simplest and most common is the control chart for the range, the R chart. You use the range chart only when the sample size or subgroup is 10 or less. If the sample size is greater than 10, a standard deviation chart is preferable (see references 10, 15, 16, and 19). Because sample sizes of 5 or less are used in many applications, the standard deviation chart is not illustrated in this text. An R chart enables you to determine whether the variability in a process is in control or whether changes in the amount of variability are occurring over time. If the process range is in control, then the amount of variation in the process is consistent over time, and you can use the results of the R chart to develop the control limits for the mean. To develop control limits for the range, you need an estimate of the mean range and the standard deviation of the range. As shown in Equation (19.4), these control limits depend on two constants, the d 2 factor, which represents the relationship between the standard deviation and the range for varying sample sizes, and the d 3 factor, which represents the relationship between the standard deviation and the standard error of the range for varying sample sizes.

Copyright © 2021 Pearson Education. Inc.

19-16

CHAPTER 19

I Stat istical Applicat ions in Quality Management Table E.9 contains values for these factors. Equation (19.4) defines the control limits for the R chart.

CONTROL LIMITS FOR THE RANGE

R

±

UCL=

-d3 3R d2

R + 3 Rd3 d2

-d3 LCL = R - 3R d2

(19.4)

where k

~R;

R.

= i.=!__

k k = number of subgroups selected

You can simplify the computations in Equation (19.4) by using the D 3 factor, equal to 1 - 3(d3 / d2 ), and the D 4 factor, equal to 1 + 3(d3 / d2), to express the control limits (see Table E.9), as shown in Equations (19.5a) and (19.5b).

COMPUTING CONTROL L IM ITS FOR THE RANGE

UCL= D4R

(19.Sa)

LCL = D 3R

(19.Sb)

To illustrate the R chart, return to the Beachcomber Hotel scenario. As part of the Measure phase of a Six Sigma project (see Section 19.8), the amount of time to deliver luggage was operationally defined as the time from when the guest completes check-in procedures to the time the luggage arrives in the guest's room. During the Analyze phase of the Six Sigma project, data were recorded over a four-week period (see the file cm:m:rJ). Subgroups of five deliveries were selected from the evening shift on each day. Table 19.5 on the next page summarizes the results for all 28 days. From Table 19.5, k k

k = 28,

"

L,,,

i= l

~R;

i= l R; = 97.5, and R = - = -97.5 = 3.482. k 28

From Table E.9, for n = 5, D 3 = 0 and D 4 = 2.114. Using Equation (19.5), UCL = D 4 R = (2. 114)(3.482) = 7.36 and the LCL does not exist. Figure 19 .6 on the next page displays Excel and Minitab R charts for the luggage delivery times. Figure 19.6 does not indicate any individual ranges outside the control limits or any obvious patterns. Therefore, you conclude that the process is in control.

Copyright © 2021 Pearson Education, Inc.

19.5 Control Charts for the Range and the Mean

TABLE 19.5 Luggage delivery times and subgroup mean and range for 28 days

Day 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

Luggage Delivery Times (in minutes) 6.7 7.6 9.5 9.8 11.0 8.3 9.4 11.2 10.0

11.7 11.4

Mean

Range

7.5 8.4 8.7

7.8 9.2

8.68 9.12

5.0 3.8

8.9 13.2 9.9 8.4 9.3 9.8 10.7

9.7 9.0 9.9 6.9 11.3 9.7 8.2 10.5 9.0

10.7

9.3 11.6 9.8 7.1 9.0 8.2

9.4 8.5 7.1 6. 1 9.7 11.0

9.54 9.72 10.46 8.66 8.02 10.04 9.78

2.0 6.3 3.1 2.7 3.3 2.2 2.8

8.6 10.7 10.8 9.5 12.9 7.8 11.1

5.8 8.6 8.3 10.5 8.9 9.0 9.9

8.7 9.1 10.6 7.0 8.1 12.2 8.8

9.5 10.9 10.3 8.6 9.0 9.1 5.5

11.4 8.6 10.0 10.l 7.6 11.7 9.5

8.80 9.58 10.00 9.14 9.30 9.96 8.96

5.6 2.3 2.5 3.5 5.3 4.4 5.6

9.2 9.0 9.9 10.7 9.0 10.7 10.2 10.0 9.6 8.2 7.1 11.1

9.7 8.1 10. l 9.8 10.0 9.8 10.5 11.1 8.8 7.9 11.1 6.6

12.3 10.2 8.9 10.2 9.6 9.4 9.5 9.5 11.4 8.4 10.8 12.0

8.1 9.7 9.6 8.0 10.6 7.0 12.2 8.8 12.2 9.5 11.0 11.5

8.5 8.4 7. 1 10.2 9.0 8.9 9.1

9.56 9.08 9.12 9.78 9.64 9.16 10.30 9.86

4.2 2.1 3.0 2.7 1.6 3.7 3.1 2.3

10.26 8.64 10.04 10.18

3.4 1.6 4.0 5.4

Sums: 265.38

97.5

9.9 9.3 9.2 10.2 9.7

19-17

FIGURE 19.6 Excel and Minitab R charts for the luggage delivery times a

R Chart for luggage Delivery Times

R Chart for lug gage Delivery Times

---------------------~

- - - - - - - - - - - - - -

UCl > 60) = P(Z < - 1.67) + P(Z > 0.83) = 0.0475 + (1.0 - 0.7967) = 0.2508. (c) P(Z < - 0.84) := 0.20, x - 50 Z = - 0.84 = - - - , X = 50 - 0.84(1 2) = 39.92 thousand miles, or 12 39,920 miles. (d) The smaller standard deviation makes the absolute Z values larger. (a) P(34 < X < 50) = P( - 1.60 < Z < 0) = 0.4452. (b) P(X < 30) + P(X > 60) = P(Z < - 2.00) + P(Z > 1.00) = 0.0228 + (1.0 - 0.8413) = 0.1815. (c) X = 50 - 0.84(10) = 4 1.6 thousand miles, or 4 1,600 miles. 6.10 (a) 0.9878. (b) 0.8 185. (c) 86. 16%. (d) Option I: Because your score of 81% on this exam represents a Z score of 1.00, which is below the minimum Z score of 1.28, you will not earn an A grade on the exam under this grading option. Option 2: Because your score of 68% on this exam represents a Z score of 2.00, which is well above the minimum Z score of 1.28, you will earn an A grade on the exam under this grading option. You should prefer Option 2. 6.12 (a) 0.059 1. (b) 0.5636. (c) 0.0003. (d) 56. 11 08 gallons. 6.14 With 39 values, the smallest of the standard normal quanti le values covers an area under the normal curve of 0.025. The corresponding Z value is - 1.96. The middle (20th) value has a cumulative area of 0.50 and a correspond ing Z value of 0.0. The largest of the standard normal quantile values covers an area under the normal curve of 0.975, and its corresponding Z value is + 1.96. 6.16 Super Bowl Ad Ratings First and Second Period: (a) Mean = 5.38, median= 5.34, S = 0.7 153, range = 2.29, 6S = 4.2918, interquartile range = 1.33, 1.33(0.7153) = 0.95 13. The mean is approx imately the same as the median. The range is much less than 6S, and the interquartile range is more than I .33S. The skewness statistic is 0.0358, indicating a symmetric distribution, and the kurtosis statistic is - 1.2567, indicating a platykurtic distribution. Super Bowl Ad Ratings Halftime and Afterward: (a) Mean = 5.65, median = 5.59, S = 1.04 11, range = 4.06, 6S = 6.2466, interquartile range= 1.45, 1.33( 1.04 11) = 1.38 17. The mean is approx imately the same as the median. The range is much less than 6S, and the interquartile range is slightly more than l.33S. The skewness statistic is 0.1406, indicating an approximately symmetric distribution, and the kurtosis statistic is - 0.59 10, ind icating a platykurtic distribution. 6.18 (a) Mean = $ 1,979.49, median = $ 1,763, S = $900.8957, range= $3,540, 6S = 6(900.5919) = $5,403.55 14, interquartile range= $ 1,333, 1.33(900.59 19) = 1,197.7872. The mean is greater than the median. The range is much less than 6S, and the interquartile range is less than l.33S. (b) The normal probability plot appears to be right skewed. The skewness statistic is 0.6408. The kurtosis is - 0.5058, indicating some departure from a normal distribution. 6.20 (a) Interquartile range = 0.0025, S = 0.00 17, range = 0.008, l.33(S) = 0.0023, 6(S) = 0.0102. Because the interquartile range is close to l .33S and the range is also close to 6S, the data appear to be approximately normally distributed. (b) The normal probability plot suggests that the data appear to be approximately normally distributed. 6.22 (a) Five-number summary: 82 127 148.5 168 2 13; mean = 147.06, mode = 130, range = 13 1, interq uartile range = 4 1, standard deviation = 31.69. The mean is very close to the median. The five-number summary suggests that the distribution is approximately symmetric around the median. The interquartile range is very close to l .33S. The range is about $50 below 6S. In general, the distribution of the data appears to closely resemble a normal distribution. (b) The normal probability plot confi rms that the data appear to be approximate ly normally distributed.

6.24 (a) 0. 1667. (b) 0.1667. (c) 0.7083. (d ) Mean = 60, standard deviation = 34.641. 6.26 (a) 0.5000. (b ) 0.2500. (c) 0.6000. (d ) Mean = 3, standard deviation = 0.4082. 6.34 (a) 0.4772. (b) 0.9544. (c) 0.0456. (d) 1.8835. (e) 1.87 10 and 2.1290. 6.36 (a) 0.0228. (b) 0. 1524. (c) $275.63. (d) $224.37 to $275.63. 6.38 (a) Waiting time will more closely resemble an exponential distribution. (b) Seating time will more closely resemble a normal distribution. (c) Both the histogram and normal probability plot suggest that waiting time more closely resembles an exponential distribution. (d) Both the histogram and normal probability plot suggest that seating time more closely resembles a normal distribution. 6.40 (a) 0.4602. (b) 0.38 12. (c) 0.0808. (d) $5,009.46. (e) $5, 156.0 1 and 6,723.99.

CHAPTER 7 7.2 (a) Virtually 0. (b) 0.1587. (c) 0.0 139. (d) 50. 195. 7.4 (a) Both means are equal to 6. This property is called unbiasedness. (c) The distribution for n = 3 has less variability. The larger sample size has resulted in sample means being closer toµ,. (d) Same answer as in (c). 7.6 (a) The probabi lity that an individual energy bar has a weight below 42.05 grams is 0.2743. (b) The probability that the mean of a sample of fo ur energy bars has a weight below 42.05 grams is 0.115 1. (c) The probability that the mean of a sample of 25 energy bars has a weight below 42.05 grams is 0.00135. (d) (a) refers to an individual energy bar while (c) refers to the mean of a sample of 25 energy bars. There is a 27.43% chance that an individual energy bar will have a weight below 42.05 grams but only a chance of 0. 135% that a mean of 25 energy bars will have a weight below 42.05 grams. (e) Increasing the sample size from four to 25 reduced the probability the mean will have a weight below 42.05 grams from 11 .5 1% to 0. 135%. 7.8 (a) When n = 4, because the mean is larger than the median, the distribution of the sales price of new houses is skewed to the right, and so is the sampling distribution of X although it will be less skewed than the population. (b ) If you select samples of n = I 00, the shape of the sampling distribution of the sample mean will be very close to a normal distribution, with a mean of $382, 700 and a standard error of the mean of $9,000. (c) 0.0791. (d) 0.0057. 7.10 (a) 0.8413. (b) 16.0364. (c) To be able to use the standardized normal distribution as an approximation for the area under the curve, you must assume that the population is approx imately symmetrical. (d) 15.5 182. 7.12 (a) 0.40. (b) 0.0704. 7.14 /0.501(1 - 0.501) OO = 0.05 1 P(p > 0.55) = P(Z > 0.98) = 1.0 - 0.8365 = 0.1635.

(a) 1T = 0.50 I, u p =

(b) 1T = 0.60, u p = P(p

> 0.55) =

j1T(I - 1T)

'J

~

7T(l -

P(Z

=

n

~0.6( 1

- 0.6) = 0.04899 100 -1.02 1) = 1.0 - 0. 1539 = 0.8461. n

>

'J

1T)

=

/ 0.49( 1 - 0.49) _ - 0.05 100 P(p > 0.55) = P(Z > 1.20) = 1.0 - 0.8849 = 0. 115 1. _

_

j1T( 1 - 1T) _

(c) 1T - 0.49, u p - '../

n

-

'../

Self-Test Solutions and Answers to Selected Even-Numbered Problems

(d) Increasing the sample size by a factor of 4 decreases the standard error by a factor of 2.

> 1.96) = 1.0 - 0.9750 = 0.0250. P(Z > -2.04) = 1.0 - 0.0207 = 0.9793. P(Z > 2.40) = 1.0 - 0.9918 = 0.0082.

> 0.55) = (b) P(p > 0.55) = (c) P(p > 0.55) =

P(Z

(a) P(p

7.16 (a) 0.8522. (b) 0.7045. (c) 0.1478. (d) (a) 0.9820. (b) 0.9640. (c) 0.0 180. 7.18 (a) 0.3594. (b) The probability is 90% that the sample percentage will be contained between 0.3244 to 0.4496. (c) The probability is 95% that the sample percentage will be contained between 0.3344 to 0.4596. 7.20 (a) 0.2367. (b) The probability is 90% that the sample percentage wi ll be contained between 0.7534 and 0.8466. (c) The probability is 90% th at the sample percentage wi ll be contained between 0.7445 and 0.8555. 7.26 (a) 0.4999. (b) 0.00009. (c) 0.0000. (d) 0.0000. (e) 0.7518. 7.28 (a) 0.8944. (b) 4.617; 4.783. (c) 4.641.

8.22 (a) 3 1.1 2 :5 µ, :5 54.96. (b) The number of days is approximately normally distributed. (c) No, the outliers skew the data. (d) Because the sample size is fairly large, at = 50, the use of the t distribution is appropriate.

11

7T :5

0.31.

8.28 (a)

8.211 4.68 :5 µ, :5 135.32. 8.4 Yes, it is true because 5% of intervals will not include the population mean. 8.6 (a) You would compute the mean first because you need the mean to compute the standard deviation. If you had a sample, you would compute the sample mean. If you had the population mean, you would compute the population standard deviation. (b) If you have a sample, you are computing the sample standard deviation, not the population standard deviation needed in Eq uation (8. I). If you have a population and have computed the population mean and population standard deviation, you don 't need a confidence interval estimate of the population mean because you already know the mean. 8.8 Equation (8.1) assumes that you know the population standard deviation. Because you are selecting a sample of 100 from the population, you are computing a sample standard deviation, not the population standard deviation.

± Z·

8.20 (a) For first and second quarter ads: 5.12 :5 µ, :5 5.65. For halftime and second half ads: 5.24 :5 µ, :5 6.06. (b ) You are 95% confident that the mean rating for firs t and second quarter ads is between 5.12 and 5.65. You are 95 % confident that the mean rating for halftime and second half ads is between 5.24 and 6.06. (c) The confidence intervals for the two groups of ads are similar. (d) You need to assume that the distributions of the rating for the two groups of ads are normally distributed. (e) The distribution of the each group of ads appears approximately normally distributed.

8.26 0.19 :5

CHAPTER 8

8.10 (a) X

8.18 (a) 6.31 :5 µ, :5 7.87. (b) You can be 95% confident that the population mean amount spent for lunch at a fast-food restaurant is between $6.31 and $7.87. (c) That the population distribution is normally distributed. (d) The assumption of normality is not seriously violated and, with a sample of 15, the validity of the confidence interval is not seriously impacted.

8.24 (a) 41.85 :5 µ, :5 48.3 1. (b) That the population distribution is normally distributed. (c) The normal probability plot appears to be approximately normally distributed.

7.30 (a) 0.0000 (b) 0.0 126 (c) 0.9874.

-

685

O'

_ I = 49,875 VII

± 1.96.

1,500

\/64;

49,507.5 1 :5 µ, :5 50,242.49. (b) Yes, because the confidence interval includes 50,000 hours the manufacturer can support a claim that the bulbs have a mean of 50,000 hours. (c) No. Because O' is known and = 64, from the Central Limit Theorem, you know that the sampling distribution of X is approximately normal. (d) The confidence interval is narrower, based on a population standard deviation of 500 hours rather than the original standard 49,752.50 :5 µ, :5 49,997.50. No, because the confidence interval does not include 50,000 hours.

11

p

x

135

=; = 500 = 0.27, p ±

fii(l=/i)

11- = 0.27 zv-----;;----

±

0.27(0.73) 2.58

500 0.2189 :5 7T :5 0.32 11. (b ) The manager in charge of promotional programs can infer that the proportion of households that would upgrade to an improved cellphone if it were made available at a substantiall y reduced cost is somewhere between 0.22 and 0.32, with 99% confidence. 8.30 (a) 0.2328 :5 7T :5 0.2872. (b) No, you cannot because the interval estimate includes 0.25 (25%). (c) 0.2514 :5 7T :5 0.2686. Yes, you can, because the interval is above 0.25 (25%). (d) The larger the sample size, the narrower the confidence interval, holding everything e lse constant. 8.32 (a) 0.8632 :5 7T :5 0.8822. (b) 0.1770 :5 7T :5 0.2007. (c) Because almost 90% of adults have purchased something online, but only about 20% are weekly online shoppers, the director of e-commerce sales may want to focus on those adu lts who are weekly online shoppers. 8.34 n

= 35.

8.36 II

= J ,04 J.

8.38 (a) 11

z2a2- = ( 1.96)2( 400)2 = 245.86. Use 11 =

=-

z2a2

= -,- = e8.40 11 = 55. (b) 11

8.42 (a) n

=

502

e2

(1.96)2(400)2 252

107. (b) n

246.

= 983.41. Use n = 984.

= 62.

8.12 (a) 2.2622. (b) 3.2498. (c) 2.0395. (d) 1.9977. (e) 1.7531.

8.44 (a) n = 246. (b) n = 385. (c) 11 = 554. (d) When there is more variability in the population, a larger sample is needed to accurately estimate the mean.

8.14 - 0.12 :5 µ, :5 11.84, 2.00 :5 µ, :5 6.00. The presence of the outlier increases the sample mean and greatly inflates the sample standard deviation.

8.46 (a) 6209 :5 7T :5 0.7878. (b) 0.5015 :5 7T :5 0.6812. (c) 0.0759 :5 7T :5 0.2024. (d ) (a) 11 = 2,017, (b) 11 = 2,324, (c) 11 = 1, 157.

8.16 (a) 87 ± ( 1.978 1)(9)/\/87; 85.46 :5 µ, :5 88.54. (b) You can be 95% confident that the population mean amount of one-time gift is between $85.46 and $88.54.

8.48 (a ) If you conducted a follow-up study, you would use 7T = 0.38 in the sample size formula because it is based on past information on the proportion. (b) 11 = 1,006.

686

Self-Test Solutions and Answers to Selected Even-Numbered Problems

8.54 (a) PC/laptop: 0.8 173 s: -rr s: 0.8628. Smartphone: 0.8923 s:

7r

s: 0.9277.

Tablet: 0.4690 s: -rr s: 0.53 10. Smart watch: 0.0814 s: -rr s: 0.1186. (b) Most adu lts have a PC/laptop and a smartphone. Some adults have a tablet computer and very few have a smart watch. 8.56 (a) 49.88 s: J.L s: 52.12. (b) 0.6760 s: 7r s: 0.9240. (c) n = 25. (d) n = 267. (e) If a single sample were to be selected for both purposes, the larger of the two sample sizes (n = 267) shou ld be used. 8.58 (a) 3.19 s: µ, s: 9.21. (b) 0.3242 s: -rr s: 0.7158. (c) n = 110. (d) n = 121. (e) If a single sample were to be selected for both purposes, the larger of the two sample sizes (n = 121 ) shou ld be used. 8.60 (a) 0.2562 s: -rr s: 0.3638. (b) 3.22 s: µ, s: $3.78. (c) $17,58 1.68 s: µ, s: $ 18,4 18.32. 8.62 (a) $36.66 s: J.L s: $40.42. (b) 0.2027 s: 7r s: 0.3973. (c) n = II 0. (d) n = 423. (e) If a single sample were to be selected for both purposes, the larger of the two sample sizes (n = 423) should be used. 8.64 (a) 0.4643 s: -rr s: 0.6690. (b) $136.28 s: µ, s: $502.21. 8.66 (a) 13.40 s: µ, s: 16.56. (b) With 95 % confidence, the population mean answer time is somewhere between 13.40 and 16.56 seconds. (c) The assumption is valid as the answer time is approximately normally distributed. 8.68 (a) 0.2425 s: µ, s: 0.2856. (b) 0.1975 s: µ, s: 0.2385. (c) The amounts of granule loss for both brands are skewed to the right, but the sample sizes are large enough. (d) Because the two confidence intervals do not overlap, it appears that the mean granule loss of Boston shingles is higher than that of Vermont shingles.

CHAPTER 9 9.2 Because Z sTAT

= +2.2 1 > 1.96, reject H0 .

9.4 Reject H0 if ZSTAT

< -2.58 or if Z sTAT > 2.58.

9.6 p-val ue

= 0.0456.

9.8 p-val ue

= 0. 1676.

9.22 No, you should not use at test because the original population is leftskewed, and the sample size is not large enough for the t test to be valid. 9.24 (a) tsTAT = (3.57 - 3.70)/ (0.8/ \/64) = -1.30. Because - 1.9983 < tSTAT = - 1.30 < 1.9983 and p-value = 0.1984 > 0.05, do not reject H 0 . There is insufficient evidence that the population mean waiting time is different from 3.7 minutes. (b) Because n = 64, the sampling distribution of the t test statistic is approximately normal. In general, the t test is appropriate for this sample size except for the case where the population is extremely skewed or bimodal. 9.26 (a) - 1.9842 < tsTAT = 1.25 < 1.9842, do not reject H0 . There is insufficient evidence that the population mean spent by Amazon Prime customers is different from $ 1,475. (b) p-value = 0.2142 > 0.05. The probability of getting a tsTAT statistic greater than + 1.25 or less than - 1.25, given that the null hypothesis is true, is 0.2142. 9.28 (a) Because -2.1448 < tsTAT = 1.6344 < 2. 1448, do not reject H 0 . There is not enough evidence to conclude that the mean amount spent for lunch at a fast-food restaurant, is different from $6.50. (b) The p-value is 0. 1245. If the population mean is $6.50, the probability of observing a sample of fifteen customers that will resu lt in a sample mean farthe r away from the hypothesized val ue than this sample is 0.1245. (c) The distribution of the amou nt spent is normally distributed. (d) With a sample size of 15, it is difficult to evaluate the assumption of normality. However, the distribution may be fairly symmetric because the mean and the median are c lose in val ue. Also, the boxplot appears only slightly skewed so the normality assumption does not appear to be seriously violated. 9.30 (a) Because - 2.0096 < tsTAT = 0. 114 < 2.0096, do not reject H 0 . There is no evidence that the mean amount is different from 2 liters. (b) p-value = 0.9095. (c) Yes, the data appear to have met the normality assu mption. (e) The amount of fill is decreasing over time so the values are not independent. Therefore, the t test is invalid. 9.32 (a) Because tsTAT = - 5.9355 < -2.0 106, reject H0 . There is enough evidence to conclude that mean widths of the tro ughs is different from 8.46 inches. (b) The population distribution is normal. (c) Although the distribution of the widths is left-skewed, the large sample size means that the validity of the t test is not seriously affected. The large sample size allows you to use the t distribution. 9.34 (a) Because - 2.68 < tsTAT = 0.094 < 2.68, do not reject H 0 . There is no evidence that the mean amount is different from 5.5 grams. (b) 5.462 s: µ, s: 5.542. (c) The conclusions are the same.

9.10 H0 : Defendant is guilty; H 1: Defendant is innocent. A Type I error would be not convicting a guilty person. A Type II error would be convicting an innocent person.

9.36 p-value

9.12 H0 : J.L = 20 minutes. 20 minutes is adequate travel time between classes. H 1: µ, ,= 20 minutes. 20 minutes is not adequate travel time between classes.

9.40 p-value = 0.9162.

9.14 (a) Z sTAT

=

49,875 - 50,000 1,000

- 1.96 < ZsTAT = - 0.6667 < 1.96, do not reject H 0 . (b) p-value = 0.5050. (c) 49,507.5 I s: µ, s: 50,242.49. (d) The conclusions are the same. 9.16 (a) Because -2.58 < Z sTAT = - 1.7678 < 2.58, do not reject H 0 . (b) p-value = 0.077 1. (c) 0.9877 s: µ, s: 1.0023 . (d) The concl usions are the same.

9.20 ±2. 1315.

9.38 p-value = 0.0838.

9.42 2.7638. 9.44 - 2.5280.

= - 0.6667. Because

v64

9.18 l sTAT = 2.00.

= 0.0228.

9.46 (a) tsTAT = 2.6880 > 1.6694, reject H 0 . There is evidence that the population mean bus miles is greater than 8,000 miles. (b) p-value = 0.0046 < 0.05. The probability of getting a tsTAT statistic greater than 2.6880, given that the null hypothesis is true, is 0.0046.

= (24.05 - 30)/ (16.5/ v's6o) = - 10.5750. Because = - 10.5750 < -2.3307, reject H0 . p-value = 0.0000 < 0.01,

9.48 (a) tsTAT tSTAT

reject H0 . (b) The probability of getting a sample mean of 24 minutes or less if the population mean is 30 minutes is 0.000. 9.50 (a) tsTAT = 1.9221 < 2.3549, do not reject H 0 . There is insufficient evidence that the population mean one-time gift donation is greater than $85.50. (b) The probability of getting a sample mean of $87 or more if the population mean is $85.50 is 0.0284.

Self-Test Solutions and Answers to Selected Even-Numbered Problems

9.52 p = 0.22. 9.54 Do not reject H0 . 9.56(a) ZsrAT = - 2.0681 , p-value = 0.0 193. BecauseZsrAr = -2.068 1 < 1.645 or p-value = 0.0193 < 0.05, reject H0 . There is evidence to show that less than 69.52% of students at your university use the Chrome web browser. (b) ZsrAT = - 5.0658, p-value = 0.0000. Because Z srAT = - 5.0658 < 1.645, or p-value = 0.0000 < 0.05, reject H0 . There is evidence to show that less than 69.52% of students at your university use the Chrome web browser. (c) The sample size had an effect on being able to reject the null hypothesis. (d) You would be very unlikely to reject the null hypothesis with a sam ple of 20. 9.58 H0 : 7T = 0.60; H 1: 7T Z srAT < - 1.96, reject H0 .

~

0.60. Decision rule: If ZsrAT > 1.96 or 464 703

p = -

= 0.6600

Test statistic: ZsrAr

=

p - 7T

----;==:===:;::

- 0.60 ----;;0.6600 ===:::;:;:;:: = 3.2488. / 0.60(1 - 0.60)

\j

\j

(7T( l - 7T) 11

703

Because Z srAT = 3.2488 > 1.96 or p-value = 0.00 12 < 0.05, reject H0 and conclude that there is evidence that the proportion of all talent acquisition professionals who report competition is the biggest obstacle to attracting the best talent at their company is different from 60%. 9.60 (a) H0 : 7T ?: 0.294. H 1: 7T < 0.294. (b) ZsrAT = -0.5268 > - 1.645; p-value = 0.2992. Because Z sr AT = -0.5268 > - 1.645 or p-value = 0.2992 > 0.05, do not reject H0 . There is insufficient evidence that the percentage is less than 29.4%.

(c) lsrAT = - 7.9075, reject H 0. (d) p-value = 0.0000. The probability of getting a lsrAT value below - 7.9075 or above +7.9075 is 0.0000. (e) Because of the large sample sizes, you do not need to be concerned with the normality assumption.

CHAPTER 10 10.2 (a) t = 3.8959. (b) elf= 21. (c) 2.5177. (d) Because tsrAT = 3.8959 > 2.5177, reject H 0. 10.4 3.73 :5 µ. , - J.L2 :5 12.27. 10.6 Because tsrAT = 2.6762 < 2.9979 or p-val ue = 0.0158 > 0.0 I, do not reject H 0 . There is no evidence that the mean of population one is greater than the mean of population 2. 10.8 (a) Because tsrAT = 2.8990 > 1.6620 or p-value = 0.0024 < 0.05, reject H0 . There is evidence that the mean amount of Walker Crisps eaten by children who watched a commercial featuring a long-standing sports celebrity endorser is higher than for those who watched a commercial for an alternative food snack. (b) 3.46 16 :5 µ. 1 - J.L2 :5 18.5384. (c) The results cannot be compared because (a) is a one-tail test and (b) is a confidence interval that is comparable only to the results of a two-tail test. (d ) You wou ld choose the commercial feat uring a long-standing celebrity endorser. 10.10 (a) H0 : µ. 1 = J.Li, where Populations: I = Southeast, 2 = Gulf Coast. H 1: µ. 1 ~ J.Li. Decision rule: df = 42. If lsrAT < -2.0 181 or lsrAT > 2.0181, reject H0 . Test statistic: 52 = p

9.70 (a) Concluding that a firm will go bankrupt when it will not. (b) Concluding that a firm will not go bankrupt when it will go bankrupt. (c) Type I. (d) If the revised model results in more moderate or large Z scores, the probability of committing a Type I error will increase. Many more of the firms wi ll be predicted to go bankrupt than will go bankrupt. On the other hand, the rev ised model that resul ts in more moderate or large Z scores wi ll lower the probability of committing a Type ll error because few firms will be predicted to go bankrupt than will actually go bankrupt. 9.72 (a) Because tsrAT = 3.3 197 > 2.0010, reject H 0. (b) p-value = 0.0015. (c) Because ZsrAT = 0.2582 < 1.645, do not reject H 0. (d ) Because -2.00 I0 < tsrAT = - I. I066 < 2.00 I0, do not reject H 0. (e) Because ZsrAT = 2.3238 > 1.645, reject H0 . 9.74 (a) Because tsrAT = - 1.69 > - 1.76 13, do not reject H0 . (b) The data are from a population that is normally distributed. (d) With the exception of one extreme value, the data are approximately normally distri buted. (e) There is insufficient evidence to state that the waiting time is less than five minutes. 9.76 (a) Because tsrAT = - 1.47 > - 1.6896, do not reject H0 . (b) p-value = 0.0748. If the null hypothesis is true, the probability of obtaining a tsrAT of - 1.47 or more extreme is 0.0748. (c) Because tsrAT = -3. 10 < - 1.6973, rejectH0 . (d) p-value = 0.002 1. lfthe nu ll hypothesis is true, the probability of obtaining a tsrAT of -3. 10 or more extreme is 0.002 1. (e) The data in the population are assumed to be normally distributed. (g) Both boxplots suggest that the data are skewed slightly to the right, more so for the Boston shingles. However, the very large sample sizes mean that the results of the t test are relatively insensitive to the departure from normality. 9.78 (a) tsrAT = -3.29 12, reject H0 . (b) p-value = 0.00 12. The probability of getting a tsrAT value below - 3.2912 or above +3.29 12 is 0.0012.

687

(11, - I )(Si) + ( 111 - I )(sn (11 1 - I )+(rzi- I )

------ - - - - - - (2 1)( 51.86842 ) + (21) ( 62.2588 2 ) 21 + 21

lsTAT

=

(X, - X2 )

=

3,283.2435

(µ., - J.Li)

-'----;::=======-'--=.:.... )s~(~ + ~) ( 45.0455 - 40.8182) - 0

~

J

=

0.2447.

1 1 3,283.2435(i + 2 2

Decision: Because -2.0 18 1 < lsrAT = 0.2447 < 2.018 1, do not reject H0 . There is not enough evidence to conclude that the mean number of partners between the Southeast and Gulf Coast is different. (b) p-value = 0.8079. (c) Ln order to use the pooled-variance t test, you need to assume that the populations are normally distributed with equal variances. 10.12 (a) Because tsrAT = -4. 1343 < -2.0484, reject H0. (b) p-value = 0.0003 < 0.05, reject H0 . (c) The populations of waiting times are approx imately normall y distributed. (d) - 4.2292 :5 J.L1 - J.L2 :5 - 1.4268. 10.14 (a) Because tsrAT = 2.7349 > 2.0484, reject H0 . There is evidence of a difference in the mean time to start a business between developed and emerging countries. (b) p-value = 0.0107. The probability that two samples have a mean difference of 14.62 or more is 0.0 I07 if there is no difference in the mean time to start a business between developed and emerging countries. (c) You need to assume that the population distribution of the time to start a business of both developed and emerging countries is normally distributed. (d) 3.6700 :5 µ. 1 - J.Li :5 25 .5700.

688

Self-Test Solutions and Answers to Selected Even-Numbered Problems

10.16 (a) Because tsTAT = -2.1554 < -2.0017 or p-value = 0.03535 < 0.05, reject H 0 . There is evidence of a difference in the mean time per day accessing the Internet via a mobile device between males and females. (b) You must assume that each of the two independent populations is normall y distributed.

The probability of obtaining a difference in proportions that gives rise to a test statistic below -2.85 16 or above +2.85 16 is 0.0043 ifthere is no difference in the proportion based on the size of the organization. (c) -0. 1809 :5 7T1 - 7Tz :5 -0.0 173. You are 99% confident that the difference in the proportion based on the size of the organization is between 1.73% and 18.09%.

10.18 df = 19.

10.34 (a) Because ZsTAT = 4.4662 > 1.96, reject H0 . There is evidence of a difference in the proportion of cobrowsing organizations and non-cobrowsing organizations that use skills-based routing to match the caller with the right agent. (b) p -value = 0.0000. The probability of obtaining a difference in proportions that is 0.2586 or more in either direction is 0.0000 if there is no difference between the proportion of cobrowsing organizations and non-cobrowsing organizations that use skills-based routing to match the caller with the right agent.

10.20 (a) tsTAT = (- 1.5566) / ( 1.424/ \/9) = -3.2772. Because tsTAT = -3.2772 < -2.306 or p-value = 0.0 112 < 0.05, reject H0 . There is enough evidence of a difference in the mean summated ratings between the two brands. (b) You must assume that the distribution of the differences between the two ratings is approximately normal. (c) p-value = 0.0 11 2. The probabi lity of obtaining a mean difference in ratings that results in a test statistic that deviates from 0 by 3.2772 or more in either direction is 0.01 12 if there is no difference in the mean summated ratings between the two brands. (d) -2.650 1 :5 µ, 0 :5 -0.46 10. You are 95% confide nt that the mean difference in sumrnated ratings between brand A and brand B is somewhere between -2.6501 and -0.4610.

10.36 (a) 2.20. (b) 2.57. (c) 3.50. 10.38 (a) Population B: S2 = 25. (b) 1.5625. 10.40 dfnumerator

10.22 (a) Because tsTAT = -4. 1372 < -2.0244, reject H0 . There is evidence to conclude that the mean download speed at AT&T is lower than at Verizon Wireless. (b) You must assume that the distribution of the differences between the ratings is approximately normal. (d) The confidence interval is from -2.9359 to - 1.0067.

10.42 Because

Test statistic: p

ZsTAT

=

7Tz.

H 1:

7T1

X1 + X2 =-- = n1 + nz

#

.....

7r2)

fp( l -p)(_!._ +

_!._) nz

n1

-

(

= -2.8560 < -2.58, reject H0 . There is evidence of a difference in the proportion of organizations with recognition programs between organi zations that have between 500 and 2,499 employees and organi zations that have 10,000+ employees (b) p-value = 0.0043 . ZsTAT

1.2109 < 2.27, do not reject H 0 .

~- H1:df # ~-

. .

FsTAT =

ST S[ =

3,876.1558 = 1.4408. • _ 2 690 33 12

10.48 (a) Because FsTAT = 2. 11 87 > 2. 11 24 or p-value = 0.0491 = 0.05, reject H 0. There is sufficient evidence of a difference in the variability of the ratings between the two groups. (b) p-value = 0.05. The probability of obtaining a sample that yields a test statistic more extreme than 2. 1 187 is 0.049 if there is no difference in the two population variances. (c) T he test assumes that each of the two populations are normally distributed. (d ) Based on (a) and (b), a separate-variance t test should be used. 10.50 (a) Because FSTAT = 69.50001 > 1.98 11 or p-value = 0.0000 < 0.05, reject H0 . There is evidence of a difference in the variance of the delay ti mes between the two drivers. (b) You assume that the delay times are normally distributed. (c) From the boxplot and the normal probability plots, the delay times appear to be approximately normally distributed. (d ) Because there is a difference in the variance of the delay times between the two drivers, you should use the separate variance I-test to determine whether there is evidence of a difference in the mean delay time between the two drivers. 10.58 (a) Because FsTAT = 2.9736 > 1.9288, or p-val ue = 0.0016 < 0.05, reject H0 . There is a difference in the variance of the salary of

= 0.8016

1 1 /0.8016(1-0.80 16)(--+ - - ) '\/ 423 192

24.

Decision: Because FSTAT = 1.4408 < 2.4086, do not reject H0 . There is insufficient evidence to conclude that the two population variances are different. (b) p-value = 0.4096 > 0.05, do not reject H0 . (c) The test assumes that each of the two populations is normally distributed. (d) Based on (a) and (b), a pooled-variance t test should be used.

2.58, do not reject H0 .

0.7707 - 0.8698) - 0

=

1est stat1st1c:

Decision rule: If IZsTAT I > 2.58,

326 + 167 423 + 192

(Pi - pz) - ( 7Tz '\/

7Tz.

=

=

Decision rule: If FsTAT > 2.4086, reject H 0 .

10.30 (a) H0 : 7T1 :5 7Tz. H 1: 7T1 > 7Tz. Populations: 1 = VOD D4+ 2 = general TV. (b) Because ZsTAT = 8.9045 > 1.6449 or p-value = 0.0000 < 0.05, reject H0 . There is evidence to conclude that the population proportion of those who viewed the brand on VOD D4+ were more likely to visit the brand website. (c) Yes, the result in (b) makes it appropriate to clai m that the population proportion of those who viewed the brand on VOD D4+ were more likely to visit the brand website than those who viewed the brand on general TV. 10.32 (a) Ho: 7T1 = reject H 0 .

FsTAT

10.46 (a) H0 :df

10.26 (a) Because tsTAT = -9.372 1 < -2.4258, reject H 0. There is evidence that the mean strength is lower at two days than at seven days. (b) The population of differences in strength is approximately normally distributed. (c) p = 0.000 < 0.05, reject H0 . :5

24, d/denominator

10.44 (a) Because FsTAT = 1.2995 < 3. 18, do not reject H0 . (b) Because FsTAT = 1.2995 < 2.62, do not reject H0 .

10.24 (a) Because tsTAT = 1.8425 < 1.943, do not reject H 0. There is not enough evidence to conclude that the mean bone marrow microvessel density is higher before the stem cell transplant than after the stem cell transplant. (b) p -value = 0.0575. The probability that the t statistic for the mean difference in microvessel density is 1.8425 or more is 5.75% if the mean density is not higher before the stem cell transplant than after the stem cell transplant. (c) -28.26 :5 µ, 0 :5 200.55. You are 95% confident that the mean differe nce in bone marrow microvessel density before and after the stem cell transplant is somewhere between -28.26 and 200.55. (d) That the distribution of the difference before and after the stem cell transplant is normally distributed.

10.28 (a) Because -2.58 :5 ZsTAT = -0.58 (b) -0.273 :5 7T1 - 7Tz :5 0.173.

=

·

Black Belts and Green Belts. (b) The separate-variance t test. (c) Because tsTAT = 5.9488 > 1.6639 or p -value = 0.0000 < 0.05, reject H0 . There is evidence that the mean salary of Black Belts is greater than the mean salary of Green Belts. 10.60 (a) Because FsTAT = 1.3611 < 1.6854, do not reject H0 . There is insufficient evidence to conclude that there is a difference between the variances in the online time per week between women and men. (b) It is more appropriate to use a pooled-variance t test. Using the pooled-variance t test, because tsTAT = -9.76 19 < -2.6009, reject H 0 . There is evidence of a difference in the mean online time per week

689

Self-Test Solutions and Answ ers to Selected Even-Numbered Pro blems

between women and men. (c) Because F srAT = 1.7778 > 1.6854, reject H0 . There is evidence to conclude that there is a difference between the variances in the time spent playing games between women and men. (d) Using the separate-variance t test, because tsrAT = -26.4 < -2.603, reject H0 . There is evidence of a difference in the mean time spent playing games between women and men. 10.62 (a) Because tsrAT = 3.3282 > 1.8595, or the p-value = 0.0052 < 0.05 reject H0 . There is enough evidence to conclude that the introductory computer students required more than a mean of I0 minutes to write and run a program in Python. (b) Because tsTAT = 1.3636 < 1.8595, do not reject H0 . There is not enough evidence to conclude that the introductory computer students required more than a mean of I0 minutes to write and run a Python program. (c) Although the mean time necessary to complete the assignment increased from 12 to 16 minutes as a result of the increase in one data value, the standard deviation went from 1.8 to 13.2, which reduced the value oft statistic. (d) Because F srAT = 1.2308 < 4.2951 , do not reject H 0 . There is not enough evidence to conclude that the population variances are different for the Introduction to Computers students and computer majors. Hence, the pooled-variance/ test is a valid test to determine whether computer majors can write a Python program in less time than introductory students, assuming that the distributions of the time needed to write a Python program for both the Introduction to Computers students and the computer majors are approximately normally distributed. Because tsrAT = 4.0666 > 1.734 1, reject H0 . There is enough evidence that the mean time is highe r for Introduction to Computers students than for computer majors. (e) p-value = 0.0052. If the true population mean amount of time needed for Introduction to Computer students to write a Python program is no more than JO minutes, the probability of observing a sample mean greater than the 12 minutes in the current sample is 0.0362%. Hence, at a 5% level of significance, you can conclude that the population mean amount of time needed for Introduction to Computer students to write a Python program is more than 10 minutes. As illustrated in (d), in which there is not enough evidence to conclude that the population variances are different for the Introduction to Computers students and computer majors, the pooled-variance t test performed is a valid test to determine whether computer majors can write a Python program in less time than introductory students, assuming that the distribution of the time needed to write a Python program for both the Introduction to Computers students and the computer majors are approximately normally distributed. 10.64 From the boxplot and the summary statistics, both distributions are approx imately normally distributed. F srAT = 1.056 < 1.89. The re is insufficient evidence to conclude that the two population variances are significantly different at the 5% level of significance. tsrAT = - 5.084 < -1.99. At the 5% level of significance, there is sufficient evidence to reject the null hypothesis of no difference in the mean life of the bulbs between the two manufacturers. You can conclude that there is a significant difference in the mean life of the bulbs between the two manufacturers. 10.66 (a) Because Z sTAT = 3.6911 > 1.96, reject H0 . There is e nough evidence to conclude that there is a difference in the proportion of men a nd women who order dessert. (b) Because ZsTAT = 6.0873 > 1.96, reject H0 . There is enough evidence to conclude that there is a difference in the proportion of people who order dessert based on whether they ordered a beef en tree. 10.68 The normal probability plots suggest that the two populations are not normally distributed. An F test is inappropriate for testing the difference in the two variances. The sample variances for Boston and Vermont shingles are 0.0203 and 0.0 15, respectively. Because tsrAT = 3.0 15 > 1.967 or p -value = 0.0028 < a = 0.05, reject H0 . There is sufficient evidence to conclude that there is a difference in the mean granule loss of Boston and Vermont shingles.

CHAPTER 11 11.2 (a) SSW = 150. (b) MSA

=

=

15. (c) MSW

5. (d ) Fm r

=

3.

11.4 (a) 2. (b) 18. (c) 20. 11.6 (a) Reject H0 if FsrAT > 2.95; otherwise, do not reject H 0. (b) Because F srAT = 4 > 2.95, reject H0 . (c) The table does not have 28 degrees of freedom in the denominator, so use the next larger critical value, Qa = 3.90. (d) Critical range = 6. 166. 11.8 (a) H 0:µ,A

=

µ, 8

=

/J-c

MSA

SSA c - 1

MSW

= SSW = II -

µ, 0 and H 1: At least one mean is different.

I, 151 ,0 16.4750 3 = 383,672. 1583.

C

MSA Fm r

=

= MSW =

2,961 ,835.3000 36 383,672.1583 82,273.2028

=

82 ,273 .2028 .

4 6634 · ·

=

Because the p-value is 0.0075 and FsrAT = 5.7121 > 4.6634, reject H 0. The re is sufficient evidence of a difference in the mean import cost across the four global regions. (b) C ritical range

=

3.8 1

82,273.2028( I 2 10

+ -I

10

)

=

I + -I)

MSW( - 2 111

Qa

ni'

= 90.7046.

From the Tukey-Kramer procedure, there is a difference in the mean import cost among the East Asia and Pacific region, Latin America and the Caribbean, Eastern Europe and Central Asia, and Latin American and Caribbean. None of the other regions are differe nt. (c) ANOVA output for Levene's test for homogeneity of variance:

MSA MSW

SSA c- 1

191890.4750

= SSW = 1 ,46~,:23.4 = II -

STAT

63,630.1583

40,8 11.761 1

C

= MSA =

F

=

3

MSW

63,630.1583 40,8 11.7611

=

1.

559 1

Because p-value = 0.2 16 1 > 0.05 and FsrAT = 1.5591 < 2.8663, do not reject H0 . There is insufficient evidence to conclude that the variances in the import cost are different. (d) From the results in (a) and (b), the mean import cost for the East Asia and Pacific region and eastern Europe and Central Asia is lower than for Latin America and the Caribbean. 11.10 (a ) Because F srAT = 12.56 > 2.76, reject H0 . (b) Critical range = 4.67 . Advertisements A and B are different from Ad verti sements C and D. Advertisement Eis only different from Advertisement D. (c) Because F srAT = 1.927 < 2.76, do not reject H 0. There is no evidence of a significant difference in the variation in the ratings a mong the five advertisements. (d) The advertisements underselling the pen's characteristics had the highest mean ratings, and the advertisements overselling the pen's characteristics had the lowest mean ratings. Therefore, use an advertisement that undersells the pen's characteristics and avoid advertisements that oversell the pen's characteristics. 11.12 (a) Source

Degrees of Freedom

Sum of Squares

Mean Squares

F

Among groups

2

15,671,226,037.72

3, 134,245,208

0.6808

4603,628,60 I

Within groups

50

230, 181 ,430,044.96

Total

52

245,852,656,082.6800

690

Self-Test Solutions and Answers to Selected Even-Numbered Problems

(b) Because FsrAT = 0.6808 < 2.4 1, do not reject H 0. There is insufficient evidence of a difference in the mean brand value of the different groups. (c) Because there was no significant difference among the groups, none of the critical ranges were significant. 11.14 (a) Because FsrAT = 5.3495 > 2.8663; p-value = 0.0038 < 0.05, reject H0 . (b) Critical range = 9.340 1 (using 36 degrees of freedom and interpolating). Asia is different from North America. (c) The assumptions are that the samples are random ly and independently selected (or randomly ass igned}, the original populations of congestion are approximately normall y distributed, and the variances are equal. (d) Because FsrAT = 0.432 1 < 2.8663; p-value = 0.7333 > 0.05, do not reject H0 . There is insufficient evidence of a difference in the variation in the mean congestion level among the continents. 11.16 (a) 40. (b) 60 and 55. (c) 10. (d ) 10. 11.18 (a) Because FsrAT = 6.00 > 3.35, reject H0 . (b} Because FsrAT = 5.50 > 3.35, reject H0 . (c) Because FsrAT = 1.00 < 2.73, do not reject H 0 . 11.20 dfs = 4, dfTOTAL = 44, SSA = 160, SSAB = 80, SSE = 150, SST = 6 10, MSB = 55, MSE = 5. For A: FsrAT = 16. For 8: FsrAT = 11. For AB: FsrAT = 2. (a) Because FsrAT = 16 > 3.32, reject H0. Factor A is significant. (b) Because FsrAT = 11 > 2.69, reject H0 . Factor Bis significant. (c) Because FsrAT = 2.0 < 2.27, do not reject H 0 . The AB interaction is not significant. 11.22 (a) Because FsrAT = 3.4032 < 4.35 12, do not reject H 0 . (b) Because FsrAT = 1.8496 < 4.35 12, do not reject H 0. (c) Because FsrAT = 9.4549 > 4.35 12, reject H 0. (e) Die diameter has a significant effect on density, but die temperature does not. However, the cell means plot shows that the density seems higher with a 3 mm die diameter at I55°C but that there is linle difference in density with a 4 mm die diameter. This interaction is not significant at the 0.05 level of significance. 11.24 (a) H0 : There is no interaction between fi lling ti me and mold temperature. H 1: There is an interaction between filling time and mold temperature . 0. 11 36 Because FsrAT = _ = 2.27 < 2.9277 or the p-value = 0 05 0. 1018 > 0.05, do not reject H0 . There is insufficient evidence of interaction between filling time and mold temperature. (b) Fswi = 9.0222 > 3.5546, reject H0 . There is evidence of a difference in the warpage due to the filling time. (c) Fsuu = 4.2305 > 3.5546, reject H0. There is evidence of a difference in the warpage due to the mold temperature. (e) The warpage for a three-second filling time seems to be much higher at 60°C and 72.5°C but not at 85°C. 11.26 (a) FsrAT = 0.8325, p-value = 0.3725 > 0.05, do not reject H 0. T here is not enough evidence to conclude that there is an interaction between zone lower and zone 3 upper. (b) FsrAT = 0.3820, p-value is 0.548 1 > 0.05, do not reject H0 . There is insufficient evidence to conclude that there is an effect due to zone I lower. (c) FsrAT = 0.1048, p -value = 0.75 17 > 0.05, do not rej ect H0 . There is inadeq uate evidence to conclude that there is an effect due to zone 3 upper. (d) A large d ifference at a zone 3 upper o f 695°C but only a small diffe rence at zone 3 upper o f 7 15°C. (e) Because this difference appeared on the cell means plot but the interaction was not statistically significant because of the large MSE, further testing should be done with larger sample sizes. 11.36 (a) Because FsrAT = 0.0 111 < 2.90 11 , do not reject H 0. (b) Because FsrAT = 0.8096 < 4. 1491 , do not reject H0. (c) Because FsrAT = 5. 1999 > 2.90 11 , reject H0. (e) Critical range = 3.56. Only the means of Suppliers I and 2 are different. You can conclude that the

mean strength is lower for Supplier I than for Supplier 2, but there are no statistically significant d ifferences between Suppliers 1 and 3, Suppliers I and 4, Suppliers 2 and 3, Suppliers 2 and 4, and Suppliers 3 and 4. (f) FSTAr = 5.6998 > 2.8663 (p-value = 0.0027 < 0.05). There is evidence that the mean strength of suppliers is different. Critical range = 3.359. Supplier I has a mean strength that is less than suppliers 2 and 3. 11.38 (a) Because FsrAT = 0.075 < 3.68, do not reject H0 . (b) Because FsrAT = 4.09 > 3.68, reject H 0 . (c) Critical range = 1.489. Breaking strength is significantly different between 30 and 50 psi. 11.40 (a) Because FsrAT = 0.1899 < 4. 11 32, do not reject Ho. There is insufficient evidence to conclude that there is any interaction between type of breakfast and desired time. (b) Because FsrAT = 30.4434 > 4.1 132, reject H0 . There is sufficient evidence to conclude that there is an effect due to type of breakfast. (c) Because FsrAT = 12.444 1 > 4. 11 32, reject H0 . There is sufficient evidence to conclude that there is an effect due to desired time. (e) At the 5% level of significance, both the type of breakfast ordered and the desired time have an effect on delivery time difference. There is no interaction between the type of breakfast ordered and the desired time. 11.42 Interaction: FsrAT = 0.2169 < 3.9668 or p-value = 0.6428 > 0.05. There is insufficient evidence of an interaction between piece size and fi ll height. Piece size: FsrAr = 842.2242 > 3.9668 or p-value = 0.0000 < 0.05 . There is evidence of an effect due to piece size. The fine piece size has a lower difference in coded weight. Fill height: FsrAT = 217 .08 16 > 3.9668 or p-value = 0.0000 < 0.05. There is evidence of an effect due to fi ll height. The low fill height has a lower difference in coded weight.

CHAPTER12 12.2 (a) For df = I and a = 0.05, ~ = 3.841. (b) For df = I and a= 0.025,x2 = 5.024.(c)Ford/= I and a= 0.0 1 , ~ = 6.635. 12.4 (a) All

f,

= 25. (b) Because KsrAT = 4.00 > 3.841, reject H 0 .

12.6(a)H0 :7T1 = 7T2. H 1:7T1 ¥ 7T2. (b) Becausex2srAr = 79.29 > 3.841, reject H0 . There is evidence to conclude that the population proportion of those who viewed the brand on general TV was different from those who viewed the brand on YOO 04+. p-value = 0.0000. The probability of obtaining a test statistic of 79 .29 or larger when the null hypothesis is true is 0.0000. (c) You should not compare the results in (a) to those of Problem 10.30 (b) because that was a one-tail test. 12.8 (a) Ho: 7T1 = 7T2. H1: 7T1 ¥ 7T2. Because x2srAT = (326 - 339.0878)2/ 339.0878 + (97 - 83.9122)2/ 83.9 122 + ( 167 - 153.9122)2/ 153.9 122 + (25 - 38.0878)2/38.0878 = 8.1566 > 6.635,reject Ho. There is evidence of a difference in the proportion of organizations with 500 to 2,499 employees and organizations with 2,500+ employees with respect to the proportion that have employee recognition programs. (b) p-value = 0.0043. The probability of obtaining a difference in proportions that gives rise to a test statistic above 8. 1566 is 0.0043 if there is no difference in the proportion in the two groups. (c) The results of (a) and (b) are exactly the same as those of Problem 10.32. The x2 in (a) and the Z in Problem 10.32 (a) satisfy the relationship that x2 = 8. 1566 = Z 2 = (- 2.856)2, and the p-value in (b) is exactly the same as the p-value computed in Problem 10.32 (b). 12.10 (b) Because x2srAT = 19.9467 > 3.841, reject H0. There is evidence that there is a significant difference between the proportion of cobrowsing organizations and non-cobrowsing organizations that use skills-based routing to match the caller with the right agent. (c) p-value is virtually

Self-Test Solutions and Answers to Selected Even-Numbered Problems

zero. The probability of obtaining a test statistic of 19.9467 or larger when the null hypothesis is true is 0.0000. (d) The results are identical because (4.4662)i = 19.9467. 12.12 (a) The expected frequencies for the first row are 20, 30, and 40. The expected frequencies for the second row are 30, 45, and 60. (b) Because KSTAT = 12.5 > 5.99 1, reject Ho. 12.14 (a) Because the calculated test statistic 96.7761 is greater than the critical value of7.8 147, you reject H0 and conclude that there is evidence of a difference among the age grou ps in the proportion of smartphone owners who mostly use smartphones to go online. (b) p-value = 0.0000. The probabi lity of obtaining a data set that gives rise to a test statistic of 96.7761 or more is 0.0000 if there is no difference in the proportion who mostly use s martphones to go online. (c) There is a significant difference between 18- to 29-year-olds and 50- to 64-years-olds and those 65 and older. There is a significant difference between 30- to 49-year-olds and 50- to 64-years-olds and those 65 and older. There is a significant difference between those who are between 50 and 64 years old and those 65 years old or older. 12.16 (a) H0 :7r1 =

'Tri

=

7r3 •

H 1: At least one proportion differs.

Observed Frequencies Group Compensation value

BE

HR

Employees

Total

Yes

28

76

66

170

No

172

124

134

430

Total

-

-

-

-

200

200

200

600

Global Region NA

A

E

Sample Proportions Group 1

0.14

Group 2

0.38

Group 3

0.33 Marascuilo Table

Proportions

IGroup 1 IGroup 1 IGroup 2

Absolute Differences

Critical Range

0. 124

0. 1033

Significant

Group 21 Group 31

0. 19

0.1011

Significant

Group 31

0.05

0.1 170

Not significant

Business executives are different from HR leaders and from employees. 12.18 (a) Because KSTAT = 31.6888 > 5.9915, reject Ho. There is evidence of a difference in the percentage who use their device to check social media while watching TV between the groups. (b) p-value = 0.0000. (c) Smarrphone versus computer: 0.1616 > 0.0835. Significant. Smartphone versus tablet: 0.1805 > 0.0917. Significant. Computer versus tablet: 0.0 188 < 0.0998. Not significant. The smartphone group is different from the computer and tablet groups. 12.20 df = (r- I)(c - 1) = (3 - 1)(4- 1) = 6. 12.22 Because KSTAT = 92.1028 > 16.919, reject H0 and conclude that there is evidence of a relationship between the type of dessert ordered and the type of entree ordered. 12.24 Because KSTAT = 27.06509 < 31.9993 , do not reject H0. There is no evidence of a relationship between time willing to devote to health and age.

Expected Frequencies Investing?

Total

12.26 Because KSTAT = 8 1.6061 > 47.3999 reject H0 . There is evidence of a relationship between identified main opportunity and geographic region.

Yes

56.6667

56.6667

56.6667

170

No

143.3333

143.3333

143.3333

---

430

-

12.28 (a) 3 1. (b) 29. (c) 27. (d) 25.

200

600

12.30 40 and 79.

Total

---

- --

200

200

691

12.32 (a) The ranks for Sample I are I , 2, 4, 5, and I 0. The ran ks for Sample 2 are 3, 6.5, 6.5, 8, 9, and 11. (b) 22. (c) 44.

Data Level of Significance

0.05

12.34 Because T1 = 22 > 20, do not reject H0 . 12.36 (a) The data are ordinal. (b) The two-sample t test is inappropriate because the data can only be placed in ranked order. (c) Because ZSTAT = - 2.2054 < - 1.96, reject H0. There is evidence of a significance difference in the median rating of California Cabemets and Washington Cabemets.

Number of Rows

2

Number of Columns

3

Degrees of Freedom

2

Results

12.38 (a) H0 : M 1 = Mi. where Populations: I = Wing A, 2 = Wing B.

Critical Value

5.99 15

H 1: M 1 -F Mi.

Chi-Square Test Statistic

3 1.584 1

Population 1 sample: Sample size 20, sum of ranks 561

p-Value

0.0000

Population 2 sample: Sample size 20, sum of ranks 259

Reject the null hypothesis

!LT, =

Because 3 1.5841 > 5.9915, rejectH0. There is a significant difference among business groups with respect to the proportion that say compensation (pay and rewards) makes for a unique and compelling EVP. (b) p-value = 0.0000. The probability of a test statistic greater than 31.5841is0.0000. (c) Level of Significance

0.05

Square Root of Critical Value

2.4477

=

1.96 (or p-value = 0.0000 < 0.05), reject H0 . There is sufficient evidence of a difference in the median delivery time in the two wings of the hotel. (b) The results of (a) are consistent with the results of Problem I 0.51.

692

Self-Test Solutions and Answers to Selected Even-Numbered Problems

12.40 (a) Because Z srAT = - 0.5100 < - 1.96, do not reject H0 . There is not enough evidence to conclude that there is a difference in the median brand value between the two sectors. (b) You must assume approximately equal variability in the two populations. (c) Using the pooled-variance I test and the separate-variance I test, you also did not reject the null hypothesis. From all three tests, you conclude that there is no difference in median brand value for the two sectors. 12.42 (a) Because - 1.96 < Z sTAT = 0.9743 < 1.96 (or the p-value = 0.3299 > 0.05), do not reject H0 . There is not enough evidence to conclude that there is a difference in the median rating of ads that play before and after halftime. (b) You must assume approximately equal variability in the two populations. (c) Using the pooled-variance t-test, you do not reject the null hypothesis (I= - 2.0032 < lsrAT = 1. 1541 < 2.0032; p-value = 0.2534 > 0.05) and conclude that there is insufficient evidence of a difference in the mean rating of ads before and after halftime. in Problem I 0.1 I (a). 12.44 (a) Decision rule: If H > Xb = 15.086, reject H0 . = 13.77 < 15.806, do not reject H0 .

(b) Because H

12.46(a)H = 13.517 > 7.81 5, p-value = 0.0036 < 0.05, rejectH0. There is sufficient evidence of a difference in the median waiting time in the four locations. (b) The results are consistent with those of Problem 11 .9. 12.48 (a) H = 19.3269 > 9.488, reject H0 . There is evidence of a difference in the median ratings of the ads. (b) The results are consistent with those of Problem 11. 10. (c) Because the combined scores are not true continuous variables, the nonparametric Kruskal-Wallis rank test is more appropriate because it does not require that the scores are normally distributed. 12.50 (a) Because H = 13.0522 > 7.8 15 or the p-value is 0.0045, reject H 0 . There is sufficient evidence of a difference in the median cost associated with importing a standardized cargo of goods by sea transport across the global regions. (b) The results are the same. 12.56 (a) Because x2srAT = 0.4 12 < 3.841, do not rej ect H 0 . There is insufficient evidence to conclude that there is a relati onship between a student's gender and pizzeria selection. (b) Because KsTAT = 2.624 < 3.841, do not reject H 0. There is insufficient evidence to conclude that there is a relationship between a student's gender and pizzeria selection. (c) Because KsTAT = 4.956 < 5.99 1, do not reject H0 . There is insufficient evidence to conclude that there is a relationship between price and pizzeria selection. (d ) p-value = 0.0839. The probability of a sample that gives a test statistic equal to or greater than 4.956 is 8.39% if the nu ll hypothesis of no relationship between price and pizzeria selection is true. 12.58 (a) Because KsTAT = 7.4298 < 9.4877; p -value = 0. 1148 > 0.05, do not reject H0 . There is not enough evidence to conclude that there is evidence of a difference in the proportion of organizations that have embarked on digital transformation on the basis of industry sector. (b) Because KsTAT = 38.09 > 2 1.026 1; p-value = 0.000 1 < 0.05, reject H 0 . There is evidence of a re lationship between digital transformation progress and industry sector.

CHAPTER13 13.2 (a) Yes. (b) No. (c) No. (d) Yes. 13.4 (a) T he scatter plot shows a positive linear relationship. (b) For each increase in alcohol percentage of 1.0, mean predicted mean wine quality is estimated to increase by 0.5624. (c) Y = 5.27 15. (d) Wine quality appears to be affected by the alcohol percentage. Each increase of I% in alcohol leads to a mean increase in wine quality of a little more than half a unit.

13.6 (b) b0 = - 13,130.6592, b 1 = 2.42 18. (c) For each increase of $ I ,OOO in tuition, the mean starting salary is predicted to increase by $2,421.80. (d) $ 109,047.0 1 (e) Starting salary seems higher for those schools that have a hi gher tuition. 13.8 (b) b0 = - 1,039.5317, b 1 = 8.5816. (c) For each additional milliondollar increase in revenue, the mean value is predicted to increase by an estimated $8.5816 million. Literal interpretation of b0 is not meaningful because a team cannot have zero revenue. (d) $ 1,105.864 million. (e) That the current value of the team can be expected to increase as revenue increases. 13.10 (b) b0 = - 0.7744, b 1 = 1.4030. (c) For each increase of million YouTube trailer views, the predicted weekend box office gross is estimated to increase by $ 1.4030 million. (d) $27.2847 million. (e) You can conclude that the mean predicted increase in weekend box office gross is $ 1.4030 million for each million increase in YouTube trailer views. 13.12 SST = 40, r 2 = 0.90. 90% of the variation in the dependent variable can be explained by the variation in the independent variable. 13.14 r 2 = 0.75. 75% of the variation in the dependent variable can be explained by the variation in the independent variable. SSR 2 1.8677 _ SST = = 0.3417, 34.17% of the variation 64 0000 in wine quality can be explained by the variation in the percentage of alcohol.

13.16 (a)

r

2

=

~

"2

~ (Y; - Y;)

I- I

n - 2

=

~42.1 323

--- =

48 (c) Based on (a) and (b), the model should be somewhat useful for predicting wine quality.

0.9369.

13.18 (a) r 2 = 0.7665. 76.65% of the variation in starting salary can be explained by the variation in tuition. (b) Srx = 15,944.3807. (c) Based on (a) and (b}, the model should be very useful for predicting the starting salary. 13.20 (a) r 2 = 0.9612, 96.12% of the variation in the current value of a MLB baseball team can be explained by the variation in its annual revenue. (b) Srx = 140.8 188. (c) Based on (a) and (b), the model should be very useful fo r predicting the current value of a baseball team. 13.22 (a) r 2 = 0.6676, 66.76% of the variation in weekend box office gross can be explained by the variation in YouTube trailer views. (b) Srx = 19.4447. (c) Based on (a) and (b), the model shou ld be useful for predicting weekend box office gross. (d) Other variables that might explain the variation in weekend box office gross could be the amount spent on advertising, the timing of the release of the movie, and the type of movie. 13.24 A residual analysis of the data indicates a pattern, with sizable clusters of consecutive residuals that are either all positive or all negative. This pattern indicates a violation of the assumption of linearity. A curvilinear model should be investigated. 13.26 There does not appear to be a pattern in the residual plot. The assumptions of regression do not appear to be seriously violated. 13.28 Based on the residual plot, the assumption of equal variance may be violated. 13.30 Based on the residual plot, there is no evidence of a pattern. 13.32 (a) An increasing linear relationship exists. (b) There is evidence of a strong positive autocorrelation among the residuals. 13.34 (a) No, because the data were not collected over time. (b) If data were collected at a single store and studied over a period of time, you would compute the Durbin-Watson statistic.

Self-Test Solutions and Answ ers to Selected Even-Numbered Pro blems

13.36 (a)

=

= 201 ,399.0S = O.OI 6 l ssx l 2,49S,626 b0 = Y - bJ( = 7 1.2621 - 0.0161 (4,393) = 0.4S76. (b) Y = 0.4S8 + 0.0161X = 0.4S76 + 0.0161(4,SOO) = 72.9867, or SSXY

bi

13.56 (a ) IS.9S :S IL>1x = 4 :S 18.0S. (b) 14.6Sl :S Yx= 4 :S 19.349. (c) The intervals in this problem are wider than in Problem 13.SS because they involve X values that are different from the mean. 13.58 (a)

II

(d}D

=

e;- 1)

(b)

2

_,__ ; =_,2~,,---

L eT

1 243 2244 ' · S99.0683

= 2.08 >

l.4S. There is no

i= I

evidence of positive autocorrelation among the residuals. (e) Based on a residual analysis, the model appears to be adequate. 13.38 (a) b0 = - 2.S3S, b 1 = 0.06073. (b) $2,SOS.40. (d) D = 1.64 > du = 1.42, so there is no evidence of positive autocorrelation among the residuals. (e) The plot shows some nonlinear pattern, suggesting that a nonlinear model might be better. Otherwise, the model appears to be adequate. 13.40 (a) 3.00. (b) ± 2. 1199. (c) Reject H 0 . There is evidence that the fitted linear regression model is usefu l. (d) 1.32 :s {3 1 :s 7.68. 13.42 (a) tsTAT

b1 -

/31

0.S624

= - - - = - - = 4.99 13 > Sb,

0.1 127

. 2.0 106. Re1ect Ho.

There is evidence of a linear relationship between the percentage of alcohol and wine quality. (b) b ± tanSb, = 0.S624 ± 2.0106 (0. 1127) 0.33S9 :s /31 :s 0.7890. 13.44(a) ISTAT = 10.7174 > 2.030l;p-value = 0.0000 < O.OS.RejectH0 . There is evidence of a linear relationship between tuition and starting salary. (b) 1.963 :s {31 :s 2.880S. 13.46 (a) ISTAT = 26.3347 > 2.0484 or because the p-value is 0.0000, reject H0 at the S% level of significance. There is evidence of a linear relationship between annua l revenue and current value. (b) 7.9 14 1 ::; /31::; 9.249 1. 13.48 (a) ls TAT = 11 .338 1 > 1.9977 or because the p-value = 0.0000 < O.OS; rej ect H 0 . There is evidence of a linear relationship between YouTube trai ler views and weekend box o ffice gross. (b) l. ISS8 ::; /31::; l.6SOI. 13.50 (a)(% daily change in SPUU) = b0 + 2.0 (% daily change in S&P SOO index). (b) If the S&P SOO gains 10% in a year, SPUU is expected to gai n an estimated 20%. (c) If the S&P SOO loses 20% in a year, SPUU is expected to lose an estimated 40%. (d) Risk takers wi ll be attracted to leveraged funds, and risk-averse investors will stay away. 13.52 (a}, (b) First weekend and U.S. gross: r = 0.7284, ISTAT = 2.6042 > 2.4469, p-value = 0.0404 < 0.0S. Reject Ho. At the O.OS level of significance, there is evide nce of a linear relationship between first weekend sales and U.S. gross. First weekend and worldwide gross: r = 0.8233, ISTAT = 3.SS32 > 2.4469, p-value = 0.0 120 < O.OS. Reject H 0. At the O.OS level of significance, there is evidence of a linear relationship between fi rst weekend sales and worldwide gross. U.S. gross and worldwide gross: r = 0.9642, ts TA T = 8.906 1 > 2.4469, p-value = 0.000 1 < O.OS. Reject H 0 . At the 0.0S level of significance, there is evidence of a linear relationship between U.S gross and worldwide gross. 13.54 (a) r = O.S674. There is a moderate linear re lationshi p between socia l media networking and the GDP per capita. (b) tSTAT = 3.44S4, p-value = 0.0020 < O.OS. Reject Ho. At the O.OS level of significance, there is significant evidence of a linear relationship between social media networking and the GDP per capita. (c) There appears to be a linear relationship.

Y= =

$72,987. (c) There is no evidence of a pattern in the residuals over time.

L (e; -

693

Y±

-0.3S29

+

(0.S624)( 10)

=

S.27 1S

Y±

tanSYXv/h;

S.271S ± 2.0 106(0.9369)Vil.0249 4.9741 ::; ILYIX= 10 ::; S.S690.

la12Srx VI+h;

=

s.21 1s

±

2.0106(0.9369)\/ 1

+

0.0249

3.364S :S Yx= IO :S 7. 1786. (c) Part (b) provides a prediction interval for the indi vidual response given a specific value of the independent variable, and part (a) provides a confidence inte rval estimate for the mean value, given a specific value of the independent variable. Because there is much more variation in predicting an individual value than in estimating a mean value, a prediction interval is wider than a confidence interval estimate. 13.60 (a ) $ 103,638.9S :S IL>1X = 50.450 :S $ l 14,4SS.06. (b) $76,229.S2 :s Yx =5o.450 :s $14 1,864.49. (c) You can estimate a mean more precisely than you can predict a single observation. 13.62 (a ) 1,043.191 1 ::; IL>1X= 250::; l, 168.S370. (b) 8 10.6799 ::; Yx = 250 :s 1,401.0480 (c) Because there is much more variation in predicting an individua l va lue than in estimating a mean, the prediction interval is wider than the confidence interval. 13.74 (a) b0 = 24.84, b 1 = 0.14. (b) For each additional case, the predicted delivery time is estimated to increase by 0.14 minute. The interpretation of the Y intercept is not meaningful because the number of cases delivered cannot be 0. (c) 4S.84. (d) No, SOO is outside the relevant range of the data used to fit the regression equation. (e) r 2 = 0.972. (f) There is no obvious pattern in the residuals, so the assumptions of regression are met. The model appears to be adequate. (g) ISTAT = 24.88 > 2.1009; reject H0 • (h) 44.88 :S ILY!X= 150 :S 46.80. 4 l.S6 :S Yx = 150 :S S0.12. (i) The number of cases explains almost all of the variation in delivery time. 13.76 (a) bo = 326.S93S, b 1 = 0.083S. (b) For each additional square foot of living space in the house, the mean asking price is predicted to increase by $83.SO. The estimated asking price of a house with 0 living space is 326.S93S thousand dollars. However, this interpretation is not meaningfu l because the living space of the house cannot be 0. (c) = 493.6769 thousand dollars. (d) r2 = 0.3979. So 39.79% of the variation in asking price is explained by the variation in living space. (e) Neithe r the residual plot nor the normal probability plot reveal s any potential violation of the linearity, equal variance, and normality assumptions. (f) tSTAT = 6.2436 > 2.00 10, p-value is 0.0000. Because p-value < O.OS, reject H0 . There is evidence of a linear relationship between asking price and living space. (g) O.OS68 :s {3 1 :s 0. 1103. (h) The living space in the house is somewhat useful in predicting the asking price, but because only 39.79% of the variation in asking price is explained by variation in living space, other variables should be considered.

Y

13.78 (a) b0 = 21.2034, b 1 = -0. 1S l 7. (b) For each additional point on the efficiency ratio, the predicted mean tangible common equity (ROATCE) is estimated to decrease by O. IS17. For an efficiency ofO, the predicted mean tangible common equity (ROATCE) is 2 1.2034. (c) 12.0989. (d) r2 = 0. 1882. (e) There is no obvious pattern in the residua ls, so the assumptions of regression are met. The model appears to be adequate. (f) tSTAT = - 4.7662 < - I .984S; reject H 0. There is evidence of a linear relationship between efficiency ratio and tangible common equity (ROATCE). (g) 11.4060 :S IL>1X = 60 :S 12.79 18, S.IS34 :S Yx=60 :S 19.0444. (h) - 0.2149 :s {31 :s - 0.0886. (i) There is a small relationship between efficiency ratio and tangible common equity (ROATCE).

694

Self-Test Solutions and Answers to Selected Even-Numbered Problems

13.80 (a) There is no clear relationship shown on the scatter plot. (c) Looking at all 23 flights, when the temperature is lower, there is likely to be some 0 -ring damage, particularly if the temperature is below 60 degrees. (d) 31 degrees is outside the relevant range, so a prediction should not be made. (e) Predicted Y = 18.036 - 0.240X, where X = temperature and Y = 0-ring damage. (g) A nonlinear model would be more appropriate. (h) The appearance on the residual plot of a nonlinear pattern indicates that a nonlinear model would be better. It also appears that the normality assumption is invalid. 13.82 (a) b0 = - 1.3477, b 1 = 0.0121. (b) For each additional milliondollar increase in revenue, the current value will increase by an estimated 0.0 12 1 billion. Literal interpretation of b0 is not meaningful because an operating franchise cannot have zero revenue. (c) 469.1 17 million. (d) r2 = 0.8288. 82.88% of the variation in the current value of an NBA basketball team can be explained by the variation in its annual revenue. (e) There does not appear to be a pattern in the residual plot. The assumptions of regression do not appear to be seriously violated. (f) tsTAT = 11.6473 > 2.0484 or because the p- value is 0.0000, reject H 0 at the 5% level of significance. There is evidence of a linear relationship between annual revenue and current value. (g)0. 19 13 :5 /.LY!x= iso :5 0.74697 billions. (h) - 0.2569 :5 Yx= iso :5 1.195 1 bi llions. (i) The strength of the relationship between revenue and current value is approximately the same for NBA basketball teams and for European soccer teams but lower than for MLB baseball teams. 13.84 (a) b0 = -2,629.222, b 1 = 82.472. (b) For each additional centimeter in circumference, the weight is estimated to increase by 82.472 grams. (c) 2,3 19.08 grams. (d ) Yes, because circumference is a very strong predictor of weight. (e) r2 = 0.937. (f) There appears to be a nonlinear relationship between circumference and weight. (g) p-value is virtually 0 < 0.05; reject H 0 . (h) 72.7875 :5 {3 1 :5 92.156.

CHAPTER14 14.2 (a) For each one-un it increase in XI> you estimate that the mean of Y will decrease 2 units, holding X2 constant. For each one-unit increase in X2, you estimate that the mean of Y will increase 7 units, holding X 1 constant. (b) The Y intercept, equ al to 50, estimates the value of Y when both X1 and X2 are 0. 14.4 (a) Y = 1.3960 - 0.0 I l 7X 1 + 0.0286X2• (b) For a given capital adequacy, for each increase of 1% in efficiency ratio, ROAA decreases by 0.0 117%. For a given efficiency ratio, for each increase of 1% in capital adequacy, ROAA increases by 0.0286% (c) Y = 1.1 214. (d) 1.0798 :5 /.LJ1x :5 1.1 629. (e) 0.5679 :5 Yx :5 1.6749. (f) The interval in (e) is narrower because it is estimating the mean value, not an individual value. (g) The model uses both the efficiency ratio and capital adequacy to predict ROA. This may produce a better model than if only one of these independent variables is included. 14.6 (a) Y = 301.78 + 3.4771 X1 + 41.041X2• (b ) For a given amount of voluntary turnover, for each increase of $ 1 billion in worldwide revenue, the mean number of full-time jobs added is predicted to increase by 3.477 1. For a given $ 1 billion in worldwide revenue, for each increase of 1% in voluntary turnover, the mean number of full-time j obs added is predicted to increase by 41.041. (c) The Yintercept has no meaning in this problem. (d) Holding the other independent variable constant, voluntary turnover has a higher slope than worldwide revenue. 14.8 (a) Y = 532.2883 + 407.1346X1 - 2.8257X2, where X1 land area, X2 = age. (b) For a given age, each increase by one acre in land area is estimated to result in an increase in the mean fair market value by $407 .1 346 thousands. For a given land area, each increase of one year in age is estimated to res ult in a decrease in the mean fair market value by $2.8257 thousands. (c) The interpretation of b0 has no practical

meaning here because it would represent the estimated fair market value of a new house that has no land area. (d) Y = $478.6577 thousands. (e) 446.8367 :5 /.LJ1x :5 5 10.4788. (f) 307.2577 :5 Yx :5 650.0577. 14.10 (a) MSR = 15, MSE = 12. (b) 1.25. (c) F STAT = 1.25 < 4. 10; do not rej ect H0 . (d) 0.20. 20% of the variation in Y is explained by variation in X. (e) 0.04. 14.12 p-value fo r revenue is 0.0395 < 0.05 and the p-value for efficiency is less than 0.000 1 < 0.05. Reject H 0 fo r each of the independent variables. There is evidence of a significant linear relati onship with each of the independent variables. 14.14 (a) F sTAT = 37 .8384 > 3.00; reject H 0. (b) p-value = 0.0000. The probability of obtaining an F sTAT value > 37.8384 if the null hypothesis is true is 0.0000. (c) r 2 = 0.2785. 27.85% of the variation in ROAA can be explained by variation in efficiency ratio and variation in risk-based capital. (d ) ?,,,lj = 0.27 12. 14.16 (a) F sTAT = 1.95 < 3.15; Do not reject H0 . There is insufficient evidence of a significant linear relationship. (b) p -value = 0. 151 2. The probability of obtaining an F sTAT value > 1.95 if the null hypothesis is true is 0. 15 12. (c) r2 = 0.0610. 6.10% of the variation in full-time jobs added can be explained by variation in worldwide revenue and variation in full-ti me voluntary turnover. (d) ?,,dJ = 0.0297. 14.18 (a) - (e) Based on a residual analysis, there is no evidence of a violation of the assumptions of regression. 14.20 (a) There is no evidence of a violation of the assumptions. (b) Because the data are not collected over time, the Durbin-Watson test is not appropriate. (c) They are valid. 14.22 (a) The residual analysis reveals no patterns. (b) Because the data are not collected over time, the Durbin-Watson test is not appropriate. (c) There are no apparent violations in the assumptions. 14.24 (a) Variable X2 has a larger slope in terms of the t statistic of 3.75 than variable X1, which has a smaller slope in terms of the t statistic of 3.33. (b) 1.46824 :5 {3 1 :5 6.53176. (c) For X1 : tsTAT = 3.33 > 2.1098. Reject H 0 . There is evidence that X1 contributes to a model already containing X2 . For X2 : tSTAT = 3.75 > 2.1 098. Reject H 0 . There is evidence that X2 contributes to a model already containing X1. Both X1 and X2 should be included in the model. 14.26 (a) 95% confidence interval on {3 1 : b 1 ± tSh,• - 0.01 17 ± 1.98 (0.0022), - 0.01 61 :5 {3 1 :5 - 0.0074. (b) For X1 : tsTAT = b/Sh, = - 0.0 177/0.0022 = - 5.3415 < - 1.98. Reject H0 . There is evidence that X1 contributes to a model already containing X2 • For X2 : tsTAT = bzlSh, = 0.0286/0.0054 = 5.2992 > 1.98. Reject H0 . There is evidence that X2 contributes to a model already containing X1• Both X1 (efficiency ratio) and X2 (total risk-based capital) shou ld be included in the model. 14.28 (a) - 5.8682 :5 {3 1 :5 12.8225. (b) For X1 : tSTAT = 0.7443 < 2.0003. Don't reject H0 . There is insufficient evidence that X1 contributes to a model already containing X2. For X2 : tsTAT = 1.8835 < 2.0003. Do not reject H0 . There is insufficient evidence that X2 contributes to a model already containing X1• Neither variable contributes to a model that includes the other variable. You should consider using only a simple linear regession model. 14.30 (a) 274. 1702 :5 {3 1 :5 540.0990. (b) For X1 : tsTAT = 6.2827 and p-value = 0.0000. Because p-value < 0.05, reject H0 . There is evidence that X1 contributes to a model already containing X2 • For X2 : tsTAT = - 4. 1475 and p -value = 0.0003 . Because p-value < 0.05 reject H0 . There is evidence that X 2 contributes to a model already containing X1• Both X1 (land area) and X2 (age) should be included in the model.

Self-Test Solutions and Answers to Selected Even-Numbered Problems

14.32 (a) For X1: FsrAT = 1.25 < 4.96; do not reject H0. For X2: FsrAT = 0.833 < 4.96; do not reject H 0. (b) 0. 1111 , 0.0769. 14.34 (a) For X1: SSR(X 1IX2) = SSR (X 1 and X2) - SSR(X2) = SSR(X1 IX2)

MSE

5.9271 - 3.6923 = 2.2348 FSTAT =

=

2.2348 . = 28.5227 > 3.897. Reject H 0 . There is evidence that X1 15 35211196 contributes to a model already containing X2. For X2: SSR(X2 IX1) = SSR (X1 and X2) - SSR(X1) = 5.9271 - 3.7275 = 2. 1996, SSR(X2IX1)

2.1996 . = 28.0823 > 3.897. 15 35211196 Reject H 0 . There is evidence that X2 contributes to a model already containing X1• Because both X1 and X2 make a significant contribution to the model in the presence of the other variable, both variables should be included in the model. ?. SSR(X1 IX2) (b) r1.2 - SST - SSR(X1 and X2) + SSR(XiiX2) FsrAT =

MSE

=

2.2348 0 27 21.279 1 - 5.9271 + 2.2348 = · 1 I. Holding constant the effect of the total risk based capital, 12. 71 % of the variation in ROAA can be explained by the variation in efficiency ratio. SST - SSR(X1 andX2) + SSR(X2IX1) 2 1

~. :;1 1 6

0 1253 21.2791 + 2. 1996 = · Holding constant the effect of efficiency ratio 12.53% of the variation in ROAA can be explained by the variation in the total risk-based capital. 14.36 (a) For X1: FsrAT = 0.554 < 4.00; Don't reject H 0. There is insufficient evidence that X1 contributes to a model containing X2. For X2: FsrAT = 3.5476 < 4.00. Do not reject H 0. There is insufficient evidence that X2 contributes to a model already containing X1• Because only X1 makes a significant contribution to the model in the presence of the other variable, only X; should be included in the model. (b) 2 = 0.009 1. Holding constant the effect of full-time voluntary turnover, 0.9 1% of the variation in fuU-time jobs added be explained by the variation in total worldwide revenue. ?Y2.I = 0.0558. Holding constant the effect of total worldwide revenue, 5.58% of the variation in full-time jobs created can be explained by the variation in full-time voluntary turnover.

01.

14.38 (a) Holding constant the effect of X2, for each increase of one unit of X" Y increases by 4 units. (b) Holding constant the effect of X" for each increase of one unit of X2> Y increases by 2 units. (c) Because tsrAT = 3.27 > 2.1098, reject H 0. Variable X2 makes a significant contribution to the model. 14.40 (a) Y = 243.7371 + 9.2189X1 + 12.6967X2, where X1 = number of rooms and X2 = neighborhood (east = 0). (b) Holding constant the effect of neighborhood, for each additional room, the mean selling price is estimated to increase by 9.2189 thousands of dollars, or $9,2 18.9. For a given number of rooms, a west neighborhood is estimated to increase the mean selling price over an east neighborhood by 12.6967 thousands of dollars, or $12,696.7. (c) Y = 326.7076, or $326,707.6. $309,560.04 :5 Yx :5 343,855. 1. $321,471.44 :5 /.LYix :5 $331, 943.71. (d) Based on a residual analysis, the model appears to be adequate. (e) FsrAT = 55.39, the p-value is virtually 0. Because p-value < 0.05, reject H0 . There is evidence of a significant relationship between selling price and the two independent variables (rooms and neighborhood). (0 For X1: lsrAT = 8.9537, the p-value is virtually 0. Reject H0. Number of rooms makes a significant contribution and should be included in the model. For X2: tsrAT = 3.5913, p-value = 0.0023 < 0.05. Reject H0. Neighborhood makes a significant contribution and should be included in the model. Based

695

on these results, the regression model with the two independent variables should be used. (g) 7.0466 :5 /31 :5 11.3913. (h) 5.2378 :5 /32 :5 20.1557. (i) r~,lj = 0.851. (j) ri1.2 = 0.825. Holding constant the effect of neighborhood, 82.5% of the variation in selling price can be explained by variation in number of rooms. ?Y2. I = 0.431. Holding constant the effect of number of rooms, 43.1 % of the variation in selling price can be explained by variation in neighborhood. (k) The slope of selling price with number of rooms is the same, regardless of whether the house is located in an east or west neighborhood. (I) Y = 253.95 + 8.032X1 - 5.90X2 + 2.089X1X2. For X1 X2, p-value = 0.330. Do not reject H 0 . There is no evidence that the interaction term makes a contribution to the model. (m) The model in (b) should be used. (n) The number of rooms and the neighborhood both significantly affect the selling price, but the number of rooms has a greater effect. 14.42 (a) Predicted time= 8.01 + 0.00523 Depth - 2.105 Dry. (b) Holding constant the effect of type of drilling, for each foot increase in depth of the hole, the mean drilling time is estimated to increase by 0.00523 minutes. For a given depth, a dry drilling hole is estimated to reduce the drilling time over wet dri lling by a mean of 2.1052 minutes. (c) 6.428 minutes, 6.210 :5 /.LY!x :5 6.646, 4.923 :5 Yx :5 7.932. (d) The model appears to be adequate. (e) FsrAT = 11 1.11 > 3.09; reject H0. (0 tsrAT = 5.03 > 1.9847; reject H 0. tsrAT = - 14.03 < - 1.9847; reject H0 . Include both variables. (g) 0.0032 :5 /3 1 :5 0.0073. (h) - 2.403 :5 f3i :5 - 1.808. (i) 69.0%. (j) 0.207, 0.670. (k) The slope of the additional drilling time with the depth of the hole is the same, regardless of the type of drilling method used. (I) The p-value of the interaction term = 0.462 > 0.05, so the term is not significant and should not be included in the model. (m) The model in part (b) should be used. Both variables affect the drilling time. Dry drilling holes should be used to reduce the drilling time. 14.44 (a) Y = I. I 079 - 0.0070X1 + 0.0448X2 - 0.0003X1X2, where X1 = efficiency ratio, X2 = total risk-based capital, p-value = 0.4593 > 0.05. Do not reject H0 . There is not enough evidence that the interaction term makes a contribution to the model. (b) Because there is insufficient evidence of any interaction effect between efficiency ratio and total risk-based capital, the model in Problem 14.4 should be used. 14.46 (a) The p-value of the interaction term = 0.1650 < 0.05, so the term is not significant and should be not included in the model. (b) Use the model developed Problem 14.6. 14.48 (a) For X1 X2> p-value = 0.2353 > 0.05. Do not rej ect H0 . There is insufficient evidence that the interaction term makes a contribution to the model. (b) Because there is not enough evidence of an interaction effect between total staff present and remote hours, the model in Problem 14.7 should be used. 14.50 Holding constant the effect of other variables, the natural logarithm of the estimated odds ratio for the dependent categorical response will increase by 2.2 for each unit increase in the particular independent variable. 14.52 0.4286. 14.54 (a) ln(estimated odds ratio) = -6.9394 + 0. l 395X1 + 2.7743X2 = -6.9394 + 0.1395(36) + 2.7743(0) = - 1.91908. Estimated odds ratio = 0. 1470. Estimated Probability of Success = Odds Ratio/( ! + Odds Ratio)= 0.1470/( 1 + 0.1470) = 0.1260. (b) From the text discussion of the example, 70.2% of the individuals who charge $36,000 per annum and possess additional cards can be expected to purchase the premium card. Only 12.60% of the individuals who charge $36,000 per annum and do not possess additional cards can be expected to purchase the premium card. For a given amount of money charged per annum, the likelihood of purchasing a premium card is substantially higher among individuals who already possess add itional cards than for those who do not possess additional cards. (c) In (estimated odds ratio)

696

Self-Test Solutions and Answers to Selected Even-Numbered Problems

= -6.9394 + 0.13957X1 + 2.7743X2 = -6.9394 + 0. 1395(1 8) + 2.7743(0) = - 4.4298. Estimated odds ratio = e-4 ·4298 = 0.0119. Estimated Probability of Success = Odds Ratio/(! + Odds Ratio) = 0.0 119/( 1 + 0.0 119) = 0.0 1178. (d) Among individuals who do not purchase additional cards, the likelihood of purchasing a premium card diminishes dramatically with a substantial decrease in the amount charged per annum.

14.56 (a) ln(estimated odds) = -47.4723 + 1.3099 fixed acidity + 90.5722 chlorides + 9.777 pH. (b) Holding constant the effect of chlorides and pH, for each increase of one point in fixed acidity, In (estimated odds) increases by an estimate of 1.3099. Holding constant the effect of fixed acidity and pH, for each increase of one point in chlorides, ln(estimated odds) increases by an estimate of 90.5722. Holding constant the effect of fixed acidity and chlorides, for each increase of one point in pH, In(estimated odds) increases by an estimate of 9.777. (c) 0.3686. (d) Deviance = 54.456, p-value = 1.0000, do not reject H 0 , so model is adequate. (e) For fixed acidity: ZsTAT = 3. 17 > 1.96, reject H0 . For chlorides: Z sTAT = 4.00 > 1.96, reject H 0 . For pH: ZsTAT = 3.29 > 1.96, reject H 0. Each variable makes a significant contribution to the model. (f) Fixed acidity, chlorides, and pH are all important facto rs in distinguishing between white and red wines. 14.58 (a) ln(estimated odds) = -0.6048 + 0.0938 claims/year + 1.8108 new business. (b) Holding constant the effects of whether the policy is new, for each increase of the number of claims submitted per year by the policy holder, ln(odds) increases by an estimate of 0.0938. Holding constant the number of claims submitted per year by the policy holder, ln(odds) is estimated to be 1.8108 higher when the policy is new as compared to when the policy is not new. (c) ln(estimated odds ratio) = 1.2998. Estimated odds ratio = 3.6684 Estimated probability of a fraudulent claim = 0.7858. (d) The deviance statistic is 11 9.4353 with a p-value = 0.0457 < 0.05. Reject H 0 . The model is not a good fitti ng model. (e) For claims/year: ZsTAT = 0.1 865, p-value = 0.8521 > 0.05. Do not reject H 0. There is insufficient evidence that the number of claims submitted per year by the policy holder makes a significant contribution to the logistic regression model. For new business: ZSTAT = 2.226 1, p-value = 0.0260 < 0.05. Reject H 0 . There is sufficient evidence that whether the policy is new makes a significant contribution to the logistic model regression. (f) ln(estimated odds) = - 1.0125 + 0.9927 claims/year. (g) ln(estimated odds) = -0.5423 + 1.9286 new business. (h) The deviance statistic for (f) is 125.0102 with a p -value = 0.0250 < 0.05. Reject H 0 . The model is not a good fi tting model. The deviance statistic for (g) is 11 9.4702 with a p -value = 0.0526 > 0.05. Do not reject H 0. The model is a good fi tting model. The model in (g) should be used to predict a fraudul ent claim. 14.60 (a) ln(estimated odds) = 1.252 - 0.0323 Age + 2.2 165 subscribes to the wellness newsletters. (b) Holding constant the effect of subscribes to the wellness newsletters, for each increase of one year in age, ln(estimated odds) decreases by an estimate of 0.0323. Holding constant the effect of age, for a customer who subscribes to the wellness newsletters, ln(estimated odds) increases by an estimate of 2.2 165. (c) 0.9 12. (d) Deviance= 102.8762,p-value = 0.3264. Do not reject H 0 so model is adeq uate. (e) For Age: Z = - 1.8053 > - 1.96, Do not rej ect H 0. For subscribes to the wellness newsletters: Z = 4.3286 > 1.96, Reject H 0. (f) Only subscribes to wellness newsletters is usefu l in predicting whether a customer will purchase organic food. 14.72 (a) Y = -3.9 152 + 0.03 19X1 + 4.2228X2, where X1 = number cubic feet moved and X2 = number of pieces of large furniture. (b) Holding constant the nu mber of pieces of large fu rni ture, for each additional cubic foot moved, the mean labor hours are estimated to

increase by 0.03 19. Holding constant the amount of cubic feet moved, for each additional piece of large furniture, the mean labor hours are estimated to increase by 4.2228. (c) Y = 20.4926. (d) Based on a residual analysis, the errors appear to be normally distributed. The equal-variance assumption might be violated because the variances appear to be larger around the center region of both independent variables. T here might also be violation of the linearity assumption. A model with quadratic terms for both independent variables might be fitted. (e) FsTAT = 228.80, p-value is virtually 0 < 0.05, reject H0 . There is evidence of a significant relati onship between labor hours and the two independent variables (the amount of cubic feet moved and the number of pieces of large furniture) . (f) The p-value is virtually 0. The probability of obtaining a test statistic of 228.80 or greater is virtually 0 if there is no significant relationship between labor hours and the two independent variables (the amount of cubic feet moved and the number of pieces of large furniture). (g) r 2 = 0.9327. 93.27% of the variation in labor hours can be explained by variation in the number of cubi c feet moved and the nu mber of pieces of large fu rniture. (h) r~dj = 0.9287. (i) For X1: tSTAT = 6.9339, the p -value is virtually 0. Reject H 0 . The number of cub ic feet moved makes a significant contribution and should be included in the model. For X2 : tsTAT = 4.6 192, the p-value is virtually 0. Reject H0 . The number of pieces of large furnitu re makes a significant contribution and should be incl uded in the model. Based on these results, the regression model with the two independent variables should be used. (j) For X1: tsTAT = 6.9339, the p-value is virtually 0. The probability of obtaining a sample that will yield a test statistic greater than 6.9339 is virtually 0 if the number of cubic feet moved does not make a significant contribution, holding the effect of the number of pieces of large fu rnitu re constant. For X2 : tsTAT = 4.6192, the p-val ue is virtually 0. The probability of obtaining a sample that wi ll yield a test statistic greater than 4.6192 is virtually 0 if the number of pieces of large furniture does not make a significant contribution, holdi ng the effect of the amount of cubic feet moved constant. (k) 0.0226 :5 {3 1 :5 0.041 3. (I) r}1.2 = 0.5930. Holding constant the effect of the number of pieces of large furniture, 59.3% of the variation in labor hours can be explained by variation in the amount of cubic feet moved. 1 = 0.3927. Holding constant the effect of the number of cubic feet moved, 39.27% of the variation in labor hours can be explai ned by variation in the number of pieces of large fu rn iture. (m) Both the number of cubic feet moved and the number of large pieces of fu rniture are useful in predicting the tabor hours, but the cubic feet moved is more important.

rh.

14.74 (a) Y = 360.2 158 + 0.0775X1 - 0.4 122X2, where X1 = house size and X2 = age. (b) Holding constant the age, for each additional square foot in the size of the house, the mean asking price is estimated to increase by 77 .50 thousand dollars. Holding constant the living space of the house, for each additional year in age, the asking price is estimated to decrease by 0.4 122 thousand dollars. (c) Y = 492.53 16 thousand dollars. (d) Based on a residual analysis, the model appears to be adequate. (e) F STAT = 19.4909, the p-value = 0.0000 < 0.05, reject H 0 . There is evidence of a significant relationship between asking price and the two independent variables (size of the house and age). (f) The p-value is 0.0000. The probability of obtaining a test statistic of 19.4909 or greater is virtually 0 if there is no significant relationship between asking price and the two independent variables (living space of the house and age). (g) r2 = 0.40 19. 40. 19% of the variation in asking price can be explained by variation in the size of the house and age. (h) r~dj = 0.3813. (i) For X1: tsTAT = 4.6904, the p-value is 0.0000. Reject H0 . The living space of the house makes a significant contribution and should be incl uded in the model. For X2: tsTAT = -0.6304, p-value = 0.5309 > 0.05. Do not reject H0 . Age does not make a significant contribution and should not be included in the model. Based on these results, the regression model with only the size of the house should be used. (j) For X 1: tsTAT = 4.6904. The probability of obtaining a sample that will yield a test statistic farther away

Self-Test Solutions and Answers to Selected Even-Numbered Problems than 4.6904 is 0.0000 if the living space does not make a significant contribution, holding age constant. For X2 : tsTAT = - 0.6304. The probability of obtaining a sample that will yield a test statistic further away than 0.6304 is 0.5309 if the age does not make a significant contribution holding the effect of the living space constant. (k) 0.0444 :5 {3 1 :5 0. 11 06. You are 95% confident that the aski ng price will increase by an amount somewhere between $44.40 thousand and $110.60 thousand fo r each additional thousand square foot increase in living space, holding constant the age of the house. In Problem 13.76, you are 95% confident that the assessed value will increase by an amount somewhere between $56.8 thousand and $ 110.30 thousand for each additional 1,000 square foot increase in living space, regardless of the age of the house. (I) ?r1.2 = 0.2750. Holding constant the effect of the age of the house, 27.50% of the variation in asking price can be explained by variation in the living space of the house. rh, 1 = 0.0068. Holding constant the effect of the size of the house, 0.68% of the variation in asking price can be explained by variation in the age of the house. (m) Only the living space of the house should be used to predict asking price. 14.76 (a) Y = -90.2166 + 9.2 169X1 + 2.5069X2 , where X1 = asking price and X2 = age. (b) Holding age constant, fo r each additional $ 1,000 in asking price, the taxes are estimated to increase by a mean of $9 .2169 thousand. Holding asking price constant, for each additional year, the taxes are estimated to increase by $2.5069. (c) Y = $3,72 1.90. (d) Based on a residual analysis, the errors appear to be normally distributed. The equalvariance assumption appears to be valid. However, there is one very large residual that is from the house that is 107 years old. Removing this point, still leaves a residual for the house that has an asking price of $550,000 and is 52 years old. However, because th is model is an almost perfect fit, you may want to use this model. ln this model, age is no longer significant. (e) FsTAT = 1,677.86 19, p-value = 0.0000 < 0.05 , reject H0 . There is evidence of a significant relationship between taxes and the two independent variables (asking price and age). (t) p-value = 0.0000. The probability of obtaining an FsTAT test statistic of 1,677 .8619 or greater is virtually 0 if there is no significant relationship between taxes and the two independent variables (asking price and age). (g) r2 = 0.9830, 98.30% of the variation in taxes can be explained by variation in asking price and age. (h) r~dJ = 0.9824. (i) For X1 : tsTAT = 53.7184, p-value = 0.0000 < 0.05. Reject H0 . The asking price makes a significant contribution and should be included in the model. For X2 : tsTAT = 2.7873, p-value = 0.0072 < 0.05. Reject H0 . The age of a house makes a significant contribution and should be included in the model. Based on these results, the regression model with asking price and age should be used. (j) For X1: p-value = 0.0000. The probability of obtaining a sample that will yield a test statistic greater than 53.7184 is 0.0000 if the asking price does not make a significant contribution, holding age constant. For X2 : p-value = 0.0072. The probability of obtaining a sample that will yield a test statistic greater than 2.7873 is 0.0072 if the age of a house doezs not make a significant contribution, holding the effect of the asking price constant. (k) 8.8735 :5 {3 1 :5 9.5604. You are 95% confident that the mean taxes will increase by an amount somewhere between $8.87 and $9.56 for each additional $ 1,000 increase in the asking price, holding constant the age. In Problem 13.77, you are 95% confident that the mean taxes will increase by an amount somewhere between $5.968 and $ 11.03 for each additional $ 1,000 increase in asking price, regardless of the age. (I) r}1.2 = 0.9803. Holding constant the effect of age, 98.03% of the vari ation in taxes can be explained by variation in the asking price. rh, 1 = 0.1 181 . Holding constant the effect of the asking price, 11.81 % of the variation in taxes can be explained by variation in the age. (m) Based on your answers to (b) through (k), the age of a house has an effect on its taxes. However, given the results when the 107-year-old house is not included, the assessor can state that for houses that are not that old, that age does not have an effect on taxes.

697

14.78 (a) Y = 160.6 120 - 18.7 181X1 - 2.8903X2, where X1 =ERA and X2 = league (American = 0 National = 1). (b) Holding constant the effect of the league, for each additional earned run, the number of wins is estimated to decrease by 18.7 181. For a given ERA, a team in the National League is estimated to have 2.8903 fewer wins than a team in the American League. (c) 76.3803 wins. (d) Based on a residual analysis, there is no pattern in the errors. There is no apparent vio lation of other assumptions. (e) F STAT = 24.306 > 3.35, p-value = 0.0000 < 0.05, reject H0 . There is evidence of a significant relationship between wins and the two independent variables (ERA and league). (t) For X1: tsTAT = -6.9 184 < - 2.0518, the p-value = 0.0000. Reject H 0 . ERA makes a significant contribution and should be included in the model. For X2 : tsTAT = - 1. 1966 > -2.05 18, p-value = 0.241 9 > 0.05. Do not reject H0 . The league does not make a significant contribution and should not be included in the model. Based on these results, the regression model with only the ERA as the independent variable should be used. (g) -24.2687 :5 f31 :5 - 13.1676. (h) -7.8464 :5 f3z :5 2.0639. (i) r~dJ = 0.6165. 61.65% of the variation in wins can be explained by the variation in ERA and league after adjusting for number of independent variables and sample size. (j) r}1.2 = 0.6394. Holding constant the effect of league, 63.94% of the variation in number of wins can be explained by the variation in ERA. ln. 1 = 0.0504. Holding constant the effect of ERA, 5.04% of the variation in number of wins can be explained by the variation in league. (k) The slope of the number of wins with ERA is the same, regardless of whether the team belongs to the American League or the National League. (I) For X1X2: tsTAT = 1.175 < 2.0555 the p-value is 0.2506 > 0.05. Do not reject H0 . There is no evidence that the interaction term makes a contribution to the model. (m) The model with one independent variable (ERA) shou ld be used. 14.80 The multiple regression model is Predicted base salary = 48,091.7853 + 8,249.2 156 (gender) + 1,061.4521 (age). Holding constant the age of the person, the mean base salary is predicted to be $8,249.22 higher fo r males than for females. Holding constant the gender of the person, for each addition year of age, the mean base salary is predicted to be $ 1,061.45 higher. The regression model with the two independent variables has F = I 18.0925 and a p-value = 0.0000. So, you can conclude that at least one of the independent variable makes a significant contribution to the model to predict base pay. Each independent variable makes a significant contribution to the regression model given that the other variable is included. (tSTAT = 3.9937, p-value = 0.0001 for gender and tsTAT = 14.8592, p-value = 0.0000 for age). Both independent variables should be included in the model. 37.0 1% of the variation in base salary can be explained by gender and age. There is no pattern in the residuals and no other violations of the assumptions, so the model appears to be appropriate. Including an interaction term of gender and age does not significantly improve the model (t,u11 = - 0.2371 , p-value = 0.8 127 > 0.05). You can conclude that females are paid less than males holding constant the age of the person. Perhaps other variables such as department, seniority, and score on a performance evaluation can be included in the model to see if the model is improved. 14.82 b0 = 18.2892 (die temperature), b 1 = 0.5976, (die diameter), b2 = - 13.5108. The r 2 of the multiple regression model is 0.3257 so 32.57% of the variation in foam density can be explained by the variation of die temperature and die diameter. The F test statistic for the combined significance of die temperature and die diameter is 5.0718 with a p-value of 0.0160. Hence, at a 5% level of significance, there is enough evidence to conclude that die temperature and die diameter affect foam density. The p-value of the t test fo r the significance of die temperature is 0.2117, which is greater than 5%. Hence, there is insufficient evidence to conclude that die temperature affects foam density holding constant the effect of die diameter. The p-value of the t test for the significance of die diameter is 0.0083, which is less than 5%.There is enough evidence to conclude that die diameter affects foam density at the 5% level of significance holding

698

Self-Test Solutions and Answers to Selected Even-Numbered Problems

constant the effect of die te mperature. After removing die temperature from the model, b0 = 107.9267 (die diameter), b 1 = - 13.5 108. The r2 of the multiple regression is 0.2724. So 27.24% of the variation in foam density can be explained by the variation of die diameter. The p-value of the t test for the significance of die diameter is 0.0087, which is less than 5%. There is enough evidence to conclude that die diameter affects foam density at the 5% level of significance. There is some lack of equality in the residu als and some departure from normality.

CHAPTER 15 15.2 (a) Predicted HOCS is 2.8600, 3.0342, 3. 1948, 3.34 18, 3.4752, 3.5950, 3.7012, 3.7938, 3.8728, 3.9382, 3.99, 4.0282, 4.0528, 4.0638, 4.0612, 4.045, 4.0 152, 3.97 18, 3.9148, 3.8442, and 3.76. (c) The curvilinear relationship suggests that HOCS increases at a decreasing rate. It reaches its maximum value of 4.0638 at GPA = 3.3 and declines after that as GPA continues to increase. (d) An r 2 of 0.07 and an adjusted r2 of 0.06 te ll you that GPA has very low explanatory power in identify ing the variation in HOCS. You can tell that the individual HOCS scores are scattered wide ly around the c urvilinear relationship. 15.4 (a) Y = -5.5476 + 21.6158X 1 + 3.9 11 2X2 where X 1 = alcohol % and X2 = carbohydrates. FsTAT = 2,272.9982 p-value = 0.0000 < 0 .05, so reject H 0 . At the 5% level of significance, the linear tem1s are significant together. (b) = 10.3937 + 14 .9055X 1 + 4.6043X2 + 0.5054XT 0.0029X~. where X 1 = alcohol % and X2 = carbohydrates. (c) FsTAT = I, 165.9 130 p-value = 0.0000 < 0.05, so reject Ho. At the 5% level of significance, the model with quadratic terms are significant. tSTAT = 2.2394, and the p-value = 0.0212. Reject Ho. There is enough evidence that the quadratic term for alcohol % is significant at the 5% level of significance. tSTAT = - 1.3930, p-value = 0.1657. Do not reject Ho. There is insufficient evidence that the quadratic term for carbohydrates is significant at the 5% level of significance. Hence, because the q uadratic term for alcohol is significant, the model in (b) that includes this tem1 is better. (d) The number of calories in a beer depends quadratically on the alcohol percentage but linearly on the number of carbohydrates. The alcohol percentage and number of carbohydrates explain about 96.84% of the variation in the number of calories in a beer.

Y

15.6 (b) price = 18,029.9837 - 1,8 12.9389 age + 63.2 11 6 age2. (c) 18,029.9837 - 1,8 12.9389(5) + 63.21 16(5)2 = $ I0,545.58. (d) There are no patterns in any of the residual plots. (e) FSTAT = 243.506 1 > 3.27. Reject H0 . There is a significant quadratic relationship between age and price. (I) p-value = 0.0000. The probability of FSTAT = 243.5061 or higher is 0.0000, given the null hypothesis is true. (g) tSTAT = 4.8631 > 2.028 1. Reject H0. (h) The probability of tSTAT < - 4.8631 or > 4.863 1 is 0.0000, given the null hypothesis is true. (i) r2 = 0.93 12. 93. 12% of the variation in price can be explained by the quadratic relationship between age and price. (j) adjusted r2 = 0.9273. (k) T here is a strong quadratic relationship between age and price. 15.8 (a) 2 15.37. (b) For each additional unit of the logarithm of X 1, the logarithm of Y is esti mated to increase by 0.9 unit, holding all other vari ables constant. For each additional unit of the logarithm of X2, the logarithm of Y is esti mated to increase by 1.41 units, holding all other variables constant.

Yr=

15.10 (a) 6.2448 + 0.7796X1 + O.l66IX2, where X 1 = alcohol % and X2 = carbohydrates. (b) The norma l probability plot of the linear model showed departure from a normal distribution, so a square-root transformation of calories was done. FSTAT = 1,732.83 12. Because the p-value is 0.0000, reject H0 at the 5% level of significance. There is evidence of a significant linear relationship between the square root of calories and the percentage of alcohol and the number of carbohydrates. (c) r2 = 0.9575. So 95.75% of the variation in the square root of calories

can be explained by the variation in the percentage of alcohol and the number of carbohydrates. (d) Adjusted r 2 = 0.9569. (e) The model in Problem 15. IO is better because the residual plot is not right skewed. 15.12 (a ) Predicted ln(Price) = 9.7771 - 0.10622 Age. (b) $10,573.4350. (c) The model is adequate. (d) tSTAT = - 19.4814 < -2.0262; reject H 0 . (e) 9 1.1 2%. 9 1.1 2% of the variation in the natural log of price can be explained by the age of the auto. (I) 90.88%. (g) Choose the model from Problem 15.6. That model has a higher adjusted r 2 of92.73%. 15.14 1.25. 15.16 RT = 0.0634, VIF,

V!F2

1

=

=

1 - 0.0634 because both V/Fs are

I - -- -0 .-0 6_3_4

=

, 1.0677, R2

=

0.0634,

1.0677. There is no evidence of collinearity

< 5.

15.18 VIF

=

1.0066

15.20 VIF

=

1.0 105. There is no evidence of collinearity.

< 5. There is no evidence of collinearity.

15.22 (a) 35.04. (b) C" > 3. This does not meet the criterion for consideration of a good model. 15.24 Let Y = asking price, X 1 = lot size, X2 = living space, and X3 = number of bedrooms. X4 = number of bathrooms, X5 = age, and X6 = fireplace (0 = No, I = Yes). Based on a fu ll regression model involving all of the variables, all the VIF values ( 1.3953, 2.11 75, 2.0878, 2.3537, 1.7807, and 1.0939, respectively) are less than 5 . There is no reason to suspect the existence of collinearity. Based on a best subsets regression and examination of the resulting Cp values, the best model appear to be a model with variables X2 and X6, which has Cp = 0.870 I. Models that add other variables do not change the results very much. Based on a stepwise regression analysis with all the original variables, only variables X2 and X6 make a significant contribution to the model at the 0.05 level. Thus, the best model is the model using the living area of the ho use (X2) and fireplace X6 should be included in the model. This was the model developed in Section 14.6. 15.30 (a) An analysis of the linear regression model with all of the three possible independent variables reveals that the highest YIF is only 1.06. A stepwise regression model selects only the supplier dummy variable for inclusion in the model. A best subsets regression produces only one model that has a C" value less than or equal to k + 1, the model that includes pressure and the supplier dummy variable. This model is Y = -3 1.5929 + 0.7879X2 + 13.1029X3 . This model has F = 5. 1088 with a p -value = 0.027. r2 = 0.48 16, ?,;dj = 0.3873. A residual analysis does not reveal any strong patterns. The errors appear to be normally distributed. 15.32 (a) Best model: C1, = 2.1558, predicted fair market value = 260.679 1 + 362.83 18 land+ 0 .1109 house size (sq ft) - 1.7543 age. (b) The adjusted r 2 fo r the best model in I 5.32(a), I 5.33(a), and 15.34(a) are, respective ly, 0.8242, 0.9047, and 0.8481. The model in 15.33(a) has the highest explanatory power after adjusting for the number of independent variables and sample size. 15.34 (a) Predicted fai r maket value = 145. 1217 + 149.9337 land + 0.09 13 house size (sq. ft.). (b) The adjusted r2 for the best model in 15.32(a), 15.33(a), and 15.34(a) are, respectively, 0.8242, 0.9047, and 0.8481. The model in I 5.33(a) has the highest explanatory power after adjusting for the number of independent variables and sample size. 15.36 Let Y = fai r market value, X 1 = land area, X2 = interior size, X3 = age, X4 = number of rooms, X 5 = number of bathrooms, X6 = garage size, X7 = I if Glen Cove and 0 otherwise, and X8 = I if Roslyn and O otherwise. (a) The V/Fs of X2, X3, and X7 are greater than 5.

699

Self-Test Solutions and Answers to Selected Even-Numbered Problems

Dropping X2 with the largest VIF, X3 still has a VIF greater than 5. After dropping X2 and X3, all remain ing VI Fs are less than 5 so there is no reason to suspect collinearity between any pair of variables. The following is the mu ltiple regression model that has the smallest Cp(4.32 1 l) and the highest adj usted r 2(0.68 15): Fair Market Value = 49.2379 + 579.0105 Land + 109.5767 Baths + 48.2282 Garage + 213.2326 Roslyn The individual t test for the significance of each independent variable at the 5% level of significance concludes that only property size, baths, and the dummy variable Roslyn are significant given that the others are in the model. The following is the multiple regression result for the model chosen by stepwise regression: Fair Market Value = 30.30 16

+ 6 11.6910 Land + 130.7788 Baths

+ 214.2567 Roslyn

CHAPTER16 16.2 (a) Because you need data from four prior years to obtain the centered nine-year moving average for any given year and the first recorded value is for 1984, the first centered moving average value you can calculate is for 1988. (b) You would lose four years for the pe riod 1984-87 because you do not have enough past values to compute a centered mov ing average. You will also lose the final four years of recorded time series because you do not have enough la ter values to compute a centered moving average. Therefore, you will lose a total of eight years in computing a series of nine-year moving averages. 16.4 (b)

,j..

All the variables are significant individually at the 5% level of significance. Combining the stepwise regression and the best subsets regression results along with the individual t test results, the most appropriate multiple regression model for predicting the fair market value is the stepwise regression model. (b) The estimated fair market value in Roslyn is $2 14.2567 thousands above Glen Cove or Freeport for two otherwise identical properties. 15.38 ln the multiple regression model with catalyst, pH, pressure, temperature, and voltage as independent variables, none of the variables has a VIF value of 5 or larger. The best subsets approach showed that only the model containing Xl> X2, X3, X 4, and X5 should be considered, where X1 = catalyst, X2 = pH, X3 = pressure, X4 = temp, and X 5 = voltage. Looking at the p-values of the t statistics for each slope coefficient of the model that includes X1 through X5 reveals that pH level is not significant at the 5% level of significance (p-value = 0.2862). The multiple regression model with pH level deleted shows that all coefficients are significant individually at the 5% level of significance. The best linear model is determined to be Y = 3.6833 + 0. 1548X1 - 0.04197X3 - 0.4036X4 + 0.4288X5 . The overall model has F = 77.0793, with a p-value that is virtually 0. r 2 = 0.8726, ?,,dj = 0.86 13. The normal probability plot does not suggest possible violation of the normality assumption. A residual analysis reveals a potential nonlinear relationship in temperature. The p-value of the squared term for temperature (0.1273) in the following quadratic transformation of temperature does not support the need for a quadratic transformation at the 5% level of significance. The p-value of the interaction term between pressure and temperature (0.0780) indicates that there is not enough evidence of an interaction at the 5% level of significance. The best model is the one that includes catalyst, pressure, temperature, and voltage, which explains 87 .26% of the variation in thickness. 15.40 Best subset regression produced several models that had CP :5 k + I. They were X2X3 = 3.9, X2X~4 = 3.3, and X 1 X2X~4 = 4.7. Stepwise regression produced a model that included only X2 (median home value) and X4 (average commuting time). Because X2 (median home value), X3 (violent crime rate), and average commuting time (X4 ) had a low Cp, this model was chosen for further analysis. The residual plot for all the independent variables showed only random patterns and no violations in the assumptions. The model is Median average annual salary = 16,830

+ 38.256 median

home value ($000) - 9.534 violent crime/ 100,000 residents

+ 1,053 average commuting time in minutes

Cl

C2

C3 AVERI

Year

Hours

1

2008

2.2

*

2 3 4 5

2009 2010 201 1 2012 2013 2014 2015

2.3 2.4 2.6 2.5 2.3 2.2 2.2

2.30000 2.43333 2.50000 2.46667 2.33333 2.23333 2.20000

2016 2017

2.2 2.1

2.16667

6 7 8 9 10

*

(c)

Cl

C2

C3

Year

Hours

1 2

2008 2009

2.2 2.3

ES ( W=0.50 ) 2.20000 2.25000

3 4 5 6 7 8 9

2010 2011 2012 2013 2014 2015 2016

2.4 2.6 2.5 2.3 2.2 2.2 2.2

2.32500 2.46250 2.48 125 2.39063 2.29531 2.24766 2.22383

JO

20 17

2.1

2.16 191

Cl

C2

C4

,j..

(dl Y201s = £ 20 11 = 2.16191

(e) ,j..

2 3 4 5 6 7 8 9 10 Y201s

=

£2011

=

Year

Hours

ES ( W=0.25 )

2008 2009 2010 201 I

2.2 2.3 2.4 2.6

2.20000 2.22500 2.26875 2.35156

20 12 2013 2014 2015 20 16 2017

2.5 2.3 2.2 2.2 2.2 2.1

2.38867 2.36650 2.32488 2.29366 2.27024 2.22768

2.22168.

2

The r of this model is 0.847, meaning that 84.7% of the variation in average annual salary can be explained by variation in median home value, variation in violent crime, and variation in average commuting time.

(f) The exponential smoothing with W = 0.5 assigns more weight to the more recent values and is better for short-term forecasting, while the

700

Self-Test Solutions and Answers to Selected Even-Numbered Problems

exponential smoothing with W = 0.25, which assigns more weight to more distant values, is better suited for identifying long-term tendencies. (g) There is no perceptible trend in the number of hours spent per day by American desktop/laptop users from 2008-17. American desktop/laptop users have consistently spent between 2 and 2.5 hours per day on their computers.

(e)

16.6 (b)

Cl-T

C2

C3

Decade

Performance(%)

MA3-VR

1830s 1840s 1850s 1860s 1870s 1880s 1890s 1900s 19 10s 1920s 1930s 1940s 1950s 1960s 1970s 1980s 1990s 2000s 2010s

2.8 12.8 6.6 12.5 7.5 6.0 5.5 10.9 2.2 13.3 - 2.2 9.6 18.2 8.3 6.6 16.6 17.6 1.2 11.0

* 7.4000 10.6333 8.8667 8.6667 6.3333 7.4667 6.2000 8.8000 4.4333 6.9000 8.5333 12.0333 11.0333 10.5000 13.6000 11.8000 9.9333

*

(c)

(d)

Cl-T

C2

C4

Decade

Performance(%)

ES ( W=0.50 )

1830s 1840s 1850s 1860s 1870s 1880s 1890s 1900s 1910s 1920s 1930s 1940s

2.8 12.8 6.6 12.5 7.5 6.0 5.5 10.9 2.2 13.3 - 2.2 9.6

2.8000 7.8000 7.2000 9.8500 8.6750 7.3375 6.4188 8.6594 5.4297 9.3648 3.5824 6.5912

1950s 1960s 1970s 1980s 1990s 2000s 2010s

18.2 8.3 6.6 16.6 17.6 1.2 11.0

12.3956 10.3478 8.4739 12.5370 15.0685 8. 1342 9.567 1

Y202os

=

E201os

=

9.5671.

Cl-T

C2

CS

Decade

Performance(% )

ES (W =0.25)

1830s 1840s 1850s 1860s 1870s 1880s 1890s 1900s 1910s 1920s 1930s 1940s 1950s 1960s 1970s 1980s 1990s 2000s 2010s

2.8 12.8 6.6 12.5 7.5 6.0 5.5 10.9 2.2 13.3 - 2.2 9.6 18.2 8.3 6.6 16.6 17.6 1.2 11.0

2.8000 5.3000 5.6250 7.3438 7.3828 7.0371 6.6528 7.7 146 6.3360 8.0770 5.5077 6.5308 9.4481 9.161 1 8.5208 10.5406 12.3055 9.5291 9.8968

Y202os

=

E 2010s

= 9.8968.

(0 The exponentially smoothed forecast for 2020s is lower with a W of 0.50 compared to a W of 0.25. The exponential smoothing with W = 0.5

assigns more weight to the more recent values and is better for short-term forecasting, while the exponential smoothing with W = 0.25, which assigns more weight to more distant values, is better suited for identifying long-term tendencies. (g) Exponential smoothing with a W of 0.25 reveals a general upward trend in the performance of stocks over the last several decades. 16.8 (b)

Year

IPOs

MA3-YR

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 201 1 2012 2013 2014 2015 2016 2017 2018

84 70 71 226 206 199 213 11 63 154 125 128 222 275 170 105 160 190

75.000 122.333 167.667 210.333 206.000 141.000 95.667 76.000 114.000 135.667 158.333 208.333 222.333 183.333 145.000 151.667

*

*

701

Self-Test Solutions and Answers to Selected Even-Numbered Problems

(c)

(c) Year is 2002, X = 2002 - 1998 = 4. Year

IPOs

ES(W=0.50)

2001

84

84.000

2002

70

77.000

2003

71

74.000

2004

226

150.000

2005

206

178.000

2006

199

188.500

2007

2 13

200.750

2008

11

105.875

2009

63

84.438

2010

154

119.219

2011

125

122. 109

2012

128

125.055

2013

222

173.527

2014

275

224.264

2015

170

197.132

2016

105

151.066

2017

160

155.533

2018

190

172.766

+ 1.5 ( 4 ) = 10.0 million dollars.

= 4.0

Y2002

(d) Year is 2019, X = 20 19 - 1998 = 2 1. Y2019

+ 1.5 (2 1) = 35.5 mi llion dollars.

= 4.0

(e) Year is 2022, X = 2022 - 1998 = 24.

+ 1.5 (24 ) = 40 million dollars.

= 4.0

Y2022

16.12 (b) Regression Analysis: Bonus($000) versus Coded Year Analysis of Variance Source

OF

Adj SS

AdjMS

F -Value

P-Value

9505.6 9505.6

10.95 10.95

0.004 0.004

17 18

9506 9506 14760 24265

Regression Coded Year Error Total

868.2

Model Summary

s 29.4654

R-sq

R -sq(adj)

R-sq(pred)

39. 17%

35.60%

26.37%

Coefficients (d) (e)

Y2019

Y2019

=

=

E 201s

= 172.766 lPOs.

Term

Year

IPOs

2001 2002

84 70

84.000 80.500

2003

71

78.125

2004 2005

226 206

115.094 137.820

2006

199

153.115

2007

213

168.086

ES (W=0.25)

2008

11

128.8 15

2009

63

112.361

2010 2011

154 125

122.771 123.328

2012

128

124.496

2013 2014 2015

222

148.872

275 170

180.404 177.803

2016

105

159.602

2017 2018

160 190

159.702 167.276

E201 s

Coef

Constant Coded Year

SE Coef

98.2 4.08

13.0 12.3

W of 0.25 compared to a W of 0.50. The exponential smoothing with W = 0.5 assigns more weight to the more recent values and is better for short-term forecasting, while the exponential smoothing with W = 0.25, which assigns more weight to more distant values, is better suited for identifying long-term tendencies. There appears to be a cyclical component every several years with up and down cycles of the number of JPOs.

16.10 (a) The Y-intercept b0 = 4.0 is the fitted trend value reflecting the real total revenues (in millions of dollars) during the origin or base year 1998. (b) The slope b 1 = 1.5 indicates that the real total revenues are increasing at an estimated rate of 1.5 million per year.

P-Value

VIF

7.55 3.31

0.000 0.004

1.00

Regression Equation Bonus($000) = 98.2 - 4.08 Coded Year

Y=

98.2

+ 4.08X, where X

= years relative to 2000.

(c) Regression Analysis: Bonus($000) versus Coded Year, Coded Year Sq Analysis of Variance Source Regression Coded Year Coded Year Sq Error Total

OF

Adj SS

AdjMS

F-Value

P-Value

2

10836 3686 1330 13429 24265

5418.0 3685.5

6.46 4.39

0.009 0.052

1330.4 339.3

1.59

0.226

16 18

Model Summary

s

R-sq

28.97 11

44.66%

R -sq(adj)

R-sq(pred)

37.74%

25.25%

Coefficients

= 167.276 IPOs.

(f) The exponentially smoothed 2019 forecast for JPOs is lower with a

T-Value

Term Constant

Coef

SE Coef

T-Value

P -Value

82.2

18.0 4.64 0.249

4.57

0.000

2.10 - 1.26

0.052 0.226

Coded Year 9.72 Coded Year Sq - 0.313

VIF 14.6 1 14.61

Regression Equation Bonus( $000 ) = 82.2 - 9.72 Coded Year - 0.313 Coded Year Sq

Y=

82.2

+ 9.72X - 0.3 l 3X2, where X = years relative to 2000.

702

Self-Test Solutions and Answers to Selected Even-Numbered Problems

(d) Regression Analysis: log Bonus versus Coded Year

16.14 (b) Regression Analysis: Receipts versus Coded Year

Analysis of Variance Source

DF

Adj SS

AdjMS

F-Value

P-Value

0. 137 16 0.137 16 0.01105

12.41 12.41

0.003 0.003

17 18

0.1372 0.1372 0.1879 0.325 1

Regression Coded Year Error Total

Model Summary

s

R-sq

R-sq(adj)

R-sq(pred)

0. 105130

42.20%

38.80%

28.52%

Coefficients Term Constant Coded Year

Coef

SE Coef

T-Value

P-Value

VIF

1.9727 0.01551

0.0464 0.00440

42.52 3.52

0.000 0.003

1.00

Analysis of Variance Source

DF

Regression Coded Year Error Total

I

39 40

Adj SS

AdjMS

F-Value

P-Value

30230638 30280638 1287872 3156851 1

30280638 30280638 33022

916.97 916.97

0.000 0.000

Model Summary

s

R-sq

R-sq(adj)

R-sq(pred)

181.72 1

95.92%

95.82%

95.42%

Coefficients Term

Coef

SE Coef

T-Value

P-Value

VIF

Constant Coded Year

225.8 72.63

55.7 2.40

4.05 30.28

0.000 0.000

1.00

Regression Equation Regression Equation Receipts = 225.8 + 72.63 Coded Year

log Bonus = 1.9727 + 0.01551 Coded Year log wY = 1.9727 + 0.01551 ( X), where X = years relative to 2000. (e)Linear: Y2019 = 98.2 + 4.08(19) = 175.72. r2020 = 98.2

+ 4.08 (20 ) = 179.80.

Quadratic: Y2019 = 82.2 + 9.72(19) - 0.313(19) 2 = 153.8870. Y2020 = 82.2 + 9.12 (20) - o.313(20 ) 2 = 151.40. Exponential: log 10 Y2019 = 1.9727 + 0.0155 1( 19) = 2.26739. 2 26739 = I 85.093. Y2019 = 10 · log 10Y2020 = 1.9121

Y=

225.8 + 72.63 (X), whereX =years relati ve to 1978.

(c) Y 2019 = 225.8 + 72.63 ( 41 ) = 3,203.63 billion. Y2020 = 225.8 + 72.63 ( 42 ) = 3,276.26 billion. (d) There is an upward trend in federal receipts from 1978 through 2018, which appears to be linear. 16.16 (b) Linear:

Regression Analysis: Solar Power Generated versus Coded Year

+ 0.0155 1(20) = 2.2829.

Y2020 = 102-2829 = 191.8221.

Analysis of Variance DF

Source

(f)

Year 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018

Revenues 100.5 74. 1 60.9 99.9 113.5 149.8 191.4 177.8 100.9 140.6 139.0 111.4 142.9 169.8 160.3 136.8 156.8 184.4 153.7

First Difference #NIA -26.4 - 13.2 39.0 13.6 36.3 41.6 - 13.6 -76.9 39.7 - 1.6 - 27.6 3 1.5 26.9 - 9.5 -23.5 20.0 27.6 -30.7

Second Difference #NIA #NIA 13.2 52.2 - 25.4 22.7 5.3 - 55.2 - 63.3 116.6 - 41.3 - 26.0 59.1 - 4.6 - 36.4 - 14.0 43.5 7.6 -58.3

Percentage Difference #NIA -26.27% - 17.8 1% 64.04% 13.6 1% 31.98% 27.77% -7. 11% - 43.25% 39.35% - 1. 14% - 19.86% 28.28% 18.82% - 5.59% - 14.66% 14.62% 17.60% -1 6.65%

A review of first, second, and percentage differences reveals no particular model is more appropriate than the other. Based on the principle of parsimony, one might choose the linear model.

Regression Coded Year 15 Error 16 Total

Adj SS

AdjMS

4447549704 4447549704 2372666419 6820216 122

4447549704 4447549704 158 177761

F-Value P-Value 28. 12 28.12

0.000 0.000

Model Summary

s

R-sq

R-sq(adj)

R-sq(pred)

12576.9

65.2 1%

62.89%

51.28%

Coefficients Term Constant Coded Year

Coef

SE Coef

T-Value

P-Value

VIF

- 13307 3302

5841 623

-2.28 530

0.038 0.000

1.00

Regression Equation Solar Power Generated = - 13307 + 3302 Coded Year

Y=

- 13,307 + 3,302(X), where X = years relative to 2002

703

Self-Test Solutions and Answers to Selected Even-Numbered Problems (c) Quadratic: Regression Analysis: Solar Power Generated versus .. . r, Coded Year Sq Analysis of Variance DF

Source

2 65 16476292 Regression 694663208 Coded Year Coded Year Sq 2068926589 14 303739830 Error 16 6820216122 Total

F-Value P-Value

Adj MS

Adj SS

0.000 0.000 0.000

3258238146 150.18 694663208 32.02 2068926589 95.36 2 1695702

Exponential: log 10 Y2o1 9 = 2.214 + 0.1503 ( 17 ) = 4.8291.

Y2019

= 104 ·8291 = 67,468.34 million kWh.

r

log 10

2020

= 2.214+0. 1503 ( 18 ) = 4.9794.

Y2020 = 104·9794 = 95,367.41 million kWh. 16.18 (b) Linear: Regression Analysis: Salary ($mi!) versus Coded Year Analysis of Variance

Model Summary

s

R-sq

R-sq(adj)

R-sq(pred)

4657.86

95.55%

94.9 1%

91.93%

Coefficients Term

Coef

Constant Coded Year Coded Year Sq

7358 - 4964 516.6

SE Coef

T-Value

P-Value

VIF

3026 877 52.9

2.43 - 5.66 9.77

0.029 0.000 0.000

14.47 1447

Source

7,358 - 4,964X + 516.6X2, where X = years relative to 2002.

Regression Coded Year Error Total

F-Value

P-Value

529.35 529.35

0.000 0.000

18 19

13.3042 13.3042 0.0251

R-sq

0.158534

96.7 1%

DF

Adj SS

AdjMS

F-Value

P-Value

1 15 16

9.2 18 9.2 18 1.109 10.327

9.21783 9.2 1783 0.0739 1

124.71 124.71

0.000 0.000

R-sq(pred)

96.53%

95 .96%

Coefficients Term

Coef

SE Coef

T-Value

P-Va lue

VI F

1.9853 0.14144

0.0683 0.00615

29.06 23.01

0.000 0.000

1.00

Y=

1.9853 + 0.14144 (X), where X =years relati ve to 2000.

(c) Quadratic: Regression Analysis: Salary ($mi!) versus Coded Year, Coded Year Sq Analysis of Variance

Model Summary

Source

s

R-sq

R-sq(adj)

R-sq(pred)

0.27 1872

89.26%

88.55%

85.92%

Coefficients Term

Coef

SE Coef

T-Value

P-Value

VIF

Constant Coded Year

2.274 0. 1503

0.126 0.0135

18.0 1 11.17

0.000 0.000

1.00

Regression Coded Year Coded Year Sq Error Total

DF

Adj SS

AdjMS

F-Value

P-Value

2

13.4579 0.3292 0.1537 0.2987 13.7566

6.72893 0.32920 0. 15367 0.01757

382.93 18.73 8.74

0.000 0.000 0.009

1 17 19

Model Summary

Regression Equation Solar Power Log 10 = 2.274 + 0. 1503 Coded Year

s

R-sq

R-sq(adj)

R-sq(pred)

0.132560

97.83%

97.57%

96.77%

Coefficients

log 10 Y= 2.274 + 0. 1503(X) , where X = years relative to 2002. Term (d) Linear:

Y20 19

= - 13,307 + 3,302( 17) = 42,827millionkWh.

Y2020

= - 13,307 + 3,302( 18) = 46,129millionkWh.

Quadratic:

Y= Y=

R-sq(adj)

Regression Equation Salary ($mil ) = 1.9353 + 0. 14144 Coded Year

Analysis of Variance Source

AdjMS

s

Constant CodedYear

(d) Exponential: Regression Analysis: Solar Power Log 10 versus Coded Year

Adj SS 13.3042 13.3042 0.4524 13.7566

Model Summary

Regression Equation Solar Power Generated = 7358 - 4964 Coded Year + 5 16.6 Coded Year Sq

Y=

DF

Regression Coded Year Error Total

Constant Coded Year Coded Year Sq

Coef 2.1539 0.0852 0.00296

SE Coef T-Value 0.0807 0.0 197 0.00100

26.68 4.33 2.96

P-Value

VIF

0.000 0.000 0.009

14.67 14.67

Regression E quation

7,358 - 4,964( 17) + 516.6(17 ) 2 = 72,267.4 millionkWh.

Salary ( $mil) = 2. 1539 - 0.0352 Coded Year + 0.00296 Coded Year Sq

7,358 - 4,964 ( 18) + 516.6( 18) 2 = 85,384.4 million kWh.

Y=

2.1539 + 0.0852X + 0.00296X2 , where X = years relative to 2000.

704

Self-Test Solutions and Answers to Selected Even-Numbered Problems

(d) Exponential:

(c) Linear:

Regression Analysis: Salary log 10 versus Coded Year

Regression Analysis: CPI versus Coded Year Analysis of Var iance

Analysis of Variance DF

Source Regression Coded Year Error Total

1 18 19

Adj SS

Adj MS

0.232769 0.232769 0.004958 0.237726

0.232769 0.232769 0.000275

F-Value

P-Value

845.09 845.09

0.000 0.000

Source

DF

Regression Coded Year Error Total

1 52 53

Model Summary

Adj SS

Adj MS

F-Value

P-Value

26 1479 261479 1192 26267 1

26 1479 26 1479 23

11402.09 11402.09

0.000 0.000

M odel Summary

s

R-sq

R-sq(adj )

R-sq(pred)

s

R-sq

R-sq(adj)

R-sq(pred)

0.0165963

97.91%

97.80%

97.32%

4.78879

99.55%

99.54%

99.50%

Coefficients

Coefficients Term Constant Coded Year

Coef

SE Coef

0.33101 0.018709

0.007 15 0.000644

T-Value P-Value 46.28 29.07

0.000 0.000

VIF

Term

1.00

Constant Coded Year

Regression Equation Salary Jog 10 = 0.33 10 1 + 0.018705 Coded Year

Coef

SE Coef

T-Value

P-Value

VIF

16.7 1 4.4647

1.29 0.0418

13.00 106.78

0.000 0.000

1.00

Regression Equation CPI = 16.7 1 - 4.4647 Coded Year

Y=

log 10 Y= 0.33101 + 0.018709(X), where X = years relati ve to 2000.

16.71 + 4.4647 (X) , whereX =years relative to 1965.

(d) Quadratic:

(e)

Year 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019

First Salary ($mil) Difference 1.99 2.29 2.38 2.58 2.49 2.63 2.83 2.92 3.13 3.26 3.27 3.32 3.38 3.62 3.8 1 4.25 4.40 4.70 4.63 4.70

Second Difference

Percentage Difference

#NIA

#NIA

#NIA

0.3 0. 1 0.2 - 0.1 0. 1 0.2 0. 1 0.2 0. 1 0.0 0.0 0. 1 0.2 0.2 0.4 0.2 0.3 - 0.1 0. 1

#NIA

15.08% 3.93% 3.40% -3.49% 5.62% 7.60% 3. 18% 7. 19% 4. 15% 0.3 1% 1.53% 1.8 1% 7. 10% 5.25% 11.55% 3.53% 6.82% - 1.49% I.S I %

- 0.2 0.1 - 0 .3 0.2 0.1 - 0.1 0.1 -0.1 -0.1 0.0 0.0 0.2 - 0.l 0.3 - 0.3 0.1 - 0.4 0.1

The first and second differences are relatively consistent across the series. This is not the case for percentage differences. Based on the principle of parsimony, one might choose the linear model. (e) Linear forecast: Y 2020 = 1.9853 + 0. 14144(20) = 4.8 141 million. 16.20 (b) There has been an upward trend in the CPl-U in the United States from 1965 through 20 18.

Regression Analysis: CPI versus Coded Year, Coded Year Sq Analysis of Variance Source

DF

Adj SS

AdjMS

F-Value

P-Value

2

26 1504 15653 26 1167 26267 1

130752 15653 25 23

57 14.90 684.17 1.1 2

0.000 0.000 0.295

Regression Coded Year Coded Year Sq Error Total

51 53

Model Summary

s

R-sq

R-sq(adj)

R-sq(pred)

4.78322

99.56%

99.54%

99.48%

Coefficients Term Constant Coded Year Coded Year Sq

Coef 18.17 4.296 0.003 17

SE Coef T-Value P-Value

VIF

1.88 0. 164 0.00300

15.47 15.47

9.65 26.16 1.06

0.000 0.000 0.295

Regression Equation CPI = 18. 17 + 4.296 Coded Year + 0.00317 Coded Year Sq

Y=

18. 17 + 4.296X + 0.00317X2 , where X = years relative to 1965.

(e) Exponential:

Regression Analysis: CPI log 10 versus Coded Year Analysis of Var iance Source Regression Coded Year Error Total

DF

Adj SS

Adj MS

F-Value

P-Value

1 52 53

4.0304 4.0304 0.2703 4.3007

4.03041 4.03041 0.00520

775.31 775.3 1

0.000 0.000

705

Self-Test Solutions and Answers to Selected Even-Numbered Problems Model Summary

Model Summary

s

R-sq

R-sq(adj)

R-sq(pred)

s

R-sq

R-sq(adj)

R-sq(pred)

0.072 1000

93.71%

93.59%

93. 15%

26.8 147

28.63%

10.79%

0.00%

Coefficients Term Constant Coded Year

Coefficients

Coef

SE Coef

T-Value

P-Value

1.5883 0.017529

0.0 194 0.000630

82.07 27.84

0.000 0.000

VIF 1.00

Regression Equation CPI log 10 = 1.5883 + 0.017529 Coded Year log 10Y = 1.5883 + 0.017529 ( X) , where X = years relative to 1965. (f) A review of the first, second, and percentage differences revealed

similar levels of variation across the time series. Based on the principle of parsimony, one might choose the linear model. (g) Linear forecast: 920 , 9 = 16.11 + 4.4647 ( 54) = 257.8038. = 16.71 + 4.4647 ( 55) = 262.2685.

Y2020

16.22 (a) For Time Series I, the graph of Y versus X appears to be more linear than the graph of log Y versus X , so a linear model appears to be more appropriate. For Time Series II, the graph of log Y versus X appears to be more linear than the graph of Y versus X, so an exponential model appears to be more appropriate. (b) Time Series I: Y = 100.0731 + 14.9776(X), where X = years relative to 2007. Time Series II: Y = 10'-99s2+o.o609 (x), where X = years relative to 2007. (c) X = 12 for year 20 19 in a ll models. Forecasts for the year 2019: TimeSeries l:Y= 100.0731 + 14.9776 ( 12 ) =279.8043. Time Series II = 101.99s2+o.o609 ( 12 l = 535.7967.

Term

Coef

SE Coef

Constant lag!

91.1 0.53 1

34.2 0.249

2.66 2.13

0.021 0.054

1.61

lag2

-0.336

0.291

- 1.16

0.271

2.38

lag3

0.193

0.239

0. 8 1

0.436

1.63

16.24

0.24 = - = - - = 2.4 is greater than the critical bound of Sa, 0. 10

tsTAT

2.228 1. Reject H 0. There is sufficient evide nce that the third-order regression parameter is significantly different from zero. A third-order autoregressive model is appropriate. 0.24 16.26 (a) tsTAT = - = - - = 1.6 is less than the critical bound of Sa, 0.15 2 .228 1. Do not reject H0 . T here is not sufficie nt evidence that the thirdorder regression parameter is significantly different than zero. A thirdorder autoregressive model is not appropriate. (b) Fit a second-order autoregressive model, and test to see if it is appropriate. a3

16.28 (a) Regression Analysis: Bonus($000) versus Iagl, lag2, lag3

Analysis of Variance

Regression

DF 3

Total

AdjMS

F-Value

P-Value

3461.9

1154.0

1.60

0.240

3269.4 959.3

3269.4 959.3

4.55 1.33

0.054 0.27 1

I

467.8

467.8

0.65

0.436

12 15

8628.3 12090.2

7 19.0

lag! lag2 lag3 Error

Adj SS

VIF

For the third-order term, lsTAT = 0.81 with a p-value of 0.436. The third term can be dropped because it is not significant at the 0.05 significance level. (b) Regression Analysis: Bonus($000) versus lagl, lag2 Method Rows unused 2 Analysis of Variance Source Regression lag! lag2 Error Total

DF

Adj SS

Adj MS

F-Value

P-Value

2

7041.4 6232.8 653. 1

3520.7 6232.8 653.1

4.1 8 7.40 0.78

0.038

11794.9 18836.3

342.5

14 16

0.017 0.393

Model Summary

s

R-sq

R-sq(adj)

R-sq(pred)

29.0257

37.38%

28.44%

0.00%

Coefficients Term

Coef

SE Coef

T-Value

P-Value

VIF

Constant lag !

79.8 0.661

30.0 0.243

2.66 2.72

0.019 0.017

1.59

-0.22 1

0.251

- 0.88

0.393

1.59

lag2

Regression Equation Bonus($000) = 79.3 + 0.661 lag ! - 0.221 lag2 Fits and Diagnostics for Unusual Observations Obs

Bonus($000)

Fit

R esid

Std Resid

9

100.9

154.9

- 54.0

- 2. 12 R

For the second-order term, tsTAT = - 0.88 with a p-value of 0.393. The second-order term can be dropped because it is not significant at the 0.05 significance level. (c) Regression Analysis: Bonus($000) versus lagl

Method Rows unu sed 3

Source

P-Value

Regression Equation Bonus ($000 ) = 9 1. 1 + 0.531 lag l - 0.336 lag2 + 0.193 lag3

Y

a3

T-Value

Method Rows unused 1 Analysis of Variance Source Regression lag! Error Total

DF

Adj SS

AdjMS

F-Value

P-Value

I

8520 8520

8519.6 85 19.6

9.41 9.4 1

0.007 0.007

16 17

14491 23011

905.7

706

Self-Test Solutions and Answers to Selected Even-Numbered Problems

Model Summary

Model Summary

s

R-sq

R-sq(adj )

R-sq(pred)

s

R-sq

R-sq(adj)

R-sq(pred)

30.0948

37.02%

33.09%

19.94%

0. 135652

97.39%

97.04%

96.26%

Coefficients

Coefficients Term Constant lag I

Coef

SE Coef

T-Value

P-Value

56.9

27.0

2. 11

0.051

0.597

0. 195

3.07

0.007

Regression Equation Bonus ( $000) = 56.9

VIF

Term

Coef

SE Coef

T-Value

P-Value

VIF

1.00

Constant lag! lag2

0.102 0.955 0.057

0.146 0.248 0.252

0.70 3.85 0.23

0.495 0.002 0.825

33.77 33.77

+ 0.597 lag I

For the first-order term, tsTAT = 3.07 with ap-value of 0.007. The first-order term should be retained because it is significant at the 0.05 significance level. (dl Y2o19 = 56.9 + o.597( 153.7) = 148.6589. 16.30 (a) Regression Analysis: Salary ($mil) versus lagl, lag2, lag3 Method Rows unused 3

Regression Equation Salary ( $mil)

0.102 + 0.955 lagl + 0.057 lag2

=

For the second-order term, tsrAT = 0.23 with a p-value of 0.825. The second term can be dropped because it is not significant at the 0.05 significance level. (c) Regression Analysis: Salary ($mil) versus lagl Method Rows unused I Analysis of Variance

Analysis of Variance Source Source Regression lag I lag2 lag3 Error Total

DF

Adj SS

Adj MS

F-Value

P-Value

3

9.07633 0.27 11 3 0.01087 0.0192 1 0.25609 9.33242

3.02544 0.27113 0.01087 0.0192 1 0.01970

153.58 13.76 0.55 0.98

0.000 0.003 0.47 1 0.341

13 16

Model Summary

s

R-sq

R-sq(adj)

R-sq(pred)

0. 140355

97.26%

96.62%

95.56%

Coefficients Term Constant lagl lag2 lag3

Coef

SE Coef

T-Value

P-Value

VIF

0.134 1.020 0.275 - 0.3 10

0. 164 0.275 0.371 0.314

0.82 3.71 0.74 - 0.99

0.429 0.003 0.471 0.341

34.30 57.79 36.8 1

Regression Equation Salary ($ mil) = 0.134 + 1.020 lagl + 0.275 lag2 - 0.310 lag3 For the third-order term, tsrAT = - 0.99 with ap-value of0.34 1. The third term can be dropped because it is not significant at the 0.05 significance level. (b)

Regression lag I Error Total

AdjMS

F-Value

P-Value

648.48 648.48

0.000 0.000

17 18

11.5661 11.5661 0.0178

s

R-sq

0. 133550

97.45 %

Term Constant lag!

R-sq(pred)

97.30%

96.69%

Coef

SE Coef

T-Value

P-Value

VIF

0.172 0.991 0

0.130 0.0389

1.32 25.47

0.204 0.000

1.00

Regression Equation Salary ($mil)

= 0. 172 + 0.9910 lagl

For the first-order term, tsTAT = 25.47 with a p-value of 0.000. The first-order term should be retained because it is significant at the 0.05 significance level. (d) Y 2020 = 0. 172 + 0.99 10 (4.7 ) = 4.8297 million. 16.32

;~(r; ""

(a) Srx =

'

~

2

- Y;) --

45 12 - I - I

---- =

n - p - I error of the estimate is 2.121.

2. 121. The standard

"

(b)MAD

=

_;=_ I_ __

n

Rows unused 2 Analysis of Variance DF

Adj SS

AdjMS

F-Value

P-Value

2

10.2940 0.2725 0.0009 0.2760 10.5700

5. 14698 0.27249 0.00093 0.0 1840

279.70 14.8 1 0.05

0.000 0.002 0.825

15 17

R-sq(adj)

Coefficients

L lii-Y;I

Method

Regression lag I lag2 Error Total

Adj SS 11.5661 11.566 1 0.3032 11.8693

Model Summary

Regression Analysis: Salary ($mil) versus lagl, lag2

Source

DF

18 12

1.5. The mean absolute deviation is 1.5.

16.34 (b)(c) Solar Power Linear Quadratic Exponential AR-First Order

Syx

MAD

12576.9 4657.86 83 17.774 2342.68

9870.772 3861.937 4526.413 1766.126

707

Self-Test Solutions and Answers to Selected Even-Numbered Problems

(d) The residual plots for linear, quadratic, and exponential reveal cyclical patterns across coded year. The first-order autoregressive model has no clear pattern but does have one outlier. The first-order autoregressive model has the smallest Sxy and MAD values. On the basis of (a) through (c) and the principle of parsimony, the first-order autoregressive model would be the best model for forecasting. 16.36 (b) (c)

Bonuses Linear Quadratic Exponential AR-First Order

MAD

Syx 29.465 28.971 30.58 1 30.095

20.936 20.509 21.803 23.097

(d) The residual plots for the linear, quadratic, and exponential reveal a slight cyclical pattern for the first part of the time series followed by no clear pattern for the remainder of the series. The first-order autoregressive revealed no clear pattern throughout the time series. Each of the models had similar Sxy and MAD values. On the basis of the residual plots and the principle of parsimony, the first-order autoregressive model might be the best choice for forecasting.

Coefficients Term Constant Coded Q QI Q2 Q3

Coef

SE Coef

T-Value

P-Value

VIF

1.1091 0.004177 - 0.1346 -0. 1233 -0. 1233

0.0 130 0.000218 0.0136 0.0 138 0.0138

85.63 19.13 -9.88 -8.93 -8.93

0.000 0.000 0.000 0.000 0.000

1.00 1.52 1.51 1.51

Regression Equation LogRevenue = 1.1091 + 0.004177CodedQ - 0. 1346Q l - 0.1233 Q2 - 0.1233 Q3 (d) JoglO~I = 0.0042; ~I = 100004Z = J.0097. The estimated quarterly compound growth rate is (B1 - I ) 100% = 0.97%. (e)

Quarter First Second Third

h;

= logp;

p, = HJ";

(P; - 1)100 %

0.7335 0.7528 0.7528

- 26.65% -24.72% -24.72%

-0. 1346 -0. 1233 - 0.1233

The first, second, and third multipliers are -26.65%, -24.72%, and -24.72% relative to fourth quarter values, respectively.

16.38 (b) (c)

Salary Linear Quadratic Exponential AR-First Order

(c)

(f) 2019 Revenues ($millions)

MAD

Syx 0.159 0.133 0. 128 0.134

0.120 0.104 0.104 0.100

(d) The residual plots for linear, quadratic, and exponential reveal cyclical patterns across coded year. The first-order autoregressive model has no clear pattern. The Sxy and MAD values are similar across each of the models. On the basis of (a) through (c) and the principle of parsimony, the first-order autoregressive model might be the best model for forecasting. 16.40 (a) log B0 = 2, Bo= 100. This is the unadjusted forecast. (b) 1og8, = 0.01, 8, = 1.0233. The estimated monthly compound growth rate is 2.33%. (c) logB2 = 0.10, B2 = 1.2589. The January values in the time series are estimated to have a mean 25.89% higher than the December val ues. 16.42 (a) logB0 = 3, B0 = 1,000. This is the unadjusted forecast. (b)1og8 1 =0.10, 8 1 = 1.2589. The estimated quarterly compound growth rate is ( B1 (c) logB3 = 0.20, B3 = l.5849.

-

I ) I 00% = 25.89%.

16.44 (a) The revenues for Target appear to be subject to seasonal variation given that revenues are consistently higher in the fourth quarter, which includes several substantial holidays. The plot confirms the answer for (a) by clearly revealing a seasonal component to revenues.

Q uarter

y

Second Third Fourth

20.2936 20.4932 27.4827

2020 Revenues ($millions)

y

Q uarter First Second Third Fourth

20.3512 2 1.0892 21.2971 28.560 1

16.46 (b)

Coefficients Term

Coei

Constant

SE Coei

T-Value

P -Value 0.000

VIF

4.9069

0.0944

52.00

Coded Month 0.006291

0.0009 18

6.85

0.000

1.00

Month_Jan

-0. 167

0.1 15

-1 .45

0.151

1.95

Month_Feb

-0.203

0. 115

- 1.77

0.081

1.95

Month_Mar

- 0. 134

0.1 15

- 1. 17

0.246

1.95

Month_Apr

- 0. 125

0. 115

- 1.09

0.279

1.95

Month_May

- 0. 137

0.1 15

- l.19

0.238

1.95

Month_Jun

- 0.204

0.119

- 1.72

0.089

1.85

Month_Jul

-0. 175

0.119

- 1.48

0.144

1.85

Month_Aug

- 0.139

0.119

- 1.1 7

0.245

1.84

Month_Sep

-0.075

0.119

-0.63

0.529

1.84

Month_Oct

-0.054

0.119

-0.45

0.652

1.84

Month_Nov

-0.073

0.119

-0.62

0.540

1.84

708

Self-Test Solutions and Answers to Selected Even-Numbered Problems

16.60

Regression Equation Log Volume = 4.9069 + 0.006291 Coded Month

(b)

Regression Analysis: Population versus Coded Year

- 0. 167 Month_Jan - 0.203 Month_Feb -

0. 134 Month_Mar - 0. 125 Month_Apr 0. 137 Month_May - 0.204 Month_Jun 0.175Month_Jul - 0.139 Month_Aug - 0.075 Month_Sep 0.054 Month_Oct - 0.073 Month_Nov

(c) December 2019 Forecast: logY95 = 5.5045, Y95 = 3 19,521.4 barrels. (d) Forecast for last four months of 2019: Month September October November December

Coded Month

y

92 93 94 95

257,335.7 274,283.7 266,164.4 319,521.4

AdjMS

F-Value P-Value

21635022177 21635022177 2068 125

10461.18 10461.18

0.000 0.000

s

R-sq

R-sq(adj)

R-sq(pred)

1438.10

99.69%

99.68%

99.65%

Term Constant Coded Year

Coef

SE Coef

T-Value

P-Value

VIF

173763 2461.8

476 24.1

365.10 102.28

0.000 0.000

1.00

Regression Equation Population = 173763 + 2461.8 Coded Year

Coef

SE Coef

T-Value

P -Value

VIF

1.0212 0.00574 0.0375 0.0043 0.0074

0.0575 0.00 124 0.0607 0.0606 0.0606

17.75 4.63 0.62 0.07 0.12

0.000 0.000 0.539 0.944 0.903

1.00 1.51 1.50 1.50

Regression Equation Log Price = 1.021 2 + 0.00574 Coded Quarter + 0.0375 QI + 0.0043 Q2 + 0.0074 Q3 (c) log10ffi1 = 0.00574; ffi 1 = 10°·00574 = 1.0 133. The estimated quarterly compound growth rate is (ffi1 - 1) 100% = 1.33%. (d) log10ffii = 0.0375; ?3i = 10° 0375 = 1.0902. The first quarter values are estimated to have a mean of 9.02% above the fourth quarter values. A review of the p-values associated with the t test on the slope of the coefficients reveals that the slope coefficients for Quarter 1, Quarter 2, and Quarter 3 are not significant at the 0.05 significance level. (e) The fitted value is 22.9 1 (US$). (f) 2019 Forecasts:

First Second Third Fourth

Adj SS

Y=

Coefficients

Q uarter

DF

Regression 1 21635022 177 Coded Year 1 21635022 177 Error 33 68248128 34 21703270305 Total

Coefficients

16.48 (b)

Constant Coded Quarter QI Q2 Q3

Source

Model Summary

(e) log10 ffi1 = 0.006291; ffi 1 = 10°·00629 1 = 1.0146. The estimated monthly compound growth rate is (ffi1 - 1) 100% = 1.46% after adjusting for the seasonal component. (f) The July values are estimated to have a mean of 33. 17% below the December values.

Term

Analysis of Variance

Coded Quarter

y

60 61 62 63

25.3070 23.7531 24.2443 24.1513

(g) The forecasts are not likely to be accurate given that that the quarterly exponential trend model did not fit the data particularly well. The adjusted r 2 = 0.2308. In addition, the time series contained an irregular component from 20 10 through 2013.

173,763 + 2,461.8 (X) where X = years relati ve to 1984. (c) Y2019 = 173,763 + 2,46 1.8 ( 35) = 259,926 (thousands). Y2020 = 173,763 + 2,46 1.8(36) = 262,387.8 (thousands). (d)

Regression Analysis: Workforce versus Coded Year Analysis of Variance Source

DF

Regression Coded Year Error Total

I 33 34

Adj SS

AdjMS

70 19320655 7019320655 7019320655 7019320655 17642781 1 5346297 7 195748466

F-Value P-Value 13 12.93 13 12.93

0.000 0.000

Model Summary

s 2312.21

R-sq

R-sq(adj)

R-sq(pred)

97.55%

97.47%

97.21%

Coefficients Term Constant Coded Year

Coef

SE Coef

T-Value

P -Value

VIF

117084 1402.2

765 38.7

153.0 1 36.23

0.000 0.000

1.00

Regression Equation Workforce = 117084 + 1402.2 Coded Year

Y=

117,084 + I,402.2(X) where X =years relative to 1984. (c) Y2019 = 117,084 + 1402.2(35) = 166,16 1 (thousands). Y2020 = 117,084 + 1402.2(35) = 167,563.2 (thousands).

709

Self-Test Solutions and Answers to Selected Even-Numbered Problems

16.62 (b) Linear: Regression Analysis: Revenues versus Coded Year

Model Summary

Analysis of Variance

s

R-Sq

R-sq(adj)

R-sq(pred)

0.111057

93.54%

93.38%

92.59%

Coefficients Source Regression Coded Year Error Total

DF

Adj SS

AdjMS

F-Value

P-Value

3352.65 3352.65 542

619.03 619.03

0.000 0.000

42 43

3352.7 3352.7 227.5 3580. 1

Term Constant Coded Year

R-sq

2.32723

R-sq(pred)

93.49%

92.82%

Coefficients

Constant Coded Vear

T-Value

P-Value

VIF

0.0329 0.00132

7.27 24.66

0.000 0.000

1.00

Log Revenues

R-sq(adj)

93.65%

Term

SE Coef

Regression Equation

Model Summary

s

Coef 0.2393 0.0325 l

Coef

SE Coef

T-Value

P-Value

VIF

- 2.120 0.6874

0.690 0.0276

- 3.07 24.88

0.004 0.000

1.00

= years relative to 1975.

(e) Autoregressive: For the third-order term, tsTAT = 0.08 with a p-value of 0.935. The third-order term can be dropped because it is not significant at the 0.05 significance level. Regression Analysis: Revenues versus lagl, lag2

Regression Equation

Method Rows unused 2 Analysis of Variance

= - 2. 120 + 0.6874 Coded Year Y = - 2.120 + 0.6874(X), where X = years relative to 1975. Revenues

Source

(c) Quadratic: Regression Analysis: Revenues versus Coded Year, Coded Year Sq Analysis of Variance Source

0.2393 + 0.03251 Coded Vear

=

log 10 Y= 0.2393 + 0.03251 (X), where X

DF

Adj SS

AdjMS

F-Value

P-Value

Regression 2 Coded Year Coded Year Sq 41 Error Total 43

3370.63 113.98 17.98 209.49 3580. 13

1685.32 113.98 17.98 5.11

329.84 22.3 1 3.52

0.000 0.000 0.068

Regression lag! lag2 Error Total

DF

Adj SS

AdjMS

F-Value

P-Value

2

3278.23 85.42 14.02 2 1.93 3300. 16

1639.ll 85.42 14.02 0.56

2914.83 151.90 24.93

0.000 0.000 0.000

I

39 41

Model Summary

s

R-sq

0.749890

99.34%

R-sq(adj)

R-sq(pred)

99.30%

99. 16%

Coefficients Model Summary

s

R-sq

2.26043

94. 15%

R-sq(adj)

R-sq(pred)

93.86%

92.48%

Coefficients Term Constant Coded Year Coded Year Sq

Coef

SE Coef

T-Value

P-Value

VIF

- 0.785 0.497 0.00444

0.978 0. 105 0.00236

- 0.80 4.72 1.88

0.427 0.000 0.068

15.36 15.36

Revenues = - 0.785 + 0.497 Coded Year + 0.00444 Coded Year Sq Y = - 0.785 + 0.497X + 0.00444X2, where X = years relative to 1975. (d) Exponential: Regression Analysis: Log Revenues versus Coded Year Analysis of Variance

Regression Coded Year Error Total

DF I

42 43

Coef

SE Coef

T-Value

P-Value

VIF

0.344 1.648 - 0.665

0.2 12 0. 134 0. 133

1.62 12.32 - 4.99

0.11 3 0.000 0.000

107.30 107.30

Regression Equation Revenues

Regression Equation

Source

Term Constant lag! lag2

Adj SS

AdjMS

F-Value

P-Value

7.4983 7.4983 0.5 180 8.0164

7.49834 7.49834 0.01233

607.96 607.96

0.000 0.000

=

0.344 + l .648 lag! - 0.665 lag2

For the second-order term, lsTAT = - 4.99 with a p-value of 0.000. The second-order term cannot be dropped because it is significant at the 0.05 significance level. The second-order model is appropriate.

Y,

= 0.344

+ l.648 ( Y; - 1) - 0.665 (Y;-2) .

(g)

Bonuses Linear Quadratic Exponential AR - Second Order

Syx

MAD

2.3272 2.2604 5.3568 0.7499

1.8713 1.6022 2.7731 0.5072

(h) The residuals plots reveal clear cyclical patterns for the linear, quadratic, and exponential models. The residual plot for second-order autoregressive model revealed no cyclical pattern. However, the residual plot did reveal a potential equal variance assumption violation discussed in an earlier chapter. The second-order autoregressive model also had the smallest values for the standard error of the estimate and MAD. Based on

710

Self-Test Solutions and Answers to Selected Even-Numbered Problems

the results from (f), (g), and the principle of parsimony, the second-order autoregressive model would be best suited for forecasting. (i) Y20 19 = 0.344 + 1.648 ( 21) - 0.665(22.8 ) = 19.79 billion. 16.64 A time series analysis of the Canadian dollar reveals a moderate component with up and down cycles of varying durations. Although the currency rate varies in cycles, the 1980 exchange rate is fairl y similar to the 2019 exchange rate. A review of residual plots reveals that the linear, quadratic, and exponential models may be problematic due to cyclical variation in the residuals. In contrast, a first-order autoregressive model revealed a random pattern of residuals. In addition, the first-order autoregressive model had the smallest standard of the estimate and MAD values. T he first-order autoregressive was the most appropriate model to use for forecasting. Using this model, the forecasted exchange rate is 1.3504 (units per U.S. dollar) and 1.3388 (units per U.S. dollar) for 2020 and 202 1, respectively. A time series analysis of the Japanese yen exchange rate revealed a steep drop in the rate beginning in 1986. The rate dropped from 238.47 (units per U.S. dollar) in 1985 to 128.17 (units per U.S. dollar) in 1988. The exchange rate had a cyclical component from that point forward with an overall declining trend. A review of the residual plots, standard of the estimate and MAD values, revealed that the first-order autoregressive model was the most appropriate fo r forecasting. Us ing this model, the forecasted exchange rate is 109.278 (un its per U.S. dollar) and 109.000 (units per U.S. dollar) for 2020 and 2021 , respectively. A time series analysis of the English pound reveals a moderate component with up and down cycles of varying durations. Although the currency rate varies in cycles, the 1980 exchange rate has increased from 0.4302 (units per U.S. dollar) in 1980 to 0.7847 (units per U.S. dollar) in 201 9. A review of the residual plots and the standard of the estimate and MAD values revealed that the first-order autoregressive model was the most appropriate for forecasting. Using this model, the fo recasted exchange rate is 0.7430 (units per U.S. dollar) and 0.7 146 (units per U.S. dollar) for 2020 and 202 1, respectively. An unexpected irregul ar component in the future could not be anticipated by the autoregressive models used for each of the currencies.

Quadratic: Regression Analysis: Canadian$ versus Coded Year, Coded Year Sq Analysis of Variance Source Regression Coded Year Coded Year Sq Error Total

Adj SS

AdjMS

2

0.05 11 8 0.01525

0.02559 0.0 1525

1. 10 0.66

0.342 0.422

I

0.028 18 0.85742 0.90860

0.02818 0.0231 7

1.22

0.277

37 39

Model Summary

s

R-sq

R-sq(adj)

R-sq(pred)

0.152228

5.63%

0.5 3%

0.00%

Coefficients Term

Coef

Constant 1.2459 Coded Year 0.00662 Coded Year Sq - 0.000223

SECoei

T-Value

P-Value

VIF

0.0687 0.008 15 0.000202

18.12 0. 81 - 1.10

0.000 0.422 0.277

15.30 15.30

Regression Equation Canad ian$ = 1.2459

+ 0.00662 Coded Year - 0.000223 Coded Year Sq

Exponential: Regression Analysis: LogCanadian versus Coded Year Analysis of Variance Source

DF

Regression Coded Year Error Total

I

38 39

Adj SS

AdjMS

F-Value

P-Value

0.0037 11

0.0037 11

1.3 1

0.259

0.0037 11 0.10741 2 0.11 1124

0.0037 11 0.002827

1.3 1

0.259

Model Summary

s

R-sq

R-sq(adj)

R-sq(pred)

0.053 1662

3.34%

0.80%

0.00%

Regression Analysis: Canadian$ versus Coded Year Analysis of Variance

F-Value P-Value

DF

Coefficients Source

DF

Regression Coded Year Error Total

I

38 39

Adj SS

AdjMS

F-Value

P-Value

0.02300 0.02300 0.88559 0.90860

0.02300 0.02300 0.0233 1

0.99 0.99

0.327 0.32.7

Term Constant Coded Year

s 0. 152660

2.53%

Constant Coded Year

Coef

SE Coef

1.3010 0.0474 - 0.00208 0.00209

T-Value

P-Value

VIF

0.0165 0.000728

6.89 - 1.1 5

0.000 0.259

1.00

LogCanadian = 0.1 136 - 0.000834 Coded Year

R-sq(adj)

R-sq(pred)

0.00%

0.00%

Coefficients Term

SE Coef

0. 1136 - 0.000834

Regression Equation

Model Summary R-sq

Coef

Regression Analysis: Canadian$ versus CanLagl Analysis of Variance Source

T-Value

P-Value

27.46 - 0.99

0.000 0.327

VIF 1.00

Regression CanLagl Error Total

Regression Equation Canadians = 1.3010 - 0.00208 Coded Year

DF I

37 38

Adj SS

AdjMS

F-Value

P-Value

0.6037 0.6037 0.2964 0.900 1

0.603656 0.603656 0.008011

75.35 75.35

0.000 0.000

Model Summary

s

R-sq

R-sq(adj)

R-sq(pred)

0.0895059

67.07%

66. 18%

62.12%

711

Self-Test Solutions and Answers to Selected Even-Numbered Problems

Regression Equation

Coefficients Term

Coef

Constant CanLag l

SE Coef

0.231 0.820 1

T-Value

0. 120 0.0945

P-Value

1.93 8.68

VIF

0.06 1 0.000

1.00

Yen = 240,79 - 10.76 Coded Year + 0.2001 Coded Year Sq Exponential: Regression Analysis: LogYen versus Coded Year Analysis of Variance

Regression Equation Source

Canadian$ = 0.23 1 + 0.8201 CanLag 1

Regression Coded Year Error Total

Canadian Linear Quadratic Exponential AR-First Order

Syx

MAD

0.1527 0.1522 0.153 0.0895

0. 1245 0.1224 0.1247 0.07132

DF

Adj SS

Adj SS

F-Value

P-Value

1 38 39

0.3822 0.3822 0.2891 0.67 13

0.382229 0.382229 0.007608

50.24 50.24

0.000 0.000

Model Summary

Regression Analysis: Yen versus Coded Year

s

R-sq

R-sq(adj)

R-sq(pred)

0.08722 15

56.94%

55.80%

51.23%

Coefficients

Analysis of Variance Source Regression Coded Year Error Total

DF

Adj SS

AdjMS

F-Value

P-Value

1 38 39

46637 46637 38083 84720

46637 46637 1002

46.54 46.54

0.000 0.000

Constant

2.2701

0.0271

Coded Year

- 0.00847 0.001 19

s

R-sq

R-sq(adj )

R-sq(pred)

55.05%

53.87%

48.91 %

Coefficients Coef

SE Coef

T-Value

P-Value

VIF

191.37 - 2.958

9.83 0.434

19.48 - 6.82

0.000 0.000

1.00

Yen = 191.37 - 2.958 Coded Year

DF

Adj SS

AdjMS

2

69339 40354 22702 15381 84720

34669.6 40353.8 22702.2 415.7

F-Value P-Value 83.40 97.08 54.61

0.000 0.000 0.000

1.00

DF

Adj SS

AdjMS

F-Value

P-Value

66818.8 66818.8 244.4

273.42 273.42

0.000 0.000

37 38

668 19 66819 9042 7586 1

Regression YenLagl Error Total

s

R-sq

R-sq(adj)

R-sq(pred)

15.6326

88.08%

87.76%

85.23%

Coefficients Term Constant YenLagl

Coef

SE Coef

T-Value

P-Value

VIF

11.61 0. 89 12

7.66 0.0539

1.52 16.54

0.138 0.000

1.00

Regression Equation

s

R-sq

R-sq(adj)

R-sq(pred)

20.3885

81.85%

80.86%

78.83%

Yen = 11.61 + 0.39 12 YenLagl Yen

Coefficients

Constant Coded Year Coded Year Sq

0.000

Model Summary

Model Summary

Term

0.000

Analysis of Variance

Analysis of Variance

1 37 39

83 .85 - 7.09

VIF

Method Rows unused 1

Regression Analysis: Yen versus Coded Year, Coded Year Sq

Regression Coded Year Coded Year Sq Error Total

P-Value

Autoregressive First Order: Regression Analysis: Yen versus YenLagl

Source

Regression Equation

Source

T-Value

LogYen = 2.2701 - 0.00847 Coded Year

31.6572

Constant Coded Year

SE Coef

Regression Equation

Model Summary

Term

Coef

Term

Coef

SE Coef

T-Value

P-Value

VIF

240.79 - 10.76 0.2001

9.21 1.09 0.0271

26. 15 - 9.85 7.39

0.000 0.000 0.000

15.30 15.30

Linear Quadratic Exponential AR-First Order

Syx

MAD

3 1.6572 20.3885 29.9896 15.6326

25 .9902 16.3877 22.8734 11 .3533

712

Self-Test Solutions and Answers to Selected Even-Numbered Problems

English Pound, Linear: Regression Analysis: English Pound versus Coded Year

Model Summary

Analysis of Variance Source

DF

Adj SS

AdjMS

F-Value

P-Value

1 38 39

0.03801 0.03801 0.23708 0.27510

0.038013 0.038013 0.006239

6.09 6.09

0.018 0.018

Regression Coded Year Error Total

Model Summary

s

R-sq(adj)

R-sq(pred)

11.65%

2.20%

Coefficients Term Constant Coded Year

Coef

SE Coef

T-Value

P-Value

VIF

- 0.24 19 0.00 1878

0.0172 0.000758

- 14.09 2.48

0.000 0.018

1.00

LogEnglish Pound = - 0.24 19

R-sq(pred)

11.55%

13.82%

0.0789874

R-sq 13.92%

Regression Equation

R-sq(adj)

R-sq

s 0.0553 133

2.13%

+ 0.001878 Coded Year

Autoregressive First Order: Regression Analysis: English Pound versus EngLagl

Coefficients SE Coef

T-Value P-Value

0.0245 0.00108

23.52 2.47

Coef

Term Constant Coded Year

0.5767 0.00267

0.000 0.018

VIF

Method Rows unused 1 Analysis of Variance

1.00 Source

Regression Equation English Pound = 0.5767

+ 0.00267 Coded Year

Quadratic: Regression Analysis: English Pound versus Coded Year, Coded Year Sq

Regression EngLagl Error Total

DF

Adj SS

AdjMS

F-Value

P-Value

1 37 38

0. 11 65 0. 1165 0. 11 8 1 0.2346

0. 11 6495 0. 11 6495 0.003 193

36.49 36.49

0.000 0.000

Model Summary

Analysis of Variance Source Regression Coded Vear Coded Year Sq Error Total

I

37 39

4.79 1.14 3. 15

0.028308 0.006725 0.018603 0.005905

0.056616 0.006725 0.018603 0.2 18479 0.275096

2

0.014 0.293 0.084

R-sq

R-sq(adj)

R-sq(pred)

0.0768430

20.58%

16.29%

0.29%

Constant Coded Year Coded Year Sq

0.0347 0.6215 - 0.00439 0.004 12 0.000181 0.000102

17.9 1 - 1.07 1.77

P-Value

VIF

0.000 0293 0.084

15.30 15.30

Regression Equation English Pound = 0.6215 - 0.00439 Coded Year - 0.000 18 1 Coded Year Sq Exponential : Regression Analysis: LogEnglish Pound versus Coded Year Analysis of Variance Source Regression Coded Year Error Total

R-sq(pred)

48.29%

43.74%

Coefficients Coef 0.2075 0.682

SE Coef

T-Value

P-Value

VIF

2.92 6.04

0.006 0.000

1.00

0.07 12 0. 11 3

Regression Equation English Pound = 0.2075 + 0.682 EngLagl English Pound

Coefficients SECoef T-Value

R-sq(adj)

Constant EngLagl

s

Coef

R-sq 49.65%

Term

Model Summary

Term

s 0.0565043

F-Value P-Value

AdjMS

Adj SS

DF

DF

Adj SS

AdjMS

F-Value

P-Value

1 38 39

0.01880 0.01880 0. 11 626 0. 13506

0.01 8797 0.018797 0.003060

6.14 6. 14

0.018 0.0 18

Linear Quadratic Exponential AR-First Order

S yx

MAD

0.079 0.0768 0.0789 0.0565

0.06 13 0.0585 0.0607 0.0443

A a (level of significance), 279 A priori probability, 154 Addition rule, 158 Adjusted ? , 484 Algebra, rules for, 626 Algorithmic cleaning of extreme numerical values, 25 Alternative hypothesis, 276 Among-group variation, 353 Analysis of proportions (ANOP), 402 Analysis of variance (ANOVA), 353 Kruskal-Wallis rank test for differences in c medians, 415 assumptions of, 416 One-way, 356 assumptions, 360 F test for differences in more than two means, 356 F test statistic, 356 Levene's test for homogeneity of variance, 361 summary table, 357 Tukey-Kramer procedure, 362 Two-way, 367 cell means plot, 374 factorial design, 353 interpreting interaction effects, 367, 374-375 multiple comparisons, 362 summary table, 357 testing for factor and interaction effects, 369-370 Analysis Too!Pak, Frequency distribution, 91 Histogram, 91-92, 96 Descriptive statistics, 148 Exponential smoothing, 597 F test for ratio of two variances, 35 1 M ultiple regression, 522-524 One-way ANOVA, 385- 386 paired t test, 349-350 pooled-variance t test, 347-348 random sampling, 35 Residual analysis, 475 sampling distributions, 243 sampling methods, 35 separate-variance t test, 348-349 simple linear regression, 475 two-way ANOVA, 387-388 Analytical skills, 3 Analyze, 4 Analyzing Categorical variables, 621-622 Numerical variables, 619-621

ANOP, 402 ANOVA. See Analysis of variance (ANOVA) Area of opportunity, 188 Arithmetic mean. See Mean Arithmetic operations, rules for, 626 Assumptions analysis of variance (AN OVA), 360 of the confidence interval estimate for the mean (u unknown), 252 of the confidence interval estimate for the proportion, 260 of the F test fo r the ratio of two variances, 336 of the paired t test, 325 of Kruskal-Wallis test, 418 of regression, 445 for 2 X 2 table, 395 for 2 x c table, 400 for r X c table, 407 for the t distribution, 250 t test for the mean (u unknown), 288 in testing for the difference between two means, 312 of the Wilcoxon rank sum test, 409 of the Z test for a proportion, 299 Autocorrelation, 450 Autoregressive modeling, 573 steps involved in, on annual timeseries data, 578 for trend fitting and forecasting, 573-581

B Bar chart, 51 Basic computer skills, 6 Bayes' theorem, 169 Best practices for creating visual summaries, 77 Best-subsets approach in model building, 542 {3 Risk, 279 Bias nonresponse, 27 selection, 27 Big data, 4 Binomial distribution, 181 mean of, 185 properties of, 181 shape of, 184-1 85 standard deviation of, 185 Binomial probabilities calculating, 182-1 83 Bins, 40 Bootstrapping, 266 Boxplots, 129 Brynne packaging, 473 Bubble chart, 7 1 Business analytics, 4

c CardioGood Fitness, 33, 87, 86, 174, 221, 272,346,384,425-426 Categorical data chi-square test for the difference between two proportions, 390 chi-square test of independence, 404 chi-square test for c proportions, 399 organizing, 39-41 visualizing, 5 1- 56 Z test for the difference between two proportions, 329 Categorical variables, 17 Causal forecasting methods, 557 Cell means plot, 374 Cell, 8, 40 Central limit theorem, 231 Central tendency, 109 Certain event, 154 Challenges in organizing and visualizing variables, Obscuring data, 75 Creating false impressions, 76 Chartjunk, 77 Charts. bar, 51 Colored scatter plot, 70 bubble, 71 Doughnut, 52 Histogram, 59 Pareto, 53-55 Ogive 61-62 pie, 52 Percentage polygon, 60-61 Scatter plot, 65- 66 Side-by-side bar, 55-56 Sparklines, 72 Time-series plot, 68-69 Chebyshev Theorem, 134 Chi-square (x2) distribution, 390 Chi-square (x2) test for differences between c proportions, 404 between two proportions, 390 Chi-square (x 2) test for the variance or standard deviation, 4 19 Chi-square (x2 ) test of independence, 404 Chi-square (x2) table, 656 Choice Is Yours Followup, 87, 174 Class boundaries, 44 Class intervals, 44 Class midpoint, 44 Class interval width, 44 Classes, 44 And Excel bins, 46 Classification trees, 608

714

Index

Clear Mountai n State Survey, 33, 147, 174, 221,346, 384, 426 Cluster analysis, 601 Cluster sample, 22 Coding errors, 24 Coefficient of correlation, 137, 457 inferences about, 458 Coefficient of determination, 443 Coefficient of multiple determination, 484, 533 Coefficient of partial determination, 495 Coefficient of variation, 117 Collectively exhaustive events, 26, 154 Collect, 3 Collinearity of independent variables, 538 Colored scatter plot, 70 Combinations, 182 Complement, 154 Completely randomized design. See also One-way analysis of variance Computing conventions used in this book, 8 Conditional probability, 161 Confidence coefficient, 279 Confidence interval estimation, 245 in auditing, 266 connection between hypothesis testing and,286-287 for the difference between the means of two independent groups, 317- 3 18 for the difference between the proportions of two independent groups, 333-334 for the mean difference, 327-328 ethical issues and, 266 for the mean (CT known), 244-249 for the mean (CT unknown), 250- 256 for the mean response, 461 for the proportion, 258-260 of the slope, 490 Connection between confidence interval estimation and hypothesis testing, 286-287 Consider This, Divine Providence and Spam, 170 Do People Really Do This?, 3 19 New Media Surveys/Old Survey Errors, 29 What is Normal?, 210 What is Not Normal, 508 Let the Model User Beware, 592 Contingency tables, 40, 390 Continuous probability distributions, 199 Exponential distribution, 199 Normal distribution, 199 Uniform distribution, 199 Conti nuous variables, 18 Convenience sampling, 21 Correlation coefficient. See Coefficient of correlation Counting rules, 17 1 Covariance, 136 of a probability distribution, 19 1 Coverage error, 27 cp statistic, 543

Craybill Instrumentation Company case, 552-553 Critical range, 362 Critical val ue approach, 248, 262 Critical values, 278 Cross-product term, 503 Cross validation, 5 14 Cumulative percentage distribution, 48 Cumulative percentage polygons (ogives), 61 Cumulative standardized normal distribution, 202 tables, 651-652 Curvilinear relationship, 527 Cyclical effect, 557

D Dashboards, 605 Data, 2 sources of, 19 Data cleaning, 24 Data collection, 18 Data files, 640-645 Data filtering, 71 Data formatting, 25- 26 Data integration errors, 24 Data mining, 602 DCOVA,3 Decision trees, 163 Define, 3 Degrees of freedom, 217 Dependent variable, 431 Descriptive analytics, 60 1 Descriptive statistics, 5 Deviance statistic, 5 11 Difference scores, 322 Digital Case, 33-34, 86, 147, 174, 195, 221, 242,272,307,345-346, 384,425, 473, 521,552,596 Directional test, 295 Discrete probability distributions binomial distribution, 189 Poisson distribution, 188 Discrete variables, 18 expected value of, 177 probability distribution for, 17 1 variance and standard deviation of, 178 Dispersion, 114 Doughnut chart, 52 Downloading files for this book, 640 Drill-down, 69 Dummy variables, 497 Durbin-Watson statistic, 451 tables, 663

E Empirical probability, 154 Empirical rule, 134 Equal variance, 446 Error sum of squares (SSE), 442 Ethical issues confidence interval estimation and, 266 in hypothesis testing, 302

in multiple regression, 547 in numerical descriptive measures, 141 for probability, 169 for surveys, 28 Events, 153 Expected frequency, 391 Expected value, of discrete variable, 177 Explained variation or regression sum of squares (SSR), 441 Explanatory variables, 441 Exponential distribution, 199, 2 17 Exponential growth with monthl y data forecasting equation, 588 with quarterly data forecasting equation, 587 Exponential smoothing, 557- 558 Exponential trend model, 567 Extrapolation, predictions in regression analysis and, 436

F Factor, 353 Factorial design. 353 See Two-way analysis of variance F distribution, 336 tables, 656-659 Finite population correction factor, 239 First-order autoregressive model, 574 First quartile, l 25c Five-number summary, 128 Fixed effects models, 379 Forecasting, 557 autoregressive modeling for, 573 choosing appropriate model for, 582-585 least-squares trend fitting and, 564 seasonal data, 586-590 formula, 8 Frame, 2 1 Frequency distribution, 64 Friedman rank test, 420 F test for the ratio of two variances, 336 F test for the factor effect, 356 F test for factor A effect, 369- 373 F test for factor B effect, 369- 373 F test for interaction effect, 369- 373 F test for the slope, 485 Ftest in one-way ANOVA, 356-360 Function, 8 Future data, 5 14

G Gaussian distribution, 199 General addition rule, 158 General multiplication rule, 163 Geometric mean, 113 Geometric mean rate of return, 11 3 Grand mean, 367 Greek alphabet, 63 1 Groups, 353 Guidelines for developing visualizations, 77

Index

H Hierarchical clustering, 609 Histograms, 59 Homogeneity of variance, 359 Levene's test for, 361 Homoscedasticity, 449 Hypergeometric distribution, 191 Hypothesis. alternative, 276 fundamental concepts, 277 null,276

Impossible event, 154 Independence, 164 of errors, 449 x 2 test of, 404 Inde pendent events, multiplication rule for, Independent variable, 431 Index numbers, 592 Inferential statistics, 5 Interaction, 367 Interaction terms, 503 Interpolation, predictions in regression analysis and, 436 Interquartile range, 127 Interrelationship of the standard normal distribution and the chi-square distribution, 395 Interval scale, 18 Invalid variable values, 24 Irregular effect, 557

J Joint event, 15 3 Joint probability, 157 Joint response, 40 Judgment sample, 2 1

K Kruskal-Wallis rank test for differences in c medians, 415 assumptions of, 415 Kurtosis, 120

L Lagged predictor variable, 225 , 573 Leaming more, 6, 20, 22, 70, 118, 183, 189, 246,534,573 Least-squares method in determining simple linear regression, 434 Least-squares trend fitting and forecasting, 564-570 Left-skewed, 120 Leptokurtic, 120 Levelofconfidence, 247 Level of significance (a), 279 Levels, 353 Levene's test for homogeneity of variance, 361 Likelihood, 508 Line graph, 53 Linear regression. See Simple linear regression

Linear relationship, 446 Linear trend model, 564 Logarithms, rules for, 627 Logarithmic transformation, 536 Logical causality, 6 Logistic regression, 509 Lurking variable, 137

M Main effects, 372 Main effects plot, 357 Managing Ashland MultiComm Services, 86, 147, 195,221,242,271-272,307, 345-346,383-384, 424-425,473, 521 , 596 Marascuilo procedure, 400 Marginal probability, 157 Margin of error, 28, 261 Matched samples, 322 Mathematical model, 181 Maximum Likelihood, 509 McNemar test, 419 Mean, 109 of the binomial distribution, 188, 253 confidence interval estimation for, 245 geometric, 11 3 population, 138 sample size determination for, 261 sampling distribution of, 225 standard error of, 227 unbiased property of, 225 Mean absolute deviation, 583 Mean difference, 322 Mean squares, 355 Mean Square Among (MSA) , 358 Mean Square A (MSA), 369 Mean Square B (MSB), 369 Mean Square Error (MSE), 369 Mean Square Interaction (MSAB) , 369 Mean Square Total (MSn, 358 Mean Square Within (MSW), 358 Measurement types of scales, 28 Measurement error, 28 Median, Ill Microsoft Excel, Absolute and relative cell references, 634 Adding numerical variables, 99 Add-ins, 647 Arithmetic mean, 148 Autocorrelation, 476 autoregressive modeling, 598- 599 bar charts, 92- 93 Bayes' theorem, 175 basic probabilities, 175 binomial probabilities, 196 bins, 40 boxplots, 156 cells, 12 cell means plot, 388 cell references, 633-634 central tendency, 148 chart formatting, 636

715

checking for and applying Excel updates, 647 checklist for using, 6- 9, 11 chi-square tests for contingency tables, 427-428 coefficient of variation, 149 computing conventions, 6-9 confidence interval estimate for the difference between the means of two independent groups, 348 confidence interval for the mean, 273 confidence interval for the proportion, 274 configuring Excel security for add-ins, 647- 648 contingency tables, 90 correlation coefficient, 151 covariance, 150 creating histograms for discrete probability distributions, 196 creating and copying worksheets, 11-12 cross-classification table, 90 cumulative percentage distribution, 92 cumulative percentage polygon, 96-97 dash boards, 6 15 defining variables, 35 descriptive statistics, 148-150 doughnutchart, 92-93 drilldown, 191 du mmy variables, 524 Durbin-Watson statistic, 476 Dynamic bubble charts, 6 16 entering data, 11 expected value, 196 exponential smoothing, 597 FAQs, 667 Filtering, 101 Formulas, 632-633 frequency distribution, 91 functions, 665-666 F test for the ratio of two variances, 35 1 Geometric mean, 148 Getting ready to use, 11 Guide workbooks, 8 Histogram, 95, 96, 638-639 Identifying version, 632 Interquartile range, 150 Keyboard shortcuts, 665 Kruskal-Wallis test, 429 Kurtosis, 149 least-squares trend fitting, 598 Levene test, 385 Log transformation, 554 Logistic regression, 524-525 Marascuilo procedure, 427-428 Median, 148 Mode, 148 moving averages, 597 multidimensional contingency tables, 98-99 multiple regression, 522-524 mean absolute deviation, 599 model building, 555 nonstatistical functions, 665- 666

716

Index

Microsoft Excel (Continued) normal probabilities, 222 normal probability plot, 222-223 one-way analysis of variance, 385 opening workbooks, 11 ordered array, 90 quartiles, 149 Paired t test, 349 Pareto chart, 93- 94 Pasting with Paste Special, 633 Percentage distribution, 92 Percentage polygon, 96-97 pie chart, 92-93 PivotChart, 99 PivotTables, 89 Poisson probabilities, 197 Pooled-variance t test, 347 Population mean, 150 Population standard deviation, 150 Population variance, 150 prediction interval, 476 preparing and using data, 11 printing worksheets, 12 probability, 175 probability distribution for a discrete random variable, 196 quadratic regression, 554 quartiles, 155 querying, 101 range, 149 relative frequency distribution, 92 residual analysis, 475, 523 sample size determination, 274 sampling distributions, 243 sampling methods, 35 saving workbooks, 11 scatter plot, 98 seasonal data, 599 selecting cell ranges for charts, 634 selecting non-contiguous cell range, 634 separate-variance t test, 348 side-by-side bar chart, 94 simple linear regression, 474 simple random samples, 35 skewness, 149 skill set needed, 9 sparklines, 100 square root transformation, 554 standard deviation, 149 starting point, 8 stem-and-leaf display, 95 summary tables, 89 t test for the mean (u unknown), 309 time-series plot, 98 transformations, 554-555 treemaps, 99-100 two-way analysis of variance, 387 Tukey-Kramer multiple comparisons, 386- 387 understa nding nonstatistical functions, 665- 666 useful keyboard shortcuts, 665 variance, 149

variance inflationary factor (VIF), 555 verifying formulas and worksheets, 633 which version to use, Wilcoxon rank sum test, 428-429 Workbooks, 11 Worksheet entries and references, 635 Worksheets, 634 Worksheet formatting, 635 Z test for the difference between two proportions, 350 Z test for the mean (u known), 309- 310 Z scores, 149 Z test for the proportion, 310 Midspread, 127 Missing values, 25 Mixed effects models, 379 Mode, 11 2 Model, 43 1 Models. See Multiple regression models More Descriptive Choices Follow-up, 147, 272,346,384, 426,552 Mountain States Potato Company case, 55 1 Moving averages, 559 Multidimensional contingency tables, 68 Multiple comparisons, 362 Multiple regression models, 479 Adjusted?, 484 best-subsets approach to, 542 coefficient of multiple de termination in, 484 coefficients of partial determination in, 495 collineari ty in, 538 confidence interval estimates for the slope in, 482 dummy-variable models in, 497 ethical considerations in, 547 interpreting slopes in, 479 interaction terms, 503 with k independent variables, 4805 model building, 540-546 model validation, 514-515 net regression coefficients, 480 overall F test, 485 partial F-test statistic in, 492 pitfalls in, 547 predicting the dependent variable Y, 481 quadratic, 527-533 residual analysis for, 487-488 stepwise regression approach to, 541 testing for significance of, 489-490 testing portions of, 492-495 testing slopes in, 494-495 transformation in, 534-5 37 variance inflationary factors in, 539 Multiplication rule, 166 Mutually exclusive events, 26, 154

N Net regression coefficient, 480 Nominal scale, 18 Nonparametric methods, 409 Nonprobability sample, 2 1

Nonresponse bias, 27 Nonresponse error, 27 Normal approx imation to the binomial distribution, 2 17 Normal distribution, 199 cumulative standardized, 202 properties of, 200 Normal probabilities calculating, 201-209 Normal probability density function, 199 Normal probability plot, 2 13 constructing, 2 13-214 Normality assumption, 291, 315 Null hypothesis, 276 Numerical descriptive measures coefficient of correlation, 137, 457 measures of central tendency, variation, and shape, I 09-122 from a population, 132-133 Numerical variables, 17 Organizing, 4 Visualizing, 4

0 Observed freque ncy, 391 Odds ratio, 509 Ogive, 61 One-tail tests, 295 null and alternative hypotheses in, 297 One-way analysis of variance (ANOVA), 356 assumptions, 359 F test for differences in more than two means, 356 F test statistic, 336 Levene's test for homogeneity of variance, 361 summary table, 357 Tukey-Kramer procedure, 362 Online resources, 640 Operational definitions, 5, 17 Ordered array, 53 Ordinal scale, 15 Organ ize, 4 Organizing a mix of variables, 68-69 Outcomes, 153 Outliers, 25 , 118 Overall F test, 485

p Paired t test, 322- 327 Parameter, 20 Pareto chart, 53-55 Pareto pri nciple, 53 Parsimony, 584 principle of, 540 Partial F-test statistic, 498 Percentage distribution, 46 Percentage polygon, 68 Percentiles, 126 PHStat, About, 669-670 Autocorrelation, 476

Index bar chart, 92 basic probabilities, lS 1 best subsets regression, SSS binomial probabilities, 196 boxplot, lSO cell means plot, 388 chi-square (x2) test for contingency tables, 427-428 coefficient of variation, 149 collinearity, SSS confidence interval for the mean (t7 known), 273 for the mean (t7 unknown), 273 for the difference between two means, 347 for the proportion, 274 contingency tables, 90, 427 cumulative percentage distributions, 92 cumulative percentage polygons, data cleaning, 36 doughnutchart,92-93 FAQs, 667-668 F test for ratio of two variances, 3SO frequency distributions, 90-91 histograms, 96-97 installing, 669- 670 joint probability, 17S Kruskal-Wallis test, 429 kurtosis, 149 Levene's test, 386 Logistic regression, S2S Marascuilo procedure, 427 Mean, 148 Median, 148 Mode, 148 Model building, SSS multiple regression, S22-S24 normal probabilities, 222 normal probability plot, 222 one-way ANOVA, 38S one-way tables, 92 paired t test, 349 Pareto chart, 93 Percentage distribution, 90 Percentage polygon, 97 pie chart, 92 Poisson probabilities, 197 pooled-variance t test, 347 prediction interval , 476 quadratic regression, SS4 quartiles, 149 Random sampling, 3S range, 149 Recoding variables, 26, 36 Relative frequency, 92 Residual analysis, 471 , S23 Sample size determination, for the mean, 274 for the proportion, 274 sampling distributions, 243 scatter plot, 97-98 separate-variance t test, 348 side-by-side bar chart, 94 simple linear regression, 474

simple probability, 17S simple random samples, 3S sparklines, 11 skewness, 149 stacked data, 36 standard deviation, 149 stem-and-leaf display, 94-9S stepwise regression, SSS summary tables, 88 t test for the mean (t7 unknown), 309 treemap, l 00 two-way ANOVA, 387 Two-way tables, 89 Tukey-Kramer procedure, 386 Unstacked data, 36 Variance, 149 Wilcoxon rank sum test, 428 Z score, 149 Z test for the mean (t7 known), 309 Z test for the difference in two proportions, 349 Z test for the proportion, 3 10 Pie chart, S2 Pitfalls in organizing and visualizing data, Obscuring data, 7S Chartjunk, 77 Creating false impressions, 76 Pitfalls in regression, 464-466, S47 PivotChart, 7 1 PivotTables, 40 Platykurtic, 128 Point estimate, 24S Poisson distribution, 188 calculating probabilities, 189 properties of, 188 Polygons, 68 cumulative percentage, 61 Pooled-variance t test, 312-3 18 Population(s), 19 Population mean, 132 Population standard deviation, 133 Population variance, 133 Positive linear Relationship, 432 Power of a test, 279, 303 Practical significance, 303 Prediction interval estimate, 462 Prediction line, 433 Predictive analytics, 601 Prescriptive analytics, 601 Primary data source, 20 Probability, 1SS a priori, 1S4 Bayes' theorem for, 169 conditional, 161 empirical, lS4 ethical issues and, 169 joint, 1S7 marginal, I S7 simple, ISS subjective, 1SS Probability density function , 199 Probability distribution function, 18 1

717

Probability distribution for discrete random variable, 177 Probability sample, 21 Proportions, 46 chi-square (x2) test for differences between two, 390 chi-square (K) test for differences in more than two, 397 confidence interval estimation for, 2S8-2S9 sample size determination for, 263-264 sampling distribution of, 22S Z test for the difference between two, 329 Z test of hypothesis for, 299 pth-order autoregressive model, S73 p-value, 284 p-value approach, 284-28S

Q QI , 12S Q2, 12S Q3, 12S Quadratic regression, S27-S33 Quadratic trend model, S66 Qualitative variable, 17-18 Quantitative variable, 17-18 Quartiles, 12S Quantile-quantile plot, 2 13 Querying, 73

R Random effect, 379 Random effects models, 379 Randomized block design, 3S3, 379 Randomness and independence, 3S9 Random numbers, table of, 22, 649 Range, 11 4 interquartile, 127 Ratio scale, 18 Recalculation, 9 Recoded variable, 26 Rectangular distribution, 199 Region of nonrejection, 278 Region of rejection, 292 Regression analysis. See Multiple regression models; Simple linear regression Regression coefficients, 434 Regression sum of squares (SSR), 441 Regression trees, 607 Relative frequency, 48 Relative frequency distribution, 46 Relevant range, 436 Reliability, 9 Repeated measurements, 322 Replicates, 367 Residual analysis, 446, S82-S83 Residual plots in detecting autocorrelation, 4S0-4S 1 in evaluating independence, 4S0-4S l in evaluating equal variance, 449 in evaluating linearity, 446-447 in evaluating normality, 448 in multiple regression, 487-488

718

Index

Residuals, 436 Resistant measures, 128 Response variable, 43 1 Right-skewed, 120 Robust, 292, 315

s Sample, 19 Sample coefficient of correlation, 138 Sample covariance, 136 Sample mean, 109 Sample proportion, 298 Sample standard deviation, 115 Sample variance, 115 Sample size determination for mean, 2610- 262 for proportion, 263-264 Sample space, 153 Samples, cluster, 22 convenience, 23 j udgment, 2 1 nonprobability, 21 probability, 2 1 simple random, 21 stratified, 22 systematic, 22 Sampling from finite populations, 239 from nonnormally distributed populations, 23 1-235 from normally distributed populations, 228- 231 with replacement, 2 1 without replacement, 21 Sampling distributions, of the mean, 225 of the proportion, 225 Sampling error, 28, 246-248 Scale interval, 18 nominal, 18 ordinal, 18 ratio, 18 Scatter diagram, 43 1 Scatter plot, 65, 43 1 Seasonal effect, 557 Second-order autoregressive model, 573 Second quartile, 125 Secondary data source, 20 Selection bias, 27 Separate-variance t test for differences in two means, 318 Shape, 120 Side-by-side bar chart, 55 Simple event, 153 Simple linear regression, assumptions in, 445 coefficient of determination in, 443 coefficients in, 443 computations in, 437-438 Durbin-Watson statistic, 45 1 equations in,

estimation of mean values and prediction of individual values, 461-462 inferences about the slope and correlation coefficient, 454-458 least-squares method in, 434 pitfalls in, 464-466 residual analysis, 446 standard error of the estimate in, 444 sum of squares in, 442 Simple probability, 155 Simple random sample, 2 1 Skewness, 120 Slicers, 73-74 Slope, 433 inferences about, 454 interpreting, in multiple regression, 479-480 software concepts, 632 Sources of data, 20 Sparklines, 72 Spread, 114 Square-root transformation, 535 Stacked variables, 26 Standard deviation, 115 of binomial distribution, 185 of discrete variable, 178 of population, 123 Standard error of the estimate, 444 Standard error of the mean, 227 Standard error of the proportion, 237 Standardized normal variable, 202 Statistic, 5, 20 Statistics, 3 descriptive, 5 inferential, 5 Statistical inference, 5 Statistical insignificance, 303 Stem-and-leaf display, 58 Stepwise regression approach to model building, 541 Strata, 22 Stratified sample, 22 Structured data, 612 Studentized range distribution, 362 tables, 6 12 Student's t distribution, 250--25 1 Properties, 251 Studenttips,4,8, 18,46,52,58,69, 78, I l l 115, 118, 125, 130, 137, 153, 154, 157, 158, 162, 177, 181 , 201, 203, 204,226,227,237, 246, 266,276, 278,281 , 284,287,289,294,299 312,3 13,322,330, 336,353,354, 355,356,358,360, 361,362,369, 370,39 1, 392,400,404, 409, 4 10, 415,434, 438,443, 446, 481,486, 487,498, 499,503,504, 509,5 11, 527,530, 533, 536, 559,560,567, 574,578 Subjective probability, 155 Summari zed data, 2 Summary table, 39 Summation notation, 628-630

Sum of squares, 115 Sum of squares among groups (SSA), 355 Sum of squares due to factor A (SSA), 368 Sum of squares due to factor B (SSB), 368 Sum of squares due to regression (SSR), 441 Sum of squares of error (SSE), 369, 442 Sum of squares to interaction (SSAB), 368 Sum of squares total (SSn, 353, 441 Sum of squares withi n groups (SSW), 355 Sunburst chart, 72 Sure Value Convenience Stores, 272, 308, 346,384,425,552 Survey errors, 27-28 Symmetrical, 120 Systematic sample, 22

T t distribution, properties of, 251 Tableau public Bar chart, I 02-1 03 Boxplot, 15 1 Colored scatter plot, 106 Contingency table, 102 Cluster analysis, 6 16-617 Data cleaning, 37 Defining variables, 37 Dashboards, 615 Dynamic bubble charts, 6 16 Entering data, 14 FAQs, 668 Five number summary, 151 Frequency distribution, 102 Getting started, 13 Histogram, 104--105 Keyboard shortcuts, 665 Openi ng a workbook, 14 Organizing a mix of variables, I 05- 106 Pareto chart, I03 Pie chart, 102- 103 Printing a workbook, 15 Saving a workbook, 14 Scatter plot, I 05, 4 77 Simple linear regression, 477 Sparkline, I 07 Summary table, 101- 102 Treemap, 106 Time-series plot, 105 Visualizing two categorical variables, I 04 Worki ng with data, 15 Tables chi-square, 655 contingency, 40 Control chart factors, 664 Cumulative distribution, 48 Durbin-Watson, 663 F di stribution, 656-659 for categorical data, 39-40 cumulative standardized normal distribution, 65 1-652 of random numbers, 22, 649- 650 standardized normal distribution, 660 Studentized range, 661-662 summary, 39

Index

t distribution, 653- 654

Wilcoxon rank sum, 660 Template, 8 Treemaps, 106 Test statistic, 278 Tests of hypothesis Chi-square (x2) test for differences between c proportions, 397-400 between two proportions, 390 Chi-square (x2 ) test of independence, 404 Chi-square (x2) test for a variance, 4 19 F test for the ratio of two variances, 336-338. F test for the regression model, 485 F test for the slope, 455-456 Friedman rank test, 420 Kruskal-Wallis rank test fo r differences in c medians, 415 Levene test, 361 McNemar test, 419 Paired t test, 322-327 pooled-variance t test, 312 quadratic effect, 530 separate-variance t test fo r differences in two means, 318 t test for the correlation coefficient, 457-458 t test for the mean (lT unknown), 288-291 t test for the slope, 454-455, 489 Wi lcoxon rank sum test for differences in two medians, 409 Wilcoxon signed ranks test, 420 Z test for the mean (lT known), 280-282 Z test fo r the difference between two proportions, 329 Z test for the proportion, 298-301 Test set, 5 14 Third quartile, 125 Times series, 86, 557 Time-series forecasting, 557 autoregressive model, 573 choosing an appropriate forecasting model, 582-585 component factors of classical multiplicative, 557 exponential smoothing in, 561 least-squares trend fitting and forecasting, 564

moving averages in, 559 seasonal data, 586-590 Times series plot, 66 Total variation, 368, 441 Training set, 514 Transformation formula, 202 Transformations in regression models logarithmic, 536 square-root, 535 Treatment, 20 Treemap, 71 Trend, 557 t test for a correlation coefficient, 457-458 t test for the mean (lT unknown), 288 t test for the slope, 454 Tukey-Kramer multiple comparison procedure, 362 Tukey multiple comparison procedure, 373 Two-factor factorial design, 367 Two-sample tests of hypothesis for numerical data, 312 F tests for differences in two variances, 336 Paired t test, 322 t tests for the difference in two means, 312- 3 18 Wilcoxon rank sum test for differences in two medians, 409 Two-tail test, 281 Two-way analysis of variance, 367 cell means plot, 374 factorial design, 353 interpreting interaction effects, 374-375 multiple comparisons, 373 summary table, 37 1 testing for factor and interaction effects, 369-371 Two-way contingency table, 40 Type I error, 27 8 Type II error, 278

u Unbiased, 225 Unexplained variation or error sum of squares (SSE), 441 Uniform probability distribution, 199 mean, 215 standard deviation, 2 16

719

Unstacked variables, 26 Unstructured data, 4, 26 Using software properly, 7-8

v Variables, 5 categorical, l 7y continuous, 18 discrete, 18 dummy,497 numerical, 17 Variance inflationary factor (VIF), 539 Variance, 115 of discrete variable, 178 F-test for the ratio of two, 333 Levene's test for homogeneity of, 361 population, 133 sample, 115 Variation, 109 Visual Explorations, normal distribution, 206 sampling distributions, 235 simple linear regression, 439 Visualize, 4 Visualizations, Guidelines for constructing, 77

w Wald statistic, 511 Width of class interval, Wilcoxon rank sum test for differences in two medians, 409 Wilcoxon signed ranks test, 420 Tables, 660 Within-group variation, 355 Workbook, 8 Worksheets, 8

y Y intercept b0 , 433

z Z scores, 118 Ztest, for the difference between two proportions, 329- 333 for the mean (lT known), 280 for the proportion, 299

Chapter 10 Andrey Tolkachev/S hutterstock

Pages 3 11 and 340, GUNDAM_Ai/Shutterstock

Page I , wallix/Getty Images

Pages 352 and 379, fotoinfot/S hutterstock

Pages 16 and 30, Haveseen/YAY Micro/AGE Fotostock

Pages 389 and 420, Vibrant Image Studio/Shutterstock

Page 38, scanrail/ 123RF

Pages 430 and 466, pixfly/Shutterstock

Pages 108 and 14 1, gitanna/Fotolia

Pages 478 and 515, maridav/123RF

Chapter 11 Chapter 12 Chapter 13 Chapter 14 Chapter 15 Pages 152 and 171, vectorfusionart/Shutterstock

Pages 526 and 547, Anthony Brown/Fotolia

Pages 176 and 191, Hongqi Zhang/1 23RF

Pages 556 and 592, stylephotographs/123rf

Pages I 98 and 21 8, cloki/Shutterstock

Pages 600 and 6 I 3, Rawpixel.com/Shutterstock

Pages 224 and 239, bluecinema/Getty Images

Pages 6 18 and 623, Sharyn Rosenberg

Pages 244 and 267, Shutterstock

Pages I9- 1 and I 9-30, zest_marina/Fotolia

Pages 275 and 304, ahmettozar/Getty Images

Pages 20-1 and20-22, Ken Mellot/Shutterstock

Chapter 16 Chapter 17 Chapter 18 Chapter 19 Chapter 20

~

The Cumulative Standardized Normal Distribution (continued) Entry represents area under the cumulative standardized normal distribution from - oo to Z

- oo

z 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 I.I

1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0 4.5 5.0 5.5 6.0

0

z

Cumulative Probabilities 0.00 0.5000 0.5398 0.5793 0.6 179 0.6554 0.6915 0.7257 0.7580 0.7881 0.8159 0.8413 0.8643 0.8849 0.9032 0.9192 0.9332 0.9452 0.9554 0.9641 0.9713 0.9772 0.9821 0.9861 0.9893 0.9918 0.9938 0.9953 0.9965 0.9974 0.9981 0.99865 0.99903 0.9993 1 0.99952 0.99966 0.99977 0.99984 0.99989 0.99993 0.99995 0.999968329 0.999996602 0.999999713 0.99999998 1 0.999999999

0.01 0.5040 0.5438 0.5832 0.62 17 0.659 1 0.6950 0.7291 0.7612 0.7910 0.8 186 0.8438 0.8665 0.8869 0.9049 0.9207 0.9345 0.9463 0.9564 0.9649 0.9719 0.9778 0.9826 0.9864 0.9896 0.9920 0.9940 0.9955 0.9966 0.9975 0.9982 0.99869 0.99906 0.99934 0.99953 0.99968 0.99978 0.99985 0.99990 0.99993 0.99995

0.02 0.5080 0.5478 0.587 1 0.6255 0.6628 0.6985 0.7324 0.7642 0.7939 0.8212 0.8461 0.8686 0.8888 0.9066 0.9222 0.9357 0.9474 0.9573 0.9656 0.9726 0.9783 0.9830 0.9868 0.9898 0.9922 0.9941 0.9956 0.9967 0.9976 0.9982 0.99874 0.99910 0.99936 0.99955 0.99969 0.99978 0.99985 0.99990 0.99993 0.99996

0.03 0.5120 0.55 17 0.5910 0.6293 0.6664 0.7019 0.7357 0.7673 0.7967 0.8238 0.8485 0.8708 0.8907 0.9082 0.9236 0.9370 0.9484 0.9582 0.9664 0.9732 0.9788 0.9834 0.9871 0.9901 0.9925 0.9943 0.9957 0.9968 0.9977 0.9983 0.99878 0.99913 0.99938 0.99957 0.99970 0.99979 0.99986 0.99990 0.99994 0.99996

0.04 0.5 160 0.5557 0.5948 0.633 1 0.6700 0.7054 0.7389 0.7704 0.7995 0.8264 0.8508 0.8729 0.8925 0.9099 0.9251 0.9382 0.9495 0.9591 0.9671 0.9738 0.9793 0.9838 0.9875 0.9904 0.9927 0.9945 0.9959 0.9969 0.9977 0.9984 0.99882 0.99916 0.99940 0.99958 0.9997 1 0.99980 0.99986 0.99991 0.99994 0.99996

0.05 0.5 199 0.5596 0.5987 0.6368 0.6736 0.7088 0.7422 0.7734 0.8023 0.8289 0.853 1 0.8749 0.8944 0.9115 0.9265 0.9394 0.9505 0.9599 0.9678 0.9744 0.9798 0.9842 0.9878 0.9906 0.9929 0.9946 0.9960 0.9970 0.9978 0.9984 0.99886 0.99918 0.99942 0.99960 0.99972 0.9998 1 0.99987 0.99991 0.99994 0.99996

0.06 0.5239 0.5636 0.6026 0.6406 0.6772 0.7123 0.7454 0.7764 0.805 1 0.8315 0.8554 0.8770 0.8962 0.9131 0.9279 0.9406 0.95 15 0.9608 0.9686 0.9750 0.9803 0.9846 0.988 1 0.9909 0.9931 0.9948 0.996 1 0.9971 0.9979 0.9985 0.99889 0.99921 0.99944 0.99961 0.99973 0.99981 0.99987 0.99992 0.99994 0.99996

0.07 0.5279 0.5675 0.6064 0.6443 0.6808 0.7 157 0.7486 0.7794 0.8078 0.8340 0.8577 0.8790 0.8980 0.9147 0.9292 0.9418 0.9525 0.9616 0.9693 0.9756 0.9808 0.9850 0.9884 0.9911 0.9932 0.9949 0.9962 0.9972 0.9979 0.9985 0.99893 0.99924 0.99946 0.99962 0.99974 0.99982 0.99988 0.99992 0.99995 0.99996

0.08 0.5319 0.57 14 0.6103 0.6480 0.6844 0.7190 0.75 18 0.7823 0.8106 0.8365 0.8599 0.88 10 0.8997 0.9162 0.9306 0.9429 0.9535 0.9625 0.9699 0.9761 0.981 2 0.9854 0.9887 0.9913 0.9934 0.995 1 0.9963 0.9973 0.9980 0.9986 0.99897 0.99926 0.99948 0.99964 0.99975 0.99983 0.99988 0.99992 0.99995 0.99997

0.09 0.5359 0.5753 0.6141 0.6517 0.6879 0.7224 0.7549 0.7852 0.8133 0.8389 0.8621 0.8830 0.9015 0.9177 0.9319 0.9441 0.9545 0.9633 0.9706 0.9767 0.9817 0.9857 0.9890 0.9916 0.9936 0.9952 0.9964 0.9974 0.9981 0.9986 0.99900 0.99929 0.99950 0.99965 0.99976 0.99983 0.99989 0.99992 0.99995 0.99997

The Cumulative Standardized Normal Distribution Entry represents area under the cumulative standardized normal distribution from - oo to Z

- oo

z -6.0 - 5.5 - 5.0 - 4.5 - 4.0 -3.9 - 3.8 - 3.7 -3.6 -3.5 - 3.4 -3.3 - 3.2 - 3.l - 3.0 -2.9 - 2.8 -2.7 -2.6 - 2.5 -2.4 - 2.3 - 2.2 -2.1 - 2.0 - 1.9 -1.8 - 1.7 - 1.6 -1.5 - 1.4 -1.3 - 1.2 -I.I

- 1.0 -0.9 -0.8 - 0.7 - 0.6 -0.5 - 0.4 - 0.3 -0.2 - 0.l - 0.0

z

0

Cumulative Probabilities 0.00 0.000000001 0.000000019 0.000000287 0.000003398 0.000031671 0.00005 0.00007 0.00011 0.00016 0.00023 0.00034 0.00048 0.00069 0.00097 0.00135 0.0019 0.0026 0.0035 0.0047 0.0062 0.0082 0.0107 0.0139 0.0179 0.0228 0.0287 0.0359 0.0446 0.0548 0.0668 0.0808 0.0968 0. 115 1 0. 1357 0.1 587 0. 1841 0.2119 0.2420 0.2743 0.3085 0.3446 0.3821 0.4207 0.4602 0.5000

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.00005 0.00007 0.00010 0.00015 0.00022 0.00032 0.00047 0.00066 0.00094 0.0013 1 0.0018 0.0025 0.0034 0.0045 0.0060 0.0080 0.0104 0.0136 0.0174 0.0222 0.0281 0.0351 0.0436 0.0537 0.0655 0.0793 0.0951 0.113 1 0.1335 0.1562 0.1814 0.2090 0.2388 0.2709 0.3050 0.3409 0.3783 0.4168 0.4562 0.4960

0.00004 0.00007 0.00010 0.00015 0.00022 0.00031 0.00045 0.00064 0.00090 0.00126 0.0018 0.0024 0.0033 0.0044 0.0059 0.0078 0.0102 0.0132 0.0170 0.0217 0.0274 0.0344 0.0427 0.0526 0.0643 0.0778 0.0934 0. 11 12 0. 1314 0. 1539 0. 1788 0.2061 0.2358 0.2676 0.3015 0.3372 0.3745 0.4129 0.4522 0.4920

0.00004 0.00006 0.0001 0 0.00014 0.00021 0.00030 0.00043 0.00062 0.00087 0.00122 0.0017 0.0023 0.0032 0.0043 0.0057 0.0075 0.0099 0.0129 0.0166 0.0212 0.0268 0.0336 0.0418 0.05 16 0.0630 0.0764 0.0918 0. 1093 0. 1292 0. 15 15 0. 1762 0.2033 0.2327 0.2643 0.2981 0.3336 0.3707 0.4090 0.4483 0.4880

0.00004 0.00006 0.00009 0.00014 0.00020 0.00029 0.00042 0.00060 0.00084 0.00 118 0.00 16 0.0023 0.0031 0.0041 0.0055 0.0073 0.0096 0.0125 0.0162 0.0207 0.0262 0.0329 0.0409 0.0505 0.06 18 0.0749 0.0901 0.1075 0.1271 0.1492 0.1736 0.2005 0.2296 0.26 11 0.2946 0.3300 0.3669 0.4052 0.4443 0.4840

0.00004 0.00006 0.00009 0.00013 0.00019 0.00028 0.00040 0.00058 0.00082 0.00114 0.0016 0.0022 0.0030 0.0040 0.0054 0.0071 0.0094 0.0122 0.0158 0.0202 0.0256 0.0322 0.0401 0.0495 0.0606 0.0735 0.0885 0. 1056 0. 1251 0. 1469 0.1 711 0. 1977 0.2266 0.2578 0.2912 0.3264 0.3632 0.4013 0.4404 0.4801

0.00004 0.00006 0.00008 0.00013 0.00019 0.00027 0.00039 0.00056 0.00079 0.00111 0.0015 0.0021 0.0029 0.0039 0.0052 0.0069 0.0091 0.01 19 0.0154 0.0197 0.0250 0.03 14 0.0392 0.0485 0.0594 0.0721 0.0869 0. 1038 0. 1230 0. 1446 0. 1685 0. 1949 0.2236 0.2546 0.2877 0.3228 0.3594 0.3974 0.4364 0.4761

0.00004 0.00005 0.00008 0.00012 0.00018 0.00026 0.00038 0.00054 0.00076 0.00107 0.0015 0.0021 0.0028 0.0038 0.0051 0.0068 0.0089 0.0116 0.0150 0.0192 0.0244 0.0307 0.0384 0.0475 0.0582 0.0708 0.0853 0.1020 0.1210 0.1423 0.1660 0.1922 0.2206 0.2514 0.2843 0.3 192 0.3557 0.3936 0.4325 0.4721

0.00003 0.00005 0.00008 0.00012 0.00017 0.00025 0.00036 0.00052 0.00074 0.00103 0.0014 0.0020 0.0027 0.0037 0.0049 0.0066 0.0087 0.0113 0.0146 0.0188 0.0239 0.0301 0.0375 0.0465 0.0571 0.0694 0.0838 0. 1003 0. 1190 0.1401 0.1635 0.1894 0.2177 0.2482 0.2810 0.3156 0.3520 0.3897 0.4286 0.4681

0.00003 0.00005 0.00008 0.0001 1 0.00017 0.00024 0.00035 0.00050 0.00071 0.00100 0.0014 0.0019 0.0026 0.0036 0.0048 0.0064 0.0084 0.0110 0.0143 0.0 183 0.0233 0.0294 0.0367 0.0455 0.0559 0.0681 0.0823 0.0985 0.1170 0.1379 0.1611 0.1 867 0.2 148 0.2451 0.2776 0.3 121 0.3483 0.3859 0.4247 0.4641