Basic Business Statistics: Concepts and applications [5 ed.] 9781488617249, 0321870026

8,321 995 38MB

English Pages [889] Year 2019

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Basic Business Statistics: Concepts and applications [5 ed.]
 9781488617249, 0321870026

Table of contents :
Front Cover
Front Matter
Half Title
Full Title
Imprint
Brief Contents
Detailed Contents
Preface
Acknowledgements
How to use this book
About the authors
Part 1 Presenting and describing information
Chapter 1 Defining and collecting data
1.1 Basic concepts of data and statistics
1.2 Types of variables
1.3 Collecting data
1.4 Types of survey sampling methods
1.5 Evaluating survey worthiness
1.6 The growth of statistics and information technology
Summary
Key terms
References
Chapter review problems
Continuing cases
Chapter 1 Excel Guide
Chapter 2 Organising and visualising data
2.1 Organising and visualising categorical data
2.2 Organising numerical data
2.3 Summarising and visualising numerical data
2.4 Organising and visualising two categorical variables
2.5 Visualising two numerical variables
2.6 Business analytics applications – descriptive analytics
2.7 Misusing graphs and ethical issues
Summary
Key terms
References
Chapter review problems
Continuing cases
Chapter 2 Excel Guide
Chapter 3 Numerical descriptive measures
3.1 Measures of central tendency, variation and shape
3.2 Numerical descriptive measures for a population
3.3 Calculating numerical descriptive measures from a frequency distribution
3.4 Five-number summary and box-and-whisker plots
3.5 Covariance and the coefficient of correlation
3.6 Pitfalls in numerical descriptive measures and ethical issues
Summary
Key formulas
Key terms
Chapter review problems
Continuing cases
Chapter 3 Excel Guide
End of Part 1 problems
Part 2 Measuring uncertainty
Chapter 4 Basic probability
4.1 Basic probability concepts
4.2 Conditional probability
4.3 Bayes’ theorem
4.4 Counting rules
4.5 Ethical issues and probability
Summary
Key formulas
Key terms
Chapter review problems
Continuing cases
Chapter 4 Excel Guide
Chapter 5 Some important discrete probability distributions
5.1 Probability distribution for a discrete random variable
5.2 Covariance and its application in finance
5.3 Binomial distribution
5.4 Poisson distribution
5.5 Hypergeometric distribution
Summary
Key formulas
Key terms
Chapter review problems
Chapter 5 Excel Guide
Chapter 6 The normal distribution and other continuous distributions
6.1 Continuous probability distributions
6.2 The normal distribution
6.3 Evaluating normality
6.4 The uniform distribution
6.5 The exponential distribution
6.6 The normal approximation to the binomial distribution
Summary
Key formulas
Key terms
Chapter review problems
Continuing cases
Chapter 6 Excel Guide
Chapter 7 Sampling distributions
7.1 Sampling distributions
7.2 Sampling distribution of the mean
7.3 Sampling distribution of the proportion
Summary
Key formulas
Key terms
References
Chapter review problems
Continuing cases
Chapter 7 Excel Guide
End of Part 2 problems
Part 3 Drawing conclusions about populations based only on sample information
Chapter 8 Confidence interval estimation
8.1 Confidence interval estimation for the mean (σ known)
8.2 Confidence interval estimation for the mean (σ
unknown)
8.3 Confidence interval estimation for the proportion
8.4 Determining sample size
8.5 Applications of confidence interval estimation in auditing
8.6 More on confidence interval estimation and ethical issues
Summary
Key formulas
Key terms
References
Chapter review problems
Continuing cases
Chapter 8 Excel Guide
Chapter 9 Fundamentals of hypothesis testing: One-sample tests
9.1 Hypothesis-testing methodology
9.2 Z test of hypothesis for the mean (σ
known)
9.3 One-tail tests
9.4 t test of hypothesis for the mean (σ
unknown)
9.5 Z test of hypothesis for the proportion
9.6 The power of a test
9.7 Potential hypothesis-testing pitfalls and ethical issues
Summary
Key formulas
Key terms
References
Chapter review problems
Continuing cases
Chapter 9 Excel Guide
Chapter 10 Hypothesis testing: Two-sample tests
10.1 Comparing the means of two independent populations
10.2 Comparing the means of two related populations
10.3 F test for the difference between two variances
10.4 Comparing two population proportions
Summary
Key formulas
Key terms
References
Chapter review problems
Continuing cases
Chapter 10 Excel Guide
Chapter 11 Analysis of variance
11.1 The completely randomised design: One-way analysis of variance
11.2 The randomised block design
11.3 The factorial design: Two-way analysis of variance
Summary
Key formulas
Key terms
References
Chapter review problems
Continuing cases
Chapter 11 Excel Guide
End of Part 3 problems
Part 4 Determining cause and making reliable forecasts
Chapter 12 Simple linear regression
12.1 Types of regression models
12.2 Determining the simple linear regression equation
12.3 Measures of variation
12.4 Assumptions
12.5 Residual analysis
12.6 Measuring autocorrelation - The Durbin-Watson statistic
12.7 Inferences about the slope and correlation coefficient
12.8 Estimation of mean values and prediction of individual values
12.9 Pitfalls in regression and ethical issues
Summary
Key formulas
Key terms
References
Chapter review problems
Continuing cases
Chapter 12 Excel Guide
Chapter 13 Introduction to multiple regression
13.1 Developing the multiple regression model
13.2 R2, adjusted R2 and the overall F test
13.3 Residual analysis for the multiple regression model
13.4 Inferences concerning the population regression coefficients
13.5 Testing portions of the multiple regression model
13.6 Using dummy variables and interaction terms in regression models
13.7 Collinearity
Summary
Key formulas
Key terms
References
Chapter review problems
Continuing cases
Chapter 13 Excel Guide
Chapter 14 Time-series forecasting and index numbers
14.1 The importance of business forecasting
14.2 Component factors of the classical multiplicative time-series model
14.3 Smoothing the annual time series
14.4 Least-squares trend fitting and forecasting
14.5 The Holt-Winters method for trend fitting and forecasting
14.6 Autoregressive modelling for trend fitting and forecasting
14.7 Choosing an appropriate forecasting model
14.8 Time-series forecasting of seasonal data
14.9 Index numbers
14.10 Pitfalls in time-series forecasting
Summary
Key formulas
Key terms
References
Chapter review problems
Chapter 14 Excel Guide
Chapter 15 Chi-square tests
15.1 Chi-square test for the difference between two proportions (independent samples)
15.2 Chi-square test for differences between more than two proportions
15.3 Chi-square test of independence
15.4 Chi-square goodness-of-fit tests
15.5 Chi-square test for a variance or standard deviation
Summary
Key formulas
Key terms
References
Chapter review problems
Continuing cases
Chapter 15 Excel Guide
End of Part 4 problems
Part 5 Further topics in stats
Chapter 16 Multiple regression model building
16.1 Quadratic regression model
16.2 Using transformations in regression models
16.3 Influence analysis
16.4 Model building
16.5 Pitfalls in multiple regression and ethical issues
Summary
Key formulas
Key terms
References
Chapter review problems
Continuing cases
Chapter 16 Excel Guide
Chapter 17 Decision making
17.1 Payoff tables and decision trees
17.2 Criteria for decision making
17.3 Decision making with sample information
17.4 Utility
Summary
Key formulas
Key terms
References
Chapter review problems
Chapter 17 Excel Guide
Chapter 18 Statistical applications in quality management
18.1 Total quality management
18.2 Six Sigma management
18.3 The theory of control charts
18.4 Control chart for the proportion - The p chart
18.5 The red bead experiment - Understanding process variability
18.6 Control chart for an area of opportunity - The c chart
18.7 Control charts for the range and the mean
18.8 Process capability
Summary
Key formulas
Key terms
References
Chapter review problems
Chapter 18 Excel Guide
Chapter 19 Further non-parametric tests
19.1 McNemar test for the difference between two proportions (related samples)
19.2 Wilcoxon rank sum test - Non-parametric analysis for two independent populations
19.3 Wilcoxon signed ranks test - Non-parametric analysis for two related populations
19.4 Kruskal-Wallis rank test - Non-parametric analysis for the one-way anova
19.5 Friedman rank test - Non-parametric analysis for the randomised block design
Summary
Key formulas
Key terms
Chapter review problems
Continuing cases
Chapter 19 Excel Guide
Chapter 20 Business analytics
20.1 Predictive analytics
20.2 Classification and regression trees
20.3 Neural networks
20.4 Cluster analysis
20.5 Multidimensional scaling
Summary
Key formulas
Key terms
References
Chapter review problems
Chapter 20 Software Guide
Chapter 21 Data analysis: The big picture
21.1 Analysing numerical variables
21.2 Analysing categorical variables
21.3 Predictive analytics
Chapter review problems
End of Part 5 problems
Appendices
Glossary
Index

Citation preview

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

5TH EDITION

Basic Business Statistics

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

This page is intentionally left blank

5TH EDITION

Basic Business Statistics Concepts and applications Berenson Levine Szabat O’Brien Jayne Watson

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019 Pearson Australia 707 Collins Street Melbourne VIC 3008 www.pearson.com.au Authorised adaptation from the United States edition entitled Basic Business Statistics, 13th edition, ISBN 0321870026 by Berenson, Mark L., Levine, David M., Szabat, Kathryn A., published by Pearson Education, Inc., Copyright © 2015. Fifth adaptation edition published by Pearson Australia Group Pty Ltd, Copyright © 2019 The Copyright Act 1968 of Australia allows a maximum of one chapter or 10% of this book, whichever is the greater, to be copied by any educational institution for its educational purposes provided that that educational institution (or the body that administers it) has given a remuneration notice to Copyright Agency Limited (CAL) under the Act. For details of the CAL licence for educational institutions contact: Copyright Agency Limited, telephone: (02) 9394 7600, email: [email protected] All rights reserved. Except under the conditions described in the Copyright Act 1968 of Australia and subsequent amendments, no part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the copyright owner. Portfolio Manager: Rebecca Pedley Development Editor: Anna Carter Project Managers: Anubhuti Harsh and Keely Smith Production Manager: Julie Ganner Product Manager: Sachin Dua Content Developer: Victoria Kerr Rights and Permissions Team Leader: Lisa Woodland Lead Editor/Copy Editor: Julie Ganner Proofreader: Katy McDevitt Indexer: Garry Cousins Cover and internal design by Natalie Bowra Cover photograph © kireewong foto/Shutterstock Typeset by iEnergizer Aptara®, Ltd Printed in Malaysia ISBN 9781488617249 1 2 3 4 5 23 22 21 20 19

Pearson Australia Group Pty Ltd   ABN 40 004 245 943

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

brief contents Preface

x

Acknowledgements

xi

How to use this book

xii

About the authors

PART 1

PRESENTING AND DESCRIBING INFORMATION 1

Defining and collecting data 2 Organising and visualising data 3 Numerical descriptive measures PART 2

Basic probability 5 Some important discrete probability distributions 6 The normal distribution and other continuous distributions 7 Sampling distributions

147 180 212 248

DRAWING CONCLUSIONS ABOUT POPULATIONS BASED ONLY ON SAMPLE INFORMATION 8

Confidence interval estimation 9 Fundamentals of hypothesis testing: One-sample tests 10 Hypothesis testing: Two-sample tests 11 Analysis of variance PART 4

4 37 91

MEASURING UNCERTAINTY 4

PART 3

xvii

279 315 358 401

DETERMINING CAUSE AND MAKING RELIABLE FORECASTS 12

Simple linear regression 13 Introduction to multiple regression 14 Time-series forecasting and index numbers 15 Chi-square tests

455 504 544 607

ONLINE CHAPTERS PART 5

FURTHER TOPICS IN STATS 16

Multiple regression model building 17 Decision making 18 Statistical applications in quality management 19 Further non-parametric tests 20 Business analytics 21 Data analysis: The big picture

650 680 704 740 770 794

Appendices A to F

A-1

Glossary G-1 Index I-1 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

vi

detailed contents

Preface Acknowledgements How to use this book About the authors

x xi xii xvii

3.3

3.4

Calculating numerical descriptive measures from a frequency distribution

118

Five-number summary and box-and-whisker plots

120

3.5

Covariance and the coefficient of correlation 123

PRESENTING AND DESCRIBING INFORMATION

3.6

Pitfalls in numerical descriptive measures and ethical issues

1 Defining and collecting data

Summary 130 Key formulas 130 Key terms 132 Chapter review problems 132 Continuing cases 134 Chapter 3  Excel Guide 135

PART 1

4

1.1

Basic concepts of data and statistics

6

1.2

Types of variables

9

1.3

Collecting data

13

1.4

Types of survey sampling methods

17

1.5

Evaluating survey worthiness

22

1.6

The growth of statistics and information technology 26

Summary 27 Key terms 27 References 27 Chapter review problems 28 Continuing cases 29 Chapter 1  Excel Guide 29

2 Organising and visualising data

37

2.1

Organising and visualising categorical data

38

2.2

Organising numerical data

43

2.3

Summarising and visualising numerical data 46

2.4

Organising and visualising two categorical variables

55

2.5

Visualising two numerical variables

59

2.6

Business analytics applications – descriptive analytics

62

Misusing graphs and ethical issues

69

2.7

Summary 73 Key terms 73 References 73 Chapter review problems 74 Continuing cases 76 Chapter 2  Excel Guide 77

3 Numerical descriptive measures 3.1

3.2

Measures of central tendency, variation and shape Numerical descriptive measures for a population

91 92 113

End of Part 1 problems

129

139

PART 2

MEASURING UNCERTAINTY

4 Basic probability

147

4.1

Basic probability concepts

148

4.2

Conditional probability

156

4.3

Bayes’ theorem

163

4.4

Counting rules

168

4.5

Ethical issues and probability

172

Summary 173 Key formulas 173 Key terms 173 Chapter review problems 174 Continuing cases 177 Chapter 4  Excel Guide 178

5 Some important discrete probability distributions 180 Probability distribution for a discrete random variable

181

5.2

Covariance and its application in finance

185

5.3

Binomial distribution

189

5.4

Poisson distribution

196

5.5

Hypergeometric distribution

200

5.1

Summary 204 Key formulas 204 Key terms 205 Chapter review problems 205 Chapter 5  Excel Guide 208

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



DETAILED CONTENTS

6 The normal distribution and other continuous distributions

212

6.1

Continuous probability distributions

213

6.2

The normal distribution

214

6.3

Evaluating normality

229

6.4

The uniform distribution

233

6.5

The exponential distribution

235

6.6

The normal approximation to the binomial distribution

238

Summary 242 Key formulas 242 Key terms 242 Chapter review problems 243 Continuing cases 244 Chapter 6  Excel Guide 246

7 Sampling distributions

248

7.1

Sampling distributions

249

7.2

Sampling distribution of the mean

249

7.3

Sampling distribution of the proportion

259

Summary 262 Key formulas 263 Key terms 263 References 263 Chapter review problems 263 Continuing cases 265 Chapter 7  Excel Guide 265

End of Part 2 problems

267

PART 3

DRAWING CONCLUSIONS ABOUT POPULATIONS BASED ONLY ON SAMPLE INFORMATION

8 Confidence interval estimation

279

Confidence interval estimation for the mean (σ known)

280

Confidence interval estimation for the mean (σ unknown)

285

Confidence interval estimation for the proportion

291

8.4

Determining sample size

294

8.5

Applications of confidence interval estimation in auditing

300

More on confidence interval estimation and ethical issues

307

8.1

8.2

8.3

8.6

Summary 308 Key formulas 308

Key terms 308 References 309 Chapter review problems 309 Continuing cases 313 Chapter 8  Excel Guide 313

9 Fundamentals of hypothesis testing: One-sample tests

315

9.1

Hypothesis-testing methodology

316

9.2

Z test of hypothesis for the mean (σ known) 322

9.3

One-tail tests

9.4

t test of hypothesis for the mean (σ unknown) 334

9.5

Z test of hypothesis for the proportion

340

9.6

The power of a test

344

9.7

Potential hypothesis-testing pitfalls and ethical issues

349

329

Summary 352 Key formulas 353 Key terms 353 References 353 Chapter review problems 354 Continuing cases 356 Chapter 9  Excel Guide 356

10 Hypothesis testing: Two-sample tests

358

10.1

Comparing the means of two independent populations 359

10.2

Comparing the means of two related populations 371

10.3

10.4

F test for the difference between two variances

378

Comparing two population proportions

384

Summary 389 Key formulas 391 Key terms 392 References 392 Chapter review problems 392 Continuing cases 395 Chapter 10  Excel Guide 396

11 Analysis of variance

401

The completely randomised design: One-way analysis of variance

402

11.2

The randomised block design

415

11.3

The factorial design: Two-way analysis of variance

425

11.1

Summary 438 Key formulas 439 Key terms 440 References 440 Chapter review problems 441

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

vii

viii

DETAILED CONTENTS

Continuing cases Chapter 11  Excel Guide

443 444

End of Part 3 problems

448

PART 4

DETERMINING CAUSE AND MAKING RELIABLE FORECASTS

12 Simple linear regression

455

14 Time-series forecasting and index numbers

544

14.1

The importance of business forecasting

545

14.2

Component factors of the classical multiplicative time-series model

546

14.3

Smoothing the annual time series

547

14.4

Least-squares trend fitting and forecasting

555

14.5

The Holt–Winters method for trend fitting and forecasting

567

Autoregressive modelling for trend fitting and forecasting

570

12.1

Types of regression models

12.2

Determining the simple linear regression equation 458

14.6

12.3

Measures of variation

467

14.7

Choosing an appropriate forecasting model 579

12.4

Assumptions 473

14.8

Time-series forecasting of seasonal data

584

12.5

Residual analysis

14.9

Index numbers

591

12.6

Measuring autocorrelation: The Durbin–Watson statistic

14.10

Pitfalls in time-series forecasting

599

477

Inferences about the slope and correlation coefficient

482

12.7

456

473

12.8

Estimation of mean values and prediction of individual values 489

12.9

Pitfalls in regression and ethical issues

493

Summary 496 Key formulas 497 Key terms 498 References 498 Chapter review problems 498 Continuing cases 501 Chapter 12  Excel Guide 502

13 Introduction to multiple regression

Chi-square test for differences between more than two proportions

615

15.3

Chi-square test of independence

622

504

15.4

Chi-square goodness-of-fit tests

627

15.5

Chi-square test for a variance or standard deviation

632

505

13.2

R 2, adjusted R 2 and the overall F test

511

Residual analysis for the multiple regression model

514

Inferences concerning the population regression coefficients

516

Testing portions of the multiple regression model

520

Using dummy variables and interaction terms in regression models

525

13.5

13.6

13.7

607 608

Developing the multiple regression model

13.4

15 Chi-square tests Chi-square test for the difference between two proportions (independent samples)

13.1

13.3

Summary 600 Key formulas 600 Key terms 601 References 602 Chapter review problems 602 Chapter 14  Excel Guide 604

Collinearity 535

Summary 536 Key formulas 537 Key terms 537 References 537 Chapter review problems 538 Continuing cases 541 Chapter 13  Excel Guide 541

15.1

15.2

Summary 635 Key formulas 635 Key terms 636 References 636 Chapter review problems 636 Continuing cases 640 Chapter 15  Excel Guide 641

End of Part 4 problems

642

PART 5 (ONLINE)

FURTHER TOPICS IN STATS

16 Multiple regression model building

650

16.1

Quadratic regression model

651

16.2

Using transformations in regression models 657

16.3

Influence analysis

660

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



DETAILED CONTENTS

16.4

Model building

663

16.5

Pitfalls in multiple regression and ethical issues

673

Summary 674 Key formulas 674 Key terms 674 References 676 Chapter review problems 676 Continuing cases 677 Chapter 16  Excel Guide 677

17 Decision making Payoff tables and decision trees

681

17.2

Criteria for decision making

685

17.3

Decision making with sample information

694

17.4

Utility 699

Summary 700 Key formulas 701 Key terms 701 References 701 Chapter review problems 701 Chapter 17  Excel Guide 703

704

18.1

Total quality management

705

18.2

Six Sigma management

707

18.3

The theory of control charts

708

18.4

Control chart for the proportion – The p chart

710

The red bead experiment – Understanding process variability

716

18.5

19.1

19.2

740

McNemar test for the difference between two proportions (related samples)

741

Wilcoxon rank sum test – Non-parametric analysis for two independent populations

744

19.3

Wilcoxon signed ranks test – Nonparametric analysis for two related populations 750

19.4

Kruskal–Wallis rank test – Non-parametric analysis for the one-way anova

755

Friedman rank test – Non-parametric analysis for the randomised block design

758

680

17.1

18 Statistical applications in quality management

19 Further non-parametric tests

19.5

Summary 762 Key formulas 762 Key terms 762 Chapter review problems 763 Continuing cases 765 Chapter 19  Excel Guide 766

20 Business analytics

770

20.1

Predictive analytics

771

20.2

Classification and regression trees

772

20.3

Neural networks

777

20.4

Cluster analysis

781

20.5

Multidimensional scaling

783

Key formulas 786 Key terms 787 References 787 Chapter review problems 787 Chapter 20  Software Guide 788

21 Data analysis: The big picture

794

21.1

Analysing numerical variables

798

Control chart for an area of opportunity – The c chart

718

21.2

Analysing categorical variables

800

18.7

Control charts for the range and the mean

721

21.3

Predictive analytics

801

18.8

Process capability

727

18.6

Summary 733 Key formulas 733 Key terms 734 References 734 Chapter review problems 734 Chapter 18  Excel Guide 736

Chapter review problems

802

End of Part 5 problems

804

Appendices A to F

A-1

Glossary G-1 Index

I-1

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

ix

preface This fifth Australasian and Pacific edition of Basic Business Statistics: Concepts and Applications continues to build on the strengths of the fourth edition, and extends the outstanding teaching foundation of the previous American editions, authored by ­Berenson, Levine and Szabat. The teaching philosophy of this text is based upon the principles of the American book, but each chapter has once again been carefully revised to include practical examples and a language and style that is more applicable to Australasian and Pacific readers. In preparation for this edition we again asked lecturers from around the country to comment on the format and content of the fourth edition and, based on those comments, the authors have worked to create a text that is more accessible – but no less authoritative – for students. Part 5 contains additional chapters: Chapter 16 on multiple regression and model building, Chapter 17 on decision making, Chapter 18 on statistical applications in quality and productivity management, Chapter 19 on further non-parametric tests and two brand new chapters: Chapter 20 on business analytics and Chapter 21 on data analysis. This chapter will be especially useful to students who wish to understand how the concepts and techniques studied in this book all fit together. The Part 5 chapters can be found within the MyLab and student download page via our catalogue. Chapter 21 (including Figure 21.1, which provides a summary of the contents of this book arranged by data-analysis task) is designed to provide guidance in choosing appropriate statistical techniques to data-analysis questions arising in business or elsewhere. Figure 21.1, and Chapter 21, should be referred to when working through the earlier chapters of this book. This should enable students to see connections between topics; that is, the big picture. The new edition has continued with a ‘real-world’ focus, to take students beyond the pure theory. Some chapters have a completely new opening scenario, focusing on a person or company, which serves to introduce key concepts covered in the chapter. The scenario is interwoven throughout the chapter to reinforce the concepts to the student. Multiple in-chapter examples have been updated that highlight real Australasian and Pacific data. The Real people, real stats feature that opens each of the text’s five parts is composed of a personal interview highlighting how real people in real business situations apply the principles of statistics to their jobs. The interviewees are: Part 1 Part 2 Part 3 Part 4 Part 5

David McCourt BDO Ellouise Roberts Deloitte Access Economics Rod Battye Tourism Research Australia Gautam Gangopadhyay Endeavour Energy Deborah O’Mara The University of Sydney

Judith Watson Nicola Jayne Martin O’Brien

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

acknowledgements When developing the new edition of Basic Business Statistics, we were mindful of retaining the strengths of the current edition, but also of the need to build on those strengths, to enhance the text and to ensure wider reader appeal and useability. We are indebted to the following academics who contributed to the new edition. Technical Editor We would like to thank Martin Firth at UWA for carrying out a detailed technical edit of the text. Reviewers Ms Gerrie Roberts Monash University Dr Sonika Singh University of Technology Sydney Dr Erick Li University of Sydney Dr Amir Arjomandi University of Wollongong Mr Jason Hay Queensland University of Technology Mr Martin J Firth University of Western Australia Dr Scott Salzman Deakin University Ms Charanjit Kaur Monash University Dr Jill Wright Monash University

The enormous task of writing a book of this scope was possible only with the expert assistance of all these friends and colleagues and that of the editorial and production staff at Pearson Australia. We gratefully acknowledge their invaluable contributions at every stage of this project, collectively and, now, individually. We thank the following people at Pearson Australia: Rebecca Pedley, Portfolio Manager; Anna Carter, Development Editor; Julie Ganner, Production Manager and Copy Editor; and Lisa Woodland, Rights & Permissions Team Leader.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

xii

how to use this book Real people, real stats interviews open each part. These introduce real people working in real business environments, using statistics to tackle real business challenges.

PA R T

1

Presenting and describing information

Real People, Real Stats David McCourt BDO

Learning objectives introduce you to the key concepts to be covered in each chapter, and are signposted in the margins where they are covered within the chapter.

Which company are you currently working for and what are some of your responsibilities? I work at BDO, Chartered Accountants and Advisors, in the corporate finance team. My primary responsibilities include the preparation of financial models and valuation reports. List five words that best describe your personality. Affable, level-headed, perceptive, analytical, assured (according to my colleagues). What are some things that motivate you? Success, working with a team, client satisfaction. When did you first become interested in statistics? I never really understood statistics at school and it was a minor part of my university degree. However, statistics play a significant role in many of our valuations, including discounted cash flow valuations and share option valuations. Complete the following sentence. A world without statistics … … is not worth thinking about.

LET’S TALK STATS What do you enjoy most about working in statistics? We use data services and statistical tools that have been created by third parties. I can use, and talk reasonably knowledgeably about, statistical data without being an expert.

CHAPTER 1 DEFINING AND COLLECTING DATA

LEARNING OBJECTIVES

04/07/18 6:33 PM

M01_BERE7249_05_SE_C01.indd 2

5

After studying this chapter you should be able to: 1 identify the types of data used in business 2 identify how statistics is used in business 3 recognise the sources of data used in business 4 distinguish between different survey sampling methods 5 evaluate the quality of surveys

Chapter-opening scenarios show how statistics are used in everyday life. The scenarios introduce the concepts to be covered, showing the relevance of using particular statistical techniques. The problem is woven throughout each chapter, showing the connection between statistics and their use in business, as well as keeping you motivated.

C H AP T E R

1

Defining and Collecting data

THE HONG KONG AIRPORT SURVEY

Not so long ago, business students were unfamiliar with the word data and had little experience handling data. Today, every time you visit a search engine website or ‘ask’ your mobile device a question, you are handling data. And if you ‘check in’ to a location or indicate that you ‘like’ something, you are creating data as well. You accept as almost true the premises of stories in which characters collect ‘a lot of data’ to uncover conspiracies, foretell disasters or catch a criminal. You hear concerns about how the government or business might be able to ‘spy’ on you in some way or how large social media companies ‘mine’ your personal data for profit. You hear the word data everywhere and may even have a ‘data plan’ for your smartphone. You know, in a general way, that data are facts about the world and that most data seem to be, ultimately, a set of numbers – that 34% of students recently polled prefer using a certain Internet browser, or that 50% of citizens believe the country is headed in the right direction, or that unemployment is down 3%, or that your best friend’s social media account has 835 friends and 202 recent posts. You cannot escape from data in this digital world. What, then, should you do? You could try to ignore data and conduct business by relying on hunches or your ‘gut instincts’. However, if you want to use only gut instincts, then you probably shouldn’t be reading this book or taking business courses in the first place. You could note that there is so much data in the world – or just in your own little part of the world – that you couldn’t possibly get a handle on it. You could accept other people’s data summaries and their conclusions without first reviewing the data yourself. That, of course, would expose yourself to fraudulent practices. Or you could do things the proper way and realise the benefits of learning the methods of statistics, the subject of this book. You can learn, though, the procedures and methods that will help you make better decisions based on solid evidence. When you begin focusing on the procedures and methods involved in collecting, presenting and summarising a set of data, or forming conclusions about those data, you have discovered statistics. In the Hong Kong Airport survey scenario it is important that research team members focus on the information that is needed by many different stakeholders when planning for future business and tourist visitors. If the research team fails to collect important information, or misrepresents the opinions of current visitors, stakeholders may make poor decisions about advertising, pricing, facilities and other factors relevant to attracting visitors and hosting them in Hong Kong. Failure to offer suitable facilities and experiences could affect the profitability of businesses in Hong Kong. In deciding how to collect the facts that are needed, it will help if you know something about the basic concepts of statistics.

Y

ou are departing Hong Kong International Airport on the next leg of your trip and have cleared Immigration. You are approached by a researcher holding a tablet computer who asks if you can answer a few questions. The first question determines if you are a visitor to Hong Kong or a resident. After establishing that you are a visitor the questions go on to determine the purpose of your visit, the name of your hotel, the activities you have undertaken and much additional information about your visit.

M01_BERE7249_05_SE_C01.indd 5

This information is useful for a tourism authority that has the task of marketing Hong Kong as a travel destination and monitoring the quality of visitors’ experiences in the city. It may also inform the authority’s government and commercial stakeholders, who provide transport, accommodation, and food and shopping for visitors, and be used for forward planning. © Jungyeol & Mina/age fotostock

Data sets and Excel workbooks that accompany the text can be downloaded and used to answer the appropriate questions.

M01_BERE7249_05_SE_C01.indd 4

04/07/18 6:33 PM

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

04/07/18 6:33 PM



PRELIMS HOW TO USE THIS BOOK

detailed contents

xiii xiii

41

2.1 ORGANISING AND VISUALISING CATEGORICAL DATA

What type of chart should you use? The selection of a chart depends on your intention. If a comparison of categories is most important, use a bar chart. If observing the portion of the whole that lies in a particular category is most important, use a pie chart. There should be no more than eight categories or slices in a pie chart. If there are more than eight, merge the smaller categories into a category called ‘other’. Figure 2.3 Microsoft Excel pie chart of the reasons for grocery shopping online

Pie chart – reasons for grocery shopping online Comfortable environment 8%

Variety/range of products 10%

Competitive prices 20%

Quality products 18%

Real world, business examples are included throughout the chapter. These are designed to show the multiple applications of statistics, while helping you to learn the statistics techniques. Emphasis on data output and interpretation The authors believe that the use of computer software is an integral part of learning statistics. Our focus emphasises analysing data by interpreting the output from Microsoft Excel while reducing emphasis on doing calculations. Excel 2016 changes to statistical functions are reflected in the operations shown in this edition. In the coverage of hypothesis testing in Chapters 9 to 11, extensive computer output is included so that the focus can be placed on the p-value approach. In our coverage of simple linear regression in Chapter 12, we assume that a software program will be used and our focus is on interpretation of the output, not on hand calculations. Summaries are provided at the end of each chapter, to help you review the key content. Key terms are signposted in the margins when they are first introduced, and are referenced to page numbers at the end of each chapter, helping you to revise key terms and concepts for the chapter. End-of-section problems are divided into Learning the basics and Applying the concepts.

Products well displayed 3% Convenience 28%

Customer service 13%

PIE CHART F OR FAMILY TYPE Use the summary tables given for family type in < DEMOGRAPHIC_INFORMATION > to construct and interpret pie charts for the capital city and the council area.

EXAMPLE 2.3

Figure 2.4 Microsoft Excel pie chart for family type

Pie chart – council area

Couple with children Couple no children One parent Other

Pie chart – capital city

Couple with children Couple no children One parent Other

M02_BERE7249_05_SE_C02.indd 41

04/07/18 7:19 PM

End-of-part problems challenge the student to make decisions about the appropriate technique to apply, to carry out that technique and to interpret the data meaningfully.* Australasian and Pacific data sets are used for the problems in each chapter. These files are contained on the Pearson website. Ethical issues sections are integrated into many chapters, raising issues for ethical consideration.

674 CHAPTER 16 MULTIPLE REGRESSION MODEL BUILDING End of PART 1 PRoblEMs 139

16 Assess your progress

End of Part 1 problems A.1

Summary In this chapter, various multiple regression topics were considered (see Figure 16.15) including quadratic regression models, interactions, transformations square root and log transformations. A number of criteria were presented to examine the influence of each individual observation on the results. In addition, the best subsets and stepwise regression approaches to model building were detailed.

You have learned how suburban ratings can be used to derive a measure of income distribution. You also learned how a director of operations at a television station could build a multiple regression model as an aid to reducing labour expenses.

Enjoy shopping for clothing Yes No Total

Key formulas The quadratic regression model

Transformed exponential model

Yi = β0 + β1X1i + β2 X 21i + εi (16.1)

ln Yi = ln( eβ0+β1 X 1i +β2 X 2i εi ) = ln( e

Quadratic regression equation

t i = ei

X1i + εi (16.3)

Original multiplicative model

Yi =

b0 X 1ib1 X 2ib2

n – k –1 SSE (1 – hi ) – ei2

(16.8)

Di =

Transformed multiplicative model

log Yi = log(β0 X 1βi 1 X 2βi2 εi )

(16.5) β

= log β0 + log( X 1i 1 ) + log( X 2i2 ) + log εi

ei2

hi k MSE (1 – hi ) 2

54 11 12 13 33

(16.9)

The Cp statistic

Cp =

= log β0 + β1 log X 1i + β2 log X 2i + log εi

(1 – Rk2 )(n – T ) 1–

RT2

– [ n – 2( k + 1)] (16.10)

Yi = e

εi

(16.6)

Key terms best-subsets approach Cook’s Di statistic Cp statistic cross-validation

M16_BERE7249_05_SE_C16.indd 674

665 662 667 672

data mining hat matrix diagonal elements hi logarithmic transformation parsimony

665 661 658 663

Gender Female 224 36 260

a. Construct contingency tables based on total percentages, row percentages and column percentages. b. Construct a side-by-side bar chart of enjoy shopping for clothing based on gender. c. What conclusions do you draw from these analyses? One of the major measures of the quality of service provided by any organisation is the speed with which the organisation responds to customer complaints. A large family-owned department store selling furniture and flooring, including carpet, has undergone major expansion in the past few years. In particular, the flooring department has expanded from two installation crews to an installation supervisor, a measurer and 15 installation crews. During a recent year the company got 50 complaints about carpet installation. The following data represent the number of days between receipt of the complaint and resolution of the complaint.

quadratic regression model square-root transformation stepwise regression Studentised deleted residual

5 19 4 10 68

35 126 165 5

137 110 32 27

31 110 29 4

27 29 28 52

152 61 29 30

2 35 26 22

123 94 25 36

81 31 1 26

74 26 14 20

651 657 663 661

The annual crediting rates (after tax and fees) on several managed superannuation investment funds between 2013 and 2017 are:

Superannuation fund Conservative Balanced Growth High growth

Total 360 140 500

27 5 13 23

a. Construct frequency and percentage distributions. b. Construct histogram and percentage polygons. c. Construct a cumulative percentage distribution and plot the corresponding ogive. d. Calculate the mean, median, first quartile and third quartile. e. Calculate the range, interquartile range, variance, standard deviation and coefficient of variation. f. Construct a box-and-whisker plot. Are the data skewed? If so, how? g. On the basis of the results of (a) to (f), if you had to report to the manager on how long a customer should expect to wait to have a complaint resolved, what would you say? Explain.

Original exponential model β0+β1 X 1i +β2 X 2i

Male 136 104 240

A.3

A.4

Historical crediting rate for year ending 30 June, % 2016 2015 2014 2013 8.7 9.0 11.3 12.3 5.2 10.7 14.1 15.9 3.8 11.3 15.6 18.7 3.1 12.3 17.4 20.5

2017 5.5 9.5 11.8 13.7

a. For each fund, calculate the geometric rate of return for three years (2015 to 2017) and for five years (2013 to 2017). b. What conclusions can you reach concerning the geometric rates of return for the funds? A supplier of ‘Natural Australian’ spring water states that the magnesium content is 1.6 mg/L. To check this, the quality control department takes a random sample of 96 bottles during a day’s production and obtains the magnesium content. < SPRING_WATER1 >

< FURNITURE >

Cook’s Di statistic

εi (16.4)

β

) + ln εi

Studentised deleted residual

Regression model with a square-root transformation

Yi = b0 + b1

β0+β1 X 1i +β2 X 2i

A.2 (16.7)

= β0 + β1X 1i + β 2 X 2i + ln εi

Yˆi = b0 + b1X1i + b2 X 21i (16.2)

A sample of 500 shoppers was selected in a large metropolitan area to obtain consumer behaviour information. Among the questions asked was, ‘Do you enjoy shopping for clothing?’ The results are summarised in the following cross-classification table.

A.5

A.6

a. Construct frequency and percentage distributions. b. Construct a histogram and a percentage polygon. c. Construct a cumulative percentage distribution and plot the corresponding ogive. d. Calculate the mean, median, mode, first quartile and third quartile. e. Calculate the variance, standard deviation, range, interquartile range and coefficient of variation. f. Construct and interpret a box-and-whisker plot. g. What conclusions can you reach concerning the magnesium content of this day’s production? The National Australia Bank (NAB) produces regular reports titled NAB Online Retail Sales Index . Download the latest in-depth report. a. Give an example of a categorical variable found in the report. b. Give an example of a numerical variable found in the report. c. Is the variable you selected in (b) discrete or continuous? The data in the file < WEBSTATS > represent the number of times during August and September that a sample of 50 students accessed the website of a statistics unit they were enrolled in. a. Construct ordered arrays for August and September. b. Construct stem-and-leaf displays for August and September. c. Construct frequency, percentage and cumulative distributions for August and September.

7/5/18 9:00 PM M03_BERE7249_05_SE_C03.indd 139

26/07/18 1:31 PM

*The solutions are calculated using the (raw) Excel output. If you use the rounded figures presented in the text to reproduce these answers there may be minor differences.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

xiv

MyLab Statistics a guided tour for students and educators Study Plan A study plan is generated from each student’s results on a pre-test. Students can clearly see which topics they have mastered and, more importantly, which they need to work on.

Unlimited Practice Each MyLab Statistics comes with preloaded assignments, including select end-ofchapter questions, all of which are automatically graded. Many study plan and educator-assigned exercises contain algorithmically generated values to ensure students get as much practice as they need. As students work though study plan or homework exercises, instant feedback and tutorial resources guide them towards understanding.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



GUIDED TOUR FOR STUDENTS AND EDUCATORS

xv

Learning Resources To further reinforce understanding, study plan and homework problems link to the following learning resources: • eText linked to sections for all study plan questions • Help Me Solve This, which walks students through the problem with step-by-step help and feedback without giving away the answer • StatCrunch.

StatTalk Videos Fun-loving statistician Andrew Vickers takes to the streets of Brooklyn, New York to demonstrate important statistical concepts through interesting stories and real-life events. This series of videos and corresponding autograded questions will help students to understand statistics.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

xvi

EDUCATOR RESOURCES

EDUCATOR RESOURCES A suite of resources is provided to assist with delivery of the text, as well as to support teaching and learning. Solutions Manual The Solutions Manual provides educators with detailed, accuracy-verified solutions to all the in-chapter and end-of-chapter problems in the book. Test Bank The Test Bank provides a wealth of accuracy-verified testing material. Updated for the new edition, each chapter offers a wide variety of true/false and multiple-choice questions, arranged by learning objective and tagged by AACSB standards. Questions can be integrated into Blackboard, Canvas or Moodle Learning Management Systems. PowerPoint lecture slides A comprehensive set of PowerPoint slides can be used by educators for class presentations or by students for lecture preview or review. They include key figures and tables, as well as a summary of key concepts and examples from the text. Digital image PowerPoint slides All the diagrams and tables from the text are available for lecturer use.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

about the authors Judith Watson Judith Watson teaches in the Business School at UNSW Australia. She has extensive experience in lecturing and administering undergraduate and postgraduate Quantitative Methods courses. Judith’s keen interest in student support led her to establish the Peer Assisted Support Scheme (PASS) in 1996 and she has coordinated this program for many years. She served as her faculty’s academic adviser from 2001 to 2004. Judith has been the recipient of a number of awards for teaching. She received the inaugural Australian School of Business Outstanding Teaching Innovations Award in 2008 and the 2012 Bill Birkett Award for Teaching Excellence. She also won the UNSW Vice Chancellor’s Award for Teaching Excellence in 2012 and a Citation of Outstanding Contributions to Student Learning from the Australian Government’s Office for Learning and Teaching in 2013. Judith is interested in using online learning technology to engage students and has created a number of adaptive e-learning tutorials for mathematics and statistics and cartoon-style videos to explain statistical concepts. Dr Nicola Jayne Nicola Jayne is a lecturer in the Southern Cross Business School at the Lismore campus of Southern Cross University. She has been teaching quantitative units since being appointed to the university in 1993 after several years at Massey University in New Zealand. Nicola has lectured extensively in Business and Financial Mathematics, Discrete Mathematics and Statistics, both undergraduate and postgraduate, as well as various Pure Mathematics units. Nicola’s academic qualifications from Massey University include a Bachelor of Science (majors in Mathematics and Statistics), a Bachelor of Science with Honours (first class) and a Doctor of Philosophy, both in Mathematics. Nicola also has a Graduate Certificate in Higher Education (Learning & Teaching) from Southern Cross University. She was the recipient of a Vice Chancellor’s Citation for an Outstanding Contribution to Student Learning in 2011. Dr Martin O’Brien Dr Martin O’Brien is a senior lecturer in economics, Director of the Centre for Human and Social Capital Research, and Director of the MBA program in the Sydney Business School, University of Wollongong. Martin earned his Bachelor of Commerce (firstclass honours) and PhD in Economics at the University of Newcastle. His PhD and subsequent published research is in the ­general area of labour economics, and specifically the exploration of older workers’ labour force participation in Australia in the context of an ageing society. Martin has been an expert witness for a number of Fair Work Commission cases, providing statistical analyses of the effects of penalty rates, workforce casualisation and family and domestic violence leave. Martin has taught a wide range of quantitative subjects at university level, including business statistics, business analytics, quantitative analysis for decision making, econometrics, financial modelling and business research methods. He also has a keen interest in learning analytics and the development and analysis of new teaching technologies.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

xviii ABOUT THE AUTHORS

about the originating authors Mark L. Berenson is Professor of Management and Information Systems at Montclair State University (Montclair, New Jersey) and also Professor Emeritus of Statistics and Computer Information Systems at Bernard M. Baruch College (City University of New York). He currently teaches graduate and undergraduate courses in statistics and in operations management in the School of Business and an undergraduate course in international justice and human rights that he co-developed in the College of Humanities and Social Sciences. Berenson received a BA in economic statistics, an MBA in business statistics from City College of New York and a PhD in business from the City University of New York. His research has been published in Decision Sciences Journal of Innovative Education, Review of Business Research, The American Statistician, Communications in Statistics, Psychometrika, Educational and Psychological Measurement, Journal of Management Sciences and Applied Cybernetics, Research Quarterly, Stats Magazine, The New York Statistician, Journal of Health Administration Education, Journal of Behavioral Medicine and Journal of Surgical Oncology. His invited articles have appeared in The Encyclopedia of Measurement & Statistics and Encyclopedia of Statistical Sciences. He is co-author of 11 statistics texts published by Prentice Hall, including Statistics for Managers Using Microsoft Excel, Basic Business Statistics: Concepts and Applications and Business Statistics: A First Course. Over the years, Berenson has received several awards for teaching and for innovative contributions to statistics education. In 2005, he was the first recipient of the Catherine A. Becker Service for Educational Excellence Award at Montclair State University and, in 2012, he was the recipient of the Khubani/Telebrands Faculty Research Fellowship in the School of Business. David M. Levine is Professor Emeritus of Statistics and Computer Information Systems at Baruch College (City University of New York). He received BBA and MBA degrees in statistics from City College of New York and a PhD from New York University in industrial engineering and operations research. He is nationally recognised as a leading innovator in statistics education and is the co-author of 14 books, including such best-selling statistics textbooks as Statistics for Managers Using Microsoft Excel, Basic Business Statistics: Concepts and Applications, Business Statistics: A First Course and Applied Statistics for Engineers and Scientists Using Microsoft Excel and Minitab. He also is the co-author of Even You Can Learn Statistics: A Guide for Everyone Who Has Ever Been Afraid of Statistics (currently in its second edition), Six Sigma for Green Belts and Champions and Design for Six Sigma for Green Belts and Champions, and the author of Statistics for Six Sigma Green Belts, all published by FT Press, a Pearson imprint, and Quality Management, third edition, published by McGraw-Hill/Irwin. He is also the author of Video Review of Statistics and Video Review of Probability, both published by Video Aided Instruction, and the statistics module of the MBA primer published by Cengage Learning. He has published articles in various journals, including Psychometrika, The American Statistician, Communications in Statistics, Decision Sciences Journal of Innovative Education, Multivariate Behavioral Research, Journal of Systems Management, Quality Progress and The American Anthropologist, and he has given numerous talks at the Decision Sciences Institute (DSI), American Statistical Association (ASA) and Making Statistics More Effective in Schools and Business (MSMESB) conferences. Levine

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



ABOUT THE AUTHORS

has also received several awards for outstanding teaching and curriculum development from Baruch College. Kathryn A. Szabat is Associate Professor and Chair of Business Systems and Analytics at LaSalle University. She teaches undergraduate and graduate courses in business statistics and operations management. Szabat’s research has been published in International Journal of Applied Decision Sciences, Accounting Education, Journal of Applied Business and Economics, Journal of Healthcare Management and Journal of Management Studies. Scholarly chapters have appeared in Managing Adaptability, Intervention, and People in Enterprise Information Systems; Managing, Trade, Economies and International Business; Encyclopedia of Statistics in Behavioral Science; and Statistical Methods in Longitudinal Research. Szabat has provided statistical advice to numerous business, non-business and academic communities. Her more recent involvement has been in the areas of education, medicine and non-profit capacity building. Szabat received a BS in mathematics from State University of New York at Albany and MS and PhD degrees in statistics, with a cognate in operations research, from the Wharton School of the University of Pennsylvania.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

xix

PA R T

1

Presenting and describing information

Real People, Real Stats David McCourt BDO Which company are you currently working for and what are some of your responsibilities? I work at BDO, Chartered Accountants and Advisors, in the corporate finance team. My primary responsibilities include the preparation of financial models and valuation reports. List five words that best describe your personality. Affable, level-headed, perceptive, analytical, assured (according to my colleagues). What are some things that motivate you? Success, working with a team, client satisfaction. When did you first become interested in statistics? I never really understood statistics at school and it was a minor part of my university degree. However, statistics play a significant role in many of our valuations, including discounted cash flow valuations and share option valuations. Complete the following sentence. A world without statistics … … is not worth thinking about.

LET’S TALK STATS What do you enjoy most about working in statistics? We use data services and statistical tools that have been created by third parties. I can use, and talk reasonably knowledgeably about, statistical data without being an expert.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

a quick q&a Describe your first statistics-related job or work experience. Was this a positive or a negative experience? The first time I can recall using statistics was for a share option valuation. We had to determine the share price volatility based on historical share price data. There are about half a dozen methods that can be used, all with various advantages and disadvantages. I did and still find this analysis interesting. What do you feel is the most common misconception about your work held by students who are studying statistics? Please explain. Statistics provides information to support our analysis and decisions. However, the information is never perfect, and subjectivity and commercial common sense play a large part in our work. Do you need to be good at maths to understand and use statistics successfully? I think you need to have a logical and well-structured approach to problems. These skills would probably make you good at both maths and statistics. Is there a high demand for statisticians in your industry (or in other industries)? Please explain. The finance industry is heavily reliant on statistics. I expect there is high demand for statisticians from the various data providers, and in a number of specialist areas (e.g. insurance).

PRESENTING AND DESCRIBING INFORMATION Does data collection play an important role in the decisions you make for your business/work? Please explain. Accurate data collection is essential to our valuation projects. Although our work involves a degree of commercial acumen, it is essential that the data supports and justifies these decisions. We also aggregate data for internal business use to measure staff productivity, business performance and forecasting budgets. Describe a project that you have worked on recently that might have involved data collection. Please be specific. We recently valued an infrastructure asset using the discounted cash flow model. The model requires two essential inputs: the forecast of future cash flows of the asset, and the discount rate that reflects the riskiness of those cash flows. To arrive at an appropriate discount rate we generally analyse comparable companies for an indication of the level of risk that should be attributed to the asset to be valued. In this exercise there are several instances of data collection. We collect five-year historical stock data for numerous comparable companies as an

initial indication of risk. We then collect data on key financial indicators to assess the degree of comparability between the stock and the asset to be valued. To determine the risk-free rate and the market-risk premium, 10-year government bond rate data is collected. How are these data usually summarised? What are some positives and negatives of these summary techniques? We generally organise the collected data into Microsoft Excel workbooks. The main advantage of using this software is the ease of data analysis. Some powerful data analysis tools include data tables, What-If Analysis, Solver, charting and common statistical functions. Some shortcomings we have encountered using Excel is that data sometimes need to be rearranged depending on the analysis, [there can be] problems with inconsistent or missing data, and output can sometimes be incomplete. These factors increase the likelihood of errors in data analysis; however, for the purposes of corporate finance, Excel is generally sufficient as a means of summarising and analysing the data collected. In your experience, what is the most commonly referred to measure of central tendency? What benefits does this measure offer over others? In valuations, we generally prefer to use the median as a measure of central tendency rather than mean or mode. We find that the mean has one main disadvantage: it is particularly susceptible to outliers. When looking at comparable companies there are often outliers caused by one-off business issues that are irrelevant for the purposes of comparing our business. We very rarely use mode given that it only really coincides with the central tendency of data where the distribution is centre-heavy and there are generally few recurring figures in the data set. Why is it important to be aware of the spread/variation of data points in a sample? What are the consequences of not knowing this type of information about your sample? Without an understanding of the spread and variation of a data set there is no context to the measure of central tendency applied. A measure of central tendency summarises the data into a single value while the spread and variation of data gives an indication of how reliable an average or median summary of collected data is. For example, if the spread of values in the data set is relatively large it suggests the mean is not as representative, and a smoothing of data is required, when compared to a data set with a smaller range. Adopting a mean without reference to the spread can taint our analysis and results in a lack of validity to our decisions that are based on the data.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

CHA PTER

1

Defining and Collecting data

THE HONG KONG AIRPORT SURVEY

Y

ou are departing Hong Kong International Airport on the next leg of your trip and have cleared Immigration. You are approached by a researcher holding a tablet computer who asks if you can answer a few questions. The first question determines if you are a visitor to Hong Kong or a resident. After establishing that you are a visitor the questions go on to determine the purpose of your visit, the name of your hotel, the activities you have undertaken and much additional information about your visit. This information is useful for a tourism authority that has the task of marketing Hong Kong as a travel destination and monitoring the quality of visitors’ experiences in the city. It may also inform the authority’s government and commercial stakeholders, who provide transport, accommodation, and food and shopping for visitors, and be used for forward planning. © Jungyeol & Mina/age fotostock

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



CHAPTER 1 DEFINING AND COLLECTING DATA

LEARNING OBJECTIVES

After studying this chapter you should be able to: 1 identify the types of data used in business 2 identify how statistics is used in business 3 recognise the sources of data used in business 4 distinguish between different survey sampling methods 5 evaluate the quality of surveys

Not so long ago, business students were unfamiliar with the word data and had little experience handling data. Today, every time you visit a search engine website or ‘ask’ your mobile device a question, you are handling data. And if you ‘check in’ to a location or indicate that you ‘like’ something, you are creating data as well. You accept as almost true the premises of stories in which characters collect ‘a lot of data’ to uncover conspiracies, foretell disasters or catch a criminal. You hear concerns about how the government or business might be able to ‘spy’ on you in some way or how large social media companies ‘mine’ your personal data for profit. You hear the word data everywhere and may even have a ‘data plan’ for your smartphone. You know, in a general way, that data are facts about the world and that most data seem to be, ultimately, a set of numbers – that 34% of students recently polled prefer using a certain Internet browser, or that 50% of citizens believe the country is headed in the right direction, or that unemployment is down 3%, or that your best friend’s social media account has 835 friends and 202 recent posts. You cannot escape from data in this digital world. What, then, should you do? You could try to ignore data and conduct business by relying on hunches or your ‘gut instincts’. However, if you want to use only gut instincts, then you probably shouldn’t be reading this book or taking business courses in the first place. You could note that there is so much data in the world – or just in your own little part of the world – that you couldn’t possibly get a handle on it. You could accept other people’s data summaries and their conclusions without first reviewing the data yourself. That, of course, would expose yourself to fraudulent practices. Or you could do things the proper way and realise the benefits of learning the methods of statistics, the subject of this book. You can learn, though, the procedures and methods that will help you make better decisions based on solid evidence. When you begin focusing on the procedures and methods involved in collecting, presenting and summarising a set of data, or forming conclusions about those data, you have discovered statistics. In the Hong Kong Airport survey scenario it is important that research team members focus on the information that is needed by many different stakeholders when planning for future business and tourist visitors. If the research team fails to collect important information, or misrepresents the opinions of current visitors, stakeholders may make poor decisions about advertising, pricing, facilities and other factors relevant to attracting visitors and hosting them in Hong Kong. Failure to offer suitable facilities and experiences could affect the profitability of businesses in Hong Kong. In deciding how to collect the facts that are needed, it will help if you know something about the basic concepts of statistics.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

5

6

CHAPTER 1 DEFINING AND COLLECTING DATA

1.1  BASIC CONCEPTS OF DATA AND STATISTICS The Meaning of ‘Data’ What do we mean by the word data? Its common use is somewhat different from its use in statistics. It could be described in a general way as meaning ‘facts about the world’. However, statisticians distinguish between the traits or properties that relate to people or things and the actual values that these take.

variables Characteristics or attributes that can be expected to differ from one individual to another. data The observed values of variables.

VA R IA B L E S Variables are characteristics of items or individuals. DATA Data are the observed values of variables. For a group of people, we could examine the traits of age, country of birth or weight. For a group of cars, we could note the colour, current value or kilometres driven. These characteristics are called variables. Data are the values associated with these traits or properties. As an example, in Table 1.1 we find a set of data collected from six people which represents observations on three different variables.

Table 1.1

operational definition Defines how a variable is to be measured.

Variable Age in years Country of birth Weight in kilograms

Data 24, 18, 53, 16, 22, 31 Australia, China, Australia, Malaysia, India, Australia 50.2, 74.6, 96.3, 45.2, 56.1, 87.3

In this book, the word data is always plural to remind you that data are a collection or set of values. While we could say that a single value, such as ‘Australia’ is a datum, the terms data point, observation, response or single data value are more typically encountered. All variables should have an operational definition – a universally accepted meaning that is clear to all associated with an analysis. Without operational definitions, confusion can occur. An example of a situation where operational definitions are needed is for the process of data gathering by the Australian Bureau of Statistics (ABS). The ABS needs to collect information about the country of birth of a person and also the countries in which their father and mother were born. While this might seem straightforward, definitional problems arise in the case of people who were adopted or have step- or foster parents or other guardians. So the operational definition used is: • ‘Country of birth of person’, which is the country identified as being the one in which the person was born • ‘Country of birth of father’, which is the country in which the person’s birth father was born, and • ‘Country of birth of mother’, which is the country in which the person’s birth mother was born (Australian Bureau of Statistics, Country of Birth Standard, Cat. No. 1200.0.55.004, 2016).

The Meaning of ‘Statistics’ statistics A branch of mathematics concerned with the collection and analysis of data.

Statistics is the branch of mathematics that examines ways to process and analyse data. ­It provides procedures to collect and transform data in ways that are useful to business decision makers. Statistics allows you to determine whether your data represent information that could be used in making better decisions. Therefore, it helps you determine whether differences in the

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



1.1 Basic Concepts of DATA AND Statistics

numbers are meaningful in a significant way or are due to chance. To illustrate, consider the following reports: • In ‘News use across social media platforms 2016’ the Pew Research Center reported in May 2016, that 67% of the adult US population had a Facebook account and 66% of users get news from the site (, accessed 12 June 2017). • In a blog titled ‘The top 10 benefits of newspaper advertising’, the 360 Degree Marketing Group says that a study showed newspaper advertising was considered a more trusted paid medium for information (58%) compared with television (54%), radio (49%) or online (27%) (, accessed 12 June 2017). Without statistics, you cannot determine whether the ‘numbers’ in these stories represent useful information. Without statistics, you cannot validate claims such as the statement that advertising in newspapers or on television is more trusted than online advertising. And without statistics, you cannot see patterns that large amounts of data sometimes reveal. Statistics is a way of thinking that can help you make better decisions. It helps you solve problems that involve decisions based on data that have been collected. You may have had some statistics instruction in the past. If you ever created a chart to summarise data or calculated values such as averages to summarise data, you have used statistics. But there’s even more to statistics than these commonly taught techniques, as the detailed table of contents shows. Statistics is undergoing important changes today. There are new ways of visualising data that did not exist, were not practicable or were not widely known until recently. And, increasingly, statistics today is being used to ‘listen’ to what the data might be telling you rather than just being a way to use data to prove something you want to say. If you associate statistics with doing a lot of mathematical calculations, you will quickly learn that business statistics uses software to perform the calculations for you (and, generally, the software calculates with more precision and efficiency than you could do manually). But while you do not need to be a good manual calculator to apply statistics, because statistics is a way of thinking, you do need to follow a framework or plan to minimise possible errors of thinking and analysis. One such framework consists of the following tasks to help apply statistics to business decision making: 1. Define the data that you want to study in order to solve a problem or meet an objective. 2. Collect the data from appropriate sources. 3. Organise the data collected by developing tables. 4. Visualise the data collected by developing charts. 5. Analyse the data collected to reach conclusions and present those results. Typically, you do the tasks in the order listed. You must always do the first two tasks to have meaningful outcomes, but, in practice, the order of the other three can change or appear inseparable. Certain ways of visualising data will help you to organise your data while performing preliminary analysis as well. In any case, when you apply statistics to decision making, you should be able to identify all five tasks, and you should verify that you have done the first two tasks before the other three. Using this framework helps you to apply statistics to these four broad categories of business activities: 1. Summarise and visualise business data. 2. Reach conclusions from those data. 3. Make reliable forecasts about business activities. 4. Improve business processes.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

7

8

CHAPTER 1 DEFINING AND COLLECTING DATA

descriptive statistics The field that focuses on summarising or characterising a set of data. inferential statistics Uses information from a sample to draw conclusions about a population.

Throughout this book, and especially in the scenarios that begin the chapters, you will discover specific examples of how we can apply statistics to business situations. Statistics is itself divided into two branches, both of which are applicable to managing a business. Descriptive statistics focuses on collecting, summarising and presenting a set of data. Inferential statistics uses sample data to draw conclusions about a population. Descriptive statistics has its roots in the record-keeping needs of large political and social organisations. Refining the methods of descriptive statistics is an ongoing task for government statistical agencies such as the Australian Bureau of Statistics and Statistics New Zealand as they prepare for each Census. In Australia, a Census is scheduled to be carried out every five years (e.g. 2011 and 2016) to count the entire population and to collect data about education, occupation, languages spoken and many other characteristics of the citizens. A large amount of planning and training is necessary to ensure that the data collected represent an accurate record of the population’s characteristics at the Census date. However, despite the best planning, such an immense data collection task can be affected by external factors. The Australian Census held in 2016 was badly affected by a computer shutdown on Census night, 9 August. It was blamed on the need to protect the system from denial of service cyber attacks and added approximately $30 million to the cost of the Census (, accessed 13 July 2017). The foundation of inferential statistics is based on the mathematics of probability theory. Inferential methods use sample data to calculate statistics that provide estimates of the characteristics of the entire population. Today, applications of statistical methods can be found in different areas of business. Accounting uses statistical methods to select samples for auditing purposes and to understand the cost drivers in cost accounting. Finance uses statistical methods to choose between alternative portfolio investments and to track trends in financial measures over time. Management uses statistical methods to improve the quality of the products manufactured or the services delivered by an organisation. Marketing uses statistical methods to estimate the proportion of customers who prefer one product over another and to draw conclusions about what advertising strategy might be most useful in increasing sales of a product.

Other Important Definitions Now that the terms variables, data and statistics have been defined, you need to understand the meaning of the terms population, sample and parameter.

population A collection of all members of a group being investigated. sample The portion of the population selected for analysis. parameter A numerical measure of some population characteristic. statistic A numerical measure that describes a characteristic of a sample.

P OPUL AT ION A population consists of all the members of a group about which you want to draw a conclusion. S A M PL E A sample is the portion of the population selected for analysis. PA R A M E T E R A parameter is a numerical measure that describes a characteristic of a population. STAT IST IC A statistic is a numerical measure that describes a characteristic of a sample.

Examples of populations are all the full-time students at a university, all the registered v­ oters in New Zealand and all the people who were customers of the local shopping centre last weekend. The term population is not limited to groups of people. We could refer to a

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



1.2 Types of Variables

population of all motor vehicles registered in Victoria. Two factors need to be specified when defining a population: 1. the entity (e.g. people or motor vehicles) 2. the boundary (e.g. registered to vote in New Zealand or registered in Victoria for road use). Samples could be selected from each of the populations mentioned above. Examples include 10 full-time students selected for a focus group; 500 registered voters in New Zealand who were contacted by telephone for a political poll; 30 customers at the shopping centre who were asked to complete a market research survey; and all the vehicles registered in Victoria that are more than 10 years old. In each case, the people or the vehicles in the sample represent a portion, or subset, of the people or vehicles comprising the population. The average amount spent by all the customers at the local shopping centre last weekend is an example of a parameter. Information from all the shoppers in the entire population is needed to calculate this parameter. The average amount spent by the 30 customers completing the market research survey is an example of a statistic. Information from a sample of only 30 of the shopping centre’s customers is used in calculating the statistic.

1.2  TYPES OF VARIABLES As illustrated in Figure 1.1, there are two types of variables – categorical and numerical, sometimes referred to as qualitative and quantitative variables respectively.

The Hong Kong airport survey Travellers in the departure lounge of the busy Hong Kong International Airport are asked to complete a survey with questions about various aspects of their visit to the city and future travel plans. The interviewer first asks if the traveller is a resident or a visitor. If the traveller is a visitor, the survey proceeds. The survey includes these questions:

■ How many visits have you made to Hong Kong prior to this one? ■ How long is it since your visit here? ■ How satisfied were you with your accommodation?

Very satisfied  ■ Satisfied  ■ Undecided ■ Dissatisfied ■  Very dissatisfied  ■

■ How many times during this visit did you travel by ferry? ■ Shopping in Hong Kong stores gives good value for money Almost always Sometimes

■ ■

Very infrequently Never

■ ■

■ Was the purpose of your visit business? Yes  ■ No ■  ■ Are you likely to return to Hong Kong in the next 12 months? Yes  ■ No ■ You have been asked to review the survey. What type of data does the survey seek to collect? What type of information can be generated from the data of the completed survey? How can the research company’s clients use this information when planning for future visitors? What other questions would you suggest for the survey?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

9

10

CHAPTER 1 DEFINING AND COLLECTING DATA

Figure 1.1  Types of variables

VARIABLE TYPE

QUESTION TYPES

Categorical

numerical variables Take numbers as their observed responses.

LEARNING OBJECTIVE

1

Identify the types of data used in business discrete variables Can only take a finite or countable number of values. continuous variables Can take any value between specified limits.

Yes

Do you currently own any shares?

No

Discrete

How many messages did you send on social media last week?

Number

Continuous

How tall are you?

Centimetres

Numerical

categorical variables Take values that fall into one or more categories.

RESPONSES

Categorical variables yield categorical responses, such as yes or no or male or female answers. An example is the response to the question ‘Do you currently own any shares?’ because it is limited to a simple yes or no answer. Another example is the response to the question in the Hong Kong Airport survey (presented on page 9), ‘Are you likely to return to Hong Kong in the next 12 months?’ Categorical variables can also yield more than one possible response; for example, ‘On which days of the week are you most likely to use public transport?’ Numerical variables yield numerical responses, such as your height in centimetres. Other examples are ‘How many times during this visit did you travel by ferry?’ (from the Hong Kong Airport survey) or the response to the question, ‘How many messages did you send on social media last week?’ There are two types of numerical variables: discrete and continuous. Discrete variables ­produce numerical responses that arise from a counting process. ‘The number of social media messages sent’ is an example of a discrete numerical variable because the response is one of a finite number of integers. You send zero, one, two, …, 50 and so on messages. Continuous variables produce numerical responses that arise from a measuring process. Your height is an example of a continuous numerical variable because the response takes on any value within a continuum or interval, depending on the precision of the measuring instrument. For example, your height may be 158 cm, 158.3 cm or 158.2945 cm, depending on the precision of the available instruments. No two people are exactly the same height, and the more precise the measuring device used, the greater the likelihood of detecting differences in their heights. However, most measuring devices are not sophisticated enough to detect small differences. Hence, tied observations are often found in experimental or survey data even though the variable is truly continuous and, theoretically, all values of a continuous variable are different.

Levels of Measurement and Types of Measurement Scales Data are also described in terms of their level of measurement. There are four widely recognised levels of measurement: nominal, ordinal, interval and ratio scales.

nominal scale A classification of categorical data that implies no ranking.

Nominal and ordinal scales Data from a categorical variable are measured on a nominal scale or on an ordinal scale. A nominal scale (Figure 1.2) classifies data into various distinct c­ ategories in which no ranking is implied. In the Hong Kong Airport survey, the answer to the question ‘Are you likely to return to

CATEGORICAL VARIABLE

CATEGORIES Yes

Personal computer ownership Type of fuel used Internet connection

Unleaded

Premium Unleaded

Diesel Cable

No LPG Wireless

Figure 1.2  Examples of nominal scaling

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



1.2 Types of Variables

Hong Kong in the next 12 months?’ is an example of a nominally scaled variable, as is your favourite soft drink, your political party affiliation and your gender. Nominal scaling is the weakest form of measurement because you cannot specify any ranking across the various categories. An ordinal scale classifies data into distinct categories in which ranking is implied. In the Hong Kong Airport survey, the answers to the question ‘Shopping in Hong Kong stores gives good value for money’ represent an ordinal scaled variable because the responses ‘almost always, sometimes, very infrequently and never’ are ranked in order of frequency. Figure 1.3 lists other examples of ordinal scaled variables.

11

ordinal scale Scale of measurement where values are assigned by ranking.

CATEGORICAL VARIABLE

ORDERED CATEGORIES

Product satisfaction Clothing size Type of Olympic medal Education level

Very unsatisfied Fairly unsatisfied Neutral Fairly satisfied Very satisfied S M L XL Gold Silver Bronze Primary Secondary Tertiary

Figure 1.3  Examples of ordinal scaling

Ordinal scaling is a stronger form of measurement than nominal scaling because an observed value classified into one category possesses more or less of a property than does an observed value classified into another category. However, ordinal scaling is still a relatively weak form of measurement because the scale does not account for the amount of the differences between the categories. The ordering implies only which category is ‘greater’, ‘better’ or ‘more preferred’ – not by how much. Interval and ratio scales Data from a numerical variable are measured on an interval or ratio scale. An interval scale (Figure 1.4) is an ordered scale in which the difference between measurements is a meaningful quantity but does not involve a true zero point. For example, sports shoes for adults are often sold in Australia marked with sizes based on the US or UK system. Neither system has a true zero size. The size below an adult size 1 is a child’s size 13. However, in each system the intervals between sizes are equal.

NUMERICAL VARIABLE Shoe size (UK or US) Height (in centimetres) Weight (in kilograms) Salary (in US dollars or Japanese yen)

LEVEL OF MEASUREMENT Interval Ratio Ratio Ratio

A ratio scale is an ordered scale in which the difference between the measurements involves a true zero point, as in length, weight, age or salary measurements, and the ratio of two values is meaningful. In the Hong Kong Airport survey, the number of times a visitor travelled by ferry is an example of a ratio scaled variable, as six trips is three times as many as two trips. As another example, a carton that weighs 40 kg is twice as heavy as one that weighs 20 kg. Data measured on an interval scale or on a ratio scale constitute the highest levels of measurement. They are stronger forms of measurement than an ordinal scale, because you can determine not only which observed value is the largest but also by how much. Interval and ratio scales may apply for either discrete or continuous data.

interval scale A ranking of numerical data where differences are meaningful but there is no true zero point.

Figure 1.4 Examples of interval and ratio scales

ratio scale A ranking where the differences between measurements involve a true zero point.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

12

CHAPTER 1 DEFINING AND COLLECTING DATA

Telephone polling think about this

Companies such as Newspoll regularly undertake market research and political polling conducted by phone interviews. A phone poll conducted by Newspoll in Sydney in November 2014 asked questions about a number of topics. Some were demographic questions about the number of people who lived in the household and the age, income, occupation and marital status of the participant. What would be the purpose of asking such questions? The other questions could be divided into three sections. The first section related to voting intentions for the next state election and the level of satisfaction with the premier and the opposition leader. The second section asked the participant’s opinion on the renewal of the federal government’s ban on super trawlers. The third section asked a number of questions about domestic and international air travel undertaken in the past year. These questions covered areas such as the purpose of travel, the airlines used and level of satisfaction. Who would use the data collected in this poll? If you were designing a similar poll, how would you construct questions to collect data for the variables referred to above? More recently, political and business functions of Newspoll have been separated. To see how results of the latest political polls are published in the Australian, go to . To see some public opinion poll reports, go to .

Problems for Section 1.2 LEARNING THE BASICS 1.1 Three different types of drinks are sold at a fast-food restaurant – soft drinks, fruit juices and coffee. a. Explain why the type of drinks sold is an example of a categorical variable. b. Explain why the type of drinks sold is an example of a nominally scaled variable. 1.2 Coffee is sold in three sizes in takeaway cardboard cups – small, medium and large. Explain why the size of the coffee cup is an example of an ordinal scaled variable. 1.3 Suppose that you measure the time it takes to download an MP3 file from the Internet. a. Explain why the download time is a numerical variable. b. Explain why the download time is a ratio scaled variable.

APPLYING THE CONCEPTS 1.4 For each of the following variables, determine whether the variable is categorical or numerical. If the variable is numerical, determine whether the variable is discrete or continuous. In addition, determine the level of measurement. a. Number of mobile phones per household b. Length (in minutes) of the longest mobile call made per month c. Whether all mobile phones in the household use the same telecommunications provider d. Whether there is a landline telephone in the household

1.5 The following information is collected from students as they leave the campus bookshop during the first week of classes: a. Amount of time spent shopping in the bookshop b. Number of textbooks purchased c. Name of degree d. Gender Classify each of these variables as categorical or numerical. If the variable is numerical, determine whether the variable is discrete or continuous. In addition, determine the level of measurement. 1.6 For each of the following variables, determine whether the variable is categorical or numerical. If the variable is numerical, determine whether the variable is discrete or continuous. In addition, determine the level of measurement. a. Name of Internet provider b. Amount of time spent surfing the Internet per week c. Number of emails received per week d. Number of online purchases made per month 1.7 Suppose the following information is collected from Andrew and Fiona Chen on their application for a home loan mortgage at Metro Home Loans: a. Monthly expenses: $2,056 b. Number of dependants being supported by applicant(s): 2 c. Annual family salary income: $105,000 d. Marital status: Married Classify each of the responses by type of data and level of measurement.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



13

1.3 Collecting Data

1.8 One of the variables most often included in surveys is income. Sometimes the question is phrased, ‘What is your income (in thousands of dollars)?’ In other surveys, the respondent is asked to ‘Place an X in the circle corresponding to your income group’ and given a number of ranges to choose from. a. In the first format, explain why income might be considered either discrete or continuous. b. Which of these two formats would you prefer to use if you were conducting a survey? Why? c. Which of these two formats would probably bring you a greater rate of response? Why? 1.9 The director of research at the e-business section of a major department store wants to conduct a survey throughout a Australia to determine the amount of time working women spend shopping online for clothing in a typical month.

a. Describe the population and the sample of interest, and indicate the type of data the director might wish to collect. b. Develop a first draft of the questionnaire needed in (a) by writing a series of three categorical questions and three numerical questions that you feel would be appropriate for this survey. 1.10 A university researcher designs an experiment to see how generous participants will be in giving to charity. Discuss the types of variables the experiment might give compared with a survey of the same subjects about donations to charity. 1.11 Before a company undertakes an online marketing campaign it needs to consider information about its own current sales and the sales made by its competitors. What categorical data might it use?

1.3  COLLECTING DATA In the Hong Kong Airport scenario, identifying the data that need to be collected is an important step in the process of marketing the city and operational planning. Some of the data will come from consumers through market research. It is important that the correct inferences are drawn from the research and that appropriate statistical methods assist planners and designers to make the right decisions. Managing a business effectively requires collecting the appropriate data. In most cases, the data are measurements acquired from items in a sample. The samples are chosen from populations in such a manner that the sample is as representative of the population as possible. The most common technique to ensure proper representation is to use a random sample. (See section 1.4 for a detailed discussion of sampling techniques.) Many different types of circumstances require the collection of data: • A marketing research analyst needs to assess the effectiveness of a new television advertisement. • A pharmaceutical manufacturer needs to determine whether a new drug is more effective than those currently in use. • An operations manager wants to monitor a manufacturing process to find out whether the quality of output being produced is conforming to company standards. • An auditor wants to review the financial transactions of a company to determine whether or not the company is in compliance with generally accepted accounting principles. • A potential investor wants to determine which firms within which industries are likely to have accelerated growth in a period of economic recovery.

LEARNING OBJECTIVE

2

Identify how statistics is used in business

Identifying Sources of Data Identifying the most appropriate source of data is a critical aspect of statistical analysis. If biases, ambiguities or other types of errors flaw the data being collected, even the most sophisticated statistical methods will not produce accurate information. Five important sources of data are: • data distributed by an organisation or an individual • a designed experiment • a survey • an observational study • data collected by ongoing business activities.

primary sources Provide information collected by the data analyser.

Data sources are classified as either primary sources or secondary sources. When the data collector is the one using the data for analysis, the source is primary. When another organisation or

secondary sources Provide data collected by another person or organisation.

LEARNING OBJECTIVE

3

Recognise the sources of data used in business

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

14

CHAPTER 1 DEFINING AND COLLECTING DATA

focus group A group of people who are asked about attitudes and opinions for qualitative research.

individual has collected the data that are used for analysis by an organisation or individual, the source is secondary. Organisations and individuals that collect and publish data typically use this information as a primary source and then let others use the data as a secondary source. For example, the ­Australian federal government collects and distributes data in this way for both public and private purposes. The Australian Bureau of Statistics oversees a variety of ongoing data collection in areas such as population, the labour force, energy, and the environment and health care, and publishes statistical reports. The Reserve Bank of Australia collects and publishes data on exchange rates, interest rates and ATM and credit card transactions. Market research firms and trade associations also distribute data pertaining to specific industries or markets. Investment services such as Morningstar provide financial data on a company-by-company basis. Syndicated services such as Nielsen provide clients with data enabling the comparison of client products with those of their competitors. Daily newspapers in print and online formats are filled with numerical information about share prices, weather conditions and sports statistics. As listed above, conducting an experiment is another important data-collection source. For example, to test the effectiveness of laundry detergent, an experimenter determines which brands in the study are more effective in cleaning soiled clothes by actually washing dirty laundry instead of asking customers which brand they believe to be more effective. Proper experimental designs are usually the subject matter of more advanced texts, because they often involve sophisticated statistical procedures. However, some fundamental experimental design concepts are considered in Chapter 11. Conducting a survey is a third important data source. Here, the people being surveyed are asked questions about their beliefs, attitudes, behaviours and other characteristics. Responses are then edited, coded and tabulated for analysis. Conducting an observational study is the fourth important data source. In such a study, a researcher observes the behaviour directly, usually in its natural setting. Observational studies take many forms in business. One example is the focus group, a market research tool that is used to elicit unstructured responses to open-ended questions. In a focus group, a moderator leads the discussion and all the participants respond to the questions asked. Other, more structured types of studies involve group dynamics and consensus building and use various ­organisational-behaviour tools such as brainstorming, the Delphi technique and the nominalgroup method. Observational study techniques are also used in situations in which enhancing teamwork or improving the quality of products and services are management goals. Data collected through ongoing business activities are a fifth data source. Such data can be collected from operational and transactional systems that exist in both physical ‘bricks-and-mortar’ and online settings but can also be gathered from secondary sources such as third-party social media networks and online apps and website services that collect tracking and usage data. For example, a bank might analyse a decade’s worth of financial transaction data to identify patterns of fraud, and a marketer might use tracking data to determine the effectiveness of a website.

‘Big Data’

big data Large data sets characterised by their volume, velocity and variety.

Relatively recent advances in information technology allow businesses to collect, process, and analyse very large volumes of data. Because the operational definition of ‘very large’ can be partially dependent on the context of a business – what might be ‘very large’ for a sole proprietorship might be commonplace and small for a multinational corporation – many use the term big data. Big data is more of a fuzzy concept than a term with a precise operational definition, but it implies data that are being collected in huge volumes and at very fast rates (typically in real time) and data that arrive in a variety of forms, both organised and unorganised. These attributes of ‘volume, velocity, and variety’, first identified in 2001 (see reference 1), make big data different from any of the data sets used in this book. Big data increases the use of business analytics because the sheer size of these very large data sets makes preliminary exploration of the data using older techniques impracticable. This effect is explored in Chapter 20.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



1.3 Collecting Data

15

Big data tends to draw on a mix of primary and secondary sources. For example, a retailer interested in increasing sales might mine Facebook and Twitter accounts to identify sentiment about certain products or to pinpoint top influencers and then match those data to its own data collected during customer transactions.

Data Formatting The data you collect may be formatted in more than one way. For example, suppose that you wanted to collect electronic financial data about a sample of companies. The data you seek to collect could be formatted in any number of ways, including: • tables of data • contents of standard forms • a continuous data stream • messages delivered from social media websites and networks. These examples illustrate that data can exist in either a structured or an unstructured form. Structured data are data that follow some organising principle or plan, typically a repeating pattern. For example, a simple ASX share price search record is structured because each entry would have the name of a company, the last sale, change in price, bid price, volume traded, and so on. Due to their inherent organisation, tables and forms are also structured. In a table, each row contains a set of values for the same columns (i.e. variables), and in a set of forms, each form contains the same set of entries. For example, once we identify that the second column of a table or the second entry on a form contains the family name of an individual, then we know that all entries in the second column of the table or all of the second entries in all copies of the form contain the family name of an individual. In contrast, unstructured data follows no repeating pattern. For example, if five different people sent you an email message concerning the share trades of a specific company, that data could be anywhere in the message. You could not reliably count on the name of the company being the first words of each message (as in the ASX search), and the pricing, volume and percentage of change data could appear in any order. Earlier in this section, big data was defined, in part, as data that arrive in a variety of forms, both organised and unorganised. You can restate that definition as ‘big data exists as both structured and unstructured data’. The ability to handle unstructured data represents an advance in information technology. Chapter 20 discusses business analytics methods that can analyse structured data as well as unstructured data or semi-structured data. (Think of an application form that contains structured form-fills but also contains an unstructured free-response portion.) With the exception of some of the methods discussed in Chapter 20, the methods taught and the software techniques used in this book involve structured data. Your beginning point will always be tabular data, and for many problems and examples you can begin with that data in the form of a Microsoft Excel worksheet that you can download and use (see companion website). Electronic formats and encoding need to be considered. Data can exist in more than one electronic format. This affects data formatting, as some electronic formats are more immediately usable than others. For example, which data would you like to use: data in an electronic worksheet file or data in a scanned image file that contains one of the worksheet illustrations in this book? Unless you like to do extra work, you would choose the first format because the second would require you to employ a translation process – perhaps a character-scanning program that can recognise numbers in an image. Data can also be encoded in more than one way, as you may have learned in an information systems course. Different encodings can affect the precision of values for numerical variables, and that can make some data not fully compatible with other data you have collected.

structured data Data that follow an organised pattern.

unstructured data Data that have no repeated pattern.

electronic formats Data in a form that can be read by a computer. encoding Representing data by numbers or symbols to convert the data into a usable form.

Data Cleaning No matter how you choose to collect data, you may find irregularities in the values you collect, such as undefined or impossible values. For a categorical variable, an undefined value would be

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

16

CHAPTER 1 DEFINING AND COLLECTING DATA

outliers Values that appear to be excessively large or small compared with most values observed. missing values Refers to when no data value is stored for one or more variables in an observation.

a value that does not represent one of the categories defined for the variable. For a numerical variable, an impossible value would be a value that falls outside a defined range of possible values for the variable. For a numerical variable without a defined range of possible values, you might also find outliers, values that seem excessively different from most of the rest of the values. Such values may or may not be errors, but they demand a second review. Missing values are another type of irregularity. They are values that were not able to be collected (and therefore are not available for analysis). For example, you would record a nonresponse to a survey question as a missing value. You can represent missing values in some computer programs and such values will be properly excluded from analysis. The more limited Excel has no special values that represent a missing value. When using Excel, you must find and then exclude missing values manually. When you spot an irregularity, you may have to ‘clean’ the data you have collected. A full discussion of data cleaning is beyond the scope of this book. (See reference 2 for more information.)

Recoding Variables

recoded variable A variable that has been assigned new values that replace the original ones. mutually exclusive Two events that cannot occur simultaneously. collectively exhaustive Set of events such that one of the events must occur.

After you have collected data, you may discover that you need to reconsider the categories that you have defined for a categorical variable, or that you need to transform a numerical variable into a categorical variable by assigning the individual numeric data values to one of several groups. In either case, you can define a recoded variable that supplements or replaces the original variable in your analysis. For example, when defining households by their location, the suburb or town recorded might be replaced by a new variable of the postcode. When recoding variables, be sure that the category definitions cause each data value to be placed in one and only one category, a property known as being mutually exclusive. Also ensure that the set of categories you create for the new, recoded variables include all the data values being recoded, a property known as being collectively exhaustive. If you are recoding a categorical variable, you can preserve one or more of the original categories, as long as your recoded values are both mutually exclusive and collectively exhaustive. When recoding numerical variables, pay particular attention to the operational definitions of the categories you create for the recoded variable, especially if the categories are not selfdefining ranges. For example, while the recoded categories ‘Under 12’, ‘12–20’, ‘21–34’, ‘35–59’ and ‘60 and over’ are self-defining for age, the categories ‘Child’, ‘Youth’, ‘Young adult’, ‘Middle aged’ and ‘Senior’ need their own operational definitions.

Problems for Section 1.3 APPLYING THE CONCEPTS 1.12 The Data and Story Library (DASL) is an online library of data files and stories that illustrate the use of basic statistical methods. Visit , click Power search, and explore a datafile of interest to you. Which of the five sources of data best describes the sources of the datafile you selected? 1.13 Visit the website of Ipsos Australia at . Read about a recent poll or news story. What type of data source is this based on? 1.14 Visit the website of the Pew Research Center at . Read one of today’s top stories. What type of data source is the story based on?

1.15 Transportation engineers and planners want to address the dynamic properties of travel behaviour by describing in detail the driving characteristics of drivers over the course of a month. What type of data collection source do you think the transportation engineers and planners should use? 1.16 Visit the homepage of the Statistics Portal ‘Statista’ at . Go to Statistics>Popular Statistics, then choose one item to examine. What type of data source is the information presented here based on?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



17

1.4 Types of Survey Sampling Methods

1.4  TYPES OF SURVEY SAMPLING METHODS

LEARNING OBJECTIVE

4

In Section 1.1 a sample was defined as the portion of the population that has been selected for analysis. You collect your data from either a population or a sample depending on whether all items or people about whom you wish to reach conclusions are included. Rather than taking a complete census of the whole population, statistical sampling procedures focus on collecting a small representative group of the larger population. The resulting sample results are used to estimate characteristics of the entire population. The three main reasons for drawing a sample are: 1. A sample is less time-consuming than a census. 2. A sample is less costly to administer than a census. 3. A sample is less cumbersome and more practical to administer than a census.

Distinguish between different survey sampling methods

The sampling process begins by defining the frame. The frame is a listing of items that make up the population. Frames are data sources such as population lists, directories or maps. Samples are drawn from these frames. Inaccurate or biased results can occur if the frame excludes certain groups of the population. Using different frames to generate data can lead to opposite conclusions. Once you select a frame, you draw a sample from the frame. As illustrated in Figure 1.5, there are two kinds of samples: the non-probability sample and the probability sample.

frame A list of the items in the population of interest.

Figure 1.5  Types of samples

Types of samples used

Non-probability samples

Judgment sample

Quota sample

Chunk Convenience sample sample

Probability samples

Simple random sample

Systematic sample

Stratified sample

Cluster sample

In a non-probability sample, you select the items or individuals without knowing their probabilities of selection. Thus, the theory that has been developed for probability sampling cannot be applied to non-probability samples. A common type of non-probability sampling is convenience sampling. In convenience sampling, items are selected based only on the fact that they are easy, inexpensive or convenient to sample. In some cases, participants are self-selected. For example, many companies conduct surveys by giving visitors to their website the opportunity to complete survey forms and submit them electronically. The response to these surveys can provide large amounts of data quickly, but the sample consists of self-selected web users. For many studies, only a non-probability sample such as a judgment sample is available. In a judgment sample, you get the opinions of preselected experts in the subject matter as to who should be included in the survey. Some other common procedures of non-probability sampling are quota sampling and chunk sampling. These are discussed in detail in specialised books on sampling methods (see references 3 and 4). Non-probability samples can have certain advantages such as convenience, speed and lower cost. However, their lack of accuracy due to selection bias and their poorer capacity to provide generalised results more than offset these advantages. Therefore, you should restrict the use of non-probability sampling methods to situations in which you want to get rough

non-probability sample One where selection is not based on known probabilities. convenience sampling  Selection using a method that is easy or inexpensive.

judgment sample Gives the opinions of preselected experts.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

18

CHAPTER 1 DEFINING AND COLLECTING DATA

probability sample One where selection is based on known probabilities.

approximations at low cost to satisfy your curiosity about a particular subject, or to small-scale studies that precede more rigorous investigations. In a probability sample, you select the items based on known probabilities. Whenever possible, you should use probability sampling methods. The samples based on these methods allow you to make unbiased inferences about the population of interest. In practice, it is often difficult or impossible to take a probability sample. However, you should work towards achieving a probability sample and acknowledge any potential biases that might exist. The four types of probability samples most commonly used are simple random, systematic, stratified and cluster. These sampling methods vary in their cost, accuracy and complexity.

Simple Random Sample simple random sample One where each item in the frame has an equal chance of being selected.

sampling with replacement  An item in the frame can be selected more than once.

sampling without replacement  Each item in the frame can be selected only once.

table of random numbers  Shows a list of numbers generated in a random sequence.

In a simple random sample, every item from a frame has the same chance of selection as every other item. In addition, every sample of a fixed size has the same chance of selection as every other sample of that size. Simple random sampling is the most elementary random sampling technique. It forms the basis for the other random sampling techniques. With simple random sampling, you use n to represent the sample size and N to represent the frame size. You number every item in the frame from 1 to N. The chance that you will select any particular member of the frame on the first draw is 1/N. You select samples with replacement or without replacement. Sampling with replacement means that after you select an item you return it to the frame, where it has the same probability of being selected again. Imagine you have a barrel which contains the shopping dockets of N shoppers at a major retail centre who are entering a competition. First assume that each shopper can have only one entry but can win more than one prize. The barrel is rolled, opened and the entry of Jason O’Brien is selected. His docket is replaced, the barrel is rolled again and a second docket is chosen. Jason’s docket has the same probability of being selected again, 1/N. You repeat this process until you have selected the desired sample size n. However, it is usually more desirable to have a sample of different items than to permit a repetition of measurements on the same item. Sampling without replacement means that once you select an item it cannot be selected again. The chance that you will select any particular item in the frame, say the shopping docket of Jason O’Brien on the first draw is 1/N. The chance that you will select any shopping docket not previously selected on the second draw is now 1 out of N – 1. This process continues until you have selected the desired sample of size n. Regardless of whether you have sampled with or without replacement, barrel draw methods have a major drawback for sample selection. In a crowded barrel, it is difficult to mix the entries thoroughly and ensure that the sample is selected randomly. As barrel draw methods are not very useful, you need to use less cumbersome and more scientific methods of selection. One such method uses a table of random numbers (see Table E.1 in Appendix E of this book) for selecting the sample. A table of random numbers consists of a series of digits listed in a randomly generated sequence (see reference 5). Because the numeric system uses 10 digits (0, 1, 2, …, 9), the chance that you will randomly generate any particular digit is equal to the probability of generating any other digit. This probability is 1 out of 10. Hence, if a sequence of 800 digits is generated, you would expect about 80 of them to be the digit 0, 80 to be the digit 1, and so on. In fact, those who use tables of random numbers usually test the generated digits for randomness prior to using them. Table E.1 has met all such criteria for randomness. Because every digit or sequence of digits in the table is random, the table can be read either horizontally or vertically. The margins of the table designate row numbers and

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



1.4 Types of Survey Sampling Methods

column numbers. The digits themselves are grouped into sequences of five in order to make reading the table easier. To use such a table instead of a barrel for selecting the sample, you first need to assign code numbers to the individual members of the frame. Then you get the random sample by reading the table of random numbers and selecting those individuals from the frame whose assigned code numbers match the digits found in the table. Example 1.1 demonstrates the process of sample selection.

SELECTING A S IMP LE R A NDO M S A MP L E U S I N G A TABL E OF RAN D OM NUMB ER S A company wants to select a sample of 32 full-time workers from a population of 800 full-time employees in order to collect information on expenditures concerning a company-sponsored dental plan. How do you select a simple random sample?

EXAMPLE 1.1

SOLUTION

The company can contact all employees by email but assumes that not everyone will respond to the survey, so you need to distribute more than 32 surveys to get the desired 32 responses. Assuming that 8 out of 10 full-time workers will respond to such a survey (i.e. a response rate of 80%), you decide to email 40 surveys. The frame consists of a listing of the names and email addresses of all N  =  800 full-time employees taken from the company personnel files. Thus, the frame is an accurate and complete listing of the population. To select the random sample of 40 employees from this frame, you use a table of random numbers, as shown in Table 1.2 on page 20. Because the population size (800) is a three-digit number, each assigned code number must also be three digits so that every full-time worker has an equal chance of selection. You give a code of 001 to the first full-time employee in the population listing, a code of 002 to the second full-time employee in the population listing, and so on, until a code of 800 is given to the Nth full-time worker in the listing. Because N = 800 is the largest possible coded value, you discard all threedigit code sequences greater than N (i.e. 801 to 999 and 000). To select the simple random sample, you choose an arbitrary starting point from the table of random numbers. One method you can use is to close your eyes and strike the table of random numbers with a pencil. Suppose you use this procedure and select row 06, column 05, of Table 1.2 (which is extracted from Table E.1) as the starting point. Although you can go in any direction, in this example you will read the table from left to right in sequences of three digits without skipping. The individual with code number 003 is the first full-time employee in the sample (row 06 and columns 05–07), the second individual has code number 364 (row 06 and columns 08–10) and the third individual has code number 884. Because the highest code for any employee is 800, you discard this number. Individuals with code numbers 720, 433, 463, 363, 109, 592, 470 and 705 are selected third to tenth, respectively. You continue the selection process until you get the needed sample size of 40 full-time employees. During the selection process, if any three-digit coded sequence is repeated, you include the employee corresponding to that coded sequence again as part of the sample, if sampling with replacement. You discard the repeating coded sequence if sampling without replacement.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

19

20

CHAPTER 1 DEFINING AND COLLECTING DATA

Table 1.2  Using a table of random numbers Source: Data from the Rand Corporation, from A Million Random Digits with 100,000 Normal Deviates (Glencoe, IL: The Free Press, 1955) (displayed in Table E.1 in Appendix E of this book).

Begin selection (row 06, column 5)

Row 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

00000

00001

11111

Column 11112 22222

22223

33333

33334

12345 49280 61870 43898 62993 33850 97340 70543 89382 37818 60430 82975 39087 55700 14756 32166 23236 45794 09893 54382 94750 70297 85157 11100 36871 23913

67890 88924 41657 65923 93912 58555 03364 29776 93809 72142 22834 66158 71938 24586 23997 53251 73751 26926 20505 74598 89923 34135 47954 02340 50775 48357

12345 35779 07468 25078 30454 51438 88472 10087 00796 67140 14130 84731 40355 93247 78643 70654 31888 15130 14225 91499 37089 53140 32979 12860 30592 63308

67890 00283 08612 86129 84598 85507 04334 10072 95945 50785 96593 19436 54324 32596 75912 92827 81718 82455 68514 14523 20048 33340 26575 74697 57143 16090

67890 07275 97349 97653 20664 79488 36394 64688 81277 16703 56203 69229 26299 63397 32768 04233 83246 55058 56788 27686 94598 82341 40881 89439 68856 54607

12345 89863 20775 91550 12872 76783 11095 68239 66090 53362 92671 28661 49420 44251 18928 33825 47651 52551 96297 46162 26940 44104 12250 28707 25853 72407

67890 02348 45091 08078 64647 31708 92470 20461 88872 44940 15925 13675 59208 43189 57070 69662 04877 47182 78822 83554 36858 82949 73742 25815 35041 55538

12345 81163 98083 78496 56095 71865 63919 55980 34101 22380 23298 55790 08401 11865 83832 63491 06546 78305 46427 68479 80336 42050 57600 96644 17381 51690

Systematic Sample systematic sample A method that involves selecting the first element randomly then choosing every k th element thereafter.

In a systematic sample, you partition the N items in the frame into n groups of k items where: N k= n You round k to the nearest integer. To select a systematic sample, you choose the first item to be selected at random from the first k items in the frame. Then you select the remaining n – 1 items by taking every kth item thereafter from the entire frame. If the frame consists of a listing of prenumbered cheques, sales receipts or invoices, a systematic sample is faster and easier to take than a simple random sample. A systematic sample is also a convenient mechanism for collecting data from telephone directories, class rosters and consecutive items coming off an assembly line. To take a systematic sample of n = 40 from the population of N = 800 employees, you partition the frame of 800 into 40 groups, each of which contains 20 employees. You then select a random number from the first 20 individuals, and include every 20th individual after the first selection in the sample. For example, if the first number you select is 008, your subsequent selections are 028, 048, 068, 088, 108, … , 768 and 788. Although they are simpler to use, simple random sampling and systematic sampling are generally less efficient than other, more sophisticated probability sampling methods. Even greater possibilities for selection bias and lack of representation of the population characteristics occur from systematic samples than from simple random samples. If there is a pattern in the

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



1.4 Types of Survey Sampling Methods

21

frame, you could have severe selection biases. To overcome the potential problem of disproportionate representation of specific groups in a sample, you can use either stratified sampling methods or cluster sampling methods.

Stratified Sample In a stratified sample, you first subdivide the N items in the frame into separate subpopulations, or strata. A stratum is defined by some common characteristic. You select a simple random sample, in proportion to the size of the strata, and combine the results from the separate simple random samples. This method is more efficient than either simple random sampling or systematic sampling because you are assured of the representation of items across the entire population. The homogeneity of items within each stratum provides greater precision in the estimates of underlying population parameters.

SELECTING A ST R AT IFIE D S A MP LE A company wants to select a sample of 32 full-time workers from a population of 800 fulltime employees in order to estimate expenditures from a company-sponsored dental plan. Of the full-time employees, 25% are managerial and 75% are non-managerial workers. How do you select the stratified sample so that the sample will represent the correct proportion of managerial workers?

stratified sample Items randomly selected from each of several populations or strata. strata Subpopulations composed of items with similar characteristics in a stratified sampling design.

EXAMPLE 1.2

SOLUTION

If you assume an 80% response rate, you need to distribute 40 surveys to get the desired 32 responses. The frame consists of a listing of the names and company email addresses of all N = 800 full-time employees included in the company personnel files. Since 25% of the full-time employees are managerial, you first separate the population frame into two strata: a subpopulation listing of all 200 managerial-level personnel and a separate subpopulation listing of all 600 full-time non-managerial workers. Since the first stratum consists of a listing of 200 managers, you assign three-digit code numbers from 001 to 200. Since the second stratum contains a listing of 600 non-managerial-level workers, you assign three-digit code numbers from 001 to 600. To collect a stratified sample proportional to the sizes of the strata, you select 25% of the overall sample from the first stratum and 75% of the overall sample from the second stratum. You take two separate simple random samples, each of which is based on a distinct random starting point from a table of random numbers (Table E.1). In the first sample you select 10 managers from the listing of 200 in the first stratum, and in the second sample you select 30 non-managerial workers from the listing of 600 in the second stratum. You then combine the results to reflect the composition of the entire company.

Cluster Sample In a cluster sample, you divide the N items in the frame into several clusters so that each cluster is representative of the entire population. You then take a random sample of clusters and study all items in each selected cluster. Clusters are naturally occurring designations, such as postcode areas, electorates, city blocks, households or sales territories. Cluster sampling is often more cost-effective than simple random sampling, particularly if the population is spread over a wide geographical region. However, cluster sampling often requires a larger sample size to produce results as precise as those from simple random sampling or stratified sampling. A detailed discussion of systematic sampling, stratified sampling and cluster sampling procedures can be found in references 3, 4 and 6.

cluster sample The frame is divided into representative groups (or clusters), then all items in randomly selected clusters are chosen. cluster A naturally occurring grouping, such as a geographical area.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

22

CHAPTER 1 DEFINING AND COLLECTING DATA

Problems for Section 1.4 LEARNING THE BASICS

1.17 For a population containing N  =  902 individuals, what code number would you assign for: a. the first person on the list? b. the fortieth person on the list? c. the last person on the list? 1.18 For a population of N  =  902, verify that, by starting in row 05 of the table of random numbers (Table E.1), you need only six rows to select a sample of n  =  60 without replacement. 1.19 Given a population of N  =  93, starting in row 29 of the table of random numbers (Table E.1) and reading across the row, select a sample of n  =  15: a. without replacement b. with replacement

APPLYING THE CONCEPTS 1.20 For a study that consists of personal interviews with participants (rather than mail or phone surveys), explain why a simple random sample might be less practical than some other methods. 1.21 You want to select a random sample of n  = 1 from a population of three items (called A, B and C ). The rule for selecting the sample is: flip a coin; if it is heads, pick item A; if it is tails, flip the coin again; this time, if it is heads, choose B; if it is tails, choose C. Explain why this is a random sample but not a simple random sample. 1.22 A population has four members (call them A, B, C and D). You would like to draw a random sample of n  =  2, which you decide to do in the following way: flip a coin; if it is heads, the sample will be items A and B; if it is tails, the sample will be items C and D. Although this is a random sample, it is not a simple random sample. Explain why. (If you did problem 1.21, compare the procedure described there with the procedure described in this problem.) 1.23 The town planning department of a Sydney council with a population of N  =  40,000 registered voters is asked by the mayor to conduct a survey to measure community attitudes to

LEARNING OBJECTIVE Evaluate the quality of surveys

5

urban consolidation. The table following contains a breakdown of the 40,000 registered voters by gender and ward of residence.

Gender Female Male Total

North  7,000  5,600 12,600

Ward of residence South East 5,200 5,000 4,600 4,000 9,800 9,000

West 4,800 3,800 8,600

Total 22,000 18,000 40,000

The planning department intends to take a probability sample of n = 2,000 voters and project the results from the sample to the entire population of voters. a. If the frame available from the council files is an alphabetical listing of the names of all N = 40,000 registered voters, what type of sample could you take? Discuss. b. What is the advantage of selecting a simple random sample in (a)? c. What is the advantage of selecting a systematic sample in (a)? d. If the frame available from the council’s files is a listing of the names and addresses of all N = 40,000 registered voters, compiled from eight separate alphabetical lists based on the gender and address breakdowns shown in the ward-ofresidence table, what type of sample should you take? Discuss. e. At present East Ward has many high-rise apartments, West Ward and South Ward have single dwellings only and North Ward has a mixture of low- and medium-density housing. What would be the danger in randomly choosing 40 street names and systematically sampling 50 of the residents of those streets? 1.24 Suppose that 5,000 sales invoices are separated into four strata. Stratum 1 contains 50 electrical invoices, stratum 2 contains 500 paint invoices, stratum 3 contains 1,000 plumbing supplies invoices and stratum 4 contains 3,450 hardware invoices. A sample of 500 sales invoices is needed. a. What type of sampling method should you use? Why? b. Explain how you would carry out the sampling according to the method stated in (a). c. Why is the sampling in (a) not simple random sampling?

1.5  EVALUATING SURVEY WORTHINESS Nearly every day you read or hear about survey or opinion poll results in newspapers, on the Internet or on radio or television. To identify surveys that lack objectivity or credibility, you must critically evaluate what you read and hear by examining the worthiness of the survey. First, you must evaluate the purpose of the survey, why it was conducted and for whom it was conducted. An opinion poll or survey conducted to satisfy curiosity is mainly for entertainment. Its result is an end in itself rather than a means to an end. You should be sceptical of such a survey because the result should not be put to further use.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



1.5 Evaluating Survey Worthiness

23

The second step in evaluating the worthiness of a survey is for you to determine whether it was based on a probability or a non-probability sample (as discussed in Section 1.4). You need to remember that the only way to make correct statistical inferences from a sample to a population is through the use of a probability sample. Surveys that use non-probability sampling methods are subject to serious, perhaps unintentional, bias that may render the results meaningless.

Survey Errors Even when surveys use random probability sampling methods, they are subject to potential errors. Four types of survey errors are: • coverage error • non-response error • sampling error • measurement error. Good survey research design attempts to reduce or minimise these various survey errors, often at considerable cost. Coverage error  The key to proper sample selection is an adequate frame. Remember, a frame is an up-to-date list of all the items from which you will select the sample. Coverage error occurs if certain groups of items are excluded from this frame so that they have no chance of being selected in the sample. Coverage error results in a selection bias. If the frame is inadequate because certain groups of items in the population were not properly included, any random probability sample selected will provide an estimate of the characteristics of the frame, not the actual population. Computer-based surveys are useful for certain studies where the subjects all have Internet access. Coverage error could result if the unemployed, the elderly or indigenous communities are not selected in the frame due to their lack of Internet or email access. Non-response error  Not everyone is willing to respond to a survey. In fact, research has shown that individuals in the upper and lower socioeconomic classes tend to respond less frequently to surveys than p­ eople in the middle class. Non-response error arises from the failure to collect data on all items in the sample and results in a non-response bias. Because you cannot generally assume that people who do not respond to surveys are similar to those who do, you need to follow up on the nonresponses after a specified period of time. You should make several attempts to persuade these individuals to complete the survey. The follow-up responses are then compared with the initial responses in order to make valid inferences from the survey (references 3, 4 and 6). The mode of response you use affects the rate of response. The personal interview and the telephone interview usually produce a higher response rate than a mail survey – but at a higher cost. Sampling error  There are three main reasons for selecting a sample rather than taking a complete census. It is more expedient, less costly and more efficient. However, chance dictates which individuals or items will or will not be included in the sample. Sampling error reflects the heterogeneity, or ‘chance differences’, from sample to sample, based on the probability of certain individuals or items being selected in particular samples. When you read about the results of surveys or polls in newspapers or magazines, there is often a statement regarding margin of error or precision; for example, ‘the results of this poll are expected to be within ±4 percentage points of the actual value’. This margin of error is the sampling error. You can reduce sampling error by taking larger sample sizes, although this also increases the cost of conducting the survey.

coverage error Occurs when all items in a frame do not have an equal chance of being selected. This causes selection bias.

non-response error Occurs due to the failure to collect information on all items chosen for the sample; this causes nonresponse bias.

sampling error The difference in results for different samples of the same size.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

24

CHAPTER 1 DEFINING AND COLLECTING DATA

The problem of online survey rigging think about this

As the use of online methods for collecting information grows more prevalent we need to be aware that individuals will not all act honestly, especially when they have something to gain. There are many methods being used to contravene the rules of online competitions, such as paying companies to vote, setting up multiple email addresses or Facebook accounts, and using methods to mask the true IP address of the computer being used. Even if a small incentive is offered for completing a survey, similar problems can arise. At an Australian university, students were recently asked to complete a survey about a peerassisted learning program and were offered the chance to win movie tickets as an incentive to give feedback. The survey was carried out anonymously in order to elicit frank responses, but on completion students were automatically sent to a second site where they could register their student ID in order to enter a draw to win movie tickets. One student registered 105 times in order to increase the chance of winning the movie tickets. It is not clear how many times this person completed the survey itself. How could this type of behaviour potentially affect survey results? What could you do to minimise this type of survey error if you were designing an online survey?

Measurement error  In the practice of good survey research, you design a questionnaire with the intention of gathering meaningful information. But you have a dilemma here – getting meaningful measurements is easier said than done. Consider the following proverb: A man with one watch always knows what time it is. A man with two watches always searches to identify the correct one. A man with ten watches is always reminded of the difficulty in measuring time.

measurement error The difference between survey results and the true value of what is being measured.

Unfortunately, the process of getting a measurement is often governed by what is convenient, not what is needed. The measurements are often only a proxy for the ones you really desire. Much attention has been given to measurement error that occurs because of a weakness in question wording (reference 6). A question should be clear, not ambiguous. And, to avoid leading questions, you need to present them in a neutral manner. There are three sources of measurement error: ambiguous wording of questions, the halo effect and respondent error. The Australian Bureau of Statistics is very conscious of minimising error caused by questionnaire design and survey operations. For the National Health Survey in 2010–11 it used Computer Assisted Interview techniques to collect information. It states: the CAI instrument allows: • data to be captured electronically at the point of interview, which obviates the cost, logistical, timing and quality issues associated with transport, storage and security of paper forms, and transcription/data entry of information from forms into electronic format • the ability to use complex sequencing to define specific populations for questions, and ensure word substitutes used in the questions were appropriate to each respondent’s characteristics and prior responses • the ability, through data validation (edits), to check responses entered against previous responses, reduce data entry errors by interviewers, and enable seemingly inconsistent responses to be clarified with respondents at the time of interview. The audit trail recorded in the instrument also provides valuable information about the operation of particular questions, and associated data quality issues. (Australian Bureau of Statistics, Australian Health Survey: Users’ Guide, 2011–2013, electronic publication, Cat. No. 4363.0.55.001, 2013)

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



1.5 Evaluating Survey Worthiness

25

The halo effect occurs when the respondent feels obligated to please the interviewer. Proper interviewer training can minimise the halo effect. Respondent error occurs as a result of overzealous or underzealous effort by the respondent. You can minimise this error in two ways: (1) by carefully scrutinising the data and calling back those individuals whose responses seem unusual, and (2) by establishing a program of random call-backs to determine the reliability of the responses. Other sources of error besides measurement error can result from clerical or recording errors. See references 7, 8 and 9 for a more detailed discussion of measurement error and the difficulties of avoiding it.

Ethical Issues Ethical considerations arise with respect to the four types of potential errors that can occur when designing surveys that use probability samples: coverage error, non-response error, sampling error and measurement error. Coverage error can result in selection bias and becomes an ethical issue if particular groups or individuals are purposely excluded from the frame so that the survey results are skewed, indicating a position more favourable to the survey’s sponsor. Non-response error can lead to non-response bias and becomes an ethical issue if the sponsor knowingly designs the survey in such a manner that particular groups or individuals are less likely to respond. Sampling error becomes an ethical issue if the findings are purposely presented without reference to sample size and margin of error, so that the sponsor can promote a viewpoint that might otherwise be truly insignificant. Measurement error becomes an ethical issue in one of three ways: (1) a survey sponsor chooses leading questions that guide the responses in a particular direction; (2) an interviewer, through mann­ erisms and tone, purposely creates a halo effect or otherwise guides the responses in a particular direction; (3) a respondent, having a disdain for the survey process, wilfully provides false information. Ethical issues also arise when the results of non-probability samples are used to form conclusions about the entire population. When you use a non-probability sampling method, you need to explain the sampling procedures and state that the results cannot be generalised beyond the sample.

Problems for Section 1.5 APPLYING THE CONCEPTS 1.25 ‘A survey indicates that the vast majority of university students own their own personal computer.’ What information would you want to know before you accepted the results of this survey? 1.26 A simple random sample of n = 300 full-time employees is selected from a company list containing the names of all N = 5,000 full-time employees in order to evaluate job satisfaction. a. Give an example of possible coverage error. b. Give an example of possible non-response error. c. Give an example of possible sampling error. d. Give an example of possible measurement error. 1.27 According to a recent cyber security report, ‘millennials remain the most common victims of cybercrime, with 40 percent having experienced cybercrime in the past year’. Reasons given for this include slack online security habits and password sharing (2016 Norton Cyber Security Insights Report,

, accessed 16 June 2017). What information would you want to know before you accepted the results of the survey? 1.28 Kiribati is a small, poor Pacific nation under threat from global warming. According to the CIA World Factbook, Kiribati comprises a group of 33 coral atolls in the Pacific Ocean straddling the equator, with elevations varying from 0 to 81 metres above sea level. The low level of some of the islands makes them sensitive to changes in sea level (Central Intelligence Agency, The World Factbook, accessed 16 June 2017). Suppose that an environmental economist has seen results from a survey which claims that 30% of inhabitants of Kiribati are already affected by roads having been permanently cut by rising seawater. What information would she want to know before accepting the results of the survey?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

26

CHAPTER 1 DEFINING AND COLLECTING DATA

1.29 Reality TV shows have incorporated surveys of audience opinion into their formats. In Australia several shows have allowed the audience to vote on whether contestants should remain on the show or be excluded. Consider a show where voting is by SMS, premium rate phone call, Facebook or another online site, and viewers are limited to 10 votes using each method. Compare this type of survey with a random poll of viewers without replacement conducted by phone for the TV show. a. How might the results differ? b. What are the costs and benefits for the owners of the show for each voting method?

1.30 The online restaurant search site Dimmi encourages diners to rate restaurants they have been to by giving them reward points which can be accumulated until a meal discount is available. A restaurant at The Rocks in Sydney has been rated as follows: Recommended 8.7; Food 8.5; Service 8.7; Value for money 7.8; Atmosphere 8.4. What differences could arise from this type of survey compared with ratings derived from a random sample of diners?

1.6  THE GROWTH OF STATISTICS AND INFORMATION TECHNOLOGY

statistical packages Computer programs designed to perform statistical analysis.

During the past century, statistics has played an important role in spurring the use of information technology and, in turn, such technology has spurred the wider use of statistics. At the beginning of the twentieth century, the expanding data-handling requirements associated with the United States Federal Census led directly to the development of tabulating machines that were the forerunners of today’s business computer systems. Statisticians such as Pearson, Fisher, Gosset, Neyman, Wald and Tukey established the techniques of modern inferential statistics as an alternative to analysing large sets of population data that had become increasingly costly, time-consuming and cumbersome to collect. The development of early computer systems permitted others to develop computer programs to ease the calculation and data-processing burdens imposed by those techniques. Over time, greater use of statistical methods by business decision makers and advances in computer capacity have led to the development of even more sophisticated statistical methods. Today, when you hear of retailers investing in a ‘customer-relationship management system’, or CRM, or a packaged goods producer engaging in ‘data mining’ to uncover consumer preferences, you should realise that statistical techniques form the foundations of such cutting-edge applications of information technology. As global information storage increases dramatically, businesses are rapidly coming to terms with how to analyse big data – data sets so large and varied that conventional software cannot readily handle them. (Think of the huge volume of data produced each day by people using Visa, Facebook, eBay and Twitter.) Even though cutting-edge applications might require custom programming, for many years businesses have had access to statistical packages such as Minitab, SPSS/PASW Statistics, SAS and Stata – standardised sets of programs that help managers use a wide range of statistical techniques by automating the data processing and calculations these techniques require. The leasing and training costs associated with statistical packages have led many to consider using some of the graphical and statistical functions of Microsoft Excel. However, you need to be aware that many statisticians have concerns about the accuracy and completeness of the statistical results produced by early versions of Excel. Invalid results could be produced, especially when the data sets were very large or had unusual statistical properties (see reference 10). Microsoft Excel 2010 and subsequent versions made some significant improvements in statistical functions (see references 11 and 12) but it would still be wise to be careful about the data and the analysis you are undertaking.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



References

Assess your progress

27

1

Summary In this chapter you have studied data collection and the various types of data used in business. In the Hong Kong International Airport scenario you were asked to review the visitor survey which will be used to provide information to the tourism authority planning staff (see page 9). Three of the questions shown will produce numerical data and four will produce categorical data. The responses to the first question (number of previous visits to Hong Kong) are discrete, and

the responses to the second question (length of time since last visit) are continuous. After the data have been collected, they must be organised and prepared in order to make various analyses. You have also learned about commonly used sampling methods and ways to prepare data for analysis such as encoding, cleaning and recoding. The next two chapters develop tables and charts and a variety of descriptive numerical measures that are useful for data analysis.

Key terms big data 14 categorical variables 10 cluster 21 cluster sample 21 collectively exhaustive 16 continuous variables 10 convenience sampling 17 coverage error 23 data 6 descriptive statistics 8 discrete variables 10 electronic formats 15 encoding 15 focus group 14 frame 17 inferential statistics 8 interval scale 11

judgment sample 17 measurement error 24 missing values 16 mutually exclusive 16 nominal scale 10 non-probability sample 17 non-response error 23 numerical variables 10 operational definition 6 ordinal scale 11 outliers 16 parameter 8 population 8 primary sources 13 probability sample 18 ratio scale 11 recoded variable 16

sample 8 sampling error 23 sampling with replacement 18 sampling without replacement 18 secondary sources 13 simple random sample 18 statistic 8 statistical packages 26 statistics 6 strata 21 stratified sample 21 structured data 15 systematic sample 20 table of random numbers 18 unstructured data 15 variables 6

References 1. Laney, D., 3D Data Management: Controlling Data Volume, Velocity, and 2.

3. 4. 5.

Variety (Stamford, CT: META Group. February 6, 2001). Osbourne, J. Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data (Thousand Oaks, CA: Sage Publications, 2013). Cochran, W. G., Sampling Techniques, 3rd edn (New York: Wiley, 1977). Lohr, S. L., Sampling Design and Analysis, 2nd edn (Boston, MA: Brooks/ Cole Cengage Learning, 2010). Rand Corporation, A Million Random Digits with 100,000 Normal Deviates (Glencoe, IL: The Free Press, 1955).

6. Groves R. M., F. J. Fowler, M. P. Couper, J. M. Lepkowski, E. Singer & R. Tourangeau, Survey Methodology, 2nd edn (New York: John Wiley, 2009).

7. Sudman, S., N. M. Bradburn & N. Schwarz. Thinking About Answers: The Application of Cognitive Processes to Survey Methodology (San Francisco, CA: Jossey-Bass, 1996). 8. Biemer, P. B., R. M. Graves, L. E. Lyberg, A. Mathiowetz & S. Sudman, Measurement Errors in Survey (New York: Wiley Interscience, 2004). 9. Fowler, F. J., Improving Survey Questions: Design and Evaluation, Applied Special Research Methods Series, Vol. 38 (Thousand Oaks, CA: Sage Publications, 1995).

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

28

CHAPTER 1 DEFINING AND COLLECTING DATA

1 0. McCullough, B. D. & B. Wilson, ‘On the accuracy of statistical procedures in Microsoft Excel 97’, Computational Statistics and Data Analysis, 31 (1999): 27–37. 1 1. Microsoft Corporation at , accessed June 2017.

12. Microsoft Corporation at , accessed June 2017.

Chapter review problems CHECKING YOUR UNDERSTANDING 1.31 1.32 1.33 1.34 1.35 1.36 1.37 1.38 1.39 1.40 1.41 1.42

What is the difference between a sample and a population? What is the difference between a statistic and a parameter? What is the difference between descriptive and inferential statistics? What is the difference between a categorical and a numerical variable? What is the difference between a discrete and a continuous variable? What is an operational definition and why is it so important? What are the four types of measurement scales? What are some potential problems with using ‘barrel draw’ methods to select a simple random sample? What is the difference between sampling with replacement and sampling without replacement? What is the difference between a simple random sample and a systematic sample? What is the difference between a simple random sample and a stratified sample? What is the difference between a stratified sample and a cluster sample?

1.47

1.48

APPLYING THE CONCEPTS 1.43 The Australasian Data and Story Library OZDASL is an online library of data files and stories that illustrate the use of basic statistical methods. The stories are classified by method and by topic. Go to this site and click on ‘First Course in Statistics’. Pick a story and summarise how statistics were used in the story. 1.44 Make a list of six ways you have used or encountered statistics in the past week. Think about what you read or heard in a news report or saw on a commercial website. Also think whether you made a bet or participated in a survey. 1.45 The Australian Bureau of Statistics site contains survey information on people, business, geography and other topics. Go to the site and find the latest version of Labour Force, Australia (Cat. No. 6202.0). a. Briefly describe the Labour Force survey. b. Give an example of a categorical variable found in this survey. c. Give an example of a numerical variable found in this survey. d. Is the variable you selected in (c) discrete or continuous? 1.46 The Australian Bureau of Statistics website allows users to access a large amount of Census data online. Go to and in the Data by Products section click on the latest Census year, enter a location and search for QuickStats. a. Give an example of a categorical variable found in this summary of survey results.

1.49

1.50

b. Give an example of a numerical variable found in this summary of survey results. c. Is the variable you selected in (b) discrete or continuous? Detailed information on airport and airline on-time performance can be found at . Explore the departures performance data for different airports and regions. a. Which of the five types of data sources listed in Section 1.3 do you think were used here? b. Name a categorical variable for which observations were collected. c. Name a numerical variable for which observations were collected. d. What type of recoding has been used here and why? Late in 2016 the National Roads and Motorists’ Association (NRMA), a major Australian motoring organisation, released results of a survey that sought to check members’ attitudes to traffic congestion and a motorway extension (see ). a. Describe the population(s) for this survey. b. Describe the sample(s) for this survey. c. Can you identify potential difficulties in comparing these results with results from a similar 2005 survey? A manufacturer of flavoured milk is planning to survey households in Tasmania to determine the purchasing habits of consumers. Among the questions to be included are those that relate to: 1. where flavoured milk is primarily purchased 2. what flavour of milk is purchased most often 3. how many people living in the household drink flavoured milk 4. the total number of millilitres of flavoured milk drunk in the past week by members of the household a. Describe the population. b. For each of the four items listed, indicate whether the variable is categorical or numerical. If numerical, is it discrete or continuous? c. Develop five categorical questions for the survey. d. Develop five numerical questions for the survey. A new bus network is proposed for a north-eastern Sydney region. A survey is sent out to residents asking questions which relate to: 1. the resident’s age 2. frequency of bus use 3. usual ticket type purchased 4. main purpose of using the bus a. Describe the population. b. Indicate whether each of the questions above is categorical or numerical.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



CONTINUING CASES

c. Develop two more numerical questions and state whether the variables are discrete or continuous. d. Develop two more categorical questions. 1.51 Political polling has traditionally used telephone interviews. Researchers at a polling organisation argue that Internet polling is less expensive and faster, and offers higher response rates than telephone surveys. Critics are concerned about the scientific reliability of this approach. Even amid this strong criticism, Internet polling is becoming more and more common. What concerns, if any, do you have about Internet polling? 1.52 Statistics New Zealand mentions a number of possible sources of non-sampling error in economic surveys in A Guide to Good Survey Design, 3rd edition, which can be downloaded from . a. Which of the four types of survey error from Section 1.5 are identified on this site as a non-sampling error? b. Discuss which errors would be more difficult to eliminate. 1.53 Researchers at a university wish to conduct a survey of past students to ascertain how frequently they are using statistical techniques in the workforce. The researchers have permission from the ethics committee to use the last recorded email and postal addresses to contact ex-students, but these may be out of date, particularly as many students have returned to homes overseas without updating their records. The emails and letters are sent out simultaneously. The response to the survey is low.

29

a. What type of errors or biases should the researchers be especially concerned with? b. What step(s) should the researchers take to try to overcome the problems noted in (a)? c. What could have been done differently to improve the survey’s worthiness? 1.54 According to a survey conducted by the Australian Interactive Media Industry Association, 77% of mobile phone users surveyed pay by a monthly phone bill compared to 21% who are on pre-paid plans. The percentage of respondents that have data included in their payment plans is 84% (M. M. Mackay, Australian Mobile Phone Lifestyle Index, 9th edn, October 2013, , accessed 24 January 2014). a. What other information would you want to know before you accepted the results of this survey? b. Suppose that you wished to conduct a similar survey for the geographic region you live in. Describe the population for your survey. c. Explain how you could minimise the chance of a coverage error in this type of survey. d. Explain how you could minimise the chance of a nonresponse error in this type of survey. e. Explain how you could minimise the chance of a sampling error in this type of survey. f. Explain how you could minimise the chance of a measurement error in this type of survey.

Continuing cases Tasman University Tasman University’s Tasman Business School (TBU) regularly surveys business students on a number of issues. In particular, students within the school are asked to complete a student survey when they receive their grades each semester. The results of Bachelor of Business (BBus) students who responded to the latest undergraduate (UG) survey are stored in < TASMAN_UNIVERSITY_BBUS_STUDENT_SURVEY >. a For each question asked in the survey, determine whether the variable is categorical or numerical. If you determine that the variable is numerical, identify whether it is discrete or continuous. b A separate survey has been carried out for Master of Business Administration (MBA) students. Results for these postgraduate (PG) students are in the file < TASMAN_UNIVERSITY_MBA_STUDENT_SURVEY >. Repeat the analysis you carried out in (a) for the postgraduate survey results.

As Safe as Houses To analyse the real estate market in non-capital cities and towns in states A and B, Safe-As-Houses Real Estate, a large national real estate company, has collected samples of recent residential sales from a sample of non-capital cities and towns in these states. The data are stored in < REAL_ESTATE >. a Identify data sources and discuss the type of sampling that was most likely used to collect these data. b Suggest any additional variables that could be collected in order to explain property prices, and determine if they are numerical or categorical, discrete or continuous.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

30

CHAPTER 1 DEFINING AND COLLECTING DATA

Chapter 1 Excel Guide EG1.1 GETTING STARTED WITH MICROSOFT EXCEL Microsoft Excel is the electronic worksheet program of Microsoft Office. Although not a specialised statistical ­program, Excel contains basic statistical functions, and the Excel 2016 PC and Mac versions include Data Analysis Toolpak procedures that you can use to perform selected advanced statistical methods. To use the Data Analysis Toolpak you must select it as an Excel add-in. You can also install the PHStat add-in (available for separate purchase or with some textbooks) to extend and enhance the Data Analysis Toolpak that Microsoft Excel contains. (You do not need to use PHStat in order to use Microsoft Excel with this text, although using PHStat will simplify using Excel for statistical analysis.) In Microsoft Excel, you create or open and save files that are called workbooks. Workbooks are collections of worksheets and related items, such as charts, that contain the original data as well as the calculations and results associated with one or more analyses. Because of its widespread distribution, Microsoft Excel is a convenient program to use, but some statisticians have expressed concern about its lack of fully reliable and accurate results for some statistical procedures. Although Microsoft has recently improved many statistical functions, especially from Excel 2010 onwards, you should be somewhat cautious about using Microsoft Excel to perform analyses on data other than the data used in this text. (If you plan to

install PHStat, make sure you first read Appendix F and any PHStat read-me file.) You can use Excel to learn and apply the statistical methods discussed in this book and as an aid in solving end-of-section and end-of-chapter problems. For many topics, you may choose to use the ‘Excel How-to’ instructions. These instructions use pre-constructed worksheets as models or templates for a statistical solution. You learn how to adapt these worksheets to construct your own solutions. Many of these sections feature a specific Excel Guide workbook that contains worksheets that are identical to the worksheets that PHStat creates. Because both of these methods create the same results and the same worksheets, you can use a combination of them as you read through this book. The ‘Excel How-to’ instructions and the Excel Guide workbooks work best with the latest Versions of Microsoft Excel, including Excel 2016 and Excel 2013 (Microsoft Windows), Excel 2016 for Mac, and Office 365. (Excel Guides also contain instructions for using the Analysis ToolPak add-in that is included with most of the latest Microsoft Excel versions.) (Microsoft Excel 2016, Microsoft Corporation, 2015) You will want to master the basic skills listed in Table EG1.1 before you begin using Microsoft Excel to understand statistical concepts and solve problems. If you plan to use the ‘Excel How-to’ instructions, you will also need to master the skills listed in the lower part of

Excel skill

Specifics

Excel data entry

•  Organising worksheet data in columns •  Entering numerical and categorical data

File operations

•  Open •  Save •  Print

Worksheet operations

•  Create •  Copy and paste

Formula skills

•  Concept of a formula •  Cell references •  Absolute and relative cell references •  How to enter a formula •  How to enter an array formula

Workbook presentation

• How to apply format changes that affect the display of worksheet cell contents

Chart formatting correction

•  How to correct the formatting of charts that Excel improperly creates

Discrete histogram creation

• How to create a properly formatted histogram for a discrete probability distribution

Table EG1.1 Basic skills for using Microsoft Excel

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 1 Excel Guide

Operation

Examples

Keyboard keys

•  Enter •  Ctrl •  Shift

Keystroke combinations •  Ctrl+C •  Ctrl+Shift+Enter •  Command+Enter Click or select operations Menu or ribbon selection Placeholder object

31

Notes

Names of keys are always the object of the verb press, as in ‘press Enter’.

Keyboarding actions that require you to press more than one key at the same time. Ctrl+C means press C while holding down Ctrl. Ctrl+Shift+Enter means press Enter while holding down both Ctrl and Shift.

•  Click OK • Select the first 2-D Bar gallery item

Mouse pointer actions that require you to single click an onscreen object. This book uses the verb select when the object is either a worksheet cell or an item in a gallery, menu, list or Ribbon tab.

•  File ➔ New •  Layout ➔ Legend ➔ None

A sequence of Ribbon or menu selections. File ➔ New means first select the File tab and then select New from the list that appears.

•  variable 1 cell range •  bins cell range

An italicised bold-faced phrase is a placeholder for an object reference. In making entries, you enter the reference (e.g. A1:A10) and not the placeholder.

Table EG1.2 Excel typographic conventions

the table. While you do not necessarily need these skills if you plan to use PHStat, knowing them will be useful if you expect to customise the Excel worksheets that PHStat creates or expect to be using Excel beyond the course that uses this book. The list of skills in Table EG1.1 begins with the more basic skills and progresses towards slightly more advanced skills that you will need to use less frequently. Table EG1.2 presents the typographic conventions that the Excel Guides in this book use to present computer operations.

EG1.2 OPENING AND SAVING WORKBOOKS Once you open the Excel program a new workbook will be displayed where you can begin entering data in rows and columns. Figure EG1.1 shows a newly opened workbook in Excel 2016. It contains the elements that are common with most Microsoft Windows programs. If you wish to use a workbook created previously you will need to use the following commands. If you are using Microsoft Excel 2016, select File ➔ Open. In the Backstage view you will be given a choice of selecting from Recent Workbooks, OneDrive or the Computer. You can browse, select the file to be opened and then click on the OK button. If you cannot find your file, you may need to do one or more of the following: • Use the scroll bars or the slider, if present, to scroll through the entire list of files.

• •

Select the correct folder from the drop-down list at the left-hand side of the dialog box. To search every file in the folder, leave All Files showing at the bottom of the dialog box. If you want a specific type of file such as text files, use the arrow to open a drop-down menu and then select Text Files.

In Excel 2016, select File ➔ Save As, and in the Backstage view choose the location. In the dialog box enter (or edit) the name of the file in the File name box and click on the OK button. If applicable, you can also do the following: • Change to another folder by selecting that folder from the Save in drop-down list. • Change the Save as type value to something other than the default choice, Microsoft Excel Workbook. Text (Tab delimited) or CSV (Comma delimited) are two file types sometimes used to share Excel data with other programs. After saving your work, you should consider saving your file a second time, using a different name, to create a backup copy of your work. Read-only files cannot be saved to their original folders unless the name is changed.

EG1.3 ENTERING DATA The main worksheet area is composed of rows and columns that you use for data entry. You enter data into the rows and columns of a worksheet. By convention, and the style used

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

32

CHAPTER 1 DEFINING AND COLLECTING DATA

Quick access toolbar

Ribbon

Formula bar

Group

Launcher button

Title bar

Column labels

Minimise, Resize and Close buttons

Tabs (Home tab selected) Row labels Workspace area with opened workbook

Scroll bars

Sheet tab

Figure EG1.1 The Excel 2016 window

in this book, when you enter data for a set of variables you enter the name of each variable into the cells of the first row, beginning with column A. Then you enter the data for the variable in the subsequent rows to create a DATA worksheet similar to the one shown in Figure EG1.2, which contains data from an auction sale. Note that the formula used in the active cell F6 can be seen on the formula bar. To enter data in a specific cell, either use the cursor keys to move the cell pointer to the cell or use your mouse to select the cell directly. As you type, what you type appears in the formula bar. Complete your data entry by pressing Tab or Enter or by clicking the checkmark button in the formula bar. When you enter data, never skip any rows in a column and, as a general rule, avoid skipping any columns. Also try to avoid using numbers as row 1 variable headings; if you

cannot avoid their use, precede such headings with apostrophes. Pay attention to any special instructions that occur throughout the book for the order of the entry of your data. For some statistical methods, entering your data in an order that Excel does not expect will lead to incorrect results. To refer to a specific entry, or cell, you use a Sheetname!ColumnRow notation. For example, Data!A2 refers to the cell in column A and row 2 in the Data worksheet. To refer to a specific group or range of cells, you use a Sheetname!Upperleftcell:Lowerrightcell notation. For example, Data!A2:B11 refers to the 20 cells that are in rows 2 to 11 in columns A and B of the Data worksheet. An absolute address for the cell A6 is shown as $A$6. Even if a formula using this address is copied to another row or column it will still refer to this cell. However, if the formula is written with the relative address A6, moving the formula will change the

Figure EG1.2 An example of a DATA worksheet

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 1 Excel Guide

reference cell. Both absolute and relative addresses may be necessary in one sheet depending on the operations intended. Also note that $A6 freezes the column but not the row and A$6 freezes the row but allows the column to change. Each Microsoft Excel worksheet has its own name. Automatically, Microsoft Excel names worksheets in the form of Sheet1, Sheet2 and so on. You should rename your worksheets, giving them more self-descriptive names, by double-clicking on the sheet tabs that appear at the bottom of each sheet, typing a new name and pressing the Enter key.

EG1.4 USING FORMULAS IN EXCEL WORKSHEETS Formulas are worksheet cell entries that perform a calculation or some other task. You enter formulas by typing the equals sign symbol (5) followed by some combination of mathematical or other data-processing operations. For simple formulas, you use the symbols 1, 2, *, / and ^ for the operations addition, subtraction, multiplication, division and exponentiation (a number raised to a power), respectively. For example, the formula 5Data!B2 1 Data!B3 1 Data!B4 1 Data!B5 adds the contents of the cells B2, B3, B4 and B5 of the Data worksheet and displays the sum as the value in the cell ­containing the formula. You can also use Microsoft Excel functions in formulas to simplify formulas. To find lists of the functions that can be selected in Excel, click on the fx Function Wizard symbol on the Formula bar. For example, the formula 5SUM(Data!B2:B5), using the Excel SUM() function, is a shorter equivalent to the formula above. You can also use cell or cell range references that do not contain the Sheetname! part, such as B2 or B2:B5. Such references always refer to the worksheet in which the formula has been entered. Formulas allow you to create generalised solutions and give Excel its distinctive ability to recalculate results automatically when you change the values of the supporting data. Typically, when you use a worksheet, you see only the results of any formulas entered, not the formulas themselves. However, for your reference, many illustrations of Microsoft Excel worksheets in this text also show the underlying formulas adjacent to the results they produce. When using Excel 2016, select Formulas ➔ Formula Auditing ➔ Show Formulas to see onscreen the formulas themselves and not their results. To restore the original view, click on Show Formulas again.

EG1.5 CREATING CHARTS The method of creating charts can vary according to the version of Excel you are using. Both these methods are available in Excel 2016. • Method 1 A feature in Excel 2016 allows you to create charts easily using the Quick Analysis tool. Simply



33

highlight an area of the spreadsheet containing some data you wish to graph by clicking on the top left-hand cell, then dragging the mouse. The range may contain labels. Click on the small box that appears in the bottom right-hand corner to open Quick Analysis. Select Charts, then, by hovering the mouse over the different chart types, you can see previews of recommended charts for the selected data. You can also choose More, which will open a dialog box with a more extensive range of options. Once a chart is selected there are several ways you can modify it by clicking on the icons that appear on its right-hand side. These are Chart Elements (1), Chart Styles (paintbrush) and Chart Filters (filter). You will also now see that multiple design options are shown on the ribbon and that options to change colours or chart type are shown there. By right-clicking on the background area of the chart you can also activate a drop-down menu. If you choose Format Chart Area a menu will open on the righthand side of the spreadsheet that allows you to change the format of the chart and text in many ways. If instead you choose Move Chart you can choose a new location on another sheet. To reposition the chart on the existing sheet, simply click on it and drag. To resize it, drag using one of the circles on its border. Method 2  Highlight the area of the spreadsheet with your data as described above. If you wish to select areas that are not adjacent, hold down the Ctrl key while selecting. The area selected must be rectangular. Click on the Insert tab, then from the Charts area click on the Recommended Charts and select a particular format from the drop-down gallery. Alternatively, you can select a chart type from the icons shown. Once the chart is ­created it can be formatted or enhanced by clicking on it and following the instructions given for Method 1.

Figure EG1.3 shows an example of a chart created in Excel 2016 with the Format Axis panel open.

EG1.6 PRINTING WORKBOOKS Before printing you may select a print area if you do not want the whole sheet printed. To print Excel 2016 worksheets, select File ➔ Print. A print preview is automatically created, as can be seen in Figure EG1.4. Various print settings are available in the drop-down list boxes. Clicking on Page Setup will give access to more choices such as changing from Portrait to Landscape orientation, as would suit the worksheet shown. When you are satisfied with the settings and look of the preview, click on the Print button. Note that if you want only a part of the worksheet to be printed it is easier to set this using Page Layout tab then Page setup ➔ Print area.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

34

CHAPTER 1 DEFINING AND COLLECTING DATA

Plot area

Chart title

Vertical axis title

Chart area Legend

Horizontal axis title

Figure EG1.3 An example of a chart created in Excel 2016 with the Format Axis panel open

Page Setup allows you to customise printing to change the print orientation, add gridlines and so on before printing. Once you are satisfied with the results, click on the Print button in the print preview window, then OK in the Print dialog box.

The Print Backstage view (see Figure EG1.4) contains settings to select the printer to be used, what parts of the workbook to print (the active worksheet is the default) and the number of copies to produce (1 is the default). If you need to change these settings, change them before clicking on the OK button.

Figure EG1.4 The Excel 2016 Backstage view with Print and Page Setup selected

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 1 Excel Guide

After printing, you should verify the contents of your printout. Most printing failures will trigger the display of an error message that you can use to work out the source of the failure.

EG1.7 HOW USING EXCEL FOR MAC DIFFERS Excel 2016 for Mac comes with the add-ins for Analysis Toolpack but earlier versions did not. If you don’t have a current version, it is possible to download software made by third-party companies to perform some of the same statistical analysis tasks. The free program StatPlus®:mac LE, for instance, will allow you to run a regression, calculate descriptive statistics and run analysis of variance tests. Further capability is available in the Pro edition at a cost. In Excel 2016 for Mac you can open a new work­ book when the program opens by using New ➔ Blank Workbook ➔ Create. The easiest way to save a new workbook is to click on the quick access toolbar file icon to Save. A Save As dialog box will allow you to choose a file name, a location for the file and the file format. You can also choose File ➔ Save to begin this process. To create a chart in Excel 2016 for Mac, use Method 2 described in section EG1.5. With the chart selected click on the Chart Design tab. You will find that extra tabs such as Add Chart Element, Quick Layout and Switch Row/ Column open on the ribbon to allow more formatting. To print a worksheet or selection use File ➔ Print then on the Printer select the printer you wish to use. The default is that all active worksheets will be printed but to modify that select Show Details. Then choose the option preferred from the drop-down menu, and finally select Print.

EG1.8 DEFINING DATA Establishing the Variable Type Microsoft Excel infers the variable type from the data you enter into a column. If Excel discovers a column that contains numbers, for example, it treats the column as a numerical variable. If Excel discovers a column that contains words or alphanumeric entries, it treats the column as a non-numerical (categorical) variable. This imperfect method works most of the time, especially if you make sure that the categories for your categorical variables are words or phrases such as ‘yes’ and ‘no’. However, because you cannot explicitly define the variable type, Excel can mistakenly offer or allow you to do nonsens­ ical things such as using a statistical method that is designed for numerical variables on categorical variables. If you must use coded values such as 1, 2 or 3, enter them preceded by an apostrophe, as Excel treats all values that begin with an apostrophe as non-numerical data. (You can check whether a cell entry includes a leading apostrophe by selecting a cell and viewing the contents of the cell in the formula bar.)

35

EG1.9 COLLECTING DATA Recoding Variables Key technique To recode a categorical variable, you first copy the original variable’s column of data and then use the find-andreplace function on the copied data. To recode a numerical variable, or a categorical variable with only two values, enter a form­ula that returns a recoded value in a new column. Example Imagine that we have collected data at an airport using a survey such as shown on page 9. The Recode workbook shows how the original variables of ‘Accommodation satisfaction’ and ‘Business visit’ have been recoded. Excel how-to Two recoded variables were created by first opening the Airport Survey worksheet in the Recode workbook and then following these steps: 1. Right-click column B (right-click over the shaded ‘B’ at the top of column B) and click Copy in the shortcut menu. 2. Right-click column C and click the first choice in the Paste Options gallery. 3. Enter Accommodation code in cell C1. 4. Select column C. With column C selected, click Home ➔ Find & Select ➔ Replace. In the Replace tab of the Find and Replace dialog box: 5. Enter Very satisfied as Find what, 1 as Replace with, and then click Replace All. 6. Click OK to close the dialog box that reports the results of the replacement command. 7. Still in the Find and Replace dialog box, enter Very dissatisfied as Find what (replacing Very satisfied), and 5 as Replace with, then click Replace All. 8. Click OK to close the dialog box that reports the results of the replacement command. 9. Continue to replace the words Dissatisfied, Satisfied and Undecided with the numbers 4, 2 and 3 respectively using this method. (This creates the recoded variable Accommodation code in column C.) 10. Enter Business visit code in cell H1. 11. Enter the formula 5IF(F2 5 “No”, 0,1) in cell H2. 12. Copy this formula down the column to the last row that contains Visitor data (row 31). (This creates the recoded variable Business visit code in column H.) The Recode workbook uses the IF function to recode the two categories as numbers. Numerical variables can also be recoded into multiple categories by using a more advanced technique using the VLOOKUP function.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

36

CHAPTER 1 DEFINING AND COLLECTING DATA

EG1.10  TYPES OF SAMPLING METHODS Simple Random Sample Key technique Use the RANDBETWEEN(smallest integer, largest integer) function to generate a random integer that can then be used to select an item from a frame. Example Create a simple random sample with replacement of size 40 from a population of 800 items. Excel how-to Enter a formula that uses this function and then copy the formula down a column for as many rows as is necessary. For example, to create a simple random sample with replacement of size 40 from a population of 800 items, open to a new worksheet. Enter Sample in cell A1 and enter the formula 5RANDBETWEEN(1, 800) in cell A2. Then copy the formula down the column to cell A41. Excel contains no functions to select a random sample without replacement. Such samples are most easily created using an add-in such as PHStat or the Analysis ToolPak, as described in the following paragraphs. Analysis ToolPak Use Sampling to create a random sample with replacement. For the example, assume you have a worksheet that contains the population of 800 items in column A and that contains a column heading in cell A1. Select Data ➔ Data Analysis. In the Data Analysis dialog box, select Sampling from the Analysis Tools list and then click OK. In the procedure’s dialog box: 1. Enter A1:A801 as the Input Range and check Labels.

2. Click Random and enter 40 as the Number of Samples. 3. Click New Worksheet Ply and then click OK.

Example Create a simple random sample without replacement of size 40 from a population of 800 items. PHStat Use Random Sample Generation. For the example, select PHStat ➔ Sampling ➔ Random Sample Generation. In the procedure’s dialog box: 1. Enter 40 as the Sample Size. 2. Click Generate list of random numbers and enter 800 as the Population Size. 3. Enter a Title and click OK. Unlike most other PHStat results worksheets, the worksheet created contains no formulas.

Excel how-to Use the COMPUTE worksheet of the Random workbook as a template. The worksheet already contains 40 copies of the formula 5RANDBETWEEN(1, 800) in column B. Because the RANDBETWEEN function samples with replacement as discussed at the start of this section, you may need to add additional copies of the formula in new column B rows until you have 40 unique values. If your intended sample size is large, you may find it difficult to spot duplicates. See the ADVANCED worksheet in the Random workbook for more information about an advanced technique that uses formulas to detect duplicate values.

Microsoft® product screen shots are reprinted with permission from Microsoft Corporation.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

Organising and visualising data

C HAP T E R

2

FESTIVAL EXPENDITURE

A

council is investigating the contribution to the local economy of visitors to an annual three-day music festival. Kai, a researcher employed by the council, has collected data from a random sample of non-local festival attendees aged 18 years and over. This data includes total amount spent, excluding festival tickets, in the region during the festival and whether the festival attendee has travelled from within the state (intrastate), from another state (interstate) or from another country (international) to attend the festival. The data is stored in the < FESTIVAL > file.

Kai is interested in answering the following questions: ■

What is the typical amount spent during the festival by intrastate, interstate and international visitors? ■ How does the amount spent vary between visitors and between intrastate, interstate and international visitors? ■ Is there a difference in the amount spent between intrastate, interstate and international ­visitors? © Zoonar/Thomas Willer/age fotostock

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

38

CHAPTER 2 ORGANISING AND VISUALISING DATA

LEARNING OBJECTIVES

After studying this chapter you should be able to: 1 describe the distribution of a single categorical variable using tables and charts 2 describe the distribution of a single numerical variable using tables and graphs 3 describe the relationship between two categorical variables using contingency tables 4 describe the relationship between two numerical variables using scatter diagrams and time-series plots 5 develop dashboard elements such as sparklines, gauges, bullet graphs and treemaps for descriptive analytics 6 correctly present data in graphs

Kai needs to organise the data into usable forms. One way of doing this is to use tables or charts to organise and visualise the data. This chapter helps you to select and construct appropriate tables and charts. We can also use numerical measures to determine certain characteristics of the data, such as their centre and spread. These numerical descriptive measures are covered in the next chapter. From Chapter 1 we know that data can be either categorical or numerical.

LEARNING OBJECTIVE

1

Describe the distribution of a single categorical variable using tables and charts

2.1  ORGANISING AND VISUALISING CATEGORICAL DATA The expenditure data in the file are examples of raw data – that is, data presented just as they were collected. Raw data give very little information, but by using summary tables and charts we can condense and present the data in a meaningful way. For categorical data, you first divide the data into categories and then present the frequency or percentage in each category in a table or chart.

Organising Categorical Data: Summary Table summary table Summarises categorical or numerical data; gives the frequency, proportion or percentage of data values in each category or class.

Table 2.1  Reasons for grocery shopping online

A summary table gives the frequency, proportion or percentage of the data in each category, which allows you to see differences between the categories. A summary table lists the ­categories in one column and the frequency, percentage or proportion in a separate column or columns. Table 2.1 illustrates a summary table based on a recent survey that asked why people shopped for groceries online. From this table, stored in < ONLINE SHOPPING >, the most ­common reason for grocery shopping online was convenience, followed by competitive prices and quality products. Very few respondents shopped for groceries online because of a comfortable environment or well-displayed products. Reason Comfortable environment Competitive prices Convenience Customer service Products well displayed Quality products Variety/range of products

Percentage  8  20  28  13  3  18  10 100

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



2.1 ORGANISING AND VISUALISING Categorical Data

SUM MA RY TA B LE S FO R LO C AT IO N A N D TY P E OF P ROP E RTI E S In other research Kai is exploring the property market in the council area. Data from 100 recent property sales is stored in < PROPERTY >. These properties are classified according to location, either in town or rural, and also by type, either a house or a unit. Construct summary tables for the properties categorised by location and type.

39

EXAMPLE 2.1

SOLUTION Location Rural Town Total

Number (frequency) of properties  34  66 100

Table 2.2A  A frequency and percentage summary table for the location of 100 recent property sales

Percentage of properties  34.0  66.0 100.0

From Table 2.2A we can see that there are approximately twice as many urban properties sold as rural properties. Type House Unit Total

Number of properties  82  18 100

Table 2.2B  A frequency and percentage summary table for type of 100 recent property sales

Percentage of properties  82.0  18.0 100.0

From Table 2.2B we can see that there are relatively few units sold.

Visualising Categorical Data: Bar Charts Each category in a bar chart is represented by a bar, the length of which indicates the proportion, frequency or percentage of values falling into that category. Figure 2.1 displays a bar chart of the reasons for grocery shopping online, presented in Table 2.1. Bar charts allow you to compare percentages, frequencies or proportions in the different categories. In Figure 2.1 the most common reason for shopping online is convenience, followed by competitive prices. Very few respondents shopped for groceries online because of a comfortable environment or well-displayed products.

bar chart Graphical representation of a summary table for categorical data; the length of each bar represents the proportion, frequency or percentage of data values in a category.

Figure 2.1  Microsoft Excel bar chart of the reasons for grocery shopping online

Bar chart – reasons for grocery shopping online Variety/range of products Quality products

Category

Products well displayed Customer service Convenience Competitive prices Comfortable environment 0

5

10

15

20

25

30

%

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

40

CHAPTER 2 ORGANISING AND VISUALISING DATA

EXAMPLE 2.2

B A R C H A RT FO R FA M I LY TY P E The council is also interested in demographic differences between the council area and the capital city. Demographic information has been collected and is stored in . Use the summary tables for family type to construct and interpret bar charts for the council area and the capital city. SOLUTION

Figure 2.2  Microsoft Excel bar chart for family type

Bar chart – council area Other

One parent

Couple no children Couple with children 0

5

10

15

20

25

30

35

40

45

% Bar chart – capital city Other

One parent

Couple no children Couple with children 0

5

10

15

20

25 %

30

35

40

45

50

We can see that, in both areas, the majority of families are couples with or without children, with a significant number of one-parent families. However, the capital city has approximately 10% more couples without children and 5% fewer one-parent families.

Pie Charts pie chart Graphical representation of a summary table for categorical data, with each category represented by a slice of a circle of which the area represents the proportion or percentage share of the category relative to the total of all categories.

A pie chart is a circle, used to represent the total, which is divided into slices, each representing a category. The area of each slice represents the proportion or the percentage share of the corresponding category. In Table 2.1, for example, 28% of the respondents said that convenience was the main reason for grocery shopping online. Thus, in constructing the pie chart, the 360° that makes up a circle is multiplied by 0.28, resulting in a slice of the pie that takes up 100.8° of the 360° of the circle (Figure 2.3). A pie chart allows you to see the portion of the entire pie that falls into each category. In Figure 2.3, convenience takes 28% of the pie and products well displayed takes only 3%.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



2.1 ORGANISING AND VISUALISING Categorical Data

41

What type of chart should you use? The selection of a chart depends on your intention. If a comparison of categories is most important, use a bar chart. If observing the portion of the whole that lies in a particular category is most important, use a pie chart. There should be no more than eight categories or slices in a pie chart. If there are more than eight, merge the smaller categories into a category called ‘other’. Pie chart – reasons for grocery shopping online Variety/range of products 10%

Quality products 18%

Comfortable environment 8%

Figure 2.3  Microsoft Excel pie chart of the reasons for grocery shopping online

Competitive prices 20%

Products well displayed 3% Customer service 13%

Convenience 28%

PIE C H A RT FO R FA MILY T YP E Use the summary tables given for family type in < DEMOGRAPHIC_INFORMATION > to construct and interpret pie charts for the capital city and the council area.

EXAMPLE 2.3

Figure 2.4  Microsoft Excel pie chart for family type

Pie chart – council area

Couple with children Couple no children One parent Other

Pie chart – capital city

Couple with children Couple no children One parent Other

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

42

CHAPTER 2 ORGANISING AND VISUALISING DATA

We can see that, in both areas, most families are couples with or without children, with a significant number of one-parent families. However, the capital city has a higher proportion of ­couples without children.

Problems for Section 2.1 LEARNING THE BASICS 2.1 A categorical variable has three categories with the following frequency of occurrence: Category A B C

Frequency 13 28  9

a. Calculate the percentage of values in each category. b. Construct a bar chart. c. Construct a pie chart. 2.2 A categorical variable has four categories with the following percentages of occurrence: Category A B

Percentage 12 29

Category C D

Percentage 35 24

a. Construct a bar chart. b. Construct a pie chart.

You can solve problems 2.4 to 2.7 manually or by using Microsoft Excel.

2.4 The following table gives the top 10 websites ranked by estimated number of unique monthly visitors in March 2017. Website Google Facebook YouTube Yahoo! Amazon Wikipedia Twitter Bing eBay MSN

Unique monthly visitors (millions) 1,600 1,100 1,100 750 500 475 290 285 285 280

a. Construct bar and pie charts. b. Which graphical method do you think best portrays these data? c. What conclusions can you reach concerning the number of unique visitors? 2.5 Pat, the owner of Pat’s Cars, asked 200 customers their colour preference when purchasing a new car. The following summary table gives the results.

ABC

Channel 10

Channel 7

Channel 9



APPLYING THE CONCEPTS

Data obtained from eBusMBA Guide, Top 15 Most Popular Websites March 2017, at accessed 13 March 2017

2.3 SBS

d. Channel 10 e. SBS

The pie chart above was constructed from the results of a survey of 2,000 viewers to determine which TV channels they watch for news. By measuring the angle of each one using a protractor, or estimating by eye, calculate the percentage of viewers watching: a. ABC b. Channel 7 c. Channel 9

Colour White Blue Red Brown Grey Silver Green Black Other

Frequency 56 31 29 17 19 15 15 13  5

a. Construct bar and pie charts. b. What colours of cars should Pat have on show? 2.6 The following table gives the labour force status of the Australian civilian population aged 15 years and over in January 2017.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



43

2.2 Organising Numerical Data

a. Construct bar and pie charts. b. Which graphical method do you think best portrays these data? c. What conclusions can you draw about participation rate – that is, the percentage of the population in the labour force? 2.7 Use the summary table for country of birth in to construct pie and bar charts.

6202.0 – Labour Force, Australia, Jan 2017 Labour force status (aged 15 years & over) Total (‘000) Employed full-time 8,066.3 Employed part-time 3,762.6 Unemployed looking for full-time work 561.4 Unemployed not looking for full-time work 213.7  7,096.2 Not in labour force Civilian population 15 aged years and over 19,700.2 Data obtained from Australian Bureau of Statistics, Labour Force, Australia, January 2017, Cat. No. 6202.0 accessed 15 March 2017

2.2  ORGANISING NUMERICAL DATA

LEARNING OBJECTIVE

When you have a large amount of raw numerical data, a useful first step is to present the data as either an ordered array or a stem-and-leaf display. Suppose you undertake a study to compare the cost of a main meal at similar restaurants in a city and in the suburbs. Table 2.3 gives the raw data for 50 city restaurants and 50 suburban restaurants; these data are stored in . From the raw data it is difficult to draw any conclusions about the price of city and suburban restaurant meals.

City 50 34 44 31 36 Suburban 37 44 43 26 51

38 39 38 34 38

43 49 14 48 53

56 37 44 48 23

51 40 51 30 39

36 50 27 42 45

25 50 44 26 37

33 35 39 35 31

41 22 50 32 39

44 45 35 63 53

37 27 31 51 30

29 24 26 26 27

38 34 34 48 38

37 44 23 39 26

38 23 41 55 28

39 30 32 24 33

29 32 30 38 38

36 25 28 31 32

38 29 33 30 25

2

Describe the distribution of a single numerical variable using tables and graphs

Table 2.3  Price per main meal at 50 city restaurants and 50 suburban restaurants

Ordered Arrays A more meaningful display is obtained by sorting the raw data in order of magnitude – that is, from smallest to largest. This is called an ordered array. Table 2.4 presents the data in Table 2.3 as ordered arrays. From Table 2.4 you can see that the price of a main meal at city restaurants is between $14 and $63, and the price of a main meal at suburban restaurants is between $23 and $55.

ordered array Numerical data sorted by order of magnitude.

Stem-and-Leaf Displays A stem-and-leaf display is a quick and easy way to visually display numerical data. The data are divided into groups (called stems) such that the values within each group (the leaves) branch out to the right on each row. The resulting display allows you to see how the data are distributed and also where they are concentrated.

stem-and-leaf display Graphical representation of numerical data; partitions each data value into a stem portion and a leaf portion.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

44

CHAPTER 2 ORGANISING AND VISUALISING DATA

Table 2.4  Ordered array of price per main meal at 50 city restaurants and 50 suburban restaurants

City 14 33 38 43 50 Suburban 23 27 30 36 39

22 34 38 44 50

23 34 38 44 50

25 35 39 44 50

26 35 39 44 51

27 35 39 45 51

30 36 39 45 53

31 36 40 48 53

31 37 41 48 56

32 37 42 49 63

23 27 31 37 39

24 28 31 37 41

24 28 32 37 43

25 29 32 38 44

25 29 32 38 44

26 29 33 38 48

26 30 33 38 51

26 30 34 38 51

26 30 34 38 55

To see how a stem-and-leaf display is constructed, suppose that 20 students spend the following amounts at a coffee cart between lectures: < COFFEE > $6.35 $8.45

$4.75 $6.05

$4.30 $9.90

$5.40 $5.75

$4.85 $6.80

$6.60 $4.30

$5.55 $5.45

$4.90 $7.20

$6.85 $7.80

 $7.50 $10.65

To construct a stem-and-leaf display for these data, use the $ amount as the stem and round the cents to the nearest 10 cents for the leaves. Now list the stem values ($) in order of size to the left of a vertical divider (|) and then record the leaves (10 cents) for each stem in rows to the right. The ‘unordered’ stem-and-leaf display for the amount spent at the coffee cart by the 20 students is: stem unit: $ 4 5 6 7 8 9 10

leaf unit: 10 cents 83993 4685 46918 528 5 9 7

The first value of $6.35 is rounded to 6.4. Its stem (row) is 6 and its leaf is 4. The second value of $4.75 is rounded to 4.8. Its stem (row) is 4 and its leaf is 8. Then, ordering each leaf, we obtain the following ordered stem-and-leaf display for the amount spent at the coffee cart by the 20 students: stem unit: $ 4 5 6 7 8 9 10

EXAMPLE 2.4

leaf unit: 10 cents 33899 4568 14689 258 5 9 7

ST E M- A N D - LE A F DIS P L AY F OR F E STI VAL E XP E N D I TU RE – I N TE RSTATE V IS ITO R S Kai is interested in the amount spent during the festival by interstate visitors. < FESTIVAL > Construct and interpret a stem-and-leaf display for these data.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



45

2.2 Organising Numerical Data

SOLUTION Figure 2.5  PhStat stem-and-leaf display for festival expenditure by interstate visitors

Festival expenditure by interstate visitors Stem unit: $100 Leaf unit: $10 2 3 4 5 6 7 8 9 10

278 1235999 02335567889 1255689 00033346689 3567789 067 114 4

From Figure 2.5 Kai can conclude that during the festival: • interstate visitors spend between $220 and $1,040 • most interstate visitors spend between $300 and $800 • interstate visitors rarely spend less than $300 or more than $800.

Problems for Section 2.2 LEARNING THE BASICS

stem unit: $100 1 2 3 4 5

2.8 Form an ordered array given the following data from a sample of n = 7 mid-semester exam scores in accounting: 68

94

63

75

71

88

64

2.9 Form a stem-and-leaf display given the following data from a sample of n = 7 mid-semester exam scores in finance: 80

54

69

98

93

53

74

2.10 Form an ordered array given the following stem-and-leaf display from a sample of n = 7 mid-semester exam scores in information systems: stem unit: 10 5 6 7 8 9

leaf unit: 1 0 446 19 2

APPLYING THE CONCEPTS 2.11 Data were collected on the monthly expenses submitted by 35 employees in a firm’s sales team. The data are summarised in the following stem-and-leaf display:

leaf unit: $10 12489 0013999999 01124445899 11556 0156

a. Place the data into an ordered array. b. Which of the two displays provides the most information? Discuss. c. In what range are most monthly expense claims? d. Is there a concentration of expense claims near the centre of the distribution? 2.12 The following data represent the late payment fee in dollars for a sample of 22 accounts. 20 45

40 20

40 38

38 45

35 45

35 15

45 35

50 40

45 35

40 45

35 40

a. Display the data as an ordered array. b. Construct a stem-and-leaf display for the data. c. Which of the two displays provides the most information? Discuss. d. Around what value, if any, are the late payment fees concentrated? Explain.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

46

CHAPTER 2 ORGANISING AND VISUALISING DATA

2.13 The following data represent ATM fees for withdrawals made above the free monthly allowance for a sample of 26 transaction accounts.

Full cream milk 155 188 160 155 160 163 170 185 135 160 165 160 163 Low- or reduced-fat milk 120 133 133 125 118 113 140 110 128 115 No-fat or skim milk 133 90 90 98 88 85 115 108 88 90 90 98

0.65 0.50 0.70 1.30 2.50 0.50 2.00 1.00 2.00 1.25 1.50 2.00 0.30 2.00 0.65 2.00 0.50 0.65 0.50 0.65 1.60 0.70 1.00 1.50 1.65 0.50 a. Display the data as an ordered array. b. Construct a stem-and-leaf display for the data. c. Which of the two displays provides the most information? Discuss. d. Around what value, if any, are the withdrawal fees concentrated? Explain. 2.14 Low-fat foods are not necessarily low calorie, as many are high in sugar. The following data give calories per 250 ml cup of a random sample of brands of fresh cow’s milk for sale in Australia.

LEARNING OBJECTIVE

2

Describe the distribution of a single numerical variable using tables and graphs

Data obtained from Calorie King Australia accessed 22 December 2013



For each category of milk: a. Display the data in ordered arrays. b. Construct stem-and-leaf displays for the data. c. Which arrangement provides more information? Discuss. d. Compare the items in terms of calories. What conclusions can you make?

2.3  SUMMARISING AND VISUALISING NUMERICAL DATA Ordered arrays and stem-and-leaf displays are of limited use when we have very large quantities of data or the data are highly variable. In these cases we use tables and graphs to condense and present the data visually. These tables and graphs include histograms, frequency, relative frequency, and cumulative distributions and polygons.

Summarising Numerical Data: Frequency Distributions A frequency distribution allows you to condense a set of data.

frequency distribution Summary table for numerical data; gives the frequency of data values in each class.

class width Distance between upper and lower boundaries of a class.

range Distance measure of variation; difference between maximum and minimum data values.

A frequency distribution is a summary table in which the data are arranged into numerically ordered classes or intervals.

To construct a frequency distribution, first select an appropriate number of classes and a suitable class width. The classes should be exhaustive and mutually exclusive, so that any one data value belongs to one and only one class. The number of classes chosen depends on the amount of data – a small number of classes for small amounts of data and a larger number of classes for larger amounts of data. In general, a frequency distribution should have at least five classes but no more than 15. If there are too few classes we lose too much information and if there are too many classes the data are not condensed enough. Each class should be of equal width. To determine the required (approximate) width of the classes, divide the range (the highest value – the lowest value) of the data by the required number of classes.

DE T E R M IN IN G A N AP PR O X I MAT E W I DT H O F A C LA SS

Class width =

range number of classes

(2.1)

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



2.3 SUMMARISING AND VISUALISING Numerical Data

47

The city restaurant data consist of a sample of 50 restaurants; with this sample size 10 is an appropriate number of classes. From the ordered array in Table 2.4, the range of the data is $63 – $14 = $49. Using Equation 2.1, the approximate class width is: Class width =

49 = 4.9 10

Choose a class width that simplifies the reading and interpretation of the distribution and resultant graphs. Therefore, instead of using a class width of $4.90, choose a width of $5.00. Construct the frequency distribution table by first establishing clearly defined class ­boundaries so that each data value belongs in one and only one class. The classes must be mutually exclusive and exhaustive. Whenever possible, choose class boundaries that simplify the reading and interpretation of the resultant tables or graphs. For the city restaurant data the price ranges from $14 to $63, so appropriate classes could be (1) from $10 to less than $15, (2) from $15 to less than $20, and so on, until we have included the highest data value, in this case $63. The last and 11th class ranges from $60 to less than $65. The centre of each class, called the class mid-point, is halfway between the lower boundary and the upper boundary of the class. Thus, the class mid-point for the 10 + 15b first class, from $10 to under $15, is $12.50 a ; the class mid-point for the second class, 2 from $15 to under $20, is $17.50, and so on. Table 2.5 gives a frequency distribution of the cost per meal for the 50 city and the 50 suburban restaurants. A frequency distribution allows you to draw conclusions about the major characteristics of the data. For example, Table 2.5 shows that the price of main meals at city restaurants is ­concentrated between $30 and $55 compared with the price of main meals at suburban restaurants, which are clustered between $25 and $40. For small data sets, one set of class boundaries may provide a different picture from another set. For example, for the restaurant price data, using a class width of 4.0 instead of 5.0 (as was used in Table 2.5) may cause shifts in the way in which the values are distributed between the classes. You can also get shifts in data concentration when you choose different lower and upper class boundaries. Fortunately, as the sample size increases, alterations in the selection of class boundaries affect the concentration of data less and less.

Price of main meal ($) $10 but less than $15 $15 but less than $20 $20 but less than $25 $25 but less than $30 $30 but less than $35 $35 but less than $40 $40 but less than $45 $45 but less than $50 $50 but less than $55 $55 but less than $60 $60 but less than $65 Total

City frequency 1 0 2 3 7 14 8 5 8 1 1 50

Suburban frequency 0 0 4 13 13 12 4 1 2 1 0 50

FREQUENCY DISTRIBUTION FOR FESTIVAL EXPENDITURE – INTERSTATE VISITORS Kai is interested in the amount spent during the festival by interstate visitors. Construct and interpret a frequency distribution for this data.

class boundaries Upper and lower values used to define classes for numerical data.

class mid-point Centre of a class; representative value of class.

Table 2.5  Frequency distribution of the price per main meal for 50 city restaurants and 50 suburban restaurants

EXAMPLE 2.5

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

48

CHAPTER 2 ORGANISING AND VISUALISING DATA

SOLUTION

As we have data from 52 interstate visitors, with expenditure during the festival ranging from approximately $220 to $1,040 (see Figure 2.5), we can choose a class width of $200 with the first class starting at $200. Table 2.6  Frequency distribution of festival expenditure – interstate visitors

Interstate visitors Festival expenditure $200 to < $400 $400 to < $600 $600 to < $800 $800 to < $1,000 $1000 to < $1,200 Total

• • •

Frequency 11 17 18  5  1 52

From Table 2.6 Kai can conclude that festival expenditure for interstate visitors is: between $200 and $1,200 concentrated between $400 and $800 rarely more than $800.

Relative Frequency and Percentage Distributions

relative frequency distribution Summary table for numerical data which gives the proportion of data values in each class. percentage distribution Summary table for numerical data which gives the percentage of data values in each class.

Table 2.7  Relative frequency and percentage distributions of the price of main meals at city and suburban restaurants

Instead of the frequency of the data in each class, knowing the proportion or the percentage of the data that fall into each class is often more useful. To do this, we use either a relative frequency or a percentage distribution. Also, when comparing two or more samples with different sample sizes, a relative frequency or percentage distribution should be used. A relative frequency distribution is obtained by dividing the frequency in each class by the total number of values. From this a percentage distribution can be obtained by multiplying each relative frequency by 100%. Thus, the relative frequency of a main meal at city restaurants with a price between $30 and $35 is 0.14 (7 ÷ 50), and the corresponding percentage is 14%. Table 2.7 presents the relative frequency and percentage distributions of the price of main meals at city and suburban restaurants. From Table 2.7 you can conclude that meals cost more at city restaurants than at suburban restaurants – 16% of main meals at city restaurants cost between $40 and $45 compared with 8% at suburban restaurants; 16% of main meals at city restaurants cost between $50 and $55 compared with 4% at suburban restaurants; while only 6% of main meals at city restaurants cost between $25 and $30 compared with 26% at suburban restaurants.

Price of main meal ($) $10 but less than $15 $15 but less than $20 $20 but less than $25 $25 but less than $30 $30 but less than $35 $35 but less than $40 $40 but less than $45 $45 but less than $50 $50 but less than $55 $55 but less than $60 $60 but less than $65 Total

City Relative frequency 0.02 0.00 0.04 0.06 0.14 0.28 0.16 0.10 0.16 0.02 0.02 1.00

Percentage 2.0 0.0 4.0 6.0 14.0 28.0 16.0 10.0 16.0 2.0   2.0 100.0

Suburban Relative frequency Percentage 0.00 0.0 0.00 0.0 0.08 8.0 0.26 26.0 0.26 26.0 0.24 24.0 0.08 8.0 0.02 2.0 0.04 4.0 0.02 2.0 0.00   0.0 1.00 100.0

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



2.3 SUMMARISING AND VISUALISING Numerical Data

R ELATIVE FR E Q U E N CY D IST R IB U T IO N AN D P E RCE N TAGE D I STRI BU TI ON FESTIVA L EXP E N D IT U R E – IN T E R STAT E AN D I N TRASTATE V I S I TORS Kai is interested in the amount spent during the festival by festival attendees; in particular if there is any difference in expenditure between interstate and intrastate visitors. Construct and interpret frequency and percentage distributions to compare the festival expenditure of interstate and intrastate visitors.

49

EXAMPLE 2.6

SOLUTION

Festival expenditure $0 to < $200 $200 to < $400 $400 to < $600 $600 to < $800 $800 to < $1,000 $1,000 to < $1,200 Total

Interstate Proportion 0.000 0.212 0.327 0.346 0.096 0.019 1.000

Visitors Intrastate Proportion 0.019 0.442 0.250 0.135 0.115 0.039 1.000

Interstate Percentage 0.00 21.15 32.69 34.62 9.62   1.92 100.00

Intrastate Percentage 1.92 44.23 25.00 13.46 11.54   3.85 100.00

Table 2.8  Relative frequency and percentage distributions of festival expenditure – intrastate and interstate

From Table 2.8 Kai can conclude that interstate visitors generally spend more during the festival than intrastate visitors. However, there is more variation in festival expenditure between intrastate visitors.

Cumulative Distributions A cumulative percentage distribution gives the percentage of values that are less than a certain value. For example, you may want to know what percentage of the city restaurant main meals cost less than $20, less than $50, and so on. A percentage distribution is used to form the corresponding cumulative percentage distribution. From Table 2.7, 0% of main meals at city restaurants cost less than $10, 2% cost less than $15, 2% also cost less than $20 (since none of the meals cost between $15 and $20), 6% (2% + 4%) cost less than $25, and so on, until all 100% of the meals cost less than $65. Table 2.9 summarises the cumulative percentages for the price of main meals at city and suburban restaurants. The cumulative distribution clearly shows that the cost of main meals is lower in suburban restaurants than in city restaurants – 34% of main meals at suburban ­restaurants cost

Price ($) $10 $15 $20 $25 $30 $35 $40 $45 $50 $55 $60 $65

City percentage of restaurants less than indicated value 0 2 2 6 12 26 54 70 80 96 98 100

Suburban percentage of restaurants less than indicated value 0 0 0 8 34 60 84 92 94 98 100 100

cumulative percentage distribution Summary table for numerical data; gives the cumulative frequency of each successive class.

Table 2.9  Cumulative percentage distributions of the price of city and suburban restaurant main meals

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

50

CHAPTER 2 ORGANISING AND VISUALISING DATA

less than $30 compared with 12% at city restaurants; 60% of main meals at suburban restaurants cost less than $35 compared with 26% at city restaurants; 84% of main meals at suburban restaurants cost less than $40 compared with 54% at city restaurants. EXAMPLE 2.7

C U MU LAT IV E P E RC E N TAGE D I STRI BU TI ON F OR F E STI VAL E XP E N D I TU RE Kai is interested in the amount spent during the festival by festival attendees; in particular if there is any difference in expenditure between interstate and intrastate visitors. Construct and interpret cumulative distributions to compare festival expenditure of interstate and intrastate visitors. SOLUTION

Table 2.10  Cumulative percentage distribution of festival expenditure – intrastate and interstate

Visitors Interstate Percentage 0.00 21.15 53.85 88.46 98.08 100.00

Festival expenditure $0 to < $200 $200 to < $400 $400 to < $600 $600 to < $800 $800 to < $1,000 $1,000 to < $1,200

Intrastate Percentage 1.92 46.15 71.15 84.61 96.15 100.00

From Table 2.10 Kai can conclude that 71% of intrastate visitors spend less than $600 ­during the festival while only 54% of interstate visitors spend less than $600. This indicates that, generally, intrastate visitors spend less during the festival than interstate visitors.

Histograms histogram Graphical representation of a frequency, relative frequency or percentage distribution; the area of each rectangle represents the class frequency, relative frequency or percentage.

A grouped frequency, relative frequency or percentage distribution can be graphically represented by a histogram. The horizontal axis is divided into intervals corresponding to the classes. Rectangles are constructed above these intervals, the heights of which measure the frequency, relative frequency or percentage of data values in the class. Figure 2.6 displays an Excel frequency histogram for the price of main meals at city restaurants. The histogram indicates that the price of main meals at city restaurants is concentrated between approximately $30 and $55. Very few meals cost less than $25 or more than $55. Instead of using class boundaries you can label and identify classes by their mid-point.

Figure 2.6  Excel histogram of the price of main meals at city restaurants

Histogram price of main meals at city restaurants

16 14

Frequency

12 10 8 6 4 2

2. 50

$6

7. 50

$5

2. 50

$5

7. 50

$4

2. 50

$4

7. 50

$3

2. 50

$3

7. 50

$2

2. 50

$2

7. 50

$1

0

$1

.5 $7

2. 50

0

Price – city

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



2.3 SUMMARISING AND VISUALISING Numerical Data

H ISTO G R A M FO R FE ST IVA L E X P E N D IT U RE – I N TE RSTATE V I S I TORS Kai is interested in the amount spent during the festival by interstate visitors. Construct and interpret a histogram for the data.

EXAMPLE 2.8

SOLUTION Figure 2.7  Histogram of festival expenditure – interstate visitors

Festival expenditure – interstate visitors 20

Frequency

15 10 5 0 0

200

400

600

800

1,000

1,200

1,400

Festival expenditure, $

From Figure 2.7 Kai can conclude that festival expenditure for interstate visitors is: • between $200 and $1,200 • concentrated between $400 and $800 • rarely more than $800.

Polygons When comparing two or more sets of data we can construct polygons on the same set of axes, allowing for easy interpretation. PE RC E N TAG E P OLYGON A percentage polygon is constructed by plotting the percentage for each class above the respective class mid-point and then joining the mid-points by straight lines. The graph is extended at each end to classes with a frequency of zero so that the polygon starts and finishes on the horizontal axis.

percentage polygon Graphical representation of a percentage distribution.

Figure 2.8 displays percentage polygons for the price of main meals in city and suburban restaurants. The polygon for suburban restaurants is concentrated to the left (corresponding to lower price) of the polygon for city restaurants. The highest percentages of price for suburban restaurants are for class mid-points of $27.50 and $32.50, while the highest percentages of price for city restaurants are for a class mid-point of $37.50. The polygons in Figure 2.8 have plotted points whose values on the horizontal axis represent the class mid-points. For example, for class mid-point $22.50, the plotted point for suburban restaurants (the higher one) represents the fact that 8% of these restaurants have main meal prices between $20 and $25, while the plotted point for city restaurants (the lower one) indicates that only 4% of these restaurants have main meal prices between $20 and $25. When constructing polygons or histograms, the vertical axis should show the true zero or ‘origin’ so as not to distort the character of the data. The horizontal axis does not need to specify the zero point for the variable of interest, although the range of the variable should constitute the major portion of the axis. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

51

52

CHAPTER 2 ORGANISING AND VISUALISING DATA

Figure 2.8  Percentage polygons for the price of main meals in city and suburban restaurants

Percentage polygon

30

25

City Suburban

20

%

15

10

5

0 7.5

EXAMPLE 2.9

12.5 17.5 22.5 27.5 32.5 37.5 42.5 47.5 52.5 57.5 62.5 Price of main meal ($)

P E RC E N TA G E P O LYG O N S F OR F E STI VAL E XP E N D I TU RE Kai is interested in the amount spent during the festival by attendees; in particular if there is a difference between interstate and intrastate visitors. Construct and interpret percentage polygons to compare the festival expenditure of interstate and intrastate visitors. SOLUTION

Figure 2.9  Percentage polygons – festival expenditure

Festival expenditure

% 50

Interstate visitors Intrastate visitors

40 30 20 10 0 100

300

500

700 $

900

1,100

1,300

From Figure 2.9 Kai can conclude that intrastate visitors generally spend less during the ­festival than interstate visitors.

Cumulative Percentage Polygons (Ogives) cumulative percentage polygon (ogive) Graphical representation of a cumulative frequency distribution.

A cumulative percentage polygon, or ogive, displays the variable of interest along the horizontal axis and the cumulative percentages (percentiles) on the vertical axis. A percentile is defined as ‘the value below which a given percentage of observations in a data set fall’. Figure 2.10 shows the cumulative percentage polygons of the price of main meals at city and suburban restaurants. Most of the curve for city restaurants is located to the right of the

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



2.3 SUMMARISING AND VISUALISING Numerical Data

53

curve for suburban restaurants. This indicates that city restaurants have fewer main meals that cost below a particular value. For example, 12% of city restaurant main meals cost less than $30 compared with 34% of suburban restaurant main meals. Figure 2.10  Cumulative percentage polygons of the cost of main meals at city and suburban restaurants

Cumulative percentage polygon 100 90 80 70 60 %

50 City Suburban

40 30 20 10 0

10

15

20

25

30

35

40

45

50

55

60

65

Price of main ($)

CUMULATIVE P E RC E NTA G E P O LYG O NS F OR F E STI VAL E XP E N D I TU RE Kai is interested in the amount spent during the festival by attendees; in particular if there is a difference in expenditure between interstate and intrastate visitors. Construct and interpret cumulative percentage polygons to compare festival expenditure of interstate and intrastate visitors.

EXAMPLE 2.10

SOLUTION Figure 2.11  Cumulative percentage polygons for festival expenditure

Festival expenditure

% 100

Interstate visitors Intrastate visitors

80 60 40 20 0 0

200

400

600 $

800

1,000

1,200

In Figure 2.11, we see that the curve for expenditure by intrastate visitors is to the left of that by interstate visitors. This indicates that generally intrastate visitors spend less during the festival than interstate visitors. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

54

CHAPTER 2 ORGANISING AND VISUALISING DATA

Problems for Section 2.3 LEARNING THE BASICS 2.15 The values for a data set vary from 11.6 to 97.8. a. If these values are grouped into nine classes, indicate appropriate class boundaries. b. What class width did you choose? c. What are the corresponding class mid-points? 2.16 The cumulative percentage polygon below shows the amount spent (in dollars) by 200 customers at a local supermarket. Ogive – amount spent at local supermarket 100 80

%

60 40 20 0 0

20

40

60

80

100

120

140

160

180

200

Amount spent ($)

a. Approximately what percentage of customers spent less than $100? b. Approximately how many customers spent at least $60? c. Approximately how much did the top 10% of customers spend? d. Approximately how much did the bottom 10% of customers spend?

APPLYING THE CONCEPTS You can solve problems 2.17 to 2.19 manually or by using Microsoft Excel.

147 172 123 130 114

102 111 128 143 135

153 148 144 187 191

197 213 168 166 137

127 130 109 139 129

5,544 6,832 7,497 8,091

6,701 7,607 8,298 9,036

178 116 175 154 151

Manufacturer A 5,814 6,868 7,645 8,119

6,190 6,879 7,654 8,392

6,307 6,930 7,773 8,416

6,342 6,941 7,816 8,416

6,423 7,007 7,838 8,514

6,429 7,037 7,924 8,532

6,485 7,043 7,999 8,542

6,612 7,059 8,038 8,544

6,667 7,136 8,067 8,731

7,118 7,721 8,666 9,385

7,133 7,754 8,792 9,460

7,142 7,767 8,800 9,471

7,156 7,806 8,856 9,521

7,344 7,839 8,861 9,540

7,493 7,888 8,993 9,693

7,569 7,983 9,001 9,744

Manufacturer B

2.17 The following data represent the electricity cost (in dollars) during the month of July for a random sample of 50 two-bedroom apartments in a New Zealand city. Electricity charge ($) 96 171 202 157 185 90 141 149 206 95 163 150 108 119 183

c. Construct the corresponding cumulative percentage distribution and plot the corresponding ogive (cumulative percentage polygon). d. Around what amount does the monthly electricity cost seem to be concentrated? 2.18 To investigate the variation in fuel prices in New South Wales on a day in March 2017, a random sample of 45 petrol stations, each in a different location, was selected. The price per litre of both unleaded petrol and diesel is recorded in . Using the New South Wales data: a. Construct frequency, percentage and cumulative distributions for the price of petrol and diesel. b. As separate graphs, plot frequency histograms for the price of petrol and diesel. c. On the same set of axes plot percentage polygons for the price of petrol and diesel. d. On the same set of axes plot cumulative percentage polygons for the price of petrol and diesel. e. What can you conclude about the variation in the fuel prices in New South Wales at the time the data were collected? 2.19 The ordered arrays in the table below give the life (in hours of usage) of samples of forty 15-watt CFL (compact fluorescent lamp) energy-saving light bulbs produced by two manufacturers, A and B.

82 165 167 149 158

a. Construct a frequency distribution and a percentage distribution with upper class boundaries of 4 1.0 5.0 6.0

Total % 34.0 66.0 100.0

Table 2.12  Percentage contingency table for number of bedrooms and location based on overall total

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

56

CHAPTER 2 ORGANISING AND VISUALISING DATA

Table 2.13  Contingency table for number of bedrooms and location based on row total reported as a percentage

Table 2.14  Contingency table for number of bedrooms and location based on column total reported as a percentage

Location Rural Town Total

1 5.9 6.1 6.0

2 14.7 21.2 19.0

Bedrooms % 3 47.1 43.9 45.0

Location Rural Town Total

1 33.3 66.7 100.0

2 26.3 73.7 100.0

Bedrooms % 3 35.6 64.4 100.0

4 29.4 21.2 24.0

>4 2.9 7.6 6.0

Total % 100.0 100.0 100.0

4 41.7 58.3 100.0

>4 16.7 83.3 100.0

Total % 34.0 66.0 100.0

Table 2.12 shows that 45% of the properties have three bedrooms and that 29% of the properties are located in town and have three bedrooms. Table 2.13 shows that 47.1% of rural properties have three bedrooms while only 43.9% of properties located in town have three bedrooms. Table 2.14 shows that 64.4% of three-bedroom properties are located in town while 35.6% are rural.

Visualising Two Categorical Variables: Side-by-Side Bar Charts side-by-side bar chart Graphical representation of a crossclassification table.

A useful way to display the results of contingency table data is by constructing a side-by-side bar chart. Figure 2.12, using the data from Table 2.11, is a Microsoft Excel side-by-side bar chart that compares the number of bedrooms based on the location of the property.

Figure 2.12  Microsoft Excel side-byside bar chart for number of bedrooms and location

Side-by-side chart for number of bedrooms and location

Number of bedrooms

>4

Town Rural

4 3 2 1 0

5

10

15

20

25

30

Number of properties

EXAMPLE 2.11

S IDE - BY- S IDE C H A RT S F OR P R I CE O F R UR AL A N D U RB AN P RO P E RTI E S For the 100 recent property sales, construct and interpret side-by-side charts to investigate if there is a difference between rural and urban property prices. SOLUTION

First, construct a column percentage contingency table for price and location.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



2.4 ORGANISING AND VISUALISING TWO CATEGORICAL VARIABLES

Frequency Asking price ($) 300,000 to < 400,000 400,000 to < 500,000 500,000 to < 600,000 600,000 to < 700,000 700,000 to < 800,000 800,000 to < 900,000 Total

Rural  8  9 12  4  0  1 34

Table 2.15  Contingency table for price and location based on percentage of column total

Column percentage Rural Town 23.5 25.8 26.5 48.5 35.3 15.1 11.8 9.1 0.0 0.0   2.9   1.5 100.0 100.0

Town 17 32 10  6  0  1 66

57

From Table 2.15 we can construct a side-by-side bar chart for location and price. Figure 2.13  Side-by-side bar chart for location and price

Side-by-side chart for location and asking price $800,000 to < $900,000

Town Rural

$700,000 to < $800,000 $600,000 to < $700,000 $500,000 to < $600,000 $400,000 to < $500,000 $300,000 to < $400,000 0

10

20

%

30

40

50

Figure 2.13 shows that a higher proportion of rural properties have prices above $500,000, and that approximately 50% of the urban properties have prices between $400,000 and $500,000.

Problems for Section 2.4 LEARNING THE BASICS 2.20 The following data represent the responses to two questions asked in a survey of 40 undergraduate students majoring in business: What is your gender? (M = Male; F = Female; O = Other) What is your major? (A = Accounting; I = Information Systems; M = Marketing) Gender M M M F M F F M F M F M M M M F F M F F Major A I I M A I A A I I A A A M I M A A A I Gender M M M M F M F F M M F M M M M F M F M M Major I I A A M M I A A A I I A A A A I I A I

a. Represent the data in a contingency table where the rows represent the gender categories and the columns the academic-major categories. b. Construct cross-classification tables based on percentages of all 40 student responses, on row percentages and on column percentages.

c. Using the results from (a), construct a side-by-side bar chart of gender based on student major. 2.21 Given the following cross-classification table, construct a sideby-side bar chart comparing A and B for each of the threecolumn categories on the vertical axis.

A B

1 20 80

2 40 80

3 40 40

Total 100 200

APPLYING THE CONCEPTS 2.22 The Living in Australia Study gives information on the study mode (full or part time) of students studying for a post-school qualification, as well as their employment status.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

58

CHAPTER 2 ORGANISING AND VISUALISING DATA

Percentage of students enrolled in post-school education Studying Studying Employment status full-time part-time All students Employed full-time  6.4 37.7  44.1 Employed part-time 18.1 12.2  30.3 Not employed 17.4  8.2  25.6 All students 41.9 58.1 100.0 Data obtained from the Household, Income and Labour Dynamics in Australia (HILDA) Survey, 2001–2005 (also known as the Living in Australia Study), The University of Melbourne 1994–2011

a. Construct cross-classification tables based on row percentages and column percentages. b. Construct a side-by-side bar chart for employment status and study mode. c. What conclusions do you draw from these analyses? 2.23 The following table classifies road fatalities in Australia from 2012 to 2016 (inclusive) by age and gender.

Age < 10 10 to < 20 20 to < 30 30 to < 40 40 to < 50 50 to < 60 60 to < 70 70 to < 80 80 to < 90 90 or more Unknown Total

Male 89 402 990 686 693 534 412 319 243 56     3 4,427

Gender Female 74 182 310 185 182 179 212 162 178 47     1 1,712

Unknown 3 0 0 0 0 0 0 0 0 0 0 3

Total 166 584 1,300 871 875 713 624 481 421 103     4 6,142

Data obtained from the Australian Road Deaths Database accessed 18 March 2017



Ignore the unknown categories. a. Investigate the relationship between age and gender by constructing a side-by-side bar chart to highlight the pattern of male and female road fatalities. b. Discuss the pattern of male and female road fatalities for 2012 to 2016. 2.24 The following data for people aged 15 years and older, classified by highest level of educational attainment and gender, were obtained for a certain Australian state: Highest level of educational attainment Below Year 10 Year 10 or equivalent Year 11 or equivalent Year 12 or equivalent Post-secondary below bachelor degree Bachelor degree or higher Total

Males (‘000) 238.1 326.7 102.0 492.2 840.8   749.8 2,749.6

Females (‘000) 253.9 394.4 89.4 506.8 687.6   856.5 2,788.6

Total (‘000) 492.0 721.1 191.4 999.0 1,528.4 1,606.3 5,538.2

Data obtained from Australian Bureau of Statistics, Education and Work, Australia, May 2016, 62270DO001_201605 accessed March 2017. © Commonwealth of Australia

a. Construct a cross-classification table based on column percentages. b. Construct a side-by-side bar chart to highlight the information in (a). c. Discuss any apparent pattern in male and female education levels in this Australian state. 2.25 The table below contains the sales of new passenger cars in New Zealand for February 2016 and 2017.

Make Audi BMW Citroen Dodge Ford Holden Honda Hyundai Jaguar Jeep Kia Land Rover Lexus Maserati Mazda Mercedes Benz Mini Mitsubishi Nissan Peugeot Porsche Renault Skoda Ssanyong Subaru Suzuki Tesla Toyota Volkswagen Volvo Other Total

Sales of new cars February 2017 February 2016 176 137 160 193 15 10 23 44 611 604 654 645 373 292 606 470 26 37 56 100 513 407 93 64 62 59 29 6 755 719 245 164 45 44 547 413 346 484 48 55 22 25 30 11 104 102 93 95 305 208 624 362 21 1 990 915 355 309 48 53    75   163 8,050 7,191

Data obtained from Motor Industry Association of New Zealand accessed March 2017, reproduced with permission. © Motor Industry Association of New Zealand

a. Construct a side-by-side bar chart for the makes of cars. b. Discuss the changes in the sale of new cars in February 2017 compared with February 2016.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



2.5 VISUALISING TWO NUMERICAL VARIABLES

2.5  VISUALISING TWO NUMERICAL VARIABLES

LEARNING OBJECTIVE

Scatter Diagrams When analysing a single numerical variable (univariate data), such as the price of a restaurant meal or festival expenditure, you can use a histogram, polygon or cumulative percentage polygon, as introduced in Section 2.3. When examining the relationship between two numerical variables (bivariate data) we can use a scatter diagram or plot to obtain a picture of a possible relationship. Plot one variable, the independent variable, on the horizontal (or x) axis and the other variable, the dependent variable, on the vertical (or y) axis. For example, a marketing analyst could study the effectiveness of advertising by comparing weekly sales volumes and weekly advertising expenditures. Or, a human resources director interested in the salary structure of the company could compare the employees’ years of experience with their current salaries. For the data from 100 recent property sales in the council area introduced in Example 2.1, and stored in , a scatter plot can be used to explore the relationship between number of bedrooms (independent variable) and asking price (dependent variable). For each property, plot the number of bedrooms on the horizontal axis and the corresponding asking price on the vertical axis. Figure 2.14 gives an Excel scatter diagram for this data.

4

Describe the relationship between two numerical variables using scatter diagrams and time-series plots

scatter diagram Graphical representation of the relationship between two numerical variables; plotted points represent the given values of the independent variable and corresponding dependent variable.

Figure 2.14  Microsoft Excel scatter diagram for number of bedrooms and asking price

Scatter diagram – 100 recent property sales

$900,000

59

$800,000 $700,000 Asking price

$600,000 $500,000 $400,000 $300,000 $200,000 $100,000 $0 0

1

2

3

4

5

6

7

8

Number of bedrooms

As expected, there is a weak increasing (positive) linear relationship with more bedrooms associated with higher asking prices. Other pairs of variables may have an decreasing (negative) relationship in which one variable increases as the other decreases; for example, the age of a second-hand car and its value. Scatter diagrams are revisited in Chapter 3 when the coefficient of correlation and the covariance are studied, and in Chapter 12 when regression analysis is introduced.

Time-series Plots A time-series plot is used to study patterns in the value of a variable over time. A time-series plot displays the time period on the horizontal axis and the variable of interest on the vertical axis. Figure 2.15 is a time-series plot of the monthly exchange rate of the Australian dollar against the United States dollar from January 2010 to February 2017.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

60

CHAPTER 2 ORGANISING AND VISUALISING DATA

Figure 2.15  Microsoft Excel time-series plot of exchange rates: Australian dollar against US dollar 2010 to 2017

1.0 0.8 AUS$:US$

Source: Data based on Reserve Bank of Australia, Statistics, Exchange Rates accessed March 2017.

Exchange rate US$ per AUS$ 1.2

0.6 0.4 0.2 0.0 Jan 10 Oct 10 Jun 11 Mar 12 Nov 12 Aug 13 Apr 14 Jan 15 Sep 15 Jun 16 Feb 17 End of month

During 2010 and the first six months of 2011, rates rose steadily from US$0.90 to US$1.10. They remained between US$1.00 and US$1.10 until 2013, steadily decreased to US$0.80 in September 2015, and then remained between US$0.80 and US$0.90 until February 2017.

Rare events think about this

When rare events happen, we often react to them more strongly than to common events with similar outcomes. Charts and graphs can give us a picture of the situation, helping to put the risk of these rare events in perspective. For example, in Australia when there is a shark attack, even if not fatal, there are often calls to protect beach users from attack, including controlling shark numbers by culling. However, shark attacks are rare: there are usually between 10 and 15 attacks annually in Australia, of which one or two are fatal, as shown in the table below.

Year 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016

Total attacks  9  9  8 13 10  7 12 10 22 14 13 14 10 11 18 15

Shark attacks, Australia Fatal attacks 0 2 1 2 2 1 0 1 0 1 4 2 2 2 1 2

Non-fatal  9  7  7 11  8  6 12  9 22 13  9 12  8  9 17 13

Data obtained from the International Shark Attack File accessed May 2014 and March 2017, © Florida Museum of Natural History, University of Florida

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



2.5 VISUALISING TWO NUMERICAL VARIABLES

If we compare these mainly non-fatal shark attacks with the number of people drowning annually at Australian beaches in the same period (see the bar chart below), it is clear that the risk of drowning while at the beach is far higher than that of being attacked by a shark. Australia – shark attacks and beach drownings 70 60

Shark attacks Beach drownings

50 40 30 20 10 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 Drowning data obtained from Royal Life Saving, National Drowning Reports 2001 to 2016 accessed March 2017; shark attack data obtained from International Shark Attack File, Florida Museum of Natural History, University of Florida

A time-series plot of the same data, shown below, indicates that there is no apparent increase in either the number of shark attacks or the number of drownings at Australian beaches. Australia – shark attacks and beach drownings 70 60 50 40 30 20 10 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 Beach drownings

Shark attacks

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

61

62

CHAPTER 2 ORGANISING AND VISUALISING DATA

Problems for Section 2.5 LEARNING THE BASICS

2.26 Below is a set of data from a sample of n = 11 items: X (horizontal axis) Y (vertical axis)

7 5 8 21 15 24

3 6 10 12 4 9 15 18 9 18 30 36 12 27 45 54

a. Plot the scatter diagram. b. Is there a relationship between X and Y? Explain. 2.27 Below is a series of real annual sales (in millions of constant 2010 dollars) for a department over an 11-year period (2007 to 2017): Year 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 Sales 13.0 17.0 19.0 20.0 20.5 20.5 20.5 20.0 19.0 17.0 13.0

a. Construct a time-series plot. b. Does there appear to be any change in real annual sales over time? Explain.

selected. The average price per litre of both unleaded petrol and diesel is recorded in . Using the New South Wales data: a. Construct a scatter diagram to investigate the relationship between petrol and diesel prices. b. What conclusions can you reach about the relationship between petrol and diesel prices? 2.31 The data file gives the monthly Australian unemployment rate (seasonally adjusted) from March 2007 to February 2017. a. Construct a time-series plot for the unemployment rate. b. Does there appear to be any pattern? 2.32 A general measure of inflation is the annual increase in the consumer price index (CPI). The table below gives the annual increase in the CPI in Australia and New Zealand.

APPLYING THE CONCEPTS You can solve problems 2.28 to 2.32 manually or by using Microsoft Excel.

2.28 For the city and suburban restaurants introduced in Section 2.2, an independent reviewer rated each restaurant on food quality, décor and service. Each was given a score out of 30 and then the three scores were added to give an overall rating out of 90.

a. Construct a scatter diagram with overall rating on the horizontal axis and price on the vertical axis. b. Does there appear to be a relationship between overall rating and price? If so, is the relationship positive or negative? 2.29 The data in were obtained from several usedcar yards for 4-cylinder, 4-door sedans. a. Construct a scatter diagram, with price on the vertical axis, to investigate the relationship between the age of a car and its price. b. Construct a scatter diagram, with price on the vertical axis, to investigate the relationship between the kilometres travelled by a car and its price. c. What conclusions can you reach about the relationship between the age or the kilometres travelled and the price of a used car? Are these the relationships you expected? 2.30 To investigate the variation in fuel prices in New South Wales on a given day, a random sample of 45 towns and suburbs was

LEARNING OBJECTIVE

5

Develop dashboard elements such as sparklines, gauges, bullet graphs and treemaps for descriptive analytics

Year to Mar 11 Jun 11 Sep 11 Dec 11 Mar 12 Jun 12 Sep 12 Dec 12 Mar 13 Jun 13 Sep 13 Dec 13

Australia rate % 3.3 3.5 3.4 3.0 1.6 1.2 2.0 2.2 2.5 2.4 2.2 2.7

NZ rate % 4.5 5.3 4.6 1.8 1.6 1.0 0.8 0.9 0.9 0.7 1.4 1.6

Year to Mar 14 Jun 14 Sep 14 Dec 14 Mar 15 Jun 15 Sep 15 Dec 15 Mar 16 Jun 16 Sep 16 Dec 16

Australia rate % 2.9 3.0 2.3 1.7 1.3 1.5 1.5 1.7 1.3 1.0 1.3 1.5

NZ rate % 1.5 1.6 1.0 0.8 0.3 0.4 0.4 0.1 0.4 0.4 0.4 1.3

Data obtained from Reserve Bank of Australia and Reserve Bank of New Zealand accessed March 2017

a. Investigate the relationship between the inflation rates for the two countries by constructing time-series plots on the same set of axes. b. What conclusions can you make about the inflation rates of the two countries?

2.6  BUSINESS ANALYTICS APPLICATIONS – DESCRIPTIVE ANALYTICS As business people gain the ability to retrieve and process larger amounts of data in smaller amounts of time, sometimes approaching near real time, some have asked: At what point does the need for using samples to expedite analysis disappear? Might there not be a day when business decision makers could just analyse all the data continuously as it flows into the business in near real time? While, in most cases, continuous data analysis is not yet a reality, these questions taken together have created the demand for methods known collectively as

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



2.6 BUSINESS ANALYTICS APPLICATIONS – DESCRIPTIVE ANALYTICS

business analytics. Analytics represents an evolution of pre-existing statistical methods

combined with advances in information systems and techniques from management science. Analytics is naturally interdisciplinary, and this nature underscores how important statistics is as part of your business education. Descriptive analytics, predictive analytics and prescriptive analytics form the three broad categories of analytic methods. Descriptive analytics explores business activities that have occurred or are occurring now. Predictive analytics identifies what is likely to occur in the (near) future and finds relationships in data that may not be readily apparent using descriptive analytics. Prescriptive analytics investigates what should occur and prescribes the best course of action for the future. We may use a number of organising and visualisation tools to aid our descriptive analytics. Giving decision makers the ability to combine, collect, organise and visualise data that could be used for day-to-day, if not minute-by-minute, business monitoring in the present, rather than business activity in the past, is one of the main goals of descriptive analytics. Being able to do real-time monitoring can be useful for a business that handles a perishable inventory. Perishable inventory is inventory that will disappear after a particular event takes place, such as an airplane taking off for its destination or the end of a concert. Empty seats on the airplane or at the concert cannot be sold later. Perishable inventory also occurs with less tangible inventory, such as spaces reserved for advertisements on a commercial web page— such spaces cannot be sold after the page has been viewed. In the past, the problem of perishable inventory was handled by models that predicted consumer behaviour based on historical patterns. A concert promoter sets prices based on the best estimation of ticket-buying behaviour. Today, by constantly monitoring sales, the promoter can use a dynamic pricing model in which the price of tickets could fluctuate in near real time based on whether sales are exceeding or failing to meet predicted demand. Real-time monitoring can also be useful for a business that manages flows of people or objects that can be adjusted in near real time, especially when there is more than one flow and the flows are interrelated. For example, overseers of a large sports stadium could benefit from monitoring the flows of cars in parking facilities, as well as the flow of fans into the stadium, and redirect stadium personnel to assist at points of congestion. The managers of WaldoLands, the theme park that licenses the characters from the Waldo­ wood stories, seek to stabilise and grow their business. During the most recent tourist season, their park was plagued by a number of major ride breakdowns, long lines at popular attractions and key food service areas, and a general inability to respond to the park’s day-to-day operating status. Last year’s problems led to numerous unfavourable reviews in key social media travel websites, and the managers are concerned that possible patrons may decide to visit competing parks run by Universal Parks & Resorts and Six Flags Entertainment. For this year, the managers have added the LineJumper service that allows patrons to ‘jump’ to the head of a line, and are offering the premium-priced No-Stress-Express experience that offers special guided tours and behind the scenes access. The managers also hope the new multimillion-dollar Rabbit Creek Racers and a greatly expanded MirrorGate Experience, based on a popular sci-fi franchise, will boost attendance, even though they fret about the technical complexity of these rides. In the WaldoLands scenario, managers could monitor flows of patrons through the ticket booths and into the theme park while also keeping an eye on the length of waiting lines and the use of the LineJumper service. This would allow the managers to adjust ride lengths or dispatch live performers to entertain patrons in line, and to try to redirect patrons to areas of the park that are currently under capacity.

63

business analytics Skills, technologies and practices for continuous iterative exploration and investigation of past business performance to gain insight and drive business planning. descriptive analytics A form of business analytics that explores business activities that have occurred or are occurring in the present moment. predictive analytics A form of business analytics that identifies what is likely to occur in the (near) future and finds relationships in data that may not be readily apparent using descriptive analytics. prescriptive analytics A form of business analytics that investigates what should occur and prescribes the best course of action for the future.

Dashboards Over several decades, people talked about developing executive information systems that would put information at the ‘fingertips’ of decision makers. Many of these efforts have spurred the development of dashboards that use descriptive analytics methods to present up-to-the-minute operational status about a business.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

64

CHAPTER 2 ORGANISING AND VISUALISING DATA

dashboard Descriptive analytics methods to present up-to-the-minute operational status about a business.

An analytics dashboard provides this information in a visual form that is intended to be easy to comprehend and review. Dashboards can contain the summary tables and charts discussed earlier in this chapter, as well as newer or more novel forms of information presentation that can summarise big data as well as smaller sets of data. The dashboard in Figure 2.16 displays key WaldoLands operational statistics that are updated on a near-real-time basis. Clicking one of the categories would lead to other displays that contain additional information about theme park operations.

Figure 2.16 A WaldoLands dashboard Source: The contents, descriptions, and characters of WaldoLands and Waldowood are copyright © 2018, 2014, 2011 Waldowood Productions, and used with permission.

Sparklines are one of the descriptive analytic methods that dashboards can contain. sparklines A descriptive analytics method that summarises time-series data as small, compact graphs designed to appear as part of a table.

­ parklines summarise time-series data as small, compact graphs designed to appear as part of a S table (or a written passage). In Figure 2.17, sparklines display the wait times for WaldoLands attractions at half-hour intervals for the current day, helping to provide context for the current wait times that are indicated by the dot markers. For example, the sparkline for the Rabbit Springs Racers ride shows that the current wait time is one of the longest wait times for the day.

Figure 2.17 WaldoLands wait times table with sparklines Source: The contents, descriptions, and characters of WaldoLands and Waldowood are copyright © 2018, 2014, 2011 Waldowood Productions, and used with permission.

gauges A visual display of data inspired by the speedometer in a car. bullet graph A horizontal bar chart inspired by a thermometer.

Analogous to automotive dashboards, analytic dashboards can provide warnings when predefined conditions are met or exceeded. Figure 2.18 contains a set of gauges and a bullet graph that both display the wait-line status for WaldoLands attractions. These displays combine a single numerical measure (wait time) with one of five categorical values that rates the wait time subjectively, from excellent (less than 25 minutes) to poor (more than 85 minutes). While gauges have

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



2.6 BUSINESS ANALYTICS APPLICATIONS – DESCRIPTIVE ANALYTICS

been a popular choice in business, most information design specialists prefer bullet graphs because those graphs foster the direct comparison of each measurement (wait time in Figure 2.18). Gauges can also consume a lot of visual space in a dashboard. For ­example, in Figure 2.18, note the amount of the space the gauges consume to show the status of the six most popular rides. The corresponding bullet graph can display the status of 14 rides and present the wait times in a way that facilitates comparisons. For these reasons, some consider gauges little more than examples of chartjunk (see reference 1), even as many decision makers request them due to their visual appeal.1

65

chartjunk Unnecessary information and detail that reduces the clarity of a graph.

Figure 2.18  Gauges and bullet graph of wait times for WaldoLands attractions Source: The contents, descriptions, and characters of WaldoLands and Waldowood are copyright © 2018, 2014, 2011 Waldowood Productions, and used with permission.

Dashboards may also contain treemaps that help users to visualise two variables, one of which must be categorical. Treemaps are especially useful when categories can be grouped to form a multilevel hierarchy or tree. Figure 2.19 displays a pair of treemaps that visualise the number of social media comments made today about WaldoLands attractions (the size of each rectangle). The left treemap shows each ride grouped by the ‘land’ of WaldoLands (StrausLand, the BWLand or FamilyLand) where the attraction is found. The right treemap shows the data for the six most popular WaldoLands attractions, ­illustrating that treemaps can be used with non-hierarchical information as well. StrausLand StrausLand

The BWLand

FamilyLand

The BWLand

StrausLand FamilyLand

StrausLand

treemaps A descriptive analytics method that helps visualise two variables, one of which must be categorical.

The BWLand The BWLand

Kirby’s SplashDown

Soarin’ Stegosaurs

Stressed Out Wild Mouse

Rabbit Springs Racers Mt Waldo Alpine Sleds

Rabbit Springs Racers

A.B.ʹs Hall of Mirrors Ms Cy... WaldoLand Un... OFFRO... Mini RR Ride Truck...

MirrorGate Experience

Lande’s Musical Chairs

Circle o... Taylorʹs...

1...



Soarinʹ Stegosaurs

Stressed Out Wild Mouse

Mt Waldo Alpine Sleds

MirrorGate Experience Landeʹs Musical Chairs

Figure 2.19  Treemaps of number and favourability of social media comments about WaldoLands attractions Source: The contents, descriptions, and characters of WaldoLands and Waldowood are copyright © 2018, 2014, 2011 Waldowood Productions, and used with permission. 1  This tension between what decision makers might find visually appealing and what statisticians and information specialists have found most useful reflects the relative newness of these descriptive methods. Over time, this tension may ease and an acceptable standard for representing such information may emerge.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

66

CHAPTER 2 ORGANISING AND VISUALISING DATA

When combined with the Figure 2.18 gauges or bullet graph, the treemap on the right in ­ igure 2.19 would allow managers to preliminarily conclude that the negativity of comments F seems to be tied to current wait lines and that rides with the shortest wait lines may generate the fewest social media comments. These relationships could then be further investigated and, if the former one was confirmed, managers could, in the future, respond to excessive wait lines by shortening the ride length to handle more customers, sending live performers to entertain those waiting in line or instructing park staff to divert incoming park patrons to other rides. Note that gauges, bullet graphs and treemaps use colour to represent the value of a second variable, thereby increasing the data density of the displays – one of the principles of good information design (see reference 2). However, when using these displays, particularly bullet graphs and treemaps, avoid using colour spectrums that run from red to green, the two colours most subject to confusion due to colour vision deficiencies. (This is less of a problem with gauges, as colours subject to confusion will have unique positions on the gauge dial.)

Data Discovery data discovery Methods used to take a closer look at historical or status data, to quickly review data for unusual values or outliers, or to construct visualisations for management presentations. drill-down The revealing of the data that underlie a higher-level summary.

Data discovery methods allow decision makers to interactively organise or visualise data and

perform preliminary analyses. These methods can be used to take a closer look at historical or status data, to quickly review data for unusual values or outliers, or to construct visualisations for management presentations. In these ways, data discovery realises the earlier promise of executive information systems to give decision makers the tools of data exploration and presentation. In its simplest version, data discovery involves drill-down, the revealing of the data that underlies a higher level summary. For example, clicking the merchandise entry in the ­Figure 2.16 WaldoLands dashboard would reveal more detailed information such as the table of sales by ‘lands’ shown in the left table in Figure 2.20. In turn, this summary can be drilled down to reveal sales by each store in the theme park (see table on the right in Figure 2.20). At this level of detail, sales at Peri’s Playtime are significantly lower than the other stores, perhaps suggesting that this store be closed, relocated, or have its merchandise mix reconsidered.

Figure 2.20 WaldoLands merchandise sales summarised on two different levels Source: The contents, descriptions, and characters of WaldoLands and Waldowood are copyright © 2018, 2014, 2011 Waldowood Productions, and used with permission.

Another level of drill-down (not shown) would reveal the sales of each item or SKU (stockkeeping unit) sold in each store. By reorganising that list by item, WaldoLands managers could discover which items are selling the best and may be subject to stockouts.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



2.6 BUSINESS ANALYTICS APPLICATIONS – DESCRIPTIVE ANALYTICS

67

Problems for Section 2.6 2.33 The Edmunds.com NHTSA Complaints Activity Report is the result of the examination of the frequency, trends and composition of consumer vehicle complaint submissions at the car manufacturer, brand and category levels (data obtained from ). The table below stored in < AUTOMAKER1 >, contains complaints received by six car manufacturers for January 2013. When the number of complaints is less than 300, the complaint rating is considered to be low; when the number of complaints is between 300 and 500, the complaint rating is considered to be medium; and when the number of complaints is more than 500, the complaint rating is considered to be high. Car manufacturer American Honda Chrysler LLC Ford Motor Company General Motors Nissan Motors Corporation Toyota Motor Sales

2.36

Number of complaints 169 439 440 551 467 332

a. Construct a gauge for each car manufacturer. b. Construct a bullet graph for the car manufacturers. c. Which display is more effective at comparing the number of complaints for each car manufacturer? 2.34 There is a very large number of mutual funds from which an investor can choose. Each mutual fund has its own mix of different types of investments. The file < BEST_FUNDS1 > contains the one-year return percentage and the three-year annualised return percentage for the 10 best short-term bond and long-term bond funds according to the U.S. News & World Report score. (data obtained from ). a. Construct bullet graphs of the one-year returns and the threeyear returns. For the purposes of comparison, consider a return below 5% as low-performing, a return between 5 and 10% as medium-performing and a return above 10% as highperforming. b. Why would you not want to construct a gauge for each bond fund? c. What conclusions can you reach about the one-year and three-year return percentages for the short-term bond and long-term bond funds? 2.35 A financial analyst was interested in comparing the price-tobook ratio (P/B) of pharmaceutical companies. The analyst collected P/B ratios for 71 pharmaceutical companies (Industry Group SIC 3 code: 283) and stored them as part of the file < BUSINESS_VALUATION >. a. Visually evaluate the P/B ratios by constructing a bullet graph. For the purposes of comparison, consider a P/B ratio

2.37

2.38

2.39

that is 2 or less as excellent, a P/B ratio that is between 2 and 5 as acceptable, and a P/B ratio that is above 5 as unacceptable. b. Why would using gauges be a poor choice for this analysis? c. Are the three groupings of P/B ratios helpful in analysing the data? What constitutes an acceptable P/B ratio varies by industry and is partially based on subjective analysis. For the purposes of information presentation, would you redefine or subdivide the current acceptable category? The file < BB_COST_2012 > contains the total cost (in $) for four tickets, two beers, four soft drinks, four hot dogs, two game programs, two baseball caps and parking for one vehicle at each of the 30 Major League Baseball (MLB) parks during the 2012 season. (data obtained from ). a. Visually evaluate the total cost at each MLB park by constructing a bullet graph. For the purposes of comparison, consider a total cost (in dollars) less than $180 as inexpensive, between $180 and $240 as typical, and more than $240 as expensive. b. Which display best visualises the distribution of costs - the bullet graph or a stem-and-leaf display? Why? c. Name something that the bullet graph reveals about the data that the stem-and-leaf display does not. How could that be used as the basis for future analysis of total costs at MLB parks? Referring to the movie attendance data between 2002 and 2012 (stored in < MOVIE_ATTENDANCE2 >): a. Construct a sparkline graph for movie attendance between 2002 and 2012. b. What conclusions can you reach about movie attendance between 2002 and 2012? c. When would using a sparkline graph be the better choice to visualise these data? When would using the time-series plot be the better choice? d. Might you ever use both a sparkline graph and a timeseries plot in the same analysis report? Explain your reasoning. The file < STOCK_INDICES > contains the data that represent the total rate of return (as a percentage) for the Dow Jones Industrial Average (DJIA), the Standard & Poor’s 500 (S&P500) and the technology-heavy NASDAQ Composite (NASDAQ) from 2006 through 2012. (data obtained from accessed 29 March 2013). a. Construct sparklines for the annual rate of return for the DJIA, S&P500 and NASDAQ from 2006 to 2012. b. What conclusions can you reach concerning the annual rates of return of the three market indices? From 2006 to 2012, the value of precious metals fluctuated dramatically. The file < METAL_INDICES > contains the total rate of return (as a percentage) for platinum, gold and silver

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

68

2.40

2.41

2.42

2.43

CHAPTER 2 ORGANISING AND VISUALISING DATA

from 2006 through 2012. (data obtained from accessed 29 March 2013). a. Construct sparklines for the annual rate of return for platinum, gold and silver from 2006 to 2012. b. What conclusions can you reach concerning the rates of return of the three precious metals? c. Compare the results of (b) to those of Problem 2.38(b). Drive-through service time is an important quality attribute for fast-food chains. The data in < SERVICE_TIME > are the mean service times for Burger King, Chick-Fil-A, McDonald’s and Wendy’s in 12 recent years. (data obtained from ). a. Construct sparklines of the mean service times for Burger King, Chick-Fil-A, McDonald’s and Wendy’s in 12 recent years. b. What conclusions can you reach concerning the mean service times for Burger King, Chick-Fil-A, McDonald’s and Wendy’s in 12 recent years? Sales of cars in the United States fluctuate from month to month and year to year. The data in the file < AUTO_SALES > represent the sales for various manufacturers in July 2013 and the change from July 2012 sales in percentages. (data obtained from ). a. Construct a treemap of the sales of cars and the change in sales from July 2012. b. What conclusions can you reach concerning the sales of cars and the change in sales from July 2012? The value of a National Basketball Association (NBA) franchise has increased dramatically over the past few years. The value of a franchise varies based on the size of the city in which the team is located, the amount of revenue it receives and the success of the team. The file < NBA_VALUES > contains the value of each team and the change in value in the past year. (data obtained from ). a. Construct a treemap that visualises the values of the NBA teams (size) and the one-year changes in value (colour). b. What conclusions can you reach concerning the value of NBA teams and the one-year change in value? The annual ranking of the FT Global 500 2013 provides a snapshot of the world’s largest companies. The companies are ranked by market capitalisation—the greater the sharemarket value of a company, the higher the ranking. The market capitalisations (in billions of dollars) and the 52-week change in market capitalisations (in percentages) for companies in the Automobile & Parts, Financial Services, Health Care Equipment & Services and ­Software & Computer Services sectors are stored in < FT_GLOBAL500 > (data obtained from ). a. Construct a treemap that presents each company’s market capitalisation (size) and the 52-week change in market capitalisation (colour) grouped by sector and country. b. Which sector seems to have the best gains in the market capitalisations of its companies? Which sectors seem to have the worst gains (or greatest losses)? c. Construct a treemap that presents each company’s market capitalisation (size) and the 52-week change in market capitalisation (colour) grouped by country.

2.44

2.45

2.46

2.47

2.48

2.49

d. What comparison can be more easily made with the treemap constructed in (c) than with the treemap constructed in (a)? Your task as a member of the International Strategic ­Management Team at your company is to investigate the potential for entry into a foreign market. As part of your initial investigation, you must provide an assessment of the economies of countries in the Americas and the Asia and Pacific regions. The file < DOING_BUSINESS > contains the 2012 GDPs per capita for these countries as well as the number of Internet users in 2011 (per 100 people) and the number of mobile phone subscriptions in 2011 (per 100 people). (data obtained from ). a. Construct a treemap of the GDPs per capita (size) and their number of Internet users in 2011 (per 100 people) (colour) for each country grouped by region. b. Construct a treemap of the GDPs per capita (size) and their number of mobile phone subscriptions in 2011 (per 100 people) (colour) for each country grouped by region. c. What patterns to these data do the two treemaps suggest? Are the patterns in the two treemaps similar or different? Explain. Using the sample of retirement funds stored in < RETIREMENT_FUNDS >: a. Construct a table that tallies type, market cap and risk. b. Drill down to examine the large-cap growth funds with high risk. How many funds are there? What conclusions can you reach about these funds? Using the sample of retirement funds stored in < RETIREMENT_FUNDS >: a. Construct a table that tallies type, market cap and rating. b. Drill down to examine the large-cap growth funds with a rating of three. How many funds are there? What conclusions can you reach about these funds? Using the sample of retirement funds stored in < RETIREMENT_FUNDS >: a. Construct a table that tallies market cap, risk and rating. b. Drill down to examine the large-cap funds that are high risk with a rating of three. How many funds are there? What conclusions can you reach about these funds? Using the sample of retirement funds stored in < RETIREMENT_FUNDS >: a. Construct a table that tallies type, risk and rating. b. Drill down to examine the growth funds that are high risk with a rating of three. How many funds are there? What conclusions can you reach about these funds? Using the sample of retirement funds stored in < RETIREMENT_FUNDS >: a. What are the attributes of the fund with the highest five-year return? b. What five-year returns are associated with small market cap funds that have a rating of five stars? c. Which fund(s) in the sample have the lowest five-year return? d. What is the type and market cap of the five-star fund with the highest five-year return?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



2.7 Misusing Graphs and Ethical Issues

2.7  MISUSING GRAPHS AND ETHICAL ISSUES

LEARNING OBJECTIVE

Good graphical displays should present the data in a clear and understandable way. Unfortunately, many graphs in newspapers and magazines, as well as graphs constructed using Microsoft Excel, are incorrect, misleading or unnecessarily complicated. To illustrate the misuse of graphs, Figure 2.21 was constructed using data obtained from Wine Australia contained in data file . In the figure, the contents of the wine bottle representing 606 million litres for 1995/96 appear to be approximately three times the contents of the icon representing 346 million litres for 1990/91. This is because a magnification factor of 1.75 (606/346 ≈ 1.75) has been applied to both height and width, so the volume has increased by 1.752 ≈ 3. One principle of good graphs is that, when using three-dimensional icons, frequency/quantity must be proportional to volume.

1,410 1,118

1,191

606

Correctly present data in graphs

Source: Data obtained from ‘Australian Gross Wine Production – pdf format’, Wine Australia Corporation accessed December 2013.

346

1990/91

1995/96

2000/01

2005/06

6

Figure 2.21  Misleading display of Australian wine production

Australian beverage wine production (million litres)

1,034

69

2010/11

2014/15

Also, the time difference between the wine bottles is not constant. There are five years between the first five icons and four years between the last two. Good graphs should be properly scaled along each axis. Finally, the year labels are ambiguous. It is not clear whether the 346 million litres represent the total production for the two years 1990 and 1991, the average production for those two years, or the wine production for the 1990/91 financial year. Good graphs should be clearly labelled. Although the wine bottle presentation may catch the eye, the data would have been better presented in a summary table or as a time-series plot using all the data available. It is often the improper use of the vertical and horizontal axes that leads to distortions in presenting data. Figure 2.22, representing New Zealand alcohol consumption, was constructed using data from OECD (2011 and 2014), contained in data file . The graph in Figure 2.22 is clearly labelled, the horizontal/time axis is correctly spaced and the height and volume are proportional. However, the cylinder representing 9.1 litres for 2004 is more than twice the height/volume of the cylinder representing 8.9 litres for 2003. This is because there is no zero point on the vertical axis. The vertical axis on a good graph should usually begin at zero. Other eye-catching displays seen in magazines and newspapers often include information that is not necessary, blurring the effect. Some guidelines for presenting good graphs are as follows: • The graph should not distort the data. In particular, frequency/quantity should be proportional to area and/or volume. • The graph should not contain chartjunk. • Any two-dimensional graph should contain a scale for each axis. • The scale on the vertical axis should begin at zero. • Graphs should be properly scaled along each axis. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

70

CHAPTER 2 ORGANISING AND VISUALISING DATA

Figure 2.22  Misleading display of New Zealand alcohol consumption

Alcohol consumption in litres per capita (15+) 9.6 9.5

Source: Data from OECD (2011 and 2014), ‘Alcohol consumption’, Health: Key Tables from OECD, No. 24. doi: 10.1787/alcoholcons-table2014-1-en and 10.1787/ alcoholcons-table-2011-1-en, accessed March 2017.

9.3

9.3

9.5

9.3

9.3 9.2

9.2

9.1 8.9

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

Year

• • •

All axes should be properly labelled. The graph should contain a title. The simplest possible graph should be used for a given set of data.

Often these guidelines are unknowingly violated by individuals unaware of how to construct appropriate graphs. Some applications, including Excel, tempt you to create ‘pretty’ charts that may be fancy in their designs but represent unwise choices. For example, making a simple pie chart fancier by adding exploded 3D slices is unwise as this can complicate a viewer’s interpretation of the data. Uncommon chart choices such as doughnut, radar and surface charts may look visually striking, but in most cases they obscure the data.

Ethical Concerns Inappropriate graphs raise ethical concerns, especially when they, deliberately or not, present a false impression of the data. To illustrate this, take the example of mobile speed cameras that were reintroduced in New South Wales on 19 July 2010. Suppose the following graphs were produced by groups for and against this, using data in the file obtained from the ­Australian Road Deaths Database. Figure 2.23A gives the impression that the number of road fatalities in New South Wales has increased after the reintroduction of mobile speed cameras, while Figure 2.23B gives the Figure 2.23A  NSW road fatalities 2010

NSW number of road fatalities 2010

40 35

Mobile cameras introduced

30 25 20 Jul 10

Aug 10

Sep 10

Oct 10

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



2.7 Misusing Graphs and Ethical Issues

Figure 2.23B  NSW road fatalities 2010

NSW number of road fatalities 2010

45

71

40 Mobile cameras introduced

35 30 25 20 Apr 10

60

May 10

Jun 10

Jul 10

Aug 10

Figure 2.23C  NSW road fatalities 2009 to 2017

NSW number of road fatalities 2009 to 2017 Mobile speed cameras introduced

50

Source: Data in Figures 2.23A–C obtained from Australian Road Deaths Database, , accessed 8 April 2017.

40 30 20 10

Feb 17

Sep 16

Apr 16

Nov 15

Jun 15

Jan 15

Aug 14

Apr 14

Nov 13

Jun 13

Jan 13

Aug 12

Mar 12

Oct 11

Jun 11

Jan 11

Aug 10

Mar 10

Oct 09

Jan 09

May 09

0

opposite ­impression. However, a time-series plot for 2009 to 2017 (Figure 2.23C) shows that there may be a slight decrease in fatalities since the introduction of mobile cameras, although the number of fatalities per month is very variable.

Problems for Section 2.7 APPLYING THE CONCEPTS 2.50 (Student project) Bring to class a chart from a newspaper or magazine that you believe to be a poor representation of a numerical variable. Be prepared to discuss why you think this. Do you believe that the intent of the chart is purposely to mislead the reader? 2.51 (Student project) Bring to class a chart from a newspaper or magazine that you believe to be a poor representation of a categorical variable. Be prepared to discuss why you think this.

Do you believe that the intent of the chart is purposely to mislead the reader? 2.52 (Student project) Bring to class a chart from a newspaper or magazine that you believe contains too many unnecessary adornments (i.e. chartjunk) that may cloud the message given by the data. Be prepared to discuss why you think this. 2.53 The following graph shows a relationship between number of pirates and global average temperature between 1820 and 2000. Comment on the influence of pirates on global warming.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

72

CHAPTER 2 ORGANISING AND VISUALISING DATA

c. Global average temperature vs number of pirates 2000

15.5

1980

1940

15.0

1920

1860 1880

1820

100 50

17

0 40

0 5, 00 0

0

15

,0 0

0

20

,0 0

0

,0 0

45

,0 0

Source: Church of the Flying Spaghetti Monster, accessed 28 December 2014. Used by permission of Bobby Henderson

2.54 Using the data and , redraw Figures 2.21 and 2.22, following the guidelines for good graphs given in Section 2.7. 2.55 The following three time-series plots show Perth’s monthly average petrol prices from January 2006 to February 2017: a. Perth petrol price

140 120 100 80 60 40

Which graph do you think best represents the data and why? 2.56 An article in the New York Times (D. Rosato, ‘Worried about the numbers? How about the charts?’, New York Times, 15 September 2002, Business 7) reported on research done on annual reports of corporations by Professor Deanna Oxender Burgess of Florida Gulf Coast University. Professor Burgess found that even slight distortions in a chart changed readers’ perception of the information. The article displayed sales information from the annual report of Zale Corporation and showed how results were exaggerated. Go online or to the library and study the most recent annual report of a local corporation. Find at least one chart in the report that you think needs improvement and develop an improved chart. Explain why you believe the improved chart is better than the one from the annual report. 2.57 Figures 2.1 and 2.3 show a bar chart and a pie chart, respectively, for the online grocery shopping data. a. Create an exploded pie chart, a doughnut chart, a cone chart or a pyramid chart for the online shopping data. b. Which graphs do you prefer? Explain.

20 Feb 17

Sep 15

May 16

Jan 15

May 14

Aug 13

Dec 12

Jul 11

Mar 12

Nov 10

Jun 09

Mar 10

Oct 08

Jan 08

Sep 06

May 07

Jan 06

0

b. Perth petrol price

155 145 135 125 115 105 Feb 17

May 16

Sep 15

Jan 15

May 14

Aug 13

Dec 12

Mar 12

Jul 11

Nov 10

Mar 10

Jun 09

Oct 08

Jan 08

May 07

Sep 06

95 Jan 06

Average price (cents/litre)

165

Feb 17

May 16

Jan 15

Sep 15

Aug 13

May 14

Dec 12

Jul 11

Mar 12

Nov 10

Jun 09

Mar 10

Oct 08

Jan 08

Data obtained from Australian Automobile Association accessed April 2017

Number of pirates (approximate)

160

May 07

13.5

35

150

0

13.0

Average price (cents/litre)

200

Jan 06

14.0

250

Sep 06

14.5

Perth petrol price

300 Average price (cents/litre)

Global average temperature, °C

16.0

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



References

73

2

Assess your progress Summary Table 2.16 summarises the tables and charts discussed in this chapter. These tables and charts enabled us to draw conclusions about online grocery shopping, the cost of restaurant meals in a city and its suburbs, and festival expenditure in the scenario at the beginning of the chapter.

Table 2.16  Roadmap for selecting tables and charts

Type of analysis Tabulating, organising and graphically presenting the values of a variable

Organising and graphically presenting the relationship between two variables

Now that you have studied tables (which show how data are distributed) and charts (which provide a visual display of how data are distributed), a variety of numerical descriptive measures will be introduced in Chapter 3 for further analysis and interpretation of data.

Type of data Numerical Ordered array, stem-and-leaf display, frequency distribution, relative frequency distribution, percentage distribution, cumulative percentage distribution, histogram, polygon, cumulative percentage polygon (Sections 2.2 and 2.3) Scatter diagram, time-series plot (Section 2.5) Sparklines, gauges, bullet graph, treemap, drill-down (Section 2.6)

Categorical Summary table, bar chart, pie chart (Section 2.1)

Contingency table, side-by-side bar chart (Section 2.4) Treemap, drill-down (Section 2.6)

Key terms bar chart 39 bullet graph 64 business analytics 63 chartjunk 65 class boundaries 47 class mid-point 47 class width 46 contingency (cross-classification) table – descriptive statistics 55 cumulative percentage distribution 49 cumulative percentage polygon (ogive) 52

dashboard 64 data discovery 66 descriptive analytics 63 drill-down 66 frequency distribution 46 gauges 64 histogram 50 ordered array 43 percentage distribution 48 percentage polygon 51 pie chart 40

predictive analytics 63 prescriptive analytics 63 range 46 relative frequency distribution 48 scatter diagram 59 side-by-side bar chart 56 sparklines 64 stem-and-leaf display 43 summary table 38 time-series plot 59 treemaps 65

References 1. Few, S. Information Dashboard Design: Displaying Data for At-a-Glance

2. Tufte, E. Beautiful Evidence (Cheshire, CT: Graphics Press, 2006).

Monitoring, 2nd edn (Burlingame, CA: Analytics Press, 2013).

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

74

CHAPTER 2 ORGANISING AND VISUALISING DATA

Chapter review problems CHECKING YOUR UNDERSTANDING 2.58 How do histograms and polygons differ with respect to their construction and use? 2.59 When or why would you construct a summary table? 2.60 What are the advantages and/or disadvantages of a bar chart or a pie chart? 2.61 Compare and contrast the bar chart for categorical data and the histogram for numerical data. 2.62 What is the difference between a time-series plot and a scatter diagram? 2.63 What are the three percentage breakdowns that can help you interpret the results found in a cross-classification table?

a. Illustrate these data with an appropriate graph or graphs. b. What can you conclude about Internet usage? Are these conclusions different from those in problem 2.64? If so, what could the reasons be? 2.66 The following table classifies road fatalities in Australia for 2012 to 2016 by crash type:

Crash type Multiple vehicle Pedestrian Single vehicle Total

2012 573 171   556 1,300

2013 479 158   550 1,187

Year 2014 503 154   493 1,150

2015 511 162   532 1,205

2016 556 171   573 1,300

APPLYING THE CONCEPTS You can solve problems 2.64 to 2.76 manually or using Microsoft Excel.

2.64 One thousand Australians were asked which websites they had visited in the previous week. The results were:

Type of sites Auction Banking Classifieds Dating Email Gaming News Online music site Search engine Shopping Social network Sport TV User generated or upload site Weather

Number 122 245 213  41 552 132 335 186 743 381 649 236 201 472 398

a. Illustrate these data by constructing appropriate tables and graphs. b. What can you say about the pattern of road fatalities in these five years? 2.67 Residents in the seaside town hosting three-day music festival are concerned that the influx of tourists for this and other events causes an increase in traffic and other offences. As the council area has one of the highest drink driving rates in the state, Kai is investigating whether tourists can be blamed for this high rate. The following table classifies the previous year’s 993 drink-driving offences by the home address of the offender: Number of drink-driving offences Local – in council area Seaside town 151 Not seaside town 462 Not local – not in council area Intrastate (within state) 130 Interstate (another state) 228 International (outside Australia)  22 Home address

a. Illustrate these data with an appropriate graph or graphs. b. What can you conclude about the type of website most visited? 2.65 Another poll asked Australians how they spent their time online, with the following result.

Email and communications Multimedia sites Online shopping Reading content Searches Social networking Total

Data obtained from the Australian Road Deaths Database at accessed 9 April 2017

19.3% 13.1% 5.4% 19.9% 20.7%  21.6% 100.0%

a. Construct bar and pie charts. b. What conclusions can Kai draw about the prevalence of drink driving? c. The headline of an article in the local paper discussing these data was ‘Tourists can’t be blamed for number of drink-drivers’. Do you agree with this? Justify your answer.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter review problems

Australian housing interest rates

8.0 7.5 7.0 % 6.5 6.0 5.5

Mar 17

Jun 16

Nov 16

Oct 15

Feb 16

Jun 15

Oct 14

Feb 15

Jun 14

Oct 13

Feb 14

Jan 13

5.0 May 13

a. Illustrate these data with an appropriate table or graph. b. What can you conclude about the reasons for installing a rainwater tank, and are there differences between Brisbane and non-Brisbane Queensland households? 2.69 The data in contains the fat and sugar content in grams (g) per 250 ml cup of a random sample of brands of fresh cow’s milk for sale in Australia. a. Use the combined data to construct graphs to explore the relationship between the variables. b. What conclusions can you reach about the relationship between the fat, sugar and calorie content of fresh milk? 2.70 On the same day in March 2017, the researcher in problem 2.30 also obtained the prices per litre of unleaded petrol and diesel from a random sample of 45 towns and suburbs in Queensland. This set of data is in the data file with the New South Wales data. a. Using appropriate tables and graphs, investigate the distribution of unleaded petrol and diesel prices in Queensland on this day in March 2017. What can you conclude about the variation in fuel prices in Queensland when the data were collected? b. Using an appropriate graph, investigate the relationship between petrol and diesel prices in Queensland. What conclusions can you draw about this relationship? c. Using appropriate tables and graphs, investigate the distribution of unleaded petrol and diesel prices in New South Wales on this day in March 2017. What can you conclude about the variation in fuel prices in New South Wales when the data were collected? d. Using an appropriate graph, investigate the relationship between petrol prices in New South Wales and Queensland. What conclusions can you draw? e. Using an appropriate graph, investigate the relationship between diesel prices in New South Wales and Queensland. What conclusions can you draw?

Sep 12

Data obtained from Australian Bureau of Statistics, Environmental Issues: Water Use and Conservation, Mar 2013, Cat. No. 4602.0.55.003 accessed 4 November 2013

Jan 12

Rest of Queensland 59.00 41.70 31.90 73.70 28.10 7.90 47.20 206.30

May 12

Reason Brisbane To save water 142.10 To save on water costs 55.60 Water restrictions on mains water 55.20 Not connected to mains water 5.40 Concerns about quality of mains water 5.40 Water tank rebates 43.00 Other 48.50 Total households (thousands) 216.50

f. The data in was obtained in March 2017. Go to Motor Mouth at , NRMA at , RACQ at , or elsewhere, to collect recent price data. Then use appropriate graphs and tables to investigate any changes in petrol and/or diesel prices in New South Wales and/or Queensland. 2.71 Data from 100 recent property sales from a council area are stored in . For the asking price data: a. Construct and interpret a stem-and-leaf display. b. Construct frequency, percentage and cumulative distributions. c. Construct a frequency histogram, a percentage polygon and an ogive. d. What conclusions can you make about the distribution of asking prices? e. Construct and interpret a scatter diagram for asking and selling price. For the type and bedroom data: f. Construct cross-classification tables based on total, row and column percentages. g. Construct side-by-side charts to investigate the relationship between number of bedrooms and type. h. What conclusions can you make about the relationship between type and number of bedrooms? 2.72 The data in data file give the bank interest rate for standard housing loans in New Zealand and Australia from January 2000 to March 2017. Construct and interpret time-series plots, on the same set of axes, for New Zealand and Australian interest rates from January 2000. 2.73 Using the Australian data from problem 2.72, a PR spokesperson for an Australian political party constructed the following graph to illustrate that the party’s influence has lowered interest rates. Do you think this is an ethical graph? Discuss.

Sep 11

2.68 The reasons why Queensland households installed a rainwater tank is given in the table below.

75

2.74 The data in data file contain sample student marks and grades from a population of students enrolled in a statistics unit. a. Construct an appropriate graph to investigate the distribution of grades. What conclusions can you draw?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

76

CHAPTER 2 ORGANISING AND VISUALISING DATA

b. Construct an appropriate graph to investigate the distribution of total marks. What conclusions can you draw? c. Construct an appropriate graph to investigate the relationship between a student’s semester mark and their exam mark. What conclusions can you draw? 2.75 (Class project) Ask each student in the class to respond to the question ‘Which soft drink do you prefer?’ and display the results in a summary table. a. Convert the data to percentages and construct a bar or pie chart. b. Analyse the findings. 2.76 (Class project) Classify each student in the class on the basis of gender (male, female), study mode (full-time or part-time) and current employment status (full-time, part-time). a. Construct contingency tables to explore the data. b. What would you conclude from this study? c. What other variables would you want to know about employment in order to enhance your findings? d. Compare your results with those from the Living in Australia Study in problem 2.22. 2.77 The file < DOMESTIC­_ BEER2 > contains the number of calories per 355 mL and number of carbohydrates (in grams) per 355 mL for a sample of 15 of the best-selling domestic beers in the

United States (data obtained from ). a. Visually evaluate the number of calories per 355 mL for each beer by constructing a bullet graph. For the purposes of comparison, consider calories below 100 as low, between 100 and 160 as medium, and above 160 as high. b. Visually evaluate the number of carbohydrates (in grams) per 355 mL for each beer by constructing a bullet graph. For the purposes of comparison, consider carbohydrates below 10 grams as low, between 10 and 14 grams as medium, and above 14 grams as high. c. What preliminary conclusions can you reach about the number of calories and amount of carbohydrate in the beers? d. Why would constructing sets of gauges for the calories and carbohydrates be a less effective means of visualising these data? 2.78 The file < CURRENCY2 > contains the value of the Canadian dollar, British pound and Euro for one US dollar from 2002 to 2012. a. Construct sparklines for the value of the US dollar in terms of the Canadian dollar, British pound and Euro. b. What conclusions can you reach about the value of the US dollar in terms of the Canadian dollar, British pound and Euro from 2002 to 2012?

Continuing cases Tasman University Tasman University’s Tasman Business School (TBS) regularly surveys business students on a number of issues. In particular, students within the school are asked to complete a student survey when they receive their grades each semester. The results of Bachelor of Business (BBus) and Master of Business Administration (MBA) students who responded to the latest undergraduate (UG) and postgraduate (PG) student surveys are stored in < TASMAN_ UNIVERSITY_BBUS_STUDENT_SURVEY > and < TASMAN_UNIVERSITY_MBA_STUDENT_SURVEY >.

Copies of the survey questions are stored in Tasman University Undergraduate BBus Student Survey and Tasman University Undergraduate MBA Student Survey. a For a selection of questions asked in the BBus student survey, construct appropriate tables and charts. b For a selection of questions asked in the MBA student survey, construct appropriate tables and charts. c Construct appropriate tables and charts to explore the relationship between selected pairs of questions within a survey or between surveys. d Write a report summarising your conclusions.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 2 Excel Guide

77

As Safe as Houses To analyse the real estate market in non-capital cities and towns in states A and B, Safe-As-Houses Real Estate, a large national real estate company, has collected samples of recent residential sales from a sample of non-capital cities and towns in these states. The data are stored in < REAL­_ ESTATE >. a For regional city 1, state A: i For a selection of variables, construct appropriate tables and charts. ii Construct appropriate tables and charts to explore the relationship between pairs of variables. b For coastal city 1, state A: i

For a selection of variables, construct appropriate tables and charts.

ii Construct appropriate tables and charts to explore the relationship between pairs of variables. c Construct appropriate tables and charts to explore the relationship between the same variable in coastal city 1, state A, and regional city 1, state A. d Write a report summarising your conclusions. e Repeat (a) to (d) for another pair of non-capital cities or towns in state A and/or state B.

Chapter 2 Excel Guide EG2.1 ORGANISING AND VISUALISING CATEGORICAL DATA ORGANISING CATEGORICAL DATA

Figure EG2.1 One-Way Tables & Charts dialog box

The Summary Table Key technique  Use the PivotTable feature to create a summary table for untallied data. Example  Create a frequency and percentage summary table similar to Table 2.2B on page 39. PHStat  Use One-Way Tables & Charts. For the example, open the Property file. Select PHStat ➔ Descriptive Statistics ➔ One-Way Tables & Charts. In the procedure’s dialog box (shown in Figure EG2.1): 1. Click Raw Categorical Data (because the worksheet contains untallied data). 2. Enter or highlight G2:G102 as the Raw Data Cell Range and check First cell contains label. 3. Enter a Title, check Percentage Column, and click OK.

PHStat creates a PivotTable summary table on a new worksheet.

In-depth Excel (untallied data)  Use the Summary_Table workbook as a model. For the example, open the Property file and select Insert ➔ PivotTable. In the Create PivotTable dialog box (shown in Figure EG2.2): 1. Click Select a table or range and enter or highlight G2:G102 as the Table/Range cell range. 2. Click New Worksheet and then click OK.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

78

CHAPTER 2 ORGANISING AND VISUALISING DATA

7. Click the Layout & Format tab. 8. Check For empty cells show and enter 0 as its value. Leave all other settings unchanged. 9. Click OK to complete the PivotTable.

Figure EG2.2 Create PivotTable dialog box

In the Excel 2016 PivotTable Fields task pane (shown in Figure EG2.3) or in the similar PivotTable Field List task pane in earlier Excels: 3. Tick Type in Choose fields to add to report to add it to ROWS (or Row Labels) box. 4. Drag Type in Choose fields to add to report and drop it in the Σ Values box. This second label changes to Count of Type to indicate that a count, or tally, of the type categories will be displayed in the PivotTable. Figure EG2.3 Microsoft Excel PivotTable Fields task pane

In the PivotTable being created: 5. Enter Type in cell A3 to replace the heading Row Labels. 6. Right-click cell A3 and then click PivotTable Options in the shortcut menu that appears. In the PivotTable Options dialog box (shown in Figure EG2.4):

Figure EG2.4   PivotTable Options dialog box

To add a column for the percentage frequency: 10. Enter Percentage in cell C3. Enter the formula 5B4∙B$6 in cell C4 and copy it down to row 6. 11. Select cell range C4:C6, right-click, and select Format Cells in the shortcut menu. 12. In the Number tab of the Format Cells dialog box, select Percentage as the Category, and the number of decimal places you wish to show, and click OK. 13. Adjust the worksheet formatting, if appropriate, and enter a title in cell A1. In the PivotTable, type categories appear in alphabetical order. To change the order: 14. Click the Unit label in cell A5 to highlight cell A5. Move the mouse pointer to the top edge of the cell until the mouse pointer changes to a four-way arrow. 15. Drag the Unit label and drop the label over cell A4. The type categories now appear in the order Unit then House in the summary table.

In-depth Excel (tallied data)  Use the SUMMARY_SIMPLE worksheet of the Summary_Table workbook as a model for creating a summary table.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 2 Excel Guide

VISUALISING CATEGORICAL VARIABLES The Bar Chart and the Pie Chart Many of the In-depth Excel instructions in the rest of this Excel Guide refer to the labelled Charts group illustration shown in Figure EG2.5.

Figure EG2.5 Microsoft Excel Charts group

Key technique  Use the Excel bar or pie chart feature. If the variable to be visualised is untallied, first construct a summary table (see the instructions in Section EG2.1 ‘Organising Categorical Data: The Summary Table’). Example  Construct a bar or pie chart from a summary table similar to Table 2.2B on page 39. PHStat  Use One-Way Tables & Charts. For the example, use the PHStat instructions in Section EG2.1 ‘Organising Categorical Data: The Summary Table’, but in step 3, check either Bar Chart or Pie Chart (or both) in addition to entering a Title, checking Percentage Column, and clicking OK. In-depth Excel  Use the Summary_Table workbook as a model. For the example, open to the OneWayTable worksheet of the Summary_Table workbook. (The PivotTable in this worksheet was constructed using the instructions in Section EG2.1 ‘Organising Categorical Data: The Summary Table’.) To construct a bar chart: 1. Select cell range A4:B5. (Begin your selection at cell B5 and not at cell A4, as you would normally do.) 2. In Excel 2016, select Insert, then the Column icon in the Charts group (#1 in the Charts group illustration in Figure EG2.5), and then select the first 2-D Bar gallery item (Clustered Bar). In other Excels, select Insert ➔ Bar Icon and then select the first 2-D Bar gallery item (Clustered Bar). 3. Right-click the Count of Type button in the chart and click Hide All Field Buttons on Chart. 4. Select Design ➔ Add Chart Element ➔ Axis Titles ➔ Primary Horizontal. (Earlier Excels) Select Layout ➔ Axis Titles ➔ Primary Horizontal Axis Title ➔ Title Below Axis. Select the words “Axis Title” in the chart and enter the title Frequency.

79

5. If required, move to chart sheet (right-click on Chart ➔ Move Chart). Adjust chart formatting if required. Although not the case with the example, sometimes the horizontal axis scale of a bar chart will not begin at 0. If this occurs, right-click the horizontal (value) axis in the bar chart and click Format Axis in the shortcut menu. In the Format Axis task pane, click Axis Options. In the Axis Options, enter 0 in the Minimum box and then close the pane. In earlier Excels, you set this value in the Format Axis dialog box. Click Axis Options in the left pane, and in the Axis Options right pane, click the first Fixed option button (for Minimum), enter 0 in its box, and then click Close. To construct a pie chart, replace steps 2 and 4 with these steps: 2. Select Insert, then the Pie icon (#3 in the Charts group illustration in Figure EG2.5), and then select the first 2-D Pie gallery item (Pie). In earlier Excels, select Insert ➔ Pie and then select the first 2-D Pie gallery item (Pie). 4. Select Design ➔ Add Chart Element ➔ Data Labels ➔ More Data Label Options. In the Format Data Labels task pane, click Label Options. In the Label Options, check Category Name and Percentage, clear the other Label Contains check boxes, and click Outside End. (To see the Label Options, you may have to first click the chart (fourth) icon near the top of the task pane.) Then close the task pane. (Earlier Excels) Select Layout ➔ Data Labels ➔ More Data Label Options. In the Format Data Labels dialog box, click Label Options in the left pane. In the Label Options right pane, check Category Name and Percentage and clear the other Label Contains check boxes. Click Outside End and then click Close.

EG2.2 ORGANISING NUMERICAL DATA Stacked and Unstacked Data PHStat  Use Stack Data or Unstack Data. For example, to unstack the Asking Price variable by the Type variable in the property data given in Example 2.1, open the Property file. Select Data Preparation ➔ Unstack Data. In that procedure’s dialog box, enter or highlight G2:G102 (the Type variable cell range) as the Grouping Variable Cell Range and enter or highlight A2:A102 (the Asking Price variable cell range) as the Stacked Data Cell Range. Check First cells in both ranges contain label and click OK. The unstacked data appear on a new worksheet.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

80

CHAPTER 2 ORGANISING AND VISUALISING DATA

The Ordered Array In-depth Excel  To create an ordered array, first select the numerical variable to be sorted. Then select Home ➔ Sort & Filter (in the Editing group) and in the drop-down menu click Sort Smallest to Largest. (You will see Sort A to Z as the first drop-down choice if you did not select a cell range of numerical data.) The Stem-and-Leaf Display Key technique  Enter leaves as a string of digits. Example  Construct a stem-and-leaf display for festival expenditure by interstate visitors, similar to Figure 2.5 on page 45. PHStat  Use the Stem-and-Leaf Display. For the example, open the Festival file. Select PHStat ➔ Descriptive Statistics ➔ Stem-and-Leaf Display. In the procedure’s dialog box (shown in Figure EG2.6): 1. Enter or highlight A2:A54 as the Variable Cell Range and check First cell contains label. 2. Click Set stem unit as and enter 100 in its box. 3. Enter a Title and click OK. Figure EG2.6 Stem-and-Leaf Display dialog box

When creating other displays, use the Set stem unit as option sparingly and only if Autocalculate stem unit creates a display that has too few or too many stems. (Any stem unit you specify must be a power of 10.)

In-depth Excel  Use the Stem_and_Leaf workbook as a model. Manually construct the stems and leaves on a new worksheet to create a stem-and-leaf display. Adjust the column width of the column that holds the leaves as necessary.

EG2.3 SUMMARISING AND VISUALISING NUMERICAL DATA SUMMARISING NUMERICAL DATA The Frequency Distribution Key technique  Establish bins and then use the FREQUENCY(untallied data cell range, bins cell range) array function to tally data. Example  Create frequency, percentage and cumulative percentage distributions for the restaurant meal cost data as in Tables 2.5, 2.7 and 2.9 in Section 2.3. To construct a frequency distribution using Excel or PhStat, you must first define your classes by a bin range. Defining Classes Using Bins Open the worksheet containing the data you want to summarise in classes. Decide on your classes and, in a separate column, enter the upper boundary or maximum value called the Bin Value for each class. This gives the Bin Cell Range. If the data are discrete, the bin range should contain the highest value in each class. If the data are continuous but recorded to a set number of decimal places, the values in the bin range should be just less than the minimum value in the next class. In this case, record the value in the bin range to one or two more significant figures than the data. For example, for the restaurant data in < RESTAURANT > in Section 2.3, the following classes were required (see Table 2.5): $10 to less than $15, $15 to less than $20 and so on. As the first class is $10 to less than $15, $15 belongs in the second class and the bin value for the first class is just less than this, 14.99 or 14.999. Therefore, the Bin Cell Range would be 14.999, 19.999, 24.999 and so on. Class $10 to < $15 $15 to < $20 $20 to < $25 : $60 to < $65

Bin values 14.999 19.999 24.999 : 64.999

Class mid-points $12.50 $17.50 $22.50 : $62.50

PHStat (untallied data)  Use Frequency Distribution. (Use Histogram & Polygons, discussed later in Section EG2.3, if you plan to construct a histogram or polygon in addition to a frequency distribution.) For the example, open the Restaurant file. The data worksheet contains the meal cost data in stacked format in column G and enter an appropriate bin cell range

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 2 Excel Guide

(see above) in column H (say, H1:H12). Select PHStat ➔ Descriptive Statistics ➔ Frequency Distribution. In the procedure’s dialog box (shown in Figure EG2.7): 1. Enter or highlight G1:G101 as the Variable Cell Range, enter or highlight H1:H12 as the Bins Cell Range, and check First cell in each range contains label. 2. Click Multiple Groups - Stacked and enter or highlight A1:A101 as the Grouping Variable Cell Range. (The cell range A1:A101 contains the Location variable.) 3. Enter a Title and click OK. Figure EG2.7 Frequency Distribution dialog box

Click Single Group Variable in step 2 if constructing a distribution from a single group of untallied data. Click Multiple Groups - Unstacked in step 2 if the Variable Cell Range contains two or more columns of unstacked, untallied data. Frequency distributions for the two groups appear on separate worksheets. To display the information for the two groups on one worksheet, select the cell range B3:D14 on one of the worksheets. Right-click that range and click Copy in the shortcut menu. Open to the other worksheet. In that other worksheet, right-click cell E3 and click Paste Special in the shortcut menu. In the Paste Special dialog box, click Values and numbers format and click OK. Adjust the worksheet title as necessary.

In-depth Excel (untallied data)  Use the Distributions workbook as a model. For the example, use the Unstacked worksheet of the Restaurant file. This worksheet contains the meal cost data unstacked in columns A and B. Enter an appropriate bin range (see above) in column D (say, D1:D12). Then: 1. Right-click the Unstacked sheet tab and click Insert in the shortcut menu.

81

2. In the General tab of the Insert dialog box, click Worksheet and then click OK. In the new worksheet: 3. Enter a title in cell A1, Bins in cell A3 and Frequency in cell B3. 4. Copy the bin number list in the cell range D2:D12 of the Unstacked worksheet and paste this list into cell A4 of the new worksheet. 5. Select the cell range B4:B14 that will hold the array formula. 6. Type (but do not press) the Enter or Tab key, the formula 5FREQUENCY(UNSTACKED!$A$1: $A$51, $A$4:$A$14). Then, while holding down the Ctrl and Shift keys, press the Enter key to enter the array formula into the cell range B4:B14. 7. Adjust the worksheet formatting as necessary. Note that in step 6, you enter the cell range as UNSTACKED! $A$1:$A$51 and not as $A$1:$A$51 because the untallied data are located on another (the Unstacked) worksheet. Steps 1 to 7 construct a frequency distribution for the meal costs at city restaurants. To construct a frequency distribution for the meal costs at suburban restaurants, repeat steps 1 to 7 but in step 6 type 5FREQUENCY(UNSTAC KED!$B$1:$B$51, $A$4:$A$14) as the array formula. To display the distributions for the two groups on one worksheet, select the cell range B3:B14 on one of the worksheets. Right-click that range and click Copy in the shortcut menu. Open to the other worksheet. In that other worksheet, right-click cell C3 and click Paste Special in the shortcut menu. In the Paste Special dialog box, click Values and numbers format and click OK. Adjust the worksheet title as necessary.

Analysis ToolPak (untallied data)  Use Histogram. For the example, use the Unstacked worksheet of the Restaurant file. This worksheet contains the meal cost data unstacked in columns A and B. Enter an appropriate bin range (see above) in column D (say, D1:D12). Then: 1. Select Data ➔ Data Analysis. In the Data Analysis dialog box, select Histogram from the Analysis Tools list and then click OK. In the Histogram dialog box (shown in Figure EG2.8): 2. Enter or highlight A1:A51 as the Input Range and enter or highlight D1:D12 as the Bin Range. (If you leave Bin Range blank, the procedure creates a set of bins that will not be as well formed as the ones you can specify.) 3. Check Labels and click New Worksheet Ply. 4. Click OK to create the frequency distribution on a new worksheet.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

82

CHAPTER 2 ORGANISING AND VISUALISING DATA

Figure EG2.8 Histogram dialog box

In the new worksheet: 5. Select row 1. Right-click this row and click Insert in the shortcut menu. Repeat. (This creates two blank rows at the top of the worksheet.) 6. Enter a title in cell A1. The ToolPak creates a frequency distribution that contains an improper bin labelled More. Correct this error by using these general instructions: 7. Manually add the frequency count of the More row to the frequency count of the preceding row. (For the example, the More row contains a zero for the frequency, so the frequency of the preceding row does not change.) 8. Select the worksheet row (for this example, row 15) that contains the More row. 9. Right-click that row and click Delete in the shortcut menu. Steps 1 to 9 construct a frequency distribution for the meal costs at city restaurants. To construct a frequency distribution for the meal costs at suburban restaurants, repeat these nine steps but in step 2 enter or highlight B1:B51 as the Input Range.

The Relative Frequency, Percentage and Cumulative Distributions Key technique  Add columns that contain formulas for the relative frequency or percentage and cumulative percentage to a previously constructed frequency distribution. Example  Create a distribution that includes the relative frequency or percentage as well as the cumulative percentage, as in Tables 2.7 (relative frequency and percentage) and 2.9 (cumulative percentage) in Section 2.3 for the restaurant meal cost data. PHStat (untallied data)  Use Frequency Distribution. For the example, use the PHStat instructions in ‘Summarising Numerical Data: The Frequency Distribution’ to

construct a frequency distribution. Note that the frequency distribution constructed by PHStat also includes columns for the percentages and cumulative percentages. To change the column of percentages to a column of relative frequencies, reformat that column. For the example, open to the new worksheet that contains the city restaurant frequency distribution and: 1. Select the cell range C4:C14, right-click, and select Format Cells from the shortcut menu. 2. In the Number tab of the Format Cells dialog box, select Number as the Category and click OK. Then repeat these two steps for the new worksheet that contains the suburban restaurant frequency distribution.

In-depth Excel (untallied data)  Use the Distributions workbook as a model. For the example, first construct a frequency distribution created using the In-depth Excel instructions in ‘Summarising Numerical Data: The Frequency Distribution’. Open to the new worksheet that contains the frequency distribution for the city restaurants and: 1. Enter Percentage in cell C3 and Cumulative Pctage in cell D3. 2. Enter 5B4∙SUM($B$4:$B$14) in cell C4 and copy this formula down to row 14. 3. Enter 5C4 in cell D4. 4. Enter 5C5 1 D4 in cell D5 and copy this formula down to row 14. 5. Select the cell range C4:D14, right-click, and click Format Cells in the shortcut menu. 6. In the Number tab of the Format Cells dialog box, click Percentage in the Category list and click OK. Then open to the worksheet that contains the frequency distribution for the suburban restaurants and repeat steps 1 to 6. If you want column C to display relative frequencies instead of percentages, enter Rel. Frequencies in cell C3. Select the cell range C4:C12, right-click, and click Format Cells in the shortcut menu. In the Number tab of the Format Cells dialog box, click Number in the Category list and click OK. Analysis ToolPak  Use Histogram and then modify the worksheet created. For the example, first construct the frequency distributions using the Analysis ToolPak instructions in ‘The Frequency Distribution’. Then use the In-depth Excel instructions to modify those distributions.

VISUALISING NUMERICAL DATA The Histogram Key technique  Construct a histogram.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 2 Excel Guide

Example  Construct histograms for price of main meals in city restaurants, similar to Figure 2.6 on page 50. PHStat  Use Histogram & Polygons.

83

Figure EG2.9 Histogram & Polygons dialog box

PHStat Defining Classes Bins and Mid-points If constructing a frequency polygon or histogram using PHStat, include a class of zero frequency at the beginning of the bin range. For example, for the restaurant data in < RESTAURANT > in Section 2.3, the first class of non-zero frequency is $10 to less than $15 with bin value 14.999, so the class $5 to less than $10 must be included before this. Therefore, the Bin Cell Range would be 9.999, 14.999, 19.999, 24.999 and so on. PHStat also requires a Mid-point Cell Range. Since PHStat associates the first mid-point given with the second bin value or second class, the Mid-point Cell Range must have one fewer cells/values than the Bin Cell Range. Price

Bin values

Class mid-points

50

9.999

$12.50

38

14.999

$17.50

43

19.999

$22.50

56

24.999

$27.50

51

29.999

$32.50

36

34.999

$37.50

25

39.999

$42.50

33

44.999

$47.50

41

49.999

$52.50

44

54.999

$57.50

34

59.999

$62.50

39

64.999

For the example, open to the Data worksheet of the Restaurant file. Select PHStat ➔ Descriptive Statistics ➔ ­Histogram & Polygons. Enter an appropriate bin range, see above, in column H (say, H1:H13) and Midpoint Range in column I (say, I1:I12). Then in the procedure’s dialog box (shown in Figure EG2.9): 1. Enter or highlight G1:G101 as the Variable Cell Range, H1:H13 as the Bins Cell Range and I1:I12 as the Midpoints Cell Range, and check First cell in each range contains label. 2. Click Multiple Groups - Stacked and enter or highlight A1:A101 as the Grouping Variable Cell Range. (In the Data worksheet of the Restaurant file, the price of meals in city and suburban restaurants are stacked, or placed in a single

c­ olumn. The column A values allow PHStat to separate the city restaurant prices from the suburban restaurant prices.) 3. Enter a Title, check Histogram, and click OK. PHStat inserts two new worksheets, each of which contains a frequency distribution and a histogram. Since you cannot define an explicit lower boundary for the first bin, the first bin can never have a mid-point. Therefore, the Midpoints Cell Range you enter must have one fewer cell than the Bins Cell Range. PHStat associates the first mid-point with the second bin and uses -- as the label for the first bin. When you include a class of zero frequency before the first class of non-zero frequency, as in this example, the histogram bar labelled -- will always be a zero bar.

In-depth Excel  Use the Histogram workbook as a model. For the example, first construct frequency distributions for city and suburban meal prices. Open the Unstacked worksheet in the Restaurant file. This worksheet contains the meal cost data unstacked in columns A and B. Enter appropriate bin cell and mid-point cell ranges, including titles, in columns D and E (say, D1:D12 and E1:E12). Then: 1. Right-click the Unstacked sheet tab and click Insert in the shortcut menu. 2. In the General tab of the Insert dialog box, click Worksheet and then click OK. In the new worksheet: 3. Enter a title in cell A1, Bins in cell A3, Frequency in cell B3, and Midpoints in cell C3.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

84

CHAPTER 2 ORGANISING AND VISUALISING DATA

4. Copy the bin values in the cell range D2:D12 of the Unstacked worksheet and paste this list into cell A4 of the new worksheet. 5. Copy the mid-points in the cell range E2:E12 of the Unstacked worksheet and paste this list into cell C4 of the new worksheet. 6. Select the cell range B4:B14 that will hold the array formula. 7. Type (but do not press the Enter or Tab key) the formula 5FREQUENCY(UNSTACKED!$A$2: $A$51, $A$4: $A$14). Then, while holding down the Ctrl and Shift keys, press the Enter key to enter the array formula into the cell range B4:B14. 8. Adjust the worksheet formatting as necessary. Steps 1 to 8 construct a frequency distribution for city restaurant main meal prices. To construct a frequency distribution for main meal prices for suburban restaurants, repeat steps 1 to 8 but in step 7 type 5FREQUENCY(UNSTACK ED!$B$1:$B$51, $A$4: $A$14) as the array formula. Having constructed the two frequency distributions, continue constructing the two histograms. Open to the worksheet that contains the frequency distribution for city restaurant prices and: 1. Select the cell range B3:B14 (the cell range of the frequencies). 2. Select Insert, then the Column icon in the Charts group (#1 in the Charts group illustration in Figure EG2.5), and then select the first 2-D Column gallery item (Clustered Column). In earlier Excels, select Insert ➔ Column and select the first 2-D Column gallery item (Clustered Column). 3. Right-click the chart and click Select Data in the shortcut menu. In the Select Data Source dialog box: 4. Click Edit under the Horizontal (Categories) Axis Labels heading. 5. In the Axis Labels dialog box, drag the mouse to select the cell range C4:C14 (containing the midpoints) to enter that cell range. Do not type this cell range in the Axis label range box as you would otherwise do. Click OK in this dialog box and then click OK (in the Select Data Source dialog box). In the chart: 6. Right-click inside a bar and click Format Data Series in the shortcut menu. 7. In the Format Data Series task pane, click Series Options. In the Series Options, click Series Options, enter 0 in the Gap Width box, and then close the task pane. (To see the Series Options, you may have to first click the chart [third] icon near the top of the task pane.) (Earlier Excels) In the Format Data Series dialog box, click Series Options in the left pane, and in

the Series Options right pane, change the Gap Width slider to No Gap. Click Close. 8. Move chart to a chart sheet (right-click on Chart ➔ Move Chart). Adjust chart formatting if required.

Analysis ToolPak  Use Histogram. For the example, open the Unstacked worksheet in the Restaurant file. Enter appropriate bin cell and midpoint cell ranges, including titles, in columns D and E (say, D1:D12 and E1:E12) and: 1. Select Data ➔ Data Analysis. In the Data Analysis dialog box, select Histogram from the Analysis Tools list and then click OK. In the Histogram dialog box: 2. Enter or highlight A1:A51 as the Input Range and enter or highlight D1:D12 as the Bin Range. 3. Check Labels, click New Worksheet Ply, and check Chart Output. 4. Click OK to create the frequency distribution and histogram on a new worksheet. In the new worksheet: 5. Follow steps 5 to 9 of the Analysis ToolPak instructions in ‘Summarising Numerical Data: The Frequency Distribution’ above. These steps construct a frequency distribution and histogram for city restaurant main meal prices. To construct a frequency distribution and histogram for suburban restaurant main meal prices repeat the nine steps, but in step 2 enter or highlight B1:B51 as the Input Range. You will need to correct several formatting errors that Excel makes to the histograms it constructs. For each histogram: 1. Right-click inside a bar and click Format Data Series in the shortcut menu. 2. In the Format Data Series task pane, click Series Options. In the Series Options, click Series Options, enter 0 in the Gap Width box, and then close the task pane. (To see the Series Options, you may have to first click the chart [third] icon near the top of the task pane.) (Earlier Excels) In the Format Data Series dialog box, click Series Options in the left pane, and in the Series Options right pane, change the Gap Width slider to No Gap. Click Close. Histogram bars are labelled by bin numbers. To change the labelling to mid-points, open to each of the new worksheets and: 3. Enter Midpoints in cell C3. Copy the mid-point cell range E2:E12 of the Unstacked worksheet and paste this list into cell C4 of the new worksheet. 4. Right-click the histogram and click Select Data.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 2 Excel Guide

5. In the Select Data Source dialog box, click Edit under the Horizontal (Categories) Axis Labels heading. 6. In the Axis Labels dialog box, drag the mouse to select the cell range C4:C14 to enter that cell range. Do not type this cell range in the Axis label range box as you would otherwise do. Click OK in this dialog box and then click OK (in the Select Data Source dialog box). 7. Move the chart to a chart sheet (right-click on Chart ➔ Move Chart). Adjust chart formatting if required.

The Percentage Polygon and the Cumulative Percentage Polygon (Ogive) Key technique  Construct percentage and cumulative percentage polygons. Example  Construct percentage and cumulative percentage polygons for main meal prices at city and suburban restaurants, similar to Figure 2.8 on page 52 and Figure 2.10 on page 53. PHStat  Use Histogram & Polygons. For the example, use the PHStat instructions for creating a histogram on page 83 but in step 3 of those instructions, also check Percentage Polygon and Cumulative Percentage Polygon (Ogive) before clicking OK. In-depth Excel  Use the Polygons_workbook as a model. For the example, open the Unstacked worksheet in the Restaurant file. Then follow steps 1 to 8 to construct a histogram for city restaurant meal prices. However, include a class of zero frequency at either end of your bin cell range. (Say, in cells D1:14, including title, also add corresponding class mid-points cells E1:14.) Repeat steps 1 to 8 but in step 7 type the array formula 5FREQUENCY(UNS TACKED!$B$1:$B$51, $A$4: $A$16) to construct a frequency distribution for suburban restaurant main meal prices. Open to the worksheet that contains the city restaurant meal price frequency distribution and: 1. Select column C. Right-click and click Insert in the shortcut menu. Right-click and click Insert in the shortcut menu a second time. (The worksheet contains new, blank columns C and D and the midpoints column is now column E.) 2. Enter Percentage in cell C3 and Cumulative Pctage. in cell D3.

85

3. Enter 5B4∙SUM($B$4:$B$16) in cell C4 and copy this formula down to row 16. 4. Enter 5C4 in cell D4. 5. Enter 5C5 1 D4 in cell D5 and copy this formula down to row 16. 6. Select the cell range C4:D16, right-click, and click Format Cells in the shortcut menu. 7. In the Number tab of the Format Cells dialog box, click Percentage in the Category list and click OK. Open to the worksheet that contains the suburban restaurant main meal price frequency distribution and repeat steps 1 to 7. To construct the percentage polygons, open to the worksheet that contains the city restaurant price frequency distribution and: 1. Select cell range C4:C16. 2. Select Insert, then select the Line icon in the Charts group (#2 in the Charts group illustration in Figure EG2.5), and then select the fourth 2-D Line gallery item (Line with Markers). In earlier Excels, select Insert ➔ Line and select the fourth 2-D Line gallery item (Line with Markers). 3. Right-click the chart and click Select Data in the shortcut menu. In the Select Data Source dialog box: 4. Click Edit under the Legend Entries (Series) heading. In the Edit Series dialog box, enter the formula 5“City Restaurants” as the Series name and click OK. 5. Click Edit under the Horizontal (Categories) Axis Labels heading. In the Axis Labels dialog box, drag the mouse to select the cell range E4:E16 to enter that cell range. Do not type this cell range in the Axis label range box as you would otherwise do. 6. Click OK in this dialog box and then click OK (in the Select Data Source dialog box). Back in the chart: 7. Move chart to a chart sheet (right-click on Chart ➔ Move Chart). Adjust chart formatting if required. In the new chart sheet: 8. Right-click the chart and click Select Data in the shortcut menu. 9. In the Select Data Source dialog box, click Add. In the Edit Series dialog box: 10. Enter the formula 5“Suburban Restaurants” as the Series name and press Tab. 11. With the current value in Series values highlighted, click the worksheet tab for the worksheet that contains the suburban restaurant meal price frequency distribution.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

86

CHAPTER 2 ORGANISING AND VISUALISING DATA

12. Drag the mouse to select the cell range C4:C16 to enter that cell range as the Series values. Do not type this cell range in the Series values box as you would otherwise do. 13. Click OK. Back in the Select Data Source dialog box, click OK. To construct the cumulative percentage polygons, open to the worksheet that contains the city restaurant price of main meal frequency distribution and repeat steps 1 to 13 but replace steps 1, 5 and 12 with the following: 1. Select the cell range D4:D16. 5. Click Edit under the Horizontal (Categories) Axis Labels heading. In the Axis Labels dialog box, drag the mouse to select the cell range A4:A16 to enter that cell range. 12. Drag the mouse to select the cell range D4:D16 to enter that cell range as the Series values. If the Y axis of the cumulative percentage polygon extends past 100%, right-click the axis and click Format Axis in the shortcut menu. In the Format Axis task pane, click Axis Options. In the Axis Options, enter 0 in the Minimum box and 1 in the Maximum box and then close the pane. In earlier Excels, you set this value in the Format Axis dialog box. Click Axis Options in the left pane, and in the Axis Options right pane, click the first Fixed option button (for Minimum), enter 0 in its box, and then click Close.

EG2.4 ORGANISING AND VISUALISING TWO CATEGORICAL VARIABLES ORGANISING TWO CATEGORICAL VARIABLES The Contingency Table Key technique  Use the PivotTable feature to create a contingency table for untallied data. Example  Construct a contingency table for location and number of bedrooms similar to Table 2.11 on page 55. PHStat (untallied data)  Use Two-Way Tables & Charts. For the example, open the Property file. Select PHStat ➔ Descriptive Statistics ➔ Two-Way Tables & Charts. In the procedure’s dialog box (shown in Figure EG2.10): 1. Enter or highlight F2:F102 as the Row Variable Cell Range. 2. Enter or highlight C2:C102 as the Column Variable Cell Range.

3. Check First cell in each range contains label. 4. Enter a Title and click OK. Figure EG2.10 Two-Way Tables & Charts dialog box

In-depth Excel (untallied data)  Use the Contingency_Table workbook as a model. For the example, open the Property file. Select Insert ➔ PivotTable. In the Create PivotTable dialog box: 1. Click Select a table or range and enter or highlight C2:F102 as the Table/Range cell range. 2. Click New Worksheet and then click OK. In the PivotTable Fields (called the PivotTable Field List in some Excel versions) task pane: 3. Tick Location in Choose fields to add to report to add it to the ROWS (or Row Labels) box. 4. Tick Bedrooms in Choose fields to add to report and drag it to the COLUMNS (or Column Labels) box. 5. Drag Location in Choose fields to add to report and drop it in the Σ Values box. (Location changes to Count of Location.) In the PivotTable being created: 6. Select cell A3 and enter a space character to clear the label Count of Location. 7. Enter Location in cell A4 to replace the heading Row Labels. 8. Enter Bedroom in cell B3 to replace the heading Column Labels. 9. Right-click over the PivotTable and then click PivotTable Options in the shortcut menu that appears. In the PivotTable Options dialog box: 10. Click the Layout & Format tab. 11. Check For empty cells show and enter 0 as its value. Leave all other settings unchanged. 12. Click the Total & Filters tab. 13. Check Show grand totals for columns and Show grand totals for rows. 14. Click OK to complete the table.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 2 Excel Guide

In-depth Excel (tallied data)  Use the CONTINGENCY_SIMPLE worksheet of the ­Contingency_Table workbook as a model for creating a contingency table.

VISUALISING TWO CATEGORICAL VARIABLES The Side-By-Side Chart Key technique  Use an Excel bar chart that is based on a contingency table. Example  Construct a side-by-side chart that displays location and number of bedrooms, similar to Figure 2.12 on page 56. PHStat  Use Two-Way Tables & Charts. For the example, use the Section EG2.4 ‘The Contingency Table’ PHStat instructions, but in step 4, check Sideby-Side Bar Chart in addition to entering a Title and clicking OK. In-depth Excel  Use the Contingency_Table workbook as a model. For the example, open to the TwoWayTable worksheet of the Contingency_Table workbook and: 1. Select cell A3 (or any other cell inside the PivotTable). 2. Select Insert ➔ Column in Excel 2016, or Bar in earlier Excel versions, and select the first 2-D Bar gallery item (Clustered Bar). 3. Right-click the Count of Location button in the chart and click Hide All Field Buttons on Chart. 4. Move the chart to a chart sheet (right-click on Chart ➔ Move Chart). Adjust formatting if required. When creating a chart from a contingency table that is not a PivotTable, select the cell range of the contingency table, including row and column headings, but excluding the total row and total column, as step 1. If you need to switch the row and column variables in a side-by-side chart, right-click the chart and then click Select Data in the shortcut menu. In the Select Data Source dialog box, click Switch Row/Column and then click OK. (In Excel 2007, if the chart is based on a PivotTable, the Switch Row/Column button will be disabled. In that case, you need to change the PivotTable to change the chart.)

EG2.5 VISUALISING TWO NUMERICAL VARIABLES The Scatter Diagram Key technique  Use the Excel scatter chart.

87

Example  Construct a scatter diagram of number of bedrooms and asking price, similar to Figure 2.14 on page 59. PHStat  Use Scatter Plot. For the example, open the Property file. Select PHStat ➔ Descriptive Statistics ➔ Scatter Plot. In the procedure’s dialog box (shown in Figure EG2.11): 1. Enter or highlight A2:A102 as the Y Variable Cell Range. 2. Enter or highlight C2:C102 as the X Variable Cell Range. 3. Check First cells in each range contains label. 4. Enter a Title and click OK.

Figure EG2.11 Scatter Plot dialog box

To add a superimposed line like the one shown in Figure 2.14, click the chart and use step 3 of the In-depth Excel instructions.

In-depth Excel  Use the Scatter_Diagram workbook as a model. For the example, open the Property file. The two variables ‘Number of bedrooms’ and ‘Asking price’ have been copied to columns I and J. 1. Select the cell range I2:J102. 2. Select Insert, then the Scatter (X,Y) icon in the Charts group (#4 in the illustration in Figure EG2.5), and then select the first Scatter gallery item (Scatter). In earlier Excels, select Insert ➔ Scatter and select the first Scatter gallery item (Scatter with only Markers). 3. Select Design ➔ Add Chart Element ➔ Trendline ➔ Linear. In earlier Excels, select Layout ➔ Trendline ➔ Linear Trendline. 4. Move chart to a chart sheet (right-click on Chart ➔ Move Chart). Adjust chart formatting if required.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

88

CHAPTER 2 ORGANISING AND VISUALISING DATA

When constructing Excel scatter diagrams with other variables, make sure that the X or horizontal variable column precedes (is to the left of) the Y or vertical variable column. (If the worksheet is arranged Y then X, cut and paste so that the Y variable column appears to the right of the X variable column.)

The Time-Series Plot Key technique  Use the Excel scatter chart. Example  Construct a time-series plot of Australian dollar exchange rate against US dollar from 2010 to 2017, similar to Fig­ ure  2.15 on page 60. In-depth Excel  Use the Time-Series workbook as a model. For the example, open the Exchange_Rate_2010_2017 file and: 1. Select the cell range A9:B95. 2. Select Insert, then select the Scatter (X, Y) icon in the Charts group (#4 in the illustration in Figure EG2.5), and then select the fourth or fifth Scatter gallery item (Scatter with Straight Lines with or without Markers). In earlier Excels, select Insert ➔ Scatter and select the fourth or fifth Scatter gallery item (Scatter with Straight Lines with or without Markers). 3. Move chart to a chart sheet (right-click on Chart ➔ Move Chart). Adjust chart formatting if required. When constructing time-series charts with other variables, make sure that the X or time variable column precedes (is to the left of) the Y or vertical variable column. (If the worksheet is arranged Y then X, cut and paste so that the Y variable column appears to the right of the X variable column.)

2. In the Insert Sparkines dialog box, enter B3:B16 as the Location Range and click OK. 3. Select Axis and then Vertical. Choose Same for all Sparklines for both Maximum and Minimum

Gauges In-depth Excel  To construct a gauge we must create both a doughnut chart for the coloured zones and a pie chart for the pointer. To create the gauges equivalent to the one shown in Figure 2.18 on page 65, open to the TopSixDATA worksheet of the WL_WaitData workbook and: 1. Select the cell range E3:E7. 2. Select Insert ➔ Pie Chart and select Doughnut. 3. Right click on the doughnut, select Format Data Series and type ‘271’ into angle of first slice (see Figure EG2.12) and close the box. Figure EG2.12 Format Data Series dialogue box

4. Right-click on the largest doughnut slice and select Format Data Point, select Fill ➔ No Fill. Figure EG2.13 Format Data Point dialogue box

EG2.6  DESCRIPTIVE ANALYTICS Sparklines In-depth Excel  Use Sparklines. For example, to create the Figure 2.17 sparklines display, open to the DATA worksheet of the WL_WaitHistory workbook. In this worksheet, ride names are in column A and the historical wait times data by half-hours are in Columns C through W. Select cell range C3:W16 and: 1. Select Insert ➔ Sparklines (select line as the sparkline type).

5. Right-click on the doughnut and choose Select Data. Click the + button and add ‘pointer’ as the name and ‘G3:G5’ as the Y values.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 2 Excel Guide

Figure EG2.14 Select Data Source dialogue box

6. Right-click at the new second doughnut, click on Change Chart Type and choose Pie Chart. 7. Right-click on the pie chart, select Format Data Series and check the Secondary Axis option. Type ‘270’ into the Angle of first slice.

Figure EG2.15 Format Data Series dialog box

8. Right-click on the largest slice of pie and select Format Data Point. Select Fill ➔ No Fill. Repeat for the next largest slice. 9. Insert ➔ Text Boxes to add the appropriate labels and change the gauge colours to suit.

Bullet Graph In-depth Excel  Use the BulletGraph worksheet of the GaugeBullet workbook as a model for simulating a bullet graph.

89

To construct a simulated bullet graph in Excel, you create a bar chart of the variable being graphed with a transparent background and overlay this chart on a bar chart that displays the coloured zones. For example, to construct a chart similar to the bullet graph shown in Figure 2.18 on page 65, open to the waitDATA worksheet of the WL_ WaitData workbook and: 1. Select cell range B1:C15. 2. Select Insert, then the bar chart icon, and select the Clustered Bar. 3. In the newly constructed bar chart, turn off the gridlines. 4. Right-click in the white space to the right of the chart title and click Format Chart Area in the shortcut menu. 5. In the Fill part of the Format Chart Area pane click No fill. The background of the chart becomes transparent. Next, construct the bar chart that will serve as the coloured zones for the bullet graph. 6. In the cell range D2:D6, enter the values 25, 20, 20, 20 and 15, to define the five zones of the­ Figure 2.18 bullet graph. Then select this edited cell range D2:D6. 7. Select Insert, then the bar chart icon, and select the Stacked Bar. 8. In the newly constructed bar chart, turn off the gridlines. 9. Right-click in the white space to the right of the chart title and click Select Data in the shortcut menu. 10. In the Select Data Source dialog box, click Switch Row/Column and then click OK. A chart of five simple bars becomes a chart of one stacked bar with five parts. 11. Right-click the one stacked bar and click Format Data Series in the shortcut menu. In the Series Options part of the Format Data Series pane, change Gap Width to 0%. 12. Change the colouring of the stacked bars. Select Design ➔ Change Colors and in the gallery click one of the colour spectrums. Be sure to choose a set of colours that does not include the colour used for the bars in the bar chart you constructed using steps 1 to 5. 13. Right-click the horizontal chart axis and click ­Format Axis in the shortcut menu. 14. In the Axis Options of the Format Axis pane, enter 100 as the Maximum. In Excel 2010, first click Fixed in the Maximum line, then enter 100, and then click Close. 15. Adjust the size of the chart, as necessary, by clicking a corner of the bar chart frame and then dragging that corner to resize the chart.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

90

CHAPTER 2 ORGANISING AND VISUALISING DATA

16. Right-click the chart border and select Send to Back ➔ Send to Back in the shortcut menu. 17. Drag the bar chart with the transparent background over the stacked bar chart and adjust so that the zeroes on the horizontal axis of both charts coincide. Then adjust the width of that bar chart so that all other horizontal axis numbers that the two charts share coincide. For other problems, you need to identify the maximum value and enter the proper set of values in a new column in

order to correct the stacked bar chart that serves to display the zones for the bullet graph.

Treemap 1. Highlight cells A1:C15 and select Insert ➔ Chart ➔ Other Charts ➔ Hierarchical Treemap. More detailed instructions for treemaps and data discovery are contained in the Software Guide in Chapter 20 (online).

Microsoft® product screen shots are reprinted with permission from Microsoft Corporation.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

Numerical descriptive measures

C HAP T E R

3

FESTIVAL EXPENDITURE

R

eturning to the festival expenditure scenario introduced in Chapter 2, as well as presenting the expenditure data graphically, Kai wishes to summarise and analyse the data further. In particular, for each non-local visitor type (intrastate, interstate and international) numerical measures of the centre and variation of total expenditure in the region during the festival are required. This analysis will help to answer the following questions: ■

What is the ‘average’ amount spent during the festival? How does this differ between visitor types? ■ How varied is the amount spent during the festival? How does this differ between visitor types? © Ton Koene/age fotostock

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

92

CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES

LEARNING OBJECTIVES

After studying this chapter you should be able to: 1 calculate and interpret numerical descriptive measures of central tendency, variation and shape for numerical data 2 calculate and interpret descriptive summary measures for a population 3 construct and interpret a box-and-whisker plot 4 calculate and interpret the covariance and the coefficient of correlation for bivariate data

variation The spread, scattering or dispersion of data values.

In Chapter 2 we saw how tables and graphs can be used to organise, visualise, summarise and describe data. In this chapter we discuss various numerical measures that can be used to summarise and describe numerical data. These numerical measures not only can be used to summarise a particular sample or population but will also enable the sample or population to be compared with others. Furthermore, these numerical measures, unlike graphs and tables, are precise, objectively determined and easy to manipulate, interpret and compare. They allow for a careful analysis of data which is especially important when using sample data to make inferences about an entire population. For example, Kai may be interested in whether interstate visitors spend more during the festival than do intrastate visitors. Also of interest would be how expenditure by international visitors to the festival compares to that of non-local visitors from within Australia. This chapter introduces some of the statistics that measure: • central tendency, the extent to which the data values are grouped around a central value • variation, the spread, scattering or dispersion of data values • shape, the pattern of the distribution of data values from the lowest value to the highest value.

shape The pattern of the distribution of data values.

Covariance and the coefficient of correlation, which measure the strength of the association between two numerical variables, are also introduced.

central tendency The extent to which data values are grouped around a central value.

LEARNING OBJECTIVE

1

Calculate and interpret numerical descriptive measures of central tendency, variation and shape for numerical data

arithmetic mean (mean) Measure of central tendency; sum of all values divided by the number of values (usually called the mean); called the arithmetic mean to distinguish it from the geometric mean.

3.1  MEASURES OF CENTRAL TENDENCY, VARIATION AND SHAPE We can describe a data set by describing its central tendency, variation and shape.

Measures of Central Tendency Many data sets have a distinct central tendency, with the data values grouped or clustered around a central point. Everyday expressions such as ‘the average value’, ‘the middle value’ or ‘the most popular or frequent value’ refer to measures of central tendency. The three most important measures of central tendency – mean, median and mode – are introduced in this section. These measures are precise, objectively determined and easy to manipulate, interpret and compare. As we see in the following sections, each has its advantages and disadvantages. Mean The arithmetic mean (typically referred to as the mean) is the most common measure of central tendency. The mean uses all the data values and can be calculated exactly. It can be thought of as a ‘balance point’ in a set of data (like the fulcrum on a seesaw). The mean is calculated by adding all the values of a variable in a data set and then dividing the sum by the number of variable values in the data set.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



3.1 Measures of Central Tendency, Variation and Shape

93

The symbol X, called X bar, is used to represent the mean of a sample. For a sample containing n values, the equation for the mean of a sample is written as: X =

sum of the sample values number of sample values

Using the series X1, X2, …, Xn to represent the set of n values and n to represent the number of values, the equation becomes: X + X2 + p + Xn X = 1 n By using summation notation (discussed in Appendix B), we can replace the numerator n

X1 + X2 + p + Xn by © Xi, which means sum all the Xi values from the first X value, X1, to the i=1

last X value, Xn, to obtain Equation 3.1.

SAM PLE M E A N The sample mean is the sum of the values divided by the number of values.

sample mean Mean calculated from sample data.

n

X =

© Xi i=1

(3.1)

n

where  X = sample mean n = number of values or sample size Xi = ith value of the variable X n

© Xi = X1 + X2 + p + Xn = sum of all Xi values in the sample i=1 As all the data values play an equal role in the calculation of the mean, the mean will be affected by any extreme (high or low) value. When there are extreme values, you should take care when using the mean as a measure of central tendency. The mean gives a ‘typical’ or central value for a data set. For example, if you knew the typical time it takes you to get ready in the morning, you might be able to plan your morning better and minimise any excessive lateness (or earliness). Suppose you define the time to get ready as the time in minutes (rounded to the nearest minute) from when you get out of bed to when you leave. You collect the times (shown below) for 10 consecutive working days; this data is stored in < TIMES >. Day: Time (minutes):

1 39

2 29

3 43

4 52

5 39

6 44

7 40

8 31

9 44

10 35

The mean time to get ready is 39.6 minutes, calculated using Equation 3.1: n

X =

© Xi

i=1

n

=

39 + 29 + … + 35 396 = = 39.6 10 10

Even though no one day in the sample actually had the value 39.6 minutes, allotting about 40 minutes to get ready would be a good rule for planning your morning – but only because the 10 days did not contain any extreme values.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

94

CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES

To illustrate how the mean can be greatly affected by any value that is very different from the others, imagine that on day 4, a set of unusual circumstances delayed you getting ready by 50 minutes so that the time for that day was 102 minutes. This extreme value would cause the mean to rise to 44.6 minutes: n

X =

© Xi

i=1

n

=

446 = 44.6 10

The one extreme value has increased the mean by more than 10% from 39.6 to 44.6 minutes. In contrast to the original mean, which was in the ‘middle’ (more than 5 of the times to get ready and less than 5 of the times to get ready), the new mean is greater than 9 of the 10 times to get ready. The extreme value of 102 has caused the mean to increase and thus become a poor measure of central tendency. A statistical calculator can be used to calculate the mean (and other numerical measures introduced in this chapter), while for large data sets, as we see later in this section, Excel can be used. Even though it is not usually necessary to use Equation 3.1 to calculate the mean it is important that you understand the process of how the mean is determined.

EXAMPLE 3.1

ME A N FE ST IVA L E X P E N D I TU RE – I N TE RN ATI ON AL V I S I TORS In the opening scenario, Kai is interested in the distribution of the amount spent by international visitors in the region during the festival. The following data give the dollar amount spent by a random sample of 12 international visitors. < FESTIVAL > 1,119

615

971

553

343

502

928

1,005

993

408

725

763

Calculate and interpret the mean amount spent by international visitors. SOLUTION 12

Calculate the sum of X, we obtain:

© X i = 1,119 + 615 + … + 763 = 8,925

then using Equation 3.1,

i=1

n

X =

© Xi

i=1

n

=

8,925 = 743.75 12

Therefore, international visitors on average spent $743.75 in the region during the festival.

median Measure of central tendency; middle value in an array.

Median The median is the value that partitions or splits an ordered set of data into two equal parts. As the median is not affected by extreme values, it may be a better measure of central tendency when there are extreme values.

The median is the middle value in a set of data that has been ordered from lowest to highest value.

To calculate the median for a set of data, first order the values from smallest to largest. Then use Equation 3.2 to calculate the rank of the value that is the median.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



3.1 Measures of Central Tendency, Variation and Shape

ME D IAN 50% of the values are equal to or smaller than the median and 50% of the values are equal to or larger than the median. Median =



• •

n+1 ranked value 2

(3.2)

Calculate the median value by these two rules: Rule 1 If there is an odd number of values in the data set, the median is the middle-ranked value. Rule 2 If there is an even number of values in the data set, then the median is the mean of the two middle-ranked values.

To calculate the median for the sample of the 10 times to get ready, first order the times: Ordered values: Ranks:

29 1

31 2

35 3

39 4

39 5

40 6

43 7

44 8

44 9

52 10

c Median = 39.5 Rank of the median is (n + 1)/2 = (10 + 1)/2 = 5.5. So, using rule 2, the median is the mean of the fifth- and sixth-ranked values, (39 + 40)/2 = 39.5. Therefore, for half of the days the time to get ready is less than or equal to 39.5 minutes and for half of the days the time to get ready is greater than or equal to 39.5 minutes. The median time to get ready of 39.5 minutes is very close to the mean time to get ready of 39.6 minutes. CA LC ULATING T H E ME DIA N FO R A N O D D S AM P L E S I Z E For a certain café, the number of customers during a selected seven-day week were 100, 75, 92, 85, 70, 80 and 71. Calculate the median number of customers for this week.

EXAMPLE 3.2

SOLUTION

Ordered values: Ranks:

70 1

71 2

75 3

80 4

85 5

92 6

100 7

c Median = 80 n+1 7+1 2 = 2 = 4. So, using rule 1, the median is the fourth-ranked value. The median number of customers is 80. Therefore, 50% of days have 80 or less customers and 50% have 80 or more customers. Rank of the median is

CALCULATING THE MEDIAN FESTIVAL EXPENDITURE – INTERNATIONAL VISITORS In the opening scenario, Kai is interested in the distribution of the amount spent by international visitors in the region during the festival. < FESTIVAL > Calculate and interpret the median amount spent by international visitors.

EXAMPLE 3.3

SOLUTION

First order the data: 343 408 502

553

615

725

763

928

971

993

1,005

1,119

c Median

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

95

96

CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES

Rank of the median is

n + 1 12 + 1 2 = 2 = 6.5. Using rule 2, the median is the mean of the

725 + 763 = 744. 2 Therefore, 50% of international visitors in the sample spent less than $744 during the festival and 50% spent more than $744. sixth- and seventh- ranked values,

mode Measure of central tendency; most frequent value.

Mode The mode is the value in a data set that appears most frequently. Like the median and unlike the mean, extreme values do not affect the mode. You should use the mode only for descriptive purposes as it is more variable from sample to sample than either the mean or the median. Often there is no mode or there are several modes in a set of data. For example, consider the data for the times to get ready shown below: 29

31

35

39

39

40

43

44

44

52

There are two modes, 39 minutes and 44 minutes, since each of these values occurs twice. Because it has two modes, this data set is considered to be bimodal. EXAMPLE 3.4

C A LC U LAT ING T H E M OD E A company’s information systems manager keeps track of the number of unplanned outages that occur in a month. Calculate the mode for the following data, which represent the number of unplanned outages during the past 14 months: 1

3

0

3

26

2

7

4

0

2

3

3

6

3

SOLUTION 

The ordered array for these data is: 0

0

1

2

2

3

3

3

3

3

4

6

7

26

Because 3 appears five times, more than any other value, the mode is 3. Thus, the systems manager can say that the most common occurrence is three unplanned outages a month. For this data set, the median is also equal to 3 while the mean is equal to 4.5. As the mean is affected by the extreme value of 26 unplanned outages, the median and the mode are better measures of central tendency than the mean for this data set. A set of data will have no mode if none of the values is ‘most typical’ – that is, if no data value occurs more than once. Example 3.5 presents a data set with no mode. EXAMPLE 3.5

DATA W IT H NO MO DE For the café of Example 3.2, calculate the mode for the number of customers for the seven days. SOLUTION

The ordered array for these data is: 70

71

75

80

85

92

100

As none of the days have the same number of customers there is no mode.

quartiles Measures of relative standing, partition a data set into quarters.

Quartiles We have seen that the median partitions a set of data into two equal parts. We can extend this idea by partitioning a set of data into as many equal parts as we wish. Quartiles divide a set of

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



3.1 Measures of Central Tendency, Variation and Shape

data into quarters – that is, four equal parts. The first, or lower, quartile, Q1, divides the lower 25% of the values from the other 75%, which are larger. The second quartile, Q2, is the median – 50% of the values are below the median and 50% above. The third, or upper, quartile, Q3, has 75% of the values below it and 25% above. Equations 3.3 and 3.4 define the first and third quartiles. Q1, the median and Q3 are also the 25th, 50th and 75th percentiles, respectively. Equations 3.2, 3.3 and 3.4 can be expressed generally in terms of finding percentiles: (p × 100)th percentile = p × (n + 1) ranked value. p is between 0 and 1, with, for example, the median (Q2) corresponding to a p value of 0.5.

first (lower) quartile Value that 25% of data values are smaller than, or equal to. second quartile The median value that 50% of data values are smaller than, or equal to. third (upper) quartile Value that 75% of data values are smaller than, or equal to.

FIRST, O R LOW E R , QUA RT IL E , Q 1 25% of the values are smaller, or equal to, Q1, the first quartile, and 75% are larger than, or equal to, the first quartile, Q1. Q1 =



n +1 ranked value (3.3) 4

THIRD , O R UPPE R , QUA RT IL E , Q 3 75% of the values are smaller than, or equal to, the third quartile, Q3, and 25% are larger than, or equal to, the third quartile, Q3.

Q3 =

3(n + 1) ranked value (3.4) 4

Use the following rules to calculate the quartiles: • Rule 1  If the result is an integer, then the quartile is equal to the ranked value. For example, if the sample size is n = 7, the first quartile, Q1, is equal to the (7 + 1)/4 = 2, second-ranked value. • Rule 2  If the result is a fractional half (2.5, 4.5, etc.), then the quartile is equal to the mean of the corresponding ranked values. For example, if the sample size is n = 9, the first quartile, Q1, is equal to the (9 + 1)/4 = 2.5 ranked value, halfway between the second- and the third-ranked values. • Rule 3  If the result is neither an integer nor a fractional half, round the result to the nearest integer and select that ranked value. For example, if the sample size is n = 10, the first quartile, Q1, is equal to the (10 + 1)/4 = 2.75 ranked value. Round 2.75 to 3 and use the third-ranked value. To illustrate the calculation of the quartiles for the times to get ready, rank the data from smallest to largest: Ranked values: Ranks:

29 1

31 2

35 3

39 4

39 5

40 6

43 7

44 8

44 9

97

52 10

The first quartile is the (n + 1)/4 = (10 + 1)/4 = 2.75 ranked value. Using the third rule for quartiles, round up to the third-ranked value as it is the closest integer. The third-ranked value for the data for the times to get ready is 35 minutes. Interpret the first quartile of 35 to mean that on 25% of the days the time to get ready is less than or equal to 35 minutes, and on 75% of the days the time to get ready is greater than or equal to 35 minutes. The third quartile is the 3(n + 1)/4 = 3(10 + 1)/4 = 8.25 ranked value. Using the third rule for quartiles, round down to the eighth-ranked value as it is the closest integer. The eighthranked value for the data for the times to get ready is 44 minutes. Interpret this to mean that on 75% of the days the time to get ready is less than or equal to 44 minutes, and on 25% of the days the time to get ready is greater than or equal to 44 minutes.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

98

CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES

Be aware that several methods exist for calculating quartiles. Other textbooks and Excel may use different rules, which can result in slightly different values for the upper and lower quartiles. EXAMPLE 3.6

CALCULATING THE QUARTILES FOR FESTIVAL EXPENDITURE – INTERNATIONAL VISITORS Kai is interested in the distribution of the amount spent by international visitors in the region during the festival. < FESTIVAL > Calculate and interpret the quartiles for the amount spent by international visitors. SOLUTION

First order the data: 343 408 502 c Q1

553

615

725

763

928

971

993 c Q3

1,005

1,119

c Median n + 1 12 + 1 Rank of the first quartile is 4 = 4 = 3.25. Using rule 3, the first quartile is thirdranked values, Q1 = 502. Therefore, 25% of international visitors in the sample spent $502 or less during the ­festival and 75% spent $502 or more. 3(n + 1) 3(12 + 1) Rank of the third quartile is = = 9.75. Using rule 3, the third quartile 4 4 is 10th-ranked values, Q3 = 993. Therefore, 75% of international visitors in the sample spent $993 or less during the ­festival and 25% spent $993 or more.

Geometric Mean geometric mean  Average rate of change of a variable.

The geometric mean and the geometric mean rate of return are used to measure the status on an investment over time or the average percentage change in a variable. The geometric mean, defined by Equation 3.5, measures the average rate of change of a variable over n periods. GE OM E T R IC M E A N The geometric mean is the nth root of the product of n values. XG = (X1 * X2 * p * Xn)1/n



(3.5)

Using the geometric mean, we can measure the average return on an investment over time. This is given by the geometric mean rate of return, defined by Equation 3.6. GE OM E T R IC M E A N R AT E O F R E T U R N

RG = [(1 + R1) * (1 + R2) * p * (1 + Rn)]1/n - 1

(3.6)

where Ri = the rate of return in time period i as a decimal To illustrate the use of these measures, consider an investment of $100,000 that declined to a value of $50,000 at the end of year 1 and then rebounded back to its original $100,000 value at the end of year 2. The rate of return for this investment for the two-year period is 0, because the starting and ending value of the investment are the same. However, the arithmetic mean of the annual rates of return of this investment is: X =

(-0.50) + (1.00) = 0.25 or 25% 2

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



3.1 Measures of Central Tendency, Variation and Shape

99

since the rate of return for year 1 is: 50,000 - 100,000 = -0.50 or -50% 100, 000

R1 =

and the rate of return for year 2 is: R2 =

100,000 - 50,000 = 1.00 or 100% 50,000

Using Equation 3.6, the geometric mean rate of return for the two years is: RG = [(1 + R1) 3 (1 + R2)]1/2 - 1 = {[1 + (-0.50)] 3 [1 + (1.0)]}1/2 - 1 = (0.50 3 2.0)1/2 - 1 = 11/2 - 1 =0 Thus, the geometric mean rate of return more accurately reflects the (zero) change in the value of the investment for the two-year period than does the arithmetic mean. CA LC ULATING T H E G E O ME T R IC ME A N RATE OF RE TU RN The annual percentage change in a New Zealand share market index, the NZX-50, for 2012 to 2016 was: Year Annual change

2012 24%

2013 16%

2014 18%

2015 14%

EXAMPLE 3.7

2016 10%

Data obtained from Yahoo 7 Finance accessed April 2017

Calculate the geometric rate of return for these five years. SOLUTION 

Using Equation 3.6, the geometric mean rate of return in the NZX 50 Index for the five years is: RG = [(1 + R2012) * (1 + R2013) * (1 + R2014) * (1 + R2015) * (1 + R2016)]1/5 - 1 = [(1 + 0.24) * (1 + 0.16) * (1 + 0.18) * (1 + 0.14) * (1 + 0.10)]1/5 - 1 = (1.24 * 1.16 * 1.18 * 1.14 * 1.10)1/5 - 1 = 1.16308p - 1 = 0.1630p The geometric rate of return of the NZX 50 Index for the five years is approximately 16.3% annually.

Measures of Variation Variation measures the spread or dispersion of values in a data set. One simple measure of variation is the range: the difference between the highest and lowest value. More commonly used in statistics are the standard deviation and variance, two measures also introduced in this section. Range The range is the simplest numerical descriptive measure of variation in a set of data.

spread (dispersion) The amount of scattering of data values. range Distance measure of variation; difference between maximum and minimum data values.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

100 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES

R A N GE The range is equal to the largest value minus the smallest value. Range = Xlargest - Xsmallest



(3.7)

To determine the range of the times to get ready, first rank the data from smallest to largest: 29

31

35

39

39

40

43

44

44

52

Then, using Equation 3.7, the range is 52 − 29 = 23 minutes. The range of 23 minutes indicates that the largest difference between any two days in the time to get ready is 23 minutes.

EXAMPLE 3.8

CALCULATING THE RANGE FOR FESTIVAL EXPENDITURE – INTERNATIONAL VISITORS In the opening scenario, Kai is interested in the distribution of the amount spent by international visitors in the region during the festival. < FESTIVAL > Calculate and interpret the range for amount spent by international visitors. SOLUTION

From an ordered array of the data, the minimum amount an international visitor spent was $343 and the maximum was $1,119. Using Equation 3.7 the range is Xlargest − Xsmallest = 1,119 − 343 = 776 Therefore, the difference between the maximum and minimum amounts spent by international visitors during the festival was $776.

The range measures the total spread of the data. Although the range is a simple measure of total variation, it is based only on the two extreme values and ignores all the other values. Thus, it does not take into account how the data are distributed between the smallest and largest values; it does not indicate whether the values are evenly distributed throughout the data set, clustered near the middle or clustered near one or both ends. Like the mean, the range is distorted by very high or very low values, so care is needed when using the range as a measure of variation.

interquartile range Distance measure of variation; difference between third and first quartile; range of middle 50% of data.

Interquartile Range The interquartile range is the difference between the third and first quartiles in a set of data.

IN T E R QUA RT IL E R A NG E The interquartile range is the difference between the third quartile and the first quartile.

Interquartile range = Q3 − Q1

(3.8)

The interquartile range is a more meaningful measure of variation than the range because it ignores extreme values by finding the range of the middle 50% of the ordered array of data values. In the times to get ready we found that Q1 = 35 and Q3 = 44. Hence, using Equation 3.8: Interquartile range = 44 − 35 = 9 minutes Therefore, the interquartile range in the time to get ready is 9 minutes.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



3.1 Measures of Central Tendency, Variation and Shape 101

CA LC ULATING T H E IN T E RQ U A RT ILE R AN GE F OR F E STI VAL E XP E N D I TU RE – INTER NATIONA L V IS ITO R S In the opening scenario, Kai is interested in the distribution of the amount spent by international visitors during the festival. < FESTIVAL > Calculate and interpret the interquartile range for the amount spent by international visitors.

EXAMPLE 3.9

SOLUTION

From Example 3.6, the first quartile, Q1, is 502 and the third quartile, Q3, is 993. Using Equation 3.8 the interquartile range is: Q3 − Q1 = 993 − 502 = 491 Therefore, the difference in the middle 50% of the amount spent by international visitors in the sample is $491.

When calculating the interquartile range the highest and lowest 25% of the data values are discarded. Therefore, the interquartile range is not affected by extreme values. Summary measures such as the median, Q1, Q3 and the interquartile range, which are not influenced by extreme values, are called resistant measures. Variance and Standard Deviation Although the range and interquartile range are measures of variation, they do not take into consideration how the values are distributed or clustered between the extremes. Two commonly used and related measures of variation that take into account how all the values in the data set are distributed are the variance and the standard deviation. These statistics measure the average scatter around the mean – how larger values fluctuate above it and how smaller values are distributed below it. These measures are based on the difference between each data value and the mean, called the deviation of the data value from the mean. The notation Xi − X is used to denote the deviation of a data value Xi from the mean X. A measure of variation around the mean could be to take the deviation of each value from the mean, and then sum the deviations. However, as the mean is the centre of balance in a set of

resistant measures Summary measures not influenced by extreme values.

variance Measure of variation based on squared deviations from the mean; directly related to the standard deviation. standard deviation Measure of variation based on squared deviations from the mean; directly related to the variance.

n

data, for every data set the deviations from the mean would sum to zero – that is,

© (Xi - X) = 0.

i=1

This can be overcome by squaring the deviations from the mean before summing. In statistics, this quantity is called a sum of squares (or SS). So the sum of squares for X is SSX =

n

© (Xi - X)2 . This sum of squares is then divided by the number of values minus 1 (for

sum of squares (SS) Sum of the squared deviations.

i=1

sample data) to get the sample variance (S2 ). The square root of the sample variance is the sample standard deviation (S). Because the sum of squares is a sum of squared differences that will always be non-negative, neither the variance nor the standard deviation can ever be negative. For a data set, the variance and standard deviation will usually be positive, and will only be zero if there is no variation – that is, all the values are equal. For a sample containing n values, X1, X2, …, Xn, the sample variance (given by the symbol S2) is: S2 =

( X1 - X )2 + ( X 2 - X )2 + … + ( X n - X )2 n-1

Equation 3.9a expresses the equation using summation notation.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

102 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES

sample variance  Variance calculated from sample data.

S A M PL E VA R IA N C E – D E F I NI T I O N F O R M U LA The sample variance is the sum of the squared deviations from the sample mean divided by the sample size minus one. n

SSX S2 = = n-1



©(Xi - X )2

i=1

n-1



(3.9a)

where  X = sample mean n = number of values or sample size Xi = ith value of the variable X n

SSX =

sample standard deviation Standard deviation calculated from sample data.

© (Xi - X )2 = sum of the squared deviations from the mean (sum of squares)

i=1

S A M PL E STA N DA R D D E V I AT I O N – D E F I NI T I O N F O R M U LA The sample standard deviation is the square root of the sample variance. S=



S2

(3.10)

If the denominator was n instead of n − 1, Equation 3.9a would calculate the average of the squared deviations from the mean. However, n − 1 is used because of certain desirable mathematical properties of the statistic S2 that make it appropriate for statistical inference (discussed in Chapter 7). The sample standard deviation, defined by Equation 3.10, is the more useful measure of variation because, unlike the sample variance, which is a squared quantity, the standard deviation is a value that is expressed in the same units of measurement as the original sample data. The standard deviation is a measure of how a set of data is clustered or distributed around its mean. For most data sets the majority of the data values lie within one standard deviation of the mean – that is, within (X − S, X + S) − and we will see later in this chapter that for all data sets at least 75% of the data values lie within two standard deviations of the mean – that is, within (X − 2S, X + 2S). Therefore, a knowledge of the mean and the standard deviation helps to define where the majority of the data values are clustered. Table 3.1 illustrates the steps for calculating the variance and standard deviation for the data on the times to get ready with mean X = 39.6, calculated earlier. The second column of Table 3.1 calculates the deviation of each time from the mean (step 1). The third column of Table 3.1 calculates the square of each deviation from the mean (step 2). The sum of the squared deviations (step 3) is shown at the bottom of Table 3.1. This total is then divided by 10 − 1 = 9 to calculate the variance (step 4). We can also calculate the variance by substituting values for the terms in Equation 3.9a: n

S2 =

© (Xi - X )2

i=1

n-1 (39 - 39.6)2 + (29 - 39.6)2 + … + (35 - 39.6)2 = 10 - 1 412.4 = 9 = 45.822…

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



3.1 Measures of Central Tendency, Variation and Shape 103

X = 39.6 Time (X ) 39 29 43 52 39 44 40 31 44 35

Table 3.1 Calculating the variance of the times to get ready

Step 2: (Xi − X )2 0.36 112.36 11.56 153.76 0.36 19.36 0.16 73.96 19.36 21.16

Step 1: (Xi − X ) - 0.60 - 10.60 3.40 12.40 - 0.60 4.40 0.40 - 8.60 4.40 - 4.60

Step 3: Sum SSX = 412.40

Step 4: Divide by (n − 1) S2 = 45.822...

The variance is in squared units (in squared minutes for these data) so, to calculate the standard deviation, which is in the original units (minutes for these data), take the square root of the variance. Using Equation 3.10, the sample standard deviation S is: S= S2=

45.82… = 6.769…

This indicates that most of the times to get ready in this sample are clustered within 6.77 minutes of the mean of 39.6 minutes (i.e. clustered between X − S = 32.83 and X + S = 46.37). Seven of the 10 times to get ready lie within this interval. To check that the mean is correct, use the second column of Table 3.1 to calculate the sum of the deviations from the mean. For any set of data, this sum will be zero – that is: n

© (Xi - X) = 0 for all sets of data

i=1

It is tedious to use Equation 3.9a to calculate sample variance, especially for large samples or when the mean and/or data values are not integers. Instead, we can use algebra to obtain alternative calculation formulas. S AMPLE VA R IA N CE – CA LCUL AT ION F O R M U LA The sample variance is the sum of the squared deviations from the mean divided by the sample size minus 1. n

n

S2 =

SSX = n-1

© Xi2 - nX 2

i=1

n-1

n

=

© X i2 -

i=1

© Xi

i=1

2



(3.9b)

n

n-1

where  X = sample mean n = number of values or sample size Xi = ith value of the variable X n

© Xi2 = X12 + X22 + p + Xn2 = sum of the squared Xi values in the sample i=1

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

104 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES

Use either calculation formula. The sum of the squared times to get ready is: 10

© Xi2 = 392 + 292 + … + 352 = 16,094

i=1

Then, calculate the variance by substituting values for the terms in the calculation form of Equation 3.9b: n

S2 =

© Xi2 - nX 2

i=1

n-1 16,094 - 10 3 39.62 = 10 - 1 412.4 = 9 = 45.822…

A statistical calculator can be used to calculate the standard deviation (and some other numerical measures introduced in this chapter) and, as covered later in this section, Excel can be used for large data sets. Even though it is not usually necessary to use Equations 3.9a or 3.9b to calculate variance and Equation 3.10 to calculate standard deviation, it is important that you understand the process of how the variance and standard deviation are obtained.

EXAMPLE 3.10

C A LC U LAT ING T H E VARI AN CE AN D STAN D ARD D E V I ATI ON F OR F E STI VAL E X P E N D IT U R E – IN T ERN ATI ON AL V I S I TORS Kai is interested in the distribution of the amount spent by international visitors during the festival. < FESTIVAL > Calculate and interpret the variance and standard deviation for amount spent by international visitors. SOLUTION

Calculate the sum of X squared: 12

© Xi2 = 1,1192 + 6152 + p + 7632 = 7,380,205;

i=1

then from Example 3.1, X = 743.75 and using Equation 3.9b, we obtain: n

SSX = © Xi2 - nX 2 = 7,380,205 - 12 * 743.752 = 742,236.25 i=1

S2 =

SSX 742,236.25 = 67,476.022 … = 11 n-1

Therefore, the variance for the amount spent by international visitors during the festival is approximatively 67,476,022 dollars squared. Now using Equation 3.10 the sample standard deviation, S, is: S = S2 =

67,476.022… = 259.761…

Therefore, the standard deviation for the amount spent during the festival by international visitors is approximatively $259.76. This indicates that we expect the majority of international visitors in the sample spent within $260 (plus or minus) of the mean expenditure $743.75 during the festival.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



3.1 Measures of Central Tendency, Variation and Shape 105

The following summarises the characteristics of the range, interquartile range, variance and standard deviation: • The more spread out, or dispersed, the data, the larger the range, interquartile range, variance and standard deviation. • The more concentrated, or homogeneous, the data, the smaller the range, interquartile range, variance and standard deviation. • If the values are all the same (so that there is no variation in the data), the range, interquartile range, variance and standard deviation will all equal zero. • None of the measures of variation (the range, interquartile range, standard deviation and variance) can ever be negative.

Coefficient of Variation Unlike the previous measures of variation presented, the coefficient of variation is a relative measure of variation that is expressed as a percentage rather than in terms of the units of the particular data. The coefficient of variation, denoted by the symbol CV, measures the scatter in the data relative to the mean.

coefficient of variation Relative measure of variation; the standard deviation divided by the mean.

CO E FFIC IE NT OF VA R IAT ION The coefficient of variation is equal to the standard deviation divided by the mean, multiplied by 100%. S (3.11) CV = 100% X where  S = sample standard deviation X = sample mean

For the sample of 10 times to get ready, since X = 39.6 and S = 6.769…, the coefficient of variation is: S 6.769… CV = 100% = 3 100% = 17.09…% 39.6 X For the times to get ready, the standard deviation is 17.1% of the size of the mean. You will find the coefficient of variation useful when comparing two or more sets of data that have different units of measurement, as Example 3.11 illustrates, or when the scale of the data sets is substantially different. CO M PA R ING T WO C O E FFIC IE N T S O F VA RI ATI ON WHE N TWO VARI ABL E S HAV E DIFFER ENT U N IT S O F ME A S U R E ME NT The operations manager of a package delivery service is deciding whether to purchase a new fleet of trucks. When packages are stored in the trucks in preparation for delivery, two major constraints need to be considered – the weight (in kilograms) and the volume (in cubic metres) of each item. The operations manager samples 200 packages and finds that the mean weight is 12.0 kilograms, with a standard deviation of 1.8 kilograms; the mean volume is 0.25 cubic metres, with a standard deviation of 0.06 cubic metres. How can the operations manager compare the variation of the weight and the volume?

EXAMPLE 3.11

SOLUTION

Because the measurement units differ for the weight and volume constraints, the operations manager should compare the relative variability in the two types of measurements.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

106 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES

For weight, the coefficient of variation is: CVW =

S 1.8 * 100% = 15% 100% = 12 X

For volume, the coefficient of variation is: CVV =

S 0.06 * 100% = 24% 100% = 0.25 X

Thus, relative to the mean, the package volume is more variable than the package weight because it has a higher coefficient of variation.

Z Scores Z scores Measures of relative standing; number of standard deviations that given data values are from the mean. extreme value (outlier) Value located far from the mean; will have a large Z score, positive or negative.

Z scores are measures of relative standing that take into consideration both the mean and the

standard deviation. A Z score represents the distance between a given observation and the mean expressed in standard deviations. An extreme value or outlier, a value located far away from the mean, will have a large Z score, either positive or negative. Therefore, Z scores are useful in identifying extreme values or outliers. Z S COR E

Z=

X-X S

(3.12)

For the data for the times to get ready in the morning, the mean is 39.6 minutes and the standard deviation is 6.77 minutes. The time to get ready on the first day is 39.0 minutes. Use formula 3.12 to calculate the Z score for day 1: Z=

X-X 39.0 - 39.6 = = -0.09 S 6.77

Therefore, the first day’s time to get ready of 39 minutes is just 0.09 of a standard deviation below the mean – that is, just slightly quicker than the mean time to get ready. Table 3.2 shows the Z scores for all 10 days. The largest Z score is 1.83 for day 4, on which the time to get ready was 52 minutes. The lowest Z score was −1.57 for day 2, on which the time to get ready was 29 minutes. As a general rule, a value is said to be an outlier if its Z score is less than −3.0 or greater than +3.0 – that is, the value is more than three standard deviations below or above the mean. As none of the times to get ready meets the outlier criterion, we can say there are no outliers in these data. Table 3.2 Z scores for the 10 times to get ready

Mean Standard deviation

Time (X ) 39 29 43 52 39 44 40 31 44 35 39.6 6.77

Z score - 0.09 - 1.57 0.50 1.83 - 0.09 0.65 0.06 - 1.27 0.65 - 0.68

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



3.1 Measures of Central Tendency, Variation and Shape 107

CA LC ULATING T H E Z S C O R E S FO R R E AL E STATE P RI CE S A couple seeking a ‘green-change’ sell their inner city unit for $520,000 and plan to purchase a house in a rural town for the same price. Given that the mean unit price in the inner city is $845,000, with a standard deviation of $220,000, and the mean house price in the rural town is $280,000 with a standard deviation of $120,000, use Z scores to determine the price of each property relative to its region.

EXAMPLE 3.12

SOLUTION

The Z score for the inner city unit is: Z=

X-X 520,000 - 845,000 = = -1.477… S 220,000

so the price of the unit sold is approximately 1.5 standard deviations below the mean price. That is, the couple have sold their unit for a relatively low price compared with mean inner city prices. If the couple purchase a house for $520,000, then its Z score is: Z=

X-X 520,000 - 280,000 = =2 S 120,000

The price of this property is approximately two standard deviations above the mean price. That is, the couple plan to purchase a house for a relatively high price compared with property prices in the region.

Shape As well as the centre and the variation of numerical data we also need a description of the shape of the distribution which represents a pattern of all the values from the lowest to highest. Many data sets are approximately mound- or bell shaped; other data sets may be skewed, with the majority of data values clustered in the upper or lower end of the distribution. A distribution is symmetrical if the lower and upper halves of the graph are mirror images of each other. Panel B of Figure 3.1 illustrates a symmetrical distribution. If the distribution is not symmetrical, it may be skewed. A distribution is skewed to the right, or positively skewed, if there is a long tail to the right, indicating that there are relatively few large data values and more smaller values – that is, most of the values are concentrated in the lower portion of the distribution. Panel C of Figure 3.1 illustrates a positively skewed distribution. As relatively few people have extremely high incomes, we would expect the distribution of annual income to be positively skewed. A distribution is skewed to the left, or negatively skewed, if there is a long tail to the left, indicating that there are relatively few small data values and more larger values, and so most of the values are concentrated in the upper portion of the distribution. Panel A in Figure 3.1 illustrates a negatively skewed distribution. As relatively few people die at an early age, we would expect the distribution of age at death of Australian residents to be negatively skewed.

symmetrical Distribution of data values above and below the mean are identical. skewed Non-symmetrical distribution; data values are clustered either in the lower or the upper portion of the distribution.

Figure 3.1 A comparison of three data sets differing in shape Panel A Negative, or left skewed

Panel B Symmetrical

Panel C Positive, or right skewed

The relative positions of the mean and median provide some information about the shape of a distribution. In many, but not all, negative or left-skewed distributions the few extremely

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

108 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES

small values pull the mean downwards so that the mean is less than the median. In many, but again not all, positive or right-skewed distributions the few extremely large values pull the mean upwards so that the mean is greater than the median. If the distribution is symmetrical, the high and low values balance each other and the mean equals the median. Therefore, for most continuous unimodal (one peak) distributions, we can say that: • mean < median, the distribution is likely to be negative or left skewed • mean = median, the distribution is symmetrical or has zero skewness • mean > median, the distribution is likely to be positive or right skewed. These rules often do not apply for discrete distributions, as illustrated in Example 3.13.

EXAMPLE 3.13

DIST R IB U T IO N O F NU M BE R OF AD U LTS I N HOU S E HOL D From a random survey of 40 households the following data were obtained in response to the question ‘How many adults (people over 18) are there in the household?’ < HOUSEHOLD > 4 4 2 2 2 1 1 3 2 2 1 1 1 1 3 2 3 2 2 3 1 2 1 1 1 2 2 5 1 3 1 2 1 2 1 1 Present these data graphically and calculate mean and median.

3 2 1 1

SOLUTION

A column chart of the data is given in Figure 3.2. Figure 3.2 Column chart for number of adults in household

Adults in household 20

Frequency

15 10 5 0

1

2

3

4

5

Number of adults

As most households have either one or two adults, the data are concentrated in the lower portion of the graph with a tail to the right. Therefore, the distribution of the number of adults in these households is positively or right skewed. 40

To calculate the mean, first calculate the sum of X, © Xi = 4 + 4 + … + 1 = 76. i=1

Then, as n = 40, using Equation 3.1 we obtain: n

X =

© Xi

i=1

n

=

76 = 1.9 40

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



3.1 Measures of Central Tendency, Variation and Shape 109

Rank of the median is: n + 1 40 + 1 = = 20.5 2 2 The median is the mean of the 20th- and 21st-ranked values. From the ordered array of the data: 1 1 2 2

1 1 2 3

1 1 2 3

1 1 2 3

1 1 2 3

1 1 2 3

1 1 2 3

1 2 2 4

1 2 2 4

1 2 2 5

the 20th- and 21st-ranked values are 2, so: Median = 2 So the mean number of adults per household is 1.9, while the median number of adults is 2. In this case, mean < median even though the number of adults per household is skewed to the right.

Microsoft Excel Descriptive Statistics Output The Microsoft Excel Data Analysis Toolpak generates the mean, median, mode, standard deviation, variance, range, minimum, maximum and count (sample size) on a single worksheet, all of which have been discussed in this section. In addition, Excel calculates the standard error, along with statistics for kurtosis and skewness. The standard error is the standard deviation divided by the square root of the sample size and is discussed in Chapter 7. Skewness measures the lack of symmetry in the data and is based on a statistic that is a function of the cubed differences around the mean. A skewness value of zero indicates a symmetrical distribution. Positive and negative values indicate positive or negative skewness. Kurtosis measures the relative concentration of values in the centre of the distribution compared with the tails, and is based on the differences around the mean raised to the fourth power. This measure is not discussed in this text. For data on festival expenditure by international visitors, the Excel descriptive statistics output, shown in Figure 3.3, gives many of the sample statistics calculated in the examples in this section.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

A B Festival spending – international visitors Mean Standard error Median Mode Standard deviation Sample variance Kurtosis Skewness Range Minimum Maximum Sum Count

743.75 74.9867 744 #N/A 259.761 67476 –1.41411 –0.13236 776 343 1119 8925 12

Figure 3.3 Microsoft Excel summary statistics for festival expenditure

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

110 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES

Median or mean? think about this

While the mean is the most common measure of central tendency, there are times when the median is the appropriate measure to use. A common measure of relative poverty is living in a household that has less than 50% of median household income. In Poverty in Australia 2016, the Australian Council of Social Service (www. acoss.org.au) reveals that a single adult with a disposable income of less than $426 per week or a couple with two children with a disposable income of less than $895 per week were living in relative poverty in 2014. Why is median household income used to define relative poverty, not mean household income? Two possible reasons are:

■ Since household income is likely to be skewed to the right, mean household income is likely to be considerably higher than the median household income. Therefore, defining the poverty line as 50% of mean household income would lead to a greater proportion of the population being defined as living in relative poverty. ■ Furthermore, defining the poverty line as 50% of mean household income would mean that any measures to alleviate poverty would be unlikely to change the proportion of households in relative poverty, since any increase in disposal household income of those in relative poverty would increase mean household income and hence raise the poverty line. However, using median household income to define relative poverty makes it possible to reduce, possibly to zero, the proportion of households in relative poverty. This is because increasing the disposal income of those living in relative poverty, through employment, benefits, tax rebates or other means, so that household income is above 50% of median income, need not change the median household income.

Exploring Descriptive Statistics visual explorations

Open the VE_Descriptive_Statistics workbook to explore the effects of changing data values on measures of central tendency, variation and shape. Change the data values in the cell range A2:A11 and then observe the changes to the statistics shown in the chart. Click View the Suggested Activity Page to view a specific change you could make to the data values in column A. Click View the More About Descriptive Statistics Page to view summary definitions of the descriptive statistics shown in the chart.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



3.1 Measures of Central Tendency, Variation and Shape 111

Problems for Section 3.1 LEARNING THE BASICS

3.1 The data below are a sample of n = 5: 7

4

9

8

2

a. Calculate the mean, median and mode. b. Calculate the range, interquartile range, variance, standard deviation and coefficient of variation. c. Calculate the Z scores. Are there any outliers? 3.2 The data below are a sample of n = 6: 7

4

9

7

3

12

a. Calculate the mean, median and mode. b. Calculate the range, interquartile range, variance, standard deviation and coefficient of variation. c. Calculate the Z scores. Are there any outliers? 3.3 The data below are a sample of n = 7: 12

7

4

9

0

7

3

a. Calculate the mean, median and mode. b. Calculate the range, interquartile range, variance, standard deviation and coefficient of variation. 3.4 The data below are a sample of n = 5: 7

- 5

- 8

7

9

a. Calculate the mean, median and mode. b. Calculate the range, interquartile range, variance, standard deviation and coefficient of variation. 3.5 Suppose that the rate of return for a particular share during the past two years was 10% and 30%. Calculate the geometric mean rate of return. (Note: A rate of return of 10% is recorded as 0.10 and a rate of return of 30% is recorded as 0.30.)

APPLYING THE CONCEPTS Problems 3.6 to 3.18 can be solved manually or by using Microsoft Excel.

3.6 The operations manager of a plant that manufactures tyres wants to compare the actual inner diameter of two grades of tyres, each of which is expected to be 575 millimetres. A sample of five tyres of each grade is selected and the results, representing the inner diameters of the tyres, ranked from smallest to largest, are as follows: 568

Grade X 570 575 578

584

573

Grade Y 574 575 577

578

a. For each of the two grades of tyres, calculate the mean, median and standard deviation. b. Which grade of tyre is providing on average better quality? Explain. c. What would be the effect on your answers in (a) and (b) if the last value for grade Y was 588 instead of 578? Explain.

3.7 Low-fat foods are not necessarily low calorie, as many low-fat foods are high in sugar. The calories per 250 ml cup of a random sample of brands of fresh cow’s milk for sale in Australia was given in problem 2.14 and stored in < FRESH_MILK >. Using the calorie data for each milk category: a. Calculate the mean, median, first quartile and third quartile. b. Calculate the variance, standard deviation, range, interquartile range and coefficient of variation. c. Based on the results of (a) and (b), what conclusions can you reach about the differences in calories between these types of milk? 3.8 The sales per day, in dollars, at a certain store are: 1,520  2,620  3,360  3,550  1,350  2,545  1,430  2,400  3,580  2,390  1,525  2,400  1,420  1,550  2,390  1,560  1,680  2,330  < SALES >

a. Calculate the mean, median, mode, first quartile and third quartile. b. Calculate the variance, standard deviation, range, interquartile range and coefficient of variation. c. What conclusions can you reach about daily sales at this store? 3.9 The supervisor of a tourist information desk at a local airport is interested in how long it takes an employee to serve a customer. For a sample of 12 customers, she measures the amount of time taken to serve each one. These times, measured in minutes, are reported below: < TOURIST > 2.4

1.5

3.9

0.6

2.7

3.1

2.8

0.9

1.4

2.6

1.4

6.1

a. Calculate the mean, median, mode, first quartile and third quartile. b. Calculate the variance, standard deviation, range, interquartile range, coefficient of variation and Z scores. c. Are there any outliers, and are the data skewed? d. Based on the results of (a) to (c), what conclusions can you reach about the time taken to serve a customer? 3.10 The ordered arrays in the table below give the life (in hours of usage) of samples of forty 15-watt CFL (compact fluorescent lamp) energy-saving light bulbs produced by two manufacturers, A and B. < BULBS > Manufacturer A 5,544 5,814 6,190 6,832 6,868 6,879 7,497 7,645 7,654 8,091 8,119 8,392

6,307 6,930 7,773 8,416

6,342 6,941 7,816 8,416

6,423 7,007 7,838 8,514

6,429 7,037 7,924 8,532

6,485 7,043 7,999 8,542

6,612 7,059 8,038 8,544

6,667 7,136 8,067 8,731

Manufacturer B 6,701 6,837 6,961 7,607 7,612 7,651 8,298 8,344 8,535 9,036 9,096 9,262

7,118 7,721 8,666 9,385

7,133 7,754 8,792 9,460

7,142 7,767 8,800 9,471

7,156 7,806 8,856 9,521

7,344 7,839 8,861 9,540

7,493 7,888 8,993 9,693

7,569 7,983 9,001 9,744

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

112 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES



For each manufacturer: a. Calculate the mean, median, first quartile and third quartile. b. Calculate the variance, standard deviation, range, interquartile range and coefficient of variation. c. What conclusions can you reach concerning the life of each manufacturer’s bulbs? 3.11 The prices (in dollars) of 14 models of camera at a camera specialty store were as follows. < CAMERA > 340 370

450 400

450 310

280 340

220 430

340 270

290 380

4.21 5.55 3.02 5.13 4.77 2.34 3.54 3.20 4.50 6.10 0.38 5.12 6.46 6.19 3.79

a. Calculate the mean, median, first quartile and third quartile. b. Calculate the variance, standard deviation, range, interquartile range, coefficient of variation and Z scores. Are there any outliers? Explain. c. Are the data skewed? If so, how? d. Based on the results of (a) to (c), what conclusions can you reach about the price of cameras at the camera specialty store? 3.12 The following data refer to the number of kilometres that a sample of 50 people drive to work each day. < TRAVEL_WORK >

23 19 12 15 26

34 26 26 27 15

25 8 5 32 27

31 35 16 10 24

32 36 7 38 25

4 24 35 9 18

17 22 46 24 44

19 27 34 12 23

3.14 A bank branch located in a commercial district of a city has developed an improved process for serving customers during the noon to 1 pm lunch period. The waiting time in minutes (defined as the time the customer enters the line to when they reach the teller window) of all customers during this hour is recorded over a period of one week. A random sample of 15 customers is selected, and the results are as follows: < BANK1 >

30 47 38 27 27

42 29 27 45 29

a. Calculate the mean, median and mode. b. Calculate the range, variance and standard deviation. c. Interpret the summary measures calculated in (a) and (b). 3.13 A manufacturer of torch batteries took a sample of 13 batteries from a day’s production and used them continuously until they were drained. The numbers of hours they were used until failure were: < BATTERIES > 342 426 317 545 264 451 1,049 631 512 266 492 562 298 a. Calculate the mean, median and mode. Looking at the distribution of times to failure, which measures of central tendency do you think are most appropriate and which least appropriate to use for these data? Why? b. Calculate the range, variance and standard deviation. c. What would you advise if the manufacturer wanted to say in advertisements that these batteries ‘should last 400 hours’? (Note: There is no right answer to this question; the point is to consider how to make such a statement precise.) d. Suppose that the first value was 1,342 instead of 342. Repeat (a) to (c), using this value. Comment on the difference in the results.

a. Calculate the mean, median, first quartile and third quartile. b. Calculate the variance, standard deviation, range, interquartile range, coefficient of variation and Z scores. Are there any outliers? Explain. c. Are the data skewed? If so, how? d. As a customer walks into the branch office during the lunch hour, she asks the branch manager how long she can expect to wait. The branch manager replies, ‘Almost certainly less than five minutes’. On the basis of the results of (a) and (b), evaluate the accuracy of this statement. 3.15 Suppose that another branch, located in a residential area, is also concerned about the noon to 1 pm lunch hour. The waiting time in minutes (defined as the time the customer enters the line to the time they reach the teller window) of all customers during this hour is recorded over a period of one week. A random sample of 15 customers is selected, and the results are as follows: < BANK2 > 9.66 5.90 8.02 5.79 8.73 3.82 8.01 8.35 10.49 6.68 5.64 4.08 6.17 9.91 5.47 a. Calculate the mean, median, first quartile and third quartile. b. Calculate the variance, standard deviation, range, interquartile range and coefficient of variation. Are there any outliers? Explain. c. Are the data skewed? If so, how? d. As a customer walks into the branch office during the lunch hour, he asks the branch manager how long he can expect to wait. The branch manager replies, ‘Almost certainly less than five minutes’. On the basis of the results of (a) and (b), evaluate the accuracy of this statement. 3.16 Data from 100 recent property sales from a council area are stored in < PROPERTY >. For the asking price data, calculate and interpret: a. the mean and median (refer to graphs in problem 2.71) b. the quartiles c. the range and interquartile range d. the variance and standard deviation.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



3.2 Numerical Descriptive Measures for a Population 113

3.17 The five years 2012 to 2016 saw volatility in the value of shares. The data in the following table give the annual percentage change in the share market index for Hong Kong, the Hang Seng, and for Australia, the S&P/ASX 200, for 2012 to 2016. Year Hang Seng ASX 200

2012 22.9% 14.6%

2013 2.9% 15.1%

2014 1.3% 1.1%

2015 - 7.2% - 2.1%

2016 0.4% 7.0%

Source: Data obtained from Yahoo 7 Finance accessed April 2017

3.18 The annual returns (before tax and fees) on several managed superannuation investment funds are:

Fund

Conservative balanced Balanced High growth Sustainable balanced

a. For each index calculate the geometric rate of return for the five years. b. What conclusions can you reach concerning the geometric rates of return of the two indices?

2017

Historical crediting rate for year ending 30 June % 2016 2015 2014

5.3 9.2 16.6

7.5 6.1 0.0

10.2 11.0 13.9

11.6 13.9 18.9

11.7 15.9 20.7

12.4

0.0

15.0

15.7

15.9

a. For each fund, calculate the geometric rate of return for three years (2015 to 2017) and for five years (2013 to 2017). b. What conclusions can you reach concerning the geometric rates of return for the funds?

3.2  NUMERICAL DESCRIPTIVE MEASURES FOR A POPULATION

LEARNING OBJECTIVE

Section 3.1 introduces several statistics that describe the properties of central tendency, variation and shape for a sample. If we have population data there are similar numerical descriptive measures, called population parameters, of central tendency, variation and shape. This section introduces three population parameters: population mean, population variance and population standard deviation. To illustrate these population parameters we use the data in Table 3.3, which classifies road fatalities in Australia for 2016 by month and gender. Because the table gives the total, and the male and female monthly road fatalities for 2016, for all of Australia this is population data.

Gender Month January February March April May June July August September October November December Total

Unknown 0 0 0 0 0 0 0 0 0 0 0 1 1

Male  27  30  23  33  29  23  26  38  24  29  28  27 337

2013

Female  80  72  87  81  76  74  91  74  68  89  78  88 958

Total  107  102  110  114  105   97  117  112   92  118  106  116 1,296

Population Mean The population mean, defined by Equation 3.13, is represented by the symbol μ, the Greek lower-case letter mu.

2

Calculate and interpret descriptive summary measures for a population

Table 3.3 Road fatalities in Australia 2016 Source: Data obtained from the Australian Road Deaths Database accessed 4 May 2017.

population mean Mean calculated from population data.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

114 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES

POPUL AT ION M E A N The population mean is the sum of the values in the population divided by the population size, N. N



μ =

© Xi

i=1

(3.13)

N

where  μ = population mean Xi = ith value of the variable X N



© Xi = sum of all Xi values in the population

i=1

To calculate the mean monthly total road fatality for 2016 from the data given in Table 3.3, use Equation 3.13: N

μ=

© Xi i=1

N

=

107 + 102 + … + 116 1296 = = 108 12 12

Thus, the mean monthly road fatality for 2016 was 108.

Population Variance and Standard Deviation population variance  Variance calculated from population data. population standard deviation  Standard deviation calculated from population data.

The population variance and the population standard deviation measure variation in a population. Like the related sample statistic, the population standard deviation is the square root of the population variance. The population variance is represented by the symbol σ2, the Greek lower-case letter sigma squared, and the population standard deviation by the symbol σ. These parameters are defined by Equations 3.14a and 3.15. The denominator in Equation 3.14a is N (population size) and not n − 1 as used in the equation for the sample variance (see Equation 3.9a). P OPUL AT ION VA R I A NC E – D E F I NI T I O N F O R M U LA The population variance is the sum of the squared deviations from the population mean divided by the population size N. N



σ2 =

SSX = N

©(Xi - μ)2

i=1

(3.14a)

N

where  μ = population mean Xi = ith value of the variable X N



SSX =

©(Xi - μ)2 = sum of the squared deviations from the mean (sum of

i=1

squares)

P OPUL AT ION STA NDA R D D E V I AT I O N The population standard deviation is the square root of the population variance.

σ = σ2

(3.15)

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



3.2 Numerical Descriptive Measures for a Population 115

As we did for sample variance and standard deviation, we can use algebra to obtain alternative calculation formulas.

PO PULATION VA R IA N CE – CA LCUL ATI O N F OR M UL A The population variance is the sum of the squared deviations from the population mean divided by the population size N. n

N



σ2 =

SSX = N

© Xi2 - Nμ2

i=1

N

© Xi

N

=

i=1

© Xi2 i=1

N

2



(3.14b)

N

where  μ = population mean Xi = ith value of the variable X N



© Xi2 = X12 + X22 + p + XN2 = sum of the squared Xi values in the population

i=1

Use either calculation formula. Using the data in Table 3.3 to calculate the population variance and standard deviation for the 2016 monthly total road fatalities, first calculate: N

© Xi2 = 1072 + 1022 + p + 1162 = 140,696

i=1

then use Equations 3.14b and 3.15 to obtain: N

σ2 =

© Xi2 - Nμ2

i =1

σ = σ2 =

N

=

140,696 - 12 3 1082 = 60.666… 12

60.666… = 7.788…

Thus, the variance of monthly total fatalities for 2016 is approximately 60.7 and the standard deviation is approximately 7.8 fatalities per month. So, the typical 2016 monthly fatality rate differs from the mean of 108 by plus or minus 7.8.

The Empirical Rule In many data sets a large portion of the values tend to cluster near the median. In right-skewed data sets, this clustering occurs in the left or lower part of the distribution. In left-skewed data sets, the values tend to cluster in the right or upper part of the distribution. In symmetrical data sets, where the median and mean are similar, the values often cluster around the median and mean, producing a bell-shaped distribution. You can use the empirical rule to examine the variability in bell-shaped distributions, both population and sample. The empirical rule states that for bell-shaped distributions: • Approximately 68% of the values are within a distance of ±1 standard deviation from the mean. That is, approximately 68% of the data values have Z scores between −1 and 1. • Approximately 95% of the values are within a distance of ±2 standard deviations from the mean. That is, approximately 95% of the data values have Z scores between −2 and 2. • Approximately 99.7% of the values are within a distance of ±3 standard deviations from the mean. That is, approximately 99.7% of the data values have Z scores between −3 and 3.

bell-shaped Symmetric, unimodal, moundshaped distribution. empirical rule Gives the distribution of data values in terms of standard deviations from the mean for bell-shaped distributions.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

116 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES

The empirical rule helps to identify outliers when analysing a set of numerical data. The empirical rule implies that, for bell-shaped distributions, only about 1 in 20 values will be beyond two standard deviations from the mean. As a general rule, you can consider values not found in the interval μ ± 2σ (or X ± 2S) as potential outliers. The rule also implies that only about 3 in 1,000 will be beyond three standard deviations from the mean. Therefore, values not found in the interval μ ± 3σ (or X ± 3S) are almost always considered outliers. For heavily skewed or non-bell-shaped data sets the Chebyshev rule, introduced next, should be used instead of the empirical rule.

EXAMPLE 3.14

U S IN G T H E E MP IR IC AL R U L E A population of 600-mL bottles of soft drink is known to have a mean fill-weight of 603 mL and a standard deviation of 1 mL. The population is also known to be bell-shaped. Describe the distribution of fill-weights. Is it very likely that a bottle will contain less than 600 mL of soft drink? SOLUTION

μ ± σ = 603 ± 1 = (602, 604) μ ± 2σ = 603 ± 2(1) = (601, 605) μ ± 3σ = 603 ± 3(1) = (600, 606) Using the empirical rule, approximately 68% of the bottles will contain between 602 mL and 604 mL, approximately 95% will contain between 601 mL and 605 mL, and approximately 99.7% will contain between 600 mL and 606 mL. Therefore, it is highly unlikely that a bottle will contain less than 600 mL of soft drink. Specifically, because of the assumed symmetry, we would expect only 0.15% of bottles to have a volume of soft drink less than 600 mL (and thus 0.15% above 606 mL).

The Chebyshev Rule Chebyshev rule Gives lower bounds of the distribution of data values in terms of standard deviations from the mean for any distribution.

The Chebyshev rule states that, for all data sets, population or sample, the percentage of values within k standard deviations of the mean must be at least: 1 2 c1 − a k b d 100%

You can use this rule for any value of k greater than 1. Consider k = 2. The Chebyshev rule states that at least [1 − (1/2)2]100% = 75% of the values must be within ±2 standard deviations of the mean. The Chebyshev rule is very general and applies to any distribution. The rule gives the percentage of values that must at least be within a given distance from the mean. However, if the data set is approximately bell-shaped, the empirical rule will more accurately reflect the greater concentration of data close to the mean. Table 3.4 compares the Chebyshev and empirical rules.

Table 3.4 How data vary around the mean

Interval (μ − σ, μ + σ) (μ − 2σ, μ + 2σ) (μ − 3σ, μ + 3σ)

% of values found in intervals around the mean Chebyshev Empirical rule (any distribution) (bell-shaped distribution) At least 0% Approximately 68% At least 75% Approximately 95% At least 88.89% Approximately 99.7%

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



3.2 Numerical Descriptive Measures for a Population 117

USING TH E C H E BYS H E V R U LE As in Example 3.14, a population of 600-mL bottles of soft drink is known to have a mean fill-weight of 603 mL and a standard deviation of 1 mL. However, the shape of the population is unknown and you cannot assume that it is bell-shaped. Describe the distribution of fill-weights. Is it very likely that a bottle will contain less than 600 mL of soft drink?

EXAMPLE 3.15

SOLUTION

μ ± σ = 603 ± 1 = (602, 604) μ ± 2σ = 603 ± 2(1) = (601, 605) μ ± 3σ = 603 ± 3(1) = (600, 606) Because the distribution may not be bell-shaped, the empirical rule should not be used. Using the Chebyshev rule, you cannot say anything about the percentage of bottles containing between 602 mL and 604 mL. You can state that at least 75% of the bottles will contain between 601 mL and 605 mL, and at least 88.89% will contain between 600 mL and 606 mL. Therefore, it is possible that up to 11.11% of bottles contain less than 600 mL of soft drink (or more than 606 mL). These two rules apply to both population and sample data. For sample data, use the sample mean X and sample standard deviation S in place of the population parameters μ and σ.

Problems for Section 3.2 LEARNING THE BASICS

3.19 The data below are for a population with N = 10: 7

5

11

8

3

6

2

1

9

8

a. Calculate the population mean. b. Calculate the population standard deviation. 3.20 The data below are for a population with N = 10: 7

5

6

6

6

4

8

6

9

3

a. Calculate the population mean. b. Calculate the population standard deviation.

APPLYING THE CONCEPTS 3.21 Analyse the road fatality data for 2016 given in < MONTHLY_FATALITY _2016 > for each gender by: a. Calculating the mean, variance and standard deviation. b. Finding the proportion of months that have fatalities within one and two standard deviations of the mean. c. Comparing your findings with what would be expected on the basis of the empirical rule. 3.22 Naturally Soap is a small business, based in a coastal town, that makes and sells natural, luxurious, handmade soap bars in a variety of scents. Presently the soap is sold at local markets: Wednesday evening in the coastal town where the business is located, and a scheduled Sunday morning market in a roster of local villages. During the last six months, Naturally Soap has also been available via the Internet. Naturally Soap is interested in analysing the quantity sold weekly at each market and Internet sales. While Naturally Soap has complete sales and price data for both markets for the previous year, due to a computer ‘problem’

there is only a sample of weekly sales and price data for the Internet sales. The data is stored in the < NATURALLY_SOAP > file. a. For the Sunday morning market: i. Calculate the mean, variance and standard deviation of the weekly sales for the year. ii. What conclusions can you make about the weekly sales for this market? iii. Use the empirical rule or the Chebyshev rule, whichever is appropriate, to further explain the variation in the weekly sales. iv. Using the results in (iii), are there any outliers? Explain. b. Repeat (a) for the Wednesday evening market. 3.23 The ages, to the nearest year, of all employees at a certain fastfood outlet are: 19

19

45

20

21

21

18

20

23

17

a. Calculate the mean, variance and standard deviation. b. Calculate the Z scores. c. Based on the results of (a) and (b), what conclusions can you reach about employee ages at this fast-food outlet? 3.24 The file < HOURS > gives the hours worked during a recent week by all 30 employees of a local bakery. For this week: a. Calculate and interpret the mean hours worked. b. Calculate the variance and standard deviation of the hours worked. Interpret the standard deviation. c. Use the empirical rule or the Chebyshev rule, whichever is appropriate, to explain further the variation in the hours worked. d. Using the results in (c), are there any outliers? Explain.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

118 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES

LEARNING OBJECTIVE

1

Calculate and interpret numerical descriptive measures of central tendency, variation and shape for numerical data

3.3  CALCULATING NUMERICAL DESCRIPTIVE MEASURES FROM A FREQUENCY DISTRIBUTION When you have a frequency distribution and the raw data are not available, you can calculate approximations to the mean and the standard deviation by assuming that all values within each class are located at the class mid-point. A P PR OXIM AT IN G T HE SAM PL E M E AN , VAR I A NC E A ND STA NDAR D DE VIAT ION FR OM A F R E QU E NCY DI ST R I B UT I O N c

X=

mj fj © j=1

(3.16)

n

where  X = sample mean n = number of values or sample size c = number of classes in the frequency distribution mj = mid-point of the jth class fj = number of values in the jth class S=



S2

(3.17) c

c

where  S 2 =

c

c

fj mj2 - nX 2 © fj mj2 © (mj - X )2 fj j© j=1 =1 j=1 n-1

=

n-1

=

© mj fj

2

j=1

n

n-1

Example 3.16 illustrates the calculation of a sample mean and the standard deviation from a frequency distribution. EXAMPLE 3.16

Table 3.5 Frequency distribution: real estate asking prices

A P P ROX IMAT ING T H E M E AN AN D STAN D ARD D E V I ATI ON F ROM A F RE QU ENCY DIST R IB U T IO N Use the frequency distribution for real estate prices given in Table 3.5 to calculate the approximate sample mean and standard deviation. Compare these approximations with the mean and standard deviation calculated from the raw (ungrouped) data in < PROPERTY >; see problem 3.16. Asking price ($) 300,000 to < 350,000 350,000 to < 400,000 400,000 to < 450,000 450,000 to < 500,000 500,000 to < 550,000 550,000 to < 600,000 600,000 to < 650,000 650,000 to < 700,000 700,000 to < 750,000 750,000 to < 800,000 800,000 to < 850,000 Total

Frequency 8 17 21 20 16 6 7 3 0 0 2 100

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



3.3 Calculating Numerical Descriptive Measures from a Frequency Distribution 119

SOLUTION

The calculations of the approximate mean and standard deviation of the real estate prices are summarised in Table 3.6 where, to avoid extremely large numbers, the mid-point of each class has been recorded in thousands of dollars.

Asking price ($) 300,000 to < 350,000 350,000 to < 400,000 400,000 to < 450,000 450,000 to < 500,000 500,000 to < 550,000 550,000 to < 600,000 600,000 to < 650,000 650,000 to < 700,000 700,000 to < 750,000 750,000 to < 800,000 800,000 to < 850,000 Total

Mid-point in $000s 325 375 425 475 525 575 625 675 725 775 825

Frequency 8 17 21 20 16 6 7 3 0 0 2 100

fj mj 2,600 6,375 8,925 9,500 8,400 3,450 4,375 2,025 0 0 1,650 47,300

fj mj2

845,000 2,390,625 3,793,125 4,512,500 4,410,000 1,983,750 2,734,375 1,366,875 0 0 1,361,250 23,397,500

Table 3.6 Calculations needed to calculate approximations of the mean and standard deviation of the real estate prices

Using Equations 3.16 and 3.17: c

X =

© mj fj

j=1

n

=

47,300 = 473 100

and c

S=

© fjmj2 - nX 2

j=1

n-1

=

23,397,500 - 100 * 4732 = 99

10,349.4949… = 101.732…

Therefore, the mean and standard deviation are approximately $473,000 and $101,700. These values compare with the actual mean, $472,440, and the standard deviation, $102,395, calculated from the raw (ungrouped) data; see solutions to problem 3.16.

Problems for Section 3.3 LEARNING THE BASICS

3.25 Given the following frequency distribution for n = 100: Class intervals   0–under 10 10–under 20 20–under 30 30–under 40 40–under 50



Approximate: a. the mean b. the standard deviation.

3.26 Given the following frequency distribution for n = 100: Class intervals   0–under 10 10–under 20 20–under 30 30–under 40 40–under 50

Frequency 10 20 40 20 10 100



Frequency 40 25 15 15 5 100

Approximate: a. the mean b. the standard deviation.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

120 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES

APPLYING THE CONCEPTS 3.27 A company wished to study its accounts receivable for two successive months. An independent sample of 50 accounts was selected for each month. The results are in the table below.

a. For each month, approximate the: i. mean ii. standard deviation b. On the basis of your answers in (a), do you think the mean and the standard deviation of the accounts receivable have changed substantially from March to April? Explain.

Frequency Distributions for Accounts Receivable March April Amount frequency frequency $0 to under $2,000 6 10 $2,000 to under $4,000 13 14 $4,000 to under $6,000 17 13 $6,000 to under $8,000 10 10 $8,000 to under $10,000 4 0 $10,000 to under $12,000 0 3 50 50 Total

LEARNING OBJECTIVE

3

Construct and interpret a box-and-whisker plot

3.4  FIVE-NUMBER SUMMARY AND BOX-AND-WHISKER PLOTS Section 3.1 introduces sample statistics to measure the centre, variation and shape of numerical data. Another way of describing numerical data is to use the five-number summary, which is illustrated graphically by a box-and-whisker plot.

Five-Number Summary five-number summary  Numerical data summarised by quartiles.

The five-number summary consists of the five statistics: Xsmallest  Q1  Median  Q3  Xlargest The five-number summary characterises a sample (or population) reasonably well and is useful for exploratory data analysis. In particular, it provides a way to determine the shape of the distribution. Table 3.7 explains how the relationships between the ‘five numbers’ allow you to recognise the shape of a data set.

Table 3.7 Relationships between the five-number summary and the type of distribution

Comparison Distance from Xsmallest to the median versus the distance from the median to Xlargest.

Left skewed The distance from Xsmallest to the median is greater than the distance from the median to Xlargest. Distance from Xsmallest to The distance from Xsmallest to Q1 is greater Q1 versus the distance from Q3 to Xlargest. than the distance from Q3 to Xlargest. The distance from Q1 to Distance from Q1 to the median versus the the median is greater distance from the median than the distance from to Q3. the median to Q3.

Type of distribution Symmetrical Both distances are the same.

Both distances are the same.

Both distances are the same.

Right skewed The distance Xsmallest to the median is less than the distance from the median to Xlargest. The distance from Xsmallest to Q1 is less the distance from Q3 to Xlargest. The distance from Q1 to the median is less than the distance from the median to Q3.

The sample of 10 times to get ready (Section 3.1) ranges from 29 minutes to 52 minutes. The median is 39.5, the first quartile is 35 and the third quartile is 44. Therefore, the five-­ number summary is: 29 35 39.5 44 52

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



3.4 Five-Number Summary and Box-and-Whisker Plots 121

The distance from Xsmallest to the median (39.5 − 29 = 10.5) is slightly less than the distance from the median to Xlargest (52 − 39.5 = 12.5). The distance from Xsmallest to Q1 (35 − 29 = 6) is slightly less than the distance from Q3 to Xlargest (52 − 44 = 8). Therefore, the times to get ready are slightly right skewed. CA LC ULATING T H E FIV E - N U MB E R S U M M ARY F OR F E STI VAL E XP E N D I TU RE – INTER NATIONA L V IS ITO R S In the opening scenario, Kai is interested in the distribution of the amount spent by international visitors during the festival. < FESTIVAL > Calculate the five-number summary.

EXAMPLE 3.17

SOLUTION

From Examples 3.3, 3.6 and 3.8 the five-number summary is: 343  502  744  993  1,119 The distance from the median to Xsmallest ($401) is more than the distance from Xlargest to the median ($375). Furthermore, the distance from Xsmallest to Q1 ($159) is more than the distance from Q3 to Xlargest ($126). Therefore, the amount spent by international visitors during the festival has a slight left-skewed distribution.

Box-and-Whisker Plots A box-and-whisker plot, alternatively called a boxplot, provides a graphical representation of the data based on the five-number summary. It shows the range, interquartile range and quartiles. Figure 3.4 illustrates the box-and-whisker plot for the times to get ready. The vertical line drawn within the box represents the median. The vertical line at the left side of the box represents Q1 and the vertical line at the right side of the box represents Q3. Thus, the box contains the middle 50% of an ordered array of data values, 25% between the median and each quartile. The lower 25% of the data values is represented by a line (i.e. a whisker) connecting the left side of the box to the location of the smallest value, Xsmallest. Similarly, the upper 25% of the data values is represented by a whisker connecting the right side of the box to Xlargest.

box-and-whisker plot Graphical representation of the five-number summary.

Figure 3.4 Box-and-whisker plot of the time to get ready Xsmallest 20

25

30

Q1 35

Median 40 Time (minutes)

Xlargest

Q3 45

50

55

The box-and-whisker plot of the times to get ready in Figure 3.4 confirms a very slight right skewness since the right whisker is slightly longer than the left whisker. CO NSTR UCTING A B OX - A ND- W H IS K E R P L OT F OR F E STI VAL E XP E N D I TU RE – INTER NATIO N A L V IS ITO R S In the opening scenario, Kai is interested in the distribution of the amount spent by international visitors during the festival. < FESTIVAL > Construct and interpret the box-and-whisker plot shown in Figure 3.5.

EXAMPLE 3.18

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

122 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES

SOLUTION Figure 3.5 Box-and-whisker plot, festival expenditure – international visitors

Festival expenditure International visitors

300

400

500

600

700

800

900

1,000

1,100

1,200

$

The left-hand whisker is slightly longer than the right-hand whisker and the left-hand and right-hand rectangles are approximately the same. Therefore, the amount spent by international visitors during the festival has a very slight left or negative skew.

Figure 3.6 demonstrates the relationship between a box-and-whisker plot and the corresponding polygon for four different types of distributions. (Note: The area under each polygon is split into quartiles corresponding to the five-number summary for the box-andwhisker plot.) Panels A and D of Figure 3.6 are symmetrical. In these distributions, the length of the left whisker is equal to the length of the right whisker, and the median line divides the box in half. Panel B of Figure 3.6 is left skewed. For this left-skewed distribution, the skewness indicates that there is a heavy clustering of values at the high end of the scale (i.e. the right-hand side); 75% of all values are found between the left edge of the box (Q1) and the end of the right whisker (Xlargest). Therefore, the long left whisker contains the smallest 25% of the values. Panel C of Figure 3.6 is right skewed. The concentration of values is on the low end of the scale (i.e. the left side of the box-and-whisker plot). Here, 75% of all data values are found between the beginning of the left whisker (Xsmallest) and the right edge of the box (Q3), and the remaining 25% of the values are dispersed along the long right whisker at the upper end of the scale.

Figure 3.6 Box-and-whisker plots and corresponding polygons for four distributions

Panel A Bell-shaped distribution

Panel B Left-skewed distribution

Panel C Right-skewed distribution

Panel D Rectangular/uniform distribution

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



3.5   Covariance and the Coefficient of Correlation 123

Problems for Section 3.4 LEARNING THE BASICS

3.28 The data below are a sample of n = 5: 7 4 9 8 2 a. List the five-number summary. b. Construct the box-and-whisker plot and describe the shape. 3.29 The data below are a sample of n = 6: 7 4 9 7 3 12 a. List the five-number summary. b. Construct the box-and-whisker plot and describe the shape. 3.30 The data below are a sample of n = 7: 12 7 4 9 0 7 3 a. List the five-number summary. b. Construct the box-and-whisker plot and describe the shape. 3.31 The data below are a sample of n = 5: 7  −5  −8 7 9 a. List the five-number summary. b. Construct the box-and-whisker plot and describe the shape.

APPLYING THE CONCEPTS

Problems 3.32 to 3.35 can be solved manually or by using Microsoft Excel or PHStat.

3.32 For the life of 15-watt CFL light bulbs data in problem 3.10: < BULBS >

a. List the five-number summary for each manufacturer. b. Construct the box-and-whisker plot and describe the shape of the distribution for each manufacturer. 3.33 For the daily sales data in problem 3.8: < SALES > a. List the five-number summary. b. Construct the box-and-whisker plot and discuss the daily sales distribution for the store. 3.34 Many fast-food chains offer salads and low-fat options on their menu as an alternative to their traditional rolls and burgers. Data for a sample of these alternative and traditional menu items are stored in . For each product category, use the fat in grams per serve data: a. List the five-number summary. b. Construct the box-and-whisker plot. c. What similarities and differences are there in the distributions for the product categories? 3.35 Use the data in problems 3.14 and 3.15, representing the waiting times of random samples of customers at two bank branches during the noon to 1 pm lunch period. For each bank: a. List the five-number summary of the waiting time at the two bank branches. b. Construct the box-and-whisker plot and describe the shape of the distribution of the two bank branches. c. What similarities and differences are there in the distribution of the waiting time at the two bank branches?

3.5  COVARIANCE AND THE COEFFICIENT OF CORRELATION

LEARNING OBJECTIVE

In Section 2.5, scatter diagrams are used to examine the relationship between two numerical variables (bivariate data). In this section, the covariance and the coefficient of correlation are introduced to measure the strength of the linear relationship between two numerical variables.

Calculate and interpret the covariance and the coefficient of correlation for bivariate data

4

Covariance The covariance is a measure of the strength and direction of the linear relationship between two numerical variables (X and Y). A positive value indicates a positive linear relationship between the two variables and a negative value indicates a negative relationship. A value of zero indicates that there is no linear relationship between the variables. A relationship that is linear can be graphed by a straight line, sloping upwards if positive and downwards if negative. Equation 3.18a defines the sample covariance.

covariance Measure of the strength of the linear relationship between two numerical variables. sample covariance Covariance calculated from sample data.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

124 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES

T H E S A M P L E COVA R I A NC E – D E F I NI T I O N F O R M U LA n

SSXY cov(X, Y ) = = n-1



©(Xi - X )(Yi - Y )

i=1

n-1

(3.18a)



where  X = sample mean of variable X Y = sample mean of variable Y n = number of data points (Xi, Yi) Xi = ith value of the independent variable X Yi = ith value of the dependent variable Y, which corresponds to Xi n



SSXY =

© (Xi - X )(Yi - Y ) = sum of the squares for X and Y

i=1

As for the sample variance and standard deviation, we can use algebra to obtain alternative calculation formulas. T H E S A M P L E COVA R I A NC E – C A LC U LAT I O N F O R M U LA n

n

n



SSXY cov(X, Y ) = = n-1

©XiYi - nX Y

i=1

n-1

©Xi ©Yi

n

=

©XiYi - i = 1

i=1

i=1

n



(3.18b)

n-1

n

where

© XiYi = X1Y1 + X2Y2 + … + XnYn = sum of the product of XiYi values

i=1

Use either calculation formula. EXAMPLE 3.19

C A LC U LAT ING T H E S AM P L E COVARI AN CE F OR D I S CRE TI ON ARY I N COM E A N D E X P E NDIT U R E The council in the opening scenario is also interested in the discretionary, or disposable, income and corresponding expenditure of residents within the region. To explore this Kai obtains the following data on discretionary weekly income and expenditure from 10 randomly selected residents of the region. Calculate the sample covariance for discretionary weekly income and expenditure. Discretionary income $ 400 Discretionary expenditure $ 350

815 650

550 525

400 370

250 250

300 295

375 330

380 350

425 415

600 460

SOLUTION

Kai expects that discretionary expenditure is related to discretionary income, so defines Discretionary Income $ as the independent variable (X) and Discretionary Expenditure $ as the dependent variable (Y). Calculate: n

© Xi

X=

i=1

n

=

4,495 = 449.50 10

=

3,995 = 399.50 10

n

Y=

© Yi

i=1

n

10

© XiYi = (400 * 350) + (815 * 650) + … + (600 * 460) = 1,966,625

i=1

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

X=

i=1

n

=

4,495 = 449.50 10

n



Y=

© Yi

i=1

n

3.5   Covariance and the Coefficient of Correlation 125

3,995 = = 399.50 10

10

© XiYi = (400 * 350) + (815 * 650) + … + (600 * 460) = 1,966,625

i=1

Then, using Equation 3.18b, we obtain: n

©XiYi - nX Y

SSXY =

i=1

= 1,966,625 - (10 * 449.50 * 399.50) = 170,872.50 SSXY 170,872.50 = 18,985.833… = n-1 9

cov(X, Y ) =

As the covariance is positive, Kai can conclude that there is a positive linear relationship between discretionary income and expenditure.

As covariance can have any value, it is difficult to use it as a measure of the relative strength of a linear relationship. A better and related measure of the relative strength of a linear relationship is the coefficient of correlation.

Coefficient of Correlation The coefficient of correlation measures the relative strength of a linear relationship between two numerical variables. The values of the coefficient of correlation range from −1 for a perfect negative linear correlation to +1 for a perfect positive linear correlation. Perfect means that, if the points are plotted in a scatter diagram, all the points will lie in a straight line. When dealing with population data for two numerical variables, the Greek letter ρ (rho) is used as the symbol for the coefficient of correlation. Figure 3.7 illustrates three different types of association between two variables.

Y

Y

Panel A Perfect negative correlation (r = –1)

X

Figure 3.7 Types of association between variables

Y

Panel B No correlation (r = 0)

X

coefficient of correlation (or correlation coefficient) Measure of the relative strength of the linear relationship between two numerical variables.

Panel C Perfect positive correlation (r = +1)

X

Panel A of Figure 3.7 illustrates a perfect negative linear relationship between X and Y, where the coefficient of correlation ρ equals −1. Panel B shows a situation in which there is no relationship between X and Y. In this case, the coefficient of correlation ρ equals 0. Panel C illustrates a perfect positive linear relationship where ρ equals +1. With sample data, the sample coefficient of correlation r can be calculated. Figure 3.8 (page 127) gives the scatter diagrams with their respective sample coefficients of correlation r for six data sets, each of which contains 100 values of X and Y.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

126 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES

sample coefficient of correlation Coefficient of correlation calculated from sample data.

In panel A of Figure 3.8 the coefficient of correlation r is −0.9. You can see that small values of X tend to be paired with large values of Y. Likewise, large values of X tend to be paired with small values of Y. As the data are not all in a straight line, the linear relationship between X and Y is strong but not perfect. The data in panel B have a coefficient of correlation equal to –0.6, and the small values of X tend to be paired with large values of Y. However, as the data points are more scattered in panel B, the linear relationship between X and Y in panel B is not as strong as that in panel A. Thus, the coefficient of correlation in panel B, while still negative (indicating a negative relationship), is closer to 0 than the correlation coefficient in panel A. In panel C the negative linear relationship between X and Y is very weak, r = −0.3, and there is only a slight tendency for the small values of X to be paired with the larger values of Y. Panels D to F depict data sets that have positive coefficients of correlation, hence positive linear relationships, where small values of X tend to be paired with small values of Y, and the large values of X tend to be paired with large values of Y. In this discussion of Figure 3.8, the relationships are deliberately described as tendencies and not as cause-and-effect. This wording is intentional. Correlation alone cannot prove that there is a causal effect – that is, that the change in the value of one variable caused the change in the other variable. A strong correlation can be produced simply by chance, by the effect of a third variable not considered in the calculation of the coefficient of correlation, or by a causeand-effect relationship. You would need to perform additional analysis to determine which of these three situations actually produced the correlation. Therefore, you can say that causation implies correlation, but correlation alone does not imply causation. Equation 3.19 defines the sample coefficient of correlation r and Example 3.20 illustrates its use.

T H E S A M P L E COEF F I C I E NT O F C O R R E LAT I O N The sample coefficient of correlation is sample covariance divided by the sample standard deviations of X and Y: r=

cov(X, Y) SX SY

where SX, SY are the sample standard deviations for variables X and Y, defined by SSXY SSX , SX = and SY = n-1 n-1 correlation coefficient can also be defined as:

SSY the sample n-1

Equation 3.10. As cov(X,Y ) =



r=

SSXY SSX

SSY

(3.19)



where the formulas for the respective sum of squares are: n

SSXY =

n

n

n

i=1

i=1

i=1

© (Xi - X )(Yi - Y ) = © XiYi - nXY = © XiYi - i = 1 ni = 1 n

SSX =

n

n

n

i=1

i=1

i=1

© (Xi - X )2 = © Xi2 - nX 2 = © Xi2 -

©Xi



n

n

n

i=1

i=1

i=1

© (Yi - Y )2 = © Yi2 - nY 2 = © Yi2 -

2

i=1

n

n

SSY =

n

©Xi ©Yi

©Yi

2

i=1

n

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



3.5   Covariance and the Coefficient of Correlation 127

100

100

50

50

0

0 –200

–100 r = –0.9

Panel A

–300

0

–200

–100 r = –0.6

Panel B

100

100

50

50

0

100

0

0 –300

–200

Panel C

–100 r = –0.3

0

100

200

–200

–100

0

100 r = 0.3

Panel D

100

100

50

50

0

200

300

400

0 –100

0

Panel E

100 r = 0.6

200

300

0 Panel F

50

r = 0.9

100

150

Figure 3.8  Six scatter diagrams and their sample coefficients of correlation, r

CA LC ULATING T H E S A MP LE C O R R E LAT I ON COE F F I CI E N T F OR D I S CRE TI ON ARY INC O ME A ND E X P E NDIT U R E Kai is exploring the relationship between discretionary, or disposable, income and the corresponding expenditure of residents within the region. From the data in Example 3.19, calculate and interpret the sample correlation coefficient.

EXAMPLE 3.20

SOLUTION

From Example 3.19: X = 449.50, Y = 399.50 and SSXY = 170,872.5

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

128 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES

Calculate: 10

© Xi2 = 4002 + 8152 + … + 6002 = 2,264,875

i=1 10

© Yi2 = 3502 + 6502 + … + 4602 = 1,722,275

i=1

n

SSX = © Xi2 - nX 2 = 2,264,875 - 10 * 449.52 = 244,372.5 i=1

SSY =



n

© Yi2 - nY 2 = 1,722,275 - 10 * 399.52 = 126,272.5

i=1

Therefore, using Equation 3.19: r=



SSXY SSX

SSY

=

170,872.5 244,372.5 126,272.5

= 0.9727…

As r = 0.97 is very close to 1, Kai can conclude that there is a very strong positive linear relationship between discretionary income and expenditure. As it is known that there is a relationship between a person’s income and their expenditure, Kai can conclude that, if a resident’s discretionary income increases, their expenditure is also highly likely to increase. In summary, the coefficient of correlation is a measure of the strength of the linear relationship, or association, between two numerical variables. The closer the coefficient of correlation is to +1 or −1, the stronger the linear relationship. When the coefficient of correlation is near 0, there is little or no linear relationship between the two numerical variables. The sign of the coefficient of correlation indicates whether the data are positively correlated (i.e. the larger values of X tend to be paired with the larger values of Y) or negatively correlated (i.e. the larger values of X tend to be paired with the smaller values of Y). The existence of a strong correlation does not imply a causation effect. It only indicates the tendencies present in the data.

Problems for Section 3.5 LEARNING THE BASICS



3.36 The data are from a sample of n = 11 items: X Y

7 21

5 15

8 24

3 9

6 18

10 30

12 36

4 12

9 27

15 45

18 54

a. Calculate the covariance. b. Calculate the coefficient of correlation. c. How strong is the relationship between X and Y? Explain.

APPLYING THE CONCEPTS

Problems 3.37 to 3.40 can be solved manually or by using Microsoft Excel.

3.37 You are interested in the relationship between the number of people in a sales team and the sales generated, in a certain industry. Number of staff Sales

26 45

18 38

15 35

28 77

19 33

23 44

27 54

23 55

17 32

24 47

These data show gross sales, measured in millions of dollars, and the number of people on a sales team. a. Calculate the covariance and coefficient of correlation. b. What conclusions can you reach about the relationship between the number of people in a sales team and the sales generated? 3.38 Use the data in problem 2.18 to investigate the relationship between petrol and diesel prices in New South Wales and in Queensland. < FUEL_2017 > a. Calculate the covariance and coefficient of correlation for diesel and petrol prices in New South Wales. b. Calculate the covariance and coefficient of correlation for diesel and petrol prices in Queensland. c. What conclusions can you reach about the relationship between petrol and diesel prices in New South Wales and in Queensland?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



3.6   Pitfalls in Numerical Descriptive Measures and Ethical Issues 129

3.39 A local council is interested in the relationship between the size of local restaurants, measured as number of seats, and their annual water usage, in kilolitres. From a random sample of 10 local restaurants the following information was obtained. < WATER2 > Number of seats X 60 45 54 68 70 55 67 45 64 42

Annual water usage Y (kilolitres) 880 550 720 725 932 922 950 560 726 405

a. Construct a scatter diagram for the data and comment on any apparent relationship between restaurant size and annual water usage. b. Calculate the sample covariance and coefficient of correlation. Are these values what you expected from the scatter diagram? c. What conclusions can you reach about the relationship between restaurant size and annual water usage? 3.40 The data file < MILK > gives nutrition content (number of calories and total fat, in grams) per 250 mL of a random sample of 20 fresh milks available in Australia. a. Calculate the covariance. b. Calculate the coefficient of correlation. c. Which do you think is more valuable in expressing the relationship between calories and fat content – the covariance or the coefficient of correlation? Explain. d. What conclusions can you reach about the relationship between calories and fat content?

3.6  PITFALLS IN NUMERICAL DESCRIPTIVE MEASURES AND ETHICAL ISSUES This chapter introduces sample statistics and population parameters that describe the centre, variation and shape of a distribution of a single numerical variable and also the association between two numerical variables. The next step is analysis and interpretation of the calculated statistics. While your analysis is objective, your interpretation is subjective. Be careful to avoid errors that may arise either in the objectivity of your analysis or in the subjectivity of your interpretation. Analysis of expenditure data in the opening scenario is objective and reveals several impartial findings. Objectivity in data analysis means reporting the most appropriate descriptive summary measures for a given data set. Now that you have read this chapter and become familiar with various descriptive summary measures and their strengths and weaknesses, how should you proceed with an objective analysis? For example, from Figure 2.9 the amount spent during the festival by intrastate visitors is positively skewed, so shouldn’t both the median and the mean be reported? Also, doesn’t the standard deviation and/or interquartile range provide more information about the variation of amount spent than the range? On the other hand, data interpretation is subjective. Different people form different conclusions when interpreting analytical findings. Everyone sees the world from different perspectives. Thus, because data interpretation is subjective, you must attempt to present your findings in a fair, neutral and transparent manner.

Ethical Issues Ethical issues are vitally important to all data analysis. As a daily consumer of information, you need to question what you read in newspapers and magazines, what you hear on the radio or television, and what you see online. Over time, much scepticism has been expressed about the purpose, the focus and the objectivity of published studies. Perhaps no comment on this topic is more telling than a quip often attributed to the famous nineteenthcentury British statesman Benjamin Disraeli: ‘There are three kinds of lies: lies, damned lies, and statistics’. Ethical considerations arise when you are deciding what results to include in a report. You should document both good and bad results. In addition, when making oral presentations and compiling written reports, you need to give results in a fair, objective and neutral manner.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

130 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES

Unethical behaviour occurs when you wilfully choose an inappropriate summary measure (e.g. the mean for a very skewed set of data) to distort the facts in order to support a particular position. In addition, unethical behaviour occurs when you selectively fail to report pertinent findings because it would be detrimental to the support of a particular position. To illustrate this selective use of statistics, in 2009 an Australian newspaper, under the heading ‘Nation of gamblers’, stated: Australian and New Zealand gamblers are the worst in the world, betting more money online than those of any other country… From the report that the statistics used came from (R. T. Wood and R. J. Williams, ‘Internet gambling: Prevalence, patterns, problems, and policy options’, 5 January 2009), the mean net monthly gambling expenditure of the 19 Australian and New Zealand Internet gamblers in the sample (from more than 12,000 from 105 countries) was US$300.32, the second highest in the survey. However, the report gave the median net monthly gambling expenditure of this group as US$9.00 – the lowest.

3

Assess your progress

Summary This chapter introduced numerical descriptive measures. This, and Chapter 2, covered descriptive statistics – how data are presented in tables and charts and then summarised, described, analysed and interpreted. When dealing with the opening scenario data, we were able to present useful information through the use of histograms and other graphical methods. Then characteristics of the expenditure data such as central tendency, variability and

shape were explored, using numerical descriptive measures including the mean, median, quartiles, range and standard deviation. The covariance and coefficient of correlation were introduced to  describe the relationship between two numerical variables. In the next chapter, the basic principles of probability are introduced to bridge the gap between descriptive statistics and inferential statistics.

Key formulas Sample mean n

X =

©

i=1

Q1 =

Xi

n

First quartile Q1

  (3.1)

Third quartile Q3

Q3 =

Median

Median =

n+1 ranked value  (3.2) 2

n+1 ranked value  (3.3) 4 3(n + 1) ranked value  (3.4) 4

Geometric mean

XG = (X1 * X2 * … * Xn)1/n   (3.5)

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Key formulas 131

Population standard deviation

Geometric mean rate of return

RG = [(1 + R1) * (1 + R2) * p * (1 + Rn)]

1/n

- 1  (3.6)

σ = σ2   (3.15) Approximating the mean from a frequency distribution

Range

Range = Xlargest - Xsmallest  (3.7)

c

Interquartile range

mj fj © j=1

Interquartile range = Q3 - Q1  (3.8)

X=

Sample variance

Approximating the standard deviation from a frequency distribution

S2 =

SSX = n-1

n

n

n

S2 =  

© Xi2 - nX 2

i=1

  (3.9a) (definition)

n-1

n-1

n

=

© Xi2 -

i=1

© Xi

where

2

S2 =   

n-1

c

c

i=1

n

  (3.16)

S 2   (3.17)

S=

©(Xi - X )2

i=1

n

c

fj mj2 - nX 2 © (mj - X )2 fj j© j=1 =1 n-1



cov(X, Y) =  

X-X   (3.12) S

SSX

  (3.13)

=

SSY

i=1

i=1

n

n-1

  

  (3.19)

n

SSXY =

SSX σ2 = = N

©(Xi - μ)2

i=1

N

n

N

© Xi2 - Nμ2

  (3.14a) (definition)

N

© Xi

N

=

  (3.14b) (calculation)

i=1

© Xi2 -

i=1

=

© XiYi -

i=1

2

SSX =

N

n

© (Xi - X )(Yi - Y ) = © XiYi - nXY

i=1

N



n-1

©XiYi -

©Xi ©Yi

i=1

where

N

=

i=1

SSXY

Population variance

i=1

©XiYi - nX Y

n

n

  (3.18b) (calculation)

r=

N

N

  

n-1

Sample coefficient of correlation

Population mean

μ=

©(Xi - X )(Yi - Y )

i=1

n

Z score

n

n-1

n

S CV = 100%  (3.11) X

© Xi i=1

=

j=1

© fj mj2 j=1

 (3.18a) (definition)

Coefficient of variation

Z=

c

n

SSXY cov(X, Y) = = n-1

S 2   (3.10)

S=

n-1

Sample covariance

  (3.9b) (calculation)

Sample standard deviation

σ2

=

©mj fj

n

n

i=1

©Xi ©Yi

i=1

i=1

n

n

n

n

n

i=1

i=1

i=1

© (Xi - X )2 = © Xi2 - nX 2 = © Xi2 -

©Xi i=1

n

n

N SSY =

n

n

n

i=1

i=1

i=1

© (Yi - Y )2 = © Yi2 - nY 2 = © Yi2 -

2

©Yi i=1

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

n

2

2

132 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES

Key terms arithmetic mean (mean) 92 bell-shaped 115 box-and-whisker plot 121 central tendency 92 Chebyshev rule 116 coefficient of correlation 125 coefficient of variation 105 covariance 123 empirical rule 115 extreme value (outlier) 106 first (lower) quartile 97 five-number summary 120 geometric mean 98

interquartile range 100 median 94 mode 96 population mean 113 population standard deviation 114 population variance 114 quartiles 96 range 99 resistant measures 101 sample coefficient of correlation 126 sample covariance 123 sample mean 93 sample standard deviation 102

sample variance 102 second quartile 97 shape 92 skewed 107 spread (dispersion) 99 standard deviation 101 sum of squares (SS) 101 symmetrical 107 third (upper) quartile 97 variance 101 variation 92 Z scores 106

Chapter review problems CHECKING YOUR UNDERSTANDING 3.41 What is meant by a property of central tendency? 3.42 What are the differences between the mean, median and mode, and what are the advantages and disadvantages of each? 3.43 How do you interpret the first quartile, median and third quartile? 3.44 What is meant by the property of variation? 3.45 What does the Z score measure? 3.46 What are the differences between the various measures of variation such as the range, interquartile range, variance, standard deviation and coefficient of variation, and what are the advantages and disadvantages of each? 3.47 How do the empirical rule and the Chebyshev rule differ?

APPLYING THE CONCEPTS

You can solve problems 3.48 to 3.56 manually or by using Microsoft Excel.

3.48 A quality characteristic of interest for a tea-bag-filling process is the weight of the tea in the individual bags. If the bags are underfilled, two problems arise. First, customers may not be able to brew the tea as strong as they wish. Second, the company may be in violation of the truth-in-labelling laws. For this product, the label weight on the package indicates that, on average, there are 5.5 grams of tea in a bag. If the average amount of tea in a bag exceeds the label weight, the company is giving away product. Getting an exact amount of tea into a bag is problematic because of variation in the temperature and humidity inside the factory, differences in the density of the tea, and the extremely fast filling operation of the machine (approximately 170 bags a minute). The table below provides the weight in grams of a sample of 50 tea-bags produced within an hour by a single machine. < TEABAGS >

5.65 5.57 5.47 5.77 5.61

5.44 5.40 5.40 5.57 5.45

5.42 5.53 5.47 5.42 5.44

5.40 5.54 5.61 5.58 5.25

5.53 5.55 5.53 5.58 5.56

5.34 5.62 5.32 5.50 5.63

5.54 5.56 5.67 5.32 5.50

5.45 5.46 5.29 5.50 5.57

5.52 5.44 5.49 5.53 5.67

5.41 5.51 5.55 5.58 5.36

a. Calculate the mean, median, first quartile and third quartile. b. Calculate the range, interquartile range, variance, standard deviation and coefficient of variation. c. Interpret the measures of central tendency and variation within the context of this problem. Why should the company producing the tea-bags be concerned about the central tendency and variation? d. Construct a box-and-whisker plot. Are the data skewed? If so, how? e. Is the company meeting the requirement set forth on the label that, on average, there are 5.5 grams of tea in a bag? If you were in charge of this process, what changes, if any, would you try to make concerning the distribution of weights in the individual bags? 3.49 Use the data in problems 2.30 and 2.70 to investigate the distribution of petrol and diesel prices in New South Wales and Queensland. < FUEL_MARCH_2017 > a. Calculate the mean, median, first quartile and third quartile of New South Wales and Queensland petrol and diesel prices. What conclusions can you draw? b. Calculate the range, interquartile range, variance, standard deviation and coefficient of variation of New South Wales and Queensland petrol and diesel prices. What conclusions can you draw? c. Construct box-and-whisker plots for the data. Are the data skewed? What conclusions can you draw? d. Calculate the covariance and coefficient of correlation for diesel and petrol prices in New South Wales and Queensland. What conclusions can you reach about the

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter review problems 133

relationship between petrol and diesel prices in New South Wales and Queensland? 3.50 The data file < GRADES > contains a sample of student marks and grades from a population of students enrolled in a statistics unit. a. Calculate the mean, median, range and standard deviation for total marks. Interpret these measures of central tendency and variability. b. List the five-number summary for total marks. c. For total marks, construct and interpret a box-and-whisker plot. d. Ignoring students who did not attempt the final exam, calculate the covariance and coefficient of correlation for semester and exam marks. e. What conclusions can you reach about the relationship between a student’s semester and exam marks? 3.51 The file < AGE > contains the ages and gender of the Australian population at 30 June 2013 and 2016. a. Calculate the approximate mean age and the approximate standard deviation of age for males and females at 30 June 2013 and 2016. b. What conclusions can you draw about male and female ages in 2013 and 2016? 3.52 In many manufacturing processes the term ‘work-in-process’ (WIP) is used. In a book-manufacturing plant the WIP represents the time it takes for sheets from a press to be folded, gathered, sewn, tipped on end sheets and bound. The following data represent samples of 20 books at each of two production plants and the processing time (operationally defined as the time in days from when the books came off the press to when they were packed in cartons) for these jobs. < WIP > Plant A 5.62 5.29 16.25 10.92 11.46 21.62 11.62 7.29 7.50 7.96 4.42 10.50

8.45 7.58

8.58 9.29

5.41 11.42 7.54 8.92

Plant B 9.54 11.46 16.62 12.62 25.75 15.41 14.29 13.13 13.71 10.04 5.75 12.46 9.17 13.21 6.00 2.33 14.25 5.37 6.25 9.71



For each of the two plants: a. Calculate the mean, median, first quartile and third quartile. b. Calculate the range, interquartile range, variance, standard deviation and coefficient of variation. c. Construct a box-and-whisker plot. Are the data skewed? If so, how? d. On the basis of the results of (a) to (c), are there any differences between the two plants? Explain. 3.53 Water_Wise is analysing water usage for a block of onebedroom flats. They collect data on daily water consumption in kilolitres (kl) for 133 consecutive days. < WATER > Explore the daily water usage in this block of flats by: a. plotting the data graphically b. calculating the summary statistics c. commenting on the graphs and the summary statistics.

3.54 In this problem you are asked to select an appropriate value for the standard deviation, based on your knowledge of how these variables vary. a. From a sample of 30 petrol stations, the mean price of E10 petrol is $1.56 per litre. Which of the following is a reasonable value for the corresponding standard deviation of prices: $0.03, $3.00 or $30.00? b. The mean starting salary of a sample of 50 recent graduates is $65,200. Which of the following is a reasonable value for the standard deviation of starting salaries: $5, $50 or $5,000? c. The mean weight of a sample of 100 male university students is 70 kg. Which of the following is a reasonable value for the standard deviation of weights: 0.5 kg, 10 kg or 50 kg? 3.55 The following table gives the annual increase in the Consumer Price Index (CPI), a measure of inflation in Australia and New Zealand.

Year to Dec 2012 Dec 2013 Dec 2014 Dec 2015 Dec 2016

CPI % annual change Australia New Zealand 2.2 0.9 2.7 1.6 1.7 0.8 1.7 0.1 1.5 1.3

Data obtained from Reserve Bank of Australia and Reserve Bank of New Zealand accessed Jun 2017



For each country: a. Calculate the geometric mean inflation rate from 2012 to 2016. b. What conclusions can you draw about the inflation rate in New Zealand and Australia? 3.56 Naturally Soap (see problem 3.22) is interested in exploring the relationship between the price and the quantity sold at each market. < NATURALLY_SOAP > For the Sunday morning and Wednesday evening markets, calculate and interpret the coefficient of correlation between weekly quantity sold and price. 3.57 You are planning to study for your statistics examination with a group of classmates, one of whom you particularly want to impress. This individual has volunteered to use Microsoft Excel to get the needed summary information, tables and charts for a data set containing several numerical and categorical variables assigned by your lecturer for study purposes. This person comes over to you with the printout and exclaims, ‘I’ve got it all – the means, the medians, the standard deviations, the box-and-whisker plots, the pie charts – for all our variables. The problem is, some of the output looks weird – like the box-andwhisker plots for gender and for major, and the pie charts for grade point average and for height. Also, I can’t understand

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

134 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES

why Professor Krehbiel said we can’t get the descriptive statistics for some of the variables – I got it for everything! See, the mean for height is 1.78, the grade point average is 2.76, the mean for gender is 1.50 and the mean for major is 4.33.’ What is your reply?

REPORT WRITING EXERCISE 3.58 The data in the file < BEER > give the alcohol and calorie content of a sample of 95 beers, together with country of origin and type.



Your task is to write a report based on a complete descriptive evaluation of each of the numerical variables – calories and alcohol content – regardless of type of product or origin. Then perform a similar evaluation comparing each of these numerical variables based on type of product – regular, light or non-alcoholic beers. In addition, perform a similar evaluation comparing and contrasting each of these numerical variables based on the origins of the beers – those of a selected country or continent versus those from elsewhere. Appended to your report should be all appropriate tables, charts and numerical descriptive measures.

Continuing cases Tasman University Tasman University’s Tasman Business School (TBS) regularly surveys business students on a number of issues. In particular, students within the school are asked to complete a student survey when they receive their grades each semester. The results of Bachelor of Business (BBus) and Master of Business Administration (MBA) students who responded to the latest undergraduate (UG) and postgraduate (PG) student surveys are stored in < TASMAN_UNIVERSITY_BBUS_STUDENT_SURVEY > and < TASMAN_UNIVERSITY_MBA_STUDENT_SURVEY >.

Copies of the survey questions are stored in Tasman University Undergraduate BBus Student Survey and Tasman University Postgraduate MBA Student Survey. a For a selection of numerical variables in the BBus student survey, calculate appropriate descriptive statistics. b For a selection of numerical variables in the MBA student survey, calculate appropriate descriptive statistics. c Write a report summarising your conclusions.

As Safe as Houses To analyse the real estate market in non-capital cities and towns in states A and B, Safe-As-Houses Real Estate, a large national real estate company, has collected samples of recent residential sales from a sample of non-capital cities and towns in these states. These data are stored in < REAL_ESTATE >. a For a selection of numerical variables for regional city 1 state A, calculate appropriate descriptive statistics. b For a selection of numerical variables for coastal city 1 state A, calculate appropriate descriptive statistics. c Write a report summarising your conclusions. d Repeat (a) to (c) for another pair of non-capital cities or towns in state A and/or state B.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 3 Excel Guide 135

Chapter 3 Excel Guide EG3.1 MEASURES OF CENTRAL TENDENCY, VARIATION AND SHAPE CENTRAL TENDENCY The Mean, Median and Mode Key technique U s e t h e AV E R AG E ( v a r i a b l e c e l l r a n g e ) , MEDIAN(variable cell range), and MODE(variable cell range) functions to calculate these measures. Example Calculate the mean, median and mode for the sample of getting-ready times introduced in Section 3.1. PHStat Use Descriptive Summary. For the example, open the Times file. Select PHStat ➔ Descriptive Statistics ➔ Descriptive Summary. In the Descriptive Summary dialog box (shown in Figure EG3.1): Figure EG3.1 Descriptive Summary dialog box

1. Enter or highlight cells A1:A11 as the Raw Data Cell Range and check First cell contains label. 2. Click Single Group Variable. 3. Enter a Title and click OK. PHStat inserts a new worksheet that contains various measures of central tendency, variation, and shape discussed in Sections 3.1. This worksheet is similar to the CompleteStatistics worksheet of the Descriptive workbook.

In-depth Excel Use the CentralTendency worksheet of the Descriptive workbook as a model. For the example, open the Times file and insert a new worksheet (right-click on tab ➔ Insert ➔ Worksheet) and: 1. Enter a title in cell A1. 2. Enter Get-Ready Times in cell B3, Mean in cell A4, Median in cell A5, and Mode in cell A6.

3. Enter the formula 5AVERAGE(DATA!A:A) in cell B4, the formula 5MEDIAN(DATA!A:A) in cell B5, and the ­formula 5MODE(DATA!A:A) in cell B6. For these functions, the variable cell range includes the name of the DATA worksheet because the data being summarised appears on the separate DATA worksheet. If you suspect that there may be more than one mode highlight several cells, say B7:G7, enter 5TRANSPOSE(MODE. MULTI(DATA!A:A)) then press Ctrl+Shift+Enter. See the Central_Tendency workbook, which gives the two modes for the times to get ready. To calculate the mean, median and mode for another set of data, paste the data into column A of the DATA worksheet, overwriting the existing getting-ready times.

Analysis ToolPak Use Descriptive Statistics. For the example, open to the Times file and: 1. Select Data ➔ Data Analysis. 2. In the Data Analysis dialog box, select Descriptive Statistics from the Analysis Tools list and then click OK. In the Descriptive Statistics dialog box (shown in Figure EG3.2): 3. Enter or highlight cells A1:A11 as the Input Range. Click Columns and check Labels in first row. 4. Click New Worksheet Ply and check Summary statistics, Kth Largest, and Kth Smallest. 5. Click OK. Figure EG3.2 Descriptive Statistics dialog box

The ToolPak inserts a new worksheet that contains various measures of central tendency, variation, and shape discussed in Section 3.1.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

136 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES

Quartiles Key technique Use the MEDIAN, COUNT, SMALL, INT, FLOOR, and CEILING functions in combination with the IF decisionmaking function to calculate the quartiles. To apply the rules for calculating quartiles on page 97, avoid using any of the Excel quartile functions to calculate the first and third quartiles. Example Calculate the quartiles for the sample of getting-ready times introduced in Section 3.1. PHStat Use Boxplot (discussed on page 137). In-depth Excel Use the COMPUTE worksheet of the Quartiles workbook as a model. For the example, the COMPUTE worksheet already calculates the quartiles for the getting-ready times. To calculate the quartiles for another set of data, paste the data into column A of the DATA worksheet, overwriting the existing getting-ready times. Open to the COMPUTE_FORMULAS worksheet to examine the formulas. The COMPARE worksheet compares the quartiles obtained using Section 3.1 rules for quartiles and the Excel quartile functions: QUARTILE(array, quart), QUARTILE. INC(array, quart) and QUARTILE.EXC(array, quart). The Geometric Mean Key technique Use the GEOMEAN((1 1 (R1)), (1 1 (R2)), . . . (1 1 (Rn))) 2 1 function to calculate the geometric mean rate of return. Example Calculate the geometric mean rate of return in the NZX-50 Index for the five years as shown in Example 3.7 on page 99. In-depth Excel Enter the formula 5GEOMEAN(110.24,110.16,11 0.18,110.14,110.10)21 in any cell.

VARIATION AND SHAPE The Range Key technique Use the MIN(variable cell range) and MAX(variable cell range) functions to help calculate the range. Example Calculate the range for the sample of getting-ready times introduced in Section 3.1.

PHStat Use Descriptive Summary as discussed earlier. In-depth Excel Use the Range worksheet of the Descriptive workbook as a model. For the example, open the worksheet implemented for the example in the In-depth Excel ‘The Mean, Median, and Mode’ instructions. Enter Minimum in cell A7, Maximum in cell A8, and Range in cell A9. Enter the formula 5MIN(DATA!A:A) in cell B7, the formula 5MAX(DATA!A:A) in cell B8, and the formula 5B8 2 B7 in cell B9. Analysis ToolPak Use Descriptive Statistics as discussed earlier. The Interquartile Range Key technique Use a formula to subtract the first quartile from the third quartile. Example Calculate the interquartile range for the sample of gettingready times introduced in Section 3.1. In-depth Excel Use the COMPUTE worksheet of the Quartiles workbook (introduced earlier) as a model. For the example, the interquartile range is already calculated in cell B19 using the formula 5B18 2 B16. The Variance, Standard Deviation, Coefficient of Variation and Z Scores Key technique Use the VAR.S(variable cell range) and STDEV.S(variable cell range) functions to calculate the sample variance and the sample standard deviation, respectively. Use the AVERAGE and STDEV.S functions for the coefficient of variation. Use the STANDARDIZE(value, mean, standard deviation) function to calculate Z scores. Example Calculate the variance, standard deviation, coefficient of variation, and Z scores for the sample of getting-ready times introduced in Section 3.1. PHStat Use Descriptive Summary as discussed earlier. In-depth Excel Use the Variation and ZScores worksheets of the Descriptive workbook as models. For the example, open to the worksheet implemented for the earlier examples. Enter Variance in cell A10,

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 3 Excel Guide 137

Standard Deviation in cell A11 and Coeff. of Variation in cell A12. Enter the formula 5VAR.S(DATA!A:A) in cell B10, the formula 5STDEV.S(DATA!A:A) in cell B11, and the formula 5B11yAVERAGE(DATA!A:A) in cell B12. If you previously entered the formula for the mean in cell A4 using the In-depth Excel instructions for the mean, enter the simpler formula 5B11yB4 in cell B12. Right-click cell B12 and click Format Cells in the shortcut menu. In the Number tab of the Format Cells dialog box, click Percentage in the Category list, enter 2 as the Decimal places, and click OK. To calculate the Z scores, copy the DATA worksheet. In the new, copied worksheet, enter Z Score in cell B1. Enter the formula 5STANDARDIZE(A2, AVERAGE(A:A), STDEV.S(A:A)) in cell B2 and copy the formula down to row 11. Then format cells B2 to B11 to the required number of decimal places. If you use an Excel version older than Excel 2010, use VAR and STDEV instead of VAR.S and STDEV.S.

Analysis ToolPak Use Descriptive Statistics as discussed earlier. This procedure does not calculate Z scores. Shape: Skewness and Kurtosis Key technique Use the SKEW(variable cell range) and the KURT(variable cell range) functions to calculate these measures. Example Calculate the skewness and kurtosis for the sample of gettingready times introduced in Section 3.1. PHStat Use Descriptive Summary as discussed earlier. In-depth Excel Use the Shape worksheet of the Descriptive workbook as a model. For the example, open to the worksheet implemented for the earlier examples. Enter Skewness in cell A13 and Kurtosis in cell A14. Enter the formula 5SKEW(DATA!A:A) in cell B13 and the formula 5KURT(DATA!A:A) in cell B14. Then format cells B13 and B14 to four decimal places. Analysis ToolPak Use Descriptive Statistics as discussed earlier.

EG3.2 NUMERICAL DESCRIPTIVE MEASURES FOR A POPULATION The Population Mean, Population Variance and Population Standard Deviation Key technique Use AVERAGE(variable cell range), VAR.P(variable cell range), and STDEV.P(variable cell range) to calculate these measures.

Example Calculate the population mean, population variance and population standard deviation for the road fatality population data of Table 3.3 on page 113. In-depth Excel Use the Parameters workbook as a model. For the example, the COMPUTE worksheet of the Parameters workbook already calculates the three population parameters for the road fatality data. For other problems, paste your unsummarised data into column B of the DATA worksheet, overwriting the road fatality data. If you use an Excel version older than Excel 2010, use the COMPUTE_OLDER worksheet. The Empirical Rule and the Chebyshev Rule Use the COMPUTE worksheet of the VE_Variability workbook to explore the effects of changing the mean and standard deviation on the ranges associated with 61 standard deviation, 62 standard deviations, and 63 standard deviations from the mean. Change the mean in cell B4 and the standard deviation in cell B5 and then note the updated results in rows 9 to 11.

EG3.3 FIVE-NUMBER SUMMARY AND BOX-ANDWHISKER PLOTS

Key technique Plot a series of line segments on the same chart to construct a boxplot. (Excel chart types do not include boxplots.) Example Calculate the five-number summary and construct the boxplots for festival expenditure by international visitors in Figure 3.5. PHStat Use Boxplot. For the example, open the Festival file. Select PHStat ➔ Descriptive Statistics ➔ Boxplot. In the Boxplot dialog box (shown in Figure EG3.3): 1. Enter or highlight C2:C14 as the Raw Data Cell Range and check First cell contains label. 2. Click Single Group Variable. 3. Enter a Title, check Five-Number Summary, and click OK. The boxplot appears on its own chart sheet, separate from the worksheet that contains the five-number summary. In-depth Excel Use the worksheets of the Boxplot workbook as templates. For the example, use the PLOT_DATA worksheet, which already shows the five-number summary and boxplot for festival expenditure by international visitors.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

138 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES

Figure EG3.3 Boxplot dialog box

For the example, the discretionary income and expenditure data have already been placed in columns A and B of the DATA worksheet and the COMPUTE worksheet displays the calculated covariance in cell B9. For other problems, paste the data for two variables into columns A and B of the DATA worksheet, overwriting the discretionary income and expenditure data. If you use an Excel version older than Excel 2010, use the COMPUTE_OLDER worksheet that calculates the covariance without using the COVARIANCE.S function that was introduced in Excel 2010.

The Coefficient of Correlation

For other problems, use the PLOT_SUMMARY worksheet as the template if the five-number summary has already been determined; otherwise, paste your unsummarised data into column A of the DATA worksheet and use the PLOT_DATA worksheet as was done for the example. The worksheets creatively misuse Excel line-charting features to construct a boxplot.

EG3.4 THE COVARIANCE AND THE COEFFICIENT OF CORRELATION The Covariance Key technique Use the COVARIANCE.S(variable 1 cell range, variable 2 cell range) function to calculate this measure. Example Calculate the sample covariance for discretionary income and expenditure data, Example 3.19.

Key technique Use the CORREL(variable 1 cell range, variable 2 cell range) function to calculate this measure. Example Calculate the coefficient of correlation for discretionary income and expenditure data in Example 3.19. In-depth Excel Use the Correlation workbook as a model. For the example, the discretionary income and expenditure data have already been placed in columns A and B of the DATA worksheet and the COMPUTE worksheet displays the coefficient of correlation in cell B14. For other problems, paste the data for two variables into columns A and B of the DATA worksheet, overwriting the revenue and value data. The COMPUTE worksheet uses the COVARIANCE.S function to calculate the covariance (see the previous section) and also the DEVSQ, COUNT, and SUMPRODUCT functions. Open the COMPUTE_FORMULAS worksheet to examine the use of all these functions.

In-depth Excel Use the Covariance workbook as a model.

Microsoft® product screen shots are reprinted with permission from Microsoft Corporation.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



End of Part 1 problems 139

End of Part 1 problems A.1

A sample of 500 shoppers was selected in a large metropolitan area to obtain consumer behaviour information. Among the questions asked was, ‘Do you enjoy shopping for clothing?’ The results are summarised in the following cross-classification table.

Enjoy shopping for clothing Yes No Total

A.2

Male 136 104 240

Gender Female 224 36 260

A.3

Superannuation fund Conservative Balanced Growth High growth

Total 360 140 500

a. Construct contingency tables based on total percentages, row percentages and column percentages. b. Construct a side-by-side bar chart of enjoy shopping for clothing based on gender. c. What conclusions do you draw from these analyses? One of the major measures of the quality of service provided by any organisation is the speed with which the organisation responds to customer complaints. A large family-owned department store selling furniture and flooring, including carpet, has undergone major expansion in the past few years. In particular, the flooring department has expanded from two installation crews to an installation supervisor, a measurer and 15 installation crews. During a recent year the company got 50 complaints about carpet installation. The following data represent the number of days between receipt of the complaint and resolution of the complaint.

A.4

5 19 4 10 68

35 126 165 5

137 110 32 27

31 110 29 4

27 29 28 52

152 61 29 30

2 35 26 22

123 94 25 36

81 31 1 26

74 26 14 20

27 5 13 23

a. Construct frequency and percentage distributions. b. Construct histogram and percentage polygons. c. Construct a cumulative percentage distribution and plot the corresponding ogive. d. Calculate the mean, median, first quartile and third quartile. e. Calculate the range, interquartile range, variance, standard deviation and coefficient of variation. f. Construct a box-and-whisker plot. Are the data skewed? If so, how? g. On the basis of the results of (a) to (f), if you had to report to the manager on how long a customer should expect to wait to have a complaint resolved, what would you say? Explain.

Historical crediting rate for year ending 30 June, % 2017 2016 2015 2014 2013  5.5 8.7  9.0 11.3 12.3  9.5 5.2 10.7 14.1 15.9 11.8 3.8 11.3 15.6 18.7 13.7 3.1 12.3 17.4 20.5

a. For each fund, calculate the geometric rate of return for three years (2015 to 2017) and for five years (2013 to 2017). b. What conclusions can you reach concerning the geometric rates of return for the funds? A supplier of ‘Natural Australian’ spring water states that the magnesium content is 1.6 mg/L. To check this, the quality control department takes a random sample of 96 bottles during a day’s production and obtains the magnesium content. < SPRING_WATER1 >

< FURNITURE >

54 11 12 13 33

The annual crediting rates (after tax and fees) on several managed superannuation investment funds between 2013 and 2017 are:

A.5

A.6

a. Construct frequency and percentage distributions. b. Construct a histogram and a percentage polygon. c. Construct a cumulative percentage distribution and plot the corresponding ogive. d. Calculate the mean, median, mode, first quartile and third quartile. e. Calculate the variance, standard deviation, range, interquartile range and coefficient of variation. f. Construct and interpret a box-and-whisker plot. g. What conclusions can you reach concerning the magnesium content of this day’s production? The National Australia Bank (NAB) produces regular reports titled NAB Online Retail Sales Index . Download the latest in-depth report. a. Give an example of a categorical variable found in the report. b. Give an example of a numerical variable found in the report. c. Is the variable you selected in (b) discrete or continuous? The data in the file < WEBSTATS > represent the number of times during August and September that a sample of 50 students accessed the website of a statistics unit they were enrolled in. a. Construct ordered arrays for August and September. b. Construct stem-and-leaf displays for August and September. c. Construct frequency, percentage and cumulative distributions for August and September.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

140 End of Part 1 problems

d. Plot frequency histograms as separate graphs; plot percentage polygons on the same graph. e. Plot cumulative percentage polygons on the same graph. f. Calculate the mean, median, mode, first quartile and third quartile. g. Calculate the variance, standard deviation, range, interquartile range and coefficient of variation. h. Based on the results of (a) to (g), what conclusions can you reach about the number of times a student accesses the website each month? A.7 In problem A.6 sample statistics were calculated from data representing the number of times, during August and September, a sample of 50 students accessed the website of a statistics unit they were enrolled in. < WEBSTATS > For each month (August and September): a. List the five-number summary. b. Construct the box-and-whisker plot. c. Discuss the distribution of the number of times a student accesses the website each month. A.8 The data stored in data file < WEBSTATS > classify the number of times, during August and September, that a sample of 50 students accessed a statistics unit website by day and time. a. Construct appropriate tables and/or charts to investigate the day of the week and the time that students access the website. b. What conclusions can you draw about the pattern of web access for the two months? c. When would you post an announcement, so that the maximum number of students would read it? A.9 The data in the file < NZ_CAR_SALES_16_17 > are of sales of new cars in New Zealand for February 2016 and 2017 (data obtained from Motor Industry Association of New Zealand accessed 27 March 2017). For each year, ignoring the other category: a. Calculate the mean, variance and standard deviation for the population of the 20 top-selling makes of car. b. What proportion of the makes have sales within ±1, ±2 and ±3 standard deviations of the mean? c. Compare and contrast your findings with what would be expected based on the empirical rule or on the Chebyshev rule. A.10 The data below represent the distribution of the ages of employees in two different divisions of a publishing company.

Age of employees (years) 20–under 30 30–under 40 40–under 50 50–under 60 60–under 70

A Frequency 8 17 11 8 2

B Frequency 15 32 20 4 0



For each of the two divisions (A and B), approximate the a. mean. b. standard deviation. c. On the basis of the results of (a) and (b), do you think there are differences in the age distribution between the two divisions? Explain. A.11 For each of the following variables, determine whether the variable is categorical or numerical. If the variable is numerical, determine whether it is discrete or continuous. In addition, determine the level of measurement. a. Amount of money spent on clothing in the last month b. Favourite department store c. Most likely time period during which shopping for clothing takes place (weekday, weeknight, weekend) d. Number of pairs of jeans owned A.12 The file < CURRENCY > contains the monthly closing exchange rates for the New Zealand dollar (NZD), the Japanese yen (JPY), the United States dollar (USD) and the Chinese renminbi (CNY) from January 2010 to May 2017, where each currency is expressed in units per Australian dollar (data obtained from Reserve Bank of Australia accessed 1 June 2017). a. Construct time-series plots for the monthly closing values of each currency. b. Explain any patterns present in the plots. c. Construct separate scatter plots of the value of pairs of these currencies. d. Calculate the correlation coefficient for pairs of currencies. e. What conclusions can you reach concerning the value of these currencies in terms of the Australian dollar? f. Obtain current exchange rates from Reserve Bank of Australia or elsewhere for either these currencies or alternative currencies. Then repeat parts (a) to (e). A.13 The table below classifies the academic staff of a small regional university by gender and level. < ACADEMIC_STAFF >

Level Professor Associate professor Senior lecturer Lecturer Associate lecturer Total

Average salary $172,500 $147,600 $128,500 $108,200 $ 86,500

Gender Female Male  13  21  16  24  37  52  74  58  23  13 163 168

Total  34  40  89 132  36 331

a. Illustrate these data by constructing appropriate tables and graphs. b. What can you conclude about gender and level for academic staff at this university? c. Estimate the mean and standard deviation of academic salaries.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



End of Part 1 problems 141

d. Estimate the annual expenditure on academic salaries for this university. e. Estimate the mean and standard deviation of male and female academic salaries. f. Comment on the difference in male and female academic salaries at this university. A.14 To test the effectiveness of mail X-ray screening in identifying potential illegal or threatening items a mail centre X-rays a random sample of 500 packages and then independently searches each package. The results of this test are given below.

How does your lender compare? BCU

X-ray items identified Yes No 36 12 14 438 50 450

Commonwealth Bank 9.44% pa Comparison of standard variable home loan rates on a loan of $150,000 over 25 years

2.086 2.038 2.014 2.003 1.981 1.957 1.894

2.066 2.031 2.013 1.999 1.973 1.951

2.075 2.029 2.014 1.996 1.975 1.951

2.065 2.025 2.012 1.997 1.971 1.947

2.057 2.029 2.012 1.992 1.969 1.941

2.052 2.023 2.012 1.994 1.966 1.941

National Australia Bank 9.46% pa ANZ 9.47% pa Suncorp 9.47% pa St George 9.47% pa Westpac 9.47% pa

Total 48 452 500

a. Illustrate these data by constructing appropriate tables and graphs. b. Do you feel that X-ray screening is effective in identifying items of interest? A.15 The following data represent the amount of soft drink filled in a sample of 50 consecutive 2-litre bottles. The results are listed horizontally in the order filled. < DRINK > 2.109 2.036 2.015 2.005 1.984 1.963 1.908

9.18% pa comparison rate

Newcastle Permanent 9.41% pa

Rates current at 23 May 2008.

Search items found Yes No Total

9.15% pa

2.044 2.020 2.010 1.986 1.967 1.938

a. Construct a frequency distribution and a percentage distribution. b. Plot a histogram and a percentage polygon. c. Form a cumulative percentage distribution and plot the corresponding cumulative percentage polygon. d. On the basis of the results of (a) to (c), does the amount of soft drink in the bottles concentrate around specific values? e. Construct a time-series plot with the amount of soft drink on the vertical axis and the bottles’ numbers (from 1 to 50) on the horizontal axis. f. What pattern, if any, is present in the data? g. If you had to make a prediction of the amount of soft drink in the next bottle, what would you predict? h. Based on the results of (e) to (g), explain why it is important to construct a time-series plot and not just a histogram, as was done in part (b). A.16 Comment on the following graph, which appeared in the Northern Star in August 2008.

Data obtained from InfoChoice

A.17 The following table gives the results on food groups never eaten from a national study of 10,000 men and 10,000 women aged at least 50. < FOOD > Foods never eaten Cheese Cream Diary products Eggs Fish Seafood Any meat Chicken/Poultry Pork/Ham Red meat Sugar Wheat products Eat all foods Total number of respondents

Men 236 623 131 175 123 166 111 126 234 159 1,095 187 7,299 10,000

Women 219 917 196 279 266 268 353 368 495 247 897 380 7,878 10,000

Total 455 1,540 327 454 389 434 464 494 729 406 1,992 567 15,177 20,000

a. For men and women, separately and combined, construct percentage summary tables and bar charts for the data. b. What conclusions can you draw about the diet of the participants in the study? c. Why would a pie chart not be appropriate for these data? A.18 The data in < PROBLEMS > are random samples of the time (in minutes) taken to resolve 40 problems reported by students and 40 problems reported by staff to the Technology Services (TS) Service Desk at Tasman University. For each sample: a. Construct appropriate tables and/or charts to investigate the time it takes the TS Service Desk to resolve problems. b. Calculate the mean, median and quartiles. c. Calculate the range, interquartile range, variance, standard deviation and coefficient of variation. d. Construct a box-and-whisker plot. Are the data skewed? If so, how? e. On the basis of the results of (a) to (d), are there any differences between the time to resolve TS problems for staff and for students? Explain.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

142 End of Part 1 problems

A.19 If two students receive a mark of 90 on the same examination, what arguments could be used to show that the underlying variable – test score – is continuous? A.20 The call centre supervisor of the IT helpdesk of a large university is monitoring the performance of the technical support staff. The data in the file < HELP_DESK > give the number of calls resolved during a random sample of 20 eighthour shifts by five support staff. a. For each staff member, construct frequency, percentage and cumulative distributions. b. For each staff member, construct a histogram. c. On the same graph, construct percentage polygons for all staff members. d. On the same graph, construct ogives for all staff members. e. For each staff member, calculate the mean, median, mode, first quartile and third quartile. f. For each staff member, calculate the variance, standard deviation, range, interquartile range and coefficient of variation. g. On the same graph, construct and interpret a box-andwhisker plot for each staff member. h. What conclusions can you reach concerning the number of resolved calls? A.21 The file contains the ages and gender of the Australian population at 30 June 2013 and 2016. a. Construct percentage and cumulative percentage distributions for the age of males, females and the entire Australian population in 2013 and 2016. b. Construct and interpret appropriate graphs to investigate the age distribution of males and females, separately and combined, and how it is changing. c. Calculate the approximate mean age and approximate standard deviation of age for the entire Australian population. A.22 One operation of a mill is to cut pieces of steel into parts that will later be used as the frame for front seats in a car. The steel is cut with a diamond saw and the resulting parts must be within ±0.125 mm of the length specified by the car manufacturer. The data in < STEEL > come from a sample of 100 steel parts. The measurement reported is the difference in millimetres between the actual length of the steel part, as measured by a laser measurement device, and the specified length of the steel part. For example, the data value –0.05 represents a steel part that is 0.05 mm shorter than the specified length. a. Construct a frequency distribution and a percentage distribution. b. Plot the corresponding histogram and percentage polygon. c. Plot the corresponding cumulative percentage polygon. d. Is the steel mill doing a good job in meeting the requirements set by the car manufacturer? Explain.

A.23 For the previous year a large confectionary chain, Sweets-4-U, is interested in analysing the quantity sold weekly, including associated cost data, of two of its popular products, ‘Forgive’ and ‘Rejoice’. These products, both wrapped chocolates sold by weight, differ only in the message attached to each chocolate. Forgive chocolates contain messages ‘Sorry’, ‘Forgive Me’, ‘Trust Me’ and similar, while the messages attached to Rejoice chocolates are ‘Celebrate’, ‘Have Fun’, ‘I Love You’ and similar. < SWEETS_4_U > For Forgive chocolates quantity sold data, construct and interpret: a. a stem-and-leaf display b. frequency, percentage and cumulative distributions c. a frequency histogram, percentage polygon and ogive d. a scatter diagram quantity sold and total cost. For each product: e. Calculate the mean, variance and standard deviation of the weekly quantity sold for the year. f. What conclusions can you make about the weekly quantity sold for each product? g. Use the empirical rule or the Chebyshev rule, whichever is appropriate, to explain further the variation in the weekly quantity sold. h. Using the results in (g), are there any outliers? Explain. i. Calculate and interpret the coefficient of correlation between weekly quantity sold and the associated costs. Also calculate and interpret the coefficient of correlation between the weekly quantity sold of Rejoice and Forgive chocolates. j. Construct time-series plots to investigate any pattern in weekly sales over the year. What conclusions can you make about the pattern of weekly sales for the products? A.24 Several hundred laboratory tests are performed at a large hospital each day. The rate at which these tests are done improperly (and therefore need to be redone) seems steady, at about 4%. In an effort to get to the root cause of these nonconformances (tests that need to be redone), the director of the lab decided to keep records over a period of one week. The laboratory tests were subdivided by the shift of workers who performed them. The results are shown below. Shift Lab tests performed Nonconforming Conforming Total

Day 16 654 670

Evening 24 306 330

Total 40 960 1,000

a. Construct cross-classification tables based on total percentages, row percentages and column percentages. b. Which type of percentage – row, column or total – do you think is most informative for these data? Explain. c. What conclusions concerning the pattern of nonconforming laboratory tests can the laboratory director reach?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



End of Part 1 problems 143

A.25 An economist exploring the relationship between interest rates and inflation has collected interest and CPI data from the reserve banks of New Zealand and Australia for 2000 to March 2017 (data obtained from Reserve Bank of Australia and Reserve Bank of New Zealand < www.rba.gov.au> and accessed 1 June 2017). < INTEREST_&C_PI_2017 >



For each country, use appropriate graphs and statistics to investigate the relationship between interest and inflation rates. What conclusions can you make? A.26 < GDP > gives the annual percentage change in real gross domestic product (GDP) per quarter since 2000 for New Zealand (NZ), Australia, the United States of America (USA), Japan and the United Kingdom (UK) (data obtained from Reserve Bank of New Zealand accessed 1 June 2017). a. Investigate the relationship between the annual percentage changes in GDP for these five countries by constructing time-series plots on the same set of axes. b. What conclusions can you make about the changes in GDP for these five countries? A.27 Alex and Tyler have been monitoring their electricity use since installing solar power almost a year ago, with the data stored in < SOLAR_POWER > . Explore Alex and Tyler’s power usage over this period by: a. plotting the data graphically b. calculating summary statistics c. commenting on the graphs and summary statistics A.28 The results of the 2017 Adobe Mobile Maturity Survey reveal insights into the change to smartphones as primary online access devices, and indicate the need for companies to focus on creating engaging and personalised digital experiences for

their customers. How are companies addressing the mobile experience? The survey found 40% of marketing decision makers were prioritising mobile apps and only 24% were prioritising mobile websites. However, the situation differed for IT decision makers, of whom 26% were prioritising mobile apps and 30% were prioritising mobile websites. The research is based on an online survey with a sample of 304 US executives, marketers, IT staff and analysts who had experience with mobile marketing and who worked for or were agents for organisations with 500+ employees. Of these, 254 were identified as marketing respondents and 50 as IT respondents (data obtained from ). a. Describe the populations of interest. b. Describe the samples that were collected. c. Describe a parameter of interest. d. Describe the statistic used to estimate the parameter in (c). A.29 A radio station survey of listeners found that 32% of the 1,356 drivers who responded admitted to talking on a hand-held mobile phone while driving, and 23% admitted to reading or sending SMS messages while driving. What information would you want to know before you accepted the results of the survey? A.30 Pre-numbered sales invoices are kept in a sales journal. The invoices are numbered from 0001 to 5,000. a. Beginning in row 16, column 1, and proceeding horizontally in Table E.1, select a simple random sample of 50 invoice numbers. b. Select a systematic sample of 50 invoice numbers. Use the random numbers in row 20, columns 5–7, as the starting point for your selection. c. Are the invoices selected in (a) the same as those selected in (b)? Why or why not?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

PA R T

2

Measuring uncertainty

Real People, Real Stats Ellouise Roberts DELLOITE ACCESS ECONOMICS Which company are you currently working for and what are some of your responsibilities? I currently work for Deloitte Access Economics where I’m in the macroeconomic policy and forecasting team located in Canberra. One of my main responsibilities is working with our demographic forecasting model, where we project the future Australian population and some of its characteristics – such as where people will live, how many people will be in the labour force and the industries they might work in. These population forecasts are a key driver of our macroeconomic model, which is used to assist a variety of clients in determining the impacts of potential economic and policy changes on their business, industry or region. Before joining Deloitte Access Economics, I worked at the Australian Bureau of Statistics in a range of roles related to social research, demography and the Census. This included calculating life tables, analysing fertility rates and investigating the type of transport people use to get from home to work. List five words that best describe your personality. A statistical text book debutante! (Practical, adaptable, instinctive, determined and enquiring.) What are some things that motivate you? In my working life I’m motivated by the role that statistics can play in solving problems. For example, by undertaking statistical analysis to test the effectiveness of a particular policy in delivering intended outcomes, we can provide a basis of evidence to assist in deciding whether or not to continue funding existing programs, or to develop alternatives.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

a quick q&a Many of the projects that I have been involved with will also play an important role in the future direction of Australia – whether these are in the area of higher education, infrastructure or the implementation of environmental controls. In these instances, the use of statistics can provide evidence and insights which cannot be acquired through other means such as consultations or literature reviews.

such as history, economics, demographics and sociology, to make sense of statistical findings. With such a wide applicability, working with statistics can offer a range of opportunities across a wide variety of industries and occupations. In many cases, the techniques and concepts used are the same, but the subject matter can differ significantly, which helps to keep work interesting.

When did you first become interested in statistics? I began to appreciate the value that statistics could offer while at school where I learnt about the work of John Graunt, who analysed the vital statistics of London’s citizens during the seventeenth century. During one of the many outbreaks of the bubonic plague in London, Graunt became interested in the Bills of Mortality – records of deaths from the plague – and through the use of statistics was able to draw conclusions about how the disease spread. Many ideas in use today – such as the application of life tables in the insurance industry, national censuses and medical statistics – utilise the principles and foundations of Graunt’s work. Statistics are also applicable to such a wide variety of industries and occupations that it is hard to imagine a subject where they could not offer additional insight and understanding. For example, a farmer can collect a record of daily rainfall, but in isolation those daily numbers do not offer any particularly interesting findings. However, with the introduction of even the most basic statistical techniques, such as the calculation of monthly averages or the pattern of rainfall events, insights begin to emerge. However, it is when they are combined with other observations – such as pest or disease outbreaks, or cropping metrics, or even worker productivity – that we begin to gain an understanding of the relationships between inputs and outputs (or dependent and independent variables) and appreciate the real value that statistics can offer.

Describe your first statistics-related job or work experience. Was this a positive or a negative experience? My first statistics-related job, as a university student, involved standing on the side of a road counting the types of vehicles that went past. A seemingly simple job in itself; however, after the counts were completed we would analyse the data to develop traffic-flow diagrams to assist with the planning of future road infrastructure, such as traffic lights. This was my first real experience of collecting data and then transforming information – a count of cars – into something meaningful and tangible to everyday life. It also emphasised the importance of accurate and suitable data collection techniques, and the role that sampling plays in obtaining information. For example, although standing by the side of the road counting cars for 24 hours was possible, it would not be particularly cost effective (or exciting), and the use of statistical techniques can help us build a comprehensive picture using only a snapshot of data. Although a relatively simple example, this experience helped to demonstrate the role of statistics in society and encouraged me to continue working in this area.

Complete the following sentence. A world without statistics … … would be a world where we wouldn’t be able to celebrate World Statistics Day.

LET’S TALK STATS What do you enjoy most about working in statistics? For me, it is not just the generation of the statistics and data that I enjoy (although that in itself can be very interesting), but rather the interpretation of these figures through the identification of patterns, trends and relationships. As part of working with statistics, you are also often involved in looking at the bigger picture, drawing in knowledge from a range of other disciplines,

What do you feel is the most common misconception about your work held by students who are studying statistics? Please explain. One of my misconceptions when studying statistics was that you are either a ‘numbers’ or a ‘words’ person. However, in the workplace no matter how good you may be at undertaking complex statistical analysis, or building complicated models, you also need to be able to communicate your findings with a variety of audiences with varying degrees of understanding and interest. Therefore, it is critical that, in addition to understanding the mathematical techniques, you also develop your ability to interpret your findings and convey them in a language that your audience will understand – no matter who they may be. Do you need to be good at maths to understand and use statistics successfully? To some degree, I think you do need to have a certain level of understanding of maths and an appreciation for the role that statistics can play. However, this doesn’t necessarily mean that you need to memorise countless formulas or mathematical

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

proofs. Rather, you need the ability to be able to understand the concepts and their application. It is also important to remember that in some instances, the interpretation of the statistics is the key output or outcome and can be more important than the numbers themselves. More broadly, studying statistics in a purely theoretical sense is useful, but the real value is being able to apply these techniques and calculations to real-world data, whether this be in the area of finance, oceanography or biomedical science. However, even more than that, to successfully work as a statistician – or with statistics in any capacity – I think you need to have an enquiring mind and to want to know things and understand why things are as they are (or may be in the future). Is there a high demand for statisticians in your industry (or in other industries)? Please explain. Studying statistics provides a solid foundation for a wide range of roles within the workplace – including ones that may not be immediately obvious, such as building early monitoring systems for tsunamis or in the monitoring of disease outbreaks. Within my role, both the public and private sectors are becoming increasingly aware of and interested in what the future demographic profile of Australia will look like, and the implications that this will have. In such a dynamic environment, I expect the opportunities for people with an understanding of statistics will only increase as more and more aspects of society, nature and the economy are investigated, evaluated and analysed. Ten years ago nobody imagined that there would be professional roles for people undertaking statistical analysis of social media, online social networks and online human behaviour – let alone the prominence that these applications would play in society.

MEASURING UNCERTAINTY What are the most practical consequences in your work that would result from failing to report uncertainty? In much of the work that I do – and particularly in population forecasting – the element of uncertainty is fairly explicit. No one knows for certain how big the population is going to be decades into the future, particularly when you consider the assumptions that need to be made about future fertility (including for females who have not yet been born themselves), mortality (where numerous medical breakthroughs every year continue to extend our lifespan) and migration (where government policies play a key role). However, by observing past trends, patterns and behaviours we can build a picture of what the demographic and economic future may look like under certain conditions. More broadly, statistics don’t always necessarily give you a definitive number or answer as such. Instead, they are often predications or assessments of information, making it critical to explain the role of uncertainty in the conclusions that you make.

Given that our work can influence public and social policy within Australia, the failure to report uncertainty can have considerable consequences by falsely informing our client’s decisions. When might a discrete probability distribution be useful for your work? Can you provide a specific question for which it has helped to provide an answer? In our type of work we are often concerned with the distribution of certain events, such as the success or failure of students completing a particular year in their apprenticeship training. In this example, we were interested in understanding the probability of success in relation to a range of different characteristics, such as age, sex and industry, as well as any government assistance that they had received. Based on a sample of records, we investigated the probability distribution based on individual characteristics, which assisted us in identifying how these factors might contribute to the relative likelihood of success or failure in relation to the overall sample. This type of work assisted us to provide a range of information to our client. Firstly, it helped to establish whether the assistance being made available was targeted at the desired group (i.e. those least likely to complete a particular year of the apprenticeship) and whether the government program was having a positive influence on completion rates. When might a continuous probability distribution be useful for your work? Can you provide a specific question for which it has helped to provide an answer? Continuous probability distributions provide the foundation for much of our multiple linear regression analysis. Using the example from above, in this instance we were also interested in estimating the overall probability of apprenticeship success. While the methods used were themselves conceptually advanced, they were built around the basic assumptions of continuous probability distributions. Is it difficult to liken collected data to a common distribution? What features of the data are used to do so? One thing that you quickly learn in any analysis of ‘real-world’ data is that although some data may be easily likened to a common distribution – like exam results, which often follow a bell-shaped curve similar to a normal distribution – any data collected is likely to present its own unique set of challenges. Taking the Census, for example, despite the extensive effort put into the design of the form, the collection procedures, the processing of answers and the data analysis there are still a wide range of errors (respondent error, processing error, partial or non-response, and undercount) that need to be considered while interpreting the statistics. In many other cases, the data you will be analysing may be collected for a different purpose (such as registrations of births, deaths and marriages for administrative purposes), and incorrect, incomplete and duplicate entries can be a significant issue.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

C HAP T E R

4

Basic probability REPEAT FESTIVAL ATTENDANCE

T

o increase visitor numbers during the year and repeat attendance at the three-day musical festival presented in the Chapter 2 and 3 scenarios, non-local festival attendees are given a book of discount vouchers for subsequent visits to the region and/or the annual music festival. These vouchers include seven nights for the price of five at selected backpackers’ hostels and motels, and two meals for the price of one at selected restaurants. Gaia Adventure Tours, which runs tours and activities in the region, offers a voucher giving two for the price of one on selected tours and activities. Jo is analysing the use of these vouchers by a sample of 500 non-local festival attendees from five years ago. Some of the questions Jo hopes to answer for these attendees are:

■ ■ ■ ■



Are those who have been to a subsequent music festival more likely to have also used an accommodation discount voucher than those who have not been a repeat attendee? What proportion of past festival attendees attend the music festival again? What proportion of repeat festival attendees use a discount meal voucher? What proportion of repeat festival attendees use the two-for-one Gaia Adventure Tours voucher? Is the proportion of repeat festival attendees who use the two-for-one Gaia Adventure Tours voucher the same as those who use a discount meal voucher?

Answers to these questions and others can help Jo develop future sales and marketing strategies to encourage repeat visits to the region and/or music festival by festival attendees. © Africa Studio/Shutterstock

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

148 CHAPTER 4 BASIC PROBABILITY

LEARNING OBJECTIVES

After studying this chapter you should be able to: 1 recognise basic probability concepts 2 calculate probabilities of simple, marginal and joint events 3 calculate conditional probabilities and determine whether events are independent or not 4 revise probabilities using Bayes’ theorem 5 use counting rules to calculate the number of possible outcomes

Probability is the link between descriptive statistics and inferential statistics. This chapter introduces several types of probability and discusses how to revise probabilities in light of new information. These topics are the foundation for the probability distribution, the concept of mathematical expectation and the binomial, hypergeometric and Poisson distributions (topics covered in Chapter 5). LEARNING OBJECTIVE

1

Recognise basic probability concepts probability The likelihood of an event occurring. impossible event An event that cannot occur. certain event An event that will occur. a priori classical probability Objective probability, obtained from prior knowledge of the process.

4.1  BASIC PROBABILITY CONCEPTS What is probability? A probability is a numerical value that represents the chance, likelihood or possibility that a particular event will occur. Examples of events are the price of a share increasing, a rainy day, a defective item or the outcome 5 when you roll a die. A probability is given either as a proportion or fraction whose value lies between 0 and 1, inclusive. An event that has no chance of occurring (i.e. an impossible event) has a probability of 0. An event that is sure to occur (i.e. a certain event) has a probability of 1. There are three approaches to assigning a probability to an event: • a priori classical probability • empirical classical probability • subjective probability. In a priori classical probability, the probability of an event is based on prior knowledge of the process involved. In the simplest case, each outcome is equally likely and the chance of occurrence of the event is given by Equation 4.1. P R OB A B IL IT Y OF OC CU R R E NC E

Probability of occurrence 5

X T

(4.1)

where  X 5 number of ways in which the event occurs   T 5 total number of possible outcomes Consider a standard deck of cards with 26 red cards and 26 black cards. The probability of selecting a black card (an event), using Equation 4.1, is 26/52 5 0.5 since there are X 5 26 black cards and a total of T 5 52 cards. What does this probability mean? As you cannot say for certain what colour the next card selected will be, it does not mean that, if each card is replaced after it is drawn, one out of the next two cards selected will be black. However, you can say that, in the long run, if cards are continually selected and replaced, the proportion of black cards selected will approach 0.5. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



4.1 Basic Probability Concepts 149

FINDING A PR IO R I P RO B A B ILIT IE S A standard die has six faces. Each face carries one, two, three, four, five or six dots. If you roll a die, what is the probability you will get a face with five dots?

EXAMPLE 4.1

SOLUTION

Each face is equally likely to occur. Since there are six faces, the probability of getting a 1 face with five dots is . 6 The above examples use the a priori classical approach to assigning a probability because the number of ways the event occurs and the total number of possible outcomes are known from the composition of the deck of cards or the faces of the die. In addition to the cards and die examples discussed, games of chance such as Lotto and Roulette are based on known probabilities and, as such, are examples of a priori classical probability. In the empirical classical approach to assigning a probability, the outcomes are based on observed data, not on prior knowledge of a process. Examples of this type of probability are the proportion of repeat festival attendees in the chapter scenario, the proportion of registered voters who prefer a certain political candidate or the proportion of students who have a part-time job. For example, if you take a survey of students and 60% state that they have a part-time job, then there is a 0.6 probability that an individual student has a part-time job. The third approach to assigning a probability, subjective probability, differs from the other two approaches because a subjective probability differs from person to person. For example, the development team for a new product may assign a probability of 0.6 to the chance of success for the product while the managing director of the company is less optimistic and assigns a probability of 0.3. The assignment of subjective probabilities to various outcomes is usually based on a combination of an individual’s prior knowledge, personal opinion and analysis of a particular situation. Subjective probability is useful in making decisions in situations in which you cannot use a priori classical probability or empirical classical probability.

empirical classical probability Objective probability, obtained from the relative frequency of occurrence of an event.

subjective probability Probability that reflects an individual’s belief that an event occurs.

Events and Sample Spaces We need the following definitions to understand probabilities. A random experiment is a precisely described scenario that leads to an outcome that cannot be predicted with certainty. For example, the scenario could be ‘roll a die and record how many dots on the upper face’, or ‘toss a coin twice and record whether heads (H) or tails (T) occurs on each toss’. An event is specified by one or more outcomes of a random experiment. The event is said to have occurred if one of the outcomes specified has occurred.

random experiment A precisely described scenario that leads to an outcome that cannot be predicted with certainty.

event One or more outcomes of a random experiment.

For example, when rolling a die, the event of an even number consists of three outcomes: 2, 4 and 6. A simple event is an event specified by a single outcome of a random experiment.

simple event A single outcome of a random experiment.

The collection of all simple events is called the sample space.

sample space Collection of all simple events of a random experiment.

For example, in the experiment of rolling the die, the sample space consists of the six simple events: 1, 2, 3, 4, 5 and 6. In the experiment of tossing a coin twice, the sample space consists of the four simple events: HH, HT, TH and TT.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

150 CHAPTER 4 BASIC PROBABILITY

joint event An event described by two or more characteristics.

A joint event is an event described by two or more characteristics. A joint event can be a simple event. For example, in the experiment of tossing a coin twice, the simple event HH has the two characteristics H on first toss and H on second toss.

complement All simple outcomes not in an event.

The complement of event A (written A′) includes all simple events that are not included in the event A. When tossing a coin, the complement of a head is a tail, since it is the only simple event that is not a head. When rolling a die, the complement of ‘five’ is ‘not five’ – that is, a 1, 2, 3, 4 or 6 – and, when rolling a die, the complement of the event ‘an even number’ is ‘an odd number’ – that is, 1, 3 or 5.

EXAMPLE 4.2

Table 4.1 Accommodation voucher use and repeat festival attendance

E V E NT S A N D S A MP L E S PACE S Table 4.1 gives information on repeat attendance at festivals and the use of discount accommodation vouchers by the sample of 500 festival attendees.

Repeat festival attendance Yes No Total

Accommodation voucher used Yes No 210  70 110 110 320 180

Total 280 220 500

What is the sample space? Give examples of simple events and joint events. SOLUTION

The sample space consists of discount accommodation voucher use and repeat festival attendance of the sample of 500 festival attendees. Examples of simple events are ‘Repeat festival attendance’ and ‘Accommodation voucher used’. The complement of the event ‘Accommodation voucher used’ is ‘Accommodation voucher not used’. The event ‘Repeat festival attendance and accommodation voucher used’ is a joint event because festival attendees have attended a subsequent music festival and used the discount accommodation voucher.

Contingency Tables and Venn Diagrams contingency (or crossclassification) table – probability Represents a sample space for joint events classified by two characteristics; each cell represents the joint event satisfying given values of both characteristics. Venn diagram Graphical representation of a sample space; joint events shown as ‘unions’ and ‘intersections’ of circles representing simple events.

There are several ways to present a sample space. Table 4.1 uses a contingency table, also called a cross-classification table (see Section 2.4), to represent a sample space. The values in the cells of the table are obtained by classifying the sample of 500 festival attendees by whether they have attended a subsequent music festival and/or used the discount accommodation voucher. For example, 210 festival attendees have used the discount accommodation voucher and attended a subsequent music festival. A Venn diagram is another way to present a sample space. It graphically represents the various events as unions and intersections of circles. Figure 4.1 presents a typical Venn diagram for a two-variable situation, with each variable having only two events (A and A′, B and B′). The circle on the left represents all simple events that are part of A and the circle on the right represents all simple events that are part of B. The area contained within circle A and circle B (centre area) is the intersection of A and B (written as A ù B), since it contains all outcomes that are in event A and also in event B. The total area of the two circles is the union of A and B (written as A ø B) and contains all outcomes in event A and/or in event B. The area in the diagram outside A ø B contains outcomes that are neither in event A nor in event B.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



4.1 Basic Probability Concepts 151

To construct a Venn diagram the events A and B must be defined. You can define either event as A or B, or use different letters, as long as you are consistent in evaluating the various events. For the repeat festival attendance scenario you can define the events as follows:

A 5 repeat festival attendance A' 5 no repeat festival attendance

B 5 accommodation voucher used B ' 5 accommodation voucher not used

In drawing the Venn diagram (see Figure 4.2), you must determine the value of the intersection of A and B in order to divide the sample space into its parts. A ù B consists of all 210 festival attendees who have attended a subsequent music festival and used the discount accommodation voucher. The remainder of event A (Repeat festival attendance) consists of the 70 repeat festival attendees who did not use the discount accommodation voucher. The remainder of event B (Accommodation voucher used) consists of the 110 festival attendees who have used the discount accommodation voucher but not attended another music festival. The remaining 110 festival attendees have neither attended a later music festival nor used the discount accommodation voucher.

A

A

B

B

B

A9

B

A 210

A

A

A

B 9= 110

B

70

B9 A

Figure 4.1  Venn diagram for events A and B

Note: A = A ù B + A ù B ′ and B = A ù B + A′ ù B

A9 B 110

B = 390

Figure 4.2  Venn diagram for repeat festival attendance scenario

Note: A = A ù B + A ù B ′ and B = A ù B + A′ ù B

Marginal Probability Now some of the questions posed in the repeat festival attendance scenario can be answered. Since the results are based on data collected (see Table 4.1), the empirical classical approach to assigning probabilities can be used. Marginal probability refers to the probability P(A) of an occurrence of an event, A described by a single characteristic. An example of a marginal probability in the repeat festival attendance scenario is the probability of a festival attendee attending a later music festival. Using Equation 4.1: P(repeat festival attendance) 5

number repeat festival attendees 280 5 0.56 5 total number of attendees 500

LEARNING OBJECTIVE

2

Calculate probabilities of simple, marginal and joint events

marginal probability Probability of an event described by a single characteristic.

Thus, there is a 0.56 (or 56%) likelihood that a festival attendee will attend a subsequent music festival. The name marginal probability derives from the fact that the total number of occurrences of event A (in this case, repeat festival attendance) is obtained from the margin of the contingency table (see Table 4.1). Example 4.3 illustrates another application of marginal probability. CA LC ULATING T H E P RO B A B ILIT Y T H AT A RE P E AT F E STI VAL ATTE N D E E U S E S TH E G A IA A D V E NT U R E TO U R S D IS C O U N T VOU CHE R In the repeat festival attendance scenario, festival attendees were given a book of discount vouchers, including two-for-one vouchers for meals and selected activities and tours by Gaia Adventure Tours. Table 4.2 gives the use of these two-for-one vouchers by the 280 repeat festival attendees.

EXAMPLE 4.3

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

152 CHAPTER 4 BASIC PROBABILITY

Table 4.2 Use of two-for-one vouchers by repeat festival attendees

Repeat festival attendance Gaia Adventure Tours voucher used Yes No Total

Meal voucher used Yes 126  84 210

No 42 28 70

Total 168 112 280

Find the probability that a repeat festival attendee uses the Gaia Adventure Tours voucher. SOLUTION

Using Equation 4.1: number repeat festival attendees using Gaia Adventure Tours voucher P(Gaia Adventure Tours) 5 total number of repeat festival attendees 5

168 5 0.6 280

Therefore, 60% of repeat festival attendees use the Gaia Adventure Tours two-for-one voucher.

Joint Probability joint probability Probability of an occurrence described by two or more characteristics.

Joint probability refers to the probability of an occurrence described by two or more characteris-

tics. An example of joint probability is the probability that you will get a head on the first toss of a coin and a head on the second toss of a coin. Referring to Table 4.1, the festival attendees who have attended a subsequent music festival and used the discount accommodation voucher are represented by the 210 festival attendees in the single cell ‘Yes – Repeat festival attendance and Yes – Accommodation voucher used’. Because this group consists of 210 festival attendees, the probability of picking a festival attendee who has attended a later music festival and used the discount accommodation voucher is: P (repeat festival attendance and accommodation voucher used) number repeat festival attendees and accommodation voucher used 5 total number of festival attendees 5

210 5 0.42 500

Example 4.4 also demonstrates how to determine joint probability.

EXAMPLE 4.4

DE T E R MIN ING T H E J OI N T P ROBABI L I TY OF A RE P E AT F E STI VAL AT T E N D E E U S ING T WO- F OR- ON E M E AL AN D GAI A AD V E N TU RE TO U R S VO U C H E R S In Table 4.2, festival attendees were given a book of discount vouchers, including two-forone vouchers for meals and Gaia Adventure Tours. Find the probability that a randomly selected repeat festival attendee uses both the meal and Gaia Adventure Tours two-for-one vouchers.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



4.1 Basic Probability Concepts 153

SOLUTION

Using Equation 4.1: P(Gaia Adventure Tours and meal voucher used) number repeat festival attendees using Gaia Adventure Tours and meal vouchers 5 total number repeat festival attendees 5

126 5 0.45 280

Therefore, there is a 45% chance that a randomly selected repeat festival attendee uses both the meal and Gaia Adventure Tours two-for-one vouchers. The marginal probability of an event is the sum of joint probabilities. For example, if B consists of two events, B1 and B2, then P(A), the probability of event A, consists of the joint probability of event A occurring with event B1 plus the joint probability of event A occurring with event B2. Equation 4.2 can be used to calculate marginal probabilities. MARG INAL P R OB A B IL IT Y P(A) = P(A and B1) + P(A and B2) + … + P(A and Bk)



(4.2)

where B1, B2, …, Bk are k mutually exclusive and collectively exhaustive events. Mutually exclusive events and collectively exhaustive events are defined as follows. Two events are mutually exclusive if the two events cannot occur simultaneously.

mutually exclusive Two events that cannot occur simultaneously.

Heads and tails in a coin toss are mutually exclusive events. When tossing a coin you cannot get both a head and a tail on the same toss. A set of events is collectively exhaustive if one of the events must occur.

collectively exhaustive Set of events such that one of the events must occur.

Heads and tails in a coin toss are collectively exhaustive events. One of them must occur. If heads does not occur, tails must occur. If tails does not occur, heads must occur. In summary, the event of tossing a coin is both collectively exhaustive and mutually exclusive. The outcome must be either heads or tails, P(Heads or Tails) = 1, so the outcomes are collectively exhaustive. When heads occurs, tails cannot occur, P(Heads and Tails) = 0, so the outcomes are also mutually exclusive. Equation 4.2 can be used to calculate the marginal probability of a festival attendee attending a later music festival: P (repeat festival attendance) 5 P(repeat festival attendance and accommodation voucher used) 1 P(repeat festival attendance and accommodation voucher not used) 5

280 70 210 1 5 5 0.56 500 500 500

Alternatively, Equation 4.1 can be used to calculate P(repeat festival attendance).

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

154 CHAPTER 4 BASIC PROBABILITY

General Addition Rule The probability of event ‘A or B’ can be calculated by the general addition rule. This rule considers the occurrence of either event A or event B or both A and B. The event ‘Repeat festival attendance or accommodation voucher used’ includes all festival attendees who have attended a subsequent music festival and all festival attendees who have used the discount accommodation voucher. Table 4.1 can be used to calculate the probability that a festival attendee either attended a later music festival or used the accommodation discount voucher by examining each cell of the contingency table (Table 4.1) to determine whether it is part of this event. From Table 4.1, the cell ‘Repeat festival attendance and accommodation voucher not used’ is part of the event, because it includes repeat festival attendees. The cell ‘No repeat festival attendance and accommodation voucher used’ is included because it contains festival attendees using the discount accommodation voucher. Finally, the cell ‘Repeat festival attendance and accommodation voucher used’ has both characteristics of interest. Therefore, the probability of a festival attendee either attending a later music festival or using the accommodation discount voucher is: P(repeat festival attendance or accommodation voucher used) = P(repeat festival attendance and accommodation voucher used) + P(no repeat festival attendance and accommodation voucher used) + P(repeat festival attendance and accommodation voucher not used) = general addition rule Used to calculate the probability of the joint event A or B.

210 110 70 390 = = 0.78 + + 500 500 500 500

Instead of using a contingency table, the general addition rule defined in Equation 4.3 can be used to calculate the probability of the event A or B, P(A or B).

GE N E R A L A DDIT IO N R U LE The probability of A or B is equal to the probability of A plus the probability of B minus the probability of A and B.

P(A or B) = P(A) + P(B) − P(A and B)

(4.3)

Applying this equation to the previous example produces the following: P (repeat festival attendance or accommodation voucher used) 5 P(repeat festival attendance) 1 P(accommodation voucher used) 2 P(repeat festival attendance and accommodation voucher used) 280 320 390 210 5 2 1 5 5 0.78 500 500 500 500 The general addition rule adds the probability of A and the probability of B, and then subtracts the joint event of A and B from this total because the joint event has been included in both the probability of A and the probability of B. Referring to Table 4.1, if the outcomes of the event ‘Repeat festival attendance’ are added to those of the event ‘Accommodation voucher used’, the joint event ‘Repeat festival attendance and accommodation voucher used’ has been included in each of these simple events. Therefore, because this joint event has been counted twice, it needs to be subtracted once. Example 4.5 illustrates another application of the general addition rule.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



4.1 Basic Probability Concepts 155

A PPLY ING TH E G E NE R A L A DDIT IO N R U L E F OR RE P E AT F E STI VAL ATTE N D E E S USING TWO -FO R - O NE ME A L O R G A IA A D V E N TU RE TOU RS VOU CHE RS In Example 4.3, festival attendees were given a book of discount vouchers, including two-for-one vouchers for meals and Gaia Adventure Tours. Find the probability that a randomly selected repeat festival attendee uses a two-for-one meal or Gaia Adventure Tours voucher.

EXAMPLE 4.5

SOLUTION

Using Equation 4.3: P (Gaia Adventure Tours or meal voucher used) 5 P(Gaia Adventure Tours voucher used) 1 P(meal voucher used) 2 P(Gaia Adventure Tours and meal voucher used) 5

210 168 126 252 1 2 5 5 0.9 280 280 280 280

Therefore, there is a 90% chance that a return repeat festival attendee uses a two-for-one meal or Gaia Adventure Tours voucher.

Problems for Section 4.1 LEARNING THE BASICS



4.1 Two coins are tossed. a. Give an example of a simple event. b. Give an example of a joint event. c. What is the complement of a head on the first toss? 4.2 An urn contains 12 red balls and 8 white balls. One ball is to be selected from the urn. a. Give an example of a simple event. b. What is the complement of a red ball? 4.3 Given the following contingency table:

APPLYING THE CONCEPTS

A A′

B 10 20

B∙ 20 40



what is the probability of a. event A? b. event A′? c. event A and B? d. event A or B? 4.4 Given the following contingency table:

A A′

B 10 25

B∙ 30 35

what is the probability of a. event A′? b. event A and B? c. event A′ and B′? d. event A′ or B′?

4.5 For each of the following, indicate whether the type of probability involved is an example of a priori classical probability, empirical classical probability or subjective probability. a. The next toss of a fair coin will be heads. b. Italy will win soccer’s World Cup the next time the competition is held. c. The sum of the faces of two dice will be 7. d. The train taking a commuter to work will be more than 10 minutes late. 4.6 For each of the following, state whether the events are mutually exclusive and/or collectively exhaustive. If they are not mutually exclusive and/or collectively exhaustive, either reword the categories to make them mutually exclusive and collectively exhaustive or explain why this would not be useful. a. An exit poll in an Australian federal election asked voters if they had voted for the Labor or the Coalition candidate. b. Respondents were classified by type of car they drive: Australian, American, European, Japanese or none.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

156 CHAPTER 4 BASIC PROBABILITY

c. People were asked, ‘Do you currently live in (i) an apartment or (ii) a house?’ d. A product was classified as defective or not defective. 4.7 The probability of each of the following events is zero. For each, state why. a. A day is Christmas and Easter. b. A product is defective and not defective. c. A car is a Ford and a Toyota. 4.8 A researcher has completed a survey of 10,000 viewers in a regional city to determine which TV network they watch most weekdays during the 6 pm to 7 pm time slot. The results are: Network ABC Seven Nine Ten SBS Other or none

LEARNING OBJECTIVE

Number 1,290 2,850 2,060 1,695   430 1,675

3

Calculate conditional probabilities and determine whether events are independent or not

conditional probability Probability of an event, given information on the occurrence of a second event.



A surveyed viewer is chosen at random. Find the probability that during the 6 pm to 7 pm time slot the viewer: a. watches ABC b. watches ABC or SBS c. watches neither ABC nor SBS d. watches one of Channels 7, 9 or 10 e. does not watch one of Channels 7, 9 or 10 4.9 A sample of 500 consumers is selected in a large metropolitan area to study consumer behaviour. Among the questions asked was ‘Do you enjoy shopping for clothing (Yes or No)?’ Of 240 males, 136 answered yes. Of 260 females, 224 answered yes. Construct a contingency table or a Venn diagram to evaluate the probabilities. What is the probability that a surveyed consumer chosen at random: a. enjoys shopping for clothing? b. is a female and enjoys shopping for clothing? c. is a female or enjoys shopping for clothing? d. is a male or a female?

4.2  CONDITIONAL PROBABILITY Calculating Conditional Probabilities We can often make use of extra information about the events under consideration when calculating probabilities. In this section, we consider the case where the probability of an event occurring depends on the occurrence of some other event. Suppose, for instance, that we are interested in determining the probability that a person selected at random earns more than $100,000 a year. If we know that the person has a degree, it might be reasonable to expect this to affect the probability. Conditional probability refers to the probability of event A, given information about the occurrence of another event, B.

CON DIT ION A L PR OB AB I LI T Y The probability of A given B, written P(A | B), is equal to the probability of A and B divided by the probability of B.

P(A | B) 5

P ( A and B ) P(B)

(4.4a)

The probability of B given A is equal to the probability of A and B divided by the probability of A.

P(B | A) 5

P( A and B) P ( A)

(4.4b)

where P(A and B) 5 joint probability of A and B P(A) 5 marginal probability of A P(B) 5 marginal probability of B

Referring to the repeat festival attendance scenario, suppose we know that a festival attendee has used the discount accommodation voucher. What is the probability that they have also

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



4.2 Conditional Probability 157

attended a later music festival – that is, P(repeat festival attendance | accommodation voucher used )? As we know that the festival attendee has used the discount accommodation voucher, the sample space does not consist of all 500 festival attendees in the sample. It consists only of the festival attendees who have used the discount accommodation voucher. Of the 320 festival attendees who have used the discount accommodation voucher, 210 are repeat festival attendees. Therefore (see Table 4.1 or Figure 4.2), the probability that a festival attendee attends a subsequent music festival given that they have used the discount accommodation voucher is: P(repeat festival attendance | accommodation voucher used) number repeat festival attendees and accommodation vouchers used 5 number of accommodation vouchers used 5

210 320

5 0.65625 Equation 4.4a can be used to calculate the above result: where define events: A 5 repeat festival attendance B 5 accommodation voucher used then: P(A | B) =

P(A and B) 210/500 210 = 0.65625 = = 320/500 320 P(B)

Therefore, if a festival attendee has used the discount accommodation voucher there is a 65.625% probability that they have also attended a subsequent music festival. Compare this conditional probability with the marginal probability of a festival attendee attending a later music festival, which is 280/500 5 0.56, or 56%. These results indicate that festival attendees who use the discount accommodation voucher are more likely to also attend a subsequent music festival. Example 4.6 further illustrates conditional probability. FINDING A C O NDIT IO NA L P RO B A B ILIT Y CON CE RN I N G RE P E AT F E STI VAL ATTENDEES’ U S E O F T WO - FO R O NE VO U CHE RS Table 4.2 is a contingency table for whether repeat festival attendees use two-for one meal and/or Gaia Adventure Tours vouchers. Find the probability that a randomly selected repeat festival attendee who used the two-for-one meal voucher also used the Gaia Adventure Tours voucher.

EXAMPLE 4.6

SOLUTION

We know that the repeat festival attendee has used the two-for-one meal voucher, so the sample space is reduced to the 210 attendees who have used their meal voucher. Of these 210 attendees, 126 have used their Gaia Adventure Tours voucher. Therefore, the probability that the Gaia Adventure Tours voucher is used, given that the meal voucher was used, is: P(Gaia Adventure Tours voucher used ) meal voucher used) number repeat attendees use meal and Gaia Adventure Tours vouchers 5 number repeat attendees use meal voucher 5

126 5 0.6 210

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

158 CHAPTER 4 BASIC PROBABILITY

If we define events: M 5 meal voucher used

G 5 Gaia Adventure Tours voucher used

then Equation 4.4a may be used: P(G  M) =

126/280 126 P(G and M) = = = 0.6 P(M ) 210/280 210

Therefore, given that a repeat festival attendee has used the two-for-one meal voucher, there is a 60% chance that the Gaia Adventure Tours two-for-one voucher is also used.

Decision Trees decision tree Graphical representation of simple and joint probabilities as vertices of a tree. Also known as a tree diagram.

In Table 4.1, a sample of 500 festival attendees were classified according to whether they have attended a later music festival or used the discount accommodation voucher. A decision tree (or tree diagram) is an alternative to a contingency table or a Venn diagram. Figure 4.3 represents the decision tree for this example. In Figure 4.3, beginning at the left with the sample of 500 festival attendees, there are two ‘branches’ corresponding to whether or not a subsequent music festival was attended. Each branch has two sub-branches, corresponding to whether the festival attendee used the discount accommodation voucher. The probabilities at the end of the initial branches represent the marginal probabilities of A (Repeat festival attendance) and A′. The probabilities at the end of each of the four sub-branches represent the joint probability for each combination of events A and B (Accommodation voucher used). The conditional probability is calculated by dividing the joint probability by the appropriate marginal probability. For example, to calculate the conditional probability that a festival attendee uses the accommodation discount voucher given that they have attended a later music festival, divide P(repeat festival attendance and accommodation voucher used) by P(repeat festival attendance). From Figure 4.3: P (accommodation voucher 210 210/500 = = 0.75 used | repeat festival attendance) = 280 280/500 Example 4.7 illustrates how to construct a decision tree.

Figure 4.3 Decision tree for repeat festival attendance scenario

P(A) 5 280 500 nce

enda

l att stiva

t fe epea

R Sample of 500 festival attendees N o re peat festi val

atte

ndan

ce

P(A 9) 5 220 500

n datio mmo sed o c c A her u vouc

P(A and B) 5 210 500

Accom vouch modation er not used

P(A and B 9) 5 70 500

odation Accomm sed u voucher

P(A 9 and B) 5 110 500

Acc vouc ommod a her not tion used

P(A 9 and B 9) 5 110 500

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



4.2 Conditional Probability 159

FO R MING A D E C IS IO N T R E E FO R R E P E AT F E STI VAL ATTE N D E E S – TWO -FO R -O N E VO U C H E R U S E Using the cross-classified data in Table 4.2, construct a decision tree and use it to find the probability a randomly selected repeat festival attendee who used the two-for-one meal voucher also used the Gaia Adventure Tours two-for-one voucher.

EXAMPLE 4.7

SOLUTION

The decision tree for ‘two-for-one voucher use’ is displayed in Figure 4.4. Using Equation 4.4a and the following definitions: G 5 Gaia Adventure Tours voucher used P(G  M) =

126/280 126 P(G and M) = = = 0.6 P(M) 210/280 210

P (M ) 5 210 280 d

r use

uche

Set of repeat festival attendees

vo Meal

Mea

l vou

cher

not

M 5 meal voucher used

used P(M9) 5 70 280

ture dven used A a i Ga her vouc Tours

Ga Tours ia Adventur e vouch er not used venture Gaia Ad r used che u o Tours v

Tour Gaia Ad v s Vo uche enture r no t use d

P (M and G) 5

126 280

Figure 4.4 Decision tree for ‘two-forone voucher use’

P(M and G9) 5 84 280

P(M 9 and G ) 5 42 280

P (M 9and G9) 5 28 280

Statistical Independence In the repeat festival attendance scenario, the conditional probability is 210/320 5 0.65625 that a selected festival attendee attended a later music festival given that they have used the discount accommodation voucher. The probability of a randomly selected festival attendee attends a later music festival is 280/500 5 0.56. This result shows that the prior knowledge that a festival attendee has used the discount accommodation voucher affected the probability that they attended another music festival. In other words, the outcome of one event is dependent on the outcome of a second event. When the outcome of one event does not affect the probability of occurrence of another event, the events are said to be statistically independent. Statistical independence can be determined by using Equation 4.5.

statistical independence The occurrence of an event does not affect the occurrence of a second event.

STATISTICA L IN DE PE N DE N CE Two events, A and B, are statistically independent if and only if

P(A | B) 5 P(A) (also P(B | A) 5 P(B))

(4.5)

Example 4.8 demonstrates the use of Equation 4.5.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

160 CHAPTER 4 BASIC PROBABILITY

EXAMPLE 4.8

DE T E R MIN ING STAT I STI CAL I N D E P E N D E N CE Using the cross-classified data in Table 4.2, determine whether, for repeat festival attendees, use of the two-for-one meal voucher and use of the Gaia Adventure Tours voucher are statistically independent events. SOLUTION

From Examples 4.6 and 4.7: P(Gaia Adventure Tours voucher used | meal voucher used) 5 0.6 which from Example 4.3 is equal to: P(Gaia Adventure Tours voucher used) 5 0.6 Thus, use of the two-for-one meal voucher and use of the Gaia Adventure Tours voucher are statistically independent events. Occurrence of one event does not affect the probability of the other event.

Multiplication Rules general multiplication rule Used to calculate the probability of the joint event A and B.

By manipulating the formula for conditional probability, you can determine the joint probability P(A and B) from the conditional probability of an event. The general multiplication rule is derived using Equations 4.4a and 4.4b and solving for the joint probability P(A and B). GE N E R A L M ULT IP LI C AT I O N R U LE The probability of A and B is equal to the probability of A given B times the probability of B or the probability of B given A times the probability of A.

P(A and B) = P(A | B)P(B) = P(B | A)P(A) (4.6)



Example 4.9 demonstrates the use of the general multiplication rule. EXAMPLE 4.9

U S IN G T H E MU LT IP LI CATI ON R U L E Of the 500 festival attendees in the repeat festival attendance scenario (Table 4.1), 280 have attended a subsequent music festival. Suppose two festival attendees are randomly selected. Find the probability that both festival attendees have since attended a later music festival. SOLUTION

We can use the multiplication rule. Define events: F1 = repeat festival attendance first attendee F2 = repeat festival attendance second attendee then, using Equation 4.6: P(F1 and F2) = P(F2 | F1)P(F1)

The probability that the first attendee has subsequently attended another music festival is 280/500. However, the probability that the second attendee has attended a later music festival depends on the result of the first selection. If the first attendee is not returned to the sample after any repeat festival attendance is determined (sampling without replacement), then the number of attendees remaining will be 499. If the first festival attendee attends a later music festival, the probability that the second also attends a later music festival is 279/499,

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



4.2 Conditional Probability 161

because 279 attendees who have subsequently attended a later music festival remain in the sample. Therefore: P(F1 and F2) = P(F2 | F1)P(F1) =

279 280 = 0.3131... × 499 500

The probability that both festival attendees have since attended a later music festival is approximately 0.313.

If A and B are independent events, then P(A | B) 5 P(A), so we can substitute P(A) for P(A | B) (or P(B) for P(B | A)) in Equation 4.6 to obtain the multiplication rule for independent events.

M ULTIPLICAT ION R UL E FOR IN DE P E ND E NT E V E NT S If A and B are statistically independent, the probability of A and B is equal to the probability of A times the probability of B.

P(A and B) = P(A)P(B)



multiplication rule for independent events Used to calculate the probability of the joint event A and B when A and B are independent.

(4.7)

If this rule holds for two events, A and B, then A and B are statistically independent. Thus, there are two ways to determine statistical independence: 1. Events A and B are statistically independent if and only if P(A | B) 5 P(A) (or P(B | A) 5 P(B)). 2. Events A and B are statistically independent if and only if P(A and B) 5 P(A)P(B).

Marginal Probability Using the General Multiplication Rule In Section 4.1 marginal probability was defined using Equation 4.2, which can be rewritten using the general multiplication rule. If: P(A) = P(A and B1) + P(A and B2) + … + P(A and Bk)

then, using the general multiplication rule, Equation 4.8 defines the marginal probability.

M ARG INAL P R OB A B IL IT Y US IN G T H E G E NE R A L M U LT I P LI C AT I O N R U LE P(A) = P(A | B1)P(B1) + P(A | B2)P(B2) + … + P(A | Bk)P(Bk)





(4.8)

where B1, B2, …, Bk are k mutually exclusive and collectively exhaustive events.

To illustrate this equation, refer to Table 4.1. Using Equation 4.8, the probability of a festival attendee attending a subsequent music festival is: P(A) = P(A | B)P(B) + P(A | B′)P(B′) where P(A) 5 probability of ‘repeat festival attendance’ P(B) 5 probability of ‘accommodation voucher used’ P(B9) 5 probability of ‘accommodation voucher not used’



P(A) =

280 210 320 210 70 70 180 × = × = 0.56 = + + 500 320 500 180 500 500 500

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

162 CHAPTER 4 BASIC PROBABILITY

Problems for Section 4.2 LEARNING THE BASICS 4.10 Given the following contingency table:

A A′

B 10 20

B∙ 20 40

a. what is the probability of: i. A | B? ii. A | B′? iii. A′| B′? b. Are events A and B statistically independent? 4.11 Given the following contingency table:

A A′

B 10 25

B∙ 30 35

a. what is the probability of: i. A | B? ii. A′| B′? iii. A | B′? b. Are events A and B statistically independent? 4.12 If P (A and B ) 5 0.4 and P (B ) 5 0.8, find P (A | B ). 4.13 If P (A) 5 0.7, P (B ) 5 0.6, and A and B are statistically independent, find P (A and B ). 4.14 If P (A) 5 0.3, P (B ) 5 0.4, and P (A and B ) 5 0.2, are A and B statistically independent?

APPLYING THE CONCEPTS 4.15 The following table gives the labour force status of the Australian civilian population aged 15 years and over in May 2017: 6202.0–Labour Force, Australia, May 2017 Labour force status (aged 15 years and over) Male Female Total ('000) Employed full-time 5,296.0  3,001.2  8,297.2 Employed part-time 1,230.5  2,678.2  3,908.7 Unemployed and looking for full  time work   277.6    205.4    483.0 Unemployed and not looking for   full-time work    90.3    130.1    220.4 Not in labour force 2,859.4  4,060.5  6,919.9 Total civilian population aged 15 years and over 9,753.8 10,075.4 19,829.2 Data obtained from Australian Bureau of Statistics, Labour Force, Australia, May 2017, Cat. No. 6202.0 accessed 28 June 2017

a. What is the probability that a randomly selected person is female? b. What is the probability that a randomly selected male is not employed? c. Suppose you know that a person is employed full-time. What is the probability that they are female? d. Are the two events ‘employed full-time’ and ‘female’ statistically independent? Explain. e. What is the probability that a randomly selected person is a male in full-time employment? f. The unemployment rate is defined as the percentage of the labour force that is unemployed and either looking for fulltime work or not looking for full-time work. What is the unemployment rate for males, females and overall? g. The participation rate is defined as the percentage of the civilian population in the labour force, either employed or unemployed. What is the participation rate for males, females and overall? 4.16 Households in a certain town were surveyed to determine whether they would subscribe to a new Pay TV channel. The households were classified according to ‘high’, ‘medium’ and ‘low’ income levels. The results of the survey are summarised in the table below. Income level High Medium Low

Will subscribe 3,200 1,920   480

Will not subscribe   800 7,080 2,520

a. What is the probability that: i. a household will subscribe? ii. a household is high income? iii. a household will subscribe and is high income? iv. a high-income household will subscribe? v. a household that subscribes is high income? b. Is income level statistically independent of whether a household subscribes or not? Explain. 4.17 At a certain university, 25% of students are in the business faculty. Of the students in the business faculty, 66% are males. However, only 52% of all students at the university are male. a. What is the probability that a student selected at random in the university is a male in the business faculty? b. What is the probability that a student selected at random in the university is male or is in the business faculty? c. What percentage of males are in the business faculty? 4.18 A sample of 500 consumers was selected in a large metropolitan area to study consumer behaviour with the following results:

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



4.3 Bayes’ Theorem 163

Exchange that is widely used as a benchmark for the performance of US equity mutual funds) finished higher after the first five days of trading. In 41 of those 59 years the S&P 500 finished higher for the year. Is a good first week a good omen for the upcoming year? The following table gives the first-week and annual performance over this 88-year period:

Gender Enjoys shopping for clothing Yes No Total

Male 136 104 240

Female 224  36 260

Total 360 140 500

a. What is the probability that a randomly chosen female consumer does not enjoy shopping for clothing? b. Suppose the chosen consumer enjoys shopping for clothing. What is the probability that the individual is male? c. Are enjoying shopping for clothing and the gender of the individual statistically independent? Explain. 4.19 A study was done to determine the efficacy of three different headache tablets – A, B and C. One thousand study participants used all three tablets (at different times) over the period of the study with the following results:

750 675 631 504 453 350 236

reported relief from tablet A reported relief from tablet B reported relief from tablet C reported relief from both tablets A and B reported relief from both tablets A and C reported relief from both tablets B and C reported relief from all three tablets

a. If a study participant is selected at random, what is the probability that they i. reported relief from tablet A? ii. reported relief from tablet B? iii. reported relief from tablet A and tablet B? iv. reported relief from tablet A or tablet B? v. did not report relief from tablet C? b. What is the probability that, if a participant reported relief from tablet A, they also reported relief from tablet B? c. What is the probability that, if a participant reported relief from tablet B, they also reported relief from tablet A? d. Are the events ‘report relief from tablet A’ and ‘report relief from tablet B’ statistically independent? Explain. 4.20 In 59 of the 88 years from 1929 to 2016, the S&P 500 (Standard and Poor’s 500 Index, one of the indices of the New York Stock



First week Higher Not higher

S&P 500’s annual performance Higher Not higher      41       18      14       15

a. If a year is selected at random, what is the probability that the S&P finished higher for the year? b. Given that the S&P 500 finished higher after the first five days of trading, what is the probability that it finished higher for the year? c. Are the two events, first-week performance and annual performance, statistically independent? Explain. d. In 2017 the S&P 500 was up 0.8% after the first five days. Look up the 2017 annual performance of the S&P 500 at or elsewhere. Comment on the results. e. Repeat part (d) for last year. 4.21 A standard deck of cards is being used to play a game. There are four suits (hearts, diamonds, clubs and spades), each having 13 faces (ace, 2 to 10, jack, queen and king), making a total of 52 cards. This complete deck is thoroughly shuffled, and you will receive the first two cards from the deck without replacement. a. What is the probability that both cards are queens? b. What is the probability that the first card is a 10 and the second card is a 5 or 6? c. If you were sampling with replacement, what would be the answer in (a)? d. In the game of blackjack, the picture cards (jack, queen, king) count as 10 points and the ace counts as either 1 or 11 points. All other cards are counted at their face value. Blackjack is achieved if your two cards total 21 points. What is the probability of getting blackjack in this problem?

4

4.3  BAYES’ THEOREM

LEARNING OBJECTIVE

Bayes’ theorem is used to revise previously calculated probabilities (called prior probabilities)

Revise probabilities using Bayes’ theorem

when there is new information. Developed by the Rev. Thomas Bayes in the eighteenth century, Bayes’ theorem is an extension of conditional probability. The conditional probability of B given A is given by Equation 4.4b combined with Equation 4.6: P(B | A) =

P(A | B)P(B) P(A and B) = P( A) P( A)

Bayes’ theorem is derived from this by substituting Equation 4.8 for P(A) in the above equation.

Bayes’ theorem Revises previously calculated probabilities when new information becomes available.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

164 CHAPTER 4 BASIC PROBABILITY

B AYE S ’ T H E OR E M



P(Bi | A) =

P(A | Bi)P(Bi) P(A | B1)P(B1) + P(A | B2)P(B2) + … + P(A | Bk)P(Bk)

(4.9)

where Bi is the ith event out of k mutually exclusive and collectively exhaustive events. The following situation illustrates when Bayes’ theorem can be used. Suppose the Consumer Electronics Company is considering marketing a new model of television. In the past, 40% of the televisions introduced by the company have been successful and 60% have been unsuccessful. Before introducing a new model of television to the marketplace, the marketing research department always conducts an extensive study and releases a report, either favourable or unfavourable. In the past, 80% of the successful televisions had received a favourable market research report and 30% of the unsuccessful televisions had received a favourable report. For the new model of television under consideration, the marketing research department has issued a favourable report. What is the probability that the television will be successful, given this favourable report? To use Equation 4.9 to calculate the required probability P(S | F), first define events:

S 5 successful television      F 5 favourable report S′ 5 unsuccessful television    F′ 5 unfavourable report

then: P(S) = 0.40

P(F | S) = 0.80

P(S') = 0.60

P(F | S') = 0.30

Therefore, using Equation 4.9: P(S | F) =

P(F | S )P(S ) P(F | S)P(S ) + P(F | S')P(S')

=

(0.80)(0.40) (0.80)(0.40) + (0.30)(0.60)

=

0.32 0.32 = 0.32 + 0.18 0.50

= 0.64 The probability of a successful television, given that a favourable report was received, is 0.64. Thus, the probability of an unsuccessful television, given that a favourable report was received, is 1 − 0.64 5 0.36. Table 4.3 summarises the calculation of the probabilities and Figure 4.5 presents the decision tree. The denominator in Bayes’ theorem represents P(F), the probability of a favourable report. This shows the connection between Equations 4.4a and 4.4b with Equation 4.9, reflecting that Bayes’ theorem is a special case of conditional probability.

Event Si S ∙ successful television set S∙ ∙ unsuccessful television set

Prior probability P(Si) 0.40 0.60

Conditional probability P(F | Si) 0.80 0.30

Joint probability P(F and S i ) ∙ P(F | Si )P(Si) 0.32 0.18 0.50 = P(F )

Revised probability P(Si | F ) 0.32/0.50 = 0.64 = P(S | F ) 0.18/0.50 = 0.36 = P(S′ | F )

Table 4.3  Bayes’ theorem calculations for the television-marketing example

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



4.3 Bayes’ Theorem 165

P(S and F )

= P(F |S) P(S) = (0.80) (0.40) = 0.32

Figure 4.5 Decision tree for marketing a new television set

P(S ) = 0.40 P(S and F 9) = P(F 9|S) P(S ) = (0.20) (0.40) = 0.08

P(S 9 and F ) = P(F |S 9) P(S 9) = (0.30) (0.60) = 0.18 P(S 9) = 0.60 P(S 9 and F 9) = P(F 9|S 9) P(S 9) = (0.70) (0.60) = 0.42

Example 4.10 applies Bayes’ theorem to a medical diagnosis problem. USING B AY E S ’ T H E O R E M IN A ME DIC A L D I AGN OS I S P ROBL E M The probability that a person has a certain disease is 0.03. Medical diagnostic tests are available to determine whether a person has the disease. If the disease is present, the probability that the medical diagnostic test will give a positive result (indicating that the disease is present) is 0.90. If the disease is not present, the probability of a positive test result (indicating that the disease is present when it is not, called a false positive) is 0.02. Suppose that the medical diagnostic test has given a positive result. What is the probability that the disease is present, given the positive test result? What is the probability of a positive test result?

EXAMPLE 4.10

SOLUTION

Define events:

D 5 has disease D′ 5 does not have disease

T 5 test is positive T′ 5 test is negative

We are given: P(D) 5 0.03 P(D′) 5 0.97

P(T | D) 5 0.90 P(T | D′) 5 0.02

Using Equation 4.9 to calculate P(D | T) – that is, the probability that the disease is present, given the positive test result – we obtain: P(D | T) =

P(T | D)P(D) P(T | D)P(D) + P(T | D' )P(D' )

(0.90)(0.03) (0.90)(0.03) + (0.02)(0.97) 0.0270 = 0.0270 + 0.0194 0.0270 = 0.0464 = 0.5818…

=

The probability that the disease is present, given a positive result has occurred (indicating that the disease is present), is 0.582. This means that if a person returns a positive test result, there is only a 58% chance they have the disease. Table 4.4 summarises the calculation of the probabilities and Figure 4.6 presents the decision tree.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

166 CHAPTER 4 BASIC PROBABILITY

Event Di D ∙ has disease D ∙ ∙ does not have disease

Prior probability P (Di)

Conditional probability P (T | Di)

Joint probability P (T | Di)P (Di)

0.03 0.97

0.90 0.02

0.0270 0.0194

Revised probability P (Di | T )

0.0270/0.0464 = 0.582 = P(D | T )

0.0194/0.0464 = 0.418 = P(D′ | T )

0.0464

Table 4.4  Bayes’ theorem calculations for the medical diagnosis problem

The denominator in Bayes’ theorem represents P(T), the probability of a positive test result, which in this case is 0.0464, or 4.64%. Figure 4.6 Decision tree for the medical diagnosis problem

P(D and T ) = P(T |D) P(D) = (0.90) (0.03) = 0.0270 P(D) = 0.03 P(D and T 9) = P(T 9|D ) P(D) (0.10) (0.03) = 0.0030

P(D 9) = 0.97

P(D 9 and T ) = P(T |D 9) P(D 9) (0.02) (0.97) = 0.0194 P(D 9 and T 9) = P(T 9|D 9) P(D 9) (0.98) (0.97) = 0.9506

Divine providence and spam think about this

Would you ever guess that the essays Divine Benevolence: Or, An Attempt to Prove that the Principal End of the Divine Providence and Government is the Happiness of His Creatures and An Essay Towards Solving a Problem in the Doctrine of Chances were written by the same person? Probably not, and in doing so you illustrate a modern-day application of Bayesian statistics: spam, or junk mail, filters. In not guessing correctly, you probably looked at the words in the titles of the essays and concluded that they were talking about two different things. An implicit rule you used was that word frequencies vary by subject matter. A statistics essay would very likely contain the word statistics as well as words such as chance, problem and solving. An eighteenth-century essay about theology and religion would be more likely to contain the uppercase forms of Divine and Providence. Likewise, there are words that you would guess to be very unlikely to appear in either book, such as technical terms from finance, and words that are most likely to appear in both – common words such as a, and and the. That words would either be likely or unlikely suggests an application of probability theory. Of course, likely and unlikely are fuzzy concepts, and we might occasionally misclassify an essay if we kept things too simple, such as relying solely on the occurrence of the words Divine and Providence.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



4.3 Bayes’ Theorem 167

For example, a profile of the late Harris Milstead, better known as Divine, the star of Hairspray and other films, visiting Providence (Rhode Island), would most certainly not be an essay about theology. But if we widened the number of words we examined and found such words as movie or the name John Waters (Divine’s director in many films), we probably would quickly realise the essay had something to do with twentieth-century cinema and little to do with theology and religion. We can use a similar process to try to classify a new email message in your inbox as either spam or a legitimate message (called ‘ham’ in this context). We would first need to add to your email program a ‘spam filter’ that has the ability to track word frequencies associated with spam and ham messages as you identify them on a day-to-day basis. This would allow the filter constantly to update the prior probabilities necessary to use Bayes’ theorem. With these probabilities, the filter can ask, ‘What is the probability that an email is spam, given the presence of a certain word?’ Applying the terms of Equation 4.9, such a Bayesian spam filter would multiply the probability of finding the word in a spam email, P (A | B ), by the probability that the email is spam, P (B ), and then divide by the probability of finding the word in an email, the denominator in Equation 4.9. Bayesian spam filters also use shortcuts by focusing on a small set of words that have a high probability of being found in a spam message and on a small set of other words that have a low probability of being found in a spam message. As spammers (people who send junk email) learned of such new filters, they tried to outfox them. Having learned that Bayesian filters might be assigning a high P (A | B ) value to words commonly found in spam, such as Viagra, spammers thought they could fool the filter by misspelling the word as Vi@gr@ or V1agra. What they overlooked was that the misspelled variants were even more likely to be found in a spam message than the original word. Thus, the misspelled variants made the job of spotting spam easier for the Bayesian filters. Other spammers tried to fool the filters by adding ‘good’ words, words that would have a low probability of being found in a spam message, or ‘rare’ words, words not frequently encountered in any message. But these spammers overlooked the fact that the conditional probabilities are constantly updated and that words once considered ‘good’ would soon be discarded from the good list by the filter as their P (A | B  ) value increased. Likewise, as ‘rare’ words grew more common in spam and yet stayed rare in ham, such words acted like the misspelled variants that others had tried earlier. Even then, and perhaps after reading about Bayesian statistics, spammers thought that they could ‘break’ Bayesian filters by inserting random words in their messages. Those random words would affect the filter by causing it to see many words whose P (A | B  ) value would be low. The Bayesian filter would begin to label many spam messages as ham and end up being of no practical use. Spammers again overlooked that conditional probabilities are constantly updated. Other spammers decided to eliminate all or most of the words in their messages and replace them with graphics so that Bayesian filters would have very few words with which to form conditional probabilities. However, this approach failed too, as Bayesian filters were rewritten to consider things other than words in a message. After all, Bayes’ theorem concerns events, and ‘graphics present with no text’ is as valid an event as ‘some word, X, present in a message’. Other future tricks will ultimately fail for the same reason. (By the way, spam filters use non-Bayesian techniques as well, which make spammers’ lives even more difficult.) Bayesian spam filters are an example of the unexpected way that applications of statistics can show up in your daily life. You will discover more examples as you read the rest of this book. Incidentally, the author of the two essays mentioned earlier was Thomas Bayes, who is a lot more famous for the second essay than the first, a failed attempt to use mathematics and logic to prove the existence of God.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

168 CHAPTER 4 BASIC PROBABILITY

Problems for Section 4.3 LEARNING THE BASICS 4.22 4.23

If P(B) 5 0.05, P(A | B) 5 0.80 and P(A | B′) 5 0.40, find P(B | A). If P(B) 5 0.30, P(A | B) 5 0.60 and P(A | B′) 5 0.50, find P(B | A).

APPLYING THE CONCEPTS 4.24 In Example 4.10 on page 165, suppose that the probability that the test will return a false positive (that is, the medical diagnostic test gives a positive result when the disease is not present) is reduced from 0.02 to 0.01. Given this information: a. If the medical diagnostic test has given a positive result (indicating the disease is present), what is the probability that the disease is present? b. If the medical diagnostic test has given a positive result, what is the probability that the disease is not present? c. If the medical diagnostic test has given a negative result (indicating that the disease is not present), what is the probability that the disease is not present? d. If the medical diagnostic test has given a negative result, what is the probability that the disease is present? 4.25 An advertising executive is studying the television viewing habits of married men and women during prime-time hours. On the basis of past viewing records, the executive has determined that, during prime time, husbands are watching television 60% of the time. When the husband is watching television, 40% of the time the wife is also watching. When the husband is not

LEARNING OBJECTIVE

5

Use counting rules to calculate the number of possible outcomes

watching television, 30% of the time the wife is watching television. Find the probability that a. if the wife is watching television, the husband is also watching television. b. the wife is watching television in prime time. 4.26 The editor of a textbook-publishing company is trying to decide whether to publish a proposed business statistics textbook. Information on previous textbooks published indicate that 10% are huge successes, 20% are modest successes, 40% break even and 30% are failures. However, before a publishing decision is made, the book will be reviewed. In the past, 99% of the huge successes received favourable reviews, 70% of the moderate successes received favourable reviews, 40% of the break-even books received favourable reviews and 20% of the failures received favourable reviews. a. If the proposed text receives a favourable review, how should the editor revise the probabilities of the various outcomes to take this information into account? (Hint: Derive the conditional probabilities for each outcome given a favourable review has been received.) b. What proportion of textbooks receive favourable reviews? 4.27 From past records of personal loans the Check$mart Bank found that 10% of borrowers default on their loan – that is, they fail to pay. It also found that, of those who default, 32% are unemployed while, of those who do not default, only 2% are unemployed. a. What percentage of unemployed borrowers default? b. What proportion of borrowers are unemployed? c. What proportion of borrowers who are not unemployed do not default?

4.4  COUNTING RULES In Equation 4.1 the probability of occurrence of an outcome was defined as the number of ways the outcome occurs divided by the total number of possible outcomes. In many instances, there is a large number of possible outcomes and it is difficult to determine the exact number. In these circumstances, rules for counting the number of possible outcomes have been developed. Five different counting rules are introduced in this section. COUN T IN G R UL E 1 If any one of k different mutually exclusive and collectively exhaustive events can occur on each of n trials, the number of possible outcomes is equal to

EXAMPLE 4.11

k n

(4.10)

C O U N T IN G R U LE 1 Suppose you toss a coin five times. What is the number of different possible outcomes (the sequences of heads and tails)?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



4.4 Counting Rules 169

SOLUTION

If you toss a coin (with two sides) five times, using Equation 4.10 the number of possible outcomes is 25 5 2 × 2 × 2 × 2 × 2 5 32.

RO LLING A D IE T W IC E Suppose you roll a die twice. How many different possible outcomes can occur?

EXAMPLE 4.12

SOLUTION

If a die (having six sides) is rolled twice, using Equation 4.10 the number of different ­possible outcomes is 62 5 36. The second counting rule is a more general version of the first, and allows for the number of possible events to differ from trial to trial. C O UN TIN G R UL E 2 If there are k1 events on the first trial, k2 events on the second trial, … , and kn events on the nth trial, then the number of possible outcomes is k1 × k2 × … × kn (4.11)



CO UNTING R U LE 2 At one stage, standard New South Wales vehicle number plates consisted of three letters ­followed by three digits. How many possible number plates are there of this form?

EXAMPLE 4.13

SOLUTION

Using Equation 4.11, if a number plate consists of three letters (A to Z) followed by three numbers (0 to 9), the total number of number plates of this form is: 26 × 26 × 26 × 10 × 10 × 10 5 263 × 103 5 17,576,000.

DETER M INING T H E NU MB E R O F D IFFE R E N T D I N N E RS A restaurant menu has a fixed-price dinner consisting of an entrée, a main, a beverage and a dessert. There is a choice of ten entrées, five mains, three beverages and six desserts. Determine the total number of possible dinners.

EXAMPLE 4.14

SOLUTION

Using Equation 4.11, the total number of possible dinners is 10 × 5 × 3 × 6 5 900. The third counting rule involves the calculation of the number of ways that a set of items can be arranged in order. C O UN TIN G R UL E 3 The number of ways that n items can be arranged in order is n! = n × (n− 1) × … × 2 × 1 (4.12) where n! is called n factorial and 0! is defined as 1.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

170 CHAPTER 4 BASIC PROBABILITY

EXAMPLE 4.15

C O U NT ING R U LE 3 If a set of six textbooks is to be placed on a shelf, in how many ways can the six books be arranged? SOLUTION

Any of the six books could occupy the first position on the shelf. Once the first position is filled, there are five books to choose from in filling the second. Continue this assignment procedure until all the positions are occupied. The number of ways that the six books can arranged is 6! = 6 × 5 × 4 × 3 × 2 × 1 5 720.

permutation Ordered selection of items.

In many instances we need to know the number of ways in which a subset of the entire group of items can be arranged in order. Each possible ordered arrangement is called a permutation.

COUN T IN G R UL E 4 – P E R M U TAT I O NS The number of ways of arranging X objects selected from n objects in order is

EXAMPLE 4.16

n PX

=

n! (n − X )!



(4.13)

C O U NT ING R U LE 4 Modifying Example 4.15, if there are six textbooks but room for only four books on the shelf, in how many ways can these books be arranged on the shelf? SOLUTION

Using Equation 4.13, the number of ordered arrangements of four books selected from six books is equal to: 6 P4 =

6! 6! = = 360 (6 − 4)! 2!

Alternatively, any of the six books could occupy the first position. Once the first position is filled, there are five books to choose from in filling the second. Continue this assignment procedure until four books are placed on the shelf. Therefore, the number of ordered arrangements of four books selected from six is: 6 × 5 × 4 × 3 5 360

combination Unordered selection of items.

In other situations we are not interested in the order of the outcomes, but only in the number of ways that X items can be selected from n items, irrespective of order. Each unordered selection is called a combination.

COUN T IN G R UL E 5 – C O M BI NAT I O NS The number of ways of selecting X objects from n objects, irrespective of order, is equal to:

nC X

=

n! X !(n − X )!

(4.14)

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



4.4 Counting Rules 171

Comparing equations 4.13 and 4.14, it can be seen that they differ only in the inclusion of a term X! in the denominator of equation 4.14. When permutations are used, all the arrangements of the chosen X objects are distinguishable. With combinations, the X! possible arrangements of the chosen X objects are irrelevant.

CO UNTING R U LE 5 Modifying Example 4.16, in how many ways can you choose four books to place on the shelf?

EXAMPLE 4.17

SOLUTION

Using Equation 4.14, the number of combinations of four books selected from six books is equal to: 6C4 =

6! 6! = 15 = 4!(6 − 4)! 4!2!

Problems for Section 4.4 APPLYING THE CONCEPTS 4.28 If there are 10 multiple-choice questions in an exam, each with three possible answers: a. How many different answer sequences are there? b. If you answer the questions randomly, what is the probability that you get all 10 correct? 4.29 A lock on a bank vault consists of three dials, each with 30 positions. To open the vault, each of the three dials must be in the correct position. a. How many different possible dial combinations are there for this lock? b. What is the probability that, if you randomly select a position on each dial, you will be able to open the bank vault? c. Explain why ‘dial combinations’ are not mathematical combinations expressed by Equation 4.14. 4.30 A particular brand of women’s jeans is available in seven different sizes, three different colours and three different styles. How many different jeans does the store manager need to order to have one pair of each type? 4.31 Greenway Gardens has a $10 salad box consisting of lettuce, tomatoes, cucumber, sprouts, capsicum, avocado and a bottle of Greenway’s special salad dressing. Suppose that at present there is a choice of eight types of lettuce, four types of tomatoes, three types of cucumbers, three types of sprouts and no choice for capsicum, avocado and dressing. How many different salad boxes are there? 4.32 If each letter is used once, how many different arrangements are there of: a. Grafton? b. Otaki? c. Darwin? d. Gore?

4.33 Currently, new standard New South Wales vehicle number plates consist of two letters followed by two digits followed by two letters. How many possible number plates are there of this form? 4.34 Each employee of a large firm has an ID number consisting of their initials (either two or three) followed by two digits. What is the maximum number of unique ID numbers generated by this system? 4.35 A trifecta consists of picking the correct finishing order of the first three horses in a race. Suppose 12 horses are entered in a race. a. How many trifecta outcomes are there for this race? b. If you choose three horses randomly, what is the probability that you win the trifecta? 4.36 Nine passengers are on a waiting list for an overbooked flight. Due to cancellations, four seats are available. How many ways are there, regardless of order, to allocate the four seats? 4.37 A daily lottery is conducted in which two winning numbers are selected out of 100 numbers. a. How many different combinations of winning numbers are possible? b. Suppose that you have an entry in this lottery – what is your probability of winning? 4.38 A reading list for a unit contains 20 articles. How many ways are there to choose three articles from this list?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

172 CHAPTER 4 BASIC PROBABILITY

4.5  ETHICAL ISSUES AND PROBABILITY Ethical issues can arise when any statements relating to probability are presented to the public, particularly when these statements are part of an advertising campaign for a product or service. Unfortunately, many people are not comfortable with numerical concepts and tend to misinterpret the meaning of the probability. In some instances, the misinterpretation is not intentional but, in other cases, advertisements may unethically try to mislead potential customers. A commercial for a Lotto game that said ‘We won’t stop until we have made everyone a millionaire’ would be a deceptive and possibly unethical application of probability. When purchasing a Lotto ticket, the customer selects a set of numbers (such as 6) from a larger list of numbers (such as 45). Although virtually all participants know that they are unlikely to win a first-division prize (select all six of the winning numbers drawn), they also have very little idea of how small the probability is (1 in 8,145,060 if selecting 6 from 45). Given the fact that Lotto makes millions of dollars, it is unlikely to stop running, so the statement made is true. However, it may also be misleading as, in a lifetime, no one can be certain of becoming a millionaire by winning Lotto. A statement in an investment newsletter promising a 90% probability of a 20% annual return on an investment is another example of a potentially unethical application of probability. To make the claim in the newsletter an ethical one, the author needs to (a) explain the basis on which this probability estimate rests, (b) provide the probability statement in another format, such as 9 chances in 10, and (c) explain what happens to the investment in the 10% of cases in which a 20% return is not achieved (e.g. Is the entire investment lost?). Other ethical issues arise when probabilities are calculated from non-representative samples. An example of this was during the Australian 2007 federal election campaign where a leaflet from the Christian Democratic Party included the following: Daily Telegraph Tele’s Voteline published on 31 March 2007 Fred Nile’s Christian Democrats are calling for an immediate moratorium on Islamic immigration. Do you agree? YES 99% As well as being overtly discriminatory, there are several problems with this probability. • • • •

The population sampled from are readers of the Daily Telegraph, which may not be representative of the Australian electorate. The sample is self-selected; readers have to ring the voteline at a cost of 55 cents a call. Therefore, only those who feel strongly about an issue, for or against, are likely to vote. Sample size is not given. Therefore, we do not know if probability is based on only a few votes or a large number of votes. From the Daily Telegraph the sample size was 972, Yes 960 and No 12. There is no mechanism to stop an individual voting more than once. The worst-case scenario is that this probability is based on the votes of two individuals, one voting Yes 960 times, and the other No 12 times.

Problems for Section 4.5 APPLYING THE CONCEPTS 4.39 Write an advertisement for: a. Lotto that ethically describes the probability of winning b. the investment newsletter that ethically states the probability of a 20% return

4.40 Find an example online or in print of an unethical or misleading use of probability.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Key terms 173

4

Assess your progress Summary This chapter developed concepts concerning basic probability, conditional probability, Bayes’ theorem and counting rules. In the next

chapter, important discrete probability distributions such as the binomial, hypergeometric and Poisson distributions will be considered.

Key formulas Marginal probability using the general multiplication rule

Probability of occurrence

Probability of occurrence =

P(A) = P(A | B1)P(B1) + P(A | B2)P(B2) + …

X (4.1) T   

+ P(A | Bk)P(Bk)

Marginal probability

P(A) = P(A and B1) + P(A and B2) + … + P(A and Bk)

   (4.2)

General addition rule

  (4.8)

Bayes’ theorem P(Bi | A) =

P(A | Bi)P(Bi)

P(A | B1)P(B1) + P(A | B2)P(B2) + … + P(A | Bk)P(Bk)

P(A or B) = P(A) + P(B) − P(A and B)  (4.3)

(4.9)

Conditional probability

Counting rule 1

P(A | B) =

P ( A and B ) (4.4a) P ( B )   

kn  (4.10)

P(B | A) =

P( A and B)  (4.4b) P( A)  

k1 × k2 × … × kn  (4.11)

Counting rule 2 Factorials

Statistical independence

n! = n × (n − 1) × … × 2 × 1  (4.12)

P(A | B) = P(A) (and P(B | A) = P(B))  (4.5)

Permutations

General multiplication rule

n PX =

P(A and B) = P(A | B)P(B) = P(B | A)P(A)  (4.6) Multiplication rule for independent events P(A and B) = P(A)P(B)  (4.7)

n!   (4.13) ( n − X )!

Combinations nC X

=

n!   (4.14) X !(n − X )!

Key terms a priori classical probability 148 Bayes’ theorem 163 certain event 148 collectively exhaustive 153 combination170 complement150 conditional probability 156 contingency (cross-classification) table – probability 150 decision tree 158

empirical classical probability 149 event149 general addition rule 154 general multiplication rule 160 impossible event 148 joint event 150 joint probability 152 marginal probability 151 multiplication rule for independent events161

mutually exclusive 153 permutation170 probability148 random experiment 149 sample space 149 simple event 149 statistical independence 159 subjective probability 149 Venn diagram 150

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

174 CHAPTER 4 BASIC PROBABILITY

Chapter review problems CHECKING YOUR UNDERSTANDING

4.41 What are the differences between a priori classical probability, empirical classical probability and subjective probability? 4.42 What is the difference between a simple event and a joint event? 4.43 How can you use the addition rule to find the probability of occurrence of event A or B? 4.44 What is the difference between mutually exclusive events and collectively exhaustive events? 4.45 How does conditional probability relate to the concept of statistical independence? 4.46 How does the multiplication rule differ for events that are and are not independent? 4.47 How can you use Bayes’ theorem to revise probabilities in light of new information? 4.48 What is the difference between a permutation and a combination?

APPLYING THE CONCEPTS 4.49 The breakdown by home address of the previous year’s 993 drink-driving offences in Problem 2.67 is: Number of drink-driving offences Local – in council area Seaside town 151 Not seaside town 462 Not local – not in council area Intrastate (within state) 130 Interstate (another state) 228 International (outside Australia)  22

a. What is the probability of winning a prize in an office sweep (where horses are randomly allocated), if prizes are given for first, second and third places? b. In a trifecta three horses are selected to finish first, second and third in the correct order. How many possible trifectas are there in the Melbourne Cup? c. How many combinations of the winning three horses are not trifectas – that is, the selected horses finish first, second and third but not in the correct order? d. Suppose that you have a sweep ticket (where horses are randomly allocated) for the trifecta. What is your probability of winning the major prize (the trifecta) or a consolation prize (you have the three winning horses but in the wrong order)? 4.52 In March 2013, 26.8% of New South Wales dwellings suitable for a rainwater tank had one installed. Of the dwellings with a rainwater tank, 53.1% had the rainwater tank plumbed into the dwelling (Australian Bureau of Statistics, Environmental Issues: Water Use and Conservation, Mar 2013, Cat. No. 4602.0.55.003 accessed 4 November 2013). a. Complete the following contingency table for this problem:

Home address



If a drink-driver offender is selected at random, what is the probability that: a. the offender is local? b. the offender is from another state? c. a non-local offender is from another state? d. a local offender is from outside the seaside town? e. the offender is from outside the state? 4.50 In a school of 200 students 95% are vaccinated against a certain disease. During a recent outbreak of this disease 20 students, including 11 vaccinated students, developed the disease. a. Find the probability that a student i. who has the disease has been vaccinated ii. who has been vaccinated catches the disease iii. who is unvaccinated catches the disease b. A parent states that vaccination is ineffective as more than 50% of those who developed the disease had been vaccinated. Comment on this. 4.51 The Melbourne Cup, held on the first Tuesday in November, has 24 horses entered in it.

Plumbed into Not plumbed dwelling into dwelling Rainwater tank No rainwater tank Total

Total

0.0000

b. From part (a) or otherwise, answer the following, to four decimal places: i. What proportion of suitable New South Wales dwellings have a rainwater tank that is not plumbed into the dwelling? ii. What percentage of New South Wales dwellings that have a rainwater tank do not have the tank plumbed into the dwelling? iii. What proportion of New South Wales dwellings that are suitable for a rainwater tank do not have one? c. There are an estimated 2,268,800 dwellings in New South Wales that are suitable for a rainwater tank. Estimate the number of dwellings with a rainwater tank plumbed into the dwelling. 4.53 When calculating premiums on life insurance products insurance companies often use life tables that enable the probability of a person dying in any age interval to be calculated. The following data obtained from New Zealand Abridged Period Life Table: 2014–2016 gives the number out of 100,000 New Zealand-born females and males who are still alive during each five-year period of life between age 20 and 60 (inclusive).

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter review problems 175

Exact age (years) 20 25 30 35 40 45 50 55 60

Number alive at exact age Out of 100,000 Out of 100,000 females born males born 99,288 99,031 99,128 98,685 98,949 98,312 98,726 97,899 98,427 97,381 97,934 96,649 97,157 95,548 95,933 93,853 94,162 91,352

Data obtained from accessed June 2017. © Statistics New Zealand, and licensed by Statistics New Zealand for re-use under the Creative ­Commons Attribution 3.0 New Zealand licence

a. What is the probability that a New Zealand-born female will reach the age of 30? b. What is the probability that a New Zealand-born female will reach the age of 45? c. What is the probability that a 20-year-old New Zealand-born female will reach the age of 30? d. What is the probability that a 20-year-old New Zealand-born female will reach the age of 40? e. A 30-year-old New Zealand-born female has purchased a term life policy that will pay her estate a million dollars if she dies within five years. What is the probability that the insurance company will pay her estate this amount? f. Repeat (a) to (e) for New Zealand-born males. 4.54 In a certain region, during a recent outbreak of a preventable disease 0.1% of primary school children caught the disease; of these 30% were vaccinated against it. Furthermore, of those who did not catch the disease 80% were vaccinated. a. What percentage of vaccinated children caught the disease? b. What percentage of unvaccinated children caught the disease? c. What percentage of primary school children in the region are vaccinated against this disease? 4.55 In an online test, 10 multiple-choice questions are randomly selected from a test bank of 100 questions. a. If the order in which the questions appear is immaterial, how many different tests can be generated? b. If the order in which the questions appear is important, how many different tests can be generated? 4.56 The employees of a company were surveyed and asked their educational background and marital status. Of the 600 employees, 400 had university degrees, 100 were single and 60 were single university graduates. a. Construct a contingency table for this problem. b. Find the probability that a randomly selected employee of the company is single or has a university degree. c. What percentage of single employees have university degrees? d. Are gender and educational background statistically independent? Explain.

4.57 A researcher has completed a survey of 10,000 New Zealand viewers to determine which channel they watch on a weekday during the 6.30 pm to 7.30 pm time-slot, with the following results: Channel TV One TV2 TV3 Prime Maori Television Other or none

Number 3,160 1,940 2,190   860   650 1,200



A surveyed viewer is chosen at random. Find the probability that during the 6.30 pm to 7.30 pm time-slot the viewer: a. watches TV One b. watches TV2 or TV3 c. watches Prime d. does not watch TV One, TV2 or TV3 4.58 The following table classifies residents of a regional area of New South Wales by gender and age. Age groups 0–4 years 5–14 years 15–19 years 20–24 years 25–34 years 35–44 years 45–54 years 55–64 years 65–74 years 75–84 years 85 years and over Total

Males 410 952 478 594 859 886 1,026 1,097 677 333 154 7,466

Females 369 861 501 559 885 974 1,105 1,033 703 492 327 7,809

Persons 779 1,813 979 1,153 1,744 1,860 2,131 2,130 1,380 825 481 15,275

Data obtained from Australian Bureau of Statistics, Census of Population and Housing: General Community Profile, Australia, 2016 accessed June 2017

a. If a resident is chosen at random, what is the probability that the resident: i. is male? ii. is a female aged at least 65 years? iii. is a child under 15 years? b. What proportion of children, defined as under 15 years, are male? c. Are the events ‘Child under 15’ and ‘Male’ statistically independent? Justify your answer. d. What is the probability that a female chosen at random is at least 65 years? e. Access the Community Profiles for the 2016 Census at for a selected location in Australia and repeat parts (a) to (d). 4.59 The following table classifies residents of a regional area of Queensland by gender, age and hours of unpaid domestic work in the week before the 2016 Census.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

176 CHAPTER 4 BASIC PROBABILITY

Less than 5 hours

Did unpaid domestic work 5-14 15-29 hours hours

Did no unpaid domestic work 30 hours or more

Total

Males  15–19 years 20–24 years 25–34 years 35–44 years 45–54 years 55–64 years 65–74 years 75–84 years 85 years and over Total males

681 562 1,176 1,119 1,045 878 488 190 67 6,206

97 191 768 1,101 1,068 1,055 738 301 77 5,396

7 32 134 285 264 259 317 163 48 1,509

0 17 54 95 106 121 146 116 39 694

602 524 712 537 663 802 780 496 273 5,389

1,387 1,326 2,844 3,137 3,146 3,115 2,469 1,266 504 19,194

Females 15–19 years 20–24 years 25–34 years 35–44 years 45–54 years 55–64 years 65–74 years 75–84 years 85 years and over Total females

688 539 845 520 620 529 238 160 107 4,246

123 365 1,070 1,127 1,369 1,339 669 279 118 6,459

13 70 405 753 722 655 595 240 71 3,524

6 49 450 725 453 416 461 234 46 2,840

480 290 342 275 356 527 639 537 483 3,929

1,310 1,313 3,112 3,400 3,520 3,466 2,602 1,450 825 20,998

Data obtained from Australian Bureau of Statistics, Census of Population and Housing: General Community Profile, Australia, 2016 accessed June 2017

a. If a resident is chosen at random, what is the probability that the resident: i. did unpaid domestic work? ii. did no unpaid domestic work and is female? iii. did unpaid domestic work and is male? iv. did at least 15 hours’ unpaid domestic work and is male? v. did no unpaid domestic work and is male? b. What proportion of male residents did unpaid domestic work? c. What percentage of female residents did unpaid domestic work? d. From parts (a) and (b), are the events ‘Male’ and ‘Did unpaid domestic work’ statistically independent? Justify your answer. e. What proportion of men do: i. at least 15 hours of unpaid domestic work? ii. less than five hours of unpaid domestic work (including no unpaid domestic work)? f. What proportion of women do: i. at least 15 hours of unpaid domestic work? ii. less than five hours of unpaid domestic work (including no unpaid domestic work)? g. From parts (e) and (f), can you conclude that men do less unpaid domestic work than women?

h. What proportion of male residents aged at least 65 did no unpaid domestic work? i. What percentage of female residents aged at least 65 did no unpaid domestic work? j. What proportion of male residents aged under 35 did unpaid domestic work? k. What percentage of female residents aged under 35 did unpaid domestic work? l. What conclusions can you draw from parts (h) to (k)? m. Access the Community Profiles for the 2016 Census at for a selected location in Australia and repeat parts (a) to (l). 4.60 In a town, 45% of all households have a pet, 35% have children, and 40% of all households with children have a pet. Using these definitions: P 5 event household has a pet C 5 event household has children a. Complete the following contingency table. P

P9

Total

C C9 Total

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Continuing cases 177

b. From part (a) or otherwise, answer the following: i. What is the probability that a randomly selected household has neither pets nor children?

ii. What proportion of households with children do not have a pet? iii. Find and interpret P(C | P ).

Continuing cases Tasman University Tasman University’s Tasman Business School (TBS) regularly surveys business students on a number of issues. In particular, students within the school are asked to complete a student survey when they receive their grades each semester. The results of Bachelor of Business (BBus) and Master of Business Administration (MBA) students who responded to the latest undergraduate (UG) and postgraduate (PG) student surveys are stored in < TASMAN_ UNIVERSITY_BBUS_STUDENT_SURVEY > and < TASMAN_UNIVERSITY_MBA_STUDENT_SURVEY >.

Copies of the survey questions are stored in Tasman University Undergraduate BBus Student Survey and Tasman University Postgraduate MBA Student Survey. a For pairs of variables in the BBus student survey, calculate contingency tables and then calculate conditional and marginal probabilities. b For pairs of variables in the MBA student survey, calculate contingency tables and then calculate conditional and marginal probabilities. c Write a report summarising your conclusions.

As Safe as Houses To analyse the real estate market in non-capital cities and towns in states A and B, Safe-As-Houses Real Estate, a large national real estate company, has collected samples of recent residential sales from a sample of non-capital cities and towns in these states. The data are stored in < REAL_ESTATE >. a From the contingency tables constructed for selected variables in Chapter 2 for regional city 1 state A, calculate selected conditional and marginal probabilities. b From the contingency tables constructed for selected variables in Chapter 2 for coastal city 1 state A, calculate selected conditional and marginal probabilities. c Write a report summarising your conclusions. d Repeat (a) to (c) for another pair of non-capital cities or towns in state A and/or state B.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

178 CHAPTER 4 BASIC PROBABILITY

Chapter 4 Excel Guide EG4.1  BASIC PROBABILITY CONCEPTS Simple and Joint Probability and the General Addition Rule Key technique Use Excel arithmetic formulas.

ure EG4.1) the conditional probabilities are calculated in rows 28 to 35. The worksheet in Figure EG4.1 already contains the Table 4.1 data. For other problems, change the sample space table entries in the cell ranges C3:D4 and A5:D6.

Example Calculate simple and joint probabilities for the Table 4.1 data on discount accommodation voucher use and repeat festival attendance.

EG4.3  BAYES’ THEOREM

PHStat Use Simple & Joint Probabilities. For the example, select PHStat ➔ Probability & Prob. Distributions ➔ Simple & Joint Probabilities. In the new template, similar to the worksheet shown below, fill in the Sample Space area with the data.

Example Apply Bayes’ theorem to the television marketing example in Section 4.3.

In-depth Excel Use the COMPUTE worksheet of the Probabilities workbook as a template. The worksheet (shown in Figure EG4.1) already contains the Table 4.1 discount accommodation voucher use and repeat festival attendance data. For other problems, change the sample space table entries in the cell ranges C3:D4 and A5:D6.

Key technique Use Excel arithmetic formulas.

In-depth Excel Use the COMPUTE worksheet of the Bayes workbook as a template. The worksheet (shown in Figure EG4.2) already contains the probabilities for the Section 4.3 example. For other problems, change those probabilities in the cell range B5:C6. Figure EG4.2 COMPUTE worksheet of the Bayes workbook

Figure EG4.1 COMPUTE worksheet of the Probabilities workbook

The COMPUTE_FORMULAS worksheet gives the formulas to calculate the probabilities, which are also shown as an inset to the worksheet in Figure EG4.2.

EG4.4  COUNTING RULES Counting Rule 1

The COMPUTE_FORMULAS worksheet gives the formulas to calculate the probabilities.

EG4.2  CONDITIONAL PROBABILITY There is no PhStat command for conditional probability.

In-depth Excel Use the COMPUTE worksheet of the Probabilities workbook as a template. In this worksheet (shown in Fig-

In-depth Excel Use the POWER(k, n) worksheet function in a cell formula to calculate the number of outcomes given k events and n trials. For example, the formula 5POWER(6, 2) calculates the answer for Example 4.12 on page 169. Counting Rule 2 In-depth Excel Use a formula that takes the product of successive POWER(k, n) functions to solve problems related to counting rule 2. For example, the formula 5POWER(26, 3) * POWER(10, 3) calculates the answer for Example 4.13 New South Wales vehicle number plates on page 169.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 4 Excel Guide 179

Counting Rule 3 In-depth Excel Use the FACT(n) worksheet function in a cell formula to calculate how many ways n items can be arranged. For example, the formula 5FACT(6) calculates 6!, the answer to Example 4.15 on page 170. Counting Rule 4 In-depth Excel Use the PERMUT(n, x) worksheet function in a cell formula to calculate the number of ways of arranging in order

x objects selected from n objects. For example, the ­formula 5PERMUT(6, 4) calculates the answer for Example 4.16 on page 170.

Counting Rule 5 In-depth Excel Use the COMBIN(n, x) worksheet function in a cell formula to calculate the number of ways of selecting x objects from n objects, irrespective of order. For example, the formula 5COMBIN(6, 4) calculates the answer for Example 4.17 on page 171.

Microsoft® product screen shots are reprinted with permission from Microsoft Corporation.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

CHA PTER

5

Some important discrete probability distributions GAIA ADVENTURE TOURS

T

ours and activities for Gaia Adventure Tours (see Chapter 4) are booked online. Potential customers can submit an online enquiry, which Gaia Adventure Tours advertises will be answered within 45 minutes between 7 am and 11 pm by a knowledgeable local adventure tour consultant. Yang, who is in charge of Gaia Adventure Tours’ online enquiry and booking procedures, is investigating several key performance indicators (KPIs); in particular: ■ the proportion of online enquiries converted to bookings ■ the number of online enquiries received in 1 hour ■ the proportion of online enquiries submitted between 7 am and 11 pm answered within 45 minutes. Recent data collected by Yang show that: ■ 10% of online enquires are converted to bookings ■ on average, Gaia Adventure Tours receives 30 online enquiries an hour between 7 am and 11 pm ■ with the current levels of staffing for enquiries: – when 24 or more online enquiries are received in 30 minutes, queries start to queue and may not be answered within the stated 45 minutes – when fewer than five enquiries are received in 20 minutes, enquiry staff have significant idle time. Yang would like to determine the probability of a given number of online enquiries being converted to confirmed bookings in a sample of a specific size. In addition, to help determine optimal enquiry staffing levels, Yang would like to calculate the probability of receiving 24 or more online enquiries in any 30 minutes or fewer than five online enquiries in any 20 minutes. Answers to these questions and others can help Gaia Adventure Tours to develop future sales, marketing and staffing strategies. © Georgejmclittle/Shutterstock/Pearson Education Ltd

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



5.1 Probability Distribution for a Discrete Random Variable 181

LEARNING OBJECTIVES

After studying this chapter you should be able to: 1 recognise and apply the properties of a probability distribution 2 calculate the expected value and variance of a probability distribution 3 calculate average return and measure risk associated with various investment proposals 4 identify situations that can be modelled by a binomial distribution and calculate binomial probabilities 5 identify situations that can be modelled by a Poisson distribution and calculate Poisson probabilities 6 identify situations that can be modelled by a hypergeometric distribution and calculate hypergeometric probabilities

To help answer the given probability questions Yang can use a model, or small-scale representation, that approximates the online enquiry process, allowing inferences to be made about the processes. Although model building is a difficult task for some endeavours, in this case Yang can use probability distributions, which are mathematical models suitable for solving these types of probability questions. This chapter introduces probability distributions and explains how to apply the binomial, Poisson and hypergeometric distributions to business and other problems.

5.1  PROBABILITY DISTRIBUTION FOR A DISCRETE RANDOM VARIABLE A numerical variable (see Chapter 1) is a variable that yields numerical responses such as the number of magazines you subscribe to or your height in centimetres. Numerical variables are classified as either continuous or discrete. Continuous numerical variables have outcomes that arise from a measuring process, for example your height or weight. Discrete numerical variables have outcomes that arise from a counting process, such as the number of magazines you subscribe to or the number of phone calls received in an hour. This chapter introduces probability distributions that represent discrete numerical variables; continuous probability distributions are discussed in Chapter 6.

A probability distribution for a discrete random variable is a mutually exclusive list of all possible numerical outcomes of the random variable with the probability of occurrence associated with each outcome.

LEARNING OBJECTIVE

1

Recognise and apply the properties of a probability distribution

probability distribution for a discrete random variable Values of a discrete random variable with the corresponding probability of occurrence.

For a probability distribution for a discrete random variable: 1. all probabilities must be between 0 and 1 inclusive; that is, 0 # P(X) # 1 2. the sum of the probabilities must equal 1; that is, ∑ P(X) 5 1.

As an example, Table 5.1 gives the distribution of the number of home mortgages approved per week by the loans manager at a local branch of Check$mart Bank. From this we can see

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

182 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS

that the loans manager approves no more than six home mortgages per week as the list in Table 5.1 is collectively exhaustive. Furthermore, as one of the outcomes must happen – that is, between none and six mortgages approved – the probabilities must sum to 1. Figure 5.1 is a graphical representation of Table 5.1.

Table 5.1 Probability distribution of the number of home mortgages approved per week

Home mortgages approved per week 0 1 2 3 4 5 6

Figure 5.1 Probability distribution of the number of home mortgages approved per week

Probability 0.10 0.10 0.20 0.30 0.15 0.10 0.05

P (X ) 0.3 0.2 0.1

0

LEARNING OBJECTIVE

2

Calculate the expected value and variance of a probability distribution

expected value of a discrete random variable Measure of central tendency; the mean of a discrete random variable.

1

2 3 4 5 Home mortgages approved per week

6

X

Expected Value of a Discrete Random Variable In Chapter 3 we used the sample mean and variance to describe the centre and variation of a sample. In the same way, we can use the mean and variance of a random variable to describe the centre and variation of a probability distribution. The mean μ of a probability distribution is the expected value of its random variable. To calculate the expected value of a discrete random variable multiply each outcome X by its corres­ ponding probability P(X) and then sum these products.

E XPE CT E D VA LUE O F A D I SC R E T E R A ND O M VA R I A BLE N

μ = E(X ) =

∑ XiP(Xi)

i=1

(5.1)

where Xi = the ith outcome of the discrete random variable X P(Xi) = probability of occurrence of the ith outcome of X

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



5.1 Probability Distribution for a Discrete Random Variable 183

Using Equation 5.1 the mean, or expected value, for the probability distribution of the number of home mortgages approved per week is: μ = E(X) N

= ∑ XiP(Xi) i=1

= (0 × 0.1) + (1 × 0.1) + (2 × 0.2) + (3 × 0.3) + (4 × 0.15) + (5 × 0.1) + (6 × 0.05) = 0 + 0.1 + 0.4 + 0.9 + 0.6 + 0.5 + 0.3 = 2.8

The actual number of mortgages approved in a given week must be an integer value, so 2.8 mortgages are never approved in one week. However, on average, or in the long run, 2.8 are approved per week.

Variance and Standard Deviation of a Discrete Random Variable The variance of a discrete probability distribution is calculated by multiplying each squared deviation from the mean [Xi – E(X)]2 by its corresponding probability P(Xi) and then summing the resulting products. Equations 5.2a and 5.3 define, respectively, the variance of a discrete random variable and the standard deviation of a discrete random variable.

VARIANC E OF A DIS CR E T E R A N DOM VA R I A BLE – D E F I NI T I O N F O R M U LA N

∑ [Xi −

σ2 =

E(X)]2 P(Xi)

i=1



(5.2a)



where Xi = the ith outcome of the discrete random variable X P(Xi) = probability of occurrence of the ith outcome of X

variance of a discrete random variable Measure of variation, based on squared deviations from the mean; directly related to the standard deviation. standard deviation of a discrete random variable Measure of variation, based on squared deviations from the mean; directly related to the variance.

As for the sample variance, we can use algebra to obtain an alternative calculation formula.

VARIANCE OF A DISCRETE RANDOM VARIABLE – CALCULATION FORMULA N

∑ Xi2P(Xi) − E(X )2

σ2 =

i=1



(5.2b)

N

where

∑ Xi2P(Xi) = X12P(X1) + X22P(X2) + … + XN2P(XN)

i =1

STAN DARD DE VIAT ION OF A DIS CR E T E R A ND O M VA R I A BLE The standard deviation of a discrete random variable is the square root of the variance

σ = σ2

(5.3)

Using Equations 5.2b and 5.3, the variance and standard deviation for the probability distribution of the number of mortgages approved per week are:

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

184 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS

N

σ2 = ∑ Xi2P(Xi) − E(X )2 i=1

= [(02 × 0.1) + (12 × 0.1) + (22 × 0.2) + (32 × 0.3) + (42 × 0.15) + (52 × 0.1) + (62 × 0.05)] − 2.82 = [(0 × 0.1) + (1 × 0.1) + (4 × 0.2) + (9 × 0.3) + (16 × 0.15) + (25 × 0.1) + (36 × 0.05)] − 7.84 = (0 + 0.1 + 0.8 + 2.7 + 2.4 + 2.5 + 1.8) − 7.84 = 10.3 − 7.84 = 2.46 σ = σ2 = 2.46 = 1.568…

Alternatively, a table format can be used to calculate the mean and variance. In Table 5.2, the mean number of home mortgages approved per week is calculated. Then, using Equation 5.2b: N

σ2 = ∑ Xi2P(Xi) − E(X )2 = 10.3 − (2.8)2 = 2.46 i=1

Table 5.2 Calculating the mean and variance of the number of home mortgages approved per week

Home mortgages approved per week Xi 0 1 2 3 4 5 6

P(Xi ) 0.10 0.10 0.20 0.30 0.15 0.10 0.05 1.00

XiP(Xi ) 0.0 0.1 0.4 0.9 0.6 0.5 0.3 μ = E(X ) = 2.8

Xi2P(Xi ) 0.0 0.1 0.8 2.7 2.4 2.5  1.8 10.3

The expected value is often used to measure the amount we can expect to gain or lose by undertaking a particular investment, while the standard deviation is used to measure the risk involved.

Problems for Section 5.1 LEARNING THE BASICS 5.1 Given the following probability distributions: Distribution A X P(X) 0 0.50 1 0.20 2 0.15 3 0.10 4 0.05

Distribution B X P(X) 0 0.05 1 0.10 2 0.15 3 0.20 4 0.50

a. Calculate the expected value for each distribution. b. Calculate the standard deviation for each distribution.

c. Compare and contrast the results of distributions A and B. 5.2 Are each of the following a valid probability distribution? Justify your answers: Distribution A Distribution B Distribution C Distribution D X P(X) X P(X) X P(X) X P(X) 0.2 0 0.1 0.250 0.500 0 0.2 -1 1 0.9 1 0.2 0.500 0.250 1 0.1 2 2 0.3 1.000 0.250 2 0.4 -0.1 3 0.3 3 0.5

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



5.2 Covariance and its Application in Finance 185

APPLYING THE CONCEPTS

Interruptions (X) 0 1 2 3 4 5 6

5.3 Using the company records for the past 500 working days, the manager of Konig Motors has summarised the number of cars sold per day in the following table: Number of cars sold per day 0 1 2 3 4 5 6 7 8 9 10 11 Total

Frequency of occurrence 40 100 142 66 36 30 26 20 16 14 8   2 500

a. Form the probability distribution for the number of cars sold per day. b. Calculate the mean or expected number of cars sold per day. c. Calculate the standard deviation. 5.4 The manager of a large computer network has developed the following probability distribution of the number of interruptions per day:

P(X) 0.32 0.35 0.18 0.08 0.04 0.02 0.01

a. Calculate the mean or expected number of interruptions per day. b. Calculate the standard deviation. 5.5 In the casino version of the traditional Australian game of two-up, a spinner stands in a ring and tosses two coins into the air. The coins may land showing two heads, two tails or one tail and one head (odds). Players can bet on either heads or tails at odds of one to one. Therefore, if a player bets $1 on heads, the player will win $1 if the coins land on heads but lose $1 if the coins land on tails. Alternatively, if a player bets $1 on tails, the player will win $1 if the coins land on tails but lose $1 if the coins land on heads. If the coins land on odds, all bets are frozen and the spinner tosses again until either heads or tails comes up. If five odds are tossed in a row all players lose. a. Construct the probability distribution representing the different outcomes that are possible for a $1 bet on heads. b. Construct the probability distribution representing the different outcomes that are possible for a $1 bet on tails. c. What is the expected long-run profit (or loss) to the player?

3

5.2  COVARIANCE AND ITS APPLICATION IN FINANCE

LEARNING OBJECTIVE

In Section 5.1 the expected value, variance and standard deviation of a discrete random variable are discussed. In this section the covariance between two discrete random variables is introduced and then applied to portfolio management, a topic of interest to financial analysts.

Calculate average return and measure risk associated with various investment proposals

Covariance Covariance, σXY, is a measure of the strength of the relationship between two random variables,

X and Y. A positive covariance indicates a positive relationship, while a negative covariance indicates a negative relationship. If the two variables are independent then their covariance is zero. Equation 5.4a defines the covariance between two discrete random variables.

covariance Measure of the strength of the linear relationship between two numerical variables.

CO VARIANC E – DE FIN IT ION FOR M UL A σXY =

∑ ∑ [Xi − E(X )][Yj − E(Y)]P(Xi and Yj)

all Xi all Yj

(5.4a)



where Xi is the ith outcome of the discrete random variable X, and Yj is the jth outcome of the discrete random variable Y.

As for the sample covariance, we can use algebra to obtain an alternative calculation formula.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

186 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS

COVA R IA N CE – C A LC U LAT I O N F O R M U LA σXY =

∑∑

XiYjP(Xi and Yj) − E(X )E(Y )

all Xi all Yj



(5.4b)



To illustrate covariance, suppose that we are deciding between two alternative investments for the coming year. The first investment is a mutual fund that consists of shares that are expected to do well when economic conditions are strong. The second investment is a mutual fund that is expected to perform best when economic conditions are weak. Your estimate of the returns for each investment (per $1,000 investment) under three economic conditions, each with a given probability of occurrence, is summarised in Table 5.3. Table 5.3 Estimated returns for each investment under three economic conditions

Economic P(XiYi ) Xi [5 P(Xi ) 5 P(Yi )] condition 0.2 Recession -$100 0.5 Stable economy +100 0.3 Expanding economy +250 1.0

Yi +$200 +50 -100

Investment XiYi Xi P(Xi ) −20,000 220 5,000 50 75 -25,000 105

Yi P(Yi ) 40 25 230 35

E(X  )

E(Y  )

Xi Yi P(Xi Yi ) 24,000 2,500 27,500 29,000

The expected value and standard deviation for each investment is calculated as follows. Let X = strong-economy fund, and Y = weak-economy fund: E(X ) = μX = (−100)(0.2) + (100)(0.5) + (250)(0.3) = $105 E(Y ) = μY = (+200)(0.2) + (50)(0.5) + (−100)(0.3) = $35

σ2X = [(−100)2 × 0.2) + (1002 × 0.5) + (2502 × 0.3)] − 1052 = 25,750 − 11,025 = 14,725 σX =

σ2X = 14,725 = 121.346… ≈ $121.35

σ2Y = [(2002 × 0.2) + (502 × 0.5) + (−100)2 × 0.3)] − 352 = 12,250 − 1225 = 11,025 E(Xσ)Y==μX σ=2Y(−100)(0.2) + $105.00 (100)(0.5) + (250)(0.3) = $105 = 11,025 = E(Y ) =calculation μ[(−100 + (50)(0.5) (−100)(0.3) $35× (−100) ×of200 0.2) + (100 ×only 50 × 0.5) + = (250 Y = (+200)(0.2) Inσthe the× covariance, the+ non-zero probabilities are:× 0.3)] − (105 × 35) XY = 2 σ2X ==[(−100) × 3,675 0.2) +=(100 0.5) + and (250Y2 =×$200) 0.3)] − 1052 = 25,750 − 11,025 = 14,725 −9,0002 − −12,675 P(X × = -$100 = 0.2

σX =

σ2X = 14,725 = 121.346… ≈and $121.35 P(X = $100 Y = $50) = 0.5

2 = 12,250 − 1225 = 11,025 = $250 and Y2=×-$100) 0.3 σ2Y = [(2002 × 0.2) + (502 P(X × 0.5) + (−100) 0.3)] −=35

σYtherefore = σ2Y = We have:11,025 = $105.00 σXY = [(−100 × 200 × 0.2) + (100 × 50 × 0.5) + (250 × (−100) × 0.3)] − (105 × 35) = −9,000 − 3,675 = −12,675 expected value of the sum of two random variables Measure of central tendency; mean of the sum of two random variables. variance of the sum of two random variables Measure of variation; directly related to the standard deviation. standard deviation of the sum of two random variables Measure of variation; directly related to the variance.

Thus, the strong-economy fund has a higher expected value (i.e. larger expected return) than the weak-economy fund but has a higher standard deviation (i.e. more risk). The covariance of -12,675 between the two investments indicates a negative relationship in which the two investments are varying in the opposite direction. Therefore, when the return on one investment is high, the return on the other is typically low.

Expected Value, Variance and Standard Deviation of the Sum of Two Random Variables Equation 5.4a defined the covariance between two discrete random variables, X and Y. Now, the expected value of the sum of two random variables, variance of the sum of two random variables and standard deviation of the sum of two random variables are defined.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



5.2 Covariance and its Application in Finance 187

E XPE CTE D VA LUE OF T H E S UM OF T W O R A ND O M VA R I A BLE S The expected value of the sum of two random variables is equal to the sum of the expected values. E(X + Y ) = E(X ) + E(Y ) μX + Y = μX + μY Alternatively:

(5.5)

VARIANC E OF T H E S UM OF T W O R A ND O M VA R I A BLE S The variance of the sum of two random variables is equal to the sum of the variances plus twice the covariance. (5.6) σ2X + Y = σ2X + σ2Y + 2σXY

STAN DARD DE VIAT ION OF T H E S UM O F T W O R A ND O M VA R I A BLE S The standard deviation is the square root of the variance. σX + Y = σ2X



+Y

(5.7)

To illustrate the expected value, variance and standard deviation of the sum of two random variables, consider the two investments previously discussed. Using Equations 5.5, 5.6 and 5.7: μX + Y = E(X + Y ) = E(X ) + E(Y ) = 105 + 35 = $140 σ2X + Y = σ2X + σ2Y + 2σXY = (14,725 + 11,025) + 2 × (−12,675) = 400 σX + Y = 400 = $20 The expected return of the sum of the strong-economy fund and the weak-economy fund is $140 with a standard deviation of $20. The standard deviation of the sum of the two investments is much less than the standard deviation of either single investment because there is a large negative covariance between the investments.

Portfolio Expected Return and Portfolio Risk The concepts of covariance, expected return and standard deviation of the sum of two random variables can be applied to the study of investment portfolios where investors combine assets into portfolios to reduce their risk. The objective is to maximise the return while minimising the risk. For such portfolios, rather than studying the sum of two random variables, each investment is weighted by the proportion of assets assigned to that investment. Equations 5.8 and 5.9 define portfolio expected return and portfolio risk.

PO RTFO LIO E XPE CT E D R E T UR N The portfolio expected return for a two-asset investment is equal to the weight assigned to asset X multiplied by the expected return of asset X plus the weight assigned to asset Y multiplied by the expected return of asset Y:

E(P ) = wE(X ) + (1 − w)E(Y )

portfolio A combined investment in two or more assets. portfolio expected return Measure of central tendency; mean return on investment. portfolio risk Measure of the variation of investment returns.

(5.8)

where E(P) = portfolio expected return w = portion of the portfolio assigned to asset X, 0 ⩽ w ⩽ 1 1 – w = portion of the portfolio assigned to asset Y

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

188 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS

PORT FOL IO R IS K σp = w2σ2X + (1 − w)2σ2Y + 2w(1 − w)σXY



(5.9)

In the previous example, the expected return and risk of two different investments were calculated, a strong-economy fund and a weak-economy fund. The covariance of the two investments was also calculated. Now, suppose that we wish to form a portfolio of these two investments that consists of an equal investment in each of these two funds. To calculate the portfolio expected return and the portfolio risk, use Equations 5.8 and 5.9, with w = 0.5, to obtain: E(P ) = wE(X ) + (1 − w)E(Y ) = (0.5 × 105) + (0.5 × 35) = $70 σp = (0.5)2(14,725) + (1 − 0.5)2(11,025) + 2(0.5)(1 − 0.5)(−12,675) = 100 = $10 Thus, the portfolio has an expected return of $70 for each $1,000 invested (a return of 7%) and has a portfolio risk of $10. The portfolio risk here is small because there is a large negative covariance between the two investments. The fact that each investment performs best under different circumstances has reduced the overall risk of the portfolio. It is possible to use calculus to determine the minimum portfolio risk – which may occasionally be zero – but that is outside the scope of this textbook.

Problems for Section 5.2 5.8 Two investments, X and Y, have the following characteristics:

LEARNING THE BASICS

5.6 Given the following probability distributions for variables X and Y: P(XiYi ) 0.4 0.6

X 100 200

Y 200 100



Calculate: a. E(X ) and E(Y ) b. σX and σY c. σXY d. E(X + Y ) 5.7 Given the following probability distributions for variables X and Y: P(XiYi) 0.2 0.4 0.3 0.1



Calculate: a. E(X ) and E(Y ) b. σX and σY c. σXY d. E(X + Y )

X -100 50 200 300

Y 50 30 20 20

E(X ) = $50, E(Y ) = $100, σX2 = 9,000, σY2 = 15,000 and σXY = 7,500

If the weight assigned to investment X of portfolio assets is 0.4, calculate: a. the portfolio expected return b. the portfolio risk

APPLYING THE CONCEPTS 5.9 The process of being served at a bank consists of two independent parts – the time waiting in line and the time it takes to be served by the teller. Suppose, at a branch of Check$mart, that the time waiting in line has an expected value of 4 minutes with a standard deviation of 1.2 minutes and the time it takes to be served by the teller has an expected value of 5.5 minutes with a standard deviation of 1.5 minutes. Calculate: a. the expected value of the total time it takes to be served b. the standard deviation of the total time it takes to be served 5.10 For the investment example given in Table 5.3: a. Calculate the portfolio expected return and the portfolio risk if: i. 30% is invested in the strong-economy fund and 70% in the weak-economy fund ii. 70% is invested in the strong-economy fund and 30% in the weak-economy fund b. Which of the three investment strategies (30%, 50% or 70% in the strong-economy fund) would you recommend? Why?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



5.3 Binomial Distribution 189

5.11 You are developing a strategy for investing in two different shares. The anticipated annual return for a $1,000 investment in each share has the following probability distribution: Returns Probability 0.1 0.3 0.3 0.3

Share X -$100 0 80 150

Share Y $50 150 -20 -100

a. Calculate: i. the expected return for share X and for share Y ii. the standard deviation for share X and for share Y iii. the covariance of share X and share Y b. Would you invest in share X or share Y? Explain. 5.12 Suppose that in problem 5.11 you wanted to create a portfolio that consists of share X and share Y. a. Calculate the portfolio expected return and portfolio risk for each of the following percentages invested in share X: i. 30% ii. 50% iii. 70% b. On the basis of the results of your calculations in part (a), which portfolio would you recommend? Explain. 5.13 You are trying to set up a portfolio that consists of a corporate bond fund and a common share fund. The following information about the annual return (per $1,000) of each of these investments under different economic conditions is available, together with the probability that each of these economic conditions will occur.

Probability 0.10 0.15 0.35 0.30 0.10

State of the economy Recession Stagnation Slow growth Moderate growth High growth

Corporate bond fund -$30 50 90 100 110

Common share fund -$150 -20 120 160 250

a. Calculate: i. the expected return for the corporate bond fund and for the common share fund ii. the standard deviation for the corporate bond fund and for the common share fund iii. the covariance of the corporate bond fund and the common share fund b. Would you invest in the corporate bond fund or the common share fund? Explain. 5.14 Suppose that in problem 5.13 you wanted to create a portfolio that consists of a corporate bond fund and a common share fund. a. Calculate the portfolio expected return and portfolio risk for each of the following percentages invested in a corporate bond fund: i. 30% ii. 50% iii. 70% b. On the basis of the results of your calculations in (a), which portfolio would you recommend? Explain.

5.3  BINOMIAL DISTRIBUTION The next three sections use mathematical models to solve business and other problems.

A mathematical model is a mathematical expression representing a variable of interest.

When a mathematical model of a discrete probability distribution is available, you can easily calculate the exact probability of occurrence of any particular outcome of the random variable. The binomial distribution is one of the most important and widely used discrete probability distributions. The binomial distribution arises when the discrete random variable is the number of successes in a sample of n observations. The binomial distribution has four essential properties: 1. The sample consists of a fixed number of observations, n. 2. Each observation is classified into one of two mutually exclusive and collectively exhaustive categories, usually called success and failure. 3. The probability of an observation being classified as a success, p, is constant from observation to observation. Thus, the probability of an observation being classified as a failure, 1 – p, is also constant for all observations. 4. The outcome (i.e. success or failure) of any observation is independent of the outcome of any other observation. To ensure independence, the observations can be randomly selected either from an infinite population without replacement or from a finite population with replacement.

LEARNING OBJECTIVE

4

Identify situations that can be modelled by a binomial distribution and calculate binomial probabilities mathematical model The mathematical representation of a random variable. binomial distribution Discrete probability distribution, where the random variable is the number of successes in a sample of n observations from either an infinite population or sampling with replacement.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

190 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS

The proportion of online enquiries that are converted to bookings is of interest in the Gaia Adventure Tours scenario, so Yang could define an online enquiry converted to a booking as a success and an online enquiry that is not converted to a booking as a failure. Yang would then be interested in the number of successes; that is, the number of online enquiries converted to bookings in a random ­sample of n online enquiries. Note: In a binomial distribution, ‘success’ is usually defined as the outcome we are interested in – in this case an online query converted to a booking. This is a binomial situation because: • a fixed number of online enquiries, n, is chosen • each online enquiry is either converted to a booking – a success – or not converted – a failure • 10% of online enquiries are converted to bookings, so the probability of a randomly chosen online enquiry being converted to a booking is p = 0.1 and that of a randomly chosen online enquiry not being converted to a booking is 1 – p = 0.9 • online enquiries are randomly selected; so the outcome, converted or not converted, of any enquiry is independent of the outcome of any other enquiry. If Yang takes a random sample of four online enquiries, the binomial random variable defined as: X = number of online enquiries converted to bookings has a range from 0 to four as none, one, two, three or all four enquiries may be converted to bookings. In general, a binomial random variable has a range from 0 to n. Suppose that Yang observes the following result in a sample of four enquiries: First order Converted

Second order Converted

Third order Not converted

Fourth order Converted

What is the probability of having three successes (converted enquiries) in a sample of four enquiries in this particular sequence? Because the historical probability of enquiries converted to bookings is 0.10, the probability that each enquiry occurs in the sequence is: First enquiry p = 0.1

Second enquiry p = 0.1

Third enquiry 1 – p = 0.9

Fourth enquiry p = 0.1

Each outcome is independent of the others because the enquiries are randomly selected. Therefore, the probability of having this particular sequence is: pp(1 - p)p = p3(1 - p) = (0.1)3(0.9)1 = 0.0009 This result indicates only the probability of three online enquiries converted to bookings (successes) out of a sample of four online enquiries in a specific sequence. The number of ways of selecting X objects from n objects irrespective of sequence is given by the counting rule for combinations introduced in Chapter 4 as Equation 4.14 and as Equation 5.10 below, introducing a different notation. COM B IN AT ION S The number of combinations of selecting X objects from n objects is given by

n! n = nCX = X!(n − X)! X

(5.10)

where n factorial is defined by n! = n × (n – 1) × … × 2 × 1 and by definition, 0! = 1. Using Equation 5.10, we see that there are: 4C3

=

4! =4 3!(4 − 3 )!

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



5.3 Binomial Distribution 191

sequences of three converted enquiries and one enquiry not converted. The four possible sequences are: Sequence 1 Converted Sequence 2 Converted Sequence 3 Converted Sequence 4 Not converted and the probability of each is:

Converted Converted Not converted Converted

Converted Not converted Converted Converted

Not converted Converted Converted Converted

p3(1 − p) = (0.1)3(0.9)1 = 0.0009 Therefore, the probability of three converted enquiries out of four is equal to: number of sequences × probability of sequence = 4 × 0.0009 = 0.0036 We can make similar, intuitive derivations for the other possible outcomes of the random variable – zero, one, two and four converted enquiries. However, as n, the sample size, gets larger, the calculations involved in using this approach become time-consuming. Instead, a mathematical model provides a formula to calculate any binomial probability. Equation 5.11 is the mathematical model that represents the binomial probability distribution and is used to calculate the probability of X successes for any given values of n and p. BIN O MIAL P R OB A B IL IT Y DIST R IB UT I O N

P(X ) =

n! p X(1 − p)n − X X!(n − X )!

(5.11)

where P(X) = probability of X successes given n and p n = number of observations p = probability of success 1 – p = probability of failure X = number of successes (X = 0, 1, 2, …, n) Equation 5.11 restates what we had intuitively derived. The binomial random variable X can have any integer value X from 0 to n. In Equation 5.11 the product: p X(1 − p)n − X indicates the probability of exactly X successes out of n observations in a particular sequence. The term: n! X!(n − X)! indicates how many combinations of the X successes out of n observations are possible. Hence, given the number of observations n and the probability of success p, the probability of X successes is: P(X) = number of sequences × probability of sequence n! p X(1 − p)n − X = X!(n − X)! Example 5.1 illustrates the use of Equation 5.11. DETER M INING P ( X 5 3 ), G IV E N n 5 4 AN D p 5 0. 1 If 10% of online enquiries are converted to bookings, what is the probability that there are three converted enquiries in a sample of four?

EXAMPLE 5.1

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

192 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS

SOLUTION

Using Equation 5.11, the probability of three converted enquiries from a sample of four is: P(X = 3) =

4! (0.1)3(1 − 0.1)4 − 3 = 4 × 0.001 × 0.9 = 0.0036 3 !(4 − 3 )!

Examples 5.2 and 5.3 give the calculations for other values of X. EXAMPLE 5.2

DE T E R MIN ING P (X . 3) , GI V E N n 5 4 AN D p 5 0. 1 If 10% of online enquiries are converted to bookings, what is the probability that there are at least three converted enquiries in a sample of four? SOLUTION

In Example 5.1 we found that the probability of exactly three converted enquiries from a sample of four is 0.0036. To calculate the probability of at least three converted enquiries, we need to add the probability of three converted enquiries to the probability of four converted enquiries. The probability of four converted enquiries is: P(X = 4) =

4! (0.1)4(1 − 0.1)4 − 4 = 1 × 0.0001 × 1 = 0.0001 4 !(4 − 4 )!

Thus, the probability of at least three converted enquiries is: P(X ⩾ 3) = P(X = 3) + P(X = 4) = 0.0036 + 0.0001 = 0.0037 There is a 0.37% chance that there will be at least three converted enquiries in a sample of four.

EXAMPLE 5.3

DE T E R MIN ING P ( X 6 3) , GI V E N n = 4 AN D p = 0. 1 If 10% of online enquiries are converted to bookings, what is the probability that there are fewer than three converted enquiries in a sample of four? SOLUTION

The probability that there are fewer than three converted enquiries is: P(X < 3) = P(X = 0) + P(X = 1) + P(X = 2) Use Equation 5.11 to calculate each of these probabilities: P(X = 0) =

4! (0.1)0(1 − 0.1)4 − 0 = 0.6561 0 !(4 − 0 )!

P(X = 1) =

4! (0.1)1(1 − 0.1)4 − 1 = 0.2916 1 !(4 − 1 )!

P(X = 2) =

4! (0.1)2(1 − 0.1)4 − 2 = 0.0486 2 !(4 − 2 )!

Therefore, P(X < 3) = 0.6561 + 0.2916 + 0.0486 = 0.9963 Alternatively, P(X 6 3) can also be calculated from its complement, P(X 9 3), since: P(X < 3) = 1 − P(X ⩾ 3) = 1 − 0.0037 = 0.9963 Calculations such as those in Example 5.3 can become tedious, especially as n gets large. To avoid computational drudgery, many binomial probabilities can be found directly from Table E.6 (Appendix E), a portion of which is reproduced in Table 5.4. Table E.6 provides

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



5.3 Binomial Distribution 193

­binomial probabilities for X = 0, 1, 2, … , n for selected combinations of n and p. For example, to find the probability of exactly two successes in a sample of four when the probability of success is 0.1, first find n = 4 and then read off the required probability at the intersection of the row X = 2 and the column p = 0.10. Thus: P(X = 2) = 0.0486

n 4

X 0 1 2 3 4

0.01 0.9606 0.0388 0.0006 0.0000 0.0000

p .... .... .... .... .... ....

0.02 0.9224 0.0753 0.0023 0.0000 0.0000

0.10 0.6561 0.2916 0.0486 0.0036 0.0001

Table 5.4 Finding a binomial probability for n = 4, X = 2 and p = 0.1 (extracted from Table E.6)

The binomial probabilities given in Table E.6 can also be calculated using Microsoft Excel. Figure 5.2 presents a Microsoft Excel worksheet for calculating binomial probabilities, using the Excel 2010 and later inbuilt binomial function BINOM.DIST(number_s,trials, probability_s,cumulative). For earlier versions of Excel the corresponding binomial function is BINOMDIST(number_s,trials,probability_s,cumulative).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

A Binomial probabilities

B

Data Sample size Probability of an event of interest

C

D

E

F

4 0.1

Figure 5.2 Microsoft Excel worksheet for number of online enquiries converted to bookings example

Statistics Mean Variance Standard deviation

0.4 =B4 * B5 0.36 =B8 * (1-B5) 0.6 =SQRT(B9)

Binomial probabilities table X 0 1 2 3 4

P(X) 0.6561 0.2916 0.0486 0.0036 0.0001

=BINOM.DIST(A14, $B$4, $B$5, FALSE) =BINOM.DIST(A15, $B$4, $B$5, FALSE) =BINOM.DIST(A16, $B$4, $B$5, FALSE) =BINOM.DIST(A17, $B$4, $B$5, FALSE) =BINOM.DIST(A18, $B$4, $B$5, FALSE)

The shape of a binomial probability distribution depends on the values of n and p. When p = 0.5, the binomial distribution is symmetrical, regardless of how large or small the value of n. When p ∙ 0.5, the distribution is skewed, to the right if p < 0.5 and to the left if p > 0.5. The closer p is to 0.5 and/or the larger the number of observations n, the less skewed the distribution. For example, the distribution of the number of converted online enquiries is highly skewed to the right because p = 0.1 and n = 4 (see Figure 5.3).

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

194 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS

Figure 5.3

0.7

Microsoft Excel graph of the binomial probability distribution with n = 4 and p = 0.1

0.6 0.5

P (X )

0.4 0.3 0.2 0.1 0

0

1

2

3

4

Number of successes

Substituting the binomial probability equation (5.11) in the expected value equation (5.1) and using algebra to simplify, it can be shown that the mean of the binomial distribution is equal to the product of n and p, as shown in Equation 5.12. Therefore, use Equation 5.12 to calculate the mean of a binomial distribution, instead of Equation 5.1. T H E M E A N OF T HE BI NO M I A L D I ST R I BU T I O N The mean μ of the binomial distribution is equal to the sample size n multiplied by the probability of success p. (5.12) μ = E(X ) = np Therefore, on average, Yang can theoretically expect E(X) = 4 * 0.1 = 0.4 converted enquiries in a sample of four. Similarly, by substituting the binomial probability equation (5.11) in the variance equation (5.2a or 5.2b) and using algebra to simplify, it can be shown that the standard deviation of the binomial distribution is given by Equation 5.13. T H E STA N DA R D D E V I AT I O N O F T HE BI NO M I A L D I ST R I BU T I O N

σ = σ2 = np(1 − p)

(5.13)

Therefore, using Equation 5.13, the standard deviation of the number of converted enquiries is: σ = 4(0.1)(0.9) = 0.60 EXAMPLE 5.4

C A LC U LAT IN G B INO M I AL P ROBABI L I TI E S Accuracy (measured as the percentage of orders consisting of a main item, side item and drink that are filled correctly) in taking orders at the drive-through window is an important feature for fast-food chains. Suppose in a recent month that records show that the percentage

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



5.3 Binomial Distribution 195

of correct orders of this type filled at a Hungry Jack’s franchise was 88%. Suppose three friends go to the drive-through window at this Hungry Jack’s franchise and each places an order of the type just mentioned. • What is the probability that: – all three orders will be filled correctly? – none of the three will be filled correctly? – at least two of the three will be filled correctly? • What is the average and standard deviation of the number of orders filled correctly? SOLUTION

There are three orders and the probability of any order being accurate is 0.88. Therefore: X = number of orders filled correctly = 0, 1, 2, 3 is a binomial random variable with n = 3, p = 0.88. Using Equations 5.11, 5.12 and 5.13: P(X = 3) =

3! (0.88)3(1 − 0.88)3 − 3 = 1 × 0.68147… × 1 = 0.68147… 3!(3 − 3 )!

P(X = 0) =

3! (0.88)0(1 − 0.88)3 − 0 = 1 × 1 × 0.00172… = 0.00172 0!(3 − 0 )!

P(X = 2) =

3! (0.88)2(1 − 0.88)3 − 2 = 3 × 0.7744 × 0.12… = 0.27878… 2!(3 − 2 )!

P(X ⩾ 2) = P(X = 2) + P(X = 3) = 0.27878… + 0.68147… = 0.96025… μ = E(X ) = 3 × 0.88 = 2.64 σ=

np(1 − p) =

3 × 0.88 × 0.12 = 0.5628…

The probability that all three orders are filled correctly is 0.6815. The probability that none of the orders is filled correctly is 0.0017. The probability that at least two orders are filled correctly is 0.9603. The mean number of accurate orders filled in a sample of three orders is 2.64 and the standard deviation is 0.563. This section introduced the binomial distribution and applied it to business and other problems. The binomial distribution plays an important role when it is used in statistical inference problems involving the estimation or testing of hypotheses about proportions (discussed in Chapters 8 and 9).

Problems for Section 5.3 Problems 5.15 to 5.24 can be solved manually or by using Microsoft Excel. Some, but not all, can also be solved using Table E.6.

LEARNING THE BASICS

5.15 If X is a binomial random variable, determine the following: a. For n = 4 and p = 0.12, what is P(X = 0)? b. For n = 10 and p = 0.40, what is P(X = 9)? c. For n = 10 and p = 0.50, what is P(X = 8)? d. For n = 6 and p = 0.83, what is P(X = 5)? 5.16 If X is a binomial random variable with n = 5 and p = 0.40, what is the probability that:

a. X = 4? b. X ⩽ 3? c. X < 2? d. X > 1? 5.17 Determine the mean and standard deviation of the random variable X in each of the following binomial distributions: a. n = 4 and p = 0.10 b. n = 4 and p = 0.40 c. n = 5 and p = 0.80 d. n = 3 and p = 0.50

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

196 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS

APPLYING THE CONCEPTS 5.18 The increase or decrease in the price of a share between the beginning and the end of a trading day is assumed to be an equally likely random event. What is the probability that a share will show an increase in its closing price on five consecutive days? 5.19 Research has shown that only 60% of consumers read every word, including the fine print, of a service contract. Assume that the number of consumers who read every word of a contract can be modelled using the binomial distribution. A group of five consumers has just signed a 12-month contract with an ISP (Internet service provider). a. What is the probability that: i. all five will have read every word of their contract? ii. at least three will have read every word of their contract? iii. less than two will have read every word of their contract? b. What would your answers be in (a) if the probability is 0.80 that a consumer reads every word of a service contract? 5.20 A student taking a multiple-choice test consisting of five questions, each with four options, selects the answers randomly. What is the probability that the student will get: a. five questions correct? b. at least four questions correct? c. no questions correct? d. no more than two questions correct? 5.21 In Example 5.4 three friends went to a Hungry Jack’s franchise. Instead, suppose that they go to a McDonald’s franchise, which last month filled 90% of orders correctly. a. What is the probability that: i. all three orders will be filled correctly? ii. none of the three will be filled correctly? iii. at least two of the three will be filled correctly? b. What is the mean and standard deviation of the number of orders filled correctly? 5.22 In a certain weekday television show, the winning contestant has to choose randomly from 20 boxes, one of which contains a major prize of $100,000.

LEARNING OBJECTIVE

5

Identify situations that can be modelled by a Poisson distribution and calculate Poisson probabilities

Poisson distribution Discrete probability distribution, where the random variable is the number of events in a given interval.

a. What is the probability that, during a week: i. no contestant wins the major prize? ii. exactly one contestant wins the major prize? iii. no more than two contestants win the major prize? iv. at least three contestants win the major prize? b. Calculate the expected number and standard deviation of winners in a week. c. How much should the producers budget for major prizes per week? 5.23 When a customer places an order with Rudy’s On-Line Office Supplies, a computerised accounting information system (AIS) automatically checks to see whether the customer has exceeded their credit limit. Past records indicate that the probability of customers exceeding their credit limit is 0.05. Suppose that, in a given half hour, 20 customers place orders. Assume that the number of customers that the AIS detects as having exceeded their credit limit is distributed as a binomial random variable. a. What are the mean and standard deviation of the number of customers exceeding their credit limits? b. What is the probability that no customer will exceed their limit? c. What is the probability that one customer will exceed their limit? d. What is the probability that two or more customers will exceed their limits? 5.24 A new drug is found to be effective on 90% of the patients tested. a. Is the 90% effective rate best classified as a priori classical probability, empirical classical probability or subjective probability? b. If the drug is administered to 20 randomly chosen patients at a large hospital, find the probability that it is effective for: i. fewer than five patients ii. 10 or more patients iii. all 20 patients

5.4  POISSON DISTRIBUTION Many studies are based on the number of times a random event occurs in an interval of time or space. Examples are the number of surface defects on a new refrigerator, the number of network failures in a month or the number of fleas on the body of a dog. The Poisson distribution can be used to calculate probabilities when counting the number of times a particular event occurs in an interval of time or space if: 1. the probability an event occurs in any interval is the same for all intervals of the same size 2. the number of occurrences of the event in one non-overlapping interval is independent of the number in any other interval 3. the probability that two or more occurrences of the event in an interval approaches zero as the interval becomes smaller. If these properties hold, then the average or expected number of occurrences over any interval is proportional to the size of the interval.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



5.4 Poisson Distribution 197

Consider the number of online enquiries received by Gaia Adventure Tours. Suppose that Yang is interested in the number of online enquiries received during a 20-minute interval. Does this situation match the properties of the Poisson distribution given above? First, define the random variable as: X = number of online enquiries received during a 20-minute interval Suppose enquiries are received randomly, then it is reasonable to assume that the probability an enquiry is received during a 20-minute interval is the same as the probability for all other 20-minute intervals. Yang can also assume that the receipt of an enquiry during a 20-minute interval has no effect on (i.e. is statistically independent of) the receipt of any other enquiry during any 20-minute interval. Finally, the probability that two or more enquiries will be received in a given time period approaches zero as the time interval becomes smaller. For example, the probability is virtually zero that two enquiries will be received in a time interval of 0.001 of a second. Thus, Yang can use the Poisson distribution to determine probabilities involving the number of online enquiries received in a 20-minute interval. The Poisson distribution has one parameter, λ (Greek lower-case letter lambda), which is the mean or expected number of events per interval. The variance of a Poisson distribution is also equal to λ, hence the standard deviation is equal to ∙∙λ. The number of events, X, of the Poisson random variable ranges from 0 to infinity. Equation 5.14, the mathematical formula for the Poisson distribution, gives the probability of X events in an interval, given that λ events are expected. PO IS S O N P R OB A B IL IT Y DIST R IB UT IO N P(X ) =



e−λλX X!

(5.14)

where P(X) = the probability of X events in a given interval λ = expected number of events in the given interval e = 2.71828 … is the base of natural logarithms To illustrate the use of the Poisson distribution, calculate the probability that in a given 20 minutes exactly five online enquiries will be received, and the probability that less than five online enquiries will be received. On average, Gaia Adventure Tours receives 30 online enquiries 30 an hour, so the average or expected number of enquiries in 20 minutes is λ = × 20 = 10 60 Using Equation 5.14 with λ = 10, the probability that in a given 20 minutes exactly five online enquiries will be received is: P(X = 5) =

e−10105 4.53999… = = 0.03783… 5! 120

and the probability that in any given 20 minutes less than five online enquiries will be received is: P(X < 5) =

e−10(10)0 e−10(10)1 e−10(10)2 e−10(10)3 e−10(10)4 + + + + 0! 1! 2! 3! 4!

= 0.00004… + 0.00045… + 0.00226… + 0.00756… + 0.01891… = 0.029252… Thus, there is a 3% likelihood that less than five online enquiries will be received in 20 minutes, leading to enquiry staff having significant idle time. To avoid the computational drudgery involved in these calculations, many Poisson probabilities can be found directly from Table E.7 (Appendix E), a portion of which is

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

198 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS

reproduced in Table 5.5. Table E.7 provides probabilities for the Poisson random variable for X = 0, 1, 2, . . . for selected values of the parameter λ. The probability that exactly five online enquiries will be received in a given 20-minute interval when the mean number of enquiries received in 20 minutes is 10 is given by the intersection of the row X = 5 and column λ = 10. Therefore, from Table 5.5, P(X = 5) = 0.0378. Table 5.5 Calculating a Poisson probability for λ = 10 (extracted from Table E.7 in Appendix E of this book)

X 0 1 2 3 4 5 6 7

9.1 0.0001 0.0010 0.0046 0.0140 0.0319 0.0581 0.0881 0.1145

9.2 0.0001 0.0009 0.0043 0.0131 0.0302 0.0555 0.0851 0.1118

𝛌   .... .... .... .... .... .... .... ....

10 0.0000 0.0005 0.0023 0.0076 0.0189 0.0378 0.0631 0.0901

You can also calculate the Poisson probabilities given in Table E.7 using Microsoft Excel. Figure 5.4 presents a Microsoft Excel worksheet for the Poisson distribution, with λ = 10, using the Excel 2010 and later inbuilt Poisson function POISSON.DIST(x,mean,cumulative). For earlier versions of Excel the corresponding Poisson function is POISSON(x,mean,cumulative). Figure 5.4 Microsoft Excel worksheet for ‘number of online enquries in 20 minutes’ example

A B C D E 1 Poisson probabilities 2 3 Data 4 Mean/Expected number of events of interest: 5 6 Poisson probabilities table 7 X P (X) 8 0 0.0000 =POISSON.DIST(A8, $E$4, FALSE) 9 1 0.0005 =POISSON.DIST(A9, $E$4, FALSE) 10 2 0.0023 =POISSON.DIST(A10, $E$4, FALSE) 11 3 0.0076 =POISSON.DIST(A11, $E$4, FALSE) 12 4 0.0189 =POISSON.DIST(A12, $E$4, FALSE) 13 5 0.0378 =POISSON.DIST(A13, $E$4, FALSE) 14 6 0.0631 =POISSON.DIST(A14, $E$4, FALSE) 15 7 0.0901 =POISSON.DIST(A15, $E$4, FALSE) 16 8 0.1126 =POISSON.DIST(A16, $E$4, FALSE) 17 9 0.1251 =POISSON.DIST(A17, $E$4, FALSE) 18 10 0.1251 =POISSON.DIST(A18, $E$4, FALSE) 19 11 0.1137 =POISSON.DIST(A19, $E$4, FALSE) 20 12 0.0948 =POISSON.DIST(A20, $E$4, FALSE) 21 13 0.0729 =POISSON.DIST(A21, $E$4, FALSE) 22 14 0.0521 =POISSON.DIST(A22, $E$4, FALSE) 23 15 0.0347 =POISSON.DIST(A23, $E$4, FALSE) 24 16 0.0217 =POISSON.DIST(A24, $E$4, FALSE) 25 17 0.0128 =POISSON.DIST(A25, $E$4, FALSE) 26 18 0.0071 =POISSON.DIST(A26, $E$4, FALSE) 27 19 0.0037 =POISSON.DIST(A27, $E$4, FALSE) 28 20 0.0019 =POISSON.DIST(A28, $E$4, FALSE)

F

10

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



5.4 Poisson Distribution 199

CA LC ULATING P O IS S O N P RO B A B ILIT IES The number of faults per month that arise in the gearboxes of a bus fleet is known to follow a Poisson distribution with a mean of 2.5 faults per month. What is the probability that in a given month no faults are found? At least one fault is found?

EXAMPLE 5.5

SOLUTION

Using Equation 5.14 with λ = 2.5 (or using Table E.7 or Microsoft Excel), the probabilities that in a given month no faults are found and at least one fault is found are: P(X = 0) =

e−2.5(2.5)0 0.08208… × 1 = = 0.08208… 0! 1

P(X ⩾ 1) = 1 − P(X = 0) = 1 − 0.08208… = 0.91791… The probability that there will be no faults in a given month is 0.0821. The probability that there will be at least one fault is 0.9179, which is the complement of there being no faults in a given month.

CA LC ULATING P O IS S O N P RO B A B ILIT IES For the Gaia Adventure Tours scenario, what is the probability that 24 or more online enquiries are received in 30 minutes?

EXAMPLE 5.6

SOLUTION

Let X = number of online enquiries received in 30 minutes, then X is Poisson with 30 λ= × 30 = 15 60 Using Microsoft Excel we can obtain Table 5.6, which gives Poisson probabilities for λ = 15.

Enquiries received in 30 minutes Expected number of enquiries:  15 X P(X ) 0 0.0000 1 0.0000 2 0.0000 3 0.0002 4 0.0006 5 0.0019 6 0.0048 7 0.0104

X  8  9 10 11 12 13 14 15

P(X ) 0.0194 0.0324 0.0486 0.0663 0.0829 0.0956 0.1024 0.1024

X 16 17 18 19 20 21 22 23 Total

P(X ) 0.0960 0.0847 0.0706 0.0557 0.0418 0.0299 0.0204 0.0133 0.9805

Table 5.6 Poisson probabilities for λ = 15

From Table 5.6: P(X ⪖ 24) = 1 − P(X < 24) = 1 − P(X ⪕ 23) = 1 − 0.9805 = 0.0195 Therefore, in approximately 2% of 30-minute intervals 24 or more online enquiries are expected to be received, hence increasing the likelihood that enquiries start to queue and may not be answered within the stated 45 minutes.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

200 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS

Problems for Section 5.4 LEARNING THE BASICS 5.25 Assume a Poisson distribution. a. If λ = 2.5, find P(X = 2). b. If λ = 8, find P(X = 8). c. If λ = 0.5, find P(X = 1). d. If λ = 3.7, find P(X = 0). 5.26 Assume a Poisson distribution. a. If λ = 2, find P(X 9 2). b. If λ = 8, find P(X 9 3). c. If λ = 0.5, find P(X ⩽ 1). d. If λ = 4, find P(X 9 1). e. If λ = 5, find P(X ⩽ 3). 5.27 Assume a Poisson distribution with λ = 5. Find the probability that: a. X = 1 b. X  6 1 c. X > 1 d. X ⩽ 1

APPLYING THE CONCEPTS Problems 5.28 to 5.32 can be solved manually or by using Microsoft Excel. Some, but not all, can also be solved using Table E.7.

5.28 The quality control manager of Marilyn’s Bakery is inspecting a batch of chocolate-chip biscuits that has just been baked. If the production process is in control, the mean number of chip parts per biscuit is 6.0. What is the probability that, in any particular biscuit being inspected: a. fewer than five chip parts will be found? b. exactly five chip parts will be found? c. five or more chip parts will be found? d. either four or five chip parts will be found? 5.29 Refer to problem 5.28. How many biscuits in a batch of 100 should the manager expect to discard if company policy requires that all chocolate-chip biscuits sold must have at least four chocolate-chip parts? LEARNING OBJECTIVE

6

Identify situations that can be modelled by a hypergeometric distribution and calculate hypergeometric probabilities hypergeometric distribution Discrete probability distribution where the random variable is the number of successes in a sample of n observations from a finite population without replacement.

5.30 The number of floods in a certain region is approximately Poisson distributed with an average of three floods every 10 years. a. Find the probability that a family living in the area for one year will experience: i. exactly one flood ii. at least one flood b. Find the probability that a student who moves to the area for three years will experience i. exactly one flood ii. at least one flood 5.31 Based on past experience, it is assumed that the number of flaws per metre in rolls of grade 2 paper follow a Poisson distribution with a mean of one flaw per 5 metres of paper. What is the probability that in a: a. 1-metre roll there will be at least two flaws? b. 10-metre roll there will be at least one flaw? c. 50-metre roll there will be between five and 15 (inclusive) flaws? 5.32 A toll-free phone number is available from 9 am to 9 pm for customers to register a complaint about a product purchased from a large company. Past history indicates that an average of 0.4 calls are received per minute. a. What properties must be true about the situation described above in order to use the Poisson distribution to calculate probabilities concerning the number of phone calls received in a 1-minute period? b. Assuming that this situation matches the properties you discuss in (a), what is the probability that, during a 1-minute period: i. zero phone calls will be received? ii. three or more phone calls will be received? c. What is the maximum number of phone calls that will be received in a 1-minute period 99.99% of the time?

5.5  HYPERGEOMETRIC DISTRIBUTION The binomial distribution and the hypergeometric distribution are both concerned with the number of successes in a sample of n observations. However, they differ in the way in which the sample is selected. For the binomial distribution, as the probability of success p must be constant for all observations and the outcome of any particular observation must be independent of any other, the random sample is either selected with replacement from a finite population or without replacement from an infinite population. For the hypergeometric distribution, the random sample is selected without replacement from a finite population. Thus, the outcome of one observation is dependent on the outcomes of previous observations. Consider a population of size N. Let A represent the total number of successes in the population. The hypergeometric distribution is then used to find the probability of X successes in a

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



5.5 Hypergeometric Distribution 201

sample of size n selected without replacement. Equation 5.15, the mathematical formula for the hypergeometric distribution, gives the probability of X successes, given n, N and A. H YPE RGE O M E T R IC DIST R IB UT ION

P(X ) =

N−A n−X

A X

N n



(5.15)

where P(X) = the probability of X successes, given n, N and A n = sample size N = population size A = number of successes in the population N – A = number of failures in the population X = number of successes in the sample n 2 X = number of successes in the sample A = ACX (see Equation 5.10) X The number of successes in the sample, represented by X, cannot be greater than the number of successes in the population, A, or the sample size, n. Thus, the range of the hypergeometric random variable is limited to the minimum of the sample size or the number of successes in the population. Equation 5.16 defines the mean of the hypergeometric distribution. TH E M E AN OF T H E H YP E R GE OM E T R I C D I ST R I BU T I O N μ = E (X ) =



nA N

(5.16)

Equation 5.17 defines the standard deviation of the hypergeometric distribution. TH E STAN DA R D DE VIAT ION OF T H E HY P E R G E O M E T R I C D I ST R I BU T I O N σ=

nA(N − A) N

2



N−n N−1

(5.17)

N−n is a finite population correction factor that results N−1 from sampling without replacement from a finite population. To illustrate the hypergeometric distribution, suppose that we wish to form a team of eight executives from different departments within a company. Suppose the company has a total of 30 executives, and 10 of these are from the finance department. If members of the team are to be selected at random, what is the probability that the team will contain two executives from the finance department? Here, the population of N = 30 executives within the company is finite. In addition, A = 10 are from the finance department and a team of n = 8 executives is to be selected. In Equation 5.17, the expression

finite population correction factor Factor required when sampling from a finite population without replacement.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

202 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS

Using Equation 5.15: P(X = 2) =

10 20 2 6 30 8

10! 20! × 2 !8! 6 !14! = = 0.2980… 30! 8 !22!

Using Equations 5.16 and 5.17: μ = E(X ) =

8 × 10 = 2.666... 30

and

σ=

30 − 8 = 1.1613... 30 − 1

8 × 10 × (30 − 10) 30

2

Thus, the probability that the team will contain two members from the finance department is 0.298, or 29.8%. Such calculations can become tedious, especially as N gets larger. However, Microsoft Excel can be used to calculate hypergeometric probabilities. Figure 5.5, using the Excel 2010 and later inbuilt hypergeometric function HYPGEOM.DIST(sample_s,number_sample, population_s,number_population,cumulative), presents a Microsoft Excel worksheet for the team-formation example. Note that the number of executives from the finance department (i.e. the number of successes in the sample) can be equal to 0, 1, 2, … 8. Figure 5.5 Microsoft Excel worksheet for the team-formation example

A 1 Hypergeometric probabilities 2 3 Data 4 Sample size 5 No. of events of interest in population 6 Population size 7 8 Hypergeometric probabilities table 9 10 11 12 13 14 15 16 17 18

B

C

D

E

F

G

8 10 30

X 0 1 2 3 4 5 6 7 8

P(X) 0.0215 0.1324 0.2980 0.3179 0.1738 0.0491 0.0068 0.0004 0.0000

=HYPGEOM.DIST (A10, $B$4, $B$5, $B$6 FALSE) =HYPGEOM.DIST (A11, $B$4, $B$5, $B$6 FALSE) =HYPGEOM.DIST (A12, $B$4, $B$5, $B$6 FALSE) =HYPGEOM.DIST (A13, $B$4, $B$5, $B$6 FALSE) =HYPGEOM.DIST (A14, $B$4, $B$5, $B$6 FALSE) =HYPGEOM.DIST (A15, $B$4, $B$5, $B$6 FALSE) =HYPGEOM.DIST (A16, $B$4, $B$5, $B$6 FALSE) =HYPGEOM.DIST (A17, $B$4, $B$5, $B$6 FALSE) =HYPGEOM.DIST (A18, $B$4, $B$5, $B$6 FALSE)

For earlier versions of Excel the corresponding hypergeometric function is HYPGEOMDIST (sample_s,number_sample,population_s,number_pop).

Problems for Section 5.5 LEARNING THE BASICS

APPLYING THE CONCEPTS

5.33 Determine the following: a. If n = 4, N = 10 and A = 5, find P(X = 3). b. If n = 4, N = 6 and A = 3, find P(X = 1). c. If n = 5, N = 12 and A = 3, find P(X = 0). d. If n = 3, N = 10 and A = 3, find P(X = 3). 5.34 Referring to problem 5.33, calculate the mean and the standard deviation for the hypergeometric distributions described in (a) to (d).

5.35 An auditor for the Australian Taxation Office is selecting a sample of six tax returns from a batch of 100 for an audit. If two or more of these returns contain errors, the entire batch of 100 tax returns will be audited.

Problems 5.35 to 5.39 can be solved manually or by using Microsoft Excel.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



5.5 Hypergeometric Distribution 203

a.

What is the probability that the entire batch will be audited if the true number of returns with errors in the batch is: i. 25? ii. 30? iii. 5? iv. 10? b. Discuss the differences in your results depending on the true number of returns in the batch with error. 5.36 The dean of a business faculty wishes to form an executive committee of five from among the 40 tenured faculty members. The selection is to be random, and there are eight tenured faculty members in accounting. a. What is the probability that the committee will contain: i. none of them? ii. at least one of them? iii. not more than one of them? b. What is your answer to part (i) above if the committee consists of seven members? 5.37 In a shipment of 15 hard disks, five are defective. If four of the disks are inspected, a. What is the probability that: i. exactly one is defective? ii. at least one is defective? iii. no more than two are defective? b. What is the mean number of defective hard disks that you would expect to find in the sample of four hard disks? 5.38 In each game of OZ Lotto seven numbers are selected, from 1 to 45. Seven winning numbers are chosen at random plus two supplementary numbers. An extension of the hypergeometric distribution to calculate probabilities of selecting combinations of winning and supplementary numbers is:

P(X, Y ) =



A X

S Y

N−A−S n−X−Y N n

where P(X,Y) is the probability of selecting X winning numbers and Y supplementary numbers, and S is the number of supplementary numbers. a. To win Division 1, the seven winning numbers must be selected. In any game, what is the probability of winning Division 1?

b. To win Division 2, six winning numbers plus either of the two supplementary numbers must be selected. In any game, what is the probability of winning Division 2? c. To win Division 3, six winning numbers must be selected. In any game, what is the probability of winning Division 3? d. To win Division 4, five winning numbers plus either of the two supplementary numbers must be selected. In any game, what is the probability of winning Division 4? e. To win Division 5, five winning numbers must be selected. In any game, what is the probability of winning Division 5? f. To win Division 6, four winning numbers must be selected. In any game, what is the probability of winning Division 6? g. To win Division 7, three winning numbers plus either of the two supplementary numbers must be selected. In any game, what is the probability of winning Division 7? h. What is the probability of selecting none of the winning or supplementary numbers? 5.39 In a certain game of Lotto six numbers are selected from 1 to 45. Six winning numbers are chosen at random plus two supplementary numbers. Use the formula in problem 5.38 or Equation 5.15 to calculate the following probabilities. a. To win Division 1, the six winning numbers must be selected. In any game, what is the probability of winning Division 1? b. To win Division 2, five winning numbers plus either of the two supplementary numbers must be selected. In any game, what is the probability of winning Division 2? c. To win Division 3, five winning numbers must be selected. In any game, what is the probability of winning Division 3? d. To win Division 4, four winning numbers must be selected. In any game, what is the probability of winning Division 4? e. To win Division 5, three winning numbers plus either of the two supplementary numbers must be selected. In any game, what is the probability of winning Division 5? f. What is the probability of selecting none of the winning or supplementary numbers?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

204 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS

5

Assess your progress

Summary This chapter introduced mathematical expectation, covariance and the development and application of the binomial, Poisson and hypergeometric distributions. In the Gaia Adventure Tours scenario, we saw how to calculate probabilities from the binomial and Poisson distributions concerning the number of online enquiries converted to bookings in a sample of n enquiries and the number of online enquiries received in a given time interval. In the next chapter, important continuous distributions are introduced, in particular the normal distribution. To help decide which discrete probability distribution to use for a particular situation, we need to ask the following questions: • Is there a fixed number of observations n, each of which is classified as success or failure, or are we counting the number

of times an event happens in an interval of time or space? If there is a fixed number of observations n, each of which is classified as success or failure, we can use the binomial or hypergeometric distribution, if the properties of the distribution are satisfied. If we are counting the number of events in an interval, we can use the Poisson distribution only if all its properties are satisfied. • In deciding whether to use the binomial or hypergeometric distribution, is the probability of success constant for all ­observations? If yes, we may be able to use the binomial ­distribution. If no, we may be able to use the hypergeometric distribution.

Key formulas Expected value of the sum of two random variables

Expected value 𝛍 of a discrete random variable

E(X + Y ) = E(X ) + E(Y ) (5.5)

N

μ = E(X ) =



XiP(Xi) (5.1)

Variance of the sum of two random variables

i=1

σ2X + Y = σ2X + σ2Y + 2σXY  (5.6)

Variance of a discrete random variable N

σ2 =

∑ [Xi − E(X)]2 P(Xi) 

(5.2a) (definition)

i=1 N

σ2 =

∑ Xi2P(Xi) − E(X)2  

(5.2b) (calculation)

i=1

Standard deviation of the sum of two random variables

σX + Y = σ2X + Y  (5.7) Portfolio expected return

E(P ) = wE(X ) + (1 − w)E(Y ) (5.8) Portfolio risk

Standard deviation of a discrete random variable

σp = w2σ2X + (1 − w)2σ2Y + 2w(1 − w)σXY  (5.9)

σ = σ2  (5.3)

Combinations Covariance

σXY =

∑∑

all Xi all Yj

σXY =

∑∑

all Xi all Yj

[Xi − E(X )][Yj − E(Y)]P(Xi and Yj) (5.4a) (definition)

n! n = nCX =  (5.10) X !(n − X)! X Binomial distribution

XiYjP(Xi and Yj) − E(X )E(Y ) (5.4b)

  (calculation)

P(X ) =

n! p X(1 − p)n − X  (5.11) X !(n − X )!

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter review problems 205

The mean of the binomial distribution

The mean of the hypergeometric distribution

μ = E(X ) = np (5.12)

μ =E( X ) =

The standard deviation of the binomial distribution

nA  (5.16) N

σ = σ2 = np(1 − p) (5.13)

The standard deviation of the hypergeometric distribution

Poisson distribution

σ=

e−λλX  (5.14) P(X ) = X!

nA(N − A) N

2



N−n  (5.17) N−1

Hypergeometric distribution

P(X ) =

A X

N−A n−X N n

 (5.15)

Key terms binomial distribution 189 covariance 185 expected value of a discrete random variable 182 expected value of the sum of two random variables 186 finite population correction factor 201

hypergeometric distribution 200 mathematical model 189 Poisson distribution 196 portfolio 187 portfolio expected return 187 portfolio risk 187 probability distribution for a discrete random variable 181

standard deviation of a discrete random variable 183 standard deviation of the sum of two random variables 186 variance of a discrete random variable 183 variance of the sum of two random variables 186

Chapter review problems CHECKING YOUR UNDERSTANDING 5.40 What is the meaning of the expected value of a probability distribution? 5.41 What are the four properties of a binomial distribution? 5.42 What are the three properties of a Poisson distribution? 5.43 When is the hypergeometric distribution used instead of the binomial distribution?

APPLYING THE CONCEPTS Problems 5.44 to 5.53 can be solved manually or by using Microsoft Excel. Some, but not all, can also be solved using Tables E.6 and E.7.

5.44 From September 1984 to July 2017 the ASX All Ordinaries Index has opened higher than the previous month for 233 of the 395 months – that is, approximately 59.0% of months (Data from YAHOO!7FINANCE accessed July 2017). a. Assuming a binomial distribution, estimate the probability that the ASX All Ordinaries Index will open higher than the previous month: i. for one month ii. for two months in a row

iii. in four of the next five months iv. in none of the next five years b. For the situation in (a) above, what assumption of the binomial distribution might not be valid? 5.45 At a recent election, 12% of the voters in a certain electorate gave their first preference to the Greens candidate. If 10 people on the electoral roll for that electorate were randomly selected, find the probability that: a. exactly four gave their first preference to the Greens candidate b. at most four gave their first preference to the Greens candidate c. a majority gave their first preference to the Greens candidate 5.46 When calculating premiums on life insurance products, insurance companies often use life tables that enable the probability of a person dying in any age interval to be calculated. The following data obtained from the ‘New Zealand Abridged Period Life Table: 2014-16’ gives the number out of 100,000 New Zealand-born males and females who are still alive during each five-year period of life between age 20 and 60 (inclusive).

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

206 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS

Exact age (years) 20 25 30 35 40 45 50 55 60

Number alive at exact age Out of 100,000 Out of 100,000 females born males born 99,288 99,031 99,128 98,685 98,949 98,312 98,726 97,899 98,427 97,381 97,934 96,649 97,157 95,548 95,933 93,853 94,162 91,352

Data obtained from accessed June 2017. © Statistics New Zealand and licensed by Statistics New Zealand for re-use under the Creative Commons Attribution 3.0 New Zealand licence

Suppose a New Zealand-born female on her 35th birthday purchases a one million dollar, five-year term life policy from an insurance company. That is, the insurance company must pay her estate $1 million if she dies within the next five years. a. Determine the insurance company’s expected payout on this policy. b. What would be the minimum you would expect the insurance company to charge her for this policy? c. What would the expected payout be if the same policy were taken out by a New Zealand-born female on her 40th birthday? d. Repeat parts (a) to (c) for a New Zealand-born male. 5.47 The emergency facility at a small country hospital has been in operation for 60 weeks and has been used 120 times. The weekly pattern of demand for this facility has a Poisson distribution. Find the: a. mean demand per week b. probability the emergency facility is not used in a given week c. probability the emergency facility is used at least twice in a week d. probability the room is used at least once in a given twoweek period 5.48 Check$mart’s records show that 58% of its customers pay only the minimum repayment on their credit card each month. a. If a random sample of 20 credit-card holders is selected, what is the probability that: i. none pays the minimum amount? ii. no more than five pay the minimum amount? iii. more than 10 pay the minimum amount? b. What assumptions did you have to make to answer each part of (a) above? 5.49 In 2016, the New Zealand general marriage rate was 10.95 marriages and civil unions per 1,000 population 16 years and over who are not married or in a civil union. The corresponding divorce rate was 8.7 per 1,000 existing marriages and civil unions (data obtained from Marriage, Civil Unions and Divorces: Year ended December 2016, Statistics New Zealand accessed July 2017).

5.50

5.51

5.52

5.53

a. Suppose 60 unmarried women were randomly selected on 1 January 2016. i. Find the probability that at least three married, including civil unions, during 2016. ii. Find the probability that at most two married, including civil unions, during 2016. iii. What is the mean and standard deviation of the number who married during 2016? b. Suppose 60 married couples were randomly selected on 1 January 2016. i. Find the probability that none divorced during 2016. ii. Find the probability that at most two divorced during 2016. iii. What is the mean and standard deviation of the number of divorces during 2016? A customer service manager of Check$mart bank is monitoring one of its phone banking call centres servicing a rural region. Suppose that on average the call centre receives 180 calls an hour during its operating hours of 8 am to 6 pm. a. Can the Poisson distribution be used to model the number of calls received in one minute? Explain. b. Assuming the number of calls received in a given interval is Poisson, calculate the probability that: i. in a given minute exactly two calls will be received ii. more than two calls will be received in a minute iii. the number of calls received in 5 minutes is at least 20 iv. the number of calls received in 5 minutes is less than 10 At current staffing levels calls start to queue, increasing the time it takes to answer a call, when the number of calls received in 5 minutes is 20 or more. However, when there are less than 10 calls in 5 minutes, more than one Customer Service Officer is usually available, increasing unproductive staff time. c. What conclusions can you draw from problem (b) parts (iii) and (iv) above? Suppose the average number of students who log on to a university’s computer system is 4.45 in each 5-minute interval. a. What is the probability that six students will log on in the next minute? b. What is the probability that fewer than six students will log on during the next two minutes? A study of various news home pages reports that the mean number of bad links per home page is 0.4 and the mean number of spelling errors per home page is 0.16. Use the Poisson distribution to find the probability that a randomly selected home page will contain: a. no bad links b. five or more bad links c. no spelling errors d. 10 or more spelling errors In an online test, 10 multiple-choice questions are randomly selected from a test bank of 100 questions. Supposing that each student has two attempts at the online test, what is the probability that in the second test a student attempts there are:

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter review problems 207

a. no questions from the first test? b. at least one question from the first test? c. exactly five questions from the first test? d. 10 questions from the first test? 5.54 The following table gives the grade distribution at a certain university. Fail 15%

Pass 40%

Credit 25%

Distinction 15%

c. On the basis of the results in (b), which portfolio would you recommend? Explain. 5.56 The breakdown by home address of the previous year’s 993 drink-driving offences in Problem 2.67 is: Number of drink-driving offences Local – in council area Seaside town 151 Not seaside town 462 Not local – not in council area Intrastate (within state) 130 Interstate (another state) 228 International (outside Australia)  22 Home address

High Distinction 5%

Supposing that a result is selected randomly, what is the probability that: a. the result is a passing grade (Pass or above)? b. the result is a Credit or above? c. If a random sample of 15 results is selected, what is the probability of: i. exactly three Fails? ii. more than five Fails? iii. all being Pass or above? iv. none being Credit or above? v. exactly five being Credits or above? vi. at least one Distinction or High Distinction? d. Based on the random sample of 15 results, what is the expected number, variance and standard deviation of the number of: i. Fail grades? ii. grades Pass or above? e. Comment on the relationship between (i) and (ii) in part (d) above. A grade point of 7 is assigned to each High Distinction, 6 to each Distinction, 5 to each Credit, 4 to each Pass and 0 to each fail. f. What is the average, variance and standard deviation of grade points for the university? 5.55 You are trying to develop a strategy for investing in two different shares. The anticipated annual return for a $1,000 investment in each share has the following probability distribution:



5.57

5.58

Returns Probability 0.25 0.50 0.25

Share A $240 $150 –$100

Share B –$100 $150 $240

a. Calculate: i. the expected returns for share A and for share B ii. the variances and standard deviations for share A and for share B iii. the covariance of share A and share B b. Suppose you want to create a portfolio that consists of share A and share B. Calculate the portfolio expected return and risk if the proportion invested in share A is: i. 0.40 ii. 0.50 iii. 0.60

5.59

5.60



Suppose that Kai randomly selects 20 of the offenders to interview in depth. What is the probability that: a. all 20 will be local? b. 15 will be local? c. five will be from interstate? d. at least 10 will not be local? Past data indicate that 6% of all students enrolled in a firstyear statistics unit at Tasman University obtain a High Distinction (HD). Assume that students are allocated randomly to a tutorial group. a. What is the probability that in a tutorial group of 30 students: i. none receive an HD? ii. at most, two students obtain an HD? iii. more than four students obtain an HD? b. What is the mean and standard deviation of the number of HDs obtained in a tutorial group? In a regional city, on average 2.6 traffic accidents are reported an hour from 7 am to 7 pm. On a given day, what is the probability that: a. four accidents are reported from 9 am to 9.30 am? b. five accidents are reported from 2 pm to 4 pm? c. three or four accidents are reported from 2 pm to 3 pm? d. at least one accident is reported from 4 pm to 5.30 pm? A hand of five cards is dealt from a shuffled standard pack of 52 cards. Find the probability that: a. all the cards are red b. exactly two of the cards are red c. at least one card is red d. the hand contains four kings e. the hand has at least one king f. all the cards are hearts g. the cards are all the same suit Pat’s Used Cars sells on average 3.6 used cars in a normal trading day. Assume the number of used cars sold follows a Poisson distribution. Determine: a. the probability that five used cars are sold in a day b. the probability that no more than two used cars are sold in a day

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

208 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS

c. the expected number and standard deviation of used cars sold in a 10-day period 5.61 The Ashland MultiComm Services (AMS) marketing department wants to increase subscriptions for a combined telephone, pay TV and Internet bundle called 3-For-All. AMS marketing has been conducting an aggressive directmarketing campaign that includes postal and electronic mailings and telephone solicitations. Feedback from these efforts indicates that including premium channels of the customer’s choice in this bundle is a very important factor for both current and prospective subscribers. After several brainstorming sessions, the marketing department has decided to add premium channels as a no-cost benefit of subscribing to 3-For-All. The research director, Mona Fields, is planning to conduct a survey among prospective customers to determine how many premium channels need to be added to 3-For-All in order to generate increased subscriptions. Based on past campaigns and on industry-wide data, she estimates the following: Number of free premium channels 0 1 2 3 4 5

Probability of subscriptions 0.020 0.040 0.060 0.070 0.080 0.085

a. If a sample of 50 prospective customers is selected and no free premium channels are included in the 3-For-All bundle, given the above probability estimates, what is the probability that: i. fewer than three customers will subscribe to 3-For-All? ii. at most one customer will subscribe to 3-For- All ? iii. more than four customers will subscribe to 3-For-All ? iv. Suppose that in the survey of 50 prospective customers, five customers subscribe to 3-For-All. What does this tell you about the estimate of the proportion of customers who would subscribe to 3-For-All if no free premium channels are included? b. Instead of offering no premium free channels, as in part (a), suppose that two free premium channels of the customer’s choice are included in the 3-For-All bundle. Given the above probability estimates, what is the probability that: i. fewer than three customers will subscribe to 3-For-All? ii. at most one customer will subscribe to 3-For-All? iii. more than four customers will subscribe to 3-For-All? c. Compare the results of (b) to those of (a). d. Suppose that in a survey of 50 prospective customers where two free premium channels of the customer’s choice are included in the 3-For-All offer, five customers subscribe. What does this tell you about the estimate of the proportion of customers who would subscribe to 3-For-All if two free premium channels are included? e. What do the above results tell you about the effect of offering free premium channels of the customer’s choice on the likelihood of obtaining subscriptions to 3-For-All?

Chapter 5 Excel Guide EG5.1 THE PROBABILITY DISTRIBUTION FOR A DISCRETE VARIABLE

Key technique Use the SUMPRODUCT(cell range 1, cell range 2) function to calculate the expected value and variance. Example Calculate the expected value, variance, and standard deviation for the number of home mortgages approved per week data given in Table 5.1 on page 182. In-depth Excel Use the Discrete_Variable workbook as a model.

For the example, open to the DATA worksheet of the Discrete_Variable workbook. The worksheet already contains the ­entries needed to calculate the expected value, variance, and standard deviation (shown in the COMPUTE worksheet) for the example. For other problems, enter the probability distribution data into columns A and B of the DATA worksheet, overwriting the existing entries. If required, extend columns C and D, first selecting cell range C7:D7 and then copying the cell range down as many rows as necessary. If the probability distribution has fewer than six outcomes, select the rows that contain the extra, unwanted outcomes, right-click, and then click Delete in the shortcut menu.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 5 Excel Guide 209

EG5.2 COVARIANCE OF A PROBABILITY DISTRIBUTION AND ITS APPLICATION IN FINANCE

Key technique Use the SQRT and SUMPRODUCT functions to calculate the portfolio analysis statistics. Example Perform the portfolio analysis for the Section 5.2 investment example. PHStat Use Covariance and Portfolio Analysis. For the example, select PHStat ➔ Decision-Making ➔ Covariance and Portfolio Analysis. In the Covariance and Portfolio Management dialog box (shown in Figure EG5.1): 1. Enter 3 as the Number of Outcomes. 2. Enter a Title, check Portfolio Management Analysis and click OK.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

A Covariance analysis Probabilities & outcomes:

Weight assigned to X Statistics E(X ) E(Y ) Variance (X ) Standard deviation(X ) Variance (X ) Standard deviation(Y ) Covariance (X Y ) Variance(X+Y ) Standard deviation(X+Y )

B

C

D

P 0.2 0.5 0.3

X –100 100 250

Y 200 50 –100

E

F

0.5

105 35 14725 121.346611 11025 105 –12675 400 20

Portfolio management Weight assigned to X Weight assigned to Y Portfolio expected return Portfolio risk

0.5 0.5 70 10

=SUMPRODUCT(B4:B6,C4:C6) =SUMPRODUCT(B4:B6,D4:D6) =SUMPRODUCT(B4:B6,G13:$G$15) =SQRT(B13) =SUMPRODUCT(B4:B6,H13:$H$15) =SQRT(B15) =SUMPRODUCT(B4:B6,I13:$I$15) =B13+B15+2*B17 =SQRT(B18)

=B8 =1-B22 =B22*B11+B23*B12 =SQRT(B22^2*B13+B23^2*B15+2*B22*B23*B17)

Figure EG5.2  COMPUTE worksheet of Portfolio workbook

EG5.3  BINOMIAL DISTRIBUTION

Key technique Use the BINOM.DIST(number of events of interest, sample size, probability of an event of interest, FALSE) function. Example Calculate the binomial probabilities for n 5 4 and p 5 0.1, given in Figure 5.2 for the ‘number of online enquiries converted to bookings’ problem. Figure EG5.1  Covariance and Portfolio Management dialog box

In the new worksheet (shown in Figure EG5.2): 1. Enter the probabilities and outcomes in the table that begins in cell B3. 2. Enter 0.5 as the Weight assigned to X.

In-depth Excel Use the COMPUTE worksheet of the Portfolio workbook as a template. The worksheet (shown in Figure EG5.2) already contains the data for the example. Overwrite the P, X and Y values and the weight assigned to X when you enter data for other problems. If a problem has more or fewer than three outcomes, first select row 5, right-click, and click Insert (or Delete) in the shortcut menu to insert (or delete) rows one at a time. If you insert rows, select the cell range F4:J4 and copy the contents of this range down through the new table rows. The worksheet also contains a Calculations Area that contains various intermediate calculations. Open the COMPUTE_FORMULAS worksheet to examine all the formulas used in this area.

PHStat Use Binomial. For the example, select PHStat ➔ Probability & Prob. Distributions ➔ Binomial. In the Binomial Probability Distribution dialog box (shown in Figure EG5.3): 1. Enter 4 as the Sample Size. 2. Enter 0.1 as the Prob. of an Event of Interest. 3. Enter 0 as the Outcomes From value and enter 4 as the (Outcomes) To value. 4. Enter a Title, check Histogram, and click OK. Figure EG5.3 Binomial Probability Distribution dialog box

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

210 CHAPTER 5 SOME IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS

Check Cumulative Probabilities before clicking OK in step 4 to have the procedure include columns for P(#X), P(,X), P(.X), and P($X) in the binomial probabilities table.

In-depth Excel Use the Binomial workbook as a template and model. For the example, open to the COMPUTE worksheet of the Binomial workbook, shown in Figure 5.2 on page 193. The worksheet already contains the entries needed for the example. For other problems, change the sample size in cell B4 and the probability of an event of interest in cell B5. If necessary, extend the binomial probabilities table by selecting cell range A18:B18 and then copying the cell range down as many rows as necessary. Use the CUMULATIVE worksheet if you require cumulative probabilities. Use CUMULATIVE_OLDER worksheet if using a version of Excel before Excel 2010.

P(,X), P(.X), and P($X) in the Poisson probabilities table. Check Histogram to construct a histogram of the Poisson probability distribution.

In-depth Excel Use the Poisson workbook as a template. For the example, open to the COMPUTE worksheet of the Poisson workbook, shown in Figure 5.4 on page 198. The w ­ orksheet already contains the entries for the example. For other problems, change the mean or expected number of events of ­interest in cell E4. If necessary, extend the Poisson probabilities table by selecting cell range A28:B28 and then copying the cell range down as many rows as necessary. Use the CUMULATIVE worksheet if you require cumulative probabilities. Use the CUMULATIVE_OLDER worksheet if using a version of Excel before Excel 2010.

EG5.5  HYPGEOMETRIC DISTRIBUTION

EG5.4  POISSON DISTRIBUTION

Key technique Use the POISSON.DIST(number of events of interest, the average or expected number of events of interest, FALSE) function. Example Calculate the Poisson probabilities for the ‘number of online enquiries received in 20 minutes’ problem with l 5 10, as in Figure 5.4 on page 198. PHStat Use Poisson. For the example, select PHStat ➔ Probability & Prob. ­Distributions ➔ Poisson. In the Poisson Probability Distribution dialog box (shown in Figure EG5.4): 1. Enter 10 as the Mean/Expected No. of Events of Interest. 2. Enter a Title and click OK.

Key technique Use the HYPGEOM.DIST(X, sample size, number of events of interest in the population, population size, FALSE) function. Example Calculate the hypergeometric probabilities for the team formation problem in Figure 5.5 on page 202. PHStat Use Hypergeometric. For the example, select PHStat ➔ Probability & Prob. Distributions ➔ Hypergeometric. In this procedure’s dialog box (shown in Figure EG5.5): 1. Enter 8 as the Sample Size. 2. Enter 10 as the No. of Events of Interest in Pop. 3. Enter 30 as the Population Size. 4. Enter a Title and click OK. Figure EG5.5

Figure EG5.4 Poisson Probability Distribution dialog box

Check Cumulative Probabilities before clicking OK in step 2 to have the procedure include columns for P(#X),

Hypergeometric Probability Distribution dialog box

Check Histogram to produce a histogram of the probability ­distribution.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 5 Excel Guide 211

In-depth Excel Use the Hypergeometric workbook as a template. For the example, open to the COMPUTE worksheet of the Hypergeometric workbook, shown in Figure 5.5 on page 202. The worksheet already contains the entries for the example. For other problems, change the sample size in cell B4, the number of events of interest in the population

in cell B5, and the population size in cell B6. If necessary, extend the hypergeometric probabilities table by selecting cell range A18:B18 and then copying the cell range down as many rows as necessary. Use the CUMULATIVE worksheet if you require cumulative probabilities. Use the CUMULATIVE_OLDER worksheet if using a version of Excel before Excel 2010.

Microsoft® product screen shots are reprinted with permission from Microsoft Corporation.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

CHA PTER

6

The normal distribution and other continuous distributions TASMAN UNIVERSITY ORIENTATION

A

s part of orientation activities, new students at Tasman University (TU) are encouraged to complete a ‘Welcome to Tasman University (TU)’ online program.

To assess the success – or otherwise – of this program, data have been collected on the time a new student spends working through it. The data suggest that the time students spend on the first module in the program ‘Introduction to TU’ is normally distributed with a mean of 7 minutes and a standard deviation of 2. From the data, the time students spend on another module in the program, ‘Support at TU’, is also normal, but with a mean of 4 minutes and a standard deviation of 1 minute. How can the orientation organisers use this data to answer questions about the time students spend on the ‘Introduction to TU’ and ‘Support at TU’ modules? © Solis Images/Shutterstock

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



6.1 Continuous Probability Distributions 213

LEARNING OBJECTIVES

After studying this chapter you should be able to: 1 calculate probabilities from the normal distribution 2 determine whether a set of data is approximately normally distributed 3 calculate probabilities from the uniform distribution 4 calculate probabilities from the exponential distribution 5 use the normal distribution to approximate probabilities from the binomial distribution

In the Gaia Adventure Tours scenario in Chapter 5, Yang wanted to solve problems about the number of occurrences of an outcome in a given sample size or the number of events in a specified interval. A different task is faced in the Tasman University Orientation scenario, one that involves a continuous measurement since the time students spend on the ‘Introduction to TU’ module can be any positive value, not just an integer value. How, then, can the orientation organisers answer questions about continuous numerical variables such as: • What proportion of students spend more than 9 minutes on the ‘Introduction to TU’ module? • 10% of students spend less than how long on the module? • What is the probability that a randomly chosen student accessing the module spends less than 3.5 minutes on it? As in Chapter 5, we use probability distributions as models. This chapter introduces the characteristics of a continuous probability distribution and then uses the normal, uniform and exponential distributions to solve business and other problems.

6.1  CONTINUOUS PROBABILITY DISTRIBUTIONS Chapter 5 discussed discrete random variables and probability distributions. In this chapter we look at continuous random variables and probability distributions. Continuous random variables arise from a measuring process where the response can take on any value within a continuum or interval; for example time, temperature, weight, height, revenue or cost. A continuous probability density function, represented by f(x), is the mathematical expression that defines the distribution of the values for a continuous random variable. Figure 6.1 graphically displays the three continuous probability density functions discussed in this chapter. Panel A depicts a normal distribution. The normal distribution is symmetrical and bell shaped, implying that most values tend to cluster around the mean, which, due to its symmetry, is equal to the median. Although the values in a normal distribution can range from negative infinity to positive infinity, the shape of the distribution makes it very unlikely that extremely large or extremely small values will occur. Panel B depicts a uniform distribution where the probability of occurrence of a value is equally likely to occur anywhere in the range between the smallest value a and the largest value b. Sometimes referred to as the rectangular distribution, the uniform distribution is symmetrical and therefore the mean equals the median. An exponential distribution is illustrated in panel C. This distribution is skewed to the right, with the mean larger than the median. The range for an exponential distribution is zero to positive infinity but its shape makes the occurrence of extremely large values unlikely.

continuous probability density function Mathematical expression that defines the distribution of the values for a continuous random variable.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

214 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS

Figure 6.1 Three continuous distributions

Values of X Panel A Normal distribution

a

b Values of X Panel B Uniform distribution

Values of X Panel C Exponential distribution

Note that a continuous probability density function gives the graph of the probability distribution, not the probability, as is the case with a discrete probability function. Probabilities involving continuous random variables are calculated as areas under the curve given by the probability density function and between specified values of the random variable. LEARNING OBJECTIVE

1

Calculate probabilities from the normal distribution

normal distribution Continuous probability distribution represented by a bell-shaped curve.

6.2  THE NORMAL DISTRIBUTION The normal distribution (sometimes referred to as the Gaussian distribution) is the most common continuous probability distribution used in statistics. The normal distribution is vitally important in statistics for three main reasons: 1. Numerous continuous variables common in business, and elsewhere, have distributions that are normal or approximately normal. 2. The normal distribution can be used to approximate various discrete probability distributions. 3. The normal distribution provides the basis for classical statistical inference because of its relationship to the Central Limit Theorem (discussed in Section 7.2). The normal distribution is represented by the classic bell shape depicted in panel A of ­ igure 6.1. In the normal distribution, we can calculate the probability that values of the ranF dom variable occur within a range or interval. However, the probability of a particular or individual value of a continuous random variable, such as a normal random variable, is zero. This property distinguishes continuous variables, which are measured, from discrete variables, which are counted. As an example, time (in seconds) is measured and not counted. Therefore, we can determine the probability that the load time for a website is between 1 and 5 seconds or between 2 and 4 seconds or between 2.99 and 3.01 seconds. However, the probability that the load time is exactly 3 seconds is effectively zero. The normal distribution has several important theoretical properties: • It is bell-shaped (and thus symmetrical) in its appearance. • Its mean and median are equal. 4 • Its middle 50% of data is within approximately standard deviations. This means that the 3 interquartile range is contained within an interval of two-thirds of a standard deviation below the mean to two-thirds of a standard deviation above the mean – that is, the middle 2 2 50% of data have Z scores (introduced in Section 3.1) between 2  and . 3 3 • Its associated random variable has an infinite range (2 ∞ , X , ∞). In practice, many variables have distributions that closely resemble the theoretical properties of the normal distribution. The data in Table 6.1 represent the thickness (in millimetres) of 10,000 brass washers manufactured by a large company. The continuous variable of interest, thickness, can be approximated by the normal distribution. The measurements of the thickness of the 10,000 brass washers cluster in the interval 0.485 to 0.495 mm and are distributed

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



6.2 The Normal Distribution 215

s­ ymmetrically around that interval, forming a bell-shaped pattern. As illustrated in Table 6.1, the non-overlapping (mutually exclusive) classes contain all possible values (are collectively exhaustive) and so the relative frequencies sum to 1.

Thickness (mm) Under 0.425 0.425 , 0.435 0.435 , 0.445 0.445 , 0.455 0.455 , 0.465 0.465 , 0.475 0.475 , 0.485 0.485 , 0.495 0.495 , 0.505 0.505 , 0.515 0.515 , 0.525 0.525 , 0.535 0.535 , 0.545 0.545 , 0.555 0.555 or above Total

Frequency 0 48 122 325 695 1,198 1,664 1,896 1,664 1,198 695 325 122 48      0 10,000

Relative frequency 0 0.0048 0.0122 0.0325 0.0695 0.1198 0.1664 0.1896 0.1664 0.1198 0.0695 0.0325 0.0122 0.0048      0 1.0000

Table 6.1 Thickness of 10,000 brass washers

Figure 6.2 depicts the relative frequency histogram and polygon for the distribution of the thickness of 10,000 brass washers. For these data, the first three theoretical properties of the normal distribution are approximately satisfied; however, the fourth does not hold. The random variable of interest, thickness, cannot possibly be zero or below, and a washer cannot be so thick that it becomes unusable. From Table 6.1, only 48 out of every 10,000 brass washers are expected to have a thickness of between 0.545 and 0.555 mm and none above 0.555 mm, whereas an equal number is expected to have a thickness between 0.425 and 0.435 mm and none below 0.425 mm. Thus, the chance of randomly getting a washer thinner than 0.435 mm or thicker than 0.545 mm is 0.0048 1 0.0048 5 0.0096 2 or less than 1 in 100. Figure 6.2 Relative frequency histogram and polygon of the thickness of 10,000 brass washers

0.20

Relative Freqency

0.15

0.10

0.05

0.00 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50 0.51 0.52 0.53 0.54 0.55 0.56 Thickness (mm)

For the normal distribution, the normal probability density function is given by Equation 6.1.

normal probability density function Mathematical expression that defines the normal distribution.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

216 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS

T H E N OR M A L PR O BA BI LI T Y D E NSI T Y F U NC T I O N



f (X ) =

1

2

σ 2π

e −(1/2)[(X−μ)/σ]

(6.1)



where e 5 2.71828… is the base of natural logarithms p 5 3.14159… m is the mean s is the standard deviation X is any value of the normal random variable, where (2 ∞ , X , ∞) Because e and p are mathematical constants, the probability density function depends on the two parameters of the normal distribution: the mean m and the standard deviation s. Each combination of m and s generates a different normal distribution. Figure 6.3 illustrates three different normal distributions. Distributions A and B have the same mean (m) but have different standard deviations. Distributions A and C have the same standard deviation (s) but have different means. Distributions B and C depict two normal probability density functions that differ with respect to both m and s. Figure 6.3 Three normal distributions

B C A

transformation formula Z score formula used to convert any normal random variable to the standardised normal random variable. standardised normal random variable Normal random variable with a mean of 0 and a standard deviation of 1.

Normal probabilities are calculated as areas under the curve given by Equation 6.1; this requires integral calculus and there is no exact rule. Fortunately, all normal probabilities can be calculated from normal probability tables. However, as there is a different normal probability distribution for each combination of m and s, the first step in finding a normal probability is to use the transformation formula, given in Equation 6.2, to convert any normal random variable X to a standardised normal random variable Z. T R A N S FOR M AT IO N F O R M U LA The Z value is equal to the difference between X and the mean m, divided by the standard deviation s.

Z=

X−μ σ

(6.2)

Equation 6.2 is a restatement of the Z score equation (3.12), introduced in Chapter 3. Thus, Equation 6.2 represents the distance between a given value of the random variable X and the mean expressed in standard deviations. Although the original normal random variable X had mean m and standard deviation s, the standard normal random variable Z has mean m 5 0 and standard deviation s 5 1. By substituting m 5 0 and s 5 1 in Equation 6.1, the probability density function of the standardised normal variable Z is given in Equation 6.3.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



6.2 The Normal Distribution 217

THE STANDA R DIS E D N OR M A L P R OBA BI LI T Y D E NSI T Y F U NC T I O N

f (Z ) =

1

2



e −(1/2)Z

(6.3)

Any normal probability distribution can be converted to the standardised probability d­ istribution. Then normal probabilities can be determined from Table E.2, the cumulative ­standardised normal distribution. To see how the transformation formula is applied and the results used to find probabilities from Table E.2, recall from the Tasman University Orientation scenario at the beginning of the chapter that data indicate that the time students spend on the ‘Introduction to TU’ module is normal, with mean m 5 7 minutes and standard deviation s 5 2 minutes. From Figure 6.4, it can be seen that every value of the random variable X, time, has a corresponding standardised Z value calculated by the transformation formula (Equation 6.2). Therefore, a time of 9 minutes is equivalent to Z 5 1; that is, 9 minutes is one standard deviation above the mean since: Z=

9−7 = +1 2

Time on ‘Introduction to TU’ module

μ – 3σ

μ – 2σ

μ – 1σ

μ

cumulative standardised normal distribution Represents the cumulative area under the standard normal curve less than a given value.

μ + 1σ

μ + 2σ

μ + 3σ

1

3

5

7

9

11

13

–3

–2

–1

0

+1

+2

+3

Figure 6.4 Transformation of scales

X scale, minutes (μ = 7, σ = 2)

Z scale (μ = 0, σ = 1)

A time of 1 minute is equivalent to Z 5 23; that is, 1 minute is three standard deviations below the mean since: 1−7 = −3 Z= 2 Thus, the standard deviation is the unit of measurement. In other words, a time of 9 minutes is 2 minutes (i.e. one standard deviation) higher, or longer, than the mean time of 7 minutes. Similarly, if a student spends 1 minute on the module it is 6 minutes (i.e. three standard deviations) lower, or shorter, than the mean time. To illustrate further the transformation formula, the time students spend on the ‘Support at TU’ module is also normal with a mean of 4 minutes and a standard deviation of 1 minute. This distribution is illustrated in Figure 6.5. For ‘Support at TU’, a time of 5 minutes is one standard deviation above the mean time since: Z=

5−4 = +1 1

A time of 1 minute is three standard deviations below the mean time since: Z=

1−4 = −3 1

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

218 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS

The two bell-shaped curves in Figures 6.4 and 6.5 represent the probability density functions of the time (in minutes) students spend on the two modules. Since the times represent the entire population, the area under the entire curve, representing probability, must be 1. Figure 6.5 A different transformation of scales Time on ‘Support at TU’ module

1

2

3

4

–3

–2

–1

0

5

6

7

+1 +2 +3

X scale, minutes (μ = 4, σ = 1) Z scale (μ = 0, σ = 1)

The steps to find the probability that the time a student spends in the ‘Introduction to TU’ module in the Tasman University Orientation scenario is less than 9 minutes are as follows: 1. Use Equation 6.2 to transform X 5 9 to the corresponding Z value: Z=



9−7 =1 2

2. Use Table E.2 to find the cumulative area under the standard normal curve less than (i.e. to the

left of) Z 5 1.00. To read the probability or area under the curve less than Z 5 1.00, scan down the Z column in Table E.2 to the Z value of interest to one decimal place, the Z row for 1.0. Read across this row until it intersects the column that contains the second decimal place of the Z value, the column representing .00. Therefore, from the body of the table, the probability for P(Z , 1.00) is given by the intersection of the row Z 5 1.0 and the column Z 5 .00, as shown in Table 6.2, which is extracted from Table E.2. This probability is 0.8413 2 that is, Table 6.2 Finding a cumulative area under the normal curve (extracted from Table E.2 in Appendix E of this book)

Z 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

.00 .5000 .5398 .5793 .6179 .6554 .6915 .7257 .7580 .7881 .8159 .8413

.01 .5040 .5438 .5832 .6217 .6591 .6950 .7291 .7612 .7910 .8186 .8438

.02 .5080 .5478 .5871 .6255 .6628 .6985 .7324 .7642 .7939 .8212 .8461

.03 .5120 .5517 .5910 .6293 .6664 .7019 .7357 .7673 .7967 .8238 .8485

.04 .5160 .5557 .5948 .6331 .6700 .7054 .7389 .7704 .7995 .8264 .8508

.05 .5199 .5596 .5987 .6368 .6736 .7088 .7422 .7734 .8023 .8289 .8531

.06 .5239 .5636 .6026 .6406 .6772 .7123 .7454 .7764 .8051 .8315 .8554

.07 .5279 .5675 .6064 .6443 .6808 .7157 .7486 .7794 .8078 .8340 .8577

.08 .5319 .5714 .6103 .6480 .6844 .7190 .7518 .7823 .8106 .8365 .8599

.09 .5359 .5753 .6141 .6517 .6879 .7224 .7549 .7852 .8133 .8389 .8621

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



6.2 The Normal Distribution 219

P(Z , 1.00)  5  0.8413. As illustrated in Figure 6.6, there is an 84.13% likelihood that a student will spend less than 9 minutes on the ‘Introduction to TU’ online module. Figure 6.6 Determining the area less than Z from a cumulative standardised normal distribution

Time on ‘Introduction to TU’ module Area 0.8413

1

3

5

7

–3.00 –2.00 –1.00

9

11

13

X scale, minutes

+1.00 +2.00 +3.00 Z scale

0

For ‘Support at TU’, as Z 5 (5 2 4)/1 5 1 (see Figure 6.7), the probability of a time less than 5 minutes is also 0.8413. Figure 6.7 shows that, regardless of the value of the mean m and standard deviation s of a normal random variable X, Equation 6.2 can be used to transform the distribution to the standard normal distribution Z. Figure 6.7 A transformation of scales for corresponding cumulative portions under two normal curves

‘Support at TU’

cale

‘Introduction to TU’

7

9

11

Xs

13 cale

Zs

5 34 +2

+3

+1 0 –2

–1

–3

In the following examples, which answer questions relating to the time students spend on the ‘Introduction to TU’ module, when necessary the normal curve is sketched and the required probability/area shaded before using Table E.2 with Equation 6.2 to calculate the required probability. FINDING P (X . 9 ) What is the probability that a student will spend at least 9 minutes on ‘Introduction to TU’?

EXAMPLE 6.1

SOLUTION

The probability that a student spends less than 9 minutes is 0.8413 (see Figure 6.6). Thus, the probability that a student will spend at least 9 minutes is the complement of less than 9 minutes, so: P(X ⩾ 9) = 1 − P(X < 9) = 1 − 0.8413 = 0.1587

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

220 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS

Therefore, approximately 15.9% of students spend at least 9 minutes on ‘Introduction to TU’. Figure 6.8 illustrates this result. Figure 6.8 Finding P(X > 9)

Time on ‘Introduction to TU’ module Area 0.1587 0.8413 1

3

5

–3.00 –2.00 –1.00

EXAMPLE 6.2

7 0

9

11

13

X scale, minutes

+1.00 +2.00 +3.00 Z scale

FIN D ING P(7 * X * 9) What is the probability that a student spends between 7 and 9 minutes on ‘Introduction to TU’? SOLUTION

From Figure 6.6, P(X , 9) 5 0.8413. Now determine the probability that the time will be at most 7 minutes and subtract this from the probability that the time is less than 9 minutes. That is: P(7 < X < 9) = P(X < 9) − P(X ⩽ 7) This is shown in Figure 6.9. Figure 6.9 Finding P(7 , X , 9)

Time on ‘Introduction to TU’ module Area 0.3413 Area 0.5000

1

3

Area 0.1587 5

–3.00 –2.00 –1.00

7 0

9

11

13

+1.00 +2.00 +3.00

X scale, minutes Z scale

From Equation 6.2 and Table E.2: P(X ⩽ 7) = P Z ⩽

7−7 = P(Z ⩽ 0.00) = 0.5000 2

Therefore: P(7 < X < 9) = P(X < 9) − P(X ⩽ 7) = 0.8413 − 0.5000 = 0.3413

EXAMPLE 6.3

FIN D ING P( X - 7 O R X . 9) What is the probability that the time a student spends on ‘Introduction to TU’ is at most 7 minutes or at least 9 minutes?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



6.2 The Normal Distribution 221

SOLUTION

From Figure 6.9, the probability that a student spends between 7 and 9 minutes is 0.3413. Therefore, the probability that the time spent on ‘Introduction to TU’ is at most 7 minutes or at least 9 minutes is its complement, so: P(X ⩽ 7 or X ⩾ 9) = 1 − P(7 < X < 9) = 1 − 0.3413 = 0.6587 Alternatively, calculate separately the probability of a time of 7 minutes or less and the probability of a time of 9 minutes or more and then add these two probabilities together to obtain the desired result (see Figure 6.10). Because the mean and median are the same for a normal distribution, 50% of students spend 7 minutes or less. From Example 6.1, the probability of a student spending at least 9 minutes is 0.1587. Hence, the probability that the time on ‘Introduction to TU’ is at most 7 minutes or at least 9 minutes is: P(X ⩽ 7 or X ⩾ 9) = P(X ⩽ 7) + P(X ⩾ 9) = 0.5000 + 0.1587 = 0.6587 Time on ‘Introduction to TU’ module

Area 0.3413 Area 0.1587

Area 0.5000

1

Figure 6.10 Finding P(X ø 7 or X > 9)

3

5

7

–3.00 –2.00 –1.00

0

9

11

X scale, minutes

13

Z scale

+1.00 +2.00 +3.00

FINDING P (5 * X * 9 ) What is the probability that a student spends between 5 and 9 minutes on ‘Introduction to TU’?

EXAMPLE 6.4

SOLUTION

The required area/probability is the area under the curve between X 5 5 and X 5 9 (see Figure 6.11). As Table E.2 gives probabilities less than a particular value of interest, we ­calculate the probabilities P(X , 9) and P(X … 5) and then obtain the desired probability/area by subtraction: P(5 < X < 9) = P

5−7 9−7

Central Office I Time to clear problems (minutes) 1.48 1.75 0.78 2.85 0.52 1.60 4.15 1.02 0.53 0.93 1.60 0.80 1.05 6.32

3.97 3.93

1.48 5.45

3.10 0.97

Central Office II Time to clear problems (minutes) 7.55 3.75 0.10 1.10 0.60 0.52 3.30 3.75 0.65 1.92 0.60 1.53 4.23 0.08

2.10 1.48

0.58 1.65

4.02 0.72



For each of the two central office locations, decide whether the data appear to be approximately normally distributed by: a. evaluating the actual versus theoretical properties b. constructing a normal probability plot 6.18 Many manufacturing processes use the term work-in-progress (often abbreviated to WIP). In a book-manufacturing plant, the WIP represents the time it takes for sheets from a press to be folded, gathered, sewn, tipped on end sheets and bound. The following data represent samples of 20 books at each of two

Plant A  5.62 11.62 5.29 21.62 10.50 8.45

7.29 7.58

16.25 7.50  8.58 9.29

Plant B  9.54 5.75 11.46 12.46 16.62 15.41 2.33 14.29 14.25 13.13

11.46 4.42 11.42 8.92

9.17 12.62 13.21 25.75 6.00 5.37 13.71  6.25 10.04 9.71



For each of the two plants, decide whether or not the data appear to be approximately normally distributed by: a. evaluating the actual versus theoretical properties b. constructing a normal probability plot 6.19 The data file < GRADES > contains a sample of student marks and grades from a population of students enrolled in a statistics unit. Decide whether or not the ‘Total Mark’ data appear to be approximately normal by: a. evaluating the actual versus theoretical properties b. constructing a normal probability plot 6.20 For the data from problem 6.19, < GRADES >, decide whether or not the ‘Exam Mark’ data appear to be approximately normally distributed by: a. evaluating the actual versus theoretical properties b. constructing a normal probability plot

6.4  THE UNIFORM DISTRIBUTION

LEARNING OBJECTIVE

In the uniform distribution, a value has the same probability of occurrence anywhere in the range between the smallest value a and the largest value b. Because of its shape, the uniform distribution is sometimes called the rectangular distribution (see panel B of Figure 6.1). Equation 6.5 defines the uniform probability density function. THE UN IFO R M PR OB A B IL IT Y DE N S IT Y F UN CT I O N f (X ) =

10.92 7.96  5.41 7.54

1 if a ⩽ X ⩽ b and 0 elsewhere b−a

where a 5  the minimum value of X b 5  the maximum value of X

(6.5)

Calculate probabilities from the uniform distribution

uniform (rectangular) distribution Continuous probability distribution; the values of the random variable have the same probability; also called the ‘rectangular distribution’.

Equations 6.6 and 6.7 define the mean and variance of the uniform distribution. TH E M E AN A N D VA R IA N CE OF T H E U NI F O R M DI ST R I B UT I O N



a+b 2

(6.6)

(b − a)2 12

(6.7)

μ= σ2 =

3

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

234 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS

Figure 6.23 illustrates the uniform distribution with a 5 0 and b 5 1. The total area of the rectangle is equal to base 3 height 5 1 3 1 5 1, thus satisfying the requirement that the area under any probability density function equals 1. In such a distribution, what is the probability of getting a value between 0.1 and 0.3? The area between 0.1 and 0.3, depicted in Figure 6.24, is equal to the base (0.3 2 0.1 5 0.2) multiplied by the height (1.0). Therefore: P(0.1 < X < 0.3) = base × height = 0.2 × 1 = 0.2

Figure 6.23 Probability density function for a uniform distribution with a 5 0 and b 5 1

f (x ) 1.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Figure 6.24 Finding P (0.1 ,  X  ,  0.3) for a uniform distribution with a 5 0 and b 5 1

x

f (x ) 1.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

x

From Equations 6.6 and 6.7, the mean and standard deviation of the uniform distribution for a 5 0 and b 5 1: μ= σ2 =

a+b 0+1 = = 0.5 2 2 (b − a)2 (1 − 0)2 1 = 0.0833… = = 12 12 12

σ = 0.0833… = 0.2886… Thus, the mean is 0.5 and the standard deviation is 0.2887.

Problems for Section 6.4 LEARNING THE BASICS

APPLYING THE CONCEPTS

6.21 Suppose you sample one value from a uniform distribution with a 5 0 and b 5 10. a. What is the probability of getting a value: i. between 5 and 7? ii. between 2 and 3? b. What is the mean? c. What is the standard deviation?

6.22 The time between arrivals of customers at a bank between noon and 1 pm has a uniform distribution over an interval from 0 to 120 seconds. a. What is the probability that the time between the arrival of two customers will be: i. less than 20 seconds? ii. between 10 and 30 seconds? iii. more than 35 seconds?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



6.5 The Exponential Distribution 235

b. What is the mean and standard deviation of the time between arrivals? 6.23 The time of failure for a continuous operation monitoring device of air quality has a uniform distribution over a 24-hour day. a. If a failure occurs on a day when it is daylight between 5.55 am and 7.38 pm, what is the probability that the failure will occur during daylight hours? b. If the device is in secondary mode from 10 pm to 5 am, what is the probability that a failure occurs during secondary mode? c. If the device has a self-checking computer chip that determines whether the device is operational every hour on the hour, what is the probability that a failure will be detected within 10 minutes of its occurrence?

d. If the device has a self-checking computer chip that determines whether the device is operational every hour on the hour, what is the probability that it will take at least 40 minutes to detect that a failure has occurred? 6.24 In an apartment building the waiting time for a lift is found to be uniformly distributed between 0 and 3 minutes. a. What is the probability of waiting: i. no more than a minute? ii. between 1 and 2 minutes? iii. more than 2 minutes? b. What is the mean and standard deviation of waiting time?

6.5  THE EXPONENTIAL DISTRIBUTION

LEARNING OBJECTIVE

The exponential distribution is a continuous distribution that is right skewed and ranges from zero to positive infinity (see panel C of Figure 6.1). The exponential distribution is widely used in waiting line (or queuing) theory to model the length of time between random and independent events or the time to the first occurrence of an event. For example, the exponential random variable can be used to model the: • time between arrivals of customers at a bank’s ATM or a fast-food restaurant • time between patients entering a hospital emergency room • time between hits on a website • time between outages to an Internet banking system • time to failure of a certain item or component.

Calculate probabilities from the exponential distribution

exponential distribution Continuous probability distribution, used to model the interval between Poisson events.

The exponential and Poisson distributions are closely related. The Poisson distribution is used to count the number of times an event occurs in some interval, while the exponential distribution is used to measure the interval between Poisson events or until the first event. The exponential distribution is defined by a single parameter, l(lambda), the expected number of events per interval; note that this is the mean of the corresponding Poisson distribution. Equation 6.8 can be used to calculate exponential probabilities.

PRO BABILIT Y T H AT A N E XP ON E N T IA L R AN DO M VAR I A BL E I S LE SS THAN A If X is an exponential random variable, 0 8 X 8 ∞, then

P(X < A) = 1 − e−λA

4

(6.8)

where l 5  expected number of events in interval e 5  2.71828… is the base of natural logarithms A is a given value of the exponential random variable X

From Equation 6.8, using the complement rule, we obtain: P(X ⩾ A) = 1 − P(X < A) = 1 − (1 − e−λA) = e−λA

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

236 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS

T H E M E A N , VA R IA NC E A ND STA NDA R D D E V I AT I O N O F T HE E XP ON E N T IA L DIST R I BU T I O N μ=σ=



σ2 =



1 λ

(6.9)

1 λ2

(6.10)

where l 5 expected number of events in interval.

For example, if the expected number of events in a minute is l 5 4, then the mean time between 1 events is m 5 5 0.25 minutes or 15 seconds. 4 To illustrate the exponential distribution, suppose that customers arrive at an ATM randomly and independently at the rate of 20 per hour. If a customer has just arrived, what is the probability that the next customer will arrive within 6 minutes (i.e. 0.1 hour)? For this example, X 5 time in hours until next customer is exponential with l 5 20 per hour. Using Equation 6.8 and A 5 6 minutes 5 0.1 hour: P(X < 0.1) = 1 − e−20×0.1 = 1 − e−2 = 1 − 0.13533… = 0.86466… Thus, the probability that a customer will arrive within 6 minutes is 0.8647. You can also use Microsoft Excel to calculate this probability. Figure 6.25 shows a Microsoft Excel worksheet, using the Excel inbuilt exponential function EXPON. DIST(x,lambda,cumulative). For Excel 2007 and earlier the corresponding exponential function is EXPONDIST(x,lambda,cumulative).

Figure 6.25 Microsoft Excel worksheet for finding exponential probabilities

EXAMPLE 6.9

1 2 3 4 5 6 7 8 9

A B Exponential probability

C

D

E

Data λ X value P(X )

20 0.1 Results 0.8647 =EXPON.DIST(B5, B4, TRUE) 0.1353 =1-B8

C A LC U LAT IN G E X P O N E N TI AL P ROBABI L I TI E S In the ATM example, what is the probability that the next customer will arrive within 3 minutes (i.e. 0.05 hour)? SOLUTION

Using Equation 6.8: P(X < 0.05) = 1 − e−20×0.05 = 1 − e−1 = 1 − 0.3678… = 0.63212… Thus, the probability that a customer will arrive within 3 minutes is 0.6321.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



6.5 The Exponential Distribution 237

CA LC ULATING E X P O N E N T IA L P RO B A B I L I TI E S Past data show that two serious workplace accidents resulting in employees taking time off work occur annually at Innovative Kitchens. A serious workplace accident has just occurred. What is the probability that there will not be another serious workplace accident in the next year and the probability that there will be at least one serious workplace accident in the next six months?

EXAMPLE 6.10

SOLUTION

X 5 time in years until next serious workplace accident is exponential with l 5 2 per year. Using Equation 6.8 and the complement rule: P(X > 1) = e −2×1 = e −2 = 0.13533… Thus, the probability that there will not be another serious workplace accident in the next year is 0.1353. Using Equation 6.8: P(X 8 0.5) 5 1 2 e22 3 0.5 5 1 2 e21 5 1 2 0.36787… 5 0.6321… Thus, the probability that there will be at least one serious workplace accident in the next six months is 0.632.

Memoryless distribution Suppose customers arrive at an average rate of one per minute. If no customer has arrived in the last minute, what is the probability that no customer will arrive in the next 2 minutes?

think about this

To answer this, let X 5 time until next customer arrives in minutes. Then X is exponential with λ 5 1. We want the probability that we will wait at least another 2 minutes for the next customer, given that we have waited 1 minute already; that is:

P(X > 2 + 1  X > 1) =

P(X > 3) e –331 = –131 = 0.1353… e P(X > 1)

Now suppose that a customer has just arrived. The probability that no customer will arrive in the next two minutes is: P(X > 2) = e –231 = e –2 = 0.13533… What do you notice? The probability has not changed. This illustrates the memoryless property of the exponential distribution. It means that it does not matter how long you have waited for a customer. If a customer has not arrived at time T, the distribution of the waiting time from time T until the next customer arrives is the same as when a customer has just arrived. In general, it can be shown that if X is exponential, then X is a memoryless random variable and:

P(X > A + T  X > T ) = P(X > A)

for A, T ⩾ 0

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

238 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS

Problems for Section 6.5 LEARNING THE BASICS 6.25 Given an exponential distribution with l 5 10, what is the probability that X is: a. less than 0.1? b. greater than 0.1? c. between 0.1 and 0.2? d. less than 0.1 or greater than 0.2? 6.26 Given an exponential distribution with l 5 30, what is the probability that X is: a. less than 0.1? b. greater than 0.1? c. between 0.1 and 0.2? d. less than 0.1 or greater than 0.2? 6.27 Given an exponential distribution with l 5 20, what is the probability that X is: a. less than 4? b. greater than 0.4? c. between 0.4 and 0.5? d. less than 0.4 or greater than 0.5?

APPLYING THE CONCEPTS 6.28 Vehicles arrive, randomly and independently, at a toll booth located at the entrance to a bridge at the rate of 240 per hour between 1 am and 2 am. Suppose a vehicle has just arrived. a. What is the probability that the next vehicle arrives within the next minute? b. What is the probability that no vehicle arrives in the next 30 seconds? c. What is the mean time between arrivals at the toll booth? d. What are your answers to (a) to (c) if the rate of arrival of vehicles is 300 per hour? e. What are your answers to (a) to (c) if the rate of arrival of vehicles is 210 per hour?

LEARNING OBJECTIVE

5

Use the normal distribution to approximate probabilities from the binomial distribution

6.29 Customers arrive at the drive-through window of a fast-food restaurant at an average of two per minute during the lunch hour. a. What is the probability that the next customer will arrive within 1 minute? b. What is the probability that the next customer will arrive within 5 minutes? c. During the dinner time period, the average arrival rate is one per minute. What are your answers to (a) and (b) for this period? 6.30 The time between unplanned shutdowns of a power plant has an exponential distribution with a mean of 20 days. Find the probability that the time between two unplanned shutdowns is: a. less than 14 days b. more than 21 days c. less than 7 days 6.31 Golfers arrive at the starter’s booth of a public golf course at an average of eight per hour during the Monday-to-Thursday midweek period. a. If a golfer has just arrived: i. what is the probability that the next golfer arrives within 15 minutes (0.25 hour)? ii. what is the probability that the next golfer arrives within 3 minutes (0.05 hour)? b. The average arrival rate on Fridays is 15 per hour. What are your answers to (a) on Fridays? 6.32 The number of floods in a certain region is approximately Poisson distributed with an average of three floods every 10 years. A flood has just occurred. a. What is the probability that: i. a flood occurs in the next year? ii. there isn’t a flood in the next two years? iii. a flood occurs in the next month? iv. at least one flood occurs in the next six months? b. What is the average time between floods?

6.6  THE NORMAL APPROXIMATION TO THE BINOMIAL DISTRIBUTION In earlier sections of this chapter, the normal probability distribution was introduced. In this section we use the normal distribution to approximate the binomial distribution. When, as in this case, a continuous distribution is used to approximate a discrete probability distribution, a continuity correction factor is required.

Need for a Continuity Correction There are two major reasons why a continuity correction is needed when using a continuous random variable to approximate a discrete random variable. First, discrete random variables such as binomial random variables can take on only specified (integer) values, while continuous random variables such as normal random variables can take on any values within a continuum or interval. When using the normal distribution to approximate the binomial

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



6.6 The Normal Approximation to the Binomial Distribution 239

distribution, more accurate approximations of the probabilities are obtained when a continuity correction is used. Second, with a continuous distribution such as the normal distribution, the probability of getting a specific value of a random variable is zero. However, when a continuous distribution is used to approximate a discrete distribution, a continuity correction is used to obtain the approximate probability of a specific value of the discrete distribution. Consider an experiment in which we toss a fair coin 10 times. Suppose we want to calculate the probability of getting exactly four heads. Whereas a discrete random variable can have only a specified value (such as 4), a continuous random variable used to approximate it could take on any values within an interval around that specified value, as demonstrated on the scale below: ...

...X 2.5

3

3.5

4

4.5

5

5.5

The continuity correction requires adding or subtracting 0.5 from the value or values of the discrete random variable X as required. To use the normal distribution to approximate the probability of getting exactly four heads, X 5 4, we need to find the area under the normal curve from X 5 3.5 to X 5 4.5, the lower and upper boundaries of 4. To determine the approximate probability of getting at least four heads, we find the area under the normal curve greater than or equal to 3.5, X 9 3.5, since 3.5 is the lower boundary of 4. Similarly, to determine the approximate probability of getting at most four heads, we find the area under the normal curve equal to or less than 4.5, X 8 4.5, since 4.5 is the upper boundary of 4. When using the normal distribution to approximate discrete probability distributions, semantics are important. To determine the approximate probability of getting fewer than four heads, we find the area under the normal curve less than or equal to 3.5, X 8 3.5. To determine the approximate probability of getting more than four heads, we find the area under the normal curve greater than or equal to 4.5, X 9 4.5. To determine the approximate probability of getting four to seven heads (inclusive), we find the area under the normal curve from 3.5 to 7.5, 3.5 8 X 8 7.5.

Approximating the Binomial Distribution In Section 5.3 we saw that the binomial distribution is symmetrical (as is the normal distribution) whenever p 5 0.5. When p Z 0.5, the binomial distribution is not symmetrical. However, the closer p is to 0.5 and/or the larger the sample size n, the more symmetrical the distribution is. On the other hand, the larger the sample size the more tedious it is to calculate the exact probabilities of success using Equation 5.11. Fortunately, whenever the sample size is large, we can use the normal distribution to approximate the exact binomial probabilities. As a general rule, the normal distribution can be used to approximate the binomial distribution whenever np and n(1 2 p) are both at least 5. From Section 5.3, the mean and standard deviation of the binomial distribution are: μ = np σ = np(1 − p) By substituting these into the transformation formula (Equation 6.2), we obtain: Z=

X−μ X − np = σ np(1 − p)

where for large enough n the random variable Z is approximately normal.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

240 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS

Hence, Equation 6.11 is used to find approximate probabilities corresponding to the values of the discrete binomial random variable, X.

N OR M A L A P PR OX I M AT I O N TO T HE BI NO M I A L D I ST R I BU T I O N Z=

Xa − np np(1 − p)

(6.11)

where m 5 np, mean of the binomial distribution σ = np(1 − p), standard deviation of the binomial distribution Xa 5 adjusted number of successes for the discrete random variable X, such that Xa 5 X ± 0.5 as appropriate

EXAMPLE 6.11

U S ING T H E NO R MA L D I ST RI B UT I ON TO A P P ROXI M AT E T HE BI N O MI A L D IST R IB U T IO N A random sample of n 5 1,600 tyres is selected from an ongoing production process in which 8% of all tyres produced are defective. What is the probability that 150 or fewer tyres will be defective? SOLUTION

Since both np 5 1,600 3 0.08 5 128 and n(1 2 p) 5 1,600 3 0.92 5 1,472 are greater than 5, the normal distribution can be used to approximate the binomial. Here, Xa, the adjusted number of successes, is 150.5 and: Z≈

Xa − np np(1 − p)

=

150.5 − 128 (1,600)(0.08)(0.92)

=

22.5 ≈ 2.07 10.8517…

Then, using Table E.2, the area under the curve to the left of Z 5 2.07 is 0.9808 (see ­Figure 6.26). Therefore, the probability of 150 or fewer tyres being defective is approximately 0.98. This agrees to two decimal places with the exact binomial probability of 0.9790.

Figure 6.26 Approximating the binomial distribution

Area 0.9808

μ = 128

150.5

X

0

+2.07

Z

Calculating a Probability Approximation for an Individual Value Suppose that we want to approximate the probability of getting exactly 150 defective tyres. The correction for continuity defines the integer value of interest to range from one-half unit below it to one-half unit above it. Therefore, we define the probability of getting exactly 150 defective

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



6.6 The Normal Approximation to the Binomial Distribution 241

tyres as the area under the normal curve between 149.5 and 150.5. Using Equation 6.11, the corresponding Z values are: Z=

149.5 − 128 (1,600)(0.08)(0.92)

=

21.5 = 1.98 10.85

=

22.5 = 2.07 10.85

and: Z=

150.5 − 128 (1,600)(0.08)(0.92)

Therefore, using Table E.2, we obtain: P(exactly 150 tyres defective) ≈ P(149.5 ⩽ X ⩽ 150.5) ≈ P(1.98 ⩽ Z ⩽ 2.07) = 0.9808 − 0.9761 = 0.0047 Thus, the approximate probability of getting 150 defective tyres is 0.0047. Compare this with the exact binomial probability which, to four decimal places, is 0.0048.

Problems for Section 6.6 LEARNING THE BASICS

6.33 For n 5 100 and p 5 0.2, use the normal distribution to approximate the probability that: a. X  5  25 b. X  7  25 c. X 8 25 d. X  ,  25 6.34 For n 5 100 and p 5 0.4, use the normal distribution to approximate the probability that: a. X  5  40 b. X  7  40 c. X 8  40 d. X  ,  40

i. four heads ii. at least four heads iii. four to seven heads b. Use the normal approximation to the binomial distribution to approximate the probabilities in (a). 6.36 For overseas flights, an airline has three different choices on its dessert menu: ice cream, apple pie and chocolate cake. Based on past experience, the airline feels that each dessert is equally likely to be chosen. If a random sample of 90 passengers is selected, what is the approximate probability that: a. at least 20 will choose ice cream for dessert? b. exactly 20 will choose ice cream for dessert? c. less than 20 will choose ice cream for dessert?

APPLYING THE CONCEPTS 6.35 Consider an experiment in which a fair coin is tossed 10 times. a. Use Equation 5.11, Table E.6 or Microsoft Excel to determine the probability of getting:

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

242 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS

6

Assess your progress

Summary In this chapter we used the normal distribution for the Tasman University Orientation scenario to study the time students spend on the ‘Introduction to TU’ module. We also used the exponential distribution to model the time between serious workplace accidents.

In addition, we studied the uniform distribution, the normal probability plot and the normal approximation to the binomial distribution. In the next chapter, the normal distribution is used in developing the subject of statistical inference.

Key formulas Variance of the uniform distribution

The normal probability density function

f (X ) =

1

2

σ 2π

e −(1/2)[(X − μ)/σ]   (6.1)

σ2 =

Calculating exponential probabilities

Finding a Z value

Z=

X−μ   (6.2) σ

P(X < A) = 1 − e−λA  (6.8)

The standardised normal probability density function

f (Z ) =

1 2π

e

−(1/2)Z2

  (6.3)

Mean and standard deviation of exponential distribution

μ=σ=

1   (6.9) λ

Variance of exponential distribution

Finding an X value

σ2 =

X = μ + Zσ  (6.4) The uniform distribution probability density function

f (X ) =

(b − a)2   (6.7) 12

1 if a ⩽ X ⩽ b and 0 elsewhere  (6.5) b−a

1   (6.10) λ2

Normal approximation to the binomial distribution

Z=

Xa − np np(1 − p)

  (6.11)

Mean of the uniform distribution

μ=

a+b   (6.6) 2

Key terms continuous probability density function213 cumulative standardised normal distribution217 exponential distribution 235

normal distribution 214 normal probability density function216 normal probability plot 231 quantile–quantile plot 231

standardised normal random variable216 transformation formula 216 uniform (rectangular) distribution 233

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter review problems 243

Chapter review problems CHECKING YOUR UNDERSTANDING 6.37 How do you find the area between two values under the normal curve? 6.38 How do you find the X value that corresponds to a given percentile of the normal distribution? 6.39 What are some of the properties of a normal distribution? 6.40 How can you use the normal probability plot to evaluate whether a set of data is normally distributed? 6.41 Why is a continuity correction needed when approximating a binomial probability with normal distribution? 6.42 When can you use the normal distribution to approximate the binomial distribution?

c. 77% of the oranges will contain at least how many millilitres of juice? d. 80% of the oranges are between which two values of juice (in millilitres) symmetrically distributed around the population mean? 6.46 The hotels from the chain in problem 6.16 frequently offer discounted ‘hot deal’ rates online. The table below gives the ‘hot deal’ rates available recently on a selected Sunday, in Australian dollars. < HOTEL_RATE > Location Auckland Barossa Valley Brisbane Canberra Darwin Hamilton Melbourne Melbourne Melbourne Palmerston North Perth Queenstown Rotorua Snowy Mountains Sunshine Coast Sydney Sydney Sydney Wellington

APPLYING THE CONCEPTS 6.43 Based on past experience, it is assumed that the number of flaws per metre in rolls of grade 2 paper follow a Poisson distribution with a mean of one flaw per 5 metres of paper. A flaw has just been found. a. What is the probability that: i. there is not another flaw in the remaining 10 metres of the roll? ii. a flaw will be found in the next metre of the roll? iii. at least one flaw will be found in the next 5 metres? b. What is the mean distance between flaws? 6.44 Aircraft arrive at a regional airport at a rate of 30 per hour. a. If the interarrival time follows an exponential distribution: i. What is the probability that air traffic control will have a break of at least 2 minutes between arrivals? ii. What is the probability that there is less than 30 seconds between arrivals? iii. What is the expected time between arrivals? b. If the interarrival time follows a uniform distribution between 0 and 4 minutes: i. What is the probability that air traffic control will have a break of at least 2 minutes between arrivals? ii. What is the probability that there is less than 30 seconds between arrivals? iii. What is the expected time between arrivals? c. If the interarrival time follows a normal distribution with mean 2 minutes and standard deviation 0.6 minutes: i. What is the probability that air traffic control will have a break of at least 2 minutes between arrivals? ii. What is the probability that there is less than 30 seconds between arrivals? 6.45 An orange juice producer buys all his oranges from a large orange grove. The amount of juice squeezed from each orange is approximately normally distributed with a mean of 135 mL and a standard deviation of 12 mL. a. What is the probability that a randomly selected orange will contain between 135 mL and 140 mL of juice? b. What is the probability that a randomly selected orange will contain between 140 mL and 155 mL of juice?

Hot deals rate A$ 140 174 129 230 114 154 152 189 149  80 150  95 122 288 170 239 189 160 105



Decide whether or not the data appear to be approximately normally distributed by: a. evaluating the actual versus theoretical properties b. constructing a normal probability plot 6.47 Geoscientists estimate that, on average, a given region has a major earthquake every 250 years. Assuming that the time between major earthquakes in this region is exponentially distributed, what is the probability that a major earthquake: a. will not occur between 2020 and 2030? b. will occur between 2020 and 2070? c. will not occur between 2020 and 2200? 6.48 An examination consists of 40 multiple-choice questions, with each question having four options. Suppose you randomly select the answer to each question – that is, you guess. What is the probability of obtaining at least 50% in the examination? 6.49 According to Burton G. Malkiel, the daily changes in the closing price of shares follow a random walk – that is, these daily events are independent of each other and move upwards or downwards in a random manner – and can be approximated by a normal distribution.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

244 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS

6.50



6.51

6.52

6.53



a. To test this theory, use the daily changes in the All Ordinaries for 2016–17 financial year in < ALL_ORDS_2016_17 > to: i. construct a stem-and-leaf display, histogram, polygon and/or box-and-whisker plot ii. evaluate the actual versus theoretical properties iii. construct a normal probability plot b. Discuss the results of (a). Are the daily changes in closing prices approximately normal? From past data, Safe-As-Houses Real Estate concludes that the age of houses in the suburb of NewAcres is uniformly distributed between 20 and 40 years. What is the probability that the age of a randomly chosen house in NewAcres is: a. more than 30 years? b. between 25 and 35 years? c. less than 35 years? The time customers are on hold when ringing the IT help line for a certain ISP provider is normally distributed with a mean of 20 minutes and a standard deviation of 10 minutes. a. What proportion of customers are on hold for more than 40 minutes? b. What is the probability that a customer is on hold for less than 30 minutes? c. What percentage of calls are answered within 10 minutes? A study by the ISP provider in problem 6.51 has shown that the length of time on hold before a customer hangs up follows an approximate exponential distribution, with an average time of 15 minutes on hold before a customer hangs up. a. What percentage of customers will hang up during the first 20 minutes on hold? b. What is the probability that a customer will hang up during the first 10 minutes on hold? c. What proportion of customers do not hang up when on hold for 40 minutes? From the Household Expenditure Statistics: Year Ended 30 June 2016 (Statistics New Zealand, ), the average weekly household expenditure in New Zealand was $1,300. Assuming that weekly household expenditure is approximately normal with a standard deviation of $350:

a. Find the probability that a household’s weekly expenditure is i. less than $500 ii. more than $1,750 b. What proportion of household expenditures are between $1,250 and $1,500? c. 99% of households have weekly expenditures of less than which amount? d. 95% of households have weekly expenditures of more than which amount? 6.54 Water_Wise (see problem 3.53) is analysing water usage for a block of one-bedroom flats. It collects data on total daily water consumption in kilolitres (kL) for 133 consecutive days. < WATER >. a. Decide whether total daily water usage in this block of flats is approximately normal by: i. evaluating the actual versus theoretical properties ii. constructing a normal probability plot b. From part (a), assume that total daily water usage of the flats is normally distributed with a mean of 1.27 kL and standard deviation of 0.33 kL. i. On what percentage of days is total water usage less than 1.0 kL? ii. On what proportion of days is total water usage between 0.8kL and 1.4 kL? iii. What is the probability that tomorrow total water usage will exceed 2.0 kL? 6.55 Suppose there is a free bus, with no timetable, which circles the city centre every 20 minutes. You arrive at a bus stop unaware of when the bus last arrived at this stop. What is the probability that you will wait for the bus: a. less than 5 minutes? b. between 10 and 15 minutes? c. more than 12 minutes? 6.56 The lifespan of a certain car battery is normally distributed with a mean of 5 years and a standard deviation of 9 months. a. What is the probability that a battery lasts more than 7 years? b. What proportion of batteries fail within the warranty period of 3 years? c. What warranty period, in months, should be set if only 1% of batteries fail within the warranty period?

Continuing cases Tasman University Tasman University’s Tasman Business School (TBS) regularly surveys business students on a number of issues. In particular, students within the school are asked to complete a student survey when they receive their grades each semester. The results of Bachelor of Business (BBus) and Master of Business Administration (MBA) students who responded to the latest undergraduate (UG) and postgraduate (PG) student surveys are stored in < TASMAN_ UNIVERSITY_BBUS_STUDENT_SURVEY > and < TASMAN_UNIVERSITY_MBA_STUDENT_SURVEY >.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Continuing cases 245

Copies of the survey questions are stored in Tasman University Undergraduate BBus Student Survey and Tasman University Postgraduate MBA Student Survey. a For a selection of numerical variables in the BBus student survey, decide whether the variable is approximately normally distributed by: i comparing data characteristics to theoretical properties ii constructing a normal probability plot b For a selection of numerical variables in the MBA student survey, decide whether the variable is approximately normally distributed by: i comparing data characteristics to theoretical properties ii constructing a normal probability plot c Write a report summarising your conclusions. d Assume that the weighted average mark (WAM) of BBus students is normal with a mean of 63.9 and a standard deviation of 12.8. i What percentage of BBus students have a WAM of at least 65, a Credit average? ii What percentage of BBus students have a WAM of at least 75, a Distinction average? iii What proportion of BBus students have a WAM of at least 85, a High Distinction average? iv What proportion of BBus students have a WAM of less than 50? v What is the probability that a BBus student chosen at random has a WAM between 50 and 70? vi Below what WAM do the lowest 10% of BBus students achieve? vii What WAM is achieved by the top 5% of BBus students? e Assume that the MBA weighted average mark (WAM) of MBA students is normal with a mean of 73.8 and a standard deviation of 8.6. i What percentage of MBA students have a WAM of at least 65, a Credit average? ii What percentage of MBA students have a WAM of at least 75, a Distinction average? iii What proportion of MBA students have a WAM of at least 85, a High Distinction average? iv What proportion of MBA students have a WAM of less than 50? v What is the probability that an MBA student chosen at random has a WAM between 50 and 70? vi Below what WAM do the lowest 10% of MBA students achieve? vii What WAM is achieved by the top 5% of MBA students?

As Safe as Houses To analyse the real estate market in non-capital cities and towns in states A and B, Safe-As-Houses Real Estate, a large national real estate company, has collected samples of recent residential sales from a sample of non-capital cities and towns in these states. The data are stored in < REAL_ESTATE >. a For a selection of numerical variables for regional city 1 state A, decide whether the variable is approximately normally distributed by: i comparing data characteristics to theoretical properties ii constructing a normal probability plot b For a selection of numerical variables for coastal city 1 state A, decide whether the variable is approximately normally distributed by: i comparing data characteristics to theoretical properties ii constructing a normal probability plot c Write a report summarising your conclusions. d Repeat (a) to (c) for another pair of non-capital cities or towns in state A and/or state B.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

246 CHAPTER 6 THE NORMAL DISTRIBUTION AND OTHER CONTINUOUS DISTRIBUTIONS

Chapter 6 Excel Guide EG6.1 CONTINUOUS PROBABILITY DISTRIBUTIONS There are no Excel Guide instructions for this section.

EG6.2 THE NORMAL DISTRIBUTION

Key technique Use the NORM.DIST(X value, mean, standard deviation, True) function to calculate normal probabilities and use the NORM.S.INV(percentage) function and the STANDARDIZE function (see Section EG3.1) to calculate the Z value. Example Calculate the normal probabilities for Examples 6.1, 6.4 and 6.5 and the X and Z values for ­Examples 6.6 and 6.7. PHStat Use Normal. For the example, select PHStat ➔ Probability & Prob. Distributions ➔ Normal. In this procedure’s dialog box (shown in Figure EG6.1): 1. Enter 7 as the Mean and 2 as the Standard Deviation. 2. Check Probability for: X , 5 and enter 3.5 in its box. 3. Check Probability for: X . and enter 9 in its box. 4. Check Probability for range and enter 5 in the first box and 9 in the second box. 5. Check X for Cumulative Percentage and enter 10 in its box. 6. Check X Values for Percentage and enter 95 in its box. 7. Enter a Title and click OK. Figure EG6.1 Normal Probability Distribution dialog box

In-depth Excel Use the COMPUTE worksheet of the Normal workbook as a template. The worksheet already contains the data for solving the problems in Examples 6.1 and 6.4 to 6.7. For other problems, change the values for the Mean, Standard Deviation, X Value, From X Value, To X Value, Cumulative Percentage and/or Percentage. If you use an ­Excel ­version older than Excel 2010, use the COMPUTE_OLDER worksheet.

EG6.3 EVALUATING NORMALITY Comparing Data Characteristics to Theoretical Properties Use the Sections EG2.3, EG3.1 and EG3.4 instructions to compare data characteristics to theoretical properties.

Constructing the Normal Probability Plot Key technique Use an Excel Scatter (X, Y) chart with Z values calculated using the NORM.S.INV function. Example Construct the normal probability plot for the call length data, as in Figure 6.22. PHStat Use Normal Probability Plot. For the example, open the Call_Length file. Select PHStat ➔ Probability & Prob. Distributions ➔ Normal Probability Plot. In the Normal Probability Plot dialog box (shown in Figure EG6.2): 1. Enter or highlight A1:A21 as the Variable Cell Range. 2. Check First cell contains label. 3. Enter a Title and click OK. Figure EG6.2 Normal Probability Plot dialog box

In addition to the chart sheet containing the normal probability plot, the procedure creates a plot data worksheet

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 6 Excel Guide 247

identical to the PlotData worksheet discussed in the Indepth Excel instructions.

In-depth Excel Use the worksheets of the NPP workbook as templates. The NormalPlot chart sheet displays a normal probability plot using the rank, the proportion, the Z value and the variable found in the PLOT_DATA worksheet. The PLOT_DATA worksheet already contains the call length data for the example. To construct a plot for a different variable, paste the sorted values for that variable in column D of the PLOT_DATA worksheet. If you have fewer than 20 values, delete rows from the bottom up. If you have more than 20 values, select row 21, right-click, click Insert ➔ Rows in the shortcut menu, copy down the formulas in A20:C20 to the new rows and then paste the sorted values for the variable in column D. To create your own normal probability plot for the call length, open to the PLOT_DATA worksheet and select the cell range C1:D21. Then select Insert ➔ Scatter and select the first Scatter gallery item (that shows only points and is labeled with Scatter or Scatter with only Markers). Relocate the chart to a chart sheet, turn off the chart legend and gridlines, add axis titles and modify the chart title. If you use an Excel version older than Excel 2010, use the PLOT_OLDER worksheet and the NormalPlot_ OLDER chart sheet.

EG6.4 THE UNIFORM DISTRIBUTION There are no Excel Guide instructions for this section.

EG6.5 THE EXPONENTIAL DISTRIBUTION

Key technique Use the EXPON.DIST(X value, mean, True) function.

Example Calculate the exponential probability for the bank ATM customer arrival example in Section 6.5. PHStat Use Exponential. For the example, select PHStat ➔ Probability & Prob. Distributions ➔ Exponential. In the procedure’s dialog box (shown in Figure EG6.3): 1. Enter 20 as the Mean per unit (Lambda) and 0.1 as the X Value. 2. Enter a Title and click OK.

Figure EG6.3  Exponential Probability Distribution dialog box

In-depth Excel Use the COMPUTE worksheet of the Exponential workbook as a template. The worksheet already contains the data for the example. For other problems, change Lambda and X Value in cells B4 and B5. If you use an Excel version older than Excel 2010, use the COMPUTE_OLDER worksheet.

Microsoft® product screen shots are reprinted with permission from Microsoft Corporation.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

CHA PTER

7

Sampling distributions

PACKAGING TEA TREE SHAMPOO

F

or centuries, Indigenous Australian peoples used the leaves of the tea tree, Melaleuca alternifolia, for healing purposes. Now tea tree oil is being used in a variety of products for its beneficial antiseptic and antifungal properties. Zoffira Pty Ltd is a small company that manufactures a number of tea tree oil products, including Zoffira T Shampoo. The shampoo is packaged in 500 mL clear pump-pack bottles via a conveyor belt process. You are in charge of monitoring that bottles are being filled correctly.

Bottles are supposed to contain a mean of 500 mL of shampoo, as indicated on the package label. Because of the speed of the process, the volume of the contents varies from bottle to bottle, causing some bottles to be underfilled and some overfilled. If the process is not working properly, the mean volume in the bottles could vary too much from the label volume of 500 mL to be acceptable. As weighing every single bottle is too time-consuming, costly and inefficient, you must take a sample of bottles and make a decision regarding the probability that the packaging process is working properly. Each time _ you select a sample of bottles and check the individual _ contents, you calculate a sample mean X . You need to determine the probability that such an X could have been randomly drawn from a population whose population mean is 500 mL. Based on this assessment, you will have to decide whether to maintain, alter or shut down the process. © Nolan777|Dreamstime.com

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



7.2 Sampling Distribution of the Mean 249

LEARNING OBJECTIVES

After studying this chapter you should be able to: 1 interpret the concept of the sampling distribution 2 calculate probabilities related to the sample mean 3 recognise the importance of the Central Limit Theorem 4 calculate probabilities related to the sample proportion

In this chapter you need to make a decision about the shampoo-packaging process based on a sample of shampoo bottles. You will learn about sampling distributions and how to use them to solve business problems. As in the previous chapter, the normal distribution is used to calculate probabilities.

7.1  SAMPLING DISTRIBUTIONS In many applications, you want to make statistical inferences – that is, to use statistics calculated from samples to estimate the values of population parameters. In this chapter you will learn more about the sample mean, a statistic used to estimate a population mean (a parameter). You will also learn about the sample proportion, a statistic used to estimate the population proportion (a parameter). Your main concern when making a statistical inference is drawing conclusions about a population, not about a sample. For example, a political pollster is interested in the sample results only as a way of estimating the actual proportion of the votes that each candidate will receive from the population of voters. Likewise, as an operations manager for ­Zoffira Pty Ltd, you are interested only in using the sample mean calculated from a sample of shampoo bottles for estimating the mean volume contained in a population of bottles. In practice, you select a single random sample of a predetermined size from the population. The items included in the sample are determined through the use of a random number generator, such as a table of random numbers (see Section 1.4 and Table E.1), or by using Microsoft Excel (see page 36). Hypothetically, to use the sample statistic to estimate the population parameter, you should examine every possible sample that could occur. A sampling distribution is the distribution of the results if you actually selected all possible samples.

LEARNING OBJECTIVE

1

Interpret the concept of the sampling distribution

sampling distribution The probability distribution of a given sample statistic with repeated sampling of the population.

7.2  SAMPLING DISTRIBUTION OF THE MEAN In Chapter 3, several measures of central tendency are discussed. Undoubtedly, the mean is the most widely used measure of central tendency. The sample mean is often used to estimate the population mean. The sampling distribution of the mean is the distribution of all possible sample means if you select all possible samples of a certain size.

The Unbiased Property of the Sample Mean

sampling distribution of the mean The distribution of all possible sample means from samples of a given size for a given population.

The sample mean is unbiased because the mean of all possible sample means (of a given sample size n), μX_, is equal to the population mean μ. A simple example concerns a population of four candidates attempting a driver knowledge test of 45 questions in order to get a driver’s licence. Table 7.1 presents the number of errors.

unbiased If the average of all possible sample means equals the population mean then the sample mean is unbiased.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

250 CHAPTER 7 SAMPLING DISTRIBUTIONS

Table 7.1  Number of errors made by each of four driver’s knowledge test candidates

Candidate Vicky Yvana Xing Zac

Number of errors X1 = 3 X2 = 2 X3 = 1 X4 = 4

This population distribution is shown in Figure 7.1. Figure 7.1  Number of errors made by a population of four driver’s knowledge test candidates

Frequency

3 2 1 0

0

3 2 Number of errors

1

4

When you have the data from a population, you calculate the mean using Equation 7.1.

POPUL AT ION M E A N The population mean is the sum of the values in the population divided by the population size N. N



μ=

∑ Xi

i=1

N

(7.1)



You calculate the population standard deviation σ using Equation 7.2. POPUL AT ION STA NDA R D D E V I AT I O N N

σ=



∑ (X

i – μ)

i=1

N

2



(7.2)

Thus, for the data of Table 7.1: μ=

3+2+1+4 = 2.5 4

and: σ=

(3 – 2.5)2 + (2 – 2.5)2 + (1 – 2.5)2 + (4 – 2.5)2 = 1.12 errors 4

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



7.2 Sampling Distribution of the Mean 251

If you select samples of two candidates with replacement from this population, there are 16 possible samples (N n = 42 = 16). Table 7.2 lists the 16 possible sample outcomes. If you average all 16 of these sample means, the mean of these values, μX_, is equal to 2.5, which is also the mean of the population μ.

Sample  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16

Candidates Vicky, Vicky Vicky, Yvana Vicky, Xing Vicky, Zac Yvana, Vicky Yvana, Yvana Yvana, Xing Yvana, Zac Xing, Vicky Xing, Yvana Xing, Xing Xing, Zac Zac, Vicky Zac, Yvana Zac, Xing Zac, Zac

Sample outcomes 3, 3 3, 2 3, 1 3, 4 2, 3 2, 2 2, 1 2, 4 1, 3 1, 2 1, 1 1, 4 4, 3 4, 2 4, 1 4, 4

Sample mean

– X1 = 3 – X 2 = 2.5 – X 3 = 2 – X 4 = 3.5 – X 5 = 2.5 – X 6 = 2 – X 7 = 1.5 – X 8 = 3 – X 9 = 2 – X 10 = 1.5 – X 11 = 1 – X 12 = 2.5 – X 13 = 3.5 – X 14 = 3 – X 15 = 2.5 – X 16 = 4 ΣX– = 40 40 µX– =  = 2.5 16

Table 7.2  All 16 samples of n = 2 test candidates from a population of n = 4 candidates when sampling with replacement

Since the mean of the 16 sample means is equal to the population mean, the sample mean is an unbiased estimator of the population mean. Therefore, although you do not know how close the sample mean of any particular sample selected comes to the population mean, you are at least assured that the mean of all the possible sample means that could have been selected is equal to the population mean.

Standard Error of the Mean Figure 7.2 illustrates the variation in the sample mean when selecting all 16 possible samples. In this small example, although the sample mean varies from sample to sample depending on which candidates are selected, the sample mean does not vary as much as the individual values in the population. That the sample means are less variable than the individual values in the population follows directly from the fact that each sample mean averages together all the values Figure 7.2  Sampling distribution of the mean based on all possible samples containing two candidates

5

Frequency

4 3 2 1 0

0

1

2 3 Number of errors

4

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

252 CHAPTER 7 SAMPLING DISTRIBUTIONS

standard error of the mean  Reflects how much the sample mean varies from its average value in repeated experiments.

in the sample. A population consists of individual outcomes that can take on a wide range of values from extremely small to extremely large. However, if a sample contains an extreme value, although this value will have an effect on the sample mean, the effect is reduced because the value is averaged with all the other values in the sample. As the sample size increases, the effect of a single extreme value becomes smaller because it is averaged with more values. The value of the standard deviation of all possible sample means, called the standard error of the mean, expresses how the sample mean varies from sample to sample. Equation 7.3 defines the standard error of the mean when sampling with replacement or without replacement (see page 18) from large or infinite populations.

STA N DA R D E R R O R O F T HE M E A N The standard error of the mean σX_ is equal to the standard deviation in the population σ divided by the square root of the sample size n.

σX =



σ n



(7.3)

Therefore, as the sample size increases, the standard error of the mean decreases by a factor equal to the square root of the sample size. You can also use Equation 7.3 as an approximation to the standard error of the mean when the sample is selected without replacement, if the sample contains less than 5% of the entire population. Example 7.1 calculates the standard error of the mean for such a situation. EXAMPLE 7.1

C A LC U LAT ING T H E STAN D ARD E RROR OF THE M E AN Return to the shampoo-packaging process described in the scenario on page 248. If you randomly select a sample of 25 bottles without replacement from the thousands of bottles filled during a shift, the sample contains far less than 5% of the population. Given that the standard deviation of the shampoo-packaging process is 15 mL, calculate the standard error of the mean. SOLUTION

Using Equation 7.3 with n = 25 and σ = 15, the standard error of the mean is: sX =

s n

=

15 25

=

15 = 3 mL 5

The variation in the sample means for samples of n = 25 is much less than the variation in individual bottles of shampoo (i.e. σX_ = 3 while σ = 15).

Sampling from Normally Distributed Populations Now that the concept of a sampling distribution has been introduced and the standard error of the mean has been defined, what distribution will the sample mean follow? If you are sampling from a population that is normally distributed with mean μ and standard deviation σ, regardless of the sample size n, the sampling distribution of the mean is normally distributed with mean μX_ = μ and standard error of the mean σX_. In the simplest case, if you take samples of size n = 1, each possible sample mean is a single value from the population because: n

X =

∑X

i =1

n

i

=

X1 = X1 1

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



7.2 Sampling Distribution of the Mean 253

Therefore, if the population is normally distributed with mean μ and standard deviation σ, the sampling distribution of X for samples of n = 1 must also follow the normal distribution with mean μX_ = μ and standard error of the mean σX_ = σ/1 = σ. In addition, as the sample size increases, the sampling distribution of the mean still follows a normal distribution with mean μX_ = μ, but the standard error of the mean decreases, so that a larger proportion of sample means are closer to the population mean. Figure 7.3 illustrates this reduction in variability, in which 500 samples of sizes 1, 2, 4, 8, 16 and 32 were randomly selected from a normally ­distributed population. From the polygons in Figure 7.3, you can see that, although the sampling distribution of the mean is approximately1 normal for each sample size, the sample means are distributed more tightly around the population mean as the sample size is increased. To examine the concept of the sampling distribution of the mean further, consider the ­shampoo-packaging scenario again. The packaging equipment that is filling 500-mL bottles of shampoo is set so that the amount of shampoo in a bottle is normally distributed with a mean of 500 mL. From past experience, the population standard deviation for this filling process is 15 mL. If you randomly select a sample of 25 bottles from the many thousands that are filled in a day and the mean volume is calculated for this sample, what type of result could you expect? For example, do you think that the sample mean could be 500 mL? 300 mL? 510 mL?

Figure 7.3  Sampling distribution of the mean from 500 samples of sizes n = 1, 2, 4, 8, 16 and 32 selected from a normal population

n = 32

n = 16

n=8

n=4 n=2 n=1

0

Z

1

Remember that ‘only’ 500 samples out of an infinite number of samples have been selected, so that the sampling distributions shown are only approximations of the true distributions.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

254 CHAPTER 7 SAMPLING DISTRIBUTIONS

The sample acts as a miniature representation of the population, so if the values in the population are normally distributed, the values in the sample should be approximately normally distributed. Thus, if the population mean is 500 mL, the sample mean has a good chance of being close to 500 mL. How can you determine the probability that the sample of 25 bottles will have a mean below 497 mL? From the normal distribution (Section 6.2) you know that you can find the area below any value X by converting to standardised Z units: Z=

X–m s

In the examples in Section 6.2 we saw how any single value X differs from the mean. Now, in the shampoo-packaging example, the value involved is a sample mean X and we wish to determine the likelihood that a sample mean is below 497. Thus, by substituting X for X, μX_ for μ and σX_ for σ, the appropriate Z value is defined in Equation 7.4.

FIN DIN G Z FOR T HE SA M P LI NG D I ST R I BU T I O N O F T HE M E A N The Z value is equal to the difference between the sample mean and the population mean μ, divided by the standard error of the mean σX_. Z=



LEARNING OBJECTIVE

2

Calculate probabilities related to the sample mean

X – mX X–m = s sX n

(7.4)

To find the area below 497 mL, from Equation 7.4: Z=

X – mX 497 – 500 –3 = = = –1.00 15 sX 3 25

The area corresponding to Z = -1.00 in Table E.2 is 0.1587. Therefore, 15.87% of all the possible samples of size 25 have a sample mean below 497 mL. This is not the same as saying that a certain percentage of individual bottles will have less than 497 mL of shampoo. We calculate that percentage as follows: Z=

–3 497 – 500 X–m = –0.20 = = 15 s 15

The area corresponding to Z  =  - 0.20 in Table E.2 is 0.4207. Therefore, 42.07% of the individual bottles are expected to contain less than 497 mL. Comparing these results, we see that many more individual bottles than sample means are below 497 mL. This result is explained by the fact that each sample consists of 25 different values, some small and some large. The averaging process dilutes the importance of any individual value, particularly when the sample size is large. Thus, the chance that the sample mean of 25 bottles is far away from the population mean is less than the chance that a single bottle is far away. Examples 7.2 and 7.3 show how these results are affected by using a different sample size.

EXAMPLE 7.2

T H E E FFE CT O F S A M P L E S I Z E n ON THE CAL CU L ATI ON OF 𝛔X_ How is the standard error of the mean affected by increasing the sample size from 25 to 100 bottles?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



7.2 Sampling Distribution of the Mean 255

SOLUTION

If n = 100 bottles, then using Equation 7.3: sX =

s n

=

15 100

=

15 = 1.5 10

The fourfold increase in the sample size from 25 to 100 reduces the standard error of the mean by half – from 3 mL to 1.5 mL. This demonstrates that taking a larger sample results in less variability in the sample means from sample to sample.

TH E EFFECT O F S A MP LE S IZ E n O N T H E CL U STE RI N G OF M E AN S I N THE SA MPLING D IST R IB U T IO N In the shampoo-packaging example, if you select a sample of 100 bottles, what is the probability that the sample mean is below 497 mL?

EXAMPLE 7.3

SOLUTION

Using Equation 7.4: Z=

X – mX 497 – 500 –3 = –2.00 = = 15 sX 1.5 100

From Table E.2, the area less than Z = -2.00 is 0.0228. Therefore, 2.28% of the samples of 100 have means below 497 mL, as compared with 15.87% for samples of 25.

Sometimes, you need to find the interval that contains a fixed proportion of the sample means. You need to determine a distance below and above the population mean containing a specific area of the normal curve. From Equation 7.4: Z=

X–m s n

Solving for X results in Equation 7.5. _ FIND ING X FOR T H E S A M P L IN G DIST R I BU T I O N O F T HE M E A N

X=m+Z

s n



(7.5)

Example 7.4 illustrates the use of Equation 7.5.

DETER M INING T H E IN T E R VA L T H AT INC L U D E S A F I XE D P ROP ORTI ON OF THE SA MPLE M EANS In the shampoo-packaging example, find an interval around the population mean that will include 95% of the sample means based on samples of 25 bottles.

EXAMPLE 7.4

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

256 CHAPTER 7 SAMPLING DISTRIBUTIONS

SOLUTION

If 95% of the sample means are in the interval, then 5% are outside the interval. Divide the 5% into two equal parts of 2.5%. The value of Z in Table E.2 corresponding to an area of 0.0250 in the lower tail of the normal curve is -1.96, and the value of Z corresponding to a cumulative area of 0.975 (i.e. 0.025 in the upper tail of the normal curve) is +1.96. The lower value of X (called X L) and the upper value of X (called X U) are found by using Equation 7.5: XL = 500 + (–1.96) XU = 500 + (1.96)

15 25 15 25

= 500 – 5.88 = 494.12

= 500 + 5.88 = 505.88

Therefore, 95% of all sample means are between 494.12 and 505.88 mL for samples of 25 bottles.

Sampling from Non-normally Distributed Populations – The Central Limit Theorem So far in this section we have discussed the sampling distribution of the mean for a normally distributed population. However, in many instances, either you know that the population is not normally distributed or it is unrealistic to assume a normal distribution. An important theorem in statistics, the Central Limit Theorem, deals with this situation.

LEARNING OBJECTIVE

3

Recognise the importance of the Central Limit Theorem

Central Limit Theorem If the sample size is large enough, the distribution of sample means will be approximately normal even if the samples came from a population that was not normal.

T H E CE N T R A L L IM I T T HE O R E M The Central Limit Theorem states that, as the sample size (i.e. the number of values in each sample) gets large enough, the sampling distribution of the mean is approximately normally distributed. This is true regardless of the shape of the distribution of the individual values in the population.

What sample size is large enough? A great deal of statistical research has gone into this issue. As a general rule, statisticians have found that for many population distributions, when the sample size is at least 30, the sampling distribution of the mean is approximately normal. However, you can apply the Central Limit Theorem for even smaller sample sizes if the population distribution is approximately bell-shaped. In the uncommon case where the distribution is extremely skewed or has more than one mode, you may need sample sizes larger than 30 to ensure normality. Figure 7.4 illustrates the application of the Central Limit Theorem to different populations. The sampling distributions from three different continuous distributions (normal, uniform and exponential) for varying sample sizes (n = 2, 5, 30) are displayed. Panel A of Figure 7.4 shows the sampling distribution of the mean selected from a normal population. As mentioned earlier, when the population is normally distributed the sampling distribution of the mean is normally distributed for any sample size. (You can measure the variability using the standard error of the mean, Equation 7.3.) Because of the unbiasedness property, the mean of any sampling distribution is always equal to the mean of the population. Panel B of Figure 7.4 depicts the sampling distribution from a population with a uniform (or rectangular) distribution (see Section 6.4). When samples of size n = 2 are selected, there is a peaking or central limiting effect already working. For n = 5, the sampling distribution is bell shaped and approximately normal. When n = 30, the sampling distribution looks very

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



7.2 Sampling Distribution of the Mean 257

Panel A Normal population

Panel B Uniform population

Panel C Exponential population

Values of X

Values of X

Values of X

Sampling distribution of X

Sampling distribution of X

Sampling distribution of X

n=2

n=2

n=2

Values of X

Values of X

Values of X

Sampling distribution of X

Sampling distribution of X

Sampling distribution of X

n=5

n=5

Values of X

Values of X

Sampling distribution of X

Sampling distribution of X

n = 30

Values of X

n = 30

Values of X

Figure 7.4  Sampling distribution of the mean for different populations for samples of n = 2, 5 and 30

n=5

Values of X Sampling distribution of X

n = 30

Values of X

similar to a normal distribution. In general, the larger the sample size the more closely the sampling distribution will follow a normal distribution. As with all cases, the mean of each sampling distribution is equal to the mean of the population, and the variability decreases as the sample size increases. Panel C of Figure 7.4 presents an exponential distribution (see Section 6.5). This population is heavily skewed to the right. When n = 2, the sampling distribution is still highly skewed to the right but less so than the distribution of the population. For n = 5, the sampling distribution is

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

258 CHAPTER 7 SAMPLING DISTRIBUTIONS

more symmetrical with only a slight skew to the right. When n = 30, the sampling distribution looks approximately normal. Again, the mean of each sampling distribution is equal to the mean of the population, and the variability decreases as the sample size increases. Using the results from these well-known statistical distributions (normal, uniform and exponential), you can make the following conclusions regarding the Central Limit Theorem: • For most population distributions, regardless of shape, the sampling distribution of the mean is approximately normally distributed if samples of at least 30 are selected. • If the population distribution is fairly symmetrical, the sampling distribution of the mean is approximately normal for samples as small as 5. • If the population is normally distributed, the sampling distribution of the mean is normally distributed regardless of the sample size. The Central Limit Theorem is of crucial importance in using statistical inference to draw conclusions about a population. It allows you to make inferences about the population mean without having to know the specific shape of the population distribution. You can explore how the Central Limit Theorem works yourself using Excel to generate samples through a Random Number Generator (see the Chapter 7 Excel Guide at the end of this chapter). PHStat also has an easy-to-use Sampling Distributions Simulation.

Problems for Section 7.2 LEARNING THE BASICS 7.1 Given a normal distribution with μ = 100 and σ = 10, if you select a sample of n = 25: _ a. What is the probability that X is: i. less than 95? ii. between 95 and 97.5? iii. above 102.2? _ b. There is a 65% chance that X is above what value? 7.2 Given a normal distribution with μ = 50 and σ = 5, if you select a sample of n = 100: _ a. What is the probability that X is: i. less than 47? ii. between 47 and 49.5? iii. above 51.1? _ b. There is a 35% chance that X is above what value?

APPLYING THE CONCEPTS 7.3 For each of the following three populations, indicate what the sampling distribution of the mean for samples of 25 would consist of. a. Travel expense vouchers for a university in an academic year b. Absentee records (days absent per year) in 2010 for employees of a large construction company c. Yearly sales (in litres) of E10 fuel at service stations located in a particular state 7.4 The following data represent the number of days absent per year in a population of six employees of a small company: 1 3 6 7 9 10

a. Assuming that you sample without replacement, select all possible samples of n = 2 and construct the sampling distribution of the mean. Calculate the mean of all the sample means and also calculate the population mean. Are they equal? What is this property called? b. Repeat (a) for all possible samples of n = 3. c. Compare the shape of the sampling distribution of the mean in (a) and (b). Which sampling distribution has less variability? Why? d. Assuming that you sample with replacement, repeat (a), (b) and (c) and compare the results. Which sampling distributions have the least variability, those in (a) or (b)? Why? 7.5 The number of passengers passing through a large South East Asian airport is normally distributed with a mean of 110,000 persons per day and a standard deviation of 20,200 persons. If you select a random sample of 16 days: a. What is the sampling distribution of the mean? b. What is the probability that the sample mean is less than 98,000 passengers per day? c. What is the probability that the sample mean is between 102,000 and 104,500 passengers per day? d. The probability is 60% that the sample mean will be between which two values symmetrically distributed around the population mean? 7.6 Realestate.com.au reports that the median price of houses in the Newcastle suburb of Merewether that were sold in the 13 months to March 2017 was $1,150,000 ( accessed 3 April 2017). Suppose that the mean price of

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



7.3 Sampling Distribution of the Proportion 259

houses in Merewether sold during that period was $1,236,450 and the standard deviation was $150,000. a. If you take samples of n = _ 2, describe the shape of the sampling distribution of X . b. If you take samples of n = _ 100, describe the shape of the sampling distribution of X . c. If you take a random sample of n = 100, what is the probability that the sample mean will be less than $1,235,000? 7.7 Travel time on a bus between two suburban stops is normally distributed with μ = 8 minutes and σ = 2 minutes. a. If you select a random sample of 25 trips, what is the probability that the sample mean is between 6.9 and 8.2 minutes? b. If you select a random sample of 25 trips, what is the probability that the sample mean is between 7.5 and 8 minutes? c. If you select a random sample of 100 trips, what is the probability that the sample mean is between 6.9 and 8.2 minutes? d. Explain the difference between the results of (a) and (c). 7.8 It is often important to monitor traffic on a website as organisations need to make online interactions with their clients faster and easier. For example, businesses applying for an Australian Business Number (ABN) online at are asked to have a variety of information about their entity ready before they begin the online process. Assume that ABN online-application times are normally distributed with a mean

time of 40 minutes and a standard deviation of 5 minutes. If a random sample of 50 applications is taken: a. What is the probability that the sample mean application time is less than 38 minutes? b. What is the probability that the sample mean is between 39 and 41 minutes? c. The probability is 80% that the sample mean is between what two values symmetrically distributed around the population mean? d. The probability is 90% that the sample mean is less than what value? 7.9 A company is having a new corporate website developed. In the final testing phase the download time to open the new home page is recorded for a large number of computers in office and home settings. The mean download time for the site is 3.61 seconds. Suppose that the download times for the site are normally distributed with a standard deviation of 0.5 seconds. If you select a random sample of 30 download times: a. What is the probability that the sample mean download time is less than 3.75 seconds? b. What is the probability that the sample mean is between 3.70 and 3.90 seconds? c. The probability is 80% that the sample mean is between which two values symmetrically distributed around the population mean? d. The probability is 90% that the sample mean is less than what value?

7.3  SAMPLING DISTRIBUTION OF THE PROPORTION Consider a categorical variable that has only two categories, such as the customer prefers your brand or the customer prefers the competitor’s brand. Of interest is the proportion of items belonging to one of the categories; for example the proportion of customers who prefer your brand. The population proportion, represented by π, is the proportion of items in the entire population with the characteristic of interest. The sample proportion, represented by p, is the proportion of items in the sample with the characteristic of interest. The sample proportion, a statistic, is used to estimate the population proportion, a parameter. To calculate the sample proportion, you assign the two possible outcomes scores of 1 or 0 to represent the presence or absence of the characteristic. You then sum all the 1 and 0 scores and divide by n, the sample size. For example, if, in a sample of five customers, three preferred your brand and two did not, you have three ones and two zeroes. Summing the three ones and two zeroes and dividing by the sample size of 5 gives you a sample proportion of 0.60 who preferred your brand. THE SAM PLE PR OPORT ION

p=

X number of items with the characteristic of interest = n sample size

(7.6)

The sample proportion p takes on values between 0 and 1. If all individuals possess the characteristic, you assign each a score of 1 and p is equal to 1. If half the individuals possess the characteristic, you assign half a score of 1, and assign the other half a score of 0, and p is

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

260 CHAPTER 7 SAMPLING DISTRIBUTIONS

standard error of the proportion The standard deviation of the sample proportion for repeated samples.

equal to 0.5. If none of the individuals possess the characteristic, you assign each a score of 0 and p is equal to 0. While the sample mean X is an unbiased estimator of the population mean μ, the statistic p is an unbiased estimator of the population proportion π. By analogy to the sampling distribution of the mean, the standard error of the proportion, σp, is given in Equation 7.7. STA N DA R D E R R O R O F T HE P R O P O RT I O N sp =



sampling distribution of the proportion The distribution of all possible sample proportions from samples of a certain size.

p(1 – p) n

(7.7)

If you select all possible samples of a certain size, the distribution of all possible sample proportions is referred to as the sampling distribution of the proportion. When sampling with replacement from a finite population, the sampling distribution of the proportion follows the binomial distribution, as discussed in Section 5.3. However, you can use the normal distribution to approximate the binomial distribution when nπ and n(1 - π) are each greater than 5 (see Section 6.6). In most cases in which inferences are made about the proportion, the sample size is substantial enough to meet the conditions for using the normal approximation (see reference 1). Therefore, in many instances, you can use the normal distribution to estimate the sampling distribution of the proportion. Substituting p for X, π for μ and p(1 – p) for σ in Equan n tion 7.4 results in Equation 7.8. DIFFE R E N CE B E T W E E N T HE SA M P LE P R O P O RT I O N A ND T HE P OPUL AT ION P R O P O RT I O N I N STA NDA R D I SE D NO R M A L U NI T S Z=



LEARNING OBJECTIVE Calculate probabilities related to the sample proportion

4

p–p p(1 – p) n

(7.8)



To illustrate the sampling distribution of the proportion, suppose that the manager of a railway’s WiFi services determines that 40% of all passengers have multiple WiFi-enabled devices available on board their train. You select a random sample of 200 passengers and count those with multiple WiFi-enabled devices. The probability that the sample proportion of passengers with multiple devices is less than 0.30 is calculated as follows. Because nπ = 200(0.40) = 80 7 5 and n(1 - π) = 200(0.60) = 120 7 5, the sample size is large enough to assume that the sampling distribution of the proportion is approximately normally distributed. Using Equation 7.8: Z=

=

p−π π(1 − π) n 0.30 − 0.40 (0.40)(0.60) 200

=

−0.10 0.24 200

=

−0.10 0.0346

= −2.89 Using Table E.2, the area under the normal curve less than Z = -2.89 is 0.0019. Therefore, the probability that the sample proportion is less than 0.30 is 0.0019 – a highly unlikely event. This means that if the true proportion of successes in the population is 0.40, less than one-fifth of 1% of the samples of n = 200 are expected to have sample proportions of less than 0.30. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



7.3 Sampling Distribution of the Proportion 261

Problems for Section 7.3 LEARNING THE BASICS 7.10 In a random sample of 64 people, 48 are classified as ‘successful’. If the population proportion is 0.70: a. Determine the sample proportion p of ‘successful’ people. b. Determine the standard error of the proportion. 7.11 A random sample of 50 households was selected for a telephone survey. The key question asked was, ‘Has any member of your household travelled by plane in the past month?’ Of the 50 respondents, 16 said yes and 34 said no. If the population proportion is 0.40: a. Determine the sample proportion p of households with members who have travelled by plane in the past month. b. Determine the standard error of the proportion. 7.12 The following data represent the responses (Y for yes and N for no) from a sample of 40 university students to the question, ‘Do you currently own any shares in listed companies?’: N N Y N N Y N Y N Y N N Y N Y Y N N N Y N Y N N N N Y N N Y Y N N N Y N N Y N N



If the population proportion is 0.30: a. Determine the sample proportion p of university students who own shares in listed companies. b. Determine the standard error of the proportion.

APPLYING THE CONCEPTS 7.13 A political polling organisation is conducting an analysis of sample results in order to make predictions on election night. Assuming a two-candidate election, if a specific candidate receives at least 55% of the vote in the sample, then that candidate will be forecast as the winner of the election. If you select a random sample of 100 voters: a. What is the probability that a candidate will be forecast as the winner when: i. the true percentage of her vote is 50.1%? ii. the true percentage of her vote is 60%? iii. the true percentage of her vote is 49% (and she will actually lose the election)? b. If the sample size is increased to 400, what are your answers to (a)? Discuss. 7.14 You plan to conduct a marketing experiment in which students are to taste one of two different brands of soft drink. Their task is to identify correctly the brand they tasted. You select a random sample of 200 students and assume they have no ability to distinguish between the two brands. (Hint: If an individual has no ability to distinguish between the two soft drinks, then each brand is equally likely to be selected.) a. What is the probability that the sample will have between 50% and 60% of the identifications correct? b. The probability is 90% that the sample percentage is contained within which symmetrical limits of the population percentage?

c. What is the probability that the sample percentage of correct identifications is greater than 65%? d. Which is more likely to occur – more than 60% correct identifications in the sample of 200 or more than 55% correct identifications in a sample of 1,000? Explain. 7.15 Over the past few years there has been increased monitoring of the representation of women on corporate boards. The Australian Institute of Company Directors reports in its March–May 2016 Report that 23.6% of ASX 200 board members were female ( accessed 25 April 2017). Suppose that the true percentage of women on ASX 200 boards is now 24.6% and that a random sample of 220 board members is chosen. a. What is the probability that in the sample less than 24% of board members will be women? b. What is the probability that in the sample between 24.2% and 25.0% of board members will be women? c. What is the probability that in the sample between 24.5% and 24.7% of board members will be women? d. If a sample of 100 is taken, how does this change your answers to (a), (b) and (c)? 7.16 People with permanent visas accounted for 19.5% of the net overseas migration to Australia during 2015. The relative shares of the different visa categories were: Family visas, 6.9%; Skilled, 9.0%; and Special Eligibility and Humanitarian, 2.5% (Australian Bureau of Statistics, Migration, Australia, 2015–16 , Cat. No. 3412.0, March 2017). Suppose a government department is conducting a follow-up study and randomly selects 260 people who migrated in 2015. a. What is the probability that more than 9.1% of the people in the sample are skilled migrants? b. What is the probability that less than 2.8% are holders of Special Eligibility or Humanitarian permanent visas? c. If a random sample of size 500 is taken, how does this change your answers to (a) and (b)? 7.17 As technology continues to change rapidly there has been a worldwide trend towards the use of smaller and more mobile devices and away from PCs. Analysts at Gartner predicted that in 2019 only 8% of devices shipped worldwide would be traditional PCs (desktops or notebooks) ( accessed 27 April 2017). Assume this prediction holds and you randomly select a sample of 100 people who purchase a device shipped in 2019. a. What is the probability that between 7.5% and 8.2% purchase a traditional PC?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

262 CHAPTER 7 SAMPLING DISTRIBUTIONS

b. The probability is 90% that the sample percentage will be contained within which symmetrical limits of the population percentage? c. The probability is 95% that the sample percentage will be contained within which symmetrical limits of the population percentage? 7.18 According to an Australian Government report, retail trade is the second largest employing industry in Australia with more than 1.267 million workers, or 11% of working Australians (Department of Employment, Australian Jobs 2016 accessed 28 April 2017). This report shows that the percentage of those employed in retail trade in November 2015 who were working part-time was 49%. Assuming this percentage is still current: a. If you select a random sample of 400 Australian retail trade workers, what is the probability that the sample has between 45% and 50% who are employed part-time? b. If a current sample of 400 Australian retail trade workers has 50.2% who are employed part-time, what can you infer about the population estimate of 49%? Explain. c. If a current sample of 100 Australian retail trade workers has 50.2% who are employed part-time, what can you infer about the population estimate of 49%? Explain. d. Explain the difference between the results in (b) and (c).

7

7.19 The Australian Tax Office carries out a range of verification checks and audits for the goods and services tax (GST) including Business Activity Statement integrity audits. Assume that currently no additional tax is collected for 25% of such audits. Suppose that you select a random sample of 100 audits. What is the probability that the sample will have: a. between 24% and 26% of audits that collect no additional tax? b. between 20% and 30% of audits that collect no additional tax? c. more than 30% of audits that collect no additional tax? 7.20 The 11th Annual Statistical Report of the HILDA Survey relates to the 2016 phase of a large longitudinal study of Australian residents. It found that 19.9% of households surveyed had HECS/HELP debts and 35.7% had debts on their home (R. Wilkins (ed), The Household, Income and Labour Dynamics in Australia Survey: Selected Findings from Waves 1 to 14, Melbourne Institute of Applied Economic and Social Research, University of Melbourne, 2016 accessed 28 April 2017). Assume the same percentages found in the survey apply right now for all Australian households. In a sample of 600 of these households, what is the probability that: a. more than 18% of households have HECS/HELP debts? b. fewer than 33.5% of households have debts on their home?

Assess your progress

Summary In this chapter we looked at the sampling distribution of the sample mean, the Central Limit Theorem and the sampling distribution of the sample proportion. You learned that the sample mean is an unbiased estimator of the population mean and the sample proportion is an unbiased estimator of the population proportion. By observing the mean volume in a sample of

shampoo bottles filled by Zoffira Pty Ltd, you were able to draw conclusions about the mean volume in the population of shampoo bottles. In the next three chapters, techniques commonly used for statistical inference, confidence intervals and tests of hypotheses are discussed.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter review problems 263

Key formulas

_ Finding X for the sampling distribution of the mean

Population mean N

μ=



X=μ+Z

Xi

i =1

N

 (7.1)

p=

N

∑ (X – μ) i

i =1

N

2

σ

n

X number of items with the characteristic of interest = n sample size

Standard error of the sample proportion

σp =

 (7.3)

π(1 – π)  (7.7) n

Finding Z for the sampling distribution of the proportion

Finding Z for the sampling distribution of the mean

Z=

 (7.5)

(7.6)  (7.2)

Standard error of the mean

σX =

n

Sample proportion

Population standard deviation

σ=

σ

X – μX X–μ  (7.4) = σ σX n

Z=

p–π

π(1 – π) n

(7.8)

Key terms Central Limit Theorem sampling distribution sampling distribution of the mean

256 249 249

sampling distribution of the proportion260 standard error of the mean 252

standard error of the proportion 260 unbiased249

References 1. Cochran, W. G., Sampling Techniques, 3rd edn (New York: Wiley, 1977).

Chapter review problems CHECKING YOUR UNDERSTANDING 7.21 Why is the sample mean an unbiased estimator of the population mean? 7.22 Why does the standard error of the mean decrease as the sample size n increases? 7.23 Why does the sampling distribution of the mean follow a normal distribution for a large enough sample size even though the population may not be normally distributed? 7.24 What is the difference between a probability distribution and a sampling distribution? 7.25 Under what circumstances does the sampling distribution of the proportion approximately follow the normal distribution?

APPLYING THE CONCEPTS 7.26 A particular type of ballpoint pen uses minute ball bearings that are targeted to have a diameter of 0.5 mm. The lower and upper specification limits under which the ball bearing can operate are 0.49 mm (lower) and 0.51 mm (upper). Past

experience has indicated that the actual diameter of the ball bearings is approximately normally distributed with a mean of 0.503 mm and a standard deviation of 0.004 mm. If you select a random sample of 25 ball bearings: a. What is the probability that the sample mean is: i. between the target and the population mean of 0.503? ii. between the lower specification limit and the target? iii. above the upper specification limit? iv. below the lower specification limit? b. The probability is 93.32% that the sample mean diameter will be above what value? 7.27 The fill amount of milk in plastic containers is normally distributed with a mean of 2.0 litres and a standard deviation of 0.05 litres. If you select a random sample of 25 containers: a. What is the probability that the sample mean will be: i. between 1.99 and 2.0 litres? ii. below 1.98 litres? iii. above 2.01 litres?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

264 CHAPTER 7 SAMPLING DISTRIBUTIONS

7.28

7.29

7.30

7.31

b. The probability is 99% that the sample mean will contain at least how much milk? c. The probability is 99% that the sample mean will contain an amount that is between which two values (symmetrically distributed around the mean)? The ABS has reported that in 2015, 26.78% of the 16.8 million employees in Australia worked part-time in their main job (Australian Bureau of Statistics, Characteristics of Employment, Australia, August 2015, Cat. No. 6333.0, 2016). Suppose that you select a random sample of 250 employees from around Australia. a. What is the probability that more than 26.2% of those sampled work part-time in their main job? b. What is the probability that the proportion of part-time employees is between 0.27 and 0.29? c. The probability is 77% that the sample proportion of parttime employees will be above what value? A new online advertisement for an Extra Dry beer has been designed for a target audience of Australian males aged 18 to 30. The advertisers hope that 24% of the target audience will find the ad ‘very entertaining’. Suppose that a sample of 400 male television viewers in the target age group is shown the advertise­ ment. What is the probability that the sample will have between: a. 18% and 22% who find it ‘very entertaining’? b. 16% and 24% who find it ‘very entertaining’? c. 14% and 26% who find it ‘very entertaining’? d. 12% and 28% who find it ‘very entertaining’? Assume that, for the first quarter of 2017, the weekly rental costs of three-bedroom dwellings in a coastal town in Western Australia are normally distributed with a mean of $260 and a standard deviation of $30. If you select a random sample of 10 dwellings from this population, what is the probability that the sample will have a mean rental cost: a. less than $270? b. between $265 and $275? c. greater than $282? APRA, the Australian Prudential Regulation Authority, monitors the return rates of large superannuation funds in Australia. Its publication Statistics: Quarterly superannuation performance, December 2016 showed an annual rate of return of 6.8% for the year . Imagine that a researcher with access to the APRA data finds that the average rate of return for the largest superannuation funds in the last year has been 7.5% with a standard deviation of 0.7%, and that rates of return were normally distributed. a. If the researcher selects an individual fund at random from this population, what is the probability that the fund had a return of: i. less than 8.2%? ii. between 6.9% and 7.8%? iii. greater than 7.9%? b. If a random sample of 10 funds is selected from this population, what is the probability that the sample mean lies in the ranges given in (a)?

7.32 Assume that the returns for shares on the Chinese share market were distributed as a normal random variable, with a mean of 1.54 and a standard deviation of 10. If you select an individual share from this population, what is the probability that it would have a return: a. less than 0 (i.e. a loss)? b. between –10 and –20? c. greater than –5? If you selected a random sample of four shares from this population, what is the probability that the sample would have a mean return: d. less than 0 (a loss)? e. between –10 and –20? f. greater than –5? g. Compare your results in parts (d) to (f) to those in (a) to (c). 7.33 (Class project ) The table of random numbers is an example of a uniform distribution because each digit is equally likely to occur. Starting in the row corresponding to the day of the month on which you were born, use the table of random numbers (Table E.1) to take one digit at a time. Select five different samples of n = 2, n = 5 and n = 10. Calculate the sample mean of each sample. Develop a frequency distribution of the sample means for the results of the entire class based on samples of sizes n = 2, n = 5 and n = 10. What can be said about the shape of the sampling distribution for each of these sample sizes? 7.34 (Class project ) Toss a coin 10 times and record the number of heads. If each student performs this experiment five times, a frequency distribution of the number of heads can be developed from the results of the entire class. Does this distribution seem to approximate the normal distribution? 7.35 (Class project ) The table of random numbers can simulate the selection of different-coloured balls from a bowl as follows: 1. Start in the row corresponding to the day of the month on which you were born. 2. Select one-digit numbers. 3. If a random digit between 0 and 6, inclusive, is selected, consider the ball white; if a random digit is a 7, 8 or 9, consider the ball red. Select samples of n = 10, n = 25 and n = 50 digits. In each sample, count the number of white balls and calculate the proportion of white balls in the sample. If each student in the class selects five different samples for each sample size, a frequency distribution of the proportion of white balls (for each sample size) can be developed from the results of the entire class. What conclusions can you reach about the sampling distribution of the proportion as the sample size is increased? 7.36 (Class project ) Suppose that step 3 of problem 7.35 uses the following rule: ‘If a random digit between 0 and 8, inclusive, is selected, consider the ball to be white; if a random digit of 9 is selected, consider the ball to be red’. Compare and contrast the results in this problem and in problem 7.35.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 7 Excel Guide 265

Continuing cases As Safe as Houses To analyse the real estate market in non-capital cities and towns in states A and B, Safe-As-Houses Real Estate, a large national real estate company, has collected samples of recent residential sales from a sample of non-capital cities and towns in these states. The data are stored in < REAL_ESTATE >. a Find the mean price for the sample of 125 properties sold in regional city 1 of state A. What is the probability of finding a sample mean at least this large if the population mean and standard deviation of prices for this city are $300,000 and $100,000 respectively? b Now find the mean price for the sample of 125 properties sold in the coastal city of state B. What is the probability that the sample mean is less than or equal to this value if the population mean and standard deviation for this city are $595,000 and $287,000 respectively? c Discuss why your answers to (a) and (b) are not the same as finding comparable probabilities for individual properties sold in each city.

Chapter 7 Excel Guide EG7.1 SAMPLING DISTRIBUTION OF THE MEAN

Key technique Use an add-in procedure to create a simulated sampling distribution. Example Create a simulated sampling distribution that consists of 100 samples of n 5 30 from a uniformly distributed population. Analysis ToolPak Use Random Number Generation. For the example, select Data ➔ Data Analysis. In the Data Analysis dialog box, select Random Number Generation from the Analysis Tools list and then click OK. In the procedure’s dialog box (shown in Figure EG7.1): 1. Enter 100 as the Number of Variables. 2. Enter 30 as the Number of Random Numbers. 3. Select Uniform from the Distribution dropdown list. 4. Keep the Parameters values as they are. 5. Click New Worksheet Ply and then click OK. Figure EG7.1 shows the entries for generating 100 samples of n 5 30 from a uniformly distributed population.

Figure EG7.1  Data Analysis Random Number Generation dialog box

If you are using PHStat with either Excel for Mac 2016 or Excel 2016, see Appendix D.1 (Sampling Distribution of the Mean) to produce an enhanced version of this worksheet.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

266 CHAPTER 7 SAMPLING DISTRIBUTIONS

EG7.2 CENTRAL LIMIT THEOREM By using the above method to generate 50 samples of size 3, then size 10 and size 40 from a uniform distribution, you should be able to observe how the Central Limit Theorem works. On PHStat simply click on Histogram to see the shape of the sampling distribution. If you are using Excel’s Random Number Generator a bit more work is required.

For each set of samples use the =AVERAGE function to calculate the mean of the first sample, then drag or copy this to find the means of the remaining 49 samples. Next, create frequency distributions of the sample means using the methods described in the Chapter 2 Excel Guide. Last, compare the three frequency tables. You should see that they resemble a normal distribution more closely as the sample size increases.

Microsoft® product screen shots are reprinted with permission from Microsoft Corporation.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



End of Part 2 problems 267

End of Part 2 problems B.1

B.2

B.3

B.4

B.5

B.6

A soft-drink bottling company maintains records of the number of unacceptable bottles of soft drink coming from the filling and capping machines. Based on past data, the probability that a bottle came from machine I and was unacceptable is 0.01, and the probability that a bottle came from machine II and was unacceptable is 0.025. Half the bottles are filled on machine I and the other half are filled on machine II. If a filled bottle of soft drink is selected at random: a. What is the probability that it is unacceptable? b. What is the probability that it was filled on machine I or is acceptable? c. Suppose you know that the bottle was filled on machine I. What is the probability that it is unacceptable? d. Suppose you know that the bottle is unacceptable. What is the probability that it was filled on machine I? e. Explain the difference in the answers to (c) and (d). (Hint: Construct a 2 * 2 contingency table or a Venn diagram to evaluate the probabilities.) The fill amount of soft-drink bottles is normally distributed with a mean of 2.0 litres (the listed content) and a standard deviation of 0.05 litre. Bottles that contain less than 95% of the listed net content (1.90 litres, in this case) make the manufacturer subject to penalties. Bottles that have a net content above 2.10 litres may cause excess spillage upon opening. a. What proportion of the bottles will contain: i. between 1.90 and 2.0 litres? ii. between 1.90 and 2.10 litres? iii. less than 1.90 litres or more than 2.10 litres? b. 99% of the bottles contain at least how much soft drink? c. 99% of the bottles contain an amount that is between which two values (symmetrically distributed) around the mean? In an effort to reduce the number of bottles that contain less than 1.90 litres, the bottler in problem B.2 sets the filling machine so that the mean is 2.02 litres. Under these circumstances, what are your answers to (a) to (c)? a. If a coin is tossed seven times, how many different outcomes are possible? b. If a die is rolled seven times, how many different outcomes are possible? c. Discuss the differences in your answers to (a) and (b). The time between arrivals of cars at Sheng’s carwash is exponential with an average of 6 minutes between arrivals. What is the probability that the time between successive arrivals will be a. less than 2 minutes? b. more than 10 minutes? c. between 4 and 6 minutes? The following data represent the electricity cost in dollars during the month of July for a random sample of 50 twobedroom apartments in a New Zealand city: < ELECTRICITY >

 96  171  202  178  147  102  153  197  127   82 157  185   90  116  172  111  148  213  130  165 141  149  206  175  123  128  144  168  109  167  95  163  150  154  130  143  187  166  139  149 108  119  183  151  114  135  191  137  129  158 a. Decide whether the electricity cost for July is approximately normal by: i. evaluating the actual versus theoretical properties ii. constructing a normal probability plot From part (a), assume that electricity cost for July is normally distributed with a mean of $147 and standard deviation of $31.70. b. A two-bedroom apartment is selected at random. What is the probability that electricity cost for July is: i. less than $120? ii. between $100 and $160? iii. more than $225? c. For 10% of two-bedroom apartments, the electricity cost for July is above what amount? d. The cost of electricity for the middle 95% of two-bedroom apartments is between which two amounts? B.7 An electrical retail store has found that 55% of its customers use a credit card to pay for their purchases. a. If 15 customers who make a purchase are randomly selected, what is the probability that: i. none use a credit card? ii. exactly five use a credit card? iii. more than two use a credit card? b. What are the mean and the standard deviation of the probability distribution? B.8 It has been observed that 92% of train commuters travelling during the 8.00 am to 9.00 am period use a mobile phone during their trip for various activities. a. In a train carriage with 42 passengers during this period, what is the probability that fewer than 38 passengers use their mobile phone during their commute? b. If the carriage has 50 passengers, what is the probability that between 43 and 47 passengers use their mobile phone? B.9 From a consignment of 64 large garden pots in individual crates being shipped from Vietnam to a local importer, 16 have imperfections such as cracks or are broken. a. If eight crates are shipped to a particular garden nursery, what is the probability that: i. all eight will have defective pots? ii. none will have a defective pot? iii. at least one will have a defective pot? b. What would be your answers to (a) if eight crates have defective pots? B.10 East Park Realty, a small real estate company located in country areas of South Australia, specialises primarily in

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

268 End of Part 2 problems

residential listings. It is interested in determining the probability of one of its listings being sold within a certain number of days. An analysis of company sales of 800 houses in the previous year produces the following data. Initial asking price

Days listed until sold 30 and under

31–90

Over 90

Total

Under $200,000  50  40  10 100 $200,000–$299,999  40 140  70 250 $300,000–$399,999  30 270 100 400 $400,000 or more  10  30  10  50 480 190 800 Total 130 a. Give an example of a simple event. b. Give an example of a joint event. c. What is the complement of ‘asking price under $200,000’? d. Why is ‘asking price under $200,000 and being listed more than 90 days until sold’ a joint event? e. Given that a house had an asking price of less than $200,000, what is the probability that it took more than 90 days to sell? f. Given that a house took more than 90 days to sell, what is the probability that its asking price was less than $200,000? g. Explain the difference in the results in (e) and (f). h. Are the two events – asking price less than $200,000, and taking more than 90 days to sell – statistically independent? i. If a house is selected at random, what is the probability that i. it is listed more than 90 days before being sold? ii. its initial asking price is at least $400,000? iii. its initial asking price is at least $400,000 and it is listed more than 90 days before being sold? iv. its initial asking price is more than $400,000 or it is listed more than 90 days before being sold? j. Explain the difference in the results in parts (i) to (iv) above. B.11 You are trying to develop a strategy for investing in two different shares. The anticipated annual return for a $1,000 investment in each share has the following probability distribution: Probability

Returns Share X

Share Y

0.1 -$50 -$100 0.3 20 50 0.4 100 130 0.2 150 200 a. Calculate the: i. expected return for share X and for share Y ii. standard deviation for share X and for share Y iii. covariance of share X and share Y b. Would you invest in share X or share Y? Explain. B.12 Suppose that in problem B.11 you wanted to create a portfolio that consists of share X and share Y. a. Calculate the portfolio expected return and portfolio risk for each of the following percentages invested in share X:

i. 30% ii. 50% iii. 70% b. On the basis of the results in (a), which portfolio would you recommend? Explain. B.13 At an ocean-side nuclear power plant, seawater is used as part of the cooling system. This system raises the temperature of the water that is discharged back into the ocean. The amount that the water temperature is raised has a uniform distribution over the interval from 10°C to 25°C. a. What is the probability that the temperature will increase less than 20°C? b. What is the probability that the temperature will increase between 20°C and 22°C? c. A temperature increase of more than 18°C is considered potentially dangerous to the environment. What is the probability that, at any point of time, the temperature increase is potentially dangerous? d. What is the mean and standard deviation of the temperature increase? B.14 A survey of 1,500 students at a large university gave the following data on their study mode (full- or part-time) as well as their employment status. Employment status

Studying Studying All full-time part-time students

Employed full-time  94 558  652 Employed part-time 292 190   482 Not employed 278  88  366 836 1500 All students 664 a. Give an example of a simple event. b. Give an example of a joint event. c. What is the complement of ‘employed full-time’? d. Why is ‘employed full-time and studying full-time’ a joint event? e. If a student is selected at random, what is the probability that: i. they are employed? ii. they are studying part-time and are employed? iii. they are studying part-time or are employed? f. Explain the difference between the results in part (e) above. B.15 Telephone calls arrive at the information desk of a large computer software company at the rate of 15 per hour. a. What is the probability that the next call will arrive within 3 minutes (0.05 hour)? b. What is the probability that the next call will arrive within 15 minutes (0.25 hour)? c. Suppose the company has just introduced an updated version of one of its software programs, and telephone calls are now arriving at the rate of 25 per hour. Given this information, redo (a) and (b). B.16 On a tourism Twitter site, where photos of scenic views and native animals are regularly shared, the long-term average number of ‘Likes’ obtained per photo posted is 600.5, with a standard deviation of 76. A sample of 52 photos is selected at random. a. What is the probability that the average number of ‘Likes’ for the sample is at least 630?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



End of Part 2 problems 269

b. What is the probability that the average number of ‘Likes’ is less than 575? B.17 In each game of OZ Lotto seven numbers are selected from 1 to 45. To win the first-division prize, the seven winning numbers must have been selected. On any game, what is the probability of winning the first division? B.18 To test the effectiveness of mail X-ray screening in identifying potential illegal or threatening items, a mail centre X-rays a random sample of 500 packages and then independently searches each package. The results of this test are given below. Search Items found Yes No Total

X-ray items identified Yes No

B.21

Total

36  12  48 14 438 452 450 500 50 B.22

a. What percentage of items does the X-ray identify as potentially illegal or threatening? b. What proportion of items identified by X-ray as potentially illegal or threatening are found to be such when searched? c. An item is found during the search to be illegal or threatening. What is the probability that the X-ray identified it as potentially illegal or threatening? d. What percentage of items are not found to be illegal or threatening during the search and not identified as illegal or threatening by X-ray? B.19 Of the packages searched at the mail centre in problem B.18, 9.6% are found to contain illegal or threatening items. Suppose 10 packages are independently and randomly selected to be searched. a. What is the probability that: i. exactly two contain illegal or threatening items? ii. none contain illegal or threatening items? iii. at least one contains illegal or threatening items? iv. more than half contain illegal or threatening items? b. What is the expected number and standard deviation of the number of packages with illegal or threatening items? B.20 The table below classifies the academic staff of a small regional university by gender and level of appointment. Gender Level Female Male Total Professor  13  21  34 Associate professor  16  24  40 Senior lecturer  37  52  89 Lecturer  74  58 132 Associate lecturer  23  13  36 Total 163 168 331 a. Calculate the following probabilities: i. A randomly selected academic staff member is female. ii. A randomly selected male academic staff member is a senior lecturer or above.

B.23



B.24

iii. A randomly selected academic staff member is a female associate lecturer. iv. A randomly selected professor is female. v. A randomly selected academic staff member is an associate professor. b. Are level of appointment and gender statistically independent? Explain. Suppose the executive of the university in problem B.20 randomly select five senior (senior lecturer and above) academic staff members for a committee. Calculate the following probabilities: a. The selected members of the committee are all male senior lecturers. b. There are no professors on the committee. c. At least half the committee is female. d. There is exactly one professor on the committee. e. There are three associate professors on the committee. An on-the-job injury occurs once every 10 days on average at a car manufacturer. What is the probability that the next on-the-job injury will occur within: a. 10 days? b. 5 days? c. 1 day? In a recent opinion poll a sample of 1,200 adults (at least 20 years old) was surveyed. Of these adults, 768 were married, 684 were female and there were 459 married females. Construct a contingency table or a Venn diagram and evaluate the probability that a surveyed adult selected at random: a. is male b. is single c. is a married male d. is a single female The following table contains the probability distribution for the number of traffic accidents per day in a small city.

Number of accidents daily (X) P(X) 0 1 2 3 4 5

0.10 0.20 0.45 0.15 0.05 0.05

a. Calculate the mean or expected number of accidents per day. b. Calculate the standard deviation. B.25 On average 108 customers per hour join a queue at any one of the checkout counters of a grocery store. Suppose that the number of customers joining a queue at the checkout counters follows an approximate Poisson distribution. a. What is the probability that in the next minute: i. exactly four customers join a queue? ii. at least one customer joins a queue?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

270 End of Part 2 problems

B.26

B.27







B.28

B.29

b. What is the probability that in the next 5 minutes: i. exactly 10 customers join a queue? ii. at least 10 customers join a queue? A computer Help desk has two technicians, A with advanced training, who is able to solve 95% of problems, and B with less training, who is only able to solve 85% of problems. Each technician randomly receives 50% of problems. a. What percentage of solved problems are solved by technician A? b. What percentage of problems are solved? A particular weekly Bingo session consists of 20 games. In each game, there are two points where a player can win (a line and a house). Assume on a typical week that there are 100 players, each player is equally likely to win and winning is independent. a. Calculate the probability that a player has a win (line and/or house) on a game. Ignore the possibility of multiple winners at any stage of a game. b. Calculate the probability that a player wins at least once during the evening. On a typical week Biff went to Bingo with four friends, each of whom won at least once but she did not. c. Calculate the probability that in a group of five players exactly four will win at least once. d. Calculate the probability that Biff does not have a win but her four friends do. The Bingo session costs a player $8, with each line won paying $10 and each house $20. e. Construct the probability distribution for the amount a player wins in a game. f. What is the expected amount a player wins in a game? g. What is the variance and standard deviation of the amount a player wins a game? h. What is a player’s expected profit (or loss) from the Bingo session? Based on past experience, 40% of all customers at Miller’s Service Station pay for their purchases with a credit card. If a random sample of 200 customers is selected, what is the approximate probability that: a. at least 75 pay with a credit card? b. not more than 70 pay with a credit card? c. between 70 and 75 customers, inclusive, pay with a credit card? At the local golf course golfers lose golf balls at a rate of 3.8 per 18-hole round. Assume that the number of golf balls lost in an 18-hole round is distributed as a Poisson random variable. a. What assumptions need to be made so that the number of golf balls lost in an 18-hole round is distributed as a Poisson random variable? b. Given the assumptions made in (a), what is the probability that in an 18-hole round: i. at least one ball will be lost? ii. less than three balls will be lost? iii. more than five balls will be lost?

B.30 The Tasmanian Visitor Survey presents data in an analyser database on a number of aspects of tourism, including attractions visited by tourists aged 14 or over. The most visited attractions by 1,283,618 tourists in the October 2016 to September 2017 period were the Saturday Salamanca Market (443,600/34.6%), MONA – the Museum of Old and New Art (352,222 /27.4%) and Mt Wellington (328,752/25.6%) (data obtained from ). a. If a survey of 300 people aged 14 or over who toured Tasmania during the period in question is taken, what is the probability that at least 30% visited MONA? b. What is the probability in this survey that between 31% and 36% of tourists visited the Saturday Salamanca Market? c. What is the probability in this survey that fewer than 23% of tourists visited Mt Wellington? B.31 A box of nine golf gloves contains two left-handed gloves and seven right-handed gloves. a. If two gloves are randomly selected from the box without replacement, what is the probability that both gloves will be right-handed? b. If two gloves are randomly selected from the box without replacement, what is the probability that one right-handed glove and one left-handed glove will be selected? c. If three gloves are selected with replacement, what is the probability that all three will be left-handed? d. If you were sampling with replacement, what would be the answers to (a) and (b)? B.32 Based on past experience, the owner of a stall at the local annual show states that 60% of visitors to the stall will purchase a showbag. On a certain day, the stall has 100 visitors. a. Is the 60% figure best classified as a priori classical probability, empirical classical probability or subjective probability? b. Find the expected number and standard deviation of sales, assuming that number of sales is binomial. c. If the showbags cost $12 each, find the expected revenue from the sales. d. What assumptions are necessary in (b)? B.33 The cost of a phone call passed on to a ‘live’ operator is approximately 10 times that of a call answered by an automated customer-service system. However, as more and more companies have implemented automated systems, customer annoyance with these systems has grown. Many customers are quick to leave the automated system when given an option such as ‘Press zero to talk to a customer-service representative’. Research has shown that approximately 40% of all callers to automated customer-service systems will automatically opt to go to a live operator when given the chance. a. If 10 independent callers contact an automated customerservice system, what is the probability that: i. none of the callers will automatically opt to talk to a live operator? ii. exactly one will automatically opt to talk to a live operator?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



B.34

B.35

B.36

B.37

B.38

End of Part 2 problems 271

iii. two or fewer will automatically opt to talk to a live operator? iv. all 10 will automatically opt to talk to a live operator? b. If all 10 automatically opt to talk to a live operator, do you think that the 40% figure applies to this particular system? Explain. One theory concerning the Standard & Poor’s (S&P) 500 Index of US stocks is that if it increases during the first five trading days of the year, it is likely to increase during the entire year. From 1929 to 2016, early gains during the first five days predicted full-year gains approximately 69.5% (41 out of 59) of the time. Assuming that this indicator is a random event with no predictive value, you would expect that the indicator would be correct 50% of the time. a. What is the probability of the S&P 500 Index increasing in 41 or more of 59 years with an early gain if the true probability of an increase in the S&P 500 Index is: i. 0.50? ii. 0.70? iii. 0.90? b. Based on the results in (a), what do you think is the probability that the S&P 500 Index will increase if there is an early gain in the first five trading days of the year? Explain. A research institute has interviewed a total of 1,764 employers. Fifty-three per cent of the 264 employers from the telecommunications industry expected to have a net increase in employment in their company during the next quarter. Only 43% of employers interviewed from other industries expected a net increase during the same period. a. If an employer from this survey pool is selected at random and expects that there will be a net increase in employment in his company during the next quarter, what is the probability that his company is in the telecommunications industry? b. What is the chance that an employer, selected at random, is neither from the telecommunications industry nor expects an increase? A quinella consists of picking the horses that will place first and second in a race irrespective of order. Suppose eight horses are entered in a race. a. How many quinella combinations are there for this race? b. If you choose two horses randomly, what is the probability that you win the quinella? Suppose that a quality control department has established that 0.1% of items produced are defective. a. If 25 items are randomly selected, find the probability that: i. exactly two items are defective ii. at most one item is defective iii. at least two items are defective b. What is the expected number and standard deviation of defective items? Assume that the number of network errors experienced in a day on a local area network (LAN) is distributed as a Poisson random variable. The mean number of network errors

B.39

B.40

B.41

B.42



B.43

experienced in a day is 2.4. What is the probability that, in any given day: a. zero network errors will occur? b. exactly one network error will occur? c. two or more network errors will occur? d. fewer than three network errors will occur? Greenway Gardens currently has six plots available to plant tomatoes, eggplant, capsicum, cucumbers, beans and lettuce. Each vegetable will be planted in one and only one plot. How many ways are there to position these vegetables in the gardens? Olive Construction Company is determining whether it should submit a bid for a new shopping centre. In the past, Olive’s main competitor, Base Construction Company, has submitted bids 70% of the time. If Base Construction does not bid on a job, the probability that Olive Construction will get the job is 0.50. If Base Construction bids on a job, the probability that Olive Construction will get the job is 0.25. a. If Olive Construction gets the job, what is the probability that Base Construction did not bid? b. What is the probability that Olive Construction will get the job? An airline maintains statistics for mishandled bags per 1,000 passengers. Suppose that last year this airline had 7.03 mishandled bags per 1,000 passengers. What is the probability that the next 1,000 passengers on this airline will have: a. no mishandled bags? b. at least one mishandled bag? c. at least two mishandled bags? A small factory processes and bottles fruit juice. Two types of defect can occur – an incorrect fill amount (over or under the stated amount on the label) and an incorrect seal. From production data it is known that 0.5% of two-litre bottles filled have an incorrect fill amount and 0.1% are incorrectly sealed, with 0.002% having both defects – an incorrect fill amount and incorrectly sealing. a. What proportion of two-litre bottles produced have at least one type of defect? b. What proportion of two-litre bottles produced have no defects? c. A two-litre bottle has an incorrect fill amount. What is the probability that it also is incorrectly sealed? d. Twenty filled two-litre bottles are randomly chosen. Determine the probability that: i. only one bottle has an incorrect fill amount ii. at least one bottle has an incorrect fill amount iii. at most, two bottles have an incorrect fill amount e. In a random sample of 100 filled two-litre bottles, find the expected number of bottles which are incorrectly sealed. The amount of time a bank teller spends with each customer has a population mean μ = 3.10 minutes and standard deviation σ = 0.40 minute. a. If you select a random sample of 16 customers: i. what is the probability that the mean time spent per customer is at least 3 minutes? ii. there is an 85% chance that the sample mean is below how many minutes?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

272 End of Part 2 problems

b. What assumption must you make in order to solve both parts of (a)? c. If you select a random sample of 64 customers, there is an 85% chance that the sample mean is below how many minutes? B.44 A manager of a seafood restaurant is interested in both the time it takes a customer to be seated (the waiting time) and the length of time between a customer being seated and leaving the restaurant (the service time). Over a month, a random sample of 100 customers (only one per party/table) was selected and waiting and serving times, in minutes, are recorded in the file < RESTAURANT_TIMES >. a. Construct a histogram for waiting times. Are waiting times approximately normal, exponential or uniform? Is this what you expected? b. Construct a histogram of serving times. Are serving times approximately normal, exponential or uniform? Is this what you expected? c. Calculate the mean and standard deviation of waiting and serving times. d. Use the results of (a) and (c) to calculate the approximate probability that a customer will wait less than 5 minutes to be seated. e. Use the results of (a) and (c) to calculate the approximate probability that a customer will wait more than 10 minutes to be seated. f. Use the results of (b) and (c) to calculate the approximate probability that the serving time for a customer will be less than 1 hour. g. Use the results of (b) and (c) to calculate the approximate probability that the serving time for a customer will be more than 90 minutes. B.45 Data from the Bureau of Infrastructure, Transport and Regional Economics (BITRE) () shows that in Australia during 2015, the number of motorcyclist deaths was 6.47 per 100 million vehicle kilometres travelled (VKT), while for car occupants it was 0.35 per 100 million VKT. A local council estimates that within the council boundaries there are annually 300 million VKT for cars and 5 million VKT for motorcycles. Assume that the fatality rates have not changed and that the Poisson distribution can be used to model the number of deaths. a. For motorcyclists in the local council area, calculate the following probabilities that in the next 12 months: i. there are no deaths ii. there is at least one death iii. there is exactly one death iv. there are no more than two deaths b. For car occupants in the local council area, calculate the following probabilities that in the next 12 months: i. there are no deaths ii. there is at least one death iii. there is exactly one death iv. there are no more than two deaths

B.46 In 2015, 16.4% of Australians aged 45 to 54 years reported a disability compared to 8.2% aged 15 to 24 years (data obtained from Australian Bureau of Statistics, Disability, Ageing and Carers, Australia: Summary of Findings, 2015, Cat. No. 4430.0 ). Suppose 15 Australians in each age group are randomly selected. a. For each age group of those selected, calculate the probability that: i. none reports a disability ii. at least one reports a disability iii. exactly five report a disability iv. a majority report a disability b. Repeat (i) to (iv) for the 90 years and over age group, of whom 85.4% report a disability. B.47 A telemarketing firm phones households at random. Data show that 80% of such calls are answered. a. If 100 households are called each evening, approximate the probability that: i. more than 50% of the calls are answered ii. between 70 and 90 (inclusive) calls are answered iii. fewer than 75 calls are answered b. Use Excel to calculate the exact probabilities for part (a). B.48 Check$mart encourages its customers to use Internet banking. Therefore the bank is concerned with the download time (the number of seconds that passes from first linking to the website until the home page is fully displayed) of its home page. Both the design of a home page and the load on the bank’s web server affect the download time. Past data indicate that download times are approximately normal with a mean of 0.9 seconds and a standard deviation of 0.3 seconds. What is the probability that a download time is: a. less than 1 second? b. more than 0.5 seconds? c. between 0.5 and 1.5 seconds? d. more than 2 seconds? e. less than 0.6 seconds? f. between 1.0 seconds and 1.5 seconds? B.49 Past records show that on average there are four unplanned outages a year to Check$mart’s Internet banking system and that these unplanned outages occur randomly and are independent of each other. An unplanned outage has just occurred. a. What is the probability that there will: i. not be an unplanned outage in the next month? ii. not be an unplanned outage in the next three months? iii. be at least one unplanned outage in the next six months? b. What is the mean time between unplanned outages? c. What is the probability that there will: i. be exactly three unplanned outages in the next year? ii. be more than six unplanned outages in the next six months? iii. be fewer than two unplanned outages in the next month?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



End of Part 2 problems 273

B.50 In the Household Expenditure Statistics: Year Ended 30 June 2016 (Statistics New Zealand, , licensed by Statistics New Zealand for re-use under the Creative Commons Attribution 3.0 New Zealand licence), 64% of the New Zealand households reported that their income was enough or more than enough to meet their everyday needs. However, of the 20% of households with an annual income of less than $35,700, 48% reported that their income was enough or more than enough for their everyday needs, while of the 20% of households with an annual income of at least $136,600, 87% reported that their income was enough or more than enough to meet their everyday needs. a. What proportion of households reporting that their income is not enough for their everyday needs have an annual income of at least $136,600? b. What proportion of households who report their income is enough for their everyday needs have incomes of less than $35,700? c. What is the probability that a household has an annual income of less than $35,700 and reports that this is enough for their everyday needs? d. What proportion of households with an annual income of at least $35,700 report that their income is enough? e. What proportion of households with an annual income of less than $136,600 report that their income is not enough? f. What is the probability that a household has an annual income of at least $136,600 and reports that this is not enough for their everyday needs? B.51 In problem 6.12 on page 229, it was assumed that the number of All Ordinaries shares traded daily on the Australian Securities Exchange (ASX) is a normal random variable. a. To test this assumption use the All Ordinaries daily volume of trade for the 2016–17 financial year to: i. construct a stem-and-leaf display, histogram, polygon and/or box-and-whisker plot ii. evaluate the actual versus theoretical properties iii. construct a normal probability plot b. Discuss the results in (a). Are the number of All Ordinaries shares traded daily approximately normal? B.52 According to Burton G. Malkiel, the daily changes in the closing price of shares follow a random walk – that is, these daily events are independent of each other and move upwards or downwards in a random manner – and can be approximated by a normal distribution. To test this theory, use either a newspaper or the Internet to select three companies traded on the ASX or other stock exchange, and then do the following: 1. Obtain the daily closing share price of each company for six consecutive weeks (so that you have 30 values per company). 2. Obtain the daily changes in the closing share price of each company for six consecutive weeks (so that you have 30 values per company).

a. For each of your six data sets, decide whether the data are approximately normally distributed by: i. examining the stem-and-leaf display, histogram or polygon and the box-and-whisker plot ii. evaluating the actual versus theoretical properties iii. constructing a normal probability plot b. Discuss the results in (a). What can you now say about your three shares with respect to daily closing prices and daily changes in closing prices? c. Which, if any, of the data sets are approximately normally distributed? Note: The random-walk theory pertains to the daily changes in the closing share price, not the daily closing share price. B.53 A motoring organisation has conducted a survey of owners of new cars manufactured in 2017. It has listed the average number of problems per car as 1.27 for brand H. Let the random variable X be equal to the number of problems with a newly purchased brand H. a. What assumptions must be made in order for X to be distributed as a Poisson random variable? Are these assumptions reasonable? b. Making the assumptions as in (a), if you purchased a 2017 brand H, what is the probability that the new car will have: i. zero problems? ii. two or fewer problems? c. Give an operational definition for ‘problem’. Why is the operational definition important in interpreting the results of the survey? B.54 Assume that in 2018 the manufacturers of brand H improve their performance, with owners of 2018 brand H reporting 1.04 problems per car. a. If you purchased a 2018 brand H, what is the probability that the new car will have: i. zero problems? ii. two or fewer problems? b. Compare your answers in part (a) with those for 2017 brand H in problem B.53 part (b). B.55 Jay has had three incidents in the past 10 years where an insurance excess needed to be paid. These were a collision with a kangaroo, hail damage and a collision from behind while stationary. In this last instance the excess was refunded as the other driver was at fault. Furthermore, Jay estimates that he drives 300 days a year. Jay recently booked a rental car online for a 27-day holiday in New Zealand. During the booking process he was offered a policy at a price of $18.40 per day, to reduce the insurance excess of $2,000 to $0. However, Jay chose not to accept this offer. a. Estimate the probability, per day of driving, that Jay will have to pay an insurance excess, even if it is refunded later because the other driver is at fault. b. Assume that the number of days during the holiday that require insurance excess to be paid can be modelled by the binomial distribution.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

274 End of Part 2 problems

i. Calculate the probability that for the 27-day holiday there are no days requiring insurance excess to be paid. ii. Calculate the probability that for the 27-day holiday there is exactly one day requiring insurance excess to be paid. iii. Calculate the probability that for the 27-day holiday there are exactly two days requiring insurance excess to be paid. iv. Calculate the probability that for the 27-day holiday there are at least three days requiring insurance excess to be paid. v. Calculate the expected payout on this policy. c. Assume that the number of instances during the holiday in which insurance excess is required to be paid can be modelled by the Poisson distribution. i. Calculate the probability that for the 27-day holiday there are no instances in which insurance excess is required to be to be paid. ii. Calculate the probability that for the 27-day holiday there is exactly one instance in which insurance excess is required to be to be paid. iii. Calculate the probability that for the 27-day holiday there are exactly two instances in which insurance excess is required to be paid. iv. Calculate the probability that for the 27-day holiday there are at least three instances in which insurance excess is required to be paid. v. Calculate the expected payout on this policy. d. Did Jay make the correct decision? e. Calculate the probability per day of driving that Jay will have to pay an insurance excess for the policy to break even. B.56 Sam and Jo recently lost their house to fire. Although they were insured, the insurance company has offered them 30% less than the rebuild amount for which they were insured. The amount for which they were insured was the amount specified by the insurance company and is consistent with the rebuild amount given by the insurance company’s online calculator. Therefore, Sam and Jo are not accepting the insurance company’s statement that they were over-insured. Do Sam and Jo have a case to ask for a higher amount to rebuild their house? a. The online calculator states that ‘in approximately 80% of cases the building estimate delivers an accuracy of +/– 10%’. Assuming that the difference between the estimated rebuild cost given by the calculator and the actual rebuild cost is normal with a mean of zero, estimate the standard deviation. b. Using the results of part (a), calculate the probability of an actual rebuild cost of at most 30% less than the estimated rebuild cost given by the online calculator. c. Comment on the insurance company’s claim that Sam and Jo were over-insured. Do you consider that they are justified in asking for a higher rebuild amount?

B.57 Australia is known as a nation of sports lovers but cultural events and venues are not all well supported. A survey by the Australian Bureau of Statistics found that in 2013–14 the attendance rates for Australians aged 15 years and over at the following selected cultural events and venues were as follows: cinemas 66.3%, zoological parks and aquariums 33.9%, botanic gardens 37.2% and libraries 34.0%. It also found that only 14.8% of Australians had attended an opera or musical in the previous 12 months (Australian Bureau of Statistics, Attendance at Selected Cultural Events and Venues, Australia, 2013–14, Cat. No. 4114.0). a. If the percentages reported by the ABS are used in decimal form as probabilities, are they best classified as a priori classical probabilities, empirical classical probabilities or subjective probabilities? b. Suppose that 10 Australians aged 15 years and over are randomly sampled. Consider the random variable defined by the number of people that have attended a musical or opera in the past year. What assumptions must be made so that this random variable is distributed as a binomial random variable? c. Assuming that the number of people who have attended a musical or opera in the past year is a binomial random variable, what are the mean and standard deviation of the distribution in (b)? B.58 Refer to problem B.57. Calculate the probability that, of the 10 people sampled, the number who have attended a musical or opera in the past year is: a. exactly none b. all 10 c. more than half d. eight or more B.59 Refer to problem B.57. a. For cinemas, using the given probability of attendance of 0.663, calculate the probability that, of the 10 people sampled, the number who have attended a cinema in the past year is: i. exactly none ii. all 10 iii. more than half iv. eight or more b. Compare the results in (a) with those of problem B.58 (a) to (d). B.60 The manager of a seafood restaurant was interested in studying ordering patterns of patrons for the Friday-to-Sunday weekend time period. Records were maintained that indicated the demand for dessert during the same period. The manager decided to study two other variables together with whether a dessert was ordered: the gender of the individual and whether a shellfish entrée was ordered. The results are as follows: Gender Dessert ordered Male Female

Total

Yes  82  32 114 No 278 208 486 240 600 Total 360

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



End of Part 2 problems 275

Shellfish entrée Dessert ordered Yes No

Total

Yes  52  62 114 No 106 380 486 442 600 Total 158 a. A waiter approaches a table to take an order. What is the probability that the first customer to order at the table: i. orders a dessert? ii. orders a dessert or a shellfish entrée? iii. is a female and does not order a dessert? iv. is a female or does not order a dessert? b. Suppose the first person that the waiter asks for a dessert order is a female. What is the probability that she does not order dessert? c. Are gender and ordering dessert statistically independent? d. Is ordering a shellfish entrée statistically independent of whether the person orders dessert? B.61 The council for a regional city constructed a levee to protect the central business district and surrounding suburbs from flooding in up to a 1-in-10-year flood. This levee was finished



12 years ago, and has just been breached for the first time, having held during three previous floods. Assume that the number of floods that breach the levee can be modelled by a Poisson distribution. a. What is the probability that the levee is: i. not breached in 12 years? ii. breached in the next 5 years? iii. not breached in the next 20 years? iv. breached again within 2 years? v. not breached within 10 years? b. Suppose the council decides to increase the height of the levee, so that the new levee will protect the central business district and surrounding suburbs from flooding in up to a 1-in-20-year flood. What is the probability that the new levee is: i. not breached in 12 years? ii. breached within 5 years of completion? iii. not breached in 20 years? iv. breached within 2 years of completion? v. not breached within 10 years of completion?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

PA R T

3

Drawing conclusions about populations based only on sample information

Real People, Real Stats Rod Battye TOURISM RESEARCH AUSTRALIA Which company are you currently working for and what are some of your responsibilities? Tourism Research Australia (TRA), which is currently a business unit within the Australian Trade Commission (AUSTRADE). My main responsibilities are to manage: • the International (IVS) and National (NVS) Visitor Surveys • the service-level agreement with funding partners • TRA’s interactive websites • TRA software and databases • data requests and statistics in general • staff and individual and team development. List five words that best describe your personality. Patient, precise, conscientious, creative, relaxed. What are some things that motivate you? Working in a team, building relationships, creating new ways to communicate messages, doing new things, getting it right and being relevant. Promoting a happy work environment. When did you first become interested in statistics? Started when I was writing programs to extract data in an area that handled statistics. I was always good at maths and it came naturally from there. There are many disciplines in statistics that apply to programming as well.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

a quick q&a Complete the following sentence. A world without statistics … … is uninformed and lacking the information required for good planning and decision making.

LET’S TALK STATS What do you enjoy most about working in statistics? Not all statistics are enjoyable to work with; the subject matter is very important. I work with tourism-related information, which covers both domestic and international topics. I am especially keen on the international topics. I enjoy working with people across the many facets of survey work I do, from collection in the field to publication and reporting. Tourism has a lot of positive and forward-looking values; it cuts across nations, genders, age, technology and much more (variety), which makes it interesting. Describe your first statistics-related job or work experience. Was this a positive or a negative experience? As mentioned earlier, I was always good at maths and statistics and got involved when I was writing software programs to extract data on migration and other topics. This experience increased my interest in information, wanting to know more, put the pieces together and tell a story. What do you feel is the most common misconception about your work held by students who are studying statistics? Please explain. The biggest misconceptions are that it’s easy and a lot of people don’t realise there is a need to do a bit of an apprenticeship. There are many different streams of statistics and a vast array of survey methodologies. It takes some time to gain the knowledge and experience to be competent at your job. Do you need to be good at maths to understand and use statistics successfully? Overall the answer is yes! I’ve seen some horror stories when people have ended in the wrong role and they are not numbers savvy. Having said that, the direction we are moving in with the way we report statistics in a simpler way, using more visual and interactive formats/technologies, is removing some of the mystery. Is there a high demand for statisticians in your industry (or in other industries)? Please explain. There is a demand particularly for younger people. There seems to have been a drop off in younger people coming through. There is a tendency for people to focus on policy or marketing and other avenues, as the trip to the top is considered to happen more quickly. What we really need more of is statistics and research in the one package. What I mean by that is someone who can reveal the numbers, interpret and write the story/convey the message.

DRAWING CONCLUSIONS ABOUT POPULATIONS BASED ONLY ON SAMPLE INFORMATION What are some variables for which data have been collected in your field? There are so many to list in terms of international visitors to Australia: where they went; what activities they undertook; their satisfaction levels with cost, food, language services, accommodation etc.; their likelihood to recommend Australia as a holiday destination; expenditure; places and attractions; tours; demographics; where they come from; and why they are here, just to name a few. Why is sampling an important part of your work? What are some common sampling techniques that you employed in the past? Sampling techniques are vital to what I do as they are a costeffective means of obtaining good results for a fraction of the cost of conducting a census. The surveys I work on are ongoing measurement surveys vital to government and the broader tourism industry. We mostly use stratified random sampling techniques. We have excellent data that we use for our sampling and benchmarking processes. Our domestic survey uses computer-assisted telephone interviewing (CATI) via random digit dialing of household phone numbers, using the last birthday method of selection. Samples are stratified using telephone prefixes and the estimated resident population of Australia by capital city/rest of state. The international survey uses computer-assisted personal interviewing (CAPI) at international departure lounges in airports throughout Australia. Immigration data using flight details, airport, country of residence and gender are used to stratify samples. Interviews are chosen at random. All TRA surveys use screening questions to check for in-scope/ targeted respondents. What are some statistical methods you have used that have assisted in solving a problem? In our surveys we have a small number of records that end up with high weights. These weights can have a detrimental effect on results at lower levels. We use a ‘trimming’ technique to reduce these influences by trimming to the weights to five standard errors from the mean. The weights are then redistributed using a raking method. Have you ever conducted a hypothesis test where the outcome was not what you expected? If so, what did you do? Yes, we conduct these types of tests on a regular basis when we consider adding new topics to the surveys. We conduct testing

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

before implementation and have some expectation as to what the result would be. Occasionally we get an outcome well removed from what was expected. In this case we consult other data/ information and source industry players. We then review and update certain details, re-test, then implement. We need to be sure what we are doing does not influence results (skew them). What are some challenges that you have faced in using statistics to provide information about a population of interest? How have you overcome these challenges? We have had difficulty reporting travel by domestic visitors due to the growth in mobile-phone-only households (under coverage of the population); our CATI collection has been conducted via random selection of landline telephone numbers. This has long been the accepted way of surveying the community in a costeffective manner. However, because the growth of mobile-only households was taken up at disproportionate rates across the age groups, there had been a shift in the characteristic of travellers and an under-representation of the younger age groups. We conducted an extensive review of methodologies and the result was that CATI was still the best method of collection for

our survey (a large tracking survey with complex definitions). In recent years we had looked into phoning mobiles, but this was too expensive due to the large number of invalid numbers (SIM cards that were no longer in use, SIM cards in shops, etc.). Advancements in technology that reduced the cost issue (being able to ping and identify invalid mobile numbers) had recently appeared. With this we decided to push ahead with the introduction of a dual-frame overlap survey. This type of collection is cutting-edge, and nowhere in the world is there a survey of this size (120,000 sample) that measures visitation via a dual-frame survey. Whereas before the survey sampled and weighted to the estimated resident population of Australia, we now have three distinct populations: mobile only, landline only, and mobile and landline. The new approach is only in its early stages but all is looking very good; the weights are now distributed as they should be and we have successfully implemented all other facets of the sampling, collection, processing and weighting.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

Confidence interval estimation

C HAP T E R

8

AUDITING SALES INVOICES AT CALLISTEMON CAMPING SUPPLIES

C

allistemon Camping Supplies Pty Ltd has several outlets that sell outdoor clothing, backpacks, tents and other camping equipment. As the company’s accountant, you are responsible for the accuracy of the integrated inventory management and sales information system. You could review the contents of every record to check the accuracy of this system, but such a detailed review would be time-consuming and costly. A better approach is to use statistical inference techniques to draw conclusions about the population of all records from a relatively small sample collected during an audit. At the end of each month, you can select a sample of the sales invoices to determine the following: ■

the mean dollar amount listed on the sales invoices for the month



the total dollar amount listed on the sales invoices for the month



any differences between the dollar amounts on the sales invoices and the amounts entered into the sales information system



the frequency of occurrence of various types of errors that violate the internal control policy of the distribution sites. These errors include making a shipment when there is no authorised delivery docket, failure to include the correct account number and shipment of the incorrect part.

How accurate are the results from the samples and how do you use this information? Are the sample sizes large enough to give you the information you need? © Chris Howes/Wild Places Photography/Alamy Stock Photo

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

280 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION

LEARNING OBJECTIVES

After studying this chapter you should be able to: 1 construct and interpret confidence interval estimates for the mean 2 construct and interpret confidence interval estimates for the proportion 3 determine the sample size necessary to develop a confidence interval for the mean or proportion 4 recognise how to use confidence interval estimates in auditing

point estimate A single value calculated from a sample which is used to estimate an unknown population parameter. confidence interval estimate A range of numbers constructed about the point estimate.

Statistical inference is the process of using sample results to draw conclusions about the characteristics of a population. Inferential statistics enables you to estimate unknown population characteristics such as a population mean or a population proportion. Two types of estimates are used to estimate population parameters: point estimates and interval estimates. A point estimate is the value of a single sample statistic. A confidence interval estimate is a range of numbers, called an interval, constructed around the point estimate. The process used to construct confidence intervals tells us that the population parameter is located somewhere within the interval in a known percentage of the intervals that could be constructed from different samples. Suppose that you would like to estimate the mean number of hours of paid work undertaken per week during term time by students in your university. The mean hours of paid work for all the students is an unknown population mean, denoted by μ. You select a sample of – students and find that the sample mean is 14.8. The sample mean, X, is a point estimate of the population mean μ. How accurate is 14.8? To answer this question you must construct a confidence interval estimate. In this chapter you will learn how to construct and interpret confidence interval estimates. – Recall that the sample mean, X, is a point estimate of the population mean μ. However, the sample mean will vary from sample to sample because it depends on the items selected in the sample. By taking into account the known variability from sample to sample (see Section 7.2 on the sampling distribution of the mean), you will learn how to develop the interval estimate for the population mean. The interval constructed will have a specified confidence of correctly estimating the value of the population parameter μ. In other words, there is a specified confidence that μ is somewhere in the range of numbers defined by the interval. Suppose that after studying this chapter you find that a 95% confidence interval for the mean number of hours students at your university are employed in paid work per week is (14.75 8 μ 8 14.85). You can interpret this interval estimate by stating that you are 95% confident that the mean number of hours per week of paid work undertaken by students at your university is between 14.75 and 14.85. However, there is still a possibility that the mean number of hours is below 14.75 or above 14.85. After learning about the confidence interval for the mean, we look at how to develop an interval estimate for the population proportion. Then we consider how large a sample to select when constructing confidence intervals, and how to perform several important estimation procedures that accountants use when performing audits.

8.1  CONFIDENCE INTERVAL ESTIMATION FOR THE MEAN (σ KNOWN) In Section 7.2 we used the Central Limit Theorem and knowledge of the population distribution to determine the percentage of sample means that fall within certain distances of the population mean. For instance, in the shampoo-bottling example used throughout Chapter 7, 95% of all

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



8.1 CONFIDENCE INTERVAL ESTIMATION FOR THE MEAN (σ KNOWN) 281

sample means are between 494.12 and 505.88 mL. This statement is based on deductive reasoning. However, inductive reasoning is what we need here. We need inductive reasoning because, in statistical inference, you use the results of a single sample to draw conclusions about the population, not vice versa. Suppose that in the shampoo-bottling example you wish to estimate the unknown population mean using the information from only a sample. Thus, rather than take μ ± (1.96) (σ∕∙∙n) to find the upper – and lower limits around μ, as in Section 7.2, you substitute the sample mean, X , for the – unknown μ and use X ± (1.96) (σ∕∙∙n) as an interval to estimate the unknown μ. Although in – practice you select a single sample of size n and calculate the mean X, in order to understand the full meaning of the interval estimate you need to examine a hypothetical set of all possible samples of n values. Figure 8.1 shows the actual population distribution of shampoo bottle contents at the top with a mean value of 500 and five confidence intervals for the population mean based on five different sample means. Suppose that a sample of n = 25 bottles has a mean of 496.2 mL. The interval developed to estimate μ is 496.2 ± (1.96)(15)∕(∙∙ 25) or 496.2 ± 5.88. The interval estimate of μ is:

Deductive reasoning Reasoning that starts with a hypothesis and examines possibilities to move to a specific conclusion. Inductive reasoning Reasoning that uses specific observations to make a general conclusion.

490.32 8 μ 8 502.08 Because the population mean μ (equal to 500) is included within the interval, this sample has led to a correct statement about μ (see Figure 8.1). Figure 8.1 Confidence interval estimates for five different samples of n = 25 taken from a population where μ = 500 and σ = 15 500

494.12 X1 = 496.2

490.32

X2 = 501.6 X3 = 493.0 X4 = 494.12 X5 = 505.88

496.2 495.72

487.12

493.0 488.24

505.88 502.08

501.6

507.48

498.88 494.12

500 500

505.88

511.76

To continue this hypothetical example, suppose that for a different sample of n = 25 bottles the mean is 501.6. The interval developed from this sample is: 501.6 ± (1.96)(15)/( 25 ) or 501.6 ± 5.88. The estimate is: 495.72 8 μ 8 507.48 Because the population mean μ (equal to 500) is also included within this interval, this statement about μ is correct. Now, before you begin to think that correct statements about μ are always made by developing a confidence interval estimate, suppose a third hypothetical sample of n = 25 bottles is selected and the sample mean is equal to 493 mL. The interval developed here is 493 ± (1.96) (15)∕(∙∙ 25) or 493 ± 5.88. In this case, the interval estimate of μ is: 487.12 8 μ 8 498.88

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

282 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION

This estimate is not a correct statement, because the population mean μ is not included in the interval developed from this sample (see Figure 8.1). Thus, for some samples the interval estimate of μ is correct but for others it is incorrect. In practice, only one sample is selected and, because the population mean is unknown, you cannot determine whether the interval estimate is correct. To resolve this dilemma of sometimes having an interval that provides a correct estimate and sometimes having an interval that provides an incorrect estimate, you need to determine the proportion of samples producing intervals that result in correct statements about the population – mean μ. To do this, consider two other hypothetical samples: the case in which X = 494.12 mL – – and the case in which X = 505.88 mL. If X = 494.12, the interval is 494.12 ± (1.96)(15)∕(∙∙ 25) or 494.12 ± 5.88. This leads to the following interval: 488.24 8 μ 8 500.00 Because the population mean of 500 is at the upper limit of the interval, the statement is correct (see Figure 8.1). – When X = 505.88, the interval is 505.88 ± (1.96)(15)∕(∙∙ 25) or 505.88 ± 5.88. The interval for the sample mean is: 500.00 8 μ 8 511.76 In this case, because the population mean of 500 is included at the lower limit of the interval, the statement is correct. Figure 8.1 shows that when the sample mean falls anywhere between 494.12 and 505.88 mL, the population mean is included somewhere within the interval. In Section 7.2 we found that 95% of the sample means fall between 494.12 and 505.88 mL. Therefore, 95% of all samples of n = 25 bottles have sample means that include the population mean within the interval developed. The interval from 494.12 to 505.88 is referred to as a 95% confidence interval. Because, in practice, you select only one sample and μ is unknown, you never know for sure whether the specific interval includes the population mean or not. However, if you take all possible samples of n and calculate their sample means, 95% of the intervals will include the population mean and only 5% of them will not. In other words, there is 95% confidence that the population mean is somewhere in the interval. Thus, we can interpret the confidence interval above as follows: LEARNING OBJECTIVE

1

Construct and interpret confidence intervals for the mean

level of confidence Represents the percentage of intervals, based on all samples of a certain size, which would contain the population parameter.

I am 95% confident that the mean amount of shampoo in the population of bottles is somewhere between 494.12 and 505.88 mL. In some situations, you might want a higher degree of confidence (such as 99%) of including the population mean within the interval. In other cases, you might accept less confidence (such as 90%) of correctly estimating the population mean. In general, the level of confidence is symbolised by (1 - α) * 100%, where α is the area in the tails of the distribution that is outside the confidence interval. The area in the upper tail of the distribution is α/2, and the area in the lower tail of the distribution is α/2. We can use Equation 8.1 to construct a (1 - α) * 100% confidence interval estimate of the mean with σ known. CON FIDE N CE IN TE R VA L F O R A M E A N ( σ KNO W N) X±Z

σ n

or

X−Z

σ n

⩽μ⩽X+Z

σ n

(8.1)

where Z = the value corresponding to a cumulative area of 1 - α/2 from the standardised normal distribution – that is, an upper-tail probability of α/2. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



8.1 CONFIDENCE INTERVAL ESTIMATION FOR THE MEAN (σ KNOWN) 283

The value of Z needed for constructing a confidence interval is called the critical value for the distribution. For a 95% confidence interval the value of α is 0.05. The critical Z value corresponding to a cumulative area of 0.9750 is 1.96 because there is 0.025 in the upper tail of the distribution and the cumulative area less than Z = 1.96 is 0.975. There is a different critical value for each level of confidence 1 - α. A level of confidence of 95% leads to a Z value of 1.96 (see Figure 8.2). For a 99% level of confidence, α is 0.01. The Z value is approximately 2.58 because the upper-tail area is 0.005 and the cumulative area less than Z = 2.58 is 0.995 (see Figure 8.3).

0.025

0.475 μ 0

–1.96

0.005

–2.58

0.475

0.495

Figure 8.2 Normal curve for determining the Z value needed for 95% confidence

0.025 X Z

+1.96

0.495 μ 0

critical value The value in a distribution that cuts off the required probability in the tail for a given confidence level.

0.005

Figure 8.3 Normal curve for determining the Z value needed for 99% confidence

X +2.58 Z

Now that various levels of confidence have been considered, why not make the confidence level as close to 100% as possible? Before doing so, you need to realise that any increase in the level of confidence is achieved only by widening (and making less precise) the confidence interval. You would have more confidence that the population mean is within a broader range of values. However, this might make the interpretation of the confidence interval less useful. The trade-off between the width of the confidence interval and the level of confidence is discussed in greater depth in the context of determining the sample size in Section 8.4. Example 8.1 illustrates the application of the confidence interval estimate. ESTIM ATING T H E ME A N S A LMO N W E IGHT WI TH 95% CON F I D E N CE Atlantic Salmon farming is an important industry in Tasmania. Fish are grown to market size in a series of large, circular, netted enclosures in areas such as the Huon River, Port Esperance, the D’Entrecasteaux Channel and around the Tasman Peninsula. When salmon are harvested to send to market they need to weigh 3.5–4 kg, so the farmer is aiming to have an average weight of 3.75 kg. We will assume that all salmon are placed in their final growing enclosure at the same time and spend 12 months there, and that the standard deviation of their weights after that time is 380 g. A farmer wishes to check whether the average weight of salmon in the enclosure falls in the required range. He weighs a sample of 50 salmon being sent to market and finds their average weight is 3,607 g. Construct a 95% confidence interval estimate for the population mean salmon weight.

EXAMPLE 8.1

SOLUTION

Using Equation 8.1 with Z = 1.96 for 95% confidence: σ 380 X±Z = 3607 ± (1.96) n 50 = 3607 ± 105.33 3501.67 ⩽ μ ⩽ 3712.33 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

284 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION

Thus, with 95% confidence you can conclude that the mean weight of salmon in the enclosure is between 3,501.67 g and 3,712.33 g. This would indicate that the average weight of fish in the enclosure is below the average of 3,750 g desirable for market-ready fish. We would expect that many fish in the enclosure still need to grow larger before being harvested.

To see the effect of using a 99% confidence interval, examine Example 8.2.

EXAMPLE 8.2

E ST IMAT ING T H E ME AN S AL M ON WE I GHT WI TH 99% CON F I D E N CE Construct a 99% confidence interval for the population mean salmon weight. SOLUTION

Using Equation 8.1 with Z = 2.58 for 99% confidence: σ 380 X±Z = 3607 ± (2.58) n 50 = 3607 ± 138.65 3468.35 ⩽ μ ⩽ 3745.65 The interval still does not contain the desired mean weight of 3.75 kg, so the fish will need to grow larger.

Problems for Section 8.1 LEARNING THE BASICS 8.1 8.2 8.3

8.4 8.5

8.6

– If X = 85, σ = 8 and n = 64, construct a 95% confidence interval estimate of the population mean μ. – If X = 125, σ = 24 and n = 36, construct a 99% confidence interval estimate of the population mean μ. A market researcher states that she has 95% confidence that the mean monthly sales of a product are between $170,000 and $200,000. Explain the meaning of this statement. Why is it not possible in Example 8.1 to have 100% confidence? Explain. From the results of Example 8.1 regarding salmon farming, is it true that 95% of the sample means will fall between 3,501.67 g and 3,712.33 g? Explain. Is it true in Example 8.1 that you do not know for sure whether the population mean is between 3,501.67 g and 3,712.33 g? Explain.

APPLYING THE CONCEPTS 8.7 The manager of a paint supply store wants to estimate the actual amount of paint contained in 4-litre cans purchased from a nationally known manufacturer. It is known from the manufacturer’s specifications that the standard deviation of the amount of paint is equal to 0.08 litres. A random sample of

50 cans is selected, and the sample mean amount of paint per 4-litre can is 3.98 litres. a. Construct a 99% confidence interval estimate of the population mean amount of paint included in a 4-litre can. b. On the basis of your results, do you think that the manager has a right to complain to the manufacturer? Why? c. Must you assume that the population amount of paint per can is normally distributed here? Explain. d. Construct a 95% confidence interval estimate. How does this change your answer to (b)? 8.8 The quality control manager at a light globe factory needs to estimate the mean life of a large shipment of energy-saving light-emitting diode (LED) light globes. The standard deviation is 3,000 hours. A random sample of 64 light globes indicates a sample mean life of 34,000 hours. a. Construct a 95% confidence interval estimate of the population mean life of light globes in this shipment. b. Do you think that the manufacturer has the right to state that the light globes last an average of 35,000 hours? Explain. c. Must you assume that the population of light globe life is normally distributed? Explain. d. Suppose that the standard deviation changes to 6,000 hours. What are your answers in (a) and (b)?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



8.2 CONFIDENCE INTERVAL ESTIMATION FOR THE MEAN (σ UNKNOWN) 285

8.9 The inspection division of a state department that regulates trade measurement wants to estimate the actual amount of soft drink in 2-litre bottles at the local bottling plant of a large, nationally known soft-drink company. The bottling plant has informed the inspection division that the population standard deviation for 2-litre bottles is 0.05 litres. A random sample of 100 2-litre bottles at this bottling plant indicates a sample mean of 1.99 litres. a. Construct a 95% confidence interval estimate of the population mean amount of soft drink in each bottle.

b. Must you assume that the population of soft-drink fill is normally distributed? Explain. c. Explain why a value of 2.02 litres for a single bottle is not unusual, even though it is outside the confidence interval you calculated. d. Suppose that the sample mean had been 1.97 litres. What is your answer to (a)?

8.2  CONFIDENCE INTERVAL ESTIMATION FOR THE MEAN (𝛔 UNKNOWN) Just as the mean of the population μ is usually unknown, you rarely know the actual standard deviation of the population, σ. Therefore, you need to develop a confidence interval estimate of – μ using only the sample statistics X and S.

Student’s t Distribution At the beginning of the twentieth century a statistician for Guinness Breweries in Ireland (see reference 1), William S. Gosset, wanted to make inferences about the mean when σ was unknown. Because Guinness employees were not permitted to publish research work under their own names, Gosset adopted the pseudonym ‘Student’. The distribution that he developed is known as Student’s t distribution. If the random variable X is normally distributed, then the following statistic has a t distribution with n - 1 degrees of freedom: t=

X−μ S

Student’s t distribution A continuous probability distribution whose shape depends on the number of degrees of freedom. degrees of freedom Relate to the number of values in the calculation of a statistic that are free to vary.

n This expression has the same form as the Z statistic in Equation 7.4 on page 254, except that S is used to estimate the unknown σ. The concept of degrees of freedom is discussed further on page 286.

Properties of the t Distribution In appearance, the t distribution is very similar to the standardised normal distribution. Both distributions are bell shaped. However, the t distribution has more area in the tails and less in the centre than the standardised normal distribution (see Figure 8.4). Because the value of σ is unknown and S is used to estimate it, the values of t are more variable than those for Z. The degrees of freedom n - 1 are directly related to the sample size n. As the sample size and degrees of freedom increase, S becomes a better estimate of σ and the t distribution gradually approaches the standardised normal distribution until the two are virtually identical. With a sample size of about 120 or more, S estimates σ precisely enough that there is little difference between the t and Z distributions. For this reason, most statisticians use Z instead of t when the sample size is greater than 120.

Standardised normal t distribution for 5 degrees of freedom

Figure 8.4  Standardised normal distribution and t distribution for 5 degrees of freedom

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

286 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION

As stated earlier, the t distribution assumes that the random variable X is normally distributed. In practice, however, as long as the sample size is large enough and the population is not very skewed, you can use the t distribution to estimate the population mean when σ is unknown. When dealing with a small sample size and a skewed population distribution, the validity of the confidence interval is a concern. To assess the assumption of normality, you can evaluate the shape of the sample data by using a histogram, stem-and-leaf display, box-and-whisker plot or normal probability plot. You find the critical values of t for the appropriate degrees of freedom from the table of the t distribution (Table E.3). The columns of the table represent the area in the upper tail of the t distribution. Each row represents the particular t value for each specific degree of freedom. For example, with 99 degrees of freedom, if you want 95% confidence you find the appropriate value of t as shown in Table 8.1. The 95% confidence level means that 2.5% of the values (an area of 0.025) are in each tail of the distribution. Looking in the column for an upper-tail area of 0.025 and in the row corresponding to 99 degrees of freedom gives you a critical value for t of 1.9842. Because t is a symmetrical distribution with a mean of 0, if the upper-tail value is +1.9842, the value for the lower-tail area (lower 0.025) is -1.9842. A t value of -1.9842 means that the probability that t is less than -1.9842 is 0.025, or 2.5% (see Figure 8.5).

The Concept of Degrees of Freedom In Chapter 3 we saw that the numerator of the sample variance S2 (see Equation 3.9a) requires the calculation of: n

∑ (Xi − X )2 i=1 Table 8.1 Determining the critical value from the t table for an area of 0.025 in each tail with 99 degrees of freedom (extracted from Table E.3 in Appendix E of this book)

Upper-tail areas Degrees of freedom

.25

.10

.05

.025

.01

.005

 1

1.0000

3.0777

6.3138

12.7062

31.8207

63.6574

 2

0.8165

1.8856

2.9200

 4.3027

 6.9646

 9.9248

 3

0.7649

1.6377

2.3534

 3.1824

 4.5407

 5.8409

 4

0.7407

1.5332

2.1318

 2.7764

 3.7469

 4.6041

 5

0.7267

1.4759

2.0150

 2.5706

 3.3649

 4.0322

 .

.

.

.

.

.

.

 .

.

.

.

.

.

.

 .

.

.

.

.

.

.

 96

0.6771

1.2904

1.6609

 1.9850

 2.3658

 2.6280

 97

0.6770

1.2903

1.6607

 1.9847

 2.3654

 2.6275

 98

0.6770

1.2902

1.6606

 1.9845

 2.3650

 2.6269

 99

0.6770

1.2902

1.6604

 1.9842

 2.3646

 2.6264

100

0.6770

1.2901

1.6602

 1.9840

  2.3642

 2.6259

Figure 8.5  t distribution with 99 degrees of freedom 0.025 –1.9842

1 – α = 0.95

0.025 +1.9842

t99

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



8.2 CONFIDENCE INTERVAL ESTIMATION FOR THE MEAN (σ UNKNOWN) 287

– In order to calculate S2, you first need to know X. Therefore, only n - 1 of the sample values are free to vary. This means that you have n - 1 degrees of freedom. For example, suppose a sample of five values has a mean of 20. How many values do you need to know before you – can determine the remainder of the values? The fact that n = 5 and X = 20 also tells you that: n

∑ Xi = 100

i=1

because:

n

∑ Xi

i=1

n

=X

Thus, when you know four of the values, the fifth one will not be free to vary because the sum must add to 100. For example, if four of the values are 18, 24, 19 and 16, the fifth value must be 23 so that the sum equals 100.

The Confidence Interval Statement Equation 8.2 defines the (1 - α) * 100% confidence interval estimate for the mean with σ unknown.

LEARNING OBJECTIVE

Construct and interpret confidence intervals for the mean

CO N FID E N CE IN T E R VA L FOR T H E M E A N (σ U NKNO W N) X ± tn−1

S n

or

X − tn−1

S n

⩽ μ ⩽ X + tn−1

S n

(8.2)

where tn-1 is the critical value of the t distribution with n - 1 degrees of freedom for an area of α/2 in the upper tail. To illustrate the application of the confidence interval estimate for the mean when the standard deviation σ is unknown, return to the Callistemon Camping Supplies scenario presented on page 279. You select a sample of 100 sales invoices from the population of sales invoices during the month and the sample mean of the 100 sales invoices is $230.27, with a sample standard deviation of $52.62. For 95% confidence, the critical value from the t distribution (as shown in Table 8.1) is 1.9842. Using Equation 8.2: X ± tn−1

S n

= 230.27 ± (1.9842)

1

52.62 100

= 230.27 ± 10.44 $219.83 ⩽ μ ⩽ $240.71 A Microsoft Excel worksheet for these data is presented in Figure 8.6 (overleaf). Thus, with 95% confidence, you conclude that the mean amount of all the sales invoices is between $219.83 and $240.71. The 95% confidence level indicates that if you selected all possible samples of 100 (something that is never done in practice), 95% of the intervals developed would include the population mean somewhere within the interval. The validity of this

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

288 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION

Figure 8.6  Microsoft Excel 2016 worksheet to calculate a confidence interval estimate for the mean sales invoice amount for Callistemon Camping Supplies

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

A B Estimate for the mean sales invoice amount Data Sample standard deviation Sample mean Sample size Confidence level

52.62 230.27 100 95%

Intermediate calculations Standard error of the mean 5.262 Degrees of freedom 99 t value 1.984217 Interval half width 10.44095

=B4/SQRT(B6) =B6 – 1 =T.INV.2T(1 – B7,B11) =B12 * B10

Confidence interval Interval lower limit Interval upper limit

=B5 – B13 =B5 + B13

219.8291 240.7109

confidence interval estimate depends on the assumption of normality for the distribution of the amount of the sales invoices. With a sample of 100, the normality assumption is not overly restrictive and the use of the t distribution is probably appropriate. Example 8.3 further illustrates how to construct the confidence interval for a mean when the population standard deviation is unknown. EXAMPLE 8.3

Table 8.2 Heights (in millimetres) of female athletes aged 18–25

E ST IMAT ING T H E ME AN HE I G HT O F FE M AL E ATH LE T E S A GE D 18 –2 5 A manufacturer of women’s tracksuits needs to estimate the average height of female ­athletes in the 18–25 age group. The measurements of a sample of 30 women are taken and their heights recorded in millimetres. Table 8.2 lists these values. < HEIGHTS > Construct a 95% confidence interval estimate for the population mean height of female athletes in this age group. 1,870

1,728

1,656

1,610

1,634

1,784

1,522

1,696

1,592

1,662

1,866

1,764

1,734

1,662

1,734

1,774

1,550

1,756

1,762

1,866

1,820

1,744

1,788

1,688

1,810

1,752

1,680

1,810

1,652

1,736

SOLUTION

– Figure 8.7 shows that the sample mean is X = 1,723.4 mm and the sample standard deviation is S = 89.55 mm. Using Equation 8.2 to construct the confidence interval, you need to determine the critical value from the t table for an area of 0.025 in each tail with 29 degrees – of freedom. Table E.3 shows that t29 = 2.0452. Thus, using X = 1,723.4, S = 89.55, n = 30 and t29 = 2.0452: S X ± tn−1 n = 1,723.4 ± (2.0452)

89.55 30

= 1,723.4 ± 33.44 1,689.96 ⩽ μ ⩽ 1,756.84

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



8.2 CONFIDENCE INTERVAL ESTIMATION FOR THE MEAN (σ UNKNOWN) 289

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

A One sample t : height

Figure 8.7  PHStat confidence interval estimate for the mean height (in millimetres) of female athletes aged 18–25

B

Data Sample standard deviation Sample mean Sample size Confidence level

89.55083319 1723.4 30 95%

Intermediate calculations Standard error of the mean 16.34967046 Degrees of freedom 29 t value 2.045229642 Interval half width 33.43883066 Confidence interval Interval lower limit Interval upper limit

1689.96 1756.84

You conclude with 95% confidence that the mean height of 18–25-year-old female a­ thletes is between 1,689.96 and 1,756.84 mm. The validity of this confidence interval estimate depends on the assumption that the heights in the population are normally distributed. Remember, however, that you can slightly relax this assumption for large sample sizes. Thus, with a sample of 30, you can use the t distribution even if the distribution of heights is slightly skewed. From the normal probability plot displayed in Figure 8.8, or the boxplot displayed in Figure 8.9, the heights appear only slightly skewed. Thus the t distribution is appropriate for these data.

Height

Normal probability plot of height

Figure 8.8  PHStat normal probability plot for the height (in millimetres) of female athletes aged 18–25

2,000 1,800 1,600 1,400 1,200 1,000 800 600 400 200 0 –2.5

–2

–1.5

–1

–0.5

0

0.5

1

1.5

2

2.5

Z value Boxplot of height

Figure 8.9  PHStat boxplot for the height (in millimetres) of female athletes aged 18–25

Height

1,520

1,620

1,720

1,820

1,920

2,020

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

290 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION

The validity of this confidence interval estimate depends on the assumption that the p­ rocessing time is normally distributed. What would happen if there was a small sample and the boxplot and the normal probability plot indicted that the distribution was right-skewed? In this case you would have some concern about the validity of the confidence interval in estimating the population mean. The concern is that a 95% confidence interval based on a small sample from a skewed distribution will contain the population mean less than 95% of the time in repeated sampling. In the case of small sample sizes and skewed distributions, you might consider the sample median as an estimate of central tendency and construct a confidence interval for the population median (see reference 2).

Problems for Section 8.2 LEARNING THE BASICS

8.10 Determine the critical value of t in each of the following circumstances: a. 1 - α = 0.95, n  = 10 b. 1 - α = 0.99, n  = 10 c. 1 - α = 0.95, n  = 32 d. 1 - α = 0.95, n  = 65 e. 1 - α = 0.90, n  = 16 –  8.11 If X =  75, S  =  24, n  =  36, and assuming that the population is normally distributed, construct a 95% confidence interval estimate of the population mean μ. – 8.12 If X =  50, S  =  15, n  =  16, and assuming that the population is normally distributed, construct a 99% confidence interval estimate of the population mean μ. 8.13 Construct a 95% confidence interval estimate for the population mean, based on each of the following sets of data, assuming that the population is normally distributed: Set 1: 1, 1, 1, 1, 8, 8, 8, 8 Set 2: 1, 2, 3, 4, 5, 6, 7, 8 Explain why these data sets have different confidence intervals even though they have the same mean and range. 8.14 Construct a 95% confidence interval for the population mean, based on the numbers 1, 2, 3, 4, 5, 6 and 20. Change the number 20 to 7 and recalculate the confidence interval. Using these results, describe the effect of an outlier (i.e. extreme value) on the confidence interval.

APPLYING THE CONCEPTS You can solve problems 8.15 to 8.21 with or without Microsoft Excel.

8.15 A stationery store wants to estimate the mean retail value of greeting cards that it has in its inventory. A random sample of 20 greeting cards indicates a mean value of $4.95 and a standard deviation of $0.82. a. Assuming a normal distribution, construct a 95% confidence interval estimate of the mean value of all greeting cards in the store’s inventory. b. How are the results in (a) useful in assisting the store owner to estimate the total value of his inventory?

8.16 Water resources in many parts of Australia are being closely watched and restrictions or water-wise rules have been imposed on activities such as garden watering. Suppose that Sydney Water monitors water usage in a suburb and finds that for one summer the average household usage is 408 litres per day. A year later it examines records of a sample of 50 households and finds that there is a daily mean usage of 380 litres with a standard deviation of 25 litres. a. Construct a 95% confidence interval for the population mean daily water usage in the second summer. Assume the population usage is normally distributed. b. Interpret the interval constructed in (a). c. Do you think water usage has changed in the second summer? Explain. 8.17 The energy consumption of refrigerators sold in Australia and New Zealand is checked and appliances are given a star rating to guide consumers who are about to make purchases. The consumption in kilowatts per annum is also displayed for each model on the website . Suppose a consumer organisation wants to estimate the actual electricity usage of a model of refrigerator that has an advertised energy usage of 355 kW per annum. It tests a random sample of n  =  18 fridges and finds a sample mean usage of 367 and a sample standard deviation of 30. a. Assuming that the energy usage in the population is normally distributed, construct a 95% confidence interval estimate of the population mean energy usage for this model of refrigerator. b. Do you think that the consumer organisation should accuse the manufacturer of producing fridges that do not meet the advertised energy consumption? Explain. c. Explain why an observed energy usage of 350 kW for a particular refrigerator is not unusual, even though it is outside the confidence interval developed in (a). 8.18 The data below represent the annual account fees for cheques made by a bank for a sample of 23 clients with

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



8.3 Confidence Interval Estimation for the Proportion 291

cheque accounts who do not undertake Internet banking. < BANK_COST1 >

26

29

20

20

21

22

25

25

18

25

15

18

20

25

25

22

30

30

30

15

20

29

20

a. Construct a 95% confidence interval for the population mean annual cheque fee. b. What assumption must you make about the population distribution in (a)? c. Interpret the interval constructed in (a). 8.19 One of the major measures of the quality of service provided by any organisation is the speed with which it responds to customer complaints. A large family-held department store selling furniture and flooring, including carpeting, has undergone a major expansion in the past several years. In particular, the flooring department has expanded from two installation crews to an installation supervisor, a measurer and 15 installation crews. Last year there were 50 complaints about carpet installation. The data below represent the number of days between the receipt of the complaint and the resolution of the complaint. < FURNITURE >

the profitability of this service to the insurance company. Over a period of one month, a random sample of 27 approved policies was selected and the total processing time in days recorded. < INSURANCE > 73 19 16 64 28 28 31 90 60 56 31 56 22 18 45 48 17 17 17 91 92 63 50 51 69 16 17 a. Construct a 95% confidence interval estimate of the mean processing time. b. What assumption must you make about the population distribution in (a)? c. Do you think that the assumption made in (b) is seriously violated? Use a plot and explain. 8.21 The data below represent the daily rate in Australian dollars for a double room or studio booking on the following Monday night at a sample of hotels, motels and motor lodges in 20 New Zealand cities and towns in July 2017. < MOTEL_2017 >

City/Town

Room cost

City/Town

Room cost

Lake Taupo

138

Hamilton

147

54  5  35 137  31 27 152  2 123 81 74

27

Whitianga

152

Waitomo

118

11 19 126 110 110 29  61 35  94 31 26

 5

Auckland

179

Whangarei

137

12  4 165  32  29 28  29 26  25  1 14

13

Paihia

113

Russell

129

23

Wellington

141

Kerikeri

136

Tauranga

128

Havelock North

156

New Plymouth

149

Thames

121

Hastings

103

Palmerston North

137

Napier

135

Wanganui

114

Gisborne

122

Rotorua

132

13 10   5  27  4 52  30 22  36 26 20 33 68

a. Construct a 95% confidence interval estimate of the mean number of days between receipt of the complaint and resolution of the complaint. b. What assumption must you make about the population distribution in (a)? c. Do you think that the assumption made in (b) is seriously violated? Explain. d. What effect might your conclusion in (c) have on the validity of the results in (a)? 8.20 The approval process for a life insurance policy requires a review of the application and the applicant’s medical history, possible requests for additional medical information and medical examinations, and a policy compilation stage where the policy pages are generated then delivered. The ability to deliver approved policies to customers in a timely manner is critical to

Data obtained from accessed 4 July 2017

a. Construct a 95% confidence interval for the population mean lowest room cost. b. Construct a 99% confidence interval for the population mean lowest room cost. c. What assumption do you need to make about the population of interest to construct the intervals in (a) and (b)? d. Given the data presented, do you think the assumption needed in (a) and (b) is valid? Use a plot and explain.

8.3  CONFIDENCE INTERVAL ESTIMATION FOR THE PROPORTION This section extends the concept of the confidence interval to categorical data. Here you are concerned with estimating the proportion of items in a population with a certain characteristic of interest. The unknown population proportion is represented by the Greek letter π (pronounced pi). The point estimate for π is the sample proportion, p = X/n, where n is the sample size and X is the number of items in the sample with the characteristic of interest. Equation 8.3 defines the confidence interval estimate for the population proportion.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

292 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION

CON FIDE N CE IN TE R VA L E ST I M AT E F O R T HE P R O P O RT I O N p±Z or

p(1− p) n

p(1 − p) 5. Using Equation 8.3 and Z = 1.96 for 95% confidence: p±Z

p(1 − p) n

= 0.10 ± (1.96)

(0.10 )(0.90 ) 100

= 0.10 ± (1.96)(0.03) = 0.10 ± 0.0588 0.0412 ⩽ π ⩽ 0.1588 Therefore, you have 95% confidence that between 4.12% and 15.88% of all the sales invoices contain errors. Figure 8.10 shows a Microsoft Excel worksheet for these data. Note that in early versions of Excel, the formula used in cell B10 would be = NORMSINV((1+B6)/2). Example 8.4 illustrates another application of a confidence interval estimate for the proportion. Figure 8.10  Microsoft Excel 2016 worksheet to form a confidence interval estimate for the proportion of sales invoices that contain errors

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

A Proportion of in-error sales invoices

B

Data Sample size Number of success Confidence level

100 10 95%

Intermediate calculations Sample proportion Z value Standard error of the proportion Interval half width

0.1 1.96 0.03 0.0588

=B5/B4 =NORM.S.INV((1 + B6)/2) =SQRT(B9 * (1 – B9)/B4) =(B10 * B11)

Confidence interval Interval lower limit Interval upper limit

0.0412 0.1588

=B9 – B12 =B9 + B12

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



8.3 Confidence Interval Estimation for the Proportion 293

ESTIM ATING T H E P RO P O RT IO N O F T YP OGRAP HI CAL E RRORS IN O NLINE N E W S PA P E R S With the latest technology available to check written text, mistakes in newspapers are becoming less common. However, humans still make mistakes. A large media corporation wants to estimate the proportion of online newspaper articles written by a variety of journalists that have typographical errors. A random sample of 200 articles is selected from all the newspapers posted online during a single month. For this sample of 200, 7 contain some type of typographical error. Construct and interpret a 90% confidence interval for the proportion of articles posted online during the month that have a typographical error.

EXAMPLE 8.4

SOLUTION

Using Equation 8.3: 7 = 0.035 200 so np = 200 3 0.035 = 7 . 5 n(1 – p) = 200 3 0.965 = 193 . 5 and with a 90% level of confidence Z = 1.645 p=

p±Z

p(1 − p) n

= 0.035 ± (1.645)

(0.035)(0.965) 200

= 0.035 ± (1.645)(0.0130) = 0.035 ± 0.0214 0.0136 < π < 0.0564 You can conclude with 90% confidence that between 1.36% and 5.64% of the newspaper articles posted online in that month have a typographical error. Equation 8.3 contains a Z statistic since you can use the normal distribution to approximate the binomial distribution when the sample size is sufficiently large. In Example 8.4, the confidence interval using Z provides an excellent approximation for the population proportion since both X and n - X are greater than 5. However, if you do not have a sufficiently large sample size, you should use the binomial distribution rather than Equation 8.3 (see references 3, 4 and 5). The exact confidence intervals for various sample sizes and proportions of successes have been tabulated by Fisher and Yates (reference 4).

Problems for Section 8.3 LEARNING THE BASICS

8.22 If n = 200 and X = 50, construct a 95% confidence interval estimate of the population proportion. 8.23 If n = 400 and X = 25, construct a 99% confidence interval estimate of the population proportion.

APPLYING THE CONCEPTS 8.24 A telco wants to estimate the proportion of mobile phone customers who would purchase a phone plan with unlimited

standard calls and SMS and 2GB of data if it were made available at a substantially reduced cost. A random sample of 500 customers is selected. The results indicate that 190 of the customers would purchase the plan at a reduced cost. a. Construct a 99% confidence interval estimate of the population proportion of customers who would purchase the unlimited 2GB plan. b. How would the manager in charge of promotional programs for mobile customers use the results in (a)?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

294 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION

8.25 A survey of 500 highly educated women who left careers for family reasons found that 66% postponed their return to work due to difficulty in making suitable childcare arrangements. a. Construct a 95% confidence interval for the population proportion of highly educated women who have postponed their return to work due to difficulty in making suitable childcare arrangements. b. Interpret the interval in (a). 8.26 A survey of 293 inhabitants of Tropical North Queensland in 2013 found that 45% considered increased property values were a negative impact of tourism in the region (Tropical North Queensland Social Indicators 2013 accessed 5 July 2017). a. Construct a 95% confidence interval for the proportion of all residents in the region who believe increased property values are a negative impact of tourism. b. Construct a 90% confidence interval for the proportion of all residents in the region who believe increased property values are a negative impact of tourism. c. Which interval is wider? Explain why this is true. 8.27 The number of older consumers in Australia is growing and they are becoming an important economic force. According to the Australian Bureau of Statistics, the proportion of the population aged 65 years and over increased from 14% in 2011 to 16% in 2016. (Australian Bureau of Statistics, Reflecting Australia- Stories from the Census, 2016, Cat. No. 2071.0, 2017). The proportion is projected to grow higher in coming years. Many older consumers feel overwhelmed when confronted with the task of selecting investments, banking services, health insurance or phone service providers. Suppose a telephone survey of 1,900 older consumers found that 27% said they felt confused when making financial decisions. a. Construct a 95% confidence interval for the population proportion of older consumers who feel confused when making financial decisions. b. Interpret the interval in (a).

8.28 The Australian Telecommunications Industry Ombudsman 2016 Annual Report states that 34.1% of new complaints in 2015–16 related to faults ( accessed 5 July 2017). Imagine that you take a survey of 1,000 Australian users and find that 36% of this sample report that they have had telecommunication service faults in the past three months. a. Construct a 95% confidence interval for the population proportion of users who have experienced service faults in the past three months. b. Does your interval indicate that there is a difference from the percentage reported by the Ombudsman? Give reasons why a difference may occur. 8.29 The Australian Psychological Society conducted an online survey in 2016 of 1,000 adults and 518 adolescents. It found that 69% of adolescents reported consuming food from fast food restaurants at least once a week. (Psychology Week 2016, APS Compass for Life Wellbeing Survey accessed 5 July 2017). a. Construct a 95% confidence interval for the proportion of all Australian adolescents who consume food from fast food restaurants at least once per week. b. How would your result change if it was a 99% interval? 8.30 Suppose that, in a survey of 600 employers, 126 indicate that they have used a recruitment service within the past two months to find new staff. a. Construct a 95% confidence interval for the population proportion of employers who have used a recruitment service within the past two months to find new staff. b. Construct a 99% confidence interval for the population proportion of employers who have used a recruitment service within the past two months to find new staff. c. Interpret the intervals in (a) and (b). d. Discuss the effect on the confidence interval estimate when you change the level of confidence.

8.4  DETERMINING SAMPLE SIZE

LEARNING OBJECTIVE

3

Determine the sample size necessary to develop a confidence interval for the mean

In each example of confidence interval estimation, you selected the sample size without regard to the width of the resulting confidence interval. In the business world, determining the proper sample size is a complicated procedure, subject to the constraints of budget, time and the amount of acceptable sampling error. If, in the Callistemon Camping Supplies scenario, you want to estimate the mean dollar amount of the sales invoices or the proportion of sales invoices that contain errors, you must determine in advance how large a sampling error to allow in estimating each of the parameters. You must also determine in advance the level of confidence to use in estimating the population parameter.

Sample Size Determination for the Mean To develop a formula for determining the appropriate sample size needed when constructing a confidence interval estimate of the mean, recall Equation 8.1 on page 282: X±Z

σ n

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



8.4 Determining Sample Size 295

– The amount added to or subtracted from X is equal to half the width of the interval. This quantity represents the amount of imprecision in the estimate that results from sampling error. The sampling error e (in this context, some statisticians refer to e as the ‘margin of error’) is defined as: σ e=Z n

sampling error The difference in results for different samples of the same size.

Solving for n gives the sample size needed to construct the appropriate confidence interval estimate for the mean. ‘Appropriate’ means that the resulting interval will have an acceptable amount of sampling error. S AMPLE S IZ E DE T E R M IN AT ION FOR T HE M E A N The sample size n is equal to the product of the Z value squared and the variance σ2, divided by the sampling error e squared.

n=

Z 2σ 2 e2



(8.4)

To determine the sample size, you must know three factors: 1. the desired confidence level, which determines the value of Z, the critical value from the standardised normal distribution1 2. the acceptable sampling error e 3. the standard deviation σ. In some business-to-business relationships requiring estimation of important parameters, legal contracts specify acceptable levels of sampling error and the confidence level required. For companies in the food or drug sectors, government regulations often specify sampling errors and confidence levels. In general, however, it is usually not easy to specify the two factors needed to determine the sample size. How can you determine the level of confidence and sampling error? Typically, these questions are answered only by the subject matter expert (i.e. the individual most familiar with the variables under study). Although 95% is the most common confidence level used, if more confidence is desired then 99% might be more appropriate; if less confidence is deemed acceptable, then 90% might be used. For the sampling error, you should think not of how much sampling error you would like to have (you really do not want any error), but of how much you can tolerate when drawing conclusions from the data. In addition to specifying the confidence level and the sampling error, you need an estimate of the standard deviation. Unfortunately, you rarely know the population standard deviation, σ. In some instances, you can estimate the standard deviation from past data. In other situations, you can make an educated guess by taking into account the range and distribution of the variable. For example, if you assume a normal distribution, the range is approximately equal to 6σ (i.e. ±3σ around the mean) so that you estimate σ as the range divided by 6. If you cannot estimate σ in this way, you can conduct a small-scale study and estimate the standard deviation from the resulting data. To explore how to determine the sample size needed for estimating the population mean, consider again the audit at Callistemon Camping Supplies. In Section 8.2, we selected a sample of 100 sales invoices and developed a 95% confidence interval estimate of the population mean sales invoice amount. How was this sample size determined? Should we have selected a different sample size? Suppose that, after consultation with company officials, we determine that a sampling error of no more than ±$10 is desired, together with 95% confidence. Past data indicate that the 1

You use Z instead of t because to determine the critical value of t you need to know the sample size, but you do not know it yet. For most studies, the sample size needed is large enough that the standardised normal distribution is a good approximation of the t distribution. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

296 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION

standard deviation of the sales amount is approximately $50. Thus, e = $5, σ = $50 and Z = 1.96 (for 95% confidence). Using Equation 8.4: n=

Z 2σ 2 e2

=

(1.96 ) 2 ( 50) 2 (10)2

= 96.04 Because the general rule is to oversatisfy slightly the criteria by rounding the sample size up to the next whole integer, you should select a sample of size 97. Thus, the sample of size 100 used on page 287 is close to what is necessary to satisfy the needs of the company based on the ­estimated standard deviation, desired confidence level and sampling error. Because the calculated sample standard deviation is slightly higher than expected, $52.62 compared with $50.00, the confidence interval is slightly wider than desired. Figure 8.11 illustrates the Microsoft Excel worksheet to determine the sample size. For early versions of Excel use the formula =NORMSINV((1+B6)/2) in cell B9. Figure 8.11  Microsoft Excel 2016 worksheet for determining sample size for estimating the mean sales invoice amount for Callistemon Camping Supplies Pty Ltd

1 2 3 4 5 6 7 8 9 10 11 12 13

A For the mean sales invoice amount Data Population standard deviation Sampling error Confidence level

50 10 95%

Intermediate calculations Z value Calculated sample size Result Sample size needed

B

1.9600 96.0365

97

=NORM.S.INV((1 + B6)/2) =((B9 * B4)/B5)^2

=ROUNDUP(B10,0)

Example 8.5 illustrates another application of determining the sample size needed to develop a confidence interval estimate for the mean. EXAMPLE 8.5

D E T E R MININ G T H E S AM P LE S I Z E F OR T HE ME A N Returning to Example 8.3, suppose you want to estimate the population mean height for females who wear size 12 to within ±15 mm with 95% confidence. On the basis of a study taken the previous year, you believe that the standard deviation is 100 mm. Find the sample size needed. SOLUTION

Using Equation 8.4 on page 295 and e = 15, σ = 100 and Z = 1.96 for 95% confidence: n=

Z 2σ 2 e2

=

(1.96)2 (100)2 (15)2

= 170.74 Therefore, you should select a sample size of 171 women, because the general rule for determining sample size is always to round up to the next integer value in order to oversatisfy slightly the criteria desired. An actual sampling error slightly larger than 15 will result if the sample standard deviation calculated in this sample of 171 is greater than 100, and it will be slightly smaller if the sample standard deviation is less than 100.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



8.4 Determining Sample Size 297

Sample Size Determination for the Proportion So far, we have seen how to determine the sample size needed for estimating the population mean. Now suppose that you want to determine the sample size necessary for estimating the proportion of sales invoices at Callistemon Camping Supplies that contain errors. To determine the sample size needed to estimate a population proportion (π), you use a method similar to that for a population mean. Recall that in developing the sample size for a confidence interval for the mean, the sampling error is defined by: e=Z

LEARNING OBJECTIVE

Determine the sample size necessary to develop a confidence interval for the proportion

σ n

When estimating a proportion, you replace σ with π(1 - π). Thus, the sampling error is: e=Z

π(1− π) n

Solving for n, you have the sample size necessary to develop a confidence interval estimate for a proportion.

SAM PLE S IZ E DE T E R M IN AT ION FOR T HE P R O P O RT I O N The sample size n is equal to the product of Z value squared, the population proportion π and 1 minus the population proportion π, divided by the sampling error e squared. n=



Z 2 π(1− π) e2

3

(8.5)

To determine the sample size, you must know three factors: 1. the desired confidence level, which determines the value of Z, the critical value from the

standardised normal distribution 2. the acceptable sampling error e 3. the population proportion π.

In practice, selecting these quantities requires some planning. Once you determine the desired level of confidence, you can find the appropriate Z value from the standardised normal distribution. The sampling error e indicates the amount of error that you are willing to tolerate in estimating the population proportion. The third quantity, π, is actually the population parameter that you want to estimate! How do you state a value for the very thing that you are taking a sample in order to determine? There are two alternatives. In many situations, you may have past information or relevant experiences that provide an educated estimate of π. If you do not, you can try to provide a value for π that would never underestimate the sample size needed. Referring to Equation 8.5, you can see that the quantity π(1 - π) appears in the numerator. Thus, you need to determine the value of π that will make the quantity π(1 - π) as large as possible. When π = 0.5, the product π(1 - π) achieves its maximum result. To show this, here are several values of π together with the accompanying products of π(1 - π): When π = 0.9, π(1 - π) = (0.9)(0.1) = 0.09 When π = 0.7, π(1 - π) = (0.7)(0.3) = 0.21 When π = 0.5, π(1 - π) = (0.5)(0.5) = 0.25 When π = 0.3, π(1 - π) = (0.3)(0.7) = 0.21 When π = 0.1, π(1 - π) = (0.1)(0.9) = 0.09

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

298 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION

Therefore, when you have no prior knowledge or estimate of the population proportion π, use π = 0.5 for determining the sample size. This produces the largest possible sample size and results in the highest possible cost of sampling. Using π = 0.5 may overestimate the sample size needed because you use the actual sample proportion in developing the confidence interval. You will get a confidence interval narrower than originally intended if the actual sample proportion is different from 0.5. The increased precision comes at the cost of spending more time and money for an increased sample size. Returning to the Callistemon Camping Supplies scenario, suppose that the auditing procedures require you to have 95% confidence in estimating the population proportion of sales invoices with errors to within ±0.07. The results from past months indicate that the largest proportion has been no more than 0.15. Thus, using Equation 8.5 and e = 0.07, π = 0.15 and Z = 1.96 for 95% confidence: Z 2 π(1− π)

n= =

e2 (1.96 ) 2 ( 0.15)(0.85) (0.07 ) 2

= 99.96 Because the general rule is to round up the sample size to the next whole integer to slightly oversatisfy the criteria, a sample size of 100 is needed. Thus, the sample size needed to satisfy the requirements of the company based on the estimated proportion, desired confidence level and sampling error is equal to the sample size taken on page 292. The actual confidence interval is narrower than required since the sample proportion is 0.10, while 0.15 was used for π in Equation 8.5. Figure 8.12 shows a Microsoft Excel 2016 worksheet. Change the formula in cell B9 to =NORMSINV((1+B6)/2) for early versions of Excel. Example 8.6 provides a second application of determining the sample size for estimating the population proportion.

Figure 8.12  Microsoft Excel 2016 worksheet for determining sample size for estimating the proportion of sales invoices with errors for Callistemon Camping Supplies Pty Ltd

EXAMPLE 8.6

1 2 3 4 5 6 7 8 9 10 11 12 13

A B For the proportion of in-error sales invoices Data Estimate of true proportion Sampling error Confidence level Intermediate calculations Z value Calculated sample size Result Sample size needed

0.15 0.07 95%

1.9600 99.9563

100

=NORM.S.INV((1 + B6)/2) =(B9^2 * B4 * (1 – B4))/B5^2

=ROUNDUP(B10,0)

DE T E R MIN ING T H E SA MP L E S I Z E FO R TH E P O P UL AT I ON P RO P ORT I ON You want to have 90% confidence of estimating the proportion of office workers who respond to email within an hour to within ±0.05. Because you have not previously undertaken such a study, there is no information available from past data. Determine the sample size needed.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



8.4 Determining Sample Size 299

SOLUTION

Because no information is available from past data, assume π = 0.50. Using Equation 8.5 and e = 0.05, π = 0.50 and Z = 1.645 for 90% confidence: n=

(1.645) 2 ( 0.50 )(0.50 ) (0.05) 2

= 270.6 Therefore, you need a sample of 271 office workers to estimate the population proportion to within ±0.05 with 90% confidence.

Problems for Section 8.4 LEARNING THE BASICS 8.31 If you want to be 95% confident of estimating the population mean to within a sampling error of ±5 and the standard deviation is assumed to be 15, what sample size is required? 8.32 If you want to be 99% confident of estimating the population mean to within a sampling error of ±20 and the standard deviation is assumed to be 100, what sample size is required? 8.33 If you want to be 99% confident of estimating the population proportion to within a sampling error of ±0.04, what sample size is needed? 8.34 If you want to be 95% confident of estimating the population proportion to within a sampling error of ±0.02 and there is historical evidence that the population proportion is approximately 0.40, what sample size is needed?

APPLYING THE CONCEPTS 8.35 A survey is planned to determine the mean annual family medical expenses of employees of a large company which subsidises the health insurance of its staff. The management of the company wishes to be 95% confident that the sample mean is correct to within ±$50 of the population mean annual family medical expenses. A previous study indicates that the standard deviation is approximately $400. a. How large a sample size is necessary? b. If management wants to be correct to within ±$25, what sample size is necessary? 8.36 If the manager of a paint supply store wants to estimate the mean amount of paint in a 4-litre can to within ±0.015 litres with 95% confidence and also assumes that the standard deviation is 0.075 litres, what sample size is needed? 8.37 If a quality control manager wants to estimate the mean life of a new type of LED light globe to within 1,000 hours with 95% confidence and also assumes that the population standard deviation is 5,000 hours, what sample size is needed? 8.38 The inspection division of a state department which regulates trade measurement wants to estimate the mean amount of soft-drink fill in 2-litre bottles to within ±0.01 litres with 95% confidence. If it assumes that the standard deviation is 0.05 litres, what sample size is needed?

8.39 A consumer group wants to estimate the mean electric bill for the month of July for single family homes in a large city. Based on studies conducted in other cities, the standard deviation is assumed to be $60. The group wants to estimate the mean bill for July to within ±$15 with 99% confidence. a. What sample size is needed? b. If 95% confidence is desired, what sample size is necessary? 8.40 An advertising agency that serves a major radio station wants to estimate the mean amount of time that the station’s audience spends listening to the radio daily. From past studies, the standard deviation is estimated as 45 minutes. a. What sample size is needed if the agency wants to be 90% confident of being correct to within ±5 minutes? b. If 99% confidence is desired, what sample size is necessary? 8.41 Suppose that an energy company wants to estimate its mean waiting time for natural gas installation to within ±5 days with 95% confidence. The company does not have access to previous data, but suspects that the standard deviation is approximately 20 days. What sample size is needed? 8.42 At a large South East Asian airport flights are classified as being ‘on time’ if they land less than 15 minutes after the scheduled time. A study of airlines using the airport finds that one of the airlines that services Australia has a record of 17% of flights arriving late. Suppose you were asked to perform a follow-up study for this airline in order to update the estimated proportion of late arrivals. What sample size would you use to estimate the population proportion to within a sampling error of: a. ±0.06 with 95% confidence? b. ±0.04 with 95% confidence? c. ±0.02 with 95% confidence? 8.43 The Nielsen company regularly conducts research into consumer purchases. Neilsen Homescan data for the 52 weeks ended 28 January 2017 showed that 34.5% of Australian homes had purchased Asian vegetables in that period. Households of 1–2 persons accounted for 47% of the volume in Asian vegetable sales. (Neilsen Insights accessed 5 July 2017).

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

300 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION



Consider a follow-up study focusing on the latest calendar year. a. What sample size is needed to estimate the population proportion of Australian households that have purchased Asian vegetables to within ±0.02 with 95% confidence? b. What sample size is needed to estimate the population proportion of the volume of Asian vegetables that are purchased by 1–2 person households to within ±0.02 with 95% confidence? c. Compare the results of (a) and (b). Explain why these results differ. d. If you were to design a data collection method for a follow-up study, would you use one sample and collect data to answer both questions, or would you select two separate samples? Explain the rationale behind your decision. 8.44 Suppose that a survey of the audience at a Sydney Symphony Orchestra (SSO) concert has found that 48 out of 350 members of the audience who participated in the survey are visitors to Sydney. a. Construct a 95% confidence interval for the population proportion of audience members at SSO concerts who are visitors to Sydney. b. Interpret the interval constructed in (a).

c. To conduct a follow-up study that would provide 95% confidence that the point estimate is correct to within ±0.03 of the population proportion, how large a sample size is required? d. To conduct a follow-up study that would provide 99% confidence that the point estimate is correct to within ±0.03 of the population proportion, how large a sample size is required? 8.45 A study conducted by the Australian Securities Exchange found that 36% of 4,009 Australian adults surveyed in late 2014 held shares, either directly or indirectly through unlisted managed funds (Australian Securities Exchange, 2014 Australian Share Ownership Study, accessed 5 July 2017). a. Construct a 95% confidence interval for the proportion of Australian adults who held shares in late 2014. b. Interpret the interval constructed in (a). c. To conduct a follow-up study to estimate the population proportion of adults who currently hold shares to within ±0.01 with 95% confidence, how many adults would you interview?

8.5  APPLICATIONS OF CONFIDENCE INTERVAL ESTIMATION IN AUDITING This chapter has focused on estimating either the population mean or the population proportion. Auditing is one area in business that makes widespread use of statistical sampling for the purposes of estimation.

auditing A process of checking the accuracy of financial records.

A UDIT IN G Auditing is the collection and evaluation of evidence about information relating to an economic entity such as a sole business proprietor, a partnership, a corporation or a government agency in order to determine and report on how well the information corresponds to established criteria. Six advantages of statistical sampling in auditing are: 1. Results are objective and defensible. Because the sample size is based on demonstrable

statistical principles, the audit is defensible before one’s superiors and in a court of law. 2. Statistical sampling provides an objective way of estimating the sample size in advance. 3. Statistical sampling provides an estimate of the sampling error. 4. Statistical sampling is often more accurate for drawing conclusions about large

populations. Examining large populations is time-consuming and therefore often subject to more non-sampling error than a statistical sample. 5. Statistical sampling allows auditors to combine, and then evaluate collectively, samples collected by different individuals. 6. Statistical sampling allows auditors to generalise their findings to the population with a known sampling error.

Estimating the Population Total Amount total amount The sum of all values.

In auditing applications we are often more interested in developing estimates of the population total amount than the population mean. Equation 8.6 shows how to estimate a population total amount.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



8.5 Applications of Confidence Interval Estimation in Auditing 301

E STIM ATING T H E P OPUL AT ION TOTA L The point estimate for the population total is equal to the population size N times the sample mean. Total = NX



(8.6)

Equation 8.7 defines the confidence interval estimate for the population total. The term is included where sampling is from a finite population.

LEARNING OBJECTIVE

CO N FID E N CE IN T E R VA L E ST IM AT E F O R T HE TOTA L NX ± N(tn−1)



S n

N−n N − 1

(8.7)

4

Recognise how to use confidence interval estimates in auditing

To demonstrate the application of the confidence interval estimate for the population total amount, we return to the Callistemon Camping Supplies scenario. One of the auditing tasks is to estimate the total dollar amount of all sales invoices for the month. If there are 5,000 invoices – for that month and X = $110.27, then, using Equation 8.6: NX = (5,000)($110.27) = $551,350 If n = 100 and S = $28.95, then, using Equation 8.7 with t99 = 1.9842 for 95% confidence: NX ± N (tn−1)

S n

N−n 28.95 5,000 − 100 = 551,350 ± (5,000)(1.9842) 5,000 − 1 N−1 100 = 551,350 ± 28, 721.295(0.99005) = 551,350 ± 28,436 $522,914 < population total < $579,786

Therefore, with 95% confidence, you estimate that the total amount of sales invoices is between $522,914 and $579,786. Example 8.7 further illustrates the population total.

DEVELOPING A CONFIDENCE INTERVAL ESTIMATE FOR THE POPULATION TOTAL An auditor is faced with a population of 1,000 vouchers and wants to estimate the total value of the population of vouchers. A sample of 50 vouchers is selected with the following results: – Mean voucher amount (X) = $1,076.39 Standard deviation (S) = $273.62

EXAMPLE 8.7

Construct a 95% confidence interval estimate of the total amount for the population of vouchers. SOLUTION

Using Equation 8.6, the point estimate of the population total is: NX = (1,000)(1,076.39) = $1,076,390

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

302 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION

From Equation 8.7, a 95% confidence interval estimate of the population total amount is: (1,000)(1,076.39) ± (1,000)(2.0096)

273.62 50

1,000 − 50 1,000 − 1

= 1,076,390 ± 77,762.902( 0.97517 ) = 1,076,390 ± 75,832 $1,000,558 < population total < $1,152,222 Therefore, with 95% confidence, you estimate that the total amount of the vouchers is between $1,000,558 and $1,152,222.

Difference Estimation difference estimation A method of estimating the level of discrepancy between book and audit values for a population.

Auditors use difference estimation when they believe that errors exist in a set of items and they want to estimate the magnitude of the errors based only on a sample. The following steps are used in difference estimation: 1. Determine the sample size required. 2. Calculate the differences between the values reached during the audit and the original values recorded. The difference in value i, denoted Di, is equal to 0 if the auditor finds that the original value is correct, is a positive value when the audited value is larger than the original value, and is negative when the audited value is smaller than the original value. – 3. Calculate the mean difference in the sample (D) by dividing the total difference by the sample size, as shown in Equation 8.8.

M E A N DIFFE R E N C E n

D=



∑ Di

i=1

n

(8.8)



where Di = audited value – original value

4. Calculate the standard deviation of the differences (SD), as shown in Equation 8.9.

Remember that any item that is not in error has a difference value of 0.

STA N DA R D DE VIAT I O N O F T HE D I F F E R E NC E n

SD =



∑ ( Di − D )2

i=1

n−1



(8.9)

5. Use Equation 8.10 to construct a confidence interval estimate of the total difference in the

population.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



8.5 Applications of Confidence Interval Estimation in Auditing 303

CO N FID E N CE IN T E R VA L E ST IM AT E F O R T HE TOTA L D I F F E R E NC E

ND ± N (tn−1)



SD N − n n N − 1

(8.10)

The auditing procedures for Callistemon Camping Supplies require a 95% confidence interval estimate of the difference between the actual dollar amounts on the sales invoice and the amounts entered into the integrated inventory and sales information system. Suppose that, in a sample of 100 sales invoices, you have 12 invoices in which the actual amount on the sales invoice and the amount entered into the integrated inventory and sales information system are different. These 12 differences < PARTS_INV > are: $9.03 $7.47 $17.32 $8.30 $5.21 $10.80 $6.22 $5.63 $4.97 $7.43 $2.99 $4.63 The other 88 invoices are not in error. Their differences are each 0. Thus: n

D=

∑ Di

i=1

n

=

90 = 0.90 100

and: n

SD = =

∑ ( Di − D )2 i=1

n −1

(9.03 − 0.9 ) 2 + (7.47 − 0.9 ) 2 + … + (0 − 0.9 ) 2 100 − 1

(In the numerator, there are 100 differences. The last 88 are all (0 − 0.9)2 .) SD = 2.7518 Using Equation 8.10, construct the confidence interval estimate for the total difference in the population of 5,000 sales invoices as follows: (5,000)(0.90) ± (5,000)(1.9842)

2.7518 5,000 − 100 5,000 − 1 100

= 4,500 ± 2,702.89 $1,797.11 < total difference < $7,202.89 Thus, the auditor estimates with 95% confidence that the total difference between the sales invoices, as determined during the audit, and the amount originally entered into the accounting system is between $1,797.11 and $7,202.89. In the previous example, all 12 differences are positive because the actual amount on the sales invoice is more than the amount entered into the accounting system. In some circumstances you could have negative errors. Example 8.8 illustrates such an occurrence.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

304 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION

EXAMPLE 8.8

DIFFE R E NC E E ST IMATI ON Returning to Example 8.7, suppose that 14 vouchers contain errors in the sample of 50 vouchers. The values < DIFF_TEST > of the 14 errors are as follows, in which two differences are negative:  $75.41 $127.74

$38.97 $55.42

$108.54  $39.03

–$37.18  $29.41

$62.75 $47.99

$118.32  $28.73

–$88.84  $84.05

Construct a 95% confidence interval estimate of the total difference in the population of 1,000 vouchers. SOLUTION

For these data: n

D=

∑ Di i=1 n

=

690.34 = 13.8068 50

and: n

SD = =

∑ ( Di − D )2 i=1

n −1

( 75.41 − 13.8068 ) 2 + ( 38.97 − 13.8068 ) 2 + … + (0 − 13.8068) 2 50 − 1

= 37.427 Using Equation 8.10, construct the confidence interval estimate for the total difference in the population: (1,000)(13.8068) ± (1,000)(2.0096)

37.427 50

1,000 − 50 1,000 − 1

= 13,806.8 ± 10,372.63 $3,434.17 < total difference < $24,179.43 Therefore, with 95% confidence you estimate that the total difference in the population of vouchers is between $3,434.17 and $24,179.43.

LEARNING OBJECTIVE

4

Recognise how to use confidence interval estimates in auditing

one-sided confidence interval Gives only an upper or lower bound to the value of the population parameter.

One-Sided Confidence Interval Estimation of the Rate of Non-Compliance with Internal Controls Organisations use internal control mechanisms to ensure that individuals act in accordance with company guidelines. For example, Callistemon Camping Supplies requires that an authorised delivery docket is completed before goods are removed from the warehouse. During the monthly audit of the company, the auditing team is charged with the task of estimating the proportion of times goods were removed without proper authorisation. This is referred to as the rate of noncompliance with the internal control. To estimate the rate of non-compliance, auditors take a random sample of sales invoices and determine how often merchandise was shipped without an authorised delivery docket. The auditors then compare their results with a previously established tolerable exception rate, which is the maximum allowable proportion of items in the population not in compliance. When estimating the rate of non-compliance, it is commonplace to use a one-sided confidence interval. That is, the auditors estimate an upper bound on the rate of noncompliance. Equation 8.11 defines a one-sided confidence interval for a proportion.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



8.5 Applications of Confidence Interval Estimation in Auditing 305

O NE -S ID E D CON FIDE N CE IN T E R VA L F O R A P R O P O RT I O N Upper bound = p + Z



p(1 − p) n

N−n N − 1

(8.11)

where Z = the value corresponding to a cumulative area of (1 - α) from the standardised normal distribution – that is, a right-hand tail probability of α. If the tolerable exception rate is higher than the upper bound, then the auditor concludes that the company is in compliance with the internal control. If the upper bound is higher than the tolerable exception rate, the auditor concludes that the control non-compliance rate is too high. The auditor may then request a larger sample. Suppose that, in the monthly audit, you select 400 of the sales invoices from a population of 10,000 invoices. In the sample of 400 sales invoices, 20 are in violation of the internal control. If the tolerable exception rate for this internal control is 6%, what should you conclude? Use a 95% level of confidence. The one-sided confidence interval is calculated using p = 20/400 = 0.05 and Z = 1.645. Using Equation 8.11: Upper bound = p + Z

p(1 − p) n

N−n 0.05(1 − 0.05) 10,000 − 400 = 0.05 + 1.645 N−1 400 10,000 − 1

= 0.05 + 1.645(0.0109)(0.98) = 0.05 + 0.0176 = 0.0676 Thus, you have 95% confidence that the rate of non-compliance is less than 6.76%. Because the tolerable exception rate is 6%, the rate of non-compliance may be too high for this internal control. In other words, it is possible that the non-compliance rate for the population is higher than the rate deemed tolerable. Therefore, you should request a larger sample. In many cases, the auditor is able to conclude that the rate of non-compliance with the company’s internal controls is acceptable. Example 8.9 illustrates such an occurrence. ESTIM ATING T H E R AT E O F N O N- C O MP L I AN CE A large electronics firm makes one million direct debit payments a year. An internal control policy requires that each payment is made only after an invoice has been authorised by an accounts payable supervisor. The company’s tolerable exception rate for this control is 4%. If control deviations are found in 8 of the 400 invoices sampled, what should the auditor do? Use a 95% level of confidence.

EXAMPLE 8.9

SOLUTION

The auditor constructs a 95% one-sided confidence interval for the proportion of invoices in non-compliance and compares this with the tolerable exception rate. Using Equation 8.11, p = 8/400 = 0.02 and Z = 1.645 for 95% confidence: Upper bound = p + Z

p(1 − p) n

N−n 0.02(1 − 0.02) 1,000,000 − 400 = 0.02 + 1.645 N−1 400 1,000,000 − 1

= 0.02 + 1.645(0.007)(0.9998) = 0.02 + 0.0115 = 0.0315

The auditor concludes with 95% confidence that the rate of non-compliance is less than 3.15%. Since this is less than the tolerable exception rate, the auditor concludes that the internal control compliance is adequate. In other words, the auditor is more than 95% confident that the rate of non-compliance is less than 4%.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

306 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION

Problems for Section 8.5 LEARNING THE BASICS 8.46 A sample of 25 is selected from a population of 500 items. The sample mean is 25.7 and the sample standard deviation is 7.8. Construct a 99% confidence interval estimate of the population total. 8.47 Suppose that a sample of 200 is selected from a population of 10,000 items. Ten items are found to have errors of the following amounts: 13.76 42.87 34.65 11.09 14.54 22.87 25.52  9.81 10.03 15.49

Construct a 95% confidence interval estimate of the total difference in the population. < ITEM_ERR > 8.48 If p = 0.04, n = 300 and N = 5,000, calculate the upper bound for a one-sided confidence interval estimate of the population proportion, π, using a level of confidence of: a. 90% b. 95% c. 99%

APPLYING THE CONCEPTS

Tax (GST) payable to the Australian Tax Office needs to be adjusted. A sample of 150 items selected from a population of 4,000 invoices at the end of a period of time revealed that in 13 cases staff failed to adjust the GST amount correctly. The amounts (in dollars) of the 13 amounts by which GST was overcharged are: < DISCOUNT >

6.45 15.32 97.36 230.63 104.18 84.92 132.76 66.12 26.55 129.43 88.32 47.81 89.01



Construct a 99% confidence interval estimate of the population total amount of GST overcharged. 8.53 Econe Pty Ltd is a small company that manufactures women’s dresses for sale to specialty stores. There are 1,200 inventory items, and the historical cost is recorded on a first in, first out (FIFO) basis. In the past, approximately 15% of the inventory items were incorrectly priced. However, any misstatements were usually not significant. A sample of 120 items was selected and the historical cost of each item compared with the audited value. The results indicated that 15 items differed in their historical cost and audited value. These differences were as follows: < FIFO >

8.49 A stationery store wants to estimate the total retail value of the 300 greeting cards it has in its inventory. Construct a 95% confidence interval estimate of the population total value of all greeting cards that are in the inventory if a random sample of 20 greeting cards indicates an average value of $5.45 and a standard deviation of $0.82. 8.50 The personnel department of a large corporation employing 3,000 workers wants to estimate the family dental expenses of its employees to determine the feasibility of providing a dental insurance plan. A random sample of 10 employees reveals the following family dental expenses (in dollars) for the preceding year: < DENTAL >

Sample Historical Audited number cost ($) value ($)  5 261 240

Sample Historical Audited number cost ($) value ($)  60  21 210

 9

 87

105

 73

140

152

17

201

276

 86

129

112

18

121

110

 95

340

216

28

315

298

 96

341

402

35

411

356

107

135

 97

43

249

211

119

228

220

51

216

305

1,110 362 2,320 1,930 3,210 208 1,730 825 616 1,179

Construct a 90% confidence interval estimate of the total family dental expenses for all employees in the preceding year. 8.51 A branch of a chain of large electronics stores is conducting an end-of-month inventory of the merchandise in stock. There are 1,546 items in inventory at the time. A sample of 50 items is randomly selected and an audit conducted, with the following results: Value of merchandise X = $252.28 S = $93.67

Construct a 95% confidence interval estimate of the total value of the merchandise in the inventory at the end of the month. 8.52 When a trade discount is allowed by wholesalers for particular types of early payments by customers, the Goods and Services



Construct a 95% confidence interval estimate of the total population difference in the historical cost and audited value. 8.54 The Snowy Ski Centre Pty Ltd conducts an annual audit of its financial records. An internal control policy for the company is that a cheque can be issued only after the accounts payable manager initials the invoice. The tolerable exception rate for this internal control is 0.04. During an audit, a sample of 300 invoices is examined from a population of 10,000 invoices and 11 invoices are found to violate the internal control. a. Calculate the upper bound for a 95% one-sided confidence interval estimate for the rate of non-compliance. b. Based on (a), what should the auditor conclude?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



8.6 More On Confidence Interval Estimation And Ethical Issues 307

8.6  MORE ON CONFIDENCE INTERVAL ESTIMATION AND ETHICAL ISSUES You should be aware that when sampling is done without replacement from a finite population, an adjustment to the standard error of the mean or standard error of the proportion is required. This has been included in equations 8.7, 8.10 and 8.11, where standard errors have been multiplied by the correction factor square root of (N - n)/(N - 1). The correction factor is used in confidence intervals for the population mean and proportion when the sample size, n, is large in relation to the population size, N (i.e. more than 5%). Ethical issues relating to the selection of samples and the inferences that accompany them can arise in several ways. The major ethical issue relates to whether or not confidence interval estimates are provided together with the sample statistics. To provide a sample statistic without also including the confidence interval limits (typically set at 95%), the sample size used and an interpretation of the meaning of the confidence interval in terms that a layperson can understand raises ethical issues because of their omission. Failure to include a confidence interval estimate might mislead the user of the results into thinking that the point estimate is all that is needed to predict the population characteristic with certainty. Thus, it is important that you indicate the interval estimate in a prominent place in any written communication, together with a simple explanation of the meaning of the confidence interval. In addition, you should highlight the size of the sample. Ethical issues concerning estimation most commonly occur in the publication of the results of political polls. Often the results of the polls are highlighted in a prominent part of the newspaper, while the sampling error involved and the methodology used is printed on the page where the article is continued, frequently in the middle of the newspaper in print editions or with a separate link in online ones. To ensure an ethical presentation of statistical results, the confidence levels, sample size and confidence limits should be made available for all surveys and other statistical studies.

Reporting poll results Let’s imagine that a newspaper reports the following table in both its print and online editions. State premier’s performance July–Sept 2016 (%)

Oct–Dec 2016 (%)

Jan–Mar 2017 (%)

Mar–Jun 2017 (%)

July–Sept 2017 (%)

Satisfied

52

50

48

42

33

Dissatisfied

33

33

41

46

57

Uncommitted

15

17

11

12

10

think about this

In the print edition it shows this extra information immediately below the table. In the online edition readers need to click on a link to see it. Question: Are you satisfied or dissatisfied with the way the current state premier is performing? This poll was carried out by a phone interview of the state’s voters, with the number in each poll being a constant percentage of the estimated number of voters. The latest survey interviewed 1,560 voters. Do you think the variation in display methods between the print and online editions will alter the way readers interpret the poll results? What other information is necessary for you to be able to evaluate the poll results effectively?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

308 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION

8

Assess your progress

Summary This chapter has discussed confidence intervals for estimating the characteristics of a population, and explained how to determine the necessary sample size. We showed how an accountant of Callistemon Camping Supplies can use the sample data from an audit to estimate important population parameters such as the mean dollar amount on invoices and the proportion of shipments that are made without proper authorisation. To determine which equation to use for a particular situation, you need to ask several questions:



Are you developing a confidence interval or are you determining sample size? • Do you have a numerical variable or do you have a categorical variable? • If you have a numerical variable, do you know the population standard deviation? If you do, use the normal distribution. If you do not, use the t distribution. The next three chapters develop a hypothesis-testing approach that makes decisions about population parameters.

Key formulas Confidence interval estimate for the proportion

Confidence interval for the mean (𝛔 known)

X±Z

σ n

p±Z

  (8.1)

or

or

X−Z

p(1 − p)   (8.3) n

σ n

contains the volume (in cubic metres) from a

sample of 368 truckloads of cypress pine mulch and from a sample of 330 truckloads of cedar wood chips. a. For the cypress pine wood chips, construct a 95% confidence interval estimate of the mean volume. b. For the cedar wood chips, construct a 95% confidence interval estimate of the mean volume. c. Evaluate whether the assumption needed for (a) and (b) has been seriously violated. d. Based on the results of (a) and (b), what conclusions can you reach concerning the mean volume of the cypress pine and cedar wood chips? 8.79 The manufacturer of ‘Bondi’ and ‘Vincentia’ terracotta roof shingles provides its customers with a 50-year warranty on the product. To determine whether a shingle will last as long as the warranty period, accelerated life testing is conducted at the manufacturing plant. Accelerated life testing exposes the shingle to the stresses it would be subject to in a lifetime of normal use via a laboratory experiment that takes only a few hours to conduct. In this test, a shingle is repeatedly scraped with an abrasive and the particles that are removed are weighed (in grams). Shingles that experience small amounts of particle loss are expected to last longer in normal use than shingles that experience large amounts of particle loss. In this situation, a shingle should experience no more than 0.8 g of particle loss if it is expected to last the length of the warranty period. The data file < PARTICLE > contains a sample of 170 measurements made on the company’s ‘Bondi’ shingles, and 140 measurements made on ‘Vincentia’ shingles. a. For the ‘Bondi’ shingles, construct a 95% confidence interval estimate of the mean particle loss. b. For the ‘Vincentia’ shingles, construct a 95% confidence interval estimate of the mean particle loss. c. Evaluate whether the assumption needed for (a) and (b) has been seriously violated. d. Based on the results of (a) and (b), what conclusions can you reach concerning the mean particle loss of the ‘Bondi’ and ‘Vincentia’ shingles? 8.80 Diners have rated 14 North Island and 14 South Island New Zealand restaurants on the basis of food, presentation, service and toilets using an online review system with ratings from 1 to 10. The data file < REST_NZ > contains the ratings for each of these categories. For each island separately: a. Construct 95% confidence interval estimates for the mean food rating, mean presentation rating, mean service rating and mean toilet rating. b. What conclusions can you reach about the North and South Island restaurants from the results in (a)?

REPORT WRITING EXERCISE 8.81 Referring to the results in problem 8.77 concerning the width of a steel trough, write a report that summarises your conclusions.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 8 Excel Guide 313

Continuing cases Tasman University The Business School at Tasman University (TBU) has decided to gather data about its undergraduate students. It has created and distributed a survey of 14 questions and receives responses from 62 undergraduates (stored in < TASMAN_UNIVERSITY_BBUS_STUDENT_SURVEY >).

a For each variable included in the survey, construct a 95% confidence interval estimate for the population characteristic and write a report summarising your conclusions. Shortly afterwards, TBU decides to undertake a similar survey for graduate students. It creates and distributes a survey of 14 questions and receives responses from 44 graduate students (stored in < TASMAN_UNIVERSITY_MBA_ STUDENT_SURVEY >).

b For each variable included in the survey, construct a 95% confidence interval estimate for the population characteristic and write a report summarising your conclusions.

As Safe as Houses While working at Safe-As-Houses Real Estate, you are told the company wishes to explore variations in the average prices of properties in towns and cities. Using data in the file < REAL_ESTATE >, find a 95% confidence interval for the mean property price in each town or city in both states. Write a report that details your findings. Have you found any evidence of differences between average prices in these towns and cities?

Chapter 8 Excel Guide EG8.1 CONFIDENCE INTERVAL ESTIMATE FOR THE MEAN (σ KNOWN)

EG8.2 CONFIDENCE INTERVAL ESTIMATE FOR THE MEAN (σ UNKNOWN)

Open the CIE_Sigma_Known workbook. This workbook already contains the entries for Example 8.1 on page 283 and uses the NORM.S.INV and CONFIDENCE.NORM functions (see Appendix D.2 for more information). To adapt this worksheet to other problems, change the population standard deviation, sample mean, sample size and confidence level values in the tinted cells in rows 4 to 7.

Open the CIE_Sigma_Unknown workbook, shown in Figure 8.6 on page 288. The workbook uses the T.INV.2T function to determine the critical value from the t distribution (see Appendix D.3 for more information). To adapt this workbook to other problems, change the sample statistics and confidence level values in the tinted cells in rows 4 to 7.

OR See Appendix D.2 (Confidence Interval Estimate for the Mean, sigma known) if you want PHStat to produce a worksheet for you.

OR See Appendix D.3 (Confidence Interval Estimate for the Mean, sigma unknown) if you want PHStat to produce a worksheet for you.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

314 CHAPTER 8 CONFIDENCE INTERVAL ESTIMATION

EG8.3 CONFIDENCE INTERVAL ESTIMATE FOR THE PROPORTION Open the CIE_Proportion workbook, shown in Figure 8.10 on page 292. The workbook uses the NORM.S.INV function to determine the Z value (see Appendix D.4 for more information). To adapt this workbook to other problems, change the sample size, number of successes and confidence level values in the tinted cells in rows 4 to 6.

OR See Appendix D.6 (Sample Size Determination for the Proportion) if you want PHStat to produce a worksheet for you.

EG8.6 CONFIDENCE INTERVAL ESTIMATE FOR THE POPULATION TOTAL

OR See Appendix D.4 (Confidence Interval Estimate for the Proportion) if you want PHStat to produce a worksheet for you.

Open the CIE_Total workbook. The workbook uses the T.INV.2T function to determine the critical value from the t distribution (see Appendix D.7 for more information). To adapt this workbook to other problems, change the population size, sample mean, sample size, sample standard deviation and confidence level values in the tinted cells in rows 4 to 8.

EG8.4 SAMPLE SIZE DETERMINATION FOR THE MEAN

OR See Appendix D.7 (Estimate for the Population Total) if you want PHStat to produce a worksheet for you.

Open the Sample_Size_Mean workbook, shown in Figure 8.11 on page 296. The workbook uses the NORM.S.INV and ROUNDUP functions (see Appendix D.5 for more information). To adapt this workbook to other problems, change the population standard deviation, sampling error and confidence level values in the tinted cells in rows 4 to 6.

EG8.7 CONFIDENCE INTERVAL ESTIMATE FOR THE TOTAL DIFFERENCE

OR See Appendix D.5 (Sample Size Determination for the Mean) if you want PHStat to produce a worksheet for you.

EG8.5 SAMPLE SIZE DETERMINATION FOR THE PROPORTION Open the Sample_Size_Proportion workbook, shown in Figure 8.12 on page 298. The workbook uses the NORM.S.INV and ROUNDUP functions (see Appendix D.6 for more information). To adapt this workbook to other problems, change the estimate of true proportion, sampling error and confidence level values in the tinted cells in rows 4 to 6.

Open the CIE_Total_Difference workbook. This two-­ worksheet file already contains the entries for the Callistemon Camping Supplies example used in Section 8.5. To adapt this workbook to other problems, first change the population size, sample size and confidence level values in the tinted cells in rows 4 to 6. Then select the Data worksheet and enter differences data in column A, replacing the data already there for the Section 8.5 problem. Finally, adjust the column B formulas, copying the formulas down to additional cells if you have more than 12 differences, or deleting the unneeded formulas if you have fewer than 12 differences. OR See Appendix D.8 (Estimate for the Total Difference) if you want PHStat to produce a worksheet for you.

Microsoft® product screen shots are reprinted with permission from Microsoft Corporation.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

Fundamentals of hypothesis testing: One-sample tests

C HAP T E R

9

PATRICIO’S PASTA CO.

Y

ou have recently been appointed to oversee quality control at Patricio’s Pasta Co., which produces and packages a range of dried pasta in traditional Italian shapes. It is made from Australian durum wheat semolina, sourced from grain grown in the Narrabri region of New South Wales. The pasta is sold in 500-gram packets, and part of your job is to ensure that packets are being filled correctly and that the weight of the contents is as shown on the packet. You select and weigh a random sample of 25 filled spiral pasta packets in order to calculate a sample mean and investigate how close the weights are to the company’s specifications of a mean of 500 grams. You must make a decision and conclude whether (or not) the mean fill weight in the entire process is equal to 500 grams, in order to know whether the fill process needs adjustment. How could you rationally make this decision? © Tim Hill/Alamy Stock Photo

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

316 CHAPTER 9 FUNDAMENTALS OF HYPOTHESIS TESTING: ONE-SAMPLE TESTS

LEARNING OBJECTIVES

After studying this chapter you should be able to: 1 identify the basic principles of hypothesis testing 2 explain the assumptions of each hypothesis-testing procedure, how to evaluate them and the consequences if they are seriously violated 3 use hypothesis testing to test a mean or proportion 4 recognise the pitfalls involved in hypothesis testing 5 identify the ethical issues involved in hypothesis testing

Unlike Chapter 7, in which the problem facing the operations manager was to determine whether the sample mean was consistent with a known population mean, this chapter’s opening scenario asks how the sample mean can validate the claim that the population mean is 500 grams. To validate the claim, you must first state the claim unambiguously. For example, the population mean is 500 grams. In the inferential method known as hypothesis testing you consider the evidence – the sample statistic – to see whether the evidence better supports the statement, called the null hypothesis, or the mutually exclusive alternative which, in this case, states that the population mean is not 500 grams. In this chapter the focus is on hypothesis testing, another aspect of statistical inference that, like confidence interval estimation, is based on sample information. A step-by-step methodology is developed that enables you to make inferences about a population parameter by analysing differences between the results observed (the sample statistic) and the results you expect to get if some underlying hypothesis is actually true. For example, is the mean weight of the retail spiral pasta packets in the sample taken at Patricio’s Pasta consistent with what you would expect if the mean of the entire population of retail packets is 500 grams? Or can you infer that the population mean is not equal to 500 grams because the sample mean is significantly different from 500 grams?

9.1  HYPOTHESIS-TESTING METHODOLOGY The Null and Alternative Hypotheses hypothesis testing A method of statistical inference used to make tests about the value of population parameters.

null hypothesis (H0) A statement about the value of one or more population parameters which we test and aim to disprove.

Hypothesis testing typically begins with some theory, claim or assertion about a particular parameter of a population. For example, your initial hypothesis about the pasta company example is that the process is working properly, meaning that the mean weight is 500 grams, and no corrective action is needed. The hypothesis that the population parameter is equal to the company specification is referred to as the null hypothesis. A null hypothesis is always one of status quo, and is identified by the symbol H0. Here, the null hypothesis is that the filling process is working properly and therefore the mean weight is the 500-gram specification. This is stated as:

H0: μ 5 500 Even though information is available only from the sample, the null hypothesis is written in terms of the population. Remember, your focus is on the population of all retail spiral pasta packets. The sample statistic is used to make inferences about the entire filling process. One inference may be that the results observed from the sample data indicate that the null hypothesis is false. If the null hypothesis is considered false, something else must be true. Whenever a null hypothesis is specified, an alternative hypothesis is also specified, one that must be true if

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



9.1 Hypothesis-testing Methodology 317

the null hypothesis is false. The alternative hypothesis, H1, is the opposite of the null hypothesis, H0. This is stated in the pasta example as: H1: m Z 500 The alternative hypothesis represents the conclusion reached by rejecting the null hypothesis. The null hypothesis is rejected when there is sufficient evidence from the sample information that the null hypothesis is false. In the pasta example, if the weights of the sampled packets are sufficiently above or below the expected 500-gram mean specified by the company, you reject the null hypothesis in favour of the alternative hypothesis that the mean fill is different from 500 grams. You stop production and take whatever action is necessary to correct the problem. If the null hypothesis is not rejected, then you should continue to believe in the status quo, that the process is working correctly and that no corrective action is necessary. Note that this does not mean you have proved that the process is working correctly. Rather, you have failed to prove that it is working incorrectly and, therefore, you continue your (unproven) belief in the null hypothesis. In the hypothesis-testing methodology, the null hypothesis is rejected when the sample evidence suggests that it is far more likely that the alternative hypothesis is true. However, failure to reject the null hypothesis is not proof that it is true. You can never prove that the null hypothesis is correct because the decision is based only on the sample information, not on the entire population. Therefore, if you fail to reject the null hypothesis, you can only conclude that there is insufficient evidence to warrant its rejection. The following key points summarise the null and alternative hypotheses: • The null hypothesis, H0, represents the status quo or the current belief in a situation. • The alternative hypothesis, H1, is the opposite of the null hypothesis and represents a research claim or specific inference you would like to prove. • If you reject the null hypothesis, you have statistical proof that the alternative hypothesis is correct. • If you do not reject the null hypothesis, you have failed to prove the alternative hypothesis. Failure to prove the alternative hypothesis, however, does not mean that you have proved the null hypothesis. • The null hypothesis, H0, always refers to a specified/hypothesised value of the population parameter (such as m), not a sample statistic (such as X ). • The statement of the null hypothesis always contains an equals sign regarding the specified value of the population parameter (e.g. H0: m 5 500 or H0: m > 400). • The statement of the alternative hypothesis never contains an equals sign regarding the specified value of the population parameter (e.g. H1: m . 500 or H1: m , 400).

TH E NULL A N D A LT E R N AT IV E H YP OT H E S E S You are the manager of an Internet provider’s call centre for customer support. You want to determine whether the time taken to call back customers who elected to leave the phone queue has changed in the past month from its previous population mean value of 4.5 minutes. State the null and alternative hypotheses.

alternative hypothesis (H1) A statement that we aim to prove about one or more population parameters; the opposite of the null hypothesis.

LEARNING OBJECTIVE

1

Identify the basic principles of hypothesis testing

EXAMPLE 9.1

SOLUTION

The null hypothesis is that the population mean has not changed from its previous value of 4.5 minutes. This is stated as: H0: μ = 4.5 The alternative hypothesis is the opposite of the null hypothesis. Since the null hypothesis is that the population mean is 4.5 minutes, the alternative hypothesis is that the population mean is not 4.5 minutes. This is stated as: H1: μ ≠ 4.5

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

318 CHAPTER 9 FUNDAMENTALS OF HYPOTHESIS TESTING: ONE-SAMPLE TESTS

Determining the Test Statistic

test statistic A value derived from sample data that is used to determine whether the null hypothesis should be rejected or not. region of rejection The range of values of the test statistic where the null hypothesis is rejected; it is also called the ‘critical region’. region of non-rejection The range of values of the test statistic where the null hypothesis cannot be rejected.

The logic behind the hypothesis-testing methodology is to determine how likely it is that the null hypothesis is true by considering the information gathered in a sample. In the Patricio’s Pasta scenario, the null hypothesis is that the mean weight of spiral pasta packets in the entire filling process is 500 grams (i.e. the population parameter specified by the company). You select a sample of packets from the filling process, weigh each packet and calculate the sample mean. This statistic is an estimate of the corresponding parameter (the population mean m). Even if the null hypothesis is in fact true, the statistic (the sample mean X ) is likely to differ from the value of the parameter (the population mean m) because of variation due to sampling. However, you expect the sample statistic to be close to the population parameter if the null hypothesis is true. If the sample statistic is close to the population parameter, you have insufficient evidence to reject the null hypothesis. For example, if the sample mean is 499.9, you would conclude that the population mean has not changed (i.e. m 5 500), because a sample mean of 499.9 is very close to the hypothesised value of 500. Intuitively, you think that it is likely that you could get a sample mean of 499.9 from a population whose mean is 500. On the other hand, if there is a large difference between the value of the statistic and the hypothesised value of the population parameter, you will conclude that the null hypothesis is false. For example, if the sample mean is 420, you would conclude that the population mean is not 500 (i.e. m Z 500), because the sample mean is very far from the hypothesised value of 500. In such a case you conclude that it is very unlikely to get a sample mean of 420 if the population mean is really 500. Therefore, it is more logical to conclude that the population mean is not equal to 500 and reject the null hypothesis. Unfortunately, the decision-making process is not always so clear-cut. Determining what is ‘very close’ and what is ‘very different’ is arbitrary and without clear definitions. Hypothesistesting methodology provides clear definitions for evaluating differences. It also enables you to quantify the decision-making process by calculating the probability of getting a given sample result if the null hypothesis is true. You calculate this probability by determining the sampling distribution for the sample statistic of interest (e.g. the sample mean) and then calculating the particular test statistic based on the given sample result. Because the sampling distribution for the test statistic often follows a well-known statistical distribution, such as the standardised normal distribution or t distribution, you can use these distributions to help determine whether the null hypothesis is true.

Regions of Rejection and Non-Rejection The sampling distribution of the test statistic is divided into two regions, a region of rejection (sometimes called the critical region) and a region of non-rejection (see Figure 9.1). If the test statistic falls into the region of non-rejection, you do not reject the null hypothesis. In the Patricio’s Pasta scenario, you see that there is insufficient evidence that the population mean fill is different from 500 grams. If the test statistic falls into the rejection region, you reject the null hypothesis. In this case, you will see that the population mean is not 500 grams.

Figure 9.1 Regions of rejection and non-rejection in hypothesis testing

X

μ

Critical Region of value rejection

Region of non-rejection

Critical value

Region of rejection

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



9.1 Hypothesis-testing Methodology 319

The region of rejection consists of the values of the test statistic that are unlikely to occur if the null hypothesis is true. These values are more likely to occur if the null hypothesis is false. Therefore, if a value of the test statistic falls into this rejection region, you reject the null hypothesis because that value is unlikely if the null hypothesis is true. To make a decision concerning the null hypothesis, you first determine the critical value of the test statistic. The critical value divides the non-rejection region from the rejection region. Determining this critical value depends on the size of the rejection region. The size of the rejection region is directly related to the risks involved in using only sample evidence to make decisions about a population parameter.

critical value The value in a distribution that cuts off the required probability in the tail for a given confidence level.

Risks in Decision Making Using Hypothesis Testing When using a sample statistic to make decisions about a population parameter, there is a risk that you will reach an incorrect conclusion. You can make two different types of errors when applying hypothesis-testing methodology: a Type I error and a Type II error. A Type I error occurs if you reject the null hypothesis, H0, when in fact it is true and should not be rejected. The probability of a Type I error occurring is a. A Type II error occurs if you do not reject the null hypothesis, H0, when in fact it is false and should be rejected. The probability of a Type II error occurring is β. In the Patricio’s Pasta scenario, you make a Type I error if you conclude that the population mean weight is not 500 when in fact it is 500. You make a Type II error if you conclude that the population mean weight is 500 when in fact it is not 500. The Level of Significance (a) The probability of committing a Type I error, denoted by a (the lower-case Greek letter alpha), is referred to as the level of significance of the statistical test. Traditionally, you control the Type I error by deciding on the risk level, a, that you are willing to have in rejecting the null hypothesis when it is true. Because you specify the level of significance before the hypothesis test is performed, the risk of committing a Type I error, a, is directly under your control. Traditionally, you select levels of 0.01, 0.05 or 0.10. The choice of a particular risk level for making a Type I error depends on the cost of making such an error. After you specify the value for a, you know the size of the rejection region because a is the probability of rejection under the null hypothesis. From this fact, you can then determine the critical value or values that divide the rejection and non-rejection regions.

Type I error The rejection of a null hypothesis that is true and should not be rejected. Type II error The non-rejection of a null hypothesis that is false and should be rejected.

level of significance (𝛂) The probability of rejecting a null hypothesis which is in fact true.

The Confidence Coefficient The complement of the probability of a Type I error (1 2 a) is called the confidence coefficient. When multiplied by 100%, the confidence coefficient yields the confidence level that was studied when constructing confidence intervals (see Section 8.1). The confidence coefficient, 1 2 a, is the probability that you will not reject the null hypothesis, H0, when it is true and should not be rejected. The confidence level of a hypothesis test is (1 2 a) * 100%.

In terms of hypothesis-testing methodology, the confidence coefficient represents the probability of concluding that the value of the parameter as specified in the null hypothesis is plausible when it is true. In the Patricio’s Pasta scenario, the confidence coefficient measures the probability of concluding that the population mean weight is 500 grams when it actually is 500 grams.

confidence coefficient (1 ∙ 𝛂) The probability of not rejecting a null hypothesis when it is true and should not be rejected. confidence level The confidence coefficient expressed as a percentage.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

320 CHAPTER 9 FUNDAMENTALS OF HYPOTHESIS TESTING: ONE-SAMPLE TESTS

LEARNING OBJECTIVE

2

Explain the consequences of different errors

risk of Type II error (𝛃) The chance that the null hypothesis will not be rejected when it is incorrect.

The Risk of Type II Error (b) The probability of committing a Type II error is denoted by β (the lower-case Greek letter beta). Unlike the Type I error, which you control by the selection of a, the probability of making a Type II error (risk of Type II error (β)) depends on the difference between the hypothesised and actual values of the population parameter. If the difference between the hypothesised and actual values of the population parameter is large, β is small, because large differences are easier to find than small ones. For example, if the true population mean is 460 grams, there is a small chance (β) that you will conclude that the mean has not changed from 500. On the other hand, if the difference between the hypothesised and actual values of the parameter is small, the probability you will commit a Type II error is large. Thus, if the population mean is really 497 grams, you have a high probability of making a Type II error by concluding that the population mean fill amount is still 500 grams. The Power of a Test The complement of the probability of a Type II error (1 2 β) is called the power of a statistical test.

power of a statistical test The probability that you reject the null hypothesis when it is false and should be rejected.

The power of a statistical test, 1 2 β, is the probability that you will reject the null hypothesis when in fact it is false and should be rejected.

In the Patricio’s Pasta scenario, the power of the test is the probability that you will correctly conclude that the mean fill amount is not 500 grams when it actually is not 500 grams. For a detailed discussion of the power of the test, see Section 9.6. Risks in Decision Making: A Delicate Balance Table 9.1 illustrates the results of the two possible decisions (do not reject H0 or reject H0) that you can make in any hypothesis test. Depending on the specific decision, you can make one of two types of errors or you can reach one of two types of correct conclusions. Table 9.1 Hypothesis testing and decision making

Statistical decision Do not reject H0 Reject H0

Actual situation H0 true Correct decision Confidence 5 1 2 a Type I error P(Type I error) 5 a

H0 false Type II error P(Type II error) 5 β Correct decision Power 5 1 2 β

You can set the size of the Type I error but it is more difficult to control the Type II error as you do not know the actual value of the parameter being estimated. One way in which you can reduce the probability of making a Type II error is by increasing the size of the sample. Large samples generally permit you to detect even very small differences between the hypothesised values and the population parameters. For a given level of a, increasing the sample size will decrease β and therefore will increase the power of the test to detect that the null hypothesis, H0, is false. However, there is always a limit to your resources, and this will affect the decision about how large a sample you can take. Thus, for a given sample size, you must consider the trade-offs between the two possible types of errors. Because you can directly control the risk of Type I error, you can reduce this risk by selecting a smaller value for a. For example, if the negative consequences associated with making a Type I error are substantial, you could select a 5 0.01 instead of 0.05. However, when you decrease a, you increase β, so reducing the risk of a Type I error will result in an increased risk of Type II error. If, on the other hand, you wish to reduce β, you could select a larger value for a. Therefore, if it is important to try to avoid a Type II error, you can select an a of 0.05 or 0.10 instead of 0.01.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



9.1 Hypothesis-testing Methodology 321

In the Patricio’s Pasta scenario, the risk of a Type I error involves concluding that the mean fill amount has changed from the hypothesised 500 grams when, in fact, it has not changed. The risk of a Type II error involves concluding that the mean fill amount has not changed from the hypothesised 500 grams when it actually has changed. The choice of reasonable values for a and β depends on the costs inherent in each type of error. For example, if it is very costly to change the pasta-packaging process, you would want to be very confident that a change is needed before making one. In this case, the risk of a Type I error is most important and you would choose a small a. On the other hand, if you want to be very certain of detecting changes from a mean of 500 grams, the risk of a Type II error is most important and you would choose a higher level of a.

Problems for Section 9.1 LEARNING THE BASICS

9.1 You use the symbol H0 for which hypothesis? 9.2 You use the symbol H1 for which hypothesis? 9.3 What symbol do you use for the level of significance or chance of committing a Type I error? 9.4 What symbol do you use for the chance of committing a Type II error? 9.5 What does 1 2 β represent? 9.6 What is the relationship of a to the Type I error? 9.7 What is the relationship of β to the Type II error? 9.8 How is the power related to the probability of making a Type II error? 9.9 Why is it possible to reject the null hypothesis when in fact it is true? 9.10 Why is it possible not to reject the null hypothesis when it is false? 9.11 For a given sample size, if a is reduced from 0.05 to 0.01, what will happen to β? 9.12 For H0: m 5 100, H1: m Z 100, and for a sample of size n, why is β larger if the actual value of m is 90 than if the actual value of m is 75?

9.16

9.17

APPLYING THE CONCEPTS 9.13 In the Australian legal system, a defendant is presumed innocent until proven guilty. Consider a null hypothesis, H0, that the defendant is innocent, and an alternative hypothesis, H1, that the defendant is guilty. A jury has two possible decisions: convict the defendant (i.e. reject the null hypothesis) or do not convict the defendant (i.e. do not reject the null hypothesis). Explain the possible consequences of committing either a Type I or a Type II error in this example. 9.14 Suppose the defendant in problem 9.13 is presumed guilty until proven innocent, as in some other judicial systems. How do the null and alternative hypotheses differ from those in problem 9.13? What are the possible consequences of committing either a Type I or a Type II error here? 9.15 The process of deciding which drugs are effective and which may in fact do harm falls in the United States to the Food and Drug Administration (FDA). In April 2017 it wrote warning letters

9.18

9.19

to 14 US companies who were selling products that fraudulently claim to prevent, treat or cure cancer. It is illegal to sell such products without first demonstrating to the FDA that they are safe and effective for their stated purpose ( 20.545°C H1: m , 20.545°C

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

330 CHAPTER 9 FUNDAMENTALS OF HYPOTHESIS TESTING: ONE-SAMPLE TESTS



one-tail (or directional) test A hypothesis where the entire rejection region is contained in one tail of the sampling distribution. The test can be either upper-tail or lower-tail.

The alternative hypothesis contains the statement you are trying to prove. If you reject the null hypothesis, there is sufficient evidence that the mean freezing point of the milk is less than the natural freezing point of 20.545°C. If the conclusion of the test is ‘do not reject H0’, there is insufficient evidence to prove that the mean freezing point is below the natural freezing point of 20.545°C. Step 2 You have selected a sample size of n 5 25. You decide to use a 5 0.05. Step 3 Because s is known, you use the normal distribution and the Z test statistic. Step 4 The rejection region is entirely contained in the lower tail of the sampling distribution of the mean since you want to reject H0 only when the sample mean is significantly below 20.545°C. When the entire rejection region is contained in one tail of the sampling distribution of the test statistic, the test is called a one-tail or directional test. The test can be either lower-tail, as here, or upper-tail. When the alternative hypothesis includes the less than sign, the critical value Z must be less than zero. As shown from Table 9.2 and Figure 9.6, because the entire rejection region is in the lower tail of the standardised normal distribution and contains an area of 0.05, the critical value of the Z test statistic is 21.645, the mean of 21.64 and 21.65. The decision rule is: Reject H0 if Z 6 21.645; otherwise, do not reject H0.

Table 9.2 Finding the critical value of the Z test statistic from the standardised normal distribution for a one-tail test with α = 0.05 (extracted from Table E.2 in Appendix E of this book)

Z ∙ -1.8 -1.7 -1.6

.00 ∙ .0359 .0446 .0548

.01 ∙ .0351 .0436 .0537

.02 ∙ .0344 .0427 .0526

Figure 9.6 One-tail test of hypothesis for a mean (s known) at the 0.05 level of significance

.03 ∙ .0336 .0418 .0516

.04 ∙ .0329 .0409 .0505

.05 ∙ .0322 .0401 .0495

.06 ∙ .0314 .0392 .0485

.07 ∙ .0307 .0384 .0475

.08 ∙ .0301 .0375 .0465

.09 ∙ .0294 .0367 .0455

0.95 0.05

–1.645

Region of rejection

Z

0

Region of non-rejection

Step 5 You select a sample of 25 containers of milk and calculate the sample mean freezing point to be 20.550°C. Using n 5 25, X 5 20.550°C, s 5 0.008°C and Equation 9.1: Z=

X −μ −0.550 − (−0.545) = = −3.125 σ 0.008 n

25

Step 6 Since Z 5 23.125 , 21.645, you reject the null hypothesis (see Figure 9.6). You conclude that, at the 5% significance level, the mean freezing point of the milk provided is below 20.545°C. The company should pursue an investigation of the milk supplier because the mean freezing point is significantly below that expected to occur by chance.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



9.3 One-tail Tests 331

The p-Value Approach Use the five steps listed in Exhibit 9.2 to illustrate the above test using the p-value approach. Steps 1–3 These steps are the same as for the critical value approach. Step 4 Z 5 23.125 (see step 5 of the critical value approach). Since the alternative hypothesis indicates a rejection region entirely in the lower tail of the sampling distribution of the Z test statistic, to calculate the p-value you need to find the probability that the Z value will be below the test statistic of 23.125. From Table E.2, the probability that the Z value will be below 23.125 is 0.0009 (see Figures 9.7 and 9.8).

Figure 9.7 Determining the p-value for a one-tail test 0.0009

0.9991 Z

–3.125

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Step 5

A Milk production hypothesis

B

Data Null hypothesis µ= Level of significance Population standard deviation Sample size Sample mean

–0.545 0.05 0.008 25 –0.55

Intermediate calculations Standard error of the mean Z test statistic

0.0016 =B6/SQRT(B7) –3.125 =(B8 – B4)/B11

Figure 9.8 Microsoft Excel 2016 Z test output for the milkproduction example

Lower-tail test Lower critical value –1.6449 =NORM.S.INV(B5) p-value 0.0009 =NORM.S.DIST(B12,1) Reject the null hypothesis

The p-value of 0.0009 is less than a 5 0.05. You reject H0. You conclude that, at 5% significance, the mean freezing point of the milk provided is below 20.545°C. The company should pursue an investigation of the milk supplier because the mean freezing point is significantly below that expected to occur by chance.

A O NE-TA IL T E ST FO R T H E ME A N A company that manufactures travel goods is particularly concerned that the mean weight of roll-on cabin bags should not exceed 2.4 kg, as many airlines limit the weight of carryon luggage on their planes to 7 kg and there is strong market competition for lighter bags.

EXAMPLE 9.5

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

332 CHAPTER 9 FUNDAMENTALS OF HYPOTHESIS TESTING: ONE-SAMPLE TESTS

Past experience allows the assumption that the standard deviation is 30 g. A sample of 50 cabin bags is selected and the sample mean is 2.409 kg. Using the a 5 0.01 level of significance, is there sufficient evidence that the population mean weight of cabin bags is greater than 2.4 kg? SOLUTION

Using the critical value approach: Step 1 H0: m ø 2.4 H1: m  . 2.4 Step 2 You have selected a sample size of n 5 50. You decide to use a 5 0.01. Step 3 Because s is known, you use the normal distribution and the Z test statistic. Step 4 The rejection region is entirely contained in the upper tail of the sampling distribution of the mean, since you want to reject H0 only when the sample mean is significantly above 2.4 kg. Because the entire rejection region is in the upper tail of the standardised normal distribution and contains an area of 0.01, the critical value of the Z test statistic is 2.33. The decision rule is: Reject H0 if Z . 2.33; otherwise, do not reject H0. Step 5 You select a sample of 50 cabin bags and the sample mean weight is 2.409 kg. Using n 5 50, X 5 2.409, s 5 0.03 and Equation 9.1: Z=

X − μ 2.409 − 2.4 = = 2.1213 σ 0.03 n

50

Step 6 Since Z  5  2.1213  ,  2.33, you do not reject the null hypothesis. There is insufficient evidence to conclude that, at 1% significance, the population mean weight of cabin bags is above 2.4 kg.

LEARNING OBJECTIVE Identify the basic principles of a one-tail test

1

To perform one-tail tests of hypotheses, H0 and H1 must be properly formulated. A summary of the null and alternative hypotheses for one-tail tests is as follows: 1. The null hypothesis, H0, represents the status quo or the current belief in a situation. 2. The alternative hypothesis, H1, is the opposite of the null hypothesis and represents a research claim or specific inference you would like to prove. 3. If you reject the null hypothesis, you have sufficient evidence that the alternative hypothesis is correct. 4. If you do not reject the null hypothesis, then you have failed to prove the alternative hypothesis. Failure to prove the alternative hypothesis, however, does not mean that you have proved the null hypothesis. 5. The null hypothesis (H0) always refers to a specified value of the population parameter (such as m), not to a sample statistic (such as X). 6. The statement of the null hypothesis always contains an equals sign regarding the specified value of the parameter (e.g. H0: m > 20.545°C). 7. The statement of the alternative hypothesis never contains an equals sign regarding the specified value of the parameter (e.g. H1: m , 20.545°C).

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



9.3 One-tail Tests 333

Can you ever know the population standard deviation? In Chapters 8 and 9 we have discussed estimation methods that require knowing s, the population standard deviation. Sometimes, as for the pasta packaging example, we have reliable past experience that gives the value of s, but this is not always the case. We have made an assumption about s to make it easier to explain the fundamentals of confidence intervals and now hypothesis testing. With a known population standard deviation, you can use the normal distribution and calculate p-values using the tables of the normal distribution.

think about this

However, for most practical applications, you are unlikely to use a hypothesis-testing method that requires knowing s. If you knew the population standard deviation, you would also know the population mean and would not need to form a hypothesis about the mean and then test that hypothesis. Because it is important that you understand the fundamentals of hypothesis testing when reading the rest of this book, review the first three sections carefully to understand the underlying concepts – even if you anticipate never having a practical reason to use the test represented by Equation 9.1 where the value of s is required. We will see how to deal with an unknown s in Section 9.4.

Problems for Section 9.3 LEARNING THE BASICS

9.33 What is the upper-tail critical value of the Z test statistic at the 0.01 level of significance? 9.34 In problem 9.33, what is your statistical decision if the calculated value of the Z test statistic is 12.39? 9.35 What is the lower-tail critical value of the Z test statistic at the 0.01 level of significance? 9.36 In problem 9.35, what is your statistical decision if the calculated value of the Z test statistic is 21.15? 9.37 Suppose that, in an upper-tail hypothesis test where you reject H0, you calculate the value of the test statistic Z to be 12.00. What is the p-value? 9.38 In problem 9.37, what is your statistical decision if you tested the null hypothesis at the 0.05 level of significance? 9.39 Suppose that, in a lower-tail hypothesis test where you reject H0, you calculate the value of the test statistic Z as 21.38. What is the p-value? 9.40 In problem 9.39, what is your statistical decision if you tested the null hypothesis at the 0.01 level of significance? 9.41 In a lower-tail hypothesis test where you reject H0, you calculate the value of the test statistic Z as 11.38. What is the p-value? 9.42 In problem 9.41, what is the statistical decision if you tested the null hypothesis at the 0.01 level of significance?

APPLYING THE CONCEPTS 9.43 The Glendale Steel Company manufactures steel bars. If the production process is working properly, it turns out steel bars

with a mean length of at least 855 mm and a standard deviation of 65 mm (as determined from engineering specifications on the production equipment involved). Longer steel bars can be used or altered, but shorter bars must be scrapped. You select a sample of 25 bars and the mean length is 832 mm. Do you need to adjust the production equipment? a. If you want to test the hypothesis at the 0.05 level of significance, what decision would you make using the critical value approach to hypothesis testing? b. If you want to test the hypothesis at the 0.05 level of significance, what decision would you make using the p-value approach to hypothesis testing? c. Interpret the meaning of the p-value in this problem. d. Compare your conclusions in (a) and (b). 9.44 You are the manager of a restaurant that delivers pizza to customers. You have just changed your delivery process in an effort to reduce the mean time between the order and completion of delivery from the current 25 minutes. From past experience, you can assume that the population standard deviation is 6 minutes. A sample of 36 orders using the new delivery process yields a sample mean of 22.4 minutes. a. Using the six-step critical value approach, at the 0.05 level of significance, is there sufficient evidence that the mean delivery time has been reduced below the previous value of 25 minutes? b. At the 0.05 level of significance, use the five-step p-value approach. c. Interpret the meaning of the p-value in (b). d. Compare your conclusions in (a) and (b).

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

334 CHAPTER 9 FUNDAMENTALS OF HYPOTHESIS TESTING: ONE-SAMPLE TESTS

9.45 A New Zealand researcher believes that, on average, teenagers aged 16–19 living in a major city will post photographs on social network sites more than 10 times a week. Suppose she wishes to find statistical evidence to support this. Let m represent the population mean number of times 16–19-year-old teenagers in this city post photos on social network sites. a. State the null and alternative hypotheses. b. Explain in the context of the above scenario the meaning of Type I and Type II errors.

c. Suppose the researcher carries out a study in the city in which you live. Based on past studies, she assumes that the standard deviation of the number of times teenagers aged 16–19 post photos on social network sites is 1.6. She takes a sample of 100 teenagers aged 16–19 and finds that the mean number of times they post photos per week is 10.87. At the 0.01 level of significance, find whether there is sufficient evidence that the mean number of times a week photos are posted is greater than 10. Use the p-value approach. d. Interpret the meaning of the p-value in (c).

9.4  t TEST OF HYPOTHESIS FOR THE MEAN (s UNKNOWN) In most hypothesis-testing situations dealing with numerical data, you do not know the population standard deviation s. Instead, you use the sample standard deviation S. If you assume that the population is normally distributed, the sampling distribution of the mean will follow a t distribution with n 2 1 degrees of freedom. If the population is not normally distributed, you can still use the t test if the sample size is large enough for the Central Limit Theorem to take effect (see Section 7.2). Equation 9.2 defines the test statistic, t, for determining the difference between the sample mean, X, and the population mean, m, when the sample standard deviation, S, is used.

t test of hypothesis for the mean A test about the population mean that uses a t distribution.

t T E ST OF H YPOT HE SI S F O R T HE ME A N ( s UN KNO WN ) t=

X−μ S

(9.2)

n 

where the test statistic t follows a t distribution having n 2 1 degrees of freedom

To illustrate the use of this t test, return to the scenario about Callistemon Camping in Chapter 8. Over the past five years, the mean amount per sales invoice was $240. As the accountant for the company, you need to inform the financial controller if this amount changes. In other words, the hypothesis test is used to try to prove that the mean amount per sales invoice is increasing or decreasing.

The Critical Value Approach To perform this two-tail hypothesis test, use the six-step method listed in Exhibit 9.1.

LEARNING OBJECTIVE

2

Explain the assumptions of a test with sigma unknown

Step 1 H0: m 5 $240 H1: m Z $240 The alternative hypothesis contains the statement you are trying to prove. If the null hypothesis is rejected, you will have sufficient evidence that the mean amount per sales invoice is no longer $240. If the statistical conclusion is ‘do not reject H0’, then you will conclude that there is insufficient evidence to prove that the mean amount differs from the long-term mean of $240. Step 2 You have selected a sample of n 5 12. You decide to use a 5 0.05. Step 3 Because s is unknown, you use the t distribution and the t test statistic for this example. You must assume that the population of sales invoices is normally distributed. This assumption is discussed on page 337.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



9.4 t TEST OF HYPOTHESIS FOR THE MEAN (σ UNKNOWN) 335

Step 4 For a given sample size n, the test statistic t follows a t distribution with n 2 1 degrees of freedom. The critical values of the t distribution with 12 2 1 5 11 degrees of freedom are found in Table E.3, as illustrated in Figure 9.9 and Table 9.3. Because the alternative hypothesis H1 that m Z $240 is non-directional, the area in the rejection region of the t distribution’s left (lower) tail is 0.025, and the area in the rejection region of the t distribution’s right (upper) tail is also 0.025. From the t table as given in Table E.3, a portion of which is shown in Table 9.3, the critical values are 62.2010. The decision rule is: Reject H0 if t 6 2t11 5 22.2010 or if t 7 t11 5 12.2010; otherwise, do not reject H0.

0.95

0.025

Region of rejection

0.025

0

–2.2010

+2.2010

Region of non-rejection

Critical value

Figure 9.9 Testing a hypothesis about the mean (s unknown) at 0.05 level of significance with 11 degrees of freedom t11

Region of rejection X

Critical value

$240

Upper-tail areas Degrees of freedom 1 2 3 4 5 6 7 8 9 10 11

.25 1.0000 0.8165 0.7649 0.7407 0.7267 0.7176 0.7111 0.7064 0.7027 0.6998 0.6974

.10 3.0777 1.8856 1.6377 1.5332 1.4759 1.4398 1.4149 1.3968 1.3830 1.3722 1.3634

.05 6.3138 2.9200 2.3534 2.1318 2.0150 1.9432 1.8946 1.8595 1.8331 1.8125 1.7959

.02 5 12.7062 4.3027 3.1824 2.7764 2.5706 2.4469 2.3646 2.3060 2.2622 2.2281 2.2010

.01 31.8207 6.9646 4.5407 3.7469 3.3649 3.1427 2.9980 2.8965 2.8214 2.7638 2.7181

.005 63.6574 9.9248 5.8409 4.6041 4.0322 3.7074 3.4995 3.3554 3.2498 3.1693 3.1058

Table 9.3 Determining the critical value from the t table for an area of 0.025 in each tail with 11 degrees of freedom (extracted from Table E.3 in Appendix E of this book)

Step 5 The following data < INVOICES > are the amounts (in dollars) in a random sample of 12 sales invoices: 308.98 215.80

162.22 242.97

220.15 188.20

243.46 286.35

227.80 268.94

172.60 232.90

Using Equations 3.1 and 3.10, or the Microsoft Excel output of Figure 9.10: n

X =

∑ Xi

i=1

n

n

= $230.86 and

S=

∑ ( X i − X )2 i=1

n −1

= $43.92

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

336 CHAPTER 9 FUNDAMENTALS OF HYPOTHESIS TESTING: ONE-SAMPLE TESTS



From Equation 9.2: t=

X−μ 230.86 − 240 = = −0.72 S 43.92 n

12

Step 6 Since 22.201 , t 5 20.72 , 2.201, you do not reject H0. You have insufficient evidence to conclude that, at the 5% level of significance, the mean amount per sales invoice differs from $240. You should inform the financial controller that the audit suggests that the mean amount per invoice has not changed.

The p-Value Approach Steps 1–3 These steps are the same as in the critical value approach. Step 4 t 5 20.72 (see step 5 of the critical value approach). Step 5 The t tables are too limited to allow us to find a p-value, so use Microsoft Excel. The worksheet of Figure 9.10 gives the p-value for this two-tail test as 0.4862. Since the p-value of 0.4862 is greater than a 5 0.05, you do not reject H0. The data provide insufficient evidence to conclude that the mean amount per sales invoice differs from $240. You should inform the financial controller that the audit suggests that the mean amount per invoice has not changed. The p-value indicates that if the null hypothesis is true, the probability that a sample of 12 invoices could have a monthly mean that differs by $9.14 or more from the stated $240 is 0.4862. In other words, if the mean amount per sales invoice is truly $240, there is a 48.62% chance of observing a sample mean below $230.86 or above $249.14. Figure 9.10 Microsoft Excel 2016 worksheet for the onesample t test of sales invoices

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

A Mean amount per invoice hypothesis Data Null hypothesis µ= Level of significance Sample size Sample mean Sample standard deviation Intermediate calculations Standard error of the mean Degrees of freedom t test statistic

B

240 0.05 12 230.86 43.92

12.6784 =B8/SQRT(B6) 11 =B6 – 1 –0.7206 =(B7 – B4)/B11

Two-tail test Lower critical value –2.2010 = –(T.INV.2T(B5, B12)) Upper critical value 2.2010 =T.INV.2T(B5, B12) p-value 0.4862 =T.DIST.2T(ABS(B13), B12) Do not reject the null hypothesis

Note: For some earlier versions of Excel the formula in B16 is = 2(TINV(B5,B12)). The formula in B18 is = TDIST(ABS(B13), B12,2).

In this example, it is incorrect to state that there is a 48.62% chance that the null hypothesis is true. This misinterpretation of the p-value is sometimes used by those not properly trained in statistics. Remember that the p-value is a conditional probability, calculated by assuming that the null

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



9.4 t TEST OF HYPOTHESIS FOR THE MEAN (σ UNKNOWN) 337

hypothesis is true. In general, it is proper to state the following: if the null hypothesis is true, then there is a (p-value) * 100% chance of observing a sample result at least as contradictory to the null hypothesis as the result observed.

Checking Assumptions You use the one-sample t test when the population standard deviation s is not known and is estimated using the sample standard deviation, S. (When a large sample size is available, S estimates s precisely enough that there is little difference between the t and Z distributions. So, you can use a Z test instead of a t test when the sample size is greater than 120.) The t test is considered a classical parametric procedure, one that makes a variety of stringent assumptions that must hold to ensure the results of the test are valid. To use the one-sample t test, the data are assumed to represent a random sample from a population that is normally distributed. In practice, as long as the sample size is not very small and the population is not very skewed, the t distribution provides a good approximation to the sampling distribution of the mean when s is unknown. There are several ways to evaluate the normality assumption necessary for using the t test. You can observe how closely the sample statistics match the normal distribution’s theoretical properties. You can also use a histogram, stem-and-leaf display, box-and-whisker plot or ­normal probability plot. For details on evaluating normality, see Section 6.3. Figure 9.11 presents Microsoft Excel output that provides descriptive statistics. Because the mean is very close to the median, the points on the normal probability plot (Figure 9.12) appear to be increasing approximately in a straight line and the boxplot (Figure 9.13) appears approximately symmetrical. You can assume that the population of sales invoices is approximately normally distributed. The normality assumption is valid and therefore the auditor’s results are valid. The t test is a robust test. It does not lose power if the shape of the population departs somewhat from a normal distribution, particularly when the sample size is large enough to enable the test statistic t to be influenced by the Central Limit Theorem (see Section 7.2). However, you can make erroneous conclusions and lose statistical power if you use the t test incorrectly. If the sample size n is small (i.e. less than 30) and you cannot easily make the assumption that the underlying population is at least approximately normally distributed, other non-parametric testing procedures are more appropriate (see references 1 and 2).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

A Invoice amount Mean Standard error Median Mode Standard deviation Sample variance Kurtosis Skewness Range Minimum Maximum Sum Count Largest(1) Smallest(1)

B

230.8642 12.67841 230.35 #N/A 43.9193 1928.905 –0.3925 0.132501 146.76 162.22 308.98 2770.37 12 308.98 162.22

LEARNING OBJECTIVE

2

Explain the assumptions of a test with sigma unknown

robust A test or procedure that is not seriously affected by the breakdown of assumptions.

Figure 9.11 Microsoft Excel descriptive statistics for the sales invoice data

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

338 CHAPTER 9 FUNDAMENTALS OF HYPOTHESIS TESTING: ONE-SAMPLE TESTS

Normal probability plot

350 300 Invoice amount

Figure 9.12 PHStat normal probability plot for the sales invoice data

250 200 150 100 50 0 –2

Figure 9.13 PHStat boxplot for the sales invoice data

–1.5

–1

–0.5

0 Z value

0.5

1

200

220

240

260

280

300

1.5

2

Invoice amount

160

180

320

340

Problems for Section 9.4 LEARNING THE BASICS

9.46 In _ a sample of n 5 16 selected from a normal population, X 5 56 and S 5 12. What is the value of the t test statistic if you are testing the null hypothesis, H0: m 5 50? 9.47 In problem 9.46, how many degrees of freedom are there in the one-sample t test? 9.48 In problems 9.46 and 9.47, what are the critical values from the t table if the level of significance a 5 0.05 and the alternative hypothesis H1 is: a. m Z 50? b. m . 50? 9.49 In problems 9.46, 9.47 and 9.48, what is your statistical decision if the alternative hypothesis, H1, is: a. m Z 50? b. m . 50? 9.50 In _ a sample of n 5 16 selected from a left-skewed population, X 5 65 and S 5 21. Would you use the t test to test the null hypothesis, H0: m 5 60? Discuss. 9.51 In _ a sample of n 5 160 selected from a left-skewed population, X 5 65 and S 5 21. Would you use the t test to test the null hypothesis, H0: m 5 60? Discuss.

APPLYING THE CONCEPTS Problems 9.52 to 9.54 can be solved manually or by using Microsoft Excel. We recommend that you use Microsoft Excel to solve problems 9.55 to 9.61.

9.52 The director of admissions at a large university advises parents of incoming students about the cost of textbooks during a typical semester. He selected a sample of 100 students and recorded their textbook expenses for the semester. He then calculated a sample mean cost of $675.60 and a sample standard deviation of $45.20. a. Using the 0.10 level of significance, is there sufficient evidence that the population mean is above $665? b. What is your answer in (a) if the standard deviation is $75 and the 0.05 level of significance is used? c. What is your answer in (a) if the sample mean is $669.60 and the sample standard deviation is $45.20? 9.53 A large supermarket chain has a target average waiting time of 90 seconds for customers who queue to use the self-service checkouts. To test that a particular store is meeting this target, the waiting time for a random sample of 50 customers in this queue is recorded during a one-day trading period. The average

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



9.4 t TEST OF HYPOTHESIS FOR THE MEAN (σ UNKNOWN) 339

waiting time is found to be 101 seconds with a standard deviation of 26.5 seconds. Using a 0.10 level of significance, is there sufficient evidence to show that the actual waiting time is different from the target of 90 seconds? 9.54 You are the manager of a fast-food franchise. Last month the mean waiting time at the counter, as measured from the time a customer places an order until the time the customer receives the order, was 3.7 minutes. The franchise helped you to institute a new process intended to reduce waiting time. You select a random sample of 64 orders. The sample mean waiting time is 3.57 minutes with a sample standard deviation of 0.8 minutes. At the 0.05 level of significance, is there sufficient evidence that the population mean waiting time is now less than 3.7 minutes? 9.55 A manufacturer of chocolate-coated sweets uses machines to package the sweets as they move along a filling line. Although the packages are labelled as 250 g, the company wants the packages to contain 250.4 g so that virtually none of the packages contains less than 250 g. A sample of 50 packages is selected periodically, and the packaging process is stopped if there is sufficient evidence that the mean amount packaged is different from 250.4 g. Suppose that the mean amount dispensed in a particular sample of 50 packages is 250.35 g with a sample standard deviation of 0.2 g. a. Is there sufficient evidence that the population mean amount is different from 250.4 g? (Use a 0.05 level of significance.) b. Calculate the p-value and interpret its meaning. 9.56 The approval process for a life insurance policy requires a review of the application and the applicant’s medical history, possible requests for additional medical information and medical examinations, and a policy compilation stage where the policy pages are generated then delivered. The ability to deliver approved policies to customers in a timely manner is critical to the profitability of this service. During a period of one month, a random sample of 27 approved policies is selected < INSURANCE > and the total processing time in days recorded: 73 19 16 64 28 28 31 90 60 56 31 56 22 18 45 48 17 17 17 91 92 63 50 51 69 16 17

a. In the past, the mean processing time averaged 45 days. At the 0.05 level of significance, is there sufficient evidence that the mean processing time has changed from 45 days? b. What assumption about the population distribution is needed in (a)? c. Do you think that the assumption needed in (a) is seriously violated? Explain. 9.57 The following data represent the amount of soft drink in a sample of 50 consecutive 2-litre bottles. < DRINK > The results are listed horizontally in the order of bottles being filled: 2.109 2.086 2.066 2.075 2.065 2.057 2.052 2.044 2.036 2.038 2.031 2.029 2.025 2.029 2.023 2.020 2.015 2.014 2.013 2.014 2.012 2.012 2.012 2.010 2.005 2.003 1.999 1.996 1.997 1.992 1.994 1.986 1.984 1.981 1.973 1.975 1.971 1.969 1.966 1.967 1.963 1.957 1.951 1.951 1.947 1.941 1.941 1.938 1.908 1.894

a. At the 0.05 level of significance, is there sufficient evidence that the mean amount of soft drink filled is different from 2.0 litres? b. Determine the p-value in (a) and interpret its meaning. c. Evaluate the assumption you made in (a) graphically. Are the results of (a) valid? Why? d. Examine the values of the 50 bottles in their sequential order as given in the problem. Is there a pattern to the results? If so, what impact might this pattern have on the validity of the results in (a)? 9.58 At a large furniture and electrical store customers usually find that the furniture on display is not held in stock. Rather than being immediately available it must be sourced from manufacturers. In the sofa department the average delivery time is expected to be six weeks after purchase. In order to test whether this target is accurate the store records the delivery times (in days) taken for 50 sofa purchases. < FURNITURE > 54  5 35  37  31 27 52  2 123 81 74 27 11 19 26 110 110 29 61 35  94 31 26  5 12  4 65  32  29 28 49 26  45  1 14 53 13 60 45  27  41 52 30 22  36 46 50 23 33 68

a. The supervisor claims that the mean number of days between receipt of the order and delivery of the sofa is 42 days or more. At the 0.05 level of significance, is there sufficient evidence that the claim is not true (i.e. that the mean number of days is less than 42)? b. What assumption about the population distribution must you make in (a)? c. Do you think that the assumption made in (b) is seriously violated? Explain. d. What effect might your conclusion in (c) have on the validity of the results in (a)? 9.59 Assume that the quality control section of a company that produces chemical products has tested a number of batches for viscosity (resistance to flow). The level of viscosity is an important indicator of the quality of products such as inks, oils and resins. The data for 120 batches are in the data file. < CHEMICAL > a. In the past, the mean viscosity was 15.5. At the 0.10 level of significance, is there sufficient evidence that the mean viscosity has changed from 15.5? b. What assumption about the population distribution do you need to make in (a)? c. Do you think that the assumption made in (b) has been seriously violated? Explain. 9.60 One operation of a steel mill is to cut pieces of steel into parts that are used in the frame for front seats in a car. The steel is cut with a diamond saw and the resulting parts must be within 60.125 mm of the length specified by the car manufacturer. The data in the file < STEEL > come from a sample of 100 steel parts. The measurement reported is the difference in millimetres between the actual length of the steel part, as

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

340 CHAPTER 9 FUNDAMENTALS OF HYPOTHESIS TESTING: ONE-SAMPLE TESTS

measured by a laser measurement device, and the specified length of the steel part. For example, a value of 20.05 represents a steel part that is 0.05 mm shorter than the specified length. a. At the 0.05 level of significance, is there sufficient evidence that the mean difference is not equal to 0.0 mm? b. Determine the p-value in (a) and interpret its meaning. c. What assumption about the differences between the actual length of the steel part and the specified length of the steel part must you make in (a)? d. Evaluate the assumption in (c) graphically. Are the results of (a) valid? Why? 9.61 Suppose the following table contains a random sample of a one-day change in value of 30 managed funds as reported in a financial newspaper. < CHANGE >

Managed fund Ambicorp Austacom Growth Beresford Mortgage BNT Resources Capital Equities Commercial Property Commercial Value Dancer Growth Diamond Mortgage

LEARNING OBJECTIVE

Value change -0.22 0.00 -0.42 -0.24 -0.23 -0.55 -0.15 -0.17 -0.38

3

Use hypothesis testing to test a proportion sample proportion The number of items that have some characteristic of interest divided by the size of the sample. Z test for the proportion A test statistic used for a test of the population proportion.

Managed fund Doncaster Balanced Dubois Holdings DunhillMerton East Asia Equities ELine Corp Everton Resources Federated Growth Fidelity Balanced First City Prime Gregory Growth Fraser Balanced Jacaranda Holdings Moreton Holdings NZAR Investments Oakleigh Growth Plus50 Investments Shore Balanced Suncoast Property WA Equity WP Holdings ZVT Mortgage

Value change -1.50 -0.09 -0.02 -0.06 -0.18 -0.26 -0.20 -0.43 -0.58 -0.17 -0.16 -0.21 -0.12 -0.46 -0.09 -0.24 0.03 -0.02 -0.14 -0.26 -0.10

a. Is there sufficient evidence that the population mean fund value has changed? Use a level of significance of 0.05. b. What assumptions are made to perform the test in (a)? c. Determine the p-value and interpret its meaning.

9.5  Z TEST OF HYPOTHESIS FOR THE PROPORTION In some situations, you want to test a hypothesis about the population proportion π of values that are in a particular category rather than testing the population mean. To begin, select a random sample and calculate the sample proportion, p 5 X/n. Then compare the value of this statistic with the hypothesised value of the parameter π to decide whether to reject the null hypothesis. If the number of successes (X) and the number of failures (n 2 X) are each greater than five, the sampling distribution of a proportion approximately follows a standardised normal distribution. You use the Z test for the proportion given in Equation 9.3 to perform the hypothesis test for the difference between the sample proportion p and the hypothesised population proportion π. ON E -S A M PL E Z T E ST F O R T HE PR O PO RT I ON Z=

p−π

π(1 − π) n  X number of successes in the sample where p 5 5 n sample size 5 sample proportion of successes π 5 hypothesised proportion of successes in the population

(9.3)

The test statistic Z approximately follows a standardised normal distribution.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



9.5 Z TEST OF HYPOTHESIS FOR THE PROPORTION 341

Alternatively, by multiplying numerator and denominator by n, you can write the Z test statistic in terms of the number of successes X, as shown in Equation 9.4. Z TEST FOR THE PROPORTION IN TERMS OF THE NUMBER OF SUCCESSES Z=

X − nπ nπ(1 − π)

 (9.4)

To illustrate the one-sample Z test for a proportion, consider the following study into energy usage. A survey of households showed that 34.9% of the 842 households in the Hunter region used off-peak electric systems for their main hot water supply (IPART 2015 Household Survey of Electricity, Gas and Water Usage, Independent Pricing and Regulatory Tribunal of New South Wales, September 2016). You might suspect that choice of electric off-peak storage hot water systems is related to a number of factors, including the availability of other energy sources. Assume that you wish to test the hypothesis that in this population in the Hunter region the proportion of households that use off-peak is 0.36. Alternatively, the proportion may be higher or lower. The null and alternative hypotheses should be: H0: π 5 0.36 (i.e. the proportion of households which use electric off-peak storage hot water is 0.36) H1: π Z 0.36 (i.e. the proportion of households which use electric off-peak storage hot water is not 0.36)

The Critical Value Approach Because you are interested in whether or not the proportion of households which use off-peak electric storage hot water is 0.36, you use a two-tail test. If you select the a 5 0.05 level of significance, the rejection and non-rejection regions are set up as in Figure 9.14, and the decision rule is: Reject H0 if Z , 21.96 or if Z . 11.96; otherwise, do not reject H0. Figure 9.14 Two-tail test of hypothesis for the proportion at the 0.05 level of significance

0.95

–1.96

Region of rejection

0

+1.96

Region of non-rejection

Z

Region of rejection

From the survey we know that p 5 0.349 and n 5 842 so np . 5 and n (1 2 p) . 5. Using Equation 9.3: Z=

p−π π(1 − π ) n

=

0.349 − 0.36 0.36(0.64) 842

=

–0.011 = –0.6650 0.0165

Because 21.96 , 20.6650 , 1.96 you do not reject H0. You can conclude that there is not enough evidence to show that the proportion of households in the population which use

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

342 CHAPTER 9 FUNDAMENTALS OF HYPOTHESIS TESTING: ONE-SAMPLE TESTS

off-peak electric hot water is other than 0.36. Figure 9.15 presents a Microsoft Excel 2016 worksheet for these data. Note that some earlier versions of Excel replace the formulas in cells B14:B16 with NORMSINV and NORMSDIST.

The p-Value Approach An alternative approach to making a hypothesis-testing decision is to calculate the p-value. For this two-tail test in which the rejection region is located in the lower tail and the upper tail, you need to find the area below a Z value of 20.6650 and above a Z value of 0.6650. Figure 9.15 reports a p-value of 0.5061. As this is larger than the selected level of significance (a 5 0.05), you cannot reject the null hypothesis. The large p-value shows that it is very likely that a sample proportion of 0.349 could be obtained when the population proportion is 0.36. Figure 9.15 Microsoft Excel 2016 worksheet for the study of energy usage

EXAMPLE 9.6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

A Off-peak storage hypothesis test Data Null hypothesis Level of significance Sample proportion Sample size

π=

Intermediate calculations Standard error Z test statistic

B

0.36 0.05 0.349 842

0.0289 =SQRT(B4*(1 – B4)/B7) –0.665 =(B6 – B4)/B10

Two-tail test Lower critical value –1.9600 =NORM.S.INV(B5/2) Upper critical value 1.9600 =NORM.S.INV(1 – B5/2) p-value 0.5061 =2*(1 – NORM.S.DIST(ABS(B11),1)) Do not reject the null hypothesis

T E ST ING A H YP OT H ES I S FO R A P RO P ORT I ON A fast-food chain has just developed a new process to make sure that orders are served correctly. Using the previous process, orders were served correctly 88% of the time. A sample of 100 orders using the new process is selected and 92 are served correctly. At the 0.01 level of significance, can you conclude that the new process has increased the proportion of orders served correctly? SOLUTION

The null and alternative hypotheses are: H0: π 8 0.88 (i.e. the proportion of orders served correctly is less than or equal to 0.88) H1: π 7 0.88 (i.e. the proportion of orders served correctly is greater than 0.88) Using Equation 9.3: p= Z=

X 92 = = 0.92 n 100 p−π π(1 − π) n

=

0.92 − 0.88 0.88(1 − 0.88) 100

=

0.04 = 1.23 0.0325

The p-value for Z 7 +1.23 is 0.1093.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



9.5 Z TEST OF HYPOTHESIS FOR THE PROPORTION 343

Using the critical value approach, you reject H0 if Z 7 2.33. Using the p-value approach, you reject H0 if the p-value 6 0.01. Since Z = +1.23 6 +2.33 or the p-value = 0.1093 7 0.01, you do not reject H0. You conclude that, at the 1% level of significance, there is insufficient evidence that the new process has increased the proportion of correct orders above 0.88.

Problems for Section 9.5 LEARNING THE BASICS 9.62 If, in a random sample of 400 items, 88 are defective, what is the sample proportion of defective items? 9.63 In problem 9.62, if the null hypothesis is that 20% of the items in the population are defective, what is the value of the Z test statistic? 9.64 In problems 9.62 and 9.63, suppose you are testing the null hypothesis H0: π 5 0.20 against the two-tail alternative hypothesis H1: π Z 0.20 and you choose the level of significance of a 5 0.05. What is your statistical decision?

APPLYING THE CONCEPTS 9.65 Assume that an article in a weekend newspaper implies that more than half of all Sydney residents would prefer tolls on all motorways to be reduced by 25 cents, rather than receiving a $100 lower annual registration cost for their cars. Also assume that this was based on a telephone poll where 593 of 1,040 participants indicated that they would rather have the reduced motorway tolls. a. At the 0.05 level of significance, is there sufficient evidence based on the survey data that more than half of all Sydney residents would rather have motorway tolls reduced by 25 cents than have their annual car registration cost lowered by $100? b. Calculate the p-value and interpret its meaning. 9.66 The ABS reported that 30.6% of those who were unemployed at June 2017 were looking for only part-time work (Australian Bureau of Statistics, Labour Force, Australia, DetailedElectronic Delivery Cat. No. 6291.0.55.001, June 2017). Assume that at a later date 975 unemployed people are interviewed and 315 indicate they are looking for only part-time work. Conduct a test to determine whether the proportion of unemployed who are looking only for part-time work has changed. Use the sixstep hypothesis-testing method and a 0.05 level of significance. 9.67 The Environmental Protection Authority in New South Wales published a paper which discussed plastic shopping bags. The paper referred to a 2015 Omnipoll survey which showed 64% of NSW respondents supported a total ban on single-use plastic shopping bags (Environmental Protection Authority, Plastic Shopping Bags: Options Paper, EPA 2016, accessed 27 April 2018). Assume that a follow-up study is undertaken to see whether the support for a ban on single-use plastic bags in

NSW has increased. A random sample of 300 shoppers is surveyed and 205 support a ban on single-use plastic shopping bags. a. At the 0.05 level of significance, use the six-step hypothesis-testing method to try to test whether the proportion of shoppers using alternative bags is significantly higher. b. Use the five-step p-value approach. Interpret the meaning of the p-value. c. Repeat (a) and (b) using a sample size of 1,000 and the same sample proportion. d. Discuss the effect that sample size had on the outcome of this analysis and, in general, on the effect sample size plays in hypothesis testing. 9.68 The Australian Bureau of Statistics reported that, for the year 2015–16, 38.2% of businesses had a social media presence (Australian Bureau of Statistics, Business Use of Information Technology, 2015–16, Cat. No. 8129.0). Assume a recent survey has been carried out of 3,996 Australian businesses . Results show that 1,595 of them have a social media presence. a. At the 0.05 level of significance, use the six-step hypothesis-testing method to try to prove that the percentage of businesses with a social media presence has increased from 38.2%. b. Use the five-step p-value approach. Interpret the meaning of the p-value. 9.69 The QILT 2016 Graduate Outcomes Survey found that 70.9% of undergraduates had found full-time work within four months of completing their degrees (). Now imagine that a researcher carries out a follow-up study by surveying a representative sample of 1,000 recent graduates from undergraduate degree courses and finds that 68.1% have found full-time work within four months of completing their degrees. a. At the 0.01 level of significance, can you state that the percentage of new graduates who have found full-time work within four months of completing their degrees has decreased since 2016? b. At the 0.05 level of significance, can you state that the proportion of new graduates who have found full-time work within four months of completion has changed since 2016? c. What conditions needed to be met in order to answer parts (a) and (b)?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

344 CHAPTER 9 FUNDAMENTALS OF HYPOTHESIS TESTING: ONE-SAMPLE TESTS

LEARNING OBJECTIVE

2

Explain how to evaluate the assumptions.

9.6  THE POWER OF A TEST Section 9.1 defined Type I and Type II errors and their associated risks. Recall that a represents the probability that you reject the null hypothesis when it is true and should not be rejected. β represents the probability that you do not reject the null hypothesis when it is false and should be rejected. The power of the test, 1 2 β, is the probability that you correctly reject a false null hypothesis. This probability depends on how different the actual population mean is from the value being hypothesised (under H0), the value of a used and the sample size. If there is a large difference between the population mean and the hypothesised mean, the power of the test will be much greater than if the difference between the population mean and the hypothesised mean is small. Selecting a larger value of a makes it easier to reject H0 and therefore increases the power of a test. Increasing the sample size increases the precision in the estimates and therefore increases the ability to detect differences in the parameters and increases the power of a test. In this section, the power of a statistical test is illustrated using the Patricio’s Pasta scenario. The packaging process is subject to periodic inspection from a representative of the trade measurement office. The representative’s job is to detect the possible ‘short weighting’ of packets, where packets having less than the specified 500 grams are sold. Thus, the representative is interested in determining whether there is sufficient evidence that the pasta packets have a mean weight that is less than 500 grams. The null and alternative hypotheses are as follows: H0: m > 500 (packaging process is working properly) H1: m , 500 (packaging process is not working properly) The representative is willing to accept the company’s claim that the standard deviation s equals 15 grams. Therefore, you can use the Z test. Using Equation 9.1, with XL (the lower critical value) substituted for X, you can find the value of X that enables you to reject the null hypothesis: X −μ Z= L σ Z

σ n

n = XL − μ

XL = μ + Z

σ n

Because you have a one-tail test with a level of significance of 0.05, the value of Z is equal to 21.645 (see Figure 9.16). The sample size n 5 25. Therefore: X L = 500 + (−1.645)

Figure 9.16 Determining the lower critical value for a one-tail Z test for a population mean at the 0.05 level of significance

0.05 XL

(15) 25

0.95 μ = 500

Region of rejection ZL = –1.645

= 500 − 4.935 = 495.065

X

Region of non-rejection 0

Z

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



9.6 The Power of a Test 345

The decision rule for this one-tail test is: Reject H0 if X , 495.065; otherwise, do not reject H0. The decision rule states that if, in a random sample of 25 packets, the sample mean is less than 495.065 grams, you reject the null hypothesis, and the representative concludes that the process is not working properly. The power of the test measures the probability of concluding that the process is not working properly for differing values of the true population mean. What is the power of the test if the actual population mean is 492 grams, for example? To determine the chance of rejecting the null hypothesis when the population mean is 492 grams, you need to determine the area under the normal curve below XL 5 495.065 grams with the population mean m 5 492. Using Equation 9.1: Z=

X −μ σ n

Z=

495.065 − 492 = 1.02 15 25

From Table E.2, there is an 84.61% chance that the Z value is less than 11.02. This is the power of the test where m 5 492 is the actual population mean (see Figure 9.17). The probability (β) that you will not reject the null hypothesis (m 5 500) is 1 2 0.8461 5 0.1539. Thus, the probability of committing a Type II error is 15.39%.

Figure 9.17 Determining the power of the test and the probability of a Type II error when m 5 492 g

b

Power = 0.8461

μ = 492

0

0.1539 XL = 495.065

X

+1.02

Z

Now that you have determined the power of the test if the population mean is equal to 492, you can calculate the power for any other value of m. For example, what is the power of the test if the population mean is 484 grams? Assuming the same standard deviation, sample size and level of significance, the decision rule is: Reject H0 if 6 495.065; otherwise, do not reject H0. Once again, because you are testing a hypothesis for a mean, from Equation 9.1: Z=

X −μ σ n

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

346 CHAPTER 9 FUNDAMENTALS OF HYPOTHESIS TESTING: ONE-SAMPLE TESTS

If the population mean shifts down to 484 grams (see Figure 9.18), then: Z=

495.065 − 484 = 3.69 15 25

Figure 9.18 Determining the power of the test and the probability of a Type II error when m 5 484 g

β = 0.00011 Power = 0.99989

μ = 484 0

XL = 495.065

X

+3.69

Z

From Table E.2, there is a 99.989% chance that the Z value is less than 13.69. This is the power of the test when the population mean is 484. The probability (β) that you will not reject the null hypothesis (m 5 500) is 1 2 0.99989 5 0.00011. Thus, the probability of committing a Type II error is only 0.011%. In the preceding two examples the power of the test is high, and the chance of committing a Type II error is low. The next example calculates the power of the test when the population mean is equal to 499 grams – a value that is very close to the hypothesised mean of 500 grams. Once again, from Equation 9.1: Z=

X −μ σ n

If the population mean is equal to 499 grams (see Figure 9.19), then: Z=

495.065 − 499 = −1.31 15 25

Figure 9.19 Determining the power of the test and the probability of a Type II error when m 5 499 g

Power = 0.0951

β = 0.9049 XL = 495.065

μ = 499

X

–1.31

0

Z

From Table E.2, the probability less than Z 5 21.31 is 0.0951 (or 9.51%). Because the rejection region is in the lower tail of the distribution, the power of the test is 9.51% and the chance of making a Type II error is 90.49%.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



9.6 The Power of a Test 347

Figure 9.20 illustrates the power of the test for various possible values of m1 (including the three values examined). This graph is called a power curve.

1.00 0.90

0.99961 0.99989

0.9964

0.99874

0.9783

0.9909

0.80

power curve A graph showing the power of the test for various actual values of the population parameter.

Figure 9.20 Power curve of the pasta-packaging process for H1: m , 500 g

0.9545 0.9131 0.8461 0.7549

0.70

0.6406

Power

0.60 0.5080

0.50 0.40

0.3783

0.30

0.2578

0.20

0.1635 0.0951 0.0500

0.10 0.00

484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500

Possible values for μ (grams)

From Figure 9.20, you can see that the power of this one-tail test increases sharply (and approaches 100%) as the population mean takes on values further below the hypothesised mean of 500 grams. Clearly, for this one-tail test, the smaller the actual mean m, the greater the power to detect this difference. (For situations involving one-tail tests in which the actual mean m is greater than the hypothesised mean, the converse is true. The larger the actual mean m compared with the hypothesised mean, the greater the power. For two-tail tests, the greater the distance between the actual mean m and the hypothesised mean, the greater the power of the test.) For values of m close to 500 grams, the power is small because the test cannot effectively detect small differences between the actual population mean and the hypothesised value of 500 grams. When the population mean approaches 500 grams, the power of the test approaches a, the level of significance (which is 0.05 in this example). Figure 9.21, overleaf, summarises the calculations for the three cases. You can see the drastic changes in the power of the test for differing values of the actual population means by reviewing the different panels of Figure 9.21. From panels A and B you can see that, when the population mean does not greatly differ from 500 grams, the chance of rejecting the null hypothesis, based on the decision rule involved, is not large. However, once the population mean shifts substantially below the hypothesised 500 grams, the power of the test greatly increases, approaching its maximum value of 1 (or 100%). In the above discussion, a one-tail test with a 5 0.05 and n 5 25 was used. The type of ­statistical test (one-tail versus two-tail), the level of significance and the sample size all affect the power. Three basic conclusions regarding the power of the test are: 1. A one-tail test is more powerful than a two-tail test. 2. An increase in the level of significance (a) results in an increase in power. A decrease in

a results in a decrease in power. 3. An increase in the sample size n results in an increase in power. A decrease in the sample

size n results in a decrease in power.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

348 CHAPTER 9 FUNDAMENTALS OF HYPOTHESIS TESTING: ONE-SAMPLE TESTS

Figure 9.21 Determining statistical power for varying values of the population mean

Region of rejection

Region of non-rejection

XL = 495.065

Panel A Given: α = 0.05, σ = 15, n = 25 One-tail test μ = 500 (null hypothesis is true) 15 = 495.065 ÎWã 25 Decision rule: Reject H0 if X < 495.065; otherwise do not reject XL = 500 – (1.645)

1 – α = 0.95

α = 0.050 X

500

Panel B Given: α = 0.05, σ = 15, n = 25 One-tail test H0: μ = 500

μ = 499 (true mean shifts to 499 grams) Z=

X–μ σ ÎW n

=

495.065 – 499 = –1.31 3

b = 0.9049

Power = 0.0951

Power = 0.0951 499

Panel C

X

Given: σ = 0.05, σ = 15, n = 25 One-tail test H0: μ = 500

μ = 492 (true mean shifts to 492 grams) Z=

X–μ σ ÎW n

=

495.065 – 492 = +1.02 3

Power = 0.8461

b = 0.1539

Power = 0.8461 X

492

Panel D Given: α = 0.05, σ = 15, n = 25 One-tail test H0: μ = 500

μ = 484 (true mean shifts to 484 grams) Z=

X–μ σ ÎW n

=

495.065 – 484 3

= +3.69

Power = 0.99989

b = 0.00011

Power = 0.99989 484 Region of rejection

XL = 495.065

X

Region of non-rejection

Problems for Section 9.6 APPLYING THE CONCEPTS 9.70 A coin-operated soft-drink machine is designed to discharge at least 200 mL of drink per cup with a standard deviation of 6 mL. If you select a random sample of 16 cups and you are willing to have an a 5 0.05 risk of committing a Type I error,

calculate the power of the test and the probability of a Type II error (β) if the population mean amount dispensed is actually: a. 197 mL per cup b. 194 mL per cup

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



9.7 Potential Hypothesis-testing Pitfalls and Ethical Issues 349

9.71 Refer to problem 9.70. If you are willing to have an a 5 0.01 risk of committing a Type I error, calculate the power of the test and the probability of a Type II error (b) if the population mean amount dispensed is actually: a. 197 mL per cup b. 194 mL per cup c. Compare the results in (a) and (b) of this problem and (a) and (b) of problem 9.70. What conclusion can you reach? 9.72 Refer to problem 9.70. If you select a random sample of 25 cups and are willing to have an a 5 0.05 risk of committing a Type I error, calculate the power of the test and the probability of a Type II error (β) if the population mean amount dispensed is actually: a. 197 mL per cup b. 194 mL per cup c. Compare the results in (a) and (b) of this problem and (a) and (b) of problem 9.70. What conclusion can you reach? 9.73 A tyre manufacturer produces tyres that have a mean life of at least 40,000 km when the production process is working properly. Based on past experience, the standard deviation of the tyres is 5,600 km. The operations manager stops the production process if there is sufficient evidence that the mean tyre life is below 40,000 km. If you select a random sample of 100 tyres (to be subjected to destructive testing) and you are willing to have an a 5 0.05 risk of committing a Type I error, calculate the power of the test and the probability of a Type II error (β) if the population mean life is actually: a. 38,400 km b. 39,840 km

9.74 Refer to problem 9.73. If you are willing to have an a 5 0.01 risk of committing a Type I error, calculate the power of the test and the probability of a Type II error (β) if the population mean life is actually: a. 38,400 km b. 39,840 km c. Compare the results in (a) and (b) of this problem and (a) and (b) of problem 9.73. What conclusion can you reach? 9.75 Refer to problem 9.73. If you select a random sample of 25 tyres and are willing to have an a 5 0.05 risk of committing a Type I error, calculate the power of the test and the probability of a Type II error (β) if the population mean life is actually: a. 38,400 km b. 39,840 km c. Compare the results in (a) and (b) of this problem and (a) and (b) of problem 9.73. What conclusion can you reach? 9.76 Refer to problem 9.73. If the operations manager stops the process when there is sufficient evidence that the mean life is different from 40,000 km (either less than or greater than) and a random sample of 100 tyres is selected together with a level of significance of a 5 0.05, calculate the power of the test and the probability of a Type II error (β) if the population mean life is actually: a. 38,400 km b. 39,840 km c. Compare the results in (a) and (b) of this problem and (a) and (b) of problem 9.73. What conclusion can you reach?

9.7  POTENTIAL HYPOTHESIS-TESTING PITFALLS AND ETHICAL ISSUES To this point, you have studied the fundamental concepts of hypothesis testing. You used hypothesis testing for analysing differences between sample estimates (i.e. statistics) and hypothesised population characteristics (i.e. parameters) in order to make decisions about the underlying characteristics. You have also learned how to evaluate the risks involved in making these decisions. When planning to carry out a test of the hypothesis based on a survey, research study or designed experiment, you must ask several questions to ensure that proper methodology is used. You need to raise and answer questions like the ones below in the planning stage: 1. What is the goal of the survey, study or experiment? How can you translate the goal into a null hypothesis and an alternative hypothesis? 2. Is the hypothesis test a two-tail test or a one-tail test? 3. Can you select a random sample from the underlying population of interest? 4. What kinds of data will you collect from the sample? Are the variables numerical or categorical? 5. At what level of significance, or risk of committing a Type I error, should you conduct the hypothesis test? 6. Is the intended sample size large enough to achieve the desired power of the test for the level of significance chosen? 7. What statistical test procedure should you use and why? 8. What conclusions and interpretations can you make from the results of the hypothesis test?

LEARNING OBJECTIVE Recognise the pitfalls involved in hypothesis testing

To do this, consult a person with substantial statistical training early in the process. Often, an expert is consulted far too late in the process, after the data have been collected. Typically, all that you can do at such a late stage is to choose the statistical test procedure that is best for

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

4

350 CHAPTER 9 FUNDAMENTALS OF HYPOTHESIS TESTING: ONE-SAMPLE TESTS

LEARNING OBJECTIVE

5

Identify the ethical issues involved in hypothesis testing

the data. You may be forced to assume that biases built into the study (because of poor planning) are negligible, but they might not be. To avoid biases, adequate controls must be built in from the beginning. You need to distinguish between poor research methodology and unethical behaviour. Ethical considerations arise when the hypothesis-testing process is manipulated. Some of the ethical issues that arise include the data-collection method, informed consent from the human subjects, the type of test (one-tail or two-tail), the choice of the level of significance, a, data snooping, the cleansing and discarding of data and the reporting of methodology and findings.

Data-Collection Method – Randomisation randomisation A process used in an experiment to ensure that selection bias is avoided.

To eliminate the possibility of potential biases in the results, you must use proper data-collection methods. To draw meaningful conclusions, the data must be the outcome of a random sample from a population or from an experiment in which a randomisation process was used. Potential respondents should not be permitted to self-select for a study, nor should they be purposely selected. Aside from the potential ethical issues that may arise, such a lack of randomisation can result in serious coverage errors or selection biases that destroy the integrity of the study.

Informed Consent from Human Respondents Ethical considerations require that any individual who is to be subjected to some ‘treatment’ in an experiment must be made aware of the research endeavour and any potential behavioural or physical side effects. The subject should also provide informed consent with respect to participation.

Type of Test: Two-Tail or One-Tail If prior information is available that leads you to test the null hypothesis against a specifically directed alternative, then a one-tail test is more powerful than a two-tail test. However, if you are interested only in differences from the null hypothesis, not in the direction of the difference, the two-tail test is the appropriate procedure to use. For example, if previous research and statistical testing have already established the difference in a particular direction, or if an established scientific theory states that it is possible for results to occur in one direction only, then a one-tail test is appropriate. It is never appropriate to change the direction of a test after the data are collected.

Choice of Level of Significance, 𝛂 In a well-designed study, you select the level of significance, a, before data collection occurs. You cannot alter the level of significance after the fact to achieve a specific result. It is also good practice always to report the p-value, not just the conclusions of the hypothesis test.

Data Snooping data snooping Using a set of data more than once for inference or selecting a model.

Data snooping is never permissible. It is unethical to perform a hypothesis test on a set of data,

look at the results and then decide on the level of significance or decide between a one-tail and two-tail test. You must make these decisions before the data are collected if the conclusions are to have meaning. In those situations in which you consult a statistician late in the process, with data already available, it is imperative that you establish the null and alternative hypotheses and choose the level of significance prior to carrying out the hypothesis test. In addition, you cannot arbitrarily change or discard extreme or unusual values in order to alter the results of the hypothesis tests.

Cleansing and Discarding of Data Data cleansing is not data snooping. In the data-preparation stage of editing, coding and transcribing, you have an opportunity to review the data for any value whose measurement appears extreme or unusual. After reviewing the unusual observations, you should construct a stemand-leaf display and/or a box-and-whisker plot in preparation for further data presentation and confirmatory analysis. This exploratory data-analysis stage gives you another opportunity to cleanse the data set by flagging possible outliers to double-check against the original data. In

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



9.7 Potential Hypothesis-testing Pitfalls and Ethical Issues 351

addition, the exploratory data analysis enables you to examine the data graphically with respect to the assumptions underlying a particular hypothesis test procedure. The process of data cleansing raises a major ethical question. Should you ever remove a value from a study? The answer is a qualified yes. If you can determine that a measurement is incomplete or grossly in error because of some equipment problem or unusual behavioural occurrence unrelated to the study, you can discard the value. Sometimes you have no choice – an individual may decide to quit a study they have been participating in before a final measurement can be made. In a well-designed experiment or study you should decide, in advance, on all rules regarding the possible discarding of data.

Reporting of Findings In conducting research, you should document both good and bad results. It is inappropriate to report the results of hypothesis tests that show statistical significance, but not those for which there is insufficient evidence in the findings. In those instances where there is insufficient evidence to reject H0, you must make it clear that this does not prove that the null hypothesis is true. What the result does indicate is that, with the sample size used, there is not enough information to disprove the null hypothesis.

Statistical Significance Versus Practical Significance You need to make the distinction between the existence of a statistically significant result and its practical significance in the context within a field of application. Sometimes, due to a very large sample size, you will get a result that is statistically significant but has little practical significance. For example, suppose that prior to a national marketing campaign focusing on a series of expensive television commercials, you believe that the proportion of people who recognised your brand was 0.30. At the completion of the campaign, a survey of 20,000 people indicates that 6,168 recognise your brand. A one-tail test trying to prove that the proportion is now greater than 0.30 results in a p-value of 0.0047, and the correct statistical conclusion is that the proportion of consumers recognising your brand name has now increased. Was the campaign successful? The result of the hypothesis test indicates a statistically significant increase in brand awareness, but is this increase practically important? The population proportion is now estimated at 6,168/20,000 5 0.3084 or 30.84%. This increase is less than 1% more than the hypothesised value of 30%. Did the large expenses associated with the marketing campaign produce a result with a meaningful increase in brand awareness? Because of the minimal realworld impact that an increase of less than 1% has on the overall marketing strategy, and the huge expenses associated with the marketing campaign, you should conclude that the campaign was not successful. On the other hand, if the campaign increased brand awareness by 20%, you could conclude that the campaign was successful.

Statistical Insignificance Versus Importance In contrast to the issue of the practical significance of a statistically significant result is the situation in which an important result may not be statistically significant. In some situations (see reference 3), the lack of a large enough sample size may result in a non-significant result when in fact an important difference does exist. Consider a study that compared male and female entrepreneurship rates globally and within one city. It found a significant difference globally but not within the city, even though the entrepreneurship rates for females and for males in the two geographic areas were similar. The difference was due to the fact that the global sample size was 20 times larger than the single-city sample size. To summarise, in discussing ethical issues concerning hypothesis testing the key is intent. You must distinguish between poor data analysis and unethical practice. Unethical practice occurs when researchers intentionally create a selection bias in data collection, manipulate the treatment of human subjects without informed consent, use data snooping to select the type of test (two-tail or one-tail) and/or level of significance to their advantage, hide the facts by discarding values that do not support a stated hypothesis or fail to report pertinent findings.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

352 CHAPTER 9 FUNDAMENTALS OF HYPOTHESIS TESTING: ONE-SAMPLE TESTS

Problems for Section 9.7 APPLYING THE CONCEPTS 9.77 You wish to carry out research then test a hypothesis about the average hours of employment undertaken by full-time university students during semester. a. Using the questions listed in Section 9.7 as a guide, design the method of collecting data and describe the hypothesis test you will use. b. How would you deal with ethical requirements? 9.78 A government department wishes to carry out a study to find the proportion of all 15-year-old teenagers who smoke at least once each week.

9

a. Using the questions listed in Section 9.7 as a guide, design a method of collecting data and describe the hypothesis test the department will use for its test. b. What ethical issues might the department encounter? 9.79 A supermarket chain wants to test a hypothesis about the average amount spent per week on fruit and vegetables by customers across Australia. a. Using the questions listed in Section 9.7 as a guide, design a method of collecting data and describe an appropriate hypothesis-testing method the supermarket chain could use. b. Describe any ethical requirements that should be observed.

Assess your progress

Summary This chapter presented the foundation of hypothesis testing. You learned how to perform Z and t tests on the population mean, and a Z test on the population proportion, and how an operations manager of a production facility can use hypothesis testing to monitor and improve a pasta-packaging process. You also learned how the significance level a is related to the Type I error. In two-tail tests the area in each tail is Figure 9.22 Flow chart for selecting a one-sample test

a/2, but in a one-tail test the single rejection region has an area of a. You found that Type II errors and the power of the test depend on the true value of the population parameter being estimated. The next two chapters build on this foundation of hypothesis testing. Figure 9.22 presents a flow chart for selecting the correct onesample hypothesis test to use.

Introduction to hypothesis testing

Categorical

Z test for the proportion

Type of data

No

t test

Numerical

σ Known ?

Yes

Z test

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



References 353

standard deviation, use the Z test for the mean. If you do not know the population standard deviation, use the t test for the mean. Table 9.4 provides a list of topics covered in this chapter.

In deciding which test to use, ask the following questions: Does the test involve a numerical variable or a categorical variable? If the test involves a categorical variable, use the Z test for the proportion. • If the test involves a numerical variable, do you know the population standard deviation? If you know the population •

Type of analysis Hypothesis test concerning a single parameter

Type of data Numerical Z test of hypothesis for the mean (Sections 9.2 and 9.3) t test of hypothesis for the mean (Section 9.4)

Categorical Z test of hypothesis for the proportion (Section 9.5)

Table 9.4 Summary of topics in Chapter 9

Key formulas Z test of hypothesis for the mean (𝛔 known)

Z=

One-sample Z test for the proportion

X − μ  (9.1) σ

Z=

n t test of hypothesis for the mean (𝛔 unknown)

t=

p−π π(1 − π) n

 (9.3)

Z test for the proportion in terms of the number of successes

X − μ  (9.2) S

Z=

X − nπ

 (9.4)

nπ(1 − π)

n

Key terms alternative hypothesis (H1) 317 confidence coefficient (1 2 a) 319 confidence level 319 critical value 319 data snooping 350 hypothesis testing 316 level of significance (a) 319 null hypothesis (H0) 316 one-tail (or directional) test 330

power curve 347 power of a statistical test 320 p-value 325 randomisation 350 region of non-rejection 318 region of rejection 318 risk of Type II error (β) 320 robust 337 sample proportion 340

test statistic t test of hypothesis for the mean two-tail test Type I error Type II error Z test for the proportion Z test of hypothesis for the mean Z test statistic

References 1. Bradley, J. V., Distribution-Free Statistical Tests (Englewood Cliffs, NJ: Prentice Hall, 1968). 2. Daniel, W., Applied Nonparametric Statistics, 2nd edn (Boston, MA: Houghton Mifflin, 1990).

3. Seaman, J. & E. Allen, ‘Not significant, but important?’, Quality Progress, August 2011, 57–59.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

318 334 322 319 319 340 322 322

354 CHAPTER 9 FUNDAMENTALS OF HYPOTHESIS TESTING: ONE-SAMPLE TESTS

Chapter review problems CHECKING YOUR UNDERSTANDING

9.80 What is the difference between a null hypothesis, H0, and an alternative hypothesis, H1? 9.81 What is the difference between a Type I and a Type II error? 9.82 What is meant by the power of a test? 9.83 What is the difference between a one-tail and a two-tail test? 9.84 What is meant by a p-value? 9.85 How can a confidence interval estimate for the population mean provide conclusions to the corresponding hypothesis test for the population mean? 9.86 What is the six-step critical value approach to hypothesis testing? 9.87 What are some of the ethical issues to be concerned about when performing a hypothesis test? 9.88 In planning to carry out a test of hypothesis based on a designed experiment or research study under investigation, what are some of the questions that you need to raise to ensure that proper methodology will be used?

APPLYING THE CONCEPTS 9.89 Your company is carrying out tests on a new pharmaceutical product which has the potential to treat a serious disease. From a sample of 250 persons in the trial, 162 report improvement in their symptoms over a three-month period. You wish to carry out a test to see if the population proportion improvement rate is better than the rate for the drug currently in use, which improves the symptoms of 60% of patients. a. State the null and alternative hypotheses in statistical terms. b. Explain the risks associated with Type I and Type II errors in this context. c. Which type of error do you consider to be more serious? 9.90 TasmanTrading, a company developing resorts in holiday areas along the east coast of Australia, has developed an econometric model to help predict the profitability of sites that are being considered as locations for new resort projects. If the model predicts large profits, the company buys the proposed site and builds a resort. If the econometric model predicts small or moderate profits, the company chooses not to proceed with that site. This decision-making procedure can be expressed in the hypothesis-testing framework. The null hypothesis is that the site is not a profitable location. The alternative hypothesis is that the site is a profitable location. a. Explain the risks associated with committing a Type I error. b. Explain the risks associated with committing a Type II error. c. Which type of error do you think the executives at TasmanTrading are trying hard to avoid? Explain. d. How do changes in the rejection criterion affect the probabilities of committing Type I and Type II errors? 9.91 For the March quarter 2017, it has been reported that the bulk-billing rate under Medicare for all Medicare Benefits Schedule (MBS) services was 78.7% (Australian Government, Department of Health, Quarterly Medicare Statistics

Suppose the government proposes to make changes to scheduled fees and carries out a survey to see whether providers will bulk-bill under the new arrangements. The survey shows that 412 of 500 will bulk-bill. a. At the 0.01 level of significance, is there sufficient evidence that the proportion of providers who bulk-bill will change from 78.7% under the new arrangements? b. Calculate the p-value and interpret its meaning. 9.92 The owner of a petrol station wants to study petrol-purchasing habits by motorists at his station. He selects a random sample of 60 motorists during a certain _ week with the following results: • Amount purchased: X 5 42.8 litres, S 5 11.7 litres. • 11 motorists purchased premium unleaded petrol. a. At the 0.05 level of significance, is there sufficient evidence that the mean purchase is different from 38 litres? b. Find the p-value in (a). c. At the 0.05 level of significance, is there sufficient evidence that fewer than 20% of all motorists at his station purchase premium unleaded petrol? d. What is your answer to (a) if the sample mean equals 39 litres? e. What is your answer to (c) if seven motorists purchased premium unleaded petrol? 9.93 A private health fund auditor is assigned the task of evaluating benefits paid to members for dental consultation claims. The audit is conducted on a sample of 85 claims. • In 10 of the consultations, an incorrect amount of reimbursement was provided. _ • The amount of benefit was: X 5 $98.95, S 5 $34.55. a. At the 0.05 level of significance, is there sufficient evidence that the mean benefit is less than $100? b. At the 0.05 level of significance, is there sufficient evidence that the proportion of incorrect benefit in the population is greater than 0.10? c. Discuss the underlying assumptions of the test used in (a). d. What is your answer to (a) if the sample mean equals $90? e. What is your answer to (b) if 15 consultations had incorrect reimbursements? 9.94 A bank branch located in a commercial district of a city has developed an improved process for serving customers during the noon to 1 pm lunch period. The waiting time (defined as the time from when customers join the queue until they reach the teller’s window) of all customers during this hour is recorded over a period of one week. A random sample of 15 customers is selected, and the results are as follows: < BANK1 > 4.21  5.55  3.02  5.13  4.77  2.34  3.54  3.20 4.50  6.10  0.38  5.12  6.46  6.19  3.79

a. At the 0.05 level of significance, is there sufficient evidence that the mean waiting time is less than 5 minutes? b. What assumption must hold in order to perform the test in (a)? c. Evaluate this assumption through a graphical approach. Discuss.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter review problems 355

d. As a customer walks into the branch office during the lunch hour, she asks the branch manager how long she can expect to wait. The branch manager replies, ‘Almost certainly not longer than 5 minutes’. On the basis of the results of (a), evaluate this statement. 9.95 A manufacturer of women’s tracksuits needs to check that the average height of 18–25-year-old female athletes is still 1,700 mm. The company collects measurements of 30 basketball players of this age. They are shown in the file < HEIGHTS > and below: Heights (in millimetres) 1,870 1,728 1,656 1,610 1,634 1,784 1,522 1,696 1,592 1,662 1,866 1,764 1,734 1,662 1,734 1,774 1,550 1,756 1,762 1,866 1,820 1,744 1,788 1,688 1,810 1,752 1,680 1,810 1,652 1,736

a. At the 0.05 level of significance, is there sufficient evidence that the mean height has changed from 1,700 mm? b. What assumption must hold in order to perform the test in (a)? c. Evaluate this assumption through a graphical approach. Discuss. 9.96 The manufacturer of ‘Bondi’ and ‘Vincentia’ terracotta roof shingles provides its customers with a 50-year warranty on most of its products. To determine whether a shingle will last as long as the warranty period, accelerated-life testing is conducted at the manufacturing plant. Accelerated-life testing exposes the shingle to the stresses it would be subject to in a lifetime of normal use in a laboratory setting via an experiment that takes only a few hours to conduct. In this test, a shingle is repeatedly scraped with an abrasive and the amount of particles that are removed is weighed (in grams). Shingles that experience low amounts of particle loss are expected to last longer in normal use than shingles that experience high amounts of particle loss. The data file contains a sample of 170 measurements made on the company’s ‘Bondi’ shingles, and 140 measurements made on ‘Vincentia’ shingles. a. For the ‘Bondi’ shingles, is there sufficient evidence that the mean particle loss is significantly different from 0.50 grams? b. Interpret the meaning of the p-value in (a). c. For the ‘Vincentia’ shingles, is there sufficient evidence that the mean particle loss is significantly different from 0.50 grams? d. Interpret the meaning of the p-value in (c). e. Is the assumption needed for (a) and (c) seriously violated? 9.97 In hypothesis testing, the common level of significance is a 5 0.05. Some might argue for a level of significance greater than 0.05. Suppose that web designers tested the proportion of potential web page visitors with a preference for a new web design over the existing one. The null hypothesis was that the population proportion of web page visitors preferring the new design was 0.50, and the alternative hypothesis was that it was not equal to 0.50. The p-value for the test was 0.20. a. State, in statistical terms, the null and alternative hypotheses for this example.

b. Explain the risks associated with Type I and Type II errors in this case. c. What would be the consequences if you incorrectly rejected the null hypothesis? d. What might be an argument for raising the value of a? e. What would you do in this situation? f. What is your answer in (e) if the p-value equals 0.12? What if it equals 0.06? 9.98 Financial institutions utilise prediction models to predict bankruptcy. One such model is the Altman Z score model, which uses multiple corporate income and balance sheet values to measure the financial health of a company. If the model predicts a low Z score value, the company is in financial stress and is predicted to go bankrupt within the next two years. If the model predicts a moderate or high Z score value, the company is financially healthy and is predicted to be a non-bankrupt firm (see ). This decision-making procedure can be expressed in the hypothesis-testing framework. The null hypothesis is that a company is predicted to be a non-bankrupt firm. The alternative hypothesis is that the company is predicted to be a bankrupt firm. a. Explain the risks associated with committing a Type I error in this case. b. Explain the risks associated with committing a Type II error in this case. c. Which type of error do you think executives want to avoid? Explain. d. How would changes in the model affect the probabilities of committing Type I and Type II errors? 9.99 The owner of a specialty coffee shop wants to study the coffee-purchasing habits of customers at her shop. During a certain week, she selects a random sample of 160, with the following results: _ • the amount spent was X 5 $10.25, S 5 $2.75 • 84 customers say they ‘definitely will’ recommend the specialty coffee shop to family and friends a. At the 0.05 level of significance, is there sufficient evidence that the population mean amount spent was significantly different from $9.50? b. Determine the p-value in (a). c. At the 0.05 level of significance, is there sufficient evidence that more than 50% of customers say they ‘definitely will’ recommend the specialty coffee shop to family and friends? d. What is your answer to (a) if the sample mean equals $6.25? e. What is your answer to (c) if 94 customers say they ‘definitely will’ recommend the specialty coffee shop to family and friends?

REPORT WRITING EXERCISE 9.100 Referring to the results of problem 9.96 concerning ‘Bondi’ and ‘Vincentia’ shingles, write a report that evaluates the particle loss of the two types of shingles.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

356 CHAPTER 9 FUNDAMENTALS OF HYPOTHESIS TESTING: ONE-SAMPLE TESTS

Continuing cases Tasman University It has been observed over a long period that the weighted average mark (WAM) of undergraduate BBus students is 65 with a standard deviation of 13. However, the WAM for MBA students tends to be higher. Using the data in and < TASMAN_UNIVERSITY_MBA_STUDENT_SURVEY >:

a Find the average WAM for students in the BBus survey and use it to test whether there has been decrease in the population WAM. State the null and alternative hypotheses and choose a level of significance and the relevant distribution. State the test statistic, critical value and decision made. b Find the average and standard deviation for students in the MBA survey and use them to test the hypothesis that the population WAM is higher than 65. Follow the steps outlined in (a). c Write a report that states your conclusions. Include any other evidence you notice from the data that might explain why the WAMs of BBus and MBA students differ.

As Safe as Houses Traditionally in state A, 65% of properties sold in regional city 1 have been houses, whereas in coastal city 1, houses made up 55% of real estate sales. Using the sample data in < REAL_ESTATE >, test hypotheses at the 5% level of significance to show whether these percentages have changed and state any conditions required and your conclusions. (Hint: Use the Excel COUNTIF function to quickly identify the number of houses sold.)

Chapter 9 Excel Guide EG9.1 Z TEST FOR THE MEAN, s KNOWN

EG9.2 t TEST FOR THE MEAN, s UNKNOWN

Open the Z_Mean workbook shown in Figure 9.5. This workbook already contains the entries for Example 9.2 concerning Patricio’s Pasta. It uses the NORM.S.INV and NORM.S.DIST functions (see Appendix D.9 for more information). To adapt this workbook to other problems, change the null hypothesis, level of significance, population standard deviation, sample size and sample mean values in the tinted cells in rows 4 to 8.

Open the T_Mean workbook shown in Figure 9.10. This workbook already contains the entries for the Section 9.4 sales invoices example. It uses the T.INV.2T and T.DIST.2T functions (see Appendix D.10 for more information). To adapt this workbook to other problems, change the null hypothesis, level of significance, sample size, sample mean and sample standard deviation values in the tinted cells in rows 4 to 8.

OR See Appendix D.9 (Z Test for the Mean, sigma known) if you want PHStat to produce a worksheet for you.

OR See Appendix D.10 (t Test for the Mean, sigma unknown) if you want PHStat to produce a worksheet for you.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 9 Excel Guide 357

EG9.3 Z TEST FOR THE PROPORTION Open the Z_Proportion workbook shown in Figure 9.15. This workbook already contains the entries for the gas usage example in Section 9.5. It uses the NORM.S.INV and NORM.S.DIST functions (see Appendix D.11 for more infor-

mation). To adapt this workbook to other problems, change the null hypothesis, level of significance, sample proportion and sample size values in the tinted cells in rows 4 to 7. OR See Appendix D.11 (Z Test for the Proportion) if you want PHStat to produce a worksheet for you.

Microsoft® product screen shots are reprinted with permission from Microsoft Corporation.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

CHA PTER

10

Hypothesis testing: Two-sample tests

SWITCHING ON ELECTRICITY

© Antartis|Dreamstime.com

PRICES AND CONSUMPTION

‘A

better deal in electricity is vital to keeping the lights on, delivering cheaper prices to families and businesses and sustaining jobs, particularly the thousands of jobs in our energy intensive industries.’ (A. Gartrell, ‘Malcolm Turnbull orders national review of electricity prices’, Sydney Morning Herald, 27 March 2017, ) Both the price of electricity and its consumption have become major issues for households and governments alike in recent years. Electricity forms a major component of households’ expenditure; therefore, recent price rises and proposed future price hikes are of major concern to households on a tight budget. In addition, consumers now have a huge choice of electricity providers, including from ‘green’ energy sources, and a vast array of energy-saving devices such as energy-efficient light globes. Government policy makers also have a big impact on electricity prices and consumption, including whether electricity generation and provision is publically or privately owned, subsidies for energy-saving devices such as insulation or solar panels, and taxes on ‘dirty’ sources of electricity. You have been contracted by a consumer advocacy group to test a number of hypotheses surrounding household electricity costs and consumption patterns. Namely, you are to investigate whether there is any difference in average prices, volatility of prices and use of energy-efficient light globes between Sydney and Melbourne. Furthermore, you are to test whether a government-sponsored electricity waste awareness campaign has been effective in reducing households’ electricity consumption. How would you go about designing experiments to test these hypotheses?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



10.1 Comparing the Means of Two Independent Populations 359

LEARNING OBJECTIVES

After studying this chapter you should be able to: 1 conduct hypothesis tests for the means of two independent populations 2 conduct hypothesis tests for the means of two related populations 3 conduct hypothesis tests for the variances of two independent populations 4 conduct hypothesis tests for two population proportions

Hypothesis testing provides a confirmatory approach to data analysis. In Chapter 9, the focus was on a variety of commonly used hypothesis-testing procedures that relate to a single sample of data selected from a single population. In this chapter, hypothesis testing is extended to procedures that compare statistics from two samples of data drawn from two populations. For example, is the mean electricity price in Sydney different from the mean electricity price in Melbourne?

10.1  COMPARING THE MEANS OF TWO INDEPENDENT POPULATIONS Z Test for the Difference between Two Means Suppose that you take a random sample of size n1 from the first population and a random sample of n2 from the second population, and the data collected in each sample are from a numerical variable. In the first population, the mean is represented by the symbol μ1 and the standard deviation is represented by the symbol σ1. In the second population, the mean is represented by the symbol μ2 and the standard deviation is represented by the symbol σ2. The test statistic used to determine the difference between the population means is based on the difference between the sample means (X 1 - X 2). If you assume that the samples are randomly and independently selected from populations that are normally distributed, this statistic follows the standardised normal distribution. If the populations are not normally distributed, the Z test is still appropriate if the sample sizes are large enough (typically n1 9 30 and n2 9 30; see the Central Limit Theorem in Section 7.2). Equation 10.1 defines the Z test for the difference between two means.

Z TE ST FO R T H E DIFFE R E N CE B E T WE E N T W O ME A NS

Z=

LEARNING OBJECTIVE

Conduct hypothesis tests for the means of two independent populations

Z test for the difference between two means A test statistic used in hypothesis tests about the difference between means of two populations.

(X1 − X2) − (μ1 − μ2) σ12 σ12 + n1 n2

1

(10.1)

where X 1 = mean of the sample taken from population 1 μ1 = mean of population 1 σ12 = variance of population 1 n1 = size of the sample taken from population 1 X 2 = mean of the sample taken from population 2 μ2 = mean of population 2 σ22 = variance of population 2 n2 = size of the sample taken from population 2 The test statistic Z follows a standardised normal distribution.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

360 CHAPTER 10 HYPOTHESIS TESTING: TWO-SAMPLE TESTS

Pooled-Variance t Test for the Difference between Two Means

pooled-variance t test A test for the difference between two population means which assumes that the unknown population variances are equal.

In most cases the variances of the two populations are unknown. You usually have only the sample means and the sample variances. If you assume that the samples are randomly and independently selected from populations that are normally distributed and that the population variances are equal (that is, σ12 = σ22), you can use a pooled-variance t test to determine whether there is a significant difference between the means of the two populations. If the populations are not normally distributed, the pooled-variance t test is still appropriate if the sample sizes are large enough (typically n1 9 30 and n2 9 30; see the Central Limit Theorem in Section 7.2). The pooled-variance t test is so named because the test statistic pools (combines) the two sample variances S12 and S22 to calculate Sp2: the best estimate of the variance common to both populations under the assumption that the two population variances are equal. To test the null hypothesis of no difference in the means of two independent populations: H0: μ1 = μ2 or μ1 - μ2 = 0 against the alternative that the means are not the same: H1: μ1 Z μ2 or μ1 - µ2 Z 0 you use the pooled-variance t test statistic. POOLED-VARIANCE t TEST FOR THE DIFFERENCE BETWEEN TWO MEANS



where  Sp2 =

t=

(X1 − X2) − (μ1 − μ2) Sp2

1 1 + n1 n2



(10.2)

(n1 − 1)S12 + (n2 − 1)S22 (n1 − 1) + (n2 − 1)

and Sp2 = pooled variance X 1 = mean of the sample taken from population 1 S12 = variance of the sample taken from population 1 n1 = size of the sample taken from population 1 X 2 = mean of the sample taken from population 2 S22 = variance of the sample taken from population 2 n2 = size of the sample taken from population 2 The test statistic t follows a t distribution with n1 + n2 - 2 degrees of freedom. When the two sample sizes are equal (that is, n1 = n2), the formula for the pooled S 12 + S 22 variance can be simplified to Sp2 = 2 The pooled-variance t test statistic follows a t distribution with n1 + n2 - 2 degrees of ­freedom. For a given level of significance, α, in a two-tail test, you reject the null hypothesis if the calculated t test statistic is greater than the upper-tail critical value from the t distribution or if the calculated test statistic is less than the lower-tail critical value from the t distribution. Figure 10.1 displays the regions of rejection. In a one-tail test in which the rejection region is in the lower tail, you reject the null hypothesis if the calculated test statistic is less than the lowertail critical value from the t distribution. In a one-tail test in which the rejection region is in the upper tail, you reject the null hypothesis if the calculated test statistic is greater than the uppertail critical value from the t distribution. To demonstrate the use of the pooled-variance t test, return to the electricity price scenario on page 358. The question you want to answer is whether the mean electricity price in Sydney

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



10.1 Comparing the Means of Two Independent Populations 361

1–α

α/2 –tn1+n2–2

Region of rejection

Figure 10.1  Regions of rejection and non-rejection for the pooled-variance t test for the difference between the means (two-tail test)

α/2

0

Region of non-rejection

Critical value

+tn1+n2–2

t

Region of rejection

Critical X1 – X2 value

is different from the mean electricity price in Melbourne. There are two populations of interest. The first population is the set of all possible electricity prices in Sydney (group 1). The second population is the set of all possible electricity prices in Melbourne (group 2). The first sample contains the annual price index of electricity in Sydney from 1997–98 to 2016–17, while the second sample contains the annual price index of electricity for Melbourne over the same time period. Table 10.1 contains the electricity price data for the two samples. < ELECTRICITY_PRICE > Year 1997–98 1998–99 1999–00 2000–01 2001–02 2002–03 2003–04 2004–05 2005–06 2006–07

Sydney 39 39 39 43 43 44 46 49 53 56

Melbourne 49 42 43 48 53 55 55 55 55 56

Year 2007–08 2008–09 2009–10 2010–11 2011–12 2012–13 2013–14 2014–15 2015–16 2016–17

Sydney  60  66  80  87 100 119 124 115 109 121

Melbourne  61  69  80  92 100 122 127 121 123 125

Table 10.1  Comparing the price index of electricity in Sydney and Melbourne, 1997–98 to 2016–17 Source: Australian Bureau of Statistics, Consumer Price Index, Australia, 2017, Cat. No. 6401.0, Table 9.

The null and alternative hypotheses are: H0: μ1 = μ2 or μ1 - μ2 = 0 H1: μ1 Z μ2 or μ1 - µ2 Z 0 Assuming that the samples are from underlying normal populations with equal variances, you can use the pooled-variance t test. The t test statistic follows a t distribution with 20 + 20 - 2 = 38 degrees of freedom. Using α = 0.05 level of significance, you divide the rejection region into the two tails for this two-tail test (i.e. two equal parts of 0.025 each). Table E.3 shows that the critical values for this two-tail test are +2.0244 and - 2.0244. As shown in Figure 10.2, overleaf, the decision rule is: Reject H0 if t 7 t38 = +2.0244 or if t 6 -t38 = -2.0244; otherwise, do not reject H0. From Figure 10.3, overleaf, the calculated t statistic for this test is -0.49 and the p-value is 0.62. Using Equation 10.2 and the descriptive statistics provided in Figure 10.3: t=

(X1 − X2) − (μ1 − μ2) Sp2

1 1 + n1 n2

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

362 CHAPTER 10 HYPOTHESIS TESTING: TWO-SAMPLE TESTS

Figure 10.2  Two-tail test of hypothesis for the difference between the means at the 0.05 level of significance with 38 degrees of freedom

0.025 –2.0244

Region of rejection Critical value

Figure 10.3  Microsoft Excel t test output for the two electricity prices

1 2 3 4 5 6 7 8 9 10 11 12 13 14

0.025 0.95 0

+2.0244

Region of non-rejection

t18

Region of rejection Critical X1 – X2 value

A B t test: two-sample assuming equal variances Sydney Mean 71.6 Variance 1017.726316 Observations 20 Pooled variance 1006.730263 Hypothesized mean difference 0 df 38 t stat –0.493342621 P(T Weekly electricity consumption (kWh) Household  1  2  3  4  5  6  7  8  9 10

Before 126 138 156 109 176 125 134 118 152 133

After 122 134 158  89 145 112 128 124 123 112

Difference (Di)* 4 4 -2 20 31 13 6 -6 29 21

Table 10.4  Matched pairs for household electricity consumption (kWh) before and after awareness campaign

* Difference 5 Before 2 After

The question that you must answer is whether the advertising campaign was effective. That is, is there sufficient evidence that the mean household electricity consumption is significantly lower after the advertising campaign? Thus, the null and alternative hypotheses are: H0: μD 8 0 (mean household electricity consumption before the awareness campaign is less than or equal to that after the campaign) H1: μD 7 0 (mean household electricity consumption is greater before the awareness ­campaign) Choosing a level of significance of α = 0.05 and assuming that the differences are normally distributed, use the paired t test (Equation 10.7). For a sample of n = 10, there are n - 1 = 9 degrees of freedom. Using Table E.3, the decision rule is: Reject H0 if t 7 t9 = 1.8331; otherwise, do not reject H0.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

374 CHAPTER 10 HYPOTHESIS TESTING: TWO-SAMPLE TESTS

For the n = 10 differences (see Table 10.4), the sample mean difference is: n

D=

Di ∑ i=1 n

=

120 = 12 10

and: n

SD =

(Di − D) 2 ∑ i=1 n−1

= 12.82

From Equation 10.7: t=

D − μD 12 – 0 = = 2.96 SD 12.82 n 10

Because t = 2.96 is greater than 1.8331, you reject the null hypothesis, H0 (see Figure 10.8). There is sufficient evidence that the mean household consumption of electricity had declined following the awareness campaign. You can calculate this test statistic and the p-value using Microsoft Excel (see Figure 10.9). Because the p-value = 0.01 6 α = 0.05, you reject H0. Because this probability is so small, Figure 10.8  One-tail paired t test at the 0.05 level of significance with 9 degrees of freedom 0.05

0.95 0

1.8331

Region of non-rejection

Region of rejection D Critical value

0

Figure 10.9  Microsoft Excel paired t test for the electricity consumption data

1 2 3 4 5 6 7 8 9 10 11 12 13 14

A t test: paired two sample for means

Mean Variance Observations Pearson correlation Hypothesized mean difference df t stat P(T; the higher the AdIndex value, the more likeable the ad. Calculate descriptive statistics and perform a paired t test. State your findings and conclusions in a report. (Use the 0.05 level of significance.) 10.25 Most motorists believe that petrol stations put prices up on public holidays. A motorist advocacy group collected petrol price data for a sample of petrol stations on the Thursday before a public holiday and the Friday of the public holiday (measured in cents per litre). < PUBLIC_HOLIDAY >

LEARNING OBJECTIVE

3

Conduct hypothesis tests for the variances of two independent populations F distribution A right-skewed continuous probability distribution which has as its parameters degrees of freedom in the numerator and in the denominator. F test statistic for testing the equality of two variances A ratio of the sample variances from two samples.

Station  1  2  3  4  5  6  7  8  9 10

Thursday 132 123 114 142 138 128 119 138 142 133

Friday 143 155 123 156 142 165 145 155 163 133

a. At the 0.01 level of significance, is there sufficient evidence that the mean price of petrol is more expensive on a public holiday? b. Interpret the meaning of the p-value in (a). 10.26 Does playing Lumosity increase one’s IQ? We collected IQ scores for a sample of 7 individuals before and after a 30-day trial of Lumosity. < LUMOSITY > Individual 1 2 3 4 5 6 7

Before 100 115  97  85 127 133 101

After 103 112 100  87 125 140 113

a. At the 0.01 level of significance, is there sufficient evidence that the mean IQ has increased? b. What assumption is necessary to perform this test? c. Find the p-value in Excel or PHStat and interpret its meaning.

10.3  F TEST FOR THE DIFFERENCE BETWEEN TWO VARIANCES Often you need to test whether two independent populations have the same variability. One important reason to test for the difference between the variances of two populations is to determine whether the pooled-variance t test is appropriate. The test for the difference between the variances of two independent populations is based on the ratio of the two sample variances. If you assume that each population is normally distributed, then the ratio S12/S22 follows the F distribution (see Table E.5). Note that this distribution is not symmetrical as the normal and t distributions are. The critical values of the F distribution in Table E.5 depend on two sets of degrees of freedom. The degrees of freedom in the numerator of the ratio are for the first sample, and the degrees of freedom in the denominator are for the second sample. Equation 10.9 defines the F test statistic for testing the equality of two variances.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



10.3 F TEST FOR THE DIFFERENCE BETWEEN TWO VARIANCES 379

F STATISTIC FOR T E ST IN G T H E E QUA LI T Y O F T W O VA R I A NC E S The F test statistic is equal to the variance of sample 1 divided by the variance of ­sample 2. F=



S12

(10.9)



S22

where  S12 = variance of sample 1 S22 = variance of sample 2 n1 = size of sample taken from population 1 n2 = size of sample taken from population 2 n1 - 1 =degrees of freedom from sample 1 (i.e. the numerator degrees of freedom) n2 - 1 =degrees of freedom from sample 2 (i.e. the denominator degrees of freedom) The test statistic F follows an F distribution with n1 - 1 and n2 - 1 degrees of freedom.

For a given level of significance of α, to test the null hypothesis of equality of variances: H0: σ12 - σ22 = 0 against the alternative hypothesis that the two population variances are not equal: H0: σ12 - σ22 Z 0 you reject the null hypothesis if the calculated F test statistic is greater than the upper-tail critical value, FU, from the F distribution with n1 - 1 degrees of freedom in the numerator and n2 - 1 degrees of freedom in the denominator, or if the calculated F test statistic falls below the lowertail critical value, FL, from the F distribution with n1 - 1 and n2 - 1 degrees of freedom in the numerator and denominator, respectively. Thus, the decision rule is: Reject H0 if F 7 FU or if F 6 FL; otherwise, do not reject H0. This decision rule and rejection regions are displayed in Figure 10.12.

Figure 10.12  Regions of rejection and non-rejection for the two-tail F test

0

α/2 .025 FL

Region of rejection

1–α FU

Region of non-rejection

α/2 .025

F

Region of rejection

To illustrate how to use the F test to determine whether the two variances are equal, return to the opening scenario about the electricity price indices in Sydney and Melbourne. To determine whether to use the pooled-variance t test or the separate-variance t test in

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

380 CHAPTER 10 HYPOTHESIS TESTING: TWO-SAMPLE TESTS

Section 10.1, you can test the equality of the two population variances. The null and alternative hypotheses are: H0: σ12 - σ22 = 0 H1: σ12 - σ22 Z 0 Because this is a two-tail test, the rejection region is split into the lower and upper tails of the F distribution. Using a level of significance of α = 0.05, each rejection region contains 0.025 of the distribution. Because there are samples of 20 for Sydney and Melbourne, there are 20 - 1 = 19 degrees of freedom for group 1 and also for group 2. FU, the upper-tail critical value of the F distribution, is found directly from Table E.5, a portion of which is presented in Table 10.6. Note that 19 degrees of freedom for the numerator is not presented in this table so we will use 20 degrees of freedom for the numerator and 19 degrees of freedom for the denominator. Thus, the uppertail critical value of this F distribution is 4.03. Table 10.6  Finding FU, the upper-tail critical value of F with 20 and 19 degrees of freedom for upper-tail area of 0.025 (extracted from Table E.5 in Appendix E of this book)

Denominator df2  1  2  3  .  .  . 18 19 20

Source: Pearson, E. S. & H. O. Hartley, eds, Biometrika Tables for Statisticians, Volume 1, 3rd edn (London: Cambridge University Press 2015, reproduced with permission). Note: 19 degrees of freedom for the numerator is not presented in Table E.5 so we go to 20 as the closest value.

1 647.80  38.51  17.44 . . .  5.98  5.92  5.87

2 799.50  39.00  16.04 . . .  4.56  4.51  4.46

3 864.20  39.17  15.44 . . .  3.95  3.90  3.86

Numerator df1 ... ... ... ... . . . ... ... ...

12 976.70  39.41  14.34 . . .  2.77  2.72  2.68

15 984.90  39.43  14.25 . . .  2.67  2.62  2.57

20 993.10  39.45  14.17

 2.56  2.51  2.46

Finding Lower-Tail Critical Values You calculate FL, a lower-tail critical value on the F distribution with n1 - 1 degrees of freedom in the numerator and n2 - 1 degrees of freedom in the denominator, by taking the reciprocal of FU*, an upper-tail critical value on the F distribution with degrees of freedom ‘switched’ (i.e. n2 - 1 degrees of freedom in the numerator and n1 - 1 degrees of freedom in the denominator). This relationship is shown in Equation 10.10. FIN DIN G LOW E R -TA I L C R I T I C A L VA LU E S F R O M T HE F D I ST R I BU T I O N FL =



1 FU*

(10.10)

where FU* is from an F distribution with n2 - 1 degrees of freedom in the numerator and n1 - 1 degrees of freedom in the denominator. In the electricity price example, the degrees of freedom are 19 and 19 for the respective numerator sample and denominator sample, so there is no switching of degrees of freedom; you just take the reciprocal. However, note that we had to use 20 degrees of freedom as Table E.5 did not contain 19 degrees of freedom. To calculate the lower-tail 0.025 critical value, we simply take the reciprocal of the upper tail critical value. As shown in Table 10.6, this upper-tail value is 2.51. Using Equation 10.10: FL =

1 1 = = 0.40 2.51 FU*

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



10.3 F TEST FOR THE DIFFERENCE BETWEEN TWO VARIANCES 381

As depicted in Figure 10.13, the decision rule is: Reject H0 if F 7 FU = 2.51 or if F 6 FL = 0.40; otherwise, do not reject H0.

0.025

0.025 0.95 0

Figure 10.13  Regions of rejection and non-rejection for two-tail F test for equality of two variances at the 0.05 level of significance with 20 and 20 degrees of freedom

0.40 FL

2.51 FU

Region of rejection

Region of non-rejection

F

Region of rejection

Using Equation 10.9 and the electricity price index data (see Table 10.1 on page 361), the F test statistic is: F= =

S12 S22 1017.73 = 1.02 995.73

Because FL = 0.40 6 F = 1.02 6 FU = 2.51, do not reject H0. The p-value is 0.96 for a two-tail test (twice the p-value for the one-tail test shown in the Microsoft Excel output of Figure 10.14, as Excel only generates a one-tail F test). Since 0.96 7 0.05, you conclude that there is no significant difference in the variability of electricity price index data for Sydney and Melbourne.

1 2 3 4 5 6 7 8 9 10

A F test: two-sample for variances

Mean Variance Observations df F P(F 0

H1: σ12 – σ22 ≠ 0

EXAMPLE 10.3

F

Panel B One-tail test H1: σ1 – σ2

2
0

A O N E - TA IL T E ST FO R THE D I F F E RE N CE BE TWE E N TWO VARI AN CE S Waiting time is a critical issue at fast-food chains, which want to minimise not only the mean service time but also the variation in the service time from customer to customer. One fastfood chain carried out a study to measure the variability in the waiting time (defined as the time in minutes from when an order was completed to when it was delivered to the customer) at lunch and breakfast at one of the chain’s stores. The results were as follows: lunch: n1 = 25 S12 = 4.4 breakfast: n2 = 21 S22 = 1.9 At the 0.05 level of significance, is there sufficient evidence that there is more variability in the service time at lunch than at breakfast? Assume that the population service times are normally distributed. SOLUTION

The null and alternative hypotheses are: H0: σL2 − σB2 ⩽ 0 H1: σL2 − σB2 > 0 The F test statistic is given by Equation 10.9 on page 379: F=

S12 S22

You use Table E.5 to find the upper critical value of the F distribution. With n1 - 1 = 25 - 1 = 24 degrees of freedom in the numerator, n2 - 1 = 21 - 1 = 20 degrees of freedom in the denominator, and α = 0.05, the upper-tail critical value, F0.05, is 2.08. The decision rule is: Reject H0 if F 7 2.08; otherwise, do not reject H0.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



10.3 F TEST FOR THE DIFFERENCE BETWEEN TWO VARIANCES 383

From Equation 10.9 on page 379: F=

S12 4.4 = = 2.3158 2 1.9 S2

Because F = 2.3158 7 2.08, you reject H0. Using a 0.05 level of significance, you conclude that there is sufficient evidence that there is more variability in the service time at lunch than at breakfast.

Problems for Section 10.3 LEARNING THE BASICS

10.27 Determine FU and FL, the upper- and lower-tail critical values of F, in each of the following two-tail tests: a. α = 0.05, n1 = 8, n2 = 7 b. α = 0.05, n1 = 9, n2 = 6 c. α = 0.025, n1 = 7, n2 = 5 d. α = 0.01, n1 = 9, n2 = 9 10.28 Determine FU, the upper-tail critical value of F, in each of the following one-tail tests: a. α = 0.05, n1 = 8, n2 = 7 b. α = 0.025, n1 = 9, n2 = 6 c. α = 0.01, n1 = 7, n2 = 5 d. α = 0.005, n1 = 9, n2 = 9 10.29 Determine FL, the lower-tail critical value of F, in each of the following one-tail tests: a. α = 0.05, n1 = 13, n2 = 25 b. α = 0.025, n1 = 13, n2 = 25 c. α = 0.01, n1 = 13, n2 = 25 d. α = 0.005, n1 = 13, n2 = 25 10.30 The following information is available for two samples drawn from independent normally distributed populations: n2 = 21   S12 = 133.7   n2 = 16   S22 = 161.9

10.31 10.32

10.33 10.34

What is the value of the F test statistic if you are testing the null hypothesis H0: σ12 - σ22 = 0? In problem 10.30, how many degrees of freedom are there in the numerator and denominator of the F test? In problems 10.30 and 10.31, what are the critical values for FU and FL from Table E.5 if the level of significance of α is 0.05 and the alternative hypothesis is H1: σ12 - σ22 Z 0? In problems 10.30 to 10.32, what is your statistical decision? The following information is available for two samples selected from independent but very right-skewed populations:

n1 = 16  

S12

= 47.3   n2 = 13   S2 = 36.4 2

Should you use the F test to test the null hypothesis of equality of variances? Discuss. 10.35 If the two samples in problem 10.34 are drawn from independent, normally distributed populations: a. At the 0.05 level of significance, is there sufficient evidence of a significant difference between σ12 and σ22?

b. Suppose that you want to perform a one-tail test. At the 0.05 level of significance, what is the upper-tail critical value of the F test statistic to determine whether there is sufficient evidence that σ12 7 σ22? What is your statistical decision? c. Suppose that you want to perform a one-tail test. At the 0.05 level of significance, what is the lower-tail critical value of the F test statistic to determine whether there is sufficient evidence that σ12 6 σ22? What is your statistical decision?

APPLYING THE CONCEPTS Problems 10.36 to 10.39 can be solved manually or by using Microsoft Excel or PHStat.

10.36 Use the information in question 10.8 on private and public school students to answer this question. a. At the 0.05 level of significance, is there sufficient evidence of a difference in the variability of ATAR score for private and public students? b. What assumptions do you need to make about the two populations in order to justify the use of the F test? c. Based on the result of part (a), which t test, defined in Section 10.1, should you use to test whether there is a significant difference in the mean ATAR score of private vs public students? 10.37 A bank with a branch located in a commercial district of a city < BANK1 > has developed an improved process for serving customers during the noon to 1 pm lunch period. The waiting time (defined as the time elapsed from when the customer enters the line until they reach the teller window) of all customers during this hour is recorded over a period of one week. A random sample of 15 customers is selected and the results (in minutes) are: 4.21  5.55  3.02  5.13  4.77  2.34  3.54  3.20 4.50  6.10  0.38  5.12  6.46  6.19  3.79



Suppose that another branch located in a residential area < BANK2 > is also concerned about the noon to 1 pm lunch period. A random sample of 15 customers is selected and the results are as follows:

9.66  5.90  8.02  5.79  8.73  3.82  8.01  8.35 10.49  6.68  5.64  4.08  6.17  9.91  5.47

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

384 CHAPTER 10 HYPOTHESIS TESTING: TWO-SAMPLE TESTS

a. Is there sufficient evidence of a difference in the variability of the waiting time between the two branches? (Use α = 0.05.) b. Find the p-value in (a) and interpret its meaning. c. What assumption is necessary in (a)? Is the assumption valid for these data? d. Based on the results of (a), is it appropriate to use the pooledvariance t test to compare the means of the two branches? 10.38 A financial planner is comparing the volatility of two investment funds. She collects annual returns of the two funds over the previous 10 years (in percentages). < RETURNS > Fund 1 (%) 6  6  8  -10 8 -3  -3  -4  -9  3 10 Fund 2 (%) 0 10 11  0 9  2    5  10    8    -1  6

LEARNING OBJECTIVE

4

Conduct hypothesis tests for two population proportions

a. Is there sufficient evidence of a significant difference in the volatility (ie. variability) of the two funds? (Use α = 0.05.) b. Find the p-value in (a) and interpret its meaning. c. Based on the results of (a), is it appropriate to use the pooled-variance t test to compare the means of the two teams? 10.39 Use the information in question 10.10 on petrol prices to answer this question. a. Using a 0.05 level of significance, is there sufficient evidence of a significant difference between the variances in petrol prices of Campbelltown and the rest of Sydney? b. On the basis of the results in (a), is it appropriate to use the pooled-variance t test to compare the means of the two groups? Discuss.

10.4  COMPARING TWO POPULATION PROPORTIONS To this point in Chapter 10 we have looked at a variety of tests for samples from two populations. Now we will see how to use the Z test to analyse the difference between two proportions. In this section we present a procedure whose test statistic, Z, is approximated by a standardised normal distribution.

Z Test for the Difference between Two Proportions Z test for the difference between two proportions A test statistic used in hypothesis tests about the difference between the proportions of two populations.

In evaluating differences between two population proportions, you can use a Z test for the difference between two proportions. The test statistic Z is based on the difference between two sample proportions (p1 - p2). This test statistic, given in Equation 10.11, follows approximately a standardised normal distribution for large enough sample sizes. The values n1 p1, n1 (1 - p1), n2 p2 and n2 (1 - p2) must all be at least 5.

Z T E ST FOR T H E DI F F E R E N CE BE T W E E N T WO P R OP ORT I O NS Z=



( p1 − p2 ) − (π1 − π2 ) p(1 − p)

1 1 + n1 n2

(10.11)



with p=

X 1 + X2 n1 + n2

p1 =

X1 n1

p2 =

X2 n2

where  p1 = proportion of successes in sample 1 X1 = number of successes in sample 1 n1 = sample size of sample 1 π1 = proportion of successes in population 1 p2 = proportion of successes in sample 2 X2 = number of successes in sample 2 n2 = sample size of sample 2 π2 = proportion of successes in population 2 –p = pooled estimate of the population proportion of successes The test statistic Z follows approximately a standardised normal distribution.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



10.4 Comparing Two Population Proportions 385

Under the null hypothesis, you assume that the two population proportions are equal (π1 = π2). Because the pooled estimate for the population proportion is based on the null hypothesis, you combine or pool the two sample proportions to calculate an overall estimate of the common population proportion. This estimate is equal to the number of successes in the two samples combined (X1 + X2) divided by the total sample size from the two sample groups (n1 + n2). As shown in the following table, you can use this Z test for the difference between population proportions to determine whether there is a difference in the proportion of successes in the two groups (two-tail test) or whether one group has a higher proportion of successes than the other group (one-tail test). Two-tail test H0: π1 = π2 or π1 - π2 = 0 H1: π1 Z π2 or π1 - π2 Z 0

One-tail test H0: π1 9 π2 or π1 - π2 9 0 H1: π1 6 π2 or π1 - π2 6 0

One-tail test H0: π1 8 π2 or π1 - π2 8 0 H1: π1 7 π2 or π1 - π2 7 0

where  π1 = proportion of successes in population 1 π2 = proportion of successes in population 2 To test the null hypothesis that there is no difference between the proportions of two independent populations: H0: π1 - π2 = 0 against the alternative hypothesis that the two population proportions are not the same: H1: π1 - π2 Z 0 use the test statistic Z, given by Equation 10.11. For a given level of significance, α, reject the null hypothesis if the calculated Z test statistic is greater than the upper-tail critical value from the standardised normal distribution, or if the calculated test statistic is less than the lower-tail critical value from the standardised normal distribution. To illustrate the use of the Z test for the equality of two proportions, we will consider some market research where customers were asked to state whether they had purchased energy-­ efficient light globes. In Sydney, 225 of the 308 customers surveyed responded ‘yes’. In ­Melbourne, 271 of the 322 customers surveyed responded yes. At the 0.05 level of significance, is there sufficient evidence of a significant difference in the proportion of consumers who have purchased energy-efficient light globes in Sydney and Melbourne? The null and alternative hypotheses are: H0: π1 - π2 = 0 H1: π1 - π2 Z 0 Using the 0.05 level of significance, the critical values are - 1.96 and +1.96 (see Figure 10.16) and the decision rule is: Reject H0 if Z 6 -1.96 or if Z 7 +1.96; otherwise, do not reject H0.

0.025

0.025 0.95

–1.96

0.95 0

+1.96

Z

Figure 10.16  Regions of rejection and non-rejection when testing a hypothesis for the difference between two proportions at the 0.05 level of significance

Region of rejection

Region of rejection Critical value

Region of non-rejection

Critical value

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

386 CHAPTER 10 HYPOTHESIS TESTING: TWO-SAMPLE TESTS

Using Equation 10.11: Z=

(p1 − p2) − (π1 − π2) p(1 − p)

where: p1 =

X1 225 = = 0.7305 n1 308

and:

1 1 + n1 n2

p2 =

X2 271 = = 0.8416 n2 322

X 1 + X2 496 225 + 271 = = = 0.7873 n1 + n2 308 + 322 630

p= so that:

Z=

(0.7305 − 0.8416) − (0) 0.7873(1 − 0.7873)

= = =

1 1 + 308 322

−0.1111 (0.1675)(0.0064) −0.1111 0.00106 −0.1111 = −3.4063 0.0326

Using the 0.05 level of significance, reject the null hypothesis because Z = -3.4063 is less than -1.96 (a greater negative). The p-value is 0.0007 (calculated from Table E.2 or from the PHStat worksheet of Figure 10.17). This means that, if the null hypothesis is true, the probability that Figure 10.17  PHStat worksheet for the Z test for the difference between two proportions in the two-city energyefficient light globe survey

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

A Two cities light globe analysis Data Hypothesized difference Level of significance Sydney Number of successes Sample size Melbourne Number of successes Sample size Intermediate calculations Sydney proportion Melbourne proportion Difference in two proportions Average proportion Z test statistic

B

0 0.05 225 308 271 322

0.730519 0.841615 –0.1111 0.787302 –3.40625

Two-tail test Lower critical value –1.95996 Upper critical value 1.959961 p-value 0.000659 Reject the null hypothesis

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



10.4 Comparing Two Population Proportions 387

a Z test statistic is less than -3.4063 or greater than +3.4063 is 0.0007. Because 0.0007 6 α = 0.05, you reject the null hypothesis. There is sufficient evidence to conclude that the two cities are significantly different with respect to purchase history of energy-efficient light globes. A greater proportion of consumers in Melbourne have purchased energy-efficient light globes than in Sydney.

Are men less likely than women to shop for bargains? A survey reported that when going shopping, 24% of men (181 of 756 sampled) and 34% of women (275 of 809 sampled) go for bargains (data obtained from ‘Brands more critical for dads’, USA Today, 21 July, 2011, p. 1C). At the 0.05 level of significance, is the proportion of men who shop for bargains ­significantly lower than the proportion of women who shop for bargains?

EXAMPLE 10.4

SOLUTION

Because you want to know whether there is sufficient evidence that the proportion of men who shop for bargains is significantly lower than the proportion of women who shop for bargains, you use a one-tail test. The null and alternative hypotheses are H0: π1 - π2 9 0 (the proportion of men who shop for bargains is greater than or equal to the proportion of women who shop for bargains); H1: π1 - π2 6 0 (the proportion of men who shop for bargains is lower than the proportion of women who shop for bargains). Using the 0.05 level of significance, for the one-tail test in the lower tail the critical value is -1.645. The decision rule is: Reject H0 if Z 6 -1.645; otherwise, do not reject H0. Using Equation 10.11 on page 384: Z=

( p1 − p2) − (π1 − π2) p(1 − p)

1 1 + n1 n2

where: p1 =

X1 181 = = 0.2394 n1 756

and: p=

p2 =

X2 275 = = 0.3399 n2 809

X 1 + X2 456 181 + 275 = = = 0.2914 n1 + n2 756 + 809 1565

so that: Z=

(0.2394 − 0.3399) − (0) 0.2914(1 − 0.2914)

= =

1 1 + 756 809

−0.1005 (0.2065)(0.00256) −0.1005 0.0230

= −4.37

Using the 0.05 level of significance, you reject the null hypothesis because Z = -4.37 6 -1.645. The p-value is 0.0000. Therefore, if the null hypothesis is true, the probability that a Z test ­statistic is less than -4.37 is virtually 0.0000 (which is less than α = 0.05). You conclude that there is sufficient ­evidence that the proportion of men who shop for bargains is significantly lower than the proportion of women who shop for bargains.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

388 CHAPTER 10 HYPOTHESIS TESTING: TWO-SAMPLE TESTS

Confidence Interval Estimate for the Difference between Two Proportions Instead of, or in addition to, testing for the difference between the proportions of two independent populations, you can construct a confidence interval estimate of the difference between the two proportions, using Equation 10.12. CON FIDE N CE IN T E R VA L E ST I M AT E F O R T HE D I F F E R E NC E BE T W E E N T W O PR OPORT IO NS ( p1 − p2 ) ± Z



p1(1 − p1) p2(1 − p2) + n1 n2

(10.12)

or ( p1 − p2 ) − Z

p1(1 − p1) p (1 − p2) < (π1 − π2) + 2 n1 n2

< ( p1 − p2 ) + Z



p1(1 − p1) p (1 − p2) + 2 n1 n2

To construct a 95% confidence interval estimate of the population difference between the proportion of consumers in Sydney and Melbourne who had purchased energy-efficient light globes, use the results on page 386 or from Figure 10.17. Using Equation 10.12: (0.7305 − 0.8416) ± (1.96)

0.7305(1 − 0.7305) 0.8416(1 − 0.8416) + 308 322

−0.1111 ± (1.96)(0.0325) −0.1111 ± 0.0636 −0.1747 < π1 − π2 < – 0.0475 By rearranging the population proportions, to aid interpretation, this could also be written as 0.0475 … π2 - π1 … 0.1747. Thus, you have 95% confidence that the difference between the population proportion of Sydney and Melbourne consumers who have purchased energy-efficient light globes is between -0.1747 and -0.0475. The proportion of Melbourne consumers who have purchased energy-efficient light globes is higher.

Problems for Section 10.4 LEARNING THE BASICS

10.40 Assume that n1 = 120, X1 = 55, n2 = 65 and X2 = 30. a. At the 0.05 level of significance, is there sufficient evidence of a significant difference between the two population proportions in group 1 and group 2? b. Set up a 95% confidence interval estimate of the difference between the two proportions. 10.41 Assume that n1 = 100, X1 = 45, n2 = 50 and X2 = 25. a. At the 0.01 level of significance, is there sufficient evidence of a significant difference between the two population proportions in group 1 and group 2?

b. Construct a 99% confidence interval estimate of the difference between the two population proportions.

APPLYING THE CONCEPTS 10.42 A human resources manager is trying to reduce the percentage of employees resigning at the organisation’s two factories in Brisbane and Christchurch. She implements an employee loyalty scheme at the Christchurch factory in which employees receive a financial bonus for staying at the factory for over five years. Over the next 12-month period, 12 out of

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Summary 389

345 employees resign at the Christchurch factory, compared to 32 out of 890 employees in Brisbane. a. At the 0.01 level of significance, is there sufficient evidence that the percentage of employees resigning in Christchurch is lower than that of Brisbane? b. Construct and interpret a 99% confidence interval estimate of the difference between the population proportions for the factories. 10.43 One of the most impressive, innovative advances in online fundraising over the past decade is the rise of crowdfunding websites. While features differ from site to site, crowdfunding sites are websites that allow you to set up an online fundraising campaign based around a fundraising page and accept money directly from that page using the website’s own credit card processor. Kickstarter, one crowdfunding website, reported that 316 of 831 technology crowdfunding projects were successfully launched in the past year and 923 of 2,796 games crowdfunding projects were successfully launched in the past year (data obtained from ). a. Is there sufficient evidence of a significant difference in the proportion of technology crowdfunding projects and games crowdfunding projects that were successful at the 0.05 level of significance? b. Determine the p-value in (a) and interpret its meaning. c. Construct and interpret a 95% confidence interval estimate for the difference between the proportion of technology crowdfunding projects and games crowdfunding projects that are successful. 10.44 A survey of 1,085 adults asked, ‘Do you enjoy shopping for clothing for yourself?’ The results indicated that 51% of the

females enjoyed shopping for clothing for themselves, compared to 44% of the males (data obtained from ‘Split decision on clothes shopping’, USA Today, 28 January 2011, p. 1B). The sample sizes of males and females were not provided. Suppose that of 542 males, 238 said that they enjoyed shopping for clothing for themselves, while of 543 females, 276 said that they enjoyed shopping for clothing for themselves. a. Is there sufficient evidence of a significant difference between males and females in the proportion who enjoy shopping for clothing for themselves at the 0.01 level of significance? b. Find the p-value in (a) and interpret its meaning. c. Construct and interpret a 99% confidence interval estimate for the difference between the proportion of males and females who enjoy shopping for clothing for themselves. d. What are your answers to parts (a) to (c) if 218 males enjoyed shopping for clothing for themselves? 10.45 A car manufacturer wishes to test whether a higher proportion of car drivers in Malaysia have converted to LPG fuel compared to in Singapore. A survey of a sample of 45 drivers in Malaysia revealed that 18 had converted to LPG, while 12 out of 30 Singapore drivers had done so. Test the claim at the 0.01 level of significance. 10.46 A marketing company employed by the Victorian Dairy Board wishes to determine whether there is a preference for margarine by females over males. A sample of 82 males shows that 33 prefer margarine to butter. A sample of 133 females shows that 65 prefer margarine to butter. At the 0.05 level of significance, test to see whether males prefer margarine significantly less than do females.

Assess your progress

10

Summary This chapter introduced statistical test procedures used in analysing possible differences between two independent populations. You also learned a test procedure frequently used when analysing differences between the means of two related populations. Remember that you need to select the test most

appropriate for a given set of conditions, and to investigate critically the validity of the assumptions underlying each of the hypothesis-testing procedures. Figure 10.18 overleaf, illustrates the steps in determining which two-sample test of hypothesis to use.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

390 CHAPTER 10 HYPOTHESIS TESTING: TWO-SAMPLE TESTS

1. What type of data do you have? If you have categorical vari-

ables, use the Z test for the difference between two proportions (assuming the samples are independent). 2. If you have numerical variables, determine whether you have independent samples or related samples. If you have related samples, use the paired t test. 3. If you have independent samples, determine whether you can assume that the variances of the two groups are equal. (This assumption can be tested using the F test.)

4. If you can assume that the two groups have equal variances,

use the pooled-variance t test. If you cannot assume that the two groups have equal variances, use the separate-variance t test. Table 10.7 provides a list of topics covered in this chapter.

Figure 10.18  Flow chart for selecting a two-sample test of hypothesis

Two-sample tests

Categorical

Type of data

Z test for the difference between two proportions Central tendency

No

Separate-variance t test

Table 10.7  Summary of topics in Chapter 10

Type of analysis Comparing two populations

σ122= σ222?

Yes

Focus

Yes

Numerical

Independent samples?

No

Variability

F test for σ12 = σ22

Paired t test

Pooled-variance t test

Type of data Numerical Z and t tests for the difference in the means of two independent populations (Section 10.1) Paired t test (Section 10.2)

Categorical Z test for the difference between two proportions (Section 10.4)

F test for differences in two variances (Section 10.3)

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Key formulas 391

Key formulas Z test for the difference between two means

Z=

Paired t test for the mean difference

(X1 − X2) − (μ1 − μ2)

t=

  (10.1)

σ12 σ12 + n1 n2

n

Pooled-variance t test for the difference between two means

(X1 − X2) − (μ1 − μ2)

t=

Sp2

D − μD SD   (10.7)

Confidence interval estimate for the mean difference

D ± tn−1   (10.2)

1 1 + n1 n2

(X1 − X2) ± tn1 + n2 − 2 Sp2

1 1   (10.3) + n1 n2

D − tn−1

(X1 − X2) − tn1 + n2 − 2 Sp2

1 1 + n1 n2

< (X1 − X2) + tn1 + n2 − 2 Sp2

F=

1 1 + n1 n2

  (10.4)

Calculating degrees of freedom in separate-variance t test

S12 n1

2

n1 − 1

+

2

  (10.5)

n2 − 1

Z test for the mean difference

Z=

D − μD σD   (10.6) n

SD n

  (10.9)

S22

1   (10.10) FU*

Z test for the difference between two proportions

Z=

( p1 − p2 ) − (π1 − π2 ) p(1 − p)

1 1 + n1 n2

  (10.11)

with

p=

2

S22 n2

n

< μD < D + tn−1

S12

FL =

(X1 − X2) − (μ1 − μ2) S12 S22 + n2 n1

SD

Finding lower-tail critical values from the F distribution

< μ1 − μ2

Separate-variance t test for the difference between two means

n=

  (10.8)

F statistic for testing the equality of two variances

or

S12 S22 + n2 n1

n

or

Confidence interval estimate of the difference in the means of two independent populations

t=

SD

X 1 + X2 n1 + n2

p1 =

X1 n1

p2 =

X2 n2

Confidence interval estimate for the difference between two proportions

( p1 − p2 ) ± Z

p1(1 − p1) p2(1 − p2)   (10.12) + n1 n2

or

( p1 − p2 ) − Z

p1(1 − p1) p (1 − p2) < (π1 − π2) + 2 n1 n2

< ( p1 − p2 ) + Z

p1(1 − p1) p (1 − p2) + 2 n1 n2

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

392 CHAPTER 10 HYPOTHESIS TESTING: TWO-SAMPLE TESTS

Key terms F distribution 378 F test statistic for testing the equality of two variances378 paired (or matched) 371

paired t test for the mean difference in related populations 372 pooled-variance t test 360 repeated measurements 371 robust364

separate-variance t test 365 Z test for the difference between two means359 Z test for the difference between two proportions384

References 1. Snedecor, G. W. & W. G. Cochran, Statistical Methods, 8th edn (Ames, IA: Iowa State University Press, 1989). 2. Satterthwaite, F. E., ‘An approximate distribution of estimates of variance components’, Biometrics Bulletin, 2 (1946): 110–114. 3. Winer, B. J., Statistical Principles in Experimental Design, 3rd edn (New York: McGraw-Hill, 1989).

4. Conover, W. J., Practical Nonparametric Statistics, 3rd edn (New York: Wiley, 2000).

5. Daniel, W., Applied Nonparametric Statistics, 2nd edn (Boston: Houghton Mifflin, 1990).

Chapter review problems CHECKING YOUR UNDERSTANDING 10.47 What are some of the criteria used in the selection of a particular hypothesis-testing procedure? 10.48 Under what conditions should you use the separate-variance t test to examine possible differences in the means of two independent populations? 10.49 Under what conditions should you use the F test to examine possible differences in the variances of two independent populations? 10.50 What is the difference between two independent and two related populations? 10.51 What is the distinction between repeated measurements and matched (or paired) items? 10.52 Under what conditions should the paired t test for the mean difference between two related populations be used? 10.53 Explain the similarities and differences between the test of hypothesis for the difference between the means of two independent populations, and the confidence interval estimate of the difference between the means. 10.54 Under what conditions should you use the Z test for two proportions?

APPLYING THE CONCEPTS Problems 10.55 to 10.62 can be solved manually or by using Microsoft Excel. We recommend using Microsoft Excel or PHStat to solve problems 10.63 to 10.73.

10.55 There are a very large number of mutual funds from which an investor can choose. Each mutual fund has its own mix of different types of investments. The data in < BEST_FUNDS1 > present the one-year return and the three-year annualised return for the 10 best short-term bond funds and the 10 best long-term bond funds, according to U.S. News & World Report (data obtained from ). Analyse the data and determine whether any

significant differences exist between short-term and long-term bond funds. (Use the 0.05 level of significance.) 10.56 Do male and female students study the same amount per week? In a recent year, 58 undergraduate business students were surveyed at a large university that has more than 1,000 undergraduate business students enrolled each year. The file < STUDY_TIME > contains the gender and the number of hours spent studying in a typical week for the sampled students. a. At the 0.05 level of significance, is there a significant difference in the variance of the study time between male students and female students? b. Using the results of (a), which t test is appropriate for comparing the mean study time for male and female students? c. At the 0.05 level of significance, conduct the test selected in (b). d. Write a short summary of your findings. 10.57 Suppose that a marketing firm has been engaged by a major international pharmaceutical company to test the acceptance of a new sunscreen in the Australian market. The company has recruited 442 people to try out the sunscreen and answer a number of questions about it on a five-point scale (1 = Strongly Disagree, 2 = Disagree, 3 = Neutral, 4 = Agree and 5 = Strongly Agree). Four of the questions asked are listed below: 1. I have been sunburnt at least once during the past summer. 2. I would be willing to pay more for a sunscreen that I know will be more effective while I am swimming. 3. Skin cancer due to sun exposure is something I want to prevent. 4. I was not aware that sunscreen needs to be applied at least 20 minutes before exposure to the sun. For each question, the researchers tested the null hypothesis that the mean response for males and females is equal. The

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter review problems 393

alternative hypothesis is that the mean response is different for males and females. The following table summarises the results.

Question 1 2 3 4

Female (n1 ∙ 137) 4.46 4.09 4.26 4.12

Sample mean Male (n2 ∙ 305) 4.26 3.86 3.91 4.06

t 1.907 2.105 3.258 0.567

p-value 0.057 0.035 0.001 0.571

a. Interpret the results of the t test for question 1. b. Interpret the results of the t test for question 2. c. Interpret the results of the t test for question 3. d. Interpret the results of the t test for question 4. e. Write a short summary about the differences between males and females concerning their views towards sun exposure and sunscreen. 10.58 A motorist wishes to compare the price of petrol in Adelaide and rural South Australia. She collected the following data: Adelaide $1.23 $0.23 25

– X

S N

Rural South Australia $1.43 $0.74 21



Use a 0.01 level of significance. a. Assuming equal variances, is there sufficient evidence that the mean petrol price is greater in rural South Australia? b. Is there sufficient evidence of a significant difference between the variances of petrol price in Adelaide and rural South Australia? c. Construct and interpret a 99% confidence interval estimate of the difference between the mean petrol prices of Adelaide and rural South Australia. 10.59 The manager of computer operations of a large company wants to study computer usage of two departments within the company – the accounting department and the research department. A random sample of five jobs from the accounting department in the past week and six jobs from the research department in the past week is selected, and the processing time (in seconds) for each job is recorded. < ACC_RES > Department Accounting Research



Processing time (in seconds) 9  3  8 7 12 4 13 10 9  9 6

Use a 0.05 level of significance. a. Is there sufficient evidence that the mean processing time in the research department is significantly greater than 6 seconds? b. Is there sufficient evidence of a significant difference between the variances in the processing time of the two departments? c. Is there sufficient evidence of a significant difference between the mean processing time of the accounting department and the research department?

d. Determine the p-values in (a), (b) and (c) and interpret their meanings. e. Construct and interpret a 95% confidence interval estimate of the mean difference in the mean processing times between the accounting and research departments. 10.60 A social researcher claims that New Zealanders watch more television than Australians. He takes a random sample of 41 New Zealanders and determines that they watched a daily average of 245 minutes of television with a standard deviation of 34. A sample of 61 Australians averaged 220 minutes with a standard deviation of 23. a. Assuming equal variances, test the claim at the 0.05 level of significance. b. Assuming unequal variances, repeat the test at the 0.05 level of significance. c. Test whether the variances are equal at the 0.05 level of significance. d. Based on the finding from part (c), which was the appropriate test for the two means? e. A few days later the researcher calls again to tell you that a reviewer of his article wants him to include the p-value for the ‘correct’ result in (a). In addition, the researcher inquires about an unequal variances problem, which the reviewer wants him to discuss in his article. In your own words, discuss the concept of p-value and describe the unequal variances problem. Determine the p-value in (a) and discuss whether or not the unequal variances problem had any meaning in the researcher’s study. 10.61 Do Pinterest shoppers and Facebook shoppers differ with respect to spending behaviour? A study of browser-based shopping sessions reported that Pinterest shoppers spent a mean of $153 per order and Facebook shoppers spent a mean of $85 per order (data obtained from ). Suppose the study consisted of 500 Pinterest shoppers and 500 Facebook shoppers, and the standard deviation of the order value was $150 for Pinterest shoppers and $80 for Facebook shoppers. Assume a 0.05 level of significance. a. Is there sufficient evidence of a significant difference in the variances of the order values between Pinterest shoppers and Facebook shoppers? b. Is there sufficient evidence of a significant difference in the mean order value between Pinterest shoppers and Facebook shoppers? c. Construct a 95% confidence interval estimate for the difference in mean order value between Pinterest shoppers and Facebook shoppers. 10.62 The owner of a restaurant that serves Continental-style entrées has the business objective of learning more about the patterns of patron demand during the weekend period from Friday to Sunday. She decided to study the demand for dessert during this period. In addition to studying whether a dessert was ordered, she will study the gender of the individual and whether a beef entrée was ordered. Data were collected from 630 customers and organised in the following contingency tables.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

394 CHAPTER 10 HYPOTHESIS TESTING: TWO-SAMPLE TESTS

Dessert ordered Yes No Total

Male  50 250 300

Dessert ordered Yes No Total

Yes  74 123 197

Gender

Female  96 234 330

Beef entrée

No  68 365 433

Total 146 484 630 Total 142 488 630

a. At the 0.05 level of significance, is there sufficient evidence of a significant difference between males and females in the proportion of those who order dessert? b. At the 0.05 level of significance, is there sufficient evidence of a significant difference in the proportion of those who order dessert based on whether or not a beef entrée has been ordered? 10.63 Use the data in the < HOSPITALITY > file to compare New South Wales country and Sydney hospitality industry trainees in terms of their annual transport and rent costs and annual wage. Use the 0.05 level of significance. 10.64 The lengths of life (in hours) of a sample of forty 100-watt incandescent light bulbs produced by manufacturer A and a sample of forty 100-watt incandescent light bulbs produced by manufacturer B are in the file < BULBS >. Completely analyse the differences between the life of the bulbs produced by the two manufacturers (use α = 0.05). 10.65 Suppose the data in the file < CYCLING > show the sex, age, top lap speed and points for the season, and current competition points for 121 cyclists. Completely analyse the differences between males and females for all characteristics. (Use α = 0.05.) 10.66 A retailer wishes to determine whether the husband and wife of a couple have different spending patterns. They collect the following data for weekly spending (in dollars): Couple 1 2 3 4 5 6 7 8

Husband ($) 345 234 124 276 284 670 127 109

Wife ($) 245 345 250 113 267 875 120 235

Test the claim at the 0.10 level of significance. 10.67 Do males and females differ in the amount of time they talk on the phone and the number of text messages they send? A study reported that women spent a mean of 818 minutes per month talking, while men spent 716 minutes per month talking (data obtained from ‘Women talk and text more’, USA Today, 1 February 2011, p. 1A). The sample sizes were not reported. Suppose the sample sizes were 100 each for women and men and that the standard deviations were 125 minutes per month for women and 100 minutes per month for men.

a. Using a 0.01 level of significance, is there sufficient evidence of a significant difference between women and men in the variances of the amount of time spent talking? b. To test for a difference in the mean talking time of women and men, is it most appropriate to use the pooled-variance t test or the separate-variance t test? Use the most appropriate test to determine if there is a significant difference between women and men in the amount of time spent talking. The article also reported that women sent a mean of 716 text messages per month while men sent 555 messages per month. Suppose that the standard deviation was 150 text messages per month for women and 125 text messages per month for men. c. Using a 0.01 level of significance, is there sufficient evidence of a significant difference between women and men in the variances of the number of text messages sent per month? d. Based on the results of (c), use the most appropriate test to determine, at the 0.01 level of significance, whether there is sufficient evidence of a significant difference between women and men in the mean number of text messages sent per month. 10.68 A consumer believes that the price of fresh fruit and vegetables at their local corner store is more volatile than that at the supermarket so she collects price information over a given time period. She calculates a standard deviation of $6.70 at the corner store from a sample of 12 weeks while the standard deviation from the supermarket is $5.78 from a sample of 12 weeks’ data. Is there sufficient evidence that the variance of fruit and vegetable prices at the local corner store is significantly greater than that at the supermarket? Test at the 0.01 level of significance. 10.69 Retailers often become very competitive during the back-toschool sale time. The data below and in the spreadsheet < STATIONERY > are for similar items sold by two retailers and were extracted from advertising brochures circulated by Coles and Woolworths in the Sydney metropolitan area during the first week of January 2006. Stationery item 64-page exercise book 96-page exercise book 128-page exercise book A4 lever arch folder A4 20-pocket display book A4 5-tab subject dividers A4 2-ring binder 24-pk colour pencils 30-cm plastic ruler 30-cm wooden ruler 1 m * 45 cm movie logo book cover 9-piece geometry set 100 A4 sheet protectors 64-page binder book 96-page binder book

Coles $0.05 $0.10 $0.15 $0.59 $0.46 $0.33 $1.69 $1.26 $0.28 $0.49 $1.69 $1.49 $1.49 $0.34 $0.38

Woolworths $0.05 $0.09 $0.15 $0.52 $0.52 $0.34 $1.49 $1.26 $0.29 $0.49 $1.28 $1.49 $1.93 $0.29 $0.35

Data obtained from Advertising brochures Coles NSW-METRO-0901-06-R-RH, Woolworths BC09016/N1A

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Continuing cases 395

a. Using an appropriate test and a 1% level of significance, examine whether there was a significant difference between the mean price of stationery at Coles and Woolworths in Sydney in the relevant week. b. Assuming a two-tail test is used, calculate the p-value and interpret it. 10.70 A lot of interest has been centred on whether or not telecommunication facilities ‘in the bush’ are on par with those in the big cities of Australia. Assume a survey carried out in both the city and regional areas looks at the proportion of Internet users who have access to a high-speed connection. The results show that 592 of 800 city dwellers have access to the National Broadband Network (NBN) and in regional areas 394 of 850 Internet users have NBN access. Using a 1% level of significance, test whether the proportion of city dwellers who have ADSL access is higher than the proportion of regional dwellers. 10.71 A hotel manager looks to enhance the initial impressions that hotel guests have when they check in. Contributing to initial impressions is the time it takes to deliver a guest’s luggage to the room after check-in. A random sample of 20

deliveries on a particular day were selected in wing A of the hotel, and a random sample of 20 deliveries were selected in wing B. The results are stored in < LUGGAGE >. Analyse the data and determine whether there is a significant difference between the mean delivery times in the two wings of the hotel. (Use α = 0.05.) 10.72 Use the information in problem 10.70 to construct a 95% confidence interval for difference in the proportion of city and regional residents who have access to the NBN. 10.73 An insurance company is interested in the driving behaviour of young drivers. Based upon its survey results, 76 out of 100 males aged 17–25 admitted to speeding on a regular basis, while 67 out of 85 young females admitted the same. Is there sufficient evidence at the 0.10 level of significance that a significantly greater proportion of young males speed on a regular basis?

REPORT WRITING EXERCISE 10.74 Referring to the results of problem 10.73 concerning the behaviour of young drivers, write a report that summarises your conclusions.

Continuing cases Tasman University The Student News Service at Tasman University (TU) has decided to gather data about the undergraduate students that attend TU. It creates and distributes a survey of 14 questions and receives responses from 62 undergraduates (stored in < TASMAN_UNIVERSITY_BBUS_STUDENT_SURVEY >). a At the 0.05 level of significance, is there sufficient evidence of a significant difference between males and females in weighted average mark (WAM), expected starting salary, number of social networking sites registered for, age, spending on textbooks and supplies, text messages sent in a week and the wealth needed to feel rich? b At the 0.05 level of significance, is there sufficient evidence of a difference between students who plan to go to graduate school and those who do not plan to go to graduate school in WAM, expected starting salary, number of social networking sites registered for, age, spending on textbooks and supplies, text messages sent in a week and the wealth needed to feel rich? The dean of students at TU has learned about the undergraduate survey and has decided to undertake a similar survey for MBA students. She creates and distributes a survey of 14 questions and receives responses from 44 graduate students (stored in < TASMAN_UNIVERSITY_MBA_STUDENT_SURVEY >). c For these data, at the 0.05 level of significance, is there sufficient evidence of a difference between males and females in age, undergraduate WAM, MBA WAM, expected salary on graduation, spending on textbooks and supplies, text messages sent in a week and the wealth needed to feel rich?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

396 CHAPTER 10 HYPOTHESIS TESTING: TWO-SAMPLE TESTS

As Safe as Houses To analyse the real estate market in non-capital cities and towns in states A and B, Safe-As-Houses Real Estate, a large national real estate company, has collected samples of recent residential sales from a sample of non-capital cities and towns in these states. The data are stored in < REAL_ESTATE >. a Determine whether any significant average price differences exist between houses and units. b Determine whether the average internal area of houses significantly exceeds that for units. c Determine whether there are any significant differences between houses and units in the variance of price or internal area. d Prepare a brief report to summarise your findings.

Chapter 10 Excel Guide EG10.1 COMPARING THE MEANS OF TWO INDEPENDENT POPULATIONS

Pooled-Variance t Test for the Difference Between Two Means

Figure EG10.1 Pooled-Variance t Test dialog box

Key technique  Use the T.INV.2T(level of significance, degrees of freedom) function to calculate the lower and upper critical values and use the T.DIST.2T(absolute value of the t test statistic, degrees of freedom) to calculate the p-value. Example  Perform the Figure 10.3 pooled-variance t test for the electricity price data shown on page 362. PHStat  Use Pooled-Variance t Test. For the example, open the Electricity_Price file. Select PHStat ➔ Two-Sample Tests (Unsummarized Data) ➔ Pooled-Variance t Test. In the procedure’s dialog box (shown in Figure EG10.1): 1. Enter 0 as the Hypothesized Difference. 2. Enter 0.05 as the Level of Significance. 3. Enter B1:B21 as the Population 1 Sample Cell Range. 4. Enter C1:C21 as the Population 2 Sample Cell Range. 5. Check First cells in both ranges contain label. 6. Click Two-Tail Test. 7. Enter a Title and click OK.

When using summarised data, select PHStat ➔ TwoSample Tests (Summarized Data) ➔ Pooled-Variance t Test. In that procedure’s dialog box, enter the hypothesised difference and level of significance, as well as the sample size, sample mean and sample standard deviation for each sample.

Analysis ToolPak  Use t-Test: Two-Sample Assuming Equal Variances. For the example, open the Electricity_Price ­file and: 1. Select Data ➔ Data Analysis. 2. In the Data Analysis dialog box, select t-Test: Two-Sample Assuming Equal Variances from the Analysis Tools list and then click OK.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 10 Excel Guide 397

In the procedure’s dialog box (shown in Figure EG10.2): 3. Enter B1:B21 as the Variable 1 Range 4. Enter C1:C21 as the Variable 2 Range. 5. Enter 0 as the Hypothesized Mean Difference. 6. Check Labels and enter 0.05 as Alpha. 7. Click New Worksheet Ply. 8. Click OK.

t Test for the Difference Between Two Means, Assuming Unequal Variances Key technique  Use the T.INV.2T(level of significance, degrees of freedom) function to calculate the lower and upper critical values and use the T.DIST.2T(absolute value of the t test statistic, degrees of freedom) to calculate the p-value. Example  Perform the Figure 10.7 separate-variance t test for the electricity price data shown on page 367.

Figure EG10.2 t-Test: Two Sample Assuming Equal Variances dialog box

Results (shown in Figure EG10.3) appear in a new worksheet that contains both two-tail and one-tail test critical values and p-values.

1 2 3 4 5 6 7 8 9 10 11 12 13 14

A B t test: two-sample assuming equal variances Sydney Mean 71.6 Variance 1017.726316 Observations 20 Pooled variance 1006.730263 Hypothesized mean difference 0 df 38 t stat –0.493342621 P(T

Designs 1

2

3

4

226.32

223.81

237.08

233.90

246.77

243.85

250.55

251.10

227.94

226.75

241.43

241.28

244.79

243.97

247.95

241.53

226.19

225.68

238.04

249.43

249.75

254.30

251.84

255.45

224.45

224.49

244.13

233.54

248.51

239.50

244.87

248.35

229.65

230.86

231.82

234.51

241.44

253.00

249.49

245.09

a. At the 0.05 level of significance, is there evidence of a significant difference in the mean distance travelled by the golf balls with different designs? b. If the results in (a) indicate that it is appropriate, use the Tukey–Kramer procedure to determine which designs differ in mean distance. c. What assumptions are necessary in (a)? d. At the 0.05 level of significance, is there evidence of a significant difference in the variation of the distance travelled by the golf balls differing in design?

11.2  THE RANDOMISED BLOCK DESIGN In Section 11.1 you used the one-way ANOVA F test to evaluate differences between the means of more than two independent groups. In section 10.2 you used the paired t test when you had repeated measurements or matched samples in order to evaluate the difference between the means of two groups. In this section, a method to analyse more than two groups using repeated measures or matched samples is developed. The heterogeneous sets of items or individuals that have been matched (or on whom repeated measurements have

LEARNING OBJECTIVE

3

Construct and apply a randomised block design

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

416 CHAPTER 11 ANALYSIS OF VARIANCE

blocks Groupings of homogeneous units in experiments. randomised block design An experimental technique where data in groups are divided into fairly homogeneous subgroups called blocks to remove variability from random error.

been taken) are called blocks. Experimental situations that use blocks are called randomised block designs. Although groups and blocks are both used in a randomised block design, the focus of the analysis is on the differences between the different groups. As is the case in completely randomised designs, groups are often different levels pertaining to a factor of interest. For example, if the factor of interest is advertising medium, three groups could be subject to the following different levels: television, radio and newspaper. In this experiment, different cities could be used as blocks. The purpose of blocking is to remove as much variability as possible from the random error so that the differences between the groups are more ­evident. A randomised block design is often more efficient statistically than a completely r­ andomised design and therefore produces more precise results (see references 1, 3, 5 and 9). To compare a completely randomised design with a randomised block design, assume we are comparing the delivery time of four couriers. Suppose that a completely randomised design is used with 16 parcels of parts dispatched Monday to Thursday over a four-week period. (Friday dispatches arrive after the weekend, so have not been included in the experiment.) Any variability between days becomes part of the random error, so differences between couriers might be hard to detect. To reduce random error, a randomised block experiment is designed where each courier collects a shipment on each of the four weekdays excluding Friday. The four days are considered blocks, while the treatment factor is still the four couriers. The advantage of the randomised block design is that variability between days, because of traffic conditions or demand for courier bookings, is removed from the random error. Therefore, this design should provide more precise results concerning delivery time differences between the four couriers.

Tests for the Treatment and Block Effects Recall from Figure 11.1 that, in the completely randomised design, the total variation (SST) is subdivided into variation due to differences between the c groups (SSB) and variation due to variation due to differences within the c groups (SSW). Within-group variation is considered experimental error, and between-group variation is due to treatment effects. To remove the effects of the blocking from the random error in the randomised block design, the within-group variation (SSW) is subdivided into variation due to differences between the blocks (SSBL) and variation due to random error (SSE). Therefore, as presented in ­Figure 11.9, in a randomised block design the total variation is the sum of three components – between-group variation (SSB) between-block variation (SSBL) and random error (SSE). Figure 11.9 Partitioning the total variation in a randomised block design model

Partitioning the total variation SST = SSB + SSBL + SSE Between-group variation (SSB) df = c – 1

Total variation (SST ) df = n – 1

Between-block variation (SSBL) df = r – 1

Random variation (SSE ) df = (r – 1) (c – 1)

The following definitions are needed to develop the ANOVA procedure for the randomised block design:

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



11.2 The Randomised Block Design 417

r = the number of blocks c = the number of groups n = the total number of values (where n = rc) X ij = the value in the ith block for the jth group X i. = the mean of all the values in block i X. j = the mean of all the values for group j c

r

∑ ∑ X ij = the grand total j=1 i=1 – Note that the dots are used in the notation as spacers to distinguish, for example, X3., the mean – of the third block, from X.3, the mean of the third group. The total variation, also called sum of squares total (SST), is a measure of the variation between all the values. You calculate SST by summing the squared differences between each – individual value and the grand mean X that is based on all n values. Equation 11.7 shows the calculation for total variation.

TOTAL VARIAT ION IN R A N DOM IS E D BLO C K D E SI G N

c

r

(X ij − X )2 ∑ ∑ j=1 i=1

SST =

(11.7)

where c

X=

r

∑ ∑ X ij j=1 i=1 rc

= (grand mean)

You calculate the between-group variation, also called the sum of squares between groups – (SSB), by summing the squared differences between the sample mean of each group Xj and the – grand mean X , weighted by the number of blocks r. Equation 11.8 shows the calculation for the between-group variation.

BE TWE E N -G R OUP VA R IAT ION IN R AND O M I SE D BLO C K D E SI G N

SSB = r

c

( X .j − X )2 ∑ j=1

where

(11.8)

r

X.j =

∑ i=1

Xij

r

You calculate the between-block variation, also called the sum of squares between blocks – (SSBL), by summing the squared differences between the mean of each block Xi and the grand

between-block variation That part of the within-group variation due to differences between the blocks. sum of squares between blocks (SSBL) That part of the within-group variation due to differences between the blocks.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

418 CHAPTER 11 ANALYSIS OF VARIANCE

– mean X , weighted by the number of groups c. Equation 11.9 shows the calculation for the between-block variation. B E T W E E N -B LOCK VA R I AT I O N I N R A ND O M I SE D BLO C K D E SI G N

SSBL = c

r

( X i. − X )2 ∑ i=1

where

c

Xi. = sum of squares error (SSE) The sum of squared differences between the values in each cell and the corresponding mean of that cell.

(11.9)

∑ Xij

i=1

c

You calculate the random variation, also called the sum of squares error (SSE), by summing the squared differences between all the values after the effect of the particular treatments and blocks have been accounted for. Equation 11.10 shows the calculation for random error. R A N DOM E R R OR I N R A ND O M I SE D BLO C K D E SI G N

mean square between blocks (MSBL) The sum of squares between blocks divided by the appropriate degrees of freedom. mean square error (MSE) The sum of squares due to random error divided by the appropriate degrees of freedom.

SSE =

c

r

∑ ∑ ( Xij − X.j − Xi. + X ) j=1 i=1 2

(11.10)

Since you are comparing c groups, there are c - 1 degrees of freedom associated with the sum of squares between groups (SSB). Similarly, since there are r blocks, there are r − 1 degrees of freedom associated with the sum of squares between blocks (SSBL). Moreover, there are n − 1 degrees of freedom associated with the sum of squares total (SST) because you are – comparing each value Xij to the grand mean X based on all n (= rc) values. Therefore, since the degrees of freedom for each of the sources of variation must add to the degrees of freedom for the total variation, you calculate the degrees of freedom for the sum of squares error (SSE) component by subtraction and algebraic manipulation. Thus, the degrees of freedom associated with the sum of squares error is (r − 1)(c − 1). If you divide each of the component sum of squares by its associated degrees of freedom, you have the three variances or mean square terms (MSB, MSBL and MSE). Equations 11.11(a)–(c) give the mean square terms needed for the ANOVA table. T H E M E A N S QUA RE S I N R A ND O M I SE D BLO C K D E SI G N

MSB =

SSB c − 1

(11.11a)

MSBL =

SSBL r−1

(11.11b)

MSE =

SSE (r − 1)(c − 1)

(11.11c)

If the assumptions of the analysis of variance are valid, the null hypothesis of no differences in the c population means (i.e. no treatment effects): H0: μ1 = μ2 = ... = μ.c

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



11.2 The Randomised Block Design 419

is tested against the alternative hypothesis that not all the c population means are equal (i.e. there are treatment effects): H1: Not all μ.j are equal (where j = 1, 2, . . . , c) by calculating the test statistic F given in Equation 11.12. RAN D O M ISE D B LOCK F T E ST STAT IST I C F=



MSB MSE

(11.12)

The F test statistic follows an F distribution with c - 1 degrees of freedom for the MSB term and (r − 1)(c − 1) degrees of freedom for the MSE term. For a given level of significance α, you reject the null hypothesis if the calculated F test statistic is greater than the upper-tail critical value FU from the F distribution with c − 1 and (r − 1)(c − 1) degrees of freedom (see Table E.5). The decision rule is: Reject H0 if F > FU; otherwise, do not reject H0. To examine whether the randomised block design was advantageous to use, some statisticians suggest that you perform the F test for block effects. The null hypothesis of no block effects: H0: μ1. = μ2. = ... = μr. is tested against the alternative hypothesis:

F test for block effects A test to determine whether or not all the population block means are equal.

H1: Not all μi. are equal (where i = 1, 2, . . . , r) Equation 11.13 gives the F statistic for the block effects. F TE ST STAT IST IC FOR B LOCK E FFE C T S F=



MSBL MSE

(11.13)

You reject the null hypothesis at the α level of significance if the F test statistic is greater than the upper-tail critical value FU from the F distribution with r − 1 and (r − 1)(c − 1) degrees of freedom (see Table E.5). That is, the decision rule is: Reject H0 if F > FU; otherwise, do not reject H0. The results of the analysis-of-variance procedure are usually displayed in an ANOVA summary table, as shown in Table 11.6.

Source

Degrees of freedom

Sum of squares

Mean square (variance) SSB MSB = c−1

F =

MSB MSE

SSBL r−1

F =

MSBL MSE

Between treatments

c-1

SSB

Between blocks

r-1

SSBL

Error

(r - 1)(c - 1)

SSE

Total

rc - 1

SST

MSBL = MSE =

F

Table 11.6 Analysis of variance table for the randomised block design

SSE (r − 1)(c − 1)

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

420 CHAPTER 11 ANALYSIS OF VARIANCE

To illustrate the randomised block design, suppose that a fast-food chain wants to evaluate the service at four restaurants. The customer service director for the chain hires six investigators with varied experiences in food-service evaluations to act as raters. To reduce the effect of the variability from rater to rater, you use a randomised block design with raters serving as the blocks. The four restaurants are the groups of interest. The six raters evaluate the service at each of the four restaurants in a random order. A rating scale from 0 (low) to 100 (high) is used. Table 11.7 summarises the results < FF_CHAIN >, along with the group totals, group means, block totals, block means, grand total and grand mean. In addition, from Table 11.7: r = 6   c = 4   n = rc = 24 and: c

X =

Table 11.7 Ratings at four restaurants of a fast-food chain

r

∑ ∑ X ij j=1 i=1 rc

=

1,887 = 78.625 24

Restaurants Raters 1 2 3 4 5 6 Totals Means

A 70 77 76 80 84 78 465 77.50

B 61 75 67 63 66 68 400 66.67

C 82 88 90 96 92 98 546 91.00

D 74 76 80 76 84 86 476 79.33

Totals

Means

287 316 313 315 326 330 1,887 78.625

71.75 79.00 78.25 78.75 81.50 82.50

Figure 11.10 illustrates output from Microsoft Excel for this randomised block design. Using the 0.05 level of significance to test for differences between the restaurants, you reject the null hypothesis (H0: μ.1 = μ.2 = μ.3 = μ.4) if the calculated F value is greater than 3.29, the upper-tail critical value from the F distribution with 3 and 15 degrees of freedom in the numerator and denominator, respectively (see Figure 11.11). Since F = 39.758 > FU = 3.29, or since the p-value = 0.000 < 0.05, you reject H0 and conclude that there is evidence of a significant difference in the mean rating between the different restaurants. The extremely small p-value indicates that if the means from the four restaurants are equal, there is virtually no chance that you will get differences as large or larger between the sample means as observed in this study. Thus, there is little degree of belief in the null hypothesis. You conclude that the alternative hypothesis is correct: the mean ratings between the four restaurants are different. As a check on the effectiveness of blocking, you can test for a difference between the raters. The decision rule, using the 0.05 level of significance, is to reject the null hypothesis (H0: μ1. = μ2. = … = μ6.) if the calculated F value is greater than 2.90, the upper-tail critical value from the F distribution with 5 and 15 degrees of freedom (see Figure 11.12). Since F = 3.782 > FU = 2.90, or the p-value = 0.02 < 0.05, you reject H0 and conclude that there is evidence of a significant difference between the raters. Thus, you conclude that the blocking has been advantageous in reducing the random error. The assumptions of the one-way analysis of variance discussed on page 410 (randomness and independence, normality and homegeneity of variance) also apply to the randomised block design. If the normality assumption is violated, you can use the Friedman rank test discussed in Chapter 19. In addition, you need to assume that there is no interacting effect between the

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



11.2 The Randomised Block Design 421

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

A B C ANOVA: two-factor without replication

D

Summary Rater 1 Rater 2 Rater 3 Rater 4 Rater 5 Rater 6

E

F

Count 4 4 4 4 4 4

Sum 287 316 313 315 326 330

Average 71.75 79 78.25 78.75 81.5 82.5

Variance 76.25 36.67 90.92 184.92 121 161

6 6 6 6

465 400 546 476

77.5 66.67 91 79.33

21.5 23.47 33.2 23.47

ANOVA Source of variation Rows Columns Error

SS 283.375 1787.458 224.792

df 5 3 15

MS 56.675 595.819 14.986

F 3.782 39.758

Total

2295.625

23

Restaurant A Restaurant B Restaurant C Restaurant D

0

.95

.05 3.29

F

Region of Region of non-rejection rejection Critical value

0

.95

.05 2.90 F Region of Region of non-rejection rejection Critical value

p-value 0.000 0.00

G

Figure 11.10  Microsoft Excel output for the fast-food chain study

F crit 2.901 3.287

Figure 11.11  Regions of rejection and non-rejection for the fastfood-chain study at the 0.05 level of significance with 3 and 15 degrees of freedom

Figure 11.12  Regions of rejection and non-rejection for the fastfood-chain study at the 0.05 level of significance with 5 and 15 degrees of freedom

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

422 CHAPTER 11 ANALYSIS OF VARIANCE

estimated relative efficiency (RE) A comparison between the randomised block design and completely randomised design ANOVA methods.

treatments and the blocks. In other words, you need to assume that any differences between the treatments (the restaurants) are consistent across the entire set of blocks (the raters). The concept of interaction is discussed further in Section 11.3. Did the blocking result in an increase in precision in comparing the different treatment groups? To answer this question, use Equation 11.14 to calculate the estimated relative efficiency (RE) of the randomised block design as compared with the completely randomised design.

E ST IM AT E D R E L AT I V E E F F I C I E NCY

RE =

(r − 1)MSBL + r(c − 1)MSE (rc − 1)MSE

(11.14)

Using Figure 11.10, RE =

(5)(56.675) + (6)(3)(14.986) = 1.60 (23)(14.986)

This value for relative efficiency means that it would take 1.6 times as many observations in a one-way ANOVA design compared with the randomised block design in order to have the same precision in comparing the restaurants.

Multiple Comparisons: The Tukey Procedure

Tukey procedure A method of making pairwise comparisons between means.

As in the case of the completely randomised design, once you reject the null hypothesis of no differences between the groups, you need to determine which groups are significantly different from the others. For the randomised block design, you can use a procedure developed by John Tukey (see references 2, 3 and 9). Equation 11.15 gives the critical range for the Tukey procedure.

T H E CR IT ICA L R A NG E F O R T HE R A ND O M I SE D BLO C K D E SI G N

Critical range = QU

MSE r

(11.15)

where QU is the upper-tail critical value from a Studentised range distribution having c degrees of freedom in the numerator and (r − 1)(c − 1) degrees of freedom in the denominator. Values for the Studentised range distribution are found in Table E.10.

Compare each of the c(c - 1)/2 pairs against the critical range. If the absolute difference in – – a specific pair of sample means, say ∙X.j − X.j¿∙, is greater than the critical range, then group j and group j′ are significantly different. To apply the Tukey procedure, return to the fast-food-chain study. Since there are four restaurants, there are 4(4 − 1)/2 = 6 possible pairwise comparisons. From Figure 11.10, the absolute mean differences are: – – 1. ∙X.1 − X.2∙ = ∙77.50 − 66.67∙ = 10.83 – – 2. ∙X.1 − X.3∙ = ∙77.50 − 91.00∙ = 13.50

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



11.2 The Randomised Block Design 423

3. 4. 5. 6.

– – ∙X.1 − X.4∙ = ∙77.50 − 79.33∙ =   1.83 – – ∙X.2 − X.3∙ = ∙66.67 − 91.00∙ = 24.33 – – ∙X.2 − X.4∙ = ∙66.67 − 79.33∙ = 12.66 – – ∙X.3 − X.4∙ = ∙91.00 − 79.33∙ = 11.67

Locate MSE = 14.986 and r = 6 in Figure 11.10 to determine the critical range. From Table E.10 (for α = 0.05, c = 4 and (r − 1)(c − 1) = 15), QU, the upper-tail critical value of the test statistic with 4 and 15 degrees of freedom, is 4.08. Using Equation 11.15: Critical range = 4.08

14.986 = 6.448 6

– – All pairwise comparisons except ∙X.1 − X.4∙ are greater than the critical range. Therefore, you conclude that there is evidence of a significant difference in the mean rating between all pairs of restaurant branches except for branches A and D. In addition, branch C has the highest ratings (i.e. is most preferred) and branch B has the lowest (i.e. is least preferred).

Randomised block design Six salespeople look after elite property sales in a real estate business you work for. The owner of the business wants to experiment with the use of different bidding techniques in order to maximise selling price and also commission. In order to see if this looks like a good strategy, he decides to allocate two salespeople to sell properties by internet auction only, two salespeople to sell by ordinary auction, and the other two to sell via more traditional means. He then plans to compare the commission between the three methods after a six-month period, and perhaps change his selling strategy based on the results.

think about this

Knowing that you had recently completed a business statistics subject at university, he asks you to collect that data and report back to him after six months. ‘But boss, I think that the results could be affected by both bidding technique as well as the salesperson.’ He agrees and asks you to design a better experiment. Thinking back to your business stats course you suggest a randomised block design. This way you can have all salespeople involved in selling using the three techniques and test for both a difference in bidding technique (column) and salesperson (row).

Problems for Section 11.2 LEARNING THE BASICS 11.15 Given a randomised block experiment with four treatment levels and six blocks, how many degrees of freedom are there in determining: a. the between-group variation? b. the between-block variation? c. the inherent random variation or error? d. the total variation? 11.16 Refer to problem 11.15. a. If SSB = 60, SSBL = 75 and SST = 210, what is SSE? b. What are MSB, MSBL and MSE? c. What is the value of the test statistic F for the difference in the four means? d. What is the value of the test statistic F for the block effects?

11.17 Refer to problems 11.15 and 11.16. a. Construct the ANOVA summary table and fill in all values in the body of the table. b. At the 0.05 level of significance, is there evidence of a significant difference in the population means? c. At the 0.05 level of significance, is there evidence of a significant difference due to blocks? 11.18 Refer to problems 11.15, 11.16 and 11.17. a. To perform the Tukey procedure, how many degrees of freedom are there in the numerator and how many degrees of freedom are there in the denominator of the Studentised range distribution? b. At the 0.05 level of significance, what is the upper-tail critical value from the Studentised range distribution? c. To perform the Tukey procedure, what is the critical range?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

424 CHAPTER 11 ANALYSIS OF VARIANCE

11.19 Given a randomised block experiment with three treatment levels and seven blocks, how many degrees of freedom are there in determining: a. the between-group variation? b. the between-block variation? c. the random variation or error? d. the total variation? 11.20 From problem 11.19, if SSB = 36 and the randomised block F test statistic is 6.0: a. What is MSE and SSE? b. What is SSBL if the F test statistic for block effects is 4.0? c. What is SST? d. At the 0.01 level of significance, is there evidence of treatment and block effects? 11.21 Given a randomised block experiment with four treatment levels and eight blocks, from the ANOVA summary table below, fill in all the missing results. Degrees of freedom

Sum of squares

Source Between c-1=? SSB = ?  treatments Between r  - 1 = ? SSBL = 540  blocks Error (r  - 1)(c  - 1) = ? SSE = ? Total

rc  - 1 = ?

SST = ?

Mean square (variance)

F

MSB = 80 F = ? MSBL = ? F = 5.0 MSE = ?

11.22 Refer to problem 11.21. a. At the 0.05 level of significance, is there evidence of a significant difference between the four treatment level means? b. At the 0.05 level of significance, is there evidence of a block effect?

APPLYING THE CONCEPTS Problems 11.23–11.28 can be solved manually or by using Microsoft Excel.

11.23 Nine experts rated four brands of Colombian coffee in a tastetesting experiment. A rating on a seven-point scale (1 = extremely unpleasing, 7 = extremely pleasing) is given for each of four characteristics: taste, aroma, richness and acidity. The following table displays the summated ratings, accumulated over all four characteristics. < COFFEE > Brand Expert C.C. S.E. E.G. B.L. C.M. C.N. G.N. R.M. P.V.

A 24 27 19 24 22 26 27 25 22

B 26 27 22 27 25 27 26 27 23

C 25 26 20 25 22 24 22 24 20

D 22 24 16 23 21 24 23 21 19



At the 0.05 level of significance, completely analyse the data to determine whether there is evidence of a significant difference in the summated ratings of the four brands of Colombian coffee and, if so, which of the brands are rated highest (i.e. best). What can you conclude? 11.24 How do the ratings for TV, phone and Internet services compare? The data in < TELECOM2 > represent the mean ratings in 13 different cities (data obtained from ‘Ratings: TV, phone, and Internet services’, Consumer Reports, May 2013, 24–25). a. At the 0.05 level of significance, determine whether there is evidence of a significant difference in the mean rating between TV, phone and Internet services. b. If appropriate, use the Tukey procedure to determine which services’ mean ratings differ. Again, use a 0.05 level of significance. 11.25 At the major shopping centre in Chatswood, NSW, there are several large chemist shops which compete for business, often through advertising discounted prices or specials. The following data < CHEMISTS > shows prices for comparable items as shown on the shelves at three Chatswood chemist shops on 5 April 2008 (in dollars). Product Taft Hairspray 200 g Nicabate Gum 96 pce Claratyne 30 pk Pantene Shampoo 350 mL L’Oréal Hair Colour–Crème Gloss Visine Eyedrops 15 mL Nivea Shaving Gel 200 mL Strepsils 24 pk

Shop 1 3.69 24.69 22.99 6.39 13.69 6.99 5.69 4.29

Shop 2 3.99 32.50 22.75 6.75 14.75 7.99 5.75 5.50

Shop 3 3.75 31.95 25.95 6.95 14.75 5.95 5.15 5.45

a. At the 0.05 level of significance, use the randomised block design to determine if there is evidence of a significant difference in the mean prices for these products at the three chemist shops. b. What assumptions are necessary to perform this test? c. If appropriate, use the Tukey procedure to determine which chemist shops differ. (Use α = 0.05.) d. Do you think that there was a significant block effect in this experiment? Explain. 11.26 Remaindered books are often sold for prices much less than the original recommended retail price. They will usually be offered in low-rent shops or stalls set up within the open areas of shopping centres. The following data < BOOK_PRICE > shows five book titles and the prices the remaindered titles cost (in dollars) at five locations: Title World Rice Cooking Cute Canines Mind Over Grey Matter The Joy of Exercise Military Jets

1 11.99 13.99 9.99 12.87 9.99

2 9.44 9.44 27.28 9.44 17.44

Location 3 4 13.99 20.79 13.99 12.49 13.99 9.99 9.99 11.99 13.99 9.99

5 9.95 13.95 9.95 13.95 18.81

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



11.3 The Factorial Design: Two-Way Analysis of Variance 425

a. At the 0.05 level of significance, use the randomised block design to determine if there is evidence of a significant difference in the mean prices for books at the five locations. b. What assumptions are necessary to perform this test? c. If appropriate, use the Tukey procedure to determine which locations differ. (Use α = 0.05.) d. Do you think that there was a significant block effect in this experiment? Explain. 11.27 How different are the rates of return of money market accounts and certificates of deposit that vary in their term length? The data in < MMCD_RATE > contain the money market, one-year CD, two-year CD and five-year CD rates for banks in a suburban area (data obtained from ‘Consumer money rates’, Newsday, 13 June 2013, p. A47). a. At the 0.05 level of significance, determine whether there is evidence of a significant difference between the mean rates for these investments. b. What assumptions are necessary to perform this test? c. If appropriate, use the Tukey procedure to determine which investments differ. (Use α = 0.05.)

d. Was there a significant block effect in their mean rates in this experiment? Explain. 11.28 The data in the < CONCRETE2 > file represent the compressive strength in thousands of pounds per square inch of 40 samples of concrete taken 2, 7 and 28 days after pouring. a. At the 0.05 level of significance, is there evidence of a significant difference in the mean compressive strength after 2, 7 and 28 days? b. If appropriate, use the Tukey procedure to determine the days that differ in mean compressive strength. (Use α = 0.05.) c. Determine the relative efficiency of the randomised block design as compared with the one-way ANOVA (completely randomised) design. d. Construct box-and-whisker plots of the compressive strength for the different time periods. e. Based on the results of (a), (b) and (d), is there a pattern in the compressive strength over the three time periods?

11.3  THE FACTORIAL DESIGN: TWO-WAY ANALYSIS OF VARIANCE In Section 11.1 you learned about the one-way analysis of variance, and in Section 11.2 you studied the randomised block design. In this section, the analysis of variance is extended to the two-factor factorial design, in which two factors are simultaneously evaluated. Each factor is evaluated at two or more treatment levels. For example, we may be interested in comparing the electricity consumption of households from different cities and whether they purchased their electricity from ‘green’ versus ‘traditional’ sources. While discussion here is limited to two factors, you can extend factorial designs to three or more factors (see references 1, 2 and 8). Data from a two-factor factorial design are analysed using two-way ANOVA. Because of the complexity of the calculations involved, you should use software such as Microsoft Excel when conducting this analysis. Nevertheless, for purposes of illustration and a better conceptual understanding of two-way ANOVA, the decomposition of the total variation is presented below. The following definitions are needed to develop the two-way ANOVA procedure:

LEARNING OBJECTIVE

Conduct a two-way analysis of variance and interpret the interaction two-factor factorial design Analysis of variance where two factors are simultaneously evaluated. two-way ANOVA Analysis of variance where two factors are simultaneously evaluated.

r = the number of levels of factor A c = the number of levels of factor B n′ = the number of values (replications) for each cell (combination of a particular level of factor A and a particular level of factor B) n = total number of values in the experiment (where n = rcn′) Xijk = value of the kth observation for level i of factor A and level j of factor B r

X=

c

n′

∑ ∑ ∑ Xijk

i=1 j=1 k=1

rcn′ c

= grand mean

n′

∑ ∑ Xijk j=1 k=1

Copyright division of of Pearson Group Pty Ltd) 2019— 9781488617249 Business Statistics 5e the ithAustralia level of factor A (where i = 1, 2, . . — . , Berenson/Basic r) X =© Pearson Australia =(a mean

i..

4

r

426 CHAPTER 11 ANALYSIS OF VARIANCE

X=

∑ ∑ ∑ Xijk rcn′

cn′

replicates Sample sizes for particular combinations of two factors in twoway ANOVA.

= mean of the ith level of factor A (where i = 1, 2, . . . , r)

n′

∑ ∑ Xijk

i=1 k=1 n′

Xij. =

= grand mean

n′

∑ ∑ Xijk j=1 k=1 r

X.j. =

n′

i=1 j=1 k=1 c

Xi.. =

c

rn′ Xijk

∑ k=1 n′

= mean of the jth level of factor B (where i = 1, 2, . . . , r)

= mean of the cell ij, the combination of the ith level of factor A and the jth level of factor B

This text deals only with situations in which there are an equal number of replicates (i.e. sample sizes n¿) for each combination of the levels of factor A with those of factor B. (See references 5 and 9 for a discussion of two-factor factorial designs with unequal sample sizes.)

Testing for Factor and Interaction Effects interaction The impact of one independent variable depends on the value of another independent variable.

There is an interaction between factors A and B if the effect of factor A is dependent on the level of factor B. Thus, the decomposition of total variation needs to account for a possible interaction effect, as well as factor A, factor B and random error. Therefore, the total variation (SST) is subdivided into sum of squares due to factor A (or SSA), sum of squares due to factor B (or SSB), sum of squares due to the interaction effect of A and B (or SSAB), and sum of squares due to random error (or SSE). This decomposition of the total variation (SST) is displayed in Figure 11.13. The sum of squares total (SST) represents the total variation between all the values around the grand mean. Equation 11.16 shows the calculation for total variation.

Figure 11.13 Partitioning the total variation in a two-factor factorial design

Partitioning the total variation SST = SSA + SSB + SSAB + SSE Factor A variation (SSA) df = r – 1 Factor B variation (SSB ) df = c – 1

Total variation (SST ) df = n – 1

Interaction (SSAB ) df = (r – 1) (c – 1) Random variation (SSE ) df = rc (n – 1)

TOTA L VA R IAT ION I N T W O -WAY A NO VA

sum of squares due to factor A (SSA) Variation due to factor A in two-way ANOVA.

SST =

r

c

n′

( Xijk − X )2 ∑ ∑ ∑ i=1 j=1 k=1

(11.16)

The sum of squares due to factor A (SSA) represents the differences between the various levels of factor A and the grand mean. Equation (11.17) shows the calculation for factor A variation.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



11.3 The Factorial Design: Two-Way Analysis of Variance 427

FACTO R A VA R IAT ION r

( Xi.. − X )2 ∑ i=1



SSA = cn′

(11.17)

The sum of squares due to factor B (SSB) represents the differences between the various levels of factor B and the grand mean. Equation (11.18) shows the calculation for factor B variation.

sum of squares due to factor B (SSB) Variation due to factor B in two-way ANOVA.

FACTO R B VA R IAT ION

SSB = rn′

c

( X .j. − X )2 ∑ j=1

(11.18)

The sum of squares due to interaction (SSAB) represents the interacting effect of specific combinations of factor A and factor B. Equation (11.19) shows the calculation for interaction variation.

sum of squares due to interaction (SSAB) The interacting effect of specific combinations of factor A and factor B.

INTE RACTIO N VA R IAT ION

r

c

( X ij. − X i.. − X .j. + X )2 ∑ ∑ i=1 j=1

SSAB = n′

(11.19)

The sum of squares error (SSE) represents the differences between the values within each cell and the corresponding cell mean. Equation (11.20) shows the calculation for random error. RAN D O M E R R OR IN T W O-WAY A N OVA

SSE =

r

c

n′

( X ijk − X ij. )2 ∑ ∑ ∑ i=1 j=1 k=1

(11.20)

Because there are r treatment levels of factor A, there are r - 1 degrees of freedom associated with SSA. Similarly, because there are c treatment levels of factor B, there are c − 1 degrees of freedom associated with SSB. Because there are n¿ replications in each of the rc cells, there are rc(n¿ − 1) degrees of freedom associated with the SSE term. Carrying this further, there are n − 1 degrees of freedom associated with the sum of squares total (SST) – because you are comparing each value Xijk with the grand mean X based on all n values. Therefore, because the degrees of freedom for each of the sources of variation must add to the degrees of freedom for the total variation (SST), you calculate the degrees of freedom for the interaction component (SSAB) by subtraction. The degrees of freedom for interaction are (r − 1)(c − 1).

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

428 CHAPTER 11 ANALYSIS OF VARIANCE

mean square between A (MSBA) The sum of squares due to factor A divided by the appropriate degrees of freedom.

If you divide each of the sums of squares by its associated degrees of freedom, you have the four variances or mean square terms (MSBA , MSBB, MSBAB and MSE). Equations 11.21a–d give the mean square terms needed for the two-way ANOVA table.

mean square between B (MSBB) The sum of squares due to factor B divided by the appropriate degrees of freedom.

T H E M E A N S QUA RE S I N T W O -WAY A NO VA

mean square between AB (MSBAB) The interaction sum of squares, SSAB, divided by the appropriate degrees of freedom.



MSBA =

SSA r−1

(11.21a)



MSBB =

SSB c−1

(11.21b)

SSAB ( r − 1)(c − 1)

(11.21c)



MSBAB =

MSE =

SSE rc( n′− 1)

(11.21c)

There are three distinct tests to perform in a two-way ANOVA: 1. To test the hypothesis of no difference due to factor A: H0: μ1.. = μ2.. = … = μr.. against the alternative hypothesis:



H1: Not all μi.. are equal you use the F statistic in Equation 11.22.



F T E ST FOR FACTO R A E F F E C T

F test for factor A effect The F test statistic formed by dividing mean square A by mean square error.

F=



MSBA MSE

(11.22)

You reject the null hypothesis at the α level of significance if: F=

MSBA > FU MSE

the upper-tail critical value from an F distribution with r − 1 and rc(n¿ − 1) degrees of freedom.

2. To test the hypothesis of no difference due to factor B:

H0: μ.1. = μ.2. = … = μ.c. against the alternative hypothesis:



H1: Not all μ.j. are equal you use the F statistic in Equation 11.23.



F test for factor B effect The F test statistic formed by dividing mean square B by mean square error.

F T E ST FOR FACTO R B E F F E C T

F=

MSBB MSE

(11.23)

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



11.3 The Factorial Design: Two-Way Analysis of Variance 429

You reject the null hypothesis at the α level of significance if: F=

MSBB > FU MSE

the upper-tail critical value from an F distribution with c − 1 and rc(n¿ − 1) degrees of freedom. 3. To test the hypothesis of no interaction of factors A and B: H0: The interaction of A and B is equal to zero against the alternative hypothesis: H1: The interaction of A and B is not equal to zero you use the F statistic in Equation 11.24.

F TE ST FO R IN T E R ACT ION E FFE CT F=



MSBAB MSE

(11.24)

F test for interaction effect The F test tatistic formed by dividing mean square AB by mean square error.

You reject the null hypothesis at the α level of significance if: F=

MSBAB > FU MSE

the upper-tail critical value from an F distribution with (r − 1)(c − 1) and rc(n¿ − 1) degrees of freedom. Table 11.8 summarises the entire set of steps.

Degrees of freedom

Sum of squares

A

r-1

SSA

MSBA =

SSA r−1

F=

B

c-1

SSB

MSBB =

SSB c−1

F =

MSBB MSE

AB

(r  - 1)(c  - 1)

SSAB

MSBAB =

SSIB (r −1)(c − 1)

F =

MSBAB MSE

Error

rc(n ¿ - 1)

SSE

MSE =

Total

n  - 1

SST

Source

Mean square (variance)

F MSBA MSE

Table 11.8 Analysis of variance table for the two-factor factorial design

SSE rc(n′ − 1)

To examine two-way ANOVA, return to the electricity scenario. We have designed an experiment where we evaluate the electricity consumption of a sample of households from the four capital cities. Five households from each city purchase electricity from ‘green’ sources and five from ‘traditional’ sources. Forty households are sampled in total. The results < ELECTRICITY_CONSUMPTION1 > are given in Table 11.9.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

430 CHAPTER 11 ANALYSIS OF VARIANCE

Table 11.9 Electricity consumption by city and source (kWh)

City Sources of electricity Green

Sydney (1) 135 167 118 129 140

Melbourne (2) 126 134 150 119 137

Hobart (3) 113 131 118 120 112

Perth (4) 143 154 146 138 140

156 166 142 154 148

156 143 178 144 152

134 123 138 127 142

165 148 179 158 161

Traditional

Figure 11.14 presents Microsoft Excel output, with the summary tables providing the sample size, sum, mean and variance for each combination of city and source of electricity. The total column in the first two tables provides these statistics for each source of electricity, and the third table provides them for each city.

Figure 11.14 Microsoft Excel two-way ANOVA output for the electricity consumption example

A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

B

C

D

E

F

G

Sydney (1)

Melbourne (2)

Hobart (3)

Perth (4)

Total

5 689

5 666

5 594

5 721

20 2670

137.8 333.7

133.2 137.7

118.8 57.7

144.2 39.2

133.5 211.5263158

5 766 153.2 81.2

5 773 154.6 200.8

5 664 132.8 60.7

5 811 162.2 127.7

20 3014 150.7 223.8

Count Sum Average

10 1455 145.5

10 1439 143.9

10 1258 125.8

10 1532 153.2

Variance

250.2777778

277.6555556

107.0666667

164.1777778

ANOVA: two-factor with replication Summary Green Count Sum Average Variance Traditional Count Sum Average Variance Total

ANOVA Source of variation Sample (factor A) Columns (factor B) Interaction Within Total

SS

df

MS

F

p-value

F crit

2958.4 4037 79.4 4154.8

1 3 3 32

2958.4 1345.666667 26.46666667 129.8375

22.78540483 10.36423735 0.203844549

3.8351E-05 6.39987E-05 0.892968763

4.149097409 2.901119588 2.901119588

11229.6

39

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



11.3 The Factorial Design: Two-Way Analysis of Variance 431

To interpret the results of our experiment, start by testing whether there is an interaction effect between factor A (source of electricity) and factor B (city). If the interaction effect is significant, further analysis will refer only to this interaction. If the interaction effect is not significant, you can focus on the main effects – potential differences in source of electricity (factor A) and potential differences in cities (factor B). Using the 0.05 level of significance, to determine whether there is evidence of an interaction effect, you reject the null hypothesis of no interaction between source of electricity and city if the calculated F value is greater than 2.92, the approximate upper-tail critical value from the F distribution with 3 and 32 degrees of freedom (see Figure 11.15).1

0

.95

.05 2.92

F

main effects The effects of individual factors averaged over the levels of other factors.

Figure 11.15 Regions of rejection and non-rejection at the 0.05 level of significance with 3 and 32 degrees of freedom

Region of Region of non-rejection rejection Critical value

Because F = 0.204 < FU = 2.92 or the p-value = 0.893 > 0.05, do not reject H0. You conclude that there is insufficient evidence of an interaction effect between source of electricity and city. The focus is now on the main effects. Using the 0.05 level of significance and testing for a difference between the two sources of electricity (factor A), you reject the null hypothesis if the calculated F value is greater than 4.17, the (approximate) upper-tail critical value from the F distribution with 1 and 32 degrees of freedom (see Figure 11.16). Because F = 22.79 > FU = 4.17 or the p-value = 0.00 < 0.05, you do not reject H0. You conclude that there is sufficient evidence of a significant difference between the two sources of electricity in terms of households’ electricity consumption.

0

.95

.05 4.17

F

Figure 11.16 Regions of rejection and non-rejection at the 0.05 level of significance with 1 and 32 degrees of freedom

Region of Region of non-rejection rejection Critical value

Using the 0.05 level of significance and testing for a difference between the cities (factor B), you reject the null hypothesis of no difference if the calculated F value exceeds 2.92, the approximate upper-tail critical value from the F distribution with 3 degrees of freedom in the 1

Similar to one of our previous examples, Table E.5 does not provide the upper-tail critical values from the F distribution with 32 degrees of freedom in the denominator. Table E.5 gives critical values either for F distributions with 30 degrees of freedom in the denominator or for F distributions with 40 degrees of freedom in the denominator. When the desired degrees of freedom are not provided in the table, you can round to the closest value that is given or use the p-value approach.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

432 CHAPTER 11 ANALYSIS OF VARIANCE

numerator and 32 degrees of freedom in the denominator (see Figure 11.15). Since F = 10.36 7 FU = 2.92 or the p-value = 0.00 < 0.05, you reject H0. You conclude that there is evidence of a significant difference between the cities in terms of the electricity consumption.

Interpreting Interaction Effects You can better understand the interaction by plotting the cell means (i.e. the means of specific treatment level combinations) as shown in Figure 11.17. Figure 11.14 provides the cell means for the source–city combinations. From the plot of the mean electricity consumption for each combination of electricity source and city, observe that the two lines (representing the two sources) are roughly parallel. This indicates that the difference between mean electricity consumption for the two sources is virtually the same for the four ­cities. In other words, there is no interaction between these two factors, as was clearly substantiated from the F test.

Average electricity consumption by source and city 200

Electricity consumption (kWh)

Figure 11.17 Microsoft Excel cell means plot of electricity consumption based on source and city

150

100 Green Traditional

50

0 Sydney (1)

Melboure (2)

Hobart (3)

Perth (4)

What is the interpretation if there is an interaction? In such a situation, some levels of factor A would respond better with certain levels of factor B. For example, with respect to electricity consumption, suppose that some cities encouraged bonuses for the adoption of green energy sources and the use of energy conservation devices. If this were true, the lines of Figure 11.17 would not be nearly as parallel and the interaction effect might be statistically significant. In such a situation, the difference between the electricity consumption of those with green versus traditional sources is no longer the same for all cities. Such an outcome would also complicate the interpretation of the main effects because differences in one factor (i.e. source) are not consistent across the other factor (i.e. city). Example 11.2 illustrates a situation with a significant interaction effect. EXAMPLE 11.2

INT E R P R E T IN G S IG N I F I CAN T I N TE RACTI ON E F F E CTS A nationwide private education company specialising in pathway courses for university admission had the business objective of improving its courses. Two factors of interest to the company were the length of the course (a condensed 10-day period or a regular 30-day period) and the type of course (traditional classroom or online distance learning). The company collected data by randomly assigning 10 clients to each of the four cells that represent a combination of length of the course and type of course. The pathway course scores are organised in the file < ACT > and presented in Table 11.10.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



11.3 The Factorial Design: Two-Way Analysis of Variance 433

What are the effects of the type of course and the length of the course on ACT scores? Method Traditional Traditional Traditional Traditional Traditional Traditional Traditional Traditional Traditional Traditional Online Online Online Online Online Online Online Online Online Online

Condensed 26 18 27 24 25 19 21 20 21 18 27 21 29 32 30 20 24 28 30 29

Regular 34 28 24 21 35 23 31 29 28 26 24 21 16 19 22 19 20 24 23 25

Table 11.10 Pathway course scores for different types and lengths of courses

SOLUTION

The cell means plot presented in Figure 11.18 shows a strong interaction between the type of course and the length of the course. The intersecting lines indicate that the effect of condensing the course depends on whether the course is taught in the traditional classroom or by online distance learning. The online mean score is higher when the course is condensed Figure 11.18 Excel cell means plot of pathway scores

Cell means for pathway scores 30 25 20 15 10 5

Online Traditional

0 Condensed

Regular

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

434 CHAPTER 11 ANALYSIS OF VARIANCE

to a 10-day period, whereas the traditional mean score is higher when the course takes place over the regular 30-day period. To verify the visual analysis provided by interpreting the cell means plot, you begin by testing whether there is a statistically significant interaction between factor A (type of course) and factor B (length of course). Using a 0.05 level of significance, you reject the null hypothesis because F = 24.2569 > 4.1132 or the p-value equals 0.0000 < 0.05 (see Figure 11.19). Thus, the hypothesis test confirms the interaction evident in the cell means plot. The existence of this significant interaction effect complicates the interpretation of the hypothesis tests concerning the two main effects. You cannot directly conclude that there is no effect with respect to length of course and type of course, even though both have p-values > 0.05. Figure 11.19 Excel and Minitab twoway ANOVA results for the ACT scores

ANOVA: two-factor with replication Summary Traditional Count Sum Average Variance

Condensed

Regular

Total

10.00 219.00 21.90 11.21

10.00 279.00 27.90 20.99

20.00 498.00 24.90 24.73

Online Count Sum Average Variance

10.00 270.00 27.00 16.22

10.00 213 21.30 8.01

20.00 483.00 24.15 20.03

Total Count Sum Average Variance

20.00 489.00 24.45 19.84

20.00 492.00 24.60 25.20

ANOVA Source of variation Sample Columns Interaction Within

SS 5.6250 0.2250 342.2250 507.9000

df 1 1 1 36

Total

855.9750

39

MS 5.6250 0.2250 342.2250 14.1083

F 0.3987 0.0159 24.2569

p-value 0.5318 0.9002 0.0000

F crit 4.1132 4.1132 4.1132

Given that the interaction is significant, you can re-analyse the data with the two factors collapsed into four groups of a single factor rather than a two-way ANOVA with two levels of each of the two factors. You can reorganise the data as follows: group 1 is traditional condensed, group 2 is traditional regular, group 3 is online condensed and group 4 is online regular. Figure 11.20 shows the results for these data, stored in < ACT_ONEWAY >

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



11.3 The Factorial Design: Two-Way Analysis of Variance 435

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

A ANOVA: single factor Summary Groups Group 1 Group 2 Group 3 Group 4

D

C

Count Sum 10 219 10 279 10 270 10 213

ANOVA Source of variation Between groups Within groups

SS 348.075 507.9

df 3 36

Total

855.975

39

A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

B

B

C

D

E

F

Average 21.9 27.9 27 21.3

Variance 11.2111 20.9889 16.2222 8.0111

MS 116.025 14.1083

F 8.2239

E

F

G

G

p-value 0.0003

F

Figure 11.20 Excel one-way ANOVA and Tukey–Kramer results for the pathway scores

F crit 2.8663

G

Tukey–Kramer multiple comparisons Sample Sample Group

Absolute

Size

1: Group 1

21.9

10

Group 1 to Group 2

6

1.187785054 4.5017 Means are different

2: Group 2

27.9

10

Group 1 to Group 3

5.1

1.187785054 4.5017 Means are different

3: Group 3

27

10

Group 1 to Group 4

0.6

1.187785054 4.5017 Means are not different

4: Group 4

21.3

10

Group 2 to Group 3

0.9

1.187785054 4.5017 Means are not different

Group 2 to Group 4

6.6

1.187785054 4.5017 Means are different

Group 3 to Group 4

5.7

1.187785054 4.5017 Means are different

Other data Level of significance Numerator df

Denominator df MSW

Q statistic

Comparison

Std error Critical

Mean

difference

of difference

range Results

0.05 4 36 14.10833 3.79

From Figure 11.20, because F = 8.2239 > 2.8663 or p-value = 0.0003 < 0.05, there is evidence of a significant difference in the four groups (traditional condensed, traditional regular, online condensed and online regular). Traditional condensed is different from ­traditional regular and from online condensed. Traditional regular is also different from online regular, and online condensed is also different from online regular. Thus, whether condensing a course is a good idea depends on whether the course is offered in a traditional classroom or as an online distance learning course. To ensure the highest mean pathway scores, the company should use the traditional approach for courses that are given over a 30-day period but use the online approach for courses that are condensed into a 10-day period. This confirms the information conveyed in Figure 11.18.

Multiple Comparisons: The Tukey Procedure Unless there is a significant interaction effect, you can determine the particular levels that are significantly different by using a procedure developed by John Tukey (see references 2 and 3). Equation 11.25 gives the critical range for factor A.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

436 CHAPTER 11 ANALYSIS OF VARIANCE

T H E CR IT ICA L R A NG E F O R FAC TO R A Critical range = QU



MSE cn′

(11.25)

where QU is the upper-tail critical value from a Studentised range distribution having r and rc(n¿ − 1) degrees of freedom. Values for the Studentised range distribution are found in Table E.10.

Equation 11.26 gives the critical range for factor B.

T H E CR IT ICA L R A NG E F O R FAC TO R B

Critical range = QU

MSE rn′

(11.26)

where QU is the upper-tail critical value from a Studentised range distribution having c and rc(n¿ – 1) degrees of freedom. Values for the Studentised range distribution are found in Table E.10.

To use the Tukey procedure, return to the electricity consumption data of Table 11.9 on page 430. In Figure 11.14 (the ANOVA summary table provided by Microsoft Excel), both main effects were significant. Using an α level of 0.05, there is evidence of a significant difference between the two sources that comprise factor A. However, because there are only two levels (Green and Traditional), it is not necessary to apply the Tukey procedure. Also, there is evidence of a significant difference between the four cities that comprise factor B. Thus, you can use the multiple comparison procedure to analyse further differences between the cities of factor B. Because there are four levels of factor B, there are 4(4 − 1)/2 = 6 pairwise comparisons. Using the calculations presented in Figure 11.14, the absolute mean differences are as follows. –







1. ∙X.1. − X.2.∙ = ∙145.5 − 143.9∙ = 1.6 – – 2. ∙X.1. − X.3.∙ = ∙145.5 − 125.8∙ = 19.7 3. ∙X.1. − X.4.∙ = ∙145.5 − 153.2∙ = 7.7 – – 4. ∙X.2. − X.3.∙ = ∙143.9 − 125.8∙ = 18.1

– – – – 6. ∙X.3. − X.4.∙ = ∙125.8 − 153.2∙ = 27.1 5. ∙X.2. − X.4.∙ = ∙143.9 − 153.2∙ = 9.3

To determine the critical range, refer to Figure 11.14 to find MSE = 129.838, r = 2, c = 4 and n¿ = 5. From Table E.10 (for α = 0.05, c = 4 and rc(n¿ − 1) = 32), QU, the upper-tail critical value of the test statistic with 4 and 32 degrees of freedom, is approximated as 3.84. Using Equation 11.26: Critical range = 3.84

129.838 = 13.84 10

Because 19.7, 18.1 and 27.4 > 13.84, Sydney and Melbourne, Melbourne and Hobart, and Hobart and Perth differ. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



11.3 The Factorial Design: Two-Way Analysis of Variance 437

Problems for Section 11.3 LEARNING THE BASICS 11.29 Consider a two-factor factorial design with three levels for factor A, three levels for factor B and four replicates in each of the nine cells. How many degrees of freedom are there in determining: a. the factor A variation and the factor B variation? b. the interaction variation? c. the random variation? d. the total variation? 11.30 Assume that you are working with the results from problem 11.29, and SSA = 120, SSB = 110, SSE = 270 and SST = 540. a. What is SSAB? b. What are MSA and MSB? c. What is MSAB? d. What is MSE? 11.31 Assume that you are working with the results of problem 11.30. a. What is the value of the test statistic F for the interaction effect? b. What is the value of the test statistic F for the factor A effect? c. What is the value of the test statistic F for the factor B effect? d. Form the ANOVA summary table and fill in all values in the body of the table. 11.32 Given the results from problems 11.29 to 11.31 at the 0.05 level of significance: a. Is there an effect due to factor A? b. Is there an effect due to factor B? c. Is there an interaction effect? 11.33 Given a two-way ANOVA with two treatment levels for factor A and five treatment levels for factor B, and four replicates in each of the 10 treatment combinations of factors A and B, with SSA = 18, SSB = 64, SSE = 60 and SST = 150: a. Form the ANOVA summary table and fill in all values in the body of the table. b. At the 0.05 level of significance: i. Is there an effect due to factor A? ii. Is there an effect due to factor B? iii. Is there an interaction effect? 11.34 Given a two-factor factorial experiment and the ANOVA summary table that follows, fill in all the missing results:

Source

Factor A

Factor B

Degrees of freedom

r -1 = 4

c -1 = ?

Sum of squares

SSA = ?

Total

rc(n ¿ -1) = 30 n -1 = ?

MSBA = 80

SSB = 220 MSBB = ?

AB inter­ action (r -1)(c -1) = 8 SSAB = Error

Mean square (variance)

SSE = ? SST = ?

F

F=?

F = 11.0

MSBAB = 10 F = ?

MSE = ?

11.35 From the results of problem 11.34, at the 0.05 level of significance: a. Is there an effect due to factor A? b. Is there an effect due to factor B? c. Is there an interaction effect?

APPLYING THE CONCEPTS Problems 11.36–11.38 can be solved manually or by using Microsoft Excel.

11.36 The effects of developer strength (factor A) and development time (factor B) on the density of photographic plate film were being studied. Two strengths and two development times were used, and four replicates for each treatment combination were run. The results (with larger values being best) are as follows: < PHOTO > Development time (minutes) Developer strength 1

10 0 5 2 4

14 1 4 3 2

Development time (minutes) Developer strength 2

10 4 7 6 5



14 6 7 8 7

At the 0.05 level of significance: a. Is there a significant interaction between developer strength and development time? b. Is there an effect due to developer strength? c. Is there an effect due to development time? d. Plot the mean density of each developer strength for each development time. e. What can you conclude about the effect of developer strength and development time on density? 11.37 A chef in a restaurant that specialises in pasta dishes was experiencing difficulty in getting brands of pasta to be al dente – that is, cooked enough so as not to feel starchy or hard but still feel firm when bitten into. She decided to conduct an experiment in which two brands of pasta, one Australian and one Italian, were cooked for either 4 or 8 minutes. The response variable measured was the weight of the pasta, since cooking the pasta enables it to absorb water. A pasta with a faster rate of water absorption may provide a shorter interval in which the pasta is al dente, thereby increasing the chance that it might be overcooked. The experiment was conducted by using 150 grams of uncooked pasta. Each trial began by bringing a pot containing 5 litres of cold unsalted water to a moderate boil. The 150 g of uncooked pasta was added and then weighed after a given period of time by lifting the pasta from the pot via a built-in strainer. The results (in terms of weight in grams) for two replications of each type of pasta and

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

438 CHAPTER 11 ANALYSIS OF VARIANCE

cooking time are as follows: < PASTA >

Type of pasta Australian



Cooking time (minutes) 4 8 265 310 270 320

Type of pasta Italian

The data are presented below: Cooking time (minutes) 4 8 250 300 245 305

Market Newcastle

Wollongong

At the 0.05 level of significance:

a. Is there a significant interaction between type of pasta and cooking time? b. Is there an effect due to type of pasta? c. Is there an effect due to cooking time? d. Plot the mean weight for each type of pasta for each cooking time. e. What conclusions can you reach concerning the importance of each of these two factors on the weight of the pasta? 11.38 A student team in a business statistics course performed a factorial experiment to investigate the sales of a new product. The two factors were marketing technique (A, B, C) and geographical location of market (Newcastle and Wollongong).



A 87 97 78 102 67 56 67 75

Marketing technique B 45 56 65 33 32 23 42 18

C 89 67 85 56 75 63 56 67

At the 0.05 level of significance:

a. Is there a significant interaction between brand marketing technique and location of market? b. Is there an effect due to marketing technique? c. Is there an effect due to the location of market? d. Plot the mean sales for each marketing technique for the two locations. e. Discuss the results of (a) to (d).

11 Assess your progress Summary In this chapter, various statistical procedures were used to analyse the effect of one or two factors of interest. You learned how electricity prices and consumption could differ because of city or source of electricity. The assumptions required for using these Table 11.11 Summary of topics in Chapter 11

Type of analysis Comparing more than two groups

procedures were discussed in detail. Remember that you need to investigate critically the validity of the assumptions underlying the hypothesis-testing procedures. Table 11.11 summarises the topics covered in this chapter.

Type of data numerical One-way analysis of variance (Section 11.1) Randomised block design (Section 11.2) Two-way analysis of variance (Section 11.3)

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Key formulas 439

Key formulas Total variation in one-way ANOVA

SST =

nj

c

∑ ∑ ( X ij − X )   (11.1) j=1 i=1 2

Between-group variation in one-way ANOVA

SSB =

c

n j (X j − X )   (11.2) ∑ j=1 2

Within-group variation in one-way ANOVA

SSW =

nj

c

∑ ∑ (X ij − X j )   (11.3) 2

j=1 i=1

The mean squares in one-way ANOVA

SSB MSB =   (11.4a) c−1 SSW MSW = n − c  (11.4b) SST   (11.4c) MST = n−1 One-way ANOVA F test statistic

F=

MSB   (11.5) MSW

MSW 1 1 + 2 n j n j ′   (11.6)

Total variation in randomised block design

SST =

c

r

(X ij − X )2  (11.7) ∑ ∑ j=1 i=1

Between-group variation in randomised block design

SSB = r

c

(X .j − X )2  (11.8) ∑ j=1

Between-block variation in randomised block design

SSBL = c

MSE =

SSE =

r

(X i. − X )2  (11.9) ∑ i=1 r

(Xij − X . j − X i. + X )2  (11.10) ∑ ∑ j=1 i=1

The mean squares in randomised block design

MSB =

SSE   (11.11c) (rr − 1)(c − 1)

Randomised block F test statistic

F=

MSB   (11.12) MSE

F test statistic for block effects

F=

MSBL   (11.13) MSE

Estimated relative efficiency

RE =

(r − 1)MSBL + r(c − 1)MSE   (11.14) (rc − 1)MSE

The critical range for the randomised block design

Critical range = QU

MSE   (11.15) r

Total variation in two-way ANOVA

SST =

r

c

n′

( X ijk − X )2  (11.16) ∑ ∑ ∑ i=1 j=1 k=1

SSB c − 1  (11.11a)

r

(X i.. − X )2  (11.17) ∑ i=1

SSA = cn′

Factor B variation

SSB = rn′

c

(X .j. − X )2  (11.18) ∑ j=1

Interaction variation r

c

(X ij. − X i.. − X .j. + X )2  (11.19) ∑ ∑ i=1 j=1

SSAB = n′

Random error in two-way ANOVA

SSE =

r

c

n′

(X ijk − X ij.)2  (11.20) ∑ ∑ ∑ i=1 j=1 k=1

The mean squares in two-way ANOVA

MSA =

SSA   (11.21a) r−1

MSB =

SSB   (11.21b) c−1

Random error in randomised block design c

SSBL   (11.11b) r−1

Factor A variation

The critical range for the Tukey–Kramer procedure

Critical range = QU

MSBL =

MSAB = MSE =

SSAB   (11.21c) (rr − 1)(c − 1)

SSE   (11.21d) rc(n′ − 1)

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

440 CHAPTER 11 ANALYSIS OF VARIANCE

F test for factor A effect

F=

The critical range for factor A

MSBA   (11.22) MSE

Critical range = QU

F test for factor B effect

F=

MSE   (11.25) cn′

The critical range for factor B

MSBB   (11.23) MSE

Critical range = QU

MSE   (11.26) rn′

F test for interaction effect

F=

MSBAB   (11.24) MSE

Key terms analysis of variance (ANOVA) 402 ANOVA summary table 405 between-block variation 417 between-group variation 402 blocks 416 completely randomised designs 402 critical range 409 estimated relative efficiency (RE) 422 factor 402 F distribution 405 F test for block effects 419 F test for factor A effect 428 F test for factor B effect 428 F test for interaction effect 429 grand mean 403 groups 402 homogeneity of variance 411 interaction 426 levels 402

Levene test 411 main effects 431 mean square between (MSB) 404 mean square between A (MSBA) 428 mean square between AB (MSBAB) 428 mean square between B (MSBB) 428 mean square between blocks (MSBL) 418 mean square error (MSE) 418 mean square total (MST) 404 mean square within (MSW) 404 multiple comparisons 408 normality 410 one-way ANOVA 402 one-way ANOVA F test statistic 404 post-hoc 408 random error 402 randomised block design 416 randomness and independence 410 replicates 426

Studentised range distribution 409 sum of squares between blocks (SSBL) 417 sum of squares between groups (SSB) 403 sum of squares due to factor A (SSA) 426 sum of squares due to factor B (SSB) 427 sum of squares due to interaction (SSAB) 427 sum of squares error (SSE) 418 sum of squares total (SST) 403 sum of squares within groups (SSW) 404 total variation 403 treatment effect 402 Tukey–Kramer multiple comparison procedure 408 Tukey procedure 422 two-factor factorial design 425 two-way ANOVA 425 within-group variation 402

References 1. Hicks, C. R. & K. V. Turner, Fundamental Concepts in the Design of Experiments, 5th edn (New York: Oxford University Press, 1999). 2. Montgomery, D. M., Design and Analysis of Experiments, 6th edn (New York: John Wiley, 2005). 3. Tukey, J. W., ‘Comparing individual means in the analysis of variance’, Biometrics, 5 (1949): 99–114. 4. Kramer, C. Y., ‘Extension of multiple range tests to group means with unequal numbers of replications’, Biometrics, 12 (1956): 307–310. 5. Berenson, M. L., D. M. Levine & M. Goldstein, Intermediate Statistical Methods and Applications: A Computer Package Approach (Englewood Cliffs, NJ: Prentice Hall, 1983).

6. Conover, W. J., Practical Nonparametric Statistics, 3rd edn (New York: Wiley, 2000). 7. Daniel, W. W., Applied Nonparametric Statistics, 2nd edn (Boston: PWS Kent, 1990). 8. Gitlow, H. S. & D. M. Levine, Six Sigma for Green Belts and Champions: Foundations, DMAIC, Tools, Cases, and Certification (Upper Saddle River, NJ: Financial Times – Prentice-Hall, 2005). 9. Neter, J., M. H. Kutner, C. Nachtsheim & W. Wasserman, Applied Linear Statistical Models, 4th edn (Homewood, IL: Richard D. Irwin, 1996).

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter review problems 441

Chapter review problems CHECKING YOUR UNDERSTANDING 11.39 In a one-way ANOVA, what is the difference between the between-groups variance MSB and the within-groups variance MSW? 11.40 What are the distinguishing features of the completely randomised design, randomised block design and two-factor factorial designs? 11.41 What are the assumptions of ANOVA? 11.42 Under what conditions should you select the one-way ANOVA F test to examine possible differences between the means of c independent populations? 11.43 When and how should you use multiple comparison procedures for evaluating pairwise combinations of the group means? 11.44 What is the difference between the one-way ANOVA F test and the Levene test? 11.45 Under what conditions should you use the two-way ANOVA F test to examine possible differences between the means of each factor in a factorial design? 11.46 What is meant by the concept of interaction in a two-factor factorial design? 11.47 How can you determine whether there is an interaction in the two-factor factorial design?

APPLYING THE CONCEPTS Problems 11.48–11.56 can be solved manually or by using Microsoft Excel.

11.48 The effects of the global financial crisis (GFC) (factor A) and location (factor B ) are being studied on consumer confidence (measured on a scale of 10). < CONFIDENCE >

Pre-crisis

Post-crisis



Sydney 7 8 5 9 8 6 6 5 8 7

Singapore 8 9 9 10 7 8 8 9 9 9

Wellington 7 6 5 7 8 5 6 4 6 5

London 5 8 9 4 6 3 4 2 5 3

a. At the 0.05 level of significance: i. is there a significant interaction between location and the GFC? ii. is there an effect due to location? iii. is there an effect due to the GFC? b. Plot the mean consumer confidence level for each location pre- and post-GFC.

c. If appropriate, use the Tukey procedure to examine differences between locations and GFC. d. What can you conclude about the effects of location and GFC on consumer confiden ce? Explain. 11.49 An operations manager wants to examine the effect of air-jet pressure (in pounds per square inch (psi)) on the breaking strength of yarn. Three different levels of air-jet pressure are to be considered: 30 psi, 40 psi and 50 psi. A random sample of 18 yarns are selected from the same batch, and the yarns are randomly assigned – six each – to the three levels of air-jet pressure. The breaking strength scores are stored in < YARN >. a. At the 0.05 level of significance, is there evidence of a significant difference in the variances of the breaking strengths for the three air-jet pressures? b. At the 0.05 level of significance, is there evidence of a significant difference between mean breaking strengths for the three air-jet pressures? c. If appropriate, use the Tukey–Kramer procedure and a 0.05 level of significance to determine which air-jet pressures significantly differ with respect to mean breaking strength. d. What should the operations manager conclude? 11.50 Suppose that, when setting up the experiment in problem 11.49, the operations manager is able to study the effect of side-to-side aspect in addition to air-jet pressure. Thus, instead of the onefactor completely randomised design in problem 11.49, a twofactor factorial design was used, with the first factor, side-to-side aspect, having two levels (nozzle and opposite) and the second factor, air-jet pressure, having three levels (30 psi, 40 psi and 50 psi). A sample of 18 yarns is randomly assigned, three to each of the six side-to-side aspect and pressure level combinations. The breaking-strength scores, stored in < YARN >, are as follows.

Side-to-side aspect Nozzle Nozzle Nozzle Opposite Opposite Opposite

30 psi 25.5 24.9 26.1 24.7 24.2 23.6

Air-jet pressure 40 psi 24.8 23.7 24.4 23.6 23.3 21.4

50 psi 23.2 23.7 22.7 22.6 22.8 24.9

a. At the 0.05 level of significance: i. is there a significant interaction between side-to-side aspect and air-jet pressure? ii. is there an effect due to side-to-side aspect? iii. is there an effect due to air-jet pressure? b. Plot the mean yarn breaking strength for each level of sideto-side aspect for each level of air-jet pressure. c. If appropriate, use the Tukey procedure to study differences between the air-jet pressures. d. On the basis of the results of (a) to (c), what conclusions can you reach concerning yarn breaking strength? Discuss.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

442 CHAPTER 11 ANALYSIS OF VARIANCE

e. Compare your results in (a) to (d) with those from the completely randomised design in problem 11.49. Discuss fully. 11.51 A hotel wanted to develop a new system for delivering room service breakfasts. In the current system, an order form is left on the bed in each room. If the customer wishes to receive a room service breakfast, they place the order form on the doorknob before 11 pm. The current system requires customers to select a 15-minute interval for desired delivery time (6.30–6.45 am, 6.45–7.00 am, etc.). The new system is designed to allow the customer to request a specific delivery time. The hotel wants to measure the difference (in minutes) between the actual delivery time and the requested delivery time of room service orders for breakfast. (A negative time means that the order was delivered before the requested time. A positive time means that the order was delivered after the requested time.) The factors included were the menu choice (American or Continental) and the desired time period in which the order was to be delivered (Early (6.30–8.00 am) or Late (8.00–9.30 am)). Ten orders for each combination of menu choice and desired time period were studied on a particular day. The data, stored in < BREAKFAST >, are as follows.

Type of breakfast Continental Continental Continental Continental Continental Continental Continental Continental Continental Continental American American American American American American American American American American

Desired time Early time period Late time period 1.2 -2.5 2.1 3.0 3.3 -0.2 4.4 1.2 3.4 1.2 5.3 0.7 2.2 -1.3 1.0 0.2 5.4 -0.5 1.4 3.8 4.4 6.0 1.1 2.3 4.8 4.2 7.1 3.8 6.7 5.5 5.6 1.8 9.5 5.1 4.1 4.2 7.9 4.9 9.4 4.0

a. At the 0.05 level of significance: i. is there a significant interaction between type of breakfast and desired time? ii. is there an effect due to type of breakfast? iii. is there an effect due to desired time? b. Plot the mean delivery time difference for each desired time for each type of breakfast. c. On the basis of the results of (a) to (c), what conclusions can you reach concerning delivery time difference? Discuss.

11.52 Does the price of fresh fruit smoothies differ between Boost juice, Naked juice and Funky juice outlets? A researcher collected the price of a sample of five products at each of the outlets (in dollars). < JUICE1 > a. At the 0.05 level of significance: i. is there evidence of a significant difference in the variance of the fruit smoothie prices? ii. is there evidence of a difference between mean fruit smoothie price? b. If appropriate, use the Tukey–Kramer procedure and a 0.05 level of significance to determine which outlet significantly differs with respect to mean price. c. What conclusions can you reach? 11.53 Suppose that in problem 11.52, when setting up the experiment, the effects of the store outlet location (university, large shopping centre, street location) was studied in addition to the company name. Thus, instead of the one-factor completely randomised design given in problem 11.52, the experiment used a two-factor factorial design, the first factor, outlet, having three levels (Boost, Naked and Funky) and the second factor, outlet location, having three levels (university, large shopping centre and street location). The data is available in the data file < JUICE2 >. a. At the 0.05 level of significance: i. is there a significant interaction between outlet name and location? ii. is there an effect due to outlet name? iii. is there an effect due to outlet location? b. Plot the mean price for the three outlets for each location. Describe the interaction and discuss why you can or cannot interpret the main effects in (i) and (ii) above. c. On the basis of the results of (a) and (b), what conclusions can you reach concerning fruit smoothie prices? Discuss. d. Compare and contrast your results in (a) to (c) with those from the completely randomised design in problem 11.52(a) to (c). 11.54 A recent blind wine tasting was held by the Pickled Palate Wine Club, during which eight wines under $14 were rated by club members. The region of origin and the price were not known to the club members until after the tasting had taken place. The wines rated (and the prices paid for them) were: · Seahaven Riesling $10.59 · Hunter Chardonnay $8.50 · Mudgee Shiraz $8.50 · Margaret River Cabernet Sauvignon $10.69 · Barossa Cabernet Sauvignon $11.75 · Clare Cabernet Merlot $10.50 · Riverina Chardonnay $9.75 · Coonawarra Shiraz $13.59 The summed ratings over several characteristics for the 12 club members are in the data file < WINE >. a. At the 0.01 level of significance, is there evidence of a significant difference between the mean rating scores of the wines? b. What assumptions are necessary in order to answer (a) of this problem? Comment on the validity of these assumptions.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Continuing cases 443

c. If appropriate, use the Tukey procedure and a 0.05 level of significance to determine the wines that differ in mean rating. d. Based on your results in (c), do you think that the type of wine or the price has had an effect on the ratings? e. Determine the relative efficiency of the randomised block design as compared with the completely randomised design. 11.55 Ignore the blocking variable in problem 11.54. a. ‘Erroneously’ re-analyse the data as a one-factor completely randomised design where the one factor (brands of wines) has eight levels and each level contains a sample of 12 independent observations. b. Compare the SSBL and SSE terms in problem 11.54(a) with the SSW term in (a). Discuss.

c. Using the results in problem 11.54(a) and this problem, describe the issues that can arise when analysing data if the wrong procedures are applied.

REPORT WRITING EXERCISES 11.56 The data in the file < BEER > represent the alcohol percentage and calorie content of a sample of 95 beers together with country of origin and type. Your task is to write a report based on a complete evaluation comparing the calories and alcohol content based on the type of beer – light, regular or non-alcoholic.

Continuing cases Tasman University The Student News Service at Tasman University (TU) has decided to gather data about the undergraduate students who attend TU. It creates and distributes a survey of 14 questions and receives responses from 62 undergraduates (stored in < TASMAN_UNIVERSITY_BBUS_STUDENT_SURVEY >). a At the 0.05 level of significance, is there evidence of a significant difference based on academic major in expected starting salary, number of social networking sites registered for, age, spending on textbooks and supplies, text messages sent in a week and the wealth needed to feel rich? b At the 0.05 level of significance, is there evidence of a significant difference based on graduate school intention in WAM, expected starting salary, number of social networking sites registered for, age, spending on textbooks and supplies, text messages sent in a week and the wealth needed to feel rich? The dean of students at TU has learned about the undergraduate survey and has decided to undertake a similar survey for MBA students. She creates and distributes a survey of 14 questions and receives responses from 44 graduate students (stored in < TASMAN_UNIVERSITY_MBA_STUDENT_SURVEY >). For these data, at the 0.05 level of significance: c Is there evidence of a significant difference based on undergraduate major in age, undergraduate WAM, MBA WAM, expected salary upon graduation, spending on textbooks and supplies, text messages sent in a week and the wealth needed to feel rich? d Is there evidence of a significant difference based on MBA major in age, undergraduate WAM, MBA WAM, expected salary upon graduation, spending on textbooks and supplies, text messages sent in a week and the wealth needed to feel rich? e Is there evidence of a significant difference based on employment status in age, undergraduate WAM, MBA WAM, expected salary upon graduation, spending on textbooks and supplies, text messages sent in a week and the wealth needed to feel rich?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

444 CHAPTER 11 ANALYSIS OF VARIANCE

As Safe as Houses To analyse the real estate market in non-capital cities and towns in states A and B, Safe-As-Houses Real Estate, a large national real estate company, has collected samples of recent residential sales from a sample of non-capital cities and towns in these states. The data are stored in < REAL_ESTATE >: a Determine whether average price differences exist between properties with different numbers of bedrooms. If so, use the Tukey–Kramer technique to determine which ones. b Determine whether average price differences exist between properties with different numbers of garages. If so, use the Tukey–Kramer technique to determine which ones. c Check that homogeneity of variances exists for parts (a) and (b). d Prepare a brief report to summarise your findings.

Chapter 11 Excel Guide EG11.1 THE COMPLETELY RANDOMISED DESIGN: ONE-WAY ANOVA Analysing Variation in One-Way ANOVA Key technique Use the Section EG2.5 instructions to construct scatter plots using stacked data. If necessary, change the levels of the factor to consecutive integers beginning with 1, as was done for the electricity price data in Figure 11.4 on page 406.

1. 2. 3. 4.

Enter 0.05 as the Level of Significance. Enter B1:E6 as the Group Data Cell Range. Check First cells contain label. Enter a Title, clear the Tukey–Kramer Procedure check box, and click OK. Figure EG11.1 One-Way ANOVA dialog box

F Test for Differences Among More Than Two Means Key technique Use the DEVSQ (cell range of data of all groups) function to calculate SST and uses an expression in the form SST – DEVSQ (group 1 data cell range) – DEVSQ (group 2 data cell range) … – DEVSQ (group n data cell range) to calculate SSA. Example Perform the Figure 11.6 one-way ANOVA for the electricity price data shown on page 408. PHStat Use One-Way ANOVA. For the example, open the Electricity_Price1 file. Select PHStat ➔ Multiple-Sample Tests ➔ One-Way ANOVA. In the procedure’s dialog box (in Figure EG11.1):

In addition to the worksheet shown in Figure 11.6, this procedure creates an ASFData worksheet to hold the data used for the test.

Analysis ToolPak Use Anova: Single Factor. For the example, open to the Electricity_Price1 file and: 1. Select Data ➔ Data Analysis. 2. In the Data Analysis dialog box, select Anova: Single Factor from the Analysis Tools list and then click OK.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 11 Excel Guide 445

In the procedure’s dialog box (shown in Figure EG11.2): 3. Enter B1:E6 as the Input Range. 4. Click Columns, check Labels in First Row, and enter 0.05 as Alpha. 5. Click New Worksheet Ply. 6. Click OK. Figure EG11.2 Anova: Single Factor dialog box

The Analysis ToolPak creates a worksheet that does not use formulas but is similar in layout to the Figure 11.6 worksheet on page 408.

of each group. Also record the MSW value, found in the cell that is the intersection of the MS column and Within Groups row, and the denominatordegrees of freedom, found in the cell that is the intersection of the df column and Within Groups row. 3. Open to the TK4 worksheet of the OneWay_ ANOVA workbook. In the TK4 worksheet: 4. Overwrite the formulas in cell range A5:C8 by entering the name, sample mean, and sample size of each group into that range. 5. Enter 0.05 as the Level of significance in cell B11. 6. Enter 4 as the Numerator d.f. (equal to the number of groups) in cell B12. 7. Enter 16 as the Denominator d.f in cell B13. 8. Enter 17.525 as the MSW in cell B14. 9. Enter 4.05 as the Q Statistic in cell B15. (Look up the Studentised range Q statistic using Table E.10.)

Levene Test for Homogeneity of Variance

Multiple Comparisons: The Tukey­–Kramer Procedure

Key technique Use the techniques for performing a one-way ANOVA.

Key technique  Use formulas to calculate the absolute mean differences and use the IF function to compare pairs of means.

Example Perform the Figure 11.8 Levene test for the electricity price data shown on page 412.

Example Perform the Figure 11.7 Tukey–Kramer procedure for the electricity price data shown on page 410.

PHStat Use Levene Test. For the example, open the Electricity_Price1 file. Select PHStat ➔ Multiple-Sample Tests ➔ Levene Test. In the procedure’s dialog box (shown in Figure EG11.3): 1. Enter 0.05 as the Level of Significance. 2. Enter B1:E6 as the Sample Data Cell Range. 3. Check First cells contain label. 4. Enter a Title and click OK.

PHStat Use the PHStat instructions for the one-way ANOVA F test to perform the Tukey–Kramer procedure, checking Tukey– Kramer Procedure instead in step 4. The procedure creates a worksheet identical to the one shown in Figure 11.7 and discussed in the following In-depth Excel section. To complete the worksheet, enter the Studentised range Q statistic (use Table E.10) for the level of significance and the numerator and denominator degrees of freedom that are given in the worksheet. Analysis ToolPak Transfer selected values from the Analysis ToolPak results worksheet to one of the TK worksheets in the OneWay_ANOVA workbook. For example, to perform the Figure 11.7 Tukey–Kramer procedure for the electricity price data: 1. Use the Anova: Single Factor procedure, as described earlier in this section, to create a worksheet that contains ANOVA results for the electricity price data. 2. Record the name, sample size (in the Count column), and sample mean (in the Average column)

Figure EG11.3 Levene Test dialog box

The procedure creates a worksheet that performs the Table 11.4 absolute differences calculations (see page 411) as well as the Figure 11.8 worksheet.

Analysis ToolPak Use Anova: Single Factor with absolute difference data to perform the Levene test.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

446 CHAPTER 11 ANALYSIS OF VARIANCE

EG11.2 THE RANDOMISED BLOCK DESIGN

Key technique Use the F.INV.RT, F.DIST.RT, and DEVSQ functions to help calculate the ANOVA summary table statistics for a randomised block design. Enter F.INV.RT(level of significance, degrees of freedom for source, Error degrees of freedom) to calculate the F critical value for the amonggroups (A) and among-blocks (BL) sources of variation. Enter F.DIST.RT(F test statistic for source, degrees of freedom for rows, Error degrees of freedom within groups) to calculate the p-value for the two sources of variation. Use the DEVSQ function to calculate SSA, SSBL, SSE and SST. Example Perform the Figure 11.10 randomised block design for the quick-service restaurant chain study on page 421. PHStat Use Randomised Block Design. For the example, open the Electricity_Consumption1 file. Select PHStat ➔ Multiple-Sample Tests ➔ Randomized Block Design. In the procedure’s dialog box (shown in Figure EG11.4): 1. Enter 0.05 as the Level of Significance. 2. Enter B1:E11 as the Sample Data Cell Range. 3. Check First cells contain label. 4. Enter a Title and click OK. Figure EG11.4 Randomized Block Design dialog box

This procedure requires that the labels that identify factor A appear stacked in column A, followed by columns for factor B.

Analysis ToolPak Use the Anova: Two-Factor Without Replication. For the example, open the Electricity_Consumption1 file and: 1. Select Data ➔ Data Analysis. 2. In the Data Analysis dialog box, select Anova: Two-Factor Without Replication from the Analysis Tools list and then click OK. In the procedure’s dialog box (shown in Figure EG11.5): 3. Enter B1:E11 as the Input Range. 4. Check Labels and enter 0.05 as Alpha.

5. Click New Worksheet Ply. 6. Click OK to create the worksheet. Figure EG11.5 Anova: TwoFactor Without Replication dialog box

This procedure requires that the labels that identify blocks appear stacked in column A and that group names appear in row 1, starting with cell B1. The Analysis ToolPak creates a worksheet that is visually similar to Figure 11.10 but contains only values and does not include any cell formulas. The ToolPak worksheet also does not contain the level of significance in row 24.

EG11.3 THE FACTORIAL DESIGN: TWO-WAY ANOVA

Key technique Use the DEVSQ function to calculate SSA, SSB, SSAB, SSE and SST. Example Perform the Figure 11.14 two-way ANOVA for the electricity consumption data shown on page 430. PHStat Use Two-Way ANOVA with replication. For the example, open the Electricity_Consumption1 file. Select PHStat ➔ Multiple-Sample Tests ➔ TwoWay ANOVA. In the procedure’s dialog box (shown in Figure EG11.6): 1. Enter 0.05 as the Level of Significance. 2. Enter B1:E11 as the Sample Data Cell Range. 3. Check First cells contain label. 4. Enter a Title and click OK. Figure EG11.6 Two-Way ANOVA With Replication dialog box

This procedure requires that the labels that identify factor A appear stacked in column A, followed by columns for factor B.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 11 Excel Guide 447

Analysis ToolPak Use Anova: Two-Factor With Replication. For the example, open the Electricity_Consumption1 file and: 1. Select Data ➔ Data Analysis. 2. In the Data Analysis dialog box, select Anova: Two-Factor With Replication from the Analysis Tools list and then click OK. In the procedure’s dialog box (shown in Figure EG11.7): 3. Enter B1:E11 as the Input Range. 4. Enter 5 as the Rows per sample. 5. Enter 0.05 as Alpha. 6. Click New Worksheet Ply. 7. Click OK. Figure EG11.7 Anova: Two-Factor With Replication dialog box

This procedure requires that the labels that identify factor A appear stacked in column A, followed by columns for factor B. The Analysis ToolPak creates a worksheet that does not use formulas but is similar in layout to the Figure 11.14 worksheet.

Visualising Interaction Effects: The Cell Means Plot Key technique Use the SUMPRODUCT(cell range 1, cell range 2) function to calculate the expected value and variance. Example Construct the Figure 11.17 cell means plot for electricity consumption data on page 432. PHStat Modify the PHStat instructions for the two-way ANOVA. In step 4, check Cell Means Plot before clicking OK.

Microsoft® product screen shots are reprinted with permission from Microsoft Corporation.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

448 End of Part 3 problems

End of Part 3 problems C.1

C.2

C.3

C.4



C.5



A business which sells flowers and gifts through Internet sales has found that the proportion of website hits from which sales result is 9.8%. After redesigning its website it takes a sample of 200 potential customers who browse the site and checks whether they make a purchase. Suppose that 24 browsers make a purchase. a. Is there evidence of a significantly increased sales rate at the 0.05 level of significance? b. Is there evidence of a significantly increased sales rate at the 0.1 level of significance? An internal control policy for Rhonda’s Online Fashion Accessories requires a quality assurance check before a shipment is made. The tolerable exception rate for this internal control is 0.05. During an audit, 500 shipping records are sampled from a population of 5,000 shipping records and 12 that violate the internal control are found. a. Calculate the upper bound for a 95% one-sided confidence interval estimate for the rate of non-compliance. b. Based on (a), what should the auditor conclude? In July 2017 the Fair Work Commission reduced the penalty rates for work on a Sunday and public holidays in the retail and hospitality sectors. A representative from an employers association wishes to determine if average weekly labour costs for its clients has decreased between June and July of 2017. The data is contained in < PENALTY_RATE >. At the 5% significance level, is there evidence of a significant reduction in labour costs since the reduction of penalty rates? In 2017 Australia’s major banks decided to remove fees for the use of their ATMs by customers of foreign (other banks). The data in < BANK_COST2 > represent the number of foreign bank ATM transactions made by 26 customers of bank A in one month since the change was made.

C.6



C.7



12 8  5 5 6  6 10 10  9 7 10 7 7  5 0 10 6 9 12  0  5 10 8  5 5 9 a. Construct a 95% confidence interval for the mean monthly number of transactions at ‘foreign’ ATMs. b. Interpret the interval constructed in (a). A key part of most Central Banks’ mission is the stability of a country’s financial system. With this in mind, the Governor of the Reserve Bank of Australia wishes to compare the volatility of the exchange rate of Australia (population 1) with that of New Zealand (population 2). A sample of monthly exchange rate data over a two-year period reveals the following:

C.8

n1 = 24   S12 = 210.2   n2 = 24   S22 = 176.4 a. At the 0.05 level of significance, is there evidence to suggest that Australia’s exchange rate is more volatile than New Zealand’s? b. What assumptions do you make here about the two populations in order to justify your use of the F test?

C.9

A manufacturer of torch batteries took a sample of 13 batteries < BATTERIES > from a day’s production and used them continuously until they failed to work. The life of the batteries in hours until failure is: 342 426 317 545 264 451 1,049 631 512 266 492 562 298 a. At the 0.05 level of significance, is there sufficient evidence that the mean life of the batteries is more than 400 hours? b. Determine the p-value in (a) and interpret its meaning. c. Using the information above, what would you advise if the manufacturer wanted to say in advertisements that these batteries ‘should last more than 400 hours’? d. Suppose that the first value was 1,342 instead of 342. Repeat (a) to (c), using this value. Comment on the difference in the results. While the use of cash is declining, bank ATMs must still be stocked with enough cash to satisfy customers making withdrawals over an entire weekend. Customer goodwill depends on such services meeting their needs. At one branch, the population mean amount of money withdrawn from the ATM per customer transaction over the weekend is $180, with a population standard deviation of $25. Suppose that a random sample of 36 customer transactions is examined, and you find that the sample mean withdrawal amount is $187. a. At the 0.05 level of significance, using the critical value approach to hypothesis testing, is there evidence to believe that the population mean withdrawal amount is significantly greater than $180? b. At the 0.05 level of significance, using the p-value approach to hypothesis testing, is there evidence to believe that the population mean withdrawal amount is significantly greater than $180? c. Interpret the meaning of the p-value in this problem. d. Compare your conclusions in (a) and (b). Assume that two suburbs of Sydney, Narrabeen and Coogee, are being considered as sites for government-subsidised day-care centres. Of 150 households surveyed in Narrabeen, the proportion in which the mother worked full-time was 44%. In Coogee, of 100 households surveyed, 38% had mothers who worked full-time. a. At the 0.05 level of significance, is there evidence of a significant difference between the two suburbs in the proportion of mothers working full-time? b. Find the p-value in (a) and interpret its meaning. Non-destructive evaluation (NDE) is a method used to describe the properties of components or materials without causing any permanent physical change to the units. It includes the determination of properties of materials and the classification of flaws by size, shape, type and location. This

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



End of Part 3 problems 449

method is most effective for detecting surface flaws and characterising surface properties of electrically conductive materials. Assume that data were collected that classified each component as being unflawed (Type 0) or having a flaw (Type 1) based on manual inspection and operator judgment. The size of the crack in the material was also reported. Do the components classified as unflawed have a smaller mean crack size than components classified as flawed? The results in terms of crack size in millimetres are in the data file < CRACK >. a. Assuming that the population variances are equal, is there evidence that the mean crack size is smaller for the unflawed specimens than for the flawed specimens? (Use α = 0.05.) b. Repeat (a) assuming that the population variances are not equal. c. Compare the results of (a) and (b). C.10 The Australian Government has endorsed a National Disability Strategy to deal with issues such as the divide in employment opportunities for disabled and able-bodied workers. In 2015 an ABS survey found an employment rate of 53.4% for people with a disability and 83.2% for people without a disability (Australian Bureau of Statistics, Disability, Ageing and Carers: Summary of Findings, 2015, Cat. No. 4430.0). Imagine that you have been assigned to check whether the gap has changed since 2015. You conduct a survey of 250 persons with disabilities and find that 137 are in the labour force. You also conduct a survey of 500 persons without disabilities and find that 420 are in the labour force. a. Construct a 95% confidence interval for the proportion of people with a disability who are now in the labour force. b. Construct a 95% confidence interval for the proportion of people without a disability who are now in the labour force. c. Which interval is wider? Explain why it is wider. C.11 The manager of a paint supply store wants to determine whether the mean amount of paint contained in 4-litre cans purchased from a nationally known manufacturer is actually 4 litres. You know from the manufacturer’s specifications that the standard deviation of the amount of paint is 0.08 litres. You select a random sample of 50 cans, and the mean amount of paint per 4-litre can is 3.98 litres. a. At the 0.01 level of significance, is there evidence that the mean amount is different from 4 litres? b. Calculate the p-value and interpret its meaning. c. Construct a 99% confidence interval estimate of the population mean amount of paint. d. Compare the results of (a) and (c). What conclusions do you reach? C.12 A university in Melbourne is testing the effects of alcohol consumption on the cognitive performance of a group of male students. Thirty students are recruited and randomly assigned to one of three groups where they are required to drink either five, one or zero standard drinks of the same alcohol in one

hour. All participants are then required to attempt the same puzzle. The number of seconds they each take to solve the puzzle is recorded in the file < CONSUMPTION >. Completely analyse the data in the table below. Consumption No alcohol

1 drink

5 drinks

156

160

236

166

165

238

148

184

257

160

192

242

139 151 158

197 172 189

282 253 270

167

179

256

142

200

267

219

193

259

C.13 A researcher analysing gender pay differences has conducted a survey of male and female income (in dollars) for aged-care workers in a nursing home: – X­­

Males 567.87

Females 546.76

S

13.35

9.42

n

12

47

a. At the 0.05 level of significance, is there sufficient evidence that the male aged-care workers earn a higher income than the female aged-care workers? b. What assumptions do you have to make about the two populations in order to justify the use of the t test? C.14 One operation of a mill is to cut pieces of steel into parts that are used later in the frame for front seats in a passenger car. The steel is cut with a diamond saw and requires the resulting parts to be within plus or minus 0.125 mm of the length specified by the car-manufacturing company. The measurement reported from a sample of 100 steel parts is the difference in millimetres between the actual length of the steel part as measured by a laser measurement device, and the specified length of the steel part. For example, the first observation, –0.05, represents a steel part that is 0.05 mm shorter than the specified length. a. Construct a 95% confidence interval estimate of the mean difference between the actual length of the steel part and the specified length of the steel part. b. What assumption must you make about the population distribution in (a)? c. Do you think that the assumption made in (b) is seriously violated? Explain. d. Compare the conclusions reached in (a) with those of problem A.22 on page 142.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

450 End of Part 3 problems

C.15 Students in a business statistics course performed a completely randomised design to test the strength of four brands of garbage bags. One hundred gram weights were placed into a bag one at a time until the bag broke. A total of 40 bags, 10 for each brand, was used. The data in the file give the weight (in kilograms) required to break the garbage bags. a. At the 0.05 level of significance, is there evidence of a significant difference in the mean strength of the four brands of garbage bags? b. If appropriate, determine which brands differ in mean strength. c. At the 0.05 level of significance, is there evidence of a significant difference in the variation in strength between the four brands of garbage bags? d. Which brand(s) should you buy and which brand(s) should you avoid? Explain. C.16 Tourism Australia reports in its China Market Profile that in 2016 there were two key factors that were equally most important for visitors in choosing Australia as their destination. Both ‘Safety and security’ and ‘World-class nature’ were chosen by 40% of visitors as key factors in their choice (data obtained from accessed 12 January 2018). Let’s consider the proportion choosing ‘Safety and security’. a. To conduct a follow-up study that would provide 95% confidence that the point estimate is correct to within ±0.04 of the population proportion, how large a sample size is required? b. To conduct a follow-up study that would provide 99% confidence that the point estimate is correct to within ±0.04 of the population proportion, how large a sample size is required? c. To conduct a follow-up study that would provide 95% confidence that the point estimate is correct to within ±0.02 of the population proportion, how large a sample size is required? d. To conduct a follow-up study that would provide 99% confidence that the point estimate is correct to within ±0.02 of the population proportion, how large a sample size is required? e. Discuss the effects of changing the desired confidence level and the acceptable sampling error on sample size requirements. C.17 In problem 3.48 on page 132, you were introduced to a tea-bagfilling operation. An important quality characteristic of interest

for this process is the weight of the tea in the individual bags. The data in the file are an ordered array of the weight, in grams, of a sample of 50 tea-bags produced during an eight-hour shift. a. Is there evidence that the mean amount of tea per bag is significantly different from 5.5 g? (Use α = 0.01.) b. Construct a 99% confidence interval estimate of the population mean amount of tea per bag. Interpret this interval. c. Compare the conclusions reached in (a) and (b). C.18 The Tertiary Certificate of Teaching has been introduced in Australian universities to help improve the quality of teaching offered in the classroom. From a sample of student responses to teaching quality, across several subjects at a university in Sydney, the following data were collected: Student teaching satisfaction High n

Lecturer has certificate 46 (58.97%) 78

Lecturer does not have certificate 70 (51.85%) 135



The satisfaction with teaching was rated high if a student gave at least 5 on a scale from 1 = very poor to 10 = extremely good. At the 0.05 level of significance, is there evidence that the proportion of students who rated teaching satisfaction highly is significantly higher for lecturers who have completed the Tertiary Certificate of Teaching? C.19 A debate in educational institutions has followed the issue of whether falling tutorial attendance can be addressed by marking attendance in class in universities. The argument is that attendance dropping will increase failure rates and this is an inefficient system of education provision. A lecturer has collected attendance records for a set of tutorials in one week containing 112 enrolled students, without telling the students, and then collected attendance for the same students two weeks later after telling them that attendance records will be kept. Attending

No rolls kept 71 (63.39%)

Rolls kept 87 (77.68%)

a. At the 0.05 level of significance, is there evidence that the proportion of students attending class tutorial is significantly higher when class rolls are kept? b. Calculate the p-value in (a) and interpret its meaning.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

This page is intentionally left blank

PA R T

4

Determining cause and making reliable forecasts

Real People, Real Stats Gautam Gangopadhyay ENDEAVOUR ENERGY (FORMER EMPLOYEE) Which company are you currently working for and what are some of your responsibilities? I retired on 31 January 2014 after working for 12 years as network forecasting manager of Endeavour Energy, a major electricity distribution authority in New South Wales. I was responsible for leading a team of forecasting analysts that was required to produce and report accurate short-term and longterm energy demand, customer numbers and system peak demands (at system and spatial levels) forecasts, with very tight accuracy targets, for Endeavour’s electricity distribution network, using best practice methods. List five words that best describe your personality. Fact based, analytical and strategic thinker. What are some things that motivate you? Challenges in capturing and quantitative modelling of the key drivers of long-term demand for services in the infrastructure sector that often require lumpy capital investments. When did you first become interested in statistics? Back in late 1970s when I was working as an industrial engineer for a large coal mining company in India (Coal India Limited), responsible for developing optimal solutions for underground mine transport systems, manpower planning and inventory management. Complete the following sentence. A world without statistics … . . . will be a world without planning.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

a quick q&a LET’S TALK STATS What do you enjoy most about working in statistics? • Undertaking customer surveys, including sample design. • Sample data gathering, examining the data and modelling of the sample data. • Reporting in plain English the findings of the survey to the relevant parties, such as board members and the senior management team of the companies I worked for. Describe your first statistics-related job or work experience. Was this a positive or a negative experience? It was the determination of capacity of front-end loaders of various types in an open-cut coal mine using multiple regression analysis in 1977. It was quite a positive experience. The results of this analysis were adopted for mine planning. It was the first time this sort of statistical approach was adopted in the Indian coal mining industry. The work was well appreciated by the senior management of Coal India. What do you feel is the most common misconception about your work held by students who are studying statistics? Please explain. The difference between sample and population, and a lack of understanding of the risks associated with the results of statistical analysis based on often small samples. Do you need to be good at maths to understand and use statistics successfully? Good mathematical skills certainly help. However, a basic understanding of sample design, sample data, and the procedures and results of the analysis of sample data is critical, and a vast knowledge in mathematics is not essential for this. Is there a high demand for statisticians in your industry (or in other industries)? Please explain. Skills in statistics and quantitative modelling are in high demand, particularly in the banking, telecommunications and energy supply industries for undertaking reliable market research, demand forecasting, and responding to the questions and issues raised by industry regulators in relation to end-use customers’ behaviour.

REGRESSION ANALYSIS When and why might regression analysis be used in your work? Mainly for weather correction of electricity demand and for predicting the future demand for energy in the short and long terms.

Are the assumptions that are necessary for an accurate regression analysis always met? How are these issues combated? Not always. The issues are addressed by transformation of the variables and assessing the predictive power of the explanatory variables, trialling alternative specifications, examining the residuals, cross-validation and backcasting. What are some practical advantages of using multiple regression models? Formulated properly, with the right specifications, they generally provide more accurate demand forecasts, and selling the forecasts based on multiple regression analysis to senior management is somewhat easier. What are some challenges that are usually faced in constructing a regression model and interpreting the results? How have you managed to overcome these? • Gathering accurately matching historical data in respect of endogenous and exogenous variables. • Postulating the right specifications of the model. • Avoiding gross violation of assumptions. • Avoiding spurious regression models and accomplishing the stationarity of time series before producing regression output. • Adopting the right projections of explanatory variables if the regression equations are used for forecasting purposes. In order to manage these issues, I had to, among other things: • cross-check the data, examine the scatter plots, and conduct specification tests (such as Ramsay’s test), tests for heteroscedasticity, ADF tests for stationarity of time series, comparing MSEs for candidate models, crossvalidation etc. • research the projections of macroeconomic variables (such as real growth projections of GDP, GSP, household disposable income and expenditure, interest rates, inflation and unemployment) and demographic variables (such as household formation rates, population- and household-size projections) from various sources, including engaging reputable economic consultants to formulate internally consistent macro scenarios and project economic and demographic variables with the use of reliable macroeconomic models such as the IMP, ORANI and Murphy models.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

Is regression analysis used in conjunction with any other statistical tool? I have used Markov Chain models for re-establishing the market shares equilibrium in response to price changes, particularly in the telecommunications industry in Australia. How relevant is regression analysis in some of the current issues you face? Please explain. All major capital-intensive industries in Australia, such as power supply, gas supply, telecommunications, steel

manufacturing and oil refineries, are going through major structural changes, and these, as well as the offshoring of manufacturing activities, are all making historical data somewhat irrelevant for forecasting purposes. A lack of available customer data, partly due to stringent privacy laws, and companies’ diminishing appetite for commissioning expensive customer surveys are making multiple regression analysis more difficult to apply and less reliable.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

Simple linear regression

C HAP T E R

12

MATHS SAVES HUMANITY

W

ho said that maths has no practical use? On the contrary, the application of mathematics and statistical tools has been instrumental in the development of modern society, from the introduction of currency, to various engineering breakthroughs, to the development of markets, modern medicine, trade and the allocation of aid funding. In a bid to encourage mathematical literacy in developing nations, the United Nations has commissioned you to prepare a report that establishes a formal link between countries’ mathematical literacy and their nation’s development. Specifically, you will attempt to predict countries’ Human Development Index, which is constructed using health, e­ ducation and income data. How would you go about estimating such a relationship? (‘Trends in the Human Development Index 1990–2015, UNDP 2016, accessed 3 July 2017; ‘PISA 2015: Full selection of indicators’, OECD 2016 accessed 3 July 2016.) © Isaiahlove|Dreamstime.com

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

456 CHAPTER 12 SIMPLE LINEAR REGRESSION

LEARNING OBJECTIVES

After studying this chapter you should be able to: 1 conduct a simple regression and interpret the meaning of the regression coefficients b0 and b1 2 use regression analysis to predict the value of a dependent variable based on an independent variable 3 assess the adequacy of your estimated model 4 evaluate the assumptions of regression analysis 5 make inferences about the slope 6 make inferences about the correlation coefficient 7 estimate mean values and predict individual values 8 comprehend the pitfalls in regression

regression analysis Method for predicting the values of a numerical variable based upon the values of one or more other variables.

dependent variable The variable whose values are explained by changes in the independent variable. independent variable The variable that explains the values of the dependent variable. simple linear regression Regression method using a single independent variable to predict values of the numerical dependent variable.

In this and the following two chapters, you will learn how regression analysis allows you to develop a model to predict the values of a numerical variable based on the values of one or more other variables. For example, in the Human Development Index scenario, you may wish to predict a country’s Human Development Index based upon its mathematical literacy. Other examples include predicting sales based on the amount spent on marketing, and predicting price based on volume of production. In regression analysis, the variable you wish to predict is called the dependent variable. The variables used to make the prediction are called independent variables. In addition to predicting values of the dependent variable, regression analysis also allows you to identify the type of mathematical relationship that exists between a dependent variable and an independent variable, to quantify the effect that changes in the independent variable have on the dependent ­variable, and to identify unusual observations. This chapter discusses simple linear regression in which a single numerical independent variable, X, is used to predict the numerical dependent variable, Y, such as using the inflation rate to predict the number of house-building permits issued in a market area. Chapter 13 discusses multiple regression models that use several independent variables to predict a numerical dependent variable, Y.

12.1  TYPES OF REGRESSION MODELS scatter diagram Graphical representation of the relationship between two numerical variables; plotted points represent the given values of the independent variable and corresponding dependent variable.

In Section 2.5 we used a scatter diagram to plot the relationship between an X variable on the horizontal axis and a Y variable on the vertical axis. The nature of the relationship between two variables can take many forms, ranging from simple to extremely complicated mathematical functions. The simplest relationship consists of a straight-line or linear relationship. An example of this relationship is shown in Figure 12.1.

Figure 12.1  A positive straight-line relationship

Y

β0

0

0

∆Y = ‘change in Y ’ ∆X = ‘change in X ’

X

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



12.1 Types of Regression Models 457

Equation 12.1 represents the straight-line (linear) model.

LEARNING OBJECTIVE

S IMPLE LIN E A R R E GR E S S ION M ODE L Yi = β0 + β1Xi + εi(12.1)



where β0 = Y intercept for the population β1 = slope for the population ε = random error in Y for observation i Yi = dependent variable (sometimes referred to as the response variable) X = independent variable (sometimes referred to as the explanatory variable)

1

Conduct a simple regression and interpret the meaning of the regression coefficients b0 and b1

response variable Dependent variable. explanatory variable The independent variable.

Yi = β0 + β1Xi is the equation of a straight line. The slope of the line, β1, represents the expected change in Y per unit change in X. It represents the mean amount that Y changes (either positively or negatively) for a one-unit change in X. The Y intercept, β0, represents the mean value of Y when X equals 0. The last component of the model, εi, represents the random error in Y for each observation i that is not explained by B0 + B1Xi. In other words, εi is the vertical distance Yi is above or below the line. Selection of the proper mathematical model depends on the distribution of the X and Y values on the scatter diagram. In panel A of Figure 12.2, the values of Y are generally increasing linearly as X increases. This panel is similar to Figure 12.3 on page 459, which illustrates the positive relationship between the Human Development Index and mean mathematical literacy score in our sample of 19 countries. Panel B is an example of a negative linear relationship. As X increases, the values of Y are generally decreasing. An example of this type of relationship might be obesity rates and phys­ ical activity rates. The data in panel C show a positive curvilinear relationship between X and Y. The values of Y increase as X increases, but this increase tapers off beyond certain values of X. An example of this positive curvilinear relationship might be the age and maintenance cost of a machine. As a machine gets older, the maintenance cost may rise rapidly at first but then level off beyond a certain number of years. Y

Y

Y

Panel A Positive linear relationship

X

X Panel D U-shaped curvilinear relationship

Y intercept Represents the mean value of Y when X equals zero in the regression.

Figure 12.2 Examples of types of relationships found in scatter diagrams

Y

Panel B Negative linear relationship

X

X

Y

Y

Panel E Negative curvilinear relationship

Panel C Positive curvilinear relationship

X

Panel F No relationship between X and Y

X

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

458 CHAPTER 12 SIMPLE LINEAR REGRESSION

Panel D shows a U-shaped relationship between X and Y. As X increases, at first Y generally decreases; but as X continues to increase, Y not only stops decreasing but actually increases above its minimum value. An example of this type of relationship might be the number of errors per hour at a task and the number of hours worked. The number of errors per hour decreases as the individual becomes more proficient at the task, but then increases beyond a certain point because of factors such as fatigue and boredom. Panel E indicates a negative curvilinear relationship between X and Y. In this case, Y decreases very rapidly as X first increases, but then decreases much less rapidly as X increases further. An example of this negative curvilinear relationship could be the resale value of a vehicle and its age. In the first year, the resale value drops drastically from its original price; however, the resale value then decreases much less rapidly in subsequent years. Finally, Panel F shows a set of data in which there is very little or no relationship between X and Y. High and low values of Y appear at each value of X. In this section, a variety of different models that represent the relationship between two variables is briefly examined. Although scatter diagrams are useful in visually showing the mathematical form of a relationship, more sophisticated statistical procedures are available to determine the most appropriate model for a set of variables. The rest of this chapter discusses the model used when there is a linear relationship between variables.

12.2  DETERMINING THE SIMPLE LINEAR REGRESSION EQUATION In the opening scenario the stated goal is to estimate the relationship between mathematical literacy and human development. To examine this relationship a sample of 19 developed and developing countries was selected using Organisation for Economic Co-operation and Development (OECD) and United Nations (UN) data. The mean mathematical literacy score is collected from the Programme for International Student Assessment (PISA) report of the OECD (OECD 2016). The Human Development Index is constructed from life expectancy, educa­ rogramme tional attainment and living standards data by the United Nations Development P (UNDP 2016). Table 12.1 summarises these data. < HDI > Table 12.1 Human Development Index and mean mathematical literacy scores, various countries 2015 Source: UNDP (2016) and OECD (2016) (see opening scenario).

Country

Human Development Index – Y

Mean mathematics performance score – X

Norway 94.9 Australia 93.9 New Zealand 91.5 United States 92.0 Ireland 92.3 Iceland 92.1 Netherlands 92.4 Canada 92.0 Germany 92.6 Singapore 92.5 Argentina 82.7 Brazil 75.4 Colombia 72.7 Tunisia 72.5 Jordan 74.1 Turkey 76.7 Thailand 74.0 Indonesia 68.9 Vietnam 68.3

502 494 495 470 504 488 512 516 506 564 409 377 390 367 380 420 415 386 495

Figure 12.3 displays the scatter diagram for the data in Table 12.1. Observe the increasing relationship between mean mathematical literacy of a country (X) and its Human Development

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



12.2 Determining the Simple Linear Regression Equation 459

Figure 12.3 Microsoft Excel scatter diagram for the Human Development Index data

100

Human Development Index

90 80 70 60 50 40 30 20 10 0 0

100

200

300

400

500

600

Mean mathematics performance score

Index (Y) over this period. As the mean mathematical literacy score of a country increases, the Human Development Index increases approximately as a straight line. Thus, you can assume that a straight line provides a useful mathematical model of this relationship. Now you need to determine the specific straight line that is the best fit to these data.

The Least-Squares Method In Section 12.1 a statistical model is hypothesised to represent the relationship between two variables, the Human Development Index and mean mathematical literacy score for all countries. However, as shown in Table 12.1, the data are from only a selected time period and a sample of countries. If certain assumptions are valid (see Section 12.4), you can use the sample Y intercept, b0, and the sample slope, b1, as estimates of the respective population parameters, β0 and β1. Equation 12.2 uses these estimates to form the simple linear regression equation. This straight line is often referred to as the prediction line.

prediction line The straight line derived by a regression equation using the method of least squares.

S IMPLE LIN E A R R E GR E S S ION E QUAT I O N: T HE P R E D I C T I O N LI NE The predicted value of Y equals the Y intercept plus the slope multiplied by the value of X.

Yˆi = b0 + b1 Xi 

(12.2)

= bpredicted where Yˆi = 0 + b1 Xi value of Y for observation i Xi = value of X for observation i b0 = sample Y intercept b1 = sample slope

Equation 12.2 requires the determination of two regression coefficients − b0 (the sample Y intercept) and b1 (the sample slope). The most common approach to find b0 and b1 is the method of least squares. This method minimises the sum of the squared differences between the actual values (Yi) and the predicted values, using the simple linear regression equation (i.e. the prediction line; see Equation 12.2). This sum of squared differences is equal to:

regression coefficients The calculated parameters in regression that specify the interval and slope of the linear line defining the relationship between the independent and dependent variables.

n

∑ (Yi − Yˆi )2

i=1

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

460 CHAPTER 12 SIMPLE LINEAR REGRESSION

Since Yˆi = b0 + b1 Xi n



(Yi − Yˆi ) 2 =

i=1

least-squares method The fitting of a linear relationship between the X and Y variables such that the sum of squared distances of the data values from the linear line of best fit is at a minimum.

Figure 12.4 Microsoft Excel output for the Human Development Index regression

n

∑ [Yi − (b0 + b1Xi )]2 i=1

Since this equation has two unknowns, b0 and b1, the sum of squared differences is a function of the sample Y intercept b0 and the sample slope b1. The least-squares method determines what values of b0 and b1 minimise the sum of squared differences. Any values for b0 and b1 other than those determined by the least-squares method result in a greater sum of squared differences between the actual value of Y and the predicted value of Y. In this textbook, Microsoft Excel spreadsheet software is used to perform the calculations involved in the least-squares method. For the data of Table 12.1, Figure 12.4 shows the Microsoft Excel output. To help understand how the results are calculated, some of the calculations involved are illustrated in Figure 12.4.

A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

B

C

D

E

F

G

Summary output Regression statistics Multiple R R square

0.7922 0.6276

Adjusted R square Standard error

0.6057 6.3010

Observations

19

ANOVA Regression

df 1

SS 1137.7110

MS 1137.7110

Residual Total

17 18

674.9532 1812.6642

39.7031

Coefficients Standard error 23.4251 11.3640

t stat 2.0613

p-value 0.0549

Lower 95% –0.5508

Upper 95% 47.4010

5.3531

0.0001

0.0799

0.1839

Intercept Mean mathematics

0.1319

0.0246

F Significance F 28.6555 0.0001

In Figure 12.4, observe that b0 = 23.4251 and b1 = 0.1319. Thus, the prediction line (see Equation 12.2) for these data is: Yˆi = 23.4251 + 0.1319 Xi The slope b1 is 0.1319. This means that, for each increase of 1 unit in X, the mean value of Y is estimated to increase by 0.1319 units. In other words, for each unit increase in a country’s mean mathematical literacy score, their mean Human Development Index is estimated to increase by 0.1319 index points. Thus, the slope represents the Human Development Index that is estimated to vary according to the country’s mean mathematical literacy. The Y intercept, b0, is 23.4251. The Y intercept represents the mean value of Y when X equals 0. It is effectively impossible for there to be a zero mean mathematical literacy score so this Y intercept has no practical interpretation. Also, the Y intercept for this example is outside the range of the observed values of the X variable, and therefore interpretations of the value of b0 should be made cautiously. Figure 12.5 displays the actual observations and the prediction line. For an illustration of a specific situation where there is a direct interpretation for the Y intercept, b0, see Example 12.1.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



12.2 Determining the Simple Linear Regression Equation 461

Figure 12.5 Microsoft Excel scatter diagram and prediction line for the Human Development Index data

Exploring simple linear regression coefficients

Human Development Index

120 Y = 23.4251 + 0.1319X

100 80 60 40 20 0 0

100

200

300

400

500

600

Mean mathematics performance score

INTER PR ETIN G T H E Y INT E RC E P T, b 0 , A N D THE S L OP E , b 1 A statistics professor wants to use the number of hours a student studies for a statistics final exam (X) to predict the final exam score (Y). A regression model was fitted based on data collected for a class during the previous semester, with the following results: Yˆ = 35.0 + 3X i

EXAMPLE 12.1

i

What is the interpretation of the Y intercept, b0, and the slope, b1? SOLUTION

The value of the Y intercept, b0 = 35.0, indicates that, when the student does not study for the final exam, the mean final exam score is 35.0. The value of the slope, b1 = 3, indicates that, for each increase of one hour in studying time, the mean change in the final exam score is predicted to be +3.0. In other words, the final exam score is predicted to increase by 3 points for each one-hour increase in studying time.

Exploring Simple Linear Regression Coefficients Use the Visual Explorations simple linear regression procedure to produce a prediction line that is as close as possible to the prediction line defined by the least-squares solution. Open the Visual Explorations.xla macro workbook and select Add-Ins ➔ VisualExplorations ➔ Simple Linear Regression from the Microsoft Excel menu bar.

visual explorations

When a scatter diagram of the Human Development Index data of Table 12.1 (shown on page 458), with an initial prediction line appears, click on the arrow buttons to change the values for b1, the slope of the prediction line, and b0, the Y intercept of the prediction line. Try to produce a prediction line that is as close as possible to the prediction line defined by the least-squares estimates, using the chart display and the Difference from Target SSE value as feedback (see page 467 for an explanation of SSE). Click on Finish when you have ended this exploration. At any time, click on Reset to reset the b1 and b0 values, Help for more information, or Solution to reveal the prediction line defined by the least-squares estimates.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

462 CHAPTER 12 SIMPLE LINEAR REGRESSION

Using your own regression data. To use Visual Explorations to find a prediction line for your own data, open the Visual Explorations.xla workbook (if not already open) and select VisualExplorations ➔ Simple Linear Regression with your worksheet data. In the Simple Linear Regression (your data) dialog box (shown below):

■ Enter the Y variable cell range in the Y Variable Cell Range edit box. ■ Enter the X variable cell range in the X Variable Cell Range edit box. ■ Select the First cells in both ranges contain a label check box, if appropriate. ■ Enter a title in the Title edit box. ■ Click on the OK button. When the scatter diagram with an initial prediction line appears, use the instructions in the first part of this section to try to produce the prediction line defined by the least-squares estimate.

LEARNING OBJECTIVE

2

Use regression analysis to predict the value of a dependent variable based on an independent variable

EXAMPLE 12.2

Return to the scenario concerning the Human Development Index and mathematical literacy. Example 12.2 illustrates how to use the prediction equation to predict the mean Human Development Index. P R E D ICT ING ME A N H U M AN D E V E L OP M E N T I N D E X BAS E D U P ON MAT H E MAT IC A L LIT E RACY S CORE S Use the prediction line to predict the country’s mean mathematical literacy score of 500. SOLUTION

You can determine the predicted value by substituting X = 500 (a country’s mean mathematical literacy score) into the simple linear regression equation: Yˆi = 23.4251 + 0.1319Xi Yˆ = 23.4251 + 0.1319(500) = 89.39 i

Thus, the predicted mean Human Development Index for a country’s mean mathematical ­literacy score of 500 is 89.39.1 1

We use the full output from Excel for our calculated answers throughout this chapter. You may experience small discrepancies in your calculated answers due to rounding errors. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



12.2 Determining the Simple Linear Regression Equation 463

Predictions in Regression Analysis: Interpolation versus Extrapolation When using a regression model for prediction purposes, you need to consider only the relevant range of the independent variable in making predictions. This relevant range includes all values from the smallest to the largest X used in developing the regression model. Hence, when predicting Y for a given value of X you can interpolate within this relevant range of the X values, but you should not extrapolate beyond the range of X values. When you use a country’s mean mathematical literacy score to predict its Human Development Index, the score varies from 367 to 564 (see Table 12.1). Therefore, you should predict the Index only for scores between 367 and 564. Any prediction of the Human Development Index for scores outside of this range assumes that the observed relationship between the Index for countries’ scores from 367 to 564 is the same as for scores outside this range. For example, you should only extrapolate the linear relationship beyond a mathematics literacy score of 564 in Example 12.2 with extreme caution. It would be improper to use the prediction line to forecast the Human Development Index for a score of 600.

relevant range The observed range of values of the explanatory variable, which are themselves the only values relevant to predicting any value in regression.

Calculating the Y Intercept, b0, and the Slope, b1 For small data sets, it is possible to perform the least-squares method using a hand calculator. Equations 12.3 and 12.4 give the values of b0 and b1, which minimise: n



(Yi − Yˆi ) 2 =

i=1

n

∑ [Yi − (b0 + b1Xi )]2 i=1

FO RMULA F OR CA LCUL AT IN G T H E SLO P E , b 1 b1 =



SSXY SSX 

(12.3)

where n

n

SSXY =

∑ ( X i − X )(Yi − Y ) = ∑ X iYi − i=1

n

∑ (Xi − X ) i=1

2

n

=



X i2 −

∑ Yi

i=1

i=1

n

i=1

2

n

SSX =



n

n

Xi

∑ Xi i=1

n

i=1

FO RMULA F OR CA LCUL AT IN G T H E Y I NT E R C E P T, b 0 b0 = Y − b1 X (12.4)

where

n

n



Y=



Yi

i=1

n

and X =

∑ Xi

i=1

n

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

464 CHAPTER 12 SIMPLE LINEAR REGRESSION

EXAMPLE 12.3

C A LC U LAT ING T H E Y I N TE RCE P T, b 0 , AN D THE S L OP E , b 1 Calculate the Y intercept, b0, and the slope, b1, for the Human Development Index problem. SOLUTION

Examining Equations 12.3 and 12.4, we see that five quantities must be calculated n

n

to determine b1 and b0. These are n, the sample size;

∑ i=1

Xi, the sum of the X values;

n

the sum of the Y values;



i=1

Yi, ∑ i=1

n

Xi2, the sum of the squared X values; and

∑ Xi Yi,

i=1

the sum of the cross-product of X and Y. For the Human Development Index data, the mathematical literacy score is used to predict the HDI. Table 12.2 presents the calculations of the various sums needed (including the sum of the squared Y values that will be used to calculate SST in Section 12.3). Table 12.2  Calculations for the Human Development Index regression

Country Norway Australia New Zealand United States Ireland Iceland Netherlands Canada Germany Singapore Argentina Brazil Colombia Tunisia Jordan Turkey Thailand Indonesia Vietnam Totals

Mean mathematics performance score – X 502 494 495 470 504 488 512 516 506 564 409 377 390 367 380 420 415 386 495 8,690

Human Development Index – Y 94.9 93.9 91.5 92.0 92.3 92.1 92.4 92.0 92.6 92.5 82.7 75.4 72.7 72.5 74.1 76.7 74.0 68.9 68.3 1,591.5

X2 252,004 244,036 245,025 220,900 254,016 238,144 262,144 266,256 256,036 318,096 167,281 142,129 152,100 134,689 144,400 176,400 172,225 148,996 245,025 4,039,902

Y2 9,006.01 8,817.21 8,372.25 8,464.00 8,519.29 8,482.41 8,537.76 8,464.00 8,574.76 8,556.25 6,839.29 5,685.16 5,285.29 5,256.25 5,490.81 5,882.89 5,476.00 4,747.21 4,664.89 135,121.73

XY 47,639.80 46,386.60 45,292.50 43,240.00 46,519.20 44,944.80 47,308.80 47,472.00 46,855.60 52,170.00 33,824.30 28,425.80 28,353.00 26,607.50 28,158.00 32,214.00 30,710.00 26,595.40 33,808.50 736,525.80

Using Equations 12.3 and 12.4 we can calculate the values of b0 and b1: b1 =

SSXY SSX

n

n

SSXY =

n

X iYi − ∑ ( X i − X )(Yi − Y ) = ∑ i=1

i=1

SSXY = 736,525.8 −

n

X i ∑ Yi ∑ i=1 i=1 n

(8,690)(1,591.5) 19

= 8,623.95789 n

∑ Xi

2

n Group Pty Ltd) 2019— n Copyright © Pearson Australia (a division of Pearson Australia 9781488617249 — Berenson/Basic Business Statistics 5e i=1 2

2

SSXY =

X iYi − ∑ ( X i − X )(Yi − Y ) = ∑ i=1

n

i=1

(8,690)(1,591.5) 19

SSXY = 736,525.8 −



12.2 Determining the Simple Linear Regression Equation 465

= 8,623.95789 2

n

n

SSX =

n

(Xi − X ) 2 = ∑ Xi2 − ∑ i=1 i=1

= 4,039,902 −

Xi ∑ i=1 n

(8,690)2 19

= 65,370.4211 so that: b1 =

8,623.957895 65,370.42105

= 0.13192447 and: b0 = Y – b1 X n

∑Y

i

Y =

i= 1

n

=

1,591.5 = 83.76316 19

=

8,690 19

n

X=

∑X i=1

n

i

= 457.36842

b0 = 83.76316 – (0.13192447)(457.36842) = 23.4250731

Problems for Section 12.2 LEARNING THE BASICS 12.1 Fitting a straight line to a set of data yields the following prediction line: Ŷi = 15 + 25Xi a. Interpret the meaning of the Y intercept, b0. b. Interpret the meaning of the slope, b1. c. Predict the mean value of Y for X = 12. 12.2 If the values of X in problem 12.1 range from 2 to 25, should you use this model to predict the average value of Y when X equals: a. 3? b. −3? c. 0? d. 24?

12.3 Fitting a straight line to a data set yields the following prediction line: Ŷi = 24 − 2.5Xi a. Interpret the meaning of the Y intercept, b0. b. Interpret the meaning of the slope, b1. c. Predict the mean value of Y for X = 10.

APPLYING THE CONCEPTS Problems 12.4 to 12.9 can be solved manually or by using Microsoft Excel.

12.4 An increasingly important topic in the context of climate change and global warming is the link between countries’ energy generation and CO2 emissions. The following table lists energy

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

466 CHAPTER 12 SIMPLE LINEAR REGRESSION

generation (terawatt hours) and CO2 emissions (000s metric tonnes) for 11 countries. < CO2 >

Country CO2 Energy USA 5,762,050 4,150 China 3,473,600 2,187 Russia 1,540,360 931 Japan 1,224,740 1,110 India 1,007,980 651 Germany 837,425 607 UK 558,225 400 Australia 332,377 236 Canada 521,404 568 Brazil 327,858 386 Spain 304,882 278 Data obtained from

a. Which variable would you choose as the (i) independent and (ii) dependent variable? Explain the reason for your choice. b. Construct a scatter diagram. c. Assuming a linear relationship, use the least-squares method to find the regression coefficients, b0 and b1. d. Interpret the meaning of the slope, b1, in this problem. e. Predict the CO2 emission for a country generating 1,000 terawatt hours of energy. 12.5 Starbucks Coffee Co. uses a data-based approach to improving the quality and customer satisfaction of its products. When survey data indicated that Starbucks needed to improve its package-sealing process, an experiment was conducted to determine the factors in the bag-sealing equipment that might be affecting the ease of opening the bag without tearing its inner liner (data obtained from L. Johnson and S. Burrows, ‘For Starbucks, it’s in the bag’, Quality Progress, March 2011, 17–23). One factor that could affect the rating of the ability of the bag to resist tears was the plate gap on the bag-sealing equipment. Data were collected on 19 bags in which the plate gap was varied. The results are stored in < STARBUCKS >. a. Construct a scatter plot. b. Assuming a linear relationship, use the least-squares method to determine the regression coefficients b0 and b1. c. Interpret the meaning of the slope, b1, in this problem. d. Predict the mean tear rating when the plate gap is equal to 0. e. What should you tell Starbucks management about the relationship between the plate gap and the tear rating? 12.6 A consumer’s spending is widely believed to be a function of their income. A university professor tries to estimate this relationship by measuring his students’ spending and income patterns in week 6 of the semester.

Spending ($) 154 135 95 29 124 130 30 73 81 221 132 100 55 94 217 200 224 127 87

Income ($) 305 150 100 30 150 175 50 120 122 300 152 125 70 100 220 240 250 150 50

a. Construct a scatter diagram. b. Assuming a linear relationship, use the least-squares method to find the regression coefficients, b0 and b1. c. Interpret the meaning of the slope, b1, in this problem. d. Predict the mean spending for a student with a weekly income of $225. 12.7 A statistics professor is investigating the relationship between tutorial class size and academic performance for a first-year business statistics course. She collects data on the tutorial class size (number of students) and average class mark (/100) for a sample of 15 classes. < CLASS_SIZE >

Tutorial size 15 25 22 12 28 22 17 29  8 15 22 19 21 27 15

Average mark 78 67 69 79 58 65 72 61 82 73 68 70 70 65 77

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



12.3 Measures of Variation 467

a. Construct a scatter diagram. b. Assuming a linear relationship, use the least-squares method to find the regression coefficients, b0 and b1. c. Interpret the meaning of the slope, b1, in this problem. d. Predict the average class mark for a tutorial of 20 students. 12.8 The production of wine is a multibillion-dollar worldwide industry. In an attempt to develop a model of wine quality as judged by wine experts, data was collected from red wine variants of Portuguese vinho verde (data obtained from P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis, ‘Modeling wine preferences by data mining from physiochemical properties’, Decision Support Systems, 47, 2009, 547–553 and ). A sample of 50 wines is stored in < VINHO_VERDE >. Develop a simple linear regression model to predict wine quality, measured on a scale from 0 (very bad) to 10 (excellent), based on alcohol content (%). a. Construct a scatter plot. (For these data, b0 5 20.3529 and b1 5 0.5624.) b. Interpret the meaning of the slope, b1, in this problem. c. Predict the mean wine quality for wines with a 10% alcohol content. d. What conclusion can you reach based on the results of (a) to (c)?

12.9 A concert promoter is interested in the relationship between ticket price and merchandise sales. She obtains data from the previous year. < CONCERT > Merchandise sales ($) 3,402 7,069 20,269 1,113 15,116 16,954 18,317 13,623 24,131 24,073 21,806 10,687

Ticket price ($) 115 109 79 125 85 85 80 95 65 55 75 100

a. Construct a scatter diagram. b. Assuming a linear relationship, use the least-squares method to find the regression coefficients, b0 and b1. c. Interpret the meaning of b0 and b1 in this problem. d. Predict merchandise sales if the ticket price is $100.

12.3  MEASURES OF VARIATION Calculating the Sum of Squares When using the least-squares method to find the regression coefficients for a set of data, there are three measures of variation that determine how much of the variation in the dependent variable Y is explained by variation in the independent variable X. The first measure, the total sum of squares (SST ), is a measure of variation of the Yi values around their mean. In a regression analysis, the total variation, or total sum of squares, is subdivided into explained variation, or regression sum of squares (SSR ), which is due to the relationship between X and Y, and unexplained variation, or error sum of squares (SSE ), which is due to factors other than the relationship between X and Y. Figure 12.6, overleaf, shows these different measures of variation. The regression sum of squares (SSR) is equal to the difference between the value of Y that is predicted from the prediction line and the mean value of Y. The error sum of squares (SSE) represents the part of the variation in Y that is not explained by the regression. It is based on the difference between Yi and Yˆi. Equations 12.5, 12.6, 12.7 and 12.8 define these measures of variation.

M E ASURE S OF VA R IAT ION IN R E GR ESSI O N Total sum of squares = regression sum of squares + error sum of squares

SST = SSR + SSE(12.5)

total sum of squares (SST ) The total variation. total variation The sum of the squared differences between each individual value and the grand mean. explained variation The regression sum of squares. regression sum of squares (SSR) The degree of variation between X and Y variables that is explained by the defined regression relationship between the two variables. Specifically, the degree of variation in the Y variable that is accounted for by variation in the X variable(s). unexplained variation The error sum of squares. error sum of squares (SSE ) The degree of variation between the X and Y variables that is not explained by the defined regression relationship between the two variables. Specifically, the degree of variation in the Y variable that is not accounted for by variation in the X variable(s).

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

468 CHAPTER 12 SIMPLE LINEAR REGRESSION

Figure 12.6 Measures of variation

Error sum of squares n

Yi

Y

∑ (Yi – Yˆi)2 = SSE

i =1

Yˆi = b0 + b1Xi

Total sum of squares n

∑ (Yi – Y )2 = SST

Regression sum of squares

i =1

n

∑ (Yˆi – Y )2 = SSR

i =1

Y

X

Xi

0

TOTA L S UM OF S Q U A R E S (SST ) The total sum of squares (SST) is equal to the sum of the squared differences between each observed Y value and Y , the mean value of Y. n

SST = total sum of squares =

∑ (Y – Y ) i

i=1



2

(12.6) 

R E GR E S S ION S UM O F SQ U A R E S ( S S R ) The regression sum of squares (SSR) is equal to the sum of the squared differences between the predicted value of Y and the average value of Y. SSR = explained variation or regression sum of squares n

=

∑ (Yˆ – Y ) i

2

(12.7)

i =1





E R R OR S UM OF SQ U A R E S ( S S E ) The error sum of squares (SSE) is equal to the sum of the squared differences between the observed value of Y and the predicted value of Y. SSE = unexplained variation or error of sum of squares n

=

∑ (Y – Yˆ ) i

i =1

i

(12.8)

2



Figure 12.7 represents the sum-of-squares portion of the Microsoft Excel output for the Human Development Index problem. From Figure 12.7, we can see that: SSR = 1,137.7110, SSE = 674.9532 and SST = 1,812.6642

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



12.3 Measures of Variation 469

A 10 11 12 13 14

ANOVA Regression Residual Total

B

C

df 1 17 18

SS 1137.7110 674.9532 1812.6642

D MS 1137.7110 39.7031

E F 28.6555

F Significance F 0.0001

Figure 12.7 Microsoft Excel sum of squares for the Human Development Index regression

From Equation 12.5: SST = SSR + SSE 1,812.6642 = 1,137.7110 + 674.9532 The SST is equal to 1,812.6642. This amount is subdivided into the sum of squares that is explained by the regression (SSR), equal to 1,137.7110, and the sum of squares that is unexplained by the regression (SSE), equal to 674.9532.

3

The Coefficient of Determination

LEARNING OBJECTIVE

By themselves, SSR, SSE and SST provide little information. However, the ratio of the regression sum of squares (SSR) to the total sum of squares (SST) measures the proportion of variation in Y that is explained by the independent variable X in the regression model. This ratio is called the coefficient of determination, r 2, and is defined in Equation 12.9.

Assess the adequacy of your estimated model coefficient of determination, r 2 Indicates the proportion of variation in Y that is explained by X.

CO E FFIC IE NT OF DE T E R M IN AT ION The coefficient of determination is equal to the regression sum of squares (i.e. explained variation) divided by the total sum of squares (i.e. total variation).

r2 =

SSR regression sum of squares = SST  total sum of squares

(12.9)

The coefficient of determination measures the proportion of variation in Y that is explained by the independent variable X in the regression model. For the Human Development Index example, with SSR = 1,137.7110, SSE = 674.9532 and SST = 1,812.6642: r2 =

1,137.7110 = 0.6276 1,812.6642

Therefore, 62.76% of the variation in the Human Development Index is explained by the variability in a country’s mean mathematical literacy scores. This r2 indicates a relatively strong positive linear relationship between two variables because the use of a regression model has reduced the variability in predicting the Human Development Index by 62.76%. Over onethird, or 37.24%, of the sample variability in the Human Development Index is due to factors other than what is accounted for by the linear regression model that uses a country’s mean mathematical literacy scores. Figure 12.8, overleaf, represents the coefficient-of-determination portion of Microsoft Excel’s output for the Human Development Index.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

470 CHAPTER 12 SIMPLE LINEAR REGRESSION

Figure 12.8  Partial Microsoft Excel regression output for the Human Development Index

EXAMPLE 12.4

A B Human Development Index

1 2 3 4 5 6 7 8

Regression statistics Multiple R R square Adjusted R square Standard error Observations

0.7922 0.6276 0.6057 6.3010 19

C A LC U LAT ING T H E C OE F F I CI E N T OF D E TE RM I N ATI ON Calculate the coefficient of determination, r2, for the Human Development Index problem. SOLUTION

You can calculate SST, SSR and SSE (defined in Equations 12.6, 12.7 and 12.8 on page 468) by using Equations 12.10, 12.11 and 12.12. Formula for Calculating SST 2

n

n

SST =

n

∑ (Y − Y ) = ∑Y 2

i

i

i=1



∑Y

i

2



i=1

(12.10)

n

i=1



Formula for Calculating SSR n

SSR =

∑(Yˆ − Y ) i

2

(12.11)

i=1

2

n

n

= b0

n

∑Y + b ∑ X Y − i

i i

1

i=1



∑Y

i=1

i

i=1

n



Formula for Calculating SSE n

SSE =

∑ (Y − Yˆ ) i

2

i =1 n

=

∑Y

i

i =1

2

n

− b0

(12.12)

n

∑Y − b ∑ X Y i

i =1

1

i i

i =1



Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



12.3 Measures of Variation 471

Using the summary results from Table 12.2 on page 464: 2

n

n

SST =



(Yi − Y ) 2 =

i=1

n

∑Y

i

∑Y

i

2



i=1

n

i=1

= 135,121.73 −

(1,591.5)2 19

= 1,812.664211 n

SSR =

∑ (Yˆ − Y )

2

i

i=1

n

n

= b0

n

∑Y + b ∑ X Y − i

i i

1

i=1

i=1

∑Y

2

i

i=1

n

= (23.4250731)(1,591.5) + (0.131924466)(736,525.8) −

(1,591.5)2 19

= 1,137.71104 n

SSE =

∑ (Y − Yˆ ) i

2

i=1 n

=

∑ i=1

Yi 2 − b0

n



n

Yi − b1

i=1

∑X Y

i i

i=1

= 135,121.73 − (23.4250731)(1,591.5) − (0.131924466)(736,525.8) = 674.953170 Therefore: r2 =

1,137.71104 1,812.664211

=

0.6276

Standard Error of the Estimate Although the least-squares method results in the line that fits the data with the minimum amount of variation, unless all the observed data points fall on a straight line the prediction line is not a perfect predictor. Just as all data values cannot be expected to be exactly equal to their mean, neither can they be expected to fall exactly on the prediction line. Therefore, a statistic that measures the variability of the actual Y values from the predicted Y values needs to be developed, in the same way that the standard deviation was developed in Chapter 3 as a measure of the variability of each value around the mean. This standard deviation around the prediction line is called the standard error of the estimate. Figure 12.5 on page 461 illustrates the variability around the prediction line for the Human Development Index data. Observe that, although many of the actual values of Y fall near the prediction line, no values are exactly on the line.

standard error of the estimate The standard deviation of the Y predicted values in a regression around the line of best fit.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

472 CHAPTER 12 SIMPLE LINEAR REGRESSION

The standard error of the estimate, represented by the symbol S YX, is defined in ­Equation 12.13.

STA N DA R D E R R O R O F T HE E ST I M AT E n

SYX =



SSE = n −2

∑ (Y − Yˆ ) i

i

2

(12.13)

i=1

n− 2



where Yi = actual value of Y for a given Xi Yˆ i = predicted value of Y for a given Xi SSE = error sum of squares

From Equation 12.8, with SSE = 674.953170: SYX =

674.953170 = 6.3010 19 − 2

This standard error of the estimate, equal to 6.3010, is labelled ‘Standard error’ on the Microsoft Excel output of Figure 12.8. The standard error represents a measure of the variation around the prediction line. It is measured in the same units used by the dependent variable Y. The interpretation of the standard error is similar to that of the standard deviation. Just as the standard deviation measures variability around the mean, the standard error measures variability around the prediction line. As you will see in Sections 12.7 and 12.8, the standard error is used to determine whether a statistically significant relationship exists between the two variables and also to make inferences about future values of Y.

Problems for Section 12.3 LEARNING THE BASICS

12.10 What does it mean if the coefficient of determination, r , is equal to 0.75? 12.11 If SSR = 25 and SSE = 24, calculate the coefficient of determination, r 2, and interpret its meaning. 12.12 If SSR = 55 and SST = 72, calculate the coefficient of determination, r2, and interpret its meaning. 12.13 If SSE = 28 and SSR = 67, calculate the coefficient of determination, r 2, and interpret its meaning. 12.14 If SSR = 105, why is it impossible for SST to equal 100? 2

APPLYING THE CONCEPTS Problems 12.15 to 12.20 can be solved manually or by using Microsoft Excel.

12.15 In problem 12.4 on page 465, you used energy generation to predict CO2 emissions. < CO2 > Using the results of that problem: a. Determine the coefficient of determination, r 2, and interpret its meaning.

b. Determine the standard error. c. How useful do you think this regression model is for CO2 emissions? 12.16 In problem 12.5 on page 466, you used the plate gap to predict the mean tear rating for Starbucks bags. For that data, SSR = 10.98 and SST = 28.81. Using the results of that problem: a. Determine the coefficient of determination, r 2, and interpret its meaning. b. Determine the standard error. c. How useful do you think this regression model is for predicting the tear rating? 12.17 In problem 12.6 on page 466, a professor wanted to estimate the spending of his students based upon their income. Using the results of that problem: a. Determine the coefficient of determination, r 2, and interpret its meaning. b. Determine the standard error. c. How useful do you think this regression is for predicting spending?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



12.5 Residual Analysis 473

12.18 In problem 12.7 on page 466, a statistics professor used tutorial class size to predict average class marks. Using the results of that problem: a. Determine the coefficient of determination, r 2, and interpret its meaning. b. Find the standard error. c. How useful do you think this regression model is for predicting the average class marks? 12.19 In problem 12.8 on page 467, you used alcohol content to predict wine quality. Using the results of that problem: a. Determine the coefficient of determination, r 2, and interpret its meaning.

b. Determine the standard error. c. How useful do you think this regression model is for predicting wine quality? 12.20 In problem 12.9 on page 467, a concert promoter used the ticket price to predict merchandise sales. Using the results of that problem: a. Determine the coefficient of determination, r 2, and interpret its meaning. b. Find the standard error. c. How useful do you think this regression model is for predicting merchandise sales?

12.4 ASSUMPTIONS The discussion of hypothesis testing and the analysis of variance emphasised the importance of the assumptions to the validity of any conclusions reached. The assumptions necessary for regression are similar to those for the analysis of variance because both topics fall under the general heading of linear models (reference 1). The four assumptions of regression (known by the acronym LINE) are as follows. • Linearity • Independence of errors • Normality of error • Equal variance (also called homoscedasticity). The first assumption, linearity, states that the relationship between variables is linear. Relationships between variables are not always linear; they can be non-linear. The second assumption, independence of errors, requires that the errors (ϵi) are independent of one another. This assumption is particularly important when data are collected over a period of time. In such situations, the errors for a specific time period are often correlated with those of the previous time period. The third assumption, normality, requires that the errors (ϵi) are normally distributed at each value of X. Like the t test and the ANOVA F test, regression analysis is fairly robust against departures from the normality assumption. As long as the distribution of the errors at each level of X is not extremely different from a normal distribution, inferences about β0 and β1 are not seriously affected. The fourth assumption, equal variance or homoscedasticity, requires that the variance of the errors (ϵi) is constant for all values of X. In other words, the variability of Y values will be the same when X is a low value as when X is a high value. The equal variance assumption is important when making inferences about β0 and β1. If there are serious departures from this assumption, you can use either data transformations or weighted least-squares methods (see reference 1).

LEARNING OBJECTIVE

4

Evaluate the assumptions of regression analysis

assumptions of regression Required conditions for regression analysis to produce reliable results.

linearity The assumption that the relationship between variables is linear. independence of errors An assumption that the errors in a data set are not related to each other. Particularly relevant in timeseries data. normality An assumption that the errors in a regression are normally distributed at each value of X. equal variance (homoscedasticity) An assumption that the variance of the error terms is constant for all values of X.

12.5  RESIDUAL ANALYSIS In Section 12.1, regression analysis was introduced. In Sections 12.2 and 12.3, a model was developed and estimated using the least-squares approach for the Human Development Index data. Was this the correct model for these data? Were the assumptions introduced in Section 12.4 valid? In this section, a graphical approach called residual analysis is used to evaluate the assumptions and thus determine whether the regression model selected is an appropriate model.

residual analysis Graphical evaluation of the residuals from regression to test for violations of the assumptions of regression.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

474 CHAPTER 12 SIMPLE LINEAR REGRESSION

residual Difference between the observed values and the corresponding values that are predicted by the regression model.

The residual or estimated error value, ei, is the difference between the observed (Yi) and the predicted values of the dependent variable for a given value of Xi. Graphically, a residual appears on a scatter diagram as the vertical distance between an observed value of Y and the prediction line. Equation 12.14 defines the residual.

T H E R E S IDUA L The residual is equal to the difference between the observed value of Y and the ­predicted value of Y. ei = Yi − Yˆi(12.14)

Evaluating the Assumptions Recall from Section 12.4 that the four assumptions of regression (known by the acronym LINE) are linearity, independence, normality and equal variance. Linearity To evaluate linearity, plot the residuals on the vertical axis against the corresponding Xi values of the independent variable on the horizontal axis. If the linear model is appropriate for the data, there will be no apparent pattern in this plot. However, if the linear model is not appropriate, there will be a relationship between the Xi values and the residuals ei. You can see such a pattern in Figure 12.9. The scatter plot in panel A shows a situation in which, although there is an increasing trend in Y as X increases, the relationship seems curvilinear because the upward trend decreases for increasing values of X. This quadratic effect is highlighted in the corresponding residual plot for a particular data set in panel B, where there is a clear relationship between Xi and ei. By plotting the residuals, the linear trend of X with Y has been removed, thereby exposing the lack of fit in the simple linear model. Thus, a quadratic model is a better fit and should be used in place of the simple linear model. Figure 12.9  Studying the appropriateness of the simple linear regression model

e

Y

0

Panel A

X

Panel B

X

To determine whether the simple linear regression model is appropriate, return to the evaluation of the Human Development Index data. Figure 12.10 provides the predicted and residual values of the response variable (the Human Development Index) calculated by Microsoft Excel. To assess linearity, the residuals are plotted against the independent variable (countries’ mean mathematical literacy scores) in Figure 12.11. There is no apparent pattern or relationship between the residuals and Xi. The residuals appear to be evenly spread above and below 0 for the differing values of X. You can conclude that the linear model is appropriate for the Human Development Index data. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



12.5 Residual Analysis 475

1 A 2 Residual output 3 4 Observation 5 1 6 2 7 3 8 4 9 5 10 6 11 7 12 8 13 9 14 10 15 11 16 12 17 13 18 14 19 15 20 16 21 17 22 18 23 19

B

Figure 12.10  Microsoft Excel residual statistics for the Human Development Index regression

C

Predicted Human Development Index 89.65116 88.59576 88.72768 85.42957 89.91500 87.80421 90.97040 91.49810 90.17885 97.83047 77.38218 73.16060 74.87561 71.84135 73.55637 78.83335 78.17373 74.34792 88.72768

Residuals 5.24884 5.30424 2.77232 6.57043 2.38500 4.29579 1.42960 0.50190 2.42115 –5.33047 5.31782 2.23940 –2.17561 0.65865 0.54363 –2.13335 –4.17373 –5.44792 –20.42768

Figure 12.11  Microsoft Excel plot of residuals against the mathematical literacy score for the Human Development Index regression

Residual plot

10 5

Residuals

0 –5 –10 –15 –20 –25

0

100

200

300

400

500

600

Mean mathematics performance score –X

Independence You can evaluate the assumption of independence of the errors by plotting the residuals in the order or sequence in which the observed data were collected. Data collected over periods of time sometimes exhibit an autocorrelation effect between successive observations. In these instances, there is a relationship between consecutive residuals. If this relationship exists (which violates the assumption of independence), it will be apparent in the plot of the residuals versus the time in which the data were collected. You can also test for autocorrelation using the Durbin–Watson statistic, which is the subject of Section 12.6. The Human Development Index data considered so far in this chapter were collected during the same time period (i.e. cross-sectional data). Therefore, you do not need to evaluate the independence assumption for these data. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

476 CHAPTER 12 SIMPLE LINEAR REGRESSION

Normality You can evaluate the assumption of normality in the errors by tallying the residuals into a frequency distribution and displaying the results in a histogram (see Section 2.3). For the Human Development Index data, the residuals have been tallied into a frequency distribution as shown in Table 12.3. (There is an insufficient number of values to construct a histogram.) You can also evaluate the normality assumption by comparing the actual versus theoretical values of the residuals, or by constructing a normal probability plot of the residuals (see Section 6.2). Table 12.3 Frequency distribution of 19 residual values for the Human Development Index regression

Residuals Less than - 6 - 6 but less than - 3 - 3 but less than 0 0 but less than 3 3 but less than 6 6 or greater Total

Frequency  1  3  2  8  4  1 19

It is difficult to evaluate the normality assumption for a sample of only 19 values regardless of whether you use a histogram, stem-and-leaf display, box-and-whisker plot or normal probability plot. The robustness of regression analysis to modest departures from normality, together with the small sample size, enables the conclusion that we should not be overly concerned about departures from the normality assumption in the Human Development Index data. Equal variance You can also evaluate the assumption of equal variance from a plot of the residuals with Xi. For the Human Development Index data of Figure 12.11, the variability of the residuals appears to increase with greater values of Xi. Thus, we can conclude that there may be a violation in the assumption of equal variance at each level of X. To examine a more obvious case in which the equal variance assumption is violated, observe Figure 12.12, which is a plot of the residuals with Xi for a hypothetical set of data. In this plot, the variability of the residuals increases dramatically as X increases, demonstrating the lack of homogeneity in the variances of Yi at each level of X. For these data, the equal variance assumption is invalid. Figure 12.12  Violation of equal variance

Residuals

0

X

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



12.6 Measuring Autocorrelation – The Durbin–Watson Statistic 477

Problems for Section 12.5 LEARNING THE BASICS

12.21 The following computer output contains the X values, residuals and a residual plot from a regression analysis. Is there any evidence of a pattern in the residuals? Explain.

APPLYING THE CONCEPTS We recommend that you use Microsoft Excel to solve problems 12.22 to 12.27.

12.22 In problem 12.5 on page 466, you used plate gap to predict tear ratings for Starbucks bags. Perform a residual analysis for these data. Based on these results: a. Determine the adequacy of the fit of the model. b. Evaluate whether the assumptions of regression have been seriously violated.

12.23 In problem 12.4 on page 465, you used energy generation to predict CO2 emissions. < CO2 > Perform a residual analysis for these data. Based on the results: a. Determine the adequacy of the fit of the model. b. Evaluate whether the assumptions of regression have been seriously violated. 12.24 In problem 12.7 on page 466, a statistics professor used tutorial class size to predict average class marks. Perform a residual analysis for these data. Based on these results: a. Determine the adequacy of the fit of the model. b. Evaluate whether the assumptions of regression have been seriously violated. 12.25 In problem 12.6 on page 466, a professor wanted to estimate the spending of his students based upon their income. Perform a residual analysis for these data. Based on these results: a. Determine the adequacy of the fit of the model. b. Evaluate whether the assumptions of regression have been seriously violated. 12.26 In problem 12.8 on page 467, you used alcohol content to predict wine quality. Perform a residual analysis for these data. Based on these results: a. Determine the adequacy of the fit of the model. b. Evaluate whether the assumptions of regression have been seriously violated. 12.27 In problem 12.9 on page 467, a concert promoter used the ticket price to predict merchandise sales. Perform a residual analysis for these data. Based on these results: a. Determine the adequacy of the fit of the model. b. Evaluate whether the assumptions of regression have been seriously violated.

12.6  MEASURING AUTOCORRELATION – THE DURBIN–WATSON STATISTIC One of the basic assumptions of the regression model is the independence of the errors. This assumption is sometimes violated when data are collected over sequential periods of time because a residual at any one point in time may tend to be similar to residuals at adjacent points in time. This pattern in the residuals is called autocorrelation. When a set of data has substantial autocorrelation, the validity of a regression model can be in serious doubt.

LEARNING OBJECTIVE

4

Evaluate the assumptions of regression analysis autocorrelation Relationships between data values in consecutive periods of time.

Residual Plots to Detect Autocorrelation As mentioned in Section 12.5, one way to detect autocorrelation is to plot the residuals in time order. If a positive autocorrelation effect is present, there will be clusters of residuals with the same sign and you will readily detect an apparent pattern. If negative autocorrelation exists, residuals will tend to jump back and forth from positive to negative to positive, and so on. This type of pattern is very rarely seen in regression analysis. Thus, the focus of this

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

478 CHAPTER 12 SIMPLE LINEAR REGRESSION

s­ ection is on positive autocorrelation. To illustrate positive autocorrelation, consider the following example. The manager of a discount electrical store wants to predict weekly sales based on the number of customers making purchases for a period of 15 weeks. In this situation, because data are collected over a period of 15 consecutive weeks at the same store, you need to determine whether autocorrelation is present. Table 12.4 summarises the data for this store. < CUST_SALE > ­Figure 12.13 illustrates Excel output. From Figure 12.13 observe that r2 is 0.6575, indicating that 65.75% of the variation in sales is explained by variation in the number of customers. In addition, the Y intercept, b0, is −16.0322, and the slope, b1, is 0.0308. However, before using this model for prediction, you must undertake proper analyses of the residuals. Because the data have been collected over a consecutive period of 15 weeks, in addition to checking the linearity, normality and equal variance assumptions, you must investigate the independence of errors assumption. Figure 12.14 plots the residuals against time to see whether a pattern exists. In Figure 12.14 the residuals tend to fluctuate up and down in a cyclical pattern. This cyclical pattern provides strong cause for concern about the autocorrelation of the residuals and, hence, a violation of the independence of errors assumption.

Table 12.4 Customers and sales for a period of 15 consecutive weeks

Figure 12.13  Microsoft Excel output for the discount electrical store data of Table 12.4

Week 1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Customers 794 799 837 855 845 844 863 875

Sales (in thousands of dollars)  9.33  8.26  7.48  9.08  9.83 10.09 11.01 11.49

A B Discount electric store sales analysis

C

Week  9 10 11 12 13 14 15

Customers 880 905 886 843 904 950 841

D

E

Sales (in thousands of dollars) 12.07 12.55 11.92 10.27 11.80 12.15  9.64

F

G

Regression statistics Multiple R 0.81083 R square 0.65745 Adjusted R square 0.63109 Standard error 0.93604 Observations 15 ANOVA Regression Residual Total

Intercept Customers

df 1 13 14

SS 21.86043 11.39014 33.25057333

MS 21.86043 0.87616

Coefficients Standard error –16.03219 5.31017 0.03076 0.00616

t stat –3.01915 4.99501

F Significance F 24.95014 0.00025

p-value 0.00987 0.00025

Lower 95% Upper 95% –27.50411 –4.56028 0.01746 0.04406

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



12.6 Measuring Autocorrelation – The Durbin–Watson Statistic 479

Figure 12.14  Microsoft Excel residual plot for the discount electrical store data of Table 12.4

Discount electrical store analysis residual plot

1.5 1 0.5

Residuals

0 –0.5 –1 –1.5 –2 –2.5

0

2

4

6

5

10

12

14

16

Week

The Durbin–Watson Statistic The Durbin–Watson statistic is used to detect autocorrelation. This statistic measures the correlation between each residual and the residual for the time period immediately preceding the one of interest. Equation 12.15 defines the Durbin–Watson statistic.

Durbin–Watson statistic Measures autocorrelation between data values in a time series by measuring the correlation between each residual and each preceding residual in the series.

D URBIN –WAT S ON STAT IST IC n

∑ (e − e i

D=

i =2

n



i –1)

(12.15)

ei2

i =1



2



where ei = residual at the time period i

To understand the Durbin–Watson statistic, D, examine the composition of the statistic n

presented in Equation 12.15. The numerator

∑ (e

i

– ei – 1 ) 2 represents the squared difference

i=2

between two successive residuals, summed from the second value to the nth value. The n

­denominator

∑e

2 i

represents the sum of the squared residuals. When successive residuals are

i =1

positively autocorrelated, the value of D will approach 0. If the residuals are not correlated, the value of D will be close to 2. (If there is negative autocorrelation, D will be greater than 2 and could even approach its maximum value of 4.) For the discount electrical store data, the Durbin–Watson statistic, D = 0.8830, is calculated by Microsoft Excel as illustrated in Figure 12.15. To determine if there is is significant positive autocorrelation, you need to know how far below 2 the Durbin–Watson statistic, D, needs to fall. Whether D is significant is dependent

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

480 CHAPTER 12 SIMPLE LINEAR REGRESSION

Figure 12.15  Microsoft Excel output of the Durbin–Watson statistic for the sales data

1 2 3 4 5 6

A Durbin–Watson calculations Sum of squared difference of residuals Sum of squared residuals Durbin–Watson statistic

B

10.0575 11.3901 0.8830

on n, the sample size, and k, the number of independent variables in the model (in simple linear regression, k = 1). Table 12.5 has been extracted from Table E.11, the table of the Durbin–Watson statistic. In Table 12.5 two values are shown for each combination of α (level of significance), n (sample size) and k (number of independent variables in the model). The first value, dL, represents the lower critical value. If D is below dL, you conclude that there is evidence of positive autocorrelation between the residuals. Under such a circumstance, the least-squares method used in this chapter is inappropriate and you should use alternative methods (reference 1). The second value, dU, represents the upper critical value of D, above which you would conclude that there is no evidence of positive autocorrelation between the residuals. If D is between dL and dU, you are unable to arrive at a definite conclusion. For the data concerning the discount electrical stores, with one independent variable (k = 1) and 15 values (n = 15), dL = 1.08 and dU = 1.36. Because D = 0.8830 < 1.08, you conclude that there is positive autocorrelation between the residuals. The least-squares regression analysis of the data is inappropriate because of the presence of significant positive autocorrelation between the residuals. In other words, the independence of errors assumption is invalid. You need to use the alternative approaches discussed in reference 1.

Table 12.5 Finding critical values of the Durbin–Watson statistic (extracted from Table E.11 in Appendix E of this book) Source: J. Durbin and G. S, Watson, ‘Testing for serial correlation in least squares regression’, Biometrika, June 1951. Oxford University Press © Biometrika Trust.

𝛂 ∙ .05 k ∙ 1 k ∙ 2 k ∙ 3 k ∙ 4 k∙5 n dL dU dL dU dL dU dL dU dL dU 15 1.08 1.36   .95 1.54 .82 1.75 .69 1.97 .56 2.21 16 1.10 1.37   .98 1.54 .86 1.73 .74 1.93 .62 2.15 17 1.13 1.38 1.02 1.54 .90 1.71 .78 1.90 .67 2.10 18 1.16 1.39 1.05 1.53 .93 1.69 .82 1.87 .71 2.06

Problems for Section 12.6 LEARNING THE BASICS 12.28 The residuals for 10 consecutive time periods are as follows: T ime period 1 2 3 4 5

Residual Time period Residual - 2  6 +1 + 3  7 -2 - 3  8 +4 +2  9 -2 - 5 10 +3

a. Plot the residuals over time. What conclusion can you reach about the pattern of the residuals? b. Based on (a), what conclusion can you reach about the autocorrelation of the residuals?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



12.6 Measuring Autocorrelation – The Durbin–Watson Statistic 481

12.29 The residuals for 15 consecutive time periods are as follows: T ime period Residual Time period Residual 1 +2  9 −1 2 +2 10 +1 3   0 11 +2 4 −1 12 +3 5 −3 13 +2 6 −3 14   0 7 −1 15 −1 8   0

a. Plot the residuals over time. What conclusion can you reach about the pattern of the residuals? b. Calculate the Durbin–Watson statistic. At the 0.05 level of significance, is there evidence of positive autocorrelation between the residuals? c. Based on (a) and (b), what conclusion can you reach about the autocorrelation of the residuals?

APPLYING THE CONCEPTS We recommend that you use Microsoft Excel to solve problems 12.30 to 12.34.

12.30 In problem 12.6 on page 466, a professor wanted to estimate the spending of his students based upon their weekly income. a. Is it necessary to calculate the Durbin–Watson statistic? Explain. b. Under what circumstances is it necessary to calculate the Durbin–Watson statistic before proceeding with the leastsquares method of regression analysis? 12.31 A mortgage brokerage company is trying to predict mortgage interest rates over the next few years. Its statistical analyst believes that the main influence on interest rates is inflation, and a time series of mortgage interest rates and inflation over the past 10 years is collected for analysis. < INTEREST_RATE > Year Interest rate Inflation rate 2006 6.6 1.5 2007 7.1 2.7 2008 7.3 2.3 2009 7.7 2.6 2010 8.1 2.3 2011 9 4.4 2012 7 1.8 2013 6.7 2.8 2014 7.1 3 2015 7.8 3.1

a. Assuming a linear relationship, use the least-squares method to find the regression coefficients, b0 and b1.

b. Predict interest rates if inflation is 4%. c. Plot the residuals for the analysis. d. Calculate the Durbin–Watson statistic. At the 0.05 level of significance, is there evidence of positive autocorrelation between the residuals? e. Based on the results of (c) and (d), is there reason to question the validity of the model? 12.32 In problem 12.6 on page 466, a professor wanted to estimate the spending of his students based upon their weekly income. Instead of collecting these data from 19 different students at one point in time (cross-sectional data), he has now collected spending and income from one student at monthly intervals over one year (time-series data). Month Spending ($) Income ($) January 564 625 February 652 678 March 389 400 April 478 525 May 612 600 June 423 521 July 565 645 August 486 500 September 384 489 October 769 893 November 524 645 December 698 785

a. Find the regression coefficients, b0 and b1. b. Predict the monthly spending for an income of $650. c. Plot the residuals over time. Are there any noticeable patterns? d. Calculate the Durbin–Watson statistic. At the 0.05 level of significance, is there evidence of positive autocorrelation between the residuals? e. Based on the results of (c) and (d), is there reason to question the validity of the model? 12.33 A social scientist wishes to analyse the relationship between crime rates (y) and the youth unemployment rate (x). He collects data on the number of break and enters per 1,000 people and youth unemployment rate measured quarterly over a four-year period. < CRIME > Quarter 2009 – 1 2009 – 2 2009 – 3 2009 – 4 2010 – 1 2010 – 2 2010 – 3 2010 – 4

Crime rate 21 20 18 17 22 17 16 22

Unemployment rate 25 23 22 19 20 15 18 14

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

482 CHAPTER 12 SIMPLE LINEAR REGRESSION

Quarter 2011 – 1 2011 – 2 2011 – 3 2011 – 4 2012 – 1 2012 – 2 2012 – 3 2012 – 4

Crime rate 25 27 16 18 16 13 19 15

Unemployment rate 23 26 16 14 13 12 13 17

a. Assuming a linear relationship, use the least-squares method to find the regression coefficients, b0 and b1. b. Predict the crime rate if youth unemployment rates are 23%. c. Plot the residuals against the time period. d. Calculate the Durbin–Watson statistic. At the 0.05 level of significance, is there evidence of positive autocorrelation between the residuals? e. Based on the results of (c) and (d), is there reason to question the validity of the model? 12.34 The number of meat pies sold at an AFL Saturday football match at the MCG in Melbourne is expected to be determined by the size of the crowd attending the game. Data collected for 12 Saturday games during the football season are given below.

Number pies sold (’000) 12 13 18 19 25 10 27 33 25 20 11 15

Attendance (’000) 50 45 67 56 80 60 90 95 70 66 57 68

a. Construct a scatter diagram with the independent variable on the x-axis. Discuss any patterns present in the data. b. Determine the prediction line. c. Plot the residuals against the time period and interpret the plot. d. Calculate the Durbin–Watson statistic. At the 0.05 level of significance, is there evidence of positive autocorrelation between the residuals? e. Based on the results of (c) and (d), is there reason to question the validity of the model?

12.7  INFERENCES ABOUT THE SLOPE AND CORRELATION COEFFICIENT In Sections 12.1 to 12.3 we used regression solely for the purpose of description. You learned how the least-squares method determines the regression coefficients, and how to predict Y for a given value of X. You also learned how to calculate and interpret the standard error of the estimate and the coefficient of determination. When residual analysis, as discussed in Section 12.5, indicates that the assumptions of a least-squares regression model are not seriously violated and that the straight-line model is appropriate, you can make inferences about the linear relationship between the variables in the population.

LEARNING OBJECTIVE

5

Make inferences about the slope

t Test for the Slope To determine the existence of a significant linear relationship between the X and Y variables, you can test whether β1 (the population slope) is equal to 0. The null and alternative hypotheses are as follows: H 0 : β1 = 0

t test for the slope Hypothesis test for the statistical significance of the regression slope using the t probability distribution.

H1: β1 ≠ 0 If you reject the null hypothesis, you conclude that there is evidence of a linear relationship. Equation 12.16 defines the test statistic.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



12.7 Inferences About the Slope and Correlation Coefficient 483

TESTING A HYPOTHESIS FOR A POPULATION SLOPE, β 1 , USING THE t TEST The t statistic equals the difference between the sample slope and hypothesised value of the population slope divided by the standard error of the slope. t= where

b1 − β1 Sb 1 

(12.16)

SYX

Sb = 1

SSX n

SSX =

∑ (X − X )

2

i

i=1

The test statistic t follows a t distribution with n − 2 degrees of freedom. Return to the scenario concerning the Human Development Index data. To test whether there is a significant relationship between the countries’ mean mathematical literacy scores and their Human Development Index at the 0.05 level of significance, refer to Microsoft Excel’s output for the t test presented in Figure 12.16. From Figure 12.16: b1 = 0.131924466 and: t=

n = 19

Sb = 0.024644596 1

b1 − β1 Sb 1

=

0.131924466 − 0 = 5.3531 0.024644596

Microsoft Excel labels this t statistic t stat (see Figure 12.16). Using the 0.05 level of significance, the critical value of t with n − 2 = 17 degrees of freedom is 2.1098 (for a two-tailed test). Because t = 5.35 > 2.1098, reject H0 (see Figure 12.17). Using the p-value, we reject H0 because the p-value is approximately 0. Hence, we can conclude that there is a significant linear relationship between the mean Human Development Index and countries’ mean mathematical literacy scores. 16 17 Intercept 18 Mean mathematical literacy score –X

Coefficients Standard error 23.4251 0.1319

11.3640 0.0246

t stat 2.0613 5.3531

0.025

p-value Lower 95% 0.0549 0.0001

0.025 0.95 –2.1098

Region of rejection Critical value

0.95 0

Region of non-rejection

2.1098

–0.5508 0.0799

Upper 95% 47.4010 0.1839

Figure 12.16  Microsoft Excel t test for the Human Development Index regression Figure 12.17  Testing a hypothesis about the population slope at the 0.05 level of significance with 17 degrees of freedom

t17

Region of rejection Critical value

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

484 CHAPTER 12 SIMPLE LINEAR REGRESSION

F test for the slope Use of the F probability distribution to test whether the slope in simple linear regression is statistically significant.

F Test for the Slope You can also use an F test to determine whether the slope in simple linear regression is statistically significant. In testing for the significance of the slope, the F test, defined in Equation 12.17, is the ratio of the variance that is due to the regression (MSR) divided by the error variance (MSE = SYX2).

TESTING A HYPOTHESIS FOR A POPULATION SLOPE, β 1 , USING THE F TEST The F statistic is equal to the regression mean square (MSR) divided by the error mean square (MSE). F=



MSR MSE 

(12.17)

where MSR =

SSR k

SSE n−k −1 k = number of explanatory variables in the regression model

MSE =

The test statistic F follows an F distribution with k and n − k − 1 degrees of freedom.

Using a level of significance of α, the decision rule is: Reject H0 if F > FU; otherwise, do not reject H0. Table 12.6 organises the complete set of results into an ANOVA table. Table 12.6 ANOVA table for testing the significance of a regression coefficient

Source

df

Mean square (variance)

Sum of squares

Regression

k

SSR

Error

n−k−1

SSE

Total

n  - 1

SST

F

SSR k SSE MSE = n -k -1

MSR =

F=

MSR MSE

The completed ANOVA table is also part of the output from Microsoft Excel (see Figure 12.18). Figure 12.18 shows that the calculated F statistic is 74.6105 and the p-value is zero. Figure 12.18 Microsoft Excel F test for the Human Development Index regression

A 10 11 12 13 14

B

C

D

E

F

ANOVA Regression Residual Total

df 1 17 18

SS 1137.7110 674.9532 1812.6642

MS 1137.7110 39.7031

F 28.6555

Significance F 0.0001

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



12.7 Inferences About the Slope and Correlation Coefficient 485

Using a level of significance of 0.05, from Table E.5 the critical value of the F distribution with 1 and 17 degrees of freedom is 4.45 (see Figure 12.19). Because F = 28.6555 > 4.45 or because the p-value = 0.0001 < 0.05, we reject H0 and conclude that a country’s mean mathematical literacy score is significantly related to its Human Development Index. Figure 12.19  Regions of rejection and ­ non-rejection when testing for significance of slope at the 0.05 level of significance with 1 and 17 degrees of freedom

0.05 0.95 0

F

4.45

Region of non-rejection

Critical value

Region of rejection

Confidence Interval Estimate of the Slope (𝛃1) As an alternative to testing the existence of a linear relationship between the variables, we can set up a confidence interval estimate of β1 and determine whether the hypothesised value (β1 = 0) is included in the interval. Equation 12.18 defines the confidence interval estimate of β1.

confidence interval estimate A range of numbers constructed about the point estimate.

CO N FID E N CE IN T E R VA L E ST IM AT E O F T HE SLO P E , β 1 The confidence interval estimate for the slope can be formed by taking the sample slope, b1, and adding and subtracting the critical t value multiplied by the standard error of the slope. b1 + – tn – 2Sb



(12.18)



1

From the Microsoft Excel output of Figure 12.16 (page 483): b1 = 0.131924466

n = 19

Sb = 0.024644596 1

To construct a 95% confidence interval estimate, α/2 = 0.025. Thus, from Table E.3, t17 = 2.1098. Thus: b1 + – tn – 2Sb = 0.131924466 + – (2.1098)(0.024644596) 1

= 0.131924466 + – 0.051995169 0.0799 ⩽ b1 ⩽ 0.1839 Therefore, we estimate with 95% confidence that the population slope is between 0.0799 and 0.1839. Because these values are both above 0, you conclude that there is a significant linear relationship between the countries’ mean mathematical literacy scores and the Human Development Index. Had the interval included 0, you would have concluded that no significant relationship exists between the variables. The confidence interval indicates that for each unit increase in a country’s mean mathematical literacy score, the mean Human Development Index is estimated to increase by at between 0.0799 and 0.1839. This interval can be found under ‘Lower 95%’ and ‘Upper 95%’ in Figure 12.16.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

486 CHAPTER 12 SIMPLE LINEAR REGRESSION

LEARNING OBJECTIVE

6

Make inferences about the correlation coefficient

correlation coefficient (or coefficient of correlation) Measure of the relative strength of the linear relationship between two variables. t test for the correlation coefficient Hypothesis test for the statistical significance of the correlation coefficient in regression using the t distribution.

t Test for the Correlation Coefficient In Section 3.5 the strength of the relationship between two numerical variables was measured using the correlation coefficient, r. We can use the correlation coefficient to determine whether there is a statistically significant association between X and Y. We hypothesise that the population correlation coefficient ρ is 0. Thus, the null and alternative hypotheses are: H0: ρ = 0 H1: ρ ≠ 0 Equation 12.19 defines the test statistic for determining the existence of a significant correlation (t test for the correlation coefficient). T E ST IN G FOR T H E E X I ST E NC E O F C O R R E LAT I O N t=

r−ρ

(12.19)

1− r2 n−2 

where r =+ r 2 if b1 > 0 or r = − r 2 if b1 < 0 The test statistic t follows a t distribution with n − 2 degrees of freedom. In the Human Development Index problem, r2 = 0.6276 and b1 = 0.1319 (see Figure 12.4 on page 460). Since b1 > 0, the correlation coefficient for countries’ Human Development Index and mean mathematical literacy scores is the positive square root of r2 − that is, r = UXX 0.6276 = 0.7922. Testing the null hypothesis that there is no correlation between these two variables results in the following observed t statistic: t =

=

r−ρ 1− r2 n−2 0.7922 − 0 1− (0.7922)2 19 − 2

= 5.35

Using the 0.05 level of significance, since t = 5.35 > 2.1098, you reject the null hypothesis. You conclude that there is evidence of a significant association between a country’s mean mathematical literacy scores and its Human Development Index. This t statistic is equivalent to the t statistic found when testing whether the population slope, β1, is equal to zero (Figure 12.16 on page 483). When inferences concerning the population slope were discussed, confidence intervals and tests of hypothesis were used interchangeably. However, developing a confidence interval for the correlation coefficient is more complicated because the shape of the sampling distribution of the statistic r varies for different values of the population correlation coefficient. Methods for developing a confidence interval estimate for the correlation coefficient can be found in reference 1.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



12.7 Inferences About the Slope and Correlation Coefficient 487

Problems for Section 12.7 LEARNING THE BASICS 12.35 You are testing the null hypothesis that there is no relationship between two variables, X and Y. From your sample of n = 34, you determine that b1 = 3.4 and Sb1= 1.2. a. What is the value of the t test statistic? b. At the α = 0.05 level of significance, what are the critical values? c. Based on your answers to (a) and (b), what statistical decision should you make? d. Construct a 95% confidence interval estimate of the population slope, β1. 12.36 You are testing the null hypothesis that there is no relationship between two variables, X and Y. From your sample of n = 21, you determine that SSR = 34 and SSE = 25. a. What is the value of the F test statistic? b. At the α = 0.05 level of significance, what is the critical value? c. Based on your answers to (a) and (b), what statistical decision should you make? d. Calculate the correlation coefficient by first calculating r2 and assuming b1 is negative. e. At the 0.05 level of significance, is there a significant correlation between X and Y?

APPLYING THE CONCEPTS Problems 12.37 to 12.48 can be solved manually or by using Microsoft Excel.

12.37 In problem 12.4 on page 465, you used energy generation to predict CO2 emissions. < CO2 > From the results of that problem, b1 = 1442.216 and Sb1 = 60.209. a. At the 0.05 level of significance, is there evidence of a linear relationship between energy generation and CO2 emissions? b. Construct a 95% confidence interval estimate of the population slope, β1. 12.38 In problem 12.5 on page 466, you used plate gap to predict tear ratings for bags at Starbucks. Using the results of that problem: a. At the 0.05 level of significance, is there evidence of a linear relationship between plate gap and tear rating? b. Construct a 95% confidence interval estimate of the population slope, β1. 12.39 In problem 12.6 on page 466, a professor wanted to estimate the spending of his students based upon their income. Using the results of that problem: a. At the 0.05 level of significance, is there evidence of a linear relationship between spending and income? b. Construct a 95% confidence interval estimate of the population slope, β1.

12.40 In problem 12.7 on page 466, a statistics professor used tutorial class size to predict average class marks. Using the results of that problem: a. At the 0.05 level of significance, is there evidence of a linear relationship between the tutorial class size and average class mark? b. Construct a 95% confidence interval estimate of the population slope, β1. 12.41 In problem 12.8 on page 467, you used alcohol content to predict wine quality. Using the results of that problem: a. At the 0.05 level of significance, is there evidence of a linear relationship between alcohol content and wine quality? b. Construct a 95% confidence interval estimate of the population slope, β1. 12.42 In problem 12.9 on page 467, a concert promoter used ticket price to predict merchandise sales. Using the results of that problem: a. At the 0.05 level of significance, is there evidence of a linear relationship between ticket price and merchandise sales? b. Construct a 95% confidence interval estimate of the population slope, β1. 12.43 Movie companies need to predict the gross receipts of an individual movie once the movie has debuted. The following results, stored in < POTTER_MOVIES >, are the first weekend gross, the US gross and the worldwide gross (in millions of dollars) of the eight Harry Potter movies that debuted from 2001 to 2011. First Worldwide Title weekend US gross gross Sorcerer’s Stone  90.295 317.558   976.458 Chamber of Secrets  88.357 261.988   878.988 Prisoner of Azkaban  93.687 249.539   795.539 Goblet of Fire 102.335 290.013   896.013 Order of the Phoenix  77.108 292.005   938.469 Half-Blood Prince  77.836 301.460   934.601 Deathly Hallows: Part I 125.017 295.001   955.417 Deathly Hallows: Part II 169.189 381.001 1,328.110 Data obtained from

a. Calculate the coefficient of correlation between first weekend gross and US gross, first weekend gross and worldwide gross, and US gross and worldwide gross. b. At the 0.05 level of significance, is there a significant linear relationship between first weekend gross and US gross, first weekend gross and worldwide gross, and US gross and worldwide gross?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

488 CHAPTER 12 SIMPLE LINEAR REGRESSION

12.44 The high cost of childcare is seen by policy makers as a major hurdle for young mothers to return to the workforce. To analyse this theory we collected the weekly price paid for childcare and hours worked per week by young mothers 12 months after the birth of their youngest child. < CHILDCARE >. Hours worked  0  1 34 31 38 24 16  0 43 35 29 25

Childcare cost ($) 850 750 345 350 330 500 550 775 345 375 500 550

a. At the 0.05 level of significance, is there evidence of a linear relationship between hours worked and childcare cost? b. Construct a 95% confidence interval estimate of the population slope, β1. 12.45 A question has arisen about the best location for police breathtesting units on public roads in Victoria. One argument is that testing should be done in high-risk locations such as the sites of major events and outside night entertainment areas. Another argument is that the number of people tested and found to be above the 0.05 drink-driving limit is simply a function of the number of drivers tested, regardless of location. Data are easily tabulated for an assumed number of charges laid and the number of drivers tested over a two-week period. umber of drivers tested N 250 288 330 68 145 310 220 88 71 169 121 115 196 243

Number of drivers charged 20 19 24 5 6 17 10 6 4 8 11 9 16 22

a. Calculate the coefficient of correlation, r. b. At the 0.05 level of significance, is there a significant linear relationship between the number of drivers tested and the number of drivers charged for drink-driving?

12.46 The volatility of a share is often measured by its beta value. You can estimate the beta value of a share by developing a simple linear regression model, using the percentage weekly change in the share as the dependent variable and the percentage weekly change in a market index as the independent variable. The S&P 500 Index is a common index to use. For example, if you wanted to estimate the beta value for Disney, you could use the following model, which is sometimes referred to as a market model: (% weekly change in Disney) 5 β0 1 β1 (% weekly change in S&P 500 Index) 1 ε The least-squares regression estimate of the slope b1 is the estimate of the beta value for Disney. A share with a beta value of 1.0 tends to move the same as the overall market. A share with a beta value of 1.5 tends to move 50% more than the overall market, and a share with a beta value of 0.6 tends to move only 60% as much as the overall market. Shares with negative beta values tend to move in the opposite direction of the overall market. The following table gives some beta values for some widely held shares as of 8 June 2013. Company Procter & Gamble Dr Pepper Snapple Group Disney Apple eBay Marriott

Ticker symbol Beta PG 0.32 DPS 20.02 DIS 1.07 AAPL 0.69 EBAY 0.79 MAR 1.32

Data obtained from accessed 26 June 2013

a. For each of the six companies, interpret the beta value. b. How can investors use the beta value as a guide for investing? 12.47 AsiaRapid is a delivery service for packages and mail across the Asia-Pacific region. The manager would like to estimate the relationship between the distance of the delivery job and the time taken to deliver. This is to enable him to quote more accurate delivery times to customers. A set of sample data is collected of 24 different distances and the time taken to deliver documents. The set of data is tabulated below. Distance Delivery time Locations (km) (hours) Beijing–Sydney 8,557 47 Seoul–Honolulu 7,317 45 New Delhi–Shanghai 4,944 23 Melbourne–Sydney 713 8 Chiang Mai–Jakarta 2,906 13 Xian–Nagasaki 13,449 52 Port Moresby–Honiara 1,406 9 Singapore–Brisbane 6,153 33 Seoul–Honolulu 7,317 42 Melbourne–Sydney 713 7 Beijing–Bangkok 3,298 17 Papeete–Noumea 4,610 30

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



12.8 Estimation of Mean Values and Prediction of Individual Values 489

Canberra–Agana 5,436 41 Sydney–Melbourne 713 8 Tokyo–Beijing 2,099 10 Manila–Singapore 2,392 12 Sydney–Beijing 8,557 40 Phnom Penh–Bangkok 530 3 Kuala Lumpur–Singapore 317 4 Brisbane–Singapore 6,153 29 Wellington–Singapore 10,716 44 Jakarta–Kuala Lumpur 1,172 6 Avarua–Port Moresby 5,811 20 Singapore–Manila 2,392 10

a. Calculate the coefficient of correlation, r. b. At the 0.05 level of significance, is there a significant linear relationship between the distance and the time taken to deliver? c. What conclusions can you reach about the relationship between distance and time taken to deliver? d. Predict time taken for a travel distance of 9,250 km. 12.48 Under new workplace agreement structures brought in by the government, an owner of a small chain of clothing stores employing fewer than 100 employees is considering whether to keep on older staff. One element of the decision is an assessment of the staff claim that

experienced staff sell more goods. A sample of 12 staff with varying lengths of employment is chosen by the store, and total weekly sales by each collected for one week. Length of service (years) 5 12 15 6 2 4 7 17 3 20 1 7

Sales ($’00) 560 634 876 350 532 789 256 822 189 834 366 212

a. Calculate the coefficient of correlation, r. b. At the 0.05 level of significance, is there a significant linear relationship between the length of service and sales? c. What conclusions can you reach about the relationship between length of service and sales?

12.8  ESTIMATION OF MEAN VALUES AND PREDICTION OF INDIVIDUAL VALUES

7

LEARNING OBJECTIVE0

This section presents methods of making inferences about the mean of Y and the prediction of individual values of Y.

Estimate mean values and predict individual values

The Confidence Interval Estimate In Example 12.2 on page 462 we used the prediction line to make predictions about the value of Y for a given X. The mean Human Development Index for a mean mathematical literacy score of 500 was predicted to be 89.39. This estimate, however, is a point estimate of the population average value. Chapter 8 presented the concept of the confidence interval as an estimate of the population mean. In a similar fashion, Equation 12.20 defines the confidence interval estimate for the mean response for a given X.

CO N FID E N CE IN T E R VA L E ST IM AT E F O R T HE ME A N OF Y Yˆi ± tn−2SYX hi

Yˆi − tn−2SYX hi < μY |X = Xi < Yˆi + tn−2SYX

hi

(12.20)  continues

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

490 CHAPTER 12 SIMPLE LINEAR REGRESSION

1 ( X i − X )2 + n SSX Yˆi = predicted value of Y; Yˆi = b0 + b1Xi SYX = standard error of the estimate n = sample size Xi = given value of X µY | X=Xi = mean value of Y when X = Xi where

hi =

(12.20, continued)

n

SSX =

∑ (Xi − X )2 i=1

The width of the confidence interval in Equation 12.20 depends on several factors. For a given level of confidence, increased variation around the prediction line, as measured by the standard error of the estimate, results in a wider interval. However, as you would expect, increased sample size reduces the width of the interval. In addition, the width of the interval – also varies at different values of X. When you predict Y for values of X close to X, the interval is narrower than for predictions for X values more distant from the mean. In the Human Development Index example, suppose you want a 95% confidence interval estimate of the mean Human Development Index for the entire population when the mean mathematical literacy score of a country is 500 (X = 500). Using the simple linear regression equation: Yˆi = 23.4251 + 0.1319Xi = 23.4251 + 0.1319(500) = 89.387306 Also, given the following: X = 457.3684 SYX = 6.3010 n

∑ ( Xi − X )2 = 65,370.4211

SSX =

i=1

From Table E.3, t17 = 2.1098. Thus: Yˆi ± t n−2 SYX hi where: hi =

1 ( X i − X )2 + n SSX

so that: 1 ( X i − X )2 Yˆi ± t n−2 SYX + n SSX = 89.387306 ± (2.1098)(6.3010)

1 (500 − 457.3684)2 + 19 65,370.4211

= 89.387306 ± 3.770305 so: 85.617 < μY |X = 500 < 93.15761 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



12.8 Estimation of Mean Values and Prediction of Individual Values 491

Therefore, the 95% confidence interval estimate is that the mean Human Development Index lies between 85.617 and 93.15761 for the country’s mean mathematical literacy score of 500.

The Prediction Interval In addition to the need for a confidence interval estimate for the mean value, you often want to predict the response for an individual value. Although the form of the prediction interval is similar to the confidence interval estimate of Equation 12.20, the prediction interval is predicting an individual value, not estimating a parameter. Equation 12.21 defines the prediction ­interval for an individual response Y, at a particular value Xi, denoted by YX=Xi. PRE D IC TIO N IN T E R VA L FOR A N IN DI V I D U A L R E SP O NSE Y Yˆi ± tn−2SYX 1 + hi Yˆi − tn−2SYX 1 + hi < μY |X = Xi < Yˆi + tn−2SYX 1 + hi 



prediction interval for an individual response Y The interval for the prediction of a specific value of Y in regression, given a value of X.

(12.21)

where hi, Yˆi, SYX, n and Xi are defined as in Equation 12.20 and YX=Xi future value of Y when X = Xi. To construct a 95% prediction interval of the Human Development Index for a country’s mean mathematical literacy score of 500 (X = 500), first calculate Yˆi. Using the prediction line: Yˆi = 23.4251 + 0.1319Xi = 23.4251 + 0.1319(500) = 89.387306 Also, given the following: X = 457.3684 SYX = 6.3010 n

SSX =

∑ ( Xi − X )2 = 65,370.4211

i=1

From Table E.3, t17 = 2.1098. Thus: Yˆi ± t n−2 SYX 1+ hi where: hi =

1 (Xi − X ) 2 + n SSX

so that: 1 ( X − X )2 Yˆi ± t n−2 SYX 1 + + i n SSX = 89.387306 ± (2.1098)(6.3010)

1+

1 (500 − 457.3684)2 + 65,370.4211 19

= 85.54391 ± 13.81834279 so: 75.56896339 < Y X = 500 < 103.205649

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

492 CHAPTER 12 SIMPLE LINEAR REGRESSION

Therefore, with 95% confidence you predict that the Human Development Index for a country with a mean mathematical literacy rate of 500 lies between 75.56896339 and 103.205649. Figure 12.20 is a Microsoft Excel worksheet that illustrates the confidence interval estimate and the prediction interval for the Human Development Index problem. If you compare the results of the confidence interval estimate and the prediction interval, you see that the width of the prediction interval for an individual Human Development Index is much wider than the confidence interval estimate for the mean Human Development Index. Remember, there is much more variation in predicting an individual value than in estimating a mean value. Figure 12.20  PHStat confidence interval estimate and prediction interval for the human Development Index regression

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

A Confidence interval estimate

B

Data X value Confidence level

500 95%

Intermediate calculations Sample size Degrees of freedom t value X bar, sample mean of X Sum of squared differences from X bar Standard error of the estimate h statistic Predicted Y (Y Hat)

19 17 2.109816 457.3684 65370.42 6.301042 0.080434 89.38731

For average Y Interval half width Confidence interval lower limit Confidence interval upper limit

3.770305 85.617 93.15761

For individual response Y Interval half width Prediction interval lower limit Prediction interval upper limit

13.81834 75.56896 103.2056

Problems for Section 12.8 LEARNING THE BASICS

12.49 Based on a sample of n = 24, the least-squares method was used to develop the following prediction line: Ŷi = 5 + 3Xi. In addition, n

SYX = 2.1, X = 3 and

∑ ( Xi − X )2 = 25 i=1

a. Construct a 95% confidence interval estimate of the mean response for X = 5. b. Construct a 95% prediction interval of an individual response for X = 5.

12.50 Based on a sample of n = 10, the least-squares method was used to develop the following prediction line: Ŷi = 5 + 12Xi. In addition, n

SYX = 1.0, X = 2 and

∑ ( Xi − X )2 = 20 i=1

a. Construct a 95% confidence interval estimate of the population mean response for X = 4. b. Construct a 95% prediction interval of an individual response for X = 4. c. Compare the results of (a) and (b) with those of problem 12.49 (a) and (b). Which interval is wider? Why?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



12.9 Pitfalls in Regression and Ethical Issues 493

APPLYING THE CONCEPTS Problems 12.51 to 12.56 can be solved manually or by using Microsoft Excel.

12.51 In problem 12.5 on page 466, you used plate gap to predict tear ratings of bags at Starbucks. a. Construct a 95% confidence interval estimate of the mean tear rating when the plate gap is 1.80. b. Construct a 95% prediction interval of tear rating when the plate gap is 1.80. c. Explain the difference in the results in (a) and (b). 12.52 In problem 12.4 on page 465, you used energy generation to predict CO2 emissions. < CO2 > a. Construct a 95% confidence interval estimate of CO2 emission for a country generating 1,000 terawatt hours of energy. b. Construct a 95% prediction interval of CO2 emission for a country generating 1,000 terawatt hours of energy. c. Explain the difference in the results in (a) and (b). 12.53 In problem 12.7 on page 466, a statistics professor used tutorial class size to predict average class marks. a. Construct a 95% confidence interval estimate of average class mark for a tutorial class of 20 students. b. Construct a 95% prediction interval of average class mark for a tutorial class of 20 students. c. Explain the difference in the results in (a) and (b). 12.54 In problem 12.6 on page 466, a professor wanted to estimate the spending of his students based upon their income.

a. Construct a 95% confidence interval estimate of the mean spending for a weekly income of $150. b. Construct a 95% prediction interval for spending on a weekly income of $150. c. Explain the difference in the results in (a) and (b). 12.55 In problem 12.8 on page 467, you predicted wine quality based on alcohol content. a. Construct a 95% confidence interval estimate of the mean value of all average shares based on a retail sales index of 40. b. Construct a 95% prediction interval of the value of an individual group of shares based on a retail sales index of 40. c. Explain the difference in the results in (a) and (b). 12.56 In problem 12.9 on page 467, a concert promoter used ticket price to predict merchandise sales. a. Construct a 95% confidence interval estimate of the population mean merchandise sales for a ticket price of $80. b. Construct a 95% prediction interval of merchandise sales for a ticket price of $80. c. Explain the difference in the results in (a) and (b). 12.57 In problem 12.4 on page 465, you used energy generation to predict CO2 emissions. Re-estimate this model using energy as the dependent variable. Interpret the results.

12.9  PITFALLS IN REGRESSION AND ETHICAL ISSUES Some of the pitfalls involved in using regression analysis are as follows: • Lacking an awareness of the assumptions of least-squares regression. • Not knowing how to evaluate the assumptions of least-squares regression. • Not knowing what the alternatives to least-squares regression are if a particular assumption is violated. • Using a regression model without knowledge of the subject matter. • Extrapolating outside the relevant range. • Concluding that a significant relationship identified in an observational study is due to a cause-and-effect relationship.

LEARNING OBJECTIVE Comprehend the pitfalls in regression

The widespread availability of spreadsheet and statistical software has removed the computational hurdle that prevented many users from applying regression analysis. However, for many users this enhanced availability of software has not been accompanied by an understanding of how to use regression analysis properly. A user who is not familiar with either the assumptions of regression or how to evaluate the assumptions cannot be expected to know what the alternatives to least-squares regression are if a particular assumption is violated. The data < ANSCOMBE > in Table 12.7, overleaf, illustrate the necessity of using scatter plots and residual analysis to go beyond the basic number crunching of calculating the Y intercept, the slope and r2.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

8

494 CHAPTER 12 SIMPLE LINEAR REGRESSION

Table 12.7 Four sets of artificial data (‘Anscombe’s quartet’) Source: F. J. Anscombe, ‘Graphs in statistical analysis’, American Statistician, 27 (1973): 17–21 © American Statistical Association, , reprinted by permission of Taylor & Francis Ltd, on behalf of the American Statistical Association.

Data set A Data set B Data set C Data set D Xi Yi Xi Yi Xi Yi Xi Yi 10 8.04 10 9.14 10 7.46 8 6.58 14 9.96 14 8.10 14 8.84 8 5.76 5 5.68 5 4.74 5 5.73 8 7.71 8 6.95 8 8.14 8 6.77 8 8.84 9 8.81 9 8.77 9 7.11 8 8.47 12 10.84 12 9.13 12 8.15 8 7.04 4 4.26 4 3.10 4 5.39 8 5.25 7 4.82 7 7.26 7 6.42 19 12.50 11 8.33 11 9.26 11 7.81 8 5.56 13 7.58 13 8.74 13 12.74 8 7.91 6 7.24 6 6.13 6 6.08 8 6.89

Anscombe (reference 2) showed that all four data sets given in Table 12.7 have the following identical results: Yˆi = 3.0 + 0.5 X i SYX = 1.237 Sb1 = 0.118 r 2 = 0.667 n

SSR = explained variation =

∑ (Yˆi − Y )2 = 27.51 i=1

SSE = unexplained variation = n

SST = total variation =

n

∑ (Yi − Yˆi )2 = 13.76 i=1

∑ (Yi − Y )2 = 41.27 i=1

Thus, with respect to these statistics associated with a simple linear regression, the four data sets are identical. If you stopped the analysis at this point, you would lose valuable information in the data. By examining Figure 12.21, which represents scatter diagrams for the four data sets, and Figure 12.22, which represents residual plots for the four data sets, you can clearly see that each of the four data sets has a different relationship between X and Y. From the scatter diagrams of Figure 12.21 and the residual plots of Figure 12.22, you see how different the data sets are. The only data set that seems to follow an approximate straight line is data set A. The residual plot for data set A does not show any obvious patterns or outlying residuals. This is certainly not true for data sets B, C and D. The scatter plot for data set B shows that a quadratic regression model is more appropriate. This conclusion is reinforced by the clear parabolic form of the residual plot for B. The scatter diagram and the residual plot for data set C clearly show an outlying observation. If this is the case, you may want to remove the outlier and re-estimate the regression model. Reestimating the model will uncover a very different relationship. Similarly, the scatter diagram for data set D also represents the situation in which the model is heavily dependent on the outcome of a single response (X8 = 19 and Y8 = 12.50). You would have to evaluate cautiously any regression model since its regression coefficients are heavily dependent on a single observation.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



12.9 Pitfalls in Regression and Ethical Issues 495

Y

Y 10

10

5

5

5

10 Panel A

15

20

Figure 12.21 Scatter diagrams for four data sets

X

5

10 Panel B

15

20

5

10 Panel D

15

20

X

Y

Y 10

10

5

5

5

10 Panel C

15

20

X

Residual +2

Residual +2

+1

+1

0

0

–1

–1

–2

5

10 Panel A

15

20

X

–2

Residual +4

Residual +4

+3

+3

+2

+2

+1

+1

0

0

–1

–1

–2

5

10 Panel C

15

20

X

–2

X

Figure 12.22 Residual plots for four data sets

5

10 Panel B

15

20

5

10 Panel D

15

20

X

X

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

496 CHAPTER 12 SIMPLE LINEAR REGRESSION

In summary, scatter diagrams and residual plots are of vital importance to a complete regression analysis. The information they provide is so basic to a credible analysis that you should always include these graphical methods as part of a regression analysis. Thus, a strategy that you can use to help avoid the pitfalls of regression is as follows: • Start with a scatter plot to observe the possible relationship between X and Y. • Check the assumptions of regression before moving on to use the results of the model. • Plot the residuals against the independent variable to determine whether the linear model is appropriate and to check the equal variance assumption. • Use a histogram, stem-and-leaf display, box-and-whisker plot or normal probability plot of the residuals to check the normality assumption. • If you collected the data over time, plot the residuals against time or use the Durbin– Watson test to check the independence assumption. • If there are violations of the assumptions, use alternative methods to least-squares regression or alternative least-squares models. • If there are no violations of the assumptions, carry out tests for the significance of the regression coefficients and develop confidence and prediction intervals. • Avoid making predictions and forecasts outside the relevant range of the independent variable. • Always note that the relationships identified in observational studies may or may not be due to a cause-and-effect relationship. Remember that, while causation implies correlation, correlation does not imply causation.

12 Assess your progress Summary In this chapter we examined the simple linear regression model between one independent (explanatory) variable and one dependent (response) variable. First, we looked at the concept of the linear regression model, and then the concept of the linear regression equation, and how this equation could be used to predict ahead, assuming known values in the future of the independent variable. We examined ways of assessing the strength of a particular model, and how accurately it could predict. We used the example of the Human Development Index and mean mathematical literacy scores for a sample of 19 developed and developing economies. Then we turned to the components of the linear model, examining the assumptions of the linear regression and also the residuals, and how these both evaluate the model and can be used to lead to

further analysis. The Durbin–Watson statistic is used to measure for autocorrelation and test for independence of the errors. This was followed by a test of the slope of the regression line, where we used hypothesis testing (Chapter 9) to test whether there was a statistically significant relationship between the independent and dependent variables. We also assessed the slope, by calculating the confidence interval around the slope of the sample, to determine whether the hypothesised value of the slope from the population is included in the interval. We then extended the analysis of confidence intervals to calculate intervals for the regression mean and for intervals around a predicted value. Finally, we examined some of the dangers in using this methodology, and how violating the assumptions could raise ethical issues.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Key formulas 497

Key formulas Formula for calculating SSE

Simple linear regression model

n

Yi = β0 + β1X i + εi  (12.1) Simple linear regression equation: the prediction line

SSE =

n

=

Formula for calculating the slope b1

SST = total sum of squares =

∑ (Yi − Y )2

(12.6)



Regression sum of squares (SSR )

SSR SSR = explained variation variation or or regression regression sum sum of of squares squares = explained





SSR regression sum of squares (12.9) = SST   total sum of squares

Formula for calculating SST

SST =

i=1

i=1



∑ (Yi − Y )2 = ∑ Yi2 − n



i=1

i=1



n



n

Yi + b1∑ X iYi − ∑ i=1 i=1



(12.15)

ei2

b1− β1   (12.16) Sb

2

Yi ∑ i=1 n

MSR   (12.17) MSE

Confidence interval estimate of the slope, 𝛃1 b1 ± tn − 2Sb   (12.18) 1

r −ρ 1− r2 n −2



(12.19)

Confidence interval estimate for the mean of Y (12.11)

n

= b0

=

1

(12.10)

n

(Yˆi − Y ) 2

n

t=

t=

Yi

Formula for calculating SSR

SSR =

D=

( ei − ei−1 ) 2 ∑ i 2

Testing for the existence of correlation

2

n

n

(12.13)

n

F=

n



Testing a hypothesis for a population slope, 𝛃1 using the F test

(12.8)

Coefficient of determination

r2 =

n −2

Testing a hypothesis for a population slope, 𝛃1 using the t test

nn

i=1

i=1

i=1

SSE = unexplained variation or error of sum of squares (Yii − Yˆii )22 ∑ i=1

∑ (Yi − Yˆi )2

Durbin–Watson statistic

(12.7)

Error sum of squares (SSE )

=

∑ X iYi

i=1

ei = Yi − Yˆi   (12.14)

i=1



n

Yi − b1

The residual n

ii= 1 =1

n

∑ i=1

SSE = n−2

SYX =

Total sum of squares (SST )

2 Yˆˆii − ((Y −Y Y )) 2

Yi 2 − b0

n

Measures of variation in regression SST = SSR + SSE  (12.5)

nn

(12.12)

Standard error of the estimate

Formula for calculating the Y intercept b0 b0 = Y − b1 X   (12.4)

= =



i=1

SSXY (12.3) SSX  



i=1

Yˆi = b0 + b1X i   (12.2) b1 =

∑ (Yi − Yˆ )2

Yˆi ± t n−2 SYX hi Yˆi − t n−2 SYX hi < μY | X= X i < Yˆi + t n−2 SYX hi



(12.20)

Prediction interval for an individual response Y

Yˆi ± tn−2SYX 1 + hi



Yˆi − tn−2 SYX 1 + hi < μY | X = Xi < Yˆi + tn−2SYX 1 + hi



(12.21)

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

498 CHAPTER 12 SIMPLE LINEAR REGRESSION

Key terms assumptions of regression 473 autocorrelation 477 coefficient of determination, r2 469 confidence interval estimate 485 correlation coefficient 486 dependent variable 456 Durbin–Watson statistic 479 equal variance (homoscedasticity) 473 error sum of squares (SSE) 467 explained variation 467 explanatory variable 457 F test for the slope 484

independence of errors 473 independent variable 456 least-squares method 460 linearity 473 normality 473 prediction interval for an individual response Y 491 prediction line 459 regression analysis 456 regression coefficients 459 regression sum of squares (SSR) 467 relevant range 463

residual 474 residual analysis 473 response variable 457 scatter diagram 456 simple linear regression 456 standard error of the estimate 471 total sum of squares (SST) 467 total variation 467 t test for the correlation coefficient 486 t test for the slope 482 unexplained variation 467 Y intercept 457

References 1. Neter, J., M. H. Kutner, C. J. Nachtsheim & W. Wasserman, Applied Linear Statistical Models, 4th edn (Homewood, IL: Irwin, 1996).

2. Anscombe, F. J., ‘Graphs in statistical analysis’, The American Statistician 27 (1973): 17–21 © American Statistical Association, , reprinted by permission of Taylor & Francis Ltd, on behalf of the American Statistical Association.

Chapter review problems CHECKING YOUR UNDERSTANDING

12.58 What is the interpretation of the Y intercept and the slope in the simple linear regression equation? 12.59 What is the relationship between the correlation coefficient and the coefficient of determination? 12.60 How is the ANOVA table used in regression analysis? 12.61 What is the difference between a t test and an F test for the population slope? 12.62 Why should you always carry out a residual analysis as part of a regression model? 12.63 Explain the difference between interpolation and extrapolation. 12.64 How do you evaluate the assumptions of regression analysis? 12.65 When is the unexplained variation (i.e., error sum of squares) equal to 0? 12.66 What is the difference between a confidence interval estimate of the mean response, μY|X = Xi , and a prediction interval of YX − Xi?

APPLYING THE CONCEPTS Problems 12.67 to 12.75 can be solved manually or by using Microsoft Excel.

12.67 Can you use Twitter activity to forecast box office receipts on the opening weekend? The following data, stored in < TWITTER_MOVIES >, indicate the Twitter activity (‘want to see’) and the receipts ($) per theatre on the weekend a movie opened, for seven movies.

Movie

The Devil Inside The Dictator Paranormal Activity 3 The Hunger Games Bridesmaids Red Tails Act of Valor

Twitter activity

Receipts ($)

219,509 14,763   6,405  5,796 165,128 15,829 579,288 36,871   6,564  8,995   11,104  7,477   9,152  8,054

Source: R. Dodes, ‘Twitter goes to the movies’, Wall Street Journal, 3 August 2012, D1–D12

a. Use the least-squares method to calculate the regression coefficients b0 and b1. b. Interpret the meaning of b0 and b1 in this problem. c. Predict the mean receipts for a movie that has a Twitter activity of 100,000. d. Should you use the model to predict the receipts for a movie that has a Twitter activity of 1,000,000? Why or why not? e. Calculate the coefficient of determination, r 2, and explain its meaning in this problem. f. Perform a residual analysis. Is there any evidence of a pattern in the residuals? Explain. g. At the 0.05 level of significance, is there evidence of a significant linear relationship between Twitter activity and receipts? h. Construct a 95% confidence interval estimate of the mean receipts for a movie that has a Twitter activity of 100,000

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter review problems 499

and a 95% prediction interval of the receipts for a single movie that has a Twitter activity of 100,000. i. Based on the results of (a) to (h), do you think that Twitter activity is a useful predictor of receipts on the first weekend a movie opens? What issues about these data might make you hesitant to use Twitter activity to predict receipts? 12.68 An accountant for a large department store has the business objective of developing a model to predict the amount of time it takes to process invoices. Data are collected from the past 32 working days, and the number of invoices processed and completion time (in hours) are stored in < INVOICE >. a. Assuming a linear relationship, use the least-squares method to calculate the regression coefficients b0 and b 1. b. Interpret the meaning of the Y intercept, b0, and the slope, b1, in this problem. c. Use the prediction line developed in (a) to predict the mean amount of time it would take to process 150 invoices. d. Determine the coefficient of determination, r 2, and interpret its meaning. e. Plot the residuals against the number of invoices processed and also against time. f. Based on the plots in (e), does the model seem appropriate? g. Based on the results in (e) and (f), what conclusions can you reach about the validity of the prediction made in (c)? h. What conclusions can you reach about the relationship between the number of invoices and the completion time? 12.69 The Longhaul Transport Company wishes to predict the price of petrol based on international crude oil prices. They record average monthly petrol prices and crude oil prices over a 12-month period. < CRUDE > Month Petrol price ($/litre) Crude oil price ($/barrel) January 1.25 72 February 1.32 75 March 1.15 69 April 1.35 82 May 1.22 78 June 1.26 80 July 1.43 91 August 1.29 84 September 1.28 82 October 1.17 79 November 1.52 98 December 1.43 91

a. Plot a scatter diagram and, assuming a linear relationship, use the least-squares method to calculate the regression coefficients, b0 and b1.

b. Interpret the meaning of the Y intercept, b0, and the slope, b1, in this problem. c. Use the prediction line developed in (a) to predict the petrol price if crude oil is $75/barrel. d. Calculate the coefficient of determination, r 2, and interpret its meaning in this problem. e. Perform a residual analysis and determine the adequacy of the fit of the model. f. At the 0.05 level of significance, is there evidence of a significant linear relationship between petrol and crude oil prices? g. Construct a 95% confidence interval estimate for the mean petrol price if crude oil is $75/barrel. h. Construct a 95% prediction interval for a particular petrol price if crude oil is $75/barrel. i. Construct a 95% confidence interval estimate of the population slope. 12.70 Can you use the annual revenues generated by National Basketball Association (NBA) franchises to predict franchise values? (Franchise values and revenues are stored in < NBA_VALUES >.) a. Assuming a linear relationship, use the least-squares method to calculate the regression coefficients b0 and b 1. b. Interpret the meaning of the Y intercept, b0, and the slope, b1, in this problem. c. Predict the mean value of an NBA franchise that generates $150 million of annual revenue. d. Calculate the coefficient of determination, r 2, and interpret its meaning. e. Perform a residual analysis and evaluate the regression assumptions. f. At the 0.05 level of significance, is there evidence of a significant linear relationship between the annual revenues generated and the value of an NBA franchise? g. Construct a 95% confidence interval estimate of the mean value of all NBA franchises that generate $150 million of annual revenue. h. Construct a 95% prediction interval of the value of an individual NBA franchise that generates $150 million of annual revenue. 12.71 In problem 12.70 you used annual revenue to develop a model to predict the franchise value of NBA teams. Can you also use the annual revenues generated by European soccer teams to predict franchise values? (European soccer team values and revenues are stored in < SOCCER_ VALUES_2013 >.) Repeat problem 12.70 (a)–(h) for the European soccer teams. 12.72 A health scientist is looking at how obesity rates in different regions may be affected by the number of fast-food outlets. She collects data from 10 regions. < OBESITY >

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

500 CHAPTER 12 SIMPLE LINEAR REGRESSION

Region Obesity rate Fast-food outlets Illawarra 45 456 Hunter 36 345 Western Sydney 23 256 St George 44 435 Richmond 30 289 Murray 40 376 Gippsland 44 425 Moreton 51 498 Darling Downs 60 562 Fremantle 51 451

a. Assuming a linear relationship, use the least-squares method to calculate the regression coefficients, b0 and b1. b. Interpret the meaning of the slope, b1, in this problem. c. Use the prediction line developed in (a) to predict the obesity rate for a region with 350 fast-food outlets. d. Calculate the coefficient of determination, r2, and interpret its meaning. e. Perform a residual analysis and determine the adequacy of the fit of the model. f. At the 0.05 level of significance, is there evidence of a significant linear relationship between obesity rates and the number of fast-food outlets? g. Construct a 95% confidence interval estimate of the obesity rate for a region with 350 fast-food outlets. h. Construct a 95% prediction interval of the obesity rate for a region with 350 fast-food outlets. i. Construct a 95% confidence interval estimate of the slope. j. What other independent variables might you consider for inclusion in the model? 12.73 Port Kembla Golf Club wishes to predict the number of golfers per weekend based upon weather conditions. The number of golfers and temperature is measured over a sample of weekends covering all seasons. < GOLF > Number of golfers 68 63 83 115 75 97 117 82 121 80 87 102 64 97 128

Temperature (°C) 17 13 18 25 27 22 21 27 23 14 34 27 12 17 22

a. Assuming a linear relationship, use the least-squares method to calculate the regression coefficients, b0 and b 1. b. Interpret the meaning of the slope, b1, in this problem. c. Predict the mean number of golfers if the temperature is 25 degrees Celsius. d. Calculate the coefficient of determination, r 2, and interpret its meaning. e. Perform a residual analysis and determine the adequacy of the fit of the model. 12.74 The file < CEO_COMPENSATION > includes the total compensation (in millions of dollars) for CEOs of 170 large public companies and the investment return in 2012 (data obtained from ‘CEO pay rockets as economy, stocks recover’, USA Today, 27 March 2013, 1B). a. Calculate the correlation coefficient between compensation and the investment return in 2012. b. At the 0.05 level of significance, is the correlation between compensation and the investment return in 2012 statistically significant? c. Write a short summary of your findings in (a) and (b). Do the results surprise you? 12.75 A entertainment promoter is trying to estimate the concert attendance of an upcoming touring DJ. She believes that the expected number of concert attendees will be related to the number of times the artist’s songs have been downloaded since their last tour. oncert attendees C 12,564 17,213 18,622 7,709 6,950 1,690 10,078 11,867 11,053 3,569 7,785

Song downloads 25,651 32,589 36,500 12,568 10,236 2,136 14,329 15,243 13,489 4,681 13,589

a. Assuming a linear relationship, use the least-squares method to calculate the regression coefficients, b0 and b 1. b. Interpret the meaning of the slope, b1, in this problem. c. Use the prediction line developed in (b) to predict the number of concert attendees for an artist with 12,500 song downloads. d. Calculate the coefficient of determination, r2, and interpret its meaning. 12.76 A sanitation engineer would like to predict the number of ‘portaloo’ lavatories required for outdoor entertainment events based upon the volume of food to be consumed at the event.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Continuing cases 501

umber of lavatories N 17 121 93 17 66 48 119 108 67 81 104 28 65

Volume of food consumed (kg) 125 875 723 154 469 350 1,156 1,000 500 683 723 256 523

a. Assuming a linear relationship, use the least-squares method to calculate the regression coefficients, b0 and b 1. b. Interpret the meaning of the slope, b1, in this problem. c. Use the prediction line developed in (b) to predict the number of lavatories if 500 kg of food is consumed. d. Calculate the coefficient of determination, r 2, and interpret its meaning. 12.77 Using the problem outlined in 12.76, produce the scatter diagram. Is there anything that can be interpreted from the diagram that is not evident from the analysis in 12.76?

Continuing cases Tasman University The Student News Service at Tasman University (TU) wishes to use the data it has collected to see if a student’s WAM helps to predict their expected salary. a Starting with the BBus students, assess the significance and importance of WAM as a predictor of job performance (data stored in < TASMAN_UNIVERSITY_BBUS_STUDENT_SURVEY >). b Predict the mean expected salary for all students with a WAM of 65. Give a point prediction as well as a 95% confidence interval. Do you have any concerns using the regression model for predicting mean expected salary given the WAM of 65? c Does undergraduate WAM help explain a student’s MBA WAM? Use the data stored in < TASMAN_ UNIVERSITY_MBA_STUDENT_SURVEY > to estimate and interpret a predictive model.

As Safe as Houses To analyse the real estate market in non-capital cities and towns in states A and B, Safe-As-Houses Real Estate, a large national real estate company, has collected samples of recent residential sales from a sample of non-capital cities and towns in these states. The data are stored in < REAL_ESTATE >: a Use the internal area to predict house and unit prices. b What would be the expected mean price of a house that has an internal area of 100 square metres? Give a point prediction as well as a 95% confidence interval. c Repeat this exercise for units. d Prepare a brief report to summarise your findings.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

502 CHAPTER 12 SIMPLE LINEAR REGRESSION

Chapter 12 Excel Guide EG12.1  TYPES OF REGRESSION MODELS There are no Excel Guide instructions for this section.

EG12.2 DETERMINING THE SIMPLE LINEAR REGRESSION EQUATION

Key technique  Use the LINEST(cell range of Y variable, cell range of X variable, True, True) array function to calculate the b1 and b0 coefficients, the b1 and b0 standard errors, r2 and the standard error of the estimate, the F test statistic and error df and SSR and SSE. Example  Perform the Figure 12.4 analysis of the Human Development Index data on page 460. PHStat  Use Simple Linear Regression. For the example, open the HDI file. Select PHStat ➔ Regression ➔ Simple Linear Regression. In the procedure’s dialog box (shown in Figure EG12.1): 1. Enter B1:B20 as the Y Variable Cell Range. 2. Enter C1:C20 as the X Variable Cell Range. 3. Check First cells in both ranges contain label. 4. Enter 95 as the Confidence level for regression coefficients. 5. Check Regression Statistics Table and ANOVA and Coefficients Table. 6. Enter a Title and click OK.

B1:B20 C1:C20

The procedure creates a worksheet that contains a copy of your data as well as the worksheet shown in Figure 12.4. To create a scatter plot that contains a prediction line and regression equation similar to Figure 12.5 on page 461, modify step 6 by checking Scatter Plot before clicking OK.

Analysis ToolPak  Use Regression. For the example, open the HDI file and: 1. Select Data ➔ Data Analysis. 2. In the Data Analysis dialog box, select Regression from the Analysis Tools list and then click OK. In the Regression dialog box (shown in Figure EG12.2): 3. Enter B1:B20 as the Input Y Range and enter

C1:C20 as the ­Input X Range. 4. Check Labels and check Confidence Level and

enter 95 in its box. 5. Click New Worksheet Ply and then click OK.

B1:B20 C1:C20

Figure EG12.2 Regression dialog box

EG12.3  MEASURES OF VARIATION

Figure EG12.1 Simple Linear Regression dialog box

The measures of variation are calculated as part of creating the simple linear regression worksheet using the section EG12.1 instructions. If you use the Section EG12.2 PHStat instructions, formulas used to calculate these measures are in the COMPUTE worksheet that is created. Formulas in cells B5, B7, B13, C12, C13, D12, and E12 copy values calculated by the array formula in cell range L2:M6. In cell F12, the F.DIST.RT(F test statistic, regression degrees of freedom, error degrees of freedom) function calculates the p-value for the F test for the slope, discussed in Section 12.7. (The

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 12 Excel Guide 503

similar FDIST function is used in the COMPUTE worksheet of the Simple Linear Regression 2007 workbook.)

EG12.4  ASSUMPTIONS OF REGRESSION There are no Excel Guide instructions for this section.

EG12.5  RESIDUAL ANALYSIS

Key technique  Use arithmetic formulas to calculate the residuals. To evaluate assumptions, use the Section EG2.5 scatter plot instructions for constructing residual plots and the Section EG6.3 instructions for constructing normal probability plots. Example  Perform the residual analysis for Figures 12.10 and 12.11 of the Human Development Index data. PHStat  Use the Section EG11.1 PHStat instructions. Modify step 5 by checking Residuals Table and Residual Plot in addition to checking Regression Statistics Table and ANOVA and Coefficients Table. To construct a normal probability plot, follow the Section EG6.3 PHStat instructions using the cell range of the r­ esiduals as the Variable Cell Range in step 1. Analysis ToolPak  Use the Section EG12.2 Analysis ToolPak instructions. Modify step 5 by checking Residuals and Residual Plots before clicking New Worksheet Ply and then OK.

EG12.6 MEASURING AUTOCORRELATION: THE DURBIN–WATSON STATISTIC

Key technique  Use the SUMXMY2(cell range of the second through last residual, cell range of the first through the second-to-last residual) function to calculate the sum of squared difference of the residuals (the numerator in Equation 12.15 on page 479) and use the SUMSQ(cell range of the residuals)

function to calculate the sum of squared residuals (the denominator in Equation 12.15).

Example  Compute the Durbin–Watson statistic for the discount electrical store data in Table 12.4 on page 478. PHStat  Use the PHStat instructions at the beginning of Section EG12.2. Modify step 6 by checking the Durbin-Watson Statistic output option before clicking OK.

EG12.7 INFERENCES ABOUT THE SLOPE AND CORRELATION COEFFICIENT The t test for the slope and F test for the slope are included in the worksheet created by using the Section EG12.2 instructions. The t test computations in the worksheets created by using the PHStat instructions are discussed in Section EG12.2. The F test computations are discussed in Section EG12.3.

EG12.8 ESTIMATION OF MEAN VALUES AND PREDICTION OF INDIVIDUAL VALUES

Key technique  Use the TREND(Y variable cell range, X variable cell range, X value) function to calculate the predicted Y value for the X value and use the DEVSQ(X variable cell range) function to compute the SSX value. Example  Calculate the confidence interval estimate and prediction interval for the HDI data shown in Section 12.8. PHStat  Use the Section EG12.2 PHStat instructions but replace step 6 with these steps 6 and 7: 1. Check Confidence Int. Est. & Prediction Int. for X 5 and enter 500 in its box. Enter 95 as the percentage for Confidence level for intervals. 2. Enter a Title and click OK.

Microsoft® product screen shots are reprinted with permission from Microsoft Corporation.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

CHA PTER

13

Introduction to multiple regression

AUSTRALIA’S CAR INDUSTRY ON THE SKIDS

A

country prides itself on its ability to produce and export motor vehicles. However, imported cars have become increasingly popular with Australian consumers over past decades. This has ultimately resulted in Ford, Holden and Toyota recently d­ iscontinuing production in Australia. So, what factors have influenced this outcome? In a Financial Review article, Ford Australia chief Bob Graziano cited high labour costs as a major reason for Ford’s inability to be competitive on the global stage. In return, the prime minister and treasurer blamed high exchange rates. In this chapter we will attempt to estimate Australian car exports using exchange rates and the wage price index as explanatory variables. (P. Coorey, M. Skulley, M. Whitbourn & L. Keen, ‘Ford to pull out of Australian manufacturing’, Financial Review, 23 May 2013, .) © Charles Robertson/Alamy

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



13.1 Developing the Multiple Regression Model 505

LEARNING OBJECTIVES

After studying this chapter you should be able to: 1 construct a multiple regression model and analyse model output 2 determine which independent variables to include in the regression model, and decide which are more important in predicting a dependent variable 3 incorporate categorical and interactive variables into a regression model 4 detect collinearity using the variance inflationary factor (VIF)

Chapter 12 focused on simple linear regression models that use one numerical independent, or explanatory, variable, X, to predict the value of a numerical dependent, or response, variable, Y. Often you can make better predictions by using more than one independent variable. This chapter introduces you to multiple regression models which use two or more independent variables to predict the value of a dependent variable.

13.1  DEVELOPING THE MULTIPLE REGRESSION MODEL In order to demonstrate an application of multiple regression in this chapter we will estimate the relationship between car exports from Australia with the exchange rate and wage price index (see Table 13.1). < CAR_EXPORT > The dependent variable, Y, is the export of motor vehicles from Australia (in millions of dollars). The two independent variables chosen were the Australia–US exchange rate in cents (X1) and the wage price index (X2). Quarterly data for September 2013 to March 2017 has been collected from the Reserve Bank of Australia and Australian Bureau of Statistics, comprising 15 observations. Year to Sep 2013 Dec 2013 Mar 2014 Jun 2014 Sep 2014 Dec 2014 Mar 2015 Jun 2015 Sep 2015 Dec 2015 Mar 2016 Jun 2016 Sep 2016 Dec 2016 Mar 2017

Exports (xport, $m) 578 585 377 390 451 560 483 497 568 563 464 363 517 481 280

Exchange rate (xrate, %) 90.98 91.75 89.77 93.42 91.42 84.99 77.36 77.75 71.51 71.98 72.99 74.41 75.55 74.41 76.33

Wage price index (wpi, %) 116.4 117.0 117.7 118.2 119.3 119.9 120.3 120.8 121.8 122.3 122.7 123.1 124.1 124.5 124.9

multiple regression model Refers to regression models that use more than one independent variable to explain the variation in the dependent variable.

LEARNING OBJECTIVE

1

Construct a multiple regression model and analyse model output

Table 13.1 Australian car exports, exchange rates and wage price index Sources: Australian Bureau of Statistics, International Trade in Goods and Services, Australia, Cat. No. 5368.0, Table 12b; Reserve Bank of Australia; Australian Bureau of Statistics, Wage Price Index, Australia, Cat. No. 6345.0, Table 1.

Interpreting the Regression Coefficients When there are several independent variables, we can extend the simple linear regression model of Equation 12.1 by assuming a linear relationship between each independent variable and the

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

506 CHAPTER 13 INTRODUCTION TO MULTIPLE REGRESSION

dependent variable. For example, with k independent variables, the multiple regression model is expressed in Equation 13.1.

M ULT IPL E R E GR ESSI O N M O D E L W I T H k I ND E P E ND E NT VA R I A BLE S

Yi = β0 + β1X1i + β2 X2i + β3 X3i + … + βk Xki + εi (13.1)

where  β0 = Y intercept β1 = slope of Y with variable X1, holding variables X2, X3, …, Xk constant β2 = slope of Y with variable X2, holding variables X1, X3, …, Xk constant β3 = slope of Y with variable X3, holding variables X1, X2, X4, …, Xk constant · · · βk = slope of Y with variable Xk, holding variables X1, X2, X3, …, Xk−1 constant εi = random error in Y for observation i

Equation 13.2 defines the multiple regression model with two independent variables.

M ULT IPL E R E GR ESSI O N M O D E L W I T H T W O I ND E P E ND E NT VA R I A BL ES

Yi = β0 + β1X1i + β2 X2i + εi (13.2)

where  β0 = Y intercept β1 = slope of Y with variable X1 holding variable X2 constant β2 = slope of Y with variable X2 holding variable X1 constant εi = random error in Y for observation i

Compare the multiple regression model with the simple linear regression model (Equation 12.1): Yi = β0 + β1X1i + εi

net regression coefficient The population slope coefficient representing the change in the mean of Y per unit change in X, taking into account the effect of other independent X variables in a multiple regression.

In the simple regression model, the slope, β1, represents the change in the mean of Y per unit change in X, and does not take into account any other variables besides the single independent variable included in the model. In the multiple regression model with two independent variables (Equation 13.2), the slope, β1, represents the change in the mean of Y per unit change in X1, taking into account the effect of X2. β1 is called a net regression coefficient. (Some statisticians refer to net regression coefficients as partial regression coefficients.) As in the case of simple linear regression, you use the sample regression coefficients (b0, b1 and b2) as estimates of the population parameters (β0, β1 and β2). Equation 13.3 defines the regression equation for a multiple regression model with two independent variables.

MULTIPLE REGRESSION EQUATION WITH TWO INDEPENDENT VARIABLES

Yˆi = b0 + b1X1i + b2X2i (13.3)

You can use Microsoft Excel to calculate the values of the three regression coefficients using the least-squares method. Figure 13.1 presents Microsoft Excel’s output for the car exports data.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



13.1 Developing the Multiple Regression Model 507

A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

B

C

D

E

F

Figure 13.1 Car export model Microsoft Excel output

G

Summary output Regression statistics Multiple R

0.5882

R square

0.3460

Adjusted R square Standard error

0.2370 79.7494

Observations

15

ANOVA df

SS

MS

2

40382.0465

20191.0233

Residual

12

76319.6868

6359.9739

Total

14

116701.7333

Coefficients

Standard error

5937.8533

2179.8227

Regression

Intercept

F Significance F 3.1747

0.0782

t stat

p-value

Lower 95%

Upper 95%

2.7240

0.0185

1188.4277

10687.2790

xrate

–10.7613

5.0077

–2.1490

0.0527

–21.6721

0.1495

wpi

–37.9702

15.0698

–2.5196

0.0269

–70.8045

–5.1358

From Figure 13.1, the calculated values of the regression coefficients are: b0 = 5,937.8533    b1 = −10.7613    b2 = −37.9702 Therefore, the multiple regression equation is: Yˆi = 5,937.8533 − 10.7613X1i − 37.9702X2i where  Yˆi = car exports i X1i = exchange rate i X2i = wage price index i The sample Y intercept (b0 = 5937.8533) estimates car exports if the exchange rate is zero and wage price index is zero. Because neither variable will be zero, the value of b0 has no practical interpretation. Regression coefficients in multiple regression are called net regression coefficients and measure the mean change in Y per unit change in a particular X, holding constant the effect of the other X variables. • The slope of exchange rate with car exports (b1 = −10.7613) indicates that, for a given percentage of wage price index, mean car exports are estimated to decrease by $10.7613 million or $10,761,300 each percentage point increase in exchange rate. • The slope of the wage price index with car exports (b2 = −37.9702) indicates that, for a given exchange rate, mean car exports are estimated to decrease by $37.9702 million or $37,970,200 for each additional percentage point of wage price index. These estimates give a better understanding of the likely effect that exchange rates and wage price index will have on car exports. For example, an increase in exchange rates of 10 percentage points is estimated to decrease car exports by $107.613 million, for a given wage price index, whereas an increase in wage price index of 10 percentage points is estimated to decrease car exports by $379.702 million, for a given exchange rate.

Predicting the Dependent Variable, Y You can use the multiple regression equation calculated by Microsoft Excel to predict values of the dependent variable. For example, what is the estimated value of car exports for an exchange rate of 75 cents and wage price index of 120? Using the multiple regression equation: Yˆ = 5,937.8533 – 10.7613X – 37.9702X i

1i

2i

with X1i = 75 and X2i = 120, ˆ

Yi = 5,937.8533 – 10.7613(75) – 37.9702(120) Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

508 CHAPTER 13 INTRODUCTION TO MULTIPLE REGRESSION

Yˆi = 5,937.8533 – 10.7613X1i – 37.9702X2i and = 120, with X1i = 75 with X1iX= and X2i = 120, 2i 75 Yˆ = 5,937.8533 – 10.7613(75) – 37.9702(120) i

= 574.3361 Thus, the estimated value of car exports for an exchange rate of 75 cents and wage price index of 120 is $574.3361 million1 or $574,336,100. After you have predicted Y and completed a residual analysis (see Section 13.3), the next step often involves a confidence interval estimate of the mean response and a prediction interval for an individual response. The calculation of these intervals is too complex to perform by hand, and you should use Microsoft Excel to perform the calculations. Figure 13.2 illustrates Microsoft Excel output. The 95% confidence interval estimate of the mean value of car exports for an exchange rate of 75 cents and wage price index of 120 is between $473.046 million and $675.626 million. The prediction interval for an individual car export value is between $373.210 million and $775.4627 million. Figure 13.2 PHStat Excel confidence interval estimate and prediction interval for the car export example

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

A B Confidence interval estimate and prediction interval

C

D

Data Confidence level

95% 1 75 120

xrate given value wpi given value X'X

15 1214.62 1813

1214.62 99322.03 146530.5

1813 146530.5 219238.2

Inverse of X'X

747.1142 –1.55145 –5.14136

–1.55145 0.003943 0.010194

–5.14136 0.010194 0.035708

X'G times inverse of X'X

13.79192

–0.03239

-0.09186

[X'G times inverse of X'X] times XG t statistic Predicted Y (Y Hat)

0.339812 2.178813 574.3361

For average predicted Y (Y Hat) Interval half width 101.2901 Confidence interval lower limit 473.046 Confidence interval upper limit 675.6261 For individual response Y Interval half width Prediction interval lower limit Prediction interval upper limit

201.1266 373.2095 775.4627

Problems for Section 13.1 LEARNING THE BASICS 13.1 For this problem, use the following multiple regression equation: Yˆi = 12 + 8X1i − 3X2i a. Interpret the meaning of the slopes. b. Interpret the meaning of the Y intercept.

13.2 For this problem, use the following multiple regression equation: Yˆi = 100 + 10X1i − 15X2i

a. Interpret the meaning of the slopes. b. Interpret the meaning of the Y intercept.

1

We use the full output from Excel for our calculated answers throughout this chapter. You may experience small discrepancies in your calculated answers due to rounding errors. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



13.1 Developing the Multiple Regression Model 509

APPLYING THE CONCEPTS You need to use Microsoft Excel to solve problems 13.4 to 13.8.

13.3 A marketing analyst for a shoe manufacturer is considering the development of a new brand of tennis shoes. The marketing analyst wants to determine which variables to use in predicting durability (i.e. the effect of long-term impact). Two independent variables under consideration are X1 (Foreimp), a measurement of the forefoot shock-absorbing capability, and X2 (Midsole), a measurement of the change in impact properties over time. The dependent variable, Y, is LTIMP, a measure of the shoe’s durability after a repeated impact test. A random sample of 15 types of currently manufactured tennis shoes was selected for testing, with the following results. ANOVA Regression Error Total

Variable Intercept Foreimp Midsole

df  2 12 14

SS 12.61020  0.77453 13.38473

Coefficients 0.02686 0.79116 0.60484

MS 6.30510 0.06454

Standard error 0.06905 0.06295 0.07174

F 97.69

Significance F 0.0001

t stat  0.39 12.57  8.43

p-value 0.7034 0.0000 0.0000

a. State the multiple regression equation. b. Interpret the meaning of the slope coefficients b1 and b2 in this problem. 13.4 A financial planner believes that retirement trends reflect the trends in the labour market and also share market returns. She collects data on 15 OECD countries’ retirement rates (%), unemployment rates (%) and share market returns for 2011. < RETIRE >

Country Australia Canada Denmark Finland France Germany Italy Netherlands New Zealand Norway Portugal Spain Sweden United Kingdom United States

Retirement rate 20 34 22 16 32 17 20 18 19 24 25 34 14 18 15

Unemployment rate 5 9 8 6 8 7 8 9 6 5 11 16 4 7 8

Share market return 8 6 3 6 6 5 2 4 7 3 3 4 7 6 7

a. State the multiple regression equation. b. Interpret the meaning of the slope coefficients b1 and b2 in this problem.

c. Explain why the regression coefficient, b0, has no practical meaning in the context of this problem. d. Predict the mean retirement rate when unemployment is 6% and there is a share market return of 5%. e. Construct a 95% confidence interval estimate for the mean retirement rate when unemployment is 6% and there is a share market return of 5%. f. Construct a 95% prediction interval for the retirement rate when unemployment is 6% and there is a share market return of 5%. 13.5 Climate change and a country’s carbon footprint are contentious issues faced by many governments. Limiting or reducing carbon dioxide emissions is seen as a priority by climate change activists. Here we attempt to predict CO2 emissions (million metric tonnes) for 14 countries based upon their GDP (in US $trillion) and population density (people per square kilometre). < EMISSION >

China United States Russia India Japan Germany South Korea Australia Saudi Arabia South Africa United Kingdom Mexico Poland Canada

CO2 (million metric tonnes) 4,266.04 2,478.03 1,000.18 963.48 561.21 350.51 338.7 241.7 241.33 229.05 196.7 190.76 165.67 165.62

GDP (US$t) 8.36 15.68 2.01 1.84 5.96 3.4 1.13 1.52 0.711 0.384 2.44 1.18 0.49 1.82

Population (per km2) 133.69 29.77 8.61 336.62 336.72 234.86 477.49 2.47 10.97 35.6 244.69 52.15 126.79 3.36

Data obtained from NationMaster , ,

a. What would be the predicted sign of coefficients b1 and b2? b. State the multiple regression equation. c. Interpret the meaning of the slope coefficients b1 and b2. d. Explain why coefficient b0 would have no practical meaning in the context of this problem. e. Predict CO2 emissions for a country with GDP of $1 trillion and a population density of 50 people per square kilometre. f. Construct a 95% confidence interval estimate for the mean CO2 emissions for a country with GDP of $1 trillion and a population density of 50 people per square kilometre. g. Construct a 95% prediction interval estimate for the CO2 emissions for a country with GDP of $1 trillion and a population density of 50 people per square kilometre.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

510 CHAPTER 13 INTRODUCTION TO MULTIPLE REGRESSION

13.6 The production of wine is a multibillion-dollar worldwide industry. In an attempt to develop a model of wine quality as judged by wine experts, data was collected from red wine variants of Portuguese vinho verde. A sample of 50 wines is stored in < VINHO_VERDE_RED_AND_WHITE > (data obtained from P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis, ‘Modeling wine preferences by data mining from physiochemical properties’, Decision Support Systems, 47, 2009, 547–553 and ). Develop a multiple linear regression model to predict wine quality, measured on a scale from 0 (very bad) to 10 (excellent), based on alcohol content (%) and the amount of chlorides (g/L). a. State the multiple regression equation. b. Interpret the meaning of the slope coefficients b1 and b2 in this problem. c. Explain why the regression coefficient b0 has no practical meaning in the context of this problem. d. Predict the mean wine quality rating for wines that have 10% alcohol and chlorides of 0.08 g/L. e. Construct a 95% confidence interval estimate for the mean wine-quality rating for wines that have 10% alcohol and chlorides of 0.08 g/L. f. Construct a 95% prediction interval for the wine quality rating for an individual wine that has 10% alcohol and chlorides of 0.08 g/L. g. What conclusions can you reach concerning this regression model? 13.7 The study of ‘happiness’ and wellbeing is becoming popular in both economics and social science as an alternative to GDP as a measure of countries’ standard of living. In an attempt to analyse the possible connection (if any) between happiness and economic variables we have obtained survey data measuring people’s happiness levels. The data below show the percentage of respondents from various countries stating they were ‘very happy’. We have included GDP per capita ($) and CPI as economic variables. < HAPPY > Country Venezuela Nigeria Iceland Philippines Netherlands Australia United States Turkey Switzerland Belgium Sweden Denmark Canada

Very happy 55% 45% 42% 40% 40% 39% 39% 39% 38% 37% 36% 36% 32%

GDP/capita 13,500 2,700 39,700 4,400 41,500 42,000 51,700 14,800 44,900 37,500 40,300 37,300 42,300

CPI 79.32 49.13 81.85 25.78 71.6 85.15 57.93 32.54 107.28 70.16 69.77 80.64 62.24

Data obtained from NationMaster , ,

a. State the multiple regression equation. b. Interpret the meaning of the slope coefficients b1 and b2. c. Explain why coefficient b0 would have no practical meaning in the context of this problem. d. Predict the mean happiness percentage for a country with GDP per capita of $35,000 and CPI of 75. e. Construct a 95% confidence interval estimate for the mean happiness percentage for a country with GDP per capita of $35,000 and CPI of 75. f. Construct a 95% prediction interval estimate for the mean happiness percentage for a country with GDP per capita of $35,000 and CPI of 75. 13.8 The business problem facing a consumer products company is to measure the effectiveness of different types of advertising media in the promotion of its products. Specifically, the company is interested in the effectiveness of radio advertising and newspaper advertising (including the cost of discount coupons). During a one-month test period, data were collected from a sample of 22 cities with approximately equal populations. Each city is allocated a specific expenditure level for radio advertising and for newspaper advertising. The sales of the product (in $’000) and also the levels of media expenditure (in $’000) during the test month are recorded, with the following results shown below and stored in < ADVERTISE >: City 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Sales ($’000) 973 1,119 875 625 910 971 931 1,177 882 982 1,628 1,577 1,044 914 1,329 1,330 1,405 1,436 1,521 1,741 1,866 1,717

Radio advertising ($’000) 0 0 25 25 30 30 35 35 40 40 45 45 50 50 55 55 60 60 65 65 70 70

Newspaper advertising ($’000) 40 40 25 25 30 30 35 35 25 25 45 45 0 0 25 25 30 30 35 35 40 40

a. State the multiple regression equation. b. Interpret the meaning of the slope coefficients b1 and b2 in this problem. c. Interpret the meaning of the regression coefficient b0. d. Which type of advertising is more effective? Explain.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

13.2 R2, ADJUSTED R2 AND THE OVERALL F TEST 511



13.2  R2, ADJUSTED R2 AND THE OVERALL F TEST Coefficients of Multiple Determination Recall from Section 12.3 that the coefficient of determination, r 2, measures the variation in Y that is explained by the independent variable, X, in the simple linear regression model. In multiple regression, the coefficient of multiple determination, R 2, represents the proportion of the variation in Y that is explained by the set of independent variables selected. Equation 13.4 defines the coefficient of multiple determination for a multiple regression model with two or more independent variables.

coefficient of multiple determination, R 2 In regression the proportion of the variation in the Y dependent variable that is explained by a set of X independent variables.

THE C O E FF ICIE N T OF M ULT IPL E DE TE R M I NAT I ON The coefficient of multiple determination is equal to the regression sum of squares (SSR) divided by the total sum of squares (SST). R2 =



SSR (13.4) SST

where  SSR = regression sum of squares SST = total sum of squares

In the car exports example, from Figure 13.1 (page 507), SSR = 40,382.0465, SST = 116,701.7333 and k = 2. Thus: R2 =

SSR 40,382.0465 = 0.3460 = SST 116,701.7333

The coefficient of multiple determination (R2 = 0.3460) indicates that 34.6% of the variation in car exports is explained by the variation in the exchange rate and wage price index. However, when dealing with multiple regression models, some statisticians suggest using the adjusted R 2 to reflect both the number of independent variables in the model and the sample size. Reporting the adjusted R2 is extremely important when you are comparing two or more regression models that predict the same dependent variable but have a different number of independent variables. Equation 13.5 defines the adjusted R2.

adjusted R 2 A modification of R-square that adjusts for the number of terms in a model.

AD JUSTE D R 2  n −1  2 Radj = 1− (1− R 2 ) (13.5) n − k − 1   where k is the number of independent variables in the regression equation.



Thus, for the house price data, because R2 = 0.3460, n = 15 and k = 2:  (15 −1)  2 Radj = 1 − (1 − R 2 ) (15 − 2 − 1)    14  = 1 − (1 − 0.3460)  12   = 1 − 0.7630 = 0.2370

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

512 CHAPTER 13 INTRODUCTION TO MULTIPLE REGRESSION

Hence, 23.7% of the variation in car exports is explained by the multiple regression model, adjusted for the number of independent variables and sample size.

Test for the Significance of the Overall Multiple Regression Model overall F test Joint test for a significant relationship between the Y dependent variable and a set of X independent variables in multiple regression, using the F probability distribution.

In this section you learn how to use the overall F test to test for the significance of the overall multiple regression model. We use this test to determine whether there is a significant relationship between the dependent variable and the entire set of independent variables. Because there is more than one independent variable, we use the following null and alternative hypotheses: H0: β1 = … = βk = 0 (No linear relationship between the dependent variable and the independent variables.) H1: At least one βj ∙ 0 (Linear relationship between the dependent variable and at least one of the independent variables.) Equation 13.6 defines the statistic for the overall F test, and Table 13.2 presents the associated ANOVA summary table.

OVE R A L L F T E ST STAT I ST I C The F statistic is equal to the regression mean square (MSR) divided by the error mean square (MSE). F=



MSR (13.6) MSE

where k = number of independent variables in the regression model F = test statistic from an F distribution with k and n − k − 1 degrees of freedom

Table 13.2 ANOVA summary table for the overall F test

Degrees of Sum of Mean square Source freedom squares (variance) F SSR MSR Regression k SSR MSR  = F = k MSE SSE Error n − k − 1 SSE MSE = n−k−1 Total

n − 1

SST

The decision rule is: Reject H0 at the α level of significance if F > FU(k, n−k−1); otherwise, do not reject H0. Using a 0.05 level of significance, the critical value of the F distribution with 2 and 12 degrees of freedom, found from Table E.5, is 3.89 (see Figure 13.3). From Figure 13.1, the F statistic given in the ANOVA summary table is 3.1747. Because F = 3.1747 < 3.89, or because the p-value = 0.0782 > 0.05, we do not reject H0 and cannot conclude that we do not have sufficient ­evidence that either independent variable (exchange rate and/or wage price index) is related to house price.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

13.2 R2, ADJUSTED R2 AND THE OVERALL F TEST 513



Figure 13.3 Testing for significance of a set of regression coefficients at the 0.05 level of significance with 2 and 12 degrees of freedom

0.95 0.05 3.89 FU (2,12)

0

Region of non-rejection

Critical value

Region of rejection

Problems for Section 13.2 d. Calculate the coefficient of multiple determination, R 2, and interpret its meaning. e. Calculate the adjusted R 2.

LEARNING THE BASICS 13.9 The following ANOVA summary table is for a multiple regression model with two independent variables.

Source Regression Error Total

Degrees of freedom   2  18   20 

Sum of squares   55  135   190

Mean squares

F

a. Determine the mean square due to regression and the mean square due to error. b. Calculate the F statistic. c. Determine whether there is a significant relationship between Y and the two independent variables at the 0.05 level of significance. d. Calculate the coefficient of multiple determination, R 2, and interpret its meaning. e. Calculate the adjusted R 2. 13.10 The following ANOVA summary table is for a multiple regression model with two independent variables.

Source Regression Error Total

Degrees of freedom   2  18   20 

Sum of squares  185  315   500

Mean squares

F

a. Determine the mean square that is due to regression and the mean square that is due to error. b. Calculate the F statistic. c. Determine whether there is a significant relationship between Y and the two independent variables at the 0.05 level of significance.

APPLYING THE CONCEPTS Use Microsoft Excel to solve problems 13.11 to 13.17.

13.11 Thousands of tourists travel on the free City Circle tram in Melbourne every day, together with many locals who simply want to get from one place to another. In a survey on the tram, tourists are asked to rank their ‘enjoyment’ of the service from 1 (very low) to 5 (very high); a range of 19 other questions are also asked, such as how long they will stay in Melbourne, how much they will spend per day, how much they earn and how old they are. Many different regression models are run to assess the factors determining enjoyment, including the following: Model 1 Enjoyment = β0 + β1 (length of stay) + ε R 2adj = 0.72

Model 2 Enjoyment = β0 + β1 (average income) + ε R 2adj = 0.78

Model 3 Enjoyment = β0 + β1 (length of stay) + β2 (average income) + ε R 2adj = 0.68 a. Interpret the adjusted R 2 for each of the three models. b. Which of these three models do you think is the best predictor of Enjoyment? c. What additional information is needed to help explain these relationships? 13.12 In problem 13.3 on page 509, a marketing analyst predicted the durability of a brand of tennis shoe based on the forefoot shock-absorbing capability and the change in impact

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

514 CHAPTER 13 INTRODUCTION TO MULTIPLE REGRESSION

properties over time. The regression analysis resulted in the following ANOVA summary table.

ANOVA Regression Error Total

Degrees of freedom SS MS F Significance   2  12.61020  6.30510 97.69 0.0001  12   10.77453  0.06454  14   13.38473 

a. Determine whether there is a significant relationship between durability and the two independent variables at the 0.05 level of significance. b. Interpret the meaning of the p-value. c. Calculate the coefficient of multiple determination, R 2, and interpret its meaning. d. Calculate the adjusted R 2. 13.13 In problem 13.5 on page 509, you used GDP and population density to estimate CO2 emissions. Using the computer output from that problem: a. Determine whether there is a significant relationship between CO2 emissions and the two independent variables (GDP and population distribution) at the 0.05 level of significance. b. Interpret the meaning of the p-value. c. Calculate the coefficient of multiple determination, R 2, and interpret its meaning. d. Calculate the adjusted R 2. 13.14 In problem 13.4 on page 509, a financial planner used unemployment rates and share market returns to predict retirement rates in OECD countries. Using the computer output from that problem: a. Determine whether there is a significant relationship between retirement rates and the two independent variables (unemployment rates and share market returns) at the 0.05 level of significance.

LEARNING OBJECTIVE Construct a multiple regression model and analyse a model output

1

b. Interpret the meaning of the p-value. c. Calculate the coefficient of multiple determination, R 2, and interpret its meaning. d. Calculate the adjusted R 2. 13.15 In problem 13.7 on page 510, you used GDP per capita and CPI to estimate the percentage of very happy people in a country. Using the computer output from that problem: a. Determine whether there is a significant relationship between percentage of very happy people and the two independent variables (GDP per capita and CPI) at the 0.05 level of significance. b. Interpret the meaning of the p-value. c. Calculate the coefficient of multiple determination, R 2, and interpret its meaning. d. Calculate the adjusted R 2. 13.16 In problem 13.6 on page 510, a wine expert used alcohol content and amount of chlorides to predict wine quality. Using the computer output from that problem: a. Determine whether there is a significant relationship between wine quality and the two independent variables (alcohol and chlorides) at the 0.05 level of significance. b. Interpret the meaning of the p-value. c. Calculate the coefficient of multiple determination, R 2, and interpret its meaning. d. Calculate the adjusted R 2. 13.17 In problem 13.8 on page 510, you used radio and newspaper advertising to predict product sales. Using the computer output from that problem: a. Determine whether there is a significant relationship between product sales and the two independent variables (radio and newspaper advertising) at the 0.05 level of significance. b. Interpret the meaning of the p-value. c. Calculate the coefficient of multiple determination, R 2, and interpret its meaning. d. Calculate the adjusted R 2.

13.3  RESIDUAL ANALYSIS FOR THE MULTIPLE REGRESSION MODEL In Section 12.5 we used residual analysis to evaluate the appropriateness of using the simple linear regression model for a set of data. For the multiple regression model with two independent variables, we need to construct and analyse the following residual plots: 1. residuals versus Yˆi 2. residuals versus X1i 3. residuals versus X2i 4. residuals versus time. The first residual plot examines the pattern of residuals versus the predicted values of Y. If the residuals show a pattern for different predicted values of Y, there is evidence of possible nonlinear effects in at least one independent variable, a possible violation to the assumption of equal variance (see Figure 12.13) and/or the need to transform the Y variable.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



13.3 Residual Analysis for the Multiple Regression Model 515

The second and third residual plots involve the independent variables. Patterns in the plot of the residuals versus an independent variable may indicate the existence of a non-linear effect and perhaps the need to add a quadratic independent variable to the multiple regression model. The fourth type of plot is used to investigate patterns in the residuals in order to validate the independence assumption when the data are collected in time order. Associated with this residual plot, as in Section 12.6, you can calculate the Durbin–Watson statistic to determine the existence of positive autocorrelation between the residuals. You can use statistical and spreadsheet software to plot the residuals. Figure 13.4 illustrates the Microsoft Excel residual plots for the car exports example. In Figure 13.4, there is very little or no pattern in the relationship between the residuals and the predicted value of Y, the value of X1 (exchange rate) or the value of X2 (wage price index). Thus, we can conclude that the multiple regression model is appropriate for predicting car exports.

Figure 13.4 Microsoft Excel residual plots for car exports: panel A, residuals versus exchange rate; panel B, residuals versus wage price index

Residual plot for exchange rate (xrate, %)

150 100

Residuals

50 0 0

10

20

40

30

50

60

70

80

90

100

124

125

126

Residual

–50 –100 –150 Panel A

Exchange rate (xrate, %)

Residual plot for wage price index (wpi, %)

150

100

Residuals

50

0 115

116

117

118

119

120

121

122

123

Residual

–50

–100

–150 Panel B

continues

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

516 CHAPTER 13 INTRODUCTION TO MULTIPLE REGRESSION

Figure 13.4 Microsoft Excel residual plots for car exports: panel C, residuals versus predicted exports (continued )

Residual plot for predicted exports (xport, %)

150

100

Residuals

50

0 0

100

200

300

400

500

600

Residual

–50

–100

–150 Panel C

Problems for Section 13.3 APPLYING THE CONCEPTS Use Microsoft Excel to solve problems 13.18 to 13.22.

13.18 In problem 13.4 on page 509, a financial planner used unemployment rates and share market returns to predict retirement rates in OECD countries. Perform a residual analysis on the results and determine the adequacy of the model. 13.19 In problem 13.5 on page 509, you used GDP and population density to estimate CO2 emissions. Perform a residual analysis on the results and determine the adequacy of the model.

LEARNING OBJECTIVE

2

Determine which independent variables to include in the regression model, and decide which are more important in predicting the dependent variable

13.20 In problem 13.6 on page 510, a wine expert used alcohol content and chlorides to predict wine quality. Perform a residual analysis on the results and determine the adequacy of the model. 13.21 In problem 13.7 on page 510, you used GDP per capita and CPI to estimate the percentage of very happy people in a country. Perform a residual analysis on your results and determine the adequacy of the model. 13.22 In problem 13.8 on page 510, you used radio and newspaper advertising to predict product sales. Perform a residual analysis on the results and determine the adequacy of the model.

13.4  INFERENCES CONCERNING THE POPULATION REGRESSION COEFFICIENTS In Section 12.7 we tested the slope in a simple linear regression model to determine the significance of the relationship between X and Y. We also constructed a confidence interval of the population slope. This section extends these procedures to multiple regression.

Tests of Hypothesis In a simple linear regression model, to test a hypothesis concerning the population slope, β1, we used Equation 12.16: t=

b1 − β1 Sb1

Equation 13.7 generalises this equation for multiple regression.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



13.4 Inferences Concerning the Population Regression Coefficients 517

TE STIN G FO R T H E S LOPE IN M ULT IP LE R E G R E SSI O N

t=

bj − β j Sbj

(13.7)

where  bj = slope of variable j with Y, holding constant the effects of all other ­independent variables Sbj = standard error of the regression coefficient bj t = test statistic for a t distribution with n − k − 1 degrees of freedom k = number of independent variables in the regression equation βj = hypothesised value of the population slope for variable j, holding constant the effects of all other independent variables The Microsoft Excel output includes the results of the t test for each of the independent variables included in the regression model (see Figure 13.1 on page 507). To determine whether variable X2 (wage price index) has a significant effect on car exports, taking into account exchange rates, the null and alternative hypotheses are: H0: β2 = 0 H1: β2 ∙ 0 From Equation 13.7 and Figure 13.1: t= =

b2 − β 2 Sb2 −37.9702 − 0 = −2.5196 15.0698

If you select a level of significance of 0.05, the critical values of t for 12 degrees of freedom from Table E.3 are −2.1788 and +2.1788 (see Figure 13.5). From Figure 13.1, the p-value is 0.0269. Because t = −2.5196 < −2.1788 or the p-value of 0.00269 < 0.05, we reject H0 and conclude that there is a significant relationship between variable X2 (wage price index) and car exports, taking into account exchange rates X1. This small p-value allows you to reject the null hypothesis that there is no linear relationship between car exports and wage price index.

0.025

0.025 0.95 –2.1788

Region of rejection Critical value

0.95 0

Region of non-rejection

+2.1788

Figure 13.5 Testing for significance of a regression coefficient at the 0.05 level of significance with 12 degrees of freedom

t12

Region of rejection Critical value

Example 13.1 presents the test for the significance of β1, the slope of car exports with wage price index.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

518 CHAPTER 13 INTRODUCTION TO MULTIPLE REGRESSION

EXAMPLE 13.1

T E ST ING FO R T H E S I GN I F I CAN CE OF THE S L OP E OF CAR E XP ORTS WI TH E XC H A N G E R AT E At the 0.05 level of significance, is there evidence that the slope of car exports with exchange rates is different from zero? SOLUTION

From Figure 13.1, t = −2.1490 > −2.1788 (the critical value for α = 0.05) or the p-value = 0.0527 > 0.05. Therefore, there is not a significant relationship between exchange rate, X1, and car exports, taking into account wage price index, X2. As seen with each of these two X variables, the test of significance for a particular regression coefficient is actually a test for the significance of adding a particular variable into a regression model, given that the other variable is included. Therefore, the t test for the regression coefficient is equivalent to testing for the contribution of each independent variable.

Confidence Interval Estimation Instead of testing the significance of a population slope, you may want to estimate the value of a population slope. Equation 13.8 defines the confidence interval estimate for a population slope in multiple regression.

CON FIDE N CE IN T E R VA L E ST I M AT E F O R T HE SLO P E

bj ± tn−k−1Sbj (13.8)

For example, if you want to construct a 95% confidence interval estimate of the population slope β1 (the effect of exchange rate, X1, on car exports, Y, holding constant the effect of wage price index, X2), from Equation 13.8 and Figure 13.1: b1 ± tn−k−1Sb1 Because the critical value of t at the 95% confidence level with 12 degrees of freedom is 2.1788 (see Table E.3): −10.7613 ± (2.1788)(5.0077) −10.7613 ± −10.9108 −21.6721 ⩽ β1 ⩽ 0.1495 With 95% confidence, the estimated effect of a one percentage point increase in exchange rates ranges from decreasing car exports by as much as $21.6721 million to increasing them by $0.1495 million. From a hypothesis-testing viewpoint, because this confidence interval includes 0, you conclude that there is not a significant relationship between exchange rates and car exports. Example 13.2 constructs and interprets a confidence interval estimate for the slope of car exports with wage price index. EXAMPLE 13.2

C O NST R U CT ING A C O N F I D E N CE I N TE RVAL E STI M ATE F OR THE S L OP E OF C A R E X P O RT S W IT H WAGE P RI CE I N D E X Construct a 95% confidence interval estimate of the population slope of car exports with wage price index.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



13.4 Inferences Concerning the Population Regression Coefficients 519

SOLUTION

The critical value of t at the 95% confidence level with 12 degrees of freedom is 2.1788 (see Table E.3). Using Equation 13.8:

−37.9702 ± (2.1788)(15.0698)



−37.9702 ± 32.8342



−70.8045 ⩽ β2 ⩽ −5.1358

Thus, taking into account the effect of exchange rates, the estimated effect of each additional percentage point of wage price index is a decrease in mean car exports of between $5.1358 million and $70.8045 million. You have 95% confidence that this interval correctly estimates the true relationship between these variables. From a hypothesis-testing viewpoint, because this confidence interval does not include 0, you can conclude that there is a significant relationship between wage price index and car exports.

Problems for Section 13.4 LEARNING THE BASICS 13.23 You are given the following information from a multiple regression analysis: n = 45, b1 = 7, b 2 = 6, Sb1= 2.5, Sb2= 1.5

a. Which variable has the largest slope in units of a t statistic? b. Construct a 95% confidence interval estimate of the population slope β1. c. At the 0.05 level of significance, determine whether each independent variable makes a significant contribution to the regression model. On the basis of these results, indicate the independent variables to include in this model.

APPLYING THE CONCEPTS Use Microsoft Excel to solve problems 13.24 to 13.29.

13.24 In problem 13.3 on page 509, a marketing analyst predicted the durability of a brand of tennis shoe based on the forefoot shock-absorbing capability (Foreimp) and the change in impact properties over time (Midsole) for a sample of 15 pairs of shoes. Use the following results:

Variable Intercept Foreimp Midsole

Coefficients - 0.02686  0.79116  0.60484

Standard error 0.06905 0.06295 0.07174

t stat - 0.39 12.57  8.43

p-value 0.7034 0.0000 0.0000

a. Construct a 95% confidence interval estimate of the population slope between durability and forefoot shockabsorbing capability. b. At the 0.05 level of significance, determine whether each independent variable makes a significant contribution to the regression model. On the basis of these results, indicate the independent variables to include in this model.

13.25 In problem 13.4 on page 509, a financial planner used unemployment rates and share market returns to predict retirement rates in OECD countries. Using the computer output from that problem: a. Construct a 95% confidence interval estimate of the population slope between retirement rate and unemployment rate. b. At the 0.05 level of significance, determine whether each independent variable makes a significant contribution to the regression model. On the basis of these results, indicate the independent variables to include in this model. 13.26 In problem 13.5 on page 509, you used GDP and population density to estimate CO2 emissions. Using the computer output from that problem: a. Construct a 95% confidence interval estimate of the population slope between GDP and population density. b. At the 0.05 level of significance, determine whether each independent variable makes a significant contribution to the regression model. On the basis of these results, indicate the independent variables to include in this model. 13.27 In problem 13.6 on page 510, a wine expert used alcohol content and chlorides to predict wine quality. Using the computer output from that problem: a. Construct a 95% confidence interval estimate of the population slope between wine quality and chlorides. b. At the 0.05 level of significance, determine whether each independent variable makes a significant contribution to the regression model. On the basis of these results, indicate the independent variables to include in this model. 13.28 In problem 13.7 on page 510, you used GDP per capita and CPI to estimate the percentage of very happy people in a country. Using the computer output from that problem: a. Construct a 95% confidence interval estimate of the population slope between happiness percentage and GDP per capita.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

520 CHAPTER 13 INTRODUCTION TO MULTIPLE REGRESSION

b. At the 0.05 level of significance, determine whether each independent variable makes a significant contribution to the regression model. On the basis of these results, indicate the independent variables to include in this model. 13.29 In problem 13.8 on page 510, you used radio and newspaper advertising to predict the product sales. Using the computer output from that problem: LEARNING OBJECTIVE

2

Determine which independent variables to include in the regression model, and decide which are more important in predicting the dependent variable

partial F test Tests for a significant contribution of an individual independent X variable in multiple regression after all other independent X variables have been included in the regression model, using the F probability distribution.

a. Construct a 95% confidence interval estimate of the population slope between newspaper advertising and product sales. b. At the 0.05 level of significance, determine whether each independent variable makes a significant contribution to the regression model. On the basis of these results, indicate the independent variables to include in this model.

13.5  TESTING PORTIONS OF THE MULTIPLE REGRESSION MODEL When developing a multiple regression model, we want to use only those independent variables that are useful in predicting the value of a dependent variable. If an independent variable is not helpful in making this prediction, you can delete it from the multiple regression model and use a model with fewer independent variables. The partial F test is an alternative method to the t test discussed in Section 13.4 for determining the contribution of an independent variable. It involves determining the contribution to the regression sum of squares made by each independent variable after all the other independent variables have been included in the model. The new independent variable is included only if it significantly improves the model. To conduct a partial F test in the car export example, we first need to evaluate the contribution of wage price index (X2) after exchange rate (X1) has been included in the model, and the contribution of exchange rate (X1) after wage price index (X2) has been included in the model. In general, if there are several independent variables, you determine the contribution of each independent variable by taking into account the regression sum of squares of a model that includes all independent variables except the one of interest, SSR (all variables except j). Equation 13.9 determines the contribution of variable j, assuming that all other variables are already included.

DE T E R M IN IN G T HE CO NT R I B UT I O N OF AN I N DE P E ND E NT VAR I A BL E TO T H E R E GR E S SI O N MO DE L SSR(Xj | all variables except j ) = SSR(all variables including j ) − SSR(all variables except j )(13.9)

If there are two independent variables, use Equations 13.10a and 13.10b to determine the contribution of each.

CON T R IB UT ION O F VA R I AB LE X 1 G I VE N T HAT X 2 HAS BE E N I N CLUD ED

SSR(X1 | X2) = SSR(X1 and X2) – SSR(X2) (13.10a)

CON T R IB UT ION O F VAR I A BL E X 2 G I VE N T HAT X 1 HAS BE E N I N CLUD ED

SSR(X2 | X1) = SSR(X1 and X2) – SSR(X1) (13.10b)

The term SSR(X2) represents the sum of squares that is due to regression for a model that includes only the independent variable X2 (wage price index). Similarly, SSR(X1) represents the sum of squares that is due to regression for a model that includes only the independent variable X1 (exchange rate). Figures 13.6 and 13.7 present Microsoft Excel output for these two models.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



13.5 Testing Portions of the Multiple Regression Model 521

A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

B

E

F

G

SUMMARY OUTPUT Regression statistics Multiple R

0.3072

R square

0.0944

Adjusted R square

0.0247

Standard error

Figure 13.6 Microsoft Excel output of simple linear regression model for car exports and wage price index SSR (X2)

90.1666

Observations

15

ANOVA df

SS

MS

1

11011.6380

11011.6380

Residual

13

105690.0953

8130.0073

Total

14

116701.7333

Coefficients

Standard error

1703.5424

1054.0490

–10.1468

8.7186

–1.1638

Regression

Intercept wpi

A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

D

C

B

0.2654

t stat

p-value

Lower 95%

Upper 95%

1.6162

0.1301

–573.5921

3980.6769

0.2654

–28.9823

8.6887

D

C

F Significance F 1.3544

E

F

G

SUMMARY OUTPUT Regression statistics Multiple R R square

0.0072 0.0001

Adjusted R square

–0.0769

Standard error

94.7449

Observations

Figure 13.7 Microsoft Excel output of simple linear regression model for car exports and exchange rates SSR (X1)

15

ANOVA df

SS

MS

1

6.0596

6.0596

Residual

13

116695.6737

8976.5903

Total

14

116701.7333

Coefficients

Standard error

470.7286

247.7224

0.0791

3.0443

0.0260

Regression

Intercept xrate

F Significance F 0.0007

0.9797

t stat

p-value

Lower 95%

Upper 95%

1.9002

0.0798

–64.4431

1005.9002

0.9797

–6.4977

6.6559

From Figure 13.6, SSR(X2) = 11,011.6380, and from Figure 13.1, SSR(X1 and X2) = 40,382.0465. Then, using Equation 13.10a:

SSR(X1| X2) = SSR(X1 and X2) − SSR(X2) = 40,382.0465 − 11,011.6380 = 29,370.4085

To determine whether X1 significantly improves the model after X2 has been included, we subdivide the regression sum of squares into two component parts, as shown in Table 13.3. The null and alternative hypotheses to test for the contribution of X1 to the model are: H0: Variable X1 does not significantly improve the model after variable X2 has been ­included. H1: Variable X1 significantly improves the model after variable X2 has been included. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

522 CHAPTER 13 INTRODUCTION TO MULTIPLE REGRESSION

Table 13.3 ANOVA table dividing the regression sum of squares into components to determine the contribution of variable X1

Degrees of Sum of Mean square Source freedom squares (variance) F Regression  2 40,382.0465 20,191.0233 X2



1

11,011.6380

Error

12

76,319.6868 6,359.9739

Total

14 116,701.7333

eX |X f e 1 29,370.4086 f e f 1 2

29,370.4086 4.6180

Equation 13.11 defines the partial F test statistic for testing the contribution of an independent variable.

PA RT IA L F T E ST STAT I ST I C

F=

SSR(Xj | all variables except j) MSE



(13.11)

The partial F test statistic follows an F distribution with 1 and n − k − 1 degrees of freedom.

From Table 13.3: F=

29,370.4086 6,359.9739

= 4.6180

The partial F test statistic has 1 and n − k − 1 = 15 − 2 − 1 = 12 degrees of freedom. Using a level of significance of 0.05, the critical value from Table E.5 is 4.75 (see Figure 13.8). Figure 13.8 Testing for contribution of regression coefficient to multiple regression model at the 0.05 level of significance with 1 and 12 degrees of freedom

0.05 0.95 0

F

4.75

Region of non-rejection

Critical value

Region of rejection

Because the partial F test statistic is less than the critical F value (4.6180 < 4.75), we do not reject H0 and conclude that the addition of variable X1 (exchange rate) does not significantly improve a regression model that already contains variable X2 (wage price index). To evaluate the contribution of variable X2 (wage price index) to a model in which variable X1 (exchange rate) has been included, use Equation 13.10b. First, from Figure 13.7, observe that SSR(X1) = 6.0596. Second, from Table 13.3, observe that SSR(X1 and X2) = 40,382.0465. Then, using Equation 13.10b: SSR(X2 | X1) = 40,382.0465 − 6.0596 = 40,375.9869

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



13.5 Testing Portions of the Multiple Regression Model 523

To determine whether X2 significantly improves a model after X1 has been included, we can subdivide the regression sum of squares into two component parts, as shown in Table 13.4.

Degrees of Sum of Mean square Source freedom squares (variance) F Regression  2 40,382.0465 20,191.0233 X2



1

6.0596

eX |X f e 1 40,375.9869 f e f 1 2 Error

12

Total

14 116,701.7333

40,375.9869 6.3484

Table 13.4 ANOVA table dividing the regression sum of squares into components to determine the contribution of variable X2

76,319.6868 6,359.9739

The null and alternative hypotheses to test for the contribution of X2 to the model are: H0: Variable X2 does not significantly improve the model after variable X1 has been included. H1: Variable X2 significantly improves the model after variable X1 has been included. Using Equation 13.11 and Table 13.4: F=

40,375.9869 = 6.3485 6,359.9739

In Figure 13.8, you can see that, using a 0.05 level of significance, the critical value of F with 1 and 12 degrees of freedom is 4.75. Since the partial F test statistic is greater than the critical value (6.3484 > 4.75), we reject H0 and conclude that the addition of variable X2 (wage price index) significantly improves the multiple regression model already containing X1 (exchange rate). Thus, by testing for the contribution of each independent variable after the other has been included in the model, we determine that each of the two independent variables significantly improves the model. Therefore, the multiple regression model should include both exchange rate, X1, and wage price index, X2, in predicting car exports. The partial F test statistic developed in this section and the t test statistic of Equation 13.7 are both used to determine the contribution of an independent variable to a multiple regression model. In fact, the hypothesis tests associated with these two statistics always result in the same decision (i.e. the p-values are identical). The t values for the car export regression model are −2.1490 and −2.5196, and the corresponding F values are 4.618 and 6.3484. Equation 13.12 illustrates the relationship between t and F. THE RE LATION S H IP B E T W E E N A t STAT I ST I C A ND A N F STAT I ST I C tα2 = F1,α (13.12)

where α = degrees of freedom

Coefficients of Partial Determination Recall from Section 13.1 that the coefficient of multiple determination, R2, measures the proportion of the variation in Y that is explained by variation in the two independent variables. Now, we examine the contribution of each independent variable to the multiple regression model, while holding constant the other variable. The coefficients of partial determination (R2Y1.2 2 and RY2.1 ) measure the proportion of the variation in the dependent variable that is explained by each independent variable while controlling for, or holding constant, the other independent variable. Equation 13.13 defines the coefficients of partial determination for a multiple regression model with two independent variables.

coefficient of partial determination In regression the independent proportion of the variation in the Y dependent variable that is explained by each X independent variable.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

524 CHAPTER 13 INTRODUCTION TO MULTIPLE REGRESSION

COE FFICIE N T S OF PA RT I A L D E T E R M I NAT I O N F O R A M U LT I P LE R E GR E S S ION M O D E L C O NTA I NI NG T W O I ND E P E ND E NT VA R I A BLE S 2 RY1 .2 =



SSR(X1 | X2)

(13.13a) SST − SSR(X1 and X2) + SSR(X1 | X2)

and 2 RY2 .1 =



SSR(X2 | X1) SST − SSR(X1 and X2) + SSR(X2 | X1)

(13.13b)

where  SSR(X1 | X2) = sum of squares of the contribution of variable X1 to the regression model given that variable X2 has been included in the model SST = total sum of squares for Y SST(X1 and X2) = regression sum of squares when variables X1 and X2 are both included in the multiple regression model SSR(X2 | X1) = sum of squares of the contribution of variable X2 to the regression model given that variable X1 has been included in the model

Equation 13.14 defines the coefficient of partial determination for the jth variable in a multiple regression model containing several (k) independent variables.

COE FFICIE N T OF PA RT I A L D E T E R M I NAT I O N F O R A M U LT I P LE R E GR E S S ION M O D E L C O NTA I NI NG k I ND E P E ND E NT VA R I A BLE S RY2j (all variables except j) =

SSR(Xj | all variables except j)

(13.14)

SST − SSR(all variables including j ) + SSR(Xj | all variables except j)

For the car export example: RY21.2 =

29,370.4086 116,701.7333 – 40,382.0465 + 29,370.4086

= 0.2779 and: RY22.1 =

40,375.9869 116,701.7333 – 40,382.0465 + 40,375.9869

= 0.3460 The coefficient of partial determination of variable Y with X1 while holding X2 constant is 0.2779. Thus, for a given (constant) amount of wage price index, 27.79% of the variation in car exports is explained by the variation in the exchange rate. The coefficient of partial determination of variable Y with X2 while holding X1 constant is 0.3460. Thus, for a given (constant) exchange rate, 34.6% of the variation in car exports is explained by variation in wage price index.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



13.6 Using Dummy Variables and Interaction Terms in Regression Models 525

Problems for Section 13.5 LEARNING THE BASICS 13.30 The following is the ANOVA summary table for a multiple regression model with two independent variables.

Source Regression Error Total

Degrees of freedom   2  18   20 

Sum of squares   60  120   180

Mean squares

13.33

F

If SSR (X1) = 45 and SSR (X2) = 25:

a. Determine whether there is a significant relationship between Y and each of the independent variables at the 0.05 level of significance. 2 b. Calculate the coefficients of partial determination, R Y1.2 2 and R Y 2.1, and interpret their meaning. 13.31 The following is the ANOVA summary table for a multiple regression model with two independent variables.

Source Regression Error Total

Degrees of freedom   2  13   15 

Sum of squares  234  346   580

Mean squares

13.34

F

13.35

If SSR (X1) = 124 and SSR (X2) = 97:

a. Determine whether there is a significant relationship between Y and each of the independent variables at the 0.05 level of significance. 2 b. Calculate the coefficients of partial determination, R Y1.2 2 and R Y 2.1, and interpret their meaning. 13.36

APPLYING THE CONCEPTS Use Microsoft Excel to solve problems 13.32 to 13.36.

13.32 In problem 13.5 on page 509, you used GDP and population density to estimate CO2 emissions. Using the computer output from that problem: a. At the 0.05 level of significance, determine whether each independent variable makes a significant contribution to the regression model. On the basis of these results, indicate the most appropriate regression model for this set of data.

2 b. Calculate the coefficients of partial determination, R Y1.2 2 and R Y2.1, and interpret their meaning. In problem 13.4 on page 509, a financial planner used unemployment rates and share market returns to predict retirement rates for OECD countries. Using the computer output from that problem: a. At the 0.05 level of significance, determine whether each independent variable makes a significant contribution to the regression model. On the basis of these results, indicate the most appropriate regression model for this set of data. 2 b. Calculate the coefficients of partial determination, R Y1.2 2 and R Y2.1, and interpret their meaning. In problem 13.7 on page 510, you used GDP per capita and CPI to estimate the percentage of very happy people in a country. Using the computer output from that problem: a. At the 0.05 level of significance, determine whether each independent variable makes a significant contribution to the regression model. On the basis of these results, indicate the most appropriate regression model for this set of data. 2 b. Calculate the coefficients of partial determination, R Y1.2 2 and R Y2.1, and interpret their meaning. In problem 13.6 on page 510, a wine expert used alcohol content and chlorides to predict wine quality. Using the computer output from that problem: a. At the 0.05 level of significance, determine whether each independent variable makes a significant contribution to the regression model. On the basis of these results, indicate the most appropriate regression model for this set of data. 2 b. Calculate the coefficients of partial determination, R Y1.2 2 , and interpret their meaning. and R Y2.1 In problem 13.8 on page 510, you used radio and newspaper advertising to predict product sales. Using the computer output from that problem: a. At the 0.05 level of significance, determine whether each independent variable makes a significant contribution to the regression model. On the basis of these results, indicate the most appropriate regression model for this set of data. 2 b. Calculate the coefficients of partial determination, R Y1.2 2 and R Y2.1, and interpret their meaning.

13.6  USING DUMMY VARIABLES AND INTERACTION TERMS IN REGRESSION MODELS The multiple regression models discussed in Sections 13.1 to 13.5 assume that each independent (also called explanatory) variable is numerical. However, in some situations we might want to include categorical variables as independent variables in the regression model. For example, in Section 13.1, we used exchange rates and wage price index to predict car exports. In addition

LEARNING OBJECTIVE

3

Incorporate categorical and interactive variables into a regression model

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

526 CHAPTER 13 INTRODUCTION TO MULTIPLE REGRESSION

dummy variable Variable that takes the values 0 or 1 to indicate the absence or presence of some categorical effect that may be expected to alter the result of the analysis.

to these numerical independent variables, we may want to include the impact of a change in government policy when developing a model to predict car exports. The use of dummy variables allows us to include categorical independent variables as part of the regression model. If a given categorical independent variable has two categories, then we need only one dummy variable to represent the two categories. A particular dummy variable, Xd, is defined as: Xd = 0 if the observation is in category 1 Xd = 1 if the observation is in category 2 To illustrate the application of dummy variables in regression, consider a model for predicting the assessed value of a sample of 15 apartments based on the size of the apartment (in hundreds of square metres) and whether or not the apartment has an ensuite bathroom. To include the categorical variable concerning the presence of an ensuite bathroom, the dummy variable (X2) is defined as: X2 = 0 if the apartment does not have an ensuite X2 = 1 if the apartment has an ensuite In the last column of Table 13.5, you can see how the categorical data are converted to numerical values.

Table 13.5 Predicting assessed value based on size of apartment and presence of an ensuite

House  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

Y 5 assessed value ($000) 84.4 77.4 75.7 85.9 79.1 70.4 75.8 85.9 78.5 79.2 86.7 79.3 74.5 83.8 76.8

X1 5 size of dwelling (hundreds of square metres) Ensuite 2.00 Yes 1.71 No 1.45 No 1.76 Yes 1.93 No 1.20 Yes 1.55 Yes 1.93 Yes 1.59 Yes 1.50 Yes 1.90 Yes 1.39 Yes 1.54 No 1.89 Yes 1.59 No

X2 5 ensuite bathroom 1 0 0 1 0 1 1 1 1 1 1 1 0 1 0

Assuming that the slope of assessed value with the size of the apartment is the same for apartments that have and do not have an ensuite, the multiple regression model is: Yi = β0 + β1X1i + β2X2i + εi where  Yi = assessed value in thousands of dollars for apartment i β0 = Y intercept X1i = size of the apartment in hundreds of square metres for apartment i β1 =slope of assessed value with size of the apartment, holding constant the effect of the presence of an ensuite X2i = dummy variable representing the presence or absence of an ensuite for apartment i β2 =incremental effect of the presence of an ensuite, holding constant the effect of the size of the apartment εi = random error in Y for apartment i Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



13.6 Using Dummy Variables and Interaction Terms in Regression Models 527

A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

B

D

C

E

F

G

Assessed value analysis Regression statistics Multiple R

0.90059

R square

0.81106

Adjusted R square

0.77957

Standard error

2.26260

Observations

Figure 13.9 Microsoft Excel output for the regression model that includes size of the apartment and presence of an ensuite

15

ANOVA df

SS

MS

2

263.70391

131.85196

Residual

12

61.43209

5.11934

Total

14

325.136

Regression

Coefficients Standard error

F Significance F 25.75565

4.54968E-05

t stat

p-value

Lower 95%

Upper 95%

Intercept

50.09049

4.351658

11.510668

7.67943E-08

40.60904

59.57193

Size

16.18583

2.574442

6.287124

4.02437E-05

10.57661

21.79506

3.85298

1.241223

3.104183

0.00912

1.14859

6.55737

Ensuite

Figure 13.9 illustrates the Microsoft Excel output for this model. From Figure 13.9, the regression equation is: Yˆi = 50.09 + 16.186X1i + 3.853X2i For apartments without an ensuite, you substitute X2 = 0 into the regression equation: Yˆi = 50.09 + 16.186X1i + 3.853X2i

= 50.09 + 16.186X1i + 3.853(0)



= 50.09 + 16.186X1i

For apartments with an ensuite, you substitute X2 = 1 into the regression equation: Yˆi = 50.09 + 16.186X1i + 3.853X2i

= 50.09 + 16.186X1i + 3.853(1)



= 53.943 + 16.186X1i

In this model, the regression coefficients are interpreted as follows: 1. Holding constant whether or not an apartment has an ensuite, for each increase of 100 square metres in the size of the apartment, the mean assessed value is estimated to increase by 16.186 thousand dollars (or $16,186). 2. Holding constant the size of the apartment, the presence of an ensuite is estimated to increase the mean value of the apartment by 3.853 thousand dollars (or $3,853). In Figure 13.9, the t statistic for the slope of the size of the apartment with assessed value is 6.29 and the p-value is approximately 0.000; the t statistic for presence of an ensuite is 3.10 and the p-value is 0.009. Thus, both variables make a significant contribution to the model at a level of significance of 0.01. In addition, the coefficient of multiple determination indicates that 81.1% of the variation in assessed value is explained by variation in the size of the apartment and whether or not the apartment has an ensuite. M O DELLING A T H R E E - LE V E L C AT E G O RI CAL VARI ABL E Define a multiple regression model using sales as the dependent variable and package design and price as independent variables. Package design is a three-level categorical ­variable with designs A, B and C.

EXAMPLE 13.3

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

528 CHAPTER 13 INTRODUCTION TO MULTIPLE REGRESSION

SOLUTION

To model a three-level categorical variable, two dummy variables are needed: X1i = 1 if package design A is used in observation i, 0 otherwise X2i = 1 if package design B is used in observation i, 0 otherwise If observation i is for package design A, then X1i = 1 and X2i = 0; for package design B, then X1i = 0 and X2i = 1; and for package design C, then X1i = X2i = 0. A third independent variable is used for price: X3i = price for observation i Thus, the regression model for this example is: Yi = β0 + β1X1i + β2X2i + β3X3i + εi where  Yi = sales for observation i β0 = Y intercept β1 =difference between the mean sales of design A and mean sales of design C, ­holding the price constant β2 =difference between the mean sales of design B and mean sales of design C, ­holding the price constant β3 = slope of sales with price, holding the package design constant εi = random error in Y for observation i

Interactions interaction The impact of one independent variable depends on the value of another independent variable.

interaction term Refers to interaction within X independent variables, or more specifically, the effect one independent variable has upon another independent variable. cross-product term The interaction term.

In all the regression models discussed so far, the effect an independent variable has on the dependent variable was assumed to be statistically independent of the other independent variables in the model. An interaction occurs if the effect of an independent variable on the response variable is dependent on the value of a second independent variable. For example, it is possible for advertising to have a large effect on the sales of a product when the price of a product is low. However, if the price of the product is too high, increases in advertising will not dramatically change sales. In this case, sales and advertising are said to interact. In other words, you cannot make general statements about the effect of advertising on sales. The effect that advertising has on sales is dependent on the price. You use an interaction term (sometimes referred to as a crossproduct term) to model an interaction effect in a regression model. To illustrate the concept of interaction and use of an interaction term, return to the example concerning the assessment values of apartments discussed above. In the regression model, we assumed that the effect the size of the apartment has on the assessed value is independent of whether or not the apartment has an ensuite bathroom. In other words, we assumed that the slope of assessed value with size is the same for apartments with ensuites as it is for apartments without ensuites. If these two slopes are different, an interaction between size of the apartment and ensuite exists. To evaluate a hypothesis of equal slopes of a Y variable with X, you first define an interaction term that consists of the product of the independent variable X1 and the dummy variable X2. You then test whether this interaction variable makes a significant contribution to a regression model that contains the other X variables. If the interaction is significant, you cannot use the original model for prediction. For the data of Table 13.5 on page 526, let: X3 = X1 * X2 Figure 13.10 illustrates the Microsoft Excel output for this regression model, which includes the size of the apartment X1, the presence of an ensuite X2 and the interaction of X1 and X2 (which is defined as X3). To test for the existence of an interaction, we use the null hypothesis H0: β3 = 0 versus the alternative hypothesis H1: β3 ∙ 0. In Figure 13.10, the t statistic for the interaction of size and

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



13.6 Using Dummy Variables and Interaction Terms in Regression Models 529

B C A 1 Assessed value analysis 2 3 Regression statistics 4 Multiple R 0.917906505 5 R square 0.842552352 6 Adjusted R square 0.799612084 7 Standard error 2.157268864 8 Observations 15 9 10 ANOVA 11 df SS 12 Regression 3 273.9441015 13 Residual 11 51.19189849 14 Total 14 325.136 15 16 Coefficients Standard error 17 Intercept 62.9521815 9.612176928 18 Size 8.362420012 5.817298426 19 Ensuite –11.84036371 10.64550326 20 Size*Ensuite 9.518000219 6.416468169

D

E

F

G

Figure 13.10 Microsoft Excel output for a regression model that includes size, presence of ensuite bathroom and interaction of size and ensuite bathroom

MS F Significance F 91.3147005 19.62149745 0.000100865 4.653808954

t stat 6.54921169 1.437509201 –1.11224086 1.483370597

p-value Lower 95% Upper 95% 4.13993E-05 41.79591203 84.10845098 0.178405557 –4.441373971 21.16621399 0.289751611 –35.27097026 11.59024285 0.166052624 –4.604558143 23.64055858

ensuite is 1.48. Because the p-value = 0.166 > 0.05, we do not reject the null hypothesis. Therefore, the interaction term does not make a significant contribution to the model given that size and presence of an ensuite are already included. Regression models can have several numerical independent variables. Example 13.4 illustrates a regression model in which there are two numerical independent variables as well as a categorical independent variable. STUDY ING A R E G R E S S IO N MO D E L T H AT CON TAI N S A D U M M Y VARI ABL E A real estate developer wants to predict electricity consumption (in kilowattts) based on atmospheric ­temperature (oC), X1, and the amount of ceiling insulation (10 mm), X2. Suppose that of 15 houses selected, < ELECTRIC > houses 7, 8, 13, 14, 15 and 18 are open-­ living-design houses. Develop and analyse an appropriate regression model using these three independent variables, X1, X2 and X3 (the dummy variable for open-living-design houses).

EXAMPLE 13.4

SOLUTION

Define X3, a dummy variable for an open-living-design house as follows: X3 = 0 if the style is not open living X3 = 1 if the style is open living Assuming that the slope between home electricity consumption and atmospheric temperature, X1, and between electricity consumption and the amount of ceiling insulation, X2, is the same for both styles of house, the regression model is: Yi = β0 + β1X1i + β2X2i + β3X3i + εi where  Yi = monthly electricity consumption in kilowatts for house i β0 = Y intercept β1 =slope of electricity consumption with atmospheric temperature, holding constant the effect of ceiling insulation and the style of house β2 =slope of electricity consumption with ceiling insulation, holding constant the effect of atmospheric temperature and the style of house

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

530 CHAPTER 13 INTRODUCTION TO MULTIPLE REGRESSION

β3 =incremental effect of the presence of an open-living-style house on electricity consumption, holding constant the effect of atmospheric temperature and ceiling insulation εi = random error in Y for house i Figure 13.11 displays the Microsoft Excel output. Figure 13.11 Microsoft Excel output for a regression model that includes temperature, insulation and style for the electricity data

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

A B Electricity consumption analysis

C

D

E

F

G

Regression statistics Multiple R 0.912397208 R square 0.832468665 Adjusted R square 0.786778301 Standard error 18.48092557 Observations 15 ANOVA df 3 11 14

Regression Residual Total

Intercept Temperature(C) Insulation Style

SS 18668.60929 3756.99071 22425.6

MS F Significance F 6222.869763 18.21978617 0.00014121 341.54461

Coefficients Standard error t stat p-value Lower 95% Upper 95% 227.2346612 30.27286883 7.506214972 1.19077E-05 160.6044925 293.86483 –3.784958908 1.308387788 –2.892841819 0.014631275 –6.664702468 –0.905215347 –6.234477587 3.348845326 –1.861679767 0.089560543 –13.60524018 1.136285006 15.00417808 13.87495838 1.081385448 0.302663139 –15.53441485 45.54277102

From the output in Figure 13.11, the regression equation is: Yˆi = 227.235 – 3.785X1i – 6.234X2i + 15.004X3i For houses that are not open-living style, this reduces to: Yˆi = 227.235 – 3.785X1i – 6.234X2i Yˆi = 227.235 – 3.785X1i – 6.235X2i + 15.004 since X3 = 0. For houses that are open-living style, since X3 = 1, this reduces to: Yˆ = 242.239 – 3.785X – 6.234X i

1i

2i

The regression coefficients are interpreted as follows: 1. Holding constant the effect of ceiling insulation and house style, for each additional 15C increase in atmospheric temperature, we estimate that the mean electricity consumption decreases by 3.785 kilowatts. 2. Holding constant the effect of atmospheric temperature and house style, for each additional 10 mm increase in ceiling insulation, we estimate that the mean electricity consumption decreases by 6.234 kilowatts. 3. b3 measures the effect on electricity consumption of having an open-living-style house (X3 = 1) compared with having a house that is not open-living style (X3 = 0). Thus, with atmospheric temperature and ceiling insulation held constant, we estimate that the mean electricity consumption is 15.004 kilowatts more for an open-living-style house than for a house that is not open-living style. The three t statistics representing the slopes for temperature, insulation and open-living style are −2.89, −1.86 and 1.08. The corresponding p-values vary from a small 0.01 to 0.09 to

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



13.6 Using Dummy Variables and Interaction Terms in Regression Models 531

0.30, so that only temperature is statistically significant in determining electricity consumption (at the 0.05 level of significance). However, the coefficient of multiple determination indicates that 83.3% of the variation in electricity use is explained by variation in the temperature and insulation, and whether the house is open-living style.

Before you can use the model in Example 13.4, you need to determine whether the independent variables interact with each other. In Example 13.5, three interaction terms are added to the model.

EVA LUATING A R E G R E S S IO N MO D E L W I TH S E V E RAL I N TE RACTI ON S For the data of Example 13.4, determine whether adding the interaction terms makes a ­significant contribution to the regression model.

EXAMPLE 13.5

SOLUTION

To evaluate possible interactions between the independent variables, three interaction terms are constructed. Let X4 = X1 * X2, X5 = X1 * X3 and X6 = X2 * X3. The regression model is now: Yi = β0 + β1X1i + β2X2i + β3X3i + β4X4i + β5X5i + β6X6i + εi where X1 is temperature, X2 is insulation, X3 is the dummy variable open-living style, X4 is the interaction term between temperature and insulation, X5 is the interaction term between temperature and open-living style, and X6 is the interaction term between insulation and open-living style. To test whether the three interaction terms significantly improve the regression model, we use the partial F test. The null and alternative hypotheses are: H0: At least one of β4 = β5 = β6 ∙ 0 (There is at least one interaction between X1, X2 and X3.) H1: β4 ∙ 0 and/or β5 ∙ 0 and/or β6 ∙ 0 (X1 interacts with X2 and/or X1 interacts with X3 and/or X2 interacts with X3.) From Figure 13.12, overleaf: SSR(X1, X2, X3, X4, X5, X6) = 20,239.42 with 6 degrees of freedom and, from Figure 13.11: SSR(X1, X2, X3) = 18,668.61 with 3 degrees of freedom Thus, SSR(X1, X2, X3, X4, X5, X6) − SSR(X1, X2, X3) = 20,239.42 − 18,668.61 = 1,570.81, and the difference in degrees of freedom is 6 − 3 = 3. To use the partial F test for the simultaneous contribution of three variables to a model, we use an extension of Equation 13.11. The partial F test statistic is: F=

[SSR(X1, X2, X3, X4, X5, X6) – SSR(X1, X2, X3)]/3 MSE(X1, X2, X3, X4, X5, X6)

=

1,570.81 273.27

= 5.75

We compare the F statistic of 5.75 with the critical F value for 3 and 8 degrees of freedom. Using a level of significance of α = 0.01, the critical F value from Table E.5 is 7.59. Because 5.75 < 7.59, we conclude that the interaction terms do not make a significant ­contribution to the model, given that the model already includes temperature X1, insulation X2 and whether the house is open-living style X3. Therefore, the multiple regression model using X1, X2 and X3 with no interaction terms is the better model. Had we rejected this null hypothesis, we would then test the contribution of each interaction separately in order to determine which interaction terms to include in the model.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

532 CHAPTER 13 INTRODUCTION TO MULTIPLE REGRESSION

Figure 13.12 Microsoft Excel output for a regression model that includes temperature X1, insulation X2, the dummy variable open-living style X3, the interaction of temperature and insulation X4, the interaction of temperature and openliving style X5 and the interaction of insulation and open-living style X6

A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

B

C

D

F

E

G

Regression analysis Regression statistics Multiple R 0.950007347 R square Adjusted R square

0.90251396 0.82939943

Standard error Observations

16.5309669 15

ANOVA df

SS

MS

6 8 14

20239.41707 2186.182933 22425.6

3373.236178 273.2728667

12.34383866

0.001152027

Coefficients 261.0758655 –6.358879964 –14.62681437 79.56211968 0.531319827 –2.858231448 –3.44369343

Standard error 178.9387497 7.15924699 30.04980535 102.1085659 1.183294641 4.205980502 11.75149358

t stat 1.459023637 –0.888205138 –0.486752383 0.779191432 0.449017353 –0.679563647 –0.293043042

p-value 0.18267649 0.400346547 0.639488898 0.458294374 0.665333347 0.515958013 0.776946717

Lower 95% –151.5578982 –22.8681438 –83.92183458 –155.9008077 –2.197364273 –12.55724615 –30.54270375

Regression Residual Total

Intercept Temperature C Insulation Style Temperature*Insulation Temperature*Style Insulation*Style

F Significance F

Upper 95% 673.7096291 10.15038388 54.66820585 315.0250471 3.260003927 6.840783255 23.65531689

Problems for Section 13.6 LEARNING THE BASICS

13.37 Suppose X1 is a numerical variable and X2 is a dummy variable and the following is the regression equation for a sample of n = 35 is: Yˆi = 12 + 5X1i + 0.5X2i

a. Interpret the meaning of the slope for variable X1. b. Interpret the meaning of the slope for variable X2. c. Suppose that the t statistic for testing the contribution of variable X2 is 2.67. At the 0.05 level of significance, is there evidence that variable X2 makes a significant contribution to the model?

APPLYING THE CONCEPTS You need to use Microsoft Excel to solve problems 13.39 to 13.46.

13.38 The manager of a music production house wants to predict CD sales on the basis of the amount spent on advertising and whether the music had air-time on popular radio during the preceding week (no air-time = 0 and air-time = 1). a. Explain the steps involved in developing a regression model for these data. Be sure to indicate the particular models you need to evaluate and compare. b. Suppose the regression coefficient for the variable of whether or not the music received air-time is +0.30. How do you interpret this result?

13.39 A real estate association in Melbourne would like to study the relationship between the size of a family house (measured by the number of rooms) and the selling price of the house (in thousands of dollars). Two different neighbourhoods are included in the study, one on the east Dandenong side (= 0) and the other on the west Sunshine side (= 1). A random sample of 20 houses was selected with the following results given in the file: < HOUSE_SIZE > House 1 2 3 4 5 6 7 8 9 10 11 12 13

Selling price ($’000) Number of rooms 345 655 325 824 432 233 567 988 199 934 258 379 355

8 9 7 12 10 6 9 13 6 12 6 10 8

Location 0 0 1 0 1 1 1 0 1 0 1 1 1

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



13.6 Using Dummy Variables and Interaction Terms in Regression Models 533

House 14 15 16 17 18 19 20

Selling price ($’000) 643 710 585 677 870 670 280

Number of rooms 10 11 9 14 12 9 7

Location 0 0 0 1 0 0 1

a. State the multiple regression equation. b. Interpret the meaning of the slopes in this problem. c. Predict the selling price for a house with nine rooms located in Melbourne’s east and construct a 95% confidence interval estimate and a 95% prediction interval. d. Perform a residual analysis on the results and determine the adequacy of the model. e. Is there a significant relationship between selling price and the two independent variables (rooms and neighbourhood) at the 0.05 level of significance? f. At the 0.05 level of significance, determine whether each independent variable makes a contribution to the regression model. On the basis of these results, indicate the most appropriate regression model for this set of data. g. Construct 95% confidence interval estimates of the population slope for the relationship between selling price and number of rooms, and between selling price and neighbourhood. h. Interpret the meaning of the coefficient of multiple determination. i. Calculate the adjusted R 2. j. Calculate the coefficients of partial determination and interpret their meaning. k. What assumption do you need to make about the slope of selling price with number of rooms? l. Add an interaction term to the model and, at the 0.05 level of significance, determine whether it makes a significant contribution to the model. m. On the basis of the results of (f) and (l), which model is more appropriate? Explain. 13.40 A traffic engineer wants to predict the number of road accidents at intersections in a major urban area. He believes the main determinants are volume of traffic and whether or not the intersection has traffic lights. Traffic lights have the dummy value 1 and no traffic lights the value 0. A sample of 15 intersections is placed into a table.

Intersection 1 2 3 4 5

Number accidents per month 12 25 18 10 20

Traffic volume (’000/day) Traffic lights 13 1 25 0 17 1 22 1 33 0

6 7 8 9 10 11 12 13 14 15

8 11 22 24 16 9 15 8 18 23

10 44 21 28 33 19 23 13 16 28

1 1 0 0 1 1 1 0 1 0

a. State the multiple regression equation. b. Interpret the meaning of the slopes in this problem. c. Predict the number of accidents for an intersection with lights and 18,000 cars/day and construct a 95% confidence interval estimate and a 95% prediction interval. d. Perform a residual analysis on the results and determine the adequacy of the model. e. Is there a significant relationship between the number of accidents and the two independent variables (traffic volume and traffic lights) at the 0.05 level of significance? f. At the 0.05 level of significance, determine whether each independent variable makes a contribution to the regression model. On the basis of these results, indicate the most appropriate regression model for this set of data. g. Construct 95% confidence interval estimates of the population slope for the relationship between the number of accidents and traffic volume, and between accidents and traffic lights. h. Interpret the meaning of the coefficient of multiple determination. i. Calculate the adjusted R 2 and interpret the result. j. Calculate the coefficients of partial determination and interpret their meaning. k. What assumption about the slope of the number of accidents with volume of traffic do you need to make? l. Add an interaction term to the model and, at the 0.05 level of significance, determine whether it makes a significant contribution to the model. m. On the basis of the results of (f) and (l), which model is more appropriate? Explain. 13.41 In problem 13.6 on page 510, you developed a multiple regression model to predict wine quality for red wines. Now, you wish to determine whether the type of wine – white (0) or red (1) – has an effect on wine quality. These data are organised and stored in < VINHO_VERDE_RED_AND_WHITE >. Develop a multiple regression model to predict wine quality based on the percentage of alcohol and the type of wine. For (a)–(m), do not include an interaction term. a. State the multiple regression equation that predicts wine quality based on the percentage of alcohol and the type of wine. b. Interpret the regression coefficients in (a).

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

534 CHAPTER 13 INTRODUCTION TO MULTIPLE REGRESSION

c. Predict the mean quality for a red wine that has 10% alcohol. Construct a 95% confidence interval estimate and a 95% prediction interval. d. Perform a residual analysis on the results and determine whether the regression assumptions are valid. e. Is there a significant relationship between wine quality and the two independent variables (percentage of alcohol and the type of wine) at the 0.05 level of significance? f. At the 0.05 level of significance, determine whether each independent variable makes a contribution to the regression model. Indicate the most appropriate regression model for this set of data. g. Construct and interpret 95% confidence interval estimates of the population slope for the relationship between wine quality and the percentage of alcohol and between wine quality and the type of wine. h. Compare the slope in (b) with the slope for the simple linear regression model of problem 13.6 on page 510. Explain the difference in the results. i. Calculate and interpret the meaning of the coefficient of multiple determination, r 2. j. Calculate and interpret the adjusted r 2. k. Compare r 2 with the r 2 value calculated in problem 13.16(c) on page 514. l. Calculate the coefficients of partial determination and interpret their meaning. m. What assumption about the slope of type of wine with wine quality do you need to make in this problem? n. Add an interaction term to the model and, at the 0.05 level of significance, determine whether it makes a significant contribution to the model. o. On the basis of the results of (f) and (n), which model is most appropriate? Explain. p. What conclusions can you reach concerning the effect of alcohol percentage and type of wine on wine quality? 13.42 In problem 13.4 on page 509, a financial planner used unemployment rates and share market returns to predict retirement rates in OECD countries. Develop a regression model that includes unemployment rates, share market returns and the interaction between unemployment rates and share market returns. a. At the 0.05 level of significance, is there evidence that the interaction term makes a significant contribution to the model? b. Which regression model is more appropriate, the one used in this problem or the one used in problem 13.4? Explain. 13.43 In problem 13.8 on page 510, you used radio and newspaper advertising to predict product sales. Develop a regression model that includes radio advertising and newspaper advertising and the interaction between radio and newspaper advertising. a. At the 0.05 level of significance, is there evidence that the interaction term makes a significant contribution to the model? b. Which regression model is more appropriate, the one used in this problem or the one used in problem 13.8? Explain. 13.44 In problem 13.5 on page 509, you used GDP and population density to estimate CO2 emissions. Develop a regression model

that includes GDP, population density, and the interaction of GDP and population density to predict CO2 emissions. a. At the 0.05 level of significance, is there evidence that the interaction term makes a significant contribution to the model? b. Which regression model is more appropriate, the one used in this problem or the one used in problem 13.5? Explain. 13.45 In problem 13.7 on page 510, you used GDP per capita and CPI to estimate the percentage of very happy people in a country. Develop a regression model to predict the happiness percentage using GDP per capita, CPI and the interaction of GDP per capita and CPI. a. At the 0.05 level of significance, is there evidence that the interaction term makes a significant contribution to the model? b. Which regression model is more appropriate, the one used in this problem or the one used in problem 13.7? Explain. 13.46 The director of a training program for a large insurance company has the business objective of determining which method is best for training underwriters. The three methods to be evaluated are classroom, online and courseware app. The 30 trainees are divided into three randomly assigned groups of 10. Before the start of the training, each trainee is given a proficiency exam that measures mathematics and computer skills. At the end of the training, all students take the same end-of-training exam. The results are organised and stored in < UNDERWRITING >. Develop a multiple regression model to predict the score on the end-of-training exam, based on the score on the proficiency exam and the method of training used. For (a)–(k), do not include an interaction term. a. State the multiple regression equation. b. Interpret the regression coefficients in (a). c. Predict the mean end-of-training exam score for a student with a proficiency exam score of 100 who had coursewareapp-based training. d. Perform a residual analysis on your results and determine whether the regression assumptions are valid. e. Is there a significant relationship between the end-oftraining exam score and the independent variables (proficiency score and training method) at the 0.05 level of significance? f. A t the 0.05 level of significance, determine whether each independent variable makes a contribution to the regression model. Indicate the most appropriate regression model for this set of data. g. Construct and interpret a 95% confidence interval estimate of the population slope for the relationship between the end-of-training exam score and the proficiency exam score. h. Construct and interpret 95% confidence interval estimates of the population slope for the relationship between the end-of-training exam score and type of training method. i. Calculate and interpret the adjusted r 2.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



13.7 Collinearity 535

j. Calculate the coefficients of partial determination and interpret their meaning. k. What assumption about the slope of proficiency score with end-of-training exam score do you need to make in this problem?

l. Add interaction terms to the model and, at the 0.05 level of significance, determine whether any interaction terms make a significant contribution to the model. m. On the basis of the results of (f) and (l), which model is most appropriate? Explain.

13.7 COLLINEARITY LEARNING OBJECTIVE

An important problem in the application of multiple regression analysis involves the possible collinearity of the independent variables. This condition refers to situations in which one or more of the independent variables are highly correlated with each other. In such situations, collinear variables do not provide unique information, and it becomes difficult to separate the effect of such variables on the dependent variable. When collinearity exists, the values of the regression coefficients for the correlated variables may fluctuate drastically, depending on which independent variables are included in the model. One method of measuring collinearity is the variance inflationary factor (VIF) for each independent variable. Equation 13.15 defines VIFj, the variance inflationary factor for ­variable j.

VARIANC E IN FL AT ION A RY FACTOR

VIFj =

1 1 – R2j

(13.15)

4

Detect collinearity using the variance inflationary factor (VIF) collinearity Refers to the potential for correlation within a set of independent X variables. variance inflationary factor (VIF ) Measures the impact of collinearity among the X s in a regression model by stating the degree to which collinearity among the predictors reduces the precision of an estimate.

where R2j is the coefficient of multiple determination of independent variable Xj with all other X variables

If there are only two independent variables, R 2j is the coefficient of determination between X1 and X2. It is identical to R22, which is the coefficient of determination between X2 and X1. If, for example, there are three independent variables, then R21 is the coefficient of multiple determination of X1 with X2 and X3; R22 is the coefficient of multiple determination of X2 with X1 and X3; and R23 is the coefficient of multiple determination of X3 with X1 and X2. If a set of independent variables is perfectly uncorrelated, each VIFj is equal to 1. If the set is highly correlated, then a VIFj might even exceed 10. Marquardt (see reference 1) suggests that if VIFj is greater than 10, there is too much correlation between the variable Xj and the other independent variables. However, other statisticians suggest a more conservative criterion. Snee (see reference 2) recommends using alternatives to least-squares regression if the maximum VIFj exceeds 5. You need to proceed with extreme caution when using a multiple regression model that has one or more large VIF values. You can use the model to predict values of the dependent variable only in the case where the values of the independent variables used in the prediction are in the relevant range of the values of the data set. However, you cannot extrapolate to values of the independent variables not observed in the sample data. Since the independent ­variables contain overlapping information, you should always avoid interpreting the regression coefficient estimates separately (i.e. there is no way of accurately estimating the individual effects of the independent variables). One solution to the problem is to delete the variable with the largest VIF value. The reduced model (i.e. the model with the independent variable with the largest VIF value deleted) is often free of collinearity problems. If you determine that all the independent variables are needed in the model, you can use methods discussed in references 1 and 3.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

536 CHAPTER 13 INTRODUCTION TO MULTIPLE REGRESSION

In the car exports example, the correlation between the two independent variables, exchange rate and wage price index, is 0.8592. Because there are only two independent variables in the model, from Equation 13.15: VIF1 = VIF2 =

1 1 – (0.8592)2

= 3.8201 There is no question of collinearity of the independent variables.

Problems for Section 13.7 LEARNING THE BASICS 13.47 If the coefficient of determination between two independent variables is 0.35, what is the VIF ? 13.48 If the coefficient of determination between two independent variables is 0.50, what is the VIF ? 13.49 If the VIF is equal to 6.2, what does this indicate?

APPLYING THE CONCEPTS You need to use Microsoft Excel to solve problems 13.50 to 13.54.

13.50 Refer to problem 13.4 on page 509. Perform a multiple regression analysis and determine the VIF for each explanatory variable in the model. Is there reason to suspect the existence of collinearity?

13.51 Refer to problem 13.5 on page 509. Perform a multiple regression analysis and determine the VIF for each explanatory variable in the model. Is there reason to suspect the existence of collinearity? 13.52 Refer to problem 13.6 on page 510. Perform a multiple regression analysis and determine the VIF for each explanatory variable in the model. Is there reason to suspect the existence of collinearity? 13.53 Refer to problem 13.7 on page 510. Perform a multiple regression analysis and determine the VIF for each explanatory variable in the model. Is there reason to suspect the existence of collinearity? 13.54 Refer to problem 13.8 on page 510. Perform a multiple regression analysis and determine the VIF for each explanatory variable in the model. Is there reason to suspect the existence of collinearity?

13 Assess your progress Summary In this chapter we looked at how to extend the simple linear regression model (Chapter 12) to include more than one independent (expanatory) variable, to predict a dependent (response) variable. First, we developed the multiple regression model using the example of manufacturing trade and then predicted car exports with two independent variables. We examined how to assess the strength of the relationship between the variables using the coefficient of multiple determination (R 2)

and an examination of the residuals from the regression. We tested the significance of the slope of the regression, and calculated confidence intervals for a slope coefficient. We considered the question of whether all the independent variables were contributing to an explanation of the dependent variable. To do this, we used hypothesis testing (Chapter 9) and our knowledge of probability distributions to test the significance of the independent variables.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



References 537

Section 13.6 introduced a non-numeric independent variable as a dummy variable. The same section examined the interactions between independent variables to test the assumption that

the ­independent variables are truly independent of each other (noncollinear). Finally, the problem of collinearity was further studied by using the variance inflationary factor statistic.

Key formulas Multiple regression model with k independent variables Yi = β0 + β1X1i + β2 X2i + β3 X3i + … + βk Xki + εi  (13.1)

Contribution of variable X1 given that X2 has been included

Multiple regression model with two independent variables

Contribution of variable X2 given that X1 has been included

Yi = β0 + β1X1i + β2 X2i + εi  (13.2)

SSR(X2 | X1) = SSR(X1 and X2) – SSR(X1)  (13.10b)

Multiple regression equation with two independent variables

Yˆi = b0 + b1X1i + b2X2i  (13.3)

2 RY1 .2 =

MSR   (13.6) MSE

2 RY2 .1 =

  (13.7)

  (13.13a)

SST − SSR(X1 and X2) + SSR(X1 | X2) SSR(X2 | X1)

  (13.13b)

SST − SSR(X1 and X2) + SSR(X2 | X1)

Coefficient of partial determination for a multiple regression model containing k independent variables

Confidence interval estimate for the slope

RY2j (all variables except j)

bj ± tn−k−1Sbj  (13.8) The contribution of an independent variable to the regression model

SSR(Xj | all variables except j)

=

SSR(Xj | all variables except j )

  (13.14)

SST − SSR(all variables including j) + SSR(Xj | all variables except j)

Variance inflationary factor

VIFj =

= SSR(all variables including j) – SSR(all variables except j)

SSR(X1 | X2)

and

Testing for the slope in multiple regression

Sbj

  (13.11)

Coefficients of partial determination for a multiple regression model containing two independent variables

Overall F test statistic

t=

MSE

tα2 = F1,α  (13.12)

n − 1   (13.5) 2 Radj = 1 − (1 − R 2 ) n − k −1

bj − βj

SSR(Xj | all variables except j)

Relationship between a t statistic and an F statistic

SSR   (13.4) SST

Adjusted R 2

F =

Partial F test statistic

F=

Coefficient of multiple determination

R2 =

SSR(X1 | X2) = SSR(X1 and X2) – SSR(X2)  (13.10a)

1 1 – R2j

  (13.15)

  (13.9)

Key terms adjusted R 2 511 coefficient of multiple determination, R 2 511 coefficient of partial determination 523 collinearity 535

cross-product term 528 dummy variable 526 interaction 528 interaction term 528 multiple regression model 505

net regression coefficient 506 overall F test 512 partial F test 520 variance inflationary factor (VIF ) 535

References 1. Marquardt, D. W., ‘You should standardize the predictor variables in your regression models’, a discussion of “A critique of some ridge regression methods”, by G. Smith & F. Campbell’, Journal of the American Statistical Association, 75 (1980): 87–91.

2. Snee, R. D., ‘Some aspects of nonorthogonal data analysis, Part 1. Developing prediction equations’, Journal of Quality Technology, 5 (173): 67–79. 3. Kutner, M., C. Nachtstein, J. Neter & W. Li, Applied Linear Statistical Models, 5th edn (New York: McGraw-Hill/Irwin, 2005).

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

538 CHAPTER 13 INTRODUCTION TO MULTIPLE REGRESSION

Chapter review problems CHECKING YOUR UNDERSTANDING 13.55 How does the interpretation of the slope coefficients differ in multiple regression versus simple regression? 13.56 How does testing the significance of the entire multiple regression model differ from testing the contribution of each independent variable? 13.57 What is the difference between R 2 and adjusted R 2? 13.58 How do the coefficients of partial determination differ from the coefficient of multiple determination? 13.59 How do you test for collinearity? 13.60 Why and how do you use dummy variables? 13.61 Under what circumstances would you include an interaction term in a multiple regression? How would you interpret its coefficient? 13.62 What process should you follow to determine which variables should be included in a multiple regression?

APPLYING THE CONCEPTS Use Microsoft Excel to solve problems 13.63 to 13.71.

13.63 Academics have noticed that attendance at lectures has declined over the last decade or so. A marketing lecturer decides to collect data on 12 students’ lecture attendance over a 13-week session. She believes that attendance is related to the distance a student has to travel to campus (in km) and how many weeks students access the video recording on the lecture attendance. < ATTENDANCE > Lecture 12 4 2 6 5 13 8 7 3 2 9 8

Distance (km) 5 35 45 12 23 2 12 25 40 25 21 8

Video 4 6 8 7 7 5 4 5 7 9 4 5

a. State the multiple regression equation. b. Interpret the meaning of the slopes in this equation. c. Predict the attendance for a student who lives 5 km from campus and who accessed three video lectures. d. Perform a residual analysis and determine the adequacy of fit of the model. e. Is there a significant relationship between student attendance and the two independent variables (distance

from campus and number of video lectures) at the 0.05 level of significance? f. Determine the p-value in (e) and interpret its meaning. g. Interpret the meaning of the coefficient of multiple determination. h. Determine the adjusted R 2. i. At the 0.05 level of significance, determine whether each independent variable makes a significant contribution to the regression model. On the basis of these results, indicate the most appropriate regression model for this set of data. j. Determine the p-values in (i) and interpret their meaning. k. Calculate a measure of collinearity and interpret the result. 13.64 Professional basketball has truly become a sport that generates interest among fans around the world. More and more players venture to the United States to play in the National Basketball Association (NBA). You want to develop a regression model to predict the number of wins achieved by each NBA team, based on field goal (shots made) percentage for the team and for the opponent. The data are stored in < NBA_2012 >. a. State the multiple regression equation. b. Interpret the meaning of the slopes in this equation. c. Predict the mean number of wins for a team that has a field goal percentage of 45% and an opponent field goal percentage of 44%. d. Perform a residual analysis and determine whether the regression assumptions are valid. e. Is there a significant relationship between number of wins and the two independent variables (field goal percentage for the team and for the opponent) at the 0.05 level of significance? f. Determine the p-value in (e) and interpret its meaning. g. Interpret the meaning of the coefficient of multiple determination. h. Determine the adjusted r2. i. At the 0.05 level of significance, determine whether each independent variable makes a significant contribution to the regression model. Indicate the most appropriate regression model for this set of data. j. Determine the p-values in (i) and interpret their meaning. k. Calculate and interpret the coefficients of partial determination. l. What conclusions can you reach concerning field goal percentage (team and opponent) in predicting the number of wins? 13.65 An economist wishes to examine the relationship between robberies, literacy and unemployment for a number of countries – robberies = b0 + b1 literacy rate + b2 unemployment rate.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter review problems 539

Country Australia Canada Czech Republic Denmark Finland Germany Hungary Ireland Netherlands New Zealand Norway Poland Switzerland United Kingdom United States

Robberies per 1,000 people 1.16048 0.823411 0.400254 0.580265 0.497798 0.720773 0.349156 0.601096 1.13549 0.439901 0.387764 1.38838 0.290827 1.57433 1.38527

% of adults with high literacy 17.4 25.1 19.6 25.4 25.1 18.9 8 11.5 20 17.6 29.4 5.8 16.1 19.1 19

Unemployment rate 4.9 6.4 8.4 3.8 7 7.1 7.4 4.3 5.5 3.8 3.5 14.9 3.3 2.9 4.8

Source: © NationMaster

a. State the multiple regression equation. b. Interpret the meaning of the slopes in this equation. c. Predict the robbery rate for a country with a literacy rate of 20% and unemployment rate of 7%. d. Interpret the meaning of the coefficient of multiple determination. e. Perform a residual analysis and determine the adequacy of the model. f. Determine whether there is a significant relationship between the robbery rate and the two independent variables (literacy and unemployment rates) at the 0.05 level of significance. g. Construct a 95% confidence interval estimate of the population slope between robbery rates and literacy rates, and between robbery rates and unemployment rates. h. At the 0.05 level of significance, determine whether each independent variable makes a significant contribution to the regression model. On the basis of these results, indicate the independent variables to include in this model. i. Test the level of collinearity between the independent variables. 13.66 Starbucks Coffee Co. uses a data-based approach to improving the quality and customer satisfaction of its products. When survey data indicated that Starbucks needed to improve its package sealing process, an experiment was conducted to determine the factors in the bag-sealing equipment that might be affecting the ease of opening the bag without tearing the inner liner of the bag (data obtained from L. Johnson and S. Burrows, ‘For Starbucks, it’s in the bag’, Quality Progress, March 2011, 17–23.) Among the factors that could affect the rating of the ability of the bag to resist tears were the viscosity, pressure, and plate gap on the bag-sealing equipment. Data were collected on 19 bags in which the plate gap was varied. The results are stored in < STARBUCKS >.

Develop a multiple regression model that uses the viscosity, pressure, and plate gap on the bag-sealing equipment to predict the tear rating. Be sure to perform a thorough residual analysis. Do you think that you need to use all three independent variables in the model? Explain. 13.67 A lecturer believes that students majoring in economics get higher wages than other graduates. To test his theory he selects a sample of 10 past students and collects their wage (in dollars) and weighted average mean of their grades, and includes a variable that equals 1 if the student majored in economics and 0 otherwise. < ECON > Wage ($) 56,500 45,245 89,750 75,200 35,400 89,500 102,500 67,450 22,150 56,400

Weighted average mean 3.75 2.8 3.8 3.5 1.9 3.8 4 3.4 1.5 3

Economics major 0 0 1 0 0 1 1 0 0 0

a. State the multiple regression equation. b. Interpret the meaning of the slopes in this equation. c. Predict the wage of a student with a weighted average mean of 3.5 and an economics major. d. Is there is a significant relationship between wage and the two independent variables (weighted average mean and economic major) at the 0.05 level of significance? e. Determine the p-value in (d) and interpret its meaning. f. Determine the adjusted R 2. 13.68 In the past, countries have tended to follow different economic development strategies. One group tended to favour protectionist theories largely based on the infant industry argument, aimed at reducing imports and substituting them with domestic production. Another group (mostly East Asian developing countries) based growth on export promotion. The following group of countries favoured the protectionist policy:

Country Argentina Brazil Chile Colombia Mexico Peru Venezuela

Growth/annum Growth/annum Growth/annum GDP (%) manufacturing (%) exports (%) 0.6 0.6 - 0.3 3.0 2.2 5.6 2.7 2.9 4.9 3.5 3.1 9.8 0.7 0.7 3.7 0.4 0.4 0.4 1.0 4.9 11.3

Source: S. Edwards, ‘Openness, trade liberalization, and growth in developing countries’, Journal of Economic Literature, 31 (1993): 1360. Copyright American Economic Association; reproduced with permission of the Journal of Economic Literature

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

540 CHAPTER 13 INTRODUCTION TO MULTIPLE REGRESSION

a. State the multiple regression equation. b. Interpret the meaning of the slopes in this equation. c. Predict the GDP for a country in the protectionist group with manufacturing growth of 1.2% and export growth of 6.3%. d. Interpret the meaning of the coefficient of multiple determination. e. Determine the adjusted R 2. Explain what has happened here. f. At the 0.05 level of significance, determine whether each independent variable makes a significant contribution to the regression model. g. What do the results of (b) to (f) tell you? 13.69 The percentage of youth not in employment, education or training (NEET) is of greater policy significance than simply looking at unemployment rates. We hypothesise that NEET is influenced by the average age at which females are having their first child and their school life expectancy. Data is collected for 30 countries in NEET. < NEET >

Country Netherlands Denmark Iceland Switzerland Sweden Austria Slovenia Luxembourg Finland Canada Norway Germany Japan Czech Republic Estonia Poland Australia France New Zealand Portugal United Kingdom Hungary United States Belgium Ireland Spain Greece Italy Mexico Turkey

NEET 4.4 5.6 5.9 6.8 6.8 6.8 7.4 7.9 8.6 8.8 9.2 9.5 10.1 11.0 11.0 11.1 11.4 12.0 12.5 12.8 13.4 13.8 14.8 16.0 17.6 17.6 18.2 19.5 22.0 30.2

Mean birth age 28.9 29.1 27 30.2 28.6 28.5 28.7 29.3 27.9 27.6 28.4 28.9 29.4 27.6 26.3 26.6 30.5 28.6 27.7 27.4 30 28.2 25 28 29.8 29.3 29.2 27.7 21.3 22.9

School life expectancy 15.9 15.6 15.8 15 16 14.7 14.1 13.1 16.7 14.8 16.9 15.3 14.3 13.5 14.1 14.4 16.6 15.4 16.2 15.2 16.4 13.6 15.2 15.8 14.9 15.3 14.3 14.7 11.5 9.5

Data obtained from Organisation for Economic Co-operation and Development (OECD), Statistics New Zealand and NationMaster

a. State the multiple regression equation. b. Interpret the meaning of the slopes in this equation. c. Predict the percentage of youth not in employment, education or training (NEET) for a country where the mother’s average age at first birth is 25 and school life expectancy is 15 years. d. Is there a significant relationship between NEET and the two independent variables at the 0.05 level of significance? e. Interpret the meaning of the coefficient of multiple determination. f. Determine the adjusted R 2. g. At the 0.05 level of significance, determine whether each independent variable makes a significant contribution to the regression model. 13.70 The owner of a moving company typically has his most experienced manager predict the total number of labour hours that will be required to complete an upcoming move. This approach has proved useful in the past, but the owner has the business objective of developing a more accurate method of predicting labour hours. In a preliminary effort to provide a more accurate method, the owner has decided to use the number of cubic feet moved and the number of pieces of large furniture as the independent variables, and has collected data for 36 moves in which the origin and destination were within the borough of Manhattan in New York, with the travel time an insignificant portion of the hours worked. The data are organised and stored in < MOVING >. a. State the multiple regression equation. b. Interpret the meaning of the slopes in this equation. c. Predict the mean labour hours for moving 500 cubic feet with two large pieces of furniture. d. Perform a residual analysis and determine whether the regression assumptions are valid. e. Determine whether there is a significant relationship between labour hours and the two independent variables (the number of cubic feet moved and the number of pieces of large furniture) at the 0.05 level of significance. f. Determine the p-value in (e) and interpret its meaning. g. Interpret the meaning of the coefficient of multiple determination. h. Determine the adjusted r 2. i. At the 0.05 level of significance, determine whether each independent variable makes a significant contribution to the regression model. Indicate the most appropriate regression model for this set of data. j. Determine the p-values in (i) and interpret their meaning. k. Construct a 95% confidence interval estimate of the population slope between labour hours and the number of cubic feet moved. l. Calculate and interpret the coefficients of partial determination. m. What conclusions can you reach concerning labour hours?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 13 Excel Guide 541

13.71 An experiment was conducted to study the extrusion process of biodegradable packaging foam (data extracted from W. Y. Koh, K. M. Eskridge and M. A. Hanna, ‘Supersaturated split-plot designs’, Journal of Quality Technology, 45, January 2013, 61–72). Among the factors considered for their effect on the unit density (mg/mL) were the die temperature (145° C versus



155° C) and the die diameter (3 mm versus 4 mm). The results were stored in < PACKAGING_FOAM3 >. Develop a multiple regression model that uses die temperature and die diameter to predict the unit density (mg/mL). Do you think that you need to use both independent variables in the model? Explain.

Continuing cases Tasman University The Student News Service at Tasman University (TU) wishes to use the data it has collected to see if a student’s WAM and gender helps predict their expected salary. a Starting with the BBus students, create a dummy variable that takes the value of 1 for female. Using the data stored in < TASMAN_UNIVERSITY_BBUS_STUDENT_SURVEY >, assess the significance and importance of both WAM and gender as predictors of job performance. b Now create an interactive term for gender and WAM and determine its influence. c Does undergraduate WAM and gender help explain a student’s MBA WAM score? Use the data stored in < TASMAN_UNIVERSITY_MBA_STUDENT_SURVEY > to estimate and interpret a predictive model.

As Safe as Houses To analyse the real estate market in non-capital cities and towns in states A and B, Safe-As-Houses Real Estate, a large national real estate company, has collected samples of recent residential sales from a sample of non-capital cities and towns in these states. The data are stored in < REAL_ESTATE >: a Use the internal area and the number of bedrooms to predict house and unit prices. b Use dummy variables to determine if this relationship differs for units. c Prepare a brief report to summarise your findings.

Chapter 13 Excel Guide EG13.1 DEVELOPING A MULTIPLE REGRESSION MODEL Interpreting the Regression Coefficients Key technique Use the LINEST(cell range of Y variable, cell range of X variables, True, True) function to calculate the regression coefficients and other values related to a multiple regression analysis.

Example Develop the Figure 13.1 multiple regression model for the car export data shown on page 507. PHStat Use Multiple Regression. For the example, open the Car_Export file. Select PHStat ➔ Regression ➔ Multiple ­R egression, and

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

542 CHAPTER 13 INTRODUCTION TO MULTIPLE REGRESSION

in the procedure’s dialog box (shown in Figure EG13.1): 1. Enter B1:B16 as the Y Variable Cell Range. 2. Enter C1:D16 as the X Variables Cell Range. 3. Check First cells in both ranges contain label. 4. Enter 95 as the Confidence level for regression coefficients. 5. Check Regression Statistics Table and ANOVA and Coefficients Table. 6. Enter a Title and click OK.

Figure EG13.2  Regression dialog box

Predicting the Dependent Variable Y Key technique Use the MMULT array function and the T.INV.2T function to help calculate intermediate values that determine the confidence interval estimate and prediction interval. Example Calculate the Figure 13.2 confidence interval estimate and prediction interval for the car export data shown on page 508.

Figure EG13.1  Multiple Regression dialog box

The procedure creates a worksheet that contains a copy of your data in addition to the Figure 13.1 worksheet.

Analysis ToolPak Use Regression. For the example, open the Car_Export file and: 1. Select Data ➔ Data Analysis. 2. In the Data Analysis dialog box, select Regression from the Analysis Tools list and then click OK. In the Regression dialog box (shown in Figure EG13.2): 3. Enter B1:B16 as the Input Y Range and enter C1:D16 as the Input X Range. 4. Check Labels and check Confidence Level and enter 95 in its box. 5. Click New Worksheet Ply. 6. Click OK.

PHStat Use the PHStat ‘Interpreting the regression coefficients’ instructions but replace step 6 with the following steps 6 through 8: 6. Check Confidence Interval Estimate & Prediction Interval and enter 95 as the percentage for Confidence level for intervals. 7. Enter a Title and click OK. 8. In the new worksheet, enter 75 in cell B6 and enter 120 in cell B7. These steps create a new worksheet that is discussed in the following In-depth Excel instructions.

EG13.2 r 2, ADJUSTED r 2 AND THE OVERALL F TEST The coefficient of multiple determination, r2, the adjusted r2 and the overall F test are all calculated as part of creating the multiple regression results worksheet using the Section EG13.1 instructions. If you use the PHStat instructions, formulas are used to calculate these results in the COMPUTE worksheet. Formulas in cells B5, B7, B13, C12, C13, D12, and E12 copy values calculated by the array

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 13 Excel Guide 543

formula in cell range L2:N6. In cell F12, the expression F.DIST.RT(F test statistic, 1, error degrees of freedom) calculates the p-value for the overall F test.

EG13.3 RESIDUAL ANALYSIS FOR THE MULTIPLE REGRESSION MODEL

Key technique Use arithmetic formulas and some results from the multiple regression COMPUTE worksheet to calculate residuals. Example Perform the residual analysis for the car export data discussed in Section 13.3, starting on page 514. PHStat Use the Section EG13.1 ‘Interpreting the regression coefficients’ PHStat instructions. Modify step 5 by checking Residuals Table and Residual Plots in addition to checking Regression Statistics Table and ANOVA and Coefficients Table. Analysis ToolPak Use the Section EG13.1 Analysis ToolPak instructions. Modify step 5 by checking Residuals and ­Residual Plots before clicking New Worksheet Ply and then OK. The ­Residuals Plots option constructs residual plots only for each independent variable. To construct a plot of the

residuals and the predicted value of Y, select the predicted and residuals cells (in the RESIDUAL OUTPUT area of the regression results worksheet) and then apply the Section EG2.5 In-depth Excel ‘The Scatter Diagram’ instructions.

EG13.4 INFERENCES CONCERNING THE POPULATION REGRESSION COEFFICIENTS The regression results worksheets created by using the Section EG13.1 instructions include the information needed to make the inferences discussed in Section 13.4.

EG13.5 TESTING PORTIONS OF THE MULTIPLE REGRESSION MODEL

Key technique Adapt the Section EG13.1 ‘Interpreting the regression coefficients’ instructions and the Section EG12.2 instructions to develop the regression analyses needed. Example Test portions of the multiple regression model for the car export data as discussed in Section 13.5, starting on page 520. PHStat Use the Section EG13.1 PHStat ‘Interpreting the regression coefficients’ instructions but modify step 6 by checking Coefficients of Partial Determination before you click OK.

Microsoft® product screen shots are reprinted with permission from Microsoft Corporation.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

CHA PTER

14

Time-series forecasting and index numbers FORECASTING FEMALE LABOUR MARKET PATTERNS

T

he changing labour force participation patterns of females in Australia is studied by economists, demographers, sociologists and policy makers with keen interest. As with many other developed economies, the labour force participation of females increased dramatically following the Second World War, and later with the arrival of baby boomers in subsequent decades. Researchers and policy makers are interested in issues such as the changing gender composition of occupations, discrimination in wages and promotion, the relationship between childbirth and the availability and cost of childcare, and the relationship between education choices and labour force participation. Many of these issues have gained increased attention in recent years as Australia’s population has aged and policy makers look to ways of increasing future labour force participation. With this in mind, we are interested in forecasting female labour force patterns. This chapter presents a number of alternative forecasting models as well as strategies for choosing the best model to use. © yellowdog/age fotostock

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



14.1 The Importance of Business Forecasting 545

LEARNING OBJECTIVES

After studying this chapter you should be able to: 1 explain the importance of business forecasting 2 disaggregate the components of a time series 3 use moving averages and exponential smoothing 4 apply linear trend, quadratic trend and exponential trend time-series models 5 estimate the Holt–Winters forecasting model 6 use autoregressive models 7 choose an appropriate forecasting model 8 estimate a forecasting model for seasonal data 9 calculate and interpret various price indices

In Chapters 12 and 13 we used regression analysis as a tool for model building and prediction. In this chapter, regression analysis and other statistical methodologies are applied to time-series data. A time series is a set of numerical data obtained at regular periods of time. Due to differences in the features of data, we need to consider several different approaches to forecasting time-series data. Discussion of time series begins with annual time-series data. Two techniques for smoothing a series are illustrated – moving averages and exponential smoothing (Section 14.3). The analysis of annual time series continues with the use of least-squares trend fitting and forecasting (Section 14.4) and other more sophisticated forecasting methods (Sections 14.5 and 14.6). These trend-fitting and forecasting models are then extended to a monthly or quarterly time series (Section 14.8).

14.1  THE IMPORTANCE OF BUSINESS FORECASTING Forecasting is needed to monitor the changes that occur over time. Forecasting is commonly used in both the for-profit and the non-profit sectors of the economy. For example, officials in government forecast unemployment, inflation, industrial production and revenues from income taxes in order to formulate policies. The marketing executives of a retailing corporation forecast product demand, sales revenues, consumer preferences, inventory and so on, in order to make timely decisions about promotions and strategic planning. There are two common approaches to forecasting: qualitative and quantitative. Qualitative forecasting methods are especially important when historical data are unavailable. Qualitative forecasting methods are considered to be highly subjective and judgmental. Quantitative forecasting methods make use of historical data. The goal of these methods is to use past data to predict future values. Quantitative forecasting methods are subdivided into two types: time series and causal. Time-series forecasting methods involve the forecast of future values of a variable based entirely on the past and present values of that variable. For example, the daily closing prices of a particular stock on the Tokyo Stock Exchange constitute a time series. Other examples of economic or business time series are the monthly publication of the Consumer Price Index (CPI), the quarterly gross domestic product (GDP) and the annual recorded total sales revenues of a particular company. Causal forecasting methods involve the determination of factors that relate to the variable you are trying to forecast. These include multiple regression analysis with lagged variables, econometric modelling, leading indicator analysis, diffusion indices and other economic barometers that are beyond the scope of this text. The primary emphasis in this chapter is on time-series forecasting methods.

time series A sequence of measurements taken at successive points in time.

LEARNING OBJECTIVE

1

Explain the importance of business forecasting

qualitative forecasting methods Methods that are primarily based on the subjective opinion of the forecaster rather than the analysis of numerical data. quantitative forecasting methods Methods that use time-series data in a mathematical process to forecast future values of the series. time-series forecasting methods Statistical methods for forecasting future values of a variable based entirely on the past values of that variable. causal forecasting methods Methods that attempt to find causal variables to account for changes in a time series.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

546 CHAPTER 14 TIME-SERIES FORECASTING AND INDEX NUMBERS

LEARNING OBJECTIVE

2

Disaggregate the components of a time series

classical multiplicative model Model which states that values in a time series are a combination of trend, seasonal, cyclical and irregular components.

14.2  COMPONENT FACTORS OF THE CLASSICAL MULTIPLICATIVE TIME-SERIES MODEL The basic assumption of time-series forecasting is that the factors that have influenced activities in the past and present will continue to do so in more or less the same way in the future. Thus, the major goals of time-series forecasting are to identify and isolate these influencing factors in order to make predictions. To achieve these goals, many mathematical models are available for exploring the fluctuations among the component factors of a time series. Perhaps the most basic is the classical multiplicative model for annual, quarterly or monthly data. To demonstrate the classical multiplicative time-series model, Figure 14.1 plots the labour force participation rate for females in Australia from July 1979 to June 2014.

Figure 14.1  Female labour force participation rates in Australia (original, 1979 to 2014)

60 50 Per cent

Source: Data obtained from Australian Bureau of Statistics, Labour Force, Australia, Cat. No. 6202.0.

70

40 30 20 10

14 20

09 20

04 20

99 19

94 19

89 19

84 19

19

79

0 Year

trend component An overall long-term upward or downward movement in the values of a time series.

cyclical component Displays in a time series as a wavelike up and down change in the individual values sequentially throughout the series.

irregular (random) component Any values in a time series that cannot be accounted for by trend, cyclical or seasonal components. seasonal component A factor that measures the regular seasonal change in a time series.

A trend component is an overall long-term upward or downward movement in a time series. From Figure 14.1 you can see that female labour force participation rates in Australia have steadily increased over the time period displayed, and are thus described as exhibiting an upward trend. The trend is not the only component factor that influences data in a time series. Two other factors, the cyclical component and the irregular component, are also present in the data. The cyclical component depicts the up-and-down swings or movements through the series. Cyclical movements vary in length, usually lasting from 2 to 10 years. They differ in intensity or amplitude and are often correlated with a business cycle. In some years, the values are higher than would be predicted by a trend line (i.e. they are at or near the peak of a cycle); in other years, the values are lower than would be predicted by a trend line (i.e. they are at or near the bottom or trough of a cycle). Any data that do not follow the trend curve modified by the cyclical component are considered part of the irregular or random component. When you have monthly or quarterly data, an additional component, the seasonal component, is considered together with the trend and the cyclical and irregular components. Table 14.1 summarises the four component factors that can influence an economic or business time series. The classical multiplicative time-series model states that any value in a time series is the product of these components. When forecasting an annual time series, you do not include the seasonal component. Equation 14.1 defines Yi, the value of an annual time series recorded in year i, as the product of the trend, cyclical and irregular components.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



14.3 Smoothing the Annual Time Series 547

Component

Classification of component

Trend

Systematic

Seasonal

Systematic

Cyclical

Systematic

Irregular

Unsystematic

Definition

Reason for influence

Duration

Overall or persistent, long-term upward or downward pattern of movement Fairly regular periodic fluctuations that occur within each 12-month period year after year Repeating up-and-down swings or movements through four phases: from peak (prosperity) to contraction (recession) to trough (depression) to expansion (recovery or growth) The erratic or ‘residual’ fluctuations in a series that exist after taking into account the systematic effects

Changes in technology, population, Several years wealth, value Weather conditions, social customs, Within 12 months (or religious customs, school monthly or quarterly data) schedules Interactions of numerous Usually 2–10 years, with combinations of factors that differing intensity for a influence the economy complete cycle

Random variations in data due to Short duration and nonunforeseen events such as strikes, repeating natural disasters and wars

CLASSICAL M ULT IP L ICAT IVE T IM E -S E R I E S M O D E L F O R A NNU A L DATA Yi = Ti × Ci × Ii (14.1)

Table 14.1  Factors influencing timeseries data

where, in year i, Ti 5 value of the trend component Ci 5 value of the cyclical component Ii 5 value of the irregular component When forecasting quarterly or monthly data, you include the seasonal component in the model. Equation 14.2 defines Yi, a value recorded in time period i, as the product of all four components. CLASSICAL M ULT IP L ICAT IVE T IM E -S E R I E S M O D E L F O R DATA W I T H A S E ASO N AL COM P ON E N T Yi = Ti × Si × Ci × Ii (14.2) where Ti, Ci, Ii 5 value of the trend, cyclical and irregular components in time period i Si 5 value of the seasonal component in time period i The first step in a time-series analysis is to plot the data and observe any patterns that occur over time. You must determine whether there is a long-term upward or downward movement in the series (i.e. a trend). If there is no obvious long-term upward or downward trend, you can use the method of moving averages or the method of exponential smoothing to smooth the series and provide an overall long-term impression (see Section 14.3). If a trend is present, you can consider several time-series forecasting methods (see Sections 14.4 to 14.6 for forecasting annual data and Section 14.8 for forecasting monthly or quarterly time series).

14.3  SMOOTHING THE ANNUAL TIME SERIES When you examine annual data, your visual impression of the long-term trend in the series is sometimes obscured by the amount of variation from year to year. Often, you cannot judge whether any long-term upward or downward trend exists in the series. To get a better overall

LEARNING OBJECTIVE Use moving averages and exponential smoothing

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

3

548 CHAPTER 14 TIME-SERIES FORECASTING AND INDEX NUMBERS

impression of the pattern of movement in the data over time, you can use the methods of moving averages or exponential smoothing. A time-series trend that may be obscured by cyclical patterns is the unemployment rate for females in Australia. Figure 14.2 presents the time-series plot of these data from 1979 to 2014.

Figure 14.2  Female unemployment rates in Australia (original, 1979–2014)

10 8 Per cent

Source: Data obtained from Australian Bureau of Statistics, Labour Force, Australia, Cat. No. 6202.0.

12

6 4 2

14 20

09 20

04 20

99 19

94 19

89 19

84 19

19

79

0

Year

Moving Averages moving averages A series of means calculated over time such that each mean is calculated for a set number of observed values.

Moving averages (MA) for a chosen period of length L consist of a series of means calcu-

lated over time such that each mean is calculated for a sequence of L observed values. Moving averages are represented by the symbol MA(L).

The method of moving averages for smoothing a time series is highly subjective and dependent on L, the length of the period selected for constructing the averages. If cyclical fluctuations are present in the data, choose an integer value of L that corresponds to (or is a multiple of) the estimated mean length of a cycle in the series. To illustrate, suppose you want to calculate 5-year moving averages from a series that has n 5 11 years. Since L 5 5, the 5-year moving averages consist of a series of means calculated by averaging consecutive sequences of five values. You calculate the first 5-year moving average by summing the values for the first five years in the series and dividing by 5: MA(5) =

Y1 + Y2 + Y3 + Y4 + Y5 5

You calculate the second 5-year moving average by summing the values of years 2 to 6 in the series and then dividing by 5: Y + Y3 + Y4 + Y5 + Y6 MA(5) = 2 5

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



14.3 Smoothing the Annual Time Series 549

Continue this process until you have calculated the last of these 5-year moving averages by summing the values of the last five years in the series (i.e. years 7 to 11) and then dividing by 5: MA(5) =

Y7 + Y8 + Y9 + Y10 + Y11 5

When you are dealing with annual time-series data, L, the length of the period chosen for constructing the moving averages, should be an odd number of years. By following this rule, you are not able to calculate any moving averages for the first (L − 1)/2 years or the last (L − 1)/2 years of the series. Thus, for a 5-year moving average, you cannot make calculations for the first two years or the last two years of the series. When graphing moving averages, you plot each of the calculated values against the middle year of the sequence of years used to calculate it. If n 5 11 and L 5 5, the first moving average is centred on the third year, the second moving average is centred on the fourth year and the last moving average is centred on the ninth year. Example 14.1 illustrates the calculation of the 5-year moving average.

CA LC ULATING A 5 - YE A R MOV IN G AV E R AGE The following data represent total revenues (in millions of constant 2010 dollars) for a car rental agency over the 11-year period 1998 to 2008: 4.0 5.0 7.0 6.0 8.0 9.0 5.0 2.0 3.5 5.5 6.5 Calculate the 5-year moving average for this annual time series.

EXAMPLE 14.1

SOLUTION

To calculate a 5-year moving average, first calculate the 5-year moving total and then divide this total by 5. The first of the 5-year moving averages is: Y1 + Y2 + Y3 + Y4 + Y5 4.0 + 5.0 + 7.0 + 6.0 + 8.0 30.0 = 6.0 = = MA(5) = 5 5 5 The moving average is then centred on the middle value – the third year of this time series. To calculate the second of the 5-year moving averages, calculate the moving total of the second to sixth years and divide this value by 5: MA(5) =

Y2 + Y3 + Y4 + Y5 + Y6 5

=

5.0 + 7.0 + 6.0 + 8.0 + 9.0 35.0 = 7.0 = 5 5

This moving average is centred on the new middle value – the fourth year of the time series. The remaining moving averages are: MA(5) =

Y3 + Y4 + Y5 + Y6 + Y7 7.0 + 6.0 + 8.0 + 9.0 + 5.0 35.0 = 7.0 = = 5 5 5

MA(5) =

Y4 + Y5 + Y6 + Y7 + Y8 6.0 + 8.0 + 9.0 + 5.0 + 2.0 30.0 = 6.0 = = 5 5 5

MA(5) =

Y5 + Y6 + Y7 + Y8 + Y9 8.0 + 9.0 + 5.0 + 2.0 + 3.5 27.5 = 5.5 = = 5 5 5

MA(5) =

Y6 + Y7 + Y8 + Y9 + Y10 9.0 + 5.0 + 2.0 + 3.5 + 5.5 25.0 = 5.0 = = 5 5 5

MA(5) =

Y7 + Y8 + Y9 + Y10 + Y11 5.0 + 2.0 + 3.5 + 5.5 + 6.5 22.5 = 4.5 = = 5 5 5

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

550 CHAPTER 14 TIME-SERIES FORECASTING AND INDEX NUMBERS

These moving averages are then centred on their respective middle values, the fifth, sixth, seventh, eighth and ninth years in the time series. By using the 5-year moving averages, you are unable to calculate a moving average for the first two or last two values in the time series.

In practice, you should use computer software such as Microsoft Excel when calculating moving averages to avoid the tedious calculations. Figure 14.3 presents the female unemployment rates in Australia from 1979 to 2014, the calculations for 5-year and 7-year moving averages, and a plot of the original data and the moving averages.

E

1983

10.4

8.9

8.7

1984

9.5

9.2

8.8

1985

8.8

9.2

8.9

1986

8.7

8.7

8.7

1987

8.6

8.2

8.2

1988

7.9

7.9

8.2

1989

6.9

7.9

8.3

1990

7.2

8.2

8.5

1991

9.1

8.6

8.6

1992

9.9

9.1

8.7

1993

10.0

9.3

8.9

1994

9.4

9.1

9.0

1995

8.1

8.8

8.7

1996

8.2

8.2

8.3

1997

8.1

7.7

7.7

1998

7.3

7.3

7.3

1999

6.7

6.9

7.0

2000

6.1

6.5

6.7

2001

6.4

6.3

6.3

2002

6.2

6.0

6.0

2003

6.0

5.9

5.8

2004

5.5

5.6

5.6

2005

5.2

5.3

5.3

2006

4.9

5.0

5.2

2007

4.8

5.0

5.1

2008

4.6

5.0

5.1

2009

5.4

5.1

5.1

2010

5.4

5.2

5.2

2011

5.3

5.4

5.4

2012

5.3

5.6

NA

2013

5.6

NA

NA

2014

6.4

NA

NA

10.0

8.0

6.0

Unemp 5-year MA

4.0

7-year MA

2.0

0.0

14

8.7

L

20

8.7

09

8.5

20

NA

1982

04

8.5

K

20

7.4

99

NA

1981

J

19

NA

NA

I

94

NA

7.9

H

19

8.2

1980

G

89

1979

F

12.0

19

D

84

C

19

B

Unemp 5-year MA 7-year MA

79

Year

19

Source: Data obtained from Australian Bureau of Statistics, Labour Force, Australia, Cat. No. 6202.0.

A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 26 26 27 28 29 30 31 32 33 34 35 36 37

Per cent

Figure 14.3  Microsoft Excel 5-year and 7-year moving averages (MA) for female unemployment rates (1979–2014)

Year

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



14.3 Smoothing the Annual Time Series 551

In Figure 14.3 there is no 5-year moving average for the first and last two years and no 7-year moving average for the first and last three years. You can see that the 7-year moving averages smooth the series more than the 5-year moving averages, because the period is longer. Unfortunately, the longer the moving average time period, the fewer the number of moving averages you can calculate. Therefore, selecting moving averages that are longer than seven years is usually undesirable because too many moving average values are missing at the beginning and end of the series. This makes it more difficult to get an overall impression of the entire series.

Exponential Smoothing Exponential smoothing is also used to smooth a time series. In addition, you can use exponential smoothing to calculate short-term (one period into the future) forecasts when it is questionable what type of long-term trend effect, if any, is present in a time series. In this respect, exponential smoothing has a distinct advantage over the method of moving averages. The name ‘exponential smoothing’ comes from the fact that it consists of a series of exponentially weighted moving averages. The weights assigned to the values decrease over time, so that the most recent value receives the highest weight, the previous value receives the second highest weight and so on, with the first value receiving the lowest weight. Throughout the series each exponentially smoothed value depends on all previous values, another advantage of exponential smoothing over the method of moving averages. The calculations involved in exponential smoothing seem formidable, so use Microsoft Excel. The equation developed for exponentially smoothing a series in any time period i is based on only three terms – the current value in the time series, Yi, the previously calculated exponentially smoothed value, Ei–1, and an assigned weight or smoothing coefficient, W. Use Equation 14.3 to smooth a time series exponentially.

exponential smoothing Statistical method for removing extreme values from a time series.

CALC ULATING A N E XPON E N T IA L LY SM O OT HE D VA LU E I N T I M E P E R I O D i E1 = Y1

Ei = WYi + (1 – W )Ei –1

i = 2, 3, 4, … (14.3)

where  Ei 5 value of the exponentially smoothed series being calculated in time period i Ei−1 5 value of the exponentially smoothed series already calculated in time period i −1 Yi 5 observed value of the time series in period i W 5 subjectively assigned weight or smoothing coefficient (where 0 < W < 1) Choosing the smoothing coefficient (i.e. weight) that you assign to the time series is critical. Unfortunately, this selection is somewhat subjective. If your goal is only to smooth a series by eliminating unwanted cyclical and irregular variations, you should select a small value for W (close to 0). If your goal is forecasting, choose a large value for W (close to 1). In the former case, the overall long-term tendencies of the series will be more apparent; in the latter case, future short-term directions may be more adequately predicted. Figure 14.4, overleaf, presents the Microsoft Excel exponentially smoothed values (with smoothing coefficients of W 5 0.50 and W 5 0.25) for female unemployment rates in Australia from 1979 to 2014, together with a plot of the original data and the two exponentially smoothed time series. To illustrate the exponential smoothing calculations for a smoothing coefficient of W 5 0.25, begin with the initial value Y1979 5 8.2 as the first smoothed value (E1979 5 8.2). Then, using the value of the time series for 1980 (Y1980 5 7.9), smooth the series for 1980 by calculating: E1980 = WY1980 + (1 – W )E1979 = (0.25)(7.9) + (0.75)(8.2) = 8.125 ≈ 8.1

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

552 CHAPTER 14 TIME-SERIES FORECASTING AND INDEX NUMBERS

C

D

G

H

I

J

K

L

12.0

8.2

8.7

9.4

8.9

1985

8.8

9.1

8.8

1986

8.7

8.9

8.8

1987

8.6

8.8

8.8

1988

7.9

8.3

8.5

1989

6.9

7.6

8.1

1990

7.2

7.4

7.9

1991

9.1

8.3

8.2

1992

9.9

9.1

8.6

1993

10.0

9.5

9.0

1994

9.4

9.5

9.1

1995

8.1

8.8

8.8

1996

8.2

8.5

8.7

1997

8.1

8.3

8.5

1998

7.3

7.8

8.2

1999

6.7

7.2

7.8

2000

6.1

6.7

7.4

2001

6.4

6.5

7.2

2002

6.2

6.4

6.9

2003

6.0

6.2

6.7

2004

5.5

5.9

6.4

2005

5.2

5.5

6.1

2006

4.9

5.2

5.8

2007

4.8

5.0

5.6

2008

4.6

4.8

5.3

2009

5.4

5.1

5.3

2010

5.4

5.2

5.3

2011

5.3

5.3

5.3

2012

5.3

5.3

5.3

2013

5.6

5.4

5.4

2014

6.4

5.9

5.6

8.0

6.0

Unemp W = 0.5

4.0

W = 0.25

2.0

0.0

14

9.2

9.5

20

10.4

09

1983 1984

10.0

20

8.1

04

8.1

20

8.5

99

1982

19

8.1 7.9

94

8.0 7.7

19

7.9 7.4

89

1980 1981

19

8.2

F

84

8.2

E

(W = 0.25)

19

1979

B

Unemp (W = 0.50)

79

16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

Year

19

Source: Data obtained from Australian Bureau of Statistics, Labour Force, Australia, Cat. No. 6202.0.

A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Per cent

Figure 14.4  Microsoft Excel plot of exponentially smoothed series (W 5 0.50 and W 5 0.25) of female unemployment rates in Australia (1979–2014)

Year

To smooth the series for 1981: E1981 = WY1981 + (1 – W )E1980 = (0.25)(7.4) + (0.75)(8.1) = 7.925 ≈ 7.9 To smooth the series for 1982: E1982 = WY1982 + (1 – W )E1981 = (0.25)(8.5) + (0.75)(7.9) = 8.05 ≈ 8.1 This process continues until you have calculated all the exponentially smoothed values for the 36 years in the series, as shown in Figure 14.4. To use the exponential smoothing for forecasting, use the smoothed value in the current time period as a forecast of the value in the following period.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



14.3 Smoothing the Annual Time Series 553

FO RE CASTI N G T IM E PE R IOD i + 1 Yˆi+1 = Ei (14.4) To forecast the unemployment rate for 2015 using a smoothing coefficient of W 5 0.25, use the smoothed value for 2014 as its estimate. Figure 14.4 shows that this projection is 5.6. (How close is this forecast? Look up and select Catalogue No. 6202.0 to find out.) When the value for 2015 becomes available, you can use Equation 14.3 to make a forecast for 2016 by calculating the smoothed value for 2015 as follows: Current smoothed value = (W )(current value) + (1 − W )(previous smoothed value) E2015 = WY2015 + (1 – W )E2014 Or, in terms of forecasting, you calculate the following: New forecast = (W )(current value) + (1 – W )(current forecast) Yˆ = WY + (1 – W )Yˆ 2016

2015

2015

Problems for Section 14.3 LEARNING THE BASICS 14.1 If you are using exponential smoothing for forecasting an annual time series of revenues, what is your forecast for next year if the smoothed value for this year is $12 million? 14.2 Consider a 3-year moving average used to smooth a time series that was first recorded in the year 1955. a. Which year serves as the first centred value in the smoothed series? b. How many years of values in the series are lost when calculating all the 5-year moving averages? 14.3 You are using exponential smoothing on an annual time series concerning total revenues (in millions of constant 1995 dollars). If you use a smoothing coefficient of W 5 0.20 and the exponentially smoothed value for 2006 is E2006 5 (0.20)(12.1) + (0.80)(9.4): a. What is the smoothed value of this series in 2006? b. What is the smoothed value of this series in 2007 if the value of the series in that year is 11.5 millions of constant 1995 dollars?

APPLYING THE CONCEPTS You can solve problems 14.4 to 14.7 manually or by using Microsoft Excel.

14.4 The following data represent unemployment rates for male youths aged 15–19 in New Zealand from 2000 to 2016.

Year 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016

Unemployment rate (%) 13.5 12.1 11.7 10.5 9.7 9.8 9.9 10.1 11.4 16.8 17.4 17.5 18.0 16.3 15.0 14.7 13.2

a. Plot the time series. b. Fit a 3-year moving average to the data and plot the results. c. Using a smoothing coefficient of W  5  0.50, smooth the series exponentially and plot the results. d. Repeat (c) using W 5 0.25.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

554 CHAPTER 14 TIME-SERIES FORECASTING AND INDEX NUMBERS

e. Compare the results of (c) and (d). f. Forecast unemployment rates for 2017. 14.5 The data in the following table represent total exports for Malaysia (in millions).

Year 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

Total exports 184,986.5 197,026.1 220,890.4 286,563.1 321,559.5 373,270.3 334,283.8 357,430.0 397,884.4 481,253.0 536,233.7 589,240.3 604,299.6 663,013.5 552,518.1 638,822.5 697,861.9 702,641.2

Data obtained from Department of Statistics Malaysia, Malaysia Economics Statistics-Time Series 2013, Table 3.1

a. Plot the time series. b. Fit a 3-year moving average to the data and plot the results. c. Using a smoothing coefficient of W 5 0.30, smooth the series exponentially and plot the results. d. Repeat (c) using W 5 0.65. e. Compare the results of (c) and (d). f. Forecast the 2013 value using W 5 0.3, and compare this forecast with the real value at . 14.6 The following data represent the assumed cost of a 375-gram jar of Vegemite from 1995 to 2008.

2002 2003 2004 2005 2006 2007 2008 2009 2010 2011

2.95 3.12 3.22 3.35 3.50 3.89 4.02 4.10 4.25 4.40

a. Plot the time series. b. Fit a 3-period moving average to the data and plot the results. c. Using a smoothing coefficient of W 5 0.50, exponentially smooth the series and plot the results. d. Repeat (c) using W 5 0.25. 14.7 The following data represent CO2 emissions per capita in Australia from 2000 to 2014.

Year 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014

CO2 tonnes per capita 17.51 17.61 17.82 17.82 18.31 18.34 18.33 18.47 18.2 18.14 17.59 17.23 16.92 16.46 15.81

Source: OECD, Environment at a Glance Report

Year 1995 1996 1997 1998 1999 2000 2001

Price ($) 1.89 1.95 1.99 2.20 2.50 2.55 2.80

a. Plot the data. b. Fit a 3-year moving average to the data and plot the results. c. Using a smoothing coefficient of W 5 0.50, exponentially smooth the series and plot the results. d. What is your exponentially smoothed forecast for 2015? e. Repeat (c), using a smoothing coefficient of W 5 0.25. f. Compare the results of (c) and (e).

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



14.4 Least-squares Trend fitting and Forecasting 555

14.4  LEAST-SQUARES TREND FITTING AND FORECASTING The component factor of a time series that is most often studied is trend. You study trend in order to make intermediate and long-range forecasts. As shown in Figure 14.1, to get a visual impression of the overall long-term movements in a time series, plot the data (dependent variable) on the vertical axis and the time periods (independent variable) on the horizontal axis. If a straight-line trend fits the data adequately, the two most widely used methods of trend fitting are least squares (see Section 12.2) and double exponential smoothing (reference 1). If the time-series data indicate some long-run downward or upward quadratic movement, the two most widely used trend-fitting methods are least squares and triple exponential smoothing (reference 1). The focus of this section is forecasting linear, quadratic and exponential trends using the method of least squares.

LEARNING OBJECTIVE

4

Apply linear trend, quadratic trend and exponential trend timeseries models

Linear Trend Model The linear trend model: Yi = β0 + β1Xi + εi

linear trend model Model defining just the trend change as a linear change through time.

is the simplest forecasting model. Equation 14.5 defines the linear trend forecasting ­equation.

LIN E AR TRE N D FOR E CA ST IN G E QUAT I O N

Yˆi = b0 + b1Xi (14.5)

Recall that in linear regression analysis we used the method of least squares to calculate the sample slope b1 and the sample Y intercept b0. We then substitute the values for X into Equation 14.5 to predict Y. When using the least-squares method for fitting trends in a time series, you can simplify the interpretation of the coefficients by coding the X values so that the first value is the origin and assigned a code value of X 5 0. We then assign all successive values consecutively increasing integer codes 1, 2, 3, …, so that the nth and last value in the series has code n − 1. For example, for time-series data recorded annually over 20 years, assign the first year a coded value of 0, the second year a coded value of 1, the third year a coded value of 2 and so on, with the final (20th) year assigned a coded value of 19. In the chapter-opening scenario, one of the data sets of interest, and which displayed an obvious trend over time in Figure 14.1, is female labour force participation rates in Australia. Table 14.2 lists these data. Coding the consecutive X values from 0 to 35 (n 5 36) and then using Microsoft Excel to perform a simple linear regression analysis on the adjusted time series (see Figure 14.5) results in the following linear trend forecasting equation:

Yˆi = 44.8767 + 0.4521Xi where the origin is 1979 (5 0) and units of X 5 1 year. You interpret the regression coefficients as follows: • The Y intercept b0 5 44.8767 is the base percentage of female labour force participation rates for year 1979, as predicted by the model. • The slope b1 5 0.4521 indicates that female labour force participation rates are predicted to increase by an average of 0.4521 percentage points per year.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

556 CHAPTER 14 TIME-SERIES FORECASTING AND INDEX NUMBERS

Table 14.2  Female labour force participation rates in Australia (original, 1979–2014) Source: Australian Bureau of Statistics, Labour Force, Australia, Cat. No. 6202.0.

Year

Labour force participation rate

1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014

43.6 44.8 44.7 44.6 44.7 45.3 46.3 48.3 48.9 49.9 51.2 52.2 52.0 51.9 51.7 52.6 53.6 53.8 53.6 53.6 53.6 54.5 55.0 55.2 55.9 55.7 57.0 57.5 58.1 58.5 58.7 58.6 58.9 58.7 58.6 58.6

To project the trend in female labour force participation rates in 2015, substitute X37 5 36, the code for 2015, into the linear trend forecasting equation: Yˆ = 44.8767 + 0.4521(36) = 61.1523%1 i

The trend line is plotted in Figure 14.6 together with the observed values of the time series. There is a strong upward linear trend and the adjusted R2 is 0.9459, indicating that a great deal of the variation in labour force participation rates is explained by the linear trend over the time series. To investigate whether a different trend model might provide an even better fit, a quadratic trend model and an exponential trend model are presented next. 1

The solutions presented in this chapter are calculated using the (raw) Excel output. If you use the rounded figures presented in the text to reproduce these answers there may be minor differences.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



14.4 Least-squares Trend fitting and Forecasting 557

A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

B

C

D

E

F

G

SUMMARY OUTPUT Regression statistics Multiple R

0.9734

R square

0.9474

Adjusted R square

0.9459

Standard error

1.1384

Observations

Figure 14.5  Microsoft Excel output for a linear regression model to forecast female labour force participation rates in Australia

36

ANOVA df

SS

MS

1

794.1550

794.1550

Residual

34

44.0605

1.2959

Total

35

838.2156

Coefficients

Standard error

44.8767

0.3717

0.4521

0.0183

24.7552

Regression

Intercept Time

F Significance F 612.8221

0.0000

t stat

p-value

Lower 95%

Upper 95%

120.7379

0.0000

44.1214

45.6321

0.0000

0.4150

0.4892

Figure 14.6  Microsoft Excel leastsquares trend line for female labour force participation rates in Australia

70 65

Actual Linear trend prediction

60 Per cent

55 50 45 40 35

4 20 1

20 09

20 04

19 99

19 94

19 89

19 84

19 7

9

30 Year

Quadratic Trend Model A quadratic trend model: Yi = β0 + β1Xi + β2 X i2 + εi is the simplest non-linear model. Using the least-squares method, we can calculate a quadratic trend forecasting equation presented in Equation 14.6.

quadratic trend model A non-linear forecast model where the second independent variable is the square of the first independent time-series variable.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

558 CHAPTER 14 TIME-SERIES FORECASTING AND INDEX NUMBERS

QUA DR AT IC T R E ND F O R E C A ST I NG E Q U AT I O N Yˆi = b0 + b1Xi + b2Xi2 (14.6) where b0 5 estimated Y intercept b1 5 estimated linear effect on Y b2 5 estimated quadratic effect on Y

Once again, use Microsoft Excel to calculate the quadratic trend forecasting equation. Figure 14.7 provides Excel output for the quadratic trend model used to forecast female labour force participation rates in Australia. In Figure 14.7: Yˆi = 43.2058 + 0.7470Xi – 0.0084Xi2 where the origin is 1979 (5 0) and units of X 5 1 year. To use the quadratic trend equation for forecasting purposes, substitute the appropriate coded X values into this equation. For example, to predict female labour force participation rates for 2015 (i.e. X37 5 36): Yˆi = 43.2058 + 0.7470(36) – 0.0084(36)2 = 59.1790%

Figure 14.7  Microsoft Excel output for a quadratic regression model to forecast female labour force participation rates in Australia

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

A SUMMARY OUTPUT

B

C

D

E

F

G

Regression statistics Multiple R 0.9878 R square 0.9758 Adjusted R square 0.9743 Standard error 0.7845 Observations 36 ANOVA Regression Residual Total

Intercept Time Timesq

df 2 33 35

SS 817.9053 20.3102 838.2156

MS 408.9527 0.6155

Coefficients Standard error 43.2058 0.3714 0.7470 0.0491 –0.0084 0.0014

t stat 116.3227 15.2114 –6.2120

F Significance F 664.4646 0.0000

p-value 0.0000 0.0000 0.0000

Lower 95% 42.4502 0.6471 –0.0112

Upper 95% 43.9615 0.8469 –0.0057

Figure 14.8 plots the quadratic trend forecasting equation together with the time series for the actual data. This quadratic trend model does provide a better fit (adjusted R2 5 0.9743) to the time series than the linear trend model. The t statistic for the contribution of the quadratic term to the model is −6.2120 (p-value 5 0.0000). exponential trend model Method for measuring trend in a time series that increases at a constant rate.

Exponential Trend Model When a time series increases at a rate such that the percentage difference from value to value is constant, an exponent of trend is present. Equation 14.7 defines the exponential trend model.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



14.4 Least-squares Trend fitting and Forecasting 559

Figure 14.8  Microsoft Excel fitted quadratic trend forecasting female labour force participation rate in Australia

70 65

Actual Quadratic trend prediction

60 Per cent

55 50 45 40 35

20 14

20 09

20 04

19 99

19 94

19 89

19 84

19 79

30 Year

E XPO NE NTIA L T R E N D M ODE L

Yi = β0β1Xi εi (14.7)

where β0 5 Y intercept (β1 − 1) × 100% is the annual compound growth rate (in %) The model in Equation 14.7 is not in the form of a linear regression model. To transform this non-linear model to a linear model, you use a base 10 logarithm transformation. (Alternatively, you can use base e logarithms. For more information on logarithms, see Appendix A at the back of the book.) Taking the logarithm of each side of Equation 14.7 yields Equation 14.8: TRAN S FO RME D E XP ON E N T IA L T R E ND M O D E L log(Yi ) = log(β0β1X i εi )

X = log(β0 ) + log(β1 i ) + log(εi )

(14.8)

= log(β0 ) + X i log(β1 ) + log(εi ) Equation 14.8 is a linear model that you can estimate, using the least-squares method with log(Yi) as the dependent variable and Xi as the independent variable. This results in Equation 14.9. E XPO N E N TI A L T R E N D FOR E CA ST IN G E Q U AT I O N log(Yˆi ) = b0 + b1X i (14.9a) ˆ0 where b0 5 estimate of log(β0) and thus 10b0 5 β b1 ˆ1 b1 5 estimate of log(β1) and thus 10 5 β therefore Yˆi = βˆ 0βˆ 1X i (14.9b)



where (βˆ 1 − 1) × 100% is the estimated annual compound growth rate (in %) Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

560 CHAPTER 14 TIME-SERIES FORECASTING AND INDEX NUMBERS

Figure 14.9 represents Excel output for an exponential trend model for retail turnover in the hospitality industry in Australia. Using Equation 14.9a above and the results from Figure 14.9: log(Yˆi) = 1.6542 + 0.0038Xi where the origin is 1979 (50) and units of X 5 1 year. Figure 14.9  Microsoft Excel output for an exponential regression model to forecast female labour force participation rates in Australia

A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

B

C

Multiple R

0.9652

R square Adjusted R square Standard error Observations

0.9317 0.9296 0.0110 36

D

E

F

G

ANOVA df

SS

MS

1 34 35

0.0560 0.0041 0.0601

0.0560 0.0001

463.5041

0.0000

Coefficients Standard error 1.6542 0.0036 0.0038 0.0002

t stat 461.1003 21.5291

p-value 0.0000 0.0000

Lower 95% 1.6470 0.0034

Regression Residual Total

Intercept Time

F Significance F

Upper 95% 1.6615 0.0042

You calculate the values for βˆ 0 and βˆ 1 by taking the antilog of the regression coefficients (b0 and b1): βˆ 0 = antilog b0 = antilog(1.6542) = 101.6542 = 45.1067 βˆ = antilog b = antilog(0.0038) = 10(0.0038) = 1.0088 1

1

Thus, using Equation 14.9b, the exponential trend forecasting equation is: Yˆi = (45.1067)(1.0088)Xi where the origin is 1979 (50) and units of X 5 1 year. ˆ 0 5 45.1067% is the forecast in the base year 1979. The value (β ˆ 1 − 1) × The Y intercept β 100% 5 0.88% is the annual compound growth rate in female labour force participation rates. For forecasting purposes, you can substitute the appropriate coded X values into either Equation 14.9a or Equation 14.9b. For example, to forecast labour force participation rates for 2015 (i.e. X37 5 36) using Equation 14.9a:

log(Yˆi) = 1.6542 + 0.0038(36) = 1.7909 Yˆ = antilog(1.7909) = 101.7909 = 61.7833 i

Figure 14.10 plots the exponential trend forecasting equation along with the time series for the female labour force participation rate data. Although not directly comparable (as we have a different dependent variable), the adjusted R2 for the exponential trend model (0.9296) is close to the adjusted R2 for both the linear and quadratic models.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



14.4 Least-squares Trend fitting and Forecasting 561

Figure 14.10  Fitting an exponential trend equation using Microsoft Excel for female labour force participation rates in Australia

70 65

Actual Exponential trend prediction

60 Per cent

55 50 45 40 35

20 14

20 09

20 04

19 99

19 94

19 89

19 84

19 79

30 Year

Model Selection Using First, Second and Percentage Differences We have used the linear, quadratic and exponential models to forecast female labour force participation rates in Australia. How can you determine which of these models is the most appropriate? In addition to visually inspecting scatter plots and comparing adjusted R2 values, you can calculate and examine first, second and percentage differences. The identifying features of linear, quadratic and exponential trend models are as follows: • If a linear trend model provides a perfect fit to a time series, then the first differences are constant. Thus, the differences between consecutive values in the series are the same throughout: (Y2 – Y1) = (Y3 – Y2) = … = (Yn – Yn –1) •

If a quadratic trend model provides a perfect fit to a time series, then the second differences are constant. Thus: [(Y3 – Y2) – (Y2 – Y1)] = [(Y4 – Y3) – (Y3 – Y2)] = … = [(Yn – Yn–1) – (Yn –1 – Yn –2)]



If an exponential trend model provides a perfect fit to a time series, then the percentage differences between consecutive values are constant. Thus: Y2 – Y1 Y – Y2 Y – Yn –1 × 100% = 3 × 100% = … = n × 100% Y1 Y2 Yn –1

Although you should not expect a perfectly fitting model for any particular set of timeseries data, you can consider the first differences, second differences and percentage differences for a given series as guides in choosing an appropriate model. Examples 14.2, 14.3 and 14.4 illustrate applications of linear, quadratic and exponential trend models with perfect fits to their respective data sets. A LINEA R TR E N D MO DE L W IT H A P E R F E CT F I T The following time series represents the number of customers per year (in thousands) at a branch of a fast-food chain:

Customers

2008 200

2009 205

2010 210

2011 215

Year 2012 220

2013 225

2014 230

2015 235

2016 240

EXAMPLE 14.2

2017 245

Using first differences, show that the linear trend model provides a perfect fit to these data.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

562 CHAPTER 14 TIME-SERIES FORECASTING AND INDEX NUMBERS

SOLUTION

The following table shows the solution: 2008 200

Customers Y First differences

2009 205   5.0

2010 210   5.0

Year 2011 2012 215 220   5.0   5.0

2013 225   5.0

2014 230   5.0

2015 235   5.0

2016 240   5.0

2017 245   5.0

The differences between consecutive values in the series are the same throughout. Thus, the number of customers at the branch of the fast-food chain shows a linear growth pattern.

EXAMPLE 14.3

A Q U A D R AT IC T R E N D M OD E L WI TH A P E RF E CT F I T The following time series represents the number of customers per year (in thousands) at another branch of a fast-food chain:

Customers

2008 200

2009 201

2010 203.5

2011 207.5

Year 2012 213

2013 220

2014 228.5

2015 238.5

2016 250

2017 263

Using second differences, show that the quadratic trend model provides a perfect fit to these data. SOLUTION

The following table shows the solution:

Customers First differences Second differences

2008 200

2009 201   1.0

2010 203.5   2.5   1.5

Year 2011 207.5   4.0   1.5

2012 213   5.5   1.5

2013 220   7.0   1.5

2014 228.5   8.5   1.5

2015 238.5  10.0   1.5

2016 250  11.5   1.5

2017 263  13.0   1.5

The second differences between consecutive pairs of values in the series are the same throughout. Thus, the number of customers at the branch of the fast-food chain shows a quadratic growth pattern. Its rate of growth is accelerating over time.

EXAMPLE 14.4

A N E X P O N E N T IA L T R E N D M OD E L WI TH A P E RF E CT F I T The following time series represents the number of customers per year (in thousands) for another branch of the fast-food chain:

Customers

2008 200

2009 206

Year 2010 2011 2012 2013 2014 2015 2016 2017 212.18 218.55 225.11 231.86 238.82 245.98 253.36 260.96

Using percentage differences, show that the exponential trend model provides almost a perfect fit to these data. SOLUTION

The following table shows the solution:

Customers Percentage differences

2008 200

Year 2009 2010 2011 2012 206 212.18 218.55 225.11   3.0   3.0   3.0   3.0

2013 2014 2015 2016 2017 231.86 238.82 245.98 253.36 260.96   3.0   3.0   3.0   3.0   3.0

The percentage differences between consecutive values in the series are approximately the same throughout. Thus, the number of customers at the branch of the fast-food chain shows an exponential growth pattern. Its rate of growth is approximately 3% per year.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



14.4 Least-squares Trend fitting and Forecasting 563

Table 14.3 presents the first, second and percentage differences for female labour force participation rates in Australia. You can see that none of the differences is constant across the series. Other models (including those considered in Sections 14.5 and 14.6) may be more appropriate.

Year 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014

Labour force First Second Percentage participation rate difference difference difference 43.6 1.2 2.7 44.8 –1.3 –0.1 –0.2 44.7 –0.0 –0.1 –0.2 44.6 0.2 0.1 0.2 44.7 0.5 0.6 1.3 45.3 0.4 1.0 2.2 46.3 1.0 2.0 4.1 48.3 –1.4 0.6 1.2 48.9 0.4 1.0 2.0 49.9 0.3 1.3 2.5 51.2 –0.3 1.0 1.9 52.2 –1.2 –0.2 –0.4 52.0 0.1 –0.1 –0.2 51.9 –0.1 –0.2 –0.4 51.7 1.1 0.9 1.7 52.6 0.1 1.0 1.9 53.6 –0.8 0.2 0.4 53.8 –0.4 –0.2 –0.4 53.6 0.2 0.0 0.0 53.6 0.0 0.0 0.0 53.6 0.9 0.9 1.7 54.5 –0.4 0.5 0.9 55.0 –0.3 0.2 0.4 55.2 0.5 0.7 1.3 55.9 –0.9 –0.2 –0.4 55.7 1.5 1.3 2.3 57.0 –0.8 0.5 0.9 57.5 0.1 0.6 1.0 58.1 –0.2 0.4 0.7 58.5 –0.2 0.2 0.3 58.7 –0.3 –0.1 –0.2 58.6 0.4 0.3 0.5 58.9 –0.5 –0.2 –0.3 58.7 0.1 –0.1 –0.2 58.6 58.6 0.0 0.1 0.0

Table 14.3  Comparing first, second and percentage differences for female labour force participation rates in Australia (1979–2014) Source: Australian Bureau of Statistics, Labour Force, Australia, Cat. No. 6202.0.

Problems for Section 14.4 LEARNING THE BASICS 14.8 The linear trend forecasting equation for an annual time series containing 20 values (from 1992 to 2011) on real total revenues (in millions of constant 1996 dollars) is:

Yˆi = 23.2 + 4.8Xi

a. Interpret the Y intercept b0. b. Interpret the slope b1. c. What is the fitted trend value for the fifth year? d. What is the fitted trend value for the most recent recorded year? e. What is the projected trend forecast three years after the last recorded value?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

564 CHAPTER 14 TIME-SERIES FORECASTING AND INDEX NUMBERS

14.9 The linear trend forecasting equation for an annual time series containing 40 values (from 1969 to 2008) on real net sales (in billions of constant 1998 dollars) is:

Yˆi = 1.2 + 0.5Xi a. Interpret the Y intercept b0. b. Interpret the slope b1. c. What is the fitted trend value for the tenth year? d. What is the fitted trend value for the most recent recorded year? e. What is the projected trend forecast two years after the last recorded value?

APPLYING THE CONCEPTS Use Microsoft Excel to solve problems 14.10 to 14.18.

14.10 The generosity of social security is often expressed as a percentage of average weekly labour market earnings and called a replacement rate. The following is the replacement rate for unemployment benefits in Australia from 1967 to 2007. Year 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997

Replacement rate 16 15 14 14 14 15 16 19 22 23 25 25 25 24 22 22 22 23 24 24 25 25 25 26 26 27 27 27 27 27 27

1998 1999 2000 2001 2002 2003 2004 2005 2006 2007

26 25 25 25 23 22 22 22 21 20

Based on data obtained from Unemployment Benefit Replacement Rates, The OECD Summary Measure of Benefit Entitlements 1961–2007 accessed 14 January 2010

a. Plot the data. b. Calculate a linear trend forecasting equation and plot the trend line. c. Calculate a quadratic trend forecasting equation and plot the results. d. Calculate an exponential trend forecasting equation and plot the results. e. Which model is the most appropriate? 14.11 Gross domestic product (GDP) is a major indicator of the nation’s overall economic activity. It consists of personal consumption expenditures, gross domestic investment, net exports of goods and services, and government consumption expenditures. The gross domestic product (in $US millions of constant dollars) for New Zealand from 1990 to 2016 is in the data file a. Plot the data. b. Calculate a linear trend forecasting equation and plot the trend line. c. What are your forecasts for 2017 and 2018? d. What conclusions can you reach concerning the trend in real gross domestic product? 14.12 The following data represent welfare expenditure as a percentage of GDP for New Zealand from 1971 to 2001. Year 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984

Welfare expenditure as a percentage of GDP  6.10  5.70  6.67  7.03  7.82  8.54  8.22 10.48 10.93 10.99 11.26 10.91 11.92 11.62

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



14.4 Least-squares Trend fitting and Forecasting 565

1985

11.33

Nov 2010

783.80

1986

12.03

May 2011

795.90

1987

11.60

Nov 2011

808.10

1988

12.44

May 2012

821.90

1989

13.41

Nov 2012

840.00

1990

14.36

May 2013

849.90

1991

14.10

Nov 2013

872.60

1992

14.56

May 2014

881.30

1993

16.04

Nov 2014

887.90

1994

14.10

May 2015

907.80

1995

13.47

Nov 2015

914.80

1996

13.21

May 2016

925.80

1997

12.98

Nov 2016

932.40

1998

12.91

1999

12.60

2000

11.88

2001

11.51

Source: Statistics New Zealand accessed 4 November 2008. This work is based on/ includes Statistics New Zealand’s data, which are licensed by Statistics New Zealand for re-use under the Creative Commons Attribution 3.0 New Zealand licence

a. Plot the data. b. Calculate a linear trend forecasting equation and plot the trend line. c. Calculate a quadratic trend forecasting equation and plot the results. d. Calculate an exponential trend forecasting equation and plot the results. e. Which model is the most appropriate? f. Using the most appropriate model, forecast welfare expenditure as a percentage of GDP for 2002. Check how accurate your forecast is by locating the true value on the Internet. 14.13 Female average weekly earnings from November 2006 to November 2016 are presented below.

Data obtained from Australian Bureau of Statistics, Average Weekly Earnings, Cat. No. 6302.0

a. Plot the data. b. Calculate a linear trend forecasting equation and plot the trend line. c. Calculate a quadratic trend forecasting equation and plot the results. d. Calculate an exponential trend forecasting equation and plot the results. e. Using the models in (b), (c) and (d), what are your forecasts of female average weekly earnings for May and November 2017? 14.14 The data in the following table represent the consumer price index (CPI) for Australia from the September quarter 2006 to the September quarter 2011. Quarter Sep 2006

CPI 155.7

Dec 2006

155.5

Mar 2007

155.6

Jun 2007

157.5

Sep 2007

158.6

Dec 2007

160.1

Mar 2008

162.2

Jun 2008

164.6

Quarter Nov 2006 May 2007

Female earnings 654.70 670.10

Sep 2008

166.5

Dec 2008

166.0

Mar 2009

166.2

Nov 2007

683.40

Jun 2009

167.0

May 2008

693.70

Sep 2009

168.6

Nov 2008

715.30

Dec 2009

169.5

May 2009

726.60

Mar 2010

171.0

Nov 2009

744.20

Jun 2010

172.1

May 2010

765.30

Sep 2010

173.3

continues Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

566 CHAPTER 14 TIME-SERIES FORECASTING AND INDEX NUMBERS

continued Quarter Dec 2010

CPI 174.0

Mar 2011

176.7

Jun 2011

178.3

Sep 2011

179.4

14.17 Data for overseas arrivals (immigration) into Australia from 1991 to 2015 are presented below.

Source: © Commonwealth of Australia, Australian Bureau of Statistics, Consumer Price Index Australia, Cat. No. 6401.0

a. Plot the data. b. Calculate a linear trend forecasting equation and plot the trend line. c. Calculate a quadratic trend forecasting equation and plot the results. d. Calculate an exponential trend forecasting equation and plot the results. e. Which model is the most appropriate? f. Using the most appropriate model, forecast the CPI for the December quarter 2011. Compare this to the actual value – see . 14.15 Use the Malaysian total export data from problem 14.5 to answer the following question. a. Plot the data. b. Calculate a linear trend forecasting equation and plot the trend line. c. Calculate a quadratic trend forecasting equation and plot the results. d. Calculate an exponential trend forecasting equation and plot the results. e. Which model is the most appropriate? f. Using the most appropriate model, forecast total exports for Malaysia in 2013. 14.16 A time-series plot often helps you to determine the appropriate model to use. For this problem, use each of the time-series presented in the following table.

1996 Time series I 100.0 Time series II 100.0

Year 1997 1998 115.2 130.1 115.2 131.7

1999 144.9 150.8

2000 160.0 174.1

2001 Time series I 175.0 Time series II 200.0

2002 189.8 230.8

2004 219.8 305.5

2005 235.0 351.8

2003 204.9 266.1

a. Plot the observed data (Y ) over time (X ), and plot the logarithm of the observed data (log Y ) over time (X ), to determine whether a linear trend model or an exponential trend model is more appropriate. (Hint: If the plot of log Y versus X appears to be linear, an exponential trend model provides an appropriate fit.) b. Calculate the appropriate forecasting equation. c. Forecast the value for 2006.

Year 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

Overseas arrivals ('000) 59.6 53.7 50.9 58.2 66.2 64.4 62.9 63.3 73.6 81.0 80.4 84.2 90.7 95.9 104.9 105.0 116.4 132.2 107.6 119.2 121.1 124.6 116.8 114.7 118.7

a. Compare the first differences, second differences and percentage differences to determine the most appropriate model. b. Calculate the appropriate forecasting equation. c. Forecast migration for 2016 and 2017. 14.18 The following table displays the amount of emergency food aid provided to developing countries from 1995 to 2012 (measured in $US million).

Year 1995 1996 1997 1998 1999 2000 2001 2002 2003

Emergency food aid   210.015741 140.72414 148.159145 171.953338 817.649792 393.846905 340.936075 1,034.932696 1,945.945352

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



14.5 The Holt–Winters Method for Trend Fitting and Forecasting 567

2004 2005 2006 2007 2008 2009 2010 2011 2012

1,496.426459 2,154.783084 1,835.9147 1,653.781354 3,012.401861 2,766.001457 2,284.101469 2,427.387885 2,015.416683

a. Plot the data. b. Calculate a linear trend forecasting equation and plot the results. c. Calculate a quadratic trend forecasting equation and plot the results. d. Calculate an exponential trend forecasting equation and plot the results. e. Using the forecasting equations in (c), (d) and (e), what are your annual forecasts of emergency food aid for 2013 and 2014?

Based on data from OECD, Foodaid – Emergency Food Aid, Cat. No. 72040, OECD ­International Development Statistics last accessed 31 March 2015

14.5  THE HOLT–WINTERS METHOD FOR TREND FITTING AND FORECASTING

LEARNING OBJECTIVE

5

The Holt–Winters method extends the exponential smoothing approach described in Estimate the Holt– ­Section  14.3 by including the future trend. To use the Holt–Winters method at any time Winters forecasting model period i, you must continuously estimate the level of the series (i.e. the smoothed value Ei) and the trend value (Ti). Equations 14.10a and 14.10b estimate the level and trend values in Holt–Winters method the Holt–­Winters method. Forecasting methodology that includes a measure of trend and exponential smoothing.

THE HO LT–W IN T E R S M E T H OD

Level: Ei = U(Ei –1 + Ti –1) + (1 – U)Yi (14.10a)



Trend: Ti = VTi–1 + (1 – V )(Ei – Ei–1) (14.10b)

where Ei 5 level of the smoothed series being calculated in time period i Ei−1 5 level of the smoothed series already calculated in time period i − 1 Ti 5 value of the trend component being calculated in time period i Ti−1 5 value of the trend component already calculated in time period i − 1 Yi 5 observed value of the time series in period i U 5 subjectively assigned smoothing constant (where 0 < U < 1) V 5 subjectively assigned smoothing constant (where 0 < V < 1)

To begin the calculations, let E2 5 Y2 and T2 5 Y2 − Y1 and choose smoothing constants for U and V. Then calculate Ei and Ti for all i years, i 5 3, 4, …, n. The choices for the smoothing constants U and V affect the results. Smaller values of U give more weight to the more recent levels of the time series and less weight to earlier levels in the series. Smaller values of V give more weight to the current trend in the time series and less weight to past trends in the series. Larger values of U and V have the opposite effect on the Holt–Winters method. To illustrate the Holt–Winters method, return to the data for female labour force participation rates in Australia, discussed in Section 14.4. To calculate the level and trend for the third

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

568 CHAPTER 14 TIME-SERIES FORECASTING AND INDEX NUMBERS

and fourth years (1981 and 1982) using the selected smoothing constants of U 5 0.3 and V 5 0.3, begin by setting: E2 5 Y2 5 44.8 and: T2 5 Y2 − Y1 5 44.8 − 43.6 5 1.2 Choosing smoothing constants U 5 0.3 and V 5 0.3, Equations 14.10a and 14.10b become: Ei 5 (0.3)(Ei−1 1 Ti−1) 1 (0.7)(Yi) and: Ti 5 (0.3)(Ti−1) 1 (0.7)(Ei − Ei−1) For 1981, the third year, i 5 3 and: E3 5 (0.3)(44.8 1 1.2) 1 (0.7)(44.7) 5 45.1 and: T3 5 (0.3)(1.2) 1 (0.7)(45.1 − 44.8) 5 0.6 For 1982, the fourth year, i 5 4 and E4 5 (0.3)(45.1 1 0.6) 1 (0.7)(44.6) 5 44.9 and: T4 5 (0.3)(0.6) 1 (0.7)(44.9 − 45.1) 5 0.0 Fortunately, you can use Microsoft Excel to calculate these values. Figure 14.11 displays the calculated values for level and trend for the entire series. To use the Holt–Winters method for forecasting, assume that all future trend movements will continue from the most recent smoothed level En. Thus, you can use Equation 14.11 to forecast j years into the future.

US IN G T H E H OLT–W I NT E R S M E T HO D F O R F O R E C A ST I NG

Yˆn+j = E n + j (Tn ) (14.11)

where Yn+j 5 forecast value j years into the future En 5 level of the smoothed series calculated in the most recent time period n Tn 5 value of the trend component calculated in the most recent time period n j 5 number of years into the future To illustrate the process of forecasting with the Holt–Winters method, we can forecast female labour force participation rates in Australia for 2015 and 2016. Using the values of level and trend based on smoothing constants of U 5 0.3 and V 5 0.3 in Figure 14.11, from Equation 14.11 the forecasts of labour force participation rates for 2015 and 2016 are as follows: Yˆn+ j = En + j(Tn) 2015: 1 year ahead Yˆ37 = E36 + (1)(T36) = 58.6 + (1)(–0.1) = 58.5% 2016: 2 years ahead Yˆ38 = E36 + (2)(T36) = 58.6 + (2)(–0.1) = 58.4% Figure 14.12 plots the labour force participation rates and the forecasted values.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



14.5 The Holt–Winters Method for Trend Fitting and Forecasting 569

A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

Year 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014

B Labour force participation rate 43.6 44.8 44.7 44.6 44.7 45.3 46.3 48.3 48.9 49.9 51.2 52.2 52.0 51.9 51.7 52.6 53.6 53.8 53.6 53.6 53.6 54.5 55.0 55.2 55.9 55.7 57.0 57.5 58.1 58.5 58.7 58.6 58.9 58.7 58.6 58.6

C

D

E #NA 44.8 45.1 44.9 44.8 45.1 46.0 47.8 49.0 50.0 51.2 52.2 52.4 52.2 51.8 52.3 53.3 53.9 53.9 53.7 53.6 54.2 54.9 55.3 55.8 55.9 56.7 57.5 58.1 58.6 58.8 58.8 58.9 58.8 58.6 58.6

Figure 14.11  Using the Holt–Winters method on female labour force participation rates in Australia (1979–2014)

E

T #NA 1.2 0.6 0.0 –0.1 0.2 0.7 1.5 1.3 1.1 1.1 1.1 0.4 –0.0 –0.2 0.2 0.8 0.6 0.2 –0.0 –0.1 0.4 0.6 0.5 0.5 0.2 0.6 0.7 0.7 0.5 0.3 0.1 0.1 –0.0 –0.1 –0.1

U V

0.3 0.3

Figure 14.12  Plot of female labour force participation rates in Australia (1979–2014)

70 65

Actual HW prediction

60

Source: Australian Bureau of Statistics, Labour Force, Australia, Cat. No. 6202.0.

Per cent

55 50 45 40 35

14 20

09 20

04 20

19

99

94 19

89 19

84 19

19

79

30 Year

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

570 CHAPTER 14 TIME-SERIES FORECASTING AND INDEX NUMBERS

Problems for Section 14.5 LEARNING THE BASICS 14.19 Consider an annual time series with 20 consecutive values. If the smoothed level for the most recent value is 50.8 and the corresponding trend level is 7.2: a. What is your forecast for the coming year? b. What is your forecast five years from now? 14.20 Given the following series from n 5 10 consecutive time periods:

152 146 148 135 130 124 120 113 110 104

u se the Holt–Winters method (with U 5 0.30 and V 5 0.30) to forecast the 16th to 20th periods. 14.21 Given the following series from n 5 10 consecutive time periods:



335 332 325 312 298 275 267 245 223 213



u se the Holt–Winters method (with U 5 0.20 and V 5 0.20) to forecast the 11th to 14th periods.

APPLYING THE CONCEPTS You should use Microsoft Excel to solve problems 14.22 to 14.25.

14.22 Refer to the data of problem 14.11 on page 564 concerning gross domestic product. a. Forecast gross domestic product for 2017 and 2018 using the Holt–Winters method with U 5 0.30 and V 5 0.30. b. Repeat (a) with U 5 0.70 and V 5 0.70. c. Repeat (a) with U 5 0.30 and V 5 0.70. d. Which of these sets of forecasts would you select, given the historical movement of the time series? Discuss.

LEARNING OBJECTIVE

6

Use autoregressive models autoregressive modelling Modelling using autocorrelation, which is the correlation between successive values in a time series. first-order autocorrelation Indicates that there is a correlation between consecutive values in a time series. second-order autocorrelation Indicates that there is a correlation between values two periods apart in a time series. p  th-order autocorrelation Correlation between values in a time series that are p periods apart.

e. Compare the results of (a) to (c) with those of problem 14.11(c). 14.23 Refer to the data of problem 14.15 on page 566 concerning total exports from Malaysia. a. Forecast total exports for 2013 using the Holt–Winters method with U 5 0.30 and V 5 0.30. b. Repeat (a) with U 5 0.70 and V 5 0.70. c. Repeat (a) with U 5 0.30 and V 5 0.70. d. Which of these sets of forecasts would you select, given the historical movement of the time series? Discuss. e. Compare the results of (a) to (c) with those of problem 14.15(f). 14.24 Refer to the data of problem 14.13 on page 565 concerning female average weekly earnings in Australia. a. Forecast the average weekly earnings in May and November 2017 using the Holt–Winters method with U 5 0.30 and V 5 0.30. b. Repeat (a) with U 5 0.70 and V 5 0.70. c. Repeat (a) with U 5 0.30 and V 5 0.70. 14.25 Refer to the data of problem 14.17 on page 566 concerning overseas arrivals. a. Forecast overseas arrivals for 2016 and 2017 using the Holt–Winters method with U 5 0.30 and V 5 0.30. b. Repeat (a) with U 5 0.70 and V 5 0.70. c. Repeat (a) with U 5 0.30 and V 5 0.70. d. Compare these forecasts with the actual value – see . Which method provided the best forecast?

14.6  AUTOREGRESSIVE MODELLING FOR TREND FITTING AND FORECASTING Frequently, the values of a time series are highly correlated with the values that precede and succeed them. This type of correlation is called autocorrelation. Autoregressive modelling is a technique used to forecast time series with autocorrelation. A first-order autocorrelation refers to the association between consecutive values in a time series. A second-order auto-correlation refers to the relationship between values that are two periods apart. A p  th-order autocorrelation refers to the correlation between values in a time series that are p periods apart. We can take advantage of the autocorrelation in data by using autoregressive modelling methods. Equations 14.12, 14.13 and 14.14 define first-order, second-order and pth-order autoregressive models.

FIR ST-OR DE R A UTO R E G R E SSI VE MO DE L

Yi = A0 + A1Yi–1 + δi (14.12)

S E CON D-OR DE R A U TO R E G R E SSI VE MO DE L

Yi = A0 + A1Yi –1 + A2Yi – 2 + δi (14.13)

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



14.6 Autoregressive Modelling for Trend fitting and Forecasting 571

pTH -O RD E R A UTOR E GR E S S IVE M OD E LS Yi = A0 + A1Yi –1 + A2Yi – 2 + δi + … + ApYi – p + δi (14.14)



where Yi 5 the observed value of the series at time i Yi−1 5 the observed value of the series at time i − 1 Yi−2 5 the observed value of the series at time i − 2 Yi−p 5 the observed value of the series at time i − p A0, A1, A2, … Ap 5autoregression parameters to be estimated from least-squares regression analysis δi 5 a non-autocorrelated random (error) component (with mean 5 0 and constant variance)

The first-order autoregressive model (Equation 14.12) is similar in form to the simple linear regression model (Equation 12.1 on page 457). The second-order autoregressive model (Equation 14.13) is similar to the multiple regression model with two independent variables (Equation 13.2 on page 506). The pth-order autoregressive model (Equation 14.14) is similar to the multiple regression model (Equation 13.1 on page 506). In the regression models, the regression parameters are given by the symbols β0, β1, … , βk, with corresponding estimates denoted by b0, b1, … , bk. In the autoregressive models, the parameters are given by the symbols A0, A1, …, Ap, with corresponding estimates denoted by a0, a1, … , ap. Selecting an appropriate autoregressive model is not easy. You must weigh the advantages that are due to simplicity against the concern of failing to take into account important autocorrelation in the data. You must be equally concerned with selecting a higher-order model requiring the estimation of numerous, unnecessary parameters – especially if n, the number of values in the series, is small. The reason for this concern is that p out of n data values are lost in calculating an estimate of Ap when comparing each data value with another data value that is p periods apart. Examples 14.5 and 14.6 illustrate this loss of data values. CO M PA R ISO N S C H E MA FO R A FIR ST- O RD E R AU TORE GRE S S I V E M OD E L Consider the following series of n 5 7 consecutive annual values:

Series

1 31

2 34

3 37

Year 4 35

5 36

6 43

first-order autoregressive model Regression model to measure firstorder autocorrelation in a time series. second-order autoregressive model Regression model to measure second-order autocorrelation in a time series. p  th-order autoregressive model Regression model to measure autocorrelation p orders apart in a time series.

EXAMPLE 14.5

7 40

Demonstrate the comparisons needed for a first-order autoregressive model. SOLUTION Year

First-order autoregressive model

i

(Yi versus Yi -1)

1 2 3 4 5 6 7

31– 34–31 37–34 35–37 36–35 43–36 40–43

Because no value is recorded prior to Y1, this value is lost for regression analysis. Therefore, the first-order autoregressive model is based on six pairs of values.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

572 CHAPTER 14 TIME-SERIES FORECASTING AND INDEX NUMBERS

EXAMPLE 14.6

C O MPA R IS O N S C H E M A F OR A S E CON D - ORD E R AU TORE GRE S S I V E M OD E L Consider the following series of n 5 7 consecutive annual values:

Series

1 31

2 34

3 37

Year 4 35

5 36

6 43

Demonstrate the comparisons needed for a second-order autoregressive model.

7 40

SOLUTION Year

Second-order autoregressive model

i

(Yi versus Yi 2 1 and Yi versus Yi22)

1 2 3 4 5 6 7

31– and 31– 34–31 and 34– 37–34 and 37–31 35–37 and 35–34 36–35 and 36–37 43–36 and 43–35 40–43 and 40–36

Because no value is recorded prior to Y1, two values are lost for regression analysis. Therefore, the second-order autoregressive model is based on five pairs of values. After selecting a model and using the least-squares method to calculate estimates of the parameters, you need to determine the appropriateness of the model. Either you can select a particular pth-order autoregressive model based on previous experiences with similar data or, as a starting point, you can choose a model with several parameters and then eliminate the parameters that do not significantly contribute to the model. In this latter approach, you use a t test for the significance of Ap, the highest-order autoregressive parameter in the current model under consideration. The null and alternative hypotheses are: H0: Ap = 0 H1: Ap ≠ 0 Equation 14.15 defines the test statistic. t T E ST FOR S IGN I F I C A NC E O F T HE HI G HE ST- O R D E R A U TO R E G R E SS I VE PA R A M E T E R A p a p – Ap t= (14.15) Sa p

where Ap 5 hypothesised value of the highest-order parameter, Ap, in the regression model ap 5 the estimate of the highest-order parameter, Ap, in the autoregressive model Sap 5 the standard deviation of ap and the test statistic t follows a t distribution with n − 2p − 1 degrees of freedom.

For a given level of significance α, you reject the null hypothesis if the calculated t test statistic is greater than the upper-tail critical value from the t distribution, or if the calculated t test statistic is less than the lower-tail critical value from the t distribution. Thus, the decision rule is: Reject H 0 if t > t n – 2 p –1 or if t < – t n – 2 p –1; otherwise, do not reject H0.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



14.6 Autoregressive Modelling for Trend fitting and Forecasting 573

Figure 14.13 illustrates the decision rule and regions of rejection and non-rejection.

1–α

α/2 –tn–2p –1

Region of rejection Critical value

0

Region of non-rejection Ap = 0

Figure 14.13  Rejection regions for a two-tail test for the significance of the highestorder autoregressive parameter, Ap

α/2 +tn–2p –1

t

Region of rejection Critical value

If you do not reject the null hypothesis that Ap 5 0, you conclude that the selected model contains too many estimated parameters. You then discard the highest-order term and estimate an autoregressive model of order p − 1 using the least-squares method. You then repeat the test of the hypothesis that the new highest-order term is 0. This testing and modelling continues until you reject H0. When this occurs, you know that the remaining highest-order parameter is significant and you can use that model for forecasting purposes. Equation 14.16 defines the fitted pth-order autoregressive equation. FIT TE D pTH -OR DE R A UTOR E GR E S S IV E E Q U AT I O N

Yˆi = a 0 + a1Yi –1 + a2Yi – 2 + … + a pYi – p (14.16)

where Yˆi 5 fitted values of the series at time i Yi−1 5 observed value of the series at time i − 1 Yi−2 5 observed value of the series at time i − 2 Yi−p 5 observed value of the series at time i − p a0, a1, a2, …, ap 5 regression estimates of the parameters A0, A1, A2, …, Ap Use Equation 14.17 to forecast j years into the future from the current nth time period. pTH -O RD E R A UTOR E GR E S S IVE FOR E C A ST I NG E Q U AT I O N

Yˆn+ j = a0 + a1Yˆn+ j–1 + a2Yˆn+j– 2 + ... + apYˆn+j– p (14.17)

where  a0, a1, a2, … , ap 5 regression estimates of the parameters A0, A1, A2, …, Ap j 5 number of years into the future Yˆ n+j−p 5 forecast of Yn+j−p from the current time period for j − p > 0 or Yˆ n+j−p 5 observed value for Yn+j−p for j − p ⩽ 0 Thus, to make forecasts j years into the future using a third-order autoregressive model, we need only the most recent p 5 3 values (Yn, Yn−1 and Yn−2) and the regression estimates a0, a1, a2 and a3. To forecast 1 year ahead, Equation 14.17 becomes: Yˆn+1 = a 0 + a1Yn + a2Yn –1 + a3Yn – 2

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

574 CHAPTER 14 TIME-SERIES FORECASTING AND INDEX NUMBERS

To forecast 2 years ahead, Equation 14.17 becomes: Yˆn+2 = a 0 + a1Yˆn+1 + a2Yn + a3Yn –1 To forecast 3 years ahead, Equation 14.17 becomes: Yˆn+3 = a 0 + a1Yˆn+2 + a2Yˆn+1 + a3Yn and so on. Autoregressive modelling is a powerful forecasting technique for time series exhibiting autocorrelation. Although slightly more complicated than other methodologies you have studied, the following step-by-step approach should guide you through the analysis: 1. Choose a value for p, the highest-order parameter in the autoregressive model to be evaluated, realising that the t test for significance is based on n – 2p – 1 degrees of freedom. 2. Form a series of p ‘lagged predictor’ variables such that the first variable lags by one time period, the second variable lags by two time periods and so on, and the last predictor variable lags by p time periods (see Figure 14.14). 3. Use Microsoft Excel to perform a least-squares analysis of the multiple regression model containing all p lagged predictor variables. 4. Test for the significance of Ap, the highest-order autoregressive parameter in the model. a. If you do not reject the null hypothesis, discard the pth variable and repeat steps 3 and 4 with an evaluation of the new highest-order parameter whose predictor variable lags by p – 1 years. The test for the significance of the new highest-order parameter is based on a t distribution whose degrees of freedom are revised to correspond with the new number of predictors. b. If you reject the null hypothesis, select the autoregressive model with all p predictors for fitting (see Equation 14.16) and forecasting (see Equation 14.17). To demonstrate the autoregressive modelling technique, return to the time series con­cerning female labour force participation rates in Australia for the 36-year period 1979–2014. Figure 14.14 displays the data setup for the first-order, second-order and third-order autoregressive models. All the columns in this table are needed for fitting the third-order autoregressive model. The last column is omitted when fitting second-order autoregressive models, and the last two columns are eliminated when fitting first-order autoregressive models. Thus, p 5 1, 2 or 3 ­values out of n 5 36 are lost in the comparisons needed for developing the first-order, secondorder and third-order autoregressive models. Selecting an autoregressive model that best fits the annual time series begins with the thirdorder autoregressive model depicted in Figure 14.15, using only completed data in Microsoft Excel. From Figure 14.15, the fitted third-order autoregressive equation is: Yˆi 5 2.3694 + 1.2732Yi−1 − 0.3553Yi−2 + 0.0434Yi−3 where the origin is 1982 and units of Y 5 1 year. Next, you test for the significance of A3, the highest-order parameter. The highest-order parameter estimate, a3, for the fitted third-order autoregressive model is 0.0434, with a standard error of 0.1690. The null and alternative hypotheses are: H0: A3 = 0 against: H1: A3 ≠ 0 From Equation 14.15 and the Microsoft Excel output given in Figure 14.15: t=

a3 – A3 0.0434 – 0 = = 0.2567 Sa 0.1690 3

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



14.6 Autoregressive Modelling for Trend fitting and Forecasting 575

A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

Year 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014

B Labour force participation rate 43.6 44.8 44.7 44.6 44.7 45.3 46.3 48.3 48.9 49.9 51.2 52.2 52.0 51.9 51.7 52.6 53.6 53.8 53.6 53.6 53.6 54.5 55.0 55.2 55.9 55.7 57.0 57.5 58.1 58.5 58.7 58.6 58.9 58.7 58.6 58.6

C

D

E

Lag 1 #NA 43.6 44.8 44.7 44.6 44.7 45.3 46.3 48.3 48.9 49.9 51.2 52.2 52.0 51.9 51.7 52.6 53.6 53.8 53.6 53.6 53.6 54.5 55.0 55.2 55.9 55.7 57.0 57.5 58.1 58.5 58.7 58.6 58.9 58.7 58.6

Lag 2 #NA #NA 43.6 44.8 44.7 44.6 44.7 45.3 46.3 48.3 48.9 49.9 51.2 52.2 52.0 51.9 51.7 52.6 53.6 53.8 53.6 53.6 53.6 54.5 55.0 55.2 55.9 55.7 57.0 57.5 58.1 58.5 58.7 58.6 58.9 58.7

Lag 3 #NA #NA #NA 43.6 44.8 44.7 44.6 44.7 45.3 46.3 48.3 48.9 49.9 51.2 52.2 52.0 51.9 51.7 52.6 53.6 53.8 53.6 53.6 53.6 54.5 55.0 55.2 55.9 55.7 57.0 57.5 58.1 58.5 58.7 58.6 58.9

Figure 14.14  Developing first-order, second-order and thirdorder autoregressive models for female labour force participation rates (1979–2014)

Using a 0.05 level of significance, the two-tail t test with 29 degrees of freedom has critical values of ±2.0452. Because −2.0452 < t 5 0.2567 < +2.0452 or because the p-value 5 0.7992 > α 5 0.05, you do not reject H0. You conclude that the third-order parameter of the autoregressive model is not significant and can be deleted. Using Microsoft Excel (see Figure 14.16), you fit a second-order autoregressive model. The fitted second-order autoregressive equation is: Yˆi = 1.7578 + 1.2137Yi –1 – 0.2412Yi –2 where the origin is 1981 and units of Y 5 1 year. From the Microsoft Excel output, the highest-order parameter estimate is a2 5 −0.2412 with a standard error 5 0.6170.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

576 CHAPTER 14 TIME-SERIES FORECASTING AND INDEX NUMBERS

Figure 14.15  Microsoft Excel output for the third-order autoregressive model for female labour force participation rates (1979–2014)

Figure 14.16  Microsoft Excel output for the second-order autoregressive model for female labour force participation rates (1979–2014)

A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

B

C

D

E

F

G

SUMMARY OUTPUT Regression statistics Multiple R

0.9936

R square

0.9873

Adjusted R square

0.9860

Standard error

0.5153

Observations

33

ANOVA df

SS

MS

F

Significance F

3

597.4822

199.1607

750.1278

0.0000

Residual

29

7.6996

0.2655

Total

32

605.1818

Regression

Coefficients

Standard error

t stat

p-value

Lower 95%

Upper 95%

Intercept

2.3694

1.1152

2.1247

0.0423

0.0886

4.6501

Lag 1

1.2732

0.1749

7.2802

0.0000

0.9155

1.6309

Lag 2

–0.3553

0.2705

–1.3135

0.1993

–0.9085

0.1979

Lag 3

0.0434

0.1690

0.2567

0.7992

–0.3022

0.3890

A

B

C

D

F

E

G

Regression statistics Multiple R

0.9936

R square

0.9872

Adjusted R square

0.9864

Standard error

0.5297

Observations

34

ANOVA df

SS

MS

F

Significance F

2

672.5824

336.2912

1198.7722

0.0000

Residual

31

8.6964

0.2805

Total

33

681.2788

Regression

Coefficients

Standard error

t stat

p-value

Lower 95%

Upper 95%

Intercept

1.7578

1.0896

1.6132

0.1168

–0.4645

3.9800

Lag 1

1.2137

0.1724

7.0413

0.0000

0.8621

1.5652

Lag 2

–0.2412

0.1670

–1.4446

0.1586

–0.5818

0.0993

To test: H0: A2 = 0 against: H1: A2 ≠ 0

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



14.6 Autoregressive Modelling for Trend fitting and Forecasting 577

from Equation 14.15: t=

a2 – A2 –0.2412 – 0 = = –1.4446 Sa 0.1670 2

To test at the 0.05 level of significance, the two-tail t test with 31 degrees of freedom has critical values t31 of ± 2.0395. Because –2.3095 < t 5 −1.4446 > +2.3095 or because the p-value 5 0.1586 > α 5 0.05, you cannot reject H0. You conclude that the second-order parameter of the autoregressive model is not significant and can be deleted. Using Microsoft Excel (see Figure 14.17), you fit a first-order autoregressive model.

A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

B

C

D

E

F

G

SUMMARY OUTPUT Regression statistics Multiple R

0.9937

R square Adjusted R square Standard error

0.9874 0.9870 0.5361

Observations

Figure 14.17  Microsoft Excel output for the first-order autoregressive model for female labour force participation rates (1979–2014)

35

ANOVA Regression

df 1

SS 741.8841

MS 741.8841

Residual Total

33 34

9.4833 751.3674

0.2874

Coefficients Standard error 2.4859 0.9993

t stat 2.4876

p-value 0.0181

Lower 95% 0.4528

Upper 95% 4.5190

50.8095

0.0000

0.9224

0.9994

Intercept Lag 1

0.9609

0.0189

F Significance F 2581.6055 0.0000

The first-order autoregressive equation is: Yˆ = 2.4859 + 0.9609Y

i –1

i

From the Microsoft Excel output, the highest-order parameter estimate is a1 5 0.9609 with a standard error 5 0.0189. To test: H0: A1 = 0 against: H1: A1 ≠ 0 from Equation 14.15: t=

a1 – A1 0.9609 – 0 = = 50.8095 Sa 0.0189 1

To test at the 0.05 level of significance, the two-tail t test with 33 degrees of freedom has  critical values t22 of ± 2.0345. Because t 5 50.8095 > 2.0345, or because the p-value 5 0.0000 < α 5 0.05, you reject H0. You conclude that the first-order parameter of the autoregressive model is significant and should remain in the model. The model-building approach has led to the selection of the first-order autoregressive model over the secondorder model because of the significant first-order parameter. Using the estimates a0 5 2.4859

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

578 CHAPTER 14 TIME-SERIES FORECASTING AND INDEX NUMBERS

and a1 5 0.9609, as well as the most recent data value Y36 5 58.6, the forecasts of female labour force participation rates in Australia for 2015 and 2016 from Equation 14.17 are: Yˆn+ j = 2.4859 + 0.9609Yˆn+ j –1 2015: 1 year ahead Yˆ37 = 2.4859 + 0.9609(58.6) = 58.8% 2016: 2 years ahead Yˆ = 2.4859 + 0.9609(58.8) = 59.0% 38

Figure 14.18 displays the predicted Y values from the first-order autoregressive model.

70 Actual

65

AR1 prediction

60 55 Per cent

Figure 14.18  Microsoft Excel plot of actual and predicted female labour force participation rate from a first-order autoregressive model

50 45 40 35

14 20

20

09

04 20

99 19

94 19

89 19

84 19

19

79

30 Year

Problems for Section 14.6 LEARNING THE BASICS

14.28 Refer to problem 14.27. The three most recent values are:

14.26 You are given an annual time series with 50 consecutive values and asked to fit a fifth-order autoregressive model. a. How many comparisons are lost in the development of the autoregressive model? b. How many parameters do you need to estimate? c. Which of the original 50 values do you need for forecasting? d. State the model. e. Write an equation to indicate how you would forecast j years into the future. 14.27 A third-order autoregressive model is fitted to an annual time series with 19 values and has the following estimated parameters and standard deviations:

Y17 5 12



a0 5 10.20 Sa1 5 0.25

a1 5 1.80 Sa2 5 0.40

a2 5 0.99 Sa35 0.09

a3 5 0.21



A t the 0.05 level of significance, test the appropriateness of the fitted model.

Y18 5 14

Y19 5 19

Forecast the values for the next year and the following year. 14.29 Refer to problem 14.27. Suppose, when testing for the appropriateness of the fitted model, the standard deviations are:

Sa1 5 0.90

Sa2 5 0.35

Sa3 5 0.2

a. What conclusions can you make? b. Discuss how to proceed if forecasting is still your main objective.

APPLYING THE CONCEPTS Use Microsoft Excel to solve problems 14.30 to 14.33.

14.30 Refer to the data given in problem 14.11 on page 564 that rep­ resent New Zealand GDP. a. Fit a third-order autoregressive model to the GDP data and test for the significance of the third-order autoregressive parameter. (Use α 5 0.05.)

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



14.7 Choosing an Appropriate Forecasting Model 579

b. Fit a second-order autoregressive model to the GDP data and test for the significance of the second-order autoregressive parameter. (Use α 5 0.05.) c. Fit a first-order autoregressive model to the GDP data and test for the significance of the first-order autoregressive parameter. (Use α 5 0.05.) d. If appropriate, forecast welfare as a percentage of GDP in 2002 and 2003. 14.31 Refer to the data introduced in problem 14.13 on page 565 concerning average weekly earnings of female employees in Australia. a. Fit a third-order autoregressive model to the earnings data and test for the significance of the third-order autoregressive parameter. (Use α 5 0.05.) b. If necessary, fit a second-order autoregressive model to the earnings data and test for the significance of the second-order autoregressive parameter. (Use α 5 0.05.) c. If necessary, fit a first-order autoregressive model to the earnings data and test for the significance of the first-order autoregressive parameter. (Use α 5 0.05.)

14.32 Refer to the data given in problem 14.14 on page 565 that represent the CPI for Australia from September 2006 to September 2011. a. Fit a third-order autoregressive model to the data and test for the significance of the third-order autoregressive parameter. (Use α 5 0.05.) b. Fit a second-order autoregressive model to the data and test for the significance of the second-order autoregressive parameter. (Use α 5 0.05.) c. Fit a first-order autoregressive model to the data and test for the significance of the first-order autoregressive parameter. (Use α 5 0.05.) d. If appropriate, forecast the data for 2012. 14.33 Refer to the data given in problem 14.17 on page 566 that represent the number of overseas arrivals into Australia. a. Fit a first-order autoregressive model to the data and test for the significance of the first-order autoregressive parameter. (Use α 5 0.05.) b. If appropriate, forecast arrivals for 2016 and 2017.

LEARNI

14.7  CHOOSING AN APPROPRIATE FORECASTING MODEL In Sections 14.4 to 14.6 we looked at seven time-series forecasting methods: the linear trend model, the quadratic trend model and the exponential trend model in Section 14.4; the Holt– Winters method in Section 14.5; and the first-order, second-order and pth-order autoregressive models in Section 14.6. Is there a best model? Which of these models should you select for forecasting? The following guidelines are provided for determining the adequacy of a particular forecasting model. These guidelines are based on a judgment of how well the model fits the past data of a given time series, and assumes that future movements in the time series can be projected by a study of the past data: • Perform a residual analysis. • Measure the magnitude of the residual error through squared differences. • Measure the magnitude of the residual error through absolute differences. • Use the principle of parsimony.

LEARNING OBJECTIVE Choose an appropriate forecasting model

A discussion of these guidelines follows.

Performing a Residual Analysis Recall from Sections 12.5 and 13.3 that residuals are the differences between the observed and the predicted values. After fitting a particular model to a time series, we plot the residuals over the n time periods. As shown in panel A of Figure 14.19, if the particular model fits adequately, the residuals represent the irregular component of the time series. Therefore, they should be randomly distributed throughout the series. However, as illustrated in the three remaining panels of Figure 14.19, if the particular model does not fit adequately, the residuals may show a systematic pattern such as a failure to account for trend (panel B), a failure to account for cyclical variation (panel C) or, with monthly or quarterly data, a failure to account for seasonal variation (panel D).

Measuring the Magnitude of the Residual Error with Squared or Absolute Differences If, after performing a residual analysis, you still believe that two or more models appear to fit the data adequately, you can use additional methods for model selection. Numerous measures

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

7

mean absolute deviation (MAD) The mean absolute difference between the actual and predicted values in a time series.

0

0

0 1 2 3 4 5 6 7 8 9 10 Time (years)

0 1 2 3 4 5 6 7 8 9 10 Time (years)

Panel A Randomly distributed forecast errors

Panel B Trend not accounted for

ei = Yi – Yˆi

ei = Yi – Yˆi

ei = Yi – Yˆi

Figure 14.19  Residual analysis for studying error patterns

ei = Yi – Yˆi

580 CHAPTER 14 TIME-SERIES FORECASTING AND INDEX NUMBERS

0

0

0 1 2 3 4 5 6 7 8 9 10 Time (years)

0 1 2 3 4 5 6 7 8 9 10 Time (years)

Panel C Cyclical effects not accounted for

Panel D Seasonal effects not accounted for

based on the residual error are available (see references 1 and 2). However, there is no consensus among statisticians as to which particular measure is best for determining the most appropriate forecasting model. Based on the principle of least squares, one measure that we have already used in regression analysis (see Section 12.2) is the standard error of the estimate (SYX). For a particular model, this measure is based on the sum of squared differences between the actual and the predicted values in a time series. If a model fits the time-series data perfectly, then the standard error of the estimate is zero. If a model fits the time-series data poorly, then SYX is large. Thus, when comparing the adequacy of two or more forecasting models, you can select the model with the minimum SYX as being the most appropriate. However, a major drawback to using SYX when comparing forecasting models is that it penalises a model too much for a large individual forecasting error. Thus, whenever there is a large difference between even a single Yi and Yˆi, the value of SYX (see Chapter 12, page 472) becomes magnified through the squaring process. For this reason, a measure that many statisticians seem to prefer for measuring the appropriateness of various forecasting models is the mean absolute deviation (MAD). Equation 14.18 defines the MAD as the mean of the absolute differences between the actual and predicted values in a time series.

M E A N A B S OLUT E D E V I AT I O N n

∑ Y – Yˆ i



MAD =

i=1

n

i

(14.18)

If a model fits the time-series data perfectly, the MAD is zero. If a model fits the time-series data poorly, the MAD is large. When comparing two or more forecasting models, you can select the one with the minimum MAD as being the most appropriate.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



14.7 Choosing an Appropriate Forecasting Model 581

Principle of Parsimony If, after performing a residual analysis and comparing the SYX and MAD measures, you still believe that two or more models appear to fit the data adequately, then you can use the principle of parsimony for model selection.

The principle of parsimony is the belief that you should select the simplest model that gets the job done adequately.

parsimony The process of choosing the simplest model in terms of independent variables that still adequately explains the variation in the dependent variable.

Among the seven forecasting models studied in this chapter, the least-squares linear and quadratic models and the first-order autoregressive model are regarded by most statisticians as the simplest. The second- and pth-order autoregressive models, the least-squares exponential model and the Holt–Winters model are considered the most complex of the techniques presented.

Comparison of Five Forecasting Methods Consider once again the female labour force participation rate in Australia data. To illustrate the model-selection process, this section compares five of the forecasting models used in ­Sections 14.4 to 14.6: the linear model, the quadratic model, the exponential model, the Holt– Winters model and the first-order autoregressive model. (There is no need for further study of the second-order or third-order autoregressive models for this time series, because they did not significantly improve the fit over the simpler first-order model.) Figure 14.20 displays the residual plots for the five models. In drawing conclusions from these residual plots, you must use caution because there are only 36 values. In Figure 14.20, observe the systematic structure of the residuals in the linear model (panel A), quadratic model (panel B) and the exponential model (panel C). For the first-order autoregressive model (panel D) and the Holt–Winters model (panel E), the residuals appear more random. To summarise, on the basis of the residual analysis of all five forecasting models, it appears that the Holt–Winters model and the first-order autoregressive model are the most appropriate, and the linear, quadratic and exponential models are the least appropriate. For further verification, compare the five models with respect to the magnitude of their residuals. Figure 14.21 provides the actual values (Yi) together with the predicted values, the residuals (ei), the error sum of squares (SSE), the standard error of the estimate (SYX) and the mean absolute deviation (MAD) for each of the five models. For this time series, the SSE, SYX and MAD provide similar results. A comparison of the SSE, SYX and MAD indicates that the female labour force participation rate choice for best model is the Holt–Winters model, followed by the AR1, with the exponential model the worst fit. Although the Holt–Winters model is clearly superior in terms of the historical fit of the data, since the choice of the smoothing constants is subjective we might want to consider trying to improve the fit even more with other smoothing constants. Once we have selected a particular forecasting model, we need to monitor our forecasts continually. If large errors occur between forecast and actual values, the underlying structure of the time series may have changed. Remember that the forecasting methods presented in this chapter assume that the patterns inherent in the past will continue into the future. Large forecast errors are an indication that this assumption is no longer true.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

582 CHAPTER 14 TIME-SERIES FORECASTING AND INDEX NUMBERS

Quadratic model residual plot

2

2

1

1

Residuals

3

Panel B

2

1

1

Panel D

Year

14 20

09 20

04 20

99

14 20

09 20

04 20

99 19

94 19

19

19

79

14 20

20

20

19

19

19

19

09

–3

04

–3

99

–2

94

–2

89

–1

84

–1

89

0

84

0

19

Residuals

2

79

Residuals

AR1 model residual plot 3

19

19

Year

Exponential model residual plot 3

Panel C

19

89 19

19

Year

Panel A

19

79

14 20

20

20

19

19

19

19

19

09

–3

04

–3

99

–2

94

–2

89

–1

84

–1

94

0

84

0

79

Residuals

Linear model residual plot 3

Year

Holt–Winters residual plot 3

Residuals

2 1 0 –1 –2

Panel E

20 14

20 09

20 04

19 99

19 94

19 89

19 84

19 79

–3

Year

Figure 14.20  Microsoft Excel residual plots for five forecasting methods

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



14.7 Choosing an Appropriate Forecasting Model 583

A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41

B

D

E

G

Linear

H

J

Quadratic

Year Actual Predicted Residual

Predicted

K

M

Exponential

Residual

Predicted

N

P

Holt-Winters

Residual

Predicted

Q AR1

Residual

Predicted Residual

1979

43.6

44.9

–1.3

43.2

0.4

45.1

–1.5

#NA

#NA

#NA

1980

44.8

45.3

–0.5

43.9

0.9

45.5

–0.7

44.8

0.0

44.4

#NA 0.4

1981

44.7

45.8

–1.1

44.7

0.0

45.9

–1.2

45.1

–0.4

45.5

–0.8

1982

44.6

46.2

–1.6

45.4

–0.8

46.3

–1.7

44.9

–0.3

45.4

–0.8

1983

44.7

46.7

–2.0

46.1

–1.4

46.7

–2.0

44.8

–0.1

45.3

–0.6 –0.1

1984

45.3

47.1

–1.8

46.7

–1.4

47.1

–1.8

45.1

0.2

45.4

1985

46.3

47.6

–1.3

47.4

–1.1

47.5

–1.2

46.0

0.3

46.0

0.3

1986

48.3

48.0

0.3

48.0

0.3

48.0

0.3

47.8

0.5

47.0

1.3

1987

48.9

48.5

0.4

48.6

0.3

48.4

0.5

49.0

–0.1

48.9

0.0

1988

49.9

48.9

1.0

49.2

0.7

48.8

1.1

50.0

–0.1

49.5

0.4

1989

51.2

49.4

1.8

49.8

1.4

49.2

2.0

51.2

0.0

50.4

0.8

1990

52.2

49.9

2.3

50.4

1.8

49.7

2.5

52.2

–0.0

51.7

0.5

1991

52.0

50.3

1.7

51.0

1.0

50.1

1.9

52.4

–0.4

52.6

–0.6

1992

51.9

50.8

1.1

51.5

0.4

50.5

1.4

52.2

–0.3

52.5

–0.6 –0.7

1993

51.7

51.2

0.5

52.0

–0.3

51.0

0.7

51.8

–0.1

52.4

1994

52.6

51.7

0.9

52.5

0.1

51.4

1.2

52.3

0.3

52.2

0.4

1995

53.6

52.1

1.5

53.0

0.6

51.9

1.7

53.3

0.3

53.0

0.6

1996

53.8

52.6

1.2

53.5

0.3

52.3

1.5

53.9

–0.1

54.0

–0.2

1997

53.6

53.0

0.6

53.9

–0.3

52.8

0.8

53.9

–0.3

54.2

–0.6

1998

53.6

53.5

0.1

54.4

–0.8

53.3

0.3

53.7

–0.1

54.0

–0.4

1999

53.6

53.9

–0.3

54.8

–1.2

53.7

–0.1

53.6

–0.0

54.0

–0.4

2000

54.5

54.4

0.1

55.2

–0.7

54.2

0.3

54.2

0.3

54.0

0.5

2001

55.0

54.8

0.2

55.6

–0.6

54.7

0.3

54.9

0.1

54.9

0.1 –0.1

2002

55.2

55.3

–0.1

55.9

–0.7

55.1

0.1

55.3

–0.1

55.3

2003

55.9

55.7

0.2

56.3

–0.4

55.6

0.3

55.8

0.1

55.5

0.4

2004

55.7

56.2

–0.5

56.6

–0.9

56.1

–0.4

55.9

–0.2

56.2

–0.5

2005

57.0

56.6

0.4

56.9

0.1

56.6

0.4

56.7

0.3

56.0

1.0

2006

57.5

57.1

0.4

57.2

0.3

57.1

0.4

57.5

0.0

57.3

0.2

2007

58.1

57.5

0.6

57.5

0.6

57.6

0.5

58.1

–0.0

57.7

0.4

2008

58.5

58.0

0.5

57.8

0.7

58.1

0.4

58.6

–0.1

58.3

0.2

2009

58.7

58.4

0.3

58.0

0.7

58.6

0.1

58.8

–0.1

58.7

0.0

2010

58.6

58.9

–0.3

58.3

0.3

59.1

–0.5

58.8

–0.2

58.9

–0.3

2011

58.9

59.3

–0.4

58.5

0.4

59.7

–0.8

58.9

0.0

58.8

0.1

2012

58.7

59.8

–1.1

58.7

0.0

60.2

–1.5

58.8

–0.1

59.1

–0.4

2013

58.6

60.2

–1.6

58.9

–0.3

60.7

–2.1

58.6

–0.0

58.9

–0.3

2014

58.6

60.7

–2.1

59.0

–0.4

61.2

–2.6

58.6

0.0

58.8

–0.2

SSE

44.1

SSE

20.3

SSE

57.0

SSE

1.5

SSE

9.5

SYX

1.3

SYX

0.6

SYX

1.7

SYX

0.0

SYX

0.3

MAD

0.9

MAD

0.6

MAD

1.0

MAD

0.2

MAD

0.4

Figure 14.21  Comparison of five forecasting methods using SSE, SYX and MAD

Problems for Section 14.7 LEARNING THE BASICS 14.34 The following residuals are from a linear trend model used to forecast sales: 3.0  −0.5 0.0 −0.2 0.7 −2.7  −0.5  −0.1 0.3 a. Calculate SYX and interpret your findings. b. Calculate the MAD and interpret your findings. 14.35 Refer to problem 14.34. Suppose the first residual is 4.0 (instead of 3.0) and the last value is −0.7 (instead of 0.3).

a. Calculate SYX and interpret your findings. b. Calculate the MAD and interpret your findings.

APPLYING THE CONCEPTS You should use Microsoft Excel to solve problems 14.36 to 14.39.

14.36 Refer to the results in problem 14.15 on page 566. a. Perform a residual analysis. b. Calculate the standard error of the estimate (SYX).

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

584 CHAPTER 14 TIME-SERIES FORECASTING AND INDEX NUMBERS

c. Calculate the MAD. d. On the basis of (a), (b) and (c), are you satisfied with your linear trend forecast in problem 14.15? Discuss. 14.37 Refer to the results in problem 14.12 on page 564 and problem 14.30 on page 578 concerning welfare as a percentage of GDP in New Zealand. a. Perform a residual analysis for each model. b. Calculate the standard error of the estimate (SYX) for each model. c. Calculate the MAD for each model. d. On the basis of (a), (b), (c) and parsimony, which forecasting model would you select? Discuss. 14.38 Refer to the results in problem 14.13 on page 565, and problems 14.24 (page 570) and 14.31 (page 579) concerning female earnings in Australia.

LEARNING OBJECTIVE

8

Estimate a forecasting model for seasonal data

Table 14.4  Quarterly female labour force participation rates in Australia (2005–2014) Source: Australian Bureau of Statistics, Labour Force, Australia, Cat. No. 6202.0.

a. Perform a residual analysis for each model. b. Calculate the standard error of the estimate (SYX) for each model. c. Calculate the MAD for each model. d. On the basis of (a), (b), (c) and parsimony, which forecasting model would you select? Discuss. 14.39 Refer to the results in problem 14.17 on page 566 and problems 14.25 and 14.33 concerning overseas arrivals in Australia. a. Perform a residual analysis for each model. b. Calculate the standard error of the estimate (SYX) for each model. c. Calculate the MAD for each model. d. On the basis of (a), (b), (c) and the principle of parsimony, which forecasting model would you select? Discuss.

14.8  TIME-SERIES FORECASTING OF SEASONAL DATA So far, this chapter has focused on time-series forecasting with annual data. However, numerous time series are collected quarterly or monthly, and others are collected weekly, daily or even hourly. When a time series is collected quarterly or monthly, we must consider the impact of seasonal effects (see Table 14.1). In this section, regression model building is used to forecast monthly or quarterly data. We return to the female unemployment rate time series from Section 14.4, where we would expect a seasonal pattern in the quarterly data. The quarterly pattern is the division of the months of the year into four equal quarters (Mar 5 January to March; June 5 April to June; Sept 5 July to September; Dec 5 October to December). Table 14.4 provides the quarterly female unemployment rates in Australia from 2005 to 2014. Figure 14.22 displays the time series.

Year 2005

2006

2007

2008

2009

Quarter Mar June Sept Dec Mar June Sept Dec Mar June Sept Dec Mar June Sept Dec Mar June Sept Dec

Unemployment rate 6.0 5.3 4.9 4.8 5.6 5.0 4.5 4.5 5.5 4.8 4.5 4.4 4.9 4.5 4.4 4.4 5.8 5.4 5.3 5.2

Year 2010

2011

2012

2013

2014

Quarter Mar June Sept Dec Mar June Sept Dec Mar June Sept Dec Mar June Sept Dec Mar June

Unemployment rate 5.8 5.3 5.2 5.2 5.8 5.2 5.1 5.1 5.8 5.4 5.0 5.1 6.0 5.6 5.5 5.4 6.8 6.0

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



14.8 Time-series Forecasting of Seasonal Data 585

Figure 14.22  Microsoft plot of quarterly female unemployment rates in Australia (2005–2014)

8.0 7.0 6.0

Source: Data from Australian Bureau of Statistics, Labour Force, Australia, Cat. No. 6202.0.

Per cent

5.0 4.0 3.0 2.0 1.0

14 20

13 20

12 20

11 20

10 20

09 20

08 20

07 20

06 20

20

05

0 Year

For quarterly time series such as these, the classical multiplicative time-series model includes the seasonal component in addition to the trend, cyclical and irregular components. It is expressed by Equation 14.2 (page 547) as: Yi = Ti × Si × Ci × Ii

Least-Squares Forecasting with Monthly or Quarterly Data To develop a least-squares regression model that includes trend, seasonal, cyclical and irregular components, the approach to least-squares trend fitting in Section 14.4 is combined with the approach to model-building using categorical independent variables (see Section 13.6) to model the seasonal component. Equation 14.19 defines the exponential trend model for quarterly data.

E XPO NE NTIA L M ODE L W IT H QUA RT ER LY DATA

Q

Q

Q

Yi = β0β1X i β2 1 β3 2 β4 3 εi (14.19)

where Xi 5 coded quarterly value, i 5 0, 1, 2, … Q1 5 1 if first quarter, 0 if not first quarter Q2 5 1 if second quarter, 0 if not second quarter Q3 5 1 if third quarter, 0 if not third quarter β0 5 Y intercept (β1 − 1) × 100% 5 quarterly compound growth rate (in %) β2 5 multiplier for first quarter relative to fourth quarter β3 5 multiplier for second quarter relative to fourth quarter β4 5 multiplier for third quarter relative to fourth quarter εi 5 value of the irregular component for time period i

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

586 CHAPTER 14 TIME-SERIES FORECASTING AND INDEX NUMBERS

The model in Equation 14.19 is not in the form of a linear regression model. To transform this non-linear model into a linear model, we use a base 10 logarithmic transformation. (Alternatively, we can use base e logarithms. See Appendix A.) Taking the logarithm of each side of Equation 14.19 yields Equation 14.20. T R A N S FOR M E D E X P O NE NT I A L M O D E L W I T H Q U A RT E R LY DATA log(Yi ) = log(β0β1X i βQ2 1 β3Q 2 β4Q 3 εi ) =

Q3 2 log(β0 ) + log(β1X i ) + log(β2Q1) + log(βQ 3 ) + log(β4 )

(14.20) + log( εi )

= log(β0 ) + X i log(β1) + Q1 log(β2) + Q2 log(β3 ) + Q3 log(β4 ) + log( εi ) Equation 14.20 is a linear model that we can estimate using least-squares regression. Performing the regression using log(Yi) as the dependent variable and Xi, Q1, Q2 and Q3 as the independent variables results in Equation 14.21. E XPON E N T IA L GR O W T H W I T H Q U A RT E R LY DATA F O R E C A ST I NG E QUAT ION log(Yˆi ) = b0 + b1X i + b2Q1 + b3Q2 + b4Q3 (14.21) ˆ0 where b0 5 estimate of log(β0) and thus 10b0 5 β b1 ˆ1 b1 5 estimate of log(β1) and thus 10 5 β b2 ˆ2 b2 5 estimate of log(β2) and thus 10 5 β b3 ˆ3 b3 5 estimate of log(β3) and thus 10 5 β ˆ4 b4 5 estimate of log(β4) and thus 10b4 5 β

Equation 14.22 is used for monthly data. E XP ON E N T IA L M O D E L W I T H M O NT HLY DATA

M9 M10 M11 Yi = β0β1X i β2M1β3M 2 β4M3 β5M 4 β6M5 β7M6 β8M7 β9M8 β10 β11 β12 ε i (14.22)

where Xi 5 coded monthly value, i 5 0, 1, 2, … M1 5 1 if January, 0 if not January M2 5 1 if February, 0 if not February M3 5 1 if March, 0 if not March . . . M11 5 1 if November, 0 if not November β0 5 Y intercept (β1 − 1) × 100% is the monthly compound growth rate (in %) β2 5 multiplier for January relative to December β3 5 multiplier for February relative to December β4 5 multiplier for March relative to December . . . β12 5 multiplier for November relative to December εi 5 value of the irregular component for time period i

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



14.8 Time-series Forecasting of Seasonal Data 587

The model in Equation 14.22 is not in the form of a linear regression model. To transform this non-linear model into a linear model, we can use a base 10 logarithm transformation. Taking the logarithm of each side of Equation 14.22 yields Equation 14.23. TRANSFO RM E D E XP ON E N T IA L M ODE L W I T H M O NT HLY DATA

M9 M10 M11 log(Yi ) = log(β0 β1Xi β 2M1β3M 2 β4M3 β5M 4 β6M5 β7M6 β8M7 β 9M8 β10 β11 β12 εi ) (14.23)

= log(β 0 ) + X i log(β1) + M1 log(β2 ) + M2 log(β 3) + M3 log(β 4 ) + M4 log(β 5 ) + M5 log(β 6 ) + M6 log(β 7) + M 7 log(β8 ) + M8 log(β 9 ) + M9 log(β10 ) + M 10 log(β 11) + M 11 log(β12 ) + log(ε i ) Equation 14.23 is a linear model that we can estimate using the least-squares method. Performing the regression using log(Yi) as the dependent variable and Xi, M1, M2, … , and M11 as the independent variables results in Equation 14.24. E XPO N E N TI A L GR OW T H W IT H M ON THLY DATA F O R E C A ST I NG E Q U AT I O N log(Yˆi ) = b0 + b1 Xi + b2 M1 + b3 M2 + b4 M3 + b5 M4 + b6 M 5 + b 7 M6 (14.24) +b8 M7 + b9 M8 + b10 M9 + b11M10 + b12 M11 where b0 5 estimate of log(β0) and thus 10b0 5 βˆ 0 b1 5 estimate of log(β1) and thus 10b1 5 βˆ 1 b2 5 estimate of log(β2) and thus 10b2 5 βˆ 2 b3 5 estimate of log(β3) and thus 10b3 5 βˆ 3 · · · b12 5 estimate of log(β12) and thus 10b12 5 βˆ 12

Q1, Q2 and Q3 are the three dummy variables needed to represent the four quarter periods in a quarterly time series. M1, M2, M3, … , M11 are the 11 dummy variables needed to represent the 12 months in a monthly time series. In building the model, we use log(Yi) instead of Yi values and then find the actual regression coefficients by taking the antilog of the regression coefficients developed from Equations 14.21 and 14.24. Although at first glance these regression models look imposing, when fitting or forecasting in any one time period the values of all, or all but one of, the dummy variables in the model are set equal to 0, and the equations simplify dramatically. In establishing the dummy variables for quarterly time-series data, the fourth quarter (December) is the base period and has a coded value of 0 for each dummy variable. With a quarterly time series, Equation 14.21 reduces as follows: For any first quarter: log(Yˆi) 5 b0 1 b1Xi 1 b2 For any second quarter: log(Yˆi) 5 b0 1 b1Xi 1 b3 For any third quarter: log(Yˆi) 5 b0 1 b1Xi 1 b4 For any fourth quarter: log(Yˆi) 5 b0 1 b1Xi When establishing the dummy variables for each month, December serves as the base period and has a coded value of 0 for each dummy variable. For example, with a monthly time series, Equation 14.24 reduces as follows: For any January: log(Yˆi) 5 b0 1 b1Xi 1 b2 For any December: log(Yˆi) 5 b0 1 b1Xi

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

588 CHAPTER 14 TIME-SERIES FORECASTING AND INDEX NUMBERS

To demonstrate the process of model-building and least-squares forecasting with a quarterly time series, return to the female unemployment rate data originally displayed in Table 14.4. The data are from each quarter from the March (first) quarter of 2005 to the June (second) quarter of 2014. Microsoft Excel output for the quarterly exponential trend model is displayed in Figure 14.23. From Figure 14.23, the model fits the data reasonably well. The adjusted coefficient of determination R2 is 0.6416%, and the overall F test results in an F statistic of 17.5566 (p-value 5 0.000). Looking further, at the 0.05 level of significance, each regression coefficient is statistically significant and contributes to the classical multiplicative time-series model except for the September quarter dummy. Taking the antilogs of all the regression coefficients, we have the following summary: bi ∙ log 𝛃ˆ i 0.6517 0.0020 0.0751 0.0299 0.0049

Regression coefficients b0: Y intercept b1: time b2: March quarter b3: June quarter b4: September quarter

Figure 14.23  Microsoft Excel output for fitting and forecasting with the quarterly female unemployment rate data

B

A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

C

𝛃ˆ i ∙ antilog(bi ) ∙ 10bi 4.4843 1.0045 1.1888 1.0713 1.0113

D

E

F

G

SUMMARY OUTPUT Regression statistics Multiple R

0.8248

R square

0.6803

Adjusted R square Standard error

0.6416 0.0269

Observations

38

ANOVA Regression

df 4

SS 0.0507

MS 0.0127

Residual Total

33 37

0.0238 0.0745

0.0007

Coefficients Standard error

F Significance F 17.5566 0.0000

t stat

p-value

Lower 95%

Upper 95%

Intercept Time Mar

0.6517 0.0020 0.0751

0.0117 0.0004 0.0124

55.6099 4.9330 6.0808

0.0000 0.0000 0.0000

0.6279 0.0012 0.0500

0.6755 0.0028 0.1002

June Sept

0.0299 0.0049

0.0123 0.0127

2.4240 0.3860

0.0210 0.7020

0.0048 –0.0209

0.0550 0.0307

ˆ 1, βˆ 2, βˆ 3, and β ˆ 4 are as follows: The interpretations for βˆ 0, β ˆ • The Y intercept β0 5 4.4843% is the unadjusted forecast for the female unemployment rate in the first quarter of 2005, the initial quarter in the time series. Unadjusted means that the seasonal component is not incorporated. ˆ 1 − 1) × 100% 5 0.45% is the estimated quarterly compound growth rate in • The value (β unemployment rate, after adjusting for the seasonal component. ˆ 2 5 1.1888 is the seasonal multiplier for the March quarter relative to the December • β quarter; it indicates that March quarter unemployment is 18.88% higher than that in the December quarter.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



14.8 Time-series Forecasting of Seasonal Data 589

• •

βˆ 3 5 1.0713 is the seasonal multiplier for the June quarter relative to the December quarter; it indicates that June quarter unemployment is 7.13% higher than that in the December quarter. βˆ 4 5 1.0113 is the seasonal multiplier for the September quarter relative to the December quarter; it indicates that September quarter unemployment is 1.13% higher than that in the December quarter. Note, however, that this coefficient is statistically insignificant.

Using the regression coefficients b0, b1, b2, b3, b4 and Equation 14.21, we can make forecasts for selected quarters. As an example, to predict the unemployment rate for the December quarter of 2014 (Xi 5 100): log(Yˆi ) = b0 + b1Xi = 0.6517 + 0.0020(39) = 0.7297 Thus: Yˆi = 100.7282 = 5.4% The predicted female unemployment rate for the December quarter of 2014 is 5.3. To make a forecast for 0.6517 + 0.0020(39), the March quarter of 2015 (Xi 5 40, Mar 5 1): log(Yˆi ) = b + b X + b Mar 0

1 i

2

= 0.6517 + 0.0020(40) + 0.0751(1) = 0.8068 Thus: Yˆi = 100.8053 = 6.4 The predicted retail turnover for the March quarter of 2015 is 6.4%.

Problems for Section 14.8 LEARNING THE BASICS 14.40 In forecasting a monthly time series over a five-year period from January 2010 to December 2017, the exponential trend forecasting equation for January is: log Yˆi 5 2.6 + 0.03Xi + 0.18 January

Take the antilog of the appropriate coefficient from the above equation and interpret: a. the Y intercept b. the monthly compound growth rate c. the January multiplier 14.41 If forecasting weekly time-series data, how many dummy variables are needed to account for the seasonal categorical variable week? 14.42 In forecasting a quarterly time series over the five-year period from the first quarter 2013 to the fourth quarter 2017, the exponential trend forecasting equation is given by: log Yˆi 5 4.5 + 0.03Xi − 0.32Q1 + 0.40Q2 + 0.17Q3

where the origin is first quarter 2013 and units of X  5  1 quarter. Take the antilog of the appropriate coefficient from the above equation and interpret the: ˆ0 a. Y intercept β b. quarterly compound growth rate c. second-quarter multiplier 14.43 Refer to the exponential model given in problem 14.42. a. What is the fitted value of the series in the third quarter of 2017? b. What is the fitted value of the series in the fourth quarter of 2017? c. What is the forecast in the first quarter of 2018? d. What is the forecast in the second quarter of 2018?





APPLYING THE CONCEPTS Use Microsoft Excel to solve problems 14.44 to 14.48.

14.44 The data given in the following table represent Standard & Poor’s Composite Stock Price Index recorded at the end of each quarter from 1994 to the second quarter of 2004.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

590 CHAPTER 14 TIME-SERIES FORECASTING AND INDEX NUMBERS

Quarter 1 2 3 4

1994 445.77 444.27 462.69 459.27

1995 500.71 544.75 584.41 615.93

Year 1996 645.50 670.63 687.31 740.74

1997 757.12 885.14 947.28 970.43

1998 1,101.75 1,133.84 1,017.01 1,229.23

1999 1,286.37 1,372.71 1,282.71 1,469.25

Year Quarter 1 2 3 4

2000 1,498.58 1,454.60 1,436.51 1,320.28

2001 1,160.33 1,224.38 1,040.94 1,148.08

2002 1,147.38 989.81 815.28 879.28

2003 848.18 974.51 995.97 1,111.92

2004 1,126.21 1,140.81

a. Plot the data. b. Develop an exponential trend forecasting equation with quarterly components. c. What is the fitted value in the first quarter of 2004? d. What is the fitted value in the second quarter of 2004? e. What are the forecasts for the last two quarters of 2004 and all four quarters of 2005? f. Interpret the quarterly compound growth rate. g. Interpret the second-quarter multiplier. 14.45 The data below show the quarterly underemployment rate (those who are employed but would prefer more hours) for male youth aged 15–24 in Australia from the August quarter 2008 to the May quarter 2014.

Quarter Aug 2008 Nov 2008 Feb 2009 May 2009 Aug 2009 Nov 2009 Feb 2010 May 2010 Aug 2010 Nov 2010 Feb 2011 May 2011 Aug 2011 Nov 2011 Feb 2012 May 2012 Aug 2012 Nov 2012 Feb 2013 May 2013 Aug 2013

Underemployment rate 8.3 11.3 11.1 13.1 11.3 13.0 12.4 10.8 10.8 13.3 10.7 10.7 10.5 11.9 12.2 11.9 10.9 12.0 11.1 12.0 11.9

Nov 2013 Feb 2014 May 2014

14.7 12.5 12.8

Data obtained from Australian Bureau of Statistics, Labour Force, Australia, Cat. no. 6202.0

a. Plot the time-series data. b. Develop an exponential trend forecasting equation with quarterly components. c. What is the fitted value in the May quarter of 2014? d. What are the forecasts for the August and November quarters of 2014? e. Interpret the quarterly compound growth rate. f. Interpret the second-quarter multiplier. 14.46 Use the Australian CPI data from question 14.14 on page 565. a. Construct the time-series plot. b. Describe the quarterly pattern that is evident in the data. c. Develop an exponential trend forecasting equation with quarterly components. d. Interpret the quarterly compound growth rate. e. Interpret the September multiplier. f. What is the predicted value for December 2011? g. What is the predicted value for March 2012? h. Compare these predictions to the actual values – see . 14.47 The following data represent the number of married females aged 25–34 in the labour force in Australia from November 2005 to September 2008.

Month/year Nov 2005 Dec 2005 Jan 2006 Feb 2006 Mar 2006 Apr 2006 May 2006 Jun 2006 Jul 2006 Aug 2006 Sep 2006 Oct 2006 Nov 2006 Dec 2006 Jan 2007 Feb 2007 Mar 2007 Apr 2007 May 2007 Jun 2007

Married females aged 25–34 in labour force (’000) 652.5 648.6 628.6 639.7 633.6 636.8 644.0 641.2 640.9 638.8 653.0 633.6 646.0 645.3 623.1 634.5 633.1 627.0 634.1 633.4

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



14.9 Index Numbers 591

Month/year Jul 2007 Aug 2007 Sep 2007 Oct 2007 Nov 2007 Dec 2007 Jan 2008 Feb 2008 Mar 2008 Apr 2008 May 2008 Jun 2008 Jul 2008 Aug 2008 Sep 2008

Married females aged 25–34 in labour force (’000) 631.9 644.0 677.6 668.5 661.6 681.0 655.3 676.6 657.1 661.9 668.5 679.8 678.6 697.8 696.6

a. Do you think that the number of married females aged 25–34 in the labour force is subject to seasonal variation? Explain. b. Develop an exponential trend forecasting equation with monthly components. c. Interpret the monthly compound growth rate. d. Interpret the monthly multipliers. e. What are the forecasts for October, November and December of 2008? 14.48 Use the female average weekly earnings data in question 14.13. a. Do you think that the female earnings are subject to seasonal variation? Explain. b. Plot the data. Does this chart support your answer to (a)? c. Develop an exponential trend forecasting equation with quarterly components. d. Interpret the quarterly compound growth rate. e. Interpret the quarterly multipliers. f. What are the forecasts for all four quarters of 2017?

Source: © Commonwealth of Australia, Australian Bureau of Statistics, Labour Force, Australia, Detailed, Cat. No. 6291.0.55.001

14.9  INDEX NUMBERS This chapter has presented various methods to study time series. In this section, index numbers are used to compare a value of a time series relative to another value of a time series. Index numbers measure the value of an item (or group of items) at a particular point in time as a percentage of an item’s (or group of items’) value at another point in time. They are commonly used in business and economics as indicators of changing business or economic activity. There are many kinds of index numbers including price indices, quantity indices, value indices and sociological indices. In this section, only the price index is considered. In addition to allowing comparison of prices at different points in time, price indices are also used to deflate the effect of inflation on a time series in order to compare values in real dollars instead of actual dollars.

LEARNING OBJECTIVE

9

Calculate and interpret various price indices index number A percentage measure of the change in the value of an item between two time periods.

The Price Index A price index compares the price of a commodity at a given period of time with the price paid for that commodity at a particular point of time in the past. A simple price index tracks the price of a single commodity. An aggregate price index compares the price of a group of commodities (called a market basket) at a given period of time with the price paid for that group of commodities at a particular point of time in the past. The base period is the point of time in the past against which all comparisons are made. In selecting the base period for a particular index, select, if possible, a period of economic stability rather than one at or near the peak of an expanding economy or the bottom of a recession or declining economy. In addition, the base period should be recent, so that comparisons are not greatly affected by changing technology and consumer attitudes and habits. Equation 14.25 defines the simple price index. S IMPLE PRICE IN DE X P I i = i × 100 (14.25) Pbase where Ii 5 price index for year i Pi 5 price for year i Pbase 5 price for the base year

price index A measure of the average price of a group of goods relative to a base year. simple price index A percentage measure of the change in the price of a single item between two time periods. aggregate price index Calculates a percentage change in prices of a group of commodities between two time periods. base period The initial point in time for comparisons calculated using index numbers.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

592 CHAPTER 14 TIME-SERIES FORECASTING AND INDEX NUMBERS

As an example of the simple price index, consider the price per litre of petrol in Dubbo from 1992 to 2017. Table 14.5 presents the price data plus two sets of index numbers. To illustrate the calculation of the simple price index for 2017, using 1992 as the base year, from Equation 14.25 and Table 14.5: I 2017 =

P2017 1.52 × 100 = × 100 = 286.79 P1992 0.53 .

Therefore, the price per litre of petrol in Dubbo in 2017 was 186.79% higher than in 1992. An examination of the price indices for 1992–2017 in Table 14.5 indicates that the price of petrol increased sharply in 2007 (and 2017). Since the base period for the index numbers in Table 14.5 is the initial year 1992, you may choose to use a base year closer to 2007. Equation 14.26 is used to develop index numbers with a new base. Table 14.5  Price per litre of petrol in Dubbo and simple price index numbers with 1992 and 2007 as the base years (1992–2017)

Year 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017

Price index 1992 100.00 103.77   92.45 109.43 107.55 113.21 115.09 120.75 128.30 130.19 128.30 133.96 135.85 141.51 145.28 158.49 160.38 162.26 167.92 175.47 177.36 183.02 186.79 196.23 205.66 286.79

Price 0.53 0.55 0.49 0.58 0.57 0.60 0.61 0.64 0.68 0.69 0.68 0.71 0.72 0.75 0.77 0.84 0.85 0.86 0.89 0.93 0.94 0.97 0.99 1.04 1.09 1.52

Price index 2007   63.10   65.48   58.33   69.05   67.86   71.43   72.62   76.19   80.95   82.14   80.95   84.52   85.71   89.29   91.67 100.00 101.19 102.38 105.95 110.71 111.90 115.48 117.86 123.81 129.76 180.95

S H IFT IN G T H E B A SE F O R A SI M P LE P R I C E I ND E X

I new =

I old I new base

× 100 (14.26)

where Inew 5 new price index Iold 5 old price index Inew base 5 value of the old price index for the new base year

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



14.9 Index Numbers 593

To change the base year to 2007, Inew base 5 158.49. Using Equation 14.26 to find the new price index for 2014: I new =

I old I new base

× 100 =

286.79 × 100 = 180.95 158.49

Thus, the 2017 price for petrol in Dubbo was 80.95% higher than it was in 2007. See Table 14.5 for the complete set of price indices.

Aggregate Price Indices An index that consists of a group of commodities taken together is more important than a price index for any individual commodity. There are two basic types of aggregate price indices: unweighted aggregate price indices and weighted aggregate price indices. An unweighted aggregate price index, defined in Equation 14.27, places equal weight on all the items in the market basket.

unweighted aggregate price index Price index for a group of items where each item has an equal weight.

UNWE IGHTE D AGGR E GAT E PR ICE IND E X n

IU(t ) =



Pi(t ) ∑ i=1 n

∑ Pi

× 100 (14.27)

(0 )

i=1

where t 5 time period (0, 1, 2, …) i 5 item (1, 2, …, n) n 5 total number of items under consideration n



Pi(t ) 5 sum of the prices paid for each of the n commodities at time period t ∑ i=1 n



Pi(0 ) 5 sum of the prices paid for each of the n commodities at time period 0 ∑ i=1 IU(t) 5 value of the unweighted price index at time t



Table 14.6 presents the average prices for three fruit items for selected periods from 1997 to 2017.

Fruit Kiwi fruit Rockmelon Watermelon

1997 Pi (0) 2.60 1.80 0.35

2002 Pi (1) 3.10 1.90 0.39

Year 2007 Pi (2) 3.38 2.15 0.44

2012 Pi (3) 3.52 2.20 0.49

2017 Pi (4) 3.91 2.30 0.56

Table 14.6  Prices (in dollars per kilogram) for three fruit items

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

594 CHAPTER 14 TIME-SERIES FORECASTING AND INDEX NUMBERS

To calculate the unweighted aggregate price index for the various years, using Equation 14.27 and 1997 as the base period: 3

1997: IU(0 ) =

∑ Pi(0) i=1 3

× 100 =

2.60 + 1.80 + 0.35 4.75 × 100 = × 100 = 100.0 2.60 + 1.80 + 0.35 4.75

× 100 =

3.10 + 1.90 + 0.39 5.39 × 100 = × 100 = 113.5 2.60 + 1.80 + 0.35 4.75

× 100 =

3.38 + 2.15 + 0.44 5.97 × 100 = × 100 = 125.7 2.60 + 1.80 + 0.35 4.75

× 100 =

3.52 + 2.20 + 0.49 6.21 × 100 = × 100 = 130.7 2.60 + 1.80 + 0.35 4.75

× 100 =

3.91 + 2.30 + 0.56 6.77 × 100 = × 100 = 142.5 2.60 + 1.80 + 0.35 4.75

Pi(0 ) ∑ i=1 3

2002: IU(1 ) =

∑ Pi(1) i=1 3

Pi(0 ) ∑ i=1 3

2007: IU(2 ) =

∑ Pi(2) i=1 3

Pi(0 ) ∑ i=1 3

2012: IU(3 ) =

∑ Pi(3) i=1 3

Pi(0 ) ∑ i=1 3

2017: IU(4 ) =

∑ Pi(4) i=1 3

Pi(0 ) ∑ i=1

Thus, in 2017, the combined price of a kilogram of kiwi fruit, a kilogram of rockmelon and a kilogram of watermelon was 42.5% more than it was in 1997. An unweighted aggregate price index represents the changes in prices, over time, for an entire group of commodities. However, an unweighted aggregate price index has two shortcomings. First, the index considers each commodity in the group equally important. Thus, the most expensive commodities per unit are overly influential. Second, not all the commodities are consumed at the same rate. In the unweighted index, changes in the price of the least consumed commodities are overly influential.

Weighted Aggregate Price Indices weighted aggregate price index Price index for a group of items where each item has a different weight based on volume of consumption.

Due to the shortcomings of the unweighted aggregate price index, weighted aggregate price indices are generally preferable. Weighted aggregate price indices account for differences in the magnitude of prices per unit and differences in the consumption levels of the items in the market basket. Two types of weighted aggregate price indices are commonly used in business and economics: the Laspeyres price index and the Paasche price index.

Laspeyres price index Uses consumption quantities in the base year to weight price changes measured by the index number.

Laspeyres Price Index  Equation 14.28 defines the Laspeyres price index, which uses the c­ onsumption quantities associated with the base year in the calculation of all price indices in the series.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



14.9 Index Numbers 595

LASPE YRE S PR ICE IN DE X n

I L(t ) =



Pi(t )Qi(0 ) ∑ i=1 n

Pi ∑ i=1

× 100 (14.28)

(0 )

Qi(0 )

where t 5 time period (0, 1, 2, …) i 5 item (1, 2, …, n) n 5 total number of items under consideration   5 quantity of item i at time period 0 Q(0) i (t) IL   5 value of the Laspeyres price index at time t Table 14.7 gives the per capita consumption in kilos for the three fruit items comprising the market basket of interest.

1997 Pi(0), QJ (0) 2.60, 5.2   1.80, 21.6   0.35, 24.8

Fruit Kiwi fruit Rockmelon Watermelon

2002 Pi (1), QJ (1) 3.10, 5.5   1.90, 19.8   0.39, 25.9

Year 2007 Pi (2), QJ (2) 3.38, 4.9   2.15, 20.5   0.44, 26.3

2012 Pi (3), QJ (3) 3.52, 4.5   2.20, 21.4   0.49, 27.7

2017 Pi (4), QJ (4) 3.91, 4.7 2.30, 22.4 0.56, 28.2

Table 14.7  Prices (in dollars per kilogram) and quantities (annual per capita consumption in kilograms) for three fruit items

Using 1997 as the base year, calculate the Laspeyres price index for 2017 (t 5 4) using Equation 14.28: 3

I L(4 ) =

=

Pi(4 )Qi(0 ) ∑ i=1 3

Pi(0 )Qi(0 ) ∑ i=1

× 100 =

(3.91 × 5.2) + (2.3 × 21.6) + (0.56 × 24.8) × 100 (2.6 × 5.2) + (1.8 × 21.6) + (0.35 × 24.8)

83.9 × 100 = 137.36 61.08

Thus, the Laspeyres price index is 137.36, indicating that the cost of purchasing these three items in 2017 was 37.36% more than in 1997. This index is less than the unweighted index, 142.5, as the expensive kiwi fruit was purchased less often over time and the least expensive item, watermelon, was purchased more over time. Paasche Price Index  The Paasche price index uses the consumption quantities in the year of interest instead of using the initial quantities. Thus, the Paasche index is a more accurate reflection of total consumption costs at that point in time. However, there are two major drawbacks of the Paasche index. First, accurate consumption values for current purchases are often hard to obtain. Thus, many important indices such as the Consumer Price Index use the Laspeyres method. The Australian CPI uses a modified Laspeyres index and is calculated separately for each capital city, with the weighted aggregate for Australia weighted by population size. Second, if a particular product increases greatly in price compared with the other items in the market basket, consumers will

Paasche price index Uses consumption quantities in the final year to weight price changes measured as an index number.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

596 CHAPTER 14 TIME-SERIES FORECASTING AND INDEX NUMBERS

avoid the high-priced item out of necessity, not because of changes in what they might prefer to purchase. Equation 14.29 defines the Paasche price index.

PA A S CH E P R ICE I ND E X n

I p(t ) =



Pi(t) Qi(t) ∑ i=1 n

Pi ∑ i=1

× 100 (14.29)

Qi(t)

(0 )

where t 5 time period (0, 1, 2, . . . ) i 5 item (1, 2, . . . , n) n 5 total number of items under consideration (t) Q i   5 quantity of item i at time period t I (t) p  5 value of the Paasche price index at time t

To calculate the Paasche price index in 2017 using 1997 as a base year, use t 5 4 in Equation 14.29: n

I P(4 ) =

Pi(4 )Qi(4 ) ∑ i=1 n

∑ Pi(0)Qi(4)

× 100 =

(3.91 × 4.7) + (2.3 × 22.4) + (0.56 × 28.2) × 100 (2.6 × 4.7) + (1.8 × 22.4) + (0.35 × 28.2)

i=1

=

85.689 × 100 = 137.30 62.41

The Paasche price index for this market basket is 137.30. Thus, the cost of these three fruit items was 37.3% higher in 2017 than in 1997 when using 2017 quantities.

Some Common Price Indices Various price indices are commonly used in business and economics. The Consumer Price Index is the most familiar index in Australia. This index is officially referred to as the All Australia Index to reflect that it is measuring the prices ‘urban’ residents (the price changes are measured only in capital cities) are subject to, but is commonly referred to as the CPI. The CPI, published monthly by the Australian Bureau of Statistics, is the primary measure of changes in the cost of living in Australia. The CPI is a weighted aggregate price index, using a variation of the Laspeyres method, for hundreds of commonly purchased items grouped under the headings of Food, Alcohol and Tobacco, Clothing and Footwear, Housing, Household Furnishings, Health, Transportation, Communication, Recreation, Education and Miscellaneous. The CPI is calculated separately in eight Australian capital cities and averaged (weighted by population size) to create the All Australia Index. From December 2011 it began its 16th linked series (each series is about six years) and details can be found at (note that the website sometimes refers to ‘prices’ rather than the CPI; the catalogue number is 6401.0). An important use of the CPI is as a price deflator. The CPI is used to convert (and deflate) actual dollars into real dollars by multiplying each dollar value in a time series by the quantity (100/CPI). Financial indices such as the Dow Jones Industrial Average, the S&P 500, the NASDAQ Index and the Australian ASX are price indices for different sets of stocks in the United States and Australia. ASX stands for Australian Securities Exchange. The ASX 200, known as the S&P

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



14.9 Index Numbers 597

ASX 200 Index, is recognised as the investment benchmark for the Australian equity market and comprises the S&P ASX 100 plus an additional 100 stocks. The All Industries Index is Australia’s premier market indicator, representing the 500 largest companies listed on the Australian Securities Exchange. Many indices measure the performance of international share markets, including the Nikkei Index for Japan, the Dax 30 for Germany and the SSE Composite for China.

Problems for Section 14.9 LEARNING THE BASICS 14.49 The simple price index for a commodity in 2014, using 1998 as the base year, is 125. Interpret this index number. 14.50 The following are prices for a commodity from 2015 to 2017: 2015 2016 2017

$5 $8 $7

a. Calculate the simple price indices for 2015–2017, using 2015 as the base year. b. Calculate the simple price indices for 2015–2017, using 2016 as the base year. 14.51 The following are prices and consumption quantities for three commodities in 2007 and 2017:

Commodity A B C

Price  $2 $32  $5

2007 Quantity 15  4 20

Price  $5 $28  $7

2017 Quantity 17  3 15

a. Calculate the unweighted aggregate price index for 2017, using 2007 as the base year. b. Calculate the Laspeyres aggregate price index for 2017, using 2007 as the base year. c. Calculate the Paasche aggregate price index for 2017, using 2007 as the base year.

APPLYING THE CONCEPTS 14.52 The data in the file reflect the annual values of the United States Consumer Price Index (US CPI), constructed over the 50-year period from 1965 to 2014, using the year 1982 as the base period. a. Form the price index for the US CPI with 1965 as the base year. b. Shift the base of the US CPI to 1990 and recalculate the price index. c. Compare the results of (a) and (b). Which price index do you think is more useful in understanding the changes in the US CPI? Explain. 14.53 The data in the following table represent the closing value of the Dow Jones Industrial Average (DJIA) from 1979 to 2003.

Year 1979 1980 1981 1982 1983 1984 1985 1986 1987

DJIA 838.7 964.0 875.0 1,046.5 1,258.6 1,211.6 1,546.7 1,896.0 1,938.8

Year 1988 1989 1990 1991 1992 1993 1994 1995 1996

DJIA 2,168.6 2,753.2 2,633.7 3,168.8 3,301.1 3,754.1 3,834.4 5,117.1 6,448.3

Year 1997 1998 1999 2000 2001 2002 2003

DJIA 7,908.3 9,181.4 11,497.1 10,788.0 10,021.5 8,341.6 10,453.9

Data obtained from

a. Form the price index for the DJIA with 1979 as the base year. b. Shift the base of the DJIA to 1990 and recalculate the price index. c. Compare the results of (a) and (b). Which price index do you think is more useful in understanding the changes in the DJIA? Explain. 14.54 Refer to the CPI data used in problem 14.14. a. Form the price index for the CPI using September 2006 as the base period. b. Discuss the changes in consumer prices during this fiveyear period. 14.55 The following data represent the annual domestic supply price index for Singapore from 1999 to 2013. Year Supply price index 1999 73.0 2000 80.3 2001 79.0 2002 76.5 2003 78.0 2004 82.0 2005 89.9 2006 94.4

Year Supply price index 2007  94.7 2008 101.8 2009  87.7 2010  91.8 2011  99.5 2012 100.0 2013  97.3

Source: Department of Statistics Singapore, Government of Singapore,

a. Form the price index with 2000 as the base year. b. Shift the base of the index to 2013 and recalculate the price index.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

598 CHAPTER 14 TIME-SERIES FORECASTING AND INDEX NUMBERS

c. Compare the results of (a) and (b). Which price index do you think is more useful in understanding the changes in prices in Singapore? Explain. 14.56 The data below represent the CPI for tradables in New Zealand from the second quarter of 1999 to the third quarter of 2011 (second quarter 2006 is the base). Year 1999 1999 1999 2000 2000 2000 2000 2001 2001 2001 2001 2002 2002 2002 2002 2003 2003 2003 2003 2004 2004 2004 2004 2005 2005 2005 2005 2006 2006 2006 2006 2007 2007 2007 2007 2008 2008 2008 2008 2009 2009

Quarter Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2

CPI for tradables   895   895   899   903   910   924   940   938   953   957   960   958   973   970   975   972   962   956   955   950   956   956   962   958   963   974   979   978 1,000 1,003   990   986   995 1,000 1,018 1,020 1,043 1,063 1,041 1,037 1,045

2009 2009 2010 2010 2010 2010 2011 2011 2011

Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3

1,062 1,057 1,058 1,055 1,065 1,092 1,097 1,113 1,114

Source: Statistics New Zealand, accessed 6 November 2011. This work is based on/includes Statistics New Zealand’s data, which are licensed by Statistics New Zealand for re-use under the Creative Commons ­Attribution 3.0 New Zealand licence

a. Calculate the simple price index for this period using the third quarter of 2000 as the base period. b. Interpret the simple price index for the third quarter of 2011, using the third quarter of 2000 as the base period. c. Recalculate the simple price index found in (a), using Equation 14.26 and the first quarter of 2005 as the base period. d. Interpret the simple price index for the third quarter of 2011, using the first quarter of 2005 as the base period. e. Would it be a good idea to use the first quarter of 2005 as the base period? Explain. f. Describe the trends in the New Zealand CPI from 1999 to 2011. 14.57 The following data represent the prices (in cents) of a basket of staple food items in September 2007 and September 2017 in Sydney and Melbourne.

Item Loin chop (1 kg) Potatoes (1 kg) Bread (650 g) Milk (150 g) Coffee (150 g) Baby food (120 g)

2007 Sydney Melbourne 765 844 108 111 216 201 220 275 662 608 59 55

2017 Sydney Melbourne 1,564 1,525   137   175   246   254   256   268   609   611   75   75

a. Calculate the simple aggregate index, using 2007 as the base year for each city. b. Interpret the price index for 2017, using 2007 as the base year for each city. c. Interpret the meaning of the difference, or lack of difference, in the indices between Sydney and Melbourne.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



14.10 PITFALLS IN TIME-SERIES FORECASTING 599

14.10  PITFALLS IN TIME-SERIES FORECASTING The value of time-series forecasting methodology, which uses past and present information as guides to the future, was recognised and most eloquently expressed more than two centuries ago by the US statesman Patrick Henry, who said: I have but one lamp by which my feet are guided, and that is the lamp of experience. I know no way of judging the future but by the past. Speech at Virginia Convention (Richmond), 23 March 1775 However, critics of time-series forecasting argue that these techniques are overly naïve and mechanical; that is, a mathematical model based on the past should not be used to extrapolate trends mechanically into the future without considering personal judgments, business experiences or changing technologies, habits and needs. Thus, in recent years econometricians have developed highly sophisticated computerised models of economic activity incorporating such factors for forecasting purposes. Such forecasting methods, however, are beyond the scope of this text (references 1, 3 and 4). Nevertheless, as you have seen from the preceding sections of this chapter, time-series methods provide useful guides for projecting future trends (on long- and short-term bases). Although not explicitly stating the causes of time-series change, many of the past changes reflect economic causes. If used properly and in conjunction with other forecasting methods, as well as with business judgment and experience, time-series methods will continue to be excellent tools for forecasting. They maintain the advantages of fewer technical assumptions and do not require knowledge of a future cause (independent variable) in order to predict change in the dependent variable.

Let the model user beware When you use a model, you must always review the assumptions built into the model and reflect how novel or changing circumstances may render the model less useful. No model can completely remove the risk involved in making a decision.

think about this

Implicit in the time-series models developed in this chapter is that past data can be used to help predict the future. While using past data in this way is a legitimate application of time-series models, every so often a crisis in financial markets illustrates that using models that rely on the past to predict the future is not without risk. For example, during August 2007 many hedge funds suffered unprecedented losses. Apparently, many hedge fund managers used models that based their investment strategy on trading patterns over long time periods. These models did not – and could not – reflect trading patterns contrary to historical patterns (G. Morgenson, ‘A week when risk came home to roost’, New York Times, 12 August 2007, pp. B1, B7). When fund managers in early August 2007 needed to sell shares due to losses in their fixed-income portfolios, shares that were previously stronger became weaker, and weaker ones became stronger – the reverse of what the models expected. Making matters worse was the fact that many fund managers were using similar models and rigidly made investment decisions based solely on what those models said. These similar actions multiplied the effect of the selling pressure, an effect that the models had not considered and that therefore could not be seen in the models’ results. This example illustrates that using models does not absolve you of the responsibility of being a thoughtful decision maker. Do go ahead and use models – when appropriately used they will enhance your decision making – but don’t use them mindlessly, for, in the words of a famous public service announcement, ‘a mind is a terrible thing to waste’.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

600 CHAPTER 14 TIME-SERIES FORECASTING AND INDEX NUMBERS

14

Assess your progress

Summary In this chapter we used time-series methods to develop forecasts of female labour force patterns in Australia and examined some important index number methodology. For time-series forecasting methods, we asked the question: Is there trend in the data or not? When there is no trend, use the moving-average technique and the exponential smoothing method. When there is trend, move on to the regression method, using simple regression from Chapter 12 to develop a linear time-series regression model. We then extended the linear regression model to a second-degree quadratic model and then to an exponential trend model. We asked the question: How can you determine which model forecasts most accurately? We looked at the measurement of percentage differences between the actual data and the forecast results.

We studied an extension of the exponential model called the Holt– Winters method, and followed this by examining the autoregressive method, using several orders of autocorrelation to search for a predictive model. To test the validity between autoregression models, we referred back to hypotheses testing (Chapter 9). The next question was: Now I have several forecasting models, how do I know if there is a best model? We examined the residuals of the models and compared five forecasting methods. Section 14.8 looked at the problem of data containing seasonal elements through time, and used the regression method on quarterly seasonal data. Section 14.9 discussed index numbers: the simple price index, the unweighted aggregate price index, the weighted aggregate price indices (Laspeyres and Paasche) and the Consumer Price Index.

Key formulas Classical multiplicative time-series model for annual data

Yi = Ti 3 Ci 3 Ii   (14.1) Classical multiplicative time-series model for data with a seasonal component

Transformed exponential trend model

log(Yi ) = log(β0β1X i εi ) =

= log(β 0 ) + X i log( β1) + log( εi )

Yi = Ti 3 Si 3 Ci 3 Ii  (14.2)

Exponential trend forecasting equation

Calculating an exponentially smoothed value in time period i

log(Yˆi ) = b0 + b1Xi   (14.9a)

  (14.3)

E1 = Y1 Ei = WYi + (1− W )Ei−1

  (14.8)

log(β0 ) + log(β1X i ) + log( εi )

ˆ X i           (14.9b) Yˆi = βˆ 0β 1 The Holt–Winters method

Forecasting time period i ∙ 1

Level: Ei = U(Ei−1 + Ti−1) + (1 − U)Yi  (14.10a)

Yˆi +1 = Ei  (14.4)

Trend: Ti = VTi−1 + (1 − V )(Ei − Ei−1)  (14.10b)

Linear trend forecasting equation Yˆi = b0 + b1Xi  (14.5)

Using the Holt–Winters method for forecasting

Quadratic trend forecasting equation

Yˆi = b0 + b1Xi + b2 Xi2  (14.6)

Yˆn+j = E n + j (Tn )

  (14.11)

First-order autoregressive model

Yi = A0 + A1Yi−1 + δi  (14.12)

Exponential trend model

Second-order autoregressive model

Yi = β0β1Xi εi  (14.7)

Yi = A0 + A1Yi−1 + A2Yi−2 + δi  (14.13)

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Key terms 601

p th-order autoregressive models

Exponential growth with monthly data forecasting equation

Yi = A0 + A1Yi−1 + A2Yi−2 + δi + … + ApYi−p + δi  (14.14)

log(Yˆi ) = b0 + b1Xi + b2 M1 + b3 M2 + b4 M3 + b5 M4 + b6 M5 + b7 M6 +b8 M7 + b9 M8 + b10 M9 + b11M10 + b12 M11

t test for significance of the highest-order autoregressive parameter Ap

t=

a p − Ap S ap

(14.24)

Simple price index

(14.15)

  

Ii =

Fitted p th-order autoregressive equation

Yˆi = a 0 + a1Yi−1 + a2Yi−2 + … + a pYi−p   (14.16) p th-order autoregressive forecasting equation

Yˆn+j = a0 + a1Yˆ n+j−1 + a2Yˆn+j−2 + ... + apYˆ n+j−p    (14.17)

Pi × 100  (14.25) Pbase

Shifting the base for a simple price index

I new =

n

MAD =

n

  (14.18)

Q Q Q β0β1Xi β2 1β3 2 β4 3 εi

IU(t ) =

  (14.19)

Transformed exponential model with quarterly data log(Yi ) = =

× 100   (14.26)

n

Exponential model with quarterly data

Yi =

I new base

Unweighted aggregate price index

Mean absolute deviation

Yi − Yˆi ∑ i=1

I old

1 Q2 Q3 log(β0 β1X i βQ 2 β3 β4 εi ) log(β0 ) + log( β1Xi ) + log(β2Q 1 ) + log( βQ3 2

  

Pi(t ) ∑ i=1 n

Pi ∑ i=1

× 100

(14.27)

(0 )

Laspeyres price index n

) + log(β4Q 3 ) + log( εi )

= log(β0 ) + Xi log( β1) + Q1 log(β2 ) + Q2 log(β3 ) + Q3 log(β4 ) + log( εi )

(14.20)

I L(t ) =

Exponential growth with quarterly data forecasting equation

log(Yˆi ) = b0 + b1Xi + b2Q1 + b3Q2 + b4Q3   (14.21) Exponential model with monthly data

Transformed exponential model with monthly data M9 M10 M11 β11 β12 εi ) log(Yi ) = log(β0 β1X i β2M1 β3M 2 β4M3 β5M 4 β6M5 β7M6 β8M7 β9M8 β10

n

Pi(0 )Qi(0 ) ∑ i=1

× 100   (14.28)

Paasche price index n

M9 M10 M11 Yi = β0β1X i β2M1β3M 2 β4M3 β5M 4 β6M5β 7M6 β8M7 β9M8 β10 β11 β12 εi

(14.22)

Pi(t )Qi(0 ) ∑ i=1

I p(t ) =

Pi(t) Qi(t) ∑ i=1 n

Pi ∑ i=1

× 100  (14.29)

Qi(t)

(0 )

= log(β0 ) + Xi log(β1) + M1 log(β2 ) + M2 log( β3) + M 3 log(β4 ) + M4 log( β5) + M5 log(β6 ) + M6 log( β7) + M 7 log(β8 ) + M8 log(β9 ) + M9 log(β10 ) + M 10 log( β11) + M11 log(β12 ) + log( εi )

(14.23)

Key terms aggregate price index autoregressive modelling base period causal forecasting methods classical multiplicative model cyclical component exponential smoothing exponential trend model first-order autocorrelation first-order autoregressive model Holt–Winters method index number

591 570 591 545 546 546 551 558 570 571 567 591

irregular (random) component 546 Laspeyres price index 594 linear trend model 555 mean absolute deviation (MAD) 580 moving averages 548 Paasche price index 595 parsimony 581 price index 591 pth-order autocorrelation 570 pth-order autoregressive model 571 quadratic trend model 557 qualitative forecasting methods 545

quantitative forecasting methods seasonal component second-order autocorrelation second-order autoregressive model simple price index time series time-series forecasting methods trend component unweighted aggregate price index weighted aggregate price index

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

545 546 570 571 591 545 545 546 593 594

602 CHAPTER 14 TIME-SERIES FORECASTING AND INDEX NUMBERS

References 1. Bowerman, B. L., R. T. O’Connell & A. Koehler, Forecasting, Time Series,

3. Box, G. E. P., G. M. Jenkins & G. C. Reinsel, Time Series Analysis:

and Regression, 4th edn (Belmont, CA: Duxbury Press, 2005). 2. Hanke, J. E., D. W. Wichern & A. G. Reitsch, Business Forecasting, 7th edn (Upper Saddle River, NJ: Prentice Hall, 2001).

4. Frees, E. W., Data Analysis Using Regression Models: The Business

Forecasting and Control, 3rd edn (Englewood Cliffs, NJ: Prentice Hall, 1994). Perspective (Upper Saddle River, NJ: Prentice Hall, 1996).

Chapter review problems CHECKING YOUR UNDERSTANDING 14.58 Provide examples of business scenarios where forecasting is important. 14.59 What is a time series? 14.60 Describe the main components that constitute the classical multiplicative time-series model. 14.61 What is the difference between moving averages and exponential smoothing? 14.62 Under what circumstances would you adopt a linear trend forecasting model versus a quadratic model? 14.63 How does the least-squares linear trend forecasting model, developed in this chapter, differ from the least-squares linear regression model considered in Chapter 12? 14.64 How does Holt–Winters modelling differ from the other approaches to forecasting? 14.65 What are the different approaches to choosing an appropriate forecasting model? 14.66 What is the major difference between using SYX and MAD for evaluating how well a particular model fits the data? 14.67 How does forecasting for monthly or quarterly data differ from forecasting for annual data? 14.68 What is an index number? 14.69 What is the difference between a simple price index and an aggregate price index? 14.70 What is the difference between the Paasche price index and the Laspeyres price index?

APPLYING THE CONCEPTS Use Microsoft Excel to solve problems 14.71 to 14.76.

14.71 The data stored in < MCDONALDS > represent the gross revenues (in billions of current dollars) of McDonald’s Corporation from 1975 to 2012. a. Plot the data. b. Calculate the linear trend forecasting equation. c. Calculate the quadratic trend forecasting equation. d. Calculate the exponential trend forecasting equation. e. Determine the best-fitting autoregressive model, using α 5 0.05. f. Perform a residual analysis for each of the models in (b) to (e). g. Calculate the standard error of the estimate (SYX) and the MAD for each corresponding model in (f). h. On the basis of your results in (f) and (g), along with a consideration of the principle of parsimony, which model would you select for purposes of forecasting? Discuss. i. Using the selected model in (h), forecast gross revenues for 2013.

14.72 The labour force participation of older workers is an increasingly important policy topic in the context of ageing populations. The following displays labour force participation rates for males aged 55–59 in Australia from 1967 to 2010. < LFPR >

Year Labour force participation rate 1967 91.29213 1968 90.92155 1969 90.83277 1970 91.21132 1971 90.90909 1972 90.58325 1973 88.31000 1974 87.6434 1975 87.84957 1976 86.88882 1977 86.34586 1978 81.94404 1979 82.09309 1980 82.62169 1981 81.21961 1982 80.10982 1983 78.47148 1984 78.04564 1985 77.12231 1986 77.03493 1987 75.83300 1988 74.26153 1989 75.24049 1990 75.01858 1991 73.69792 1992 73.21390 1993 71.68768 1994 73.22536 1995 74.14988 1996 72.74920 1997 72.30829 1998 73.45512 1999 72.47096 2000 72.40716 2001 71.73871 2002 72.63158 2003 73.76008 2004 74.28808 2005 75.32154

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter review problems 603



Year Labour force participation rate 2006 76.27012 2007 77.22897 2008 76.34785 2009 78.26174 2010 80.19947

a. Plot the time series. b. Calculate the linear trend forecasting equation. c. Calculate the exponential trend forecasting equation. d. Calculate the quadratic trend forecasting equation. e. Which appears to be the best model? 14.73 You are an analyst for a company preparing a report on the potential closure of Essendon Airport in Melbourne. Assume that the following data represent the number of light aircraft overhauls completed at the airport from 1990 to 2005. You need to estimate the likely demand for 2008 and 2009 after the potential closure of the airport. Year Aircraft overhauls 1990 147 1991 175 1992 150 1993 191 1994 188 1995 179 1996 201 1997 220 1998 209 1999 220 2000 240 2001 235 2002 261 2003 270 2004 272 2005 280

a. Plot the time series. b. Does the series indicate a particular pattern in the plot in (a)? c. Use an appropriate method to forecast the series. d. How accurate is this method in forecasting the current series? e. Forecast the number of overhauls for 2007 and 2008. 14.74 The data in the table below present the annual dividend return for Coca-Cola shares from 2001 to 2013. < COKE > Year Annual dividend 2001 0.72 2002 0.80 2003 0.88 2004 1.00 2005 1.12

2006 1.24 2007 1.36 2008 1.52 2009 1.64 2010 1.76 2011 1.88 2012 1.02 2013 1.12 Data obtained from Coca-Cola, Year-End Market Values

a. Plot the data. b. Calculate the linear trend forecasting equation. c. Calculate the quadratic trend forecasting equation. d. Calculate the exponential trend forecasting equation. e. Find the best-fitting autoregressive model using α 5 0.05. f. Perform a residual analysis for each of the models in (b) to (e). g. Calculate the standard error of the estimate (SYX) and the MAD for each corresponding model in (f). h. On the basis of your results in (f) and (g), together with a consideration of parsimony, which model would you select for purposes of forecasting? Discuss. i. Using the selected model in (h), forecast the annual dividend return for 2014. 14.75 Teachers’ Retirement System of the City of New York offers several types of investments for its members. Among the choices are investments with fixed and variable rates of return. There are several categories of variable-return investments. The Diversified Equity Fund consists of investments that are primarily made in shares, and the Stable-Value Fund consists of investments in corporate bonds and other types of lower-risk instruments. The data in < TRS_NYC > represent the value of a unit of each type of variable-return investment at the beginning of each year from 1984 to 2013 (data obtained from ‘Historical data-unit values, Teachers’ Retirement System of the City of New York’, ). For each of the two time series: a. Plot the data. b. Calculate the linear trend forecasting equation. c. Calculate the quadratic trend forecasting equation. d. Calculate the exponential trend forecasting equation. e. Determine the best-fitting autoregressive model, using α 5 0.05. f. Perform a residual analysis for each of the models in (b) to (e). g. Calculate the standard error of the estimate (SYX) and the MAD for each corresponding model in (f). h. On the basis of your results in (f) and (g), along with a consideration of the principle of parsimony, which model would you select for purposes of forecasting? Discuss. i. Using the selected model in (h), forecast the unit values for 2014. j. Based on the results of (a) to (i), what investment strategy would you recommend for a member of the Teachers’ Retirement System of the City of New York? Explain.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

604 CHAPTER 14 TIME-SERIES FORECASTING AND INDEX NUMBERS

14.76 William Tanner, the owner of a vineyard in the Barossa Valley, South Australia, has collected the following information describing the prices and quantities of harvested crops for the years 2014 and 2017. 2014

Type of grape

2017

Volume Price harvested ($/tonne) (tonnes)

Ruby Cabernet Barber Shiraz

108 93 97

a. Calculate the Paasche index with 2014 as the base year. b. Calculate the Laspeyres index with 2014 as the base year. c. Calculate the unweighted aggregate price index, using 2014 as the base year.

Price ($/tonne)

Volume harvested (tonnes)

111 101 107

1,360 890 1,660

1,280 830 1,640

Chapter 14 Excel Guide EG14.1 THE IMPORTANCE OF BUSINESS FORECASTING There are no Excel Guide instructions for this section.

EG14.2 COMPONENT FACTORS OF TIME-SERIES MODELS There are no Excel Guide instructions for this section.

EG14.3 SMOOTHING AN ANNUAL TIME SERIES Moving Averages Key technique  Use the AVERAGE(cell range that contains a sequence of L observed values) function to calculate moving averages and use the special worksheet value #N/A (not available) for time periods in which no moving average can be calculated.

For the example, open the Fem_Partic file and: 1. Select Data ➔ Data Analysis. 2. In the Data Analysis dialog box, select Exponential Smoothing from the Analysis Tools list and then click OK. In the Exponential Smoothing dialog box (shown in Figure EG14.1): 1. Enter B1:B37 as the Input Range. 2. Enter 0.5 as the Damping factor. (The damping factor is equal to 1 – W.)) 3. Check Labels, enter C1 as the Output Range, and click OK.

Example  Calculate the five- and seven-year moving averages for the female unemployment rate data shown in Figure 14.3 on page 550. Exponential Smoothing Key technique  Use arithmetic formulas to calculate exponentially smoothed values. Example  Calculate the exponentially smoothed series (W 5 0.50 and W 5 0.25) for the female unemployment rate data shown in Figure 14.4 on page 552. Analysis ToolPak  Use Exponential Smoothing.

Figure EG14.1  Exponential Smoothing dialog box

In the new column C: 1. Copy the last formula in cell C11 to cell C12. 2. Enter the column heading ES(W 5 .50) in cell C1, replacing the #N/A value. To create the exponentially smoothed values that use a smoothing coefficient of W 5 0.25, repeat steps 3 through 7 with these modifications: Enter 0.75 as the Damping ­factor in step 4, enter D1 as the Output Range in step 5, and enter ES(W 5 .50) as the column heading in step 7.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 14 Excel Guide 605

EG14.4 LEAST-SQUARES TREND FITTING AND FORECASTING The Linear Trend Model Modify the Section EG12.2 instructions (see page 502) to create a linear trend model. Use the cell range of the coded variable as the X variable cell range (called the X Variable Cell Range in the PHStat instructions, and called the Input X Range in the Analysis ToolPak instructions). If you need to create coded values, enter them manually in a column. If you have many coded values, you can use Home ➔ Fill (in the Editing group) ➔ Series and in the Series dialog box, click Columns and Linear, and select appropriate values for Step value and Stop value.

The Quadratic Trend Model Modify the Section EG16.1 instructions (see page 677) to create a quadratic trend model. Use the cell range of the coded variable and the squared coded variable as the X variables cell range (called the X Variables Cell Range in the PHStat instructions and the Input X Range in the Analysis ToolPak instructions). Use the Sections EG16.1 and EG16.2 instructions to create the squared coded variable and to plot the quadratic trend.

The Exponential Trend Model Key technique  Use the POWER(10, predicted log(Y)) function to calculate the predicted Y values from the predicted log(Y) results. To create an exponential trend model, first convert the values of the dependent variable Y to log(Y) values using the ­Section EG16.2 instructions on page 678. Then perform a simple linear regression analysis with residual analysis using the log(Y) values. Modify the Section EG12.5 ‘Residual analysis’ instructions using the cell range of the log(Y) values as the Y variable cell range and the cell range of the coded variable as the X variable cell range. (Note that the residual analysis instructions incorporate the Section EG12.2 ‘Determining the simple linear regression equation’ instructions.) Note that the Y and X variable cell ranges are called the Y Variable Cell Range and X Variable Cell Range in the PHStat instructions, and the Input Y Range and Input X Range in the Analysis ToolPak instructions. If you use the PHStat instructions, residuals will appear in a residuals worksheet. If you use the Analysis ToolPak instructions, residuals will appear in the RESIDUAL OUTPUT area of the regression results worksheet. Because you use log(Y) values for the regression, the predicted Y and residuals listed are log values that need to be converted. (The Analysis ToolPak incorrectly labels the new column for the logs of the residuals as Residuals, and not as LOG(Residuals) as you might expect.)

Convert the predicted log(Y) results to predicted Y results using the POWER function. Use an empty column in the residuals worksheet (PHStat) or empty column ranges to the right of RESIDUALS OUTPUT area (Analysis ToolPak) to first add a column of formulas that use the POWER function to calculate the predicted Y values. Then, add a second column that contains the original Y values. (Copy the original Y values to this column.) Finally, add a third new column that contains formulas in the form 5(revenue cell − predicted revenue cell) to calculate the actual residuals. To construct an exponential trend plot, first select the cell range of the time-series data and then use the Section EG2.5 instructions to construct a scatter plot. (For the female labour participation example, use the cell range B1:B37 in the Fem_Partic file.) Select the chart and: 1. Select Design ➔ Add Chart Element ➔ Trendline ➔ More Trendline Options. 2. In the Format Trendline pane, click Exponential. If you use an Excel version older than Excel 2013, select Layout ➔ Trendline ➔ More Trendline Options. In the Format Trendline dialog box, click Trendline Options in the left pane and in the Trendline Options right pane, click Exponential and click OK.

Model Selection Using First, Second, and Percentage Differences Use arithmetic formulas to calculate the first, second, and percentage differences. Use division formulas to calculate the percentage differences and use subtraction formulas to calculate the first and second differences.

EG14.5 AUTOREGRESSIVE MODELLING FOR TREND FITTING AND FORECASTING Creating Lagged Predictor Variables Create lagged predictor variables by creating a column of formulas that refer to a previous row’s (previous time period’s) Y value. Enter the special worksheet value #N/A (not available) for the cells in the column to which lagged values do not apply. When specifying cell ranges for a lagged predictor variable, you include only rows that contain lagged values. Contrary to the usual practice in this book, you do not include rows that contain #N/A, nor do you include the row 1 column heading.

Autoregressive Modelling Modify the Section EG13.1 instructions (see page 541) to create a third-order or second-order autoregressive model. Use the cell range of the first-order, second-order and thirdorder lagged predictor variables as the X variables cell

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

606 CHAPTER 14 TIME-SERIES FORECASTING AND INDEX NUMBERS

range for the third-order model. Use the cell range of the first-order and second-order lagged predictor variables as the X variables cell range for the second-order model. (The X variables cell range is the X Variables Cell Range in the PHStat instructions and the Input X Range in the Analysis ToolPak instructions.) If using the PHStat instructions, modify step 3 to clear, not check, First cells in both ranges contain label. Modify the Section EG12.2 instructions (see page 502) to create a first-order autoregressive model. Use the cell range of the first-order lagged predictor variable as the X variable cell range (called the X Variable Cell Range in the PHStat instructions and the Input X Range in the Analysis ToolPak instructions). If using the PHStat instructions, modify step 3 to clear, not check, First cells in both ranges contain label. If using the Analysis ToolPak instructions, do not check Labels in step 4.

values to transfer results from regression results worksheets. For the SSE values (row 39 in Figure 14.21), copy the regression results worksheet cell C13, the SS value for Residual in the ANOVA table. For the SYX values (row 40), copy the regression results worksheet cell B7, labeled Standard Error, for all but the exponential trend model. For the MAD values, add formulas as discussed in the previous section. For the SYX value for the exponential trend model, enter a formula in the form 5SQRT(exponential SSE cell / (COUNT(cell range of exponential residuals) 2 2)). For the Figure 14.21 worksheet, this formula is 5SQRT(H20 / (COUNT(H3:H19) 2 2)).

EG14.6 CHOOSING AN APPROPRIATE FORECASTING MODEL

To develop a least-squares regression model for monthly or quarterly data, add columns of formulas that use the IF function to create dummy variables for the quarterly or monthly data. Enter all formulas in the form 5IF(comparison, 1, 0). Figure EG14.2 shows the first five rows of columns F through K of a data worksheet that contains dummy variables. In panel A, columns F, G, and H contain the quarterly dummy variables Q1, Q2, and Q3 that are based on column B coded quarter values (not shown). In panel B, columns J and K contain the two monthly variables M1 and M6 that are based on column C month values (also not shown).

Performing a Residual Analysis To create residual plots for the linear trend model or the first-order autoregressive model, use the instructions in Section EG12.5 on page 503. To create residual plots for the quadratic trend model or the second-order or third-order autoregressive model, use the instructions in Section EG13.3 on page 543. To create residual plots for the exponential trend model, use the instructions in Section EG14.4 on page 605.

EG14.7 TIME-SERIES FORECASTING OF SEASONAL DATA Least-Squares Forecasting with Monthly or Quarterly Data

Measuring the Magnitude of the Residuals Through Squared or Absolute Differences To calculate the mean absolute deviation (MAD), first perform a residual analysis. Then add a formula in the form 5SUMPRODUCT(ABS(cell range of residual values)) / COUNT(cell range of the residual values). In the cell range of the residual values do not include the column heading as is the standard practice in this book.

A Comparison of Five Forecasting Methods Construct a model comparison worksheet similar to the one shown in Figure 14.21 on page 583 by using Paste Special

Figure EG14.2 Data worksheet with dummy variables

Microsoft® product screen shots are reprinted with permission from Microsoft Corporation.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

C HAP T E R

15

Chi-square tests

ONLINE CHECK-IN USAGE

A

irlines are constantly trying to find new ways to cut costs and improve efficiency. In recent years, as access to the Internet has become much more widespread, travellers are being given the option to check in for international flights prior to their arrival at the airport. This prevents long airport check-in queues and reduces the number of staff required. However, for some customers airport check-in is still more practical, especially if they are away from their home country and do not have easy access to printers or Internet connections for mobile devices immediately before travelling. Hikari Airlines encourages online check-in by sending reminder emails to its passengers 24 hours prior to their departure time. For operational planning purposes, Hikari Airlines is examining whether there is a difference between the use of online check-in at two of its ports, Sydney and Singapore. © Elena Elisseeva|Dreamstime.com

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

608 CHAPTER 15 CHI-SQUARE TESTS

LEARNING OBJECTIVES

After studying this chapter you should be able to: 1 identify how and when to use the chi-square test for contingency tables 2 use the Marascuilo procedure for determining pairwise differences when evaluating more than two proportions 3 apply the chi-square test of independence 4 use the chi-square test to evaluate the goodness of fit of a set of data to a specific probability distribution 5 use the chi-square distribution to test for the population variance or standard deviation

In the preceding chapters, you have used hypothesis-testing procedures to analyse both numerical and categorical data in both one-sample tests and two-sample tests. This chapter extends hypothesis testing to analyse differences between population proportions based on two or more samples, as well as the hypothesis of independence in the joint responses to two categorical variables. In addition, the chi-square test is used to test whether a set of data fits a specific probability distribution.

LEARNING OBJECTIVE

1

Identify how and when to use the chi-square test for contingency tables cross-classification table – chi-square Displays counts of categorical responses between any number of independent groups. contingency table (or crossclassification table) – probability Represents a sample space for joint events classified by two characteristics; each cell represents the joint event satisfying given values of both characteristics.

15.1  CHI-SQUARE TEST FOR THE DIFFERENCE BETWEEN TWO PROPORTIONS (INDEPENDENT SAMPLES) This hypothesis-testing procedure uses a test statistic that is approximated by a chi-square distribution. If you are interested in comparing the counts of categorical responses between two independent groups, you can develop a two-way cross-classification table (see Section 2.4) to display the frequency of occurrence of successes and failures for each group. This table is called a contingency table and is used in Chapter 4 to define and study probability. To illustrate the contingency table, return to the scenario at the beginning of the chapter concerning use of online check-in facilities. Two cities (Sydney and Singapore) are used for the analysis. Hikari Airlines randomly selects a sample of passengers from its complete departure list for the two cities over a two-month period. The check-in methods of 420 passengers leaving Sydney and 530 passengers departing from Singapore on this airline are tabulated. At the 0.05 level of significance, is there evidence of a significant difference in online check-in pattern between the two cities? The contingency table displayed in Table 15.1 has two rows and two columns and is called a ‘2 × 2 table’. The cells in the table indicate the frequency for each row and column combination. Table 15.2 contains the contingency table for the check-in method. The contingency table has two rows, indicating whether the passenger checks in online (i.e. success) or at the airport (i.e failure), and two columns, one for each city. The cells in the table indicate the frequency of each row and column combination. The row totals indicate the number of passengers checking in by each method. The column totals are the sample sizes for each city. To test whether the population proportion of all passengers who use online check-in in Sydney, π1, is equal to the population proportion of all passengers who use online check-in in Singapore, π2, you can use the χ2 test for equality of proportions. (Note that this is based on a Greek letter and therefore pronounced ‘kai’ square, not ‘chee’.) To test the null hypothesis that there is no difference between the two population proportions: H0: π1 = π2

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



15.1 Chi-Square Test for the Difference Between Two Proportions (Independent Samples) 609

against the alternative hypothesis that the two population proportions are not the same: H1: π1 ∙ π2 you use the χ2 test statistic, shown in Equation 15.1.

Row variable Successes Failures Totals

Column variable (group) 2 X2 n2 − X2 n2

1 X1 n1 − X1 n1

Totals X n−X n

Table 15.1  Layout of a 2 × 2 contingency table

where X1 = number of successes in group 1 X2 = number of successes in group 2 n1 − X1 = number of failures in group 1 n2 − X2 = number of failures in group 2 X = X1 + X2 is the total number of successes n – X = (n1 − X1) + (n2 − X2) is the total number of failures n1 = the sample size in group 1 n2 = the sample size in group 2 n = n1 + n2 is the total sample size

City Check-in method Sydney Singapore Total Online 258 375 633 Airport 162 155 317 Total 420 530 950

Table 15.2  A 2 × 2 contingency table for the flight check-in study

χ 2 TE ST FOR T H E DIFFE R E N CE B E T W E E N T W O P R O P O RT I O NS The χ2 test statistic is equal to the squared difference between the observed and expected frequencies, divided by the expected frequency in each cell of the table, summed over all cells of the table.

χ2 =

( fo − fe)2 (15.1) fe all cells



where  fo = observed frequency in a particular cell of a contingency table fe = expected frequency in a particular cell if the null hypothesis is true The test statistic, χ2, approximately follows a chi-square distribution with one degree of freedom for this particular problem. Degrees of freedom 5 (r 2 1) (c 2 1) where r 5 number of rows and c 5 number of columns.

To calculate the expected frequency, fe , in any cell you need to understand that, if the null hypothesis is true, the proportion of successes in the two populations will be equal. Then the sample proportions you calculate from each of the two groups would differ from each other only by chance, and each would provide an estimate of the common population parameter, π. A statistic that combines these two separate estimates into one overall estimate of the population parameter, π, provides more information than either of the separate estimates could provide

observed frequency The known (given) categorical response value usually displayed in a cross-classification table. expected frequency The calculated categorical response value on the basis of a true null hypothesis whereby the total number of responses (successes) for each group is divided on a proportional basis to the total number of successes.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

610 CHAPTER 15 CHI-SQUARE TESTS

by itself. This statistic, given by the symbol p–, represents the estimated overall proportion of successes for the two groups combined (i.e. the total number of successes divided by the total sample size). The complement of p–, 1 − –p, represents the overall proportion of failures in the two groups. Using the notation presented in Table 15.1, Equation 15.2 defines –p.

CA LCUL AT IN G T H E E ST I M AT E D O V E R A LL P R O P O RT I O N

chi-square (𝛘2) distribution The probability distribution for chi-square to be used in determining critical values of chi-square.

p=

X 1+ X 2 X = (15.2) n1 + n 2 n

To calculate the expected frequency, fe , for each cell pertaining to success (i.e. the cells in the first row in the contingency table), multiply the sample size (or column total) for a group by –p. To calculate the expected frequency, fe , for each cell pertaining to failure (i.e. the cells in the second row in the contingency table), multiply the sample size (or column total) for a group by 1 − –p. The test statistic shown in Equation 15.1 approximately follows a chi-square distribution with one degree of freedom. Using a level of significance of α, you reject the null hypothesis in favour of the alternative if the calculated χ2 test statistic is greater than χU2 , the upper-tail critical value from the χ2 distribution having one degree of freedom. Thus, the decision rule is: Reject H0 if χ2 > χU2 ; otherwise, do not reject H0. Figure 15.1 illustrates the decision rule.

Figure 15.1  Regions of rejection and non-rejection when using the chi-square test for equality of proportions with level of significance of α

(1 – α)

α χ21

0

Region of non-rejection

Critical Region of value rejection

If the null hypothesis is true, the calculated χ2 statistic should be close to zero because the squared difference between what is actually observed in each cell, fo , and what is theoretically expected, fe , should be very small. On the other hand, if H0 is false and there are real differences in the population proportions, the calculated χ2 statistic is expected to be large. However, what constitutes a large difference in a cell is relative. The same actual difference between fo and fe from a cell in which only a few observations are expected contributes more to the χ2 test statistic than a cell where many observations are expected. To illustrate the use of the chi-square test for equality of two proportions, return to the online check-in scenario and the corresponding contingency table displayed in Table 15.2. The null hypothesis (H0: π1 = π2) states that there is no difference between the proportion of passengers in Sydney and Singapore who will choose online check-in for their flight. From Equation 15.2 above, we use –p to estimate the common parameter π, the population proportion of passengers in Sydney and Singapore who will check in online. To begin, calculate –p using Equation 15.2: p=

X1 + X2 258 + 375 633 = = = 0.6663 n1 + n2 420 + 530 950

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



15.1 Chi-Square Test for the Difference Between Two Proportions (Independent Samples) 611

– p is the estimate of the common parameter π, the population proportion of passengers who will check in online if the null hypothesis is true. The estimated proportion of passengers who are not likely to check in online (i.e. will check in at the airport) is the complement of p–, 1 − 0.6663 = 0.3337. Multiplying these two proportions by the sample size for Sydney gives the hypothetical number of passengers departing from Sydney who are expected to check in online and at the airport. In a similar manner, multiplying the two respective proportions by the Singapore sample size yields the corresponding expected frequencies for that group. CA LC ULATING T H E E X P E CT E D FR E Q U EN CI E S Calculate the expected frequencies for each of the four cells of Table 15.2. For greater ­accuracy use the value of –p without rounding it first.

EXAMPLE 15.1

SOLUTION

Online check-in, Sydney: –p = 0.6663 and n1 = 420, so fe = 279.85 Online check-in, Singapore: –p = 0.6663 and n2 = 530, so fe = 353.15 Airport check-in, Sydney: 1 − –p = 0.3337 and n1 = 420, so fe = 140.15 Airport check-in, Singapore: 1 − –p = 0.3337 and n2 = 530, so fe = 176.85 Table 15.3 presents these expected frequencies next to the corresponding observed ­frequencies taken from Table 15.2. Check-in method Online Airport Total

Sydney Singapore Observed Expected Observed Expected Total 258 279.85 375 353.15 633 162 140.15 155 176.85 317 ___ ______ ___ ______ ___ 420 420.00 530 530.00 950

Table 15.3  A 2 × 2 contingency table for comparing the observed (fo) and expected (fe) frequencies

To test the null hypothesis that the population proportions are equal: H0: π1 = π2 against the alternative hypothesis that the population proportions are not equal: H1: π1 ∙ π2 you use the observed and expected frequencies from Table 15.3 to calculate the χ2 test statistic given by Equation 15.1. Table 15.4 presents the calculations. fo

fe (fo – fe )

(fo – fe )2 (fo – fe )2/fe

258 279.85 –21.85 477.54 375 353.15 21.85 477.54 162 140.15 21.85 477.54 155 176.85 –21.85 477.54

1.706 1.352 3.407 2.700 9.166

Table 15.4  Calculation of χ2 test statistic for the online check-in study

The chi-square distribution is a right-skewed distribution whose shape depends solely on the number of degrees of freedom. As the number of degrees of freedom increases, the chisquare distribution becomes more symmetrical. You find the critical value of the χ2 test statistic from Table E.4, a portion of which is presented as Table 15.5. The values in Table 15.5 refer to selected upper-tail areas of the χ2 distribution. A 2 × 2 contingency table has (2 − 1)(2 − 1) = 1 degree of freedom. Using α = 0.05, with one degree

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

612 CHAPTER 15 CHI-SQUARE TESTS

of freedom, the critical value of χ2 from Table 15.5 is 3.841. You reject H0 if the calculated χ2 statistic is greater than 3.841 (see Figure 15.2). Since 9.166 > 3.841, you reject H0 and conclude that there is a difference between the online check-in rates for passengers departing from Sydney and Singapore. Table 15.5  Finding the χ2 critical value from the chi-square distribution with one degree of freedom using the 0.05 level of significance (extracted from Table E.4 in Appendix E of this book)

Figure 15.2  Regions of rejection and non-rejection when finding the χ2 critical value with one degree of freedom at the 0.05 level of significance

Degrees of freedom

0.995

0.99

. . .

1 2 0.010 0.020 3 0.072 0.115 4 0.207 0.297 5 0.412 0.554

Upper-tail area 0.05

0.025

0.01

0.005

. . .  3.841  5.024  6.635  7.879 . . .  5.991  7.378  9.210 10.597 . . .   7.815   9.348 11.345 12.838 . . .   9.488 11.143 13.277 14.860 . . . 11.071 12.833 15.086 16.750

0.05 0.95 0

3.841

Region of non-rejection

Critical value

χ21

Region of rejection

Figure 15.3 represents a Microsoft Excel worksheet for the airport check-in contingency table (Table 15.2 on page 609). These outputs include the expected frequencies, χ2 test statistic, degrees of freedom and p-value. The χ2 test statistic is 9.166, which is greater than the critical value of 3.841 (or the p-value = 0.0025 < 0.05), so you reject the null hypothesis that there is no difference between the two cities in the proportion of passengers who check in online. For the χ2 test to give accurate results for a 2 × 2 table, you must assume that each expected frequency is at least 5. If this assumption is not satisfied, you can use alternative procedures such as Fisher’s exact test (see references 1 and 2). In the airport check-in study, both the Z test based on the standardised normal distribution (see Section 10.4) and the χ2 test based on the chi-square distribution will lead to the same conclusion. You can explain this result by the interrelationship between the standardised normal distribution and a chi-square distribution with one degree of freedom. For such situations, the χ2 test statistic is the square of the Z test statistic. For instance, in the airport check-in study, the calculated Z test statistic would be −3.028 and the calculated χ2 test statistic is 9.166. Except for rounding error, this latter value is the square of −3.028 (i.e. (−3.028)2 = 9.166). Also, if you compare the critical values of the test statistics from the two distributions, at the 0.05 level of significance, the χ12 value of 3.841 is the square of the Z value of ±1.96 (i.e. χ12 = Z2). Furthermore, the p-values for both tests are equal. Therefore, when testing the null hypothesis of equality of proportions: H0: π1 = π2 against the alternative hypothesis that the population proportions are not equal: H1: π1 ∙ π2 the Z test and the χ2 are equivalent methods. However, if you are specifically interested in determining whether there is evidence of a directional difference, such as π1 > π2, then you

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



15.1 Chi-Square Test for the Difference Between Two Proportions (Independent Samples) 613

A 1

C

D

Sydney

Singapore

Total

B

E

F

G

Chi-square analysis

2 Observed frequencies

3

City

4

Figure 15.3  Microsoft Excel 2016 worksheet for the online check-in data

Calculations

5

Check-in method

6

Online

258

375

633

7

Airport

162

155

317

8

Total

420

530

950

Sydney

fo – fe –21.85263158

21.85263158

21.85263158 –21.85263158

9 10

Expected frequencies

11

City (fo – fe)^2/fe

12

Check-in method

Singapore

Total

13

Online

279.8526

353.1474

633

1.706389196

1.352232948

14

Airport

140.1474

176.8526

317

3.40739546

2.700200176

15

Total

420

530

950

16 Data

17 18

Level of significance

19

Number of rows

0.05 2

20

Number of columns

2

21

Degrees of freedom

1

22 Results

23 24

Critical value

25

Chi-square test statistic

9.1662

26

p-value

0.0025

27

3.8415

Reject the null hypothesis

28 29

Expected frequency assumption is met

30

must use the Z test with the entire rejection region located in one tail of the standardised normal distribution. In Section 15.2, the χ2 test is extended to make comparisons and evaluate differences between the proportions for more than two groups. However, you cannot use the Z test if there are more than two groups.

Insurance for young drivers Insurance companies charge heavy premiums for policies for high-performance cars, particularly those that are driven by young drivers. Premiums are based on the risk of claims for theft and the cost of accident repairs. Accident risks are perceived to be greater due to the capacity for driving at high speed.

think about this

An insurance company believes that there is one factor that may reduce claims. An advanced driver education course may make a difference to the behaviour of young drivers of high-performance cars, which would then be reflected in the proportion of claims they make on their policies during a oneyear period. If you were working for an insurance company, how would you set about testing whether the population proportion of claims made for cars with young drivers who have completed advanced training differs from the proportion made where young drivers have not completed advanced training?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

614 CHAPTER 15 CHI-SQUARE TESTS

Problems for Section 15.1 LEARNING THE BASICS 2

15.1 Determine the critical value of χ in each of the following circumstances. a. α = 0.01, df = 12 b. α = 0.025, df = 23 c. α = 0.05, df = 24 15.2 Determine the critical value of χ2 in each of the following circumstances. a. α = 0.95, df = 4 b. α = 0.975, df = 26 c. α = 0.99, df = 15 15.3 In this problem, use the following contingency table. 1 2 Total

A 20 30 50

B Total 30  50 45  75 75 125

a. Find the expected frequency for each cell. b. Compare the observed and expected frequencies for each cell. c. Calculate the χ2 statistic. Is it significant at α = 0.05? 15.4 For this problem, use the following contingency table. 1 2 Total

A 15 12 27

B Total 24 39 25 37 49 76

a. Find the expected frequency for each cell. b. Find the χ2 statistic for this contingency table. Is it significant at α = 0.01?

APPLYING THE CONCEPTS You can solve problems 15.5 to 15.9 manually or by using Microsoft Excel.

15.5 A car importer believes that black-coloured cars are more often chosen by male buyers than by females. A survey is conducted to ask, ‘Do you prefer black-coloured cars?’ The survey results are available only as percentages and no sample sizes were given. Prefer black cars? Yes No

Men Women 26% 10% 74% 90%

a. Assume that 50 men and 50 women were included in the survey. At the 0.05 level of significance, is there a difference between males and females in the proportion of people who prefer black-coloured cars? b. Assume that 200 men and 200 women were included in the survey. At the 0.05 level of significance, is there a difference between males and females in the proportion of people who prefer black-coloured cars? c. Discuss the effect of sample size on the chi-square test.

15.6 In a study of share market investments, a sample of 334 Chinese and 116 Japanese business executives were asked, ‘Do you privately own shares on the stock market?’ It is expected that the longer history of free-market activity in Japan compared with China will show that Japanese executives are more likely to own shares. The contingency table below presents the survey results. Country of origin Own shares? China Japan Totals Yes  14  36  50 No 320  80 400 116 450 Total 334

a. At the 0.05 level of significance, is there a significant difference between the proportion of share ownership between Chinese and Japanese executives? b. Determine the p-value in (a) and interpret its meaning. c. What conclusions can you draw from this analysis? 15.7 A company is about to enter into wage negotiations with its workers and must choose between enterprise (collective) bargaining or individual contracts as its mode of negotiation. Out of a sample of 165 blue-collar workers, 112 preferred enterprise bargaining while 23 out of 48 white-collar workers preferred enterprise bargaining. Is there a significant difference between the proportion of blue-collar and white-collar workers who prefer enterprise bargaining? (Use α = 0.05.) 15.8 In order to test the puncture resistance of the new SDsandtrak tyre, a new type of off-road vehicle tyre, 120 of the new tyres and 180 standard SD tyres are driven over the same hazardous 500-kilometre course in Central Australia. Just six of the SDsandtrak tyres are punctured whereas 18 of the standard SD tyres have to be replaced due to punctures. a. Can it be concluded that the puncture rate of the new SDsandtrak is significantly different from the puncture rate of normal tyres, at the 1% level of significance? b. Why has it been decided to set the significance level at a low 1%? What is the consequence if a 5% level had been chosen? 15.9 Non-response to mail, telephone and interview surveys is a continual problem for researchers. The problem is knowing whether the individuals who do not respond form a separate subsample of the population. If they do, then the data collected are biased because they do not represent this subsample of respondents. A common method of increasing survey response percentages is to collect a smaller, directed sample by interview at a prearranged appointment. Another way of increasing the response rate is to offer a prize draw (although the anonymity of respondents then becomes an issue). To test whether offering a prize is effective in increasing the response rate, two groups of financial planners are surveyed by mail, one in Adelaide without offering a prize and one in Perth

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



15.2 Chi-Square Test for Differences Between More than Two Proportions 615

where a lucky prize draw for a high-quality laptop computer, valued at $1,675, is offered. The results are given in the following table: City Adelaide Perth Total Response  73  98 171 Non-response 138 117 255 215 426 Total 211

a. Is there a difference between the two treatment groups in the proportion of responses to the survey at the 0.05 level of significance? b. Find the p-value in (a) and interpret its meaning. c. What aspect of offering a prize might impact on the findings?

15.2  CHI-SQUARE TEST FOR DIFFERENCES BETWEEN MORE THAN TWO PROPORTIONS In this section the χ2 test is extended to compare more than two independent populations. The letter c is used to represent the number of independent populations under consideration. Thus, the contingency table now has two rows and c columns. To test the null hypothesis that there are no differences between the c proportions: H0: π1 = π2 = … = πc against the alternative hypothesis that not all the c population proportions are equal: H1: Not all πj are equal (where j = 1, 2, …, c) use Equation 15.1 on page 609: χ2 =

( fo − fe)2 fe all cells



where  fo = observed frequency in a particular cell of a 2 × c contingency table fe = expected frequency in a particular cell if the null hypothesis is true If the null hypothesis is true and the proportions are equal across all c populations, then the c sample proportions should differ only by chance. In such a situation, a statistic that combines these c separate estimates into one overall estimate of the population proportion π provides more information than any one of the c separate estimates alone. To expand on Equation 15.2 (page 610), the statistic p– in Equation 15.3 represents the estimated overall proportion for all c groups combined.

CALC ULATING T H E E ST IM AT E D OVE R A LL P R O P O RT I O N F O R c G R O U P S

p=

X1 + X2 + … + Xc X = (15.3) … n1 + n2 + + nc n

To calculate the expected frequency, fe , for each cell in the first row of the contingency table, multiply each sample size (or column total) by –p. To calculate the expected frequency, fe , for each cell in the second row of the contingency table, multiply each sample size (or column total) by 1 − –p. The test statistic shown in Equation 15.1 approximately follows a chi-square distribution with degrees of freedom equal to the number of rows in the contingency table minus 1, multiplied by the number of columns in the table minus 1. For a 2 × c contingency table, there are c − 1 degrees of freedom: Degrees of freedom = (2 − 1)(c − 1) = c − 1

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

616 CHAPTER 15 CHI-SQUARE TESTS

Using a level of significance of α, you reject the null hypothesis if the calculated χ2 test statistic is greater than χU2 , the upper-tail critical value from a chi-square distribution having c − 1 degrees of freedom. Therefore, the decision rule is: Reject H0 if χ2 > χU2 ; otherwise, do not reject H0. Figure 15.4 illustrates the decision rule.

Figure 15.4  Regions of rejection and non-rejection when testing for differences between c proportions using the χ2 test

(1 – α)

α χc2 – 1

0

Region of non-rejection

Critical value

Region of rejection

To illustrate the χ2 test for equality of proportions when there are more than two groups, return to the chapter-opening scenario about online flight check-ins. Results for a third city, Jakarta, are now added. Table 15.6 presents the check-in patterns of the three samples of different passengers departing from the three cities.

Table 15.6  A 2 × 3 contingency table for online check-in patterns

City Check-in method Sydney Singapore Jakarta Total Online 258 375 210 843 Airport 162 ___ 155 190 507 ___ ___ _____ Total 420 530 400 1,350

Since the null hypothesis states that there are no differences between the three cities regarding the use of online check-in, use Equation 15.3 to calculate an estimate of π, the population proportion of passengers who intend to use online check-in for their flights: X + X +…+X = X p = n1 + n 2 + … + nc = n 1 2 c =

843 (258 + 375 + 210) = (420 + 530 + 400) 1,350

= 0.624 The estimated overall proportion of passengers who would not be likely to check in online (i.e. intend to check in at the airport) is 1 − –p, or 0.376. Multiplying these two proportions by the sample size taken in each city yields the hypothetical expected number of passengers who will check in online and the hypothetical expected number who will check in at the airport.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



15.2 Chi-Square Test for Differences Between More than Two Proportions 617

CA LC ULATING T H E E X P E CT E D FR E Q U EN CI E S Calculate the expected frequencies for each of the six cells in Table 15.6. For greater accuracy, use the –p value without rounding.

EXAMPLE 15.2

SOLUTION

Online check-in, Sydney: –p = 0.624 and n1 = 420, so fe = 262.27 Online check-in, Singapore: –p = 0.624 and n2 = 530, so fe = 330.96 Online check-in, Jakarta: –p = 0.624 and n3 = 400, so fe = 249.78 Airport check-in, Sydney: 1 − –p = 0.376 and n1 = 420, so fe = 157.73 Airport check-in, Singapore: 1 − –p = 0.376 and n2 = 530, so fe = 199.04 Airport check-in, Jakarta: 1 − –p = 0.376 and n3 = 400, so fe = 150.22

City Check-in method Sydney Singapore Jakarta Total Online 262.27 330.96 249.78 843 Airport 157.73 199.04 150.22 507 _______ ______ ______ _____ Total 420.00 530.00 400.00 1,350

Table 15.7  Cross-classification of expected frequencies from the Hikari Airlines passenger check-in analysis of three cities

Table 15.7 presents these expected frequencies. To test the null hypothesis that the proportions are equal: H0: π1 = π2 = π3 against the alternative hypothesis that not all the three proportions are equal: H1: Not all πj are equal (where j = 1, 2, 3) use the observed and expected frequencies from Tables 15.6 and 15.7 to calculate the χ2 test statistic given by Equation 15.1 (page 609). Table 15.8 presents the calculations. fo

fe (fo – fe )

(fo – fe )2 (fo – fe )2/fe

258 262.2667 −4.2667 18.2044 0.0694 375 330.9556 44.0444 1,939.9131 5.8616 210 249.7778 −39.7778 1,582.2716 6.3347 162 157.7333 4.2667 18.2044 0.1154 155 199.0444 −44.0444 1,939.9131 9.7461 190 150.2222 39.7778 1,582.2716 10.5329 32.6601

Table 15.8  Calculation of χ2 test statistic for three cities

You find the critical value of the χ2 test statistic from Table E.4. Because three cities are evaluated in the airport check-in study, there are (2 − 1)(3 − 1) = 2 degrees of freedom. Using α = 0.05, the χ2 critical value with 2 degrees of freedom is 5.991. Because the calculated test statistic (χ2 = 32.6601) is greater than this critical value, you reject the null hypothesis (see Figure 15.5). Microsoft Excel (see Figure 15.6) also reports the p-value. Since the p-value is approximately zero, which is obviously less than α = 0.05, you reject the null hypothesis. Further, this p-value indicates that there is virtually no chance of seeing differences this large or larger between the three sample proportions if the population proportions for the three cities are equal. Thus, there is sufficient evidence to conclude that, at the 5% level of significance, the cities are different with respect to the proportion of passengers who check in for their flights online.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

618 CHAPTER 15 CHI-SQUARE TESTS

Figure 15.5  Regions of rejection and non-rejection when testing for differences in three proportions at the 0.05 level of significance with two degrees of freedom

0.05 0.95 0

Region of non-rejection

Figure 15.6  Microsoft Excel worksheet for the online check-in patterns data of Table 15.6

A 1

χ22

5.991

B

C

Region of rejection

Critical value

D

E

F

G

H

I

Chi-square analysis

2 Observed frequencies

3

City

4 5

Check-in method

6

Calculations

Sydney

Singapore

Online

258

375

210

843

7

Airport

162

155

190

507

8

Total

420

530

400 1350

fo – fe

Jakarta Total –4.266666667

44.04444444 –39.77777778

4.266666667 –44.04444444

39.77777778

9 10

Expected frequencies

11

City

12

Check-in method

13

(fo – fe)^2/fe

Sydney

Singapore

Jakarta

Total

Online

262.2667

330.9556

249.7778

843

0.069411964

5.861551661

6.33471728

14

Airport

157.7333

199.0444

150.2222

507

0.115412792

9.746130277

10.53287311

15

Total

420

530

400 1350

16 Data

17 18

Level of significance

19

Number of rows

0.05 2

20

Number of columns

3

21

Degrees of freedom

2

22 Results

23 24

Critical value

25

Chi-square test statistic

26

p-value

27

5.991465 32.660097 8.0900E-08

Reject the null hypothesis

28 29

Expected frequency assumption is met

30

For the χ2 test to give accurate results when dealing with 2 × c contingency tables, all expected frequencies must be large. For such situations there is much debate among s­ tatisticians about the definition of large. Some statisticians (see reference 3) have found that the test gives accurate results as long as all expected frequencies equal or exceed 5.0. Other statisticians, slightly different in their approach, require that no more than 20% of the cells contain expected frequencies less than 5 and no cells have expected frequencies less than 1 (see reference 4). A reasonable compromise between these points of view is to make sure that each expected frequency is at least 1. To accomplish this you may need to collapse two or more low expected

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



15.2 Chi-Square Test for Differences Between More than Two Proportions 619

frequency categories into one category in the contingency table before performing the test. Such merging of categories usually results in expected frequencies sufficiently large to conduct the χ2 test accurately. If the combining of categories is undesirable, alternative procedures are available (see references 2 and 5).

2

The Marascuilo Procedure

LEARNING OBJECTIVE

Rejecting the null hypothesis in a χ2 test of equality of proportions in a 2 × c table only allows you to reach the conclusion that not all c population proportions are equal. But which of the proportions differ? Because the result of the χ2 test for equality of proportions does not specifically answer this question, a post-hoc multiple comparison procedure is needed. One such approach that follows rejection of the null hypothesis of equal proportions is the Marascuilo procedure. The Marascuilo procedure enables you to make comparisons between all pairs of groups. First, you need to calculate the observed differences pj − pj¿ (where j ∙ j¿) between all c(c − 1)/2 pairs. Then, use Equation 15.4 to calculate the corresponding critical ranges for the Marascuilo procedure ( j and j¿ refer to the two groups being compared).

Use the Marascuilo procedure for determining pairwise differences when evaluating more than two proportions Marascuilo procedure Enables comparisons between all pairs of groups within a contingency table.

CRITICAL RA N GE FOR T H E M A R A S C U I LO P R O C E D U R E

Critical range = χ 2U

pj (1 − pj) nj

+

pj ′ (1 − pj ′) (15.4) nj ′

You need to calculate a distinct critical range for each pairwise comparison of sample proportions. In the final step you compare each of the c(c − 1)/2 pairs of sample proportions against its corresponding critical range. You declare a specific pair significantly different if the absolute difference in the sample proportions |p j − p j¿| is greater than its critical range. To apply the Marascuilo procedure, return to the airport check-in study. Using the χ2 test, you concluded that there was evidence of a significant difference between the population proportions. Because there are three cities, there are (3)(3 − 1)/2 = 3 pairwise comparisons. From Table 15.6 (page 616) the three sample proportions are: p1 =

X1 258 = 0.614 = n1 420

p2 =

X 2 375 = 0.708 = n 2 530

p3 =

210 X3 = 0.525 = 400 n3

Using Table E.4 and an overall level of significance of 0.05, the upper-tail critical value of the χ2 test statistic for a chi-square distribution having (c − 1) = 2 degrees of freedom is 5.991. Thus: χU2 = 5.991 = 2.448 Next, calculate the three pairs of absolute differences in sample proportions and their corresponding critical ranges. If the absolute difference is greater than its critical range, the proportions are significantly different.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

620 CHAPTER 15 CHI-SQUARE TESTS

Absolute difference in proportions

Critical range

p j − p j′

2.448

p1 − p2 = 0.614 − 0.708 = 0.093

2.448

(0.614)(0.386) (0.708)(0.292) = 0.076 + 420 530

p1 − p3 = 0.614 − 0.525 = 0.089

2.448

(0.614)(0.386) (0.525)(0.475) + = 0.084 400 420

p2 − p3 = 0.708 − 0.525 = 0.183

2.448

(0.525)(0.475) (0.708)(0.292) + = 0.078 530 400

pj (1 − pj) nj

+

pj ′ (1 − pj ′) nj ′

Figure 15.7 shows a Microsoft Excel worksheet for the Marascuilo procedure. You can conclude, using a 0.05 overall level of significance, that a lower proportion of passengers check in online in Jakarta (p3 = 0.525) than either in Sydney (p1 = 0.614) or Singapore (p2 = 0.708). However, the online check-in proportion found in Sydney is also significantly lower than the proportion in Singapore. Figure 15.7  Microsoft Excel worksheet for the Marascuilo procedure

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

A Marascuilo procedure

B

Level of significance Square root of critical value

C

D

0.05 2.447746831

Sample proportions Group 1 Group 2 Group 3

0.614285714 0.70754717 0.525

Marascuilo table Proportions Absolute differences Critical range | Group 1 – Group 2 | 0.093261456 0.07562559 Significant | Group 1 – Group 3 | 0.089285714 0.084352404 Significant | Group 2 – Group 3 |

0.18254717

0.077939137 Significant

Problems for Section 15.2 LEARNING THE BASICS 15.10 Consider a contingency table with two rows and five columns. a. Find the degrees of freedom. b. Find the critical value for α = 0.05. c. Find the critical value for α = 0.01. 15.11 Using the following contingency table: 1 2 Total

A 12 23 35

B C Total 18 34  64 15 38  76 33 72 140

a. Calculate the expected frequencies for each cell. b. Calculate the χ2 statistic for this contingency table. Is it significant at α = 0.01? c. If appropriate, use the Marascuilo procedure and α = 0.01 to determine which groups are different. 15.12 Using the following contingency table: 1 2 Total

A 20 30 50

B C Total 30 25  75 20 25  75 50 50 150

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



15.2 Chi-Square Test for Differences Between More than Two Proportions 621

a. Calculate the expected frequencies for each cell. b. Calculate the χ2 statistic for this contingency table. Is it significant at α = 0.05?

APPLYING THE CONCEPTS You can solve problems 15.13 to 15.20 manually or by using Microsoft Excel.

15.13 A survey was conducted in five countries. The percentages of respondents whose household members own more than one laptop computer are as follows. Australia 53% New Zealand 48% China 38% Japan 54% South Korea 49%



Suppose that the survey was based on 500 respondents in each country. a. At the 0.05 level of significance, determine whether there is a significant difference in the proportion of households who own more than one laptop computer. b. Find the p-value in (a) and interpret its meaning. c. If appropriate, use the Marascuilo procedure and α = 0.05 to determine which countries are different. Discuss your results. 15.14 Although online grocery shopping is now possible, supermarket shopping is still the most common method that Australians use to buy their groceries. Assume that a market research company examining supermarket operations throughout the week interviewed shoppers who buy groceries in person in order to find out whether their major grocery shopping day is Saturday or another day. Shoppers were separated into three different age categories. Age Major shopping day Under 35 35–54 Over 54 Saturday 24% 28% 12% A day other than Saturday 76% 72% 88%



Assume that 200 shoppers for each age category were surveyed. a. Is there evidence of a significant difference between the age groups with respect to major grocery shopping day? (Use α = 0.05.) b. Determine the p-value in (a) and interpret its meaning. c. If appropriate, use the Marascuilo procedure and α = 0.05 to determine which age groups are different. Discuss your results. d. Discuss the managerial implications of (a) and (b). How can supermarkets use this information to improve marketing and sales? Be specific. 15.15 Repeat (a) and (b) of problem 15.14, assuming that only 50 shoppers for each age category were surveyed. Discuss the

implications of sample size on the χ2 test for differences in more than two populations. 15.16 Micro Brewery is interested in whether there is a difference in the proportion of people of different age groups who prefer low-carbohydrate beer. The table below contains the survey results. Age group 18–25 26–45 46–65 >65 Prefer low carb 45 45 23 12 Do not prefer low carb 32 43 31 18

a. Determine at a 5% level of statistical significance whether there is a difference in the proportion of people of different ages who prefer low-carbohydrate beer. b. If appropriate, use the Marascuilo procedure and α = 0.05 to determine which age groups are different. 15.17 The staff at Xavier University in Victoria believe that the academic capacity of students from high schools varies depending on the area of study they enrol in. Using the Australian Tertiary Admission Rank (ATAR) as the measure of academic capacity, the data below are tabulated for a sample of students. ATAR < 90 ATAR > 90

University course Arts Business Engineering Science 164 172 68 120 172 195 73 139

a. Determine whether there is a significant difference in the ATAR entry level for the different courses at the 0.05 level of statistical significance. b. Calculate the p-value and interpret its meaning. 15.18 The availability of free WiFi Internet connections in hotel rooms varies considerably with the luxury level of the hotel and the country. Despite changing attitudes to the necessity of providing free WiFi access, many upmarket hotels are still charging guests to use WiFi in their rooms. A survey was developed which kept the hotel type similar. From data collected by Syba Computing at 100 mid-range hotels in each of five countries, the following results were obtained. Free WiFi available? No Yes

Country UK Australia US China Malaysia 38 43 59 52 60 62 57 41 48 40

a. At the 0.025 level of statistical significance, determine whether there is a difference in the availability of free in-room WiFi between these countries. b. Determine the p-value in (a) and interpret its meaning. 15.19 A travel company believes there is a difference in the ages of people, divided simply between younger and older, based on the place they choose for a tropical holiday. A sample of travellers to three destinations on the basis of age is collected.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

622 CHAPTER 15 CHI-SQUARE TESTS

Under 50 years old Over 50 years old

Bali 68 23

118 registered voters was taken and classified by their country of origin and political voting intention.

Destination Fiji North Queensland 59 65 32 54

Political voting intention Liberal Labor National Other



Determine at 95% statistical significance whether travellers to North Queensland are different on the basis of age compared with travellers to Bali and Fiji, using the Marascuilo procedure. 15.20 An election polling company in Australia wishes to determine whether those born overseas have similar party voting intentions to those born in Australia. A random sample of LEARNING OBJECTIVE

3

Apply the chi-square test of independence



Country of origin Australia Overseas Total 12 23 35 23 28 51 8 12 20 4 8 12

Conduct an analysis to test whether there is a difference between political party voting intentions for those born overseas, at the α = 0.05 level of statistical significance.

15.3  CHI-SQUARE TEST OF INDEPENDENCE In Sections 15.1 and 15.2 you used the χ2 test to evaluate potential differences between population proportions. For a contingency table that has r rows and c columns, you can generalise the χ2 test as a test of independence for two categorical variables. As a test of independence, the null and alternative hypotheses are: H0: The two categorical variables are independent (i.e. there is no relationship between them). H1: The two categorical variables are dependent (i.e. there is a relationship between them). Once again, you use Equation 15.1 (page 609) to calculate the test statistic: χ2 =

( fo − fe)2 fe all cells



You reject the null hypothesis at the α level of significance if the calculated value of the χ2 test statistic is greater than χU2 , the upper-tail critical value from a chi-square distribution with (r − 1)(c − 1) degrees of freedom (see Figure 15.8). Thus, the decision rule is: Reject H0 if χ2 > χU2 ; otherwise, do not reject H0. chi-square test of independence A hypothesis test used to test whether there is a significant relationship between two categorical variables.

The chi-square test of independence is similar to the χ2 test for equality of proportions. The test statistic and the decision rule are the same, but the stated hypotheses and the conclusion to be drawn are different. For example, in the online check-in analysis of Section 15.2 there is evidence of a significant difference between the cities with respect to online check-in rates. From a different viewpoint, you could conclude that there is a significant relationship between the cities and the likelihood that a passenger would check in online. Nevertheless, there is a fundamental and conceptual distinction between the two types of tests. The major difference is in the sampling scheme used. In a test for equality of proportions, there is one factor of interest with two or more levels. These levels represent samples drawn from independent populations. The categorical responses

Figure 15.8  Regions of rejection and non-rejection when testing for independence in an r × c contingency table using the χ2 test

(1 – α)

α

0

Region of non-rejection

Critical value

χ(r2 –1)(c –1)

Region of rejection

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



15.3 Chi-Square Test of Independence 623

in each sample group or level are usually classified into two categories which might be ­considered – success and failure. The objective is to make comparisons and evaluate differences between the proportions of success at the various levels. However, in a test for independence, there are two factors of interest, each of which has two or more levels. You select one sample, and tally the joint responses to the two categorical variables into the cells of a contingency table. To illustrate the χ2 test for independence, suppose that, in the analysis of airport check-in methods, for those who checked in at the airport a second factor was examined to see if passenger departure experience contributed to intercity differences. The measure of experience used was the passengers’ frequent flyer status with Hikari Club, the airline’s loyalty program. The airline offers the basic Bronze level with Silver and Gold levels attainable through a points system. Those without Hikari Club frequent flyer membership are shown by None. Table 15.9 presents the resulting 4 × 3 contingency table. In Table 15.9 observe that for the 507 passengers who checked in at the airport, 168 hold Hikari Club Bronze status, 66 hold Silver, 22 hold Gold and the remaining 251 are not Hikari Club members. As in Table 15.6 there were 162 passengers in Sydney, 155 in Singapore and 190 in Jakarta who checked in for their flights at the airport. The observed frequencies in the cells of the 4 × 3 contingency table represent the joint tallies of the sampled passengers and their frequent flyer status. The null and alternative hypotheses are: H0: There is no relationship between the city and frequent flyer status of those who choose to check in at the airport. H1: There is a relationship between the city and the frequent flyer status of those who choose to check in at the airport. To test this null hypothesis of independence against the alternative, that there is a relationship between the two categorical variables, use Equation 15.1 to calculate the test statistic: ( fo − fe)2 fe all cells

χ2 =



where  fo = observed frequency in a particular cell of the r × c contingency table fe = expected frequency in a particular cell if the null hypothesis of independence is true To calculate the expected frequency, fe , in any cell, use the multiplication rule for independent events (see Equation 4.7). For example, under the null hypothesis of independence, the probability of responses expected in the upper-left-corner cell, representing a Bronze status for Sydney, is the product of the two separate probabilities: P(Bronze and Sydney) = P(Bronze) × P(Sydney) Here, the proportion of passengers that hold Bronze status, P(Bronze), is 168/507 = 0.3314 and the proportion of passengers departing from Sydney, P(Sydney) is 162/507= 0.3195. If the null hypothesis is true then the city and frequent flyer status are independent and P(Bronze and Sydney) is the product of the individual probabilities, 0.3314 × 0.3195 = 0.1059. The expected frequency is the product of the overall sample size n and this probability, 507 × 0.1059 = 53.68. The fe values for the remainder of a 4 × 3 contingency table are calculated in a similar manner (see Table 15.10). City Frequent flyer status Sydney Singapore Jakarta Total Bronze  62  58  48 168 Silver  22  15  29  66 Gold   5   7  10  22 None  73  75 103 251 ___ ___ ___ ___ Total 162 155 190 507

Table 15.9  Observed frequency of responses, crossclassifying frequent flyer status with city of departure

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

624 CHAPTER 15 CHI-SQUARE TESTS

Table 15.10  Expected frequencies for departure city crossclassified with frequent flyer status

City Frequent flyer status Sydney Singapore Jakarta Total Bronze  53.68  51.36  62.96 168 Silver  21.09  20.18  24.73  66 Gold   7.03   6.73   8.24  22 None  80.20  76.74  94.06 251 ______ _____ ______ ___ Total 162.00 155.00 190.00 507

Equation 15.5 presents a simpler way to calculate expected frequencies. CA LCUL AT IN G T HE E X P E C T E D F R E Q U E NC I E S The expected frequency in a cell is the product of its row total and column total divided by the overall sample size. row total × column total (15.5) n where row total = sum of all the frequencies in the row column total = sum of all the frequencies in the column n = overall sample size fe =



For example, using Equation 15.5 for the upper-left-corner cell (Bronze and Sydney): fe =

row total × column total (168)(162) = = 53.68 n 507

and for the lower-right-corner cell (None and Jakarta): fe =

row total × column total (251)(190) = = 94.06 507 n

Table 15.10 lists the entire set of fe values. To perform the test of independence, use the χ2 test statistic shown in Equation 15.1. Here the test statistic approximately follows a chi-square distribution with degrees of freedom equal to the number of rows in the contingency table minus 1 multiplied by the number of columns in the table minus 1. Thus, for an r × c contingency table: Degrees of freedom = (r − 1)(c − 1) Table 15.11 illustrates the calculations for the χ2 test statistic. Table 15.11  Calculation of χ2 test statistic for the test of independence

Cell

fo

fe (fo – fe )

(fo – fe )2 (fo – fe )2/fe

Bronze/Sydney 62 53.68 8.32 69.215 1.289 Bronze/Singapore 58 51.36 6.64 44.077 0.858 Bronze/Jakarta 48 62.96 –14.96 223.759 3.554 Silver/Sydney 22 21.09 0.91 0.830 0.039 Silver/Singapore 15 20.18 –5.18 26.807 1.329 Silver/Jakarta 29 24.73 4.27 18.201 0.736 Gold/Sydney 5 7.03 –2.03 4.119 0.586 Gold/Singapore 7 6.73 0.27 0.075 0.011 Gold/Jakarta 10 8.24 1.76 3.082 0.374 None/Sydney 73 80.20 –7.20 51.857 0.647 None/Singapore 75 76.74 –1.74 3.013 0.039 None/Jakarta 103 94.06 8.94 79.868 0.849 10.311

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



15.3 Chi-Square Test of Independence 625

Using a level of significance of α = 0.05, the upper-tail critical value from the chi-square distribution with (4 − 1)(3 − 1) = 6 degrees of freedom is 12.592 (see Table E.4). Since the calculated test statistic χ2 = 10.311 < 12.592, you do not reject the null hypothesis of independence (see Figure 15.9). Similarly, you can use the p-value approach shown in the Microsoft Excel spreadsheet in Figure 15.10. Since the p-value = 0.112 > 0.05, you do not reject the null hypothesis of independence.

0.95

0.05 12.592

0

Region of non-rejection

A

B

C

χ 26

Region of rejection

Critical value

D

E

Jakarta

Total

Figure 15.9  Regions of rejection and non-rejection when testing for independence in the airport check-in study at the 0.05 level of significance with 6 degrees of freedom

F

G

H

I

Chi-square analysis

1 2

Observed frequencies

3

City

4

Calculations

5

Frequent flyer status

Sydney Singapore

6

Bronze

62

58

48

168

7

Silver

22

15

29

66

8

Gold

5

7

10

22

9

None

73

75

103

251

10

Total

162

155

190

507

fo – fe 6.639053254

–14.95857988

0.911242604 –5.177514793

4.266272189

8.319526627 –2.029585799

0.274161736

1.755424063

–7.2011834327 –1.735700197

8.936883629

Figure 15.10  Microsoft Excel worksheet for the 4 × 3 contingency table for frequent flyer status by departure city

11 Expected frequencies

12

City

13

(fo – fe)^2/fe

14

Frequent flyer status

Sydney

Singapore

Jakarta

Total

15

Bronze

53.6805

51.3609

62.9586

168

1.289379898

0.858181769

3.554068603

16

Silver

21.0888

20.1775

24.7337

66

0.03937468

1.328541186

0.735880921

17

Gold

7.0296

6.7258

8.2446

22

0.585983105

0.011175508

0.37376254

18

None

80.2012

76.7357

94.0631

251

0.646587003

0.039260151

0.849088273

19

Total

162

155

190

507

20 Data

21 22

Level of significance

23

Number of rows

0.05 4

24

Number of columns

3

25

Degrees of freedom

6

26 Results

27 28

Critical value

29

Chi-square test statistic

30

p-value

31

12.591587 10.311284 1.1214E-01

Do not reject the null hypothesis

32 33 Expected frequency assumption is met. 34

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

626 CHAPTER 15 CHI-SQUARE TESTS

To ensure accurate results, all expected frequencies need to be large in order to use the χ2 test when dealing with r × c contingency tables. As in the case of the 2 × c contingency table on page 616, all expected frequencies should be at least 5. For cases in which one or more expected frequencies are less than 1, you can use the test after collapsing two or more low-­ frequency rows into one row (or collapsing two or more low-frequency columns into one column). Merging of rows or columns usually results in expected frequencies sufficiently large to conduct the χ2 test accurately.

Problems for Section 15.3 LEARNING THE BASICS 15.21 If a contingency table has five rows and six columns, how many degrees of freedom are there for the χ2 test for independence? 15.22 When performing a χ2 test for independence in a contingency table with r rows and c columns, determine the upper-tail critical value of the χ2 test statistic in each of the following circumstances. a. α = 0.05, r = 4 rows, c = 5 columns b. α = 0.10, r = 4 rows, c = 5 columns c. α = 0.01, r = 5 rows, c = 4 columns d. α = 0.05, r = 3 rows, c = 6 columns e. α = 0.01, r = 2 rows, c = 5 columns

APPLYING THE CONCEPTS You can solve problems 15.23 to 15.27 manually or by using Microsoft Excel.

15.23 In problem 15.16 on page 621, information on age group and preference for low-carbohydrate beer is presented. Test whether age group and low-carbohydrate beer are independent. (Use α = 0.05.) 15.24 A large corporation is interested in determining whether an association exists between the commuting time of its employees and the level of stress-related problems observed on the job. A study of 116 assembly-line workers reveals the following: Stress Commuting time (minutes) High Moderate Low Total Under 15  9  5 18  32 15–45 17  8 28  53 Over 45 18  6  7  31 Total 44 19 53 116

a. At the 0.01 level of significance, is there evidence of a significant relationship between commuting time and stress? b. What is your answer to (a) if you use the 0.05 level of significance? 15.25 In problem 15.20 on page 622, information on political party voting intention and country of origin is presented. Determine at the 0.01 level of significance whether there is a relationship between country of origin and voting intention. 15.26 A 2011 study by Anna Mieczakowski, Tanya Goldhaber and John Clarkson investigated communication preferences in four

countries. The responses for four communication methods have been cross-tabulated in the following table. Preferred method UK Face to face 824 Phone, talk   70 Phone, text  94 Email 118

Country US Australia China 509 667 545 111 127 191 154  75  91 120 142  20

Data obtained from A. Mieczakowski, T. Goldhaber and J. Clarkson, Culture, Communication and Change: Summary of an Investigation of the Use and Impact of Modern Media and Technology in our Lives, Engineering Design Centre, University of Cambridge, 2011



At the 0.01 level of significance, is there evidence of a significant relationship between country and type of preferred communication as shown in the study? 15.27 Managed funds provide the individual investor with a convenient medium for diversification. A managed fund spreads its investment over numerous securities, so that individuals who purchase a share in a managed fund are investing in a highly diversified portfolio, something they could not achieve on their own with limited funds. Managed funds tailor their diversification strategies to different investor groups. There are capital growth funds, current income funds, equity funds, bond funds and so on. The table below lists the level of return over five years for a sample of investors for various categories of managed funds. Managed fund type Maximum capital gain Long-term growth Growth and current income Balanced income Common stock

Level of return High Medium Low return return return 25 41 52 22 31 42 33 41 53 35 39 42 28 15 10

a. Determine at the 0.05 level of significance if there is evidence of a significant relationship between the level of return and the type of managed fund. b. What does this finding say about managed fund investment?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



15.4 Chi-Square Goodness-of-Fit Tests 627

4

15.4  CHI-SQUARE GOODNESS-OF-FIT TESTS

LEARNING OBJECTIVE

In this section, χ2 goodness-of-fit tests are used to determine how well a set of data matches a specific probability distribution. Goodness-of-fit tests compare the observed frequencies in a category with the frequencies that are theoretically expected if the data follow a specific probability distribution. In doing a χ2 goodness-of-fit test, you follow several steps. First, you determine the specific probability distribution to compare with the data. Second, you estimate (from a sample) or hypothesise the value of each parameter (such as the mean) of the selected probability distribution. Next, you determine the theoretical probability in each category using the selected probability distribution. Finally, you use a χ2 test statistic to test whether the selected distribution is a good fit to the data.

Use the chi-square test to evaluate the goodness of fit of a set of data to a specific probability distribution

Chi-Square Goodness-of-Fit Test for a Poisson Distribution Recall from Chapter 5 that we used the Poisson distribution to find the probability of a specific number of occurrences of an event per period of time. Let’s now consider the number of arrivals per minute at a bank located in the central business district of a city. Suppose that you recorded the actual arrivals per minute in 200 one-minute periods over the course of a week. Table 15.12 summarises the results. Arrivals 0 1 2 3 4 5 6 7 8

Frequency 14 31 47 41 29 21 10 5 2 200

goodness-of-fit tests Any test to determine how well a set of sample data matches a specific probability distribution.

Table 15.12  Frequency distribution of arrivals per minute

To determine whether the number of arrivals per minute follows a Poisson distribution, the null and alternative hypotheses are: H0: The number of arrivals per minute follows a Poisson distribution. H1: The number of arrivals per minute does not follow a Poisson distribution. The Poisson distribution has one parameter, its mean λ, and you need to specify its value in the null and alternative hypotheses. You can use either a value based on past knowledge or a value estimated from sample data. In this example we estimate the mean, using the data in Table 15.12. Using Equation 3.16 on page 118 and the calculations in Table 15.13: c

X = =

∑ mj f j j=1

n 580 = 2.90 200

To find the probabilities from the tables of the Poisson distribution (Table E.7), you use 2.90 as an estimate of λ. You then calculate the expected frequency for each number of arrivals by multiplying the appropriate Poisson probability by the sample size n = 200. Table 15.14 summarises these results. Observe in Table 15.14 that the expected frequency of 9 or more arrivals is less than 1.0. In order to have all categories contain a frequency of 1.0 or greater, we need to combine the category 9 or more with the category of 8 arrivals.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

628 CHAPTER 15 CHI-SQUARE TESTS

Table 15.13  Calculation of the sample mean number of arrivals from the frequency distribution of arrivals per minute

Table 15.14  Observed and expected frequencies of the arrivals per minute



Arrivals

Frequency fj mj fj

0 1 2 3 4 5 6 7 8

14 0 31 31 47 94 41 123 29 116 21 105 10 60 5 35 2  16 200 580

Probability, P(X ), for Observed Poisson distribution Arrivals frequency fo with 𝛌 = 2.9

Expected frequency fe = n . P(X )

0 14 0.0550 11.00 1 31 0.1596 31.92 2 47 0.2314 46.28 3 41 0.2237 44.74 4 29 0.1622 32.44 5 21 0.0940 18.80 6 10 0.0455  9.10 7  5 0.0188  3.76 8  2 0.0068  1.36  0.60 9 or more   0 0.0030 1.0000 200.00

Equation 15.6 defines the chi-square test for determining whether the data follow a specific probability distribution. CH I-S QUA R E GOO D NE SS- O F - F I T T E ST 2 = χk−p−1



( fo − fe)2 (15.6) fe

k where  fo = observed frequency fe = expected frequency k = number of categories or classes remaining after combining classes p = number of parameters estimated from the data

The test statistic χ2 follows a chi-square distribution with k − p − 1 degrees of freedom. Returning to the example concerning arrivals at the bank, nine categories remain (0, 1, 2, 3, 4, 5, 6, 7, 8 or more). As we estimated the mean of the Poisson distribution from the data, the number of degrees of freedom is k − p − 1 = 9 − 1 − 1 = 7. Using the 0.05 level of significance, from Table E.4 the critical value of χ2 with 7 degrees of freedom is 14.067. The decision rule is: Reject H0 if χ2 > 14.067; otherwise, do not reject H0. Table 15.15 shows the calculation of the χ2 statistic. Since χ2 = 2.28954 < 14.067, we do not reject H0. There is insufficient evidence to conclude that the arrivals per minute do not fit a Poisson distribution. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



15.4 Chi-Square Goodness-of-Fit Tests 629

Arrivals

fo

fe (fo – fe ) (fo – fe )2 (fo – fe )2/fe

0 14 11.00 3.00 9.0000 0.81818 1 31 31.92 –0.92 0.8464 0.02652 2 47 46.28 0.72 0.5184 0.01120 3 41 44.74 –3.74 13.9876 0.31264 4 29 32.44 –3.44 11.8336 0.36478 5 21 18.80 2.20 4.8400 0.25745 6 10 9.10 0.90 0.8100 0.08901 7 5 3.76 1.24 1.5376 0.40894 8 or more 2 1.96 0.04 0.0016 0.00082 2.28954

Table 15.15  Calculation of the χ2 test statistic for arrivals per minute

Chi-Square Goodness-of-Fit Test for a Normal Distribution In Chapters 9 and 10, when testing hypotheses about numerical variables, we assumed that the underlying population was normally distributed. While the boxplot and the normal probability plot are useful for evaluating the validity of this assumption, when you have a large sample size you can also use the χ2 goodness-of-fit test. As an example of how you can use the χ2 goodness-of-fit test for a normal distribution, refer to Table 15.16 for the age of car purchasers in a district over the past two months.

Age Frequency 10 but less than 20 2 20 but less than 30 9 30 but less than 40 13 40 but less than 50 15 50 but less than 60 5 60 but less than 70 5 70 but less than 80   0 Total 49

Table 15.16  Frequency distribution of the age of car purchasers

To test whether these ages follow a normal distribution, the null and alternative hypotheses are: H0: The ages follow a normal distribution. H1: The ages do not follow a normal distribution. The normal distribution has two parameters, the mean μ and the standard deviation σ. You must specify the values of these parameters in the null and alternative hypotheses. You can use values based on past knowledge or values estimated from sample data. In this example, approximating – the mean and standard deviation from a frequency distribution, X = 40.510 and S = 13.080. Table 15.16 uses class interval widths of 10 with class boundaries beginning at 10.0. Since the normal distribution is continuous, you need to determine the area in each class interval. In addition, since a normally distributed variable theoretically ranges from −∞ to +∞, you must account for the area beyond the class interval. Thus, the area below 10 is the area below the Z value: Z= =

X−X S 10.0 − 40.510 = −2.33 13.080

From Table E.2, the area below Z = −2.33 is approximately 0.0099. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

630 CHAPTER 15 CHI-SQUARE TESTS

To find the area between 10.0 and 20.0, calculate the area below 20.0 as follows: Z= =

X−X S 20.0 − 40.510 = −1.57 13.080

From Table E.2, the area below Z = −1.57 is approximately 0.0582. Thus, the area between 10.0 and 20.0 is the difference in the area below 20.0 and the area below 10.0, which is 0.0582 − 0.0099 = 0.0483. Continuing, to find the area between 20.0 and 30.0, calculate the area below 30.0 as follows: Z= =

X−X S 30.0 − 40.510 = −0.80 13.080

From Table E.2, the area below Z = −0.80 is approximately 0.2119. Thus, the area between 20.0 and 30.0 is the difference in the area below 30.0 and the area below 20.0, which is 0.2119 − 0.0582 = 0.1537. In a similar manner, you can calculate the area in each class interval. Table 15.17 summarises the complete set of calculations performed on Excel that are needed to find the area and expected frequency in each class.

Table 15.17  Calculation of the area and expected frequencies in each class interval for the ages of car purchasers

P(X ) – Classes X X – X Z Area below Area in class Below 10 10 but < 20 20 but < 30 30 but < 40 40 but < 50 50 but < 60 60 but < 70 70 but < 80 80 or more

fe = n ∙ P(X )

10 –30.51 –2.33 0.0098 0.0098   0.4819 20 –20.51 –1.57 0.0584 0.0486   2.3814 30 –10.51 –0.80 0.2108 0.1524 7.4677 40 –0.51 –0.40 0.4844 0.2736 13.4069 50 9.49 0.73 0.7659 0.2815 13.7930 60 19.49 1.49 0.9319 0.1660 8.1319 70 29.49 2.25 0.9879 0.0560   2.7452 80 39.49 3.02 0.9987 0.0108   0.5298 – ∞ 1.0000 0.0013  0.0621 49.0000 – 1.0000

Observe in Table 15.17 that the expected frequencies in the categories below 10.0, 70.0 to 80.0 and 80.0 or more are each less than 1.0. Since all categories need to have a frequency of 1.0 or greater, you combine the category below 10.0 with the category 10.0 to 20.0 and the categories 70.0 to 80.0 and 80 or more with the category 60.0 to 70.0. Using Equation 15.6, we now calculate the chi-square test statistic. In this example, after combining classes, 6 classes remain. As we used the sample mean and sample standard deviation to estimate the population mean and population standard deviation, the number of degrees of freedom is equal to k − p − 1 = 6 − 2 − 1 = 3. Using a level of significance of 0.05, the critical value of chi-square with 3 degrees of freedom is 7.815. Table 15.18 summarises the calculations for the chi-square test. Since χ2 = 2.7276 < 7.815, do not reject H0. There is insufficient evidence to conclude that the ages of car purchasers do not fit a normal distribution. Therefore, it is reasonable to assume that this set of data is normally distributed for the purposes of hypothesis testing.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



15.4 Chi-Square Goodness-of-Fit Tests 631

Classes

fo

(fo – fe )2 (fo – fe )2/fe

fe (fo – fe )

Below 20 2 2.8633 –0.8633 0.7424 0.2603 20 but < 30 9 7.4677 1.5323 2.3479 0.3144 30 but < 40 13 13.4069 –0.4069 0.1656 0.0124 40 but < 50 15 13.7930 1.2070 1.4569 0.1056 50 but < 60 5 8.1319 –3.1319 9.8090 1.2062 60 or more 5 3.3371 1.6629 2.7652 0.8286 2.7276

Table 15.18  Calculating the χ2 test statistic for the ages of car purchasers

Problems for Section 15.4 APPLYING THE CONCEPTS



15.28 The manager of a human resources department has collected the number of people absent through illness in the previous 12 months. The results are shown below. Number of sick days 0 1 2 3 4 5 6

Number of employees 123 137 95 67 32 26   4 484



Does the distribution of sick days follow a Poisson distribution? (Use the 0.01 level of significance.) 15.29 Referring to the data in problem 15.28, at the 0.01 level of significance does the distribution of sick days follow a Poisson distribution with a population mean of 1.5 sick days per year? 15.30 The manager of a commercial mortgage department of a large bank has collected data during the past two years concerning the number of commercial mortgages approved per week. The results from these two years (104 weeks) indicate the following: Number of commercial mortgages approved 0 1 2 3 4 5 6 7

Does the distribution of commercial mortgages approved per week follow a Poisson distribution? (Use the 0.01 level of significance.) 15.31 A random sample of 500 car batteries revealed the following distribution of battery life (in years).

Frequency 13 25 32 17 9 6 1   1 104

Life (in years) Frequency 0–under 1 12 1–under 2 94 2–under 3 170 3–under 4 188 4–under 5 28 5–under 6   8 500

– For these data, X = 2.80 and S = 0.97. At the 0.05 level of significance, does battery life follow a normal distribution? 15.32 A random sample of 500 long-distance telephone calls revealed the following distribution of call length (in minutes).

Length (in minutes) Frequency 0–under 5 48 5–under 10 84 10–under 15 164 15–under 20 126 20–under 25 50 25–under 30  28 500

a. Calculate the mean and standard deviation of this frequency distribution. b. At the 0.05 level of significance, does call length follow a normal distribution?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

632 CHAPTER 15 CHI-SQUARE TESTS

LEARNING OBJECTIVE

5

Use the chi-square distribution to test for the population variance or standard deviation

15.5  CHI-SQUARE TEST FOR A VARIANCE OR STANDARD DEVIATION When analysing numerical data, sometimes you need to draw conclusions about the population variance or standard deviation. For example, recall that in the pasta-packaging process described in Section 9.2, you assumed the population standard deviation σ was equal to 15 grams. To see if the variability of the process has changed, you need to test whether the standard deviation has changed from the previously specified level of 15 grams. Assuming that the data are normally distributed, Equation 15.7 gives the χ 2 test statistic used to test whether or not the population variance or standard deviation is equal to a specified value. 𝛘 2 T E ST FOR T H E VA R I A NC E O R STA NDA R D D E V I AT I O N χ2 =



(n − 1)S 2 (15.7) σ2

where  n = sample size S2 = sample variance σ2 = hypothesised population variance The test statistic χ2 follows a chi-square distribution with n − 1 degrees of freedom. To apply the test of hypothesis, return to the pasta-packaging example. You are interested in determining whether the standard deviation has changed from the previously specified level of 15 grams. Thus, you use a two-tail test with the following null and alternative hypotheses: H0: σ = 15 grams (or σ2 = 225 grams squared) H1: σ ∙ 15 grams (or σ2 ∙ 225 grams squared) If you select a sample of 25 pasta packets, you reject the null hypothesis if the χ2 test statistic falls into either the lower or upper tail of a chi-square distribution with 25 − 1 = 24 degrees of freedom as shown in Figure 15.11. From Equation 15.7, observe that the χ2 test statistic falls into the lower tail of the chi-square distribution if the sample standard deviation (S) is sufficiently smaller than the hypothesised σ of 15 grams, and it falls into the upper tail if S is sufficiently larger than 15 grams. From Table 15.19 (extracted from Table E.4), if you select a level of significance of 0.05, the lower χ2L and upper χU2 critical values are 12.401 and 39.364 respectively. Therefore, the decision rule is: Reject H0 if χ2 < χ2L = 12.401 or if χ2 > χU2 = 39.364; otherwise, do not reject H0. Figure 15.11  Determining the lower and upper critical values of a chi-square distribution with 24 degrees of freedom corresponding to a 0.05 level of significance for a two-tail test of hypothesis about a population variance or standard deviation

0

0.025 12.401

Region of rejection

2 L

0.95

Region of non-rejection

0.025 39.364 2 U

2 24

Region of rejection

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



15.5 Chi-Square Test for a Variance or Standard Deviation 633

Cumulative area 0.005 0.01 0.025 0.05 0.10 0.90 0.95 0.975 Upper-tail areas Degrees of freedom 0.995 0.99 0.975 0.95 0.90 0.10 0.05 0.025  1 . . .  . . .   0.001  0.004  0.016  2.706  3.841  5.024  2  0.010  0.020  0.051  0.103  0.211  4.605  5.991  7.378  3  0.072  0.115  0.216  0.352  0.584  6.251  7.815  9.348 · · · · · · · · · · · · · · · · · · · · · · · · · · · 23   9.260 10.196 11.689 13.091 14.848 32.007 35.172 38.076 24   9.886 10.856 12.401 13.848 15.659 33.196 36.415 39.364 25 10.520 11.524 13.120 14.611 16.473 34.382 37.652 40.646

Table 15.19  Finding the critical values corresponding to a 0.05 level of significance for a two-tail test from the chisquare distribution with 24 degrees of freedom (extracted from Table E.4 in Appendix E of this book)

Suppose that in the sample of 25 pasta packets, the standard deviation, S, is 17.7 grams. Using Equation 15.7, the test statistic is: χ2 =

(n −1) S 2 σ2

=

( 25 −1)(17.7 ) 2 (15) 2

= 33.42

Since χ2L = 12.401 < χ2 = 33.42 < χU2 = 39.364, or since the p-value = 0.0956 > 0.05 (see Figure 15.12), you do not reject H0. You conclude that there is insufficient evidence that the population standard deviation is different from 15 grams.

A B 1 Pasta-packaging analysis 2 3 Data 4 Null hypothesis σ^2= 225 5 Level of significance 0.05 6 Sample size 25 7 Sample standard deviation 17.7 8 9 Intermediate calculations 10 Degrees of freedom 24 11 Half area 0.025 12 Chi-square statistic 33.4176 13 14 Two-tail test 15 Lower critical value 12.4012 16 Upper critical value 39.3641 17 p-value 0.0956 18 Do not reject the null hypothesis

Figure 15.12  Microsoft Excel 2016 worksheet for testing the variance in the mueslipackaging process

=B6 – 1 =B5/2 =B10 * B7^2/B4

=CHISQ.INV(B11, B10) =CHISQ.INV.RT(B11, B10) =IF(B12–B15 < 0, 1-CHISQ.DIST.RT(B12, B10), CHISQ.DIST.RT (B12, B10)) =IF(B17. a. At the 0.05 level of significance, is there evidence that the population standard deviation differs from 5 g? b. What assumption do you need to make in order to perform this test? c. Calculate the p-value in part (a) and interpret its meaning. A manufacturer claims that the standard deviation in capacity of a certain type of battery the company produces is 2.5 ampere-hours. An independent consumer protection agency wishes to test the credibility of the manufacturer’s

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Key formulas 635

b. What assumption do you need to make in order to perform this test? c. Calculate the p-value in part (a) and interpret its meaning.

claim and measures the capacity of a random sample of 20 batteries from a recently produced batch. The results are in the file . a. At the 0.05 level of significance is there evidence that the population standard deviation in battery capacity is greater than 2.5 ampere-hours?

15

Assess your progress Summary This chapter examined a number of non-parametric chi-square tests. In the first chi-square section we looked at the case of two samples to see whether they were different, and then considered the case of more than two samples. Both tests looked at differences in proportions. For the multiple-sample chi-square test, we used the Marascuilo procedure to determine which of the multiple samples differed. In examining these differences, we used the example of airline check-in analysis. This example was continued to use your

knowledge of probability distributions to evaluate the differences between population proportions using the chi-square test of independence. We looked further at probability distributions by using chi-square goodness-of-fit tests to determine whether a set of data matched a specific probability distribution, including the Poisson and normal distributions. Finally, we used the chi-square distribution to perform tests about the population variance or standard deviation.

Key formulas 𝛘2 test for the difference between two proportions

χ2 =

Calculating the expected frequencies

( fo − fe)2   (15.1) fe all cells



fe =

Chi-square goodness-of-fit test

Calculating the estimated overall proportion

2 = χk−p−1

X + X2 X p= 1 =   (15.2) n1 + n 2 n

∑ k

Calculating the estimated overall proportion for c groups

p=

row total × column total   (15.5) n

X1 + X2 + … + Xc X =   (15.3) n1 + n2 + … + nc n

( fo − fe)2   (15.6) fe

2

𝛘 test for the variance or standard deviation

χ2 =

(n − 1)S 2 σ2

  (15.7)

Critical range for the Marascuilo procedure

Critical range = χ 2U

pj (1 − pj) nj

+

pj ′ (1 − pj ′) nj ′

  (15.4)

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

636 CHAPTER 15 CHI-SQUARE TESTS

Key terms chi-square (χ2) distribution chi-square test of independence contingency (cross-classification) table – probability

610 622 608

cross-classification table – chi square expected frequency goodness-of-fit tests

608 609 627

Marascuilo procedure observed frequency

619 609

References 1. Conover, W. J., Practical Nonparametric Statistics, 3rd edn (New York: John Wiley, 2000). 2. Daniel, W. W., Applied Nonparametric Statistics, 2nd edn (Boston: PWS Kent, 1990). 3. Hollander, M. & D. A. Wolfe, Nonparametric Statistical Methods (New York: John Wiley & Sons, 1973).

4. Dixon, W. J. & F. J. Massey Jr, Introduction to Statistical Analysis, 4th edn (New York: McGraw-Hill, 1983).

5. Marascuilo, L. A. & M. McSweeney, Nonparametric and Distribution-Free Methods for the Social Sciences (Monterey, CA: Brooks/Cole, 1977).

Chapter review problems CHECKING YOUR UNDERSTANDING 2

15.47 Under what conditions should you use the χ test rather than the Z test to examine possible differences in the proportions of two independent populations? 15.48 Provide examples of scenarios where you could utilise the χ2 test to examine possible differences in the proportions of more than two independent populations. 15.49 Under what conditions should you use the χ2 test of independence?

APPLYING THE CONCEPTS 15.50 Service suppliers tend to cluster close together to minimise competition; so, if you are in an Australian city looking for a Westpac bank and you find an ANZ bank, the one you are actually looking for is likely to be nearby. Major supermarket chains are often near each other, specialist doctors locate in the same street and takeaway brand shops locate near each other. Thus, there should be no difference in the perception of people about the ease of accessing different main brand service suppliers. Assume that a sample of 318 people has been surveyed on how easy it is to access three banks.

and females. The results are contained in the contingency table below. Gender Intend to buy Male Female Total Yes 122 91 213 No  79 49 128

a. Is there a significant difference between males and females regarding their intent to purchase the product? Use the 0.05 level of significance. b. Determine the p-value in (a) and interpret its meaning. 15.52 A company that rents campervans in New Zealand is checking whether the number of days campervans are rented is lower in the South Island than in the North Island. Campervans from this company are not permitted to travel between islands by ferry so no hires are for both islands. The company embarks on a study to determine the reasons for variations in hire length. First, it is concerned that in the South Island the weather is colder than in the North Island and, thus, the number of days of campervan use may be affected. The figures for number of days of use from a survey of 250 hire contracts are shown in the contingency table below.

Bank Ease of access Commonwealth ANZ Westpac Total Easy to find 77 70 65 212 Difficult to find 22 30 25   77 Very difficult to find 11 10   8   29

Island Number of days of use North South Total Less than 10 75 86 161 10 or more 44 45   89

a. Using a level of significance of 0.05, is there evidence of a significant difference between the public perception of the ease of access to these three banks? b. Interpret the result in the light of the comments made above about service providers locating in clusters. 15.51 A female telemarketer is promoting a new espresso coffee machine. She wishes to analyse the buying intentions of males

a. Use a technique from this chapter with a 5% level of significance to determine whether there is a difference in the number of days of use between the North and South islands. b. Determine the p-value in (a) and interpret its meaning. 15.53 A car hire company that operates in Australia and New Zealand is conducting a survey of the prevalence of damage to the

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter review problems 637

windscreens of cars returned from hire in both countries. The results are contained in the following table.

Attitude towards time off without pay Type of job Favour Neutral Oppose Total Hourly worker 135 23   67 225 Supervisor  39  7  14  60 Middle management  47  6  22  75 Upper management  26  6   8  40 Total 247 42 111 400

Country Windscreen damage Australia New Zealand Total Yes  23  32  55 No 387 448 835 Total 410 480 890



At the 0.05 level of significance, is there evidence of a difference in the proportion of cars returned with windscreen damage between Australia and New Zealand? 15.54 Government researchers are interested in the public’s attitude to supporting an increase in the age pension. The researchers gather information from three age groups, the results of which are shown in the following table. Support increase in age pension Yes No

Age group 18–34 35–54 >54 Total 34 45 55 134 78 51 24 153

a. At the 0.05 level of significance, is there a significant difference in the proportion of support between age groups? b. Find the p-value in (a) and interpret its meaning. c. If appropriate, use the Marascuilo procedure and a 0.05 level of significance to determine which age groups are different. 15.55 A company is considering an organisational change by adopting the use of self-managed work teams. To assess the attitudes of company employees towards this change, a sample of 400 employees is selected and asked whether they favour the institution of self-managed work teams in the organisation. Three responses were permitted: favour, neutral or oppose. The results of the survey, cross-classified by type of job and attitude towards self-managed work teams, are summarised as follows: Attitude towards self-managed work teams Type of job Favour Neutral Oppose Total 108 46   71 225 Hourly worker Supervisor  18 12  30  60 Middle management  35 14  26  75 Upper management  24  7   9  40 185 79 136 400 Total



b. At the 0.05 level of significance, is there evidence of a relationship between attitude towards time off without pay and type of job? 15.56 A survey conducted in China, India and Australia of 283 school teachers in the secondary-level teaching system asked several questions about attitudes towards teaching a business curriculum. The first question asked, ‘Should economics courses place an emphasis on understanding the banking system?’ The results are tabulated below.

a. At the 0.05 level of significance, is there evidence of a relationship between attitude towards self-managed work teams and type of job? The survey also asked respondents about their attitudes towards instituting a policy whereby an employee could take one additional day off per month without pay. The results, cross-classified by type of job, are as follows:

Country Emphasis on banking? China India Australia Total Yes 34 45 48 127 No 55 50 51 156



a. Is there evidence at the 95% significance level that there is a difference in the emphasis on an understanding of the banking system between these three countries? b. If appropriate, use the Marascuilo procedure and α = 0.05 to determine which countries differ. c. Calculate the p-value and interpret its meaning. The second survey question was, ‘What percentage emphasis should be placed on corporate accounting?’ The results are shown below. Country Corporate accounting? China India Australia Total 29 35 45 109 Less than 20% 21–40% 50 47 41 138 Greater than 40% 10 13 13   36



d. At the 0.05 level of significance, is there evidence of a relationship between the level of corporate accounting desired in each country? e. Calculate the p-value and interpret its meaning. The third question was, ‘Should comparative market systems be a significant component in an economics senior subject?’ The results are shown below. Country Market systems? China India Australia Total Yes 59 50 45 154 No 30 45 54 129

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

638 CHAPTER 15 CHI-SQUARE TESTS

f. Is there evidence of a difference between the countries on whether comparative market systems should be a significant part of an economics senior subject? Use the 0.025 level of significance. g. If appropriate, use the Marascuilo procedure and α = 0.05 to determine which countries have different approaches. 15.57 The provision of a swimming pool in a hotel could be expected to depend on factors such as the age and location of the building, level of luxury and the climate. The Trivago.com.au hotel comparison website allows filtering which enables a quick count of pools in the hotels in different cities. The data for four cities in different countries is shown below. Accommodation City Pool No pool Total Barcelona  130  5,887 6,017 Lyon   10    912  922 Melbourne  143  1,685  1,828 London     79 7,644 7,723 Total  362 16,128 16,490 Data obtained from Trivago accessed 21 August 2017

a. Using a 0.01 level of significance, is there a significant difference between these cities in the availability of a swimming pool in hotels? b. Is it possible to tell from part (a) whether the availability of hotel swimming pools in particular cities differs? 15.58 A market researcher was interested in studying the effect of advertisements on the brand preference of new car buyers. Assume that prospective purchasers of new cars were first asked whether they preferred Toyota or Mazda and then shown video advertisements of comparable models from the two manufacturers. After viewing the ads, the prospective customers again indicated their preference. The results are summarised in the following tables. Preference before ads Toyota Mazda Total



Preference after ads Toyota Mazda Total 79  13  92 11  97 108 90 110 200

a. Is there evidence of a significant difference in the proportion of respondents who prefer Toyota before and after viewing the ads? (Use α = 0.05.) b. Calculate the p-value and interpret its meaning. The following table was derived from the table above. Preference before ads Toyota Mazda Total Before ad   92 108 200 After ad  90 110 200 Total 182 218 400

c. Show how this table is derived from the table above. d. Using the second table, is there evidence of a significant difference in preference for Toyota before and after viewing the ads? (Use α = 0.05.) e. Calculate the p-value and interpret its meaning. f. Explain the difference in the results of (a) and (d). Which method of analysing the data should you use? Why? 15.59 An online streaming company is analysing the relationship between income levels and subscriber download choices – TV shows or movies. A sample of 450 subscribers revealed the following information. Downloads/week TV episodes Movies Total

Income level High income Low income Total 352   769 1,121 198   498   696 550 1,267 1,817

a. Is there evidence of a relationship between income level and download choice? (Use α = 0.10.) b. Calculate the p-value and interpret its meaning. 15.60 Assume that the current Queensland Government employment strategy has a fundamental concept that men and women should be treated equally and have the same opportunities in the workplace. As part of the Department of Education and Training program to implement this concept, an initial analysis is undertaken of the workplace to identify areas where male and female situations differ. From this initial determination, a direction towards more in-depth analysis is intended. The following table of data was compiled by researchers on the basis of a series of sample surveys. Employment characteristics Males Duration of current job (%) Under one year  20.7 1 year and under 2 years   9.9 2 years and under 3 years   9.5 3 years and under 4 years   7.4   5.7 4 years and under 5 years 5 years and under 10 years  18.8 10 years and under 15 years  11.5 15 years and under 20 years   6.0  10.5 Over 20 years Multiple job holders (%) Multiple job   3.3 Single job  96.7 Persons looking for work (’000): difficulties reported Too young or too old  12.6 Unsuitable hours   3.8 Transport problems   9.6 Lacked training/education  13.6 Insufficient work experience  15.4 No vacancies in line of work  19.7 No vacancies at all  24.5

Females 31.2 15.3 13.0  9.6  6.0 15.6  4.8  2.3  2.2 97.5  2.5 20.1 5.1 3.2 2.2 18.0 25.8 25.6

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter review problems 639

Duration of unemployment (’000) 1–4 weeks 5–13 weeks 14–26 weeks Number of periods looking for work (’000) One Two Three or more Birthplace of unemployed (’000) Born in Australia Born outside Australia



 11.5   7.5   9.5

12.9 17.5 23.2

118.0  10.4   7.3

112.5   8.0   7.2

 34.9  15.5

34.6 15.2

To determine whether males and females have different opportunities in the workplace, an attempt was made to use χ2 analysis. To determine relative strength for each analysis, where the null hypothesis of no difference between sexes was rejected, the highest critical value of rejection was also determined. That is, if rejection of the null hypothesis could be achieved only at the 0.05 level of significance, that was considered less important than rejection at the 0.01 level of significance.

a. Test the hypothesis that the duration of the current job is independent of sex. b. Test the hypothesis that whether a person has more than one job is independent of sex. c. Test the hypothesis that the difficulties reported in looking for work are independent of sex. d. Test the hypothesis that the duration of unemployment is independent of sex. e. Test the hypothesis that birthplace of the unemployed is independent of sex. f. Rank the order of importance of each problem area analysed. Note that, in the ranking, acceptance of the null hypothesis is of equal lowest rank importance. 15.61 What social media tools do males and females commonly use? The Sensis Social Media Report 2017 surveyed the percentage of males and females who use the indicated social media tools. A total of 1,178 people were surveyed. Assume that exactly half were males and the other half females. The results are shown in the following table: Social networking sites used Facebook Instagram LinkedIn Twitter

Male 91% 50% 22% 35%

Female 97% 41% 14% 28%

Source: Reproduced with the permission of Sensis Pty Ltd



For each social media tool, at the 0.05 level of significance, determine whether there is a difference between males and females in the proportion who used each social media tool.

TEAM PROJECT The data file contains information regarding 21 variables from a sample of 48 investment funds listed on the ASX

as at June 2014, as shown in ASX Funds (data obtained from ASX Spotlight on Listed Investment & Absolute Return Funds, Monthly Update, June 2014, p. 4, ). The variables are: Code – ASX code for the fund Fund name – the name of the investment fund Type – Shares or units Category – type of shares comprising the fund – Australian or International MER – Managed expense ratio % p.a. Outperf fee – outperformance fee Mkt cap ($m) – market capitalisation in $million Mkt cap ($m) change – change in market capitalisation over the period Traded value – value of shares/units traded Traded volume – number of shares/units traded Number of trades Monthly liquidity Prem/disc – premium discount Last price – price of last trade Year high – highest price during the previous year Year low – lowest price during the previous year Hist. dist yield – historical distribution yield 1 mth total return – one-month total return 1 yr total return – 12-month total return 3 yr total return (ann.) – annualised total return over 3 years 5 yr total return (ann.) – annualised total return over 5 years 15.62 a. Construct a 2 × 2 contingency table using category as the row variable and sign of one-month total return (either negative or non-negative) as the column variable. b. At the 0.05 level of significance, is there evidence of a significant relationship between the category of the investment fund and whether the one-month return was negative or not? 15.63 a. Classify market capitalisation as high if at least $500m, low if less than $50m and medium otherwise. Ignoring missing data, construct a 2 × 3 contingency table using outperformance fees as the row variable and market capitalisation classification as the column variable. b. At the 0.05 level of significance, is there evidence of a significant relationship between a fund’s market capitalisation class and whether it has outperformance fees? 15.64 a. Classify the traded volume as high if it is at least 1 million and low if less than 1 million. Construct a 3 × 2 contingency table using market capitalisation classification (as in problem 15.63) as the row variable and traded volume as the column variable. b. At the 0.05 level of significance, is there evidence of a significant relationship between the market capitalisation of an investment fund and its traded volume? 15.65 Considering the funds for which data is available, does the three-year return fit a normal distribution (α = 0.05)? 15.66 Considering the funds for which data is available, does the five-year return fit a normal distribution (α = 0.05)?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

640 CHAPTER 15 CHI-SQUARE TESTS

Continuing cases Tasman University The Business Faculty at Tasman University (TU) has decided to gather data about the undergraduate students in the BBus program. It creates a survey of 14 questions and receives responses from 62 undergraduates, which it stores in < TASMAN_UNIVERSITY_BBUS_STUDENT_SURVEY >. a Construct contingency tables using gender, major, plans to go to graduate school and employment status. (You need to construct six tables, taking two variables at a time.) Analyse the data at the 0.05 level of significance to determine whether any significant relationships exist among these variables. b At the 0.05 level of significance, is there evidence of a difference between males and females in the number of text messages sent in a week using the categories 0–299, 300–599 and 600 and over? c Can it be concluded that undergraduate students’ weighted average marks (WAMs) are normally distributed? Perform a chi-square goodness-of-fit test at the 0.05 level of significance. The Business Faculty decides to now investigate the postgraduate students and undertakes a similar survey for them at TU. It creates a survey of 14 questions and receives responses from 44 graduate students, which are stored in < TASMAN_UNIVERSITY_MBA_STUDENT_SURVEY >. For these data, at the 0.05 level of significance: d Construct contingency tables using gender, undergraduate major, graduate major, and employment status. (You need to construct six tables, taking two variables at a time.) Analyse the data to determine whether any significant relationships exist among these variables. e Repeat part (b) for the postgraduate students. f Repeat part (c) for the postgraduate students.

As Safe as Houses To analyse the real estate market in non-capital cities and towns in states A and B, Safe-As-Houses Real Estate, a large national real estate company, has collected samples of recent residential sales from a sample of non-capital cities and towns in these states. The data are stored in < REAL_ESTATE >. A saleswoman at Safe-As-Houses Real Estate is interested in whether the internal size of properties is different for houses and units. a For each city in state A, work out the proportion of properties that have internal areas of less than 120 square metres and those that have internal areas of at least 120 square metres. Then, for each city, develop a two-way classification table using the property type (house or unit) as the other variable. For each city, carry out a test for the difference between two proportions using a 5% level of significance. b Repeat the tests for each city in state B but use a 10% level of significance.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 15 Excel Guide 641

Chapter 15 Excel Guide EG15.1 FOR χ2 TEST FOR THE DIFFERENCE BETWEEN TWO PROPORTIONS Open the 2 3 2 worksheet in the Chi_Square workbook. This worksheet already contains the entries for the Section 15.1 airline check-in example. The worksheet uses the CHISQ.INV.RT and CHISQ.DIST.RT functions (see Appendix D.21 for more information). To adapt this worksheet to other problems, change the tinted labels and values in the observed frequencies table in rows 4 to 7 and the level of significance value in tinted cell B18. OR See Appendix D.21 (Chi-square Test for the Differences in Two Proportions) if you want PHStat to produce a worksheet for you.

rows and three columns, open to the 4 3 3 worksheet of the Chi_Square workbook. To perform calculations for other problems, change the data in the observed frequency table data and, optionally, the level of significance value in the tinted cell in column B. (#DIV/0! messages in individual cells will disappear after you enter the observed frequencies.) These worksheets use the CHISQ.INV.RT and CHISQ.DIST.RT functions (see Appendix D.22 for more information). OR See Appendix D.22 (Chi-square Test) if you want PHStat to produce a worksheet for you.

EG15.2 FOR χ2 TEST FOR THE DIFFERENCES IN MORE THAN TWO PROPORTIONS AND FOR THE χ2 TEST OF INDEPENDENCE Open the Chi_Square workbook to the worksheet that contains the appropriate observed frequency table for your problem. For the Section 15.2 online check-in example that requires two rows and three columns, open to the 2 3 3 worksheet of the Chi_Square workbook. For the Section 15.3 airport check-in example that requires four

Microsoft® product screen shots are reprinted with permission from Microsoft Corporation.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

642 End of Part 4 problems

End of Part 4 problems D.1

D.2

D.3

One of the most famous concepts in economics is the Phillips Curve, which suggests there is an inverse relationship between inflation and unemployment in an economy. Malaysian data is contained in < PHILLIPS_CURVE > for 2004 to 2014. a. Use the most appropriate method to measure the relationship between the two variables. b. Interpret the coefficients. c. Predict the inflation rate when unemployment is 3.5%. d. Why would it not be appropriate to use the model to predict inflation when unemployment is 10%? Explain. e. Determine the coefficient of determination, r2, and interpret its meaning. In problem D.1, an attempt is made to predict inflation on the basis of years of unemployment. a. Construct a 95% confidence interval estimate of the mean inflation rate when the unemployment rate is 3.5%. b. Construct a 95% prediction interval of the inflation rate when the unemployment rate is 3.5%. c. Explain the difference in the results in (a) and (b). The tourism industry is exploring the link between exchange rates and overseas visitors. The following data were collected over a two-year period.

Quarter

Overseas arrivals

Sep 2006 Dec 2006 Mar 2007 Jun 2007 Sep 2007 Dec 2007 Mar 2008 Jun 2008 Sep 2008 Dec 2008 Mar 2009 Jun 2009 Sep 2009 Dec 2009 Mar 2010 Jun 2010 Sep 2010 Dec 2010 Mar 2011 Jun 2011

D.4

Number Delivery Number Delivery of time of time Customer cases (minutes) Customer cases (minutes)  1  52  2  64  3  73  4  85  5  95  6 103  7 116  8 121  9 143 10 157

Exchange rate

23,450 102.5 36,123 100.3 56,045   94.3 45,674   98.7 23,158 105.9 65,784   89.7 23,986 105.8 78,647   76.8 45,238   95.7 73,452   82.4 76,234   85.7 82,690   72.8 45,387   97.8 29,849 105.8 45,676   96.5 24,785 107.9 56,864   86.7 25,896 109.8 68,075   90.8 46,896   96.5

a. Use the most appropriate method to measure the relationship between the two variables. b. Predict the mean overseas arrivals when the exchange rate is 100. c. Plot the residuals against the time period.

d. At the 0.05 level of significance, is there evidence of positive autocorrelation between the residuals? e. Based on the results of (c) and (d), is there reason to question the validity of the model? Management of a soft-drink bottling company wants to develop a method for allocating delivery costs to customers. Although one cost clearly relates to travel time within a particular route, another variable cost reflects the time required to unload the cases of soft drink at the delivery point. A sample of 20 deliveries within a territory was selected. The delivery time and the number of cases delivered were recorded.



32.1 34.8 36.2 37.8 37.8 39.7 38.5 41.9 44.2 47.1

11 12 13 14 15 16 17 18 19 20

161 184 202 218 243 254 267 275 287 298

43.0 49.4 57.2 56.8 60.6 61.2 58.2 63.1 65.6 67.3

Develop a regression model to predict delivery time based on the number of cases delivered. a. Use the least-squares method to calculate the regression coefficients, b0 and b1. b. Interpret the meaning of b0 and b1 in this problem. c. Predict the delivery time for 150 cases of soft drink. d. Would it be appropriate to use the model to predict the delivery time for a customer who is receiving 500 cases of soft drink? Explain. e. Determine the coefficient of determination, r2, and explain its meaning in this problem. f. Perform a residual analysis. Is there any evidence of a pattern in the residuals? Explain. g. At the 0.05 level of significance, is there evidence of a linear relationship between delivery time and the number of cases delivered? h. Construct a 95% confidence interval estimate of the mean delivery time for 150 cases of soft drink. i. Construct a 95% prediction interval of the delivery time for a single delivery of 150 cases of soft drink. j. Construct a 95% confidence interval estimate of the population slope. k. Explain how the results in (a) to (j) can help allocate delivery costs to customers.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



End of Part 4 problems 643

D.5

An educational designer is studying the relationship between student performance and the provision of online learning content. She collects data on student grades and total time spent viewing online lecture content.

Customer satisfaction index



Student

Grade (/100)

Online content (minutes)

 1 38  36  2 53 125  3  7  12  4 33  50  5 27  45  6 35  67  7 70 234  8 47  87  9 66 112 10  0   0 11 20  32 12 42  67 13 51 128 14 17  63 15 57 187

D.6

a. Use the least-squares method to calculate the regression coefficients, b0 and b1. b. Interpret the meaning of the model parameters. c. Predict the grades of a student who views 100 minutes of online content. d. Is it appropriate to use the model to predict the marks for students viewing 500 minutes of content? Explain. e. Calculate the coefficient of determination, r2, and explain its meaning in this problem. f. Plot the residuals against time online. Is there any evidence of a pattern in the residuals? Explain. g. Based on the results of (f), is there reason to question the validity of the model? Explain. h. At the 0.05 level of significance, is there evidence of a linear relationship between student grades and online viewing? i. Construct a 95% confidence interval estimate of the mean grade for a student with 85 minutes of online vewing. j. Construct a 95% prediction interval of the grade for a student with 85 minutes of online viewing. k. Construct a 95% confidence interval estimate of the population slope. A fundamental principle in total quality management is the positive relationship between employee satisfaction and customer satisfaction. A business consultant conducts a survey of 15 businesses and measures an index of employee and customer satisfaction.

Employee satisfaction index

94 76 54 48 41 56 23 12 34 56 62 78 22 23 31 45 9 40 88 98 47 55 20 34 42 23 90 88 77 78 a. Use the most appropriate method to measure the relationship between the two variables. b. Calculate the coefficient of determination and interpret its meaning. c. Predict the customer satisfaction index for a company with an employee satisfaction index of 75. d. Perform a residual analysis. What conclusions do you reach? D.7 In a phone survey of 200 people conducted in each of five states in Australia the following percentages of respondents said they were searching for a new job: New South Wales 5.5% Queensland 7.5% Tasmania 7% Victoria 6.5% Western Australia 8% a. At the 0.05 level of significance, determine whether there is a significant difference in the proportion of people who are seeking new employment. b. Find the p-value in (a) and interpret its meaning. c. If appropriate, use the Marascuilo procedure and a 0.05 level of significance to determine which states are different. D.8 A sample of 300 members from four political parties in Australia are surveyed to determine their preference for the head of government in Australia: Prime Royalty President Minister Chancellor Greens 7 28 Liberal Party 28 15 Labor Party 22 14 National Party 23 24 D.9

33 9 20 10 25 12 22 8

At the 0.05 level of significance, is there evidence of a significant relationship between political party and preferred head of state? A large produce distributor is analysing the supply of lamb in Western Australia. She hypothesises that the supply of lamb would be a function of lamb feed price and the supply of beef in the market.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

644 End of Part 4 problems

Supply of lamb Year (’000 kg) 1995 6,589 1996 6,789 1997 7,852 1998 8,645 1999 7,853 2000 5,490 2001 6,540 2002 5,703 2003 4,590 2004 6,780 2005 5,289 2006 5,987 2007 7,890 2008 8,093

Price of feed ($/kg)

Supply of beef (’000 kg)

2 3 4 5 5 4 6 7 8 8 6 7 4 9

8,906 9,065 7,890 9,867 7,865 5,679 4,589 8,759 7,853 6,953 6,489 5,649 8,963 7,639

a. Predict the quantity of lamb supplied in Western Australia if the price of feed is $5/kg and the supply of beef is 7,800 kg, and construct a 95% confidence interval estimate and a 95% prediction interval. b. Perform a residual analysis and determine the adequacy of the model. c. Is there is a significant relationship between the supply of lamb and the two independent variables at the 0.05 level of significance? d. Construct 95% confidence interval estimates of the population slope for the relationship between lamb supply and feed price, and between lamb supply and beef supply. e. Calculate the adjusted R2. f. Calculate the coefficients of partial determination and interpret their meaning. D.10 A farmer who specialises in the production of carpet wool where the sheep are shorn twice per year is seeking a 75-mm-length clip from his Tukidale sheep. He believes that the proportion of sheep at each clip meeting this standard varies according to average rainfall during the six-month growing period and whether additional hand feeding of high-protein sheep nuts occurs during the period (because of a shortage of grass cover in the paddocks). Hand feeding is measured as 1 and no hand feeding as 0. Proportion at 75 mm (%)

Average rainfall

Hand feeding

67 100 75 150 80 148 72  70 91 210 69 120 55  50 77 167 84 230 92 189 58  40 69  93 74 133 72  80 66 108

1 0 0 1 0 1 1 0 0 0 1 1 0 1 1

a. Predict the proportion at 75 mm if the rainfall is 180 mm and there is no hand feeding, and construct a 95% confidence interval estimate and 95% prediction interval. b. Perform a residual analysis and determine the adequacy of the model. c. Is there a significant relationship between the clip length proportion and the two independent variables at the 0.05 level of significance? d. Construct 95% confidence interval estimates of the population slope for the relationship between clip proportion and rainfall, and between clip proportion and hand feeding. e. Interpret the meaning of the coefficient of multiple determination. f. Calculate the adjusted R2. g. Calculate the coefficients of partial determination and interpret their meaning. h. Add an interaction term to the model and, at the 0.05 level of significance, determine whether it makes a significant contribution to the model. D.11 If the coefficient of determination between two independent variables is 0.75, what is the VIF? D.12 The following data represent petrol consumption for a family’s annual vacation trip. Year

Petrol consumption (L)

2000  89 2001 123 2002 102 2003 112 2004  87 2005 213 2006 112 2007 123 2008 101 2009 109 2010  99 2011 103 a. Plot the data. b. Fit a 3-year moving average to the data and plot the results. c. Using a smoothing coefficient of W = 0.50, exponentially smooth the series and plot the results. d. What is your exponentially smoothed forecast for 2012? e. Repeat (c), using a smoothing coefficient of W = 0.25. f. What is your exponentially smoothed forecast for 2012? g. Compare the results of (d) and (e). D.13 If using the method of least squares for fitting trends in an annual time series containing 32 consecutive yearly values: a. What coded value do you assign to X for the first year in the series? b. What coded value do you assign to X for the eleventh year in the series? c. What coded value do you assign to X for the most recent recorded year in the series? d. What coded value do you assign to X if you want to project the trend and make a forecast three years beyond the last observed value?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



End of Part 4 problems 645

D.14 The data contained in are the number of males employed full-time in a professional occupation in Australia from the August quarter 1996 to the August quarter 2008. Time period Aug 1996 Nov 1996 Feb 1997 May 1997 Aug 1997 Nov 1997 Feb 1998 May 1998 Aug 1998 Nov 1998 Feb 1999 May 1999 Aug 1999 Nov 1999 Feb 2000 May 2000 Aug 2000 Nov 2000 Feb 2001 May 2001 Aug 2001 Nov 2001 Feb 2002 May 2002 Aug 2002 Nov 2002 Feb 2003 May 2003 Aug 2003 Nov 2003 Feb 2004 May 2004 Aug 2004 Nov 2004 Feb 2005 May 2005 Aug 2005 Nov 2005 Feb 2006 May 2006 Aug 2006 Nov 2006 Feb 2007 May 2007

Male full-time employed professionals (’000) 645.6 675.0 661.2 659.9 675.2 700.5 710.7 697.8 711.1 733.6 751.0 730.4 704.2 725.7 735.2 743.7 742.7 746.4 762.7 760.5 765.8 752.7 759.6 768.6 761.1 783.3 793.4 772.2 768.8 786.9 799.1 794.7 772.5 782.2 769.7 798.6 823.3 841.7 848.3 848.2 838.2 825.1 851.0 858.9

Aug 2007 Nov 2007 Feb 2008 May 2008 Aug 2008

856.0 875.5 889.7 883.2 887.9

Source: Australian Bureau of Statistics, Labour Force, Australia, Detailed, Quarterly, Aug 2008, Cat. No. 6291.0.55.033 accessed 6 December 2008

D.15

D.16

D.17

D.18

D.19

a. Plot the series of data. b. Calculate a linear trend forecasting equation and plot the trend line. c. What are your forecasts of the male full-time professional employment in the November quarter 2008 and the February quarter 2009? d. Do you think it is reasonable to try to forecast employment in this way? Singapore welcomes thousands of Australian visitors each month. Monthly data from February 2016 to July 2017 are contained in . a. Plot the data. b. Calculate a linear trend forecasting equation. c. Calculate a quadratic trend forecasting equation. d. Calculate an exponential trend forecasting equation. e. Which model is the most appropriate? Refer to the data in the previous problem concerning Australian vistors to Singapore. a. Forecast Australian visitors for August 2017 using the Holt– Winters method with U = 0.30 and V = 0.30. b. Repeat (a) with U = 0.70 and V = 0.70. c. Repeat (a) with U = 0.30 and V = 0.70. Refer to the data given in problem D.15 representing Australian visitors to Singapore. a. Fit a third-order autoregressive model to the visitor data and, using a 0.05 level of significance, test for the significance of the third-order autoregressive parameter. b. Fit a second-order autoregressive model to the visitor data and, using a 0.05 level of significance, test for the significance of the second-order autoregressive parameter. c. Fit a first-order autoregressive model to the visitor data and, using a 0.05 level of significance, test for the significance of the first-order autoregressive parameter. d. If appropriate, forecast the receipts for August 2017. Refer to the results in problem D.15, problem D.16 and problem D.17(c) concerning visitors to Singapore. a. Perform a residual analysis for each model. b. Calculate the standard error of the estimate (SYX) for each model. c. Calculate the MAD for each model. d. On the basis of (a), (b), (c) and parsimony, which forecasting model would you select? Discuss. An upmarket restaurant owner believes that customers’ spending behaviour is influenced by their level of income and also if they are using their company’s credit card.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

646 End of Part 4 problems

Bill ($)

Income

Credit card

212 56 237 89 45 432 123 67 49 256

67,000 34,000 112,000 78,000 23,500 112,000 24,500 45,000 29,800 56,780

1 0 0 0 0 1 1 0 0 1

a. Interpret the regression coefficients. b. Conduct hypothesis tests on the significance of the coefficients. c. Predict the bill for a customer with an income of $75,000 and who is using the company’s credit card. d. Perform a residual analysis and determine the adequacy of the model. D.20 A car dealership is constructing a model to predict secondhand car prices. The two factors that are hypothesised to affect the price most are the car’s mileage and its new car price. Second-hand price (’000) Mileage (km)  3.2  5.4 10.8 23.5 34.8  4.6 21.8 12.6  6.8 23.7 18.2

New car price ($’000)

780 12 400 18 234 23 106 34 234 54 670 12 452 33 123 23 546 19 234 35 123 29

a. Predict the second-hand sales price of a car with 450,000 km and a new car sales price of $35,000, and construct a 95% confidence interval estimate and a 95% prediction interval. b. Perform a residual analysis and determine the adequacy of the model. c. Is there a significant relationship between second-hand sales price and the two independent variables at the 0.05 level of significance? d. Construct 95% confidence interval estimates of the population slopes. e. Calculate the adjusted R2. f. Calculate the coefficients of partial determination and interpret their meaning. D.21 Are beer prices higher during summer? The following table contains an estimate of the average monthly price (in dollars per dozen stubbies of light beer) for beer sold on the Gold Coast, Queensland, from 2008 to 2011.

Year 2008 2009 2010 2011 Month January February March April May June July August September October November December

13.01 14.72 11.39 14.73 13.69 14.84 11.30 16.41 15.41 14.47 12.41 17.48 15.06 15.64 14.07 16.59 14.98 17.29 14.21 15.42 16.17 16.40 14.04 15.14 15.93 14.82 14.12 15.24 15.10 14.27 14.23 16.28 15.82 15.31 14.22 17.28 15.59 13.62 14.49 16.03 15.55 12.63 14.48 15.35 14.89 11.31 13.94 14.94

a. Construct a time-series plot. b. Develop an exponential trend forecasting equation for monthly data. c. Interpret the monthly compound growth rate. d. Interpret the monthly multipliers. e. Write a short summary of your findings. D.22 The data below represent the average prices (cents) and consumption (volume per item measure) of the items in the basket of goods in problem 14.57 on page 598. For example, in Sydney in 2007, the average price per kg of lamb sold was $7.65, and 60 kg was consumed. 2007 2017 Item Sydney Melbourne Sydney Melbourne Loin chop (1 kg) Potatoes (1 kg) Bread (650 g) Milk (150 g) Coffee (150 g) Baby food (120 g)

765, 60 108, 83 216, 72 220, 64 662, 84 59, 55

844, 58 111, 82 201, 70 275, 60 608, 75 55, 50

1,564, 72 137, 77 246, 68 256, 70 609, 93 75, 58

1,525, 68 175, 75 254, 69 268, 73 611, 72 75, 60

a. Calculate the 2017 Paasche index for each city, using 2007 as the base year. b. Calculate the 2017 Laspeyres index for each city, using 2007 as the base year. c. Compare the two indices from (a) and (b). D.23 The data below are the average prices (in dollars) of various cars from 2001 to 2011. Year Commodore Falcon WRX Kluger 2001 2003 2005 2007 2009 2011

34,500 35,450 37,650 39,075 40,650 42,675

35,600 36,000 37,540 38,650 39,500 41,500

45,000 55,000 46,750 56,750 48,750 58,900 49,000 62,450 51,350 63,000 53,000 65,400

a. Calculate the 2001–2011 simple price indices for the four cars using 2001 as the base year. b. Recalculate the price indices in (a) using 2003 as the base year. c. Calculate the 2001–2011 unweighted aggregate price indices for the four cars.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



End of Part 4 problems 647

d. Calculate the 2011 Laspeyres price index for the cars assuming that 1,200 Commodores, 1,000 Falcons, 550 WRXs and 375 Klugers were sold in 2001. e. Calculate the 2011 Paasche price index for the cars assuming that 1,650 Commodores, 1,346 Falcons, 789 WRXs and 502 Klugers were sold in 2011. D.24 Given n = 25, b1 = 5, b2 = 10, Sb1 = 2, Sb2 = 8: a. Which variable has the largest slope in units of a t statistic? b. Construct a 95% confidence interval estimate of the population slope, β1. c. At the 0.05 level of significance, determine whether each independent variable makes a significant contribution to the regression model. On the basis of these results, indicate the independent variables to include in this model. D.25 The following computer output contains the X values, residuals and a residual plot from a regression analysis. Is there any evidence of a pattern in the residuals? Explain.

30 – under 40 40 – under 50 50 – under 60

a. Calculate the mean and standard deviation of the frequency distribution. b. At the 0.05 level of significance, does the length of distance kicked follow a normal distribution? D.28 The warranty offered for a brand of motor vehicles sold in Australia is 5 years. The company keeps records of all warranty claims made and has calculated the average time between claims to be 18 months with a standard deviation of five months. The service manager has become concerned that the model released two years ago has been less reliable. He has examined the warranty claims for a small sample of 30 vehicles of this model and found the standard deviation to be six months. a. At the 0.05 level of significance, is the population standard deviation significantly different from 5? b. What assumption do you need to make in order to perform this test? c. Calculate the p-value in part (a) and interpret its meaning. D.29 A debate in education institutions has followed the issue of whether falling tutorial attendance can be addressed by marking attendance in class in universities. The argument is that falling attendance will increase failure rates, and this is an inefficient system of education provision. A lecturer has collected attendance records for a set of tutorials in one week containing 112 enrolled students without telling the students, and then collected attendance for the same students two weeks later after telling them that attendance records will be kept.

D.26 A real estate chain in Hobart has collected data for the number of domestic and commercial properties sold per week during the past three years (156 weeks): Number sold

Number of weeks

0 11 1 49 2 32 3 22 4 15 5 13 6  8 7  4 8  2 Do sales of domestic properties follow a Poisson distribution? D.27 In a random sample of 360 goals kicked in AFL games in Adelaide, the distance for each kick has been recorded. Kick distance ( metres)   0 – under 10 10 – under 20 20 – under 30

Number goals kicked 52 69 82

74 61 22

No rolls kept

Rolls kept

Attendance 71 87 Non-attendance 41 25 a. At the 0.05 level of significance, is the proportion of students attending class tutorial the same when class rolls are kept and when they are not? b. Calculate the p-value in (a) and interpret its meaning. D.30 The following data represent quarterly sales of iPhones from 2010 to 2014. Fiscal year Dec quarter Mar quarter Jun quarter Sep quarter 2010 2011 2012 2013 2014

8,737,000 8,752,000 16,240,000 18,650,000 37,044,000 35,100,000 47,800,000 37,400,000 51,000,000

8,398,000 14,102,000 20,340,000 17,070,000 26,000,000 26,900,000 31,200,000 33,800,000

a. Construct a time-series plot. b. Develop an exponential trend forecasting equation for quarterly data. c. Interpret the quarterly compound growth rate. d. Interpret the quarterly multipliers. e. Forecast iPhone sales for the March quarter 2014.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

This page is intentionally left blank

PA R T

5

Further topics in stats

Real People, Real Stats Deborah O’Mara THE UNIVERSITY OF SYDNEY Which company are you currently working for and what are some of your responsibilities? I am an Associate Professor at the Sydney Medical School, The University of Sydney, and I also work on a small number of external consultancy projects. My academic responsibilities and consultancy work include the statistical analysis of examination results, longitudinal student databases and research. List five words that best describe your personality. Outgoing, friendly, analytical, loyal, caring. What are some things that motivate you? I have always been self-motivated. I strive to achieve goals related to family, work, leisure, finance, and so on. When did you first become interested in statistics? At secondary school in mathematics, and to a greater degree at university.

LET’S TALK STATS What do you enjoy most about working in statistics? Identifying unexpected patterns and uncovering a conclusion that helps resolve a problem.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

a quick q&a Describe your first statistics-related job or work experience. Was this a positive or a negative experience? I worked as a research assistant in the summer vacation for a group of academics in the School of Education at Macquarie University. It was a great experience, much more enjoyable and better paid than being a waitress, and I understood more about the application of statistics to a research problem. This job led to other similar positions which further developed my research experience and statistical expertise. What do you feel is the most common misconception about your work held by students who are studying statistics? Please explain. The most common misconception, even among my own children, is that statistics is boring. It is the way statistics is taught that is boring or irrelevant to students. Unfortunately, making it compulsory for many students as a prerequisite course inevitably means that many study it under duress. Do you need to be good at maths to understand and use statistics successfully? No, but it does help. Often I find students with a social science background are better at seeing the application of statistics. If you are using multivariate statistics, an aptitude for mathematics is helpful, but you do not need a mathematics degree. Is there a high demand for statisticians in your industry (or in other industries)? Please explain. Yes, there is a high demand for statisticians in social science. We use Rasch analysis to analyse and equate examinations and there are very few statisticians who have a higher degree and/or extensive experience in both education and statistics. Similarly, in market research there are a few researchers who understand and can communicate to both marketers and statisticians.

FURTHER TOPICS IN STATISTICS Based on your experiences, how reliable do you think a decision you have made would be without the aid of statistical tools and statistics in general? Most decisions I make are based on a thorough analysis of the data and/or are based on my extensive experience in using

statistics to make decisions, and therefore are directly or indirectly guided by statistical tools. Why should we bother worrying about quality/productivity? If you do not worry about quality, then your statistical results will be unreliable and perhaps also invalid. Reliability is an essential but not sufficient condition for validity, and both quality and reliability are required if your conclusions from statistical analyses are to be useful for your client. The use of incorrect statistical results can lead to reduced profits for business and could have major ramifications for individuals. Is the process of performing statistical tests usually a tedious task? Not for me. Sometimes it becomes tedious if the same process has to be performed on different data sets. Similarly, data cleaning can be boring. But like many things, statistics in my experience is an 80/20 relationship of data preparation and interesting statistical analyses, respectively. But the last 20% is where the fun and influence are. Have you have used non-parametric tests such as the Wilcoxon rank sum test or signed ranks test? Yes, I have used several non-parametric tests, primarily in my PhD thesis, which was a longitudinal analysis of education policy change over 30 years. Due to the small number of years investigated and my use of qualitative indicators about policy, these were the most appropriate techniques. I have more recently used the Mann–Whitney U test for a small patient study on dementia with two geriatricians, which has now been published. Do you have any advice for students who are studying statistics? If so, what would it be? My advice is to persist through basic statistics. They are important building blocks for multivariate techniques such as ANOVA and regression, which is where the fun and enjoyment are. But statistics is not a replacement for thinking – you need to plan your analysis and conduct your interpretation systematically and then reap the rewards. Being a good statistician and having an ability to explain statistical findings in plain English will give you an advantage over others in both business and academia.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

CHA PTER

16

Multiple regression model building

PROPERTY RATINGS

S

urveys are commonly taken of suburbs in major cities to estimate the rating or quality of the suburb. While the classification variables vary widely from housing prices to lifestyle, the ratings can have a significant impact on perception in real estate markets.

A recent survey that received widespread publicity rated suburbs on a lifestyle set of measures and was publicised in a ‘worst’ to ‘best’ guide in a newspaper. As a real estate analyst it seems to you that the ratings simply relate to house prices, and these in turn relate to average family income. The problem is that household incomes are not graded linearly: there are few people on very high incomes and most people on lower incomes. How do you build the model with the most appropriate mix of independent variables? © Dmitri Kamenetsky|Dreamstime.com

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



16.1 Quadratic Regression Model 651

LEARNING OBJECTIVES

After studying this chapter you should be able to: 1 utilise quadratic terms in a regression model 2 calculate and use transformed variables in a regression model 3 examine the effect of each observation on the regression model 4 construct a regression model using either the stepwise or the best-subsets approach 5 recognise the many pitfalls involved in developing a multiple regression model

In Chapter 13 you studied multiple regression models that contained two independent variables. In this chapter, regression analysis is extended to models containing more than two independent variables. In order to help you learn to develop the best model when confronted with a large set of data (such as the one described in the property ratings scenario), this chapter introduces you to various topics related to model building. These topics include quadratic independent variables, transformations of either the dependent or the independent variables, identification of influential observations, stepwise regression and logistic regression.

16.1  QUADRATIC REGRESSION MODEL

LEARNING OBJECTIVE

The simple regression model discussed in Chapter 12 and the multiple regression model discussed in Chapter 13 assumed that the relationship between Y and each independent variable is linear. However, in Section 12.1, several different types of non-linear relationships between var­ iables were introduced. One of the most common non-linear relationships is a quadratic relationship between two variables in which Y increases (or decreases) at a changing rate for various values of X (see Figure 12.2, panels C–E, on page 457). You can use the quadratic regression model defined in Equation 16.1 to analyse this type of a relationship between X and Y.

1

Utilise quadratic terms in a regression model

Q UAD RATIC R E GR E S S ION M ODE L Yi = β0 + β1X1i + β2 X21i + εi where b0 5 Y intercept b1 5 coefficient of the linear effect on Y b2 5 coefficient of the quadratic effect on Y εi 5 random error in Y for observation i

(16.1)

This quadratic regression model is similar to the multiple regression model with two independent variables (see Equation 13.2 on page 506) except that the second independent variable is the square of the first independent variable. Once again, you use the sample regression coefficients (b0, b1 and b2) as estimates of the population parameters (b0, b1 and b2). Equation 16.2 defines the regression equation for the quadratic model with one independent variable (X1) and a dependent variable (Y).

quadratic regression model A multiple regression model with two independent variables, where the second independent variable is the square of the first independent variable.

Q UAD RATIC R E GR E S S ION E QUAT ION

Yˆi = b0 + b1X1i + b2 X 21i(16.2)

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

652 CHAPTER 16 MULTIPLE REGRESSION MODEL BUILDING

In Equation 16.2, the first regression coefficient, b0, represents the Y intercept; the second regression coefficient, b1, represents the linear effect; and the third regression coefficient, b2, represents the quadratic effect.

Finding the Regression Coefficients and Predicting Y You have taken a sample of suburban ratings for 18 suburbs in Sydney and collected data on the average household income (AUD$) for each suburb in < SUB_HOUSE >. Table 16.1 summarises the data. Table 16.1 Suburban lifestyle ratings and average household incomes for 18 suburbs in Sydney

Average household Suburb Rating income (AUD$) (’000s) Vaucluse 1,000 285 Palm Beach 1,000 265 Watsons Bay 1,000 240 Mosman 1,000 220 Manly 800 150 Narrabeen 800 145 Balgowlah 800 130 Bondi 800 130 Paddington 600 84 Chatswood 600 80 Maroubra 400 47 Hornsby 400 46 Merrylands 400 45 Sylvania 400 44 Pymble 400 40 Castle Hill 400 40 Marrickville 200 33 Liverpool 200 32

To help you select the proper model for expressing the relationship between ratings and income, you plot a scatter diagram as shown in Figure 16.1. Figure 16.1 indicates a rapid increase in income as rating increases. Therefore, a non-linear model is more appropriate for these data. You could use either a quadratic or an expontential non-linear model. Here you use a quadratic model. (Exponential linear regression was discussed in Chapter 14.) Figure 16.1 Microsoft Excel scatter diagram of suburb rating (X) and income (Y )

Scatter diagram of suburb rating and average household income

300 250

Income

200 150 100 50 0 0

200

400

600

800

1,000

1,200

Rating

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



16.1 Quadratic Regression Model 653

You need to use a spreadsheet program or a statistical package to perform the least-squares method needed to calculate the three sample regression coefficients (b0, b1 and b2) for this example. Figure 16.2 presents a Microsoft Excel worksheet of the output. From Figure 16.2: b0 = 72.4672  b1 = 7.3891  b2 = −0.0146 Therefore, the quadratic regression equation is: Yˆi = 72.4672 + 7.3891 X1i − 0.0146 X21i where  Yˆi = predicted suburb rating for sample i X1i = average income for sample i

A B C 1 Suburb rating analysis 2 3 Regression statistics 4 Multiple R 0.98782494 5 R square 0.975798111 6 Adjusted R square 0.972571193 7 Standard error 46.69007093 8 Observations 18 9 10 ANOVA 11 df SS 12 Regression 2 1318411.67 13 Residual 15 32699.44085 14 Total 17 1351111.111 15 16 Coefficients Standard error 17 Intercept 72.4671963 32.99812102 18 Family income 7.389052965 0.607628043 19 FI squared –0.014637544 0.00201472

D

E

F

G

Figure 16.2 Microsoft Excel output for the suburb rating data

MS F Significance F 659205.8351 302.3931685 7.56605E-13 2179.962724 t stat p-value Lower 95% Upper 95% 2.196100689 0.044223364 2.133366573 142.801026 12.16048708 3.60421E-09 6.093924456 8.684181475 –7.26529806 2.76164E-06 –0.018931818 –0.010343269

Figure 16.3 plots this quadratic regression equation on the scatter diagram to show the fit of the quadratic regression model to the original data.

Figure 16.3 Microsoft Excel scatter diagram expressing the quadratic relationship between suburb rating and average income for the suburb rating data

Scatter diagram of suburb rating and average household income

300 250

Yˆi = 72.4672 + 7.3891 X1i – 0.0146 X 21i

Income

200 150 100 50 0 0

200

400

600

800

1,000

1,200

Rating

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

654 CHAPTER 16 MULTIPLE REGRESSION MODEL BUILDING

From the quadratic regression equation and Figure 16.3, the Y intercept (b0 = 72.4672) is the predicted rating when the average income is 0. To interpret the coefficients b1 and b2, observe that income increases rapidly as rating increases. This non-linear relationship is further demonstrated by predicting the suburb rating at the bottom, middle and top of the data range:1 Yˆi = 72.4672 + 7.3891 X1i − 0.0146 X21i for X1i = 32: Yˆi = 72.4672 + 7.3891(32) − 0.0146(32)2 = 294 for X1i = 8.0: Yˆi = 72.4672 + 7.3891(80) − 0.0146(80)2 = 570 for X1i = 285: Yˆi = 72.4672 + 7.3891(285) − 0.0146(285)2 = 989 Thus, the predicted rating for an average income of $32,000 is 294, which is 276 rating points below that for the middle income of $80,000, but from the middle income of $80,000 to the top income of $285,000, the gap from 570 to a rating of 992 becomes 422.

Testing for the Significance of the Quadratic Model After you calculate the quadratic regression equation, you can test whether there is a significant overall relationship between rating, Y, and average income, X. The null and alternative hypotheses are as follows: H0: b1 = b2 = 0 (No overall relationship between X1 and Y) H1: At least one bj ∙ 0 (Overall relationship between X1 and Y) Equation 13.6 on page 512 defines the overall F statistic used for this test: F=

MSR MSE

From the Excel output in Figure 16.2: F=

MSR 659,205.8351 = 302.3932 = 2,179.9627 MSE

If you choose a level of significance of 0.05, then from Table E.5 the critical value of the F distribution with 2 and 15 degrees of freedom is 3.68 (see Figure 16.4). Since F = 302.39 > 3.68, or since the p-value = 0.0000 < 0.05, you reject the null hypothesis (H0) and conclude that there is a significant quadratic relationship between rating and average income. Figure 16.4 Testing for the existence of the overall relationship at the 0.05 level of significance with 2 and 15 degrees of freedom

0.05 0.95

0

3.68 Region of non-rejection

Critical value

F Region of rejection

1

The solutions presented in this chapter are calculated using the (raw) Excel output. If you use the rounded figures presented in the text to reproduce these answers, there may be minor differences.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



16.1 Quadratic Regression Model 655

Testing the Quadratic Effect In using a regression model to examine a relationship between two variables, you want to find not only the most accurate model but also the simplest model expressing that relationship. Therefore, you need to examine whether there is a significant difference between the quadratic model: Yi = β0 + β1X1i + β2 X 21i + εi and the linear model:

Yi = β0 + β1X1i + εi

In Section 13.4, you used the t test to determine whether each particular variable makes a significant contribution to the regression model. To test the significance of the contribution of the quadratic effect, you use the following null and alternative hypotheses: H0: Including the quadratic effect does not significantly improve the model (b2 = 0). H1: Including the quadratic effect significantly improves the model (b2 ∙ 0). The standard error of each regression coefficient and its corresponding t statistic are part of the Excel output (see Figure 16.2 on page 653). Equation 13.7 defines the t test statistic on page 517: b − β2 t= 2 Sb2 =

–0.0146 − 0 = –7.3 0.0020

If you select a level of significance of 0.05, then from Table E.3 the critical values for the t distribution with 15 degrees of freedom are −2.1315 and +2.1315 (see Figure 16.5). Since t = −7.3 < −2.1315, or since the p-value = 0.0000 < 0.05, you reject H0 and conclude that the quadratic model is significantly better than the linear model for representing the relationship between rating and average income. Figure 16.5 Testing for the contribution of the quadratic effect to a regression model at the 0.05 level of significance with 15 degrees of freedom

0.95 0.025

0.025

–2.1315 Region of rejection

Critical value

0 Region of non-rejection

t 15

+2.1315 Critical value

Region of rejection

The Coefficient of Multiple Determination In the multiple regression model, the coefficient of multiple determination R2 (see Section 13.2) represents the proportion of variation in Y that is explained by variation in the independent variables. Consider the quadratic regression model you used to predict suburb ratings using average income and average income squared. You calculate R2 using Equation 13.4 on page 511: SSR R2 = SST From Figure 16.2 on page 653: SSR = 1,318,411.67 and SST = 1,351,111.11

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

656 CHAPTER 16 MULTIPLE REGRESSION MODEL BUILDING

Thus: R2 =

SSR 1,318,411.67 = = 0.9758 SST 1,351,111.11

This coefficient of multiple determination indicates that 97.58% of the variation in rating is explained by the quadratic relationship between rating and average income. You should also calculate the R2adj to account for the number of independent variables and the sample size. In the quadratic regression model, k = 2 since there are two independent variables, X1 and X12. Thus, using Equation 13.5 on page 511:

(n − 1) 2 Radj = 1 − (1 −R 2 ) (n − k − 1) = 1 − (1 − 0.9758)

17 15

= 1 − 0.0274 = 0.9726

Problems for Section 16.1 LEARNING THE BASICS 16.1 The following quadratic regression equation is for a sample of n = 30: 2 Ŷi = 10 + 2X1i + −0.5X 1i a. Predict Y for X1 = 4. b. Suppose the t statistic for the quadratic regression coefficient is 2.35. At the 0.05 level of significance, is there evidence that the quadratic model is significantly better than the linear model? c. Suppose the t statistic for the quadratic regression coefficient is 1.17. At the 0.05 level of significance, is there evidence that the quadratic model is significantly better than the linear model? d. Suppose the regression coefficient for the linear effect is −3.0. Predict Y for X1 = 2.

APPLYING THE CONCEPTS You need to use Microsoft Excel to solve problems 16.2–16.5.

16.2 A research analyst for General Motors Holden in Australia wants to develop a model to predict petrol consumed per 100 kilometres based on highway speed for the new model SV6. An experiment is designed in which a test SV6 is driven at speeds ranging from 20 kilometres per hour to 120 kilometres per hour. The results are in the data file < SPEED >. Assume a quadratic relationship between speed and petrol consumption. a. Construct a scatter diagram for speed and petrol consumption. b. State the quadratic regression equation. c. Predict litres per 100 kilometres when the car is driven at 90 kilometres per hour. d. Perform a residual analysis and determine the adequacy of the model.

e. At the 0.05 level of significance, is there a significant quadatic relationship between litres per 100 kilometres and speed? f. At the 0.05 level of significance, determine whether the quadratic model is a better fit than the linear model. g. Interpret the meaning of the coefficient of multiple determination. h. Calculate the adjusted R 2. 16.3 Businesses actively recruit business students with welldeveloped higher-order cognitive skills (HOCS) such as problem identification, analytical reasoning and content integration skills. Researchers in the United States conducted a study to see whether improvement in students’ HOCS was related to the students’ grade point average (GPA) (data obtained from R. V. Bradley, C. S. Sankar, H. R. Clayton, V. W. Mbarika and P. K. Raju, ‘A study on the impact of GPA on perceived improvement of higher-order cognitive skills’, Decision Sciences Journal of Innovative Education, 5(1), January 2007, 151–168). The researchers conducted a study in which business students were taught using the case study method. Using data collected from 300 business students, the following quadratic regression equation was derived: HOCS = –3.48 + 4.53(GPA) –0.68(GPA)2

where the dependent variable HOCS measured the improvement in higher order cognitive skills, with 1 being the lowest improvement in HOCS and 5 being the highest improvement in HOCS. a. Construct a table of predicted HOCS, using GPA equal to 2.0, 2.1, 2.2, …, 4.0. b. Plot the values in the table constructed in (a), with GPA on the horizontal axis and predicted HOCS on the vertical axis. c. Discuss the curvilinear relationship between students’ GPA and their predicted improvement in HOCS.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



16.2 Using Transformations in Regression Models 657

d. The researchers reported that the model had an R2 of 0.07 and an adjusted R2 of 0.06. What does this tell you about the scatter of individual HOCS scores around the curvilinear relationship plotted in (b) and discussed in (c)? 16.4 A travel agency is interested in analysing the relationship between expenditure on social marketing and client revenue. The data are contained in < SOCIAL >. Assume a quadratic relationship between the social marketing expenditure and client revenue. a. Construct a scatter diagram for expenditure and revenue. b. State the quadratic regression equation. c. Predict revenue for a firm spending $75,000 on social marketing. d. Perform a residual analysis and determine the adequacy of the model. e. At the 0.05 level of significance, is there a significant overall relationship between client revenue and marketing expenditure? f. What is the p-value in (e)? Interpret its meaning. g. At the 0.05 level of significance, determine whether there is a significant quadratic effect. h. What is the p-value in (g)? Interpret its meaning. i. Interpret the meaning of the coefficient of multiple determination. j. Calculate the adjusted R 2. 16.5 Researchers wanted to investigate the relationship between employment and accommodation capacity in the European travel and tourism industry. The file < EURO_TOURISM> contains a sample of 27 European countries. Variables included are the number of jobs generated in the travel and tourism industry in 2012 and the number of establishments that provide

overnight accommodation for tourists (data obtained from ). a. Construct a scatter plot of the number of jobs generated in the travel and tourism industry in 2012 (Y ) and the number of establishments that provide overnight accommodation for tourists (X ). b. Fit a quadratic regression model to predict the number of jobs generated and state the quadratic regression equation. c. Predict the mean number of jobs generated in the travel and tourism industry for a country with 3,000 establishments that provide overnight accommodation for tourists. d. Perform a residual analysis and determine whether the regression model is valid. e. At the 0.05 level of significance, is there a significant quadratic relationship between the number of jobs generated in the travel and tourism industry in 2012 and the number of establishments that provide overnight accommodation for tourists? f. What is the p-value in (e)? Interpret its meaning. g. At the 0.05 level of significance, determine whether the quadratic model is a better fit than the linear model. h. What is the p-value in (g)? Interpret its meaning. I. Interpret the meaning of the coefficient of multiple determination. j. Calculate the adjusted R2. k. What conclusions can you reach concerning the relationship between the number of jobs generated in the travel and tourism industry in 2012 and the number of establishments that provide overnight accommodation for tourists?

16.2  USING TRANSFORMATIONS IN REGRESSION MODELS

LEARNING OBJECTIVE

This section introduces regression models in which the independent variable, the dependent variable or both are transformed in order either to overcome violations of the assumptions of regression or to make a model linear in form. Among the many transformations available (see reference 1) are the square-root transformation and transformations involving the common logarithm (base 10) and the natural logarithm (base e).2

2

Calculate and use transformed variables in a regression model

The Square-Root Transformation The square-root transformation is often used to overcome violations of the equal variance assumption, as well as to transform a model that is not linear in form into one that is linear. Equation 16.3 shows a regression model that uses a square-root transformation of the independent variable.

square-root transformation Uses the square-root of the sample data to overcome breaches of the homoscedasticity or linearity assumptions.

RE GRE SSION M ODE L W IT H A S QUA R E - R O OT T R A NSF O R M AT I O N

Yi = β0 + β1

X1i + εi

(16.3)

Example 16.1 illustrates the use of a square-root transformation. 2

For more information on logarithms see Appendix A.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

658 CHAPTER 16 MULTIPLE REGRESSION MODEL BUILDING

EXAMPLE 16.1

U S IN G T H E S Q U A R E -ROOT TRAN S F ORM ATI ON Given the following values for Y and X, use a square-root transformation for the X variable. Construct a scatter diagram for X and Y and for the square root of X and Y.   Y X Y X 42.7 1 100.4 3 50.4 1 104.7 4 69.1 2 112.3 4 79.8 2 113.6 5 90.0 3 123.9 5 SOLUTION

Figure 16.6, panel A, displays the scatter diagram of X and Y, and panel B shows the square root of X versus Y. Scatter diagram of X and Y

Y

140 120 100 80 60 40 20 0

Y

Figure 16.6 Panel A: scatter diagram of X and Y; panel B: scatter diagram of the square root of X and Y

0

1

2

3

4

5

6

140 120 100 80 60 40 20 0

Scatter diagram of square root of X and Y

0

0.5

1

X

1.5

2

2.5

X

Panel A

Panel B

You can see that the square-root transformation has transformed a non-linear relationship into a linear relationship.

The Log Transformation logarithmic transformation Uses the common or natural log of the sample data to overcome breaches of the homoscedasticity or linearity assumptions in regression.

The logarithmic transformation is often used to overcome violations to the equal variance assumption. You can also use the logarithmic transformation to change a non-linear model into a linear model. Equation 16.4 shows a multiplicative model. OR IGIN A L M ULT IP LI C AT I V E M O D E L

Yi = β0 X1iβ1 X 2iβ2 εi

(16.4)

By taking base 10 logarithms of both the dependent and the independent variables, you can transform Equation 16.4 to the model shown in Equation 16.5. T R A N S FOR M E D MU LT I P LI C AT I V E M O D E L log Yi = log(β0 X 1βi 1 X β2i2 εi ) = log β0 + log( X 1βi 1 ) + log( X β2i2 ) + log εi

(16.5)

= log β0 + β1 log X 1i + β2 log X 2i + log εi

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



16.2 Using Transformations in Regression Models 659

Hence, Equation 16.5 is linear in the logarithms. In a similar fashion, you can transform the exponential model shown in Equation 16.6 to linear form by taking the natural logarithm of both sides of the equation.

O RIG INAL EXP ON E N T IA L M ODE L Yi = e β0+β1 X 1i+ β2 X2i εi(16.6)



TRAN S FO RME D E XP ON E N T IA L M ODE L ln Yi = ln( eβ0+β1 X 1i + β2 X 2i εi ) = ln( eβ0+β1 X 1i+ β2 X 2i ) + ln εi

(16.7)

= β0 + β1X 1i + β2 X 2i + ln εi



USING TH E N AT U R A L LO G T R A NS FO R M ATI ON Given the following values for Y and X, use a natural logarithm transformation for the Y variable. Construct a scatter diagram for X and Y, and for X and the natural logarithm of Y.   Y 0.7 0.5 1.6 1.8 4.2

EXAMPLE 16.2

X Y X 1  4.8 3 1 12.9 4 2 11.5 4 2 32.1 5 3 33.9 5

SOLUTION

40 35 30 25 20 15 10 5 0

Scatter diagram of X and Y

ln Y

Y

As shown in Figure 16.7, the natural logarithm transformation has transformed a non-linear relationship into a linear relationship.

0

Panel A

1

2

3 X

4

5

6

4 3.5 3 2.5 2 1.5 1 0.5 0 –0.5 –1

Figure 16.7 Panel A: scatter diagram of X and Y; Panel B: scatter diagram of X and the natural logarithm of Y

Scatter diagram of X and the natural logarithm of Y

0

1

2

3 X

4

5

6

Panel B

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

660 CHAPTER 16 MULTIPLE REGRESSION MODEL BUILDING

Problems for Section 16.2 APPLYING THE CONCEPTS You need to use Microsoft Excel to solve problems 16.6–16.9.

16.6 Referring to the data of problem 16.2 on page 656, and using the file < SPEED >, perform a square-root transformation of the independent variable (speed) and re-analyse the data using this model. a. State the regression equation. b. Predict the petrol consumed per 100 kilometres when the car is driven at 85 kilometres per hour. c. Perform a residual analysis and determine the adequacy of the model. d. At the 0.05 level of significance, is there a significant relationship between petrol consumed per 100 kilometres and the square root of speed? e. Interpret the meaning of the coefficient of determination R 2 in this problem. f. Calculate the adjusted R 2. g. Compare your results with those in problem 16.2. Which model is better? Why? 16.7 Referring to the data of problem 16.2 on page 656, using the file < SPEED >, perform a natural logarithmic transformation of the dependent variable (litres per 100 kilometres) and re-analyse the data using this model. a. State the regression equation. b. Predict the petrol consumed per 100 kilometres when the car is driven at 85 kilometres per hour. c. Perform a residual analysis and determine the adequacy of the fit of the model. d. At the 0.05 level of significance, is there a significant relationship between the natural logarithm of petrol consumed per 100 kilometres and speed? e. Interpret the meaning of the coefficient of determination R 2 in this problem. f. Calculate the adjusted R 2. g. Compare your results with those in problems 16.2 and 16.6. Which model is best? Why?

LEARNING OBJECTIVE

3

Examine the effect of each observation on the regression model

16.8 Referring to the data of problem 16.5 on page 657, and using the file < EURO_TOURISM >, perform a natural logarithm transformation of the dependent variable (employment) and re-analyse the data using this model. a. State the regression equation. b. Predict employment when the number of tourism establishments is 5,000. c. Perform a residual analysis and determine the adequacy of the model. d. At the 0.05 level of significance, is there a significant relationship between the natural logarithm of tourism employment and the number of tourism establishments? e. Interpret the meaning of the coefficient of determination R 2 in this problem. f. Calculate the adjusted R 2. g. Compare your results with those in problem 16.5. Which model is better? Why? 16.9 Referring to the data of problem 16.5 on page 657, and using the file < EURO_TOURISM >, perform a square-root transformation of the independent variable (tourism establishments) and re-analyse the data using this model. a. State the regression equation. b. Predict tourism employment when the number of tourism establishments is 5,000. c. Perform a residual analysis and determine the adequacy of the model. d. At the 0.05 level of significance, is there a significant relationship between employment and the square root of the tourism establishments? e. Interpret the meaning of the coefficient of determination R 2 in this problem. f. Calculate the adjusted R 2. g. Compare your results with those of problems 16.5 and 16.8. Which model is best? Why?

16.3  INFLUENCE ANALYSIS In Sections 12.5 and 13.3, you used residual analysis to evaluate the regression assumptions. This section introduces several methods that measure the influence of individual observations: • the hat matrix elements hi • the Studentised deleted residuals ti • Cook’s distance statistic Di. Figure 16.8 presents the values of these statistics calculated by Excel for the car exports data from Chapter 13. In Figure 16.8, certain influence statistics are highlighted for further analysis.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



16.3   Influence Analysis 661

Jun 10 Sep 10 Dec 10 Mar 11 Jun 11 Sep 11 Dec 11 Mar 12 Jun 12 Sep 12 Dec 12 Mar 13 Jun 13 Sep 13 Dec 13

C1

C2

xport

xrate wagegr

500 431 452 325 331 343 354 382 417 492 407 416 391 578 585

C3

87.71 91.9 98.47 101.4 107.83 104.75 102.29 106.18 101.24 104.3 103.98 103.65 97.64 90.98 91.75

1.6 1.6 1.4 1.3 1.3 1.4 1.5 1.2 0.6 0.2 0 0.2 0.5 0.7 0.8

C4

C5

hi

0.363398 0.233053 0.113098 0.108784 0.255382 0.190086 0.169043 0.181297 0.096661 0.212764 0.284756 0.206823 0.133746 0.247752 0.203358

Figure 16.8 Excel influence statistics for the car export data

C6

Stud t

Cook D

–0.19985 –0.56235 0.616601 –0.88714 0.233398 0.006857 –0.12334 0.632818 –0.223 1.049322 –0.57918 –0.2606 –1.35363 0.669546 1.009718

0.001894 0.007802 0.003916 0.007513 0.00155 9.2E-07 0.000258 0.007151 0.000442 0.022715 0.010826 0.001467 0.020452 0.011861 0.01999

The Hat Matrix Elements hi In Section 12.8, hi was defined for the simple linear regression model when constructing the confidence interval estimate of the mean response. For multiple regression models, the formula for calculating the hat matrix diagonal elements hi requires the use of matrix algebra and is beyond the scope of this text (see references 1, 2 and 3). The hat matrix diagonal element for observation i, denoted hi, reflects the possible influence of Xi on the regression equation. If potentially influential observations are present, you may need to delete them from the model. In a regression model containing k independent variables, Hoaglin and Welsch (see reference 3) suggest the following ­decision rule:

hat matrix diagonal elements hi Tests for the influence of individual sample cases in a multiple regression model.

If hi . 2(k + 1)/n then Xi is an influential observation and is a candidate for removal from the model. For the car export data, because n = 15 and k = 2, you flag any hi value greater than 2(2 + 1)/15 = 0.4. Referring to Figure 16.8, no values are greater than 0.4.

The Studentised Deleted Residuals ti Recall from Section 12.5 that a residual is the difference between the observed value of Y and the predicted value of Y (see Equation 12.14 on page 474). Studentised residuals are the resid– uals divided by the standard error of the estimate SYX and adjusted for the distance from X . The S­ tudentised deleted residual, expressed as a t statistic in Equation 16.8, measures the difference of each Yi from the value predicted by a model that includes all observations except observation i. STUD E NTISE D DE L E T E D R E S IDUA L t i = ei

n − k −1 SSE (1 − hi ) − e i2 

Studentised deleted residual A statistical method of residual analysis using the t probability distribution that identifies individual cases in the sample data of a multiple regression, that have high individual influence on the regression equation.

(16.8)

where ei = the residual for observation i k = number of independent variables SSE = error sum of squares of the regression model fitted hi = hat matrix diagonal element for observation i

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

662 CHAPTER 16 MULTIPLE REGRESSION MODEL BUILDING

Hoaglin and Welsch (see reference 3) suggest that if ti . tn−k−2 or ti , −tn−k−2 (using a two-tail test with a 0.01 level of significance), then the observed and predicted values are so different that observation i is highly influential on the regression equation and is a candidate for removal. For the car export data, n = 15 and k = 2. Thus, you flag any ti whose absolute value is greater than 1.7959 (see Table E.3). In Figure 16.8 no value is greater than 1.7959. Since hi and ti measure different aspects of influence, neither criterion is sufficient by itself. When hi is small, ti may be large. When hi is large, ti may be moderate or small because the observed Yi is consistent with the rest of the data. The next section introduces a third method for identifying influential observations.

Cook’s Distance Statistic Di Cook’s Di statistic A statistical method of residual analysis using the F probability distribution that identifies individual cases in the sample data of a multiple regression that have high individual influence on the regression equation.

Cook’s Di statistic, based on both hi and the Studentised residual, is a third criterion for identifying

influential observations. To decide whether an observation flagged by either the hi or the ti criterion is unduly affecting the model, Cook and Weisberg (see reference 2) developed the Cook’s Di statistic.

COOK ’S D i STAT IST I C Di =

ei2 hi k MSE (1 − hi ) 2 

(16.9)

where ei = the residual for observation i k = number of independent variables MSE = mean square error of the regression model fitted hi = hat matrix diagonal element for observation i

Cook and Weisberg suggest that if Di . Fk+1,n−k−1, the critical value of the F distribution having k + 1 degrees of freedom in the numerator and n − k − 1 degrees of freedom in the denominator at a 0.50 level of significance, then the observation is highly influential on the regression equation and is a candidate for removal. For the car export data, n = 15 and k = 2. Thus, any Di . F3,12 = 0.8353 is flagged. Referring to Figure 16.8, you see that none of the Di values exceed 0. 8353, and therefore no observations are identified as influential using Cook’s Di statistic.

Overview This section discussed three criteria for evaluating the influence of each observation on the multiple regression model. The various statistics did not lead to a consistent set of conclusions. According to both the hi and the Di criteria, none of the observations is a candidate for removal. Under such circumstances, most statisticians would conclude that there is insufficient evidence for the removal of any observation from the analysis. In addition to the three criteria presented here, there are other measures of influence (see reference 4). Although different statisticians seem to prefer particular measures, currently there is no consensus as to the ‘best’ measure.

Problems for Section 16.3 APPLYING THE CONCEPTS You need to use Microsoft Excel to solve problems 16.10–16.14.

16.10 In problem 13.4 on page 509, a financial planner used unemployment rates and share market returns to predict

retirement rates. Perform an influence analysis on the results and determine whether any observations should be deleted from the analysis. If necessary, re-analyse the regression model after deleting these observations and compare your results.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



16.4 Model Building 663

16.11 In problem 13.5 on page 509, a researcher used GDP and population density to predict CO2 emissions. Perform an influence analysis on the results and determine whether any observations should be deleted from the analysis. If necessary, re-analyse the regression model after deleting these observations and compare your results. 16.12 In problem 13.6 on page 510, a wine expert used alcohol quantity and amount of chlorides to predict wine quality. Perform an influence analysis on the results and determine whether any observations should be deleted from the analysis. If necessary, re-analyse the regression model after deleting these observations and compare your results.

16.13 In problem 13.7 on page 510, you used GDP per capita and CPI to predict countries’ happiness. Perform an influence analysis on your results and determine whether any observations should be deleted from the analysis. If necessary, re-analyse the regression model after deleting these observations and compare your results. 16.14 In problem 13.8 on page 510, you used radio and newspaper advertising to predict product sales. Perform an influence analysis on the results and determine whether any observations should be deleted from the analysis. If necessary, re-analyse the regression model after deleting these observations and compare your results.

16.4  MODEL BUILDING This chapter and Chapter 13 have introduced you to many different topics in regression analysis, including quadratic terms, dummy variables, interaction terms and influential observations. In this section, you will learn a structured approach to building the most appropriate regression model. As you will see, successful model building incorporates many of the topics you have studied so far. To begin, imagine you are the director of operations for a television station looking for ways to reduce labour expenses. Currently, the unionised graphic artists at the station receive hourly pay for a significant number of hours in which they are idle. These hours are called standby hours. You have collected data concerning standby hours and four factors that you suspect are related to the excessive number of standby hours the station is currently experiencing: the total number of staff present, remote hours, Dubner hours3 and total labour hours. Table 16.2, overleaf, presents the data < STANDBY >. Before you develop a model to predict standby hours, you need to consider the principle of parsimony. Parsimony means that you want to develop a regression model that includes the fewest number of independent variables that permit an adequate interpretation of the dependent variable. Regression models with fewer independent variables are easier to interpret, p­ articularly because they are less likely to be affected by collinearity problems (described in Section 13.7). The selection of an appropriate model when many independent variables are under consideration involves complexities that are not present with a model with only two independent variables. The evaluation of all possible regression models is more computationally complex. Although you can quantitatively evaluate competing models, there may not be a uniquely best model, but rather several equally appropriate models. To begin analysing the standby-hours data, you calculate the variance inflationary factors (see Equation 13.15 on page 535) to measure the amount of collinearity between the independent variables. Figure 16.9 shows Microsoft Excel output of the VIF values along with the regression equation. Observe that all the VIF values are relatively small, ranging from a high of 2.0 for the total labour hours to a low of 1.2 for remote hours. Thus, on the basis of the criteria developed by Snee that all VIF values should be less than 5.0 (see reference 2 in Chapter 13), there is little evidence of collinearity between the set of independent variables.

The Stepwise Regression Approach to Model Building You continue your analysis of the standby hours data by attempting to determine if a subset of all independent variables yields an adequate and appropriate model. The first approach described here is stepwise regression, which attempts to find the ‘best’ regression model without examining all possible models.

LEARNING OBJECTIVE

4

Construct a regression model using either the stepwise or the best-subsets approach

parsimony The process of choosing the simplest model in terms of independent variables that still adequately explains the variation in the dependent variable.

stepwise regression A model-building regression technique to find subsets of independent variables that most adequately predict a dependent variable given the specified criteria for adequacy of model fit.

3

Dubner hours refer to time used on a specific graphic software system.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

664 CHAPTER 16 MULTIPLE REGRESSION MODEL BUILDING

Table 16.2 Predicting standby hours based on total staff present, remote hours, Dubner hours and total labour hours

Standby Total staff Remote Dubner Total labour Week hours present hours hours hours   1 245 338 414 323 2,001   2 177 333 598 340 2,030   3 271 358 656 340 2,226   4 211 372 631 352 2,154   5 196 339 528 380 2,078   6 135 289 409 339 2,080   7 195 334 382 331 2,073   8 118 293 399 311 1,758   9 116 325 343 328 1,624 10 147 311 338 353 1,889 11 154 304 353 518 1,988 12 146 312 289 440 2,049 13 115 283 388 276 1,796 14 161 307 402 207 1,720 15 274 322 151 287 2,056 16 245 335 228 290 1,890 17 201 350 271 355 2,187 18 183 339 440 300 2,032 19 237 327 475 284 1,856 20 175 328 347 337 2,068 21 152 319 449 279 1,813 22 188 325 336 244 1,808 23 188 322 267 253 1,834 24 197 317 235 272 1,973 25 261 315 164 223 1,839 26 232 331 270 272 1,935

The first step of stepwise regression is to find the best model that uses one independent variable. The next step is to find the best of the remaining independent variables to add to the model selected in the first step. An important feature of the stepwise approach is that an independent variable included in the model at an early stage may subsequently be removed after other independent variables are considered. Thus, in stepwise regression, variables are either added to or deleted from the regression model at each step of the model-building process. The partial F test statistic (see Section 13.5) is used to determine if variables are added or deleted. The stepwise procedure terminates with the selection of a best-fitting model when no additional variables can be added to or deleted from the last model evaluated. Figure 16.10 (page 666) represents Microsoft Excel stepwise regression output for the standby-hours data. For this example, a significance level of 0.05 is used to enter a variable into the model or to delete a variable from the model. The first variable entered into the model is total staff, the variable that correlates most highly with the dependent variable standby hours. Because the p-value of 0.001 is less than 0.05, total staff is included in the regression model. The next step involves selecting a second independent variable for the model. The second variable chosen is one that makes the largest contribution to the model, given that the first variable has been selected. For this model, the second variable is remote hours. Because the p-value of 0.027 for remote hours is less than 0.05, remote hours is included in the regression model. After remote hours is entered into the model, the stepwise procedure determines whether total staff is still an important contributing variable or whether it can be eliminated from the model. Because the p-value of 0.0001 for total staff is less than 0.05, total staff remains in the regression model.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



16.4 Model Building 665

1 2 3 4 5 6 7 8 9 10 11

A Standby hours analysis

Multiple R R square Adjusted R square Standard error Observations VIF

B

C

D

Regression statistics Total staff and Remote and Dubner and all other X all other X all other X 0.64368 0.43490 0.56099 0.41433 0.18914 0.31471 0.33446 0.07856 0.22126 16.47151 124.93921 57.55254 26 26 26 1.70743 1.23325 1.45924

E

Total labour and all other X 0.70698 0.49982 0.43161 114.41183 26 1.99928

1 2 3 4 5 6

B

A Durbin-Watson calculations Sum of squared difference of residuals Sum of squared residuals

47241.61261 21282.82166

Durbin-Watson statistic

2.21971

Panel C

Panel C

Panel A

Panel A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

A B C Quadratic effect for insulation variable?

D

E

F

G

Regression statistics Multiple R 0.78935 R square 0.62308 Adjusted R square 0.55128 Standard error 31.83501 Observations 26 ANOVA Regression Residual Total

Intercept Total staff Remote Dubner Total

Panel B

df 4 21 25

SS 35181.79373 21282.82166 56464.61538

MS 8795.44843 1013.46770

Coefficients Standard error –330.83184 110.89536 1.24563 0.41206 –0.11842 0.05432 –0.29706 0.11793 0.13053 0.05932

t stat –2.98328 3.02293 –2.17983 –2.51891 2.20041

F Significance F 8.67857 0.00027

p-value 0.00709 0.00647 0.04080 0.01995 0.03911

Lower 95% Upper 95% –561.451405 –100.212285 0.388704 2.102554 –.231392 –0.005444 –0.542310 –0.051807 0.007166 0.253904

Panel B

The next step involves selecting a third independent variable for the model. Because none of the other variables meet the 0.05 criterion for entry into the model, the stepwise procedure terminates with a model that includes total staff present and the number of remote hours. This stepwise regression approach to model building was originally developed more than three decades ago, in an era in which regression analysis on mainframe computers involved the costly use of large amounts of processing time. Under such conditions, stepwise regression became widely used, although it provides a limited evaluation of alternative models. With today’s extremely fast personal computers, the evaluation of many different regression models is completed quickly at a very small cost. Thus, a more general way of evaluating alternative regression models, in this era of fast computers, is the best-subsets approach discussed below. Stepwise regression is not obsolete, however. Today, many businesses use stepwise regression as part of a new research technique called data mining, where huge data sets are explored to discover significant statistical relationships between a large number of variables. These data sets are so large that the best-subsets approach is impractical.

The Best-Subsets Approach to Model Building The best-subsets approach evaluates all possible regression models for a given set of independent variables. Figure 16.11, overleaf, represents Microsoft Excel output summarising all possible regression models for the standby hours data.

Figure 16.9 Microsoft Excel regression model to predict standby hours based on four independent variables

data mining Sorting through large amounts of data to find relevant information.

best-subsets approach An approach to multiple regression model development that evaluates all possible models for a given set of independent variables.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

666 CHAPTER 16 MULTIPLE REGRESSION MODEL BUILDING

Figure 16.10 Microsoft Excel stepwise regression output for the standby-hours data

Figure 16.11 Microsoft Excel bestsubsets regression output for the standby-hours data Note: For an extremely small R 2, it is possible to get a negative adjusted R 2.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

B A Stepwise analysis for standby hours Table of results for general stepwise Total staff entered

df 1 24 25

Regression Residual Total

SS 20667.39798 35797.21741 56464.61538

D

Total staff entered

df 2 23 25

t stat –2.19238 3.72241

SS 27662.54287 28802.07251 56464.61538

MS 13831.27143 1252.26402

Coefficients Standard error –330.67483 116.48022 1.76846 0.037904 –0.13897 0.05880

t stat –2.83889 4.65619 –2.36347

Regression Residual Total

F

E

G

MS F Significance F 20667.39798 13.85631586 0.00106 1491.550725

Coefficients Standard error –272.38165 124.24020 1.42405 0.38256

Intercept Remote

Intercept Total staff Remote

C

p-value 0.03829 0.00106

Lower 95% –528.80077 0.63448

Upper 95% –15.96253 2.21362

F Significance F 11.04501 0.00043

p-value 0.00930 0.00011 0.02693

Lower 95% –571.63220 0.98077 –0.26060

D

E

F

R square 0.36602 0.48991 0.53617 0.62308 0.50919 0.44990 0.53779 0.37542 0.00909 0.06116 0.45906 0.22375 0.05970 0.42882 0.17105

Adj. R square 0.33961 0.44555 0.47292 0.55128 0.44227 0.40206 0.47476 0.32111 –0.03220 –0.02048 0.38529 0.15625 0.02052 0.37915 0.13651

Upper 95% –89.71747 2.54896 –0.01733

No other variables could be entered into the model. Stepwise ends.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

A B C Best subsets analysis for standby hours Intermediate calculations R 2T 0.62308 1 – R 2T 0.37692 n 26 T 5 n–T 21 Model X1 X1X2 X1X2X3 X1X2X3X4 X1X2X4 X1X3 X1X3X4 X1X4 X2 X2X3 X2X3X4 X2X4 23 X3X4 X4

Cp k+1 13.32152 2 8.41933 3 7.84181 4 5.00000 5 9.34492 4 10.64856 3 7.75166 4 14.79818 3 33.20781 2 32.30673 3 12.13813 4 23.24809 3 30.38835 2 11.82309 3 24.18460 2

Std error 38.6206 35.38734 34.50286 31.83501 35.49212 36.74905 34.44263 39.15789 48.28359 48.00868 37.26076 43.65405 47.03452 37.44658 44.16192

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



16.4 Model Building 667

A criterion often used in model building is the adjusted R2, which adjusts the R2 of each model to account for the number of independent variables in the model as well as for the sample size (see Section 13.1). Because model building requires you to compare models with different numbers of independent variables, the adjusted R2 is more appropriate than R2. Referring to Figure 16.11, you see that the adjusted R2 reaches a maximum value of 0.551 when all four independent variables (X1X2X3X4) plus the intercept term (for a total of five estimated parameters) are included in the model. Note too that the model with the highest adjusted R2 also has the lowest standard error (of the estimate). In this example, the fourvariables model’s standard error (31.835) is lower than all other values. A second criterion often used in the evaluation of competing models is the Cp statistic developed by Mallows (see reference 1). The Cp statistic, defined in Equation 16.10, measures the differences between a fitted regression model and a true model, along with random error. THE C p STAT IST IC Cp =

(1 − Rk2 )(n − T )



1 − RT2

Cp statistic Test for determining which combination of independent X variables is best to use in a multiple regression model.

− [ n − 2( k + 1)](16.10)

where k = number of independent variables included in a regression model T = total number of parameters (including the intercept) to be estimated in the full regression model R2k = coefficient of multiple determination for a regression model that has k independent variables R2T = coefficient of multiple determination for a full regression model that contains all T estimated parameters Using Equation 16.10 to calculate Cp for the model containing total staff present and remote hours (X1X2): n = 26   k = 2   T = 4 + 1 = 5   R2k = 0.490   R2T = 0.623 so sothat: that: Cp =

(1− 0.49 )(26 − 5) − [ 26 − 2(2 + 1)] 1− 0.623

= 8.42 When a regression model with k independent variables contains only random differences from a true model, the mean value of Cp is k + 1, the number of parameters. Thus, in evaluating many alternative regression models, the goal is to find models whose Cp is close to or less than k + 1. In Figure 16.11, you see that only the model with all four independent variables considered contains a Cp value close to or below k + 1. Therefore, you should choose this model. Although it was not the case here, the Cp statistic often provides several alternative models for you to evaluate in greater depth using other criteria such as parsimony, interpretability and departure from model assumptions (as evaluated by residual analysis). The model selected using stepwise regression has a Cp value of 8.4, which is substantially above the suggested criterion of k + 1 = 3 for that model. Since the data were collected in time order, you need to calculate the Durbin–Watson s­ tatistic to determine if there is autocorrelation in the residuals (see Section 12.6). From F ­ igure 16.9, you see that the Durbin–Watson D statistic is 2.22. Since D is greater than 2.0, there is no indication of positive correlation in the residuals but there may even be slight negative autocorrelation. When you have finished selecting the independent variables to include in the model, you should perform a residual analysis to evaluate the regression assumptions. Figure 16.12, overleaf, presents Microsoft Excel residual analysis output. None of the plots of the residuals versus the total staff, the remote hours, the Dubner hours and the total labour hours reveal an apparent pattern. In addition, a histogram of the residuals

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

80 60 40 20 0 –20 –40 –60

Total staff residual plot

Residuals

Figure 16.12 Microsoft Excel residual plots for the standby-hours data

Residuals

668 CHAPTER 16 MULTIPLE REGRESSION MODEL BUILDING

0

80 60 40 20 0 –20 –40 –60

Panel C

Dubner residual plot

0

100

200

300 400 Dubner

500

600

Remote residual plot

0

100

200

Panel B

Residuals

Residuals

Panel A

50 100 150 200 250 300 350 400 Total staff

80 60 40 20 0 –20 –40 –60

80 60 40 20 0 –20 –40 –60

300 400 Remote

500

600

700

Total labour residual plot

0

500

1,000 1,500 Total labour

2,000

2,500

Panel D

(not shown here) indicates only moderate departure from normality. Because the residual ­analysis appears to confirm the aptness of the model, you can now use various influence measures to determine whether any of the values unduly influence the regression equation. Figure 16.13 presents the values of the hi, t and Cook’s Di statistics for the fitted model, with certain observations highlighted. Using the decision rule suggested by Hoaglin and Welsch (see Section 16.3), you flag any hi value greater than 2(k + 1)/n = 2(4 + 1)/26 = 0.3846. Referring to Figure 16.13, h6 = 0.4049, h9 = 0.4537 and h11 = 0.5217 have hi values that are greater than 0.3846 and are therefore candidates for deletion from the analysis. Now consider the Studentised deleted residual measure ti. Using the decision rule suggested by Hoaglin and Welsch (see Section 16.3), you flag any ti value greater than 1.7247 or less than −1.7247 (see Table E.3 to find the critical t value with n − k − 2 = 26 − 4 − 2 = 20 degrees of freedom). Referring to Figure 16.13, you see that t3 = 1.7411, t11 = 2.0673, t17 = −2.0072 and t19 = 2.1029. Thus, these observations may have an adverse effect on the model. Observation 11 was also flagged according to the hi criterion. Now you should consider a third criterion, Cook’s Di statistic, which is based on hi and the standardised residual. For the model in which k = 4 and n = 26, using the decision rule suggested by Cook and Weisberg (see Section 16.3), you flag any Di > F = 0.899, the critical value for the F statistic having 5 and 21 degrees of freedom at the 0.50 level of significance (see Table E.12). Referring to Figure 16.13, none of the Di values exceed 0.899 although Di for observation 11 is 0.807. So according to Cook’s Di statistic, no values are candidates for deletion. Hence, you have no clear basis for removing any observations from the analysis. Thus, from Figure 16.9 on page 665, the regression equation is: Yˆi = −330.83 + 1.2456X1i − 0.1184X2i − 0.2971X3i + 0.1305X4i Example 16.3 presents a situation where there are several alternative models in which the Cp statistic is close to or less than k + 1.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



16.4 Model Building 669

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

A C1 Standby 245 177 271 211 196 135 195 118 116 147 154 146 115 161 274 245 201 183 237 175 152 188 188 197 261 232

B C2 Staff 338 333 358 372 339 289 334 293 325 311 304 312 283 307 322 335 350 339 327 328 319 325 322 317 315 331

C C3 Remote 414 598 656 631 528 409 382 399 343 338 353 289 388 402 151 228 271 440 475 347 449 336 267 235 164 270

D C4 Dubner 323 340 340 352 380 339 331 311 328 353 518 440 276 207 287 290 355 300 284 337 279 244 253 272 223 272

E C5 Labour 2001 2030 2226 2154 2078 2080 2073 1758 1624 1889 1988 2049 1796 1720 2056 1890 2187 2032 1856 2068 1813 1808 1834 1973 1839 1935

F C6 TRES1 1.26648 –0.00450 1.74109 –0.88635 0.28517 –0.66415 –0.55135 –0.20251 –1.38665 –0.36082 2.06732 –0.50740 –0.47510 –0.20841 1.42189 0.84801 –2.00716 –1.06250 2.10290 –1.02770 –0.49356 –0.31659 –0.48404 –0.52597 1.63790 0.34618

G C7 HI1 0.057851 0.159009 0.308626 0.317663 0.117963 0.404904 0.066692 0.177819 0.453697 0.080160 0.521708 0.239542 0.267786 0.219149 0.241543 0.155157 0.240144 0.073308 0.101265 0.071640 0.105621 0.107193 0.100514 0.123908 0.193025 0.094110

H C8 COOK1 0.019147 0.000001 0.246769 0.073903 0.002275 0.061665 0.004493 0.001859 0.305928 0.002367 0.806606 0.016814 0.017142 0.002554 0.122799 0.026771 0.222548 0.017752 0.085690 0.016257 0.005969 0.002515 0.005434 0.008105 0.118818 0.002599

C H O O SING B E T W E E N A LT E R NAT IV E R EGRE S S I ON M OD E L S Given the output in Table 16.3 from a best-subsets regression analysis of a regression model with seven independent variables, determine which regression model you would choose as the best model.

Figure 16.13 Excel influence statistics for the standby-hours data

EXAMPLE 16.3

SOLUTION

From Table 16.3, overleaf, you need to determine which models have Cp values that are less than or close to k + 1. Two models meet this criterion. The model with six independent variables (X1, X2, X3, X4, X5, X6) has a Cp value of 6.8, which is less than k + 1 = 6 + 1 = 7, and the full model with seven independent variables (X1, X2, X3, X4, X5, X6, X7) has a Cp value of 8.0. One way you can choose between models that meet these criteria is to determine whether the models contain a subset of variables that are common. Then you test whether the contribution of the additional variables is significant. In this case, because the models differ only by the inclusion of variable X7 in the full model, you test whether variable X7 makes a significant contribution to the regression model given that the variables X1, X2, X3, X4, X5 and X6 are already included in the model. If the contribution is statistically significant, then you should include variable X7 in the regression model. If variable X7 does not make a statistically significant contribution, you should not include it in the model.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

670 CHAPTER 16 MULTIPLE REGRESSION MODEL BUILDING

Table 16.3 Partial output from best-subsets regression

Number of Cp variables R 2(%) Adjusted R 2(%) 1 12.1 11.9 113.9 1  9.3  9.0 130.4 1  8.3  8.0 136.2 2 21.4 21.0  62.1 2 19.1 18.6  75.6 2 18.1 17.7  81.0 3 28.5 28.0  22.6 3 26.8 26.3  32.4 3 24.0 23.4  49.0 4 30.8 30.1  11.3 4 30.4 29.7  14.0 4 29.6 28.9  18.3 5 31.7 30.8   8.2 5 31.5 30.6   9.6 5 31.3 30.4  10.7 6 32.3 31.3   6.8 6 31.9 30.9   9.0 6 31.7 30.6  10.4 7 32.4 31.2   8.0

Variables included X4 X1 X3 X3 X4 X1 X3 X1 X4 X1 X3 X4 X3 X4 X5 X2 X3 X4 X1 X2 X3 X4 X1 X3 X4 X6 X1 X3 X4 X5 X1 X2 X3 X4 X5 X1 X2 X3 X4 X6 X1 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X7 X1 X2 X3 X4 X6 X7 X1 X2 X3 X4 X5 X6 X7

Exhibit 16.1 summarises the steps involved in model building.

E XH IB IT 1 6 .1 ST EP S I NV O LV E D I N M O D E L BU I LD I NG 1. Compile a listing of all independent variables under consideration. 2. Fit a regression model that includes all the independent variables under consideration and determine the variance inflationary factor (VIF) for each independent variable. 3. Determine whether any independent variables have a VIF > 5. Three possible results can occur: a. None of the independent variables have a VIF > 5; proceed to step 4. b. One of the independent variables has a VIF > 5; eliminate that independent variable and proceed to step 4. c. More than one of the independent variables has a VIF > 5; eliminate the independent variable that has the highest VIF, and go back to step 2. 4. Perform a best-subsets regression with the remaining independent variables and determine the Cp statistic and/or the adjusted R2 for each model. 5. List all models that have Cp close to or less than k + 1 and/or a high adjusted R2. 6. From those models listed in step 5, choose a best model. 7. Perform a complete analysis of the model chosen, including a residual analysis and an influence analysis. 8. Depending on the results of the residual analysis and influence analysis, add quadratic terms, transform variables, delete individual observations if necessary and re-analyse the data. 9. Use the selected model for prediction and inference.

Figure 16.14 represents a road map for the steps involved in model building.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



16.4 Model Building 671

Figure 16.14 Road map for model building

Choose independent variables to be considered

Run regression model with all independent variables to find VIF s

Are any VIFs > 5?

No Run best-subsets regression to obtain best models with k terms for a given number of independent variables

Yes

Does more than one x variable have VIF > 5?

Yes

Eliminate x variable with largest VIF

No

Eliminate this x variable

List all models that have Cp close to or less than k + 1

Choose a ‘best’ model from among these models

Do a complete analysis of this model including residual analysis and influence analysis

Depending on results of residual and influence analysis, add quadratic terms, transform variables and/or delete individual observations as necessary

Use selected model for prediction and confidence interval estimation

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

672 CHAPTER 16 MULTIPLE REGRESSION MODEL BUILDING

Model Validation The final step in the model-building process is to validate the selected regression model. This step involves checking the model against data that were not part of the sample analysed. There are several ways of validating a regression model: • Collect new data and compare the results. • Compare the results of the regression model with previous results. • If the data set is large, split the data into two parts and cross-validate the results.

cross-validation An in-sample method for assessing the validity of a multiple regression model, using half the data to test the accuracy of prediction, having generated the regression model with the other half of the sample data.

Perhaps the best way of validating a regression model is by collecting new data. If the results with new data are consistent with the selected regression model, you have strong reason to believe that the fitted regression model is applicable in a wide set of circumstances. If it is not possible to collect new data, you can use one of the two other approaches. One possibility is to compare your regression coefficients and predictions with previous results. Another approach, called cross-validation, that you can use when the data set is large enough is to split the data into two parts. You use the first part of the data to develop the regression model. You use the second part of the data to evaluate the predictive ability of the regression model.

Problems for Section 16.4 LEARNING THE BASICS 16.15 You are considering four independent variables for inclusion in a regression model. You select a sample of 30 observations with the following results: • The model that includes independent variables A and B has a Cp value equal to 4.6. • The model that includes independent variables A and C has a Cp value equal to 2.4. • The model that includes independent variables A, B and C has a Cp value equal to 2.7. a. Which models meet the criterion for further consideration? Explain. b. How would you compare the model that contains independent variables A, B and C with the model that contains independent variables A and B? Explain. 16.16 You are considering six independent variables for inclusion in a regression model. You select a sample of 40 observations with the following results: 2 2 n = 40  k = 2  T = 6 + 1 = 7  Rk = 0.274  RT = 0.623 a. Calculate the Cp value for this two-independent-variable model. b. Based on your answer to (a), does this model meet the criterion for further consideration as the best model? Explain.

APPLYING THE CONCEPTS You need to use Microsoft Excel to solve problems 16.17 and 16.18.

16.17 The human resources (HR) director for a large company that produces highly technical industrial instrumentation devices is interested in using regression modelling to help in recruiting decisions concerning the company’s sales managers.

The company has 45 sales regions, each headed by a sales manager. Many of the sales managers have degrees in electrical engineering, and due to the technical nature of the product line several company officials believe that only applicants with degrees in electrical engineering should be considered. At the time of their application, candidates are asked to take the Strong–Campbell interest inventory test and the Wonderlic personnel test. Due to the time and money involved with the testing, some discussion has taken place about dropping one or both of the tests. To start, the HR director gathered information on each of the 45 current sales managers, including years of selling experience, electrical engineering background and the scores from both the Wonderlic and Strong–Campbell tests. The dependent variable was ‘sales-index’ score, which is the ratio of the regions’ actual sales divided by the target sales. The target values are constructed each year by upper management in consultation with the sales managers, and are based on past performance and market potential within each region. The data file contains information on the 45 current sales managers. The variables included are: • Sales: ratio of yearly sales divided by the target sales value for that region. The target values were mutually agreed upon ‘realistic expectations’. • Wonder: score from the Wonderlic personnel test. The higher the score, the higher the applicant’s perceived ability to manage. • SC: score on the Strong–Campbell interest inventory test. The higher the score, the higher the applicant’s perceived interest in sales. • Experience: number of years of selling experience before becoming a sales manager.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



16.5 Pitfalls in Multiple Regression and Ethical Issues 673

• Engineer: dummy variable that equals 1 if the sales manager has a degree in electrical engineering and 0 otherwise. a. Develop the most appropriate model to predict sales. b. Do you think that the company should continue administering the Wonderlic and Strong–Campbell tests? Explain. c. Do the data support the argument that electrical engineers outperform the other sales managers? Would you support the idea only to hire electrical engineers? Explain. d. How important is previous selling experience? Explain. e. Discuss in detail how the HR director should incorporate the regression model you developed into the recruiting process. 16.18 Accounting Today identified top public accounting firms in ten geographic regions across the United States. The file < ACCOUNTING_PARTNERS6 > contains data for public accounting firms in the Southeast, Gulf Coast and Capital Region. The variables are: • revenue (millions of dollars) • number of partners in the firm • number of professionals in the firm

• proportion of business dedicated to management advisory services (MAS%) • whether the firm is located in the Southeast Region (0 = no, 1 = yes) • whether the firm is located in the Gulf Coast Region (0 = no, 1 = yes) (data obtained from ) Develop the most appropriate multiple regression model to predict business revenue. Be sure to perform a thorough residual and influence analysis. In addition, provide a detailed explanation of the results. 16.19 In problem 13.4 on page 509, a financial planner used unemployment rates and share market returns to predict retirement rates. Develop the best model from the available data. 16.20 In problem 13.6 on page 510, a wine expert used alcohol content and amount of chlorides to predict wine quality. Develop the best model from the available data. 16.21 In problem 13.7 on page 510, GDP per capita and CPI were used to predict countries’ happiness. Develop the best model from the available data.

16.5  PITFALLS IN MULTIPLE REGRESSION AND ETHICAL ISSUES Pitfalls in Multiple Regression Model building is an art as well as a science. Different individuals may not always agree on the best multiple regression model. Nevertheless, you should use the process described in Exhibit 16.1 on page 670. In doing so, you must avoid certain pitfalls that can interfere with the development of a useful model. Section 12.9 discussed pitfalls in simple linear regression and strategies for avoiding them. Now that you have studied a variety of multiple regression models, you need to take some additional precautions. To avoid pitfalls in multiple regression: • Interpret the regression coefficient for a particular independent variable from a perspective in which the values of all other independent variables are held constant. • Evaluate residual plots for each independent variable. • Evaluate interaction terms. • Calculate the VIF for each independent variable before determining which independent variables to include in the model. • Examine several alternative models using best-subsets regression. • Use influence analysis to determine whether to remove any observations from the analysis.

LEARNING OBJECTIVE Recognise the many pitfalls involved in developing a multiple regression model

Ethical Considerations Ethical considerations arise when a user who wants to make predictions manipulates the development process of the multiple regression model. The key here is intent. In addition to the situations discussed in Section 12.9, unethical behaviour occurs when someone uses multiple regression analysis and wilfully fails to remove from consideration variables that exhibit a high collinearity with other independent variables or wilfully fails to use methods other than least-squares regression when the assumptions necessary for least-squares regression are seriously violated.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

5

674 CHAPTER 16 MULTIPLE REGRESSION MODEL BUILDING

16 Assess your progress Summary In this chapter, various multiple regression topics were considered (see Figure 16.15) including quadratic regression models, interactions, transformations square root and log transformations. A number of criteria were presented to examine the influence of each individual observation on the results. In addition, the best subsets and stepwise regression approaches to model building were detailed.

You have learned how suburban ratings can be used to derive a measure of income distribution. You also learned how a director of operations at a television station could build a multiple regression model as an aid to reducing labour expenses.

Key formulas Transformed exponential model

The quadratic regression model

Yi = β0 + β1X1i +

β2 X 21i

ln Yi = ln( eβ0+β1 X 1i +β2 X 2i εi )

+ εi (16.1)

= ln( e

Quadratic regression equation

t i = ei

X1i + εi (16.3)

Original multiplicative model

Yi =

n – k –1 SSE (1 – hi ) – ei2

 (16.8)

Cook’s Di statistic

εi (16.4)

Di =

Transformed multiplicative model

log Yi = log(β0 X 1βi 1 X 2βi2 εi ) β

) + ln εi

Studentised deleted residual

Regression model with a square-root transformation

b0 X 1ib1 X 2ib2

 (16.7)

= β0 + β1X 1i + β 2 X 2i + ln εi

Yˆi = b0 + b1X1i + b2 X 21i (16.2)

Yi = b0 + b1

β0+β1 X 1i +β2 X 2i

 (16.5) β

= log β0 + log( X 1i 1 ) + log( X 2i2 ) + log εi = log β0 + β1 log X 1i + β2 log X 2i + log εi

ei2 hi  (16.9) k MSE (1 – hi ) 2

The Cp statistic

Cp =

(1 – Rk2 )(n – T ) 1 – RT2

– [ n – 2( k + 1)] (16.10)

Original exponential model

Yi = e β0+β1 X 1i +β2 X 2i εi (16.6)

Key terms best-subsets approach 665 662 Cook’s Di statistic 667 Cp statistic cross-validation 672

data mining 665 hat matrix diagonal elements hi 661 logarithmic transformation 658 parsimony 663

quadratic regression model square-root transformation stepwise regression Studentised deleted residual

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

651 657 663 661



SUMMARY 675

Figure 16.15 Road map for multiple regression

Fitting a multiple regression model

Model building

Bestsubsets regression

Adjusted R2

Quadratic terms

Multiple linear regression

Dummy variables

Stepwise regression

Transformed variables

Interaction terms

Determining and interpreting regression coefficients

Cp

Residual analysis

Influence analysis

hi

Is model apt?

No

t*i

Di

Yes

Consistent candidates for removal

No

Yes

Yes

No

Collinearity

Testing overall model significance H0: β1 = β2 = ... = βk = 0

Is model significant?

Yes

No

Testing portions of model

Yes

Is βi significant?

No

Use model for prediction and estimation

Estimate βi

Estimate μ and predict Y

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

676 CHAPTER 16 MULTIPLE REGRESSION MODEL BUILDING

References 1. Kutner, M., C. Nachtsheim, J. Neter & W. Li, Applied Linear Statistical

3. Hoaglin, D. C. & R. Welsch, ‘The hat matrix in regression and ANOVA’,

Models, 5th edn (New York: McGraw-Hill/Irwin, 2005). 2. Cook, R. D. & S. Weisberg, Residuals and Influence in Regression (New York: Chapman & Hall, 1982).

4. Hocking, R. R., ‘Developments in linear regression methodology: 1959–

American Statistician, 32 (1978): 17–22. 1982’, Technometrics, 25 (1983): 219–250.

Chapter review problems CHECKING YOUR UNDERSTANDING 16.22 How can you measure the influence that individual observations have on a particular regression model? 16.23 What is the difference between stepwise regression and bestsubsets regression? 16.24 How do you choose between models according to the Cp statistic in best-subsets regression? 16.25 Describe the steps to follow in model validation.

APPLYING THE CONCEPTS You need to use Microsoft Excel to solve problems 16.26–16.29.

16.26 A business analyst suspects a quadratic relationship exists between investor confidence (1 to 10) and exchange rates. A sample of 12 months of data is taken. < INVESTOR >

a. Construct a scatter diagram for investor confidence and exchange rates. b. State the quadratic regression equation. c. Predict investor confidence for an exchange rate of 85. d. At the 0.05 level of significance, is there a significant quadratic relationship between investor confidence and exchange rates? e. At the 0.05 level of significance, determine whether the quadratic model is a better fit than the linear model. f. Interpret the meaning of the coefficient of multiple determination. g. Calculate the adjusted R 2. 16.27 Hemlock Farms is a community located in the Pocono Mountains area of eastern Pennsylvania. The file < HEMLOCK_FARMS > contains information on homes that were recently for sale. The variables included were: • list price – asking price of the house • hot tub – whether the house has a hot tub (0 = no, 1 = yes)

• lake view – whether the house has a lake view (0 = no, 1 = yes) • bathrooms – number of bathrooms • bedrooms – number of bedrooms • loft/den – whether the house has a loft or den (0 = no, 1 = yes) • finished basement – whether the house has a finished basement (0 = no, 1 = yes) • acres – number of acres for the property Develop the most appropriate multiple regression model to predict the asking price. Be sure to perform a thorough residual and influence analysis. In addition, provide a detailed explanation of your results. 16.28 An agricultural industry consultant has been contracted to build a model to predict the demand for chicken in Malaysia. He postulates that demand for chicken may be related to the price of chicken, the price of duck, industry advertising expenditure and average income. < CHICKEN > Develop the most appropriate multiple regression model to predict the demand for chicken in Malaysia. Make sure you perform a thorough residual analysis. In addition, provide a detailed explanation of your results. 16.29 Nassau County is located approximately 25 miles east of New York City. The file < GLEN_COVE > contains a sample of 30 single-family homes located in Glen Cove. Variables included are the fair market value, land area of the property (acres), interior size of the house (square feet), age (years), number of rooms, number of bathrooms and number of cars that can be parked in the garage. Develop the most appropriate multiple regression model to predict fair market value.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



CHAPTER 16 EXCEL GUIDE 677

Continuing cases Tasman University The Student News Service at Tasman University (TU) wishes to use the data it has collected to see whether a student’s weighted average mark (WAM) and gender helps to predict their expected salary. a Use the data stored in < TASMAN_UNIVERSITY_BBUS_STUDENT_SURVEY > develop the best model to predict BBus students’ expected salary. b Use the data stored in < TASMAN_UNIVERSITY_MBA_STUDENT_SURVEY > to develop the best model to predict MBA students’ expected salary.

As Safe as Houses To analyse the real estate market in non-capital cities and towns in states A and B, Safe-As-Houses Real Estate, a large national real estate company, has collected samples of recent residential sales from a sample of non-capital cities and towns in these states.

Use the data stored in < REAL_ESTATE > to develop the best model to predict house and unit prices. Prepare a

report to summarise your recommendations.

Chapter 16 Excel Guide EG16.1 THE QUADRATIC REGRESSION MODEL

Key technique  Use the exponential operator (^) in a column of formulas to create a quadratic term.

Figure EG16.1  Format Trendline pane

Example  Create the quadratic term for family income in Section 16.1 and construct the Figure 16.1 scatter plot that shows the quadratic relationship between suburb rating and family income on page 652. To create the quadratic term, open the Sub_House file. That worksheet contains the independent variable family income in column C and the dependent variable suburb rating in column B. Enter the label FI^2 in cell D1 and then enter the formula =C2^2 in cell D2. Copy this formula down the column through all the data rows. Perform a regression analysis using this new variable by adapting the Section EG13.1 PHStat instructions. Then adapt the appropriate Section EG2.5 instructions to construct a scatter plot. Select that chart. Then select Design ➔ Add Chart Element ➔ Trendline ➔ More Trendline Options and in the Format Trendline pane (shown in Figure EG16.1) click Polynomial and check Display Equation on chart in the Format Trendline pane.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

678 CHAPTER 16 MULTIPLE REGRESSION MODEL BUILDING

In Excel versions older than Excel 2013, select Layout ➔ Trendline ➔ More Trendline Options and in the Format Trendline dialog box (similar to the Format Trendline Pane), click Trendline Options in the left pane and in the Trendline Options right pane, click Polynomial, check Display Equation on chart, and click OK. While the quadratic term FI^2 could be created in any column, placing independent variables in adjacent columns is a best practice and mandatory if you use the Analysis ToolPak Regression procedure.

EG16.2 USING TRANSFORMATIONS IN REGRESSION MODELS The Square-Root Transformation To the worksheet that contains your regression data, add a new column of formulas that computes the square root of one of the independent variables to create a square-root transformation. For example, to create a square-root transformation in blank column E for an independent variable in column C, enter the formula =SQRT(C2) in cell E2 of that worksheet and copy the formula down through all data rows. If the rightmost column in the worksheet contains the dependent variable, first select that column, right-click, click Insert from the shortcut menu and place the transformation in that new column.

Example  Perform the Figure 16.10 stepwise analysis for the standbyhours data shown on page 666. PHStat  Use Stepwise Regression. For the example, open the Standby file. Select PHStat ➔ Regression ➔ Stepwise Regression. In the procedure’s dialog box (shown in Figure EG16.2): 1. Enter A1:A27 as the Y Variable Cell Range. 2. Enter B1:E27 as the X Variables Cell Range. 3. Check First cells in both ranges contain label. 4. Enter 95 as the Confidence level for regression coefficients. 5. Click p values as the Stepwise Criteria. 6. Click General Stepwise and keep the pair of .05 values as the p value to enter and the p value to remove. 7. Enter a Title and click OK.

The Log Transformation To the worksheet that contains your regression data, add a new column of formulas that computes the common (base 10) logarithm or natural logarithm (base e) of one of the independent variables to create a log transformation. For example, to create a common logarithm transformation in blank column F for a variable in column B, enter the formula =LOG(B2) in cell F2 of that worksheet and copy the formula down through all data rows.

EG16.3 COLLINEARITY

PHStat  To compute the variance inflationary factor (VIF), use the ‘Interpreting the regression coefficients’ PHStat instructions in Section EG13.1 on page 541, but modify step 6 by checking Variance Inflationary Factor (VIF) before you click OK. The VIF will appear in cell B9 of the regression results worksheet, immediately following the Regression Statistics area.

EG16.4  MODEL BUILDING The Stepwise Regression Approach to Model Building Key technique  Use PHStat to perform a stepwise analysis.

Figure EG16.2  Stepwise Regression dialog box

This procedure may take more than a few seconds to construct its results. The procedure finishes when the statement ‘Stepwise ends’ is added to the stepwise regression results worksheet (shown in row 26 in Figure 16.10 on page 666).

The Best-Subsets Approach to Model Building Key technique  Use PHStat to perform a stepwise analysis. Example  Perform the Figure 16.11 best-subsets analysis for the standby-hours data shown on page 666.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



CHAPTER 16 EXCEL GUIDE 679

PHStat  Use Best Subsets. For the example, open the Standby file. Select PHStat ➔ Regression ➔ Best Subsets. In the procedure’s dialog box (shown in Figure EG16.3): 1. Enter A1:A27 as the Y Variable Cell Range. 2. Enter B1:E27 as the X Variables Cell Range. 3. Check First cells in each range contains label. 4. Enter 95 as the Confidence level for regression coefficients. 5. Enter a Title and click OK. This procedure constructs many regression results worksheets (which may be seen as a flickering in the Excel window) as it evaluates each subset of independent variables.

Figure EG16.3  Best Subsets dialog box

Microsoft® product screen shots are reprinted with permission from Microsoft Corporation.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

CHA PTER

17

Decision making

SELECTING A SOCIAL MEDIA PLATFORM

T

he 2017 Sensis Social Media report showed that 79% of Australians are now using social media . Therefore, it is not surprising to find that marketers are turning to online social media campaigns to advertise products and bypassing traditional media such as newspapers and magazines. Besides running direct promotions, they are paying influencers to showcase products using photos and blogs and offering discounts for those who sign up to receive regular offers. As the marketing manager of an Australian fashion brand, you are responsible for deciding the most effective use of the marketing budget. You currently need to decide between advertising campaigns on two social media platforms. Your company’s board expects a large return on this investment and, at the same time, wants to minimise the company’s risk. A researcher for your company has evaluated the potential one-year returns for campaigns on both platforms under four economic conditions: recession, stability, moderate growth and boom. She has also estimated the probability of each economic condition occurring. How can you use the information provided by the researcher to determine which social media platform to choose in order to maximise return and minimise risk? © Kaspars Grinvalds/Shutterstock

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



17.1 Payoff Tables and Decision Trees 681

LEARNING OBJECTIVES

After studying this chapter, you should be able to: 1 use payoff tables and decision trees to evaluate alternative courses of action 2 use several criteria to select an alternative course of action 3 use Bayes’ theorem to revise probabilities in light of sample information 4 recognise the concept of utility

In Chapter 4, you studied various rules of probability and used Bayes’ theorem to revise probabilities. In Chapter 5, you learned about discrete probability distributions and how to calculate the expected value. In this chapter these probability rules and probability distributions are applied to a decision-making process for evaluating alternative courses of action. In this context, you can consider the four basic features of a decision-making situation: 1. Alternative courses of action. The decision maker must have two or more possible choices to evaluate before selecting one course of action. For example, as the marketing manager of an Australian fashion company in the scenario at the beginning of the chapter, you must decide whether to run a marketing campaign on social media platform A or B. 2. Events or states of the world. The decision maker must list the events that can occur and consider each event’s probability of occurring. To aid in selecting which platform to use in the scenario, a researcher for your company has listed four possible economic conditions and the probability of each occurring in the next year. 3. Payoffs. In order to evaluate each course of action, the decision maker must associate a value or payoff with the result of each event. In business applications, this payoff is usually expressed in terms of profits or costs, although other payoffs such as units of satisfaction or utility are sometimes considered. In the chapter-opening scenario, the payoff is the return on advertising expenditure. 4. Decision criteria. The decision maker must determine how to select the best course of action. Section 17.2 discusses three criteria for decision making: expected monetary value, expected opportunity loss and return-to-risk ratio.

17.1  PAYOFF TABLES AND DECISION TREES

alternative courses of action Choices that may be made in decision making. events or states of the world Outcomes that may occur, and their associated probabilities. payoffs Values associated with the outcome of events. decision criteria Alternative methods for deciding the best course of action. payoff table A table that shows the values associated with every possible event that can occur for each course of action.

LEARNING OBJECTIVE

1

In order to evaluate the various alternative courses of action for the complete set of events, you need to develop a payoff table or construct a decision tree. A payoff table contains each possible event that can occur for each alternative course of action. You must associate a value or payoff for each combination of an event and course of action. Example 17.1 shows a payoff table for a marketing manager trying to decide whether or not to introduce a new high-definition television model.

Use payoff tables and decision trees to evaluate alternative courses of action

A PAYO FF TA B LE FO R DE C IDIN G W H E T HE R TO MA RKE T A N E W H IG H -DEFINI T IO N T E LE V IS IO N MO D E L As the marketing manager of the Consumer Electronics Company, you are considering whether to introduce a new widescreen high-definition television (HDTV) model into the market. You are aware of the risks in deciding whether to market this model. For example, you could decide to market the HDTV model and then, for any of a number of reasons, the

EXAMPLE 17.1

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

682 CHAPTER 17 DECISION MAKING

introduction turns out to be unsuccessful. Second, you could decide not to market the HDTV model when, in reality, it would have been successful. There is a fixed cost of $3 million dollars incurred prior to making a final decision to market the HDTV model. Based on past experience, if the HDTV model is successful, you expect to make a profit of $45 million dollars. If the HDTV model is not successful, you expect to lose $36 million. Construct a payoff table for these two alternative courses of action. SOLUTION

Table 17.1 is a payoff table for the television marketing example. Table 17.1 Payoff table for the television marketing example (in millions of dollars)

decision tree Graphical representation of simple and joint probabilities as vertices of a tree.

EXAMPLE 17.2

Alternative courses of action Event Ei Successful HDTV model, E1

Unsuccessful HDTV model, E2

Market, A1

Do not market, A2

+ $45

- $3

- $36

- $3

A decision tree, first introduced in Section 4.2, is another way of representing the events for each alternative course of action. The decision tree pictorially represents the events and courses of action through a set of branches and nodes. A square node represents a point where a decision is made and branches coming from it represent the alternative actions. A circular node represents an uncertainty point where a set of uncontrollable events can occur. Lines from the circular nodes represent possible outcomes which occur by chance. The values at the end of each branch are payoffs associated with the sequence of events represented by that branch. Example 17.2 illustrates a decision tree. T H E DE C IS IO N T R E E F OR THE TE L E V I S I ON M ARKE TI N G D E CI S I ON Given the payoff table for the television marketing example, construct a decision tree. SOLUTION

Figure 17.1 is the decision tree for the payoff table shown in Table 17.1. Figure 17.1 Decision tree for the television marketing example

l

$45 million

essfu

Succ odel

TV m

HD arket

M

Do n

ot m

arke

t HD

Unsuc

cessfu

l

ful

TV m

odel

Success Unsu

cces

sful

–$36 million

–$3 million

–$3 million

In Figure 17.1, the first set of branches relates to the two alternative courses of action: market the HDTV model or do not market the HDTV model. The second set of branches represents the possible events of successful HDTV model or unsuccessful HDTV model. These events occur for each of the alternative courses of action on the decision tree.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



17.1 Payoff Tables and Decision Trees 683

The decision structure for the television marketing example contains only two possible alternative courses of action and two possible events. In general, there can be several alternative courses of action and events. As the marketing manager in the scenario at the beginning of the chapter, you need to decide which of two social media platforms to use for a fashion advertising campaign. A researcher in the company has predicted returns for the two platforms under the four economic conditions – recession, stability, moderate growth and boom. Table 17.2 presents the predicted yearly return of a social media marketing campaign for each platform under each economic condition. Figure 17.2 shows the decision tree for this payoff table. The decision (which platform to use) is the first branch of the tree and the second set of branches represents the four events (the economic conditions). Table 17.2 Predicted yearly return of a social media campaign on two platforms under four economic conditions

Social media Economic conditions

Platform A ($’000)

Recession Stable economy Moderate growth Boom

Platform B ($’000)

$30  70 100 150

- $50  30 250 400

$'000 30

ssion

Rece

conomy

Stable e

Moderate growth

rm A

Platfo

150 –50

n

rm B

100

Boom

Recessio

Platfo

70

Figure 17.2 Decision tree for the platform selection payoff table (in thousands of dollars)

Stable economy Moderate grow

th

Boom

30 250 400

You use payoff tables and decision trees as decision-making tools to help determine the best course of action. For example, when deciding whether to market a new television model, you would market it if you knew that the television was going to be successful. Certainly, you would not market it if you knew that it was not going to be successful. For each event, you can determine the amount of profit that will be lost if the best alternative course of action is not taken. This is called ‘opportunity loss’. The opportunity loss is the difference between the highest possible profit for an event and the actual profit for an action taken.

opportunity loss The difference between the highest possible profit for an event and the actual profit.

Example 17.3 illustrates the calculation of opportunity loss.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

684 CHAPTER 17 DECISION MAKING

EXAMPLE 17.3

FIN D ING O P P O RT U NI TY L OS S I N THE TE L E V I S I ON M ARKE TI N G E XAM P L E Using the payoff table from Example 17.1, construct an opportunity loss table. SOLUTION

For the event ‘successful HDTV model’, the maximum profit occurs when the product is marketed (+$45 million). The opportunity that is lost by not marketing the HDTV model is the difference between $45 million and −$3 million, which is $48 million. If the HDTV model is unsuccessful, the best action is not to market it (−$3 million profit). The opportunity that is lost by making the incorrect decision of marketing the television is −$3 − (−$36) = $33 million. The opportunity loss is always a non-negative number, because it represents the difference between the profit under the best action and any other course of action that is taken for the particular event. Table 17.3 shows the complete ­opportunity loss table for the television marketing example. Table 17.3 Opportunity loss table for the television marketing example (in millions of dollars)

Alternative courses of action Optimum action

Event Ei Successful

Market

Unsuccessful

Do not market

Profit of optimum action

Market

Do not market

$45

$45 - $45 = $0

$45 - (- $3) = $48

- $3 - (- $36) = $33

- $3

- $3 - (- $3) = $0

Figure 17.3 represents the Microsoft Excel opportunity loss table. Figure 17.3 Microsoft Excel opportunity loss table (in millions of dollars)

A 1 2 3 4 5 6 7 8 9 10 11 12 13

B

C

D

E

Television marketing opportunity loss Payoff table: Market

Do not market

Successful Unsuccessful

45

–3

–36

–3

Opportunity loss table:

Successful Unsuccessful

Optimum

Optimum

Action

Profit

Alternatives Market

Do not market

Market

45

0

48

Do not market

–3

33

0

You can also develop an opportunity loss table for the platform selection problem in the scenario at the beginning of the chapter. Here, there are four possible events or economic conditions that will affect the yearly return for each of the two social media platforms. In a recession, platform A is a better choice, providing a return of $30,000 as compared with a loss of $50,000 for platform B. In a stable economy, a platform A campaign is again better than a campaign using platform B. Platform A provides a return of $70,000 compared with $30,000 for platform B. However, under conditions of moderate growth or boom, advertising on platform B produces a better result than advertising on platform A. In a moderate growth period, a platform B campaign provides a return of $250,000 as compared with $100,000 from platform A, while in boom conditions the difference between the platforms is even greater, with a platform B campaign providing a return of $400,000 as compared with $150,000 from platform A. Table 17.4 summarises the complete set of opportunity losses.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



17.2 Criteria for Decision Making 685

Event Ei

Optimum action

Profit of optimum action ($’000)

Recession Stable economy Moderate growth Boom

Platform A Platform A Platform B Platform B

30 70 250 400

Alternative courses of action Platform A Platform B ($’000) ($’000) 30 - 30 = 0 70 - 70 = 0 250 - 100 = 150 400 - 150 = 250

30 - (- 50) = 80 70 - 30 = 40 250 - 250 = 0 400 - 400 = 0

Table 17.4 Opportunity loss table for advertising on two platforms under four economic conditions

Problems for Section 17.1 LEARNING THE BASICS 17.1 For this problem use the following payoff table: Action Event 1 2

A ($)  50 200

B ($) 100 125

a. Construct an opportunity loss table. b. Construct a decision tree. 17.2 For this problem use the following payoff table: Action Event

A ($)

B ($)

1

 50

 10

2

300

100

3

500

200

a. Construct an opportunity loss table. b. Construct a decision tree.

APPLYING THE CONCEPTS 17.3 A manufacturer of designer jeans must decide whether to build a large factory or a small factory in a particular location. The profit per pair of jeans manufactured is estimated as $10. A small factory will incur an annual cost of $200,000 and have a production capacity of 50,000 jeans per year. A large factory will incur an annual cost of $400,000 and have a production capacity of 100,000 jeans per year. Four levels of manufacturing demand are considered likely: 10,000, 20,000, 50,000 and 100,000 pairs of jeans per year.

a. Determine the possible levels of production for a small factory and the payoffs for each possible level of production. b. Determine the possible levels of production for a large factory and the payoffs for each possible level of production. c. Based on the results of (a) and (b), construct a payoff table indicating the events and alternative courses of action. d. Construct a decision tree. e. Construct an opportunity loss table. 17.4 An author is trying to choose between two publishing companies that are competing for the marketing rights to her new novel. Company A has offered the author $10,000 plus $2 per book sold. Company B has offered the author $2,000 plus $4 per book sold. The author believes that five levels of demand for the book are possible: 1,000, 2,000, 5,000, 10,000 and 50,000 books sold. a. Calculate the payoffs for each level of demand for company A and company B. b. Construct a payoff table indicating the events and alternative courses of action. c. Construct a decision tree. d. Construct an opportunity loss table. 17.5 The East Ryfield Scout Group purchases and sells Christmas trees each year as a fundraiser. The group purchases the trees for $40 each and sells them for $75 each. Any trees not sold by Christmas Day are taken to the recycling centre at a cost of $5 each. The scout group estimates that four levels of demand are possible: 100, 200, 500 and 1,000 trees. a. Calculate the payoffs for purchasing 100, 200, 500 or 1,000 trees for each of the four levels of demand. b. Construct a payoff table indicating the events and alternative courses of action. c. Construct a decision tree. d. Construct an opportunity loss table.

17.2  CRITERIA FOR DECISION MAKING After you calculate the profit and opportunity loss for each event under each alternative course of action, you need to determine the criteria for selecting the most desirable course of action. To determine which alternative to choose, you first assign a probability to each event. The probability assigned is based on information available from past data, from the opinions of the d­ ecision maker

LEARNING OBJECTIVE Use several criteria to select an alternative course of action

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

2

686 CHAPTER 17 DECISION MAKING

or from knowledge about the probability distribution that the event may follow. Using these probabilities, along with the payoffs or opportunity losses of each event–action combination, you select the best course of action according to a particular criterion. In this section, three decision criteria are presented: expected monetary value, expected opportunity loss and the return-to-risk ratio.

Expected Monetary Value expected monetary value (EMV) The sum of payoffs for each event and action multiplied by the respective event probabilities.

In Section 5.1, Equation 5.1 on page 182 shows how to calculate the expected value of a probability distribution. Now you use this formula to calculate the expected monetary value for each alternative course of action. The expected monetary value (EMV) for a course of action j is the payoff (Xij) for each combination of event i and action j times Pi, the probability of occurrence of the event i, summed over all events see (Equation 17.1). E XP E CT E D M ON ETA RY VA LU E EMV( j) =



N

∑ XijPi

(17.1)

i=1

where EMV(j) = expected monetary value of action j Xij = payoff that occurs when course of action j is selected and event i occurs Pi = probability of occurrence of event i N = number of events Criterion: Select the course of action with the largest EMV. Example 17.4 illustrates the application of expected monetary value to the television marketing example. EXAMPLE 17.4

C A LC U LAT ING T H E EXP E CTE D M ON E TARY VAL U E ( E M V  ) I N THE TE L E V I S I O N MA R K E T IN G E X A MP L E Returning to the payoff table for deciding whether to market a HDTV model (Example 17.1), suppose that the probability is 0.40 that the model will be successful (so that the probability is 0.60 that the HDTV model will not be successful). Calculate the expected monetary value for each alternative course of action and determine whether or not to market the new television. SOLUTION

You use Equation 17.1 to determine the expected monetary value for each alternative course of action. Table 17.5 summarises these calculations. Table 17.5 Expected monetary value (in millions of dollars) for each alternative for the television marketing example

Alternative courses of action Event Ei

Successful E1

Unsuccessful E2

Pi

Market, A1

Xij Pi

Do not market, A2

Xij Pi

0.40

+ $45

$45(0.4) = $18

- $3

- $3(0.4) = - $1.2

0.60

- $36

- $36(0.6) = - $21.6

- $3

- $3(0.6) = - $1.8

EMV(A1) = - $3.6

EMV(A2) = - $3

The expected monetary value for marketing the HDTV model is −$3.6 million, and the expected monetary value for not marketing the HDTV model is −$3 million. Thus, if your objective is to choose the action that maximises the expected monetary value, you would choose the action of not marketing the HDTV model because its profit is highest (or in this case its loss is lowest). Note, however, that if the probability that the HDTV model is

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



17.2 Criteria for Decision Making 687

successful, P1, is slightly greater than the assumed value of 0.40, you would make a different decision. Specifically, if P1 = 0.41, then EMV(A1) = −$2.79 million, EMV(A2) = −$3 million and the best decision is to market the television. This change in the optimal decision, when such a small change in the assumed probability of success occurs, illustrates the importance of accuracy when determining the probabilities. You need to consider the closeness of the decision-making criterion before making a final decision. As a second application of expected monetary value, return to the scenario at the beginning of the chapter and the payoff table presented in Table 17.2. Suppose an economist assigns the following probabilities to the different economic conditions: P(Recession) = 0.10 P(Stable economy) = 0.40 P(Moderate growth) = 0.30 P(Boom) = 0.20 Table 17.6 shows the calculations of the expected monetary value for advertising in each of the two platforms, A and B. Table 17.6 Expected monetary value for advertising on two platforms under four economic conditions

Alternative courses of action ($'000) Pi

Platform A

Recession

0.10

 30

 30(0.1) =   3

Stable economy

0.40

 70

Moderate growth

0.30

100

Boom

0.20

150

Event Ei

Xij Pi

Platform B

Xij Pi

- 50

- 50(0.1) = - 5

 70(0.4) = 28

30

  30(0.4) =   12

100(0.3) = 30

250

250(0.3) =   75

150(0.2) = 30

400

400(0.2) =   80

EMV(A) = 91

EMV(B) = 162

Thus, the expected monetary value of the return for advertising on platform A is $91,000 and the expected monetary value of the return for platform B is $162,000. Using these results, you should choose platform B because the expected monetary value of the return for advertising on platform B is almost twice that of advertising on platform A.

Expected Opportunity Loss Earlier, you learned how to use the expected monetary value criterion when making a decision. An equivalent criterion, based on opportunity losses, is introduced next. Payoffs and opportunity losses can be viewed as two sides of the same coin. It all depends on whether you wish to view the problem in terms of maximising expected monetary value or minimising expected opportunity loss. The expected opportunity loss (EOL) of action j is the loss Lij for each combination of event i and action j times Pi, the probability of occurrence of the event i, summed over all events (see Equation 17.2).

expected opportunity loss (EOL) The sum of the losses for each event and action multiplied by the respective event probabilities.

E XPE CTE D OP PORT UN IT Y LOS S

EOL( j) =

N

∑ LijPi i=1

(17.2)

where Lij = opportunity loss that occurs when course of action j is selected and event i occurs Pi = probability of occurrence of event i Criterion: Select the course of action with the smallest EOL. Selecting the course of action with the smallest EOL is equivalent to selecting the course of action with the largest EMV. See Equation 17.1.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

688 CHAPTER 17 DECISION MAKING

Example 17.5 illustrates the application of opportunity loss for the HDTV marketing e­ xample. EXAMPLE 17.5

C A LC U LAT ING T H E EXP E CTE D OP P ORTU N I TY L OS S ( E OL ) F OR THE HDTV MA R K E T IN G E X A MP L E Referring to the opportunity loss table given in Table 17.3, and assuming that the probability is 0.40 that the television model will be successful, calculate the expected opportunity loss for each alternative course of action. Determine whether or not to market the model. SOLUTION

Table 17.7 Expected opportunity loss in millions of dollars for each alternative for the HDTV marketing example

Alternative courses of action Event Ei

Successful E1

Unsuccessful E2

Pi

Market, A1

0.40

0

0.60

$33

LijPi

Do not market, A2

0(0.4) = 0 $33(0.6) = $19.8

$48 0

EOL(A1) = $19.8

LijPi $48(0.4) = $19.2 0(0.6) = 0

EOL(A2) = $19.2

The expected opportunity loss is lower for not marketing the model ($19.2 million) than for marketing the model ($19.8 million). Therefore, using the EOL criterion, the optimal decision is not to market the television. This outcome is expected since the equivalent EMV ­criterion produced the same optimal strategy. Recall from Table 17.5 that EMV(A2) = –$3 million versus EMV(A1) = –$3.6 million, with a difference of $0.6 million. Note once again that if the probability that the model is successful, P1, is slightly greater than the assumed value of 0.40, a different decision is made. Specifically, if P1 = 0.41, then EOL(A1) = $19.47 million, EOL(A2) = $19.68 million, and the best ­decision is to market the television. expected value of perfect information (EVPI) The expected opportunity loss from the best decision.

The expected opportunity loss from the best decision is called the expected value of perfect information (EVPI). Equation 17.3 defines the EVPI.

E XPE CT E D VA LUE O F P E R F E C T I NF O R M AT I O N EVPI = expected profit under certainty − expected monetary value of the best alternative (17.3)

expected profit under certainty The expected profit that you could make if you had perfect information about which event will occur.

The expected profit under certainty represents the expected profit that you could make if you have perfect information about which event will occur. Example 17.6 illustrates the expected value of perfect information.

EXAMPLE 17.6

C A LC U LAT ING T H E EXP E CTE D VAL U E OF P E RF E CT I N F ORM ATI ON ( E V P I ) I N T H E H DT V MA R K E T IN G E XAM P L E Referring to Example 17.5, calculate the expected profit under certainty and the expected value of perfect information. SOLUTION

As the marketing manager of the Consumer Electronics Company, if you could always predict the future, a profit of $45 million would be made for the 40% of the HDTV models that are successful, and a loss of $3 million would be incurred for the 60% of the models that are not successful. Thus:

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



17.2 Criteria for Decision Making 689

Expected profit under certainty = 0.40($45) + 0.60(−$3) = $18 − $1.80 = $16.2 This value, $16.2 million, represents the profit you could make if you knew with certainty whether or not the model was going to be successful. Use Equation 17.3 to calculate the expected value of perfect information: EVPI = expected profit under certainty − expected monetary value of the best alternative = $16.2 − (−$3) = $19.2 This EVPI value of $19.2 million represents the maximum amount that you should be willing to pay for perfect information. Of course, you can never have perfect information and you should never pay the entire EVPI for more information. Rather, the EVPI provides a guideline for an upper bound on how much you might consider paying for better information. The EVPI is also the expected opportunity loss of not marketing the model. Return to the scenario at the beginning of the chapter and the opportunity loss table presented in Table 17.4. Table 17.8 presents the calculations to determine the expected opportunity loss for advertising on platform A or B. Alternative courses of action ($’000) Event, Ei

Pi

Platform A

Recession

0.10

  0

Stable economy

0.40

  0

Moderate growth

0.30

Boom

0.20

LijPi

Platform B

LijPi

0(0.1) =   0

80

80(0.1) =  8

0(0.4) =   0

40

40(0.4) = 16

150

150(0.3) = 45

0

0(0.3) =  0

250

250(0.2) = 50

0

EOL(A) = 95

Table 17.8 Expected opportunity loss for each alternative for the social media advertising example

0(0.2) =  0

EOL(B) = EVPI = 24

The expected opportunity loss is lower for platform B than for platform A. Your optimal decision is to advertise on platform B, which is consistent with the decision made using expected monetary value. The expected value of perfect information is $24,000 (being the lower of the two expected opportunity loss values), meaning that you should be willing to pay up to $24,000 for perfect information.

Return-to-Risk Ratio Unfortunately, neither the expected monetary value nor the expected opportunity loss criterion takes into account the variability of the payoffs for the alternative courses of action under diff­ erent events. From Table 17.2, you see that the return for advertising on platform A varies from $30,000 in a recession to $150,000 in an economic boom, while for platform B (the one chosen according to the expected monetary value and expected opportunity loss criteria) the return varies from a loss of $50,000 in a recession to a profit of $400,000 in an economic boom. To take into account the variability of the events (in this case, the different economic conditions), you can calculate the variance and standard deviation of each platform using Equations 5.2a and 5.3 on page 183. Using the information presented in Table 17.6, for platform A, μA = $91 thousand and the variance is equal to: σ 2A =

N

(Xi − μ)2 P( Xi ) ∑ i=1

= (30 − 91)2 (0.1) + (70 − 91)2 (0.4) + (100 − 91)2 (0.3) + (150 − 91)2 (0.2) = 1,269 thousand and the standard deviation for platform A (σA) =

1,269 = $35.62 thousand.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

690 CHAPTER 17 DECISION MAKING

For platform B, μB = $162 thousand and the variance is: σ 2B =

N

(Xi − μ)2 P( Xi ) ∑ i=1

= (−50 − 162)2 (0.1) + (30 − 162)2 (0.4) + (250 − 162)2 (0.3) + (400 − 162)2 (0.2) = 25,116 thousand and the standard deviation of platform B (σB) = 25,116 = $158.48 thousand. Because you are comparing two sets of data with vastly different means, you should evaluate the relative risk associated with advertising on each platform. Once you calculate the standard deviation of the return from each platform, you calculate the coefficient of variation – discussed in Section 3.1. Substituting σ for S and EMV for X in Equation 3.11 on page 105, you find that the coefficient of variation for platform A is equal to: CVA = =

σA 100% EMVA 35.62 100% = 39.1% 91

while the coefficient of variation for platform B is equal to: CVB = =

return-to-risk ratio (RTRR) The expected monetary value of an action divided by its standard deviation.

σB 100% EMVB 158.48 100% = 97.8% 162

Thus, there is much more variation in return for using platform B than for platform A. Since the coefficient of variation shows the relative size of the variation compared with the mean (or expected monetary value), a criterion other than EMV or EOL is needed to express the relationship between the return (as expressed by the EMV) and the risk (as expressed by the standard deviation). Equation 17.4 defines the return-to-risk ratio (RTRR) as the expected monetary value of action j divided by the standard deviation of action j.

R E T UR N -TO-R IS K R AT I O

RTRR ( j ) =

EMV ( j ) (17.4) σj

where EMV(j) = expected monetary value for alternative course of action j σj = standard deviation for alternative course of action j Criterion: Select the course of action with the largest RTRR.

For each of the two platforms discussed previously, you calculate the return-to-risk ratios as follows. For platform A, the return-to-risk ratio is equal to: RTRR (A) =

91 = 2.55 35.62

For platform B, the return-to-risk ratio is equal to: RTRR (B) =

162 = 1.02 158.48

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



17.2 Criteria for Decision Making 691

Thus, relative to the risk as expressed by the standard deviation, the expected return is much higher for platform A than for B. Platform A has a smaller expected monetary value than ­platform B, but also has a much smaller risk than platform B. The return-to-risk ratio shows A to be preferable to B. Figure 17.4 represents the Microsoft Excel output.

A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

B

C

D

E

F

Figure 17.4 Microsoft Excel expected value and standard deviation worksheet

G

Platform selection analysis Probabilities and payoffs table: P

Platform A

Platform B

Recession

0.1

30

–50

Stable economy

0.4

70

30

Moderate growth

0.3

100

250

For variance and std. deviation

Boom

0.2

150

400

Platform A

Statistics for:

Platform B

3721

44944 17424

Platform A

Platform B

441

91

162

81

7744

1269

25116

3481

56644

35.6230

158.4803

Expected monetary value Variance Standard deviation

Calculations area

Coefficient of variation Return to risk ratio

39.15

97.83

2.5545

1.0222

Opportunity loss table: Alternatives

Optimum

Optimum

action

profit

Platform A

Recession

Platform A

30

0

80

Stable economy

Platform A

70

0

40

Moderate growth

Platform B

250

150

0

Boom

Platform B

400

250

0

Platform B

Platform A Expected opportunity loss

95

Platform B 24 EVPI

Problems for Section 17.2 LEARNING THE BASICS 17.6 For the following payoff table: Action Event 1 2



A  50 200

B 100 125

The probability of event 1 is 0.5 and the probability of event 2 is also 0.5. a. Calculate the expected monetary value (EMV) for actions A and B. b. Calculate the expected opportunity loss (EOL) for actions A and B. c. Explain the meaning of the expected value of perfect information (EVPI) in this problem. d. Based on the results of (a) or (b), which action would you choose? Why?

e. Calculate the coefficient of variation for each action. f. Calculate the return-to-risk ratio (RTRR) for each action. g. Based on (e) and (f), which action would you choose? Why? h. Compare the results of (d) and (g) and explain any differences. 17.7 For the following payoff table: Action Event 1 2 3



A  50 300 500

B  10 100 200

The probability of event 1 is 0.8, the probability of event 2 is 0.1 and the probability of event 3 is 0.1. a. Calculate the expected monetary value (EMV) for actions A and B.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

692 CHAPTER 17 DECISION MAKING

b. Calculate the expected opportunity loss (EOL) for actions A and B. c. Explain the meaning of the expected value of perfect information (EVPI) in this problem. d. Based on the results of (a) or (b), which action would you choose? Why? e. Calculate the coefficient of variation for each action. f. Calculate the return-to-risk (RTRR) ratio for each action. g. Based on (e) and (f), which action would you choose? Why? h. Compare the results of (d) and (g) and explain any differences. i. Would your answers to (d) and (g) be different if the probabilities for the three events were 0.1, 0.1 and 0.8 respectively? Discuss. 17.8 For a potential investment of $1,000, if a share has an EMV of $100 and a standard deviation of $25, what is the: a. rate of return? b. coefficient of variation? c. return-to-risk ratio? 17.9 A share has the following predicted returns under the following economic conditions: Economic condition Recession Stable economy Moderate growth Boom

Probability 0.30 0.30 0.30 0.10

Return $50 100 120 200



Calculate the: a. expected monetary value. b. standard deviation. c. coefficient of variation. d. return-to-risk ratio. 17.10 The following are the results for shares in two companies:

Expected monetary value Standard deviation

A $90  10

B $60  10

Which company would you choose to invest in, and why? 17.11 The following are the results for shares in two companies:

Expected monetary value Standard deviation



A $60  20

B $60  10

Which company would you choose to invest in, and why?

APPLYING THE CONCEPTS 17.12 A street vendor outside a cricket ground is deciding whether to sell ice cream or hot dogs at today’s game. The vendor believes that the profit made will depend on the weather. The payoff table is as follows:

Action Event Cool weather Warm weather

Sell hot dogs $260  230

Sell ice cream $190  290



Based on his experience at this time of year, the vendor estimates the probability of warm weather as 0.60. a. Calculate the expected monetary value (EMV) for selling hot dogs and selling ice cream. b. Calculate the expected opportunity loss (EOL) for selling hot dogs and selling ice cream. c. Explain the meaning of the expected value of perfect information (EVPI) in this problem. d. Based on the results of (a) or (b), would you choose to sell hot dogs or ice cream? Why? e. Calculate the coefficient of variation for selling hot dogs and selling ice cream. f. Calculate the return-to-risk ratio (RTRR ) for selling hot dogs and selling ice cream. g. Based on (e) and (f), would you choose to sell hot dogs or ice cream? Why? h. Compare the results of (d) and (g) and explain any differences. 17.13 The Hawkesbury Tomato Farm grows tomatoes each summer at a cost of $2.50 per kilogram and sells them at a farmgate stall for $3.50 per kilogram. Any tomatoes not sold to the public by the end of the week can be sold to a local Italian restaurant for sauce making at $1.50 per kg. The company can grow enough plants to produce 500, 1,000 or 2,000 kg of tomatoes. The probabilities of various levels of demand are as follows: Demand (kilograms)   500 1,000 2,000

Probability 0.2 0.4 0.4

a. For each possible purchase level (500, 1,000 or 2,000 kg), calculate the profit (or loss) for each level of demand. b. Using the expected monetary value (EMV) criterion, determine the optimal number of kilograms of tomatoes the farm should produce. Discuss. c. Calculate the standard deviation for each possible production level. d. Calculate the expected opportunity loss (EOL) for producing 500, 1,000 and 2,000 kg of tomatoes. e. Explain the meaning of the expected value of perfect information (EVPI) in this problem. f. Calculate the coefficient of variation for producing 500, 1,000 and 2,000 kg of tomatoes. Discuss. g. Calculate the return-to-risk ratio (RTRR) for producing 500, 1,000 and 2,000 kg of tomatoes. Discuss. h. Based on (b) and (d), what would you choose to sell: 500, 1,000 or 2,000 kg of tomatoes? Why?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



17.2 Criteria for Decision Making 693

i. Compare the results of (b), (d), (f) and (g) and explain any differences. j. Suppose that tomatoes can be sold to the public for $4 per kg. Repeat (a) to (h) with this selling price for tomatoes and compare the results with those in (i). k. What would be the effect on the results in (a) to (i) if the probability of the demand for 500, 1,000 and 2,000 kg were 0.4, 0.4 and 0.2 respectively? 17.14 An investor has a certain sum of money available to invest now. Three alternative investments are available. The estimated profits of each investment under each economic condition are indicated in the following payoff table:

Event Economy declines No change Economy expands

Investment selection A B C $500 - $2,000 - $7,000 $1,000 $2,000 - $1,000 $2,000 $5,000 $20,000



Based on his own experience, the investor assigns the following probabilities to each economic condition:



P (economy declines) = 0.30 P (no change) = 0.50 P (economy expands) = 0.20

a. Determine the best investment according to the expected monetary value (EMV ) criterion. Discuss. b. Calculate the standard deviation for investments A, B and C. c. Calculate the expected opportunity loss (EOL) for investments A, B and C. d. Explain the meaning of the expected value of perfect information (EVPI) in this problem. e. Calculate the coefficient of variation for investments A, B and C. f. Calculate the return-to-risk ratio (RTRR) for investments A, B and C. g. Based on (e) and (f), which would you choose, investment A, B or C? Why? h. Compare the results of (a) and (g) and explain any differences. i. Suppose the probabilities of the different economic conditions are as follows: i. 0.1, 0.6 and 0.3 ii. 0.1, 0.3 and 0.6 iii. 0.4, 0.4 and 0.2 iv. 0.6, 0.3 and 0.1 Repeat (a) to (h) with each of these sets of probabilities and compare the results with those originally calculated. Discuss. 17.15 In problem 17.3 on page 685, you developed a payoff table for building a small factory and a large factory for manufacturing

designer jeans. Given the results of that problem, suppose that the demand probabilities are as follows: Demand 10,000 20,000 50,000 100,000

Probability 0.1 0.4 0.2 0.3

a. Calculate the expected monetary value (EMV) for building a small factory and building a large factory. b. Calculate the expected opportunity loss (EOL) for building a small factory and building a large factory. c. Explain the meaning of the expected value of perfect information (EVPI) in this problem. d. Based on the results of (a) or (b), would you choose to build a small factory or a large factory? Why? e. Calculate the coefficient of variation for building a small factory and building a large factory. f. Calculate the return-to-risk ratio (RTRR) for building a small factory and building a large factory. g. Based on (e) and (f), would you choose to build a small factory or a large factory? Why? h. Compare the results of (d) and (g) and explain any differences. i. Suppose that the demand probabilities are 0.4, 0.2, 0.2 and 0.2 respectively. Repeat (a) to (h) with these probabilities and compare the results with those originally calculated. 17.16 In problem 17.4 on page 685, you developed a payoff table to assist an author in choosing between signing with company A or with company B. Given the results calculated in that problem, suppose that the probabilities of the levels of demand for the novel are as follows: Demand 1,000 2,000 5,000 10,000 50,000

Probability 0.45 0.20 0.15 0.10 0.10

a. Calculate the expected monetary value (EMV) for signing with company A and with company B. b. Calculate the expected opportunity loss (EOL) for signing with company A and with company B. c. Explain the meaning of the expected value of perfect information (EVPI) in this problem. d. Based on the results of (a) or (b), if you were the author, which company would you choose to sign with, company A or company B? Why? e. Calculate the coefficient of variation for signing with company A and signing with company B. f. Calculate the return-to-risk ratio (RTRR) for signing with company A and signing with company B.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

694 CHAPTER 17 DECISION MAKING

g. Based on (e) and (f), which company would you choose to sign with, company A or company B? Why? h. Compare the results of (d) and (g) and explain any differences. i. Suppose that the probabilities of demand are 0.3, 0.2, 0.2, 0.1 and 0.2 respectively. Repeat (a) to (h) with these probabilities and compare the results with those for (a) to (h). 17.17 In problem 17.5 on page 685, you developed a payoff table for whether to purchase 100, 200, 500 or 1,000 Christmas trees. Given the results of that problem, suppose that the demand probabilities for different numbers of trees are as follows: Demand (number of trees)  100  200  500 1,000

LEARNING OBJECTIVE

3

Probability 0.20 0.50 0.20 0.10

a. Calculate the expected monetary value (EMV) for purchasing 100, 200, 500 and 1,000 trees. b. Calculate the expected opportunity loss (EOL) for purchasing 100, 200, 500 and 1,000 trees. c. Explain the meaning of the expected value of perfect information (EVPI) in this problem. d. Based on the results of (a) or (b), would you choose to purchase 100, 200, 500 or 1,000 trees? Why? e. Calculate the coefficient of variation for purchasing 100, 200, 500 and 1,000 trees. f. Calculate the return-to-risk ratio (RTRR) for purchasing 100, 200, 500 and 1,000 trees. g. Based on (e) and (f), would you choose to purchase 100, 200, 500 or 1,000 trees? Why? h. Compare the results of (d) and (g) and explain any differences. i. Suppose that the demand probabilities are 0.4, 0.2, 0.2 and 0.2 respectively. Repeat (a) to (h) with these probabilities and compare the results with those originally calculated.

17.3  DECISION MAKING WITH SAMPLE INFORMATION

Use Bayes’ theorem to revise probabilities in light of sample information

In Sections 17.1 and 17.2, you learned the framework for making decisions when there are several alternative courses of action. You then studied three different criteria for choosing between alternatives. For each criterion, you assigned the probabilities of the various events using the past experience and/or the subjective judgment of the decision maker. This section introduces decision making when sample information is available to estimate probabilities. Example 17.7 illustrates decision making with sample information.

EXAMPLE 17.7

D E C IS IO N MA K ING US I N G S A MP L E I N F OR MATI O N F OR T HE TE L E V I S I ON MA R K E T ING E X A MP L E In Section 4.3 on page 164, you found that the probability of a new television model being successful, given that the company receives a favourable report, is 0.64. Thus, the probability of an unsuccessful model, given that the company receives a favourable report, is 1 − 0.64 = 0.36. Given that the company receives a favourable report, calculate the expected monetary value for each alternative course of action, and determine whether or not to market the television. SOLUTION

You need to use the revised probabilities, not the original subjective probabilities, to calculate the expected monetary value of each alternative. Table 17.9 illustrates the ­calculations. Table 17.9 Expected monetary value in millions of dollars using revised probabilities for each alternative in the television marketing example

Alternative courses of action Event Ei

Successful E1

Pi

Market, A1

Xij Pi

Do not market, A2

Xij Pi

0.64

+ $45

$45(0.64) = $28.8

- $3

- $3(0.64) = - $1.92

- $36

- $36(0.36) = - $12.96

- $3

- $3(0.36) = - $1.08

Unsuccessful E2 0.36

EMV (A1) =

$15.84

EMV (A2) = - $3

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



17.3 Decision Making with Sample Information 695

In this case, the optimal decision is to market the product because a profit of $15.84 million is expected, compared with a loss of $3 million if the television model is not marketed. This decision is different from the one considered optimal before the ­sample information is collected in the form of the market research report. The favourable recommendation contained in the report greatly increases the probability that the model will be successful.

Now we will return to the platform selection scenario at the beginning of the chapter. Because the relative desirability of the two platforms under consideration is directly affected by economic conditions, you should use a forecast of the economic conditions in the upcoming year. You can then use Bayes’ theorem, introduced in Section 4.3, to revise the probabilities associated with the different economic conditions. Suppose that such a forecast can predict either an expanding economy (F1) or a declining or stagnant economy (F2). Past experience indicates that when there is a recession, prior forecasts predicted an expanding economy 20% of the time. When there is a stable economy, prior forecasts predicted an expanding economy 40% of the time. When there is moderate growth, prior forecasts predicted an expanding economy 70% of the time. Finally, when there is a boom economy, prior forecasts predicted an expanding economy 90% of the time. If the forecast is for an expanding economy, you can revise the probabilities of economic conditions using Bayes’ theorem (Equation 4.9 on page 164). Let: event E1 = recession event E2 = stable economy event E3 = moderate growth event E4 = boom economy

event F1 = expanding economy is predicted event F2 = declining or stagnant economy is predicted

and: P ( E1 ) = 0.10

P ( F1 E1 ) = 0.20

P ( E 2 ) = 0.40

P ( F1 E 2 ) = 0.40

P ( E3 ) = 0.30

P ( F1 E3 ) = 0.70

P ( E4 ) = 0.20

P ( F1 E4 ) = 0.90

Then, using Bayes’ theorem defined by Equation 4.9: P(E1 F1) =

P(F1 E1)P(E1) P(F1 E1)P(E1) + P(F1 E2)P(E2) + P(F1 E3)P(E3) + P(F1 E4)P(E4)

=

(0.20)(0.10) (0.20)(0.10) + (0.40)(0.40) + (0.70)(0.30) + (0.90)(0.20)

=

0.02 = 0.035 0.57

P(E2 F1) =

P(F1 E2)P(E2) P(F1 E1)P(E1) + P(F1 E2)P(E2) + P(F1 E3)P(E3) + P(F1 E4)P(E4)

=

(0.40)(0.40) (0.20)(0.10) + (0.40)(0.40) + (0.70)(0.30) + (0.90)(0.20)

=

0.16 = 0.281 0.57

P(E3 F1) =

P(F1 E3)P(E3) P(F E )P(E ) + P(F E )P(E ) + P(F E )P(E ) + P(F E )P(E )

1 1 (a division 1 1 2Australia 2 Group Pty1Ltd)3 2019— 3 9781488617249 1 4 — Berenson/Basic 4 Copyright © Pearson Australia of Pearson Business Statistics 5e

=

(0.40)(0.40) (0.20)(0.10) + (0.40)(0.40) + (0.70)(0.30) + (0.90)(0.20)

=

0.16 = 0.281 0.57

696 CHAPTER 17 DECISION MAKING

P(E3 F1) =

P(F1 E3)P(E3) P(F1 E1)P(E1) + P(F1 E2)P(E2) + P(F1 E3)P(E3) + P(F1 E4)P(E4)

=

(0.70)(0.30) (0.20)(0.10) + (0.40)(0.40) + (0.70)(0.30) + (0.90)(0.20)

=

0.21 = 0.368 0.57

P(E4 F1) =

P(F1 E4)P(E4) P(F1 E1)P(E1) + P(F1 E2)P(E2) + P(F1 E3)P(E3) + P(F1 E4)P(E4)

=

(0.90)(0.20) (0.20)(0.10) + (0.40)(0.40) + (0.70)(0.30) + (0.90)(0.20)

=

0.18 = 0.316 0.57

Table 17.10 summarises the calculation of these probabilities. Figure 17.5 displays the joint probabilities in a decision tree. You need to use the revised probabilities, not the original subjective probabilities, to calculate, the expected monetary value. Table 17.11 shows these calculations. Table 17.10 Bayes’ theorem calculations for the platform selection example

Event, Ei E1 = Recession E2 = Stable economy

E3 = Moderate growth E4 = Boom

Prior probability, P (Ei)

Conditional probability, P (F1 ∙ Ei)

Joint probability, P (F1 ∙ Ei)P(Ei)

0.10

0.20

0.02

0.02/0.57 = 0.035 = P(E1|F1)

0.40

0.40

0.16

0.16/0.57 = 0.281 = P(E2|F1)

0.30

0.70

0.21

0.21/0.57 = 0.368 = P(E3|F1)

0.20

0.90

0.18

Revised probability, P (Ei ∙ F1)

0.18/0.57 = 0.316 = P(E4|F1)

0.57

Figure 17.5 Decision tree with joint probabilities for the platform selection example

P ( E 1 and F 1 ) = P ( F 1 | E 1 ) P ( E 1 ) = (0.20) (0.10) = 0.02 P ( E 1 ) = 0.10

P ( E 1 and F 2 ) = P ( F 2 |E 1 ) P ( E 1 )

= (0.80) (0.10) = 0.08

P ( E 2 ) = 0.40

P ( E 2 and F 1 ) = P ( F 1 |E 2 ) P ( E 2 )

= (0.40) (0.40) = 0.16

P ( E 2 and F 2 ) = P ( F 2 | E 2 ) P ( E 2 ) = (0.60) (0.40) = 0.24 P ( E 3 ) = 0.30 P ( E 4 ) = 0.20

P ( E 3 and F 1 ) = P ( F 1 | E 3 ) P ( E 3 )

= (0.70) (0.30) = 0.21

P ( E 3 and F 2 ) = P ( F 2 | E 3 ) P ( E 3 ) = (0.30) (0.30) = 0.09 P ( E 4 and F 1 ) = P ( F 1 | E 4 ) P ( E 4 )

= (0.90) (0.20) = 0.18

P ( E 4 and F 2 ) = P ( F 2 | E 4 ) P ( E 4 ) = (0.10) (0.20) = 0.02

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



17.3 Decision Making with Sample Information 697

Alternative courses of action ($'000) Event, Ei

Pi

Platform A

Recession

0.035

 30

Stable economy

0.281

Moderate growth

0.368

Boom

0.316

XijPi

Platform B

XijPi - 50(0.035) = - 1.75

30(0.035) =

1.05

- $50

 70

70(0.281) =

19.67

  30

30(0.281) =

8.43

100

100(0.368) =

36.80

 250

250(0.368) =

92.00

150

150(0.316) =

47.40

 400

400(0.316) = 126.40

EMV(A) = 104.92

Table 17.11 Expected monetary value using revised probabilities for each alternative of advertising on two platforms under four economic conditions

EMV(B) = 225.08

Thus, the expected monetary value or profit for advertising on platform A is $104,920, and the expected monetary value or profit for platform B is $225,080. Using this criterion, you should once again choose platform B, since the expected monetary value is much higher. However, you should re-examine the return-to-risk ratios in light of these revised probabilities. Using Equations 5.2a and 5.3 on page 183, for platform A since μA = $104.92 thousand: σ 2A =

N

(Xi − μ)2 P(Xi) ∑ i=1

= (30 − 104.92)2 (0.035) + (70 −104.92)2 (0.281) + (100 − 104.92)2 (0.368) + (150 − 104.92)2 (0.316) = 1,190.194 The standard deviation of platform A (σA ) = 1,190.194 = $34.50 thousand. For platform B, since μB = $225.08 thousand: σ 2B =

N

(Xi − μ)2 P(Xi) ∑ i=1

= (−50 − 225.08)2 (0.035) + (30 − 225.08)2 (0.281) + (250 − 225.08)2 (0.368) + (400 − 225.08)2 (0.316) = 23,239.39 The standard deviation of platform B (σB ) = 23,239.39 = $152.445 thousand. – To calculate the coefficient of variation, substitute σ for S and EMV for X in Equation 3.11 on page 105: CVA = =

σA 100% EMVA 34.50 100% = 32.88% 104.92

and: CVB = =

σB 100% EMVB 152.445 100% = 67.73% 225.08

Thus, there is still much more variation in the returns from platform B advertising than from platform A advertising.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

698 CHAPTER 17 DECISION MAKING

For each of these platforms, you calculate the return-to-risk ratios as follows. For platform A, the return-to-risk ratio is equal to: RTRR (A) =

104.92 = 3.041 34.50

For platform B, the return-to-risk ratio is equal to: RTRR (B) =

225.08 = 1.476 152.445

Thus, using the return-to-risk ratio, you should select advertising on platform A. This decision is different from the one you reached when using expected monetary value (or the equivalent expected opportunity loss). What platform should you use? Your final decision will depend on whether you believe it is more important to maximise the expected return on investment (select platform B) or to control the relative risk (select platform A).

Problems for Section 17.3 LEARNING THE BASICS



17.18 Consider the following payoff table. Action Event 1 2

A  50 200

B 100 125

For this problem, P(E1) = 0.5, P(E2) = 0.5, P(F | E1) = 0.6 and P(F | E2) = 0.4. Suppose you are informed that event F occurs. a. Revise the probabilities P(E1) and P(E2) now that you know event F has occurred. Based on these revised probabilities, answer (b) to (i). b. Calculate the expected monetary value of action A and action B. c. Calculate the expected opportunity loss of action A and action B. d. Explain the meaning of the expected value of perfect information (EVPI ) in this problem. e. On the basis of (b) or (c), which action should you choose? Why? f. Calculate the coefficient of variation for each action. g. Calculate the return-to-risk ratio (RTRR ) for each action. h. On the basis of (f) and (g), which action should you choose? Why? i. Compare the results of (e) and (h), and explain any differences. 17.19 Consider the following payoff table:

Action Event 1 2 3

A  50 300 500

B  10 100 200



For this problem P(E1) = 0.8, P(E2) = 0.1 and P(E3) = 0.1, P(F|E1) = 0.2, P(F|E2) = 0.4 and P(F|E3) = 0.4. Suppose you are informed that event F occurs. a. Revise the probabilities P(E1), P(E2) and P(E3) now that you know event F has occurred. Based on these revised probabilities, answer (b) to (i). b. Calculate the expected monetary value of action A and action B. c. Calculate the expected opportunity loss of action A and action B. d. Explain the meaning of the expected value of perfect information (EVPI ) in this problem. e. On the basis of (b) and (c), which action should you choose? Why? f. Calculate the coefficient of variation for each action. g. Calculate the return-to-risk ratio (RTRR ) for each action. h. On the basis of (f) and (g), which action should you choose? Why? i. Compare the results of (e) and (h) and explain any differences.

APPLYING THE CONCEPTS 17.20 In problem 17.12 on page 692, a street vendor outside a cricket ground was deciding whether to sell ice cream or hot dogs at today’s game. Before making his decision, he decides to listen to the local weather forecast. In the past, when it has been cool the reporter has forecast cool weather 80% of the time. When it has been warm the reporter has forecast warm weather 70% of the time. The local forecast is for cool weather. a. Revise the prior probabilities now that you know the forecast is for cool weather. b. Use these revised probabilities to redo problem 17.12. c. Compare the results in (b) to the original results of problem 17.12.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



17.4 Utility 699

17.21 In problem 17.14 on page 693, an investor is trying to determine the optimal investment decision between three investment opportunities. Before making his decision, the investor decides to consult with his stockbroker. In the past when the economy has declined the stockbroker has given a rosy forecast 20% of the time (with a gloomy forecast 80% of the time). When there has been no change in the economy the stockbroker has given a rosy forecast 40% of the time. When there has been an expanding economy the stockbroker has given a rosy forecast 70% of the time. The stockbroker in this case gives a gloomy forecast for the economy. a. Revise the probabilities of the investor based on the stockbroker’s economic forecast. b. Use these revised probabilities to redo problem 17.14. c. Compare the results in (b) with the original results of problem 17.14.

17.22 In problem 17.16 on page 693, an author is deciding which of two competing publishing companies to select to publish her new novel. Before making a final decision, the author decides to have an experienced reviewer examine her novel. This reviewer has an outstanding reputation for predicting the success of a novel. In the past, for novels that sold 1,000 copies only 1% received favourable reviews. Of novels that sold 5,000 copies 25% received favourable reviews. Of novels that sold 10,000 copies 60% received favourable reviews. Of novels that sold 50,000 copies 99% received favourable reviews. After examining the author’s novel, the reviewer gives it an unfavourable review. a. Revise the probabilities of the number of books sold in light of the reviewer’s unfavourable review. b. Use these revised probabilities to redo problem 17.16. c. Compare the results in (b) with the original results of problem 17.16.

17.4 UTILITY The methods used in Sections 17.1 to 17.3 assume that each incremental amount of profit or loss has the same value as the previous amounts of profits attained or losses incurred. In fact, under many circumstances in the business world, this assumption of incremental changes is not valid. Most companies, as well as most individuals, make special efforts to avoid large losses. At the same time, many companies, as well as most individuals, place less value on extremely large profits as compared with initial profits. Such differential evaluation of incremental profits or losses is referred to as utility, a concept first discussed by Daniel Bernoulli in the eighteenth century (see reference 1). To illustrate this concept, suppose that you are faced with the following two choices:

LEARNING OBJECTIVE

4

Recognise the concept of utility

utility A measure of the desirability of different outcomes for an individual decision maker.

1. Choice 1 A fair coin is to be tossed – if it lands on heads you will receive $0.60; if it

lands on tails you will pay $0.40. 2. Choice 2 Do not play the game.

Which one should you choose? The expected value of playing this game is (0.60)(0.50) + (−0.40)(0.50) = +$0.10, and the expected value of not playing the game is 0. Most people will decide to play the game, since the expected value is positive and only small amounts of money are involved. Suppose, however, that the game is formulated with a payoff of $600,000 when the coin lands on heads and a loss of $400,000 when the coin lands on tails. The expected value of playing the game is now +$100,000. With these payoffs, even though the expected value is positive, most individuals will not play the game because of the severe negative consequences of losing $400,000. Each additional dollar amount of either profit or loss does not have the same utility as the previous amount. Large negative amounts for most individuals have severely negative utility; conversely, the extra value of each incremental dollar of profit decreases when high enough profit levels are reached. An important part of the decision-making problem, which is beyond the scope of this text (see references 2 and 3), is to develop a utility curve for the decision maker that represents the utility of each specified dollar amount. Figure 17.6, overleaf, illustrates three types of utility curves: those of the risk averter, the risk seeker and the risk-neutral person. The risk-averter’s curve shows a rapid increase in utility for initial amounts of money followed by a gradual levelling off for increasing dollar amounts. This curve is appropriate for most individuals or businesses, because the value of each additional dollar is not as great after large amounts of money have already been earned.

risk-averter’s curve A utility curve that increases rapidly then levels off as dollar amounts increase.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

700 CHAPTER 17 DECISION MAKING

Dollar amount Panel A Risk-averter’s curve

Utility

Figure 17.6 Three types of utility curves: panel A, risk averter; panel B, risk seeker; panel C, risk neutral

Utility

risk-neutral curve A utility curve where each additional dollar of profit has the same value.

The risk-seeker’s curve represents the utility of someone who enjoys taking risks. The utility is greater for large dollar amounts. This curve represents an individual who is interested only in ‘striking it rich’ and is willing to take large risks for the opportunity of making large profits. The risk-neutral curve represents the expected monetary value approach. Each additional dollar of profit has the same value as the previous dollar. After a utility curve is developed in a specific situation, you convert the dollar amounts to utilities. Then you calculate the utility of each alternative course of action and apply the decision criteria of expected utility value, expected opportunity loss and return-to-risk ratio to make a decision.

Utility

risk-seeker’s curve A utility curve that increases more rapidly as dollar amounts increase.

Dollar amount Panel B Risk-seeker’s curve

Dollar amount Panel C Risk-neutral curve

Problems for Section 17.4 APPLYING THE CONCEPTS 17.23 Do you consider yourself a risk seeker, a risk averter or a riskneutral person? Explain.

17.24 Refer to problems 17.3–17.5 on page 685 and 17.12–17.14 on pages 692–693. In which problems do you think the expected monetary value (risk-neutral) criteria is inappropriate? Why?

17 Assess your progress Summary In this chapter, you learned how to develop payoff tables and decision trees, to use various criteria to choose between alternative courses of action and to revise probabilities in light of sample information using Bayes’ theorem. In the scenario at the beginning of the chapter, you learned how a company could use these tools to decide whether to advertise on social media platform A or to

advertise on platform B. You found that platform B had a higher expected monetary value, a lower expected opportunity loss but a lower return-to-risk ratio. You also learned how to calculate the expected value of perfect information and how to distinguish between utility curves for risk seekers and risk averters.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter review problems 701

Key formulas Expected monetary value

EMV( j) =

Expected value of perfect information EVPI = expected profit under certainty – expected monetary value of the best alternative  (17.3)

N

XijPi (17.1) ∑ i=1

Return-to-risk ratio

Expected opportunity loss

EOL( j) =

RTRR( j ) =

N

∑ LijPi (17.2)

EMV ( j )  (17.4) σj

i=1

Key terms alternative courses of action 681 decision criteria 681 decision tree 682 events or states of the world 681 expected monetary value (EMV ) 686 expected opportunity loss (EOL) 687

expected profit under certainty 688 expected value of perfect information (EVPI ) 688 opportunity loss 683 payoffs 681 payoff table 681

return-to-risk ratio (RTRR ) 690 risk-averter’s curve 699 risk-neutral curve 700 risk-seeker’s curve 700 utility 699

References 1. Bernstein, P. L., Against the Gods: The Remarkable Story of Risk (New York: John Wiley, 1996). 2. Render, B., R. M. Stair, M. Hanna & T. S. Hale, Quantitative Analysis for Management, 13th edn (Boston: Pearson, 2018).

3. Tversky, A. & D. Kahneman, ‘Rational choice and the framing of decisions’, Journal of Business, 59 (1986): 251–278.

Chapter review problems CHECKING YOUR UNDERSTANDING

APPLYING THE CONCEPTS

17.25 What is the difference between an event and an alternative course of action? 17.26 What are the advantages and disadvantages of a payoff table as compared with a decision tree? 17.27 How are opportunity losses calculated from payoffs? 17.28 Why can’t an opportunity loss be negative? 17.29 How does expected monetary value (EMV ) differ from expected opportunity loss (EOL)? 17.30 What is the meaning of the expected value of perfect information (EVPI )? 17.31 How does the expected value of perfect information differ from the expected profit under certainty? 17.32 What are the advantages and disadvantages of using expected monetary value (EMV) as compared with the return-to-risk ratio (RTRR )? 17.33 How is Bayes’ theorem used to revise probabilities in light of the sample information? 17.34 What is the difference between a risk averter and a risk seeker? 17.35 Why should you use utilities instead of payoffs in certain circumstances?

17.36 A supermarket chain purchases large quantities of discounted own-brand white bread for sale during a week. The stores purchase the bread for $1.00 a loaf and sell it for $1.80 a loaf. Any loaves not sold after three days can be donated to charity, returning a tax benefit worth approximately $0.32 per loaf. Based on past demand, the probability of various levels of demand is as follows: Demand (loaves) Probability  6,000 0.10  8,000 0.50 10,000 0.30 12,000 0.10

a. Construct the payoff table indicating the events and alternative courses of action. b. Construct the decision tree. c. Calculate the expected monetary value (EMV ) for purchasing 6,000, 8,000, 10,000 and 12,000 loaves. d. Calculate the expected opportunity loss (EOL) for purchasing 6,000, 8,000, 10,000 and 12,000 loaves.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

702 CHAPTER 17 DECISION MAKING

e. Explain the meaning of the expected value of perfect information (EVPI ) in this problem. f. Based on the results of (c) or (d), how many loaves would you purchase? Why? g. Calculate the coefficient of variation for each purchase level. h. Calculate the return-to-risk ratio (RTRR ) for each purchase level. i. Based on (g) and (h), what action would you choose? Why? j. Compare the results of (f) and (i) and explain any differences. k. Suppose that new information changes the probabilities associated with the demand level as follows: Demand (loaves) Probability  6,000 0.30  8,000 0.40 10,000 0.20 12,000 0.10

possible national market responses to a change in product package: weak, moderate and strong. The projected payoffs, in millions of dollars, in increased or decreased profit compared with the current package are as follows: Strategy Use new Keep old Event package package Weak national response −$4 0 Moderate national response 1 0 Strong national response 5 0



P(Weak national response) = 0.30 P(Moderate national response) = 0.60 P(Strong national response) = 0.10



Repeat (c) to (j) of this problem with these new probabilities. Compare the results with these new probabilities with those of (c) to (j). 17.37 The owner of a company that supplies home heating oil to Australian customers is deciding whether to offer a solar heating installation service to this market. The owner of the company has determined that a startup cost of $150,000 would be necessary, but a profit of $2,000 can be made on each solar heating system installed. The owner estimates the probability of various demand levels as follows: Number of units installed Probability  50 0.40 100 0.30 200 0.30

a. Construct the payoff table indicating the events and alternative courses of action. b. Construct the decision tree. c. Construct the opportunity loss table. d. Calculate the expected monetary value (EMV ) for offering this solar heating system installation service. e. Calculate the expected opportunity loss (EOL) for offering this solar heating system installation service. f. Explain the meaning of the expected value of perfect information (EVPI ) in this problem. g. Calculate the return-to-risk ratio (RTRR) for offering this solar heating system installation service. h. Based on the results of (d) or (e) and (g), should the company offer this solar heating system installation service? Why? i. How would your answers to (a) to (h) be affected if the startup cost was $200,000? 17.38 The manufacturer of a nationally distributed brand of potato chips wants to determine the feasibility of changing the product package from a cellophane bag to an unbreakable container. The product manager believes that there are three

Based on experience, the product manager assigns the following probabilities to the different levels of national response:



a. Construct the decision tree. b. Construct the opportunity loss table. c. Calculate the expected monetary value (EMV ) for offering this new product package. d. Calculate the expected opportunity loss (EOL) for offering this new product package. e. Explain the meaning of the expected value of perfect information (EVPI ) in this problem. f. Calculate the return-to-risk ratio (RTRR ) for offering this new product package. g. Based on the results of (c) or (d) and (f), should the company offer this new product package? Why? h. What are your answers to parts (c) to (g) if the probabilities were 0.6, 0.3 and 0.1 respectively? i. What are your answers to parts (c) to (g) if the probabilities were 0.1, 0.3 and 0.6 respectively? Before making a final decision, the product manager would like to test market the new package in a selected city by substituting the new package for the old package. A determination will then be made about whether sales have increased, decreased or stayed the same. In previous test marketing of other products, when there was a subsequent weak national response, sales in the test city decreased 60% of the time, stayed the same 30% of the time and increased 10% of the time. Where there was a moderate national response, sales in the test city decreased 20% of the time, stayed the same 40% of the time and increased 40% of the time. When there was a strong national response, sales in the test city decreased 5% of the time, stayed the same 35% of the time and increased 60% of the time. j. If sales in the test city stayed the same, revise the original probabilities in light of this new information. k. Use the revised probabilities in (j) to repeat (c) to (g). l. If sales in the test city decreased, revise the original probabilities in light of this new information. m. Use the revised probabilities in (l) to repeat (c) to (g).

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 17 Excel Guide 703

17.39 A manufacturer of a brand of inexpensive felt-tip pens maintains a production process that produces 10,000 pens per day. In order to maintain the highest quality of this product, the manufacturer guarantees free replacement of any defective pen sold. Each defective pen produced costs 20 cents for the manufacturer to replace. Based on past experience, four rates of producing defective pens are possible: · very low – 1% of the pens will be defective · low – 5% of the pens will be defective · moderate – 10% of the pens will be defective · high – 20% of the pens will be defective The manufacturer can reduce the rate of defective pens produced by having a mechanic fix the machines at the end of each day. This mechanic can reduce the rate to 1%, but his services will cost $80. A payoff table based on the daily production of 10,000 pens, indicating the replacement costs ($) for each of the two alternatives (calling in the mechanic and not calling in the mechanic), is as follows: Action Defective rate Very low (1%) Low (5%) Moderate (10%) High (20%)

Do not call mechanic  20 100 200 400

Call mechanic 100 100 100 100



Based on past experience, each defective rate is assumed to be equally likely to occur. a. Construct a decision tree. b. Construct an opportunity loss table. c. Calculate the expected monetary value (EMV) for calling and for not calling the mechanic. d. Calculate the expected opportunity loss (EOL) for calling and for not calling the mechanic. e. Explain the meaning of the expected value of perfect information (EVPI) in this problem. f. Calculate the return-to-risk ratio (RTRR) for calling and not calling the mechanic. g. Based on the results of (c), (d) and (f), should the company call the mechanic? Why? h. At the end of a day’s production, a sample of 15 pens is selected, and 2 are defective. Revise the prior probabilities in light of this sample information. (Hint: Use the binomial distribution to determine the probability of the outcome that occurred, given a particular defective rate.) i. Use the revised probabilities in (h) to repeat parts (c) to (g) above.

Chapter 17 Excel Guide EG17.1 OPPORTUNITY LOSS

EG17.3 EXPECTED OPPORTUNITY LOSS

See Appendix D.23 (Opportunity Loss) if you want PHStat to produce an opportunity loss analysis worksheet. (There are no Microsoft Excel commands that directly perform opportunity loss analysis.)

See Appendix D.24 (Expected Opportunity Loss) if you want PHStat to produce an expected opoportunity loss worksheet for you. (There are no Microsoft Excel commands that directly produce such a worksheet.)

EG17.2 EXPECTED MONETARY VALUE See Appendix D.25 (Expected Monetary Value) if you want PHStat to produce an expected monetary value analysis worksheet. (There are no Microsoft Excel commands that directly perform expected monetary value analysis.)

Microsoft® product screen shots are reprinted with permission from Microsoft Corporation.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

CHA PTER

18

Statistical applications in quality management TASMAN UNIVERSITY TECHNOLOGY SERVICES (TS) SERVICE DESK

T

asman University aims to continually improve student satisfaction. A factor in student satisfaction is how quick and easy it is to get help when problems arise.

Having just received training in Six Sigma management, Robin, manager of Technology Services (TS), believes that by applying the Six Sigma principles to enquiries received by TS, student satisfaction can be improved. Robin has decided to focus on calls from students to the TS Service Desk, which past data shows often involve complicated or urgent technology problems. Are calls answered quickly? Do TS Service Desk staff have sufficient training and knowledge to solve the problem? Does the student call again for the same issue? To study these student satisfaction issues, Robin has embarked on an improvement project that measures two critical-to-quality (CTQ) measurements, the total time a call takes and if the problem was solved. Robin would like to learn the following:



Is the first call resolution rate (FCR) – that is, rate at which the problem is solved on the first call, with no repeat calls required from either the student or a TS support officer – acceptable? ■   Is the caller invested time (CIT) – that is, the total time the call took, including time spent in a queue waiting for the call to be answered, talking to the TS support officer and spent on hold – acceptable? ■   Are the first call resolution rate and caller invested time consistent from day to day, or are they increasing or decreasing? ■  When the proportion of calls not solved on the first call or the caller invested time is greater than normal, is this due to a chance occurrence or is there a fundamental flaw in the process used to answer and deal with calls? Robert Kneschke/Shutterstock/Pearson Education Ltd

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



18.1 Total Quality Management 705

LEARNING OBJECTIVES

After studying this chapter you should be able to: 1 distinguish between common cause and special cause variation 2 select appropriate control charts for categorical and numerical data 3 construct and interpret attribute and variables control charts 4 measure the capability of a process

In this chapter, the focus is on quality management. Companies manufacturing products, as well as those providing services, realise that quality is essential for survival in the global economy. The areas in which quality has an impact on our everyday work and personal lives include: • the design, production and subsequent reliability of our cars • the services provided by hotels, banks, schools, shopping centres and mail-order companies • the continuous improvement in computer hardware that makes for faster and more reliable computers • the availability of new technology and equipment that has led to improved diagnosis of illnesses and the improved delivery of health care services. In this chapter, a process is the value-added transformation of inputs to outputs. The inputs and outputs of a process can involve machines, materials, methods, measurement, people and the environment. Each of the inputs is a source of variability, which can lead to variability in the output. We seek to monitor and manage this variability in output as it can result in poor service and/or poor product quality, both of which can decrease customer satisfaction. Two important quality improvement approaches, total quality management and Six Sigma management, will be discussed, before we introduce control charts, a statistical tool widely used to monitor variability and improve quality.

process The value-added transformation of inputs to outputs.

18.1  TOTAL QUALITY MANAGEMENT Total quality management (TQM) is an approach to improving quality that includes a broad range

of behavioural, management and technical techniques. It focuses on continuous improvement of products and services through an emphasis on statistics, process improvement and optimisation of the total system. The key concepts of quality management, developed by, among others, W. Edwards Deming, Joseph M. Juran and Kaoru Ishikawa, were implemented successfully in Japan in the 1950s. These ideas were then ‘exported’ to the West and further developed. Total quality management is characterised by the following themes: • The primary focus is on process improvement. • Most of the variation in a process is due to the system and not the individual. • Teamwork is an integral part of a quality management organisation. • Customer satisfaction is a primary organisational goal. • Organisational transformation must occur in order to implement quality management. • Fear must be removed from organisations. • Higher quality costs less, not more, but requires an investment in training.

total quality management (TQM) Approach to quality improvement with emphasis on continuous improvement and the total system.

Today, quality improvement systems have been implemented in many organisations worldwide. Even if the name TQM is not used, the underlying philosophy and statistical

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

706 CHAPTER 18 STATISTICAL APPLICATIONS IN QUALITY MANAGEMENT

Deming’s 14 points for management Aspects of total quality management.

Shewhart–Deming cycle Improvement process used by TQM: ‘plan, do, study, act’.

Figure 18.1 Shewhart–Deming cycle

methods used in today’s quality improvement systems are consistent with TQM, as reflected by Deming’s 14 points for m ­ anagement:   1. Create constancy of purpose for improvement of product and service.   2. Adopt the new philosophy.   3. Cease dependence on inspection to achieve quality.   4. End the practice of awarding business on the basis of price tag alone. Instead, minimise total cost by working with a single supplier.   5. Improve constantly and forever every process for planning, production and service.   6. Institute on-the-job training.   7. Adopt and institute leadership.   8. Drive out fear.   9. Break down barriers between staff areas. 10. Eliminate slogans, exhortations and targets for the workforce. 11. Eliminate numerical quotas for the workforce and numerical goals for management. 12. Remove barriers that rob people of pride of workmanship. Eliminate the annual rating or merit system. 13. Institute a vigorous program of education and self-improvement for everyone. 14. Put everyone in the company to work to accomplish the transformation. Points 1, 2, 5, 7 and 14 focus on the need for commitment and leadership from management for organisational transformation, without which any improvements obtained will be limited. One aspect of this improvement process is illustrated by the Shewhart–Deming cycle shown in Figure 18.1. The Shewhart–Deming cycle represents a continuous cycle of ‘plan, do, study and act’. The first step, planning, represents the initial design phase for planning a change in a manufacturing or service process. This step involves teamwork by individuals from different areas within an organisation. The second step, doing, involves implementing the change, preferably on a small scale. The third step, studying, involves an analysis of the results using statistical tools to determine what was learned. The fourth step, acting, involves the acceptance of the change, its abandonment or further study of the change under different conditions.

Act

Plan

Study

Do

Point 3, ‘Cease dependence on inspection to achieve quality’, implies that any inspection whose purpose is to improve quality is too late because the quality is either already built into the product or it is not. It is better to focus on making it right the first time. Among the difficulties involved in inspection (besides high costs) are the failure of inspectors to agree on the operational definitions for nonconforming items, and the problem of separating good and bad items. The following example illustrates the difficulties inspectors face. Suppose your job involves proofreading the sentence in Figure 18.2 with the objective of counting the number of occurrences of the letter F. Perform this task and ask two or three others to perform the task as well, each recording the number of occurrences of the letter ‘F’. It is likely that your answers are not consistent as people usually see either three Fs or six Fs. The correct number is six Fs. The number you see depends on the method you use to examine the sentence. You are likely to find three Fs if you read the sentence phonetically and six Fs if you count the number of Fs carefully. This exercise shows that a simple well-defined process of counting F’s can lead to inconsistent results. Therefore, what may happen when a complicated

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



18.2 Six Sigma Management 707

FINISHED FILES ARE THE RESULT OF YEARS OF SCIENTIFIC STUDY COMBINED WITH THE EXPERIENCE OF MANY YEARS process does not have a well-defined clear operational definition of ‘nonconforming’? In such situations, we would expect a large amount of variability from inspector to inspector. Point 4, ending the practice of awarding business on the basis of price tag alone, focuses on the idea that there is no real long-term meaning to price without knowledge of the quality of the product. Points 6 and 13 refer to training, education and self-improvement for all employees, including managers. Continuous learning is critical for quality improvement within an organisation. In particular, management needs to understand the differences between special causes and common causes of variation (Section 18.3) so that proper action is taken in each circumstance. Points 8 to 12 relate to the evaluation of employee performance. Deming believed that an emphasis on targets and exhortations places an improper burden on the workforce. Workers cannot produce beyond what the system allows (as illustrated in the red bead experiment in Section 18.5). It is management’s job to improve the system, not to raise the expectations on workers beyond the system’s capability. Although Deming’s points were thought provoking, some criticised his approach for lacking a formal, objective accountability. Many managers of large organisations, used to seeing financial analyses of policy changes, needed a more prescriptive approach.

Figure 18.2 An example of a proofreading process Source: W. W. Scherkenbach, The Deming Route to Quality and Productivity: Road Maps and Roadblocks (Washington, DC: CEEP Press, 1986). Reproduced with permission of W. W. Scherkenbach.

18.2  SIX SIGMA MANAGEMENT Six Sigma management is a quality improvement system originally developed by Motorola in the

mid-1980s. It is used by many companies to improve efficiency, cut costs, eliminate defects and reduce product variation. Six Sigma offers a more prescriptive and systematic approach to process improvement than TQM. It is distinguished from other quality improvement systems by its clear focus on achieving bottom-line results in a relatively short three- to six-month period of time. The Six Sigma approach assumes that processes are designed so that the upper and lower specification limits (Section 18.8) are six standard deviations away from the mean, hence the name. The Six Sigma approach also assumes that the process may shift as much as 1.5 standard deviations over the long term. Therefore, if the processes are monitored correctly with control charts (Section 18.3), the worse-case scenario is for the mean to shift to within 4.5 standard deviations from the nearest specification limit. Therefore, as the area under the normal curve less than 4.5 standard deviations below (or greater than 4.5 standard deviations above) the mean is approximately 3.4 out of 1 million (Table E.2 reports this probability as 0.000 003 398), the process will result in no more than 3.4 defects per million.

Six Sigma management Approach to process improvement with an emphasis on accountability and bottom-line results.

The DMAIC Model To guide managers in their task of improving short- and long-term results, Six Sigma uses a five-step process known as the DMAIC model, named for the five steps in the process: Define, Measure, Analyse, Improve and Control. • Define  The problem is defined along with the costs, benefits and the impact on the customer. • Measure  Important characteristics related to the quality of the service or product are identified and discussed, with operational definitions for each critical-to-quality (CTQ) characteristic identified and then developed. In addition, the measurement procedure is verified so that it is consistent over repeated measurements.

DMAIC model Improvement process used by Six Sigma: ‘Define, Measure, Analyse, Improve, Control’. critical-to-quality (CTQ) Characteristics that affect quality.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

708 CHAPTER 18 STATISTICAL APPLICATIONS IN QUALITY MANAGEMENT



• •

Analyse  The root causes of why defects occur are determined, and variables in the process causing the defects are identified. Data are collected to determine benchmark values for each process variable. This analysis often uses control charts (discussed in Sections 18.3 to 18.7). Improve  The impact of each process variable on the CTQ characteristic is studied using designed experiments. The objective is to determine the best level for each variable. Control  The objective is to maintain the benefits for the long term by avoiding potential problems that can occur when a process is changed.

Implementation of Six Sigma management requires a data-oriented approach that is heavily based on using statistical tools such as control charts and designed experiments. It also involves training everyone in the organisation in the DMAIC model. LEARNING OBJECTIVE

1

Distinguish between common cause and special cause variation

control chart Chart that monitors variation in a characteristic of a product or service over time.

special (or assignable) causes of variation Large fluctuations or patterns in data that are not inherent to a process; these fluctuations reflect changes in the process. common (or chance) causes of variation Variability inherent in a process; it cannot be reduced without redesigning the process.

18.3  THE THEORY OF CONTROL CHARTS Both total quality management and Six Sigma management make use of a wide array of statistical tools. One tool widely used in each approach to analyse process data collected sequentially over time is the control chart. A control chart monitors variation in a characteristic of a product or service over time. You can use a control chart to study past performance, to evaluate present conditions or to predict future outcomes. Information gained from analysing a control chart forms the basis for process improvement. Different types of control charts allow you to analyse different types of CTQ variables – for categorical variables such as the proportion of problems solved on first contact, discrete variables such as the number of appliances replaced under warranty during a week, and continuous variables such as the length of time a call took. In addition to providing a visual display of data representing a process, a principal focus of the control chart is the attempt to separate special causes of variation from common causes of variation. Special causes of variation represent large fluctuations or patterns in the data that are not

inherent to a process. These fluctuations are often caused by changes in the process that represent either problems to correct or opportunities to exploit. Some organisations refer to special causes of variation as assignable causes of variation. Common causes of variation represent the inherent variability that exists in a process. These fluctuations consist of the numerous small causes of variability that operate randomly or by chance. Some organisations refer to common causes of variation as chance causes of variation. Walter Shewhart (see reference 1) developed an experiment that illustrates the distinction between common and special causes of variation. The experiment asks you repeatedly to write the letter A in a horizontal line across a piece of paper: AAAAAAAAAAAAAAAAA When you do this, you immediately notice that the As are all similar but not exactly the same. In addition, you may notice some difference in the size of the As from letter to letter. This difference is due to common-cause variation. Nothing special happened that caused the differences in the size of the A. You probably would have a hard time trying to explain why the largest A is bigger than the smallest A. These types of differences almost certainly represent common-cause variation. However, if you did the experiment over again but wrote half of the As with your right hand and the other half of the As with your left hand, you would almost certainly see a very big difference in the As written with each hand. In this case, the hand that you used to write the As is the source of the special-cause variation. The distinction between the two causes of variation is crucial because special causes of variation are not part of a process and are correctable or exploitable without changing the process.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



18.3 The Theory of Control Charts 709

Common causes of variation, however, can be reduced only by changing the process. Such systemic changes are the responsibility of management. Control charts allow you to monitor a process and identify the presence or absence of special causes of variation. By doing so, control charts help prevent two types of errors. The first type of error involves the belief that an observed value represents special-cause variation when it is due to the common-cause variation of the system. Treating common-cause variation as special-cause variation often results in overadjustment of a process. This overadjustment, known as tampering, increases the variation in the process. The second type of error involves treating special-cause variation as common-cause variation. This error results in not taking immediate corrective action when necessary. Although both of these types of errors can occur even when using a control chart, they are less likely. To construct a control chart, collect samples from the output of a process over time. These samples are known as subgroups. For each subgroup (i.e. sample), calculate the value of a statistic associated with a CTQ variable. Commonly used statistics include the sample proportion for a categorical variable (see Section 18.4), the number of items of interest (see Section 18.6) and the mean and range of a numerical variable (see Section 18.7). Then plot the values against time and add control limits to the chart. The most typical form of a control chart sets control limits that are within 63 standard deviations1 of the statistical measure of interest. Equation 18.1 defines, in general, the upper and lower control limits for control charts.

CO N STR UCT I N G C O N TR O L LI M I TS Process mean ±3 standard deviations (18.1)



so that Upper control limit (UCL) = process mean + 3 standard deviations Lower control limit (LCL) = process mean − 3 standard deviations

tampering Over-adjustment that increases variation in a process.

subgroup Sample used in a control chart.

upper control limit (UCL) Upper limit for a control chart, typically three standard deviations above the process mean. lower control limit (LCL) Lower limit for a control chart, typically three standard deviations below the process mean.

When these control limits are set, evaluate the control chart by trying to find any pattern that might exist in the values over time and by determining whether any points fall outside the control limits. Figure 18.3 illustrates three different situations. In panel A of Figure 18.3, there is no apparent pattern in the values over time and there are no points that fall outside the 3 standard deviation control limits. The process appears stable and contains only common-cause variation. Panel B, in contrast, contains two points that fall outside the 3 standard deviation control limits. These points should be investigated to try to

UCL

Common-cause variation only: no points outside 3σ limits; no pattern over time

Special-cause variation

Pattern over time: special-cause variation

Figure 18.3 Three control chart patterns

Centre line LCL Special-cause variation

Time Panel A

Time Panel B

Time Panel C

1

Recall from Section 6.2 that in the normal distribution, m ± 3s includes almost all (99.73%) of the observations in the population. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

710 CHAPTER 18 STATISTICAL APPLICATIONS IN QUALITY MANAGEMENT

determine the special causes that led to their occurrence. Although panel C does not have any points outside the control limits, it has a series of consecutive points above the mean value (the centre line) as well as a series of consecutive points below the mean value. In addition, a longterm overall downward trend is clearly visible. You should investigate the situation to try to determine what may have caused this pattern. Detecting a trend or pattern is not always so obvious. Two simple rules that allow you to detect a shift in the mean level of a process involve the following results: • Eight or more consecutive points that lie above the centre line or eight or more consecutive points that lie below the centre line. • Eight or more consecutive points move upward in value or eight or more consecutive points move downward in value.

out-of-control process Process that contains special-cause variation as well as common-cause variation.

in-control process Process that contains only common-cause variation. state of statistical control A process that is in control.

LEARNING OBJECTIVE

2

Select appropriate control charts for categorical and numerical data

LEARNING OBJECTIVE Construct and interpret attribute and variables control charts

3

A process whose control chart indicates an out-of-control condition (a point outside the control limits or exhibiting a trend) is said to be out of control. An out-of-control process contains both common-cause and special-cause variation. Because special causes of variation are not part of the process design, an out-of-control process is unpredictable. Once you determine that a process is out of control, you must identify the special causes of variation that are producing the out-of-control conditions. If the special causes are detrimental to the quality of the product or service, you need to implement plans to eliminate this source of variation. When a special cause increases quality, you should change the process so that the special cause is incorporated into the process design. Thus, this beneficial special cause becomes a common-cause source of variation and the process is improved. A process whose control chart is not indicating any out-of-control conditions is said to be in control. An in-control process contains only common-cause variation. Because these sources of variation are inherent to the process itself, an in-control process is predictable. In-control processes are sometimes said to be in a state of statistical control. When a process is in control, you must determine whether the amount of common-cause variation in the process is small enough to satisfy the customers of the products or services. (In Section 18.8, you will learn statistical methods that allow you to compare common-cause variation with customer expectations.) If the common-cause variation is small enough to satisfy the customer consistently, then use control charts to monitor the process on a continuing basis to make sure that the process remains in control. If the common-cause variation is too large you need to alter the process itself.

18.4  CONTROL CHART FOR THE PROPORTION – THE p CHART Various types of control charts are used to monitor processes and determine whether specialcause variation is present. Attribute charts are used for categorical or discrete variables. This section introduces the p chart, which plots the proportion of items in a sample that are in a category of interest. For example, sampled items are often classified according to whether they conform or do not conform to operationally defined requirements. Thus, the p chart can be used to monitor and analyse the proportion of nonconforming items there are in repeated samples (i.e. subgroups) selected from a process. Section 18.6 introduces the c chart, an attribute chart for the number of nonconformances (or occurrences) in a given area of opportunity. To begin the discussion of p charts, recall that proportions and the binomial distribution were introduced in Section 5.3. Then, in Section 7.3, Equations 7.6 and 7.7 defined the sample proportion as p 5 X/n, with the standard deviation of the sample proportion given by:

attribute chart Control chart for categorical or discrete variables. p chart Control chart for proportion of nonconforming items.

σp =

π(1 − π) n

Using Equation 18.1, Equation 18.2 gives the control limits for the proportion of nonconforming2 items from the sample data. 2

In this chapter the term ‘nonconforming items’ is used, while in earlier chapters the term ‘success’ was used.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



18.4 CONTROL CHART FOR THE PROPORTION – THE p CHART 711

CO N TRO L LI M I TS FO R TH E p C H A RT

p(1 − p ) n

p±3



UCL = p + 3

p(1 − p ) n

LCL = p − 3

p(1 − p ) n

For equal ni

(18.2)

k

n = ni and p =

∑ pi

i=1

k

or in general k

n=



i=1

k

ni

k

and p =

∑ Xi

i=1 k

∑ ni

i=1

where Xi = number of nonconforming items in subgroup i ni = sample (or subgroup) size for subgroup i pi = Xi∙ni = proportion of nonconforming items in subgroup i k = number of subgroups selected – n = mean subgroup size p– = estimated proportion of nonconforming items

Any negative value for the lower control limit means that the lower control limit does not exist. To show the application of the p chart, return to the Tasman University Technology Services (TS) scenario concerning the quality of support provided by the TS Service Desk. During the Measure phase of the Six Sigma DMAIC model, a nonconformance was operationally defined as a problem not being solved on first contact. During the Analyse phase of the Six Sigma DMAIC model, data on the nonconformances were collected daily from a sample of 200 calls to the TS Service Desk. Table 18.1 lists the number and proportion of problems not solved on first contact for each day in the four-week period. < CALLS1 > For these data, k = 28,

k

pi = 2.315 and because the ni are equal, ni = n– = 200. Thus: ∑ i=1 k

p=

∑ pi i=1 k

=

2.315 = 0.0827 28

Using Equation 18.2: 0.0827 ± 3

( 0.0827 )(0.9173) ≈ 0.0827 ± 0.0584 200

so that: UCL = 0.0827 + 0.0584 = 0.1411 and: LCL = 0.0827 − 0.0584 = 0.0243

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

712 CHAPTER 18 STATISTICAL APPLICATIONS IN QUALITY MANAGEMENT

Table 18.1  Nonconforming calls to TS Service Desk over a 28-day period

Calls sampled 200 200 200 200 200 200 200 200 200 200 200 200 200 200

Day  1  2  3  4  5  6  7  8  9 10 11 12 13 14

Problem not solved 16  7 21 17 25 19 16 15 11 12 22 20 17 26

Proportion 0.080 0.035 0.105 0.085 0.125 0.095 0.080 0.075 0.055 0.060 0.110 0.100 0.085 0.130

Day 15 16 17 18 19 20 21 22 23 24 25 26 27 28

Calls sampled 200 200 200 200 200 200 200 200 200 200 200 200 200 200

Problem not solved 18 13 15 10 14 25 19 12  6 12 18 15 20 22

Proportion 0.090 0.065 0.075 0.050 0.070 0.125 0.095 0.060 0.030 0.060 0.090 0.075 0.100 0.110

Figure 18.4 displays the Microsoft Excel control chart for the data of Table 18.1. Figure 18.4 indicates a process in a state of statistical control, with the individual points distributed around p– without any pattern and all the points within the control limits. Thus, any improvement in the solving TS problems process in the Improve phase of the DMAIC model must come from the reduction of common-cause variation. Such reductions require a change in the process. These changes are the responsibility of management. Remember that improvements cannot occur until improvements to the process itself are successfully implemented. This example illustrated a situation in which the subgroup size did not vary. As a general rule, as long as none of the subgroup sizes ni differ from the mean subgroup size n– by more than 625%, you can use Equation 18.2 to calculate the control limits for the p chart. If any subgroup size differs by more than 625%, there are alternative formulas for calculating the control limits. To illustrate the use of the p chart when the subgroup sizes are unequal, Example 18.1 studies the production of gauze sponges. Figure 18.4 Microsoft Excel p chart for the nonconforming calls to TS Service Desk

p chart for problems not solved

0.16

UCL

0.14

Proportion

0.12 0.10 pBar

0.08 0.06 0.04

LCL

0.02 0 0

5

10

15

20

25

30

X

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



18.4 CONTROL CHART FOR THE PROPORTION – THE p CHART 713

USING TH E p C H A RT FO R U N E Q U A L S U BGROU P S I Z E S Table 18.2 indicates the number of sponges produced daily and the number of nonconforming sponges for a period of 32 days. < SPONGE > Construct a control chart for these data.

Day  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16

Sponges produced 690 580 685 595 665 596 600 620 610 595 645 675 670 590 585 560

Nonconforming sponges 21 22 20 21 23 19 18 24 20 22 19 23 22 26 17 16

Proportion 0.030 0.038 0.029 0.035 0.035 0.032 0.030 0.039 0.033 0.037 0.029 0.034 0.033 0.044 0.029 0.029

Sponges produced 575 610 596 630 625 615 575 572 645 651 660 685 671 660 595 600

Day 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Nonconforming sponges 20 16 15 24 25 21 23 20 24 39 21 19 17 22 24 16

Proportion 0.035 0.026 0.025 0.038 0.040 0.034 0.040 0.035 0.037 0.060 0.032 0.028 0.025 0.033 0.040 0.027

EXAMPLE 18.1

Table 18.2  Nonconforming sponges over a 32-day period

SOLUTION

For these data: k = 32,

k



i=1

ni = 19,926 and

k

∑ Xi = 679

i=1

Thus, using Equation 18.2: n=

19,926 679 = 622.69 and p = = 0.034 32 19,926

so that: 0.034 ± 3

(0.034)(1 − 0.034) = 0.034 ± 0.022 622.69

Thus: UCL = 0.034 + 0.022 = 0.056 and: LCL = 0.034 − 0.022 = 0.012 Figure 18.5, overleaf, displays the Microsoft Excel control chart for the sponge data. An examination of Figure 18.5 indicates that day 26, with 39 nonconforming sponges out of 651, is above the upper control limit. Management needs to determine the reason (i.e. root cause) for this special-cause variation and take corrective action. Once action is taken, you can remove the data from day 26 and then construct and analyse a new control chart.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

714 CHAPTER 18 STATISTICAL APPLICATIONS IN QUALITY MANAGEMENT

Figure 18.5 Microsoft Excel p chart for the proportion of nonconforming sponges

p chart for nonconforming sponges

0.07 0.06

UCL

Proportion

0.05 0.04 pBar 0.03 0.02 LCL

0.01 0 0

5

10

15

20

25

30

35

X

Problems for Section 18.4 LEARNING THE BASICS 18.1 The following data were collected on nonconformances for a period of 10 days. Day  1  2  3  4  5  6  7  8  9 10

Sample size 100 100 100 100 100 100 100 100 100 100

Nonconformances 12 14 10 18 22 14 15 13 14 16

a. On what day is the proportion of nonconformances largest? Smallest? b. Calculate the LCL and UCL. c. Are there any special causes of variation? 18.2 The following data were collected on nonconformances for a period of 10 days.

Day  1  2  3  4  5  6  7  8  9 10

Sample size 111  93 105  92 117  88 117  87 119 107

Nonconformances 12 14 10 18 22 14 15 13 14 16

a. On what day is the proportion of nonconformances largest? Smallest? b. Calculate the LCL and UCL. c. Are there any special causes of variation?

APPLYING THE CONCEPTS You can solve problems 18.3–18.8 manually or using Microsoft Excel.

18.3 A medical transcription service enters medical data on patient files for hospitals. The service studied ways to improve the turnaround time (defined as the time between receiving data

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



18.4 CONTROL CHART FOR THE PROPORTION – THE p CHART 715

and the time the client receives completed files). After studying the process, it was determined that turnaround time was increased by transmission errors. A transmission error was defined as data transmitted that did not go through as planned, and needed to be retransmitted. Each day for a month, a sample of 125 transmissions was randomly selected and evaluated for errors. The table below presents the number and proportion of transmissions with errors. < TRANSMIT >

Day  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16

Number of errors  6  3  4  4  9  0  0  8  4  3  4  1 10  9  3  1

Proportion of errors 0.048 0.024 0.032 0.032 0.072 0.000 0.000 0.064 0.032 0.024 0.032 0.008 0.080 0.072 0.024 0.008

Day 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Number of errors  4  6  3  5  1  3 14  6  7  3 10  7  5  0  3

Proportion of errors 0.032 0.048 0.024 0.040 0.008 0.024 0.112 0.048 0.056 0.024 0.080 0.056 0.040 0.000 0.024

a. Construct a p chart. b. Is the process in a state of statistical control? Why? 18.4 The following data represent the findings from a study conducted at a factory that manufactures film canisters. For 32 days, 500 film canisters were sampled and inspected. The following table lists the number of defective film canisters (i.e. nonconforming items) for each day (i.e. subgroup). < CANISTER >

Day  1  2  3  4  5  6  7  8  9 10 11 12 13

Number nonconforming 26 25 23 24 26 20 21 27 23 25 22 26 25

Day 14 15 16 17 18 19 20 21 22 23 24 25 26

Number nonconforming 29 20 19 23 19 18 27 28 24 26 23 27 28

Day 27 28 29

Number nonconforming 24 22 20

Number nonconforming 25 27 19

Day 30 31 32

a. Construct a p chart. b. Is the process in a state of statistical control? Why? 18.5 A hospital administrator is concerned with the time to process patients’ medical records after discharge. She determined that all records should be processed within five days of discharge. Thus, any record not processed within five days of a patient’s discharge is nonconforming. The administrator recorded the number of patients discharged and the number of records not processed within the five-day standard for a 30-day period in the file < MED_REC >. a. Construct a p chart for these data. b. Does the process give an out-of-control signal? Explain. c. If the process is out of control, assume that special causes were subsequently identified and corrective action taken to keep them from happening again. Eliminate the data causing the out-of-control signals, and recalculate the control limits. 18.6 The bottling division of Sweet Suzy’s Sugarless Cola maintains daily records of the occurrences of unacceptable cans flowing from the filling and sealing machine. The following table lists the number of cans filled and the number of nonconforming cans for one month (based on a five-day working week).

Day  1  2  3  4  5  6  7  8  9 10 11

Cans filled 5,043 4,852 4,908 4,756 4,901 4,892 5,354 5,321 5,045 5,113 5,247

Unacceptable cans 47 51 43 37 78 66 51 66 61 72 63

Day 12 13 14 15 16 17 18 19 20 21 22

Cans filled 5,314 5,097 4,932 5,023 5,117 5,099 5,345 5,456 5,554 5,421 5,555

Unacceptable cans 70 64 59 75 71 68 78 88 83 82 87

a. Construct a p chart for the proportion of unacceptable cans for the month. b. Does the process give an out-of-control signal? c. If you want to develop a process for reducing the proportion of unacceptable cans, how should you proceed? 18.7 The manager of the accounting office of a large corporation is studying the problem of entering incorrect account numbers into the computer system. A subgroup of 200 account numbers

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

716 CHAPTER 18 STATISTICAL APPLICATIONS IN QUALITY MANAGEMENT

is selected from each day’s output, and each account number is inspected to determine whether it is a nonconforming item. The results for a period of 39 days are in the file < ERRORS_PC >. a. Construct a p chart for the proportion of nonconforming items. b. Does the process give an out-of-control signal? c. Based on your answer to (b), if you were the manager of the accounting office what would you do to improve the process of account number entry?

LEARNING OBJECTIVE Distinguish between common cause and special cause variation

Table 18.3  Red bead experiment results for four workers over three days

1

18.8 Check$mart Bank is monitoring the proportion of transaction errors made by customer service operators at a (24-hour, 7 days a week) phone banking call centre. Data collected over a month (30 days) is given in the file < BANK_ERROR >. a. Construct a p chart for the proportion of transaction errors. b. Is the process out of control? c. What should be done to improve the error rate?

18.5  THE RED BEAD EXPERIMENT – UNDERSTANDING PROCESS VARIABILITY In Section 18.3 common- and special-cause variation were defined and discussed. The red bead experiment is now used to enhance your understanding of common-cause and special-cause variation. The red bead experiment involves the selection of beads from a box that contains 4,000 beads. Unknown to the participants in the experiment, 3,200 (80%) of the beads are white and 800 (20%) are red. Several different scenarios can be used for conducting the experiment. The one used here begins with a facilitator (playing the role of company supervisor) asking members of the audience to volunteer for the jobs of workers (four), inspectors (two), chief inspector (one) and recorder (one). A worker’s job consists of using a paddle that has five rows of 10 bead-size holes to select 50 beads from the box of beads. When the participants have been selected the supervisor explains the jobs to them. The job of the workers is to produce white beads, because red beads are unacceptable to the customers. Strict procedures are to be followed. Work standards call for the daily production of exactly 50 beads by each worker (a strict quota system). Management has established a standard that no more than two red beads (4%) per worker are to be produced on any given day. Each worker dips the paddle into the box of beads so that when it is removed, each of the 50 holes contains a bead. The worker carries the paddle to the two inspectors, who independently count and record the number of red beads. The chief inspector compares their counts and announces the results to the audience. The recorder writes down the number and percentage of red beads next to the name of the worker. When all the people know their jobs, ‘production’ can begin. Suppose that on the first ‘day’, the number of red beads ‘produced’ by the four workers (Alyson, David, Peter and Sharyn) was 9, 12, 13 and 7, respectively. How should management react to the day’s production when the standard says that no more than two red beads per worker should be produced? Should all the workers be reprimanded, or should only David and Peter be given a stern warning that they will be fired if they don’t improve? Suppose that production continues for an additional two days. Table 18.3 summarises the results for all three days.

Name Alyson David Peter Sharyn All 4 workers Mean Percentage

1   9 (18%) 12 (24%) 13 (26%)   7 (14%) 41 10.25 20.5%

Day 2 11 (22%) 12 (24%)   6 (12%)   9 (18%) 38 9.5 19%

3   6 (12%)   8 (16%) 12 (24%)   8 (16%) 34  8.5 17%

All three days 26 (17.33%) 32 (21.33%) 31 (20.67%) 24 (16.0%) 113 9.42 18.83%

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



18.5 The Red Bead Experiment – Understanding Process Variability 717

From Table 18.3, on each day some of the workers were above the mean and some below the mean. On day 1, Sharyn did best, but on day 2, Peter (who had the worst record on day 1) was best, and on day 3, Alyson was best. How can you explain all this variation? Use Equation 18.2 to develop a p chart for these data: k = 4 workers × 3 days = 12, n = 50 and

k

∑ X i = 113

i=1

Thus: p=

113 = 0.1883 (50)(12 )

so that: p±3

0.1883(1 − 0.1883) p(1 − p ) = 0.1883 ± 3 50 n = 0.1883 ± 0.1659

Thus: UCL = 0.1883 + 0.1659 = 0.3542 and: LCL = 0.1883 − 0.1659 = 0.0224 Figure 18.6 represents the p chart for the data in Table 18.3. In Figure 18.6, all of the points are within the control limits and there are no patterns in the results. The differences between the workers merely represent common-cause variation inherent in a stable system. The red bead experiment illustrates the following: • Variation is an inherent part of any process. • Workers work within a system over which they have little control. It is the system that primarily determines their performance. • Only management can change the system. • There will always be some workers above the mean and some workers below the mean.

UCL

Proportion of red beads

0.30

0.20

Figure 18.6 p chart for the red bead experiment

P = 0.1883

0.10 LCL 0

Alyson

David

Peter

Alyson Peter Alyson Peter Sharyn David Sharyn David Sharyn

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

718 CHAPTER 18 STATISTICAL APPLICATIONS IN QUALITY MANAGEMENT

Problems for Section 18.5 APPLYING THE CONCEPTS

a. Conduct the experiment in the same way as described in this section. b. Remove 400 red beads from the bead box before beginning the experiment. How do your results differ from those in (a)? What does this tell you about the effect of the ‘system’ on the workers?

18.9 In the red bead experiment, how do you think many managers would have reacted after day 1? Day 2? Day 3? 18.10 (Class project ) Obtain a version of the red bead experiment for your class.

LEARNING OBJECTIVE

2

Select appropriate control charts for categorical and numerical data

LEARNING OBJECTIVE

3

Construct and interpret attribute and variables control charts

c chart Control chart for the number of nonconformities. area of opportunity Area of interest for counting nonconformities.

18.6  CONTROL CHART FOR AN AREA OF OPPORTUNITY – THE c CHART In Section 18.4 the proportion of nonconforming items was monitored and analysed using a p chart where nonconformities are defects or flaws in a product or service. In this section we use a c chart to monitor and analyse the number of nonconformities in an area of opportunity. An area of opportunity can be an individual unit of a product or service, or a unit of time, space or area. Examples are the number of flaws in a square metre of carpet, the number of typographical errors on a printed page, the number of computer outages in a given month or the number of customer complaints in a given week. In a p chart, each unit was classified as either conforming or nonconforming, allowing the binomial distribution to be used. In a c chart, the number of nonconformities in an area of opportunity are counted, which fits the assumptions of the Poisson distribution (Section 5.4). For the Poisson distribution, the standard deviation of the number of nonconformities is the square root of the mean number of nonconformities (l). Assuming that the size of each area of opportunity remains constant,3 you can calculate the control limits for the number of nonconformities per area of opportunity using the observed mean number of nonconformities as an estimate of l. Equation 18.3 defines the control limits for the c chart, which is used to monitor and analyse the number of nonconformities per area of opportunity.

C O N TR O L LI M I TS FO R T HE c C HART c±3 c



(18.3)

UCL = c + 3 c LCL = c − 3 c k

where

c=

∑ ci i=1

k k = number of units sampled

ci = number of nonconformities in unit i

Suppose Innovative Kitchens, which designs and installs custom-made kitchens, including kitchen appliances, is monitoring customer complaints with the aim of improving customer satisfaction. If a customer is dissatisfied with a product purchased or service received, they are asked to 3

If the size of the unit varies, you should use the u chart instead of the c chart.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



18.6 CONTROL CHART FOR AN AREA OF OPPORTUNITY – THE c CHART 719

complete a customer complaint form. At the end of each week, the number of complaints received is recorded. In this example a complaint is a nonconformity and the area of opportunity is one week. Table 18.4 lists the number of complaints from the past 50 weeks. < COMPLAINTS >

Week  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17

Number of complaints  8 10  6  7  5  7  9  8  7  9 10  7  8 11 10  9  8

Number of complaints  7 10 11  8  7  8  6  7  7  5  8  6  7  5  5  4  4

Week 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

Week 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

Number of complaints 3 5 2 4 3 3 4 2 4 5 5 3 2 5 4 4

Table 18.4  Number of complaints in the past 50 weeks

For these data: k = 50 and

k

∑ ci = 312

i=1

Thus: c=

312 = 6.24 50

so that using Equation 18.3: c ± 3 c = 6.24 ± 3 6.24 = 6.24 ± 7.494 Thus: UCL = 6.24 + 7.494 = 13.734 LCL = 6.24 − 7.494 = −1.254 < 0 Therefore, the LCL does not exist. Figure 18.7, overleaf, displays the control chart for the complaint data in Table 18.4. Figure 18.7 does not indicate any points outside the control limits. However, there is a clear pattern to the number of customer complaints over time. During the first half of the sequence almost all the weeks had more than the mean number of complaints, and almost all the weeks in the second half had less than the mean. There are more than eight points in a row above the centre line (weeks 6–23) and below the centre line (weeks 31–50), thus signalling a trend. This change, which is an improvement, is due to a special cause of variation. The next step is to investigate the process and determine the special cause that produced this pattern. When identified, you then need to ensure that this becomes a permanent improvement, not a temporary phenomenon. In other words, the source of the special-cause variation must become part of the permanent ongoing process in order for the number of customer complaints not to slip back to the high levels experienced in the first half of the study period.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

720 CHAPTER 18 STATISTICAL APPLICATIONS IN QUALITY MANAGEMENT

Figure 18.7 c chart for Innovative Kitchens customer complaints

c chart: Innovative Kitchens customer complaints

14

Number of complaints

12

UCL = 13.73 CL = 6.24 LCL = 0.00

10 8 6 4 2 0 10

20

30

40

50

Week

Problems for Section 18.6 LEARNING THE BASICS

APPLYING THE CONCEPTS

18.11 The following data were collected on the number of nonconformities for 10 time periods:

You can solve problems 18.13 to 18.17 manually or using Microsoft Excel.

Time 1 2 3 4 5

Nonconformities 7 3 6 3 4

Time  6  7  8  9 10

Nonconformities 5 3 5 2 0

a. Construct the appropriate control chart and determine the LCL and UCL. b. Are there any special causes of variation? 18.12 The following data were collected on the number of nonconformities for 10 time periods: Time 1 2 3 4 5

Nonconformities 25 11 10 11  6

Time  6  7  8  9 10

Nonconformities 15 12 10  9  6

a. Construct the appropriate control chart and determine the LCL and UCL. b. Are there any special causes of variation?

18.13 To improve service quality the owner of a dry cleaning business wants to study the number of dry cleaned items that are returned for re-cleaning per day. Records were kept for a four-week period (the store is open Monday to Saturday) with the results given below. < DRYCLEAN >

Day  1  2  3  4  5  6  7  8  9 10 11 12

Items returned  4  6  3  7  6  8  6  4  8  6  5 12

Day 13 14 15 16 17 18 19 20 21 22 23 24

Items returned  5  8  3  4 10  9  6  5  8  6  7  9

a. Construct a c chart for the number of items per day that are returned for recleaning. Do you think that the process is in a state of statistical control? b. Should the owner of the dry cleaning store take action to investigate why 12 items were returned for re-cleaning on

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



18.7 Control Charts for the Range and the Mean 721

day 12? Explain. Would your answer change if 20 items were returned for re-cleaning on day 12? c. On the basis of the results in (a), what should the owner of the dry cleaning store do to reduce the number of items per day that are returned for re-cleaning? 18.14 The branch manager of a savings bank has recorded the number of errors that each of 12 tellers has made during the past year. The results were as follows. < TELLER >

Teller Alice Carl Gina Jane Linda Marla

Number of errors  4  7 12  6  2  5

Teller Mitchell Nora Paul Susan Thomas Vera

Number of errors 6 3 5 4 7 5

a. Do you think the bank manager will single out Gina for any disciplinary action regarding her performance in the last year? b. Construct a c chart for the number of errors committed by the 12 tellers. Is the number of errors in a state of statistical control? c. Based on the c chart developed in (b), do you now think that Gina should be singled out for disciplinary action regarding her performance? Does your conclusion now agree with what you expected the manager to do? d. On the basis of the results in (b), what should the branch manager do to reduce the number of errors? 18.15 Falls are one source of preventable hospital injury. Although most patients who fall are not hurt, a risk of serious injury is involved. The data in the file < PT_FALLS > represent the number of patient falls per month over a 28-month period in a large metropolitan hospital.

a. Construct a c chart for the number of patient falls per month. Is the process of patient falls per month in a state of statistical control? b. What effect would it have on your conclusions if you knew that the hospital was opened only one month before the beginning of data collection? c. What other factors might contribute to special-cause variation in this problem? 18.16 The data in the file < FATAL_CRASH_JUNE > give the number of fatal crashes per day in Australia during June 2016 and June 2017. A researcher wants to determine whether variation in the number of fatal crashes is due to chance variation or specialcause variation. (data obtained from the Australian Road Deaths Database accessed July 2017). Separately for June of each year: a. What is the mean number of fatal crashes per day? b. Construct a c chart for the number of fatal crashes per day. c. There were 7 fatal crashes on 10 June 2016. Is this large value explained by chance variation, or does it appear that special causes of variation occurred? d. What conclusions can you draw from these data? 18.17 For an investigation into the number of homicide and related offences, Hemi obtained the data in < NZ_HOMICIDE >, which give the number of homicide and related offences per month in New Zealand from June 2015 to May 2017 (data obtained from New Zealand Police, Offender Demographics (Proceedings), policedata.nz, , accessed 26 July 2017). Hemi wants to determine whether months containing more than the mean number of homicide and related offences were due to chance variation or special-cause variation. a. What is the mean number of recorded homicide and related offences? b. Construct a c chart for the number of recorded homicide and related offences. c. Is the process in a state of statistical control?

18.7  CONTROL CHARTS FOR THE RANGE AND THE MEAN Variables control charts are used to monitor and analyse a process when you have numerical

data. Common numerical variables include time, money and measurements such as weight and height. Because numerical variables provide more information than attribute data, which relate to the proportion of nonconforming items or the number of nonconformities, variables control charts are more sensitive in detecting special-cause variation than a p chart or a c chart. Variables charts are typically used in pairs. One chart monitors the variabilty in a process, and the other monitors the mean. You should examine the chart that monitors variability first because if it indicates the presence of out-of-control conditions, the interpretation of the chart for the mean will be misleading. Although businesses currently use several alternative pairs of charts, this text considers only the control charts for the range and the mean.

The R Chart You can use several different types of control charts to monitor the variability in a numerically measured characteristic of interest. The simplest and most common is the control chart for the

LEARNING OBJECTIVE

2

Select appropriate control charts for categorical and numerical data

LEARNING OBJECTIVE

3

Construct and interpret attribute and variables control charts variables control charts Control charts for numerical variables.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

722 CHAPTER 18 STATISTICAL APPLICATIONS IN QUALITY MANAGEMENT

R chart Control chart for the range.

d2 factor Represents the relationship between standard deviation and – range, used in X and R charts. d3 factor Represents the relationship between standard deviation and standard error, used in R charts.

range, the R chart. Use the range chart only when the sample size is 10 or fewer. If the sample size is greater than 10, a standard deviation chart is preferable. Since sample sizes of five or fewer are typically used in many applications, the standard deviation chart is not covered in this text. The R chart enables you to determine whether the variability in a process is in control or whether changes in the amount of variability are occurring over time. If the process range is in control, then the amount of variation in the process is consistent over time, and you can use the results of the R chart to develop the control limits for the mean. To develop control limits for the range, you need an estimate of the mean range and the standard deviation of the range. As shown in Equation 18.4, these control limits depend on two constants: the d2 factor, which represents the relationship between the standard deviation and the range for varying sample sizes, and the d3 factor, which represents the relationship between the standard deviation and the standard error of the range for varying sample sizes. Table E.13 contains values for these factors. Equation 18.4 defines the control limits for the R chart.

C O N TR O L LI M I TS F OR T HE R A NGE d d R ± 3R 3 R ± 3R 3 d2 d2 UCL = R + 3R

d3 d2

UCL = R + 3R

d3 d2

LCL = R − 3R

d3 d2

LCL = R − 3R

d3 d2

k

where R =

where



∑ Ri i=1 k

(18.4)

k

R=

∑ Ri

i=1

k

k = number of subgroups

D3 factor Used to calculate the LCL of a R chart.

You can simplify the calculations in Equation 18.4 by using the D3 factor, equal to 1 2 3(d3∙d2), and the D4 factor, equal to 1 1 3(d3∙d2), to express the control limits as shown in Equations 18.5a and 18.5b.

D4 factor Used to calculate the UCL of a R chart.

C A LC U LATI N G C O NT R OL L I M I T S F OR T HE R A NGE

UCL = D4R



LCL = D3R

(18.5a)

(18.5b)

To illustrate the R chart, return to the scenario at the beginning of the chapter concerning calls to the TS Service Desk at Tasman University. During the Measure phase of the Six Sigma DMAIC model, the caller invested time (CIT) was operationally defined as the time from when the student picked up the phone to ring the TS Service Desk to the time the call ended. During the Analyse phase of the Six Sigma DMAIC model, data were recorded over a four-week period. Subgroups of five calls were selected from each day. Table 18.5 summarises the results for all 28 days. < CALLS2 >

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



18.7 Control Charts for the Range and the Mean 723

Day  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

 6.7  7.6  9.5  9.8 11.0  8.3  9.4 11.2 10.0  8.6 10.7 10.8  9.5 12.9  7.8 11.1  9.2  9.0  9.9 10.7  9.0 10.7 10.2 10.0  9.6  8.2  7.1 11.1

CIT in minutes  9.7  9.0  9.9  6.9 11.3  9.7  8.2 10.5  9.0  8.7  9.1 10.6  7.0  8.1 12.2  8.8 12.3 10.2  8.9 10.2  9.6  9.4  9.5  9.5 11.4  8.4 10.8 12.0

11.7 11.4  8.9 13.2  9.9  8.4  9.3  9.8 10.7  5.8  8.6  8.3 10.5  8.9  9.0  9.9  9.7  8.1 10.1  9.8 10.0  9.8 10.5 11.1  8.8  7.9 11.1  6.6

7.5 8.4 8.7 9.3 11.6 9.8 7.1 9.0 8.2 9.5 10.9 10.3 8.6 9.0 9.1 5.5 8.1 9.7 9.6 8.0 10.6 7.0 12.2 8.8 12.2 9.5 11.0 11.5

 7.8  9.2 10.7  9.4  8.5  7.1  6.1  9.7 11.0 11.4  8.6 10.0 10.1  7.6 11.7  9.5  8.5  8.4  7.1 10.2  9.0  8.9  9.1  9.9  9.3  9.2 10.2  9.7 Sums:

Mean  8.68  9.12  9.54  9.72 10.46  8.66  8.02 10.04  9.78  8.80  9.58 10.00  9.14  9.30  9.96  8.96  9.56  9.08  9.12  9.78  9.64  9.16 10.30  9.86 10.26  8.64 10.04 10.18 265.38

Range 5.0 3.8 2.0 6.3 3.1 2.7 3.3 2.2 2.8 5.6 2.3 2.5 3.5 5.3 4.4 5.6 4.2 2.1 3.0 2.7 1.6 3.7 3.1 2.3 3.4 1.6 4.0 5.4 97.5

Table 18.5  Caller invested time (CIT) and subgroup mean and range for 28 days

For the data in Table 18.5: k

k

k = 28,

∑ Ri = 97.5 and R = i=1

∑ Ri i=1

k

=

97.5 = 3.482 28

Using Equation 18.4 and d2 5 2.326 and d3 5 0.864 from Table E.13 for n 5 5: 3.482 ± 3(3.482)

0.864 = 3.482 ± 3.880 2.326

Thus: UCL = 3.482 + 3.880 = 7.362 LCL = 3.482 − 3.880 = −0.398 < 0 Therefore, the LCL does not exist because it is impossible to get a negative range. Alternatively, using Equations 18.5a and 18.5b and D3 5 0 and D4 5 2.114 from Table E.13: – UCL = D4R = (2.114)(3.482) = 7.361 and the LCL does not exist. Figure 18.8 displays the Microsoft Excel R chart for the CITs. It does not indicate any individual ranges outside the control limits or any trends.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

724 CHAPTER 18 STATISTICAL APPLICATIONS IN QUALITY MANAGEMENT

Figure 18.8 Microsoft Excel R chart for the caller invested times

R chart for CITs

8

UCL

7 6

Minutes

5 4

RBar

3 2 1 LCL

0 0

5

10

15

20

25

30

Day

– The X Chart – X chart Control chart for the process mean.

Once you have determined from the R chart that the range is in control, then continue by exam– ining the control chart for the process mean, the X chart. – The control chart for X uses subgroups each of size n for k consecutive periods of time. To calculate control limits for the mean, you need to calculate the mean of the subgroup means = (called X ) and the standard error of the mean. The estimate of the standard error of the mean is a function of the d2 factor, which represents the relationship between the standard deviation and the – range for varying sample sizes.4 Equations 18.6 and 18.7 define the control limits for the X chart. – C O N TR O L LI M I TS F OR T HE X C HA RT R R X±3 X±3 R d2 n X ± 3 d2 n d n R 2 R UCL = X + 3 UCL = X + 3 R d 2 n= X + 3 d 2 n UCL d2 n R R LCL = X − 3 LCL = X − 3 R d 2 n= X − 3 d 2 n LCL d2 n kk

k

where

A2 factor Used to calculate control limits of – an X chart.



∑ ∑ ∑



(18.6)

k

∑ ∑

k R XRi i R X i=1 i=1 i=1 i where X = R = i=1 i XR = = k where R = i=1 k X = i=1 k – k k time i samplemean mean of observations XXii = X i =of = sample nnobservations time i sample mean of at natobservations at time i = sample mean of n iobservations at time i Ri = range ofXni observations at time Rk = of of n Robservations n observations time i at time i i = range i = range of at number subgroups Ri = range of n observations at time i k = number of subgroups k = number of subgroups k = number of subgroups You can simplify the calculations in Equation 18.6 by utilising the A2 factor, equal to 3/ d 2 n . Equations 18.7a and 18.7b show the simplified control limits.

(

Xi

k

)

4 –

– R /d2 is used to estimate the standard deviation of the population, and R /(d2Un–) is used to estimate the standard error of the mean.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



18.7 Control Charts for the Range and the Mean 725

CALCULATI N G C O N TR O L LI M I TS FO R TH E M E A N U S I NG T HE A 2 FA C T OR

UCL = X + A2R

(18.7a)



LCL = X − A2R

(18.7b)

From Table 18.5: k = 28,

k



i=1

so that:

X i = 265.38 and

k

X=

k

∑ Ri = 97.5

i=1 k

∑ Xi i=1

=

k

265.38 = 9.478 and R = 28

∑ Ri i=1 k

=

97.5 = 3.482 28

Using Equation 18.6 and d2 5 2.326 from Table E.13 for n 5 5: X±3

R d2 n

= 9.478 ± 3

3.482 (2.326) 5

= 9.478 ± 2.008

Thus: UCL = 9.478 + 2.008 = 11.486 LCL = 9.478 − 2.008 = 7.470 Alternatively, using Equations 18.7a and 18.7b, and A2 5 0.577 from Table E.13: UCL = 9.478 + (0.577)(3.482) = 9.478 + 2.009 = 11.487 UCL = 9.478 − (0.577)(3.482) = 9.478 − 2.009 = 7.469 These results are the same as those using Equation 18.6, except for rounding error. – Figure 18.9 displays the Microsoft Excel X chart for the caller invested time data. Figure 18.9 does not reveal any points outside the control limits or any trend. Although there is Figure 18.9 Microsoft Excel X– chart for the caller invested times

– X chart for caller invested times

14 12

UCL

Minutes

10

XBar

8

LCL

6 4 2 0 0

5

10

15

20

25

30

Day

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

726 CHAPTER 18 STATISTICAL APPLICATIONS IN QUALITY MANAGEMENT

considerable variability between the 28 subgroup means, the CIT process is in a state of – statistical control because both the R chart and the X chart are in control. If you want to reduce the variation or lower the caller invested time, then you need to change the process.

Problems for Section 18.7 LEARNING THE BASICS

18.18 For subgroups of n = 4, what is the value of: a. the D4 factor? a. the d2 factor? e. the A2 factor? b. the d3 factor? c. the D3 factor? 18.19 The following summary of data is for subgroups of n = 4 for a 10-day period. Day 1 2 3 4 5

Mean 13.6 14.3 15.3 12.6 11.8

Range 3.5 4.1 5.0 2.8 3.7

Day  6  7  8  9 10

Mean 12.9 17.3 13.9 12.6 15.2

Range 4.8 4.5 2.9 3.8 4.6

13 14 15 16 17 18 19 20

18.21

a. Calculate control limits for the range. b. Is there evidence of special-cause variation in (a)? c. Calculate control limits for the mean. d. Is there evidence of special-cause variation in (c)? 18.22

APPLYING THE CONCEPTS You should use Microsoft Excel to solve problems 18.20–18.26.

18.20 The manager of a Check$mart Bank branch wants to analyse waiting times of customers for teller service during the 12 noon to 1 pm lunch hour. A subgroup of four customers is selected (one in each 15-minute interval during the hour), and the time in minutes is measured from the point each customer enters the line to when they reach the teller window. The results over a four-week period are in the data file < BANK_TIME >. Day  1  2  3  4  5  6  7  8  9 10 11 12

7.2 5.6 5.5 4.4 9.7 8.3 4.7 8.8 5.7 1.7 2.6 4.6

Time in minutes 8.4 7.9 8.7 3.3 7.3 3.2 8.0 5.4 4.6 4.8 8.9 9.1 6.6 5.3 5.5 8.4 4.7 4.1 4.0 3.0 3.9 5.2 2.7 6.3

4.9 4.2 6.0 7.4 5.8 6.2 5.8 6.9 4.6 5.2 4.8 3.4

18.23

18.24

4.9 7.1 7.1 6.7 5.5 4.9 7.2 6.1

6.2 6.3 5.8 6.9 6.3 5.1 8.0 3.4

7.8 8.2 6.9 7.0 3.2 3.2 4.1 7.2

8.7 5.5 7.0 9.4 4.9 7.6 5.9 5.9

a. Construct control charts for the mean and range. b. Is the process in a state of statistical control? The call centre supervisor of the IT helpdesk at a large university is monitoring the performance of the technical support staff. The data in the file < HELP_DESK > give the number of calls resolved during each 8-hour shift (8 am to 4 pm, Monday to Friday) by each of a subgroup of five support staff over a four-week period. a. Construct a control chart for the range and mean. b. Is the process in a state of statistical control? A supplier of ‘natural Australian’ spring water states that the magnesium content is 1.6 mg/L. To check this during the bottling operation, the quality control department takes a random sample of four bottles each hour and calculates the magnesium content. The data in < SPRING_WATER > represent the magnesium content of the samples collected over a 24-hour bottling day. a. Construct a control chart for the range. b. Construct a control chart for the mean. c. Is the process in a state of statistical control? The data in the file < ROPE > are the tensile strengths in megapascals (mPa) of nylon fibre ropes. The data were collected in subgroups of three over a 24-hour manufacturing period. a. Construct a control chart for the range. b. Construct a control chart for the mean. c. Is the process in a state of statistical control? The director of radiology at a large metropolitan hospital is concerned about scheduling in the radiology facilities. On a typical day, 250 patients are transported to the radiology department for treatment or diagnostic procedures. If patients do not reach the radiology unit at their scheduled time, backups occur and other patients experience delays. The time it takes to transport patients to the radiology unit is operationally defined as the time between when transport is assigned to the patient and the time the patient arrives at

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



18.8 Process Capability 727

the radiology unit. A sample of n = 4 patients was selected each day for 20 days and the time to transport each patient (in minutes) was determined, with the results in the data file < TRANSPORT >. a. Construct control charts for the mean and range. b. Is the process in a state of statistical control? 18.25 A filling machine for a tea-bag manufacturer produces approximately 170 tea-bags per minute. The process manager monitors the weight of the tea placed in individual bags. A subgroup of n = 4 tea-bags is taken every 15 minutes for 25 consecutive time periods. The results are given below: < TEA3 > Sample  1  2  3  4  5  6  7  8  9 10 11 12 13

5.32 5.63 5.56 5.32 5.45 5.29 5.57 5.44 5.53 5.41 5.55 5.58 5.63

Weight (in grams) 5.77 5.50 5.44 5.54 5.40 5.67 5.45 5.50 5.53 5.46 5.42 5.50 5.40 5.52 5.61 5.49 5.25 5.67 5.55 5.51 5.58 5.58 5.36 5.45 5.75 5.46

5.61 5.40 5.57 5.42 5.47 5.44 5.54 5.58 5.53 5.53 5.56 5.53 5.54

14 15 16 17 18 19 20 21 22 23 24 25

5.48 5.49 5.54 5.46 5.72 5.58 5.43 5.59 5.42 5.64 5.62 5.51

5.44 5.57 5.62 5.46 5.36 5.50 5.51 5.58 5.41 5.59 5.38 5.54

5.45 5.43 5.66 5.38 5.59 5.36 5.37 5.60 5.40 5.42 5.75 5.73

5.60 5.36 5.59 5.49 5.25 5.40 5.32 5.46 5.69 5.56 5.47 5.77

a. What are some of the sources of common-cause variation that might be present in this process? b. What problems might occur that would result in specialcause variation? c. Construct control charts for the range and the mean. d. Is the process in a state of statistical control? 18.26 A manufacturing company makes brackets for bookshelves. The brackets provide critical structural support and must have a 90-degree bend ±1 degree. Measurements of the bend of the brackets were taken at 18 different times. Five brackets were sampled at each time. The data are in the file < ANGLE >. a. Construct control charts for the range and mean. b. Is the process in a state of statistical control?

18.8  PROCESS CAPABILITY It is often necessary to analyse the amount of common-cause variation present in a process that is in control. Is the common-cause variation small enough to satisfy customers with the product or service? Or is the common-cause variation so large that there are too many dissatisfied customers and a process change is needed? Analysing the capability of a process is a way to answer these questions. There are many methods available to analyse and report process capability. However, this section begins with a relatively simple method to estimate the percentage of products or services that will satisfy the customer. Later in the section the use of capability indices is introduced.

LEARNING OBJECTIVE

4

Measure the capability of a process

Customer Satisfaction and Specification Limits Quality is defined by the customer. A customer who believes that a product or service has met or exceeded their expectations will be satisfied. The management of a company must listen to the customer and translate the customer’s needs and expectations into easily measured criticalto-quality (CTQ) variables. Management then sets specification limits for these CTQs. Specification limits are technical requirements set by management in response to c­ ustomers’ needs and expectations. The upper specification limit (USL) is the largest value a CTQ can have and still conform to customer expectations. Likewise, the lower specification limit (LSL) is the smallest value a CTQ can have and still conform to ­customer expectations.

specification limits Technical requirements based on customers’ needs and expectations. upper specification limit (USL) Largest value a CTQ can have to meet customer expectations. lower specification limit (LSL) Smallest value a CTQ can have to meet customer expectations.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

728 CHAPTER 18 STATISTICAL APPLICATIONS IN QUALITY MANAGEMENT

For example, a business selling natural, luxurious, handmade soap will understand that customers expect their soap to produce a certain amount of lather. The customer can become dissatisfied if the soap produces too much or too little lather. Product engineers know that the level of free fatty acids in the soap controls the amount of lather. Thus, the process manager, with input from the product engineers, sets both a USL and an LSL for the amount of free fatty acids in the soap. As an example of a case in which only a single specification limit is involved, consider calls to the TS Service Desk in the Tasman University Technology Services scenario. Since students want their calls to the TS Service Desk to be handled quickly and efficiently, Robin sets a USL for the caller invested time but no LSL. As you can see in both the caller invested time and the soap examples, specification limits are customer-driven requirements placed on a product or a service. If a process consistently meets these requirements, the process is capable of satisfying the customer.

process capability The ability of a process to consistently meet specified customer expectations.

Process capability is the ability of a process to consistently meet specified customer-driven

requirements.

One way to analyse the capability of a process is to estimate the percentage of products or services that are within specifications. To do this you must have a process in control because an out-of-control process does not allow you to predict its capability. If you are dealing with an out-of-control process, you must first identify and eliminate the special causes of variation before performing a capability analysis. Out-of-control processes are unpredictable, and therefore you cannot conclude that such processes are capable of meeting specifications or satisfying customer expectations. In order to estimate the percentage of product or service within specifications, first you must estimate the mean and standard deviation of the population of all X values, the CTQ variable of interest for the product or service. = The estimate for the population mean is X , the mean of all the sample means (see Equation – 18.6). The estimate of the standard deviation of the population is R divided by d2. You can – – = use the X and R from in-control X and R charts respectively with the appropriate d2 value from Table E.13. In this text, the population of X values is assumed to be approximately normally distributed. (If your data are not approximately normally distributed see reference 2 for an alternative approach.) Assuming that the process is in control and X is approximately normal, you can use Equation 18.8 to estimate the probability that a process outcome is within specifications.

ES TI M ATI N G TH E C A PA B I L I T Y OF A P R OC E S S For a CTQ variable with a lowerwith specification limit and anlimit upperand specification limit: For a CTQ variable a lower specification an upper specification limit: P(an outcome will be within specifications) = P(LSL < X < USL) =P

(18.8a)

LSL − X USL − X

a. Estimate the percentage of ropes within specifications. b. Calculate CPL and Cpk. 18.32 Refer to problem 18.25 on page 727 concerning a filling machine for a tea-bag manufacturer. In this problem you should have concluded that the process is in control. The label weight for this product is 5.5 g, the lower specification limit is 5.2 g, and the upper specification limit is 5.8 g. Company policy states that at least 99% of the tea-bags produced must be inside the specifications in order for the process to be considered capable. < TEA3 > a. Estimate the percentage of the tea-bags that are inside the specification limits. Is the process capable of meeting the company policy? b. If management implemented a new policy stating that 99.7% of all tea-bags are required to be within specifications, is this process capable of reaching that goal? Explain. 18.33 Refer to problem 18.20 on page 726 concerning waiting time for customers at a bank. Suppose management has set an upper specification limit of 5 minutes on waiting time and that at least 99% of the waiting times must be less than 5 minutes in order for the process to be considered capable. < BANK_TIME >

a. Estimate the percentage of waiting times that are inside the specification limits. Is the process capable of meeting the company policy? b. If management implemented a new policy stating that 99.7% of all waiting times are required to be within specifications, is this process capable of reaching that goal? Explain.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Key formulas 733

18

Assess your progress Summary This chapter introduced you to quality, including total quality management, Deming’s 14 points and Six Sigma management. You learned how to use several types of control charts to distinguish between common-cause and special-cause variation. Finally you learned how to measure the capability of a process by estimating

the percentage of items within specifications and by calculating capability indices. By applying these concepts to calls to Tasman University’s TS Service Desk, you learned how a manager can identify problems and continually improve service quality.

Key formulas Constructing control limits

Process mean 63 standard deviations  (18.1) Upper control limit (UCL) 5   process mean 1 3 standard deviations Lower control limit (LCL) 5   process mean 2 3 standard deviations

– Control limits for the X chart

X±3

R

 (18.2)

 (18.6)

d2 n

UCL = X + 3

Control limits for the p chart

p(1 − p ) p±3 n

LCL = D3R  (18.5b)

LCL = X − 3

R d2 n R d2 n

p(1 − p) UCL = p + 3 n

Calculating control limits for the mean using the A2 factor

p(1 − p) LCL = p − 3 n

LCL = X − A2R (18.7b)

Control limits for the c chart  (18.3)

c±3 c UCL = c + 3 c

Estimating the capability of a process

For a CTQ variable with a lower specification limit and an upper specification limit: P (an outcome will be within specifications)

= P(LSL < X < USL)

LCL = c − 3 c Control limits for the range

R ± 3R

UCL = X + A2R (18.7a)

d3 d2

 (18.4)

=P

  (18.8a)

LSL − X USL − X contains the data for the first test (Test 1) and the last test (Test 4). For each test: a. Construct a control chart for the range. b. Construct a control chart for the mean. c. Is the process in a state of statistical control? 18.47 The funds-transfer department of a bank is concerned with turnaround time for investigations of funds-transfer payments. A payment may involve the bank as a remitter of funds, a beneficiary of funds or an intermediary in the payment. An investigation is initiated by a payment inquiry or query by a party involved in the payment or any department affected by the flow of funds. When a query is received, an investigator reconstructs the transaction trail of the payment and verifies that the information is correct and the proper payment is transmitted. The investigator then reports the results of the investigation and the transaction is considered closed. It is important that investigations are closed rapidly, preferably within the same day. The number of new investigations and the number closed on the same day that the inquiry was made are in the file < FUND_TRAN >. a. Construct a control chart for these data. b. Is the process in a state of statistical control? Explain. c. Based on the results of (a) and (b), what should management do next to improve the process? 18.48 A branch manager of a brokerage company is concerned with the number of undesirable trades made by her sales staff. A trade is considered undesirable if there is an error on the trade ticket. Trades with errors are cancelled and resubmitted. The cost of correcting errors is billed to the brokerage company. The branch manager wants to know whether the proportion of undesirable trades is in a state of statistical control so that she can plan the next step in a quality improvement process. Data were collected for a 30-day period with the following results: < TRADE >

Day  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

Undesirable trades  2 12 13 33  5 20 17 10  8 18  3 12 15  6 21

Total trades  74  85 114 136  97 115 108  76  69  98 104  98 105  98 204

Day 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Undesirable trades  3 12 11 11 14  4 10 19  1 11 12  4 11 13 15

Total trades  54  74 103 100  88  58  69 135  67  77  88  66  72 118 138

a. Construct a control chart for these data. b. Is the process in control? Explain. c. Based on the results of (a) and (b), what should the manager do next to improve the process? 18.49 On each morning during a four-week period, record your pulse rate (in beats per minute) just after you get out of bed and also – before you go to sleep at night. Set up X and R charts and determine whether your pulse rate is in a state of statistical control. Explain. 18.50 (Class project) Use the table of random numbers (Table E.1) to simulate the selection of different coloured balls from an urn as follows: 1. Start in the row corresponding to the day of the month you were born plus the year in which you were born. If your total exceeds 100, subtract 100 from the total. For example, if you were born on 15 October 1996 you would start in row 15 + 96 − 100 = 11. 2. Select two-digit random numbers. 3. If you select a random number from 00 to 94, consider the ball to be white; if the random number is from 95 to 99, consider the ball to be red. Each student is to select 100 such two-digit random numbers and report the number of ‘red balls’ in the sample. Construct a control chart for the proportion of red balls. a. What conclusions can you draw about the system of selecting red balls? b. Are all the students part of the system? c. Is anyone outside the system? d. If so, what explanation can you give for someone who has too many red balls? e. If a bonus were paid to the top 10% of the students (the 10% with the fewest red balls), what effect would that have on the rest of the students? Discuss.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

736 CHAPTER 18 STATISTICAL APPLICATIONS IN QUALITY MANAGEMENT

18.51 A customer services manager of Your Phone, a telecommunic­ ations company, is monitoring the number of errors made by its offshore call centre processing requests concerning additions, changes and deletions to phone and Internet services. Data were collected over a period of 30 days. < ERROR > a. Construct a p chart for the proportion of errors. Is the process in a state of statistical control? b. What should be done to improve the processing of requests for additions, changes and deletions to phone and Internet services? 18.52 Jay, a rural fire service volunteer, decided to apply control chart methodology, learned in a business statistics class, to data collected by the local brigade. Jay wants to determine whether weeks containing more than the mean number of incidents attended were due to inherent, chance causes of variation, or if there were special causes of variation such as increased arson, severe drought or holiday-related activities, or other emergencies including searches, storms and floods. The file < INCIDENTS > contains the number of incidents attended per week (Sunday to Saturday) over the past year. a. What is the mean number of incidents attended per week? b. Construct a c chart for the number of incidents attended per week. c. Is the process in a state of statistical control? d. Weeks 15 and 41 experienced seven incidents each. Are these large values explainable by common causes, or does it appear that special-cause variation occurred in these weeks? e. Explain how the local fire brigade can use these data to chart and monitor future weeks in real-time (i.e. on a weekto-week basis)? 18.53 The manager of a large cosmetics store is monitoring salesstaff performance. The data in < SALES > give the number of sales per week made by a subgroup of five employees over a 30-week period. a. Construct control charts for the range and the mean. b. Is the process in a state of statistical control?

18.54 A company manufactures stone benchtops. The specified length of one range is 1000 mm ± 2.5 mm. The data in < BENCH > give the length of a sample of five bench tops produced each day during a 30-day period. a. Construct control charts for the range and the mean. b. Is the process in a state of statistical control? c. If the process is in control: i. estimate the percentage of benchtops inside the specification limits ii. calculate Cp, CPL, CPU and Cpk d. If the manufacturer requires 99.7% of all benchtops to be within the specification limits, comment on the capability of the process based on your calculations in (c). e. Reducing the specification limits to 1000 mm ± 1.0 mm may result in significant orders from a new customer. Is this possible with the current production process? 18.55 The Ashland MultiComm Services (AMS) technical services team has embarked on a quality improvement effort. Its first project relates to maintaining the target upload speed for its Internet service subscribers. Upload speeds are measured on a device that records the results on a standard scale in which the target value is 1.0. Each day for 25 days, five uploads are randomly selected and their speeds measured. These data are stored in < AMS_UPLOAD >. a. Construct appropriate control charts for the data. b. Is the process in a state of statistical control? c. If the process is in control, given an upper specification limit of 1.2 with no lower specification limit: i. estimate the percentage of uploads within specification ii. calculate CPU d. If AMS requires 99.7% of all uploads to be within the specification limits, comment on the capability of the process based on your calculations in (c).

Chapter 18 Excel Guide EG18.1  TOTAL QUALITY MANAGEMENT There are no Excel Guide instructions for this section.

EG18.2 SIX SIGMA MANAGEMENT There are no Excel Guide instructions for this section.

EG18.3 THE THEORY OF CONTROL CHARTS There are no Excel Guide instructions for this section.

EG18.4 CONTROL CHART FOR THE PROPORTION: THE p CHART

Example  Construct the Figure 18.4 p chart for the Table 18.1 nonconforming calls to TS Service Desk data. PHStat  Use p Chart.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 18 Excel Guide 737

For the example, open the Calls1 file. Select PHStat ➔ Control Charts ➔ p Chart and in the p Chart dialog box (shown in Figure EG18.1): 1. Enter or highlight C1:C29 as the Nonconformances Cell Range. 2. Check First cell contains label. 3. Click Size does not vary and enter 200 as the Sample/Subgroup Size. 4. Enter a Title and click OK.

A 1 2 3 4 5 6 7 8 9 10 11 12 13 14

B

p chart summary Intermediate calculations Sum of subgroup sizes Number of subgroups taken Average sample/subgroup size Average proportion of nonconforming items Three standard deviations Preliminary lower control limit

5600 28 200 0.0827 0.0584 0.0243

p chart control limits Lower control limit Center Upper control limit

0.0243 0.0827 0.1411

Figure EG18.2  Compute worksheet of p Chart workbook



Figure EG18.1 p Chart dialog box

The procedure creates a p chart on its own chart sheet and two supporting worksheets: one that calculates the control limits and one that calculates the values to be plotted. For problems in which the sample/subgroup sizes vary, replace step 3 with this step: Click Size varies, enter the cell range that contains the sample/subgroup sizes as the Sample/Subgroup Cell Range, and click First cell contains label.

In-depth Excel  Use the pChartDATA and COMPUTE worksheets of the p_Chart workbook as a template for calculating control limits, plot points and a p chart. The pChartDATA worksheet uses formulas in column D to divide the column C number of nonconformances value by the column B subgroup/sample size value to calculate the proportion ( pi). It then uses formulas in columns E to G to display the values for the LCL, p and UCL that are calculated in cells B12 to B14 of the COMPUTE worksheet. In turn, the COMPUTE worksheet (shown In Figure EG18.2) uses the subgroup sizes and the proportion values in the pChartDATA worksheet to calculate the control limits. (To examine all of the formulas used in the workbook, open to the COMPUTE_FORMULAS and pChartDATA_FORMULAS worksheets.)

Calculating control limits, plot points and a p chart for other problems requires the new data to be pasted in the pChartDATA worksheet of the p_Chart workbook. Before pasting the data, adjust this worksheet so that the number of time periods is correct. If there are more than 28 time periods, select row 29, right-click, click Insert in the shortcut menu, and repeat as many times as is necessary to insert the required number of additional rows. Then copy down the formulas in columns D to G to the new rows. If there are fewer than 28 time periods, delete the extra rows from the bottom up, starting with row 29. Then paste the time period, subgroup/sample size, and number of nonconformances data into columns A to C of the pChartDATA worksheet.

EG18.5 THE RED BEAD EXPERIMENT: UNDERSTANDING PROCESS VARIABILITY There are no Excel Guide instructions for this section.

EG18.6 CONTROL CHART FOR AN AREA OF OPPORTUNITY: THE c CHART

Example  Construct the Figure 18.7 c chart for the Table 18.4 Innovative Kitchens customer complaint data. PHStat  Use c Chart. For the example, open the Complaints file. Select PHStat ➔ Control Charts ➔ c Chart and in the c Chart dialog box (shown in Figure EG18.3): 1. Enter or highlight B1:B51 as the Nonconformances Cell Range. 2. Check First cell contains label. 3. Enter a Title and click OK.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

738 CHAPTER 18 STATISTICAL APPLICATIONS IN QUALITY MANAGEMENT

50 time periods, select row 51, right-click, click Insert in the shortcut menu, and repeat as many times as necessary to insert the required number of additional rows. Then copy down the formulas in columns C to E to the new rows. If there are fewer than 50 time periods, delete the extra rows from the bottom up, starting with row 51. Then paste the time period and number of nonconformances data into columns A and B of the cChartDATA worksheet.

EG18.7 CONTROL CHARTS FOR THE RANGE AND THE MEAN The R Chart and the X Chart —

Figure EG18.3  c Chart dialog box

The procedure creates a c chart on its own chart sheet and two supporting worksheets: one that calculates the control limits and one that calculates the values to be plotted.

In-depth Excel  Use the cChartDATA and COMPUTE worksheets of the c_Chart workbook as a template for calculating control limits, plot points and a c chart. The cChartDATA worksheet uses formulas in columns C to E to display the values for the LCL, cBar and UCL that are calculated in cells B10 to B12 of the COMPUTE worksheet. In turn, the COMPUTE worksheet (shown in Figure EG18.4) calculates sums and counts of the number of nonconformities found in the cChartDATA worksheet to calculate the control limits. (To examine all of the formulas used in the workbook, open to the COMPUTE_FORMULAS and cChartDATA_FORMULAS worksheets.) A 1 2 3 4 5 6 7 8 9 10 11 12

B

c chart summary Intermediate calculations Sum of nonconformities Number of units sampled CBar Preliminary lower control limit

312 50 6.24 –1.2540

c chart control limits Lower control limit Center Upper control limit

0.0000 6.2400 13.7340

Example  _ Construct the Figure 18.8 R chart and Figure 18.9 X chart for the Table 18.5 caller invested time data. PHStat  Use R and XBar Charts. For the example, open the Calls2 file. The PHStat procedure requires column cell ranges that contain either means or ranges, so first add two columns that calculate the mean and ranges on this worksheet. Enter the column heading Mean in cell G1 and the heading Range in cell H1. Enter the formula 5AVERAGE(B2:F2) in cell G2 and the formula 5MAX(B2:F2) 2 MIN(B2:F2) in cell H2. Select the cell range G2:H2 and copy the range down through row 29. With the two columns created, select PHStat ➔ Control Charts ➔ R and XBar Charts. In the R and XBar Charts dialog box (shown in Figure 18.5): 1. Enter 5 as the Subgroup/Sample Size. 2. Enter or highlight H1:H29 as the Subgroup Ranges Cell Range. 3. Check First cell contains label. 4. Click R and XBar Charts. Enter or highlight G1:G29 as the Subgroup Means Cell Range and check First cell contains label. 5. Enter a Title and click OK.

Figure EG18.4  Compute worksheet of c Chart workbook

Calculating control limits, plot points and a c chart for other problems requires the new data to be pasted in the cChartDATA worksheet of the c_Chart workbook. Before pasting the data, adjust this worksheet so that the number of time periods is correct. If there are more than

Figure EG18.5 R and XBar Charts dialog box

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 18 Excel Guide 739

The procedure creates the two charts on separate chart sheets and two supporting worksheets: one that calculates the control limits and one that calculates the values to be plotted.

In-depth Excel  Use the DATA, RXChartDATA and COMPUTE worksheets of the R_and_XBar_Chart workbook as a template – for calculating control limits, plotting points and R and X charts. The RXChartDATA worksheet uses formulas in columns B and C to calculate the mean and range values for the Table 18.5 caller invested times stored in the DATA worksheet. The worksheet uses formulas in columns D through I to display the values for the control limit and center lines, using values that are calculated in the COMPUTE worksheet. Formulas in columns D and G use IF functions that will omit the lower control limit if the LCL value calculated is less than 0. (To examine all of the formulas used in the workbook, open to the COMPUTE_FORMULAS and RXChartDATA_FORMULAS worksheets.) The COMPUTE worksheet (shown in Figure EG18.6) _ uses the calculated means and ranges to calculate R and = X the mean of the subgroup means. Unlike the COMPUTE worksheets for other control charts, you must manually enter the Sample/Subgroup Size in cell B4 (5, as shown below) in addition to the D3, D4 and A2 factors in cells B8, B9 and B18 (0, 2.114, and 0.577, as shown). Use Table E.13 to look up the values for the D3, D4 and A2 factors. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

A Control chart calculations

B

Data Sample/subgroup size

5

R chart intermediate calculations RBar D3 factor D4 factor

3.4821 0 2.114

EG18.8  PROCESS CAPABILITY

R chart control limits Lower control limit Center Upper control limit

0.0000 3.4821 7.3613

XBar chart intermediate calculations Average of subgroup averages A2 factor A2 factor * RBar

9.4779 0.577 2.0092

XBar chart control limits Lower control limit Center Upper control limit

– Calculating control limits, plot points and R and X charts for other problems requires changes to the RXChartDATA and/or the DATA worksheet, depending on whether means and ranges have been previously calculated. If the means and ranges have been previously calculated and if the number of time periods is not 28, first adjust RXChartData worksheet. If there are more than 28 time periods, select row 29, right-click, click Insert in the shortcut menu, and repeat as many times as necessary to insert the required number of additional rows. Then copy down the formulas in columns D to I to the new rows. If there are fewer than 28 time periods, delete the extra rows from the bottom up, starting with row 29. Then paste the calculated means and ranges into columns B and C of the RXChartDATA worksheet, and paste time-period data into Column A. If the sample/subgroup size is not 5 update COMPUTE worksheet. If the means and ranges have not been previously calculated changes must be made to the DATA worksheet. First, determine the subgroup size. If the subgroup size is less than 5, delete the extra columns, right-to-left, starting with column F. If the subgroup size is greater than 5, select column F, right-click, and click Insert from the shortcut menu. (Repeat as many times as necessary.) If the sample/ subgroup size is not 5, update the COMPUTE worksheet. With the DATA worksheet so adjusted, paste the time and subgroup data into the worksheet, starting with cell A1. Then open to the RXChartDATA worksheet, and if the number of time periods is not equal to 28, adjust the number of rows. If there are more than 28 time periods, select row 29, right-click, click Insert in the shortcut menu, and repeat as many times as necessary to insert the required number of additional rows. Then copy down the formulas in columns A to I to the new rows. If there are fewer than 28 time periods, delete the extra rows from the bottom up, starting with row 29.

Use the COMPUTE worksheet of the Capability workbook as a template for computing the process capability indices discussed in Section 18.8. The worksheet contains the soft-drink filling process data discussed in Example 18.2. For other problems, enter the corresponding data.

7.4687 9.4779 11.4871

Figure EG18.6  Compute worksheet of R and XBar Chart workbook Microsoft® product screen shots are reprinted with permission from Microsoft Corporation. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

CHA PTER

19

Further non-parametric tests

TASMAN UNIVERSITY ACADEMIC SUPPORT

A

cademic Support at Tasman University runs workshops for new students to develop the required academic skills for their studies. An academic communication workshop first assesses students’ current skills by way of a short, 15-minute written exercise. This is graded from A to E, where a grade of C or above implies that the student has adequate communication skills for undergraduate study. Students are reassessed with a follow-up equivalent written exercise after the workshop and completion of related activities. Similarly, a numeracy workshop first assesses students’ current numeracy skills with a short, 20-question online test, with a follow-up equivalent online test after the workshop and completion of related activities. Academic Support wishes to know if these workshops and related activities improve students’ academic communication and numeracy skills. © Wavebreakmedia/Shutterstock/Pearson Education Ltd

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



19.1 McNEMAR TEST FOR THE DIFFERENCE BETWEEN TWO PROPORTIONS (RELATED SAMPLES) 741

LEARNING OBJECTIVE

After studying this chapter you should be able to: 1 choose and conduct a selection of powerful non-parametric tests

The non-parametric tests used in chi-square analysis (see Chapter 15) are well known since they are a commonly used non-parametric analysis tool. However, there are many other very useful and statistically powerful non-parametric tests. This chapter introduces some of these techniques as alternative non-parametric tests for several hypothesis tests in Chapters 9 to 11.

19.1 McNEMAR TEST FOR THE DIFFERENCE BETWEEN TWO PROPORTIONS (RELATED SAMPLES) In Chapter 10 the Z test and in Section 15.1 the chi-square test were used to test for the difference between two proportions. These tests require that the samples are independent. However, there are times when testing for a difference between two proportions that the data are from repeated measurements or matched samples, and therefore the samples are related. Such situations often arise in marketing when you want to determine whether there has been a change in attitude, perception or behaviour from one time period to another. The M ­ cNemar test is useful for testing the significance of changes between states before and after the sample objects (or individuals) have been exposed to the independent variable, which itself can be classified into two categories. The McNemar test is used to determine whether there is evidence of a significant difference between the proportions of two related samples. Although you could use a test statistic that approximately follows a chi-square distribution, the McNemar test introduced in this section uses a test statistic that approximately follows the normal distribution, enabling either a one-tail or a two-tail test. Consider the 2 × 2 table presented in Table 19.1. The sample proportions of interest are: p1 =

A+B = proportion of respondents in the sample who answer yes to condition 1 n

p2 =

A+C = proportion of respondents in the sample who answer yes to condition 2 n

Condition (group 1) Yes No

Condition (group 2) Yes No A B C D _____ _____

Totals A 1 C

B 1 D

Totals A1B C 1D _____

LEARNING OBJECTIVE

1

Choose and conduct a selection of powerful non-parametric tests

McNemar test A non-parametric test for testing the difference between two proportions from related samples.

Table 19.1 2 × 2 contingency table for the McNemar test

n

where A 5 number of respondents that answer yes to condition 1 and yes to condition 2 B 5 number of respondents that answer yes to condition 1 and no to condition 2 C 5 number of respondents that answer no to condition 1 and yes to condition 2 D 5 number of respondents that answer no to condition 1 and no to condition 2 n 5 number of respondents in the sample

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

742 CHAPTER 19 FURTHER NON-PARAMETRIC TESTS

The population proportions of interest are: p1 5 proportion in the population who would answer yes to condition 1 p2 5 proportion in the population who would answer yes to condition 2 Equation 19.1 presents the McNemar test statistic used to test H0: p1 5 p2. M c N E M A R T E ST

Z=

B–C B+C

(19.1)

where the test statistic Z is approximately normally distributed. To illustrate the McNemar test, return to the Tasman University Academic Support scenario. The results of 40 students who recently attended an academic communication workshop are classified by whether the grade achieved is C or above (yes) or below C (no) in the first and follow-up written assessment tasks. Table 19.2 presents the results. Table 19.2 Assessment results before and after workshop

First assessment grade C or above Yes No Total

Second assessment grade C or above Yes No Total 5 3 8 22 __ __ 10 32 __ 27 13 40

The McNemar test can be used, since each student is assessed twice, both before and after the workshop. That is, the data are paired. You want to determine whether there is a difference between the population proportion of students with a grade of C or above before the workshop, p1, and the population proportion of students with a grade of C or above after the workshop, p2. The null and alternative hypotheses are: H0: p1 5 p2 H0: p1 ∙ p2 Using a 0.05 level of significance, the critical values are -1.96 and 11.96 (see Figure 19.1), with the decision rule: Reject H0 if Z 6 -1.96 or if Z 7 11.96; otherwise, reject H0. Figure 19.1  Two-tail McNemar test at the 0.05 level of significance Reject H0 0.025 –1.96

Do not reject H0

0

Reject H0 0.025 +1.96

Z

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



19.1 McNEMAR TEST FOR THE DIFFERENCE BETWEEN TWO PROPORTIONS (RELATED SAMPLES) 743

For the data of Table 19.2: A 5 5    B 5 3    C 5 22    D 5 10 so that: p1 =

A+B 5+3 8 A + C 5 + 22 27 = = = 0.2 and p2 = = = = 0.675 n 40 40 n 40 40

Using Equation 19.1: Z=

B–C B+C

=

3 – 22 3 + 22

=

–19 25

=

–19 = –3.8 5

Since Z 5 -3.8 < -1.96, you reject H0. Alternatively, using the p-value approach, the p-value is less than 0.0001 (from the cumulative standardised normal table (Table E.2)). Since 0.0001 < 0.05, you reject H0. You can conclude that the proportion of students with a grade of C or above before and after the workshop differs.

Problems for Section 19.1 APPLYING THE CONCEPTS 19.1 A market researcher wanted to determine whether the proportion of beer drinkers who preferred brand A increased as the result of a change to the brewing technique. A random sample of 200 beer drinkers was selected. The results indicating preference for brand A or brand B prior to the beginning of the change and after its completion are shown below. Preference after completion of Preference before change change to brewing technique of brewing technqiue Brand A Brand B Total Brand A 112 20 132 Brand B  43 25  68 ___ __ ___ Total

155 45 200

a. At the 0.05 level of significance, is there evidence that the proportion of beer drinkers who prefer brand A is significantly lower before the change in brewing technique? b. Calculate the p-value in (a) and interpret its meaning. 19.2 Two candidates for prime minister participated in a televised debate. A political pollster recorded the preferences of 200 registered voters in a random sample before and after the debate. Preference prior Preference after debate to debate Candidate A Candidate B Total Candidate A   95 25 120 Candidate B  10 70  80 ___ ___ ___ Total

105

a. At the 0.05 level of significance, is there evidence of a significant difference in the proportion of voters who favour candidate A before and after the debate? b. Calculate the p-value in (a) and interpret its meaning. 19.3 A taste-testing experiment compared two brands of Barossa Cabernet Sauvignon wines. After the initial comparison, 60 preferred brand A and 40 preferred brand B. The 100 respondents were then exposed to a very professional and powerful advertisement promoting brand A. The 100 respondents were then asked to taste the two wines again and state which brand they preferred. Preference before advertising Brand A Brand B Total

Preference after completion of advertising Brand A Brand B Total 55  5  60 15 25   40 ___ ___ ___ 70

30 100

a. At the 0.05 level of significance, is there evidence that the proportion who prefer brand A is significantly lower before the advertising than after the advertising? b. Calculate the p-value in (a) and interpret its meaning. 19.4 An airline has changed catering suppliers and wishes to gauge whether this has had an influence on customer satisfaction. A random sample of 100 passengers who flew with the airline both last year and this year is selected.

95 200

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

744 CHAPTER 19 FURTHER NON-PARAMETRIC TESTS

Satisfied last year Yes No

Satisfied now Yes No Total 55  8  63 25 12  27 __ __ ___

Total

80 20 100

a. At the 0.05 level of significance, is there evidence that satisfaction was significantly lower last year prior to introduction of a new catering company? b. Calculate the p-value in (a) and interpret its meaning. 19.5 The human resources manager of a large department store wants to reduce absenteeism among sales assistants. She decided to institute an incentive plan that provides financial rewards for sales assistants who are absent less than five days

in a given calendar year. A sample of 100 sales assistants selected at the end of the trial year revealed the following: Year 1 6 5 days absent 9 5 days absent

Year 2 *5 days . 5 days absent absent Total 32   4   36 25 39   64 __ __ ___

Total

57 43 100

a. At the 0.05 level of significance, is there evidence that the proportion of employees absent less than five days was significantly lower in year 1 than in year 2? b. Calculate the p-value in (a) and interpret its meaning.

19.2  WILCOXON RANK SUM TEST – NON-PARAMETRIC ANALYSIS FOR TWO INDEPENDENT POPULATIONS

Wilcoxon rank sum test  A non-parametric test for testing the difference between two medians from independent samples.

In Section 10.1, a t test was used to test for a difference between the means of two independent populations. If sample sizes are small and the assumption that the data comes from normally distributed populations cannot be made, it is not valid to use a t test. In this case a non-parametric test may be used. This section introduces the Wilcoxon rank sum test for testing whether there is a difference between two medians. The Wilcoxon rank sum test is almost as powerful as the pooled-­variance and separate-variance t tests under conditions appropriate to these tests and is likely to be more powerful when the assumptions of those t tests are not met. However, the Wilcoxon rank sum test requires that the independent samples are from populations with distributions of approximately the same shape and variability. In addition, you can use the Wilcoxon rank sum test when you have ordinal data, as is often the case when dealing with studies in consumer behaviour and marketing research. To perform the Wilcoxon rank sum test, the values in the two samples of size n1 and n2 are replaced by their combined ranks. Define n 5 n1 1 n2 as the total sample size. Next, assign ranks so that rank 1 is given to the smallest of the n combined values, rank 2 is given to the second smallest and so on, until rank n is given to the largest. If several values are tied, assign each the average of the ranks that otherwise would have been assigned had there been no ties. The Wilcoxon rank sum test statistic, T1, is defined as the sum of the ranks assigned to the n1 values in the sample from the first population, X1. Often, for convenience, the population corresponding to the smaller sample is defined as X1. For any integer value n, the sum of the first n consecutive integers is n(n 1 1)/2. Therefore, the test statistic T1 plus T2, the sum of the ranks assigned to the n2 items in the second sample, must equal n(n 1 1)/2. Use Equation 19.2 to check the accuracy of your rankings.

CH E CK IN G T H E R AN KI NG S

T1 + T2 =

n(n + 1) (19.2) 2

The Wilcoxon rank sum test can be either a two-tail test or a one-tail test, depending on whether you are testing whether the two population medians are different or whether one median is greater than the other median.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



19.2 Wilcoxon Rank Sum Test – Non-Parametric Analysis for Two Independent Populations 745

Two-tail test H0: M1 5 M2 H1: M1 Z M2

One-tail test H0: M1 9 M2 H1: M1 6 M2

One-tail test H0: M1 8 M2 H1: M1 7 M2

where M1 5 median of population 1 M2 5 median of population 2 When either sample n1 or n2 is < 10, use Table E.8 to find the critical values of the test statistic T1. For a two-tail test, you reject the null hypothesis (see panel A of Figure 19.2) if the calculated value of T1 is greater than or equal to the upper critical value or is less than or equal to the lower critical value. For one-tail tests having the alternative H1: M1 < M2, you reject the null hypothesis if the observed value of T1 is less than or equal to the lower critical value (see panel B of Figure 19.2). For one-tail tests having the alternative H1: M1 > M2, you reject the null hypothesis if the observed value of T1 is greater than or equal to the upper critical value (see panel C of Figure 19.2).

Region of rejection Region of non-rejection

–Z

0

T1

μT

L

1

+Z

–Z

0

T1

T1

μT

U

Panel A H 1: M 1 ≠ M 2

L

+Z

0 1

Panel B

H 1: M 1 < M 2

μT

1

T1

Panel C

H 1: M 1 > M 2

Figure 19.2  Regions of rejection and non-rejection using the Wilcoxon rank sum test

For large sample sizes, n1 and n2 both 9 10, the test statistic T1 is approximately normally distributed with mean mT1 equal to: μT1 =

n1(n + 1) 2

and standard deviation sT1, equal to: σT1 =

n1 n2 (n + 1) 12

Therefore, Equation 19.3 defines the standardised Z test statistic.

LARGE -S AMPL E W ILCOXON R A N K SU M T E ST



Z=

n1(n + 1) 2 (19.3) n1 n2(n + 1) 12

T1 –

where the test statistic Z approximately follows a standardised normal distribution.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

U

746 CHAPTER 19 FURTHER NON-PARAMETRIC TESTS

Use Equation 19.3 for testing the null hypothesis when the sample sizes are both 9 10. Based on a, the level of significance selected, you reject the null hypothesis if the calculated Z value falls in the region of rejection. In the TU Academic Support scenario, Academic Support wishes to assess whether attending the academic communication workshop improves a student’s academic results in an introductory management unit. To test this, random samples of the marked first assignment for 10 students who had attended and for 10 students who did not attend the workshop were obtained. The data are shown in Table 19.3 below, from the file .  To test whether the median mark for students who attended the academic communication workshop is different from the median mark of students who did not attend, use a two-tail test with the following null and alternative hypotheses: H0: M1 5 M2 (i.e. the median assignment marks are equal) H1: M1 ∙ M2 (i.e. the median assignment marks are not equal) To perform the Wilcoxon rank sum test, calculate the rankings for the assignment marks from the n1 5 10 students who attended academic communication workshop and the n2 5 10 students who did not attend the workshop. Table 19.3 provides the combined rankings.

Table 19.3 Forming the combined rankings

Attended workshop

Assignment mark out of 100 Combined rank Did not attend workshop

Combined rank

35

3.0

24

1.0

50

5.5

30

2.0

56

7.0

43

4.0

66

9.5

50

5.5

66

9.5

60

8.0

83

15.0

67

11.0

84

16.5

74

12.0

84

16.5

76

13.0

89

19.0

77

14.0

93

20.0

85

18.0

The next step is to calculate T1, the sum of the ranks assigned to the first sample. Choosing Attended Workshop as the first sample: T1 5 3 1 5.5 1 7 1 9.5 1 9.5 1 15 1 16.5 1 16.5 1 19 1 20 5 121.5 As a check on the ranking procedure, calculate T2 from: T2 5 1 1 2 1 4 1 5.5 1 8 1 11 1 12 1 13 1 14 1 18 5 88.5 and then use Equation 19.2 to show that the sum of the first n 5 20 integers in the combined ranking is equal to T1 1 T2: n(n + 1) T1 + T2 = 2 121.5 + 88.5 =

20(21) 2

210 = 210

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



19.2 Wilcoxon Rank Sum Test – Non-Parametric Analysis for Two Independent Populations 747

To test the null hypothesis that there is no difference between the median assignment marks of the two populations, Table E.8 can be used to determine the lower- and upper-tail critical values for the test statistic T1. From Table 19.4, a portion of Table E.8, observe that for a level of significance of 0.05 the critical values are 78 and 132. The decision rule is: Reject H0 if T1 8 78 or if T1 9 132; otherwise, do not reject H0.

A n1 One- Two- 4 5 6 7 8 9 10 tail tail (Lower, Upper) n2 0.05 0.10 16,40 24,51 33,63 43,76 54,90 66,105   9 0.025 0.05 14,42 22,53 31,65 40,79 51,93 62,109 0.01 0.02 13,43 20,55 28,68 37,82 47,97 59,112 0.005 0.01 11,45 18,57 26,70 35,84 45,99 56,115 0.05 0.10 17,43 26,54 35,67 45,81 56,96 69,111 82,128 10 0.025 0.05 15,45 23,57 32,70 42,84 53,99 65,115 78,132 0.01 0.02 13,47 21,59 29,73 39,87 49,103 61,119 74,136 0.005 0.01 12,48 19,61 27,75 37,89 47,105 58,122 71,139

Table 19.4  Finding the lower- and upper-tail critical values for the Wilcoxon rank sum test statistics T1 where n1 5 10, n2 5 10 and a 5 0.05 (extracted from Table E.8 in Appendix E of this book) Source: Adapted from F. Wilcoxon and R. A. Wilcox, Some Rapid Approximate Statistical Procedures, Table 1 (Pearl River, NY: Lederle Laboratories, 1964), with permission of the American Cyanamid Company.

Since the test statistic 78 < T1 5 121.5 < 132, do not reject H0. There is no evidence of a significant difference between the median assignment marks for students who have attended the academic communication workshop and those who have not. From the PHStat worksheet in Figure 19.3 (overleaf), the p-value is 0.2123, which is greater than α 5 0.05. Table E.8 shows the lower and upper critical values of the Wilcoxon rank sum test statistic T1 only for situations involving small samples, where n1 and n2 are less than or equal to 10. If both sample sizes are greater than 10, you must use the large-sample Z approximation formula (Equation 19.3). To demonstrate the large-sample Z approximation formula, consider the assignment mark data. Using Equation 19.3:

Z=

n1(n + 1) 2 n1n2(n + 1) 12

T1 –

(10)(21) 2 = (10)(10)(21) 12 121.5 –

=

121.5 – 105 = 1.247... 13.23

Since Z 5 1.25 < 1.96, the critical value of Z at the 0.05 level of significance, do not reject H0.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

748 CHAPTER 19 FURTHER NON-PARAMETRIC TESTS

Figure 19.3 Wilcoxon rank sum test worksheet for assignment mark data

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

A Wilcoxon rank sum test

B

Data Level of significance

C

D

E

F

0.05

Population 1 sample Sample size Sum of ranks Population 2 sample Sample size Sum of ranks

10 121.5 10 88.5

Intermediate calculations Total sample size n 20 T1 test statistic 121.5 T1 mean 105 Standard error of T1 13.2288 Z test statistic 1.2472828

=B7+B10 =IF(B7 a. Using a 0.05 level of significance, can you conclude that males have a significantly greater starting salary than females? b. What assumptions must you make in (a)? c. Compare your results to those in problem 10.17. 19.19 A bank with a branch located in Pitt Street, Sydney < BANK1 > has developed an improved process for serving customers during the noon to 1 pm lunch period. The waiting time (defined as the time elapsed from when the customer enters the line until they reach the teller window) of all customers during this hour is recorded over a period of one week. A random sample of 15 customers is selected, and the results (in minutes) are as follows:

Group A 2 5 7 3 6 8 5 3 2 Group B 5 4 5 6 4 5 2 4 7



Another branch located in Hurstville, a residential area, < BANK2 > is also concerned with the noon to 1 pm lunch



At the 0.05 level of significance, is there evidence of a significant difference in the median complaints between the two groups? 19.16 A New York television station decides to produce a story comparing two commuter railroads in the area: the Long Island Rail Road (LIRR) and New Jersey Transit (NJT). Data analysts at the television station sample the performance of several scheduled train runs of each railroad, 10 for the LIRR and 12 for the NJT. The data on how many minutes each train is early (negative numbers) or late (positive numbers) are presented below. < TRAIN2 >

LIRR: 5 –1 39 9 12 21 15 52 18 23 NJT: 8 4 10 4 12 5 4 9 15 33 14 7

a. At the 0.01 level of significance, is there evidence that the railroads differ significantly in their median tendencies to be late? b. What assumptions must you make in (a)? c. What conclusions can you make about the lateness of the two railroads? 19.17 The director of training for an electronic equipment manufacturer wants to determine whether different training methods have an effect on the productivity of assembly-line employees. She randomly assigns 42 recently hired employees to two groups of 21. The first group receives a computer-assisted, individual-based training program and the other receives a team-based training program. Upon completion of the training, the employees are evaluated on the time (in seconds) it takes to assemble a part. The results are in the data file < TRAINING >. a. Using a 0.05 level of significance, is there evidence of a significant difference in the median assembly times (in

4.21 5.55 3.02 5.13 4.77 2.34 3.54 3.20 4.50 6.10 0.38 5.12 6.46 6.19 3.79

period. A random sample of 15 customers is selected, and the results are as follows:

9.66 5.90 8.02 5.79 8.73 3.82 8.01 8.35 10.49 6.68 5.64 4.08 6.17 9.91 5.47

a. At the 0.05 level of significance, is there evidence of a difference in the median waiting time between the two branches? b. What assumptions must you make in (a)? 19.20 A problem with a telephone line that prevents a customer from receiving or making calls is upsetting to both the customer and the telephone company. The following data < PHONE > represent samples of 20 problems reported to two different offices of a telephone company and the time taken to clear these problems (in minutes) from the customers’ lines:

Central Office I Time to clear problems (minutes) 1.48 1.75 0.78 2.85 0.52 1.60 4.15 3.97 1.48 3.10 1.02 0.53 0.93 1.60 0.80 1.05 6.32 3.93 5.45 0.97



Central Office II Time to clear problems (minutes) 7.55 3.75 0.10 1.10 0.60 0.52 3.30 2.10 0.58 4.02 3.75 0.65 1.92 0.60 1.53 4.23 0.08 1.48 1.65 0.72

a. At the 0.05 level of significance, is there evidence of a difference in the median time to clear these problems between the two offices? b. What assumptions must you make in (a)? c. Compare the results of problem 10.15(a) on page 370 with the results of (a) in this problem. Discuss.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

750 CHAPTER 19 FURTHER NON-PARAMETRIC TESTS

19.3  WILCOXON SIGNED RANKS TEST – NON-PARAMETRIC ANALYSIS FOR TWO RELATED POPULATIONS

Wilcoxon signed ranks test A non-parametric test for testing the median difference for paired samples.

In Section 10.2 a paired t test was used to compare the means of two populations when you had repeated measures or matched samples. The paired t test assumes that the data are measured on an interval or a ratio scale and that the paired differences are normally distributed. If you cannot make these assumptions, you can use the non-parametric Wilcoxon signed ranks test to test the median difference. The ­Wilcoxon signed ranks test requires that the differences are approximately symmetric and that the data are measured on an interval or ratio scale. When the assumptions for the Wilcoxon signed ranks test are met, but the assumptions of the t test are violated, the Wilcoxon signed ranks test can be used to test for a difference between the two populations. Even under conditions appropriate to the paired t test, the Wilcoxon signed ranks test is almost as powerful. The Wilcoxon signed ranks test uses the test statistic W. Exhibit 19.1 lists the steps required to calculate W.

EXH IBIT 1 9 . 1  ST E PS I N CA LC UL AT I N G T HE WI LCO X ON SI G N E D R AN KS T E ST STAT I ST I C W 1. For each item in a sample of n items, calculate a difference score Di between the

two paired values.

2. Neglect the ‘+’ and ‘−’ signs and calculate a set of n absolute differences |Di|. 3. Omit from further analysis any absolute difference score of zero, thereby yielding

a set of n′ non-zero absolute difference scores, where n′ ⩽ n. After you remove values with absolute difference scores of zero, n′ becomes the actual sample size. 4. Assign ranks Ri from 1 to n′ to each of the |Di| such that the smallest absolute difference score gets rank 1 and the largest gets rank n′. If two or more |Di| are equal, assign each of them the mean rank of the ranks they would have been assigned individually had ties in the data not occurred. 5. Reassign the symbol ‘+’ or ‘−’ to each of the n′ ranks Ri, depending on whether Di was originally positive or negative. 6. Calculate the Wilcoxon test statistic W as the sum of the positive ranks (see Equation 19.4).

W ILCOXON S IGN ED R A NKS T E ST STAT I ST I C W The Wilcoxon test statistic W is calculated as the sum of the positive ranks. n′

W=



∑ Ri(+) (19.4)

i=1

The null and alternative hypotheses for the Wilcoxon signed ranks test are: Two-tail test H0: MD 5 0 H0: MD ∙ 0

One-tail test H0: MD ⩾ 0 H1: MD < 0

One-tail test H0: MD ⩽ 0 H1: MD > 0

Because the sum of the first n′ integers (1, 2, . . . , n′) equals n′(n′ + 1)/2, the ­Wilcoxon test statistic W ranges from a minimum of 0 (where all the difference scores are negative) to a maximum of n′(n′ + 1)/2 (where all the difference scores are positive). If the null hypothesis is

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



19.3 Wilcoxon Signed Ranks Test – Non-Parametric Analysis for Two Related Populations 751

true, the test statistic W is expected to be close to its mean μW 5 n′(n′ + 1)/4. If the null hypothesis is false, the value of the test statistic is expected to be close to one of the extremes. Use Table E.9 to find the critical values of the test statistic W for both one- and two-tail tests for samples of n′ ⩽ 20. For a two-tail test, reject the null hypothesis (panel A of Figure 19.4) if W is greater than or equal to the upper critical value or is or less than or equal to the lower critical value. For a one-tail test in the negative direction, reject the null hypothesis if W is less than or equal to the lower critical value (panel B of Figure 19.4). For a one-tail test in the positive direction, the decision rule is to reject the null hypothesis if W equals or is greater than the upper critical value (panel C of Figure 19.4).

Region of rejection Region of non-rejection

Figure 19.4  Regions of rejection and non-rejection using the Wilcoxon signed ranks test

–Z

0

+Z

–Z

0

0

+Z

WL

μW

WU

WL

μW

μW

WU

Panel B H 1: M D < 0

Panel C H 1: M D > 0

Panel A H 1: M D ≠ 0

For samples of n′ ⩾ 15 the test statistic W is approximately normally distributed with mean μW and standard deviation σW. The mean of the test statistic W is: μW =

n′(n′ + 1) 4

and the standard deviation of the test statistic W is: σW =

n′(n′ + 1)(2n′ + 1) 24

Therefore, Equation 19.5 defines the standardised Z-test statistic.

LARGE -S AMPL E W ILCOXON S IGN E D R A NKS T E ST

W– Z=

n′(n′ + 1) 4

n′(n′ + 1)(2n′ + 1) 24

(19.5)

This large-sample approximation formula can be used when sample sizes are outside the range of Table E.9 or n′ ⩾ 15. Reject the null hypothesis if the calculated Z value falls in the rejection region. The rejection region used depends on the level of significance and whether the test is one-tail or two-tail (see Figure 19.4). To demonstrate how to use the Wilcoxon signed ranks test, return to the TU Academic Support scenario. Academic Support wishes to assess if attending a numeracy workshop improves a student’s numeracy skills. To test this, marks for the before and after workshop tests were obtained for a random sample of 16 students who had recently attended a numeracy workshop.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

752 CHAPTER 19 FURTHER NON-PARAMETRIC TESTS

Use this data, stored in and shown in Table 19.5, to determine whether the median mark has increased. The null and alternative hypotheses for this one-tail test are: H0: MD ⩽ 0 H1: MD > 0 To perform the test, follow the six steps listed in Exhibit 19.1. First, calculate a set of difference scores Di between each of the n paired values: Di 5 X1i - X2i where i 5 1, 2, . . . , n. In this example, you calculate a set of n difference scores using Di 5 Xafter − Xbefore. If the numeracy workshop improves numeracy skills, the after-workshop test marks should be higher than the before-workshop test marks, so that the difference scores will tend to be positive values (and you will reject H0). If the workshop does not improve numeracy skills, you can expect some Di values to be positive, others to be negative and some to show no change (that is, Di 5 0). In – this case, the difference scores will cluster near zero (i.e. D ⩾ 0) and you will not reject H0. The remaining steps of the six-step procedure are developed in Table 19.5.

Table 19.5 Setting up the Wilcoxon signed ranks test for the median difference

After workshop X1i

Before workshop X2i

Difference Di 5 X1i – X2i |Di |

Rank Ri

10 10   0   0 10  9   1  1  2.5 15 14   1  1  2.5 11 12 −1  1  2.5 12 13 −1  1  2.5  7  9 −2  2  5.5  6  8 −2  2  5.5  9  6   3  3  7.0 13  9   4  4  8.5 8  4   4  4  8.5 18 13   5  5 10.0 13  5   8  8 12.0 17  9   8  8 12.0 13  5   8  8 12.0 17  7  10 10 14.0 18  5  13 13 15.0

Sign of Di + + − − − − + + + + + + + + +

The first student in Table 19.5 has a difference of zero, so omit them from the sample. Therefore, the sample size is n′ 5 15. Eleven of the 15 difference scores have a positive sign. Now calculate the test statistic W as the sum of the positive ranks: n′

W=

∑ Ri(+) = 2.5 + 2.5 + 7 + 8.5 + 8.5 + 10 + 12 + 12 + 12 + 14 + 15 = 104

i=1

Because n′ 5 15 ⩽ 20, Table E.9 can be used to determine the upper-tail critical value. Using α 5 0.05 for this one-tail test, the upper-tail critical value is 90 (see Table 19.6, which is a portion of Table E.9). Since W 5 104 > 90, reject the null hypothesis. There is evidence that marks have increased. That is, the numeracy workshop improved numeracy skills.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



19.3 Wilcoxon Signed Ranks Test – Non-Parametric Analysis for Two Related Populations 753

One-Tail: α 5 0.05   α 5 0.025 α 5 0.01 α 5 0.005 Two-Tail: α 5 0.10 α 5 0.05 α 5 0.02 α 5 0.01   n (Lower, Upper)  5  0,15 —,— —,— —,—  6  2,19  0,21 —,— —,—  7  3,25  2,26  0,28 —,—  8  5,31  3,33  1,35  0,36  9  8, 37  5,40  3,42  1,44 10 10,45  8,47  5,50  3,52 11 13,53 10,56  7,59  5,61 12 17,61 13,65 10,68  7,71 13 21,70 17,74 12,79 10,81 14 25,80 21,84 16,89 13,92 15 30,90 25,95 19,101 16,104

Table 19.6 Finding the upper-tail critical value for the Wilcoxon signed ranks test statistic w where n’ 5 15 and α 5 0.05 (extracted from Table E.9 in Appendix E of this book) Source: Adapted from F. Wilcoxon and R. A. Wilcox, Some Rapid Approximate Statistical Procedures, Table 1 (Pearl River, NY: Lederle Laboratories, 1964), with permission of the American Cyanamid Company.

Table E.9 provides critical values only for situations involving small samples (n′ ⩽ 20). If the sample size n′ is greater than 20, you must use the large-sample Z approximation formula (Equation 19.5). For 15 ≤ n′ ≤ 20, use either Table E.9 or the Z approximation approach. To demonstrate the large-sample Z approximation approach, consider the numeracy test data. Using Equation 19.5: W− Z=

n′(n′ + 1)(2n′ + 1) 24 104 −

=

=

n′(n′ + 1) 4

15 × 16 4

15 × 16 × 31 24 104 − 60 = 2.499... ≈ 2.5 310

The decision rule is: Reject H0 if Z > 1.6449; otherwise, do not reject H0. Since Z 5 2.5 > 1.6449, reject H0. Alternatively, the p-value is 0.0062. Since the p-value is less than α 5 0.05, reject the null hypothesis. Thus, without having to assume that the original population of difference scores are normally distributed, you can conclude that attending a numeracy workshop improves a student’s numeracy skills. The Wilcoxon signed ranks test makes fewer and less stringent assumptions than the paired t test. The assumptions are: • The data are a random sample of n independent difference scores. The difference scores are the result of repeated measures or matched pairs. • The data are measured on an interval or ratio scale. • The distribution of the population of difference scores is approximately symmetric.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

754 CHAPTER 19 FURTHER NON-PARAMETRIC TESTS

Problems for Section 19.3 LEARNING THE BASICS 19.21 Using Table E.9, determine the lower- and upper-tail critical values for the Wilcoxon signed ranks test statistic W in each of the following two-tail tests: a. α 5 0.10, n′ 5 11 b. α 5 0.05, n′ 5 11 c. α 5 0.02, n′ 5 11 d. α 5 0.01, n′ 5 11 19.22 Using Table E.9, determine the upper-tail critical value for the Wilcoxon signed ranks test statistic W in each of the following one-tail tests: a. α 5 0.05, n′ 5 11 b. α 5 0.025, n′ 5 11 c. α 5 0.01, n′ 5 11 d. α 5 0.005, n′ 5 11 19.23 Using Table E.9, determine the lower-tail critical value for the Wilcoxon signed ranks test statistic W in each of the following one-tail tests: a. α 5 0.05, n′ 5 11 b. α 5 0.025, n′ 5 11 c. α 5 0.01, n′ 5 11 d. α 5 0.005, n′ 5 11 19.24 Consider the following n 5 12 difference scores from two related samples:  4.8, +1.7, +4.5, −1.2, +11.1, −0.8, Difference scores (Di): + +2.3, −2.0, 0.0, +14.8, +7.8, +1.7 What is the value of the test statistic W if you are testing H0: MD ⩽ 0? 19.25 In problem 19.24, what is the upper-tail critical value for the test statistic W from Table E.9 if the level of significance α is 0.05 and the alternative hypothesis is H1: MD > 0? 19.26 In problems 19.24 and 19.25, what is your statistical decision? 19.27 Consider the following n′ 5 12 signed ranks (Ri) calculated from the difference scores (Di) from two related samples: Signed ranks (Ri): +5, +6.5, +4, +11, −8, +2.5, −2.5, +1, +12, +6.5, +10, +9 What is the value of the test statistic W if you are testing H0: MD ⩽ 0? 19.28 For problem 19.27, at a 0.05 level of significance, determine the upper-tail critical value for the Wilcoxon signed ranks test statistic W if you want to test H0: MD ⩽ 0 against H1: MD > 0. 19.29 For problems 19.27 and 19.28, what is your statistical decision?

APPLYING THE CONCEPTS Problems 19.30 and 19.31 can be solved manually or by using Excel. We recommend that you use Excel to solve problems 19.32–19.34.

19.30 A new type of drink dispensing machine has been installed at refreshment stands at ANZ Stadium. The number of drinks dispensed per hour at each stand was measured before and after the installation of the new machine. < DISPENSE >



At the 0.05 level of significance, is there evidence of a significant increase in the rate of drinks dispensed? 19.31 The following data compares the prices (in dollars) of a range of food staples between a Coles and a Woolworths supermarket located in Melbourne. < SUPERMARKET >

Item

Quantity

Woolworths

Coles

Milk

1 litre

$2.00

$1.20

Eggs

1 dozen

$4.10

$3.00

1 litre

$2.40

$2.37

Lettuce

Each

$3.00

$3.50

Lamb chops

500 g

$6.49

$6.50

Vegemite

380 g

$6.00

$6.60

Apples

1 kg

$4.80

$4.50

Oats

1 kg

$5.80

$5.00

Rump steak

1 kg

$21.00

$20.00

Wholemeal bread

Loaf

$2.80

$2.00

Fruit juice



At the 0.01 level of significance, is there evidence that prices are significantly higher at Woolworths than at Coles? 19.32 Can students save money by buying their textbooks at Amazon.com? To investigate this possibility, a random sample of 15 textbooks used during a recent semester at Tasman University was selected. The prices for these textbooks at a local bookshop and through Amazon.com were recorded. The prices for the textbooks, including all relevant taxes and shipping, are in the data file < TEXTBOOK >. At the 0.01 level of significance, is there evidence of a significant difference in the price of textbooks between the local bookshop and Amazon.com? 19.33 Over the past year the manager for human resources at a large medical centre ran a series of three-month workshops aimed at increasing worker motivation and performance. As a check on the effectiveness of the workshops, she selected a random sample of 35 employees from the personnel files and recorded their most recent annual performance ratings along with the ratings attained prior to attending the workshops. The data are stored in the < PERFORM > file. At the 0.05 level of significance, is there evidence of a significant difference in performance ratings? 19.34 Return to problem 10.25 on page 378 regarding petrol prices. < PUBLIC_HOLIDAY >

At the 0.01 level of significance, is there evidence that prices are significantly higher on a public holiday (Friday)?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



19.4 Kruskal–Wallis Rank Test – Non-parametric Analysis for the One-Way ANOVA 755

19.4  KRUSKAL–WALLIS RANK TEST – NON-PARAMETRIC ANALYSIS FOR THE ONE-WAY ANOVA If the assumptions of the one-way ANOVA F test are not met, you can use the Kruskal–Wallis rank test. The Kruskal–Wallis rank test for differences between c medians (where c > 2) is an extension of the Wilcoxon rank sum test for two independent populations, discussed in ­Section 19.2. Thus, the Kruskal–Wallis test has the same power relative to the one-way ANOVA F test that the Wilcoxon rank sum test has relative to the pooled-variance t test for two independent populations. You use the Kruskal–Wallis rank test to test whether c independent groups have equal medians. The null hypothesis is: H0: M1 5 M2 5 … 5 Mc and the alternative hypothesis is: H1: Not all Mj are equal (where j 5 1, 2, . . . , c)

Kruskal–Wallis rank test A non-parametric alternative to one-way ANOVA; it tests the null hypothesis that the different samples are drawn from the same distribution or from distributions with the same median.

To use the Kruskal–Wallis test, you need to assume that the c groups have the same variability and shape. To use the Kruskal–Wallis rank test, replace the values in the c samples with their combined ranks. Rank 1 is given to the smallest of the combined values and rank n to the largest of the combined values (where n 5 n1 + n2 + … + nc). If any values are tied, you assign them the mean of the ranks they would have otherwise been assigned if ties had not been present in the data. The Kruskal–Wallis test is an alternative to the one-way ANOVA F test. Thus, the Kruskal– Wallis test statistic H is conceptually similar to SSB, the between-group variation term (see Equation 11.2 on page 403), which is part of the test statistic F (see Equation 11.5 on page 404). – 5 Instead of comparing each of the c group means Xj against the grand mean X , the Kruskal–Wallis test compares the mean rank in each of the c groups against the overall mean rank based on all n combined values. If there is a significant difference between the c groups, the mean rank will differ considerably from group to group. In the process of squaring these differences, the test statistic H becomes large. If there is no difference present, the test statistic H will theoretically have a value of 0. In practice, because of random variation, the H test ­statistic will be small because the mean of the ranks assigned in each group should be similar from group to group. Equation 19.6 defines the Kruskal–Wallis test statistic H. KRUSKAL–WA L L IS R A N K T E ST FOR DI F F E R E N CE S I N c ME D I AN S

c

2

Tj 12 H= n(n + 1) j=1 n j



– 3(n + 1) (19.6)

where n 5 the total number of values over the combined samples nj 5 number of values in the jth sample ( j 5 1, 2, . . . , c) Tj 5 sum of the ranks assigned to the jth sample T2j 5 square of the sum of the ranks assigned to the jth sample c 5 number of groups As the sample sizes in each group get large (i.e. greater than 5), you can approximate the test statistic H by the chi-square distribution with c − 1 degrees of freedom. Thus, reject the null hypothesis if the calculated value of H is greater than the χU2 upper-tail critical value (see Figure 19.5, overleaf). That is: Reject H0 if H 7 xU2 ; otherwise, do not reject H0. The critical values from the chi-square distribution are given in Table E.4.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

756 CHAPTER 19 FURTHER NON-PARAMETRIC TESTS

Figure 19.5 Determining the rejection region for the Kruskal– Wallis test 1–α χ2

0 Region of non-rejection

U

α

χ2

Region of rejection

Critical value

To illustrate the Kruskal–Wallis rank test for differences between c medians, suppose a large corporation wishes to test the efficiency of its management teams in four cities: Kuala Lumpur, Manila, Sydney and Tokyo. Each team has been created independently by four different subsidiaries of the corporation. The staff of each team are set a simple efficiency test and a percentage score results. Since it cannot be assumed that the efficiency score is normally distributed in all c groups, use the Kruskal–Wallis rank test. The null hypothesis is that the median efficiency score for the four management teams is equal. The alternative hypothesis is that at least one team’s median differs from that of the others: H0: M1 5 M2 5 M3 5 M4 H1: Not all Mj are equal (where j 5 1, 2, 3, 4) Table 19.7 displays the results of the efficiency test with the corresponding rank. < EFFICIENCY > Table 19.7 Efficiency test of management teams in four cities

City Kuala Lumpur Manila Sydney Efficiency Rank Efficiency Rank Efficiency Rank 80 15.5 82 18.5 69 6 81 17 74 11 70 7.5 75 12 83 20 67 4 79 14 68 5 65 3 85 22.5 72 9 60 1 86 24 73 10 62 2 70 7.5 80 15.5 85 22.5

Tokyo Efficiency Rank 82 18.5 84 21 91 26 92 27 78 13 90 25 93 28

In converting the efficiency percentages to ranks, observe in Table 19.7 that the fifth team member in Sydney has the lowest efficiency score, 60. This member is given a rank of 1. The first team member for Manila and the first for Tokyo each have a value of 82. Because they are tied for ranks 18 and 19, they are assigned the rank 18.5. Finally, the last team member for Tokyo has the largest value, 93, and is assigned a rank of 28. After all the ranks are assigned, calculate the sum of the ranks for each group: Rank sums: T1 5 112.5

T2 5 111.5

T3 5 23.5

T4 5 158.5

As a check on the rankings: T1 + T2 + T3 + T4 = 112.5 + 111.5 + 23.5 + 158.5 =

n(n + 1) 2 28(29) 2

406 = 406 Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



19.4 Kruskal–Wallis Rank Test – Non-parametric Analysis for the One-Way ANOVA 757

Using Equation 19.6 to test the null hypothesis of equal population medians: c

H=

2

Tj 12 − 3(n + 1) n(n + 1) j=1 n j



=

12 (112.5)2 (111.5)2 (23.5)2 (158.5)2 + + + 28(29) 7 8 6 7

=

12 (7,043.0) − 87 = 17.084 812

− 3(29)

The statistic H approximately follows a chi-square distribution with c − l degrees of freedom. 2 Using a 0.05 level of significance, χU , the upper-tail critical value of the chi-square d­ istribution with c − 1 5 3 degrees of freedom is 7.815 (see Table 19.8). Since the calculated value of the test statistic H 5 17.084 is greater than the critical value, reject the null hypothesis and conclude that not all the teams are the same with respect to efficiency. The same conclusion is reached using the p-value approach. In Figure 19.6, observe that the p-value 5 0.0007 < 0.05. So reject the null hypothesis and conclude that there is evidence of a significant difference between the teams with respect to efficiency. Upper-tail area Degrees of freedom 0.995 0.99 0.975 0.95 0.90 0.75 0.25 0.10 0.05 0.025 1 — — 0.001 0.004 0.016 0.102 1.323 2.706   3.841   5.024 2 0.010 0.020 0.051 0.103 0.211 0.575 2.773 4.605   5.991   7.378 3 0.072 0.115 0.216 0.352 0.584 1.213 4.108 6.251   7.815   9.348 4 0.207 0.297 0.484 0.711 1.064 1.923 5.385 7.779   9.488 11.143 5 0.412 0.554 0.831 1.145 1.610 2.675 6.626 9.236 11.071 12.833

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

A B Kruskal-Wallis rank test for differences In medians Data Level of significance

C

0.05

D

E

F

Group Sample size Sum of ranks Kuala Lumpur 7 112.5 Intermediate calculations Manilla 8 111.5 Sum of squared ranks/sample size 7043.001 =(G5*F5)+(G6*F6)+(G7*G7)+(G8*F8) Sydney 6 23.5 Sum of sample sizes 28 =SUM(E5:E8) Tokyo 7 158.5 Number of groups 4 Test result H test statistic Critical value p-value Reject the null hypothesis

G

Table 19.8 Finding χU2 , the upper-tail critical value for the Kruskal–Wallis test at the 0.05 level of significance with 3 degrees of freedom (extracted from Table E.4 in Appendix E of this book)

Figure 19.6 Management team efficiency analysis

Mean ranks 16.0714 13.9375 3.91667 22.6429

17.0838 =(12/(B8*(B8+1)))*B7-(3*(B8+1)) 7.8147 =CHISQ.INV.RT(B4,B9-1) 0.0007 =CHISQ.DIST.RT(B12,B9-1)

The following assumptions are needed to use the Kruskal–Wallis rank test: • The c samples are randomly and independently selected from their respective populations. • The data provide at least a set of ranks, both within and between the c samples. • The c populations have the same variability. • The c populations have the same shape. The Kruskal–Wallis procedure makes less stringent assumptions than does the F test, since to use the F test you must assume that the c samples are from underlying normal populations that have equal variances. When the more stringent assumptions of the F test hold, you should use the F test instead of the Kruskal–Wallis test because it is slightly more powerful in its ability to detect significant differences between groups. However, if the assumptions of the F test do not hold, the Kruskal–Wallis test should be used. Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

758 CHAPTER 19 FURTHER NON-PARAMETRIC TESTS

Problems for Section 19.4 LEARNING THE BASICS



19.35 What is the upper-tail critical value from the chi-square distribution if you use the Kruskal–Wallis rank test for comparing the medians in six populations at the 0.01 level of significance? 19.36 Using the results of problem 19.35: a. State the decision rule for testing the null hypothesis that all six groups have equal population medians. b. What is your statistical decision if the calculated value of the test statistic H is 13.77?

At the 0.05 level of significance, is there a significant difference between the median sales of the four cities? 19.39 The retail manager of a supermarket chain wants to determine whether product location has an effect on the sale of pet toys. Three different aisle locations are considered: front, middle and rear. A random sample of 18 stores is selected with six stores randomly assigned to each aisle location. The size of the display area and the price of the product are constant for all stores. At the end of a one-month trial period, the sales volume (in thousands of dollars) of the product in each store is as follows. < LOCATE >

APPLYING THE CONCEPTS Problems 19.37–19.40 can be solved manually or by using Microsoft Excel. We recommend that you use Microsoft Excel to solve problem 19.41.

19.37 An industrial psychologist wants to test whether the reaction times of assembly-line workers are equivalent under three different learning methods. From a group of 25 new employees, 9 are randomly assigned to method A, 8 to method B and 8 to method C. After the learning period, the workers are given a task to complete and their reaction times are measured. The rankings of the reaction times from 1 (fastest) to 25 (slowest) are in the data file < IND_PSYCH >. At the 0.01 level of significance, is there evidence of a significant difference between the median reaction times for the three learning methods? 19.38 An electronics firm is introducing its new smartphone to the market. They make it available in four cities: Sydney, Auckland, Melbourne and Singapore. Daily sales are recorded for the first five days. < SMART > Sales Sydney Auckland Melbourne Singapore 345 231 356 567 456 289 452 765 564 342 299 234 346 298 535 589 651 314 456 678



The operations manager, by experience, knows that such data come from populations that are not normally distributed.



Aisle location Front Middle Rear 8.6 3.2 4.6 7.2 2.4 6.0 5.4 2.0 4.0 6.2 1.4 2.8 5.0 1.8 2.2 4.0 1.6 2.8

At the α 5 0.05 level of significance, is there evidence of a significant difference between the median sales of the three aisle locations? 19.40 Return to problem 11.11 on page 414 regarding sick leave taken for different age groups. At the 0.05 level of significance, is there evidence of a significant difference in the median sick leave taken by different age groups? < SICK_LEAVE > 19.41 Students in an engineering course performed an experiment to test the strength of four brands of garbage bags. One-kilogram weights were placed into a bag, one at a time, until the bag broke. A total of 40 bags were used (10 for each brand). The data file < TRASH_BAGS > gives the weight (in kilograms) required to break the bags. At the 0.05 level of significance, is there evidence of a significant difference in the median strength of the four brands of garbage bags?

19.5  FRIEDMAN RANK TEST – NON-PARAMETRIC ANALYSIS FOR THE RANDOMISED BLOCK DESIGN Friedman rank test Finds whether multiple sample groups have been selected from populations with equal medians.

When analysing a randomised block design, sometimes you cannot assume normality or the data consists of only the ranks within each block. In these situations, the Friedman rank test can be used. Use the Friedman rank test to determine whether c groups (i.e. the treatment levels) have been selected from populations having equal medians. That is, you test: H0: M1 5 M2 5 … 5 Mc against the alternative: H1: Not all Mj are equal (where j 5 1, 2, . . . , c)

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



19.5 Friedman Rank Test – Non-Parametric Analysis for the Randomised Block Design 759

To conduct the test, first replace the data by their ranks in each block. In each of the r independent blocks, the c values are replaced by their corresponding ranks such that you assign rank 1 to the smallest value in the block and rank c to the largest. If any values in a block are tied, assign them the mean of the ranks that they would otherwise have been given. Thus, Rij is the rank (from 1 to c) associated with the jth group in the ith block. Equation 19.7 defines the test statistic for the Friedman rank test.

FRIE D MAN R A N K T E ST FOR DIFFE R E NC E S BE T W E E N c M E D I A NS

FR =

12 rc(c + 1)

c

∑ R 2j − 3r(c + 1)

(19.7)

j=1

where Rj2 5 the square of the total of the ranks for group j ( j 5 1, 2, . . . , c) r 5 the number of blocks c 5 the number of groups

As the number of blocks in the experiment becomes large (greater than five), you can approximate the test statistic FR by the chi-square distribution with c − 1 degrees of freedom. Thus, for any selected level of significance α, you reject the null hypothesis if the calculated value of FR is greater than χU2 , the upper-tail critical value for the chi-square distribution having c − 1 degrees of freedom as shown in Figure 19.7. That is: Reject H0 if FR > χU2 ; otherwise, do not reject H0. The critical values from the chi-square distribution are given in Table E.4.

Figure 19.7 Determining the rejection region for the Friedman rank test 0

1–α Region of non-rejection

χ2

U

α

χ2

Region of rejection Critical value

To illustrate the Friedman rank test, six restaurant reviewers (blocks) evaluate the service in four restaurants (treatments), giving a rating out of 100, where higher ratings indicate better service. The results of the experiment are displayed in Table 19.9, overleaf. If you cannot assume that the service ratings are normally distributed for each restaurant, the Friedman rank test is more appropriate than the F test. The null hypothesis is that the median service ratings for the four restaurants are equal. The alternative hypothesis is that at least one of the restaurants differs from at least one of the others: H0: M1 5 M2 5 M3 5 M4 H1: Not all the medians are equal

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

760 CHAPTER 19 FURTHER NON-PARAMETRIC TESTS

Table 19.9 provides the 24 service ratings along with the ranks assigned within each block. < SERVICE >

Table 19.9 Converting data to ranks within blocks

Restaurants A B C D Blocks of raters Rating Rank Rating Rank Rating Rank Rating Rank 1 70   2.0 61 1.0 82  4.0 74  3.0 2 77   3.0 75 1.0 88  4.0 76  2.0 3 76   2.0 67 1.0 90  4.0 80  3.0 4 80   3.0 63 1.0 96  4.0 76  2.0 5 84   2.5 66 1.0 92  4.0 84   2.5 6 78   2.0 68 1.0 98  4.0 86   3.0 ____ ___ ____ ____ 14.5 6.0 24.0 15.5

Rank total

From Table 19.9 the following rank totals for each group are: Rank totals:    R1 5 14.5    R2 5 6.0    R3 5 24.0    R4 5 15.5 Equation 19.8 provides a check on the rankings. CH E CK IN G T H E R A NKI NG S I N T HE F R I E D M A N R A NK T E ST

R1 + R2 + R3 + R4 =

rc(c + 1) 2

(19.8)

For the restaurant data: 14.5 + 6 + 24 + 15.5 =

(6)(4)(5) 2

60 = 60 Using Equation 19.7: c

FR =

12 R 2j −3r(c + 1) rc(c + 1) j=1



=

12 (14.52 + 6.02 + 24.02 + 15.52) − (3)(6)(5) (6 )( 4 )(5)

=

12 (1,062.5) − 90 = 16.25 120

Since FR 5 16.25 > 7.815, the upper-tail critical value χU2 of the chi-square distribution with c − 1 5 3 degrees of freedom (see Table E.4), reject the null hypothesis at the α 5 0.05 level. Conclude that there are significant differences (as perceived by the reviewers) in the median service rating at the four restaurants. The following assumptions are needed to use the Friedman rank test: • The r blocks are independent, so the values in one block have no influence on the values in any other block. • The data constitute at least an ordinal scale of measurement within each of the r blocks. • There is no interaction between the r blocks and the c treatment levels.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



19.5 Friedman Rank Test – Non-Parametric Analysis for the Randomised Block Design 761

• •

The c populations have the same variability. The c populations have the same shape.

The Friedman procedure makes less stringent assumptions than does the randomised block F test, since the F test requires that the level of measurement is an interval or ratio scale and that the c samples are from underlying normal populations having equal variances. Both the F test and the Friedman test assume that there is no interaction between the treatments and the blocks. When the more stringent assumptions of the F test hold, you should select it over the Friedman test because it has more power to detect significant treatment effects. However, if the assumptions of the F test are inappropriate, you should use the Friedman rank test.

Problems for Section 19.5 LEARNING THE BASICS

using themselves as blocks. Four brands of bubblegum were tested. A student chewed two pieces of a brand of gum and then blew two bubbles, attempting to make them as big as possible. Another student measured the diameter of the bubbles at their biggest point. The following table gives the combined diameters of the bubbles (in centimetres) for the 16 observations. < BUBBLEGUM >

χU2

19.42 What is the upper-tail critical value when testing for the equality of the medians in six populations using α 5 0.10? 19.43 For problem 19.42: a. State the decision rule for testing the null hypothesis that all six groups have equal population medians. b. What is your statistical decision if FR 5 11.56?

APPLYING THE CONCEPTS



19.44 Nine experts rated four brands of Colombian coffee in a taste-testing experiment. A rating on a seven-point scale (1 5 extremely unpleasing, 7 5 extremely pleasing) is given for each of the following four characteristics: taste, aroma, richness and acidity. The following table displays the summated ratings, accumulated over all four characteristics. < COFFEE > Brand Expert A B C.C. 24 26 S.E. 27 27 E.G. 19 22 B.L. 24 27 C.M. 22 25 C.N. 26 27 G.N. 27 26 R.M. 25 27 P.V. 22 23

C D 25 22 26 24 20 16 25 23 22 21 24 24 22 23 24 21 20 19



At the 0.05 level of significance, is there evidence of a significant difference in the median diameter of the bubbles produced by the four brands of bubblegum? 19.46 The following data compares the prices (in dollars) of staple food items at Woolworths, ALDI, Coles and IGA in  < SUPERMARKET >

Item Milk Eggs Fruit juice Lettuce Lamb chops Vegemite Apples Oats Rump steak Wholemeal bread



At the 0.05 level of significance, is there evidence of a significant difference between the median summated ratings of the four brands of Colombian coffee? 19.45 A student team in a business statistics course designed an experiment to investigate whether the brand of bubblegum used affected the size of bubbles they could blow. The students believed that Kyle was an expert at blowing bubbles, and his expertise might negatively affect the results of a completely randomised design. Thus, to reduce the person-to-person variability, they decided to use a randomised block design

Brand of bubblegum Student Bazooka Bubbletape Bubbleyum Bubblicious Kyle 8.75 9.50 8.50 11.50 Sarah 9.50 4.00 8.50 11.00 Leigh 9.25 5.50 7.50  7.50 Isaac 9.50 8.50 7.50  7.50



Quantity Woolworths Coles ALDI IGA 1 litre  $2.00  $1.20  $1.00  $1.36 1 dozen  $4.10  $3.00  $3.49  $3.46 1 litre  $2.40  $2.37  $1.60  $2.26 Each  $3.00  $3.50  $2.50  $2.82 500 g  $6.49  $6.50  $5.50 $10.99 380 g  $6.00  $6.60  $5.75  $6.30 1 kg  $4.80  $4.50  $3.50  $3.49 1 kg  $5.80  $5.00  $4.80  $4.50 1 kg $21.00 $20.00 $19.00 $16.74 Loaf  $2.80  $2.00  $2.29  $2.99

At the 0.05 level of significance, use the Friedman rank test to determine if there is evidence of a significant difference in the median prices for these staple food items at the four stores.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

762 CHAPTER 19 FURTHER NON-PARAMETRIC TESTS

19

Assess your progress

Summary This chapter extended the non-parametric chi-square tests in Chapter 15 to introduce five non-parametric tests for analysing data that cannot be assumed to follow the normal probability distribution. The first test is the McNemar test, which can be used to measure the difference between two proportions where the data is derived from two related samples. We studied an example using this method for the Tasman University Academic Support scenario. The second test is the Wilcoxon rank sum test, which can be used to test whether there is a difference between two medians. It can be used as an alternative to the t test for testing for a difference between two means, when only ordinal level data is available or the assumptions of the normal distribution are not met. The third test is the Wilcoxon signed ranks test, which can replace the paired t test between two

means. It tests for a difference between data sampled from two related populations. The fourth test is the Kruskal–Wallis rank test, which can be used instead of the one-way analysis of variance (ANOVA), when the assumptions of ANOVA cannot be met. The Kruskal– Wallis test compares the differences between medians from three or more samples, and as such is an extension of the Wilcoxon rank sum test (for two independent populations). The final test is the Friedman rank test, which can be used as an alternative analysis of variance test for the randomised block design. All of the tests studied here are powerful alternatives that can be used instead of parametric hypothesis tests given in Chapters 9, 10 and 11, when the assumptions of these parametric tests cannot be made for specific sample data sets.

Key formulas Large-sample Wilcoxon signed ranks test

McNemar test

B−C

Z=

B+C

  (19.1)

Z=

Checking the rankings

T1 + T2 =

n(n + 1)   (19.2) 2

Kruskal–Wallis rank test for differences between c medians c

Large-sample Wilcoxon rank sum test

Z=

H=

n1(n + 1) 2   (19.3) n1 n2(n + 1) 12

T1 −

2

Tj 12 − 3(n + 1)  (19.6) n(n + 1) j=1 n j



Friedman rank test for differences between c medians c

FR =

Wilcoxon signed ranks test statistic W n′

W=

n′(n′ + 1) 4   (19.5) n′(n′ + 1)(2n′ + 1) 24 W−

∑ Ri(+)  (19.4)

12 R 2j − 3r(c + 1)  (19.7) rc(c + 1) j=1



Checking the rankings in the Friedman test

i=1

R1 + R2 + R3 + R4 =

rc(c + 1)   (19.8) 2

Key terms Friedman rank test Kruskal–Wallis rank test

758 755

McNemar test Wilcoxon rank sum test

741 744

Wilcoxon signed ranks test

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

750



CHAPTER REVIEW PROBLEMS 763

Chapter review problems CHECKING YOUR UNDERSTANDING 19.47 Under what conditions should you use the McNemar test? 19.48 Under what conditions should you use the Wilcoxon rank sum test? 19.49 Under what conditions should you use the Wilcoxon signed ranks test? 19.50 Under what conditions should you use the Kruskal–Wallis rank test? 19.51 Under what conditions should you use the Friedman rank test? 19.52 A market researcher is interested in studying the effect of advertisements on brand preference of new car buyers. Prospective purchasers of new cars were first asked whether they preferred Ford or Holden and then watched video advertisements of comparable models of the two manufacturers. After viewing the ads, the prospective customers again indicated their preference. The results are summarised in the following table.

Before ads Ford Holden Total

Preference after ads Ford Holden 83  23 15  79 98 102

Preference after campaign Athletics 20 16 36

Other sports  3 61 64

Total  23  77 100

a. At the 0.10 level of significance, is there evidence of a significant increase in the proportion of respondents who support athletics before and after the campaign? b. Calculate the p-value and interpret its meaning. 19.54 Managed funds provide the individual investor with a convenient medium for diversification. The table below lists the level of return over six years for a sample of investors for five different types of fund. < FUND >

Multisector growth 0.2 18.6 19.3 12.7 4.4 13.7

Multisector high growth - 1.2 20.7 19.9 13.5 3.6 16.8



Determine at the 0.05 significance level if there is any significant difference between the median returns of the various managed funds. 19.55 Playing the share market via the Internet has become a popular pastime for many people. Shares tend to be a long-term investment but the power of the Internet allows people to invest and sell quickly on international markets. The following table presents the opening and closing price for 13 selected Asia-Pacific indices on a day in 2017. < INDICES >

Total 106  94 200

a. At the 0.05 level of significance, is there evidence of a significant difference in the proportion of respondents who prefer Ford before and after viewing the ads? b. Calculate the p-value and interpret its meaning. 19.53 Athletics Australia is trying to attract more sports spectators to athletics. It conducts an advertising campaign on sports TV channels to gauge spectator preferences.

Preference before campaign Athletics Other sports Total

Return for year Multiending 30 June Australian International sector balanced % shares shares 2012 1.2 -5.3 - 0.8 2013 22.6 23.9 15.9 2014 18.7 21.7 17.2 2015 6.9 24.2 11.0 2016 1.7 0.2 6.7 2017 12.6 20.4 10.4

Open 5,824.50 15,267.40 5,755.04 3,262.08 27,684.60 32,341.05 20,062.65 1,781.65 7,771.57 3,320.67 2,404.68 10,598.59 6,373.33

Australia Canada Indonesia China Hong Kong India Japan Malaysia New Zealand Singapore South Korea Taiwan United States

Close 5,795.70 15,256.40 5,810.56 3,281.87 27,854.91 32,014.19 19,996.01 1,777.94 7,782.72 3,318.08 2,394.73 10,470.38 6,370.46



Determine at the 0.05 significance level if there is a significant difference between the opening and closing prices for the selected indices. 19.56 Six Masterchef guest reviewers have given scores out of 10 to four contestants. < MASTERCHEF > Reviewer 1 2 3 4 5 6



A 5 6 6 6 6 5

Contestant B 4 5 5 6 4 7

C  9  8  8  4 10  5

D 8 5 6 9 8 7

Use the Friedman test to determine at the 0.05 significance level whether there is any significant difference between the median scores for the four contestants.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

764 CHAPTER 19 FURTHER NON-PARAMETRIC TESTS

19.57 A toy manufacturer is assessing a new, longer-lasting small battery in four popular toys – A, B, C and D – each requiring a single battery. Twenty batteries are randomly assigned to the toys. Then each toy is run continuously and the time to failure of the battery recorded. < BATTERY_LIFE > Toy A 18.0 18.1 19.2 19.4 21.7

Toy B 17.6 18.2 19.8 20.9 22.3

Toy C 16.0 16.3 17.1 17.7 18.9

Toy D 15.1 15.6 15.9 16.7 17.8

Material 15 16

Primary supplier $26.00 $37.05

At the 0.05 level of significance, test the buyer’s suspicion. 19.60 An executive of Airline X has taken a quick poll of 30 regular airline passengers. Each passenger was asked to rate the airline they last flew on. The ratings were on a seven-point Likert scale, where 1 5 extremely poor and 7 5 excellent. Of the 30 respondents, 12 last flew on Airline X and the remainder flew on other airlines. The ratings are shown below. < AIRLINE >

Airline rating



It is known that time to failure for these batteries is unlikely to be normally distributed. At the 0.05 level of significance, analyse the data to determine whether there is a significant difference in the median battery life between the toys. 19.58 A hotel chain has similar 50-room resorts in five locations. The number of rooms occupied in each resort on randomly selected days is given in the table below. < OCCUPANCY > Byron bay 40 42 26 35 33

Cairns 32 24 35 49 25

Gold coast 28 33 37 50 41

Noosa 37 31 50 47 45

Airline X 4 6 3 4 2 4 7 6 5 5 4 1            

Port douglas 31 26 48 21 47



At the 0.10 level of significance, is there evidence to conclude that there is a significant difference in the median occupancy rates in the five locations? 19.59 A builder suspects that the primary supplier of raw materials is overcharging. In order to determine if her suspicion is correct, she contacts a second supplier and asks for the prices on a selection of materials. She wants to compare these prices with those of the primary supplier. The data collected is presented in the table below. < MATERIAL > Material  1  2  3  4  5  6  7  8  9 10 11 12 13 14

Primary supplier $55.10 $48.00 $31.00 $83.00 $92.00 $95.00 $34.00 $235.00 $1.99 $102.05 $114.60 $21.00 $15.00 $19.55

Secondary supplier $45.00 $47.00 $32.99 $77.00 $94.00 $94.49 $35.00 $233.00 $2.20 $99.99 $112.50 $20.98 $12.95 $21.00

Secondary supplier $24.00 $37.00

Other airlines 5 3 1 5 2 3 6 2 1 3 7 3 7 6 1 4 3 6



Can the executive conclude from this data, with 5% significance, that Airline X is more highly rated than the other airlines? 19.61 The manager of a fast-food restaurant wants to determine the effectiveness of a promotional campaign where ‘20% off’ coupons, valid for a month, are widely distributed. The table below records the daily sales (in thousands of dollars) of random samples of 10 days before the campaign and 10 days during the campaign. < FAST_FOOD > Before campaign 18.8 9.6 17.3 9.0 12.3 12.0 9.1 9.3 8.9 12.3



During campaign 22.9 12.0 22.9 9.2 14.3 13.9 11.0 11.7 9.6 18.1

Can the manager, at the 0.05 level of significance, conclude that sales increased significantly during the campaign?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



CONTINUIING CASES 765

Continuing cases Tasman University Tasman University’s Tasman Business School (TBS) regularly surveys business students on a number of issues. In particular, students within the school are asked to complete a student survey when they receive their grades each semester. Data from the survey responses of Bachelor of Business (BBus) and Master of Business Administration (MBA) students who responded to the latest undergraduate (UG) and postgraduate (PG) student survey are stored in < TASMAN_UNIVERSITY_BBUS_STUDENT_SURVEY > and < TASMAN_UNIVERSITY_MBA_STUDENT_SURVEY >. Copies of the survey questions are stored in Tasman University Undergraduate BBus Student Survey and Tasman University Postgraduate MBA Student Survey. a For the BBus student survey, at a 0.05 level of significance: i Is there evidence of a significant difference between males and females in median weighted average mark (WAM), expected starting salary, number of social networking sites registered for, age, spending on textbooks, number of text messages sent in a week and the wealth needed to feel rich? ii Is there evidence of a significant difference based on academic major in median expected starting salary, number of social networking sites registered for, age, spending on textbooks, text messages sent in a week and the wealth needed to feel rich? b For the MBA student survey, at a 0.10 level of significance: i Is there evidence of a significant difference between males and females in median undergraduate WAM, postgraduate WAM, expected salary on graduation, age, spending on textbooks, text messages sent in a week and the wealth needed to feel rich? ii Is there evidence of a significant difference based on undergraduate major in the median age, undergraduate WAM, MBA WAM, expected salary upon graduation, spending on textbooks, text messages sent in a week and the wealth needed to feel rich? iii Is there evidence of a significant difference based on MBA major in the median age, undergraduate WAM, MBA WAM, expected salary upon graduation, spending on textbooks, text messages sent in a week and the wealth needed to feel rich? iv Is there evidence of a significant difference based on employment status in the median age, undergraduate WAM, MBA WAM, expected salary upon graduation, spending on textbooks, text messages sent in a week and the wealth needed to feel rich? v Is there evidence that undergraduate WAM is significantly higher than MBA WAM? c Write a report summarising your conclusions.

As Safe as Houses To analyse the real estate market in non-capital cities and towns in states A and B, Safe-As-Houses Real Estate, a large national real estate company, has collected samples of recent residential sales from a sample of non-capital cities and towns in these states. This data is stored in < REAL_ESTATE >. a For regional city 1, state A at a 0.05 level of significance: i is there evidence that houses have significantly higher median prices than units? ii is there evidence that houses have significantly larger internal areas than units? iii is there evidence of a significant difference in median internal area based on number of bedrooms?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

766 CHAPTER 19 FURTHER NON-PARAMETRIC TESTS

iv is there evidence of a significant difference in median price based on number of bedrooms? v is there evidence of a significant difference in median price based on number of bathrooms? vi is there evidence of a significant difference in median internal area based on number of bathrooms? b For coastal city 1, state A at a 0.01 level of significance: i is there evidence that houses have significantly higher median prices than units? ii is there evidence that houses have significantly larger internal areas than units? iii is there evidence of a significant difference in median internal area based on number of bedrooms? iv is there evidence of a significant difference in median price based on number of bedrooms? v is there evidence of a significant difference in median price based on number of bathrooms? vi is there evidence of a significant difference in median internal area based on number of bathrooms? c For coastal city 1, state A and regional city 1, state A at a 0.10 level of significance: i is there evidence of a significant difference in median price between the two cities? ii is there evidence of a significant difference in median internal area between the two cities? d Write a report summarising your conclusions. e Repeat (a) to (d) for another pair of non-capital cities and/or towns in state A and/or state B, or different levels of significance.

Chapter 19 Excel Guide EG19.1 MCNEMAR TEST FOR THE DIFFERENCE BETWEEN TWO PROPORTIONS (RELATED SAMPLES)

Key technique Use the NORM.S.INV function to calculate critical values and use the NORM.S.DIST function to calculate p-values. Example Perform the McNemar test for the Section 19.1 Tasman University Academic Support scenario example. PHStat Use McNemar Test. For the example, select PHStat ➔ Two-Sample Tests (Summarized Data) ➔ McNemar Test. In the McNemar Test dialog box: 1. Enter 0.05 as the Level of Significance. 2. Click Two-Tail Test. 3. Enter a Title and click OK.

In the new worksheet: 4. Read the yellow note about entering values and then press the Delete key to delete the note. 5. Enter the Table 19.2 data, including row and column labels, in rows 4 through 7.

In-depth Excel Use the COMPUTE worksheet of the McNemar_Test workbook (shown in Figure EG19.1) as a template for performing the McNemar test. The worksheet contains the data of Table 19.2 concerning assessment grades in the ­academic communication workshop example. Go to the­ COMPUTE_ALL_FORMULAS worksheet to examine the formulas used in the worksheet. To perform the McNemar two-tail test for other problems, change the row 4 through 7 entries in the Observed Frequencies area and enter the level of significance for the test in cell B11. For one-tail tests, change the Observed Frequencies area and level of significance in the COMPUTE_ ALL worksheet in the McNemar_Test workbook.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 19 Excel Guide 767

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

A McNemar test

B

C

D

E

F

G

H

I

Compute worksheet for McNemar Test workbook

Observed frequencies Second assessment grade C or above First assessment grade C or above Yes No Total Data Level of significance

Yes 5 22 27

No 3 10 13

Figure EG19.1

Total 8 32 40

0.05

Intermediate calculations Numerator –19 =C6-B7 Denominator 5.0000 =SQRT(C6+B7) Z test statistic –3.8000 =B14/B15 Two-tail test Lower critical value –1.9600 Upper critical value 1.9600 p-value 0.0001 Reject the null hypothesisa

=NORM.S.INV(B11/2) =NORM.S.INV(1-B11/2) =2*(1-NORM.S.DIST(ABS(B16),TRUE)) =IF(B21 contains the difference between the actual and requested delivery times (a negative time means that the breakfast was delivered before the requested time) recorded for 30 deliveries on a particular day, along with whether the customer had previously stayed at the hotel. a. Using all the data as the training sample, develop a classification tree model to predict the probability that the customer will be satisfied based on the delivery time difference and whether the customer had previously stayed at the hotel. b. What conclusions can the hotel reach about the probability that the customer will be satisfied? 20.2 A marketing manager wants to predict customers with risk of churning (switching their service contracts to another company) based on the number of calls the customer makes to the company call centre and the number of visits the customer makes to the local service centre. Data from a random sample of 30 customers are organised and stored in < CHURN >. a. Using all the data as the training sample, develop a classification tree model to predict the probability of churning, based on the number of calls the customer makes to the company call centre and the number of visits the customer makes to the local service centre. b. What conclusions can the marketing manager reach about the probability of churning? 20.3 An automotive insurance company wants to predict which filed stolen vehicle claims are fraudulent, based on the number of claims submitted per year by the policy holder and whether the policy is a new policy–that is, is one year old or less (coded as 1 5 yes, 0 5 no). Data from a random sample of 98 automotive insurance claims are organised and stored in < INSURANCE_FRAUD > (data obtained from Gelp et al., ‘A comparative analysis of decision trees vis-à-vis other computational data mining techniques in automotive insurance fraud detection’, Journal of Data Science, 10, 2012, 537–561). a. Using all the data as the training sample, develop a classification tree model to predict the probability of a fraudulent claim, based on the number of claims submitted per year by the policy holder and whether the policy is new. b. What conclusions can the insurance company reach about the probability of a fraudulent claim? c. Using half the data as the training sample and the other half as the validation sample, develop a classification tree model

to predict the probability of a fraudulent claim, based on the number of claims submitted per year by the policy holder and whether the policy is new. d. What differences exist in the results of (a) and (c)? What conclusions can you reach about the models’ fit from the training samples in (a) and (c)? 20.4 Undergraduate students at Miami University in Oxford, Ohio, were surveyed in order to evaluate the effect of price on the purchase of a pizza from Pizza Hut. The students were asked to suppose that they were going to have a large two-topping pizza delivered to their residence. Then they were asked to select from either Pizza Hut or another pizzeria of their choice. The price they would have to pay to get a Pizza Hut pizza differed from survey to survey. For example, some surveys used the price $11.49. Other prices investigated were $8.49, $9.49, $10.49, $12.49, $13.49 and $14.49. The dependent variable for this study is whether or not a student will select Pizza Hut. The independent variables are the price of a Pizza Hut pizza and the gender of the student (1 5 male, 0 5 female). The results of these surveys are stored in < PIZZA_HUT >. a. Using half the data as the training sample and the other half as the validation sample, develop a classification tree model to predict the probability a student will select Pizza Hut based on the price of a Pizza Hut pizza and the student’s gender. b. What conclusions can you reach about the probability the student will select Pizza Hut? 20.5 The business problem facing a consumer products company is to measure the effectiveness of different types of advertising media in the promotion of its products. Specifically, the company is interested in the effectiveness of radio advertising and newspaper advertising (including the cost of discount coupons). During a one-month test period, data were collected from a sample of 22 cities with approximately equal populations. Each city is allocated a specific expenditure level for radio advertising and for newspaper advertising. The sales of the product (in thousands of dollars) and also the levels of media expenditure (in thousands of dollars) during the test month are recorded and stored in < ADVERTISE >. a. Using all the data as the training sample, develop a regression tree model to predict sales of the product. b. What conclusions can the consumer products company reach about sales of the product? 20.6 Starbucks Coffee Co. uses a data-based approach for improving the quality and customer satisfaction of its products. When survey data indicated that Starbucks needed to improve its

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



20.3  Neural Networks 777

package sealing process, an experiment was conducted (data obtained from L. Johnson and S. Burrows, ‘For Starbucks, it’s in the bag’, Quality Progress, March 2011, 17–23) to determine the factors in the bag-sealing equipment that might be affecting the ease of opening the bag without tearing its inner liner. Among the factors that could affect the rating of the ability of the bag to resist tears were the viscosity, pressure, and plate gap on the bag-sealing equipment. Data were collected on 19 bags in which the plate gap was varied. The results are stored in < STARBUCKS >. a. Using all the data as the training sample, develop a regression tree model to predict the rating of the ability of the bag to resist tears. b. What conclusions can Starbucks reach about the rating of the ability of the bag to resist tears? 20.7 In mining engineering, holes are often drilled through rock using drill bits. As a drill hole gets deeper, additional rods are added to the drill bit to enable additional drilling to take place. It is expected that drilling time increases with depth. This increased drilling time could be caused by several factors, including the mass of the drill rods that are strung together. The business problem relates to whether drilling is faster using dry drilling holes or wet drilling holes. Using dry drilling holes involves forcing compressed air down the drill rods to flush the cuttings and drive the hammer. Using wet drilling holes involves forcing water rather than air down the hole. Data have been collected

from a sample of 50 drill holes that contains measurements of the time to drill each additional 5 feet (in minutes), the depth (in feet) and whether the hole was a dry or wet drilling hole. The data are organised and stored in < DRILL >. a. Using half the data as the training sample and the other half as the validation sample, develop a regression tree model to predict the drilling time. b. What conclusions can you reach about the drilling time? 20.8 The owner of a moving company typically has his most experienced manager predict the total number of labour hours that will be required to complete an upcoming move. This approach has proved useful in the past, but the owner has the business objective of developing a more accurate method of predicting labour hours. In a preliminary effort to provide a more accurate method, the owner has decided to use the number of cubic feet moved, the number of large pieces of furniture and whether there is an elevator in the apartment building as the independent variables. He has also collected data for 36 moves in which the origin and destination were within the borough of Manhattan in New York City and the travel time was an insignificant portion of the hours worked. The data are organised and stored in < MOVING >. a. Using all the data as the training sample, develop a regression tree model to predict the labour hours. b. What conclusions can the moving-company owner reach about the labour hours?

20.3  NEURAL NETWORKS Neural networks are powerful, flexible data mining techniques that construct models from pat-

terns and relationships uncovered in data. Unlike the inferential methods in earlier chapters, in which you must supply a model to be tested, neural networks ‘learn’ from the data to construct that model for you.2 Neural networks are very flexible and can be applied to prediction, classification and clustering problems and, as non-parametric methods, do not make a priori assumptions about the distribution of the data. Because they do not require a supplied model as a starting point, neural networks are particularly valuable in analysing big data in which the process of model transformations discussed in Chapters 13 and 16 would be too unwieldy and time consuming to perform.

LEARNING OBJECTIVE

2

Use neural networks for predictive analytics neural networks Flexible data mining techniques that `learn’ from the data and construct models from patterns and relationships uncovered in data.

Multilayer Perceptrons All neural networks contain complex computations that begin with inputs and end with outputs. Neural networks used for prediction and classification are typically multilayer ­perceptrons (MLPs) that contain an input layer, a hidden layer and an output layer, as shown in Figure 20.3, overleaf. To construct models, MLPs use error back propagation. To begin, the input layer nodes (the circles in Figure 20.3) send the various inputs to the nodes of the hidden layer. The inputs

multilayer perceptrons (MLPs) Neural networks that contain an input layer, a hidden layer and an output layer.

2

Because neural networks were inspired by the architecture of the human brain, some describe such networks as learning from the data to form the best model. This ‘learning’ in neural networks, while complex, is more analogous to the computation and evaluation of inferential methods discussed earlier in this book than to the many types of human learning. Therefore this book uses ‘uncovers’ where other sources might use ‘learn’.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

778 CHAPTER 20 BUSINESS ANALYTICS

Figure 20.3 Structure of a multilayer perceptron

Input layer

Hidden layer

Output layer

Input 1

Input 2

Output Input 3

Input 4

processing elements The hidden layer in multilayer perceptrons (MLPs). hyperbolic tangent function An S-shaped function that varies between 21 and 11.

have associated weights. Processing elements3 that comprise the hidden layer combine the weighted inputs and apply a non-linear function, such as the S-shaped hyperbolic tangent function, to the combination. The hyperbolic tangent function is a function that varies between 21 and 11. H YPE R B OL IC TA NG E NT F U NC T I O N

e2x - 1  e2x + 1

(20.4)

where x is a linear combination of the X variables.

training data A set of data used by neural networks to uncover a model that by some criterion best describes the patterns and relationships in the data.

The results of the processing element calculations in the hidden layer are sent to the output layer. The output layer combines the results it receives from the hidden layer and compares it to the target Y value. The output layer then sends back to the hidden layer nodes (the start of the back propagation) its estimate of the difference between the predicted results and the target value (the error rate). Calculations in the hidden layer then backwardly influence the weighting done near the input layer, and the process continues forward a second time to the output layer. This forward-and-backward calculation between the three layers continues until the output layer detects that the error rate has been minimised or is at an acceptable level. (This minimisation of errors is analogous to the minimisation of prediction error of regression models.) At this point, the model is established. To establish a neural network, you use some of your data as the training data and some of it as the validation data. Neural networks use the training data to uncover a model that by some 3

Processing elements are meant to simulate the neuron in the brain and therefore are often referred to as ­artificial neurons or nodes. For more on the relationship between biological and artificial neural networks, see reference 2.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



20.3  Neural Networks 779

criterion best describes the patterns and relationships in the data. The model is then applied to the validation data to see if the model can make the correct prediction or classification. Models that neural networks construct can be difficult to interpret (see reference 3). Neural networks also can suffer from poor quality of data, insufficient data or overfitted models–models that only work well with the data used to construct that model. Determining the number of hidden nodes (processing elements) can be an inexact process. One source suggests that when using a neural network for classification, it is best to start with one hidden node per class (see reference 4). To illustrate an MLP, recall the card study example that sought to identify credit-card holders who would be likely to upgrade to a premium card. Figure 20.4 shows the MLP results calculated by JMP for both the training and validation samples.

Figure 20.4 Multilayer perceptron results for classifying credit-card holders who would upgrade to a premium card Source: The output for this paper was generated using JMP® software. Note: The table of parameter estimates of the model contains the final estimated weights that resulted from the backpropagation training process. Even though these weights cannot be interpreted in the same manner as regression coefficients, weights are crucial in that they store the patterns that were uncovered (‘learned’) in analysing the data.

The validation data results can be used as a representation of the model’s predictive power on future observations. The validation data misclassification rate, 0.0909, means that 9% of the cardholders in the validation set were inaccurately classified using the trained MLP neural network. The Confusion Matrix report shows a contingency table of the actual and predicted values for the ‘Upgraded’ variable. The Confusion Rates report is equal to the Confusion Matrix report, with the numbers divided by the row totals. Of the 11 cardholders in the validation set, 5 actually upgraded to the premium card and 6 did not. Of the 5 cardholders who did in fact upgrade to the premium card, 5 (or 1.0000) were correctly classified by the MLP neural network. Of the 6 cardholders who did not in fact upgrade to the premium card, 5 (or 0.8333) were correctly classified by the MLP neural network. To illustrate a neural network analysis with a continuous numerical dependent variable, return to the OmniPower scenario above. Figure 20.5, overleaf, presents the neural network results for predicting the sales of OmniPower bars. The r2 statistic for the validation data is 0.9172. This large value indicates that the model is doing a good job in predicting the target in the validation data.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

780 CHAPTER 20 BUSINESS ANALYTICS

Figure 20.5 Multilayer perceptron results for predicting the sales of OmniPower bars Source: The output for this paper was generated using JMP® software.

Problems for Section 20.3 20.9 Refer to problem 20.1 on page 776 concerning a hotel that has designed a new system for room service ­delivery of breakfast that allows the customer to select a specific delivery time: < SATISFACTION > a. Develop a neural network model to predict the probability that the customer will be satisfied, based on the delivery time difference and whether the customer had previously stayed at the hotel. b. What conclusions can the hotel reach about the probability that the customer will be satisfied? 20.10 Refer to problem 20.2 on page 776 concerning a marketing manager who wants to predict customers with risk of churning. < CHURN > a. Develop a neural network model to predict the probability of churning, based on the number of calls the customer makes to the company call centre and the number of visits the customer makes to the local service centre. b. What conclusions can the marketing manager reach about the probability of churning? 20.11 Refer to problem 20.3 on page 776 concerning an automotive insurance company that wants to predict which filed stolen vehicle claims are fraudulent. < INSURANCE_FRAUD > a. Develop a neural network model to predict the probability of a fraudulent claim, based on the number of claims submitted per year by the policy holder and whether the policy is new. b. What conclusions can the insurance company reach about the probability of a fraudulent claim?

20.12 Refer to problem 20.4 on page 776 concerning the effect of price on the purchase of a pizza from Pizza Hut. < PIZZA_HUT > a. Develop a neural network model to predict the probability a student will select Pizza Hut based on the price of a Pizza Hut pizza and the student’s gender. b. What conclusions can you reach about the probability the student will select Pizza Hut? 20.13 Refer to problem 20.5 on page 776 concerning a consumer products company that wants to measure the effectiveness of different types of advertising media in the promotion of its products. < ADVERTISE > a. Develop a neural network model to predict sales of the product. b. What conclusions can the consumer products company reach about sales of the product? 20.14 Refer to problem 20.6 on page 776 concerning Starbucks’ experiment to determine the factors in its bag-sealing equipment that might be affecting the ease of opening the bag without tearing its inner liner. < STARBUCKS > a. Develop a neural network model to predict the rating of the ability of the bag to resist tears. b. What conclusions can Starbucks reach about the rating of the ability of the bag to resist tears? 20.15 Refer to problem 20.7 on page 777, which considers whether drilling is faster using dry or wet drilling holes. < DRILL > a. Develop a neural network model to predict the drilling time. b. What conclusions can you reach about the drilling time?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



20.4  Cluster Analysis 781

20.16 Refer to problem 20.8 on page 777, where the owner of a moving company wishes to develop a more accurate method of predicting labour hours. < MOVING >

a. Develop a neural network model to predict labour hours. b. What conclusions can the moving-company owner reach about labour hours?

20.4  CLUSTER ANALYSIS

LEARNING OBJECTIVE

Cluster analysis seeks to classify data into a sequence of groupings such that objects in each

Use cluster analysis for predictive analytics

group are more alike other objects in their group than they are to objects found in other groups. Cluster analysis can be performed in several different ways, two of the most common of which are hierarchical clustering and k-means clustering. In hierarchical clustering, the analysis starts with each object in its own cluster. Then, the two objects that are determined to be the closest to each other are merged into a single cluster. The merging of the two closest objects repeats until there remains only one cluster that includes all objects. In k-means clustering, the number of clusters (k) is set at the start of the process. Objects are then assigned to clusters in an iterative process that seeks to make the means of the k clusters as different as possible. During the iterative process, unlike hierarchical clustering, in which clusters once formed are never changed later in the process, objects may be reassigned to a different cluster later in the process. To perform a cluster analysis, you must determine how to measure the distance between objects and how to measure the distance between clusters. The most common measure of distance between objects is Euclidean distance , which measures the distance between objects as the square root of the sum of the squared differences between objects over all r dimensions. E UC LID E AN DISTA N CE (CLUST E R A N ALYSI S)

(X - Xjk)2(20.5) A a ik r



dij = where

k=1

dij 5 distance between object i and object j Xik 5 value of object i in dimension k Xjk 5 value of object j in dimension k r 5 number of dimensions

Among the measures of distance between clusters are complete linkage, single linkage, average linkage and Ward’s minimum variance method. Complete linkage bases the distance between clusters on the maximum distance between objects in one cluster and another cluster. Single linkage bases the distance between clusters on the minimum distance between objects in one cluster and another cluster. Average linkage bases the distance between clusters on the mean distance between objects in one cluster and another cluster. Ward’s minimum variance method bases the distance between clusters on the sum of squares over all variables between objects in one cluster and another cluster. To illustrate cluster analysis, suppose that you wanted to examine the similarities and dissimilarities between various sports. You collect data from a survey on the perceptions of four attributes of nine sports (basketball, skiing, cricket, table tennis, hockey, track and field, bowling, tennis and Australian rules football) and store the data in < SPORTS >. You define the following seven-point rating scales for the four variables that correspond to the four attributes: • movement speed: 1 5 fast paced to 7 5 slow paced • rules: 1 5 complicated rules to 7 5 simple rules • team orientation: 1 5 team sport to 7 5 individual sport • amount of contact: 1 5 non-contact to 7 5 contact

3

cluster analysis A form of analysis that classifies data into a sequence of groupings in which objects in each group have more in common with others in their group than they do with objects found in other groups. hierarchical clustering A form of cluster analysis where two objects that are determined to be the closest to each other are merged into a single cluster. This process of merging of the two closest objects repeats until there remains only one cluster that includes all objects. k-means clustering A form of cluster analysis where objects are assigned to clusters in an iterative process that seeks to make the means of the k clusters as different as possible. Euclidean distance A distance measure used in cluster analysis where the distance between objects is the square root of the sum of the squared differences between objects over all r dimensions. complete linkage A measure of distance that bases the distance between clusters on the maximum distance between objects in one cluster and another cluster. single linkage A measure of distance that bases the distance between clusters on the minimum distance between objects in one cluster and another cluster. average linkage A measure of distance that bases the distance between clusters on the mean distance between objects in one cluster and another cluster. Ward’s minimum variance method A measure of distance that bases the distance between clusters on the sum of squares over all variables between objects in one cluster and another cluster.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

782 CHAPTER 20 BUSINESS ANALYTICS

Figure 20.6 presents the results of a JMP complete linkage cluster analysis based on the mean score of each sport on each rating scale. Figure 20.6 JMP cluster analysis results for the different sports Source: The output for this paper was generated using JMP® software.

From examining either the tree diagram (which JMP calls a dendrogram) from left to right or the clustering history, observe that the first two sports that cluster together are track and field and bowling followed by basketball and hockey. Then skiing and tennis join together, followed by table tennis merging with track and field and bowling. This process continues until all the sports are merged into one cluster. When there are three clusters remaining, the sports in the three clusters are {basketball, hockey, Australian rules football and cricket}, {bowling, table tennis and track and field} and {tennis and skiing}. The first cluster of {basketball, hockey, Australian rules football and cricket} appears to represent team sports. The second cluster of {bowling, table tennis and track and field} are slow-moving individually contested sports. The third cluster {tennis and skiing} represents fast-moving individually contested sports.

Problems for Section 20.4 20.17 Movie companies need to predict a movie’s gross receipts once it has debuted. The following results, stored in < POTTER_MOVIES >, are the first weekend gross, the US gross and the worldwide gross (in millions of dollars) of the Harry Potter movies. a. Using the complete linkage method, perform a cluster analysis on the Harry Potter movies based on the first weekend gross, the US gross and the worldwide gross (in millions of dollars). b. What conclusions can you reach about which Harry Potter movies are most similar? 20.18 The file < CEREALS > contains the calories and the carbohydrates and sugar (in grams) in one serving of seven breakfast cereals.

a. Using the complete linkage method, perform a cluster analysis on the cereals based on the calories and the carbohydrates and sugar (in grams). b. What conclusions can you reach about which cereals are most similar? 20.19 The file < PROTEIN > contains calorie and cholesterol information for popular protein foods (fresh red meats, poultry and fish) compiled by the US Department of Agriculture. a. Using the complete linkage method, perform a cluster analysis on the protein foods based on the calories and the cholesterol (in grams). b. What conclusions can you reach about which protein foods are most similar?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



20.5  Multidimensional Scaling 783

c. Using Ward’s minimum variance method, perform a cluster analysis on the protein foods, based on the calories and the cholesterol (in grams). d. What conclusions can you reach about which protein foods are most similar? e. Compare the results of (a) and (c). Are there any differences between your conclusions? Explain why or why not. 20.20 A Pew Research Center survey found that social networking is popular in many nations around the world. The file < GLOBAL_ SOCIAL_MEDIA > contains the level of social media networking (measured as the percent age of individuals polled who use social networking sites) and GDP at purchasing

power parity (PPP) per capita for each of 25 selected countries (data obtained from ‘Global digital communication: texting, social networking ­popular worldwide’, Pew Research Center, ). a. Using the complete linkage method, perform a cluster analysis on the nations based on the level of social media networking (measured as the percent age of individuals polled who use social networking sites) and GDP at purchasing power parity (PPP) per capita. b. What conclusions can you reach about which nations are most similar?

20.5  MULTIDIMENSIONAL SCALING

LEARNING OBJECTIVE

Multidimensional scaling (MDS) visualises objects in a two- or more dimensional space, or

map, with the goal of discovering patterns of similarities or dissimilarities between the objects. One challenge of MDS is to interpret the significance of the dimensions of the map and understand the reasons behind the distance between individual objects or apparent groups of objects. There are two main types of multidimensional scaling: metric multidimensional scaling, which a­ ssumes that the distance between objects is ratio scaled, and non-metric multidimensional ­scaling, which assumes that the distance between objects is ordinal scaled. To perform an MDS analysis, you must determine how to measure the distance between objects (the basis for placing objects in the map) and the number of map dimensions to be ­interpreted. The most common measure of distance between objects used is the Euclidean ­distance, which measures the distance between objects as the square root of the sum of the squared differences between objects over all r dimensions. This is the same Euclidean distance formula that is used for cluster analysis (see Equation 20.5). You obtain MDS results in a varying number of dimensions, usually from one dimension to five. The goal of MDS analysis is minimise the number of dimensions used to interpret the results while maximising the goodness of fit of the results to the original data. The goodness of fit is measured by the stress statistic, as defined in Equation 20.6.

STRE SS STAT IST IC

Use multidimensional scaling for predictive analytics

multidimensional scaling (MDS) A form of business analytics that visualises objects in a two- or more dimensional space with the goal of discovering patterns of similarities or dissimilarities between the objects.

stress statistic A goodness-of-fit statistic used in multidimensional scaling.

n 2 a (dij - dij) m



Stress = where

i, j = 1 m

2 a (dij - d)



4

(20.6)

f i, j = 1

dij 5 distance between objects i and j dˆnij 5 the fitted regression value estimated by the multidimensional scaling _ algorithm from the original data for objects i and j d 5 mean distance between objects m 5 n(n 2 1)/2 n 5 number of objects

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

784 CHAPTER 20 BUSINESS ANALYTICS

While the smaller the stress statistic, the better the fit, there is no fixed rule about what constitutes an acceptable value for the stress statistic. Because, as a general rule, the stress statistic decreases as the number of dimensions increases, using many dimensions can cause the stress statistic to approach 0 (a perfect fit), but at the cost of creating a map that could be as complex as the original data itself! Usually, a good rule is to increase dimensions as long as the stress statistic decreases substantially. In many cases, the decrease in the stress statistic begins to level off after the second or third dimension is considered. When this occurs, you can limit yourself to trying to interpret two or three dimensions. Attempting to interpret more than three dimensions in many cases can be extremely challenging. To illustrate MDS analysis, suppose that you wanted to examine the similarities and dissimilarities between various sports. You collect data from a survey on the perceptions of four attributes of nine sports (basketball, skiing, cricket, table tennis, hockey, track and field, bowling, tennis and Australian rules football) and store the data in < SPORTS >. You define the following seven-point rating scales for the four variables that correspond to the four attributes: • movement speed: 1 5 fast paced to 7 5 slow paced • rules: 1 5 complicated rules to 7 5 simple rules • team orientation: 1 5 team sport to 7 5 individual sport • amount of contact: 1 5 non-contact to 7 5 contact Figure 20.7 presents the results of the MDS analysis based on the mean score of each attribute for each sport.

Figure 20.7 Multidimensional scaling stress results for nine sports Source: The output for this paper was generated using JMP® software.

The JMP results reveal a stress statistic of 0.3166 in one dimension, 0.1376 in two dimensions and 0.0788 in three dimensions. Because there is a large difference in the stress statistic between one and two dimensions but only a small difference in the stress statistic between two and three dimensions, you would choose to begin by interpreting the two-dimensional results shown in Figure 20.8. To interpret a two-dimensional map, you look for points that appear close to each other as well as points that appear distant from each other. Although not the case with Figure 20.8, you may need to rotate the map in order to better interpret the dimensions. From Figure 20.8, observe that Australian rules football, hockey and basketball are close to each other. Tennis, table tennis and bowling are somewhat close to each other, and track and field and skiing are separate from the others. To best interpret the dimensions separating the sports, observe that if you rotate the map clockwise 45 degrees, one axis appears to separate the team sports (Australian rules football, hockey, basketball and cricket) from the non-team sports. The other axis appears to separate

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



20.5  Multidimensional Scaling 785

Figure 20.8 Two-dimensional MDS map for perceptions about nine sports Source: The output for this paper was generated using JMP® software.

the fast-paced contact sports (Australian rules football, hockey and basketball) from the slow-paced non-contact sports such as table tennis, bowling and tennis. After interpreting the two-dimensional map, you can check to see if interpreting a threedimensional map yields a better result. Interpreting a three-dimensional map is inherently harder as there are many more ways to examine and rotate the cube-like map. Figure 20.9 shows the original and rotated three-dimensional map. The rotated map seems to show team sports gathering near the ‘ceiling’, while individual sports gather near the ‘floor’. As this does not enhance the interpretation, you would use the simpler two-dimensional map for your final analysis.

Figure 20.9  Original and rotated three-dimensional MDS maps for perceptions about nine sports Source: The output for this paper was generated using JMP® software.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

786 CHAPTER 20 BUSINESS ANALYTICS

Problems for Section 20.5 20.21 Refer to problem 20.17 on page 782 concerning the need for movie companies to predict the gross receipts of a movie once it has debuted. < POTTER_MOVIES > a. Perform an MDS analysis on the Harry ­Potter movies based on the first weekend gross, the US gross and the worldwide gross (in millions of dollars). b. What conclusions can you reach about which Harry Potter movies are most similar? 20.22 Refer to problem 20.18 on page 782 concerning the calories and the carbohydrates and sugar (in grams) in one serving of seven breakfast cereals. < CEREALS > a. Perform an MDS analysis on the cereals based on the calories, carbohydrates and sugar, in grams. b. What conclusions can you reach about which cereals are most similar? 20.23 Refer to problem 20.19 on page 782 concerning calorie and cholesterol information for popular protein foods (fresh red

meats, poultry, and fish) compiled by the US Department of Agriculture. < PROTEIN > a. Perform an MDS analysis on the protein foods based on the calories and the cholesterol (in grams). b. What conclusions can you reach about which protein foods are most similar? 20.24 Refer to problem 20.20 on page 783 concerning the Pew Research Center survey on social networking. < GLOBAL_SOCIAL_MEDIA >

a. Perform an MDS analysis on the nations based on the level of social media networking (measured as the percent age of individuals polled who use social networking sites) and GDP at purchasing power parity (PPP) per capita. b. What conclusions can you reach about which nations are most similar?

20 Assess your progress Summary In this chapter you have applied a number of techniques in the predictive analytics phase of business analytics – namely, classification and regression trees, neural networks, cluster analysis for predictive analytics, and multidimensional scaling (MDS). You were able to apply these methods to conduct a campaign to

persuade a bank’s customers to upgrade their credit cards, determine the effect that price and in-store promotional expenses will have on sales of OmniPower bars, and examine the similarities and dissimilarities between various sports.

Key formulas Akaike information criterion (AIC)

LogWorth

 AIC = 2k - 2 ln(L)  (20.1)

 LogWorth = - log10( p@value)  (20.3)

Akaike information criterion corrected (AICc )

Hyperbolic tangent function

 AICc = AIC +

2k(k + 1)   (20.2) n - k - 1



e2x - 1   (20.4) e2x + 1

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter review problems 787

Stress statistic

Euclidean distance (cluster analysis and MDS)

 dij =

(Xik - Xjk)   (20.5) Aa r

n 2 a (dij - dij) m

2

k=1

 Stress =

i, j = 1 m

a (dij - d) f i, j = 1

  (20.6)

2

Key terms Akaike information criterion (AIC) average linkage business analytics classification and regression trees cluster analysis complete linkage data mining

774 781 771 772 781 781 772

Euclidean distance Gini impurity hierarchical clustering hyperbolic tangent function k-means clustering multidimensional scaling (MDS) multilayer perceptrons (MLPs)

781 772 781 778 781 783 777

neural networks predictive analytics processing elements single linkage stress statistic training data Ward’s minimum variance method

777 771 778 781 783 778 781

References 1. JMP Version 10 (Cary, NC: SAS Institute, 2012). 2. Tableau Public Version 8 (Seattle, WA: Tableau Software, 2013). 3. Hakimpoor, H., K. Arshad, H. Tat, N. Khani and M. ­Rahmandoust, ‘Artificial neural network application in management’, World Applied Sciences Journal, 2011, 14(7), 1008–1019.

4. Lindoff, G. and M. Berry, Data Mining Techniques: For ­Marketing, Sales, and Customer Relationship Management (Hoboken, NJ: Wiley Publishing, Inc., 2011).

Chapter review problems CHECKING YOUR UNDERSTANDING 20.25 How do classification trees differ from regression trees? 20.26 How do classification and regression tree models differ from neural network models? 20.27 How does cluster analysis differ from multidimensional scaling?

APPLYING THE CONCEPTS 20.28 The production of wine is a multibillion-dollar worldwide industry. In an attempt to develop a model of wine quality as judged by wine experts, data were collected from red and white wine variants of Portuguese vinho verde (data obtained from P. Cortez, A. Cerdeira, F. Almeida, T, Matos and J. Reis, ‘Modeling wine preferences by data mining from physiochemical properties’, Decision Support Systems, 47, 2009, 547–553 and ). The population of 6,497 wines is stored in < VINHO_VERDE_POPULATION > a. Using half the data as the training sample and the other half as the validation sample, develop a classification tree model to predict the probability that the wine is red. (Consider the entire set of variables in your analysis.)

b. What conclusions can you reach about the probability that the wine is red? c. Repeat (a) using a neural network model. d. Compare the results of (a) and (c). 20.29 Refer to the data in problem 20.28. a. Using half the data as the training sample and the other half as the validation sample, develop a regression tree model to predict wine quality. (Consider the entire set of variables in your analysis.) b. What conclusions can you reach about wine quality? c. Repeat (a) using a neural network model. d. Compare the results of (a) and (c). 20.30 The file < FT_MBA > contains a sample of 2012 top-ranked full-time MBA programs. Variables included are mean starting salary on graduation (in dollars), percentage of students with job offers within three months of graduation, program cost (in dollars) and ­total number of students per program (data obtained from ). a. Using all the data as the training sample, develop a regression tree model to predict the mean starting salary on graduation.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

788 CHAPTER 20 BUSINESS ANALYTICS

b. What conclusions can you reach about the mean starting salary on graduation? c. Using half the data as the training sample and the other half as the validation sample, develop a regression tree model to predict the mean starting salary on graduation. d. What differences exist in the results of (a) and (c)? e. Repeat (c) using a neural network model. f. Compare the results of (c) and (e). 20.31 A specialist in baseball analytics wants to determine which variables are important in predicting a team’s wins in a given baseball season. He has collected data in < BB_2012 > that includes the number of wins, earned run average (ERA), saves, runs scored, hits allowed, walks allowed and errors for the 2012 season. a. Using half the data as the training sample and the other half as the validation sample, develop a regression tree model to predict the number of wins. b. What conclusions can the baseball analytics specialist reach about the number of wins? c. Repeat (a) using a neural network model. d. Compare the results of (a) and (c). 20.32 Nassau County is located approximately 25 miles east of New York City. Data in < GLEN_COVE > are from a sample of 30 single-family homes located in Glen Cove. Variables included are the fair market value, land area of the property (acres), interior size of the house (square feet), age (years), number of rooms, number of bathrooms and number of cars that can be parked in the garage. a. Using all the data as the training sample, develop a regression tree model to predict the fair market value. b. What conclusions can you reach about the fair market value? c. Using half the data as the training sample and the other half of the data as the validation sample, develop a regression tree model to predict the fair market value.

d. What differences exist in the results of (a) and (c)? e. Repeat (c) using a neural network model. f. Compare the results of (c) and (e). 20.33 A market research study has been conducted by a travel website that specialises in restaurants, with the business objective of determining which types of foods are perceived to be similar and which are perceived to be different. The following 10 types of foods were studied: • Japanese •  Cantonese (Chinese) •  Szechuan (Chinese) • French • Mexican

•  Mandarin (Chinese) • American • Spanish • Italian • Greek



The mean values of each food on the following scales are stored in are stored in < FOODS >: • bland 5 1 to spicy 5 7 • light 5 1 to heavy 5 7 • low calories 5 1 to high calories 5 7 a. Perform a cluster analysis on the types of foods. b. Perform an MDS analysis on the types of foods. d. Which foods can the travel website conclude are most similar? 20.34 A specialist in baseball analytics wants to determine which Major League Baseball (MLB) teams were most similar in 2012. He has collected data in < BB_2012 > related to ERA, saves, runs scored, hits allowed, walks allowed and errors for the 2012 season. a. Perform a cluster analysis on the baseball teams. b. Perform an MDS analysis on the baseball teams. c. Which MLB teams can the baseball analytics specialist conclude were most similar in 2012?

Chapter 20 Software Guide INTRODUCTION Chapter 20 discusses a number of statistical methods not included in or weakly supported by Microsoft Excel, and uses JMP, a separate statistical application, for its predictive analytics examples. For these reasons, this Software Guide presents instructions for using JMP.

JMP  Use Treemap. For example, to construct a treemap of WaldoLands social media comments grouped by ‘Land’ similar to the one shown at left in Figure 2.19 on page 65, open the WL_ Social_Data workbook. Select File ➔ Open. In the Open Data File dialog box, navigate to the location of the file,

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 20 Software Guide 789

select the file and then click Open. In the DATA - JMP window: 1. Select Graph ➔ Tree Map. In the Tree Map - JMP dialog box (shown in Figure SG20.1):

  5. Drag Rating in the Measures group and drop it on

the Color icon in the Marks area.   6. Drag Land in the Dimensions group and drop it on

the ­Label icon in the Marks area of the new worksheet area.   7. Drag Ride in the Dimensions group and drop it on the Label icon in the Marks area. Tableau Public constructs a treemap and updates the Marks area (called the ‘Marks card’ by Tableau) and adds a SUM(rating) area similar to the areas shown in Fig­ ure SG20.2.

Figure SG20.1  Tree Map - JMP dialog box

2. 3. 4. 5. 6.

Drag Land to the Categories box. Drag Ride to the Categories box. Drag Comments to the Sizes box. Drag Rating to the Coloring box. Click OK.

Adjust the size of the treemap as necessary to clearly display the labels. By default, JMP colours the treemap using blue for the unfavourable ratings and red for the favourable ratings, the inverse of the colourings in Fig­ ure 2.19. To change the colour spectrum, click the dropdown button that is part of the chart title, click Color Theme, and then click one of the submenu choices. Note that there is no pre-defined spectrum that uses red for the lowest values. To construct a treemap similar to the one shown at right in Figure 2.19, in the TopSix DATA - JMP window, repeat steps 1 to 6, skipping step 2.

Tableau public  Use the treemap feature. For example, to construct a treemap of WaldoLands social media comments grouped by ‘Land’ similar to the one shown at left in Figure 2.19 on page 65, select File ➔ New and:  1. Select Data ➔ Connect to Data. In the Connect to Data panel, click Microsoft Excel.  2. In the Open dialog box, navigate to the location of the WL_Social_Data workbook, select that file, and then click Open.  3. In the Excel Workbook Connection dialog box, click DATA in the Step 2 list box and click OK. Tableau Public displays a Data pane next to an empty worksheet. From the Data pane:  4. Drag Comments in the Measures group and drop it on the Size icon in the Marks area.

Figure SG20.2  Tableau Public Marks card

To change the red-to-green spectrum to colours less likely to be confused:  8. Double-Click the SUM(Rating) colour spectrum. In the Edit Colors [(Rating)] dialog box:  9. Select Red-Blue Diverging from the Palette dropdown list. 10. Click OK. Change the dimensions of the treemap to allow all labels to be displayed. To change a dimension, move the mouse pointer over an edge and then drag the edge to adjust. To adjust the formatting of the labels, select Format ➔ Font, and in the Font pane change the font attributes for Pane in the default group. Tableau Public allows the interactive collapse of a level of data. For example, to collapse the rides into their lands, right-click Ride in the Marks area, and in the shortcut menu click ­Attribute. The treemap changes to a three-area map, one for each of WaldoLands’ three lands. To restore the original map, right-click Ride in the Marks area, and in the shortcut menu click Dimension. There are two ways to construct the treemap shown at right in Figure 2.19. You can repeat steps 1 to 10, clicking ­Top6DATA in step 3 and skipping step 4, or you can use the filter function to alter the treemap that is produced by the original steps 1 to 10. To use the filter function, first­ ­follow steps 1 to 10 above. Right-click Ride in the Marks

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

790 CHAPTER 20 BUSINESS ANALYTICS

area, and in the shortcut menu click Filter. In the General tab of the Filter [Ride] dialog box, clear the check boxes for the rides that are not part of the top six attractions and then click OK.

Data Discovery In-depth Excel  Use PivotTables and Slicer. For drill-down, construct a PivotTable. (While those instructions discuss using only categorical variables, the same set of instructions can be used for a mix of categorical and numerical variables.) Then click on the 1 buttons that precede the row categories to expand the table one level deeper. ­Figure 2.20 on page 66 illustrates this operation. To reveal the data for variables not initially included in the initial PivotTable, as illustrated by Figure 20.6 on page 782, include the cell range of those variables when first defining the ­PivotTable and then later click on a cell that contains the value of interest. To construct the slicer dashboard, first construct a PivotTable. Click cell A3 in the PivotTable and: 1. Select Insert ➔ Slicer.

‘sliced’ data, and if you click a dimmed value button, the PivotTable will be empty and show no values.

JMP  Use Graph Builder. Summary charts can be used as the basis for both drilldown and slicer-like operations in JMP. For example, open the Retirement_Funds workbook and select File ➔ Open. In the Open Data File dialog box, navigate to the location of the file, select the file, and then click Open. In the DATA - JMP window, select Analyze ➔ Distribution and in the Distribution - JMP dialog box (shown in Figure SG20.4): 1. Drag Type to the Y, Columns box. 2. Drag Risk to the By box. 3. Click OK.

In the Insert Slicers dialog box (shown on in Figure SG20.3):

Figure SG20.4  Distribution - JMP dialog box

Figure SG20.3  Insert Slicers dialog box

2. Check Market Cap, Type, Expense Ratio and Star Rating. 3. Click OK. 4. In the worksheet, drag the slicers to reposition them. If necessary, resize slicer panels as you would resize a window. Click the value buttons in the slicers to explore the data. When you click a value button, the icon at the top right of the slicer changes to include a red X. Click this icon to reset the slicer. When you click a value button, value buttons in other slicers may become dimmed. Dimmed value buttons represent values that are not found in the currently

JMP constructs a panel of three histograms that show the distribution of growth and value funds for each type of risk (low, average and high). Double-click a histogram bar to display all of the variables for funds that have the bar’s combination of type and risk values. To create a slicer-like display, in the DATA - JMP window, again select Analyze ➔ Distribution and in the Distribution - JMP dialog box: 1. Drag Type to the Y, Columns box. 2. Drag Market Cap to the Y, Columns box. 3. Drag Star Rating to the Y, Columns box 4. Drag Expense Ratio to the Y, Columns box 5. Click OK. JMP constructs a panel that contains four histograms, one for each variable. (The display for the numerical variable Expense ­Ratio also includes a box-and-whisker plot.) Click a specific bar to display the proportion of the bars in the other histograms related to that bar’s value. For example, when you click the Value bar in the Type histogram, you can see that among value funds large is the most

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 20 Software Guide 791

f­ requent market cap and that three stars is the most frequent star rating.

SG20.1  PREDICTIVE ANALYTICS There are no software guide instructions for this section.

would be results that use the easily understood values ‘No’ and ‘Yes’. To recode Extra Cards in this way, select the Extra Cards column and: 1. Select Cols ➔ Recode. 2. In the Recode - JMP dialog box (shown in Figure SG20.7), enter No as the New Value for 0, enter Yes as the New Value for 1, and click OK.

SG20.2 CLASSIFICATION AND REGRESSION TREES Classification Tree JMP  Use Partition. For example, to perform the Figure 20.1 classification tree analysis for predicting the proportion of credit-card holders who would upgrade to a premium card, open the Card_Study workbook. Select File ➔ Open. In the Open Data File dialog box, navigate to the location of the file, select the file, and then click Open. Because the Upgraded and Extra Cards variables have been coded with values 0 and 1, JMP mistakes these categorical variables as numerical variables (and would perform an incorrect analysis on the data). To change the variable type of Extra Cards to categorical, right-click the Extra Cards column and click Column Info in the shortcut menu. In the Extra Cards - JMP dialog box (shown in Figure SG20.5), select Character from the Data Type drop-down list and click OK. (The Modeling Type changes from Continuous to Nominal.) To change Upgraded to a categorical variable, right-click the Upgraded column, click Column Info, and select Character as the Data Type and click OK.



Figure SG20.7  Recode - JMP dialog box

To recode Upgraded, select the Upgraded column and repeat the previous steps 1 and 2. With the data properly prepared, in the Data - JMP window, select Analyze ➔ Modeling ➔ Partition. In the Partition dialog box (shown in Figure SG20.8): 3. Drag Upgraded to the Y, Response box. 4. Drag Purchases to the X, Factor box. 5. Drag Extra Cards to the X, Factor box. 6. Click OK.

Figure SG20.8  Partition - JMP dialog box

Figure SG20.5  Extra Cards - JMP dialog box

Verify these operations by examining the icons that appear in the Columns panel to the left of the JMP worksheet. If these icons, which JMP calls modeling type icons, appear as shown in Figure Figure SG20.6 Icons SG20.6, the Upgraded and Extra in Columns panel of JMP Cards variables have been properly worksheet identified as categorical variables (with a nominal scale). While you could begin your analysis at this point, the results will use the values 0 and 1, whose meanings may be misinterpreted, for the two categorical variables. Better

In the new DATA-Partition of Upgraded-JMP dialog box: 7. Click the drop-down button to the left of the diagram title and click Split Best. Repeat this step until clicking Split Best no longer has any effect on the tree diagram. 8. If the contents of the diagram do not match Figure 20.1, click the drop-down button to the left and then select Display ­Options. To match Figure 20.1, all choices on the Display Options submenu should be checked, except the last two choices, Show Split Candidates and Sort Split Candidates. If necessary, click unchecked choices, one at a time, until all but the last choice are checked. 9. Click Color Points. The Color Points button disappears and the ‘No’ points in the plot appear in red and the ‘Yes’ points appear in blue.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

792 CHAPTER 20 BUSINESS ANALYTICS

At any point, click Prune to remove the last split operation. To enhance the display of the points in the plot, right-click a point, then click Marker Size from the shortcut menu and click one of the size choices.

1. 2. 3. 4.

Regression Tree

In the next dialog box: 5. Leave Holdback as the Validation Method and 0.3333 as the Holdback Proportion. 6. Enter 2 as the number of Hidden Nodes. 7. Click Go.

JMP  Use Partition. For example, to perform the Figure 20.2 regression tree analysis for predicting the sales of OmniPower bars, open the OmniPower workbook. Select File ➔ Open. In the Open Data File dialog box, navigate to the location of the file, select the file, and then click Open. Because JMP properly identifies Sales, Price and Promotion as numerical variables, there is no need to change variable types as is done in the preceding classification tree example. In the DATA - JMP window, select Analyze ➔ Modeling ➔ Partition. In the Partition dialog box: 1. Drag Sales to the Y, Response box. 2. Drag Price to the X, Factor box. 3. Drag Promotion to the X, Factor box. 4. Click OK. In the new DATA - Partition of Upgraded - JMP dialog box: 5. Click the drop-down button to the left of the title Partition for Sales and click Split Best. Repeat this step until clicking Split Best no longer has any effect on the tree diagram. At any point, click Prune to remove the last split operation. To enhance the display of the points in the plot, right-click a point, then click Marker Size from the shortcut menu and click one of the size choices.

SG20.3  NEURAL NETWORKS

JMP  Use Neural. For example, to perform the Figure 20.4 MLP analysis for classifying credit-card holders who would be likely to upgrade to a premium card, open the Card_Study workbook. Select File ➔ Open. In the Open Data File dialog box, navigate to the location of the file, select the file and then click Open. As discussed in Section SG20.2, because the Upgraded and Extra Cards variables have been coded with values 0 and 1, JMP mistakes these categorical variables for numerical variables (and would perform an incorrect analysis on the data). First use the instructions in Section SG20.2 to change the data type (to Character) for these two variables. Then use the instructions in that section for recoding the values of the two variables as ‘No’ and ‘Yes’. With the data properly prepared, select Analyze ➔ Modeling ➔ Neural. In the Neural JMP dialog box (similar to Figure SG20.8):

Drag Upgraded to the Y, Response box. Drag Purchases to the X, Factor box. Drag Extra Cards to the X, Factor box. Click OK.

In the DATA - Neural of Upgraded - JMP window: 8. Click the triangle icons to the left of the two Confusion ­Matrix titles and the two Confusion Rates titles to display these tables. 9. Click the drop-down button to the left of the title Model NTanH(2) and click Show Estimates. Because the initial weights for the model are chosen at random, the actual results will almost certainly differ from the Figure 20.4 results. If the validation data misclassification rate is too high, you may generate additional models. To generate an additional model, click the triangle icon to the left of the title Model Launch and then click Go. To eliminate a model generated, right-click on the model’s Model NTanH(2) title and click Remove Fit in the shortcut menu. To perform the Figure 20.5 MLP analysis for predicting the sales of OmniPower bars, open the OmniPower workbook. Select File ➔ Open. In the Open Data File dialog box, navigate to the location of the file, select the file, and then click Open. Because JMP properly identifies Sales, Price and Promotion as numerical variables, there is no need to change variable types as is done in the preceding example. In the DATA - JMP window, select Analyze ➔ Modeling ➔ Neural. In the Neural JMP dialog box: 1. Drag Sales to the Y, Response box. 2. Drag Price to the X, Factor box. 3. Drag Promotion to the X, Factor box. 4. Click OK. In the next dialog box: 5. Leave Holdback as the Validation Method and 0.3333 as the Holdback Proportion. 6. Enter 3 as the number of Hidden Nodes. 7. Click Go. In the DATA - Neural of Sales - JMP window: 8. Click the drop-down button to the left of the title Model NTanH(3) and click Show Estimates. As noted in the first example, because the initial weights for the model are chosen at random, the actual results will almost certainly differ from the Figure 20.5 results. If necessary, you can generate additional models by clicking Go in Model Launch as explained in the previous example.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter 20 Software Guide 793

SG20.4  CLUSTER ANALYSIS

JMP 10  Use Cluster. For example, to perform the Figure 20.6 cluster analysis for the different sports, open the Sports workbook. Select File ➔ Open. In the Open Data File dialog box, navigate to the location of the file, select the file, and then click Open. (JMP properly identifies all five variables that comprise the file.) In the DATA - JMP window, select Analyze ➔ Multivariate Methods ➔ Cluster. In the Cluster dialog box (shown in Figure SG20.9): 1. Drag Movement Speed to the Y, Columns box. 2. Drag Rules to the Y, Columns box. 3. Drag Team Oriented to the Y, Columns box. 4. Drag Amount of Contact to the Y, Columns box. 5. Drag Sport to the Label box. 6. In the Options drop-down list, click Hierarchical and click Complete (in the Method group). 7. Click OK.

In the DATA - Hierarchical Cluster - JMP dialog box, click the triangle icon to the left of the title Clustering History to reveal how the clustering was done.

SG20.5  MULTIDIMENSIONAL SCALING

JMP  Use the Multidimensional Scaling add-in and the R statistical package For example, to perform the Figure 20.7 multidimensional scaling analysis based on the mean score of each attribute for each sport, open the Sports workbook. Select File ➔ Open. In the Open Data File dialog box, navigate to the location of the file, select the file, and then click Open. (JMP properly identifies all five variables that comprise the file.) In the DATA - JMP window, click Add-Ins ➔ Multidimensional Scaling. In the General tab of the Multidimensional Scaling - JMP dialog box: 1. Click R project. 2. Drag Movement Speed to the Y, Columns box. 3. Drag Rules to the Y, Columns box. 4. Drag Team Oriented to the Y, Columns box. 5. Drag Sport to the Label Column box. 6. Click Run. The add-in displays an overlay plot in a new NMDS Fit Output - JMP window. To view the results for a particular number of dimensions, click the plot point for that number of dimensions. Then click Display MDS Results for Selected Dimension. To view the table of stress values, click the drop-down list button to the left of the title Overlay Plot and click Script, then Data Table Window. The add-in also works with SAS. To use SAS, click SAS Project in step 1.

Figure SG20.9  Clustering - JMP dialog box

The output for the figures in this chapter was generated using JMP® software.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

CHA PTER

21

Data analysis: The big picture MOUNTING FUTURE ANALYSES

L

earning business statistics is a lot like climbing a mountain. At first, it may seem intimidating, or even overwhelming, but over time you learn techniques that help make the task much more manageable.

In Part 1 – Presenting and Describing Information you learned how to define, collect and organise data, which allowed you to visualise the data collected. These are important first steps in analysing data, which was covered extensively in Parts 3 to 5. Part 2 – Measuring Uncertainty provides the link between Part 1, describing sample or population data, and Parts 3 to 5, where sample data is used to draw conclusions about the populations the data is drawn from. Determining what methods to use to analyse data may have seemed straightforward when solving problems from a particular chapter, and even the end of part problems. But what approach will you take when you find yourself in new situations, needing to analyse data for another course or to help solve a problem in a real business setting? After all, when you solved a problem from a chapter on multiple regression, you ‘knew’ that multiple regression methods would be part of your analysis. In new situations, you might wonder whether you should use multiple regression, or whether using simple linear regression would be better – or, indeed, whether any type of regression would be appropriate. You also might wonder if you should use a combination of methods from several different chapters or parts to help solve the problems you face. The question for you becomes: How can you apply the statistical methods you have learned to new situations that require you to analyse data? Courtesy of Sharyn Rosenberg

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



CHAPTER 21 DATA ANALYSIS: THE BIG PICTURE 795

LEARNING OBJECTIVES

After studying this chapter you should be able to: 1 identify the questions to ask when choosing which statistical methods to use to conduct data analysis 2 generate rules for applying statistics in future studies and analyses

Figure 21.1 contains a summary of the contents of this book, arranged by data analysis task. This would be a good starting point for answering the question posed on the previous page. Figure 21.1 Commonly used data analysis tasks discussed in this book

DESCRIBING ONE OR SEVERAL GROUPS

Numerical variables Ordered array, stem-and-leaf display, frequency distribution, relative frequency distribution, percentage distribution, cumulative percentage distribution, histogram, polygon, cumulative percentage polygon (Sections 2.2 and 2.3) Bullet graph, gauge, treemap (Section 2.6) Sample mean, median, mode, quartiles, range, interquartile range, standard deviation, variance, coefficient of variation (Sections 3.1 and 3.3) Population mean, standard deviation and variance (Section 3.2) Boxplot (Section 3.4) Normal probability plot (Section 6.3)

Categorical variables Summary table, bar chart, pie chart (Section 2.1) Contingency tables (Section 2.4)

INFERENCES ABOUT A SINGLE POPULATION

Numerical variables Confidence interval estimate of the mean (Sections 8.1 and 8.2) Z test or t test for the mean (Sections 9.2 to 9.4) Chi-square test for a variance or standard deviation (Section 15.5) Chi-square test goodness of fit tests (Section 15.4)

Categorical variables Confidence interval estimate of the proportion (Section 8.3) Z test for the proportion (Section 9.5)

continues

COMPARING TWO POPULATIONS Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

(Sections 9.2 to 9.4) Chi-square test for a variance or standard deviation (Section 15.5) 796 CHAPTER 21 DATA ANALYSIS: THE BIG PICTURE Chi-square test goodness of fit tests (Section 15.4)

Figure 21.1 Commonly used data analysis tasks discussed in this book (continued )

COMPARING TWO POPULATIONS

Numerical variables Tests for the difference in the means of two independent populations (Section 10.1) Paired t test (Section 10.2) F test for the difference between two variances (Section 10.3) Wilcoxon rank sum test (online Section 19.2) Wilcoxon signed ranks test (online Section 19.3)

Categorical variables Z test for the difference between two proportions (independent samples) (Section 10.4) Chi-square test for the difference between two proportions (independent samples) (Section 15.1) McNemar test for the difference between two proportions (related samples) (online Section 19.1)

COMPARING MORE THAN TWO POPULATIONS

Numerical variables One-way analysis of variance (Section 11.1) Kruskal–Wallis rank test (online Section 19.4) Randomised block design (Section 11.2) Two-way analysis of variance (Section 11.3) Friedman rank test (online Section 19.5)

Categorical variables Chi-square test for differences between more than two proportions (Section 15.2)

ANALYSING THE RELATIONSHIP BETWEEN TWO VARIABLES

Numerical variables Scatter plot, time-series plot (Section 2.5) Covariance, coefficient of correlation, coefficient of determination (Sections 3.5 and 12.3) Simple linear regression (Chapter 12) Time-series forecasting (Chapter 14)

Categorical variables Contingency table, side-by-side bar chart (Sections 2.1 and 2.4) Chi-square test of independence (Section 15.3)

continues

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



CHAPTER 21  DATA ANALYSIS: THE BIG PICTURE 797

ANALYSING THE RELATIONSHIP BETWEEN TWO OR MORE VARIABLES

Figure 21.1 Commonly used data analysis tasks discussed in this book (continued )

Numerical dependent variables Multiple regression (Chapters 13 and online Chapter 16) Regression tree (online Section 20.2) Neural network (online Section 20.3)

CLASSIFYING OBJECTS INTO GROUPS

For a set of variables Cluster analysis (online Section 20.4) Multidimensional scaling (online Section 20.5)

ANALYSING PROCESS DATA

Numerical variables X and R control charts (online Section 18.7)

Categorical variables p chart (online Section 18.4)

For counts of nonconformities c chart (online Section 18.6)

For any data analysis task to solve a business (or other) problem, the first step is to define the variables you want to study. To do this, you must identify the type of business (or other) problem (describing a group, population or sample, or making inferences about a population or populations, among other choices) and then determine the type of variable – numerical or ­categorical – you are analysing. In Figure 21.1, the all-uppercase first-level headings identify types of business (or other) problems, and the second-level headings include the two types of variables. The entries in Figure 21.1 identify the specific statistical methods appropriate for a particular type of problem and type of variable. Choosing appropriate statistical methods for your data is the single most important task you face and is at the heart of ‘doing statistics’. But this selection process is also the single most difficult thing you do when applying statistics! How, then, can you ensure that you have made an appropriate choice? By asking a series of questions, you can guide yourself to the appropriate choice of methods. The rest of this chapter presents questions that will help guide you in making this choice. Two lists of questions, one for numerical variables and the other for categorical variables, are presented in the next two sections. Having two lists makes the decision you face more manageable while also reinforcing the importance of identifying the type of variable that you seek to analyse.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

798 CHAPTER 21 DATA ANALYSIS: THE BIG PICTURE

LEARNING OBJECTIVE

1

Identify the questions to ask when choosing which statistical methods to use to conduct data analysis

LEARNING OBJECTIVE Generate rules for applying statistics in future studies and analyses

2

21.1  ANALYSING NUMERICAL VARIABLES Exhibit 21.1 presents the list of questions to ask if you plan to analyse a numerical variable. Each question is independent of the others, and you can ask as many or as few questions as is appropriate for your analysis. How to go about answering these questions follows Exhibit 21.1.

EXH IBIT 2 1 . 1 QUEST I O NS TO ASK WHE N A NA LY SI NG NU ME R I C AL VA R I AB LE S • Do you want to describe the characteristics of the variable (possibly broken down into several groups)? • Do you want to reach conclusions about the mean and/or standard deviation of the variable in a population? • Do you want to determine whether the mean and/or standard deviation of the variable differs depending on the group? • Do you want to determine which factors affect the value of a variable? • Do you want to predict the value of the variable based on the values of other ­variables? • Do you want to determine whether the values of the variable are stable over time?

Describing the Characteristics of a Numerical Variable Develop tables and charts and calculate descriptive statistics to describe characteristics such as central tendency, variation and shape. Specifically, you can create a stem-and-leaf display, percentage distribution, histogram, polygon, boxplot, normal probability plot, bullet graph, gauge, and treemap (see Sections 2.2, 2.3, 2.6, 3.4 and 6.3), and you can calculate sample statistics, and population parameters, such as the mean, median, mode, quartiles, range, interquartile range, standard deviation, variance, coefficient of variation, skewness and kurtosis (see Sections 3.1, 3.2 and 3.3).

Reaching Conclusions About the Population Mean and/or Standard Deviation You have several different choices, and you can use any combination of these choices. To estimate the mean value of the variable in a population, you construct a confidence interval estimate of the mean (see Sections 8.1 and 8.2). To determine whether the population mean is equal to a specific value, you may be able to conduct a Z test or t test of hypothesis for the mean (see Sections 9.1 to 9.4). To determine whether the population standard deviation or variance is equal to a specific value, you may be able to conduct a x2 test of hypothesis for the standard deviation or variance (see Section 15.5).

Determining Whether the Mean and/or Standard Deviation Differs Depending on the Group When examining differences between groups, you first need to establish which categorical variable to use to divide your data into groups. You then need to know whether this grouping variable divides your data in two groups (such as male and female groups for a gender variable) or whether the variable divides your data into more than two groups (such as the analysis of electricity prices and consumption in four capital cities discussed in Section 11.1). Finally, you must ask whether your data set contains independent groups or whether your data set contains matched or repeated measurements.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



21.1  Analysing Numerical Variables 799

If the Grouping Variable Defines Two Independent Groups and You Are ­Interested in Central Tendency Which hypothesis tests you use depends on the assumptions you can make about your data. If you can assume that your numerical variable is normally distributed and that the variances are equal, you conduct a pooled t test for the difference between the means (see Section 10.1). If you cannot assume that the variances are equal, you conduct a separate-variance t test for the difference between the means (see Section 10.1). To test whether the variances are equal, assuming that the populations are normally distributed, you can conduct an F test for the difference between two variances (see Section 10.3). If you cannot assume that your numerical variable is normally distributed, you may be able to perform a Wilcoxon rank sum test (see online Section 19.2). To evaluate the assumption of normality that the pooled t test and separate-variance t test require, you can construct boxplots and normal probability plots for each group (see Sections 3.4 and 6.3). Alternatively, a chi-square goodness-of-fit test (see Section 15.4) can be used to test if the data come from normal populations. If the Grouping Variable Defines Two Groups of Matched Samples or Repeated Measurements and You Are Interested in Central Tendency If you can assume that the paired differences are normally distributed, you conduct a paired t test (see Section 10.2). If you cannot assume that the paired differences are normally distributed, you may be able to conduct a Wilcoxon signed rank test (see online Section 19.3). If the Grouping Variable Defines Two Independent Groups and You Are ­Interested in Variability If you can assume that your numerical variable is normally distributed, you conduct an F test for the difference between two variances (see Section 10.3). If the Grouping Variable Defines More Than Two Independent Groups and You Are Interested in Central Tendency If you can assume that the values of the numerical variable are normally distributed, with equal variances, you conduct a one-way analysis of variance (see Section 11.1); otherwise, you may be able to conduct a Kruskal–Wallis rank test (see online Section 19.4). If the Grouping Variable Defines More Than Two Groups of Matched Samples or Repeated Measurements and You Are Interested in Central Tendency Suppose that you have a design where the rows represent the blocks and the columns represent the levels of a factor. If you can assume that the values of the numerical variable are normally distributed, you may be able to conduct a randomised block design F test (see Section 11.2). If you cannot assume that the values of the numerical variable are normally distributed, you may be able to conduct a Friedman rank test (see online Section 19.5).

Determining Which Factors Affect the Value of a Variable If there are two factors to be examined to determine their effect on the values of a variable, you may be able to develop a two-factor factorial design (see Section 11.3).

Predicting the Value of a Variable Based on the Values of Other Variables When predicting the values of a numerical dependent variable, you may be able to conduct least-squares regression analysis. The least-squares regression model you develop depends on the number of independent variables in your model. If there is only one independent variable being used to predict the numerical dependent variable of interest, you develop a simple linear regression model (see Chapter 12); otherwise, you develop a multiple regression model (see Chapter 13 and online Chapter 16) or a regression tree (see online Section 20.2) or a neural network (see online Section 20.3).

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

800 CHAPTER 21 DATA ANALYSIS: THE BIG PICTURE

If you have values over a period of time and you want to forecast the variable for future time periods, you can use moving averages, exponential smoothing, least-squares forecasting and autoregressive modelling (see Chapter 14).

Determining Whether the Values of a Variable Are Stable Over Time If you are studying a process and have collected data on the values of a numerical variable over a time period, you construct R and X charts (see online Section 18.7). If you have collected data in which the values are counts of the number of nonconformities, you construct a c chart (see online Section 18.6). LEARNING OBJECTIVE

1

Identify the questions to ask when choosing which statistical methods to use to conduct data analysis

21.2  ANALYSING CATEGORICAL VARIABLES Exhibit 21.2 presents the list of questions to ask if you plan to analyse a categorical variable. Each question is independent of the others, and you can ask as many or as few questions as is appropriate for your analysis. How to go about answering these questions follows Exhibit 21.2.

EXH IBIT 2 1 . 2 QUEST I O NS TO ASK WHE N A NA LY SI NG CAT E G OR I C AL VA R I AB LE S • Do you want to describe the proportion of items of interest in each category (possibly broken down into several groups)? • Do you want to reach conclusions about the proportion of items of interest in a ­population? • Do you want to determine whether the proportion of items of interest differs depending on the group? • Do you want to determine whether the proportion of items of interest is stable over time?

Describing the Proportion of Items of Interest in Each Category You can construct summary tables and the following charts: bar chart, pie chart, or side-by-side bar chart (see Sections 2.1 and 2.4).

Reaching Conclusions About the Proportion of Items of Interest You may be able to estimate the proportion of items of interest in a population by constructing a confidence interval estimate of the proportion (see Section 8.3). Alternatively, you may be able to determine whether the population proportion is equal to a specific value by conducting a Z test of hypothesis for the proportion (see Section 9.5).

Determining Whether the Proportion of Items of Interest Differs Depending on the Group When examining this difference, you first need to establish the number of categories associated with your categorical variable and the number of groups in your analysis. If your data contain two groups, you must also ask if your data contain independent groups or if your data contain matched samples or repeated measurements.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



21.3 Predictive Analytics 801

For Two Categories and Two Independent Groups You may be able to conduct either the Z test for the difference between two proportions (see Section 10.4) or the x2 test for the difference between two proportions (see Section 15.1). For Two Categories and Two Groups of Matched or Repeated Measurements You may be able to use the McNemar test (see online Section 19.1). For Two Categories and More Than Two Independent Groups You may be able to use a x2 test for the difference between several proportions (see Section 15.2). For More Than Two Categories and More Than Two Groups You can develop contingency tables and use multidimensional contingency tables to drill down to examine relationships between two or more categorical variables (Sections 2.1, 2.4, and 2.6). You may be able to conduct a x2 test of independence (see Section 15.3).

Determining Whether the Proportion of Items of Interest is Stable over Time If you are studying a process and have collected data over a time period, you can create the appropriate control chart. If you have collected the proportion of items of interest over a time period, you develop a p chart (see online Section 18.4).

21.3 PREDICTIVE ANALYTICS

LEARNING OBJECTIVE

Predictive analytics are methods that determine what is likely to occur in the future. These methods fall into one of four categories: • prediction: assigning a value to a target based on a model • classification: assigning items in a collection to target categories or classes • clustering: finding natural groupings in data • association: finding items that tend to co-occur and specifying the rules that govern their co-occurrence.

Generate rules for applying statistics in future studies and analyses

Table 21.1 identifies and classifies the specific predictive analytics methods that were discussed in Chapter 20.

Method for: Method Classification and regression trees Neural networks Cluster analysis Multidimensional scaling (MDS)

Prediction

Classification

• •

• • •

Clustering

Association

Table 21.1 Selected predictive analytics methods

• • •

While predictive analytics has been used with samples from a population, the growing use of big data has led to more widespread use. Combining predictive methods with big data requires data mining. Data mining is the use of techniques that allow the extraction of useful knowledge from huge repositories of data. That extraction is rarely simple and typically requires statistical methods and computer science algorithms as well as the database-type operations associated with simple searches and retrievals of data. Because of this common association of predictive analytics and data mining, some people intertwine the meanings of

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

2

802 CHAPTER 21 DATA ANALYSIS: THE BIG PICTURE

the two terms or use them jointly in the phrase ‘predictive analytics and data mining’. Even though predictive analytics methods may be most useful to a business decision maker when used with data mining, these methods can exist independently of data mining and can be used with smaller sets of data.

21 Assess your progress Chapter review problems 21.1 In many manufacturing processes, the term work in progress (often abbreviated WIP) is used. At the LSS Publishing book manufacturing plants, WIP represents the time it takes for sheets from a press to be folded, gathered, sewn, tipped on end sheets and bound together to form a book, and the book placed in a packing carton. The operational definition of the variable of interest–processing time–is the number of days (measured in hundredths) from when the sheets come off the press to when the book is placed in a packing carton. The company has the business objective of determining whether there are differences in the WIP between plants. Data have been collected from samples of 20 books at each of two production plants. The data, stored in < WIP >, are as follows: Plant A 5.62 11.62

5.29 16.25 10.92 11.46 21.62 8.45 8.58 5.41 11.42 7.29

7.50

7.96

4.42 10.50 7.58 9.29 7.54

8.92

Plant B 9.54 11.46 16.62 12.62 25.75 15.41 14.29 13.13 13.71 10.04 5.75 12.46



9.17 13.21

6.00

2.33 14.25

5.37

6.25

9.71

Completely analyse the data.

21.2 The file < EURO_TOURISM2 > contains a sample of 27 European countries. Variables included are the number of jobs generated in the travel and tourism industry in 2012, the spending on business travel within the country by residents and international visitors in 2012, the total number of international visitors who visited the country in 2012 and the number of establishments that provide overnight ­accommodation for tourists (data obtained from ). You want to be able to predict the number of jobs generated in the travel and tourism industry. Completely analyse the data.

21.3 The file < PHILLY > contains a sample of 25 neighborhoods in Philadelphia. Variables included are neighborhood population, median sales price of homes in 2012, mean number of days homes were on the market in 2012, number of homes sold in 2012, ­median neighbourhood household income, percentage of residents in the neighbourhood with a bachelor’s degree or higher, and whether the neighbourhood is considered ‘hot’ (coded as 15 yes, 0 5 no) (Data extracted from ). You want to be able to predict median sales price of homes. Completely analyse the data. 21.4 Professional basketball has truly become a sport that generates interest among fans around the world. More and more players travel from outside the United States to play in the National Basketball Association (NBA). Many factors could impact the number of wins achieved by each NBA team. In addition to the number of wins, the file < NBA_2012 > contains team statistics for points per game (for team, opponent, and the difference between team and opponent), field goal (shots made) percentage (for team, opponent, and the difference between team and opponent), turnovers (losing the ball before a shot is taken) per game (for team, opponent, and the difference between team and opponent), and the rebound percentage. You want to be able to predict the number of wins. Completely analyse the data. 21.5 The data in < USED_CARS > have been collected for 10 different cars in various Australian states from online car sales sites. For one of the cars in one state, fully analyse the data and provide a written report. In particular, for cars of this make and model in the specified state, answer the following. a. For two- and three-year-old cars, what is the minimum and maximum price and the average price? Provide an estimated price range for a two- or three-year-old used car.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



Chapter review problems 803

b. Suppose a buyer will consider only white cars, since they believe white cars are safer as they are more visible. Will this limit their choice of cars for sale? (For the variable ‘White’: Yes 5 car for sale is white, No 5 car for sale is not white) c. Suppose a buyer wishes to purchase a two- or three-yearold car with less than 50,000 km on the clock. Will this limit their choice of cars for sale? d. Is there is a difference in price between cars for sale privately and those for sale by a used car dealer? (For the variable ‘Seller’: Private 5 private sale, Dealer 5 for sale by used-car dealer.) e. How does a used car depreciate in value? 21.6 A study was conducted to determine whether any gender bias existed in an academic science environment. Faculty from several universities were asked to rate candidates for the position of undergraduate laboratory manager based on their application. The gender of the applicant was given in the applicant’s materials. The raters were from the biology, chemistry and physics departments. Each rater was to give a competence rating to the applicant’s materials on a sevenpoint scale, with 1 being the lowest and 7 being the highest. In addition, the rater supplied a starting salary that should be offered to the applicant. These data (which have been altered from an actual study to preserve the anonymity of the respondents) are stored in < CANDIDATE_ASSESSMENT >. Analyse the data. Do you think that there is any gender bias in the evaluations? Support your point of view with specific references to your data analysis. 21.7 Zagat’s publishes restaurant ratings for various locations in the United States. The file < RESTAURANTS2 > contains

the Zagat rating for food, décor, service, cost per person and popularity index (popularity points the restaurant received divided by the number of people who voted for that restaurant) for various types of restaurants in New York City. You want to study differences in the cost of a meal for the different types of cuisines and also want to be able to predict the cost of a meal. Completely analyse the data (data obtained from Zagat Survey 2012 New York City Restaurants). 21.8 The data in < BANK_MARKETING > are from a direct marketing campaign conducted by a Portuguese banking institution (data obtained from S. Moro, R. Laureano and P. Cortez, ‘Using data mining for bank direct marketing: an application of the CRISP-DM methodology’, in P. Novais et al. (eds.), Proceedings of the European Simulation and Modeling Conference — ESM’2011, 117–121.) The variables included were age, type of job, marital status, education, whether credit is in default, average yearly balance in account in euros, whether there is a housing loan, whether there is a personal loan, last contact duration in seconds, number of contacts performed during this campaign and whether the client has purchased a term deposit. Analyse the data and assess the likelihood that the client will purchase a term deposit. 21.9 The file < HYBRID_SALES > contains the number of domestic and imported hybrid vehicles sold in the United States from 1999 to 2012 (data obtained from ). You want to be able to predict the number of domestic and imported hybrid vehicles sold in the United States in 2013 and 2014. Completely analyse the data.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

804 End of Part 5 problems

End of Part 5 problems E.1

E.2



To monitor the filling process at a small Marlborough winery a random sample of six bottles is selected each hour and the volume of wine per bottle measured. To be acceptable as a 750 mL bottle each bottle must contain between 747 and 753 mL. The data in the file < WINE_BOTTLES > gives the results for a 24-hour day’s production. a. Construct a control chart for the range. b. Construct a control chart for the mean. c. Is the process in a state of statistical control? d. If the process is in control, estimate the percentage of bottles whose content is within the specification limits. e. If the process is in control, calculate Cp, CPL, CPU and Cpk. f. If the winery requires that 99.7% of all bottles be within the specification limits, comment on the capability of the process based on your calculations in (d) and (e). An entrepreneur wants to determine whether it would be profitable to establish a gardening service in a local suburb. The entrepreneur believes that there are four possible levels of demand for this gardening service: · very low demand – 1% of the households would use the service · low demand – 5% of the households would use the service · moderate demand – 10% of the households would use the service · high demand – 25% of the households would use the service. Based on experiences in other suburbs, the entrepreneur assigns the following probabilities to the various demand levels: P(very low demand) 5 0.20 P(low demand) 5 0.50 P(moderate demand) 5 0.20 P(high demand) 5 0.10 The entrepreneur has calculated the following profits or losses of this garden service for each demand level (over a period of 1 year):

Action Provide garden Do not provide Demand service ($) garden service Very low Low Moderate High

2$50,000 0 $60,000 0 $130,000 0 $300,000 0

a. Construct a decision tree. b. Construct an opportunity loss table. c. Compute the expected monetary value (EMV) for offering this garden service. d. Compute the expected opportunity loss (EOL) for offering this garden service. e. Explain the meaning of the expected value of perfect information (EVPI) in this problem.

E.3

E.4

f. Compute the return-to-risk ratio (RTRR) for offering this garden service. g. Based on the results of (c) or (d) and (f), should the entrepreneur offer this garden service? Why? Before making a final decision, the entrepreneur conducts a survey to determine demand for the gardening service. A random sample of 20 households is selected and three indicate that they would use this gardening service. h. Revise the prior probabilities in light of this sample information. (Hint: Use the binomial distribution to determine the probability of the outcome that occurred given a particular level of demand.) i. Use the revised probabilities in (h) to repeat (c) to (g). A netball player, in the position of goal shooter, is studying her goal shooting ability. On each training day she shoots 100 goals. The file < GOAL_SHOTS > contains her results for 40 training days. a. Construct a p chart for the proportion of successful goal shots. Do you think that the netballer’s goal-shooting process is in statistical control? If not, why not? b. What if you were told that the netballer used a different method of shooting goals for the past 20 days? How might this information change your conclusions in (a)? c. If you knew the information in (b) before completing (a), how might you do the analysis differently? The Parents and Citizens Association of a high school runs an annual fete as a fundraising measure. As most stalls are outdoors the organisers will suffer losses if there is heavy rain on the day of the fete. Therefore, they consider taking out wetweather insurance which will cost $750. The insurance company will repay unsold food costs if there is high rainfall ( > 40 ml ) during the fete. This will reimburse the costs of wasted food if fete attendance and food sales are low due to the weather. The percentage of food wasted under different rainfall levels can be summarised as follows: · very low – 1% of food will be wasted · low – 10% of food will be wasted · moderate – 20% of food will be wasted · high – 50% of food will be wasted Potential losses are based on a food cost of $4500 if all food was wasted and there was no insurance. The following payoff table is based on the costs of the two alternatives (insuring against heavy rain and not insuring):

Action Do not Rainfall insure Rain insurance Very low (1%) Low (10%) Moderate (20%) High (50%)

$ 45 $450 $900 $2,250

$750 $750 $750 $750

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



End of Part 5 problems 805



E.5

E.6

City

Based on experience, each rainfall level is assumed to be equally likely to occur. a. Construct a decision tree. b. Construct an opportunity loss table. c. Compute the expected monetary value (EMV ) for insuring and for not insuring. d. Compute the expected opportunity loss (EOL) for insuring and for not insuring. e. Explain the meaning of the expected value of perfect information (EVPI ) in this problem. f. Compute the return-to-risk ratio (RTRR) for not insuring. g. Based on the results of (c) or (d) and (f), should the Parents and Citizens Association insure? Why? h. The fete moves to a different time of year where probabilities for rain are: · very low – 0.25 · low – 0.3 · moderate – 0.27 · high – 0.18 Use the revised probabilities to repeat (c) to (g). As chief operating officer of a local community hospital, you have just returned from a three-day seminar on quality and productivity. It is your intention to implement many of the ideas that you learned at the seminar. You have decided to maintain control charts for the upcoming month for the following variables: number of daily admissions, proportion of rework in the laboratory (based on 1,000 daily samples), and time (in hours) between receipt of a specimen at the laboratory and completion of the work (based on a subgroup of 10 specimens per day). The data collected are summarised in the file . You are to make a presentation to the chief executive officer of the hospital and the board of directors. Prepare a report that summarises the conclusions drawn from analysing control charts for these variables. In addition, recommend additional variables to measure and monitor using control charts. In Australia during the summer one-day cricket series competition, only one free-to-air television station will obtain the broadcast rights. Television ratings are not normally taken over summer, but to test the significance of gaining the rights to summer cricket, a television station obtained the ratings for 10 Australian cities for two sporting events, tennis and cricket, which it broadcast on consecutive weekends. Cricket

Sydney Melbourne Brisbane Adelaide Perth Canberra Hobart Darwin Newcastle Geelong

Tennis

56 45 65 40 47 58 60 50 52 61 38 72 44 66 72 38 47 50 58 29

E.7

At the 0.05 level of significance, test whether cricket achieved a significantly higher rating audience than tennis. An employment agency regularly tests secretarial job applicants on typing speed through a home Internet test before choosing the applicant for a given job. In a subsequent check of the procedure, 12 successful applicants are re-tested under the conditions of the workplace.

Applicant 1 2 3 4 5 6 7 8 9 10 11 12

E.8

Home test speed Workplace test speed (words/min) rank (words/min) rank 11 8 2 6 10 3 4 1 9 5 7 12

2 11 6 8 3 10 1 4 5 9 12 7

The data are available by rank order (1 = fastest to 12 = slowest) of the 12 applicants. At the 0.05 level of significance, is there a significant difference in the rankings between the two tests? A chain of pet supply stores is trying to decide whether store location is likely to affect sales. Annual sales for each type of store were recorded for 2017 ($million).

Shopping centre

Main road

Suburban

1.2 3.4 1.2 7.6 5.2 0.8 6.2 1.9 4.4 3.8 7.6 2.9 7.8 3.8 8.1 3.9 3.2 7.4 E.9

At the 0.05 level of significance, is there a significant difference in the median sales between store locations? A panel of 10 executives from several companies affiliated to a Singapore-based parent company have been carefully selected as specialists in the required fields to sit on a selection panel for the appointment of a new CEO. The selection has eight main criteria, including leadership, financial skills, communication skills and corporate entrepreneurship, each of which is rated on a scale of 1 to 6. The following table displays the sum of the ratings for each panel member for the top four applicants for the position.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

806 End of Part 5 problems

Neighbourhood

Panel member

Applicant 1 Applicant 2 Applicant 3 Applicant 4

J.F. 36 H.F. 30 L.T. 38 J.W. 41 F.S. 33 S.W. 40 J.H. 40 P.J. 36 M.M. 35 I.S. 38

40 32 42 41 35 44 40 39 40 44 45 47 36 32 43 33 46 48 30 28 42 39 42 47 39 46 45 40 40 43

a. At the 0.05 level of significance, is there a significant difference in the median ratings for the four applicants? b. Recalculate the test assuming the panel members are selected completely randomly. E.10 Each semester at Tasman University, for each unit, a sample of enrolled students complete a unit feedback survey. In this survey students are asked to rate their overall satisfaction with the unit on a on a Likert scale, from 1 – extremely dissatisfied to 5 – extremely satisfied. A lecturer for a first-year statistics unit is analysing student satisfaction with their unit for the last nine semesters. The table below gives the number of students in each semester’s sample who are dissatisfied, including extremely dissatisfied, and also the number who are satisfied, including extremely satisfied.

Semester

Sample size

1 2 3 4 5 6 7 8 9

Dissatisfied

88 17 80  8 99  8 96 16 77  2 76 17 75  8 74 13 73  5

Satisfied 53 60 70 76 65 42 56 45 58

a. Construct a control chart for the proportion of students who are dissatisfied with the unit. Is the process in a state of statistical control? b. Construct a control chart for the proportion of students who are satisfied with the unit. Is the process in a state of statistical control? E.11 Safe-As-Houses Real Estate is comparing the median number of days it took a property to sell in two neighbourhoods (A and B). Ten properties are randomly selected from each neighbourhood, and the number of days on the market until sold is recorded. .

A 80 120 52 96 102 85 106 117 98 89

B 152 96 123 98 181 133 76 47 115 104

At the 0.05 level of significance, is there a significant difference in the median number of days a property takes to sell between the neighbourhoods? E.12 The manager of Maria’s Takeaways (open Monday to Friday) suspects that the average number of lunchtime customers depends on the day of the week. The following table shows the number of lunchtime customers on randomly selected days.

Monday Tuesday 116 126 146 97 100

97 122 85 80 111

Wednesday Thursday Friday 91 112 67 56 134

99 107 98 126 135

114 118 102 143 123

At the 0.01 level of significance, can the manager conclude that the median number of customers differs significantly during the week? E.13 To improve customer service, the manager of Maria’s Takeaways asked customers for four weeks to anonymously rate the service received on a Likert scale, from 1 = extremely dissatisfied to 5 = extremely satisfied. The number of customers either extremely dissatisfied or dissatisfied with the service received each day is recorded in the table below.

Week Monday Tuesday Wednesday Thursday Friday 1 2 3 4

6 9 0 7

8 3 5 1

0 3 1 2

2 1 0 4

4 0 3 2

a. Construct a control chart for the number of dissatisfied customers for day? b. Is the process in a state of statistical control? E.14 Each semester, a sample of students at Tasman University complete a unit feedback survey for each unit attempted. In this survey students are asked to rate their overall satisfaction with a unit on a Likert scale from 1 = extremely dissatisfied to 5 = extremely satisfied.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



End of Part 5 problems 807

In response to student feedback in semester 1, the lecturer for first-year law implemented several suggested changes in semester 2. The lecturer wishes to know if these changes have resulted in increased student satisfaction with the unit. The file contains the sample student satisfaction ratings for semesters 1 and 2. At a 0.01 level of significance, can the lecturer conclude that the changes have resulted in significantly higher student satisfaction in semester 2? E.15 A market researcher wanted to determine whether the proportion of coffee drinkers who preferred brand A increased as the result of an advertising campaign. A random sample of 200 coffee drinkers was selected. The results indicating preference for brand A or brand B before the advertising campaign began and after its completion are shown below:

Preference before advertising campaign Brand A Brand B Total

Preference after advertising campaign

At the 0.01 significance level, is there a significant difference between the satisfaction ratings for the different brands? E.17 Concerned that decreasing tutorial attendance will lead to higher failure rates, Tasman University is considering recording tutorial attendance. To test whether keeping attendance records for tutorials increases tutorial attendance, a lecturer in the School of Management collected attendance records in week 3 for scheduled tutorials containing 112 enrolled students, without telling the students that attendance would be recorded. Then tutorial attendance was recorded for the same students two weeks later, in week 5, after telling them attendance records would be kept. Week 3 tutorial attendance Attended Absent

Brand A

Brand B

Total

101  22 123

 9 68 77

110  90 200

a. At the 0.10 significance level, is the proportion of coffee drinkers who prefer brand A significantly lower before the advertising campaign than after it? b. Calculate the p-value in (a) and interpret its meaning. E.16 Suppose that 10 different car guides have given a rating, out of 100, of satisfaction for four best-selling Japanese cars in Australia. Guide

Mazda

Honda

Subaru

Toyota

1 2 3 4 5 6 7 8 9 10

70 75 73 80 72 84 89 80 69 70

72 68 60 85 66 81 89 78 82 63

79 82 84 85 88 70 78 83 56 91

79 74 95 76 83 77 80 78 99 92

Total

Week 5 tutorial attendance Attended

Absent

Total

65 22 87

6 19 25

 71  41 112

a. At the 0.01 level of significance, is the proportion of students attending a tutorial significantly higher when students are aware that tutorial attendance records are being kept? b. Calculate the p-value in (a) and interpret its meaning. E.18 The production of wine is a multibillion-dollar worldwide industry. In an attempt to develop a model of wine quality as judged by wine experts, data were collected from red and white wine variants of Portuguese vinho verde (data obtained from P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis, ‘Modeling wine preferences by data mining from physiochemical properties’, Decision Support Systems, 47, 2009, 547–553 and ). The population of 6,497 wines is stored in . a. Perform a cluster analysis on the wine information. b. Perform a multidimensional scaling analysis on the wine information. c. What conclusions can you reach about which wines are most similar?

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

Appendices

A. Review of arithmetic, algebra and logarithms A.1 Rules for arithmetic operations A.2 Rules for algebra: Exponents and square roots A.3 Rules for logarithms

E.4 E.5 E.6 E.7 E.8

B. Summation notation

E.9

C. Statistical symbols and Greek alphabet

E.10 E.11

C.1 Statistical symbols C.2 Greek alphabet

D. PHStat user’s guide E. Tables E.1 Table of random numbers E.2 The cumulative standardised normal distribution E.3 Critical values of t

E.12 E.13 E.14

Critical values of χ2 Critical values of F Table of binomial probabilities Table of Poisson probabilities Lower and upper critical values T1 of Wilcoxon rank sum test Lower and upper critical values W of Wilcoxon signed ranks test Critical values of the Studentised range Q Critical values dL and dU of the Durbin– Watson statistic D Selected critical values of F for Cook’s Di statistic Control chart factors The standardised normal distribution

F. Using Microsoft Excel Analysis Toolpack F.1 Configuring Microsoft Excel F.2 Using the data analysis tools

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

A-2 APPENDICES

A

Review of Arithmetic, Algebra and Logarithms A.1 • RULES FOR ARITHMETIC OPERATIONS RULE EXAMPLE  1.   2.   3.   4.   5.   6.   7.   8.    9.  10.  11. 

a + b = c and b + a = c 2 + 1 = 3 and 1 + 2 = 3 a + (b + c) = (a + b) + c 5 + (7 + 4) = (5 + 7) + 4 = 16 a − b = c  but  b − a ∙ c 9 − 7 = 2  but  7 − 9 = −2 a × b = b × a 7 × 6 = 6 × 7 = 42 a × (b + c) = (a × b) + (a × c) 2 × (3 + 5) = (2 × 3) + (2 × 5) = 16 a ÷ b ∙ b ÷ a 12 ÷ 3 ∙ 3 ÷ 12 a b 7 3 a+b 7+3 = + = + =5 c c 2 2 c 2 a a a 3 3 3 ∙ + ∙ + b+c b c 4+5 4 5 1 1 1 1 8 b+a 5+3 + = + = = a b 3 5 (3)(5) 15 ab a c a:c 2 6 2:6 12 × = × = = b d b:d 3 7 3:7 21 a c a:d 5 3 5:7 35 ÷ = ÷ = = b d b:c 8 7 8:3 24

A.2 • RULES FOR ALGEBRA: EXPONENTS AND SQUARE ROOTS RULE EXAMPLE 1.  Xa ∙ Xb = Xa+b 42 ∙ 43 = 45 2.  (Xa)b = Xab (22)3 = 26 3.  (Xa/ Xb) = Xa−b

35 2 3 =3 3 34 = 30 = 1 34

4. 

Xa = X0 = 1 Xa

5. 

XY = X Y

(25)(4) = 25 4 = 10

6. 

X X = Y Y

16 16 = =0.40 100 100

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



APPENDICES A-3

A.3 • RULES FOR LOGARITHMS Base 10 LOG is the symbol used for base 10 logarithms. RULE EXAMPLE 1.  LOG(10A) = A LOG(100) = LOG(102) = 2 2.  If LOG(A) = B, then A = 10B If LOG(A) = 2, then A = 102 = 100 3.  LOG(A × B) = LOG(A) + LOG(B) LOG(100) = LOG(10 × 10)

  = LOG(10) + LOG(10) = 1 + 1 = 2 4.  LOG(AB) = B × LOG(A) LOG(1000) = LOG(103) = 3 × LOG(10)   = 3 × 1 = 3 5.  LOG(A/B) = LOG(A) − LOG(B) LOG(100) = LOG(1000/10)   = LOG(1000) − LOG(10) = 3 − 1 = 2 Example Take the base 10 logarithm of each side of the following equation: Y = β0β1Xε Solution  Apply rules 3 and 4: LOG(Y) = LOG(β0β1Xε) = LOG(β0) + LOG(β1X) + LOG(ε) = LOG(β0) + X × LOG(β1) + LOG(ε) Base e LN is the symbol used for base e logarithms, commonly referred to as natural logarithms. e is Euler’s number and e ≅ 2.718282. RULE EXAMPLE 1.  2.  3.  4.  5. 

LN(e A) = A LN(7.389056) = LN(e2) = 2 If LN(A) = B, then A = e B If LN(A) = 2, then A = e2 = 7.389056 LN(A × B) = LN(A) + LN(B) LN(100) = LN(10 × 10)   = LN(10) + LN(10)   = 2.302585 + 2.302585 = 4.605170 LN(AB) = B × LN(A) LN(1000) = LN(103) = 3 × LN(10)   = 3 × 2.302585 = 6.907755 LN(A/B) = LN(A) − LN(B) LN(100) = LN(1000/10)   = LN(1000) − LN(10)   = 6.907755 − 2.302585 = 4.605170

Example Take the base e logarithm of each side of the following equation: Y = β0β1Xε Solution  Apply rules 3 and 4: LN(Y) = LN (β0β1Xε) = LN(β0) + LN(β1X) + LN(ε) = LN(β0) + X × LN(β1) + LN(ε)

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

A

A-4 APPENDICES

B

Summation Notation The symbol Σ, the Greek capital letter sigma, is used to denote ‘taking the sum of’. Consider n

a set of n values for variable X. The expression

Xi means that these n values are to be added © i=1

together. Thus: n

Xi = X1 + X2 + X3 + … + Xn © i=1 The following problem illustrates the use of the summation notation. Consider five values of a variable X: X1 = 2, X2 = 0, X3 = −1, X4 = 5 and X5 = 7. Thus: 5

Xi = X1 + X2 + X3 + X4 + X5 = 2 + 0 + (−1) + 5 + 7 = 13 © i=1 In statistics, the squared values of a variable are often summed. Thus: n

Xi © i=1

2

= X12 + X22 + X32 + … + Xn2

and, in the example above: 5

Xi2 = X12 + X22 + X32 + X42 + X52 © i =1



= 22 + 02 + (−1)2 + 52 + 72 = 4 + 0 + 1 + 25 + 49 = 79

n

n

Xi , the summation of the squares, is not the same as ∑ X i © i =1 i=1 2

n

Xi © i =1

2



n

Xi ∑ i=1

2

, the square of the sum:

2

In the example given earlier, the summation of squares is equal to 79. This is not equal to the square of the sum, which is 132 = 169. Another frequently used operation involves the summation of the product. Consider two variables, X and Y, each having n values. Then: n

XiYi = X1Y1 + X2Y2 + X3Y3 + … + XnYn © i=1 Continuing with the previous example, suppose there is a second variable, Y, whose five values are Y1 = 1, Y2 = 3, Y3 = −2, Y4 = 4 and Y5 = 3. Then: 5



XiYi = X1Y1 + X2Y2 + X3Y3 + X4Y4 + X5Y5 © i =1

= (2)(1) + (0)(3) + (−1)(−2) + (5)(4) + (7)(3) = 2 + 0 + 2 + 20 + 21 = 45

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



APPENDICES A-5

n

In calculating

XiYi, realise that the first value of X is multiplied by the first value of Y, the © i=1

second value of X is multiplied by the second value of Y, and so on. These cross-products are then summed in order to calculate the desired result. However, the summation of products is not equal to the product of the individual sums: n

n

n

XiYi ∙ ∑ X i ∑ Y i © i =1 i=1 i=1 5

In this example,

© i=1

Xi = 13 and

5

Yi = 1 + 3 + (−2) + 4 + 3 = 9, so that: © i=1

5

5

Xi ∑ Y i ∑ i=1 i=1

= (13)(9) = 117.

5

However,

XiYi = 45. The following table summarises these results. © i=1

Value

Xi

1 2 3 4 5

Yi

2 0 −1 5 7 ___________

1  2 3  0 −2  2 4 20 3 21 _________ ____________

5

5

© i =1 Xi = 13



XiYi

© i =1 Yi = 9

5

© i =1 XiYi = 45

Rule 1  The summation of the values of two variables is equal to the sum of the values of each summed variable: n

© i=1

(Xi + Yi) =

n

© i=1

Xi +

n

Yi © i=1

Thus: 5

(Xi + Yi) = (2 + 1) + (0 + 3) + [−1 + (−2)] + (5 + 4) + (7 + 3) © i =1



= 3 + 3 + (−3) + 9 + 10 = 22

5



© i=1

Xi +

5

Yi = 13 + 9 = 22 © i =1

Rule 2  The summation of a difference between the values of two variables is equal to the difference between the summed values of the variables: n

© i =1

(Xi − Yi) =

n

© i =1

Xi −

n

Yi © i =1

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

B

A-6 APPENDICES

B

Thus:

5

(Xi − Yi) = (2 – 1) + (0 – 3) + [–1 – (–2)] + (5 – 4) + (7 – 3) © i =1



= 1 + (–3) + 1 + 1 + 4 =4

5

5

© Xi – i=1 © Yi = 13 − 9 = 4 i =1

Rule 3  The summation of a constant times a variable is equal to that constant times the summation of the values of the variable: n

©

i =1

n

© Xi

cXi = c

i =1

where c is a constant. Thus, if c = 2: 5

5

©

i =1

5

5

cXi = i=1 c

© 2Xi = (2)(2) + (2)(0) + (2)(–1) + (2)(5) + (2)(7) = 4 + 0 + (–2) + 10 + 14 = 26

Xi = (2)(13) = 26 © Xi = 2 i© i =1 =1

Rule 4  A constant summed n times will be equal to n times the value of the constant: n

© c = nc

i =1

where c is a constant. Thus, if the constant c = 2 is summed five times: 5

© c = 2 + 2 + 2 + 2 + 2 = 10

i =1

nc = (5)(2) = 10 3

You should distinguish a summation such as tains extra terms. The double summation: 3

© XiYi from the double summation which coni =1

3

© © XiYj = X1Y1 + X1Y2 + X1Y3 + X2Y1 + X2Y2 … + X3Y3. all i all j For example, if X1 = 2, X2 = 0, X3 = −1 and Y1 = 1, Y2 = 3, Y3 = −2 3



© XiYi = X1Y1 + X2Y2 + X3Y3 i =1 3

= 2 × 1 + 0 × 3 + (−1) × (−2) =2+0+2 =4

3

© © XiYj = X1Y1 + X1Y2 + X1Y3 + X2Y1 + X2Y2 + X2Y3 + X3Y1 + X3Y2 + X3Y3 all i all j

= 2 × 1 + 2 × 3 + 2 × (−2) + 0 × 1 + 0 × 3 + 0 × (−2) + (−1) × 1 + (−1) × 3 + (−1) × (−2) =2+6−4+0+0+0−1−3+2 =2

Problem Suppose there are six values for the variables X and Y such that X1 = 2, X2 = 1, X3 = 5, X4 = −3, X5 = 1 and X6 = −2, and Y1 = 4, Y2 = 0, Y3 = −1, Y4 = 2, Y5 = 7 and Y6 = −3. Calculate each of the following: 6

©

6

©

(a)  Xi (b)  Yi i =1

i =1

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



APPENDICES

6

6

©

(c)  X2i

(d) 

i=1 6

C

© Yi2

i =1 6

©

(e)  XiYi

(f) 

i=1 6

© (Xi + Yi)

i =1 6

© i =1

© i =1

(Xi − 3Yi + 2X2i ) (g)  (Xi − Yi) (h)  6

6

© i=1

© i =1

(Xi − 3Yi + c), where c = +3 (i)  (cXi), where c = −1 (j)  Answer (a) 4  (b) 9  (c) 44  (d) 79  (e) 10  (f) 13  (g) −5 (h) 65 (i) −4 (j) −5 References 1.  Bashaw, W. L., Mathematics for Statistics (New York: Wiley, 1969). 2.  Lanzer, P., Video Review of Arithmetic (Hicksville, NY: Video Aided Instruction, 1990). 3. Levine, D., The MBA Primer: Business Statistics (Cincinnati, OH: Southwestern Publishing, 2000). 4.  Levine, D., Video Review of Statistics (Hicksville, NY: Video Aided Instruction, 1989). 5. Shane, H., Video Review of Elementary Algebra (Hicksville, NY: Video Aided Instruction, 1990).

Statistical Symbols and Greek Alphabet C.1 • STATISTICAL SYMBOLS + add − subtract = equal to ≅ approximately equal to > greater than ⩾ greater than or equal to

× multiply ÷ divide ∙ not equal to < less than ⩽ less than or equal to

C.2 • GREEK ALPHABET Greek Letter English letter name equivalent

A B Γ ∆ Ε Ζ Η Θ Ι Κ Λ Μ

α β γ δ ϵ ζ η θ ι κ λ μ

Alpha Beta Gamma Delta Epsilon Zeta Eta Theta Iota Kappa Lambda Mu

a b g d e˘ z -e th i k l m

Greek Letter English letter name equivalent

N Ξ Ο Π Ρ ∑ Τ Υ Φ Χ Ψ Ω

ν ξ ο π ρ σ τ υ ϕ χ ψ ω

A-7

Nu Xi Omicron Pi Rho Sigma Tau Upsilon Phi Chi Psi Omega

n x o˘ p r s t u ph ch ps -o

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

A-8 APPENDICES

PHStat User’s Guide

D ABOUT THIS APPENDIX

You should read this appendix if you plan to use PHStat to perform statistical analyses in Microsoft Excel while learning from this text. This appendix presents selected PHStat commands in order of appearance in this text. Other PHStat commands are included in the Excel guides of a number of chapters. For each command presented, this appendix explains what the command does and shows how to select the command and fill in its dialog box. (If you need more detailed information about a particular command, you can click on the Help button of the dialog box.) Note that PHStat as discussed here uses formulas from Excel 2016.

D.1 • SAMPLING DISTRIBUTIONS SIMULATION The Sampling Distributions Simulation command creates a sampling distributions simulation worksheet based on the number of samples, sample size and type of distribution that you specify. The command silently uses the Data Analysis Random Number Generation procedure to create the results of the simulation on a new worksheet. The command then adds the sample means, the overall mean and the standard error of the mean to this worksheet. If you check the Histogram box, the command uses the Data Analysis Histogram procedure to create a histogram of the simulation.

D.2 • CONFIDENCE INTERVAL ESTIMATE FOR THE MEAN, SIGMA KNOWN The Estimate for the Mean, sigma known command creates a confidence interval estimate worksheet based on the values of the population standard deviation, sample mean, sample size and confidence level that you specify. If you have unsummarised data, you can select the Sample Statistics Unknown option and have the command calculate the sample statistics for you. The worksheet created uses the NORMSINV and CONFIDENCE functions to determine the Z value and to calculate the interval half width, respectively. The templates for these functions are: • NORMSINV(P * Z) where P , Z is the area under the curve that is less than Z. • CONFIDENCE(1 - confidence level, population standard deviation, sample size). Use: Select PHStat ➔ Confidence Intervals ➔ Estimate for the Mean, sigma known. Example: Figure D.2 contains the entries for solving Example 8.1 on page 283.

Use: Select PHStat ➔ Sampling ➔ Sampling Distributions Simulation. Example: Figure D.1 contains the entries for generating a simulation of 100 samples of sample size 30 from a uniformly distributed population.

Figure D.2  Estimate for the Mean, sigma known dialog box

Figure D.1  Sampling Distributions Simulation dialog box

Note: PHStat reports the lower Z value (e.g. −1.95996) while the Excel spreadsheets shown in Chapter 8 give the upper Z value (e.g. 1.95996) used for confidence intervals. However, the actual confidence intervals or

Analysis Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



APPENDICES A-9

D c­ alculations of sample size produced by Excel and PHStat are the same.

D.3 • CONFIDENCE INTERVAL ESTIMATE FOR THE MEAN, SIGMA UNKNOWN The Estimate for the Mean, sigma unknown command creates a confidence interval estimate worksheet based on sample statistics and a confidence level value that you specify. If you have unsummarised data, you can select the Sample Statistics Unknown option and have the command calculate the sample statistics for you. The worksheet created uses the TINV function to determine the critical value of the t distribution. The template for this function is TINV (1 - confidence level, degrees of freedom).

the sample size, number of successes and confidence level that you specify. The worksheet created uses the NORMSINV function, the template of which is NORMSINV(P * Z) where P < Z is the area under the curve that is less than Z, to determine the Z value. Use: Select PHStat ➔ Confidence Intervals ➔ Estimate for the Proportion. Example: Figure D.4 contains the entries for estimating the proportion of sales invoices with errors in Section 8.3.

Use: Select PHStat ➔ Confidence Intervals ➔ Estimate for the Mean, sigma unknown. Example: Figure D.3 contains the entries for estimating the mean amount of sales invoices (see page 287).

Figure D.4  Estimate for the Proportion dialog box

D.5 • SAMPLE SIZE DETERMINATION FOR THE MEAN

Figure D.3  Estimate for the Mean, sigma unknown dialog box

D.4 • CONFIDENCE INTERVAL ESTIMATE FOR THE PROPORTION The Estimate for the Proportion command creates a confidence interval estimate worksheet based on the values of

The Determination for the Mean command creates a worksheet based on the values of the population standard deviation, sampling error and confidence level that you specify. The worksheet created uses the NORMSINV function, the template of which is NORMSINV(P * Z) where P < Z is the area under the curve that is less than Z, to determine the Z value. The worksheet also uses the ROUNDUP function to round the result of the sample size calculation up to the next integer. Use: Select PHStat ➔ Sample Size ➔ Determination for the Mean. Example: Figure D.5 contains the entries for determining the sample size needed to estimate the mean sales invoice amount in Section 8.4.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

A-10 APPENDICES

D D.7 • CONFIDENCE INTERVAL ESTIMATE FOR THE POPULATION TOTAL

Figure D.5  Sample Size Determination for the Mean dialog box

D.6 • SAMPLE SIZE DETERMINATION FOR THE PROPORTION The Determination for the Proportion command creates a worksheet based on the values of the sample size, number of successes and confidence level that you specify. The worksheet created uses the NORMSINV function, the template of which is NORMSINV(P * Z) where P < Z is the area under the curve that is less than Z, to determine the Z value. The worksheet also uses the ROUNDUP function to round the result of the sample size calculation up to the next integer.

The Estimate for the Population Total command creates a confidence interval estimate worksheet based on the values of the population size, sample mean, sample size, sample standard deviation and confidence level that you specify. If you have unsummarised data, you can select the Sample Statistics Unknown option and have the command calculate the sample statistics for you. The worksheet created uses the TINV function, the template of which is TINV(1 - confidence level, degrees of freedom), to determine the critical value from the t distribution. Use: Select PHStat ➔ Confidence Intervals ➔ Estimate for the Population Total. Example: Figure D.7 contains the entries for estimating the total amount of sales invoices in Section 8.5.

Use: Select PHStat ➔ Sample Size ➔ Determination for the Proportion. Example: Figure D.6 contains the entries for determining the sample size needed to estimate the proportion of sales invoices with errors in Section 8.4.

Figure D.7  Estimate for the Population Total dialog box

D.8 • CONFIDENCE INTERVAL ESTIMATE FOR THE TOTAL DIFFERENCE

Figure D.6  Sample Size Determination for the Proportion dialog box

The Estimate for the Total Difference command creates a confidence interval estimate worksheet based on the values of the population size, sample size and confidence level that you specify. To use this command, first create a column of differences starting in row 1 on a new worksheet. (You can

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



APPENDICES A-11

D compare column A of the Data worksheet of the CIE Total Difference.xlsx file as a guide.) With your workbook open to these data, use the command. Use: Select PHStat ➔ Confidence Intervals ➔ Estimate for the Total Difference. Example: Figure D.8 contains the entries for estimating the difference between the actual amount on the sales invoices and that entered into the accounting system in Section 8.5. The cell range entry assumes that the differences data are located on a worksheet named Data.

Figure D.8  Estimate for the Total Difference dialog box Figure D.9  Z Test for the Mean, sigma known dialog box

D.9 • Z TEST FOR THE MEAN, SIGMA KNOWN The Z Test for the Mean, sigma known command creates a hypothesis-testing worksheet based on the values of the null hypothesis, level of significance, population standard deviation, sample size and sample mean, and the test option that you specify. If you have unsummarised data, you can select the Sample Statistics Unknown option and have the command calculate the sample statistics for you. The worksheet created uses the NORMSINV and NORMSDIST functions to determine the critical values and p-values, respectively. The templates for these functions are: • NORMSINV(P * Z) where P < Z is the area under the curve that is less than Z. • NORMSDIST(Z value).

D.10 • t TEST FOR THE MEAN, SIGMA UNKNOWN

Use: Select PHStat ➔ One-Sample Tests ➔ Z Test for the Mean, sigma known.

The t Test for the Mean, sigma unknown command creates a hypothesis-testing worksheet based on the values of the null hypothesis, level of significance, sample standard deviation, sample size and sample mean, and the test option that you specify. If you have unsummarised data, you can select the Sample Statistics Unknown option and have the command calculate the sample statistics for you. The worksheet created uses the TINV and TDIST functions to determine the critical values and p-values, respectively. The templates for these functions are: • TINV(1 - confidence level, degrees of freedom) • TDIST(ABS(t), degrees of freedoms, tails), where tails = 2 indicates a two-tail test.

Example: Figure D.9 contains the entries to test the pastapackaging hypothesis in Section 9.2.

Use: Select PHStat ➔ One-Sample Tests ➔ t Test for the Mean, sigma unknown.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

A-12 APPENDICES

D Example: Figure D.10 contains the entries for testing whether the mean amount per sales invoice in Section 9.4 is increasing or decreasing.

Figure D.11  Z Test for the Proportion dialog box

D.12 • CHI-SQUARE TEST FOR DIFFERENCES IN TWO PROPORTIONS

Figure D.10  t Test for the Mean, sigma unknown dialog box

D.11 • Z TEST FOR THE PROPORTION The Z Test for the Proportion command creates a hypothesistesting worksheet based on the values of the null hypothesis, level of significance, number of successes, and sample size and the test option that you specify. The worksheet created uses the NORMSINV and NORMSDIST functions to determine the critical values and p-values, respectively. The templates for these functions are: • NORMSINV(P * Z) where P < Z is the area under the curve that is less than Z. • NORMSDIST(Z value).

The Chi-Square Test for Differences in Two Proportions creates a hypothesis-testing worksheet into which you enter observed frequency table data. The worksheet created uses the CHIINV and CHIDIST functions to determine the critical value and p-value, respectively. The templates for these functions are: • CHIINV(level of significance, degrees of freedom). • CHIDIST(critical value of 𝛘2, degrees of freedom). Use: Select PHStat ➔ Two-Sample Tests (Summarized Data) ➔ Chi-Square Test for Differences in Two Proportions. Example: Figure D.12 contains the structure for comparing the online check-in patterns of passengers in two cities in Section 15.1.

Use: Select PHStat ➔ One-Sample Tests ➔ Z Test for the Proportion. Example: Figure D.11 contains the entries for testing the off-peak storage hypothesis in Section 9.5. Note that the closest integer is used for the number of households.

Figure D.12  Chi-Square Test for Differences in Two Proportions dialog box

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



APPENDICES A-13

D D.13 • CHI-SQUARE TEST The Chi-Square Test command creates a hypothesis-testing worksheet into which you enter the observed frequency table data based on the number of rows and number of columns that you specify. You can use the worksheet produced for either the χ2 test for the differences in more than two proportions or the χ2 test of independence. If the number of rows entered is 2, you can also perform the Marascuilo procedure by selecting the Marascuilo Procedure output option (this option is enabled only if you enter 2 as the number of rows as in Figure D.13A below). The worksheet created uses the CHIINV and CHIDIST functions to determine the critical values and p-value, respectively. The templates for these functions are: • CHIINV(level of significance, degrees of freedom). • CHIDIST(critical value of 𝛘2, degrees of freedom). Use: Select PHStat ➔ Multiple-Sample Tests ➔ ChiSquare Test. Examples: Figure D.13A contains the entries for the proportion of passengers checking in online from three cities in Section 15.2 (with the optional Marascuilo Procedure selected) and Figure D.13B contains the entries for testing two factors of interest in online check-in rates in Section 15.3.

Figure D.13B  Chi-Square Test (433) dialog box

worksheet, enter the payoff data in the tinted table cells that begin in row 4. Use: Select PHStat ➔ Decision-Making ➔ Opportunity Loss. Example: Figure D.14 contains the entries for the opportunity loss analysis for marketing an HDTV model, shown in Table 17.1 on page 682.

Figure D.14  Opportunity Loss dialog box

Figure D.13A  Chi-Square Test (233) dialog box

D.14 • OPPORTUNITY LOSS The Opportunity Loss command creates an opportunity loss analysis worksheet based on the number of events and alternative actions that you specify. To complete the

D.15 • EXPECTED OPPORTUNITY LOSS If you already know the probabilities and opportunity losses for events with different alternatives you can use this command to calculate the expected opportunity losses directly. Use: PHStat ➔ Decision-Making ➔ Expected Oportunity Loss.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

A-14 APPENDICES

D D.16 • EXPECTED MONETARY VALUE The Expected Monetary Value command creates an expected monetary value worksheet based on the number of events and alternative actions that you specify. Select the Expected Opportunity Loss and Measures of Variation output options to include these additional statistics. To complete the worksheet, enter the probabilities and payoff data in the tinted table cells that begin in row 4. Use: Select PHStat ➔ Decision-Making ➔ Expected Monetary Value. Example: Figure D.15 contains the entries for the HDTV marketing data (see page 686).

Figure D.15  Expected Monetary Value dialog box

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



APPENDICES A-15

Tables   E.1   E.2   E.3   E.4   E.5   E.6   E.7   E.8   E.9 E.10 E.11 E.12 E.13 E.14

Table of random numbers The cumulative standardised normal distribution Critical values of t Critical values of χ2 Critical values of F Table of binomial probabilities Table of Poisson probabilities Lower and upper critical values T1 of Wilcoxon rank sum test Lower and upper critical values W of Wilcoxon signed ranks test Critical values of the Studentised range Q Critical values dL and dU of the Durbin–Watson statistic D Selected critical values of F for Cook’s Di statistic Control chart factors The standardised normal distribution

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

E

A-16 APPENDICES

Column

E Table E.1

00000 00001 11111 11112 22222 22223 33333 33334 Row 12345 67890 12345 67890 12345 67890 12345 67890

Table of random numbers

01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

49280 88924 35779 00283 81163 07275 89863 02348 61870 41657 07468 08612 98083 97349 20775 45091 43898 65923 25078 86129 78496 97653 91550 08078 62993 93912 30454 84598 56095 20664 12872 64647 33850 58555 51438 85507 71865 79488 76783 31708 97340 03364 88472 04334 63919 36394 11095 92470 70543 29776 10087 10072 55980 64688 68239 20461 89382 93809 00796 95945 34101 81277 66090 88872 37818 72142 67140 50785 22380 16703 53362 44940 60430 22834 14130 96593 23298 56203 92671 15925 82975 66158 84731 19436 55790 69229 28661 13675 30987 71938 40355 54324 08401 26299 49420 59208 55700 24586 93247 32596 11865 63397 44251 43189 14756 23997 78643 75912 83832 32768 18928 57070 32166 53251 70654 92827 63491 04233 33825 69662 23236 73751 31888 81718 06546 83246 47651 04877 45794 26926 15130 82455 78305 55058 52551 47182 09893 20505 14225 68514 47427 56788 96297 78822 54382 74598 91499 14523 68479 27686 46162 83554 94750 89923 37089 20048 80336 94598 26940 36858 70297 34135 53140 33340 42050 82341 44104 82949 85157 47954 32979 26575 57600 40881 12250 73742 11100 02340 12860 74697 96644 89439 28707 25815 36871 50775 30592 57143 17381 68856 25853 35041 23913 48357 63308 16090 51690 54607 72407 55538 79348 36085 27973 65157 07456 22255 25626 57054 92074 54641 53673 54421 18130 60103 69593 49464 06873 21440 75593 41373 49502 17972 82578 16364 12478 37622 99659 31065 83613 69889 58869 29571 57175 55564 65411 42547 70457 03426 72937 83792 91616 11075 80103 07831 59309 13276 26710 73000 78025 73539 14621 39044 47450 03197 12787 47709 27587 67228 80145 10175 12822 86687 65530 49325 16690 20427 04251 64477 73709 73945 92396 68263 70183 58065 65489 31833 82093 16747 10386 59293 90730 35385 15679 99742 50866 78028 75573 67257 10934 93242 13431 24590 02770 48582 00906 58595 82462 30166 79613 47416 13389 80268 05085 96666 27463 10433 07606 16285 93699 60912 94532 95632 02979 52997 09079 92709 90110 47506 53693 49892 46888 69929 75233 52507 32097 37594 10067 67327 53638 83161 08289 12639 08141 12640 28437 09268 82433 61427 17239 89160 19666 08814 37841 12847 35766 31672 50082 22795 66948 65581 84393 15890 10853 42581 08792 13257 61973 24450 52351 16602 20341 27398 72906 63955 17276 10646 74692 48438 54458 90542 77563 51839 52901 53355 83281 19177 26337 66530 16687 35179 46560 00123 44546 79896 34314 23729 85264 05575 96855 23820 11091 79821 28603 10708 68933 34189 92166 15181 66628 58599 continues

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



APPENDICES A-17

Column

E

00000 00001 11111 11112 22222 22223 33333 33334 Row 12345 67890 12345 67890 12345 67890 12345 67890 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00

66194 28926 99547 16625 45515 67953 12108 57846 78240 43195 24837 32511 70880 22070 52622 61881 00833 88000 67299 68215 11274 55624 32991 17436 12111 86683 61270 58036 64192 90611 15145 01748 47189 99951 05755 03834 43782 90599 40282 51417 76396 72486 62423 27618 84184 78922 73561 52818 46409 17469 32483 09083 76175 19985 26309 91536 74626 22111 87286 46772 42243 68046 44250 42439 34450 81974 93723 49023 58432 67083 36876 93391 36327 72135 33005 28701 34710 49359 50693 89311 74185 77536 84825 09934 99103 09325 67389 45869 12296 41623 62873 37943 25584 09609 63360 47270 90822 60280 88925 99610 42772 60561 76873 04117 72121 79152 96591 90305 10189 79778 68016 13747 95268 41377 25684 08151 61816 58555 54305 86189 92603 09091 75884 93424 72586 88903 30061 14457 18813 90291 05275 01223 79607 95426 34900 09778 38840 26903 28624 67157 51986 42865 14508 49315 05959 33836 53758 16562 41081 38012 41230 20528 85141 21155 99212 32685 51403 31926 69813 58781 75047 59643 31074 38172 03718 32119 69506 67143 30752 95260 68032 62871 58781 34143 68790 69766 22986 82575 42187 62295 84295 30634 66562 31442 99439 86692 90348 66036 48399 73451 26698 39437 20389 93029 11881 71685 65452 89047 63669 02656 39249 05173 68256 36359 20250 68686 05947 09335 96777 33605 29481 20063 09398 01843 35139 61344 04860 32918 10798 50492 52655 33359 94713 28393 41613 42375 00403 03656 77580 87772 86877 57085 17930 00794 53836 53692 67135 98102 61912 11246 24649 31845 25736 75231 83808 98917 93829 99430 79899 34061 54308 59358 56462 58166 97302 86828 76801 49594 81002 30397 52728 15101 72070 33706 36239 63636 38140 65731 39788 06872 38971 53363 07392 64449 17886 63632 53995 17574 22247 62607 67133 04181 33874 98835 67453 59734 76381 63455 77759 31504 32832 70861 15152 29733 75371 39174 85992 72268 42920 20810 29361 51423 90306 73574 79553 75952 54116 65553 47139 60579 09165 85490 41101 17336 48951 53674 17880 45260 08575 49321 36191 17095 32123 91576 84221 78902 82010 30847 62329 63898 23268 74283 26091 68409 69704 82267 14751 13151 93115 01437 56945 89661 67680 79790 48462 59278 44185 29616 76537 19589 83139 28454 29435 88105 59651 44391 74588 55114 80834 85686 28340 29285 12965 14821 80425 16602 44653 70467 02167 58940 27149 80242 10587 79786 34959 75339 17864 00991 39557 54981 23588 81914 37609 13128 79675 80605 60059 35862 00254 36546 21545 78179 72335 82037 92003 34100 29879 46613 89720 13274

Table E.1

Table of random numbers (continued)

Source: Partially extracted from the Rand Corporation, A Million Random Digits with 100,000 Normal Deviates (Glencoe, IL, The Free Press, 1955).

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

A-18 APPENDICES

E

Table E.2

The cumulative standardised normal distribution Entry represents area under the cumulative standardised normal distribution from −∞ to Z.

Z

0

  Z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 −6.0 0.000000001 −5.5 0.000000019 −5.0 0.000000287 −4.5 0.000003398 −4.0 0.000031671 −3.9 0.00005 0.00005 0.00004 0.00004 0.00004 0.00004 0.00004 0.00004 0.00003 0.00003 −3.8 0.00007 0.00007 0.00007 0.00006 0.00006 0.00006 0.00006 0.00005 0.00005 0.00005 −3.7 0.00011 0.00010 0.00010 0.00010 0.00009 0.00009 0.00008 0.00008 0.00008 0.00008 −3.6 0.00016 0.00015 0.00015 0.00014 0.00014 0.00013 0.00013 0.00012 0.00012 0.00011 −3.5 0.00023 0.00022 0.00022 0.00021 0.00020 0.00019 0.00019 0.00018 0.00017 0.00017 −3.4 0.00034 0.00032 0.00031 0.00030 0.00029 0.00028 0.00027 0.00026 0.00025 0.00024 −3.3 0.00048 0.00047 0.00045 0.00043 0.00042 0.00040 0.00039 0.00038 0.00036 0.00035 −3.2 0.00069 0.00066 0.00064 0.00062 0.00060 0.00058 0.00056 0.00054 0.00052 0.00050 −3.1 0.00097 0.00094 0.00090 0.00087 0.00084 0.00082 0.00079 0.00076 0.00074 0.00071 −3.0 0.00135 0.00131 0.00126 0.00122 0.00118 0.00114 0.00111 0.00107 0.00103 0.00100 −2.9 0.0019 0.0018 0.0018 0.0017 0.0016 0.0016 0.0015 0.0015 0.0014 0.0014 −2.8 0.0026 0.0025 0.0024 0.0023 0.0023 0.0022 0.0021 0.0021 0.0020 0.0019 −2.7 0.0035 0.0034 0.0033 0.0032 0.0031 0.0030 0.0029 0.0028 0.0027 0.0026 −2.6 0.0047 0.0045 0.0044 0.0043 0.0041 0.0040 0.0039 0.0038 0.0037 0.0036 −2.5 0.0062 0.0060 0.0059 0.0057 0.0055 0.0054 0.0052 0.0051 0.0049 0.0048 −2.4 0.0082 0.0080 0.0078 0.0075 0.0073 0.0071 0.0069 0.0068 0.0066 0.0064 −2.3 0.0107 0.0104 0.0102 0.0099 0.0096 0.0094 0.0091 0.0089 0.0087 0.0084 −2.2 0.0139 0.0136 0.0132 0.0129 0.0125 0.0122 0.0119 0.0116 0.0113 0.0110 −2.1 0.0179 0.0174 0.0170 0.0166 0.0162 0.0158 0.0154 0.0150 0.0146 0.0143 −2.0 0.0228 0.0222 0.0217 0.0212 0.0207 0.0202 0.0197 0.0192 0.0188 0.0183 −1.9 0.0287 0.0281 0.0274 0.0268 0.0262 0.0256 0.0250 0.0244 0.0239 0.0233 −1.8 0.0359 0.0351 0.0344 0.0336 0.0329 0.0322 0.0314 0.0307 0.0301 0.0294 −1.7 0.0446 0.0436 0.0427 0.0418 0.0409 0.0401 0.0392 0.0384 0.0375 0.0367 −1.6 0.0548 0.0537 0.0526 0.0516 0.0505 0.0495 0.0485 0.0475 0.0465 0.0455 −1.5 0.0668 0.0655 0.0643 0.0630 0.0618 0.0606 0.0594 0.0582 0.0571 0.0559 −1.4 0.0808 0.0793 0.0778 0.0764 0.0749 0.0735 0.0721 0.0708 0.0694 0.0681 −1.3 0.0968 0.0951 0.0934 0.0918 0.0901 0.0885 0.0869 0.0853 0.0838 0.0823 −1.2 0.1151 0.1131 0.1112 0.1093 0.1075 0.1056 0.1038 0.1020 0.1003 0.0985 −1.1 0.1357 0.1335 0.1314 0.1292 0.1271 0.1251 0.1230 0.1210 0.1190 0.1170 −1.0 0.1587 0.1562 0.1539 0.1515 0.1492 0.1469 0.1446 0.1423 0.1401 0.1379 −0.9 0.1841 0.1814 0.1788 0.1762 0.1736 0.1711 0.1685 0.1660 0.1635 0.1611 −0.8 0.2119 0.2090 0.2061 0.2033 0.2005 0.1977 0.1949 0.1922 0.1894 0.1867 −0.7 0.2420 0.2388 0.2358 0.2327 0.2296 0.2266 0.2236 0.2206 0.2177 0.2148 −0.6 0.2743 0.2709 0.2676 0.2643 0.2611 0.2578 0.2546 0.2514 0.2482 0.2451 −0.5 0.3085 0.3050 0.3015 0.2981 0.2946 0.2912 0.2877 0.2843 0.2810 0.2776 −0.4 0.3446 0.3409 0.3372 0.3336 0.3300 0.3264 0.3228 0.3192 0.3156 0.3121 −0.3 0.3821 0.3783 0.3745 0.3707 0.3669 0.3632 0.3594 0.3557 0.3520 0.3483 −0.2 0.4207 0.4168 0.4129 0.4090 0.4052 0.4013 0.3974 0.3936 0.3897 0.3859 −0.1 0.4602 0.4562 0.4522 0.4483 0.4443 0.4404 0.4364 0.4325 0.4286 0.4247 −0.0 0.5000 0.4960 0.4920 0.4880 0.4840 0.4801 0.4761 0.4721 0.4681 0.4641 continues

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



APPENDICES A-19

Table E.2

E

The cumulative standardised normal distribution (continued ) Entry represents area under the cumulative standardised normal distribution from −∞ to Z.

0

Z

Z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359 0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753 0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141 0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517 0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879 0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224 0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7518 0.7549 0.7 0.7580 0.7612 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852 0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133 0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389 1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621 1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830 1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015 1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177 1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319 1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441 1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545 1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633 1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706 1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767 2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817 2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857 2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890 2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916 2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936 2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952 2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964 2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974 2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981 2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986 3.0 0.99865 0.99869 0.99874 0.99878 0.99882 0.99886 0.99889 0.99893 0.99897 0.99900 3.1 0.99903 0.99906 0.99910 0.99913 0.99916 0.99918 0.99921 0.99924 0.99926 0.99929 3.2 0.99931 0.99934 0.99936 0.99938 0.99940 0.99942 0.99944 0.99946 0.99948 0.99950 3.3 0.99952 0.99953 0.99955 0.99957 0.99958 0.99960 0.99961 0.99962 0.99964 0.99965 3.4 0.99966 0.99968 0.99969 0.99970 0.99971 0.99972 0.99973 0.99974 0.99975 0.99976 3.5 0.99977 0.99978 0.99978 0.99979 0.99980 0.99981 0.99981 0.99982 0.99983 0.99983 3.6 0.99984 0.99985 0.99985 0.99986 0.99986 0.99987 0.99987 0.99988 0.99988 0.99989 3.7 0.99989 0.99990 0.99990 0.99990 0.99991 0.99991 0.99992 0.99992 0.99992 0.99992 3.8 0.99993 0.99993 0.99993 0.99994 0.99994 0.99994 0.99994 0.99995 0.99995 0.99995 3.9 0.99995 0.99995 0.99996 0.99996 0.99996 0.99996 0.99996 0.99996 0.99997 0.99997 4.0 0.999968329 4.5 0.999996602 5.0 0.999999713 5.5 0.999999981 6.0 0.999999999

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

A-20 APPENDICES

E

Table E.3

Critical values of t For a particular number of degrees of freedom, entry represents the critical value of t corresponding to a specified upper-tail area (α).

α

0

Degrees of freedom

t (α,df )

Upper-tail areas 0.25 0.10 0.05 0.025 0.01 0.005

  1   2   3   4   5

1.0000 3.0777 6.3138 12.7062 31.8207 63.6574 0.8165 1.8856 2.9200 4.3027 6.9646 9.9248 0.7649 1.6377 2.3534 3.1824 4.5407 5.8409 0.7407 1.5332 2.1318 2.7764 3.7469 4.6041 0.7267 1.4759 2.0150 2.5706 3.3649 4.0322

  6   7   8   9 10

0.7176 1.4398 1.9432 0.7111 1.4149 1.8946 0.7064 1.3968 1.8595 0.7027 1.3830 1.8331 0.6998 1.3722 1.8125

2.4469 2.3646 2.3060 2.2622 2.2281

3.1427 2.9980 2.8965 2.8214 2.7638

3.7074 3.4995 3.3554 3.2498 3.1693

11 12 13 14 15

0.6974 1.3634 1.7959 0.6955 1.3562 1.7823 0.6938 1.3502 1.7709 0.6924 1.3450 1.7613 0.6912 1.3406 1.7531

2.2010 2.1788 2.1604 2.1448 2.1315

2.7181 2.6810 2.6503 2.6245 2.6025

3.1058 3.0545 3.0123 2.9768 2.9467

16 17 18 19 20

0.6901 1.3368 1.7459 0.6892 1.3334 1.7396 0.6884 1.3304 1.7341 0.6876 1.3277 1.7291 0.6870 1.3253 1.7247

2.1199 2.1098 2.1009 2.0930 2.0860

2.5835 2.5669 2.5524 2.5395 2.5280

2.9208 2.8982 2.8784 2.8609 2.8453

21 22 23 24 25

0.6864 1.3232 1.7207 0.6858 1.3212 1.7171 0.6853 1.3195 1.7139 0.6848 1.3178 1.7109 0.6844 1.3163 1.7081

2.0796 2.0739 2.0687 2.0639 2.0595

2.5177 2.5083 2.4999 2.4922 2.4851

2.8314 2.8188 2.8073 2.7969 2.7874

26 27 28 29 30

0.6840 1.3150 1.7056 0.6837 1.3137 1.7033 0.6834 1.3125 1.7011 0.6830 1.3114 1.6991 0.6828 1.3104 1.6973

2.0555 2.0518 2.0484 2.0452 2.0423

2.4786 2.4727 2.4671 2.4620 2.4573

2.7787 2.7707 2.7633 2.7564 2.7500

31 32 33 34 35

0.6825 1.3095 1.6955 0.6822 1.3086 1.6939 0.6820 1.3077 1.6924 0.6818 1.3070 1.6909 0.6816 1.3062 1.6896

2.0395 2.0369 2.0345 2.0322 2.0301

2.4528 2.4487 2.4448 2.4411 2.4377

2.7740 2.7385 2.7333 2.7284 2.7238

36 37 38 39 40

0.6814 1.3055 1.6883 0.6812 1.3049 1.6871 0.6810 1.3042 1.6860 0.6808 1.3036 1.6849 0.6807 1.3031 1.6839

2.0281 2.0262 2.0244 2.0227 2.0211

2.4345 2.4314 2.4286 2.4258 2.4233

2.7195 2.7154 2.7116 2.7079 2.7045

41 42 43 44 45

0.6805 1.3025 1.6829 0.6804 1.3020 1.6820 0.6802 1.3016 1.6811 0.6801 1.3011 1.6802 0.6800 1.3006 1.6794

2.0195 2.0181 2.0167 2.0154 2.0141

2.4208 2.4185 2.4163 2.4141 2.4121

2.7012 2.6981 2.6951 2.6923 2.6896

46 47 48

0.6799 1.3022 1.6787 2.0129 2.4102 2.6870 0.6797 1.2998 1.6779 2.0117 2.4083 2.6846 0.6796 1.2994 1.6772 2.0106 2.4066 2.6822 continues

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



APPENDICES A-21



Upper-tail areas

E

Degrees of freedom 0.25 0.10 0.05 0.025 0.01 0.005   49   50

0.6795 1.2991 1.6766 2.0096 2.4049 2.6800 0.6794 1.2987 1.6759 2.0086 2.4033 2.6778

  51   52   53   54   55

0.6793 1.2984 1.6753 0.6792 1.2980 1.6747 0.6791 1.2977 1.6741 0.6791 1.2974 1.6736 0.6790 1.2971 1.6730

2.0076 2.0066 2.0057 2.0049 2.0040

2.4017 2.4002 2.3988 2.3974 2.3961

2.6757 2.6737 2.6718 2.6700 2.6682

  56   57   58   59   60

0.6789 1.2969 1.6725 0.6788 1.2966 1.6720 0.6787 1.2963 1.6716 0.6787 1.2961 1.6711 0.6786 1.2958 1.6706

2.0032 2.0025 2.0017 2.0010 2.0003

2.3948 2.3936 2.3924 2.3912 2.3901

2.6665 2.6649 2.6633 2.6618 2.6603

  61   62   63   64   65

0.6785 1.2956 1.6702 0.6785 1.2954 1.6698 0.6784 1.2951 1.6694 0.6783 1.2949 1.6690 0.6783 1.2947 1.6686

1.9996 1.9990 1.9983 1.9977 1.9971

2.3890 2.3880 2.3870 2.3860 2.3851

2.6589 2.6575 2.6561 2.6549 2.6536

  66   67   68   69   70

0.6782 1.2945 1.6683 0.6782 1.2943 1.6679 0.6781 1.2941 1.6676 0.6781 1.2939 1.6672 0.6780 1.2938 1.6669

1.9966 1.9960 1.9955 1.9949 1.9944

2.3842 2.3833 2.3824 2.3816 2.3808

2.6524 2.6512 2.6501 2.6490 2.6479

  71   72   73   74   75

0.6780 1.2936 1.6666 0.6779 1.2934 1.6663 0.6779 1.2933 1.6660 0.6778 1.2931 1.6657 0.6778 1.2929 1.6654

1.9939 1.9935 1.9930 1.9925 1.9921

2.3800 2.3793 2.3785 2.3778 2.3771

2.6469 2.6459 2.6449 2.6439 2.6430

  76   77   78   79   80

0.6777 1.2928 1.6652 0.6777 1.2926 1.6649 0.6776 1.2925 1.6646 0.6776 1.2924 1.6644 0.6776 1.2922 1.6641

1.9917 1.9913 1.9908 1.9905 1.9901

2.3764 2.3758 2.3751 2.3745 2.3739

2.6421 2.6412 2.6403 2.6395 2.6387

  81   82   83   84   85

0.6775 1.2921 1.6639 0.6775 1.2920 1.6636 0.6775 1.2918 1.6634 0.6774 1.2917 1.6632 0.6774 1.2916 1.6630

1.9897 1.9893 1.9890 1.9886 1.9883

2.3733 2.3727 2.3721 2.3716 2.3710

2.6379 2.6371 2.6364 2.6356 2.6349

  86   87   88   89   90

0.6774 1.2915 1.6628 0.6773 1.2914 1.6626 0.6773 1.2912 1.6624 0.6773 1.2911 1.6622 0.6772 1.2910 1.6620

1.9879 1.9876 1.9873 1.9870 1.9867

2.3705 2.3700 2.3695 2.3690 2.3685

2.6342 2.6335 2.6329 2.6322 2.6316

  91   92   93   94   95

0.6772 1.2909 1.6618 0.6772 1.2908 1.6616 0.6771 1.2907 1.6614 0.6771 1.2906 1.6612 0.6771 1.2905 1.6611

1.9864 1.9861 1.9858 1.9855 1.9853

2.3680 2.3676 2.3671 2.3667 2.3662

2.6309 2.6303 2.6297 2.6291 2.6286

  96   97   98   99 100

0.6771 1.2904 1.6609 0.6770 1.2903 1.6607 0.6770 1.2902 1.6606 0.6770 1.2902 1.6604 0.6770 1.2901 1.6602

1.9850 1.9847 1.9845 1.9842 1.9840

2.3658 2.3654 2.3650 2.3646 2.3642

2.6280 2.6275 2.6269 2.6264 2.6259

110

0.6767 1.2893 1.6588 1.9818 2.3607 2.6213

120

0.6765 1.2886 1.6577 1.9799 2.3578 2.6174



0.6745 1.2816 1.6449 1.9600 2.3263 2.5758

Table E.3

Critical values of t (continued )

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

A-22 APPENDICES

Table E.4

E

Critical values of χ2

For a particular number of degrees of freedom, entry represents the critical value of χ2 corresponding to a specified upper-tail area (α).

0



α

1–α χ2

U(α,df )

Upper-tail areas (𝛂)

Degrees of freedom

0.995 0.99 0.975 0.95 0.90 0.75 0.25 0.10 0.05 0.025 0.01 0.005

  1 0.001 0.004 0.016 0.102 1.323 2.706 3.841 5.024 6.635 7.879   2 0.010 0.020 0.051 0.103 0.211 0.575 2.773 4.605 5.991 7.378 9.210 10.597   3 0.072 0.115 0.216 0.352 0.584 1.213 4.108 6.251 7.815 9.348 11.345 12.838   4 0.207 0.297 0.484 0.711 1.064 1.923 5.385 7.779 9.488 11.143 13.277 14.860   5 0.412 0.554 0.831 1.145 1.610 2.675 6.626 9.236 11.071 12.833 15.086 16.750   6   7   8   9 10

0.676 0.872 1.237 1.635 2.204 3.455 7.841 10.645 12.592 14.449 16.812 18.458 0.989 1.239 1.690 2.167 2.833 4.255 9.037 12.017 14.067 16.013 18.475 20.278 1.344 1.646 2.180 2.733 3.490 5.071 10.219 13.362 15.507 17.535 20.090 21.955 1.735 2.088 2.700 3.325 4.168 5.899 11.389 14.684 16.919 19.023 21.666 23.589 2.156 2.558 3.247 3.940 4.865 6.737 12.549 15.987 18.307 20.483 23.209 25.188

11 12 13 14 15

2.603 3.074 3.565 4.075 4.601

3.053 3.571 4.107 4.660 5.229

3.816 4.404 5.009 5.629 6.262

16 17 18 19 20

5.142 5.697 6.265 6.844 7.434

5.812 6.408 7.015 7.633 8.260

6.908 7.962 9.312 11.912 19.369 23.542 26.296 28.845 32.000 34.267 7.564 8.672 10.085 12.792 20.489 24.769 27.587 30.191 33.409 35.718 8.231 9.390 10.865 13.675 21.605 25.989 28.869 31.526 34.805 37.156 8.907 10.117 11.651 14.562 22.718 27.204 30.144 32.852 36.191 38.582 9.591 10.851 12.443 15.452 23.828 28.412 31.410 34.170 37.566 39.997

4.575 5.226 5.892 6.571 7.261

5.578 7.584 13.701 17.275 19.675 21.920 24.725 26.757 6.304 8.438 14.845 18.549 21.026 23.337 26.217 28.299 7.042 9.299 15.984 19.812 22.362 24.736 27.688 29.819 7.790 10.165 17.117 21.064 23.685 26.119 29.141 31.319 8.547 11.037 18.245 22.307 24.996 27.488 30.578 32.801

21 22 23 24 25

8.034 8.897 10.283 11.591 13.240 16.344 24.935 29.615 32.671 35.479 38.932 41.401 8.643 9.542 10.982 12.338 14.042 17.240 26.039 30.813 33.924 36.781 40.289 42.796 9.260 10.196 11.689 13.091 14.848 18.137 27.141 32.007 35.172 38.076 41.638 44.181 9.886 10.856 12.401 13.848 15.659 19.037 28.241 33.196 36.415 39.364 42.980 45.559 10.520 11.524 13.120 14.611 16.473 19.939 29.339 34.382 37.652 40.646 44.314 46.928

26 27 28 29 30

11.160 12.198 13.844 15.379 17.292 20.843 30.435 35.563 38.885 41.923 45.642 48.290 11.808 12.879 14.573 16.151 18.114 21.749 31.528 36.741 40.113 43.194 46.963 49.645 12.461 13.565 15.308 16.928 18.939 22.657 32.620 37.916 41.337 44.461 48.278 50.993 13.121 14.257 16.047 17.708 19.768 23.567 33.711 39.087 42.557 45.722 49.588 52.336 13.787 14.954 16.791 18.493 20.599 24.478 34.800 40.256 43.773 46.979 50.892 53.672

For larger values of freedom (df ) the expression Z = 2χ2 − 2(df ) − 1 may be used and the resulting upper-tail area can be found from the cumulative standardised normal distribution (Table E.2).

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

FU(α,df ,df ) 1 2

161.40 199.50 215.70 224.60 230.20 234.00 236.80 238.90 240.50 241.90 243.90 245.90 248.00 249.10 250.10 251.10 252.20 253.30 254.30 18.51 19.00 19.16 19.25 19.30 19.33 19.35 19.37 19.38 19.40 19.41 19.43 19.45 19.45 19.46 19.47 19.48 19.49 19.50 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79 8.74 8.70 8.66 8.64 8.62 8.59 8.57 8.55 8.53 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96 5.91 5.86 5.80 5.77 5.75 5.72 5.69 5.66 5.63

6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74 4.68 4.62 4.56 4.53 4.50 4.46 4.43 4.40 4.36 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.06 4.00 3.94 3.87 3.84 3.81 3.77 3.74 3.70 3.67 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64 3.57 3.51 3.44 3.41 3.38 3.34 3.30 3.27 3.23 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35 3.28 3.22 3.15 3.12 3.08 3.04 3.01 2.97 2.93 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14 3.07 3.01 2.94 2.90 2.86 2.83 2.79 2.75 2.71

4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 2.91 2.85 2.77 2.74 2.70 2.66 2.62 2.58 2.54 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85 2.79 2.72 2.65 2.61 2.57 2.53 2.49 2.45 2.40 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75 2.69 2.62 2.54 2.51 2.47 2.43 2.38 2.34 2.30 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67 2.60 2.53 2.46 2.42 2.38 2.34 2.30 2.25 2.21 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60 2.53 2.46 2.39 2.35 2.31 2.27 2.22 2.18 2.13

4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54 2.48 2.40 2.33 2.29 2.25 2.20 2.16 2.11 2.07 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49 2.42 2.35 2.28 2.24 2.19 2.15 2.11 2.06 2.01 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 2.45 2.38 2.31 2.23 2.19 2.15 2.10 2.06 2.01 1.96 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41 2.34 2.27 2.19 2.15 2.11 2.06 2.02 1.97 1.92 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 2.38 2.31 2.23 2.16 2.11 2.07 2.03 1.98 1.93 1.88

4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35 2.28 2.20 2.12 2.08 2.04 1.99 1.95 1.90 1.84 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 2.32 2.25 2.18 2.10 2.05 2.01 1.96 1.92 1.87 1.81 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 2.30 2.23 2.15 2.07 2.03 1.98 1.91 1.89 1.84 1.78 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 2.27 2.20 2.13 2.05 2.01 1.96 1.91 1.86 1.81 1.76 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 2.25 2.18 2.11 2.03 1.98 1.94 1.89 1.84 1.79 1.73

4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 2.24 2.16 2.09 2.01 1.96 1.92 1.87 1.82 1.77 1.71 4.23 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.27 2.22 2.15 2.07 1.99 1.95 1.90 1.85 1.80 1.75 1.69 4.21 3.35 2.96 2.73 2.57 2.46 2.37 2.31 2.25 2.20 2.13 2.06 1.97 1.93 1.88 1.84 1.79 1.73 1.67 4.20 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.24 2.19 2.12 2.04 1.96 1.91 1.87 1.82 1.77 1.71 1.65 4.18 3.33 2.93 2.70 2.55 2.43 2.35 2.28 2.22 2.18 2.10 2.03 1.94 1.90 1.85 1.81 1.75 1.70 1.64

4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16 2.09 2.01 1.93 1.89 1.84 1.79 1.74 1.68 1.62 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08 2.00 1.92 1.84 1.79 1.74 1.69 1.64 1.58 1.51 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04 1.99 1.92 1.84 1.75 1.70 1.65 1.59 1.53 1.47 1.39 3.92 3.07 2.68 2.45 2.29 2.17 2.09 2.02 1.96 1.91 1.83 1.75 1.66 1.61 1.55 1.50 1.43 1.35 1.25 3.84 3.00 2.60 2.37 2.21 2.10 2.01 1.94 1.88 1.83 1.75 1.67 1.57 1.52 1.46 1.39 1.32 1.22 1.00 continues

  5   6   7   8   9

  10   11   12   13   14

  15   16   17   18   19

  20   21   22   23   24

  25   26   27   28   29

  30   40   60 120 ∞

1 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60 120 ∙

0

α = 0.05

  1   2   3   4

Denominator df2

Numerator, df1

For a particular combination of numerator and denominator degrees of freedom, entry represents the critical values of F corresponding to a specified upper-tail area (α).

Critical values of F

Table E.5

APPENDICES A-23

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

E

6.20 4.77 4.15 3.80 3.58 3.41 3.29 3.20 3.12 3.06 2.96 2.86 2.76 2.70 6.12 4.69 4.08 3.73 3.50 3.34 3.22 3.12 3.05 2.99 2.89 2.79 2.68 2.63 6.04 4.62 4.01 3.66 3.44 3.28 3.16 3.06 2.98 2.92 2.82 2.72 2.62 2.56 5.98 4.56 3.95 3.61 3.38 3.22 3.10 3.01 2.93 2.87 2.77 2.67 2.56 2.50 5.92 4.51 3.90 3.56 3.33 3.17 3.05 2.96 2.88 2.82 2.72 2.62 2.51 2.45

5.87 4.46 3.86 3.51 3.29 3.13 3.01 2.91 2.84 2.77 2.68 2.57 2.46 2.41 5.83 4.42 3.82 3.48 3.25 3.09 2.97 2.87 2.80 2.73 2.64 2.53 2.42 2.37 5.79 4.38 3.78 3.44 3.22 3.05 2.93 2.84 2.76 2.70 2.60 2.50 2.39 2.33 5.75 4.35 3.75 3.41 3.18 3.02 2.90 2.81 2.73 2.67 2.57 2.47 2.36 2.30 5.72 4.32 3.72 3.38 3.15 2.99 2.87 2.78 2.70 2.64 2.54 2.44 2.33 2.27

5.69 4.29 3.69 3.35 3.13 2.97 2.85 2.75 2.68 2.61 2.51 2.41 2.30 2.24 5.66 4.27 3.67 3.33 3.10 2.94 2.82 2.73 2.65 2.59 2.49 2.39 2.28 2.22 5.63 4.24 3.65 3.31 3.08 2.92 2.80 2.71 2.63 2.57 2.47 2.36 2.25 2.19 5.61 4.22 3.63 3.29 3.06 2.90 2.78 2.69 2.61 2.55 2.45 2.34 2.23 2.17 5.59 4.20 3.61 3.27 3.04 2.88 2.76 2.67 2.59 2.53 2.43 2.32 2.21 2.15

5.57 4.18 3.59 3.25 3.03 2.87 2.75 2.65 2.57 2.51 2.41 2.31 2.20 2.14 5.42 4.05 3.46 3.13 2.90 2.74 2.62 2.53 2.45 2.39 2.29 2.18 2.07 2.01 5.29 3.93 3.34 3.01 2.79 2.63 2.51 2.41 2.33 2.27 2.17 2.06 1.94 1.88 5.15 3.80 3.23 2.89 2.67 2.52 2.39 2.30 2.22 2.16 2.05 1.94 1.82 1.76 5.02 3.69 3.12 2.79 2.57 2.41 2.29 2.19 2.11 2.05 1.94 1.83 1.71 1.64

  15   16   17   18   19

  20   21   22   23   24

  25   26   27   28   29

  30   40   60 120   ∞

2.07 1.94 1.82 1.69 1.57

2.18 2.16 2.13 2.11 2.09

2.35 2.31 2.27 2.24 2.21

2.64 2.57 2.50 2.44 2.39

3.31 3.12 2.96 2.84 2.73

2.01 1.88 1.74 1.61 1.48

2.12 2.09 2.07 2.05 2.03

2.29 2.25 2.21 2.18 2.15

2.59 2.51 2.44 2.38 2.33

3.26 3.06 2.91 2.78 2.67

1.94 1.80 1.67 1.53 1.39

2.05 2.03 2.00 1.98 1.96

2.22 2.18 2.14 2.11 2.08

2.52 2.45 2.38 2.32 2.27

3.20 3.00 2.85 2.72 2.61

1.87 1.79 1.72 1.64 1.58 1.48 1.43 1.31 1.27 1.00 continues

1.98 1.91 1.95 1.88 1.93 1.85 1.91 1.83 1.89 1.81

2.16 2.09 2.11 2.04 2.08 2.00 2.04 1.97 2.01 1.94

2.46 2.40 2.38 2.32 2.32 2.25 2.26 2.19 2.20 2.13

3.14 3.08 2.94 2.88 2.79 2.72 2.66 2.60 2.55 2.49

6.07 6.02 4.90 4.85 4.20 4.14 3.73 3.67 3.39 3.33

6.94 5.46 4.83 4.47 4.24 4.07 3.95 3.85 3.78 3.72 3.62 3.52 3.42 3.37 6.72 5.26 4.63 4.28 4.04 3.88 3.76 3.66 3.59 3.53 3.43 3.33 3.23 3.17 6.55 5.10 4.47 4.12 3.89 3.73 3.61 3.51 3.44 3.37 3.28 3.18 3.07 3.02 6.41 4.97 4.35 4.00 3.77 3.60 3.48 3.39 3.31 3.25 3.15 3.05 2.95 2.89 6.30 4.86 4.24 3.89 3.66 3.50 3.38 3.29 3.21 3.15 3.05 2.95 2.84 2.79

6.12 4.96 4.25 3.78 3.45

  10   11   12   13   14

6.18 5.01 4.31 3.84 3.51

10.01 8.43 7.76 7.39 7.15 6.98 6.85 6.76 6.68 6.62 6.52 6.43 6.33 6.28 8.81 7.26 6.60 6.23 5.99 5.82 5.70 5.60 5.52 5.46 5.37 5.27 5.17 5.12 8.07 6.54 5.89 5.52 5.29 5.12 4.99 4.90 4.82 4.76 4.67 4.57 4.47 4.42 7.57 6.06 5.42 5.05 4.82 4.65 4.53 4.43 4.36 4.30 4.20 4.10 4.00 3.95 7.21 5.71 5.08 4.72 4.48 4.32 4.20 4.10 4.03 3.96 3.87 3.77 3.67 3.61

  5   6   7   8   9

6.23 5.07 4.36 3.89 3.56

647.80 799.50 864.20 899.60 921.80 937.10 948.20 956.70 963.30 968.60 976.70 984.90 993.10 997.20 1,001.00 1,006.00 1,010.00 1,014.00 1,018.00 38.51 39.00 39.17 39.25 39.30 39.33 39.36 39.39 39.39 39.40 39.41 39.43 39.45 39.46 39.46 39.47 39.48 39.49 39.50 17.44 16.04 15.44 15.10 14.88 14.73 14.62 14.54 14.47 14.42 14.34 14.25 14.17 14.12 14.08 14.04 13.99 13.95 13.90 12.22 10.65 9.98 9.60 9.36 9.20 9.07 8.98 8.90 8.84 8.75 8.66 8.56 8.51 8.46 8.41 8.36 8.31 8.26

1 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60 120 ∙

FU(α,df ,df ) 1 2

  1   2   3   4

Denominator df2

0

α = 0.025

E

Numerator, df1

Critical values of F (continued )

Table E.5

A-24 APPENDICES

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

FU (α,df ,df ) 1 2

4,052.00 4,999.50 5,403.00 5,625.00 5,764.00 5,859.00 5,928.00 5,982.00 6,022.00 6,056.00 6,106.00 6,157.00 6,209.00 6,235.00 6,261.00 6,287.00 6,313.00 6,339.00 6,366.00 98.50 99.00 99.17 99.25 99.30 99.33 99.36 99.37 99.39 99.40 99.42 99.43 44.45 99.46 99.47 99.47 99.48 99.49 99.50 34.12 30.82 29.46 28.71 28.24 27.91 27.67 27.49 27.35 27.23 27.05 26.87 26.69 26.60 26.50 26.41 26.32 26.22 26.13 21.20 18.00 16.69 15.98 15.52 15.21 14.98 14.80 14.66 14.55 14.37 14.20 14.02 13.93 13.84 13.75 13.65 13.56 13.46

16.26 13.27 12.06 11.39 10.97 10.67 10.46 10.29 10.16 10.05 9.89 9.72 9.55 9.47 9.38 9.29 9.20 9.11 9.02 13.75 10.92 9.78 9.15 8.75 8.47 8.26 8.10 7.98 7.87 7.72 7.56 7.40 7.31 7.23 7.14 7.06 6.97 6.88 12.25 9.55 8.45 7.85 7.46 7.19 6.99 6.84 6.72 6.62 6.47 6.31 6.16 6.07 5.99 5.91 5.82 5.74 5.65 11.26 8.65 7.59 7.01 6.63 6.37 6.18 6.03 5.91 5.81 5.67 5.52 5.36 5.28 5.20 5.12 5.03 4.95 4.86 10.56 8.02 6.99 6.42 6.06 5.80 5.61 5.47 5.35 5.26 5.11 4.96 4.81 4.73 4.65 4.57 4.48 4.40 4.31

10.04 7.56 6.55 5.99 5.64 5.39 5.20 5.06 4.94 4.85 4.71 4.56 4.41 4.33 4.25 4.17 4.08 4.00 3.91 9.65 7.21 6.22 5.67 5.32 5.07 4.89 4.74 4.63 4.54 4.40 4.25 4.10 4.02 3.94 3.86 3.78 3.69 3.60 9.33 6.93 5.95 5.41 5.06 4.82 4.64 4.50 4.39 4.30 4.16 4.01 3.86 3.78 3.70 3.62 3.54 3.45 3.36 9.07 6.70 5.74 5.21 4.86 4.62 4.44 4.30 4.19 4.10 3.96 3.82 3.66 3.59 3.51 3.43 3.34 3.25 3.17 8.86 6.51 5.56 5.04 4.69 4.46 4.28 4.14 4.03 3.94 3.80 3.66 3.51 3.43 3.35 3.27 3.18 3.09 3.00

8.68 6.36 5.42 4.89 4.56 4.32 4.14 4.00 3.89 3.80 3.67 3.52 3.37 3.29 3.21 3.13 3.05 2.96 2.87 8.53 6.23 5.29 4.77 4.44 4.20 4.03 3.89 3.78 3.69 3.55 3.41 3.26 3.18 3.10 3.02 2.93 2.81 2.75 8.40 6.11 5.18 4.67 4.34 4.10 3.93 3.79 3.68 3.59 3.46 3.31 3.16 3.08 3.00 2.92 2.83 2.75 2.65 8.29 6.01 5.09 4.58 4.25 4.01 3.84 3.71 3.60 3.51 3.37 3.23 3.08 3.00 2.92 2.84 2.75 2.66 2.57 8.18 5.93 5.01 4.50 4.17 3.94 3.77 3.63 3.52 3.43 3.30 3.15 3.00 2.92 2.84 2.76 2.67 2.58 2.49

8.10 5.85 4.94 4.43 4.10 3.87 3.70 3.56 3.46 3.37 3.23 3.09 2.94 2.86 2.78 2.69 2.61 2.52 2.42 8.02 5.78 4.87 4.37 4.04 3.81 3.64 3.51 3.40 3.31 3.17 3.03 2.88 2.80 2.72 2.64 2.55 2.46 2.36 7.95 5.72 4.82 4.31 3.99 3.76 3.59 3.45 3.35 3.26 3.12 2.98 2.83 2.75 2.67 2.58 2.50 2.40 2.31 7.88 5.66 4.76 4.26 3.94 3.71 3.54 3.41 3.30 3.21 3.07 2.93 2.78 2.70 2.62 2.54 2.45 2.35 2.26 7.82 5.61 4.72 4.22 3.90 3.67 3.50 3.36 3.26 3.17 3.03 2.89 2.74 2.66 2.58 2.49 2.40 2.31 2.21

7.77 5.57 4.68 4.18 3.85 3.63 3.46 3.32 3.22 3.13 2.99 2.85 2.70 2.62 2.54 2.45 2.36 2.27 2.17 7.72 5.53 4.64 4.14 3.82 3.59 3.42 3.29 3.18 3.09 2.96 2.81 2.66 2.58 2.50 2.42 2.33 2.23 2.13 7.68 5.49 4.60 4.11 3.78 3.56 3.39 3.26 3.15 3.06 2.93 2.78 2.63 2.55 2.47 2.38 2.29 2.20 2.10 7.64 5.45 4.57 4.07 3.75 3.53 3.36 3.23 3.12 3.03 2.90 2.75 2.60 2.52 2.44 2.35 2.26 2.17 2.06 7.60 5.42 4.54 4.04 3.73 3.50 3.33 3.20 3.09 3.00 2.87 2.73 2.57 2.49 2.41 2.33 2.23 2.14 2.03

7.56 5.39 4.51 4.02 3.70 3.47 3.30 3.17 3.07 2.98 2.84 2.70 2.55 2.47 2.39 2.30 2.21 2.11 2.01 7.31 5.18 4.31 3.83 3.51 3.29 3.12 2.99 2.89 2.80 2.66 2.52 2.37 2.29 2.20 2.11 2.02 1.92 1.80 7.08 4.98 4.13 3.65 3.34 3.12 2.95 2.82 2.72 2.63 2.50 2.35 2.20 2.12 2.03 1.94 1.84 1.73 1.60 6.85 4.79 3.95 3.48 3.17 2.96 2.79 2.66 2.56 2.47 2.34 2.19 2.03 1.95 1.86 1.76 1.66 1.53 1.38 6.63 4.61 3.78 3.32 3.02 2.80 2.64 2.51 2.41 2.32 2.18 2.04 1.88 1.79 1.70 1.59 1.47 1.32 1.00 continues

  5   6   7   8   9

  10   11   12   13   14

  15   16   17   18   19

  20   21   22   23   24

  25   26   27   28   29

  30   40   60 120   ∞

1 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60 120 ∙

0

α = 0.01

  1   2   3   4

Denominator df2

Numerator, df1

Critical values of F (continued )

Table E.5

APPENDICES A-25

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

E

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

22.78 18.31 16.53 15.56 14.94 14.51 14.20 13.96 13.77 13.62 13.38 13.15 12.90 12.78 12.66 12.53 12.40 12.27 12.11 18.63 14.54 12.92 12.03 11.46 11.07 10.79 10.57 10.39 10.25 10.03 9.81 9.59 9.47 9.36 9.24 9.12 9.00 8.88 16.24 12.40 10.88 10.05 9.52 9.16 8.89 8.68 8.51 8.38 8.18 7.97 7.75 7.65 7.53 7.42 7.31 7.19 7.08 14.69 11.04 9.60 8.81 8.30 7.95 7.69 7.50 7.34 7.21 7.01 6.81 6.61 6.50 6.40 6.29 6.18 6.06 5.95 13.61 10.11 8.72 7.96 7.47 7.13 6.88 6.69 6.54 6.42 6.23 6.03 5.83 5.73 5.62 5.52 5.41 5.30 5.19

12.83 9.43 8.08 7.34 6.87 6.54 6.30 6.12 5.97 5.85 5.66 5.47 5.27 5.17 5.07 4.97 4.86 4.75 1.61 12.23 8.91 7.60 6.88 6.42 6.10 5.86 5.68 5.54 5.42 5.24 5.05 4.86 4.75 4.65 4.55 4.44 4.34 4.23 11.75 8.51 7.23 6.52 6.07 5.76 5.52 5.35 5.20 5.09 4.91 4.72 4.53 4.43 4.33 4.23 4.12 4.01 3.90 11.37 8.19 6.93 6.23 5.79 5.48 5.25 5.08 4.94 4.82 4.64 4.46 4.27 4.17 4.07 3.97 3.87 3.76 3.65 11.06 7.92 6.68 6.00 5.56 5.26 5.03 4.86 4.72 4.60 4.43 4.25 4.06 3.96 3.86 3.76 3.66 3.55 3.41

10.80 7.70 6.48 5.80 5.37 5.07 4.85 4.67 4.54 4.42 4.25 4.07 3.88 3.79 3.69 3.58 3.48 3.37 3.26 10.58 7.51 6.30 5.64 5.21 4.91 4.69 4.52 4.38 4.27 4.10 3.92 3.73 3.64 3.54 3.44 3.33 3.22 3.11 10.38 7.35 6.16 5.50 5.07 4.78 4.56 4.39 4.25 4.14 3.97 3.79 3.61 3.51 3.41 3.31 3.21 3.10 2.98 10.22 7.21 6.03 5.37 4.96 4.66 4.44 4.28 4.14 4.03 3.86 3.68 3.50 3.40 3.30 3.20 3.10 2.89 2.87 10.07 7.09 5.92 5.27 4.85 4.56 4.34 4.18 4.04 3.93 3.76 3.59 3.40 3.31 3.21 3.11 3.00 2.89 2.78

9.94 6.99 5.82 5.17 4.76 4.47 4.26 4.09 3.96 3.85 3.68 3.50 3.32 3.22 3.12 3.02 2.92 2.81 2.69 9.83 6.89 5.73 5.09 4.68 4.39 4.18 4.02 3.88 3.77 3.60 3.43 3.24 3.15 3.05 2.95 2.84 2.73 2.61 9.73 6.81 5.65 5.02 4.61 4.32 4.11 3.94 3.81 3.70 3.54 3.36 3.18 3.08 2.98 2.88 2.77 2.66 2.55 9.63 6.73 5.58 4.95 4.54 4.26 4.05 3.88 3.75 3.64 3.47 3.30 3.12 3.02 2.92 2.82 2.71 2.60 2.48 9.55 6.66 5.52 4.89 4.49 4.20 3.99 3.83 3.69 3.59 3.42 3.25 3.06 2.97 2.87 2.77 2.66 2.55 2.43

9.48 6.60 5.46 4.84 4.43 4.15 3.94 3.78 3.64 3.54 3.37 3.20 3.01 2.92 2.82 2.72 2.61 2.50 2.38 9.41 6.54 5.41 4.79 4.38 4.10 3.89 3.73 3.60 3.49 3.33 3.15 2.97 2.87 2.77 2.67 2.56 2.45 2.33 9.34 6.49 5.36 4.74 4.34 4.06 3.85 3.69 3.56 3.45 3.28 3.11 2.93 2.83 2.73 2.63 2.52 2.41 2.29 9.28 6.44 5.32 4.70 4.30 4.02 3.81 3.65 3.52 3.41 3.25 3.07 2.89 2.79 2.69 2.59 2.48 2.37 2.25 9.23 6.40 5.28 4.66 4.26 3.98 3.77 3.61 3.48 3.38 3.21 3.04 2.86 2.76 2.66 2.56 2.45 2.33 2.21

9.18 6.35 5.24 4.62 4.23 3.95 3.74 3.58 3.45 3.34 3.18 3.01 2.82 2.73 2.63 2.52 2.42 2.30 2.18 8.83 6.07 4.98 4.37 3.99 3.71 3.51 3.35 3.22 3.12 2.95 2.78 2.60 2.50 2.40 2.30 2.18 2.06 1.93 8.49 5.79 4.73 4.14 3.76 3.49 3.29 3.13 3.01 2.90 2.74 2.57 2.39 2.29 2.19 2.08 1.96 1.83 1.69 8.18 5.54 4.50 3.92 3.55 3.28 3.09 2.93 2.81 2.71 2.54 2.37 2.19 2.09 1.98 1.87 1.75 1.61 1.43

7.88 5.30 4.28 3.72 3.35 3.09 2.90 2.74 2.62 2.52 2.36 2.19 2.00 1.90 1.79 1.67 1.53 1.36 1.00

  5   6   7   8   9

  10   11   12   13   14

  15   16   17   18   19

  20   21   22   23   24

  25   26   27   28   29

  30   40   60 120

  ∞

Source: E. S. Pearson and H. O. Hartley, Biometrika Tables for Statisticians, Volume 1, 3rd edn, 2015, copyright Cambridge University Press, reproduced with permission.

16,211.00 20,000.00 21,615.00 22,500.00 23,056.00 23,437.00 23,715.00 23,925.00 24,091.00 24,224.00 24,426.00 24,630.00 24,836.00 24,910.00 25,044.00 25,148.00 25,253.00 25,359.00 25,465.00 198.50 199.00 199.20 199.20 199.30 199.30 199.40 199.40 199.40 199.40 199.40 199.40 199.40 199.50 199.50 199.50 199.50 199.50 199.50 55.55 49.80 47.47 46.19 45.39 44.84 44.43 44.13 43.88 43.69 43.39 43.08 42.78 42.62 42.47 42.31 42.15 41.99 41.83 31.33 26.28 24.26 23.15 22.46 21.97 21.62 21.35 21.14 20.97 20.70 20.44 20.17 20.03 19.89 19.75 19.61 19.47 19.32

1 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60 120 ∙

FU(α,df ,df ) 1 2

  1   2   3   4

Denominator df2

0

α = 0.005

E

Numerator, df1

Critical values of F (continued )

Table E.5

A-26 APPENDICES

n

7

6

5

4

3

2

n 1

X 0 1 0 1 2 0 1 2 3 0 1 2 3 4 0 1 2 3 4 5 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 X

0.99

0.9321 0.0659 0.0020 0.0000

0.9415 0.0571 0.0014 0.0000

0.9510 0.0480 0.0010 0.0000

0.01 0.9900 0.0100 0.9801 0.0198 0.0001 0.9703 0.0294 0.0003 0.0000 0.9606 0.0388 0.0006 0.0000

0.98

0.8681 0.1240 0.0076 0.0003 0.0000

0.8858 0.1085 0.0055 0.0002 0.0000

0.9039 0.0922 0.0038 0.0001 0.0000

0.02 0.9800 0.0200 0.9604 0.0392 0.0004 0.9412 0.0576 0.0012 0.0000 0.9224 0.0753 0.0023 0.0000

0.97

0.8080 0.1749 0.0162 0.0008 0.0000

0.8330 0.1546 0.0120 0.0005 0.0000

0.03 0.9700 0.0300 0.9409 0.0582 0.0009 0.9127 0.0847 0.0026 0.0000 0.8853 0.1095 0.0051 0.0001 0.0000 0.8587 0.1328 0.0082 0.0003 0.0000

0.96

0.7514 0.2192 0.0274 0.0019 0.0001 0.0000

0.7828 0.1957 0.0204 0.0011 0.0000

0.04 0.9600 0.0400 0.9216 0.0768 0.0016 0.8847 0.1106 0.0046 0.0001 0.8493 0.1416 0.0088 0.0002 0.0000 0.8154 0.1699 0.0142 0.0006 0.0000

0.95

0.6983 0.2573 0.0406 0.0036 0.0002 0.0000

0.7351 0.2321 0.0305 0.0021 0.0001 0.0000

0.05 0.9500 0.0500 0.9025 0.0950 0.0025 0.8574 0.1354 0.0071 0.0001 0.8145 0.1715 0.0135 0.0005 0.0000 0.7738 0.2036 0.0214 0.0011 0.0000

0.94

0.6485 0.2897 0.0555 0.0059 0.0004 0.0000

0.06 0.9400 0.0600 0.8836 0.1128 0.0036 0.8306 0.1590 0.0102 0.0002 0.7807 0.1993 0.0191 0.0008 0.0000 0.7339 0.2342 0.0299 0.0019 0.0001 0.0000 0.6899 0.2642 0.0422 0.0036 0.0002 0.0000

0.93

0.6017 0.3170 0.0716 0.0090 0.0007 0.0000

0.07 0.9300 0.0700 0.8649 0.1302 0.0049 0.8044 0.1816 0.0137 0.0003 0.7481 0.2252 0.0254 0.0013 0.0000 0.6957 0.2618 0.0394 0.0030 0.0001 0.0000 0.6470 0.2922 0.0550 0.0055 0.0003 0.0000

0.92

0.5578 0.3396 0.0886 0.0128 0.0011 0.0001 0.0000

0.08 0.9200 0.0800 0.8464 0.1472 0.0064 0.7787 0.2031 0.0177 0.0005 0.7164 0.2492 0.0325 0.0019 0.0000 0.6591 0.2866 0.0498 0.0043 0.0002 0.0000 0.6064 0.3164 0.0688 0.0080 0.0005 0.0000

0.91

0.5168 0.3578 0.1061 0.0175 0.0017 0.0001 0.0000

0.09 0.9100 0.0900 0.8281 0.1638 0.0081 0.7536 0.2236 0.0221 0.0007 0.6857 0.2713 0.0402 0.0027 0.0001 0.6240 0.3086 0.0610 0.0060 0.0003 0.0000 0.5679 0.3370 0.0833 0.0110 0.0008 0.0000

p

p

0.90

0.10 0.9000 0.1000 0.8100 0.1800 0.0100 0.7290 0.2430 0.0270 0.0010 0.6561 0.2916 0.0486 0.0036 0.0001 0.5905 0.3281 0.0729 0.0081 0.0005 0.0000 0.5314 0.3543 0.0984 0.0146 0.0012 0.0001 0.0000 0.4783 0.3720 0.1240 0.0230 0.0026 0.0002 0.0000

0.15 0.8500 0.1500 0.7225 0.2550 0.0225 0.6141 0.3251 0.0574 0.0034 0.5220 0.3685 0.0975 0.0115 0.0005 0.4437 0.3915 0.1382 0.0244 0.0022 0.0001 0.3771 0.3993 0.1762 0.0415 0.0055 0.0004 0.0000 0.3206 0.3960 0.2097 0.0617 0.0109 0.0012 0.0001 0.0000 0.85

0.20 0.8000 0.2000 0.6400 0.3200 0.0400 0.5120 0.3840 0.0960 0.0080 0.4096 0.4096 0.1536 0.0256 0.0016 0.3277 0.4096 0.2048 0.0512 0.0064 0.0003 0.2621 0.3932 0.2458 0.0819 0.0154 0.0015 0.0001 0.2097 0.3670 0.2753 0.1147 0.0287 0.0043 0.0004 0.0000 0.80

0.25 0.7500 0.2500 0.5625 0.3750 0.0625 0.4219 0.4219 0.1406 0.0156 0.3164 0.4219 0.2109 0.0469 0.0039 0.2373 0.3955 0.2637 0.0879 0.0146 0.0010 0.1780 0.3560 0.2966 0.1318 0.0330 0.0044 0.0002 0.1335 0.3115 0.3115 0.1730 0.0577 0.0115 0.0013 0.0001 0.75

0.30 0.7000 0.3000 0.4900 0.4200 0.0900 0.3430 0.4410 0.1890 0.0270 0.2401 0.4116 0.2646 0.0756 0.0081 0.1681 0.3602 0.3087 0.1323 0.0284 0.0024 0.1176 0.3025 0.3241 0.1852 0.0595 0.0102 0.0007 0.0824 0.2471 0.3177 0.2269 0.0972 0.0250 0.0036 0.0002 0.70

0.35 0.6500 0.3500 0.4225 0.4550 0.1225 0.2746 0.4436 0.2389 0.0429 0.1785 0.3845 0.3105 0.1115 0.0150 0.1160 0.3124 0.3364 0.1811 0.0488 0.0053 0.0754 0.2437 0.3280 0.2355 0.0951 0.0205 0.0018 0.0490 0.1848 0.2985 0.2679 0.1442 0.0466 0.0084 0.0006 0.65

0.40 0.6000 0.4000 0.3600 0.4800 0.1600 0.2160 0.4320 0.2880 0.0640 0.1296 0.3456 0.3456 0.1536 0.0256 0.0778 0.2592 0.3456 0.2304 0.0768 0.0102 0.0467 0.1866 0.3110 0.2765 0.1382 0.0369 0.0041 0.0280 0.1306 0.2613 0.2903 0.1935 0.0774 0.0172 0.0016 0.60

For a given combination of n and p, entry indicates the probability of obtaining a specified value of X. To locate entry: when p - 0.50, read p across the top heading and both n and X down the left margin; when p . 0.50, read p across the bottom heading and both n and X up the right margin.

Table of binomial probabilities

Table E.6

0.45 0.5500 0.4500 0.3025 0.4950 0.2025 0.1664 0.4084 0.3341 0.0911 0.0915 0.2995 0.3675 0.2005 0.0410 0.0503 0.2059 0.3369 0.2757 0.1128 0.0185 0.0277 0.1359 0.2780 0.3032 0.1861 0.0609 0.0083 0.0152 0.0872 0.2140 0.2918 0.2388 0.1172 0.0320 0.0037 0.55

0.50 0.5000 0.5000 0.2500 0.5000 0.2500 0.1250 0.3750 0.3750 0.1250 0.0625 0.2500 0.3750 0.2500 0.0625 0.0313 0.1563 0.3125 0.3125 0.1563 0.0313 0.0156 0.0938 0.2344 0.3125 0.2344 0.0938 0.0156 0.0078 0.0547 0.1641 0.2734 0.2734 0.1641 0.0547 0.0078 0.50

7 n

6

5

4

3

2

1

n

continues

X 1 0 2 1 0 3 2 1 0 4 3 2 1 0 5 4 3 2 1 0 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 X

APPENDICES A-27

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

E

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

n

11

10

9

X 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 11 X

0.99

0.8953 0.0995 0.0050 0.0002 0.0000

0.9044 0.0914 0.0042 0.0001 0.0000

0.9135 0.0830 0.0034 0.0001 0.0000

0.01 0.9227 0.0746 0.0026 0.0001 0.0000

0.98

0.8007 0.1798 0.0183 0.0011 0.0000

0.8171 0.1667 0.0153 0.0008 0.0000 0.0000

0.8337 0.1531 0.0125 0.0006 0.0000

0.02 0.8508 0.1389 0.0099 0.0004 0.0000

0.97

0.7153 0.2433 0.0376 0.0035 0.0002 0.0000

0.7374 0.2281 0.0317 0.0026 0.0001 0.0000

0.7602 0.2116 0.0262 0.0019 0.0001 0.0000

0.03 0.7837 0.1939 0.0210 0.0013 0.0001 0.0000

0.96

0.6382 0.2925 0.0609 0.0076 0.0006 0.0000

0.6648 0.2770 0.0519 0.0058 0.0004 0.0000

0.6925 0.2597 0.0433 0.0042 0.0003 0.0000

0.04 0.7214 0.2405 0.0351 0.0029 0.0002 0.0000

0.95

0.5688 0.3293 0.0867 0.0137 0.0014 0.0001 0.0000

0.5987 0.3151 0.0746 0.0105 0.0010 0.0001 0.0000

0.6302 0.2985 0.0629 0.0077 0.0006 0.0000

0.05 0.6634 0.2793 0.0515 0.0054 0.0004 0.0000

0.94

0.5063 0.3555 0.1135 0.0217 0.0028 0.0002 0.0000

0.5386 0.3438 0.0988 0.0168 0.0019 0.0001 0.0000

0.5730 0.3292 0.0840 0.0125 0.0012 0.0001 0.0000

0.06 0.6096 0.3113 0.0695 0.0089 0.0007 0.0000

0.93

0.4501 0.3727 0.1403 0.0317 0.0048 0.0005 0.0000

0.4840 0.3643 0.1234 0.0248 0.0033 0.0003 0.0000

0.5204 0.3525 0.1061 0.0186 0.0021 0.0002 0.0000

0.07 0.5596 0.3370 0.0888 0.0134 0.0013 0.0001 0.0000

0.92

0.3996 0.3823 0.1662 0.0434 0.0075 0.0009 0.0001 0.0000

0.4344 0.3777 0.1478 0.0343 0.0052 0.0005 0.0000

0.4722 0.3695 0.1285 0.0261 0.0034 0.0003 0.0000

0.08 0.5132 0.3570 0.1087 0.0189 0.0021 0.0001 0.0000

0.91

0.3544 0.3855 0.1906 0.0566 0.0112 0.0015 0.0002 0.0000

0.3894 0.3851 0.1714 0.0452 0.0078 0.0009 0.0001 0.0000

0.4279 0.3809 0.1507 0.0348 0.0052 0.0005 0.0000

0.09 0.4703 0.3721 0.1288 0.0255 0.0031 0.0002 0.0000

p

p

0.90

0.3138 0.3835 0.2131 0.0710 0.0158 0.0025 0.0003 0.0000

0.3487 0.3874 0.1937 0.0574 0.0112 0.0015 0.0001 0.0000

0.3874 0.3874 0.1722 0.0446 0.0074 0.0008 0.0001 0.0000

0.10 0.4305 0.3826 0.1488 0.0331 0.0046 0.0004 0.0000

0.85

0.1673 0.3248 0.2866 0.1517 0.0536 0.0132 0.0023 0.0003 0.0000

0.1969 0.3474 0.2759 0.1298 0.0401 0.0085 0.0012 0.0001 0.0000

0.2316 0.3679 0.2597 0.1069 0.0283 0.0050 0.0006 0.0000

0.15 0.2725 0.3847 0.2376 0.0839 0.0185 0.0026 0.0002 0.0000

0.80

0.0859 0.2362 0.2953 0.2215 0.1107 0.0388 0.0097 0.0017 0.0002 0.0000

0.1074 0.2684 0.3020 0.2013 0.0881 0.0264 0.0055 0.0008 0.0001 0.0000

0.20 0.1678 0.3355 0.2936 0.1468 0.0459 0.0092 0.0011 0.0001 0.0000 0.1342 0.3020 0.3020 0.1762 0.0661 0.0165 0.0028 0.0003 0.0000

0.75

0.0422 0.1549 0.2581 0.2581 0.1721 0.0803 0.0268 0.0064 0.0011 0.0001 0.0000

0.25 0.1001 0.2670 0.3115 0.2076 0.0865 0.0231 0.0038 0.0004 0.0000 0.0751 0.2253 0.3003 0.2336 0.1168 0.0389 0.0087 0.0012 0.0001 0.0000 0.0563 0.1877 0.2816 0.2503 0.1460 0.0584 0.0162 0.0031 0.0004 0.0000

0.70

0.30 0.0576 0.1977 0.2965 0.2541 0.1361 0.0467 0.0100 0.0012 0.0001 0.0404 0.1556 0.2668 0.2668 0.1715 0.0735 0.0210 0.0039 0.0004 0.0000 0.0282 0.1211 0.2335 0.2668 0.2001 0.1029 0.0368 0.0090 0.0014 0.0001 0.0000 0.0198 0.0932 0.1998 0.2568 0.2201 0.1321 0.0566 0.0173 0.0037 0.0005 0.0000

0.35 0.0319 0.1373 0.2587 0.2786 0.1875 0.0808 0.0217 0.0033 0.0002 0.0207 0.1004 0.2162 0.2716 0.2194 0.1181 0.0424 0.0098 0.0013 0.0001 0.0135 0.0725 0.1757 0.2522 0.2377 0.1536 0.0689 0.0212 0.0043 0.0005 0.0000 0.0088 0.0518 0.1395 0.2254 0.2428 0.1830 0.0985 0.0379 0.0102 0.0018 0.0002 0.0000 0.65

0.40 0.0168 0.0896 0.2090 0.2787 0.2322 0.1239 0.0413 0.0079 0.0007 0.0101 0.0605 0.1612 0.2508 0.2508 0.1672 0.0743 0.0212 0.0035 0.0003 0.0060 0.0403 0.1209 0.2150 0.2508 0.2007 0.1115 0.0425 0.0106 0.0016 0.0001 0.0036 0.0266 0.0887 0.1774 0.2365 0.2207 0.1471 0.0701 0.0234 0.0052 0.0007 0.0000 0.60

0.45 0.0084 0.0548 0.1569 0.2568 0.2627 0.1719 0.0703 0.0164 0.0017 0.0046 0.0339 0.1110 0.2119 0.2600 0.2128 0.1160 0.0407 0.0083 0.0008 0.0025 0.0207 0.0763 0.1665 0.2384 0.2340 0.1596 0.0746 0.0229 0.0042 0.0003 0.0014 0.0125 0.0513 0.1259 0.2060 0.2360 0.1931 0.1128 0.0462 0.0126 0.0021 0.0002 0.55

0.50 0.0039 0.0313 0.1094 0.2188 0.2734 0.2188 0.1094 0.0313 0.0039 0.0020 0.0176 0.0703 0.1641 0.2461 0.2461 0.1641 0.0703 0.0176 0.0020 0.0010 0.0098 0.0439 0.1172 0.2051 0.2461 0.2051 0.1172 0.0439 0.0098 0.0010 0.0005 0.0054 0.0269 0.0806 0.1611 0.2256 0.2256 0.1611 0.0806 0.0269 0.0054 0.0005 0.50

11 n

10

9

8

n

continues

X 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 10 9 8 7 6 5 4 3 2 1 0 11 10 9 8 7 6 5 4 3 2 1 0 X

E

n 8

Table of binomial probabilities (continued )

Table E.6

A-28 APPENDICES

n

14

13

n 12

X 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 X

0.99

0.8687 0.1229 0.0081 0.0003 0.0000

0.8775 0.1152 0.0070 0.0003 0.0000

0.01 0.8864 0.1074 0.0060 0.0002 0.0000

0.98

0.7536 0.2153 0.0286 0.0023 0.0001 0.0000

0.7690 0.2040 0.0250 0.0019 0.0001 0.0000

0.02 0.7847 0.1922 0.0216 0.0015 0.0001 0.0000

0.97

0.6528 0.2827 0.0568 0.0070 0.0006 0.0000

0.6730 0.2706 0.0502 0.0057 0.0004 0.0000

0.03 0.6938 0.2575 0.0438 0.0045 0.0003 0.0000

Table of binomial probabilities (continued )

Table E.6

0.96

0.5647 0.3294 0.0892 0.0149 0.0017 0.0001 0.0000

0.5882 0.3186 0.0797 0.0122 0.0013 0.0001 0.0000

0.04 0.6127 0.3064 0.0702 0.0098 0.0009 0.0001 0.0000

0.95

0.4877 0.3593 0.1229 0.0259 0.0037 0.0004 0.0000

0.5133 0.3512 0.1109 0.0214 0.0028 0.0003 0.0000

0.05 0.5404 0.3413 0.0988 0.0173 0.0021 0.0002 0.0000

0.94

0.4205 0.3758 0.1559 0.0398 0.0070 0.0009 0.0001 0.0000

0.4474 0.3712 0.1422 0.0333 0.0053 0.0006 0.0001 0.0000

0.06 0.4759 0.3645 0.1280 0.0272 0.0039 0.0004 0.0000

0.93

0.3620 0.3815 0.1867 0.0562 0.0116 0.0018 0.0002 0.0000

0.3893 0.3809 0.1720 0.0475 0.0089 0.0012 0.0001 0.0000

0.07 0.4186 0.3781 0.1565 0.0393 0.0067 0.0008 0.0001 0.0000

0.92

0.3112 0.3788 0.2141 0.0745 0.0178 0.0031 0.0004 0.0000

0.3383 0.3824 0.1995 0.0636 0.0138 0.0022 0.0003 0.0000

0.08 0.3677 0.3837 0.1835 0.0532 0.0104 0.0014 0.0001 0.0000

0.91

0.2670 0.3698 0.2377 0.0940 0.0256 0.0051 0.0008 0.0001 0.0000

0.2935 0.3773 0.2239 0.0812 0.0201 0.0036 0.0005 0.0000

0.09 0.3225 0.3827 0.2082 0.0686 0.0153 0.0024 0.0003 0.0000

p

p

0.90

0.2288 0.3559 0.2570 0.1142 0.0349 0.0078 0.0013 0.0002 0.0000

0.2542 0.3672 0.2448 0.0997 0.0277 0.0055 0.0008 0.0001 0.0000

0.10 0.2824 0.3766 0.2301 0.0852 0.0213 0.0038 0.0005 0.0000

0.85

0.1028 0.2539 0.2912 0.2056 0.0998 0.0352 0.0093 0.0019 0.0003 0.0000

0.1209 0.2774 0.2937 0.1900 0.0838 0.0266 0.0063 0.0011 0.0001 0.0000

0.15 0.1422 0.3012 0.2924 0.1720 0.0683 0.0193 0.0040 0.0006 0.0001 0.0000

0.80

0.0440 0.1539 0.2501 0.2501 0.1720 0.0860 0.0322 0.0092 0.0020 0.0003 0.0000

0.0550 0.1787 0.2680 0.2457 0.1535 0.0691 0.0230 0.0058 0.0011 0.0001 0.0000

0.20 0.0687 0.2062 0.2835 0.2362 0.1329 0.0532 0.0155 0.0033 0.0005 0.0001 0.0000

0.75

0.0178 0.0832 0.1802 0.2402 0.2202 0.1468 0.0734 0.0280 0.0082 0.0018 0.0003 0.0000

0.0238 0.1029 0.2059 0.2517 0.2097 0.1258 0.0559 0.0186 0.0047 0.0009 0.0001 0.0000

0.25 0.0317 0.1267 0.2323 0.2581 0.1936 0.1032 0.0401 0.0115 0.0024 0.0004 0.0000

0.70

0.0068 0.0407 0.1134 0.1943 0.2290 0.1963 0.1262 0.0618 0.0232 0.0066 0.0014 0.0002 0.0000

0.0097 0.0540 0.1388 0.2181 0.2337 0.1803 0.1030 0.0442 0.0142 0.0034 0.0006 0.0001 0.0000

0.30 0.0138 0.0712 0.1678 0.2397 0.2311 0.1585 0.0792 0.0291 0.0078 0.0015 0.0002 0.0000

0.65

0.0024 0.0181 0.0634 0.1366 0.2022 0.2178 0.1759 0.1082 0.0510 0.0183 0.0049 0.0010 0.0001 0.0000

0.35 0.0057 0.0368 0.1088 0.1954 0.2367 0.2039 0.1281 0.0591 0.0199 0.0048 0.0008 0.0001 0.0000 0.0037 0.0259 0.0836 0.1651 0.2222 0.2154 0.1546 0.0833 0.0336 0.0101 0.0022 0.0003 0.0000

0.40 0.0022 0.0174 0.0639 0.1419 0.2128 0.2270 0.1766 0.1009 0.0420 0.0125 0.0025 0.0003 0.0000 0.0013 0.0113 0.0453 0.1107 0.1845 0.2214 0.1968 0.1312 0.0656 0.0243 0.0065 0.0012 0.0001 0.0000 0.0008 0.0073 0.0317 0.0845 0.1549 0.2066 0.2066 0.1574 0.0918 0.0408 0.0136 0.0033 0.0005 0.0001 0.0000 0.60

0.45 0.0008 0.0075 0.0339 0.0923 0.1700 0.2225 0.2124 0.1489 0.0762 0.0277 0.0068 0.0010 0.0001 0.0004 0.0045 0.0220 0.0660 0.1350 0.1989 0.2169 0.1775 0.1089 0.0495 0.0162 0.0036 0.0005 0.0000 0.0002 0.0027 0.0141 0.0462 0.1040 0.1701 0.2088 0.1952 0.1398 0.0762 0.0312 0.0093 0.0019 0.0002 0.0000 0.55

0.50 0.0002 0.0029 0.0161 0.0537 0.1208 0.1934 0.2256 0.1934 0.1208 0.0537 0.0161 0.0029 0.0002 0.0001 0.0016 0.0095 0.0349 0.0873 0.1571 0.2095 0.2095 0.1571 0.0873 0.0349 0.0095 0.0016 0.0001 0.0001 0.0009 0.0056 0.0222 0.0611 0.1222 0.1833 0.2095 0.1833 0.1222 0.0611 0.0222 0.0056 0.0009 0.0001 0.50

14 n

13

12

n

continues

X 12 11 10 9 8 7 6 5 4 3 2 1 0 13 12 11 10 9 8 7 6 5 4 3 2 1 0 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 X

APPENDICES A-29

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

E

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

n

20

X 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 X

0.99

0.8179 0.1652 0.0159 0.0010 0.0000

0.01 0.8601 0.1303 0.0092 0.0004 0.0000

0.98

0.6676 0.2725 0.0528 0.0065 0.0006 0.0000

0.02 0.7386 0.2261 0.0323 0.0029 0.0002 0.0000

0.97

0.5438 0.3364 0.0988 0.0183 0.0024 0.0002 0.0000

0.03 0.6333 0.2938 0.0636 0.0085 0.0008 0.0001 0.0000

0.96

0.4420 0.3683 0.1458 0.0364 0.0065 0.0009 0.0001 0.0000

0.04 0.5421 0.3388 0.0988 0.0178 0.0022 0.0002 0.0000

0.95

0.3585 0.3774 0.1887 0.0596 0.0133 0.0022 0.0003 0.0000

0.05 0.4633 0.3658 0.1348 0.0307 0.0049 0.0006 0.0000

0.94

0.2901 0.3703 0.2246 0.0860 0.0233 0.0048 0.0008 0.0001 0.0000

0.06 0.3953 0.3785 0.1691 0.0468 0.0090 0.0013 0.0001 0.0000

0.93

0.2342 0.3526 0.2521 0.1139 0.0364 0.0088 0.0017 0.0002 0.0000

0.07 0.3367 0.3801 0.2003 0.0653 0.0148 0.0024 0.0003 0.0000

0.92

0.1887 0.3282 0.2711 0.1414 0.0523 0.0145 0.0032 0.0005 0.0001 0.0000

0.08 0.2863 0.3734 0.2273 0.0857 0.0223 0.0043 0.0006 0.0001 0.0000

0.91

0.1516 0.3000 0.2818 0.1672 0.0703 0.0222 0.0055 0.0011 0.0002 0.0000

0.09 0.2430 0.3605 0.2496 0.1070 0.0317 0.0069 0.0011 0.0001 0.0000

p

p

0.90

0.1216 0.2702 0.2852 0.1901 0.0898 0.0319 0.0089 0.0020 0.0004 0.0001 0.0000

0.10 0.2059 0.3432 0.2669 0.1285 0.0428 0.0105 0.0019 0.0003 0.0000

0.85

0.0388 0.1368 0.2293 0.2428 0.1821 0.1028 0.0454 0.0160 0.0046 0.0011 0.0002 0.0000

0.15 0.0874 0.2312 0.2856 0.2184 0.1156 0.0449 0.0132 0.0030 0.0005 0.0001 0.0000

0.80

0.0115 0.0576 0.1369 0.2054 0.2182 0.1746 0.1091 0.0545 0.0222 0.0074 0.0020 0.0005 0.0001 0.0000

0.20 0.0352 0.1319 0.2309 0.2501 0.1876 0.1032 0.0430 0.0138 0.0035 0.0007 0.0001 0.0000

0.75

0.0032 0.0211 0.0669 0.1339 0.1897 0.2023 0.1686 0.1124 0.0609 0.0271 0.0099 0.0030 0.0008 0.0002 0.0000

0.25 0.0134 0.0668 0.1559 0.2252 0.2252 0.1651 0.0917 0.0393 0.0131 0.0034 0.0007 0.0001 0.0000

0.70

0.0008 0.0068 0.0278 0.0716 0.1304 0.1789 0.1916 0.1643 0.1144 0.0654 0.0308 0.0120 0.0039 0.0010 0.0002 0.0000

0.30 0.0047 0.0305 0.0916 0.1700 0.2186 0.2061 0.1472 0.0811 0.0348 0.0116 0.0030 0.0006 0.0001 0.0000

0.65

0.0002 0.0020 0.0100 0.0323 0.0738 0.1272 0.1712 0.1844 0.1614 0.1158 0.0686 0.0336 0.0136 0.0045 0.0012 0.0003 0.0000

0.35 0.0016 0.0126 0.0476 0.1110 0.1792 0.2123 0.1906 0.1319 0.0710 0.0298 0.0096 0.0024 0.0004 0.0001 0.0000

0.60

0.0000 0.0005 0.0031 0.0123 0.0350 0.0746 0.1244 0.1659 0.1797 0.1597 0.1171 0.0710 0.0355 0.0146 0.0049 0.0013 0.0003 0.0000

0.40 0.0005 0.0047 0.0219 0.0634 0.1268 0.1859 0.2066 0.1771 0.1181 0.0612 0.0245 0.0074 0.0016 0.0003 0.0000

0.55

0.45 0.0001 0.0016 0.0090 0.0318 0.0780 0.1404 0.1914 0.2013 0.1647 0.1048 0.0515 0.0191 0.0052 0.0010 0.0001 0.0000 0.0000 0.0001 0.0008 0.0040 0.0139 0.0365 0.0746 0.1221 0.1623 0.1771 0.1593 0.1185 0.0727 0.0366 0.0150 0.0049 0.0013 0.0002 0.0000

0.50

0.0000 0.0002 0.0011 0.0046 0.0148 0.0370 0.0739 0.1201 0.1602 0.1762 0.1602 0.1201 0.0739 0.0370 0.0148 0.0046 0.0011 0.0002 0.0000

0.50 0.0000 0.0005 0.0032 0.0139 0.0417 0.0916 0.1527 0.1964 0.1964 0.1527 0.0916 0.0417 0.0139 0.0032 0.0005 0.0000

X 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 X

n

20 n

15

E

n 15

Table of binomial probabilities (continued )

Table E.6

A-30 APPENDICES



APPENDICES A-31

𝛌

E

X 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0 1 2 3 4 5 6 7

0.9048 0.8187 0.7408 0.6703 0.6065 0.5488 0.4966 0.4493 0.4066 0.3679 0.0905 0.1637 0.2222 0.2681 0.3033 0.3293 0.3476 0.3595 0.3659 0.3679 0.0045 0.0164 0.0333 0.0536 0.0758 0.0988 0.1217 0.1438 0.1647 0.1839 0.0002 0.0011 0.0033 0.0072 0.0126 0.0198 0.0284 0.0383 0.0494 0.0613 0.0000 0.0001 0.0003 0.0007 0.0016 0.0030 0.0050 0.0077 0.0111 0.0153 0.0000 0.0000 0.0000 0.0001 0.0002 0.0004 0.0007 0.0012 0.0020 0.0031 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0002 0.0003 0.0005 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001

Table E.7

Table of Poisson probabilities For a given value of λ, entry indicates the probability of a specified value of X.

𝛌 X 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 0 1 2 3 4 5 6 7 8 9

0.3329 0.3012 0.2725 0.2466 0.2231 0.2019 0.1827 0.1653 0.1496 0.1353 0.3662 0.3614 0.3543 0.3452 0.3347 0.3230 0.3106 0.2975 0.2842 0.2707 0.2014 0.2169 0.2303 0.2417 0.2510 0.2584 0.2640 0.2678 0.2700 0.2707 0.0738 0.0867 0.0998 0.1128 0.1255 0.1378 0.1496 0.1607 0.1710 0.1804 0.0203 0.0260 0.0324 0.0395 0.0471 0.0551 0.0636 0.0723 0.0812 0.0902 0.0045 0.0062 0.0084 0.0111 0.0141 0.0176 0.0216 0.0260 0.0309 0.0361 0.0008 0.0012 0.0018 0.0026 0.0035 0.0047 0.0061 0.0078 0.0098 0.0120 0.0001 0.0002 0.0003 0.0005 0.0008 0.0011 0.0015 0.0020 0.0027 0.0034 0.0000 0.0000 0.0001 0.0001 0.0001 0.0002 0.0003 0.0005 0.0006 0.0009 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001 0.0002

𝛌 X 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0   0 0.1225 0.1108 0.1003 0.0907 0.0821 0.0743 0.0672 0.0608 0.0550 0.0498   1 0.2572 0.2438 0.2306 0.2177 0.2052 0.1931 0.1815 0.1703 0.1596 0.1494   2 0.2700 0.2681 0.2652 0.2613 0.2565 0.2510 0.2450 0.2384 0.2314 0.2240   3 0.1890 0.1966 0.2033 0.2090 0.2138 0.2176 0.2205 0.2225 0.2237 0.2240   4 0.0992 0.1082 0.1169 0.1254 0.1336 0.1414 0.1488 0.1557 0.1622 0.1680   5 0.0417 0.0476 0.0538 0.0602 0.0668 0.0735 0.0804 0.0872 0.0940 0.1008   6 0.0146 0.0174 0.0206 0.0241 0.0278 0.0319 0.0362 0.0407 0.0455 0.0504   7 0.0044 0.0055 0.0068 0.0083 0.0099 0.0118 0.0139 0.0163 0.0188 0.0216   8 0.0011 0.0015 0.0019 0.0025 0.0031 0.0038 0.0047 0.0057 0.0068 0.0081   9 0.0003 0.0004 0.0005 0.0007 0.0009 0.0011 0.0014 0.0018 0.0022 0.0027 10 0.0001 0.0001 0.0001 0.0002 0.0002 0.0003 0.0004 0.0005 0.0006 0.0008 11 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001 0.0002 0.0002 12 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 𝛌 X 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0   0 0.0450 0.0408 0.0369 0.0334 0.0302 0.0273 0.0247 0.0224 0.0202 0.0183   1 0.1397 0.1340 0.1217 0.1135 0.1057 0.0984 0.0915 0.0850 0.0789 0.0733   2 0.2165 0.2087 0.2008 0.1929 0.1850 0.1771 0.1692 0.1615 0.1539 0.1465   3 0.2237 0.2226 0.2209 0.2186 0.2158 0.2125 0.2087 0.2046 0.2001 0.1954   4 0.1734 0.1781 0.1823 0.1858 0.1888 0.1912 0.1931 0.1944 0.1951 0.1954   5 0.1075 0.1140 0.1203 0.1264 0.1322 0.1377 0.1429 0.1477 0.1522 0.1563   6 0.0555 0.0608 0.0662 0.0716 0.0771 0.0826 0.0881 0.0936 0.0989 0.1042   7 0.0246 0.0278 0.0312 0.0348 0.0385 0.0425 0.0466 0.0508 0.0551 0.0595   8 0.0095 0.0111 0.0129 0.0148 0.0169 0.0191 0.0215 0.0241 0.0269 0.0298   9 0.0033 0.0040 0.0047 0.0056 0.0066 0.0076 0.0089 0.0102 0.0116 0.0132 10 0.0010 0.0013 0.0016 0.0019 0.0023 0.0028 0.0033 0.0039 0.0045 0.0053 11 0.0003 0.0004 0.0005 0.0006 0.0007 0.0009 0.0011 0.0013 0.0016 0.0019 12 0.0001 0.0001 0.0001 0.0002 0.0002 0.0003 0.0003 0.0004 0.0005 0.0006 13 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001 0.0001 0.0002 0.0002 14 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 continues

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

A-32 APPENDICES

𝛌 X 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0

E Table E.7

Table of Poisson probabilities (continued )

  0 0.0166 0.0150 0.0136 0.0123 0.0111 0.0101 0.0091 0.0082 0.0074 0.0067   1 0.0679 0.0630 0.0583 0.0540 0.0500 0.0462 0.0427 0.0395 0.0365 0.0337   2 0.1393 0.1323 0.1254 0.1188 0.1125 0.1063 0.1005 0.0948 0.0894 0.0842   3 0.1904 0.1852 0.1798 0.1743 0.1687 0.1631 0.1574 0.1517 0.1460 0.1404   4 0.1951 0.1944 0.1933 0.1917 0.1898 0.1875 0.1849 0.1820 0.1789 0.1755   5 0.1600 0.1633 0.1662 0.1687 0.1708 0.1725 0.1738 0.1747 0.1753 0.1755   6 0.1093 0.1143 0.1191 0.1237 0.1281 0.1323 0.1362 0.1398 0.1432 0.1462   7 0.0640 0.0686 0.0732 0.0778 0.0824 0.0869 0.0914 0.0959 0.1002 0.1044   8 0.0328 0.0360 0.0393 0.0428 0.0463 0.0500 0.0537 0.0575 0.0614 0.0653   9 0.0150 0.0168 0.0188 0.0209 0.0232 0.0255 0.0280 0.0307 0.0334 0.0363 10 0.0061 0.0071 0.0081 0.0092 0.0104 0.0118 0.0132 0.0147 0.0164 0.0181 11 0.0023 0.0027 0.0032 0.0037 0.0043 0.0049 0.0056 0.0064 0.0073 0.0082 12 0.0008 0.0009 0.0011 0.0014 0.0016 0.0019 0.0022 0.0026 0.0030 0.0034 13 0.0002 0.0003 0.0004 0.0005 0.0006 0.0007 0.0008 0.0009 0.0011 0.0013 14 0.0001 0.0001 0.0001 0.0001 0.0002 0.0002 0.0003 0.0003 0.0004 0.0005 15 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001 0.0001 0.0001 0.0002 𝛌 X 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0   0 0.0061 0.0055 0.0050 0.0045 0.0041 0.0037 0.0033 0.0030 0.0027 0.0025   1 0.0311 0.0287 0.0265 0.0244 0.0225 0.0207 0.0191 0.0176 0.0162 0.0149   2 0.0793 0.0746 0.0701 0.0659 0.0618 0.0580 0.0544 0.0509 0.0477 0.0446   3 0.1348 0.1293 0.1239 0.1185 0.1133 0.1082 0.1033 0.0985 0.0938 0.0892   4 0.1719 0.1681 0.1641 0.1600 0.1558 0.1515 0.1472 0.1428 0.1383 0.1339   5 0.1753 0.1748 0.1740 0.1728 0.1714 0.1697 0.1678 0.1656 0.1632 0.1606   6 0.1490 0.1515 0.1537 0.1555 0.1571 0.1584 0.1594 0.1601 0.1605 0.1606   7 0.1086 0.1125 0.1163 0.1200 0.1234 0.1267 0.1298 0.1326 0.1353 0.1377   8 0.0692 0.0731 0.0771 0.0810 0.0849 0.0887 0.0925 0.0962 0.0998 0.1033   9 0.0392 0.0423 0.0454 0.0486 0.0519 0.0552 0.0586 0.0620 0.0654 0.0688 10 0.0200 0.0220 0.0241 0.0262 0.0285 0.0309 0.0334 0.0359 0.0386 0.0413 11 0.0093 0.0104 0.0116 0.0129 0.0143 0.0157 0.0173 0.0190 0.0207 0.0225 12 0.0039 0.0045 0.0051 0.0058 0.0065 0.0073 0.0082 0.0092 0.0102 0.0113 13 0.0015 0.0018 0.0021 0.0024 0.0028 0.0032 0.0036 0.0041 0.0046 0.0052 14 0.0006 0.0007 0.0008 0.0009 0.0011 0.0013 0.0015 0.0017 0.0019 0.0022 15 0.0002 0.0002 0.0003 0.0003 0.0004 0.0005 0.0006 0.0007 0.0008 0.0009 16 0.0001 0.0001 0.0001 0.0001 0.0001 0.0002 0.0002 0.0002 0.0003 0.0003 17 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001 0.0001 𝛌 X 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7.0   0 0.0022 0.0020 0.0018 0.0017 0.0015 0.0014 0.0012 0.0011 0.0010 0.0009   1 0.0137 0.0126 0.0116 0.0106 0.0098 0.0090 0.0082 0.0076 0.0070 0.0064   2 0.0417 0.0390 0.0364 0.0340 0.0318 0.0296 0.0276 0.0258 0.0240 0.0223   3 0.0848 0.0806 0.0765 0.0726 0.0688 0.0652 0.0617 0.0584 0.0552 0.0521   4 0.1294 0.1249 0.1205 0.1162 0.1118 0.1076 0.1034 0.0992 0.0952 0.0912   5 0.1579 0.1549 0.1519 0.1487 0.1454 0.1420 0.1385 0.1349 0.1314 0.1277   6 0.1605 0.1601 0.1595 0.1586 0.1575 0.1562 0.1546 0.1529 0.1511 0.1490   7 0.1399 0.1418 0.1435 0.1450 0.1462 0.1472 0.1480 0.1486 0.1489 0.1490   8 0.1066 0.1099 0.1130 0.1160 0.1188 0.1215 0.1240 0.1263 0.1284 0.1304   9 0.0723 0.0757 0.0791 0.0825 0.0858 0.0891 0.0923 0.0954 0.0985 0.1014 10 0.0441 0.0469 0.0498 0.0528 0.0558 0.0588 0.0618 0.0649 0.0679 0.0710 11 0.0245 0.0265 0.0285 0.0307 0.0330 0.0353 0.0377 0.0401 0.0426 0.0452 12 0.0124 0.0137 0.0150 0.0164 0.0179 0.0194 0.0210 0.0277 0.0245 0.0264 13 0.0058 0.0065 0.0073 0.0081 0.0089 0.0098 0.0108 0.0119 0.0130 0.0142 14 0.0025 0.0029 0.0033 0.0037 0.0041 0.0046 0.0052 0.0058 0.0064 0.0071 continues

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



APPENDICES A-33

𝛌 X 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7.0 15 16 17 18 19

0.0010 0.0012 0.0014 0.0016 0.0018 0.0020 0.0023 0.0026 0.0029 0.0033 0.0004 0.0005 0.0005 0.0006 0.0007 0.0008 0.0010 0.0011 0.0013 0.0014 0.0001 0.0002 0.0002 0.0002 0.0003 0.0003 0.0004 0.0004 0.0005 0.0006 0.0000 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0002 0.0002 0.0002 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001

𝛌 X 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8.0

E Table E.7

Table of Poisson probabilities (continued )

  0 0.0008 0.0007 0.0007 0.0006 0.0006 0.0005 0.0005 0.0004 0.0004 0.0003   1 0.0059 0.0054 0.0049 0.0045 0.0041 0.0038 0.0035 0.0032 0.0029 0.0027   2 0.0208 0.0194 0.0180 0.0167 0.0156 0.0145 0.0134 0.0125 0.0116 0.0107   3 0.0492 0.0464 0.0438 0.0413 0.0389 0.0366 0.0345 0.0324 0.0305 0.0286   4 0.0874 0.0836 0.0799 0.0764 0.0729 0.0696 0.0663 0.0632 0.0602 0.0573   5 0.1241 0.1204 0.1167 0.1130 0.1094 0.1057 0.1021 0.0986 0.0951 0.0916   6 0.1468 0.1445 0.1420 0.1394 0.1367 0.1339 0.1311 0.1282 0.1252 0.1221   7 0.1489 0.1486 0.1481 0.1474 0.1465 0.1454 0.1442 0.1428 0.1413 0.1396   8 0.1321 0.1337 0.1351 0.1363 0.1373 0.1382 0.1388 0.1392 0.1395 0.1396   9 0.1042 0.1070 0.1096 0.1121 0.1144 0.1167 0.1187 0.1207 0.1224 0.1241 10 0.0740 0.0770 0.0800 0.0829 0.0858 0.0887 0.0914 0.0941 0.0967 0.0993 11 0.0478 0.0504 0.0531 0.0558 0.0585 0.0613 0.0640 0.0667 0.0695 0.0722 12 0.0283 0.0303 0.0323 0.0344 0.0366 0.0388 0.0411 0.0434 0.0457 0.0481 13 0.0154 0.0168 0.0181 0.0196 0.0211 0.0227 0.0243 0.0260 0.0278 0.0296 14 0.0078 0.0086 0.0095 0.0104 0.0113 0.0123 0.0134 0.0145 0.0157 0.0169 15 0.0037 0.0041 0.0046 0.0051 0.0057 0.0062 0.0069 0.0075 0.0083 0.0090 16 0.0016 0.0019 0.0021 0.0024 0.0026 0.0030 0.0033 0.0037 0.0041 0.0045 17 0.0007 0.0008 0.0009 0.0010 0.0012 0.0013 0.0015 0.0017 0.0019 0.0021 18 0.0003 0.0003 0.0004 0.0004 0.0005 0.0006 0.0006 0.0007 0.0008 0.0009 19 0.0001 0.0001 0.0001 0.0002 0.0002 0.0002 0.0003 0.0003 0.0003 0.0004 20 0.0000 0.0000 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0002 21 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 𝛌 X 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9.0   0 0.0003 0.0003 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0001 0.0001   1 0.0025 0.0023 0.0021 0.0019 0.0017 0.0016 0.0014 0.0013 0.0012 0.0011   2 0.0100 0.0092 0.0086 0.0079 0.0074 0.0068 0.0063 0.0058 0.0054 0.0050   3 0.0269 0.0252 0.0237 0.0222 0.0208 0.0195 0.0183 0.0171 0.0160 0.0150   4 0.0544 0.0517 0.0491 0.0466 0.0443 0.0420 0.0398 0.0377 0.0357 0.0337   5 0.0882 0.0849 0.0816 0.0784 0.0752 0.0722 0.0692 0.0663 0.0635 0.0607   6 0.1191 0.1160 0.1128 0.1097 0.1066 0.1034 0.1003 0.0972 0.0941 0.0911   7 0.1378 0.1358 0.1338 0.1317 0.1294 0.1271 0.1247 0.1222 0.1197 0.1171   8 0.1395 0.1392 0.1388 0.1382 0.1375 0.1366 0.1356 0.1344 0.1332 0.1318   9 0.1256 0.1269 0.1280 0.1290 0.1299 0.1306 0.1311 0.1315 0.1317 0.1318 10 0.1017 0.1040 0.1063 0.1084 0.1104 0.1123 0.1140 0.1157 0.1172 0.1186 11 0.0749 0.0776 0.0802 0.0828 0.0853 0.0878 0.0902 0.0925 0.0948 0.0970 12 0.0505 0.0530 0.0555 0.0579 0.0604 0.0629 0.0654 0.0679 0.0703 0.0728 13 0.0315 0.0334 0.0354 0.0374 0.0395 0.0416 0.0438 0.0459 0.0481 0.0504 14 0.0182 0.0196 0.0210 0.0225 0.0240 0.0256 0.0272 0.0289 0.0306 0.0324 15 0.0098 0.0107 0.0116 0.0126 0.0136 0.0147 0.0158 0.0169 0.0182 0.0194 16 0.0050 0.0055 0.0060 0.0066 0.0072 0.0079 0.0086 0.0093 0.0101 0.0109 17 0.0024 0.0026 0.0029 0.0033 0.0036 0.0040 0.0044 0.0048 0.0053 0.0058 18 0.0011 0.0012 0.0014 0.0015 0.0017 0.0019 0.0021 0.0024 0.0026 0.0029 19 0.0005 0.0005 0.0006 0.0007 0.0008 0.0009 0.0010 0.0011 0.0012 0.0014 20 0.0002 0.0002 0.0002 0.0003 0.0003 0.0004 0.0004 0.0005 0.0005 0.0006 21 0.0001 0.0001 0.0001 0.0001 0.0001 0.0002 0.0002 0.0002 0.0002 0.0003 22 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 continues

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

A-34 APPENDICES

𝛌 X 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 10

E Table E.7

Table of Poisson probabilities (continued )

  0 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0000   1 0.0010 0.0009 0.0009 0.0008 0.0007 0.0007 0.0006 0.0005 0.0005 0.0005   2 0.0046 0.0043 0.0040 0.0037 0.0034 0.0031 0.0029 0.0027 0.0025 0.0023   3 0.0140 0.0131 0.0123 0.0115 0.0107 0.0100 0.0093 0.0087 0.0081 0.0076   4 0.0319 0.0302 0.0285 0.0269 0.0254 0.0240 0.0226 0.0213 0.0201 0.0189   5 0.0581 0.0555 0.0530 0.0506 0.0483 0.0460 0.0439 0.0418 0.0398 0.0378   6 0.0881 0.0851 0.0822 0.0793 0.0764 0.0736 0.0709 0.0682 0.0656 0.0631   7 0.1145 0.1118 0.1091 0.1064 0.1037 0.1010 0.0982 0.0955 0.0928 0.0901   8 0.1302 0.1286 0.1269 0.1251 0.1232 0.1212 0.1191 0.1170 0.1148 0.1126   9 0.1317 0.1315 0.1311 0.1306 0.1300 0.1293 0.1284 0.1274 0.1263 0.1251 10 0.1198 0.1210 0.1219 0.1228 0.1235 0.1241 0.1245 0.1249 0.1250 0.1251 11 0.0991 0.1012 0.1031 0.1049 0.1067 0.1083 0.1098 0.1112 0.1125 0.1137 12 0.0752 0.0776 0.0799 0.0822 0.0844 0.0866 0.0888 0.0908 0.0928 0.0948 13 0.0526 0.0549 0.0572 0.0594 0.0617 0.0640 0.0662 0.0685 0.0707 0.0729 14 0.0342 0.0361 0.0380 0.0399 0.0419 0.0439 0.0459 0.0479 0.0500 0.0521 15 0.0208 0.0221 0.0235 0.0250 0.0265 0.0281 0.0297 0.0313 0.0330 0.0347 16 0.0118 0.0127 0.0137 0.0147 0.0157 0.0168 0.0180 0.0192 0.0204 0.0217 17 0.0063 0.0069 0.0075 0.0081 0.0088 0.0095 0.0103 0.0111 0.0119 0.0128 18 0.0032 0.0035 0.0039 0.0042 0.0046 0.0051 0.0055 0.0060 0.0065 0.0071 19 0.0015 0.0017 0.0019 0.0021 0.0023 0.0026 0.0028 0.0031 0.0034 0.0037 20 0.0007 0.0008 0.0009 0.0010 0.0011 0.0012 0.0014 0.0015 0.0017 0.0019 21 0.0003 0.0003 0.0004 0.0004 0.0005 0.0006 0.0006 0.0007 0.0008 0.0009 22 0.0001 0.0001 0.0002 0.0002 0.0002 0.0002 0.0003 0.0003 0.0004 0.0004 23 0.0000 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0002 0.0002 24 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001

X

𝛌 ∙ 20

X

𝛌 ∙ 20

X

𝛌 ∙ 20

X

𝛌 ∙ 20

0 0.0000 10 0.0058 20 0.0888 30 0.0083 1 0.0000 11 0.0106 21 0.0846 31 0.0054 2 0.0000 12 0.0176 22 0.0769 32 0.0034 3 0.0000 13 0.0271 23 0.0669 33 0.0020 4 0.0000 14 0.0387 24 0.0557 34 0.0012 5 0.0001 15 0.0516 25 0.0446 35 0.0007 6 0.0002 16 0.0646 26 0.0343 36 0.0004 7 0.0005 17 0.0760 27 0.0254 37 0.0002 8 0.0013 18 0.0844 28 0.0181 38 0.0001 9 0.0029 19 0.0888 29 0.0125 39 0.0001

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



APPENDICES A-35

𝛂



n1

E

n2 One-tail Two-tail 4 5 6 7 8 9 10  4

0.05 0.10 11,25 0.025 0.05 10,26 0.01 0.02 —,— 0.005 0.01 —,—

 5

0.05 0.025 0.01 0.005

0.10 0.05 0.02 0.01

12,28 19,36 11,29 17,38 10,30 16,39 —,— 15,40

  6

0.05 0.025 0.01 0.005

0.10 0.05 0.02 0.01

13,31 20,40 28,50 12,32 18,42 26,52 11,33 17,43 24,54 10,34 16,44 23,55

  7

0.05 0.025 0.01 0.005

0.10 0.05 0.02 0.01

14,34 21,44 29,55 39,66 13,35 20,45 27,57 36,69 11,37 18,47 25,59 34,71 10,38 16,49 24,60 32,73

  8

0.05 0.025 0.01 0.005

0.10 0.05 0.02 0.01

15,37 23,47 31,59 41,71 51,85 14,38 21,49 29,61 38,74 49,87 12,40 19,51 27,63 35,77 45,91 11,41 17,53 25,65 34,78 43,93

  9

0.05 0.025 0.01 0.005

0.10 0.05 0.02 0.01

16,40 24,51 33,63 43,76 54,90 66,105 14,42 22,53 31,65 40,79 51,93 62,109 13,43 20,55 28,68 37,82 47,97 59,112 11,45 18,57 26,70 35,84 45,99 56,115

10

0.05 0.025 0.01 0.005

0.10 0.05 0.02 0.01

17,43 26,54 35,67 45,81 56,96 69,111 82,128 15,45 23,57 32,70 42,84 53,99 65,115 78,132 13,47 21,59 29,73 39,87 49,103 61,119 74,136 12,48 19,61 27,75 37,89 47,105 58,122 71,139

Table E.8

Lower and upper critical values T1 of Wilcoxon rank sum test

Source: Adapted from Table 1 of F. Wilcoxon and R. A. Wilcox, Some Rapid Approximate Statistical Procedures (Pearl River, NY: Lederle Laboratories, 1964), with permission of the American Cyanamid Company.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

A-36 APPENDICES

One-tail Two-tail

E Table E.9

𝛂 ∙ 0.05 𝛂 ∙ 0.10

n

Lower and upper critical values W of Wilcoxon signed ranks test

 5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

0,15 2,19 3,25 5,31 8,37 10,45 13,53 17,61 21,70 25,80 30,90 35,101 41,112 47,124 53,137 60,150

𝛂 ∙ 0.025 𝛂 ∙ 0.05

𝛂 ∙ 0.01 𝛂 ∙ 0.02

𝛂 ∙ 0.005 𝛂 ∙ 0.01

(Lower, Upper) —,— 0,21 2,26 3,33 5,40 8,47 10,56 13,65 17,74 21,84 25,95 29,107 34,119 40,131 46,144 52,158

—,— —,— —,— —,— 0,28 —,— 1,35 0,36 3,42 1,44 5,50 3,52 7,59 5,61 10,68 7,71 12,79 10,81 16,89 13,92 19,101 16,104 23,113 19,117 27,126 23,130 32,139 27,144 37,153 32,158 43,167 37,173

Source: Adapted from Table 2 of F. Wilcoxon and R. A. Wilcox, Some Rapid Approximate Statistical Procedures (Pearl River, NY: Lederle Laboratories, 1964), with permission of the American Cyanamid Company.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

Upper 5% points (𝛂 ∙ 0.05)

18.00 27.00 32.80 37.10 40.40 43.10 45.40 47.40 49.10 50.60 52.00 53.20 54.30 55.40 56.30 57.20 58.00 58.80 59.60 6.09 8.30 9.80 10.90 11.70 12.40 13.00 13.50 14.00 14.40 14.70 15.10 15.40 15.70 15.90 16.10 16.40 16.60 16.80 4.50 5.91 6.82 7.50 8.04 8.48 8.85 9.18 9.46 9.72 9.95 10.15 10.35 10.52 10.69 10.84 10.98 11.11 11.24 3.93 5.04 5.76 6.29 6.71 7.05 7.35 7.60 7.83 8.03 8.21 8.37 8.52 8.66 8.79 8.91 9.03 9.13 9.23

3.64 4.60 5.22 5.67 6.03 6.33 6.58 6.80 6.99 7.17 7.32 7.47 7.60 7.72 7.83 7.93 8.03 8.12 8.21 3.46 4.34 4.90 5.31 5.63 5.89 6.12 6.32 6.49 6.65 6.79 6.92 7.03 7.14 7.24 7.34 7.43 7.51 7.59 3.34 4.16 4.68 5.06 5.36 5.61 5.82 6.00 6.16 6.30 6.43 6.55 6.66 6.76 6.85 6.94 7.02 7.09 7.17 3.26 4.04 4.53 4.89 5.17 5.40 5.60 5.77 5.92 6.05 6.18 6.29 6.39 6.48 6.57 6.65 6.73 6.80 6.87 3.20 3.95 4.42 4.76 5.02 5.24 5.43 5.60 5.74 5.87 5.98 6.09 6.19 6.28 6.36 6.44 6.51 6.58 6.64

3.15 3.88 4.33 4.65 4.91 5.12 5.30 5.46 5.60 5.72 5.83 5.93 6.03 6.11 6.20 6.27 6.34 6.40 6.47 3.11 3.82 4.26 4.57 4.82 5.03 5.20 5.35 5.49 5.61 5.71 5.81 5.90 5.99 6.06 6.14 6.20 6.26 6.33 3.08 3.77 4.20 4.51 4.75 4.95 5.12 5.27 5.40 5.51 5.62 5.71 5.80 5.88 5.95 6.03 6.09 6.15 6.21 3.06 3.73 4.15 4.45 4.69 4.88 5.05 5.19 5.32 5.43 5.53 5.63 5.71 5.79 5.86 5.93 6.00 6.05 6.11 3.03 3.70 4.11 4.41 4.64 4.83 4.99 5.13 5.25 5.36 5.46 5.55 5.64 5.72 5.79 5.85 5.92 5.97 6.03

3.01 3.67 4.08 4.37 4.60 4.78 4.94 5.08 5.20 5.31 5.40 5.49 5.58 5.65 5.72 5.79 5.85 5.90 5.96 3.00 3.65 4.05 4.33 4.56 4.74 4.90 5.03 5.15 5.26 5.35 5.44 5.52 5.59 5.66 5.72 5.79 5.84 5.90 2.98 3.63 4.02 4.30 4.52 4.71 4.86 4.99 5.11 5.21 5.31 5.39 5.47 5.55 5.61 5.68 5.74 5.79 5.84 2.97 3.61 4.00 4.28 4.49 4.67 4.82 4.96 5.07 5.17 5.27 5.35 5.43 5.50 5.57 5.63 5.69 5.74 5.79 2.96 3.59 3.98 4.25 4.47 4.65 4.79 4.92 5.04 5.14 5.23 5.32 5.39 5.46 5.53 5.59 5.65 5.70 5.75

2.95 3.58 3.96 4.23 4.45 4.62 4.77 4.90 5.01 5.11 5.20 5.28 5.36 5.43 5.49 5.55 5.61 5.66 5.71 2.92 3.53 3.90 4.17 4.37 4.54 4.68 4.81 4.92 5.01 5.10 5.18 5.25 5.32 5.38 5.44 5.50 5.54 5.59 2.89 3.49 3.84 4.10 4.30 4.46 4.60 4.72 4.83 4.92 5.00 5.08 5.15 5.21 5.27 5.33 5.38 5.43 5.48 2.86 3.44 3.79 4.04 4.23 4.39 4.52 4.63 4.74 4.82 4.91 4.98 5.05 5.11 5.16 5.22 5.27 5.31 5.36

2.83 3.40 3.74 3.98 4.16 4.31 4.44 4.55 4.65 4.73 4.81 4.88 4.94 5.00 5.06 5.11 5.16 5.20 5.24 2.80 3.36 3.69 3.92 4.10 4.24 4.36 4.48 4.56 4.64 4.72 4.78 4.84 4.90 4.95 5.00 5.05 5.09 5.13 2.77 3.31 3.63 3.86 4.03 4.17 4.29 4.39 4.47 4.55 4.62 4.68 4.74 4.80 4.85 4.89 4.93 4.97 5.01 continues

  1   2   3   4

  5   6   7   8   9

  10   11   12   13   14

  15   16   17   18   19

  20   24   30   40

  60 120   ∞

Numerator degrees of freedom Denominator degrees of freedom 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20



Critical values of the Studentised range Q

Table E.10

APPENDICES A-37

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

E

Upper 1% points (𝛂 ∙ 0.01)

E

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

5.70 6.97 7.80 8.42 8.91 9.32 9.67 9.97 10.24 10.48 10.70 10.89 11.08 11.24 11.40 11.55 11.68 11.81 11.93 5.24 6.33 7.03 7.56 7.97 8.32 8.61 8.87 9.10 9.30 9.49 9.65 9.81 9.95 10.08 10.21 10.32 10.43 10.54 4.95 5.92 6.54 7.01 7.37 7.68 7.94 8.17 8.37 8.55 8.71 8.86 9.00 9.12 9.24 9.35 9.46 9.55 9.65 4.74 5.63 6.20 6.63 6.96 7.24 7.47 7.68 7.87 8.03 8.18 8.31 8.44 8.55 8.66 8.76 8.85 8.94 9.03 4.60 5.43 5.96 6.35 6.66 6.91 7.13 7.32 7.49 7.65 7.78 7.91 8.03 8.13 8.23 8.32 8.41 8.49 8.57

4.48 5.27 5.77 6.14 6.43 6.67 6.87 7.05 7.21 7.36 7.48 7.60 7.71 7.81 7.91 7.99 8.07 8.15 8.22 4.39 5.14 5.62 5.97 6.26 6.48 6.67 6.84 6.99 7.13 7.25 7.36 7.46 7.56 7.65 7.73 7.81 7.88 7.95 4.32 5.04 5.50 5.84 6.10 6.32 6.51 6.67 6.81 6.94 7.06 7.17 7.26 7.36 7.44 7.52 7.59 7.66 7.73 4.26 4.96 5.40 5.73 5.98 6.19 6.37 6.53 6.67 6.79 6.90 7.01 7.10 7.19 7.27 7.34 7.42 7.48 7.55 4.21 4.89 5.32 5.63 5.88 6.08 6.26 6.41 6.54 6.66 6.77 6.87 6.96 7.05 7.12 7.20 7.27 7.33 7.39

4.17 4.83 5.25 5.56 5.80 5.99 6.16 6.31 6.44 6.55 6.66 6.76 6.84 6.93 7.00 7.07 7.14 7.20 7.26 4.13 4.78 5.19 5.49 5.72 5.92 6.08 6.22 6.35 6.46 6.56 6.66 6.74 6.82 6.90 6.97 7.03 7.09 7.15 4.10 4.74 5.14 5.43 5.66 5.85 6.01 6.15 6.27 6.38 6.48 6.57 6.66 6.73 6.80 6.87 6.94 7.00 7.05 4.07 4.70 5.09 5.38 5.60 5.79 5.94 6.08 6.20 6.31 6.41 6.50 6.58 6.65 6.72 6.79 6.85 6.91 6.96 4.05 4.67 5.05 5.33 5.55 5.73 5.89 6.02 6.14 6.25 6.34 6.43 6.51 6.58 6.65 6.72 6.78 6.84 6.89

4.02 4.64 5.02 5.29 5.51 5.69 5.84 5.97 6.09 6.19 6.29 6.37 6.45 6.52 6.59 6.65 6.71 6.76 6.82 3.96 4.54 4.91 5.17 5.37 5.54 5.69 5.81 5.92 6.02 6.11 6.19 6.26 6.33 6.39 6.45 6.51 5.56 6.61 3.89 4.45 4.80 5.05 5.24 5.40 5.54 5.65 5.76 5.85 5.93 6.01 6.08 6.14 6.20 6.26 6.31 6.36 6.41 3.82 4.37 4.70 4.93 5.11 5.27 5.39 5.50 5.60 5.69 5.77 5.84 5.90 5.96 6.02 6.07 6.12 6.17 6.21

3.76 4.28 4.60 4.82 4.99 5.13 5.25 5.36 5.45 5.53 5.60 5.67 5.73 5.79 5.84 5.89 5.93 5.98 6.02 3.70 4.20 4.50 4.71 4.87 5.01 5.12 5.21 5.30 5.38 5.44 5.51 5.56 5.61 5.66 5.71 5.75 5.79 5.83 3.64 4.12 4.40 4.60 4.76 4.88 4.99 5.08 5.16 5.23 5.29 5.35 5.40 5.45 5.49 5.54 5.57 5.61 5.65

  5   6   7   8   9

  10   11   12   13   14

  15   16   17   18   19

  20   24   30   40

  60 120   ∞

Source: E. S. Pearson and H. O. Hartley, Biometrika Tables for Statisticians, Volume 1, 3rd edn, 2015, copyright Cambridge University Press, reproduced with permission.

90.00 135.00 164.00 186.00 202.00 216.00 227.00 237.00 246.00 253.00 260.00 266.00 272.00 277.00 282.00 286.00 290.00 294.00 298.00 14.00 19.00 22.30 24.70 26.60 28.20 29.50 30.70 31.70 32.60 33.40 34.10 34.80 35.40 36.00 36.50 37.00 37.50 37.90 8.26 10.60 12.20 13.30 14.20 15.00 15.60 16.20 16.70 17.10 17.50 17.90 18.20 18.50 18.80 19.10 19.30 19.50 19.80 6.51 8.12 9.17 9.96 10.60 11.10 11.50 11.90 12.30 12.60 12.80 13.10 13.30 13.50 13.70 13.90 14.10 14.20 14.40

 1   2   3   4

Numerator degrees of freedom Denominator degrees of freedom 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20



Critical values of the Studentised range Q (continued )

Table E.10

A-38 APPENDICES

k ∙ 1

k ∙ 2

k ∙ 3

𝛂 ∙ 0.01 k ∙ 4

k∙5

1.08 1.10 1.11 1.13 1.14 1.15 1.16 1.18 1.19 1.20 1.24 1.28 1.32 1.35 1.38 1.40 1.42 1.44 1.46 1.47 1.49 1.50

.70 .74 .77 .80 .83 .86 .89 .91 .94 .96 .98 1.00 1.02 1.04 1.05 1.07 1.34 1.35 1.36 1.36 1.37 1.38 1.38 1.39 1.39 1.40 1.42 1.45 1.47 1.48 1.50 1.52 1.53 1.54 1.55 1.56 1.57 1.58

1.25 1.25 1.25 1.26 1.26 1.27 1.27 1.28 1.29 1.30 1.30 1.31 1.32 1.32 1.33 1.34 1.02 1.04 1.05 1.07 1.08 1.10 1.11 1.12 1.14 1.15 1.20 1.24 1.28 1.32 1.35 1.37 1.39 1.42 1.43 1.45 1.47 1.48

.59 .63 .67 .71 .74 .77 .80 .83 .86 .88 .90 .93 .95 .97 .99 1.01

a n = number of observations; k = number of independent variables. Source: This table is reproduced from J. Durbin and G. S. Watson, Biometrika, 41 (1951): 173 and 175, Oxford University Press © 1951, Biometrika Trust.

1.27 1.28 1.29 1.30 1.31 1.32 1.32 1.33 1.34 1.34 1.38 1.40 1.43 1.45 1.47 1.49 1.50 1.52 1.53 1.54 1.55 1.56

1.07 1.09 1.10 1.12 1.13 1.15 1.16 1.17 1.19 1.20 1.21 1.22 1.23 1.24 1.25 1.26 1.42 1.43 1.43 1.43 1.44 1.44 1.45 1.45 1.45 1.46 1.48 1.49 1.51 1.52 1.53 1.55 1.56 1.57 1.58 1.59 1.60 1.60

1.46 1.44 1.43 1.42 1.41 1.41 1.41 1.40 1.40 1.41 1.41 1.41 1.41 1.41 1.42 1.42 .96 .98 1.00 1.01 1.03 1.04 1.06 1.07 1.09 1.10 1.16 1.20 1.25 1.28 1.31 1.34 1.37 1.39 1.41 1.43 1.45 1.46

.49 .53 .57 .61 .65 .68 .72 .75 .77 .80 .83 .85 .88 .90 .92 .94

1.51 1.51 1.51 1.51 1.51 1.51 1.51 1.52 1.52 1.52 1.53 1.54 1.55 1.56 1.57 1.58 1.59 1.60 1.60 1.61 1.62 1.63

1.70 1.66 1.63 1.60 1.58 1.57 1.55 1.54 1.53 1.53 1.52 1.52 1.51 1.51 1.51 1.51

.90 .92 .94 .95 .97 .99 1.00 1.02 1.03 1.05 1.11 1.16 1.21 1.25 1.28 1.31 1.34 1.36 1.39 1.41 1.42 1.44

.39 .44 .48 .52 .56 .60 .63 .66 .70 .72 .75 .78 .81 .83 .85 .88

1.60 1.60 1.59 1.59 1.59 1.59 1.59 1.58 1.58 1.58 1.58 1.59 1.59 1.60 1.61 1.61 1.62 1.62 1.63 1.64 1.64 1.65

1.96 1.90 1.85 1.80 1.77 1.74 1.71 1.69 1.67 1.66 1.65 1.64 1.63 1.62 1.61 1.61

1.36 1.50 1.30 1.57 1.23 1.65 1.16 1.74 1.09 1.83 1.15 1.37 1.50 1.31 1.57 1.24 1.65 1.18 1.73 1.11 1.82 1.16 1.38 1.51 1.32 1.58 1.26 1.65 1.19 1.73 1.13 1.81 1.17 1.39 1.51 1.33 1.58 1.27 1.65 1.21 1.73 1.15 1.81 1.18 1.40 1.52 1.34 1.58 1.28 1.65 1.22 1.73 1.16 1.80 1.19 1.41 1.52 1.35 1.59 1.29 1.65 1.24 1.73 1.18 1.80 1.21 1.42 1.53 1.36 1.59 1.31 1.66 1.25 1.72 1.19 1.80 1.22 1.43 1.54 1.37 1.59 1.32 1.66 1.26 1.72 1.21 1.79 1.23 1.43 1.54 1.38 1.60 1.33 1.66 1.27 1.72 1.22 1.79 1.24 1.44 1.54 1.39 1.60 1.34 1.66 1.29 1.72 1.23 1.79 1.25 1.48 1.57 1.43 1.62 1.38 1.67 1.34 1.72 1.29 1.78 1.29 1.50 1.59 1.46 1.63 1.42 1.67 1.38 1.72 1.34 1.77 1.32 1.53 1.60 1.49 1.64 1.45 1.68 1.41 1.72 1.38 1.77 1.36 1.55 1.62 1.51 1.65 1.48 1.69 1.44 1.73 1.41 1.77 1.38 1.57 1.63 1.54 1.66 1.50 1.70 1.47 1.73 1.44 1.77 1.41 1.58 1.64 1.55 1.67 1.52 1.70 1.49 1.74 1.46 1.77 1.43 1.60 1.65 1.57 1.68 1.54 1.71 1.51 1.74 1.49 1.77 1.45 1.61 1.66 1.59 1.69 1.56 1.72 1.53 1.74 1.51 1.77 1.47 1.62 1.67 1.60 1.70 1.57 1.72 1.55 1.75 1.52 1.77 1.48 1.63 1.68 1.61 1.70 1.59 1.73 1.57 1.75 1.54 1.78 1.50 1.64 1.69 1.62 1.71 1.60 1.73 1.58 1.75 1.56 1.78 1.51 1.65 1.69 1.63 1.72 1.61 1.74 1.59 1.76 1.57 1.78 1.52

k ∙ 5

  31   32   33   34   35   36   37   38   39   40   45   50   55   60   65   70   75   80   85   90   95 100

k ∙ 4

1.08 1.36 .95 1.54 .82 1.75 .69 1.97 .56 2.21 .81 1.10 1.37 .98 1.54 .86 1.73 .74 1.93 .62 2.15 .84 1.13 1.38 1.02 1.54 .90 1.71 .78 1.90 .67 2.10 .87 1.16 1.39 1.05 1.53 .93 1.69 .82 1.87 .71 2.06 .90 1.18 1.40 1.08 1.53 .97 1.68 .86 1.85 .75 2.02 .93 1.20 1.41 1.10 1.54 1.00 1.68 .90 1.83 .79 1.99 .95 1.22 1.42 1.13 1.54 1.03 1.67 .93 1.81 .83 1.96 .97 1.24 1.43 1.15 1.54 1.05 1.66 .96 1.80 .86 1.94 1.00 1.26 1.44 1.17 1.54 1.08 1.66 .99 1.79 .90 1.92 1.02 1.27 1.45 1.19 1.55 1.10 1.66 1.01 1.78 .93 1.90 1.04 1.29 1.45 1.21 1.55 1.12 1.66 1.04 1.77 .95 1.89 1.05 1.30 1.46 1.22 1.55 1.14 1.65 1.06 1.76 .98 1.88 1.07 1.32 1.47 1.24 1.56 1.16 1.65 1.08 1.76 1.01 1.86 1.09 1.33 1.48 1.26 1.56 1.18 1.65 1.10 1.75 1.03 1.85 1.10 1.34 1.48 1.27 1.56 1.20 1.65 1.12 1.74 1.05 1.84 1.12 1.35 1.49 1.28 1.57 1.21 1.65 1.14 1.74 1.07 1.83 1.13

k ∙ 3

𝛂 ∙ 0.05

  15   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30

K ∙ 2

dL dU dL dU dL dU dL dU dL dU dL dU dL dU dL dU dL dU dL dU

k ∙ 1

n





Critical values dL and dU of the Durbin–Watson statistic D (critical values are one-sided)a

Table E.11

APPENDICES A-39

E

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

A-40 APPENDICES

E Table E.12

Selected critical values of F for Cook’s Di statistic 𝛂 ∙ 0.50



Numerator df ∙ k ∙ 1 12 15 20 Denominator df ∙ n ∙ k ∙ 1 2 3 4 5 6 7 8 9 10   10   11   12   15   20   24   30   40   60 120   ∞

.743 .845 .899 .932 .954 .971 .983 .992 1.00 1.01 1.02 1.03 .739 .840 .893 .926 .948 .964 .977 .986 .994 1.01 1.02 1.03 .735 .835 .888 .921 .943 .959 .972 .981 .989 1.00 1.01 1.02 .726 .826 .878 .911 .933 .949 .960 .970 .977 .989 1.00 1.01 .718 .816 .868 .900 .922 .938 .950 .959 .966 .977 .989 1.00 .714 .812 .863 .895 .917 .932 .944 .953 .961 .972 .983 .994 .709 .807 .858 .890 .912 .927 .939 .948 .955 .966 .978 .989 .705 .802 .854 .885 .907 .922 .934 .943 .950 .961 .972 .983 .701 .798 .849 .880 .901 .917 .928 .937 .945 .956 .967 .978 .697 .793 .844 .875 .896 .912 .923 .932 .939 .950 .961 .972 .693 .789 .839 .870 .891 .907 .918 .927 .934 .945 .956 .967

Source: E. S. Pearson and H. O. Hartley, Biometrika Tables for Statisticians, Volume 1, 3rd edn, 2015, copyright Cambridge University Press, reproduced with ­permission.

Table E.13

Control chart factors

Number of observations in sample  2  3  4  5  6   7   8   9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

d2

d3

D3

D4

A2

1.128 0.853 0 3.267 1.880 1.693 0.888 0 2.575 1.023 2.059 0.880 0 2.282 0.729 2.326 0.864 0 2.114 0.577 2.534 0.848 0 2.004 0.483 2.704 0.833 0.076 1.924 0.419 2.847 0.820 0.136 1.864 0.373 2.970 0.808 0.184 1.816 0.337 3.078 0.797 0.223 1.777 0.308 3.173 0.787 0.256 1.744 0.285 3.258 0.778 0.283 1.717 0.266 3.336 0.770 0.307 1.693 0.249 3.407 0.763 0.328 1.672 0.235 3.472 0.756 0.347 1.653 0.223 3.532 0.750 0.363 1.637 0.212 3.588 0.744 0.378 1.622 0.203 3.640 0.739 0.391 1.609 0.194 3.689 0.733 0.404 1.596 0.187 3.735 0.729 0.415 1.585 0.180 3.778 0.724 0.425 1.575 0.173 3.819 0.720 0.435 1.565 0.167 3.858 0.716 0.443 1.557 0.162 3.895 0.712 0.452 1.548 0.157 3.931 0.708 0.459 1.541 0.153

Source: Reprinted from ASTM-STP 15D by kind permission of the American Society for Testing and Materials.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



APPENDICES A-41

Table E.14

E

The standardised normal distribution Entry represents area under the standardised normal distribution from the mean to Z. 0

Z

Z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09 0.0 .0000 .0040 .0080 .0120 .0160 .0199 .0239 .0279 .0319 .0359 0.1 .0398 .0438 .0478 .0517 .0557 .0596 .0636 .0675 .0714 .0753 0.2 .0793 .0832 .0871 .0910 .0948 .0987 .1026 .1064 .1103 .1141 0.3 .1179 .1217 .1255 .1293 .1331 .1368 .1406 .1443 .1480 .1517 0.4 .1554 .1591 .1628 .1664 .1700 .1736 .1772 .1808 .1844 .1879 0.5 .1915 .1950 .1985 .2019 .2054 .2088 .2123 .2157 .2190 .2224 0.6 .2257 .2291 .2324 .2357 .2389 .2422 .2454 .2486 .2518 .2549 0.7 .2580 .2612 .2642 .2673 .2704 .2734 .2764 .2794 .2823 .2852 0.8 .2881 .2910 .2939 .2967 .2995 .3023 .3051 .3078 .3106 .3133 0.9 .3159 .3186 .3212 .3238 .3264 .3289 .3315 .3340 .3365 .3389 1.0 .3413 .3438 .3461 .3485 .3508 .3531 .3554 .3577 .3599 .3621 1.1 .3643 .3665 .3686 .3708 .3729 .3749 .3770 .3790 .3810 .3830 1.2 .3849 .3869 .3888 .3907 .3925 .3944 .3962 .3980 .3997 .4015 1.3 .4032 .4049 .4066 .4082 .4099 .4115 .4131 .4147 .4162 .4177 1.4 .4192 .4207 .4222 .4236 .4251 .4265 .4279 .4292 .4306 .4319 1.5 .4332 .4345 .4357 .4370 .4382 .4394 .4406 .4418 .4429 .4441 1.6 .4452 .4463 .4474 .4484 .4495 .4505 .4515 .4525 .4535 .4545 1.7 .4554 .4564 .4573 .4582 .4591 .4599 .4608 .4616 .4625 .4633 1.8 .4641 .4649 .4656 .4664 .4671 .4678 .4686 .4693 .4699 .4706 1.9 .4713 .4719 .4726 .4732 .4738 .4744 .4750 .4756 .4761 .4767 2.0 .4772 .4778 .4783 .4788 .4793 .4798 .4803 .4808 .4812 .4817 2.1 .4821 .4826 .4830 .4834 .4838 .4842 .4846 .4850 .4854 .4857 2.2 .4861 .4864 .4868 .4871 .4875 .4878 .4881 .4884 .4887 .4890 2.3 .4893 .4896 .4898 .4901 .4904 .4906 .4909 .4911 .4913 .4916 2.4 .4918 .4920 .4922 .4925 .4927 .4929 .4931 .4932 .4934 .4936 2.5 .4938 .4940 .4941 .4943 .4945 .4946 .4948 .4949 .4951 .4952 2.6 .4953 .4955 .4956 .4957 .4959 .4960 .4961 .4962 .4963 .4964 2.7 .4965 .4966 .4967 .4968 .4969 .4970 .4971 .4972 .4973 .4974 2.8 .4974 .4975 .4976 .4977 .4977 .4978 .4979 .4979 .4980 .4981 2.9 .4981 .4982 .4982 .4983 .4984 .4984 .4985 .4985 .4986 .4986 3.0 .49865 .49869 .49874 .49878 .49882 .49886 .49889 .49893 .49897 .49900 3.1 .49903 .49906 .49910 .49913 .49916 .49918 .49921 .49924 .49926 .49929 3.2 .49931 .49934 .49936 .49938 .49940 .49942 .49944 .49946 .49948 .49950 3.3 .49952 .49953 .49955 .49957 .49958 .49960 .49961 .49962 .49964 .49965 3.4 .49966 .49968 .49969 .49970 .49971 .49972 .49973 .49974 .49975 .49976 3.5 .49977 .49978 .49978 .49979 .49980 .49981 .49981 .49982 .49983 .49983 3.6 .49984 .49985 .49985 .49986 .49986 .49987 .49987 .49988 .49988 .49989 3.7 .49989 .49990 .49990 .49990 .49991 .49991 .49992 .49992 .49992 .49992 3.8 .49993 .49993 .49993 .49994 .49994 .49994 .49994 .49995 .49995 .49995 3.9 .49995 .49995 .49996 .49996 .49996 .49996 .49996 .49996 .49997 .49997

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

A-42 APPENDICES

F

Using Microsoft Excel Analysis ToolPak

F.1 • CONFIGURING MICROSOFT EXCEL Many of the statistical functions in Excel need an add-in called Analysis ToolPak to be activated before they can be used. This add-in is available in both Microsoft Excel 2016 for PC and Microsoft Excel 2015 or 2016 for Mac. Complete the following steps to ensure that the copy of Microsoft Excel that you are using is properly configured. You also need to do this to ensure that PHStat and Visual Explorations work. If you are using an earlier Mac version of Excel, be aware that most do not contain the Analysis ToolPak Add-In. Alternatives to upgrading would be partitioning the drive using Boot Camp software so that it can run as a PC and then installing Microsoft Office for Windows. Alternatively, programs such as Statplus or Numbers can be purchased, which give some of the same functionality. Steps for PCs: Verify installation of the Microsoft Excel 2016 Analysis ToolPak Add-Ins. Open Microsoft Excel and click on the File tab, then Options, then Add-Ins in the Excel Options dialog box. If Analysis ToolPak appears in the Active Applications Add-Ins list then it is installed and enabled. Otherwise, select Excel Add-Ins in the Manage list and click on Go, then OK In the Add-Ins dialog box (see Figure F.1) tick both Analysis ToolPak and Analysis ToolPak – VBA. Click on OK.

In any version of Excel if Analysis ToolPak and Analysis ToolPak – VBA do not appear in the Add-Ins dialog box, you need to rerun the Microsoft Office (or Excel) setup program. Use your original Microsoft Office software to install the ToolPak add-in. Steps for Macs: The steps to configure the Analysis ToolPak Add-in on a Mac are slightly different. Open Microsoft Excel 2015 or 2016, go to Tools on the top menu, then in the drop-down menu click on Add-Ins. From the Add-Ins available, click on Analysis ToolPak, then OK. You should see that Data Analysis now appears on the right hand end of the Data tab. Analysis ToolPak-VGA, which allows extra macros to be run, is not included in the Mac versions of Excel.

F.2 • USING THE DATA ANALYSIS TOOLS The Data Analysis Tools are a set of statistical procedures included with Microsoft Excel. To use the Data Analysis tools select Data ➔ Data Analysis to display the Data Analysis dialog box (see Figure F.2). In the Analysis Tools list, select a procedure and click on the OK button. For most procedures, a second dialog box will appear in which you make entries and selections.

Figure F.2  Data Analysis dialog box

Figure F.1  Excel Add-ins dialog box

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

glossary a priori classical probability Objective probability, obtained from prior knowledge of the process. – A2 factor  Used to calculate control limits of a X chart. 2 adjusted R   A modification of R-square that adjusts for the number of terms in a model. aggregate price index  A price index that calculates a percentage change in prices of a group of commodities between two time periods. Akaike information criterion (AIC)  A measure of the relative quality of a regression model. alternative courses of action  Choices that may be made in decision making. alternative hypothesis (H1) A statement that we aim to prove about one or more population parameters; the opposite of the null hypothesis. analysis of variance (ANOVA)  A method of analysing the differences between group means by partitioning variation. ANOVA summary table  A table showing the results of an analysis of variance. area of opportunity  The area of interest for counting nonconformities. arithmetic mean (mean) A measure of central tendency; the sum of all values divided by the number of values (usually called the mean); called the arithmetic mean to distinguish it from the geometric mean. assignable (or special) causes of variation  Large fluctuations or patterns in data that are not inherent to a process; these fluctuations reflect changes in the process. assumptions of regression  The conditions required for regression analysis to produce reliable results. attribute chart  A control chart for categorical or discrete variables. auditing  A process of checking the accuracy of financial records. autocorrelation  Relationships between data values in consecutive periods of time. autoregressive modelling Modelling using autocorrelation, which is the correlation between successive values in a time ­series. average linkage A measure of distance that bases the distance between clusters on the mean distance between objects in one cluster and another cluster. bar chart  The graphical representation of a summary table for categorical data; the length of each bar represents the proportion, frequency or percentage of data values in a category. base period  The initial point in time for comparisons calculated using index numbers. Bayes’ theorem Revises previously calculated probabilities when new information becomes available. bell-shaped  Symmetric, unimodal, mound-shaped distribution. best-subsets approach  An approach to multiple regression model development that evaluates all possible models for a given set of independent variables.

between-block variation  That part of the within-group variation due to differences between the blocks. between-group variation  That part of total variation due to differences from group to group. big data  Large data sets characterised by their volume, velocity and variety. binomial distribution Discrete probability distribution, where the random variable is the number of successes in a sample of n observations from either an infinite population or sampling with replacement. blocks  Groupings of homogeneous units in experiments. box-and-whisker plot  The graphical representation of the fivenumber summary. Also called a boxplot. bullet graph  A horizontal bar chart inspired by a thermometer. business analytics Skills, technologies and practices for continuous iterative exploration and investigation of past business performance to gain insight and drive business planning. capability index A numerical measure of a process’s ability to meet specification limits. categorical variables  Variables which take values that fall into one or more categories. causal forecasting methods  Methods that attempt to find causal variables to account for changes in a time series. c chart  A control chart for the number of nonconformities. Central Limit Theorem  If the sample size is large enough, the distribution of sample means will be approximately normal even if the samples came from a population that was not normal. central tendency  The extent to which data values are grouped around a central value. certain event  An event that will occur. chance (or common) causes of variation  The variability inherent in a process; it cannot be reduced without redesigning the process. chartjunk Unnecessary information and detail that reduce the clarity of a graph. Chebyshev rule  Gives lower bounds of the distribution of data values in terms of standard deviations from the mean for any distribution. chi-square (𝛘2) distribution  The probability distribution for chisquare to be used in determining critical values of chi-square. chi-square test of independence A hypothesis test used to test whether there is a significant relationship between two categorical variables. chi-squared statistic  Measures the extent to which a set of categorical outcomes are different from a set of probabilities. class boundaries  Upper and lower values used to define classes for numerical data. class mid-point  The centre of a class; the representative value of a class. class width  Distance between upper and lower boundaries of a class.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

G-2 GLOSSARY

classical multiplicative model  Model which states that values in a time series are a combination of trend, seasonal, cyclical and irregular components. classification and regression trees  Decision trees that split data into groups based on the values of independent or explanatory (X) variables. cluster  A naturally occurring grouping, such as a geographic area. cluster analysis A form of analysis that classifies data into a sequence of groupings in which objects in each group have more in common with others in their group than they do with objects found in other groups. cluster sample  The frame is divided into representative groups (or clusters) then all items in randomly selected clusters are chosen. coefficient of correlation (or correlation coefficient)  A measure of the relative strength of the linear relationship between two numerical variables. coefficient of determination, r2 The square of the correlation coefficient between two variables (r2). coefficient of multiple determination, R2 In regression, the proportion of the variation in the Y dependant variable that is explained by a set of X independent variables. coefficient of partial determination  In regression, the independent proportion of the variation in the Y dependent variable that is explained by each X independent variable. coefficient of variation  A relative measure of variation; the standard deviation divided by the mean. collectively exhaustive A set of events such that one of the events must occur. collinearity  In regression, refers to the potential for correlation within a set of independent X variables. combination  An unordered selection of items. common (or chance) causes of variation  The variability inherent in a process; it cannot be reduced without redesigning the process. complement  All simple outcomes not in an event. complete linkage  A measure of distance that bases the distance between clusters on the maximum distance between objects in one cluster and another cluster. completely randomised designs  One-factor experiments in the analysis of variance. conditional probability  The probability of an event, given information on the occurrence of a second event. confidence coefficient (1 2 a)  The probability of not rejecting a null hypothesis when it is true and should not be rejected. confidence interval estimate A range of numbers constructed about the point estimate. confidence level The confidence coefficient expressed as a percentage. contingency table (or cross-classification table) – probability  A table representing a sample space for joint events classified by two characteristics; each cell represents the joint event satisfying given values of both characteristics. continuous probability density function A mathematical expression that defines the distribution of the values for a continuous random variable. continuous variable  A variable that can take any value between specified limits.

control chart  A chart that monitors variation in a characteristic of a product or service over time. convenience sampling  Selection using a method that is easy or inexpensive. Cook’s Di statistic  A statistical method of residual analysis using the F probability distribution that identifies individual cases in the sample data of a multiple regression that have high individual influence on the regression equation. correlation coefficient (or coefficient of correlation)  A measure of the relative strength of the linear relationship between two numerical variables. covariance  A measure of the strength of the linear relationship between two numerical variables. coverage error An error that occurs when all items in a frame do not have an equal chance of being selected. This causes selection bias. Cp statistic  A test for determining which combination of independent X variables is best to use in a multiple regression model. critical range In the Tukey–Kramer method, the value above which differences in means are significant. critical-to-quality (CTQ)  Characteristics that affect quality. critical value  The value in a distribution that cuts off the required probability in the tail for a given confidence level. cross-classification (or contingency) table – descriptive statistics  A summary table for two categorical variables; each cell represents data that satisfy the given values of both variables. cross-classification table – chi-square A table that displays counts of categorical responses between any number of independent groups. cross-product term  The interaction term. cross-validation  An in-sample method for assessing the validity of a multiple regression model, using half the data to test the accuracy of prediction, having generated the regression model with the other half of the sample data. cumulative percentage distribution A summary table for numerical data that gives the cumulative frequency of each successive class. cumulative percentage polygon (ogive)  The graphical representation of a cumulative frequency distribution. cumulative standardised normal distribution Represents the cumulative area under the standard normal curve less than a given value. cyclical component  Displays in a time series as a wave-like up and down change in the individual values sequentially through the series. d2 factor  A factor that represents the relationship between stan– dard deviation and range, used in X and R charts. d3 factor  A factor that represents the relationship between standard deviation and standard error, used in R charts. D3 factor  A factor used to calculate the LCL of a R chart. D4 factor  A factor used to calculate the UCL of a R chart. dashboard  Descriptive analytics methods to present up-to-theminute operational status about a business. data  The observed values of variables. data discovery Methods used to take a closer look at historical or status data, to quickly review data for unusual values or outliers, or to construct visualisations for management presentations.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



GLOSSARY G-3

data mining  Sorting through large amounts of data to find relevant information. data snooping  Using a set of data more than once for inference or selecting a model. decision criteria Alternative methods for deciding the best course of action. decision tree The graphical representation of simple and joint probabilities as vertices of a tree. Also known as a tree diagram. deductive reasoning  Reasoning that starts with a hypothesis and examines possibilities to move to a specific conclusion. degrees of freedom  Relate to the number of values in the calculation of a statistic that are free to vary. Deming’s 14 points for management Aspects of total quality management. dependent (or response) variable  In regression, the variable whose values are explained by changes in the independent variable. descriptive analytics  A form of business analytics that explores business activities that have occurred or are occurring in the present moment. descriptive statistics  The field that focuses on summarising or characterising a set of data. difference estimation A method of estimating the level of discrepancy between book and audit values for a population. discrete variables  Can only take a finite or countable number of values. DMAIC model The improvement process used by Six Sigma: ‘Define, Measure, Analyse, Improve, Control’. drill-down  The revealing of the data that underlie a higher-level summary. dummy variable  In regression analysis, takes the values 0 or 1 to indicate the absence or presence of some categorical effect that may be expected to alter the result of the analysis. Durbin–Watson statistic  Measures autocorrelation between data values in a time series by measuring the correlation between each residual and each preceding residual in the series. electronic formats  Data in a form that can be read by a computer. empirical classical probability  Objective probability, obtained from the relative frequency of occurrence of an event. empirical rule  Gives the distribution of data values in terms of standard deviations from the mean for mound-shaped distributions. encoding  Representing data by numbers or symbols to convert the data into a usable form. equal variance (homoscedasticity) In regression, a state of equal variance whereby the variability in the Y values remains equal for different values of the X variable. error sum of squares (SSE) (or sum of squares error) The degree of variation between the X and Y variables that is not explained by the defined regression relationship between the two variables. Specifically, the degree of variation in the Y variable that is not accounted for by variation in the X variable(s). estimated relative efficiency (RE) A comparison between the randomised block design and completely randomised design ANOVA methods. Euclidean distance A distance measure used in cluster analysis where the distance between objects is the square root of the sum of the squared differences between objects over all r dimensions.

event  One or more outcomes of a random experiment. events or states of the world Outcomes that may occur, and their associated probabilities. expected frequency The calculated categorical response value on the basis of a true null hypothesis whereby the total number of responses (successes) for each group is divided on a proportional basis to the total number of successes. expected monetary value (EMV) The sum of payoffs for each event and action multiplied by the respective event probabilities. expected opportunity loss (EOL) The sum of the losses for each event and action multiplied by the respective event probabilities. expected profit under certainty The expected profit that you could make if you had perfect information about which event will occur. expected value of a discrete random variable A measure of central tendency; the mean of a discrete random variable. expected value of perfect information (EVPI) The expected opportunity loss from the best decision. expected value of the sum of two random variables A measure of central tendency; the mean of the sum of two random variables. explained variation  The regression sum of squares. explanatory variable  The independent variable. exponential distribution Continuous probability distribution, used to model the interval between Poisson events. exponential smoothing A statistical method for removing extreme values from a time series. exponential trend model A method for measuring trend in a time series that increases at a constant rate. extreme value (outlier) A value located far from the mean; it will have a large Z score, positive or negative. factor A categorical explanatory variable that has a number of levels. F distribution A right-skewed continuous probability distribution that has as its parameters degrees of freedom in the numerator and in the denominator. finite population correction factor A factor that is required when sampling from a finite population without replacement. first-order autocorrelation Indicates there is a correlation between consecutive values in a time series. first-order autoregressive model A regression model to measure first-order autocorrelation in a time series. first (lower) quartile The value that 25% of data values are smaller than or equal to. five-number summary  Numerical data summarised by quartiles. focus group  A group of people who are asked about attitudes and opinions for qualitative research. frame  A list of the items in the population of interest. frequency distribution A summary table for numerical data; gives the frequency of data values in each class. Friedman rank test  Finds whether multiple sample groups have been selected from populations with equal medians. F test for block effects  A test to determine whether or not all the population block means are equal. F test for factor A effect The F test statistic formed by dividing mean square A by mean square error.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

G-4 GLOSSARY

F test for factor B effect The F test statistic formed by dividing mean square B by mean square error. F test for interaction effect The F test statistic formed by dividing mean square AB by mean square error. F test for the slope  Use of the F probability distribution to test whether the slope in simple linear regression is statistically significant. F test statistic for testing the equality of two variances  A ratio of the sample variances from two samples. gauges  A visual display of data inspired by the speedometer in a car. general addition rule  Used to calculate the probability of the joint event A or B. general multiplication rule  Used to calculate the probability of the joint event A and B. geometric mean  The average rate of change of a variable over time. Gini impurity  The product of the probability that each item is chosen multiplied by the probability that the item has been misclassified. goodness-of-fit tests Any test to determine how well a set of sample data matches a specific probability distribution. grand mean  The mean of all the values in all groups combined. groups  The term for different populations used in the analysis of variance. hat matrix  Gives the value of Y predicted as a linear combination of the observed values of Y. hat matrix diagonal elements hi  Tests for the influence of individual sample cases in a multiple regression model. hierarchical clustering A form of cluster analysis where two objects that are determined to be the closest to each other are merged into a single cluster. This process of merging of the two closest objects repeats until there remains only one cluster that includes all objects. histogram  The graphical representation of a frequency, relative frequency or percentage distribution; the area of each rectangle represents the class frequency, relative frequency or percentage. Holt–Winters method  A forecasting methodology that includes a measure of trend and exponential smoothing. homogeneity of variance An assumption that the population variances are equal. hyperbolic tangent function An S-shaped function that varies between −1 and +1. hypergeometric distribution A discrete probability distribution where the random variable is the number of successes in a sample of n observations from a finite population without replacement. hypothesis testing A method of statistical inference used to make tests about the value of population parameters. impossible event  An event that cannot occur. in-control process  A process that contains only common-cause variation. independence of errors  The assumption that the errors in a data set are not related to each other. Particularly relevant in timeseries data. independent (or explanatory) variable  In regression, the variable that explains the values of the dependent variable. index number  A percentage measure of the change in the value of an item between two time periods.

inductive reasoning  Reasoning that uses specific observations to make a general conclusion. inferential statistics Uses information from a sample to draw conclusions about a population. interaction  The impact of one independent variable depends on the value of another independent variable. interaction term Refers to interaction within X independent variables in regression or, more specifically, the effect one independent variable has upon another independent variable. interquartile range A distance measure of variation; the difference between the third and first quartile; the range of the middle 50% of data. interval scale  A ranking of numerical data where differences are meaningful but there is no true zero point. irregular (random) component  Any values in a time series that cannot be accounted for by trend, cyclical or seasonal components. joint event  An event described by two or more characteristics. joint probability  The probability of an occurrence described by two or more characteristics. judgment sample A sample that gives the opinions of preselected experts. k-means clustering  A form of cluster analysis where objects are assigned to clusters in an iterative process that seeks to make the means of the k clusters as different as possible. Kruskal–Wallis rank test  A non-parametric alternative to oneway ANOVA; it tests the null hypothesis that the different samples are drawn from the same distribution or from distributions with the same median. Laspeyres price index  Uses consumption quantities in the base year to weight price changes measured by the index number. least-squares method  The fitting of a linear relationship between the X and Y variables such that the sum of squared distances of the data values from the linear line of best is at a minimum. level of confidence  Represents the percentage of intervals, based on all samples of a certain size which would contain the population parameter. level of significance (a) The probability of rejecting a null ­hypothesis which is in fact true. levels  Numerical or categorical divisions of a factor. Levene test A method of testing whether the variances of all populations are equal. linear trend model  A model defining just the trend change as a linear change through time. linearity  The assumption that the relationship between variables is linear. logarithmic transformation  Uses the common or natural log of the sample data to overcome breaches of the homoscedasticity or linearity assumptions in regression. lower control limit (LCL)  The lower limit for a control chart, typically three standard deviations below the process mean. lower specification limit (LSL)  The smallest value a CTQ can have to meet customer expectations. main effects  The effects of individual factors averaged over the levels of other factors. Marascuilo procedure  Enables comparisons between all pairs of groups within a contingency table. marginal probability  The probability of an event described by a single characteristic.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



GLOSSARY G-5

matched  Observations that are analysed together on the basis of a common characteristic. mathematical model  The mathematical representation of a random variable. McNemar test A non-parametric test for testing the difference between two proportions from related samples. mean absolute deviation (MAD)  The mean absolute difference between the actual and predicted values in a time series. mean square Variance, sum of squares divided by degrees of freedom. mean square between (MSB) The sum of squares between groups divided by the appropriate degrees of freedom. mean square between A (MSBA)  The sum of squares due to factor A divided by the appropriate degrees of freedom. mean square between AB (MSBAB)  The interaction of squares, SSAB, divided by the appropriate degrees of freedom. mean square between B (MSBB) The sum of squares due to factor B divided by the appropriate degrees of freedom. mean square between blocks (MSBL)  The sum of squares between blocks divided by the appropriate degrees of ­freedom. mean square error (MSE)  The sum of squares due to random error divided by the appropriate degrees of freedom. mean square total (MST)  The total sum of squares divided by its degrees of freedom. mean square within (MSW)  The within-group sum of squares divided by the appropriate degrees of freedom. measurement error  The difference between survey results and the true value of what is being measured. median A measure of central tendency; the middle value in an array. missing values  Refers to when no data value is stored for one or more variables in an observation. mode  A measure of central tendency; the most frequent value. moving averages  A series of means calculated over time such that each mean is calculated for a set number of observed values. MSB (one-way ANOVA) Mean square between; the sum of squares between groups divided by the associated degrees of freedom. MSB (two-way ANOVA) Mean square B; the sum of squares due to factor B divided by the associated degrees of freedom. multicollinearity  Used in the context of correlation matrices or covariance matrices, to describe the condition when one or more variables from which the respective matrix was calculated are linear functions of other variables. multidimensional scaling (MDS)  A form of business analytics that visualises objects in a two- or more dimensional space with the goal of discovering patterns of similarities or dissimilarities between the objects. multilayer perceptrons (MLPs) Neural networks that contain an input layer, a hidden layer and an output layer. multiple comparisons  A method of identifying which means are different. multiple regression Used to analyse the relationship between several independent or predictor variables and a dependent or criterion variable. multiple regression model Regression models that use more than one independent variable to explain the variation in the dependent variable.

multiplication rule for independent events  The probability of the joint event A and B is the product of the simple probabilities when A and B are independent. multiplicative model  A model in which the joint effect of two or more factors is the product of their effects. mutually exclusive Two events that cannot occur simultaneously. net regression coefficient  The population slope coefficient representing the change in the mean of Y per unit change in X, taking into account the effect of other independent X variables in a multiple regression. neural networks Flexible data mining techniques that ‘learn’ from the data and construct models from patterns and relationships uncovered in data. nominal scale  A classification of categorical data that implies no ranking. non-parametric test A test used when the parameters of the distribution are unknown for the variable of interest in the population. non-probability sample  A sample where selection is not based on known probabilities. non-response error An error that occurs due to the failure to collect information on all items chosen for the sample; this causes non-response bias. normal distribution  A continuous probability distribution represented by a bell-shaped curve. normal probability density function  The mathematical expression that defines the normal distribution. normal probability plot  A graphical approach used to evaluate if data are normal. normality  An assumption that sample values come from a normally distributed population. null hypothesis (H0)  A statement about the value of one or more population parameters which we test and aim to disprove. numerical variables  Take numbers as their variables. observed frequency The known (given) categorical response value usually displayed in a cross-classification table. one-sided confidence interval Gives only an upper or lower bound to the value of the population parameter. one-tail (or directional) test A hypothesis where the entire rejection region is contained in one tail of the sampling distribution. The test can be either upper-tail or lower-tail. one-way ANOVA  A method of comparing the means of observations occurring in a number of groups. one-way ANOVA F test statistic The F value used to test a null hypothesis that the means of all populations are equal. operational definition  Defines how a variable is to be measured. opportunity loss The difference between the highest possible profit for an event and the actual profit. ordered array  Numerical data sorted by order of magnitude. ordinal scale  A scale of measurement that represents the ranks of a variable’s values. outliers  Values that appear to be excessively large or small compared with most values observed. out-of-control process A process that contains special-cause variation as well as common-cause variation. overall F test  Tests for a significant relationship between the Y dependent variable and a set of X independent variables in multiple regression, using the F probability distribution.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

G-6 GLOSSARY

Paasche price index Uses consumption quantities in the final year to weight price changes measured as an index number. paired  Observations that are analysed together on the basis of a common characteristic. paired t test for the mean difference in related populations A test for the difference between the means of two populations that have a common characteristic. parameter  A numerical measure of some population charac­teristics. parsimony  The process of choosing the simplest model in terms of independent variables that still adequately explains the variation in the dependent variable. partial correlation  The correlation between two variables after removing the effects of other variables; used to identify spurious correlation and hidden correlation (a correlation masked by the effect of other variables). partial F test  Tests for a significant contribution of an individual independent X variable in multiple regression after all other independent X variables have been included in the regression model, using the F probability distribution. payoff table  A table that shows the values associated with every possible event that can occur for each course of action. payoffs  Values associated with the outcome of events. p chart  A control chart for the proportion of nonconforming items. Pearson correlation  The correlation coefficient, also called the linear or product-moment correlation; determines the extent to which values of two variables are ‘proportional’ to each other. percentage distribution  A summary table for numerical data; it gives the percentage of data values in each class. percentage polygon  A graphical representation of a percentage distribution. permutation  An ordered selection of items. pie chart  A graphical representation of a summary table for categorical data; each category is represented by a slice of a circle of which the area represents the proportion or percentage share of the category. point estimate  A single value, calculated from a sample, that is used to estimate an unknown population parameter. Poisson distribution  Discrete probability distribution, where the random variable is the number of events in a given interval. pooled-variance t test A test for the difference between two population means which assumes that the unknown population variances are equal. population A collection of all members of a group being investigated. population mean  A mean calculated from population data. population standard deviation  A standard deviation calculated from population data. population variance  A variance calculated from population data. portfolio  A combined investment in two or more assets. portfolio expected return A measure of central tendency; a mean return on investment. portfolio risk  A measure of the variation of investment returns. post-hoc A comparison where hypotheses are formulated after the data have been inspected. power curve  A graph showing the power of the test for various actual values of the population parameter. power of a statistical test  The probability that you reject the null hypothesis when it is false and should be rejected.

prediction interval for an individual response Y  The interval for the prediction of a specific value of Y in regression, given a value of X. prediction line  The straight line derived by a regression equation using the method of least squares. predictive analytics  A form of business analytics that identifies what is likely to occur in the (near) future and finds relationships in data that may not be readily apparent using descriptive analytics. prescriptive analytics A form of business analytics that investigates what should occur and prescribes the best course of action for the future. price index  A measure of the average price of a group of goods relative to a base year. primary source Provides information collected by the data analyser. principle of parsimony  The principle that the simplest of two competing statistical processes is to be preferred. probability  The likelihood of an event occurring. probability distribution for a discrete random variable The values of a discrete random variable with the corresponding probability of occurrence. probability sample  A sample where selection is based on known probabilities. process  The value-added transformation of inputs to outputs. process capability  The ability of a process to consistently meet specified customer expectations. processing elements  The hidden layer in multilayer perceptrons (MLPs). pth-order autocorrelation  The correlation between values in a time series that are p periods apart. pth-order autoregressive model  A regression model to measure autocorrelation p order apart in a time series. p-value  The probability of getting a test statistic more extreme than the sample result if the null hypothesis is true. quadratic regression model A multiple regression model with two independent variables, where the second independent variable is the square of the first independent variable. quadratic trend model A non-linear forecast model where the second independent variable is the square of the first independent time-series variable. qualitative forecasting methods Methods that are primarily based on the subjective opinion of the forecaster rather than the analysis of numerical data. quantile–quantile plot  A normal probability plot. quantitative forecasting methods  Methods that use time-series data in a mathematical process to forecast future values of the series. quartiles  Measures of relative standing that partition a data set into quarters. R chart  A control chart for the range. random error  An error that results from unpredictable variations. random experiment  A precisely described scenario that leads to an outcome that cannot be predicted with certainty. randomisation  A process used in an experiment to ensure selection bias is avoided. randomised block design An experimental technique where data in groups are divided into fairly homogeneous subgroups called blocks to remove variability from random error.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



GLOSSARY G-7

randomness and independence Assumptions necessary in ANOVA to avoid bias. range A distance measure of variation; the difference between maximum and minimum data values. ratio scale A ranking where the differences between measurements involve a true zero point. recoded variable  A variable that has been assigned new values that replace the original ones. rectangular distribution A continuous probability distribution where the values of the random variable have the same probability; also called the ‘uniform distribution’. region of non-rejection  The range of values of the test statistic where the null hypothesis cannot be rejected. region of rejection  The range of values of the test statistic where the null hypothesis is rejected; it is also called the ‘critical region’. regression analysis A method for predicting the values of a numerical variable based upon the values of one or more other variables. regression coefficients  The calculated parameters in regression that specify the interval and slope of the linear line defining the relationship between the independent and dependent variables. regression sum of squares (SSR) The degree of variation between X and Y variables that is explained by the defined regression relationship between the two variables. Specifically, the degree of variation in the Y variable that is accounted for by variation in the X variable(s). relative frequency distribution  A summary table for numerical data which gives the proportion of data values in each class. relevant range  The range of values of the explanatory variable, which are themselves the only values relevant to predicting any value in regression. repeated measurements Data collected from the same set of persons or items at different times. replicates  Sample sizes for particular combinations of two factors in two-way ANOVA. residual Difference between the observed values and the corresponding values that are predicted by the regression model; they represent the variance that is not explained by the model. residual analysis A graphical evaluation of the residuals from regression to test for violations of the assumptions of regression. resistant measures Summary measures not influenced by extreme values. response variable  A dependent variable. return-to-risk ratio (RTRR)  The expected monetary value of an action divided by its standard deviation. risk-averter’s curve A utility curve that increases rapidly then levels off as dollar amounts increase. risk of Type II error (b)  The chance that the null hypothesis will not be rejected when it is incorrect. risk-neutral curve  A utility curve where each additional dollar of profit has the same value. risk-seeker’s curve  A utility curve that increases more rapidly as dollar amounts increase. robust A test or procedure that is not seriously affected by the breakdown of assumptions.

sample  The portion of the population selected for analysis. sample coefficient of correlation A coefficient of correlation calculated from sample data. sample covariance  A covariance calculated from sample data. sample mean  A mean calculated from sample data. sample proportion  The number of items that have some characteristic of interest divided by the size of the sample. sample space A collection of all simple events of a random experiment. sample standard deviation A standard deviation calculated from sample data. sample variance  A variance calculated from sample data. sampling distribution The probability distribution of a given sample statistic with repeated sampling of the population. sampling distribution of the mean  The distribution of all possible sample means from a given population. sampling distribution of the proportion  The distribution of all possible sample proportions from samples of a certain size. sampling error  The difference in results for different samples of the same size. sampling with replacement  An item in the frame can be selected more than once. sampling without replacement  Each item in the frame can be selected only once. scatter diagram A graphical representation of the relationship between two numerical variables; plotted points represent the given values of the independent variable and corresponding dependent variable. seasonal component  A factor that measures the regular seasonal change in a time series. second-order autocorrelation Indicates there is a correlation between values two periods apart in a time series. second-order autoregressive model  A regression model to measure second-order autocorrelation in a time series. second quartile  Usually called the median; the middle value in an array that 50% of data values are smaller than, or equal to. secondary source  Provides data collected by another person or organisation. separate-variance t test A test for the difference between two population means, used when the unknown population variances cannot be assumed to be equal. shape  The pattern of the distribution of data values. Shewhart–Deming cycle An improvement process used by TQM: ‘plan, do, study, act’. side-by-side bar chart A graphical representation of a crossclassification table. simple event  A single outcome of a random experiment. simple linear regression A regression method using a single independent variable to predict values of the numerical dependent variable. simple price index A percentage measure of the change in the price of a single item between two time periods. simple random sample  A sample where each item in the frame has an equal chance of being selected. single linkage A measure of distance that bases the distance between clusters on the minimum distance between objects in one cluster and another cluster. Six Sigma management An approach to process improvement with an emphasis on accountability and bottom-line results.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

G-8 GLOSSARY

skewed  Non-symmetrical distribution; where the distribution of data values above and below the mean differ. sparklines  A descriptive analytics method that summarises timeseries data as small, compact graphs designed to appear as part of a table. special (or assignable) causes of variation  Large fluctuations or patterns in data that are not inherent to a process; these fluctuations reflect changes in the process. specification limits  Technical requirements based on customers’ needs and expectations. spread (dispersion)  The amount of scattering of data values. square-root transformation  Uses the square-root of the sample data to overcome breaches of the homoscedasticity or linearity assumptions in regression. standard deviation A measure of variation based on squared deviations from the mean; closely related to the variance. standard deviation of a discrete random variable  A measure of variation, based on squared deviations from the mean; closely related to the variance. standard deviation of the sum of two random variables A measure of variation; closely related to the variance. standard error  The square root of the expected squared difference between the random variable and its expected value. standard error of the estimate The standard deviation of the Y predicted values in a regression around the line of best fit. standard error of the mean Reflects how much the sample mean varies from its average value in repeated experiments. standard error of the proportion  The standard deviation of the sample proportion for repeated samples. standardised normal random variable  A normal random variable with a mean of 0 and a standard deviation of 1. state of statistical control  A process that is in control. statistic A numerical measure that describes a characteristic of a sample. statistical independence The occurrence of an event does not affect the occurrence of a second event. statistical packages Computer programs designed to perform statistical analysis. statistics  A branch of mathematics concerned with the collection and analysis of data. stem-and-leaf display A graphical representation of numerical data that partitions each data value into a stem portion and a leaf portion. stepwise regression A model-building regression technique to find subsets of independent variables that most adequately predict a dependent variable given the specified criteria for adequacy of model fit. strata  Subpopulations composed of items with similar characteristics in a stratified sampling design. stratified sample  Items randomly selected from each of several populations or strata. stress statistic A goodness-of-fit statistic used in multidimensional scaling. structured data  Data that follow an organised pattern. Studentised deleted residual A statistical method of residual analysis using the t probability distribution that identifies individual cases in the sample data of a multiple regression that have high individual influence on the regression equation.

Studentised range distribution  A probability distribution used for testing all differences between pairs of means. Student’s t distribution A continuous probability distribution whose shape depends on the number of degrees of freedom. subgroup  A sample used in a control chart. subjective probability  The probability that reflects an individual’s belief that an event occurs. sum of squares (SS)  The sum of the squared deviations. sum of squares between blocks (SSBL)  That part of the withingroup variation that is due to differences between the blocks. sum of squares between groups (SSB)  That part of total variation that is due to differences between groups. sum of squares due to factor A (SSA)  Variation due to factor A in two-way ANOVA. sum of squares due to factor B (SSBB)  Variation due to factor B in two-way ANOVA. sum of squares due to interaction (SSAB) The interacting effect of specific combinations of factor A and factor B. sum of squares error (SSE) (or error sum of squares)  The sum of squared differences between the values in each cell and the corresponding mean of that cell. sum of squares total (SST) (or total sum of squares)  The total variation; the sum of squared differences between each value and the grand mean. sum of squares within groups (SSW)  The sum of squared differences between each value and the mean of its own group. summary table  A table that summarises categorical or numerical data; it gives the frequency, proportion or percentage of data values in each category or class. symmetrical Where the distribution of data values above and below the mean are identical. systematic sample  A method that involves selecting the first element randomly then choosing every kth element thereafter. table of random numbers A table showing a list of numbers generated in a random sequence. tampering  Over-adjustment that increases variation in a process. test of independence  Tests for independence between the rows and columns of a contingency table. test statistic A value derived from sample data that is used to determine whether the null hypothesis should be rejected or not. third (upper) quartile The value that 75% of data values are smaller than or equal to. time series A sequence of measurements taken at successive points in time. time-series forecasting methods Statistical methods for forecasting future values of a variable based entirely on the past values of that variable. time-series plot A graphical representation of the value of a numerical variable over time. total amount  The sum of values. total quality management (TQM) An approach to quality improvement that emphasises continuous improvement and the total system. total sum of squares (SST) (or sum of squares total)  The total variation. total variation  The sum of the squared differences between each individual value and the grand mean.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



GLOSSARY G-9

training data  A set of data used by neural networks to uncover a model that by some criterion best describes the patterns and relationships in the data. transformation formula A Z-score formula used to convert any normal random variable to the standardised normal random variable. treatment effect  A variation due to group membership. treemaps  A descriptive analytics method that helps visualise two variables, one of which must be categorical. trend component An overall long-term upward or downward movement in the values of a time series. t test for the correlation coefficient  A hypothesis test for the statistical significance of the correlation coefficient in regression using the t probability distribution. t test for the slope A hypothesis test for the statistical significance of the regression slope b using the t probability distribution. t test of hypothesis for the mean A test about the population mean that uses a t distribution. Tukey–Kramer multiple comparison procedure A method of determining which of the group means are significantly ­different. Tukey procedure A method of making pairwise comparisons between means. two-factor factorial design  Analysis of variance where two factors are simultaneously evaluated. two-tail test A hypothesis test where the rejection region is ­divided into the two tails of the probability distribution. two-way ANOVA  An analysis of variance where two factors are simultaneously evaluated. Type I error  The rejection of a null hypothesis that is true and should not be rejected. Type II error  The non-rejection of a null hypothesis that is false and should be rejected. unbiased  If the average of all possible sample means equals the population mean then the sample mean is unbiased. unexplained variation  The error sum of squares. uniform distribution A continuous probability distribution in which the values of the random variable have the same probability; also called the ‘rectangular distribution’. unstructured data  Data that have no repeated pattern. unweighted aggregate price index  A price index for a group of items where each item has an equal weight. upper control limit (UCL)  The upper limit for a control chart, typically three standard deviations above the process mean. upper specification limit (USL)  The largest value a CTQ can have to meet customer expectations. utility  A measure of the desirability of different outcomes for an individual decision maker.

variables  Characteristics or attributes that can be expected to differ from one individual to another. variables control charts  Control charts for numerical ­variables. variance A measure of variation based on squared deviations from the mean; closely related to the standard deviation. variance inflationary factor (VIF) A factor that measures the impact of collinearity among the Xs in a regression model by stating the degree to which collinearity among the predictors reduces the precision of an estimate. variance of a discrete random variable  A measure of variation, based on squared deviations from the mean; closely related to the standard deviation. variance of the sum of two random variables A measure of variation; closely related to the standard deviation. variation  The spread, scattering or dispersion of data values. Venn diagram  The graphical representation of a sample space; joint events shown as ‘unions’ and ‘intersections’ of circles representing simple events. Ward’s minimum variance method  A measure of distance that bases the distance between clusters on the sum of squares over all variables between objects in one cluster and another cluster. weighted aggregate price index A price index for a group of items where each item has a different weight based on volume of consumption. Wilcoxon rank sum test A non-parametric test for testing the difference between two medians from independent ­samples. Wilcoxon signed ranks test  A non-parametric test for testing the mean difference for paired samples. within-group variation  That part of total variation due to differences within individual groups. – X chart  A control chart for the process mean. Y intercept  Represents the mean value of Y when X equals zero in regression. Z scores  Measures of relative standing; number of standard deviations given data values are from the mean. Z test for the difference between two means  A test statistic used in hypothesis tests about the difference between means of two populations. Z test for the difference between two proportions  A test statistic used in hypothesis tests about the difference between the proportions of two populations. Z test for the proportion A test statistic used for a test of the population proportion. Z test of hypothesis for the mean A test about the population mean which uses the standard normal distribution. Z test statistic  A test statistic calculated by converting a sample statistic to a standard normal score.

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

This page is intentionally left blank

index Page numbers in bold indicate definitions of key terms. Page numbers in italics indicate figures. 2 3 2 tables  608–9, 611 2 3 3 tables  616 4 3 3 tables  625 a priori classical probability  148, 148–9 ABS (Australian Bureau of Statistics)  6, 8, 14, 24 accounting 8 adjusted R2  511, 537, 542–3 aggregate price indices  591, 593–4 All Australia Index  596 All Industries Index  597 alternative hypotheses (H1) 316–17, 317, 323–4, 332 analysis of variance (ANOVA), defined  402 see also one-way analysis of variance; randomised block design; two-way analysis of variance annual time-series data  545 ANOVA summary tables  405, 406–8, 408, 419, 430 Anscombe’s quartet  494 arithmetic mean  92, 92–3 artificial data  494 assumptions of regression (LINE)  473 ASX 200 Index  596–7 auditing  300, 300–7 Australian Bureau of Statistics (ABS)  6, 8, 14, 24 Australian Securities Exchange (ASX)  597 autocorrelation defined  477 first-order  570 measuring with Durbin–Watson statistic  477–80, 480, 503 pth-order  570 regression analysis  475, 477–80, 478–80 second-order  570 autoregressive modelling  570, 570–8, 573, 575–8, 600–1, 605–6 averages  see moving averages bar charts  39, 39–40, 39–40, 79 base period  591 Bayes’ theorem  163, 163–6, 168–9, 173, 178 bell-shaped distribution  115, 122, 122, 213, 214 between-block variation  416, 416, 417, 417–18 between-group variation  402, 403, 416, 416, 417–18, 439 bias 25 big data  14, 14–15 bin ranges  83 binomial distribution binomial probabilities  191–5, 193–4 defined  189 example 190 formula  191, 204 mean  194, 205 normal approximation to  238–41, 240, 242

properties 189 standard deviation  194, 205 using statistical software  193, 193–4 binomial probabilities  191–5, 193–4 block effects  416, 416–22, 421 blocks  416 see also randomised block design box-and-whisker plots confidence interval estimation  289 defined  121 examples 121–2, 121–2 one-sample tests  338, 350 two-sample tests  364, 375 using Excel  137–8, 138 boxplots 121 brainstorming 14 bullet graphs  64, 64–5, 65, 66, 89–90 business analytics  62–7, 63 business forecasting  545 CAI (Computer Assisted Interview)   24 call monitoring  368 car exports  505 categorical data, tables and charts for  38–42, 73, 77–9 categorical variables  9–11, 10, 10, 55–7, 86–7, 527–8 causal forecasting methods  545 cell means plots  432–4, 432–4, 447 censuses  8, 26 Central Limit Theorem  214, 256, 256–8, 257, 266, 280, 322, 334, 337 central tendency  92, 92–9, 135–6 certain events  148 chartjunk  65 charts bar charts  39, 39–40, 39–40, 79 bullet graphs  64, 64–5, 65 for categorical data  38–42, 39–41 choosing an appropriate chart  73 misusing graphs  69–71, 69–71 for numerical data  46–53 pie charts  40, 40–2, 41, 79 side-by-side bar charts  56, 56–7, 56–7 using statistical software  79 see also scatter diagrams; time-series plots Chebyshev rule  116, 116–17, 137 chi-square analysis chi-square (x2) distribution  610 x2 test statistic  609–13, 612, 615–18, 622–33, 641 goodness-of-fit tests  627, 627–31, 635 key formulas  635 Marascuilo procedure  619, 619–20, 620, 635 test for differences between more than two proportions 615–20, 616, 618, 620, 635, 641 test for differences between two proportions 608–13, 610, 612–13, 635, 641 test for standard deviation  632–3, 632–3, 635 test for the variance  632–3, 632–3, 635

test of independence  622, 622, 622–6, 625, 641 using statistical software  612–13, 618, 620, 625, 633, 641 chi-square (x2) distribution  610 chunk samples  17 class boundaries  47 class mid-point  47, 83 class width  46, 46–7 classical multiplicative time-series model  546, 546–7, 585, 600 classical parametric procedures  337 classical statistical inference  214 cluster samples  17, 21 clusters  21 coefficient of correlation  125, 125–8, 138, 486, 497, 503 coefficient of determination  469, 469–71, 470, 497 coefficient of multiple determination  511, 511–12, 537, 542–3 coefficient of partial determination  523, 523–4, 537 coefficient of variation  105, 105–6, 131, 136–7 collectively exhaustive events  16, 47, 153, 215 collinearity  535, 535–6 combinations  170, 170–1, 173, 179, 190–1, 204 complement  150 completely randomised designs  402, 403, 416, 444–5 Computer Assisted Interview (CAI) techniques 24 computer technology  26 conditional probability  156, 156–61, 164, 173 confidence coefficient (1 2 a)  319 confidence interval estimate, defined  280, 485 confidence interval estimation applications in auditing  300–7 compared with prediction interval  492, 492 difference between the means of two independent populations  364–5, 391 difference between the means of two related populations  377, 391, 397 difference between two proportions  388, 391, 400 ethical issues  307 and hypothesis testing  327–8 for the mean (s known)  280–4, 281, 283, 308, 313 for the mean (s unknown)  285–6, 285–90, 288–9, 308, 313 for the mean difference  377 for the mean of Y  489–91, 497 of the mean response  508, 508 one-sided  304, 304–5 for the population total  300–2, 314 for the proportion  291–3, 292, 305, 308, 314 of the slope  485, 497, 537 for total difference  314 using statistical software  288, 289 value of a population slope  518–19 see also difference estimation; sample size determination

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

I-2

INDEX

confidence interval statements  287–8, 288 confidence level  319 consent, in research  350 Consumer Price Index (CPI)  595, 596 contingency tables 2 3 2 tables  608–9, 611 2 3 3 tables  616 4 3 3 tables  625 chi-square analysis  608–9, 615 defined  55, 608 examples of  55–7 probability  150 using statistical software  86–7, 625 continuity corrections  238–9 continuous numerical variables  181 continuous probability density function  213, 213–14 continuous probability distributions  146, 213–14, 405 continuous random variables  213–14, 238–9 continuous variables  10 convenience sampling  17, 17 correlation coefficient  125, 125–8, 138, 486, 497, 503 counting rules  168–71, 173, 178–9 covariance  123, 123–5, 138, 185, 185–8, 204, 209 coverage errors  23, 25 CPI (Consumer Price Index)  595, 596 critical range factor A  436, 440 factor B  436, 440 for Marascuilo procedure  619 randomised block design  422–3, 439 for Tukey–Kramer procedure  409–10, 439 for Tukey procedure  438 critical region  318 critical value approach to hypothesis testing  322–5, 323 to one-tail tests  329–30, 330 t test for the mean (s unknown)  334–6, 335 Z test for the proportion  341–2, 341–2 critical values  283, 286, 318, 319, 335, 335, 380–2, 381–2, 391 CRM (customer-relationship management system) 26 cross-classification (contingency) tables  55, 150, 608, 608–9 cross-product term  528 cumulative percentage distributions  49, 49–50, 82, 85–6 cumulative percentage polygons (ogives)  52, 52–3, 53 cumulative standardised normal distribution  217, 217–18, 219, 223 curvilinear relationship  457, 457–8 customer-relationship management system (CRM) 26 cyclical component  546, 546–7 dashboards 63–6, 64, 64 data analysing  129, 350–1 big data  14, 14–15 categorical  9–11, 38–42 cleaning  15–16, 350–1 collecting  13–16, 350 defined  6

discarding 350–1 formatting 15 independence of  420 interpreting 129 measuring 10–11 numerical  9–11, 43–53 randomness 350 sources of  13–14 stacked and unstacked data  79 structured  15 unstructured  15 Data Analysis Toolpak  see Microsoft Excel data discovery  66 data mining  26 data point  6 data snooping  350 Dax 30 Index  597 decision making, in hypothesis testing  320 decision trees  158, 158–9, 158–9, 165–6 deductive reasoning  281 degrees of freedom  285, 285–7, 286, 366, 391, 404 Delphi technique  14 dependent variables  456, 507–8, 508 descriptive analytics  62–7, 63, 88–90 descriptive statistics  8, 55, 109 see also numerical descriptive measures difference estimation  302, 302–4 diffusion indices  545 directional tests  330 discrete random variables  181–4, 185–7, 208 discrete variables  10 dispersion (spread)  99 distribution bell-shaped  115, 122, 122, 214 exponential 213, 214, 235, 235–7, 236, 242, 247, 247, 257, 257–8 hypergeometric distribution  200, 200–2 probability distribution for a discrete random variable  181, 181–4, 182, 208 shape  107, 107–9 skewed distribution  107, 107, 107–8, 109, 110, 120, 137 symmetrical  107, 107, 107–8, 109, 120 uniform 213, 214, 233, 233–4, 234, 256, 257, 258 see also binomial distribution; chi-square distribution; normal distribution; Poisson distribution; sampling distributions double exponential smoothing  555 Dow Jones Industrial Average  596 drill-down  66 dummy variables  525–31, 526, 527, 529–30, 532, 587 Durbin–Watson statistic  475, 479, 479–80, 480, 497, 503 econometric modelling  545 electronic formats  15 empirical classical probability  149 empirical rule  115, 115–16, 137 encoding  15 equal variance  473, 476, 476 error sum of squares (SSE)  467, 468, 468–9, 497, 581 see also sum of squares error (SSE) errors independence of  473

random error  402 residual 579–80 Type I errors  319, 319–21, 344 Type II errors  319, 319–21, 344, 345–6, 345–6 estimated relative efficiency (RE)  422, 439 ethical issues calculating probabilities  172 confidence interval estimation  307 hypothesis testing  349–51 misuse of graphs  70–1, 70–1 selective use of statistics  129–30 survey errors  25 using numerical descriptive measures  129–30 using regression analysis  493–4, 495, 496 events certain events  148 collectively exhaustive events  16, 47, 153, 215 complement of  150 defined  149 impossible events  148 independent events  159–61 joint events  150 mutually exclusive events  16, 47, 153, 215 rare events  60–1 sample spaces and  150 simple events  149 Excel  see Microsoft Excel exchange rate  505 executive information systems  63 expected frequency  609, 611–13, 612–13, 615–19, 623–8, 630, 635 expected returns  187–8 expected value of a discrete random variable  182, 182–3, 204 of the sum of two random variables  186, 186–7, 204 explained variation  467 explanatory variables  457, 525 exponential distribution  213, 214, 235, 235–7, 236, 242, 247, 247, 257, 257–8 exponential growth, forecasting equations  586, 601 exponential smoothing  551, 551–3, 552, 600, 604 exponential trend model  558, 558–60, 560–1, 581, 582–3, 585–8, 588, 600–1, 605 exponentially weighted moving averages  551 extrapolation (regression analysis)  463, 493 extreme values (outliers)  106 F distribution  378, 380–1, 381, 391, 405 F test for block effects  419–20, 421 for the difference between two variances 378–83, 379, 381–2, 391, 399–400, 399–400 for differences between more than two means 402–8, 403, 405–6, 408 for factor A effect  428, 440 for factor B effect  428, 428–9, 440 for interaction effect  429, 440 one-way ANOVA F test  402–8, 403, 404, 405–6, 408, 439 overall F test  511–12, 512, 513, 537, 542–3 partial F test  520, 520–3, 537 for the slope  484, 484–5, 484–5, 497

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



INDEX

F test statistic  378, 378–80, 391, 419, 439, 512, 522–3 factor A  426–8, 436, 439, 440 factor B  426–8, 436, 439, 440 factorials  169, 173, 179 factors (variables)  402 finance, use of statistics in  8, 187–8, 209 financial indices  596–7 finite population correction factor  201 first differences  561–3, 605 first (lower) quartile  97, 100, 130 first-order autocorrelation  570 first-order autoregressive model  570–1, 571, 575, 577–8, 577–8, 581, 582–3, 600 fitted pth-order autoregressive equation  573, 601 five-number summary  120, 120–1, 137 focus groups  14 forecasting  see time-series forecasting frame  17 frequencies  see expected frequency; observed frequency frequency distributions approximating standard deviation from  131 approximating the mean from  131 constructing 46–8 defined  46 finding numerical descriptive measures from 118–19 relative frequency distributions  48–9 using statistical software  80–4, 81 frequency polygons  83 Friedman rank test  420 gauges  64, 64–6, 65, 88–9 Gaussian distribution  214 general addition rule  154, 154–5, 173 general multiplication rule  160, 160–1, 173 geometric mean  98, 98–9, 130 geometric mean rate of return  98–9, 131 goodness-of-fit tests  627, 627–31, 635 Gosset, William S.  285 grand mean  403 graphs bullet graphs  64, 64–5, 65 misuse of  69–71, 69–71 groups  402 halo effect  25 highest-order autoregressive model  571–3, 575, 601 histograms  50, 50–1, 82–5 Holt–Winters method  567, 567–8, 569, 581, 582–3, 600 homogeneity of variance  411, 412, 445 homoscedasticity  473 households  108, 108–9 Human Development Index  456–64, 460–1, 468, 469, 471, 475–6, 483–6, 484, 489–92 hypergeometric distribution  200, 200–2, 205, 211 hypothesis testing comparing the means of two independent populations 359–68, 361–4, 367, 391, 396–8, 396–8 comparing the means of two related populations 371–7, 374–6, 391, 398–9, 398–9 comparing two population proportions  384–8, 385–6, 391, 400, 400

and confidence interval estimation  327–8 confidence level  319 confirmatory approach  359 critical value approach  322–5, 323, 329–30, 330, 334–6, 335, 341–2, 341–2 decision-making risks  319–21 defined  316 ethical issues  349–51 F test for difference between two variances 378–83, 379, 381–2, 399–400, 399–400 hypothesis-testing methodology  316–21 in multiple regression models  516–19, 517 null and alternative hypotheses  316–17, 323–5, 327, 332, 350 one-sample tests  316–57 p-value approach  325–7, 326, 331–2, 342–3 potential pitfalls  349–51 power of a statistical test  320, 320–1, 344–7, 344–8 regions of rejection and non-rejection  318, 318–19, 323, 324 selecting a test  352, 353 six-step method  323–4 t test of hypothesis for the mean  334, 334–7, 335–8 test statistic  318, 323–4 two-sample tests  359–400 using statistical software  326, 326, 331, 336–7, 336–8, 342, 356–7, 362–3, 367, 374, 376, 381, 386, 396–400, 397–400 Z test of hypothesis for the mean  322, 322–8, 323, 326 Z test of hypothesis for the proportion  340, 340–3, 341 see also chi-square analysis; F test; one-tail tests; t test; two-tail tests; Z test impossible events  148 independence in ANOVA  410, 420 chi-square test of  622, 622, 622–6, 625, 641 of data  420 of errors  473 in regression analysis  475 statistical  159, 159–60 independent events  159–61 independent populations  359–68, 361–4, 367, 378, 391, 396–8, 396–8 independent variables  456, 506–7, 520, 523–4, 528–9, 535–6, 537 index numbers  591, 591–7 inductive reasoning  281 inferential statistics  8, 26, 280–1 information technology  26 informed consent  350 insurance premiums  613 interaction effects  420, 422, 426, 426, 426–35, 430–5, 447 interaction terms  528 interactions (variables)  528, 528–31, 529–30, 532 interpolation (regression analysis)  463 interquartile range  100, 100–1, 105, 131, 136 interval estimates  see confidence interval estimation interval scales  11, 11

I-3

intervals 280 investment services  14 irregular (random) component  546, 546–7 joint events  150 joint probability  152, 152–3 judgment samples  17, 17 kurtosis  109, 137 labour force participation rates, for females  546, 546 lagged predictor variables  545, 605 Laspeyres price index  594, 594–5, 596, 601 leading indicator analysis  545 leading questions  24 least-squares method  459–61, 460, 460–1, 555–63, 557, 581, 606 least-squares trend-fitting and forecasting  555–63, 557, 585–9, 588, 605 level of confidence  282, 282–4 level of significance (a)  319, 323–4, 350, 431 levels (factors)  402 Levene test  411, 445 LINE  473, 474 linear relationship  457, 457–8 linear trend models  555, 555–6, 557, 561–2, 581, 582–3, 600, 605 linearity  473, 474, 475 lotteries 172 lower quartile  97, 100, 130 lower-tail critical values  380–2, 381–2, 391 MAD (mean absolute deviation)  580, 601, 606 main effects  431, 432 management 8 Marascuilo procedure  619, 619–20, 620, 635 margin of error  295 marginal probability  151, 151–2, 173 market research  14 marketing 8 matched observations  371, 373 mathematical models  189 mean arithmetic mean  92, 92–3 of binomial distribution  194, 205 calculation of  92–3, 135 comparing the means of two independent populations 359–68, 361–4, 367, 391, 396–8, 396–8 comparing the means of two related populations 371–7, 374–6, 391, 398–9, 398–9 confidence interval estimation  280–90, 281, 283, 285–6, 288–9, 308, 313, 364–5, 377, 391 defined  92 exponential distribution  236, 242 F test for differences between more than two means 402–8, 403, 405–6, 408 from a frequency distribution  118–19, 131 geometric mean  98, 98–9, 130 grand mean  403 of hypergeometric distribution  201, 205 versus median  110 normal distribution  216, 216, 219 one-tail test of hypothesis for  330, 330, 331–2

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

I-4

INDEX

mean (continued) related populations  371–7, 374–6 sample mean  93, 93–4 sample size determination  294–6, 296, 308, 314 sampling distribution of  249–58, 253, 256, 263 in shape of distribution  107–8 standard error of  251, 251–2, 252, 307 t test of hypothesis for  334, 334–7, 335–8, 353, 356 unbiased property of  249–51 uniform distribution  233, 234, 242 Z test of hypothesis for  322, 322–8, 323, 326, 353, 356 see also population mean mean absolute deviation (MAD)  580, 601, 606 mean difference  302 mean square between (MSB)  404 mean square between A (MSBA)  428 mean square between AB (MSBAB)  428 mean square between B (MSBB)  428 mean square between blocks (MSBL)  418, 418–19 mean square error (MSE)  418, 418–19 mean square total (MST)  404 mean square within (MSW)  404 mean squares  418–19, 439 mean values, estimating  489–92, 503 measurement error  24, 24–5 measurement, scales of  11–12, 11–12 see also numerical descriptive measures median  94, 94–6, 110, 130, 135 Microsoft Excel analysis of variance  408, 412, 430, 433–5, 444–7, 444–7 autoregressive modelling  575–7 bar charts  39–40, 79 basic probabilities  178 basics 30–1 Bayes’ theorem  178 binomial probabilities  193, 193–4, 209–10 box-and-whisker plots  137–8, 138, 289, 338, 375 bullet graphs  89–90 calculating coefficient of correlation  138 calculating covariance  138 calculating mean, median and mode  135, 135 calculating quartiles  136 calculating variation  136–7 Central Limit Theorem  266 chi-square analysis  612–13, 618, 620, 625, 633, 641 coefficient of variation  136–7 collecting data  35 conditional probability  178 confidence interval estimates  288, 289, 292, 313–14, 492 contingency tables  86–7 counting rules  178–9 covariance of a probability distribution  209 creating charts  33 Data Analysis Toolpak  109, 109 defining classes, bins and mid-points  83–4 defining data  35 descriptive measures for a population  137 descriptive statistics  230

determining sample size  296 entering data  31–3, 32 evaluating normality  246, 246–7 exponential distribution  236, 236, 247, 247 F test for difference between two variances  381 frequency distributions  80–2, 81 gauges 88–9 histograms  50, 83–5 hypergeometric distribution  202, 202, 210, 211 hypothesis testing  326, 326, 331, 336–7, 336–8, 342, 356–7, 362–3, 367, 374, 376, 381, 386, 396–400, 396–400 Marascuilo procedure  620 multiple regression  507–8, 514–15, 515–16, 527, 527, 529–30, 532, 541–3, 542 normal distribution  246 normal probabilities  226, 227, 230, 246, 246–7 normal probability plots  232, 338 numerical descriptive measures for a population 137 ogives 85 one-sample t test  336, 336 opening and saving workbooks  31 ordered arrays  80 paired t tests  376 percentage and cumulative percentage polygons 85–6 pie charts  41, 79 Poisson distribution  198, 210 polygons 85 pooled t tests  362–3 printing workbooks  33–5, 34 probability distribution for a discrete variable 208 probability plots  289, 338 problems with early versions of  26 randomised block design  421, 446, 446 relative frequency  82 residual analysis  478–9, 514–15, 515–16 sample size determination  314 sampling distributions  82, 265, 265 scatter diagrams  59, 87, 87–8, 408, 459, 461 separate-variance t test  367 side-by-side bar charts  56, 87 side-by-side charts  87 simple linear regression  460–1, 461–2, 469, 472, 475, 483, 492, 492, 502, 502–3, 521 sparklines 88 stacked and unstacked data  79 standard deviation  136–7 stem-and-leaf displays  80 summary tables  77–8, 77–8 tables and charts  45 time-series forecasting  550, 552, 557–8, 569, 575–7, 582–3, 585, 588, 604, 604–6, 606 time-series plots  60, 88 Tukey–Kramer procedure  410 two-sample tests  338, 396–400, 396–400 types of sampling methods  36 using Excel on a Mac  35 using formulas in worksheets  33 variance 136–7 Z scores  136–7 Z test for difference between two proportions  386

Minitab (software)  26 missing values  16 mode  98, 135 Morningstar 14 moving averages  548, 548–51, 550, 604 MSB (mean square between)  404 MSBA (mean square between A)  428 MSBAB (mean square between AB)  428 MSBB (mean square between B)  428 MSBL (mean square between blocks)  418, 418–19 MSE (mean square error)  418, 418–19 MST (mean square total)  404 MSW (mean square within)  404 multiple comparisons  408, 408–10, 410, 422–3, 435–6, 445 multiple determination, coefficients of  511, 511–12, 537, 542–3 multiple regression models  505–36 coefficients of multiple determination  511, 511–12, 537, 542–3 coefficients of partial determination  523, 523–4, 537 collinearity  535, 535–6 confidence interval estimate for the slope  518–19, 537 defined  505 dummy variables  525–31, 526, 527, 529–30, 532 interactions  528, 528–31, 529–30, 532 interpreting the regression coefficient  505–7 with k independent variables  506, 524, 537 key formulas  537 overall F test  511–12, 512, 513, 537, 542–3 partial F test  520, 520–3 population regression coefficients  516–19, 517, 543 predicting the dependent variable Y 507–8, 508, 542 residual analysis for  514–15, 515–16, 543 testing for significance of overall model  512, 513 testing for the slope  516–18, 517, 537 testing portions of  520–4, 521–2, 543 with two independent variables  506–7, 507, 537 using statistical software  507–8, 527, 527, 529–30, 532, 541–3, 542 multiplication rule for independent events  161, 173 multiplication rules  160–1 multiplicative model  546, 546–7, 585, 600 mutually exclusive events  16, 47, 153, 215 NASDAQ Index  596 National Health Survey (2010–11)  24 negatively skewed distribution  107, 107–8, 120 net regression coefficients  506 Newspoll 12 Nielsen 14 Nikkei Index  597 nominal scales  10, 10, 10–11 non-compliance, rate of  304–5 non-parametric procedures  337, 364, 375 non-probability sampling  17, 17, 17–18, 25 non-response bias  23 non-response errors  23, 25

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



INDEX

normal distribution approximating binomial distribution  238–41, 240, 242 bell-shaped curve of  213, 214 chi-square goodness-of-fit tests for  629–31 constructing normal probability plots  231, 232 cumulative standardised  217, 217–18, 219 defined  214 different normal distributions  216, 216, 219 evaluating normality  227, 229–31, 230, 232, 246, 246–7 example calculations  219–26, 220–2, 224–5 misapplication of in business  227 sampling distribution  256–7, 257 standardised normal distribution  285, 285 theoretical properties  214, 229–31 transformation formula  216, 216–19, 217–19, 224 using statistical software  226, 227, 246–7, 246–7 normal probability density function  215, 215–18, 242 normal probability plots  231, 232, 246, 289, 338 normality  227, 229–31, 230, 232, 246, 246–7, 410, 420, 473, 476 null hypothesis (H0)  316, 316–17, 323–5, 327, 332, 350 numerical data, tables and charts for  9–11, 43–53, 73, 79–86 numerical descriptive measures box-and-whisker plots  121, 121–2, 121–2 coefficient of variation  105, 105–6 ethical issues  129–30 five-number summary  120, 120–1 from a frequency distribution  118–19 measures of central tendency  92–9 objectivity in data analysis  129 for a population  113–17, 137 shape  92, 107, 107–9 variance and standard deviation  101–5 Z scores  106, 106–7 numerical variables  9–11, 10, 10, 59–61, 181 objectivity, in data analysis  129 observation 6 observational studies  14 observed frequency  609, 611, 623, 627–8 observed level of significance (p-value) 325 OECD (Organisation for Economic Co-operation and Development)  458 ogives  52, 52–3, 53, 85 one-factor experiments  402 one-sample tests  316–57 flow chart for selecting  352 one-sample t test  336 potential pitfalls  349–51 power of a statistical test  320, 320–1, 344–7, 344–8 t test of hypothesis for the mean  337 using statistical software  356–7 Z test of hypothesis for the mean  337, 353 Z test of hypothesis for the proportion  340, 340–3, 341, 353 see also hypothesis testing; one-tail tests one-sided confidence interval  304, 304–5

one-tail tests choice of  350 comparing two population proportions  364–5 critical value approach  329–30, 330 defined  330 ethical considerations  350 F test for difference between two variances  382, 382–3 for the mean  330, 330, 331–2 p-value approach  331–2 power of a test  344, 344–5, 347, 347 t test  374 Z test for population mean  344, 344 one-way analysis of variance  402–13 assumptions  402, 410–11 between-group variation  402, 403, 439 calculating mean squares  404, 439 completely randomised design  402–13, 403, 405–6, 408, 410, 412 defined  402 example calculation  412–13 F test for differences between more than two means 402–8, 403, 405–6, 408 F test statistic  404, 404–5, 405, 439 key formulas  439 Levene test  411 summary tables  405, 406–8, 408, 419 total variation  403, 403, 403–4, 439 Tukey–Kramer procedure  408, 408–10, 410, 413, 435, 439, 445 using statistical software  406, 408, 412, 444–5, 445 within-group variation  402, 404, 439 see also randomised block design online surveys, rigging of  24 operational definition  6, 16 opinion polls  307 ordered arrays  43, 43–4, 80 ordinal scales  10–11, 11, 11 Organisation for Economic Co-operation and Development (OECD)  458 outliers  16, 106 overall F test  511–12, 512, 513, 542–3 p-value  325 p-value approach to hypothesis testing  325–7, 326 to one-tail tests  331–2 t test for the mean (s unknown)  336–7 to two-tail tests  325–6, 326 Z test for the proportion  342, 342–3 Paasche price index  595, 595–6, 601 paired observations  371 paired t test  372, 372–6, 374–6, 391, 398–9, 398–9 parameters  8, 9 parametric procedures  337 parsimony  581 partial determination, coefficient of  523, 523–4, 537 partial F test  520, 520–3, 537 partial regression coefficients  506 Pearson, Karl  227 percentage differences  561–3, 605 percentage distributions  48, 48–9, 82 percentage polygons  51, 52, 85–6 perfect negative correlation  125, 125

I-5

perfect positive correlation  125, 125 perishable inventory  63 permutations  170, 173, 179 PHStat  see Microsoft Excel pie charts  40, 40–2, 41, 79 point estimate  280 Poisson distribution calculating probabilities  197–9 chi-square goodness-of-fit tests for  627–9 defined  196 formula 205 properties 196 relation to exponential distribution  235 using statistical software  198, 210 political polls  307 polls  12, 172, 307 polygons 51–3, 52–3, 83, 85–6 pooled-variance t test  360, 360–4, 361–4, 391, 396–7, 396–7 population comparing the means of two independent populations 359–68, 361–4, 367, 391, 396–8, 396–8 comparing the means of two related populations 371–7, 374–6, 391, 397, 398–9, 398–9 comparing two population proportions  384–8, 385–6, 391, 400, 400 defined  8 estimating total amount  300, 300–2, 314 estimating unknown characteristics  280–2, 281 examples of  8–9 independent populations  359–68, 361–4, 367 numerical descriptive measures for  113–17, 137 related populations  371–7, 374–6 sampling from non-normally distributed populations 256–8, 257 sampling from normally distributed populations 252–6, 253 see also confidence interval estimation; proportions population mean calculation of  114, 137 defined  113 formula  114, 131, 263 power of a statistical test  344–7, 344–8 sampling distributions  250 Z test for  344, 344 population parameters  113, 332 population proportions  259–60, 297–9, 298 population regression coefficients  516–19, 517, 543 population standard deviation  114, 114–15, 131, 137, 250–1, 263, 333, 334 population variance  114, 114–15, 131, 137 portfolio expected return  187, 187–8, 204 portfolio risk  187, 187–8, 204 portfolios  187, 209 positively skewed distribution  107, 107–8, 120 post-hoc comparison  408 poverty, measures of  110 power curve  347, 347 power of a statistical test  320, 320–1, 344–7, 344–8 practical significance  351

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

I-6

INDEX

prediction interval for an individual response Y  491, 491–2, 493, 497, 503, 508, 508 prediction line  459, 461–3, 497 predictions, in regression analysis  462–3, 503 predictive analytics  63 prescriptive analytics  63 price indices  591–7, 601 primary sources, of data  13, 13–14 prior probabilities  163 probability a priori classical probability  148, 148–9 basic concepts  148–55 Bayes’ theorem  163, 163–6, 168–9 conditional probability  156, 156–61, 164, 173 contingency tables  150 continuous probability distributions  146, 213–14 counting rules  168–71, 173, 178–9 defined  148 empirical classical approach  149 ethical issues  172 events 149–50 general addition rule  154, 154–5, 173 impossible events  148 joint probability  152, 152–3 marginal probability  151, 151–2, 173 of occurrence  148, 173 sample space  149 subjective  149 using statistical software  178–9 Venn diagrams  150, 150–1, 151 see also binomial distribution; normal distribution; Poisson distribution probability distribution, for a discrete random variable  181, 181–4, 182, 208 probability samples  17, 18 proportions calculating overall proportions  610, 610–11, 615–16, 616, 635 chi-square test for differences between more than two  615–20, 616, 618, 620, 635, 641 chi-square test for differences between two 608–13, 610, 612–13, 635, 641 comparing two population proportions  384– 7, 385–6, 391, 400, 400 confidence interval estimates for  291–3, 292, 305, 314, 388, 400 population proportions  259–60, 263 sample size determination  297–9, 298, 308, 314 standard error of  307 Z test for the difference between two  384–7, 385–6, 391, 400, 400 Z test of hypothesis for  340, 340–3, 341, 352, 353, 357 pth-order autocorrelation  570 pth-order autoregressive forecasting equation  573–4, 601 pth-order autoregressive model  570–1, 571, 575, 581, 601 quadratic trend model  557, 557–8, 558–9, 562, 581, 582–3, 600, 605 qualitative forecasting methods  545 quantile-quantile plots  231 quantitative forecasting methods  545

quartiles  96, 96–8, 100, 100–1 questionnaires 24–5 quota samples  17 R2 (coefficient of multiple determination)  511, 511–12, 537 random component  546, 546–7 random error  402, 418, 427, 439 random experiments  149 random numbers tables  18, 18–20 randomisation  350 randomised block design  415–23 between-block variation  416, 416, 417, 417–18, 439 between-group variation  402, 403, 416, 416, 417–18, 439 block effects  419–20, 421, 439 compared with completely randomised design 416 critical range  422–3, 439 defined  416 estimated relative efficiency  422, 439 F test statistic  419–20, 421, 439 focus of analysis  416 mean squares  418–19, 439 partitioning the total variation  416 random error  416, 416, 418, 439 tests for the treatment and block effects  416, 416–22, 421 total variation  416, 416–17, 439 Tukey procedure  422, 422–3 using statistical software  421, 446, 446 within-group variation  402 randomness  410, 420 range calculation of  99–100, 136 characteristics of  105 defined  46, 99 formula 131 interquartile  100, 100–1, 105 relevant range  463 rare events  60–1 rate of non-compliance  304–5 ratio scales  11, 11 RE (estimated relative efficiency)  422, 439 real-time monitoring  63 recoded variables  16 rectangular distribution  213, 233, 256 region of non-rejection  318, 318, 323, 324 region of rejection  318, 318, 318–19, 323, 324 regression analysis defined  456 ethical issues  493–4, 495, 496 pitfalls 493–4, 495, 496 predictions 462–3 scatter diagrams for  456–8, 457, 459 types of regression models  456–8, 458–9 see also multiple regression models; simple linear regression regression coefficients defined  459 interpreting 505–7, 507, 541–2, 588 net regression coefficients  506 partial regression coefficients  506 population regression coefficients  516–19, 517 producing a prediction line  461–2

testing for significance  484, 512, 513, 517, 517 using dummy variables  525–31, 526, 527, 529–30, 532, 587 regression models  456–8, 458–9 regression sum of squares (SSR)  467, 468, 468, 497, 522, 523 related populations  371–7, 374–6 relative frequency distributions  48, 48–9, 82, 215, 215 relevant range  463 repeated measurements  371 replicates  426 research findings, reporting of  351 Reserve Bank of Australia  14 residual  474, 497 residual analysis defined  473 multiple regression  514–15, 515–16, 543 simple linear regression  473–6, 474–6, 493–4, 495, 496 time-series forecasting models  579, 580, 606 using statistical software  475, 478–9, 503 residual error, measuring magnitude of  579–80 residual plots to assess linearity  474, 475 to detect autocorrelation  477–8, 479 for five forecasting methods  581, 582 multiple regression  514–15, 515–16 simple linear regression  494, 495, 496 resistant measures  101 respondent error  25 response 6 response variables  457 risk of Type II error (b)  320 road fatalities, statistics on  70–1, 70–1, 113–15 robust tests  337, 364 rules, of counting  168–71 sample coefficient of correlation  126, 126–8, 127, 131 sample covariance  123, 123–5, 131 sample mean  93, 93–4, 118, 130 sample proportion  340 sample size  254–5, 256, 323–4, 351 sample size determination in business  295 for the mean  294–6, 296, 308, 314 for the proportion  297–9, 298, 308, 314 using statistical software  296, 298, 314 sample space  149, 150 sample standard deviation  102, 102–3, 131, 334, 337 sample statistic  316, 332 sample variance  102, 103–4, 131 samples cluster  17, 21 convenience  17, 17 defined  8, 17 examples of  9 judgment  17, 17 non-probability  17, 17, 17–18 probability  17, 18 reasons for drawing  17 simple random  17, 18, 18–20 stratified  17, 21 systematic  17, 20, 20–1

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



INDEX

sampling applications in auditing  300–7 Central Limit Theorem  214, 256, 256–8, 257 from finite populations  301, 307 from non-normally distributed populations 256–8, 257 from normally distributed populations  252–6, 253 with replacement  18 survey sampling methods  17–21 without replacement  18 sampling distributions defined  249 of the mean  249, 249–58, 253, 263 of the proportion  259–60, 260, 263 using statistical software  82, 265–6 sampling error  23, 25, 295 S&P 500 Index  596 S&P ASX 200 Index  596–7 SAS (software)  26 scales 11–12, 11–12 scatter diagrams defined  59, 456 example 59, 59 necessity of using  493–4 regression analysis  474–5, 493–4, 495, 496 regression models  456–8, 457, 459 sample coefficients of correlation  127 for two numerical variables  59, 59 using statistical software  87, 87–8 seasonal component  546, 546–7 seasonal data, time-series forecasting of  584–9, 585, 588, 600, 601, 606 second differences  561–3, 605 second-order autocorrelation  570 second-order autoregressive model  570–1, 571, 575, 575, 576, 577, 581, 600 second quartile  97 secondary sources, of data  13, 13–14 semi-structured data  15 separate-variance t test  365, 365–7, 367, 391, 397 shape  92, 107, 107–9, 136–7 shark attacks, statistics on  60–1 side-by-side bar charts  56, 56–7, 56–7, 87 significance level  319, 323–4, 350, 431 significance, statistical versus practical  351 simple events  149 simple linear regression  456–98 assumptions of  473, 473–6 calculating the slope  463–5, 497 calculating Y intercept  463–5, 497 coefficient of correlation  486, 497 coefficient of determination  469, 469–72, 470, 497 confidence interval estimate of the slope  485 confidence interval estimation for the mean of Y 489–91 defined  456 determining the equation  458–65, 459–62 Durbin–Watson statistic  475, 479, 479–80, 480, 497, 503 estimation of mean values  489–92 inferences about the slope  482–5, 483–5, 503 key formulas  497 least-squares method  459–61, 460, 460–1 measures of variation  467–72, 468–70, 497

measuring autocorrelation  477–8, 478–9, 503 pitfalls 493–4, 495, 496 prediction interval for an individual response Y  491, 491–2, 493 prediction line  459 regression coefficients  459, 461–2, 484 regression models  456–7, 456–8, 497 relevant range  463 residual analysis  473, 473–6, 474–6 standard error of the estimate  471, 471–2, 497 total sum of squares (SST)  467, 467–9, 468–9, 497 using statistical software  460–1, 461–2, 469, 472, 475, 483, 492, 492, 502, 502–3, 521 simple price index  591, 591–3, 601 simple random samples  17, 18, 18–20 single data value  6 skewed distribution  107, 107, 107–8, 109, 110, 137 slope calculating 497 confidence interval estimate of  485, 518–19, 537 F test for  484, 484–5, 484–5, 497 simple linear regression  463–5, 482–5, 483–5, 503 t test for  482, 482–3, 483, 497 testing for in multiple regression models 516–18, 517, 537 software, statistical  see statistical software spam filters  168–9 sparklines  64, 64, 88 spread (dispersion)  99 SPSS/PASW Statistics (software)  26 SS (sum of squares)  101 SSA (sum of squares due to factor A)  426, 426–7 SSAB (sum of squares due to interaction)  427 SSB (sum of squares between groups)  403, 407, 416, 416 SSB (sum of squares due to factor B)  427 SSBL (sum of squares between blocks)  416, 416, 417, 417–18 SSE (error sum of squares, or sum of squares error)  418, 467, 468, 468–9, 497, 581 SSE Composite Index  597 SSR (regression sum of squares)  467, 468, 468, 497, 522, 523 SST (sum of squares total, or total sum of squares)  403, 407, 416, 416, 467, 467–9, 468–9, 497 SSW (sum of squares within groups)  404, 407, 416, 416 stacked data  79 standard deviation of binomial distribution  194, 205 calculation of  101–2 characteristics of  105 chi-square test for  632–3, 632–3, 635 defined  101 in determining sample size  295 of the difference  302 of a discrete random variable  183, 183–4, 204 of exponential distribution  242 from a frequency distribution  118–19, 131

I-7

of hypergeometric distribution  201–2, 202, 205 of normal distribution  216, 216, 217, 222 sample standard deviation  102, 102–3 of the sum of two random variables  186, 186–7, 204 of uniform distribution  234 using statistical software  136–7 see also population standard deviation standard error defined 109 of the estimate  471, 471–2, 497, 580 of the mean  251, 251–2, 252, 263, 307 of the proportion  260, 307 standardised normal distribution  285, 285 standardised normal probability density function  217–18, 242 standardised normal random variables  216 Stata (software)  26 statistic, defined  8, 9 statistical independence  159, 159–60, 173 statistical inference  280–1 statistical packages  26 statistical significance  351 statistical software analysis of variance  406, 408, 412, 430, 433–5, 444–7, 445–7 autoregressive modelling  575–7 bar charts and pie charts  79 binomial distribution  193, 193–4 bullet graphs  89–90 chi-square analysis  612–13, 618, 620, 625, 633, 641 confidence interval estimates  288, 289, 292, 313–14 contingency tables  86–7 cumulative distributions  82 descriptive statistics  109, 109, 135–8 determining sample size  296, 298 frequency distributions  80–2, 81 gauges 88–9 histograms 82–4 hypergeometric distribution  202, 202 hypothesis testing  326, 326, 331, 336–7, 336–8, 342, 356–7, 362–3, 367, 374, 376, 381, 386, 396–400, 397–400 Marascuilo procedure  620 measures of central tendency  135–6 multiple regression  507–8, 527, 527, 529–30, 532, 541–3, 542 normal distribution  226, 227, 246–7, 246–7 one-sample tests of hypothesis  356–7 organising numerical data  79–86 percentage and cumulative percentage polygons 85–6 percentage distributions  82 Poisson distribution  198 probabilities 178–9 probability distribution for a discrete variable 208 randomised block design  421 relative frequency  82 residual analysis  478–9 sample size determination  314 sampling distributions  265–6 scatter diagrams  87, 87–8

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

I-8

INDEX

statistical software (continued) simple linear regression  460–1, 461–2, 469, 472, 475, 483, 492, 492, 502, 502–3, 521 sparklines 88 stacked and unstacked data  79 summary tables  77–8, 77–8 time-series forecasting  550, 552, 557–8, 569, 575–7, 582–3, 585, 588, 604, 604–6, 606 time-series plots  88 two-sample tests  396–400, 397–400 variation and shape  136–7 see also Microsoft Excel; Minitab; SAS; SPSS/ PASW; Stata statistics  6, 6–8, 26 Statistics New Zealand  8 stem-and-leaf displays  43, 43–5, 80, 350 straight-line relationship  456, 456 strata  21 stratified samples  17, 21 structured data  15 Studentised range distribution  409, 436 Student’s t distribution  285, 285, 285–6 subjective probability  149 subjectivity, in interpretation  129 sum of squares (SS)  101 sum of squares between blocks (SSBL) 416, 416, 417, 417–18 sum of squares between groups (SSB)  403, 407, 416, 416 sum of squares due to factor A (SSA)  426, 426–7 sum of squares due to factor B (SSB)  427 sum of squares due to interaction (SSAB)  427 sum of squares error (SSE)  418 see also error sum of squares sum of squares total (SST)  403, 407, 416, 416, 467, 467–9, 468–9 see also total sum of squares sum of squares within groups (SSW)  404, 407, 416, 416 summary tables ANOVA summary tables  405, 406–8, 408, 419, 430 defined  38 examples of  38–9 using statistical software  77–8, 77–8 see also frequency distributions survey errors  23–5 survey sampling methods  17–21 surveys  14, 22–5 symmetrical distribution  107, 107, 107–8, 109, 120 syndicated services  14 systematic samples  17, 20, 20–1 t distribution, properties of  285, 285–6 t test checking assumptions  337, 337–8 choice of  352 as a classical parametric procedure  337 for the correlation coefficient  486 critical value approach  334–6, 335 highest-order autoregressive model  571–2, 572, 601 of hypothesis for the mean (s unknown)  334, 334–7, 335–8, 353, 356 means of two independent populations (s unknown) 360–4, 361–4

p-value approach  336–7 paired t test  372, 372–6, 374–6, 398–9, 398–9 pooled-variance t test  360, 360–4, 361–4, 391, 396–7, 396–7 robustness of  337 separate-variance t test  365, 365–7, 367, 391, 397 for the slope  482, 482–3, 483, 497 t statistic and the F statistic  523, 537 test means of two related samples  372–6 tables for categorical data  38–42 choosing an appropriate chart  73 frequency distributions  46–8 for numerical data  43–50 of random numbers  18, 18–20 using statistical software for  77–8, 77–8 see also contingency tables; summary tables telephone polling  12 test statistic  318, 323–4, 334 tests goodness-of-fit tests  627, 627–31, 635 power of  344–7, 344–8 robust tests  337, 364 see also hypothesis testing; one-sample tests; one-tail tests; t test; two-sample tests; twotail tests; Z test third-order autoregressive model  576 third (upper) quartile  97, 100, 130 tied observations  10 time-period forecasting  553, 600 time series  545 time-series forecasting  545–601 assumptions of  546, 599 autoregressive modelling  570, 570–8, 573, 575–8, 581, 582–3, 600–1, 605–6 in business  545 choosing an appropriate model  579–81, 580, 582–3, 606 classical multiplicative model  546, 546–7, 585, 600 defined  545 exponential smoothing  551, 551–3, 552, 600, 604 exponential trend model  558, 558–60, 560–1, 581, 582–3, 585–8, 588, 600–1, 605 five methods compared  581, 582–3, 606 forecasting time period  600 Holt–Winters method  567, 567–8, 569, 581, 582–3, 600 index numbers  591, 591–7 key formulas  600–1 least-squares method  555–63, 581, 582–3 least-squares trend-fitting and forecasting 555–63, 557, 585–9, 588, 605 linear trend model  555, 555–6, 557, 561–2, 581, 582–3, 600, 605 mean absolute deviation (MAD)  580 model selection  561–3, 571–2, 579–81, 580, 582–3, 605 moving averages  548, 548–51, 550 performing a residual analysis  579, 580 pitfalls 599 principle of parsimony  581 quadratic trend model  557, 557–8, 558–9, 562, 581, 582–3, 600, 605

as a quantitative method  545 of seasonal data  584–9, 585, 588, 600, 606 smoothing the annual time series  547–53 using statistical software  550, 552, 557–8, 569, 575–7, 582–3, 585, 588, 604, 604–6, 606 time-series plots  59, 59–61, 60–1, 88, 548 total amount  300, 300–2, 314 total difference  303–4, 314 total sum of squares (SST)  403, 407, 416, 416, 467, 467–9, 468–9, 497 total variation  403, 403, 403–4, 416, 416–17, 426, 426, 439, 467 trade associations  14 transformation formula  216, 216–19, 217–19, 224 treatment effect  402, 404–6, 416, 416–22, 421 treemaps  65, 65, 65–6 trend 545–6, 546 trend component  546 trend-fitting  see least-squares trend-fitting and forecasting triple exponential smoothing  555 Tukey–Kramer multiple comparison procedure  408, 408–10, 410, 413, 435, 439, 445 Tukey procedure  422, 422–3, 435–6 two-factor factorial design  425 two-sample tests  359–400 comparing the means of two independent populations 359–68, 361–4, 367 comparing the means of two related populations 371–7, 374–6 comparing two population proportions  384–7, 385–6 F test for difference between two variances 378–83, 379, 381–2 flow chart for selecting  380 using statistical software  396–400, 397–400 two-tail tests autoregressive modelling  573, 577 choice of  350 defined  322 difference between two means  361–2, 362 difference between two proportions  364–5 ethical considerations  350 F test for difference between two variances  379, 382 hypothesis for the proportion  342 p-value approach  325–6, 326, 342, 342–3 paired t test  372–3 t test of hypothesis for the mean  334–6, 335 two-way analysis of variance  425–36 critical range  438, 440 defined  425 F test for factor A effect  428, 440 F test for factor B effect  428, 428–9, 440 F test for interaction effect  429, 429–30, 440 factor A variation  426–8, 436, 439 factor B variation  426–8, 436, 439 interaction variation  439 interpreting interaction effects  432–5, 432–5 key formulas  439–40 mean squares  428, 439 random error  427, 439 summary tables  430 testing for factor and interaction effects  426, 426–32, 430–1

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e



INDEX

total variation  426, 426, 439 Tukey procedure  435–6 using statistical software  430, 433–5, 446–7, 446–7 Type I errors  319, 319–21, 344 Type II errors  319, 319–21, 344, 345–6, 345–6 unbiased sample mean  249, 249–51, 250 UNDP (United Nations Development Programme) 458 unethical practices  see ethical issues unexplained variation  467 uniform distribution  213, 214, 233, 233–4, 234, 242, 256, 257, 258 uniform probability density function  233–4, 234, 242 United Nations Development Programme (UNDP) 458 United States Federal Census  26 unstacked data  79 unstructured data  15 unweighted aggregate price index  593, 593–4, 601 upper quartile  97, 100 values 6, 16 variables categorical 9–10, 10, 10, 55–7, 86–7, 527–8 continuous  10 defined  6 dependent  456, 507–8, 508 discrete  10, 181–4 dummy variables  525–31, 526, 527, 529–30, 532, 587 explanatory  457 independent  456, 506–7, 520, 523–4, 528–9, 535–6, 537 lagged predictor variables  605 numerical 9–10, 10, 10

operational definition  6 recoding 16 response variable  457 types of association  125, 125 variance analysis of variance (ANOVA)  402 calculation of  103–5 characteristics of  105 chi-square test for  632–3, 632–3, 635 defined  101 difference between two means, unequal variances 397–8 of discrete random variables  183, 183–4, 204 equal variance  476, 476 exponential distribution  236, 242 F test for difference between two variances 378–83, 379, 381–2, 399–400, 399–400 from a frequency distribution  118 homogeneity of  411, 412, 445 population variance  114, 114–15 sample variance  102, 103–4 of the sum of two random variables  186, 186–7, 204 uniform distribution  233, 242 using statistical software  136–7, 399–400, 399–400 see also one-way analysis of variance; two-way analysis of variance variance inflationary factor (VIF)  535, 535–6, 537 variation calculating with Excel  136–7 coefficient of  105, 105–6 defined  92 explained variation  467 measures of  99–105, 467–72, 468–70, 497, 502–3

I-9

partitioning the total variation  403, 426 total variation  403, 403, 403–4, 416, 416–17, 426, 426, 439, 467 unexplained  467 Venn diagrams  150, 150–1, 151 VIF (variance inflationary factor)  535, 535–6, 537 wage price index  505 WaldoLands 63, 64, 65, 65 weighted aggregate price indices  594, 594–6 Wilcoxon rank sum test  364 within-group variation  402, 404, 439 _ X bar (X )  93, 252–3, 255–6, 257, 263 X values  223–6, 224–5, 242 Y intercept  457, 460–1, 461, 463–5, 497 Z scores  106, 106–7, 131, 136–7, 214, 216, 231 Z test for the difference between two means  359, 391 for the difference between two proportions  384, 384–7, 385–6, 391, 400, 400, 612–13 for the difference between two related populations 371–7, 374–6 of hypothesis for the mean  322, 322–8, 323, 326, 337, 353, 356 for the mean difference  371–2, 391 for the population mean  344, 344 for the proportion  340, 340–3, 341, 342, 352, 353, 357 versus t test  337 Z test statistic  322, 323–4, 325 Z values  224–5, 224–6, 231, 242, 263

Copyright © Pearson Australia (a division of Pearson Australia Group Pty Ltd) 2019— 9781488617249 — Berenson/Basic Business Statistics 5e

This page is intentionally left blank