Statistics for Business Students: A Guide to Using Excel & IBM SPSS Statistics 9781637957622

Statistics for Business Students (ISBN: 978-1-63795-762-2) brings the teaching of statistics into the 21st century. Rath

3,788 342 24MB

English Pages 727 [757] Year 2021

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Statistics for Business Students: A Guide to Using Excel & IBM SPSS Statistics
 9781637957622

Citation preview

Statistics for Business Students

A Guide to Using Excel & IBM SPSS Statistic

Glyn Davis & Branko Pecar

First edition, 2021

Page | 1

Statistics for Business Students; A Guide to Using Excel and IBM SPSS Statistics Copyright © 2021 by Branko Pecar & Glyn Davis ISBN: 978-1-63795-762-2 All rights reserved. No part of this book shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical photocopying, recording, photographing, or otherwise, without written permission from the authors. No patent liability is assumed with respect to the use of the information contained herein. Although every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors and omissions. Nor is any liability assumed for damages resulting from the use of the information contained herein. Electronic print as Amazon Kindle Edition First version: January 2021 Trademarks All terms mentioned in this book that are known as trademarks or service marks have been appropriately referenced. Use of a term in this book should not be regarded as affecting the validity of any trademark or service mark. Warning and disclaimer Every effort has been made to make this book as complete and as accurate as possible, but no warranty or fitness is implied. The information provided is on an “as is” basis. The authors and the publisher shall have neither liability nor responsibility to any person or entity with respect to any loss or damages arising from the information contained in this book. Workbooks and supporting material All workbooks and other supporting materials are published for a free download from the website: https://www.stats-bus.co.uk. This includes Excel and IBM SPSS data and output files. Contact email addresses for authors: [email protected]

Page | 2

Contents – Brief Preface……………………………………………………………………………………………………..……………..13 Chapter 1, Data Visualisation……………………………………………………………………………………18 Chapter 2, Descriptive statistics…………………………………………………………………………….....95 Chapter 3, Probability distributions………………………………………………………………………..166 Chapter 4, Sampling distributions…………………………………………………………………….…….241 Chapter 5, Point and interval estimates……………………………………………………….………….289 Chapter 6, Hypothesis testing……………………………………………………………………..…………..334 Chapter 7, Parametric hypothesis tests…………………………………………………………..……….358 Chapter 8, Nonparametric tests………………………………………………………………………………417 Chapter 9, Linear correlation and regression analysis…………………………………………..…510 Chapter 10, Introduction to time series data, long term forecasts and seasonality…….569 Chapter 11, Short and medium-term forecasts………………………………………………...……...644 Appendices…………………………………………………………………………………………………………....717 Index……………………………………………………………………………………………………………………..751

Page | 3

Contents Preface .............................................................................................................................................................. 13 What features do we have?.................................................................................................................. 16 Greek alphabet letters used within this textbook ...................................................................... 16 Acknowledgements ................................................................................................................................ 17 Chapter 1 Data visualisation .................................................................................................................... 18 1.1 Introduction and learning objectives ....................................................................................... 18 1.2 What is a variable?........................................................................................................................... 19 Quantitative variables: interval and ratio scales .................................................................... 20 Qualitative variables: categorical data (nominal and ordinal scales) ............................ 21 Discrete and continuous data types ............................................................................................ 21 1.3 Tables ................................................................................................................................................... 23 Simple tables ........................................................................................................................................ 23 Frequency distribution..................................................................................................................... 25 Creating a crosstab table using Excel PivotTable .................................................................. 38 Summarising the principles of table construction ................................................................ 46 Check your understanding .............................................................................................................. 46 1.4 Graphs - visualising of data .......................................................................................................... 47 Bar charts .............................................................................................................................................. 47 Creating a bar chart using Excel and SPSS ................................................................................ 50 Check your understanding .............................................................................................................. 58 Pie charts ............................................................................................................................................... 58 Check your understanding .............................................................................................................. 65 Histograms ............................................................................................................................................ 66 Creating a histogram using Excel and SPSS.............................................................................. 70 Check your understanding .............................................................................................................. 77 Scatter and time series plots ............................................................................................................... 78 Creating a scatter and times series plot using Excel and SPSS ......................................... 80 Check your understanding .............................................................................................................. 90 Chapter summary .................................................................................................................................... 91 Test your understanding ...................................................................................................................... 91 Want to learn more?............................................................................................................................... 94 Chapter 2 Descriptive statistics .............................................................................................................. 95 2.1 Introduction and learning objectives ....................................................................................... 95 Introduction.......................................................................................................................................... 95 Learning objectives ............................................................................................................................ 98 Page | 4

2.2 Measures of average for a set of numbers.............................................................................. 98 Mean, median and mode for a set of numbers ........................................................................ 99 Check your understanding ........................................................................................................... 108 2.3 Measures of dispersion for a set of numbers ..................................................................... 108 Percentiles and quartiles for a set of numbers .................................................................... 109 Check your understanding ........................................................................................................... 118 The range ............................................................................................................................................ 118 The interquartile range and semi-interquartile range ..................................................... 119 The standard deviation and variance ...................................................................................... 120 Check your understanding ........................................................................................................... 128 Interpretation of the standard deviation ............................................................................... 129 The coefficient of variation .......................................................................................................... 130 Check your understanding ........................................................................................................... 131 2.4 Measures of shape ........................................................................................................................ 131 Measuring skewness: distribution symmetry ........................................................................... 132 Pearson’s coefficient of skewness ............................................................................................. 133 Fisher–Pearson skewness coefficient ...................................................................................... 133 Check your understanding ........................................................................................................... 139 Measuring kurtosis: distribution outliers and peakedness ................................................. 140 Check your understanding ........................................................................................................... 146 Calculating a five-number summary............................................................................................. 146 To identify symmetry..................................................................................................................... 148 To identify outliers.......................................................................................................................... 149 Check your understanding ........................................................................................................... 152 Creating a box plot ............................................................................................................................... 152 To identify symmetry..................................................................................................................... 153 To identify outliers.......................................................................................................................... 153 Check your understanding ........................................................................................................... 158 2.5 Using the Excel Data Analysis menu ...................................................................................... 159 Check your understanding ........................................................................................................... 162 Chapter summary ................................................................................................................................. 163 Test your understanding ................................................................................................................... 164 Want to learn more?............................................................................................................................ 165 Chapter 3 Probability distributions ................................................................................................... 166 3.1 Introduction and learning objectives .................................................................................... 166 Learning objectives ......................................................................................................................... 166 Page | 5

3.2 What is probability? ..................................................................................................................... 167 Introduction to probability .......................................................................................................... 167 Relative frequency .......................................................................................................................... 169 Sample space ..................................................................................................................................... 170 Discrete and continuous random variables .......................................................................... 170 3.3 Continuous probability distributions.................................................................................... 171 Introduction....................................................................................................................................... 171 The normal distribution................................................................................................................ 172 Check your understanding ........................................................................................................... 178 The standard normal distribution (Z distribution) ............................................................ 178 Check your understanding ........................................................................................................... 190 Checking for normality....................................................................................................................... 190 Check your understanding ........................................................................................................... 197 Student’s t distribution ...................................................................................................................... 198 Check your understanding ........................................................................................................... 204 F distribution ......................................................................................................................................... 205 Check your understanding ........................................................................................................... 208 Chi-square distribution ...................................................................................................................... 208 Check your understanding ........................................................................................................... 211 3.4 Discrete probability distributions .......................................................................................... 211 Introduction....................................................................................................................................... 211 Check your understanding ........................................................................................................... 215 Binomial probability distribution .................................................................................................. 216 Normal approximation to the binomial distribution ......................................................... 228 Check your understanding ........................................................................................................... 228 Poisson probability distribution .................................................................................................... 229 Check your understanding ........................................................................................................... 237 Chapter summary ................................................................................................................................. 237 Test your understanding ................................................................................................................... 238 Want to learn more?............................................................................................................................ 240 Chapter 4 Sampling distributions ....................................................................................................... 241 4.1 Introduction and learning objectives .................................................................................... 241 Learning objectives ......................................................................................................................... 241 4.2 Introduction to sampling ........................................................................................................... 242 Types of sampling............................................................................................................................ 243 Types of errors ................................................................................................................................. 249 Page | 6

Check your understanding ........................................................................................................... 249 4.3 Sampling from a population ..................................................................................................... 250 Population versus sample ............................................................................................................ 250 Sampling distribution of the mean ........................................................................................... 259 Sampling from a normal population ........................................................................................ 264 Sampling from a non-normal population ............................................................................... 273 Sampling without replacement .................................................................................................. 276 Sampling distribution of the proportion ................................................................................ 282 Check your understanding ........................................................................................................... 285 Chapter summary ................................................................................................................................. 286 Test your understanding ................................................................................................................... 286 Want to learn more?............................................................................................................................ 288 Chapter 5 Point and interval estimates ............................................................................................ 289 5.1 Introduction and learning objectives .................................................................................... 289 Learning objectives ......................................................................................................................... 289 5.2 Point estimates .............................................................................................................................. 290 Point estimate of the population mean and variance ........................................................ 292 Point estimate of the population proportion and variance............................................. 303 Pooled estimates .............................................................................................................................. 308 Check you understanding ............................................................................................................. 308 5.3 Interval estimates ......................................................................................................................... 309 Interval estimate of the population mean where σ is not known and the sample is smaller than 30 observations ..................................................................................................... 315 Interval estimate of a population proportion....................................................................... 323 Check your understanding ........................................................................................................... 325 5.4 Calculating sample sizes ............................................................................................................. 326 Check your understanding ........................................................................................................... 331 Chapter summary ................................................................................................................................. 331 Test your understanding ................................................................................................................... 332 Want to learn more?............................................................................................................................ 333 Chapter 6 Hypothesis testing ............................................................................................................... 334 6.1 Introduction and Learning Objectives .................................................................................. 334 Learning objectives ......................................................................................................................... 335 6.2 What is hypothesis testing? ...................................................................................................... 335 What are parametric and nonparametric statistical tests? ............................................. 335 Hypothesis statements H0 and H1 ............................................................................................. 337 One- and two-tailed tests ............................................................................................................. 338 Page | 7

One- and two-sample tests .......................................................................................................... 339 Independent and dependent samples/populations........................................................... 340 Sampling distributions from different population distributions .................................. 341 Sampling from a normal distribution, large sample and known  (AAA) ................. 341 Sampling from a non-normal distribution, large sample size and known  (BAA) ................................................................................................................................................................ 342 Sampling from a normal distribution, small sample size and unknown  (ABB) .. 342 Sampling from a normal distribution, large sample and unknown  (AAB)............ 344 Sampling from a normal distribution, small sample and known  (ABA) ................ 344 Sampling from a non-normal distribution, large sample and unknown  (BAB) .. 344 Sampling from a non-normal distribution, small sample and known  (BBA) ....... 344 Sampling from a non-normal distribution, small sample and unknown  (BBB) .. 344 Check your understanding ........................................................................................................... 344 6.3 Introduction to hypothesis testing procedure................................................................... 345 Steps in hypothesis testing procedure .................................................................................... 345 How do we make decisions? ....................................................................................................... 348 Types of errors and statistical power ...................................................................................... 354 Check your understanding ........................................................................................................... 356 Chapter summary ................................................................................................................................. 356 Test your understanding ................................................................................................................... 357 Want to learn more?............................................................................................................................ 357 Chapter 7 Parametric hypothesis tests............................................................................................. 358 7.1 Introduction and Learning Objectives .................................................................................. 358 Learning objectives ......................................................................................................................... 359 7.2 One-sample hypothesis tests .................................................................................................... 359 One-sample z test for the population mean .......................................................................... 359 Check your understanding ........................................................................................................... 366 One-sample t test for the population mean ........................................................................... 366 Check your understanding ........................................................................................................... 376 One-sample z test for the population proportion ............................................................... 377 Check your understanding ........................................................................................................... 382 7.3 Two-sample hypothesis tests ................................................................................................... 383 Two-sample t test for the population mean: independent samples ............................ 383 Excel Data Analysis solutions ..................................................................................................... 399 Check your understanding ........................................................................................................... 401 Two-sample t-test for the population mean: dependent or paired samples............ 402 Page | 8

Check your understanding ........................................................................................................... 413 Chapter summary ................................................................................................................................. 414 Test your understanding ................................................................................................................... 415 Want to learn more?............................................................................................................................ 416 Chapter 8 Chi square and non-parametric hypothesis tests .................................................... 417 8.1 Introduction and learning objectives .................................................................................... 417 Learning objectives ......................................................................................................................... 419 8.2 Chi-square tests ............................................................................................................................. 419 Chi-square test of independence ............................................................................................... 420 How do you solve problems when you have raw data? ................................................... 432 Check your understanding ........................................................................................................... 439 Chi-square test for two proportions (independent samples) ........................................ 441 Check your understanding ........................................................................................................... 449 McNemar’s test for the difference between two proportions (dependent samples) ................................................................................................................................................................ 450 Check your understanding ........................................................................................................... 458 8.3 Nonparametric tests .................................................................................................................... 459 Sign test ............................................................................................................................................... 460 Check your understanding ........................................................................................................... 473 Wilcoxon signed-rank test for matched pairs ...................................................................... 473 Mann–Whitney U test for two independent samples ........................................................ 488 Check your understanding ........................................................................................................... 505 Chapter summary ................................................................................................................................. 505 Test your understanding ................................................................................................................... 506 Want to learn more?............................................................................................................................ 509 Chapter 9 Linear correlation and regression analysis ............................................................... 510 9.1 Introduction and chapter overview ....................................................................................... 510 Learning objectives ......................................................................................................................... 511 9.2 Introduction to linear correlation .......................................................................................... 511 9.3 Linear correlation analysis........................................................................................................ 513 Scatter plots ....................................................................................................................................... 514 Covariance .......................................................................................................................................... 518 Pearson’s correlation coefficient, r ........................................................................................... 523 The coefficient of determination, r2 or R-Squared ............................................................. 529 Spearman’s rank correlation coefficient, rs ........................................................................... 529 Check your understanding ........................................................................................................... 532 9.4 Introduction to linear regression ........................................................................................... 533 Page | 9

9.5 Linear regression .......................................................................................................................... 533 Fit line to a scatter plot.................................................................................................................. 538 Sum of squares defined ................................................................................................................. 543 Regression assumptions ............................................................................................................... 545 Test how well the model fits the data (Goodness-of-fit) .................................................. 547 Prediction interval for an estimate of Y .................................................................................. 554 Excel data analysis regression solution .................................................................................. 559 Regression and p-value explained ............................................................................................ 562 Check your understanding ........................................................................................................... 564 Chapter summary ................................................................................................................................. 564 Test your understanding ................................................................................................................... 565 Want to learn more?............................................................................................................................ 568 Chapter 10 Introduction to time series data, long-term forecasts and seasonality ........ 569 10.1 Introduction and chapter overview .................................................................................... 569 Learning objectives ......................................................................................................................... 570 10.2 Introduction to time series analysis ................................................................................... 570 Stationary and non-stationary data sets ................................................................................ 571 Seasonal time series ....................................................................................................................... 573 Check your understanding ........................................................................................................... 574 10.3 Trend extrapolation as long-term forecasting method ............................................... 575 A trend component ......................................................................................................................... 575 Fitting a trend to a time series ................................................................................................... 577 Using a trend chart function to forecast time series .......................................................... 580 Trend parameters and calculations.......................................................................................... 585 Check your understanding ........................................................................................................... 589 10.4 Error measurements ................................................................................................................. 589 Types of error statistics ................................................................................................................ 594 Check your understanding ........................................................................................................... 598 10.5 Prediction interval ..................................................................................................................... 599 Standard errors in time series .................................................................................................... 600 Check your understanding ........................................................................................................... 613 10.6 Seasonality and Decomposition in classical time series analysis ............................ 613 Cyclical component ......................................................................................................................... 619 Seasonal component....................................................................................................................... 625 Error measurement ........................................................................................................................ 634 Prediction interval .......................................................................................................................... 636 Page | 10

Check your understanding ........................................................................................................... 640 Chapter summary ................................................................................................................................. 641 Test your understanding ................................................................................................................... 642 Want to learn more?............................................................................................................................ 643 11. Short and medium-term forecasts .............................................................................................. 644 11.1 Introduction and chapter overview .................................................................................... 644 Learning objectives ......................................................................................................................... 645 11.2 Moving averages ......................................................................................................................... 646 Simple moving averages ............................................................................................................... 646 Short-term forecasting with moving averages .................................................................... 654 Mid-range forecasting with moving averages ...................................................................... 662 Check your understanding ........................................................................................................... 667 11.3 Introduction to exponential smoothing............................................................................. 667 Forecasting with exponential smoothing............................................................................... 670 Mid-range forecasting with exponential smoothing.......................................................... 684 Check your understanding ........................................................................................................... 690 11.4 Handling errors for the moving averages or exponential smoothing forecasts 690 Prediction interval for short and mid-term forecasts ....................................................... 694 Check your understanding ........................................................................................................... 697 11.5 Handling seasonality using exponential smoothing forecasting ............................. 698 Classical decomposition combined with exponential smoothing ................................ 698 Holt-Winters’ seasonal exponential smoothing .................................................................. 702 Check your understanding ........................................................................................................... 713 Chapter summary ................................................................................................................................. 714 Test your understanding ................................................................................................................... 715 Want to learn more?............................................................................................................................ 716 Appendices .................................................................................................................................................. 717 Appendix A Microsoft Excel Functions ........................................................................................ 717 Appendix B Areas of the standardised normal curve............................................................. 724 Appendix C Percentage points of the Student’s t distribution (5% and 1%)................ 725 Appendix D Percentage points of the chi-square distribution ........................................... 726 Appendix E Percentage points of the F distribution ............................................................... 727 Upper 5%................................................................................................................................................. 727 Upper 2.5% ............................................................................................................................................. 728 Upper 1%................................................................................................................................................. 729 Appendix F Binomial critical values.............................................................................................. 730 Page | 11

Appendix G Critical values of the Wilcoxon matched-pairs signed-ranks test............. 731 Appendix H Probabilities for the Mann–Whitney U test ....................................................... 732 Mann–Whitney p-values (n2 = 3) ................................................................................................... 732 Mann–Whitney p-values (n2 = 4) ................................................................................................... 732 Mann–Whitney p-values (n2 = 5) ................................................................................................... 732 Mann–Whitney p-values (n2 = 6) .................................................................................................. 733 Mann–Whitney p-values (n2 = 7) .................................................................................................. 733 Mann–Whitney p-values (n2 = 8) .................................................................................................. 734 Appendix I Statistical glossary ........................................................................................................ 735 Book index ................................................................................................................................................... 751

Page | 12

Preface The way we teach and learn statistics has not changed since the days before computers and apps. This cannot be right. Surely the way we use various computing platforms and associated apps should also have an impact on the way we teach statistics. This is precisely what we attempted to achieve in this textbook. When it comes to statistics, Universities all over the country will usually rely on one of two software platforms, and they are either Microsoft Excel or IBM SPSS. Other platforms are also used, but they only represent a minority. Furthermore, once our students graduate, almost all of them are without a shadow of the doubt expected to be proficient in Microsoft Excel, a platform of choice for business, government, and other non-profit organisations. For this reason, we decided to build this textbook around the solutions based on both Microsoft Excel and IBM SPSS. Every problem is first explained and then solved using Excel, followed by the SPSS solution. This is the general approach we followed throughout this textbook. The second point we would like to emphasise is why should one bother to learn statistics at all. If you eliminate the technical and mathematical aspects of statistics and think for a moment, you will realise that by learning statistics, you learn how to: • • • • •

Draw conclusions about the whole population based on limited data that you were given. Assign specific level of confidence to your conclusions. Describe and quantify relationships between different phenomena. Reduce uncertainty. Predict the future by understanding the past.

We are sure that you will agree that these are very useful skills. If you ask students if they would like to have these sorts of skills, they all agree that they are very desirable. At the same time, when you try to teach them, a lot of them lose interest. Why? This brings us back to the first point. The reason is that the way we teach our students statistics has not changed for decades. Most of the statistics courses are structured and taught in such a way as if all the students will become future statisticians. This could not be further from the truth. They will be accomplished professionals in their chosen area of expertise and statistics will be just one of many tools that they need to use to do their jobs properly. This is where we come to the third point of this preface. By using the tools such as Excel and SPSS, we eliminate a need to learn the technical and mathematical details of the methodology that underpins statistics. We put emphasis on: • • •

What is the problem you are trying to solve? Which method can be applied to provide a solution? How do you interpret the results that Excel, or SPSS produced using this method? Page | 13

Why do we think that this approach is the right approach for business students? The most fundamental aspect of running any business is the decision-making aspect. To make a decision we can either follow chance, intuition, or rumours, or do something completely different that usually guarantees success. We can: • • •

Gain a better understanding of the issue by analysing it. Draw the relevant conclusions. Decide on how to implement the actions.

This is the essence of any good decision-making process and the foundation for business management. What we are advocating is that, to make good decisions and manage business properly, we must first gain insight into a problem. How do we gain insight and what exactly is insight? Insight is an ability to understand the inner nature of things and clearly discern cause and effect in a specific context. You could gain insight by talking to either experts or people who have done something repeatedly, or by being there for a long time. However, most of the time, you do not have the luxury of access to such people/experts, or they simply do not exist in a field. The only alternative is to apply certain methodology which will effectively convert you into such an expert. So, the fundamental question here is: what sort of methodology are we talking about? What kind of ‘science’ can give us the insight into the problems we have never dealt with before and make us into experts overnight? It sounds like a science fiction movie in which the main character takes a pill that enables a much larger percentage of his brain to engage and suddenly gains tremendous insight into virtually any problem. What we are advocating here is not science fiction, but a simple matter of understanding statistics. The methodology that will give you an insight into virtually any problem is the most fundamental toolset from statistics. Moreover, the methodology that statistics teaches us is universal. It is designed so that, no matter what the problem, no matter what the context, we will always have a solid foundation to make the right decision and defend it without bias or arbitrary spin. The decision becomes easy to defend and can be scrutinised by anyone who wants to challenge it. Ultimately, they will come to the same conclusion. Being proficient in statistics will give you a competitive advantage and help you with employability (i.e. easier to get a job) and transferability (i.e. an ability to move from one field or industry to another). The important point is that being proficient in statistics does not imply that you must be an expert in statistics. It is no different to speaking a foreign language. If you are in a foreign country, it helps if you speak the language. You do not have to be a linguist and understand the nuances of the grammatical syntax of the language you are speaking. It suffices to be fluent in it. The same applies to statistics. You do not have to know how certain equations were derived. You just need to know how to apply them and how to interpret the results. This alone will give you a distinct advantage and a head start over those of your colleagues who have not bothered to equip themselves with this statistical methodology. Page | 14

Regardless of what your core skills and profession are, a proficiency in statistical methodology makes you a more insightful individual who will find employment more easily, stay in the job with more respect and satisfaction, and be able to take advantage of what other subject fields offer. Not a bad proposition. Therefore, we believe that it is a good personal investment to learn the skill set found in statistics. Our ambition was to create a textbook that will be practical in nature and focus on applications, rather than theory. To make it very useful, we used two core software tools (Excel and SPSS) that are invariably used to teach statistics. Excel is very transferable as it is a widely used tool throughout a range of organisations, including business, public sector, and non-profit organisations. In addition to this, we offer an incredible wealth of online resources to help those who either struggle with the content or are keen to learn more. Most of all, as you go through this textbook and learn one chapter after another, you should feel good about yourself. You have just gained a small competitive advantage that might make your career prospects brighter than you ever imagined. To support this textbook, we created a supporting website containing numerous resources, depending on whether you are a student or a lecturer. The website is: https://stats-bus.co.uk/ Once you landed on the website, depending on if you are a lecturer or a student, there will be two possible options for you. Lecturers will have a password protected part of the website that contains: • • • • • •

All data files with solutions. PowerPoint slides for all the lectures. Instructor’s manual. Multiple choice questions. Exam questions. A pdf version of the textbook.

Students and any other reader will have a free access to the side of the website that contains: • • • •



Student files (both Excel and SPSS) Want to learn more folder with numerous files expanding on current content Learning resources, such as: o Revision tips. o Multiple choice questions. Other resources, such as: o Introduction to Excel. o Introduction to SPSS. o Develop your mathematical skills. o Factorial experiment workbook. Video files, i.e. YouTube style videos, each up to 5 min, explaining key points. Page | 15

Lecturers will have to register to gain access to the protected side of the website, whilst students and the general reader do not have to register. However, you can register to receive updates, corrections and any other improvements or additions to the textbook and the website. To contact the two authors of this textbook, just drop an email to: [email protected] We will be glad to assist.

What features do we have? Throughout the chapters various features are included, such as learning objectives, examples, and solutions for Excel and SPSS. Every feature is clearly marked with a symbol, or an icon, and they have the meanings shown in Table 0.1.

Learning objectives

Example

Excel Solution

SPSS Solution

Check your understanding for every section

Chapter summary

Test your understanding

Want to learn more

Glossary

Index

Table 0.1 Book icons

Greek alphabet letters used within this textbook A list of the Greek letters used in this book is provided in Table 0.2. Name

Lowercase letter Alpha  Beta  Chi  Lambda  Table 0.2 Greek letters

Name Mu Pi Rho Sigma

Lowercase letter    

Page | 16

Acknowledgements We are grateful to everyone who granted permission to reproduce copyrighted material in this textbook. Every effort has been made to trace the copyright holders and we apologise for any unintentional omissions. We would be pleased to include the appropriate acknowledgements in any subsequent edition of this publication or at reprint stage. The authors of this textbook would like to thank the following for their agreement to use software screenshot images and data files: • •



Microsoft Excel: Microsoft Excel screenshots, Excel function definitions, and generation of critical tables used with permission from Microsoft®. IBM SPSS Statistics software (‘SPSS’): IBM® SPSS Statistics screenshots of solutions used throughout all chapters. Reprint courtesy of International Business Machines Corporation, © International Business Machines Corporation. SPSS Inc. was acquired by IBM in October 2009. Example 8.2 data set: reproduced with permission from Dr Anne Johnston.

We owe a debt of gratitude to our colleague from down under, Anthony Hyden. Our Australian friend, though not a statistician but a thoroughbred engineer, volunteered to read the manuscript in its early stages and made numerous suggestions to improve the readability of the text.

Glyn Davis, Associate Professor, Teesside University Business School, Teesside University Dr Branko Pecar, Visiting Fellow, University of Gloucestershire

Page | 17

Chapter 1 Data visualisation 1.1 Introduction and learning objectives In this chapter we shall look at methods to summarise data using tables and charts. Learning Objectives On completing this chapter, you will be able to: 1. Understand the different types of data variables that can be used to represent a specific measurement. 2. Know how to present data in table form. 3. Present data in a variety of graphical forms. 4. Construct frequency distributions from raw data. 5. Distinguish between discrete and continuous data. 6. Construct histograms for equal class widths. 7. Solve problems using Microsoft Excel and IBM SPSS Statistics. The display of various types of data or information in the form of tables, graphs and diagrams is quite a common practice these days. Newspapers, magazines, television, and social media all use these types of displays to try and convey information in an easy-to-assimilate way. In a nutshell, what these forms of displays aim to do is to summarise large sets of raw data so that we can see the 'behaviour' or the pattern in the data. This chapter and the next will use a variety of techniques that can be used to present the raw data in a form that will make sense to people using both Microsoft Excel and IBM SPSS Statistics software packages. Tables and graphs can be useful tools for helping people to make decisions. As well as being able to identify clearly what the graph or table is telling us, it is important to identify what parts of the story are missing. This can help the reader decide what other information they need, or whether the argument should be rejected because the supporting evidence is suspect. You will need to know how to critique the data and the way they are presented. It is worth remembering that a table or graph can misrepresent information by either leaving out important information or by constructing it in such a way that it misrepresents relationships. If we have a choice, should we use a table or a graph? We can use both, but a general guide is that: 1. Tables are generally best if you want to be able to look up specific information or if the values must be reported precisely. 2. Graphs are best for illustrating trends, ratios and making comparisons. Figures 1.1 and 1.2 provide examples of a graph and table published within an economic report by the Office for National Statistics (https://www.ons.gov.uk/businessindustryandtrade/business/activitysizeandlocation /bulletins/ukbusinessactivitysizeandlocation/2018). Page | 18

Figure 1.1 Number of value added tax and/or pay as you earn based businesses, 2013–2018 (Source: Office for National Statistics)

Figure 1.2 Number of value added tax and/or pay as you earn based businesses by region (thousands), 2016–2018 (Source: Office for National Statistics) London accounted for the largest number of businesses in March 2018, with 19% of the UK total. The region with the next largest share of businesses was the South East, with 15.2%.

1.2 What is a variable? We collect data from either a published material (books, statistical bulletins, etc.) or by conducting some form of survey. Regardless what is the source of our data, we are usually interested in one specific measured characteristic. A good example is the height of 1000 subjects whose height was measured. This measured characteristic, or Page | 19

attribute, that differs for different subjects is called a variable. A variable is a symbolic name that has a value associated with it. If we had 1000 subjects as in our example, then there will be 1000 values associated with this single variable. You can also think of a variable as a symbol that has a series of datapoints associated with it. These datapoints are also called observations. So, 1000 values associated with a variable called the height, also represent 1000 observations. However, these observations (or datapoints) do not have to be numbers. This means that we can have different variables. Variables are usually divided into quantitative (continuous or discrete variables) and qualitative (nominal or ordinal variables), as indicated in Figure 1.3.

Figure 1.3 Quantitative versus qualitative variables Quantitative variables always consist of numerical data, for example, the average height of a person or the time required to finish the 800 m race by an athlete. In later chapters, we will use statistical techniques that will be dependent upon whether the data measured are quantitative or qualitative. Qualitative (non-metric) variables describe some quality of the item being measured without measuring it. For example, when describing the colour of the sky or the finish position of an athlete in an 800 m race. Let us look at an example. If a group of business students were asked to name their favourite video game, then the variable would be qualitative. If the time spent playing a game was measured, then the variable would be quantitative. Both quantitative and qualitative variables can have data measured at different scales, as shown in Figure 1.3. Let us explore different scales.

Quantitative variables: interval and ratio scales If one unit on the scale represents the same magnitude of the characteristic being measured across the whole range of the scale, then we call this an interval measurement scale. For example, we can attempt to measure student stress levels on an interval scale. In this case, a difference between a score of 5 and a score of 6 would represent the same difference in anxiety as would a difference between a score of 9 and a score of 10. However, interval scales do not have a ‘true’ zero point; and therefore, it is not possible to make statements about how many times higher one score is than another. For the stress measurement, it would not be valid to say that a person with a score of 6 was twice as anxious as a person with a score of 3.

Page | 20

Ratio scales, on the other hand, are very similar to interval scales, except that they have true zero points. For example, a height of 2 m is twice as much as 1 m. Interval and ratio measurements are also called continuous variables. Table 1.1 summarises the different measurement scales with examples provided of these different scales.

Qualitative variables: categorical data (nominal and ordinal scales) Qualitative variables do not use numbers in mathematical sense, but they use labels to put data into categories. If the scale used happens to be nominal, this means that the variable consists of groups or categories to which every observation is assigned . No quantitative information is conveyed, and no ordering of the observations is implied. Football club allegiance, sex or gender, degree type, and courses studies are all examples of nominal scales. To show how often every value occurs on a point on the scale, we use frequency distributions. If the categories of the data can be placed in order of size, then the data are classed as ordinal. Measurements with ordinal scales are ordered in the sense that higher numbers represent higher values. However, the intervals between the numbers are not necessarily equal. For example, on a five-point rating scale measuring student satisfaction, the difference between a rating of 1 (‘very poor’) and a rating of 2 (‘poor’) may not be the same as the difference between a rating of 4 (‘good’) and a rating of 5 (‘very good’). The lowest point on the rating scale in the example was arbitrarily chosen to be 1, and this scale does not have a ‘true’ zero point Measurement Scale Interval data

Recognising a measurement scale 1. Ordered, constant scale, with no natural zero e.g. temperature, dates. 2. Differences make sense, but ratios do not e.g. temperature difference Ratio data 1. Ordered, constant scale, and a natural zero e.g. length, height, weight, and age. Nominal data 1. Classification data e.g. male or female, red or blue hat. 2. Arbitrary labels e.g. m (male) or f (female), 0 or 1. 3. No ordering e.g. it makes no sense to state that m > f. Ordinal data 1. Ordered list e.g. customer satisfaction scale of 1, 2, 3, 4, and 5. 2. Differences between values are not important e.g. political parties can be given labels far left, left, mid, right, far right etc. Table 1.1 Examples of measurement scales A scale also implies that data it measures can be either continuous or discrete. Let us see what we mean by this.

Discrete and continuous data types Data that a variable consists of, can exist in two forms: discrete and continuous. Discrete data occur as an integer (whole number), for example, 1, 2, 3, 4, 5, 6, etc. Continuous Page | 21

data occur as a continuous number and can take any level of accuracy, for example, the number of miles travelled could be 110.5 or 110.52 or 110.524, etc. Note that whether data are discrete or continuous does not depend upon how they are collected, but how they occur. Thus height, distance and age are all examples of continuous data, although they may be presented as whole numbers. Most of the time, the scale of the data (both discrete and continuous) is very wide, and we want to put the data into groups, or classes. This means that every class needs a boundary. Class limits are the extreme boundaries. When we start creating frequency distributions (further below), the class limits are called the stated limits. Two common types are illustrated in Table 1.2. A 5 - under 10 10 - under 15 Table 1.2 Examples of class limits

B 5 - 9 10 - 14

We must make sure that there are no gaps between classes. To help place data in their appropriate class, we use what are known as true limits, or mathematical limits. True or mathematical limits are determined depending if we are dealing with continuous or discrete data. Table 1.3 indicates how these limits may be defined. MATHEMATICAL LIMIT STATED LIMIT DISCRETE A 5 - under 10 5–9 10 - under 15 10 – 14 B 5 - 9 5–9 10 - 15 10 - 14 Table 1.3 Example of mathematical limits

CONTINUOUS 5 - 9.999999' 10 - 14.999999' 4.5 - 9.5 9.5 - 15.5

Why are true limits so important? If the data are continuous and stated limits are as in style A, then a value of 9.9 would be placed in the class “5–under 10”. If style B were used then it would be placed in the class “10–under 15” , given a value of 9.9 lies closer to 10 than 9. This means that the class width is very important. Using the true or mathematical limits, the width of a class can be found. If CW is the class width, UCB the upper-class boundary, and LCB the lower-class boundary, then the class width is calculated using equation (1.1). CW = UCB – LCB

(1.1)

If, for example, the true limits are 0.5–1.5, 1.5–2.5, etc., then the class width is 1.5 – 0.5 = 1 or 2.5 – 1.5 = 1. Or, if the true limits are 395.5–419.5, 419.5–439.5, then the class width is 419.5 – 395.5 = 20 or 439.5 – 419.5 = 20. When we come at the end of a distribution, open-ended classes can be used as a catchall for extreme values. For example: Up to 40, 40–50, 50-60, …, 100 and over. To decide what number of classes to use is subjective and there are no strict rules about it. However, the following should be taken into consideration: Page | 22

a. Use between 5 and 12 classes. The actual number will depend on the size of the sample and minimising the loss of information. b. Class widths are easier to handle if in multiples of 2, 5 or 10 units. c. Although not always possible, try and keep classes at the same widths within a distribution. As a guide equation (1.2) can be used to calculate the number of classes given the class boundaries and the class width. Class width=

Highest value-Lowest value Number of classes

(1.2)

For example, if we had highest value 309.5, lowest value 189.5, and say we want 6 classes of equal size, then from equation (1.2), the class width is: Class width=

Highest value-Lowest value 309.5−189.5 = Number of classes 6

=

120 6

= 20

With 6 classes and given the highest and lowest value, each class should be 20 units wide.

1.3 Tables Table is the most basic arrangement of data into rows and columns. Apart from taking up less room, a table enables figures to be located more quickly. It makes comparisons between different classes easy and may reveal patterns which cannot otherwise be deduced. The simplest form of table indicates the frequency of occurrence of objects within several defined categories.

Simple tables Tables come in a variety of formats, from simple tables to frequency distributions. When creating a table, the following principles should be followed: a. b. c. d.

When a secondary data source is used it is acknowledged. The title of the table is given. The total of the frequencies is given. When percentages are used for frequencies this is indicated together with the sample size, n.

Example 1.1 355 undergraduate business students were asked the question ‘how many still plan to stay with us and study one of our master’s degrees’, the results were as follows: • • • •

120 MSc International Management, 24 MSc Business Analysis, 80 MA Human Resource Management, 45 MSc Project Management, Page | 23



86 MSc Digital Business.

We can put this information in table form, indicating the frequency within each category either as a raw score or as a percentage of the total number of responses. Number of undergraduate students entering differing master’s courses (Source: School survey August 2020) Course

Frequency

Course

Frequency %

MSc International Management

120

MSc International Management

34%

MSc Business Analysis

24

MSc Business Analysis

7%

MA Human Resource Management

80

MA Human Resource Management

23%

MSc Project Management

45

MSc Project Management

13%

MSc Digital Business

86

MSc Digital Business

24%

Total

355

Total

100%

or

Table 1.4 Type of master’s course chosen to study Sometimes categories can be subdivided, and tables can be constructed to convey this information together with the frequency of occurrence within the subcategories. For example, Table 1.5 indicates the frequency of the number of hard disks sold in a shop or online, with the sales split by month. Example 1.2 Table 1.5 illustrates further subdivisions of categories. Quick Computers Ltd Number of hard disks sold between shop and online Month January March May July September November Total Shop 23 56 123 158 134 182 11750 Online 64 145 423 400 350 409 7533 Total 87 201 546 558 484 591 19283

Table 1.5 Number of hard disks sold in shop vs. online Another example of how categories may also be displayed is given in Table 1.6, showing the time spent online for a sample of 560 adults. Example 1.3 Table 1.6 shows the tabulated results from a survey undertaken to measure the time spent online. Page | 24

Less than 15 hours per week 15 - 30 hours per week More than 30 hours per week Totals

Young person (up to 18 years old) 45 62 81 188

18 - 24 years old 50 82 102 234

Over 24 years old 18 54 66 138

Totals 113 198 249 560

Table 1.6 Time spend online

Frequency distribution We already mentioned frequency distributions, so let’s learn more about this important type of tables. When data are collected by survey, or by some other form, we have initially a set of unorganised raw data which, when viewed, would convey little information. A first step would be to organise the set into a frequency distribution. By doing so, we create a table showing the similar quantities that are collected and the frequency of occurrence of the quantities for every group of quantities. Example 1.4 Consider the number of telephone calls per day received by a mobile phone shop enquiring about the new iPhone SE during over a period of 92 days in the summer of 2020. 1 4 3 5 4 1 4 3 2 4 5 3 4 1 4 3 5 2 4 3

5 1 4 3 5 4 3 5 1 5 3 4 2 3 4 1 5 3 5 4

2 3 1 5 4 3 5 4 5 3 2 5 4 5 3 5 4 5 1 3

5 4 3 5 2 5 4 3 5 5 3 5 5 4 3 5 5 5 2 4

3 5 4 4 3 2 5 4 5 5 3 2

Table 1.7 Number of calls received over a period of 92 days Page | 25

The frequency distribution can be used to show how many days we had 0 calls per day, 1 call per day, 2 calls per day, and so on. You will notice that if you undertake the calculation then you will find that the minimum and maximum value of the number of calls per day is 1 and 5, respectively. Therefore, we wish to create a tally chart and then a frequency distribution for the value of X. Write down the range of values from lowest (1) to the highest (5) then go through the data set recording each score in the table with a tally mark. It is a good idea to cross out figures in the data set as you go through it to prevent double counting. Table 1.8 illustrates the frequency distribution for this data set.

Day 1 2 3 4 5

Tally                    

Number of telephone calls per day (f) 8 9 22 23 30

Table 1.8 Frequency distribution Excel solution Figure 1.4 illustrates the first 16 values in cells C4:C95 (first 20 observations illustrated). You will observe from the Excel solution that we calculate from the data set the minimum and maximum values: Minimum value of x = 1 Maximum value of x = 5 This enables the frequencies to be calculated and therefore the frequency distribution identified for this data set (C4:C95) using the =COUNTIF() Excel function.

Page | 26

Figure 1.4 Excel solution Observe that the Excel frequency distribution is the same as given in Table 1.7. SPSS solution Input data into SPSS Figure 1.5 illustrates the first 20 values.

Figure 1.5 Example 1.4 SPSS data We can create the frequency distribution using the SPSS Frequencies method illustrated below. Select Analyze > Descriptive Statistics > Frequencies. Page | 27

Transfer Number_calls_day into the Variables(s) box.

Figure 1.6 SPSS frequencies menu Click OK SPSS output

Figure 1.7 SPSS solution In this example, there were relatively few cases. However, we may have increased our survey period to one year and the number of calls per day may have been between 0 and 100+. Since our aim is to summarise information, we may find it better to group ‘likes’ into classes to form a grouped frequency distribution. The next example illustrates this point. Example 1.5 Table 1.9 illustrates the distance travelled in miles by 100 FG delivery vans per day. Page | 28

Data 499 501 553 455 452 506 533 526 492 514

526 489 451 457 481 460 482 470 484 501

456 528 509 537 440 556 400 588 541 492

545 505 507 493 474 486 526 485 496 480

426 522 582 528 483 484 555 434 500 534

561 566 574 459 582 535 534 469 461 534

501 497 466 482 477 528 450 499 529 476

495 523 514 472 594 517 552 471 488 513

579 526 554 472 496 500 469 519 545 496

Table 1.9 Distance travelled by delivery vans per day This mass of data conveys little in terms of information. Because there would be too many value scores, putting the data into an ungrouped frequency distribution would not portray an adequate summary. Grouping the data, however, leads to Table 1.10. Class 400 - 449 450 - 499 500 - 549 550- 599 600 - 649

Tally |||| |||| |||| |||| |||| |||| |||| |||| |||| |||| |||| |||| |||| |||| |||| |||| |||| | |||| |||| |||| |

Frequency 4 44 36 15 1

Table 1.10 Grouped frequency distribution data for Example 1.5 data For this example, we have created five classes with each class interval of the same value. The lower- and upper-class widths are: • • • • • •

399.5 449.5 499.5 549.5 599.5 649.5

With corresponding class widths of 50: • • • • •

399.5 – 449.5 = 50 449.5 – 499.5 = 50 499.5 – 549.5 = 50 549.5 – 599.5 = 50 599.5 – 649.5 = 50

Page | 29

Excel solution Step 1 Input data into cells C6:C105 (first 20 data values illustrated)

Figure 1.8 Example 1.5 Excel dataset Step 2 Use Excel Analysis ToolPak solution to create a frequency distribution. Excel can construct grouped frequency distributions from raw data by using Analysis ToolPak. Before we use this add-in, we must input the lower- and upper-class boundaries into Excel. Excel calls this the bin range. In this example we have decided to create a bin range that is based upon equal class widths. Therefore, the Excel bin range values will be: • • • • • •

399.5 449.5 499.5 549.5 599.5 649.5

We can now use Excel to calculate the grouped frequency distribution. We put the bin range values in cells E6:E11 (with the label in cell E5) as illustrated in Figure 1.9. Page | 30

Figure 1.9 Excel bin range values Now create the histogram. Select Data. Select Data Analysis menu.

Figure 1.10 Excel Data > Data Analysis menu Click on Histogram.

Figure 1.11 Excel Data Analysis menu Click OK. Enter Input Range: C5:C105. Enter Bin Range: E5:E11. Click on Labels Choose location of Output Range: G5.

Page | 31

Figure 1.12 Excel histogram Click OK. Excel will now print out the grouped frequency table (bin range and frequency of occurrence) as presented in cells G5:H11.

Figure 1.13 Excel solution The grouped frequency distribution would now be as illustrated in Table 1.11. Bin Range 399.5 449.5 499.5 549.5 599.5 649.5 More

Frequency 0 4 44 36 15 1 0

Table 1.11 Bin and frequency values

Page | 32

From this table we can now create the grouped frequency distribution as illustrated in Table 1.12. Class 400 - 449 450 - 499 500 - 549 550- 599 600 - 649

Frequency 4 44 36 15 1

Table 1.12 Grouped frequency distribution SPSS solution Input data into SPSS (the first 20 values are illustrated in Figure 1.14). We have called the SPSS column Distance_travelled.

Figure 1.14 Example 1.5 SPSS data Transform data into a grouped frequency distribution using Excel Visual Binning Select Transform > Visual Binning. Transfer distance travelled into the Variables to Bin box (Figure 1.15).

Page | 33

Figure 1.15 SPSS Visual Binning menu Click Continue: this gives you the Visual Binning dialog. Type Distance_travelled_cat into the Binned Variable box. Type Distance_travelled (Binned) into the Label box.

Figure 1.16 SPSS Visual Binning continued Click the Excluded button.

Page | 34

Figure 1.17 SPSS Visual Binning Excluding end points Click on Make Cutpoints button Enter First cutpoint Location: 400 Enter Width = 50 Note that when you click out of the Width box that the Number of Cutpoints is calculated by SPSS and enters 5 in the box as illustrated in Figure 1.18.

Figure 1.18 Make Cutpoints Click Apply You will find that Figure 1.16 will now change such that the values are provided as illustrated in Figure 1.19.

Page | 35

Figure 1.19 SPSS class Cutpoints Now generate the Labels Click on the Make Labels button

Figure 1.20 Class values and labels Click OK SPSS warns you that the automatically generated labels are no longer correct.

Figure 1.21 SPSS warning – creating a new variable in SPSS data file Once you are finally satisfied with the binning, click OK. This will create a new variable within your SPSS data file (first 20 values out of

Page | 36

Figure 1.22 SPSS data file with binned variable Now, create the grouped frequency table Select Analyze > Descriptive Statistics > Frequencies Transfer Distance_travelled_.. into the Variable(s) box.

Figure 1.23 SPSS Frequencies menu We can ask SPSS to create the histogram if required for this grouped frequency distribution. Click on Charts Choose Histograms Please note we will discuss the histogram when introducing this later in this chapter. Page | 37

Click OK SPSS output

Figure 1.24 SPSS solution Observe that this result is the same as both the Excel and manual solutions.

Creating a crosstab table using Excel PivotTable A cross tabulation (or a crosstab) is a table that shows relationships between variables. Excel calls it a pivot table. A pivot table can organise and summarise large amounts of data. If you collect raw data and store the data in Excel, then it is a convention that every column will represent a variable and every row will represent an observation. You might like to see and present how different variables are related to each other and how are different observations distributed across these variables. To do this, you need to create a pivot table. A pivot table can also filter the data to display just the details for areas of interest. Once you have a pivot table, you can present the content in form of a chart. Details on creating a pivot chart are set out later in this section. The source data can be: 1. An Excel worksheet, a database/list, or any range that has labelled columns. 2. A collection of ranges must be consolidated. The ranges must contain both labelled rows and columns. 3. A database file created in an external application such as Access.

Page | 38

Note that the data in a pivot table cannot be changed as they are the summary of other data. However, the data set itself (raw data from the spreadsheet) can be changed and the pivot table recalculated thereafter. Formatting changes (bold, number formats, etc.) can be made directly to the pivot table data. The general rule is that you need more than two criteria of data to work with, otherwise you have nothing to pivot. Figure 1.25 depicts a typical pivot table where we have tabulated department with the product required per trimester. Notice the black down-pointing arrows in the pivot table. In Figure 1.25, the rows represent the department and the columns represent the product.

Figure 1.25 Example of an Excel pivot table We could click on a department and view the number of a product type per trimester. But Excel does most of the work for you and puts in those drop-down boxes as part of the wizard. In the example, we can see that the total number of toners required across all departments for all trimesters was 134. Example 1.6 This example consists of a set of data that has been collected to measure the departmental product requirements (paper, toner) across the 3 trimesters.

Figure 1.26 Example 1.6 Excel dataset Excel solution Select Insert > PivotTable. Page | 39

The PivotTable wizard will walk you through the process of creating an initial PivotTable. Select PivotTable as illustrated in Figure 1.27.

Figure 1.27 Selecting Excel PivotTable Input in the Create PivotTable menu the cell range for the data table and where you want the PivotTable to appear. Select a table: D4:G20. Choose to insert the PivotTable in Existing Worksheet in cell D25. Figure 1.28 illustrates the Create PivotTable menu.

Figure 1.28 Create Excel PivotTable menu Click OK Excel creates a blank PivotTable and the user must then drag and drop the various fields from the items. The resulting report is displayed ‘on the fly’ as illustrated in Figure 1.29.

Page | 40

Figure 1.29 Excel dataset with blank PivotTable The PivotTable (cells D25:F42) will be populated with data from the data table in cells B3:E32 with the completion of the PivotTable Fields list which is located at the righthand side of the worksheet. For example, choose to select Department, Product, and Number budget as illustrated in Figure 1.30.

Figure 1.30 Excel PivotTable fields Page | 41

1. From the Field List, drag the fields with the data you want to display in rows to the area on the PivotTable diagram labelled Drop Row Fields Here or into the Row Labels box. 2. Drag the fields with the data you want to display in columns to the area labelled Drop Column Fields Here or into the Column Labels box. 3. Drag the fields that contain the data you want to summarize to the area labelled Drop Value Fields Here or into the Values box. Excel assumes Sum as the calculation method for numeric fields and Count for non-numeric fields. 4. If you drag more than one data field into rows or into columns, you can re-order them by clicking and dragging the columns on the PivotTable itself or in the boxes. 5. To rearrange the fields at any time, simply drag them from one area to another. 6. To remove a field, drag it out of the PivotTable report or untick it in the Field List. Fields that you remove remain available in the field list.

Figure 1.31 Select PivotTable variable for table The completed PivotTable for this problem is shown in Figure 1.32.

Figure 1.32 Excel PivotTable SPSS solution Input data into SPSS

Page | 42

Figure 1.33 SPSS data Select Analyze > Tables > Custom Tables

Figure 1.34 SPSS Custom Table

Page | 43

Figure 1.35 SPSS Custom tables warning Click OK. Transfer Department from Variables to Rows box Transfer Product from Variables to Columns box Transfer Number from Variables to table cells labelled ‘nnnn’

Figure 1.36 Create table in SPSS Custom Tables Change Mean to Sum

Page | 44

Figure 1.37 Change summary statistics data type Click Apply to Selection button

Figure 1.38 Summary statistic now changed to Sum from Mean Click OK SPSS output

Page | 45

Figure 1.39 SPSS table solution This solution agrees with the Excel solution in Figure 1.32.

Summarising the principles of table construction From the above, we can conclude that when constructing tables good principles to be adopted are as follows: a. b. c. d. e. f. g. h. i.

Aim at simplicity. The table must have a comprehensive and explanatory title. The source should be stated. Units must be clearly stated. The headings to columns and rows should be unambiguous. Double counting should be avoided. Totals should be shown where appropriate. Percentages and ratios should be computed and shown where appropriate. Overall, use your imagination and common sense.

Check your understanding X1.1

Criticise Table 1.13. Castings Weight of metal Up to 4 ton 60 Up to 10 ton 100 All other weights 110 Other 20 Total 290 Table 1.13 Weight of metal

X1.2

Foundry 210 640 800 85 2000

Table 1.14 shows the number of customers visited by a salesman over an 80week period. Use Excel to construct a grouped frequency distribution from the data set and indicate both stated and mathematical limits (start at 50–54 with class width of 5).

Page | 46

68 64 75 82 68 60 62 71 59 85 75 61 65 75 82 75 94 77 69 74 68 83 71 79 62 67 97 78 88 78 62 76 53 74 86 Table 1.14 Number of customers

88 87 60 85 67

76 74 96 76 73

93 62 78 65 81

73 95 89 71 72

79 78 61 75 63

88 63 75 65 76

73 72 95 80 75

60 66 60 73 85

93 78 79 57 77

1.4 Graphs - visualising of data Once the data have been tabulated, we might like to graph the data using a variety of visual display methods to provide a suitable graph or chart. In this section we will explore bar charts, pie charts, histograms, frequency polygons, scatter plots, and time series plots. The type of graph we will use to visualise the data depends upon the type of variable we are dealing with within our data set. When describing scales, we classified data as interval, ratio, nominal and ordinal. Different types of data will use different graphs to visualise, as per Table 1.15. Data type Interval and ratio

Which graph to use? • Histogram and frequency polygon. • Cumulative frequency curve (or ogive). • Scatter plots. • Time series plots. Nominal or • Bar chart. category • Pie chart. • Cross tab tables (or contingency tables). Ordinal • Bar chart. • Pie chart. • Scatter plots. Table 1.15 Deciding which graph type given data type ‘Graph’ and ‘chart’ are terms that are used here interchangeably to refer to any form of graphical display.

Bar charts Bar charts are very useful in providing a simple pictorial representation of several sets of data on one graph. They are used for categorical data where each category is represented by each vertical (or horizontal) bar. In bar charts each category is represented by a bar with the frequency represented by the height of the bar. All bars should have equal width, and the distance between each bar is kept constant. It is important that the axes (X and Y) are labelled and the chart has an appropriate title. What each bar represents should be clearly stated within the chart.

Page | 47

Example 1.7 Figure 1.40 shows a component bar chart for the number of bus tickets sold for route A and route B.

Figure 1.40 Number of bus tickets sold Example 1.8 Consider the categorical data in Example 1.1 which represents the number of undergraduate students choosing to enter master’s courses. Excel can be used to create a bar chart to represent this data set. For each category, a vertical bar is drawn, with the vertical height representing the number of students in that category (or frequency) and the horizontal distance for each bar and distance between each bar equal. Each bar represents the number of students who would choose a course. From the bar chart, you can easily detect the differences of frequency between the five categories: 1. 2. 3. 4. 5.

MSc International Management MSc Business Analysis MA Human Resource Management MSc Project Management MSc Digital Business

Figure 1.41 shows the bar chart for the proposed postgraduate taught course chosen by the undergraduate students.

Page | 48

Figure 1.41 Proposed postgraduate taught course Example 1.9 If you are interested in comparing totals, then a component bar chart (or stacked chart) is constructed. Figure 1.42 shows a component bar chart for the sales of hard disks. In this component bar chart, you can see the variation in total sales from month to month, and the split between the shop sales and online sales per month.

Figure 1.42 Excel component bar chart

Page | 49

Creating a bar chart using Excel and SPSS Excel and IBM SPSS Statistics enable you to easily tabulate and graph your raw data. To illustrate the application of Excel and SPSS we shall use the following data example that provides data on 400 employees working at a local engineering firm. Example 1.10 Figures 1.43 and 1.44 represent the Excel and SPSS data views for the first 10 records out of a total of 400 records (complete data sets are available via the book website). The variables identified in the screenshot are: 1. 2. 3. 4. 5.

ID – employee ID from 1 to 400 Gender – male or female Employment category – job category (1 = trainee, 2 = junior manager, 3 = manager) Current salary (£) Starting salary (£).

Figure 1.43 Example 1.10 Excel data view

Figure 1.44 Example 1.10 SPSS data view Use Excel and SPSS to create a bar chart to represent employment category. Page | 50

Excel solution Step 1 Input data series ID: cells B5:B404 Gender: C5:C404 Employment category: D5:D404 Current salary: E5:E404 Starting salary: F5:F404.

Figure 1.45 Example 1.10 Excel data set Step 2 Create the frequency distribution

Figure 1.46 Creating Excel frequency distribution Step 3 Create the bar chart Highlight H10:I13 (includes labels in cells H10 and I11).

Page | 51

Figure 1.47 Highlight bar chart cell range Step 4 Select Insert > Insert Column or Bar Chart,

Figure 1.48 Excel bar chart options Choose first option This will result in the graph illustrated in Figure 1.49

Figure 1.49 Excel bar chart Step 4 Edit the chart Page | 52

We can now edit the bar chart to remove the frequency (f) from the x-axis and add titles. Right-click on bar chart in Excel and choose Select Data. Click on variable labelled ‘x’

Figure 1.50 Select Data Source and remove variable ‘x’ Choose Remove ‘x’ (Figure 1.51).

Figure 1.51 Variable ‘x’ removed Click OK

Page | 53

Figure 1.52 Excel bar chart Now, add bar chart and vertical axis titles. Select in the far-left menu Add Chart Element.

Figure 1.53 Add Chart Element Now click on the drop-down menu (Figure 1.54) and choose the option you would like to modify.

Figure 1.54 Add Chart Element options For this example, we have added a chart title, axis titles, and placed the frequency values in each bar as illustrated in Figure 1.55. Page | 54

Figure 1.55 Final Excel bar chart Changing the column colours To change the bar colours, select each bar in turn, right-click and select Format Data Point > Solid Fill > choose colour. Repeat this for the other bars. When each bar has a unique colour the chart legend will list each of the bar titles with their respective colours as illustrated in Figure 1.56.

Figure 1.56 Modified Excel bar chart SPSS solution Figure 1.57 shows the SPSS data view of the first 10 records out of a total of 400 records taken from the SPSS data file.

Page | 55

Figure 1.57 Example 1.10 SPSS data The variables identified in the screenshot represent: 1. 2. 3. 4. 5.

ID – employee ID from 1 to 400 Gender – male or female EmpCategory – job category (1 = trainee, 2 = junior manager, 3 = manager) CurrentSalary – current salary (£) StartSalary – starting salary (£).

When SPSS carries out an analysis (calculation of statistics or creating graphs) it will create a separate file, called an SPSS output file, to store the outcome of the analysis. Create a bar chart for the current salary for each employment category. Graphs - Legacy Dialogs – Bar (Figure 1.58). Select Simple

Figure 1.58 SPSS Bar charts menu Select Define Click on Bars Represent N of cases and click on arrow to transfer to the Variable box. Transfer Employment Category variable to the Category Axis box (Figure 1.59).

Page | 56

Figure 1.59 Define Simple Bar summaries for groups of cases Click on Titles Type in Bar chart for employment category (Figure 1.60).

Figure 1.60 Add title Click Continue. Click OK: the output is shown in Figure 1.61. SPSS output

Page | 57

Figure 1.61 SPSS solution

Check your understanding X1.3

Draw a suitable bar chart for the data in Table 1.16. Industrial Sources for Consumption and Investment Demand (thousand million) Producing Industry Consumption Investment Agriculture, mining 1.1 0.1 Metal manufacturers 2.0 2.7 Other manufacturing 6.8 0.3 Construction 0.9 2.7 Gas, electricity & water 1.2 0.2 Services 16.5 0.8 Total 28.5 7.8 Table 1.16 Consumption and investment demand

Pie charts In a pie chart the relative frequencies are represented by slices of a circle. Each section represents a category, and the area of a section represents the frequency or number of objects within a category. They are particularly useful in showing relative proportions, but their effectiveness tends to diminish for more than eight categories. Example 1.11 Consider the Example 1.1 proposed postgraduate course chosen data illustrated in table 1.17.

Page | 58

Number of undergraduate students entering differing masters courses (Source: School survey August 2020) Course

Frequency

MSc International Management

120

MSc Business Analysis

24

MA Human Resource Management

80

MSc Project Management

45

MSc Digital Business

86

Total

355

Table 1.17 Proposed postgraduate course These data can then be represented by a pie chart. Figure 1.62 represents a pie chart for proposed postgraduate course.

Figure 1.62 Pie chart for proposed voting behaviour We can see that different slices of the circle represent the different choices that people have when it comes to the chosen postgraduate course. To make a pie chart, you need to calculate the angle of each slice in the circle representing each course. From the table, we can calculate that the total number of students: 120 + 24 + 80 + 45 + 86 = 355. Given that 360° represents the total number of degrees in a circle, we can calculate how many degrees would be represented by each student. For this example, Page | 59

we have 360° = 355 students. Therefore, each student is represented by 360/355 degrees. Based upon this calculation we can now calculate each angle for each political party category (see Table 1.18). Course

Frequency

How to calculate angles for a pie chart

Angle (degrees) (1 decimal place)

MSc International Management

120

= 120 * (360/355)

121.7

24

= 24 * (360/355)

24.3

80

= 80 * (360/355)

81.1

45

= 45 * (360/355)

45.6

86

= 86 * (360/355)

87.2

MSc Business Analysis MA Human Resource Management MSc Project Management MSc Digital Business Total

355

360

Table 1.18 Calculation of pie chart angles The size of each slice (sector) depends on the angle at the centre of the circle which in turn depends upon the number in the category the sector represents. Before drawing the pie chart, you should always check that the angles you have calculated sum to 360°. A pie chart may be constructed on a percentage basis, or the actual figures may be used. Creating a pie chart using Excel and SPSS Excel and IBM SPSS Statistics enable you to easily tabulate and graph your raw data. To illustrate the application of Excel and SPSS we shall use the following data example that provides data on 400 employees working at a local engineering firm. Example 1.12 We will follow the same steps 1 and 2 as in Example 1.10 and use Figures 1.45 and 1.57 that represent the Excel and SPSS data views for the first 10 records out of a total of 400 records (complete data sets are available via the book website). Below we show how to use Excel and SPSS to create a pie chart to represent employment category. Excel solution Step 1 Input data into Excel (first 10 values illustrated)

Page | 60

Figure 1.63 Data set Step 2 Create the frequency distribution for this data set

Figure 1.64 Frequency distribution Step 3 Create the pie chart. In Figure 1.65 (identical to Figure 1.46), highlight H10:I13 (includes labels in cells H10 and I11).

Figure 1.65 Calculation of frequencies using Excel Select Insert > Insert Pie or Doughnut Chart

Page | 61

Figure 1.66 Choose type of pie chart Choose the first option. This will result in the graph illustrated in Figure 1.67.

Figure 1.67 Excel pie chart Step 4 Edit the chart. The chart can then be edited to improve its appearance, for example, to include a chart title, change bar colours, and change data label information using the method described for bar charts. Select data and remove variable ‘x’. The final bar chart is illustrated in Figure 1.68.

Page | 62

Figure 1.68 Pie chart for employment category SPSS solution Enter data into SPSS

Figure 1.69 SPSS data Create a pie chart for each employment category Select Graphs – Legacy Dialogs – Pie

Figure 1.70 SPSS Pie chart menu Page | 63

Click on Summaries for groups of cases (Figure 1.71).

Figure 1.71 SPSS Pie Charts menu Click on Define Transfer Employment Category to the Define Slices by box (Figure 1.72).

Figure 1.72 SPSS define pie summaries for groups of cases Add the chart title Click on Titles

Page | 64

Figure 1.73 Add chart title Click Continue Click OK The output is shown in Figure 1.74.

Figure 1.74 SPSS pie chart Double-click on the chart to edit.

Check your understanding X1.4

Three thousand six hundred people who work in Bradford were asked about the means of transport which they used for daily commuting. The data collected is shown in Table 1.19. Construct a pie chart to represent this data. Type of Transport Frequency of Response Private Car 1800 Bus 900 Train 300 Other 600 Table 1.19 Type of transport Page | 65

X1.5

The results of the voting in an election are as shown in Table 1.20. Show this information on a pie diagram. Mr P Mr Q Mrs R Ms S Table 1.20 Election results

2045 votes 4238 votes 8605 votes 12012 votes

Histograms Frequency distribution as a method showing the category-type data as a table has already been covered. This concept can now be extended to graphs. The method used to graph a group frequency distribution (or table) is to construct a histogram. A histogram looks like a bar chart, but they are different and should not be confused with each other. Histograms are constructed on the following principles: a) The horizontal axis (x-axis) is a continuous scale. b) Each class is represented by a vertical rectangle, the base of which extends from one true limit to the next. c) The area of the rectangle is proportional to the frequency of the class. The last point is very important since it means that the area of the bar represents the frequency of each category. In the bar chart the frequency is represented by the height of each bar. This implies that if we double the class width for one bar compared to all the other classes then we would have to half the height of that bar compared to all other bars. In the special case where class widths are equal, the height of the bar can be taken to be representative of the frequency of occurrence for that category. It is important to note that either frequencies or relative frequencies can be used to construct a histogram, but the shape of the histogram would be the same regardless. Example 1.13 In Example 1.4 we listed the number telephone calls per day enquiring about the new iPhone SE over a period of 92 days (Table 1.21). This number of telephone calls per day could be called ‘Day’. The data variable ‘Day’ is a discrete variable, and the histogram is constructed as illustrated in Table 1.22. Day 1 2 3 4 5

Frequency, f 8 9 22 23 30

Table 1.21 Example 1.13 frequency distribution

Page | 66

We can see from Table 1.22 that all the class widths have the same value 1 (constant, class width = UCB – LCB). In this case the histogram can be constructed with the height of the bar representing the frequency of occurrence. Day 1 2 3 4 5

LCB - UCB 0.5 – 1.5 1.5 – 2.5 2.5 – 3.5 3.5 – 4.5 4.5 – 5.5

Class width 1 1 1 1 1 f=

Frequency, f 8 9 22 23 30 92

Table 1.22 Creation of frequency distribution To construct the histogram, we would plot frequency (y-axis, vertical) against Day (xaxis), with the upper- and lower-class boundaries determining the class boundary positions for each bar (see Figure 1.75).

Figure 1.75 Identifying histogram class boundaries Figure 1.76 shows the completed histogram.

Page | 67

Figure 1.76 Histogram We can use the histogram to see how the number of telephone calls per day varies in frequency from the lowest value of 1 to the highest value of 5. If we look at the histogram, we note that the number of telephone calls per day gradually increases. These ideas will lead to the idea of average (central tendency) and data spread (dispersion) which will be explored in Chapter 2. Example 1.14 Example 1.5 represents the distance travelled 100 delivery vans. Figure 1.77 shows the data set and Table 1.23 shows the grouped frequency table.

Figure 1.77 Example 1.14 Excel data Page | 68

From this data we can construct the grouped frequency distribution as illustrated in Table 1.23. Class 400 - 449 450 - 499 500 - 549 550- 599 600 - 649

Frequency 4 44 36 15 1

Table 1.23 Grouped frequency distribution The data variable ‘distance travelled’ is a grouped variable and the histogram is constructed as illustrated in Table 1.24. Class 400 - 449 450 - 499 500 - 549 550- 599 600 - 649

Lower 399.5 449.5 499.5 549.5 599.5

Upper 449.5 499.5 549.5 599.5 649.5

Class width 50 50 50 50 50

Frequency 4 44 36 15 1

Table 1.24 Calculation procedure to identify class limits for the histogram We can see from the table that all the class widths have the same value, 20 (constant, class width = UCB – LCB). In this case the histogram can be constructed with the height of the bar representing the frequency of occurrence. To construct the histogram, we would plot frequency (y-axis, vertical) against score (x-axis, horizontal), with the upperand lower-class boundaries determining the class boundary positions for each bar (see Figure 1.78).

Figure 1.78 Identifying group frequency histogram class boundaries Page | 69

Figure 1.79 shows the completed histogram.

Figure 1.79 Grouped frequency distribution histogram We can use the histogram to see how the frequency changes as the distance travelled changes from the lowest group (400–449) to the highest group (600–649). If we look at the histogram we can note: • • •

Looking along the x-axis, we can see that the mileage is evenly spread out. The mileage rises (400–449 to 450–499) with a maximum of 450–499 recorded. The mileage falls (450–499 to 600–649) with a minimum of 600–649 miles recorded.

Creating a histogram using Excel and SPSS Excel solution Step 1 Input data into cells A6:G19 as illustrated in Figure 1.80.

Page | 70

Figure 1.80 Example 1.14 Excel dataset and bin range Step 2 Excel spreadsheet macro command – using Analysis ToolPak. Select Data > Data Analysis > Histogram

Figure 1.81 Excel Data Analysis menu Click OK Input (data) Range: C5:C105 Input Bin Range: E14:E19 Choose location of Output Range: G13

Page | 71

Figure 1.82 Data Analysis Histogram menu Click OK Excel will now print out the grouped frequency table (bin range and frequency of occurrence) as presented in cells G13:H20.

Figure 1.83 Example 1.14 Excel solution for bin range and frequency We can now use Excel to generate the graphical version of this grouped frequency distribution – the histogram for equal class widths. Step 3 Crete histogram Input data series Class: J14:K19 (label in J14) Frequency: H14:H19 (label in K14) Page | 72

Highlight cells J14:K19

Figure 1.84 Excel grouped frequency distribution Step 4 Create column chart (Insert > Column > choose first option). This will create the chart illustrated in Figure 1.85 with chart title and axis titles updated.

Figure 1.85 Excel bar chart Step 5 Transformation of the column chart to a histogram. Select bars by clicking on them as illustrated in Figure 1.86.

Page | 73

Figure 1.86 Transforming bar chart to a histogram Select any one of the bars, right-click and Select Format Data Series

Figure 1.87 Format bar chart data series Reduce Gap Width to zero as illustrated in Figure 1.88.

Figure 1.88 Format data series The final histogram is presented in Figure 1.89 after adding title, axis titles, and removing the bar colour.

Page | 74

Figure 1.89 Excel grouped frequency histogram These ideas will lead to the idea of average (central tendency) and data spread (dispersion) which will be explored in Chapter 2. SPSS solution Input data into SPSS and calculate the bin values (see Example 1.5)– the first 10 values are shown in Figure 1.90.

Figure 1.90 Example 1.14 SPSS dataset Create the histogram. Select Analyze > Descriptive Statistics > Frequencies Transfer distance travelled into the Variable(s) box (Figure 1.91).

Page | 75

Figure 1.91 Transfer variable into Variable(s) box Click on Charts. Choose Histograms

Figure 1.92 Create histogram Click Continue

Figure 1.93 SPSS frequencies menu Page | 76

Click OK SPSS output

Figure 1.94 SPSS solution

Figure 1.95 SPSS solution continued

Check your understanding X1.6

Create a suitable histogram to represent the number of customers visited by a salesman over an 80-week period (Table 1.25). 68 64 75 82 68 60 62 88 76 93 73 79 71 59 85 75 61 65 75 87 74 62 95 78 82 75 94 77 69 74 68 60 96 78 89 61 83 71 79 62 67 97 78 85 76 65 71 75 88 78 62 76 53 74 86 67 73 81 72 63 Table 1.25 Number of customers visited by salesman

88 63 75 65 76

73 72 95 80 75

60 66 60 73 85

93 78 79 57 77

Page | 77

X1.7

Create a suitable histogram to represent the spending (£s) on extra-curricular activities for a random sample of university students during the ninth week of the first term (Table 1.26). 16.91 9.65 22.68 12.45 18.24 8.10 3.25 9.00 9.90 12.87 14.73 8.59 6.50 20.35 8.84 9.50 7.14 10.41 12.80 32.09 8.31 6.50 13.80 9.87 6.29 Table 1.26 Extra-curricular spending

11.79 17.50 13.45 6.74 14.59

6.48 10.05 18.75 11.38 19.25

12.93 27.43 24.10 17.95 5.74

7.25 16.01 13.57 7.25 4.95

13.02 6.63 9.18 4.32 15.90

Scatter and time series plots A scatter plot is a graph which helps us assess visually the form of relationship between two variables. To illustrate the idea of a scatter plot, consider the following problem. Example 1.15 A luxury car business sells luxury cars and is interested in the relationship between sales and the advertising spend. The company has collected 12-month sales and advertising data, which is presented in table 1.27. How does the advertising spend impact upon the sales per month over the last 12 months? Month January February March April May June July August September October November December

Advertising spend (£) 15712 53527 66528 31118 95460 69116 29335 96701 38706 60389 35783 47190

Number of sales 17 75 84 51 118 90 61 100 54 89 58 70

Table 1.27 12-month of sales and advertising data Figure 1.96 shows the scatter plot. As can be seen, there would seem to be some form of relationship as the number of sales increases as the advertising spend increases. The data, in fact, would indicate a positive relationship.

Page | 78

Figure 1.96 Scatter plot for the number of sales against advertising spend Care needs to be taken when using graphs to infer what this relationship may be. For example, if we modify the y-axis scale then we have a very different picture of this potential relationship. Figure 1.97 illustrates the effect on the graph of modifying the vertical scale.

Figure 1.97 What happens if we change the size of the vertical axis? We can see that the data points are now hovering above the x-axis, with the increase in the vertical direction not as pronounced as in Figure 1.96. If we further increased the yaxis scale, then this pattern would be diminished even further. This example illustrates well how important the scale is when charting data.

Page | 79

Time series plot is concerned with data collected over time. It attempts to isolate and imply the influence of various factors which contribute to changes over time in such variable series. An example of a time series plot could involve imports and exports, sales, unemployment, or prices. If we can determine the main components which determine the value of, say, sales for a month then we can project the series into the future to obtain a forecast. Example 1.16 Consider the time series data table presented in Example 1.15 but this time we wish to construct a scatterplot for the number of sales against time (months).

Figure 1.98 Time series plot Figure 1.98 illustrates the up and down pattern, with the overall number of sales oscillating between January and December. We shall explore these ideas of trend and seasonal components in Chapters 9, 10, and 11.

Creating a scatter and times series plot using Excel and SPSS Excel solution Step 1 Input data series: Month: B4: B16 (including data label) Page | 80

Advertising spend: C4:C16 (including data label) Number of sales: D4:D16 (including data label)

Figure 1.99 Example 1.15 data Highlight C4:D16

Figure 1.100 Example 1.15 data set highlighted Step 2 Select Insert > Scatter and choose option 1

Page | 81

Figure 1.101 Select Scatter plot type (option 1)

Figure 1.102 Excel scatter plot Step 3 Edit the chart. The chart can then be edited to improve its appearance, for example, to include chart title, axis titles, removal of horizontal gridlines, and removal of the legend as illustrated in Figure 1.103.

Page | 82

Figure 1.103 Excel scatter plot after update We could now ask Excel to fit a straight line to this data chart by clicking on a data point on the chart, right-clicking on a data point, and choosing Add Trendline. We will look at fitting a trend line and curves to scatter and time series data charts in Chapters 8 to 10. SPSS Solution Two solutions: (a) scatterplot, (b) time series plot. SPSS solution for a scatterplot Input data into SPSS

Figure 1.104 Example 1.15 SPSS dataset Graphs - Legacy Dialogs Page | 83

Figure 1.105 Creating SPSS scatterplot Select Scatter/Dot. Choose Simple Scatter

Figure 1.106 Choose type of SPSS scatter plot Choose Define. Transfer Number_of_sales to the Y Axis box Transfer Advertising_spend to the X Axis box.

Page | 84

Figure 1.107 Create SPSS simple scatter plot Now click on Titles and add a title to your chart

Figure 1.108 Add title to scatter plot Click Continue

Page | 85

Figure 1.109 Create SPSS simple scatter plot update Select OK SPSS output

Figure 1.110 SPSS solution SPSS solution for a time series plot Page | 86

Step 1 Given we have time series data then we need to define date and time.

Figure 1.111 SPSS data Select Data > Define date and time

Figure 1.112 Define date and time in SPSS Starting Jan 2019 to December 2019

Figure 1.113 Date and time added for the 12 months

Page | 87

Step 2 Create time series graph Select Graphs > Legacy Dialogs > Scatter/Dot

Figure 1.114 Creating scatterplot (this will be a time series plot)

Figure 1.115 Scatter/Dot options Choose Simple Scatter Transfer Number_of_sales to the Y Axis: box Transfer Month, period 12 [Month] to the X Axis: box

Page | 88

Figure 1.116 Simple Scatterplot Click OK

Figure 1.117 Time series plot Finally, edit the graph to give a final version as illustrated in Figure 1.118.

Page | 89

Figure 1.118 Time series plot for number of sales against time.

Check your understanding X1.8

Obtain a scatter plot for the data in Table 1.28 and comment on whether there is a link between road deaths and the number of vehicles on the road. Would you expect this to be true? Provide reasons for your answer. Countries

Road Deaths per 100,000 population Great Britain 31 14 Belgium 32 30 Denmark 30 23 France 46 32 Germany 30 26 Irish Republic 19 20 Italy 35 21 Netherlands 40 23 Canada 46 30 U.S.A. 57 35 Table 1.28 Number of vehicles and road deaths X1.9

Vehicles per 100 population

Obtain a scatter plot for the data in Table 1.29 that represents the passenger miles flown by a UK-based airline (millions of passenger miles) during 2017– 2018. Comment on the relationship between miles flown and quarter. Year Quarter 1 2003 98.9 2004 113.4 Table 1.29 Passenger miles

Quarter 2 191.0 228.8

Quarter 3 287.4 316.2

Quarter 4 123.2 155.7

Page | 90

Chapter summary The methods described in this chapter are very useful for describing data using a variety of tabulated and graphical methods. These methods allow you to make sense of data by constructing visual representations of numbers within the data set. Table 1.30 provides a summary of which table/graph to construct, given the data type. Which table or chart to be applied Numerical data Categorical data Tabulating data Frequency distribution. Summary table. Cumulative frequency distribution Graphing data Histogram. Bar chart. Frequency polygon. Pie chart. Presenting a relationship Scatterplot. Contingency between data variables Time series graph. table. Table 1.30 Which table or chart to use?

In the next chapter we will look at summarising data using measures of average, dispersion, and shape.

Test your understanding TU1.1 A small company is promoting healthy lifestyles with its workforce and provides during the tea break a piece of fruit. Table 1.31 represents the fruit chosen by 36 workers. Apple Plum Banana Apple Peach Pear Orange Apple Pear Apple Pear Peach Pear Peach Apple Orange Apple Banana Table 1.31 Choice of fruit

Apple Apple Orange Apple Orange Orange

Plum Peach Plum Plum Apple Apple

a. Construct a tally chart for the type of fruit chosen. b. Clearly state the frequency of occurrence for each fruit. c. Construct an appropriate bar chart for the type of fruit chosen. TU1.2 The company described in TU1.1 is interested in the concept that worker productivity is dependent upon happy workers. One of the criteria to measure this is the number of pets that workers own. Table 1.32 represents the number of pets owned by 36 of the workers. 1 4 3 2 1 2 0 1 4 2 1 0 4 3 2 1 Table 1.32 Number of pets

0 2 2 0

2 2 1 3

1 0 2 4

3 3 1 2 Page | 91

a. Construct a tally chart for the number of pets owned. b. Clearly state the frequency of occurrence for the number of pets owned. c. Construct an appropriate bar chart for the number of pets owned. TU1.3 The monthly sales for a second-hand car dealership are provided in Table 1.33. Month Frequency January 16 February 24 March 22 April 26 May 27 June 31 Table 1.33 Monthly sales

Month July August September October November December

Frequency 30 29 24 20 15 10

a. Draw a line graph for the number of sales. b. Use the graph to describe how the sales are varying per month. c. What would you predict as sales for the following month? TU1.4 Six hundred people are surveyed on the mode of transport they use to get to work, as shown in Table 1.34. Construct a suitable pie chart to represent these data. Type of Frequency transport Train 90 Bus 120 Car 300 Cycle 30 Walk 60 Table 1.34 Mode of transport TU1.5 State the class boundaries for a data set that varies between 11.6 and 97.8 when we would like to have 9 classes. TU1.6 The data in Table 1.35 show the annual sales for a business over a period of 11 years. Year Sales Year 2007 13 2013 2008 17 2014 2009 19 2015 2010 20 2016 2011 20.5 2017 2012 20.5 Table 1.35 Annual sales

Sales 20.5 20 19 17 143

Page | 92

a. Construct a time series plot for sales against time. b. Use the time series plot to comment on how the annual sales have changed from 2007 to 2017. TU1.7 The data in Table 1.36 show the cost of electricity (£) during June 2018 for 50 one-bedroom flats. 96 171 202 178 157 185 90 116 141 149 206 175 95 163 150 154 108 119 183 151 129 139 109 130 158 149 167 165 Table 1.36 Cost of electricity (£’s)

147 172 123 130 114 127 82

102 111 128 143 135 166 137

153 148 144 187 191 168 213

197

a. Construct an appropriate grouped frequency distribution. b. Using this frequency distribution, construct a histogram. c. Around what amount does the June 2018 electricity cost appear to be concentrated? TU1.8 During a manufacturing process a sample of fifty 2 litre bottles of pop are checked and the amount of pop (litres) is measured as given in Table 1.37. 2.109 1.963 2.003 2.031 2.029 2.065 2.036 1.908 1.981 1.999 2.014 2.025 2.015 2.086 1.957 1.973 1.996 2.012 2.005 2.038 1.894 1.951 1.975 1.997 1.984 2.014 2.066 2.075 1.951 1.971 2.012 2.044 2.052 1.941 1.938 2.010 2.023 2.020 1.967 1.986 2.012 1.941 Table 1.37 Quantity of pop in each bottle (litres)

1.947 2.057 2.029 2.012 1.992 1.966 1.994

1.969

a. Construct an appropriate grouped frequency distribution. b. Using this frequency distribution, construct a histogram. c. Use this histogram to comment on whether the sample content concentrates about specific values. Does this match the advertised amount of 2 litres per bottle? TU1.9 The sales of diesel cars over a period of 12 years in the UK are presented in Table 1.38.

Page | 93

Time point

New diesel Time point car sales March 2007 166667 March 2013 March 2008 180000 March 2014 March 2009 133333 March 2015 March 2010 160000 March 2016 March 2011 170000 March 2017 March 2012 180000 March 2018 Table 1.38 Sales of diesel cars a. b. c. d.

New diesel car sales 186667 213333 226667 233333 240000 146667

Construct a time series plot for new diesel car sales against time point. What do you notice about the sales over time? What happened to sales after the year from March 2017 to March 2018? Based upon your business knowledge, why would we have this pattern?

Want to learn more? The textbook online resource centre contains a range of documents to provide further information on the following topics: 1. 2. 3. 4.

A1Wa Histograms with unequal class widths. A1Wb Frequency polygon. A1Wc Cumulative frequency curve. A1Wd Superimposing two sets of data onto one graph.

Page | 94

Chapter 2 Descriptive statistics 2.1 Introduction and learning objectives Introduction A journey from data to knowledge takes you through different phases, but most of them can be summarised by the following stages: 1. 2. 3. 4. 5.

Acquire raw data – create or import data sets from sources. Organise raw data – tabulate data or put them in a spreadsheet format. Understand the meaning of the data – calculate statistics that describe data. Present and visualise the data – plot or chart various aspects of the data set. Make inferences and draw conclusions– use findings to understand how things work in general, or how related phenomena might behave.

The first point about acquiring raw data is something that is usually covered in research methods modules. The second point is the prerequisite that every student has already learned through previous stages of education, and we have also covered some of the specifics in Chapter 1. The third point, as well as partially the fourth point, will be a subject of this chapter. The final point is covered in greater depth in the remainder of this book. In this chapter, we shall look at three key statistical measures that enable us to describe a data set. They are the measures of central tendency, measures of dispersion and measures of shape. The meaning of these three concepts is as follows: 1.

The central tendency (or measure of average) is a single value that is defined as a typical value for the whole data set. This defined typical value is usually the value that best represents the whole set, and we can use an average value, for example. However, there is more than one single measure of central tendency that can be applied. The three most used are the mean (colloquially called an average), median and mode. These can be calculated for both ungrouped data (individual data values known) and grouped data (data values within class intervals) data sets.

2.

The dispersion (or measure of spread) is the amount by which all the data values are dispersed around the central tendency value. The more closely data are surrounding the central tendency value, the more representative this central tendency value is for this given data set and the less the dispersion. Several different measures of dispersion exist, and we will include the range, interquartile range, semi-interquartile range, standard deviation, and variance. These also can be calculated for both ungrouped and grouped data sets.

3.

The shape of the distribution is the pattern that the data set will form. This shape usually defines the peak and the symmetry of the pattern. Different shapes will not only look different, but also will determine where different measures of central tendency are placed and ordered. Shapes can be classified according to Page | 95

whether the distribution is symmetric (or skewed) and whether there is evidence that the shape is peaked. Skewness is defined as a measure of the lack of symmetry in a distribution. Kurtosis is defined as a measure of the degree of peakedness in the distribution. Different students have different habits when it comes preparing for exams. Some are early morning birds, and some like to do revision in the evening, or even late into the night. Let us assume that you fall in the latter category. A friend of yours asks you: ‘How do you stay awake?’ To which the answer is that you drink lots of coffee. A friend wants to know what you mean by ‘lots of coffee’. How many cups in particular? You answer: ‘Well, it really depends, but on average it must be 4 cups of coffee per night.’ You just used a summary statistic, called an average. Rather than describing what happens every night when you do the revisions, you used a single number to summarise and describe your typical revision night. This is how the statistical measures that we will cover in this chapter work. They provide an instant summary and description of a much broader pattern. However, there are various summary statistics, so we need to understand all of them to make sure we use them appropriately. The above example is anecdotal. You might be working on a serious project and handling lots of tables summarising data. Although tables, diagrams and graphs provide easy-to-assimilate summaries of data, they only go part of the way in fully describing data. Often a concise numerical description is preferable as it enables us to interpret the significance of the data. Measures of average (or central tendency) attempt to quantify what we mean when we think of as the ‘typical’ or ‘average’ value for a data set. The concept of central tendency is extremely important, and it is encountered in daily life. For example: • •

What are the average CO2 emissions for a particular car compared to other similar cars? What is the average starting salary for new graduates starting employment with a large city bank?

Measures of dispersion such as the standard deviation, on the other hand, provide an additional layer of information. Sometimes stating the average value is just not enough. Imagine you work for a company with two factories, one in Italy and one in the UK. Your factory in Italy might be producing a product that contains some impurities and they amount on average to 5.3 particles per million (ppm). Your UK factory’s average is 4.7 ppm. At first glance, the UK factory is making a purer product. However, the standard deviation for the average in Italy is 0.9 ppm and the standard deviation for the UK average is 1.8 ppm. What does this tell you? Suddenly you realise that although the UK product has a lower average value, the factory in Italy has better quality control and much less variation in its manufacturing process. Figures 2.1– 2.3 depict these statistics.

Page | 96

Figure 2.1 UK factory amount of impurities

Figure 2.2 Italian factory amount of impurities

Figure 2.3 Impurity comparison between UK and Italian factories This simple example illustrates that measures of central tendency (averages) provide only a partial picture. With the additional measure of dispersion (the standard deviation) we were able to gain deeper understanding of how the two factories performed. A final concept in analysing the data set is the shape of the distribution, which can be measured using the concepts of skewness and kurtosis. You will find similar applications for the measures of shape, as we go through this chapter. Page | 97

Learning objectives On completing this chapter, you will be able to: 1. 2. 3. 4. 5.

Understand the concept of an average and be able to recognise different types of averages for raw data and summary data: mean, mode and median Understand the concept of dispersion and be able to recognise different types of dispersions for raw data and summary data: range, interquartile range, semiinterquartile range, standard deviation, and variance Understand the idea of distribution shape and calculate a value for symmetry and peakedness Apply exploratory data analysis to a data set Use Microsoft Excel and IBM SPSS Statistics to calculate data descriptors.

2.2 Measures of average for a set of numbers We know that there are four measurement scales (or types of data): nominal, ordinal, interval and ratio. These are simply ways to categorise different types of variables. Table 2.1 provides a summary of which statistical measures to use for different types of data. Summary statistic to be applied Spread or Dispersion N/A Range Range, interquartile range, Ratio or Range interval Range, interquartile range Variance, Standard deviation, Skewness, Kurtosis Table 2.1 Which summary statistic to use? Data type Nominal Ordinal

Average Mode Mode Median Mode Median Mean

Whenever you handle data, or even just look at a data set, you intuitively try to summarise what you have in front of you. The summary statistics effectively give you elementary descriptors that define the data set. Some of these descriptors, such as the average value, you are already familiar with, but not necessarily with the others. By understanding what data descriptors are available and how to use them, you will be able to position your business arguments better and understand more clearly what has been presented to you. Imagine that you are doing price analysis and you are trying to define the competitor’s average price in the market. If you use the mean as the average value, then your competitor is operating at an average price of £75 per metre. If you use the median value as an average, then their average price is £55 per metre. Which one represents better what you have measured? Why should you use one and not the other to decide about your pricing policy? The next section will provide answers to all these questions.

Page | 98

Mean, median and mode for a set of numbers If the mean is calculated from the entire population of the data set, then it is called the population mean. If we sample from this population and calculate the mean, then the mean is called the sample mean. The population and sample mean are calculated using the same formula: Mean =

Sum of data values Total number of data values

For example, if DIVE Ltd were interested in the mean time for a consultant to travel by train from London to Newcastle and if we assume that DIVE Ltd has gathered the time (rounded to the nearest minute) for the last five trips (445, 415, 420, 435 and 405), then the mean time would be: Mean=

445+415+420+435+405 5

=

2120 5

= 424

The mean time to travel between London and Newcastle was calculated to be 424 minutes. We can see that the mean uses all the data values in the data set (445 + 415 + … + 405) and provides an acceptable average if we do not have any values that can be considered unusually large or small. If we added an extra value of one single trip that took 900 minutes, then the new mean would be 503.3 minutes. This would not be representative of the other data values in the data set, which range in value from 405 to 445. Such extreme values, called outliers, tend to skew the data distribution. In this case we would use a different measure to calculate the value of central tendency. An alternative method to calculate the average is the median. The median is literally the ‘middle’ number if you list the numbers in order of size. The median is not as susceptible to extreme values as is the mean. Beside the mean and median, there is the third method for determining average. It is called the mode. The mode is defined as the number that occurs most frequently in the data set. It can be used for both numerical and categorical (or nominal) data variables. A major problem with the mode is that it is possible to have more than one modal value representing the average for numerical data variables. Several examples are provided to demonstrate how these measures of central tendency are calculated. Example 2.1 Suppose the advertising spend by Rubber Duck Ltd are as illustrated in Table 2.2. We can describe the overall advertising spend by calculating an ‘average’ score using the mean, median, and mode.

Page | 99

ID Month 1 January 2 February 3 March 4 April 5 May 6 June 7 July 8 August 9 September 10 October 11 November 12 December Table 2.2 Advertising spend (£)

2019-2020 Advertising spend (£) by Rubber Duck Ltd 15712 53527 66528 31118 95460 15712 29335 96701 38706 60389 35783 47190

The Mean In general, the mean can be calculated using the formula: n

∑ Sum of data values Xi Mean(X̄) = Total number of dat avalues = i=1 N

(2.1)

Where 𝑥̅ (‘x-bar’) represents the mean value for the sample data, ∑𝑛𝑖=1 𝑥𝑖 represents the sum of all the data values, and N represents the number of data values. If the data represent the population of all data values, then the mean would represent the population mean. Alternatively, if the data represent a sample from the population then the mean would be called a sample mean. For the advertising spend example above, the mean is calculated as: ∑ni=1 Xi 15712 + 53527 + ⋯ + 47190 586161 𝑥̅ = = = = 48846.75 N 12 12 The mean advertising spend is £48,846.75 The Median The median is defined as the middle number when the data are arranged in order of size. Consider the data in Table 2.2. (Note that the data are written in order of size, that is, ranked. If they were not ranked, the data would have to be put in the correct order before this method is used manually.) In Table 2.2, the median is positioned as the 7th number in the ordered list of data values (53), which means that there are six numbers on each side of the median. If the data set were much larger, we would need to rely on a formula rather than visually Page | 100

positioning the value. The position of any percentile within the ordered list of numbers is given by equation (2.2): Position of median =

P 100

(2.2)

(N + 1)

where P represents the percentile value and N represents the number of numbers in the data set. A percentile is a value on a scale of 100 that indicates the percentage of a distribution that is equal to or below it. In our case, the median is the 50th percentile (p = 50): Position of median =

P 50 (N + 1) = (12 + 1) = 6.5 100 100

Position of the median = 6.5th number from the data set which is listed in order of size as illustrated in Table 2.3.

Advertising spend in size order ID (ranked) 1 15712 6 15712 7 29335 4 31118 11 35783 9 38706 12 47190 2 53527 10 60389 3 66528 5 95460 8 96701 Table 2.3 Numbers listed in order of size

Ascending rank order 1 2 3 4 5 6 7 8 9 10 11 12

6th number = 38706 7th number = 47190 Now, use linear interpolation to calculate the 6.5th number 6.5th number = 6th number + 0.5 *(7th number – 6th number) 6.5th number = 38706 + 0.5 *(47190 –38706)=42948 The median value is £42,948.

Page | 101

The Mode The mode is defined as the number which occurs most frequently (the most ‘popular’ number). In our case only 15712 appears twice and it is, therefore, the most ‘popular’ value. Hence, the value of the mode is £15,712. If no number is repeated, then there is no mode. We have already said that if two, three or more numbers are repeated equal numbers of times, then it is also impossible to define the modal value. Excel solution Figures 2.4 and 2.5 illustrates the Excel solution.

Figure 2.4 Calculate measures of average

Figure 2.5 Calculate measures of average (Excel formulae)

Page | 102

The above values imply that, depending on what measure we use, the average mark for the advertising spend can be £48,846.75 (the mean), £42,948 (median) or £15,712 (mode). The choice of the measure will depend on the type of numbers within the data set and the context. SPSS solution Using the Table 2.2 data, we can extract the same statistics from SPSS in the following manner. Enter data into SPSS

Figure 2.6 Example 2.1 SPSS data set With SPSS we have three methods to calculate descriptive statistics: Frequencies, Descriptives and Explore. We have 3 SPSS Descriptives methods that can be used to calculate these statistics: 1. Frequencies 2. Descriptives 3. Explore Method 1: Frequencies Select Analyze > Descriptive Statistics > Frequencies.

Figure 2.7 SPSS frequencies menu Transfer Advertising_spend to the Variable(s) box Page | 103

Figure 2.8 SPSS Frequencies menu Click on Statistics. Under Central Tendency choose Mean, Median, and Mode (Figure 2.9).

Figure 2.9 SPSS frequencies statistics options Click Continue. Click OK

Page | 104

SPSS output

Figure 2.10 SPSS frequencies solution The SPSS values for the mean (£48,846.75), median (£42,948) and mode (£15,712) agree with the Excel solutions in Figure 2.5. Method 2: Descriptives Warning: This method provides the mean but not the median or mode. Select Analyze > Descriptives > Descriptives Transfer variable StatsMark to the Variable(s) box

Figure 2.11 SPSS descriptives menu Click on Options and choose: Mean

Page | 105

Figure 2.12 SPSS descriptives options Click on Continue Click on OK SPSS output

Figure 2.13 SPSS descriptives solution The output gives a mean of £48,846.75, but this method does not give the median or mode. Method 3: Explore Warning: Gives the mean and median but not the mode. Select Analyze > Descriptives > Explore. Transfer variable Advertising_spend to the Variable(s) box

Page | 106

Figure 2.14 SPSS explore menu Click on Statistics and choose Descriptives

Figure 2.15 SPSS explore statistics options Click Continue Click OK SPSS output

Page | 107

Figure 2.16 SPSS explore solution The output gives a mean of £48,846.75 and a median of £42,948. This method does not give the value of the mode but provides a series of other summary statistics that we will explore within this and other chapters.

Check your understanding X2.1 X2.2 X2.3

X2.4

In 12 consecutive innings a batman's scores were: 6, 13, 16, 45, 93, 0, 62, 87, 136, 25, 14, 31. Find his mean score and the median. The following are the IQs of 12 people: 115, 89, 94, 107, 98, 87, 99, 120, 100, 94, 100, 99. It is claimed that 'the average person in the group has an IQ of over 100'. Is this a reasonable assertion? A sample of six components was tested to destruction, to establish how long they would last. The times to failure (in hours) during testing were 40, 44, 55, 55, 64, 69. Which would be the most appropriate average to describe the life of these components? What are the consequences of your choice? Find the mean, median and mode of the following set of data: 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5.

2.3 Measures of dispersion for a set of numbers In Section 2.2 we looked at the concept of central tendency, which provides a measure of the middle value of a data set: mean, median and mode. As useful as these statistics are, they only provide a partial description. A fuller description can be obtained by also obtaining a measure of dispersion (otherwise known as measure of spread, or measure Page | 108

of variation) of the distribution. Measures of dispersion indicate whether the values in the group are distributed closely around an average, or whether they are more dispersed. These measures are also particularly useful when we wish to compare distributions. To illustrate this, consider the two hypothetical distributions presented in Figure 2.17, which measure the daily value of sales made by two salespersons in their respective sales areas over a period of one year. Suppose the means of the two distributions, A and B, were 3500 and 5500, respectively. But as you can see, their shapes are very different, with B being far more spread out.

Figure 2.17 Comparison of two distributions What would you infer from the two distributions given about the two salespersons and the areas that they work in? We can see that the distributions, A and B, have different mean values, with distribution B being more spread out (or dispersed) than distribution A. Furthermore, distribution A is taller than distribution B. In this section, we will introduce the methods that can be used to put a number to this idea of dispersion. The methods we will explore include the range, interquartile range, semi-interquartile range, variance, standard deviation, and coefficient of variation. Dispersion is also called variability, scatter, or spread. A proper description of a set of data should include both characteristics: average and dispersion.

Percentiles and quartiles for a set of numbers As we already learned, the median represents the middle value of the data set, which corresponds to the 50th percentile (P = 50). This is also known as the second quartile. A data set always needs to be ranked in order of size – only then can we use the technique described below to calculate the values that would represent individual percentile or quartile values. Example 2.2

Page | 109

Reconsider Example 2.1 with the monthly advertising spend to calculate the 25th percentile and quartile values (Table 2.4). We will demonstrate how to calculate percentiles and quartiles for the first and third quartiles.

ID 1

January

2019-2020 Advertising spend (£) by Rubber Duck Ltd 15712

2

February

53527

3

March

66528

4

April

31118

5 6 7

May June July

95460 15712 29335

8

August

96701

9 10

September October

38706 60389

Month

11 November 12 December Table 2.4 Advertising spend (£)

35783 47190

To calculate the quartiles, like the median, we need to list the numbers in order of size as illustrated in table 2.5. 2019-2020 Advertising spend (£) by Rubber Duck Ltd 15712

ID 1

Advertising spend in size order (ranked) 15712

ID 1

Month January

2

February

53527

6

15712

3

March

66528

7

29335

4

April

31118

4

31118

5 6 7

May June July

95460 15712 29335

11 9 12

35783 38706 47190

8

August

96701

2

53527

9 10

September October

38706 60389

10 3

60389 66528

11 November 35783 5 12 December 47190 8 Table 2.5 Advertising data and data listed in order of size

95460 96701

Page | 110

First quartile, Q1 The first quartile corresponds to the 25th percentile and the position of this value within the ordered data set is given by equation (2.2): Position of 25th percentile =

P (N + 1) 100

P = 25, N = 12 Position of 25th percentile =

25 100

(12 + 1) = 3.25th number

We therefore take the 25th percentile to be the number that is one quarter the distance between the 3rd and 4th numbers. To solve this problem, we use linear interpolation: 3rd number = 29335 4th number =31118 3.25th number = 3rd number + 0.25*(4th number – 3rd number) 3.25th number = 29335 + 0.25*(31118 – 29335) = 29780.75 The first quartile advertising spend is £29780.75. This means that 25% of the data have a value that is equal to or less than £29780.75. Third quartile, Q3 The third quartile corresponds to the 75th percentile, and the position of this value within the ordered data set is also given by equation (2.2): P = 75, N = 12 Position of 75th percentile =

75 100

(12 + 1) = 9.75th number

We therefore take the 75th percentile to be the number that is three quarters the distance between the 9th and 10th numbers. To solve this problem, we use linear interpolation: 9th number = 60389 10th number =66528 9.75th number = 9th number + 0.75*(10th number – 9th number) 9.75th number = 60389 + 0.75*(66528 – 60389) = 64993.25

Page | 111

The third quartile advertising spend is £64993.25. This means that 75% of the data have a value that is equal to or less than £64993.25. Excel solution Figures 2.18 to 2.20 illustrate the Excel solution.

Figure 2.18 Example data

Figure 2.19 Excel formula solution

Page | 112

Figure 2.20 Excel function solution Note that in cells M5:M7 we use dedicated Excel functions. From Excel, we observe: 1. 2. 3.

25th percentile = first quartile = £29780.75 50th percentile = second quartile = median = £42948.00 75th percentile = third quartile = £64993.25

These results agree with the manual results. SPSS solution

Figure 2.21 SPSS data Using the Explore method Select Analyze > Descriptives

Figure 2.22 SPSS Descriptives menu Page | 113

Select Explore Transfer Advertising_spend to the Variable(s) box

Figure 2.23 SPSS Frequencies menu Click on Statistics and choose Percentiles

Figure 2.24 SPSS explore statistics options Click Continue Click OK SPSS output

Figure 2.25 SPSS explore solution Page | 114

From SPSS, we observe: 1. 2. 3.

25th percentile = first quartile = £29780.75 50th percentile = second quartile = median = £42948.00 75th percentile = third quartile = £64993.25

These results agree with the Excel Quartile.EXC function results. Example 2.3 How would we calculate the value of the 20th percentile? 2019-2020 Advertising spend (£) by Rubber Duck Ltd 15712

ID 1

Advertising spend in size order (ranked) 15712

ID 1

Month January

2

February

53527

6

15712

3

March

66528

7

29335

4

April

31118

4

31118

5 6 7

May June July

95460 15712 29335

11 9 12

35783 38706 47190

8

August

96701

2

53527

9 10

September October

38706 60389

10 3

60389 66528

11 November 35783 5 12 December 47190 8 Table 2.6 Advertising data and data listed in order of size

95460 96701

P = 20, N = 12 Position of 20th percentile =

20 100

(12 + 1) = 2.6th number

We therefore take the 20th percentile to be the number that is 0.6 the distance between the 2nd and 3rd numbers. To solve this problem, we use linear interpolation: 2nd number = 15712 3rd number =29335 2.6th number = 2nd number + 0.6 * (3rd number – 2nd number)

Page | 115

2.6th number = 15712 + 0.6 * (29335 – 15712) = 23885.80 The 20th percentile advertising spend is £23885.80 This means that 20% of the data have a value that is equal to or less than £23885.80. Note: If you wanted to calculate the 56th percentile then use the manual method to show the 56th percentile value is £48964.36. Excel solution

Figure 2.26 Example 2.3 Excel solution From Excel, we observe: 1. 2. 3. 4. 5.

25th percentile = first quartile = £29780.75 50th percentile = second quartile = median = £42948.00 75th percentile = third quartile = £64993.25 20th percentile = £23885.80 56th percentile = £48964.36

These results agree with the manual results. SPSS solution Select Analyze > Descriptives > Frequencies. Transfer StatsMark to the Variable(s) box

Page | 116

Figure 2.27 SPSS frequencies menu Click on Statistics. Click on Percentiles Type 20 into the percentiles box. Click on Add Type 56 into the percentiles box Click on Add

Figure 2.28 SPSS frequencies statistics options Click Continue Click OK SPSS output

Page | 117

Figure 2.29 SPSS frequencies solution From SPSS, we observe: 1. 2. 3. 4. 5.

25th percentile = first quartile = £29780.75 50th percentile = second quartile = median = £42948.00 75th percentile = third quartile = £64993.25 20th percentile = £23885.80 56th percentile = £48964.36

These results agree with the manual and Excel results.

Check your understanding X2.5

A class of 20 students had the following scores on their most recent test: 75, 77, 78, 78, 80, 81, 81, 82, 83, 84, 84, 84, 85, 87, 87, 88, 88, 88, 89, 90. Calculate: (a) the mean, (b) the median, (c) the first quartile, (d) the third quartile, (e) the 20th percentile. (f) What percentile does the value 85 represent?

The range The range is one of the simpler measures of distribution. It indicates the 'length' of a distribution. It is determined by finding the difference between the lowest and highest values in a distribution. A formula for calculating the range, depending on the type of data, is defined by equation (2.3) or (2.4): RANGE (ungrouped data) = Maximum value – Minimum value

(2.3)

RANGE (grouped data) = UCB Highest Class – LCB Lowest Class

(2.4)

Where UCB represents the upper-class boundary and LCB represents the lower-class boundary.

Page | 118

Example 2.4 We can use the data from Example 2.1 and take a look at Table 2.6. In this table the data is ordered in ascending order. The minimum value is 15712 and the maximum value is 96701. According to equation (2.1), the range is calculated as: RANGE = 96701 – 15712 = 80989 If, for example you had data for another similar company’s advertising expenses and their range turned out to be 42357, then you would be able to conclude that this other company has much narrower range over which their advertising expenses are spread.

The interquartile range and semi-interquartile range The interquartile range (IQR) represents the difference between the third and first quartiles and can be used to provide a measure of spread within a data set which includes extreme data values. The interquartile range is little affected by extreme data values in the data set and is a good measure of spread for skewed distributions. The interquartile range is defined by equation (2.5): Interquartile range, IQR = Q3 – Q1

(2.5)

The semi-interquartile range (SIR) is another measure of spread and is computed as half of the interquartile range, as shown in equation (2.6): SIQR =

Q3 − Q1

(2.6)

2

Example 2.5 We will again use the data from Example 2.1 and Example 2.2. The interquartile range (IQR) and semi-interquartile range (SIQR) are calculated, using equations (2.5) and (2.6) as: IQR = 64993.25 – 29780.75 = 35212.5 SIQR =

64993.25 – 29780.75 2

= 17606.25

IQR od 35212.5 can be interpreted as follows: 50% of the data (advertising expenses) that reside between the first and the third quartile (or, between the 25th and 75th percentile) have a range of 35212.5. SIQR, on the other hand splits the middle 50% of all the values exactly into half. This means that if our data was proportionally distributed, the full range would be 70425, i.e. (4 x 17606.25 = 70425). As we can see from example 2.4, the full range is 80989, which implies that we do not have the values equally distributed and that we might have some minor extremes present in our data set. As both IQR and SIQR focus on the middle half of all the values in the dataset, they are much less influenced by the extreme values. Page | 119

The standard deviation and variance The standard deviation is the measure of spread most used in statistics when the mean is used to calculate central tendency. The goal of the standard deviation is to summarise the spread of a data set (i.e. in general how far each data point is from the mean). Imagine you have calculated differences from the mean for each data value (𝑥 − 𝑥̅ ). If you did this, then some values would come up as positive and some as negative numbers. If you were then to sum all these differences, then you would find that ∑(𝑥 − 𝑥̅ ) = 0, i.e. the positive and negative values would cancel out. To avoid this problem, you need to square each individual difference before carrying out the summation. The benefits of squaring include: • •

Squaring always gives a positive value, so the sum will not be zero. Squaring emphasises larger values – a feature that can be good or bad (for example, think of the effect of outliers).

The value calculated in this way, or the statistic, shows the average of squared differences, which is known as the variance (VAR(X)). Variance is defined by equation (2.7), where N is the total population size: VAR(X) =

̅ 2 ∑n i=1(X−X) N

(2.7)

By algebraic manipulation we can also rearrange equation (2.7) to give equation (2.8): VAR(X) =

2 ∑n i=1 X

N

−̅ X2

(2.8)

Squaring, however, does have a problem as a measure of spread, and that is that the units are all squared, whereas we might prefer the spread to be in the same units as the original data. Hence, the square root allows us to return to the original units to give the standard deviation, SD(X), as illustrated in equation (2.9):

SD ( X ) = VAR ( X ) (2.9) If we substitute equation (2.5) into (2.6), we get the standard deviation as in equation (2.10): ̅ 2 ∑n i=1(X− 𝑋)

SD(X) = √

N

(2.10)

Variance describes how much the data values are scattered around their mean value. You can also say that it shows how tightly are the data values grouped around the mean. This leads to the conclusion that the smaller the value of variance, the more representative the mean value is. We will see later that the variance is also very useful as a comparison measure between two data sets. Because the variance is based on squared values (squared differences from the mean) , this means that it does not have the same dimension as the data set, or the mean. In other words, if the data values are Page | 120

percentages, inches, degrees Celsius, or any other unit, the variance is not expressed in the same values, because it is expressed in squared units. Standard deviation is used to bring the variance into the same units of measure as the data set. Standard deviation is the square root of the variance value, as shown in equation (2.9). Example 2.6 The mean and variance can be calculated for the Example 2.1 data set using equations (2.1), (2.7) or (2.8), respectively: Mean X̄ =

∑ni=1 Xi N

Variance VAR(X) =

∑ni=1 X 2 ̅2 −X N

Standard deviation

SD(X) = √

∑ni=1(X − 𝑋̅)2 N

To calculate the mean, variance, and standard deviation we need to calculate: a) b) c) d) e)

The number of data values, N. The sum of the data values, X. From a) and b) calculate the mean. The sum of the data values squared, X2. From (a), c) and d) calculate the variance and the standard deviation. 2019-2020 Advertising spend (£) by Rubber Duck Ltd 15712 53527 66528 31118 95460 15712 29335 96701 38706 60389 35783

X^2 246866944 2865139729 4425974784 968329924 9112611600 246866944 860542225 9351083401 1498154436 3646831321 1280423089 Page | 121

47190

2226896100

Table 2.7 From the table, we can show: a) b) c)

Count the number of data values, N = 12 Sum the data values, X = 586161 Sum the data values squared, X2 = 36729720497

Mean 𝑋̅ =

∑ni=1 Xi N

𝑋̅ =

15712 + 53527 + ⋯ … . +47190 12

𝑋̅ = 48846.75 Variance VAR(X) =

∑ni=1 X 2 ̅2 −X N

VAR(X) =

36729720497 − 48846.752 12

VAR(X) = 674805055.8542.. Standard deviation SD(X) = √Variance (X) SD(X) = 35977,01 Population data set If the data set is the complete population then the same equations as (2.7)-(2.10) are used, except that we change the notation. For population variance we use symbol 2 and population standard deviation . And  is symbol for the mean. Equations (2.10), (2.7) and (2.8) are rewritten as: 𝜎2 =

∑𝑛𝑖=1 𝑋 2 − 𝜇2 𝑁

or ∑𝑛𝑖=1(𝑋 − 𝜇)2 𝜎 = 𝑁 2

Page | 122

Therefore: 𝜎 = √VAR(𝑋) = √𝜎 2 The Excel functions to calculate variance and standard deviation assuming we are using the population data are: =VAR.P() and =STDEV.P() respectively. It should be noted that VAR.P and STDEV.P are newer versions of the Excel functions VARP and STDEVP. Sample from a population If the data set is a sample from the population then the sample variance (s2) and sample standard deviation (s) are given by equations (2.11) and (2.12). Sample variance (𝑠 2 ) =

2 ∑𝑛 𝑖=1(𝑥𝑖 −𝑥̅ )

𝑛−1

Sample standard deviation (𝑠) = √𝑠 2

(2.11) (2.12)

The corresponding Excel functions to calculate sample variance and sample standard deviation are: =VAR.S() and =STDEV.S(), respectively. Again, it should be noted that VAR.S AND STDEV.S are newer versions of the Excel functions VAR and STDEV. Excel solution Figures 2.30 to 2.32 illustrate the Excel solutions.

Figure 2.30 Data set and column calculations

Page | 123

Figure 2.31

Figure 2.32 From Excel: • • • • •

Mean = 48846.70 Population variance = 674805055.85 Population standard deviation = 25977.01 Range = 80989 Q1 = 29870.75 Page | 124

• •

Median = 42948.00 Q3 = 64993.25

SPSS solution Method: Frequencies Select Analyze > Descriptive Statistics > Frequencies.

Figure 2.33 SPSS frequencies menu Transfer Advertising_spend to the Variable(s) box

Figure 2.34 SPSS Frequencies menu Page | 125

Click on Statistics Choose Quartiles, Mean, Median, Std. deviation, Variance, Range

Figure 2.35 SPSS frequencies statistics options Click Continue. Click OK

Figure 2.36 SPSS solutions From SPSS: • • • •

Mean = 48846.70 Population variance = 674805055.85 Population standard deviation = 25977.01 Range = 80989 Page | 126

• • •

Q1 = 29870.75 Median = 42948.00 Q3 = 64993.25

Example 2.7 Consider the following e-commerce module marks achieved by 13 students. ID 1 2

e-Commerce module marks 21 28

3 4

35 44

5 6 7

50 50 54

8

57

9 10

58 64

11 12 13

76 81 82

Table 2.8 If we solve this problem, we will find that the summary statistics are as follows: • • • • • •

N = 13 First quartile, Q1 = 39.5 Median = second quartile, Q2 = 54 Third quartile, Q3 = 70 Interquartile range, IQR = Q3 – Q1 = 16 Semi interquartile range, SIQR = (Q3 – Q1)/2 = 16/2 = 8

What is the meaning of these numbers? Since half the scores in a distribution lie between Q3 and Q1, the semi-interquartile range is 1/2 the distance needed to cover 1/2 the scores. In a symmetric distribution, an interval stretching from one semi-interquartile range below the median to one semiinterquartile above the median will contain 1/2 of the scores as illustrated in Figure 2.36.

Page | 127

Figure 2.37 Location IR and SIR for Example 2.5 data set If this was a symmetric distribution and as SIQR is 8 and median Q2 is 54, this means that in our case 50% of all the score is between 46 (= 54 - 8) and 62 (= 54 + 8). This contradicts our statement above where we claimed that 50% of all the values are between 39.5 and 70. The reason for this contradiction is because our sample is not symmetric (it is a very short sample). The interquartile and semi-interquartile ranges are more stable than the range because they focus on the middle half of the data values. Therefore, they cannot be influenced by extreme values. The SIQR is used in conjunction with the median for a highly skewed distribution or to describe an ordinal data set. The interquartile range (and semiinterquartile range) are more influenced by sampling fluctuations in normal distributions than is the standard deviation. Therefore, they are not often used for data that are approximately normally distributed. In general, for a normal distribution, the interquartile range is about 30% larger than the standard deviation. Although the SIR is inferior to standard deviation as a measure of dispersion, we can see that sometimes it makes sense to use it.

Check your understanding X2.6

The daily incomes (£) of workers in a factory are: 95, 110, 105, 130, 135, 155, 170. Calculate a measure of central tendency and dispersion. Provide an explanation for your choice of average.

X2.7

Table 2.9 represents the time in days that a second-hand furniture store takes to sell tables. Calculate the mean time and an appropriate measure of dispersion. Provide an explanation for your choice of average. 24 27 36 48 52 52 53 Table 2.9 Time to sell tables (days)

X2.8

55

59

60

85

90

92

A local garden centre allows customers to order goods online via its own ecommerce website. The company quality assurance process includes the Page | 128

processing of customer orders and the time to deliver (working days). A sample of 30 orders is presented in Table 2.10. Calculate an appropriate measure of average and dispersion. Provide a rationale for your choice of average and dispersion. 25 25 32 16 25 29 30 20 23 28 25 18 18 22 28 22 32 19 28 28 27 28 19 18 18 29 25 20 Table 2.10 Time to deliver orders (days) X2.9

28 18 18 20

26 21 33 23

26 25 26 30

Greendelivery.com has recently decided to review the weekly mileage of its delivery vehicles that are used to deliver shopping purchased online to customer homes from a central parcel depot. The sample data collected and provided in Table 2.11 is part of the first stage in analysing the economic benefit of potentially moving all vehicles to biofuels from diesel. 10 9 9 6 7 5 12 8 2 9 4 10 5 5 5 7 6 7 7 8 6 4 8 7 Table 2.11 Weekly mileage for delivery vehicles

6 9 5

8 9 6

a. Use Excel to construct a frequency distribution and plot the histogram with class intervals of 10 and classes 75–84, 85–94, …, 175–184. Comment on the pattern in mileage travelled by the company vehicles. b. Use the raw data to determine the mean, median, standard deviation and interquartile range. c. Comment on which measure you would use to describe the average and measure of dispersion. Explain using your answers to (a) and (b).

Interpretation of the standard deviation If we collect data from a population (or sample) then we can calculate the mean and standard deviation for this data set. For any data set we can use the calculated standard deviation to tell us about the proportion of data values that lie within a specified interval about the population mean. Recall that we used the squared differences between every data point and the mean to calculate both variance and standard deviation. Therefore, neither the variance nor the standard deviation can ever be negative. We stated that the variance is not expressed in the same units as the data points and the mean. However, when we convert variance into standard deviation, we get the same units as the original data. If the original data are in square metres, then both the mean and standard deviation are in square metres. If the original data are in degrees Celsius, then the mean and standard deviation are in degrees Celsius. You will notice, as you work on many other examples, that usually the standard deviation is much smaller than the mean. Why? Because the standard deviation is a measure of dispersion, in other words, it measures how data are dispersed around their Page | 129

mean. Another way to express this is to say that the standard deviation measures how well the mean represents the data. Let us explain this. If we have 100 data points and if 70 or 80 of them, for example, are very close to their mean in value, then we can say that the mean represents this data set very well. Another way to say that is: if a great amount of data is within the range that is defined as 𝑥̅ ± 1 standard deviation, then we have a narrow spread (dispersion) of data and the mean represents the data set very well. In the next chapter we will introduce the so called normal distribution. This is a classic bell-shaped distribution. For such distributions, the standard deviation defines exactly how wide the spread is for all the data points about the average value. Figure 2.38 illustrates this point.

Figure 2.38 Percentage points for the normal distribution As we will see in Chapter 5, once we know the mean and the standard deviation, and if we assume that the data follow the normal distribution, we will be able to say that 68.3% of all the values in our data set are within  ± 1 standard deviation, that 95.5% of all the values in our data set are within  ± 2 standard deviations, and that 99.7% of all the values in our data set are within  ± 3 standard deviations. If, for example, we measured the height of students at the local secondary school and found a mean height of 𝑥̅ =165cm, with standard deviation  = 15 cm, and under the assumption that the height of the students follows the normal distribution, we can say that 68% of all the students in this school are between 165 – 15 = 150 cm and 165 + 15 = 180 cm, and that 95% of all the students are between 165 – 2 × 15 = 135 cm and 165 + 2 × 15 = 195 cm.

The coefficient of variation The coefficient of variation is a statistic calculated as the ratio of the standard deviation to the mean. When comparing the degree of variation from one data set to another, this is one of the invaluable statistics. If the two data sets have different units, or the values are by magnitude different, then the coefficient of variation is particularly useful. Page | 130

For example, the value of the standard deviation of a set of weights will be different, depending on whether they are measured in pounds or kilograms. The coefficient of variation, however, will be the same in both cases as it does not depend on the unit of measurement. The coefficient of variation, V, is defined by equation (2.13):

𝑉=

Standard deviation Mean

𝑠

× 100 = 100 𝑥̅

(2.13)

For example, if the coefficient of variation is 10% then this means that the standard deviation is equal to 10% of the average. For some measures, the standard deviation changes as the average changes. If this is case, the coefficient of variation is the best way to summarise the variation. Example 2.8 Consider the following problem that compares UK factory and US factory average earnings: (a) mean earnings in the UK are £125 per week with a standard deviation of £10 and (b) mean earnings in the USA are $145 per week with a standard deviation of $16. 10

For the UK, therefore, 𝑉 = 125 × 100 = 8% 16

For the USA, we have 𝑉 = 145 × 100 = 11.03% Although one set of data is given in pounds sterling and the other in US dollars, the coefficient of variation returns the values as percentages, so that two sets of data can be compared. In the case of the UK, the standard deviation is 8% of the mean value and in the case of the USA, the percentage is 11.03%. We can conclude that the spread of earnings in the USA is greater than the spread in earnings in the UK.

Check your understanding X2.10 A manufacturing company sells a new type of car tyre with a mean life of 27,000 miles and a standard deviation of 6000 miles. Calculate the coefficient of variation. X2.11 A salesman earns commission based upon the number of sales above a certain value. Calculate the coefficient of variation if the mean commission is €200 with a standard deviation of €40.

2.4 Measures of shape Most of time when conducting statistical analysis, we are just trying to get the location and variability of a data set. However, there are other measures that could be included in this analysis and they are the measures of the distribution shape: skewness and kurtosis. Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the centre point. Page | 131

Measuring skewness: distribution symmetry The histogram is an effective graphical technique for showing both the skewness and kurtosis for a data set. Consider three distributions A, B and C as illustrated in Figures 2.39–2.41.

Figure 2.39 Symmetric distribution

Figure 2.40 Right-skewed (positive skewness)

Figure 2.41 Left-skewed (negative skewness)

Page | 132

Distribution A is said to be symmetrical. The mean, median and mode have the same value. Distribution B has a high frequency of relatively low values and a low frequency of relatively high values. Consequently, the mean is 'dragged' toward the right (the high values) of the distribution. It is known as a right-skewed (or positively skewed) distribution. Distribution C has a high frequency of relatively high values and a low frequency of relatively low values. Consequently, the mean is 'dragged' toward the left (the low values) of the distribution. It is known as a left-skewed (or negatively skewed) distribution. The skewness of a frequency distribution can be an important consideration. For example, if your data set is salary, your employer would prefer a situation that led to a positively skewed distribution of salary to one that is negatively skewed. To measure skewness, we can use one of several different methods.

Pearson’s coefficient of skewness One measure of skewness is Pearson's coefficient of skewness as defined by equation (2.14):

PCS=

3(Mean - Median)

(2.14)

Standard deviation

As we can see, equation (2.14) is relatively simple, but there are several points to remember related to the measurement of skewness: 1. 2. 3. 4. 5. 6.

The direction of skewness is given by the sign. A large negative value means the distribution is negatively skewed or leftskewed. A large positive value means the distribution is positively skewed or right skewed. A value of zero means no skewness at all (symmetric distribution). The coefficient compares the sample distribution with a normal distribution. The larger the value, the more the distribution differs from the normal distribution.

Fisher–Pearson skewness coefficient This is an alternative option to measure skewness, and Excel and SPSS use this alternative measure based upon the Fisher–Pearson skewness coefficient as defined by equation (2.15), where s is the standard deviation.

Sample skewness =

n

̅ )3 ∑(X− X

(n−1)(n−2)

s3

(2.15)

If the skewness is positive, the data are positively skewed or right skewed, meaning that the right tail of the distribution is longer than the left. If the skewness is negative, the data are negatively skewed or left-skewed, meaning that the left tail is longer than the right. If the skewness is zero, the data are perfectly symmetrical. However, a skewness

Page | 133

of exactly zero is quite unlikely for real-world data, so how can you interpret the skewness? Here we list some simple rules of thumb: 1. 2. 3.

If skewness is less than −1, or greater than +1, the distribution is highly skewed. If skewness is between −1 and −0.5, or between +0.5 and +1, the distribution is moderately skewed. If skewness is between −0.5 and +0.5, the distribution is approximately symmetric.

Example 2.9 Consider the e-commerce module marks achieved by 25 students as illustrated in Table 2.12. ID E-Commerce marks, x 1 73 2 78 3 75 4 75 5 76 6 69 7 69 8 82 9 74 10 70 11 63 12 68 13 64 14 70 15 72 16 64 17 72 18 67 19 74 20 70 21 74 22 77 23 68 24 78 25 72 Table 2.12 e-Commerce marks To calculate the sample skewness, we will solve equation (2.15).

Page | 134

̅) 3 ∑(X − X n Sample skewness = (n − 1)(n − 2) s3 Step 1 From previous calculations we can show that the sample size (n), sample mean, and sample standard deviation (s) have the following values N = 25 ̅= Mean X

∑X 73 + 78 + ⋯ … + 72 = = 71.76 N 25

𝑆ample standard deviation, s =

̅ )2 ∑(X− X n−1

= 4.7371

Step 2 Calculate the column statistic (X – mean X)3 and sum all these values ID E-Commerce marks, x (X - Xbar)^3 1 73 1.9066 2 78 242.9706 3 75 34.0122 4 75 34.0122 5 76 76.2250 6 69 -21.0246 7 69 -21.0246 8 82 1073.7418 9 74 11.2394 10 70 -5.4518 11 63 -672.2214 12 68 -53.1574 13 64 -467.2886 14 70 -5.4518 15 72 0.0138 16 64 -467.2886 17 72 0.0138 18 67 -107.8502 19 74 11.2394 20 70 -5.4518 21 74 11.2394 22 77 143.8778 23 68 -53.1574 24 78 242.9706 25 72 0.0138 Table 2.13 Column calculation for (X – mean X)3  (X - Xbar)3 =4.108 Page | 135

Now, substitute these values into equation (2.15) Sample skewness =

∑(X − ̅ n X)3 (n − 1)(n − 2) s3

Sample skewness =

25 4.108 (25 − 1)(25 − 2) 4.73713

Sample skewness = 0.0018 Since the calculated value is 0.0018 (very close to zero), this indicates that this distribution is almost perfectly symmetrical. Excel solution

Figure 2.42 Example 2.7 Excel solution ̅)3 and calculated these Observe in the Excel solution we created a column called (X – X values by placing the formula = (B4 - $G$5)^3 in Cell C4, and then copied the formula down from C4:C28. Cells G9 and G11 show the same value, which is 0.0018. In cellGI9, we used the manual formula as in equation (2.15), and in cell G11 we used the equivalent Excel function =SKEW(). Since the calculated value is 0.0018 (very close to zero), this indicates that this distribution is almost perfectly symmetrical. SPSS solution Using the data in table 2.13, we can extract the same statistics from SPSS in the following manner. Enter data into SPSS Page | 136

Figure 2.43 Example 2.6 SPSS data set With SPSS we have three methods to calculate descriptive statistics: Frequencies, Explore, and Descriptives. To illustrate, let us choose the Frequencies method to calculate the value of skewness. Frequencies method Select Analyze > Descriptives > Frequencies Transfer variable eCommerceMarks into the Variable(s) box

Figure 2.44 SPSS descriptives menu Click on Options and choose Skewness, as illustrated in Figure 2.45.

Page | 137

Figure 2.45 SPSS descriptives options Click on Continue Click on OK SPSS output The output is shown in Figure 2.46.

Figure 2.46 SPSS Frequencies solution The value of skewness is given as 0.002, which agrees with the Excel value to 3 decimal places. Descriptives method solution

Page | 138

Figure 2.47 Explore method solution

Figure 2.48

Check your understanding X2.12 A newspaper delivers newspapers to customer homes and earns a commission rate based upon the volume of newspapers delivered per week. Table 2.14 shows the commission earned (£) for the last 30 weeks. Calculate a measure of skewness. Can we state that the distribution is symmetric? 80 165 159 136 138 118 159 131 93 163 136 163 106 111 123 144 145 91 170 105 131 137 152 109 114 155 92 Table 2.14 Commission earned (£’s)

143 120 145 142 161 112 141 122 143

140 124 109 80 179 146 122 126 165

Page | 139

Measuring kurtosis: distribution outliers and peakedness The other common measure of shape is called the kurtosis. Traditionally, kurtosis has been explained in terms of the central peak. You’ll see statements like this one: higher values indicate a higher, sharper peak; lower values indicate a lower, less distinct peak. Recent developments in the understanding of kurtosis suggest that higher kurtosis means that more of the variance is the result of infrequent extreme deviations, as opposed to frequent modestly sized deviations. In other words, it’s the tails that mostly account for kurtosis, not the central region. The reference standard is a normal distribution, which has a kurtosis of 3. In recognition of this, it is the excess kurtosis that is often presented as the kurtosis. For example, the ‘kurtosis’ reported by Excel and SPSS is actually the excess kurtosis. There are several points, and expressions, to remember related to measurement of kurtosis: 1. 2. 3.

A normal distribution has kurtosis exactly 3 (or excess kurtosis 0). Any distribution with kurtosis ≈ 3 (excess kurtosis ≈ 0) is called mesokurtic. A distribution with kurtosis less than 3 (excess kurtosis less than 0) is called platykurtic. Compared to a normal distribution, its tails are shorter and thinner, and often its central peak is lower and broader. A distribution with kurtosis greater than 3 (excess kurtosis greater than 0) is called leptokurtic. Compared to a normal distribution, its tails are longer and fatter, and often its central peak is higher and sharper.

Figure 2.49 compares two normal population distributions with the same mean but different standard deviations.

Figure 2.49 Comparison of two distributions To assess the length of the tails and how peaked the distribution is we can calculate a measure of kurtosis, and Excel and SPSS provide Fisher’s kurtosis coefficient as defined by equation (2.16): Page | 140

Sample kurtosis =

̅ )4 ∑(X− X n (n+1) (n−1) (n−2) (n−3) s4



3 (n−1)2 (n−2) (n−3)

(2.16)

Where s represents the sample standard deviation. Example 2.7 illustrates the calculation details. Example 2.10 Consider the e-commerce module marks achieved by 25 students as illustrated in Example 2.9 Table 2.12. To calculate the sample kurtosis, we will solve equation (2.16). ∑(X − ̅ n (n + 1) X)4 3 (n − 1)2 Sample kurtosis = − (n − 1) (n − 2) (n − 3) (n − 2) (n − 3) s4 Step 1 From previous calculations we can show that the sample size (n), sample mean, and sample standard deviation (s) have the following values N = 25 Mean ̅ X=

∑X 73 + 78 + ⋯ … + 72 = = 71.76 N 25

𝑆ample standard deviation, s =

̅ )2 ∑(X− X n−1

= 4.7371

Step 2 Calculate the column statistic (X – mean X)4 and sum all these values ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

E-Commerce marks, x 73 78 75 75 76 69 69 82 74 70 63 68 64 70 72

(X - Xbar)^4 2.3642 1516.1367 110.1996 110.1996 323.1941 58.0278 58.0278 10995.1163 25.1763 9.5951 5888.6593 199.8717 3626.1593 9.5951 0.0033

Page | 141

16 64 3626.1593 17 72 0.0033 18 67 513.3668 19 74 25.1763 20 70 9.5951 21 74 25.1763 22 77 753.9198 23 68 199.8717 24 78 1516.1367 25 72 0.0033 Table 2.15 Column calculation for (X – mean X)4  (X - Xbar)4 = 29601.7352 Now, substitute these values into equation (2.16 Sample kurtosis =

25 (25 + 1) 29601.7352 3 (25 − 1)2 − (25 − 1) (25 − 2) (25 − 3) 4.73714 (25 − 2) (25 − 3)

Sample kurtosis = −0.2686 Sample kurtosis is – 0.2686. The kurtosis value is given is equal to –0.2686. Since the calculated value is negative (–0.2686) which indicates that this distribution is platykurtic. Excel solution Figures 2.50 and 2.51 illustrate the Excel solutions.

Page | 142

Figure 2.50 Example 2.7 Excel data set and calculation of column statistics ̅)3 and calculated these Observe in the Excel solution we created a column called (X – X values by placing the formula = (B4 - $I$5)^3 in Cell C4, and then copied the formula down from C4:C28. Observe in the Excel solution we created a column called (X – Xbar)^4 and calculated these values by placing the formula = (B4 - $I$5)^4 in Cell E4, and then copied the formula down from E4:E28.

Figure 2.51 Excel solution continued The kurtosis value is given in cells I15 and I17 and is equal to –0.2686. In cell I15, we used the manual formula as in equation (2.16) and in cell I17 the equivalent Excel Page | 143

function =KURT(). Note that SPSS and Excel show excess kurtosis rather than the proper kurtosis value. Excess kurtosis is the proper kurtosis value minus 3 (in our case, the proper kurtosis value equals – 3.2686). Since the calculated value is –0.2686 (which is negative excess kurtosis), this indicates that this distribution is platykurtic. SPSS solution With SPSS we have three methods to calculate descriptive statistics: Frequencies, Explore, and Descriptives. To illustrate, let us choose the Frequencies method to calculate the value of kurtosis. From the SPSS Statistics menu bar, Select Analyze > Descriptives >Frequencies Transfer eCommerceMarks to the Variable(s) box

Figure 2.52 SPSS descriptives menu Click on Statistics and choose Skewness and Kurtosis, as illustrated in Figure 2.53.

Figure 2.53 SPSS descriptives options Page | 144

Click on Continue Click on OK SPSS output The output is shown in Figure 2.54.

Figure 2.54 SPSS Frequencies solution The value of excess kurtosis is given as –0.269, which agrees with the Excel value to 3 decimal places. Explore solution

Figure 2.55 Explore solution Descriptives solution

Page | 145

Figure 2.56 Descriptives solution

Check your understanding X2.13 A newspaper delivers newspapers to customer homes and earns a commission rate based upon the volume of newspapers delivered per week. Table 2.16 represents the commission earned (£) for the last 30 weeks. Calculate a measure of kurtosis. 19 28 17 16 18 20 20 21 25 20 15 16 17 21 21 Table 2.16 Commission earned (£’s)

23 21 21

19 17 13

21 20 16

24 20 15

17 22 19

Calculating a five-number summary We will now discuss one very simple, yet very effective and intuitive method that combines several measures we have covered so far (central tendency, dispersion, and shape). The five-number summary is a simple method that provides measures of average, spread and the shape of the distribution. This five-number summary consists of the following numbers in the data set: • • • • •

Smallest value First quartile, Q1 Median or second quartile, Q2 Third quartile, Q3 Largest value.

For symmetrical distributions, the following rule would hold: Q3 – Median = Median – Q1 Largest value – Q3 = Q1 – smallest value For non-symmetrical distributions, the following rule would hold: Right-skewed distributions: Largest value – Q3 greatly exceeds Q1 – Smallest value Left-skewed distributions: Q1 – Smallest value greatly exceeds Largest value – Q3 Page | 146

Example 2.11 Consider the student results obtained in a statistics examination as presented in Table 2.17. 73 78 75 75 76

69 69 82 74 70

63 68 64 70 72

64 72 67 74 70

74 77 68 78 72

Table 2.17 Statistics examination marks Using the methods explored earlier in this chapter we can calculate the required statistics. Statistic Minimum Q1 Median Q3 Maximum

Value 63.00 68.50 72.00 75.00 82.00

Table 2.18 Excel solution Figure 2.57 illustrates the Excel solution. The five-number summary, provided in columns D:F, is as follows:

Page | 147

Figure 2.57 Example 2.9 Excel data set and solution • • • • •

Smallest value = 63.00 First quartile, Q1 = 68.50 Median or second quartile, Q2 = 72.00 Third quartile, Q3 = 75.00 Largest value = 82.00

To identify symmetry Using the numbers in Figure 2.56, we conclude: • •

The distance from Q3 to the median (75 – 72 = 3) is like the distance between Q1 and the median (72 – 68.5 = 3.5). The distance between Q3 and the largest value (82 – 75 = 7) is not the same as the distance between Q1 and the smallest value (68.5 – 63 = 5.5).

These summary values indicate that the distribution is right-skewed, because the distance between Q3 and the largest value (82 - 75 = 7) is longer than the distance between Q1 and the smallest value (68.5 – 63 = 5.5).

Page | 148

Please note that this is a very small right skewed distribution. If you calculate the measure of skewness for this data set then the value is 0.002, which suggests the data is symmetric.

To identify outliers The following process is followed to identify possible outliers. The interquartile range (IQR) is defined by equation (2.17) and represents the value of the middle 50% of the data distribution. IQR = Q3 – Q1

(2.17)

The IQR equation can then be used to identify any data outliers within the data set as described by the following set of rules. Construct inner fences: Lower inner fence = Q1 – 1.5 IQR Upper inner fence = Q3 + 1.5 IQR

(2.18) (2.19)

Construct outer fences: Lower outer fence = Q1 – 3 IQR Upper outer fence = Q3 + 3 IQR

(2.20) (2.21)

In Example 2.9, the interquartile range is IQR = 74 – 66 = 8. The inner and outer fence values are as follows: • • • •

Lower inner fence = Q1 – 1.5 IQR = 54 Upper inner fence = Q3 + 1.5 IQR = 86 Lower outer fence = Q1 – 3 IQR = 42 Upper outer fence = Q3 + 3 IQR = 98

If data values are located between the inner and outer fences, then these data values would be classified as mild outliers. If data values are located outside the outer fences, then these would be classified as extreme outliers. We conclude we have no outliers within the data set. SPSS solution The five-number summary can be calculated using the SPSS Statistics Frequency command. Input data into SPSS

Page | 149

Figure 2.58 Example 2.9 SPSS data set Select Analyze > Descriptives > Frequencies. Transfer eCommerceMarks to the Variable(s) box

Figure 2.59 SPSS frequencies menu Click on Statistics. Choose Quartiles, Minimum, Maximum

Page | 150

Figure 2.60 SPSS frequencies statistics Click Continue Click OK. SPSS output The output is shown in Figure 2.61.

Figure 2.61 SPSS frequencies solution According to SPSS, the five-number summary is: • • • • •

Minimum = 63 First quartile = 68.50 Second quartile = 72.00 Third quartile = 75.00 Maximum = 82

We observe that the five-number summaries are the same in the manual, Excel and SPSS solutions. Page | 151

You can also use SPSS Explore menu to generate the same results.

Check your understanding X2.14 The manager at Big Jim’s restaurant is concerned at the time it takes to process credit card payments at the counter by counter staff. The manager has collected the processing time data (time in minutes) shown in Table 2.19 and requested that summary statistics are calculated. 73 73 73 73 73 73 73 78 78 78 78 78 78 78 75 75 75 75 75 75 75 75 75 75 75 75 75 75 Table 2.19 Time to process credit cards (minutes)

73 78 75 75

73 78 75 75

73 78 75 75

a. Calculate a five-number summary for this data set. b. Do we have any evidence for a symmetric distribution? c. Which measures would you use to provide a measure of average and spread? X2.15 The local regional development agency is conducting a major review of the economic development of a local community. One economic measure to be collected is local house prices. These reflect the economic well-being of this community. The development agency has collected the house price data (£) shown in Table 2.20. Processing credit cards (n= 40) 1.57 1.38 1.97 1.52 1.09 1.29 1.26 1.07 1.13 1.59 0.27 0.92 1.49 1.73 0.79 1.38 0.98 2.31 1.23 1.56 0.76 1.23 1.56 1.98 1.40 1.89 0.89 1.34 0.76 1.54 1.78 4.89 Table 2.20 House prices (£s)

1.39 1.76 0.71 2.46 0.89 2.01 3.21 1.98

a. Calculate a five-number summary. b. Do we have any evidence for a symmetric distribution? c. Which measures would you use to provide a measure of average and spread?

Creating a box plot We have already discussed techniques for visually representing data (see histograms and frequency polygons). In this section we present another important method, called the box plot (also known as box-and-whisker plot). A box plot is a graphical method of displaying the symmetry or skewness in a data set. It shows a measure of central location (the median), two measures of dispersion (the range and interquartile range), Page | 152

the skewness (from the orientation of the median relative to the quartiles) and potential outliers. Example 2.12 Consider the student marks presented in Example 2.9. Figure 2.62 shows the box-andwhisker plot for the quantitative marks example, where the summary statistics are as follows: first quartile Q1 = 68.5, minimum = 63, median = 72, maximum = 82 and third quartile Q3 = 75.

Figure 2.62 Example 2.9 box plot The box-and-whisker plot shows that the lowest 25% of the statistics marks are less spread out than the highest 25% of the distribution. The plot also shows that the other half are approximately equally spread out. This corresponds to the five-number summary analysis in the previous section.

To identify symmetry The box plot is interpreted as follows. If the median within the box is not equidistant from the whiskers (or hinge), then the data are skewed. The box plot indicates rightskewness because the distance between the median and the highest value is greater than the distance between the median and the lowest value. Furthermore, the top whisker (the distance between Q3 and the maximum) is longer than the lower whisker (the distance between Q1 and the minimum).

To identify outliers The box plot is interpreted as follows. The minimum and maximum points (or whiskers) are identified and enable identification of any extreme values (or outliers). A simple rule to identify an outlier (or suspected outlier) is that the whisker (maximum value – Page | 153

minimum value) should be no longer than three times the length of the box (Q3 – Q1). In this case the difference between the maximum and minimum is 21 and 3(Q3 – Q1) is 24. The conclusion is that extreme values are not present in the data set and that the distribution is somewhat right skewed. We have 3 methods we can use to create a box plot: 1. Create a boxplot using the five-number summary – see result in Figure 2.62. 2. Create a boxplot using the Excel box-and-whisker plot method – see below. 3. Use SPSS to create boxplot – see below. We will now look at the last two methods listed (2, 3). Excel solution

Figure 2.63 Example 2.10 Excel solution for outliers Highlight cells B3:B28 (complete data set, including label) Select Insert Statistic Chart

Figure 2.64 Insert statistics chart Select Box and Whisker (this is the same has a boxplot)

Page | 154

Figure 2.65 Choose box and whisker plot Clicking on Insert Box and Whisker gives the solution shown in Figure 2.66.

Figure 2.66 Excel box and whisker plot Now edit the chart by adding chart title ‘Box plot for statistics data’, vertical axis title ‘Statistics mark’, replace the number 1 on the horizontal axis with Value, remove the chart border, and change the vertical axes from 0–100 to 50–90 as illustrated in Figure 2.67.

Page | 155

Figure 2.67Excel box plot SPSS solution Input data into SPSS as illustrated in Figure 2.68.

Figure 2.68 Example 2.12 SPSS data Select Graphs > Legacy Dialogs > Boxplot

Page | 156

Figure 2.69 SPSS boxplot option Select Simple In the Data in Chart Are box choose Summaries of separate variables.

Figure 2.70 SPSS boxplot Select Define Transfer StatisticsMarks into the Variable box.

Page | 157

Figure 2.71 SPSS define simple boxplot Click OK. Finally, add the chart by double-clicking on the chart.

Figure 2.72 SPSS box plot

Check your understanding X2.16 Create a boxplot for the data in X2.14. X2.17 Create a boxplot for the data in X2.15.

Page | 158

2.5 Using the Excel Data Analysis menu A selection of summary statistics can very easily be calculated in Excel by using the Data Analysis menu add-ins. Almost all the measures we described in this chapter, and a few extra, are included in the automatic printout. This tool will generate a report based upon your univariate data set, including the mean, median, mode, standard deviation, sample variance, kurtosis, skewness, range, minimum, maximum, sum, count, largest, and smallest number. The skewness and kurtosis values can be used to provide information about the shape of the distribution. Example 2.13 Consider the e-Commerce marks example. ID 1 2

e-Commerce marks, x 73 78

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

75 75 76 69 69 82 74 70 63 68 64 70 72 64 72

18 19 20 21 22 23 24 25

67 74 70 74 77 68 78 72

Table 2.21

Page | 159

Excel Solution The Descriptive Statistics procedure in the Excel ToolPak add-in can be used to calculate the required statistics. Enter the data into Excel

Figure 2.73 Example 2.10 Excel data set From the Data tab, select Data Analysis.

Page | 160

Figure 2.74 Excel data Analysis Descriptive Statistics menu Select Descriptive Statistics Input data range: B3:B28, Grouped By: Columns Click on Labels in first row Type in Output Range: D7. Tick Summary statistics.

Figure 2.75 Excel Descriptive Statistics menu Click OK The Excel results would then be calculated and printed out in the Excel worksheet

Page | 161

Figure 2.76 Excel solution

Check your understanding X2.18 The manager at Big Jim’s restaurant is concerned at the time it takes to process credit card payments at the counter by counter staff. The manager has collected the processing time data (time in minutes) shown in Table 2.22 and requested that summary statistics are calculated. 73 78 75 75

73 78 75 75

73 78 75 75

73 78 75 75

73 78 75 75

73 78 75 75

73 78 75 75

73 78 75 75

73 78 75 75

73 78 75 75

Table 2.22 Time to process credit cards (minutes) Use Excel Data Analysis to calculate: a. A five-number summary for this data set. b. Do we have any evidence for a symmetric distribution? c. Which measures would you use to provide a measure of average and spread? X2.19 The local regional development agency is conducting a major review of the economic development of a local community. One economic measure to be collected is the local house prices. These reflect the economic well-being of this community. The development agency has collected the following house price data (£) as presented in Table 2.23. Processing credit cards (n= 40) 1.57 1.38 1.97 1.52 1.39 1.09 1.29 1.26 1.07 1.76 1.13 1.59 0.27 0.92 0.71 Page | 162

1.49 1.73 0.79 1.38 0.98 2.31 1.23 1.56 0.76 1.23 1.56 1.98 1.40 1.89 0.89 1.34 0.76 1.54 1.78 4.89 Table 2.23 House prices (£s)

2.46 0.89 2.01 3.21 1.98

Use Excel Data Analysis to calculate: a. A five-number summary. b. Do we have any evidence for a symmetric distribution? c. Which measures would you use to provide a measure of average and spread?

Chapter summary This chapter expanded from using tables and charts to summarising data using measures of average and dispersion. The average provides a measure of the central tendency (or middle value). The mean is the most calculated average to represent the measure of central tendency. We learned that this measurement uses all the data within the calculation and therefore outliers will affect the value of the mean. Accordingly, the value of the mean may not be representative of the underlying data set. If outliers are present in the data set, we also learned that we can either eliminate these outlier values, or use the median to represent the average. The next calculation to perform is to provide a measure of the spread of the data within the distribution. The standard deviation is the most common measure of dispersion (or spread) but, like the mean, the standard deviation is influenced by the presence of outliers in the data set. If outliers are present, then again you can either eliminate these outlier values or use the semiinterquartile range to represent the degree of dispersion. The degree of skewness we learned can be estimated in the data set by calculating the Pearson or Fisher–Pearson skewness coefficient. To estimate the degree of ‘peakedness’, we used the Fisher kurtosis coefficient. The last topic covered were the Box plots, which are graph plots that allow you to visualise the degree of symmetry or skewness in the data set. The chapter explored the calculation process for raw data and frequency distributions. However, it is very important to note that the graphical method will not be as accurate as the raw data method when calculating the summary statistics. Table 2.24 provides a summary of which statistics measures to use for different types of data. Summary statistic to be applied Data type Average Spread or Dispersion Nominal Mode NA Ordinal Mode Range Median Range, interquartile range, Ratio or Mode Range interval Median Range, interquartile range Mean Variance, Standard deviation, Skewness, Kurtosis Table 2.24 Which summary statistic to use? Page | 163

Test your understanding TU2.1 Calculate the mean and median for the following data set: 28, 23, 27, 19, 22, 19, 23, 26, 34, 30, 29, 25. TU2.2 Calculate the 10th and 74th percentile values for the following data set: 38, 41, 38, 48, 56, 35, 44, 31, 46, 41, 54, 51, 33. TU2.3 Table 2.25 represents the English language results for a sample of students attending a pre-university course. Calculate: (a) mean, (b) sample standard deviation, (c) median, (d) sample skewness, and (e) sample kurtosis. Use your results to comment on the shape of the sample distribution. 70 90 82 80 88 86 97 79 79 71 82 83 93 82 75 77 73 74 85 Table 2.25 English language results

80 84 61

69 71 88

90 81 75

TU2.4 Table 2.26 represents the multiple-choice results out of 40 for a mid-term economics module. Calculate: (a) mean, (b) sample standard deviation, (c) median, (d) sample skewness, and (e) sample kurtosis. Use your results to comment on the shape of the sample distribution. 26 16 22 28 28 15 28 29 25 26 24 31 29 32 19 26 28 32 22 25 27 18 Table 2.26 Midterm economics results

28 29 22

26 28 35

TU2.5 For the data in TU3 construct a five-number summary and a box plot. Comment on the distribution shape. Does your answer agree with your answer to Exercise 2.3? TU2.6 For TU4 construct a five-number summary and a box plot. Comment on the distribution shape. Does your answer agree with your answer to TU4? TU2.7 A local delivery company is assessing the times for delivery of an order for the last 35 customer orders. The data are presented in Table 2.27. Calculate the mean, median, sample standard deviation, interquartile range, and measures of sample skewness and kurtosis. Use these summary statistics to comment on the distribution shape. 28 28 28 28 28 28 23 23 23 23 23 23 27 27 27 27 27 27 19 19 19 19 19 19 Table 2.27 Sample data for the delivery times (minutes)

28 23 27 19

28 23 27 19

28 23 27 19

Page | 164

TU2.8 Maxim’s, the wine merchant, supplies vintage wine to restaurants. Concerns have been raised at the sales to one restaurant during the last 12 months. Maxim’s has collected the last 48 weeks of sales data as presented in Table 2.28. Based upon the weekly data calculate: (a) mean, (b) standard deviation, (c) median, (d) fivenumber summary, and (e) box plot. Based upon your answers, comment on the central tendency and whether the data are skewed. 43 39 31 34 44 38 34 53 35 28 25 29 41 46 32 38 26 50 38 37 Table 2.28 Weekly sales data

37 29 46 39 36

36 40 31 32 46

34 27 43 33 41

44 53 38 42 38

29 43 35 27

31 38 25 45

TU2.9 Joe runs a business delivering packages on behalf of a national postal business to customers. He keeps detailed records of the daily number of deliveries and has provided data for the last 32 days (Table 2.29). Calculate: (a) mean, (b) standard deviation, (c) median, (d) five-number summary, and (e) box plot. Based upon your answers, comment on the central tendency and whether the data are skewed. 43 39 31 34 37 36 34 44 29 31 44 38 34 53 29 40 27 53 43 38 35 28 25 29 46 31 43 38 35 25 43 18 Table 2.29 Daily number of deliveries over 30 days

Want to learn more? The textbook online resource centre contains a range of documents to provide further information on the following topics: 1. 2. 3. 4. 5.

A2Wa Inferring the population skewness value from the sample A2Wb Inferring the population kurtosis value from the sample A2Wc Chebyshev’s theorem A2Wd Measures of average and dispersion for a frequency distribution A2We Generating a grouped frequency distribution from raw data using SPSS

Page | 165

Chapter 3 Probability distributions 3.1 Introduction and learning objectives The topics covered in previous chapters would conventionally be called descriptive statistics. This means that the techniques we described are very useful to describe variables and populations of interest. However, statistics has another category called inferential statistics. As the word ‘inference’ implies, we will be drawing conclusions from something. This is precisely what inferential statistics does. It analyses the results from smaller samples that describe the same phenomena as the whole population, and then draws conclusions from these samples and applies them to the whole population. The fundamental tool that enables us to make these conclusions is probability theory. This chapter starts with an introduction to common probability terms, such as sample and expected value. These will help the reader understand the concepts described later in the chapter. The focus of this chapter is to introduce the reader to probability distributions that are commonly used in statistical hypothesis testing and to describe them. These include the normal distribution, Student’s t distribution, the F distribution, chi-square distribution, binomial distribution, and Poisson distribution.

Learning objectives On completing this chapter, you will be able to: 1. Understand key probability terms, such as: experiment, outcome, sample space, relative frequency, sample probability, mutually exclusive events, independent events, and tree diagrams. 2. Identify a continuous probability distribution and calculate the mean and variance. 3. Identify a discrete probability distribution and calculate the mean and variance. 4. Solve problems using Microsoft Excel and IBM SPSS software packages. The concept of probability is an important aspect of the study of statistics. In this chapter we will introduce you to some of the concepts that are relevant to probability. However, the main aim of Chapter 3 is to focus on the ideas of continuous and discrete probability distributions and not on the fundamentals of probability theory. We will first explore continuous probability distributions (normal, Student’s t, and F) and then introduce the concept of discrete probability distributions (binomial and Poisson). Table 3.1 summarises the most frequently used probability distributions according to whether the data variables are discrete or continuous and whether the distributions are symmetric or skewed.

Page | 166

Measured characteristic Shape Distributions

Variable type Discrete distributions Continuous distributions Symmetric - Uniform - Binomial

Skewed - Poisson - Hypergeometric

Symmetric - Uniform - Student’s t - Normal Table 3.1 Variable type versus measurement characteristic

Skewed -F - Exponential

Suppose that you are conducting a survey among students about attitudes towards smoking. In your sample you included young people between 18 and 23 years of age. You get some interesting results. Can you assume that everyone of similar age has similar views? Perhaps. To get a definitive answer, you need to apply some fundamental principles of probability. You need to figure out what is the probability that the similar views are held by the rest of the population. Let us now assume that you have graduated and that you got a job as a market research analyst with a telephone service provider. The company would like to conduct a pilot study in Ireland and, if it works, to implement the findings not only in Ireland, but also in the UK. Do you have the tools to confirm with confidence that the results are applicable in the rest of Ireland, as well as in the UK? If you do, what is the level of confidence, or, to put it another way, the level of risk that you are prepared to tolerate for the things to go wrong? The above examples are just two out of many, many possible problem areas that this chapter will help you understand better. Probability plays a key role in statistics. As you collect data, or record any kind of data set, you will quickly realise that the data set behaves in a fashion and has some specific characteristics that describe it as unique. We refer to this behaviour, and the associated characteristics, as the data distribution. There are several well-known and well-defined types of distributions. The best-known one is the so-called normal distribution. Every data point that belongs to this, or any other distribution, is defined by certain rules and probability laws. By learning how these probability laws apply, you will learn how to deal with the questions that we put forward at the beginning of this section.

3.2 What is probability? Introduction to probability There are several words and phrases that encapsulate the basic concept of probability: ‘chance’, ‘probable’, ‘odds’ and so on. In all cases we are faced with a degree of uncertainty and concerned with the likelihood of an event happening. These words and phrases are too vague, so we need some measure of the likelihood of an event occurring. This measure is termed probability and is measured on a scale between 0 and 1, with 0 representing no possibility of the event occurring and 1 representing certainty that the event will occur (Figure 3.1). For all practical situations, the value of the probability will lie between 0 and 1.

Page | 167

Figure 3.1 Range of probability values To determine the probability of an event occurring, data must be collected. For example, this can be achieved through experience, desk research, observation or empirical methods. The term ‘experiment’ is used when we want to make observations for a situation of uncertainty. The actual results of the uncertain situation are called the outcome or sample point. If the result of an experiment remains uncertain from one repetition to another then the experiment is called a random experiment. In a random experiment, the outcome cannot be stated with certainty. An experiment may consist of one or more observations. If there is only a single observation, the term ‘random trial’ or ‘simple trial’ is used. An LED bulb may be selected from a factory to examine if it is defective or not. A single LED bulb being selected is a trial. We can select any number of LED bulbs. The number of observations will be equal to the number of LED bulbs. A random experiment has the following properties: 1. 2. 3.

It may be repeated any number of times under similar conditions. It has more than one possible outcome. Outcomes vary from trial to trial even when the initial conditions are the same.

Here are some examples of random variables: 1. 2.

In an experiment involving measuring the time for a LED bulb to fail, the random variable X would be the time taken for an LED bulb to fail. In an experiment involving measuring the starting salary of recently graduated students, the random variable X would be the value of this starting salary for each student measured within the experiment.

Since random variables cannot be predicted exactly, they must be described in the language of probability where every outcome of the experiment will have a probability associated with it. The result of an experiment is called an ‘outcome’. It is the single possible result of an experiment – for example, tossing a coin produces a ‘head’, or rolling a die gives a 3. If we accept the proposition that an experiment can produce a finite number of outcomes, then we could in theory define all these outcomes. The set of all possible outcomes is defined as the sample space. For example, the experiment of rolling a die could produce the outcomes 1, 2, 3, 4, 5, 6 which would thus define the sample space.

Page | 168

Another basic notion is the concept of an event. Think of it as simply an occurrence of one of the possible outcomes – this implies that an event is a subset of the sample space. For example, in the experiment of rolling a die, the event of obtaining an even number would be defined as the subset {2, 4, 6}. Finally, two events are said to be mutually exclusive if they cannot occur together. By rolling a die, for example, the event stated as ‘obtaining a 2’, is mutually exclusive of the event ‘obtaining a 3’. The event ‘obtaining a 2’ and the event ‘obtaining an even number’ are not mutually exclusive since both can occur together, since {2} is a subset of {2, 4, 6}. This section provides a very basic overview and a refresher of some of the most elementary concepts needed to be understood to follow the rest of the chapter. The online chapters provide a more comprehensive introduction and a refresher into elementary probability theory.

Relative frequency Suppose we perform the experiment of throwing a die and note the score obtained. We repeat the experiment many times. We will assign the symbol n to the number of times we repeated the experiment. We also observe the occurrence of event A out of the total number of experiments. We will the symbol m for the occurrence of this event A. The ratio between m and n is called the relative frequency. In general, if event A occurs m times, then your estimate of the probability that A will occur is given by equation (3.1). 𝑃(𝐴) =

𝑚

(3.1)

𝑛

Example 3.1 Consider the result of running the die experiment where the die has been thrown 10 times and the number of times each possible outcome (1, 2, 3, 4, 5, 6) recorded. Now consider the result of running the die experiment where the die has been thrown 1000 times and the number of times each possible outcome (1, 2, 3, 4, 5, 6) recorded. The result of this die experiment is also illustrated in Table 3.2. Score

1

2

3

4

5

6

Frequency for 10 runs

3

1

2

1

0

3

Relative frequency for 10

0.3

0.1

0.2

0.1

0

0.3

Frequency for 1000 runs

173

168

167

161

172

159

0.173

0.168

0.167

0.161

0.172

0.159

Relative Frequency for 1000

Table 3.2 Calculation of relative frequencies We can see big differences between the relative frequencies for 10 throws when compared with 1000 throws of a die. As the number of experiments increases, the relative frequency stabilises and approaches the true probability of the event. Thus, if we had performed the above experiment 2000 times we might expect 'in the long run' Page | 169

the frequencies of all the scores to approach 0.167. This implies that P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 0.167. Actually, for this experiment the theoretical values for each event would be P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6. It is a very common practise to calculate probabilities through this relative frequency approach. This approach is also called the ‘empirical approach’ or the ‘experimental probability’ approach. A good example for this approach is a scenario where a manufacturer indicates that he is 99% certain (P = 99% or 0.99) that an electric light bulb will last 200 hours. This figure will have been arrived at from experiments which have tested numerous samples of light bulbs. You can also read this statement differently. You can interpret it that there is 1% chance (risk) that the bulb will not last 200 hours. Several important issues are assumed, and should be remembered, when approaching probability problems: The probability of each event within the probability experiment lies between 0 and 1. The sum of probabilities of all events in this experiment equals 1. If we know the probability of an event occurring in the experiment, then the probability of it not occurring is P(event not occurring) = 1 – P(event occurring).

Sample space We already know that the sample space contains all likely outcomes of an experiment and that one or more outcomes constitute an event. Here, rather than resort to the notion of relative frequency, we will look at probability as defined by equation (3.2):

P(Event)=

Number of outcomes in the event

(3.2)

Total number of outcomes

An example below is used to illustrate this notion via the construction of the sample space. Example 3.2 If an experiment consists of rolling a die then the possible outcomes are 1, 2, 3, 4, 5, 6. The probability of obtaining a 3 from one roll of the dice can then be calculated using equation (3.2):

P(Obtaining a 3)=

Number of outcomes producing a 3 Total number of outcomes

=

1 6

2

The probability of obtaining a 3 is 0.166…, or 16.7%, or 16 3 %.

Discrete and continuous random variables A random variable is a variable that provides a measure of the possible values obtainable from an experiment. For example, we may wish to count the number of times that the number 3 appears on the tossing of a fair die, or we may wish to measure the weight of people participating in a new diet programme. They are both random variables. Page | 170

The probabilities of a particular outcome for a random variable are distributed in a certain way. These probability distributions will be different, depending on our random variable being either discrete or continuous. Here is an example of discrete random variable. Let the random variable consist of the numbers 1, 2, 3, 4, 5, 6 (a six-sided die, for example). If the die were fair, then on each toss of the die each possible number (or outcome) will have an equal chance of occurring. The numbers 1, 2, 3, 4, 5, 6 represent the values of the random variable for this experiment. As the values are the whole-number answers (not a continuum such as 1.1, 1.2, etc.), this is an example of a discrete random variable. Several discrete probability distributions will be discussed in this and online chapters, including binomial and Poisson. If the numbers can take any value with respect to measured accuracy (160.4 lbs, 160.41 lbs. 160.414 lbs, etc.), then this is an example of a continuous random variable. In this chapter, we will explore the concept of a continuous probability distribution with the focus on introducing the reader to the normal probability distribution. However, several other continuous probability distributions will be discussed in this and online chapters, including Student’s t distribution, the chi-square distribution, and the F distribution.

3.3 Continuous probability distributions Introduction In probability theory, an expected value is the theoretical mean value of a numerical experiment over many repetitions of the experiment. The phrases “the mean” and “the expected value” can be used interchangeably. For any continuous probability distribution, the expected value, E(X), and variance, VAR(X), can be found by solving the integral equations (3.3) and (3.4), with the function f (x) known: b

E ( X ) =  xf ( x) dx

(3.3)

a

Equation (3.3) represents the expected value E(X) of X, which is the total area under the function, x f (x), between the lower limit (a) and upper limit (b). Essentially this implies that when we are referring to the whole distribution, then the expected value is equal to the mean value. We already said that we will use the phrases “expected value” and “the mean” interchangeably. b

VAR ( X ) = E ( X −  )2 =  ( x −  )2 f ( x) dx a

(3.4)

Equation (3.4) represents the variance VAR(X) of X, which is the total area under the function (x – )2 f (x) between the lower limit (a) and upper limit (b). We will provide Page | 171

more detailed explanations of this concept in the context of a specific continuous probability distribution.

The normal distribution In probability theory, the normal distribution is a very common continuous probability distribution. The normal distribution is important because of the central limit theorem. The central limit theorem states that, under certain conditions, averages of samples of observations of random variables independently drawn from independent distributions converge in distribution to the normal. This statement is a powerful tool that will enable us to make several inferences and is the foundation of inferential statistics. The statement about the central limit theorem means that if we took many, many samples from any kind of distribution (even non-normal distribution), when we calculate the averages for each and every one of these samples, these averages will follow the normal distribution, regardless of what the distribution of the original population happens to be. Many of the real-life variables that have a normal distribution can, for example, be found in manufacturing (weights of tin cans) or can be associated with the human population (people’s heights). The probability density of the normal distribution is defined by equation (3.5): f (X ) =

 1  x −  2  1 exp  −     2  2    

(3.5)

Where: • •

µ is the population mean or expectation of the distribution.  is the population standard deviation.

The probability density is the function that gives you the probability that at any point the value of the sample x is close to the value of the function X. Or to put it another way, if you integrate the function in each interval, then the area under the function in this interval is equal to the probability that a random variable will be in this interval. Think of the probability density function as a function that defines the probability of occurrence of every value from a random variable. How these probabilities are shaped and distributed is determined by the probability density function. The following conventions are often used in relation to normal distribution: 1. 2. 3.

The population mean and population standard deviation are represented by the notation  and  respectively. If a variable X follows a normal distribution, we write X ~ N(, 2), which is read as ‘X varies in accordance with a normal distribution, whose mean is  and whose variance is 2. The total area under the curve represents the total probability of all events occurring, which equals 1. Page | 172

4. 5.

The mean of the random variable is , which is the same as saying that the expected value E(X) = . The variance of the random variable is 2, which is the same as saying that the variance value VAR(X) = E[(X – )2] = 2.

Equation (3.5) can be represented graphically by Figure 3.2, which illustrates the symmetrical characteristics of the normal distribution. For the normal distribution the mean, median, and mode are all aligned and have the same numerical value. The normal distribution is sometimes called the ‘bell curve’.

Figure 3.2 Percentage points of the normal distribution It is a property of the normal curve that 68.3% of all the values reside between   1, 95.5% of all the values reside between   2 and 99.7% of all the values reside between   3. To calculate the probability of a value of X occurring we would use Excel (or SPSS, or statistical tables) to find the corresponding value of the probability. Example 3.3 A manufacturing firm quality department assures components manufactured, and historically the length of a tube is found to be normally distributed with a population mean of 123 cm and a population standard deviation of 13 cm. Calculate the probability that a random sample of one tube will have a length of at least 136 cm. From the information provided we define X as the tube length in centimetres, with population mean µ = 123 and standard deviation σ = 13. This can be represented using the notation X ~ N(123, 132). The problem we must solve is to calculate the probability that one tube will have a length of at least 136 cm. This can be written as P(X ≥ 136) and is represented by the shaded area illustrated in Figure 3.3.

Page | 173

Figure 3.3 Region represents P(X ≥ 136) Excel solution The Excel solution is illustrated in Figure 3.5. This problem can be solved by using the Excel function =NORM.DIST(X, µ, , TRUE). This function calculates the shaded region to the left of P(X  136). Therefore, P(X  136) = 1 – NORM.DIST(136, 123, 13, TRUE).

Figure 3.4 Relationship between P(X ≥ 136) and NORM.DIST Excel function

Figure 3.5 Example 3.3 Excel solution Page | 174

From Excel: P(X ≥ 136) = 0.1587 We observe that the probability that an individual tube length is at least 136 cm is 0.1587, or 15.87%. SPSS solution To use SPSS to calculate values we require data in the SPSS data file. If no data is present, then enter a number to represent variable VAR00001. In this example, we entered the number 1 as illustrated in Figure 3.6. Please note you can enter any number– see Figure 3.6.

Figure 3.6 Enter number 1 to represent VAR00001 Now we can use SPSS Statistics to calculate the associated probabilities. Select Transform > Compute Variable Target Variable: Example 3 Numeric expression = 1 – CDF.NORMAL (136, 123, 13).

Figure 3.7 Use of compute variable to calculate P(X ≥ 136) Click OK The value will not be in the SPSS output file but in the SPSS data file in a column called Example 3.

Figure 3.8 SPSS solution, P(X ≥ 136) = 0.158655 Page | 175

The probability that an individual tube length is at least 136 cm is 0.1587 (or 15.87%). This agrees with the Excel solution illustrated in Figure 3.5. Example 3.4 Using the same assumptions as in Example 3.3, calculate the probability that X lies between 110 and 136 cm. In this example, we are required to calculate P(110 ≤ X ≤ 136) which represents the area shaded in Figure 3.9. The value of P(110 ≤ X ≤ 136) can be calculated using Excel’s =NORM.DIST() function.

Figure 3.9 Shaded region represents P(110 ≤ X ≤ 136) Excel solution The Excel solution is illustrated in Figure 3.10.

Figure 3.10 Excel solution for P(110 ≤ X ≤ 136) The =NORM.DIST() function can be used to calculate P(110 ≤ X ≤ 136) = 0.682689492. Thus, the probability that an individual tube length lies between 110 and 136 cm is 0.6827 or 68.27%.

Page | 176

SPSS solution Enter data into SPSS As before, note that in these examples we have no data to input but we must enter a data value to be able to use the methods described below. In this example, we have entered the number 1 into column 1 (VAR00001).

Figure 3.11 Enter number 1 to represent VAR00001 Now we can use SPSS Statistics to calculate the associated probabilities. Repeat the calculation above but this time use: Select Transform > Compute Variable Target Variable: Example 4 Numeric expression = CDF.NORMAL (136, 123, 13) – CDF.NORMAL (110, 123, 13).

Figure 3.12 Use computer variable to calculate P(110 ≤ X ≤ 136) Click OK. The value will not be in the SPSS output file but in the SPSS data file in a column called Example 4.

Figure 3.13 SPSS solution, P(110 ≤ X ≤ 136) = 0.682689 The probability that an individual tube length lies between 110 and 136 cm is 0.6827 (or 68.27%). This agrees with the Excel solution shown in Figure 3.10.

Page | 177

Check your understanding X3.1

Calculate the following probabilities, where X ~ N(100, 25): (a) P(X  95), (b) P(95  X  105), (c) P(105  X  115), (d) P(93  X  99). For each probability identify the region to be found by shading the area on the normal probability distribution graph.

The standard normal distribution (Z distribution) Assume you are researching two different populations, both following normal distributions. However, it could be difficult to compare if the units are different, or the means and variances might be different. If this was the case, we would like to be able to standardise these distributions so that we can compare them. This is possible by creating the standard normal distribution. The corresponding probability density function f (z) is given by equation (3.6): 𝑓(𝑧) =

1 √2𝜋

1

exp (− 2 𝑧 2 )

(3.6)

The standard normal distribution is a normal distribution whose mean is always 0 (=0) and whose standard deviation is always 1 (=1). This means that every value of X in a normal distribution, can be transformed to a value of Z in the standard normal distribution. To achieve this, we are using equation (3.7): Z=

X −



(3.7)

Where X, µ, and σ are the variable score value, population mean, and population standard deviation respectively, taken from the original normal distribution. Equation (3.6) can also be solved for X, which means that if we know the values of Z,  and , we can calculate the value of X. This is done by rearranging equation (3.6) into Z = X – , which ultimately yields X = Z + , or X =  + Z. The advantage of this method is that the Z values are not dependent on the original data units, and this allows tables of Z values to be produced with corresponding areas under the curve. This also allows for probabilities to be calculated if the Z value is known, and vice versa, which allows a range of problems to be solved. Figure 3.14 illustrates the standard normal distribution (or Z distribution) with Z scores between –3 and +3 and how they correspond to the actual X values.

Page | 178

Figure 3.14 Normal and standard normal curve The Z-value is effectively a standard deviation from a standard normal distribution. Because Z is always identical to 1 (standard deviation), this means that the standard normal distribution will always have 68.3% of all the values between 1, 95.5% of all the values between 2, and 99.7% of all the values between 3. The Excel function =NORM.S.DIST() (not to be confused with the =NORM.DIST() function) calculates the probability P(Z ≤ z) as illustrated in Figure 3.15.

Figure 3.15 Shaded region represents P(Z ≤ z) If Z corresponds to the standard deviation of the standard normal distribution, and in the box above we said 2 covers 95.5% of the distribution, how does this translate into the statements that we made about the ordinary (non-standard) normal distribution? Just before Example 3.3 we stated that any normal distribution covers 68.3% of the values for   1, 95.5% of the values for   2, and 99.7% of the values for   3. In the case of the standard normal distribution  = 0, which means that we need 1.96Z (not 2 or 2Z) to cover exactly 95% of the values and 2.58Z (not 3 or 3Z) to cover exactly 99% of the values. Page | 179

We can show, for example, that the proportion of values between ±1, ±2, and ±3 population standard deviations from the population mean of zero is 68.3%, 95.5%, and 99.7% respectively as illustrated in Figure 3.16.

Figure 3.16 Population proportions within 1, 3, 3 population standard deviations We’ll illustrate the method of calculating the area between the mean ± 1 standard deviation (  1), or ± 1z. Note that these values can be found using critical tables or using software such as Excel and SPSS. Say, we want to calculate the probability that Z lies between -1 and + 1. This is represented by the statement P(- 1 ≤ Z ≤ +1).Remember that the total area underneath the curve but equal to or above the y-axis represents the total probability which equals 1. We can write this as: P(- infinity  Z  + infinity) = 1 Therefore, 1 = P( - 1  Z) + P(- 1  Z) + P(Z  +1) + P(Z  1) Rearranging this equation to give P(- 1 ≤ Z ≤ +1) = 1 – P(Z  - 1) – P(Z  + 1) Due to the normal distribution being symmetric, then P(Z  - 1) = P(Z  + 1). Therefore, P(- 1 ≤ Z ≤ +1) = 1 – P(Z  + 1) – P(Z  + 1) P(- 1 ≤ Z ≤ +1) = 1 – 2 * P(Z  + 1)

Page | 180

From table 3.3, we can look up the probabilities associated with certain Z values. The Z values listed in this table provide the right-hand tail probabilities for positive values of Z i.e. P(Z  + z). Z

0.00

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3

0.500 0.460 0.421 0.382 0.345 0.309 0.274 0.242 0.212 0.184 0.159 0.136 0.115 0.097

0.01 0.496 0.456 0.417 0.378 0.341 0.305 0.271 0.239 0.209 0.181 0.156 0.133 0.113 0.095

0.02 0.492 0.452 0.413 0.374 0.337 0.302 0.268 0.236 0.206 0.179 0.154 0.131 0.111 0.093

0.03 0.488 0.448 0.409 0.371 0.334 0.298 0.264 0.233 0.203 0.176 0.152 0.129 0.109 0.092

0.04 0.484 0.444 0.405 0.367 0.330 0.295 0.261 0.230 0.200 0.174 0.149 0.127 0.107 0.090

Table 3.3 Use of critical values to find P(Z  1) From table 3.3, P(Z  + 1) = P(Z  + 1.00) = 0.159 to 3 decimal places. P(- 1 ≤ Z ≤ +1) = 1 – 2 * 0.159 P(- 1 ≤ Z ≤ +1) = 0.682 For the standard normal distribution where z = 1 is the same as one standard deviation, the proportion of all values between  ± 1σ is 0.682, or 68.2%. Other values can be looked up in a similar manner from table 3.3. For example, to look up P(Z  1.84), find the row for 1.8 and the 0.04 column. At the intersection of the row and column (1.8 + 0.04 = 1.84) it can be seen that P(Z  1.84)=0.033. Example 3.5 Reconsider Example 3.3 but find P(X  136) by calculating the value of Z. If a variable X varies as a normal distribution with a mean of 123 and a standard deviation of 13, then the value of Z when X = 136 would be given by equation (3.6): 𝑍=

136 − 123 = +1 13

From table 3.3: P(X  136) = P(Z  1) = 0.159. Therefore, the probability that X  136 is 15.9%. This agrees with the answer given in Example 3.3. Page | 181

Excel solution The value of P(Z ≥ 1) can be calculated using Excel’s =NORM.S.DIST() function. The Excel solution is illustrated in Figure 3.18.

Figure 3.18 Example 3.5 Excel solution P(Z ≥ 1) We used two functions to calculate that P(X ≥ 136) or P(Z ≥ 1). The first function in cell C10 is the Excel =NORM.DIST() function and the other one in cell C15 is the Excel =NORM.S.DIST() function. The results are the same, although the input parameters are different. The first function (cell C10) requires as an input the X values with the corresponding mean and the standard deviation values. The second function (cell C15) requires only the Z value to calculate the same probability. In cell C14, instead of using manual formula, we could have used the Excel function =STANDARDIZE(x, mean, standard-dev). Either way, we get the same Z value. This solution can be represented graphically by Figure 3.19. From Excel, the =NORM.S.DIST() function can be used to calculate P(Z ≥ +1) = 0.1588655.

Figure 3.19 Shaded region represents P(Z ≥ 1)

Page | 182

We observe that the probability that an individual tube length is at least 136 cm is 0.1589 or 15.89% (P(X ≥ 136) = P(Z ≥ 1) = 0.1589). Take a note of the following remarks: 1. 2. 3. 4.

The Excel function =NORM.DIST() calculates the value of the normal distribution for the specified mean and standard deviation. The Excel function =NORM.S.DIST() calculates the value of the normal distribution for the specified Z score value. The value of the Z score can also be calculated using the Excel function =STANDARDIZE(). If the mean is equal to 0 and the standard deviation is equal to 1, then =NORM.S.DIST() and =NORM.DIST() produce identical results. If 0 and σ1, then only the =NORM.DIST() function can be used.

SPSS solution To use SPSS to calculate values we require data in the SPSS data file. If no data is present, then enter a number to represent variable VAR00001. In this example, we entered the number 1 as illustrated in Figure 3.20 (please note you can enter any number).

Figure 3.20 Enter number 1 to represent VAR00001 Now we can use SPSS Statistics to calculate the associated probabilities. Select Transform > Compute Variable Target Variable: Example 5 Numeric expression = 1-CDF.NORMAL (136, 123, 13)

Figure 3.21 Use compute variable to calculate P(X ≥ 136) Click OK.

Page | 183

The value will not be in the SPSS output file but in the SPSS data file in a column called Example5.

Figure 3.22 SPSS solution P(Z ≥ 1) = 0.158655 The probability that an individual tube length is at least 110 cm is 0.158655 (or 15.9%). This agrees with the Excel solution illustrated in Figure 3.16. Example 3.6 A local authority purchases 250 emergency lights to be used by the emergency services. The lifetime in hours of these lights follows a normal distribution, where X ~ N(230, 182). Calculate: (a) what number of lights might be expected to fail within the first 215 hours; (b) what number of lights may be expected to fail between 227 and 235 hours; and (c) after how many hours would we expect 10% of the lights to fail? Excel solution a. From this information we have population mean, µ, of 230 hours and a variance, σ2, of 324 hours (which is 182). This problem can be solved using either the =NORM.DIST() or =NORM.S.DIST() Excel function. This solution can be represented graphically by Figure 3.21. This problem involves finding P(X ≤ 215), and then multiplying it by the number of lights purchased (250) to obtain the number expected to fail within the first 215 hours.

Figure 3.23 Shaded area represents P(X ≤ 215) The Excel solution is illustrated in Figure 3.23. The =NORM.DIST() or =NORM.S.DIST() function can be used to calculate P(X ≤ 215) = 0.2023. The number of lights that are expected to fail out of the 250 lights purchased is E(fail) = 250 × P(X ≤ 215) = 50.5 or 51 of the purchased lights.

Page | 184

Figure 3.24 Example 3.6 (a) Excel solution P(X ≤ 215) b. The second part of the problem requires the calculation of the probability that X lies between 227 and 235 hours, and the estimated number of purchased lights out of 250 which will fail. This problem consists of finding P(227 ≤ X ≤ 235), as shown graphically by Figure 3.25.

Figure 3.25 Shaded region represents P(227 ≤ X ≤ 235) The Excel solution is illustrated in Figure 3.26. The =NORM.DIST() or =NORM.S.DIST() function can be used to calculate P(227 ≤ X ≤ 235) = 0.175592357. The number of purchased lights that are expected to fail between 227 and 235 hours out of the 250 lamps is then E(fail) = 250 × P(227 ≤ X ≤ 235) = 48.8980 or 49 purchased lights.

Page | 185

Figure 3.26 Example 3.6(b) Excel solution c. The final part of this problem involves calculating the number of hours for the first 10% to fail. This corresponds to calculating the value of x where P(X ≤ x) = 0.1. To solve this problem, we need two new Excel functions: =NORM.INV() and =NORM.S.INV(). Figures 3.27 and 3,28 illustrate the graphical and Excel solutions.

Figure 3.27 Shaded region represents P(X ≤ x) = 0.1

Page | 186

Figure 3.28 Example 3.6(c) Excel solution From Excel, the expected time for 10% to fail is 206.93, or to round it up, in 207 hours. We show two different ways to solve this in cells C11 and C14. In cell C11 we calculate X directly using Excel =NORM.INV() function. The result is 206.93 hours. The Excel function =NORM.INV() calculates the value of X from a normal distribution for the specified probability, mean and standard deviation. The Excel function =NORM.S.INV() calculate the value Z from normal distribution for the specified probability value. In cell C14 we calculate X directly from equation (3.7), which was solved for X: Z=

X −



→ Z = X −  → X = Z + 

We find that P(X ≤ x) = 0.1 corresponds to Z = –1.28 (cell C13). We can now use the above equation to obtain X = (–1.28 × 18) + 230 = 206.96 (slight error here due to the use of 2 decimal places in the Z value). SPSS solution To use SPSS to calculate values we require data in the SPSS data file. If no data is present, then enter a number to represent variable VAR00001. In this example, we entered the number 1 as illustrated in Figure 3.29. Please note you can enter any number. Now we can use SPSS Statistics to calculate the associated probabilities.

Figure 3.29 Enter number 1 to represent VAR00001 a. Repeat the calculation above, but this time use: Page | 187

Select Transform > Compute Variable Target Variable: Example 6 Numeric expression = CDF.NORMAL(215, 230, 18)

Figure 3.30 Use computer variable to calculate P(X ≤ 215) Click ok

Figure 3.31 SPSS solution P(X ≤ 215) = 0.202328 Target Variable: Example6_expected Numeric expression = Example6*250

Figure 3.32 Use computer variable to calculate expected value Click OK. The value will not be in the SPSS output file but in the SPSS data file in a column called Example6_expected.

Figure 3.33 SPSS solution E(X ≤ 215) = 50.582095 or 51 The number of lights that are expected to fail out of the 250 lights is E(fail) = 250 × P(X ≤ 215) = 51 lamps. This agrees with the Excel solution illustrated in Figure 3.24. b. Repeat the calculation above, but this time use: Page | 188

Select Transform > Compute Variable Target Variable: Example6b Numeric expression = CDF.NORMAL(235, 230, 18) – CDF.NORMAL(227, 230, 18).

Figure 3.34 Use computer variable to calculate P(227 ≤ X ≤ 235) Click OK. Target Variable: Example6b_expected Numeric expression = Example6b_expected*250

Figure 3.35 Calculate expected value Click OK. The value will not be in the SPSS output file but in the SPSS data file in a column called Example6b_expected.

Figure 3.36 SPSS solution E(227 ≤ X ≤ 235) = 43.898089 or 44. The number of lights that are expected to fail between 227 and 235 hours out of the 250 lights is E(fail) = 250 × P(227 ≤ X ≤ 235) = 44. This agrees with the Excel solution shown in Figure 3.26. c. Repeat the calculation above, but this time use: Select Transform > Compute Variable Target Variable: Example6c Numeric expression = IDF.NORMAL(0.1, 230, 13). Page | 189

Figure 3.37 Use compute variable to calculate x given P(X ≤ x) = 0.1 Click OK. The value will not be in the SPSS output file but in the SPSS data file in a column called Example6c.

Figure 3.38 SPSS solution x = 206.932072 or 207. The expected time for 10% to fail is 207 hours. This agrees with the Excel solution shown in Figure 3.28.

Check your understanding X3.2

Calculate the following probabilities, where X ~ N(100, 25): (a) P(X  95), (b) P(95  X  105), (c) P(105  X  115), (d) P(93  X  99)? In each case convert X to Z. Compare with your answers from X3.1.

X3.3

Given that a normal variable has a mean of 12 and a variance of 36, calculate the probability that a member chosen at random is: (a) 15 or greater, (b) 15 or smaller, (c) 5 or smaller, (d) 5 or greater, (e) between 5 and 15.

X3.4

The lifetimes of certain brand of car tyres are normally distributed with a mean of 60285 km and standard deviation of 7230 km. If the supplier guarantees them for 50000 km, what proportion of tyres will be replaced under guarantee?

X3.5

Audio sensor have a design frequency of max 48 KHz. The sensors are produced on a line with an output distributed as N(48.1, 1.03). Maximum frequency range below 47.9 KHz and above 48.2 KHz are rejected. Find: (a) the proportion that will be rejected; (b) the proportion that would be rejected if the mean were adjusted so as to minimise the proportion of rejects; (c) by how much the standard deviation would need to be reduced (leaving the mean at 48.1 KHz) so that the proportion of rejects below 47.9 KHz would be halved.

Checking for normality Normality tests assess the likelihood that the given data set comes from a normal distribution. This is an important concept in statistics, given that the parametric assumption relies on the data being normally distributed or approximately normally Page | 190

distributed. Several statistical tests exist to test for normality, such as the Shapiro–Wilk test. However, several visual tests can also be used, such as: 1. Constructing a five-number summary and box plot 2. Constructing a normal probability plot. We are already familiar with the five-number summary and box plot. The second approach, a normal probability plot, involves constructing a graph of data values against corresponding Z values, where Z is based upon the ordered value. Example 3.7 The manager at Big Jim’s restaurant is concerned at the time it takes to process credit card payments at the counter by counter staff. The manager has collected the processing time data (time in minutes for each of 19 cards) shown in Table 3.4 and requested that the data be checked to see if they are normally distributed. 0.64 0.71 0.85 0.89 0.92 1.23 0.76 1.18 0.79 1.26 Table 3.4 Processing cards (n=19)

0.96 1.29

1.07 1.34

0.76 1.38

1.09 1.5

1.13

Excel solution The method to create the normal probability plot is as follows (refer to Figure 3.39): 1. Order the data values (1, 2, 3, …, n) with 1 referring to the smallest data value and n representing the largest data value (column E). 2. Show the data (y) sorted in ascending order (column F) 3. For the first data value (smallest) calculate the cumulative area using the formula: = 1/(n + 1) (cell G4). 4. Repeat for the other values, where the cumulative area is given by the formula: =old area + 1/(n + 1) (cells G5 down). 5. Calculate the value of Z for this cumulative area using the Excel Function: =NORM.S.INV(Z ) (column I). 6. Plot data values y (column F) against Z values (column I) for each data point (Figure 3.40). Figure 3.39 illustrates the Excel solution.

Page | 191

Figure 3.39 Example 3.7 Use excel to create a normal probability plot Figure 3.40 shows the normal probability curve plot for this example.

Figure 3.40 Example 3.7 Normal probability plot We observe from the graph that the relationship between the data values y and Z is approximately a straight line. For data that are normally distributed we would expect the relationship to be linear. In this situation, we would accept the statement that the data values are approximately normally distributed. SPSS solution Enter the data into SPSS in one column – we have called the column variable Normalcheck. Page | 192

Figure 3.41 Example 3.7 SPSS data Select Analyze > Descriptive Statistics > Explore.

Figure 3.42 SPSS Explore menu Transfer variable into Dependent List box

Page | 193

Figure 3.43 SPSS explore menu Click on Plots. Click on Normality plots with tests

Figure 3.44 SPSS explore plots options Click Continue Click OK SPSS output This will output: (a) descriptive statistics, (b) tests of normality, and (c) normal Q-Q plot.

Page | 194

Figure 3.45 SPSS descriptives solution From the descriptive statistics in Figure 3.45, we see that the skewness is 0.115 and the kurtosis –1.156. The following can be inferred: a. Descriptive statistics output already indicates that this data is normally distributed, and you can see it from the skewness and kurtosis values. Remember that SPSS and Excel show excess kurtosis rather than the proper kurtosis value. Excess kurtosis is the proper kurtosis value minus 3 (in our case, the proper kurtosis value equals – 4.156). It is expected that both the ratio of skewness/skewness standard error and excess kurtosis/standard error for excess kurtosis to lie within the range ± 1.96 for a 95% confidence interval. In our example (0.115/0.524=0.219 and -1.156/1.014=1.140), both ratios are within the range - 1.96 to + 1.96, and therefore from just these descriptives we can conclude that the distribution is normally distributed. b. SPSS test of normality in Figure 3.45 presents the results from two well-known tests of normality, namely the Kolmogorov–Smirnov test and the Shapiro–Wilk test. The Shapiro–Wilk test is more appropriate for small sample sizes (less than 50) but can also handle sample sizes as large as 2000. For this reason, we will use the Shapiro–Wilk test as our numerical means of assessing normality. If the p-value (Sig. in Figure 3.45) of the Shapiro–Wilk test is greater than 0.05, the data are approximately normally distributed. If it is below 0.05, the data significantly deviate Page | 195

from a normal distribution. In this example, the significance value is 0.584 > 0.05, and we conclude that the data are approximately normally distributed. The rationale behind this conclusion will become much clearer after we introduced hypothesis testing in Chapter 6. c. To determine normality graphically, we can use the output of a normal Q-Q plot. If the data are normally distributed, the data points will be close to a straight line. If the data points stray from the line in an obvious nonlinear fashion, the data are not normally distributed. From Figure 3.46, we observe the data are approximately normally distributed. This agrees with the Excel solution illustrated in Figure 3.40.

Figure 3.46 Line fit to normal probability plot If you are at all unsure of being able to correctly interpret the graph, rely on the numerical methods instead because it can take a fair bit of experience to correctly judge the normality of data based on plots. To conclude, let us illustrate just three possible scenarios to better understand how decisions are made on the symmetry of a distribution and the shape of the normal probability curve. First, Figure 3.47 illustrates a normal distribution where largest value minus Q3 equals Q1 minus smallest value. Second, Figure 3.48 illustrates a left-skewed distribution where Q1 minus smallest value greatly exceeds largest value minus Q3. Finally, Figure 3.49 illustrates a right-skewed distribution where largest value minus Q3 greatly exceeds Q1 minus smallest value.

Page | 196

Figure 3.47 Line fit to a distribution that is normally distributed

Figure 3.48 Line fit to a distribution that is left skewed

Figure 3.49 Line fit to a distribution that is right skewed

Check your understanding X3.6

Use SPSS to assess whether the data set in Table 3.5 may have been drawn from a normal distribution by comparing (a) skewness, (b) kurtosis, (c) Shapiro–Wilk test statistic, and (d) normal Q-Q plot. [Hint: Access the want to learn more – common assumptions about data document to help you answer this question. Page | 197

3.4 3.8 4.3 3.2 4.4 3.3 3.7 3.2 4.1 4.0 Table 3.5 Data set

3.9 3.7 3.9 3.5 4.2

2.7 3.2 4.4 4.5 4.7

4.8 3.8 3.3 4.1 3.5

4.7 4.7 3.1 3.2 4.1

Student’s t distribution In probability and statistics, Student’s t distribution (or simply the t distribution) is any member of a family of continuous probability distributions that arise when estimating the mean of a normally distributed population in situations where the sample size is small and the population standard deviation is unknown. By ‘small sample’, we typically mean a sample with a maximum of 30 observations. It was developed by William Gosset under the pseudonym Student. Whereas a normal distribution describes a full population (it can also describe a sample), t distributions always describe samples drawn from a full population. Accordingly, the t distribution for each sample size is different, and the larger the sample, the more the distribution resembles a normal distribution. For interested readers the probability density function (pdf) of the t distribution is defined by equation (3.8):

  df + 1    2  1    f (t ) = df df          2  

 t2  1 +    df 

 df +1  −   2 

(3.8)

Where the degrees of freedom df > 0 (to be explained shortly) and –∞ < t < +∞. The t distribution is symmetric and bell-shaped, like the normal distribution, but has heavier tails, meaning that it is more prone to producing values that fall far from its mean. Figure 3.50 provides a comparison between the effects of different sample sizes on the t distribution compared to the standard normal curve. As the sample size increases, the t distribution approaches the standard normal curve and the t distribution can be used in place of the normal distribution when the population standard deviation (or variance) is unknown.

Page | 198

Figure 3.50 Different Student’s t distributions compared to the normal distribution If we take a sample of n observations from a continuous distributed population with population mean , then the sample mean, and sample variance are given by equations (3.9) and (3.10), respectively:

𝑥̅ = S2 =

𝑥1 +𝑥2 +𝑥3 +⋯+𝑥𝑛 𝑛

1 n 2 ( xi − x )  n −1 i =1

(3.9)

(3.10)

Given equations (3.9) and (3.10), we can calculate the t-value using equation (3.11):

t=

𝑥̅ −𝜇 𝑆 √𝑛

(3.11)

The t distribution with n – 1 degrees of freedom is the sampling distribution of the tvalue when the samples consist of independent and identically distributed observations from a normally distributed population. This t-test equation will prove very important in later chapters when we would like to conduct a confidence interval or hypothesis test on data collected from a normal or approximately normal population and we do not know the value of the population variance (unknown standard deviation). Example 3.8 A sample has been collected from a normal distribution where the population standard deviation is unknown. After careful consideration, the business analyst decides that a t test would be appropriate for the required analysis. Use Excel and SPSS to calculate the Page | 199

value of the t statistic (assuming all the area is in the upper right-hand tail), when the number of degrees of freedom is 18 and with a significance level of 0.1, 0.05 and 0.01. Excel solution Figure 3.51 illustrates the Excel solution. The t-values for a t distribution with 18 degrees of freedom are 1.33, 1.73 and 2.55 when the significance levels are 0.1, 0.05 and 0.01, respectively.

Figure 3.51 Example 3.8 Excel solution The t-values for the significance levels of 0.1, 0.05 and 0.01 (cells C4:E4) are 1.33, 1.73 and 2.55 respectively (cells C6:E6). SPSS solution To use SPSS to calculate values we require data in the SPSS data file. If no data is present, then enter a number to represent variable VAR00001. In this example, we entered the number 1 as illustrated in Figure 3.52. Please note you can enter any number.

Figure 3.52 Enter number 1 to represent VAR00001 Now we can use SPSS Statistics to calculate the associated probabilities. For a significance value of 0.1, repeat the calculation above but this time use: Select Transform > Compute Variable Target Variable: Example 8a Numeric expression = IDF.T(1-0.1,18).

Page | 200

Figure 3.53 Use compute variable to calculate t(0.1, 18) Click OK. SPSS output The value will not be in the SPSS output file but in the SPSS data file in a column called Example8a.

Figure 3.54 SPSS solution t(0.1, 18) = 1.33 The t-value for a t distribution with 18 degrees of freedom is 1.33 when the significance level is 0.1. This agrees with the Excel solution shown in Figure 3.51. For a significance value of 0.05, repeat the calculation above but this time use: Select Transform > Compute Variable Select Target Variable: Example8b Numeric expression = IDF.T(1-0.05,18).

Figure 3.55 Use compute variable to calculate t(0.05, 18) Click OK. SPSS output The value will not be in the SPSS output file but in the SPSS data file in a column called Example8b.

Page | 201

Figure 3.56 SPSS solution t(0.05, 18) = 1.73 The t-value for a t distribution with 18 degrees of freedom is 1.73 when the significance level is 0.05. This agrees with the Excel solution shown in Figure 3.51. For a significance value of 0.01, repeat the calculation above but this time use: Select Transform > Compute Variable Select Target Variable: e3p14c Numeric expression = IDF.T(1-0.01,18).

Figure 3.57 Use compute variable to calculate t(0.01, 18) Click OK. SPSS output The value will not be in the SPSS output file but in the SPSS data file in a column called Example8c.

Figure 3.58 SPSS solution t(0.01, 18) = 2.55 The t-value for a t distribution with 18 degrees of freedom is 2.55 when the significance level is 0.01. This agrees with the Excel solution shown in Figure 3.51. Example 3.9 The business analyst in the previous example finds that the value of the t statistic equals 1.24 with 18 degrees of freedom. Estimate the value of the area such that the variable is less than 1.24 and the value of the probability density function at this value of t and df. Excel solution Figure 3.59 illustrates the Excel solution. Page | 202

Figure 3.59 Example 3.9 Excel solution The probability that the t value, when we have 18 degrees of freedom, is less than or equal to 1.24 is 0.88 (or 88%). The value of the probability density function is 0.18 (or 18%) when t = 1.24, df = 18. This is the right tail distribution, and we will explain the details in the following chapter. SPSS solution To use SPSS to calculate values we require data in the SPSS data file. If no data is present, then enter a number to represent variable VAR00001. In this example, we entered the number 1 as illustrated in Figure 3.60. Please note you can enter any number.

Figure 3.60 Enter number 1 to represent VAR00001 Now we can use SPSS Statistics to calculate the associated probabilities. Repeat the calculation above but this time use: Select Transform > Compute Variable Select Target Variable: Example9a Numeric expression = CDF.T(1.24,18).

Figure 3.61 Use compute variable to calculate P(t ≤ 1.24) Page | 203

Click OK. SPSS output The value will not be in the SPSS output file but in the SPSS data file in a column called Example9a.

Figure 3.62 SPSS solution P(t ≤ 1.24) = 0.88 Now, calculate f(x) Select Target Variable: Example9b Numeric expression = PDF.T(1.24,18).

Figure 3.63 Use compute variable to calculate probability density function f(x) when t = 1.24, df = 18. The value will not be in the SPSS output file but in the SPSS data file in a column called Example9b. SPSS output

Figure 3.64 SPSS solution f(x) = 0.18 The probability that the t value is less than or equal to 1.24 is 0.88 (or 88%) when we have 18 degrees of freedom. The value of the probability density function is 0.18 when t = 1.24, df = 18. This agrees with the Excel solution illustrated in Figure 3.58.

Check your understanding X3.7

Calculate the following t-distribution probabilities when df = 6: (a) P(t ≤ –1.45), (b) P(t ≤ 0), (c) P(0.3 ≤ t ≤ 1.4), (d) P(–1.34 ≤ t ≤ 1.8).

X3.8

Calculate the area in the right-hand tail if the t-value equals 2.05, P(t  2.05), and the t distribution has 15 degrees of freedom.

Page | 204

F distribution In probability theory and statistics, the F distribution, also known as Snedecor's F distribution or the Fisher–Snedecor distribution (after Ronald Fisher and George W. Snedecor) is another continuous probability distribution. You will see in the chapters that follow that this distribution arises frequently as the null distribution of a test statistic, most notably in the analysis of variance. The F-test statistic is defined by equation (3.12): s12 F= 2 s2

(3.12)

Where 𝑆12 and 𝑆22 are the sample 1 and sample 2 variances, respectively. The shape of the distribution depends upon the numerator and denominator degrees of freedom (df1 = n1 – 1, df2 = n2 – 1) and the F distribution is written as a function of n1, n2 as F(n1, n2). The probability density function (pdf) of the F distribution is defined by equation (3.13):

  df1 + df 2   df1 −1   2   df1 df 2 2 x    ( df1 ) 2 ( df 2 ) 2 f ( x) =  df1 + df 2    df1    df 2   2 ( df + df x ) 2 1   2   2  

(3.13)

Where x > 0 and Γ(df) = (df – 1)! denotes the gamma function. The gamma function is one of the ‘standard’ functions in mathematics and it is used to extend the factorial function for use on all sorts of fractions and complex numbers. The factorial n! is defined for a positive integer n as n! = n (n – 1) (n – 2) ……. (2) (1). For example, 5! = 5  4  3  2  1 = 120. From a calculation perspective, we do not need to worry about using equation (3.13), given we have access to published tables or software like Excel and SPSS to do the calculations. Figure 3.65 illustrates the shape of the F distribution for dfA = 17 and dfB = 24.

Figure 3.65 F distribution with dfA = 17, dfB = 24

Page | 205

Example 3.10 Calculate the probability of F ≤ 4.03 if the numerator and denominator degrees of freedom are 9 and 9, respectively. Based on this answer, calculate the P(F  4.03). Excel solution Figure 3.66 illustrates the Excel solution. The probability that F will be less than 4.03 is 0.98 (or 98%) and greater than 4.03 is 0.02 (or 2%), given the numerator and denominator degrees of freedom are 9.

Figure 3.66 Example 3.10 Excel solution Cells C9:C11 in Figure 3.66 show three different ways to achieve the same result. SPSS solution To use SPSS to calculate values we require data in the SPSS data file. If no data is present, then enter a number to represent variable VAR00001. In this example, we entered the number 1 as illustrated in Figure 3.67. Please note you can enter any number.

Figure 3.67 Enter number 1 to represent VAR00001 Now we can use SPSS Statistics to calculate the associated probabilities. a. To obtain P(F ≤ 4.03), repeat the calculation above but this time use: Select Transform > Compute Variable Target Variable: Example10a Page | 206

Numeric expression = CDF.F(4.03,9,9).

Figure 3.68 Use compute variable to calculate P(F ≤ 4.03) Click OK SPSS output The value will not be in the SPSS output file but in the SPSS data file in a column called Example10a.

Figure 3.69 SPSS solution P(F ≤ 4.03) = 0.98 P(F ≤ 4.03) = 0.98 The probability that F will be less than 4.03 is 0.98 (or 98%). This agrees with the Excel solution shown in Figure 3.66. b. To obtain P(F  4.03), repeat the calculation above but this time use: Select Transform > Compute Variable Target Variable: Example10b Numeric expression = 1-CDF.F(4.03,9,9).

Figure 3.70 Use compute variable to calculate P(F ≥ 4.03) Click OK SPSS output Page | 207

The value will not be in the SPSS output file but in the SPSS data file in a column called Example10b.

Figure 3.71 SPSS solution P(F ≥ 4.03) = 0.02 P(F  4.03) = 0.02 The probability that F will be greater than 4.03 is 0.02 (or 2%) The probability that F will be less than 4.03 is 0.98 (or 98%) and greater than 4.03 is 0.02 (or 2%), given the numerator and denominator degrees of freedom are 9. This agrees with the Excel solution shown in Figure 3.66.

Check your understanding Calculate F when  = 0.1 and numerator (df1) and denominator (df2) degrees of freedom are 5 and 7, respectively. X3.10 Calculate the probability that F  2.34 when numerator and denominator degrees of freedom are 12 and 18, respectively. X3.9

Chi-square distribution The chi-square (χ2) distribution is a popular distribution that is used to solve many statistical inference problems involving contingency tables and can be used to test if a sample of data came from a population with a specific distribution.. The probability density function (pdf) of the chi-square distribution is defined by equation (3.14): k   −1  

x 2



x

e 2 f ( x, k ) = k , x0 k   22    2

(3.14)

Where x > 0 and  denotes the gamma function that we have already briefly described. From a calculation perspective, we again do not need to worry about using this equation, given we have access to published tables or software like Excel and SPSS to do the calculations. Figure 3.72 illustrates the variation in the value of the F distribution with the degrees of freedom varying between 3 and 9.

Page | 208

Figure 3.72 Chi square distribution curves for different degrees of freedom Example 3.11 Calculate: (a) the probability that the chi-square test statistic is 1.86 or less if the number of degrees of freedom is 8; (b) find the value of x given P(χ2  x) = 0.04 and 10 degrees of freedom. Excel solution Figure 3.73 illustrates the Excel solution.

Figure 3.73 Example 3.11 Excel solution We can see that (a) P(χ2 ≤ 1.86) with 8 degrees of freedom yields probability 0.015, and (b) P(χ2  x) = 0.04 with 10 df gives x = 19.02. SPSS solution To use SPSS to calculate values we require data in the SPSS data file. If no data is present, then enter a number to represent variable VAR00001. In this example, we entered the number 1 as illustrated in Figure 3.74. Please note you can enter any number. Page | 209

Figure 3.74 Enter number 1 to represent variable VAR00001 Now we can use SPSS Statistics to calculate the associated probabilities. a. To find P(χ2 ≤ 1.86), repeat the calculation above but this time use: Select Transform > Compute Variable Target Variable: Example11a Numeric expression =CDF.CHISQ(1.86, 8).

Figure 3.75 Use compute variable to calculate P(chi-square  1.86) Click OK. SPSS output The value will not be in the SPSS output file but in the SPSS data file in a column called Example11a.

Figure 3.76 SPSS solution P(chi-square  1.86) = 0.015 P(χ2 ≤ 1.86) is 0.015 (or 1.5%). This agrees with the Excel solution shown in Figure 3.73. b. To find the value of x given P(χ2  x) = 0.04, repeat the calculation above but this time use: Select Transform > Compute Variable Target Variable: Example11b Numeric expression =IDF.CHISQ(1-0.04, 10).

Page | 210

Figure 3.77 Use compute variable to calculate x, given P(chi-square  x) = 0.04 Click OK The value will not be in the SPSS output file but in the SPSS data file in a column called Example11b.

Figure 3.78 SPSS solution, x = 19.02 P(χ2  x) = 0.04 with 10 df gives x = 19.02. This agrees with the Excel solution shown in Figure 3.73.

Check your understanding X3.11 Calculate the probability that χ2  13.98 given 7 degrees of freedom. X3.12 Find the value of x given P(χ2 ≤ x) = 0.0125 with 12 degrees of freedom.

3.4 Discrete probability distributions Introduction In this section, we shall explore discrete probability distributions when dealing with discrete random variables. The two specific distributions we include are the binomial probability distribution and the Poisson probability distribution. Example 3.12 To illustrate the idea of a discrete probability distribution, consider the frequency distribution presented in Table 3.6, representing the distance travelled by delivery vans per day. The table tells us that 4 vans out of 100 travelled between 400 and 449 miles, 44 travelled out of 100 travelled between 450 and 499 miles, etc.

Page | 211

Distance

Frequency, f

400 - 449 450 - 499 500 - 549 550 - 599 600 - 649

4 44 36 15 1

Total =

100

Table 3.6 Frequency distribution From Table 3.6 we can work out the relative frequency distribution. Table 3.7 illustrates the calculation process. We observe from Table 3.7, the relative frequency for 440–459 miles travelled, for example, is 4/100 = 0.4. This implies that we have a chance or probability of 4/100 that the distance travelled lies within this class. Distance

Frequency, f

400 - 449 450 - 499 500 - 549 550 - 599 600 - 649

4 44 36 15 1

Total =

100

Relative frequency 0.040000 0.440000 0.360000 0.150000 0.010000 1.000000

Table 3.7 Calculation of relative frequencies Excel solution

Figure 3.79 Example 3.12 Excel solution Thus, relative frequencies provide estimates of the probability for that class, or value, to occur. If we were to plot the histogram of relative frequencies, we would in fact be plotting out the probabilities for each event: P(400–449) = 0.4, P(450–449) = 0.44, etc. The distribution of probabilities given in Table 3.7 and the graphical representation in Figure 3.80 are different ways of illustrating the probability distribution.

Page | 212

Figure 3.80 Histogram for the distance travelled For the frequency distribution, the area under the histogram is proportional to the total frequency. However, for the probability distribution, the area is proportional to total probability (= 1). Given a probability distribution, we can determine the probability for any event associated with it. For example, P(400 – 549) = P(400  X  459) is the area under the distribution from 400 to 549, or P(400 – 4490) + P(450 – 499) + P(500 – 549) = 0.04 + 0.44 + 0.36 = 0.84. Thus, we have a probability estimate of approximately 84% for the distance travelled to lie between 400 and 549 miles. Now, imagine that in Figure 3.80 we decreased the class width towards zero and increased the number of associated bars observable. Then the outline of the bars in Figure 3.80 would approximate a curve – the probability distribution curve. The value of the mean and standard deviation can be calculated from the frequency distribution by using equations (3.15) and (3.16), respectively. If you undertake the calculation (see the next example) then the mean value is 507.00 with a standard deviation of 40.85 miles. By using relative frequencies to determine the mean we have in fact found the mean of the probability distribution. The expected value of a discrete random variable X, which is denoted in many ways, including E(X) and µ, is also known as the expectation or mean. For a discrete random variable X under probability distribution P, the expected value of the probability distribution E(X) is defined by equation (3.15): E(X) = ∑ X × P

(3.15)

Where X is a random variable with a set of outcome variables X1, X2, X3, …, Xn occurring with probabilities P1, P2, P3, …, Pn. Equation (3.15) can be written as: E(X) = ∑ X × P E(X) = X1P1 + X2P2 + ……. + XnPn Page | 213

Further thought along the lines used in developing the notion of expectation would reveal that the variance of the probability distribution, VAR(X), represents a measure of how broadly distributed the random variable X tends to be and is defined as the expectation of the squared deviation from the mean: 2

VAR(X) = E [(X − E(X)) ] What does this mean? First, let us rewrite the definition explicitly as a sum. If X takes values X1, X2, …, Xn, with each X value having an associated probability, P(X), then the variance equation can be written as follows: 2

VAR(X) = ∑(X − E(X)) × P(X)

(3.16)

In words, the formula for VAR(X) says to take a weighted average of the squared distance to the mean. By squaring, we make sure we are averaging only non-negative values, so that the spread to the right of the mean will not cancel that to the left. From equation (3.16), the standard deviation for the probability distribution is calculated using the relationship given in equation (3.17): SD(X) = √VAR(X)

(3.17)

Example 3.13 Returning to the delivery vans travelled we can easily calculate the mean number and the corresponding measure of dispersion as illustrated in Figures 3.81–3.83. Remember that LCB and UCB stand for the lower-class boundary and upper-class boundary, respectively.

Figure 3.81 Example 3.13 Excel solution

Page | 214

Figure 3.82 Example 3.13 Excel solution continued

Figure 3.83 Example 3.13 Excel solution continued We can see that the mean, or expected value, is 507.0 miles (cell C14), with a standard deviation of 40.85 miles (cell C16).

Check your understanding X3.13 Give an appropriate sample space for each of the following experiments: a. A card is chosen at random from a pack of cards. b. A person is chosen at random from a group containing 5 females and 6 males. c. A football team records the results of each of two games as 'win', 'draw' or 'lose'. X3.14 A dart is thrown at a board and is likely to land on any one of eight squares numbered 1 to 8 inclusive. A represents the event the dart lands in square 5 or 8. B represents the event the dart lands in square 2, 3 or 4. C represents the event the dart lands in square 1, 2, 5 or 6. Which two events are mutually exclusive? X3.15 Table 3.8 provides information about 200 school leavers and their destination after leaving school. Determine the following probabilities that a person selected at random: Leave school at Leave school at 16 years a higher age Full time education, E 14 18 Full time job, J 96 44 Other 15 13 Table 3.8 Destination of school leavers Page | 215

a. b. c. d. e.

Went into full-time education Went into a full-time job Either went into full-time education or went into a full-time job Left school at 16 Left school at 16 and went into full-time education.

Binomial probability distribution One of the most elementary discrete random variables, the binomial, is associated with questions that only allow yes or no type answers, or a classification such as male or female, or recording a component as defective or not defective. If the outcomes are also independent, (e.g., the possibility of a defective component does not influence the possibility of finding another defective component) then the variable is a binomial variable. Consider the example of a supermarket that runs a two-week television campaign to increase its volume of trade. During the campaign, all customers are asked if they came to the supermarket because of the television advertising. Each customer response can be classified as either yes or no. At the end of the campaign the proportion of customers who responded ‘yes’ is determined. For this study, the experiment is the process of asking customers if they came to the supermarket because of the television advertising. The random variable, X, is defined as the number of customers who responded ‘yes’ or ‘no’. Clearly the random variable consists of n number of customers, each with the value of just 1 (“yes”) or 0 (“no”). Consequently, the random variable is discrete. Consider the characteristics that define the binomial experiment. The experiment consists of n identical trials: 1. Each trial results in one of two outcomes which for convenience we can define as either a success or a failure. 2. The outcomes from trial to trial are independent. 3. The probability of success (p) is the same for each trial. 4. The probability of failure is q = 1 – p. 5. The random variable equals the number of successes in the n trials and can take a value from 0 to n. These five characteristics define the binomial experiment and are applicable for situations of sampling from finite populations with replacement or for infinite populations with or without replacement. Example 3.14 A group of archers are interested in calculating the probability of hitting the centre of the target from the recommended beginners’ distance of 5 yards from the target as illustrated in Figure 3.84 (not to scale). The historical data collected from the archery club gives a probability of 0.3 for beginners after attending the required training courses. If the archer makes 3 attempts calculate the probability distribution.

Page | 216

Figure 3.84 Archery target (not to scale) This experiment can be modelled by a binomial distribution since: 1. We have three identical trials (n = 3). 2. Each trial can result in hitting the target (success) or not hitting the target (failure). 3. The outcome of each trial is independent. 4. The probability of a success (P(hitting target) = p = 0.3) is the same for each trial. 5. The random variable is discrete. Let T represent the event that the marksman hits the target, and T′ represents the event that the target is missed. The corresponding individual event probabilities are: Probability of hitting target, P(T) = 0.3 Probability of missing target, P(T′) = 1 – P(T) = 1 – 0.3 = 0.7 Figure 3.85 illustrates the tree diagram that represents the described experiment.

Figure 3.85 Example 3.14 tree diagram Page | 217

From this tree diagram we can identify the possible routes that could be achieved by having 3 attempts on the target, where X represents the number of targets hit on 3 attempts: 1. 2. 3. 4.

Target missed on all attempts out of the 3 attempts, X = 0 Target hit on 1 occasion out of the 3 attempts, X = 1 Target hit on 2 occasions out of the 3 attempts, X = 2 Target hit on 3 occasion out of the 3 attempts, X = 3

This completes the complete set of possible options for this experiment which consists of allowing 3 attempts on the target. We can now use the tree diagram to identify the routes to achieve each of these alternative possibilities and the associated probabilities, given the probability of hitting the target at each attempt, p = 0.3 (therefore, probability of missing target on each attempt q = 1 – p = 0.7). Target missed on all attempts out of the 3 attempts, X = 0 This can be achieved via the route: 1st attempt missed, 2nd attempt missed, and 3rd attempt missed. We can write this has a probability equation as follows: Probability that target missed on all 3 attempts = P(X = 0) P(X = 0) = P(1st attempt missed, 2nd attempt missed, and 3rd attempt missed) P(X = 0) = P(T’ and T’ and T’) Given that each of these 3 attempts are independent events, then we can show that P(T’ and T’ and T’) = P(T’)  P(T’)  P(T’) P(X = 0) = 0.7 × 0.7 × 0.7 P(X = 0) = (0.7)3 P(X = 0) = 0.343 The important lesson is to note how we can use the tree diagram to calculate an individual probability but also note the pattern identified in the relationship between the probability, P(X = x), and the individual event probability of success, p, or failure, q. If we continue the calculation procedure we find: P(1 target hit) = P(X = 1 success) P(1 target hit) = P(TT′T′ or T′TT′ or T′T′T) P(1 target hit) = 0.3 × 0.7 × 0.7 + 0.7 × 0.3 × 0.7 + 0.7 × 0.7 × 0.3

Page | 218

P(1 target hit) = 0.3 × (0.7)2 + 0.3 × (0.7)2 + 0.3 × (0.7)2 P(1 target hit) = 3 × 0.3 × (0.7)2 P(1 target hit) = 3pq2 = 0.441 P(2 targets hit) = P(X = 2 successes) P(2 targets hit) = P(TTT′ or TT′T or T′TT) P(2 targets hit) = 0.3 × 0.3 × 0.7 + 0.3 × 0.7 × 0.3 + 0.7 × 0.3 × 0.3 P(2 targets hit) = (0.3)2 × 0.7 + (0.3)2 × 0.7 + (0.3)2 × 0.7 = 3 × (0.3)2 × 0.7 P(2 targets hit) = 3p2q = 0.189 P(3 targets hit) = P(X = 3 successes) P(3 targets hit) = P(TTT) P(3 targets hit) = 0.3 × 0.3 × 0.3 P(3 targets hit) = (0.3)3 = p3 = 0.027 From these calculations, we can now note the probability distribution for this experiment (see Table 3.9 and Figure 3.86). x 0 1 2 3

Formula P(X = x) 3 q 0.343 2 3pq 0.441 3p2q 0.189 3 p 0.027 Total = 1.000 Table 3.9 Probability distribution table

Page | 219

Figure 3.86 Bar chart representing the binomial distribution event From the probability distribution given in table 3.8, we observe that the total probability equals 1. This is expected since the total probability would represent the total experiment. We can express the total probability for the experiment by equation (3.18): ∑ 𝑃(𝑋 = 𝑥) = 1

(3.18)

If we increase the size, n, of the experiment, then it becomes quite difficult to calculate the event probabilities. We really need to develop a formula for calculating binomial probabilities. Using the ideas generated earlier, we have Total probability = P(X = 0 or X = 1 or X = 2 or X = 3) Total probability = P(X = 0) + P(X = 1) + P(X = 2) + P(X = 3) Using the information given in table 3.8 we can re-write this equation Total probability = (0.7)3 + 3 × (0.3) × (0.7)2 + 3 × (0.3)2 × (0.7) + (0.3)3 Total probability = q3 + 3pq2 + 3p2q + p3 Total probability = p3 + 3p2q + 3pq2 + q3 Repeating this experiment for increasing values of n would enable the identification of a pattern that can be used to develop equation (3.19). This equation can then be used to calculate the probability of x successes given n attempts of the binomial experiment. P(X = x) = 𝐶𝑥𝑛 px qn−x

(3.19) Page | 220

The term 𝐶𝑥𝑛 calculates the binomial coefficients which are the numbers in front of the letter terms in the binomial expansion. In the previous example we found that the total probability p3 + 3p2q + 3pq2 + q3, with the numbers 1, 3, 3, 1 in front of the letters. These numbers are called the ‘binomial coefficients’ and are calculated using equation (3.20): 𝑛!

𝐶𝑥𝑛 = 𝑥! (𝑛−𝑥)!

(3.20)

Where n! (pronounced ‘n factorial’) is factorial of a positive integer n, and is defined by equation (3.21): n! = n  (n – 1)  (n – 2)  (n – 3).........3  2  1

(3.21)

The term 𝐶𝑥𝑛 calculates the number of ways of obtaining x successes from n attempts of the experiment. For example, the values of 3! 2! 1!, and 0! are as follows: 3! = 3 × 2 × 1 = 6 2! = 2 × 1 = 2 1! = 1 0! = 1. It can be shown that the mean and variance of a binomial distribution are given by equations (3.22) and (3.23): Mean of a binomial distribution, E(X) = np

(3.22)

Variance of a binomial distribution, VAR(X) = npq

(3.23)

In Example 3.14, we note that n = 3, p = 0.35, and q = 1 – p = 1 – 0.35 = 0.65. Now. Let us calculate using Equation (3.19) the probability that the target will not be hot on the 3 attempts, P(X = 0). P(X = 0) Substituting these values into equation (3.19) gives for P(X = 0): P(X = x) = 𝐶𝑥𝑛 px qn−x P(X = 0) = 𝐶03 0.350 0.653−0 P(X = 0) = 𝐶03 0.350 0.653 Inspecting this equation, we have three terms that are multiplied together to provide the probability of the target not hit on the three attempts:

Page | 221

𝐶03 (0.35)0 = 1 (0.65)3 = 0.65  0.65  0.65 The second and third terms are straightforward to calculate, and the first can be calculated from equation (3.20) as follows: 𝐶03 =

3! 0! (3 − 0)!

𝐶03 =

3! 0! 3!

𝐶03 =

3 ×2 ×1 1 × 3 ×2 ×1

𝐶03 = 1 Substituting this value into the problem solution gives: P(X = 0) = 1 × 1 × 0.653 P(X = 0) = 0.274625 Equation (3.19) can now be used to calculate P(X = 1), P(X = 2), and P(X = 3): P(X = 1) P(X = 1) = 𝐶03 0.351 0.653−1 𝐶13 =

3! 1! (3 − 1)!

𝐶13 =

3! 1! 2!

𝐶13 = 3 P(X = 1) = 3  0.351 0.652 P(X = 1) = 3  0.35  0.65  0.65 P(X = 1) = 0.443625 P(X = 2) Page | 222

P(X = 2) = 3C2 0.352 0.653−2 𝐶23 =

3! 2! (3 − 2)!

𝐶23 =

3! 2! 1!

𝐶23 = 3 P(X = 2) = 3  0.352 0.651 P(X = 2) = 3  0.35  0.35  0.65 P(X = 2) = 0.238875 P(X = 3) P(X = 3) = 3C3 0.353 0.653−3 𝐶33 =

3! 3! (3 − 3)!

𝐶33 =

3! 3! 0!

𝐶33 = 1 P(X = 3) = 1  0.353 0.650 P(X = 3) = 1  0.35  0.35  0.35 P(X = 3) = 0.042875 Given that we have calculated all the probabilities for this binomial experiment when we had 3 attempts, then equation (3.18) should be true. ∑ 𝑃(𝑋 = 𝑥) = 1 Check by adding the individual probabilities together Total probability = P(X = 0) + P(X = 1) + P(X = 2) + P(X = 3) Total probability = 0.274625 + 0.443625 + 0.238875 + 0.042875 Total probability = 1

Page | 223

Excel solution Figure 3.87 illustrates the Excel solution.

Figure 3.87 Example 3.14 Binomial distribution SPSS solution To use SPSS to calculate values we require data in the SPSS data file. If no data is present, then enter a number to represent variable VAR00001. In this example, we entered the number 1 as illustrated in Figure 3.88. Please note you can enter any number.

Figure 3.88 Enter number 1 to represent VAR00001 Now we can use SPSS Statistics to calculate the probability distribution and individual associated probabilities. Probability distribution In SPSS type 0, 1, 2, 3 in one column and label X. Select Transform > Compute Variable Target Variable: Prob Numeric expression = PDF.BINOM(X, 3, 0.35)

Page | 224

Figure 3.89 Compute variable Click Ok The value will not be in the SPSS output file but in the SPSS data file in a column called X and Prob, as illustrated in Figure 3.90.

Figure 3.90 Binomial distribution These results agree with the manual and Excel solutions. Individual probabilities We can use SPSS to calculate the individual probabilities as illustrated for P(X = 3): P(X = 3 given binomial and n = 3, p = 0.35) SPSS can be used to calculate the probability of a binomial event occurring using the PDF.BINOM function. Select Transform > Compute Variable Target Variable: Example 14a Numeric expression = PDF.BINOM(3, 3, 0.35)

Figure 3.91 Use compute variable to calculate P(X = 3) Page | 225

Click OK SPSS output The value will not be in the SPSS output file but in the SPSS data file in a column called Example 14a

Figure 3.92 SPSS solution P(X = 3) = 0.0270 P(X = 3 given binomial and n = 3, p = 0.35) = 0.042875. This agrees with the Excel solution illustrated in Figure 3.87. Calculate P(X  2) If you wanted to calculate the P(X  2) then we can solve this as follows: P( 2) = P(X = 0) + P(X = 1) + P(X = 2) P( 2) = 0.274625 + 0.443625 + 0.238875 P( 2) = 0.957125 Therefore, we have a 95.71% chance that we have at least 2 target hits out of the 3 attempts. Excel solution We can use Excel to solve this problem by using the Excel function: P(X  2) = BINOM.DIST(2, n, p, true) P( 2) = 0.957125 SPSS solution We can use SPSS to solve this problem as follows: Select Transform > Compute Variable Target Variable: Prob_X Numeric expression = CDF.BINOM(2, 3, 0.35)

Page | 226

Figure 3.93 Use compute variable to calculate P(X  2)

Figure 3.94 SPSS solution P(X  2) = 0.957125 Calculate P(X < 2) If you wanted to calculate the probability that X < 2 then we can solve this as follows: P(X < 2) = P(X = 0) + P(X = 1) P(< 2) = 0.274625 + 0.443625 P(< 2) = 0.71825 Therefore, we have a 71.83% chance that we have less than 2 target hits out of the 3 attempts. Excel solution We can use Excel to solve this problem by using the Excel function P(X < 2) = BINOM.DIST(1, n, p, true) P(< 2) = 0.71825 SPSS solution We can use SPSS to solve this problem as follows: Select Transform > Compute Variable Target Variable: Prob_Xa Numeric expression = CDF.BINOM(1, 3, 0.35)

Page | 227

Figure 3.95 Use compute variable to calculate P(X < 2)

Figure 3.96 SPSS solution P(X < 2) = 0.718250 Notes: 1. The mean of a binomial distribution is given by equation (3.22), mean = np = 3  0.35 = 1.05. 2. The variance of a binomial distribution is given by equation (3.33), variance – npq = 3  0.35  0.65 = 0.6825. 3. The standard deviation of a binomial distribution is the square root of the variance, standard deviation = 0.8261 to 4 decimal places. Useful Excel functions: 1. Binomial coefficients 𝐶𝑥𝑛 can be calculated using the Excel function =COMBIN(n, x). For example, COMBIN(3, 0) = 1, COMBIN(3, 1) = 3, COMBIN(3, 2) = 3 and COMBIN(3, 1) = 1. 2. Factorial values n! can be calculated using the Excel function =FACT(). For example, FACT(0) = 1, FACT(1) = 1, FACT(2) = 2, FACT(3) = 6, FACT(4) = 24, FACT(5) = 120, and so on.

Normal approximation to the binomial distribution The normal distribution is generally considered to be a pretty good approximation for the binomial distribution when np ≥ 5 and n(1 – p) ≥ 5. For values of p close to 0.5, the number 5 on the right side of these inequalities may be reduced somewhat, while for more extreme values of p (especially for p < 0.1 or p > 0.9) the value 5 may need to be increased.

Check your understanding X3.16 Evaluate the following: (a) 𝐶13 , (b) 𝐶310 , (c) 𝐶02 . X3.17 A binomial model has n = 4 and p = 0.6. a. Find the probabilities of each of the five possible outcomes (i.e. P(0), P(1), …, P(4)). b. Construct a histogram of the data. Page | 228

X3.18 Attendance at a cinema has been analysed and shows that audiences consist of 60% men and 40% women for a film. If a random sample of six people were selected from the audience during a performance, find the following probabilities: (a) all women are selected; (b) three men are selected; and (c) less than three women are selected. X3.19 A quality control system selects a sample of three items from a production line. If one or more is defective, a second sample is taken (also of size 3), and if one or more of these are defective then the whole production line is stopped. Given that the probability of a defective item is 0.05, what is the probability that the second sample is taken? What is the probability that the production line is stopped? X3.20 Five people in seven voted in an election. If four of those on the roll are interviewed, what is the probability that at least three voted? X3.21 A small tourist resort has a weekend traffic problem and is considering whether to provide emergency services to help mitigate the congestion that results from an accident or breakdown. Past records show that the probability of a breakdown or an accident on any given day of a four-day weekend is 0.25. The cost to the community caused by congestion resulting from an accident or breakdown is as follows: a weekend with 1 accident day costs £20,000, a weekend with 2 accident days costs £30,000, a weekend with 3 accident days costs £60,000, and a weekend with 4 accident days costs £125,000. As part of its contingency planning, the resort needs to know: a. The probability that a weekend will have no accidents b. The probability that a weekend will have at least two accidents c. The expected cost that the community will have to bear for an average weekend period d. Whether to accept a tender from a private firm for emergency services of £20,000 for each weekend during the season.

Poisson probability distribution In the previous section, we explored the concept of the binomial distribution, a discrete probability distribution that enables the probability of achieving r successes from n independent experiments to be calculated. Each experiment (or event) has two possible outcomes (‘success’ or ‘failure’) and the probability of ‘success’ (p) is known. The Poisson distribution is a discrete probability distribution that enables the probability of x events occurring during a specified interval (time, distance, area, and volume) to be calculated if the average occurrence is known and the events are independent of the specified interval since the last event occurred. It has been usefully employed to describe probability functions of phenomena such as product demand, demand for service, numbers of accidents, numbers of traffic arrivals, and numbers of defects in various types of lengths or objects.

Page | 229

Like the binomial, it is used to describe a discrete random variable. With the binomial distribution, we have a sample of definite size and we know the number of successes and failures. There are situations, however, when to ask how many 'failures' would not make sense and/or the sample size is indeterminate. For example, if we watch a football match, we can report the number of goals scored but we cannot say how many were not scored. In such cases we are dealing with isolated cases in a continuum of space and time, where the number of experiments (n) and the probability of success (p) and failure (q) cannot be defined. What we can do is divide the interval (time, distance, area, volume) into very small sections and calculate the mean number of occurrences in the interval. This gives rise to the Poisson distribution defined by equation (3.24): P(X = x) =

λx e− λ x!

(3.24)

Where: • • • •



P(X = x) is the probability of event x occurring. The symbol r represents the number of occurrences of an event and can take the value 0 → ∞ (infinity). x! is the factorial of r calculated using the Excel function =FACT().  (Greek letter lambda) is a positive real number that represents the expected number of occurrences for a given interval. For example, if we found that we had an average of 4 stitching errors in 1 metre of cloth, then for 2 metres of cloth we would expect the average number of errors to be  = 4 × 2 = 8. The symbol e represents the base of the natural logarithms (e = 2.71828…).

If we determine the mean and variance of a Poisson distribution, either using the frequency distribution or the probability distribution, we will find that the relationship is as given in equation (3.25):  = VAR(X)

(3.25)

The characteristics of a Poisson distribution are: 1. 2. 3. 4. 5.

The variance is equal to the mean. Events are discrete and randomly distributed in time and space. The mean number of events in each interval is constant. Events are independent. Two or more events cannot occur simultaneously.

Once it has been identified that the mean and variance have the same numerical value, ensure that the other conditions above are satisfied, and this will indicate that the sample data most likely follow the Poisson distribution. Example 3.15 An ice cream parlour estimates the footfall arriving in the parlour is 16 customers per hour.

Page | 230

a) Use equation (3.34) to calculate the probability that thirteen customers will arrive in one hour? Compare this answer to the Excel and SPSS solutions. b) Use Excel and SPSS to create the Poisson probability distribution for this distribution from X = 0 to X = 30. Comment on the shape of the distribution. Answer a) a) Use equation (3.34) to calculate the probability that thirteen customers will arrive in one hour? Compare this answer to the Excel and SPSS solutions. Equation (3.24) can be used to calculate the Poisson probability P(X = 13) given  = 16 per hour, x = 13. λ x e− λ P(X = x) = x! P(X = 13) =

1613 e− 16 13!

Use a calculate to show: • • •

1613 = 4.503599627  1015 13! = 6227020800 e-16 = 1.12535  10-7

Substituting these values gives P(X = 13) = 0.081389 Probability that 13 customers will arrive per hour is 0.081389 or 8.1%. Excel solution Figure 3.97 illustrates the Excel solution for P(X = 13).

Page | 231

Figure 3.97 Excel solution SPSS solution SPSS can be used to calculate P(X = 13) Select Transform > Compute Variable Target Variable: Example15a Numeric expression = CDF.POISSON(13, 16)

Figure 3.98 Use compute variable to calculate P(X = 13)

Figure 3.99 SPSS solution P(X = 13) = 0.081389 Answer b) a) Use Excel and SPSS to create the Poisson probability distribution for this distribution from X = 0 to X = 30. Comment on the shape of the distribution. Excel solution Thus, we can now determine the probability distribution using equation (3.24) as illustrated in Figure 3.100. Page | 232

Figure 3.100 Example 3.15 Excel solution Figure 3.101 illustrates Poisson probability plot for the variation with P(X = x) with individual values of X from 0 to 30. We can see from the graph that the distribution shape looks quite symmetric and looks like a normal probability distribution. Remember for the Poisson distribution: Poisson mean =  = 16 Poisson variance =  = 16 Poisson standard deviation = √16 = 4

Page | 233

Figure 3.101 Poisson distribution with  = 16 customers per hour Normal approximation to the Poisson distribution Furthermore, the shape of the Poisson distribution below tells us that under certain conditions we can approximate a Poisson distribution with a normal approximation with population mean  = np, and population variance 2 = npq, when the number of trials, n  20. SPSS solution Thus, we can now determine the probability distribution using equation (3.24) as illustrated in Figure 3.102. Enter data into SPSS

Page | 234

Figure 3.102 SPSS data file Select Transform > Compute Variable Target Variable: Prob Numeric expression = PDF.POISSON(X, 16)

Figure 3.103 Use compute variable to calculate P(X = x) Click Ok SPSS output The value will not be in the SPSS output file but in the SPSS data file in a column called Prob. Page | 235

Figure 3.104 SPSS solutions P(X = x) for X = 0 to 30 From this data we can ask SPSS to create a scatterplot for the P(X = x) and x as illustrated in Figure 3.105.

Figure 3.105 Poisson distribution with  = 16 customers per hour What is a real difference between a binomial and Poisson distribution, and could they be used interchangeably? When students need to decide which distribution to use, they often get confused between the two. The binomial distribution should be used when we try to fit the distribution to n cases, each with a probability of success p. The Poisson Page | 236

distribution is used when we have an infinite number of cases n, but there is a very, very small (infinitesimally small) probability of success p. To use mathematical language, if you start with a binomial distribution and you let n approach infinity (n → ) and p approach zero (p → 0), then the binomial distribution approaches a Poisson distribution with parameter . In practical terms, if n is large and you happen to know the mean value, you should use a Poisson distribution and not a binomial. In the case of a binomial distribution, you will know the probability of a case n happening, but you will not know the mean value.

Check your understanding X3.22 Calculate P(0), P(1), P(2), P(3), P(4), P(5), P(6) and P(>6) for a Poisson variable with a mean of 1.2. Using this probability distribution determine the mean and variance. X3.23 In a machine shop the average number of machines out of operation is 2. Assuming a Poisson distribution for machines out of operation, calculate the probability that at any one time there will be: (a) exactly one machine out of operation; and (b) more than one machine out of operation. X3.24 A factory estimates that 0.25% of its production of small components is defective. These are sold in packets of 200. Calculate the percentage of the packets containing one or more defectives. X3.25 The average number of faults in a metre of cloth produced by a particular machine is 0.1. a. What is the probability that a length of 4 metres is free from faults? b. How long would a piece have to be before the probability that it contains no flaws is less than 0.95? X3.26 A garage has three cars available for daily hire. Calculate the following probabilities if the variable is Poisson with a mean of 2 cars hired per day: a. Find the probability that on a given day that exactly 0, 1, 2, and 3 cars will be hired. b. The charge to hire a car is £25 per day and the total outgoings per car, irrespective of whether it is hired, are £5 per day. Determine the expected daily profit from hiring these three cars. X3.27 Accidents occur in a factory randomly and on average at the rate of 2.6 per month. What is the probability that in each month (a) no accidents will occur, and (b) more than one accident will occur?

Chapter summary In this chapter, we introduced the concept of probability using the idea of relative frequency. Then we defined key terms such as experiment, sample space, and the Page | 237

relationship between a relative frequency distribution and a probability distribution. Further, we used the concept of relative frequency to introduce probability distribution and expectation. Then we covered continuous and discrete probability distributions and provided examples to illustrate the different types of continuous (normal, Student’s t, F distribution and chi square distribution) and discrete (binomial, Poisson) distributions. To start with we introduced the normal continuous probability distributions, and we showed the probability density functions that define the shapes of the distribution and some of the associated parameters. We described how, by knowing the mean and the variance of a distribution, we can determine the probability that a certain value is within a range. The inverse, where knowing the probability, we can work out the value of X, was also shown. The standard normal distribution was introduced, and the concept of Z-value was explained. We demonstrated how to calculate these values either from the statistical table or using Excel/SPSS. Several examples illustrated applications of how Z-values and the standard normal distribution can be used to answer business questions. And finally, a very useful test for testing for normality of the data was introduced and explained. Three further continuous distributions were introduced, namely, Student’s t distribution, the F distribution, and the chi-square distribution. All have specific uses when dealing with different types of samples and situations. Just like with the normal distributions, we showed examples of how to calculate certain parameters, given that we have some limited knowledge of every distribution. Finally, we introduced two discrete distributions, the binomial and Poisson distributions. Just like continuous distributions, they are designed to handle different scenarios, with the exception that they are specifically suited to discrete data. As before, for each distribution we showed how to execute calculations based on certain parameters that are known upfront. The material from this chapter will enable us to explore, in the following chapter, the concept of data sampling from normal and nonnormal population distributions and to introduce the central limit theorem. In Chapters 4–5, we will apply the central limit theorem to provide point and interval estimates to certain population parameters (mean, variance, proportion) based on sample parameters (sample mean, sample variance, sample proportion).

Test your understanding TU3.1 A gas boiler engineer is employed to assess customer gas boilers and to service, repair or replace boilers if required. The engineer records the reasons for failure as either an electrical fault, gas fault, or other fault. The current records collected by the engineer involving either electrical or gas faults are as shown in Table 3.10.

Gas fault

Yes No Table 3.10 Type of faults

Electrical fault Yes No 53 11 23 13

Page | 238

a. Calculate the probability that failure involves an electrical fault given that it involves a gas fault. b. Calculate the probability that failure involves a gas fault given that it involves an electrical fault. TU3.2 Bakers Ltd is currently in the process of reviewing the credit line available to supermarkets which they have defined as a ‘good’ or ‘bad’ risk. Based upon a £100,000 credit line, the profit is estimated to be £25,000 with a standard deviation of £5,000. Calculate the probability that the profit is greater than £28,000. TU3.3 The quality controller at a sweet factory monitors a range of quality control variables associated with the production of a range of chocolate bars. Based upon current information, the probability that a chocolate bar is overweight is 0.02. At the end of a shift the quality controller samples 70 chocolate bars and rejects the shift’s production if more than 4 bars are overweight. Calculate the probability that he will reject the output of chocolate bars created by this shift. Is there a problem with the production process if the factory manager considers rejection rates over 2% to be an issue? TU3.4 A local DIY store receives an average of 4 complaints per day from customers. Calculate on a given day the following probabilities: (a) no complaints, (b) exactly one complaint, (c) exactly two complaints, and (d) more than 2 complaints. TU3.5 The number of bookings at a local gym can be modelled using a Poisson distribution with mean value of 1.8 per hour. Karen works for 3 hours between tea breaks and is surprised that there are no bookings during this 3-hour period. What is the probability that no bookings occur during Karen’s shift? Should Karen be surprised? TU3.6 A dentist estimates that 23% of patients will requires reappointments for dental treatment. Over any given week the dentist handles 100 patients. What are the mean and standard deviation for the number of subsequent reappointments? Calculate the probability that the dentist will have to see more than 30 reappointments in any week. TU3.7 The Skodel Ltd credit manager knows from experience that if the company accepts a 'good risk' applicant for a £60,000 loan the profit will be £15,000. If it accepts a 'bad risk' applicant, it will lose £6000. If it rejects a 'bad risk' applicant nothing is gained or lost. If it rejects a 'good risk' applicant, it will lose £3000 in good will. a. Complete the profit and loss table (Table 3.11) for this situation. DECISION Accept Type of Risk

Reject

Good Bad

Table 3.11 Profit and loss table Page | 239

b. The credit manager assesses the probability that a applicant is a 'good risk' is 1/3 and a 'bad risk' is 2/3. What would be the expected profits for each of the two decisions? Consequently, what decision should be taken for the applicant? c. Another manager independently assesses the same applicant to be four times as likely to be a bad risk as a good one. What should this manager decide? d. Let the probability of being a good risk be x. What value of x would make the company indifferent between accepting and rejecting an applicant? TU3.8 A local hospital records the average rate of patient arrival at its accident and emergency department during the weekend. Calculate the probability that a patient arrives less than 23 seconds after the previous patient, if the average rate is 0.45 patients per minute. TU3.9 At a bottling plane the quantity of chutney that is bottled is expected to be normally distributed with a population mean of 36 oz and a standard deviation of 0.1 oz. Once every 30 minutes a bottle is selected from the production process and its content measured. If the amount goes below 35.8 oz or above 36.2 oz, then the bottling process will be out of control and the bottling stopped. a. Calculate the probability that the process is out of control. b. Calculate the probability that the number of bottles found out of control in 16 inspections will be zero. c. Calculate the probability that the number of bottles found out of control in 16 inspections will be equal to 1. d. Calculate the probability of the process being out of control if the analysis of the historical data shows that the population mean and standard deviation are actually 37 oz and 0.4 oz, respectively.

Want to learn more? The textbook online resource centre contains a range of documents to provide further information on the following topics: 1. A3Wa The probability laws – introduction to the concept of probability and probability laws. 2. A3Wb Probability distributions and approximations – explores how certain probability distributions can be used to approximate other probability distributions and introduces two other probability distributions that have business applications: the hypergeometric and exponential probability distributions. 3. A3Wc Other useful probability distributions – introduces a few other distributions, such as the hypergeometric discrete probability distribution and binomial approximation, as well as the exponential distribution.

Page | 240

Chapter 4 Sampling distributions 4.1 Introduction and learning objectives In Chapter 3 we introduced the concept of a probability distribution and learned about two distinct types of distribution: continuous and discrete. This was followed by the definition of the key probability distributions in the context of business applications: the normal distribution, Student’s t distribution, F distribution, chi-square distribution, binomial distribution, and finally the Poisson distribution. Furthermore, the online resource includes information on probability distribution approximations and an introduction to the exponential and hypergeometric probability distributions. In this chapter we will see that we can take numerous samples from a population and calculate their means, for example. The collection of these means will also create a distribution, just like the actual variable values create a distribution. These distributions of the mean, or proportion if this is the statistic that we are calculating, are called sampling distributions. We will specifically investigate the sampling distribution of the means and the sampling distribution of the proportions. The chapter begins with simulation, during which we generate numerous samples and calculate the mean values for every sample. We then show how these sample means are distributed to form the sampling distribution. The concept that follows shows that it is not necessary to draw numerous samples and that we can use some of the properties of the sampling distribution to infer the properties of the population based on one single sample. The middle section of the chapter shows that sampling distribution of the mean will always follow the normal distribution, even if we used data from a non-normal distribution to calculate multiple sample means. This unique property, called the central limit theorem, is a useful tool that we can then use to estimate any population, regardless of what distribution it follows, just based on the sample we took from this distribution. The chapter concludes by defining yet another sampling distribution, the sampling distribution of the proportion. Principles similar to those for the sampling distribution of the mean are used to infer the population proportions based on the sample proportions.

Learning objectives On completing this chapter, you will be able to: 1. Distinguish between the concept of the population and sample 2. Recognise different types of sampling – probability (simple, systematic, stratified, cluster) and non-probability (judgement, quota, chunk, convenience) 3. Recognise reasons for sampling error – coverage error, non-response error, sampling error, measurement error 4. Understand the concept of a sampling distribution: mean and proportion Page | 241

5. 6. 7. 8.

Understand sampling from a normal population Understand sampling from a non-normal population – central limit theorem Estimate an appropriate sample size given the confidence interval Solve problems using Microsoft Excel and IBM SPSS software packages.

4.2 Introduction to sampling ‘Sampling’ is one of the key words in the title of this chapter, and intuitively we understand the word. In fact, we probably engage in sampling activities more often that we think we do. A good example is browsing through the television channels to find a programme we may wish to watch. Once we have seen a short sample of the programme that is currently on, we decide to stay on that channel. This process is very similar to the scientific approach of sampling. In our case, based on a few seconds of a programme (a sample), we draw conclusions about the whole programme (a population). In science, and practical research, the sample taken leads us to make conclusions about the whole population. This methodology is called making an inference about the population based upon the sample observations. There is nothing rigorous or scientific about watching a sample of a TV programme and making an inference, while in real life this inference must be based on some structured rules. But why do we even bother with sampling? In our TV case, the reason is that we do not have time to watch the whole programme and then decide that we will or will not like it. We are looking for a shortcut. In real life, with various research questions, the size of the population is such that it is impractical to measure all members of the population. In this situation a proportion of the population would be selected from the population of interest to the researcher. This proportion is achieved by sampling from the population and the proportion selected is called the sample. The first question we would like to illuminate is: what kinds of inference are we likely to make once we start working as professionals? Depending on your job there will be many research questions you will wish to answer that involve populations that are too large to measure every member of the population. How have the wages of German car workers changed over the past ten years? What are the management practices of foreign exchange bankers working in Paris? How many voters are planning to vote for a political party at a local election? These are all relevant topics, but the second question is: what formal methods do we use to collect the relevant data, and how do we go about making inference? This is really the heart of this chapter. It will teach you the formal procedures and the shortcuts that can be used to make inferences about the whole population. Some form of a survey instrument is the most common way of conducting the sampling, but could also be achieved by observation, archival record, or other methods. However, no matter what method is used to collect the data, the purpose is typically to make estimates of the population parameters. It is then crucial to determine how, and how well, the data set can be used to generalise the findings from the sample to the population. It is important to avoid data collection methods that maximise the associated errors. A bad sample may well render findings misleading or meaningless. Page | 242

Sampling in conjunction with survey research is one of the most popular approaches to data collection in business research. The concept of random sampling provides the foundational assumption that allows statistical hypothesis testing to be valid. The primary aim of sampling is to select a sample from the population that has the same characteristics as the population itself. For example, if the population average height of grown men between the ages of 20 and 50 is 176 cm, then the sample average height would also be expected to be 176 cm, unless we have sampling error (which will be discussed later in this chapter). Ideally, sample and population values should agree. To put it another way, we expect the sample to be representative of the population being measured. Before we describe the main sampling methods, we need to define the terminology we will use in this and later chapters. A few statements hold true in general when dealing with sampling: • •







Samples are always drawn from a population. A sample should reflect the properties of the target population. Sometimes, for reasons of practicality or convenience, the sampled population is more restricted than the target population. In such cases, precautions must be taken to ensure that the conclusions only refer to the sampled population. It is a common practise to divide the population into parts that are called sampling units. These units must cover the whole of the population and they must not overlap. In other words, every element in the population must belong to one and only one sampling unit. For example, in sampling the supermarket spending habits of people living in a town, the unit may be an individual or family or a group of individuals living in a postcode. To put the sampling units into what is called a sampling frame, is often one of the major practical problems. The frame is a list that contains the population you would like to measure. For example, market research firms will access local authority census data to create a sample. The list of registered students may be the sampling frame for a survey of the student body at a university. The practical problems can arise in sampling frame bias. Telephone directories are often used as sampling frames, for instance, but many people have unlisted numbers, or perhaps they opted out of landlines and use mobile phones only. Samples can be collected using either probability sampling or non-probability sampling.

Types of sampling There are several different ways to create a sample. Samples can be divided into two types: probability samples and non-probability samples. Probability sampling The idea behind probability sampling is random selection. More specifically, each sample from the population of interest has a known probability of selection under a given sampling scheme. There are five kinds of probability sampling: simple random Page | 243

sampling; systematic random sampling; stratified random sampling; cluster sampling; and multi-stage sampling. Simple random sampling Simple random sampling is the most widely known type of random sampling. Every member of the population has the same probability of selection. A sample of size n from a population of size N is selected and every possible sample of size n has equal chance of being drawn. Example 4.1 Consider the task that a cinema chain is facing when selecting a random sample of 300 viewers who visited a local cinema during a given period. The researcher notes that the cinema chain would like to seek the views of its customers on a proposed refurbishment of the cinema. The total number of viewers within this period is 8,000. With a population of this size we could employ several ways of selecting an appropriate sample of 300. For example, we could place 8,000 consecutively numbered pieces of paper (1, …, 8,000) in a box, draw a number at random from the box, shake the box, and select another number to maximise the chances of the second pick being random, continuing the process until all 300 numbers are selected. These numbers would then be used to select a viewer purchasing the cinema ticket, with the customer chosen based on the number selected from the random process. To maximise the chances that customers selected would agree to complete the survey we could enter them into a prize draw. These 300 customers will form our sample, with each number in the selection having the same probability of being chosen. When collecting data via random sampling it is generally difficult to devise a selection scheme to guarantee that we have a random sample. For example, the selection from a population might not be from the total population that you wish to measure. Alternatively, during the time interval when the survey is conducted, we may find that the customers sampled may by unrepresentative of the population due to unforeseen circumstances. Systematic random sampling With systematic random sampling, we create a list of every member of the population. From this list, we will sample 200 numbers from the population of 8,000 number values. From the list, we choose an initial sample value and then select every (N/n)th sample value. The term (N/n) is called the sampling interval or sampling fraction (f). This method involves choosing the nth element from a population list as follows: 1. 2. 3.

Divide the number of cases in the population by the desired sample size (f=N/n=8000/200=40). Select a random number between 1 and the value found in step 1. For example, we could pick the number 23 (x1=23). This tells us that the first number chosen from the list is the 23rd number value. We start with the sampling number value chosen in step 2 (23rd number) and the sampling fraction f = 8,000/200 = 40. This tells us to sample the 23rd number, Page | 244

63rd number, 103rd number, 143rd number, and so on until you have chosen 200 sample points. The formula to calculate the nth sample number is given by the formula: the nth sample value = first sample value + sampling fraction (nth sample value – 1). Therefore, the 300th sample value would be the 23 + 40 (200 – 1) = 7,983rd number value. Systematic sampling compared to simple random sampling has some advantages. It is easier to draw the sample from the population. We also avoid potential bias and clustering of members of population, which could happen unintentionally. The disadvantages are that the sample points are not equally likely to be selected. Stratified random sampling The procedure for stratified random sampling is to divide the population into two or more groups (subpopulations or strata). This could be by age, region, or some other research area of interest. Each stratum must be mutually exclusive (i.e., nonoverlapping), and together the strata must include the entire population. A random sample is then drawn from each group. As an example, suppose we conduct a national survey in England. We might divide the population into groups (or strata) based on the counties in England. Then we would randomly select from each group (or stratum). The advantage of this method is that it guarantees that every group within the population is selected and provides an opportunity to carry out group comparisons. Stratified random sampling nearly always results in a smaller variance for the estimated mean or other population parameters of interest. However, the main disadvantage of a stratified sample is that it may be costlier to collect and process the data compared to a simple random sample. Two different categories of stratified random sampling are available, as follows: •



Proportionate stratification. With proportionate stratification, the sample size of each stratum is proportionate to the population size of the stratum (same sampling fraction). The method provides greater precision than for simple random sampling with the same sample size, and this precision is better when dealing with characteristics that are the same between strata. Disproportionate stratification. With disproportionate stratification, the sampling fraction may vary from one stratum to the next. If differences are explored in the characteristics being measured between strata, then disproportionate stratification can provide better precision than proportionate stratification. In general, given similar costs, you would always choose proportionate stratification.

Example 4.2 Consider the task where we wish to sample the views of graduate job applicants to a major financial institution. The nature of this survey is to collect data on the application process from the applicants’ perspective. The survey will therefore have to collect the views from the different specified groups within the identified population. For example, Page | 245

this could be based on gender, race, type of employment requested (full- or part-time), or whether an applicant is classified as disabled. Remember that a simple random sample might fail to obtain a representative sample from one of these groups. This could happen due, for example, to the size of the group relative to the population. This is the reason why we would employ stratified random sampling. We want to ensure that appropriate numbers of sample values are drawn from each group in proportion to the percentage of the population. Stratified sampling offers several advantages over simple random sampling: it guards against an unrepresentative sample (e.g., all-male samples from a predominately female population); it provides sufficient group data for separate group analysis; it requires a smaller sample; and it can achieve greater precision compared to simple random sampling for a sample of the same size. Cluster sampling Cluster sampling is a sampling technique in which the entire population of interest is divided into groups, or clusters, and a random sample of these clusters is selected. Each cluster must be mutually exclusive, and together the clusters must include the entire population. Once clusters are chosen, then all data points within the chosen clusters are selected. No data points from non-selected clusters are included in the sample. This differs from stratified sampling, in which some data values are selected from each group. When all the data values within a cluster are selected, the technique is referred to as ‘one-stage cluster sampling’. If a subset of units is selected randomly from each selected cluster, it is called ‘two-stage cluster sampling’. Cluster sampling can also be made in three or more stages: it is then referred to as ‘multistage cluster sampling’. The main reason for using cluster sampling is that it is usually much cheaper and more convenient to sample the population in clusters rather than randomly. In some cases, constructing a sampling frame that identifies every population element is too expensive or impossible. Cluster sampling can also reduce costs when the population elements are scattered over a wide area. Multistage sampling With multistage sampling (not to be confused with multistage cluster sampling), we select a sample by using combinations of different sampling methods. For example, in stage 1, we might use cluster sampling to choose clusters from a population. Then, in stage 2, we might use simple random sampling to select a subset of elements from each chosen cluster for the final sample. Non-probability sampling In many situations, it is not possible to select the kinds of probability samples used in large-scale surveys. For example, we may be required to seek the views of local familyrun businesses that are experiencing financial difficulties. In this situation, there are no easily accessible lists of businesses experiencing difficulties, or there may never be a list Page | 246

created or available. The question of obtaining a sample in this situation is achievable by using non-probability sampling methods. The two primary types of non-probability sampling methods are convenience sampling and purposive sampling. Convenience sampling Convenience sampling is a method of choosing subjects who are available or easy to find. This method is also sometimes referred to as haphazard, accidental, or availability sampling. The primary advantage of the method is that it is very easy to carry out, relative to other methods. Problems can occur with this survey method in that you can never guarantee that the sample is representative of the population. Convenience sampling is a popular method with researchers and provides some data that can analysed, but the type of statistics that can be applied to the data is compromised by uncertainties over the nature of the population that the survey data represent. Example 4.3 When a student researcher is eager to begin conducting research with people as subjects but may not have a large budget or the time and resources that would allow for the creation of a large, random sample, they may choose to use the technique of convenience sampling. This could mean stopping people as they enter and leave a supermarket, or surveying other students, or others to whom the researcher has regular access. For example, suppose that a business researcher is interested in studying online shopping habits among university students. The researcher is enrolled on a course and decides to give out surveys during class for other students to complete and hand in. This is an example of a convenience sample because the researcher is using subjects who are convenient and readily available. In just a few minutes the researcher can conduct an experiment with possibly a large research sample, given that introductory courses at universities can have as many as several hundreds of students enrolled. Purposive sampling Purposive sampling is a sampling method in which elements are chosen based on the purpose of the study. Purposive sampling may involve studying the entire population of some limited group (the accounts department at a local engineering firm) or a subset of a population (chartered accountants). As with other non-probability sampling methods, purposive sampling does not produce a sample that is representative of a larger population, but it can be exactly what is needed in some cases – a study of an organisation, community, or some other clearly defined and relatively limited group. Examples of two popular purposive sampling methods include: quota sampling and snowball sampling. Quota sampling Quota sampling is designed to overcome the most obvious flaw of convenience sampling. Rather than taking just anyone, you set quotas to ensure that the sample you get represents certain characteristics in proportion to their prevalence in the Page | 247

population. Note that for this method, you must know something about the characteristics of the population ahead of time. There are two types of quota sampling: proportional and non-proportional. •



In proportional quota sampling you want to represent the major characteristics of the population by sampling a proportional amount of each. For instance, if you know the population has 25% women and 75% men, and that you want a total sample size of 400, you will continue sampling until you get those percentages and then you will stop. So, if you've already got the 100 women for your sample, but not the 300 men, you will continue to sample men; if legitimate women respondents come along, you will not sample them because you have already ‘met your quota’. The primary problem with this form of sampling is that even when we know that a quota sample is representative of the characteristics for which quotas have been set, we have no way of knowing if the sample is representative in terms of any other characteristics. If we set quotas for age, we are likely to attain a sample with good representativeness on age, but one that may not be very representative in terms of gender, education, or other pertinent factors. In non-proportional quota sampling you specify the minimum number of sampled data points you want in each category. In this case you are concerned not with having the correct proportions but with achieving the numbers in each category. This method is the non-probabilistic analogue of stratified random sampling in that it is typically used to ensure that smaller groups are adequately represented in your sample.

Finally, researchers often introduce bias when allowed autonomy in selecting respondents, which is usually the case in this form of survey research. In choosing males, interviewers are more likely to choose those who are better-dressed, seem more approachable and less threatening. That may be understandable from a practical point of view, but it introduces bias into research findings. Snowball sampling In snowball sampling, you begin by identifying someone who meets the criteria for inclusion in your study. You then ask them to recommend others they may know who also meet the criteria. Thus, the sample group appears to grow like a rolling snowball. This sampling technique is often used in hidden populations which are difficult for researchers to access, including firms with financial difficulties, or students struggling with their studies. The method creates a sample with questionable representativeness, and it can be difficult to judge how a sample compares to a larger population. Furthermore, an issue arises in how respondents choose others to refer you to; for example, friends will refer you to friends but are less likely to refer you to those they don't consider as friends, for whatever reason. This creates a further bias within the sample that makes it difficult to say anything about the population. The primary difference between probability methods of sampling and non-probability methods is that in the latter you do not know the likelihood that any element of a population will be selected for study. Page | 248

Types of errors In this chapter we are concerned with sampling from populations using probability sampling methods, which are the prerequisite for the application of statistical tests. However, if we base our decisions on a sample, rather than the whole population, by definition, we are going to make some errors. The concept of sampling implies, therefore, that we’ll also have to deal with several types of errors, including sampling error, coverage error, measurement error, and non-response error. Let us just briefly define these terms: • • •





Sampling error is the calculated statistical imprecision due to surveying a random sample instead of the entire population. The margin of error provides an estimate of how much the results of the sample may differ due to chance when compared to what would have been found if the entire population were interviewed. Coverage error is associated with the inability to contact portions of the population. Telephone surveys usually exclude people who do not have access to a landline telephone in their homes. They will also miss people who are not at home (e.g., at work, in prison, or on holiday). Measurement error is error, or bias, that occurs when surveys do not measure what they intended to measure. This type of error results from flaws in the measuring instrument (e.g. question wording, question order, interviewer error, timing, and question response options). This is the most common type of error faced by the polling industry. Non-response error results from not being able to interview people who would be eligible to take the survey. Many households use telephone answering machines and caller identification that prevent easy contact, or people may simply not want to respond to calls. Non-response bias is the difference in responses of those people who complete the survey against those who refuse to for any reason. While the error itself cannot be calculated, response rates can be calculated by dividing the number of responses by the number invited to respond.

Now we understand what types of samples are possible, we will focus on the key type of samples. The samples that have been randomly selected. We will explore the statistical techniques that can be applied to a randomly selected data sets.

Check your understanding X4.1 X4.2 X4.3 X4.4 X4.5

Compare random sampling and systematic (non-sampling) errors. Explain the reasons for taking a sample rather than a complete census. Name and describe the types of non-probability sampling. Name and describe the types of probability sampling. List the stages in the selection of a sample.

Page | 249

4.3 Sampling from a population So, we said that when we wish to know something about a population it is usually impractical, especially when considering large populations, to collect data from every unit of that population. It is more efficient to collect data from a sample of the population under study and then to make estimates of the population parameters from the sample. Essentially, we make generalisations about a population based on a sample.

Population versus sample To describe the difference between a population and a sample, we reiterate the two terms as follows: •



A population is a complete set of counts or measurements derived from all objects possessing one or more common characteristics. For example, if you want to know the average height of the residents of London, that is your population, the residents of London. Measures such as means, and standard deviations derived from the population data are known as population parameters. A sample is a set of data collected by a defined procedure from a proportion of the population under study. Measures such as means, and standard deviations derived from samples are known as sample statistics or estimate.

In this section we will explore the concept of taking a sample from a population and use this sample to provide population estimates for the mean, standard deviation and proportion. The method of using samples to estimate population parameters is known as statistical inference. Statistical inference draws upon the probability results discussed in previous chapters. To distinguish between population parameters and sample statistics, very often the symbols presented in Table 4.1 are used (the symbols µ, σ, π, ρ are the Greek symbols ‘mu’, ‘sigma’, ‘pi’ and ‘rho’, respectively). Parameter Population Sample Size N n Mean µ x Standard deviation s  Proportion π  Table 4.1 Symbols employed to differentiate between population and sample One of the easiest ways to generate a random sample from a sampling distribution is to use Excel. Excel can be used to generate random samples from a range of probability distributions, including normal, binomial, and Poisson distributions. To generate a random sample, select Tools > Data Analysis. Select Data > Data Analysis > Random Number Generation (Figure 4.1)

Page | 250

Figure 4.1 Excel Data Analysis menu Click OK Provide the information required in the dialog box that appears (Figure 4.2): • • • • •

Input number of variable (or samples) Input number of data values in each sample Select the distribution Input distribution parameters e.g. for normal: µ, σ. Decide on where the results should appear (Output range).

Figure 4.2 Excel random number generator Click OK. Example 4.4 Consider the problem of sampling from a population which consists of the salaries of public sector employees employed by a national government. The historical data suggest that the population data are normally distributed with mean £45,000 and standard deviation £1,000. We can use Excel to generate N random samples, with each sample containing n data values. Page | 251

a. b. c.

Create 10 random samples each with 1000 data points. Calculate the mean for each random sample. Plot the histogram representing the sampling distribution for the sample mean.

Excel solution a. Generate N = 10 samples (variables) with n = 1000 data values. Select Data > Data Analysis > Random Number Generation (Figure 4.3). • • • • • • •

Inputs: Number of Variables = 10 Number of Random Numbers = 1000 Distribution: Normal distribution Parameter Mean,  = 45000 Parameter Standard deviation,  = 1000. Output range: cell B5.

Figure 4.3 Example 4.4 Excel random number generator Click OK. The N samples are in the columns of the table of values (sample 1, B5:B1004; sample 2, C5:C1004; …; sample 10, K5:K1004). b. Calculate N sample means.

Page | 252

Now, we can create 1000 mean values of size 10 by assuming that each row for the 10 samples represents a sample of size 1000. Figure 4.4 illustrates the first four samples and sample means.

Figure 4.4 Calculate the first 4 row means (sample means in cells L5:L1004) c. Create histogram bins and plot a histogram of sample means. We note that from the spreadsheet the smallest and largest sample means are 43,917.62 and 46,045.67 , respectively. Based upon these two values, we then determine the histogram bin range as 43,900 to 46,300 with step size 600 (43,900, 44,500, … , 46,300), as illustrated in Figure 4.5.

Figure 4.5 Creation of the histogram based upon the 1000 row means To create the histogram, select Data > Data Analysis > Histogram and select values as illustrated in Figure 4.6: Input Range: L4:L1004 Bin Range: N13:N18 Click on Labels Output Range: P9.

Page | 253

Figure 4.6 Excel histogram menu Click OK. Figures 4.7 and 4.8 illustrate the frequency distribution and corresponding histogram.

Figure 4.7 Frequency distribution Now, use Excel to create a histogram for the frequency against Bin as illustrated in Figure 4.8. Highlight data cells N20:O25. Select Insert > Insert Column or Bar Chart > 2-D Column (option 1). Now, edit the bar chart to remove bin value, add title, axes titles, and reduce bar gap to zero to give the histogram illustrated in Figure 4.8. Page | 254

Figure 4.8 Histogram The histogram shows a normal distribution for the sample means. The overall mean value of all 1000 sample means is £45,005 with a standard deviation of £329. From the histogram we note that the histogram values are centred about the population mean value of £45,000. If we repeated this exercise for different values of sample size n, we would find that the range would reduce as the sample sizes increase. If you don’t like the Excel method for generating a histogram described above, then you could just plot frequency against the sample means as illustrated in Figure 4.9.

Figure 4.9 Plot of frequency against the sample means

Page | 255

SPSS solution We can use SPSS to re-create the sampling distribution for this example. a. Create the variables X1, X2, …, X10. Name the first column X1. Enter a number (any) in the 1000th cell of the first column to define the variable size (i.e., the size of the sample). If you have a problem with selecting the 1000th value, then create the 1000 case data values in Excel and copy to the SPSS data file as illustrated in Figure 4.10.

Figure 4.10 Enter ID values into SPSS Now enter any number into the 1000th cell for X1

Figure 4.11 Add any number into ID = 1000 Now create the sample values in column X1 Transform > Compute Variable Target Variable: X1. Enter in Numeric Expression: RV.NORMAL(45000, 10000)

Figure 4.12 Use compute variable to randomly select from the normal population Click OK. SPSS will now carry out the calculation and store the result in the data file under the column labelled X1 (the first 10 of the 1000 values are shown in Figure 4.13).

Page | 256

Figure 4.13 First 10 values for the first sample Since we want 10 samples, we need to repeat this calculation for X2, …, X10. The first 10 values for the first five samples are shown in Figure 4.14.

Figure 4.14 First 10 values for the first 5 samples b. Calculate the average values for each of the 10 samples of size 1000. Now, calculate the mean values for each of the rows to create 1000 row sample means of size 10. Transform > Compute Variable Target Variable: Xbar. Enter in Numeric Expression: MEAN(X1,X2,X3,X4,X5,X6,X7,X8,X9,X10).

Figure 4.15 Calculate the row sample means The first three average values are presented in Figure 4.16.

Figure 4.16 The first 3 row means Page | 257

c. Calculate the overall Xbar mean and standard deviation and construct the histogram. Analyze > Descriptive Statistics > Frequencies. Transfer Xbar to the Variables box. Uncheck Display frequency tables (ignore the warning). Click on Charts and select Histogram. Click Continue. Click on Statistics and select Mean, Std. deviation. Click Continue. Click OK

Figure 4.17 SPSS solution representing the mean of the 1000 row means Summary statistics are shown in Figure 4.17. The overall mean value of all 1000 sample means gives a mean of £45,007 with a standard deviation of £318. The histogram (Figure 4.18) shows a fairly normal distribution for the sample means.

Figure 4.18 Histogram for the sampled data From Excel, the overall sample average of the means is £45,005 with a standard deviation of £329. From SPSS, the overall sample average of the means is £45,007 with a standard deviation of £318. We observe that the average of the sample means from Excel and SPSS is approximately equal to the population mean that we sampled from, and which we know is £45,000. Furthermore, the standard deviation of these sample means is £329 and £318 from Excel and SPSS respectively. What is important to observe is that these values are much smaller than the standard deviation of the population we Page | 258

sampled from which is equal to £1000. Why the difference? This will be explained soon when we talk about the Central Limit Theorem. In this section we have generated several random samples and then calculated the mean value for every random sample. This has enabled us to estimate the true mean of the overall population. However, in practice we do not have to do that. The central limit theorem provides us with a shortcut. What we noticed in the previous example was that when we gather the mean values calculated from numerous random samples, they begin to follow the normal distribution. In other words, the mean values from multiple samples are creating a sampling distribution. We could have calculated any other statistic for each one of these random samples, and we would see that they also form their own sampling distributions. This brings us to the key point. Any variable can be distributed in a number of different ways, and many, though not all, of them follow the normal distribution. Each one of these distributions that depict a variable is defined by certain parameters, such as the mean and standard deviation. If we collected a large number of samples from that one distribution and calculated their statistics (such as the mean of the standard deviation), then these statistics will also create a distribution. We call this the sampling distribution. The two sampling distributions that we will explore are the sampling distribution of the mean and the sampling distribution of the proportion.

Sampling distribution of the mean In this section we will continue to explore what is understood by the phrase sampling distribution of the mean. Our Example 4.4 already illustrated that we can take random samples from a population and calculate the mean value for every sample. If we generate a large number of samples and have calculated the mean value for every sample, then these mean values will also form a distribution. This is what we mean by the sampling distribution of the mean. What is important here is that the mean of all the sample means has some interesting properties. It is identical to the overall population mean. A sample mean is called an unbiased estimator since the mean of all sample means of size n selected from the population is equal to the population mean, µ. Example 4.5 To illustrate this property, consider the problem of tossing a fair die. We know that the die has 6 numbers (1, 2, 3, 4, 5, 6), with each number likely to have the same frequency of occurrence. As an example, we can then take all possible samples of size 2 with replacement from this population. Let us to illustrate two important results of the sampling distribution of the sample means using this example. To refresh our memory, the population mean, and population standard deviation are calculated using equations (4.1) and (4.2), respectively: μ=

∑X

(4.1)

n ∑ 𝑋2

𝜎=√

𝑛

− 𝜇2

(4.2) Page | 259

The sample mean is calculated exactly the same way as equation (4.1), but if we have grouped data, then we can use equation (4.3): ̅= X

∑f x

(4.3)

∑f

If we take several samples and for every sample calculate a sample mean, we’ll end up with a number of sample means. The mean of all these sample means (𝑋̿) is calculated using equation (4.4): ̅ X =

̅ ∑f X

(4.4)

∑f

Can we calculate the standard deviation of the sample means around their mean? If you understood the question (read it again!), the answer is: yes. The standard deviation of the means around their central mean is called the standard error of the sample means. Effectively, the standard error of the sample means measures the standard deviation of all sample means from the overall mean. Equation (4.5) shows the formula for the standard error (sometimes the second part of the phrase, ‘of the sample means’, is dropped): ̅2 ∑f X

σx̅ = √

∑f

̅2 − X

(4.5)

Equation (4.5) can be re-written as: ∑(𝑥̅ -𝑥̿ )2

𝜎𝑥̅ = √

(4.6)

𝑛−1

Remember, this equation applies only if we have a very large population and draw numerous random cases to calculate their respective means 𝑥̅ . From the population data values (1, 2, 3, 4, 5, 6) we can calculate the population mean and standard deviation using equations (4.1) and (4.2), together with Table 4.2. X2 1 4 9 16 25 36

Die value, X 1 2 3 4 5 6 N= 6  X = 21

 X2 = 91 Mean  = 3.5 Population standard deviation  = 1.7078

Table 4.2 Calculation table for population mean and standard deviation Using Table 4.2: Page | 260

μ=

∑X 21 = = 3.5 n 6

𝜎=√

∑ 𝑋2 91 − 𝜇 2 = √ − 3.52 = 1.7078 𝑛 6

Let us now take all possible samples of size 2 (n = 2) with replacement from this population. From a population of size N = 6, there are 36 possible samples of size n = 2. Table 4.3 illustrates the results with pairs such as (1, 2) and (2, 1) combined to give a frequency of 2, for example. Sample pairs ID

Value mean, ̅ ̅ Value 1 Value 2 X f fX 1 1 1 1 1 1 2 1 2 1.5 2 3 3 1 3 2 2 4 4 1 4 2.5 2 5 5 1 5 3 2 6 6 1 6 3.5 2 7 7 2 2 2 1 2 8 2 3 2.5 2 5 9 2 4 3 2 6 10 2 5 3.5 2 7 11 2 6 4 2 8 12 3 3 3 1 3 13 3 4 3.5 2 7 14 3 5 4 2 8 15 3 6 4.5 2 9 16 4 4 4 1 4 17 4 5 4.5 2 9 18 4 6 5 2 10 19 5 5 5 1 5 20 5 6 5.5 2 11 21 6 6 6 1 6 Table 4.3 Calculation for sample mean and standard deviation

2

̅ fX 1 4.5 8 12.5 18 24.5 4 12.5 18 24.5 32 9 24.5 32 40.5 16 40.5 50 25 60.5 36

We can calculate the mean of these sample means and corresponding standard deviation of the sample means using the Table 4.3 frequency distribution. For example, for sample pair (2, 6) the sample mean is equal to 4. The pair (2, 6) occurs twice, given we can have (2, 6) or (6, 2). From Table 4.3, recalling from equation (4.4) that the mean of the sample means is denoted by 𝑋̅ , and denoting the standard deviation of the means by 𝜎𝑋̅ , we have: ∑ f = 36 Page | 261

̅ = 126 ∑f X ∑f ̅ X 2 = 493.5 ̅ ∑fX 126 ̅ X= = = 3.5 ∑f 36

𝜎𝑋̅ = √

∑ 𝑓 𝑋̅ 2 493.5 − 𝑋̅ 2 = √ − 3.52 = 1.2076 ∑𝑓 36

The mean of the sample means is equal to 3.5. We already stated that the mean of the sample means is an unbiased estimator of the population mean, which means that: 𝑋̅ = 𝜇

(4.7)

The standard deviation of the sample means by 𝜎𝑋̅ is equal to 1.2076. Recall, however, that the population standard deviation was 1.7078. If we take the value of  = 1.7078 and divide it by √𝑛 (which is √2), we get 1.2076. First, we see that the standard deviation of the sample means is not equal to the population standard deviation (𝜎𝑋̅ < σ). In fact, the standard deviation of the sample means (or, as we will call it from now on, the standard error of the sample means) is a biased estimate of the population standard deviation. Secondly, it can be shown that the relationship between sample and population is given by equation (4.8): σX̅ =

σ

(4.8)

√n

We have shown this in our example: 𝜎 √𝑛

=

1.7078 √2

= 1.2076 = 𝜎𝑋̅

Although the full name for equation (4.8) is the standard error of the sample means, more often it is called just the standard error. From equation (4.8) we observe that as n increases, the value of the standard error approaches zero (𝜎𝑋̅ → 0). In other words, as n increases, the spread of the sample means decreases to zero. This means that 𝜎𝑋̅ is approaching the true value of . Make sure you understand the difference between the standard error and the standard deviation: 1. The standard deviation measures how much individual values are scattered around their mean (in either a sample or in the population). 2. The standard error measures how much sample means are scattered around the overall mean, in other words, how representative is our sample mean when compared to the true mean value. Page | 262

Remember: the standard deviation is a descriptive statistic and the standard error is an inferential statistic. The standard error of the mean is effectively the standard deviation of a number of sample means around their overall mean. Excel solution Figures 4.19 and 4.20 illustrate the Excel solution.

Figure 4.19 Example 4.5 Excel solution The formulae in Figure 4.19 use two different methods to calculate the mean and standard deviation in Excel. One method is using Excel functions =AVERAGE() and =STDEV(), and the other method shows manual calculations. Now let us look at all possible samples of 2 as illustrated in Figure 4.20.

Page | 263

Figure 4.20 Example 4.5 Excel solution continued SPSS solution No built-in SPSS solution.

Sampling from a normal population If we select a random sample X from a population that is normally distributed with population mean µ and standard deviation σ, then we can state this relationship using the notation X ~ N(µ, σ2). Figure 4.21 shows the distribution of X.

Figure 4.21 Normal distribution X ~ N(µ, 2) If we choose a number of samples from a normal population then we can show that the sample means are also normally distributed with a mean of µ and a standard deviation Page | 264

of the sampling mean given by equation (4.7), where n is the sample size on which the sampling distribution was based. Figure 4.22 shows the distribution of X .

Figure 4.22 Normal distribution for sample means 𝑋̅~ N(µ, 2/n) Example 4.6 We return to Example 4.4 with 40,000 random samples from a population that is assumed to be normally distributed with mean £45,000 and standard deviation £10,000. The population values are based on 40,000 data points and the sampling distribution is illustrated in Figure 4.23. We observe from Figure 4.23 that the population data are approximately normal.

Figure 4.23 Shape of the histogram when n = 40 Figures 4.24-4.27 show the sampling distributions of the mean where n = 2, 5, 10, and 40 respectively. From Figures 4.24–4.27 we observe that all the sampling distributions of the mean are approximately normal, but the shape changes depending on the size n. We observe that the sample means are less spread out about the mean as the sample sizes increase. Page | 265

Figure 4.24 Shape of histogram when n = 2

Figure 4.25 Shape of histogram when n = 5

Page | 266

Figure 4.26 Shape of histogram when n = 10

Figure 4.27 Shape of histogram when n = 40 From these observations we conclude that if we sample from a population that is normally distributed with mean µ and standard deviation σ (X ~ N(µ, σ2)), then the sampling means are also normally distributed with mean µ and standard deviation of the sample means of σX̅ = σ⁄ . This relationship is represented by equation (4.9): √n

Page | 267

𝑋 ∼ 𝑁 (𝜇,

𝜎2 𝑛

or

)

𝑋 ∼ 𝑁 (𝜇,

𝜎 √𝑛

)

(4.9)

Now we know that the sample mean is normally distributed, we can solve a range of problems using the method that will be described in Chapters 5–6. Before we proceed, let us remind ourselves of equation (3.6) that refers to the standardised values for the normal distribution. Just like equation (3.6) that enabled us ̅, we can to convert all the values of X into Z, if we had a distribution of sample means X ̅ use a similar equation to convert every value of X to Z. The standardised sample mean Z value is given by equation (4.10). Z=

̅− μ X σX ̅

=

̅− μ X σ √n

(4.10)

Equation (3.6) shows how to convert any x value into the Z-score. However, in this chapter we are dealing with the distribution of the means, and therefore, from this equation (3.6), X becomes ̅ X,  remains the same and  becomes 𝜎𝑋̅ , which is the standard error of the means, SE or σX̅ = σ⁄ . This is how equation (3.6) becomes √n (4.10). Why is this standardised sample mean Z value so important? Because this is the shortcut that will enable us to make estimates about the population without a need to draw numerous random samples to estimate the true parameters of the population. We will be able to take one sample and make inference about the whole population. Example 4.7 Diet X runs several weight reduction centres in a large town in the North East of England. From historical data it was found that the weight of participants is normally distributed with a mean of 180 pounds and a standard deviation of 30 pounds, X ~ N(180, 302). Calculate the probability that the average sample weight is greater than 189 pounds when 25 participants are randomly selected for the sample. Given X ~ N(, 2) = N(180, 302). Calculate P(sample mean > 189) when we have a randomly chosen sample of size n = 25. From the central limit theorem, we have: 𝑋 ∼ 𝑁 (𝜇,

𝜎2 302 30 ) = 𝑁(180,6) ) = 𝑁 (180, ) = 𝑁 (180, 𝑛 25 √25

The central limit theorem is a ‘rule’ which says that the means calculated from a large number of samples, taken from a large population, will be distributed in accordance with a normal population, even if this large population is not distributed as normal distribution. It also implies that the mean of all sample means taken is equal to the true mean of the population. Page | 268

̅ > 189). Figure 4.28 illustrates The problem requires the solution to the problem P(X the region to be found that represents this probability. Excel can be used to solve this problem by either using the =NORM.DIST() or =NORM.S.DIST() functions.

Figure 4.28 Shaded region represents P(𝑋̅ > 189) From equation (4.10) we have: 𝑍=

𝑋̅ − 𝜇 189 − 180 9 = = = + 1.5 𝜎 30 6 √𝑛 √25

Note that what is 189 on the actual scale (X = 189 pounds) becomes 1.5 (Z = 1.5) on the standardised scale: ̅ > 189) = P(Z > + 1.5) P(X From normal tables: ̅ > 189) = P(Z > + 1.5) = 0.06681 P(X The probability that the sample mean is greater than 189 pounds is 0.06681 or 6.7%. Excel solution

Page | 269

Figure 4.29 Example 4.7 Excel solution The above solution shows two alternative ways in Excel of calculating Z and P. We already described both Excel functions =NORM.DIST() and =NORM.S.DIST(). Make sure you do not confuse them. Given the population mean (µ = 150), population standard deviation (σ = 30), sample size (n = 25), and standard error of the sample mean(𝜎𝑋̅ = 𝜎⁄√𝑛 = 30⁄√25 = 6). Therefore, from Excel the probability that the sample mean is greater than 189 is 0.06680 or 6.7%. SPSS solution Select Transform > Compute Variable Target Variable: EXAMPLE7 Enter in Numeric Expression: 1-CDF(189,180,6).

Figure 4.30 Use compute variable to calculate P(𝑋̅ > 189) Click OK.

Page | 270

SPSS will now carry out the calculation and store the result in the data file under column labelled EXAMPLE7.

Figure 4.31 SPSS solution P(𝑋̅ > 189) = 0.066807 This result agrees with the Excel solution shown in Figure 4.28. Example 4.8 Use the Example 4.6 data to calculate the probability that the sample mean lies between 156 and 165 pounds. We have X ~ N(, 2) = N(163, 352). We require P(140 ≤ ̅ X ≤ 158) when we have a randomly chosen sample of size n = 25. From the central limit theorem, we have: 𝑋̄ ∼ 𝑁 (𝜇,

𝜎2 302 ) = 𝑁 (163, ) 𝑛 25

We are given the population mean (µ = 163), population standard deviation (σ = 35), sample size (n = 25), and standard error of the sample mean (𝜎𝑋̅ = 𝜎⁄√𝑛 = 30⁄√25 = 6). Figure 4.32 illustrates the region to be found that represents this probability. Again, Excel can be used to solve this problem by using either the =NORM.DIST() or =NORM.S.DIST() function.

Figure 4.32 Shaded region represents P(156 ≤ 𝑋̅ ≤ 165) From equation (4.10) we have: Z1 =

̅ X1 − μ 156 − 163 = = − 1.2329 σ 30 √n √25 Page | 271

Z2 =

̅ X2 − μ 165 − 163 = = +0.3523 σ 30 √n √25

P(156 ≤ 𝑋̅ ≤ 165) = 𝑃(− 1.2329 ≤ 𝑍 ≤ +0.3523) From normal critical tables: P(156 ≤ 𝑋̅ ≤ 165) = 𝑃(− 1.2329 ≤ 𝑍 ≤ +0.3523) P(156 ≤ 𝑋̅ ≤ 165) = 1 − 𝑃(𝑍 ≥ + 1.2329) − 𝑃(𝑍 ≥ 0.3523 P(156 ≤ 𝑋̅ ≤ 165) = 1 − 0.10935 − 0.36317 P(156 ≤ 𝑋̅ ≤ 165) = 0.52748 or 52.8% The probability that the sample mean is between 156 and 165 Ibs is 0.52748 or 53%. Excel solution Figure 4.33 illustrates the Excel solution using the standard formula and function methods.

Figure 4.33 Example 4.8 Excel solution P(156 ≤ 𝑋̅ ≤ 165) = 0.5289 or 52.9% The probability that the sample mean is between 156 and 165 Ibs is 0.5289 or 53%. SPSS solution Select Transform > Compute Variable Page | 272

Name Target Variable: example8 Enter in Numeric Expression: CDF.NORMAL(165, 163, 5.6777)CDF.NORMAL(156, 163, 5.6777).

Figure 4.34 Use compute variable to calculate P(156 ≤ 𝑋̅ ≤ 165) Click OK SPSS will now carry out the calculation and store the result in the data file under column labelled example8.

Figure 4.35 SPSS solution P(156 ≤ 𝑋̅ ≤ 165) = 0.528869 or 52.9% This result agrees with the Excel solution illustrated in Figure 4.33 and the manual solution.

Sampling from a non-normal population In the previous section, we sampled from a population which is normally distributed, and we stated that the sample means will be normally distributed with mean µ and standard error of the mean 𝜎𝑋̅ . What if the data do not come from the normal distribution? It can be shown that if we select a random sample from a non-normal distribution then the sampling mean will still be approximately normal with mean µ and standard deviation 𝜎𝑋̅ , if the sample size is sufficiently large. In most cases the value of n should be at least 30 for non-symmetric distributions and at least 20 for symmetric distributions, before we apply this approximation. This relationship is already represented by equation (4.9). This leads to an important concept in statistics that is known as the central limit theorem, which we briefly mentioned above. The central limit theorem provides us with a shortcut to the information required for constructing a sampling distribution. By applying the theorem, we can obtain the descriptive values for a sampling distribution, usually the mean and the standard error, computed from the sampling variance. We can also obtain probabilities associated with any of the sample means in the sampling distribution. The central limit theorem states that if you have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement, then the distribution of the sample means will be approximately normally distributed. As we just stated, this will hold true regardless of Page | 273

whether the source population is normal or skewed, provided the sample size is sufficiently large (usually n > 30). If the population is normal, then the theorem holds true even for samples smaller than 30. This means that we can use the normal probability model to quantify uncertainty when making inferences about a population mean based on the sample mean. The central limit theorem also implies that certain distributions can be approximated by the normal distribution, for example: 1. Student’s t distribution, t(df), is approximately normal with mean 0 and variance 1 when the degrees of freedom df is large. 2. The chi-square distribution, 2(k), is approximately normal with mean k and variance 2k, for large k. 3. The binomial distribution, B(n, p), is approximately normal with mean np and variance np(1 – p) for large n and for p not too close to 0 or 1. 4. The Poisson distribution, Po(), with mean value  is approximately normal with mean  and variance  for large values of . Example 4.9 Consider the sampling of 38 electrical components from a production run where historically a component’s average lifetime was found to be 990 hours with a standard deviation of 150 hours. The population data are right-skewed and therefore is not normally distributed. Calculate the probability that a randomly chosen sample mean is less than 995 hours. Given a population variable X has a mean  = 990 and standard deviation  = 150 (rightskewed). We need to calculate P(𝑋̅ < 995) when we have a randomly chosen sample of size n = 38. The population distribution is right-skewed, but the sample size is greater than 30, so we can use the central limit theorem to state: ̅ ~ N (μ, X

σ2 1502 ) = N (990, ) N 38

The problem requires us to find 𝑃(𝑋̅ < 995). Figure 4.36 illustrates the region to be found that represents this probability. Excel can be used to solve this problem by using either the =NORM.DIST() or =NORM.S.DIST() function.

Page | 274

Figure 4.36 Shaded region represents 𝑃(𝑋̅ < 995) = 0.5814. We are given the population mean (µ = 990), population standard deviation (σ = 150), sample size (n = 38), and standard error IS: 𝜎𝑋̅ =

𝜎 √𝑁

=

150 √38

= 24.33321317

From equation (4.10) we have: Z =

̅ X − μ 995 − 990 = +0.2054 σ = 150 √n √38

From normal tables: 𝑃(𝑋̅ < 995) = 𝑃(𝑍 < +0.2054) 𝑃(𝑋̅ < 995) = 1 − 𝑃(𝑍 > +0.2054) 𝑃(𝑋̅ < 995) = 1 − 0.41683 𝑃(𝑋̅ < 995) = 0.58317 Based on a random sample, the probability that the sample mean is less than 995 hours is 0.58317 or 58%. Excel solution Figure 4.37 illustrates the Excel solution.

Page | 275

Figure 4.37 Example 4.9 Excel solution Based on a random sample, the probability that the sample mean is less than 995 hours is 0.58140 or 58%. SPSS solution Select Transform > Compute Variable Name Target Variable: Example9 Enter in Numeric Expression: CDF.NORMAL(995, 990, 24.33321317).

Figure 4.38 Use compute variable to calculate P(𝑋̅ < 995) Click OK. SPSS will now carry out the calculation and store the result in the data file under column labelled Example9.

Figure 4.39 SPSS solution P(𝑋̅ < 995) = 0.581402 The result agrees with the Excel solution shown in Figure 4.37 and the manual solution.

Sampling without replacement In the previous cases we assumed that sampling will have taken place with replacement (for a very large or infinite population). If there is no replacement, then equation (4.8) has to be modified by a correction factor to give equation (4.11): Page | 276

𝜎𝑋̅ =

𝜎 √𝑛

𝑁−𝑛



𝑁−1

(4.11)

Where N is the size of the population and n is the size of the sample. Example 4.10 A random sample of 38 part-time employees is chosen without replacement from a firm employing 98 part-time workers. The mean number of hours worked per month is 45, with a standard deviation of 12. Determine the probability that the sample mean: (a) will lie between 45 and 48 hours; (b) be over 47 hours. In this example we have a finite population of size N (= 98) and a sample size n (= 38). From equation (4.11) we can calculate the standard error of the sampling mean and then use Excel to calculate the two probability values. Since the sample size (n = 30) is sufficiently large for the population (N=200), we will apply the central limit theorem to the problem and assume that the sampling mean distribution is approximately normally distributed as given by equation (4.8): ̅ ~ N (μ, X

σ2 ) N

a. The problem requires us to find P(45 ≤ ̅ X ≤ 48). Figure 4.40 illustrates the region to be found that represents this probability. Excel can be used to solve this problem by using either the =NORM.DIST() or =NORM.S.DIST() function.

̅ ≤ 48) Figure 4.40 Shaded region represents P(45 ≤ X Page | 277

Calculate P(45 ≤ ̅ X ≤ 48)? Given the population size (N = 98), population mean (µ = 45), population standard deviation (σ = 12) and sample size (n = 38), the corrected standard error is:

𝜎𝑋̅ =

σX̅ =

𝜎 √𝑛

× √

12 √38

𝑁−𝑛 𝑁−1

× √

98 − 38 98 − 1

𝜎𝑋̅ = 1.5310 From equation (4.10) we have: Z =

̅ − μ X 48 − 45 = = +1.9595 𝜎𝑋̅ 1.5310

P(45 ≤ ̅ X ≤ 48) = 𝑃(0 ≤ 𝑍 ≤ 1.9595 ) P(45 ≤ ̅ X ≤ 48) = 0.5 − 𝑃(𝑍 ≥ 1.9595 ) From critical normal tables: P(45 ≤ ̅ X ≤ 48) = 0.5 − 0.02500 ̅ ≤ 48) = 0.475 P(45 ≤ X Based upon a random sample, the probability that the sample mean lies between 45 and 48 is 0.475or 48.0%. Excel solution Figure 4.41 illustrates the Excel solution.

Page | 278

Figure 4.41 Example 4.10a Excel solution Both methods provided the same answer to the problem of calculating the required probability. SPSS solution Select Transform > Compute Variable Name Target Variable: Example10a Enter in Numeric Expression: CDF.NORMAL(48, 45, 1.5310).

Figure 4.42 Use compute variable to calculate Click OK. SPSS will now carry out the calculation and store the result in the data file under column labelled Example10a

Figure 4.43 SPSS solution P(45 ≤ ̅ X ≤ 48) = 0.475

Page | 279

This result agrees with the Excel solution illustrated in Figure 4.41 and the manual solution which uses the critical normal tables. ̅ > 47). b. The problem requires us to find P(X Given the population mean (µ = 45), population standard deviation (σ = 12), population size N = 98, and sample size (n = 30). We are using the correction factor; the corrected standard error given by equation (4.11):

𝜎𝑋̅ =

σX̅ =

𝜎 √𝑛

× √

12 √38

𝑁−𝑛 𝑁−1

× √

98 − 38 98 − 1

𝜎𝑋̅ = 1.5310 Figure 4.44 shows the region to be found that represents the probability find ̅ > 47). P(X

̅ > 47). Figure 4.44 Shaded region represents P(X From equation (4.10) we have: Z =

̅ − μ X 47 − 45 = = +1.3063 𝜎𝑋̅ 1.5310

From critical normal tables: ̅ > 47) = 0.09510 P(X Page | 280

The probability that the sample mean is greater than 47 is 0.09510 or 9.5%. Excel solution The Excel solution is illustrated in Figure 4.45.

Figure 4.45 Example 4.10b Excel solution From Excel, the probability that the sample mean is greater than 47 is 0.0957 or 9.6%, which agrees with the manual solution using critical normal tables. SPSS solution Select Transform > Compute Variable Name Target Variable: Example10b Enter in Numeric Expression: 1-CDF.NORMAL(47, 45, 1.5310)

̅ > 47). Figure 4.46 Use compute variable to calculate P(X Click OK SPSS will now carry out the calculation and store the result in the data file under column labelled Example10b

̅ > 47) = 0.095719. Figure 4.47 SPSS solution P(X This result agrees with the Excel solution shown in Figure 4.45 and the manual solution using critical normal tables.

Page | 281

Sampling distribution of the proportion Consider the case where a variable has two possible values, ‘yes’ or ‘no’, and we are interested in the proportion of people who choose ‘yes’ or ‘no’ in some survey. An example for this scenario could be a measure of responses of shoppers in deciding whether to purchase product A. From historical data it is found that 38% of people surveyed preferred product A. We would define this as the estimated population proportion, π, who prefer product A. If we then took a random sample from this population, it would be unlikely that exactly 38% would choose product A, but, given sampling error, it is likely that this proportion could be slightly less or slightly more than 38%. If we continued to sample proportions from this population, then each sample would have an individual sample proportion value, which when placed together, would form the sampling distribution of the sample proportion who choose product A. The sampling distribution of the sample proportion is approximated using the binomial distribution, given that the binomial distribution represents the distribution of r successes (choosing product A) from n trials (or selections). Remember, the binomial distribution is the distribution of the total number of successes, whereas the distribution of the population proportion is the distribution of the mean number of successes. Although we use the binomial distribution to approximate the sampling distribution of the sample proportions, they are not identical. Given that the mean is the total divided by the sample size, n, the sampling distribution of the proportion is somewhat different from the binomial distribution. Why? The sample proportion is effectively the mean of the scores and the binomial distribution is dealing with the total number of successes. Let us work out the details. We know from equation (3.22) that the mean of a binomial distribution is given by the equation µ = nπ, where π represents the population proportion. If we divide this expression by n, then this equation gives equation (4.12). Equation (4.12) represents the unbiased estimator of the mean of the sampling distribution of the proportion. (4.12)

𝜇𝜌 = 𝜋

Equation (3.23) represents the variance of the binomial distribution which. This one, when divided by n, gives equation (4.13). Equation (4.13) represents the standard deviation of the sampling proportion (or standard error), σρ, where  represents the population proportion: 𝜋 (1− 𝜋)

𝜎𝜌 = √

𝑛

(4.13)

From equations (4.12) and (4.13), the sampling distribution of the proportion is approximated by a binomial distribution with mean (µρ) and standard deviation (σρ). Furthermore, the sampling distribution of the sample proportion (ρ) can be Page | 282

approximated with a normal distribution when the probability of success is approximately 0.5, and n and n(1 – ) are at least 5: 𝜌 ~ 𝑁 (𝜋,

𝜋 (1− 𝜋) 𝑛

)

(4.14)

The standardised sample mean Z value is given by modifying equation (4.10) to give equation (4.15):

Z=

ρ− π σρ

=

ρ− π π (1− π) √ n

(4.15)

Example 4.11 It is known that 32% of workers in a factory own a personal computer. Find the probability that at least 40% of a random sample of 38 workers will own a personal computer. In this example, we have the population proportion π = 0.25 and sample size n = 80. The problem requires us to find P(ρ ≥ 0.4). From equation (4.13) the standard error for the sampling distribution of the proportion is:

𝜎𝜌 = √

𝜋 (1 − 𝜋) 𝑛

𝜎𝜌 = √

0.32 (1 − 0.32) 38

𝜎𝜌 = 0.075672424 Substituting this value into equation (4.15) gives the standardised Z value: Z=

ρ− π 0.4 − 0.32 = = 1.057 σρ 0.075672

P(ρ ≥ 0.4) = 𝑃(𝑍 ≥ 1.057) Figure 4.48 illustrates the area representing P(ρ ≥ 0.4).

Page | 283

Figure 4.48 Shaded region represents P( ≥ 0.4) From normal tables: P(ρ ≥ 0.4) = 𝑃(𝑍 ≥ 1.057) = 0.14457 The probability that at least 40% of a random sample of 38 workers will own a personal computer is 14.5%. Excel solution Figure 4.49 illustrates the Excel solution.

Figure 4.49 Example 4.11 Excel solution From Excel, the probability that at least 40% of a random sample of 38 workers will own a personal computer is 14.5% which agrees with the manual solution.

Page | 284

SPSS solution Select Transform > Compute Variable Name Target Variable: Example11 Enter in Numeric Expression: 1-CDF.NORMAL(0.4, 0.32, 0.075672424)

Figure 4.50 Use compute variable to calculate Click OK. SPSS will now carry out the calculation and store the result in the data file under column labelled Example11

Figure 4.51 SPSS solution P( ≥ 0.4) = 0.145213 This result agrees with the Excel solution shown in Figure 4.49 and the manual solution.

Check your understanding X4.6

Five people have made insurance claims for the amounts shown in Table 4.4. A sample of two people is to be taken at random, with replacement, from the five. Derive the sampling distribution of the mean and prove: (a) 𝑥̿ = 𝜇, and (b) 𝜎𝑥̅ = 𝜎/√𝑛. Person 1 Insurance claim (£) 500 Table 4.4 Insurance claims (£)

2 400

3 900

4 1000

5 1200

X4.7

If X is a normal random variable with mean 10 and standard deviation 2, that is, X˜ N(10, 4), define and compare the sampling distribution of the mean for samples of size: (a) 2, (b) 4, (c) 16.

X4.8

If X is any random variable with mean 63 and standard deviation 10, define and compare the sampling distribution of the mean for samples of size: (a) 40, (b) 60, and (c) 100.

X4.9

Use Excel to generate a random sample of 100 observations from a normal distribution with mean 10 and a standard deviation 4. Calculate the sample mean Page | 285

and standard deviation. Why are these values different from the population values?

Chapter summary In this chapter we have introduced the important statistical concept of sampling and the concept of the central limit theorem. This enabled us to conclude that the sampling distribution can be approximated by the normal distribution. We have shown how the central limit theorem can eliminate the need to construct a sampling distribution by examining all possible samples that might be drawn from a population. The central limit theorem allows us to determine the sampling distribution by using the population mean and variance values or estimates of these obtained from a sample. Furthermore, we established that an unbiased estimate of the population mean is provided by the sample mean. The sample variance (or standard deviation) is a biased estimate of the population variance (or standard deviation) and we used the standard error of the estimate, which is used to estimate the population variance (or standard deviation). We have shown that as the sample size increases, the standard error decreases. However, any advantage quickly vanishes as improvements in standard error tend to be smaller as the sample size gets larger. The next chapter will use these results to introduce the idea of calculating point and interval estimates for the population mean (and proportion) based on sample data and the assumption that the underlying x variable is normally distributed.

Test your understanding TU4.1 Is the sampling distribution of the sample means dependent on the underlying population distribution? For what values of sample size would the central limit theorem apply? TU4.2 The central limit theorem allows a sampling distribution of the sample means to be developed for large sample size even if the shape of the population distribution is unknown: a. What happens to the value of the standard deviation of the sampling means if the sample size increases? b. What happens to the sampling distribution of the sample means as the sample size increases from n = 15, n =23, n = 30, and n > 30? c. What value of the sample size should we use to randomly select samples such that the central limit theorem can be applied even when the shape of population distribution is unknown? TU4.3 A series of samples are to be taken from a population with a population mean of 100 and population variance of 67. The central limit theorem applies when the Page | 286

sample size is at least 30. If a random sample of size 34 is taken then calculate: (a) the mean and variance of the sampling distribution of the sample means; (b) the probability that a sample mean is greater 103; and (c) the probability that a sample mean lies between 98.4 and 103.2. TU4.4 Current airline rules assume that the average mean weight for male passengers is 88 kg, with no information provided on the shape of the population distribution. If we assume that the population standard deviation is 14 kg, describe the sampling distribution for the sample means if the sample size is large. Calculate the probability that a sample mean is greater than 96 kg when we randomly sample 38 passengers. TU4.5 A population variable has an unknown shape but known mean and standard deviation of 74 and 18, respectively. If we randomly choose a sample of size 48, calculate the probability that the sample mean: (a) is greater than 69, (b) is less than 69, (c) is between 69 and 73, and (d) is greater than 76. TU4.6 According to a recent UK Office for National Statistics report, the average weekly household spending was £528.90 during 2016, with a standard deviation of £90 and a population distribution that is positively skewed. What is the probability that the household sample mean spend is greater than £534.87 if we randomly sample 56 households? TU4.7 A research assistant collects data on the time spent in minutes by gym users on a new piece of gym equipment. The assistant randomly selects a sample of 57 from the population where the population mean is 18 minutes with a standard deviation of 1.8 minutes. a. What is the shape of the sampling distribution of the sample means? b. Is your answer to part (a) dependent upon the shape of the population distribution? c. Calculate the mean and standard deviation of the sampling means. d. Calculate the probability that a sample mean is greater than 18.4 minutes. TU4.8 A university academic spends on average 6 minutes answering each student email. The shape of the population distribution is unknown, but the standard deviation is known to be 2.2 minutes. The academic decides to calculate the sample mean and wishes to know the shape of the sampling distribution of the sample means and to calculate a series of probabilities associated with the sample means. a. If we select a small random sample can we apply the central limit theorem (say n < 30)? b. For what values of sample size can we then apply the central limit theorem? c. What does the central limit theorem tell us about the sampling distribution of the sample means? d. What is the value of the mean of the sampling means? e. What is the value of the standard error of the sampling means if we select a random sample of size 67? Page | 287

f. Calculate the probability that the sample mean is greater than 6.6 minutes when n = 67. g. Calculate the probability that the sample mean lies between 6.2 and 6.8 minutes when n = 67. TU4.9 A local supermarket employs a staff member at the counter who deals with the collection and returns of goods purchased online via the supermarket ecommerce website. The time spent dealing with each customer has a population mean of 4.5 minutes with a standard deviation of 0.35 minutes. The shape of the population distribution is unknown. a. What is the shape of the sampling distribution of the sample means if the supermarket randomly selects a sample of size 45? b. What is the name of the theorem stated in your answer to part (a)? c. What is the value of the mean of the sampling means if the central limit theorem applies? d. What is the value of the standard error of the sampling means if the central limit theorem applies? e. Calculate the probability that the sample mean time will be greater than 4.6 minutes? f. Calculate the probability that the sample mean time will lie between 4.55 and 4.7 minutes?

Want to learn more? The textbook online resource centre contains a range of documents to provide further information on the following topics: 1. AW4 Use SPSS to demonstrate the Central Limit Theorem

Page | 288

Chapter 5 Point and interval estimates 5.1 Introduction and learning objectives Chapter 4 enabled us to understand that the sampling distributions provide very useful clues about the population parameters, and we specifically looked at the mean and proportions. Effectively, this chapter takes the sampling distribution of the sample mean and proportion that we investigated in the previous chapter to the next level. We will learn how to use the sample mean (or proportion) to provide an estimate of the population value. We will first define two different types of estimates, which are a point estimate and interval estimate. We will then define the criteria for a good estimator. From there, we will learn how to make some of the specific point estimates of the population mean, variance and proportion. This will be followed by exploring interval estimates for the same population statistics. We will clarify how interval estimates are related to confidence intervals and conclude the chapter by providing specifics of the procedure for calculating the interval estimate of the population mean (µ) and proportion (π), provided that  known and the sample is larger than 30 observations. Then we will do the same for the situation where  unknown and the sample is smaller than 30 observations. We will complete the chapter with some practical advice on how to calculate the sample size. Students might find this advice useful not only when designing surveys for their dissertations, but also later in professional life.

Learning objectives On completing this chapter, you will be able to: 1. Calculate point estimates for population parameters (mean, proportion, variance) for one and two samples 2. Calculate sampling errors and confidence intervals when the population standard deviation is known or unknown (z and t tests) for one and two samples 3. Determine sample sizes 4. Solve problems using Microsoft Excel and IBM SPSS. An important skill in business statistics is to be able to analyse sample data such that we can infer the population statistics. The simplest method is to calculate the sample mean and infer that this is equal to the value of the population mean. If we have collected sufficient sample data to minimise the chance of sampling error, then we would expect the sample mean to be an unbiased estimator of the population mean. Because no estimate can be 100% reliable, you would want to know how confident you can be in your estimate and whether to act on it from a business decision-making perspective. In statistics, a confidence interval gives the probability that an estimated range of possible values includes the actual value being estimated. For example, you may decide Page | 289

that a printing shop can print 2000 A4 pages per day. Because the printing shop cannot be expected to print exactly 2000 A4 pages on each day, a confidence interval can be created to give a range of possibilities. You may state that there is a 95% chance that the shop prints between 1800 and 2200 A4 pages per day. The confidence interval is 95%, and the probability that the actual number of pages printed is outside this estimated range is 5%. You can also think of this 5% figure as a risk factor. In other words, there is a 5% risk that actual number of pages is not between 1800 and 2200. The contents of this chapter will be useful to you if you need to make an estimate based on sample information. This happens almost every day. You might be in business and need to decide, based on research data, whether to launch a new product. If this is the case, the content of this chapter will help you frame your decision in a compelling way. You might be in manufacturing and deciding, based on the scrap sample, if you have adequate quality problem in your process. Again, the content of this chapter will provide the basis for such decision-making. You might be in public service and need to decide, based on local information, if the funding on a broader level is appropriate. The methods covered in this chapter are essential in this case.

5.2 Point estimates In the previous chapter we explored the sampling distribution of the mean and proportion. We showed that these distributions can be normal with population parameters µ and 2. Because the distribution of the means (or proportions) is effectively normal, this will enable us to make estimates of the unknown population mean (or proportion), based on the sample mean (or proportion). This can be phrased as: our objective is to determine the value of a population parameter based on a sample statistic. As we will see shortly, this population parameter is determined with a degree of probability, which is called the level of confidence. We reiterate, the method described in this section is dependent upon the sampling distribution being (approximately) normally distributed. There are two estimates of the population value that we can make: either a point estimate or an interval estimate. Some procedures provide a single value (called point estimates), while others provide a range of values (called interval estimates). Figure 5.1 illustrates the relationship between the point and interval estimates for a population mean .

Figure 5.1 Point and interval estimate Suppose that you want to find the mean weight of all football players who play in a local football league. Due to practical constraints you are unable to measure all the players, but you can select a sample of 45 players at random and weigh them to provide a sample mean. From your survey, this sample mean is calculated as 78 kg. Page | 290

From Chapter 4, we know that the sampling distribution of the mean is approximately normally distributed for large sample sizes. We also know that the sample mean is an unbiased estimator of the population mean. Because of these two facts, you can treat this number of 78 kg also as the point estimate of the population mean. If we knew, or can estimate, the population standard deviation (), then we can apply equation (4.9) to provide an interval estimate for the population mean based upon some degree of error between the sample and population means. This interval estimate is called the confidence interval for the population mean. This approach is also valid for the confidence interval for the population proportion if we are measuring proportions. Performing a parametric statistical analysis requires the justification of several assumptions. These assumptions are as follows: 1. 2. 3.

Data are collected randomly and independently. Data are collected from normally distributed populations. Variances are equal when comparing two or more groups.

The word ‘parametric’ means that sample data come from a known distribution (often a normal distribution, though not necessarily) whose parameters are fixed. There are many ways to check the assumption of normality (covered in subsequent chapters): 1. 2. 3.

Normality can be checked by plotting a histogram to observe if the shape looks normal, or we can undertake some statistical analysis and calculate the Kolmogorov–Smirnov test statistic. Equal variance can be checked by undertaking Levene’s test for equality of variance. Hypothesis tests involving means and regression analysis can be run (more about that in the chapters that follow).

If the data collected are not normally distributed, or the equality of variance assumption is violated, there are alternative ways to make inferences or carry out hypothesis tests. These include carrying out an equivalent nonparametric test which does not assume that the distribution is normally distributed (or that the distribution parameters are fixed). These so-called nonparametric tests measure data that are not at the scale/ratio level but are of ordinal or categorical form (we shall explore some of them in Chapter 7). An alternative method that can be used when the parametric assumptions have been violated is to use Excel or IBM SPSS to implement a method called bootstrapping. Bootstrapping is a numerical sampling technique where the data are sampled with replacement. This means that you require an initial sample and then use this sample to generate more samples. In this way you obtain many samples from the original sample which can then be used to calculate descriptive statistics, such as the mean, median and variance. Bootstrapping can be used to create many resamples, and bootstrapping statistics allow the researcher to analyse any distribution and carry out hypothesis tests (covered in the next Chapter). Page | 291

As the objective of this chapter is to estimate certain statistics, we might ask ourselves what constitutes a good estimator. A good estimator should be unbiased, consistent, and efficient: 1.

An unbiased estimator of a population parameter is an estimator whose expected value is equal to that parameter. As we already know from the previous chapter, the sample mean 𝑥̅ is an unbiased estimator of the population mean, µ. In other words, the expected value of the sample mean equals the population mean, as expressed in equation (5.1): ̅) = μ E(X

2.

(5.1)

An unbiased estimator is said to be a consistent estimator if the difference between the estimator and the parameter grows smaller as the sample size grows larger. The sample mean 𝑥̅ is a consistent estimator of the population mean, µ, with variance given by equation (5.2):

̅) = VAR(X

σ2 n

(5.2)

As n grows larger, the variance of the sample mean grows smaller. 3.

If there are two unbiased estimators of a parameter, the one whose variance is smaller is said to be the more efficient estimator. For example, both the sample mean, and median are unbiased estimators of the population mean. Which one should we use? If the sample median has a greater variance than the sample mean, we choose the sample mean since it is more efficient.

Point estimate of the population mean and variance A point estimator draws an inference about a population parameter by estimating the value of an unknown parameter given as a single point or data value. For example, the sample mean is the best estimator of the population mean. It is unbiased, consistent, and the most efficient estimator as long as the sample was drawn from a normal population. If it was not drawn from a normal population but the sample was sufficiently large, then the central limit theorem states that the sampling distribution can be approximated by a normal distribution, so it is still unbiased, consistent and the most efficient estimator. To summarise, the sample mean represents a point estimate of the population mean, , and this relationship is defined by equation (5.3): μ=̅ X

(5.3)

Applying the central limit theorem, we would expect the point estimator to get closer and closer to the true population value as the sample size increases. The degree of error is not reflected by the point estimator. However, we can employ the concept of the interval estimator to put a probability on the value of the population parameter lying Page | 292

between two values, with the middle value being the point estimator. In Section 5.3, we will discuss the concept of an interval estimate or confidence interval. In statistics, the standard deviation is often estimated from a random sample drawn from a population. In Chapter 4, we showed, via a simple example, that the sampling distribution of the means gives the following rules: 1. The mean of the sample means is an unbiased estimator of the population mean (𝑋̅ = μ). In other words, the expected value of the sample means equals the ̅) = μ). population mean (E(X 2. The sample variance is a biased estimator of the population variance (𝜎𝑋2̅ ≠ 𝜎 2 ). In other words, the expected value of the sample variance (s2) is not equal to the population variance (𝐸(𝑠) ≠ 𝜎). As the sample mean is an unbiased estimator, it is obvious that estimating the population mean from the sample mean is straightforward. Let us just consider the variance and standard deviation. Because the sample variance is biased, the bias can be corrected using Bessel’s correction. This corrects the bias in the estimation of the population variance, and some, but not all, of the bias in the estimation of the population standard deviation. The Bessel correction factor is given by expression (5.4): 𝑛

(5.4)

𝑛−1

The relationship between the unbiased sample variance (s2) and the biased sample variance (sb2 ) is given by equation (5.5): s2 =

n n−1

sb2

(5.5)

From equation (5.5), the unbiased sample variance is given by equation (5.6): s 2 = ∑ni=1

̅ )2 (Xi −X n−1

(5.6)

The Excel function to calculate an unbiased estimate of the population variance (s2) is =VAR.S(). From equation (5.6), the unbiased sample standard deviation is given by equation (5.7): s = √∑ni=1

̅ )2 (Xi −X n−1

(5.7)

The Excel function to calculate an unbiased estimate of the population standard deviation (s) is =STDEV.S(). Why n – 1 and not n in equations (5.6) and (5.7)? Page | 293

This is because the population variance (σ) is estimated from the sample mean (𝑋̅) and from the deviations of each measurement from the sample mean. If, for example, we lacked any one of these measurements (the sample mean or a single deviation value), we could still calculate it from the rest of the data. So, with n measurements (data points) only n – 1 of them are free variables in the calculation of the sample variance. This means that a missing observation can be found from the other n – 1 observations and the sample mean. The term n – 1 is called the degrees of freedom. Also note that if you used n rather than n-1, you would effectively underestimate the true population variance or standard deviation. Unfortunately, it can be shown mathematically that not all the bias is removed when using n – 1 in the equation rather than n. Fortunately, however, the amount of bias is negligible. Despite this negligible bias, we can safely assume that equation (5.7) is an unbiased estimator of the population standard deviation. From equation (5.7), we have n – 1 degrees of freedom. The larger the sample size, n, the smaller the correction involved in using the degrees of freedom (n – 1). For example, Table 5.1 compares the value of 1/n and 1/(n – 1) for different sample sizes n = 15, 25, 30, 40, 50, 100, …, 10000. We conclude that very little difference exists between 1/n and 1/(n – 1) for large n. n 15 25 30 40 50 100 1000 10000

1/n 0.0667 0.0400 0.0333 0.0250 0.0200 0.0100 0.0010 0.0001

1/(n - 1) 0.0714 0.0417 0.0345 0.0256 0.0204 0.0101 0.0010 0.0001

% difference 0.0667 0.0400 0.0333 0.0250 0.0200 0.0100 0.0010 0.0001

Table 5.1 Difference between 1/n and 1/(n-1) as n increases Similarly, the bias in the sample standard deviation is negligible when n – 1 is used instead of n in the denominator. For example, for a normally distributed variable, the unbiased estimator of the population standard deviation (𝜎̂) can be shown to be given by equation (5.8): ̂ = s (1 + σ

1

)

4 (n−1)

(5.8)

In equation (5.8) s is the sample standard deviation and n is the number of observations in the sample. Table 5.2 explores the degree of error between the unbiased estimator of the population standard deviation and the sample standard deviation. The table shows that when the sample size is 4 the underestimate is 8.33% and when the sample size is 30 the underestimate is 0.86%. The difference between the two values quickly reduces with increasing sample size. Page | 294

n= Error = % error =

4 0.083 8.333

10 0.028 2.778

20 0.013 1.316

30 0.009 0.862

40 0.006 0.641

50 0.005 0.510

100 0.003 0.253

1000 0.000 0.025

10000 0.000 0.003

Table 5.2 Degree of error as n increases Now we know how to calculate the unbiased standard deviation (s) for a sample by using equation (5.7), this value of s can be used to estimate the standard deviation of the population. This is done via the standard error of the mean. We already know that the standard error of the sample means is defined by equation (4.8), so we will repeat it here for the sake of convenience and replace  with s, which is the standard deviation for a sample: 𝜎𝑋̅ =

𝑠

(5.9)

√𝑛

This value 𝜎𝑋̅ , or the standard error of the sample mean (which is effectively the standard deviation of the sampling means), will become the key statistic that will enable us to make estimates of the population parameters, based on the sample parameters. Example 5.1 A manufacturer takes 25 measurements of the weight of a product (Kgs) with the measurement results presented in table 5.3. 5.02 5.11 5.06 5.30 4.46

5.32 4.89 4.97 4.75 4.93

Sample data 5.14 5.23 4.84 4.85 5.29

5.12 4.92 5.03 5.27 4.82

4.97 4.86 4.79 4.56 5.30

Table 5.3 Weight of product (Kgs) Calculate an unbiased estimate of the mean, and the standard deviation and standard error of your estimate of the mean. The unbiased estimates of the population mean, standard deviation, and standard error of the mean are provided by solving equations (5.3), (5.7), and (5.9). a. Calculate the sample statistics (sample size, sample mean, sample standard deviation) Sample data, X 5.02 5.11 5.06 5.30 4.46 5.32

(X-Xbar)^2 0.000784 0.013924 0.004624 0.094864 0.283024 0.107584 Page | 295

4.89 0.010404 4.97 0.000484 4.75 0.058564 4.93 0.003844 5.14 0.021904 5.23 0.056644 4.84 0.023104 4.85 0.020164 5.29 0.088804 5.12 0.016384 4.92 0.005184 5.03 0.001444 5.27 0.077284 4.82 0.029584 4.97 0.000484 4.86 0.017424 4.79 0.040804 4.56 0.186624 5.30 0.094864 Table 5.4 Calculation of the summary statistics Summary statistics: Sample size, n = 25 Sample mean, ̅ X 𝑋̅ =

5.02 + 5.11 + ⋯ . . +4.56 + 5.30 25

𝑋̅ =

124.8 25

̅ = 4.9920 Kgs X Sample variance, s2 n

̅) 2 (Xi − X s = ∑ n−1 2

i=1

s2 =

0.000784 + 0.013924 + ⋯ . +0.186624 + 0.094864 25 − 1

s2 =

1.2588 24 Page | 296

s2 = 0.0525 Kgs2 Sample standard deviation, s 𝑠 = √𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑠 = √0.0525 s = 0.2290 Kgs b. Population estimates Unbiased estimate of the population mean, 𝜇̂ , is: μ̂ = ̅ X = 4.9920 Kgs Unbiased estimate of the population standard deviation, 𝜎̂, is: ̂=s σ ̂ = 0.2290 Kgs σ Unbiased estimate of the standard deviation of the sampling means (the standard error, σX̅ ) is 𝜎̂𝑋̅ = 𝜎̂𝑋̅ =

𝜎̂𝑋̅ =

𝜎̂ √𝑛 𝑠 √𝑛 0.2290 √25

𝜎̂𝑋̅ = 0.0458 Kgs We will show shortly how is this standard error used for interval estimates. Excel solution Method 1. Formula method – equations (5.3), (5.5) and (5.7) Figures 5.2 and 5.3 illustrate the formula methods to calculate the required summary statistics.

Page | 297

Figure 5.2 Example 5.1 Excel solution

Figure 5.3 Example 5.1 Excel solution continued From Excel, the population estimates are: 1. Estimate of the population mean is 4.9920 Kgs 2. Estimate of the population standard deviation is 0.2290 Kgs 3. Estimate of the standard error is 0.0458 Kgs Method 2 Excel function method Figure 5.4 illustrates the formula method to calculate the required summary statistics.

Page | 298

Figure 5.4 Example 5.1 Excel function solution continued SPSS Solution SPSS solution using SPSS Frequencies Input data into SPSS.

Figure 5.5 Example 5.1 SPSS data Frequencies method Select Analyze > Descriptive Statistics > Frequencies

Page | 299

Figure 5.6 SPSS Frequencies Transfer Data into the Variable(s) box. Switch off Display frequency tables (ignore warning).

Figure 5.7 SPSS frequencies menu Click on Statistics. Choose Mean, Std.deviation, S.E. mean Page | 300

Figure 5.8 SPSS frequencies statistics option Click Continue.

Figure 5.9 SPSS frequencies menu Click OK SPSS output

Page | 301

Figure 5.10 SPSS frequencies solution The SPSS solutions presented in Figure 5.10 agree with the manual solution and Excel solutions presented in Figures 5.3 and 5.4. Alternatively, you could use the SPSS Descriptives and Explore menus to provide the results. Descriptives method Select Analyze > Descriptive Statistics > Descriptives. Transfer Data into the Variable(s) box. Click on Options. Choose Mean, Std.deviation, S.E. mean. Click Continue. Click OK. SPSS output

Figure 5.11 SPSS descriptives solution Explore method Select Analyze > Descriptive Statistics > Explore. Transfer Data into the Dependent List box. Click on Statistics. Choose Descriptives. Click Continue. Click OK. SPSS output

Page | 302

Figure 5.12 SPSS explore solution Observe that the Explore method provides a range of descriptive statistics and not just the statistics that the first two SPSS methods provided. What are the implications of these statistics that we have just calculated? We calculated that the average weight is 4.9920 Kgs. The standard error of 0.0458 Kgs will help us, as we will demonstrate shortly, to estimate how ‘close’ we are to estimating the true mean value of all the rods that we manufacture.

Point estimate of the population proportion and variance In the previous section we provided the equations to calculate the point estimate for the population mean based upon the sample data. A similar principles can be used if we have a sample proportion and we want to provide the point estimate of the population proportion. Equations (5.10) and (5.11) provide unbiased estimates of the population proportion (𝜋̂) and standard error (), where  is the sample proportion. Estimate of the population proportion (5.10

𝜋̂ = 𝜌 Estimate of the population standard deviation 𝜌 (1− 𝜌)

𝜎̂𝜌 = √

𝑛

(5.11)

This means we find the sample proportion  and from there we estimate the population proportion 𝜋̂ and the population standard deviation 𝜎̂𝜌 . Page | 303

Example 5.2 A local call centre has randomly sampled 30 staff from a total of 143 to ascertain if the staff are in favour of moving to a new working shift pattern. The results of the survey are presented in table 5.5. Provide a point estimate of the population proportion of total workers who disagree with the new shift pattern and give an estimate for the standard error of your estimate.

ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Staff outcomes (A - Agree, D Disagree) D D A A A A A D D A D A A A A D D D D A A A D A A D A D A D

Table 5.5 Call centre staff survey results Unbiased estimates of the population proportion and standard error of the proportion are found by solving equations (5.10) and (5.11). Unbiased estimate of the population proportion 𝜋̂ = 𝜌 =

13 = 0.4333333′ 30 Page | 304

Unbiased estimate the standard error of the proportion

𝜎̂𝜌 = √

𝜌 (1 − 𝜌) 0.4333333′ × (1 − 0.4333333′) = √ = 0.0905 𝑛 30

Estimate of the population proportion is 0.4333 with an estimate of the standard error equal to 0.0905. Excel solution Figures 5.13 and 5.14 show the Excel solution.

Figure 5.13 Example 5.2 Excel solution

Figure 5.14 Example 5.2 Excel solution continued From Excel, the estimates of the population proportion and standard error are 0.43 (43%) and 0.0905 (9.05%), respectively. Page | 305

SPSS solution You can calculate the proportions by using the SPSS Frequency menu. Input data into SPSS

Figure 5.15 Example 5.2 SPSS data Select Analyze > Descriptive Statistics

Figure 5.16 Frequencies menu Select Frequencies. Transfer Outcome into Variable(s) box

Page | 306

Figure 5.17 SPSS frequencies menu Click OK SPSS output The output is shown in Figure 5.18.

Figure 5.18 SPSS frequencies solution We observe that the population proportion dissatisfied is equal to the sample proportion dissatisfied, which is 43.3% or 0.433. As before, we may now use equation (5.11) to estimate the value of the standard error of the proportion:

𝜎̂𝜌 = √

𝜌 (1 − 𝜌) 0.4333333′ × (1 − 0.4333333′) = √ = 0.0905 𝑛 30

Again, we will shortly demonstrate how is this standard error of the proportion used to provide interval estimates.

Page | 307

Pooled estimates If more than one sample is taken from a population then the resulting sample statistics can be combined to provide pooled estimates for the population mean, variance and proportion. If we have two samples of sizes n1 and n2, then the estimate of the population mean is provided by the pooled sample mean: X=

n1 X 1 + n2 X 2 n1 + n2

(5.12)

The estimate of the population variance is provided by the pooled sample variance:

ˆ 2 =

n1s12 + n2 s22 n1 + n2 − 2

(5.13)

The estimate of the population proportion is provided by the pooled sample proportion:

ˆ =

n1ˆ1 + n2ˆ 2 n  +n  = 1 1 2 2 n1 + n2 n1 + n2

(5.14)

Check you understanding X5.1

A random sample of five values was taken from a population: 8.1, 6.5, 4.9, 7.3, and 5.9. Estimate the population mean and standard deviation, and the standard error of the estimate for the population mean.

X5.2

The mean of 10 readings of a variable was 8.7 with standard deviation 0.3. A further five readings were taken: 8.6, 8.5, 8.8, 8.7, and 8.9. Estimate the mean and standard deviation of the set of possible readings using all the data available.

X5.3

Two samples are drawn from the same population as follows: sample 1 (0.4, 0.2, 0.2, 0.4, 0.3, and 0.3) and sample 2 (0.2, 0.2, 0.1, 0.4, 0.2, 0.3, and 0.1). Determine the best unbiased estimates of the population mean and variance.

X5.4

A random sample of 100 rods from a population were measured and found to have a mean length of 12.132 with standard deviation 0.11. A further sample of 50 is taken. Find the probability that the mean of this sample will be between 12.12 and 12.14.

X5.5

A random sample of 20 children in a large school were asked a question and 12 answered correctly. Estimate the proportion of children in the school who would answer correctly and the standard error of this estimate.

X5.6

A random sample of 500 fish is taken from a lake and marked. After a suitable interval, a second sample of 500 is taken and 25 of these are found to be marked. By considering the second sample, estimate the number of fish in the lake. Page | 308

5.3 Interval estimates As we know by now, if we take just one sample from a population, we can use the sample statistics to estimate a population parameter. Our knowledge of sampling error would indicate that the standard error provides an evaluation of the likely error associated with the estimate. If we assume that the sampling distribution of the sample means is normal, then we can provide a measure of this error in terms of a probability value that the value of the population mean will lie within a specified interval. This interval is called an interval estimate (or as we will explain later, it can be called the confidence interval), and it is centred at the point estimate for the population mean. To create an interval around the sample mean ̅ X, so that you can state with a certain confidence that the true mean resides within it, we need three parameters. We need to calculate the sample mean ̅ X (see equation (4.3)), we need the value of the standard error σX̅ (see equation (4.8) or (5.9)) and we need the Z-value which will correspond to the desired probability level (see equation (4.10)). With these three parameters, we can make an estimate that the true population mean is within the interval, as described by equation (5.15) or equation (5.16): μ=̅ X ± Z σX̅

(5.15)

𝑠

(5.16)

or ̅ ±Z μ=X

√𝑛

From our knowledge of the normal distribution we know that 95% of the distribution lies within ± 1.96 standard deviations of the mean. Therefore, for the distribution of sample means, 95% of these sample means will lie in the interval defined by equation (5.16). This equation tells us that an interval estimate (or confidence interval) is centred at X̅ with a lower and upper confidence interval given by equations (5.17) and (5.18): Lower confidence interval boundary value μ1 = ̅ X − 1.96

s √n

(5.17)

Upper confidence interval boundary value ̅ + 1.96 μ2 = X

s √n

(5.18)

Figure 5.19 illustrates graphically the position of the point estimate, lower and upper boundaries for the confidence interval (1, 2) and the confidence interval itself relative to these statistics.

Page | 309

Figure 5.19 Shaded region represents the confidence interval ̅. From there, In other words, if we select a sample, we can calculate the sample mean X ̅ we can be 95% confident that this sample mean X represents the true population mean , somewhere in the interval that is estimated as ̅ X ± 1.96 s⁄√n. Why? Because, if we took many samples, as explained earlier, 95% of all the intervals would be around the true population mean. This also implies that we would expect that 5% of the intervals would not contain the population mean. For example, suppose we take a sample from a population which is normally distributed with a population mean of 90 and standard deviation of 5. We can take random samples of size 36 from this population and calculate the sample means. If we wish to calculate a 95% confidence interval, then this interval would be given by the following equation: μ=̅ X ± 1.96

𝑠 √𝑛

Now, if we collect four samples with a sample means of 88.9, 91.0, 87.2 and 95.3 respectively, then the 95% confidence intervals for each sample are illustrate in Figure 5.20.

Page | 310

Figure 5.20 95% confidence intervals for four sample means From Figure 5.20, we observe: a. For sample 1, the interval is 86.3–91.5, which includes the assumed population mean of 90. b. For sample 2, the interval is 88.4–93.6, which includes the assumed population mean of 90. c. For sample 3, the interval is 84.6–89.8, which does not include the assumed population mean of 90. d. For sample 4, the interval is 92.7–97.9, which does not include the assumed population mean of 90. In the real world, you will choose one sample where the population mean  is unknown; in this situation you have no guarantees that the confidence interval based upon this sample mean will include the population mean . However, if you take all possible samples of size n and computer the corresponding sample means, 95% of the confidence intervals will include the population mean and only 5% of the samples will not. This tells us that we would have 95% confidence that the population mean  lies within the sample confidence interval. If we return to Example 5.1, we had a mean value of 4.9920 and a standard error of 0.0458. If we take Z = 1.96, then we can provide an estimate of the population weight to lie between 4.9022 (= 4.9920 – 1.96  0.0458) and 5.0818 (= 4.9920 + 1.96  0.0458).

Page | 311

In fact, because we used 1.96 for the Z-value, this means that we are 95% certain that the true weight lies somewhere between 4.9022 and 5.0818. A different value of Z would give us a different confidence interval. From the above examples you can conclude that the estimate interval and the confidence interval are connected. Equation (5.15) tells you how wide is the interval in which the true mean is likely to be (which is the estimate interval), and the value of Z will tell you the probability that the true mean value is in this interval (which is the confidence interval). It is intuitive to expect that the higher the probability (i.e. the higher the confidence level), the wider the estimate interval will be, and vice versa. Now that we understand how interval estimates are made and how this is connected with a confidence interval for the estimate, we will look at how to calculate confidence intervals for both the population mean and proportion, depending on the sample size. Interval estimate of the population mean where σ is known and the sample is larger than 30 observations To reiterate: if a random sample of size n is taken from a normal population N(, σ2) 2 then the sampling distribution of the sample means will be normal, ̅ X ~ N (μ, σ ⁄n). As we have shown, we can use equation (5.15) to give a confidence interval for the population mean: ̅ − Z σX̅ ≤ μ ≤ X ̅ + Z σX̅ X

(5.19)

Or ̅ X−Z

𝜎 √𝑛

≤ μ ≤̅ X+Z

𝜎 √𝑛

(5.20)

Example 5.3 Calculate a 95% confidence interval for the population mean data presented in Example 5.1 data set but this time assume that the data is normally distributed . If you carry out the calculations, then you will find that the required statistics are as follows: Population data Data X ~ N(, 2) Sample data 2

σ Sample means ̅ X ~ N (μ, n )

Sample size, n = 25 Sample mean, ̅ X = 4.9920

Page | 312

Sample standard deviation, s = 0.2290 Standard error of the means, σX̅ − 0.0458 Confidence intervals The value of the critical z statistic at a given significance level can be found from the normal distribution tables. With two tails your risk is equally split between the two tails, so you have 2.5% in the left tail, 2.5% in the right tail, and 95% in between. Table 5.6 shows an example of this with the critical value z identified for a z value of 1.96 of the probability P(Z ≥ z) = 2.5% = 0.025 (right-hand tail in Figure 5.19). Z 0.00 0.01 0.02 0.03 0.04 0.05 0.0 0.500 0.496 0.492 0.488 0.484 0.480 0.1 0.460 0.456 0.452 0.448 0.444 0.440 0.2 0.421 0.417 0.413 0.409 0.405 0.401 1.8 0.036 0.035 0.034 0.034 0.033 0.032 1.9 0.029 0.028 0.027 0.027 0.026 0.026 2.0 0.023 0.022 0.022 0.021 0.021 0.020 2.1 0.018 0.017 0.017 0.017 0.016 0.016 Table 5.6 Calculation of z when P(Z ≥ 1.96) = 0.025

0.06 0.476 0.436 0.397 0.031 0.025 0.020 0.015

From Table 5.6, critical z value = 1.96 when P(Z ≥ z) = 0.025. Given that we have two tails, then the critical z value = ± 1.96. Estimate of the standard errors for the sample means is 𝜎̂𝑋̅ =

𝜎̂𝑋̅ =

𝑠 √𝑛 0.2290 √25

𝜎̂𝑋̅ = 0.0458 Then our interval endpoints are: 𝜇1 = 𝑋 − 𝑍 𝜇2 = 𝑋 + 𝑍

𝑠 √𝑛 𝑠 √𝑛

= 4.9920 − 1.96 × 0.0458 = 4.9022 = 4.9920 + 1.96 × 0.0458 = 5.0818

Figure 5.21 illustrates the 95% confidence interval for the population mean.

Page | 313

Figure 5.21 Shaded region represents 95% confidence interval Thus, the 95% confidence interval for µ is = 4.9920 ± 1.96 × 0.0458, that is, from 4.9022 to 5.0818. The other way to say the same is to conclude that there is 5% risk that the population mean is not between 4.9022 and 5.0818. Excel solution Figures 5.22 and 5.23 illustrates the Excel solution (first 10 out of 25 observations illustrated)

Figure 5.22 Example 5.3 Excel solution

Page | 314

Figure 5.23 Example 5.3 Excel solution continued From Excel, the 95% confidence interval for µ is 4.9022 to 5.0812. Note: You can also solve this problem using the Excel CONFIDENCE.NORM function: =CONFIDENCE.NORM(alpha, standard_dev, size) Where, 1. Alpha is the significance level used to compute the confidence level. The confidence level equals 100*(1 - alpha)%, or in other words, an alpha of 0.05 indicates a 95 percent confidence level. 2. Standard_dev is the population standard deviation for the data range and is assumed to be known. Please note in the example above this value is not known but you could replace this with the sample standard deviation to obtain the results given in Figure 5.23. 3. Size is the sample size, n. SPSS solution There is no built-in SPSS solution, though there are workarounds. We will show some possible ways how to use SPSS later in the text.

Interval estimate of the population mean where σ is not known and the sample is smaller than 30 observations In the previous example we calculated the point and interval estimates when the population was normally distributed, and the population standard deviation was known. In most cases the population standard deviation is unknown, and we have to use the sample value to estimate the population value with associated errors. With the population standard deviation unknown, the population mean estimate is still given by the value of the sample mean, but what about the interval estimate? In the Page | 315

previous example the sample mean, and sample size were used to provide this interval. However, in this new case we have an extra unknown that has to be estimated from the sample data in order to find the interval estimate of a population mean when the sample size is small. This is often the case in student research projects. They handle small sample sizes and the population standard deviation is unknown. The question then becomes how we can create interval estimates when the population standard deviation is unknown, and the sample sizes are small. The question then becomes whether we can measure how much smaller this probability will be. This question was answered by W. S. Gossett who determined the distribution of the mean when divided by an estimate of the standard error. The resulting distribution is called Student’s t distribution. If the random variable X is normally distributed, then the test statistic has a t distribution with n – 1 degrees of freedom and is defined by equation (5.21). t df =

̅− μ X s √n

(5.21)

As we can see, the above equation has the same form as equation (4.10), although the t distribution is not the same as the Z distribution. As we already know, the t distribution is very similar to the normal distribution when the estimate of variance is based on many degrees of freedom (df = n – 1), but has relatively more scores in its tails when there are fewer degrees of freedom. The t distribution is symmetric, like the normal distribution, but flatter (leptokurtic). As a reminder, Figure 5.24 compares the t distribution with 5 degrees of freedom and the standard normal distribution.

Figure 5.24 Normal versus t distribution Since the t distribution is leptokurtic, the percentage of the distribution within ±1.96 standard deviations of the mean is less than the 95% for the normal distribution. However, if the number of degrees of freedom (df) is large (df = n – 1 ≥ 30) then there is very little difference between the two probability distributions. The sampling error for the t distribution is given by the sample standard deviation (s) and sample size (n), as defined by equation (5.22), which has the same form equations (4.8) and (5.9): Page | 316

σX̅ =

̂ σ √n

=

s

(5.22)

√n

The degrees of freedom and the interval estimate are given by equation (5.23): df = n – 1

(5.23)

This yields the interval estimate for the population mean to be defined as in equation (5.24): ̅ X − t df

s √n

≤ μ ≤̅ X + t df

s √n

(5.24)

Example 5.4 Calculate a 95% confidence interval for the data presented in table 5.7 (assume data is normally distributed). X 12.00 13.54 12.22 10.99 10.09 10.82 11.62 12.12 12.49 9.95 12.57 11.50 10.22 11.98 10.98 12.61 12.40 11.23 Table 5.7 For this data, we can calculate the summary statistics: • • •

Sample mean = 11.6 Sample size, n = 18 Sample standard deviation, s = 0.9858 Page | 317

The value of the critical t statistic at a given significance level and degrees of freedom can be found from Student’s t distribution tables. Table 5.8 shows an example of this with the critical t value identified for a value of the probability P(T ≥ t) = 2.5% = 0.025 (right-hand tail in Figure 5.25 below) (alpha = 2 × 0.025 = 0.5) and degrees of freedom n – 1 = 18 – 1 = 17. From the table, the critical t value is 2.11 when P(T ≥ t) = 0.025 for 17 degrees of freedom. ALPHA df

50% 0.5

20% 0.20

1 2 14 15 16 17 18 19 20

1.00 0.82 0.69 0.69 0.69 0.69 0.69 0.69 0.69

3.08 1.89 1.35 1.34 1.34 1.33 1.33 1.33 1.33

10% 0.1 6.31 2.92 1.76 1.75 1.75 1.74 1.73 1.73 1.72

5% 0.05 12.71 4.30 2.14 2.13 2.12 2.11 2.10 2.09 2.09

2.50% 0.025 25.45 6.21 2.51 2.49 2.47 2.46 2.45 2.43 2.42

Table 5.8 Calculation of t for P(T ≥ t) = 0.025 with 17 df The confidence interval is then found using equations (5.18)–(5.20). The standard error is: σX̅ =

̂ σ √n

=

s √n

=

0.9858 √18

= 0.2323

Then, we can calculate the confidence interval using equation (5.24). Lower confidence interval boundary, Lower CI ̅ − t df Lower CI = X

s √n

= 11.6294 − 2.1098 × 0.2323 = 11.1392

Upper confidence interval boundary, Upper CI Upper CI = ̅ X + t df

s √n

= 11.6294 + 2.1098 × 0.2323 = 12.1197

Figure 5.25 shows the 95% confidence interval for the population mean.

Page | 318

Figure 5.25 Shaded region represents 95%confidence interval for µ Thus, the 95% confidence interval is from 11.14 to 12.12. To put it another way, we are 95% confident that, based on this small sample, the true population mean is between 11.14 and 12.12. Excel solution Method 1. Excel formula method Figures 5.26 and 5.27 illustrates the Excel solution using equations (5.17)– (5.20).

Figure 5.26 Example 5.4 Excel solution Page | 319

Figure 5.27 Example 5.4 Excel solution continued From Excel, the 95% confidence interval for µ is 11.14 to 12.12. Method 2. Excel function method Figure 5.28 illustrates the Excel solution using Excel functions. The results are the same as for the formula method.

Figure 5.28 Example 5.4 Excel solution continued From SPSS, the 95% confidence interval for µ is 11.14 to 12.12. In this example we used the Excel CONFIDENCE.T function: =CONFIDENCE.T(alpha, standard_dev, size). Where, 1. Alpha is the significance level used to compute the confidence level. The confidence level equals 100*(1 - alpha)%, or in other words, an alpha of 0.05 indicates a 95 percent confidence level. 2. Standard_dev is the population standard deviation for the data range and is assumed to be known. 3. Size is the sample size. Page | 320

Given the population standard deviation (Standard_dev) is unknown, we have replaced this value with the sample standard deviation to obtain the same results as illustrated in Figure 5.28. SPSS solution Enter data into SPSS

Figure 5.29 Example 5.4 SPSS data Select Analyze > Descriptive Statistics > Explore Transfer Sample_data to the Dependent List: box

Figure 5.30 SPSS Explore Choose Statistics

Page | 321

Figure 5.31 SPSS Explore Statistics option Click on Continue. Click OK SPSS output The output is shown in Figure 5.32

Figure 5.32 SPSS solution continued From Figure 5.32, the point estimate for the population mean is 11.39 and the 95% estimate interval is from 10.66 to 12.12. These results agree with the Excel solution

Page | 322

shown in Figures 5.25 and 5.26. We are 95% confident that the population mean is contained within the interval from 10.66 to 12.12.

Interval estimate of a population proportion If the population is normally distributed or the sample size is large (central limit theorem, n ≥ 30) then the confidence interval for a proportion is calculated by using equation (5.11) and transforming equation (5.16) to give equation (5.25), where the population proportion, π, is estimated from the sample proportion, ρ: 𝜌 − 𝑍 𝜎𝜌 ≤ 𝜋 ≤ 𝑍 𝜎𝜌

(5.25)

Which can also be written as

𝜌−𝑍 × √

𝜌(1 − 𝜌) 𝜌(1 − 𝜌) ≤ 𝜋 ≤ 𝜌+𝑍 × √ 𝑛 𝑛

Example 5.5 Fit a 95% confidence interval for the population proportion given the sample proportion is 0.4 and sample size is 38. The confidence interval is given by equation (5.25). From the data we can state: 1. 2. 3. 4. 5.

Sample proportion  = 0.4. Sample size n = 38. Population proportion distribution unknown but sample size large. Given point 3, we can apply the central limit theorem to state 𝜌 ∼ 𝑁(𝜋, 𝜎𝜌2 ). Z for 95% confidence is ±1.96 (Table 5.6).

Substituting these values into equations (5.10), (5.11), and (5.25) gives: Estimate of the population proportion via equation (5.10) 𝜋̂ = 𝜌 𝜋̂ = 0.4 Estimate of the population standard deviation via equation (5.11)

𝜎̂𝜌 = √

𝜌 (1 − 𝜌) 𝑛

𝜎̂𝜌 = √

0.4 (1 − 0.4) 38 Page | 323

𝜎̂𝜌 = 0.07947 Estimate the 95% confidence interval for the population proportion using equation (5.25) Lower confidence interval (LCI), Z = - 1.96

𝐿𝐶𝐼 = 𝜌 − 𝑍 × √

𝜌(1 − 𝜌) 𝑛

𝐿𝐶𝐼 = 0.4 − 1.96 × 0.07947 𝐿𝐶𝐼 = 0.2442 Lower 95% confidence boundary is 0.2442 Upper confidence interval (UCI), Z = + 1.96

𝑈𝐶𝐼 = 𝜌 + 𝑍 × √

𝜌(1 − 𝜌) 𝑛

𝑈𝐶𝐼 = 0.4 + 1.96 × 0.07947 𝑈𝐶𝐼 = 0.5558 Upper 95% confidence boundary is 0.5558 Thus, the 95% confidence interval for the population proportion, , is 0.4 ± 1.96 × 0.07947, that is, from 0.2442 to 0.5558. Figure 5.33 illustrates the 95% confidence interval for the population proportion.

Figure 5.33 Shaded region represents 95% confidence interval Page | 324

We are 95% confident that the true population proportion is contained within the interval from 0.2442 to 0.5558. Concept of risk Let us reiterate one point. In business, people often use the concept of risk. A risk could be defined in relation to the confidence interval. In most simplistic terms, a risk and confidence interval are two opposite sides of the same phenomenon. If we state with 95% confidence that the true value is, for example, between 4 and 6, then implicitly we also state that there is a 5% risk that the true value is not between 4 and 6. Excel solution Figure 5.34 illustrates the Excel solution.

Figure 5.34 Example 5.5 Excel solution SPSS solution There is no built-in SPSS solution but you could solve this problem in SPSS by using the SPSS transform method to calculate the individual statistics.

Check your understanding X5.7

The standard deviation for a method of measuring the concentration of nitrate ions in water is known to be 0.05 ppm. If 100 measurements give a mean of 1.13 ppm, calculate the 90% confidence limits for the true mean.

X5.8

In trying to determine the sphere of influence of a sports centre, a random sample of 100 visitors was taken. This indicated a mean travel distance (d) of 10 miles with a standard deviation of 3 miles. Calculate a 90% confidence interval for the mean travel distance (D).

X5.9

The masses, in grams, of 13 ball bearings taken at random from a batch are: 21.4, 23.1, 25.9, 24.7, 23.4, 24.5, 25.0, 22.5, 26.9, 26.4, 25.8, 23.2, 21.9. Calculate a 95% Page | 325

confidence interval for the mean mass of the population, supposed normal, from which these masses were drawn.

5.4 Calculating sample sizes If you take a closer look at equation (5.20), you will notice that we can control the width of the confidence interval by determining the sample size necessary to produce narrow intervals. For example, if we assume that we are sampling a mean from a population that is normally distributed then we can modify equation (4.10) to calculate an appropriate sample size. Equation (4.10) stated that: Z=

̅− μ X σ √n

This can be re-written to give equation (5.26) ̅ X−μ=Z

𝜎

(5.26)

√𝑛

One way to look at equation (5.26) is to say that ̅ X − μ is in fact an interval within which our estimate should fall. In this case we can rewrite the above equation as: Interval estimate = 2 × Z ×

σ √n

(5.27)

Why did we insert number 2 in equation (5.27)? The graph in Figure 5.35 illustrates the point.

Figure 5.35 Relationship between sample mean and confidence interval Another way to look at equation (5.27) is to say that ̅ X − μ is effectively an error between the estimated mean value and the true mean value. In this case, from equation (5.27) we can specify this error as: E=Z ×

σ √n

(5.28)

You can also think of E in equation (5.28) as a margin of error. If, for example, you would like your results to be within 10% accuracy, then the margin of error is expressed in decimal numbers as 0.1. If desired accuracy is 5%, then the margin of error Page | 326

is expressed as 0.05. Either way, a margin of error applies to a given confidence level that is determined by Z. If ̅ X − μ is effectively an error E, we can now rearrange equation (5.28) into: √n =

Zσ E

This gives the sample size n, as per equation (5.29): Zσ 2

n= ( ) E

(5.29)

Figure 5.36 illustrates the confidence interval, margin of error, and sample size.

Figure 5.36 Confidence interval for the population mean µ Example 5.6 A researcher working in the quality control department of a brewery wishes to determine the sample size where the margin of error is no more than 0.05 units of alcohol and with a 98% confidence. Historical data provided to the researcher indicates that the population data are normally distributed with a population standard deviation of 0.2 units of alcohol. Information provided: • • • • •

Population data normally distributed Confidence interval = 98% From Table 5.6, Zcri for 98% CI = ± 2.326347874 Population standard deviation = 0.2 units Required margin of error E = 0.05 units

The sample size can now be calculated using equation (5.26). Page | 327

𝑛=

𝑍2𝜎2 𝐸2

=

2.32..2 0.22 0.052

= 86.59031…

To meet the requirements of the researcher, a sample size of 87 is required. Excel solution Figure 5.37 illustrates the Excel solution

Figure 5.37 Example 5.6 Excel solution From Excel, the sample size to achieve the result would be 87. SPSS solution There is no built-in SPSS solution but you could solve this problem in SPSS by using the SPSS transform method to calculate the individual statistics. Impact of margin of error on the confidence interval To see what impact the selection of the margin of error and confidence interval has on the sample size, we will run a small simulation. Table 5.9 illustrates how sample size changes with differing margins of error and confidence interval By keeping the same margin of error, but changing the confidence interval, we can see how the sample size changes. Effectively, in this example, we need to increase the sample size by almost two and half times if we want our confidence interval to increase from 90% to 99% (see Table 5.9). Margin of error 0.05 0.05 0.05 0.05 Conf. Interval 90% 95% 98% 99% Sample size 44 62 87 107 Table 5.9 sample size changes if margin of error and confidence interval changes

Page | 328

Let us now keep the confidence interval constant, at 90%, but let’s change the margin of error. Table 5.10 shows the sample size required. Margin of error 0.15 0.10 0.05 0.01 Conf. Interval 90% 90% 90% 90% Sample size 5 10 44 1083 Table 5.10 How sample size changes when margin of error changes but confidence interval is constant As we can see, the margin of error has a tremendous impact on the sample size. It is particularly important to emphasise here that the margin of error depends very little on the size of the population from which we are sampling if the sampling fraction is less than 5% of the total population. For very large populations, the impact is almost negligible. The same equation (4.10) can be used to extract n, as we did in equation (5.29), or E or Z, as below: E=Z

s

or

√n

Z=E

√n s

If we solve the equation for E, this tells us what is going to be the size of our error if we use a level of confidence (Z) and a given a sample size (n). If we solve the equation for Z, this tells us the expected level of confidence given the size of our error (E) and a given a sample size (n). A note to remember: Error margin is expressed in the same units as the original data values, and so is the standard deviation. If the data units are kg, for example, then the mean value, the standard deviation and the error margin are also kg. If the mean value and the standard deviation are percentages, for example, then the error margin is also expressed in percentages. However, if your original values and the standard deviation are in kg, for example, and you would like the error margin to be 10% of the target value expressed in kg, then you need to multiply the target value by 0.1 which results with the correct error margin expressed in kg as units. Sample size for the proportion estimate If we do not have the standard deviation of the population, which is very often the case, and we use proportions, then there is another equation to determine the size of the sample. Let us assume that p is the proportion of some variable and q = 1 – p. In this case, the sample size is calculated as: Z 2

n = p q (E)

(5.30)

Example 5.7 Suppose that we do not know what is the true proportion of people who like Marmite, which means we will assume the neutral position of 50% (p = 0.5). Thus, q = 1 – 0.5 = Page | 329

0.5. In other words, there are also 50% of people who do not like Marmite. Let us also assume that we would like 95% confidence in our results, which means that Z = 1.96. And finally, let us say that we will accept an error of 5% (E = 0.05). To calculate the size of the sample needed, we insert these values in equation (5.26): 1.96 2 ) = 384.16 n = 0.5 × 0.5 × ( 0.05 A sample of 384 people will give us results within the 95% confidence limit and will not generate an error that is greater than  5%. Equation (5.26) will work for very large populations, but if the population is small (and denoted by N), then the sample size n must be corrected using equation (5.31), called Cochran’s formula correction: n1 =

n 1+

(5.31)

(n−1) N

Example 5.8 We will use the same data as in Example 5.7 but let us make two changes. Let us assume that we got survey results from somewhere and we know that only 41% of the people like Marmite (p = 0.41 and q = 0.59). Let us also assume that we would like to apply these numbers and survey the population of the first year at our business school, and this population is only 500 (N = 500). What should be the sample size we need to take in this case, still assuming 95% confidence limit and 5% margin of error? By inserting the new numbers into equation (5.30) we get: 𝑛 = 0.41 × 0.59 × (

1.96 2 ) = 1032.54 = 1033 𝑎𝑝𝑝𝑟𝑜𝑥 0.05

We use this value of n in equation (5.31) to get the new estimate of the sample: n1 =

n (n − 1) 1+ N

n1 =

1033 (1033 − 1) 1+ 500

n1 = 337.09 Now we have the corrected sample size, which indicates that we must include 337 people in our survey and that we can expect the results within a 5% error. As before, the same equation (4.10) can be used to extract n for estimating the proportions, as we did in equation (5.26), or E or Z:

Page | 330

E=Z × √

pq n

or

Z=

E pq

√n

If we solve the equation for E, this tells us what is going to be the size of our error if we used a particular proportion (p) at a level of confidence (Z) and a given a sample size (n). If we solve the equation for Z, this tells us the expected level of confidence given the size of our error (E) and a given a sample size (n) and the expected proportion (p).

Check your understanding X5.10 A business analyst has been requested by the managing director of a national supermarket chain to undertake a business review of the company. One of the key objectives is to assess the level of spending of shoppers who historically have weekly mean levels of spending of €168 with a standard deviation of €15.65. Calculate the size of a random sample to produce a 98% confidence interval for the population mean spend, given that the margin of error is €3. Is the sample size appropriate given the practical factors?

Chapter summary In this chapter we have explored methods that can be used to provide point and interval estimates for population parameters. We learned how to estimate the population mean and population proportion from the sample mean and the sample proportion, respectively. Once we learned how to provide point estimates, we extended these principles to interval estimates. What we learned was that we can assign a probability that these point estimates will reside in an interval. To this effect we learned how to make interval estimates for the population mean and a population proportion. We also learned that interval estimates, although closely related to confidence intervals, are not the same thing. One provides the interval of values where the population parameter is likely to be (interval estimate) and the other provides the probability (confidence interval) that the true value is in this interval. And finally, we learned how to handle estimates if the samples are small, i.e. less than 30 observations. Table 5.11 summarises the various equations and formulae that can be used to calculate certain population and sample parameters, as well as how they can be used to estimate the population parameters from a sample. Statistic Mean

Population ∑ 𝑓𝑋 𝜇= ∑𝑓

Standard deviation 𝜎=√ Standard error for 𝑥̅

∑( 𝑋 − 𝜇)2 𝑁

Sample > 30 ∑ 𝑓𝑥 𝑥̅ = ∑𝑓 ∑( 𝑥 − 𝑥̅ )2 𝑛−1 𝑠 𝑆𝐸𝑥̅ = √𝑛

𝑠=√

Sample < 30 ∑ 𝑓𝑥 𝑥̅ = ∑𝑓 ∑( 𝑥 − 𝑥̅ )2 𝑛−1 𝑠 𝑆𝐸𝑥̅ = √𝑛

𝑠=√

Page | 331

z-value / t-value

(𝑥 − 𝜇) 𝜎 𝑥 = 𝑧𝜎 + 𝜇

x or 𝑥̅ -value Proportion

(𝑥̅ − 𝜇) 𝑆𝐸𝑥̅ 𝑥̅ = 𝑧 𝑆𝐸𝑥̅ + 𝜇

𝑧=





𝑆𝐸𝜌 = √

Expected value

𝑡=



Standard error for  Probability p given x or z

(𝑥̅ − 𝜇) 𝑆𝐸𝑥̅ 𝑥̅ = 𝑡 𝑆𝐸𝑥̅ + 𝜇

𝑧=

𝜌(1 − 𝜌) 𝑛

𝑆𝐸𝜌 = √

𝜌(1 − 𝜌) 𝑛

For both the population and the samples >30 from either the tables or from =NORM.DIST() if you know x, or =NORM.S.DIST() if you know z. For samples 100 The first line of this shorthand reads ‘Our null hypothesis is that the mean value is equal to 100’. The second line shows three different options: ‘Our alternative hypothesis is that the mean value is not equal to 100’, ‘Our alternative hypothesis is that the mean value is less than 100’ or ‘Our alternative hypothesis is that the mean value is greater than 100’. We typically use only one of them, depending on what we are testing. Hypothesis testing has its own language. What we mean by this is that you will often see phrases such as ‘the evidence suggests that we reject the null hypothesis’ or ‘the evidence suggests that we fail to reject the null hypothesis’. It would be incorrect to use the phrase ‘accept the null hypothesis’. Why this convoluted language? Page | 337

The way the test philosophy is used implies that you can never be completely sure that the null hypothesis is true. If you are ‘not rejecting’ the null hypothesis, this does not mean that it is true. It just means that you are still ‘retaining’ it, as there is some small possibility that it might be true. In fact, you can think of these tests as a method to help you collect evidence for rejecting or not rejecting the null hypothesis. If your evidence suggests that you cannot reject the null hypothesis, this means that you do not have enough evidence to do so. It does not mean that it is true, which is the reason why we should not use the word ‘accept H0’, but rather ‘evidence suggests we do not reject H0’.

One- and two-tailed tests Depending on how we state the hypothesis to test, we will have to use either a onetailed test or a two-tailed test. A typical two-tailed test states the null hypothesis as H0:  = 100 and H1:  ≠ 100, for example. The symbol ≠ determines that we need a twotailed test. If the hypotheses are stated as H0:  = 100 and H1:  > 100 or H0:  = 100 and H1:  < 100, for example, then a one-tailed test is appropriate. The symbol ‘’ (greater than) will determine whether we will use a lower-tail or an upper-tail test. Whether you have a one- or two-tailed test will be important because of the rejection region. The rejection region is in the tail(s) of the distribution. The exact location is determined by the way H1 is expressed. If H1 simply states that there is a difference, for example H1:   100, then the rejection region is in both tails of the sampling distribution with areas equal to /2. For example, if  (the level of significance) is set at 0.05 then the area in both tails will be 0.025 (see Figure 6.1). This is known as a twotailed test.

Figure 6.1 Two-tailed tests and test rejection areas (shaded) A two-tailed test is a statistical hypothesis test in which the values for which we can reject the null hypothesis, H0, are in both tails of the probability distribution. If H1 states that there is a direction of difference, for example  < 100 or  > 100, then the rejection region is in one tail of the sampling distribution and we have a one-tailed test — the tail being defined by the direction of the difference. If we have ‘less than’ (H1:  < 100, for Page | 338

example), the left-hand tail is used, and this is known as a lower-tailed test (see Figure 6.2).

Figure 6.2 (Lower) one-tailed test and test rejection area (shaded) Conversely, if we have ‘greater than’ (H1:  > 100, for example), the right-hand tail is used, and this is known as an upper-tailed test (see Figure 6.3).

Figure 6.3 Upper one-tailed test and test rejection area (shaded)

One- and two-sample tests The hypothesis testing procedure will vary, depending on how many samples we use, whether they are dependent on each other or independent, and what kind of conclusions we want to draw. For each of these options, a slightly modified test is used. Although the logic and the procedures are almost identical, the formulae used are somewhat different. We will now describe the different tests available to us. How we form the hypotheses and how we go about executing the test will depend, among other things, on whether we are dealing with one or multiple samples. All the tests we will cover in this chapter are applicable to only one or two samples. A onePage | 339

sample test involves testing a sample parameter (e.g. the mean value, variance or proportion) against a perceived population value to ascertain whether there is a significant difference between the sample statistic and population parameter. For a twosample tests, we test one sample against another to ascertain whether there is a significant difference between them and, consequently, whether the two samples represent different populations. When we talk about ‘one sample’ we mean that the results associated with one group of observations, are compared with the population results. When we talk about ‘two samples’ we mean that the results are compared between two groups of results (products, individuals, groups, or anything similar). Typical examples are: • • • • •

One individual produces X of product as opposed to the average amount Y of the whole group. Is he ‘in line’ with the rest of the group? Are the results of the sales team in region A comparable with the rest of the company, or is the team underperforming? The quality in one factory seems to be different from that in another. Should we be concerned or is it within our overall quality assurance standards? After attending a sales course, a team’s sales effectiveness seems to have gone up. Is this by chance or does the training really have impact on our sales force? You are planning a promotion campaign and want to give certain products for free if a customer buys some other products. Can you afford this and how should you structure the ‘bundle’?

Independent and dependent samples/populations When we come to using the tests that apply to two samples/populations, most of them assume that the samples come from two independent populations. In some cases, the two populations are dependent. These two different scenarios can be defined as follows. Two populations are said to be independent if the measured values of the items observed in one population do not affect the measured values of the items observed in the other population. For example, consider the following two populations: all unmarried men aged 28 in Wales (population A) and all married men aged 28 in Wales (population B). The variable we are interested in measuring in these men is the amount of weight they have gained/lost since they were 18 years old. In this case, we would say that the two populations are independent, because the amount of weight gained by an individual in population A will not affect the amount of weight gained by an individual in population B (and vice versa). Two populations are dependent if the measured values of the items observed in one population directly affect the measured values of the items observed in the other population. Typically, the items in two dependent populations are paired, in the sense that each item in one population is directly linked to a corresponding item in the other population. For example, in a study to determine the effectiveness of sales training, we define population A as the population of salespeople before the training course and population B as the population of salespeople after the training course.

Page | 340

The variable being measured is the sales effectiveness (measured as a ratio of open quotations to closed orders). In this example, each item in population A is directly linked to each item in population B (the individual before training and the same individual after training). Clearly, the sales effectiveness value after the training (population B) is somewhat reliant on the original value before the training (population A), that is, they are dependent.

Sampling distributions from different population distributions In Chapter 4, we explored the central limit theorem states that the sum of a number of independent and identically distributed random variables with finite variances will tend to a normal distribution as the number of variables grows. Thus, even though we might not know the shape of the distribution where our data comes from, the central limit theorem says that we can treat the sampling distribution as if it were normal. Furthermore, as we sample from normal or non-normal distribution, the samples could be either large or small. And finally, for every case the standard deviation is either known or not known. All these sampling options are depicted below in Figure 6.4:

Figure 6.4

Different sampling scenarios

We will now describe briefly how to handle each of these cases.

Sampling from a normal distribution, large sample and known  (AAA) We already know that if the observed sample data X1, X2, …, Xn are (i) independent, (ii) have a common population mean , and (iii) have a common variance 2 (according to equation (6.1), then the sample mean value 𝑋̅ has mean  and variance 2/n (equation (6.2). 𝑋 ∼ 𝑁(𝜇, 𝜎 2 )

(6.1)

Page | 341

Then the sampling distribution of the sample means is: 2

𝜎 𝑋̄ ∼ 𝑁 (𝜇, 𝑛 )

(6.2)

And the corresponding standardised Z equation is: Z=

̅− μ X

(6.3)

σ √n

Sampling from a non-normal distribution, large sample size and known  (BAA) Suppose now that the observed sample data X1, X2, …, Xn are (i) independent, (ii) have a common population mean , and (iii) unknown common variance 2. For populations that are not normally distributed we can make use of the central limit theorem. Because of the central limit theorem, many test statistics are approximately normally distributed for large samples (n ≥30). In this case, the unknown population standard deviation  would be replaced by the sample standard deviation s (equation (6.4)). Then the sampling distribution of the sample means is the same as equation (6.2), with the exception that we use s instead of : 2

𝑠 𝑋̄ ∼ 𝑁 (𝜇, 𝑛 )

(6.4)

And the corresponding standardised Z equation is the same as equation (6.3), with the exception that we use s instead of : Z=

̅− μ X

(6.5)

s √n

Sampling from a normal distribution, small sample size and unknown  (ABB) If the observed sample data X1, X2, …., Xn are (i) independent, (ii) have a common population mean , and (iii) unknown common variance 2, what do we do if the sample size is less than 30 and we do not know the population standard deviation? If the population data are normally distributed, then we can replace the normal distribution with Student’s t distribution with df = n – 1 degrees of freedom (equation (6.6)). Then the sampling distribution of the sample means is: 2

𝑠 𝑋̄ ∼ 𝑡𝑑𝑓 (𝜇, 𝑛 )

(6.6)

And the corresponding standardised Z equation is:

Page | 342

t =

̅− μ X s √n

(6.7)

Why a sample size of 30? This is a historical issue from the time before we had appropriate software packages to do the statistical analysis and a distinction would be made between small-sample and large-sample versions of t tests. The small- and large-sample versions did not differ in how we calculated the test statistic but in how the critical test statistic was obtained. For the small-sample test, one used the critical value of t, from a table of critical t-values. For the large-sample test, one used the critical value of z, obtained from a table of the standard normal distribution. The other difference is that to calculate t-values, we need one more piece of information, the degrees of freedom. Today we can use statistical software to carry out t tests, such as Excel and SPSS, which will print out for a given test the value of the test statistic and test statistic p-value. When we solve statistical hypothesis problems in this and later chapters, we will use the manual/critical tables method, the Excel method, and where possible provide the SPSS solution. As a reminder, Figure 6.5 shows a comparison between the normal and t distribution, where the number of degrees of freedom increases from 2 to 30. We observe that the difference between the normal and t distribution decreases as the number of degrees of freedom increases. In fact, very little numerical difference exists between the normal and t distributions when we have sample sizes of at least 30.

Figure 6.5 Comparison between normal and t distribution

Page | 343

Sampling from a normal distribution, large sample and unknown  (AAB) In this case, we are sampling from a normal distribution with an unknown  but large sample size. Therefore, the sampling distribution is given by equation (6.4).

Sampling from a normal distribution, small sample and known  (ABA) In this case, we are sampling from a normal distribution with a known  and small sample size. Therefore, the sampling distribution is given by equation (6.4).

Sampling from a non-normal distribution, large sample and unknown  (BAB) In this case, we are sampling from a non-normal distribution with an unknown  and large sample size. Therefore, the sampling distribution is given by equation (6.4) or equation (6.6).

Sampling from a non-normal distribution, small sample and known  (BBA) In this case, we are sampling from a non-normal distribution with a known  and small sample size. In this situation, we would use a non-parametric method described in Chapter 8.

Sampling from a non-normal distribution, small sample and unknown  (BBB) In this case, we are sampling from a non-normal distribution with a unknown  and small sample size. In this situation, we would use a non-parametric method described in Chapter 8.

Check your understanding State the null hypothesis for the following statements: X6.1

The incidence of spending is higher for men than for women.

X6.2

On the playground, more children play on the swings than on the slide.

X6.3

A high score on a quantitative methods module predicts success as a business analyst.

X6.4

Dog Luxuries Ltd sells luxury dog food direct to customers via its e-commerce web site. The company monitors sales in real time and undertakes regular checks on sales. The company would like to check if sales have changed since January 2018 when 5500 sales per month were recorded with a sales value to be

Page | 344

recorded at the end of May 2018 for the May sales period. Use this information to answer the following questions: a. State the null and alternative hypotheses? b. Is the alternative hypothesis one- or two-tailed? c. If you changed the word ‘changed’ to ‘reduced’ how would this affect your answers to (a) and (b)?

6.3 Introduction to hypothesis testing procedure The hypothesis testing procedure can be condensed into just five steps. Some steps are just very simple statements, the other ones contain calculations, and some involve making decisions.

Steps in hypothesis testing procedure We’ll define the five steps that apply to every hypothesis test. Every step is colour-coded to correspond with the same colour area in the spreadsheet to make it easier to identify what is happening during every step. In this section, we will not conduct any tests, just explain how the procedure works. Step 1 Provide the formal hypothesis statements H0 and H1 As we said, the null hypothesis (H0) and alternative hypothesis (H1) are two competing statements concerning the population parameters of interest. These statements determine the entire testing procedure. The general idea is to state H0 so that we can reject it. How H0 and H1 are stated will depend on the problem and the data available. We know that, on average, men are taller than women. How do you prove it? Let’s first state what we want to prove: that men are taller than women. We took a sample of 37 men and measured their height. Also, we took a sample of 41 women and measured their height. (Note that the two groups do not always have to be the same size.) We have population averages for both groups: 1 is the average for men and 2 is the average for women. Our hypotheses could be stated as follows: H0: 1 ≤ 2 (on average men are not taller than women) H1: 1 > 2 (on average men are taller than women) Note that because we want to prove that men are taller, this becomes H1. As we said, H0 is usually stated in anticipation of being rejected. Step 2 Determine the test to apply for the given hypothesis statement The statistical test we apply will depend, for example, on how many samples we have (one or two or more), which statistic are we using (the mean or proportions), how much we know about the population (we know the mean Page | 345

and/or variance, or neither), and the size of the sample or population. We will address all these criteria as we go through specific tests. Figure 7.1 from the following Chapter provides a high-level map of how to select the appropriate test. For the above example with the average height for men and women, we would choose a two-sample test for independent samples, comparing two means and assuming the variances are unequal (a t-test). The following sections in this chapter will explain all these terms. Step 3 Set the level of significance level,  The significance level, usually denoted by the Greek letter alpha (α), is a fixed probability of making the error of ‘rejecting the null hypothesis H0, even though it is true’ (more about that a bit later). This probability is arbitrary and specified by the person conducting the test. It also represents the degree of accuracy that the test should exhibit. For example, if we choose α = 0.05, we are saying that we would like to make the mistake of incorrectly rejecting H0 in at most 5% of cases when this test is conducted. Another way to think about this is to say that the level of significance represents the amount of risk that an analyst will accept when deciding. The use of the significance level is connected with the aim of seeking to put beyond reasonable doubt the notion that the findings are due to chance. The value of  normally takes the value 5% (0.05) or 1% (0.01), but it could be any other value. The value of  depends upon how sure you want to be that your decisions are an accurate reflection of the true population relationship. For the above example with the average height for men and women, the phrase ‘taller than’ implies that we are willing to take a chance on rejecting H0, when it might be true. This means that we are ready to accept that in 5% of cases, we might be rejecting the hypothesis which is true. You can also think about this percentage as the confidence level (1 – α = 1 – 0.05 = 0.95), implying that we can be 95% certain that our conclusion is correct. Step 4 Extract the relevant statistic A test statistic is a quantity calculated from sample data (an equation or a formula). It is used to determine if H0 should be rejected. This statistic is always calculated under the assumption that H0 is true. Most of the time we imply that the data are distributed in accordance with the normal distribution (z) or in accordance with the t distribution (t). The critical value of the test statistic will be denoted by either zcri or tcri (or sometimes zα or zα/2). These values (for the given level of significance α) are compared with the calculated values that we call zcal or tcal. You must be patient as we will explain this shortly. Page | 346

There are two alternative ways to calculate the necessary statistic. We can either use the critical value of the test statistic(zcri), or alternatively the so-called pvalue, which is the probability associated with this calculated statistic. The critical value is a quantile (related to the probability α) from the sampling distribution of the test statistic. Critical values define the range of values for the test statistic for which we do not reject H0, that is, if the observed value of the test statistic lies in the rejection region defined by the critical values, then we reject H0. The p-value is the probability of getting a value of the test statistic as extreme as or more extreme than that observed by chance alone, if the null hypothesis is true. Typically, if the p-value is less than α, we will reject the null hypothesis. For the above example with the average height for men and women, let’s assume that zcri = 1.65 and we calculated zcal = 3.5. We know that zcri corresponds to the α value, which is why it is sometimes called zα. Think of α as the probability level corresponding to zcri = 1.65. As an alternative method, we also calculated the pvalue. Let’s assume that the p-value is 0.002. Think of this as the probability level corresponding to zcal=3.5. Step 5 Make a decision The final step in the hypothesis testing procedure is to decide whether the null hypothesis should be rejected. We do this by deciding whether the test statistic is ‘large’, that is, whether the ‘distance’ between a sample statistic and a hypothesised parameter is large. Based on the value of the test statistic, we can determine if the sample statistic is ‘close’ to the hypothesised parameter (in which case the evidence suggests that we fail to reject H0), or if it is ‘far away’ from the hypothesised parameter (in which case we will reject H0). To facilitate this interpretation, we consider two different approaches that we can use to decide what constitutes ‘far away’ and what is ‘close’. We can use either the critical test statistic value or the p-value. In both approaches, the way in which we make a decision depends on the nature of the alternative hypothesis (whether it is right-sided, left-sided or two-sided). Example 6.1 For the above example with the average height for men and women, if zcri = 1.65 and zcal = 3.5, this means that zcal > zcri. We can therefore reject the null hypothesis H0 at the 0.05 level of significance which is stating that men are not taller than women. In fact, we are 95% confident that we have not made a wrong decision based on our samples. Using the alternative method, because the p-value is 0.0002 and α = 0.05, we can see that the p-value is less than α. Because of that, we can reject the null hypothesis H0. As expected, although we used two alternative methods, they lead us to identical conclusions. The situation can be visually summarised as shown in Figure 6.6.

Page | 347

Figure 6.6 Comparison between α and p-value, and between zcal (zα) and zcri Because we used the phrase ‘taller than’, we had to use a one-tailed test. If we had used the phrase ‘taller or shorter than’, then we would have had to use a two-tailed test. In the one-tailed example above, the yellow area represents all the values that are greater than 1.65 on the x-axis and smaller than 0.05 on the y-axis. This is called the rejection area. Every zcal value, if we use the zcri method, that is to the right of 1.65, falls in the rejection area. This means if zcal > zcri, we must reject H0. The same logic applies if we use the pvalue method. Every value of p that is less than 0.05 (in this case) falls in the rejection area, and H0 must be rejected. Rejecting H0 means that we need to accompany it with a statement that reads something like this: ‘We reject the hypothesis that men are not taller than women at the 5% level of significance’. This implies that we are 95% certain that men are, on average, taller than women. If we used the left-tailed test, the same logic would apply, only we would be looking for the numbers to the right of zcri. The two-tailed test uses similar reasoning, as we will see shortly.

How do we make decisions? As we already demonstrated whether we use the critical test statistic value or p-value makes no difference. These two approaches to hypothesis testing lead to identical conclusions. Intuitively, we can say that the critical test statistic values are thresholds indicating when a test statistic’s value is close to zero or not. If the test statistic exceeds the critical value, its value is ‘far away’ from zero (there is a ‘large distance’ between the sample statistic and the hypothesised parameter). Consequently, we would reject the null hypothesis H0 and accept the alternative hypothesis H1. If the test statistic lies anywhere else, then it is ‘close’ to zero (the ‘distance’ between the sample statistic and the hypothesised parameter is small enough to be considered Page | 348

negligible). Consequently, we would fail to reject the null hypothesis H0 and reject the alternative hypothesis H1. The p-value approach is directly related to the critical value approach, except that instead of basing the decision on the test statistic’s actual value compared to a critical value, we base it on a probability associated with the test statistic, called the p-value. We then compare the p-value to the level of significance 𝛼, specified for the test, and decide between rejecting and failing to reject the null hypothesis, H0. Once again, the nature of the alternative hypothesis will affect the way in which we make the decision. Example 6.2 To illustrate how critical values and rejection regions are defined, we consider the situation in which we test if the population mean is equal to 100 hours, H0: μ = 100. The test statistic selected for this test follows a standard normal distribution for each of these different alternatives if we assume that the sample data used in this test are obtained from a normally distributed population with known variance, σ2. The test statistic is given by equation (6.3): z=

X − 100  n

That is, z follows a standard normal distribution if the conditions stated in the null hypothesis are true (we usually shorten this by simply saying ‘z has a standard normal distributed under H0’). Note that we distinguish between the random variable z and the actual calculated value of the test statistic by denoting the calculated value by zcal. Just to make it easier, we will select α = 0.05 for all our examples. Left-sided test Critical value method, Zcri The hypothesis statement for a left-sided test that tests whether the population mean is 100 hours versus the alternative that it is less than 100 hours is: H0 : μ = 100 H1 : μ < 100 The critical value for this test is then just the lower one-tailed test with the α = 0.05 quantile for the standard normal distribution: it will be –Zα = –Z0.05 = –1.645. This value can be obtained using statistical tables or using a software package such as Excel. If the observed value of the test statistic is smaller than this critical value (if it falls in the rejection region), then we reject H0. In other words, if Zcal < Zα (or, as it is sometime written, Zcal < Zcri), we reject H0. Page | 349

p-value method, p We calculate the p-value using the formula: p = P (Zα < Zcal )

(6.8)

To perform the test, we compare the p-value to the value of the significance level (α). If the p-value is smaller than α, we reject H0. Figures 6.7 and 6.8 show two scenarios for a left-sided test: Scenario I (Figure 6.7) shows the case where the test statistic lies in the rejection region, and scenario II (Figure 6.8) shows the case where the test statistic lies in the ‘fail to reject H0’ region.

Figure 6.7 Left-sided test showing both critical values and p-values – scenario 1: H0:  = x, H1:  < x. If the p-value is less than , reject H0

Figure 6.8 Left-sided test showing both critical values and p-values – scenario 2: H0:  = x, H1:  < x. If the p-value is greater than , fail to reject H0

Page | 350

Right-sided test Critical value method The hypothesis statement for a right-sided test that tests whether the population mean is 100 hours versus the alternative that it is greater than 100 hours is: H0 : μ = 100 H1 : μ > 100 The critical value for this test is the upper α = 0.05 quantile for the standard normal distribution: it will be Zα = Z0.05 = 1.645. Again, this value can be obtained using the Excel function =NORM.S.INV(). If the observed value of the test statistic is greater than this critical value (if it falls in the rejection region), then we reject H0. In other words, if Zcal > Zα, we reject H0. p-value method We calculate the p-value using the formula: p = P (Z𝛼 > Zcal )

(6.9)

To perform the test, we compare the p-value to α. If the p-value is smaller than α, we reject H0. Figures 6.9 and 6.10 below show the two scenarios for a rightsided test: Scenario I (Figure 6.9) shows the case where the test statistic lies in the rejection region, and scenario II (Figure 6.10) shows the case where the test statistic lies in the ‘fail to reject H0’ region.

Figure 6.9 Right-sided test showing both critical values and p-values – scenario 1: H0:  = x, H1:  > x. If the p-value is less than , reject H0

Page | 351

Figure 6.10 Right-sided test showing both critical values and p-values – scenario 2: H0:  = x, H1:  > x. If the p-value is greater than , fail to reject H0 Two-sided test Critical value method The hypothesis statement for a two-sided test used to test whether the population mean is 100 hours versus the alternative that it is not 100 hours is: H0 : μ = 100 H1 : μ ≠ 100 There are two critical values in this test: the lower α/2 = 0.025 quantile and the upper α/2 = 0.025 quantile, both for the standard normal distribution, that is, they will be Zα/2 = – Z0.025 = – 1.96 and Zα/2 = Z0.025 = 1.96. If the observed value of the test statistic is greater than Zα/2 or is less than – Zα/2, then we reject H0. In other words, if Zcal < – Z(α/2), or if Zcal > Z(α/2), we reject H0. The critical values and rejection region are illustrated in Figure 6.11. p-value method If the calculated value of the test statistic Zcal is positive, then the p-value is calculated using the formula: p = P(Z𝛼 < − Zcal ) + P(Z𝛼 > Zcal ) p = 2 × P(Z𝛼 > Zcal )

(6.10)

If the calculated value of the test statistic Zcal is negative, then the p-value is calculated using the formula: p = P(Z𝛼 > − Zcal ) + P(Z𝛼 < Zcal ) Page | 352

p = 2 × P(Z𝛼 < Zcal )

(6.11)

Note that this calculation differs from the previous two because we need to calculate the probability to the left and to the right. Once again, to decide, we compare the p-value to . If the p-value is smaller than , we reject H0. The critical values and rejection region are illustrated in Figures 6.11 and 6.12.

Figure 6.11 Two-sided test showing both critical values and p-values – scenario 1: H0:  = x, H1:  ≠ x. If the p-value is less than , reject H0

Figure 6.12 Two-sided test showing both critical values and p-values – scenario 2: H0:  = x, H1:  ≠ x. If the p-value is greater than , fail to reject H0

Page | 353

Types of errors and statistical power Making decisions always implies that there is a possibility of making an error. When making decisions in hypothesis testing, we can distinguish between two types of possible errors: Type I error and Type II error. Null hypotheses again Earlier we defined the notion of the null hypothesis (H0), which is the hypothesis that the phenomenon to be demonstrated is in fact absent. For example, it is the hypothesis that there is no difference between a population mean and an observed sample mean or no difference between the means (1 = 2) in a t test. The null hypothesis is important because it is what researchers are most often testing in their studies. If they can reject the null hypothesis at a certain alpha level (e.g., p < 0.05), then they can accept as probable whatever alternative hypothesis makes sense, for example, that the population mean is not a predefined value for reasons other than chance (e.g.,  < 100 at p < 0.05) in a t test. Once again, focusing on rejecting the null hypothesis and declaring a ‘significant’ (at p < 0.05) mean difference is how researchers typically proceed. Type I and Type II error Most often, the probability statements in the above example are taken to indicate the probability that the researcher will accept the alternative hypothesis when the null hypothesis is true (see α in top left-hand corner of Table 6.3). That seems to be the primary concern of most researchers in their studies. However, there is another way to look at these issues that involves what are called Type I and Type II errors. From this perspective, α is the probability of making a Type I error (accepting the alternative hypothesis when the null hypothesis is true), and β is the probability of making a Type II error (accepting the null hypothesis when the alternative hypothesis is true). By extension, 1 – α is the probability of not making a Type I error, and 1 – β is the probability of not making a Type II error.

Reject H0 Decision about null hypothesis Fail to reject H0 (H0)

State of Nature (reality) Null hypothesis (H0) is actually: H0 is true H0 is false Type I error (false positive) Correct (true positive) Probability of making this Probability of getting this error = α correctly, Power = 1 – β Correct (true negative) Type II error (false Probability of getting this negative) correctly, the Level of Probability of making this confidence = 1 - α error = β

Table 6.3 Scenarios of rejecting, or not rejecting, H0 that is true or not true The primary concern of most researchers is to guard against Type I errors, errors that would lead to interpreting observed differences as non-chance (or probably real) when Page | 354

they are due to chance fluctuations. However, researchers often don’t think about Type II errors and their importance. Recall that Type II errors are those that might lead us to accept that a set of results is null (i.e., there is nothing in the data but chance fluctuations) when the alternative hypothesis is true. Researchers may be making Type II errors every time they accept the null hypothesis because they are so tenaciously focused on Type I errors (α) while completely ignoring Type II errors (β). Statistical power The term statistical power (or just power) represents the probability that you will reject a false null hypothesis and therefore accept a true alternative hypothesis. This can be written as power = P(rejecting H0 given H0 is false). Based on these definitions, we can write the following equation: Statistical power = 1 – β

(6.12)

For example, if the Type II error (β) is equal to 23% then the statistical power is 1 – 0.23 = 77% and we would conclude that we would not reject a false, null hypothesis 77% of the time. If statistical power is high, the probability of making a Type II error, or concluding there is no effect when, in fact, there is one, is low. The statistical power of an experiment is determined by the following factors: a. The level of significance to be used. b. The variability of the data (as measured, for example, by their standard deviation). c. The size of the difference in the population it is required to detect. d. The size of the samples. By setting the power (often 80%) and any three of these four values the remaining one can calculated. However, since we usually use a 5% level of significance, we need only set two out of (b), (c), (d) and the power to determine the other. The variability of the data (b) needs to be approximately assessed, usually from previous studies or from the literature, and then the sample sizes (d) can be determined for a given difference (c) or, alternatively, for a specific sample size the difference likely to be detected can be calculated. Software packages to calculate statistical power There are several software packages that specialise in the calculation of statistical power: G*Power - Free software G*Power is free and can be downloaded from http://gpower.hhu.de. G*Power is a tool to compute statistical power analyses for many different t tests, F tests, χ2 tests, Z tests and some exact tests. G*Power can also be used to compute effect sizes and to display graphically the results of power analyses.

Page | 355

SPSS - Not free software IBM SPSS SamplePower software enables you to quickly find the right sample size for your research and test the possible results before you begin your study. The software provides advanced statistical techniques such as means and differences in means, correlation, one-way and factorial analysis of variance (ANOVA), regression and logistical regression, survival analysis and equivalence tests.

Check your understanding X6.5

A supermarket is supplied by a consortium of milk producers. A recent quality assurance check suggests that the amount of milk supplied is significantly different from the quantity stated within the contract. a. Define what we mean by significantly different. b. State the null and alternative hypothesis statements. c. For the alternative hypothesis do we have a two-tailed, lower one-tailed, or upper one-tailed test?

X6.6

A business analyst is attempting to understand visually the meaning of the critical test statistic and the p-value. For a z-value of 2.5 and a significance level of 5% provide a sketch of the normal probability distribution and use the sketch to illustrate the location of the following statistics: test statistic, critical test statistic, significance value, and p-value (you do not need to calculate the values of zcri or the p-value).

X6.7

At the 2% significance level, what are the critical z values for (a) a two-tailed test, (b) a lower one-tailed test, and (c) an upper one-tailed test?

X6.8

A marketing manager has done a hypothesis test to test for the difference between accessories purchased for two different products. The initial analysis has been performed and an upper one-tailed z test chosen. Given that the z value was calculated to be 3.45, find the corresponding p-value. What would you conclude from this result?

Chapter summary In this chapter we have introduced the important statistical concept of hypothesis testing. What is important in hypothesis testing is that you can recognise the nature of the problem and should be able to convert this into two appropriate hypothesis statements (H0 and H1) that can be tested. If you are comparing more than two samples then you would need to employ advanced statistical parametric hypothesis tests that are beyond the scope of this book – these statistical tests are called analysis of variance (ANOVA) tests. In this chapter we have described a simple five-step procedure to aid the solution process. The main emphasis is placed on the use of the p-value, which quantifies the probability of the null hypothesis (H0) being rejected. Thus, if the measured p-value is Page | 356

greater than  then we would fail to reject H0 as statistically significant. Remember the value of the p-value will depend on whether we are dealing with a one- or two-tailed test. So, take extra care with this concept since this is where most students slip up. The alternative part of the decision-making process described the use of the critical test statistic in making decisions. This is the traditional textbook method which uses published tables to provide estimates of critical values for various test values. Moreover, we learned that we have two types of errors with hypothesis testing: Type I, when we reject a true null hypothesis; and Type II, when we fail to not reject a false null hypothesis. This concept was then extended to the concept of statistical power, when we reject a false null hypothesis, and the relationship between statistical power and the probability of making a Type II error.

Test your understanding TU6.1 Calculate the critical z value if you have a two-tailed test and you choose a significance level of 0.05 and 0.01. TU6.2 If you conduct a z test and it is a lower one-tailed test, what is your decision if the significance level is 0.05 and the value of the test statistic is – 2.01? TU6.3 Calculate the value of the p-value for the TU2 question. TU6.4 Calculate the value of the z statistic if the null hypothesis is H0:  = 63, where a random sample of size 23 is selected from a normal population with a sample mean of 66 (assume the population standard deviation is 15). TU6.5 Calculate the probability that a sample mean is greater than 68 for the TU3 question, when the alternative hypothesis is (a) two-tailed, and (b) upper onetailed (assume the significance level is 0.05). TU6.6 Repeat TU5 but with the information that the population standard deviation is not known, Describe the test you would use to solve this problem. Given that the sample standard deviation was estimated to be 16.2, answer TU5 (a), and (b).

Want to learn more? The textbook online resource centre contains a range of documents to provide further information on the following topics: 1. A6Wa Common assumptions about data 2. A6Wb Meaning of the p-value

Page | 357

Chapter 7 Parametric hypothesis tests 7.1 Introduction and Learning Objectives As we describe a variety of tests in this Chapter, you will notice that in general we will only focus on testing three different statistics: the mean, the proportion and the variance (in fact, the variance is only covered in online chapters). As you will see, we will have a variety of test permutations, due to the fact that we could have one or more samples, they could be dependent or independent, they could be large or small, the population standard deviation is either known or not known, etc. This variety of permutations of tests often “clouds” the hypothesis testing chapters. It is difficult, at least at first glance, to differentiate one test from another. Make sure you can clearly understand which test is applied to what combination of conditions. This chapter is dedicated to parametric hypothesis tests only. Figure 7.1 show a highlevel classification of some of the most important hypothesis tests.

Figure 7.1 Which test to use? Regardless whether we use one or two sample tests, in this chapter we will focus only on testing the mean and proportion. The online material also includes variance tests. Table 7.1 provides a list of parametric statistical tests described in this book and identifies which methods are solved using Excel and SPSS.

Page | 358

Statistics test One sample z test for the population mean One sample t test for the population mean One sample z test for the population proportion Two sample z test for two independent population means Two sample z test for two independent population proportions Two sample t test for two population means (independent samples, equal variance) Two sample t test for two population means (independent samples, unequal variance) Two sample t test for two population means (dependent samples) Two sample F test for two population variances Table 7.1 Statistics tests covered in textbook

Excel Yes Yes Yes Online Online Yes

SPSS No Yes No No No Yes

Yes

Yes

Yes Yes Online Online

Learning objectives On completing this chapter, you will be able to: 1. Understand the concept of parametric hypothesis testing for one and two samples. 2. Be able to apply the tests for small and large samples as well as if the population standard deviation is known or not. 3. Conduct one- and two-sample hypothesis tests for the sample mean and proportion. 4. Solve problems using Microsoft Excel and IBM SPSS Statistics software packages.

7.2 One-sample hypothesis tests In this section we will explore the application of one-sample parametric hypothesis testing to carry out one-sample z tests for the population mean, one-sample t tests for the population mean, and one-sample z tests for the population proportion.

One-sample z test for the population mean The first test we will explore is the one-sample z test for the population mean, with the following test assumptions: 1. 2. 3. 4.

The sample is a simple random sample from a defined population. The variables of interest in the population are measured on an interval/ratio scale. The population standard deviation is known. The variable being measured is normally distributed in the population.

When dealing with a normal sampling distribution we calculate the z statistic using equation (6.3): Zcal =

̅− μ X σ √n

Page | 359

As we already know, by convention we take ̅ X to be the sample mean,  is the population mean and σ is the population standard deviation. You should also remember that the denominator in equation (6.3), σ⁄ , is called the standard error (or, to give it √n its full title, the standard error of the sampling distribution of the means, σX̅ ). Example 7.1 A toy manufacturer undertakes regular assessment of employee performance. This performance testing consists of measuring the number of a toys that employees can make per hour, with the historical data recording a rate of 85 per hour with a standard deviation of 15 units. All new employees are tested after a period of training. A new employee is tested on 45 separate random occasions and found to have an output given in table 7.2. Does this indicate that the new employee's output is significantly different from the average output? Test at the 5% significance level. ID

Sample data

ID

Sample data

1

88.6

24

81.8

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

63.8 95.6 94.4 118.2 84.4 81.6 78.6 56.6 77.8 87.2 82.0 100.3 91.6 126.2 99.8 94.1 92.5

25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41

88.8 83.2 94.0 96.7 79.9 93.1 101.7 55.5 88.6 79.3 80.7 93.3 102.5 71.9 91.1 109.6 96.0

19 20 21 22

96.0 85.8 69.0 90.3

42 43 44 45

82.6 108.3 95.5 76.8

23

88.0

Table 7.2 Number of units per hour after training The five-step procedure to conduct this test progresses as follows:

Page | 360

Step 1 State hypothesis Null hypothesis H0:  = 85 (population mean is equal to 85 units per hour) Alternative hypothesis H1:  ≠ 85 (population mean is not 85 units per hour) The ≠ sign implies that a two-tailed test will be appropriate. Step 2 Select test We now need to choose an appropriate statistical test for testing H0. From the information provided we note: • • • •

Number of samples – one sample. The statistic we are testing – testing for a difference between a sample mean and population mean (µ = 85). Population standard deviation is known (σ = 15). Size of the sample – relatively large (n = 45  30). Nature of population from which sample drawn – population distribution is not known but sample size is large. For large n, the central limit theorem states that the sample mean approximately follows a normal distribution.

Then the sampling distribution of the sample means is from equation (6.2) 𝑋̄ ∼ 𝑁 (𝜇,

𝜎2 ) 𝑛

Where  is known (= 15). Therefore, a one-sample z test of the mean is selected. Step 3 Set the level of significance  = 0.05 Step 4 Extract the relevant statistic When dealing with a normal sampling distribution we calculate the one-sample z test statistic using equation (6.3): Zcal =

̅ X− μ σ √n

If we undertake the calculations, we find:

Page | 361

Sample size n = 45 ̅ = 88.74 to 2 decimal places Sample mean X Substituting these values into equation (6.3) gives: Zcal =

88.74 − 85 = +1.6726 to 4 decimal places 15 √45

Step 5 Make a decision The calculated test statistic Zcal = + 1.6726. We need to compare this test statistic with the critical z-test statistic, Zcri. For 5% significance and a two-tailed test, Zcri =  1.96 (see Figure 7.2).

Figure 7.2 Areas of the standardised normal distribution Does the test statistic lie within the rejection region? Compare the calculated and critical z values to determine which hypothesis statement (H0 or H1) to accept. We observe that the sample z value (zcal) does not lie in the upper rejection zone (+1.6726 < +1.96), so we will fail to reject H0. The sample mean value (88.74 units per hour) is close enough to the population mean value (85 units per hour) to allow us to assume that the sample comes from that population. We conclude from the evidence that there is no significant difference, at the 0.05 level, between the new employee's output and the firm’s existing employee output.

Page | 362

Excel solution Figures 7.3 and 7.4 illustrate the Excel solution.

Figure 7.3 Excel data for Example 7.1

Figure 7.4 Excel solution for Example 7.1 Page | 363

Critical Z value method Two-tailed critical test statistic Zcri =  1.96 Zcal = 1.6726 Given Zcal lies between the lower and upper Z critical (- 1.96 < 1.6726 < + 1.96), then fail to reject the null hypothesis. P-value method Two-tailed p-value = 0.0944 Given two-tailed p-value > significance level  (0.0944 > 0.05), fail to reject the null hypothesis. We conclude, the evidence suggests that there is no significant difference, at the 0.05 level, between new employee's output and the firms existing employee output after the training. Figure 7.5 illustrates the relationship between using the critical test statistic and using the p-value method in deciding on which hypothesis statement to accept.

Figure 7.5 Relationship between the p-value, test statistic, and critical test statistic for a two-tailed test Table 7.3 provides the Excel functions to calculate the critical z test statistic or p-values. Calculation Lower one-tail Upper one-tail

P-values = NORM.S.DIST(abs(z value)) = 1-NORM.S.DIST(z value)

Critical test statistic = NORM.S.INV (significance level) = NORM.S.INV (1-significance level) Two-tail = 2*(1-NORM.S.DIST(ABS(z = NORM.S.INV (significance value or cell reference), level/2) for lower critical z value true)) and = NORM.S.INV (1-significance level/2) for upper critical z value Table 7.3 Excel functions to calculate critical values and p-values Page | 364

Confidence interval method From the previous chapter, we know that the true mean resides somewhere in the interval defined by equation (5.16). This was part of the interval estimate procedure, and we also used an expression confidence interval. We can use this confidence interval to decide on the null and alternative hypotheses. The confidence interval for the population mean  is given by rearranging equation (6.3) to give equation (7.1). 𝜎

𝜇 = 𝑋 ± 𝑍cri ( 𝑛) √

(7.1)

If we carry out the calculation for the 5% significance level, then the 95% confidence interval for the population mean would be from 84.36 to 93.12 as illustrated in Figure 7.6.

Figure 7.6 Confidence interval solution to make a hypothesis test decision Cells N22 and N26 in Figure 7.6 calculate the same value, but using two different Excel functions. The end result, which are the cells N23:N24 and N27:N28, is identical. Observe that this 95% confidence interval (84.35 to 93.12) does contain the known population mean (85) in the hypothesis test. We conclude from the evidence that there is no significant difference, at the 0.05 level, between the new employee's output and the firm’s existing employee output. Checking assumptions

Page | 365

To use the z test, the data are assumed to represent a random sample from a population that is normally distributed. One-sample Z tests are considered ‘robust’ for violations of the normal distribution assumption. This means that the assumption can be violated without serious error being introduced into the test. The central limit theorem tells us that, if our sample is large, the sampling distribution of the mean will be approximately normally distributed irrespective of the shape of the population distribution. Knowing that the sampling distribution is normally distributed is what makes the one-sample Z test robust for violations of the assumption of normal distribution. If the underlying population distribution is not normal and the sample size is small, then you should not use the Z test. In this situation you should use an equivalent nonparametric test (see Chapter 8).

Check your understanding X7.1

A mobile phone company is concerned at the lifetime of phone batteries supplied by a new supplier. Based upon historical data, this type of battery should last for 900 days with a standard deviation of 150 days. A recent randomly selected sample of 40 batteries was selected and the sample battery life was found to be 942 days. Is the population battery life significantly different from 900 days (significance level 5%)?

X7.2

A local Indian restaurant advertises home delivery times of 30 minutes. To monitor the effectiveness of this promise the restaurant manager monitors the time that the order was received and the time of delivery. Based upon historical data, the average time for delivery is 30 minutes with a standard deviation of 5 minutes. After a series of complaints from customers regarding this promise the manager decided to analyse the last 50 data orders, which resulted in an average time of 32 minutes. Conduct an appropriate test at the 5% significance level. Should the manager be concerned?

One-sample t test for the population mean In many real-world cases of hypothesis testing, one does not know the standard deviation of the population. In such cases, it must be estimated using the sample standard deviation. That is, s (calculated with division by n – 1) is used to estimate σ. Other than that, the calculations are identical as we saw for the z test for a single sample – but the test statistic is called t, not z, and we conduct a one-sample t test for the population mean with the following t-test assumptions: 1. The sample is a simple random sample from a defined population. 2. The variables of interest in the population are measured on an interval/ratio scale. 3. The sampling distribution of the sample means is normal (the central limit theorem tells you when this will be the case). 4. The population standard deviation is estimated from the sample. Student’s t test is built around a t distribution with the value of the t-test statistic given by equation (6.7): Page | 366

t cal =

̅ X− μ s √n

Where the sample standard deviation (s) is given by equations (2.7) and (2.8):

s= √

∑(X − ̅ X)2 n−1

For a single-sample t test, we must use a t distribution with n – 1 degrees of freedom. As this implies, there is a whole family of t distributions, with degrees of freedom ranging from 1 to infinity. All t distributions are symmetrical about t = 0, like the standard normal. In fact, the t distribution with df =  is identical to the standard normal distribution. Example 7.2 A car dealer offers a generous package to customers who would like high-quality extras fitted to the cars. Historically, people who access this offer spend £2300 per customer. The owner is concerned that recently this average spend has changed and requested, after discussions with a data analyst, that this is tested. The analyst recommended a one-sample t test. To test this hypothesis, the data analyst checked the data for the last 10 years to confirm that the population spend follows approximately a normal distribution and then collected the spending data for the last thirteen customers as illustrated in table 7.4. ID 1

Sample data, £'s 2595

2 1670 3 2899 4 2194 5 2313 6 2469 7 2131 8 2131 9 2657 10 1817 11 2473 12 1890 13 2330 Table 7.4 Sample data

Page | 367

Test the hypothesis that the average spend is £2300 (test at the 5% significance level). The five-step procedure to conduct this test progresses as follows: Step 1 State hypothesis Null hypothesis H0:  = 2300 (population mean spend on extras is equal to £2300). Alternative hypothesis H1:   2300 (population mean is not equal to £2300). The ≠ sign implies a two-tailed test. Step 2 Select test We now need to choose an appropriate statistical test for testing H0. From the information provided we note: a. Number of samples – one sample. b. The statistic we are testing – testing for a difference between a sample mean and population mean (µ = 2300). Therefore, we want a two-tailed test. c. Size of the sample – small (n = 13). d. Nature of population from which sample drawn – normal population distribution, sample size is small, and population standard deviation is unknown. The sample standard deviation will be used as an estimate of the population standard deviation and the sampling distribution of the mean is a t distribution with n – 1 degrees of freedom. Then the sampling distribution of the sample means is given by equation (6.6) 𝑋̄ ∼ 𝑡𝑑𝑓 (𝜇,

𝑠2 ) 𝑛

and the corresponding standardised t equation is given by equation (6.7) t =

̅ X− μ s √n

We conclude that a one-sample t test of the mean is appropriate. Step 3 Set the level of significance  = 0.05 Page | 368

Step 4 Extract relevant statistic Sample data Sample size, n = 13 Sample mean, ̅ X = 2274.538462 Sample standard deviation, s = 352,4974268 Sampling error of the means, σX̅ =

s √n

= 97.76519591

Substituting these values into equation (6.7) gives: t =

̅ X− μ 2274.5385 − 2300 = −0.2604 s = 97.7652 √n

with the number of degrees of freedom, df = n – 1 – 12. Step 5 Make a decision Critical value method From statistical tables, the two-tailed critical test statistic, tcri = t(0.05/2, 12) = 2.18.

Figure 7.7 Critical values of the t distribution The calculated test statistic tcal = – 0.2604 and the critical t value  2.18 are compared to decide which hypothesis statement to accept. Page | 369

Given tcal lies between the lower and upper critical t values (– 2.18 and + 2.18), we fail to reject the null hypothesis H0. Figure 7.9 illustrates the relationship between the p-value, test statistic and critical test statistic. The evidence suggests that there is no significant difference, at the 0.05 level, between the extras purchased by the sample (i.e., the customers today) and the historical extras purchased of £2300. Excel solution Figures 7.8 and 7.9 illustrate the Excel solution.

Figure 7.8 Example 7.2 data

Page | 370

Figure 7.9 Example 7.2 Excel solution From Excel: Critical value method The calculated test statistic tcal = –0.26 and the critical t value tcri =  2.18. Given that tcal lies between the lower and upper critical t values (– 2.18 and + 2.18), we fail to reject the null hypothesis H0. P-value method From the Excel and SPSS solutions, the value of the two-tail p-value = 0.7989 > significance level (0.05), so we fail to reject the null hypothesis. Figure 7.10 illustrates the relationship between using the critical test statistic and using the p-value method in deciding on which hypothesis statement to accept.

Page | 371

Figure 7.10 Relationship between the p-value, test statistic, and critical test statistic. Table 7.5 provides the Excel functions to calculate Student’s critical test statistic or pvalues using Excel functions. Calculation Lower one-tail

P-values =T.DIST (t value, degrees of freedom, true) =T.DIST.RT (t value, degrees of freedom) =T.DIST.2T (ABS(t value), degrees of freedom).

Critical test statistic = - T.INV (significance level, degrees of freedom) for lower tail Upper one-tail = T.INV (significance level, degrees of freedom) Two-tail = T.INV.2T (significance level, degrees of freedom) for upper critical t value and = - T.INV.2T (significance level, degrees of freedom) for lower critical t value Table 7.5 Excel functions to calculate critical t values or p-values Confidence interval method The confidence interval for the population mean  is given by rearranging equation (6.7) to give equation (7.2). In other words, a true  will be somewhere inside the interval defined by equation (7.2): 𝑠

𝜇 = 𝑋̅ ± 𝑡𝑐𝑟𝑖 × ( 𝑛) √

(7.2)

If we carry out the calculation for a 5% significance level, then a 95% confidence interval for the population mean difference would be – 238.47 to + 187.55 as illustrated in Figure 7.11. Page | 372

Figure 7.11 Confidence interval solution to make a hypothesis test decision As before, cells K23 and K32 use two different Excel functions to yield the same result. Observe that this 95% confidence interval does contains the assumed population mean in the hypothesis test of – 25.46 (sample mean – population mean = 2274.54 – 2300). Therefore, we fail to reject the null hypothesis. We conclude from the evidence that there is no significant difference, at the 0.05 level, between the extras purchased by the sample (i.e. the customers today) and the historical extras purchased of £2300. If we carry out the calculation for the 5% significance level, then the 95% confidence interval for the population mean would be from -238.47 to 187.55 as illustrated in Figure 7.11. SPSS solution Enter data into SPSS

Figure 7.12 Example 7.2 SPSS data

Page | 373

Select Analyze > Compare Means

Figure 7.13 SPSS One-Sample T Test Select One-Sample T Test Transfer Data_value into Test Variable(s) box Type 2300 into the Test Value box.

Figure 7.14 Click on Options. Type 95% in the Confidence Interval Percentage box. Page | 374

Figure 7.15 SPSS one-sample t test options Click on Continue.

Figure 7.16 SPSS one-sample t test Click OK SPSS output Figure 7.17 gives the one-sample statistics and one-sample hypothesis test results.

Figure 7.17 SPSS solution

Page | 375

Summary Observe that the manual, Excel, and SPSS solutions all agree. Hypothesis test method t = - 0.26 df = 12 2 tail p-value = 0.799 > 0.05, so fail to reject null hypothesis Confidence interval method Mean difference = - 25.46 95% confidence interval – 238.47 to + 187.55 Observe mean difference = - 25.46 lies between – 238.47 and + 187.55, so fail to reject null hypothesis. Checking assumptions The assumptions of the one-sample t test are identical to those of the one-sample z test. To use the t test, the data are assumed to represent a random sample from a population that is normally distributed. In practice, if the sample size is not too small and the populations are nearly symmetrical, the t distribution provides a good approximation to the sampling distribution of the mean when the population standard deviation is unknown. The t test is called a robust test (not sensitive to departures) in that it does not lose power if the shape of the distribution departs from a normal distribution and the sample size is large (n  30); this allows the test statistic to be influenced by the central limit theorem. If the underlying population distribution is not normal and the sample size is small, then you should not use the t test. In this situation you should use an equivalent nonparametric test (see Chapter 8).

Check your understanding X7.3

Calculate the critical t values for a significance level of 1% and 12 degrees of freedom for (a) a two-tailed test, (b) a lower one-tailed test, and (c) an upper one-tailed test.

X7.4

After further data collection the marketing manager (Exercise X7.5) decides to revisit the data analysis and changes the type of test to a t test. a. Explain under what conditions a t test could be used rather than the z test. b. Calculate the corresponding p-value if the sample size was 13 and the test statistic equal to 2.03. From this result what would you conclude?

X7.5

A tyre manufacturer conducts quality assurance checks on the tyres that it manufactures. One of the tests was on its medium-quality tyres with an independent random sample of 12 tyres providing a sample mean and standard deviation of 14,500 km and 800 km, respectively. Given that the historical Page | 376

average is 15,000 km and that the population is normally distributed, test whether the sample gives cause for concern. X7.6

A new low-fat fudge bar is advertised with 120 calories. The manufacturing company conducts regular checks by selecting independent random samples and testing the sample average against the advertised average. Historically the population varies as a normal distribution and the most recent sample consists of the values 99, 132, 125, 92, 108, 127, 105, 112, 102, 112, 129, 112, 111, 102, 122. Is the population value significantly different from 120 calories (at significance level 5%)?

One-sample z test for the population proportion We now consider the one-sample z test for the proportion, π. This test also relies on the central limit theorem to ensure a standard normal distribution for its test statistic, and so we can only apply it to large samples. Null hypothesis for this test is H0: π = π0 Alternative hypothesis for this test is Left-sided test H1: π < π0 or right-sided test H1: π > π0 or two-sided test H1: π ≠ π0 Here  is a population proportion value specified in the null hypothesis. The only assumption that this test requires is that the sample size be sufficiently large that the test statistic in equation (7.3) will follow a standard normal distribution: Z=

p− π π (1− π) √ n

(7.3)

Where p is the sample proportion, n is the sample size and  is the population proportion value specified in H0. The calculated value of the test statistic is denoted by zcal.

Page | 377

Example 7.3 The human resources manager of a large company is asked to investigate a claim by the union that 20% of the employees of the company are experiencing high levels of work stress. The company, hoping to prove that this number is not as high as the union claims it to be, conducts a large study on 695 randomly selected employees. The study shows that 123 of the 695 employees included in the study suffer from high levels of work stress. The company would like to know if it can use this result to refute the claim. Assume the data sampled follow a normal distribution and test at the 5% significance level. The five-step procedure to conduct this test is as follows. Step 1. State hypothesis It was claimed that the population proportion of employees experiencing high levels of work stress is 20% or 0.2, so the null hypothesis is H0: π = 0.2. The company would like to prove that the true proportion is, in fact, lower than 0.2, so the alternative hypothesis is H1: π < 0.2. The symbol < implies that a left-sided test will be used. Step 2 - Select test • • • • •

We now need to choose an appropriate statistical test for testing H0. From the information provided we note: Number of samples – one sample. Clearly the parameter of interest is the population proportion π. The test statistic will need to reflect the distance between the sample proportion (p = 123/695= 0.176978) and the hypothesised population proportion (π0 = 0.2). The data collected follow a normal distribution.

Step 3 - Set the level of significance  = 0.05 Step 4 - Extract the relevant statistic Note that the sample proportion of employees experiencing high levels of work stress is p = 123/695= 0.176978 and that the hypothesised value of the population proportion is given as π0 = 0.2. Therefore, the test statistic calculated from this information and equation (5.11) is: Z=

p− π √π (1 − π) n

Page | 378

Z=

0.176978 − 0.2 √0.2 (1 − 0.2) 695

= −1.59

Step 5. Make a decision Using a significance level of α = 0.05, the critical value of a left-sided test is – Z = – Z0.05 = –1.645 (from tables; see Figure 7.18).

Figure 7.18 Critical values of the standardised normal distribution Comparing the calculated test statistic, zcal = –1.59, to the critical value, Z0.05 = – 1.645, we find that the Zcal does not exceed the critical value. We therefore fail to reject H0. Please note the values are quite close and we could conclude that the sample evidence is inconclusive on whether we accept or fail to reject the null hypothesis. A possible course of action would be to re-do the data collection and review how the sample data was collected. Excel solution Figure 7.19 illustrates the Excel solution.

Page | 379

Figure 7.19 Example 7.3 Excel solution Critical value method Zcal = - 1.590 Lower one-tail Zcri = - 1.645 Given Zcal > Zcri (- 1.590 > - 1.645), we conclude that we fail to reject the null hypothesis. P-value method Lower one-tail p-value = 0.0558 Given one tail test then compare this p-value with your significance level, ( = 0.05). Given p-value >  (0.0558 > 0.05), fail to reject the null hypothesis. Please note the values are quite close and we could conclude that the sample evidence is inconclusive on whether we accept or fail to reject the null hypothesis. A possible course of action would be to re-do the data collection and review how the sample data was collected.

Page | 380

Although the company hoped to disprove the claim that 20% of the work force suffers from stress, they cannot do so (at the 0.05 significance level). The rejection region is shown in Figure 7.20.

Figure 7.20 Graphical representation of the critical values, test statistic value and p-value of the test Confidence interval method The confidence interval for the population mean  is given by rearranging equation (7.3) to give equation (7.3). As before, this means that the true proportion for the population is somewhere within the interval defined by equation (7.4): p (1− p)

π = p ± Zcri × √

n

(7.4)

If we carry out the calculation for the 5% significance level, then a 95% confidence interval for the population mean would be from 0.15 to 0.21 as illustrated in Figure 7.21.

Page | 381

Figure 7.21 Confidence interval solution for Example 7.3 Observe that this 95% confidence interval does contain the assumed population proportion in the hypothesis test of 0.2. Therefore, we fail to reject the null hypothesis. Although the company hoped to disprove that 20% of the work force suffers from stress, it cannot do so (at the 0.05 significance level). Checking assumptions To use the z test, the data are assumed to represent a random sample from a population that is normally distributed. One-sample Z tests are considered robust for violations of the normal distribution. This means that the assumption can be violated without serious error being introduced into the test. The central limit theorem tells us that, if our sample is large, the sampling distribution of the mean will be approximately normal irrespective of the shape of the population distribution. Knowing that the sampling distribution is normally distributed is what makes the one-sample Z test robust for violations of the assumption of the normal distribution. If the underlying population distribution is not normal and the sample size is small, then you should not use the Z test. In this situation you should use an equivalent nonparametric test (see Chapter 8).

Check your understanding X7.7

Do 9% of Teesside commuters travel to work by car? A survey on commuting by car was done on a random sample of 250 commuters and found car commuting to be 13%. Test the claim using a 5% level of significance.

X7.8

A national provider of gas and electricity within a national market claims that 86% of its customers are very satisfied with the service they receive. To test this claim, the company regularly undertakes random sampling surveys of its customers. A recent random sample of 100 customers showed an 80% rating at the very satisfied level. Based on these findings, can we reject the hypothesis that 86% of the customers are very satisfied? Assume a significance level of 0.05. Page | 382

X7.9

The company in X7.8 now claims that at least 86% of its customers are very satisfied. Again, 100 customers are surveyed using simple random sampling, with 80% very satisfied. Based on these results, should we accept or reject the company’s hypothesis? Assume a significance level of 0.05.

X7.10 A university reviews student progress and over the last two years has implemented many initiatives to improve the retention rate. The historical data based upon the last three years show a failure rate of 6% across all university programmes. After implementing several new initiatives during the two academic years, the university re-evaluated its failure rate using a random sample of 250 students and found the failure rate for this academic year had changed to 2.5%. Test whether the failure rate has improved.

7.3 Two-sample hypothesis tests So far, we have learned how to test hypotheses involving one sample, where we contrasted what we observed with what we expected from the population. Often, researchers are faced with hypotheses about differences between groups. For example, do interest rates rise more quickly when wages increase or stay the same? In this section we will explore a range of two-sample tests covering comparing means, proportions, and variances. Table 7.6 provides a list of these. We will concentrate on two-sample tests that are available in both Excel and SPSS; other tests are explored in the online materials. Statistics test Two sample z test for two independent population means Two sample z test for two independent population proportions Two sample t test for two population means (independent samples, equal variance) Two sample t test for two population means (independent samples, unequal variance) Two sample t test for two population means (dependent samples) Two sample F test for two population variances Table 7.6 Two sample hypothesis tests

Excel Online Online

SPSS No No

Yes

Yes

Yes

Yes

Yes

Yes

Online

Online

Two-sample t test for the population mean: independent samples In testing for the difference between means we assume that the populations are normally distributed, with either equal or unequal population variances. Pooled-variance t test – equal population variances assumed For situations in which the two populations have equal variances, the pooled-variance t test is robust to moderate departures from the assumption of normality, provided the sample sizes are large (nA  30, nB  30).

Page | 383

In this case you can use the pooled-variance t test without serious effects on the power. The test is used to test whether the population means are significantly different from each other, using the means from randomly drawn samples. For example, do males and females differ in terms of their exam scores? When dealing with a normal sampling distribution we calculate the t-test statistic using equation (7.5), an estimate of the population standard deviation from equation (7.6), and the number of degrees of freedom given by equation (7.7). Test statistic t cal =

(XA − XB )−(μA − μB ) 1 1 + n A nB

(7.5)

̂ A+B ×√ σ

Pooled standard deviation (nA −1) S2A +(nB −1) S2B

̂A+B = √ σ

nA + nB −2

(7.6)

Degrees of freedom df = nA + nB – 2

(7.7)

Equation (7.6) is an estimator of the pooled standard deviation of the two samples. As indicated above, the null hypothesis for this test specifies a value for A – B, the difference between the population means. When testing using equation (7.5), H0 specifies that A – B = 0. For that reason, most textbooks omit A – B from the numerator of equation (7.5). The independent-samples (or unpaired) t test has degrees of freedom df = nA + nB – 2. The assumptions for the equal-variance t test are as follows: 1. 2. 3. 4.

Both samples are simple random samples. The two samples are independent. The sampling distribution of 𝑋̅𝐴2 − 𝑋̅𝐵2 is normal. The populations from which you sampled have equal variances.

The two-sample t test is robust to departures from normality. When checking distributions graphically, look to see that they are symmetric and have no outliers. There are a range of statistical tests that you could employ to see if the normality assumption is violated (e.g. the F test for population variances, Kolmogorov–Smirnov test or Shapiro–Wilks test). If you have an issue with the homogeneity of variance assumption then a rule of thumb on the relative sizes is that if the larger of the two variances is no more than 4 times the smaller, the t-test approximation is probably good enough, especially if the sample sizes are equal. Homogeneity of variances means that population variances are equal. The other word often used is homoscedasticity. Page | 384

It is important to note, however, that heterogeneity of variance and unequal sample sizes do not mix. If you have reason to anticipate unequal variances, make every effort to keep your sample sizes as equal as possible. If the heterogeneity of variance is too severe, there are versions of the independent-group t test that allow for unequal variances. The method used in SPSS, for example, is called Welch’s unequal variances t test. If you have a problem of this nature, then think about using a nonparametric test called the Mann–Whitney U test. Separate-variance t test for the difference between two means: unequal population variances The assumption of equal variances is critical if the sample sizes are markedly different. Welch developed an approximation method for comparing the means of two independent normal populations when their variances are not necessarily equal. Because Welch’s modified t test is not derived under the assumption of equal variances, it allows users to compare the means of two populations without first having to test for equal variances. With equations (7.5)–(7.7), we assumed that the variances were equal for the two samples and conducted a two-sample pooled t test. If the variances are unequal, we should not use these equations but use the Welch’s unequal variances t test statistic defined by equation (7.8): t cal =

̅A− X ̅ B )− (μA − μB ) (X S2 S2 √ A+ B nA nB

(7.8)

Here, SA2 and SB2 are the unbiased estimators of the standard deviations of the two samples. Unlike in Student's t test, the denominator is not based on a pooled variance estimate. For use in significance testing, the distribution of the test statistic is approximated as an ordinary Student’s t distribution with the degrees of freedom calculated using the Welch–Satterthwaite equation: 2

2 2

S S ( A+ B )

df =

aA n B 2 2 2 S S2 ( A) ( B) nA nB + nA −1 nB −1

(7.9)

The assumptions for the unequal-variance t test are as follows: 1. Both samples are simple random samples. 2. The two samples must be independent. 3. The sampling distribution of ̅ XA2 − ̅ XB2 is normal. The two-sample t test is robust to departures from normality. When checking distributions graphically, look to see that they are symmetric and have no outliers. Welch's t test remains robust for skewed distributions and large sample sizes.

Page | 385

If you have a problem of this nature, then think about using a nonparametric test called the Mann–Whitney U test. Example 7.4 A newsagent sells tea bags to the local community at slightly above cost price which are packed and delivered via a distribution sent that receives unwanted goods from national supermarkets. The newsagent shop owner decides to check on the number of tea bags in the bags to check that they have approximately the same number of tea bags in each bag. To enable this analysis the shop owner checks two random independent samples at two different time points and counts the number of tea bags with the data provided in table 7.7 (G1=Group1 and G2=Group2). Conduct a two-sample t test for the population mean to test this hypothesis at the 5% significance level (assume sampling distributions are normally distributed). G1 425 385 464 396 365 387 351 446 411 426

429 381 420 443 417 407 381 386 349 376

417 359 418 364 357 351 468 421 303 436

414 444 417 403 401 464 364 409 379 434

421 323 368

G2 409 387 385 485 405 426 402 402 407 408

471 505 364 358 393 440 413 455 434 357

439 303 392 456 444 566 434 522 469 369

368 375 283 339 413

Table 7.7 Number of tea bags in bags at time t1 and t2 If we calculate the mean number of tea bags for group 1 and group 2, we will find: Group 1 mean = 399.5349 Group 2 mean = 413.6571 We observe that the number of teabags is different between the two groups, but the question is if this difference is statistically significant? To answer this question, we will undertake a 2-sample t test for two independent samples. In this case, we are not told if the population variances are equal or unequal. The five-step procedure to conduct this test progresses as follows. Step 1. State hypothesis Null hypothesis: H0: 1 = 2 Alternative hypothesis: H1: 1 ≠ 2 Page | 386

The ≠ sign implies a two-tailed test. Step 2. Select test We now need to choose an appropriate statistical test for testing H0. From the information provided we note: • • •

Number of samples – two samples. The statistic we are testing – testing that the amount of beans in a bag sold by both shops are the same. Both population standard deviations are unknown. Nature of population from which sample drawn – population distributions are normal.

Step 3. Set the level of significance  = 0.05 Step 4. Extract relevant statistic Summary statistics are shown in Table 7.8 Sample statistic Sample size Sample mean Sample standard deviation Table 7.8 Sample statistics

Sample Group 1 43 399.5349 37.6983

Sample Group 2 35 413.6571 57.7918

Option 1. Equal population variances assumed If H0 is true (µA – µB = 0) then equations (7.5) to (7.7) give the t test statistic, pooled standard deviation, and degrees of freedom. Pooled standard deviation

σ ̂ 1+2 =



(n1 − 1) S21 + (n2 − 1) S22 n1 + n2 − 2

σ ̂ 1+2 = 2279.8498

Test statistic t cal =

(X1 − X 2 ) − (μ1 − μ2 ) 1 1 ̂1+2 × √n + n σ 1 2

t cal = −1.2992

Page | 387

Degrees of freedom df = n1 + n2 – 2 df = 76 Option 2. Unequal population variances assumed If H0 is true (µA – µB = 0) then equation (7.8) and (7.9) can be used to calculate the t test statistic and the degrees of freedom: Test statistic t cal =

̅A − X ̅ B ) − (μA − μB ) (X S2 S2 √ A+ B nA nB

tcal = - 1.2458 Degrees of freedom

df =

S2 S2 ( A + B) a A nB 2

2

2

S2 S2 ( A) ( B) nA nB + nA − 1 nB − 1

df = 56.1711 Step 5. Make a decision Option 1 Equal population variances assumed tcal = - 1.2992 df = 76 Critical t value for 5% two tailed and df = 76 can be found from the critical t tables as illustrated in Figure 7.22.

Page | 388

Figure 7.22 Critical values of the Student’s t distribution Therefore, require the critical t value for a significance level of 5% 2-tail for df = 76 given that the two-tail critical t value is 1.99 at df = 70 and 1.99 at df = 80. Therefore, using linear interpolation: t at df76 = t at df70 + (6/10) * (t at df80 – t at df70) t at df76 = 1.99 + (6/10) *(1.99 – 1.99) t at df76 = 1.99 The critical t value for 95% 2-tail with 76 degrees of freedom is  1.99. Therefore, compare the calculated value of t with this critical value. tcal = - 1.2992 tcri =  1.99 Given tcal (= - 1.2992) lies between – 1.99 and +1.99, we fail to reject the null hypothesis. Option2. Unequal population variances assumed tcal = - 1.2458 df = 56.1711 Critical t value for 5% two tailed and df = 56.1711 can be found from the critical t tables as illustrated in Figure 7.23. Page | 389

Figure 7.23 Critical values of the Student’s t distribution Therefore, require the critical t value for a significance level of 5% 2-tail for df = 56.1711 given that the two-tail critical t value is 2.01 at df = 50 and 2.00 at df = 60. Therefore, using linear interpolation: t at df56.1711 = t at df50 + (6.1711/10) * (t at df60 – t at df50) t at df56.1711 = 2.01 + (6.1711/10)*(2.00 – 2.01) t at df56.1711 = 2.0038 Therefore, the critical t value for 95% 2-tail with 76 degrees of freedom is  2.00. Therefore, compare the calculated value of t with this critical value. tcal = - 1.2458 tcri =  2.00 Given tcal (= - 1.2458) lies between – 2.00 and +2.00, we fail to reject the null hypothesis. Therefore, the analysis suggests that the number of tea bags in the bags is not significantly different at a 5% significance. Excel solution Figures 7.24–7.25 show the Excel solutions for (a) population variance equal and (b) population variance not equal. Enter data into Excel

Page | 390

Figure 7.24 Example 7.4 data Page | 391

Figure 7.24 Example 7.4 Excel solution Two sample pooled t test for means Figure 7.26 represents the Excel solution.

Figure 7.26 Two-sample pooled t test From Excel: tcal = - 1.2992 two-tailed tcri =  1.99 two-tailed p-value = 0.1978 Based upon these statistics we reject the null hypothesis. Page | 392

We conclude that, based upon the sample data collected, the quantity of teabags is not different at the 5% level of significance. Figure 7.27 illustrates the relationship between the p-value and the test statistic.

Figure 7.27 Comparison of t, critical t, and p-value It can be concluded that, based upon the sample data collected, the quantities of beans sold by shops A and B are significantly different at the 5% level of significance. It should be noted that the decision will change if you choose a 1% level of significance. Two sample t test for means assuming unequal variances Figure 7.28 represents the Excel solution.

Figure 7.28 Two sample t test for means assuming unequal variances From Excel: tcal = - 1.2458 Page | 393

two-tailed tcri =  2.00 two-tailed p-value = 0.2180 Based upon these statistics we reject the null hypothesis. We conclude that, based upon the sample data collected, the quantity of teabags is not different at the 5% level of significance. Figure 7.29 illustrates the relationship between the p-value and test statistic.

Figure 7.29 Comparison of t, critical t, and p-value Note that Excel provides the test statistic, critical test statistic, and p-value. SPSS provides the test statistic and p-value. Confidence interval solutions We can use the confidence interval method to make an hypothesis test decisions. Two sample pooled t test for means

Figure 7.30 95% confidence interval for pooled t test Page | 394

The 95% confidence interval is – 35.77 to 7.53 with a difference of 21.65. Given the confidence interval does not contain this value then we conclude that we accept the alternative hypothesis. Two sample t test for means assuming unequal variances

Figure 7.31 95% confidence interval for t test assuming unequal variances The 95% confidence interval is – 36.83 to 8.59 with a difference of 22.71. Given the confidence interval does not contain this value then we conclude that we accept the alternative hypothesis. SPSS solution Enter data into SPSS

Figure 7.32 Example 6.6 SPSS data Given that we have a two-sample independent test to perform, we have created two variables for SPSS. The Group variable takes values 1 and 2, representing tea bag type 1 and type 2, respectively. Page | 395

Select Analyze > Compare Means

Figure 7.33 Choosing SPSS Independent samples t test Select Independent-Samples T Test Transfer Beans variable into the Test Variable(s) box. Transfer Group variable into the Grouping Variable box. Click on Define Groups, and type 1 in the Group 1 box and 2 in the Group 2 box. Click Continue.

Figure 7.34 Independent samples t test Page | 396

Click on Options. Type 95% in the Confidence Interval Percentage box.

Figure 7.35 Options Click Continue.

Figure 7.36 SPSS independent samples t-test menu Click OK SPSS output Figure 7.37 gives the group statistics for group 1 and group 2. Figures 7.38 and 7.39 gives the hypothesis test results for the two-sample independent t-test results.

Figure 7.37 Example 6.6 SPSS solution

Page | 397

Figure 7.38 Example 7.4 SPSS solution continued

Figure 7.39 Example 7.4 SPSS solution continued Equal variances assumed From SPSS, we note that we have two results. When we assume equal variances, we obtain t = - 1.299 and two-tailed p-value = 0.196. This is identical to the Excel results t = - 1.2992 and two-tailed p-value = 0.1978. Given that the two-tailed pvalue is greater than 0.05, we fail to reject H0. Equal variances test (Levene’s test) The SPSS printout includes the value of the F-test statistic which can be used to test whether the two populations for group 1 and group 2 have equal population variances. The hypothesis test is H0: variances equal, and H1: variances not equal. The value of F is given as 3.235, and the associated test p-value is 0.076. Since our significance level is 0.05 and our p-value is greater than this, we accept the null hypothesis H0 and conclude at the 5% level of significance that the two population variances are not significantly different. This also means that we should use the two-sample t test for independent samples with pooled variances. Unequal variances assumed If we did have evidence for unequal variances, then we can use the second t test. When we do not assume equal variances, we obtain t = - 1.246 and two-tailed p-

Page | 398

value = 0.218. This is identical to the Excel result. Given the two-tailed p-value is greater than 0.05, we fail to reject the null hypothesis H0. Conclusions We conclude that based upon the sample data collected that we have no evidence to suggest that the number of tea bags in groups 1 and 2 are significantly different at the 5% level of significance.

Excel Data Analysis solutions As an alternative to either of the two previous methods, we can use a method embedded in Excel Data Analysis. Select Data > Data Analysis > t-Test: Two Sample Assuming Equal Variances:

Figure 7.40 Excel Data Analysis menu: two-sample t test assuming equal variances

Figure 7.41 Excel Data Analysis menu: two-sample t test assuming equal variances Page | 399

We observe from Figure 7.41 that the relevant results agree with the previous results. Select Data > Data Analysis > t-Test: Two Sample Assuming Unequal Variances:

Figure 7.42 Excel data analysis solution – t test for 2 sample assuming unequal variances

Figure 7.43 Excel data analysis solution – t test for 2 sample assuming unequal variances Warning: It should be noted that the Excel Analysis ToolPak method for this statistical test will round up the value of the degrees of freedom and use this value to calculate the critical value. We observe from Figure 7.43 that the relevant results agree with the previous results (see Excel warning about the degrees of freedom). Conclusion

Page | 400

We conclude that based upon the sample data collected that we have no evidence that the number of tea bags is different between the two data collection time points at the 5% level of significance.

Check your understanding X7.11 During an examination board meeting concerns were raised concerning the marks obtained by students sitting the final year advanced economics (AE) and e-marketing (EM) papers (see Table 7.9). Historically the sample data follow a normal distribution and the population standard deviations are approximately equal. Assess whether there is a significant difference between the two sets of results (test at 5%). AE AE EM 51 63 71 66 35 69 50 9 63 48 39 66 54 35 43 83 44 34 68 68 57 48 36 58 45 68 Table 7.9 Student marks

EM 68 53 65 48 63 48 47 53 64

EM 61 59 55 66 61 58 77 73 54

X7.12 A university finance department would like to compare the travel expenses claimed by staff attending conferences. After initial data analysis, the finance director has identified two departments which seem to have very different levels of claims. Based upon the data provided in Table 7.10, carry out a suitable test to assess whether the level of claims from department A is significantly greater than that from department B. You can assume that the population expenses data are normally distributed and that the population standard deviations are approximately equal. Department A 156.67 146.81 147.28 169.81 143.69 157.58 130.74 155.38 179.89 158.86 170.74 Table 7.10 Travel expenses

140.67 154.78 154.86

Department B 108.21 109.10 127.16 142.68 110.93 101.85 135.92 132.91 124.94

X7.13 Repeat Exercise X7.11 but do not assume equal variances. Are the two sets of results significantly different? Test at 5%. X7.14 Repeat Exercise X7.12 but do not assume equal variances. Are the expenses claimed by department A significantly different than those by department B? Test at 5%. Page | 401

Two-sample t-test for the population mean: dependent or paired samples The paired sample t-test, sometimes called the dependent sample t-test, is a statistical procedure used to determine whether the mean difference between two sets of observations is zero. In a paired sample t-test, each subject is measured twice, resulting in pairs of observations. Suppose you are interested in evaluating the effectiveness of a weight loss diet. One approach you might consider would be to measure the weight of a person before starting the diet and the weight after being on the diet for a period of time. We would then analyse the differences using a paired sample t-test. The null hypothesis for this paired samples t test is that the difference scores are a random sample from a population in which the mean difference has some value which you specify. The test assumptions are as follows: 1. The samples are simple random samples. 2. The sample data consist of matched pairs. 3. The number of matched pairs is large (n  30) or the paired differences from the population are approximately normally distributed. The t-test statistic is given by equations (7.10) and (7.11), with the degrees of freedom given by equation (7.12): Test statistic t=

̅−D d

(7.10)

sd √n

Standard deviation of the differences ̅ 2

∑(d−d) sd = √ n−1

(7.11)

df = n – 1

(7.12)

Degrees of freedom

In equations (7.10)–(7.11), d̅ is the sample mean of the difference scores, D is the mean difference in the population, given a true H0 (often D = 0, but not always), sd is the sample standard deviation of the difference scores, n is the number of matched pairs, and df is the degrees of freedom. Example 7.5 A university department is running extra support for international students on basic study skills. The unit that deals with this assesses student report writing skills at the start of the course and re-assesses the students at the end of the course using s standardised test. Please note that this study skills course is compulsory for all Page | 402

international students but is not part of the student degree course being studied. Table 7.11 illustrates the test results for the 2019-20 academic year. Conduct a two-sample t test for the population mean for paired samples to test the hypothesis that the before and after test results are significantly different. If significant, do we have any evidence that the support worked with improving the student results? What hypothesis test would you conduct? Pairs X1 56.86 57.88 44.64 48.87 68.34 55.76 46.83 56.66 56.18 51.1

Pairs

X2 42.66 55.71 59.12 55.75 40.91 25.58 55.4 59.2 48.44 61.18

X1 40.48 44.78 87.45 66.28 59.11 50.69 57.24 48.14 68.17 64.48

X2 45.93 38 63.27 30.72 46.81 58.23 43.36 69.73 62.13 66.09

Pairs X1 49.75 54.71 60.43 70.05 78.55 50.75 66.87 46.5 43.39 76.26

X2 39.09 47.32 66.12 36.76 82.46 44.92 57.28 31.35 67.8 67.08

Pairs X1 67.93 70.21 55.63 46.76 26.72 45.67 60.75 65.99 30.62 40.14

X2 67.07 53.23 52.21 56.55 36.58 44.88 59.18 40.2 53.42 60.01

Pairs X1 64.28 70.41 55.29 66.83 70.18

X2 69.46 79.53 44.73 33.38 62.43

Table 7.11 Before and after test results The five-step procedure to conduct this test progresses as follows. Step 1 State hypothesis The hypothesis statement implies that the population mean difference between test results is zero. Null hypothesis: No difference in population test results H0: D = 1 - 2 = 0 Alternative hypothesis Difference in population test results H1: D ≠ 0 (or D = 1 - 2 ≠ 0) The ≠ sign implies a two-tailed test.

Page | 403

Step 2 Select test We now need to choose an appropriate statistical test for testing H0. From the information provided we note: • • • •

Number of samples – two samples. The statistic we are testing – testing that the test results before and after are significantly different. Both population standard deviations are unknown. Size of both samples – large (n1 and n2 = 45 > 30). Nature of population from which sample drawn – population distribution is not known; we will assume that the population is approximately normal given the sample size is greater than 30.

In this case, we have two variables that are related to each other (test result before and test result after) and we will conduct a paired-sample t test for means. Step 3 Set the level of significance  = 0.05 Step 4 Extract relevant statistic Given equations (7.10) – (7.12) we need to calculate: sample pair size n, standard deviation of the differences (sd), t statistic, and the degrees of freedom, df. Tables 7.12 – 7.13, represent the calculation of the summary statistics

Person 1 2 3 4

Assessment score after extra help, X1 56.86 57.88 44.64 48.87

Assessment score before extra help, X2 42.66 55.71 59.12 55.75

d = X1 - X2 14.20 2.17 -14.48 -6.88

(d - )̅ 2

5 6 7 8 9 10 11 12

68.34 55.76 46.83 56.66 56.18 51.1 40.48 44.78

40.91 25.58 55.4 59.2 48.44 61.18 45.93 38

27.43 30.18 -8.57 -2.54 7.74 -10.08 -5.45 6.78

545.48 681.50 159.88 43.75 13.44 200.35 90.72 7.32

102.53 3.63 344.27 120.00

Table 7.12 Summary statistics

Page | 404

Person 13 14 15 16

Assessment score after extra help, X1 87.45 66.28 59.11 50.69

Assessment score before extra help, X2 63.27 30.72 46.81 58.23

d = X1 - X2 24.18 35.56 12.30 -7.54

(d - )̅ 2

17

57.24

43.36

13.88

96.15

18 19 20 21 22 23 24 25 26 27 28 29

48.14 68.17 64.48 49.75 54.71 60.43 70.05 78.55 50.75 66.87 46.5 43.39

69.73 62.13 66.09 39.09 47.32 66.12 36.76 82.46 44.92 57.28 31.35 67.8

-21.59 6.04 -1.61 10.66 7.39 -5.69 33.29 -3.91 5.83 9.59 15.15 -24.41

658.66 3.86 32.31 43.37 10.99 95.34 853.55 63.75 3.08 30.42 122.67 811.36

30 31 32

76.26 67.93 70.21

67.08 67.07 53.23

9.18 0.86 16.98

26.07 10.33 166.55

33 34 35 36 37 38 39 40 41 42 43 44 45

55.63 46.76 26.72 45.67 60.75 65.99 30.62 40.14 64.28 70.41 55.29 66.83 70.18

52.21 56.55 36.58 44.88 59.18 40.2 53.42 60.01 69.46 79.53 44.73 33.38 62.43

3.42 -9.79 -9.86 0.79 1.57 25.79 -22.80 -19.87 -5.18 -9.12 10.56 33.45 7.75

0.43 192.22 194.17 10.79 6.27 471.57 722.24 573.34 85.64 174.09 42.06 862.92 13.51

404.23 991.34 67.66 134.90

Table 7.13 Summary statistics continued From tables 7.12 and 7.13: D = 0, n = 45,  d = 183.35,  (d – d̅)2 = 10288.72 Therefore,

Page | 405

Average difference d̅ = 4.0744 Substituting into equation (6.23) gives the standard deviation for the differences, sd 2 ∑(d − d̅) 10288.72 √ sd = =√ = 15.2916 n−1 45 − 1

Substituting into equation (6.22) gives the t test value t=

d̅ − D 4.0744 − 0 sd = 15.2916 = 1.7874 √n √45

Substituting into equation (6.24) gives the number of degrees of freedom df = n – 1 df = 45 – 1 = 44 Step 5. Make a decision Given t = 1.7874 and df = 44, we need to use the t critical tables to find the critical value for df = 44 and a 5% significance level (see Figure 7.44).

Figure 7.44 Critical values for the Student’s t distribution From statistical tables: For a two-tail 5% significance we have: Critical t value at df = 40 is 2.02 Critical t value at df = 50 is 2.01 Page | 406

Can use linear interpolation to find the critical value for df = 44 as follows: tdf = 44 = tdf = 40 + (4/10)*(tdf=50 – tdf=40) tdf = 44 = 2.02 + (4/10)*(2.01 – 2.02) tdf = 44 = 2.016 We have: tcal = + 1.7874 tcri =  2.016 Given tcal lies between the critical t values (- 2.016  + 1.7874  + 2.016), then we would fail to reject the null hypothesis, H0. We conclude that there is no evidence to suggest the test scores before or after are significantly different at a 5% significance level. Excel solution Figures 7.45 - 7.46 illustrate the Excel solution. Note that in Figure 7.44 rows 11-45 are hidden.

Figure 7.45 Example 7.5 Excel solution

Page | 407

Figure 7.46 Example 7.5 Excel solution Critical t value method From Excel, t = 1.7874 and the two-tail critical t value =  2.0154. Given 1.7874 lies between – 2.0154 and + 2.0154, we fail to reject the null hypothesis. P-value method From Excel, the two-tail p-value = 0.0808 > 2 tail significance level of 0.05, again we fail to reject the null hypothesis. Figure 7.47 illustrates the relationship between the p-value and test statistic.

Page | 408

Figure 7.47 Relationship between t, critical t value, and the p-value We conclude that there is no evidence to suggest the test scores before or after are significantly different at a 5% significance level. Confidence interval method If you calculate the confidence intervals you will have the following results.

Figure 7.48 Confidence interval calculations Page | 409

From Excel, the 95% confidence interval is – 0.15 to 8.67. Notice that the paired mean difference is 4.07 which lies between the lower and upper confidence interval. This tells us that we fail to reject the null hypothesis. We conclude that there is no evidence to suggest the test scores before or after are significantly different at a 5% significance level. Excel Data Analysis add-in solution for a two-sample t test for the mean As an alternative to either of the two previous methods, we can use a method embedded in Excel Data Analysis. Select Data > Data Analysis > t-Test: Paired Two Sample for Means.

Figure 7.49 Example 7.5 Excel Data Analysis solution

Figure 7.50 Example 7.5 Excel Data Analysis solution

Page | 410

SPSS solution Input data into SPSS

Figure 7.51 Example 7.5 SPSS data Analyze > Compare Means >

Figure 7.52 Paired sample t test Select Paired-Samples T Test Transfer variables Score_1 and Score_2 into the Paired Variables box.

Page | 411

Figure 7.53 SPSS paired samples t test menu Click OK SPSS output Figure 7.54 shows the paired sample statistics. Figure 7.55 shows the paired samples correlations. Figures 7.56 and 7.57 show the outcome of the two-sample dependent t test.

Figure 7.54 SPSS solution

Figure 7.55 SPSS solution continued

Figure 7.56 SPSS solution continued

Page | 412

Figure 7.57 SPSS solution continued From SPSS: t value is 1.787 two-tailed p-value is 0.081 This is identical to the Excel results t = - 1.787 and two-tailed p-value = 0.081. Given that the two-tailed p-value is greater than 0.05, we fail to reject H0. We conclude that there is no evidence to suggest the test scores before or after are significantly different at a 5% significance level. Note that the equivalent nonparametric test for the two samples mean (dependent and paired samples) is the Wilcoxon signed-rank test. We will cover this test in Chapter 8.

Check your understanding X7.15 Choko Ltd provides training to its salespeople to aid the ability of each salesperson to increase the value of their sales. During the last training session 15 salespeople attended and their weekly sales before and sales after are provided in Table 7.14. Assuming the populations are normally distributed assess whether there is any evidence for the training company’s claims that its training is effective (test at 5% and 1%). Person Before After Person 1 2911.48 2287.22 9 2 1465.44 3430.54 10 3 2315.36 2439.93 11 4 1343.16 3071.55 12 5 2144.22 3002.40 13 6 2499.84 2271.37 14 7 2125.74 2964.65 15 8 2843.05 3510.43 Table 7.14 Change in value of sales

Before 2049.34 2451.25 2213.75 2295.94 2594.84 2642.91 3153.21

After 2727.41 2969.99 2597.71 2890.20 2194.37 2800.56 2365.75

X7.16 Concern has been raised at the standard achieved by students completing finalyear project reports within a university department. One of the factors identified as important is the mark achieved in the research methods module, which is studied before the students start their project. The department has now collected data for 18 students as given in Table 7.15. Assuming the populations are normally distributed, is there any evidence that the marks are different? Test at 5%.

Page | 413

Student RM Project Student 1 38 71 9 10 2 50 46 11 3 51 56 4 75 44 12 5 58 62 13 6 42 65 14 7 54 50 15 8 39 51 Table 7.15 Research methods results

RM Project 48 43 14 62 38 66 47 75 58 60 53 75 66 63

Chapter summary The focus of parametric tests is that the underlying variables are at the interval/ratio level of measurement and the population(s) being measured is normally or approximately normally distributed. In the next chapter we explore hypothesis tests for variables that are at the nominal or ordinal level of measurement by employing the concept of the chi-square test and nonparametric tests. As a summary, Table 7.16 provides a comparison between the z test and t test. For large sample size (n  30), the results from Student’s t test will be approximated by the normal distribution, given that the sample standard deviation is assumed to be approximately equal to the population standard deviation ( ≈ s). Furthermore, Student’s t distribution approximates the shape of the normal distribution when sample size is large. Parameter

Test

Number of samples 1

Samples independent or dependent

Test for the z-test mean Test for the t-test 1 mean Test for the z-test 1 proportion Test for the z-test 2 Independent means Test for the z-test 2 Independent proportions Test for the t-test 2 Independent proportions Test for the t-test 2 Dependent means Table 7.16 One and two sample z and t tests

Test statistic distribution Normal

Population standard deviation Known

Sample size

Student’s t

Not known

< 30

Normal Normal

Known

Normal Student’s t

Not known

< 30

Student’s t

Not known

< 30

Page | 414

Test your understanding TU7.1 Calculate the critical z value if you have a two-tailed test and you choose a significance level of 0.05 and 0.01. TU7.2 If you conduct a z test and it is a lower one-tailed test, what is your decision if the significance level is 0.05 and the value of the test statistic is – 2.01? TU7.3 Calculate the value of the p-value for the TU2 question. TU7.4 Calculate the value of the z statistic if the null hypothesis is H0:  = 63, where a random sample of size 23 is selected from a normal population with a sample mean of 66 (assume the population standard deviation is 15). TU7.5 Calculate the probability that a sample mean is greater than 68 for the TU3 question, when the alternative hypothesis is (a) two-tailed, and (b) upper onetailed (assume the significance level is 0.05). TU7.6 Repeat TU5 but with the information that the population standard deviation is not known, Describe the test you would use to solve this problem. Given that the sample standard deviation was estimated to be 16.2, answer TU5 (a), and (b). TU7.7 According to a recent report by a local car dealership, the mean age of cars that are scrapped is 8.7 years. To test whether this age has changed the dealership regularly keeps details of the age of cars when a decision to send to the scrap yard is made. The manging director of the dealership has provided an analyst (you) with a sample size of 52 and a sample average and standard deviation of 9.5 and 3.2, respectively. The managing director has asked you to analyse the dat. a. b. c. d. e.

State the null and alternative hypothesis. Explain whether you will use a z or t test, Calculate the critical z value (assume significance levels of 0.05 and 0.01). Calculate the value of the test statistic. Decide whether H0 or H1 should be accepted and provide reasons for your decision (use the critical test statistic and p-value in making this decision). f. Explain any reservations you may have in making your decision with reference to the test assumptions. TU7.8 A small manufacturing company sells ice cream in 430g pots. The quality control process includes the measuring and recording of all 430g pots which are filled by hand. A recent sample of 36 pots provided a sample mean and standard deviation of 396g and 14g, respectively. You have been given the task of answering the following questions: a. What type of test could we use to test the hypothesis H0:  = 430, H1:  ≠ 430? b. Conduct the test and decide whether H0 or H1 should be accepted. Page | 415

c. Provide details of any issues that the company needs to be aware of. Include a discussion of test assumptions and any potential impact on the quality control process. TU7.9 The quality controller for the ice cream company (TU8) has been given responsibility for monitoring individual staff performance on the 430 g production line. The concern is that there is a difference in worker performance in terms of the number of pots filled between the morning and afternoon shifts. Given the sample data in Table 7.17, can we assume that there is a difference between the morning and afternoon shifts? Test at the 0.01 level of significance and assume independent samples. If a difference exists, what would the consequences be for the company and employees? Morning 72 68 91 shift Afternoon 81 65 88 shift Table 7.17 Worker performance

69

78

73

77

80

76

75

81

74

76

77

Want to learn more? The textbook online resource centre contains a range of documents to provide further information on the following topics: 1. A7Wa Two-sample z test using Excel. 2. A7Wb Comparing population variances: variance ratio F test and Levene’s test 3. A7Wc Welch ANOVA test 4. A7Wd Statistical power and type II error Factorial experiments workbook Furthermore, you will find a factorial experiments workbook that explores using Excel and SPSS the solution of data problems with more than two samples using both parametric and non-parametric tests.

Page | 416

Chapter 8 Chi square and non-parametric hypothesis tests 8.1 Introduction and learning objectives In previous chapters we explored choosing sample estimates from populations that may be normally distributed, or large samples that came from non-normal distributions. We carried out specific inference or hypothesis tests using z and Student’s t tests. We made assumptions about the nature of the sample and distribution. However, very little has been said about what to do if these assumptions are not met. In this chapter tests will be presented for circumstances in which the assumptions for z and t tests have not been met. In Chapters 5 and 7 we considered only interval- or ratiolevel variables. What can you do if you need to test differences or relationships among nominal or ordinal variables? We will study tests that do not involve the actual data values that were observed (e.g., test scores). Instead, we will look at two types of test. The first type will involve either the counts (or frequencies) of observations (applicable to chi-square tests). The second type will involve tests that use the ranks of scores instead of the scores themselves (applicable to nonparametric tests). Parametric tests from the previous chapter were used to assess whether the differences between means (or proportions) are statistically significant. Within parametric tests we sample from a distribution with a known parameter value, for example the population mean (), variance (2), or proportion (π). Important things to remember about the techniques described in Chapter 7 can be summarised by three assumptions: 1. The underlying population follows a normal distribution. 2. The level of measurement is of equal interval or ratio scaling. 3. The population variances are equal. Unfortunately, we will come across data that do not fit these assumptions. How do we measure the difference between the attitudes of people surveyed in assessing their favourite brand, when the response by each person is of the form 1, 2, 3,…, n? In this situation, we have ordinal data for which taking differences between the numbers (or ranks) is meaningless. Another example is if we are asking for opinions where the opinion is of a categorical form (e.g. strongly agree, agree, disagree, strongly disagree). The concept of difference is again meaningless. The responses are words, not numbers. You can solve this problem by allocating a number to each response, with 1 for strongly agree, 2 for agree, and so on. This gives you a rating scale of responses. However, recall that the opinions of people are not quite the same as measuring time or measuring the difference between two points. Can we say that the difference between strongly agree and agree is the same as the difference between agree and disagree? Another way of looking at this problem is to ask the question: can we say that a rank of 5 is five times as strong as a rank of 1?

Page | 417

As before, the relevance of this chapter is in specific applications that you may meet as you conduct your business. A question, such as whether the two regions that you are responsible for are performing in a uniform fashion, is a good example that can be solved using the techniques from this chapter. Another example is whether male customers respond to my advertising campaign in the same way as female customers. You might think that the answer is obvious. For example, if 65% of women and 55% of men respond positively to your campaign, does this mean that the women are more convinced by the campaign? Not necessarily. Given the population size and other factors, you might be drawing incorrect conclusions. If you apply the methods from this chapter to this specific example, you might, on the other hand, end up telling your management something different. You might say that although it appears that women are more susceptible to the campaign, you have evidence that this is not the case. In fact, you will be able to provide evidence that both genders respond similarly, despite what the actual percentages show. This is the practical value inherent in the methods covered in this chapter. To take yet another example to illustrate how the tests in this chapter can be used, imagine you are doing a small survey for your dissertation. Part of your overall research project is to establish whether students’ attitudes towards work ethics change as they progress through their studies. To establish this, you interview a group of first-year students and a group of final-year students and ask them certain questions to illuminate this problem. You present the results in a simple table, where the rows represent first-year and lastyear students and the columns represent their attitudes (a scale such as strongly agree, agree, disagree, etc.). Once you have constructed such a table, you can establish if the maturity of students is in some way linked with their views on work ethics. The chisquare test would test this claim of an association between their views on work ethics. Table 8.1 provides a list of chi-square and nonparametric statistical tests described in this book together and shows which methods are solved via Excel and SPSS. Statistics test Chi-square test of independence test Chi-square test for two proportions (independent samples) test McNemar’s test for the difference between two proportions (dependent samples) test Chi-square goodness-of-fit test Sign test Wilcoxon signed rank sum test for matched pairs Mann-Whitney test for two independent samples Table 8.1 Chi-square and nonparametric tests in Excel and SPSS

Excel Yes Yes Yes

SPSS Yes Yes Yes

Yes Yes Yes Yes

Yes Yes Yes Yes

As it was the case in the previous Chapter, it is sometimes difficult to select the appropriate test for the given conditions. Figure 8.1 provides a flow chart of the decisions required to decide which test to use to carry out the correct hypothesis test.

Page | 418

Figure 8.1 Which hypothesis test to use? The key questions are: 1. 2. 3. 4.

What are you testing: difference or association? What type of data is being measured? Can we assume that the population is normally distributed? How many samples?

We will begin this chapter with tests based around the chi-square distribution.

Learning objectives On completing this chapter, you will be able to: 1. Apply a chi-square test to solve a range of problems. These include analysing tabulated data, goodness-of-fit tests, testing for normality, and testing for equality of variance. 2. Apply a range of nonparametric tests to solve a range of problems. These include the sign test, Wilcoxon signed-rank test for two paired samples, and the Mann–Whitney U test for two independent samples 3. Solve problems using Microsoft Excel and IBM SPSS Statistics software packages.

8.2 Chi-square tests A chi-square test (or 2 test, from the Greek letter chi) is any statistical hypothesis test where the sampling distribution of the test statistic is a chi-square distribution when the null hypothesis is true. This test is versatile and widely used. It can be used when Page | 419

dealing with data that are categorical (or nominal or qualitative) in nature and cannot be stated as a number (e.g. responses such as ‘yes’, ‘no’, ‘red’, and ‘disagree’). In Chapter 1 we explored tabulating such data and the use of bar and pie charts. Charts were appropriate when we were dealing with proportions that fall into each of the categories measured, which form a sample from a proportion of all possible responses. In Chapter 7 we explored the situation of comparing two proportions where we assumed that the underlying population distribution is normally distributed. In this section we will explore a series of methods that make use of the chi-square distribution to make inferences about two or more proportions: 1. Test of independence. 2. Chi-square test for two populations (independent samples) 3. McNemar’s test for differences between two proportions (dependent samples)

Chi-square test of independence In this section, we test a claim that the row and column variables are independent of each other. A test of independence tests the null hypothesis that there is no association between the variables in a contingency table. There is no requirement for the population data to be normally distributed or follow any other distribution. However, the following three assumption are made: 1. Simple random sample: the sample data are a random sample from a fixed population where every member of the population has an equal probability of selection. 2. Independence: the observations must be independent of each other. This means the chi-square cannot test for correlated data (e.g. matched pairs). In that situation McNemar’s test is more appropriate. 3. Cell size: The chi-square test is valid if at least 80% of the expected frequencies exceed 5 and all the expected frequencies exceed 1. If these assumptions hold, the chi-square test statistic follows a chi-square distribution. Suppose a university department surveyed 594 students to see which module they had chosen for their final year of study. The objective was to determine whether males and females differed in module preference for five specific modules (Table 8.2). The question we would like to answer is whether we have an association between the module chosen and a person’s gender.

Module

Males BIN3020-N 65 BIN3029-N 66 BIN3022-N 51 BIN3678-N 59 BIN3045-N 59 Table 8.2 Gender versus module attended

Females 86 82 38 40 48

Page | 420

To answer this question, we can employ the chi-square test of independence (or test of association). This is used to determine whether the frequency of occurrences for two category variables (or more) are significantly related (or associated) to each other. Null hypothesis, H0 The null hypothesis (H0) states that the row and column categorical variables (e.g. the module, student gender) are not associated (are independent). Alternative hypothesis, H1 The alternative hypothesis (H1) states that the row and column variables are associated (are dependent). The chi-square test statistic is defined by equation (8.1): χ2 = ∑all cells

(O−E)2 E

(8.1)

Where O is the observed frequency in the cell of a contingency table and E is the expected frequency in the cell of a contingency table, given the null hypothesis is true. It can be shown that if the null hypothesis is true then the chi-square (2) test statistic approximately follows a chi-square distribution with (r – 1)(c – 1) degrees of freedom. Why are the degrees of freedom equal to (r - 1) (c - 1)? Suppose that we know the marginal totals for each of the levels of our categorical variables. In other words, we know the total for each row and the total for each column. For the first row, there are c columns in our table, so there are c cells. Once we know the values of all but one of these cells, then because we know the total of all the cells it is a simple algebra problem to determine the value of the remaining cell. If we were filling in these cells of our table, we could enter c – 1 of them freely, but then the remaining cell is determined by the total of the row. Thus, there are c – 1 degrees of freedom for the first row. We continue in this manner for the next row, and there is again c – 1 degrees of freedom. This process continues until we get to the penultimate row. Each of the rows except for the last one contributes c – 1 degrees of freedom to the total. By the time that we have all but the last row, then because we know the column sum, we can determine all the entries of the final row. This gives us r - 1 rows with c - 1 degrees of freedom in each of these, for a total of (r – 1) (c – 1) degrees of freedom. As with parametric tests, you would reject the null hypothesis H0 if the value of the test statistic is equal to or greater than the upper critical value of the chi-square distribution with (r – 1)(c – 1) degrees of freedom: Reject null hypothesis: χ2cal ≥ χ2cri Fail to reject the null hypothesis: χ2cal < χ2cri Page | 421

Figure 8.2 illustrates the region of acceptance/rejection for the null hypothesis.

Figure 8.2 Chi-square region of acceptance/rejection for the null hypothesis Again, just like with parametric tests, we can also make decisions by using either the critical value criterion or the p-value. If the null hypothesis is true, then we can use the observed frequencies in the table to calculate the expected frequencies (E) for each cell using equation (8.2): E=

Row total  Column total Total sample size

(8.2)

To test the null hypothesis, we would compare the expected cell frequencies with the observed cell frequencies and calculate the Pearson chi-square test statistic given by equation (8.3), which is a more detailed version of equation (8.1):

χ2cal = ∑ni=1

(Oi -Ei )2 Ei

(8.3)

The chi-square test statistic enables a comparison to be made between the observed frequency (O) and expected frequency (E). Equation (8.3) tells us what the expected frequencies would be if there was no association between the two categorical variables, for example, gender and course. If the values were close to one another then this would provide evidence that there is no association. Conversely, if we find large differences between the observed and expected frequencies, then we have evidence to suggest an association does exist between the two categorical variables: gender and course.

Page | 422

Statistical hypothesis testing allows us to confirm whether the differences are likely to be statistically significant. The chi-square distribution varies in shape with the number of degrees of freedom. Therefore, we need to find this value before we can look up the appropriate critical values. The number of degrees of freedom (df) for a contingency table is given by equation (8.4): df = (r – 1)(c – 1)

(8.4)

Where r is the number of rows and c is the number of columns. We will identify a rejection region using either the critical test statistic or the p-value calculated from the test statistic, as with the hypothesis testing methods described in Chapter 6. Example 8.1 We are interested in establishing if gender and course selection are associated. We have two attributes: gender, and module, both of which have been divided into categories: two for gender and five for module. The resulting table is called a 5 × 2 contingency table because it consists of five rows and rows columns. To determine whether gender and chosen module are associated (or independent), we conduct a chi-square test of association on the contingency table. Table 8.3 repeats Table 8.2, but with the addition of row, column, and overall totals.

Module

Males BIN3020-N 65 BIN3029-N 66 BIN3022-N 51 BIN3678-N 59 BIN3045-N 59 Totals = 300 Table 8.3 Gender versus module chosen

Females 86 82 38 40 48 294

Totals 151 148 89 99 107 594

The five-step procedure to conduct this test progresses as follows. Step 1 State hypothesis Null hypothesis H0: Gender and module choice are not associated (are independent) Alternative hypothesis H1: There is an association between sex and module chosen (they are dependent) Step 2 Select test • • • •

Number of samples – two category data variables (gender and module). Random sample. Values represented as frequency counts within the contingency table. The statistic we are testing – testing for an association between the two category data variables. Page | 423

Apply a chi-square test of association to the sample data. Step 3 Set the level of significance  = 0.05 Step 4 Extract relevant statistic Calculate the expected frequencies using equation (8.2) and the chi-square value for each observed/expected frequency pair using equation (8.3). The results are given in Tables 8.4 and 8.5.

Expected male 76.2626

=151*300/594 =148*300/594

Expected female 74.7374

74.7475 73.2525 =89*300/594 44.9495 44.0505 =99*300/594 50.0000 49.0000 =107*300/594 54.0404 52.9596 Table 8.4 Calculation of the expected frequencies

=151*294/594 =148*294/594 =89*294/594 =99*294/594 =107*294/594

We can now calculate the ratio (O - E)2 / E for each cell pairing of observed (O) and expected frequencies E. (O-E)2/E 1.6633

O-E)2/E cont. 1.6972

1.0237 1.0446 0.8144 0.8311 1.6200 1.6531 0.4552 0.4645 Table 8.5 Calculation of the chi-square values Calculate the test statistic using equation (8.3): χ2cal = 1.6633 + 1.0237 + ⋯ … … . +1.6531 + 0.4645 χ2cal = 11.2670 Calculate the number of degrees of freedom using equation (8.4) and the values from Table 8.4: df = (r – 1)(c – 1) = 4 × 1 = 4

Page | 424

Step 5 Make a decision We can identify a rejection region using the critical test statistic method. From chi-square critical tables, we find a critical value of 9.49 for the 5% significance level and 4 degrees of freedom (Figure 8.3).

Figure 8.3 Percentage points of the chi-square distribution Decision rule χ2cal = 11.2670 χ2cri = 9.49 From the calculations, we have χ2cal > χ2cri , so we reject the null hypothesis, H0. We conclude that there is a significant relationship (or association) between the category variables gender and module preference. If you look at table 8.5, you can see that the main contribution to this overall χ2cal value of 11.2670 comes from module BIN3020-N with more females compared to males, and BIN3678-N with more males than females. Figure 8.4 illustrates the relationship between the test statistic and the critical test statistic.

Rejection area

Figure 8.4 Chi-square distribution with df = 4

Page | 425

One of the requirements of this test is that we have an adequate expected call count. What do we mean by an adequate expected cell count? If we have a 2 x 2 table, then every cell should contain the value that is at least 5 or larger. If a table is larger, then 5 or more in 80% of the cells, but no cells with zero expected count. When this assumption is not met, Yates's correction for continuity (Yates’s chi-square statistic) is applied. For a larger table, all expected frequencies must be greater than 1 and no more than 20% of all cells may have expected frequencies less than 5. If these criteria cannot be met, then we must increase the sample size or/and combine classes to eliminate frequencies that are too small. Any cell frequencies < 5 Note that none of the cells have an expected frequency less than 5. Yate’s correction for continuity Using the chi-square distribution to interpret the chi-square statistic requires one to assume that the discrete probability of observed binomial frequencies in the table can be approximated by the continuous chi-square distribution. This assumption is not quite correct and introduces some error. To reduce the error in approximation, Frank Yates suggested a correction for continuity that adjusts the formula for the chi-square test by subtracting 0.5 from the difference between each observed value and its expected value in a 2 × 2 contingency table. Equation (8.5) shows Yates's correction for continuity (Yates’s chi-square statistic) for Pearson's chi-square statistic:

χ2Yates = ∑ni=1

(|Oi − Ei |−0.5)2 Ei

(8.5)

If you calculate Yates’s chi-square test statistic by using equation (8.5) in place of equation (8.3) then the values will be: χ2cal = 9.9552

χ2cri = 9.49

From the calculations, we have χ2cal > χ2cri , so we reject the null hypothesis, H0. We conclude that there is a significant relationship (or association) between the category variables gender and module preference. Notice that the corrected value of chi-square is smaller than uncorrected chi-square value: Yates's correction makes the test more ‘conservative’, meaning that it is harder to get a significant result (although in this case chi-square remains statistically significant). Note that the Excel function CHISQ.TEST does not use Yates’s chi-square continuity correction. Note also that SPSS calculates chisquare using Yates's correction for continuity for a 2 × 2 table when the data are in raw data format (see Example 8.3). Excel solution Input the data. Page | 426

Calculate the expected frequencies and chi-square ratio (O – E)2/E as illustrated in Figure 8.5.

Figure 8.5 Example 8.1 Excel solution Observe in Figure 8.5, we created extra columns called Em (Expected male), Ef (Expected female), row totals, column totals, and (O – E)2/E. Expected frequencies in Figure 8.5 are calculated as: Expected frequencies male (Em) Expected frequencies female (Ef) (O-E)2/E

Cell C16 Cell D16 Cell E15

Formula:=$E5*C$10/$E$10 Copy formula down C17:E20 Formula:=$E5*D$10/$E$10 Copy formula down D17:D20 Formula:=(C5-C16)^2/C16 Copy formula down and across to F20

Table 8.6 Now we can proceed and calculate the test statistics for the chi-square hypothesis test as illustrated in Figure 8.6.

Page | 427

Figure 8.6 Example 8.1 Excel solution continued 2 Using 𝜒𝑐𝑟𝑖 to make a decision

From Excel: χ2cal = 11.267 χ2cri = 9.49 From the calculations, we have χ2cal > χ2cri , so we reject the null hypothesis, H0. Use test p-value to make a decision From Excel: P-value = 0.0237 Significance level  = 0.05 From the calculations, the p-value is less than the significance level (0.0237 < 0.05). Therefore, reject the null hypothesis, H0. We conclude that there is a significant relationship (or association) between the category variables gender and module preference. Any cell frequencies < 5 Page | 428

Note in cell J29 we are checking that none of the cells have an expected frequency less than 5. In Figure 8.7, we observe that χ2cal lies in the lower rejection zone (11.267 > 9.488).

Figure 8.7 Chi-square distribution with df = 4 Yate’s continuity correction You could calculate Yate’s continuity correction value of chi-square = 9.9552 by subtracting 0.5 from the chi-square equation = (|O − E| − 0.5)2⁄E in cells E16:F20. SPSS solution We can use SPSS to solve this problem by converting the independent categories (Gender, Module) into SPSS codes and creating a third variable to represent the frequency of occurrence within the contingency table. Table 8.7 represents the coding to be used for the two independent categories and the frequency count for each of the category codes. Module

Module code BIN3020-N 1 BIN3020-N 1 BIN3029-N 2 BIN3029-N 2 BIN3022-N 3 BIN3022-N 3 BIN3678-N 4 BIN3678-N 4 BIN3045-N 5 BIN3045-N 5 Table 8.7 SPSS codes

Gender Male Female Male Female Male Female Male Female Male Female

Gender code 1 2 1 2 1 2 1 2 1 2

Frequency 65 86 66 82 51 38 59 40 59 48

Page | 429

Enter these data into SPSS as illustrated in Figure 8.8.

Figure 8.8 SPSS Data The next step is to weight the cases given the count values are frequencies. Click on Data > Weight Cases Move Frequency variable into the Frequency Variable box as illustrated in Figure 8.9.

Figure 8.9 Weight cases Click OK Now run SPSS to solve this problem. Click Analyze > Descriptive Statistics > Crosstabs… Transfer Course to Row(s) box and Gender to Column(s) box as illustrated in Figure 8.10.

Page | 430

Figure 8.10 SPSS Crosstabs menu Click on Statistics, Choose Chi-square Click Continue. Click on Cells Choose Observed and Expected frequencies Click Continue. Click OK. SPSS output The SPSS output solution is split into three parts: a case processing summary (Figure 8.11), a crosstabulation table (Figure 8.12) and a table labelled ‘Chi-Square Tests’ (Figure 8.13).

Figure 8.11 Example 8.1 SPSS solution

Page | 431

Figure 8.12 Example 8.1 SPSS solution continued

Figure 8.13 Example 8.1 SPSS solution continued If you compare the SPSS and Excel solution (Figure 8.6) you will observe that you have the same results as with the Excel solution. The chi-square p-value associated with the chi-square score (+11.267) is 0.024, which means that there is a probability of 0.024 that we would get a value of chi-square as large as the one we have if there were no effect in the population. Given p-value = 0.024 < 0.05 (significance level), we reject the null hypothesis and accept the alternative hypothesis. Conclusions The value of the chi-square test statistic is 11.267 and is highly significant given the pvalue = 0.024 < 0.05. This indicates that gender of a person is a significant factor in the type of course males and females attended.

How do you solve problems when you have raw data? Calculation of a chi-square test of association using Excel and IBM SPSS when we have raw data rather than data in a contingency table is illustrated in Example 8.2. Page | 432

Example 8.2 If you have spent any time watching political campaign adverts, you probably know that such adverts can vary in terms of what they emphasise. For example, some adverts may emphasise polishing a candidate's image, while other adverts may emphasise the candidate's stand on issues. You may also have noticed that adverts can vary in terms of how they attempt to motivate voters. For example, some adverts may attempt to appeal to fears that will scare voters into voting for the candidate, or at least against some opponent of the candidate. Other adverts may rely on other motivational strategies. Johnson and Kaid set out to learn, among other things, whether these two variables, emphasis (on either image or issues) and persuasion strategy (using a fear appeal or not using a fear appeal), might be related. (See Johnson, A. and Kaid, L.L. (2002), Image ads and issue ads in U.S. presidential advertising: Using video style to explore stylistic differences in televised from 1952 to 2000. Journal of Communication, 52, 281–300.) As an example, they hypothesised that fear appeals are more common in issue adverts than in image adverts. To investigate that possibility, Johnson and Kaid’s study examined 1213 presidential campaign ads run between 1952 and 2000. Figure 8.14 provides a screenshot of the first 10 records out of a total of 1213 records.

Figure 8.14 First 10 records out of 1213 records For this problem, we wish to check if there is an association between the type of advert and the advert’s appeal. Therefore, this problem involves running a chi-square test of association, where the null hypothesis states that we have no association between the two category variables (type of advert and appeal of advert). The alternative hypothesis states that an association exists. We will test at the 5% significance level ( = 0.05). Excel solution Enter data into Excel as shown in Figure 8.15 (only the first and the last 7 records are shown, i.e. rows 11:1209 are hidden).

Page | 433

Figure 8.15 Example 8.2 Excel solution Create the crosstab table. In Excel, the crosstab table is an Excel PivotTable. Click in the data area and click on Insert PivotTable (Figure 8.16).

Figure 8.16 Create PivotTable Click OK. Now we need to tell the Excel PivotTable what should go in the PivotTable rows, columns, and sum values boxes (Figure 8.17).

Figure 8.17 Excel PivotTable

Page | 434

Now copy this PivotTable below the current table, but this time right-click and choose Paste Special > Values. Repeat this step underneath your first copy to produce Figure 8.18.

Figure 8.18 Copy values from the PivotTable You could change the number formatting in the PivotTable from numbers to a percentage. This is achieved by clicking in the PivotTable data field (say, cell F4), and the PivotTable Fields menu will appear. Now change in  Values box from Count of Appeal of Advert to Value Field Settings…, Choose Show Values As, and choose % of Row Total.

Figure 8.19 Excel PivotTable showing percentages We observe from Figure 8.19 that 26.02% of issue adverts have a fear factor, compared to 13.99% of image adverts. Now we can use the third table. Remove the observed frequency values (cells F16:G17) and insert the Excel formulae into these cells to calculate the expected frequencies (cell F16: =H10*F12/H12, cell F17: =H16*G18/H18, cell G16: =H17*F18/H18, and cell G17: =H17*G18/H18). Note that to aid understanding we have labelled the observed and expected frequency tables in Figure 8.20. If all you want is to decide if the null or the alternative hypothesis should be accepted then calculate the chi-square p-value for this example using the Excel function =CHISQ.TEST(F10:G11, F16:G17), as illustrated in Figure 8.20.

Page | 435

Figure 8.20 Example 8.2 Excel solution continued From Excel: • •

chi-square test statistic = 23.584 > Chi-square critical value = 3.841 chi-square p-value = 0.0000012 < test significance level = 0.05

Both methods, as expected, yield the same conclusion: reject the null hypothesis. We conclude that there is evidence to suggest an association between type of advert (image, issue) and the appeal of the advert (no fear appeal, fear appeal). Any expected frequencies < 5 Check from the table that no cell as an expected frequency < 5 (cell G37). Yate’s continuity correction For completeness, we can also use Yates’s correction for continuity (equation (8.5)) instead of equation (8.3) by modifying the equation in cells H25:H28 as illustrated in Figure 8.21. Our conclusion is unchanged.

Page | 436

Figure 8.21 Example 8.2 Excel solution with Yate’s continuity applied SPSS solution Import the data into SPSS to create the SPSS data file (the SPSS data file is available for download from the online resource centre).

Figure 8.22 Example 8.2 SPSS data Choose Analyze > Descriptive Statistics > Crosstabs… (Figure 8.23).

Figure 8.23 SPSS Crosstabs menu Page | 437

Choose Statistics… Select Chi-Square Click Continue Choose Cells…, Select Observed and Expected counts Click Continue Click on OK SPSS output The first output table (Figure 8.24) shows that we have 1213 cases (or participants) to complete the study with no missing cases observed – this agrees with the Excel solution.

Figure 8.24 SPSS solution The second output table (Figure 8.25) shows the crosstabulation data for this example, and we observe that more participants showed a greater fear with an advert issue (204) than with an avert image (60).

Figure 8.25 SPSS solution continued The final output table (Figure 8.26) shows the results of the chi-square test: the Pearson chi-square value is 23.584, with a two-sided p-value given as 0.000. Although the SPSS solution shows the p-value equal to 0.000 this does not imply that the probability is zero but that it is extremely unlikely.

Page | 438

Figure 8.26 SPSS solution continued For this example, the null hypothesis should be rejected. We conclude that there is evidence to suggest a significant association between the type of advert (image, issue) and the appeal of the advert (no fear appeal, fear appeal). This agrees with the Excel solution provided in Figure 8.20. Finally, given this is a 2 × 2 table, Yates’s chi-square continuity value is provided in Figure 8.26, and is equal to 22.882. This value agrees with the Excel solution provided in Figure 8.21. The chi-square p-value associated with the chi-square score (+ 23.584) is 0.000, which means that there is a probability of 0.000 that we would get a value of chi-square as large as the one we have if there were no effect in the population. Given p-value = 0.000 < 0.05 (significance level), we reject the null hypothesis and accept the alternative hypothesis. Conclusions The value of the chi-square test statistic is 23.584 and is highly significant given the pvalue = 0.000 < 0.05. This indicates that association exists between type of advert and advert appeal.

Check your understanding X8.1

A business consultant requests that you perform some preliminary calculations before analysing a data set using Excel: a. Calculate the number of degrees of freedom for a contingency table with three rows and four columns. b. Find the upper tail critical c2 value with a significance level of 5% and 1%. What Excel function would you use to find this value? c. Describe how you would use Excel to calculate the test p-value. What does the p-value represent if the calculated chi-square test statistic equals 8.92?

X8.2

A trainee risk manager for an investment bank has been told that the level of risk is directly related to the industry type (manufacturing, retail and financial). Table 8.8 shows the frequencies for the type of risk and different industries, as Page | 439

collected in a survey. For these data, analyse whether or not a perceived risk is dependent upon the type of industry identified. Use a significance level of 5%. If the two variables are associated, then what is the form of the association? Industrial Class Manufacturing Retail Low 81 38 Moderate 46 42 High 22 26 Table 8.8 Type of industry versus level of risk Level of Risk

X8.3

Financial 16 33 29

A manufacturing company is concerned at the number of defects produced in the manufacture of office furniture. The firm operates three shifts and has classified the number of defects as low, moderate, high, or very high. Table 8.9 shows the number of defects recorded for the different shifts over a period of time. Is there any evidence to suggest a relationship between types of defect and shifts? Use a significance level of 5%. If the two variables are associated, then what is the form of the association? Defect Type Low Moderate High 1 29 40 91 2 54 65 63 3 70 33 96 Table 8.9 Defect type per shift Shift

Very high 25 8 38

X8.4

A local trade association is concerned about the level of business activity within the local region. As part of a research project a random sample of business owners were surveyed on how optimistic they were for the coming year. Table 8.10 shows their responses. Based upon the table, do we have any evidence to suggest different levels of optimism for business activity? Use a significance level of 5%. If the two variables are associated, then what is the form of the association? Type of business Optimism level Bankers Manufacturers Retailers Farmers High 38 61 59 96 No change 16 32 27 29 Low 11 26 35 41 Table 8.10 Type of business versus levels of optimism

X8.5

A group of 412 students at a language school volunteered to sit a test to assess the effectiveness of a new method to teach German to English-speaking students. To assess the effectiveness, the students sat two different tests, one in English and the other in German. Table 8.11 shows their results. Is there any evidence to suggest that the student test performances in English are replicated by their test performances in German? Use a significance level of 5%. If the two variables are associated, then what is the form of the association?

Page | 440

French ≥ 60% 40% - 59% ≥ 60% 90 81 40% - 59% 61 90 < 40% 29 39 Table 8.11 French versus German performance German

< 40% 8 8 6

Chi-square test for two proportions (independent samples) In Chapter 7 we explored the application of the z test to solve problems involving two proportions. If we are concerned that the parametric assumptions are not valid then we can use the chi-square test to test two independent proportions. With the chi-square test for two independent samples we have two samples that involve counting the number of times a categorical choice is made. In this situation we can develop a crosstabulation (or contingency table) to display the frequency that each possible value was chosen. Example 8.3 To illustrate the concept, consider the example of a firm trying to establish its environmental footprint by conducting a survey whether employees use the bus to travel to work. Table 8.12 summarises the responses for only the people who work on Mondays and Fridays. The question is whether we have a significant difference between the Monday and Friday employees who travel to work by bus. Monday Friday Take bus to work 105 90 Do not take bus to work 70 100 Total 175 190 Table 8.12 Employees’ method of travel to work

Total 195 170 365

Column variable 1 2 Totals 1 n1 n2 N Row 2 t1 – n1 t2 – n2 T-N variable Totals t1 t2 T Table 8.13 Generic 2 x 2 contingency table In general, a 2×2 contingency table can be structured as illustrated in Table 8.12. From this table, we can estimate the proportion of employees who will use the bus (and by extension the probability of their doing so) by calculating the overall proportion (ρ) using equation (8.6): ρ=

n1 + n2 t1 + t2

=

N T

(8.6)

We can now use this estimate to calculate the expected frequency (E) for each cell within the contingency table. We do this by multiplying the column total by ρ for the Page | 441

cells linked to travel by bus and 1 – ρ for those cells who do not travel by train using equation (8.7): E =  × column total

(8.7)

We then calculate the chi-square test statistic to compare the observed and expected frequencies using equation (8.3): n

( Oi − Ei )

i =1

Ei

 = 2 cal

2

Where Oi is the observed frequency in a cell, and Ei is the expected frequency in a cell calculated if the null hypothesis is true. The number of degrees of freedom is given by equation (8.4). df = (r – 1)(c – 1) In this case, we would expect the proportion of employees taking the train on the two days to be the same. This fact can then be used to calculate the expected frequencies. From the observed and expected frequencies, we can calculate the chi-square test statistic. We would then compare this value with the critical chi-square test statistic or calculate the test p-value and compare with the significance level. The five-step procedure to conduct this test progresses as follows. Step 1 State hypothesis Given that the population proportions are 1 and 2, the null and alternative hypothesis are as follows: H0: 1 = 2 (proportions travelling by bus on the two days are the same) H1: 1 ≠ 2 (proportions different) The ≠sign indicates a two-tailed test. Step 2 Select the test • • •

Two independent samples Categorical data Chi-square test for the difference between two proportions.

Step 3 Set the level of significance  = 0.05 Step 4 Extract relevant statistic Page | 442

Calculate  using equation (8.6): ρ=

n1 + n2 N 195 = = = 0.534 t1 + t 2 T 365

Calculate the expected frequencies for each cell using equation (8.7). For example, for the bus on Monday the expected frequency would be 195×175/365 = 195×0.479 = 93.493. Repeat this calculation for the other cells within the contingency table. To calculate the chi-square test statistic, we now need to calculate for each cell the ratio (O – E)2/E given in equation (8.2), as illustrated in Table 8.14. Observed frequency

Expected frequency

Chi-square value for Monday (O – E)^2/E 1.4 1.6

Monday Friday Total EM EF Bus 105 90 195 93.5 101.5 Not bus 70 100 170 81.5 88.5 Total 175 190 365 Table 8.14 Calculate the expected frequencies and chi-square values

Chi-square value for Friday (O – E)^2/E 1.3 1.4

Note we have checked that no expected frequency is less than 5. We sum these values to give the χ2cal test statistic: 2 𝜒𝑐𝑎𝑙 =∑

(𝑂−𝐸)2 𝐸

= 1.416 + 1.304 + 1.624 + 1.496 = 5.841

The degrees of freedom are calculated from equation (8.4): df = (r – 1)(c – 1) = 1. Step 5 Make a decision The critical value can be found using statistical tables with a two-tailed significance level of 0.05 and degrees of freedom = 1.

Figure 8.27 Percentage points of the chi-square distribution Page | 443

From Figure 8.27, the critical chi-square value is χ2cri = 3.84. Does the test statistic lie within the rejection region? Compare the calculated and critical chi-square values to determine which hypothesis statement (H0 or H1) to accept. We observe that the χ2cal lies in the rejection region (5.841 > 3.84), and we reject the null hypothesis H0. We conclude that there is a significant difference in the proportions travelling by bus on Monday compared to Friday. Any expected frequencies < 5 From table 8.14, we have no expected frequencies < 5. Yate’s continuity correction You could calculate Yate’s continuity correction value of chi-square = 5.345 by subtracting 0.5 from the chi-square equation = (|O − E| − 0.5)2⁄E in table 8.14. Relationship between Z and χ2 when we have 1 degree of freedom When we have 1 degree of freedom, we can show that there is a simple relationship between the value of χ2cal and the corresponding value of Zcal. If you carry out the calculation of a two-independent-sample z test for proportions:

p1 =

X 1 105 = = 0.6 n1 175

p2 =

X 2 90 = = 0.47368 n2 190

p=

X 1 + X 2 195 = = 0.53425 n1 + n2 365

If H0 true, then 1 = 2, and Z equals z=

p1 − p2 1 1  p (1 − p )  +   n1 n2 

= 2.4169

Note that Z2 = (2.4169)2 = 5.84139 This value is equal χ2cal . Page | 444

If we are interested in testing for direction in the alternative hypothesis (e.g. H1: 1 > 2) then you cannot use a χ2 test but you will have to use a normal distribution Z test to test for direction. Excel solution Figures 8.28 and 8.29 illustrate the Excel solution.

Figure 8.28 Example 8.3 Excel solution Expected frequencies are calculated as: Expected frequencies (O-E)^2/E

Cell C13 Cell E13

Formula: =$E6*C$8/$E$8 Copy formula down and across C13:D14 Formula: =(C6-C13)^2/C13 Copy formula down and across E13:F14

Table 8.15

Page | 445

Figure 8.29 Example 8.3 Excel solution continued From Excel: χ2cal = 5.841 χ2cri = 3.841 P-value = 0.016 Any expected frequencies < 5 Note we have checked that no expected frequency is less than 5. Yate’s continuity correction You could calculate Yate’s continuity correction value of chi-square = 5,345 by subtracting 0.5 from the chi-square equation = (|O − E| − 0.5)2⁄E in cells E13:F14. Does the test statistic lie within the rejection region? Page | 446

Critical value solution Given χ2cal = 5.841 > χ2cri = 3.841, reject the null hypothesis P-value solution Given p-value = 0.016 < 0.05, reject the null hypothesis We conclude that there is a significant difference in the proportions travelling by bus on Monday compared to Friday. SPSS solution Enter data into SPSS

Figure 8.30 Example 8.3 SPSS data Codes: Day (1 = Monday, 2 = Friday) and Bus (1 = take bus to work, 2 = do not take bus to work). Given this is in grouped frequency form, we need to weight cases in SPSS so that SPSS knows the count (or frequency) values for each pairing of Day and Bus.

Figure 8.31 SPSS Weight Cases menu Click OK. Select Analyze > Descriptive Statistics > Crosstab (Figure 8.32). Transfer Day variable into Rows and Bus variable into Columns. Click Statistics Select Chi-square Select Continue. Page | 447

Click Cells Select Observed and Expected Select Continue.

Figure 8.32 SPSS Crosstabs menu Click OK. SPSS output The output is shown in Figures 8.33–8.35.

Figure 8.33 Example 8.3 SPSS solution

Figure 8.34 SPSS solution continued

Page | 448

Figure 8.35 SPSS solution continued The Pearson chi-square test statistic is 5.841 with a two-sided p-value if 0.016. This is the same as the Excel solution. Conclusions Conclude that there is a significant difference in the proportions travelling by bus on Monday compared to Friday [Chi-square test statistic = 5.841, p-value = 0.016 < 0.05].

Check your understanding X8.6

A local wine shop is considering running an advertisement to sell wine during the time when a major athletics competition takes place. The shop owner has conducted a small survey to check on his customers’ likely television viewing habits during this athletic competition (watch, do not watch) and whether they drink wine during the competition (drink wine, do not drink wine). Based upon the sample data presented in Table 8.16, carry out an appropriate test to test the hypothesis that the proportions drinking wine are not affected by their viewing habits. Drink wine?

Watch athletics Drink wine 16 Do not drink wine 4 Table 8.16 Television viewing habits X8.7

Do not watch athletics 24 56

Extra Hotels owns two hotels in a popular tourist town. The managing director would like to check if the two hotels (X, Y) have similar return rates and has done a small survey to ascertain if this is true. The results are presented in Table 8.17. Conduct a suitable test to test that the proportions returning are the same for hotels X and Y. Choose hotel again? Yes No Table 8.17 Hotel return rates

X 163 64

Y 154 108 Page | 449

McNemar’s test for the difference between two proportions (dependent samples) The contingency table methods described so far are based on the data being independent. For a 2 × 2 table consisting of frequency counts that result from matched pairs, the independence rule is violated. In this situation, we can use McNemar’s test for matched pairs. In general, the 2 × 2 contingency table can be structured as illustrated in Table 8.18. From this table, we observe that the sample proportions are given by equations (8.8) and (8.9):

Condition 1 Row variable

Yes No Totals

Yes a c a+c

Condition 2 Column variable No b d b+d

Totals a+b c+d N

Table 8.18 Generic 2 x 2 table Where: a = number who answer yes to condition 1 and yes to condition 2. b = number who answer yes to condition 1 and no to condition 2. c = number who answer no to condition 1 and yes to condition 2. d = number who answer no to condition 1 and no to condition 2. N = total sample size who answered the questions The question we have is are the population proportions answering yes to condition 1 the same as for condition 2. In other words, is 1 = 2. The sample proportions who answer yes under conditions 1 and 2 are given by equations (8.8) and (8.9). ρ1 = ρ2 =

a+b N a+c N

(8.8) (8.9)

Equation (8.10) represents the McNemar test used to test the null hypothesis H0: 1 = 2). Zcal =

b−c √b+c

(8.10)

For H0 true, if b + c  25, then McNemar’s z-test statistic defined by equation (8.10) is approximately normally distributed. McNemar’s test can be one- or two-tailed. If you have a two-tailed test, then the test statistic follows a chi-square distribution and approximately follows a normal distribution. If your test is one-tailed, then you can only use the approximate normal distribution test. The assumptions for the text are as follows: Page | 450

1. 2. 3.

The sample data have been randomly selected from the population. The sample data consist of matched pairs of frequency counts. The sample data are at the nominal level of measurement.

Exact binomial solution If either b or c is small (b + c < 10) then the test statistic is not well approximated by the chi-square distribution. In this case, an exact binomial test can be used, where the smaller value of b and c is compared to a binomial distribution with size parameters min(b, c), n = b + c, and p = 0.5. To achieve a two-sided p-value, the p-value of the extreme tail should be multiplied by 2 as in equation (8.11): Exact binomial 2 − tail p − value = 2 ∑nk=b(nk) 0.5k (1 − 0.5)n−k

(8.11)

This is simply twice the binomial distribution cumulative distribution function with p = 0.5 and n = b + c. The corrected version of the McNemar’s test approximates the binomial exact p-value. Example 8.4 Consider the problem of establishing the impact a change in recipe has on consumers of pop tarts. Two focus groups of consumers are selected at random and their opinions, after blind-tasting the product (‘I would never buy this product’ or ‘I would definitely buy this product’), recorded. Both groups are then offered the product with the modified recipe and their revised opinions recorded. The question that arises is whether the change in recipe yielded the desired effect. In this case, we have two groups who are recorded before and after, and we recognise that we are dealing with paired samples. To solve this problem, we can use McNemar’s test for two sets of nominal data that are randomly sampled. Table 8.19 shows consumers’ before and after preferences. After Would never buy Would buy Would never buy 138 42 Would buy 21 99 Table 8.19 Before versus after voting intentions Before

The question is whether the change in recipe has been successful and moved those consumers who would never buy this product to the group of those that would buy it. To simplify the problem, we shall look at whether the proportion of those who would never buy has significantly changed. Our hypotheses are: H0: Proportion stating that they would never buy has not changed H1: Proportion stating that they would never buy has changed Page | 451

In mathematical notation this can be written as: H0: 1 = 2, H1: 1 ≠ 2, where 1 is the population proportion stating they would never buy before the recipe change and 2 is the population proportion stating that they would never buy after the recipe change. The five-step procedure to conduct this test progresses as follows. Step 1 State hypothesis Given that the population proportions are 1 and 2, respectively denoting the proportion stating that they would never buy before and after the recipe change, then the null and alternative hypothesis are as follows: H0: 1 = 2, H1: 1 ≠ 2, The ≠ symbol indicates a two-tailed test. Step 2 Select the test • • • •

Sample data have been randomly selected. The sample consists of matched pairs. The data area at the nominal level of measurement. Since b = 43 and c = 21, b + c = 63 > 25 we can use a normal distribution or an exact binomial solution. Let us use McNemar’s z test and compare it with the exact binomial test.

Step 3 Set the level of significance α = 0.05 Step 4. Extract relevant statistic Using McNemar’s z test, we have, by equation (8.10): Zcal =

Zcal =

b−c √b + c 42 − 21 √42 + 21

Zcal = 2.6458 For a two-tail critical z value with 5% significance, Zcri =  1.96. Binomial solution

Page | 452

For comparison, the exact two-tailed binomial solution is given by solving equation (8.11): n

n Exact binomial 2 − tail p − value = 2 ∑ ( ) 0.5k (1 − 0.5)n−k k k=b

This is simply twice the binomial distribution cumulative distribution function with p = 0.5, b = 42 and n = b + c = 42 + 21 = 63. We obtain: n

n p = 2 ∑ ( ) 0.5k (1 − 0.5)n−k k k=b 63

63 p = 2 ∑ ( ) 0.5k (1 − 0.5)63−k 𝑘 k=42

63 63 63 p = 2 × [( ) 0.542 (0.5)21 + ( ) 0.543 (0.5)20 … … . . + ( ) 0.563 (0.5)0 ] 42 43 63 This is quite a lengthy calculation to solve manually, so it is a good idea to use a software package like Excel as illustrated in Figure 8.36.

Figure 8.36 Calculation of the 2-tail binomial p-value From Figure 8.36 (cell F3): p = 0.011141

Page | 453

Step 5 Make a decision McNemar’s z test solution Zcal = 2.6458 Critical Zcri =  1.96 Our calculated test statistic is Zcal = 2.6458, and the Zcri is ± 1.96. We observe that Zcal lies in the rejection region (2.6458 > 1.96), so we accept the alternative hypothesis, H1. We conclude that there is a significant difference in ‘never would buy’ intentions after the recipe change compared with before the recipe change. Binomial solution For comparison, the two-tailed binomial exact p-value is 0.0111, which is less than the significance level of 0.05, so we accept the alternative hypothesis, H1. We conclude that there is a significant difference in ‘never would buy’ intentions after the recipe change compared with before the recipe change. Excel solution Figures 8.37 and 8.38 illustrate the Excel solution for McNemar’s z and 2 tests.

Figure 8.37 Example 8.4 Excel solution

Page | 454

Figure 8.38 Example 8.4 Excel solution continued Using McNemar’s method, we have: McNemar’s Z-test statistic = 2.6458 Two-tailed critical chi-square statistic = 1.96 Two-tailed Z-test p-value = 0.0082. Given Z = 2.6458 > + 1.96, reject the null hypothesis and accept the alternative hypothesis. Given the two-tailed p-value = 0.0082 < 0.05, reject the null hypothesis and accept the alternative hypothesis. Using the binomial exact solution, we have: two-tailed binomial p-value = 0.0111. Given two-tailed binomial p-value = 0.0111 < 0.05, reject the null hypothesis and accept the alternative hypothesis. We observe that the manual and Excel solutions agree.

Page | 455

We conclude that there is a significant difference in ‘never would buy’ intentions after the recipe change compared with before the recipe change. SPSS solution Enter data

Figure 8.39 Example 8.4 SPSS data Code: Before (1=No, 2=Yes) and After (1=No, 2=Yes). Given this is frequency data we will weight cases by count (frequency value). Select Data > Weight Cases > transfer Count variable into the Frequency Variable: box (Figure 8.40).

Figure 8.40 SPSS Weight Cases menu Click OK. Select Analyze > Descriptive Statistics > Crosstabs Transfer Before into Row(s) box Transfer After into Column(s) box Select Statistics and choose McNemar’s. Select Cells and choose Counts (Observed and Expected) and Percentages (Rows and Columns).

Page | 456

Figure 8.41 SPSS Crosstabs menu Click OK. SPSS output SPSS outputs descriptive statistics as shown in Figure 8.42 and test statistics as shown in Figure 8.43. In SPSS, the binomial distribution is used for McNemar's test.

Figure 8.42 Example 8.4 SPSS solution continued

Page | 457

Figure 8.43 Example 8.4 SPSS solution continued Conclusions The 2-tail binomial exact p-value = 0.0111 < significance level of 0.05, accept the alternative hypothesis H1. We conclude that there is a significant difference in ‘never would buy’ intentions after the recipe change compared with before the recipe change. This agrees with the manual and Excel solution given in Figure 8.38.

Check your understanding X8.8

A business analyst requests answers to the following questions: a. What is the p-value when the 2 test statistic is 2.89 and we have 1 degree of freedom? b. If you have 1 degree of freedom, what is the value of the z test statistic? c. Find the critical 2 value for a significance level of 1% and 5%.

X8.9

During the summer of 2018 petrol prices have raised concerns with new car sellers that potential customers are taking prices into account when choosing a new car. To provide evidence to test this possibility a group of five local car showrooms agree to ask fleet managers and individual customers during August 2018 whether they are or are not influenced by petrol prices. The results are as shown in Table 8.20. At the 5% level of significance, is there any evidence for the concerns raised by the car showroom owners? Answer this question using both the critical test statistic and p-value. Are petrol prices Fleet customers influencing you in purchasing? Yes 56 No 23 Table 8.20 Are customers influenced.

Individual customers 66 36

X8.10 A business analyst has been asked to confirm the effectiveness of a marketing campaign on people’s attitudes to global warming. To confirm that the campaign was effective a group of 500 people were randomly selected from the population. They were asked whether they agree that national governments should be concerned, answering ‘yes’ or ‘no’. The results are as shown in Table 8.21. At the 5% level of significance, is there any evidence that the campaign has increased Page | 458

the number of people requesting that national governments should be concerned that global warming is an issue? Answer this question using both the critical test statistic and p-value. Before campaign

After campaign

Yes Yes 202 No 89 Table 8.21 Attitude to global warming

No 115 75

8.3 Nonparametric tests Many statistical tests require that your data follow a normal distribution. Parametric statistical hypothesis tests assume that the data on which they are applied possess certain characteristics or parameters: 1. The two samples are independent of one another. 2. The two populations have equal variance or spread. 3. The two populations are normally distributed. There is no getting around assumption 1. That assumption must be satisfied for a t-test. When assumptions 2 and 3 (equal variance and normality) are not satisfied but the samples are large (say, greater than 30), the results are approximately correct. But when our samples are small and our data is skewed or non-normal, we probably should not place much faith in the t test. This is where nonparametric tests come in, given they can solve statistical problems where assumptions 2 and 3 are violated. Whereas the null hypothesis of the two-sample t test is equal means, the null hypothesis of nonparametric tests is concerned with equal medians. Another way to think of the null hypothesis is that the two populations have the same distribution with the same median. If we reject the null, that means we have evidence that one distribution is shifted to the left or right of the other as illustrated in Figure 8.44.

Figure 8.44 Two distributions with the same shape but different median values Page | 459

Since we are assuming our distributions are equal, rejecting the null hypothesis means we have evidence that the medians of the two populations differ. All the methods described in this section use the method of ranks rather than a distribution shape (e.g., the population or sampling distribution does not follow a normal or Student’s t distribution) to carry out an appropriate nonparametric hypothesis test. We use the median rather than the mean as a measure of the centre of the distribution. The decision often depends on whether the mean or median more accurately represents the centre of your data distribution: • •

If the mean accurately represents the centre of your distribution and your sample size is large enough, consider a parametric test. If the median better represents the centre of your distribution, consider a nonparametric test.

Given that we need less information to use nonparametric tests compared to parametric tests, it follows that nonparametric tests have less power than parametric tests. This is only true if your data are normally distributed. In addition, if you have a very small sample size, you might be stuck with using a nonparametric test (or collect more data next time if possible!) As you can see, the sample size guidelines are not really that large. Your chance of detecting a significant effect, when one exists, can be very small when you have both a small sample size and you need to use a less efficient nonparametric test. In this section, we shall explore three nonparametric tests: the sign test, Wilcoxon signed-rank test, and Mann–Whitney U test. Table 8.22 compares the nonparametric tests with the equivalent parametric tests for one- and two-sample tests discussed in Chapter 8. Test One sample

Parametric test Non-parametric test One sample z test Sign test One sample t test Wilcoxon signed-rank test Paired samples Two paired sample z test Sign test Two paired sample t test Wilcoxon signed rank test Independent samples Two independent sample t Mann Whitney U test test (Wilcoxon rank sum test) Table 8.22 Comparison of nonparametric versus equivalent parametric tests

Sign test The sign test is used to test the null hypothesis that the median of a distribution is equal to some value. It can be used (a) in place of a one-sample t test, (b) in place of a paired t test or (c) for ordered categorial data where a numerical scale is inappropriate but where it is possible to rank the observations. The assumptions of the sign test are as follows: 1. The sign test is a nonparametric (distribution-free) test, so we do not assume that the data are normally distributed. Page | 460

2. Data should be one or two samples. The population may differ for the two samples. 3. Dependent samples should be paired or matched. Types of sign test The sign test is used to test the null hypothesis that the median of a distribution is equal to some value: a. A one sample sign test where the sample median value is compared to the hypothesized median value. b. A two-sample sign test where we compare if two sample medians provide evidence that the population median difference is zero. The one-sample and two-sample sign tests replace the one-sample t-test and twosample paired t-tests where we have evidence to suggest the data is not normally distributed. The observations in a random sample of size n are X1, X2, …, Xn (these observations could be paired differences); the null hypothesis is that the population median is equal to some value M. Suppose that X+ of the observations are greater than M and X− are smaller than M (in the case where the sign test is being used in place of a paired t test, M would be zero). Values of X which are exactly equal to M are ignored; the sum of X+ and X− may therefore be less than n– we will denote it by n′. Hypothesis statements Under the null hypothesis we would expect half the X’s to be above the median and half below. Therefore, under the null hypothesis both X+ and X− follow a binomial distribution with probability of success p = 0.5 and n = n′. We will assume that our random variable X is a continuous random variable with unknown median M. Upon taking a random sample X1, X2, X3, …., Xn we are interested in testing whether the median M takes on a value Ma. That is, we are interested in testing the null hypothesis: H0: M = Ma Against any of the possible alternative hypotheses, H1: H1: M > Ma or H1: M < Ma or H1: M ≠ Ma Page | 461

Binomial exact solution In this case, the probability distribution is a binomial distribution with probability (or proportion) of success = 0.5 and the number of trials represented by the number of paired observations (n). In this case we can model the situation using a binomial distribution X ~ Bin(n, p). The value of the binomial probability is given by equation (8.12): 𝑃(𝑋 = 𝑥) = 𝐶𝑥𝑛 𝑝 𝑥 (1 − 𝑞)𝑛−𝑥

(8.12)

This value given by the binomial equation represents an exact p-value. The p-value for a sign test is found using the binomial distribution where p = proportion of non-zero values = 0.5, n = number of non-zero differences, and x = number of positive differences: 1. Upper one-tail: p > 0.5, p-value = probability that x is greater than or equal to x = P(X ≥ x) 2. Lower one-tail: p < 0.5, p-value = probability that x is less than or equal to x = P(X ≤ x) 3. Two-tail: p ≠ 0.5, p-value = twice the probability that x is greater than or equal to x = 2P(X ≥ x) Solution procedure: 1. 2. 3. 4. 5. 6.

List the data values X1, X2, …., Xn. List the values that are greater than M and those that are less than M. Remove the data values equal to M. Rank the positive values and sum the positive rank values to give x+. Rank the negative values and sum the negative rank values to give x–. Choose the value of x depending upon the alternative hypothesis statement from x–, x+. 7. Calculate P(X  x) given binomial distribution and p = 0.5, n = n′. Example 8.5 To illustrate the concept, consider a sales manager who oversees 16 inside and 16 outside sales representatives. Every inside sales representative is paired with an outside sales representative. After interviewing all of them, the sales manager assigns values to the level of understanding that every inside and every outside salesperson have of their shared territories (see Table 8.23). A mark of 1 means great insight and a mark of 5 means little or no insight. The null hypothesis statement is that there is the same level of understanding and insight of their respective territories among the inside and outside salespeople. The alternative hypothesis is that the outside sales staff have a different insight compared to the inside sales staff.

Page | 462

T A B T A B 1 3 4 9 4 5 2 4 3 10 5 4 3 3 5 11 2 4 4 5 3 12 2 1 5 3 2 13 4 3 6 5 3 14 5 2 7 1 2 15 2 5 8 4 2 16 4 5 Table 8.23 Sales representatives levels of understanding The five-step procedure to conduct this test progresses as follows Step 1 State hypothesis H0: The median of the differences is zero. H1: The median differences are different. Two-tailed test Step 2 Select test • •

Two dependent samples, both samples consist of ordinal data assigned by the sales manager, and no information on the form of the distribution. Conduct sign test.

Step 3 Set the level of significance Significance level  = 0.05 Step 4 Extract relevant statistic The solution process can be broken down into a series of steps: Enter data Denote the sales manager’s marks for inside salespeople by A and for outside salespeople by B. Calculate the differences d = B – A. State the sign (+ or –) for each paired difference. Territory 1 2 3 4 5 6 7

A (Inside) 3 4 3 5 3 5 1

B (Outside) 4 3 5 3 2 3 2

d=B-A 1 -1 2 -2 -1 -2 1

Sign + + + Page | 463

8 4 2 -2 9 4 5 1 10 5 4 -1 11 2 4 2 12 2 1 -1 13 4 3 -1 14 5 2 -3 15 2 5 3 16 4 5 1 Table 8.24 Calculation of the positive and negative signs

+ + + +

The median difference, d = median (1, - 1, 2, …., 1) = - 1.0 The median difference for B – A is equal to - 1.0, which suggests that the understanding from the outside sales staff is less than the inside sales staff. Allocate ‘+’ and ‘–‘, depending on whether d > 0 or d < 0 as illustrated in Table 8.24. Total number of trials, n = 16 From Table 8.24: Number of – (negative ranks), x– = 9 Number of + (positive ranks), x+ = 7 Calculate the number of paired values that give d = 0, and find x and n′ Number of values with d equal to zero, n0 = 0 Calculate x = max(x–, x+) = max (9,7) = 9 Adjust n to remove the number of values with d equal to zero, n′ = n – n0 = 16 Calculate binomial probabilities, P(X ≥ x) This problem can be solved exactly given this is a binomial distribution with x successes from n′ trials when the probability of each trial p = 0.5. •

Probability of success, p = 0.5.



Number of trials, n’ = 16.



Number of successes, x = 9.

Solve P(X ≥ x) when x = 9, n′ = 16. The corresponding binomial p-value is calculated from equation (8.14): Page | 464

16

𝑝 = 𝑃(𝑋 ≥ 𝑥) = ∑ 𝐶𝑥𝑛 𝑝 𝑥 (1 − 𝑞)𝑛−𝑥 𝑥=9

In this case we wish to solve the problem: p = P(X ≥ 9) = P(X = 9, 10, 11, 12, 13, 14, 15, 16) p = P(X = 9) + P(X = 10) + P(X = 11) + P(X = 12) + P(X=13) + P(X=14) + + P(X=15) + P(X=16) To find the binomial probability (X  9 where n = 16, p = 0.5) P(X = x) = 𝐶𝑥𝑛 px qn-x P(X  9) = P(X=9) + P(X=10) + …… + P(X=16) P(X = 9) = 𝐶916 (0.5)9 (1 – 0.5)16-9 P(X = 9) = 𝐶916 (0.5)9 (0.5)7 P(X = 9) = 𝐶916 (0.5)16 Remember, that 𝐶𝑥𝑛 =

n! x! (n − x)!

𝐶916 =

16! 9! (16 − 9)!

𝐶916 =

16! 9! (7)!

𝐶916 =

16 × 15 × 14 × 13 × 12 × 11 × 10 × 9! 9! 7!

𝐶916 =

16 × 15 × 14 × 13 × 12 × 11 × 10 7!

𝐶916 =

16 × 15 × 14 × 13 × 12 × 11 × 10 7×6×5×4×3×2×1

𝐶916 =

57657600 5040

𝐶916 = 11440 Page | 465

Therefore, P(X = 9) = 𝐶916 (0.5)16 P(X = 9) = 11440  1.525878907 E -5 P(X = 9) = 0.1746 Repeat the calculations for the other terms to give P(X  9) as illustrated in Table 8.25 Solve P(X  9) P(X = 9) = 0.174561 P(X = 10) = 0.122192 P(X = 11) = 0.066650 P(X = 12) = 0.027771 P(X = 13) = 0.008545 P(X = 14) = 0.001831 P(X = 15) = 0.000244 P(X = 16) = 0.000015 Table 8.25 Binomial probabilities p = P(X = 9) + P(X = 10) + P(X = 11) + P(X = 12) + P(X=13) + P(X=14) + + P(X=15) + P(X=16) p = P(X  9) = 0.1746 + 0.1222 + ……. + 0.0002 + 0.0000 P(X  9) = 0.402 This represents the one-tailed binomial p-value. Step 5 Make a decision Given we have a two-tailed test, then the 2-tailed binomial p-value = 2 x 0.402 = 0.804. Given 2-tailed binomial p-value = 0.804 > 0.05, fail to reject the null hypothesis. Conclude the evidence suggests that we have no significant difference at a 5% significance level. Large sample sign test – Normal approximation If n is sufficiently large (n > 30), we can use a normal approximation with the value of the population mean and standard deviation given by equations (3.23) and (3.24): n′ = 16, p = 0.5. For a binomial distribution Binomial population mean = n  p Page | 466

Binomial variance = n  p  q Therefore, for n sufficiently large, then we can approximate the binomial with a normal distribution Normal population mean,  = n p Normal population variance, 2 = n p q Normal population standard deviation, σ = √n p q Therefore, for the previous example:  = n p = 16  0.5 = 8 2 = n p q = 16  0.5  (1 – 0.5) = 4 σ = √n p q = √4 = 2 Now we need to solve P(X  9) for the binomial distribution. Remember the binomial distribution is a discreate distribution and we wish to approximate this with the normal distribution, which is a continuous distribution. Therefore, we need to state the value of X for the normal distribution given the binomial X value is X 9. P (X  9 binomial) = P(X  8.5 normal) Now calculate the value of the Z test statistic Zcal =

X− μ σ

=

8.5− 8 2

= 0.25

Calculate the critical z-test statistic From standardised normal table, Zcri =  1.96 given a two-tailed test at a 5% significance level. Decision Given Zcal = 0.25 lies between – 1.96 and + 1.96, then fail to reject the null hypothesis. We conclude that outside salespeople have equal levels of insight in their sales territory than inside salespeople. Excel solution Figures 8.45 and 8.46 illustrate the Excel solution. Values labelled as A represent the inside salespeople, and B are the outside salespeople. The territories to which every matched pair belongs are marked 1, …, 16. They are considered matched because they exchange and discuss the business issues related to their territory. Page | 467

Figure 8.45 Example 8.5 Excel solution Observe in Figure 8.45, we created extra columns called d = B – A, and Sign. These are calculated as follows: d = B – A (Cell E4, formula: = D4 – C4, copy formula down E4:E19), and Sign (Cell G4, formula: = IF (E4 < 0, ” - “, IF (E4 > 0, ” + ”,” 0 ”), copy formula down G4:G19).

Figure 8.46 Example 8.5 Excel solution continued Identify rejection region using the p-value method

Page | 468

From Excel, the two-sided p-value is 0.804 (cell L37). Does the test statistic lie in the rejection region? Compare the chosen significance level (α) of 5% (or 0.05) with the calculated two-sided p-value of 0.804. We observe that the p-value is greater than α, and we fail to reject the null hypothesis, H0. We conclude that outside salespeople have equal levels of insight in their sales territory than inside salespeople. Normal approximation

Figure 8.47 Example 8.5 Excel solution continued From Excel: P(X normal  8.5) = 0.40129 Therefore, two-tail p=value = 2  P(X normal  8.5) = 2  0.40129 = 0.803 Given two-tail p-value = 0.803 > 0.05, fail to reject the null hypothesis, H0. We conclude that outside salespeople have equal levels of insight in their sales territory than inside salespeople. SPSS solution Enter data into SPSS (Figure 8.48).

Page | 469

Figure 8.48 Example 8.5 SPSS data Select Analyze > Nonparametric Tests > Legacy Dialogs > 2 Related samples. Transfer variables A and B into Test Pairs box (Figure 8.49).

Figure 8.49 SPSS Two-Related-Samples Tests menu Click on Exact and choose Exact (Figure 8.48).

Figure 8.50 SPSS Exact Tests option Page | 470

Click Continue. Click on Options. Choose Descriptives and select Descriptives and Quartiles (Figure 8.51).

Figure 8.51 SPSS Two-Related-Samples Tests menu Click Continue Click OK. SPSS output Figures 8.52 – 8.54 represent the SPSS solutions

Figure 8.52 SPSS solution

Figure 8.53 SPSS solution continued

Page | 471

Figure 8.54 SPSS solution continued The output is shown in Figures 8.52–8.54. From Figure 8.54, the two-tailed pvalue is 0.804. This is the same as with the Excel p-value of 0.804 for the exact binomial method. Conclusions Conclude that outside salespeople have the same insight in their sales territory than inside salespeople [two-tail p-value = 0.804 < 0.05]. Comparing one sample against a population median If you have one sample, then you can conduct a sign test by using the binomial test in Excel and SPSS. For example, consider a single sample of results with the null hypothesis H0: median = 22, and the alternative hypothesis H1: median ≠ 22. Excel solution Change the variable labelled B to the value of the population median = 22. SPSS solution Sample data X. Compute a new variable called newX = X – 22. The 22 is the value from the null hypothesis. Now remove any zero values for newX. In SPSS, Select Data > Select Cases. Click on If condition is satisfied and move the variable newX into Numeric Expression box and add ~=0 (should read newX ~= 0). The SPSS symbol ~=0 represents  in the null hypothesis. Now select Analyze > Nonparametric Tests > Legacy Dialogs > Binomial. Transfer the new variable newX to the Test Variable box. Under Define Dichotomy, choose Cut point = 0. Test proportion = 0.5. Click OK.

Page | 472

Check your understanding X8.11 A researcher has carried out a sign test with the following results: the sum of positive and negative signs is 15 and 4, respectively with 3 ties. Given that binomial p = 0.5, assess (at the 5% significance level) whether there is evidence that the median value is greater than 0.5. X8.12 A teacher of 40 university students studying the application of Excel within a business context is concerned that students are not taking a group work assignment seriously. This is deemed to be important given that the group work element is contributing to the development of personal development skills. To assess whether this is a problem the module tutor devised a simple experiment which judged the individual level of cooperation by each individual student within their own group. In the experiment, a rating scale was employed to measure the level of cooperation: 1 = limited cooperation, 5 = moderate cooperation and 10 = complete cooperation. The testing consisted of an initial observation, a lecture on working in groups, and a final observation. Given the raw data in Table 8.26 conduct a test to assess (at the 5% significance level) whether we can observe that cooperation has significantly changed. 5, 8 4, 6 3, 3 6, 5 8, 9 10, 9 8, 8 4, 8 5, 5 3, 5 5, 4 6, 5 4, 4 7, 8 7, 9 9, 9 8, 7 5, 8 8, 7 8, 8 3, 4 5, 6 6, 7 4, 8 7, 8 9, 10 10, 10 8, 8 4, 6 4, 5 7, 8 5, 7 7, 9 8, 10 3, 6 5, 6 Table 8.26 Data in pairs. The number before the comma sign is assessment before the experiment, and after the comma is after the experiment

8, 9 5, 6 8, 9 7,8

X8.13 A leading business training firm advertises in its promotional material that class sizes at its Paris branch are no greater than 25. Recently the firm has received feedback from many disgruntled students complaining that class sizes are greater than 25 for most of its courses in Paris. To assess this claim, the company randomly selects 15 classes and measures the class sizes as follows: 32, 19, 26, 25, 28, 21, 29, 22, 27, 28, 26, 23, 26, 28, and 29. Carry out an appropriate test to assess (at the 5% significance level) whether there is any justification to the complaints (assess at 5%). What would your decision be if you assessed at the 1% significance level?

Wilcoxon signed-rank test for matched pairs The Wilcoxon signed-rank test is the nonparametric equivalent of paired two-sample t test. It is used in those situations in which the observations are paired, and you have not met the assumption of normality. The Wilcoxon signed-rank test assumptions are: 1. Each pair is chosen randomly and independently. 2. Data are paired and come from the same population. Page | 473

3. Both samples consist of quantitative or ordinal (rank) data. Remember your quantitative data will be converted to rank data. 4. The distribution of the paired differences is a symmetric distribution or at least not very skewed. If the fourth assumption is violated, then you should consider using the sign test, which does not require symmetry. As for the sign test, the Wilcoxon signed-rank test is used to test the null hypothesis that the median of a distribution is equal to some value. The method considers the differences between n matched pairs as one sample. If the two population distributions are identical, then we can show that the sample statistic has a symmetric null distribution. When the number of paired observations is small (n ≤ 20) we need to consult tables; but when the number of paired observations is large (n > 20) we can use a test based on the normal distribution. Although the Wilcoxon signed-rank test assumes neither normality nor homogeneity of variance, it does assume that the two samples are from populations with the same distribution shape. It is also vulnerable to outliers, although not to nearly the same extent as the t test. If we cannot make this assumption about the distribution, then we should use a test called the sign test for ordinal data. McNemar’s test is available for nominal paired data relating to dichotomous qualitative variables. In this section we shall apply the Wilcoxon signed-rank test where we have a large and small number of paired observations. In the case of many paired observations (n > 20) we shall use a normal approximation to provide test of our hypothesis. Furthermore, for many paired observations we shall use Excel to calculate both the p-value and critical z value to decide. The situation of a small number of paired observations (n ≤ 20) will be described together with an outline of the solution process. The solution process can be broken down into a series of steps: 1. State hypotheses H0 and H1. 2. Calculate the differences between pairs (d = X – Y). 3. If the difference between pairs d = 0, then remove this data pair from the analysis. 4. Record the sign of the difference in one column, the absolute value of the difference in the other column. 5. Rank the absolute differences from the smallest to the largest. 6. Reattach the signs of the differences to the respective ranks to obtain signed ranks, then average to obtain the mean rank. 7. Calculate number of paired values and adjust for shared ranks (n0), n′ = n – n0. 8. Calculate the sum of the ranks, T– and T+. 9. State Tcal = minimum value of T– and T+. 10. Use statistical tables for the Wilcoxon signed-rank test to find the probability of observing a value of T or lower. This is your p-value if the test is one-sided. If your alternative hypothesis is two-sided, then double this probability to give the p-value. For large samples (n > 20), the T statistic is approximately normally

Page | 474

distributed under the null hypothesis that the population differences are centred at zero. We shall solve Wilcoxon signed-rank test using the normal approximation. Example 8.6 A study is made to determine whether there is a difference between husbands and wives’ attitudes towards online marketing advertisements. A questionnaire measuring this was given to 24 couples with the results summarized in table above (ordinal range 0 (hate) to 20 (love). Is there a significant difference with the couple's attitude to the online advertisements at a 5% level of significance? ID

Wife, W

Husband, H

1

15

17

2 8 3 11 4 19 5 13 6 4 7 16 8 5 9 9 10 15 11 12 12 11 13 14 14 4 15 11 16 17 17 14 18 5 19 9 20 8 21 9 22 11 23 11 24 12 Table 8.27 Survey outcomes

19 18 19 17 5 13 3 16 21 12 9 10 17 12 24 12 12 8 16 12 7 17 13

The five-step procedure to conduct this test progresses as follows Step 1 State hypothesis

Page | 475

Under the null hypothesis, we would expect the distribution of the differences to be approximately symmetric around zero and the distribution of positives and negatives to be distributed at random among the ranks. H0: no difference in attitudes to adverts between husband and wife Median W - Median H = 0 H1: difference in attitudes to adverts between husband (H) and wife (W) Median W - Median H ≠ 0 Two-tailed test at 5% significance Step 2 Select test • • • •

Two dependent samples. Both samples consist of ordinal (rank) data. No information on the form of the distribution but we shall assume the differences is a symmetric distribution. Wilcoxon signed-rank test.

Step 3 Set the level of significance  = 0.05 Step 4 Extract relevant statistic Calculate the difference (d = W – H), and sign of the difference Please note we have shaded the cell green where we have a difference d = W – H = 0. These values are not included in the analysis. ID

Wife, W

Husband, H

Difference d=W-H

Difference (d = W - H) sign

1

15

17

-2

-

2

8

19

-11

-

3

11

18

-7

-

4

19

19

0

5

13

17

-4

-

6

4

5

-1

-

7

16

13

3

+

8

5

3

2

+

9

9

16

-7

-

10

15

21

-6

-

11

12

12

0

12

11

9

2

+

13

14

10

4

+

Page | 476

14

4

17

-13

-

15

11

12

-1

-

16

17

24

-7

-

17

14

12

2

+

18

5

12

-7

-

19

9

8

1

+

20

8

16

-8

-

21

9

12

-3

-

22

11

7

4

+

23

11

17

-6

-

24

12

13

-1

-

Table 8.28 Calculation of sign of the differences Calculate magnitude of the differences (d = W – H) and rank the differences depending upon both positive (r+) and negative (r-) ID

Magnitude of difference, d = W - H

Rank difference, R

1

2.0

6.5

2

11.0

21.0

3

7

18.5

5

4

12

6

1

2.5

7

3

9.5

8

2

6.5

9

7

18.5

10

6

14.5

12

2

6.5

13

4

12

14

13

22

15

1

2.5

16

7

18.5

17

2

6.5

18

7

18.5

19

1

2.5

20

8

20

21

3

9.5

22

4

12

23

6

14.5

24

1

2.5

4

11

Table 8.29 Magnitude and rank of differences Page | 477

Identify positive and negative ranks ID 1 2 3 4 5 6 7 8 9 10 11 12

Positive ranks, R+

Negative ranks, R6.5 21 18.5 12 2.5

9.5 6.5 18.5 14.5 6.5

13 12 14 22 15 2.5 16 18.5 17 6.5 18 18.5 19 2.5 20 20 21 9.5 22 12 23 14.5 24 2.5 Table 8.30 Identification of positive and negative ranks Calculate the median difference Median difference d=median (-2 + -11 + -7 + …….. + 4 + -6 + -1) = - 1.5 The median difference is – 1.5 which supports the alternative hypothesis that d ≠ 0. The question then is whether this difference is statistically significant. Identify difference values d = 0 and remove from the analysis The fourth and eleventh paired data point give a difference value d = = 0. This value is then removed from the solution (cells shaded green in tables 8.28 – 8.30). Calculate number of paired values Page | 478

Number of paired ranks, n = 24 Number of data pairs where d = 0 is 2, that is, n0 = 2. Adjust n to remove data pairs with d = 0, n′ = n – n0 = 24 – 2 = 22. Calculate the sums of the ranks, T+ and T-: T+ = Sum of positive ranks T+ = 9.5 + 6.5 + 6.5 + 12 + 6.5 + 2.5 +12 T+ = 55.5 T– = Sum of negative ranks T– = 6.5+ 21+ 18.5+ 12 + 2.5 + 18.5 + 14.5 + 22 + 2.5 + 18.5 + 18.5 + 20 + 9.5 + 14.5 + 2.5 T– = 198.5 Calculate the Wilcoxon signed-ranks test statistic, Tcal Tcal = min (T–, T+) Tcal = min(198.5, 55.5) Tcal = 55.5 Critical value At this stage, we can use Wilcoxon signed-rank test critical tables to look up a critical T value for a 5% significance level and with 22 paired data values.

Figure 8.55 Critical values of the Wilcoxon matched-pairs signed rank test Page | 479

From Figure 8.55, the critical value of T when we have n = 22 and we are testing at a 5% significance is Tcri = 75. Thus, Tcal < Tcri (55.5 < 75), reject the null hypothesis and accept the alternative hypothesis. Step 5 Make decision. We conclude there is a significant difference between a wife and her husband's attitude to online adverts. Normal approximation to the Wilcoxon sign rank test For large samples (n > 20), the T statistic is approximately normally distributed under the null hypothesis that the population differences are centred at zero. When this is true, the mean and standard deviation values are given by equations (8.13) and (8.14). μT =

n′ (n′ +1)

(8.13)

4 n′ (σ′ +1) (2 n′ +1)

σT = √

24

(8.14)

Then, for large n, the distribution of the random variable, Z, is approximately standard normal, where ZT =

Tcal − μT σT

(8.15)

Applying equations (8.13)–(8.15) given n’ = 22: Population mean: μT =

n′ (n′ + 1) 4

μT =

22 (22 + 1) 4

μT = 126.5 Population standard deviation:

σT = √

n′ (σ′ + 1) (2 n′ + 1) 24

Page | 480

σT = √

22(22 + 1) (2 × 22 + 1) 24

σT = √

22770 24

σT = 30.802 Value of Z given Tcal = 55.5, T = 126.5, and T = 30.802 ZT =

Tcal − μT σT

ZT =

55.5 − 126.5 30.802

ZT = −2.305 Notes: 1. If the alternative hypothesis is upper one-tailed, then reject the null hypothesis if ZT =

Tcal − μT σT

> + 𝑍𝛼

(8.16)

2. If alternative hypothesis is lower one-tail, then reject the null hypothesis if ZT =

Tcal − μT σT

< − 𝑍𝛼

(8.17)

Decision The calculated test statistic Zcal = -2.305. The critical z values can be found from tables to give the two-tailed 5% Zcri =  1.96. Does the test statistic lie within the rejection region? Compare the calculated and critical z values to determine which hypothesis (H0 or H1) to accept. Given Zcal < lower Zcri (– 2.305 < –1.96), reject the null hypothesis. We conclude there is a significant difference between a wife and her husband's attitude to online adverts. Continuity correction for Z Page | 481

The standardised Z test is a continuous distribution that provides an approximation to the discrete T statistic (ranked data) by applying a continuity correction applied to equation (8.18): Zcal =

|Tcal − 𝜇𝑇 |− 0.5

(8.18)

σT

For this example, Tcal = 55.5, T = 126.5, T = 30.802. Therefore, corrected Z value = + 2.2888 > + 1.96. We conclude there is a significant difference between a wife and her husband's attitude to online adverts. Issue of tied ranks If there are a large number of ties then equation (8.14) can be replaced by equation (8.19), which provides a better estimator of the standard deviation:

𝜎=√

𝑛(𝑛+1)(2𝑛+1) 24



1 48

∑𝑛𝑖=1(𝑓𝑖3 − 𝑓𝑖 )

(8.19)

Where i varies over a set of tied ranks and fi is the frequency that the rank i appears. Excel solution Figures 8.56 – 8.58 illustrate the Excel solution.

Figure 8.56 Page | 482

Figure 8.57

Figure 8.58 Page | 483

Figure 8.59

Figure 8.60

Figure 8.61 We conclude there is a significant difference between a wife and her husband's attitude to online adverts.

Page | 484

Values not corrected for ties Estimate of change to variance is 2.625 based upon shared rank 12.5 with f = 4, shared rank = 22.5 with f = 4, and shared rank 15.5 with f = 2. If you update the sigma value, then your value of Z will be very close to the SPSS solution. SPSS solution Enter data into SPSS

Figure 8.62 Example 8.6 SPSS data Click on Analyze > Nonparametric Tests > Legacy Dialogs > 2 Related Samples. Transfer X and newY variables to the Test Pairs box. Click on Wilcoxon test

Page | 485

Figure 8.63 SPSS Two-Related-Samples Tests menu Click Options Choose Descriptives and Quartiles

Figure 8.64 Options Click OK SPSS output

Figure 8.65 Example 8.6 SPSS solution

Page | 486

Figure 8.66 Example 8.6 SPSS solution continued

Figure 8.67 Example 8.6 SPSS solution continued The rows labelled Asymptotic Sig. and Exact Sig. tell us the probability that a test statistic of at least that magnitude would occur if there were no differences between groups. 1. If you have a small sample use the Exact Sig. value. 2. If you have a large sample (n > 10) use the Asymptotic Sig. value. The test value Z is approximately normally distributed for large samples. From SPSS: Z = - 2.311 with 2-tail asymptotic. p-value = 0.021. From Excel: Z = - 2.305 with 2-tail asymptotic p-value = 0.021. The two-tail p-value associated with the Z-score (- 2.311) is 0.021, which means that there is a probability of 0.021 that we would get a value of Z as large as the one we have if there were no effect in the population. Given two-tail p-value = 0.021 < 0.05 (significance level), we reject the null hypothesis and accept the alternative hypothesis. Conclusion We conclude there is a significant difference between a wife and her husband's attitude to online adverts.

Page | 487

Check your understanding X8.14 The Wilcoxon paired ranks test is more powerful than the sign test. Explain why. X8.15 A company is planning to introduce new packaging for a product that has used the same packaging for over 20 years. Before it decides on the new packaging the company decides to ask a panel of 20 participants to rate the current and proposed packaging (using a rating scale of 0–100, where higher scores are more in favour of change); see Table 8.31. Is there any evidence that the new packaging is more favourably received than the older packaging? Assess at the 5% significance level. Participant 1 2 3 4 5 6 7 8 9 10

Before 80 75 84 65 40 72 41 10 16 17

After 89 82 96 68 45 79 30 22 12 24

Participant 11 12 13 14 15 16 17 18 19 20

Before 37 55 80 85 17 12 15 23 34 61

After 40 68 88 95 21 18 21 25 45 80

Table 8.31 Rating of current and proposed packing X8.16 A local manufacturer is concerned at the number of errors made by machinists in the production of kites for a multinational retail company. To reduce the number of errors being made the company decides to retrain all staff in a new set of procedures. To assess whether the training worked, a random sample of 30 machinists was selected and the number of errors made before and after the training recorded, as illustrated in Table 8.32. Is there any evidence that the training has reduced the number of errors? Assess at the 5% significance level.

Before After Before After Before After

Machinist 1 2 49 34 22 23 11 12 29 45 23 29 21 22 33 38 37 37

3 30 32 13 32 37 23 35 24

4 46 24 14 44 22 24 35 23

5 37 23 15 49 33 25 47 23

6 28 21 16 28 27 26 47 37

7 48 24 17 44 35 27 48 38

8 40 29 18 39 32 28 35 30

9 42 27 19 47 35 29 41 29

10 45 27 20 41 24 30 35 31

Table 8.32 Number of errors

Mann–Whitney U test for two independent samples The Mann–Whitney U test is a nonparametric test that can be used in place of an unpaired t test. It is used to test the null hypothesis that two samples come from the Page | 488

same population (i.e. have the same median) or, alternatively, whether observations in one sample tend to be larger than observations in the other. This test is also called the Mann-Whitney-Wilcoxon test and is equivalent to the Wilcoxon Rank Sum test. Although it is a nonparametric test it does assume that the two distributions are similar in shape. Where the samples are small, we need to use tables of critical values to determine whether to reject the null hypothesis. Where the sample is large, we can use a test based on the normal distribution. The basic premise of the test is that once all the values in the two samples are put into a single ordered list, if they come from the same parent population, then the rank at which values from sample 1 and sample 2 appear will be by chance. If the two samples come from different populations, then the rank at which the sample values will appear will not be random and there will be a tendency for values from one of the samples to have lower ranks than values from the other sample. We are thus testing for different locations of the two samples. When you want to compare the distributions in two samples which are independent of each other then you have two tests you can apply: the Mann–Whitney U test or the equivalent Wilcoxon rank-sum test. In this section, we will adopt the Mann–Whitney U test method to solve this type of problem. Whenever the sample sizes are greater than 20,a large-sample approximation can be used for the distribution of the Mann–Whitney U statistic. The Mann–Whitney U test assumptions are as follows: 1. Random samples from populations. 2. Independence within samples and mutual independence between samples. 3. Both samples consist of quantitative or ordinal (rank) data. Remember your quantitative data will be converted to rank data. 4. Populations for each sample have the same shape. This implies the that the two populations have the same median value. Although it is a nonparametric test it does assume that the two distributions are similar in shape. This can be assessed by the creation of an histogram (or five-numbersummary, boxplot) for the two samples and using the information to make a comparison of the shapes of each sample shape. This is the nonparametric equivalent to the two-sample t test with equal variances. It is used primarily when the data have not met the assumption of normality (or should be used when there is enough doubt). The test is based on ranks. It has good efficiency, especially for symmetric distributions. There are exact procedures for this test given small samples with no ties, and there are large-sample approximations. The solution process can be broken down into a series of steps: 1. For each observation create a column that can be used to identify the group membership for each sample value. 2. Add the sample data values into the next column. 3. Now rank the data – if samples values are the same then share the rank value. Page | 489

4. 5. 6. 7. 8.

Calculate the size of sample 1 and sample 2. Calculate the sum of ranks for sample 1 and sample 2. Calculate the test statistic U1 for sample 1 and U2 for sample 2. Calculate the Mann-Whitney test statistic U = min (U1, U2). Use statistical tables for the Mann-Whitney U test to find the probability of observing a value of U or lower. If the test is one-sided, this is your p-value; if the test is a two-sided test, double this probability to obtain the p-value.

For large samples (n > 20), the U statistic is approximately normally distributed under the null hypothesis that the population differences are centred at zero. Example 8.7 A company is considering adopting a new method of training for its employees. To assess whether the new method improves employees’ effectiveness, the firm has collected two random samples from the population of employees sitting the training assessment. Training type 1 employees have studied via the traditional method, and training type 2 employees via the new method. The exam scores are given in Table 8.33. The firm has analysed previous data, and the outcome of the results provides evidence that the distribution is not normally distributed but is skewed to the left. This information provides concerns about the suitability of using a two-sample independent t test for the analysis. Instead, we decide to use a suitable distribution-free test. In this case, the appropriate test is the Mann– Whitney U test. Training type 2 1 2 1 1 2 2 1 2 1 1 2 1 2 1 1 2 2

Sample value 36 34 37 40 21 34 28 34 24 23 35 43 23 33 38 25 30 35

Training type 2 2 2 1 2 1 2 1 2 2 2 1 2 1 1 1 2 1

Sample value 38 42 39 38 32 27 28 35 34 39 36 38 31 29 27 34 40 22

Training type 2 2 2 2 1 1 2 2 2 2 2 1 2 2 2 2 2 1

Sample value 25 33 32 24 39 40 32 35 40 34 34 28 38 39 28 28 18 31

Training type 2 2 2 2 2 2 1 2 1 2 1 1 1 1 2 2 1 2

Sample value 23 39 27 35 31 21 36 34 29 39 38 30 36 38 34 39 27 40 Page | 490

2 21 2 37 2 36 2 42 1 28 2 33 2 25 2 35 2 43 1 27 2 34 1 39 1 31 2 28 Table 8.33 Training method comparison

2 2 2 2 2 1 1

40 39 40 37 27 32 23

1 1 2 1 2 1 1

29 40 30 39 35 25 34

The five-step procedure to conduct this test progresses as follows. Step 1 State hypothesis Null hypothesis H0: no difference in effectiveness between the two training methods Median 1 – Median 2 = 0 This is equivalent to saying that the two samples come from the same population. Alternative hypothesis H1: difference exists between the training methods Median 1 – Median 2 ≠ 0 Two-tailed test. Step 2 Select test • • •

Comparing two independent samples Both samples consist of ordinal (ranked) data Unknown population distribution but assumed similar shapes between the two populations.

Mann–Whitney U test. Step 3 Set the level of significance  = 0.05 Step 4 Extract relevant statistic The solution process can be broken down into a series of steps: Input samples into two columns and rank data Page | 491

Combined sample: sample 1 = 1, and sample 2 = 2. The convention is to assign rank 1 to the smallest value and rank n′ to the largest value. If you have any shared ranks, then the policy is to assign the average rank to each of the shared values as illustrated in tables 8.34 – 8.37.

ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Training type 2 1 2 1 1 2 2 1 2 1 1 2 1 2 1 1 2 2 2 2 1 2 2 2 1

Combined Samples 36 34 37 40 21 34 28 34 24 23 35 43 23 33 38 25 30 35 21 36 28 25 43 34 31

Rank* 66 51 70 92.5 3 51 25 51 10.5 8.5 60 99.5 8.5 44 75 13.5 33 60 3 66 25 13.5 99.5 51 36.5

Table 8.34 Calculation of ranks

ID

Training type

Combined Samples

Rank*

26

2

38

75

27

2

42

98.5

28 29

2 1

39 38

83.5 75

30 31 32

2 1 2

32 27 28

40.5 18.5 25

Page | 492

33 34 35

1 2 2

35 34 39

60 51 83.5

36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

2 1 2 1 1 1 2 1 2 2 2 2 1 1 2

36 38 31 29 27 34 40 22 37 42 33 35 27 39 28

66 75 36.5 30 18.5 51 92.5 5 70 98.5 44 60 18.5 83.5 25

Table 8.35 Calculation of ranks cont.

ID 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71

Training type 2 2 2 2 1 1 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2

Combined Samples 25 33 32 24 39 40 32 35 40 34 34 28 38 39 28 28 18 31 40 39 40

Rank* 13.5 44 40.5 10.5 83.5 92.5 40.5 60 92.5 51 51 25 75 83.5 25 25 1 36.5 92.5 83.5 92.5

Page | 493

72 73 74 75

2 2 1 1

37 27 32 23

70 18.5 40.5 8.5

Combined Samples 23 39 27 35 31 21 36 34 29 39 38 30 36 38 34 39 27 40 29 40 30 39 35 25 34

Rank* 8.5 83.5 18.5 60 36.5 3 66 51 30 83.5 75 33 66 75 51 83.5 18.5 92.5 30 92.5 33 83.5 60 13.5 51

Table 8.36 Calculation of ranks

ID 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

Training type 2 2 2 2 2 2 1 2 1 2 1 1 1 1 2 2 1 2 1 1 2 1 2 1 1

Table 8.37 Calculation of ranks cont. Median values for type 1 and type 2 If you calculate the median value for the two training types, then the median values are: M1 = 32 for sample 1 M2 = 34 for sample 2

Page | 494

We can observe that the median for sample 2 is larger than for sample 1 (34 > 32). The question now reduces to whether this difference is significant? Count the number of data points in each sample: Number in sample 1, n1 = 39 Number in sample 2, n2 = 61 Calculate the sum of the ranks, T1 and T2: T1 = Sum of sample 1 (traditional method) ranks T1 = 51 + 92.5 +

+ 13.5 + 51 = 1778.00

T2 = Sum of sample 2 (new method) ranks T2 = 66 + 70 + ………. + 33 + 60 = 3273.00 Calculate U1, U2 and the test statistic Ucal The values of U1 and U2 are given by equations (8.20) and (8.21): U1 = n1 n2 + U2 = n1 n2 +

n1 (n1 +1) 2 n2 (n2 +1) 2

− T1

(8.20)

− T2

(8.21)

The test statistic U is equal to the difference between the maximum possible values of T for the sample versus the observed values of T: U1 = T1,max – T1 and U2 = T2,max – T2. Applying equations (8.20) and (8.21) enables U1 and U2 to be calculated: U1 = n1 n2 +

n1 (n1 + 1) − T1 2

U1 = 39 × 61 +

39 (39 + 1) − 1778.00 2

U1 = 2379 + 780 − 1778.00 U1 = 1382 U2 = 39 × 61 +

61 (61 + 1) − 3273 2

U2 = 2379 + 1891 − 3273 Page | 495

U2 = 997 Please note we only need to calculate either U1 or U2 given that we can find the other value from equation (8.22): U1 + U2 = n1n2

(8.22)

Check: U1 + U2 = 1382 + 997 = 2379 n1n2 = 39  61 = 2379 The value of Ucal can be either U1 or U2 and for this example we will choose Ucal = minimum value of U1, U2. Ucal = min(U1, U2) = min(1382, 997) = U2 = 997 The Wilcoxon rank sum test statistic (W) is defined as the smaller of the two ranks T1, T2. For this example, W is as follows: T1 = 1770 T2 = 3273 W = Minimum (T1, T2) = minimum (1770, 3273) = 1778.00. Critical U value Next, we can use Mann-Whitney tables for n1 = 39 and n2 = 61 to find Ucri or the associated test statistic p-value. Given both n1 and n2 are large we will use a normal approximation. Normal approximation to the Mann–Whitney test If the null hypothesis is true, then we would expect U1 and U2 both to be centred at the mean value µU and variance given by equations (8.23) and (8.24): μU = μU =

n1 n 2

(8.23)

2

39 × 61 = 1189.5 2 n1 n2 (n1 + n2 +1)

σu = √

σu = √

(8.24)

12

39 ×61 (39+ 61+1) 12

=√

240279 12

=141.5035 Page | 496

For large sample sizes (both at least 10), the distribution of the random variable given by equation (8.25) is approximated by the normal distribution. Z=

Ucal −μU

(8.25)

σU

In our example the calculated Z test statistic is: Z=

Ucal − μU σU

Z=

997 − 1189.5 = −1.3604 141.5035

Critical value The critical z value can be found from statistical tables. For a two-tail test at 5%, Zcri =  1.96. Does the test statistic lie within the rejection region? Compare the calculated and critical z values to determine which hypothesis statement (H0 or H1) to accept. Given Zcal lies between the lower and upper Zcri values (- 1.96 < – 1.3604 < + 1.96), we fail to reject the null hypothesis, H0. Step 5 Make a decision There is not enough evidence at the 5% significance level to indicate that the effectiveness has improved. Continuity correction for Z The standardised z test is a continuous distribution that provides an approximation to the discrete T statistic (ranked data) by applying a continuity correction to equation (8.24), as shown in (8.26): Zcal =

|Ucal − μU |−0.5 σU

(8.26)

For this example, Ucal = 997, U =1189.5, U = 141.5035 Zcal = - 1.3569, two-tail p-value = 0.1748 Issue of tied ranks

Page | 497

If there are many ties, then equation (8.27) provides a better estimator of the standard deviation: 3

3

f −f n n n −n σ = √n12−n2 ( 12 − ∑ni=1 i12 i )

(8.27)

Where i varies over the set of tied ranks and fi is the number of times (i.e. frequency) the rank i appears, and n = n1 + n2. Excel solution Figures 8.68 and 8.69 illustrate the calculation of the ranks (first and final 10 data points used to illustrate)

Figure 8.68 Example 8.7 Excel solution

Figure 8.69 Example 8.7 Excel solution continued Figures 8.70 and 8.71 illustrate the Excel Mann–Whitney U test solution.

Page | 498

Figure 8.70 Example 8.7 Excel solution continued

Figure 8.71 Example 8.7 Excel solution continued From Excel: Median sample 1 = 32 Median sample 2 = 34 Number in group 1, n1 = 39 Number in group 2, n2 = 61 Sum of group 1 ranks, T1 = 1778.00 Sum of group 2 ranks, T2 = 3273.00 T1 max = 3159 T2 max = 4270 U1 = 1382 U2 = 997 Check: U1+U2 = n1 n2 Page | 499

U1+U2 = 2379 n1 n2 = 2379 Chose Ucal = minimum value of U1 and U2 Ucal = min(U1, U2) = Ucal = min(1382, 997) Ucal = 997 (this is from the second sample, n2 = 61) Mann-Whitney test statistic, Ucal = 997 Wilcoxon rank sum test statistic, W= Minimum (T1, T2) = 1778.00 Normal approximation Two-tail p=value = 0.1748 > 0.05, fail to reject the null hypothesis H 0. There is not enough evidence at the 5% significance level to indicate that the effectiveness has improved. SPSS solution Enter data into SPSS as illustrated in Figures 8.72 The first column is called Group and the values 1 and 2 represent the traditional method and the new method, respectively. The second column shows the performance of students.

Page | 500

Figure 8.72 Example 8.7 SPSS data

Figure 8.73 Example 8.7 SPSS data cont. Page | 501

Figure 8.74 Example 8.7 SPSS data

Figure 8.75 Example 8.7 SPSS data cont. Select Analyze > Nonparametric Tests > Legacy Dialogs > 2 Independent Samples Page | 502

Transfer Combined_sample to the Test Variable List. Transfer Training_type to the Grouping Variable box.

Figure 8.76 SPSS Two-Independent-Samples Tests Click on Options and choose Descriptives and Quartiles

Figure 8.77 Enter number of samples Click Continue

Figure 8.78 SPSS Two-Independent-Samples Tests Page | 503

Click OK. SPSS output

Figure 8.79 Example 8.7 SPSS solution

Figure 8.80 Example 8.7 SPSS solution continued

Figure 8.81 Example 8.7 SPSS solution continued The rows labelled Asymptotic Sig. and Exact Sig. tell us the probability that a test statistic of at least that magnitude would occur if there were no differences between groups. • •

If you have a small sample (n < 50) use the Exact Sig. value. If you have a large sample use the Asymptotic Sig. value. The test value Z is approximately normally distributed for large samples.

From SPSS: Mann-Whitney U test statistic = 997, Z = - 1.363 with 2-tail asymptotic p-value = 0.173. [The Wilcoxon rank sum test statistic, W = minimum (T1, T2) = 1778.00]. From Excel: Page | 504

Mann-Whitney U test statistic = 997, Z = - 1.360, with 2-tail p-value = 0.174. [The Wilcoxon rank sum test statistic, W = 1778.00]. Conclusion There is not enough evidence at the 5% significance level to indicate that the effectiveness has improved.

Check your understanding X8.17 What assumptions need to be made about the type and distribution of the data when the Mann–Whitney test is used? X8.18 Two groups of randomly selected students are tested on a regular basis as part of professional appraisals that are conducted on a two-year cycle. The first group has eight students, with their sum of ranks equal to 65 and the second group has nine students. Is there sufficient evidence to suggest that the performance of the second group is higher than the performance of the first group? Assess at the 5% significance level. X8.19 The sale of new homes is closely tied to the level of confidence within the financial markets. A developer builds new homes in two European countries (A and B) and is concerned that there is a direct relationship between the country and the interest rates obtainable to build properties. To provide answers the developer decides to carry out market research to see what interest rates would be obtainable if he decided to borrow €300,000 over 20 years from ten financial institutions in country A and 13 financial institutions in country B. Based upon the data in Table 8.38, do we have any evidence to suggest that the interest rates are significantly different? A:

10.20 10.97 10.63 10.25 10.75 11.00 B: 10.60 10.80 11.40 10.78 11.05 11.15 Table 8.38 Regional interest rates

10.70

10.50

10.30

10.65

10.90 10.85

11.10 11.16

11.20 11.18

10.89

Chapter summary In this chapter we have explored the concept of hypothesis testing for category data using the chi-square distribution. After that, we extended the parametric tests to the case of nonparametric tests (or so-called distribution-free tests). These tests do not require the assumption of a normal population (or sample) distribution. This chapter adopted the simple five-step procedure described in Chapter 7 to aid the solution process.

Page | 505

The main emphasis is placed on the use of the p-value. This gives the probability of rejecting the null hypothesis. The value of the p-value depends on whether we are dealing with a two- or one-tailed test. The second part of the decision-making described the use of the critical test statistic in making decisions. This is the traditional textbook method. It uses published tables to provide estimates of critical values for various test parameter values. In the case of the chi-square test we looked at a range of applications, including: 1. Testing for differences in proportions 2. Testing for association 3. Testing how well a theoretical probability distribution fits collected sample data. In the case of nonparametric tests, we looked at a range of tests, including: 1. Sign test for one sample 2. Two-paired-sample Wilcoxon signed-rank test 3. Two-independent-sample Mann–Whitney test. In the case where we have more than two samples then we would have to use techniques like the Kruskal–Wallis test if we are dealing with independent samples. For dependent samples we would use the Friedman tests.

Test your understanding TU8.1 During a local election the sample data in Table 8.39 were collected to ascertain how local voters see green issues. Given these data, is there any evidence for a difference in how the voters who completed the survey are likely to vote? Test at the 5% significance level. Political party

Support green issues Conservative 37 Labour 67 Liberal democrats 106 Table 8.39 Local voters and green issues

Indifferent

Opposed

72 85 124

1 1 3

TU8.2 A factory makes specific components that are used in a range of family cars. The company undertakes regular sampling to check the quality of the components from three different machines. Each component in the sample is checked for any defects which would result in a batch being rejected by the company. Based on the sample data in Table 8.40, is there any association between the proportion of defectives and the machine used? Test at the 5% significance level.

Page | 506

Outcome Machine 1 Machine 2 Defective 16 12 Non-defective 70 81 Table 8.40 Check on quality of components

Machine 3 20 75

TU8.3 Further historical research by the company described in TU2 shows that machine 1 has a production record as follows: 80% excellent, 17% good, 3% rejected. After machine 1 has completed a refit a sample of 200 components from machine 1 produced the following results: 157 excellent, 42 good and 1 rejected. Carry out a chi-square goodness-of-fit test to test whether the refit has changed the quality of the output from machine 1. Test at the 5% significance level. TU8.4 The manager of a university computer network has collected data on the number of phishing attempts on university staff computers. Based on the sample data in Table 8.41, do we have any evidence that we may be able to model the relationship with a Poisson distribution? Test at the 5% significance level. Number of phishing 0 1 2 3 attempts per day Frequency 200 240 125 52 Table 8.41 Number of computer phishing attempts

4

5

6

27

16

12

TU8.5 Reconsider TU8.4 with the Poisson population average equal to 1.25. TU8.6 A dairy farmer’s milk production over the last 12 months is shown in Table 8.42. Based upon historical data, the milk production has been found to be uniformly distributed. Based upon the sample data given in the table, is there any statistical evidence at the 5% significance level that the sample data is not uniformly distributed? Month

Milk quantity Month (litres) 1 2678 7 2 2602 8 3 2649 9 4 2588 10 5 2530 11 6 2397 12 Table 8.42 Quantity of milk produced

Milk quantity (litres) 2410 2350 2495 2558 2602 2665

TU8.7 The waiting times at a railway station for rail tickets are normally distributed and have a population standard deviation of 8 minutes. The station master would like to know if the variation in waiting times has been reduced after a review of performance at the customer windows. The station master has collected the following sample data: sample standard deviation s = 4 and sample size n = 25. Is there evidence at the 5% significance level that the population standard deviation has been reduced?

Page | 507

TU8.8 The business manager for several health centres suggested that the median waiting time for each patient to see the doctor at a practice was 22 minutes. It is believed that the median waiting time in other practice health centres is greater than 22 minutes. A random sample of 20 visits to other health centres resulted in the following results: 9.4, 13.4, 15.6, 16.2, 16.4, 16.8, 18.1, 18.7, 18.9, 19.1, 19.3, 20.1, 20.4, 21.6, 21.9, 23.4, 23.5, 24.8, 24.9, 26.8. Conduct a sign test to assess if there is statistical evidence to conclude that the median visit length in these other health practices is greater than 22 minutes. TU8.9 Table 8.43 represents the amount spent by 17 people at two restaurants (X, Y) in a city. Conduct a Wilcoxon signed-rank test to assess whether the amount spent is different between the two restaurants. Use a significance level of 0.05. X Y 20.2 22.8 19.5 14.2 18.6 14.1 20.9 16.1 23.1 25.2 18.6 20.2 19.6 16.7 23.2 21.3 21.8 18.7 20.2 22.8 Table 8.43 Restaurant spend

X 20.3 19.2 19.5 18.7 18.2 21.6 22.4

Y 20.9 22.6 16.9 21.4 18.5 23.4 21.3

TU8.10 The registrar in a business school is exploring staff resources allocated to courses to provide staff support to students while studying for their examinations. Two modules have been chosen, and the data in Table 8.44 are from two independent random samples of 11 students studying statistics on the business degree and 11 students studying e-commerce on the business technology degree. Test the sample data given in Table 8.44 to assess whether there is a difference in the median hours studied between the two groups of students?

Page | 508

Hours spent studying for Hours spent studying for the the statistics examination e-commerce examination Person Rating Person Rating 1 1 7 14 2 2 8 13 3 3 12 12 4 4 10 11 5 5 9 9 6 6 13 17 7 7 11 16 8 8 9 11 9 9 5 12 10 10 14 9 11 11 13 6 Table 8.44 Hours spent studying

Want to learn more? The textbook online resource centre contains a range of documents to provide further information on the following topics: 1. A8Wa Chi-square goodness of fit 2. A8Wb Chi-square test for one population variance 3. A8Wc Chi-square test for normality. Factorial experiments workbook Furthermore, you will find a factorial experiments workbook that explores using Excel and SPSS the solution of data problems with more than two samples using both parametric and non-parametric tests.

Page | 509

Chapter 9 Linear correlation and regression analysis 9.1 Introduction and chapter overview Cross tabulation and the chi-square distributions we covered in previous chapters were used to illustrate and define possible associations between two variables. In this chapter we will introduce other methods that will help us to quantify the strength of the relationship between two variables. Once we established that there is some form of relationship between the two variables, we can build models to see how one variable impacts the other. The simplest way to visualise possible association between two variables is to create a scatter plot of one variable against the other. Such a plot will help us decide visually if an association exist, as well as what is the possible form of this association, e.g. linear or non-linear. If the scatter plot suggests a possible association, then we can also use least squares regression to fit this model to the data set, as will see in the second part of this chapter. The first part of this Chapter is dedicated to establishing possible relationships, or associations, between two interval, or ordinal, variables. For interval data, the strength of this association can then be assessed by calculating Pearson’s correlation coefficient. For ordinal data, we can use Spearman’s rank order correlation coefficient. Although the relationships between variables could take various shapes, in this Chapter we are focusing strictly on linear types. This will lead us to learn how to model linear relationships using a single equation. The basic model that we will introduce is a simple linear regression analysis. What are the typical applications of the methods covered in this chapter? Imagine that a local council office contracts you to participate in a project about the impact of homelessness on the local community. You conduct some data mining and establish that the level of homeless people in the municipality is correlated with the level of crime. Does this mean that homeless people commit more crime? You do not know that. But you can speculate that some other factors might be behind both variables. Could it be drug abuse? You do not know that either, but at least now you know what must be investigated next. This is an example of by how going through a discovery phase and using a simple tool such as correlation you can decide what the next step in this project could be. The correlation coefficient will tell you if there is association (and how strong it is) between two variables. Regression analysis is in a way an extension of the principle of association between the variables, but it goes beyond that. When people apply for loans or credit cards, their credit rating is checked. The model that decides if you are going to be approved for a credit card is possibly based on regression analysis. The model inputs several variables that define you and your lifestyle and then predicts from these variables if you are “credit-worthy”. If you were in marketing, you can use this technique to predict the changes in demand as you modify the pricing policy, or how one brand will fare based on your knowledge of Page | 510

another brand. The examples are numerous. In other words, correlation and regression analysis are powerful business tools for modeling relationships between data sets and for predicting the results. You can think of this technique as one of the most valuable “assets” that you should carry with you into your professional life.

Learning objectives On completing this unit, you should be able to: 1. Understand the meaning of calculating the correlation coefficient. 2. Apply a scatter plot to demonstrate visually a possible association between two data variables. 3. Calculate Pearson’s correlation coefficient for interval data and be able to interpret the strength of this value. 4. Calculate Spearman’s rank correlation coefficient for ordinal ranked data and be able to interpret the strength of this value. 5. Understand the meaning of simple linear regression analysis. 6. Fit this simple linear model to the scatter plot. 7. Fit a simple linear regression model to two variables. 8. Predict/estimate a dependent variable using an independent variable. 9. Use the coefficient of determination (or the r-squared value) to establish the strength of the model. 10. Estimate how well the regression model fits the variables. 11. Check the model assumptions and asses if they have been violated. 12. Construct a prediction interval to the population parameter estimate. 13. Solve problems using the Microsoft Excel and SPSS.

9.2 Introduction to linear correlation Correlation analysis is one of the simplest and most effective tools to provide greater insight into data. It is based on the intuitive assumption that if two variables, move in the same direction, there must be some sort of “connection” between them. Equally, if they move in the opposite direction to one another, the connection is still there, but with the inverse effect. Imagine that you work for a food conglomerate that makes a variety of snacks. By sifting through data, you discover that the increase in sale of a chocolate bar is accompanied by the increase in sale of a certain dog food. When one drops, the other one drops too. You have no idea why this is happening, but you have established that they are related. We call this correlation. However, one thing that you cannot say is that the increase in the sale of sale of one of the items is causing the increase in sale of another item. They just move in the same direction for the reasons unknown to us. Correlation is often confused with causation, which is a mistake. Two variables could show very high level of correlation, but this may not mean that either of them can cause changes in the other. It may mean that they both respond in the same way to some other undefined and invisible variables. The fact that they may not cause changes to one another, does not mean that they cannot be used as good predictors of one another. This is the feature that we will explain in the sections that follow. The only limitation to the Page | 511

techniques described in this Chapter is that the relationship between the variables needs to be linear. Otherwise our linear correlation measurement is not appropriate. What do we mean by linear relationship? Imagine that the supply of strawberries on the market grows from week to week in June as follows: 10, 20, 30, 40 tons per week. Because the increments between the numbers are constant (10 tone per week), this is called linear growth. The non-linear growth would be: 10, 20, 40, 80 tons per week. Figures 9.1 and 9.2 illustrate these examples in a graphical format.

Figure 9.1 Linear trend representing growth

Figure 9.2 Non-linear trend representing growth As we can see, in the case of non-linear growth, the increments from week to week are 10, 20, 40, etc., i.e. they are doubling every week. The price for strawberries might decline from week to week as: 4.00, 3.50, 3.00, 2.50 GBP per kilo. This is also linear movement, though a declining one. A non-linear decline would be something like 4.00, 2.00, 1.00, 0.50 GBP per kilo. Figures 9.3 and 9.4 illustrate these two examples.

Page | 512

Figure 9.3 Linear trend representing decline

Figure 9.4 Non-linear trend representing decline Note that non-linear does not necessarily look like a parabola, as in our two examples, but it can be any other curve that is not a straight line. Using the techniques from this chapter, you can only calculate correlation between the two variables that are moving in linear fashion. For non-linear movements, other techniques not mentioned in this textbook are required.

9.3 Linear correlation analysis If we suspect that there is relationship between various sets of data, then our aims are to confirm that a relationship exists, and how strong it is. Even if we do not suspect the relationship, using the techniques from this chapter, we might discover that there is one. The techniques we will use are: • • • •

Scatter plots. Pearson’s correlation coefficient (r) for interval data. Spearman’s rank correlation coefficient (rs) for ordinal ranked data. Undertake an inference test on the value of the correlation coefficients (r and rs) being significant (online only)

Page | 513

Scatter plots To introduce the concept of correlation, we need to start with scatter plots. Scatter plots are like line graphs in that they use horizontal and vertical axes to plot data points. However, the objective of scatter plots is to show if the two variables are in any way connected. If two variables are connected in some way, then we can say that there is a relationship between them. The relationship between two variables is called their correlation. A point on a scatter plot is where the two variables intersect. These points will create a “cloud”. The closer the “cloud” of data points come to making a straight line when plotted, the higher the correlation between the two variables. The higher the correlation, the stronger the relationship. Note that this line does not have to be straight, but in this case the relationship is non-linear, and we will not consider it in this chapter. Example 9.1 Table 9.1 consists of two sets of data from the Office for National Statistics in the UK. Both data sets cover the period between January 2016 and May 2019. The first set shows monthly level of employment in the UK, in percentages, for the ages of 16 and above. The second set shows monthly visits abroad, in thousands, for all ages of UK citizens in the same period.

Table 9.1 Two UK data sets for period from Jan 2016 to May 2019 (Source: ONS UK) By just looking at numbers, we can see that both data sets are increasing. We can present these two data sets in a graph, and we use the graph with the left axis showing the scale for employment, and the right axis showing the scale for all UK visits abroad.

Page | 514

Figure 9.5 UK data sets for age 16 and over employed and for all visits abroad (all ages) From Figure 9.5, we can see both data sets flow with an upward trend from month to month, as the corresponding numbers show. We might speculate that if we knew the number of UK people 16 and over that are employed, we might be able to predict how many people will travel abroad from the UK. In other words, travel abroad might be related to the number of people that are employed. Think of travel abroad as a dependent variable and percentage of people employed, as an independent variable. Park this thought, we will return to it later in the chapter as this is the key point. Excel Solution Rather than showing the two data sets as line graphs, as in Figure 9.5, we could plot each pair of values as a point on a graph. This graph, shown in Figure 9.6, is called a scatter plot. How to construct the scatter plot using Excel has already been demonstrated in Chapter 1. We will therefore skip this part of explanation and concentrate on the content of the plot. A dot on the graph represents the intersection of the two values, both of them are captured at the same point in time. In August 2017, for example, we have the value of 61.0 on the x axis intersecting with the value of 7,210 on the y axis. This intersection is shown as a dot. Clearly scatter plots are not showing any time dimension (the date when the intersection happened is not shown on this graph), just what value from the first variable corresponds with the value from the second variable.

Page | 515

Figure 9.6 Scatter plot for the percentage of age 16 over employed and all visits abroad A scatterplot hints that there is some form of relationship. Look at the graph in Figure 9.6; as the percentage of those that are employed increases (horizontal axis), there is a tendency for more people to travel abroad too (vertical axis). The data, therefore, would indicate a positive relationship. As we will show later, it is possible to describe this relationship by fitting a line or curve to the data set. This will enable us to predict the number of visits abroad based on any number of employed people between 16 and over. SPSS Solution SPSS data file: Chapter 9 Example 1 Scatter plot.sav (only first 8 records illustrated below)

Figure 9.7 Example 9.1 SPSS data A scatter plot is constructed following the procedure below. Graphs > Legacy Dialogs > Scatter/Dot Choose Simple Scatter

Page | 516

Figure 9.8 SPSS scatter/plot menu Click Define We move number visited to Y Axis and employed to X Axis.

Figure 9.9 SPSS simple scatterplot menu Click OK SPSS Output

Figure 9.10 SPSS solution scatterplot Just as the same plot constructed in Excel, this one shows how employed and number of visits are distributed. We also see that low values of employed have low values of number of visits and high values of employed have high values of number of visits. Let us now conduct a little experiment. In Figure 9.11 we modified the y axis scale to run from 4,000 to 11,000 instead 5,000 to 9,000. More importantly, we also changed one point to illustrate the issue of outliers.

Page | 517

Figure 9.11 Identifying outliers in data What are outliers? The scatter plot can be used to identify possible outliers within the data set. We can see in Figure 9.11 the same data set as in Figure 9.6 or 9.10, but with one data value of y changed (Jan 2018 = 10000, instead of 7450). This value far greater than any other data point for y values and it is called an outlier. It could have been a data-entry error, or it could have been a genuine “freaky” number. The point here is that outliers could have undue influence on the values of the correlation coefficients estimated and, therefore, need to be somehow handled. Regardless of whether an outlier is an error or a genuine extreme value, we cannot leave it as it is. This one single value would distort our model and give us false representation of reality. One of the solutions to this problem is to delete the outlier value from the data set. If the data set is not time based, then deleting the outlier is an acceptable option. By deleting just one observation from a large set will not distort the results. However, if we are using time-based data, then deleting the data point does not seem right. By deleting one observation we create a gap in the continuum of the time series. For this type of data, it is better to substitute the outlier with some more acceptable value that is more “in line” with other values in the data set. There is no universally accepted method on how to deal with outliers. Some advocate that outliers that lie beyond ±1.5 standard deviations around the mean value should be excluded. In many cases, we suggest a “quick and dirty” method, which is to use the two neighbouring values, calculate their average and substitute this value for an outlier. If we followed this approach in our example, we would use 7260 and 7250 (values for Dec 2017 and Feb 2018) and find that their average is 7255. This value should replace 10000, (the value for Jan 2018) which was an outlier. Given that we know that the original value was 7450, the substitute of this value with 7255 is much more acceptable than 10000. It will have no significant impact on the overall relationship between the two variables.

Covariance The scatter plot enabled us to visualize and conclude that two variables are potentially jointly related, i.e. they show us the movements of one variable in relationship to the Page | 518

movements of the other variable. A quantitative measure that tells us the same thing is called covariance. Covariance is a measure of how changes in one variable are associated with the changes in a second variable. Covariance takes either positive or negative values, depending on whether the two variables move in the same or opposite directions. If the covariance value is zero, or close to zero, then this is an indication that the two variables do not move closely together. Equation (9.1) defines the sample covariance. 𝑆𝑥,𝑦 =

∑𝑛 ̅) 𝑖=1(𝑥𝑖 −𝑥̅ )(𝑦𝑖 −𝑦

(9.1)

𝑛−1

In the equation (9.1) “x” represents individual values of the first data set (percentage of employed, for example) and “y” represents individual values of the second data set (number of people traveling abroad, for example). The average value for people employed is 𝑥̅ and the average value for the variable describing people travelling abroad is 𝑦̅. The symbol n represents the number of cases per data set, which in our case is 50. Example 9.2 We are using the same data as in Example 9.1 to demonstrate how the calculations are executed. Table 9.2 illustrates the calculations (only the first row and the last four rows are shown).

Table 9.2 Example 9.2 data table and calculation of column statistics Following equation (9.1), we first calculate the average for x and y, which was calculated as: 𝑥̅ = 60.8 𝑦̅ = 7299.5 From there, we calculate deviations for every value from their average, multiply them and then add them all up: 𝑆𝑥,𝑦 = 𝑆𝑥,𝑦 =

∑((𝑥1 − 𝑥̅ )(𝑦1 − 𝑦̅) + (𝑥2 − 𝑥̅ )(𝑦2 − 𝑦̅) + ⋯ + (𝑥𝑛 − 𝑥̅ )(𝑦𝑛 − 𝑦̅)) 𝑛−1 (60.4−60.8)(6350−7299.5)+(60.2−60.8)(6760−7299.5)+⋯+(61.7−60.8)(7370−7299.5) 47

𝑆𝑥,𝑦 = 131.46 Page | 519

The covariance value for these two data sets is 131.46. As it will be explained shortly, the covariance is an important building block for calculating the coefficient of correlation between two variables. Excel solution Figure 9.12 illustrates the Excel solution (rows 12-40 are hidden to make the table more compact for presentation purposes). The two Excel functions that were used, =COVARIANCE.P(), aimed at the population data, and =COVARIANCE.S(), aimed at the sample data, show similar values. If these two datasets were the whole population, which they are not, then the covariance level of 137.64 would be the appropriate value to use. We will use the sample data value of 140.57, because the datasets we used are the subsets of larger datasets. The “population” is much larger than just the interval between Jan 2016 and Dec 2019, so this is just a sample.

Figure 9.12 Excel solution to calculate covariance The sample covariance value implies that the relationship is positive (because the number is positive) and that its strength is 131.46. But what is the meaning of 131.46? Not much, used in isolation. The covariance is expressed in some mixed “units” between the two variables. In other words, it is difficult to interpret and to compare with other covariances, if this is what we wanted to do. From Excel, the sample covariance is 131.46, implying that both variables are moving in the same direction (indicated by the positive value). A major shortcoming of calculating just the covariance is that the variable can take any value and we are unable to establish the strength of the relationship in relative terms. For this value of 131.46 we do not know if this represents a strong or weak relationship between UK employment and UK visits

Page | 520

abroad. To measure this strength in relative terms, we need to use the correlation coefficient. SPSS solution SPSS data file: Chapter 9 Example 2 Covariance.sav (only first 8 records illustrated below)

Figure 9.13 Example 9.2 SPSS data Select Analyze > Correlate > Bivariate

Figure 9.14 SPSS correlate menu selection Transfer both variables to the Variables box Select Options

Page | 521

Figure 9.15 SPSS bivariate correlations menu Choose Means and standard deviations and Cross-product deviations and covariances.

Figure 9.16 SPSS bivariate correlations options Click Continue Click OK SPSS Output

Page | 522

Figure 9.17 SPSS solution The value of the sample covariance is printed twice. The first one shows covariance between UK employment level and UK visits abroad, and the second one the reverse, i.e. UK visits abroad and UK employment level, which is the same. From SPSS, the covariance = 131.46. In summary, the difficulty with covariance is that it will be larger if the values of X and Y are larger, and smaller if the values of the two variables are smaller numbers. Effectively, the value of covariance is defined by the range of values of X and Y, so it is impossible to get any meaningful interpretation of the covariance number. Neither can we compare two covariances. To address the problem of interpretation and comparison, we need to standardize the measure that will tell us more about the relationship between variables. This new measure, or the statistic, is called the correlation coefficient. Think about the covariance as the building block that will help us calculate the correlation coefficient.

Pearson’s correlation coefficient, r The sample correlation coefficient that can be used to measure the strength and direction of a linear relationship is called Pearson’s product moment correlation coefficient, r, defined by equation (9.2). 𝑟=

𝑆𝑥𝑦 𝑆𝑥 𝑆𝑦

(9.2) Page | 523

Where Sx is the sample standard deviation of sample variables x, Sy is the sample standard deviation of sample variables y, and Sxy is the sample covariance between variables x and y. If we substitute equation (9.1) into (9.2), we get an alternative equation representing Pearson’s correlation coefficient, that is often used. This is given by equation (9.3). 𝑟=

1 𝑛

∑𝑛𝑖=1

(𝑥𝑖 −𝑥̅ ) (𝑦𝑖 −𝑦̅) 𝑆𝑥 𝑆𝑦

(9.3)

Where symbols Sx, Sy are the same as in equation (9.2) i.e. they are the standard deviations for variables x and y respectively, and n is the number of paired (x, y) data values. In some textbooks, you will find even more complex equation for calculating Pearson’s coefficient of correlation. 𝑛 ∑𝑛 𝑖=1 𝑥𝑖 ∑𝑖=1 𝑦𝑖 ∑𝑛 𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑛

𝑟=

2 √(∑𝑛 𝑖=1 𝑥𝑖 −

2 (∑𝑛 𝑖=1 𝑥𝑖 ) 𝑛

2 (∑𝑛 𝑦 ) 2 − 𝑖=1 𝑖 ) )(∑𝑛 𝑦 𝑖=1 𝑖 𝑛

(9.4)

Equation (9.4) was frequently used before the days of computers and spreadsheets and today we have much more elegant ways to calculate the coefficient of correlation. Example 9.3 In Example 9.2 we already calculated the sample covariance as 131.46. We have also calculated (not shown here) the standard deviations for x and y, and they are respectively sx=0.37 and sy=436.74. Following equation (9.2), this produces the correlation coefficient as: 𝑟=

131.46 = 0.81 0.37 × 436.74

How do we interpret this value of 0.81? The values of r can be anywhere between -1 and +1. The way this number is interpreted is as follows: a) If r lies between – 1 ≤ r ≤ - 0.7 or 0.7 ≤ r ≤ 1, a strong association, or correlation, is assumed. b) If r lies between – 0.7 ≤ r ≤ - 0.3 or 0.3 ≤ r ≤ 0.7, a medium association, or correlation, is assumed. c) If r lies between – 0.3 ≤ r ≤ - 0.1 or 0.1 ≤ r ≤ 0.3, a weak association, or correlation, is assumed. d) If r lies between – 0.1 ≤ r ≥ 0.1, there is virtually no association, or correlation. e) If r = 0, this implies with certainty there is no association, or correlation, between the variables. In our case r = 0.81, which represents a very strong association (correlation) between these two variables.

Page | 524

Figure 9.18 and 9.19 show examples of perfect positive correlation (r=+1) and perfect negative correlation (r=-1).

Figure 9.18 Perfect positive correlation example

Figure 9.19 Perfect negative correlation example The rules above on how to interpret the strength of association between the two variables are not rigorous or strict. Depending on the context, they could be relaxed. Think of them as the “consensus” views typically taken in business and management. Excel solution Figure 9.20 represents the Excel solution with rows 10 to 40 hidden.

Page | 525

Figure 9.20 illustrates the Excel solution to calculate Pearson’s correlation coefficient, r. We used three different Excel functions to calculate the coefficient of correlation. The first two, =PEARSON() and =CORREL() are standard Excel functions and the third one converts equation (9.2) into Excel syntax. In any case, all three are returning the same value, which is +0.81. This value indicates a strong positive linear association between the value of the visits abroad (y) and the percentage of employed age 16 and over (x), confirming the impression from the scatter plot in Figures 9.6 and 9.10. SPSS solution SPSS data file: Chapter 9 Example 3 Pearson correlation analysis.sav (only first 8 records illustrated below)

Figure 9.21 Example 9.3 SPSS data Run SPSS Correlation Test - Pearson Select Analyze > Correlate > Bivariate Transfer UK employment and UK visits abroad variables to the Variables box Choose Pearson

Page | 526

Figure 9.22 SPSS bivariate correlations menu Click OK SPSS Output

Figure 9.23 SPSS solution The correlation itself is 0.81. This indicates a strong (positive) linear relationship between employed and number of visits abroad. The p-value, denoted by “Sig. (2tailed)”, is 0.000. This value of the p-value = 0.000 is smaller than 0.05, which is the level of significance , and we conclude that the correlation between the two variables is significant. The results are based on N = 41 cases. In summary, a strong linear relationship is observed between the variable employed and the number of visits with Pearson correlation = 0.81 and p-value = 0.000 (2-sided).

Page | 527

On top of how employment and number of visits are distributed separately, we also see that low values of employed have low values of number of visits and high values of employed have high values of number of visits. It should be noted that if we included the outlier illustrated in Figure 9.11 rather than the original value for Jan 2018, then the value of the correlation coefficient (r) would reduce to 0.63 and would suggest a reduced correlation between the two variables (x & y). This illustrates how much one single outlier can distort the true correlation between two variables if not handled properly. Let us clarify what does the value of 'r' not indicate? 1. 2.

Correlation only measures the strength of a relationship between two variables but does not prove a cause and effect relationship. A value of r ≈ 0 would indicate no linear relationship between x and y, but this may also indicate that the true form of the relationship is non-linear.

To show the case of negative correlation, we will take a look at the relationship between the UK unemployment rate and visits abroad in Figure 9.24. Note that Figure 9.24 replaces employment data from Table 9.1, with the unemployment data (same source) vs. the total number of people traveling abroad in the UK. In Figure 9.6 we observe that in this case as x increases, y decreases. The correlation between x and y is negative in this case. In Figure 9.24 the line goes from a high-value on the y-axis down to a high-value on the x-axis, the variables have a negative correlation. The correlation coefficient in this case is -0.85.

Figure 9.24 Example of negative correlation The correlation coefficient of -0.87 between the unemployment rate of age 16 and over in the UK and the total travel abroad in the UK indicates that there is a strong negative correlation between these two variables. As the unemployment goes up, the fewer people in the UK travel abroad. Conversely, you can also say the fewer people are unemployed, the more people will travel abroad. Page | 528

The coefficient of determination, r2 or R-Squared What happens if we take the Pearson correlation coefficient to the power of two, in other words r2? We get a new measure that is called the coefficient of determination. Most of the software packages refer to this statistic as R-squared or R-square. In our Example 9.3, the value of r=0.81. This means that R-squared, or r2 = 0.812 = 0.65. How do we interpret this statistic and the corresponding value? Unlike the coefficient of correlation whose range is between -1 (negative correlation), via 0 (no correlation), to +1 (positive correlation), R-squared goes only between 0 and 1. The meaning of 1 is that the changes in one variable are 100% accompanied by the changes in another variable. The meaning of 0 is that the two variables have no impact on one another. The value of 0.65 for R-squared in the example above means that 65% of the variations in visits abroad in the UK can be explained by the variations in employment. The remaining 35% (100-65=35) of variations is attributed to some other factors beyond the employment rate. We will return to the coefficient of determination in the context of linear regression and define how R-Squared can be used to help us decide if our predictions meet the original variable. We still have not examined how significant is linear correlation expressed as the correlation coefficient r. That is, do the conclusions we have made about the sample data apply to the whole population. To do this, we need to conduct a hypothesis test. The result will confirm if the same conclusion applies to the whole phenomenon (population) and, importantly, at what level of significance. This test is included in the web chapters that accompany this textbook: AW8a Testing the significance of a linear correlation between the two variables.

Spearman’s rank correlation coefficient, rs We will now cover an example for data collected in ranked form. In this case, a ranked correlation coefficient can be determined. Equations (9.2 – 9.4) provide the value of Pearson’s correlation coefficient between two data variables x and y, which are both measure on interval scales. The question then arises, what do we do if the data variables are both ranked? In this case we can show algebraically that equation (9.4) is equivalent to equation (9.5). 𝑟𝑠 = 1 −

2 6 ∑𝑛 𝑟=1(𝑋𝑟 − 𝑌𝑟 ) 2 𝑛 (𝑛 −1)

(9.5)

Where Xr = rank order value of X, Yr = rank order value of Y, and n = number of paired observations. Equation (9.5) is known as Spearman’s rank correlation coefficient. If the characteristics of any two variables cannot be expressed quantitatively, but can be ranked, we can still measure if they are correlated, but using Spearman’s rank Page | 529

correlation coefficient. Equivalence between equations (9.4) and (9.5) will only be true for situations where no tied ranks exist. When tied ranks exist, then the discrepancies between the value of r and rs exist. As with the majority of other nonparametric tests included in this textbook, ties are handled by giving each tied value the mean of the rank positions for which it is tied. The interpretation of rs is like that for r, namely: (a) A value of rs near 1.0 indicates a strong positive relationship, and (b) A value of rs near - 1.0 indicates a strong negative relationship. As a reminder, note that similar to r, the same rules apply to rs: a) b) c) d)

If rs lies between – 1 ≤ rs ≤ - 0.7 or 0.7 ≤ rs ≤ 1 (strong association) If rs lies between – 0.7 ≤ rs ≤ - 0.3 or 0.3 ≤ rs ≤ 0.7 (medium association) If rs lies between – 0.3 ≤ rs ≤ - 0.1 or 0.1 ≤ rs ≤ 0.3 (weak association) If rs = 0, as before, there is association between the ranks

Example 9.4 A company makes seven different brands (A, B, …, G) and they are sold to two export markets, Germany and France. The brands are ranked differently in each market. You are asked to decide whether the brand rank in Germany correlates with the rank for seven brands in France. The data is provided in Table 9.3. Since the information is ranked, we use Spearman's correlation coefficient to measure the correlation between German and French ranks.

or Table 9.3 Ranks of different brands (brand A to brand G) in two markets (Germany and France) Excel solution Excel does not offer a built in Spearman rank correlation coefficient, so we will use Excel to conduct manual calculations. Figure 9.25 shows these calculations.

Page | 530

Figure 9.25 Excel solution for calculating the Spearman’s rank correlation From Figure 9.25, the Spearman rank correlation is positive, rs = 0.643, indicating that there is a reasonably high positive rank correlation in this case. The way our brands A to G are ranked in German and French market indicates moderate correlation. This implies that these two markets, as far as the ranking of our brands are concerned, are not very strongly correlated. If this number was closer to +1, we would be able to claim even stronger positive rank correlation. Although Excel does not have a procedure for directly computing Spearman’s ranked correlation coefficient, there is a workaround. Since the formula for Spearman’s essentially measures the same thing as the Pearson’s correlation coefficient (but for ranked values), we can use Pearson’s, providing that we have first converted the x and y variables to rankings (see in Excel tabs: Data > Data Analysis > Rank and Percentile). SPSS solution Spearman correlation coefficient SPSS data file: Chapter 9 Example 4 Spearman correlation analysis.sav

Figure 9.26 Example 9.4 SPSS data Select Analyze > Correlate > Bivariate Transfer German Rank and French Rank variables to the Variables box Click on Spearman

Page | 531

Figure 9.27 SPSS bivariate correlations menu Click OK SPSS Output

Figure 9.28 SPSS solution From SPSS: Spearman’s correlation coefficient is 0.643. This indicates a moderate (positive) linear relationship between the French and German ranks. For pairs of data considered to have a strong relationship, just as in the case of Pearson’s correlation coefficient, you will need to confirm that the value is significant. Using Excel, this test is included in the web chapters that accompany this textbook: AW8b Testing the significance of Spearman’s rank correlation coefficient, rs. Using SPSS, this test conducted by using the p-value. In Figure 9.27 under the title “Sig. (2-tailed)”, we can see the value of 0.119. This is the p-value. It indicates that the p-value = 0.119 > 0.05, so we conclude that the correlation between the two variables is not significant. The results are based on N = 7 cases.

Check your understanding X9.1 Why do you need to identify outliers and decide if you need to deal with them?

Page | 532

X9.2 If the correlation coefficient is zero, does this mean that there is no correlation between the two variables, or that the correlation is negative? X9.3 What is the difference between r and r-squared? X9.4 For measuring correlations between the ranked values, would you use the Pearson’s or Spearman’s correlation coefficient formula? X9.5 What values of the correlation coefficient would you use to describe: strong association between two variables, medium association and weak association?

9.4 Introduction to linear regression Most of the time, just measuring association is not enough. Once we know that there is association between the variables, we can also fit a line equation to a data set. This becomes a model that will allow estimates / predictions to be provided for a dependent variable (y) given an independent (or predictor) variable (x). Linear regression analysis is one of the most widely used techniques for modelling a linear relationship between variables. It is frequently used in business, economics, psychology, and social sciences in general. This section is dedicated to linear regression only, but the online chapters cover also non-linear and multiple regression modelling. It is important to realize that regression analysis can be used on all types of data: • • • •

temporal or time series data (monthly sales values, for example). categorical or nominal data (example is the gender of the respondent). ordinal data (example is ordering respondents into low economic status, medium and high economic status). interval data (example are equally spaced categories, such as earnings below 20K, between 20-40K and 40-60K).

The point here is that regression analysis can be applied to virtually any kind of data type. In some cases, data manipulations might be necessary. Coefficients of correlation just establish the strength of the relationship between two variables, regression analysis, on the other hand, is a technique that will help you establish how to estimate, or predict, changes in one variable if you know the other.

9.5 Linear regression Linear regression analysis attempts to model associations between two variables in the form of a straight-line relationship. If the relationship between any two variables was linear, and if wanted to describe this relationship for the whole population, then we would use the equation (9.6). 𝑌 = 𝛽0 + 𝛽1 𝑋

(9.6)

Page | 533

Where, Y is the dependent variable, X is the independent variable, 0 is the intercept for this linear equation and 1 is the slope, i.e. the rate at which this straight line grows or declines. No doubt you are familiar with this equation as it is identical to the straight-line equation. However, in real life we seldom have a luxury of dealing with the whole population. Most of the time we deal with a sample, or just a section of all the data available. This means that our equation (9.6) is more likely to look like equation (9.7). 𝑦̂ = 𝑏0 + 𝑏1 𝑥

(9.7)

Where, yˆ is the estimated value of the dependent variable (y) for the given values of the independent variable (x). The values of constants b0 and b1 are effectively estimates of some true values of 0 and 1. Let us assume that using our model in (9.7) we are trying to estimate one of the data points, for example Y7. Our model is estimating this point to be: 𝑦̂7 = 𝑏0 + 𝑏1 𝑥7 . Clearly the true point Y7 and the estimate 𝑦̂7 might not be identical (because b0 and b1 are just estimates of some true values of 0 and 1). This means that effectively for every estimate 𝑦̂, there is a potential error element: 𝑦̂ = 𝑏0 + 𝑏1 𝑥 + 𝑒𝑟𝑟𝑜𝑟. We can use a practical example to explain this better. Imagine if you could gather the data from all the students in a country telling you how much time they spent preparing for every exam. Let us call this population data set X. Then you check the test results (from 0 to 100) they got for every exam. We will call this population data set Y. If you knew this, the relationship between these two variables would be something like Y = 40 + 0.5X (note that this equation is purely fictional and used here just for the illustration purposes). What can you conclude from this? First, if X=0, then Y=40. This means, if you have spent zero hours on revisions, you are likely to get the test score of 40. What if X=50? Well, in this case Y=65 (i.e. =40+0.5×50). In other words, 50 hours of revision is likely to give you the test score of 65. However, the point here is that you do not know that β0 is 40 and that β1 is 0.5. If you have conducted a quick survey at your campus, you might get the values of b0 to be 45 and b1 to be 0.35. This shows that b0 and b1 are just estimates of β0 and β1. By using b0 and b1 you generalize that if students spend x number of hours revising for an exam, they will get the test score equivalent to 𝑦̂. To validate this conclusion, you will need to conduct some further tests. However, it is quite clear that 𝑦̂ and Y might not be identical, which is the reason why we said that individual values in the model are likely to contain some error element. Every model, including regression analysis models, is just an abstraction of reality, or an approximation of reality. This means that all the models contain a certain amount of error when approximating this reality. Analysing these errors, as we will see shortly, will help us decide how good is our model. How do we calculate constants b0 and b1 from the limited number of data points representing x and y? To do this, regression analysis uses the method of least squares. Page | 534

The method assumes that the line will pass through the point of intersection of the mean values of x and y ( x , y ). Let us demonstrate this using the following illustration. Let us assume that we have asked six students from our campus to tell us how many hours they have spent revising for their last Statistics examination. We also asked them to tell us the test results they obtained for this examination. The data is captured as follows: Hours of revision (x) 20 80 170 130 210 Test results (y) 45 35 82 70 77 Table 9.4 Hours spent studying for the statistics examination

110 65

On average, these students spend 120 hours on revision and on average they got the score of 62. These are the mean values: 𝑥̅ = 120 and 𝑦̅ =62. We can plot the results as a scatter diagram and insert the lines representing 𝑥̅ and 𝑦̅ into the graph, as in Figure 9.29.

Figure 9.29 Line fitted through mean value of data points fo x and y variables We also added the regression line (red colour in this graph), although we have not shown yet how it was calculated (patience just for a bit longer). We just wanted to illustrate the point above, which is that the regression line will have to pass through the intersection point of the two means (𝑥̅ , 𝑦̅). OK, so now we know that the regression line must go through the intersection, but what is the correct angle? There could be many lines going through the intersection. Figure 9.30 shows several possible options.

Page | 535

Figure 9.30 Possible line fits The answer is that: the least square method ensures that the regression line is pivoted about this intersection point until: I. II.

The sum of the vertical distances of the data points above the line equals those below the line (i.e. the sum is zero). The sum of the vertical squared distance of the data points is a minimum.

Again, graphical representation of what we just said is as in Figure 9.31.

Figure 9.31 Fitting a line to data using least squares method We can see in Figure 9.31 that we measure distances of every actual point from the regression line. When the two conditions are satisfied then we can say that we found the regression line that best represents (or fits) this relationship. In practise, it would be very difficult to pivot the regression line until it meets the two criteria, so we can use some

Page | 536

algebra to achieve this more efficiently. Algebraically, the above conditions are defined as: ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂) = 0

I.

and ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂)2 = 𝑚𝑖𝑛𝑖𝑚𝑢𝑚

II.

Where 𝑦̅ represents the values on the regression line, which are effectively estimates of y. From this concept, two 'normal equations' are defined: 𝑛

𝑛

∑ 𝑦𝑖 = 𝑛 𝑏0 + 𝑏1 ∑ 𝑥𝑖 𝑖=1

𝑖=1

and 𝑛

𝑛

𝑛

∑ 𝑥𝑖 𝑦𝑖 = 𝑏0 ∑ 𝑥𝑖 + 𝑏1 ∑ 𝑥𝑖2 𝑖=1

𝑖=1

𝑖=1

The reason the phrase “normal equations” is used, implies that distances from every point and the regression line must be orthogonal (see Figure 9.31), which is also called “normal”. By solving the above equations simultaneously, we obtain the values of b0 and b1. These give estimated equation of the line of regression of Y on X, where Y is the dependent variable and X is the independent variable. If we rearrange the two ‘normal equations’, then b0 and b1 are calculated using equations (9.8) and (9.9). 𝑏1 = 𝑏0 =

𝑛 𝑛 𝑛 ∑𝑛 𝑖=1 𝑥𝑖 𝑦𝑖 −(∑𝑖=1 𝑥𝑖 ∑𝑖=1 𝑦𝑖 ) 𝑛 2 2 𝑛 ∑𝑛 𝑖=1 𝑥𝑖 −(∑𝑖=1 𝑥𝑖 ) 𝑛 ∑𝑛 𝑖=1 𝑦𝑖 −(𝑏1 ∑𝑖=1 𝑥𝑖 )

𝑛

(9.8) (9.9)

Later, we will show that Excel can be used in several different ways to undertake regression analysis and calculate the required coefficients b0 and b1. The three possible options that Excel offers are: 1. Dedicated statistical functions – Excel contains embedded functions that allow a range of regression coefficient calculations to be undertaken. 2. Standard worksheet functions – Standard Excel functions can be used to reproduce the manual solution e.g. use =SUM(), =SQRT() functions, etc. 3. Excel Data Analysis > Regression – This method provides a complete set of solutions. Regardless of what option is used, or if we do it manually, the process of going through regression analysis can be split into a series of steps:

Page | 537

• • • • • • •

Always start with a scatter plot to get an idea about a possible model. Calculate and fit model to sample data. Conduct a goodness-of-fit test of the model (using the coefficient of determination, for example). Test whether the predictor variables are significant contributors (undertake a ttest). Test whether the overall model is a significant contributor (undertake an F-test if you have more than one independent variable). Calculate a confidence interval for the population slope, 1. Check model assumptions.

The third Excel method listed above (Excel Data Analysis option) will automatically deliver most of the steps listed here. The first two require manual calculations, though they are supported by the ready-made Excel functions.

Fit line to a scatter plot In previous sections we learned how to create a scatter plot. In this section, we will learn how to fit a line to such a scatter plot and what does that imply. Excel contains several functions that allow you to directly calculate the values of b0 and b1 in equations (9.8) and (9.9). Example 9.5 We will use the same data sets as in Example 9.1. As both Excel and SPSS offer very elegant and time saving options to complete the regression analysis, we will avoid manual calculations using equations (9.8) and (9.9). Excel Solution We will use built in Excel functions for the intercept and slope to calculate the fitted line, as per equations (9.8) and (9.9), that will go through the data points on a scatter diagram. Figure 9.32 represents the Excel solution to fitting a line to the Example 9.5 data set (note that rows 10-40 are hidden).

Page | 538

Figure 9.32 Calculating the slope and intercept in Excel From Excel: b0 = -50324.72 and b1 = 947.42. This mean that the equation of the sample regression line is ŷ = - 50324.72 + 947.42x, or if we rearrange the parameters ŷ = 947.42x – 50324.72. The above equation tells us that the number of UK visits abroad can be predicted from the percentage of employed in the UK for the ages of 16 and over as: Number of visits abroad = (947.42 × Percentage employed) – 50324.72 If x = 60.9 (percentage of employed), then the number of visits abroad according to the model is 7373 (we simply “plug in” the numbers: 947.42.6 × 60.9 - 50324.72 = 7373). This is the number of visits abroad that the model returns, given this percentage of employed. However, from Figure 9.32 we can see that in February 2018 we had 60.9% employed, and the number of visits abroad was actually 7250, and not 7373. Obviously, this model, as we already know, creates some errors, but we will learn how to deal with this later. We can also show a relationship between the correlation coefficient r and the slope of the 𝑆𝑦

𝑆

linear regression b. They are expressed as 𝑏 = 𝑟 (𝑆 ), which means that 𝑟 = 𝑏 (𝑆𝑥 ), where 𝑥

𝑦

Sx and Sy are the standard deviations for x and y respectively. Try these equations and you will see that they work. For every value of x (percentage of employed age 16 and over) we can now, using the model, estimate a value of the people traveling abroad. If we plotted these estimated values, they would represent a line of regression, (sometimes called a trend line). The calculated regression line (column E from Figure 9.32) has been fitted as a dotted line to the scatter plot as shown in Figure 9.33. Page | 539

Figure 9.33 A regression line fitted through the data on the scatter diagram From Figure 9.33 you can see that most of the data points do not reside on the fitted line, which is to be expected. Nevertheless, the line seems to approximate the direction that these two variables are taking. Since not all the points lie on the fitted line, we call this a model error (sometimes called residuals or variations) between the data y value and the value of the line at each data point ŷ (which is an estimate). This concept of error is used to establish different types of statistics indicators, including: • • •

coefficient of determination (COD, or r2, or R-squared) standard error of estimate (SEE) a range of inference indicators used to assess the suitability of the regression model

These indicators we listed above have different purpose to complete the analysis, and we will return to them later. The simplest method to calculate the regression line in Excel is just to right-click on one of the data points in the graph and select Add Trendline option from the box (see Figure 9.34).

Page | 540

Figure 9.34 Using Excel to fit the line automatically Figure 9.35 illustrates the Format Trendline menu. Select the following options: • • •

Trend/Regression Type – Linear. Display Equation on chart. Display R-squared on chart.

Click Close.

Figure 9.35 Excel Format Trendline menu

Page | 541

With this simple operation, we now get not just the regression line, but also the equation, which is identical to the one we calculated using Excel functions =SLOPE() and =INTERCEPT(), as well as the R-squared value.

Figure 9.36 The result of the Add Trendline option in Excel We already know that the value of R2 of 0.65 implies that this regression line captures 65% of the variations in visits abroad related to the variations of the percentage of employed of 16 and over in the UK. In other words, 35% of the variations in visits abroad are not explained by this model but are dependent on some other factors not captured by this model. As 65% is a reasonably high number, we can be satisfied that our model represents the reality well. As we already know, the observed values of y and those estimated by the regression line (ŷ) will not be identical. In other words, the model will not be perfectly accurate and it will have some errors. These errors, or the differences, in the context of regression analysis are also called the residuals, and are defined by equation (9.10). Residual = 𝑦 − 𝑦̂

(9.10)

Figure 9.37 below shows the value of these errors, or residuals, calculated at every point (rows 10 – 40 hidden).

Page | 542

Figure 9.37 Example 9.5 Excel solution These errors, or residuals, as they are called in the context of linear regression analysis, are very important part of regression analysis. Before we learn how to handle residuals in regression analysis, we need to be familiar with just a few additional concepts.

Sum of squares defined Regression analysis involves identifying three important measures of variation: regression sum of squares (SSR), error sum of squares (SSE), and total sum of squares (SST). Figure 9.38 illustrates the relationship between these different measures. For the sake of clarity, we are showing only one point, yi. Note that this principle applies when we sum up the squared differences of all the points, hence the phrase sum of squares.

Figure 9.38 Understanding the relationship between SST, SSR, and SSE

Page | 543

As in our illustration at the beginning of this chapter, we can see the intersection of 𝑥̅ and 𝑦̅, and the regression line passing through this intersection. We can see that the data point yi (blue dot) is somewhat above the regression line. The distance between this data point and the regression line will be used to calculate the SSE (Sum of Squares of Error). In fact, we use the word Sum, because we will square and sum up all the distances between all the points and the regression line, though in this example only one single point is shown. On the other hand, the distance between the regression line and the value of 𝑦̅, when squared and summed up for all the points, will be called SSR (Sum of Squares for Regression). If we add these two sums of squares (SSE and SSR), we get SST (Sum of Squares in Total), which measures the difference between every data point and their mean value 𝑦̅. Again, if we used algebra, these expressions can be expressed in a more elegant way. Regression sum of squares (SSR) – sometimes called explained variations is defined as: SSR = ∑𝑛𝑖=1(𝑦̂𝑖 − 𝑦̅)2

(9.11)

Regression error sum of squares (SSE) – sometimes called unexplained variations is defined as: SSE = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂𝑖 )2

(9.12)

Regression total sum of squares (SST) – sometimes called the total variation is defined as: SST = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅)2

(9.13)

The above equations include the following symbols: yi = actual data or observation 𝑦̅ = the mean value of actual data set ŷi = predicted data using the regression model The total sum of squares is equal to the regression sum of squares plus the error sum of squares: SST = SSR + SSE

(9.14)

What we observe here is that SSR, or explained variations, measure deviations of the predicted values from the overall data mean. SSE, or unexplained variations, measure deviations of the actual from the predicted values. And lastly, SST, or total variations measure deviations of the actual data values from their mean. However, remember that the word “deviations” that we are using here is in fact the sum of the squared value of all these deviations.

Page | 544

Regression assumptions Now, we have completed our calculations and produced the equation that defines the regression line for our two variables. How do we know that this equation is appropriate? Well, for a regression equation to truly represent the relationship between the variables, it must satisfy certain assumptions, and there are four of them. The four assumptions of regression are: (1) linearity, (2) independence of errors, (3) normality of errors, and (4) constant variance. 1.

Linearity Linearity assumes that the relationship between the two variables is linear. One of the methods to assess linearity is by plotting the residuals (or errors) against the independent variable, x. In Excel if you go to Data > Data Analysis > Regression, you will automatically get this plot if requested. From Figure 9.39 we cannot see any apparent pattern between the residuals and x. Also, the residuals are evenly spread out around the zero line (if the dot falls on this line, this means that the error is zero).

Figure 9.39 Example 9.5 residuals versus x. In this example, because the errors are randomly spread, the conclusion is that a line fit to the dataset would appear appropriate. If form the scatter plot we conclude that the relationship is non-linear (i.e. there is some pattern), then a non-linear model should be fitted to the data set. 2.

Independence of errors The regression independence of errors assumption implies that the current error values are not dependent of the previous error values. To measure this effect, we use the Durbin-Watson statistic and the effect is called serial correlation. Some textbooks will use the expression autocorrelation. Serial correlation and autocorrelation are synonyms, though they are usually used in a different context. Both autocorrelation and the Durbin-Watson statistic are covered online.

Page | 545

3.

Normality of errors The assumption of normality of errors implies that the measured errors (or residuals) have to be normally distributed for each value of the independent variable, x. The violation of this assumption can produce unrealistic estimates for the regression coefficients b0, b1, as well as the measures of correlation. Also, calculations of the confidence intervals assume that the errors are normally distributed. Normality of errors assumption can be evaluated using two graphical methods: (i) (ii)

Construct a histogram for the errors against x and check whether the shape looks normal, or Create a normal probability plot of the residuals (available from the Excel Data > Data Analysis > Regression).

Figure 9.40 illustrates a normal probability plot based upon the Example 9.1 data set.

Figure 9.40 Example 9.5 normal probability plot of the residuals For a plot to confirm the normality of errors, it must follow an approximately straight line, i.e. it cannot be non-linear or “jump” too much up and down. From Figure 9.40 we observe that the relationship is linear, and we conclude that the normality assumption is not violated. 4.

Constant variance The equal variance assumption (or homoscedasticity) assumes that the variance of the errors is constant for all values of x. This requirement implies that the variability of the y values is the same for all values of x. In order to make correct inferences about 0 and 1, we must adhere to this assumption. In Figure 9.41 we observe that the errors are not growing as the value of x changes. This plot shows that the variance assumption is not violated. If the value of error changes greatly as the value of x changes then we would assume that the variance assumption is violated. Page | 546

Figure 9.41 Example 9.5 residuals If there are violations of this assumption, then we can use some form of data transformations to attempt to improve model accuracy (beyond the scope of this book). What happens if any of the four assumptions are violated? In this case the conclusion is that linear regression is not the best method for fitting to the data set. We would in this case need to find an alternative method or model.

Test how well the model fits the data (Goodness-of-fit) Of the several methods used to assess how good is the regression line (or the model) we shall discuss: (a) Coefficient of Determination (COD), or R-squared. (b) Residuals and the Standard Error of the Estimate (SEE). These two statistics measure two completely different properties, but both fall under the goodness of fit measures. Coefficient of Determination or R-squared We already introduced the coefficient of determination in relation to the coefficient of correlation. Here we explain the role of this statistic in the context of regression analysis. The regression line effectively summarises the relationship between x and y. However, the line will only partially explain the variability of the observed values. We saw that when we examined the residuals. As we stated already explained, the total variability of y can be split into two components: (i) (ii)

variability explained or accounted for by the regression line. unexplained variability as indicated by the residuals.

The COD, or R2, is defined as the proportion of the total variation in y that is explained by the variation in the independent variable x. This definition is represented by equations (9.15) and (9.16) below: Page | 547

R2 =

Regression sum of squares Total sum of squares

SSR

= SST =

Explained variations Total variations

∑𝑛 (𝑦̂ −𝑦̅)2

𝑖 R2 = ∑𝑖=1 𝑛 (𝑦 −𝑦̅)2 𝑖=1

(9.15) (9.16)

𝑖

Equation (9.16) is a more efficient and symbolic representation of (9.15). We can also show that the coefficient of determination (COD) can also be given by equation (9.17): COD = (Correlation Coefficient)2 = R2 (or r2)

(9.17)

Equation (9.17) shows that if we take the square root of the COD, we get the correlation coefficient (which means that COD is r squared). This explains why the coefficient of determination is also called the R-squared. Note that the coefficient of determination equation (9.15) can be re-written in terms of SSE and SST by making use of the relationship SSR = SST - SSE. SSR

r2 = SST =

SST-SSE SST

SSE

=1- SST

(9.18)

To summarise, coefficient of determination (COD), or R-squared value as it is often called, indicates how well the model fits the data, or to put it differently, how much of variability in dependent variable is explained by the variability of independent variable. Standard Error of the Estimate (SEE) Residuals play an important role in establishing if we selected the correct model to fit the dataset. One of the first steps is to calculate the residuals, which as we now know, are the difference between the actual values (y) and the values predicted by the model (𝑦̂). The next step is to calculate the standard deviation for these residuals. This measure, or statistic, is known as the standard error of the estimate (SEE): ∑𝑛 ̂)2 𝑖=1(𝑦−𝑦

SEE = √

𝑛−2

(9.19)

We can see that the numerator in equation (9.19) is the Regression Error Sum of Squares shown in equation (9.12), also called unexplained variations. We can, therefore, also re-write equation (9.19) into equation (9.20). SSE

SEE = √𝑛−2

(9.20)

Standard Error of the Estimate (SEE) provides a measure of the scatter of observed values y around the corresponding estimated values ŷ on the regression line. SEE is measured in the same units as y. In other words, if y are inches or gallons, SEE will be expressed in the same units. SEE is effectively the standard deviation of actual values from the predicted values. We remember that +/- 1 standard deviation covers 68% of the population/sample. This Page | 548

means that you can say that we can be 95% confident that the true value of y will be in the interval of 𝑦̂ ±1.96SEE. To be 99% certain, we would need to take 2.58SEE, i.e. 𝑦̂ ± 2.58SEE, etc. The Excel function =STEYX() enables the calculation of the standard error of the estimate (SEE) as illustrated in Figure 9.42. Remember: R-squared (or COD) measures if the model selected is a good fit for the actual dataset. SEE measures how well the actual data points are estimated by the regression model. The higher the value of R-squared, the more confident we are that we have selected the correct model. The smaller the SEE, the more precise our model is, and the estimated values are a better fit for the actual data. Online chapters provide more detailed explanations and illustrations of how to interpret these two statistics in regression analysis. Example 9.6 Re-consider the Example 9.1 data set and test the linear regression model reliability. Figure 9.42 illustrates the Excel solution to calculate the coefficient of determination and the standard error of the estimate (again rows 10-40 are hidden in this Figure).

Figure 9.42 Example 9.6 Excel solution By plotting the regression line onto the scattergram, as shown in Figure 9.36, we know that many of the observed data points do not lie on the line. These differences between the actual points and the fitted points, or residuals, are calculated in column G in Figure 9.42. It is always advisable to plot the residuals. Plotting the residuals provides information about possible modifications of, or areas of caution, in applying the regression line. In plotting the residuals, we would look for a Page | 549

random, equal scatter about the zero-residual line. This would indicate that the derived line was relatively free from error. Figure 9.43 shows the errors for our Example.

Figure 9.43 Example 9.6 errors plot In Excel, as in many other software packages, the coefficient of determination (COD) is labelled R-squared. The value is calculated in cell D60 in Figure 9.42 using the Excel function =RSQ(). This is just one method to calculate the coefficient of determination. As we already said r2 = 0.65, implies that this regression model can interpret 65% of all the variations in travel abroad with the variations in employment level. The remainder of variations are subject to some other influences not embedded in this model. This is a reasonable number (65%), so we will accept this linear model as a good fit for this relationship. Just to repeat, from the value of COD, we can find Pearson’s correlation coefficient r = √𝐶𝑂𝐷 = √0.65 = 0.81. In cell D51 in Figure 9.42 we have the value of SEE of 260.56. As the visits abroad are expressed in thousands (for example, June 2018 shows 7,690,000 visits), this means that the standard error of the estimate is also in thousands of visits (260,560 visits). We will show below how to use SEE to draw further conclusions about our regression model. SPSS Solution SPSS can be used to undertake the calculations described in this chapter based upon the data in Example 9.1. In this example, we are fitting a straight line relationship between the number of visits (y) and whether employed (x). Input data into SPSS SPSS data file: Chapter 9 Example 6 Linear regression analysis.sav

Page | 550

Figure 9.44 SPSS data (Only the first 6 rows of data are shown) Select Analyze > Regression > Linear

Figure 9.45 SPSS linear regression menu selection Transfer UK visits to the Dependent box Transfer UK Employment to the Independent(s) box

Page | 551

Figure 9.46 SPSS linear regression menu Note that the Method option selected is Enter Click on Statistics Tick the boxes as shown

Figure 9.47 SPSS linear regression statistics options Click Continue Click OK SPSS Output – this will be saved to an SPSS output file Given that we have selected several option boxes in the Statistics menu option, the output from SPSS is comprehensive. Page | 552

Figure 9.48 SPSS solution

Page | 553

Figure 9.49 SPSS solution continued

Figure 9.50 SPSS solution continued If you compare Figure 9.49 with Figure 9.42 (Excel output), you will see that R-Square is 0.65 and that SEE is 260.56, which is matching Excel results.

Prediction interval for an estimate of Y Two variables, one independent and one dependent, can be modelled using a linear regression equation ŷ = b0 + b1 x . In this case ŷ are the estimates of y for the given value of x. For example, we may want to know what the number of UK visits abroad (y) would be if the UK age 16 and over employed value was set at 60% (x). The prediction interval for y, at a value of x, is given by equation (9.21). (9.21)

𝑦̂ − 𝑒 < 𝑦 < 𝑦̂ + 𝑒

This implies that, within a certain probability, we are confident that the true value of y is somewhere in the interval between 𝑦̂ ± 𝑒. The error term e is calculated using equation (9.22), where xp is the value of x for which the error is calculated. 1

𝑛(𝑥𝑝 −𝑥̅ )2

𝑒 = 𝑡𝑐𝑟𝑖 × 𝑆𝐸𝐸 × √1 + 𝑛 + 𝑛(∑ 𝑥 2 )−(∑ 𝑥)2

(9.22)

We can see in equation (9.22) that to calculate the error values for the prediction interval, it is not enough to just multiply the value of SEE with the t-value (or z-value). We need another, complicated expression, given under the square root. The value of e calculated Page | 554

in this way and combined with the value of ŷ that comes from the linear regression, will provide the prediction interval. Prediction interval is essentially a confidence interval for linear regression models. Just like the confidence interval for the population mean states that you are confident that the true mean is somewhere in the interval of values =𝑥̅ SE (SE=standard error of the mean), the prediction interval provides a confidence level that the true possible data value y is somewhere in the interval of values y=𝑦̂SEE (SEE= standard error of the estimate). Example 9.7 Fit a prediction interval at x = 60 (i.e. if employment was 60%) to the Example 9.1 data set. Figures 9.51 and 9.52 illustrate the Excel solution to calculate the predictor interval (cells 21 – 42 hidden)

Figure 9.51 Example 9.7 Excel solution

Page | 555

Figure 9.52 Example 9.7 Excel solution continued From Excel: x = 60, n = 41, significance level = 0.05, tcri = ± 2.012, SEE = 260.56, x = 60.8, x = 2493.7, and x2 = 151677.3. Substituting values into equation (9.22) gives:

𝑒 = 2.012 × 260.56 × √1 +

1 41(60.0 − 60.8)2 + = 564.23 41 41 × (151677.3 − (2493.7)2

The value of 𝑦̂ is calculated using equation (9.6), which is: 𝑦̂ = 𝑏0 + 𝑏1 𝑥 = -50324.72 + 947.4 × 60 = 6520.9. In the above equation for e, note that 2.012 is the t-value for 95% confidence interval. The 95 % prediction interval for x = 60 is between 5956.5 to 7085.0 (calculated as: 6520.8 ± 564.23). If the percentage of 16 and over of employed in the UK was 60%, we can predict/estimate that the number of UK visits abroad would in this case reach 6520.9. In fact, we can state that we are 95% confident that this number of visits abroad would be somewhere between 5965.5 and 7085.0. SPSS solution The prediction interval can be calculated using SPSS as follows. Let us say we want to create a prediction interval for when the employment level is 60% (x = 60). To do a Page | 556

prediction, simply enter the value of the predictor variable at the last row of the data sheet under the predictor variable and go through the model building.

Figure 9.53 Enter predictor value in the SPSS data file (Xp = 60) Select Analysis > Regression > Linear Transfer UK visits to the Dependent box Transfer UK Employment to the Independent(s) box Method: Enter

Figure 9.54 SPSS Linear Regression menu Click on Save

Page | 557

Figure 9.55 SPSS Linear Regression Save menu Now in the box labelled Prediction Values, click on Unstandardized. This will give the predicted Y-values from the model. The data window will have a column labelled pre_1. For the prediction intervals, in the boxes near the bottom labelled Prediction Intervals, put check marks in front of Mean and Individual. In the data window, will now be columns, labelled LMCI_1, UMCI_1, LICI_1, and UICI_1. LMCI and UMCI stand for Lower Mean Confidence Interval and Upper Mean Confidence Interval, respectively. LICI and UICI stand for Lower Individual Confidence Interval and Upper Individual Confidence Interval, respectively. The values for LICI and UICI are 5956.546 and 7085.003 respectively, which is 95% confidence interval for the UK visits abroad providing that the Employment level is 60, as specified. Click Continue

Page | 558

Figure 9.56 SPSS Linear Regression menu Click OK SPSS output The calculated values are stored in the SPSS data file as illustrated in Figure 9.57.

Figure 9.57 95% Confidence and predictor intervals From the SPSS data file, we have: 1. Predicted number of visits given x = 60 is given by PRE_1 = 6520.77 2. The 95% confidence interval for the individual response when x = 60 is given by LICI_1 and UICI_1 (5956.54, 7085.00). 3. We also get the 95% confidence interval for mean response when x = 60. This is given by LMCI_1 and UMCI_1 (6319.31, 6722.23). We can see that SPSS solutions agree with Excel.

Excel data analysis regression solution If we wanted to avoid doing most of the calculations from the previous example, we can use Excel ToolPak Regression. This package provides a complete set of solutions, including: • • • •

Calculate equation of line Calculate measures of goodness of fit Check that the predictor is a significant contributor (T and F tests) Calculate a confidence interval for b0 and b1. Page | 559

Example 9.8 Re-consider Example 9.1 data set and use the Excel Data Analysis tool to fit the linear regression model and calculate required reliability and significance test statistics. Excel Solution Select Data > Data Analysis > Select Regression • • • • •

Y Range: D5:D45 X Range: C5:C45 Confidence Interval: 95% Output Range: G3 Click on residuals, residual plots, and normal probability plot

Figure 9.58 Example 9.8 Excel data analysis regression menu Click OK Excel will now calculate and output the required regression statistics and charts as Illustrated in Figure 9.59.

Page | 560

Figure 9.59 Excel data analysis regression solution The printout in Figure 9.59 might look somewhat puzzling, so we will explain all the cells from this printout and connect some of them with the terms from the Section on Sum of Squares. Cell H6 = Multiple R (can also be obtained using =RSQ(), =CORREL() or =PEARSON() function Cell H7 = R-Square (can be obtained as =(H6)^2) Cell H8 = A refined version of R2, adjusted R-Square for the sample size and the number of dependent variables (not described in this textbook) Cell H9 = Standard Error of Estimate (SEE) (can also be obtained using =STEYX() function) Cell H10 = Number of observations n Cell H14 = dfA (the number of degrees of freedom for regression v1) Cell H15 = dfB (the number of degrees of freedom for residuals v2=n-m-1, where m= number of independent variables, i.e. 1 in this case) Cell H16 = dfT (total of H14 and H15) Cell I14 = SSR (Explained variations) Cell I15 = SSE (Unexplained variations) Cell I16 = SST (=SSR+SSE) Cell J14 = MSR (this is the result of I14/H14, i.e. =SSR/v1) Cell J15 = MSE (this is the result of I15/H15, i.e. =SSE/v2). If you take a square root of this value, you get the standard error of the estimate, as per Cell H9. Cell K14 = F-statistic(this is the result of J14/J15, i.e. =MSR/MSE) H19 = b0 H20 = b1 I19 = sb0 (Standard Error for b0) I20 = sb1 (Standard Error for b1) J19 = t-stat, or t-calc for b0 (this is the result of F19/G20) J20 = t-stat, or t-calc for b1 (this is the result of F20/G20) K19 = p-value for the intercept K20 = p-value for the slope

Page | 561

From Figure 9.59 we can identify the required regression statistics as illustrated in Table 9.5: Calculation Fit model to sample data Test model reliability using the coefficient of determination Test whether the predictor variables are significant contributors – t test H 0 : β 0 = 0 vs. H 1 : β 0  0 H 0 : β 1 = 0 vs. H 1 : β 1  0 Calculate the test statistics and pvalues using Excel – F test H 0 : β 1 = 0 vs. H1 : β1  0

Regression statistic b0 = -50324.72 b1 = 947.4 COD = 0.65 SEE = 260.5

Excel cell Cell H19 Cell H20 Cell H7 Cell H9

t = -7.48, p = 4.727E-9 t = 9.56, p = 1.690E-10

Cells J19 and K19 Cells J20 and K20

F = 73.38, p = 1.690E-10 Cells K14 and L14

Confidence interval for 0 and 1 –63931.4 to –36719.0 95% CI for 0 723.71 to 1171.13 95% CI for 1 Table 9.5 Linear regression test statistics

Cells L19 and M19 Cells L20 and M20

Regression and p-value explained Cells K19 and K20 in Figure 9.59 contain the so-called p-value. What is the p-value? The p-value measures the chance (or probability) of achieving a test statistic equal to or more extreme than the sample value obtained, assuming H0 is true. We are already familiar with p-value from the previous chapters on hypothesis testing. As we already know, in order to make a decision when testing the hypothesis, we compare the calculated p-value with the level of significance (say 0.05 or 5%). If p < 0.05, then we reject H0. In case of linear regression, the same principle applies. If p < 0.05 (assuming we used 5% significance level), then our model is valid. As before, H0 implies that there is no connection between x and y (remember, we set H0 with intentions to reject it). In practical terms, Excel applies the t-test to tell us if the predictor variable (in this case the percentage of employed, x) is a significant contributor to the value of y (UK visits abroad, y), given that the p-value in cell K20 in Figure 9.59 (= 1.690E-10) < 0.05. As 1.690E-10 < 0.05, we conclude that x is a significant contributor to y. Beside this, we can see in cell K19 in Figure 9.59 that the intercept (b0) is also a significant contributor to the value of y. The value of p shown in cell K19 in Figure 9.59 shows the value of p = 4.727E-09. This is much less than 0.05, hence the conclusion that the intercept is very important in this equation. The significance of the F-test shown in cell L14 in Figure 9.59, confirms that the model is a significant contributor to the value of the dependent variable (p = 1.690E-10 < 0.05). This confirms the t-test solution and we conclude that there is a significant relationship between the percentage of employed and visits abroad. For simple models with just one dependent variable, the relationship between t and F is given as: 𝑡 = √𝐹 = √73.380 = 9.566. See cell J20 in Figure 9.59, which is the t value, and cell K14 in Figure 9.59, which Page | 562

is the F value. The Regression Data Analysis also helps with checking of some of the assumptions, namely: linearity, constant variance, and normality as illustrated in Figures 9.60 –9.62.

Figure 9.60 Residual output (note that rows 36-56 are hidden for clarity of output)

Figure 9.61 Plot of residuals against x Figure 9.61 demonstrates that we have no observed pattern within the residual plot. We can, therefore, assume that the linearity assumption is not violated. A further conclusion is that both the residuals and the variance are not growing and are bounded between a high and low point. We use this conclusion to state that the variance assumption is also not violated.

Page | 563

Figure 9.62 Assumption check for normality The normal probability plot in Figure 9.62 illustrates that the relationship is linear. We conclude that the normality assumption is not violated. In summary, our model does not violate any of the linear regression analysis assumptions, and therefore, it is a good representation of the relationship between the level of employment in the UK and travel abroad from the UK.

Check your understanding X9.6 State what do the two coefficients, b0 and b1, in the regression equation represent and what are they estimating. X9.7 What is the point at which the regression line pivots and what are the conditions that need to be satisfied for the best fit regression line? X9.8 What is the meaning of the word residual in regression analysis and what other expressions are used for the same term? X9.9 Explain what another term is to describe the total variations in regression analysis, and what kind of variations constitute the total variations in this type of analysis. X9.10. State the four assumptions that need to be satisfied for the linear regression to be considered appropriate.

Chapter summary In this chapter, we have introduced the concepts of correlation, coefficient of determination, or R-squared value, and regression analysis. These concepts explain relationships between variables / data sets. As we progress through the remaining chapters, we will see that they are also important building blocks for other statistical techniques. The coefficient of correlation is based on the concept of covariance that measures how closely the two variables are associated. If the number is positive, it implies positive association (as one variable grows, so will the other). If the number is negative, it implies negative association (as one variable grows, the other will decline). However, the absolute value of the covariance has very little meaning, and it is impossible to compare the variances for data sets that are using values from a different range of numbers. To address this issue, we introduced the coefficient of correlation. Page | 564

The correlation coefficient standardizes the variances and, regardless of what range of values or units is used in the data set, it always returns the values between -1 and +1. The closer the coefficient of correlation to +1, the stronger the association between the variables (growth in one variable is accompanied by the growth in another). The opposite case is: the closer the coefficient of correlation to -1, the more opposite the association between the variables (growth in one variable is accompanied by the decline in another). The value of the coefficient of correlation that is zero, or close to zero, indicates that there is no meaningful linear association between the variables. However, there may be a non-linear association between the variables. We also emphasized that correlation and causation are not to be confused. The fact that two variables are highly correlated does not necessarily mean that they influence or cause movements between each other. Two specific correlation coefficients were introduced. The first one was Pearson’s correlation coefficient and the second one was Spearman’s rank correlation coefficient. Pearson’s coefficient of correlation measures linear relationship between two variables, whilst the Spearman’s coefficient measures the ranked values rather than raw data. As both coefficients of correlation apply to sample data, we also introduced (online) how to use the hypothesis testing to gather evidence as to whether the sample data results can be applied to the whole population. The squared value of the coefficient of correlation is called the coefficient of determination (r2, or R-squared), and we introduced this concept too. Unlike the correlation coefficient, the coefficient of determination can take values between 0 and 1. These numbers can be interpreted as percentages indicating what percentage of the variations in one variable is associated with the variations in another variable. The balance between 1 and the value of coefficient of determination indicates how much the variations in one variable are dependent on other, i.e. external factors, that are beyond the association between these two variables. Continuing with the premise that the two variables are in linear relationship, we introduced a simple linear regression analysis tool. We demonstrated how to fit an appropriate model to the data set using the least squares regression. We defined the meaning of different types of variations and what is the relevance of residuals, i.e. errors in regression analysis. This led us to explain the assumptions of linear regression, as well as how to test the reliability of the model using either the t-test or the F-test (online chapters). We also introduced the standard error of the estimate as well as how to use it to create a prediction interval.

Test your understanding TU9.1 Take the following two data sets: X1: 2, 4, 7, 5, 9, 13, 13, 15, 14, 18 and X2: 1, 3, 8, 4, 4, 9, 12, 22, 14, 15. Construct the scatter diagram. TU9.2 How would you deal with an outlier in TU9.1? TU9.3 Calculate the correlation coefficient for the TU9.1 example.

Page | 565

TU9.4 Show that the correlation coefficient from TU9.3 is significant and not just random. TU9.5 Level 1 university students ranked their five top Lecturers as: John, Lucy, Steve, Mark and Alice. Level 2 students ranked the same lecturers as follows: Lucy, John, Steve, Alice and Mark. How closely are the Level 1 students and Level 2 students’ perceptions about their top Lecturers aligned? TU9.6 In the regression equation for yˆ = b0 + b1 x , the value of b0 is given by the equation: A. C.

2  Y − b1  X n  Y − b1  X b0 = n

b0 =

B.

b0 =

D.

b0 =

 Y − b1  X 2n  Y − n X n

TU9.7 In the regression equation for yˆ = b0 + b1 x , the value of b1 is given by the equation: A.

b1 =

C.

b1 =

n  XY 2 −  X  Y

B.

n  X − ( X ) n  XY −  X  Y 2

2

n  X − ( X )

D.

2

b1 =

b1 =

n  XY −  X Y n  X 2 − ( X )2

n  XY −  X  Y n  X 2 − ( X )

Use the ANOVA table 9.6 to answer exercise questions TU9.8 – TU9.11: ANOVA df SS Regression 1 169759.3 Residual 28 18167.14 Total 29 187925.5 Table 9.6 ANOVA table

MS 169759.3 649.8263

F 261.6392

Significance F 9.76E-16

TU9.8 Calculate the coefficient of determination, COD: A.

0.78

B.

0.9

C.

0.80

D.

1.80

D.

0.89

D.

25.47

TU9.9 Calculate the value of Pearson’s correlation coefficient, r: A.

0.99

B.

1.89

C.

0.95

TU9.10 Calculate the value of the standard estimate of the error, SEE: A.

155.20

B.

133.18

C.

35.47

TU9.11 In 2019 Pet-Dog Ltd. ascertained the amount spent on advertising and the corresponding sales revenue by ten marketing clients.

Page | 566

Advertising (£000s), xSales (£000s), y 5 104 13 173 11 121 16 156 2 50 19 182 22 199 8 76 11 95 Table 9.7 Spend on advertising versus sales (a) (b)

Plot a scatter plot and comment on a possible relationship between sales and advertising. Use Excel regression functions to undertake the following tasks: i. Fit linear model, ii. Check model reliability (r and COD), iii. Undertake appropriate inference tests (t and F test), iv. Check model assumptions (residual and normality checks), v. Provide a 95% confidence interval for the predictor variable,

TU9.12 Over the last 14 days you register the morning temperature in degrees Celsius and count the number of students present in your Business Statistics 101 class. The two variables are showing the following values: Morning 16 18 18 16 18 24 22 18 16 17 20 24 20 18 temperature Students 78 65 70 76 75 60 64 72 75 72 70 59 65 70 in the class Table 9.8 Temperature versus attendance Establish if there is any correlation between these two events and conduct the tests to determine if the outside temperature in general has impact on the student numbers attending Business Statistics classes. TU9.13 The cinemas are currently showing four new blockbuster movies. The movies are labelled as A to D. In London, the movies are ranked per number of viewers as: D,A,B,C. In Manchester the same movies are ranked per number of viewers as: D,C,B,A. Are the audience tastes in these two cities comparable, at least as far as these four movies are concerned? TU9.14 Assignment and final examination marks for 13 undergraduate students in Business Statistics are given in Table 9.9. Fit an appropriate equation to this data set to create a model from which the final examination marks can be predicted given the assignment marks.

Page | 567

Assignment 72 40 48 37 100 80 100 88 60 45 70 48 46 Examination 80 64 70 62 80 81 81 73 65 58 69 59 60 Table 9.9 Examination versus assignment mark (a) (b)

Plot a scatter plot and comment on a possible relationship between sales and advertising. Use Excel regression functions to undertake the following tasks: i. Fit linear model, ii. Check model reliability (r and COD), iii. Undertake appropriate inference tests (t and F test), iv. Check model assumptions (residual and normality checks), v. Provide a 95% confidence interval for the predictor variable,

Want to learn more? The web chapters contain additional sections to provide further information on the following topics: 1. A9Wa Testing the significance of linear correlation between the two variables. 2. A9Wb Testing the significance of Spearman rank correlation coefficient. 3. A9Wc The use of the t-test to test whether the predictor variable is a significant contributor. 4. A9Wd The use of the F-test to test whether the predictor variable is a significant contributor. 5. A9We Confidence interval estimate for the slope. 6. A9Wf Autocorrelation. 7. A9Wg Standard error for the autocorrelation function. 8. A9Wh Significance of the autocorrelation coefficients and evaluation. 9. A9Wi Partial autocorrelation coefficient. 10. A9WJ Error and residual inspection. 11. A9Wk Non-linear regression analysis. 12. A9Wl Multiple regression analysis. 13. A9Wm Linear regression and Durbin Watson test for autocorrelation. 14. A9Wn Regression goodness of fit.

Page | 568

Chapter 10 Introduction to time series data, long-term forecasts and seasonality 10.1 Introduction and chapter overview Time series are time-based variables. They are a series of data points listed in time order. The aim of this chapter is to provide the reader with a set of tools which can be used for time series analysis and extrapolation. This chapter will allow you to apply several time series methods for long-term forecasting and extrapolation. The methods we will cover are applicable to all types of temporal data. However, we are restricting our applications to one single time series. This is often called the univariate approach to time series analysis and forecasting. The methods covered are suitable for many applications in economics, business, finance, social sciences, and natural sciences. We will start the chapter by explaining what types of time series are likely to be found and what differentiates them. Next, we will explain how different types of models can be fitted to different types of time series. This will be followed by a brief overview of different types of error measurements that are used to establish the quality of our forecasts and the suitability of the model used. Further on, we will cover the prediction interval for time series analysis. We will modify the formula for calculating the standard error of the estimate to show that the prediction interval should grow wider the further in the future we extrapolate the time series. This will match the intuitive assumption that the further into the future we go, the uncertainty grows larger and larger. The last topic we will cover will be dedicated to seasonal time series. We will use classical decomposition method to extract different components from the time series and learn how to predict seasonality and cyclical movements in a variable. In many ways, this chapter is an extension of the Linear Regression chapter, but with one difference. While simple regression analysis involved two variables that may not be temporal, this chapter applies strictly to time series. In fact, we will still use two variables. One variable, measured in time, will be treated as a dependent variable. The other variable, the independent one and a predictor, will be just sequential units representing the time. The practical value of time series analysis is immense. The applicability of these methods is universal and covers almost any commercial or scientific discipline. No matter what you do, the chances are that you will have some data ordered in time. This could be sports results, inflation figures, mortality data, spending patterns, crime rate, etc. Once you have these figures ordered in a time series, you will invariably ask yourself: can I establish a trend here and see where this is going? If you can establish “where this is going”, you can effectively predict what this variable will be in x number of time units from now. Why is predicting something so important? Well, if we take a business example, you will soon realize why. If someone told you that next month the demand for your product is going to double, what would you do? You would probably double your production today, in order to meet the next month’s demand. This is precisely the purpose of forecasting. You can take actions today and be better prepared for the future. Forecasts provide Page | 569

insight into uncertainty that tomorrow brings. By learning how to forecast future events, you are effectively managing the uncertainty of the future. This means that you can take actions, make decisions, or plan better to be ready for this future. If you can gain glimpses into what tomorrow brings, you will be more confident in making decisions today. When we use the phrase “gain glimpses”, we do not mean that you will know exactly what will happen. We mean that the statistics will help you identify the most probable area, or range of numbers, likely to happen in the future. This is what time series analysis, or forecasting, is all about.

Learning objectives On completing this unit, you should be able to: 1. 2. 3. 4. 5. 6. 7. 8. 9.

Understand the terminology associated with time series analysis. Be able to inspect and prepare data for forecasting. Plot the data and visually identify patterns. Fit an appropriate model to the data set using the time series approach. Use the identified model to provide forecasts. Calculate a measure of error for the model fit to the data set. Learn how to calculate prediction interval. Learn how to handle seasonal time series and apply the decomposition method Solve problems using the Microsoft Excel and SPSS.

10.2 Introduction to time series analysis In this chapter, we apply the regression analysis principles to one single variable, measured in time. This implies that we are still using two variables, but one of them is the time. Previously we fitted a line equation that best describes relationship between a dependent variable (y) and an independent (or predictor) variable (x). In this chapter we still use only one dependent variable (y), whilst the independent variable (x) is a sequence of numbers representing time. In linear regression, we used the model to estimate the value of y, given the knowledge of x. This means the objective was primarily to estimate y for the current range of x (though it could be used for extrapolating y for the future values of x). Forecasting objective is primarily to extrapolate the variable into the future, by relying on its history. Extrapolation implies that we can take the future values of x and estimate what the potential value of y might be. So, how do we define a time series? A time series is a variable that is measured and recorded in equidistant units of time. A good example is inflation. We can record monthly inflation, quarterly inflation, or annual inflation. All three data sets represent a time series. In other words, it does not matter what units of time we use as long as we are consistent, and the time units are sequential. By consistent, we mean that we are not allowed to mix the units of time (daily with monthly data or minute with hourly data, for example). By sequential, we mean that we are not allowed to skip any data points and have empty values for any point in time. Should this happen, we need to somehow Page | 570

estimate the missing value. The easiest way is to calculate the average of the two neighbouring values. Other more sophisticated method might be even more appropriate. What is the purpose of time series analysis? Well, the main purpose of time series analysis methods is to identify the historical pattern that a time series exhibits through time, and to predict the future movements of a variable based on these historical patterns. In other words, forecasting the future values is the main concern. To assess if the correct forecasting method has been used, several other auxiliary methods has been developed. They all fall in the category of time series analysis. Nevertheless, forecasting remains the main purpose. To be fully equipped to deal with time series and forecasting, like with any other area, you need to understand the terminology of this area of statistics. This terminology is primarily related to the types of data that you will encounter, and to the types of methods that are available to be used. The following sections will provide a brief overview of the terminology used in time series analysis.

Stationary and non-stationary data sets Example 10.1 consists of two time series data sets. The data sets come from the World Bank and they show the birth rates for India and Switzerland per 1,000 people from 2000-2018. If we just looked at the data in columns B and C, in Figure 10.1, we could conclude very little. However, if we plot these numbers, as in Figure 10.2, we immediately see the difference. As we will describe below, some time series oscillate around the fixed mean value (they are called stationary time series) and some around a continuously changing value (they are called nonstationary time series). The estimation principles and the statistics used to describe the horizontal time series vs. upward or downward moving time series are different. This implies that very often two different sets of methods are used when handling stationary vs. nonstationary time series. Stationary time series have constant mean, variance, etc., whilst nonstationary time series do not. Because of that, nonstationary time series violate a number of assumptions that are used for modelling stationary time series, and therefore, we have a separate set of models dedicated to nonstationary time series. Example 10.1

Page | 571

Figure 10.1 Birth rates for India and Switzerland between 2000-2018

Figure 10.2 A graph for birth rates for India and Switzerland between 2000-2018 The data for Switzerland shows a line that seems to be moving horizontally, oscillating around some central value. Meanwhile, the data for India are undoubtedly moving downwards, which implies that data do not oscillate around some central value, but around some moving value. The time series representing birth rates for Switzerland, following a horizontal line, is called stationary, whilst the time series for India is called a non-stationary time series. Every time series must fall in one of these two categories. Why is this important? Most of the time you cannot use the same method to successfully

Page | 572

handle a stationary and a non-stationary data set. A variety of methods have been invented to handle either the stationary or non-stationary time series. Another point that can be inferred from the opening of this section is that we can ‘see’ very little by just looking at the data values. This implies that charting the data is not optional. It is one of the pre-requisites in time series analysis.

Figure 10.3 Birth rates in Switzerland 2000-2018 Before we proceed, let us look at Figure 10.3. We charted the values for the birth rates in Switzerland between 2000 and 2018. The axis of the chart on the left specifies the years, but we could have easily shown just sequential numbers, which is what we did with the chart on the right. As the title indicates (or sometimes the chart legend) that the starting year is 2000, the number 1 on the x axis would imply that this is 2000, the number 2 that this is 2001, etc. until we come to 2018, which is sequential number 19. This also implies that when dealing with time series, the variable that we are charting is typically the dependent variable and the independent variable is simply time. The time is in this case defined by the context, but we can use the expression ‘time period’ and mark every observation with the sequential numbers starting from one onwards. This column will in fact become a variable, as we will see in the pages to follow.

Seasonal time series So, every time series is either a stationary or non-stationary, and furthermore, every time series can also be either seasonal or non-seasonal. Intuitively we understand that the word seasonal means a time series that shows some repeated pattern over the units of time less than a year (monthly, quarterly, etc.). If the pattern repeats over longer time intervals, like several years, the time series is called cyclical. However, as both types show repeated pattern regardless of the time units, for forecasting purposes they are treated in a similar way. A variety of methods exist to treat seasonal time series. Example 10.2 Here is an example of one seasonal and stationary time series (Figure 10.4) and one seasonal and non-stationary time series (Figure 10.5).

Page | 573

Figure 10.4 Seasonal stationary time series

Figure 10.5 Seasonal non-stationary time series Both seasonal and or cyclical patterns repeat themselves after some fixed number of time units (days, weeks, months, or quarters for the seasonal data and years for the cyclical data). This is also sometimes called periodicity. Remember that seasonal data sets represent a special set of time series and we will learn that there are methods dedicated to use exclusively with seasonal time series.

Check your understanding X10.1 Chart the following time series and decide if it is stationary and/or seasonal. x 1 y 24 Table 10.1

2 60

3 72

4 72

5 48

6 60

7 84

8 60

9 96

10 108

X10.2 The time series in Table 10.2 is seasonal. What is the periodicity, or the number periods over which you think this time series shows seasonality? Page | 574

x 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 y 15 45 75 45 15 45 75 105 75 45 75 105 135 105 75 Table 10.2 X10.3 Is it possible to have a time series that is seasonal and non-stationary? If so, can you draw a graph showing how could one such series look like? X10.4 Go to one of the web sites that allow you to download financial time series (such as http://finance.yahoo.com/) and plot the series of your choice in several identical line graphs. Change the scale of the y-axis on every graph and make sure that they are radically different scales. What can you say about the appearance of every graph? X10.5 Describe why do you think that different forecasting methods are used for the stationary as opposed to those used for non-stationary time series?

10.3 Trend extrapolation as long-term forecasting method Long term forecasting is essentially the same as trend extrapolation. If we are going to produce long-term forecasts, this implies that we are not necessarily interested in every detail of the future time series. We are interested in the general direction and the speed (slope) at which the future values of the variable are going to happen. Long-term forecasts do not mean that we must go several years in the future. It means that we want to go several time periods in the future. What we are saying is that regardless of units of time (years, days, minutes), long term-forecasting implies that we are going to forecast the values of the variable for a considerable number of time periods. What is considerable number of time periods? Well, it depends how long is the time series. A general and empirical rule is that long-term forecasts should not exceed one third of the total number of observations (n/3) we have in our data set (we call this historical observations). As an example, if your time series consists of 36 observations (n), the maximum number of forecasts to be produced should not exceed 12 (n/3). In general, we need to exercise common sense and see how far in the future it makes sense to forecast. If we go too far, the confidence level will be so wide that the uncertainty will not be reduced. In other words, if everything is possible, why bother to forecast?

A trend component A trend can be described as a general shape and direction in which something is moving. In the context of forecasting, a trend is a component that best describes the shape and direction of the time series. Graphically, this shape and direction can be approximated by any curve, though in this textbook we mainly deal with linear trends. In addition to trend, other components can be present in time series, but for now we will focus on the trend component only. Page | 575

Let us say that, for all practical purposes, we are only interested in estimating the trend. After we calculated the trend that models the time series, the difference between every actual value in the time series and every trend value is effectively a series of residuals (R). We can say that the time series Y, using this simplified model, consists of only two components, as defined by equation (10.1). Y=T+R

(10.1)

However, we also need to make one further assumption and that is, in this case, that the residuals are something that should randomly oscillate around the trend (we know this already from regression analysis in Chapter 9). In other words, if we can estimate the underlying trend of a time series, we will not worry about these random residuals fluctuating around the trend line (at least not in the context of long-term forecasts). After we extrapolated this trend, the trend effectively becomes a forecast of the time series. We realise that this forecast will not be 100% accurate, i.e. every trend value will not be the same as the actual value, but that is OK because we are only interested in the shape and direction of the time series. In practice, extrapolating the trend is the same as producing the long-term forecasts. Example 10.3 Our simplified example in Figure 10.6 shows that the trend we calculated is in fact the estimate (or, forecast) Ŷ. If we add the residuals to the trend, we will get the values of the actual time series Y. Just as per our equation (10.1).

Figure 10.6 The actual data Y, their trend value T and the residuals R This simplified example makes the point that you can create reasonable long-term forecasts by only focusing on the trend. For short term forecasts this approach would be “short” of expected precision, but we will tackle this problem in the following Chapter. Let us stay with the trend for now. Another key point here is that before we extrapolate the trend, we always need to fit the trend to the existing values. Figure 10.6 does not show any extrapolated values, just fitted trend values (column C in Figure 10.6) to historical values of the data set, i.e. the time series. Page | 576

Fitting a trend to a time series If a trend is the underlying pattern that indicates the general movements and the direction of a time series, then this implies that a trend can be described by some regular curve. This usually means a smooth curve, such as a straight line, a parabola or a sinusoid, or any other well-defined curve. Rather than starting with manual calculations, on this occasion we will start with Excel. Excel is very well equipped to help us define the trend and fit it to time series and extrapolate it into the future. The way Excel is used to achieve this, is identical to the way we used it to demonstrate how to apply certain elements of regression analysis in previous Chapter. Example 10.4 We will use a time series representing average annual UK petrol prices in pence per litre 1983-2020 (leaded 4-star up to 1988, unleaded thereafter). The time series consists of 38 observations, as illustrated in Table 10.3. Price per litre Price per litre Year (p) Year (p) 1983 2002 36.7 69.9 1984 2003 38.7 77.9 1985 2004 42.8 77.9 1986 2005 38.2 79.9 1987 2006 37.8 88.9 1988 2007 34.7 87.9 1989 2008 38.4 103.9 1990 2009 40.2 89.9 1991 2010 39.5 111.9 1992 2011 40.3 129.9 1993 2012 45.9 134.1 1994 2013 48.9 138.9 1995 2014 50.9 130.9 1996 2015 52.9 109.9 1997 2016 57.9 103.9 1998 2017 60.9 117.9 1999 2018 61.9 115.9 2000 2019 76.9 111.9 2001 2020 77.9 111.9 Table 10.3 A time series with 38 observations When charted as a line graph, the time series looks as illustrated in Figure 10.7.

Page | 577

Figure 10.7 A graph of the time series from Table 10.3 Excel Solution To fit a trend line to the time series is a very easy graphical process in Excel, as we already demonstrated in Chapter 9.

Figure 10.8 Fitting a trend to a time series in Excel To fit a trend line to the time series right-click on any data point in the Excel graph, as illustrated in Figure 10.8, and select Add Trendline. After selecting Add Trendline, choose Linear Option, as well as Display Equation on chart, and Display R-squared on chart (see Figure 10.9). Click on close. Page | 578

Figure 10.9 Excel Trendline options box The final graph with the trend line is automatically added as illustrated in Figure 10.10 with the line equation and coefficient of determination (R2) included.

Figure 10.10 Graph of the time series from Table 10.3 and its trend line with the trend equation details As we can see, we were able to get instantly a straight line that describes the underlying movement and the direction of our time series, i.e. the trend. This trend, when extrapolated, becomes a forecast. Page | 579

Using a trend chart function to forecast time series It is quite a simple task to use the fitted Excel trend line to forecast, for example, ten time periods into the future. Right-click on the trend line on the graph and choose Format Trendline as illustrated in Figure 10.11.

Figure 10.11 Formatting the existing trendline in Excel

Figure 10.12 Trendline options box

Page | 580

After we clicked on Format Trendline, a dialogue box as in Figure 10.12 will appear. We are already familiar with this dialogue box. Under the Forecast option, we can see the field called Forward. In the box next to it we will opt for an extension of the trend line that will extrapolate the trend to future 10 periods, as seen in Figure 10.12. Figure 10.13 illustrates the modified time series chart with the trend line extended by 10 time periods to provide a forecast for time points from 38 to 48.

Figure 10.13 Graph of the time series from Table 10.3 and its trend line with the trend equation details and 10-period forecasts We can see that the actual time series is not a smooth straight line, but it oscillates around the smooth straight line that we have estimated. By extrapolating our straight line, or linear trend, into the future, we are anticipating that our forecasts might not be completely accurate, but we believe that this trend represents well the direction and how steep are the movements of our variable. Excel does not just give us a pictorial of this trend line, but the actual equation of this line. From Figure 10.13, we can see that this trend line is moving in accordance with the equation: y = 20.964 + 2.88x. The R-squared (or R2) value is 0.89. As we know, the closer R2 is to the value of 1, the better the fit of the trend to time series. In our case R-squared is 0.89, which is very good. This confirms that our trend is approximating, or fitting, the historic data very well. Only 11% (1 - 0.89 = 0.11) of data variations are not explained by this linear model. This is more than reasonable. SPSS Solution SPSS data file: Chapter 10 Example 4 Time series trend.sav Enter data into SPSS – only first 15 data values shown in Figure 10.14. Page | 581

Figure 10.14 The first 15 values of the time series from Table 10.3 SPSS Curve Fit Analyze > Regression > Curve Estimation Transfer the variable Series to the Dependent (s) box Choose Independent Variable: time Choose Models: Linear

Figure 10.15 Selecting the linear trend model in SPSS Curve Estimation mode Select Save Choose Save Variables: Page | 582

• • • •

Predicted values Residuals Prediction intervals, 95% confidence interval Predict Cases – Predict through Observation = 40

Figure 10.16 Selecting 10 future observations to be forecasted (the time series has 30 observations) Select Continue Select OK When you click OK the following menu appears, choose OK

Figure 10.17 A notification box from SPSS informing about the creation of 4 additional variables SPSS Data File SPSS data file modified to include predicted values, residuals, forecast values for time points 31-45, and prediction intervals. We will return to prediction intervals in greater detail later in this chapter. Figure 10.18 represents only the first 15 data values in screenshot

Page | 583

Figure 10.18 The first 15 observations of data from the table 10.3 and four new variables created whilst building the forecasting model Figure 10.19 represents the forecast values for time points 31-40 [manually entered time points t = 31 to 40].

Figure 10.19 The forecasts (future 10 observations of data from the table 10.3) and four new variables created whilst building the forecasting model SPSS Output Model summary The trend line equation statistics are provided in Figure 10.20 (T = 20.964 + 2.88 * time point) with a time series plot in Figure 10.21.

Figure 10.20 SPSS model summary for linear trend forecasts for data from Table 10.3

Page | 584

Figure 10.21 SPSS graph of the data from table 10.3 and the fitted linear trend Going back to the trend equation, we said that the trend line equation, in this case, was y = 20.964 + 2.88x. Excel extrapolated ten periods in the future our trend line, but unlike the SPSS printout, we do not know neither the past values, nor the future values of this trend line. All we have is the chart that does this for us. We need to learn how to calculate these values manually, or by using the built in Excel functions.

Trend parameters and calculations Think of the equation y = 20.964 + 2.88x as a specific case, fitted to our data set. In previous chapter we used equation (8.7) ŷ = b0 + b1x, which look similar. In fact, this is the same equation. Do not be confused with the notation. In most textbooks this equation is written as y = ax + b, or y = a + bx Whatever the case, the letter that stands alone (without x) is called an intercept and the other letter associated with x, is called the slope. In our case, the value of the intercept is 20.964 and the value of the slope is 2.881. Chapter 9 explains the meaning of these two parameters. Refresh your memory if you need to and re-read Chapter 9. Equations (9.8) and (9.9) refer to b1 and b0 respectively. Equation (9.9) that applies to b0, or letter a as used here, is the intercept. Equation (9.8) that applies to b1, or letter b as used here, is the slope of the trend. Effectively, to calculate our past and future trend values we just need these two parameters. The values of x are presented by the sequential numbers that represent time periods.

Page | 585

Example 10.5 We will use the same numbers as in Example 10.4 to calculate the intercept a, as well as the slope b and to show how these calculations are executed manually. We will modify equations (9.8) and (9.9) to make them a bit easier to apply for calculating trend, so the equations for slope (10.2) and the intercept (10.3) are: 𝑏=

∑𝑛 ̅ 𝑖=1 𝑥 𝑦 −𝑛 𝑥̅ 𝑦 2 2 ∑𝑛 𝑖=1 𝑥 −𝑛 𝑥̅

𝑎 = 𝑦̅ − 𝑏 𝑥̅

(10.2) (10.3)

Where, y is the variable, x are the incremental numbers representing the time periods, 𝑦̅ is the mean value for y, 𝑥̅ is the mean value for x and n is the number of observations, or data points available. The manual calculations are illustrated in Figure 10.22:

Figure 10.22 Manual procedure for fitting a linear trend to data from Table 10.3 Summary statistics: n=38, 𝑥̅ = 110.5, 𝑦̅ = 77.123, xy=70308.8, n𝑥̅ 𝑦̅ = 57148.6, x2=19019

Page | 586

𝑏=

70308.8 − 57148.6 = 2.879 19019 − 38 × 110. 52

𝑎 = 77.123 − 2.879 × 110.5 = 20.964 As we can see these numbers agree with what we got from Excel in Figure 10.13 (as well as SPSS in Figure 10.20). Let us now show how to calculate the same values in an even more elegant way using Excel built in functions. Example 10.6 To fit a trend line to a time series data set given the value of the slope of the trend line and its intercept. We can either use built in Excel functions =SLOPE() and =INTERCEPT() or use one single function called =TREND(). Excel Solution Figures 10.23 and 10.24 illustrates both approaches. In addition, we extrapolated the values of ŷ ten periods into the future (rows 8-27 are hidden).

Figure 10.23 Excel calculations for finding the intercept and the slope of the linear trend

Page | 587

Figure 10.24 Excel calculations for linear trend using the coefficient method or single function =TREND() method Column H in Figure 10.24 shows manual calculations of the trend, after we used Excel functions for the slope and intercept. Column L in Figure 10.24 shows the same values, but this time calculated using a dedicated Excel function for trend. The forecasts, or future trend values at time points 31 to 40 (H34:H43 and L34:L43) are produced in the same way as the historical values. The future values of x should always be a sequential continuation of the time period numbers used in the past. In our case, the last observation is for period 38, which means that the future values of x are 39, 40, …, 48. There are exceptions, where the future values start from 1, 2, … , m, but we will make sure that this is clearly understood when we come to these methods. The principles of calculating the linear trend, as described here can be applied to other types of curves. The Manual and the Function method in Excel work with any curve, though the equations are different. In Excel, if we choose to apply the Function method, in addition to =TREND() another function called =GROWTH() can be applied. GROWTH is Excel function that describes exponential trends. It is invoked and used exactly in the same way as the TREND function used for linear time series. SPSS Solution Figures 9.44 – 9.47 show how to calculate the linear regression/trend line coefficients ‘a’ and ‘b’ in linear trend, so we will not repeat this procedure here. Page | 588

Check your understanding X10.6 How would you call a model represented by equation: is y=a+bx+cx2, and would you say that this is a linear model? X10.7 Define what are residuals when describing the fitting and extrapolating trends? X10.8 Does R-squared=0.90 indicate a good fit? Is this the same statistic used in Chapter 8 dedicated to linear regression? X10.9 Extrapolate the time series below and go 3 time periods in the future. Use TREND function. Why do you think it would not make sense to extrapolate this time series 10 time periods in the future? X 1 2 Y 20 25 Table 10.4

3 25

4 27

5 29

6 33

7 29

8 33

9 35

10

11

12

10.4 Error measurements You can ask yourself a simple question: Why do we need forecasts? The essential purpose of forecasting is to try to manage uncertainty that the future brings. We can never eliminate uncertainty associated with the future, but good forecasts can reduce it to the acceptable level. Let’s “unpack” this statement. What would be a good forecast? An intuitive answer is: A good forecast is the one that shows the smallest error when compared with actual event. However, it is impossible to measure errors until the event happened. Essentially, in order to produce good forecasts, we would like to measure errors before the future unfolds. How do we do this? As we showed in Example 10.6 where we fitted a trend to the actual data, before we use the model to extrapolate the data, we first “back-fit” the existing time series using the model. This is sometimes called ex-post forecasting. Once we produced the ex-post forecasts (which is the same word as fitting a model to actual data), it is easy to measure deviations from the actual data. These deviations between the actual and model fitted data (ex-post forecasts), are called forecasting errors. Essentially, forecasting errors will tell us how good our method or model is. In the context of regression analysis, we referred to these errors as residuals. As already shown in previous chapter where we addressed regression residuals, calculating errors is a trivial exercise. An error is a difference between what happened (y) and what we thought would happen according to the model (ŷ). The same principle applies to any forecasting of the time series. An error is the difference between the actual data and the data produced by a model, or ex-post forecasts. This can be expressed as a formula: Page | 589

et = At - Ft , or et = yt - ŷt

(10.4)

Where et is an error for a period t, At (or yt) is the actual value in a period t and Ft (or ŷt) is forecasted value for the same period t. Just like with the regression analysis, calculating errors, or residuals, will tell us if our method, or a model, fits the actual data well. Again, remember that regression residuals had to be random, otherwise the assumption was that the model did not fit the data well. The same rule applies here. Example 10.7 Let us take a hypothetical and non-linear time series and fit an incorrect linear model to it. Figure 10.25 shows a time series that is non-linear, and we deliberately fitted the wrong linear model to it (calculations not shown here).

Figure 10.25 A linear model fitted to a non-linear time series If we plot the errors, as in Figure 10.26, they clearly show that they are not random. If errors are not random, then we picked the wrong model. This is a good example of how errors can help us decide if the model, or the method selected, is the correct one for the data.

Page | 590

Figure 10.26 Errors for the linear model fitted to non-linear time series in Figure 10.25 Another use of error measurement is if we are unsure which one of several models calculated is the most appropriate one. We can use several different forecasting methods and apply them to the same data set. We then calculate forecasting errors for all of them and compare them. Whichever model/method shows the smallest errors in the past, will probably make the smallest errors when extrapolated in the future. In other words, the model with smallest historical errors will be the best to quantify the uncertainty that the future brings. This is the key assumption. However, often it is not enough to calculate just a series of simple forecasting errors. We need to calculate some statistics based on these errors. For example, as with any other variable, we can sum all the errors and find an average error: 𝑒̅ =

∑(𝐴𝑡 − 𝐹𝑡 ) 𝑛

or 𝑒̅ =

∑(𝑦𝑡 −𝑦̂𝑡 ) 𝑛

(10.5)

As we will see shortly, the average error is more often called the Mean Error (ME), and it is just one of many different types of error statistics that we can use to evaluate our model and our forecasts. Example 10.8 We will use the same mini-data sample as in Example 10.3.

Page | 591

Period Actual Y Forecast Ŷ 1 130.0 110 2 120.0 120 3 110.0 130 4 135.0 140 5 155.0 150 6 160.0 160 7 180.0 170 Table 10.5 A short time series and its forecasts Figure 10.27 shows the results in a graphical way.

Figure 10.27 A graph for the data in Table 10.5 The simple calculations of error values are: Period Actual Y Forecast Ŷ Error 1 130.0 110 20.0 2 120.0 120 0.0 3 110.0 130 -20.0 4 135.0 140 -5.0 5 155.0 150 5.0 6 160.0 160 0.0 7 180.0 170 10.0 Sum 990.0 980.0 10.0 Average 141.4 140.0 1.4 Table 10.6 Errors for the data from Table 10.5 From table 10.6: e1 = y1 - ŷ1,

e2 = y2 - ŷ2,

…,

e7 = y7 - ŷ7

From these individual errors, we calculate:

Page | 592

 (yt - ŷt) = 10 𝑒̅ =

∑(𝐴𝑡 − 𝐹𝑡 ) 10.0 = = 1.4 𝑛 7

For period 1 (t=1) our forecast is below the actual values, which is presented as 20.0, because errors are calculated as actual minus forecasted. For period t=2, our forecast is identical to the actual value. For period 3 (t=3), we are above the target, showing an error of -20.0, etc. And lastly, for period 7 (t=7), our forecast is below the actual, showing an error of 10.0. What can we conclude from this? If these were the first 5 weeks of our new business venture, and if we add all these numbers together, then our cumulative forecast for these five weeks would have been 980. The business generated 990. This implies that the method we used made a cumulative error of 10, or given the above formula, we underestimated the reality by 10 units. If we divide this cumulative value by the number of weeks to which it applies, i.e. 7, we get the average value of our error of 1.4. The average error that our method generates per period is +1.4 and, because errors are defined as differences between the actual and forecast values, this means that on average the actual values are by 1.4 units lower than our forecast. Given earlier assumptions that the method will probably continue to perform in the future as in the past (assuming there are no dramatic or step changes), our method will probably generate similar errors in the future. Given how small the errors are, on average we have a good forecasting method. Excel Solution Figure 10.28 shows an example of how to calculate forecasting errors in Excel.

Figure 10.28 An example of calculating the forecasting errors. Columns B, C and D contain the just the values (calculations not shown), and column E uses simple formula for calculating errors as E4=C4-D4, etc. Let us conduct a brief simulation. Assume that we use two different methods to produce forecasts. One of the methods generates an average error of -2 and the other one 4. If you assume that these are the actual units in tonnes of the product that you are forecasting, then the first method overshoots the actual by 2 tonnes, and the second one undershoots the actual by 4 tonnes. In the first case you might end up with 2 tonnes of product not sold and in the second you could have made more money by selling another

Page | 593

4 tonnes, but you did not have them available, because your forecast was short. Which forecast would you prefer? Difficult question. One approach is to say that in absolute terms 2 is less than 4, therefore, we would recommend the first method as a much better model for forecasting this business venture. On the other hand, missing the opportunity of selling 4 tonnes might be more important to a business than 2 tonnes of extra product on inventory. The point we are making is: there could be various business scenarios why you might prefer one or the other forecast. As we do not know the circumstances, we will adopt purely numerical approach and go for the lowest absolute value. The above examples and the simple simulation illustrated how forecasting methods can help us understand and potentially quantify uncertainty. We are effectively using errors as measures of uncertainty. We learned how to calculate an average, or mean error, but in practise, other error measurements or error statistics are used too, and we will cover this in the section below. SPSS Solution SPSS does not provide error calculations without the context of the method for which the errors are used. We will, therefore, show how the errors and error statistics are executed in SPSS as we cover specific extrapolation methods.

Types of error statistics A variety of error measurements, or error statistics, can be used to assess how good the forecasts are. The six most commonly used error statistics are: the mean error (ME), the mean absolute error (sometimes called the mean absolute deviation and abbreviated as MAD), the means square error (MSE), the root mean square error (RMS), the mean percentage error (MPE) and the mean absolute percentage error (MAPE). These errors are calculated as follows: 𝑀𝐸 =

∑(𝐴𝑡 −𝐹𝑡 )

𝑀𝐴𝐷 = 𝑀𝑆𝐸 =

𝑛

=

∑|𝐴𝑡 −𝐹𝑡 | 𝑛

(10.6)

𝑛

=

∑(𝐴𝑡 −𝐹𝑡 )2 𝑛

∑ 𝑒𝑡

∑|𝑒𝑡 |

=

(10.7)

𝑛 ∑ 𝑒𝑡2

(10.8)

𝑛

∑ e2t

RMS = √

𝑀𝑃𝐸 =

(10.9)

n

𝐴 −𝐹 ∑( 𝑡 𝑡 )

𝑀𝐴𝑃𝐸 =

𝐴𝑡

𝑛

=

|𝐴 −𝐹 | ∑( 𝑡 𝑡 ) 𝐴𝑡

𝑛

𝑒 ∑( 𝑡 ) 𝐴𝑡

(10.10)

𝑛

=

|𝑒 | ∑( 𝑡 ) 𝐴𝑡

𝑛

(10.11)

Page | 594

Where, At represents the actual values in the time series yt, Ft represents the forecasts, or ŷt, and et are the errors. For the number of errors considered, we used a symbol of n. Mean error (ME) is exactly what the phrase and the equation (10.6) implies, an average error. Unfortunately, positive errors might cancel the negative errors, and often this error will be equal to zero. It does not mean that we do not have errors, just that they cancelled each other. This is the reason why this error is often used in conjunction with other error measurements. Mean absolute error (MAD) eliminates the problem of positive and negative errors cancelling each other and produces a typical error, without saying if it is positive or negative. Mean square error (MSE) also eliminates the problem of ME by squaring the errors and thus eliminating the cancelling of the positive errors from negative errors. It is a frequently used indicator, but squaring a number implies that we do not know the units in which errors are produced. Root mean square error (RMSE) solves the problem of MSE and provides a measurement in the same units as the time series. If the units of the time series are miles or litres/head, then RMS is also expressed in the same units, i.e. miles or litres per head. As we will see later, it is also a kind of standard deviation for forecasts. Mean percentage error (MPE) provides a mean percentage value that forecasts deviate from actual data, rather than the unit value as ME or MAD do. Mean absolute percentage error (MAPE) eliminates the sign in front of the MAD and provides the mean percentage error as a typical value, without the plus or minus sign. When evaluating forecasts and/or forecasting models, at least one or potentially several of these error statistics, will be used to provide evaluation. Example 10.9 Using the same example as in 10.3 and 10.8, we will show how to do the error statistic calculations in Excel. Figure 10.29 shows the calculations.

Page | 595

Excel Solutions

Figure 10.29 Calculating various errors Errors in column E are calculated as the values from column C minus the values from column D. For example, E4=C4-D4. The same cell in column F (MAD) is calculated as F4=ABS(E4). Cell G4 (MSE) is calculated as G4=E4^2 and cell H4 (RMSE) as H4=SQRT(G4). Cell I4 (MPE) is I4=E4/C4 and cell J4 (MAPE) is calculated as J4=F4/C4. From Excel: MAD = 8.57 MSE = 135.71 RMS = 11.65 MPE = 0 MAPE = 0.07 There is another, faster and more elegant method. Rather than calculating individual errors (as in columns E to J in Figure 10.29) and adding all the individual error values (as in row 9) or calculating the average (as in row 10), we could calculate all these errors with a single formula line for each type of error. Example 10.10 Using some of the built-in Excel functions, these errors can be calculated as illustrated in Figures 10.30.

Page | 596

Excel Solutions

Figure 10.30 A method of calculating various error statistics as a single function in Excel Note that MAD, MPE and MAPE formulae have curly brackets on both sides of the formulae. Do not enter these brackets manually. Excel enters the brackets automatically if, after you typed the formula, you do not press just the Enter button, but CTRL+SHIFT+ENTER button (i.e. all three at the same time). This means that the range is treated as an array. You might ask yourself a question: Why should we bother with so many different error statistics? That is a good question, especially given that there are a few more types of error statistics that we did not include in this chapter. The simplest answer is: they are sensitive to different things and often we will calculate more than one type of error statistic and, depending on the results, they will help us make a better judgement about our method and our forecasts. Let us take a look at just MSE as an example. The concept of MSE was also extensively used in linear regression where we used the residuals (errors) to evaluate the model. The rationale behind MSE is that large numbers (errors) when squared will get even larger. Take for example two errors, one with the value of 2 and the other one with the value of 10. The second one is five times bigger than the first one. However, when you square them, then 100 is 25 time bigger than 4. This means that if we have some large errors, in other words, our model is not fitting the actual data “tightly” enough, then this model will show large MSE. This means that MSE favours smaller errors and “penalises” models that have a few larger errors. Whether this is right or not, this is the reason why MSE is one of the error measurements that is most often used to assess forecasts and models.

Page | 597

A general rule to follow is: the lower the error statistic when compared between several potential forecasting methods, the better the model. So always select the model that has lower error statistic. What if you have several models and you calculated several different error statistics (ME, MSE and MAD, for example) for every model, and some errors are lower for one model and the other errors are lower for the other model? There are no rules for such cases, so use other statistics that we will cover shortly to make a judgement which model to select. SPSS Solution SPSS will provide fit measures such as RMSE, MAPE, etc. when fitting a time series model to the data set using Analyze > Forecasting > Create Traditional Models > Statistics and choosing your model fit statistics. Unfortunately, you cannot enter actual and forecast data values into SPSS and run the SPSS Forecast command to reproduce these results (unless you execute it as a manual formula). For this reason, the error statistics in SPSS are covered in the context of specific methods.

Check your understanding X10.10What do you think is a difference between accuracy and precision? How would you apply these definitions in forecasting context, i.e. what are the consequences if your forecasts are precise, but not accurate? Could you have accurate forecasts that are not precise? X10.11Why is MAD or MSE type of error measurement preferred over the ME type of error? X10.12Two forecasts were produced, as shown below. The ME for the second forecast is 8 times larger than the ME for the first forecast. However, the MSE is 64 times larger. Can you explain?

Page | 598

Table 10.7 Two variables, their forecasts, errors, ME and MSE X10.13It is acceptable to see some regularity in pattern when examining the series of residuals, or forecasting errors? X10.14If forecasted values are close to actual values, what do you expect to see on a scatter diagram?

10.5 Prediction interval We need to remind ourselves that in the sampling chapter we used the standard error of the mean (SE), together with the z-value and the estimated mean value x , to make an estimate that the true mean value  is somewhere in a given interval. This interval is defined as the confidence interval CI: 𝐶𝐼 = 𝑥̅ ± 𝑧 × 𝑆𝐸

or

𝑠

(10.12)

𝐶𝐼 = 𝑥̅ ± 𝑧 × ( 𝑛) √

See equations (4.6) and/or (5.7) to refresh your memory on the standard error of the mean (SE). Depending on the value of z, we get different confidence intervals (CI). For example: (a) z = 1.64 for 90% CI, (b) z = 1.96 for 95% CI, and (c) z = 2.58 for 99% CI. If the sample, or the data set, is relatively short and represents just a small sample of the true population data values, then the t distribution is used for the computation of the confidence interval, rather than the z-value. 𝐶𝐼 = 𝑥̅ ± 𝑡𝑣𝑎𝑙𝑢𝑒 × 𝑆𝐸

or

𝑠

𝐶𝐼 = 𝑥̅ ± 𝑡𝑣𝑎𝑙𝑢𝑒 × ( 𝑛) √

(10.13)

The only difference between equation (10.12) and (10.13) is that the t-value in the above equation will be determined not just by the level of significance (as it was the case with the z-values), but also by the number of degrees of freedom.

Page | 599

If you compare equation (10.13) to equation (9.21) from the chapter on regression analysis, you will see that they convey identical message. In other words, to find an interval in which the true value of y is likely to reside, you need to add to ŷ the standard error of the estimate multiplied by the desired value of z or t, depending what is the required level of confidence. A general rule is that for larger samples greater than 100 observations, the z-value and the t-value produce similar results, so it is discretionary which one to use. If your time series is shorter than 100 observations, you should use t-value and for the series longer than 100 observations, you can use either one, but z-value is easier to use. In any case, remember that in the context of mean estimates, the confidence interval (CI) is based on the z-value or t-value and it enables us to claim with x% confidence that, on the basis of the sample data, the true mean resides somewhere in this given interval.

Standard errors in time series The section above restated that the standard error of the estimate of the mean (SE) measures the differences between every observation and the mean. When dealing with time series, it would seem logical to use the same principle, but rather than calculating deviations from the mean, we calculate deviations between the actual and predicted values. We can modify equation (5.7), where we measure deviations from the mean, and instead of the mean value use the predicted values as defined by equation (10.14). This is called the standard error of the estimate of the predicted values. As before, we just shorten it to the standard error, and from the context it is clear if the phrase applies to the mean or predicted values. ∑𝑛 ̂ 𝑖 )2 𝑖=1(𝑦𝑖 −𝑦

𝑆𝐸𝑦̂,𝑦 = √

(10.14)

𝑛−2

Here y i are the actual observations and yˆ i are the predicted values. As we will see shortly, the Excel version of this formula is =SQRT (SUMXMY2 (array_x, array_y)/n-2). Excel offers an even more elegant function as a substitute for this formula. The function is called: =STEYX (known_y’s, known_x’s). Both functions return the standard error of the predicted ŷ-value for each x in the regression. If you look into Excel’s Help file, you will see that this function is a very elegant representation of an unfriendly looking equation given by (10.15). 1

𝑆𝐸𝑦̂,𝑦 = √((𝑛−2) (∑(𝑦 − 𝑦̅)2 −

[∑(𝑥−𝑥̅ )(𝑦−𝑦̅)]2 ) ∑(𝑥−𝑥̅ )2

(10.15)

Either way, remember that Excel formulae =SQRT(SUMXMY2(array_x, array_y)/n-2) and =STEYX(known_y’s, known_x’s) are identical. They both return the standard error for the predicted values. Equally, equations (10.14) and (10.15) are absolutely identical, although (10.14) uses ŷ for calculating SE and (10.15) uses 𝑥̅ and 𝑦̅ to do the same. Either way, they deliver the same value for the standard error of the estimate.

Page | 600

As you read this, or any other statistics textbook, for the sake of expedience they all use just the phrase the standard error. From the context you need to “decipher” if this phrase applies to the standard error of the estimate of the mean or the standard error of the estimate of the predicted values. The formula for the former one is given by equation (5.7) and the formula for the latter one is given by equation (10.14). Given that we now have equation for SEyŷ as in (10.14), this means that we can modify equations (10.12) and (10.13) into equation (10.16). ŷ  tvalue  SEŷ,y

or

ŷ  z  SEŷ, y

(10.16)

Where, ŷ are the predicted values, SEŷ, y is the standard error of prediction and the tvalue is the t-value from the Student’s t critical table (z-value used for longer time series). Equation (10.16) is known as the prediction interval, and it is equivalent to equation (9.21) from the previous Chapter. A prediction interval tells us that although we estimated a model value to be ŷ, the true value of y is likely to be somewhere in this interval. The confidence, or probability that the true value y is in this interval, is given by the t-value or z-value. As we said that, for example, for z=1.96, this confidence (or probability) is 95%. Do not confuse the confidence value with the width of the prediction interval. It is logical to expect that the higher the confidence level, the wider the prediction interval will be. Example 10.11 In Examples 10.4 - 10.6 we used UK Petrol prices per liter between 1983-2020 and extrapolated them 10 periods in the future. We will now shorten the time series and use the same data, but only 2006 - 2020. From this 15 observation time series we will extrapolate 5 periods ahead to 2025, but this time we will also calculate the prediction interval. Excel Solution We will use Excel =TREND() function to produce forecasts and calculate the prediction interval for the trend forecasts. Figures 10.31 and 10.32 illustrates the technique.

Page | 601

Figure 10.31 A time series with its trend, prediction interval and deviations from the mean Column D contains the values calculated using Excel =TREND() function, i.e. the value of cell D4 is: D4=TREND($C$4:$C$18, $B$4:$B$18, B4), etc. Columns E and F are based on equation (10.16). For example, E4=D4-$H$5*$H$8 and F4=D4+$H$5*$H$8, and the values of $H$5 (SE) and $H$8 (t-value) come from Figure 10.32.

Figure 10.32 Key indicators of goodness of fit Note that the cells H3 and H4 contain manual formulas for calculating the standard error, though they use different Excel functions. Cell H5 uses the dedicated Excel function. Page | 602

As a side note, we can also use the Pearson correlation coefficient and the Total Sum of Squares (SST) to calculate the standard error: (1−𝜌2 )𝑆𝑆𝑇

𝑆𝐸𝑦̂,𝑦 = √

(10.17)

𝑛−2

Where, SST is defined by equation (9.13) from the previous chapter. This formula was used in H13 (this gives us H3, H4, H5 and H13 as four alternatives to calculate the same value, using different equations or functions). Equations (10.14) and (10.17) produce the same value, as we can see from the cells H5 and H13. In Figure 10.31, the trend function was extrapolated 5 periods in the future. Figure 10.33 illustrates the graph for the prediction and the corresponding prediction interval.

Figure 10.33 A graph of the time series from Figure 10.31 with the trend and the prediction interval The equations and calculations we performed above show how to calculate the prediction interval when forecasting time series. Unfortunately, this does not comply with one intuitive assumption, which is that the width of the prediction interval should not be constant and that it should change with time. In particular, the further we go in the future, the wider the interval should be as the uncertainty increases. How do we calculate the prediction interval and make sure it changes with time? As we can see from the above example, the value of the standard error is a constant value. To make the prediction interval change with time, we will need to replace equation (10.14) with equation (10.18). 1

(𝑥 −𝑥̅ )2

SE𝑦̂,x =SEy,ŷ √1 + 𝑛 + ∑(𝑥𝑖 −𝑥̅ )2 𝑖

(10.18)

Page | 603

Equation (10.18) is effectively “correcting” the standard error SEŷ,y for ŷ given by equation (10.14) or (10.15) for the changing value of x. If you compare the square root portion of equation (10.18), it looks similar to the square root portion of equation (9.22). In fact, they are identical. Although slightly different calculations are executed in (9.22) as opposed to (10.18), it can be easily proven that these two expressions are identical:

√1 +

(𝑥𝑖 − 𝑥̅ )2 1 1 𝑛(𝑥𝑖 − 𝑥̅ )2 √ + = 1 + + 𝑛 ∑(𝑥𝑖 − 𝑥̅ )2 𝑛 𝑛(∑ 𝑥 2 ) − (∑ 𝑥)2

From (10.18)

From (9.22)

As the square root expression on the left, from equation (10.18) is easier to implement in Excel, we will use this format to calculate the corrections of the standard error of estimate of ŷ for every value of x. We will use the time series from Example 10.11 to demonstrate the effects of this additional formula. Example 10.12 Figures 10.34 and 10.35 illustrates the Excel solution to calculate interval estimate.

Figure 10.34 Same time series as in Figure 10.31, but using SEŷ,x for the prediction interval Columns D, F and G are calculated in the same way as columns D, E and F in Figure 10.31. Page | 604

The first value in column E (SEŷ,x) is calculated: E4=$I$6*SQRT(1+(1/COUNT($B$4:$B$18))+(B4-AVERAGE($B$4:$B$18))^2/DEVSQ($B$4:$B$18)) and copied down. This is equation (10.18) translated into Excel syntax. The value of $J$6 (SE ŷ,y) is found in Figure 10.35 as the value of the standard error of the estimate (the same value is calculated in $J$7 using a different functions).

Figure 10.35 Two ways to calculate the standard error of prediction SEŷ,y Cell J4 is calculated as =COUNT(C4:C18)-2, according to equation (10.17). And finally, cell J5 is the t-value calculated as =T.INV.2T(J3,J4). The only difference between Figure 10.31 and Figure 10.34 is that in Figure 10.31 we used one fixed value of SEŷ,y (cell I3) to calculate the prediction interval in columns E and F. In Figure 10.34, we created an additional column E where SEŷ,x was calculated as per equation (10.18). This means that SEŷ,x was no longer a single value, but it changes as the value of x changed. This makes the prediction interval change with time. Figure 10.36 illustrates the Excel graphical solution.

Figure 10.36 A graph of the time series from Figure 10.34 with the trend and the widening prediction interval Page | 605

Note that the prediction interval is the narrowest where x = 𝑥̅ , which is implied from equation (10.18). Comparing Figure 10.36 and Figure 10.33, we can see that the prediction interval in Figure 10.36 based on equation (10.18), now complies with a more intuitive assumption that the further in the future we extrapolate our trend, the greater the uncertainty, i.e. the more widely our forecasts are likely to be spread. SPSS Solution In order to understand how to implement these calculations in SPSS, you first need to go back to Example 10.4 and repeat what has been shown in Figures 10.14 to 10.110. Figures 10.18 and 10.19 contain already the prediction interval and it is marked as LCL (Lower Confidence Level) and UCL (Upper Confidence Level). We will repeat the whole process as on this occasion we are using a shorter time series from 2006-2020 (Average annual UK unleaded petrol prices per litre). SPSS data file: Chapter 10 Example 12 Prediction interval.sav Enter data into SPSS in Figure 10.37.

Figure 10.37 The last 15 values of the time series from Table 10.3 SPSS Curve Fit Analyze > Regression > Curve Estimation Transfer the variable Series to the Dependent (s) box Choose Independent Variable: time Choose Models: Linear

Page | 606

Figure 10.38 Selecting the linear trend model in SPSS Curve Estimation mode Select Save Choose Save Variables: • • • •

Predicted values Residuals Prediction intervals, 95% confidence interval Predict Cases – Predict through Observation = 20

Figure 10.39 Selecting 5 future observations (to observation 20) to be forecasted (the time series has 15 observations) Select Continue Select OK When you click OK the following menu appears, choose OK Page | 607

Figure 10.40 A notification box from SPSS informing about the creation of 4 additional variables SPSS Data File SPSS data file modified to include predicted values, residuals, forecast values for time points 1-15, prediction intervals and the forecast values for time points 16-20 [manually entered time points t = 16 to 20] are given in Figure 10.41.

Figure 10.41 The forecasts (future 5 observations of data from the table 10.3) and four new variables created whilst building the forecasting model SPSS Output Model summary The trend line equation statistics are provided in Figure 10.42 (T = 98.9 + 1.835 * time point) with a time series plot in Figure 10.43.

Page | 608

Figure 10.42 SPSS model summary for linear trend forecasts for data from Table 10.3

Figure 10.43 SPSS graph of the data from table 10.3 and the fitted linear trend Up to this point all the steps are identical to what we covered in Example 10.4. From this point we will continue to add a confidence interval to graph. Modify graph to include the confidence interval Double click on the graph and then double click on the trendline – this will open the Chart Editor, as in Figure 10.44.

Page | 609

Figure 10.44 SPSS graph of observations and a linear trend Click on symbol Figure 10.45.

shown in Figure 10.44. This will trigger a dialogue box as in

Figure 10.45 Add an interpolation line This joins the data points together with a straight line.

Page | 610

Figure 10.46 SPSS progress graph of the time series and the trend line Click on File > Close as illustrated in Figure 10.47

Figure 10.47 A dialogue box from SPSS

Figure 10.48 SPSS final graph of the time series and the trend line Now fit the confidence interval using LCL_1 and UCL_1 calculated values Page | 611

Select Graphs > Legacy Dialogs > Line Choose Multiple Choose Data in Chart Are: Values of individual cases

Figure 10.49 SPSS option box for multiple lines Select Define Transfer into Lines Represent box: Series, Fit for Series …., 95% LCL…, and 95% UCL …. In Category Labels, choose Variable and transfer Period into box

Figure 10.50 Selecting in SPSS which variables to include in the graph Select OK SPSS Output

Page | 612

Figure 10.51 The final version of the time series, the trend line and the prediction interval (compare with Figure 10.36 for Excel version) From the SPSS solutions we observe the results agree with the Excel solutions: trend line, associated trend line statistics, forecasts for time points 31-45, and the graphs.

Check your understanding X10.15What is the difference between the confidence level and the prediction interval? X10.16What do you think is appropriate to use to calculate the prediction interval for a time series that has 20 observations? Would you use the z-values or t-values? X10.17Is it logical to expect that the prediction interval should get wider and wider the further into the future we extrapolate the forecasts?

10.6 Seasonality and Decomposition in classical time series analysis We started this chapter with a simplified model. We stated that if we are interested in forecasting the long-term future of a data set, all we have to do is to assume that data set consists of only two components. Equation (10.1) stated that the first component is just a trend and everything else is treated as a residual. The residuals were the second component. In other words, if we can produce a good trend line (linear or any curve), and fit it to our data, the difference between the historical values of the time series and the trend should be considered as residuals. Page | 613

These residuals, we also learned, need to be randomly fluctuating around this trend line. If they are not, then the trend does not represent (or fit) the time series well and it is a wrong forecasting model. Sometimes no matter what we do, these residuals are not random. Why? We will use an example to explain why. Example 10.13 We will look at a real-life time series, which happens to be Quarterly index for US Consumer energy products from Q1 2015 until Q2 2020, where 2012=100. We are showing in Figure 10.52 and 10.53 the time series and the trend we fitted to it (calculations not shown).

Figure 10.52 Quarterly index for US Consumer energy products where 2012=100 with the trend and error calculations

Page | 614

Figure 10.53 a graph of the time series from Figure 10.52 We can see that the linear trend was appropriately selected, but when we plot the errors, as in Figure 10.54, we can see that they show regularity in movements. It seems that something else must be embedded in our actual data. No matter what curve we picked, there seems to be some other regularities pulsating in our time series, which means that a simple trend method will not be enough to model such time series.

Figure 10.54 Errors chart for column F from Figure 10.52 If there are other components embedded in our historical time series, then this simplistic approach is preventing us from incorporating them into the forecasting model. Time to introduce the seasonal component. We all intuitively know that the sales of ice cream will have a strong seasonal component. General retail sales, we also know, has a strong seasonal component, showing strong peaks around Christmas time, for example. We obviously need to incorporate this periodic component into our model, as otherwise our model will never Page | 615

be a good fit. In other words, if we do not do it, we will never get the errors to fluctuate in a random fashion, which is essential to declare a model or a method fit for purpose. A method that is called the time series decomposition method is precisely one of these methods that will help us capture other components from historical time series. This will enable us to build a credible forecasting model. In fact, the method captures not only a seasonal component, but a cyclical too. Let’s explain. The classical time series decomposition method starts with an assumption that every time series can be decomposed into four elementary components: (i) (ii) (iii) (iv)

Underlying Trend (T) Cyclical variations (C) Seasonal variations (S) Irregular variations (I)

Depending on the model, these components can be put together in several different ways to represent the time series. The simplest of all is the so-called additive model. It states that time series Y, implicitly consists of the four components that are all added together: Y=T+C+S+I

(10.19)

If you compare equation (10.19) with (10.1) you will see that what used to be called R (Residuals) is now broken down into three new components, (C, S and I). All the components in equation (10.19) share the plus sign, implying that this is an additive model. However, beside the additive model, another alternative is the multiplicative model that can also be used: Y=TCSI

(10.20)

Sometimes the most appropriate model is in fact a mixed model. Here is an example of one such model: Y = (T  C  S) + I

(10.21)

Example 10.14 To illustrate how the components make up the estimated time series, we will use the same artificially short time series from Example 10.3, as illustrated in Figure 10.55.

Page | 616

Figure 10.55 An example of a time series and its constituent components Column B contains the values of the time series Y and columns C to F show the constituent components of this time series (trend, cyclical, seasonal and irregular component). We are not showing here how they were calculated, as this is just for illustration purposes. However, once we have these components (columns C to F), we can recompose them, as we did, and show in column G as the estimated variable Ŷ. We used the mixed model, as per equation (10.21), to “reconstitute” the time series Ŷ from the components that were calculated (we’ll explain shortly how these components are calculated). As in previous examples, this new time series Ŷ is effectively an estimate, or an approximation (or a fit), of the actual time series Y. The character of the data in time series will determine which model is the most appropriate. We will come back to this point too. Let us describe briefly what exactly is meant by every component, symbolized by T, C, S and I. Underlying trend (T) is a general tendency and the direction that the time series follows. We are already very familiar with this component. It can be horizontal (stationary time series), or upward/downward (non-stationary time series). This trend line does not have to be a straight line, it can be a curve, or even a periodic function. The cyclical component (C) is a new one. The cyclical component consists of the longterm variations that happen over a period of several years. If the time series is not long enough, we might not even be able to observe this component. If you have annual data over a long period of time, the cyclical component will move up and down around some imaginary trend line. If we used economic data, we all know that although an economy might grow over a number of years, there will be a cluster of periods when the growth is stronger (the prosperity years) and a number of years when the growth is sluggish or non-existent (recession, for example). This is a typical example of the cyclical component. The seasonal component (S), on the other hand, applies to seasonal effects happening within one year. Therefore, if the time series consists of annual data, there is no need to worry about the seasonal component. At the same time, if we have monthly data (or quarterly, or weekly, for example) and our time series is several years long, then it will possibly include the seasonal component. The irregular component (I) is everything else that does not fit into one of the previous three components. This component is also called the residuals, or errors, though we changed the notation to I in order not to confuse it with R as used in the previous chapter. Page | 617

We know that we need to analyse this component, as it is important for the quality and accuracy of our forecasting model. A method of isolating different components in a time series or, decomposing the time series as we will do here, is called the classical time series decomposition method. This is one of the oldest approaches to forecasting. The whole area of classical time series analysis is concerned with this theory and practise of how to decompose a time series into these components. Once you have identified the components and estimated them, you then recompose them to produce forecasts. Now we know that a time series can be decomposed into up to four constituent components, and we know that these components could form additive or multiplicative models, how do we know which model is appropriate for our time series? A general rule that applies not only to the decomposition models, but to many other models, is that stationary data are usually better approximated using the additive model and that nonstationary data are better fitted with a multiplicative model. The question now is how do we isolate each of these four components from the data? To demonstrate the principle, we will use some very simple algebra. We will take a multiplicative model from equation (10.20). If Y = T  C  S  I, then by dividing the historical time series Y with the trend component T that we have calculated using the linear regression or simple trend approach, we can “isolate” the remaining three components: 𝑌 𝑇

=

T×C×S×I T

=C×S×I

(10.22)

We already said that if we have annual data, then the cyclical component will be visible, but seasonal component is potentially hidden in the trend data. This means that for annual data we do not have to worry about seasonality, so the above equation (10.22) becomes: 𝑌 𝑇

=

T×C× I T

=C× I

(10.23)

Equally, if we have quarterly, monthly, weekly or daily data, in other words data expressed in less than annual form, then probably the cyclical component is not visible (it might be potentially hidden inside the trend component), but the seasonal component is. In this case, we have a new equation: 𝑌 𝑇

=

T×S× I T

=S× I

(10.24)

As we can see, it is easy to get down to just one component, either C or S, but which is still “polluted” with some Irregular component. There are various ways to isolate the seasonal (or cyclical) component from the irregulars (residuals), and we will now learn how to do it. Both the cyclical and seasonal component have similar behaviour, i.e. they both repeat the pattern at some level of periodicity. The only difference is that the seasonal pattern Page | 618

repeats itself within every year and the cyclical pattern takes several years to repeat. The time span for each of the two components is different, but their behaviour is similar. This means that most methods for isolating the cyclical component can be applied to seasonal component.

Cyclical component To illustrate a simple method of extracting the cyclical component, we will use data from Encyclopaedia of Mathematics that shows the number of the Canadian lynx "trapped" in the Mackenzie River district of the North-West Canada for the period 1878–1931. In fact, the time series on the website is even longer and covers 1821-1834 (https://www.encyclopediaofmath.org/index.php/Canadian_lynx_data). Example 10.15 Year No. Year No. Year No. Year No. Year No. 1878 299 1889 39 1900 387 1911 1388 1922 399 1879 201 1890 49 1901 758 1912 2713 1923 1132 1880 229 1891 59 1902 1307 1913 3800 1924 2432 1881 469 1892 188 1903 3465 1914 3091 1925 3574 1882 736 1893 377 1904 6991 1915 2985 1926 2935 1883 2042 1894 1292 1905 6313 1916 3790 1927 1537 1884 2811 1895 4031 1906 3794 1917 674 1928 529 1885 4431 1896 3495 1907 1836 1918 81 1929 485 1886 2511 1897 587 1908 345 1919 80 1930 662 1887 389 1898 105 1909 382 1920 108 1931 1000 1888 73 1899 153 1910 808 1921 229 Table 10.8 A time series showing the number of the Canadian lynx "trapped" in the Mackenzie River district of the North-West Canada for the period 1878–1931 Excel Solution The time series consists of 54 observations and when we plot it, it looks as in Figure 10.56.

Page | 619

Figure 10.56 A graph of the time series from Table 10.8 This appears to be a cyclical time series, so we will now apply the classical decomposition principles and calculations in Excel, shown in Figure 10.57. The graph in Figure 10.56 indicates a repeating pattern approximately every 9 years, so the length of the lynx trapping cycle seems to be 9 years.

Figure 10.57 Calculating the trend and the C×I components from the data from Table 10.8 Column C in Figure 10.57 shows the number of Lynx trapped per year (Y). The data are annual, so according to the principles of time series decomposition, they contain three components: T, C and I. We need to isolate the trend (T) component first. We achieved this in column D in Figure 10.57. This was calculated using Excel function =TREND(). Cell D3, for example, is calculated as: D3=TREND($C$3:$C$56,$B$3:$B$56,B3), etc.

Page | 620

According to equation (10.23), to eliminate the trend value from the data Y, we need to divide the historical values Y (column C in Figure 10.57) with the trend data T (column D in Figure 10.57). This is done in column E where, for example, cell E3=C3/D3, etc. The result is the cyclical component (C) entangled with the irregular component (I) which is also shown in a graph form in Figure 10.60. To take the next step, we will need another table, as per Figure 10.58.

Figure 10.58 Grouping data from column E in Figure 10.40 to identify typical cycle values As the cycle is decided to be 9 years, there are 6 cycles in this time series. The block of 9 values (E3:E11) from Figure 10.57 is copied to I3:Q3 in Figure 10.58. This was repeated 6 times until all the blocks of 9 cells were copied. The typical cycle values in row 9 (I9:Q9 in Figure 10.58) are calculated as simple average values (for example, I9=AVERAGE(I3:I8), etc.). The average value effectively removes the irregular variations within every cycle and this average value for every cycle now becomes a typical cycle value. In other words, by averaging all values in every cycle, we have eliminated the Irregular component. What we have in I9:Q9 in Figure 10.58 is a pure C component. Now we need to move the cyclical component C from Figure 10.58 to Figure 10.510. Cells I9:Q9 from Figure 10.58 are now, as a block of nine values, copied into column F in Figure 10.510. Note that in Figure 10.59 rows 15-45 are hidden.

Page | 621

Figure 10.59 Extracting a pure C component from Figure 10.58 and calculation of the estimates ŷ Column G in Figure 10.59 recomposes the time series (Ŷ) using a simple formula G3=D3*F3, which is copied down.

Figure 10.60 A graph of the C×I components from Figure 10.59 (column E) We decided that 9 years is the periodicity of this time series by just observing the graph. There are other more accurate methods, such as autocorrelations, that can be used to decide precise periodicity, but they are beyond the scope of this textbook, though you can learn about it in one of the online chapters. Page | 622

Figure 10.61 shows typical C component after the Irregular component has been removed (column F in Figure 10.59).

Figure 10.61 A graph for typical C component from figure 10.59 column F Figure 10.62 shows the original data and the fitted values Ŷ (G in Figure 10.59). As the value of C is effectively an index number oscillating around 1, all we had to do was to multiply every trend value T with its corresponding cycle value C.

Figure 10.62 The result of recomposing the components and showing them against the actual values for Example 10.15 As we can see our model does not fit the actual data perfectly well. However, it is much better than a simple trend line fitted to data. We will learn later how to improve on this model. As a side comment, if we used only trend (column D in Figure 10.59) as a model and then calculated the errors or residuals, we would get the chart that looks as Figure 10.63.

Page | 623

Figure 10.63 Errors if the model used for data in Figure 10.59 was just a simple linear trend (column D) Clearly the errors in Figure 10.63 show strong pattern, indicating that fitting just a linear model is not adequate. On the other hand, the errors calculated from the model constructed through the decomposition method look as in Figure 10.64. They indicate some improvements, but we can say that we could have improved even more (which we will demonstrate shortly).

Figure 10.64 A graph of errors calculated from the forecasts by the decomposition method Classical decomposition method makes sense if we have a strong cyclical or seasonal time series and we will make further improvements as we proceed through this section and the following chapter. SPSS Solution SPSS has an option for time series decomposition, and it is called Seasonal Decomposition, under the Forecasting sub-menu of Analyze tab. However, the same method is used for both cyclical and seasonal data, so we will defer to show how to use it after we covered seasonal data below.

Page | 624

Seasonal component The above example applied to the cyclical component. If we had a seasonal component only, we use the same technique. Other more complex, and more accurate, techniques exist, but they are beyond the scope of this textbook. To demonstrate how to isolate the seasonal component specifically, we will use an example that contains quarterly data. Example 10.16 The data set used represents quarterly index number for the US consumer energy products, based on 2012=100 (same as Example 10.13). As before, we will use Excel first to show the calculations. Year 2015

Quarter Y 1 120.2 2 90.6 3 101.8 4 98.7 2016 1 114.6 2 92.0 3 107.8 4 105.8 2017 1 116.3 2 92.6 3 910.7 4 108.1 2018 1 124.2 2 98.1 3 104.5 4 112.2 2019 1 125.9 2 93.6 3 103.4 4 110.2 2020 1 116.8 2 88.3 Table 10.9 Quarterly index number for the US consumer energy products, based on 2012=100 Figure 10.65 illustrates the data graph with a trend line fitted to the data set.

Page | 625

Figure 10.65 A graph for time series in Table 10.9 and the corresponding trend Excel Solution Figure 10.66 and 10.67 show the table covering period from 2015 until 2020 (column A). Every year is broken into four quarters (column B) and the whole-time series is only 22 observations long (column C). The values of the time series are given in column D. Figure 10.66 also shows some calculations that we will explain in a moment.

Figure 10.66 Calculating T and S×I components for data in Table 10.9 As before, column E is calculated using =TREND() function and column F is column D divided by column E (for example, F4=D4/E4). Page | 626

To calculate seasonal components, we need a few more tables, as in Figure 10.67.

Figure 10.67 Extracting seasonal components for data from Table 10.9 and Figure 10.66 The same principle as before, copying the values from column F in Figure 10.66 to cells K6:P9 apply. The details are explained further down. We can now calculate the forecast values for each time point, as in Figure 10.68.

Page | 627

Figure 10.68 Re-composition of the components and forecasts for data from Figure 10.66 We calculated the values up to column F in Figure 10.68, in an identical way to the calculations we did in Example 10.15. The only difference is that in this Example 10.16 we are handling seasonal components and in Example 10.15 we handled cyclical components. Because this time series in Example 10.16 consists of quarterly data, the seasonality present is clearly based on four quarters in the year. In Figure 10.67 in cells K6:P9 we can see how the data from column F in Figure 10.66 was transposed in the appropriate cells. The next step, which are the cells K15:P18 in Figure 10.67, are identical to the previous range K6:P9 in Figure 10.67, with one difference. In every row of this little table, and every row corresponds to one quarter, we have “greyed out” the min and the max value for every row. This helps us visually as we need to eliminate possible extremes that might be recorded during this quarter over the range of years. After we excluded min and max value per row, we calculate the average value for the remaining cells. These average values for every quarter are given in cells Q15:Q18 in Figure 10.67 The sum of the average values per quarter in cells Q15:Q18 in Figure 10.67 should add up to 4 (because they are like index numbers per quarter and four quarters times 1 should be 4). However, as we can see in cell Q19 in Figure 10.67, they add up to 4.000330. This means we need to come up with a correction, or scaling factor, which is Page | 628

given in cell Q20 in Figure 10.67. The value of the scaling factor is 0.999918 and it is calculated as: Q20=4/Q110. Once we have the scaling factor, we can calculate the final values for every quarter, and this is given in cells K24:K27 in Figure 10.67. They are calculated by multiplying every seasonal factor from Q15:Q18 with the scaling factor from Q20, all in Figure 10.67. What we get are the typical quarterly indices for this time series. These typical indices are then copied into column G in Figure 10.68. From there, we return to column H in Figure 10.68. This column represents the model estimates Ŷ and they are calculated by multiplying the trend value values in column E with the typical seasonal indices in column G, both in Figure 10.68. The result is the re-composed time series and it represents the fit, or model values, for our time series. You can see the result in Figure 10.610. The model seems to fit the data extremely well.

Figure 10.69 The result of recomposing the components and showing them against the actual values for Example 10.16 SPSS Solution SPSS data file: Chapter 10 Example 16 Seasonal time series.sav Enter series data into SPSS

Page | 629

Figure 10.70 SPSS data window for Example 10.16 Set time series (2015 Q1 – 2020 Q2) We enter data into SPSS, which is Quarterly index for US consumer energy products from 2015 Q1 until 2020 Q2 (2012=100). Before we proceed, we need to assign date labels to data. Select Data > Define data and time

Figure 10.71 Time-stamping the data in SPSS Click OK The time data will then be added to the SPSS data file as illustrated in Figure 10.72.

Page | 630

Figure 10.72 SPSS creating a new time label variables Run the SPSS test: Seasonal Decomposition Analyze > Forecasting > Seasonal Decomposition Transfer variable Quarterly_index in to the Variable(s): box

Figure 10.73 SPSS dialogue box for seasonal decomposition Click OK The message will appear warning that 4 new variables will be added to the data set.

Page | 631

Figure 10.74 SPSS notification that new variables will be added to the file SPSS Output After we clicked OK, SPSS provides several outputs. Figure 10.75 shows that we have selected the multiplicative model and what the values of the typical quarterly seasonal index are.

Figure 10.75 A summary of the typical values for seasonal components for Example 10.16 SPSS adds new variables to your data set. Figure 10.76 shows the ones that we selected during the execution process. Furthermore, we can now fit this model to the time series data given the model fit is given by the equation SAF_1 * STC_1. Select Transform > Compute Variable Type Fit in the Target Variable box Page | 632

In the Numeric Expression box type the equation = SAF_1 * STC_1. This will add an extra column called fit to your SPSS data file as illustrated in Figure 10.76.

Figure 10.76 New variables created by SPSS when using seasonal decomposition as a forecasting model SPSS has created four new variables and they are called SAS_1, SAF_1, STC_1 and ERR_1. The interpretation, as per SPSS Help file, is as follows: •

• • •

SAS. Seasonally adjusted series, representing the original series with seasonal variations removed. Working with a seasonally adjusted series, for example, allows a trend component to be isolated and analyzed independent of any seasonal component. SAF. Seasonal adjustment factors, representing seasonal variation. For the multiplicative model, the value 1 represents the absence of seasonal variation; for the additive model, the value 0 represents the absence of seasonal variation. STC. Smoothed trend-cycle component, which is a smoothed version of the seasonally adjusted series that shows both trend and cyclic components. ERR. The residual component of the series for a particular observation.

If you compare these column values with the Excel version for Example 10.16, then: 1. SAS variable is comparable to Column F (Seasonally Adjusted series) in Figure 10.68 2. SAF variable is comparable to cells K24:K27 in Figure 10.67 (Typical Seasonal Index) or to column G in Figure 10.68 3. STC variable is somewhat comparable to Column E (Trend component) in Figure 10.68 (this is not quite the case as SPSS uses moving averaged trend, which will be explained in the next Chapter) 4. ERR is calculated and shown in the section below You will notice that the numbers from Excel do not match 100% the numbers from SPSS, though they are very close. For example, the typical quarterly indices for Q1-Q4 in Page | 633

SPSS are: 1.13, 0.89, 0.97 and 1.01. In Excel they are: 1.13, 0.87, 0.98 and 1.02. The reason for the discrepancies is that SPSS uses slightly more sophisticated estimation method to make it universally applicable to both seasonal and cyclical components. For example, the method for trend that SPSS uses, is not a simple trend method but a combined centred moving average method that incorporates both T and C component together. This has a “knock on” effect on both the SAS and SAF variables, hence small differences from our Excel solution. To establish how well this decomposition method models the actual data, we need to measure the differences between the actual values in column D from the fitted values in column H in Figure 10.68. In fact, we can also apply some of the lessons from the linear regression chapter and return to the model goodness of fit concept.

Error measurement We will now calculate the errors, as well as the squared errors for Example 10.16. From there, we will check if the errors are random, if they are distributed as per normal distribution and what is the mean square error (MSE), as well as the root mean square error (RMSE). You should be by now be familiar with all the calculations, so we will go directly to Excel to perform these tasks. Example 10.17 We use the data from column D and H in Figure 10.68 to calculate the errors. They are copied into columns B and C in Figure 10.77. Excel Solution

Figure 10.77 Error analysis for the classical decomposition forecasting method Page | 634

The errors in column D in Figure 10.77 are simple differences between columns B and C (for example, D4=B4-C4, etc.). As the first step, we will plot the errors from column D in Figure 10.77. The plot is given in Figure 10.78.

Figure 10.78 A graph of errors (column D in Figure 10.77) calculated from the forecasts by the decomposition method As we can see from Figure 10.78 errors appear to be random, which means that our model potentially modelled the actual data well. However, the fact that they appear random is not enough. We need to do some tests. One of them is the test for normality of errors, i.e. a check if errors are distributed per normal distribution. Columns E to J in Figure 10.77 show the procedure. Errors are ranked in column E using Excel function =RANK(). For example, cell E4=RANK(D4,$$4:$$25,1), which is copied down. We copy the values from this column into column F, but then we just use Excel SORT utility to sort all the errors in this column in ascending order. Column G now shows the rank for every error, calculated as G4=RANK(F4,$F$4:$F$25,1). This column is labelled rt. The rank values rt from column G in Figure 10.77 are used to calculate the cumulative probabilities for this time series. The probabilities are calculated using a simple equation (10.25): 𝑝𝑡 =

𝑟𝑡 −0.5 𝑛

(10.25)

Where rt is the rank and n is the number of errors. These values pt are shown in column H in Figure 10.77. To calculate the p-values in column H, we convert equation (10.25) into: H4=(G4-0.5)/$$25, and copy the cells down. The z-values in column I are calculated as I4=NORM.S.INV(H4), and they are also copied down. And finally, cell J4=D4^2, is also copied down. In cell L3 we are showing the sum for all e2, calculated as L3= SUM(J4:J25). L4 contains MSE, and it is calculated as L4= L3/COUNT(J4:J25). Cell L5 is RMSE, calculated as L5=SQRT(L4). To summarise, to check if our errors follow normal distribution, we need to calculate the z-values for every pt. In other words, we will use the Excel cumulative distribution function =NORM.S.INV(), or total area under the curve to the left of every pt point. As it is shown in column I in Figure 10.77, this is all easily executed in Excel. Finally, we can create a plot of errors et vs the zt values (columns D and I in Figure 10.77), shown as a scatter diagram. Figure 10.79 shows the plot. Page | 635

Figure 10.79 Error normality graph for classical decomposition forecasts in Example 10.16 As we know from Chapter 9, the dots in this scatter diagram must be aligned in a moreor-less straight line. To help us see the alignment, we right click on any of the dots in the plot in Excel and then select Add Trendline from the dialogue box that appears. This will automatically add a line that best fits a straight line between the dots. As we can see, our line fits very well almost all the errors and all the errors follow the straight line, which means that they are normally distributed. This is another confirmation that our model is fitting data well. We also calculated the mean square error, which happens to be 12.84 (cell L4 in Figure 10.78). To put this error in the same units as the original time series (which happen to be indices), we take the square root of that value. Root mean square error (RMSE) is 3.58 (cell L5 in Figure 10.78). This means that on average our fitted model is “adrift” from the original time series by 3.58 index points. Figure 10.80 showed us visually that we have a good fit, but now we were also able to quantify how good this fit is. If we used more than one model or a method to fit to the actual data, we would use MSE and RMS to decide which one to keep as the final model. As we know, the lower MSE or RMS, the better is the time series fitted by this model.

Prediction interval Equation (10.16) already defined how to use the standard error of the estimate SEŷ,y to calculate the interval where the true value of y might be. We called this a prediction interval. We also said that for the larger time series, we can use either the t-value or zvalues to define the prediction interval. So, our prediction interval, if we use the most generic format, is defined as per equation (10.16), which we repeat here: Ŷt ± z  SEŷ,y

(10.26)

Page | 636

Because we said that our errors must follow the normality assumption, we will use one “trick” which enables us, under these conditions, to say that RMSE (root mean square error) is effectively the same as SEŷ,y (the standard error of the estimate). You can compare equation (9.19) with (10.9) to understand why we said that. This means that equation (10.26) becomes: Ŷt ± z  RMSE

(10.27)

The difference between equation (9.19) and (10.9) is that the denominator in (9.19) is n2 and in (10.9) is n. Equation (10.9) shows a general formula for calculating the RMSE, whilst (9.19) shows a specific version suited to regression analysis, where n-2 indicates the number of degrees of freedom for linear regression. Out of convenience, let’s label the expression z  RMSE in equation (10.27) as k. In this case, the equation (10.27) becomes: Ŷt ± k

(10.28)

The above equation (10.28) is perfectly suited for one step-ahead forecasts. The ex-post forecasts that fit the historical time series are effectively one-step ahead forecasts, so it is appropriate to use this approach to estimate the prediction interval for the historical data. However, when we come to the last observation in the time series and we forecast the future values, we are producing forecasts for multiple steps ahead (because we reached the last actual time series observation). As we already know, it is intuitive to assume that as we go further into the future, the prediction interval is going to get wider and wider. Unfortunately, without going into some very complex methods which are beyond the scope of this textbook, there are no easy analytic methods to apply this principle to seasonal data. However, we can use a workaround, which is an empirical method. The empirical method that could be used is as follows. We take h to be the number of steps ahead for which we are producing forecasts, i.e. h = 1, 2, 3, … If we multiply the MSE by h, this will intuitively address the need to make the prediction interval going wider as we move further into the future. In this case, the RMSE is calculated using the following equation: 𝑅𝑀𝑆𝐸ℎ = √ℎ × 𝑀𝑆𝐸 = √ℎ

𝑒2 𝑛

(10.29)

Equation (10.27) no longer contains just single value RMSE, but RMSEh, which is a dynamic value, changing with h. The factor k in the prediction interval now also becomes a dynamic value, subject to the number of future steps h. This means our equation (10.28) keeps the shape, but it is “enhanced” by the number of steps h: Ŷt+h ± kh

(10.30)

Example 10.18 To illustrate how to calculate a prediction interval for seasonal time series, we will use the data from Example 10.16. Page | 637

Excel Solution

Figure 10.80 Prediction interval for Example 10.16 based on classical decomposition forecasts Columns A:H in Figure 10.80 are copied from Figure 10.68 and errors in column I are calculated as the differences between the cells in column D and H, as before. Cell J3 (MSE) is calculated as =SUMXMY2(D4:D25,H4:H25)/COUNT(D4:D25) and cell J4 (RMSE) as =SQRT(J3). Cell J6 contains the value of 0.05, which is the value of alpha for the 95% confidence interval. Cell J7 is the t-value for 95% confidence interval, calculated as =T.INV.2T(J6,COUNT(D4:D25)-2). The second argument in this function refers to degrees of freedom, which is n-2, or in our case 22-2=20. The new values are the steps h in cells J26:J31, which are just the future time periods, market as 1, 2, …, 6. The dynamic RMSEh is in cells K26:K31 and it is calculated as J26=SQRT(J26*$J$3), copied down. Cells L26:L31 contain the kh values and they are calculated for 95% confidence interval L26=J26*$J$7, etc. To show how well our model fits the actual data, as well as the future forecasts and their prediction interval, we put all these variables in one graph in Figure 10.81. As we can see from Figure 10.81, the historical prediction interval tracks actual observations in a consistent manner. The future prediction interval projects the 95% confidence that the future values will be somewhere within this interval. It also complies with the intuitive expectation that it should get wider the further into the future we extrapolate our forecasts. Page | 638

Figure 10.81 The result of recomposing the components and showing them against the actual values for Example 10.18 together with the historical and future prediction intervals This might not be so obvious from the graph due to the seasonal character of the time series. However, if we use T from column E in Figure 10.80 and put it in equation (10.30), so that it reads: Tt+h ± kh, then Figure 10.81 changes into Figure 10.82.

Figure 10.82 The same as Figure 10.82, except that T is used for forecasts instead of ŷ To demonstrate the point, in Figure 10.82 the prediction interval was calculated using just the trend value T, rather than the estimates ŷ as in Figure 10.81. This is a general method for showing the widening prediction interval for future steps h and is often used in regression analysis. In Figure 10.83 we show how the two future prediction intervals differ (compare the rows 26:31).

Page | 639

Figure 10.83 Prediction interval calculated on the basis of ŷ values (left) and T values (right) The historical prediction interval in cells N4:O25 in Figure 10.83 (for both cases) is calculated as N4=H4-$J$7*$J$4 and copied down to the 25th row. Cells N26:O32 use N26=H26-L26 and O26=H26+L26 copied down. Note that the prediction interval in cells N26:O31 is different in the left-hand chart (where ŷ is used as the basis for the prediction interval) from the right-hand chart (where T is used as the basis for the prediction interval). Although in the context of seasonal time series forecasting you would not expect to use T rather than ŷ for forecasting (though it is a common practise in regression analysis), we have done it here just to make it more obvious to show how the future prediction interval gets wider the further into the future we go, despite the fact that the confidence level remains the same, i.e. 95% in this specific case.

Check your understanding X10.18Decide the periodicity of the time series in Table 10.10, isolate the cyclical component and produce forecasts using the classical decomposition method.

Page | 640

Year 2001 2002 2003 2004 2005 2006 2007 2008 2009

y 25 26 27 27 26 25 26 28 27

Year 2010 2011 2012 2013 2014 2015 2016

y 26 26 27 29 28 27 26

Table 10.10 X10.19Which model did you use in X10.18 and why? X10.20Go to one of the web sites that allow you to download financial time series (such as http://finance.yahoo.com/) but with a strong seasonal/cyclical component. Use them to practise classical decomposition method.

Chapter summary In this chapter we introduced time series as a special type of data sets moving sequentially through time. We declared that time series analysis is mainly used for extrapolations and forecasting. We also defined various terms that define different types of time series (stationary vs non-stationary), seasonality, as well as types of methods used to produce forecasts. And finally, we emphasized that most of the material from this chapter is similar to the previous chapter, with the exception that in this chapter we only used one set of observations (one variable) that is implicitly defined by time (another variable). We also assumed that the easiest way to think about time series extrapolation and forecasting is to assume that a time series can be extrapolated using a simple trend function. In this case everything else that is not included in the trend function is called a residual. This approach is particularly useful if we are interested in the long-term forecasts where deviations from the trend are not as important as the overall direction and the slope of the trend. As the initial focus of this chapter was just on the trend component, we explained that it can come in different shapes (linear, curve, sinusoid, etc.). We then used a simple trend extrapolation technique to demonstrate how to fit it to the time series and how longterm forecasting is done. Once the trend was fitted to the time series, we were able to measure a variety of error terms. We covered six specific error measurement statistics, which are: the mean error (ME), mean absolute error (MAD), mean square error (MSE), root mean square error (RMSE), mean percentage error (MPE) and mean absolute percentage error (MAPE). After we demonstrated how they are calculated, we also provided interpretation of every error measurement method. Page | 641

The following section was dedicated to measuring the goodness of fit of our forecasts and establishing the future uncertainty interval for our forecasts. We used the standard error of the estimate to calculate the prediction interval and to show that it can be made to grow wider the further into the future we extrapolate our time series. The last section was dedicated to time series decomposition method. We concluded that in certain instances, especially if we are dealing with seasonal or cyclical data, a simple trend extrapolation is too “crude” as a method to produce acceptable forecasts. We needed to introduce other components hidden in a time series, such as cycles and seasons, and learn how to decompose the time series to extract them. Once we isolated all the relevant components, we were able to produce forecasts that were much more realistic than simple trend extrapolations. We concluded the chapter by demonstrating how to calculate a prediction interval for seasonal time series and how it complies with an intuitive assumption that the interval should grow in width, the further into the future our forecasts are extended.

Test your understanding Basic concepts TU10.1 How would you classify the two-time series shown below and why:

Figure 10.84

Figure 10.85 TU10.2 Is it possible for a stationary time series to be seasonal at the same time? If you think it is, sketch how would such time series look like as a graph.

Page | 642

TU10.3 What is the difference between using the =TREND() method vs. the =SLOPE() and =INTERCEPT() method in Excel to calculate linear trend? What other built in functions in Excel enable you to calculate a non-linear trend? TU10.4 To calculate forecasting errors, you subtract: A.

et = At - Ft

or

B.

et = Ft - At

Does this create a minor inconvenience in interpretation and what is the inconvenience? TU10.5 What is the main advantage, and the main disadvantage, of MSE? TU10.6 If your forecasting errors, when graphed as below in Figure 10.86, exhibit this kind of pattern, what would you conclude about your forecasts.

Figure 10.86 TU10.7 Which dedicated Excel function is used for the formula below: n

SE y,yˆ =

2  (y i − yˆ i )

i =1

n−2

What alternative Excel functions would you use to build the same formula.

Want to learn more? The textbook online resource centre contains a range of documents to provide further information on the following topics: 1. A10Wa Other types of trends 2. A10Wb Index numbers refresher

Page | 643

11. Short and medium-term forecasts 11.1 Introduction and chapter overview In the previous chapter we introduced long-term forecasts. They are typically nothing but trend extrapolations and are very similar to simple regression analysis. Short-term forecasts, on the other hand, bring completely fresh perspective on time series analysis and forecasting. We already know that we do not expect our long-term forecasts to be too accurate. If they clearly indicate the direction and the general shape of the curve they follow, our objectives are met, and we are happy with the results. Our short-hand forecasts, on the other hand, are not required to follow a trend. On the contrary, they are supposed to forecast just one period ahead, so they need to be very accurate. The medium-term forecasts go several periods ahead, so they are expected to be as accurate as possible, or at least to have the prediction interval as narrow as possible. This means that both the short-term and the medium-term forecasts require to be handled differently from the long-term forecasts. We must remember that the words, such as “short-term” and “medium-term” are not used in the time context. Short-term means just one forecasts ahead, regardless whether we use data expressed in minutes or annual data. Medium-term implies two to several (arbitrary up to four or six) forecasts ahead. If the data are given in seconds and we need to forecast the next 24 observations, although this is only 24 seconds in the future, it is a long-term forecast. Equally, if the data is measured annually, and we must forecast just one year in the future, then this is a short-term forecast. As we said, it is not the time dimension, but the number of the future observations that we are forecasting (the time horizon) that will determine if we are talking about the short-term or long-term forecasts. Short-term and medium-term forecasting methods require understanding of relatively simple and intuitive concepts, such as moving averages and exponential smoothing. After we introduced a concept of a moving average, we will learn how to use this concept as a short-term forecasting technique. If we wanted to use the same technique for medium-term forecasts, then certain modifications are needed, and we will introduce double moving averages to explain how to achieve this. As a more sophisticated alternative to moving averages, we will introduce exponential smoothing. This method can also be used as a short-term and a medium-term approach to forecasting. We will explain how both types of forecasts are executed using this technique. For both the short-term and medium-term forecasts we need to measure how appropriate they are to be used with a specific data set. This is achieved by understanding the data sets, as well as by measuring forecasting errors. The error measurement is identical to the approach we used in the previous chapter, so we will just explain the specifics related to the short-term and medium-term forecasts. This will lead us to the section on how the prediction intervals are constructed for these types of forecasts. Page | 644

The final section will combine some of the lessons from this and the previous chapter and we will learn how to handle short to medium-term seasonal time series by combining the classical decomposition method with exponential smoothing. The chapter is completed by introducing the Holt-Winters’ method, one of the most powerful methods for forecasting seasonal time series. Why is short term forecasting so important? Let us assume that you own some shares and you are contemplating selling them. You are not in a rush, so your strategy is to wait until you think the shares have reached a reasonably high level. When this happens, you want to “pull the trigger” and sell. It is, therefore, important for you to monitor the shares movement from one day to another and anticipate what might happen the following day. If you forecast that the value will go up, you will wait. If the forecast is that it will go down, you might decide to sell today. This is very crude scenario that describes why you really want to know what will happen the following day. Another one, that you are likely to experience if you end up in supply chain, is the question of inventory. Assume that you are running a shop that gets supplies from the distribution centre. Because of the distance from the distribution centre, you have a 12-hour delivery notice. If you can predict that you will be down significantly on one item, you quickly order it and within 12 hours it is delivered. Your short-term forecasts are the only tool to defend you from empty shelf space for this product, or from overstocking an item that you cannot sell. Again, short term forecast is your only tool to manage your business efficiently. We used the phrase earlier that forecasting is one of the areas of statistics that will help you today to gain glimpses of tomorrow. These glimpses imply that these statistical techniques will help you narrow down the most probable area, or range of numbers, likely to happen in the future. It is the objective of short-term forecasting to narrow down this area as much as possible and deliver as accurate and as reliable forecasts as possible.

Learning objectives On completing this unit, you should be able to: 1. 2. 3. 4. 5. 6. 7. 8. 9.

Understand moving averages. Be able to forecast using single and double moving averages. Understand exponential smoothing. Be able to forecast using single and double exponential smoothing. Calculate errors to establish if the model fits the data set. Construct a prediction interval. Produce seasonal short to medium range forecasts. Be able to apply Holt-Winters’ method to seasonal mid-range forecasts. Solve problems using the Microsoft Excel and SPSS.

Page | 645

11.2 Moving averages Short-term and medium-term forecasting techniques are all based around some sort of moving averages, or smoothed values where past errors play role in forecasting. One way or the other, both sets of techniques are essentially “smoothing” the original time series. What do we mean by this? We mean that the time series that is created by using one of these techniques to fit the historical values, will in its appearance be smoother than the original time series. It will have fewer dramatic ups and downs and it will appear as if someone “ironed” the original time series. Smoothing the original time series and treating this smoothed time series as an approximation of the original time series, or the fit, means that we are eliminating some random elements from the time series. Just like in the previous chapter, we assumed that the long-term forecast is the trend, plus some random variations, here we are looking not for a trend, but for a moving average. In fact, it can be either a moving average, or exponentially smoothed values (to be explained below). Everything beyond that is also treated as a residual. So, in principle we kept the same philosophy as in the beginning of the previous chapter, but we are substituting the trend values with the moving averages, or exponentially smoothed values.

Simple moving averages To understand moving averages, we must remind ourselves of some of the basic properties of the mean value, or an average, in the context of time series analysis. Let us use a simple Excel =AVERAGE() function to calculate the average value of a time series. We’ll use a very short and artificial time series just for the illustration purposes. Example 11.1 A very short time series in Figure 11.1 has an average value of 206. This average value represents the series well, because the series flows very much horizontally.

Figure 11.1 A short stationary time series and the average value Figure 11.2 illustrates this graphically. The average of 206 is shown as a horizontal line that runs across the time series.

Page | 646

Figure 11.2 A graph for the time series from Figure 11.1 As we know, the above sample time series can be called a stationary time series. This implies that an average is a very good predictor of a stationary time. If we know the average (the mean value, for example), then we can predict the next future value and it will be probably somewhere around this mean value. However, if the series was moving upwards (as in Figure 11.3), or downwards, we have a non-stationary time series. In this case this average value would not be the best representation of the series (see Figure 11.4).

Figure 11.3 A short non-stationary time series and the average value

Figure 11.4 A graph for the time series from Figure 11.3 with fitted mean value In this case a much more realistic representation would be a moving average. How do we calculate moving averages? Moving averages are dynamic averages and they change depending on the number of periods for which they are calculated. A general formula for moving averages is given by equation (11.1). Page | 647

𝑀𝑡 =

∑𝑡−𝑁+1 𝑥𝑖 𝑖=𝑡 𝑁

(11.1)

In equation (11.1), t is the time period, N is the number of observations in the interval taken into the calculation and xi are the observations. A simplified expression of equation (11.1) is shown as equation (11.2). Mt =

xt +xt-1 +xt-N+1 N

(11.2)

This means that the moving average for M3 is calculated as: M3 =

x3 +x2 +x1 3

Using the data from Figure 11.3: M3 =

200 + 250 + 150 = 200 3

In this case, the moving average value for the first three observation is placed in the third period, i.e. at the end of the interval for which it is calculated. However, sometimes you will see the following equation: M2 =

x3 +x2 +x1 3

M2 =

200 + 250 + 150 = 200 3

This is called a centred moving average. It is the same value, but it is positioned in the middle of the interval for which it is calculated, rather than placed in line with the last observation in the interval, as above. Notice that above we used an example with odd number of observations (3) in the interval. If we had an even number of observations in the interval and we wanted to centre the moving average, it would be difficult to place it between the two middle observations. For this reason, it is easier to take the odd number of observations in the moving average interval. It is a convention to place the moving average value either aligned with the last observation from the moving average interval, or to centre it in the middle of the moving average interval. For these purposes, we recommend that, if appropriate, the moving average interval consists of odd number of observations. Another way to express the equation (11.2) is as follows: M t = M t −1 +

x t − x t −N N

(11.3) Page | 648

Equation (11.3) implies that we can still estimate the current moving average even if we do not know all the values in the moving average interval. All we need is the previous value of the moving average, plus the other two values from the interval. If we had 5 elements in the moving average interval (N=5) and, for example, we tried to estimate the 8th moving average value in the series, the equation (11.3) would look as follows: M8 = M7

x8 - x3 5

Although this might appear to be useless fact here, you will see why we mentioned it when we discuss exponential smoothing. The key point here is that the current value of the moving average can be extracted from the previous value of the moving average, plus some combination of the actual historical values from the time series. To standardise the notation, we will use an abbreviation MA for moving averages, or SMA (single, or simple moving averages). If you see 3MA or 5MA, this means: single moving averages for the interval of 3 or 5 observations respectively. Example 11.2 We will use the nonstationary time series from Example 11.1 and calculated moving averages as in Figure 11.5.

Figure 11.5 An example of 3 and 5 centred and not-centred moving averages Column D shows 3 point moving averages (3MA) not centred (written at the end of the moving average interval), column E also shows 3MA, but centred, i.e. written in the middle of the moving average interval. Columns F and G show the same example, but for 5MA, and column H shows the moving averages calculated as per equation (11.3), and as expected they are the same as the values from column F.

Page | 649

Excel Solution The cells in Figure 11.5 are executed in Excel as follows: The first cell in column D is D6=SUM(C4:C6)/3, or D6=AVERAGE(C4:C6). The same formula is used for cell E5. Correspondingly, we have F8=SUM(C4:C8)/5, or F8=AVERAGE(C4:C8), and the same formula in G6. In cell H9 we are using equation (11.3), so H9=G8+(C0-C4)/5. All the formulas are copied down to the last cell that it makes sense. To illustrate how the time series and 3MA centred series look like, in Figure 11.6 we are showing the original time series and the 3MA time series.

Figure 11.6 A graph of the time series and centred 3MA from Figure 11.5 SPSS Solution SPSS provides a standard function for calculating centred moving averages. Figure 11.7 contains the same time series as in Figure 11.5. The file used is Chapter 11 Example 2 Moving averages.spv.

Figure 11.7 Time series from Figure 11.5 in SPSS To calculate moving averages, we need to create a new time series. Select Transform > Create Time Series Page | 650

Figure 11.8 A dialogue box to create a new time series This brings a dialogue box as illustrated in Figure 11.9. • • • •

Transfer Series to the Variable -> New name box Use the drop-down menu Function to choose Centered moving average. In the Span entry box type 3. Click on Change

Figure 11.9 A dialogue box to create a 3MA centred time series from the original time series Click OK Page | 651

SPSS output This creates a new series, as in Figure 11.10, where Series_1 represents the 3-point centred moving average.

Figure 11.10 3MA centred time series created in SPSS To chart both time series (the original time series and a newly created 3MA centred time series) onto the same chart. Select Analyze, Select Forecasting, Select Sequence charts

Figure 11.11 Charting the two time series from Figure 11.10 Move Periods to Time Axis Labels and Series and MA(Series,3,3) to variables.

Page | 652

Figure 11.12 Defining the chart for the two time series Click OK SPSS output

Figure 11.13 A graph of the time series from Figure 11.10 The SPSS chart given in Figure 11.13 is identical to the Excel solution given in Figure 11.6. In Example 11.2 we used only 3-period (3MA) and five period (5MA) moving averages. What happens if we extend the number of observations in the moving average interval? As the number of observations increases in the moving average interval and it reaches ultimately the full data set, the moving averages line becomes smoother and smoother,

Page | 653

until it becomes a straight horizontal line that represents the overall average (mean) value. Our simple data set from Example 11.2 is too short to illustrate this, so we will use the data representing the average annual UK unleaded petrol prices in pence per litre 19832020 (the same dataset was previously used in Example 9. 4). The data set contains 38 observations and we decided arbitrarily to calculate 3MA, 12MA and the overall average (effectively 38MA). Figure 11.14 shows the results as a graph.

Figure 11.14 A time series from Example 9.4 and three other types of averages (3MA, 12MA and simple average) Remember that the larger the number of moving averages in the interval, the “smoother”, the time series of moving averages will be when compared to the original time series. This is quite obvious from Figure 11.14. If you look at the 3-interval moving average time series, you will see that it is closely tracking the actual time series. On the other hand, a 12-interval moving average line will be much “flatter” and eliminate the extreme “jumps”, as it is averaging much larger intervals and it is, therefore, not so much a subject to most recent events. Moving averages are one of the favourite techniques often used in business reports. Typically, you will find 3-months, 6-months or 12-months moving averages used.

Short-term forecasting with moving averages Now we know how to create moving averages, the question is: how do we use them as a forecasting tool? The answer is very simple: we just shift the moving average by one period in the future and this becomes our forecast. Equation (11.2) can be re-written as a forecast in the following way: Ft+1 = Mt =

xt + xt-1 + xt-2 N

(11.4)

Page | 654

In other words, the moving average for the first three periods (if we are using 3 moving average periods, for example), becomes a forecast for the fourth period: F4 = . .

x3 +x2 +x1 3

F11 =

x10 + x9 + x8 3

If we take the data from Example 11.2, then according to equation (11.4) the forecasts for one period ahead are as follows: Period Series 3MA forecasts 1 150 2 250 3 200 4 360 200.0 5 330 270.0 6 380 296.7 7 280 356.7 8 300 330.0 9 490 320.0 10 450 356.7 11 413.3 Table 11.1 Forecasts using 3MA

Calculations

=(150+250+200)/3 =(250+200+360)/3 =(200+360+330)/3 =(360+330+380)/3 =(330+380+280)/3 =(380+280+300)/3 =(280+300+490)/3 =(300+490+450)/3

From Table 11.1, the forecast at time point 11 is 413.3. If we present the results in a graphical form, then our forecast for one period ahead is as illustrated in Figure 11.15:

Figure 11.15 A graph for the values from Table 11.1 If, for example a 12-months moving average is used in a business report, you will often find that the report will calculate the moving average for the previous 12 months and declare that this is most likely to be the forecast for next month’s value.

Page | 655

Example 11.3 Let us now use a little longer time series and see how to use the moving averages function in Excel for forecasting purposes. Figures 11.16 and 11.17 shows the UK birth rate per 1000 people from 1960-2018. In total 59 observations (rows 15-55 are hidden in the table) and the graph.

Figure 11.16 UK birth rate per 1000 people from 1960-2018, Source: ONS

Figure 11.17 A graph of the time series from Figure 11.16 Excel Solution We have several ways to apply moving averages in Excel.

Page | 656

The first method is similar to what we did when we added a trend line to a time series. We right click on the time series in a graph and select Add Trendline. This invokes a dialogue box with several options included. We click on the option called “Moving Average” and change the number of periods to 5 as illustrated in Figure 11.18.

Figure 11.18 A dialogue box to invoke moving averages in Excel Excel will automatically chart the moving averages from the last observation in the period specified. If we selected a 5-period moving average, then the moving average function would start from observation five.

Figure 11.19 A graph of the time series from Figure 11.16 and its 5MA values Page | 657

This shows us how to include moving averages graphically, but we still do not have the actual values of moving averages. In order to do this, we need to go to Data > Data analysis option. Select Data > Data analysis This will invoke further options, and we select Moving Average option, as shown in Figure 11.20.

Figure 11.20 A dialogue box to invoke Moving Averages algorithm in Excel Data Analysis Add-in In Figure 11.21 we are showing that we selected the range, i.e. cells C4:C62, which is where our time series resides, and we selected Interval of 5, which is the number of moving averages we decided to use. Further down in the same dialogue box, we selected the output to start in cell D4. We also selected the Chart Output and Standard Errors (more on which a bit later).

Figure 11.21 A dialogue box to define the length of the interval for moving averages in Excel What we get is shown in Figure 11.22 (again, rows 15 – 55 hidden). Excel have inserted 5MA values in column D and produced Standard Errors in column E. However, the column D shows that the first 4 periods are not calculated, implying that Excel places moving averages not as centred (as we have shown in the previous SPSS example), but it places them at the end of the moving average interval (in our case the first one is placed in the fifth period, because we are calculating 5MA). Page | 658

Figure 11.22 Excel output after the dialogue box as in Figure 11.21 As before, =AVERAGE(C4:C8) which is copied down. Column E shows interesting looking formula. Cell E12= SQRT(SUMXMY2(C8:C12,D8:D12)/5), which is the function for the standard error. However, in this case the denominator is 5, which corresponds with the number of moving averages. We’ll explain this later in the chapter when we tackle again the standard error. The graph that was produced by Excel is given in Figure 11.23.

Figure 11.23 Excel automatic chart showing the original time series and its 5MA as forecasts

Page | 659

Once again, how is the moving average approach used to produce forecasts? As we already know, all we need to do is to shift the moving average calculations, as produced by Excel, by one observation. In other words, the moving average value for the first five observations (assuming we are using moving averages for five periods) becomes the forecast for the sixth observation. The seventh observation is predicted by using the next five period moving average (observations two to six), and so on. Figure 11.24 illustrates the point for five-period moving averages (5MA). In Figure 11.24 we are showing that we just inserted one blank cell D4 (we also inserted a blank cell in E4 to shift the SE down), which shifted all the calculations one row down (rows 15 – 55 hidden).

Figure 11.24 Modified Figure 11.22 to show how moving averages are converted into moving average forecasts The forecast based on 5MA looks as in Figure 11.25.

Page | 660

Figure 11.25 A graph of the time series and its 5MA forecasts from Figure 11.24 By just observing Figure 11.26 we can say that 5-period moving averages (5MA) follows the actual time series quite well and that the forecast of 11.62 for 2019 look reasonably credible. We know from previous two chapters that this statement requires more scrutiny, but we’ll settle for it for now. We also calculated 13MA interval and produced one step ahead forecast. Figure 11.26 shows the comparison with a 5MA forecasts. As we already know, 13MA forecasts seem to be even “smoother” than the 5MA forecasts.

Figure 11.26 A comparison between 5MA and 13MA forecasts for the time series However, there is a difficulty associated with this approach. Moving averages cannot extend our forecast beyond just one future period, which means that this method can only be used as a short-term forecasting method that predicts only one future observation. Moving averages are acceptable forecasting technique, providing we are interested in forecasting only one future period.

Page | 661

SPSS Solution Refer to Figure 11.9. All we need to do is to modify the SPSS Function from Centered moving average to Prior moving average for the same span observations. We will obtain the results identical to the results in Figure 11.24.

Mid-range forecasting with moving averages If we need to extend our forecasts just beyond the immediate next forecast, for example, to the following 2-6 observations, then we are effectively aiming to produce mid-range, or mid-term forecasts. This implies that we will need to modify our simple moving average formula. In order to use moving averages as mid-term forecasts, we need to introduce one more concept, and that is the concept of double moving averages. For this reason, we will modify the notation. Simple (or single) moving averages will be called 𝑀𝑡′ or SMA and double moving averages 𝑀𝑡′′ or DMA. If moving averages represent a ‘rolling’ average of the actual observations in the series, then we can also imagine a ‘rolling’ average of these moving averages, or, double moving averages. Single moving averages are defined by equation (11.1) and double moving averages are: 𝑀𝑡′′ =

∑𝑡−𝑁+1 𝑀𝑖′ 𝑖=𝑡 𝑁

(11.5)

In other words, a moving average of the moving averages. Using single and double moving averages, we can construct a dynamic intercept and a dynamic slope coefficient, which will move and fluctuate as the original time series moves. These two coefficients are calculated as follows: 𝑎𝑡 = 2𝑀𝑡′ − 𝑀𝑡′′ 2

𝑏𝑡 = 𝑁−1 (𝑀𝑡′ − 𝑀𝑡′′ )

(11.6) (11.7)

Where N in the denominator is the number of moving averages in the interval. These two coefficients enable us to calculate forecasts that dynamically change as the time series changes. The formula is: 𝐹𝑡+1 = 𝑌̂𝑡+1 = 𝑎𝑡 + 𝑏𝑡

(11.8)

The equation (11.8) will produce forecasts just one period ahead. We said that we were looking for a method that can forecast further into the future. Well, the answer is now very simple. If we need to extend forecasts m periods into the future, then the equation for double moving average (DMA) forecasts becomes: 𝐹𝑡+𝑚 = 𝑌̂𝑡+𝑚 = 𝑎𝑡 + 𝑏𝑡 𝑚

(11.9) Page | 662

Where m is the number of future periods (1, 2, 3, ..., m). Equation (11.9) looks identical as a simple regression equation or a simple trend extrapolation. However, there are two major differences. Both simple regression and simple trend extrapolation use the fixed values of a and b. With double moving averages (DMA) equation, the coefficients at and bt are dynamic and they change from period to period. The second difference is that a simple trend extrapolation uses variable x, which represented time periods, starting from 1 and proceeding as consecutive numbers until the end of the time series. The forecasts were the continuation of the same number stream (if the time series has 20 observations, the value of x for the future calculations is 21, 22, 23, …). With DMA, the value of m applies to the future periods, and m always starts with 1 (the future values of m are: 1, 2, 3, …). Example 11.4 We’ll use the same data as in Example 11.3. Figure 11.27 summarises the whole procedure (as before, rows 15 - 55 are hidden). We are using 5-period moving averages.

Figure 11.27 Fitting the UK birth rate per 1000 between 1960-2018 time series with DMA forecasts To calculate the first two SMA for year 1964 and 1965, for example, we use equation (11.1): SMA5 =

17.5 + 17.9 + 18.3 + 18.5 + 18.8 = 18.2 5 Page | 663

SMA6 =

17.9 + 18.3 + 18.5 + 18.8 + 18.3 = 18.4 5

The last SMA for year 2018 is SMA59 =

12.0 + 11.9 + 11.8 + 11.4 + 11.0 = 11.6 5

DMAs for the same years are calculated using equation (11.5) as: DMA5 =

18.2 + 18.4 + 18.4 + 18.2 + 17.9 = 18.2 5

DMA6 =

18.4 + 18.4 + 18.2 + 17.9 + 17.5 = 18.1 5

The last DMA for year 2018 is DMA59 =

12.5 + 12.3 + 12.1 + 11.8 + 11.6 = 12.1 5

Coefficients at and bt are calculated using equations (11.6) and (11.7). The examples for time periods 9 and 10 are: 𝑎9 = 2𝑀9′ − 𝑀9′′ = 2 × 17.9 − 18.2 = 17.7 ′ ′′ 𝑎10 = 2𝑀10 − 𝑀10 = 2 × 17.5 − 18.1 = 16.9, etc.

𝑏9 =

2 2 (𝑀9′ − 𝑀9′′ ) = (17.9 − 18.2) = −0.1 58 − 1 57 2

2

′ ′′ ) 𝑏10 = 58−1 (𝑀10 − 𝑀10 = 57 (17.5 − 18.1) = −0.3, etc.

One step forecasts 𝑌̂𝑡+1 are calculated using equation (11.8). Again, we show here just periods 10 and 11: 𝑌̂10 = 𝐹10 = 𝑎9 + 𝑏9 = 17.7 − 0.1 = 17.5 𝑌̂11 = 𝑎10 + 𝑏10 = 16.9 − 0.3 = 16.6, etc. When we reach the last observation, the value of SMA59=11.6, DMA59=12.1, a59=11.2, b59=-0.2. To calculate m-forecasts ahead, we use equation (11.9): 𝑌̂59+1 = 𝐹60 = 𝑎59 + 𝑏59 × 1 = 11.2 − 0.2 × 1 = 11.2 𝑌̂59+1 = 𝐹61 = 𝑎59 + 𝑏59 × 2 = 11.2 − 0.2 × 2 = 11.9 . . 𝑌̂59+5 = 𝐹64 = 𝑎59 + 𝑏59 × 5 = 11.2 − 0.2 × 5 = 11.3 Page | 664

Excel Solution To implement these calculations in Excel is very easy, as shown in Figure 11.28.

Figure 11.28 Double Moving Average (DMA) forecasts (5 observations) In this example we used 5-period single moving averages (5SMA) and five period double moving averages (5DMA). They are shown in columns D and E. Columns F and G calculate at and bt, using equations (11.6) and (11.7). The first part of equation (11.7) has the term 2/(N-1). As N represents number of observations in the moving average interval (5 in this case), this translates into 2/(5-1)=2/4=1/2. The past forecasts, or the DMA fit of the existing time series is given in cells H13:H61, the future forecasts are given in cells H62:H66, and they were calculated using equation (11.9). The chart in Figure 11.29 shows the result, i.e. the graph of the original time series and its DMA forecasts.

Page | 665

Figure 11.29 Chart showing extrapolation using Double Moving Averages (DMA) forecasts As we can see, because the values of the coefficients at and bt are dynamically calculated, the historical DMA forecasts values are ‘mimicking’, or emulating, the movements of the original time series. Unfortunately, once we have reached the last observation in the series, these coefficients are “frozen” and all the future extrapolated values are linear. Clearly, the reason they are linear (straight line) is because the formula for DMA forecasts is a linear formula, as per equation (11.9). DMA, as any other method, has some advantages, and some disadvantages. The values of the regression analysis coefficients a and b (explained in previous chapter) used for extrapolating the linear trend, for example, were based on minimising the squares of all the distances of every observation from the trend line. This gives some statistical credibility to these two coefficients. Unfortunately, the values of the coefficients at and bt for DMA forecasts used for forecasting the future are based only on the last moving average interval. This means that they do not really represent the whole time series. In other words, the basis for extrapolating the time series in the future is relatively short. This potentially increases uncertainty for our forecasts and implies that the DMA forecasting method is only acceptable for short to medium term forecasts. On the other hand, if the time series is non-stationary, then the long-term history has very little relevance. It is the most recent history that is much more relevant for our immediate forecasts. Because DMA forecasts explicitly rely on the most recent history, this method is often much better suited to produce the short-to-medium range forecasts for non-stationary time series. SPSS Solution SPSS does not offer a ready-made solution for DMA, but formulae can be recreated using the Create Time Series option in Transform menu. Page | 666

Check your understanding X11.1 Why is the simple mean (or, the average) the easiest and the “safest” way to predict future values of a stationary time series? X11.2 What is the difference between using moving averages as a general technique, versus moving averages as a forecasting method? X11.3 What is the impact on newly generated time series of moving averages if you increase the number of elements in the interval for moving averages? X11.4 What curve does the double moving average (DMA) forecasting method follow when extrapolating the values in the future? X11.5 Is it appropriate to use double moving averages (DMA) forecasting method for long-term forecasting?

11.3 Introduction to exponential smoothing In order to introduce the exponential smoothing method, we need to assume that one of the ways to think about observations in a time series is to say that the previous value in the series (yt-1), plus some error element (et), is the best predictor of the current value (𝑦̂𝑡 ). This can be expressed by equation (11.10). ŷt = yt-1 + et

(11.10)

We can modify equation (11.10) and state that every new forecast is equal to the old forecast plus an adjustment for the error that occurred in the last forecast i.e. et-1 = yt-1 Ft-1, as presented in equation (11.11). Ft = Ft-1 + et-1 or

Ft = Ft-1 + (yt-1- Ft-1)

(11.11)

Where yt-1 is the actual result from period t – 1 and Ft-1 is the forecast result for period t - 1. Remember equation (10.1) from the previous chapter? It states that Y = T + R, which is a similar principle we are using here. Instead of using the trend value T, equation (11.11) uses a more general term Ft-1. This makes equations (10.1) and (11.11) similar in terms of an idea they convey. Let us now assume that the error element, i.e. (yt-1 - Ft-1) is zero. In this case the current forecast is the same as the previous forecast. However, if it is not zero, we can take the full impact of the error et, or just a fraction of the error. If we are going to take a fraction of the error, then this means that we need to multiply it with the value somewhere between 0 and 1. This is done using equation (11.12). Ft = Ft-1 + (yt-1 – Ft-1)

(11.12)

Page | 667

We use letter  (alpha) to describe the fraction, and the word ‘fraction’ implies that  takes values between zero and one. If  = 0, then current forecast is the same as the previous one. If  = 1, then current forecast is also the same as the previous one, plus the full amount of the deviation between the previous actual and forecasted value. If using the same formula, you substitute in equation (11.12) Ft-1 with the same expression and then Ft-2, etc, you will see that the current Ft depends on all the past values of yt. When the past values of yt change over time by going upwards or downwards (non-stationary time series), then more recent observations are more important than older observations and they should be appropriately weighted. Simple exponential smoothing is a forecasting method that applies unequal weights to the time series data, i.e. we have a power to decide if older or more recent data should gain more weight in deciding about the future. Why a fraction of an error? If every current forecast/observation depends on the previous one and this one depends on the one before, etc., then all the previous errors are in fact embedded in every current observation/forecast. By taking a fraction of error, we are in fact discounting the influence that every previous observation and its associated error has on current observations/forecasts. So, in order to take just a fraction of that deviation,  must be greater than zero and smaller than one, i.e. 0 <  < 1. The forecasts calculated in such a way are forming a line that is in fact smoother than the line formed by the actual observations. If we plot both the original observations and these newly calculated ex-post forecasts of the series, we’ll see that the ex-post forecasts curve is eliminating some of the dynamics that the original observations exhibit. It is a smoother time series. Just like the moving average series. The origins of equation (11.11) and (11.12) originate from Brown’s single exponential smoothing method. The original Brown’s formula states that: (11.13)

′ 𝑆𝑡′ = 𝛼𝑦𝑡 + (1 − 𝛼)𝑆𝑡−1

Note that Brown uses yt rather than Ft-1. This effectively means that Brown is using the exponentially smoothed values in the same way as we initially used moving averages. In other words, just as a smoothing technique. If we use the original smoothing equation by Brown, then we must remember that Ft = S' t -1 (see equations (11.11) and (11.12)). Remember, this is the same principle as with moving averages, where we said that a moving average for an interval can be used as either an approximation value for the interval, or as a forecast for the time period that follows the interval. By changing the notation, equation (11.12) can also be rewritten as equation (11.14). 𝐹𝑡 = 𝛼𝑦𝑡−1 + (1 − 𝛼)𝐹𝑡−1

or

𝐹𝑡+1 = 𝛼𝑦𝑡 + (1 − 𝛼)𝐹𝑡

(11.14)

Equation (11.12) and the two forms of equation (11.14) are all identical and it is a matter of preference which one to use. They all provide identical forecasts based on smoothed approximations of the original time series. Page | 668

We implied that the smaller the  (i.e. the closer  to zero), the smoother and more horizontal the series of newly calculated values is going to be. Conversely, the larger the  (i.e. the closer  to one), the more impact the deviations have and potentially the more dynamic the fitted series is. When =1, the smoothed values are identical to the actual values, i.e. no smoothing is taking place. There is also a connection between Brown’s formula, i.e. equation (11.13) and the moving averages concept. You will recall from the section on moving averages that we used equation (11.3): 𝑀𝑡 = 𝑀𝑡−1 +

𝑦𝑡 − 𝑦𝑡−𝑁 𝑁

The above equation can be written as: 𝑀𝑡 = 𝑀𝑡−1 +

1 (𝑦 − 𝑦𝑡−𝑁 ) 𝑁 𝑡 1

This looks very much like equation (11.12). In this case, it looks as if α= N. Although this is not strictly true (see the text below), the similarity between moving averages and exponential smoothing is obvious. This is the reason why exponential smoothing is sometimes called exponentially weighted moving average (EWMA) method. The smoothing constant () and the number of elements in the interval for calculating moving averages are in fact related. The equation that defines this relationship is given by equation (11.15). 2

𝛼 = 𝑀+1

(11.15)

In equation (11.15), M is a number of observations in the interval used to calculate the moving average. The formula indicates that the moving average for three observations that we used earlier is equivalent to  = 0.5. Equally,  = 0.2 is equivalent to M = 9. So, the smaller the value of the smoothing constant, the more horizontal the series will be, just like in the case when larger number of moving averages is used. To reiterate, if using equation (11.14) we inserted yt-2 and Ft-2, and then yt-3 and Ft-3, etc., we would see that effectively we are multiplying the newer observations with higher values of  and the older data in the series with the smaller values of . By doing this we are in effect assigning a higher importance to the more recent observations. As we observe the value of  used as a weight for older observations, the value of  drops exponentially. This is the reason why we have the word “exponential” in the phrase exponential smoothing. Every value in the time series is affected by all those that precede it, but the relative weight (importance) of these preceding values declines exponentially the further we go in the past. There is another useful interpretation of this fact. If we chose a small value of  (closer to zero), we are putting more weight on all observations, including the older ones. Therefore, the time series of such exponentially smoothed values looks smoother. If we Page | 669

chose a larger value of  (closer to one), we are putting more weight on more recent observations. Therefore, the time series of such exponentially smoothed values looks more like our original time series.

Forecasting with exponential smoothing When discussing moving averages, we learned that, depending where we place the moving average value, it can be considered either just a simple moving average value (if it is centred or at the end of the moving average interval), or a forecast obtained using a moving average (if it is placed one period after the moving average interval). The example we can use is if we have 3MA value and we place it in the third period, then we are simply implying that this value is a moving average of this interval of three observations. If, on the other hand, we put it in the fourth period position, then we imply that this moving average value is the forecast for the next period, based on the previous three. The same principle is valid when using exponential smoothing. If you remember that the past smoothed value can also be used as a forecast, then hopefully there is no confusion: 𝐹𝑡+1 = 𝑆𝑡′

(11.16)

Example 11.5 As an example, let’s use a short time series to demonstrate how to use exponential smoothing to create forecasts using Brown’s exponential smoothing method. The time series is given in Table 11.2 (same as Example 11.2): Period Yi 0 150 1 250 2 200 3 360 4 330 5 380 6 280 7 300 8 490 9 450 Table 11.2 A short time series with zero starting period To start the smoothing process, we must make a choice for the value of the smoothing constant α and the initial estimate of S'0 . The value of S'0 is needed to determine S'1 , and it is calculated as: 𝑆1′ = 𝛼𝑦0 + (1 − 𝛼)𝑆0′. In this example, we have chosen α = 0.3 and 𝑆0′ = y1 = 150. Exponentially smoothed values are calculated using equation (11.14):

Page | 670

𝐹1 = 𝛼𝑦0 + (1 − 𝛼)𝐹0 = 0.3 × 150 + 0.7 × 150 = 150 𝐹2 = 𝛼𝑦1 + (1 − 𝛼)𝐹1 = 0.3 × 250 + 0.7 × 150 = 180 𝐹3 = 𝛼𝑦2 + (1 − 𝛼)𝐹2 = 0.3 × 200 + 0.7 × 180 = 186 𝐹4 = 𝛼𝑦3 + (1 − 𝛼)𝐹3 = 0.3 × 260 + 0.7 × 186 = 238 𝐹5 = 𝛼𝑦4 + (1 − 𝛼)𝐹4 = 0.3 × 300 + 0.7 × 238 = 257 Again, the calculations in Excel are easy to implement, as in Figure 11.30. Cell D6=C6, then D7=$C$3*C7+(1-$C$3)*D6, which is copied down. Forecasts in column F use the same formulae, except that they are shifted down by one cell. Excel Solution

Figure 11.30 Applying simple exponential smoothing as a forecasting method As was the case with moving averages, in order to forecast one value in the future, we need to shift the exponential smoothing calculations by one period ahead. The last exponentially smoothed value will in effect become a forecast for the following period. That is what we did in column F in Figure 11.30. As an alternative to this formula method, Excel provides the exponential smoothing method from the Data Analysis add-in pack (Select Data > Select Data Analysis > Select Exponential Smoothing). Example 11.6 We will use data from Example 11.3 and 11.4 to create a new Example 11.6 that demonstrates the use of exponential smoothing via the Excel Data Analysis method. Figure 11.31 illustrates the data in Excel.

Page | 671

Figure 11.31 The same data as in Examples 11.3 and 11.4 The first step is to go to tab Data > Select Data Analysis > Select Exponential Smoothing.

Figure 11.32 Excel dialogue box to invoke exponential smoothing in Data Analysis Add-In Click OK to access the Exponential Smoothing menu. WARNING: Excel uses the expression “Damping factor”, rather than smoothing constant, or . Damping factor is defined as (1 - ). In other words, if you want  to be 0.3, you must specify in Excel the value of the damping factor as 0.7. Input the menu inputs as illustrated in Figure 11.33 and click OK.

Page | 672

Figure 11.33 Dialogue box to define parameters in exponential smoothing In our case α=0.9, so we had to enter 0.1 in the dialogue box in Figure 11.33. If you wanted the Chart Output and the Standard Errors (more about that later), then you can tick the two boxes at the bottom, as shown in Figure 11.33. The final output is as in Figure 11.34 (we are again hiding rows 15:55).

Figure 11.34 Output from Excel after exponential smoothing dialogue box as in Figure 11.33 was completed (columns E and F) and manual calculations (column G) The values in column E were produced by the Excel app, and the values in column G are reproduced manually converting equation (11.13) into Excel formula. Cell G6=$G$2*C5+(1-$G$2)*E5, etc. As we said, we’ll return to SE values from column F later. At present just note that they are automatically calculated by Excel using formula: F8=SQRT(SUMXMY2(C5:C7,E5:E7)/3). From Figure 11.34, we observe the forecast for time point 2019 is 11.04.

Page | 673

As we can see, the only difference between the formulae from column E (Excel calculation) and column F (manual calculation) is that manual calculation uses alpha (α=0.9) and Excel uses Dumping factor (1 - α = 1 - 0.9 = 0.1). Data Analysis > Exponential Smoothing solution illustrated in Figure 11.34 method always ignores the first observation and produces exponential smoothing from the second observation. It also cuts short with the exponential smoothing values, as the last exponentially smoothed value corresponds with the last observation in the series. You can easily extend the last cell one period in the future, which is what we did (just Copy/Paste the last formula to the next cell down). What becomes obvious from the example above is that by using Excel routine you cannot change the values of  and see automatically what effect this has on your forecasts. This means that, as far as simple exponential smoothing is concerned, you are better off producing your own set of formulae. We will do this and compare two different values of alpha in Example 11.7. Example 11.7 We are using the same data set as in Example 11.6, and the two different alpha values used are 0.1 and 0.9. Rows 15:55 are hidden in Figure 11.35. Excel Solution

Figure 11.35 The use of two different values of alpha (0.1 and 0.9) for simple exponential smoothing As before D5=C4, D6=$D$2*C5+(1-$D$2)*D5 and F6=$D$2*C5+(1-$D$2)*F5, etc. Page | 674

From Figure 11.35, we observe the forecast for time point 2019 given two values of  = 0.1, 0.9 are 12.15 and 11.04 respectively. The results are charted in the graph in Figure 11.36. As we can see the ex-post forecasts for α=0.9 are following more closely the original time series, whilst the ex-post forecasts for α=0.1 give us much smoother line, as expected.

Figure 11.36 A graph of the two exponentially smoothed time series, using =0.1 and =0.9 The way we implemented these formulas by creating a dedicated cell for the smoothing constant alpha (cells D2 and E2 in Figure 11.35), implies that by just changing this one single cell we can see the impact on how well our ex-post forecasts fit the original time series. SPSS Solution In this section we will explore how we use SPSS to produce forecasts using exponential smoothing for the same Example 11.7. SPSS data file: Chapter 11 Example 7 Exponential smoothing.sav Enter data into SPSS – first 15 data values shown in Figure 11.37

Page | 675

Figure 11.37 Same data from Examples 11.3 and 11.4 in SPSS The last data points are illustrated in Figure 11.38 including the time point 60 for the 2019 forecast.

Figure 11.38 The last five observations of the data set from Figure 11.37 and the empty cell for the first future value of the time series Data > Define date and time…

Figure 11.39 A dialogue box for time-stamping the data set Define date by clicking on Define date and time (year, starting 1960)

Page | 676

Figure 11.40 Assigning time stamps to observations from the time series Click OK

Figure 11.41 The first 15 observations from the time series with the newly created time stamps

Figure 11.42 The last five and the first future observations with the time stamps Now run the analysis Analyze > Forecasting > Create Traditional Models

Page | 677

Figure 11.43 Selecting Create Traditional Models in SPSS to apply exponential smoothing Analyze > Forecasting > Create Traditional Models • •

Transfer variable Series into the Dependent Variables box Select Method: Exponential Smoothing

Figure 11.44 A dialogue box to define exponential smoothing Click on Criteria button and choose Model Type – Nonseasonal – Simple

Page | 678

Figure 11.45 Selecting a simple exponential smoothing method Click Continue

Figure 11.46 Defining the series for exponential smoothing Select Statistics tab and choose options selected in Figure 11.47

Page | 679

Figure 11.47 Defining the outputs for exponential smoothing Select Plots tab and choose options selected in Figure 11.48

Figure 11.48 Defining the plots for exponential smoothing Select Save tab In Variables box choose to Save: Predicted values, Lower Confidence Limits, Upper Confidence Limits

Page | 680

Figure 11.49 Defining the prediction interval for exponential smoothing forecasts Select Options tab and choose Forecast Period: First case after end of estimation period through last case in active data set (2018) as illustrated in Figure 11.50

Figure 11.50 Defining the width of the confidence limits for the prediction interval Click OK SPSS data file Figures 11.51 and 11.52 illustrates the SPSS data file (rows 15 – 54 hidden) Page | 681

Figure 11.51 The first 15 observations, their exponentially smoothed forecasts and the prediction interval

Figure 11.52 The last five observations and one future exponentially smoothed forecast and the prediction interval From Figures 11.51 and 11.52, we observe the predicted (or forecast) value for time point 2019 is 11.0. SPSS Output SPSS will come up with the following printout.

Page | 682

Figure 11.53 Output from SPSS with a variety of error statistics

Figure 11.54 Output from SPSS with model statistics and model parameters

Figure 11.55 A single future exponentially smoothed forecast and the prediction interval From Figure 11.55, we observe the predicted (or forecast) value for time point 2018 is 11.0. The forecast values illustrated in Figure 11.55 are comparable but not identical to those calculated by Excel (the value for 2018 was 11.04) and shown in Figure 11.35. However, you will notice that SPSS did not ask us to determine the value of alpha. It

Page | 683

optimised it to find the smallest RMSE. We are somewhat familiar with this already, but we will come back to this soon. Figure 11.56 illustrates a graph plot for the observed date values , fitted predicted values from the smoothing model and the forecasted values for time point 60 (prediction interval also included, though we’ll cover it later).

Figure 11.56 SPSS graph with the actual time series, exponentially smoothed forecasts and the prediction interval Like single moving averages, simple exponential smoothing as a forecasting method can produce forecasts only one period ahead. To have longer, i.e. mid-range forecasts, a modification is needed.

Mid-range forecasting with exponential smoothing In order to use simple exponential smoothing beyond just a single forecast, we must add a few simple equations. Like we did with equations (11.5) – (11.8), we need to introduce double exponential smoothing values (DES) and the related parameters at and bt that will be used to produce linear forecasts beyond just one future value. To apply the analogy with the double moving averages, we can say that double exponential smoothing values are exponentially smoothed values of the exponentially smoothed values. Single exponential smoothing (SES) equation (11.13) could be applied to construct double exponentially smoothed (DES) values: ′′ 𝑆𝑡′′ = 𝛼𝑆𝑡′ + (1 − 𝛼)𝑆𝑡−1

(11.17)

We are using a single apostrophe (S’) to symbolize single (SES) and a double apostrophe (S”) to symbolize double exponentially smoothed (DES) values. Again, using the analogy with the DMA forecasting method, double exponential smoothing (DES) forecasting method is: 𝐹𝑡+𝑚 = 𝑌̂𝑡+𝑚 = 𝑎𝑡 + 𝑏𝑡 𝑚

(11.18)

Page | 684

The only difference is that the coefficients at and bt are calculated as: 𝑎𝑡 = 2𝑆𝑡′ − 𝑆𝑡′′

(11.19)

𝛼

(11.20)

𝑏𝑡 = 1−𝛼 (𝑆𝑡′ − 𝑆𝑡′′ ) Example 11.8

In Example 11.4 we produced forecasts using the DMA (Double Moving Average) method. In the following Example 11.8, we will use the same data set and produce forecasts using the DES (Double Exponential Smoothing) method. We will skip manual calculations and show only an Excel implementation. The value of alpha used is, =0.3. Excel Solution

Figure 11.57 DES forecasts for the same example as in 11.6 As before, the past forecasts, or the DES fit of the existing time series is given in cells H4:H62. However, the future forecasts are given in cells H63:H67 and they were calculated using equation (11.18). Example below shows these calculations: 𝑌̂59+1 = 𝐹60 = 𝑎59 + 𝑏59 × 1 𝑌̂59+2 = 𝐹61 = 𝑎59 + 𝑏59 × 2 . . Page | 685

𝑌̂59+5 = 𝐹64 = 𝑎59 + 𝑏59 × 5 In Figure 11.57, alpha is given in cell G2. SES values in column D are calculated as D5=$G$2*C5+(1-$G$2)*D4 and DES values in column E using the same formula E5=$G$2*D5+(1-$G$2)*E4. The dynamic intercept at in column F is F4=2*D4-E4 and the dynamic slope bt in column G is G4=($G$2/(1-$G$2))*(D4-E4). And finally, DES ex-post forecasts in cells H5:H61 use simple formula H5=F4+G4, whilst the forecasts in H63:H67 use formula H63=$F$62 + $G$62*B63. The chart in Figure 11.58 shows the final result, i.e. the graph of the original time series and its DES forecasts.

Figure 11.58 A graph containing the DES forecasts for 5 years ahead Compare these forecasts from the ones in Example 11.4. Although the future values are not identical (these ones were calculated using the DES and the previous ones using the DMA method), they both show linear extrapolation when going forward in the future. SPSS Solution Input data into SPSS and include dates, as illustrated in Figures 11.59 and 11.60. Figure 11.37 to Figure 11.42 in Example 11.7 show how to time stamp the data, so we are not repeating them here.

Figure 11.59 Same as Figure 11.37, but with only 5 observations shown

Page | 686

Figure 11.60 Modified Figure 11.38 to show five future periods Notice the forecast time points are now 60 – 64. Select Analyze > Forecasting > Create Traditional Models Transfer Series into Dependent Variables box Choose Method: Exponential Smoothing

Figure 11.61 Selecting a variable for DES forecasts Click on Criteria and select Model Type Nonseasonal: Brown’s linear trend (this is how SPSS refers to the DES method)

Page | 687

Figure 11.62 Selecting Brown’s linear trend, which is equivalent to DES Click Continue

Figure 11.63 Defining the series for double exponential smoothing (DES) The steps that follow are identical to Figure 11.47-50, so will not show them here.

Figure 11.64 Selecting the future forecasting horizon The result is shown in Figures 11.65 and 11.66, where SPSS added more variables that define this DES, or Brown’s model.

Figure 11.65 The first 5 observations, their DES forecasts and the prediction interval Page | 688

Figure 11.66 The last five observations and five future DES forecasts and the prediction interval From Figure 11.66, the double exponential smoothing forecasts are 10.6, 10.2, 9.8, 9.5 and 9.1. The values in Figure 11.66 are slightly different from the values in Figure 11.57, which is the Excel version. If you look at SPSS output, you get the following:

Figure 11.67 Output from SPSS with model parameter details Clearly SPSS used the value of alpha as 0.864, whilst we used 0.3 in Excel. If you manually input alpha = 0.864 into Cell G2, then you will get the same solutions as SPSS. Later we will show how to optimise this value in Excel to get better Excel forecasts. Finally, Figure 11.68 shows the graph.

Figure 11.68 SPSS graph with the actual time series, DES forecasts and the prediction interval There are methods based on exponential smoothing principle that are not necessarily linear, such as the triple exponential smoothing method (TES), but this is beyond the scope of this textbook. Page | 689

Check your understanding X11.6 What is the difference between single exponential smoothing as a technique to eliminate variations (smoothing technique) and single exponential smoothing as a forecasting method. X11.7 What is the range for the smoothing constant alpha and how does it impact the newly created time series of smoothed values? X11.8 Is there a difference between the smoothing constant alpha and damping factor in Excel? X11.9 What is the relationship between the smoothing constant and moving averages? X11.10 Would you describe the future double exponential smoothing forecasts as linear?

11.4 Handling errors for the moving averages or exponential smoothing forecasts In previous chapter we demonstrated how to use errors to validate the long-term forecasts. In principle, there is no difference between the long-term and medium-term forecasts, at least as far as the error handling is concerned. However, in the context of the mid-term forecasts, in particular when using exponential smoothing method, we can use errors to help us optimise the forecasts. Excel is very well suited to help us with this. We will use Example 11.9 and demonstrate how to execute the optimization. Example 11.9 Same example as before, and rows 15-55 are again hidden. All the formulae are the same as before. Excel Solution As we can see, the only addition to the spreadsheet from Example 11.8 is the cell G1. This is the cell where we introduced the Mean Squared Error (MSE). Rather than adding a new column with errors and then squaring them, adding all up and finding an average (which would be MSE), we used one single formula to calculate the MSE: =SUMXMY2(C5:C62,H5:H62)/COUNT(C5:C62).

Page | 690

Figure 11.69 DES forecasts for the same example as in Figure 11.57, but with the added MSE value In our case, the value of MSE is 0.37359, providing the value of alpha is α=0.3. Let’s see how, by manipulating the value of alpha, we can reduce the MSE. To use MSE as a tool for determining the optimal value of the smoothing constant  we need to use a small trick. We will use Excel’s Solver add-in option, which can be found by clicking on Data > Solver menu (see Figure 11.70).

Figure 11.70 Solver option in Excel By selecting the Solver option, a dialogue box appears, as in Figure 11.71. As we can see we are aiming (or as Excel would say: Set Objective) to change the values in the cell G1, which is the MSE, and we specified that we want this value to be the smallest possible value, i.e. the minimum. How are we going to do this? We are going to do this by allowing Excel to change all the possible values in G2 (the value of the smoothing constant alpha), until the value in G1 is the minimum.

Page | 691

Figure 11.71 Complete Solver dialogue box ready to optimise the content of the cell G1 (get the minimum value), given the restrictions for cell G2 To recap, the logic behind the Solver function is to: • •

Set objective: G1 in our case By changing Variable Cells: G2 in our case

If we want to impose some restrictions, under the heading of Subject to the Constraints we need to click on the Add button. This triggers another dialogue box, as in Figure 11.72. We need to restrict the value of alpha to a maximum of one and a minimum of zero.

Figure 11.72 Adding the Solver constraints Note that we restricted the smoothing constant in cell G2 to be a maximum of 0.9999. Sometime, Excel Solver does not converge to a solution if we put the value of 1, hence 0.9999. Equally we defined the minimum value as 0.0001, rather than 0 to help with the convergence towards the optimum solution. Figure 11.72 shows all the boxes completed in the Solver dialogue box before we click on the Solve button. Page | 692

Once we have clicked the Solve button, Excel offers additional report sheets (reports on Answers, Sensitivity and Limits), but we will let you explore this independently. The result that Excel offers as the optimum smoothing constant that minimises the MSE is α = 0.85686. This value of alpha yields an MSE of 0.11894, which is an improvement over the existing 0.37359 for =0.3.

Figure 11.73 DES forecasts for the same example as in Figure 11.69, but with the optimised MSE value If we look at the graph now, it will be visible that our new ex-post forecasts are even closer to the historical values of the time series. Compare the forecast values from Figure 11.73 with the ones from Figure 11.68, which we obtained using SPSS, and you will see that they are identical. You can also see from Figure 11.67 that SPSS used almost identical value of alpha (0.863) that was obtained in our Excel example after using the Solver option (0.85559).

Page | 693

Figure 11.74 A graph containing the optimised DES forecasts for 5 years ahead It suffices to say that we could have added other constraints, if we wanted it. The objective, or a cell containing the MSE as a target, could also be changed. We could have used ME, MAPE, RMSE, SE, or any other measure as a target. We could also have set the desired value of the MSE, for example. The possibilities are numerous, which makes this simple and elegant solution for optimising forecasts an ideal tool. Sometimes when you run the Solver, it will either not converge to any meaningful value, or it will give you the values #DIV/0!. Do not be discouraged. Either change the Solving Method in Excel from GRG Nonlinear to Simplex LP or Evolutionary. Or, you modify the limit values for the constraints, as we did in Figure 11.72. Because alpha should not be neither 1 or 0, we modified the criteria and put 0.9999 and 0.0001. This helped with the convergence toward the solution.

Prediction interval for short and mid-term forecasts In Example 11.3 and 11.6, when we selected moving averages and exponential smoothing respectively, through one of Excel’s Data Analysis apps, one of the dialogue boxes gave us an option to get Standard Errors displayed. As we deferred explanations for this statistic, we will briefly repeat the procedure, but this time we will focus on the Standard Error option. In Example 11.3 in Figure 11.24, column E contained standard errors, as calculated by Excel. The formula that Excel used (the first cell was E12) is: =SQRT(SUMXMY2(C8:C12,D8:D12)/5). This is effectively the standard error of the estimate as defined by equation (10.14) from the previous chapter. The only difference is that equation (10.14) has n-2 in the denominator, whilst Excel used number 5. The reason why Excel used number 5 was to match the number of moving averages we selected. This means that the standard error, as calculated by Excel automatically in this case, is a dynamic standard error. In other Page | 694

words, it is not constant for the whole data set, but it varies as the moving averages change. In Example 11.6 in Figure 11.34, column F contains standard errors calculated by Excel, but this time for the exponentially smoothed values. The first value in cell F8 has almost identical formula =SQRT(SUMXMY2(C5:C7,D5:D7)/3). The only difference is that the denominator is 3. It is unclear why Excel uses 3, and it is consistently used regardless of the value of alpha. However, as before, this indicates that the standard error of the estimate is a dynamic value that changes from period to period. Example 11.10 We will now use the same data as in Example 11.9 where we optimised the value of alpha using Excel Solver. In this Example 11.10, we will extend the calculations and use standard errors to calculate the prediction interval. Excel Solution Figure 11.75 illustrates the Excel solution (row 15-55 hidden). Columns A:H contain formulae identical to Figure 11.69. We introduced a few new cells. L3 is 0.05 (this is the level of significance, also called alpha, but not to be confused with alpha for the smoothing constant). L4 is the number of degrees of freedom, which is 57 (L4=COUNT(C4:C62)-2). L5 is the t-value calculated as L5=T.INV.2T(L3,L4). Cells L6 and L7 give the same value of the standard error of the estimate for the whole data set, but they are calculated in two different ways, as: L6=SQRT(SUMXMY2(C5:C62,H5:H62)/(COUNT(C5:C62)-1)), and: L7=STEYX(H5:H62,C5:C62).

Figure 11.75 Same as Figure 11.73, but with the SE values calculated and the prediction interval As we said, column I contains the dynamic Standard Errors, as produced automatically by Excel. We used these values to calculate the prediction interval for every ex-post Page | 695

forecast given in columns J and K. We are familiar with how to calculate the prediction interval, as we applied it in Chapter 10 (see equations (10.26)-(10.30) and Example 10.18): I8=SQRT(SUMXMY2(C5:C7,H5:H7)/3), J8=H8-($L$5*I8) and K8=H8+($L$5*I8). All of them are copied down to row 62. When we reached the end of the actual time series (row 62 in Figure 11.75), the series of standard errors of the estimate calculated using automatic Excel formula have stopped. This means that in order to calculate the future prediction interval for the forecasts, we have to use some other value of the standard error. We have two options here. One is that for the first future prediction interval we use the last actual standard error (prediction interval is in this case: ŷ60  tvalue SE59). However, as this last value of the standard error might be skewed (because it is calculated for a small rolling number of observations), it is much safer to use the overall standard error of the estimate. Cell J63=H63-($L$5*I63*SQRT(B63)) and K63=H63+($L$5*I63*SQRT(B63)). The cells from J64 and K64 copied downwards are: J64=H64-($L$5*$L$6*SQRT(B64)) and K64=H64+($L$5*$L$6*SQRT(B64)). As a reminder the equation for the future prediction interval is: ŷt  tvalue  SE  √ℎ. See equation (10.29) for the explanation for what is meant by h and why we are taking the square root of h. Also as a reminder, h represents the future increments (h=1, 2, 3, …) that are used to correct the future prediction interval and anticipate the future uncertainty. Cells J63:K67 in Figure 11.75 show the future prediction interval.

Figure 11.76 The actual time series, DES forecasts and the prediction interval As we can see the two dotted lines, symbolizing (DES – t  SE) and (DES + t  SE), follow the red line that represents SES. The interval is not of consistent width, because it is not calculated on the basis of a single standard error for the whole data set. It is calculated as a dynamic and moving series of standard errors for every rolling 3-period interval. For the final five prediction intervals we use the constant standard error (for the whole Page | 696

data set), but they are also multiplied by the square root of h values 1, 2, 3, .., m for the future prediction intervals (see equation 10.29). We know from equations (10.26) and (10.27) that SEŷ,y and RMSE can be used interchangeably for a prediction interval ŷt ± z  SEŷ,y , or ŷt ± z  RMSE. Equation (10.29) defines RMSEh (RMSE for multiple steps h) as:

𝑅𝑀𝑆𝐸ℎ = √ℎ × 𝑀𝑆𝐸 = √ℎ

𝑒2 𝑛

This means that for the prediction interval for the multiple steps in the future, our equation is: ŷt  tvalue  SEŷ,y  √ℎ. Why square root of h? SEŷ,y or RMSE is already a square root of MSE, which means that we only need to take the square root of h. If you compare Figure 11.76 with Figure 11.68, you will see some differences between the Excel and SPSS solutions. First of all, if in Figure 11.75 we used the constant value of SE, rather than the dynamic one, Excel prediction interval would be identical to what we have in SPSS. The only true difference is the future prediction interval between the Excel and SPSS solutions. The SPSS future prediction interval is much wider than the one calculated using Excel. We stated that in Excel we will use a quick workaround ŷt ± t  RMSEh, as proper solutions are too complicated for manual calculations. SPSS uses the proper algorithm to calculate the future prediction interval, hence the difference. However, these are the only differences between the two solutions. SPSS Solution There is no need to show how this works in SPSS given that these values were already printed out in Figures 11.65 - 11.68.

Check your understanding X11.11 How would you use MSE to optimise your forecasts? X11.12 How is the standard error of forecast calculated in the Excel application related to exponential smoothing? X11.13 How would you construct the future prediction interval for forecasts produced using either double moving averages or double exponential smoothing? X11.14 Compare the results obtained using Excel function =STEYX() and manual formula =SQRT(SUMXMY2(known_Y, predicted_Ŷ)/(COUNT(Y)-2)). What conclusions can you draw? X11.15 How do you calculate the prediction interval for the future values, as opposed to historical values when fitting the model to the time series?

Page | 697

11.5 Handling seasonality using exponential smoothing forecasting As an alternative to the classical time series decomposition approach, we can also use the exponential smoothing approach to forecast seasonal data. If you look at equation (11.12), you will see that effectively we said that the next forecast is equal to the previous forecasts, plus the fraction of the past forecast error. We know that et = yt - Ft, and therefore, equation (11.12) become: Ft = Ft-1 + et-1

(11.21)

Earlier we introduced the concept of double exponential smoothing and used this technique to apply it to the double exponential smoothing (DES) method, suitable for following linear time series trend and producing mid-range forecasts. This method is sometimes called Brown’s linear exponential smoothing method. Equations (11.17)(11.20) are used to execute the DES method, or Brown’s linear exponential smoothing method. We will reuse these equations and combine them into one single equation. Effectively combining equations (11.19)-(11.20) into equation (11.18) and substituting equation (11.13) and (11.17) for single and double exponentially smoothed values, we get equation (11.22): 𝐹𝑡 = 𝑌̂𝑡 = 2𝑦𝑡−1 − 𝑦𝑡−2 − 2(1 − 𝛼)𝑒𝑡−1 + (1 − 𝛼)2 𝑒𝑡−2

(11.22)

We will use this equation (11.22) and combine it with the trend/cycle/seasonal data principles we learned in previous chapter.

Classical decomposition combined with exponential smoothing Let’s use an example to show how the principles of time series decomposition can be combined with exponential smoothing to produce credible seasonal forecasts. Example 11.11 The data set used are the average quarterly CO2 emissions in ppm measured at the Mauna Loa observatory in Hawaii from Q1 2015 to Q2 2020. The objective is to forecasts the next four quarters until Q2 of 2021. As we already covered most of the equations, we will go directly to Excel and implement the method. Figures 11.77 and 11.78 show the complete solution. Excel Solution The Excel solution is illustrated in Figure 11.77.

Page | 698

Figure 11.77 Time series decomposition combined with exponential smoothing for seasonal data

Figure 11.78 Calculations for time series decomposition combined with exponential smoothing We are showing somewhat different approach to decomposition here, so let’s go through it, step by step. The steps are numbered in blue boxes in Figures 11.77-11.78. In line with the decomposition philosophy, we will first try to isolate the Trend and Cycle component together. In examples from previous chapter, we isolated the Trend component by simple trend calculation. However, if the time series is too short to reveal

Page | 699

cycles that appear on top of a trend, it is safer to use a different approach. The approach is: isolate Trend and Cycle as two combined components bundled together (call it TC). 1 Step 1: If we calculate the moving averages of the data set, this moving average is effectively a TC component bundled together. All we need to do is to centre them. If we have the even number of seasons (four quarters in our case), then centring can be achieved by calculating an average of the two neighbouring averages. In column E in Figure 11.77 we take the average of the first four observations and the average of the next four observations. The average of these two averages is a centred average, and we put it in the middle of the year, or to the nearest quarter, which is Q3, i.e. cell E6. As we proceed, column E contains the centred Trend and Cycle component (TC) that are combined. In Figure 11.77 cell E6=(AVERAGE(D4:D7)+AVERAGE(D5:D8))/2 is copied down to cell E23. 2 Step 2: To establish how much the time series is oscillating around this Trend/Cycle line, we need to divide the Y data values with the Trend/Cycle values. This ratio indicates seasonality, and we have it in column F in Figure 11.77. Number 1 means that this index is the same as the Trend/Cycle component. Numbers below 1 show the dips against the Trend/Cycle line, and numbers above 1 show the upswings against the same line. In Figure 11.77 cell F6=D6/E6 is copied down to F23. 3 Step 3: Because our time series consists of quarterly data, the next step is to find what could be called the typical quarterly indices. This has been achieved in cells P2:R5 in Figure 11.78. We can see that we actually take all Q1 data and average them, then Q2, etc. The typical seasonal indices are given in R2:R5 in Figure 11.78. Cell R6 is the sum of all quarterly indices. It adds up to 4 as it should (1 for every quarter), but sometimes it does not happen, which is the reason why we are showing this. If the sum was above or below the true sum (4 for quarterly data, or 12 for monthly data, for example), we would need to adjust our typical seasonal quarterly indices. We have done this in cells R2:R5 in Figure 11.78. 4

Step 4: The values of the typical seasonal index are copied to column G in Figure 11.77. You can see that the values from R2:R5 from Figure 11.77 are copied for every appropriate quarter in column G in Figure 11.77. We can now adjust our time series for seasonal effects. This is done in column H in Figure 11.77and every value is nothing but the original data value, divided by the typical seasonal index, i.e. H4=D4/G4 copied down. 5

Step 5: The forecasts for seasonally adjusted indices are calculated in column I in Figure 11.77, but you will notice that the first two cells (I4:I5) are just copies of the original time series. The equation (11.22) “kicks in” from cell I6. What we have in this column are double exponential smoothing (DES) forecasts produced based on the seasonally adjusted time series from column H. Cell I6=2*H5-H4-2*(1-$M$3)*J5+((1$M$3)^2)*J4 is copied down to I29. 6 Step 6: You will notice that the formula in column I in Figure 11.78 makes a reference to column J, which contains errors. It might be paradoxical to take errors into account at the same time as we are producing forecasts. After all, error is something that you realize afterwards when you compare the actual with the forecasted value. However, Page | 700

thanks to Excel automation procedure, we can do this immediately and the cells will be filled as we copy them down. Cell J4=H4-I4, copied down to J29. 7 Step 7: To complete the forecasting process, we need to recompose our DES forecasts by multiplying them with the typical seasonal index. This is achieved in column K in Figure 11.78. Cell K4=I4*G4 copied down to K29. How well does this approach fit the original time series? Let us first look at the graph of data set vs. its fit and forecasts. Figure 11.79 contains the graph. The alpha smoothing constant used (cell M3 in Figure 11.78) was 0.5.

Figure 11.79 A graph of the quarterly CO2 emissions in ppm measured at the Mauna Loa observatory in Hawaii from Q1 2015 to Q2 2020 and forecasts using the time series decomposition and exponential smoothing combined As we can see, visually we have a good fit. Initially, the first 5 periods, our forecasts are not completely accurate, which is the consequence of using a very short time series (only 20 observations). For a longer time-series, this initial mismatch would become even less relevant. From Example 11.8 we know that we can optimise the forecast by changing the value of alpha. This is achieved using Excel Solver. We have not done this here, but you can try to see how much you can improve the forecasts. In Figure 11.78 we have the value of MSE in cell M4=SUMXMY2(D4:D25,K4:K25) /COUNT(D4:D25) and RMSE in cell M5=SQRT(M4), and you can use them to optimise the forecasts. We know that to properly validate our forecasts we need to do some analysis with forecast errors as well as to show the prediction interval for our forecasts. As before, we are keeping this for the very end of the section in this chapter. SPSS Solution

Page | 701

SPSS does not have a “ready-made” solution that is identical to the approach we took here. However, it offers some other useful methods, and we will show how to use one of them in the next section.

Holt-Winters’ seasonal exponential smoothing One of the most effective forecasting methods for seasonal data, based on exponential smoothing, is the Holt-Winters’ method. It uses three smoothing equations and three smoothing constants, but it is not as complex as it sounds. The three equations are dedicated to three different components, namely: 1. Level (ℓt) 2. Trend (bt) 3. Seasonality (St) Effectively we are combining the linear regression components (intercept or level, and slope) with the seasonal component. The only difference is that in the case of the HoltWinters’ method, these components are dynamic. Recall that in simple regression we had only one value of a, which is the intercept, and one value of b, which is the slope. Here, we have changing values of at and bt. However, to avoid confusion, we will call the intercept as level ℓt, trend as bt and the seasonality component a St. There is one more point to remember. Holt-Winters’ method comes in two “flavours”. One is the model where these components are in additive relationship and the other one is multiplicative relationship. We use additive Holt-Winters’ model mainly for stationary time series and multiplicative Holt-Winters’ model predominantly for non-stationary time series. And lastly, alpha (), the smoothing constant, is used only for the level equation. Beta () the smoothing constant is used only for the trend equation, and gamma () only for the smoothing constant for the seasonal equation. Let us show these three equations. For the additive model, the equations are: ℓ𝑡 = 𝛼(𝑦𝑡 − 𝑆𝑡−𝑠 ) + (1 − 𝛼)(ℓ𝑡−1 + 𝑏𝑡−1 )

(11.23)

𝑏𝑡 = 𝛽(ℓ𝑡 − ℓ𝑡−1 ) + (1 − 𝛽)𝑏𝑡−1

(11.24)

𝑆𝑡 = 𝛾(𝑦𝑡 − ℓ𝑡 ) + (1 − 𝛾)𝑆𝑡−𝑠

(11.25)

For the multiplicative model, the equations are: 𝑦

ℓ𝑡 = 𝛼 (𝑆 𝑡 ) + (1 − 𝛼)(ℓ𝑡−1 + 𝑏𝑡−1 )

(11.26)

𝑏𝑡 = 𝛽(ℓ𝑡 − ℓ𝑡−1 ) + (1 − 𝛽)𝑏𝑡−1

(11.27)

𝑡−𝑠

𝑦

𝑆𝑡 = 𝛾(ℓ 𝑡) + (1 − 𝛾)𝑆𝑡−𝑠 𝑡

(11.28)

Page | 702

The small s in the above equations is used for periodicity or seasonality (for quarterly data s=4, for monthly s=12, etc.). This means that forecasts for the additive and multiplicative models are produced respectively as: 𝐹𝑡+𝑚 = ℓ𝑡 + 𝑏𝑡 𝑚 + 𝑆𝑡−𝑠+𝑚

(11.29)

𝐹𝑡+𝑚 = (ℓ𝑡 + 𝑏𝑡 𝑚)𝑆𝑡−𝑠+𝑚

(11.30)

Where m is the number of forecasts ahead and the values of the smoothing constants will be as follows: 0 <   1, 0 <   1 and 0 <   1- To start the calculations, the initial values of ℓt, bt and St are: ℓ0 = ∑𝑠𝑡=1

𝑦𝑡 𝑠

(11.31) (11.32)

𝑏0 = 0 𝑦

𝑆0,𝑠 = ℓ 𝑡

0

(11.33)

The above equations indicate that ℓ0 is equal to the average value of just the first-year data. b0 is zero and s0 is calculated only for the first year (because here t=1, …, s) as a ratio between the individual observations and their first-year average. Note that if  = 0 and  = 0, then the Holt-Winters’ model is equivalent to single exponential smoothing (SES). Let us see how this works. We will use the same data set as in Example 11.11. Example 11.12 The data set used are the average quarterly CO2 emissions in ppm measured at the Mauna Loa observatory in Hawaii from Q1 2015 to Q2 2020. The objective is to forecasts the next four quarters until Q2 of 2021. We will skip manual equations and move directly into Excel solution. Excel Solution Figure 11.80 shows how the model was implemented in Excel. Note that in all three equations we used the initial value of 0.5 for all three smoothing constants. This is arbitrary. We know that the range is 0 <   1, 0 <   1 and 0 <   1-, so 0.5 is as good as any other value in this range.

Page | 703

Figure 11.80 Holt-Winters’ forecasting method for seasonal data Cell E7=AVERAGE(D8/D11) and cell E8=$J$3*(D8/G4)+(1-$J$3)*(E7+F7), copied down to E29. Cell F7=0 and cell F8=$J$4*(E8-E7)+(1-$J$4)*F7, copied down to F29. Cell G4=D8/AVERAGE($D$8:$D$11) copied to G7 and G8=$J$5*(D8/E8)+(1-$J$5)*G4, copied down to G29. This covers ℓt, bt and St. Forecasts Ft are calculated in column H, starting with H8=(E7+F7)*G4, copied down to H29. And finally, errors et in column I are calculated as I8=D8-H8, copied down to I29. The formulae in Figure 11.80 are the direct implementations of all the equations from this section in the Excel format. Column E in Figure 11.80 is equation (11.26), column F equation (11.27), column G equation (11.28) and column H is equation (11.30). To check the visual appearance of our forecasts (column H in Figure 11.80), we produce the graph in Figure 11.81.

Page | 704

Figure 11.81 A graph of the quarterly CO2 emissions in ppm measured at the Mauna Loa observatory in Hawaii from Q1 2015 to Q2 2020 and forecasts using the Holt-Winters’ method We can see more than a reasonable fit with the data set. However, we can again optimise this further using Excel Solver, as we did in Example 11.9. Example 11.13 demonstrates how we did the optimization here. Example 11.13 We will continue where we stopped in Example 11.12 by using Figure 11.80 to optimise the values of the three smoothing constants. Excel Solution For optimisation, we invoked the following Solver dialogue box:

Page | 705

Figure 11.82 Solver dialogue box for Holt-Winters’ forecasts As before, we are targeting to minimise cell J6 in Figure 11.80, which is the value of MSE. The objective will be achieved by changing cells J3:J5 in Figure 11.80, which are the three smoothing constants. The constraints we used to limit changes are that both α and  should be between zero and one, whilst  should be between 0 and 1- α. After we clicked on Solve button, the improvements were not so significant. The MSE went down from 0.979 to 0.891. We set the initial value of all three smoothing constants to 0.5. After optimisation we got: α=0.41, =0.31 and =0.58. Although not a significant drop in MSE, still we can say that we optimised our forecasts. Figure 11.83 shows the optimised version of the same example.

Page | 706

Figure 11.83 Optimised constants for Holt-Winters’ method using Solver in Excel As we have not changed any formulae in the spreadsheet, except optimised the values in cells J3:J7, there is no need to repeat the Excel Solution descriptions. All the cells retain the same formulae as in Example 11.12. In Figure 11.84 we show the results of the optimised forecasts in a visual form. The results are not much different, but some improvement has been achieved.

Figure 11.84 Optimised forecasts (for the basic case see Figure 11.81) Page | 707

SPSS Solution The date is in year/quarters rather than in years compared to the previous examples.

Figure 11.85 Quarterly CO2 emissions in ppm measured at the Mauna Loa observatory in Hawaii from Q1 2013 to Q2 2018 time series in SPSS Select Analyze > Forecasting > Create Traditional Model Transfer Series into Dependent Variables box Select Method: Exponential Smoothing Click on Criteria box. Select Winters’ multiplicative model

Figure 11.86 A dialogue box for selecting Winters’ multiplicative model, which is the Holt-Winters’ method Page | 708

Click on Continue

Figure 11.87 Selecting the variables for analysis Click on Statistics tab and choose options

Figure 11.88 Selecting the statistics to display in outputs Click on Plots tab and select options Page | 709

Figure 11.89 Selecting the plots to display Click on Save tab and select options

Figure 11.90 Selecting the confidence interval Click on Options tab and select options

Page | 710

Figure 11.91 Defining the confidence interval width Click OK New variables have been created showing predicted values, lower and upper confidence levels (i.e. the prediction interval), and the residuals, as in Figure 11.92.

Figure 11.92 Complete output with fitted data using the Winters’ multiplicative model (Holt-Winters’ method) together with the prediction interval and the future forecasts SPSS Output

Page | 711

Figure 11.93 Error statistics for the selected model

Figure 11.94 Model statistics

Figure 11.95 Constants for the Winters’ multiplicative model (Holt-Winters’ method)

Figure 11.96 Forecasts and the prediction interval for the future 4 observations and the prediction interval Note that for the level we used the word alpha for the smoothing constant, which is the same as SPSS. However, we used the word beta for the trend component, and SPSS uses the word gamma. On the other hand, we used the word gamma for the seasonal component and SPSS uses the word delta. And finally, the graph output is as in Figure 11.97

Page | 712

Figure 11.97 SPSS graph of the historical data and the future forecasts with the prediction interval The results are not completely identical to the ones from Excel, for a simple reason that SPSS optimises the three constants differently to what we did with the Solver function in Excel and it initiates the starting values in a different way. Nevertheless, the forecasted values are in close vicinity to one another. We did not show how to calculate the prediction interval in this example using Excel, but it is identical to all the previous ones. However, in Figure 11.98 we are showing Excel graph of the forecasts and the prediction interval. As an exercise, readers might like to reproduce these calculations in their versions of the spreadsheet.

Figure 11.98 Excel graph of the historical data and the future forecasts with the prediction interval

Check your understanding

Page | 713

X11.16 Use the dataset from Table 11.3 and apply the decomposition method using centred moving averages as the TC component. Forecast the next four periods in 2021. Year 2017

Q Y 1 100.0 2 160.0 3 150.0 4 140.0 2018 1 122.0 2 188.0 3 160.0 4 159.0 Table 11.3 A seasonal time series data

Year 2019

2020

Q 1 2 3 4 1 2 3 4

Y 141.0 204.0 201.0 179.0 150.0 200.0 220.0 200.0

X11.17 Use forecasts from X11.16 and construct the future prediction interval using RMSE and h. X11.18 Use the data from Table 11.3 (same as X11.16) and apply Holt-Winters method to forecasts for the next four periods in 2021. Use 0.5 as the value for alpha, beta and gamma. X11.19 Optimise forecasts from X11.18 by using Microsoft Solver. What are the new values of alpha, beta and gamma? X11.20 Calculate the prediction interval for forecasts from X11.19.

Chapter summary Short-term and medium-term forecasting methods rely very much on moving average techniques and exponential smoothing, both of which were introduced in this chapter. We started with simple, or single moving averages (SMA), and demonstrated how this “smoothing technique” can also be used to produce short-term (one-period ahead) forecasts. This followed with the introduction of double moving averages (DMA). They were necessary to introduce a simple DMA linear method which enable us to produce mi-term forecasts (2 to approximately 6 periods ahead). The next forecasting method was based on Brown’s simple exponential smoothing technique. When this technique was converted into a short-term forecasting method, it became the single exponential smoothing (SES) method. We demonstrated how to use it to produce just one period ahead forecasts. Just like with DMA, we subsequently introduced double exponential smoothing (DES) technique, which enabled us to produce DES forecasts that were linear in nature and capable of producing reasonable forecasts on the medium-term basis (2-6 periods ahead). We also reminded ourselves of the concept of forecasting error and explored how these errors could be used to improve forecasts. Specifically, we use MSE together with the Excel Solver function to optimise the value of the smoothing constant alpha. This Page | 714

enabled us to achieve highly optimised forecasts that fit the actual time series more closely. The construction of the prediction interval in case where we are dealing with the shortterm and the medium-term forecasts was also introduced. The application was very similar to what we learned in the previous chapter, with the exception that in this case we used the notion of the rolling standard error. We concluded the chapter by combining the principles of classical time series decomposition method and exponential smoothing. At first, we used an improvised method to combine the two. This was followed by introduction of the Holt-Winters’ method, as one of the best performing short to medium-range forecasting techniques suitable for seasonal time series.

Test your understanding TU11.1 Take the below time series that consists of 30 observations: Time Series 1 26.21 2 25.72 3 27.51 4 28.32 5 28.42 6 28.28 7 27.1 8 32.35 9 35.33 10 33.35 Table 11.4

Time 11 12 13 14 15 16 17 18 19 20

Series 36.41 29.14 28.42 28.58 29.05 30.25 29.42 27.38 27.68 30.22

Time 21 22 23 24 25 26 27 28 29 30

Series 29.24 28.75 28.02 26.69 25.08 23.39 22.65 22.02 23.39 26.35

a) Calculate single moving average forecasts (SMA) for the following periods: 3, 5, 7 and 9. What can you conclude? b) Calculate double moving averages (DMA) for the SMA values from a). TU11.2 For the same time series as in TU1, calculate the following: a) Calculate single exponential smoothing forecasts (SES) for the following values of alpha: 0.1, 0.3, 0.5 and 0.9. What can you conclude? b) Calculate double exponential smoothing values (DES) for the SES values from a). TU11.3 For the same time series as in TU1, calculate the following: a) Produce forecasts using 5DMA for 6 periods in the future. b) Produce forecasts using alpha=0.3 DES for 6 periods in the future. c) Calculate the ME and MSE for DMA and DES. Which forecast would you recommend? TU11.4 Optimise alpha using the Solver. Which forecast is better now? Page | 715

TU11.5 For the forecast you decided is the best from TU4, calculate the prediction interval for the fitted series and the future 6 forecasts. TU11.6 As part of a longitudinal study, look at the UK birth rates. Table 11.5 below shows data 1960-2017. UK Birth rate per 1,000 Year people 1960 17.5 1961 17.9 1962 18.3 1963 18.5 1964 18.8 1965 18.3 1966 17.9 1967 17.5 1968 17.2 1969 16.6 1970 16.2 1971 16.1 1972 14.9 1973 13.9 1974 13.1 1975 12.4 1976 12 1977 11.7 1978 12.2 1979 13.1 Table 11.5

Year 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999

UK Birth rate per 1,000 people 13.4 13 12.8 12.8 12.9 13.3 13.3 13.7 13.8 13.6 13.9 13.8 13.6 13.2 13 12.6 12.6 12.5 12.3 11.9

Year 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019

UK Birth rate per 1,000 people 11.5 11.3 11.3 11.7 11.9 12 12.3 12.6 12.9 12.7 12.9 12.8 12.8 12.1 12 11.9 11.8 11.4 11.0

a) Produce forecasts for 2019 using Linear trend, Parabola, Power trend and polynomial trend, as well as 5MA and SES for alpha=0.3. b) Use error metrics to evaluate the best method (use ME, MAD. MSE, RMS, MPE and MAPE) c) Calculate the prediction interval for the chosen method

Want to learn more? The textbook online resource centre contains a range of documents to provide further information on the following topics: 1. S11Wa Different ways to implement exponential smoothing forecasting in Excel 2. S11Wb Excel ETS seasonal exponential forecasting

Page | 716

Appendices Appendix A Microsoft Excel Functions Table A.1 provides a list of all Excel functions that you may find helpful in solving business statistics type problems. The Excel function includes a link to the Microsoft support website for that Excel function. Function

Description

1

AVEDEV

2

AVERAGE

Returns the average of the absolute deviations of data points from their mean Returns the average of its arguments

3

AVERAGEA

4

AVERAGEIF

5

AVERAGEIFS

6

BASE

7

BINOM.DIST

8

BINOM.DIST.RANGE

9

BINOM.INV

10

BINOMDIST

11

CHIDIST

12

CHIINV

13

CHISQ.DIST

14

CHISQ.DIST.RT

15

CHISQ.INV

16

CHISQ.INV.RT

17

CHISQ.TEST

Returns the average of its arguments, including numbers, text, and logical values Returns the average (arithmetic mean) of all the cells in a range that meet a given criteria Returns the average (arithmetic mean) of all cells that meet multiple criteria. Converts a number into a text representation with the given radix (base) Returns the individual term binomial distribution probability Returns the probability of a trial result using a binomial distribution Returns the smallest value for which the cumulative binomial distribution is less than or equal to a criterion value Returns the individual term binomial distribution probability Returns the one-tailed probability of the chisquared distribution Returns the inverse of the one-tailed probability of the chi-squared distribution Returns the cumulative beta probability density function Returns the one-tailed probability of the chisquared distribution Returns the inverse of the left-tailed probability of the chi-squared distribution. Returns the inverse of the one-tailed probability of the chi-squared distribution Returns the test for independence

18

CHITEST

Returns the test for independence Page | 717

19

COMBIN

Returns the number of combinations for a given number of objects Returns the number of combinations with repetitions for a given number of items Returns the confidence interval for a population mean Returns the confidence interval for a population mean Returns the confidence interval for a population mean, using a Student's t distribution Returns the correlation coefficient between two data sets Counts how many numbers are in the list of arguments Counts how many values are in the list of arguments Counts the number of blank cells within a range

20

COMBINA

21

CONFIDENCE

22

CONFIDENCE.NORM

23

CONFIDENCE.T

24

CORREL

25

COUNT

26

COUNTA

27

COUNTBLANK

28

COUNTIF

29

COUNTIFS

30

COVAR

31

COVARIANCE.P

32

COVARIANCE.S

33

CRITBINOM

34

DEVSQ

Counts the number of cells within a range that meet the given criteria Counts the number of cells within a range that meet multiple criteria Returns covariance, the average of the products of paired deviations Returns covariance, the average of the products of paired deviations Returns the sample covariance, the average of the products deviations for each data point pair in two data sets Returns the smallest value for which the cumulative binomial distribution is less than or equal to a criterion value Returns the sum of squares of deviations

35

EXP

Returns ‘e’ raised to the power of a given number

36

EXPON.DIST

Returns the exponential distribution

37

EXPONDIST

Returns the exponential distribution

38

F.DIST

Returns the F probability distribution

39

F.DIST.RT

40

F.INV

41

F.INV.RT

Returns the (right-tailed) F probability distribution for two data sets. Returns the inverse of the F probability distribution Returns the inverse of the (right-tailed) F probability distribution. Page | 718

42

F.TEST

Returns the result of an F-test

43

FACT

Returns the factorial of a number

44

FACTDOUBLE

Returns the double factorial of a number

45

FDIST

Returns the F probability distribution

46

FINV

47

FORECAST

Returns the inverse of the F probability distribution Returns a value along a linear trend

48

FORECAST.ETS

54

Uses an exponential smoothing algorithm to predict a future value on a timeline, based on a series of existing values FORECAST.ETS.CONFINT Returns a confidence interval for a forecast value at a specified target date. FORECAST.ETS.SEASONALITY Returns the length of the repetitive pattern Excel detects for a specified time series. FORECAST.ETS.STAT Returns a statistical value relating to a time series forecasting. FORECAST.LINEAR Predicts a future point on a linear trend line fitted to a supplied set of x- and y- values. FREQUENCY Returns a frequency distribution as a vertical array FTEST Returns the result of an F-test

55

GEOMEAN

Returns the geometric mean

56

HARMEAN

Returns the harmonic mean

57

HYPGEOM.DIST

Returns the hypergeometric distribution

58

HYPGEOMDIST

Returns the hypergeometric distribution

59

IF

Specifies a logical test to perform

60

IFS

61

INT

Tests a number of supplied conditions and returns a result corresponding to the first condition that evaluates to TRUE. Rounds a number down to the nearest integer

62

INTERCEPT

Returns the intercept of the linear regression line

63

KURT

Returns the kurtosis of a data set

64

LARGE

Returns the k-th largest value in a data set

65

LINEST

Returns the parameters of a linear trend

66

MAX

Returns the largest value from a list of supplied numbers

49 50 51 52 53

Page | 719

67

MAXIFS

68

MIN

69

MINIFS

70

MEDIAN

Returns the largest value from a subset of values in a list that are specified according to one or more criteria. Returns the smallest value from a list of supplied numbers Returns the smallest value from a subset of values in a list that are specified according to one or more criteria. Returns the median of the given numbers

71

MODE

Returns the most common value in a data set

72

NEGBINOM.DIST

Returns the negative binomial distribution

73

NEGBINOMDIST

Returns the negative binomial distribution

74

NORM.DIST

Returns the normal cumulative distribution

75

NORM.INV

76

NORM.S.DIST

77

NORM.S.INV

78

NORMDIST

Returns the inverse of the normal cumulative distribution Returns the standard normal cumulative distribution Returns the inverse of the standard normal cumulative distribution. Returns the normal cumulative distribution

79

NORMINV

80

NORMSDIST

81

NORMSINV

82

PEARSON

83

PERCENTILE

84

PERCENTILE.EXC

85

PERCENTILE.INC

86

PERCENTRANK

87

PERCENTRANK.EXC

88

PERCENTRANK.INC

89

PERMUT

Returns the inverse of the normal cumulative distribution Returns the standard normal cumulative distribution Returns the inverse of the standard normal cumulative distribution Returns the Pearson product moment correlation coefficient Returns the k-th percentile of values in a range Returns the k-th percentile of values in a range, where k is in the range 0 to 1, exclusive Returns the k-th percentile of values in a range, where k is in the range 0 to 1, inclusive. Returns the percentage rank of a value in a data set Returns the rank of a value in a data set as a percentage (0..1, exclusive) of the data set Returns the percentage rank of a value in a data set Returns the number of permutations for a given number of objects

Page | 720

90

PERMUTATIONA

91

PI

Returns the number of permutations for a given number of objects (with repetitions) that can be selected from the total objects Returns the value of pi

92

POISSON

Returns the Poisson distribution

93

POISSON.DIST

Returns the Poisson distribution

94

POWER

Returns the result of a number raised to a power

95

QUARTILE

Returns the quartile of a data set

96

QUARTILE.EXC

97

QUARTILE.INC

Returns the quartile of the data set, based on percentile values from 0..1, exclusive Returns the quartile of a data set

98

RAND

Returns a random number between 0 and 1

99

RANDBETWEEN

100

RANK

Returns a random number between the numbers you specify Returns the rank of a number in a list of numbers

101

RANK.AVG

Returns the rank of a number in a list of numbers

102

RANK.EQ

Returns the rank of a number in a list of numbers

103

ROUND

Rounds a number to a specified number of digits

104

ROUNDDOWN

Rounds a number down, toward zero

105

ROUNDUP

Rounds a number up, away from zero

106

RSQ

107

SKEW

Returns the square of the Pearson product moment correlation coefficient Returns the skewness of a distribution

108

SKEW.P

109

SLOPE

Returns the skewness of a distribution based on a population: a characterization of the degree of asymmetry of a distribution around its mean Returns the slope of the linear regression line

110

SMALL

Returns the k-th smallest value in a data set

111

SQRT

Returns a positive square root

112

STANDARDIZE

Returns a normalized value

113

STDEV

Estimates standard deviation based on a sample

114

STDEV.P

115

STDEV.S

Calculates standard deviation based on the entire population Estimates standard deviation based on a sample

116

STDEVA

Estimates standard deviation based on a sample, including numbers, text, and logical values Page | 721

117

STDEVP

118

STDEVPA

119

STEYX

120

SUM

Calculates standard deviation based on the entire population Calculates standard deviation based on the entire population, including numbers, text, and logical values Returns the standard error of the predicted yvalue for each x in the regression Adds its arguments

121

SUMIF

Adds the cells specified by a given criteria

122

SUMIFS

123

SUMPRODUCT

124

SUMSQ

Adds the cells in a range that meet multiple criteria Returns the sum of the products of corresponding array components Returns the sum of the squares of the arguments

125

SUMX2MY2

126

SUMX2PY2

127

SUMXMY2

128

T

Returns the sum of the difference of squares of corresponding values in two arrays Returns the sum of the sum of squares of corresponding values in two arrays Returns the sum of squares of differences of corresponding values in two arrays Converts its arguments to text

129

T.DIST

Returns the Student's left-tailed t-distribution.

130

T.DIST.2T

131

T.DIST.RT

132

T.INV

133

T.INV.2T

134

T.TEST

135

TDIST

Returns the cumulative, two-tailed Student's tdistribution. Returns the cumulative, right-tailed Student's tdistribution. Returns the t-value of the Student's t-distribution as a function of the probability and the degrees of freedom Returns the two-tailed inverse of the Student's tdistribution. Returns the probability associated with a Student's t-test Returns the Student's t-distribution

136

TINV

Returns the inverse of the Student's t-distribution

137

TREND

Returns values along a linear trend

138

TRUNC

Truncates a number to an integer

139

TTEST

140

VAR

Returns the probability associated with a Student's t-test Estimates variance based on a sample

Page | 722

141

VAR.P

142

VAR.S

143

VARA

144

VARP

145

Z.TEST

146

ZTEST

Calculates variance based on the entire population Estimates variance based on a sample Estimates variance based on a sample, including numbers, text, and logical values Calculates variance based on the entire population Returns the one-tailed probability-value of a ztest Returns the one-tailed probability-value of a ztest

Page | 723

Appendix B Areas of the standardised normal curve

Page | 724

Appendix C Percentage points of the Student’s t distribution (5% and 1%)

Page | 725

Appendix D Percentage points of the chi-square distribution

Page | 726

Appendix E Percentage points of the F distribution Upper 5%

Page | 727

Upper 2.5%

Page | 728

Upper 1%

Page | 729

Appendix F Binomial critical values

Page | 730

Appendix G Critical values of the Wilcoxon matched-pairs signedranks test Source: White, Yeats, Skipworth - Tables for statisticians (1979, table 21, p32, Stanley Thornes Ltd).

Page | 731

Appendix H Probabilities for the Mann–Whitney U test Source: White, Yeats, Skipworth - Tables for statisticians (1979, table 22, pages 33 - 35, Stanley Thornes Ltd).

Mann–Whitney p-values (n2 = 3)

Mann–Whitney p-values (n2 = 4)

Mann–Whitney p-values (n2 = 5)

Page | 732

Mann–Whitney p-values (n2 = 6)

Mann–Whitney p-values (n2 = 7)

Page | 733

Mann–Whitney p-values (n2 = 8)

Page | 734

Appendix I Statistical glossary Adjusted r2 This is the same as R-squared but adjusted for the sample size (see Rsquared or coefficient of determination). Alpha, α Alpha refers to the probability that the true population parameter lies outside the confidence interval. Not to be confused with the symbol alpha in a time series context (exponential smoothing), where alpha is the smoothing constant. Alternative hypothesis (H1) The alternative hypothesis, H1, is a statement of what a statistical hypothesis test is set up to establish. Availability sampling See convenience sampling. Average This is a vague term for central tendency that usually is equivalent to the arithmetic average (or mean) but could also be represented by the median, mode, or geometric mean. Average from a frequency distribution Arithmetic average for data in a frequency distribution. Arithmetic mean The sum of a list of numbers divided by the number of numbers. Bar chart A bar chart is a way of summarising a set of categorical data using bar shapes. Beta,  Beta refers to the probability that a false population parameter lies inside the confidence interval. In the context of exponential smoothing, it is one of the smoothing constants. Binomial experiment An experiment consisting of a fixed number of independent trials each with two possible outcomes, success and failure, and the same probability of success. The probability of a given number of successes is described by a binomial distribution. Binomial distribution A distribution created by binomial experiments. A binomial experiment is a statistical experiment that has the following properties: n repeated trials, each trial with two possible outcomes; probability of success p; and trials are independent. Bootstrapping It is a statistical procedure that resamples a single dataset to create many simulated samples. Box plot A box plot is a way of summarising a set of data measured on an interval scale (also called a box-and-whisker plot). Brown’s single exponential smoothing method Identical to single exponential smoothing (SES), with the exception that it is placed in line with the observation that is smoothed, as opposed to SES which is placed as the next value and, as such, becomes a forecast. Categorical A variable whose value ranges over categories, such as (red, green, blue), or (male, female). Category A class or division of people or things regarded as having shared characteristics. Central limit theorem States that when a large number of simple random samples are selected from the population and the mean is calculated for each sample, the distribution of these sample means will assume the normal probability distribution, even if the original population from which the samples were selected was not normally distributed. Central tendency This is a catch-all term for the location of the middle or the centre of a distribution. Page | 735

Characteristics of a random experiment A random experiment is a statistical experiment that has the following properties: the experiment can be repeated any number of times; a random trial consists of at least two possible outcomes (e.g. success/failure, Monday/Tuesday/Wednesday); and the result depends on chance and cannot be predicted uniquely. Chart A chart is a graphical representation of data. Chi-square distribution The chi-square distribution is a mathematical distribution that is used directly or indirectly in many tests of significance. Chi-square test This applies the chi-square distribution to test for homogeneity, independence, or goodness of fit. Chi-square test for independence A chi-square test for independence is applied when you have two categorical variables from a single population. Chi-square test for two independent samples The chi-square test of independence tests if there is no difference in the distribution of responses to the outcome across comparison groups. Chi-square test for variance The chi-square test for variance is used to test the null hypothesis that the variance of the population from which the data sample is drawn is equal to a hypothesised value. Class interval In creating a frequency distribution or plotting a histogram, one starts by dividing the range of values into a set of non-overlapping intervals, called class intervals, in such a way that every datum is contained in some class interval. Class limit (or class boundary) A point that is the left endpoint of one class interval, and the right endpoint of another class interval. Class mid-point For any given class interval of a frequency distribution, this is the value halfway across the class interval, (upper class limit + lower class limit)/2. Class widths equal The distance between the lower- and upper-class limits for each class has the same numerical value. Class widths unequal The distance between the lower- and upper-class limits for each class does not have the same numerical value. Cluster sampling This is a sampling technique used when ‘natural’ but relatively homogeneous groupings are evident in a statistical population. It is often used in marketing research. In this technique, the total population is divided into these groups (or clusters) and a simple random sample of the groups is selected. Coefficient of correlation This measures the strength and direction of the linear relationship between two variables. Sometimes it is referred to as the Person product moment coefficient of correlation. Coefficient of determination (COD) This is the proportion of the variance in the dependent variable that is predicted from the independent variable. Also called Rsquared or r2. Coefficient of variation The coefficient of variation measures the spread of a set of data as a proportion of its mean. It is often expressed as a percentage. Component bar chart A subdivided or component bar chart is used to represent data in which the total magnitude is divided into different or components. Confidence interval A range of likely values for estimates and forecasts, usually expressed as 90%, 95%, 99% or any other interval from the trend, within which the likely values of the forecast reside (also known as a prediction interval). Confidence interval of a population mean This is an interval estimate for the population mean based upon the sample mean.

Page | 736

Confidence interval of a population proportion This is an interval estimate for the population proportion based upon the sample proportion. Confidence interval of a population mean when the sample size is small This is an interval estimate for the population mean based upon the sample mean where the sample size is small. Consistent estimator This is an estimator having the property that as the number of data points used increases indefinitely, the resulting sequence of estimates converges to the population value. Contingency table A contingency table is a table of frequencies classified according to the values of the variables in question. Continuous A set of data is said to be continuous if the values/observations belonging to it may take on any value within a finite or infinite interval. You can count, order and measure continuous data. Continuous probability distribution If a random variable is a continuous variable, its probability distribution is called a continuous probability distribution. Continuous random variable This is a random variable where the data can take infinitely many values. Convenience sampling This is a non-probability sampling technique where subjects are selected because of their convenient accessibility and proximity to the researcher. This method is also sometimes referred to as haphazard, accidental, or availability sampling. Coverage error Coverage error occurs in statistical estimates of a survey. It results from gaps between the sampling frame and the total population. This can lead to biased results and can affect the variance of results. Critical test statistic The critical value for a hypothesis test is a limit at which the value of the sample test statistic is judged to be such that the null hypothesis may be rejected. Cumulative distribution function A function whose value is the probability that a corresponding continuous random variable has a value less than or equal to the argument of the function. See also probability density function. Cumulative frequency distribution A cumulative frequency distribution is the sum of the class and all classes below it in a frequency distribution. Rather than displaying the frequencies from each class, a cumulative frequency distribution displays a running total of all the preceding frequencies. Damping factor A unique factor only named like this in Microsoft Excel. Its value is 1 – α (where α is a smoothing constant). This naming convention is unique to Excel and is not used in any other software or textbook. Degrees of freedom The number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary. Dependent populations Two populations are dependent if the measured values of the items observed in one population directly affect the measured values of the items observed in the other population. Dependent variable A variable that is expected to show the change as an independent variable is manipulated or changed. Determining the sample size This is the act of choosing the number of observations to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make inferences about a population from a sample. Discrete A set of data is said to be discrete if the values/observations belonging to it are distinct and separate, that is, they can be counted (1,2,3, …).

Page | 737

Discrete probability distributions If a random variable is a discrete variable, its probability distribution is called a discrete probability distribution. Discrete random variable A set of data is said to be discrete if the values belonging to it can be counted as 1, 2, 3, …. Dispersion (or spread) The variation between data values is called dispersion. Disproportionate stratification This is a type of stratified sampling. With disproportionate stratification, the sample size of each stratum does not have to be proportionate to the population size of the stratum. This means that two or more strata will have different sampling fractions. Distribution shape A graph plotting frequency (or probability) against actual values. There are some typical shapes: normal (bell-shaped), Student’s t, and F distributions. Double exponential smoothing (DES) A linear forecasting method that uses smoothing values of the smoothing values in order to define linear components at and bt, which are used to produce linear forecasts. Double moving averages (DMA) A linear forecasting method that uses moving averages of the moving averages in order to define linear components at and bt, which are used to produce linear forecasts. Durbin–Watson test A test used to detect serial correlation, or autocorrelation, among the residuals, which means a relationship between residuals separated by each other by a given time lag. Efficient estimator Among a number of estimators of the same class, the estimator having the least variance is called the efficient estimator. Empirical approach This denotes information gained by means of observation, experience, or experiment. Error analysis or residual analysis After a forecasting model is fitted to the actual time series and deviations between the actual and forecasted values are calculated, these deviations are called errors or residuals. They are scrutinised as in regression analysis to validate the model, but also to decide which model is the best, if more than one model is used to forecast the same time series. Estimator An estimator is a rule for calculating an estimate of a given quantity based on observed data. Explanatory variable Another expression for independent variable. Explained variations See Sum of squares for regression (SSR). Extrapolate Extend the results produced by a method into the future, by assuming that the existing rules will continue to apply. Exponential smoothing A method that relies on a constant which is used to correct the past forecasts’ deviations from the actual values (errors). A time series obtained in such a way is invariably ‘smoother’ than the original time series that it has been derived from. Event An event is a set of outcomes of an experiment to which a probability is assigned. Expected frequency In a contingency table the expected frequencies are the frequencies that you would predict in each cell of the table, if you knew only the row and column totals, and if you assumed that the variables under comparison were independent. Expected value of the probability distribution The expected value of a random data variable indicates its population average value. F distribution The F distribution is a probability distribution used in analysis of variance when comparing the variance of two samples for significance.

Page | 738

F test for equality of population variance The F test for two population variances (variance ratio test) is used to test if the variances of two populations are equal. F test for multiple regression models A test to determine whether any of the independent variables is significant. F test for simple regression models A test to determine if the independent variable is a significant contributor to the predicted values yˆ . First quartile, Q1 This is also referred to as the 25th percentile and is the value below which 25% of the population falls. Fisher kurtosis coefficient Kurtosis is a measure of the ‘peakedness’ of the probability distribution of a random variable. Fisher–Pearson skewness coefficient Skewness is a measure of the asymmetry of the probability distribution of a random variable about its mean. Fitting a trend Finding the line (linear or curve) that best fits the historical data. Five-number summary A five-number summary is especially useful when we have so many data that it is enough to present a summary of the data rather than the whole data set. Forecasting A method of predicting the future values of a variable, usually represented as the time series values. Forecasting errors The difference between the actual and forecasted values in a time series. Frequency definition of probability This defines the probability of an outcome as the frequency or the number of times the outcome occurs relative to the number of times that it could have occurred. Frequency density The frequency density can be calculated by using the following formula: frequency density = frequency ÷ class width. Frequency distribution A summary of data presented in table form representing frequency and class intervals. Frequency polygon Frequency polygons are line graphs joined by all the midpoints at the top of the bars of histograms. Goodness-of-fit test A chi-square goodness-of-fit test attempts to answer the following question: are sample data consistent with a hypothesised distribution? Graph A graph or chart is a visual illustration of a set of data. Grouped frequency distribution In a grouped frequency distribution, data are sorted and separated into groups called classes. Histogram A histogram is a way of summarising data that are measured on an interval scale (either discrete or continuous). It is often used in exploratory data analysis to illustrate the major features of the distribution of the data in a convenient form. In this case, the class intervals are constant. Histogram with unequal class intervals A histogram is a way of summarising data that are measured on an interval scale (either discrete or continuous). It is often used in exploratory data analysis to illustrate the major features of the distribution of the data in a convenient form. In this case, the class intervals are not constant. Homogeneity of variance This is another name for equal population variances. Homoscedasticity This is also called homogeneity or uniformity of variance. It refers to the requirement that the variance of regression errors, or residuals, is constant for all values of X. This is one of the linear regression assumptions. Hypothesis A testable statement about the relationship between two or more variables or a proposed explanation for some observed phenomenon. Page | 739

Hypothesis test procedure A series of steps to determine whether to accept or reject a null hypothesis, based on sample data. IBM SPSS Statistics SPSS Statistics is a software package used for statistical analysis. Independence of errors Current error must not be dependent on the previous error. Another term is serial correlation. Independent populations Two populations are said to be independent if the measured values of the items observed in one population do not affect the measured values of the items observed in the other population. Independent variable A variable that stands alone and is expected to cause some changes to the dependent variable. Also called explanatory variable. Inference This is the process of deducing properties of an underlying probability distribution by analysis of data. Inferential statistical analysis infers properties about a population; this includes testing hypotheses and deriving estimates. Independent events Two events are independent if the occurrence of one of the events has no influence on the occurrence of the other event. Intercept This is the value of the regression equation (y) when the x value is 0. Interquartile range The interquartile range is a measure of the spread of or dispersion within a data set. Interval An interval scale is a scale of measurement where the distance between any two adjacent units of measurement (or ‘intervals’) is the same but the zero point is arbitrary. Interval estimates This refers to the use of sample data to calculate an interval of plausible values of an unknown population parameter. Kurtosis This is a measure of the ‘peakedness’ of the probability distribution of a random variable. Least squares The method of least squares is a criterion for fitting a specified model to observed data. If refers to finding the smallest (least) sum of squared differences between fitted and actual values. Left-skewed (or negatively skewed) A distribution is said to be left-skewed (or lefttailed), even though the curve itself appears to be leaning to the right; ‘left’ here refers to the left tail being drawn out, with the mean value being skewed to the left of a typical centre of the data set. Leptokurtic A statistical distribution is leptokurtic when the points along the X-axis are clustered, resulting in a higher peak, or higher kurtosis, than the curvature found in a normal distribution. Level of confidence This is the likelihood that the true value being tested will lie within a specified range of values. Likelihood of an event happening This is the probability that an event will occur. Linear relationship The relationship between two variables is linear, that is, they move together in the same or opposite direction, but in a linear fashion. This is one of the linear regression assumptions. Lower class limit A point that is the left endpoint of one class interval. Lower one-tailed test A lower one-tail test is a statistical hypothesis test in which the values for which we can reject the null hypothesis, H0, are located entirely in the left tail of the probability distribution. Mann–Whitney U test The Mann–Whitney U test is used to test the null hypothesis that two populations have identical distribution functions against the alternative hypothesis that the two distribution functions differ only with respect to location (median), if at all.

Page | 740

Margin of error This provides an estimate of how much the results of the sample may differ due to chance when compared to what would have been found if the entire population were interviewed. It is a statistic expressing the amount of random sampling error in a survey’s results. McNemar’s test for matched pairs McNemar’s test is a nonparametric method used on nominal data to determine whether the row and column marginal frequencies are equal. Mean The mean is a measure of the average data value for a data set. Mean absolute error, or deviation (MAD) The mean value of all the differences between the actual and forecasted values in a time series. The differences between these values are represented as absolute values, with the effects of the sign ignored. Mean absolute percentage error (MAPE) The mean value of all the differences between the actual and forecasted values in a time series. The differences between these values are represented as absolute percentage values, that is, the effects of the sign are ignored. Mean error (ME) The mean value of all the differences between the actual and forecasted values in a time series. Mean of the binomial distribution The expected value, or mean, of a binomial distribution is calculated by multiplying the number of trials by the probability of successes (np). Mean of the Poisson distribution The expected value, or mean, of a Poisson distribution is the average number of events in a given time or space interval and is represented by the symbol . Mean percentage error (MPE) The mean value of all the differences between the actual and forecasted values in a time series. The differences between these values are represented as percentage values. Mean squared error (MSE) The mean value of all the differences between the actual and forecasted values in a time series. The differences between these values are squared to avoid positive and negative differences cancelling each other. Measure of average A measure of average is a number that is typical for a set of figures (e.g. mean, median, or mode). Measure of central tendency Measures of central tendency are numbers that describe what is average or typical of the distribution of data, (e.g. mean, median, or mode). Measure of dispersion Dispersion is the extent to which a distribution is stretched or squeezed (e.g. standard deviation, and interquartile range). Measure of location Measures of location are numbers that describe what is average or typical of the distribution of data, (e.g. mean, median, or mode). Measure of shape The principal measures of distribution shape used in statistics are skewness and kurtosis. Measure of spread Spread is the extent to which a distribution is stretched or squeezed (e.g. standard deviation, and interquartile range). Measure of variation Variation is the extent to which a distribution is stretched or squeezed (e.g. standard deviation, and interquartile range). Measurement error (or observational error) This is the difference between a measured value or quantity and its true value. In statistics, an error is not a ‘mistake’. Variability is an inherent part of things being measured and of the measurement process. Median The median is the value halfway through an ordered data set. Mesokurtic A statistical distribution is mesokurtic when its kurtosis is similar, or identical, to that of a normally distributed data set. Page | 741

Microsoft Excel Microsoft Excel is a spreadsheet developed by Microsoft. It organises numeric or text data in spreadsheets or workbooks. Mode The mode is the most frequently occurring value in a set of discrete data. Model An abstraction of something, in this context typically referring to an equation, or a series of equations, that mimic a variable. Moving averages are averages calculated for a limited number of rolling periods. Every subsequent moving average drops the first observation from the rolling period and takes the subsequent one. Moving average forecasts If the moving average is placed as the value following the last observation taken into the moving average interval, then this moving average becomes a forecast. Multiple regression Multiple linear regression aims to find a linear relationship between a response variable and several possible predictor variables. Multistage sampling This refers to sampling plans where the sampling is carried out in stages using smaller and smaller sampling units at each stage. Mutually exclusive events Two or more events are said to be mutually exclusive if they cannot occur at the same time. Nominal A set of data is said to be nominal if the values/observations belonging to it can be assigned a code in the form of a number, where the numbers are simply labels. You can count but not order or measure nominal data. Nonparametric tests are often used in place of their parametric counterparts when certain assumptions about the underlying population are questionable. Non-probability sampling This is a sampling technique where the samples are gathered in a process that does not give all the individuals in the population equal chances of being selected. Non-proportional quota sampling This captures a minimum number of respondents in a specific group. Non-response error Non-response errors occur when a survey fails to get a response to one, or possibly all, of the questions. Non-stationary A time series that does not have a constant mean and/or variance and oscillates around a moving mean. Normal distribution The normal distribution is a symmetrical, bell-shaped curve, centred at its expected value. Normal probability plot This is a graphical technique to assess whether a set of data is normally distributed. Normality of errors Regression errors, or residuals, need to be distributed in accordance with the normal distribution. This is one of the linear regression assumptions. Null hypothesis (H0) The null hypothesis, H0, represents a theory that has been put forward but has not been proved. Observational error (see Measurement error). Observed frequency In a contingency table the observed frequencies are the frequencies obtained in each cell of the table, from our random sample. Ogive A cumulative frequency graph. One-sample test A one-sample test is a hypothesis test for answering questions about the mean (or median) where the data are a random sample of independent observations from an underlying distribution

Page | 742

One-sample t test A one-sample t test is a hypothesis test for answering questions about the mean where the data are a random sample of independent observations from an underlying normal distribution whose population variance is unknown. One-sample z test A one-sample z test is used to test whether a population parameter is significantly different from some hypothesised value. One-tailed test A one-tailed test is appropriate if the estimated value may depart from the reference value in only one direction. Ordinal A set of data is said to be ordinal if the values/observations belonging to it can be ranked (put in order) or have a rating scale attached. You can count and order, but not measure, ordinal data. Outcome An outcome is a possible result of an experiment. Outlier An outlier is an observation in a data set which is far removed in value from the others in the data set. p-value The p-value is the probability of getting a value of a test statistic as extreme as or more extreme than that observed by chance alone, if the null hypothesis is true. Paired-samples t test A two-sample t test for population mean (dependent or paired samples) is used to compare two dependent population means inferred from two samples (‘dependent’ indicates that the values from both samples are numerically dependent upon each other – there is a correlation between corresponding values). Parametric Any statistic computed by procedures that assume the data were drawn from a distribution. Pearson coefficient of skewness A measure of the symmetry of a distribution Percentile The xth percentile is the value beneath which x% of the population falls. Pie chart A pie chart is a way of summarising a set of categorical data using a circle which is divided into segments, where each segment represents a category. Pivot table Excel expression for a table that was generated from a ‘flat’ table that contains raw data. A pivot table reorganises and summarises data by enabling the rotation of variables, effectively facilitating the structural (or pivotal) change the way data are presented. Platykurtic A statistical distribution is platykurtic when the points along the X-axis are extremely dispersed, resulting less peakedness, or lower kurtosis, than found in a normal distribution. Point estimate Is the use of sample data to calculate a point value of an unknown population parameter. Point estimate of the population mean Is a single value of a sample mean statistic that represents the population mean. Point estimate of the population proportion Is a single value of a sample proportion statistic that represents the population proportion. Point estimate of the population variance Is a single value of a sample variance statistic that represents the population variance. Poisson probability distribution This is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event. Pooled estimates Pooled estimates (also known as combined, composite, or overall variance) is a method for estimating a statistic (mean, variance) of several different populations. Pooled-variance t test Two-sample t test with the two populations having the same variance. Page | 743

Population A collection of persons, objects, or items of interest. Population mean The population mean is the mean value of all possible values. Population mean for the normal distribution This is the expected value for a variable that follows a normal distribution. Population parameter This is a characteristic of a population. The population is a set of individuals, items, or data from which a statistical sample is taken. Population standard deviation The population standard deviation is the standard deviation of all possible values. Population variance The population variance is the variance of all possible values. Prediction interval See confidence interval. Probability A measure of the likelihood that events will occur. Probability density function A statistical expression for the probability distribution of a continuous variable. It provides the probability that a variable fall in some interval. Probability distribution A listing of all possible events or outcomes associated with a course of action and their associated probabilities. Probability sampling This is any method of sampling that utilises some form of random selection. To have a random selection method, you must set up some process or procedure that ensures that the different units in your population have equal probabilities of being chosen. Probability trees Method to use tree diagrams to aid the solution of problems involving probability. Proportional quota sampling This represents the major characteristics of the population by sampling a proportional amount of each. Proportionate stratification This is a type of stratified sampling. With proportionate stratification, the sample size of each stratum is proportionate to the population size of the stratum. This means that each stratum has the same sampling fraction. Purposive sampling This is a non-probability sample that is selected based on characteristics of a population and the objective of the study. Purposive sampling is also known as judgmental, selective, or subjective sampling. Qualitative A qualitative variable is one whose values are adjectives, such as colours, genders, nationalities, etc. Quantitative A variable that takes numerical values for which arithmetic makes sense (e.g. counts, temperatures, weights, amounts of money). Quartile A quartile is one of three values that divide a sample of data into four groups containing an equal number of observations. Quota sampling This is a method for selecting survey participants that is a nonprobabilistic version of stratified sampling. In quota sampling, a population is first segmented into mutually exclusive subgroups, just as in stratified sampling. Then judgement is used to select the subjects or units from each segment based on a specified proportion. R-squared or R2 See coefficient of determination. Random experiment Randomised experiments are the experiments that allow the greatest reliability and validity of statistical estimates of treatment effects. Random sample A random sample is one in which every element in the population has an equal chance of being selected. Random variable A random variable is a set of possible values from a random experiment. Range The range of a data set is a measure of the dispersion of the observations. Rank To rank is to list data in order of size. Page | 744

Ratio data are interval data with a natural zero point. Raw data Raw data, also known as primary data, are data collected from a source. Regression analysis This is a set of statistical processes used to estimate relationships between variables. Regression assumptions In order to apply linear regression, four assumptions need to be met: (i) linearity; (ii) independence of errors; (iii) normality of errors; and (iv) constant variance, or homoscedasticity. Regression coefficients In linear regression there is one coefficient that describes the value of the intercept (when x = 0) and one coefficient that describes the slope (or gradient). Rejection region This is the range of values that leads to rejection of the null hypothesis. Relative frequency This is defined as how often something happens divided by the total number of outcomes. Residual The residual represents the unexplained variation (or error) after fitting a regression model. It is also the differences between the actual and predicted values. Residual analysis Is the analysis of the residuals of the regression. It typically means that the residuals comply with certain assumptions. See regression assumptions. Right-skewed (or positively skewed) A distribution is said to be right-skewed (or right-tailed), even though the curve itself appears to be leaning to the left; ‘right’ here refers to the right tail being drawn out, with the mean value being skewed to the right of a typical centre of the data set. Robust test Robust statistics are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal. Root mean square error (RMS) This is the root value of the MSE. It takes the error statistic back to the same units as used by the variable. It can also be used as the standard error of the estimate to calculate a prediction interval. Sample A sample is a set of data collected and/or selected from a statistical population by a defined procedure. The elements of a sample are known as sample points, sampling units or observations. Sample mean The sample mean (or empirical mean) is computed from a collection of data from a population. Sample point This is a single possible observed value of a variable; a member of the sample space of an experiment. Sample statistics Is a set of data collected and/or selected from a statistical population by a defined procedure. The elements of a sample are known as sample points, sampling units or observations. Sample standard deviation This is the extent to which a collection of data from a population are stretched or squeezed along the X-axis. Sample space The sample space is an exhaustive list of all the possible outcomes of an experiment. Sample variance This is a measure of the dispersion of a sample of observations taken from a population. Sampling distribution This is the probability distribution of a given statistic based on a random sample. Sampling distributions are important in statistics because they provide a major simplification en route to statistical inference.

Page | 745

Sampling distribution of the sample mean This is a theoretical distribution of the values that the mean of a sample takes on in all the possible samples of a specific size that can be drawn from a given population. Sampling distribution of the sample proportion This is a theoretical distribution of the values that the proportion of a sample takes on in all the possible samples of a specific size that can be drawn from a given population. Sampling distribution of the sample variance This is a theoretical distribution of the values that the variance of a sample takes on in all the possible samples of a specific size that can be drawn from a given population. Sampling error This occurs when the statistical characteristics of a population are estimated from a subset, or sample, of that population. Since the sample does not include all members of the population, statistics on the sample, such as means and standard deviations, generally differ from the characteristics of the entire population, which are known as parameters. Sampling frame This is the source material or device from which a sample is drawn. It is a list of all those within a population who can be sampled, and may include individuals, households, or institutions. Sampling with replacement Sampling is called ‘with replacement’ when a unit selected at random from the population is returned to the population and then a second element is selected at random. Sampling without replacement Sampling is called ‘without replacement’ when a unit selected at random from the population is not returned to the population and then a second element is selected at random. Scaling Changing of the scale of the ordinate (y-axis) on a graph to modify the resolution of the diagram. Scatter plot A scatter plot is a useful summary of a set of bivariate data (two variables), usually drawn before working out a linear correlation coefficient or fitting a regression/time series line. It gives a good visual picture of the relationship between the two variables and aids the interpretation of the correlation coefficient or regression model. Seasonal A time series, represented in the units of time smaller than a year, that shows regular pattern in repeating itself over a number of these units of time Second quartile, Q2 This is also referred to as the 50th percentile or the median and is the value that divides the population in the middle and has 50 % of the population values below it. Serial correlation There should be no serial correlation among the residuals, and this is one of the linear regression assumptions. See independence of errors. Set of all possible outcomes The set of all possible outcomes of a random experiment. Shapiro–Wilk test The Shapiro–Wilk test calculates a statistic that tests whether a random sample comes from a normal distribution. Simple random sampling This is the basic sampling technique where we select a group of subjects (a sample) for study from a larger group (a population). Everyone is chosen entirely by chance and each member of the population has an equal chance of being included in the sample. Simple regression analysis Simple linear regression aims to find a linear relationship between one response variable and one possible predictor variable by the method of least squares. Sign test The sign test is designed to test a hypothesis about the location of a population distribution. Page | 746

Single (or simple) moving averages (SMA) These are identical to moving average forecasts, but the word ‘single’ is used to differentiate it from double moving averages. Sometimes it is shown as 3MA or 5MA, for example, where the prefix number implies how many periods are taken into to rolling interval. Single exponential smoothing (SES) Simple weighted averages of the past deviations of the forecasts from the actual values are used to create a new fit for the original time series. If these values are shifted one period in the future, they become simple exponential smoothing forecasts. Significance level, α The significance level of a statistical hypothesis test is a fixed probability of wrongly rejecting the null hypothesis, H0, if it is in fact true. Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. Slope This is the gradient of the fitted regression line. Smoothing constant, α This is defining what fraction of the past forecasting error will be considered to produce future forecasts. The smoothing constant will vary between 0 and 1. The closer to zero, the smoother the SES time series will be. This also implies that a larger smoothing constant puts more emphasis on more recent data in the time series, while a smaller smoothing constant puts more emphasis on the whole history of the time series. Snowball sampling This is a true multipurpose sampling technique. Through its use, it is possible to make inferences about social networks and relations in areas in which sensitive, illegal, or deviant issues are involved. Spearman’s rank coefficient of correlation This measures dependence between the ranking of two variables. Standard deviation Measure of the dispersion of the observations (A square root value of the variance). Standard deviation for the probability distribution The standard deviation of a random variable, statistical population, data set, or probability distribution is the square root of its variance. Standard deviation of a normal distribution The standard deviation of a normal distribution is . Standard deviation of the sample mean This is the standard deviation of the sampling distribution of the mean. It is therefore the square root of the variance of the sampling distribution of the mean. It is also called the standard error of the mean. Standard error The standard error of any parameter (or statistic) is the standard deviation of its sampling distribution, or an estimate of the standard deviation. If the parameter or statistic is the mean, then we call it the standard error of the mean. Standard error in time series This is a measure of accuracy of predictions made using one of the extrapolation techniques. Standard error of the estimate (SEE) This is a measure of the accuracy of predictions made with a regression line. Standard error of the sample proportion Is the standard error of the sampling distribution of the proportion. It is therefore the square root of the variance of the sampling distribution of the proportion. Standard normal distribution This is the normal distribution with mean 0 and standard deviation 1. Standard normal distribution table This is a mathematical table of the values of the cumulative distribution function of the normal distribution. It is used to find the

Page | 747

probability that a statistic is observed below, above, or between values on the standard normal distribution. Standardised sample mean Z value A standardized value is what you get when you take a sample mean and scale it by population data Population mean and standard error). Stated limits (true or mathematical) Class limits are the smallest and largest observations (data, events, etc.) in each class. Therefore, each class has two limits: a lower limit and upper limit. Stationary A time series that has a constant mean and variance and oscillates around this mean is referred to as stationary. Statistical independence Two events are independent if the occurrence of one of the events gives us no information about whether the other event will occur. Statistical inference This is the process of deducing properties of an underlying probability distribution by analysis of data. Inferential statistical analysis infers properties about a population; this includes testing hypotheses and deriving estimates. Statistical power The power of a statistical test is the probability that it will correctly lead to the rejection of a false null hypothesis. Stratified random sampling This is a method of sampling from a population. In statistical surveys, when subpopulations within an overall population vary, it is advantageous to sample each subpopulation (stratum) independently. Student’s t distribution This is any member of a family of continuous probability distributions that arises when estimating the mean of a normally distributed population in situations where the sample size is small, and population standard deviation is unknown. Student’s t distribution table This is a mathematical table of the values of the cumulative distribution function of the Student’s t distribution. It is used to find the probability that a statistic is observed below, above, or between values on the Student’s t distribution. Student’s t test This is a method of testing hypotheses about the mean of a small sample drawn from a normally distributed population when the population standard deviation is unknown. Sum of squares for error (SSE) This is the sum of squared distances of every point from the regression line. Also called unexplained variations. Sum of squares for regression (SSR) This is the sum of squared distances of every point on the regression line from the mean value. Also called explained variations. Symmetric distributions A data set is symmetrical when the data values are distributed in the same way above and below the middle value. Systematic random sampling This is a method of choosing a random sample from among a larger population. The process of systematic sampling typically involves first selecting a fixed starting point in the larger population and then obtaining subsequent observations by using a constant interval between samples taken. t test for multiple regression models This tests whether the predictor variables in the regression equation are significant contributors. t test for simple regression models This is a test that validates if the regression model is usable. It tests whether the slope coefficient in the regression equation is significantly different from zero. Table A set of facts or figures systematically displayed, especially in columns. Tally chart A table used to record values for a variable in a data set, by hand, often as the values are collected. One tally mark is used for each occurrence of a value. Tally Page | 748

marks are usually grouped in sets of five, to aid the counting of the frequency for each value. Test of association The chi-square test of association allows the comparison of two attributes in a sample of data to determine if there is any relationship between them. Test statistic A test statistic is a quantity calculated from our sample of data. Third quartile Q3 This is also referred to as the 75th percentile and is the value below which 75% of the population falls. Tied ranks Two or more data values share a rank value. Time period A unit of time by which the variable is defined (an hour, day, month, year, etc.) Time series This is a variable whose observations are measured in time and/or timestamped and equidistant in units of time. Time series plot A time series plot is a graph that you can use to evaluate patterns and behaviour in a data variable over time. Total sum of squares (SST) This is the total variation of the data from their mean, consisting of the regression sum of squares (SSR) and the error sum of squares (SSE). Total variation See total sum of squares (SST). Tree diagram A tree diagram may be used to represent a probability space. Tree diagrams may represent a series of independent events or conditional probabilities. Trend This is a component in the classical time series analysis approach to forecasting that covers underlying directional movements of the time series. In general, it is the underlying tendency of any time series, indicating the direction and pace. Trend parameters In the case of a linear trend, these are the slope and intercept. If a more complex trend equation is used, they do not have a name, but are referred to as a, b, c, etc. Triple exponential smoothing (TES) A nonlinear forecasting method that uses three levels of smoothing values in order to define components at, bt and ct, which are used to produce linear forecasts. Two-sample test A two-sample test is a hypothesis test for answering questions about the mean where the data are collected from two random samples of independent observations, each from an underlying distribution. Two sample t test A two-sample t test for the population mean (independent samples, equal variance) is used when two separate sets of independent and identically distributed samples are obtained, one from each of the two populations being compared. Two sample t test (independent samples, unequal variances) A two-sample t test for population mean (independent samples, unequal variances) is used when two separate sets of independent but differently distributed samples are obtained, one from each of the two populations being compared. Two-sample z test for the population mean A two-sample z test for the population mean is used to evaluate the difference between two group means. Two-sample z test for the population proportion A two-sample z test for the population proportion is used to evaluate the difference between two group proportions. Two-tailed test This is a statistical hypothesis test in which the values for which we can reject the null hypothesis, H0, are in both tails of the probability distribution. Type I error, α A Type I error occurs when the null hypothesis, H0, is rejected when it is in fact true.

Page | 749

Type II error, β A Type II error occurs when the null hypothesis, H0, is not rejected when it is in fact false. Types of trends Any curve can be a trend. Most often used are linear, logarithmic, polynomial, power, and exponential trends. Unbiased estimator If the expected value of the sample statistic is equal to the population value then we say that the sample statistic is an unbiased estimator of the population value. Uncertainty This is a situation which involves imperfect and/or unknown information. Unexplained variation See Sum of squares for error (SSE). Uniform distribution A type of probability distribution in which all outcomes are equally likely. Univariate methods These use only one variable and try to predict its future value based on the past values of the same variable. Upper class limit A point that is the right endpoint of one class interval. Upper one-tail test An upper one tail test is a statistical hypothesis test in which the values for which we can reject the null hypothesis, H0 are located entirely in the right tail of the probability distribution. Variable A numerical value or a characteristic that can differ from individual to individual or change through time. Variance This is a measure of the dispersion of the observations. Variance of a binomial distribution The variance of the binomial distribution is npq. Variance of a normal distribution The variance of a normal distribution is 2. Variance of a Poisson distribution The variance of a Poisson distribution is . Variance of a probability distribution This is the expectation of the squared deviation of a random variable from its mean. Informally, it measures how far a set of numbers are spread out from their average value. Visual display This is a visual representation of data in graphic form. Weighted average This is an average in which each quantity to be averaged is assigned a weight, and these weightings determine the relative importance of each quantity for the average. Welch's unequal-variances t test This is a two-sample location test which is used to test the hypothesis that two populations have equal means. Welch–Satterthwaite equation This equation is used to calculate an approximation to the effective degrees of freedom of a linear combination of independent sample variances, also known as the pooled degrees of freedom, corresponding to the pooled variance. Wilcoxon signed-rank test This is designed to test a hypothesis about the location of the population median (one or two matched pairs). Yates' correction An adjustment made to chi-square values obtained from 2 by 2 contingency table analyses.

Page | 750

Book index 5 steps in hypothesis testing ............................................................................................................... 345 Alpha (α).............................................................................................................................................. 346 Alternative hypothesis (H1) ......................................................................................................... 337, 345 Assumptions for the equal-variance t-test ......................................................................................... 384 Bar charts .............................................................................................................................................. 47 Bessel correction factor ...................................................................................................................... 293 Beta, β ................................................................................................................................................. 355 Biased estimate ................................................................................................................................... 262 Binomial .............................................................................................................................................. 171 Binomial experiment........................................................................................................................... 216 Binomial probability distribution ........................................................................................................ 211 Bootstrapping ..................................................................................................................................... 291 Box plot ....................................................................................................................................... 152, 191 Categories ............................................................................................................................................. 21 Central limit theorem.......................................................................................................................... 172 Central Limit Theorem ................................................................................................................ 273, 341 Central tendency ............................................................................................................................. 95, 96 Centred moving average ..................................................................................................................... 649 Chance................................................................................................................................................. 167 Chart...................................................................................................................................................... 47 Chi-square distribution ............................................................................................................... 208, 423 Chi-square test .................................................................................................................................... 419 Chi-square test for two independent samples ................................................................................... 441 Chi-square test of independence ........................................................................................................ 421 Class interval ......................................................................................................................................... 29 Class limits............................................................................................................................................. 22 Class widths equal ................................................................................................................................. 66 Classical time series analysis ............................................................................................................... 616 Classical time series decomposition ................................................................................................ 618 Cluster sampling.................................................................................................................................. 246 Coefficient of determination .............................................................................................................. 529 Coefficient of variation ....................................................................................................................... 130 Component bar chart ............................................................................................................................ 49 Confidence interval ............................................................................................................................. 309 Confidence interval for the population mean .................................................................................... 291 Confidence interval for the population proportion ............................................................................ 291 Confidence interval of a population mean when the sample size is small ......................................... 316 Consistent estimator ........................................................................................................................... 292 Contingency table ............................................................................................................................... 423 Continuous probability distribution .................................................................................................... 171 Continuous random variable .............................................................................................................. 171 Continuous variable .............................................................................................................................. 20 Convenience sampling ........................................................................................................................ 247 Covariance........................................................................................................................................... 519 Coverage error .................................................................................................................................... 249 Critical test statistic ............................................................................................................................. 349 Critical value of the test statistic......................................................................................................... 347

Page | 751

Critical Z test statistic .......................................................................................................................... 362 Cyclical ................................................................................................................................................ 616 Degrees of freedom ............................................................................................................ 294, 367, 421 Dependent .......................................................................................................................................... 340 Determining the sample size .............................................................................................................. 326 Discounting ......................................................................................................................................... 668 Discrete probability distributions ....................................................................................................... 211 Discrete random variable............................................................................................................ 171, 213 Discrete variable ................................................................................................................................... 20 Dispersion ............................................................................................................................................. 95 Disproportionate stratification ........................................................................................................... 245 Distribution shape ............................................................................................................................... 131 Double exponential smoothing........................................................................................................... 684 Double moving averages..................................................................................................................... 662 Efficient estimator............................................................................................................................... 292 Empirical ............................................................................................................................................. 168 Error measurements ........................................................................................................................... 594 Error sum of squares, SSE................................................................................................................. 543 Estimate .............................................................................................................................................. 250 Event ................................................................................................................................................... 169 Expected frequency ............................................................................................................................ 423 Expected value of the probability distribution ................................................................................... 213 Experiment .......................................................................................................................................... 168 Exponential smoothing ....................................................................................................................... 667 Exponentially smoothed ..................................................................................................................... 646 Exponentially weighted moving average (EWMA) ............................................................................. 669 F-distribution....................................................................................................................................... 205 First quartile ........................................................................................................................................ 111 Fisher - Pearson skewness coefficient ................................................................................................ 134 Fisher’s kurtosis coefficient ................................................................................................................ 141 Five-number summary ................................................................................................................ 146, 191 Forecasting errors ............................................................................................................................... 589 Four assumptions of regression .......................................................................................................... 545 Frequency distribution .................................................................................................................. 25, 213 Frequency distributions ........................................................................................................................ 21 Graph .................................................................................................................................................... 47 Group frequency distribution ............................................................................................................... 66 Grouped frequency distribution ........................................................................................................... 28 Histogram .............................................................................................................................................. 66 Holt-Winters method ........................................................................................................................ 702 Homogeneity of variance assumption ................................................................................................ 384 Homoscedasticity ................................................................................................................................ 546 Hypothesis test procedure .................................................................................................................. 345 IBM SPSS Statistics ................................................................................................................................ 18 Independent........................................................................................................................................ 340 Inference ............................................................................................................................................. 242 Interquartile range .............................................................................................................................. 119 Interval data .......................................................................................................................................... 98 Interval estimate ......................................................................................................................... 290, 309 Irregular .............................................................................................................................................. 616 Kurtosis ................................................................................................................................. 97, 132, 140

Page | 752

Left skewed ......................................................................................................................................... 133 Leptokurtic .......................................................................................................................................... 140 Likelihood ............................................................................................................................................ 167 Linear regression analysis ................................................................................................................ 533 Linearity regression assumption ......................................................................................................... 545 Long-term forecasts ............................................................................................................................ 644 Mann-Whitney U test ......................................................................................................................... 488 Margin of error ................................................................................................................................... 249 McNemar’s test for matched pairs ..................................................................................................... 450 Mean of a standard normal distribution ............................................................................................ 178 Mean of the sampling distribution for the proportions ..................................................................... 282 Measure of average .............................................................................................................................. 95 Measure of dispersion ........................................................................................................................ 108 Measure of spread ........................................................................................................................ 95, 109 Measure of variation........................................................................................................................... 109 Measurement Error ............................................................................................................................ 249 Measures of average............................................................................................................................. 96 Measures of central tendency .............................................................................................................. 95 Measures of dispersion ................................................................................................................... 95, 96 Measures of shape ................................................................................................................................ 95 Median ................................................................................................................................................ 100 Medium-term forecasts ...................................................................................................................... 644 Mesokurtic .......................................................................................................................................... 140 Microsoft Excel...................................................................................................................................... 18 Mid-term forecasts ............................................................................................................................. 662 Mode ................................................................................................................................................... 102 Moving average .................................................................................................................................. 646 Multistage sampling............................................................................................................................ 246 Mutually exclusive .............................................................................................................................. 169 Negatively skewed .............................................................................................................................. 133 Nominal data......................................................................................................................................... 98 Nominal variable ................................................................................................................................... 20 Non-parametric................................................................................................................................... 291 Non-parametric hypothesis tests........................................................................................................ 334 Non-parametric test - Mann-Whitney U test ............................................................................. 385, 386 Non-parametric tests .......................................................................................................................... 459 Non-probability samples ..................................................................................................................... 243 Non-probability sampling ................................................................................................... 243, 246, 247 Non-proportional quota sampling ...................................................................................................... 248 Non-response error............................................................................................................................. 249 Non-seasonal time series .................................................................................................................... 573 Non-stationary time series ................................................................................................................. 572 Non-symmetrical distributions ........................................................................................................... 147 Normal distribution............................................................................................................................. 172 Normal probability plot....................................................................................................................... 191 Normality assumption......................................................................................................................... 384 Normally distributed population ........................................................................................................ 349 Null hypothesis (H0) .................................................................................................................... 337, 345 Observed frequency............................................................................................................................ 422 Odds .................................................................................................................................................... 167 One sample t test ................................................................................................................................ 368

Page | 753

One sample test .................................................................................................................................. 340 One sample z test for the proportion, π ............................................................................................. 377 One sample z-test statistic .................................................................................................................. 361 One-tail test ........................................................................................................................................ 338 Ordinal data .......................................................................................................................................... 98 Ordinal variable............................................................................................................................... 20, 21 Outcome ............................................................................................................................................. 168 Outlier ................................................................................................................................................. 518 Outliers.................................................................................................................................................. 99 Parametric ........................................................................................................................................... 291 Parametric hypothesis test ................................................................................................................. 334 Pearson chi square test statistic ......................................................................................................... 422 Pearson’s product moment correlation coefficient............................................................................ 523 Pearson's coefficient of skewness ...................................................................................................... 133 Percentile .................................................................................................................................... 101, 109 Pie chart ................................................................................................................................................ 58 Pivot table ............................................................................................................................................. 38 Platykurtic ........................................................................................................................................... 140 Point estimate ..................................................................................................................................... 290 Point estimate of the population mean .............................................................................................. 292 Poisson ................................................................................................................................................ 171 Poisson distribution ............................................................................................................................ 229 Poisson distribution experiment ......................................................................................................... 230 Poisson probability distribution .......................................................................................................... 211 Pooled estimates................................................................................................................................. 308 Pooled standard deviation .................................................................................................................. 385 Pooled-variance t-test......................................................................................................................... 384 Population ........................................................................................................................................... 250 Population mean ................................................................................................................. 100, 173, 350 Population mean of the normal distribution ...................................................................................... 172 Population parameters ....................................................................................................................... 250 Population standard deviation ................................................................................................... 122, 173 Population standard deviation for a normal distribution ................................................................... 172 Population variance ............................................................................................................................ 122 Positively skewed ................................................................................................................................ 133 Probability distribution ....................................................................................................................... 213 Probability samples ............................................................................................................................. 243 Probability sampling ........................................................................................................................... 243 Probable .............................................................................................................................................. 167 Proportional quota sampling .............................................................................................................. 248 Proportionate stratification ................................................................................................................ 245 Purposive sampling ..................................................................................................................... 247, 248 P-value ................................................................................................................................ 343, 347, 349 Qualitative variable ............................................................................................................................... 20 Quantitative variable ............................................................................................................................ 20 Quartile ............................................................................................................................................... 109 Quota sampling ................................................................................................................................... 248 Random experiment properties.......................................................................................................... 168 Random variable ................................................................................................................................. 170 Range .................................................................................................................................................. 118 Ranks ................................................................................................................................................... 461

Page | 754

Ratio ...................................................................................................................................................... 98 Ratio data .............................................................................................................................................. 21 Raw data ............................................................................................................................................... 18 Region of rejection .............................................................................................................................. 338 Regression equal variance assumption............................................................................................... 546 Regression independence of errors assumption ................................................................................ 545 Regression sum of squares, SSR ....................................................................................................... 543 Relative frequency .............................................................................................................................. 169 Residuals ............................................................................................................................................ 542 Residuals (R) ........................................................................................................................................ 576 Right skewed ....................................................................................................................................... 133 Robust test .......................................................................................................................................... 377 R-squared ............................................................................................................................................ 529 Sample......................................................................................................................................... 242, 250 Sample correlation coefficient ............................................................................................................ 523 Sample covariance .............................................................................................................................. 519 Sample mean .............................................................................................................................. 100, 292 Sample point ....................................................................................................................................... 168 Sample space .............................................................................................................................. 168, 170 Sample standard deviation ......................................................................................................... 123, 293 Sample statistics.................................................................................................................................. 250 Sample variance .......................................................................................................................... 123, 293 Sampling distribution .................................................................................................................. 259, 339 Sampling distribution of the mean ..................................................................................................... 259 Sampling distribution of the proportion ..................................................................................... 259, 282 Sampling distribution of the sample proportion ................................................................................ 282 Sampling error .................................................................................................................................... 249 Sampling frame ................................................................................................................................... 243 Scatter plot............................................................................................................................................ 78 Scatter plots ........................................................................................................................................ 514 Seasonal .............................................................................................................................................. 616 Seasonal time series............................................................................................................................ 573 Second quartile ................................................................................................................................... 109 Semi-inter quartile range .................................................................................................................... 119 Shape..................................................................................................................................................... 96 Short-term forecasts ........................................................................................................................... 644 Sign test............................................................................................................................................... 460 Significance level ................................................................................................................................. 346 Simple moving average ....................................................................................................................... 662 Simple random sampling .................................................................................................................... 244 Skewness ....................................................................................................................................... 97, 132 Snowball sampling .............................................................................................................................. 248 Spearman’s rank correlation coefficient............................................................................................. 530 Stacked chart ........................................................................................................................................ 49 Standard deviation ........................................................................................................................ 96, 120 Standard deviation for the probability distribution............................................................................ 214 Standard deviation of a standard normal distribution ....................................................................... 178 Standard error..................................................................................................................................... 262 Standard error of the estimate ........................................................................................................... 548 Standard error of the sample means .................................................................................................. 260 Standard normal distribution.............................................................................................................. 178

Page | 755

Stated limits .......................................................................................................................................... 22 Stationary time series ......................................................................................................................... 572 Statistical inference............................................................................................................................. 250 Statistical power ................................................................................................................................. 355 Statistical test to be used .................................................................................................................... 346 Stratified random sampling ................................................................................................................ 245 Student’s t distribution ....................................................................................................................... 316 Student’s t distribution tables ............................................................................................................ 318 Student’s t-distribution ....................................................................................................................... 198 Student’s t-test ................................................................................................................................... 366 Symmetrical distributions ................................................................................................................... 147 Systematic random sampling .............................................................................................................. 244 Table...................................................................................................................................................... 23 Tally chart.............................................................................................................................................. 26 T-distribution ...................................................................................................................................... 198 Test of association .............................................................................................................................. 422 Test statistic ................................................................................................................................ 343, 346 Third quartile ...................................................................................................................................... 111 Time period ......................................................................................................................................... 573 Time periods ...................................................................................................................................... 575 Time series plot ..................................................................................................................................... 80 Total sum of squares, SST ................................................................................................................. 543 Trend ................................................................................................................................................... 616 T-test assumptions .............................................................................................................................. 366 Two sample tests ................................................................................................................................ 383 Two-sample tests ................................................................................................................................ 340 Two-tail test ........................................................................................................................................ 338 Type I error.......................................................................................................................................... 354 Type II error......................................................................................................................................... 354 Unbiased estimator..................................................................................................................... 259, 292 Unbiased estimator of the population mean ..................................................................................... 262 Unbiased estimator of the population standard deviation ................................................................ 294 Uncertainty ......................................................................................................................................... 167 Variance .............................................................................................................................................. 120 Variance of a binomial distribution .................................................................................................... 221 Variance of a Poisson distribution ...................................................................................................... 230 Variance of the binomial distribution ................................................................................................. 282 Variance of the probability distribution.............................................................................................. 214 Visual display......................................................................................................................................... 47 Welch’s unequal variances t-test ........................................................................................................ 385 Welch–Satterthwaite equation........................................................................................................... 385 Wilcoxon signed rank test ................................................................................................................... 474 Yate’s chi-square statistic ................................................................................................................... 426 Yates's correction for continuity ......................................................................................................... 426

Page | 756