Statistical Analysis in Excel

The Most Basic Approach To Learn, Create And Analyze Data, Charts, Tables With Helpful Tricks For Both Beginners And Pro

816 208 22MB

English Pages 315 Year 2022

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Statistical Analysis in Excel

Table of contents :
TABLE OF CONTENTS
Evaluating Data in The Real World
The Statistical and Related Notions You Just Have to Know
Samples and Populations
Variables Dependent and Independent
Types of Data
A Little Probability
Inferential Statistics: Testing Hypotheses
Two types of Error
Some Excel Fundamental
Auto filling cells
Referencing Cells.
Then reference a cell inside another worksheet.
Creating a cell reference using the link cells command
Conclusion
Chapter 2
Understanding excels statistical capabilities
Getting started
Worksheet function
Setting up for statistics
Array function
What in a name? an array of possibilities
Creating your array formula
Using a data analysis tool
Additional data analysis tool package
Accessing commonly used functions
The new analysis data tool
Data from pictures
Related functions
DSQ
Average deviation
AVEDEV
Chapter 3
Meeting standards and catching some Z
Catching some Z’s
Characteristic of the Z score
Standardize
Where do you stand?
RANK.EQ and the RANK AVG
LARGE and SMALL
PERCENTILE. INC and PERCENTILE. EXC
PERCENTRANK.INC and PERCENTRANK.EXC
Data analysis tool: Rank and Percentile
Conclusion.
Chapter 4
Summarizing it all
Counting out
COUNT, COUNTA, COUNTBLANK, COUNTIF, COUNTIFS
The long and the short of it
MAX, MAXA, MIN and MINA
Getting esoteric
SKEW and SKEW.P
KURT
Turning in the frequency
FREQUENCY
DATA analysis tool histogram
Can you give me a description?
DATA analysis tool: descriptive statistics
Chapter 5
What is normal?
Hitting the curve
Digging deeper
Parameters of a normal distribution
NORM.DIST
NORM.INV
A distinguished member of the family
NORM.S.INV
PHI and GAUSS
Graphing a standard normal distribution.
Show and tell: graphing data
Why use graphs?
Examine some fundamentals
Gauging excels graphics (charts?) capabilities
Becoming a columnists
Stacking the columns
Slicing the pie
Drawing the line
Adding a spark
Passing the bar
Finding another use for the scatter chart
Conclusion
Chapter 6
Finding your center
Means: the lore of average
Calculating the mean
Average and AVERGAEA
The AVERAGEIF and the AVERAGEIFS
TRIMMEAN
Other means to an end
Medians caught in the middle
Finding the MEDIAN
Statistics mode
Finding the mode
MODE.SNGL and the MODE.MULT
Conclusion
Chapter 7
Deviating from the average
Measuring the variation
Averaging the squared deviation: variance and how to calculate it
VAR. P and VARPA
Sample and variance
VAR. S and VARA
Back to the roots of standard deviation
Popular standard deviation
STDEV.P and STDEVPA
Sample standard deviation
STDEV.S and STDEVA
The missing functions STDEVIF and STDEVIFS
Conclusion
Part 3 concludes with data
Chapter 8
The Confidence Game: Estimation
Understanding sampling distributions
An extremely important idea: the central limit theorem
Simulating the central limit theorem
The limits of confidence
Finding the confidence limits for mean
CONFIDENCE.NORM
Fit to a T
Confidence.T
Conclusion:
Chapter 9
One sample hypothesis testing
Hypothesis, tests, and errors
Hypothesis tests and sample distributions
Catching some Z’s again
Z.TEST
T for one
T. DIST, T. DIST.RT, and T.DIST.2T
T.INV and T.INV. 2T
Visualizing a t-distribution
Testing a variance
CHISQ.DIST and CHISQ.DIST. RT
CHISQ.INV and CHISQ. INV. RT
Visualizing a chis-square distribution
Conclusion
Chapter 10
Two sample hypothesis testing
Hypotheses built for two
Sampling distributions revisited
Applying the central limit theorem
Z’s once more
(μ1-μ2) refers to the difference between the means.
Data analysis tool: Z test: two samples for means
T for two
Like peas in a pod: equal variances
Like p’s and q’s: unequal variances
“t. test “
Data Analysis Tool: T-Test: Two Sample
A matched set: hypothesis testing for paired samples.
T. Test for matched samples
Data analysis tool: t-test: paired two samples for mean
T-test on the iPad with stat plus
Testing two variances
Using the f in conjunction with the t
F.TEST
F. DIST and the F. DIST.RT
F.INV and the F.INV.RT
Data analysis tool: F-test: two samples for variances
Visualizing The F distribution
Conclusion
Chapter 11
Testing more than two samples
Testing more than two
A thorny problem
A solution
Meaningful relationships
After the F-test
Data analysis tool ANOVA: single factor
Comparing the means
Another kind of hypothesis: another kind of test
Working with repeated measures ANOVA
Getting trendy
Data analysis tool ANOVA: two factors without replications
Analyzing trend
ANOVA on the IPAD
ANOVA On the Ipad: another way
Repeated measures ANOVA on the IPAD
Conclusion
Chapter 13
Slightly More Complicated Things
Creating the combinations
Breaking down the variance
Data analysis tool: ANOVA: two factors without replication
Cracking The Combinations Again
Rows and columns
Interactions
The analysis
Data Analysis Tool: Anova: Two Factors With Replication
Two kinds of variables—at once
Using excel with mixed design
Graphing the results
After the Anova
Two factors Anova on the Ipad
Conclusion
Chapter 13
Regression: linear and multiple
The plot of scatter
Graphing a line
Regression: what a line
Using regression for forecasting
Variation around the regression line
Testing Hypotheses About Regression
Worksheet Function For Regression
SLOPE, INTERCEPT, STEYX
FORECAST.LINEAR
Array function: trend
Array function: LINEST
Data Analysis Tool: regression
Working with tabled output
Opting for graphical output
Juggling Many Relationships At Once: Multiple Regression
Excel tool for multiple regression
Trend revisited
Linest revisited
The regression data analysis tool revisited
Regression analysis on the iPad
Conclusion
Chapter 14
Correlation: The Rise And The Fall Of Relationships
Scatterplots again
Understanding correlation
Correlation and regression
Testing hypotheses about the correlation
Is a correlation coefficient greater than zero?
Do two correlation coefficients differ?
Worksheet functions for correlation
CORREL and PEARSON
RSQ
COVARIANCE.P and COVARIANCE.S
Data analysis tool: correlation
Tabled output
Multiple correlations
Partial correlation
Using Excel To Test Hypotheses About Correlations
Worksheet functions: FISHER, FISHERINV.
Correlation analysis on the iPad
Conclusion
It is about time
A series and its components
A moving experience
Data analysis tool moving average
How to be a smoothie, exponentially
One-click forecasting
Working with time series on iPad
Conclusion
Chapter 15
Non-Parametric statistics
Independent samples
Two samples: Mann-Whitney U test
More than two samples: krusical walls one-way ANOVA
Matched samples
Two samples: Wilcoxon matched-pairs signed-ranks
More than two samples: Friedman two-way ANOVA
More than two samples: Cochran's Q
Correlation: spearman’s rs
A head-up
Conclusion
Chapter 16
Introducing probability
What is probability
Experiments, trials, events, and sample spaces
Sample spaces and probability
Compound events
Union and intersection
Intersection, again
Conditional probability
Working with the probabilities
The foundation of hypothesis testing
Large sample spaces
Permutations
Combinations
Worksheet functions
FACT
PERMUT and PERMUTATIONA
COMBIN and COMBINA
Random variables: discrete and continuous
Probability distribution and density functions.
The Binomial Distribution
Worksheet function
Binom.DIST and BINOM.DIST.RANGE
NEGBINOM.DIST
Hypothesis Testing With The Binomial Distribution
BINOM.INV
More On Hypothesis Testing
The Hypergeometric Distribution
HYPGEOM.DIST
Conclusion
Chapter 17
More on probability
Discovering beta
BETA.DISt
BETA.INV
POISSON
POISSON.DIST
Working with GAMMA
The gamma function and GAMMA
The gamma distribution and GAMMA
GAMMA.INV
Exponential
EXPON.DIST
Conclusion
Chapter 18
Using Probability: Modeling And Simulation
Modeling a distribution
Plunging into the POISSON distribution
Visualizing the Poisson distribution
Working with the POISSON distribution
USING POISSON.DIST again
Testing the model's fit
A word about CHISQ.TEST
Playing ball with a model
A simulating discussion
Taking a chance: the Monte Carlo method
Loading the dice
Data analysis tool: random Number generation
Simulating the central limit theorem
Simulating a business
Estimating probability: logistic and regression
Working your way through logistics and regression
Mining with XLMINER
Conclusion:
Chapter 19
12 Statistical And Graphical Tips And Tricks
Significant does not always mean important
Trying not to reject a null hypothesis has a few implications
Regression is not always linear
Extrapolating Beyond A Sample Scatterplot Is A Bad Idea
Examine the variability around a regression line
A sample can be too large
Consumers: know your axes
Graphing a categorical variable as a quantitative variable is wrong
Whenever appropriate, include variability in the graph
Be careful when relating statistics textbooks concepts to excel
It is always a good idea to use named ranges in excel
Statistical analysis with excel on the IPad
Conclusion
Chapter 20
Topics that just don’t fit elsewhere
Graphing the standard error of the mean
Probabilities and distributions
PROB
WEIBULL.DIST
Drawing samples
Testing independence: the trying use of the CHISQ.TEST
Logarithmica Esosterica
What is a logarithm?
What is e?
LOGNORM,DIST
LOGNORMAL.INV
Array function: LONGEST
Array function: Growth
The logs of Gamma
Sorting Data
Appendices
When your data lives elsewhere.
Tips for teachers and learners
Augmenting analyses is a good thing
Understanding ANOVA
Revisiting regression
Simulating data is also a good thing
When all you have is a graph.
More on excel graphics
Tasting the bubbly
Taking stock
Scratching the surface
On the radar
Growing a tree map and bursting some sum
Building A Histogram
Ordering columns: Pareto
Of boxes and whiskers
3D maps
Filled maps
Conclusion
Chapter 21
The analysis of covariance
Covariance a close look
Why do you analyze covariance
How do you analyze covariance?
ANCOVA in excel
Method 1
The final column of ANCOVA is the tricky part.
Method 2
After the ANCOVA
And one more thing
INDEX

Citation preview

STATISTICAL ANALYSIS IN EXCEL

The Most Basic Approach to Learn, Create and Analyze Data, Charts, Tables with Helpful Tricks For Both Beginners and Pros

GOLDEN MCPHERSON

TABLE OF CONTENTS TABLE OF CONTENTS Evaluating Data in The Real World The Statistical and Related Notions You Just Have to Know Samples and Populations Variables Dependent and Independent Types of Data

A Little Probability Inferential Statistics: Testing Hypotheses Two types of Error Some Excel Fundamental Auto filling cells Referencing Cells. Then reference a cell inside another worksheet. Creating a cell reference using the link cells command Conclusion Chapter 2 Understanding excels statistical capabilities Getting started Worksheet function Setting up for statistics Array function What in a name? an array of possibilities Creating your array formula Using a data analysis tool Additional data analysis tool package Accessing commonly used functions The new analysis data tool Data from pictures Related functions DSQ Average deviation AVEDEV Chapter 3 Meeting standards and catching some Z Catching some Z’s

Characteristic of the Z score Standardize Where do you stand? RANK.EQ and the RANK AVG LARGE and SMALL PERCENTILE. INC and PERCENTILE. EXC PERCENTRANK.INC and PERCENTRANK.EXC Data analysis tool: Rank and Percentile Conclusion. Chapter 4 Summarizing it all Counting out COUNT, COUNTA, COUNTBLANK, COUNTIF, COUNTIFS The long and the short of it MAX, MAXA, MIN and MINA Getting esoteric SKEW and SKEW.P KURT Turning in the frequency FREQUENCY DATA analysis tool histogram Can you give me a description? DATA analysis tool: descriptive statistics Chapter 5 What is normal? Hitting the curve Digging deeper Parameters of a normal distribution NORM.DIST

NORM.INV A distinguished member of the family NORM.S.INV PHI and GAUSS Graphing a standard normal distribution. Show and tell: graphing data Why use graphs? Examine some fundamentals Gauging excels graphics (charts?) capabilities Becoming a columnists Stacking the columns Slicing the pie Drawing the line Adding a spark Passing the bar Finding another use for the scatter chart Conclusion Chapter 6 Finding your center Means: the lore of average Calculating the mean Average and AVERGAEA The AVERAGEIF and the AVERAGEIFS TRIMMEAN Other means to an end Medians caught in the middle Finding the MEDIAN Statistics mode Finding the mode

MODE.SNGL and the MODE.MULT Conclusion Chapter 7 Deviating from the average Measuring the variation Averaging the squared deviation: variance and how to calculate it VAR. P and VARPA Sample and variance VAR. S and VARA Back to the roots of standard deviation Popular standard deviation STDEV.P and STDEVPA Sample standard deviation STDEV.S and STDEVA The missing functions STDEVIF and STDEVIFS Conclusion Part 3 concludes with data Chapter 8 The Confidence Game: Estimation Understanding sampling distributions An extremely important idea: the central limit theorem Simulating the central limit theorem The limits of confidence Finding the confidence limits for mean CONFIDENCE.NORM Fit to a T Confidence.T Conclusion: Chapter 9

One sample hypothesis testing Hypothesis, tests, and errors Hypothesis tests and sample distributions Catching some Z’s again Z.TEST T for one T. DIST, T. DIST.RT, and T.DIST.2T T.INV and T.INV. 2T Visualizing a t-distribution Testing a variance CHISQ.DIST and CHISQ.DIST. RT CHISQ.INV and CHISQ. INV. RT Visualizing a chis-square distribution Conclusion Chapter 10 Two sample hypothesis testing Hypotheses built for two Sampling distributions revisited Applying the central limit theorem Z’s once more (μ1-μ2) refers to the difference between the means. Data analysis tool: Z test: two samples for means T for two Like peas in a pod: equal variances Like p’s and q’s: unequal variances “t. test “ Data Analysis Tool: T-Test: Two Sample A matched set: hypothesis testing for paired samples. T. Test for matched samples

Data analysis tool: t-test: paired two samples for mean T-test on the iPad with stat plus Testing two variances Using the f in conjunction with the t F.TEST F. DIST and the F. DIST.RT F.INV and the F.INV.RT Data analysis tool: F-test: two samples for variances Visualizing The F distribution Conclusion Chapter 11 Testing more than two samples Testing more than two A thorny problem A solution Meaningful relationships After the F-test Data analysis tool ANOVA: single factor Comparing the means Another kind of hypothesis: another kind of test Working with repeated measures ANOVA Getting trendy Data analysis tool ANOVA: two factors without replications Analyzing trend ANOVA on the IPAD ANOVA On the Ipad: another way Repeated measures ANOVA on the IPAD Conclusion Chapter 13

Slightly More Complicated Things Creating the combinations Breaking down the variance Data analysis tool: ANOVA: two factors without replication Cracking The Combinations Again Rows and columns Interactions The analysis Data Analysis Tool: Anova: Two Factors With Replication Two kinds of variables—at once Using excel with mixed design Graphing the results After the Anova Two factors Anova on the Ipad Conclusion Chapter 13 Regression: linear and multiple The plot of scatter Graphing a line Regression: what a line Using regression for forecasting Variation around the regression line Testing Hypotheses About Regression Worksheet Function For Regression SLOPE, INTERCEPT, STEYX FORECAST.LINEAR Array function: trend Array function: LINEST Data Analysis Tool: regression

Working with tabled output Opting for graphical output Juggling Many Relationships At Once: Multiple Regression Excel tool for multiple regression Trend revisited Linest revisited The regression data analysis tool revisited Regression analysis on the iPad Conclusion Chapter 14 Correlation: The Rise And The Fall Of Relationships Scatterplots again Understanding correlation Correlation and regression Testing hypotheses about the correlation Is a correlation coefficient greater than zero? Do two correlation coefficients differ? Worksheet functions for correlation CORREL and PEARSON RSQ COVARIANCE.P and COVARIANCE.S Data analysis tool: correlation Tabled output Multiple correlations Partial correlation Using Excel To Test Hypotheses About Correlations Worksheet functions: FISHER, FISHERINV. Correlation analysis on the iPad Conclusion

It is about time A series and its components A moving experience Data analysis tool moving average How to be a smoothie, exponentially One-click forecasting Working with time series on iPad Conclusion Chapter 15 Non-Parametric statistics Independent samples Two samples: Mann-Whitney U test More than two samples: krusical walls one-way ANOVA Matched samples Two samples: Wilcoxon matched-pairs signed-ranks More than two samples: Friedman two-way ANOVA More than two samples: Cochran's Q Correlation: spearman’s rs A head-up Conclusion Chapter 16 Introducing probability What is probability Experiments, trials, events, and sample spaces Sample spaces and probability Compound events Union and intersection Intersection, again Conditional probability

Working with the probabilities The foundation of hypothesis testing Large sample spaces Permutations Combinations Worksheet functions FACT PERMUT and PERMUTATIONA COMBIN and COMBINA Random variables: discrete and continuous Probability distribution and density functions. The Binomial Distribution Worksheet function Binom.DIST and BINOM.DIST.RANGE NEGBINOM.DIST Hypothesis Testing With The Binomial Distribution BINOM.INV More On Hypothesis Testing The Hypergeometric Distribution HYPGEOM.DIST Conclusion Chapter 17 More on probability Discovering beta BETA.DISt BETA.INV POISSON POISSON.DIST Working with GAMMA

The gamma function and GAMMA The gamma distribution and GAMMA GAMMA.INV Exponential EXPON.DIST Conclusion Chapter 18 Using Probability: Modeling And Simulation Modeling a distribution Plunging into the POISSON distribution Visualizing the Poisson distribution Working with the POISSON distribution USING POISSON.DIST again Testing the model's fit A word about CHISQ.TEST Playing ball with a model A simulating discussion Taking a chance: the Monte Carlo method Loading the dice Data analysis tool: random Number generation Simulating the central limit theorem Simulating a business Estimating probability: logistic and regression Working your way through logistics and regression Mining with XLMINER Conclusion: Chapter 19 12 Statistical And Graphical Tips And Tricks Significant does not always mean important

Trying not to reject a null hypothesis has a few implications Regression is not always linear Extrapolating Beyond A Sample Scatterplot Is A Bad Idea Examine the variability around a regression line A sample can be too large Consumers: know your axes Graphing a categorical variable as a quantitative variable is wrong Whenever appropriate, include variability in the graph Be careful when relating statistics textbooks concepts to excel It is always a good idea to use named ranges in excel Statistical analysis with excel on the IPad Conclusion Chapter 20 Topics that just don’t fit elsewhere Graphing the standard error of the mean Probabilities and distributions PROB WEIBULL.DIST Drawing samples Testing independence: the trying use of the CHISQ.TEST Logarithmica Esosterica What is a logarithm? What is e? LOGNORM,DIST LOGNORMAL.INV Array function: LONGEST Array function: Growth The logs of Gamma Sorting Data

Appendices When your data lives elsewhere. Tips for teachers and learners Augmenting analyses is a good thing Understanding ANOVA Revisiting regression Simulating data is also a good thing When all you have is a graph. More on excel graphics Tasting the bubbly Taking stock Scratching the surface On the radar Growing a tree map and bursting some sum Building A Histogram Ordering columns: Pareto Of boxes and whiskers 3D maps Filled maps Conclusion Chapter 21 The analysis of covariance Covariance a close look Why do you analyze covariance How do you analyze covariance? ANCOVA in excel Method 1 The final column of ANCOVA is the tricky part. Method 2

After the ANCOVA And one more thing INDEX

Chapter 1

Evaluating Data in The Real World In the real world, evaluating the data that you have in the real world can be very important and that is what we are doing: we are evaluating that for the real world. When we are talking about the real world, we are talking about

excel. We are not talking about some random paper or some random software. The only reason why we are evaluating data is so that they are useful in the real world. So in this chapter, we are going to be taking you through some things that you need to know about statistics.

The Statistical and Related Notions You Just Have to Know Before we go any further there is one thing that we have to clarify. Statistics is the analysis of quantifiable models and the representation of a certain type of experimental data or real-life studies. One thing that is important about statistics is the fact that the information is simplified. Here are some of the few statistical terms that you ought to know about before you can consider yourself knowledgeable about statistics. There is analytics that can both be descriptive, diagnostic, predictive, and prescriptive. Descriptive analytics refers to those data that give you information on the past and makes the business to better understand the level of its performance: this way, the stakeholders and investors have an idea of what they are working with. Diagnostic analytics refers to those that take one step further with the descriptive and give you a clear understanding of what happened in the past. Predictive analytics is important to give the stakeholder the idea of what is likely to happen in the future based on the current situation and trend of the company The prescriptive analytics gives you a recommendation based on the predictions that have been made. Like a doctor sending out prescriptions, prescriptive analytics can be the difference between disaster and success.

Samples and Populations In statistics, a lot of people tend to use sample and population interchangeably. But that will be erroneous to do. This is because they are two different things. the population is the group that you will like to conclude about, while the sample is a specific group within the population that you will

like to collect data from. Generally, the population size beats the size of the sample. Typically, statistical research populations do not refer to people, but they refer to objects, events, organizations, countries species, etc. When talking about the population in English, we are talking about a different thing, if you brought that into statistics. You can refer to people as population and you can also refer to animate creatures the same. By population, we mean the population of Nigeria or the population of cows in the north. In statistics, when we refer to population, we can also refer to them as objects things. therefore, we can simply say that a population is the aggregate of things and creatures. Etc. It is important that as a statistician, the population is clearly defined however it is understandable that they are not easily enumerated. For example, if you are referring to population ordinarily you will refer to the population of Nigeria as the totality of everyone living in the country as enumerated in a census. But if you as a health expert want to know the average blood pressure of every Nigerian between the ages of 40 to 60. In other to draw up data on this there is a myriad of questions to clarify before going any further. When you then take averages and standard deviation from the population they are referred to as a parameter. Samples are kind of different in the sense that when the original population has too many individuals in them that it would be considered preposterous to study, it is still acceptable to draw a sample out of the population to study. a sample is essentially the simplified or summarized version of the original information. The well-chosen sample has to have all of the important information and specific parameters from the original population. While simplified and summarized, there must be a valid inference made at the end. One of the most important parts of sampling is that is unbiased and random. This is the part where the true test comes as in randomizing selection and picking them, every part of the data has to have a non-zero chance of being picked. Unsurprisingly, picking the sample data has proven to be trickier than the actual analysis.

Variables Dependent and Independent We have to separate these two terms – dependent variable and independent

variable before we can say that we understand them. the first question to ask is, what is a variable. A variable is those characteristics that can be represented in different values like height, age, temperature, or the test score of people. The dependent and independent variables are often manipulated to assess the cause and the effect and relationships. First the two types of variables: 1. The independent variable which values are independent of the other variables. 2. Then there is the dependent variable whose values are dependent on the changes that exist in those independent variables Then what is the independent variable? The independent variable is those variables that you can manipulate or even vary when there is you are studying data. this variable is not influenced by any other variable. Thus, they are called “independent.” In statistics, they are more commonly called, explanatory variables, predictor variables, and right-hand side variables, as you typically would evaluate the change in the independent variable and their links with the dependent variable. There are two different types of independent variables: there is the experimental independent variable and the subject variable. Talking about the experimental variable, you directly work with the independent variable, so that you know how they influence the dependent variable. So, you access outcomes by testing the variables at different levels. Typically to conduct a true experiment, you must assign randomly the different levels of the independent variables to the participants. When you randomize the assignment, then you can control the characteristics of the participants. This way, you have more confidence that the dependent variables' results come mainly from the manipulation of the independent variable. Then there is the subject variable. The subject variable is one of those

attributes that might vary across the different participants and as a researcher, you cannot manipulate it. For example, you cannot manipulate things like gender, identity, ethnicity, race, etc. You cannot just assign these attributes to the specific participants, as these characteristics are innate in those groups of individuals. However, what you can do is create a design to compare the outcomes of the group of participants that have characteristics. They are also called a quasi-experimental design as you are not randomly assigning. Then when we are talking about dependent variables, we are talking about those variables that change because of the manipulation of the independent variables. They are those outcomes that you want to measure and they also depend on the independent variables. Dependent variables in statistics are called, “response variables” “outcome variables”, and “left-hand side variables” The dependent variable is those outcomes that you keep a record of. This data checks the extent to which the independent variable influences the dependent variable with statistical analysis. You can then estimate the extent to which the independent variable influences the variable. Furthermore, you can also know how much the independent variable, will change due to the variations of the independent variable. It might be tricky to identify the independent and dependent variables and that makes designing a complex study hard. A variable considered to be independent in a simulation, might not be in another study. so, you might want to pay some extra attention to your analysis. If you want to identify the independent variable, here are some of the questions you need to ask. 1. Check if the variable can be manipulated, controlled, or even used as the subject to group the data 2. Check if the variable comes first. 3. Check and determine if the researcher has a grasp of how the variable affects the other variables.

Recognizing dependent variables. 1. Assess whether that variable has been earmarked as the outcome of the study 2. Check if the variable is dependent on another variable within the study 3. Check if the variable is measured only after altering the other variable.

Types of Data In statistics, data is classified into two major categories and they are the Qualitative and Quantitative data. It is important to know the types of data that you are working with. This way you can evaluate the data with the right technique. Quantitative data is simply the data that shows information concerning the quantities, or the things that can be measured. Therefore, quantitative data are those data that are numeric. Thus, they are also called numerical data. When we are then talking about qualitative data, we are then referring to those data that give us information about the qualities of specific things. these things cannot be measured. Rather they are observed. They are called categorical data. Quantitative data is further categorized into two other subcategories and they are discrete and continuous data. The discrete data are that information that is not precise, but they also cannot take a specific kind of value. this can refer to whole numbers or a fixed number scheme. These kinds of data are referred to as discrete as they have a fixed point and any measure in between them does not count. Hence data that you count are also called discrete data like the number of students in your class, or the number of patients that a hospital records. Then we are talking about continuous data, we are referring to those data that are represented in specific values and are usually within a specific limit and can further be distinguished in finer parts. Data like heights are called continuous data since they can be measured by meters and fractions of meters. The time that the event also occurs can also be called continuous data and they can be categorized by years and further divided into smaller parts. Qualitative data, on the other hand, are not measured or numeric rather they

are seen as abstract like socioeconomic status, opinions, and hair color.

A Little Probability Then with probability in statistics, we are referring to the likelihood an event might occur randomly in an experiment. Probability is quantified between the number one and zero, where zero is indicative of impossible and the one indicates a certainty. The higher the probability of a certain event then the more likely that even is going to happen. Take a look at the tossing of a coin since the coin has two sides, it means that when they are fairly tossed the probability that it is going to be heads or tails is equal. So, we are going to put the probability at ½ Then there is a conditional probability which refers to the probability that an event is going to occur given a specific event has occurred.

Inferential Statistics: Testing Hypotheses When we are talking about descriptive statistics, then we are talking about a summary of the characteristics of a data set. However, when we refer to inferential statistics, then you are talking about jumping to conclusions and making a prediction based on the data. after collecting data from a sample, you can make use of inferential statistics to better understand the whole population when you take data from a sample. There are two main uses for inferential statistics. And they are: 1. For making estimates about the subject of the analysis 2. Then they are used to test hypotheses in other to conclude the population. Then we are referring to hypothesis testing, then we are referring to using statistics for inferential statistics. What we intend to do with hypothesis testing is to make a comparison of the population or to find the relationship between the variables through the sample. The hypotheses or the predictions are tested with statistical tests. Statistical test furthermore estimates the sampling error to make valid inferences. These statistical tests might have parameters or not. The hypotheses test with parameters is considered to be the most effective as they are better at

detecting any effect where they are. The tests with parameters make the following assumptions 1. The population from which the sample comes has a normal distribution of the metrics 2. The sample size is appropriate based on the population 3. The variance refers to the measure of the spread or the group being compared are similar. When the data does not concur with this assumption, then it is better to use the nonparametric test as they are more suitable for those situations. The nonparametric test is seen as the distribution-free test as they do not make any assumption about the distribution of the population data. Then there are three forms of statistical tests and they are comparison tests, correlation tests, and regression tests. The comparison test is used to assess if there are any differences in the means, medians, or the ranking of the scores of different groups. If you want to make a decision on which of the test better suits your aim, then you might want to consider if the data meets the conditions that are necessary for the parametric tests or the number of samples, and the level of the measurements of the variable. Then the correlation test observes if the two variables are associated with each other. The regression test creates a demonstration of the changes in the predictor variable and how that might cause changes inside the outcome variable. You can even decide which of the regression test you want to use based on the amount and also the types of the variable that are the predictor or the outcomes. The regression tests are typically parametric. If the data is not distributed normally then you can make use of a data transformation.

Two types of Error There are two types of errors in statistics. They are the type 1 error and the type 2 error. The type 1 error refers to when you falsely reject a true null hypothesis. Then there is the type 2 error which rejects the false null

hypothesis. now, if you want to reduce the type 1 error then you might need to decrease the predetermined level of the statistical significance. Then if you want to decrease the type 2 error, you need to increase the sample size so that you can detect the effect size of interest with adequate statistical power. When you reduce the type 1 error then you are more likely to get the type 2 errors.

Some Excel Fundamental Now, here is the part, where we move from statistics to the crux that we are trying to shed some light on and that is excel. Now, here is a breakdown of some of the fundamentals with excel

Auto filling cells You can make use of the auto-filling feature to fill the cells with the data that has a pattern that it follows. Here is how to do that. 1. The first thing is to choose the cell that you will like to use as the basis for the filling of the additional cells. For a series of examples, 1,2,3,4,5 then you need to type 1 and then 2 inside the first two cells. Then for the series of 2,4,6,7, type 2 and 4 2. Then you need to go to the fill handle and drag it 3. Then If you deem it necessary, drag the autofill options and then select the option that you want to.

Referencing Cells. When we are talking about cell references, then we are referring to the cell or the range of cells that can be used for formulas so that the excel software can find the data that you want to evaluate. You can use the cell reference to refer to any of these. 1. The data from one or different contiguous cells inside the worksheet 2. The data that is contained in the different parts of the worksheet 3. The data is in another worksheet but the same workbook. Here is how to create a reference on the same worksheet: 1. Select the cells that you want to enter inside the formula 2. When you are in the formula bar, type the (=) equal sign 3. Then do one of these; reference one or more cells, or reference a

4.

5.

6. 7. 8.

defined name To reference one or more of the cells choose the cell or the range on the same worksheet and drag the cell border in other to move the selection or drag the corner so that you can expand the selection. Then if you want to reference the defined name, you can do any of these: type the name, press F3 and choose the name inside the paste name box and select ok Then you need to do any of these. If the reference is for one cell, select enter But if the reference is for an array formula, press ctrl + shift + enter, the reference can be one cell or a range of cells and the array formula can then calculate single or various results.

Then reference a cell inside another worksheet. Now what you need to use in this instance is to add the exclamation mark into the formula. In the example, we are going to use the worksheet function that is named AVERAGE to calculate the average value for the range in B1:B10 inside a worksheet named marketing.

1. 2. 3. 1. 2. 3. 4.

Based on this, marketing refers to the worksheet called marketing Then the B1:B10 refers to the range of cell And the quotation marks separate the cells Select the cell you will like to enter the formula Then enter the formula bae and the formula that you will like to use Then select the tab for the worksheet that you are referencing Then choose the cell or the range of the cell that you will like to reference.

Creating a cell reference using the link cells command So for this, you can then copy the cell reference with the link cell command so that you can create a cell reference. You can then make use of this command.

This way, you can then display the important information easily. Imagine that there is a workbook with a lot of worksheets and in each of the worksheets, there is a cell with the summary of the information that is in the others. If you then want to make the summary cell highlighted, you can go ahead to create a cell reference to those cells inside the initial worksheet of the workbook. this makes it possible to see the summary of the information of the whole workbook inside the first worksheet This way is also easier to make a cell reference between a worksheet and workbooks. When you make use of the link cell command, you are automatically pasting the proper syntax. 1. The first thing to do with excel is to select the cell with the data that you will like to link 2. Then hold ctrl+c or you can also go to the home tab and when you are in the clipboard group select copy 3. Then paste it to paste the link

Conclusion We hope that you understand how some of the fundamentals of Excel, as well as the fundamental of statistics because that is what you are going to be doing with excel anyway.

Chapter 2 Understanding excels statistical capabilities Now, so much have been said about the statistic. So, at this point, it might make you wonder, what have we been talking so much about statistics for? Well, that might be because you use excel to make statistical analyses. This is the basis of what you do with excel. Now, we are going to be taking you through the steps of using excel for statistical purposes.

Getting started There are different statistical functions in excel, where we can perform the basic mean, median, and mode statistics to the complicated statistical distribution and different probability. So, that you have a better understanding of what the excel statistical functions are, we have to divide it into two different sets: 1. There is the basic statistical function 2. Then, there is the intermediate statistical function.

Worksheet function The basic statistical function in excel includes the COUNT, the COUNTA, the COUNTBLANK, and the COUNTIFs function. You use the COUNT function to count the number of cells that have values in them.

Then there is the COUNTA function which counts all of the cells with any information, whether with numbers or with error values, and anyone too with

empty text

Then there is the COUNTIF function. This function is the most common in excel. The function works with different conditions that you set for it to work. Now, that we have taken you through the intermediate statistical functions in excel, we have to then discuss the intermediate statistical functions in excel. As an analyst, you are very likely to use these kinds of functions. They include the AVERAGE function, the MEDIAN function, the MODE function, the STANDARD DEVIATION function and the VARIANCE function, the QUARTILES function, and the CORRELATION function. 1. The AVERAGE function: is the most popular function that is used. when you make use of these functions, then you are going to be getting the arithmetic mean or the average of the cells inside a specific range.

2. Then you can work with the AVERAGEIF function: this kind of function returns the arithmetic mean of the average within the criteria that you have set

3. Then there is the MEDIAN function: This will give you the central value of the data. they work just like the AVERAGE function

4. There is also the MODE function: that gives you the most frequent

5.

6.

7.

8.

value of the cell that you find in the range that was given. There is also the standard deviation: This will help you to discover how much of the value that you have put together deviated or varied from the actual average. This function is very important in excel. The VARIANCE function: This determines the amount of variation inside the data set. The more the data is spread out, the more the variance. The Quartile function: also divides the data into four different parts the way that the median divides the data into two and finds what is in the middle. This way, the data is divided and the data that exist within those points are in the quartile. So, you are getting the first, second, third, and fourth quartile. There is the CORRELATION function: which will give you the relationship that exists between two different variables, and all of this function will then be used by the analyst in other to fund the data. the range of the correlation coefficient then is between -1 to the +1

Setting up for statistics There are different tools that you can use for excel analysis. There are different statistical functions, so to calculate a data set, you need to do this:

Enter the data tab and open the data analysis command button to then open the excel data analysis dialog box.

Here in the data analysis toolbox, you are going to see all of the analysis tools for statistics. So, if you are working for descriptive statistics, you can just select it.

Array function The best function in excel is the array function. But the bad thing about it is that they are not very easy to master. One single array formula performs different calculations and it also replaces thousands of the other common formulas. The problem, however, is that people are just afraid of using them in the first place. They might be the most convoluting feature to learn. However, with the guide that we are going to be make the entire function an entirely easy process to learn.

What in a name? an array of possibilities Before we go any further, we ought to ask the question, what is an array formula and function. the array is a collection of items. These items can either be text or numbers and they can work with one single row or one column, or even with multiple rows and columns. Now, if you want to put the list of things that you want to buy from the store in excel array style, they are going to look something like this {“stapler”, “Shakespeare”,” pencil”, “pens”} When you then choose the cell A1 to D1 you need to proceed with the above arrays with the equal sign when you are in the formula bar, then use ctrl+shift+enter to get the result. Once you do that then you have created a one-dimensional horizontal array. There is one difference between the array formula and the regular formula and that is why people dread it. the array formula in excel creates an evaluation of all of the individual values inside an array and makes multiple calculations on different items based on the condition that was expressed inside the formula.

Creating your array formula Here is a simple example of an array formula. Imagine that you have items in column b and the prices of these items inside eth column C then you will like to find the total of the sales that you have made. Now, you might need to first calculate the subtotals in the row first just by using the formula =B2*C2 and

then find the sum of the values. Then you will need to go to the next, then to the next, and so on and so on. The array formula solves the entire issue. So after putting your data, just use the following formula =SUM(B2:B5*C2:C5) based on the illustration that we have here for you.

Then the next thing to do is to use the shortcut, CTRL+SHIFT+ENTER in other to finish the array formula. Once you are done with this, excel then adds a curly brace This means that when you are working with large rows of data, finding the total can just be that simple.

Using a data analysis tool If you want to present the data that you have in a simpler and illustrative manner both quicker and intuitively, you can make use of the data analysis tool that has been made available for you. Depending on the excel that you are using, you are bound to find the data analysis tool when you enter data in the ribbon and then choose them. However, if you are using the older excel, the data analysis tools will not be available to you from the get-go. What you need to do in situations like that,

is to follow these six illustrative steps The first thing is to enter the file tab on the ribbon

Then you need to then select options

The next thing is to select add-in from the dialog box that pops up.

Then select go next to the manage text box that says excel add-in on the

bottom of the dialog box. there are some other ones that you can choose. But, if this one is fine with you, then you can go with it.

Then in the other dialog box, you see mark the analysis toolpak and select ok. then voila, you are going to see the analysis group In the ribbon.

Now to access the analysis group, you need to go to the ribbon and enter data. you are going to then see it on the right corner of the screen.

Additional data analysis tool package When you are unlocking the data analysis tools for excel, you are going to notice, that there are a lot of other add-ins that are available. Based on this diagram that we have here, you are going to notice the 1. 2. 3. 4.

The analysis tool pak The analysis toolpak-VBA The eurocurrency tools And the solver adds in

When you select any of these options, they become available for you to take a crack at data analysis that surpasses the basics. Here is a breakdown of the uses of these add-ins.

1. The analysis toolpak comes with all excel. It provides extended tools for statistical manipulation. It comes with tools like the histogram, correlation, a range of z-test, and a t-test function. it also comes with a random number generator. Immediately you load this one you are going to see it in the data analysis group from the data tab 2. The analysis toolpak VBA is a more complex add-in, that helps you to do some excel work. You can calculate the moving average with this

tool 3. The euro currency tool converts the number from the euro currency to a European member currency using the euro member's currency as the basis of a triangulation. You can only convert the currencies of European countries 4. The solver add-in helps you to define and to solve a particular problem. They can be used for the what-if analysis. When you are looking for the most efficient value for a formula that is inside a cell which is also known as the objective cell. this might, however, be subject to the particular limitations on the values of the other formulas that exist in the cell inside a worksheet. The solver works with a different group of cells and they are called decision variables used to compute the formulas inside an objective and the constraint cells.

Accessing commonly used functions A function is the formulas that have been predefined. They are used to perform specific calculations with specific values placed in a unique order. Once you are in a spreadsheet, you are going to see some of the most common functions like sum, average, count, maximum, and the minimum value when you are working with the range of the cells. When you want to use a function well there are different parts of the function and ways to create the argument in other to calculate the values and the cell reference. There are different parts of a function. to work with a function there are three constants. The (=) equal sign, the name of the function like COUNT, COUNTA, SUM, etc, and the argument, which can be the range or the criteria. There are a few functions and they are the SUM, the AVERAGE, COUNT, MAX, MIN The SUM function adds the values that are inside a cell in the argument The AVERAGE function will determine the average of the values that are inside the argument. They calculate the sum of the cells and they divide the value by the number of the cells that are inside the argument COUNT is used to count the number of the cells that are inside several data

in the argument. When you want to count specific cells in a range, then this is what you need to do that MAX. this is the function that is used to determine the highest cell value inside an argument MAX while this function is used to determine the lowest cell value

The new analysis data tool Speaking of the new data analysis tool, then we are going to be taking you through all of the tools that you can work with in excel. Like the pictures and the related functions.

Data from pictures Now, that you have gotten a picture of the data. it can be tricky to have to type that data manually into cells when you are working with an excel spreadsheet. Now, you can save a lot of time and reduce the risks that there will be errors if you just eliminate this tedious task when you work with the data from the picture feature. When you use the excel data from the picture feature, excel is going to scan an image and then analyze that picture to then get the important data from it. once you edit the data that you find, it can then be imported to the excel spreadsheet. You are not currently going to find this feature in the Windows operating system that runs excel. But you are going to see it on the Mac, iPhone, and Android functions. there are three methods to insert the data inside the Microsoft excel from an image that you copied, or from the iPhone or your iPad To insert the image, if you have the image that you saved using the data that you need, inserting the data is simple. The first thing to do is to open the spreadsheet, then go to insert. Then what you then need to do is select data from a picture and choose the picture from the file.

Related functions Then there are also the related functions that you can also work with. Here we are going to be taking you through the ways to calculate the related

functions

DSQ Then there is the excel DEVSQ function that we are going to be using the calculate the sum of the squared deviation from the mean of the data set that you have. So the variance and the standard deviation functions work with the deviations on a negative scale by first squaring the deviation and before they are then averaged. The DEVSQ will calculate the sum of the squared deviation using the mean and they are not going to be divided by the N or by the N -1 There are multiple arguments when the DEVSQ function is concerned. The arguments, can either be a constant, a cell reference, or a range. While you can just place just one range or array if you do not want to place multiple arguments.

Based on this illustration, the formula that we put in G5 is =DEVSQ (B5:B10) While the formula that we put in C5 and also in d5 are =B5-$g$4 and C5^2

Average deviation Before we take you through how to find the average deviation in excel, the first thing that we ought to do is clarify what a standard average deviation is. When you have a data set, the average deviation is the average of all of the deviations from the central point that you have set. The average deviation is the statistical function that measures the distance between means, and median

and the mean is the average value of the numbers that are in the data set that is from the highest number to the lowest number. The average deviation can also be called the mean absolute deviation or the average absolute deviation. If you are working with a small set of data, you can then calculate the average deviation on your own. However, those larger data sets will need you to use software to make this calculation faster and easily. There are a few steps to find the average deviation: 1. The first step is to find the mean: To find the mean, add all of the values inside the data set and divide the sum of that by the number of values. If you don’t want to calculate the mean, you can also find the median. To find the median, arrange the numbers in order and count the total. If you have an odd number total, then divide that number by two and then round up to find the median. if however, it even divides the number by two then find the average that is between the number inside the position and the one that is in the consequent higher position. 2. Find the deviation from the mean: When you calculate the deviation from the mean for the values that are in the data set, what you then need to do is to find the differences between the mean that you have calculated previously and each value that is inside the data set and you then need to write the absolute value of the number that it results in. the absolute value of the number is also known as the modulus or to some others, the non-negative value. since the direction of the variation does not matter when you are calculating the average deviation the number that comes from it is positive. 3. Find the sum of the deviation: When you have calculated the deviation using the mean you then need to add them together. Since we are working with an absolute value, then the value has to be a positive number. 4. What you then need to do is to find the average deviation of the data set by dividing the sum of the deviation that you calculated earlier by the total number of deviations that you have added together. The number that you are going to see is the average deviation from the mean.

AVEDEV When you are working with the AVEDEV function, you are using excel to calculate the average of an absolute deviation based on the meant that is inside a data set. The variance and the standard deviation work with the negative deviations by first squaring the deviation and then averaging the. AVEDEV works only with absolute values. The average is also called the average absolute deviation. It is one easy way to find the variability inside a data set. However, this is not as common when working with variance and the standard deviation. The advantage of the AVEDEV however is that the unit does not change. If you have the values in centimeters the absolute average deviation is also going to be in centimeters. The AVEDEV will take different arguments of up to a total of 255 arguments. You can either have a hardcoded constant argument, a simple cell reference, or the range. Based on the example that we have here the formula is =AVEDEV (B5:B10)

Then the formula that you see in C5 and D5 is =B5-$G$4 for the deviation =ABS(C5) for the absolute deviation

Chapter 3 Meeting standards and catching some Z Now before we go any further into catching some Z we have to first explain what a Z is. In excel when you want to compare the alternative hypothesis with the null hypothesis, you make use of the hypothesis test. By null hypothesis, we are talking about the common statement. And when you are trying to make the Z test by way of a hypothesis, we are trying to refute the validity of the null hypothesis against the alternative hypothesis. In this chapter, we are going to be taking you through how to meet standards and make the Z test.

Catching some Z’s First, what is a Z score? A Z score is simply the number of the standard deviation from the mean. You can also call the Z score a standardized score. The formula to create the Z score is the data point minus the mean and divided by the standard deviation. To find the Z score two things are required. You need to find the mean and the standard deviation. Based on this data that we have, there is a list of exam scores.

What we need to first do is to go ahead and calculate the mean o the data set.

So what you need to do to find the mean is to select a new empty cell. for illustrative purposes, we have chosen the cell that comes after the cell MEAN.

Then you type in the formula =AVERAGE (A2:A9) when you enter this, you are going to get this, you will discover the following mean. Then you need to calculate the standard deviation. To do this, go to an empty cell. it is probably best that you labeled it as standard deviation, then go to the empty cell and enter the following formula =STDEV (A2:A7)

When you hit enter, you are then going to get the standard deviation.

In this instance, our standard deviation is 22.40177 Now when you want to calculate the Z score go to the cell next to the number and enter the following formula (A2-F4)/F5

then when you hit enter, you are going to get the following number

This then implies that the value of 87 is 0.574731 Then If you want to replicate this in all of the values, you can just use the click and drag on the cell with the formula down to the cells you wanted it to be replicated in. But before you can do that you need to lock the cells with the mean and the standard deviation. to then lock the cell, you need to enter the dollar symbol before the columns and the rows like we have done in this illustration

Then you need to enter

Then select cell F2 after entering enter. Then all of the z scores appear.

There are other ways to find the Z score and we are going to be taking you through that in other segments of this book.

Characteristic of the Z score Z score in statistics is very important. It tells you more about the value. however, that is not all it tells you more about what lies beneath the number distribution. If typically, if the value that you have placed 3 standard deviations just above the mean, you are going to notice that it is three times the average distance just above the mean and is also representative of one of those higher numbers from the sample. Alternatively, if the value is just one standard deviation below the mean, then you are going to know that is on the lower scale of the midrange of the values that are from the sample. The Z score is particularly important when you want to do statistical

inference on quantitative variables. Simply put a Z score is the signed distance the value of the data set is from the mean in a standard deviation. Through the formula above, you are going to understand precisely what the Z score is. It is the value minus the mean, divided by the standard deviation. With the Z score, you can check data like the life expectancy then you are going to get the value instantly and see people's rank among others. Here are some of the characteristics of the Z score: 1. 2. 3. 4.

Z score means always comes as zero And the Z score standard deviation is 1 The Z score graph has the same shape as the first group of number When you square the Z score, the sum is also the same as the number of the Z score.

5. When the Z score is above 0 then the sample value is above the mean and when the Z score is under 0 it represents the sample value that is under the mean.

Standardize Then there is the excel standardized function. this gives you a normalized value using the mean and the standard deviation. If you want to use this function, then you need to calculate the mean using the AVERAGE function just like we have already done then enter the STDEV.P function. So, to work with the standard function, Enter an empty cell, or a cell that you have designated for the standardized function Then enter the following formula =STANDARDIZE (A2, F4, F5). This gives you the Z score for the number A2.

when you then select enter, you are going to get the following

you are going to notice that the Z score is the same number as the style we have earlier used. to then lock a cell and drag and fill the other ones swiftly, use the dollar sign. This means that our formula becomes, =STANDARDIZE (A2, $F$4, $F$5)

Where do you stand?

RANK.EQ and the RANK AVG Let us take this step by step. First, what is the RANK.AVG.? Here we are going to be taking you through the excel RANK.AVG function to give you the statistical rank of the value that was given to you inside a data set. If you also have duplicated value in the list, then it will also return the average rank. Here is the syntax for this function. RANK.AVG (number, ref, [order]) On the other hand, there is the RANK EQ function. this gives you the rank of the number inside a number data set. The size in this matter is related to the other values that exist inside the list and if multiple values have the same rank, the prominent rank of the set will then be returned. And if you sort the list, then the rank of the number would be the position. The syntax of the RANK.EQ is the same as that of the RANK.AVG and that is RANK.EQ(number, ref, [order]) Based on the description of these two functions, you are going to notice that the difference happens when there is a duplicate of the values in the list. the RANK.EQ function will return the lower rank, however the RANK. Avg function is going to give you the average rank. Take a look at this example. If you have a list of 4,5,5,6 in the ascending order, the number 5 is both the second and the third, hence to calculate the rank of the value 5 The RANK EQ will give you rank 2 and the RANK.AVG will give you the rank of 2.5 To further exemplify take a look at these examples

Based on this illustration we have ranked the number 87, by selecting the cell inside the formula and then the reference AKA the list (A2:A8). When you do not put, an order reference then it automatically ranks in descending order

This illustration indicates the number 1 in the formula. As in, =RANK.AVG (A4, A2:A8,1). The 1 indicates the order and that implies that it is going to be in ascending order.

LARGE and SMALL Then there is the large and the small function in excel. Here is a breakdown of how they go. The LARGE function will give you a numeric value based on where the value is on the lost when it is sorted in the descending order. This then means that the LARGE function is going to give you the nth largest value in the list where the number 1 is the same as the largest value and 2 is the second largest. This means, that if you want to get the first, second, and the third-highest from a list, you can use the number 1,2,3. The LARGE function works with two different arguments, which are the array and the K. the array can be the range while the K can be the position. Based on the example that we have here, to get the 1,2,3 and third largest number here is what to d

Based on this illustration, to get the first largest enter the formula =LARGE (A2:A11,1) And to get the second largest in the rank, enter the formula =LARGE(A2:A11,2) If you want to see the third-largest in the rank enter the formula =LARGE (A2:A11,3)

Now that we have taken you through the LARGE function, the SMALL function works similarly. The difference is that with the SMALL formula, you are going to be getting the specific smallest. Take a look at how it is in this illustration.

Based on his illustration, we have been able to get the smallest number which is 34

PERCENTILE. INC and PERCENTILE. EXC When we are looking for the specific nth percentile of a list, we are looking for the value that cuts off the first n percent of the data values if you sort all the values from the smallest to the biggest. Imagine that the 90th percentile is the value that cuts off the bottom 90th percent of the data value based on the top 10 percent of the data values. For this section, the PERCENTILE.INC gives you the Kth percentile in a data set, while the PERCENTILE.EXC gives you the kth percentile of the data set without including the numbers 0 and 1. Then there is also a third function that returns the kth percentile of the data set. It gives the same values as the PERCENTILE.INC Based on what we have in the illustration below, to get the 30th percentile, use

the PERCENTILE.EXC, enter the following formula, =PERCENTILE.EXC (A2:A11,0.3) with 0.3 symbolizing thirty percent.

Using the percent INC, here is what you will get

PERCENTRANK.INC and PERCENTRANK.EXC The PERCENTRANK.INC and the PERCENT RANK.EXC both gives you the percentage rank of values in a data set. When we are talking about the percentage rank of a value, we are talking about the relative position of the value inside a data set, expressed both as a percentage of 0 and 1. It is simply the percentage of the values inside a data set that is both equal to and less than the number itself. When you enter the PERCENTRANK.EXC, you are going to get the real number both between 0 and 1. The arguments of the dataset are the data, the value, and the significance. Referring to the data we are talking about the array or the reference to the range of cells representing the data There is the value which is the real number referencing the cell with the number representing the value that you will like to find the percentile rank of Then the significance refers to the integer value or the reference to the particular cell with the number specifying the number of digits that you will like the percentage value to be rounded to. When you do not add, the significance, then the value of 3 will be used as the default

And if you do not see the value in the data this means that the data interpolate so that it can get the valid percentage rank. Nevertheless, the value might fail beyond the data range which means that the function will give you a (VALUE!) error. Then the significance is not an integer, the function will use the least value The major difference between the PERCENTRANK.INC is that it gives you 1 when the range largest value is passed as value. meanwhile, the ePERCCENTRANK. EXC gives a value between 0 and 1. Also, when the smallest value is passed it then returns between 0 and 1

Data analysis tool: Rank and Percentile You can also use the data analysis tool to find the rank and percentile of a value in a list. there is one advantage of using this feature and that is the percentile will then be added to the output table. Now use the data analysis toolpak for percentile and rank, here is what you need to do. Go to the ribbon and enter the data tab and go to the data analysis group and choose it. Then scroll till you find rank and percentile in the list.

Here is the dialog box you are going to find next.

In the input, range enter the cell reference for the list of numbers that you want to analyze The group by section indicates if the input range is arranged either in columns or rows. in this example, the data is then arranged inside one single column Then the labels in the first row of the data are then arranged inside columns of the first row of the data with labels. If there are labels then you need to tick the box and the label also goes to the output table once you specify it. There is also the output range which is the upper-left cell in the output table. excel makes a new output table for the column or the row of the data. Here we only have one column so that means that we are getting just one output tale Then there is the new worksheet ply mark it if you want to get another worksheet The n the new workbook, select it to create a brand new workbook then paste the result that you get inside the new worksheet on your new workbook. And it is that simple. You have found the rank and the percentile.

Conclusion. We hope that you have a better understanding of the excel functions.

Chapter 4 Summarizing it all The summary function in excel is one of the most basic functions in excel. In short, excel does not get any easier. So, here we are going to be taking you through some of the ways of summarizing with excel

Counting out We are going to be counting out now. there is a series of functions for counting in excel and they include the COUNT, the COUNTA, the COUNTBLANK, COUNTIF, and the COUNTIFs function. we are going to be breaking down all of these now.

COUNT, COUNTA, COUNTBLANK, COUNTIF, COUNTIFS Here is a breakdown of these functions. You use the COUNT function to count the number of cells that have values in them.

Then there is the COUNTA function which counts all of the cells with any information, whether with numbers or with error values, and anyone too with empty text

Then there is the COUNTIF function. This function is the most common in excel. The function works with different conditions that you set for it to work. Now there is a difference between the COUNTIF function and the COUNTIFS function. it just has to do with multiple numbers. The COUNTIF function supports just one single condition that you give while the COUNTIFS function works for multiple conditions.

Based on the illustration that we have here, we have used the following formulas for the cells in G5, G6, and G7. The COUNTIF function is not a case-sensitive function however =COUNTIF(D5:D12,">100") this counts the sales that are over 100

=COUNTIF(B5:B12, "Jim") this counts the sale that has Jim in them =COUNTIF(C5:C12, "ca") this counts the sale with ca in them When you are looking for a text value, you must enclose them in doublequotes. However, with numbers, you do not have to. But, the situation changes, when you have a logical illustrator with the number. So, that means that our formula can go like this =COUNTIF(A1:A10,100) to count the cell that is equal to 100 =COUNTIF(A1:A10,">32") and this counts the cells that have several more than 32 When you have a value in a different cell you can add them to the criteria with something called a concatenation. So, that means that our function will give us the values that are in cell A1:A10 that are less than the value that we have in B1. =COUNTIF(A1:A10," 5 if n refers to the sample size and p refers to the probability that the population is a success. This implies that we can make use of the probability model to quantify any uncertainty when we are making an inference about a certain population mean based on the sample mean. For the random samples taken from the population, you can compute the mean of the sample means:

then the standard deviation of this sample implies that :

Now, let us have a look at the central limit theorem in a normal distribution

If we then go ahead and take some basic samples n=10 at random from this population and then compute the mean for the samples, the distribution of the sample mean ought to be normal based on the central limit theorem. Remember that the sample size is less than 30 however the source population is distributed normally which means there is no problem here.

this then is the distribution of the sample means you are going to notice that the horizontal axis differs from what we have in the previous illustration and the range is much narrower. The means of this sample mean is 75 and the standard deviation is 2.5. this is then how that is calculated

If we then wanted to take samples of n=5 instead of using n=10 we are also going to get a much more similar distribution, however, the variation amongst the sample means comes out larger.

Simulating the central limit theorem So that you better understand what statistical analysis is in excel, it can be helpful to simulate the central limit theorem. Yes, that sounds odd. But we have found a way. Here is how the central limit theorem works in excel. There is first a simulation and that simulation then creates a sampling distribution of the mean—ideally a small sample based on a population that was not distributed normally. Clearly from what we are going to show, regardless of whether the population is a normal distribution, and regardless of whether the sample is small, the sampling of the means at the end of the day looks very much like a normal distribution. The first thing we are going to tell you is to imagine that there is one large population with just their scores, 1,2, and 3and all of them have the same probability to land in one of the samples. Then also imagine too that you can then randomly choose a sample for the three scores in the population.

You can see from this table the next thing. The sample mean we see more often is 2.00 and the sample mean that we see less frequently are 1.00 and 3.00 In this simulation, we have chosen a score at random from the population and

we then randomly choose two more. The group of the three scores is called a sample. What you then need to do is to calculate the mean of the sample. We have repeated this process for up to 60 samples, which results in 60 sample means. At the end of it all, you can then graph the distribution of the sample mean Here is what the simulated sampling distribution shows.

In this worksheet, each row is a sample. The column that is labeled x1 x2, and x3 shows you the three scores for the sample. Column E indicates the average for the sample that is in each row. Column G indicates the likely values for the sample mean and column H indicates how reoccurring a certain mean appears in the 60 samples the columns G and H and the graph show the distribution in its maximum frequency at 2.00. and the frequency reduces once it goes further from zero. This shows that the population does not look like the normal distribution and the sample size is relatively smaller. However, under these conditions, the sampling distribution of the mean then looks like a normal distribution subsequently. Then what is the parameter the central limit theorem then predicts when sampling the distribution? We need to begin with the population. The population mean in this instance is 2.00 and our standard deviation is 67 Then when working with the sampling distribution. The mean of 60 means becomes 1.98 and the standard deviation is 48. Those numbers approximate

closely the central limit theorem parameters that were predicted for the sampling distribution of the mean and that is 2.00 and .47 If you want to do the simulation yourself, here is what to do: 1. Choose a cell for the first randomly chosen number. Chosen cell B2 2. Then use the worksheet function RANDBETWEEN to choose either of 1,2,3. This randomly picks a number from the population consisting of the numbers, 1,2,3 where there are equal chances of choosing any of the numbers. You can use either of the FORMULAS, Math & Trig. RANDBETWEEN then uses the function argument dialog box or simply type =RANDBETWEEN (1,3) inside cell B2 and selects enter. The first argument has to carry the smallest number then the second argument can carry the larger number. 3. Choose the cell on the right side of the original cell and choose another random number between 1 and 3. Do this again for another random number in the cell on the right side of the second one. the best way to do this is by auto-filling the two cells on the right side of the first cell. in the worksheet that we have the cells are C2 and D2 4. Use the three cells as the sample, then calculate the mean in the cell to the right side of the third cell. just enter=AVERAGE (B2:D2) in the cell E2 and press enter 5. Then you can repeat the process for all of the samples that you will like to add to the simulation. The fastest way to get that formula is to choose the row of three numbers chosen by random and the mean then you can autofill the rows that remain. The set of the sample means in column E then becomes the simulated sampling distribution for the mean.

The limits of confidence Well, we are not talking about confidence as a personality trait, we are talking about confidence with excel. Sometimes, it might be necessary to find the confidence level of mean in excel. Here is how to do it.

Finding the confidence limits for mean If you already have the standard deviation of a population, you can find the confidence intervals for the mean of the population. And when there is a

measured statistical feature like income, price, etc, in numbers, a lot of people try to estimate the mean value of the same population. You can estimate the population means with the sample mean, then plus or minus for the margin of errors. The result that you are going to get is referred to as the confidence interval of the population mean.x

Here we have a table that shows the value of z*in the confidence levels that you have here. These values were gotten from the standard normal distribution area by the way. then the area that is between the z* value and the negative of z* is known as the confidence percentage. This means that the area between z*+1.28 and then z*=-1.28 becomes 0.80 appropriately. So, we can expand the chart to other confidence percentages too. Here is how to find the confidence interval using the conditions that we have: 1. First, determine the confidence level then find the preferred value for Z*2. Then look for the sample mean for use in the sample size 3. Then multiply Z* by σ then divide the answer by the square roots of n. then you can see the margin of error 4. Take the sample mean then plus or minus the margin of error that you got from the confidence interval. The lower end of the confidence

5.

6. 7.

8.

9.

interval can then be removed from the margin of error. But the upper end becomes the mean added to the margin of error. Take this example you work for the department of natural resources and you're trying to make an estimate using the 95 percent confidence, with the mean length of all of the walleye from a fish pond If it is the 95% interval that you want then the Z* value becomes 1.96 If you then take a sample chosen at random of 100 of the fingerlings and estimate the average length to be 7.5 inches, you need to assume the population standard deviation has to be 2.3 inches this then implies that x̄ = 7.5, σ = 2.3, and n = 100. Then multiply 1.96 by 2.3 and divide it by the square root of 100. You are going to get the margin of error as ± 1.96(2.3/10) = 1.96*0.23 = 0.45 inches The 95% confidence interval becomes 7.5 inches plus or minus 0.45 inches.

CONFIDENCE.NORM Now, if you are looking for the confidence interval of a specific population mean, you can use the CONFIDENCE.NORM function. it works as a range of cells with the sample mean x being at the center of the range and the range becomes x plus or minus the confidence. norm. For any of the population mean μ0 to be in this range the probability that you are going to get a sample mean that is distant from μ0 than the x is from the alpha; furthermore, for a population mean, μ0 that is not found in this range, it is less likely to obtain a sample mean that is far away from μ0 the way that x comes less than the alpha. This simply means that you should imagine using the mean, the standard deviation, and the size to make a two-tailed test in the significance level alpha based on eh hypothesis that the population means is μ0. Then we cannot reject the hypothesis when μ0 when the confidence interval. This confidence interval prevents us from inferring that there is a probability of 1. The syntax is CONFIDENCE.NORM(alpha, standard.dev, size). The alpha refers to the significant level that was used to add the confidence level.

The standard.dev refers to the standard deviation for a data range and that can be known. The size refers to the sample size Here is what it looks like

If there is a non-numeric argument then the function gives you #VALUE! error And if the alpha is less than or equal to 0 or alpha is greater than or equal to 1, it will give you the #NUM! error If the standard_dev is less than or equal to 0, then it returns the #NUM! error too And if you do not place the size as an integer, that is going to be truncated If the size is below 1 then it will give you the #NUM! error

Fit to a T First, what is a T-distribution? This refers to the standard distance of the sample mean to the population mean if you do not know the population standard deviation is not known and that observation is gotten from a

distributed population.

Confidence.T The confidence. T also gives you the confidence interval of the mean with the T distribution. The syntax works this way CONFIDENCE.T(alpha,standard_dev,size) The alpha refers to the significant level that was used to add the confidence level The standard_dev refers to the standard deviation for a data range and that can be known The size refers to the sample size Here is what to know: If there is a non-numeric argument then the function gives you #VALUE! error And if the alpha is less than or equal to 0 or alpha is greater than or equal to 1, it will give you the #NUM! error If the standard.dev is less than or equal to 0, then it returns the #NUM! error too And if you do not place the size as an integer, that is going to be truncated If the size is below 1 then it will give you the #NUM! error

Conclusion: We have discussed ways to find the confidence in excel in this chapter so, stick around to learn more about some other things.

Chapter 9 One sample hypothesis testing In this section, we are going to be taking you through making one sample testing. There are T-tests and Z tests that can help you to determine whether a previously unknown population mean differs from the specific value. but before we go any further, we are going to be taking you through what some of these terms are.

Hypothesis, tests, and errors A hypothesis is a claim or an explanation that is based on very little evidence to then go-ahead to start an investigation. They are seen as predictions that your research hopes to test. There are disciplines where the hypothesis is also known as thesis statements. When creating a hypothesis, one predicts the relationship that exists between two variables. Generally, these hypotheses are meant to be proven and that is when the idea of tests comes into play. The test in statistics is when you place your hypothesis into a trial-and-error phase. This is when your test affirms or disapproves of your hypothesis.

Hypothesis tests and sample distributions Now, imagine that you take a sample with a specific size from a pretty normal population and then we ask if the sample mean is different from the population means, then this is equal to the testing of the entire null hypothesis H0: H0:

they are called the two-tailed hypotheses even though sometimes we will only need the one-tailed hypothesis. Then you can use this information to test the null hypothesis or then employ a Z score as the test statistics.

here is an example of when the national norms for a particular school's mathematics exam have been created with a mean of 80 and it also has a standard deviation of 20. When we then take a sample taken at random 60 students from California that has a proficiency score of 75. The question then is if the sample score differs so much from the mean based on a significance level of a=0.5? we then want to present the information that the deviation that exists from the expected value of 80 for the sample only is that way because of chance. There are three methods that we can try. Here is one. which is the one-tailed null hypothesis imagine that before you collected the data, you had predicted that a specific sample you picked would have a specific mean that is lower than the population means. The distribution of the sample mean is represented by N (μ, σ2) when μ = 80, σ = 20 then, n = 60. If, however the standard error = 2.58, when we create the distribution of the sample mean, it is N (80,2.58). our important part is then the left tail that represents α = 5% of the distribution. Then what to do next is to test if x̄ belongs to the important part. critical value (left tail) = NORM.INV (α, μ, ) = NORM.INV (.05, 80, 2.58) = 75.75 If we reject the null hypothesis, we can as well test if the p value is less than α which are named p-value = NORM.DIST(x̄, μ, \sigma/\! \Sqrt{n}), TRUE) = NORM.DIST(75, 80, 2.58, TRUE) = .0264 Since the p value is then= .0264 < .05 = α, we can also reject the hypothesis in this instance. Then we can also use the Z score and based on the tests that we have below then you can reject the null hypothesis P-value = NORM.S. DIST (z, TRUE) = NORM.S. DIST (-1.94, TRUE) = .0264 < .05 = α zcrit = NORM.S.INV(α) = -1.64 > – 1.94 = zobs the point at the end of it all is the fact that the sample comes with a smaller score than the general population.

Catching some Z’s again Now we are not talking about sleep but testing a hypothesis. Here is how to catch some Z with excel

Z.TEST If you have the variance or the standard deviation of a specific population, you can go ahead and find the Z test value in excel using the data analysis add-in. the Z test values help to find the confidence levels and also the confidence intervals for the data that was distributed normally. Here are the steps to perform a Z test in excel: 1. Go to the Data tab and hit the data analysis command button. 2. Once excel shows the data analysis dialog box, you choose z test: two samples for means and choose ok. this then brings up the dialog box for the Z test 3. When you go to the variable one and the variable 2 text boxes, enter the range where the two samples have been put. 4. Then specify if you hypothesized an equal mean in the hypothesized mean difference box. if you thought that they were equal then you can put zero there and if you do not believe that they are equal, you can place your predicted difference. 5. Then make use of the variable 1 and the variable 2 text boxes to add the population variance for the first and the second samples 6. When you go to the alpha text box, place the confidence level for the calculation that you have made. The confidence level is typically between 1 and 0. However, by default, the confidence level can be equal to 0.05 7. When you go to the option for the output indicate the cell you want the result to be in. 8. And select ok. once you do that, Excel is then going to calculate the Z test results. The Z test result is going to show you the sample mean for all of the data sets, the variance, the amount of observation, and also the hypothesized mean difference, then the probability value for both the one-tail and the two-tail tests.

T for one

When we were talking about hypothesis testing, we said that it is used to estimate the mean for the population data sets with different distribution functions which are based on the sample data set. A statistical hypothesis is a hypothesis tested by observing random variables. There is something called a degree of freedom in statistical testing, in mathematics, the degree of freedom of a particular distribution is usually equal to the number of the normal deviation that was summed up. To better understand the concept of degree of freedom sees it as the number of independent possibilities in a specific event. Assume that we toss a coin 100 times and we get heads48 times, we then conclude the tail comes up 52 times, this means that the degree of freedom is equal to 1. However, if we take a route of a traffic light for instance and we want to find the probability of red light in one of the samples at a time. the degree of freedom in this instance is 2 because we are going to need information from 2 other color lights. This means that the degree of freedom is 1 sample size. We are going to be taking you through how to find the t-distribution for a tailed test with the following T. DIST, T. DIST.RT, and T.DIST.2T

T. DIST, T. DIST.RT, and T.DIST.2T In excel, the T. DIST gives you the t distribution probability percentage of a particular thing across a sample. The function takes the argument of the variable x and the degree of freedom for the distribution and also the tailed test type. The T. DIST has the following syntax =TDIST (x, deg freedom, tails) With x referring to the value the distribution is calculated by Deg freedom refers to the degree of freedom And tails refer to whether it is a one-tailed or two-tailed test. For a one-tailed test, use 1 and for a two-tailed test, use 2. Then there is the T. DIST.RT which gives the right-tailed distribution for a one-tailed test based on the variable x and the degree of freedom. Here is the syntax =T. DIST.RT(x, deg freedom)

X refers to the value that the distribution is calculated by And the degree of freedom Then there is the T.DIST.2T function that gives you a two-tailed test for a specific distribution with the variable x and the degree of freedom. T.DIST.2T gives you a two-tailed test for a distribution with the variable x and the degree of freedom. The syntax looks like this =T.DIST.2T (x, deg freedom). Making the T-test here are some things that you might notice: 1. These functions are only compatible with numbers. 2. The function will give you a #NUM! error if x is negative and the degree of freedom is < 1 or >10^10 3. The cumulative argument is compatible with boolean numbers. 4. Values in decimals and percentages are the same value 5. You can just add the argument to the function immediately or use the cell reference

T.INV and T.INV. 2T Excel can be used to do various T-tests. We have talked about one earlier. Here is an inverse function. These are handy on the occasion we want to reverse the process. That is after finding the proportion we then want to know the value that gave us the proportion. There are 2 inverse functions for this and they are: T.INV for the left-tailed inverse of the Student's T distribution And the T.INV.2T for the two tried inverse of the Student's T distribution. These functions have two arguments. The first one is the probability and the second is the degree do freedom for the distribution. Here is an example: if you are working with a t distribution that has twelve degrees of freedom. If you want to then know the point of the distribution that has 10% of the area under the curve till you get to the left side of this point, you need to enter the following inside an empty cell =T.INV (0.1,12). This makes excel give you the number -1.356.

However, if you use the T.INV.2T function you are going to get 1.782. this implies that 10% of the area can be found under the graph of the distribution function.

Visualizing a t-distribution Here are the steps to create a T distribution graph in excel: 1. First, enter the number that indicates the degree of freedom in an empty cell 2. Then create one column specifically for the range of values for the variables of the t distribution. 3. Then make a column to hold the pdf of the t distribution that is associated with random values 4. Then create a graph by going to the insert tab. Use the scatter graph with smooth lines to see the chart. 5. You can then change the way the graph appears.

Testing a variance Now if you want to look for the variance between two groups you can perform a hypothetic variance test. In this part of this book, we are going to be taking you through a hypothesis testing of variance. A variance test is one hypothesis test that helps you to find how two groups compare to each other. The variance test makes use of sample data like every other hypothesis test. Here we are going

CHISQ.DIST and CHISQ.DIST. RT If you are looking for the right-tailed probability of chi-squared distribution you can make use of the CHISQ.DIST. RT function. we associate the X2 distribution with the X2 test. That means that you can make use of the X2 test to make a comparison of the expected and the compared values. Take for example that you want to make a genetic experiment and you make the hypothesis that a subsequent generation of plants is going to give you a particular type of color. When you compare the results that you observed with the ones that you expected then you can check if your hypothesis is valid. Here is the syntax for the CHISQ.DIST. RT function =CHISQ.DIST. RT(x,

deg freedom) with x referring to the value that you will like to evaluate and the deg reg referring to the number of degrees of freedom. NOTE: If any of the argument is non-numeric then you are going to get the #VALUE! error And if any one of the arguments is non-numeric then the function is going to give you the #VALUE! error When the deg reg is not an integer then the degree of freedom becomes truncated And if the degree of freedom is less than one or is equal to 1010 then you are going to get the #NUM! value. Here is an illustration of the CHISQ.DIST. RT function.

Then the CHISQ.DIST Function is a little bit distinct from its counterpart. It also gets you the chi-squared distribution, but that is pretty much all it does. We use the chi-squared to make a study of the variation between the percentage of something across different samples, like for example the part of the day that people go ahead to watch TV Here is the syntax for the CHISQ.DIST function. CHISQ.DIST(x, deg freedom, cumulative)

With x referring to the value that you will like to evaluate its distribution and the deg freedom referring to the number of the degree of freedom, then the cumulative refers to the logical value determining the way the function is formed. when the cumulative is true the function then gives you the cumulative distribution function when it is false it will give you the probability density function NOTE: If any of the argument is non-numeric then you are going to get the #VALUE! error And if any one of the arguments is non-numeric then the function is going to give you the #VALUE! error When the deg reg is not an integer then the degree of freedom becomes truncated And if the degree of freedom is less than one or is equal to 1010 then you are going to get the #NUM! value.

CHISQ.INV and CHISQ. INV. RT Then when you are looking for the inverse of the left-tailed probability of a chi-squared distribution, you can make use of the CHISQ.INV function. the chi-squared distribution is used most times to find out when there is a variation between the percentage of things across different samples for example the part of the day that people spend watching television programs. Here is the syntax of the CHISQ.INV function =CHISQ.INV (probability, deg freedom) With the probability referring to the probability that is associated with the distribution and the deg reg referring to the number of degrees of freedom When the argument is non-numeric then the function will give you #VALUE! error When the probability is below zero or is more than 1, then it is going to give you the #NUM! error And if the degree of freedom is not an integer, then it becomes truncated Also, if the degree of freedom is less than 1 or the degree of freedom is

greater than 1010, the function is going to give you a #NUM! error value. On the other hand, is the CHISQ.INV. RT. they all return the inverse of the right-tailed probability of a chi-squared distribution. If the probability is = CHISQ.DIST. RT (x,) that makes the CHISQ.INV.RT(probability...) equal to x. you can make use of this option to make a comparison of the result that you observed with the result that you expected to know if the hypothesis that you made was valid. The syntax goes like this CHISQ.INV.RT(probability, deg freedom) with the arguments: Probability refers to the probability that was associated with the chi-squared The degree of freedom refers to the number of degrees of freedom. NOTE: If any of the arguments is non-numeric, then the function gives you a #VALUE error And if the probability is less than or equal to 1 then the function gives you the #NUM! error If the degree of freedom is not an integer, then it becomes truncated And if the degree of freedom is less than one, the CHISQU.INV.RT gives you the #NUM! error When you have been given a value for probability, this function looks for the value x making CHISQ.DIST. RT (x, deg freedom) is equal to the probability. This means the precision of the CHISQ.INV.RT is dependent on the precision of the CHISQ.INV. RT. it makes use of a search technique that is referred to as an interactive search technique. And if after 64 iterations the search is yet to converge then the function is going to give you a #NUM! error

Visualizing a chis-square distribution In this section, we are going to be taking you through how to plot a chisquare. 1. The first thing to do is to define the range of the X values that we are going to use for the plot.

2. Then you need to calculate the Y values. The values in this plot are going to come in place of the PDF value that is associated with the CHI-Square distribution. You can then type the formula inside cell b2 to find the PDF value of the chis square distribution that is related to the X value of 0 and the degree of freedom value which is three. Then you can paste the formula across the entire cell 3. Then you can plot the distribution of the chi-square. What you have to do is highlight the cell and select insert and open the scatter option and when you enter the group for charts then select scatter with smooth lines.

4. Here is the chart that is going to be created

5. The x-axis indicates the value of the random variables that come after the chi-square distribution with just three-degree freedom while the yaxis indicates the PDF value that corresponds with the PDF values of the Chi-Square distribution

Conclusion This section has been a great eye-opener and you have learned what you have not even heard of. Now, it is time for you to start practicing.

Chapter 10 Two sample hypothesis testing A two-sample test is also called the independent test. It is the method that is used to find out if the unknown population mean of the two groups is equal or if they are not. These kinds of tests are used when the data value is independent and they are also sampled randomly from two populations and these two populations are independent and of equal variance. In this chapter, we are going to be taking you through the ins of the two-sample hypothesis test

Hypotheses built for two Just like it is in one sample case, the hypothesis testing in the two samples begins with a null hypothesis and the other hypothesis. When there is a null hypothesis, it tells you that if there is any difference between the samples, then it is simply by chance. However, the alternative hypothesis, tells you that the difference is the real deal and there is no chance involved. It can be possible to have a one-tailed test where the hypothesis makes a specification of the direction of the differences that exists between two different mean or a tailed test where the hypothesis does not make mention of any differences. The one-tailed hypothesis test looks very different. The zeros are just peculiar. Nevertheless, you can create a hypothesis test for any value just by changing the value to zero. To make the test, you have to first set α and then the probability of a type 1 error that you're okay with. Then what you have to then do is calculate the mean then the standard deviation of each of the samples, then subtract each of the means from the other, then make use of a formula to go and convert the result with a statistic test. The next thing to do is to compare the statistic test to the sampling distribution for the test. If α rejects it, the reject H0, and if it does not do not reject H0

Sampling distributions revisited

In an earlier chapter, we have given to you the idea of a sampling distribution. But here again, we still think that it might just be necessary to define sampling distribution is again: it is a distribution of all the possible values of the statistical data for a sample size. We have earlier talked about the sample distribution of the mean and also the connection that the sample distribution has with the one-sample hypothesis. When we are talking about the two-sample hypothesis testing on the other hand a different sample distribution might be what you need. Here is one thing that you have to know. The sample distribution of the difference that is between the different means is the distribution of all the probable values of the differences that exist between the pairs of the sample means that have the sample size held as constant from a pair to pair. By pair to pair, we mean that the first of the pairs sample always has the same size, while the second sample of the pair has the same size too. The two sample sizes are not equal at all. The samples from each of these pairs come from a distinct population. And all the samples are independent which means that if you want to pick an individual for one of the samples would not distort the possibility of picking individuals from another. To create a sample distribution, you want to take a sample from one of the populations and another sample from another, then you need to find the mean and then subtract the means from each other. Then take back that sample into the two populations and do it all again. What you are going to get at the end of it is there is going to be a set of differences between the means. This set of differences is called the sampling distribution.

Applying the central limit theorem The sampling distributions are just like the typical number set so they also have means and standard deviation. This means that we can also apply the central limit theorem. When we talked about the theorem, we declared that if the samples are huge that means that the sampling distribution of the means distribution then is

normal. Then when the population has to be distributed normally, that means the sampling distribution becomes a normal distribution regardless of whether the samples are small or not. there is also a theory that when the parameter of the initial population are both μ1 and σ1. Then the parameters of the subsequent population are μ2 and σ2. This makes the mean of this sampling distribution,

then the standard deviation becomes,

N1 refers to the number of individuals that are in the sample of the initial population, while N2 refers to the number of individuals that are in the sample from the second.

Z’s once more When we apply the central limit theorem, we know that the sample distribution is a normal distribution when it comes from a large sample, we can make use of the z score for the test statistic. There is one other way to implement the z score to test the statistic and that is the performance of a z test. The formula is

(μ1-μ2) refers to the difference between the means. The formula that we have above, changes the differences between the sample means into a standard score. Then what you need to do is compare the standard score versus a standard normal distribution. And if you find the score in the rejection area and is defined by α you then need to reject h0. if it

is not then you do not have to reject it. You can make use of the following formula knowing the value of σ12 and σ22 To put that into a quantitative perspective, you have to first imagine that there is a new technique taught and designed to increase one's IQ. When you take a sample of 24 different people and then teach them this technique, you then take another 25 people chosen at random with no knowledge of this technique. Imagine then that the sample for the new sample technique is then 107, and the number of people with no training sample is 101.2. our hypothesis test becomes H0: μ1 - μ2 = 0 H1: μ1 - μ2 > 0 You then need to set α at .05 The IQ has a standard deviation of 16 and it is also assumed too that the standard deviation of those who were taught the technique will be the same. We then assume that if the population exists, it would have the same value for the standard deviation just like the regular population of IQ scores. If we are then to know if our theoretical population has the same value as the regular population? Then the answer si that the first population will say it does and the other will say that it is larger. Our test statistic becomes.

Then when α = .05 our critical value for the Z is the value that cuts off the top 5 percent of the section that comes under the standard normal distribution which is1.645. you can then make use of the excel function NORM.S.INV. the calculated value is also not equal to the critical value hence we do not need to reject the first population.

Here is the sampling difference of the distribution.

Data analysis tool: Z test: two samples for means There are a few data analysis tests that you can do to make a Z test. It Is referred to as Z-test: two samples for means. Here is the dialog box for the Ztest: two sample means

Here are the steps to make use of this tool: First type in the data for each of the samples in a different data array. Then select data to open the data analysis dialog box. When you enter the data analysis dialog box, find the Z-test: two

samples for means . Choose OK next to open the dialog box for the Z-test: two samples for mean. You will see the variable 1 range box, giving you the space to type in a range holding the data for the sample. And go to the variable two range box, then enter the cell range with the data for the other sample. When you then go to the hypothesized mean difference box, you then need to type in the difference between μ1 and μ2 specified in H0 . When you go to the box for the variable 2 variances, type in the variance for the other sample. Based on our example, the variance is 256. If there is a column heading in the cell range, choose the labels check box. In the alpha box, enter 0.05 as the default When you are in the output option, choose a radio button to show where you will like the results to be in. Select ok next.

In this example, the value is in cell B8 and the critical value for the one-tailed test can be found in cell B10, and the critical value of the two-tailed test can be found in cell B12.

T for two

Now, this is something that you might not have heard of before and that is the known population variance. If you already know this, however, that means that you also know the population means, then if you know that too, then that means that you do not need to create a hypothesis test. If you do not know the variance then it is pointless working with the central limit theorem. This implies that it is impossible to make use of the normal distribution in an approximation of the sampling distribution of the differences established between the means. You might then make use of the t distribution. And apply it to the one-sample hypothesis testing. The parameters are what separates the distribution and they are called the degree of freedom. The degree of freedom is the denominator of the estimation of the variance that you use when you calculate the t-value as the test statistic. However, if you want to calculate the value of the t with the test statistics then you are performing a t-test. The unknown population variance provides two possibilities for hypothesis testing. An idea is that even if we know the variance, you can also assume that the variance is equal and also not equal. In the sections below we explore all of these.

Like peas in a pod: equal variances The sample variance is used to find the population variance. Then if there are two samples, you can average the two-sample variance and get an estimate When you gather different sample variances to calculate the population variance, that is known as pooling. Here is how to do this with two different samples.

Based on the formula that we have here, Sp2 refers to the estimate of the pool. You will observe that the estimate has a denominator of(N1-1)+(N2-1) which is the degree of freedom. The t formula then is

For example, if a robotics company wants to choose from two different machines to make new components for a microrobot and they want to know which of the machines is faster. The company then orders ten copies of the parts and times the machines. The hypothesis becomes H0: μ1-μ2 = 0 H1: μ1-μ2 ≠ 0

This is known as a two-tailed test since we do not know ahead of time which of the machine is fast

Here are the sample statistics: Then the pooled estimate becomes,

The estimate of σ is equal to 2.75, and the square root is 7.56 Our test statistic goes thus:

For the test statistic, our degree of freedom is 18, which is the denominator of the estimated variance. Based on the t distribution of 18 degrees of freedom, the critical value becomes 2.10 for the right side and then -2.10 for the left side. You can use the T.INV.2T function to fact-check this assertion. The calculated value of the test statistics is more than 2.10, so what you need to do is reject H0. This data asserts that the second machine is the fastest.

Like p’s and q’s: unequal variances When the variance is unequal then there is a problem. If the variance is not equal it means that the t distribution and the degree of freedom do not come estimated close to the sampling distribution. As a statistician, when you encounter this kind of problem then you might need to reduce the degree of freedom. To reduce it, there is a formula based on the standard deviation of the sample and the size of the sample. Since the variance is not equal, the pooled estimate then is not the appropriate one. what you then need to do is calculate the t-test differently.

Then you need to compare the statistic test with the t distribution with a smaller degree of freedom.

“t. test “ the worksheet function for the t-test removes any of the problems with working the t-test. Here is how to use the t. test function: 1. Type the data for the samples inside different arrays and choose the

cell that you want the result to be in. 2. Go to the statistical function menu and choose t. test to start the dialog box for the t. test 3. When the dialog box for the function argument opens, place the appropriate values. You can place the cell sequence with the data for one of the sample And in another array, you can then enter the cell sequence holding the data for the alternate sample Mailbox refers to which test? Is it a one-tailed or a two-tailed test in the type box, there is a number with the type of t-test? You have the choice of 1 for paired test, 2 for two samples with the assumption that there is unequal variance then select 4. Then select ok to find the answer.

Data Analysis Tool: T-Test: Two Sample you can also make use of a data analysis tool for t-tests. You can make use of one of the tools for equal variance and another tool for unequal variance. When you make use of the following tools then you are going to get the following information Here is how to use the t-test: two samples using the data analysis tool: 1. First, type the data for the samples into the different array 2. Then choose DATA from the ribbon, then open the data analysis dialog box 3. You then need to scroll through the dialog list and choose t-test: two samples assuming equal variance 4. Then select ok to yank open the dialog box 5. When you enter the range box for the variable 1, enter a cell range holding the data for the sample 6. And in the variable 2 range box, you then need to enter the cell range with the data for the sample. 7. When you enter the hypothesized mean difference box, enter the difference between μ1 and μ2 specified by H0 8. And if the cell range also has a column heading, then you need to choose the labels check box

9. The alpha box comes with 0.05 by default. You can change the value if you like 10. When you go to the option for output, choose the radio button that indicates where the result should be in 11. Then select ok

The rows remaining gave us a t related. You will find the test statistic calculation in cell B:10. And cell B11 provides the proportion of the area where the positive value of the test statistics has cut off inside the upper tail of the t distribution that was indicated as the degree of freedom. Cell B12 indicates the critical value for the one-tailed test. The critical value is the value that cuts off the proportion of the area that is in the upper tail which is equal to α. To then find the unequal variance, you need to follow the same steps from above the only difference is that when you are in the dialog box for the data analysis tool, you need to choose t-test: two-sample assuming unequal variances When there are differences, you are going to see their effects of them in the remaining statistics. The t value, the critical value, and the probabilities are not the same

A matched set: hypothesis testing for paired

samples. When we are talking about hypothesis tests, the samples are typically independent. Choosing one of the units from a sample might not have any effects on the individual for the other. At times the samples can match. The most common situation happens when the units that are provided get a score under a before or after study. For example, if ten people participate in a program for weight loss. They weigh in before a program even starts, and they do the same again a while after it you start using the program—like a month later. If you look at these data sets, you have to see the differences as sample scores and work with them the same way that you work with them, if you were using a one-sample t-test. The first thing to do is to create a test on the following hypotheses.

H0: μd ≤ 0 H1: μd > 0

In this subscript, the d represents the differences and set α = .05.

In the formula above, the d refers to the mean of the differences. If you want to calculate sd then you need to find the standard deviation of the differences, then divide it by the square root of the number of pairs. Our degree of freedom becomes N-1

Since the degree of freedom is then 9 which is the number of pairs of -1 the critical value for = .05 is 2.26 you can verify with the T.INV function. the value that is calculated extends the value this means that we have to decide to reject H0 The only difference between this and the one-sample test is that the one sample contains the difference between the pairs.

T. Test for matched samples We have already made a description of the T. test worksheet function and you already know how to combine it with the independent sample. You can also make use of it for patched samples. Look at this example in this image with the dialog box for the function arguments with the data also for weight loss.

Here is what to do: 1. The first thing to do is to enter the data for the sample data array and choose the cell. in the example that we have here, we put the data for before in column B and put the data for after in column C 2. Go then to eh statistical function menu and choose T.test to then open the dialog box 3. Then you need to enter the correct value in the right place. In this example, we entered the cell sequence holding the data for a sample which is the before data, and in the array2 box, we entered the cell sequence holding the data for the after sample. In the tail box, you need to indicate if you are making a one-tailed test or a two-tailed test. Then you need to type 1 in the box. In the type box, you are going to see the number indicating the type of t-test needed to perform. When it is a paired test, you will see 1 and when it is a two-sample test, you will see 2 and when you want two-sample tests with unequal variance the option is 3. 4. After analyzing the value that was applied to the dialog box, there is a probability shown that can be associated with this t-value for the data. you will not see the value of t 5. Select ok to place the answer inside the chosen cell The value that is in the dialog box becomes less than .05, hence we have to reject H0 Now, when you assign the column headers in the image above as the name of the array, then below is the formula =T.TEST (Before, After,1,1)

Data analysis tool: t-test: paired two samples for mean The t-test: paired two samples for means is an excel data analysis tool that handles anything concerning matched samples.

Here is what you need to do: First, you need to get into the data for the samples in the arrays. As you can see we have placed the samples for before in column B and we have placed the sample for after, in column C. Then choose data and select the data analysis so that it opens the data analysis dialog box When you then go to the data analysis dialog box, you need to then scroll through the list to find the t-test: paired two samples for mean Then choose ok this creates the dialog box, that you see up there When you go to the range for variable 1 create an absolute reference referring to the before sample Then inside the variable 2 range, use absolute referencing to enter the cell range that holds the data sample after Our hypothesized mean difference is assumed to be 0 so we entered zero If there is a cell range in the column heading, then you need to choose the labels check box In the alpha box, 0.05is the default. You need to change the value to then

use another alpha Go to the output option to pick one of the radio buttons that indicates where you will like to put the result Then select ok

Here we have the result. If the number is close to 1, a high score in one of the samples is related to a high score in the other. However low score in one of them is associated with a low score in another one of them. But if the number is close to -1, the high scores in the first sample imply that there is a low score in the second one while if there is a low score in the first then there is a high score in the second one. then when the number is closer to zero, that means that the score in the initial sample does not relate to the second. You should be expecting a high value since the two samples have scores on the sample people. The cell in B8 refers to the H0 which is the null hypothesis that is the specified difference between the population mean. The rows that remain give t-related information. You will find the calculated value of the test statistics in cell B10. Then in cell B11, you will see the size of the section the positive value cuts off the t distribution upper tail. The cell B12 refers to the critical value of the one-tailed test: which is the value that estimates the proportion of the area that is in the upper tail then equal to α The proportion in B13 is doubled. That cell holds the proportion of the area that is in B11 and adds it to the proportion of area where the negative

value of the test statistics then cuts off again in the lower tail. The cell in B13 then refers to the critical value of the two-tailed test: this is known as the positive value estimating alpha divided by 2 in the upper tail. The negative value estimates alpha divided by two in the lower tail

T-test on the iPad with stat plus Testing two variances When we are using the two-sample hypothesis testing we were referring to the means. That is not to say that you cannot refer to the variance. Here we are going to be using the following example. A robotic company has produced parts of specific length and very limited variability. The company needs to decide between two machines that produce the part and then would like to pick the one with the least variability. The robotics company then decides to take samples from different machines, then measure them to see the variance of the sample then perform a hypothesis to know if the variance in one machine is greater than another. Here is the hypothesis that we are going to be working with H σ 2=σ 2 0: 1 2 H σ 2≠σ 2 1: 1 2

We must use alpha and our alpha is going to be .05 Then if you then test two different variances, you do not need to subtract them from each other. What you can do instead is divide one of them by the other so that you can get the test statistics. This kind of test is the F-ratio it refers to the statistician who figured the statistics for finding the variance out this way. the test statistics are named that way. the distribution referring to the F test is known as the f distribution. In this kind of test, the parameter is the degree of freedom distinguishing one of the data sets from the other. What makes this data set distinct is that there are two variance estimates involved. This means that each member of the family can be associated with the two values of the degree of freedom. Instead of one the way that it is in the t-test. The other difference between the F-distribution and other distributions is that there is no possibility of getting a negative value. here we are going to show you two members of the

F-distribution family.

The test statistic becomes

Imagine then that the robotics company has ten different parts with the first machine and there is a sample variance of .60 square inches. Then the second machine produces 15 parts and there is a sample variance of .44 square inches. Does this mean that the company rejects the null hypothesis? It matters what degree of freedom is in the numerator or the denominator. For example, the F-distribution with the degree of freedom being equal to 9 and the degree of freedom being equal to 14 differs from when the F distribution is the degree of freedom being 14 and the degree of freedom being 9. The critical value of the first assertion is 3.98 for instance while the degree of freedom for the second one is 3.21

Using the f in conjunction with the t You can also make use of the f-test and t-test as a whole for independent samples. Before working with the t-test you first need to make use of the F test to decide if you want to assume equal variances or an unequal variance in the samples. When we were talking about the t-test sample, we showed the standard deviation to be 2.71 and 2.79 respectively while the variance is 7.34 and 7.78.

the F ratio for these kinds of variance then is. F Each of these samples is based on ten different observations. This means that our degree of freedom for the sample variance is equal to 9. The F-ratio is then equal to 1.06 and that estimates the higher 47 percent of the Fdistribution with the degree of freedom 9 and 9. This means that you can then make use of the equal variance version of the t-test for these data. There are occasions when having a high alpha is a good thing. When the null hypothesis comes with the outcome that you desire, it is better to not reject it, the best way to not reject it is if you set alpha at a high level thereby a small difference might make you reject the null hypothesis. This is also one of the few occasions where it is better to make use of the equal variance t-test that provides a more expansive degree of freedom than the equal variance t-test. When you set a high value for the alpha in the f test, it makes you more confident when you want to assume equal variances.

F.TEST The F. test finds the f-ration inside the data from two samples. Be careful that it does not get the actual f-ratio but gives you a two-tailed probability of the f-ratio that was calculated under the null hypothesis. This gives the implication that the answer is also a part of the part on the right side of the Fratio and also on the left side of the F ratio result. Here is a data that identifies the data belonging to a robotics company monitoring the speed of two machines. The function argument for the F. test

Here are the steps to apply a function1. The first thing to do is enter the data for the sample into a different data array and then choose a cell to find the answer. Based on the example that we have above, we have put the first sequence of numbers in the cell A2:A12 and the second sequence of numbers in the cell B2:B17 Then enter the function with the following argument in an empty cell like this =F.TEST (A2:A12,B2:B17) Or you can go to the statistical function dialog box to enter the appropriate value in it. Then select ok to add your answer into the chosen cell. when you check the dialog box and notice that the value is more than .05 then we should not reject the null hypothesis. Then If we assign a name to the array above we can also use a named array.

F. DIST and the F. DIST.RT You can use the functions F. DIST and the function F. DIST.RT to decide if the calculated F-ratio is within the region they need to be rejected. For the

F.DISt, you need to supply the value for F, and a value for each of the degrees of freedom then also a value for the argument referred to as cumulative. If the cumulative read is a TRUE value, the F. DIST gives you a probability to get the F-ratio equal to yours when the null hypothesis is true. And if the probability is more than 1- the alpha, you need to reject the null hypothesis. When the cumulative value is false on the other hand the F. DIST gives you a height of the F distribution at the value of F. we made use of these options in the chapter below to start a chart of the F-distribution F. DIST.RT gives you the probability that you are going to be getting the F ratio that is at the very least equal to the null hypothesis when it is true. It is called a right-tailed probability. When the value is less than the alpha, you need to reject the null hypothesis. The F. DIST.RT is easier to use when you are working with both. We are going to be applying for an F. DIST.RT to our earlier example if we have an F-ratio of 1.36 and there is a degree of freedom between 9 and 14 Here are the steps: 1. the first thing to do is to specify the cell where you want the answer to be in 2. then go to the statistical function menu then choose the f.DIST.RT to then open the function argument dialog box for the DIST.Rt 3. when you enter the dialog box, you need to place the appropriate values for the arguments 4. when you enter the f box, enter the calculated f. in this instance it is 1.36. then go to the first box for the degree of freedom and type the degree of freedom for the variance estimate in the f numerator. In this instance, you use 9. And in the second degree of the freedom box, enter the variance estimate for the denominator and in this instance, we are going to be working with 14. When you enter the value for the whole argument, you are going to see the answer in the dialog box 5. Choose ok to get rid of the dialog box and make the answer appear in the cell that you have chosen. If the value inside the dialog box is more than the alpha .05 then you should not reject the null hypothesis.

F.INV and the F.INV.RT This works by reversing the effects of the F. DIST function and the F. DIST.RT function. this finds the value inside the f distribution that cuts off the part of the lower left tail while the other function which is the F.INV.RT function finds the value that estimates that part from the upper tail. You can also use this value to find the critical value of F. Based on the machine tests from earlier we are going to be looking for the critical value for the two-tailed test: 1. you need to first choose the cell you want the answer to be in 2. then go to the menu for the statistical function and choose, F>INV.RT. this opens the dialog box, for the function 3. then when you enter there, you need to place the important value for the argument. When you are in the probability box, you need to place the proportion of the areas inside the upper tail. Based on our example, we have put .025 since it is a two-tailed test with the alpha being .05 4. then in the degree of freedom box 1, we put 9 5. and in the degree of freedom box 2, we put 14. When you place the value for all of the arguments then you are going to see the answer inside the dialog box. then you can choose ok so that the answer can appear in the cell that you have chosen.

Data analysis tool: F-test: two samples for variances There is also a data analysis tool for the f test on a two-sample variance. Here is an example of the data side by side with the dialog box for the f-test: two samples for variances.

Here are the steps to work with this tool: The first thing to do is to enter the data for the sample into all of the data arrays. In this example, we have put the data for the first machine in the column A and the data for the second machine in column D Then you need to go to data and enter the data analysis so that the data analysis dialog box pops up Go next to the data analysis dialog box, to scroll through the analysis tool list and choose f: test two samples for variances Then select ok so that the tool dialog box opens Then enter the appropriate data in variable 1. In this instance we have put the A2:A12 Then in the variable 2 range box enter the appropriate data. this time, add the second data set. In this instance, we have put the range If there is a column heading then you need to mark the check box The alpha box by default is 0.05, you can change that to the preferred alpha Inside the output option, you can choose your preferred radio button Then select ok Then you are going to find the answer in the cell that you have been working with.

Visualizing The F distribution When you visualize your F distribution, there is a difference. Here we are going to show you a range of numbers and the finished product then you are

going to see how to graph it.

Here is what you need to do; 1. The first thing is to put the degree of freedom in an empty cell. in this instance, we put in the cell B1 and the second-degree of freedom in cell B2 2. Then go an create a column of values for statistics inside cell D2 till we got to cell D17. Then we put values with increments of 0.2 3. In the cell of the column that comes next, we put the value of the probability density. 4. We used the following formula =F.DIST(C10,D2$B$1,$B$2,FALSE) and auto-filled it. then we went to insert and selected create the chart. then we chose to scatter with smooth lines 5. Then you can modify elements of the chart if you want to. 6. You can try different values for the degree of freedom and observe changes to see how they influence the chart.

Conclusion When we mean that we went into statistics in-depth, we mean it. By now, working through the excel formula for these statistical functions should be way easy for you.

Chapter 11 Testing more than two samples It would not make a lot of sense if all you can work with in excel is just one sample. In the previous chapters we have talked extensively about working with one or two samples it is as though it is impossible to work with multiple samples. Now that would not make a lot of sense. So we are going to then be focusing a lot of attention on how to work with two or more samples in excel. It does not matter if the sample is dependent on each other or not, we are going to be exploring the multiple cases when they are or when they are not. We are then going to be introducing the excel data analysis tools that work perfectly. They are not the best, but when you combine them with the inbuilt feature in excel, they work perfectly.

Testing more than two To test more than two. Imagine that you teach different class grades and separate them into three different parts and have been asked to use three different methods of teaching these sections. The class has up to thirty students each. What you want to do is first train them, then test them and put them in a table, and make a statistical inference from the data. but for some reason three of the students have to then leave the groups either by suspension or whatever. One of them left the group for the first method and two then left in the third method

The three methods may be going to give you a similar result—so similar that you might not be able to tell one apart from the other. If you want to then work with this, you have to make use of a hypothesis test. H0: μ1 = μ2 = μ3 H1: Not H0 With the alpha being .05

A thorny problem If you have read the previous chapter, then you will be tempted to think that this work is very easy. What you have to do is to take the mean from the firstclass class a, then pick the mean from the second class, then perform your ttest, and voila, you have found your t-test. You might then be tempted to think that the same procedure can be applied here where you just have to add the third class. If there is a significant difference, your work is done when you reject the null hypothesis It is not that easy. If the alpha stays constant at .05, in each of the t-tests, then that might be a problem by itself. Because you will have a situation where the probability is higher than what is needed. There si a greater probability that one of the three t-tests will have a significant difference higher than .05. it can be .14 and that is just too complicated When you have more than three situations, the thorny problem gets even

more twisted. When there are four groups then there need to be six t-tests, and there is a high chance that one of them is going to be higher than .26 here is what happens if the number of samples increases.

This then means that you cannot continue with the plan of just carrying out multiple t-tests

A solution Here is one way to do that. Think of it more like a variance instead of a mean. And also, the variance has to be seen differently. Here is the formula for the population variance

With xi referring to the value of the observation and X referring to the mean value of the observation and n being the number of observations. Since the variance is the mean of the squared deviation taken from the mean that is why it is better to see it more like a mean square. So the numerator of the variance is simply adding the squared deviation from the mean. The denominator on the other hand is the degree of freedom. This means that we can see it like this So, to bypass our problem, you need to find the mean square in the data. then you need to remember that the mean square is used to estimate the population variance where the sample came from. Then we need to then assume that the variance is equal when you are estimating the variance. Then you can use the

estimates In the table above, we have created three different mean squares. You then need to see the entire score as a whole regardless of them being of a different number set. Then if you want to use the 27 scores as a whole the mean is 84.92308. This means that we are going to get the following variance: (10) (89.55556-84.92308)2+(10) 84.92308)2/3-1=

(81.22222-84.92308)2+(9)

(83.875-

Our degree of freedom in this instance is then 2. This is referred to as the variance between the sample means. There then becomes three estimates of population variance and they are MSt, MSw, and MSB. What you will be tempted to do is to find the hypothesis about the three means. But when we are referring to the null hypothesis, it is only by chance that you are going to find any difference. This means that the variance between the means is equivalent to the variance in three numbers chosen at random from the whole population. Now, you might be thinking of how to compare the variance amongst the mean with the population variance to see whether it holds up but there is no estimate of the population variance independent of the group's differences, that is where there is a breakdown. Nevertheless, if there is an estimate. There is first MSw, which is an estimate that was based on a pooled variance inside the sample. If you assume that these variances are representative of an equivalent population then you will have a great estimate. Then what you need to assume is if MSb is more than MS0 then the data is inconsistent with the null hypothesis. This means that the following hypothesis converts from theseH0: μ1 = μ2 =μ3 H1: Not H0 into these H0: σB2 ≤ σW2, H1: σB2 >σW2. Instead of working with a t-test for the sample means then you need to first

test the difference between the two variances. To perform the test, you need to provide one of the variances with another. You can compare the result with the data set called F-distribution. Since there are two variances in this instance, that means that two values for degree of freedom indicate each member of the family. In this example, F comes with the degree of freedom 2. And the degree of freedom 24. Based on what we are working it is then the distribution of f values when the null heading is true. We are going to be using the following example. F2.24=variance/15.31= Then you will need to look at the part of the area where the value cuts off in the higher tail. When the proportion is microscopic and the values inside the horizontal axis can only get to 5, it means that any differences in the mean are because of chance. What we have just done is called the analysis of variance and is known as ANOVA. When we are talking about ANOVA, the denominator of the Fratio is named the error term. While the independent variable is called a factor. Based on our example, we have one factor and that is called the training method. And each part of the independent variable is referred to as a level. There are three levels nevertheless in these variables. When the study is more complex, you will observe multiple levels.

Meaningful relationships Then you need to have a good look at the mean square in our example. And you need to observe each of them and their sum of squares and their degree of freedom. When we calculate the mean square, we never showed you each of the sums. However, this image does so in that respect

You have to ask if it is a coincidence that the following degree of freedom adds up. dfB = 2, dfW = 24, and dfT = 26 it is not always so. This always happens when we are analyzing the variance This means that you need to partition the SSt into a different part from SSb and then another one for dft into a specific amount for the dfw

After the F-test When you make the f test, you can then decide on rejecting the null hypothesis or accepting it. when you reject it you are going to notice some things. But those things are not specified by the test. Nevertheless. That is why we are here. To clarify these things. There is the planned comparison. This means that you will have to make some other test. And you do not have to do only that but do it before working ANOVA. Here are the tests! The first is the t-test which looks like it is inconsistent with an increased alpha of the different t-tests. Since an analysis gives you the go-ahead to reject a null hypothesis, then it is fine to make use of the t-test, to search the data for the differences. In this instance, the t-test is a little different These tests are also known as the apriori tests. Take a look at this illustration. If you assumed before you gathered data, that the first method for the first class would give you a higher score than 2, and the second class would have a higher score than three. This means that you have to predict that you will need to compare the means of the sample if the ANOVA decision is to reject the null hypothesis. Here is the formula for the t-test that we are about to carry on.

Ths is the test of H0: μ1 ≤ μ2 & H1: μ1 > μ2 The MSw replaces the pooled estimate sp2. And the df for the te test becomes dfw instead of (N1–1) + (N2–1) This means that the comparison of class a and class 2 is

This planned comparison t-test formula is just like the t-test for two samples. You can write the planned comparison in such a way that it adds different possibilities. You want first write the numerator differently The +1 and -1 are seen as coefficients figures of the planned comparison Now if you only use comparisons then you are bound to soon abandon ANOVA. There are times when you want to go deeper into your data to look for something new and there are things that you did not plan. When that happens you can compare those analyses that you did not plan. These are known as aposterior tests. There are different types of these tests known to statisticians. These kinds of tests often involve stacking the data against the rejected null hypothesis for a particular comparison. Instead of working with convoluted formulas and distribution, you need to begin with the test from the planned comparison and then add some other things. We like the Scheffer test and here is how to use it: 1. First, you need to calculate the planned comparison for the t-test 2. You then need to square the value to then create an F

3. Then you need to look for the critical value of F 4. Multiply that critical value by the number of samples 5. Then compare F to F

Data analysis tool ANOVA: single factor The data analysis tool can be arduous. But there is one tool that does all of the heavy liftings and it is called the ANOVA: single factor. Here are the steps to work with You want to first enter the data sample for the data array Then you want to choose the data and select data analysis to open the dialog box with the tools for the toolpak Then look for ANOVA: single factor in the dialog box and choose it Select ok to open the ANOVA single factor dialog box When you enter the input range box, you need to then go to the cell range with all of the data If the cell range has headings, mark the box for labels. The alpha box holds 0.05 as the default Then select one of the output options and choose ok

What we have in the illustration above is the output.

Comparing the means There is no built-in method to carry out a comparison of the means. However, there are ways that you can do this when you use the SUMPRODUCT When you go to the worksheet page that has the ANOVA output, see it as

where you can start. We are going to be taking you through a way to make a planned comparison. And we are going tobe compare the mean of the class a and the mean of class b You need to first create a column to hold the comparison information. Below those rows, you can place the t-test numerator, the denominator, and the t-value. make sure that you use different cells for the numerator and denominator so that the formula is simple. You can even place them all together in one big formula and have one cell for t. however, it is going to be hardtop keep track of it all Then use the sum-product formula: it takes the array o the cells, then multiplies those numbers in the cell that corresponds with tit and the product sum. So we are going to use the same formula to multiply the coefficient by the sample mean and add the products.

Another kind of hypothesis: another kind of test There are times when you work with independent samples and there are times when you work with matched ones. This means that in practice, one can participate in each of the conditions of the independent variable. So in the section below, we are going to be exploring instances when you want to use ANOVA for two matched samples. This ANOVA is known as repeated measures. Working with repeated measures ANOVA To know how effective the program is, then you have to do a hypothesis test H0: μBefore = μ1 = μ2 = μ3H1: Not H0 Just like we have been doing before, our alpha is equal to .05 Just like the way we did it in the prior ANOVA, you have to begin first with the variance in the data. The MSr is the same as the variance in all of the scores in the mean. There is another way to find the variance and that is inside the data. you can interpret it as the variance that is left over when you pull the variance in the rows and columns from the total variance. Or it is the sum of the leftover squares when you remove the SS in rows and the SS in columns from the SS

of the T-test This kind of variance is referred to as the MSerror. In the ANOVA analysis, we call the denominator of the F distributions error term. This means that the error lets you know that the MS is the denominator of F. Then if you want to find the MSerror. You have to make use of the relations that exists amongst the sums of square and the degree of freedom.

You can also calculate the degree of freedom this other way

When you have 3 and 27 degrees of freedom, the critical F for the alpha is .05 is equal to 2.96. you can use the excel F.INV.RT to verify this. When the F that you calculate is more than the critical F then you have to reject the null hypothesis. With F’s involved in MSrows that does not get into the null hypothesis. In this kind of situation, when you discover a critical F it means that the individuals are different. To create a planned comparison to find the difference, you can use the following formulas.

The expression on the right was formed by the formula on the left since in repeated measures, the Ns are equal. The degree of freedom for the test we performed is known as dferror. Getting trendy

In this sheet containing the register of the weight loss and gain of people in an exercise parlor, there is a quantitative independent variable. There are levels and there are equal intervals. When there are independent variables, then it might be important that you search for trends in the data instead of just planning the comparisons amongst the means. When you graph the mean, they tend to estimate the lines. Now, when there is trend analysis, then there is a statistical way to study the pattern. The aim at the end of it all is to observe the pattern and see their contribution to the differences in the means. The trend can be a linear trend or a nonlinear trend. When you want to analyze a trend then it is best that you used the coefficients which are the numbers that you make use of in planned comparisons. The only difference is that you make use of them more uniquely.

In this example, you can make use of the comparison coefficient in other to find the sum of all of the squares for the linear. The sum of the squares of linear is abbreviated as SSlinear When you calculate the sum of the square of the linear, then you divide it by the degree of freedom for the linear then you will get the MSlinear. The work Is that easy because the degree of freedom for the linear is equal to 1. You need to divide the MSlinear by the MS error then you have the F. And when the F is more than the critical value of F where the degree of freedom is equal to 1 and the degree of freedom in the alpha level that you use. this all means that the weight decreases linearly during the weight loss program. The comparison coefficient changes based on the number of samples. When it is four samples, then the coefficient is -3,-1.1. and 3. If you want to then form the Squared sample linear use the following formula

. Using the formula above, N refers to the number of people while c represents the coefficient. When we apply that formula to our example, here is how it looks

This is a large proportion of the sample square columns and the sample square nonlinear is relatively small.

Data analysis tool ANOVA: two factors without replications Here is how to find the ANOVA: two factors without replications: 1. First, enter the data for the samples in different data arrays. 2. Then if the excel datapak add-in is activated go to data and select data analysis 3. Then you are going to see the following. Select ANOVA: two factors without replication

4. 5. 6. 7. 8.

Then select ok. Then input the appropriate range If there is a column heading there, then select labels You can change the alpha box number to a number that works for you When you go to the output option part, pick a preferred output and select ok the ANOVA table is going to give you the sum of squares, the degree of freedom and the mean squares, the F, the p-value, and the critical F ration for the degree of freedom that you indicate. It does not matter if there is a value of F since the null hypothesis only concerns the columns that are in the data.

Analyzing trend The tool ANOVA: two factors without replication does not analyze trends. However, just like the planned comparison, you can use the SUMPRODUCT or the SUMSQ to calculate it. So you can start by adding the comparisons coefficient for a linear trend in the J11 to j15 Then in j20 to j22, you can add the information that is related to the SSlinear that is the numerator, the denominator, and the sum of squares values. It is better to use separate cells for the numerator and denominator so that the equation is simple.

ANOVA on the IPAD You can use the IPAD statplus add-in to work with ANOVA 1. First, enter your array 2. Then select insert, enter addins then you can select stat plus to open the addins box 3. When you are in the statplus panel, enter a new analysis so that the command box opens 4. Then choose the analysis tool

5. Then you need to enter the values inside boxes in the statplus panel 6. Then select run ANOVA On the Ipad: another way Now, if you want to perform ANOVA differently. look at this image for example.

The format on the left is the common format. They are referred to as a stacked format while the ones in the right are referred to as narrow. It still holds the same data but in a different order. When you transform data from a wide format to a narrow format, it is called melting. You can use the Stat plus one way to do the ANOVA The first thing is to enter a data inside the array Then select insert and choose addins, then select statplus this opens the add-in box When you open the Statplus panel, select new analysis. This opens the command box for the analysis tools Enter the appropriate values inside the box. Then tap run

Repeated measures ANOVA on the IPAD

You can also use the ANOVA tool for two-way ANOVA without any replication for repeated measures in STATplus. Here is how to make the repeated measure: The first thing is to enter a data inside the array Then select insert and choose addins, then select statplus this opens the add-in box When you open the Statplus panel, select new analysis. This opens the command box for the analysis tools Enter the appropriate values inside the box. Then tap run

Conclusion We have been detailed and specific, in our lesson on the usage and testing of multiple samples. Now, this should be easy for you.

Chapter 13 Slightly More Complicated Things Mind you the last three chapters have been the simpler stuff. Now, it is time to get to the adult stuff. Here we are going to be cracking teaching a lot of things ranging from how to break down variance and simply ANOVA stuff. We are going to be learning then how to work with two factors in a data set. The best way to work with ANOVA is true the excel analysis tool. so follow us:

Creating the combinations Now if a car company wants to test a new engine and see how efficient they are in three different modes: cruise, driving in a steep road incline or decline. The company then decides to work with nine different cars and assign at random a specific engine to test this variance. What they then do is find the number of miles the engines will run before they need servicing. Here is what the data looks like:

Then two hypothesis test is needed H0: μengine1 = μ engine2 = μengine3 H1: Not H0 And H0: μcruise = μrecline = μincline, H1: Not H0 In the tests above, our alpha should be .05 as usual.

Breaking down the variance The best analysis for this kind of test is analyzing the variance. The variables which are the engines and the tasks are the factors in this instance hence, we have a two-factor variable. For a better understanding of this kind of ANOVA, you have to first put into consideration the variance in the nine numbers which is the mean square total. Our total mean is approximately 61.33 So with this mean, our variance is then: MSt= Then the means of the three batteries might vary to get the variance this is what we do MSbatteries When you are working with means, you have to consider the number of scores that produces the mean. Also, the task mean varies to: The next variance to find is called the MS error. That is the left over after subtracting the sample square for the batteries and the sample square for tasks from the sample square of the tables and then you use the degree of freedom to divide it. If you want to test the hypothesis, you need to calculate the f to see the effect of the batteries and to see the effects of the tasks. In both of them, the denominator should be MSerror So f= And f= When the critical f has two and four-degree freedom and the alpha is .05, this makes the critical f for example 6.94. this means that you have to reject the

null hypothesis for the engines. To see the difference between the engines then you might have to make a planned comparison

Data analysis tool: ANOVA: two factors without replication When you use the ANOVA: two factors without replication tool carries cracks the code of two factors without any replication. When we are talking about without a replication it means that in the case of the cars above, that means we are assigning one task to one car when you assign more than one car to a car, then that is with replication. Here are the steps to work with this tool: 1. First, enter the appropriate data inside the worksheet and add the labels for the rows and the columns. 2. Then go to the ribbon and choose data and select data analysis to open the data analysis dialog box 3. When you are inside the data analysis dialog box, find ANOVA: two-factor replications and choose it 4. Select ok 5. Then input the range and tick the appropriate boxes. It is better to use an absolute reference 6. If there is a column heading then you need to mark labels 7. By default, the alpha box is .05. you can change it if you want to 8. Then choose the option you want for output. 9. And select ok Here is our output

There are two sections in the SUMMARY table. you will see the statistics summary in the first part and the second part shows the summary of the columns You will also see the degree of freedom, the mean square, the F, the p-value, and the critical F for the degree of freedom indicated When we look at this example, then you need to decide to reject or accept the null hypothesis for the engines and the tasks.

Cracking The Combinations Again in the previous section, we dealt with the idea of one score for each of the combinations of the two factors. When you assign an individual to each of the combinations is appropriate so that you can assume that one of the objects is similar to the other However, when there are people involved, that might mean something else. Because of the individual variations, you must assign a sample of people to the combined factors, not just that individual

Rows and columns Take the idea that you are a teacher and have learned two methods to teach your students. One involves teaching a class without raising your voice and

without stopping and the other is teaching the student with little breaks to control the disruptive class. This means that in one class you are welcoming and in the other, you are distant and cold. Now when we combine these levels and factors, we will have four combinations. Then there are four classes and four students in each of them. at the end of it, you want to test their aptitude. Here is the hypothesis that we are going to be making. We will call our first-factor composition in class and the second-factor tolerance. Then the hypothesis is. H0: μunruly = μcalm, H1: Not H0 And H0: μwelcoming = μdistant, H1: Not H0 The unruly and calm refers to the composition in class and the welcoming and distant levels refer to the tolerance factor

Interactions Then when there is a row and there is a column with data and you want to test the hypothesis of the row factor and the column factor, this means that there is an additional factor to consider. And that is the row-column combinations. The question then is if the row column gives a peculiar effect. In this example, you can combine the unruly factor with the welcoming and the distance gives you an unexpected result. Now when you notice this unexpected result then that is called an interaction. So, an interaction happens in cases when one factor influences the levels of the alternate factor.

The analysis Then when we are analyzing the variance in this situation we are going to be working with ANOVA depending on the variance of the data. The initial variance is called the total variance of all of the scores. What you get in the denominator will tell you what the degree of freedom is The other variance is gotten from the alternate factor. So you find the mean square, then the variance of the mean based on the grand mean. All of the 8 different levels will multiply the squared deviation since you then

need to consider the number of the scores that produce each of the row's mean. The degree of freedom for this MS method is known as the number of rows which is 1.

Data Analysis Tool: Anova: Two Factors With Replication Here is the data analysis tool to carry out a two-factor replication when this is concerned.

1. First, enter your data inside the worksheet and add the labels if needed to the rows and the column 2. Go to data inside the ribbon and in the data analysis group, and select data analysis. it brings up the dialog box for the toolpak 3. Then find ANOVA: two factors with replication 4. Then select ok to then open the ANOVA 5. Go next to the input range box and enter the cell range that carries the data. 6. When you enter the rows per the sample enter the number of scores in the combination. Because of what we are doing, we will enter 4

7. We left the alpha the same 8. Then choose one of the buttons you prefer in the output and select ok. 9. When you are done, you can then decide to reject the null hypothesis or not.

Two kinds of variables—at once Then when you have two variables at once. Imagine that what you are looking for is the effect of your teaching style on how the students read. You can assign the students to read the e-readers or to print books. That is known as the between-groups variable. And you also want to see how the fonts matter then you assign students specific fonts to read. So that the work looks more whole you need to order randomly the fonts of the subject. Here is how the ANOVA table looks like

One thing that you have to note is the three F-ratios. The first checks the differences amongst the A levels, and the second for the b levels, then the third checks the interaction of both the levelA and B. you will also notice that the denominators of the initial F ratio differ from the denominator of the others. The more complex ANOVA is the more this happens. Then you have to know the relationship at the top level. When you know the relation then you can finish the ANOVA table.

Using excel with mixed design There is no mixed-design ANOVA tool currently but there is still something that you can do. After you enter the data, create a complete ANOVA table for analysis and run two ANOVAs on the data to combine the ANOVA tables. here is what to do to create a relationship; After entering the data Go to the data analysis dialog box and choose ANOVA: Two factors without replication Then go to eh input range and enter your preferred cell And select the option for where you want the output to be I and select ok. When you are done, you can modify the ANOVA table: insert more rows for terms from the alternate ANOVA, and then change the name of the source of variance, then delete any unnecessary value. You need to first

insert rows between the rows and column then change rows to betweengroup then the column to the font. Once done, you need to delete all of the information with error and also delete the F ratio, the p values, and the f critical values. Then select data again then go to data analysis Then in the data analysis dialog box, select Anova: For two factors with replication: Input the appropriate range In the row per sample, the box enters the number of subjects in each level of the between-groups variable Then select where you want the output to be in Then copy the result and paste it into the first ANOVA sheet under the table. Then add within the group inside the first ANOVA table, then four rows below the between groups calculate the value of Sample square and the degree of freedom Copy the row of data of the sample from the subsequent ANOVA table then paste it inside our initial ANOVA table under between group Then the next thing to do is to change the sample name to what you named the between-group variable Then In the subsequent row, input the name of the source and calculate the squared sample, the degree of freedom, and the Mean samples Then go to the appropriate cell and find the F ratio for the A variable Then go and copy the interaction from the second Anova table and its squared sample, and its degree of freedom and paste all of these data inside the first ANOVA table inside the row under the B variable. It is better to change the interaction to the name of the interaction that comes between the A variable and the B variable And in the row that follows, enter the name of the source for the B X S/A and find the squared sample, the degree of freedom, and the mean square Finally, calculate the F ratio that remains For this whole procedure to work, then there needs to be an equal amount of people at all the levels of the variable

Graphing the results Since the ANOVA tool is not built-in, excel is not going to find the

descriptive statistics for the combination in each condition. You then need to have all of the statistics to find a chart of the results. You need the average and the standard deviation. Then create a chart for it.

After the Anova When you have the interaction, you can then start the post-analysis test. find the effects of the levels of the first variable and then the second variable. The statistics texts give complex formulas to make an analysis. However, in excel it is easy. You just need to arrange the numbers in two columns.

Two factors Anova on the Ipad The Statplus carries out ANOVA analysis on the Ipad for two factors. Here is how to go ahead. First, enter the data in the array Then go to insert and select addins and select statplus In the panel, enter a new analysis Choose an analysis tool Then place the appropriate values inside the panel. Do the same thing again if you want to And select run

Conclusion Now, is that not easy? Well, with a little practice it should be.

Chapter 13 Regression: linear and multiple When you predict with excel statistical functions, you take multiple variables and make your predictions on another set of variables. But before you can make all of these predictions and take the variables, you need to first make a summary of the relationships within the variables then you can go further to set a hypothesis. Here we are going to be going through regressions, and find a way to do that with excel.

The plot of scatter For this, we have to first imagine that you are the head of a training institute taking a lot of applications but there are only so many people that you can train. So then you now want to predict based on the information that you have their potential growth in your institute. If you do not know too much information, about the potential trainee you can know by a school grading system based on an aptitude test that you gave to potential trainees. Now if you had more information then we will be saying something different. However, the only thing that we have for now is the grading that you gave to the student. Now, we want to use that to make a prediction mixed with scores from the interviews arranged with the applicants here is what you do.

So we have the interview score and their grade point from the test that you set

for them. as you can see, the points are scattered. So, they are scatterplots. The vertical axis should be the data that you want to predict. They are known as the dependent variable. Then the horizontal axis shows the data that you want to use to make the prediction. It is also known as the independent variable.

Graphing a line When we are talking about lines in statistics and mathematics, we are not talking about lines you draw on beaches it is more than that. In mathematics, it is the precise way to illustrate the connection between an independent variable and the dependent variable.

There are a few things that you can notice about lines. The first is that lines can be described by the way they are slanted and the place that it runs on the Y-axis. The slant is known as the slope. The slope refers to how the y axis changes when one unit of X changes. The point where it runs into the y axis is known as the y-intercept. The slope can either be positive or negative. When the slope is negative that means that the line it means that the line falls to the right and if it is positive, the line falls to the left. And when the slope is going to zero, that means that the x and the y axis do not change. Just like we have said with the slope, the same is also true about the yintercept. The y-intercept can be a positive number, a negative number, or zero. When it is positive, it means that the line intercepts the y axis on top of the x-axis and when the intercept is negative, it means that the line will

intercept on the y axis under the x-axis. Now when there is an intersection between the y-axis and the x-axis, then that point is called the origin

Regression: what a line So now we will be going back to regression and lines. The best way to make a summary of the relationship between the scatterplot is the line. With the line charts, it is possible to draw a limitless number of lines in the scatter plot. The problem however is getting them to fit into the scatter plot. It is more effective the more points that the lines go through. As a statistician the line is important. When you plot the line through the scatter plot and also plot the distance between the dots and the lines, then square up the distance and sum it up, that is referred to as a minimum. That line is known as the regression line and they show the following. Y=a+bx The y refers to the points in the line. It shows the ultimate y prediction based on the value of x. Now, if you want to know where the line is, the thing to do is to find the slope and the intercept. Concerning the regression lines, the slopes and the intercepts are referred to as the regression coefficients.

Here is the formula for the slope. formula is Let us use the following as an example:

and our intercept

In this case, the equation for the best fit line is to calculate the slope of regression plus the intercepts.

Using regression for forecasting When you have the regression line, that means you can take one of the applicant's SAT scores and then make a prediction of the GPA. When that rule is involved, the only prediction can be the GPA.

Variation around the regression line The means of the data set or the average does not tell you all that you need to know about the data set. There is more. You also need to know the way that the data set varies around the mean. So, we might need to find the variance and the standard deviation. In the situation about this chapter and scatter plots, to have all of the information about the scatter plot, you need to see the variance in the score. So, we are going to be talking about the residual variation and also the standard error of estimate—they are just synonymous with the variance and the standard deviation. When we are talking about the residual variance we are referring to the average of the squared deviation of the actual y value compared to the predicted y values. Therefore, The deviation of the data point from the predicted point is what is known as the residual variance.

When referring to the variance estimates, we have said earlier that the denominator is the degree of freedom. The formula for the standard error of the estimate is

Finally, after finding the residual variance and the standard error of estimate and you then notice that It is small, it means that you are going to have a regression line that fits with the data inside the scatter plot and when it is large, it means that the regression line does not fit well

Testing Hypotheses About Regression The regression equation that we showed earlier, gives a summary of the relationships inside a scatter plot of a sample data set. A and b the regression coefficients are seen as sample statistics. The statistics are used to find the hypothesis of population parameters. In this section, we are going to be taking you through a test hypothesis based on regression. Nevertheless, the regression line can be seen as the graph of an equation consisting of parameters instead of statistics. Typically, the Greek letters are used for parameters and that makes our regression equation be. Y’=alpha+beta+ epsilon Epsilon stands for error in the population. It is used to refer to specific values that you do not know about. They are seen in the residuals. However, the more you know about your calculations the less likely you are going to encounter errors. There is no way to find the errors between the SAT and the GPA, however, they are going to be there. for example., it is possible that someone does not score well on their SATs, however, they do well in training. When you represent that kind of data inside a data plot, then that looks very much like an error. Then we can go further to test the hypothesis of the alpha, the beta, and the epsilon.

But before we go any further, we need to test the fit. The test of the fit starts with finding how well the regression line goes along with the scatter plot. You have to first test the epsilon which is the error in the relation. The next thing is to decide if the line represents the relationship that exists between the variables. There is a possibility that a relationship might just be based on chance which means that your calculation of the regression line is pointless. Now, you can test that possibility by first starting with the following hypothesis: H0: No real relationship, H1: Not H0. To start the test, the first thing to do is to find the variance, before you then consider the variance you need to begin with the deviation. We are going to see on point of a scatter plot and the deviation that it carries from the line of regression and also the deviation from the mean attributed to the Y variable. This also indicates the deviation that exists between the line of regression and the mean.

Look at the figure above you will observe that the space between points and the regression line and also the space between the regression line, also equals the distance between the points and the mean. Then there are some important relationships to be made. You need to first square up the deviation and add all of the squared deviations. Then you are going to get the following

it is also the numerator of

the residual variance. It indicates where it is variability in the regression line. Then concerning the hypothesis, if the null hypothesis, is true and there is no big distance between the x and y relationship. then it represents a gain in prediction since the regression line si not going to be more than any variability around the regression line. Also if the null hypothesis is false, and there is a substantial gain in prediction, that means that the mean square regression should not be as big as the residual. There is another question that needs to be answered when linear regression is considered. Is the slope of the regression line different from zero? When the line is not close to zero, then that means that the mean has a predictive power that is just as good as the regression line.

Worksheet Function For Regression By now, you should understand excel is for the big work and has a powerful predictive power. There are a lot of good functions and tools that can help you to accomplish a plethora of problems concerning regression. We are going to now be focusing on those worksheet functions that work well. We are going to be working with the following data set for the functions that we are going to be careful going through.

SLOPE, INTERCEPT, STEYX All these functions work similarly, but we are going to try to be as detailed as

we can here. Here are the steps to use these: 1. When you enter a data, pick a cell 2. Then you can enter either of the following into a blank cell in excel. =SLOPE, =STEYX and =INTERCEPT. Depending on which you prefer 3. Then enter the cell range inside the box. the syntax for the slope function is =SLOPE (known_ys,known_xs), for INTERCEPT, it is =INTERCEPT(known_ys,known_xs) and finally for the STEYX, it is =STEYX(known_ys,known_xs) in the example that you are going to see below, we are going to be using the GPA roll for the known ys and the SAT boxes for the known xs. 4. Then once you hit enter you are going to see the answer immediately.

Here the slope function gives us the number 69.5791

The intercept function in this place gives us the number 779. 4543

The steyx function gives us the number 117.3739 If you looked at all of the functions, you are going to notice that we used the GPA as our known_xs and we use the GPA as our known_ys

FORECAST.LINEAR The forecast.LINEAR as its bones to pick against the other three here. You are not only supplying values to the x and the y variables you are also finding the value for the x variable and what solves that question is the prediction that is based on a regression relationship between the x and the y variable.

We entered one of the SAT scores to fill the x part and then we entered the GPA for the Ys and the SAT for the XS

Array function: trend The trend function can be used to do a lot of things. it can be used to find a set of predicted y values for the x values inside the sample. Also, you can bring a new set of x values and then predict the y values with the linear relationship sample. It simply replicates the functions of the forecast all at once. Here is an explanation of how it goes: We are going to be predicting y’s for the x’s that are in our sample. The first thing that we are going to be using is to use the trend function to predict the GPAs for each of our students

As you can see, we can find the predicted GPA on cell D2 for the GPA of student number one. When you replicate this step with the other variables then you can predict the new set of the ys for a new set of xs We are going to also be using the trend function in this instance to predict the

GPAs for some other SATs scores. 1. When you enter the data, choose a cell then enter the trend function 2. For the known ys enter the range with the y variables, then in the known-xs box, enter the range with the x_variables. Then when you are in the New_xs box, place the cell range with the new scores. When you are in the const area, enter the TRUE or FALSE. TRUE, for when you want to calculate the y-intercept, and FALSE if you want the yintercept to be set as zero. 3. And you can hit enter

Array function: LINEST The LINEST function is the statistical function to find the linear trend. This function is a hybrid of the SLOPE, INTERCEPT, and STEYX and even adds some other smaller functions. We are going to be putting this into action now. one thing that you might notice is that it is a five-row by the two-column array. With linear regressions that is the way that the array that you choose looks like. We are going to be telling you what the row-column dimension of the array looks like. To use the LINEST function, go to the ribbon and enter the formula tab

Then go to the symbol on the left and choose the insert function. All these need to be done after you enter your data. then inside the function argument box, enter the following.

Select LINEST

Enter the following into the box above. We used the GPAs as the known_ys and we use the SAT as the known xs then const, read TRUE and the stats read TRUE too. Then choose LINEST so that you can see the answers

Data Analysis Tool: regression Now it is time to bring out the real dogs with the excel data analysis tool. it

does pretty much everything and will even label the output if you want it to. Here is how to get these tools to work: Enter the data inside the worksheet and add some labels Then go to the ribbon and choose data, and in the data analysis group, select data analysis so that the data analysis dialog box pops up

Then select regression from the toolpak dialog box that pops up and select ok

When you are in the input y range enter the cell with the data for the y variable. Few have put the data without the labels in $c$2:$c$11

Then we placed the X variable in the space for the X-range using absolute reference in $B$2:$B$11

Now if you have headings, you can check the label box We went with the default level so there is no reason to change it but If you want to, you can select the check bo and change the value in it. When you go to the output options part, select one of the options there. The residual part will give you four different ways to see the deviation inside the data set and the predicted point If you also want to see a graph, you can choose the normal probability box. And select ok

Working with tabled output

This is the output that we have got from the regression analysis. you are going to see in the first three rows of the regression statistic table the related R2, while in the fourth row, you are going to see the stand error of the estimate and also the fifth provides the number of units inside the sample under the ANOVA table, you will see the table with the regression coefficient the last remaining columns give you the results of the t-test, the P-value, etc.

Opting for graphical output In the image above, you are going to notice three different graphical outputs. One is the normal probability plot, the x variable 1 residual plot, and the X variable 1 line plot. Here are their images

Juggling Many Relationships At Once: Multiple Regression With linear regression, you can make predictions about a lot of things. there are two things that you need to know two things: the slope and the intercepts of the line relating the two variables; that means that you can use new x values to make predictions of a new y value. This means that you can use the SAT score to make predictions of the GPA. Now if you have other variables, you can also use them. working with multiple independent variables is known as multiple regression. Just like with

linear regression, you need to find the regression coefficient for the lines that best fit with the scatter plot. The best-fitting refers to the sum of the squared distances—it refers to the data points to the line. When there are two independent variables, you cannot reveal the scatterplots in two different dimensions. You are going to need at least three dimensions. Here is the equation the Yi=a+b1+b2b2 Here are a few things to keep in mind when talking about multiple regressions: 1. You can have multiple variables 2. The coefficient for the x variable can change when you convert the linear regression to multiple regression. The intercept also changes 3. The standard error of the estimate will decrease too from linear regression to multiple regression.

Excel tool for multiple regression The multiple regression works just like the linear regression and the same tools that work for the linear regressions, work for these too. The multiple regression is linear regression but with some extra variables.

Trend revisited With the trend function, nothing much has changed. The only thing is that there is an additional variable. However, you can go on with the steps that we have to given you earlier and you are going to get the trend based on which ever x and y variables that you put. Since the syntax only needs two variables and a constant.

Linest revisited The LINEST still goes with the same steps we have earlier mentioned in this chapter. Place the x variable in the required field and the y variable in the required field and then the constant to show TRUE or false.

The regression data analysis tool revisited With the regression data analysis tool, it is the same steps that we have gone through earlier, just enter the appropriate values in the x variable and the y

variable.

Regression analysis on the iPad The regression analysis also works on the iPad. You can make use of the STATPLUS add-in to do this. Here is how to go about it. First, enter the data inside the arrays Then go to insert and select addins then choose the statplus Then go to the statplus tab and then select new analysis to open the command box Then choose an analysis tool. Enter the appropriate values inside the boxes on the statplus panel. When you are done, enter run Voila! It is just that easy!

Conclusion Linear regression and the likes could not be much easier and now we have accomplished that in a very detailed chapter.

Chapter 14 Correlation: The Rise And The Fall Of Relationships In the previous chapter, we talked about regression. Now, we are going to be talking about correlations. The correlations are statistical measures used to determine the relationship between two different variables. The correlation coefficient will then give you the perspective of how two different variables work together. furthermore, this is the best way to look for the numerical value that expresses the relationship between two separate variables.

Scatterplots again And the best way graphically to show the relationship between two different variables is the scatter plot. Here is how our scatter plot looks like

This image is from our earlier illustration of the student's GPA and SAT score. Our GPA is on the vertical axis while the sample percentile is on the horizontal axis

Understanding correlation Now, what is correlation then? We have earlier said that they are used to find the relationship between two variables. Like the regression, it is also a statistical measure. In correlation when we say that two things correlate, we are also implying that those things vary independently. There are two variants of correlation worth remembering. They are the positive correlation and the negative correlation. The positive correlation implies that if you have a high

score in one of the variables, there is also going to be a high score in the other variable. The same also goes if there are low scores; when there are low scores in one then it is related to a low score in another. A negative correlation alternatively says the opposite. It says that when there are high scores in one of the variables, then there are going to be low scores in the opposite variable. A typical example is a weight loss program. The more time you spend on the program, the more your weight reduces. And if you spend less time in the program, then it is assumed that you are also going to be gathering weight. So we are going to be working with the SAT scores like before. Here is the formula to find correlations. r= R represents the correlation coefficient. It gives you the way two independent variables work together. The standard deviation is the denominators of the x and the y variables while the numerator is the covariance. Like we said earlier the covariance essentially shows you how these variables differ together. so you need to divide the covariance with the product of the standard deviation of both the x and the y variables. The correlation coefficient has lower and upper limits. The upper limit is +1.00, while the lower limit is -1.00 The positive correlation shows you a coefficient of +1.00 while the negative will show you a coefficient of -1.00. now when the correlation is 0.00 on the other hand then that implies that the variables do not correlate.

Correlation and regression Here is an image of a scatter plot that best fits.

When a line best fits there are specific standards that it must meet. The bestfitting line means that the sum of the squared distances between the points and the line is as small as they can be. These lines are referred to as regression lines. Their main job is to allow you to make your predictions. When there is no regression line the other way to find it is the mean for the x and the y variable. The regression line makes predictions based on the x and the y variables. The points in regression lines represent the predicted value for y. Now here is how regression corresponds with correlations.

In the image above, there are a few things to put into consideration. We labeled three of the distance in (y-y1) to sow the differences between the points and where the regression line predicts the points should be. While the

distance that is labeled using the regression line.

shows the gain the prediction capability of

When you are working with the regression lines, you can square all of the deviations, the points from the mean and add them, the residuals too, and add those. Just like these. The ssresiduals+ssregressons+sstotals. When the square of the regression is larger than the sum of the squared residuals it means that the relationship between the x variable and the y variable is large. This also implies that in the scatterplot the variability of the regression line is a small one. Now if the sum of the squared regression is smaller than the sum of the squared residuals then there is a weak relationship between the y and the x variables. That means that there is a large variability in the scatter plot. You can make a test of the sum of the squared regression and compare it with the sum of the squared residuals by dividing each of them by each of their degree of freedom. This then gives you a variance estimate. Then when you get the variance estimates, you can now divide each of them to get you the F. when the mean square of the regression is more than the means square of the residual then the relationship between the x and y variables is a strong one. you can also evaluate the sum of the squared regression by comparing it with the sum of the squared total. First, you need to divide the sum of the squared regression by the sum of the squared residuals and in the ratio, you get when you see a large one, then that means that there is a strong relationship. it is also known as the coefficient determinant. What you then have to do is to find the square root of the coefficient determinant and that gives you the correlation coefficient. So if you want to then find the correlation coefficient and know what that value represents you can do it with the coefficient determination.

Testing hypotheses about the correlation Now we are going to be a testing a hypothesis about the correlation with simple statistics. We are going to be making inferences about the parameters of the population. Now since the correlation coefficient gives us an idea of how strong the linear relationship is between x and y we also have to put

some things into consideration. We need to both looks at the value of the correlation coefficient r and also the sample size. The hypothesis test can be used to test if the sample data is good enough to use to predict the relationship in the main population. It is also used to make the decision when the value of the correlation coefficient of the population is close to zero or significantly more or less. This is all based on the sample correlation and the sample size.

Is a correlation coefficient greater than zero? When after you make your test and you can find that the correlation coefficient Is more than or different from zero, then that means that the correlation coefficient is of significance. This means that we have evidence that there is a relationship between the x and the y variable as the coefficient differs from zero. This means that we can use the regression line in our sample to model a linear relationship between the x and the y inside the population. On the other hand, if the coefficient does not have any significance from zero, then the correlation coefficient is nonsignificant and that means that there is some relationship between the x and y since the correlation coefficient is different from zero. However, there is no significant relationship between x and y. which means that the regression line cannot be used to model the linear relationship between x and y.

Do two correlation coefficients differ? Now if we have a sample of foreign students and the correlation between their IELTS and the GPAs are. 752. Is this going to be different from a group of students with a different correlation? Now if there is no way to assume if the correlation should be higher than another, here are the hypotheses. Now if we are to compare the correlation coefficient of the university of oxford and the University of Leicester, here is our hypothesis. H0: ρleicester = ρoxford H1: ρLeicester ≠ ρOxford with our alpha being .05. The symbol ρ represents the population parameters. In this hypothesis, there is no way to make a t-test as this is highly technical.

What you have to do is to transform the correlation coefficient into a different thing and work with these new converts inside a z-test formula. Here is how the r is transformed to a z Zr=1/2[loge(1+r)-loge(1-r)] We are going to show you how to find the loge in a later chapter so sit tight.

Then when you transform the r to z the formula becomes. Z= Then what you do next is to find and compare the calculated value with a standard normal distribution. When it is a two-tailed test, with the alpha .05

Worksheet functions for correlation There are two worksheet functions for correlation and they are CORREL and PEARSON and they work the same way. Then there are also other formulas like the RSQ, the COVARIANCE.P, and the COVARIANCE.S. RSQ and we are going to be taking you through all of this and how they help you to find the coefficient determination.

CORREL and PEARSON Here is how to use the CORREL and PEARSON function: 1. The first thing to do is to enter the data inside the array and choose a cell where you want to see the result of the CORREL 2. Go to the statistical functions menu and choose CORREL then enter the appropriate value

3. When you are here, enter the appropriate references in the appropriate boxes. 4. It is going to give you the answers to what you have in this instance

5. Once you choose ok, you are going to see the answer in the cell. 6. Then to place PEARSON, 7. Go to the statistical function

8. And choose PEARSON 9. Then follow the steps in steps 3 to 5

RSQ The RSQ function helps you to calculate the coefficient determinant. The CORREL already does this job and there is no reason to use it but the syntax for this function is RSQ(known_ys,known_xs

Follow the steps above that we used to get the correlation to get the RSQ

COVARIANCE.P and COVARIANCE.S The difference between the covariance P and the covariance s is that one works with the population, while the other one works with the sample. The .p in the function refers to the fact that it works with the population while the “.s” refers to the fact that it is working with the sample. Now when you want to use this function to find r, then here is what you have to do. After entering the function for covariance which is =COVARIANCE.P(array1,array2) is to divide the answer that you get by the product of STDEV.P(array1) and STDEV.P(aaray2

Calculating the COVARIANCE.S is just the same

Data analysis tool: correlation The data analysis tool is just as simple. Now when you want to work with multiple correlations that we are going to focus on later then you are going to understand its usefulness. Here is how to work with it: 1. After entering the data, 2. Go to data one the ribbon and open the data analysis dialog box. 3. Then find the correlation when you scroll down. It is typically arranged in alphabetical order.

4. Then select ok so that the correlation dialog box opens. 5. Enter the appropriate range

6. Then select one of the output options. 7. And choose ok Tabled output

The cells give you the correlation of the variables inside the rows that have the variable inside the column. The cell that is B3 gives you the correlation of the SAT with the GPA

Multiple correlations The multiple correlations give you the maximum degree of a linear relationship that you can get from two independent variables and one dependent variable. The independent variables are weighed in a way that what you get, gives you the largest possible correlation with the dependent variable. Since the determination of these variables works just like typical statistics, sampling errors might affect it. and also the multiple correlations are going to be properly shrunken to adjust for any sampling error. The shrinkage of the correlation usually is based on how many independent variables there are and the size of the sample. With a small number of independent sample variables and a large sample size then there is no shrinkage procedure. Furthermore, the correlations of the independent variable for the R calculation can also be corrected, and that in turn increases the R.

Partial correlation The partial correlation gives you the degree of association between two random variables after removing the effect of another random variable. Now if you want to know to what extent the numerical relationship between two variables, then when you make use of this correlation coefficient, you are going to get a wrong result when there is an influential variable related to the two variables. To avoid this error, then you need to control the confounding variable. You can do this by typing in the partial correlation coefficient. Take for example that you have a data on the production, the income, and the wealth of a specific group of people and you want to see a relationship between production and income, now if you do not control wealth when you are entering the correlation coefficient, you are going to get a result that does not show the data properly. That is why it is called a partial relation as it avoids the problem. Furthermore, the partial coefficient also works with the same value as the correlation coefficient with the range -1,0, and 2. When it is -1 then it is a negative correlation and when it is 1, then it is a positive correlation and when it is 0, then there is no linear relationship. Data analysis tool covariance The data analysis tool is just as simple. Now

when you want to work with multiple correlations that we are going to focus on later then you are going to understand its usefulness. Here is how to work with it: 1. After entering the data, 2. Go to data one the ribbon and open the data analysis dialog box. 3. Then find the correlation when you scroll down. It is typically arranged in alphabetical order.

4. Then select ok so that the correlation dialog box opens. 5. Enter the appropriate range

6. Then select one of the output options. 7. And choose ok

Using Excel To Test Hypotheses About Correlations There is no worksheet function to evaluate the correlation hypothesis. So instead, you make use of a t-test, and the answer that you get to put that in TDIST so that you can get a one-tailed probability when the result is below .05

Worksheet functions: FISHER, FISHERINV. You can use these functions: FISHER, FISHERINV, to evaluate the

hypothesis showing the difference between two different correlations the FISHER converts the r to the z while the FISHER.INV does the opposite by converting z to r. Here is how to use excel to do this: The first thing after entering your data is to choose a cell that will carry the result of the FISHER And go to the statistical function and look for fisher. Then when you see the dialog box for fisher, enter the appropriate value And select ok Then finally use the NORM.S.INV to see the critical value to reject the null hypothesis using the two-tailed alpha of .05

Correlation analysis on the iPad On the IPAD you can use statplus to work with correlations. here is how to go about it 1. 2. 3. 4. 5. 6. 7.

The first thing is to enter the data inside the array Then select insert and go to add-ins Then open statplus When you are in the statplus pane, select new analysis Then pick an analysis tool. in this instance, pick linear correlations Enter the necessary value inside the panel Then select run

Conclusion There is so much that goes on with regression and correlation. So here is a start! It is about time Now, when you are working on a project or research, sooner or later, you are going to be dealing with time. when you are dealing with time or like the changes of specific data set over some time, those periods are known as time series and you might have to study them to understand your project or research better. And even you can make those time series to make predictions and forecasts about stuff. We are going to be taking you through the ways to use excel to accomplish these steps.

A series and its components As a CEO of a small startup that wants to succeed. You must take your data and observe time series that relate to your sales or income. The numbers in the time series can show a lot of things. from a trend up or a trend down. If you look at the graph of income or sales over some time of a successful company you will notice that very rarely is the growth exponential and it is unlikely. Most times there is a pattern to it. take the example of an umbrella company producing umbrellas in tropical places that have a season of rain and a season of drought. In times, of drought, umbrella sales will be down, because there is very little demand for them and in times of rain umbrella sales will be up because people need them to shelter themselves from the rain. Typically, the umbrella company is not going to sell massively at once. There are times when sales will be down really, especially at the beginning and there are times when sales are up the longer they are in with more experience. Now if the sales move up generally since the umbrella company is expanding or is more trusted. It is called the trend component of the time series. That is an example of a linear trend. However, that is not the only type of graph that can show time series. If the highs and the lows are different moving up, it is known as the cyclic component of the time series. Then there is also a sporadic nonrecurring influence that has effects on the time series. And that is called the irregular component of a time series

A moving experience When you have a mean and it takes all of the peaks and all of the falls into account then you might not have the full perspective of what is going on in the trend. The best way to smooth the bumps out and have full perspective is by working with the moving average. This is the average of the most current scores from the time series. It is called a moving average as it continues to be calculated over the data. moving averages are used to analyze the data points by first creating an average series from different sample data from the whole data. it is used in finance as a stock indicator for technical analysis. Why moving averages are important in stock is that it will give you a way to smooth the prices out by consistently giving you an updated version of the average prices. When you calculate the average, then you will have fewer worries when there is temporary and random price fluctuation in the stock at specific periods. Looking at our data set from above that shows the number of sales our umbrella company is making; you can use the moving average of the most

recent sales figures. In our example, to have a full picture, you need to average the first three quarters, then average the second quarter to the fifth quarter. When you are done with that, you can average the quarters from 3 to 6 and then average the remaining quarters of the series. This gives you a forecast. It predicts based on a guess primarily by averaging the sales figures of the most recent sales. There are two ways in excel to find the moving average. The first one is using the trend line and the other one is using the data analysis tool. Working with the trendline, here is what you need to do Of course, before you do anything, you need to place your data inside the sheet Then add a line chart by going to insert and in the choose line charts with markers

Then hover above the chart to reveal the plus sign

Select the plus sign to reveal the chart elements tool Then select the trend line. This gives you more options in the When you go to the formatting trend line panel, choose moving average and change the periods.

Lining up to the trend Working with the trendline, here is what you need to do: Of course, before you do anything, you need to place your data inside the sheet Then add a line chart by going to insert and in the choose line charts with markers

Then hover above the chart to reveal the plus sign

Select the plus sign to reveal the chart elements tool Then select the trend line. This gives you more options in the When you go to the formatting trend line panel, choose moving average and change the periods.

Data analysis tool moving average You can also use the data analysis tool to find the moving average. It both charts the moving average and also give you the numerical values. However, different from the trendlines, there is no way to explore different periods while you are working with them. Here is the way to use the data analysis tools for moving averages: The first thing to do is enter the data inside the spreadsheet. When you are done with that, go to data inside the ribbon and enter data analysis from the menu, and select moving average from the dialog box that pops up

Then enter the range inside the box. only rows or columns can fit into the input range and not both.

Then select ok and you are going to see the following.

The error symbol does not mean much. The numbers in column k are known as the moving averages. Do not mind the error symbol and the numbers in column L are the standard error. It is the square root of the average of the squared difference between sales and the forecast for the last five quarters. In the graph, the part labeled forecast shows you the moving average. At times, the forecast matches up with the data, and other times, it does not.

How to be a smoothie, exponentially When you are smoothing exponentially you are doing something similar to what a moving average would do. It is another technique to make forecasts. However different from what moving averages do by working with actual value sequence, the exponential smoothening also works with the previous prediction. So not only the current data is taken into account here but also the predicted data. It works based on something called a dampening factor. The dampening factor works like a car's shock absorber. By smoothing the ride out when it is

bumpy. In the same vein, the damping factor also reduces the bumps in the data so that you can observe large patterns easily. The damping factor is usually numbers between 0 and 1 and the alpha represents the damping factor. Here is the formula

With our examples, y’t represents the predicted sales at specific times. Then t is the current quarter and t-1 is the quarter that precedes that. this means that yt-1 represents the preceding quarter’s predicted sales. In the sequence of prediction, the first predicted value is the observed value from the quarter that precedes it immediately. When there is a larger damping factor then the quarter predictions have more weight to them than when there is a small damping factor and when it is in the middle at 0.5 then it balances each other out. Using the example above, we have entered the exponential smoothing.

One-click forecasting Now you can use excel to choose a time series and give you a set of estimated forecasts and confidence intervals for the forecasts. There are a few excel functions that handle forecasting and they are FORECAST.ETS gives you a forecast based on a triple exponential smoothing

FORECAST.ETS.STAT this gives you the values of the statistics that are related to the ETS FORECAST.ETS.CONFINT this gives you the confidence interval for a forecast value FORECAST.ETS.SEASONALITY this gives you the length of the seasonal pattern inside a data To use the one-click forecasting here is what to do 1. The first thing is to enter the data and put the dates in a column 2. Choose the data 3. Then enter the data tab and go to the forecast group

4. Then choose the forecast sheet 5. Then make any changes from the dialog box that you see. 6. And choose to create on the lower right.

Working with time series on iPad You can use the trend line to show moving averages on your IPad. Here is how to do it: you go to data and select then tab on insert and you are going to see recommended chart it is going to give you a line chart. once you choose the line chart, then it is created for you. Then the next thing to do is to select the chart tab then choose elements and go to trend lines and choose a moving average You can also use the stat plus to work with the time series: The first thing is to enter the data inside the array Then go to insert and locate add-ins then find stat plus this then pops up a pane for stat plus, then select new analysis so that it will give you a dialog box Here, selects the analysis tool. then you can choose time series and then moving average. Then select run

With exponential smoothing, it is pretty much the same. Here is how that goes: The first thing is to enter the data inside the array Then go to insert and locate add-ins then find stat plus this then pops up a pane for stat plus, then select new analysis so that it will give you a command box then select the analysis tool. then you can choose time series then exponential smoothing Then select run

Conclusion We have just worked on exponential smoothing and moving averages as smooth as we know how to. Now, it is your turn to work things out in practice.

Chapter 15 Non-Parametric statistics So far we have been talking about making estimates of ratio estimating parameters and making use of the central limit theorem to know the frequency of the distribution to then begin a hypotheses test. Trust us when we say there are times when the only day you will be evaluating are nominal and ordinal data and in these times it is impossible to know the distribution of the variables. All these can be a problem if you don’t know what you are doing. But there is something called nonparametric studies. Nonparametric statistics are those statistical methods where we do not assume that data comes from prescribed models determined by a limited amount of parameters: these kinds of parametric methods include the normal distributions, and the linear regression models too. Nonparametric statistics will at times make use of data that does not have numbers but will work in some order or ranking. this includes studies like a survey of consumer likes and dislikes. There are no apriori methods in inferring the structure of the nonparametric statistics rather by the data itself. Since the data is nonparametric, then it is safe to say that the data lacks parameters. Well, you are going to be wrong to assume that. What it means is that the parameters are not rigid and not fixed before you work with the data. one example of a nonparametric estimate of a probability distribution is the histogram. There is no specified tool to make nonparametric tests in excel. So, what you are going to have to do is convert the test formulas into an excel statistical function you can then make a hypothesis test.

Independent samples We are going to be taking you through a nonparametric test that looks similar to the independent group t-test and also analogous to our one-factor evaluation of our variance.

Two samples: Mann-Whitney U test

When you are working with the Mann-Whitney U test, then you know that you are working with something amazing. It is the best nonparametric test out there. it is used to evaluate if two independent samples are from the same population when there is no interval or ratio especially when you cannot make use of the t-test. For example, if the data that you have is ordinal, then you are going to have to work with the data ranks instead of the data points and that is pretty much what the test does. To make an illustration this, we are going to be working on a study where we choose 20 random people and expose them to a certain level of stress from doing a task and we rate their level of stress on a scale of 1 to 100. We have a null hypothesis that these two groups are from the same population and our other hypothesis is that our stress levels are going to be high when placed in a stressful situation, so we are going to be working with the one-tailed test. What we are going to first do is rank the scores to see two things. if we have the null hypothesis is true then the high ranks and the low ranks are going to be distributed equally in the two groups. Then if we want to work with the formula for the Mann-Whitney u test we can work with it on either the ranks that are in column A or the ranks that are in the column B. then if we choose the A, the formula is going to be U=NANB+NA(NA+1)/2-RA and when you choose B it is then reversed. We are going to have U=NANB+NB(NB+1)/2-RB RA represents the sum of the ranks that are in stressful situations while Rb refers to the ranks that are in the relaxing situation. Na refers to the number of scores in stressful situations and Rb refers to the number of scores that are in the relaxing situation. When there is a large sample then the sampling distribution of our U becomes a normal distribution

When you get to this point, then you need to translate the formula into a U formula to do this, enter the following so this is what you are going to do. Go into an empty cell and enter the following formula =RANK.AVG(A2,$A$2:$C21,1) then in columns E and F I auto-filled the formulas that I place in E2. So since we have 20 scores in column A and 20 scores in column c, we are going to be working with that =20*20+((20* (20+1)/2)-G22) then select enter and you are going to get 357 as our U test. Then to get the mean of the sample distribution, we then type the following = (20*20)/2. So the mean of our distribution is 200. Then the next thing to do is to find the deviation. We entered the following formula to get the standard deviation =SQRT((20*20*(20+20+1))/12). When we have all of these, we then went ahead to get the probability test with the following formulas. =NORM.DIST(J4,J5,J6,TRUE). With our U test in J4, our mean in J5, and our standard deviation in J6. We get the result of 0.99 which tells us that the probability of u is more than the value of 0.95 since the probability of a more extreme is less than .05 meaning that we can reject this null hypothesis.

More than two samples: krusical walls one-way ANOVA Then if we are going to be working with a sample of more than two what we

are to do is different. If we have the example of stress level from six categories of movies and we are to rate satisfaction on a scale from 1 to 100, here we are going to be creating a worksheet that properly illustrates what to do. With this, the null hypothesis implies that the sample that we are getting is coming from the same population and on the other hand the hypothesis is that they do not. Here is what to do with that nonparametric test. And we are going to be using the Kruskal Wallis one-way analysis of the variance. Just like we did earlier, we have to first rank the 24 scores in ascending order and if we have a true null hypothesis, then we are going to be distributing the ranks equally around the groups. here is the formula for the Krusical Wallis one-way analysis. H[12/N(N+1)E R2/n]-3(N+1) So we have 20 scores in total and 5 groups. and we ranked the groups and summed the ranks of each fil category in A14, b14,c14, and d14. Then to begin we have to calculate H, by typing the following formula = (12/(20*(20+1)))*(SUMSQ(A14:D14)/5)-3*(20+1) and when we want to enter the hypothesis test we are going to use the following =CHISQ.DIST.RT(H5,3) and we are going to get the score of 0.43 and since what we got is more than .05, then we can accept the null hypothesis.

Matched samples These nonparametric tests look similar to the matched group t-test and the also the repeated measures analysis of variance.

Two samples: Wilcoxon matched-pairs signed-ranks This test is also compatible with the variance between a matched pair. And it even takes a step further. It also adds to the direction of the difference the differences in size too. So they work best with ordinal data and the right kind of precision that makes you rank the differences between the pairs. So, we are going to be using this example of 20 pairs of similarly colored cars and we want to rate how beat they look at the end of the year on a scale of 1 to 100. We are going to be working with the null hypothesis that there is no difference in our badly beat they are at the end of the year and our alternative hypothesis is that on the contrary, there are differences. Then what we want to do after we enter the data of the home-washed and the car-wash-washed cars, we are going to look for which of the pairs have negative differences and count them, then find how many of them have positive differences. It is assumed that if we have a true null hypothesis then the sum of the ranks with positive differences should be just the same as the

sum of the ranks when the differences are negative. So you are going to have to work with the category that occurs less frequently and add the rank. The sum can be called A. and if the number of the total is more than A then that means that A is distributed normally. So we are going to find the number of positive and negative differences using the COUNTIF function. in the illustration we used =COUNTIF(D2:D21,">0") to find the number of positive differences and we used, =COUNTIF(D2:D21,"0")/1000, and the other statistics that you will see there are simple Then to begin the simulation J2 to J10001 will hold all of the results of the thousand simulations.

Estimating probability: logistic and regression Now we are going to be working with logistics and regression. The regression we used to work on involved a continuous dependent variable with values that you can predict based on independent variables. With this kind of regression, the dependent variable here is the probability that there is going to be a success or a failure while the independent variable is the event itself So what we aim to get is the estimate of the probability that someone is going to buy an iPhone after seeing the ad on the TV. Ordinarily, the dependent variable which is the time spent watching the TV we supposed to be continuous but since we are working with probability the minimum value is 0 and the maximum is 1. It is referred to as logistic regression.

Working your way through logistics and regression Here we are going to be working with the data of 20 people. In the first column, we see the time that they spent watching the promotion and the second column holds the outcome of 1 or 0

Here is what a graph of a logistic model looks like. The logistic regression is simply about finding the values both for the a and the b which result in something that best fits the data found in A and B

Mining with XLMINER There is no logistic regression tool for now. however, there are still things that you can do. You can first download the XLMIner analysis tool pak. When you add the XLMINER toolpak you are going to see it on the right side of the Home tab. Then to open the toolpak, select insert and choose addins and tap xlminer analysis toolpak. Then select logistic regression then you are going to see the field for both the Y range and the X range and also the output range.

Conclusion: So far, there is no excel tool for logistic regression but there is a way to bypass that with XLminer so good luck

Chapter 19 12 Statistical And Graphical Tips And Tricks Statistics might be complex, but if you know what you are doing then there are a lot of opportunities that you can benefit from. Here are some of the best graphical and statistical tips and tricks for you to seize these opportunities lurking in between.

Significant does not always mean important The word significant is synonymous with important. However, in statistics that is not the case. When you generate a statistical test and the result that you get is significant but you have to reject the null hypothesis, this does not imply that that study is important. In statistics, you can only use the numbers to help your decision-making and to make inferences when there need to be. However, those tests do not automatically make the studies you make important. At the end of the day, that mantle is left for you to decide. You have to make the decision yourself.

Trying not to reject a null hypothesis has a few implications Whether or not you try to reject a null hypothesis, in the real world it can come with some consequences. As you know why a lot of people will say the business word or statistics is about numbers the truth is that is not the case in the real world. Take for example companies who measure the way they pollute the air around them. do you know that many of them make statistical and null hypothesis tests to determine the extent to which they pollute the environment? At the end of the day, they conclude backed by a null hypothesis that they do not pollute significantly. Sometimes it is easy to stack a null hypothesis defending what you do but at the end of the day, it interferes with your real-life decision with real-life consequences. Nevertheless, there are times when it is okay not to reject the null hypothesis and when that time comes, make sure that the alpha is set in high value so that there becomes a small divergence from the null hypothesis so that the null hypothesis can be rejected.

Regression is not always linear When we talked about regression all through this book, we referred to it mostly as linear. Hence the importance of lines. The truth is that they are not always linear. Granted that it is the best way to understand the plot and when you understand it fully you can understand where the slopes and the intercepts are. However, there are other kinds of regression that are not linear. One for instance is the curve regression. Yes, you can fit a curve inside a scatter plot. Nevertheless, they tend to be complex when compared with simple linear that we all easily understand. When a linear regression is not linear then it is nonlinear as the name implies. Now when you use linear regression it has a lot of opportunities however, there are a few notable drawbacks to it. 1. The effect that the predictor has on the response is not as easy to understand. 2. It is virtually impossible to calculate the p-values as a predictor 3. It swings between the balance of calculating the confidence intervals

Extrapolating Beyond A Sample Scatterplot Is A Bad Idea Now there is one thing to know about sample scatter plots—do not make any assertions or assumptions that go beyond the scatter plot boundaries. Now imagine that you generated a prediction of the relationship between two variables and the scatter plot you used covers for example just a narrow range of one of the variables. There is no way to know If the relationship is still there beyond the range. Any prediction that you make outside of the range cannot be logical and hence invalid. What you need to do is to expand the scatter plot by adding more valid values and that might give you a better picture of what you are looking for.

Examine the variability around a regression line Now when you look at the differences between the observed values and the predicted values, you have a better understanding of the extent to which the

line can fit into the data. the typical assumption here is that the variability that exists around the regression lines is similar all across the line. When that is not the case, then the model does not make the necessary prediction. When the variability is systematic, the curvilinear regression tends to be more appropriate than the linear regression. Typically using the standard error of the estimate is not the indicator always.

A sample can be too large Since we established the differences between the population and the sample. We understand that the population is the entire whole value while the sample is randomly generated from the population. Now, when you choose a sample from a population, it mustn't be too large. Now when the sample is too large then it affects the correlation coefficient and makes them statistically significant. Take for example that you have a degree of freedom with the alpha being .05, the correlation coefficient of .195 makes it necessary to then reject the null hypothesis since the population coefficient is 0 Now that takes us to the question of what is the meaning of the correlation coefficient. The coefficient of the determination of r2 is equal to .038 which means that the sample square regression is 4% less than the sample squared total. Ultimately, you must observe the sample size at the same time you are working with the correlation coefficient. When it is too large it makes random associations be significant statistical

Consumers: know your axes The axes matter--- I will repeat it --- the axes matter. Knowing where they are in the data can save you from a lot of trouble. In a graph, it is also important that you know what the unit of measure is too. Which are the dependent variable and the independent variables. These two variables need to be fully grasped before they can be plotted on a graph. So many times, because of how professional and visualizing graphs are, producers, tend to use graphs to their advantage even if it makes sense or not. Because of how visual we are they try to manipulate the idea that people buy with their eyes and not with their brains. So look at the graphs with your eyes but think with your brain. The graphs are meant to mean something.

Figure that out.

Graphing a categorical variable as a quantitative variable is wrong There is a difference between a categorical variable and a quantitative variable know. Now when you are graphing them the processes are two different processes. Take for example that you are gathering statistics of Lionel Messi, Cristiano Ronaldo, and Lewandowski. You notice in the graph that Ronaldo has been dominating the statistics in the number of goals scored for the past ten years. Then if you want to make a summary of all of the outcomes where Lewandowski or Messi scores more, you can make a graph for that. There is one thing to know however when you are creating the graph: 1. The line inside the graph implies continuity from one point to the other. So, the line of continuity ought to reflect the time interval over the years. 2. Also, it is important to know that the line graph is not appropriate in situations when the set of variables is a set of categories. However, in that situation, you need to create column graphs or pie chart

Whenever appropriate, include variability in the graph It is important sometimes to integrate some variability into the graph. It also might need graphs and standard error for each of the means. This way the person looking at the data has a better idea of any variabilities inside the data. The means are not very descriptive in painting the picture so there are some other variables to take into consideration. Make sure that you understand the variance and the standard deviation to have a better understanding of what you are working with. When there is a high value of a variance that is related to large means, that shows a clue there is a relationship that you did not see earlier.

Be careful when relating statistics textbooks concepts to excel Now, this is not easy to admit. But it is necessary. Now do not take every concept that you find in excel textbooks as it is. Sometimes, some concepts are introduced just to make the study simpler, especially in times when the

theoretical explanations seem complicated. So that can cause some clashes. But do not worry if you follow the process with care, you are on the right track.

It is always a good idea to use named ranges in excel Named ranges are the deal. I will advise people who are working with excel to make use of named ranges instead of the typical range coordinates or references. Things are easier and simpler afterward. You can write formulas better and also read them well. Now if you have the typical coordinates, some people have no idea about excel range coordinates and want to read your spreadsheet. Now if they do not understand it, then there is a problem in that instance. With named ranges, a novice can easily read spreadsheets and make sense of the concepts within. Another thing that makes it important is when you are working with the R languages. If you want to look for formulas within the R language, you are going to need vectors which are named items that are like the named ranges. When you are already naming your ranges in excel, then when you are working with those r languages life will be easy

Statistical analysis with excel on the IPad The Ipad version of excel might be missing a few tools like what-ifs and some graphical capabilities however there is still a lot that you can do with the IPad. Now Ipads might not have dialog boxes, however, they have contextual menus and variance of excel add-ins renamed for the IPad like the Stat plus which is an IPad variant of the tool Pak.

Conclusion These tips that we have given you will make your analysis of statistical and graphical tips and tricks.

Chapter 20 Topics that just don’t fit elsewhere Now in the course of this book, there are a few concepts that we do not feel can fit elsewhere. So we have decided to put all of those concepts in this section so that we can address them.

Graphing the standard error of the mean Now when you are generating a graph and the data you are working with it is important that you work with the standard error of the mean in the graph. Now when you create the standard error, then people can have an understanding of how the scores are spread through the mean. We are going to be using a test score of five groups of people and the time that it took for them to prepare for the tests then we represented that in the graph below. What we did is we collected our data and found the mean with the average function, then we found the standard deviation with the STDEV.S function and we then looked for the standard error in the 14th row. So we entered the following formula in A14 =A13/SQRT(COUNT(A2:A11)). Then the trick comes when you try to add the standard error inside the graph. The first thing to do is to choose the graph. And you are going to see the format tab. Then select the design and add a chart element then select error bars. You are going to see a few error bar options to choose from. The appropriate one here is the more error bar option.

In this instance, however, you need to specify the values for the error bar.

In this instance, we used our standard error row to specify that and it gave us an image like this

Probabilities and distributions So there are some other probability and distribution functions that we did not expatiate earlier on. Here is us giving a breakdown of how they are.

PROB The PROP function is appropriate when there is a probability of a discrete random variable and you are looking to find the probability that the variable is taking a specific value. We will show you the dialog box and give you an explanation of how they work. Now when you go to the ribbon look for formulas and enter there look for statistical functions and scroll down.

It pops up the following dialog box.

You need to enter the random variable and the probability for the x, then the lower limit.

WEIBULL.DIST We never talked about this earlier. However, it works for probability density and is compatible with people who study engineering. It is used as a model to judge failures of engineering. The WEIBULL distribution has two parameters and they are used to take into consideration all of the parameters, the alphas too, and determine the variance of a distribution. While the beta parameter considers the place where it is

centered on the X-axis. If you were to do the equation on your own, this stuff might be very complex without excel. So, here you can use the WEIBULL.DIST function. simply go to the functions dialog box and you are going to see all of the parameters that are associated with it.

Then when you enter it you are going to see the following boxes in the dialog box

Now we use the instance we want to imagine the number of hours it will take for a bulb to fail is in the following.75 represents the alpha and the beta refers to 1000 hours. So we just found the probability that the bulb is going to last 2000 hours.

Drawing samples You can use the data sampling tool to draw samples in excel. There are a lot of ways to do this. For example, if you are trying to make a focus group and you want to choose the participants from a pool, you can assign a number to each one and then use the sampling tool to choose your group. You can do this periodically by supplying n and excel is going to pick as a sample, every nth number. You can also sample randomly by supplying the values that you will like to randomly select. Now you can go to the ribbon and look for the data tab to open the data analysis tool. you are going to see the following dialog box where you can scroll down to look for sampling

then from the image, you have a few options to sample.

What you do next is input the range that you want to sample from. Then select one of the methods that you want. Periodic for the nth value or the random number of values. Choose one of the radio options to specify where you want the answer to appear in.

Testing independence: the trying use of the CHISQ.TEST We already know how to make use of the CHISQ.TESt to test how fit the model is. However, there are a few things to look out for. The CHISQ.TEST is specifically used for this specific purpose—to test the independence. Now if you have a survey asking people their favorite movie from two different independent variables which are urban and rural then you want to know if the data that you got is independent of the environment. You need to make the hypothesis test. With the null hypothesis that the movie preference is independent of the environment and the alternative hypothesis that the movie preference is not independent of the environment. To follow this through you need to know what needs to be done when the two are independent. Then you can there compare the number of the data with the numbers that you expected ad see I they match. And if they do not, you know what to do. Reject the null hypotheses. So the CHISQ.TEST function needs you to supply two things and that is the observed number and the expected number to get the probability that he x2 is at the very least as high as the result that you got earlier. And if the probability is small then you need to reject the null hypothesis.

Logarithmica Esosterica Now we are going to be working with mathematical-based functions that people who work In tech are more likely to use compared to lay people.

What is a logarithm? Logarithms are the exponent to which bases have to be raised so that there is a yield. Take for example 102 it gives 100. 2 here is an exponent and that makes it a logarithm. So technically log10 100= 2. It is a different way of

saying 102 =100. The logarithm was invented in the 17th century so that calculations are faster. Logarithms reduce the time that is going to take to multiply the different numbers with a lot of digits in them. Scientists love logarithms a lot because they have a lot of properties that make them very useful. Particularly it is very easy for example to find the products of two different numbers m and n by looking for the numbers in a logarithmic table and then adding the logarithms together and then going back to the table again to look for the number with the calculated logarithm. In terms of common logarithms, relationships are gotten by log mn=logm+logn. Imagine that you are looking for 100*1000 you can look up the log of 100 and then 1000 which is 2 and 3 respectively and then add the logarithms and then go back to the logarithm table to look for the antilogarithm which is 100,000. It is that simple. You can also use it for divisions. For example logm/n=logm-long. You can also use logarithms to find the square roots which can be converted between positive bases.

What is e? Then what is e? e is also called Euler’s number and it is a mathematical constant that is approximately equal to 2.71828 and can be expressed in different ways. It serves as the base of natural logarithms and the limit of (1+1/n)n till infinity. Furthermore, it is calculated as the sum of infinite series.

Furthermore, it is the unique positive number alpha in the way that the graph for the function is y=

and has a slope of 1 at x=0

However, the exponential function is the unique function that equals the derivative and also satisfies the equation so that you can define e as the

natural logarithm to the base of e serves to be the inverse function to the exponential function. the natural logarithm of K>1 s is defined as the area that is under the curve y=1/x in between x=1 and x=k making e the value of k and they are all equal to one. there are other ways to express them.

LOGNORM,DIST Now when you have a random variable and it is lognormally distributed that means that the variable has its natural logarithm distributed normally. While normal logarithms can have negative numbers as possible values, the lognorm cannot and it is also not symmetrical instead it is stretched to the right. It shares similarities with the Weibull in the sense that engineers will also make use of the lognormal to create models of the breakdown of systems that they create. The excel lognorm.dist function works with the lognormal distribution. What you do is to specify values, the mean and the standard deviation for the lognormal and you are going to get the probability the variable is at the very highest that value. Now imagine that you are testing failures of some products and then you discover that the failures to hours are distributed lognormally with the mean of 10 and the standard deviation of 2.5 then you want to find the probability that the product is going to fail In 10,000 at the highest. You will find this function if you go to the insert function on the ribbon and search for it.

Then we used the parameters earlier stated here

LOGNORMAL.INV The LOGNORM.Inv is the inverse function for the LOGNORM.DIST. In the same way here, you are going to have to provide a probability, the mean, and the standard deviations for lognormal distributions. Then you are going to get the random variable that cuts off the probability you specify. So to find the value that cuts off .001 you can use the LOGNORM.INV function. so using the mean of 10 and the standard deviation of 2.5, here is

what we did. So look for the function on the insert function search bar:

When you select it, you are going to see the following dialog box. Then enter the parameters that we entered above and you are going to get the answer

Array function: LONGEST The best way that you can fit a curve inside a scatter plot is by the following bx

With the LONGEST function, you can find the estimate of a and b for the curvilinear equation. Here is the dialog box for the LONGEST function.

And these are the steps to apply the function: 1. When you enter the data, choose a cell to begin the LONGEST output array 2. Then go to the statistical function menu and look for LONGEST 3. Then when the dialog box pops up enter the known_y which is the range with the y variable and the known x is for the range with the x variable 4. In the const box, choose true if you want the value of alpha to be calculated too and false for the alpha to be set as 1 the choice for the stat box should be true so that it gives you the regression statistics with a, b and false 5. Select ok so that the answer appears in the output array.

Array function: Growth In curvilinear regressions, the growth answer the trend. This function can be used in two different ways. They can be used to predict y’s for the x’s in the sample and they can also be used to predict the set of y values for another set of x values To predict the y values for the x values in the sample, use the growth function and follow these steps: 1. 2. 3. 4. 5.

Enter the data Then enter the statistical function growth For the known ys enter the range with the y variable And for the known x enter the values with the x variable Since we are not calculating the values for new_x leave it blanks.

6. In CONST enter sure to calculate the alpha too and enter false to set the alpha as 1 7. Then select ok so that the growth answer can be in the array you set as the output 8. Then to predict a new set of ys follow the same process but enter the value or the range for the new x

The logs of Gamma When you use GAMMA. PRECISE looks for the natural log of the gamma function value of the argument x and that looks like this. =GAMMALN.PRECISE(5.3)

Sorting Data To sort data that is random. Here is what to do: Choose the cell range with the data Then choose data and select sort In the drop-down menu that comes after sort by, choose the first variable that you want to sort by. Then adjust the setting inside the sort on the list and also in the order list Select add level button You will see a drop-down menu next to then by, then you can choose the next variable to sort by and then adjust the sort on setting and also the order setting Once you do this for the last variable, select ok

Appendices When your data lives elsewhere. There are times when you do not physically have the data with you and you will have to reach out to them from elsewhere outside your computer. Most times you have the option to download the data and work on them. however, there are times when it is not that straightforward. However, when that is not the case, you can use Microsoft 365 by opening the data tab and choosing to get and transform the data area, then you select a general category. What you do is navigate to the target and then you go with the steps that are inside the dialog box in other to load the data that is

inside the worksheet.

Tips for teachers and learners Now throughout this series, you must have been able to understand how great excel is. it is very quick and easy to work with. This is why it is one of the most common programs for processing spreadsheets. In this section, we are going to be taking you through the ways to use excel as a teaching tool and have a productive day in class as a student. So here it goes:

Augmenting analyses is a good thing Now when you look at the output of data analysis, you must understand where those numbers are coming from. No doubt a lot of times it is going to be very complicated and convoluted. Nevertheless, it is possible to understand them. Now throughout this book, we have been going through the ways to use excel to modify the excel analysis. This can even mean that you make some extra calculations outside of the data output and compare the results or even use those results to modify your work. When you do this, you understand your work better. And in posterity things are going to be much smoother.

Understanding ANOVA In understanding ANOVA there are a few concepts to understand. Now after creating the ANOVA output table, for example, this.

So to understand the ANOVA table you need to start with the MS total. ANOVA does not give you the term. So to find it, you need to dived the total of the SS with the total in DF so that means 5405.846/25 Mean square is a synonym for variance. So you can make use of the VAR.S to see what the term is. You are going to see that the answer is equal to the value we earlier calculated in 5405.846/25 Then referring to the MS within. It is the weighted average of variance that is within the groups. to measure the group variance, multiply that by the number of the scores that are in the group minus 1 So you can simply use the count column to know the number of scores that are in each of the groups. So you need to add one count which is one column that removes one from each of the counts. When you are done, you need to define the variance as the name of the three rows that are in the variance column and then count_1 refers to the name of the three rows that are in the count-1 column Then the MS between refers to the variance that is among the group means when each of the group means is multiplied by the number of scores in the group. What you do here is to square the deviation of the group mean from the grand mean and then multiply by the count.

Revisiting regression

In regression analysis when we are talking about statistical modeling is used to find the relationship that exists between multiple variables. Variables are typically independent or dependent. By dependent variable, we are referring to the main subject you want to understand. While with the independent variable we are referring to the factors that seem to affect the dependent variables. You use regression analysis to understand how dependent variables might change when one of the independent variables is different. So you can understand with regression which of the regression variables are important and have an impact on the data. Regression analyses are technically based on the sum of squares and it is the best way to find where the dispersion of the data points. the point of the regression model is that you can find the minimum sum of squares so that you can draw a line that nears is closest to the data. Regressions can be simple linear or multiple linear. With simple linear regression, you find the relationship between the dependent variable and one independent variable with the linear function. it is that simple. And when you are working with multiple variables you are working with multiple linear regression.

Simulating data is also a good thing Now when you are teaching about data analysis and maybe you are working with sheets that present analysis between the groups ANOVA or the independent groups’ t-test. You may want to then replicate the analysis without the raw data. but they give you the summary statistics like the mean, the standard deviation, etc. Students can still complete the analyses by simply simulating the data so they can work on it. you can simply just generate the numbers that are similar to the summary statistics. Now imagine that you want to generate the score of two groups and they have a mean of 25 and a standard deviation of 4 you can use excel to generate the numbers. What you do is place the mean and the standard deviation in an empty cell. you can use the RAND to generate ten random values and paste

the generated values into the cells in the same order so that the values are stable.

When all you have is a graph. Now when you are working with data that presents the data points in graphs. To analyze them you need to get the numerical values for the data point. you can use a web plot digitizer to do this. You supply the image to the webplotdgitizer so that you can download it as a spreadsheet.

More on excel graphics Now let us go into the world of the other excel graphs that you probably have not heard of before.

Tasting the bubbly Have you ever had data that look like they need to be represented in three dimensions but you feel that a two-dimensional chart will do the trick for you? Let us introduce you to bubble. In the bubble, the data points are represented as circular bubbly stuff. In this, the positions of the bubble follow the coordinates of the X and the Ys but the size of the bubble represents the third dimension. Here you will see how to work with it: The first thing that you need to know about the bubble chart is that it is a variant of the scatter chart. so you have to treat it like you adding one. So go to insert on the ribbon and choose to scatter chart from the charts group and choose it

When you select bubble you are going to see the representation of that chart over here

Then to edit the chart, you can choose a part of the chart and you are going to see the chart tools; When you go under the chart tools, enter the Design tab from the chart styles and select the chart style that you prefer If there is a legend on the chart, you can select it to delete If you want to change the size of the chart you can go to the ribbon and enter the format tab. It is here that you can change the size of the chart from the size group and then choose the shape size that you like. You

have the height and the width of the shape to choose from and then you can select enter Then there is the chart title on the chart: which can also be edited When you enter the chart select the title: it automatically gives you the option to edit it Then if the chart title is too big or too small you can reduce or increase it by right-clicking on it and choosing font then enter what size you like To make the chart title then align with the plot area, select the title of the chart and move it to where you want it to be If the axis title is not enough for you, select the chart area and do any of these 1. Select the plus sign that comes after the chart And choose the axis title then choose the primary horizontal Then select the axis title box then enter the text you want in there. There are many other options and they all look pretty intuitive if you just look at them closely.

Taking stock Now the stock chart is used for what it is named after. It is used to keep market stock or whatever. You can use it to look for volume high low close etc. so here is how to get around with this chart. 1. First, go to the insert tab 2. Then select recommended charts. And in the menu above you are going to see the all-charts tab 3. Select it and look for the stock from the list 4. Then select any of the available ones for you and choose ok

Then when you have it there is also the option to format the chart to how you will like for it to be. 1. 2. 3. 4.

First, choose the chart and the chart tool tab is activated in the ribbon Then select the design tab Then choose the element that you will like to format. Save the changes that you made to the chart

And it is that simple. You have your chart.

Scratching the surface Charts are used to represent data and picking the right kind of chart for the appropriate data is something that should not be underrated. Now the surface chart is the appropriate chart when you are looking for the best combination of two different data sets. Just the ways that they are in maps, the colors, and the patterns of the chart show where there are areas of the same range of values. Here is how to create a surface chart: The first thing to do is to gather the data in your worksheet and make sure that you arrange them. Then the next thing is to pick the data that you want to be represented And go to the ribbon and select the insert tab and when you enter the

charts group, select stock, surface, or data chart from the ribbon

Then you can preview any of the charts that you like. There are a few options as you can see from the image with the surface charts and they all meet specific purposes. The 3d surface is the one that we have here. And it is just a sheet splayed like a 3d column chart. you can use it to see the relationships that exist between large amounts of the data that you might not see easily. The color bands in these charts are not representative of the data series. However, they show the differences that exist between the values. Then there is the wireframe: this wireframe is like a building without the brick walls covering them. simply it is a frame. They are not very easy to use, but they are perfect for plotting large data sets. They are used when: You want to display the trends within the values across two dimensions on a continuous curve. if the groups and the series are numeric and if the data makes a curve behind itself then there are also the contour surface charts. It is basically like the surface chart but seen from the top. The color shows the range of the values and the line shows you where the interpolated points of the equal values are.

On the radar The radar chart is used to show performances and highlights where the strengths and the weaknesses are. There are a few options of charts that Excel provides you with and they are the radar, the radar with markers, and the filled marker. The radar shows you the data as it is related to the center points. they are used in times when the categories do not compare directly

Then there is the radar chart with markers are radars with markers that is the only difference between the two. With the filled charts, they are pretty much the same. But on these occasions, there is a color-filled on the radar in its entirety. If you play PES11 or one of those games from earlier then you should know what a radar chart is. the x-axis is the end of the spider and the steps are the y-axis. To create one here is what to do: 1. the first thing is to enter the data inside the chart 2. then select insert from the ribbon and choose to scatter and radar

from the chart groups. 3. there are three options. choose any of them so that the chart appears inside the group. 4. You can then format the chart if you wanted to

Growing a tree map and bursting some sum The tree maps are there for hierarchical data with multiple relationships. The treemaps are best for showing things like best-selling products, regional sales, etc. the treemap, has nested, rectangles that act like branches. The items in the dataset are represented with a rectangle. But the size correlates to the number of data. The treemaps are the best to spot patterns, similarities, etc. because of how compact the treemaps are they also make for a very compact spreadsheet. To create a treemap in excel: 1. 2. 3. 4. 5.

First, enter the data. then select insert from the ribbon. Then open the dialog box for the charts You will see all of the charts. Choose treemap from the options

6. Formatting the treemap is pretty easy. 7. To change the title, choose the chart title and change it 8. Then on the ribbon, go to the chart design tab to make any

augmentation to the chart that you want. 9. The sunburst is just underneath 10. The difference between the sunburst and the treemap is that while data is ultimately represented in rectangles, the sunburst makes a doughnut shape.

11.

In the hierarchy, the highest level is the inner ring and the ones on the sublevels are the outer rings

Building A Histogram To create a histogram in excel, make use of the excel toolpak or the frequency function with the column chart. there is even the option to work with the frequency function by simply selecting the array and then inserting it into the histogram to make them. To make the histogram: Enter the data Then go to insert from the ribbon and choose the charts group to open the dialog box Then select the histogram

Ordering columns: Pareto The Pareto is both a line chart and a column chart. the columns are shown in decreasing order of magnitude. And the line shows a cumulative percentage. In the Pareto charts, the columns show the items arranged in descending order while the line shows the cumulative percentages of the items because of the way they are on the X-axis and the y-axis. It pretty much shows the X and the y axis.

Of boxes and whiskers The boxes and the whisker chart represents the distributions of data by quartiles, showing the mean and the outliers. The boxes have lines that are extended vertically and those are called whiskers. The lines show the variabilities that are outside the upper and the lower quartiles and the points outside the whiskers are called outliers These kinds of charts are used for excel statistics Here is how to use it: First, enter your data Then go to the ribbon and choose to insert then enter insert statistic chart and choose boxes and whiskers Then you can change the options for the boxes and the whiskers. Right click on a box within the chart and go to the shortcut menu and select format data series When you are in the format data series pane, select the series option then make any changes that you want to make.

3D maps

The 3d maps in excel is another 3d excel tool that helps you to better visualize your data. they are more insightful and allow you to see things differently. The 3d maps help you to make plots of geographic and temporal data on the globe in 3d. the maps are good to show: It helps to map data It gives you new insight, especially into how the data is related to the geographic area It is also appropriate for sharing stories: To find the 3d maps, go to the tours group from the ribbon. The first thing to do is to make sure that the excel data has the geographic properties and is placed in table format or a data model. They ought to have properties like cities, states, counties, etc. First, open the workbook with data models. Any random worksheet will not make the 3dmap highlight. Select the cell inside the table Select insert and choose 3dmaps Go to the layer pane and verify to make sure that the field is correctly mapped use the drop-down arrow of the fields that are mapped incorrectly then match them with the appropriate geographic property.

Filled maps A fill is one feature that makes the 3d maps even better. You can then create a two-dimensional map with descriptive statistics.

Conclusion As far as charts and graphs are concerned excel is the best. The thing is that many other options and features are yet to come. The good thing is that no matter what you need, there is one excel feature that works for you.

Chapter 21 The analysis of covariance When you analyze covariance it means that you are making comparisons of data sets that have three different variables of which two of them being referred to as the treatment variable and the third variable is the one that cannot be controlled. However, they can be measured. Nevertheless, the third variable still affects the initial two variables. This means that it has some indirect control and there is precision in the study and very little room for bias. Now the common example of covariance is for example the study of organ weight in toxicity study. what we are interested within this example is the effect of the dose of toxins on the weight of organs, however, we all know that the weight of organs can be determined by the person's weight. So since the covariant which is the body weight is not our interest in this example we are only going to be measuring it so that we can adjust the measurement of the variate that we are concerned about. Now with ANCOVA, you can make these kinds of adjustments simply. Before you make use of ANCOVA you need to thread with caution so that the relationships between the variate and the covariate can be reliable ANCOVA has a few rigid assumptions. Here they are: 1. 2. 3. 4. 5. 6.

The regression slope of the Y and the X are equal groups to group The relationship between eh X and the Y are both linear There are no errors in the measurement of the covariate X Unmeasured disjointed variables do not exist Any error within the data set does not relate to each other When there is a variance of errors in the groups they are then equal between the group 7. The data measured that make up the groups are distributed normally.

Covariance a close look

So basically when you are working with covariance, it means that you are taking a closer look. You have to consider that third effect that works with the data. take for example you are testing people's proficiency in mathematics and their social skills. Then you observe that the people who are more mathematically proficient turn out to be highly sociable and the people who are not very good with mathematics are not very sociable, the covariance here is high and positive. Since the relationship here is positive then it is referred to as a direct relationship. Now I know that you are already assuming that mathematically proficient people tend to be geeks—hence the less sociable one. but the people who are not good at maths are the popular kids hence sociably more accepted. In these conditions, the numbers are high and negative. And referred to as an inverse relationship Furthermore, there is a possibility then that there is no connection between a person’s proficiency in mathematics and how sociable they are. Under these conditions, the two variables are independent so the covariance is nonexistent —virtually. When you look at these then you might say: wait! Is this not a Correlation? Pretty close. Covariance is the numerator of a correlation coefficient. Simply correlations are easier to understand when compared to covariance.

Why do you analyze covariance So to have an understanding of why we even analyze covariance in the first place, let us make use of this example. Now you test people under different conditions and then you want to see how well they perform under the conditions that you place them. Let us go to the classroom and assign randomly 15 children across three groups where they prepare for mathematical tests in different ways. The first is that they listen to an instructor. In the second group, they use an interactive program on the computer, and in another, they read textbooks. And they go into the examination hall Now the way they perform In the test is the dependent variable and the

preparation methods are the independent variables. Now we want to know if the conditions are affected by the independent variables as any baring to the outcome of the test which is the dependent variable. So we need to do a hypothesis test with the null hypothesis that the independent variables have a baring and the alternative hypothesis that the null hypothesis is wrong. Now you can do a simple ANOVA for this. But there is also ANCOVA lurking somewhere. In this example, we have just mentioned the dependent variables and the independent variables. However, there is also a possibility of a third variable If you looked closely. Now if you have another measure to test the children—the mathematics aptitude. Now, when you see the mathematical aptitude, you understand that also affects the children’s performance in the test. Now, this third variable is referred to as the covariate. When you find the relationship between the covariate and the dependent variable, then that is what covariance is. now this brings us to the question of why we need to have statistical control? If you take a study of something and you observe that there is no significant difference between the preparation groups. this implies that the experimental control (the term for assigning individuals to the conditions of the independent variable while keeping every other thing the same) is not strong enough to show the effect of the type of preparations made. It is in cases like this that one assesses the effect of the covariant. So experimental control and statistical control and ANCOVA assess times when for example the mathematical aptitude then affects the performance.

How do you analyze covariance? In ANCOVA, you find the relationship between the independent variable and the covariate and then use it to then adjust the SSbetween and the SSwithin. And when the relationship is a strong one then the adjustment is likely to increase the SSbetween. Then reduces the SSwithin. The main difference between the ANOVA and the ANCOVA is that the ANOVA can help you to look for the unlikeliest of data while the ANCOVA does the same but goes a little extra.

ANCOVA in excel There are no excel built-in two for ANCOVA like there are for ANOVA. But that cannot stop you from doing what you want to do. And it is easy to work it out nonetheless. So there are a few tools that you can convert for your convenience here. So you can make use of the ANOVA single factor analysis tool and some other worksheet functions. you can also make use of the regression analysis tool and some other worksheet functions. We are going to be working with the following data. Method 1 In the image above, we have placed the covariance which is the exam score with the covariate which is the math aptitude.

So in the image here, we separated the covariant from the dependent variable and we have structured it this way since we are going to be using the ANOVA: single factor tool on both the dependent variable and the covariate. So we have to find the bwithin, which is the number that we are going to use the SS and, then make an adjustment to the group means to make a post-analysis testing. The bwithin here is to first imagine that you have a scatter plot of the three groups and make a regression line through the scatter plot. The slopes of the

regression line determine the bwithin. It is the average of the slope of the group when they are each weighted by the variance of the covariance inside the group. Here is a formula:

So what you do is to find the variance for each group inside the covariance. And we have placed the variance in the 8th row. Then we made use of the SLOPE function to find the slope in the group. Then we multiplied the varX by the slope XY Then to find the value of bwithin we divided the sum of the VARX*SLOPEXY by the sum of the VARIANCEX And in G18 we entered the b total which is the SLOPE function with =SLOPE (C12:E16, C3:E7) Then we work with the ANCOVA table. A18:A22. so you have to fill in the table. so you can format it to make it share similarities with an ANOVA: single factor table. then fill the source of variations and the degree of freedom. They are just like the ANOVA, the only difference is that the degree of freedom from dfwithin and that is in the dftotal Then work with the ANCOVA single factor using the ANOVA tool. do it for the dependent variable once and then for the covariate. Why we need to run ANOVA once on the variance is because the ANOVA is going to give us an SS value to finish the ANCOVA So when you make the ANOVA, fill in the ANCOVA. To find the adjusted SS total, here is the formula. SStotal =SStotally- b2totalSStotalx. Then find the adjusted SSbetween which is then adjusted total minus the adjusted between Then to finish the ANCOVA table divide the adjusted SS with the degree of freedom and divide the adjusted MS between the Adjusted Ms within to find

the F

The final column of ANCOVA is the tricky part. Under the ANCOVA, we entered the adjusted means of the independent variable. This lets you enter the area to start the post-analysis testing. Here is the formula

So for D22 we did something like this = mean for human- Bwithin *(covariant human mean- AVERAGE-covariant) and the others in their variation.

The conclusion is that the relationship between the dependent variable and the covariate will help you to know an effect that you might not notice.

Method 2 So the other way is with the regression model

After the ANCOVA

There is two analysis to do when you are done with the ANCOVA the first one is planned comparison which is basically what you expect the data to be like before you gathered it and the second one is the post hoc test: which is what looks interesting in the data. In the first post, ANCOVA planned a comparison, you can do whatever it is you want to do with the data. and the second test works with what you have found interesting in the data that you have gathered. The difference between the ANCOVA and the ANOVA is that with the ANCOVA you are going to be working with the adjusted group mean instead of the group means. The error terms and the mean square and the sample square also need some adjusting too.

And one more thing A lot of practice of this stuff is really what gives the difference. We hope that at the end of this whole book you have a better understanding of how to go about excel statistical analysis.

INDEX accepting, 162 adjustment, 66, 312, 313 aggregate, 18 alignment, 86 alpha, 116, 117, 118, 122, 136, 141, 147, 148, 150, 152, 153, 155, 158, 162, 164, 166, 167, 169, 170, 175, 176, 180, 188, 189, 207, 214, 221, 237, 246, 247, 252, 253, 254, 257, 258, 276, 277, 285, 288, 291, 292 analysis, 17, 19, 21, 23, 32, 33, 35, 36, 37, 38, 40, 58, 67, 68, 69, 70, 92, 106, 111, 121, 135, 136, 141, 142, 146, 154, 157, 161, 162, 164, 166, 168, 169, 171, 172, 173, 174, 175, 176, 179, 180, 181, 182, 183, 196, 198, 200, 201, 210, 213, 214, 217, 219, 223, 227, 228, 265, 267, 269, 270, 272, 274, 280, 285, 294, 296, 309, 312, 313, 314, 315, 316 analyst, 30, 32, 103, 259 analytics, 17, 18 ANCOVA, 309, 311, 312, 314, 315

applications, 184 appropriate, 24, 104, 107, 109, 140, 152, 153, 154, 155, 170, 172, 173, 176, 177, 182, 183, 200, 201, 208, 211, 213, 214, 241, 248, 249, 250, 253, 254, 255, 257, 258, 277, 278, 279, 281, 283, 300, 307 arithmetic, 30, 31, 88, 89 array, 26, 33, 34, 41, 54, 57, 66, 67, 89, 90, 92, 136, 141, 145, 146, 151, 152, 164, 165, 171, 172, 183, 195, 207, 214, 223, 268, 291, 292, 305 ascending, 52, 53, 227 assignment, 20 assumption, 24, 72, 141, 235, 277 assumptions, 24, 276, 309 attributes, 20 autofill, 25, 113, 230 automatically, 27, 53, 275, 299 average, 18, 26, 30, 31, 32, 38, 39, 41, 42, 43, 49, 52, 70, 88, 89, 90, 91, 92, 95, 96, 101, 112, 115, 138, 183, 187, 216, 217, 218, 219, 220, 221, 223, 257, 258, 262, 272, 281, 295, 313 AVERAGEIFS, 90, 91 averages, 18, 88, 93, 96, 217, 219, 220, 221, 223 beaches, 185 beats, 18 binomial, 109, 244, 245, 246, 247, 248, 251, 255, 256, 263 bottom, 36, 55, 92, 107, 108 calculate, 26, 32, 34, 38, 39, 40, 41, 42, 43, 45, 46, 50, 52, 66, 68, 73, 74, 90, 96, 97, 98, 102, 103, 109, 112, 113, 122, 129, 131, 137, 138, 140, 144, 162, 164, 166, 167, 168, 171, 175, 182, 183, 187, 194, 209, 217, 227, 247, 249, 266, 276, 292 calculations, 33, 34, 39, 188, 255, 265, 287, 294 categorical, 22, 278 categories, 22, 67, 81, 227, 263, 278, 302 cell, 25, 26, 27, 32, 33, 38, 39, 41, 43, 45, 46, 47, 48, 50, 51, 53, 57, 59, 62, 63, 64, 66, 67, 80, 84, 89, 90, 91, 94, 98, 104, 107, 108, 113, 122, 123, 124, 129, 136, 137, 140, 141, 142, 145, 146, 147, 148, 151, 152, 153, 154, 155, 156, 164, 165, 180, 181,182, 191, 194, 197, 207, 208, 212, 214, 226, 229, 231, 240, 241, 244, 245, 246, 247, 249, 250, 252, 253, 254, 255, 257, 258, 261, 269, 272, 273, 291, 293, 297, 307 census, 18

characteristics, 19, 20, 23, 49, 70 characters, 63, 104 charts, 70, 76, 77, 78, 79, 81, 83, 85, 109, 129, 186, 217, 218, 219, 297, 299, 301, 302, 303, 305, 306, 308 Chi-Square distribution, 130 coefficient, 32, 165, 168, 169, 171, 198, 199, 200, 202, 203, 205, 206, 207, 209, 212, 213, 232, 277, 310 column an, 70 combination, 63, 79, 180, 183, 239, 242, 249, 300 comma, 91 common, 30, 33, 39, 43, 44, 61, 68, 104, 143, 172, 287, 294, 309 comparison, 23, 24, 77, 81, 125, 128, 162, 163, 164, 165, 167, 168, 169, 171, 176, 315 complicated, 29, 105, 158, 235, 259, 279, 294 complications, 238 comprehend, 103 concatenation, 62 conclusions, 23 conjunction, 150 constraint, 39 Consumers, 278 contiguous, 26 convoluted, 164, 294 convoluting, 33 corner, 26, 37, 107, 108 CORREL, 207, 208, 209 correlation, 24, 32, 38, 70, 202, 203, 205, 206, 207, 209, 210, 211, 212, 213, 214, 215, 232, 277, 310 CORRELATION, 30, 32 corresponds, 71, 73, 130, 165, 204, 242 COUNT, 29, 39, 60, 105, 281 COUNTA, 29, 39, 60, 61, 105 COUNTBLANK, 29, 60 COUNTIFS, 60, 61, 62 creatures, 18 criteria, 31, 39, 62, 63, 64, 90, 91, 106 cumulative, 71, 72, 73, 74, 123, 126, 152, 245, 247, 250, 252, 253, 254, 255,

257, 258, 305 curving, 69 customize, 79, 83 data, 17, 18, 19, 21, 22, 23, 24, 25, 26, 27, 31, 32, 33, 34, 35, 37, 38, 39, 40, 41, 42, 43, 44, 45, 49, 52, 55, 56, 57, 58, 59, 63, 64, 66, 67, 68, 69, 70, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 87, 88, 89, 90, 92, 93, 94, 95, 96, 97, 98, 100, 101, 102, 103, 104, 105, 106, 107, 116, 117, 120, 121, 122, 125, 131, 132, 135, 136, 140, 141, 142, 143, 144, 145, 146, 149, 150, 151, 154, 155, 157, 160, 161, 162, 163, 164, 166, 168, 169, 170, 172, 174, 176, 178, 179, 180, 181, 182, 183, 185, 187, 188, 189, 190, 191, 194, 195, 196, 197, 199, 200, 201, 206, 207, 210, 212, 213, 214, 215, 217, 218, 219, 221, 222, 223, 224, 225, 228, 229, 230, 231, 237, 244, 259, 260, 261, 262, 263, 264, 265, 267, 270, 272, 273, 274, 277, 278, 279, 281, 285, 286, 291, 292, 293, 294, 296, 297, 300, 301, 302, 303, 304, 305, 306, 307, 309, 310, 312, 315 decimals, 123 denominator, 137, 138, 139, 150, 153, 159, 161, 165, 166, 171, 175, 179, 181, 188, 249 depending, 102, 179, 251 descending, 53, 54, 305 descriptive, 17, 23, 33, 68, 183, 279, 308 designated, 50, 66 determination, 205, 207, 212, 277 deviation, 18, 32, 41, 42, 43, 44, 45, 46, 47, 49, 50, 69, 70, 71, 72, 73, 74, 96, 102, 103, 104, 105, 106, 107, 108, 109, 110, 113, 114, 115, 116, 117, 120, 121, 122, 131, 133, 134, 140, 144, 150, 159, 179, 183, 187, 189, 190, 197, 203, 226, 227, 229, 243, 248, 266, 269, 270, 271, 272, 279, 281, 289, 290, 295, 296, 297 DEVSQ, 41, 231 diagnostic, 17 dialog, 32, 35, 36, 59, 68, 79, 84, 113, 121, 135, 136, 140, 141, 142, 144, 145, 146, 152, 153, 154, 164, 176, 180, 181, 182, 196, 210, 211, 213, 214, 219, 223, 240, 242, 244, 245, 246, 247, 249, 252, 253, 254, 255, 257, 258, 264, 267, 268, 280, 283, 284, 285, 290, 291, 293, 303, 305 dimension, 195, 297 dimensional, 33, 297, 308 discrete, 22, 242, 243, 244, 260, 268, 269, 283

distribution, 24, 29, 49, 64, 66, 67, 69, 70, 71, 72, 73, 74, 75, 76, 89, 107, 108, 109, 110, 111, 112, 113, 114, 117, 120, 122, 123, 124, 125, 126, 127, 128, 129, 130, 132, 133, 134, 135, 137, 139, 140, 142, 148, 149, 150, 152, 153, 155, 161, 164, 207, 224, 226, 243,244, 247, 248, 249, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 265, 267, 268, 270, 282, 284, 289 dollar, 47, 51, 230 dolphins, 83 duplicated, 52 earmarked, 21 efficient, 38, 174, 264 elementary, 233, 234, 235, 238, 242 empirical, 263 enumerated, 18 equation, 171, 187, 188, 199, 255, 284, 288, 291 error values, 29, 61 essentially, 19, 88, 203 ethnicity, 20 evaluating, 17, 224 evaluation, 33, 225, 237, 253, 254 excel, 17, 25, 27, 28, 29, 30, 32, 33, 34, 35, 36, 37, 38, 40, 41, 43, 44, 50, 52, 54, 59, 60, 61, 63, 64, 66, 68, 69, 70, 74, 75, 77, 78, 79, 82, 88, 89, 90, 91, 92, 94, 95, 96, 97, 98, 99, 103, 105, 106, 107, 111, 114, 118, 121, 123, 124, 135, 146, 156, 157, 167, 169, 174, 181, 183, 184, 190, 191, 196, 214, 215, 217, 222, 224, 230, 231, 240, 247, 249, 251, 252, 255, 259, 264, 267, 272, 274, 279, 280, 284, 285, 289, 294, 297, 303, 305, 306, 307, 308, 312, 316 exclamation, 26, 239 experiment, 20, 23, 78, 125, 233, 242, 270 experimental, 17, 20, 311, 312 Extrapolating, 276 FALSE, 63, 71, 104, 156, 194, 245, 250, 253, 255, 261 FISHER, 214 FISHERINV, 214 fluctuation, 78, 217 FORECAST, 193, 222 format, 80, 81, 83, 86, 87, 172, 281, 298, 300, 303, 306, 307, 314 formula, 26, 27, 33, 34, 38, 41, 43, 44, 45, 46, 47, 49, 50, 51, 53, 54, 55, 56,

62, 65, 66, 72, 73, 80, 89, 91, 93, 94, 95, 98, 99, 100, 104, 106, 107, 108, 113, 129, 131, 133, 134, 138, 140, 144, 146, 156, 159, 163, 165, 167, 169, 186, 188, 195, 203, 207, 221, 225, 226, 227, 229, 230, 231, 233, 234, 238, 239, 247, 252, 254, 257, 260, 261, 263, 266, 269, 270, 272, 281, 313, 314 Frequency, 66 function, 26, 29, 30, 31, 32, 33, 38, 39, 41, 43, 50, 52, 54, 55, 56, 58, 60, 61, 62, 63, 64, 66, 69, 70, 71, 72, 73, 74, 89, 90, 91, 92, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 113, 115, 117, 123, 124, 125, 126, 127, 128, 135, 139, 140, 144, 145, 151, 152, 153, 191, 192, 193, 194, 195, 200, 207, 208, 209, 210, 214, 224, 229, 240, 242, 244, 245, 247, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 263, 264, 265, 268, 270, 272, 281, 283, 284, 287, 288, 289, 290, 291, 292, 293, 296, 305, 313, 314 functions, 29, 30, 32, 39, 40, 41, 52, 60, 64, 95, 97, 98, 105, 122, 123, 124, 152, 156, 184, 190, 191, 193, 195, 207, 208, 214, 222, 240, 243, 244, 282, 283, 284, 287, 312 fundamentals, 25, 27, 78 GAMMA, 256, 257, 293 gender, 20 generation, 125, 267, 268, 270 graphical, 198, 275, 280 graphs, 76, 77, 78, 87, 243, 278, 279, 297, 308 hardcoded, 43, 63, 64 hierarchical, 303 highlighted, 27 histogram, 38, 67, 68, 109, 224, 305 horizontal, 33, 67, 69, 78, 81, 82, 85, 86, 110, 161, 185, 202, 299 HYPGEOM, 249 hypotheses, 23, 24, 119, 143, 206, 207, 224, 231, 247, 263, 287 hypothesis, 23, 25, 44, 116, 119, 120, 121, 122, 125, 128, 131, 132, 134, 137, 138, 143, 147, 148, 149, 150, 151, 152, 153, 158, 160, 161, 162, 163, 164, 166, 167, 170, 174, 175, 176, 177, 178, 180, 184, 188, 189, 190, 206, 207, 214, 224, 225, 227, 228, 229, 230, 231, 233, 237, 246, 247, 248, 251, 263, 264, 275, 277, 286, 287, 311 hypothesized, 121, 122, 136, 141, 147, 263 identity, 20 illustrative, 35, 45 implement, 133

implication, 151 impossible, 23, 100, 137, 157, 224, 270, 276 independent, 19, 20, 21, 63, 70, 122, 131, 132, 143, 144, 150, 160, 161, 166, 168, 185, 199, 203, 212, 225, 234, 236, 237, 260, 273, 278, 286, 296, 310, 311, 312, 314 independently, 203, 260 indicates, 23, 53, 59, 78, 92, 96, 112, 124, 130, 142, 147, 189, 190 indicative, 23 individuals, 18, 20, 132, 133, 167, 232, 311 inference, 19, 49, 109, 157, 248 inferential, 23, 233 infinity, 288 influence, 20, 156, 216 influenced, 19, 234 information, 17, 19, 22, 27, 29, 61, 77, 119, 120, 122, 141, 148, 165, 171, 182, 184, 187, 266 insert, 40, 68, 70, 76, 79, 81, 82, 83, 84, 85, 125, 129, 156, 171, 172, 173, 182, 183, 195, 201, 214, 217, 218, 223, 274, 289, 290, 297, 299, 301, 303, 305, 306, 307 inserting, 40, 305 institute, 184 integer, 57, 58, 117, 118, 125, 127, 128 Interactions, 178 interchangeably, 18, 88 intermediate, 29, 30 interpret, 77, 104, 166 intuitively, 35 inverse, 65, 124, 127, 128, 288, 290, 310 investors, 17, 103 iPhone, 40, 273 jumping, 23 knowledgeable, 17 layout, 81, 87 LINEAR, 193 logarithm, 287, 288 logical operator, 62 lowest, 39, 42, 75

manipulate, 19, 20, 278 manipulated, 19, 21 manually, 40, 240 marker, 84, 302 marketing, 26, 272 mathematical, 88, 270, 287, 288, 311, 312 MAX, 39, 63, 64 MAXA, 63, 64 maximum, 39, 64, 70, 83, 112, 212, 245, 273 measurement, 67, 264, 309 measurements, 24 median, 29, 32, 41, 42, 93, 94 midrange, 49 minimum, 39, 83, 186, 273, 296 models, 17, 224, 244, 259, 265, 289, 307 negative, 41, 42, 43, 64, 66, 114, 123, 148, 149, 185, 186, 203, 213, 228, 229, 245, 256, 288, 310 nonparametric, 24, 224, 225, 227, 228, 232 NORMSINV, 73 numerical, 22, 202, 212, 219, 297 observations, 150, 159, 244 occurrences, 66, 94, 260 organizations, 18 parameters, 19, 24, 113, 133, 137, 188, 206, 207, 224, 253, 263, 267, 268, 284, 289, 291 participants, 20, 285 paste, 26, 27, 59, 129, 182, 297 percentage, 57, 58, 92, 114, 123, 126, 127, 264, 305 PERCENTILE, 55, 56 performance, 17, 133, 264, 311, 312 PERMUTATIONA, 240 Permutations, 238 POISSON, 255, 256, 257, 258, 260, 261, 262 population, 18, 23, 24, 66, 97, 98, 99, 100, 103, 104, 105, 107, 109, 110, 111, 112, 113, 114, 115, 116, 117, 119, 120, 121, 122, 131, 132, 133, 134, 135, 137, 138, 147, 159, 160, 161, 188, 206, 207, 210, 225, 227, 249, 250, 270, 277

portfolio, 103 practicing, 130 predetermined, 25 predict, 163, 184, 185, 193, 194, 206, 233, 265, 273, 292 predictions, 18, 23, 103, 106, 119, 184, 199, 204, 215, 221 predictive, 17, 190 preferred, 79, 84, 97, 115, 155, 170, 181 preposterous, 18 prescriptive, 17, 18 PROB, 283 probability, 23, 29, 71, 72, 73, 96, 107, 109, 111, 116, 122, 123, 124, 125, 126, 127, 128, 131, 145, 151, 152, 153, 156, 158, 197, 198, 214, 224, 227, 233, 234, 235, 236, 237, 238, 239, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 257, 258, 260, 262, 263, 264, 265, 266, 267, 268, 272, 273, 282, 283, 284, 285, 287, 289, 290 procedure, 158, 183, 212 prominent, 52 proportion, 124, 142, 148, 153, 161, 169 quantifiable, 17 Quantitative, 22 quotation, 27 random, 17, 19, 38, 107, 109, 110, 112, 113, 115, 120, 122, 124, 130, 134, 160, 174, 212, 217, 225, 233, 242, 243, 249, 251, 260, 265, 267, 268, 269, 270, 272, 277, 283, 284, 286, 288, 290, 293, 297, 307 randomize, 20 randomizing, 19 range, 25, 26, 27, 30, 32, 38, 39, 41, 43, 54, 57, 58, 59, 62, 63, 64, 66, 68, 72, 79, 83, 84, 85, 90, 91, 92, 103, 110, 115, 116, 117, 121, 124, 128, 136, 141, 146, 147, 155, 164, 170, 176, 180, 181, 182, 191, 194, 197, 211, 213, 220, 268, 274, 276, 279, 286, 291, 292, 293, 300, 302 RANK.AVG, 52, 53, 226, 230 Recognizing, 21 recommended, 76, 79, 223, 299 references, 25, 90, 208, 279 regression, 24, 186, 187, 188, 189, 190, 193, 196, 198, 199, 200, 201, 202, 204, 205, 206, 215, 224, 273, 274, 276, 277, 291, 296, 309, 312, 313, 315 relationship, 23, 32, 119, 181, 186, 189, 190, 193, 202, 205, 206, 212, 213,

264, 276, 279, 296, 309, 310, 311, 312, 315 relationships, 19, 162, 184, 188, 190, 287, 301, 303, 309 replication, 169, 171, 172, 176, 179, 180, 181, 182 representation, 17, 77, 298 representative, 49, 77, 161, 301 researcher, 20, 21, 262 ribbon, 35, 36, 37, 58, 78, 85, 141, 176, 180, 195, 196, 210, 213, 219, 267, 283, 285, 289, 297, 298, 300, 301, 303, 304, 305, 306, 307 robotic, 148 RSQ, 207, 209 scattered, 76, 185 Scientists, 287 screen, 37 section, 55, 59, 64, 66, 69, 72, 109, 119, 128, 130, 135, 148, 166, 177, 188, 267, 281, 294 shape style group, 80 shrinkage, 212 significance, 25, 57, 58, 116, 120, 206, 246, 247 significant, 116, 117, 158, 206, 275, 277, 311 simplified, 17, 19 SKEW, 64, 65, 66 skewness, 66 software, 17, 25, 42 sort &filter, 75 sparklines, 83, 84 specific, 18, 19, 20, 22, 23, 30, 39, 55, 66, 73, 77, 78, 81, 82, 94, 115, 119, 120, 121, 122, 123, 148, 162, 173, 174, 180, 188, 204, 212, 215, 217, 221, 239, 244, 252, 256, 257, 258, 259, 260, 283, 286, 301 spreadsheet, 39, 40, 219, 230, 231, 240, 266, 279, 297, 303 squared, 41, 96, 125, 126, 127, 128, 159, 179, 182, 183, 187, 190, 199, 204, 205, 220, 277 stakeholders, 17 standard, 18, 32, 41, 43, 44, 45, 46, 47, 49, 50, 68, 70, 71, 72, 73, 74, 75, 102, 103, 104, 105, 106, 107, 108, 109, 110, 113, 114, 115, 116, 117, 118, 120, 121, 131, 133, 134, 135, 140, 144, 150, 183, 187, 188, 200, 203, 207, 220, 226, 227, 229, 243, 248, 266, 269, 270, 271, 272, 277, 279, 281, 282, 289, 290, 296, 297

standardized, 44, 50 Statistics, 17, 23, 94, 275 statplus, 171, 172, 173, 183, 201, 214 STDEVPA, 103, 104, 105 stock, 103, 217, 299, 301 SUM, 34, 39, 231 summarized, 19 summary, 23, 27, 60, 68, 177, 184, 186, 188, 244, 278, 296 SUMPRODUCT, 165, 171, 269, 270 SUMSQ, 171, 227, 230, 231 suspension, 157 symbol, 47, 79, 83, 195, 207, 220, 235 symmetrical, 64, 70, 288 synonymous, 187, 275 syntax, 27, 52, 72, 73, 89, 90, 94, 95, 98, 100, 104, 116, 117, 123, 125, 126, 127, 128, 191, 200, 209 technique, 22, 128, 134, 221 temperature, 19, 78, 96 theorem, 109, 110, 111, 113, 133, 137, 224, 270 thorny, 158, 159 toolpak, 36, 37, 38, 58, 164, 180, 196, 274, 305 transformation, 25 transparent, 80 trend, 17, 78, 168, 171, 193, 194, 195, 200, 215, 216, 217, 218, 219, 223, 292 trickier, 19 TRUE, 63, 71, 72, 104, 120, 121, 152, 194, 196, 200, 227, 245, 250, 253, 255 uncountable, 243 understandable, 18 understanding, 17, 29, 60, 73, 88, 97, 175, 261, 277, 279, 281, 294, 310, 316 unpredictable, 242 Unsurprisingly, 19 VAR, the VAR.S, VARP, VAR.P, VARA,, 97 variability, 43, 96, 97, 148, 190, 205, 277, 279 variable, 19, 20, 21, 24, 70, 121, 123, 136, 141, 146, 154, 155, 161, 166, 168, 175, 180, 182, 183, 185, 189, 193, 197, 198, 200, 203, 204, 205, 206, 212, 232, 242, 243, 249, 251, 260, 267, 268, 269, 273, 278, 283, 284, 288, 289, 290, 291, 292, 293, 296, 309, 311,312, 313, 314, 315

variance, 24, 32, 41, 43, 96, 97, 98, 99, 100, 102, 103, 121, 122, 125, 131, 136, 137, 138, 139, 140, 141, 142, 145, 148, 149, 150, 153, 154, 159, 160, 161, 162, 166, 174, 175, 179, 182, 187, 188, 189, 190, 205, 225, 227, 228, 243, 244, 248, 263, 266, 269, 270, 279, 280, 284, 295, 309, 313, 314 visualize, 77, 108, 155, 307 webplotdgitizer, 297 WEIBULL, 284 workbook, 26, 27, 59, 68, 69, 307 worksheet, 26, 27, 38, 59, 68, 81, 82, 112, 113, 140, 144, 165, 176, 179, 190, 196, 207, 214, 227, 263, 264, 268, 269, 293, 301, 307, 312 worksheets, 27 XY to scatter, 70, 76 y-intercept, 185, 186, 194