1,571 345 19MB
English Pages [259] Year 2020
Table of contents :
Hypothesis
Hypothesis testing
Frequency
Using Python
Sample and Population
Type I Error and Type II Error
Measuring the Central Tendencies (mean, median, and mode), and Range
Using Python
Geometric Mean
Using Python
Grand Mean
Using Python
Harmonic Mean
Using Python
Mean Deviation
Using Python
Mean Difference
Using Python
Root mean square
Using Python
Sample mean and population mean
Types of data in statistics
Data Collection and its types
Sampling plan
Sampling Methods
Required Sample Size
Using Python
Simple Random Sampling
Cluster Sampling
Stratified Sampling
Graphs and Plots
Bar Graph
Using Python
Pie Chart
Using Python
Scatter plot or Bubble chart
Different patterns of data in bubble charts
Using Python
Dot plot
Using Python
Matrix plot
Using Python
Pareto Chart
Using Python
Histogram
Using Python
Stem and Leaf plot
Using Python
Box plot
Using Python
Outlier
Using Python
Quantile
Using Python
Standard deviation and Variance
Using Python
Range Rule of Thumb
Probability
Bayesian statistics
Using Python
Reliability Coefficient
Cohen's Kappa Coefficient
Using Python
Fleiss’ Kappa Coefficient
Using Python
Cronbach's alpha
Using Python
Coefficient of variation
Using Python
Chebyshev's Theorem
Using Python
Factorial
Using Python
Distribution, and Standardization
Using Python
Prediction interval
Using Python
Tolerance interval
Using Python
Parameters to describe the form of a distribution
Skewness
Kurtosis
Using Python
Different functions of distributions
Probability Density Function
Using Python
Cumulative Distribution Function
Using Python
Types of Distribution
Binomial Distribution
Using Python
Chi-squared Distribution
Using Python
Continuous Uniform Distribution
Using Python
Cumulative Poisson Distribution
Using Python
Exponential Distribution
Using Python
Normal Distribution
Using Python
Poisson Distribution
Using Python
Beta Distribution
Using Python
F Distribution
Using Python
Gamma Distribution
Using Python
Negative Binomial Distribution
Using Python
Gumbel Distribution
Using Python
Hypergeometric Distribution
Using Python
Inverse Gamma Distribution
Using Python
Log Gamma Distribution
Using Python
Laplace Distribution
Using Python
Geometric Distribution
Using Python
Level of Significance and confidence level
Statistical Estimation
Interval Estimation
Using Python
Best Point estimation
Using Python
Correlation
Central Limit Theorem
Standard Error of the Mean
Using Python
Statistical Significance
Tests for Non-normally distributed data
One tail and two tail tests
Mood’s Median Test
Using Python
Goodness of Fit
Chi-square test
Using Python
McNemar Test
Using Python
Kolmogorov-Smirnov Test (KS-Test)
Using Python
One-tailed test, two-tailed test, and Wilcoxon Rank Sum Test / Mann Whitney U Test
Using Python
The Sign Test
Using Python
Wilcoxon Signed Rank Test
Using Python
The Kruskal-Wallis Test
Using Python
Degrees of Freedom
The Friedman Test
Using Python
Tests for Normally distributed data
Unpaired “t” test
Using Python
Paired “t” test
Using Python
Analysis of Variance (ANOVA)
Sum of Squares
Residual Sum of Squares
One way ANOVA
Using Python
Two way ANOVA
Using Python
Different types of ANOVAs
Using Python – General MANOVA
Factor Analysis
Path Analysis
Structural Equation Modeling
Effect size
Using Python
Odds Ratio and Mantel-Haenszel Odds Ratio
Using Python
Correlation Coefficient
Using Python
R-squared and Adjusted R-squared
Regression Analysis
Using Python
Logistic Regression
Using Python
Black-Scholes model
Using Python
Combination
Using Python
Permutation
Using Python
Even and Odd Permutations
Circular Permutation
Survival Analysis
Kaplan-Meier method
Using Python
Bonus Topics
Most commonly used non-normal distributions in health, education, and social sciences
Circular permutation in Nature
Time Series
Using Python
Monte Carlo Simulation
Density Estimation
Decision Tree
Meta-analysis
Important Statistical Techniques/Procedures used in Medical Research
Lite Colorful Statistics with Basic Steps in Python Programming Language Usman Zafar Paracha M. Phil. Pharmaceutics, Rawalpindi, Pakistan (2020)
This book will help the students to learn and utilize some basic concepts of statistics while utilizing Python Programming Language. Any Feedback will be Highly Appreciated. Usman Zafar Paracha Owner of SayPeople.com [email protected] https://www.facebook.com/usmanzparacha
Some words from the author I tried to make this book on statistics as informative as possible, especially for beginners in statistics. Spyder (The Scientific PYthon Development EnviRonmen) 3.2.8 (Python 3.6) has been considered for this book. Anaconda can be downloaded from this link https://www.continuum.io/downloads. Instructions for opening Spyder after installation of Anaconda can be found here: https://docs.continuum.io/anaconda/ide_integration. It is assumed that the reader knows about the software, how to start it and use it (even at beginner level). This book may also have some trademarked names without using trademark symbol. However, they are used only in an editorial context, and there is no intention of infringement of trademark. It is important to note that calculations and examples used in this book could not take the place of actual research. Statistics has to be used under the guidance of experts. People with authentic comments and/or feedbacks (on Amazon) can ask me questions or send me “Message” here: https://www.facebook.com/usmanzparacha, and I will try to answer them.
Contents Hypothesis Hypothesis testing Frequency Using Python Sample and Population Type I Error and Type II Error Measuring the Central Tendencies (mean, median, and mode), and Range Using Python Geometric Mean Using Python Grand Mean Using Python Harmonic Mean Using Python Mean Deviation Using Python Mean Difference Using Python Root mean square Using Python Sample mean and population mean Types of data in statistics Data Collection and its types Sampling plan Sampling Methods Required Sample Size Using Python Simple Random Sampling Cluster Sampling Stratified Sampling
Graphs and Plots Bar Graph Using Python Pie Chart Using Python Scatter plot or Bubble chart Different patterns of data in bubble charts Using Python Dot plot Using Python Matrix plot Using Python Pareto Chart Using Python Histogram Using Python Stem and Leaf plot Using Python Box plot Using Python Outlier Using Python Quantile Using Python Standard deviation and Variance Using Python Range Rule of Thumb Probability Bayesian statistics Using Python Reliability Coefficient
Cohen's Kappa Coefficient Using Python Fleiss’ Kappa Coefficient Using Python Cronbach's alpha Using Python Coefficient of variation Using Python Chebyshev's Theorem Using Python Factorial Using Python Distribution, and Standardization Using Python Prediction interval Using Python Tolerance interval Using Python Parameters to describe the form of a distribution Skewness Kurtosis Using Python Different functions of distributions Probability Density Function Using Python Cumulative Distribution Function Using Python Types of Distribution Binomial Distribution Using Python Chi-squared Distribution
Using Python Continuous Uniform Distribution Using Python Cumulative Poisson Distribution Using Python Exponential Distribution Using Python Normal Distribution Using Python Poisson Distribution Using Python Beta Distribution Using Python F Distribution Using Python Gamma Distribution Using Python Negative Binomial Distribution Using Python Gumbel Distribution Using Python Hypergeometric Distribution Using Python Inverse Gamma Distribution Using Python Log Gamma Distribution Using Python Laplace Distribution Using Python Geometric Distribution Using Python
Level of Significance and confidence level Statistical Estimation Interval Estimation Using Python Best Point estimation Using Python Correlation Central Limit Theorem Standard Error of the Mean Using Python Statistical Significance Tests for Non-normally distributed data One tail and two tail tests Mood’s Median Test Using Python Goodness of Fit Chi-square test Using Python McNemar Test Using Python Kolmogorov-Smirnov Test (KS-Test) Using Python One-tailed test, two-tailed test, and Wilcoxon Rank Sum Test / Mann Whitney U Test Using Python The Sign Test Using Python Wilcoxon Signed Rank Test Using Python The Kruskal-Wallis Test Using Python
Degrees of Freedom The Friedman Test Using Python Tests for Normally distributed data Unpaired “t” test Using Python Paired “t” test Using Python Analysis of Variance (ANOVA) Sum of Squares Residual Sum of Squares One way ANOVA Using Python Two way ANOVA Using Python Different types of ANOVAs Using Python – General MANOVA Factor Analysis Path Analysis Structural Equation Modeling Effect size Using Python Odds Ratio and Mantel-Haenszel Odds Ratio Using Python Correlation Coefficient Using Python R-squared and Adjusted R-squared Regression Analysis Using Python Logistic Regression Using Python
Black-Scholes model Using Python Combination Using Python Permutation Using Python Even and Odd Permutations Circular Permutation Survival Analysis Kaplan-Meier method Using Python Bonus Topics Most commonly used non-normal distributions in health, education, and social sciences Circular permutation in Nature Time Series Using Python Monte Carlo Simulation Density Estimation Decision Tree Meta-analysis Important Statistical Techniques/Procedures used in Medical Research
Hypothesis Statistics is an art as well as science, and needs help of both fields for explanation. You need stronger imagination as an artist and well-designed research as a scientist. Let’s start with a concept of hypotheses. According to an ancient Chinese myth, there is an entirely different world behind mirrors. That world has its own creatures, and is known as Fauna of mirrors. So, if a person is of opinion that fauna of mirrors actually exist, and he wants to perform a research on fauna of mirrors, his hypothesis will be that fauna of mirrors actually exist. This is known as “research hypothesis” or “alternative hypothesis” as the person is doing research on this hypothesis. It is represented by Ha or H1. On the other hand, there is another opinion showing that there is nothing like fauna of mirrors. This opinion can be considered as “null hypothesis” as it is negating the statement for research hypothesis and showing that research hypothesis is not a commonly observed phenomenon. Null hypothesis is represented by Ho or H0. For a research to be completed successfully, null hypothesis is usually rejected. So, in order to prove the fauna of mirrors after performing a research, null hypothesis has to be rejected. Shortly, it can be said that according to null hypothesis, nothing is changed or no significant new thing can be found (anywhere in any group), and according to alternative hypothesis, some significant change must have occurred or some significant new thing can be found (somewhere or in some group). Hypothesis testing Suppose there are several circles that are assumed to have the diameters of about 5cm. 1. Specify the hypothesis Null hypothesis (H0): population mean = µ0 = 5cm Alternative hypothesis (H1): population mean = µa < 5 cm or > 5 cm or ≠ 5cm 2. Determine the sample size How many circles are needed to measure to have a good chance of detecting
a difference from 5 cm? 3. Choose a significance level (α) Select significance level α=0.05 means 5% of circles > or < 5 cm is allowed. 4. Collect the data Note down diameters of 6 circles among all. 5. Compare the p-value from the test to the above α Suppose, Hypothesis Test gives a p-value of 0.004 < 0.05. 6. Decide whether to reject or accept H0 We are rejecting H0 and all circles are not of 5 cm as 0.004 is less than 0.05. Frequency Frequency is of different types, including absolute frequency, relative frequency, and cumulative frequency. Usually, frequency is represented by absolute frequency. 1. Absolute frequency: Number of times an event or incidence (i) occurred. Represented by ni. 2. Relative frequency: The number of times a specific event (i) occurred divided by the total number of events (N). Represented by fi = ni/N. 3. Cumulative frequency: The sum of all previously presented frequencies. Represented by n1+n2+n3+…+ni. Numbers (that are repeated): 1,1,1,2,2,3,3,3,3,3,4,5,5,5,5 Numbers Absolute Relative Cumulative frequency frequency frequency 1 3 0.2(=3/15) 3 2 2 0.13(=2/15) 5(=2+3) 3 5 0.33=(5/15) 10(=5+5) 4 1 0.07(=1/15) 11(=10+1) 5 4 0.27(=4/15) 15(=11+4)
Using Python In order to work on the absolute frequency, relative frequency, and cumulative frequency, we can use these numbers: 1,1,1,2,2,3,3,3,3,3,4,5,5,5,5. So, write the following lines of codes: from scipy import stats x=np.array([1, 1, 1,2,2,3,3,3,3,3,4,5,5,5,5]) relfreq = stats.relfreq(x, numbins=5) cumfreq = stats.cumfreq(x, numbins=5) freq=np.bincount(x) print(freq) print(relfreq.frequency) print(cumfreq.cumcount) This give the following results: [0 3 2 5 1 4] [0.2
0.13333333 0.33333333 0.06666667 0.26666667]
[ 3. 5. 10. 11. 15.] Sample and Population Sample refers to a small part of something that is used to represent the whole (population). Suppose there are 15 balls in a box. It can be considered as a population (N). If four balls are selected randomly, then we have a sample (n) of four balls that can be of different colors. It can be considered as random selection. On the other hand, sample (n) of only two blue color balls can also be taken, but it could not be considered as random selection. Nevertheless, samples help in making inferences about population. Type I Error and Type II Error Consider the hypotheses mentioned earlier. Suppose initial findings on the hypotheses show that fauna of mirrors actually exist. So, it can be said that the null hypothesis is rejected. However, it is important to note that the
results from initial findings could be false (or they could be true) due to the presence of errors in the research. Those errors are known as α error and β error. α error is also known as Type I error or False positive. In this condition, it is possible that we incorrectly reject the null hypothesis, i.e. the statement “there is nothing like fauna of mirrors” seems wrong after performing an experiment, when in reality it is right. So, it is also considered as False positive as in this condition, we think that alternative hypothesis is right (positive).This type of error can be fixed by performing further tests. Moreover, changing the level of significance can also help in reducing type I error. On the other hand, β error is also known as Type II Error or False negative. In this condition, it is possible that we incorrectly accept the null hypothesis. Type II error is more serious as compared to type I error because after this error, nobody would do further research on alternative hypothesis. Errors of the higher kind could also be present. As, for example, type III error occurs when a researcher or an investigator gives the right answer to some wrong question. It is also considered when an investigator or researcher correctly rejects the null hypothesis for some wrong reason/s. Measuring the Central Tendencies (mean, median, and mode), and Range Mean value is the sum of all data values divided by the number of data values in sample. It is calculated by the following formula: x̅ = Σx ⁄ n Where Σx is the sum of all data values, and n is the number of data values in sample. In case of population, the following formula is used: µ = Σx ⁄ N Where N is the number of data values in population. Median value is the middle value (or the mean of two middle values).
Mode is/are the value/s which appear/s most often. Range is the difference between the highest value and the lowest value. For example, we have a data set: 1,2,2,3,4,5,6,7,8,8,9,10,11. Mean value is calculated as (1+2+2+3+4+5+6+7+8+8+9+10+11) / 13 = 5.846. Median value is 6 as 50% of the values in the dataset are below 6 and 50% of the values in the dataset are above 6. There are two modes in the dataset, i.e., 2 and 8. Range is 11 – 1 = 10. Using Python In order to find the mean, median, mode, and range, we can use the numbers (as noted above). So, the numbers are 1,2,2,3,4,5,6,7,8,8,9,10, and 11. We have first to (1) import statistics as st, and (2) import math as mt. So, these are imported as import statistics as st import math as mt After importing them, we can find the mean by writing the following: x=st.mean([1,2,2,3,4,5,6,7,8,8,9,10,11]) print(x) It gives the answer 5.846153846153846. In the same way, we can find the median value of the entire set of numbers, x=st.median([1,2,2,3,4,5,6,7,8,8,9,10,11]) print(x) It gives the answer 6. In the same way, we can find the mode value of the entire set of numbers, x=st.mode([1,2,2,3,4,5,6,7,8,8,9,10,11]) print(x) As there are more than one mode, i.e. 2 and 8, in this dataset, so the following
output is obtained: “StatisticsError: no unique mode; found 2 equally common values” Suppose we add another “8” and try to get the mode, so x=st.mode([1,2,2,3,4,5,6,7,8,8,8,9,10,11]) print(x) gives the answer 8. In order to find the range of the entire set of numbers, we have to import numpy, which is an important package for scientific computing with python, as follows, import numpy as np Then range is found by writing the following, x=np.min([1,2,2,3,4,5,6,7,8,8,9,10,11]) y=np.max([1,2,2,3,4,5,6,7,8,8,9,10,11]) z=y-x print(z) It gives the answer 10. Geometric Mean Geometric means refers to the nth root of the product of a set of data having n numbers as, for example, square root of the product of 2 numbers, cube root of the product of 3 numbers, and so on.
In this equation, n shows total numbers in a set of data, and x1×x2×x3×…×xn shows product of the set of data (where x are numbers). Suppose a geometric figure has different values of height, width, and length, or there is an annual variation in price… then geometric mean gives the mean value regarding them... Example:
Suppose there is a rectangle with length of 32 cm and width of 2 cm, then the geometric mean will be ( =)8 cm. (If you try to make a figure with this dimension, you will get a square with 8 cm length on all sides.) Example: Suppose value of a piece of land having the price of $80,000 increases by 10% in 1st year; 20% in 2nd year, and 50% in 3rd year. Manual calculation: After one year, the price will be $80,000 ×110% or 1.1 = $88,000. After two years, the price will be $88,000 ×120% or 1.2 = $105,600. After three years, the price will be $105,600 ×150% or 1.5 = $158,400. With the help of geometric mean: Geometric mean of the increase in value is given by (110%×120%×150%)1/3 = (1.1×1.2×1.5)1/3= 1.25571. Value after 3 years will be $80,000×(1.25571)3= $158,401 Using Python In order to find the geometric mean, we can use the example of the percentage increase in the price of land (as noted above). So, the increases are 10% in the first instance i.e. (making 110% or) 1.1; than 20% i.e. (making 120% or) 1.2, and finally 50% i.e. (making 150% or) 1.5. So, write the following lines of codes: from scipy import stats print(stats.gmean([1.1,1.2,1.5])) In this case, the result (geometric mean) is 1.2557072356438912. So, this is the geometric mean value for the increase in every year in our example. Grand Mean The grand mean of a set of samples is the sum of all the values in the data divided by the total number of samples. One use of grand mean is in Analysis of Variance (ANOVA) to calculate sum of squares (SSQ). Example:
Suppose there are three cities in which daily earnings of five random people are as follows: City Person # 1 City A 10 City B 50 City C 80 Compute all means
Person # 2 60 20 30
Person # 3 70 80 90
Person # 4 80 120 100
Person # 5 40 60 70
Calculate the mean of means
OR
Using Python In order to calculate the grand mean, we can take the example as noted above. So, City Person # 1 Person # 2 City A 10 60 City B 50 20 City C 80 30 Write the following lines of codes: import statistics as st a=st.mean([10,50,80]) b=st.mean([60,20,30]) c=st.mean([70,80,90])
Person # 3 70 80 90
Person # 4 80 120 100
Person #5 40 60 70
d=st.mean([80,120,100]) e=st.mean([40,60,70]) gm=st.mean([a,b,c,d,e]) print(gm) In this case, the result (grand mean) is 64. Harmonic Mean Harmonic mean is the number of observations in a data set divided by the sum of the reciprocals of the observations, i.e. number in the series. It is calculated by the following equation:
Suppose there are three numbers: 5, 8, and 13. Harmonic mean is calculated as:
Weighted harmonic mean It is calculated by the following equation:
Suppose there are three numbers: 5, 8, and 13, with weights (W): 2,1,1
Using Python In order to calculate the harmonic mean, we can take the example as noted above. So, the numbers are 5, 8, and 13. Now, write the following lines of codes: import statistics as s
x=([5,8,13]) print(s.harmonic_mean(x)) In this case, the result (harmonic mean) is 7.464114832535885. In case of weighted harmonic mean, we enter the values 5, 5, 8, and 13 in the code. So, write the following lines of codes: import statistics as s x=([5,5,8,13]) print(s.harmonic_mean(x)) In this case, the result (weighted harmonic mean) is 6.645367412140574. Mean Deviation Mean of absolute deviations of observed or variable values from the mean value. It is represented by the following equation:
In this equation, “x” shows “variable value”; “µ” shows “mean value of choices”; the vertical bars (||) represent absolute value, i.e. ignoring minus signs; “N” shows “number of observations or values”, and “D” shows “difference.” As an example, suppose, we have three values: 3,4, and 6. Their mean is (3+4+6)/3= 4.33. So,
And
Using Python In order to work on mean deviation in Python, we can use the same example as noted above. So, the example numbers are 3, 4, and 6. Now write the following lines of codes: import pandas as pd
import numpy as np x =[3,4,6] series = pd.Series(x) mad = series.mad() print(mad) In this case, the result (mean deviation) is 1.111. Mean Difference It is also known as difference in means. It determines the absolute difference between the two groups’ mean values; thereby, helping in knowing the change on an average. So, Mean difference = Mean of one group – Mean of the other group Using Python Suppose in one group, we have the numbers 3, 4, and 6, and in the other group, we have the numbers, 2, 3, and 4. Now write the following lines of codes: import numpy as np x=np.array([3,4,6]) y=np.array([2,3,4]) print(x-y) In this case, the results is [1 1 2]. Root mean square Root mean square is the square root of the sum of the squares of the observations or values (x12, x22,…xn2) divided by the total number of values (n). It is represented by the following equation:
The root mean square is used for predicting the speed of molecules, such as
that of gas, at a given temperature. This equation is also most commonly used for measurement of the voltage of an alternating current. Using Python In order to work on Python, we can use the numbers 3, 4, and 6 as an example. So, write the following lines of codes: from math import sqrt def rms(numb): return sqrt(sum(n*n for n in numb)/len(numb)) print(rms([3,4,6])) In this case, the result (root mean square) is 4.509249752822894. Sample mean and population mean Sample mean and population mean are two different things. Sample mean is represented by x̄, and population mean is represented by μ. Population consists of all the elements from a collection of data (N), and sample consists of some observations from that data (n). Here,
And
In these equations,
And “x” is an individual value, “N” represents all the elements from a collection of data, and “n” represents some observations from that data. Types of data in statistics Usually, there are three types of data, including Nominal data, Ordinal data, and Interval data. Nominal data is represented by names or categories as, for example, eggs of different colors such as brown, blue, white, rose, green, and spotted eggs. Ordinal data is represented by ordered data in which the items are ranked or graded as, for example, suppose green colored eggs have three
different shades, so they can be arranged in an ordinal data as light green, green, and dark green. Interval data is represented by numbers that are not only arranged in a series but also arranged with an exact difference between them as, for example, suppose, we have a series of numbers: 4,5,6,7,7,9,11. First four numbers (i.e. 4,5,6,7) have equal and definite interval, and show one interval data. And last three numbers (i.e. 7,9,11) are arranged with a different set of equal and definite interval; thereby, showing another interval data. Data Collection and its types Data collection is the process of collecting information in relation to selected variables. There are two main types: primary data and secondary data. Primary data can be categorized into Qualitative data and Quantitative data. On the other hand, the example of secondary data is the data obtained from the U.S. Census Bureau. Qualitative data is the data that can only be described and cannot easily be evaluated using statistics as, for example, attitude of people towards a certain policy. It can be obtained through in-depth interviews, focus group interviews, and indirect interviews, such as story completion study. Quantitative data is the data that can be expressed in numbers and can be evaluated with the help of statistics as, for example, weights, heights, and ages of people, and/or their mean values. It can be obtained through surveys with closed-ended questions; clinical trials or experiments, such as randomized controlled trials, and observational study, such as cross-sectional survey, case-control study, and cohort study in which cohort refers to a group of individuals having specific characteristics. Sampling plan It is the detailed outline of study measurements, including types, times, materials, manner/procedure, and person involved. The steps in the sampling plan may include (1) identification of attributes or parameters, such as ranges of probable values, or required resolution for selecting the appropriate equipment or instrument, (2) selection of sampling method, (3) choosing the sample sizes, while considering population parameters to be studied, cost of sampling, available/present knowledge, practicality, and spread or variability of the population, (4) choosing the sampled data storage format, (5)
assignment of roles and responsibilities to every individual related to the study, and (6) verification of the sampling plan and its execution. Sampling Methods Sampling methods are of two main types: (1) Probability sampling and (2) Non-probability sampling. Probability sampling refers to the known probability or chance of samples to be selected. It can be of different types, including (a) Simple random sampling in which each sample has an equal chance of being selected, (b) Stratified sampling that can be proportionate or disproportionate, (c) Cluster sampling that can be one stage, two stage, or multi-stage, (d) Systematic sampling that refers to random selection of nth sample and then selecting every nth sample in succession, and (e) Multistage sampling that is the combination of two or more sampling methods. Non-probability sampling refers to no known probability or chance of samples to be selected. It can be of different types, including (a) Volunteer samples, and (b) Convenience (haphazard) samples. However, these are considered as “sampling disasters.” Required Sample Size Sample size can be determined either through a subjective approach or through a mathematical approach. The subjective approach depends on (1) the nature of the population, e.g., homogeneity or heterogeneity of population, (2) the nature of respondent, e.g., accessibility and availability of respondents, (3) the nature of study, (4) sampling technique used, (5) complexity of findings/tabulation, e.g., quantity and types of discoveries to be assembled, (6) availability of time and resources, e.g., little time and resources may lead to little sample size, (7) levels of required precision and accuracy, e.g., larger sample size for more precision and accuracy. The mathematical approach consists of different formulae, including Standard formula / Cochran’s sample size formula, and Slovin’s formula. Standard formula / Cochran’s sample size formula is represented by the following equation:
Where “Z” shows “z-score from z-table”, “p” shows “estimated proportion of the population”, “q” is equal to “1-p”, and “e” shows “margin of error, i.e., desired level of precision”. This equation can also be modified, when population is small. The modified equation is as follows:
Where “n” shows “adjusted sample size”, and “N” is population size. On the other hand, Slovin’s formula is represented by the following equation:
Where “e” shows “margin of error.” Using Python Cochran’s sample size formula In order to work on Cochran’s sample size formula in Python, we can develop an example. So, a researcher wants to know how many people in a city of 10,000 people, with at least 5% precision, have cars. If we consider with 95% confidence that 60% of all the people have cars. Then the z-value at 95% confidence level will be 1.96 per the normal table. Now write the following line of code: print((((1.96)**2)*0.6*0.4)/0.05**2) The result (sample size) is 368.7936 or simply 369. This result is according to the Cochran’s sample size formula. Cochran’s sample size formula when the population size is small In order to work on Cochran’s sample size formula when the population size is small as, for example, 1000 people… print(369/(1+(368/1000))) The result (sample size) is 269.7368 or simply 270. This result is according to the modification of the Cochran’s sample size formula when the population
size is small. Slovin’s formula In order to work on Slovin’s formula in Python, write the following lines of codes: print(10000/(1+10000*0.05**2)) The result (sample size) is 384.61538 or simply 385. This result is according to the Slovin’s formula. Simple Random Sampling It has a population with “N” items or objects, and sample with “n” items or objects. In this sampling method, each sample has an equal chance of being selected. Eventually, it gives an unbiased representative sample. Suppose we have 15 (N) different colored balls in a box, and with sample random selection, four (n) different colored balls are selected. Simple random sampling can be done with replacement or without replacement. In case of simple random sampling with replacement, once the sample is selected, it is placed back in the population (i.e., selected sample is replaced). For example, from a box containing 10 different colored balls, three different colored balls are selected and three different colored balls are placed back into the box. In case of simple random sampling without replacement, once the sample is selected, it is not placed back in the population (i.e., selected sample is not replaced), and then the samples are selected from the remaining samples. For example, from a box containing 10 different colored balls, three different colored balls are selected and no other balls are placed back into the box. It is considered as preferred method. Cluster Sampling In cluster sampling, a population is divided into separate groups known as clusters, and then simple random sampling of clusters (i.e. sampled clusters) is done. It can be of three types: (1) One stage cluster sampling in which data is collected from every item/unit in the sampled clusters, (2) Two stage cluster sampling in which data is collected from randomly selected item/unit in the sampled clusters, and (3) Multi-stage cluster sampling in which
sampled clusters or items/units are selected in multiple stages. Suppose, we have a population of 15 squares and circles (i.e. 15 items) of different colors. If we place these 15 different colored squares and circles into five groups each with three different colored squares and circles (i.e. three items) in each group, then these five groups can be considered as clusters. Again, if we select three clusters from these five clusters, then it is known as one stage cluster sampling. If we select six different colored squares and circles (i.e. six items) from three clusters selected in the one stage cluster sampling, then it is known as two stage cluster sampling. However, if we conduct simple random sampling of two clusters out of three clusters selected in the one stage cluster sampling, then it is known as multi-stage cluster sampling. Stratified Sampling In stratified sampling, a population is divided into separate groups known as strata, and then simple random sampling of units/items from each strata is done. It can be of different types, including proportionate stratified sampling in which number of items/units from each stratum is proportionate (according) to the size of the stratum, and disproportionate stratified sampling in which number of items/units from each stratum is disproportionate or variable (not according) to the size of the stratum. For example, we have a population of 15 squares and circles (i.e., 15 items) of different colors. Then the formation of five groups with different number of squares and circles (items) in each group, could be considered as the formation of five strata. For instance, in the first strata, there are two circles (two items); in the second strata, there are three circles and one square (four items); in the third strata, there are four squares and one circle (five items); in the fourth strata, there is one circle (one item), and in the fifth strata, there are three circles (three items). If we select exactly half number of items from each strata, i.e., one item in first strata, two items in second strata, two and a half items in the third strata, half item in the fourth strata, and one and a half items in fifth strata, then this can be considered as proportionate stratified sampling. On the other hand, if we select random number of items from each strata, i.e., two items in the first strata, two items in the second strata, one and a half item in the third strata, one item in the fourth strata, and one item in the fifth strata, then this can be
considered as disproportionate stratified sampling.
Graphs and Plots Bar Graph A bar graph is represented by bars for different categories on graph. Suppose a person has performed a survey on why people travel and he gets the following information… Reason of traveling
Percentage of people
Visiting family
25
Spending time with friends 15 Work-related
18
Personal issues
10
Escaping the colder climate 6 Discovering a new culture
11
Leisure 15 Bar chart/graph is as follows:
It is a graphical representation of different categories as bars. In a bar graph, usually, y-axis represents values for categories, and x-axis represents categories. In the bar graph, bar’s heights are proportional to values of categories. Using Python
In order to develop Bar Graph, we can use the example (as noted above). Reason of travelling
Percentage of people
Visiting family
25
Spending time with friends 15 Work-related
18
Personal issues
10
Escaping the colder climate 6 Discovering a new culture
11
Leisure 15 In order to get a Bar Graph, write the following lines of codes: import matplotlib.pyplot as plt; plt.rcdefaults() import numpy as np import matplotlib.pyplot as plt
Reason_of_travelling = ('Visiting family', 'Spending time with friends', 'Work-related', 'Personal issues', 'Escaping the colder climate', 'Discovering a new culture', 'Leisure') y_pos = np.arange(len(Reason_of_travelling)) Percentage_of_people = [25,15,18,10,6,11,15]
plt.bar(y_pos, Percentage_of_people, align='center', alpha=0.5) plt.xticks(y_pos, Reason_of_travelling, fontsize=10, rotation=90) plt.xlabel('Reason of travelling') plt.ylabel('Percentage of People') plt.title('Chart of Percentage of people and Reason of travelling')
plt.show() A chart appears as follows:
Pie Chart Pie chart, also known as pie graph, is a statistical graphical chart that is presented in a circular form in which numerical proportions are divided into slices. Suppose following information has been obtained regarding the cost of construction of house...
Cost of construction of house Labor Timber Electrical works Supervision Steel Bricks Cement Design and fee for Engineer/Architect Other
Supposed amount spent in USD 60000 18000 18000 90000 30000 30000 27000 15000
Percentages
12000
4
20 6 6 30 10 10 9 5
Pie chart/graph is as follows:
It is graphical representation of different categories in a circular form. It is has no x-axis or y-axis. In this graph, 360 degrees of circle represents 100%. These percentages are obtained by dividing the value by the total value and
multiplying by 100. Using Python In order to develop Pie Chart, we can use the example (as noted above). Supposed amount Cost of construction of spent in USD house 60000 Labor 18000 Timber 18000 Electrical works 90000 Supervision 30000 Steel 30000 Bricks 27000 Cement 15000 Design and fee for Engineer/Architect 12000 Other So, write the following lines of codes:
Percentages 20 6 6 30 10 10 9 5 4
import matplotlib.pyplot as plt labels = 'Labor', 'Timber', 'Electrical works', 'Supervision', 'Steel', 'Bricks', 'Cement', 'Design and fee for Engineer/Architect', 'Other' sizes = [20, 6, 6, 30, 10, 10, 9, 5, 4] fig1, ax1 = plt.subplots() ax1.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90) ax1.axis('equal') plt.show() A chart appears as follows:
Scatter plot or Bubble chart Scatter plot or scatter chart graphically displays the relationship between two variables (two sets of data) in the form of dots. Suppose the number of tourists coming to Pakistan during the last 7 years is as follows… Year
Number of tourists
2011
1167000
2012
966000
2013
565212
2014
530000
2015
563400
2016
965498
2017 1750000 The scatter plot in association with this data is as follows:
It shows Graphical representation of different categories in the form of dots. The position of a dot refers to its values on x-axis and y-axis. Scatter chart is called as bubble chart, if the data points (dots) in the graph are replaced with circles (bubbles) of variable size and characteristics as, for example, circles with only outline and no colors inside.
Different patterns of data in bubble charts Bubble charts could be of different types depending on linearity, slope, and strength.
Using Python In order to develop Scatter Plot, we can use the example (as noted above). Year
Number of tourists
2011
1167000
2012
966000
2013
565212
2014
530000
2015
563400
2016
965498
2017 1750000 Write the following lines of codes: import matplotlib.pyplot as plt
x = ([2011,2012,2013,2014,2015,2016,2017]) y = ([1167000,966000,565212,530000,563400,965498,1750000])
plt.scatter(x, y, alpha=0.5) plt.title('Scatter plot of Number of tourists vs Year') plt.xlabel('Year') plt.ylabel('Number of tourists') plt.show() A simple chart appears as follows:
In order to draw Bubble chart, write the following lines of codes: import numpy as np import matplotlib.pyplot as plt x = ([2011,2012,2013,2014,2015,2016,2017]) y = ([1167000,966000,565212,530000,563400,965498,1750000]) area = (30 * np.array([1.167000,.966000,.565212,.530000,.563400,.965498,1.750000]))**2
plt.scatter(x, y, s=area, alpha=0.5) plt.show() A simple chart appears as follows:
Dot plot Dot plot displays graphical data in the form of dots. Suppose the number of siblings in 7 houses in an area is as follows… House number
Number of siblings
1
4
2
3
3
1
4
0
5
5
6
2
7 2 Dot plot is as follows:
It shows graphical representation of data in the form of dots. Using Python In order to develop Dot Plot, we can use the example (as noted above). House number
Number of siblings
1
4
2
3
3
1
4
0
5
5
6
2
7 2 Write the following lines of codes: import matplotlib.pyplot as plt
a = ([1,2,3,4,5,6,7]) b = ([4,3,1,0,5,2,2]) c = ([3,2,0,0,4,1,1]) d = ([2,1,0,0,3,0,0])
e = ([1,0,0,0,2,0,0]) f = ([0,0,0,0,1,0,0]) g = ([0,0,0,0,0,0,0]) plt.scatter(a,b,c='#A9A9A9') plt.scatter(a,c,c='#A9A9A9') plt.scatter(a,d,c='#A9A9A9') plt.scatter(a,e,c='#A9A9A9') plt.scatter(a,f,c='#A9A9A9') plt.scatter(a,g,c='#A9A9A9')
plt.show() A simple chart appears as follows:
Matrix plot Matrix plot is used to study the association of several pairs of variables at a time. Suppose we have the following data with four variables (A, B, C, and D): A B C D 1 3.6 4.9 4.1
1 5.0 3.5 4.1 1 4.7 4.1 4.4 1 6.1 7.3 5.3 2 7.6 5.2 2.4 2 6.9 4.7 4.0 2 5.6 5.0 5.4 2 5.0 5.7 5.8 Matrix plot is as follows:
It is graphical representation of different variables in the form of dots. The position of a dot refers to its value on x-axis and y-axis of the two specific variables. Using Python In order to develop Matrix Plot, we can use the same example (as noted above). A 1 1 1 1
B C D 3.6 4.9 4.1 5.0 3.5 4.1 4.7 4.1 4.4 6.1 7.3 5.3
2 7.6 5.2 2.4 2 6.9 4.7 4.0 2 5.6 5.0 5.4 2 5.0 5.7 5.8 Write the following lines of codes: import pandas as pd import seaborn as sns A=([1,1,1,1,2,2,2,2]) B=([3.6,5.0,4.7,6.1,7.6,6.9,5.6,5.0]) C=([4.9,3.5,4.1,7.3,5.2,4.7,5.0,5.7]) D=([4.1,4.1,4.4,5.3,2.4,4.0,5.4,5.8]) df = pd.DataFrame({'A':A, 'B':B, 'C':C, 'D':D}) sns.pairplot(df) sns.plt.show() A simple chart appears as follows:
Pareto Chart Pareto chart, also known as Pareto distribution diagram, is used for improving the most probably advantageous or beneficial areas with the help of the identification of commonly experienced defects or causes of customer complaints. It is based on Pareto principle showing that about 80% of the output is associated with 20% of the input. Suppose, a hotel gets following complaints along with their counts: Complaint
Counts
Difficult parking
175
Bad behavior of receptionists 25
Poor lightning
11
Room service is not good
21
Cleaning issues
145
Overpriced
12
Too noisy
32
Confusing layout Pareto chart is as follows:
16
This graph shows that improvements in “Difficult parking”, “Cleaning issues”, and “Too noisy” could solve about 80% of the complaints. Using Python In order to develop Pareto Chart, we can use the same example (as noted above). Complaint
Counts
Difficult parking
175
Bad behavior of receptionists 25 11 Poor lightning Room service is not good
21
Cleaning issues
145
Overpriced
12
Too noisy
32
Confusing layout 16 Now write the following lines of codes: import pandas as pd import matplotlib.pyplot as plt from matplotlib.ticker import PercentFormatter df = pd.DataFrame({'Complaint': [175,25,11,21,145,12,32,16]}) df.index = ['Difficult parking', 'Bad behavior of receptionists','Poor lightning','Room service is not good','Cleaning issues','Overpriced','Too noisy','Confusing layout'] df = df.sort_values(by='Complaint',ascending=False) df["cumpercentage"] = df["Complaint"].cumsum()/df["Complaint"].sum()*100
fig, ax = plt.subplots() ax.bar(df.index, df["Complaint"], color="C0") ax2 = ax.twinx() ax2.plot(df.index, df["cumpercentage"], color="C1", marker="D", ms=7) ax2.yaxis.set_major_formatter(PercentFormatter())
ax.tick_params(axis="y", colors="C0") ax2.tick_params(axis="y", colors="C1") ax.tick_params(axis="x",labelrotation=90) plt.show() A simple chart appears as follows:
Histogram A histogram is also a graph that groups numbers into ranges. Suppose the number of siblings in 7 houses in an area is as follows… House number
Number of siblings
1
4
2
3
3
1
4
0
5
5
6
2
7 2 Histogram is as follows:
It is graphical representation of data in the form of bars, but without any gaps between them... x-axis is showing different ranges: 0-2, 2-4, and 4 to 6. This graph shows that there are 4 houses with 0 to 2 number of siblings, and so on... Using Python In order to develop Histogram, we can use the example (as noted above). House number
Number of siblings
1
4
2
3
3
1
4
0
5
5
6
2
7 2 in Python we have to import numpy and matplotlib.pyplot, and write the following lines of codes: import numpy as np import matplotlib.pyplot as mp hist, bin_edges = np.histogram([4,3,1,0,5,2,2], bins = range(7)) mp.bar(bin_edges[:-1], hist, width = 1) mp.xlim(min(bin_edges), max(bin_edges))
mp.show() A simple graph appears as shown below:
Stem and Leaf plot In a stem and leaf plot, the data values are split into the first digit or digits showing “stem” and the last digit showing “leaf”. Suppose we have the following numbers... 15,14,17,12,23,24,29,35,47,41,59,62,67,69 Stem and leaf plot is as follows: Stem Leaf 1 5472 2 349 3 5 4 71 5 9 6 279 Here, the first number, i.e., 15 splits into 1 (stem) and 5 (leaf) and the last number 69 splits into 6 (stem) and 9 (leaf). In this way, large amounts of data can quickly be organized... Using Python In order to develop Stem and Leaf plot, we can use the example (as noted above): 15,14,17,12,23,24,29,35,47,41,59,62,67,69
First install the module “stemgraphic” by writing the following line in the console: !pip install stemgraphic Now write the following lines of codes: import stemgraphic x = [15,14,17,12,23,24,29,35,47,41,59,62,67,69] stemgraphic.stem_graphic(x) Results appear as follows:
Box plot The box plot (also known as box and whisker diagram) is a graphical way of representing the distribution of data on the basis of five number summary: (1) minimum, (2) first quartile, (3) median, (4) third quartile, and (5) maximum. The difference between the third quartile (Q3) and the first quartile (Q1) is referred to as interquartile range. There could also be an outlier that is an extreme value as shown below:
An example of box plot is shown below:
In this graph, median values are showing that people work more and more as the week goes on, and then work less on the weekend. Using Python In order to generate boxplot, we can use the following example: Number of desks in class 1
School 1
School 2
School 3
54
42
56
Number of desks in class 2 Number of desks in class 3 Number of desks in class 4 Number of desks in class 5 Number of desks in class 6 Number of desks in class 7 Number of desks in class 8 Number of desks in class 9 Number of desks in class 10 Write the following lines of codes:
60
48
68
65
53
78
66
54
80
67
55
82
69
57
86
70
58
88
72
60
92
73
61
94
75
63
98
import matplotlib.pyplot as plt School1=([54,60,65,66,67,69,70,72,73,75]) School2=([42,48,53,54,55,57,58,60,61,63]) School3=([56,68,78,80,82,86,88,92,94,98]) boxplot=[School1,School2,School3] plt.boxplot(boxplot) plt.show() A simple chart appears as follows:
Outlier A value that lies at an abnormal distance from (outside of the) other values. It may show some sort of problem in data, such as a case (associated with that value at abnormal distance) that is not according to the model under study, or an error occurred during measurement. For example, supercomputer in a lab of personal computers would be considered as an outlier. Examples of outliers in the form of graphs are as follows:
The value at 6 (on x-axis) is an outlier. Box plots are commonly used to find or display outliers.
Outliers can be found on Tuesday, Thursday, and Saturday. Using Python Suppose, the dataset is as follows: 342.962, 346.033, 345.917, 344.717, 343.048, 345.855, 344.717, 395.548, 345.464, 345.703 In order to know the outliers, write the following lines of codes containing dataset and any point outside 2 standard deviations. import numpy as np import pandas as pd dataset= ([342.962, 346.033, 345.917, 344.717, 343.048, 345.855, 344.717, 395.548, 345.464, 345.703]) outliers=[] def outlier(x): stdev=2 mean_1 = np.mean(x) std_1 =np.std(x) for y in x: z_score= (y - mean_1)/std_1
if np.abs(z_score) > stdev: outliers.append(y) return outliers outlier_value = outlier(dataset) print(outlier_value) This gives the following result: [395.548] This shows that 395.548 is an outlier. Another way is to use the box plot. In this case, write the following lines of codes: import matplotlib.pyplot as plt dataset= ([342.962, 346.033, 345.917, 344.717, 343.048, 345.855, 344.717, 395.548, 345.464, 345.703]) plt.boxplot(dataset) A simple chart appears as follows:
This graph shows that one value, which is close to 395, is far away from other values, i.e. an outlier. Quantile
This word is from “quantity”. Usually, a quantile is obtained by dividing a sample into adjacent and equal-sized subgroups, OR it is obtained by dividing a probability distribution into areas or intervals of equal probability. There are different types, including quartile, decile, and percentile. Quartile: One of the four subgroups of equal size in a sample, or one of the four areas of equal probability in a probability distribution.
Decile: One of the ten subgroups of equal size in a sample, or one of the ten areas of equal probability in a probability distribution.
Percentile: One of the 100 subgroups of equal size in a sample, or one of the 100 areas of equal probability in a probability distribution.
In these equations, h shows Width of Quantile (Quartile, Decile, or Percentile) group; N shows Total number of observations or frequencies; c shows Cumulative frequency preceding Quantile (Quartile, Decile, or Percentile) group; f shows Frequency of Quantile (Quartile, Decile, or Percentile) group, and l shows Lower boundary of Quantile (Quartile, Decile, or Percentile) group. Example: Suppose, we have ten number: 3,2,5,6,7,8,1,10,9,4 Arrangement of these numbers is as follows: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 There are
Three quartiles, i.e. Q1 (lower quartile), Q2 (median), and Q3 (upper quartile) Nine deciles, i.e. D1 (between 1 and 2), D2 (between 2 and 3), D3 (between 3 and 4),…, D9 (between 9 and 10) Ninety nine percentiles, i.e. P1, P2,…,P99 Each showing a specific percentage of data, e.g. below Q3, 75% of data falls; below D3, 30% of data falls, and below P80, 80% of data falls, and so on... Using Python In order to work on percentiles, we can use the example (as noted above), i.e. 3,2,5,6,7,8,1,10,9,4. In Python, different results could be obtained depending on the types of “interpolation.” Write the following lines of codes: import numpy as np x = ([3,2,5,6,7,8,1,10,9,4]) print(np.percentile(x, [10,25,50,75,100], interpolation='linear')) This gives the following results: [ 1.9 3.25 5.5 7.75 10. ] Now write the following line: print(np.percentile(x, [10,25,50,75,100], interpolation='midpoint')) This gives the following results: [ 1.5 3.5 5.5 7.5 10. ] Now write the following line: print(np.percentile(x, [10,25,50,75,100], interpolation='higher')) This gives the following results: [ 2 4 6 8 10] Now write the following line: print(np.percentile(x, [10,25,50,75,100], interpolation='lower')) This gives the following results:
[ 1 3 5 7 10] Now write the following line: print(np.percentile(x, [10,25,50,75,100], interpolation='nearest')) This gives the following results: [ 2 3 5 8 10] Standard deviation and Variance Standard deviation helps in knowing the variability of the observation or spread out of numbers about the mean. It is represented by the Greek letter sigma (σ). Low standard deviation shows that the values are close to mean, i.e. close to normal range or required range. σ can be obtained by taking the square root of variance, which is represented by σ2. So, we have to calculate variance to calculate standard deviation. Variance is the average of squared deviation about the mean. It is also important to note that standard deviation and variance are of two types, i.e. population standard deviance and population variance, and sample standard deviation and sample variance, respectively. Population standard deviation is obtained by the following equation:
Where N is Number of data values, and µ is population mean that is measured by the following equation:
On the other hand, sample standard deviation is obtained by the following equation:
Where x is observed data value (representative of each value in data set), and x̄ represents sample mean that is measured by the following equation:
And sample variance is obtained by s2. In order to calculate ∑(x-x̄)2 in standard deviation or variance, suppose a person goes to 7 different shops and buys different number of eggs from those shops… Number of shop
Number of eggs (purchase)
1
5
2
7
3
6
4
4
5
7
6
9
7
11 Mean number of eggs = 7
Squared deviation (or squared difference) about the mean = (Number of eggs – Mean of the numbers)2 (5-7)2 = (-2)2 = 4 (7-7)2 = (0)2 = 0 (6-7)2 = (-1)2 = 1 (4-7)2 = (-3)2 = 9 (7-7)2 = (0)2 = 0 (9-7)2 = (2)2 = 4 (11-7)2 = (4)2 = 16
Total = ∑(x-x̄) 2 = 34 Using Python In order to calculate sample variance and standard deviation, we can use the following example: Number of flowering plant
Number of flowers on the plant
1
5
2
7
3
6
4
4
5
7
6
9
7 So, we have the data as follows:
11
5,7,6,4,7,9,11 In order to calculate sample variance, we have to import statistics and write the following lines of codes: import statistics as st x=st.variance([5,7,6,4,7,9,11]) print(x) This gives the sample variance that is equal to 5.666667. In order to calculate population variance, write the following: import statistics as st x=st.pvariance([5,7,6,4,7,9,11]) print(x) This gives the population variance that is equal to 4.8571. In order to calculate standard deviation for sample, write the following lines of codes:
import statistics as st x=st.stdev([5,7,6,4,7,9,11]) print(x) In this case, the value is 2.3804. In order to calculate standard deviation for population, write the following lines of codes: import statistics as st x=st.pstdev([5,7,6,4,7,9,11]) print(x) In this case, the value is 2.2039. Range Rule of Thumb It helps in rough estimation of the standard deviation from a data obtained from known samples or population. s ≈ range/4 Here, range = (maximum value) – (minimum value) And s is standard deviation. Example: Suppose, we have ten numbers: 3,2,5,6,7,8,1,10,9,4 Population standard deviation is 2.87, and sample standard deviation is 3.02… And rough estimation of standard deviation through the Range Rule of Thumb is s ≈ 10-1/4 s ≈ 9/4 = 2.25 Probability Probability is a measure of how likely it is that some event will occur. Some of the basic rules of probability are as follows: Probability
For any
Probability can be as low as
Rule # 1
Probability Rule # 2
event A, the zero or as high as 1. probability P(A) satisfies 0 ≤ P(A) ≤ 1. Here P(A)=Number of favorable outcomes / Total number of outcomes = m/n Probability The sum of the probabilities of model for a all possible outcomes is 1. sample space S satisfies P(S)=1.
Probability Complement For an event The probability of the Rule # 3 Rule A, P(not A) = happening of an event that is 1 – P(A) complement of event A (i.e. not A) is similar to 1 minus the probability of happening of event A.
Probability The Rule # 4 Addition Rule for Disjoint Events
Probability The General Rule # 5 Addition
If A and B are two disjoint or mutually exclusive events, P(A or B) = P(A) + P(B) P(A or B) = P(A ∪ B)=
The probability that event A or event B will occur is similar to
Rule
Probability The Rule # 6 Multiplication Rule for Independent Events
P(A) + P(B) – P(A and B)
If A and B are two independent events, then P(A and B) = P(A∩B) = P(A) × P(B). Probability Conditional The Rule # 7 Probability conditional Rule probability of event B, given event A, is P(B | A) = P(A and B) / P(A) Probability General For any two Rule # 8 Multiplication events A and Rule B, P(A and B) = P(A∩B) = P(A) × P(B | A)
Bayesian statistics
the probability that event A occurs plus the probability that event B occurs minus the probability of event A and event B occurring simultaneously (which is also equal to A∩B, i.e. P(A and B) = P(A∩B) )
Bayesian statistics has been named for Thomas Bayes (1702-1761) English priest. Bayes came to a theorem with the help of which a hypothesis could be established based on the observation of consequences. Following equation is commonly considered in Bayesian statistics:
In this equation, P(A1/B) refers to Conditional probability of observing A1 given B is true, and it is also considered as “Posterior.” P(A1) refers to Probability of A1 without conditioning the probability of B [i.e., P(B)] or any other probability/possibility, and it is also considered as “Prior Distribution.” P(B/A1) refers to Conditional probability of observing B given A1 is true, and it is also considered as “Likelihood.” P(B) refers to Probability of B without conditioning the probability of A1 [i.e., P(A1)] or any other probability/possibility, and it is also considered as “Marginal (it remains constant, and affect all models in the same way).” Based on these considerations, the above equation can also be written as:
This equation is also equal to the following equation:
In these equations, A and B are two events; A1,A2,…Ai…An are mutually exclusive events, and B1,B2…Bj are also mutually exclusive events. Example: Suppose about 40% of restaurants serve burgers, then the probability of burger given that there is restaurant is P(Burger|Restaurant)=0.4… But burgers are commonly found (about 80% of all markets contain burgers), then the probability of burgers is P(Burger)=0.8…
And restaurants are fairly common (suppose 70% of all markets have a restaurant…), then the probability of restaurant is P(Restaurant)=0.7… Then what is the probability that we see restaurant when we see burgers? Then, probability of restaurant given burgers P(Restaurant|Burger)=? In this case,
So, there is 35% chance of finding restaurants when there are burgers. Another example: Suppose seeds are common (15%) but plants are scarce (5%) due to small area of land, and about 85% of all seeds produce plants then P(seed|plant)= [P(seed) P(plant|seed)]/P(plant) P(seed|plant)=[15% x 85%]/5% P(seed|plant)=255% So, the probability of seed when there is plant is 255%. Another example: We have 4 boxes with black and white balls. All balls are the same shapes and sizes. Box 1: 4 black balls Box 2: 2 black balls and 2 white balls Box 3: 2 black balls and 2 white balls Box 4: 4 white balls By accident, one box is selected and the black ball is drawn out. What is probability that a ball is from box 1? We have the following events: A1- box 1 is selected
A2- box 2 is selected A3- box 3 is selected A4- box 4 is selected B- black ball is drawn out. In this case, probabilities without condition is presented as
Whereas conditional probabilities are given as
Further,
Otherwise with the help of table
Using Python The above examples can easily be calculated in Python. For example, write the following lines of codes: print((0.7*0.4)/0.8) print((15*85)/5) Following results are obtained: 0.3499999999999999 255.0 Reliability Coefficient Reliability refers to consistency of a measure, or trust or dependence on a measure. Reliability coefficient refers to the degree of consistency or accuracy of a measuring instrument or a test. Usually, reliability coefficient is determined by the correlation of the two different sets of measures. Different ways are used to assess reliability coefficients as, for example, inter-rater reliability that is usually measured by Cohen’s kappa coefficient; internal consistency reliability that is usually measured by Cronbach's alpha, and testretest reliability that can be measured by Pearson's correlation coefficient. Cohen's Kappa Coefficient It is used to determine inter-rater or interobserver agreement for qualitative/categorical items. It is more robust than simple % agreement calculation as “k” also considers the chance agreement. However, it should be determined when a trial on each sample is rated by two different observers/raters, or two trials on each sample are rated by an observer/rater. It
is determined by the equation:
Where po is observed proportion of agreement among raters, and pe is expected proportion of agreement that is attributable to chance. It ranges from 0 to 1, in which 0 shows agreement similar to chance, 1 shows perfect agreement, and the values in between may represent slight agreement, moderate agreement, and substantial agreement. Example: Suppose two observers (A and B) are checking the presence of a particular behavior in 100 children. They give their response as “Yes” or “No”, and following information is obtained. B Yes No A Yes 40 10 No 20 30 This table shows, 40 “Yes” from both A and B, and 30 “No” from both A and B. So, Observed proportion of agreement among raters is
Moreover, there is 50 “Yes” and 50 “No” from A, showing that “Yes” 50% (0.5) and “No” 50% (0.5) of the time. Moreover, 60 “Yes” and 40 “No” from B, showing that “Yes” 60% (0.6) and “No” 40% (0.4) of the time. Probability of saying “Yes” by both A and B is 0.5 × 0.6 = 0.3 Probability of saying “No” by both A and B is 0.5 × 0.4 = 0.2 Expected Proportion of agreement that is attributable to chance is pe= 0.3+0.2=0.5
Using Python In order to work on Python, we can use the same example as noted above. B Yes No A Yes 40 10 No 20 30 Write the following lines of codes: import statsmodels.stats.inter_rater as st table=[[40,10],[20,30]] print(st.cohens_kappa(table)) This gives the following results:
Cohen’s Kappa Statistics is 0.4. Fleiss’ Kappa Coefficient It is used to determine inter-rater or inter-observer agreement for Likert scale data, nominal scale data, or ordinal scale data. It also considers the chance agreement. It is determined when a trial on each sample is rated by three or more different observers/raters. It also ranges from 0 to 1, in which 0 shows agreement similar to chance, 1 shows perfect agreement, and the values in between may represent slight agreement, moderate agreement, and substantial agreement. Here, substantial agreement has a co-efficient of more than 0.75, and it is also considered as good. Nevertheless, “Acceptable” level of agreement depends on the field of
research. Using Python In order to work on Python, we can use the following lines of codes. import statsmodels.stats.inter_rater as st table=[[x,y,z],[a,b,c]…] print(st.fleiss_kappa(table)) In these lines of codes, [[x,y,z],[a,b,c]…] is an example representing the table. Cronbach's alpha It is used to measure the internal consistency reliability. It is measured by the equation:
Here, c̅ shows average inter-task/item covariance among the tasks/items, N shows Number of tasks/items, and v̅ shows average variance of each task/item. Here, α-value < 0.5 => unacceptable α-value > 0.7 => okay α-value ≥ 0.9 => excellent Using Python Suppose we have 7 questions (or 7 items) in a questionnaire. A total of 18 participants completed the questionnaire. Each question is measured utilizing a 5-point Likert item ranging from “strongly agree” (5) to “strongly disagree” (1). Suppose we got the following results: Individual1 Individual2
Q1 5 1
Q2 1 5
Q3 3 4
Q4 2 3
Q5 1 5
Q6 4 5
Q7 2 4
Individual3 4 2 Individual4 4 4 Individual5 2 3 Individual6 5 1 Individual7 2 1 Individual8 3 2 Individual9 2 1 Individual10 4 4 Individual11 4 3 Individual12 5 1 Individual13 5 4 Individual14 2 4 Individual15 2 4 Individual16 4 5 Individual17 3 4 Individual18 2 4 Write the following lines of codes:
1 3 3 1 1 2 5 1 4 1 3 4 3 2 4 3
3 5 5 2 3 5 3 4 3 3 2 2 2 3 1 5
5 3 2 4 1 2 1 2 4 3 5 4 2 4 4 1
import numpy as np def sumvar(A): n = float(len(A)) sumvar=(sum([(x-np.mean(A))**2 for x in A]) / n)* n/(n-1.) return sumvar def CronbachAlpha(data): ivars = [sumvar(it) for it in data] tscores = [0] * len(data[0]) for it in data: for i in range(len(it)): tscores[i]+= it[i] nitems = len(data)
3 4 5 5 5 5 3 5 4 5 3 4 3 2 2 3
4 5 4 1 2 4 4 4 5 3 4 5 2 3 4 5
Cronalpha=nitems/(nitems-1.) * (1-sum(ivars)/ sumvar(tscores)) return Cronalpha
data = [[5,1,4,4,2,5,2,3,2,4,4,5,5,2,2,4,3,2], [1,5,2,4,3,1,1,2,1,4,3,1,4,4,4,5,4,4], [3,4,1,3,3,1,1,2,5,1,4,1,3,4,3,2,4,3], [2,3,3,5,5,2,3,5,3,4,3,3,2,2,2,3,1,5], [1,5,5,3,2,4,1,2,1,2,4,3,5,4,2,4,4,1], [4,5,3,4,5,5,5,5,3,5,4,5,3,4,3,2,2,3], [2,4,4,5,4,1,2,4,4,4,5,3,4,5,2,3,4,5]] print("Cronbach alpha = ", CronbachAlpha(data)) This gives the results for Cronbach’s Alpha as follows: Cronbach alpha = 0.09138691081722555 Interpretation of result: This value of Cronbach’s Alpha (0.091) is unacceptable as it is lower than 0.5. Thereby, showing a low level of internal consistency for our scale. Coefficient of variation It is a measure of dispersion or relative variability. It is measured by the following equation:
Example: Suppose two weight loss programs, i.e. X and Y, are started having the same goal as well as target population. Average weight loss per week Standard deviation
Weight loss program X Weight loss program Y 5 6 3
7
CV 3×100/5=60 7×100/6=116.67 Since Coefficient of Variation of program X is lower than the Coefficient of Variation of program Y; therefore, program X is more consistent. Using Python In order to work on Python, we can use the following example: Weight loss program X Weight loss program Y Individual1 1 1 Individual2 2 1 Individual3 1 1 Individual4 6 2 Individual5 7 8 Individual6 4 6 Individual7 8 2 Individual8 7 11 Individual9 9 22 Write the following lines of codes: import scipy.stats as sp X=([1,2,1,6,7,4,8,7,9]) Y=([1,1,1,2,8,6,2,11,22]) print((sp.variation((X)))*100) print((sp.variation((Y)))*100) This gives the following results: 58.11865258054232 109.99438818457405 These results are slightly different from the manually obtained results, but still their interpretation is according to the findings of manually obtained results. Interpretation of results: These results show that the program Y has more variability relative to its mean as compared to program X.
Chebyshev's Theorem This theorem states that for any set of observations or numbers, the minimum proportion of the values that lie within k standard deviations of the mean is 11/k2, where k is a positive number greater than 1. This theorem applies to all types of distributions of data. k
1-1/k2
1
1-1/12 = 1-1 = 0
Minimum percentage of data covered 0%
2
1-1/22 = 1-1/4 = 0.75
75%
3
1-1/32 = 1-1/9 = 0.8889
88.89%
4
1-1/42 = 1-1/16 = 0.9375
93.75%
5 1-1/52 = 1-1/25 = 0.96 Graphically it is represented as,
96%
Using Python Suppose k=2 as noted in the above example. Write the following line of code: print(1-(1/2**2)) In this case, the result is 0.75 (or 75%).
Factorial Factorial is related to the product of all the integers less than and equal to a given integer (n). Integers are greater than zero. Formula is as follows: n! = 1×2×3×…×n here n! is n factorial, and n is a non-negative integer. Example: 4! = 4×3×2×1= 24 Also note that by definition 0! = 1 Using Python Write the following lines of codes: import math print(math.factorial(4)) In this case, the result is 24. Distribution, and Standardization Suppose, there is a place known as “Virana” where beneficial viruses are living and fighting against some harmful bacteria. Those harmful bacteria are assigned “value-of-danger” from 1 to 9 according to their harmfulness to viruses, i.e. higher the number, the more harmful the bacteria are. Suppose we have different number of bacteria (frequency) according to their value-ofdanger as shown in the following table: Frequency (number of bacteria) Value-of-danger 200,000
1
300,000
2
500,000
3
900,000
4
1000,000
5
900,000
6
500,000
7
300,000 200,000
8 9 Then a frequency distribution plot could be developed as follows:
This frequency distribution plot shows “bell-curve” as it looks like a bell. Technically, this type of distribution is known as normal or Gaussian distribution as noted by statistician professor Gauss. This type of distribution is commonly found in nature as, for example, blood sugar levels and heart rates follow this type of distribution. Normal distribution has three important characteristics. 1. In a normal distribution, mean, median, and mode are equal to each other. 2. In this type of distribution, there is symmetry about the central point. 3. Half of the values, i.e. 50% are less than the mean value, and half of the values, i.e. 50% are more than the mean value. In the above illustration, the value-of-danger “5” is the mean and median, and as it is related to the most commonly found value (highest or largest number of bacteria that could be frequently found), so it is also mode. Disturbance in the values would result in the disturbance of the normal distribution; thereby, leading to non-normal or non-Gaussian
distribution in which there is no appropriate bell-shaped curve. Frequency distribution has a close relationship with standard deviations: About 68.27% of all values lie within one standard deviation of the mean on both sides, i.e., total of two standard deviations, About 95% of all values lie within two standard deviations of the mean on both sides, i.e. total of four standard deviations, and About 99.7% of all values lie within three standard deviations of the mean on both sides, i.e., total of six standard deviations. The number of standard deviations from the mean is known as “Standard Score”, “z-score”, or “sigma”, and in order to convert a value to a z-score or Standard Score, subtract the mean and then divide the value by Standard Deviation. It is represented as
in which z is showing the z-score; x is showing the value that had to be standardized; μ is showing the mean value, and σ is showing the Standard Deviation. This process of getting a z-score is known as “Standardizing.” Let’s consider the frequency distribution and frequency distribution plot again:
And its values: Frequency (number of bacteria) Value-of-danger 1 200,000 300,000
2
500,000
3
900,000
4
1000,000
5
900,000
6
500,000
7
300,000
8
9 200,000 In this table, the mean value is shown by the value 5, and its Standard Deviation could be calculated from variance. So, variance is calculated as
So, Standard Deviation = σ = 2.58 In order to get the z-score of the first value, i.e. 1, first subtract the mean. So, it will be 1-5 = -4. Then the value will be divided by the Standard Deviation.
So, it will be -4/2.58 = -1.55. So, z-score will be -1.55, i.e. the value-ofdanger 1 will be -1.55 Standard Deviations from the mean. If we calculate the z-scores of all the values and placed the values in the table, we get: Frequency (number of bacteria) 200,000 300,000 500,000 900,000 1000,000 900,000 500,000 300,000 200,000
Value-of-danger
z-score
1 2 3 4 5 6 7 8 9
(1-5)/2.58 = -1.55 (2-5)/2.58=-1.16 (3-5)/2.58=-0.78 (4-5)/2.58=-0.39 (5-5)/2.58=0 (6-5)/2.58=0.39 (7-5)/2.58=0.78 (8-5)/2.58=1.16 (9-5)/2.58=1.55
From this table, Standard Normal Distribution Graph could also be obtained in which z-scores are along x-axis and frequency (number of bacteria) is along y-axis.
The graph shows that nearly 68.27% of the values in the “value-of-danger” are present within one standard deviation of the mean on both side, i.e. from -1 to 1. In that case, 68.27% is also considered as a confidence interval between upper limit of z-score=1 and lower limit of z-score=-1. On a further note, nearly 95% of all values are present within two standard deviations of the mean on both sides, i.e. from -2 to 2. In case of disturbance in the values, the normal graph could be changed into non-normal and start showing binomial or Poisson distribution. Using Python In order to calculate z-score, enter the values in the columns. We can take the example of “Frequency (number of bacteria)” and their “Value-of-danger”. Frequency (number of bacteria) Value-of-danger 200,000
1
300,000
2
500,000
3
900,000
4
1000,000
5
900,000
6
500,000
7
300,000
8
200,000 9 In order to calculate z-scores, we can use the Value-of-danger (above) as an example series. So, the numbers are 1,2,3,4,5,6,7,8,9. We can use scipy stats, and write the following: x = ([1,2,3,4,5,6,7,8,9]) from scipy import stats y=stats.zscore(x) print(y) This gives the following z-scores: [-1.54919334 -1.161895 -0.77459667 -0.38729833 0.
0.38729833
0.77459667 1.161895
1.54919334]
These are the values of z-scores. A graph of z-scores on x-axis and frequency (number of bacteria) on y-axis can also be plotted by using the following lines of codes: x = ([-1.54919334, -1.161895, -0.77459667, -0.38729833, 0.0, 0.38729833, 0.77459667, 1.161895, 1.54919334]) y= ([200000,300000,500000,900000,1000000,900000,500000,300000,200000])
fig, ax = plt.subplots() ax.plot(x, y) Following graph is obtained:
You can also give the labels and title by writing the following line of code: ax.set(xlabel='z-scores', ylabel='Number of bacteria', title='Graph of Number of bacteria vs z-scores') Now the graph looks as follows:
Prediction interval It is a modified form of confidence interval that covers a range of values likely containing a future value. It works on the basis of pre-existing values and some sort of regression analysis. The formula for the prediction interval is as follows, when predictor is xh.
OR (in simple words) Sample estimate (predicted value or fitted value) ± t-multiplier × Standard error of the prediction In this equation, standard error of the prediction is almost similar to standard error of the fit. It is important to note that prediction intervals are wider as compared to confidence intervals, which are represented by the following formula:
Using Python In order to work on prediction interval, we can use the following example of weights and heights of 8 males in the age range of 18 years to 28 years: Serial no. Weight of person (kg) 1 84.2 2 98.8 3 62.6 4 73.6 5 67.4 6 77.9 7 72.9 8 68.6 Write the following lines of codes: import statsmodels.api as sm import pandas as pd y = (84.2,98.8,62.6,73.6,67.4,77.9,72.9,68.6) x = (189,178,176,172,168,179,186,164) X = sm.add_constant(x) model = sm.OLS(y, X) results = model.fit() predicted = results.predict() pred = results.get_prediction() pred_df = pred.summary_frame() pd.set_option('display.max_columns', None) print(pred_df)
Height of person (cm) 189 178 176 172 168 179 186 164
In these lines of codes, the code pd.set_option('display.max_columns', None) is used to show all the columns. This gives the following results:
In this result, mean_ci_lower and mean_ci_upper show 95% confidence interval (lower and upper value) and obs_ci_lower and obs_ci_upper show 95% prediction interval (lower and upper value). So, 95% prediction interval for the weight of an individual who is randomly selected and who is 168 cm tall (i.e. 4th in the result with mean value of 70.773) is 40.1473-101.399, and to find 95% confidence interval for the average weight of all individuals who are 168 cm tall is 56.6860-84.8602. In other words, it can be said that we can be 95% confident that the single future value of 168 cm will fall within the weight range of 40.1473-101.399 kg.
Tolerance interval An interval, also known as enclosure interval, covering a fixed proportion (or percent) of the population with a particular confidence. The endpoints, also known as a maximum value and a minimum value, are known as tolerance limits. For a normally distributed population, it has the following equation (due to Howe):
In this equation, z(p+1)/2 shows critical value of the normal distribution related to cumulative probability (p+1)/2, and χ21-α,df shows critical value of the chisquare distribution with df degrees of freedom exceeded with probability α. The difference between the tolerance limit and confidence limit is that tolerance limit is a limit within which a specific proportion (or percentage) of population may lie with stated confidence, and confidence limit is a limit within which a given population parameter (such as mean value) may lie with stated confidence as shown below:
Using Python In order to work on tolerance interval, we can use the following example: Serial no.
Weight of person (kg)
1
84.2
2
98.8
3
62.6
4
73.6
5
67.4
6
77.9
7
72.9
8
68.6
Write the following lines of codes: from numpy import mean from numpy import sqrt from scipy.stats import chi2 from statistics import stdev from matplotlib import pyplot data = (84.2,98.8,62.6,73.6,67.4,77.9,72.9,68.6) n = len(data) dof = n - 1 prob = 0.95 chi_critical = chi2.isf(q=prob, df=dof) interval =1.96* (sqrt((dof * (1 + (1/n))) / chi_critical)) data_mean = mean(data) data_stdev = stdev(data) val=interval*data_stdev lower, upper = data_mean-val, data_mean+val print('Tolerance Interval: %.3f' % interval) print(lower, upper) print(data_mean) print(data_stdev) pyplot.errorbar(n, 75.75, yerr=val, color='blue', fmt='o') pyplot.show() Following results are obtained: Tolerance Interval: 3.736 33.06228597226652 118.43771402773348 75.75
11.425785374694005
These results show that tolerance interval is 3.736; tolerance limits are 33.06228597226652 (lower) and 118.43771402773348 (upper); mean value is 75.75, and standard deviation is 11.425785374694005. According to these results, it can be said with 95% confidence that the weight of 95% of persons lie in the range of 33.062 kg to 118.438 kg.
Parameters to describe the form of a distribution Different parameters to describe the form of a distribution are (1) location parameters, (2) scale parameters, and (3) shape parameters. The location parameter (x0) is used to determine the shift or location of a distribution. It is determined by the equation: fx0(x)=f(x-x0). The examples of location parameter may include mean, median, or mode. The scale parameter (s) is used to describe the width of a probability distribution. It is determined by the equation: fs(x)=f(x/s)/s. The examples of scale parameter may include the large “s” that is used to show more spread out distribution, and small “s” that is used to show more concentrated distribution.
The shape parameter is used to describe all other parameters beyond location parameter and scale parameter. The examples of shape parameter may include skewness and kurtosis.
Skewness Skewness helps in knowing about the symmetry (or the lack of symmetry) of distribution or data set. A distribution is said to be symmetric, if it appears same to the right and left of the center point. Skewness is most commonly measured by Karl Pearson's methods that can be represented by SKP. The relative measures of skewness in relation to Pearson’s Coefficient of Skewness are as follows: Method # 1
Method # 2
If the co-efficient of skewness has a negative value, then it is referred to as negatively skewed distribution; if the value of co-efficient of skewness is zero, then it is referred to as symmetrical distribution, and if the co-efficient of skewness has a positive value, then it is referred to as positively skewed distribution.
Kurtosis It is a measure of the degree of peakedness or flatness of a frequencydistribution curve, i.e. the extent to which a distribution is showing a peak or flatness. The equation for the general form of kurtosis is given below:
In this equation, µ shows population mean, and σ shows population standard deviation. Kurtosis can be of different forms, including (+) leptokurtic curve, (0) mesokurtic curve, and (-) platykurtic curve. Leptokurtic curve shows positive kurtosis, and it is more peaked curve than normal curve. In leptokurtic curve, β2>3. Mesokurtic curve is a normal curve, and it shows normal distribution. In mesokurtic curve, β2=3. Platykurtic curve shows negative kurtosis, and it is less peaked than normal curve. In platykurtic curve, β2median ≤median 5. Calculation of Chi-square statistic by the following formula:
Here, Eij shows Expected cases in row i at column j, and Oij shows observed cases in row i at column j. 6. Determination of degree of freedom (DF) by the following formula: DF = k-1 Here, k shows the number of levels of categorical variables, or (simply) categories, or sample, or populations, etc. 7. Comparison of values, i.e., compare χ2 test statistic (i.e. calculated χ2 value) with determined DF with χ2 value in a table of χ2. 8. Interpretation of results, i.e., if the calculated value of the χ2 test is more than the value in the χ2 table, null hypothesis will be rejected. Example: Suppose we are working on a disease and the pain associated with the disease, and we have worked on two different medicines, i.e. A and B. After receiving the treatments, we assessed the quality-of-life and pain through the Quality-of-Life Scale and overall ranking of pain, respectively. After getting the scores for quality-of-life and pain, those scores are joined and an overall score out of 10 is developed. Higher scores show improvement, while lower scores show no effect of treatment. Overall score out of 10 Medicines 7
A
2
A
5
A
4
A
6 8
B B
6
B
9 B Calculate overall median of 8 values, and in this case it is “6”. Now, for each medicine, count the number of observations that are equal to or less than the overall median (≤6) and greater than the overall median (>6). These values are placed in a 2×2 contingency table. A B >median 1 2 ≤median 3 2 Now calculate the chi-square by using the equation χ2 = ((O - E)2 / E), where O shows observed values and E shows Expected values. The expected values can be considered as the average values as follows: Treatment group Placebo Group >median 1.5 1.5 ≤median 2.5 2.5 Now the Chi-square values are as follows: χ2 = ((O - E)2 / E) = ((1 – 1.5)2 / 1.5) = 0.17 χ2 = ((O - E)2 / E) = ((3 – 2.5)2 / 2.5) = 0.1 χ2 = ((O - E)2 / E) = ((2 – 1.5)2 / 1.5) = 0.17 χ2 = ((O - E)2 / E) = ((2 – 2.5)2 / 2.5) = 0.1 Total χ2 = 0.167 + 0.1 + 0.167 + 0.1 = 0.534 Now we will check the Degrees of freedom (df). For χ2, we will calculate df by the following formula, df = k-1 = 2 - 1 = 1. So, we have df = 1. Now we will compare our calculated χ2 value, i.e. 0.534 with χ2 value in a
table of χ2 with 1 df. If the calculated value of the Chi-square test is more than the value in the Chi-square table, null hypothesis will be rejected. χ2 value at α = 0.05 and df=1 is 3.841. Our observed χ2 value is 0.534 that is less than 3.841. This is showing that the median of A medicine has a nonsignificant difference from the B medicine. Using Python In order to work on Mood’s Median test, we can use the same example as noted above: Overall score out of 10 Medicines 7
A
2
A
5
A
4
A
6
B
8
B
6
B
9 B Write the following lines of codes: from scipy.stats import median_test medicineA=(7,2,5,4) medicineB=(6,8,6,9) stat, p, med, tbl = median_test(medicineA, medicineB,correction=False) print('Overall median: %f' %med) print('Chi-squared statistics: %f' %stat) print('p-value: %f' %p) print(tbl) Following results appear: Overall median: 6.000000
Chi-squared statistics: 0.533333 p-value: 0.465209 [[1 2] [3 2]] In these answers, print(tbl) gives contingency table. Our observed χ2 value is 0.53 and p-value is 0.465. This p-value>0.05, thereby showing that the median of A medicine has a nonsignificant difference from the B medicine.
Goodness of Fit A goodness of fit test is used to check how well an observed frequency distribution (or observed set of data) fits a claimed distribution of a population (or an expected outcome). A goodness of fit test is among the most commonly used non-parametric tests. Chi-square test is commonly utilized for goodness of fit tests.
Chi-square test Chi-square (χ2) test is a very simple non-parametric test. It is commonly used for discrete distributions, such as Binomial distribution or Poisson distribution. If we consider the pain as a variable, this test can be used to show an association between the efficacies of treatments in reducing the pain. Testing Assumptions: Two variables have to be measured as categories, usually at a nominal or ordinal level. The study groups have to be independent of each other. Following steps can be considered in the Chi-square test: 1. Determination of hypothesis: Null Hypothesis – H0: Observed value = Expected value, and Alternative Hypothesis – H1: Observed value ≠ Expected value 2. Determination of χ2 test statistic by the following formula:
Here, Oi is the observed value of ith level of variable, and Ei is the expected value of ith level of variable. 3. Determination of degree of freedom (DF) by the formula DF = k-1, here k is the number of levels of categorical variables, or (simply) categories, OR by the formula: DF = (Number of rows - 1) x (Number of columns - 1) 4. Comparison of values / Determination of p-value by comparing χ2 test statistic (i.e. calculated χ2 value) with determined DF with χ2 value in a table of χ2. 5. Interpretation of results: If the calculated value of the χ2 test is more than the value in the χ2 table, null hypothesis will be rejected. OR If the calculated p-value is less than the significance level (0.05), null hypothesis will be rejected. If H0 is rejected, it is not a good fit. Example: Suppose a random sample of 100 boys and girls were surveyed about their views about mathematics... Here, Null Hypothesis – H0: No difference between the boys and girls, and Alternative Hypothesis – H1: Significant difference exists between boys and girls. The observed values are presented in the following table: Groups
Total Outcomes Like mathematics Don’t like mathematics Boys 50 40 10 Girls 50 20 30 Total 100 60 40 Suppose, there is no relationship between anything, and everything is normal. The expected values can be presented by the following table:
Groups
Total Outcomes Like mathematics Don’t like mathematics Boys 50 30 20 Girls 50 30 20 Total 100 60 40 For instance, the expected value of 30 in case of girls who like mathematics, is obtained by multiplying the total number of girls with total number of participants who like mathematics and dividing the value by the overall total number. Therefore, 50 × 60 / 100 = 30. Now, determine chi-square test statistic by the following formula:
Here, DF = 1 The calculated value of the χ2 test statistic, i.e. 16.66 is more than the value in the χ2 table at DF=1 and 0.05 significance level, i.e. 3.841. So, null hypothesis will be rejected at 0.05 level of significance.
Using Python In order to work on chi-square test, we can use the following data as an
example; Groups
Total Outcomes Like Mathematics Don’t like Mathematics Boys 40 50 10 Girls 20 50 30 Total 60 100 40 In order to work on chi square test, we have to import scipy stats, and write the following lines of codes: from scipy.stats import chisquare x= chisquare([40,10,20,30], f_exp=[30,20,30,20]) print(x) Here, f_exp represents Expected data. This gives the following answer: Power_divergenceResult(statistic=16.666666666666668, pvalue=0.0008275246236931499) This p-value is less than 0.05, i.e. p Critical value, null hypothesis is not rejected.
Suppose we perform a test on a sample of 8 participants, i.e. n=8, having a disease resulting in pain. Ranking of pain is assessed before and after the treatment. Initially, each participant is asked for the level of pain caused by the disease, and then they are asked about the pain one week after starting the new treatment. Ranking of pain is from 1 to 12. One shows least pain and 12 shows highest level of pain. Suppose, we get the following results: Participants 1 2 3 4
Ranking of pain before treatment 9 7 4 7
Ranking of pain after 1 week of treatment 8 5 5 4
5 6 7 8
8 8 6 3
2 7 4 5
This data shows that some participants showed great level of improvement in the condition of pain as, for example, participant 5; whereas some participants showed worsening of the condition as, for example, participants 3 and 8. However, it is important to establish the statistical significance after one week of treatment. Here, we can use the Sign Test. For this test, we can start by calculating the difference scores for each model. So, we will subtract the ranking of pain after treatment from that determined before treatment. Participants of the study
Ranking of pain before treatment
1 2 3 4 5 6 7 8
9 7 4 7 8 8 6 3
Ranking of pain after 1 week of treatment 8 5 5 4 2 7 4 5
Difference, i.e. Before treatment – After treatment 1 2 -1 3 6 1 2 -2
Here, the scores, i.e. differences, vary widely, and 6 can be considered as an outlier. For this test, our null hypothesis, i.e. H0 is that the median difference is zero, and our alternative hypothesis, i.e. H1 is that the median difference is positive. So, our null hypothesis shows that there is no difference in scores before treatment versus after treatment, i.e. there are some positive differences showing improvement of pain and there are some negative differences showing worsening of pain. On the other hand, our alternative or
research hypothesis shows that there are more positive differences after treatment than before treatment. Our α = 0.05. From another table, we can find the number of positives and negatives (i.e. number of positive outcomes and negative outcomes). Participants of Ranking of pain before the study treatment
1 2 3 4 5 6 7 8
9 7 4 7 8 8 6 3
Ranking of pain after 1 week of treatment 8 5 5 4 2 7 4 5
Difference, i.e. Before treatment – After treatment 1 2 -1 3 6 1 2 -2
Sign
+ + + + + + -
This table shows more positive differences suggesting that the alternative hypothesis is true. However, it is important to know whether the suggestion has statistically significant value or it is just by chance. Now take a look at the critical values for the Sign Test. From the data, it is clear that we have 2 negative signs and 6 positive signs. Overall, there are 8 participants, i.e. n=8. This is showing that number of negative signs is smaller, and we will compare this with the critical value for the Sign Test. If the smaller of the number of negative or positive signs is less than or equal to critical value, we can reject null hypothesis in favor of alternative hypothesis. However, if the smaller of the number of negative or positive signs is more than the critical value, we cannot reject null hypothesis. For this research, we will consider two-tailed significance level and check the critical value with n=8 and α=0.05. The critical value is zero. In our case, we cannot reject null hypothesis as the number of negative signs which is equal to 2 is more than zero.
In short, in the Sign Test, null hypothesis is that the median difference is zero. The test statistic is the smaller number of signs from the number of negative or positive signs. In this case, calculated value of test statistic is 2, which is more than the critical value of zero. So, the null hypothesis cannot be rejected; thereby, showing that the new treatment is not working. Using Python In order to work on the Sign Test, we can use the following data as an example: Participants
Ranking of pain Ranking of pain after before treatment 1 week of treatment 1 9 8 2 7 5 3 4 5 4 7 4 5 8 2 6 8 7 7 6 4 8 3 5 To work on the Sign Test, we work on scipy stats and numpy. So, first we write the following, import numpy as np x=([9,7,4,7,8,8,6,3]) y=([8,5,5,4,2,7,4,5]) z=list(np.array(x) - np.array(y)) print(z) This gives use the result showing negative and positive numbers. So, the numbers are [1, 2, -1, 3, 6, 1, 2, -2]. Count the number of positives and negatives. In this case, we have 2 negative values and 6 positive values. Check the number of successes in trials, i.e. the smaller of the number of positively signed cases and number of negatively signed cases. In this regard, it is 2, i.e. we have 2 negative values. Now, write the following lines of
codes: from scipy.stats import binom_test a = binom_test (2, 8) print(a) In this, 2 shows the number of successes and 8 is the number of trials. This gives the p-value of 0.2891. This calculated p-value is more than the alpha level of 0.05, so we cannot reject the null hypothesis and say that the results are not statistically significant.
Wilcoxon Signed Rank Test It is nonparametric test. Testing Assumptions: Two variables in which one is dependent variable that is measured on an ordinal or continuous level, and the other is independent variable that has two groups, or pairs (as you can see in the example of treatment of pain). Following steps can be considered in the Wilcoxon Signed Rank test: 1. Calculation of difference scores: In this case, subtract the data points after intervention or process from that determined before the intervention or process. 2. Ordering the absolute values of the difference scores. 3. Ranking of the ordered absolute values of the difference scores: In case of the same absolute values of the difference scores, we will assign the mean rank. 4. Attach the positive or negative signs of the ordered absolute values of differences to the ranks to get Signed Ranks. 5. Calculate the sums of the (two) Signed Ranks, i.e. W+ and W-. Here, W+ = sum of the positive ranks, and W-= sum of the negative ranks. After adding the sums of the (two) Signed Ranks (i.e. [W-]+[W+]), the answer must always be equal to n(n+1)/2 to proceed further. Here “n” is the total number of samples in the two groups.
6. Determination of hypothesis: Null Hypothesis – H0: Median difference is zero, OR W+ = W-, and Alternative Hypothesis – H1: Median difference is not zero, OR W+≠W-. 7. Find the value of “W”, i.e., the smaller value in W+ and W-. 8. Compare the observed W value with the critical value. Critical value is determined with the help of the table of critical values of W. 9. Interpretation of results: Observed W value ≤ Critical value, null hypothesis is rejected, and Observed W value > Critical value, null hypothesis is not rejected.
Wilcoxon Signed Rank Test can also be performed on the same data (as
mentioned in The Sign Test). We will rank the difference scores from 1 through 8 after ordering the absolute values of the difference scores. In case of the same absolute values of the difference scores, we will assign the mean rank. Then we will attach the positive or negative signs of the observed differences to the ranks. So, we can get the following table: Difference, i.e. Before treatment – After treatment 1 2 -1 3 6 1 2 -2
Ordered absolute values of differences
Ranks or Mean Ranks
Signed ranks
1 -1 1 2 -2 2 3 6
(1+2+3)/3 = 2 (1+2+3)/3 = 2 (1+2+3)/3 = 2 (4+5+6)/3 = 5 (4+5+6)/3 = 5 (4+5+6)/3 = 5 7 8
2 -2 2 5 -5 5 7 8
We will work with the same hypotheses as in The Sign Test. However, capital W is the test statistic for the Wilcoxon Signed Rank Test. W is the smaller value of the sum of the positive ranks that could be represented by W+ and the sum of the negative ranks that could be represented by W-. Null hypothesis is accepted, if W+ is similar to W-, whereas null hypothesis is rejected, if W+ is much larger in value than W-. From the data, we have found that W+ = 29 and W- = 7. As the sum of the ranks must always by equal to n(n+1)/2. So, 8(8+1)/2 = 72/2 = 36 that is equal to 29 + 7 = 36. Our test statistic is 7, i.e. W =7, which is the smaller value in 29 and 7. Now we will check the table of critical values of W by considering the sample size, i.e. n = 8, and level of significance of 5%, i.e. α=0.05. If the observed value, i.e. 7 is less than or equal to the critical value, null hypothesis is rejected, whereas if the observed value is greater than the critical value, null hypothesis is not rejected.
In the table, the critical value of W in two-tailed test is 4 that is smaller than 7. This is showing that the null hypothesis cannot be rejected. In short, in Wilcoxon Signed Rank Test, the test statistic is W. The null hypothesis is that the median difference is zero or W+ is equal to W-. W is equal to the smaller value in W+ and W-. In this case, calculated value of W is 7, which is more than the critical value of 4. So, the null hypothesis cannot be rejected; thereby, showing that the new treatment is not working. Using Python In order to work on Wilcoxon Signed Rank Test, we can use the following data as an example; Ranking of pain Ranking of pain after before treatment 1 week of treatment 9 8 1 7 5 2 4 5 3 7 4 4 8 2 5 8 7 6 6 4 7 3 5 8 In order to work on this test, we work on scipy stats as follows: Participants
from scipy.stats import wilcoxon x=([9,7,4,7,8,8,6,3]) y=([8,5,5,4,2,7,4,5]) z= wilcoxon (x,y, zero_method='wilcox') print(z) This gives the following answer: WilcoxonResult(statistic=7.0, pvalue=0.11979493042591832) This gives the WilcoxonResult, statistic is equal to 7 in this case, and p-value is 0.1198 in this case. This p-value is more than 0.05, so the results are
statistically not significant.
The Kruskal-Wallis Test It is a nonparametric test. It is an alternative to One Way ANOVA. It can be performed on more than two independent groups. It is usually performed, When sample sizes are small, and They are not normally distributed Testing Assumptions: Two variables in which one is dependent variable that is measured on an ordinal or continuous level, and the other is independent variable that has two or more categorical groups. Observations are independent of each other, i.e. they are not related to each other. Following steps can be considered in the Kruskal-Wallis test: 1. Determination of hypothesis: Null Hypothesis – H0: Population medians of all the groups are equal, and Alternative Hypothesis – H1: Population medians of the groups are not equal. 2. Ordering the data of total number of samples (in all groups) from smallest value to largest value while keeping the track of group assignments in the total sample. 3. Calculate the sums of the ranks of (all of the) groups, i.e. R1+ R2+ R3+…. After adding the sums of the ranks of (all of the) groups, the answer must always be equal to n(n+1)/2 to proceed further. Here, “n” is the total number of samples in the two groups, so n=n1 (number of samples in group 1) + n2 (number of samples in group 2) + n3 +…. 4. Calculate the value of “H” (by the following equation for 3 groups):
5. Compare the calculated H value with the critical value. In this case, critical value is determined with the help of the table of critical values of the Kruskal-Wallis H. 6. Interpretation of results: Calculated H value ≥ Critical value, null hypothesis is rejected, and Calculated H value < Critical value, null hypothesis is not rejected.
Suppose, some people are feeling pain due to some disease and we want to know the effect of different concentrations of a new drug on the treatment of
pain. We will get 12 participants (samples) from those people and check the efficacy of the new drug and its different concentrations in reducing the pain. The concentrations of the new drug include 5%, 10% and 15% of drug in the solution, and the response of the people will be taken in the form of ranking of pain on a scale from 1 to 12 in which 1 shows least pain and 12 shows highest level of pain. Suppose, we give 5% solution to 3 people (i.e. n1), 10% solution to 5 people (i.e. n2), and 15% solution to 4 people (i.e. n3). In this situation, we will perform Kruskal-Wallis Test as the sample sizes are small and they are not equal, i.e. n1 = 3, n2 = 5, and n3 = 4. Moreover, they are not normally distributed. Kruskal-Wallis Test is useful as it helps in comparing the outcomes of more than two independent groups. Our null hypothesis is that the population medians of all three groups are equal, whereas alternative hypothesis is that the population medians of the groups are not equal at 5% level of significance. Suppose, following responses are obtained: 5% solution 7 6 9
10% solution 6 5 7 8 5
15% solution 5 2 3 4
This table is showing that 15% solution of the drug is more helpful in reducing pain as compared to 5% solution. However, it is important to check whether this observed difference is statistically significant or not. For Kruskal Wallis Test, we have to order the data of 12 subjects from smallest value to largest value while keeping the track of group assignments in the total sample.
First we will check whether the total value of ranks, i.e. 29+38+11=78 is equal to n(n+1)/2 or not. So, n(n+1)/2= 12(13)/2 = 78. So, these are equal. The test statistic for the Kruskal Wallis test is represented by H and can be calculated by using the equation,
Where n is the total number of subjects or samples, i.e. 12; R1 is the sum of the ranks in the first group, i.e. 29; R2 is the sum of the ranks in the second group, i.e. 38; R3 is the sum of the ranks in the third group, i.e. 11; n1 is the sample size of the first group, i.e. 3;
n2 is the sample size of the second group, i.e. 5, and n3 is the sample size of the third group, i.e. 4.
Now we will check whether this test statistic, i.e. H=7 is in favor of null hypothesis or it rejects the null hypothesis. So, we will check it by considering the critical value of H. If this value is more than or equal to the critical value, null hypothesis will be rejected, whereas if this value is less than the critical value null hypothesis will not be rejected. From the table, the critical value with sample sizes of 3, 5, and 4, and α=0.05 is 5.656. Our observed H value is 7 and it is greater than the critical value, so we can reject null hypothesis. It can also be said that the test is statistically significant as the null hypothesis is rejected. It is quite good news as we can use increased concentration of the drug to reduce the pain. In short, in Kruskal Wallis Test, the test statistic is H. The null hypothesis is that the medians of all the populations are equal. Our calculated value of H is 7, which is more than the critical value of 5.656. In this case, the null hypothesis is rejected. It cannot be rejected only when calculated H value is less than the critical value. Using Python In order to work on Kruskal Wallis Test, we can use the following data as an example: 5% solution 7
10% solution 15% solution 6 5
6 9
5 2 7 3 8 4 5 We have to work on scipy stats. So, write the following lines of codes: from scipy.stats import kruskal x=([7,6,9]) y=([6,5,7,8,5]) z=([5,2,3,4]) a= kruskal (x,y, z) print(a) This gives the following answer: KruskalResult(statistic=7.258690476190476, pvalue=0.026533551881464005) This p-value is less than 0.05, so the results are statistically significant. Degrees of Freedom The number of values or scores in a data set (or distribution) that are free to vary while considering the final calculation. For single group tests, it is represented by df=N-1. For two group tests, it is represented by df=N1+N2-2. For Error degrees of freedom (as found in ANOVA), it is represented by df=N-M, where N shows total data points, and M shows number of groups. Example: Suppose, we have a data set of three numbers: 3+2+4=9, with the mean value of 3 (i.e. 9/3=3), we have N-1 = 3-1 = 2 degrees of freedom, i.e., we have two choices only (out of three). The final value in this data set is 3. In this case, if we know any 2 numbers and the mean value, then the 3rd number is determined, i.e. it can’t be changed… In other words, only 2 of the three numbers can be changed on choice, while the 3rd number cannot be left on choice.
Suppose, The two numbers are 1 and 2, the third number must be 6; The two numbers are 3 and 5, the third number must be 1; The two numbers are 4 and 3, the third number must be 2, and so on...
The Friedman Test It is a nonparametric test that is considered as an alternative to Two Way ANOVA. It can be performed on three or more groups having the same number of sample sizes. Testing Assumptions: Data, especially related to dependent variable, has to be measured at the continuous or ordinal level. Data is related to a single group measured at a minimum of three different occasions. Random sampling strategy for the group from the selected population. Following steps can be considered in the Friedman test: 1. Determination of hypothesis: Null Hypothesis – H0: Median rankings of all the groups are equal, and Alternative Hypothesis – H1: Median ranking of at least one of the groups is different from the median ranking of at least one of the other groups. 2. Arrangement of data into blocks (or columns). 3. Each column is ranked separately, in which the smallest rank (score) is 1. 4. The ranks are summed (i.e. summed value for each ranked column) to get total rank (R). Here, R1 may show the total rank for group 1, R2 may show the total rank for group 2, R3 for group 3, and so on... 5. Calculate the sums of the ranks of (all of the) groups, i.e., R1+ R2+ R3+ …. After adding the sums of the ranks of (all of the) groups, the answer must always be equal to rc(c+1)/2 to proceed further. Here, “r” is the number of samples/participants in every group, and “c” is the number of groups. 6. Calculate the value of “F” (by using the following formula that is for 4
groups):
7. Compare the calculated F value with the critical value. In this case, critical value is determined with the help of chi-square table while considering (c-1) degrees of freedom. 8. Interpretation of results: Calculated F value > Critical value, null hypothesis is rejected, and Calculated F value ≤ Critical value, null hypothesis is not rejected.
Suppose, there is a treatment for pain (in some disease) that works equally on
pain-related outcomes after the use of 15% solution of that treatment in all age groups. We give the new treatment to four different groups and get their responses. Further suppose that the four groups are A, B, C, and D. Group A has 6 participants having 15 years of age. Group B has 6 participants having 25 years of age. Group C has 6 participants having 35 years of age. Group D has 6 participants having 45 years of age. This difference of years is helpful in knowing the differences of the new treatment on different ages of samples. In this case, Friedman test can help as it is used to compare three or more groups, and all the groups have same number of sample sizes. Our null hypothesis is that median rankings of the four groups are equal, while the alternative hypothesis is that median ranking of at least one of the groups is different from the median ranking of at least one of the other groups at α = 0.05. After giving the treatment to the participants (blocks of raters), suppose we get the following results:
Now, we need to convert the data to ranks within blocks.
These Rank Totals are showing that there are differences in the rankings of pain of different age groups. So, we have to test the statistical significance of these results. First, we will check the rankings in the Friedman Test, so
Where r = number of participants in every group = 6 c = number of groups = 4 So, we have
For the Friedman test, we have the following equation,
Now, we have to check whether the calculated value is greater than, equal to, or smaller than the critical value of the chi-square distribution with c -1 = 3 degrees of freedom and α = 0.05. Critical value in the table is 7.815, and our calculated value is 13.35, which is greater than the critical value. Therefore, we can reject the null hypothesis, and there are significant differences in the median rankings of participants from different age groups. In short, in Friedman test, the test statistic is F. The null hypothesis is that the median rankings of all the groups are equal. Our calculated value of F is 13.35, which is more than the critical value of 7.815. In this case, we can reject the null hypothesis. It cannot be rejected only when calculated F value is less than the critical value. Using Python In order to work on the Friedman Test, we can work on the following data:
Write the following lines of codes: from scipy.stats import friedmanchisquare u=([8,7,6,5]) v=([9,8,7,6]) w=([7,5,6,4]) x=([6,7,5,3]) y=([9,6,4,4]) z=([8,7,8,5]) a= friedmanchisquare (u, v, w, x, y, z) print(a) This gives the following answer: FriedmanchisquareResult(statistic=12.251908396946558, pvalue=0.03149442426414504) This gives the Friedman chi-squared, which is equal to 12.252, and p-value, which is 0.03149. This p-value is less than 0.05, so the results are statistically significant.
Tests for Normally distributed data Unpaired “t” test, Paired “t” test, One Way ANOVA, and Two Way ANOVA can be used to test Normally distributed data. Usually, Student’s t test is presented by the following equation:
Here, x̄ shows observed mean, s shows standard deviation, µ shows population mean, and n shows total size. The observed mean is represented by the following equation:
The standard deviation is represented by the following equation:
Example: Suppose, an irregular sample of 9 qualities of an ordinary populace demonstrated a mean of 41.5 inches and the entirety of square of deviation from this mean equivalent to 72 inches show whether the supposition of mean of 44.5 inches in the populace is reasonable. (For v=8, t.05=2.776). Here, x̄=45.5, µ=44.5, n=9, and
Following steps could be used: 1. Specify the hypothesis: Null hypothesis (H0) – Population mean = µ = 44.5, and Alternative hypothesis (H1) – Population mean = µ < 44.5 or µ > 44.5 or µ ≠ 44.5. 2. Choose a significance level (α): Select significance level α=0.05 means 5% of samples > or < 44.5 is allowed at α=0.05 value of t, i.e., 2.776. 3. Determine the sample size: It is 8 in this case. 4. Calculate standard deviation:
5. Applying t-test:
Degrees of freedom, v = n-1 = 9-1 = 8 For v=8, t0.05 for two tailed test = 2.306 6. Decide whether to reject or accept H0: Since, the calculated value of |t|(3) > the table value of t(2.776), we reject the null hypothesis. 7. Conclusion: We conclude that the population mean is not equal to 44.5.
Unpaired “t” test Testing Assumptions: A dependent continuous variable and an independent categorical variable with two groups, or levels. The dependent variable follows a normal distribution or at least approximately normal distribution. The variances or standard deviation of the two groups are equal, i.e. homogeneity of variance or homogeneity, respectively. N.B.: If these assumptions are not followed, you may consider using non-parametric tests. Suppose we are working on a disease and the pain associated with the disease, and we have developed two groups of 8 participants each. One group receives the new treatment, while the other group receives placebo. After receiving the treatments, those participants are assessed for the qualityof-life and pain through the Quality-of-Life Scale and overall ranking of pain respectively. After getting the scores for quality-of-life and pain, those scores
are joined and an overall score out of 10 is developed. Higher scores show improvement, while lower scores show no effect of treatment. Overall scores out of 10 For Treatment Group
For Placebo Group
6
4
3
4
4
1
5
3
5
5
9
6
7
2
8 7 In this case, we will use student’s t-test as the number of sample is less than 30, i.e. overall sample size is equal to 16. Moreover, we will use unpaired or independent t-test as the comparison has been made on the outcomes in two different groups. Now we have to perform some calculations on these results as follows:
Where s2 is pooled sample variance. Proceeding with this data, we have
Now, we have to look at the t table with two tailed test as we are not sure about the efficacy of the new treatment on human beings. We will check the table at degrees of freedom = number of samples – 2 = 14 and at α = 0.05.
Table 3: t table (Source: Anonymous)
In the table, the critical value is 2.145, but our calculated value is 1.875 that is less than the critical value of the table. This is showing that the differences of the two groups are not statistically significant. Using Python In order to work on unpaired t-test, we can use the following data as an example, Overall scores out of 10 For Treatment
For Placebo Group
Group 6
4
3
4
4
1
5
3
5
5
9
6
7
2
8
7
Normality test First, we have to work on the testing assumptions. In order to know the normality of the data, we can use Kolmogorov-Smirnov Test (KS-Test) with Lilliefors Correction. For this test, write the following lines of codes: import statsmodels.api as sm a=([6,3,4,5,5,9,7,8,4,4,1,3,5,6,2,7]) z=sm.stats.lilliefors(a) print(z) Following result is obtained: (0.1135359521708621, 0.2) Here, KS statistic is 0.1135359521708621 and p-value is 0.2. These results show that the data points are comparatively closer to the fitted normal distribution line. The p-value (0.2) is more than 0.05. So, it can be said that the null hypothesis could not be rejected, and the data follow a normal distribution. The normality test in Python language can also be performed by using the following lines of codes: from scipy import stats a=([6,3,4,5,5,9,7,8,4,4,1,3,5,6,2,7])
z=stats.normaltest(a) print(z) Following results are obtained: NormaltestResult(statistic=0.06927143471667326, pvalue=0.9659572336259948) It is again showing that the null hypothesis could not be rejected (as p-value > 0.05), and the data follow a normal distribution. Homogeneity of variance In order to test the homogeneity of variance, we can use the following lines of codes: from scipy import stats x=([6,3,4,5,5,9,7,8]) y=([4,4,1,3,5,6,2,7]) z=stats.levene(x,y) print(z) In this, “x” shows “For Treatment Group”, and “y” shows “For Placebo Group.” These lines of codes give the following result: LeveneResult(statistic=0.046357615894039736, pvalue=0.8326320992667434) The analysis of homogeneity of variances shows that the findings meet the assumption of homogeneity of variance as p-value is 0.8326320992667434, i.e. p-value > 0.05, and there is no statistically significant difference between the groups. So, now we can conduct “t” test after looking at other assumptions. “t” test In order to perform “t” test, write the following lines of codes: from scipy.stats import ttest_ind x=([6,3,4,5,5,9,7,8]) y=([4,4,1,3,5,6,2,7])
z= ttest_ind(x,y, equal_var=True) print(z) This gives the t-test results as follows: Ttest_indResult(statistic=1.860521018838127, pvalue=0.08394585608714018) In this case, we get t-stat as 1.8605 and p-value as 0.08395, which is more than 0.05 that is showing that the results are not statistically significant.
Paired “t” test Testing Assumptions: There is a dependent continuous variable and an independent variable with two groups, or pairs. The observations or values are independent of each other. The dependent variable shows approximate normal distribution. There must not be any significant outlier in the dependent variable. N.B.: If these assumptions are not fulfilled, you may consider using non-parametric tests. Suppose we are working on the same treatment for the disease as mentioned above on 20 other participants. This time, we are checking the efficacy of the new treatment on 20 participants, i.e. n = 20 before the start of treatment and after the start of treatment, i.e. after one month of starting the treatment. After working on the participants, suppose we have obtained the following data that is showing the combined value of the quality-of-life and pain ranking out of 10. Higher scores show improvement, while lower scores show no effect of treatment.
Participants 1 2
Overall score out of 10 Before After treatment treatment 3 8 1 6
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
5 6 1 1 1 3 0 5 5 5 3 2 6 2 2 4 1 7
7 9 10 7 5 6 4 8 7 9 6 8 6 7 9 8 6 5
From this data, we develop another table: Participants 1 2 3 4 5 6
Overall score out of 10 Before After treatment treatment 3 8 1 6 5 7 6 9 1 10 1 7
Difference 5 5 2 3 9 6
7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 3 0 5 5 5 3 2 6 2 2 4 1 7
5 6 4 8 7 9 6 8 6 7 9 8 6 5
4 3 4 3 2 4 3 6 0 5 7 4 5 -2
After calculating the differences, it can be found that those differences are following approximately normal distribution and there are almost no extreme outliers. Therefore, paired t-test could be performed on the data. Mean difference calculated from this data is equal to 3.9, and standard deviation is 2.343. Therefore, standard error of the mean difference is equal to 0.52 (i.e. standard deviation / square root of total number of participants = 2.343 / square root of 20). In order to calculate the t-statistic, we have, t = mean difference / standard error of the mean difference = 3.9 / 0.52 = 7.5. Now we will look at the t table with two tailed test. For the table, degrees of freedom = number of samples – 1 = 19, and α = 0.05. Critical value in the t table is 2.093 and our calculated value is 7.5, i.e. larger than the critical value. So, there is strong evidence that the new treatment would work effectively. Using Python In order to work on paired t-test, we can use the following data, Overall score out of 10
Participants
Before After treatment treatment 1 3 8 2 1 6 3 5 7 4 6 9 5 1 10 6 1 7 7 1 5 8 3 6 9 0 4 10 5 8 11 5 7 12 5 9 13 3 6 14 2 8 15 6 6 16 2 7 17 2 9 18 4 8 19 1 6 20 7 5 Enter the values in the columns. In this case, values of “Before treatment” are entered in the first column, i.e. C1, and the values of “After treatment” are entered in the second column, i.e. C2. Normality test First, we have to work on the testing assumptions. In order to know the normality of the data, we can use Kolmogorov-Smirnov Test (KS-Test) with Lilliefors Correction. For this test, write the following lines of codes: import statsmodels.api as sm
a= ([3,1,5,6,1,1,1,3,0,5,5,5,3,2,6,2,2,4,1,7,8,6,7,9,10,7,5,6,4,8,7,9,6,8,6,7,9,8,6,5]) z=sm.stats.lilliefors(a) print(z) Following results are obtained: (0.1351568591762069, 0.06343004032332776) The data points are comparatively closer to the fitted normal distribution line. The KS statistics is 0.135, and p-value (0.063) is more than 0.05. So, it can be said that the null hypothesis could not be rejected, and the data follow a normal distribution. Outlier Test In order to know the outliers, write the following lines of codes containing dataset and any point outside 2 standard deviations. import numpy as np import pandas as pd import matplotlib.pyplot as plt dataset= ([3,1,5,6,1,1,1,3,0,5,5,5,3,2,6,2,2,4,1,7, 8,6,7,9,10,7,5,6,4,8,7,9,6,8,6,7,9,8,6,5]) outliers=[] def outlier(x): stdev=2 mean_1 = np.mean(x) std_1 =np.std(x) for y in x: z_score= (y - mean_1)/std_1 if np.abs(z_score) > stdev: outliers.append(y) return outliers
outlier_value = outlier(dataset) print(outlier_value) plt.boxplot(dataset) This gives the following result and graph: []
These results show that there is no significant outlier. So, paired “t” test could be conducted after looking at other assumptions. “t” test Write the following lines of codes: import scipy x=([3,1,5,6,1,1,1,3,0,5,5,5,3,2,6,2,2,4,1,7]) y=([8,6,7,9,10,7,5,6,4,8,7,9,6,8,6,7,9,8,6,5]) z= scipy.stats.ttest_rel(y,x) zmx=scipy.mean(x) zmy=scipy.mean(y) print(z) print(zmx) print(zmy) This gives the following results:
Ttest_relResult(statistic=7.2552976687586765, pvalue=6.930991031370908e-07) 3.15 7.05 These results show the statistic of 7.2553 and p-value of 0.0000006931. This p-value is less than 0.05, so we can say that the results are statistically significant. If we look at the mean values, we can find that mean values increase “After treatment”, i.e. 7.05 after treatment as compared to 3.15 before treatment.
Analysis of Variance (ANOVA) ANOVA relates to the analysis of groups using their one or more than one properties. For example, 1. Making a thing by two processes, which one outperformed? 2. Students from different colleges gave same test, which college outperformed? 3. Doctors treated similar patients with 3 different medicines, which medicine outperformed? Three different types of ANOVA are (1) One-Way ANOVA, in which hypothesis is based on one property, such as effect of age on learning; (2) Two-Way Factorial ANOVA, in which hypothesis is based on 2 properties, such as effect of age and gender on learning, and (3) N-way or Multivariate ANOVA, in which hypothesis is based on more than 2 properties, such as effect of age, gender, country, etc. on learning. The Two-Way ANOVA can further be of two types, (a) Balanced in which the groups have same sample size, and (b) Unbalanced in which the groups do not have same sample size. The Unbalanced can further be of different subtypes, including (i) Type 1 – Hierarchical approach in which unbalancing is unintentional, and hierarchy is present; (ii) Type 2 – Classical experimental approach in which unbalancing is unintentional, and hierarchy is not present, and (iii) Type 3 – Full Regression approach in which unbalancing is intentional. Following steps can be considered in the ANOVA test procedure:
1. Specify the hypothesis: Null hypothesis (H0) shows no significant difference, and Alternative hypothesis (H1) showing significant difference. 2. Calculate F-ratio and probability of F. 3. Choose a significance level (α). 4. Compare the p-value of F-ratio test with above α. 5. Decide whether to reject or accept H0. If null hypothesis is rejected, conclude that mean of groups are not equal. Sum of Squares It is used in ANOVA, and is represented by the following equation:
Here, xi shows individual value in a sample, and x̂ shows mean of the sample.
Residual Sum of Squares Residual sum of squares is also known as sum of squared residuals (SSR), or sum of squared errors of prediction (SSE). It is represented by the following equation:
Here, x shows x-values, y shows y-values, i shows individual point, α shows intersect of regression line (constant), β shows slope of regression line (constant), and n shows total number of points. It is represented by the following graph:
Example: Consider two population groups, where X=1,2,3,4 and Y=4,5,6,7, constant value α=1, β=2. Find the Residual Sum of Square (RSS) values for the two population groups. Here, X=1,2,3,4, Y=4,5,6,7, α=1, and β=2. The solution is as follows:
One way ANOVA Testing Assumptions: There is a continuous dependent variable and an independent variable with two or more categorical groups. The data must represent independent observations. There must not be any significant outliers. The dependent variable must be normally distributed. The dependent variable must have the variance equal in each population, i.e. homogeneity of variance.
N.B.: If these assumptions are not fulfilled, you may consider using non-parametric tests. Suppose we have worked on the same disease and the treatment as mentioned above, but now we have divided the participants of the study into four different groups receiving four different concentrations of the treatment and placebo for three months. Among those four groups, one group receives 25% solution of the new treatment; second group receives 15% solution of the new treatment; third group receives 5% solution of the new treatment, and fourth group receives placebo and is considered as control group. Every group has 5 samples or participants. Those groups give scores for quality-of-life and pain, and those scores are joined and an overall score out of 10 is developed. Group A receives placebo 3 3 5 1 4
Group B receives 5% solution of treatment 4 6 5 3 4
Group C receives 15% solution of treatment 3 5 4 6 2
Group D receives 25% solution of treatment 9 2 7 8 4
These numbers are following approximately normal distribution as shown by the following graph:
We will use the Analysis of Variance (ANOVA) procedure, i.e. One way ANOVA, in this case as we have more than 3 sets of data.
In order to perform ANOVA, our null hypothesis is that means of all the groups are equal, and alternative hypothesis is that means of all the groups are not equal at α=0.05. The test statistic for this process is F statistic for ANOVA, where F is equal to Mean Squares Between Treatments divided by Mean Squares Error (or Residual). The appropriate critical value will be noted from the table of probabilities for the F distribution. Our degrees of freedom will be degree of freedom one (df1) that is equal to total number of groups (k) minus one, i.e. df1 = k -1 = 4-1 =3, and degree of freedom two (df2) that is equal to total number of samples in all groups (N) minus number of groups (k), i.e. df2 = N – k = 20 – 4 = 16. With these values of degrees of freedom and at α=0.05, the critical value is 3.24. Therefore, we have to reject the null hypothesis if the observed F value will be greater than or equal to 3.24. Table 4: F distribution table at alpha level of 0.05 (Source: Anonymous)
Now we will calculate the F statistic. For this calculation, it is important to take the sample mean for each group and then the overall mean on the basis of the total sample. Group A received placebo Number of samples (n) Group mean
5
Group B received 5% solution of treatment 5
Group C received 15% solution of treatment 5
Group D received 25% solution of treatment 5
3.2
4.4
4
6
If we consider all N=20 observations, the overall mean is equal to 4.4, i.e. 88/20 = 4.4. Now Sums of Squares Between Treatments (SSB) is calculated by the following formula, SSB = number of samples in Group A (Group A mean – Overall mean)2 + number of samples in Group B (Group B mean – Overall mean)2 + number of samples in Group C (Group C mean – Overall mean)2 + number of samples in Group D (Group D mean – Overall mean)2 So, SSB = 5 (3.2 – 4.4)2 + 5 (4.4 – 4.4)2 + 5 (4 – 4.4)2 + 5 (6 – 4.4)2 SSB = 5 (1.44) + 5 (0) + 5(0.16) + 5(2.56)
SSB = 7.2 + 0 + 0.8 + 12.8 SSB = 20.8 Now, we will calculate Sums of Squares for Errors (or Residuals) [SSE]. In order to calculate SSE, squared differences are required between each observation and its group mean, i.e. SSE = Total value of (Score – 3.2)2 of Group A + Total value of (Score – 4.4)2 of Group B + Total value of (Score – 4.0)2 of Group C + Total value of (Score – 6.0)2 of Group D. So, it will be calculated in parts. For the samples in Group A, Group A 3 3 5 1 4 Total
(Score – 3.2) (Score – 3.2)2 -0.2 0.04 -0.2 0.04 1.8 3.24 -2.2 4.84 0.8 0.64 0 8.8
For the samples in Group B, Group B 4 6 5 3 4 Total
(Score – 4.4) (Score – 4.4)2 -0.4 0.16 1.6 2.56 0.6 0.36 -1.4 1.96 -0.4 0.16 0 5.2
For the samples in Group C, Group C
(Score – 4) (Score – 4)2
3 5 4 6 2 Total
-1 1 0 2 -2 0
1 1 0 4 4 10
For the samples in Group D, Group D 9 2 7 8 4 Total
(Score – 6) (Score – 6)2 3 9 -4 16 1 1 2 4 -2 4 0 34
Now SSE = Total value of (Score – 3.2)2 of Group A + Total value of (Score – 4.4)2 of Group B + Total value of (Score – 4.0)2 of Group C + Total value of (Score – 6.0)2 of Group D SSE = 8.8 + 5.2 + 10 + 34 SSE = 58 Now, the ANOVA Table is as follows: Source of Variation Between Treatments
Sums of Squares (SS) 20.8 = SSB
Degrees of Freedom (df) 4-1=3 = df1
Means Squares (MS)
F
20.8/3=6.93 = MSB
MSB / MSE = 1.91
Error (or Residual) Total
58 = SSE 78.8
20-4=16 = df2 20-1=19
58/16=3.63 = MSE
From this calculated F value, i.e. 1.91, we can conclude that we cannot reject the null hypothesis as 1.91 is smaller than 3.24. We don’t have statistically significant evidence at α=0.05 to show that there is a difference in the mean score among the four groups or treatment levels. In short, the new treatment at any concentration below 30% in solution is not much effective as compared to placebo. Using Python In order to work on one way ANOVA, we can use the following data: Group A receives placebo 3 3 5 1 4
Group B receives 5% solution of treatment 4 6 5 3 4
Group C receives 15% solution of treatment 3 5 4 6 2
Group D receives 25% solution of treatment 9 2 7 8 4
Normality test First, we have to work on the testing assumptions. In order to know the normality of the data, we can use Kolmogorov-Smirnov Test (KS-Test) with Lilliefors Correction. For this test, write the following lines of codes: import statsmodels.api as sm a=([3, 4, 3, 9, 3, 6, 5, 2, 5, 5, 4, 7, 1, 3, 6, 8, 4, 4, 2, 4]) z=sm.stats.lilliefors(a) print(z) Following results are obtained: (0.17785715032749605, 0.09726116062372032)
The p-value (0.0972) is more than 0.05. So, it can be said that the null hypothesis could not be rejected, and the data follow a normal distribution. Outlier Test In order to know the outliers, write the following lines of codes containing dataset and any point outside 3 standard deviations. import numpy as np import pandas as pd import matplotlib.pyplot as plt dataset= ([3,4,3,9,3,6,5,2,5,5,4,7,1,3,6,8,4,4,2,4]) outliers=[] def outlier(x): stdev=3 mean_1 = np.mean(x) std_1 =np.std(x) for y in x: z_score= (y - mean_1)/std_1 if np.abs(z_score) > stdev: outliers.append(y) return outliers outlier_value = outlier(dataset) print(outlier_value) plt.boxplot(dataset) This gives the following result and graph: []
Although boxplot shows “9” as an outlier, but still it is not a point outside 3 standard deviations. Homogeneity of variance In order to test the homogeneity of variance, we can use the following lines of codes: from scipy import stats A =([3,3,5,1,4]) B =([4,6,5,3,4]) C =([3,5,4,6,2]) D =([9,2,7,8,4]) z=stats.levene(A,B,C,D) print(z) In this, “A” shows “Group A”, “B” shows “Group B”, “C” shows “Group C”, and “D” shows “Group D”. These lines of codes give the following result: LeveneResult(statistic=1.2677595628415304, pvalue=0.3189265658323015) The analysis of homogeneity of variances shows that the findings meet the assumption of homogeneity of variance as p-value (0.3189) is more than 0.05, and there is no statistically significant difference between the groups. So, now we can conduct One way ANOVA after looking at other assumptions.
One Way ANOVA Write the following lines of codes: from scipy.stats import f_oneway A =([3,3,5,1,4]) B = ([4,6,5,3,4]) C = ([3,5,4,6,2]) D = ([9,2,7,8,4]) x = f_oneway (A, B, C, D) print(x) This gives the following results: F_onewayResult(statistic=1.9126436781609193, pvalue=0.16824503153272194) This shows that F-value is equal to 1.913 and p-value is equal to 0.168. This p-value is more than 0.05 showing that the results are statistically not significant.
Two way ANOVA Two-Factor ANOVA procedure could be performed by different variables at the same time as, for example, we can consider different ages and different concentrations of treatments at the same time as opposed to only different concentrations as shown above in One way ANOVA. Testing Assumptions: There is a continuous dependent variable and two independent variables with two or more categorical groups. Observations are independent of each other. Samples represent normal distribution. Samples show homogeneity of variance. There are no significant outliers in the data. N.B.: If these assumptions are not fulfilled, you may consider using non-parametric tests.
Suppose we have participants belonging to two different age groups, i.e. Group A having 15 samples (participants) in the age range of 10-25 years, and Group B having 15 samples (participants) in the age range of 25-40 years. Each group of 15 samples were randomly assigned to three different solutions of the new treatment, i.e. Treatment X received 25% of the solution, Treatment Y received 15% of the solution, and Treatment Z received 5% of the solution. Suppose, we have obtained the following data: Treatment
X receiving 5% of the solution
Y receiving 15% of the solution
Z receiving 25% of the solution
Group A of participants in the age range of 1025 years 3
Group B of participants in the age range of 2640 years 4
4 4 5 4 5
5 5 6 7 6
6 5 8 6 7
5 6 7 7 7
7 8 6 6
9 9 7 8
Here, we will use two way ANOVA as we have two groups, i.e. Group A of participants in the age range of 10-25 years and Group B of participants in the age range of 26-40 years, and Three treatment, i.e. X receiving 25% of the
solution, Y receiving 15% of the solution, and Z receiving 5% of the solution. In order to perform two way ANOVA, we have the following ANOVA table: Source of Variation Total number of groups Treatment
Groups
Treatment versus Group interaction Error (or Residual)
Sums of Squares (SS) SSN
Degrees of Mean freedom (df) Squares (MS) df1=Total SSN/df1 = number of MSN groups – 1 = 6-1 = 5 SST df2 = Total SST / df2 = number of MST treatment groups – 1 = 3 -1 = 2 SSG df3 = Total SSG / df3 = number MSG groups by age range – 1 = 2 -1 = 1 SSTG = SSN df4 = df2 * SSTG / df4 – (SST + df3 = 2 * 1 = = MSTG SSG) 2 SSE
df5 = Total SSE / df5 = number of MSE samples – (Total number groups by age range * Total number of treatment groups) = 30 – (2*3) = 30-
F
MSN/MSE
MST/MSE
MSG/MSE
MSTG/MSE
6 = 24 Total After thorough calculation, we can make another table with final values. Source of Variation
Sums of Squares (SS) 45
Degrees of Mean freedom (df) Squares (MS) df1 = 5 45/5 = 9
F
36.5
df2 = 2
Groups
6.5
df3 = 1
36.5 / 2 = 18.3 6.5 / 1 = 6.5
Treatment versus Group interaction Error (or Residual) Total
2
df4 = 2
2/2=1
18.3/0.95 = 19.26 6.5/0.95 = 6.84 1/0.95 = 1.05
22.8
df5 = 24
22.8 / 24 = 0.95
Total number of groups Treatment
9/0.95 = 9.47
In this table, there are four statistical tests. The first test is an overall test to check whether a difference exists between 6 group means. The F statistic is 9.47, which is greater than the critical value of 2.62 at alpha level of 0.05, so it is statistically significant. After looking the significance of the overall test, it is important to check the factors that may result in the significance, i.e. treatment, group, or their interaction. F statistic for the treatment is 19.26, which is greater than the critical value of 3.40, and it is significant. Similarly, F statistic for the groups is 6.84, which is greater than the critical value of 4.26, and it is significant. However, the F statistic for the treatment versus group interaction is 1.05, which is lower than the critical value of 3.40, and it is non-significant. Now, we will look at the mean values of different groups
and treatments. So, we have the following table, Treatment Group A Group B X 4 5.4 Y 6 6.2 Z 6.8 8 Treatment Z appears to be the best treatment in different treatments and groups. Among all treatments and groups, Group B shows better results. Using Python In order to work on two-way ANOVA, we can use the following data as an example: Treatment
Group A of participants in the age range of 1025 years 3
Group B of participants in the age range of 2640 years 4
4 4 5 4 5
5 5 6 7 6
6 5 8 6 Z receiving 25% of the 7 solution 7 8 6 6
5 6 7 7 7
X receiving 5% of the solution
Y receiving 15% of the solution
9 9 7 8
Normality test First, we have to work on the testing assumptions. In order to know the normality of the data, we can use Kolmogorov-Smirnov Test (KS-Test) with Lilliefors Correction. For this test, write the following lines of codes: import statsmodels.api as sm a=([3,4,4,5,4,5,5,6,4,7,5,6,6,5,5,6,8,7,6,7,7,7,7,9,8,9,6,7,6,8]) z=sm.stats.lilliefors(a) print(z) Following results are obtained: (0.12910523676981722, 0.2) The p-value (0.2) is more than 0.05. So, it can be said that the null hypothesis could not be rejected, and the data follow a normal distribution. Outlier Test In order to know the outliers, write the following lines of codes containing dataset and any point outside 3 standard deviations. import numpy as np import pandas as pd import matplotlib.pyplot as plt dataset= ([3,4,4,5,4,5,5,6,4,7,5,6,6,5,5,6,8,7,6,7,7,7,7,9,8,9,6,7,6,8]) outliers=[] def outlier(x): stdev=3 mean_1 = np.mean(x) std_1 =np.std(x) for y in x: z_score= (y - mean_1)/std_1 if np.abs(z_score) > stdev:
outliers.append(y) return outliers outlier_value = outlier(dataset) print(outlier_value) plt.boxplot(dataset) This gives the following result and graph: []
These results show that there is no significant outlier. Homogeneity of variance In order to test the homogeneity of variance, we can use the following lines of codes: from scipy import stats AX =([3,4,4,5,4]) BX =([4,5,5,6,7]) AY =([5,6,5,8,6]) BY =([6,5,6,7,7]) AZ =([7,7,8,6,6]) BZ =([7,9,9,7,8]) z=stats.levene(AX,BX,AY,BY,AZ,BZ)
print(z) These lines of codes give the following result: LeveneResult(statistic=0.3200000000000001, pvalue=0.8960017225912339) The analysis of homogeneity of variances shows that the findings meet the assumption of homogeneity of variance as p-value (0.896) is more than 0.05, and there is no statistically significant difference between the groups. So, now we can conduct Two way ANOVA after looking at other assumptions. Two Way ANOVA The data in the table (in the example) is imported as a .csv file. Moreover, we have to write the information in the table as follows:
We have to import plotly, numpy, pandas, scipy, and statsmodels, and write the following lines of codes: import plotly.plotly as py import plotly.graph_objs as go
from plotly.tools import FigureFactory as FF import numpy as np import pandas as pd import scipy import statsmodels import statsmodels.api as sm from statsmodels.formula.api import ols data = pd.read_csv('/Book1.csv') df = data[0:30] table = FF.create_table(df) py.iplot(table, filename='Book1.csv') formula = 'response ~ C(group) + C(treatment) + C(group):C(treatment)' model = ols(formula, data).fit() aov_table = statsmodels.stats.anova.anova_lm(model, typ=2) print(aov_table) These lines of codes give the following results:
In this table, p-values of groups and treatments are less than 0.05, i.e. they are showing statistically significant results. We can also check the mean values of the different groups and treatments to know which treatment is best and which group is performing better than other. For instance, write the following lines of codes: import statistics as st
AX =([3,4,4,5,4]) BX =([4,5,5,6,7]) AY =([5,6,5,8,6]) BY =([6,5,6,7,7]) AZ =([7,7,8,6,6]) BZ =([7,9,9,7,8])
print(st.mean(AX)) print(st.mean(BX)) print(st.mean(AY)) print(st.mean(BY)) print(st.mean(AZ)) print(st.mean(BZ)) This gives the following results: 4 5.4 6 6.2 6.8 8 These results show that Treatment # 3 (i.e. Z) appears to be the best treatment in different treatments and groups. Among all treatments and groups, Group # 2 (i.e. B) shows better results.
Different types of ANOVAs ANOVA is a type of regression that is used to compare means across more than two independent groups. As noted earlier, it can be of different types, including 1. One Way ANOVA that is related to one independent variable as, for example, effect of a drug, standard drug & placebo on heart rate;
2. Two Way ANOVA that has two independent variables as, for example, effect of a drug, standard drug, & placebo and gender on heart rate; 3. Three Way ANOVA that has three independent variables as, for example, effect of a drug, standard drug, & placebo, and gender, and age on heart rate; 4. Repeated measure ANOVA that is used when there are same samples or participants over all the treatments as, for example, effect of a weight loss program on a group of subjects at baseline and then every month; 5. An extension of ANOVA, i.e., Multivariate Analysis of Variance (MANOVA) that is used when there are more than one dependent variables as, for example, effect of a drug, standard drug, & placebo on heart rate and mobility, and 6. An ANOVA with a covariate (i.e. independent variable + dependent variable + covariate), referred to as Analysis of Covariance (ANCOVA). It controls the confounding variable (covariate that can change the effect of the independent variable on the dependent variable). For example, effect of a drug, standard drug, & placebo on heart rate, while considering the baseline heart rate (outcome variable before treatment) or age as a covariate. Using Python – General MANOVA Suppose researchers want to check the effect of a Drug on heart rate and mobility, and perform an experiment to check the difference of the Drug from that of Standard drug and placebo. In order to perform the experiment, researchers work with 36 participants, and expose 12 participants to the Drug, 12 participants to the Standard drug, and 12 participants to the placebo. The results are obtained in the form of ratings from 1 to 10, where higher numbers show increased heart rate and mobility. Put this data in Excel and save the file as .csv file as shown below:
This .csv file is then imported to get the required results. So, write the following lines of codes: import pandas as pd from statsmodels.multivariate.manova import MANOVA file = 'D:/folder/General Manova example.csv' df = pd.read_csv(file, index_col=0) df.head() maov = MANOVA.from_formula('Heart_rate + Mobility ~ Intervention', data=df) print(maov.mv_test()) In this code, you have to specify the file = 'D:/folder/General Manova example.csv' The following results are obtained:
These results show that one-way MANOVA is statistically significant as the p-values are less than 0.05. Therefore, it can be said that the there is a statistically significant difference between the interventions on the outcomes. In order to look at the mean values of different treatments (interventions) and outcomes, write the following lines of codes: groupby_mean = df.groupby(['Intervention']).mean() print(groupby_mean) This gives the following results:
The “Drug” shows better results as compared to “Standard Drug” and “Placebo” in case of Heart rate. However, the “Drug” is almost similar to “Standard Drug” in case of Mobility, but they show useful results as compared to “placebo.” Factor Analysis Factors analysis is used as Data reduction method OR Structure detection method. It is used for complex sets of data as, for example, related to
Socioeconomic status, Psychological studies, Data mining, and Other such studies. It is mainly used for 1. Reduction of the number of variables (i.e. various items), and 2. Detection of the structure or pattern in the relationships between different variables. In this case, it is helpful in classification of the variables, and related variables are considered as clusters or factors. Factors are more efficient as compared to individual variables in providing outcomes in some studies. A factor represents a cluster of several correlated variables as shown below:
For example, a product may have several features (factors) but some features (variable a) make it more useful, and some features (variable b) are just to support them. Path Analysis It is an extension of multiple regression that is used to work on 1. Relationship (correlations) between variables shown by path coefficients, which are the standard regression coefficients from multiple regression, and 2. Direction or causality in the relationship between variables. For example,
In this example, Variable 1 can directly impact Variable 5, OR Variable 1 can indirectly impact Variable 5 through Variable 3, OR Variable 1 can indirectly impact Variable 5 through Variables 2 and 4. Structural Equation Modeling It is a technique that is an advancement of path analysis and it is based on causal relationships between variables. It is a confirmatory technique (rather than exploratory technique). Here, confirmatory technique is used to confirm the working of a proposed model, and exploratory technique is used to explore a specific relationship with less model building. It is used in Social sciences Population genetics Economics Ecology Marketing, and Other such fields For example,
In this example, Latent variable 1 is causing or affecting Latent variables 2 and 3, and Manifest variables 1, 2, and 3, and so on. Here, Latent variables are key variables of interest that can be measured only indirectly as, for example, quality of soil. Manifest variables can be measured and/or scored as, for example, bulk density of soil.
Effect size Effect size shows the magnitude of the difference between two different variables or groups indicating the strength of the difference between groups on numeric scale. An absolute effect size shows the difference between the mean or average outcomes of two groups. The effect size can be of different types, such as standardized means difference, odds ratio, Cohen’s d effect size, Pearson r correlation, Hedges’ g method of effect size, and Glass’s Δ method of effect size. Standardized means difference is obtained by dividing the difference of means of two groups by their standard deviation. The equation is as follows:
Odds ratio shows the odds of success in one group relative to the odds of success in the other group. It is measured by considering the following table and equation: Success Treatment a Placebo c
Failure B d
Answer ES
1 Odds are likely in favor of both groups
1.5 Small
2 Medium
3 Large
Cohen’s d effect size is obtained by dividing the difference of means of two groups by the standard deviation from the data. It is measured by considering the following equations and table:
Here,
Answer ES
0.2 Small
0.5 Medium
0.8 Large
1.3 Very Large
Pearson r correlation shows an association of two variables, x and y. It is measured by considering the following equation and table:
Here, N shows the number of pairs of scores, Σxy shows the sum of products of paired scores, Σx shows sum of x scores, Σy shows sum of y scores, Σx2 shows sum of squared x scores, and Σy2 shows sum of squared y scores. Answer ±0.2 ES Small
±0.5 ±0.8 Medium Large
Hedges’ g method of effect size is a modified method of Cohen’s d. It is determined by the following equation:
Where
Glass’s Δ method of effect size is a modified method of Cohen’s d. It is determined by the following equation:
Here, s2 is a standard deviation of control group or second group. Using Python Simple codes can be developed to measure different effect sizes. Odds Ratio and Mantel-Haenszel Odds Ratio Odds Ratio (OR) is used to determine the effect size of the difference in two interventions or treatments. Suppose there is a bacterial disease that could be treated by fish meat, especially rainbow trout. So, we can develop two groups of mice (animal models) to know and compare the treatments’ efficacy. One group receives the standard treatment with the commonly available fishes, and the other group receives rainbow trout. After giving the treatment with fishes, suppose we get the following results.
Mouse Died Mouse Survived
Standard treatment with fishes 454 358
Treatment with rainbow trout
Odds
19 105
454/19 = 23.89 358/105=3.41
812
124
Odds Ratio = 23.89/3.41 = 7 This table is showing that the animals, who received standard treatment with the fishes died 7 times more often as compared to the animals, who received the treatment with rainbow trout.
Now suppose, we perform experiment on male and female animal models and get the following results:
53 = a
Treatment Totals with rainbow trout 7=b 60 = tfd
102 = c
36 = d
Standard treatment with fishes Female animal models
Animals died Animals survived Totals Animals died Animals survived Totals
OR
2.67
138 = tfs 155 = nfs 198 = nf 43 = nfr Male 13 = f 124 = 3.99 111 = e animal tmd models 152 = g 223 = 71 = h tms 363 = nms 447 = 84 = nmr nm The table is showing that the impact of treatment in males is more as OR in male animal models is higher as compared to OR in female animal models. In order to check the impact of treatment on different sexes, “Weighted” OR, or Mantel-Haenszel OR (ORMH) has been used. ORMH is as follows:
This is showing that weighted chance of death related to the standard
treatment is 3.4 times the chance of death of animal models having treatment with rainbow trout. Using Python In order to work on Odds Ratio, we can use the following data as an example; Standard treatment with fishes Mouse Died 454 Mouse Survived 358 We calculate the odds ratio by writing the following
Treatment with rainbow trout 19 105
import scipy oddsratio = scipy.stats.fisher_exact([[454, 19], [358, 105]]) print(oddsratio) In this case, we get the answer 7.008. In order to work on ORMH, we can use the following data: Standard treatment with fishes Female animal models
Male animal models
Animals died Animals survived Totals Animals died Animals survived Totals
53 = a
Treatment Totals with rainbow trout 7=b 60 = tfd
102 = c
36 = d
155 = nfs 111 = e
43 = nfr 13 = f
152 = g
71 = h
363 = nms
84 = nmr
OR
2.67
138 = tfs 198 = nf 124 = 3.99 tmd 223 = tms 447 = nm
We can write the following, x=(((53*36)/198)+((111*71)/447))/(((102*7)/198)+((152*13)/447))
print(x) This gives us the result for Mantel-Haenszel chi-squared test. In this result, we get 3.397. Correlation Coefficient It is represented by the following equation:
Here, N shows the number of pairs of scores, Σxy shows the sum of products of paired scores, Σx shows sum of x scores, Σy shows sum of y scores, Σx2 shows sum of squared x scores, and Σy2 shows sum of squared y scores. Example:
Example: Suppose research is performed on young animal models of approximately same weight. Initially, animals were injected with the disease causing bacteria. Those animals were initially placed in the lab for five days. After 5 days, their physical condition was checked, and their weight was assessed. After thorough assessment, those animals were provided with a specific amount of rainbow trout, and after 5 days their physical condition was again checked and their weight was assessed. Suppose thorough assessment helps in giving the following results: Animal Model #
1 2
Gram of rainbow trout per day (i.e. feed of animal models for 5 days) 1 2
Weight of animals (gms) after providing the specific amount of rainbow trout per day 7 10
3 4 5 6
3 4 5 6
15 16 17 18
This information helps in providing the following graph:
Correlation coefficient, which is also represented by ‘r’, can help in finding that increased grams of rainbow trout help in increased level of improvement in animal models. The sample correlation coefficient varies from -1 to +1. With every increase in value from -1 to +1, the strength and direction of the linear association also increase among the two variables, i.e. grams of rainbow trout given to the animal models and improvement in their condition, and if the correlation is close to zero, it means that there is no relation between the two variables. However, before going further, it is important to know that in this case, grams of rainbow trout is the independent variable and presented on x-axis, and weight of animals is dependent variable and presented on y-axis. So, we develop a scatter diagram as shown here. Each point on the diagram is showing an (x,y) pair. This scatter diagram is apparently showing a positive or direct relation between two variables, i.e. increased amount of fish taken per day can increase the level of improvement. We can also develop a table showing the total and mean values of grams of rainbow trout given to the animal models and their condition.
Animal Model #
1 2 3 4 5 6 Total Mean value
Gram of rainbow trout per day (i.e. feed of animal models for 5 days) 1 2 3 4 5 6 21 3.5 = X-mean
Weight of animals after providing a specific amount of rainbow trout 7 10 15 16 17 18 83 13.8 = Y-mean
In order to work on sample correlation coefficient, variance of the values on Y-axis, variance of the values of X-axis, and the covariance of the values on both X-axis and Y-axis [Cov(x,y)] are required. Variance of the values of Xaxis is as follows: Animal Model #
1 2 3 4 5 6 Total
Gram of rainbow trout per day (i.e. feed of animal models for 5 days) 1 2 3 4 5 6 21
Grams – Xmean
(Grams – Xmean)2
-2.5 -1.5 -0.5 0.5 1.5 2.5 0
6.25 2.25 0.25 0.25 2.25 6.25 17.5
The variance of the values on X-axis = 17.5/6 = 2.9.
Variance of the values of Y-axis is as follows: Animal Model # Weight of Weight – Yanimals after mean providing a specific amount of rainbow trout 1 7 -6.83 2 10 -3.83 3 15 1.17 4 16 2.17 5 17 3.17 6 18 4.17 Total 83 ≈0 The variance of the values on Y-axis = 94.83/6 = 15.8.
(Weight – Ymean)2
46.69 14.69 1.36 4.69 10.03 17.36 94.83
The covariance of the values of X-axis and Y-axis is presented as: Animal Model #
Grams – Xmean
Weight – Ymean
(Grams – Xmean)( Weight – Y-mean) 1 -2.5 -6.83 17.075 2 -1.5 -3.83 5.745 3 -0.5 1.17 -0.585 4 0.5 2.17 1.085 5 1.5 3.17 4.755 6 2.5 4.17 10.425 Total 38.5 The covariance of the values on X-axis and the values on Y-axis = Cov (x,y) = 38.5/6 = 6.4. The formula for calculation of the sample correlation coefficient is as follows:
r = 6.4 / 6.8 r = 0.94 This value of sample correlation coefficient is clearly showing a strong positive correlation. Using Python In order to work on Correlation Coefficient, we can use the following data: Animal Model #
Gram of rainbow trout per day (i.e. feed of animal models for 5 days) 1 1 2 2 3 3 4 4 5 5 6 6 We can write the following
Weight of animals (gms) after providing the specific amount of rainbow trout per day 7 10 15 16 17 18
import numpy as np x=([1,2,3,4,5,6]) y=([7,10,15,16,17,18]) z=np.corrcoef(x,y) print(z) This gives the result of 0.945064. This value of sample correlation coefficient is clearly showing a strong positive correlation. R-squared and Adjusted R-squared R-squared estimates the differences in one variable (dependent variable) in relation to differences in a second variable (independent variable). Adjusted
R-squared adjusts the measurements/statistic on the basis of a number of independent variables. It is represented by the following equation:
Here, R2 shows coefficient of determination (R-squared) or square of correlation coefficient (R); N shows number of observations/data points, and p shows number of parameters/independent variables/independent regressors. Note that Adjusted R2 ≤ R2. Here, Adjusted R2 adjusts the measurements/statistic on the basis of a number of independent variables. Note that decrease in the number of useless variables results in an increase in the value of Adjusted R2, whereas increase in the number of useless variables results in decrease in the value of Adjusted R2. Regression Analysis Simple linear regression on the obtained data can be performed to estimate the relationships between variables.
For simple linear regression analysis, we can consider the data as mentioned
above and develop the following table: Animal Model #
Values (Values on X-axis on Xaxis)2
Values on (Values Y-axis on Yaxis)2
1 2 3 4 5 6 Mean Sum Total Square of the sum total
1 2 3 4 5 6 3.5 21 = Ʃx
7 10 15 16 17 18 13.83 83 = Ʃy
441 = (Ʃx)2
1 4 9 16 25 36 91 = Ʃx2
49 100 225 256 289 324
(Values on X-axis) (Values on Y-axis) 7 20 45 64 85 108
1243 = Ʃy2 329 = Ʃxy
6889 = (Ʃy)2
Now we need three corrected sums of squares, i.e. SST, SSX, and SSXY. So,
Then,
Then,
Now the slope of the best fit line, which is represented by b, and the intercept, which is represented by a, can be calculated by the following equations:
And a = Mean of the values on Y-axis – b × Mean of the values on X-axis a = 13.8 – 2.2 × 3.5 = 6.1 Now, we will calculate the regression sum of squares (SSR) and the error sum of squares (SSE) as follows: SSR = b × SSXY SSR = 2.2 × 38.5 = 84.7 And SSE = SST – SSR SSE = 94.8 – 84.7 = 10.13 A completed ANOVA table with all these values can be represented as follows: Source Regression Error Total
SS 84.7 10.13 94.83
df 1 4 5
MS 84.7 2.53
F 33.48
The calculated F value of 33.48 is much larger than the critical values of 7.71 at α=0.05 and 1 degree of freedom in the numerator and 4 degrees of freedom
in the denominator. This is clearly showing that we have to reject the null hypothesis that the slope of the line is zero, and we can confidently work on the increased amount of rainbow trout for increasing the level of improvement in case of the disease. Standard errors of the slope (SEb) and intercept (SEa) can be calculated as follows:
In order to calculate 95% confidence intervals for the slope and intercept, 2tailed value of Student’s t with 4 degrees of freedom is required. So, it is 2.776. This value is then multiplied with SEa and SEb to get the 95% confidence interval for intercept and slope respectively. So, a = 6.1 ± (SEa x 2.776) = 6.1 ± (1.48 x 2.776) = 6.1 ± 4.11 b = 2.2 ± (SEb x 2.776) = 2.2 ± (0.37 x 2.776) = 2.2 ± 1.03 = 3.23, 1.17
This is showing that we are 95% confident that the average weight of animal models increases between 3.23 and 1.17 grams per one gram increase of rainbow trout in the diet. Moreover, this range of confidence interval does not contain zero; thereby, showing a significant relationship between the two variables at alpha level of 0.05. Using Python In order to work on Regression Analysis, we can use the following data: Animal Model #
Gram of rainbow trout per day (i.e. feed of animal models for 5 days) 1 1 2 2 3 3 4 4 5 5 6 6 Enter the following lines of codes:
Weight of animals (gms) after providing the specific amount of rainbow trout per day 7 10 15 16 17 18
import numpy as np import statsmodels.api as sm Y = [7,10,15,16,17,18] X = [1,2,3,4,5,6] X = sm.add_constant(X) model = sm.OLS(Y,X) results = model.fit() print (results.summary()) This gives the results. In the results, F-statistic is 33.43 and p-value is 0.004. const (intercept) is 6.133 and x1 (slope of the best fit line) is 2.2. With p-value less than 0.05, they are showing statistical significance.
Logistic Regression
Using Python In order to work on logistic regression, we can take the same example as noted above. So, write the following lines of codes: import statsmodels.formula.api as sm import statsmodels.api as sa
hours= (0.50,0.75,1.00,1.25,1.50,1.75,1.75,2.00,2.25,2.50,2.75,3.00,3.25,3.50,4.00,4.25,4.50,4.75 pass_h=(0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1) hours=sa.add_constant(hours)
model = sm.Logit(pass_h,hours) result = model.fit()
print(result.summary()) Following results are obtained:
Black-Scholes model It was developed by Fisher Black, Robert Merton and Myron Scholes in the year 1973. It is most commonly used in European financial markets. It assumes that price of heavily traded assets follow Geometric Brownian motion (i.e. showing constant drift and volatility). An equation for Black-Scholes Model is shown below:
Here, C shows Value of Call Option, N(.) shows the cumulative distribution function of the standard normal distribution, and S shows Stock Price. Moreover,
Here, T shows the exercise date, K shows strike or exercise price, and r shows risk free interest rate, and
Here, σ shows annualized volatility, and T-t shows time to maturity (expressed in years).
Another equation for Black-Scholes Model is shown below:
Here, P shows the value of put option. Using Python In order to work on the Black-Scholes model, we can use the following lines of codes: import numpy as np import scipy.stats as si def BlackScholes(S, K, T, r, sigma, option = 'call'): d1 = (np.log(S / K) + (r + 0.5 * sigma ** 2) * T) / (sigma * np.sqrt(T)) d2 = (np.log(S / K) + (r - 0.5 * sigma ** 2) * T) / (sigma * np.sqrt(T)) if option == 'call': result = (S * si.norm.cdf(d1, 0.0, 1.0) - K * np.exp(-r * T) * si.norm.cdf(d2, 0.0, 1.0)) if option == 'put': result = (K * np.exp(-r * T) * si.norm.cdf(-d2, 0.0, 1.0) - S * si.norm.cdf(-d1, 0.0, 1.0)) return result print(BlackScholes(45, 105, 2, 0.10, 0.5, option = 'put')) print(BlackScholes(45, 105, 2, 0.10, 0.5, option = 'call')) This gives the following results: 45.09841171397385 4.131682640785767 for “put” and “call” options, respectively. Combination A selection of items or set of objects without considering the order of selection of items or objects. It is represented by the following equation:
Here, n shows total number of items or set of objects, and r shows size of each permutation, i.e. number of picked/selected items. There is also a combination with replacement that is the probability of selecting an object several times from an unordered list.
Example: Suppose, we have a week of seven days: Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, and Sunday. In order to get combination of 2 days at a time, i.e. the number of ways in which the days can be selected, we have
So, there are 21 ways in which the days can be paired. Using Python In order to work on Python, we can use the same example of combination of 2 days at a time. So, write the following lines of codes: from itertools import combinations
days=[c for c in combinations(['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'],2)] print (len(days)) In this case, the result (combination) is 21. Permutation It is an ordered arrangement of items or set of objects. Permutation with repetition is represented by nr, and permutation without repetition is represented by
Here, n shows total number of items or set of objects, and r shows size of each permutation, i.e. number of picked/selected items. Example: Suppose, we have a week of seven days: Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, and Sunday. In order to get permutation of 2 days (without repetition) at a time, i.e. the number of ways in which the days can be selected, we have
So, there are 42 ways in which the days can be selected. Using Python In order to work on Python, we can use the same example of combination of 2 days at a time. So, write the following lines of codes: from itertools import permutations days=[p for p in permutations(['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'],2)] print (len(days)) In this case, the result (permutation) is 42. Even and Odd Permutations Even permutation is obtained by composing a zero or even number of inversions or of swaps of two elements. It has a permutation symbol of +1. For a set of n (>2) elements, possible permutations are n!/2. Odd permutation is obtained by composing an odd number of inversions or of swaps of two elements. It has a permutation symbol of -1. For a set of n (≥2) elements, possible permutations are n!/2. Example: Suppose, we have an initial set: {4,5,6} Total number of odd OR even permutation possible are
Possible permutations are as follows: p1:{4,5,6} Even permutation (as there is zero swapping). p2:{4,6,5} Odd permutation (one large number is before a small number, i.e. one inversion). p3:{5,4,6} Odd permutation (one large number is before a small number, i.e. one inversion). p4:{5,6,4} Even permutation (two large numbers are before a small number, i.e. two inversions). p5:{6,4,5} Even permutation (one large number is before two small numbers, i.e. two inversions). p6:{6,5,4} Odd permutation (one large number [6] is before two small numbers, and one large number [5] is before one small number, i.e. three inversions). Circular Permutation It refers to the total number of ways to arrange n distinct objects, such as w, x, y, and z, around a fixed circle. When clockwise orders are different from anticlockwise orders, then When clockwise orders are same to anticlockwise orders, then
Here, Pn shows circular permutation, and n shows number of objects. The clockwise orders and anticlockwise orders are presented below:
Survival Analysis A model for time from start of a study (baseline) to the happening of a certain event. For example, the study may start at birth and the event is death, when study ends. Among the other examples of baseline or first event are manufacture of component, and first heart attack, and among the other examples of second event are component failure, and second heart attack. Survival analysis gives survival data. This data consists of (A) completely observed data and (B) censored data, which is also known as Missing data that may occur during the study, when (1) the subject or sample doesn’t have the event of interest (i.e. second event), which is also known as Right censoring, and/or (2) the data is lost. In the form of graph, the survival analysis is represented as follows:
This graph shows that: At 30 years, the survival probability is about 0.95, i.e. 95%, At 60 years, the survival probability is about 0.4, i.e. 40%, At 100 years, the survival probability is about 0.05, i.e. 5%.
Kaplan-Meier method One of the most commonly used nonparametric methods for survival analysis is Kaplan-Meier method, which is also known as Product Limit method. Suppose we have the following data in which 12 participants (over the age of 50 years) were studied for 25 years until they die, they lost to follow-up, or the study ends (and remember this data is just for understanding and can’t take the place of original research). Participant number 1 2 3 4 5 6 7 8
Year of death
Year of last contact 25 16
4 8 17 25 10 10
9 14 2 10 14 11 18 12 This table shows that 3 participants died before the completion of the study, and 2 participants completed the study. The Kaplan-Meier method utilizes this formula, St(i) = St(i)-1*((Ni-Di)/Ni) to compute survival probability. In this formula, Ni shows number at risk, Di shows number of deaths, St(i) shows survival probability, and St(i)-1 shows survival probability just previous to the present one. We can use life table approach for Kaplan-Meier method. Time (years)
Number at risk (Ni)
Number of deaths (Di)
Number censored (C)
Survival probability (St(i) = St(i)-1*((Ni-Di)/Ni))
2 1 1*((12-1)/12)=0.917 12 4 0.917*((11-1)/11)=0.834 11 1 8 1 0.834*((10-0)/10)=0.834 10 10 2 0.834*((9-0)/9)=0.834 9 14 1 0.834*((7-1)/7)=0.715 7 1 16 1 0.715*((5-0)/5)=0.715 5 17 1 0.715*((4-0)/4)=0.715 4 18 1 0.715*((3-0)/3)=0.715 3 25 2 0.715*((2-0)/2)=0.715 2 Roughly, it can also be represented by the following graph:
This shows that at 4 years after the start of study, the survival probability is about 0.834, i.e. 83.4%, and the survival probability at 17 years and 18 years is almost same, i.e. 0.715 or 71.5%. Using Python In order to work on Python, we can use the same example as noted above. However, the above table is considered as follows: Observation Status 25
1
16
0
4
1
8
0
17
0
25
1
10
0
10
0
14
1
2
1
14
0
18
0
In this table, “0” shows no event or occurrence of event after give time, and “1” shows the occurrence of event, i.e. the time related to the participants who either died or completed the study. Write the following lines of codes: import statsmodels.duration.survfunc as sd Observation = [25,16,4,8,17,25,10,10,14,2,14,18] Status = [1,0,1,0,0,1,0,0,1,1,0,0] x=sd.SurvfuncRight(Observation, Status, exog=None) print(x.summary()) x.plot() This gives us the results and graph as follows:
This shows that at 4 years after the start of study, the survival probability is about 0.833, i.e. 83.3%, and the survival probability at 17 years and 18 years is almost same, i.e. 0.714 or 71.4%. This graph assumes that after 25 years, there is no further survival probability.
Bonus Topics Most commonly used non-normal distributions in health, education, and social sciences Most commonly used non-normal distributions in health sciences, education, and social sciences have been obtained from the paper “Non-normal Distributions Commonly Used in Health, Education, and Social Sciences: A Systematic Review” by Roser Bono, María J. Blanca, Jaume Arnau, and Juana Gómez-Benito, published in the journal Frontiers in Psychology. From high to low: 1. Gamma distribution 2. Negative binomial distribution (These two distributions can fit to variables associated with health costs or income in social sciences) 3. Multinomial distribution 4. Binomial distribution (These two distributions can fit to data obtained from discrete measurement scales) 5. Lognormal distribution (This distribution is commonly found in latent periods (the time from infection to the appearance of first symptoms) of infectious diseases) 6. Exponential distribution 7. Poisson distribution Circular permutation in Nature For example, proteins, such as Lectins, Transolidases, Glycosyl hydrolases, Saposins, and Methyltransferases contain changed order of amino acids in their peptide sequence. Circular permutation in proteins can occur at peptide level or at genetic level, and genetic rearrangements are more commonly found in nature. However, usually, proteins show same or almost similar function after circular permutation. In the case of circular permutation, usually, protein domains are in different sequence order. Here, domains are the parts of a protein sequence (having
specific number of amino acids) that can function, evolve, and exist independently in comparison to the rest of the protein chain. For example, a protein (T) may have three independent domains A, B, and C in the form of ABC. These domains can also be joined as BCA, or CAB; thereby, giving rise to two more different proteins having different sequence orders of domains but similar functions. Time Series A time series is represented by a series of data points that are recorded at specific times or time intervals. It is represented by a timeplot in which values (variables) are represented on the y-axis and time is displayed on the x-axis. Time series help in showing variations in data with the passage of time. This is useful in finding the patterns related to the data and in making predictions regarding the values (variables) in relation to the time. Some of the examples of time series are as follows: Daily air temperature in a specific location The size of an organism, measured daily Daily closing stock rates Weekly interest rates National income Annual number of earthquakes in the world The patterns of time series are often difficult to analyze due to the reasons, such as the increased number of data points with equal intervals. In these situations, the technique of “smoothing” is used in which a line graph is utilized without the series of dots. This technique is also considered important in the prediction of future events as, for example, predicting whether the stock market would go up or down. It is also helpful in spotting the outliers. One of the simple ways of smoothing the timeplots is by utilizing the moving average. In the time series data, periodic fluctuations often appear and these fluctuations are referred to as “seasonality”. It is important to note that seasonality in time series may occur at any time period in the data as, for example, the daily air temperature may increase in the mid of the day and decrease during the night time. Using Python
Suppose we have the following data: Days Values 1 3 2
2
3
7
4
5
5
10
6
9
7
8
8 5 Write the following lines of codes: from matplotlib import pyplot Days = [1,2,3,4,5,6,7,8] Values = [3,2,7,5,10,9,8,5] pyplot.plot(Days,Values) pyplot.show() The following graph is obtained:
In order to get smoothing (line), write the following lines of codes: import numpy as np
import matplotlib.pyplot as plt Days = [1,2,3,4,5,6,7,8] Values = [3,2,7,5,10,9,8,5] plt.figure() pol = np.polyfit(Days,Values,5) pol_y = np.poly1d(pol)(Days) plt.plot(Days,pol_y) plt.plot(Days,Values) plt.show() The following graph is obtained:
In order to get information about the trend, we need to add the trendline. So, write the following lines of codes: import pandas as pd import numpy as np import matplotlib.pyplot as plt Days = [1,2,3,4,5,6,7,8] Values = [3,2,7,5,10,9,8,5]
plt.plot(Days, Values) pol = np.polyfit(Days, Values, 1) pol_y = np.poly1d(pol) plt.plot(Days,pol_y(Days),"r--") plt.show() The following graph is obtained:
This graph is showing upward trend. Monte Carlo Simulation The Monte Carlo method is a numerical method of solving mathematical problems by random sampling of a large number of variables. It is used for obtaining numerical solutions to problems that are too complicated to solve analytically. Examples may include Calculating the volume of figures in a space of several dimensions Calculating some definite integrals Simulation of a system with more degrees of freedom, such as fluids In economy predicting failure, and overdraft costs. Following steps can be considered in this case: Step 1: Developing samples/sampling of random input variables x=(x1, x2, x3,…, xn). This could be considered as distributions of input variables.
Step 2: Analysis of numerical problems in the sample (Evaluation of model output y). In this step, set of random input variables, i.e., x=(x1, x2, x3,…, xn), are considered for the analysis of the Model: y = g(x), and set of output variables, i.e., y=(y1, y2,…, yn), proceed to the next step. Step 3: Statistical Analysis on model output (Getting probabilistic information). It could be associated with probabilistic characteristics and the uncertainty of output variables. Density Estimation Density estimation is the process of estimation of a continuous density field from a discretely sampled set of data-points obtained from that density field. It can be (1) parametric, in which the shape of the distribution is known, and (2) non-parametric, in which form of the density completely determined by the data rather than model. The parametric density estimation can be (a) Maximum likelihood estimation, and (b) Bayesian estimation. The nonparametric density estimation can be (a) Parzen windows, and (b) Nearest neighbor. Decision Tree It is used for classification, prediction, estimation, clustering and visualization with the help of branched structure of data between which relations exist. In this case, “Root” is considered as an initial question; “Branches” are considered as answers to the initial question; “Subgroups” are considered as further questions; “Further branches” are considered as further answers, and “Leaf – subgroup without branch” could be considered as decision. This is also represented below:
Decision tree for test statistics is as follows:
Meta-analysis
The following information has been obtained from “Basics of Meta-analysis with Basic Steps in R” written by Usman Zafar Paracha. It is available here: https://amzn.to/31ff3Kj Meta-analysis is a quantitative and systematic method to combine the results of previous studies to get conclusions. The combination of results lead to combined results – Overall Effect size / Pooled Statistical Results. Following steps can be considered in the process: 1. Formulation of research question 2. Searching the required articles and information, working on inclusion and exclusion criteria, and Utilizing “Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA)” to note the flow of information. 3. Collection of abstract data, such as age of participants, sample size, outcomes, etc. 4. Statistical analysis that consists of (a) calculation of effect size, which is the size of the difference of effect between two groups, (b) heterogeneity statistics, and (c) development of Forest Plot as shown below:
The heterogeneity statistics may include Q, which is a ratio of observed variation to within-study variance; Tau-squared; and I-squared, which is proportion of observed variation that can be caused by actual difference between studies.
The heterogeneity statistics may also include plots, such as Funnel plot, Baujat plot, and/or L’Abbe plot. An example of funnel plot is as shown below:
Important Statistical Techniques/Procedures used in Medical Research Following table shows some important statistical techniques or procedures commonly used in medical research. This information has been taken from the paper titled, “Statistical trends in the Journal of the American Medical Association and implications for training across the continuum of medical education” by Arnold, Braganza, Salih, & Colditz published in PLoS ONE in 2013. Table 5: Important Statistical techniques/procedures used in Medical Research Statistical Techniques/Procedures
Pearson correlation coefficient
Comments (about the Increased, decreased or static use in the past 3 decades) Its use is declining at a faster rate
Mantel-Haenszel
Its use is declining
ANOVA
Its use is declining
Simple Linear regression
Its use is almost static
Fisher Exact
Its use is almost static
t-test
Its use is almost static
Logistic regression
Its use is almost static
Chi-square
Its use is almost static
Descriptive Statistics
Used most commonly but its use is declining Its use is almost static
Morbidity and Mortality
Transformation
Used most commonly but its use is almost static Its use is almost static
Multiple comparison
Its use is increasing slowly
Epidemiologic statistics
Its use is increasing slowly
Poisson Regression
Its use is increasing
p-trend
Its use is increasing
Log-rank test Wilcoxon Rank
Its use is increasing at a faster rate Its use is increasing
Non-parametric test
Its use is increasing
Intention to treat
Its use is increasing
Kaplan Meier
Its use is increasing
Power Cox models
Its use is increasing at a faster rate Its use is increasing
Multi-level modeling
It use is increasing at a faster rate
Survival analysis Multiple regression
Its use is increasing at a faster rate Its use is increasing
Sensitivity analysis
Its use is increasing
Low-level Statistical measures