Basic Biostatistics with Basic Steps in Stata®

This book covers the following topics: •Hypothesis •Frequency •Type I Error and Type II Error, and Sample and Population

215 80 2MB

English Pages 110

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Basic Biostatistics with Basic Steps in Stata®

Table of contents :
Hypothesis
Frequency
Sample and Population
Type I Error and Type II Error
Measuring the central tendencies (mean, median, and mode), and Range
Using Stata
Types of data in statistics
Sample mean and population mean
Histogram, frequency distribution plot, and Bar Graph
Using Stata
Variance, Standard deviation, and Outliers
Using Stata
Level of Significance and confidence level
Correlation
Types of Distribution, and Standardization
Using Stata
Standard Error of the Mean
Using Stata
Tests for Non-normally distributed data
Chi-square test
Using Stata
One-tailed test, two-tailed test, and Wilcoxon Rank Sum Test / Mann Whitney U Test
Using Stata
The Sign Test
Using Stata
Wilcoxon Signed Rank Test
Using Stata
The Kruskal-Wallis Test
Using Stata
The Friedman Test
Using Stata
Tests for Normally distributed data
Unpaired “t” test
Using Stata
Paired “t” test
Using Stata
One way ANOVA
Using Stata
Two way ANOVA
Using Stata
Odds Ratio and Mantel-Haenszel Odds Ratio
Using Stata
Correlation Coefficient
Using Stata
Regression Analysis
Using Stata
Important Statistical Techniques/Procedures used in Medical Research

Citation preview

Basic Biostatistics with Basic Steps in Stata®

Usman Zafar Paracha Assistant Professor, Pharmaceutics, Hajvery University, Lahore, Pakistan (2017)

This Book will help students to learn and utilize some basic concepts of Biostatistics.

Any Feedback will be Highly Appreciated.

Usman Zafar Paracha Owner of SayPeople.com [email protected] https://www.facebook.com/usmanzparacha

Some words from the author I tried to make this ebook on biostatistics as informative as possible, especially for beginners in biostatistics, that’s why I used simple calculations. Assumptions, suppositions, or calculations in this eBook are obtained from my eBook “Biostatistics – When Pain becomes Treatment” available here: http://amzn.to/2iOR8yG I have worked on Stata/MP 14.0. It is assumed that the reader knows about the software, how to start it and use it (even at beginner level). This book may have some trademarked names without using trademark symbol. However, they are used only in an editorial context, and there is no intention of infringement of trademark. It is important to note that calculations and examples used in this book could not take the place of actual research. Biostatistics has to be used under the guidance of experts. People with authentic comments and/or feedbacks (on Amazon) can ask me questions or send me “Message” here: https://www.facebook.com/usmanzparacha, and I will try to answer them. Perhaps you would also like: Basic Biostatistics with Basic Steps in Microsoft Excel http://amzn.to/2qd0vcs Basic Biostatistics with Basic Steps in SPSS® http://amzn.to/2p9Uz5o Basic Biostatistics with Basic Steps in R - http://amzn.to/2ps7OPv Basic Biostatistics with Basic Steps in Minitab® http://amzn.to/2pDNCIh

Contents Hypothesis Frequency Sample and Population Type I Error and Type II Error Measuring the central tendencies (mean, median, and mode), and Range Using Stata Types of data in statistics Sample mean and population mean Histogram, frequency distribution plot, and Bar Graph Using Stata Variance, Standard deviation, and Outliers Using Stata Level of Significance and confidence level Correlation Types of Distribution, and Standardization Using Stata Standard Error of the Mean Using Stata Tests for Non-normally distributed data Chi-square test Using Stata One-tailed test, two-tailed test, and Wilcoxon Rank Sum Test / Mann Whitney U Test Using Stata The Sign Test Using Stata

Wilcoxon Signed Rank Test Using Stata The Kruskal-Wallis Test Using Stata The Friedman Test Using Stata Tests for Normally distributed data Unpaired “t” test Using Stata Paired “t” test Using Stata One way ANOVA Using Stata Two way ANOVA Using Stata Odds Ratio and Mantel-Haenszel Odds Ratio Using Stata Correlation Coefficient Using Stata Regression Analysis Using Stata Important Statistical Techniques/Procedures used in Medical Research

Hypothesis Statistics is an art as well as science, and needs help of both fields for explanation. You need stronger imagination as an artist and welldesigned research as a scientist. Let’s start with a concept of hypotheses. According to an ancient Chinese myth, there is an entirely different world behind mirrors. That world has its own creatures, and is known as Fauna of mirrors. So, if a person is of opinion that fauna of mirrors actually exist, and he wants to perform a research on fauna of mirrors, his hypothesis will be that fauna of mirrors actually exist. This is known as “research hypothesis” or “alternative hypothesis” as the person is doing research on this hypothesis. It is represented by Ha or H1.

On the other hand, there is another opinion showing that there is nothing like fauna of mirrors. This opinion can be considered as “null hypothesis” as it is negating the statement for research hypothesis and showing that research hypothesis is not a commonly observed phenomenon. Null hypothesis is represented by Ho or H0. For a research to be completed successfully, null hypothesis is usually rejected. So, in order to prove the fauna of mirrors after performing a research, null hypothesis has to be rejected.

Shortly, it can be said that according to null hypothesis, nothing is changed or no significant new thing can be found (anywhere in any group), and according to alternative hypothesis, some significant change must have occurred or some significant new thing can be found (somewhere or in some group).

Frequency Frequency refers to the number of times an event or incidence occurred in a study. More strictly, it is also known as absolute frequency. In comparison to absolute frequency, relative frequency refers to the number of times a specific event occurred divided by the total number of events. Suppose an event is represented by “i”, the number of times of a specific event is represented by “ni“, and the total number of events are represented by “N”. So, we will represent frequency of event as “fi“. The equation for relative frequency could be represented as fi = ni / N. Suppose a person sees his reflection in a mirror 3 times. Out of those 3 times, 2 times he sees the mirror lonely and the 3rd time he sees the reflection along with his friend. So, relative frequency of mirror reflection in loneliness could be represented by 2/3 in which 2= ni and 3=N.

Sample and Population

p

p

Suppose we have a big mirror. If we cut three small pieces of mirrors from that big mirror, those pieces could be considered as Sample, and that big mirror could be considered as a Universe or Population in statistical terms. Sample (those three pieces) could help us in getting the result for the entire Universe or Population (the big mirror).

Type I Error and Type II Error Consider the hypotheses mentioned earlier. Suppose initial findings on the hypotheses show that fauna of mirrors actually exist. So, it can be said that the null hypothesis is rejected. However, it is important to note that the results from initial findings could be false (or they could be true) due to the presence of errors in the research. Those errors are known as α error and β error. α error is also known as Type I error or False positive. In this condition, it is possible that we incorrectly reject the null hypothesis, i.e. the statement “there is nothing like fauna of mirrors” seems wrong after performing an experiment, when in reality it is right. So, it is also considered as False positive as in this condition, we think that alternative hypothesis is right (positive).This type of error can be fixed by performing further tests. Moreover, changing the level of significance can also help in reducing type I error.

On the other hand, β error is also known as Type II Error or False negative. In this condition, it is possible that we incorrectly accept the null hypothesis.

Type II error is more serious as compared to type I error because after this error, nobody would do further research on alternative hypothesis. Errors of the higher kind could also be present. As, for example, type III error occurs when a researcher or an investigator gives the right answer to some wrong question. It is also considered when an investigator or researcher correctly rejects the null hypothesis for some wrong reason/s.

Measuring the central tendencies (mean, median, and mode), and Range Suppose there are 6 trees each having eggs under them on the ground. 1st tree has 5 eggs, 2nd tree has 7 eggs, 3rd tree has 6 eggs, 4th tree has 4 eggs, 5th tree has 7 eggs, and 6th tree has 9 eggs. If we sum up all the number of eggs and divide them by the total amount of numbers, we will get the mean value. Formula for the mean value is given by x̄ = Σx / n, where x̄ is “x bar” and is the mean value; ∑ is known as “sigma” and it means “sum of”; x is number of eggs under different trees, and n is the number of trees, i.e. 6. So, 5+7+6+4+7+9 = 38 = ∑x. Overall, there are 38 eggs on the ground and if we divide 38 by 6, which is the total amount of numbers, we get the mean value of 6.33. So, x̄ =6.33. If we arrange the numbers of eggs in the series from least value to greatest value, the series will become 4,5,6,7,7,9. From this series, we can calculate the median by finding the mean value of two middle values. So, the sum of middle values would be 6+7 = 13, and 13 divided by 2 gives us median value, which is equal to 6.5 (which is the mean value of two middle values) and is the median value. Mode is the number that is repeated in a series. In the above series, 7 is mode of the numbers.

Range of the series is obtained by subtracting the highest number from the lowest number. In the above series, 9 is the highest number and 4 is the lowest number; so, range is 9-4=5. Now try to calculate these things again. Suppose another tree is considered having 11 eggs under it on the ground. Now, you can find that the mean, mode, median, and range, everything becomes equal to 7. Mean is 49/7 = 7. Mode is 7, which is a repeated number. Median is 7 as it is the middle number in the series 4,5,6,7,7,9,11. Here, it is important to note that when a series contains even amount of numbers, median is calculated by taking the average of two middle numbers (as mentioned above), and when a series contains odd amount of numbers, median is calculated by taking the middle number after arranging the numbers from lowest to highest. Range is 11-4=7.

Using Stata In order to find the mean, median, mode, and range, we can use the number of eggs under different trees (as noted above). So, the numbers are 4, 5, 6, 7, 7, 9, and 11. Click on the box in “Command” below. Write “edit” and enter. A window “Data Editor” appears. This is a kind of spreadsheet window. Enter the values in the first column from row 1 to row 7. These values are named as “var1”(pay special attention to number, capitalizatios, etc). Close this window; it automatically saves “var1.” In order to find the mean and median of the numbers, we can write summarize var1, detail

OR sum var1, d This gives us the results. In the results, mean value is 7. The 50th percentile is the median, so in this case the median is 7. Strangely, Stata has no command for calculation of mode. However, in this case, mode can be obtained using the –egen function. You can write the following in the Command window below egen mode = mode(var1) tab mode It is important to note that this function does not work for every data. In order to calculate range, use the following function sum var1 This gives the Min (minimum value) and Max (maximum value). Maximum value in this case is 11 and minimum value is 4. So, these values can be subtracted to get the range. In order to subtract these values, we write display 11-4

Types of data in statistics There are different types of data in statistics; Nominal data, Ordinal data, and Interval data. Nominal data is simply represented by “names” or they can also be represented by “categories”. Suppose there are eggs of different colors, they can be classified in different colors such as brown, blue, white, rose, green, and spotted eggs. Ordinal data is also known as ordered data in which the items are ranked or graded. Suppose there are green colored eggs having three different shades from light green to dark green; they can be graded

in different colors that are light green, green, and dark green. This is ordinal data. According to interval data, numbers are not only arranged in a series but also arranged with an exact difference between them. Consider the above mentioned series of eggs, i.e. 4,5,6,7,7,9,11. In this series, first four numbers (i.e. 4,5,6,7) have equal and definite interval and show one interval data; whereas, the last three numbers in the series (i.e. 7,9,11) are arranged with a different set of equal and definite interval; thereby, showing another interval data.

Sample mean and population mean Sample mean and population mean are two different things. Sample mean is represented by x̄ as mentioned earlier, and population mean is represented by μ. Population consists of all the elements from a collection of data, and sample consists of some observations from that data (as mentioned earlier in case of mirror). In the above collection of series, we can consider that 6.33 is a sample mean and 7 is a population mean. When we get the mean value for the number of eggs for first 6 trees out of 7 trees, it is sample mean, and when we get the mean value for the number of eggs for all 7 trees, it is population mean.

Histogram, frequency distribution plot, and Bar Graph A bar graph is represented by bars on a graph. Suppose we have a data for different colored eggs as follows: Color of eggs

Number of eggs

Green

5

Light green

4

Dark green

6

Brown

5

Blue

5

White

11

Rose

5

Spotted

8

This data can be represented in a bar graph as follows:

A histogram is also a graph that groups numbers into ranges. In order to develop a histogram, we again take the following information Suppose there are 7 trees each having eggs under them on the ground. 1st tree had 5 eggs, 2nd tree had 7 eggs, 3rd tree had 6 eggs, 4th tree had 4 eggs, 5th tree had 7 eggs, 6th tree had 9 eggs, and

7th tree had 11 eggs. On the basis of this information, following histogram could be developed.

This graph shows that 2.1 to 4 numbers of eggs were present only under one tree. For example, only one tree, i.e. 4th tree had 4 eggs. 4.1 to 6 numbers of eggs were present under 2 trees. For example, one tree, i.e.1st tree had 5 eggs under it and another tree, i.e. 3rd tree had 6 eggs under it. So, there were two trees under which 4.1 to 6 numbers of eggs were present. And so on… This graph also shows frequency distribution, and can also be called as frequency distribution plot.

Using Stata In order to develop a Bar Graph, we can use the following data Color of eggs

Number of eggs

Green

5

Light green

4

Dark green

6

Brown

5

Blue

5

White

11

Rose

5

Spotted

8

Click on the box in “Command” below. Write “edit” and enter. Enter the values in the columns. In this case, we have “Color of eggs” and “Number of eggs” in the first two columns. Close this window; it automatically saves “var1”, representing categorical variables and “var2” representing numerical variables. In the Command window, enter graph bar (mean)var2, over(var1) A graph appears as shown below:

In order to work on histogram, we can use the same data as shown above. So, in the Command window (below), write the following: histogram var2, bin(8) bins are a kind of containers. So, try to set the number of bins according to the number of results. For example, with the upper command, we get the following result:

But if we write the following having “2” as the number of bins, histogram var2, bin(2) we get the following results:

Therefore, it is important to consider the number of bins according to the number of results.

Variance, Standard deviation, and Outliers Standard deviation helps in knowing the variability of the observation or spread out of numbers about the mean. It is represented by the Greek letter sigma (σ). Low standard deviation shows that the values are close to mean, i.e. close to normal range or required range. σ can be obtained by taking the square root of variance, which is represented by σ2. So, we have to calculate variance to calculate standard deviation. Variance is the average of squared deviation about the mean. If we consider the mean of the number of eggs, i.e. 7, we have the squared deviation (or squared difference) as shown here: Number of eggs under 7 trees =N

Squared deviation (or squared difference) about the mean = (Number of eggs - Mean of the numbers)2

5

(5 - 7)2 = (-2)2 = 4

7

(7 - 7)2 = (0)2 = 0

6

(6 - 7)2 = (-1)2 = 1

4

(4 - 7)2 = (-3)2 = 9

7

(7 - 7)2 = (0)2 = 0

9

(9 - 7)2 = (2)2 = 4

11

(11 - 7)2 = (4)2 = 16

Mean of the numbers = 7

Total = ∑(x-x̄ )2 = 34

∑(x-x̄ )2 shows squared deviation (or squared difference) about the mean as you can see in the table and is equal to 34, and N is the number of values and is equal to 7. Variance can be obtained by taking the average of squared deviation (or squared difference) about the mean, i.e. ∑(x-x̄ )2/N = 4 + 0 + 1 + 9 + 0 + 4 + 16 / 7 = 34 / 7 = 4.86 = σ2. Square root of 4.86 is standard deviation, i.e. 2.20 = σ. 7±2.2 shows the variability of the observation or spread out of numbers about the mean. So, it can be considered that the approximate normal range of eggs is 4.8 to 9.2. Considering this range, we can say that 4th tree has below normal range of eggs, i.e. 4 eggs, and 7th tree has above normal range of eggs, i.e. 11 eggs, and all the other trees have a normal range of eggs lying between 4.8 and 9.2 eggs. The values, i.e. the number of eggs under 4th tree and 7th

tree could be considered as outliers, i.e. they are extreme from the others. It is also important to note that variance is of two types, i.e. population variance and sample variance. Population variance is represented by σ2 = x̄ 2, and sample variance is represented by s2= (x̄ )2 – 1.

Using Stata In order to calculate variance and standard deviation, we can use the number of eggs under seven trees as noted above. So, the data could be Number of tree

Number of eggs under that tree

1

5

2

7

3

6

4

4

5

7

6

9

7

11

Click on the box in “Command” below. Write “edit” and enter. Enter the values in the columns. In this case, we enter the data in the first two columns, i.e. “Number of tree” in the first column and “Number of eggs under that tree” in the second column. Now close

this window. Now we have var1 for “Number of tree” and var2 for “Number of eggs under that tree.” Now write sum var2, d This gives the results having many outcomes. In these results, Std. Dev. (Standard deviation) is 2.38 and Variance is 5.67. These are for samples. We have no option for population standard deviation or population variance, the answers for which have been presented above, i.e. 2.20 and 4.86 respectively.

Level of Significance and confidence level Some events occur commonly, and they may have decreased significance; whereas, some events or outcomes occur rarely and they are of significant nature. The same thing applies in biostatistics. We, usually, say that the probability, which is denoted by capital ‘P’, of an event is low, if it is a rare event, and P-value, which can be considered in percentages, represents the value at which a rare event can be separated from other normal events. For example, researchers usually consider the value of less than 5% or 0.05 as a level of significance, which is a P-value. It can be shown by the sign of ‘less than’, i.e. P < 0.05. So, an event is of significant nature in biostatistics, if it has less than 5% probability. Significance level also refers to the probability of incorrectly rejecting a null hypothesis, i.e. probability of committing a Type I error… and the chances, which can be represented in percentages, of rejecting null hypothesis, refers to the level of significance. Again take a look at the Type I error.

As level of significance is related to α-error, so it is also known as α level. So, by changing the level of significance, the chances of Type I error can also be changed. Whereas, the remaining level, which is obtained after removing the level of significance, is considered as level of confidence, which is represented by γ, and is obtained by the equation γ = 100% - α level or γ = 1 - α level. Usually, 5% or 0.05 is considered as the level of significance, so 95% or 0.95 is considered as the level of confidence.

Correlation Correlation refers to systematic changes in the amount of one variable in relation to systematic changes in the amount of another variable. The correlation coefficient is represented by ‘r’ and it ranges from -1 to 0 to +1. In case of -1, the effect of one variable is completely negative on the other, whereas in case of +1, which is the highest value, the effect of one variable is completely positive on the other, i.e. with increase in one variable, second variable also increases. Those two variables can be plotted on x-axis and y-axis

on a graph, and their correlation can be checked with the help of line in the plot. (Correlation coefficient is further explained later in the book.)

Types of Distribution, and Standardization Suppose, there is a place known as “Virana” where beneficial viruses are living and fighting against some harmful bacteria. Those harmful bacteria are assigned “value-of-danger” from 1 to 9 according to their harmfulness to viruses, i.e. higher the number, the more harmful the bacteria are. Suppose we have different number of bacteria (frequency) according to their value-of-danger as shown in the following table: Frequency (number of bacteria)

Value-of-danger

200,000

1

300,000

2

500,000

3

900,000

4

1000,000

5

900,000

6

500,000

7

300,000

8

200,000

9

Then a frequency distribution plot could be developed as follows:

This frequency distribution plot shows “bell-curve” as it looks like a bell. Technically, this type of distribution is known as normal or Gaussian distribution as noted by statistician professor Gauss. This type of distribution is commonly found in nature as, for example, blood sugar levels and heart rates follow this type of distribution. Normal distribution has three important characteristics. 1. In a normal distribution, mean, median, and mode are equal to each other. 2. In this type of distribution, there is symmetry about the central point. 3. Half of the values, i.e. 50% are less than the mean value, and half of the values, i.e. 50% are more than the mean value. In the above illustration, the value-of-danger “5” is the mean and median, and as it is related to the most commonly found value (highest or largest number of bacteria that could be frequently found), so it is also mode. Disturbance in the values would result in the disturbance of the normal distribution; thereby, leading to non-normal or non-

Gaussian distribution in which there is no appropriate bell-shaped curve. Frequency distribution has a close relationship with standard deviations: About 68.27% of all values lie within one standard deviation of the mean on both sides, i.e., total of two standard deviations, About 95% of all values lie within two standard deviations of the mean on both sides, i.e. total of four standard deviations, and About 99.7% of all values lie within three standard deviations of the mean on both sides, i.e., total of six standard deviations. The number of standard deviations from the mean is known as “Standard Score”, “z-score”, or “sigma”, and in order to convert a value to a z-score or Standard Score, subtract the mean and then divide the value by Standard Deviation. It is represented as

in which z is showing the z-score; x is showing the value that had to be standardized; μ is showing the mean value, and σ is showing the Standard Deviation. This process of getting a z-score is known as “Standardizing.” Let’s consider the frequency distribution and frequency distribution plot again:

And its values: Frequency (number of bacteria)

Value-of-danger

200,000

1

300,000

2

500,000

3

900,000

4

1000,000

5

900,000

6

500,000

7

300,000

8

200,000

9

In this table, the mean value is shown by the value 5, and its Standard Deviation could be calculated from variance. So, variance is calculated as

So, Standard Deviation = σ = 2.58 In order to get the z-score of the first value, i.e. 1, first subtract the mean. So, it will be 1-5 = -4. Then the value will be divided by the Standard Deviation. So, it will be -4/2.58 = -1.55. So, z-score will be -1.55, i.e. the value-of-danger 1 will be -1.55 Standard Deviations from the mean. If we calculate the z-scores of all the values and placed the values in the table, we get: Frequency (number of bacteria)

Value-of-danger

z-score

200,000

1

(1-5)/2.58 = -1.55

300,000

2

(2-5)/2.58=-1.16

500,000

3

(3-5)/2.58=-0.78

900,000

4

(4-5)/2.58=-0.39

1000,000

5

(5-5)/2.58=0

900,000

6

(6-5)/2.58=0.39

500,000

7

(7-5)/2.58=0.78

300,000

8

(8-5)/2.58=1.16

200,000

9

(9-5)/2.58=1.55

From this table, Standard Normal Distribution Graph could also be obtained in which z-scores are along x-axis and frequency (number of bacteria) is along y-axis.

The graph shows that nearly 68.27% of the values in the “value-ofdanger” are present within one standard deviation of the mean on both side, i.e. from -1 to 1. In that case, 68.27% is also considered as a confidence interval between upper limit of z-score=1 and lower

limit of z-score=-1. On a further note, nearly 95% of all values are present within two standard deviations of the mean on both sides, i.e. from -2 to 2. In case of disturbance in the values, the normal graph could be changed into non-normal and start showing binomial or Poisson distribution.

Using Stata In order to calculate z-score, we can use the following data: Frequency (number of bacteria)

Value-of-danger

200,000

1

300,000

2

500,000

3

900,000

4

1000,000

5

900,000

6

500,000

7

300,000

8

200,000

9

In the Command window, write “edit” and enter. In the appeared window, enter the values in the columns. In this case, we enter “Frequency (number of bacteria)” in the first column

and it is saved as var1, and “Value-of-danger” in the second column and it is saved as var2. In the Command window, write egen float zscore = std(var2), mean(0) std(1) This is done to develop new variables (extended). Those new variables are actually z-scores. You can check the z-scores by either writing list zscore in the Command window, OR by writing edit in the Command window and looking at the third column. Results are as follows: var1

var2

zscore

200000

1

-1.460593

300000

2

-1.095445

500000

3

-.7302967

900000

4

-.3651484

1.0e+06

5

0

900000

6

.3651484

500000

7

.7302967

300000

8

1.095445

200000

9

1.460593

It is important to note that these z-scores are according to sample standard deviation.

Standard Error of the Mean

Standard Error of the Mean Standard Error of Mean is represented by the symbol for Standard deviation, i.e. σ along with a subscript M representing mean. So, σM is the symbol for Standard Error of Mean. It could be calculated by the formula:

Where σ is standard deviation and n is the amount of numbers. If we consider the table of number of bacteria along with their value-ofdanger (above), we have standard error of mean as follows:

This value is helpful in knowing about all the population of the harmful bacteria, i.e. the smaller the standard error of mean, the more accurately the sample would represent the entire population of bacteria. Standard error of mean helps in working on confidence intervals. The mean value 5 ± 2 σM shows a probability of 95%, i.e. the values between 6.72 (which is 5 + 2 x 0.86) and 3.28 (which is 5 – 2 x 0.86) shows 95% probability and could be called as 95% confidence interval. This is used to develop a confidence interval for the mean. So, it shows that we are 95% confident that the mean value lies between 6.72 and 3.28.

Using Stata In order to calculate standard error of mean, we can use the following data as an example: Frequency (number of bacteria)

Value-of-danger

200,000

1

300,000

2

500,000

3

900,000

4

1000,000

5

900,000

6

500,000

7

300,000

8

200,000

9

In the Command window, write “edit” and enter. In the appeared window, enter the values in the columns. In this case, we enter “Frequency (number of bacteria)” in the first column and it is saved as var1, and “Value-of-danger” in the second column and it is saved as var2. In the Command window, write tabstat var2, stats(semean) This gives standard error of mean for sample standard deviation. In this case, it is equal to 0.9128709. Whereas, our calculated standard error of mean (above) is for population standard deviation; both of them are right.

Tests for Non-normally distributed data

Non-parametric testing can be applied to ranked, ordinal, or continuous outcome variables, and for those variables that do not follow normal distribution, i.e. non-normally distributed data. For example, severity of pain is an ordinal outcome and can be ordered from no pain to severe pain to agonizing pain on a scale from 1 to 12.

Chi-square test Chi-square (χ2) test is a very simple non-parametric test. If we consider the pain as a variable, this test can be used to show an association between the efficacies of treatments in reducing the pain. Suppose we have a sample of 100 subjects and make two groups of 50 subjects each. Fifty of them (i.e. one group) receive a new treatment while the other 50 subjects (i.e. second group) receive placebo. There are two possible outcomes of the research, i.e. either pain will be reduced (alternative hypothesis, i.e. there is a significant difference between placebo and the new treatment) or not (null hypothesis, i.e. there is no difference between placebo and the new treatment and they are same in efficacy). For a chi-square test, we will develop a 2 x 2 contingency table showing two groups, and note the findings of the treatment on two groups. Suppose the findings are as follows:

On first look, this table is showing that the new treatment is better than placebo in removing or reducing the pain, but it is important to know the statistical significance. To move ahead, we have to

compare this data, i.e. Observed data, with Expected data. In the table, we have found that the total number of samples is 100, i.e. 50 in the treatment group and 50 in the placebo group, and 60 samples reported that the pain removed and 40 reported that pain did not remove. Suppose, there is no relationship between anything, and everything is normal. We would Expect the following table, i.e. Expected data,

In this table, the value in the upper left cell is obtained by multiplying the total number of samples in pain removed with total number of sample in the treatment group and divide the value by total number of samples. So, 50 x 60 / 100 = 30, which is Expected value in the upper left cell. Similarly, Expected values of other cells can be calculated. Chi-square can be calculated by using the equation χ2 = ((O - E)2 / E), where O shows observed values and E shows Expected values. So, the following table can be developed.

Now we will check the Degrees of freedom (df). For χ2, we will calculate df by the following formula, df = (Number of rows - 1) x (Number of columns - 1) = ( 2 - 1) x ( 2 - 1) = 1. So, we have df=1. Now we will compare our calculated χ2 value, i.e. 16.66 with χ2 value in a table of χ2 with 1 df. If the calculated value of the Chisquare test is more than the value in the Chi-square table, null hypothesis will be rejected. Table 1: Chi-square table (Source: Anonymous)

χ2 value at α = 0.05 and df=1 is 3.841. Our observed χ2 value is 16.66 that is greater than 3.841. This is showing that treatment group has a significant difference from the placebo group. Our treatment group is showing significant results even at α level of 0.005. So, it means we can say that the new treatment can be used for the reduction of pain.

Using Stata In order to work on chi-square test, we can use the following data as an example; Groups

Outcomes

Total

Pain removed

Pain did not remove

Treatment group

40

10

50

Placebo group

20

30

50

Total

60

40

100

In the Command window, write the following tabi 40 10 \ 20 30, chi2 This gives the results. Pearson chi2(1) is 16.6667 and Pr (probability value) is 0.000. This p-value is less than 0.05, i.e. p