Suppose you want to test the hypothesis (derived from some theory) that, in general, college men are more satisfied with their social life than college women. It would be a big pain to measure the satisfaction of every single college person, but you could do it with a sample of people. Basically, you get a list of all college students, you randomly select a certain number (say, 1000), you call them up and ask two questions: 1. 'What sex are you?'. 2. Are you satisfied with your social life? Yes or no.
Now, according to your hypothesis, the number of men saying 'yes' should be higher than the number of women. Right? Well, not quite. What if the sample has more men in it than women? Then, even if men are unhappier than women, there will still be more 'yeses' among the men simply because there are more of them. The solution, obviously, is to use percentages. The hypothesis is that the percent of men saying 'yes' should be higher than the percent of women who say 'yes'.
Important note: it doesn't matter whether, in absolute terms, it's a large or small percentage of men that say 'yes', as long as it's bigger than the women's. For example, suppose 35% of the men say they are satisfied, while only 15% of the women are. The hypothesis has been supported, even though most men actually said they were dissatisfied.
Now, suppose we did the above, interviewing 100 men and 100 women, and got this result:
Men | Women | |
Satisfied: | 30 | 20 |
Is the hypothesis supported? Well, maybe. It certainly was for the sample of 200. But it was just a sample. It wasn't everybody. What if we did the study all over again, choosing a different sample of 200 people. Would we get the same result? It's unlikely to come out exactly the same. Think about flipping a coin. The chance of it coming up heads is 50% right? So if you flip it ten times, it should come up heads 5 times, right? But it doesn't have to. It is possible to get 6 heads. It is possible to get 7 heads. It is even possible to get 10 heads in a row. It has happened. It may be unlikely, but it can happen.
So suppose that out there in America, there really is no difference in the satisfaction levels between male and female college students. Actually, 25% of all college students, male or female, are satisfied with their social lives. Now we come along and interview 100 students of each sex at random. "At random" means that we essentially flipped a many-sided coin to see which of all the college males in America we would interview. Each male student had an equal chance of being chosen. If there are 1,000,000 college males in the population, then each one had the same chance of being chosen (1 in 10,000). And since 25% of them are satisfied, we should have gotten 25 that were satisfied and 75 that were not. But if we can flip a coin 10 times and not get exactly 5 heads, what are the chances we can pick 100 males out of million and get exactly 25 that are satisfied? By chance alone, we could have gotten 30. Or 40. Or even 100. After all, there are 25,000 satisfied males out there. You could easily have picked 100 of them.
And the same is true of the women. Our sample of 100 could have easily included 30 satisfied women, or just 10. So when we get a result like 30% of the men sampled are satisfied while only 20% of the women were satisfied, we cannot be sure that the difference of 10% points is not due to chance. This is what is called sampling error, although it is not really an error. After all, when the coin gives you 6 heads instead of 5, do you consider it an error?
So what we would like to have is a sense for the probability of getting a difference like 10% points by chance alone. In other words, we would like to know what the chances are of getting a 10% difference when, in the population, there is actually no difference in the population (i.e., the 10% difference we observe is due to sampling variation rather than a difference in the population).
That's what significance tests are for. What a significance test does is give you that probability, which is often called a "p-value". A p-value is the probability of observing a difference as large as you actually observed, given that the null hypothesis of no difference is actually true in the population. You can think of it loosely as the probability that you are wrong when you claim that men and women college students have different rates of satisfaction.
So what's a decent p-value? Typically, people say that if the p-value is less than 0.05 (i.e., 5 percent), then that's good enough: it is significant. If the p-value is larger than 0.05, we say it is non-significant. If the p-value is lower than 0.001, we say it is highly significant. What it means is that if the p-value is less than 5%, then it is low enough that we are willing to take the chance that we might be wrong.
Although there is a simpler way to calculate the p-value than the one I'm going to show you, the one we will use is better because we can generalize it to other situations. That way, you only need to learn one statistic instead of several different ones. The method I'm going to teach is called the chi-square test of independence.
The basic idea is to compare the frequencies you got with the frequencies that you would expect to get if there was no difference in satisfaction between men and women in the population. That means that the variable "sex" and the variable "satisfaction with social life" are unrelated or independent of each other.
So let's review basic probability theory for a moment. If two events are independent, then the probability of both occurring at the same time can be calculated by multiplying the probability of one by the probability of the other. So if we flip two coins at the same time, the probability that both will come out heads is 0.5 x 0.5 = 0.25. It should happen about a quarter of the time. Similarly, suppose we flipped a coin and rolled a die (the singular of "dice") at the same time. What's the probability of getting a head and a 4 at the same time? Well it's just 1/2 times 1/6 = 1/12.
Therefore, if a person's sex has nothing to do with whether they are satisfied with their social life, then the proportion of males who are satisfied should be equal to the probability of being male (i.e., what percentage of your sample was male) times the probability of being satisfied (i.e., what percentage of your sample was satisfied, regardless of sex).
So, just to make things interesting, let's assume that in our sample of 200, we obtained, by the luck of the draw, 120 females and 80 males, and 25% of the men said they were satisfied while 20% of the women said the same. We can arrange the information in a pair of tables:
Frequencies
Males | Females | Sampled | |
Yes | 20 | 24 | 44 |
No | 60 | 96 | 156 |
Sampled | 80 | 120 | 200 |
Column Percentages
Males | Females | Sampled | |
Yes | 25% | 20% | 22% |
No | 75% | 80% | 78% |
Sampled* | 40% | 60% | 100% |
* The last row are not column percentages,
but rather percentages of the whole sample
So if sex is independent of satisfaction, the probability of sampling a satisfied male is 0.4x0.22 = 0.088. So we would have expected that 8.8 percent of the total sample, or between 17 and 18 people, would be satisfied males (see table below). In other words I multiple 0.088 by the total sample size (200) to get the expected frequency of satisfied males. Similarly, the probability of getting a satisfied female is 0.6 x .22 = 0.132. So we would have expected that 13.2 percent of the sample, or about between 26 and 27 people, to be satisfied females. We carry out the same calculations for the unsatisfied males and the unsatisfied females.
Calculating Expected Frequencies
(given assumption of independence)
Males | Females | Sampled | |
Yes | .4*.22*200 = 17.6 | .6*.22*200 = 26.4 | 44 |
No | .4*.78*200 = 62.4 | .6*.78*200 = 93.6 | 156 |
Sampled | 80 | 120 | 200 |
Expected Frequencies
(given assumption of independence)
Males | Females | Sampled | |
Yes | 17.6 | 26.4 | 44 |
No | 62.4 | 93.6 | 156 |
Sampled | 80 | 120 | 200 |
Computationally, instead of bothering to compute the proportion of males (.4) and females (.6) and YESes (.22) and NOs (.78) and then multiplying the appropriate ones together and multiplying the by sample size (200), it is faster to work directly from the raw frequencies and multiply the row sums times the column sums divided by the sample size as follows:
Expected Frequencies
(given assumption of independence)
Males | Females | Sampled | |
Yes | 44*80/200 = 17.6 | 44*120/200 = 26.4 | 44 |
No | 156*80/200 = 62.4 | 156*120/200 = 93.6 | 156 |
Sampled | 80 | 120 | 200 |
Now we compare the expected frequencies with the observed frequencies. For each cell in the table, we take the observed frequency, subtract the expected value, square the difference, and divide it by the expected value. Then we sum all those quantities to get a single value called chi-square. In symbols, we calculate
(o-e)2
-------
e
for each of the four cells in the table and add them up.
Differences
(given assumption of independence)
Males | Females | Sampled | |
Yes | 2.4 | -2.4 | 0 |
No | 2.4 | 2.4 | 0 |
Sampled | 0 | 0 | 0 |
Squared Differences
(given assumption of independence)
Males | Females | Sampled | |
Yes | 5.76 | 5.76 | 11.52 |
No | 5.76 | 5.76 | 11.52 |
Sampled | 11.52 | 11.52 | 23.04 |
Squared Differences Divided By Expected
(given assumption of independence)
Males | Females | Sampled | |
Yes | 0.327 | 0.218 | 0.545 |
No | 0.092 | 0.062 | 0.154 |
Sampled | 0.42 | 0.28 | 0.7 |
For our data, the value of chi-square is about 0.7. This is an unusually small value (I'll explain what it means in a moment). Often, chi-square values are numbers like "32.87" or "116.2". The smallest chi-square value possible is 0, but there is no upper bound: it depends on the size of the numbers.
Notice that the less the difference between observed and expected, the smaller the value of chisquare will be. Chi-square is zero only when there is absolutely no difference between the observed and the expected. So when will chi-square be small? Whenever the sample data are consistent with the null hypothesis that there is no difference in satisfaction between males and females in the population. So if we were hoping that there really was a difference between males and females, we want chisquare to be large.
But how large is large? The maximum value of chi-square is ... big. What we want to know is what is the probability of getting a chi-square as large as actually observed, given that in the population the variables are independent of each other. The probabilities are given in a chi-square table such as this one:
Table of Chi-Square Values
df \ P | 0.1 | 0.050 | 0.025 | 0.010 | 0.005 | 0.001 |
1 | 2.7055 | 3.8414 | 5.0238 | 6.6349 | 7.8794 | 10.828 |
2 | 4.6051 | 5.9914 | 7.3777 | 9.2103 | 10.5966 | 13.816 |
3 | 6.2513 | 7.8147 | 9.3484 | 11.3449 | 12.8381 | 16.266 |
4 | 7.7794 | 9.4877 | 11.1433 | 13.2767 | 14.8602 | 18.467 |
5 | 9.2363 | 11.0705 | 12.8325 | 15.0863 | 16.7496 | 20.515 |
6 | 10.6446 | 12.5916 | 14.4494 | 16.8119 | 18.5476 | 22.458 |
7 | 12.0170 | 14.0671 | 16.0128 | 18.4753 | 20.2777 | 24.322 |
8 | 13.3616 | 15.5073 | 17.5346 | 20.0902 | 21.9550 | 26.125 |
9 | 14.6837 | 16.9190 | 19.0228 | 21.6660 | 23.5893 | 27.877 |
10 | 15.9871 | 18.3070 | 20.4831 | 23.2093 | 25.1882 | 29.588 |
11 | 17.2750 | 19.6751 | 21.9200 | 24.7250 | 26.7569 | 31.264 |
12 | 18.5494 | 21.0261 | 23.3367 | 26.2170 | 28.2995 | 32.909 |
13 | 19.8119 | 22.3621 | 24.7356 | 27.6883 | 29.8194 | 34.528 |
14 | 21.0642 | 23.6848 | 26.1190 | 29.1413 | 31.3193 | 36.123 |
15 | 22.3072 | 24.9958 | 27.4884 | 30.5779 | 32.8013 | 37.697 |
16 | 23.5418 | 26.2962 | 28.8454 | 31.9999 | 34.2672 | 39.252 |
17 | 24.7690 | 27.5871 | 30.1910 | 33.4087 | 35.7185 | 40.790 |
18 | 25.9894 | 28.8693 | 31.5264 | 34.8058 | 37.1564 | 42.312 |
19 | 27.2036 | 30.1435 | 32.8523 | 36.1908 | 38.5822 | 43.820 |
20 | 28.4120 | 31.4104 | 34.1696 | 37.5662 | 39.9968 | 45.315 |
21 | 29.6151 | 32.6705 | 35.4789 | 38.9321 | 41.4010 | 46.797 |
22 | 30.8133 | 33.9244 | 36.7807 | 40.2894 | 42.7956 | 48.268 |
23 | 32.0069 | 35.1725 | 38.0757 | 41.6384 | 44.1813 | 49.726 |
24 | 33.1963 | 36.4151 | 39.3641 | 42.9798 | 45.5585 | 51.179 |
25 | 34.3816 | 37.6525 | 40.6465 | 44.3141 | 46.9278 | 52.620 |
26 | 35.5631 | 38.8852 | 41.9232 | 45.6417 | 48.2899 | 54.052 |
27 | 36.7412 | 40.1133 | 43.1944 | 46.9680 | 49.6449 | 55.476 |
28 | 37.9159 | 41.3372 | 44.4607 | 48.2782 | 50.9933 | 56.892 |
29 | 39.0875 | 42.5569 | 45.7222 | 49.5879 | 52.3356 | 58.302 |
30 | 40.2560 | 43.7729 | 46.9792 | 50.8922 | 53.6720 | 59.703 |
40 | 51.8050 | 55.7585 | 59.3417 | 63.6907 | 66.7659 | 73.402 |
50 | 63.1671 | 67.5048 | 71.4202 | 76.1539 | 79.4900 | 86.661 |
60 | 74.3970 | 79.0819 | 83.2976 | 88.3794 | 91.9517 | 99.607 |
70 | 85.5271 | 90.5312 | 95.0231 | 100.425 | 104.215 | 112.317 |
80 | 96.5782 | 101.879 | 106.629 | 112.329 | 116.321 | 124.839 |
90 | 107.565 | 113.145 | 118.136 | 124.116 | 128.299 | 137.208 |
100 | 118.498 | 124.342 | 129.561 | 135.807 | 140.169 | 149.449 |
The columns of the table correspond to p-values. In general, we look down the column marked "0.05", because we use 0.05 as the conventional cut-off level of statistical significance. In choosing the .05 level, we are saying that if the probability of a certain result occurring just because of sampling variation is greater than 5%, then we are not willing to assume that the results are real (i.e., that sex and satisfaction are associated in the population from which the sample was drawn).
The cells of the table correspond to chi-square values (such as the 0.7 we computed above). The rows correspond to degrees of freedom. To calculate degrees of freedom for a simple table such as we have, we use the following formula:
df = (R - 1) x (C - 1)
where R is the number of categories in the Satisfaction variable, and C is the number of categories in the Sex variable. In our case, both variables have 2 categories, so the degrees of freedom is 1 x 1 = 1.
Now we look at the first row of the table (corresponding to 1 degree of freedom), and look down the 0.05 column. The value in the table is 3.8414. Comparing that to the 0.7 that we calculated, we see that our value is smaller than the value in the table. This means that the differences between observed and expected were relatively small -- so small, that it could have happened by chance (due to sampling variation) more than 5% of the time.
When something could occur by sampling variation more than 5% of the time, we call that "non-significant" and don't trust it, since there is a significant chance (5%) that it occurred solely because of the luck of the draw: a weird sample. Hence, it makes no sense to try to interpret it as a real difference in satisfaction between men and women. In fact, in this case, the 0.7 is smaller than the chi-square value in the "0.100" column as well, indicating that the difference we observed are likely to occur in more than 10% of samples drawn from a population in which there is actually no difference between males and females in satisfaction.
So we conclude that we cannot reject the null hypothesis of indepenence (no difference between sexes). In other words, the difference in percentages between males and females is so small that there might not be any difference in the population: it might easily have been a fluke of our sample.
Now suppose the chisquare value that we had computed from our data had been somewhat larger -- say, 7.9. Looking along the first row of the table, we see that it is just larger than the value under the "0.005" column. That means that such a large result would only occur by chance in one half of one percent of samples. In other words, it is really unlikely to have happened by chance, so in that case we would be willing to believe that there really is a difference between men and women in the population.