Percentages,

Suppose you want to test the hypothesis (derived from some theory) that, in general, college men are more satisfied with their social life than college women. It would be a big pain to measure the satisfaction of every single college person, but you could do it with a sample of people. Basically, you get a list of all college students, you randomly select a certain number (say, 1000), you call them up and ask two questions: 1. 'What sex are you?'. 2. Are you satisfied with your social life? Yes or no.

Now, according to your hypothesis, the number of men saying 'yes' should be higher than the number of women. Right? Well, not quite. What if the sample has more men in it than women? Then, even if men are unhappier than women, there will still be more 'yeses' among the men simply because there are more of them. The solution, obviously, is to use percentages. The hypothesis is that the percent of men saying 'yes' should be higher than the percent of women who say 'yes'.

Important note: it doesn't matter whether, in absolute terms, it's a large or small percentage of men that say 'yes', as long as it's bigger than the women's. For example, suppose 35% of the men say they are satisfied, while only 15% of the women are. The hypothesis has been supported, even though most men actually said they were dissatisfied.

Now, suppose we did the above, interviewing 100 men and 100 women, and got this result:

Is the hypothesis supported? Well, maybe. It certainly was for the sample of 200. But it was just a sample. It wasn't everybody. What if we did the study all over again, choosing a different sample of 200 people. Would we get the same result? It's unlikely to come out exactly the same. Think about flipping a coin. The chance of it coming up heads is 50% right? So if you flip it ten times, it should come up heads 5 times, right? But it doesn't have to. It is possible to get 6 heads. It is possible to get 7 heads. It is even possible to get 10 heads in a row. It has happened. It may be unlikely, but it can happen.

So suppose that out there in America, there really is no difference in the satisfaction levels between male and female college students. Actually, 25% of all college students, male or female, are satisfied with their social lives. Now we come along and interview 100 students of each sex at random. "At random" means that we essentially flipped a many-sided coin to see which of all the college males in America we would interview. Each male student had an equal chance of being chosen. If there are 1,000,000 college males in the population, then each one had the same chance of being chosen (1 in 10,000). And since 25% of them are satisfied, we should have gotten 25 that were satisfied and 75 that were not. But if we can flip a coin 10 times and not get exactly 5 heads, what are the chances we can pick 100 males out of million and get exactly 25 that are satisfied? By chance alone, we could have gotten 30. Or 40. Or even 100. After all, there are 25,000 satisfied males out there. You could easily have picked 100 of them.

And the same is true of the women. Our sample of 100 could have easily included 30 satisfied women, or just 10. So when we get a result like 30% of the men sampled are satisfied while only 20% of the women were satisfied, we cannot be sure that the difference of 10% points is not due to chance. This is what is called sampling error, although it is not really an error. After all, when the coin gives you 6 heads instead of 5, do you consider it an error?

So what we would like to have is a sense for the probability of getting a difference like 10% points by chance alone. In other words, we would like to know what the chances are of getting a 10% difference when, in the population, there is actually no difference in the population (i.e., the 10% difference we observe is due to sampling variation rather than a difference in the population).

That's what significance tests are for. What a significance test does is give you that probability, which is often called a "p-value". A p-value is the probability of observing a difference as large as you actually observed, given that the null hypothesis of no difference is actually true in the population. You can think of it loosely as the probability that you are wrong when you claim that men and women college students have different rates of satisfaction.

So what's a decent p-value? Typically, people say that if the p-value is less than 0.05 (i.e., 5 percent), then that's good enough: it is significant. If the p-value is larger than 0.05, we say it is non-significant. If the p-value is lower than 0.001, we say it is highly significant. What it means is that if the p-value is less than 5%, then it is low enough that we are willing to take the chance that we might be wrong.

Calculating the p-value

Although there is a simpler way to calculate the p-value than the one I'm going to show you, the one we will use is better because we can generalize it to other situations. That way, you only need to learn one statistic instead of several different ones. The method I'm going to teach is called the chi-square test of independence.

The basic idea is to compare the frequencies you got with the frequencies that you would expect to get if there was no difference in satisfaction between men and women in the population. That means that the variable "sex" and the variable "satisfaction with social life" are unrelated or independent of each other.

So let's review basic probability theory for a moment. If two events are independent, then the probability of both occurring at the same time can be calculated by multiplying the probability of one by the probability of the other. So if we flip two coins at the same time, the probability that both will come out heads is 0.5 x 0.5 = 0.25. It should happen about a quarter of the time. Similarly, suppose we flipped a coin and rolled a die (the singular of "dice") at the same time. What's the probability of getting a head and a 4 at the same time? Well it's just 1/2 times 1/6 = 1/12.

Therefore, if a person's sex has nothing to do with whether they are satisfied with their social life, then the proportion of males who are satisfied should be equal to the probability of being male (i.e., what percentage of your sample was male) times the probability of being satisfied (i.e., what percentage of your sample was satisfied, regardless of sex).

So, just to make things interesting, let's assume that in our sample of 200, we obtained, by the luck of the draw, 120 females and 80 males, and 25% of the men said they were satisfied while 20% of the women said the same. We can arrange the information in a pair of tables:

* The last row are not column percentages,
but rather percentages of the whole sample

So if sex is independent of satisfaction, the probability of sampling a satisfied male is 0.4x0.22 = 0.088. So we would have expected that 8.8 percent of the total sample, or between 17 and 18 people, would be satisfied males (see table below). In other words I multiple 0.088 by the total sample size (200) to get the expected frequency of satisfied males. Similarly, the probability of getting a satisfied female is 0.6 x .22 = 0.132. So we would have expected that 13.2 percent of the sample, or about between 26 and 27 people, to be satisfied females. We carry out the same calculations for the unsatisfied males and the unsatisfied females.

Computationally, instead of bothering to compute the proportion of males (.4) and females (.6) and YESes (.22) and NOs (.78) and then multiplying the appropriate ones together and multiplying the by sample size (200), it is faster to work directly from the raw frequencies and multiply the row sums times the column sums divided by the sample size as follows:

Now we compare the expected frequencies with the observed frequencies. For each cell in the table, we take the observed frequency, subtract the expected value, square the difference, and divide it by the expected value. Then we sum all those quantities to get a single value called chi-square. In symbols, we calculate

For our data, the value of chi-square is about 0.7. This is an unusually small value (I'll explain what it means in a moment). Often, chi-square values are numbers like "32.87" or "116.2". The smallest chi-square value possible is 0, but there is no upper bound: it depends on the size of the numbers.

Notice that the less the difference between observed and expected, the smaller the value of chisquare will be. Chi-square is zero only when there is absolutely no difference between the observed and the expected. So when will chi-square be small? Whenever the sample data are consistent with the null hypothesis that there is no difference in satisfaction between males and females in the population. So if we were hoping that there really was a difference between males and females, we want chisquare to be large.

But how large is large? The maximum value of chi-square is ... big. What we want to know is what is the probability of getting a chi-square as large as actually observed, given that in the population the variables are independent of each other. The probabilities are given in a chi-square table such as this one:

df \ P	0.1	0.050	0.025	0.010	0.005	0.001
1	2.7055	3.8414	5.0238	6.6349	7.8794	10.828
2	4.6051	5.9914	7.3777	9.2103	10.5966	13.816
3	6.2513	7.8147	9.3484	11.3449	12.8381	16.266
4	7.7794	9.4877	11.1433	13.2767	14.8602	18.467
5	9.2363	11.0705	12.8325	15.0863	16.7496	20.515
6	10.6446	12.5916	14.4494	16.8119	18.5476	22.458
7	12.0170	14.0671	16.0128	18.4753	20.2777	24.322
8	13.3616	15.5073	17.5346	20.0902	21.9550	26.125
9	14.6837	16.9190	19.0228	21.6660	23.5893	27.877
10	15.9871	18.3070	20.4831	23.2093	25.1882	29.588
11	17.2750	19.6751	21.9200	24.7250	26.7569	31.264
12	18.5494	21.0261	23.3367	26.2170	28.2995	32.909
13	19.8119	22.3621	24.7356	27.6883	29.8194	34.528
14	21.0642	23.6848	26.1190	29.1413	31.3193	36.123
15	22.3072	24.9958	27.4884	30.5779	32.8013	37.697
16	23.5418	26.2962	28.8454	31.9999	34.2672	39.252
17	24.7690	27.5871	30.1910	33.4087	35.7185	40.790
18	25.9894	28.8693	31.5264	34.8058	37.1564	42.312
19	27.2036	30.1435	32.8523	36.1908	38.5822	43.820
20	28.4120	31.4104	34.1696	37.5662	39.9968	45.315
21	29.6151	32.6705	35.4789	38.9321	41.4010	46.797
22	30.8133	33.9244	36.7807	40.2894	42.7956	48.268
23	32.0069	35.1725	38.0757	41.6384	44.1813	49.726
24	33.1963	36.4151	39.3641	42.9798	45.5585	51.179
25	34.3816	37.6525	40.6465	44.3141	46.9278	52.620
26	35.5631	38.8852	41.9232	45.6417	48.2899	54.052
27	36.7412	40.1133	43.1944	46.9680	49.6449	55.476
28	37.9159	41.3372	44.4607	48.2782	50.9933	56.892
29	39.0875	42.5569	45.7222	49.5879	52.3356	58.302
30	40.2560	43.7729	46.9792	50.8922	53.6720	59.703
40	51.8050	55.7585	59.3417	63.6907	66.7659	73.402
50	63.1671	67.5048	71.4202	76.1539	79.4900	86.661
60	74.3970	79.0819	83.2976	88.3794	91.9517	99.607
70	85.5271	90.5312	95.0231	100.425	104.215	112.317
80	96.5782	101.879	106.629	112.329	116.321	124.839
90	107.565	113.145	118.136	124.116	128.299	137.208
100	118.498	124.342	129.561	135.807	140.169	149.449

The columns of the table correspond to p-values. In general, we look down the column marked "0.05", because we use 0.05 as the conventional cut-off level of statistical significance. In choosing the .05 level, we are saying that if the probability of a certain result occurring just because of sampling variation is greater than 5%, then we are not willing to assume that the results are real (i.e., that sex and satisfaction are associated in the population from which the sample was drawn).

The cells of the table correspond to chi-square values (such as the 0.7 we computed above). The rows correspond to degrees of freedom. To calculate degrees of freedom for a simple table such as we have, we use the following formula:

where R is the number of categories in the Satisfaction variable, and C is the number of categories in the Sex variable. In our case, both variables have 2 categories, so the degrees of freedom is 1 x 1 = 1.

Now we look at the first row of the table (corresponding to 1 degree of freedom), and look down the 0.05 column. The value in the table is 3.8414. Comparing that to the 0.7 that we calculated, we see that our value is smaller than the value in the table. This means that the differences between observed and expected were relatively small -- so small, that it could have happened by chance (due to sampling variation) more than 5% of the time.

When something could occur by sampling variation more than 5% of the time, we call that "non-significant" and don't trust it, since there is a significant chance (5%) that it occurred solely because of the luck of the draw: a weird sample. Hence, it makes no sense to try to interpret it as a real difference in satisfaction between men and women. In fact, in this case, the 0.7 is smaller than the chi-square value in the "0.100" column as well, indicating that the difference we observed are likely to occur in more than 10% of samples drawn from a population in which there is actually no difference between males and females in satisfaction.

So we conclude that we cannot reject the null hypothesis of indepenence (no difference between sexes). In other words, the difference in percentages between males and females is so small that there might not be any difference in the population: it might easily have been a fluke of our sample.

Now suppose the chisquare value that we had computed from our data had been somewhat larger -- say, 7.9. Looking along the first row of the table, we see that it is just larger than the value under the "0.005" column. That means that such a large result would only occur by chance in one half of one percent of samples. In other words, it is really unlikely to have happened by chance, so in that case we would be willing to believe that there really is a difference between men and women in the population.

	Men	Women
Satisfied:	30	20

	Males	Females	Sampled
Yes	25%	20%	22%
No	75%	80%	78%
Sampled*	40%	60%	100%

	Males	Females	Sampled
Yes	.4.22200 = 17.6	.6.22200 = 26.4	44
No	.4.78200 = 62.4	.6.78200 = 93.6	156
Sampled	80	120	200

	Males	Females	Sampled
Yes	17.6	26.4	44
No	62.4	93.6	156
Sampled	80	120	200

	Males	Females	Sampled
Yes	44*80/200 = 17.6	44*120/200 = 26.4	44
No	156*80/200 = 62.4	156*120/200 = 93.6	156
Sampled	80	120	200

	Males	Females	Sampled
Yes	5.76	5.76	11.52
No	5.76	5.76	11.52
Sampled	11.52	11.52	23.04

	Males	Females	Sampled
Yes	0.327	0.218	0.545
No	0.092	0.062	0.154
Sampled	0.42	0.28	0.7

The Chi-Square Test

Calculating the p-value