Introduction to Sampling

Introduction to Sampling

Why Sample?

Pool of possible cases is too large (e.g., 280 million Americans) -- would cost too much and take too long
Don't want to use up the cases: e.g., when testing light bulbs to see how long they last, you take a bulb and leave it on until it burns out. You can't test all the bulbs this way, because their whole objective is to sell the bulbs, not burn them out.
It's not necessary to survey all cases: for most purposes, taking a sample yields a estimates that are accurate enough.
The trade-off is that sampling does introduce some error. You didn't interview everybody, so certain opinions or combinations of opinions won't be represented in your data. When the population is very diverse, your sample can't include all the possible combinations of attributes that are found in the population, such blacks and whites, men and women, cardiac patients non-patients, black women, white men, white women with heart trouble who like Oprah and don't like Ally McBeal, etc.

Populations, Sampling Frames, and Elements

Population is the universe of cases. It is the group that you ultimately want to say something about. For example, if you want to report 'what Americans think about Clinton', then the population is all Americans.
Elements are the individual cases in the population (usually, persons)
Sampling ratio is size of sample divided by size of population. Contrary to popular belief, a large sampling ratio is not crucial.
Sampling frame is a specific list of names from which sample elements will be chosen. The Literary Digest poll in 1936 used a sample of 10 million, drawn from government lists of automobile and telephone owners. Predicted Alf Landon would beat Franklin Roosevelt by a wide margin. But instead Roosevelt won by a landslide. The reason was that the sampling frame did not match the population. Only the rich owned automobiles and telephones, and they were the ones who favored Landon.
Replacement. Sampling with replacement means that after you draw a name out of the hat and record it, you put the name back and it can be chosen again. Sampling without replacement means that once you draw the name out, it is not available to be chosen again.
Bias. Systematic errors produced by your sampling procedure. For example, if you sample people and ask them whether they watch Ally McBeal, but the percentage always comes out too high (maybe because you are interviewing your friends and your whole group really likes Ally McBeal)

Non-Probability Sampling

Haphazard/Convenience

Whoever happens to walk by your office; who's on the street when the camera crews come out
If you have a choice, don't use this method. Often produces really wrong answers, because certain attributes tend to cluster with certain geographic and temporal variables. For example, at 8am in NYC, most of the people on the street are workers heading for their jobs. At 10am, there are many more people who don't work, and the proportion of women is much higher. At midnight, there are young people and muggers.

Quota

Haphazard sampling within categories (e.g., first 5 males to come by)
Is an improvement, but still has problems. How do you know which categories are key? How many do you get of each category?

Purposive/Judgement

Expert judgement picks useful cases for study
Good for exploratory, qualitative work, and for pre-testing a questionnaire.

Snowball

Recruiting people based on recommendation of people you have just interviewed
Useful for studying invisible/illegal populations, such as drug addicts

Probability Sampling

Probability sampling is any sampling scheme in which the probability of choosing each individual is the same (or at least known, so it can be readjusted mathematically). These are also called random sampling. They require more work, but are much more accurate. They also allow the researcher to calculate the amount of error she can expect, and this is really important.

Simple Random

Develop a sampling frame, then randomly select elements (place all names on cards, then randomly draw cards from hat; in Excel, there is a function for attaching a random number to each cell, then sort and take N largest)
Typically use sampling without replacement, but with replacement can be done (and is easier mathematically)
Any one sample is likely to yield statistics (such as the average income or the percentage of respondents that watch Ally McBeal) that are different from the population parameters
The average statistic from many random samples should equal the population parameter. In other words, if you took 150 different samples of Americans, each of 300 people, and calculated the percentage that like Ally McBeal in each of the samples, then averaged all those percentages together, that should equal the "real" percentage of all Americans that like Ally McBeal
It is the Central Limit Theory that guarantees that as the number of random samples increases, the average of those samples converges on the population parameter
Because of these mathematical guarantees, we can estimate how far off a sample might be from the population, giving rise to confidence intervals
Random samples are unbiased and, on average, representative of the population.

Example. A company of 680 employees wants to know whether to bother with instituting a program to deal with employee drug-taking. To find out, they will test a sample of employees on an anonymous basis: if a person tests positive, the company will not know who it is and will not try to find out. The objective is solely to estimate what percentage of the company might be doing drugs. If the percentage is high enough, the company will consider instituting a mandatory drug testing program. Given this objective, a simple random sampling design is perfect: the results will generalize to the whole company.

Stratified Sampling

Better than random sampling in terms of efficiency, but sometimes not possible
Procedure is this: Divide the population into strata (mutually exclusive classes), such as men and women. Then randomly sample within strata.
Suppose a population is 51% male and 49% female. To get a sample of 100 people, we randomly choose 51 males (from the population of all males) and, separately, choose 49 females. Our sample is then guaranteed to have exactly the correct proportion of sexes.
This avoids problem of random sampling that the proportions could be 50-50, 48-52, etc.
Especially important when one group is so small (say, 3% of the population) that a random sample might miss them entirely.

Example. The VP for Human Resources of a large manufacturing is considering creating a stress-management program for employees. To get an idea of what kinds of needs the program would have to fill, she will interview a sample of 50 employees first. If she does a simple random sample, it's possible that her sample will not include any representatives of some of the smaller departments, just by chance. Since she knows that different kinds of jobs within the company produce different kinds of stress, she wants to get separate samples from the workmen (who handle dangerous chemicals), the foremen (who balance the interests of the workmen with management), and the managers (who are responsible to shareholders). So she uses a stratified random sample.

Cluster Sampling

Used when (a) sampling frame not available or too expensive, and (b) cost of reaching an individual element is too high
E.g., there is no list of automobile mechanics in the US. Even if I could construct it, it would cost too much money to reach randomly selected mechanics across the entire US: would have to have unbelievable travel budget
In cluster sampling, first define large clusters of people. These clusters should have a lot heterogeneity within, but be fairly similar to other clusters. For example, cities make good clusters.
Then sample among the clusters. Then once you have chosen the clusters, randomly sample within the clusters.
Clusters might be cities. Once you've chosen the cities, might be able to get a reasonably accurate list of all the mechanics in each of those cities. Is also much less expensive to fly to just 10 cities instead of 2000 cities.
Cluster sampling is less expensive than other methods, but less accurate.
- each stage introduces its own sampling error.
Suppose you want to sample college students. You start by sampling 300 colleges. Then choose 10 students from each college. Problem is, if the colleges are of different size, the probability of a person being chosen if they are from a big college is smaller than for a small college. So need to choose a proportion of students, not a fixed number. Or don't choose colleges with equal probability (let the big schools be more likely to be in the sample). This is called PSS, Proportionate to Size Sampling

Example. Once a quarter, a large retail chain sends auditors to randomly chosen stores to check that proper procedures are being carried out. They look at the physical layout, the interactions between staff and customers, backroom procedures, and so on. A simple random sample could have an auditor visiting a California store one day, a New York the next, then another California store, and so on. Using cluster sampling, the auditor might first select a random sample of states, then visit a random sampling of stores with each state, thus reducing travel time.

Sample Size

The bigger the better, up to 2500. Beyond 2500, it doesn't really matter (accuracy increases very slowly after this point)
The smaller the population, the bigger the sampling ratio that is needed.
For populations under 1000, you need sampling ratio of 30% (300 elements) to be really accurate.
For populations of about 10,000 need sampling ratio of about 10%

Visits: