Introduction to Sampling
Why Sample?
- Pool of possible cases is too large (e.g., 280 million Americans) -- would cost too much
and take too long
- Don't want to use up the cases: e.g., when testing light bulbs to see how long they
last, you take a bulb and leave it on until it burns out. You can't test all the bulbs
this way, because their whole objective is to sell the bulbs, not burn them out.
- It's not necessary to survey all cases: for most purposes, taking a sample yields a
estimates that are accurate enough.
- The trade-off is that sampling does introduce some error. You didn't interview
everybody, so certain opinions or combinations of opinions won't be represented in your
data. When the population is very diverse, your sample can't include all the possible
combinations of attributes that are found in the population, such blacks and whites, men
and women, cardiac patients non-patients, black women, white men, white women with heart
trouble who like Oprah and don't like Ally McBeal, etc.
Populations, Sampling Frames, and Elements
- Population is the universe of cases. It is the group that you ultimately want
to say something about. For example, if you want to report 'what Americans think about
Clinton', then the population is all Americans.
- Elements are the individual cases in the population (usually, persons)
- Sampling ratio is size of sample divided by size of population. Contrary to
popular belief, a large sampling ratio is not crucial.
- Sampling frame is a specific list of names from which sample elements will be
chosen. The Literary Digest poll in 1936 used a sample of 10 million, drawn from
government lists of automobile and telephone owners. Predicted Alf Landon would beat
Franklin Roosevelt by a wide margin. But instead Roosevelt won by a landslide. The reason
was that the sampling frame did not match the population. Only the rich owned automobiles
and telephones, and they were the ones who favored Landon.
- Replacement. Sampling with replacement means that after you draw a name out of
the hat and record it, you put the name back and it can be chosen again. Sampling without
replacement means that once you draw the name out, it is not available to be chosen again.
- Bias. Systematic errors produced by your sampling procedure. For example, if
you sample people and ask them whether they watch Ally McBeal, but the percentage always
comes out too high (maybe because you are interviewing your friends and your whole group
really likes Ally McBeal)
Non-Probability Sampling
Haphazard/Convenience
- Whoever happens to walk by your office; who's on the street when the camera crews come
out
- If you have a choice, don't use this method. Often produces really wrong answers,
because certain attributes tend to cluster with certain geographic and temporal variables.
For example, at 8am in NYC, most of the people on the street are workers heading for their
jobs. At 10am, there are many more people who don't work, and the proportion of women is
much higher. At midnight, there are young people and muggers.
Quota
- Haphazard sampling within categories (e.g., first 5 males to come by)
- Is an improvement, but still has problems. How do you know which categories are key? How
many do you get of each category?
Purposive/Judgement
- Expert judgement picks useful cases for study
- Good for exploratory, qualitative work, and for pre-testing a questionnaire.
Snowball
- Recruiting people based on recommendation of people you have just interviewed
- Useful for studying invisible/illegal populations, such as drug addicts
Probability Sampling
Probability sampling is any sampling scheme in which the probability of choosing each
individual is the same (or at least known, so it can be readjusted mathematically). These
are also called random sampling. They require more work, but are much more accurate. They
also allow the researcher to calculate the amount of error she can expect, and this is
really important.
Simple Random
- Develop a sampling frame, then randomly select elements (place all names on cards, then
randomly draw cards from hat; in Excel, there is a function for attaching a random number
to each cell, then sort and take N largest)
- Typically use sampling without replacement, but with replacement can be done (and is
easier mathematically)
- Any one sample is likely to yield statistics (such as the average income or the
percentage of respondents that watch Ally McBeal) that are different from the population
parameters
- The average statistic from many random samples should equal the population parameter. In
other words, if you took 150 different samples of Americans, each of 300 people, and
calculated the percentage that like Ally McBeal in each of the samples, then averaged all
those percentages together, that should equal the "real" percentage of all
Americans that like Ally McBeal
- It is the Central Limit Theory that guarantees that as the number of random samples
increases, the average of those samples converges on the population parameter
- Because of these mathematical guarantees, we can estimate how far off a sample might be
from the population, giving rise to confidence intervals
- Random samples are unbiased and, on average, representative of the population.
Example. A company of 680 employees wants to know whether to bother
with instituting a program to deal with employee drug-taking. To find out, they will test
a sample of employees on an anonymous basis: if a person tests positive, the company will
not know who it is and will not try to find out. The objective is solely to estimate what
percentage of the company might be doing drugs. If the percentage is high enough, the
company will consider instituting a mandatory drug testing program. Given this objective,
a simple random sampling design is perfect: the results will generalize to the whole
company.
Stratified Sampling
- Better than random sampling in terms of efficiency, but sometimes not possible
- Procedure is this: Divide the population into strata (mutually exclusive classes), such
as men and women. Then randomly sample within strata.
- Suppose a population is 51% male and 49% female. To get a sample of 100 people, we
randomly choose 51 males (from the population of all males) and, separately, choose 49
females. Our sample is then guaranteed to have exactly the correct proportion of sexes.
- This avoids problem of random sampling that the proportions could be 50-50, 48-52, etc.
- Especially important when one group is so small (say, 3% of the population) that a
random sample might miss them entirely.
Example. The VP for Human Resources of a large manufacturing is
considering creating a stress-management program for employees. To get an idea of what
kinds of needs the program would have to fill, she will interview a sample of 50 employees
first. If she does a simple random sample, it's possible that her sample will not include any
representatives of some of the smaller departments, just by chance. Since she knows that
different kinds of jobs within the company produce different kinds of stress, she wants to
get separate samples from the workmen (who handle dangerous chemicals), the foremen (who
balance the interests of the workmen with management), and the managers (who are
responsible to shareholders). So she uses a stratified random sample.
See also the
wikipedia
entry.
Cluster Sampling
- Used when (a) sampling frame not available or too expensive, and (b) cost of reaching an
individual element is too high
- E.g., there is no list of automobile mechanics in the US. Even if I could construct it,
it would cost too much money to reach randomly selected mechanics across the entire US:
would have to have unbelievable travel budget
- In cluster sampling, first define large clusters of people. These clusters should have a
lot heterogeneity within, but be fairly similar to other clusters. For example, cities
make good clusters.
- Then sample among the clusters. Then once you have chosen the clusters, randomly sample
within the clusters.
- Clusters might be cities. Once you've chosen the cities, might be able to get a
reasonably accurate list of all the mechanics in each of those cities. Is also much less
expensive to fly to just 10 cities instead of 2000 cities.
- Cluster sampling is less expensive than other methods, but less accurate.
- each stage introduces its own sampling error.
- Suppose you want to sample college students. You start by sampling 300 colleges. Then
choose 10 students from each college. Problem is, if the colleges are of different size,
the probability of a person being chosen if they are from a big college is smaller than
for a small college. So need to choose a proportion of students, not a fixed number. Or
don't choose colleges with equal probability (let the big schools be more likely to be in
the sample). This is called PSS, Proportionate to Size Sampling
Example. Once a quarter, a large retail chain sends auditors to
randomly chosen stores to check that proper procedures are being carried out. They look at
the physical layout, the interactions between staff and customers, backroom procedures,
and so on. A simple random sample could have an auditor visiting a California store one
day, a New York the next, then another California store, and so on. Using cluster
sampling, the auditor might first select a random sample of states, then visit a random
sampling of stores with each state, thus reducing travel time.
Sample Size
- The bigger the better, up to 2500. Beyond 2500, it doesn't really matter (accuracy
increases very slowly after this point)
- The smaller the population, the bigger the sampling ratio that is needed.
- For populations under 1000, you need sampling ratio of 30% (300 elements) to be really
accurate.
- For populations of about 10,000 need sampling ratio of about 10%
|