The idea of sampling
I want to measure what is happening at my library. I want to know, say, about:
- the circulation of stock: which items are in heavy and which are in low demand
- the number of people who visit the library
- the ways they actually use it: study, recreational, IT, meeting friends, etc,
For the sake of the argument, I assume that my library works on a manual basis. It lacks an automated catalog and has no electronic counter at the entrance. But the idea of sampling is the same in the manual and the automated case.
It is clear that I cannot document whatever is going on a continuous basis. Monitoring and writing down every single event could easily double the total work load at the library.
Data collection is work - hard, disciplined work. It should be kept to a minimum. Statistical sampling is a technique for collecting data that minimizes the work, while providing the answers we need. When we sample, we take a selection from a larger total. Statistical samples use very special routines. If the procedures are followed we can treat the sample as if it were the total.
The total is is often called the universe or the population.
Perfect accuracy is seldom needed. In a library it hardly matters whether we had, say, 13.380 or 13.410 visitors last year. But if visitor numbers increase substantially we want to register this fact. As a rule of thumb, I would say:
- do not bother about a difference of 1-2 percent
- a difference of 3-4 percent is small, but may be meaningful
- a difference of 5-15 percent is real and interesting
- anything more is very interesting
We can usually get information that is good enough for practical decision-making, from a sample of a few hundred items. Thge greater the sample, the greater the accuracy.
A very basic, and also very surprising, statistical rule is: The size of the original population does not matter. Accuracy only depends of the size of the sample.
Let me apply the idea of sampling to the book collection. My library has, say, ten thousand books. I want to know how up-to-date my collection is, by looking at the year of publication.
Checking ten thousand cards and writing down ten thousand numbers is heavy work. I turn to statistics and take a sample of two hundred cards instead. This could, for obvious reasons, be called a two percent sample.
The big idea in statistical sampling lies in the way you go about selecting the sample from the total. You should not pull 200 consecutive cards from the nearest drawer. Nor should you rummage around, taking one here and one there as the mood takes you. The sample should
- come from the whole population
- and not depend on a series of deliberate choices
When you sample, you should act like a robot with no personal interest in the outcome.
There are many ways of achieving this. The simplest is probably to take (look at) every fiftieth card in the catalogue and write down the year of publication. The distribution of these two hundred numbers will provide a good approximation to the true distribution (based on all ten thousand publication dates).
My library is open - say - six days a week. We open at 9 am, take a break from noon til 2 pm, and open again from 2 till 6 pm. On Saturday, there is no afternoon session. The library is also closed for a total of four weeks during holidays.
This means that the library is open 6 * (52 - 4) = 288 days a year.
- It is open for three hours on 48 Saturdays - which gives a total of 144 "Saturday hours".
- It is open for seven hours on 240 weekdays - giving a total of 1680 "weekday hours"
- The total number of hours is 1680 + 144 = 1724 hours per year.
I want to know the number of visitors we have in a year. I know, from experience, that library use tends to vary systematically during the day, during the week and during the year.
If I want to know the true number of visitors, I should no
- take my "best hour" - and multiply by 1724.
- take my best day - and multiply by 288
- take my best week- and multiply by 48
These estimates will be far too high. I have to choose my sample from the "whole population" - and to do it in a proper "robotic" way.
There are, as before, many ways of achieving this. The easiest is probably to select a small number of "counting days" throughout the year. On these days all visitors are counted.
You may, for instance start with the first Monday in January - and continue with the first Tuesday in February, the first Wednesday in March and so on.
This approach will give you the visitor numbers for twelve separate days, or two full weeks, covering the whole year. Since the library keeps open 48 weeks a year, you find the total number of visitors by multiplying the observed number with 24.
Concepts are important. If we want to study users, we must first decide the limits of the population. For instance, does user mean:
- The people that visit the library on a regular basis?
- The people that have visited the library at least once during the last year?
- The people that have visited the library at least once during the last five years?
- The people that are registered as users?
- The people that are registered as users and have borrowed materials during the last year?
- The people that are registered as users and have borrowed materials during the last five years?
And so on.
In this example I define user as a person that is registered as a user. My population consists of a set of registration cards.
I want to check the social impact of the library by looking at the geographical distribution of the users. Where do these people live? Are there places we have "missed" - and where we might do some extra marketing? Let us say we have 6.000 registered users. I choose 200 at random. The selection procedure follows the first example. Since 6.000/200 = 30, I may simply take every thirtieth card, write down the addresses and plot them on a local map.