Syllabus
Module1️⃣:
Survey in Society:
The need for statistical information seems endless in moderns society. In particular , data are regularly collected to satisfy the need for information about specified sets of elements, called finite populations.
For example, our objective might be to obtain information about the households in a city and their spending patterns, the business enterprises in an industry and their profits, the individuals in a country and their participation in the work force or the farms in a region and their production cereal.
One of the most important modes of data collection for satisfying such needs is a sample survey, that is, a partial investigation of the finite population. A sample survey costs less than a complete enumeration, is usually less time consuming, and may even be more accurate than the complete enumeration.
Over the last few decades, survey sampling has evolved into an extensive body of theory, methods, and operations used daily all over the world. As Rossi, Wright and Anderson(1983) point out, it is appropriate today to speak of a worldwide survey industry with different sectors:
A Government sector
An Academic sector
A Private sector
Mass media sector
A Residual sector
Consisting of an hoc and in-house surveys.
In many countries, a central statistical office is mandated by law to provide statistical information about the state of the nation, and surveys are an important part of this activity.
For example, in Canada, the 1971 Statistics Act mandates Statistics Canada to "collect, compile, analyze, abstract, and publish statistical information relating to the commercial, industrial, financial, social, economic, and general activities and condition of the people."
Thus, national statistical offices regularly produce statistics on important national characteristics and activities, including
Demography(age and sex distribution, fertility and mortality)
Agriculture(crop distribution)
Labor force(employment)
Health and living conditions
Industry and trade
In the academic sector, survey sampling is extensively used, especially in
Sociology and public opinion research
Economics
Political Science
Psychology and Social Psychology
Many academically affiliated survey institutes are heavily engaged in survey sampling activity. In the private and mass media sectors, we find
Television audience surveys
Readership surveys
Polls and marketing surveys
The content of ad hoc and in-house surveys vary greatly. Examples include
Payroll surveys
Surveys for auditing purposes
News media provide the public with the results of new or recurring surveys. It is widely accepted that a sample of fairly modest size is sufficient to give an accurate picture of a much larger universe; for example, a well-selected sample of a few thousand individuals can portray with great accuracy a total population of millions. However data gathering is costly. Therefore, it makes a great difference if a major national survey uses 2000 observations, when 15000 or even 10000 observations might suffice. For reasons of cost effectiveness, it is imperative to use the best methods available for sampling design and estimation, to profit from auxiliary information, and so on.
Here statistical knowledge and insight become highly important. The expert survey statistician must have a good grasp of statistical concepts in general, as well as the particular reasoning used in survey sampling. A good measure of practical experience is also necessary.
Skeleton Outline of a Survey:
To start, we need a skeleton outline of a survey and some basic terminology. The terms "Survey" and "Sampling Survey" are used to denote statistical investigations with the following methodologic features
A survey concerns a finite set of elements called a finite population. An enumeration rule exists that unequivocally defines the elements belonging to the population. the goal of a survey is to provide information about the finite population in question or about subpopulations of special interest, for example, "men" and "women" as two subpopulation of "all persons". Such subpopulations are called domains of study or just domains.
A value of one or more variables of study is associated with each population element. The goal of a survey is to get information about unknown population characteristics or parameters. parameters are functions of the study variable values. They are unknown, quantitative measures of interest to the investigator, for example, total revenue, mean revenue, total yield, number of unemployed, for the entire population or for specified domains.
In most surveys, access to and observation of individual population elements established through a sampling frame, a device that associates the elements of the population with the sampling units in the frame.
From the population, a sample (that is, a subset) of elements is selected. This can be done by selecting sampling units in the frame. A sample is probability sample if realized by a chance mechanism.
The sample elements are observed, that is, for each element in the sample, the variables of study are measured and the values recorded. The measurement conforms to a well-defined measurement plan, specified in terms of measurement instruments, one or more measurement operations, the order between these, and the conditions under which they are carried out.
The recorded variable values are used to calculate (point) estimates of the finite population parameters of interest ( totals, means, medians, ratios, regression coefficients, etc). Estimates of the precision of the estimates are also calculated. the estimates are finally published.
In a sample survey, observation is limited to a subset of the population. the special type of survey where the whole population is observed is called a "census or a complete enumeration".
Example: 1
Labor force surveys are conducted in many countries. Such a survey aims at answering questions of the following type:
How many persons are currently in the labor force in the country as a whole and in various regions of the country?
what proportion of these are unemployed?
In this case, some of the key concepts may be follows:
Population: All persons in the country with certain exceptions (such as infants, people in institutions).
Domain of interest: age/sex groups of the population, occupational groups in the population, and regions of the country.
Variables: Each person can be described at the time of the survey as
i. belonging to the labor force or not, and
ii. employed or not.
Correspondingly, there is a variable of interest that takes the value "one" for a person in the labor force, "zero" for a person not in the labor force. To measure unemployment, a second variable of interest is defined as taking the value "one" if a person is unemployed, "zero" otherwise.
If the purpose is to estimate unemployment in a given month, and if an interviewed person states that he worked one week during that month, but that he is unemployed the day of the interview, there must be a clear rule stating whether he is to be record as unemployed or not.
Population characteristics of interest:
i. Number of persons in the labor force.
ii. Number of unemployed persons in the labor force.
iii. Proportion of unemployed persons in the labor force.
Sample: A sample of person is selected from the population in an efficient manner given existing devices for observational access to the persons in the country.
Observations: Each person included in the sample is visited by a trained interviewer who asks questions following a standardized questionnaire and records the answers.
Data processing and estimation: The recorded data are edited, that is, prepared for the estimation phase; rules for handling of nonresponse are observed; estimates of the population characteristics are calculated. Indicators of the uncertainty of the estimates (variance estimates) are calculated and finally results are published.
Example:2
Consider a household survey whose aim is to obtain information about planned household expenditures in the coming year for specified durable goods. Here, some of the basic concepts may be as follows.
Population: All households in the country.
Variables: Planned expenditure amounts for specified goods, such as automobiles, refrigerators, etc.
Population characteristics of interest: Total of planned households is obtained by initially selecting a sample of geographic areas.
Observations: Each household in the sample receives a self-administered questionnaire. For a majority of households, the questions are answered and the questionnaire returned. Households not returning the questionnaire are followed up by telephone or visited by a trained interviewer to obtain the desired information.
Data processing and estimation: Data are edited. The calculation of point estimates and precision takes into account the two-stage design of the survey.
Comments on the methodologic features (1) to (6)
The complexity of a survey can vary greatly, depending on the size of the population and the means of accessing that population. To survey the members of a professional society, the hospitals in a region, or the residents in a small municipality may be a relatively simple matter. At the other extreme are complex nationwide surveys, with a population of many millions spread over a large territory; such surveys are typically carried ouut by government statistical agencies and extensive administrative and financial resources.
Although a survey involves observations on individual population elements, the purpose of a survey is not to use such data for decision-making about individual elements, but to obtain summary statistics for the population or for specific subgroups.
In the same survey, there are often many variables of study and many domain of interest. The number of characteristics to estimate may be large, in the hundreds or even in the thousands.
Finite population parameters are quantitative measures of various aspects of the population. Prior to a survey they are unknown. Here, we examine the estimation of different types of parameters; the total of a variable of study, the mean of the variable, the median of the variable, the correlation coefficient between two variables, and so on. The exact value of a finite population parameter can be obtained in a special case, namely, if we observe the complete population (i.e. the survey is a census), and there are no measurement errors and no nonresponse. A census does not automatically mean "estimation without error".
Most people are aware of the term "Census" in a particular sense, namely, as a fact finding operation about a country's population, in particular about such sociodemographic characteristics as the age distribution, education levels, special skills, mother tongue, housing conditions, household structure, migration patterns. In these situations there is often a "census proper", done through a "short form" (a questionnaire with few questions) going out to all individuals, while a "long form" may be administered to a 20% sample with a request for more extensive information.
A sample is any subset of the population. It may or may not be drawn by a probability mechanism. A simple example of a probability sampling scheme is one that gives every sample of a fixed size the same probability of selection. This is " Simple random Sampling Without Replacement". In practice, selection schemes are usually more complex. Probability sampling has over the years proved to be a highly accurate instrument.
To correctly measure and record the desired information for all sampled elements may be difficult or impossible. False response may be obtained. For some elements designated for the survey, measurements may be missing because of, for example, impossibility to contact or refusal to respond. these non-sampling errors may be considerable.
Advances in computer technology have made it possible to produce a great deal of official statistics (for example, in the economic sector) from administrative data files. Several files may be used. For example, elements are matched in two complete population registers, and the information combined. The matched files gives a more extensive base for the production of statistics. (For populations of individuals, matching may conflict with privacy considerations). Information from a sample survey may also be combined with the information from one or more complete administrative registers. The administrative data may then serve as "auxiliary information" to strengthen the survey estimates.
Probability Sampling
Probability sampling is an approach to sample selection that satisfies certain conditions, which, for the case of selecting elements directly from the population, are described as follows:
Example
The Central Frame Data Base is a sampling frame compiled by Statistics Canada for use in business survey. It is a fairly complex frame, consists of several parts, and is based on two types of Canadian tax returns: corporate and individual, the latter for the self-employed. The frame has two main components:
an " integrated" component (containing all of the large business establishments) and
a "non-integrated" component, which is further divided into separate parts using information from Revenue Canada Taxation.
Business firms reporting small total operating revenue are considered " out-of-scope" for business surveys. Continuous updating is required to register "births" (starting of new business activity), and changes in classification based on geography, industry, or size.
Direct Element Sampling:
We use the term "direct element sampling" to denote sample selection from a frame that directly identifies the individual elements of the population of interest. That is, the units in the frame are objects of the same kind as those that we wish to measure and observe. A selection of elements can take place directly from the frame. Ideally, the set of elements identified in the frame is equal to the set of elements in the population of interest.
"Frame " here refers to a device with four successive layers. School districts are the first stage sampling units, schools the second stage sampling units, classes the third stage units. The individual elements (the students) are the sampling units in the fourth and final stage of sampling . In a selection consisting of several stages, each stage has its own type of sampling unit.
A finite population is made up of elements. They are sometimes also called "units of analysis", which underscores that they are entities that are measured and for which values are recorded. For example: if one is interested in estimating the population total of the variable " household income", the element (or unit of analysis) is naturally the household. The frame is an instrument for ganing access (more or less directly) to these elements. One way is to first select city blocks and then observe households in selected blocks.
Target Population and Frame Population
It becomes necessary at this point to distinguish target population from frame population . The target population is the set of elements about which information is wanted and parameter estimates are required. The frame population is the set of all elements that are either listed directly as units in the frame or can be identified through a more complex frame concept, such as a frame for selecting in several stages.