Stats text one Introduction

Introduction: Samples and Levels of Measurement

1.1 Populations and Samples

Statistics studies groups of people, objects, or data measurements and produces summarizing mathematical information on the groups. The groups are usually not all of the possible people, objects, or data measurements. The groups are called samples. The larger collection of people, objects or data measurements is called the population.

Statistics attempts to predict measurements for a population from measurements made on the smaller sample. For example, to determine the average weight of a student at the college, a study might select a random sample of fifty students to weigh. Then the measured average weight could be used to estimate the average weight for all student at the college. The fifty students would be the sample, all students at the college would be the population.

Population: The complete group of elements, objects, observations, or people.
Parameters: Measurements of the population: population size N, population median, population mean μ...

Sample: A part of the population. A sample is usually more than five measurements, observations, objects, or people, and smaller than the complete population.
Statistics: Measurements of a sample: sample size n, sample median, sample mean x.

Examples

We could use the ratio of females to males in a class to estimate the ratio of females to males on campus. The sample is the class. The intended population is all students on campus. Whether the statistics class is a "good" sample - representative, unbiased, randomly selected, would be a concern.

We could use the average body fat index for a randomly selected group of females between the ages of 18 and 22 on campus to determine the average body fat index for females in the FSM between the ages of 18 and 22. The sample is those females on campus that we've measured. The intended population is all females between the ages of 18 and 22 in the FSM. Again, there would be concerns about how the sample was selected.

Measurements are made of individual elements in a sample or population. The elements could be objects, animals, or people.

Sample size n

The sample size is the number of elements or measurements in a sample. The lower case letter n is used for sample size. If the population size is being reported, then an upper case N is used. The spreadsheet function for calculating the sample size is the COUNT function.

=COUNT(data)

If one wants to count the sample size for a nominal level list of words, the COUNTA function is used.

=COUNTA(data)

1.2 Types and Levels of measurement

Types of measurement

Data can be put into categories such as words or numbers, countable and uncountable, and into levels of measurement.

Words or numbers

Countable or uncountable

Levels of measurement

There are four levels of measurement. In this text most of the data and examples are at the ratio level of measurement.

Nominal: Qualitative, discrete data values: Data that is words only. Baby names, favorite colors, sports you play

Ordinal: Qualitative/quantitative borderline, discrete data values: Data that can be put in a rank order. Letter grades A, B, C, D, F. Sakau market rating system where the number of cups until one is "pwopihda"...

Interval: Quantitative discrete or continuous data values: Data where differences in numeric values have meaning but ratios do not have meaning. Some measurement scales in fields such as psychology, temperature in Celsius. There is either a lack of a zero or the zero is not a true zero. The number of occupants of a car on Pohnpei: neither zero nor fractional values occur.

Ratio: Quantitative continuous data values: Data where differences, ratios, and fractions have meaning. Zero exists and has meaning. Distance, height, speed, velocity, time in seconds, altitude, acceleration, mass.

Nesting of the levels of measurement

The levels of measurement can also be thought of as being nested. For example, ratio level data consists of numbers. Numbers can be put in order, hence ratio level data is also orderable data and is thus also ordinal level data. To some extent, each level includes the ones below that level. The highest level of measurement that a data could be considered to be is said to be the level of measurement. There are instances where qualitative data might be placed in an order and thus be considered ordinal data, thus ordinal level data may be either qualitative or quantitative. When a survey says, "Strongly agree, agree, disagree, strongly disagree" the data technically consists of answers which are words. Yet these words have an order, in some instances the answers are mapped to numbers and a median value is then calculated. Above the ordinal level the data is quantitative, numeric data.

Note that at higher levels, such as at the ratio level, the mean is usually chosen to represent the middle, but the median and mode can also be calculated. Statistics that can be calculated at lower levels of measurement can be used in higher levels of measurement.

Descriptive statistics: Numerical or graphical representations of samples or populations. Can include numerical measures such as mode, median, mean, standard deviation. Also includes images such as graphs, charts, visual linear regressions.

Inferential statistics: Using descriptive statistics of a sample to predict the parameters or distribution of values for a population.

1.3 Simple random samples

The number of measurements, elements, objects, or people in a sample is the sample size n. A simple random sample of n measurements from a population is one selected in a way that:

Ensuring that a sample is random is difficult. Suppose I want to study how many Pohnpeians own cars. Would people I meet/poll on main street Kolonia be a random sample? Why? Why not?

Studies often use random numbers to help randomly selects objects or subjects for a statistical study. Obtaining random numbers can be more difficult than one might at first presume.

Computers can generate pseudo-random numbers. "Pseudo" means seemingly random but not truly random. Computer generated random numbers are very close to random but are actually not necessarily random. Next we will learn to generate pseudo-random numbers using a computer. This section will also serve as an introduction to functions in spreadsheets.

Coins and dice can be used to generate random numbers.

Using a spreadsheet to generate random numbers

The random function RAND generates numbers between 0 and 0.9999...

=rand()

The random number function consists of a function name, RAND, followed by parentheses. For the random function nothing goes between the parentheses, not even a space.

To get other numbers the random function can be multiplied by coefficient. To get whole numbers the integer function INT can be used to discard the decimal portion.

=INT(argument)

The integer function takes an "argument." The argument is a computer term for an input to the function. Inputs could include a number, a function, a cell address or a range of cell addresses. The following function when typed into a spreadsheet that mimic the flipping of a coin. A 1 will be a head, a 0 will be a tail.

=INT(RAND()*2)

The spreadsheet can be made to display the word "head" or "tail" using the following code:

=CHOOSE(INT(RAND()*2),"head","tail")

A single die can also be simulated using the following function

=INT(6*RAND()+1)

To randomly select among a set of student names, the following model can be built upon.

=CHOOSE(INT(RAND()*5+1),"Jan","Jen","Jin","Jon","Jun")