Chapter 1/2

The video above is from a past semester.  The material is the same but any dates mentioned will be incorrect.   For the current term, please refer to the due dates in syllabus.

Chapter 1

Section 1.2 Types of Data

This section is mostly definitions.  Below you will find some definitions along with some examples and videos of them.

Statistic- a numerical measurement describing some characteristic of the SAMPLE.

Parameter-   a numerical measurement describing some characteristic of the POPULATION.

Example: Suppose we want to find out the percentage of people in the United States that use marijuana.  

-To find the exact value, we would have to survey EVERY member of the country. This result would be the parameter since we selected every member in the population.

-If we wanted to estimate the percentage that use marijuana we could take a sample from the people in the U.S. (say we randomly select 1000 from the country).  This would be a statistic since we only selected a sample and not the entire population.

- Statistic is to Sample as Parameter is to Population



Quantitative (or Numerical) Data  - consist of numbers representing counts or measurements.

Categorical (or Qualitative) Data- consist of non numerical measurements.

Example: 

Quantitative examples: Number of students in the class, wait times at a coffee shop, length of a boat.

Categorical examples: political affiliation, race, gender



Discrete data: result when the data values come from a finite or countable set.

Continuous data: results when the data values come from an infinite set (uncountable).

Example: 

Discrete examples: amount of money earned at a job, number of people on a bus

Note: You can list all the possible outcomes. Amount of money earned on a roofing job can be $10,000 or $10,000.01 or $10,000.02, etc (there are a lot of possibilities but we can list them all...discrete!)

Continuous examples: heights, weights

Note: We can NOT list all the outcomes.  Weight, if we have a weight of 175lbs and 176lbs are there any possible weights between the two? YES! 175.5 lbs.  What about 175 and 175.5, any other possibilities between those two? YES! 175.2 lbs.  No matter what two weights are given we can always find another value between them.  In fact, we can find an infinite number between any two given weights....Continuous data!



Levels of measurement (4 levels below)

Nominal-data that consists of names or categories.  No ordering possible. 

 Examples: political affiliation, race, gender (grouping with no ordering possible)

Ordinal- Ordering is possible but mathematical differences are meaningless.

Examples: Course grades (A,B,C,D,F) we know that an “A” is better than a “B” but we can’t measure the difference mathematically.  Places (1st, 2nd, 3rd)

Interval: Ordering is possible and mathematical differences are meaningful. No absolute zero.

Example: temperature in degrees Celsius.  If we have a 10 degree day and a 20 degree day we can say that the 20 degree day was 10 degrees warmer...ie. mathematical differences are meaningful.  Note: that 0 degrees Celsius does not mean absence of heat...just the point where water freezes.  Therefore, there is no absolute zero. When no absolute zero exists, ratios can not be used...we can not say the 20 degree day was twice as warm as the 10 degree day.

Ratio: Ordering is possible, mathematical differences and ratios are meaningful.  Absolute zero present

Example: weight.  If we have a 100lb object and a 200lb object we can say that the 200lb object weighs 100 lbs more (differences are meaningful). We can also say that the 200lb object weighs twice as much (ratios are meaningful)!  We’re able to use ratios here since an absolute zero is present. What does 0lbs mean?  Means it’s without weight or weightless! 

Section 1.3 Collecting Sample Data

Simple Random Sample (SRS)-a type of sampling where a subset of participants is selected from a population at random.

Systematic Sampling- start with a randomly selected starting point, then select additional participants at every kth interval.

Ex: Sampling from a grocery store line.  Randomly select the first participant, then select every 5th person in line after that.

Convenience Sampling- just use data that is simple to obtain (quite often biased in some manner)

Stratified Sampling- Data is grouped into subcategories (Strata) based on a characteristic (gender, race, political affiliation) and then randomly sample within each strata.

Ex: Stratify by political affiliation (Rep, Dem, Ind) then sample within each strata.  Advantage: You learn about the population as a whole and each individual group.

Cluster Sampling- a method of sampling in which you divide a population into groups (clusters), such as city blocks or districts, and then randomly select some of these clusters for your sample. All members of a selected cluster are sampled.

Ex: Cluster by city block, randomly select several clusters (blocks) and sample each member in that cluster (block).  Saves time verse randomly selecting households all over the city.

Chapter 2: 

Section 2.1 Frequency distributions

Frequency distribution- Displays the data in several different categories (or classes) and the number that fall in each class (frequency).

Below is an example of a frequency distribution and the terminology that goes along with it

Age        Frequency

20-29        20

30-39        18

40-49        10

50-59          8

60-69          0

70-79          2

lower class limits- 20, 30, 40, 50, 60, and 70

upper class limits- 29, 39, 49, 59, 69, and 79

class midpoints- 24.5, 34.5, 44.5, 54.5, 64.5, 74.5 (halfway between the lower and upper class limits)

class boundaries- 19.5, 29.5, 39.5, 49.5, 59.5, 69.5, and 79.5 (halfway between the upper class limit and the next lower class limit, don't forget the first and last one)

class width (careful, this is tricky)- 10 This can be found by finding the difference in two consecutive lower class limits OR two consecutive upper class limits OR two consecutive class boundaries.

The purpose of a frequency distribution is to display where the data falls. For example, the above frequency distribution shows that there are more younger people than older in the sample.



Building a Frequency Distribution:

Fish Length (in) (Rockfish, California Halibut, and Lingcod lengths from a mercury study that a few buddies and I are conducting in Northern California)

26.50, 29.50, 24.70, 29, 29.25, 22, 14, 14, 16, 16.25, 16.25, 17, 19, 20, 13.25*, 15.50, 16, 16, 16.75, 17.75, 18, 28.75, 29, 33.50, 33.50, 33.50, 34, 35, 38.50, 40*

1) Select the number of classes. Generally, between 5 and 20 depending on the size of the data set. (this will be determined for you in the homework and tests). Since we have a small data set, let's go with a small number of classes...say 7.

2) Find the class width.

class width=(max value -min value)/number of classes= (40 -13.25)/7= 3.82 since this is not a convenient number to work with, we can round UP to 4 for simplicity. Note: we can not round down since that will then mean that we'll need to add another class to cover all the data. This calculation gives us the minimum class width...no smaller than 3.82!

3) Choose the first lower class limit. Either the min value or a more convenient value less that the min value. In this case, the min is 13.25, not convenient, so we can use 13. 

4) Find the lower class limits. Starting with 13, add the class width (4) repeatedly.

13

17

21

25

29

33

37

5) Determine the upper class limits

classes

13-16

17-20

21-24

25-28

29-32

33-36

37-40

6) Tally the frequencies for each class (count the number of fish in each size class)

classes     frequency

13-16           9

17-20           6

21-24          1

25-28          2

29-32          5

33-36          5

37-40          2

This can also be displayed as a RELATIVE frequency distribution. This is the same idea but instead of a frequency count as above, it's represented as a percentage of the total. Since we have a total of 30 data elements, we would divide each frequency by 30 and change into a percentage.

classes    frequency     relative frequency

13-16              9                   9/30=.3= 30%

17-20              6                   6/30=.2= 20%

21-24              1                            3.3%

25-28              2                            6.7%

29-32              5                           16.7%

33-36              5                           16.7%

37-40             2                              6.7%



Section 2.2

Histogram-Graph with bars representing the frequency in each class. Bars will touch unless there is a class with 0 as a frequency. The horizontal scale represents the classes and the vertical scale the frequencies.


I've taken the frequency distribution in the previous section and made a histogram. Please note that the class boundaries were used along the horizontal scale. The bars represent the frequencies in each one of the classes. Basically, a histogram is a graphical representation of a frequency distribution. We're just trying to find out where the data falls. We can see here that a majority of the fish are smaller, not many in the mid range, and a good number of larger fish.


Dotplot-graph of data where a dot is placed above the value it is to represent.  This graph has the same effect as the histogram as you can see where the data lies.

Fish Length (in) (Rockfish, California Halibut, and Lingcod lengths from a mercury study that a few buddies and I are conducting in Northern California)

26.50, 29.50, 24.70, 29, 29.25, 22, 14, 14, 16, 16.25, 16.25, 17, 19, 20, 13.25, 15.50, 16, 16, 16.75, 17.75, 18, 28.75, 29, 33.50, 33.50, 33.50, 34, 35, 38.50, 40

Statcrunch instructions (I highly recommend using statcrunch for this course as it will be very helpful)

Dotplot: Click the little box to the right of the data set  (in the homework and test questions) and select "Open in Statcrunch". Now that Statcrunch is open, click "Graph" and then "Dotplot".  Select the variable.  Click "Compute".

Stem and leaf plot- Graph where the data is broken up in two parts, the stem and leaf.  The stem is the left most digit or digits and the leaf would be the right most digit or digits.

Using the same fish length data:

Variable: fish lengths

Decimal point is 1 digit(s) to the right of the colon.

Leaf unit = 1

1 : 344

1 : 66666677889

2 : 02

2 : 579999

3 : 04444

3 : 59

4 : 0

Note: the decimals in the original data were ignored so 16, 16.25, and 16.75 are all represented as 1 : 6 

If you turn the stem and leaf 90 degrees you can see the resemblance to the histogram.  Longer the string of digits, the more data in that group.

Statcrunch instructions (I highly recommend using statcrunch for this course as it will be very helpful)

Stem and Leaf Plot: Click the little box to the right of the data set  (in the homework and test questions) and select "Open in Statcrunch". Now that Statcrunch is open, click "Graph" and then "Stem and Leaf".  Select the variable.  Click "Compute".

 

Many of the problems in this section will ask if the data is approximately normally distributed.  What we are referring to is the shape of the data...does it resemble the below shape?

Normal curve

If we can put a normal curve over the histogram/dotplot/stem and leaf and it looks to have roughly the same shape, then it would be considered approximately normally distributed as we see below. 

2.4 #6

Construct a Scatterplot.

Bear chest size   Bear Weight

26                                   80

45                                 344

54                                 416

49                                 348

35                                166

41                                220

41                                262

Statcrunch instructions (I highly recommend using statcrunch for this course as it will be very helpful)

Scatter plot: Click the little box to the right of the data set (in the homework and test questions) and select "Open in Statcrunch". Now that Statcrunch is open, click "Graph" and then "Scatter Plot".  Select the two variables for the X variable and Y variable.  Click "Compute".