Lecture 11
Introduction:
Statistics is the science of collecting, analyzing, interpreting, and presenting data. Statistical methods are used in a wide range of fields, including science, business, healthcare, social sciences, and many others. In this introduction, we will explore some of the basic statistical concepts that are fundamental to understanding and analyzing data.
One of the primary tasks of statistics is to describe and summarize data using descriptive statistics. Descriptive statistics include measures of central tendency (such as mean, median, and mode), measures of variability (such as standard deviation and range), and measures of distribution (such as histograms and box plots).
Another important aspect of statistics is making inferences about a population based on a sample of data. This involves using inferential statistics, which includes methods such as hypothesis testing and confidence intervals. These methods allow us to draw conclusions about a larger population based on a smaller sample.
Probability is another key concept in statistics. Probability is used to describe the likelihood of an event occurring and is used in statistical analysis to calculate the likelihood of events such as drawing a specific card from a deck or flipping a coin and getting heads.
Sampling is the process of selecting a subset of individuals or units from a larger population to represent the population as a whole. Statistical methods are used to ensure that the sample is representative of the population and to estimate population parameters based on the sample data.
Correlation is another important statistical concept that refers to the degree to which two variables are related to each other. Correlation can be positive, negative, or zero and is used to describe the relationship between two variables.
In summary, understanding basic statistical concepts is essential for interpreting and analyzing data in a wide range of fields. Whether you are working in science, business, healthcare, or any other field, knowledge of statistics can help you make informed decisions based on data.
Here are the other basic statistical concepts you must know:
Lecture 11. Basic Statistical Concepts:
We begin with a simple example. There are millions of passenger automobiles in the Philippines. What is their average value? It is obviously impractical to attempt to solve this problem directly by assessing the value of every single car in the country, add up all those values, then divide by the number of values, one for each car. In practice the best we can do would be to estimate the average value. A natural way to do so would be to randomly select some of the cars, say 200 of them, ascertain the value of each of those cars, and find the average of those 200 values. The set of all those millions of vehicles is called the population of interest, and the number attached to each one, its value, is a measurement.
The average value is a parameter: a number that describes a characteristic of the population, in this case monetary worth. The set of 200 cars selected from the population is called a sample, and the 200 numbers, the monetary values of the cars we selected, are the sample data. The average of the data is called a statistic: a number calculated from the sample data. This example illustrates the meaning of the following definitions.
Concepts and Definitions:
A population is any specific collection of objects of interest.
A sample is any subset or sub collection of the population, including the case that the sample consists of the whole population, in which case it is termed a census.
A measurement is a number or attribute computed for each member of a population or of a sample.
A sample data is the measurements of the sample elements.
A parameter is a number that summarizes some aspect of the population as a whole.
A statistic is a number computed from the sample data. Continuing with our example, if the average value of the cars in our sample was 500,000, then it seems reasonable to conclude that the average value of all cars is about 500,000. In reasoning this way, we have drawn an inference about the population based on information obtained from the sample.
In general, statistics is a study of data: describing properties of the data, which is called descriptive statistics, and drawing conclusions about a population of interest from information extracted from a sample, which is called inferential statistics. Computing the single number 500,000 to summarize the data was an operation of descriptive statistics; using it to make a statement about the population was an operation of inferential statistics.
Statistics is a collection of methods for collecting, displaying, analyzing, and drawing conclusions from data.
Areas of Statistics: Descriptive statistics is the branch of statistics that involves organizing, displaying, and describing data
Areas of Statistics: Inferential statistics is the branch of statistics that involves drawing conclusions about a population based on information contained in a sample taken from that population. The measurement made on each element of a sample need not be numerical. In the case of automobiles, what is noted about each car could be its color, its make, its body type, and so on. Such data are categorical or qualitative, as opposed to numerical or quantitative data such as value or age. This is a general distinction of data.
Qualitative data are measurements for which there is no natural numerical scale, but which consist of attributes, labels, or other non-numerical characteristics.
Quantitative data are numerical measurements that arise from a natural numerical scale
Remember that Qualitative data can generate numerical sample statistics. In the automobile example, for instance, we might be interested in the proportion of all cars that are less than six years old. In our same sample of 200 cars we could note for each car whether it is less than six years old or not, which is a qualitative measurement. If 172 cars in the sample are less than six years old, which is 0.86 or 86%, then we would estimate the parameter of interest, the population proportion, to be about the same as the sample statistic, the sample proportion, that is, about 0.86.
The relationship between a population of interest and a sample drawn from that population is perhaps the most important concept in statistics, since everything else rests on it. This relationship is illustrated graphically in the Figure below. The circles in the large box represent elements of the population. The solid black circles represent the elements of the population that are selected at random and that together form the sample. For each element of the sample there is a measurement of interest, denoted by a lowercase x (which we have indexed as x1,. . . ,xn to tell them apart).
These measurements collectively form the sample data set. From the data we may calculate various statistics. To anticipate the notation that will be used later, we might compute the sample mean x and the sample proportion ˆp, and take them as approximations to the population mean µ (this is the lowercase Greek letter mu, the traditional symbol for this parameter) and the population proportion p, respectively. The other symbols in the figure stand for other parameters and statistics that we will encounter.
Probability sampling method is a sampling method that uses randomization to choose survey participants.
A Random sample is a sample in which each member of the population has an equal chance of being included and in which the selection of one member is independent from the selection of all other members.
[Example: You want to select a simple random sample of 100 employees of Company X. You assign a number to every employee in the company database from 1 to 1000, and use a random number generator to select 100 numbers.]
A Stratified sampling attempts to account for the demographics and traits of the larger population. It attempts to recreate the elements in the sample. For example, if you’re surveying college history majors, and you already know that 40% of history majors are female and 60% are male, you might want your sample to have the same proportions.
[Example: The company has 800 female employees and 200 male employees. You want to ensure that the sample reflects the gender balance of the company, so you sort the population into two strata based on gender. Then you use random sampling on each group, selecting 80 women and 20 men, which gives you a representative sample of 100 people.]
A Cluster sampling also involves dividing the population into subgroups, but each subgroup should have similar characteristics to the whole sample. Instead of sampling individuals from each subgroup, you randomly select entire subgroups.
[Example: The company has offices in 10 cities across the country (all with roughly the same number of employees in similar roles). You don’t have the capacity to travel to every office to collect your data, so you use random sampling to select 3 offices – these are your clusters.]
Nonprobability Sampling Methods is a method that do not use any randomization to select survey participants. Therefore, population members do not have an equal chance of being included.
Convenience sampling is a nonprobabilistic sampling that includes participants based on their availability and accessibility. Essentially, it includes people who are easy to reach.
[Example: You are researching opinions about student support services in your university, so after each of your classes, you ask your fellow students to complete a survey on the topic. This is a convenient way to gather data, but as you only surveyed students taking the same classes as you at the same level, the sample is not representative of all the students at your university]
Voluntary response sampling is a sampling method that is similar to a convenience sample, a voluntary response sample is mainly based on ease of access. Instead of the researcher choosing participants and directly contacting them, people volunteer themselves (e.g. by responding to a public online survey). Voluntary response samples are always at least somewhat biased, as some people will inherently be more likely to volunteer than others.
[Example: You send out the survey to all students at your university and a lot of students decide to complete it. This can certainly give you some insight into the topic, but the people who responded are more likely to be those who have strong opinions about the student support services, so you can’t be sure that their opinions are representative of all students]
A Snowball Sampling is a sampling relies on the first survey participants to refer you to the next ones, and so on. Once you’ve found enough people to meet your required sample size, you stop the survey.
[Example: You are researching experiences of homelessness in your city. Since there is no list of all homeless people in the city, probability sampling isn’t possible. You meet one person who agrees to participate in the research, and she puts you in contact with other homeless people that she knows in the area.]
A Purposive sampling is a type of sampling, also known as judgement sampling, involves the researcher using their expertise to select a sample that is most useful to the purposes of the research.
[Example: You want to know more about the opinions and experiences of disabled students at your university, so you purposefully select a number of students with different support needs in order to gather a varied range of data on their experiences with student services.]
A Quota sampling is similar to a stratified sampling. The difference is that this method doesn’t randomly select participants. As with stratified sampling, the researchers first define categories they want to represent in their sample and choose appropriate proportions for each group. These could be equal quotas, like 100 men and 100 women, or they could seek to replicate a target population’s demographics. Instead of randomly selected participants, the surveyors will use some form of convenience sampling. When they’ve hit the right quotas for each category, they stop the survey
Type of Variables
A Random variable is a variable that represents value(s) from a random sample. We will use letters at the end of the alphabet, especially x, y and z, as random variables.
An Independent random variable is a variable that is chosen, and then measured or manipulated, by the researcher in order to study some observed behavior.
A Dependent random variable is a variable whose value depends on the value of one or more independent variables.
A Discrete variable is a variable which can take a discrete set of values (e.g. cards in a deck or scores on an IQ test). Discrete variables can take either a finite or infinite set of values, although for our purposes we usually consider discrete variables which only take a finite set of values.
A Continuous variable is a variable that can take all the values in a finite or infinite interval (e.g. weight or temperature). A continuous variable can take an infinite set of values
Type of Data Measurement
Data Scales: Nominal Data provides a name; if numeric, then no scale is implied.
Example: Gender (Male, Female); Primary Color (Yellow, Red, Blue)
Data Scales: Ordinal Data provides an ordered scale.
Example: Educational Level (High School, BS, MS, Ph.D.)
Data Scales: Interval Data can be manipulated mathematically. Scale in equal increments. An interval scale is one where there is order and the difference between two values is meaningful.
Example: Temperature (Farenheit), Temperature (Celcius), pH, SAT score (200-800)
Data Scales: Ratio Data are Interval scale with a meaningful zero. A ratio scale is a quantitative scale where there is a true zero and equal intervals between neighboring points. Unlike on an interval scale, a zero on a ratio scale means there is a total absence of the variable you are measuring.
Example: Length, area, and population
Type of Error
A type I error (false-positive) occurs if an investigator rejects a null hypothesis that is actually true in the population.
A type II error (false-negative) occurs if the investigator fails to reject a null hypothesis that is actually false in the population.
Hypothesis testing is used to assess the plausibility of a hypothesis by using sample data. The test provides evidence concerning the plausibility of the hypothesis, given the data. Statistical analysts test a hypothesis by measuring and examining a random sample of the population being analyzed.
P-value is a measure of the likelihood of a value that a random variable takes.
A non-parametric test (sometimes called a distribution free test) does not assume anything about the underlying distribution (for example, that the data comes from a normal distribution).
A non-parametric test (sometimes called a distribution free test) does not assume anything about the underlying distribution (for example, that the data comes from a normal distribution).
A parametric test are those that make assumptions about the parameters of the population distribution from which the sample is drawn. This is often the assumption that the population data are normally distributed.
Normal distribution is a continuous probability distribution wherein values lie in a symmetrical fashion mostly situated around the mean.
Measures of Central tendency is a statistic that represents the single value of the entire population or a dataset.
The mean (or average) is the most popular and well-known measure of central tendency. It can be used with both discrete and continuous data, although its use is most often with continuous data. The mean is equal to the sum of all the values in the data set divided by the number of values in the data set
The median is the middle score for a set of data that has been arranged in order of magnitude. The median is less affected by outliers and skewed data.
The mode is the most frequent score in our data set.
Measures of Dispersion in statistics is the measures of dispersion help to interpret the variability of data i.e. to know how much homogenous or heterogeneous the data is.
A standard deviation is a measure of how dispersed the data is in relation to the mean. Low standard deviation means data are clustered around the mean, and high standard deviation indicates data are more spread out.
The Variance measures how far a data set is spread out. It is mathematically defined as the average of the squared differences from the mean
Measures of Relative position are conversions of values, usually standardized test scores, to show where a given value stands in relation to other values of the same grouping.
In statistics, a Quartile is a type of quantile which divides the number of data points into four parts, or quarters, of more-or-less equal size.
Decile is a method that is used to divide a distribution into ten equal parts. When data is divided into deciles a decile rank is assigned to each data point in order to sort the data into ascending or descending order.
The percentile is a number where a certain percentage of scores fall below the given number
Skewness refers to a distortion or asymmetry that deviates from the symmetrical bell curve, or normal distribution, in a set of data. If the curve is shifted to the left or to the right, it is said to be skewed.
Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack of outliers.
In descriptive statistics, the interquartile range tells you the spread of the middle half of your distribution.
The Mean Absolute Deviation of a dataset is the average distance between each data point and the mean. It gives us an idea about the variability in a dataset.
A frequency in statistics is the number of times an event or observation happened in an experiment or study. It can also be defined simply as a count of a certain event.
A scatter plot is a graph in which the values of two variables are plotted along two axes, the pattern of the resulting points revealing any correlation present.
A Graphical representation refers to the use of charts and graphs to visually display, analyze, clarify, and interpret numerical data, functions, and other qualitative structures.
Types of Graphs: Line Graphs – Line graph or the linear graph is used to display the continuous data and it is useful for predicting future events over time.
Types of Graphs: Bar Graphs – Bar Graph is used to display the category of data and it compares the data using solid bars to represent the quantities.
Types of Graphs: Histograms – The graph that uses bars to represent the frequency of numerical data that are organized into intervals. Since all the intervals are equal and continuous, all the bars have the same width.
Types of Graphs: Line Plot – It shows the frequency of data on a given number line. ‘x ‘is placed above a number line each time when that data occurs again.
Types of Graphs: Circle Graph – Also known as the pie chart that shows the relationships of the parts of the whole. The circle is considered with 100
Types of Graphs: Stem and Leaf Plot – In the stem and leaf plot, the data are organized from least value to the greatest value. The digits of the least place values from the leaves and the next place value digit forms the stems
Box and Whisker Plot – The plot diagram summarises the data by dividing into four parts. Box and whisker show the range (spread) and the middle (median) of the data
A trimmed mean (similar to an adjusted mean) is a method of averaging that removes a small designated percentage of the largest and smallest values before calculating the mean. After removing the specified outlier observations, the trimmed mean is found using a standard arithmetic averaging formula. The use of a trimmed mean helps eliminate the influence of outliers or data points on the tails that may unfairly affect the traditional or arithmetic mean.
The coefficient of variation (CV) is the ratio of the standard deviation to the mean and shows the extent of variability in relation to the mean of the population. The higher the CV, the greater the dispersion.
ACTIVITY 01.
Instruction: Complete the crossword puzzle below.