Lecture 11
Basic Statistical Concepts
Basic Statistical Concepts
In this lecture we will first understand terms and concepts in statistics. This will help us navigate this easier in future lectures.
Authors usually define statistics in the following manner.
Statistics is the branch of science which deals with the collection, organization, analysis, interpretation, and presentation of data.
That's quite a lot. Let's break this definition into pieces.
Etymology
(just a trivia, no need to memorize)
"Statistics" may have come from
German "Statistik" introduced by the philosopher G. Achenwall to mean the study of data about the state [1].
Latin "status" meaning "condition"
Ancient Greek "στατιστική" (statistiké) meaning "pertaining to the state"
Italian "statista" meaning "statesman" or "politician"
Usage
The word statistics may be used in three different ways:
statistics as a branch of science (always in plural form)
statistic as any measurement taken from a sample such as sample mean, standard deviation (may take singular form)
Statistics as a course/subject you take in College/High School (proper noun, i.e., first letter always capital)
Statistics is not a branch of mathematics. It is related to mathematics (it belongs to a class of sciences called mathematical sciences); it uses mathematics; but is definitely not mathematics.
Mathematics is concerned about patterns, structures, numbers, space, and change -- all independent of the real-world. (Guy Ritchie once paraphrased Aristotle, "Are days and miles necessary? Numbers can exist without them.") It is fundamentally deductive. It starts with axioms and uses logic to derive theorems. Truth is absolute in mathematics.
Whereas statistics is concerned about data. Data is tied to the real-world. Its reasoning is inductive. It uses data as evidence to support its conclusions. Therefore, its results are supported by likelihood and not absolute truth.
So,
Mathematics is concerned with patterns but statistics is concerned with data.
Mathematics (i.e., pure mathematics) is independent from the real-world but statistics is tied to the real-world.
Mathematics is deductive but statistics is inductive.
Mathematics proves statements using axioms through logic; statistics supports statements using data as evidence.
Mathematics concludes with absolute truth; statistics concludes with probable truth.
Data collection is the process of systematically obtaining observations or measurements. It includes the sampling procedure and actual data collection such as surveys and interviews.
Raw data is messy. Organization is the process of structuring, classifying, and summarizing raw data into a coherent form suitable for analysis and interpretation. It includes cleaning, coding, and tabulating them.
Analysis is the application of statistical and mathematical methods in order to identify patterns, estimate quantities, test hypotheses, or make predictions. This is where the calculations or the use of statistical software happen. This will compose majority of our statistical work in this class.
The analysis earlier produces results. Those results have meaning. Interpretation is the process of assigning meaning to statistical results by relating them to the research question, assumptions, and context.
We should tell people our statistical results!
Presentation is the communication of data and statistical findings through tables, graphs, and narratives.
So don't forget the five steps of the statistical process:
These terms might be considered most basic in statistics: data, variable, observation, dataset, and element.
Data are recorded observations or measurements that represent attributes, outcomes, or events associated with a phenomenon of interest. It is any fact or information collected for the purpose of solving a certain problem.
Well, sometimes, data is collected for a different purpose. For example, banks collect depositor's data for a number of purposes including security. But in a statistical process, data is collected for problem solving.
Take note that data does not have a plural form. (There is now such word as datas.) It's an uncountable noun.
A variable is a characteristic or attribute that can take different values across observations. For example, hair color, height, weight, age.
The elements are the units or entities from which the data was collected. For example, if data was collected from a sample of students, then these students are the elements.
An observation is a single recorded instance. It is the value for a single variable for a certain element. May also be called a score.
A dataset (some authors write "data set") is the set of observations under one or more variables.
In this example,
sex, age, and weight are the variables
Juan, Pedro, Bartolome, Rhodora, and Edgar are the elements
19 is the observation from Pedro for the variable Age
{45, 52, 61, 48, 56} is the dataset under the variable Weight (kg)
There are two types of variables:
Qualitative variables (also called categorical variables) are variables whose values represent categories or labels. Examples are hair color, sex, degree program;
Quantitative variables (also called numerical variables) are variables whose values represent numerical measurements. Examples include age, weight, final grade.
Furthermore, there are two types of quantitative variables:
Discrete variables are quantitative variables that take countable values. They usually answer the question, "how many?" (but not all of them). The values of discrete variables may be represented by consecutive individual points. Examples include number of children, age (in number of years), number of cars.
Continuous variables are quantitative variables that take any value in an interval. They usually answer the question, "how much?" (but not all of them). The values of continuous variables may be represented by a straight line of interval. For any two values, there exists infinitely many values between them. Examples include weight, height, duration (as in length of time).
[1] Wikipedia contributors. (n.d.). History of statistics. In Wikipedia. https://en.wikipedia.org/wiki/History_of_statistics