Statistics and Probability is used in almost all fields of human endeavors. Mostly, it is used to analyze the results of surveys and can be used as a tool in scientific research to make decisions on controlled experiments. Statistics is defined as the science of collecting studies to collect, organize, summarize, analyze, and draw conclusions from the given data.
As civil engineering students, we study statistics to understand statistical studies performed in the realm of civil engineering, to apply statistical procedures when conducting research, or it can be used to infer data and provide intelligent decisions related to civil engineering profession.
Statisticians usually collect information for variables, which is described as a characteristic or attribute that can assume different values (called data). A collection of data values form a data set, and each value of data set is called a data value or datum.
Statistics is also divided into two main areas, which are as follows:
Descriptive Statistics - Consists of the collection, organization, summarization, and presentation of data.
Inferential Statistics - consists of generalizing from samples to populations, performing estimations and hypothesis tests, determining relationships among variables, and making predictions.
Here are some of the important terms used in statistics. Note that it is important to distinguish between the terms mentioned below
Population - Set of all entities under study
Sample - Subset of population
Unit - Individual object or person in the population
Parameters - All descriptive measures or characteristics of population
Statistics - Characteristics of sample
Findings - These are the results of investigation.
Conclusion - It is an opinion based on findings, a generalization on population based on the result of the investigation on samples.
Inference - It is an educated guess or a meaningful prediction based on findings and conclusions.
Census - It is the process of gathering information from each element of the population.
Survey - It is the process of getting information from every element in the sample.
Variables can be classified as qualitative or quantitative. Qualitative variables are variables that can be placed into distinct categories, according to some characteristic or attribute. Quantitative variables are numerical and can be ordered or ranked. Quantitative variables can be further classified into two groups: discrete and continuous.
Discrete variables can be assigned values such as 0, 1, 2, 3 and are said to be countable.
Continuous variables, by comparison, can assume an infinite number of values in an interval between any two specific values.
Variables can be classified into three types, these are:
Independent Variable - It is used to describe or explain the differences in the dependent variables.
Dependent Variable - An outcome of interest that is observed and measured by a researcher in order to assess the effects of the independent variable.
Nuisance Variable - A random variable that is fundamental to the probabilistic model, but that is of no particular interest in itself or is no longer of interest.
Variables can be classified by how they are categorized, counted, or measured. This type of classification uses measurement scales, and four common types of scales are used: nominal, ordinal, interval, and ratio.
Nominal Level of Measurement - classifies data into mutually exclusive (nonoverlapping), exhausting categories in which no order or ranking can be imposed on the data.
Ordinal Level of Measurement - classifies data into categories that can be ranked, however, precise differences between the ranks do not exist.
Interval Level of Measurement - ranks data, and precise differences between units of measure do exist; however, there is no meaningful zero.
Ratio Level of Measurement - possesses all the characteristics of interval measurement, and there exists a true zero. In addition, true ratios exist when the same variable is measured on two different members of the population.
Major considerations in collecting data include the nature of the problem, objectives of the researcher, types of data required, and sources of data.
Data sources could be primary, which come from primary sources such as government agencies, business establishments, or individuals that carry original data or secondary, which are obtained from secondary sources such as newspaper, magazine, journals, and republished material.
Data can be collected in a variety of ways. One of the most common methods is through the use of surveys. Surveys can be done by using a variety of methods. Three of the most common methods are the telephone survey, the mailed questionnaire, and the personal interview.
Other methods of collecting data are shown below:
Observation Method
Experimental Method
Use of Existing Studies
Registration Method
Listed below are the two types of data organization:
These are the data plot that uses part of the data values as the stem and part of the data values as the leaf to form groups or classes. It is an alternative method for describing a set of data and presents a histogram-like picture of the data, while allowing the experimenter to retain the actual observed values of each data point
These refers to an organized set of numbers representing the frequency of observations that fall within a specific category or class of variable.
Two types of frequency distributions that are most often used are the categorical frequency distribution (ungrouped frequency distribution) and the grouped frequency distribution.
The categorical frequency distribution is used for data that can be placed in specific categories, such as nominal or ordinal-level data.
Grouped frequency distribution is a frequency distribution in which the values of the variable have been grouped into classes.
Listed below is the procedure on how to construct a preceding grouped frequency distribution
Decide on the number of class interval to use (between 5 & 15). Use Sturge’s Formula.
Find the range. (the difference between the Highest value and the lowest value in the set of data).
Divide the range by the desired number of class intervals. The result is rounded off to the next unit if the scores to be grouped are expressed as whole numbers. Otherwise, it has to be rounded to the next number with the same number of decimal places as the given measurement. This resulting number is called the interval sizes, class size or class width.
Choose an appropriate lower limit for the first-class interval. This number is less than or equal to the lowest value in the data. It is more convenient to use a lower limit that is divisible by the class width. Add the class width to obtain the next lower-class limit. Keep on adding the class width to get all the other lower-class limits.
Find the upper-class limits. If the class size is rounded off to the unit’s place, subtract one from the second lower class limits to arrive at the first upper class limit. Subtract 0.1 from the result, if rounded off to the tenth place, subtract .01 if rounded to the hundredths place.
Determine the class boundaries. The class boundaries are the true limits of a class interval made up of the lower-class boundary and upper-class boundary. The class boundary is the midway between the upper limit and the lower limit of the next higher-class interval.
Find the class mark or midpoint of each class interval.
Tally the raw score and indicate the frequency for each of the class intervals.
Add the frequencies and indicate the sum.
Listed below are the three types of data presentation:
This method presents data in paragraph form and becomes effective when the objective is to call the reader’s attention to some data that require special emphasis.
Data are presented in rows and columns. A more convenient and understandable than textual method because the numeral information’s are displayed in a more concise and systematic manner by using vertical and horizontal lines which describes the corresponding heading.
The most effective way of presenting statistical data because important relationships are brought about clearly. Comparison and trends of quantitative values are readily available to enable ease of communication of results or information
The three most commonly used graphs when organizing data into a frequency distribution are the following:
Histogram
Frequency Polygon
The cumulative frequency graph, or ogive
Frequency Curve
The histogram is a graph that displays the data by using contiguous vertical bars (unless the frequency of a class is 0) of various heights to represent the frequencies of the classes.
The frequency polygon is a graph that displays the data by using lines that connect points plotted for the frequencies at the midpoints of the classes. The frequencies are represented by the heights of the points.
The ogive is a graph that represents the cumulative frequencies for the classes in a frequency distribution.
The frequency curve is a smoothed frequency polygon obtained by increasing the number of class intervals and consequently decreasing the interval size.
Frequency curved can be bell shaped or skewed (Positively Skewed Curve or Negatively Skewed Curve).
A sample should not be selected in haphazard way to obtain data and information that are reliable and realistic. Sampling Technique is a procedure used to determine which element is to be included in the sample. This procedure is called random sampling technique or probability sampling technique. Sample size can be determined using Slovin's Formula. Shown below are the different random sampling methods used.
Systematic Sampling
Stratified Sampling
Cluster Sampling
Multi-Stage Sampling
On the other hand, non probability sampling is a sampling technique where all the participants of the investigation are not derived through equal chances. Certain parts in the overall group are deliberately not included in the selection of the representative’s subgroup. Therefore, it makes use of judgment in the selection of items to be included in the subgroup. Non-random sampling is classified into three types as shown below.
Data can be summarized by finding the averages. The average means the center of the distribution. Measures of the average are also called measures of central tendency and include the mean, median, mode and midrange. On the other hand, measures that determine the spread of the data values are called measures of variation, or measures of dispersion. This measures include the range, variance and standard deviation. Lastly, Another set of measure necessary to describe data are called measures of position. They tell where a specific data value falls within the data set or its relative position in comparison with other data value. The most common position measures are percentile, deciles, and quartiles. These measures are sometimes referred to as the norms.
The mean, also known as the arithmetic average, is found by adding the values of the data and dividing by the total number of values. The mean is appropriate to determine the central tendency of an interval or ratio data. (X̄ ) is used to represent the mean of a sample and ( μ ) is used to denote the mean of a population.
Weighted mean is also useful when various classes or groups contribute differently to the total. The weighted mean is found by multiplying each value by its corresponding weight and dividing by the sum of the weights.
The median is the midpoint of the data array. The symbol for the median is MD. When the data set is ordered whether ascending or descending it is called a data array. It is an appropriate measure of central tendency data that are ordinal or above but is more valuable in an ordinal type of data.
The value that occurs most often in a data set is called the mode.A data set that has only one value that occurs with the greatest frequency is said to be unimodal. If a data set has two values that occur with the same greatest frequency, both values are considered to be the mode and the data set is said to be bimodal. If a data set has more than two values that occur with the same greatest frequency, each value is used as the mode, and the data set is said to be multimodal. When no data value occurs more than once, the data set is said to have no mode.
Measures of Variability are statistics that describe the amount of difference and spread in a data set. For the spread or variability of a data set, three measures are commonly used: range, variance, and standard deviation.
The range is the simplest of the three measures and is defined as the highest value minus the lowest value. The symbol R is used for the range
To have a more meaningful statistic to measure the variability, statisticians use measures called the variance and standard deviation.
The variance is the average of the squares of the distance each value is from the mean. The symbol for the population variance is σ² (σ is the Greek lowercase letter sigma)
The standard deviation is the square root of the variance. The symbol for the population standard deviation is σ.
The rounding rule for the standard deviation is the same as that for the mean. The final answer should be rounded to one more decimal place than that of the original data.
When computing the variance of a sample, the formula given above is not usually used, however, since in most cases the purpose of calculating the statistic is to estimate the corresponding parameter. It is because does the expression does not give the best estimate of the population variance because when the population is large and the sample is small. Therefore, instead of dividing by n, find the variance of the sample by dividing by n 1, giving a slightly larger value and an unbiased estimate of the population variance.
The procedure for finding the variance and standard deviation for grouped data is similar to that for finding the mean for grouped data, and it uses the midpoints of each class.
When two samples have the same units of measures, the variance and standard deviation for each can be compared directly. A statistic that allows the statistician to compare standard deviations when the units are different is called the coefficient of variation.
The Coefficient of Variation, denoted by CVar, is the standard deviation divided by the mean. The result is expressed as a percentage
The range can be used to approximate the standard deviation. The approximation is called the range rule of thumb. If the range is divided by four, an approximate value for standard deviation is obtained.
The range rule of thumb is only an approximation and should be used when the distribution of data values is unimodal and roughly symmetric. The range rule of thumb can be used to estimate the largest and smallest data values of a data set. The smallest data value will be approximately 2 standard deviations below the mean, and the largest data value will be approximately 2 standard deviations above the mean of the data set.
The proportion of values from a data set that will fall within k standard deviations of the mean will be at least (1-1k^2), where k is a number greater than 1 (k is not necessarily an integer).
When a distribution is bell-shaped (or what is called normal), the following statements, which make up the empirical rule, are true.
Approximately 68% of the data values will fall within 1 standard deviation of the mean.
Approximately 95% of the data values will fall within 2 standard deviations of the mean.
Approximately 99.7% of the data values will fall within 3 standard deviations of the mean.
In addition to measures of central tendency and measures of variation, there are measures of position or location. These measures include standard scores, percentiles, deciles, and quartiles. They are used to locate the relative position of a data value in the data set.
A standard score or z score tells how many standard deviations a data value is above or below the mean for a specific distribution of values. If a standard score is zero, then the data value is the same as the mean. It is obtained by subtracting the mean from the value and dividing the result by the standard deviation. The symbol for a standard score is z.
Percentiles are position measures used in educational and health-related fields to indicate the position of an individual in a group. This divide the data set into 100 equal groups
Quartiles divide the distribution into four groups, separated by Q1, Q2, Q3.
In addition to dividing the data set into four groups, quartiles can be used as a rough measurement of variability. The interquartile range (IQR) is defined as the difference between Q1 and Q3 and is the range of the middle 50% of the data.
Deciles divide the distribution into 10 groups, as shown in the right. They are denoted by D1, D2, etc. D1 corresponds to P10; D2 corresponds to P20; etc. Deciles can be found by using the formulas given for percentiles.
An outlier is an extremely high or an extremely low data value when compared with the rest of the data values.
Shown in the left figure is the procedure on how to identify outliers in a given data set.
Probability is defined as the chance of an event occurring, like playing games of chance, cards, slot machines and even lottery. Probability is also used in insurance and weather forecasting. This is also a basis of inferential statistic where predictions are based on probability and hypotheses are tested using probability.
For a person to determine the number of all possible outcomes for a sequence of events, three rules can be used. These are the fundamental counting rule, permutation rule, and the combination rule.
Sum Rule - Suppose that an event can be performed by either of two different procedures with m possible outcomes for the first procedure and n possible outcomes for the second. If the two sets of possible outcomes are disjoint, then the number of possible outcomes for the event is m+n.
Product Rule - In a sequence of n events in which the first has n1 possibilities and the second event has n2, and the third has n3, and so forth, the total number of possibilities of the sequence will be n1 x n2 x n3 ... . x nk.
A permutation is an arrangement of all or part of a number of things (or objects) in a definite order. The number of permutation of n objects taken r at a time is given by the equation shown in the right.
Combination is a grouping or selection of all or part of a number of thing or object without reference to the arrangement of the things selected. The number of combinations of n objects taken r at a time is given by:
Combinations are used when the order or arrangement is not important, as in the selecting process.
The theory of probability grew out of the study of various games of chance using coins, dice, and cards. Processes such as flipping a coin, rolling a die, or drawing a card from a deck are called probability experiments.
A probability experiment is a chance process that leads to well-defined results called outcomes.
An outcome is the result of a single trial of a probability experiment.
A sample space is the set of all possible outcomes of a probability experiment.
Examples of sample spaces for a given probability experiments are shown in the left figure.
A simple event is an event that includes one and only one of the outcomes for an experiments and is denoted by E.
On the other hand, a compound event is a collection of more than one outcome for an experiment; it is also called composite event
Probability as a general concept can also be defined as the chance of an event occurring. An event that cannot occur has zero probability which is called an impossible event and if an event that is certain to occur has a probability equal to 1 which is called sure event. There are four basic probability rule that will be helpful in solving probability problems.
The probability of any event E is a number (either a fraction or decimal) between and including 0 and 1. This is denoted by 0 ≤ P(E) ≤ 1.
If an event E cannot occur (i.e., the event contains no members in the sample space), its probability is 0.
If an event E is certain, then the probability of E is 1.
The sum of the probabilities of all the outcomes in the sample space is 1.
The complement of an event E is the set of outcomes in the sample space that are not included in the outcomes of event E. The complement of E is denoted by Ē (read “E bar”)
Shown in the left figure are the rules for complementary events
Empirical Probability is the type of probability that uses frequency distribution based on observations to determine numerical probabilities of events
Subjective probability uses a probability value based on an educated guess or estimate, employing opinions and inexact information
Shown in the right figure are the addition rules for probability.
Shown in the left figure are the multiplication rules for probability.
A. Marginal Probability is a probability of a single event without consideration of any other event; it is also called single probability.
Conditional Probability is a probability that an event will occur given that another event has already occurred.
A discrete probability distribution consists of the values a random variable can assume and the corresponding probabilities of the values. Probabilities are determined theoretically or by observation.
A random variable is a function or rule that assigns a number to each outcome of an experiment (Chance Variable). Random variable can be discrete or continuous. Discrete random variable assumes values that can be counted, while continuous random variable can assume all values between any two specific values, a variable obtained by measuring or are contained one or more intervals.
The mean variance, and standard deviation for a probability distribution are computed differently from the mean, variance and standard deviation for samples. The formulas used for computing the mean and variance of probability distribution as well as the expected value of a discrete random variable are show in the left figure.
If the probabilities has only two outcomes or can be reduced to two outcomes (the outcomes are considered as either success or failure these are called binomial experiment).
The outcomes of a binomial experiment and the corresponding probabilities of these outcomes are called a binomial distribution.
A discrete probability distribution that is useful when n is large and p is small and when the independent variables occur over a period of time is called the Poisson distribution.
In addition to being used for the stated conditions (i.e., n is large, p is small, and the variables occur over a period of time), the Poisson distribution can be used when a density of items is distributed over a given area or volume.
The CLO1 overall discusses the methods of data collection, presentation, and organization, sampling technniques, ways on describing data (finding averages), and probability distribution. These topics are highly important in analyzing the results of surveys in order to make decisions based on controlled experiments. With that, I learned how to gain important information from raw data and organize them into a frequency distribution and presenting data into different graphs.
After our thorough discussion on descriptive statistics, I have learned that there a three methods on how to find the average or the so-called center of distribution in statistic These are measures of central tendency which includes the mean, median, and mode, measures of variation consisting of range, variance, and standard deviation, and measures o position which includes percentiles, deciles, and quartiles. Also, I have grasped various concepts on probability and counting rules. These concepts include probability experiments, sample spaces, the addition and multiplication rules, and the probabilities of complementary events. I also learned the rule for counting, differences between permutations and combinations, and how to identify them based on the sample problems given, and how to figure out how many different combinations for specific situations exist.