Unit 6: Statisitical Analysis and Data Presentation

Statistical methods are extremely important in engineering, because they provide a means for representing large amounts of data in a concise form that is easily interpreted and understood. Usually, the data are represented with a statistical distribution function that can be characterized by a measure of central tendency and measures of dispersion.

Measuring central tendency

Measuring dispersion

Graphing

In general, in graphs that present the experimental data, the x-axis represents the independent variable and the y-axis represents the dependent variable. Graphs allow experimenter to analyse unsystematic raw data and help other scientists to interpret the experimental results. Different graphs are used in presenting different experiments. They subject to each specific experiment, and are therefore not all illustrated in this section. Only the common forms of graphs are described below.

Shape of graphs

These are the more professional phrases to describe the shape hence trends of a group of experimental data.

Best-fit line

A best-fit line is a straight line that shows the trend of the datas. Note that it may not touch any of the points, but it does show the overall trend of the data set.

The below example shows the best-fit line (purple solid line) of Hong Kong’s population over the past 50 years. As you can see, the line extends out of the last data point. The extended part is the forecast of the population of Hong Kong based on the previous data.

Exponential and logarithm

In biology experiments, exponential and logarithm progressions are often used to represent the growth or as the values of independent variables.

The growth of bacteria is a good example. Bacteria reproduce and multiplied by binary fission, which means each division produce 2 individuals. Suppose the population of bacteria doubles every 3 hours and you inoculate n bacteria on a fresh culture at 12:00 pm. At 3pm, you will have 2n bacteria, at 6pm you will have 4n bacteria, at 9pm you will have 8n bacteria, and so on. If this were the case, the growth process would be geometric. A geometric growth model predicts that the population increases at discrete time points. In other words, the growth of the population is not constant but exponential.


We usually use log graphs to represent measurements against a log-progressive concentration. In a log-linear graph, one of the axes (usually x-axis) is progressing in the power of 10. That means for example, each uniform interval of the axis represents represents a progression of 10-10, 10-8, 10-6, 10-4, 10-2, and 1 respectively, increasing in concentration. The dose response graph below shows the variation of percentage response against the log of concentration of drug.


Calibration

Often when we design and create experiments in our own lab, the instruments that we use are not customised directly to measuring certain parameters that we wish. Therefore the parameter of the measuring instrument is said to be an unknown standard. We have to recreate a standard and known parameter so that we can compare our measurement to the standard and get sensible results. For example, we would like to measure the uric acid concentration in a medium with genetically engineered bacteria. We design and transform some bacteria that express GFP (green fluorescent protein) according to the uric acid concentration in the medium. We introduce known concentrations of uric acid to the GM bacteria. The GFP express hence green fluorescence produce by GM bacteria is found to be directly proportional to the uric acid concentration in the medium. We measure and record the green fluorescence concentration emitted from GM bacteria of each known concentration. From this calibration, we plot a curve of uric acid concentration against green fluorescence intensity from the values of green fluorescence intensity of each known uric acid concentration. When we put GM bacteria into unknown concentration of uric acid medium, we can use the calibration curve to trace the uric acid concentration according to the green fluorescence intensity.

Correlation coefficient

Correlation coefficient measures the strength of the linear relationship. It only shows linear relationships even if a curved relationship exists. A positive correlation coefficient indicates a direct relationship whereas a negative one indicates an inverse relationship. As the coefficient is closer to 0, a straight line is a poorer description of the data (bad fit), but when it is close to either -1 or 1, it’s a strong fit. Below are some examples of correlation coefficients. The one all the way in the lower right corner when r = -.99 has the best fit of all 6 graphs. With such a high r, it shows how well the points actually fit to the line. The one in the upper left corner on the other hand has the worst fit since r = 0, it demonstrates no fit.

Error calculation

Percentage error

When we calculate results that are aiming for known and exact values, percentage error is an useful tool for determining the precision of your calculations.

It can be calculated by

Error bar

Error bars are graphical representations of the variability of data and used on graphs to indicate the error or uncertainty in a measurement. When an estimator (typically a mean, or average) is based on a large population, error bars help depict how far the estimator is likely to be from the true value.


  1. Calculate the estimator, i.e. mean of the experimenal data by evaluating the following formula:

2. Caluculate standard deviation of the population by evaluating the following formula:

3. Divide standard deviation by square root of sample size. This gives standard error.

4. The standard error bar indicates mean1 standard error.

Below is a demonstration of plotting error bars using data from Hong_Kong_UCCKE iGEM2016.