Exploratory Data Analysis

(4 plots)

4PlotsEDAdataPredicitveAnalysis.xlsx (the program used)

Objective: This method can be used to determine if data can be used for predicting or concluding and to what degree of confidence.

It uses data statistics instead of heuristics (heuristic = an approach to problem solving, learning, or discovery that employs a practical method not guaranteed to be optimal or perfect, but sufficient for the immediate goals.]

The Exploratory Data Analysis approach does not impose deterministic or probabilistic models on the data. On the contrary, the EDA approach allows the data to suggest admissible models that best fit the data.


IF the four assumptions are true (1) random sampling from (2) a fixed location [mean] with (3) a fixed distribution [standard deviation] and (4) with a fixed variation

THEN the data is parametric [it has a known distribution] AND

THEN probabilistic predictability is achieved, the process is in statistical control, repeatable and can be modeled [Yi=C+Ei=average+error] to make scientifically/legally valid predictions and conclusions [based on probability], for example : data will be Y +/- error 19 out of 20 times. Control limits can be established and outliers determined/rejected, otherwise all data is significant, non conforming results cannot be rejected arbitrarly, they must be rejected on the basis of data in a statistically controlled process.

ELSE the process is unpredictable, out of control, drifting and no conclusions-judgements [past] or predictions [future] can be made, repeating the tests will yield different and unrelated results [there could be other unknown and unknownable variables not accounted for]

IF the sample is not randomly selected THEN it is biased and not representative of the population it claims to represent AND not decisions can be made

This 'unpredictable area-domain' is where statistics are invalid, this is the quadrant where statistics are misleading [create a false sense of security] and Taleb's 'black swan' or 'turkey in november' area where anything can happen and the dramatic and unexpected happens (the majority of human experiences and language (politics values) is undebatable unpredictable)

Unknown or changing distribution of a complex process make accurate and exact predictions impossible

Only take direct action on the process or operator for special causes of variation [it could be a meaningful signal]. To reduce variation due to common causes [noise] actions must be taken on the system by management.

Test the assumptions with 4plots [run sequence, histogram, lag plot, normal probability plots], if true then develop a model for the system, the objective is to characterize and model the function [regression, forecast function]


Predictability is an all-important goal in science and engineering. If the four underlying assumptions hold, then we have achieved probabilistic predictability--the ability to make probability statements not only about the process in the past, but also about the process in the future. In short, such processes are said to be "in statistical control".

If the four assumptions are valid, then the process is amenable to the generation of valid scientific and engineering conclusions. If the four assumptions are not valid, then the process is drifting (with respect to location, variation, or distribution), unpredictable, and out of control. A simple characterization of such processes by a location estimate, a variation estimate, or a distribution estimate inevitably leads to engineering conclusions that are not valid, are not supportable (scientifically or legally), and which are not repeatable in the laboratory.

Because the validity of the final scientific/engineering conclusions is inextricably linked to the validity of the underlying univariate assumptions, it naturally follows that there is a real necessity that each and every one of the above four assumptions be routinely tested.

Extremes data do not necessarily mean that they are due to a special cause [especially if the process is not in statistical control]. They can still be part of the normal variation of the process.


Signal to Noise ratio [SN] vs time [T]: The longer period the data is from the better signal to noise ratio. Stock market observations. For T=1 year, SN= 1/1. For T=1 day, SN= 0.05/0.95. For T=1 hour, SN= 0.005/0.995. Daily or hourly 'news' are at best irrelevant at worst misleading more than 95% of the time.