S5 Identifying Distributions from Data

Distributions are MODELS for data which exist in the imagination of the statistician, while the data itself is an observed, real world phenomenon. The idea that the models we make are TRUE has been the source of major confusions and misunderstandings in data analysis. When we look at a data series, it is commonly believed that there is a UNIQUE TRUE model which generates the data. This is known as the DGP -- the Data Generating Process. This is the result of a linguistic confusion. Obviously there is a real world process which generates the data, so one can say that there exists a real world Data Generating Process. But whether this DGP bears even a remote resemblance to the entire class of models for data ever studied by statisticians, this is unlikely. The Metaphor-Model takes for granted that models do not belong to the real world, and are always FALSE, in the sense that they DO NOT correctly represent the real world data generating process. The only question is whether these oversimplified models are adequate for the purpose at hand. With these preliminary cautions, we turn to one of the central questions of statistics: identifying the model from the data. Here we generate three different series from Uniform, Gamma, and Normal, as ask whether we can find out the parent distribution by looking at the data. The difference in our point of view is that we are thinking about finding a model which is ADEQUATE for our purposes, and NOT about finding the TRUE model, which does not exist.

When we have real data sets, we are looking for models which are compatible with the data set. Such models provide a crude and oversimplified picture, which smooths out the complexities of the data set and reduces it to a simple to understand model. The adequacy of the model must be evaluated in context of applications. As a FIRST STEP to knowing which models are useful for describing which types of data sets, we consider a simpler problem. We generate artificial (simulated) data exactly according to the model, and then look at the data to try to classify which of the many possible models generated the data. If we can classify data by model type when the data actually does follow the model, we will be in a better position to do the same type of exercise on real data, which may or may not follow the model. The exercise below provides some practice in this most fundamental of statistical problems.

EXERCISE: The EXCEL spreadsheet below uses THREE models to generate data in columns A,B,C.

M0: Uniform U(0,2.5). IID random variables from a uniform distribution with range [0,2.5]

M1: Gamma with shape parameter p=2, and scale parameter L=1/2 -- Note that this is the same as a Chi-Square density with 4 degrees of freedom.

M2: Half-Normal. This is the normal distribution restricted to be positive: X=|Z| where Z is standard normal. Note that P(0<X<1)=P(-1<Z<+1).

Your job is to find out which column comes from which distribution. The difficulty of the task depends on the sample size. The job is very easy with large samples, and becomes harder as the sample size becomes smaller. Three different sample sizes have been provided on the spreadsheet.

Page updated

Google Sites

Report abuse