Chapter 1: mean, median, variance, and correlation
Chapter 1: mean, median, variance, and correlation
Here we will use Matlab to simulate data from a pair of correlated variables, and then compute the mean, median, variance and correlation of these variables. We begin with the internal psuedorandom multivariate Gaussian random number generator, mvnrand (multivariate normal rand):
mu=[0 0]; V1=1; V2=2^2; r=.85; N=100;
xy=mvnrand([0 0],[V1 r*sqrt(V1*V2); r*sqrt(V1*V2) V2],N);
These commands produce 100 xy data-pairs (the size of the xy matrix is 100 x 2), where the variance in the x-dimension is V1 and the variance in the y-dimension is V2. We can get a sense of the distribution of simulated data by plotting their scatter and histograms:
GaussPoints(xy,mu,[V1 r*sqrt(V1*V2); r*sqrt(V1*V2) V2]);
This plot command produces a 2D scatterplot with histograms of each of the two dimensions along each of the axes, as in Fig. 1.6c. We can see from the plot that the data are roughly centered on the origin (we specified that mu should be [0 0] when we generated the data), and that the spread of the data is larger on the y-axis than on x-axis. Also, there is a clear positive correlation between the data values along the two axes, with values in the 2D scatter adhering closely to a straight line of positive slope.
We can compute the mean and median of the data with the commands:
mubar=mean(xy)
muhat=median(xy)
Both commands produce a 2-element vector output, where the first element is the mean or median in the x-dimension and the second the mean or median in the y-dimension. Use of the var command produces a similar 2-element output that describes the variance of the two dimensions:
Vhat=var(xy)
Finally, we compute the correlation of the 2D scatter with the command:
rhat=corrcoef(xy)
which produces a 2x2 matrix, where rhat(1,1) is the correlation between the first column of xy with the first column of xy, rhat(1,2) is the correlation between the first column of xy with the second column of xy, etc. The diagonal elements are therefore always 1, and the off-diagonals are symmetric [e.g., rhat(1,2)=rhat(2,1)]. By structuring the output as a matrix in this way, it is easy to add additional columns to the corr(M) output when the input matrix has more than two columns. However, in the case where we are looking at a simple correlation between two dimensions, x and y, we only care about the single number describing the correlation between x and y, and not the entire matrix:
rhat=corrcoef(xy); rhat=rhat(2,1)