of two random variables:    cov(x, y) = E[  (x - E[x]) (y - E[y])  ]       by linearity of expected value we have:
The magnitude of the covariance is not easy to interpret. The normalized version of the covariance, the correlation coefficient, however, shows by its magnitude the strength of the linear relation.

Covariance Matrix

a matrix whose i, jth element is the covariance between the i th and j th elements of the random vector (vector of random variables). generalizes the notion of variance to multiple dimensions.

This is equivalent to the following vector multiplication
I have a random vector X = [x1, x2, ..., xn], each of xi's is a random variable with mean and variance.
It's covariance matrix is a matrix who's i,j element is the covariance between random variables i and j.
Covariance(xi, xj): E[ (xi - E[xi])(xj - E[xj]) ]

Spearman or Pearson can be used in finding correlated features and only considering one of them for feature reduction.

Spearman's rank correlation coefficient           ضریب همبستگی رتبه ای  
Nonparametric measure of statistical dependence between two variables. 
If two variables can be described via a monotonic function.
Perfect Spearman correlation of +1 or −1 occurs when each of the variables is a perfect monotone function of the other.
The Spearman correlation coefficient is often described as being "nonparametric". This can have two meanings. First, the fact that a perfect Spearman correlation results when X and Y are related by any monotonic function can be contrasted with the Pearson correlation, which only gives a perfect value when X and Y are related by a linear function. The other sense in which the Spearman correlation is nonparametric in that its exact sampling distribution can be obtained without requiring knowledge (i.e., knowing the parameters) of the joint probability distribution of X and Y.
 where xi is the rank(index) of Xi

if there are no two values with similar ranking (index # in list): where di = xi-yi (difference in rank)

not linear but monotone:


in multiple choice questions.
Question Best Choice/ preference Worst selected option/preference
1 C E
2 A E
3 B F
4 C F
5 C B
6 B F
We can count up the number of times an alternative is chosen as best and the number of times it is chosen as worst, and the difference between these is the count. For this experiment, alternative C is seen to be most preferred, alternatives B and A are equal second, D is fourth, E fifth and F sixth (i.e., C > B = A > D > E > F).
Alternative Best Worst diff
A 1 0 1
B 2 1 1
C 3 0 3
D 0 0 0
E 0 2 -2
F 0 3 -3
some researchers compute the ratio of best to worst scores.
in R: the following code enters this data into R and computes the counts
mdData = matrix(c(NA,NA,1,0,-1,NA,1,NA,NA,0,-1,NA,NA,1,NA,0,NA,-1,0,NA,1,NA,NA,-1,0,-1,1,NA,NA,NA,NA,1,NA,NA,0,-1),6,byrow=TRUE, dimnames=list(Block=1:6,Alternatives = LETTERS[1:6]))
apply(mdData,2,sum, na.rm=TRUE)

Similarly, for unbalanced designs, the code in R uses mean instead of sum:

apply(mdData,2,mean, na.rm=TRUE)

To determine if a result is statistically significant, a researcher would have to calculate a p-value, which is the probability of observing an effect given that the null hypothesis is true.[11] The null hypothesis is rejected if the p-value is less than the significance or α level. The α level is the probability of rejecting the null hypothesis when it is true (type I error) and is usually set at 0.05 (5%), which is the most widely used.[2] If the α level is 0.05, then the probability of committing a type I error is 5%.[12] Thus, a statistically significant result is one in which the p-value for obtaining that result is less than 5%, which is formally written as p < 0.05[12]

زمانی یک رابطه از نظر آماری «معنادار» خوانده می‌شود که به احتمال کمتر از 5% رابطه‌ی مورد نظر ناشی از تصادف بوده باشد. معنی‌ این گفته این است که اگر پژوهش تکرار شود، به احتمال 95% به همان نتیجه‌ی قبلی خواهد انجامید. تعیین عدد 95% دلبخواهی است؛ و استانداردی است که ما انتخاب کرده‌ایم. یک نقطه‌ی قراردادی دیگر که اهمیت دارد نقطه‌ی 99% است. وقتی نتیجه‌ی یک آزمایش همبستگی 99% باشد، گفته می‌شود که نتیجه از نظر آماری شدیداً معنادار است.[۱]

Discriminant analysis

Linear discriminant analysis (LDA) and the related Fisher's linear discriminant. find a linear combination of features which characterizes or separates two or more classes of object:

LDA seeks to reduce dimensionality while preserving as much of the class discriminatory information as possible:
- Assume we have a set of d- dimensional samples {x_1, ..., x_n}, n1 of which belong to class w1 , and n2 to class w2
We seek to obtain a scalar y by projecting the samples x onto a line y=wTx – Of all the possible lines we would like to select the one that maximizes the separability of the scalars
The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

How to calculate:
Maximize the distance between mean of classes in Y space (μ1 is mean for class 1 and μ2 for class 2 in X space- in Y space we call it μ'): |μ1'-μ2'| = |wT(μ1-μ2)|

Fisher's solution:
maximizing the difference between the means, normalized by a measure of the within - class scatter (variance (s))
max, where s1 is the covariance equivalence (instead of E it only does sum not divide)