multivariate.outlier.fnc

Objetive

Statistical estimates of Multivariate D 2 (Mahalanbis distance) for each record in the database. The user can exclude resulting outliers records of subsequent multivariate analyzes. Two criterias can be used in order to mark a record as outlier: p.criteria (criteria by default with an alpha of 0.001) marks significant distance with a p value lower thant the criteria and distance value d2.criteria. You can also search outliers in a segmented data using which.factor argument with the name of factor or factors (maximum of two) over which you want to segment your data.

Multivariate Outlier

Function multivariate.outlier.fnc performs statistical estimate of D 2 (Mahalanobis Distance) in a matrix of quantitative data variable for each record or database specified by the user. The user can then remove from the analysis all cases resulting labeled as outlier by having a D distance to the centroid of the multivariate distribution less than or equal to a probability included as criteria (p.criteria p <0.001 by default) or a distance criteria previously defined by the user in d2.criteria argument.

The questionnaire Synthetic Aperture Personality Assessment (SAPA) found in the database bfi will be used as example. It has 25 items measuring 5 personality constructs. If you write at console ?bfi you will get some information about the items that compose it .

outlier= multivariate.outlier.fnc(bfi, variables=1:25)

#------------------------------------------------------------------

# MULTIVARIATE OUTLIER DETECTION

#------------------------------------------------------------------

#------------------------------------------------------------------

# Exclusion criteria: p.val <=0.001

#------------------------------------------------------------------

*** Mahalanobis distances estimated from a matrix of 2436 records in 25 variables

*** from a original data matrix with 2800 records in 25 variables.

*** 82 problematic records (D2 p <= 0.001) have been detected.

*** Follow the next example procedure in order to delete outliers records:

*** Ej: outlier=multivariate.outlier.fnc(bfi, variables=1:25)

*** new.dat=bfi[outlier$records.to.maintain, ] # PAY ATTENTION TO COMMA (,)

*** factorial.analysis.fnc(new.dat, variables=1:25)

*** HEAD OF RESULTS (Only 15 first records will be show) ***

$Mahalanobis

D D2 p.val record

mydata.63467 10.310152 106.29923 0.0000 63467

mydata.65407 9.525357 90.73242 0.0000 65407

mydata.65439 9.270040 85.93364 0.0000 65439

mydata.63763 9.097980 82.77324 0.0000 63763

mydata.67288 9.023625 81.42580 0.0000 67288

mydata.64341 8.961424 80.30713 0.0000 64341

mydata.63406 8.825714 77.89322 0.0000 63406

mydata.64192 8.770832 76.92749 0.0000 64192

mydata.62181 8.684167 75.41476 0.0000 62181

mydata.65170 8.553129 73.15601 0.0000 65170

mydata.64724 8.514739 72.50078 0.0000 64724

mydata.64642 8.350569 69.73200 0.0000 64642

mydata.65974 8.350569 69.73200 0.0000 65974

mydata.63762 8.340674 69.56684 0.0000 63762

mydata.64079 8.289636 68.71807 0.0000 64079

$records.to.maintain

[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

You can see that we have assigned the output of the procedure to the object outlier . The function saves there a list with two elements: Mahalanobis distance of each record and second, the logical vectorrecords.to.maintain. We will use this logical vector as an index to filter bfi data, so that the 82 records detected, will not participate in the subsequent factorial analysis. As you can see the records of that table (only the first 15 are shown) are not removed in the process of filtering, since all have a TRUE in the vector records.to.maintain .

We can see that the record 63467 (among others) has an important and significant Mahalanobis distance from the centroid of the joint multivariate distribution of the 25 variables in BFI questionnaire (Big Five).

We can now carry out a factor analysis of the new filtered data, excluding subjects with significantly greater than zero (p <.001) distance.

First we will remove all records that have resulted with p <0.001 in the previous function call (perhaps you may need to go to first steps to remember how to select or delete records from a database).

bfi.2=bfi[outlier$records.to.maintain, ]

dim(bfi) dim(bfi.2)

[1] 2800 28 [1] 2718 28

Notice how we use the logical variable records.to.maintain of the results list assigned by us to outlier object. You can see how far the filtered data no longer contain the problematic detected 82 records ( 2800-82 = 2718 ). The function automatically generates (graph =T default) a histogram of the Mahalanobis distance, which allows you to create, or define a new cut point as criteria to detect a case as outlier including the argument d2.criteria with the distance value we want to use. Considering the histogram below, we could repeat the procedure including a distance D 2 of 60.

outlier= multivariate.outlier.fnc(bfi, variables=1:25, d2.criteria=60)

*** 39 problematic records (D2 > 60) have been detected.

As you can see with this new approach, we will only remove about half of those eliminated using the associated probability criterion Mahalanobis distance statistic (39 vs 82).

We will conduct a factor analysis of bfi questionnaire using cleaned database stored in the previous step.

factorial.analysis.fnc(bfi2, variables=1:25, n.factors=5)

reliability.fnc(bfi2, variables=1:25)

We omit the result of both outputs. The user can easily replicate this example autonomously and compare the output of cleaned data with the results of the data matrix without clean.

MULTIVARIATE OUTLIER CLEAN WITH DATA SEGMENTATION

If you include the argument which.factor with the factor or factors names on which you want to carry out the cleaning procedure (maximum two factors), the function will perform the procedure separately in each level of segmentation.

outlier=multivariate.outlier.fnc(mydata, variables=1:10, which.factor='sexo')

outlier=multivariate.outlier.fnc(mydata, variables=1:10, which.factor='sex:zone')

outlier=multivariate.outlier.fnc(iris, variables=1:4, which.factor = 'Species',

d2.criteria=10)

*** 8 problematic records (D2 > 10) have been detected.

$Mahalanobis

D D2 p.val record

virginica.119 3.697239 13.669574 0.0084 119

versicolor.69 3.523733 12.416691 0.0145 69

setosa.44 3.518075 12.376852 0.0148 44

setosa.42 3.488495 12.169595 0.0161 42

setosa.23 3.345523 11.192522 0.0245 23

virginica.132 3.299671 10.887828 0.0279 132

versicolor.99 3.199508 10.236852 0.0366 99

setosa.15 3.187075 10.157448 0.0379 15

setosa.25 3.116110 9.710142 0.0456 25

virginica.118 3.092769 9.565221 0.0484 118

virginica.135 2.974840 8.849674 0.0650 135

virginica.101 2.961209 8.768759 0.0671 101

versicolor.71 2.929822 8.583860 0.0724 71

setosa.45 2.922615 8.541677 0.0736 45

virginica.142 2.885865 8.328218 0.0803 142

$records.to.maintain

[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE