clean.dv.fnc

Copy, Paste and Adapt:

cleaned.data = clean.dv.fnc(stacked.data, dv='rt',

which.factor='condition', cuts=2.5,

which.min=200, which.max=2000, save.deleted=T)

Objetive

Always over a stacked data (each row in the file is a trial an not a subject), performs cleaning of a numeric variable defined by the user for each experimental condition in each participant. This cleaning is done routinely in the repeated measures designs where one or more groups of participants see different items through J x K repeated measures experimental conditions.

In general the process consists in recode to NA value any observation outside the range defined by + - 2 standard deviations around the mean.

clean.dv.fnc

To demonstrate its use we will generate a simulated repeated measures experiment data with 30 subjects in a 2 x 2 factorial design with 20 items per experimental condition. For simplicity we will generate the data matrix assuming an “incorrect” covariance matrix of the 80 diagonal items, because we want to simulate an experiment where 30 participants respond to these 80 stimuli.

set.seed(20)

The above line ensures replicability of data simulated.

dat=data.frame(mvrnorm(30,mu=rep(c(1,-1,-1,1),each=20),

Sigma=diag(80)))

dim(dat)

[1] 30 80

within.factor=list(mrA=c('a1','a2'), mrB=c('b1','b2'))

dat.st=stack.data.fnc(dat, within.factor=within.factor,

col.start.rm=1, n.item=80)

head(dat.ap)

#------------------------------------------------------------------

# DATA STACKED

#------------------------------------------------------------------

*** Head of your stacked data.

dv item subject mrA mrB condition

1 2.89071 item1 sub1 a1 b1 a1#b1

2 0.46034 item1 sub2 a1 b1 a1#b1

3 0.33693 item1 sub3 a1 b1 a1#b1

4 2.77245 item1 sub4 a1 b1 a1#b1

5 0.36962 item1 sub5 a1 b1 a1#b1

6 2.25805 item1 sub6 a1 b1 a1#b1

We can now ask for a histogram of dv variable in all experimental conditions. This histogram is very important to define some mandatory arguments of the cleaning function.

histogram.fnc(dat.st, which.factor='mrA:mrB', check=T)

With check argument (TRUE) we ask for the overlay of the four experimental conditions histograms in a single graphic display.

As we can see in the pre-cleaning figure, in the bottom left figure histogram, we have very far values in the lower curve. Therefore, we can start telling our cleaning function that not allows values less than -2.5.

dat.cl=clean.dv.fnc(dat.st, which.factor='condition',

which.min=-2.5)

In the call to the function, many arguments are omitted: dv because there is a variable with that name in the data to clean, cuts because by default the interval is delimited to + - 2 standard deviations around each mean in each experimental condition and subject. Z argument is omitted too because the interval around the mean is desired and not mad around the median (Z = F). Finally, it has not included the which.max argument because it is to be the very function that determines the appropriate cut for higher values. In the event that the function was not able to remove intolerable values for the research context argument which.max with this cutoff should be included.

limpia.vd.fnc

head(dat.cl)

dv item subject mrA mrB condition dv.original dv.deleted

1 2.89071 item1 sub1 a1 b1 a1#b1 2.89071 0

2 0.46034 item1 sub2 a1 b1 a1#b1 0.46034 0

3 0.33693 item1 sub3 a1 b1 a1#b1 0.33693 0

4 2.77245 item1 sub4 a1 b1 a1#b1 2.77245 0

5 0.36962 item1 sub5 a1 b1 a1#b1 0.36962 0

6 2.25805 item1 sub6 a1 b1 a1#b1 2.25805 0

In dat.cl (data cleaned) two new variables have been created: dv.original and a dummy variable dv.deleted. Now data dat.cl will be filtered for dv.deleted iqual to 1 (record dv recoded to NA).

head(subset(dat.cl, dv.deleted==1))

dv item subject mrA mrB condition dv.original dv.deleted

10 NA item1 sub10 a1 b1 a1#b1 -1.2848 1

43 NA item2 sub13 a1 b1 a1#b1 -0.5665 1

81 NA item3 sub21 a1 b1 a1#b1 -1.8294 1

84 NA item3 sub24 a1 b1 a1#b1 -1.5054 1

87 NA item3 sub27 a1 b1 a1#b1 -1.4127 1

254 NA item9 sub14 a1 b1 a1#b1 2.8246 1

We see that the value of -1.28 in a1#b1 condition has been eliminated for subject number 10. Apparently it is a negative value but not particularly striking. To understand why it has been removed by the cleaning procedure, first, we must extract the subject 10 of both database (original and cleaned) and ask for a histogram of its response values.

sub10a=subset(dat.st, subject=='sub10')

histogram.fnc(sub10a, which.factor='mrA:mrB', check=T)

sub10b=subset(dat.cl, subject=='sub10')

histogram.fnc(sub10b, which.factor='mrA:mrB', check=T)

The conclusion is clear. For this subject, the value -1.2 is clearly far from the a1#b1 experimental condition average, which this value belongs to (black color line in the histogram). In the right histogram (cleaned data) you can see how by eliminating that value the distribution of this condition has a much more acceptable appearance.