Bootsy (Version R.A.1)

Bootstrap Nonparametric

Comparison of Medians

# [begin]

# --------------------------------------------------------------------------------------------

# BOOTSY (Version R.A.1)

# Richard Anderson, PhD., Dept. of Psychology, Bowling Green State University

# --------------------------------------------------------------------------------------------

# The nonparametric bootstrap works regardless of the shapes of the distributions. In addition,

# the use of the median (rather than the mean) makes the effects robust against influence by

# extreme data points.

# This script uses R's bcaboot package (Efron & Narasimhan, 2020) to generate nonparametric

# bootstrap confidence intervals for any of the following statistics:

# [1] The Median.

# [2] D: The difference between the median of one group and

# the median of another group.

# [3] DELTA D: For four groups, the difference between the

# difference between two medians.

# This is offered as a nonparametric alternative to a 2-by-2 between-subject ANOVA

# interaction.

# The formula is: ( (A1,B1 - A2,B1) - (A1,B2 - A2,B2) ), where A1 is Level 1

# of Variable A, B1 is Level 1 of Variable B, A2 is Level 2 of Variable A, and

# B2 is Level 2 of Variable B.

# [4] DOUBLE DELTA D: For eight groups, the difference

# between two Delta Ds.

# This is offered as a nonparametric alternative to a 2-by-2-by-2 ANOVA interaction.

# The formula is:

# ((A1,B1,C1 - A2,B1,C1) - (A1,B2,C1 - A2,B2,C1)) -

# ((A1,B1,C2 - A2,B1,C2) - (A1,B2,C2 - A2,B2,C2)),

# where A1 is Level 1 of Independent Variable A, A2 is Level 2 of Independent

# Variable A, B1 is Level 1 of Independent Variable B, B2 is Level 2 of

# Independent Variable B, C1 is Level 1 of Independent Variable C, and C2 is

# Level 2 of Independent Variable C.

# Accommodating Repeated Measures

# If you want your test to include one or more repeated-measures factors you can

# build that into your dependent variable prior to using the present R script.

# For example, suppose you want to assess an interaction-like statistic for the two

# repeated-measures factors, Sess (Session 1 and Session 2) and Task (A and B), with

# the dependent variable being response time (RT), and with RT distributed across

# the four columns: RT.Sess1.TaskA, RT.Sess2.TaskA, RT.Sess1.TaskB, and

# RT.Sess2.TaskB.

# You could create the computed dependent variable,

# y = (RT.Sess1.TaskA - RT.Sess2.TaskA) - (RT.Sess1.TaskB - RT.Sess2.TaskB)

# Then use the Bootsy script to obtain a bootstrapped confidence interval for the

# median of y.

# --------------------------------------------------------------------------------------------

# CITATION

# Efron, B., & Narasimhan, B. (2020). The automatic construction of bootstrap confidence

# intervals. Journal of Computational and Graphical Statistics, 29(3), 608-619.

# https://doi.org/10.1080/10618600.2020.1714633

# --------------------------------------------------------------------------------------------

# (i) Your data file must be in CSV format. In the data file, each variable

# name (column header) should begin with a letter and should consist only of

# letters, numbers, underscore symbols, or periods (no spaces).

# (ii) Place a copy of your data file into into a directory (e.g., folder) that

# will be easy for you to navigate to without having to rely on shortcut-links.

# (iii) Change R's working directory to the directory containing the data file.

# In R, select: FILE, CHANGE DIR. (On a Mac, select: Misc, Change Working

# Directory.)

# (iv) Copy this page of code into a text editor such as Microsoft Word, edit

# the code, then paste it into the R console. It may help that the parts you

# should edit are bolded, in orange.

# WARNING: Be sure that your text editor does not automatically "correct" text as you type.

# For example, it's important that quote marks not be changed to smart quotes.

# Perhaps a safer move would be to use a plain text editor, though the text's bold and color

# formatting would be lost.)

# Note: If you've opted use an R script (.R) file, you need to have that file be in the same

# directory that contains your data file.

# --------------------------------------------------------------------------------------------

clean <- function(a) {if (is.character(a) & (a != " ")) a <- trimws(a) ; return(a)}

# --------------------------------------------------------------------------------------------

# SET YOUR PARAMETERS

# --------------------------------------------------------------------------------------------

Auto.Install.And.Load.Packages <- "yes"

# (The default is "yes". If you enter "no", you should install bcaboot, readr,

# and magrittr before running this script.)

clean(Auto.Install.And.Load.Packages)

if (Auto.Install.And.Load.Packages == "yes") {

# https://vbaliga.github.io/verify-that-r-packages-are-installed-and-loaded/

packages = c('bcaboot','readr','magrittr') # First specify the packages of interest.

# Now load or install&load all packages.

package.check <- lapply(

packages,

FUN = function(x) {

if (!require(x, character.only = TRUE)) {

install.packages(x, dependencies = TRUE)

library(x, character.only = TRUE)

}

)

}

MyInputFileName <- "Data_For_R.csv" %>% clean()

Missing_Value_Strings <- c( "" , " " , "NA" , "na" , "." )

# (If the data file has, or might have strings indicating missing values,

# list those strings.)

Add.Normal.Noise.SD <- 0.0

# (Adds normal noise to the dependent variable. Specify 0.0 [the noise standard deviation]

# for no added noise.)

Test_Type <- 4

# (1) A one-group test.

# (2) A two-group test of D.

# (3) A test of Delta D: the difference between the difference between two groups.

# (4) A test of Double Delta D (the difference between two Delta Ds).

DV_input <- "Reading_Time_seconds" %>% clean()

# (The dependent-variable column name from data file.)

First_IV_input <- "Congruency" %>% clean()

# The statement above specifies the first independent variable, A. (Column-name from data file.)

# If there isn't one, specify "none".

# If First_IV_input isn't "none", specify below the order of the two levels of the variable.

Level1_of_First_IV <- "Incongruent" %>% clean()

Level2_of_First_IV <- "Congruent" %>% clean()

# Note: Subtraction order will be Level 1 minus Level 2.

Second_IV_input <- "Expl_Status_of_Addtl_Info" %>% clean()

# Second independent variable, B. (Column-name from data file.)

# (If there isn't one, specify "none".)

# If Second_IV_input isn't "none", specify below the order of the two levels of the variable.

Level1_of_Second_IV <- "Non-Explanatory" %>% clean()

Level2_of_Second_IV <- "Explanatory" %>% clean()

# Note: Subtraction order will be Level 1 minus Level 2.

Third_IV_input <- "Binned_Confidence__0thru3_4and5" %>% clean()

# Third independent variable, C. (column-name from data file)

# (If there isn't one, specify "none".)

# If Third_IV_input isn't "none", specify below the order of the two levels of the variable.

Level1_of_Third_IV <- "Low" %>% clean()

Level2_of_Third_IV <- "High" %>% clean()

# Note: Subtraction order will be Level 1 minus Level 2.

VariableForFiltering <- "none" %>% clean()

# An expression for filtering-in a subset of the data, based on a variable from the data set.

# For no filtering, indicate "none".

RunFilter <- function(data) {

data <- data[data[, 5]

>= 4 # For no filtering, indicate "none" for VariableForFiltering, above.

return(data)

}

This.Many.Bootstrap.Samples <- 10000 # (There is no correct number. The larger, the better.)

My.Random.Seed.Value <- 12345 # Any integer. Makes the result reproducible.

# ------------------------------------------------------------------------------------

# TYPICALLY, YOU SHOULD NOT CHANGE ANYTHING BELOW THIS LINE.

# ------------------------------------------------------------------------------------

# Old code: Rank.All.Data.Before.Bootstrapping <- "no"

inputData <- as.data.frame(read_csv(MyInputFileName))

blank <- rep("blank", length(inputData[, 1]))

myData <- data.frame(DV = blank, IVA = blank, IVB = blank, IVC = blank, FilterVar = blank)

myData[,1] <- inputData[, DV_input]

myData[,1] <- myData[,1] + rnorm(length(myData[,1]),0,Add.Normal.Noise.SD)

# Add random normal noise to y

if (Test_Type == 2 | Test_Type == 3 | Test_Type == 4) {

myData[,2] <- as.character(inputData[,First_IV_input])

for (i in seq(1,length(myData[,1]))) {

if (myData[i,2] == as.character(Level1_of_First_IV)) {

myData[i,2] <- paste0("1_",myData[i,2])

}

if (myData[i,2] == as.character(Level2_of_First_IV)) {

myData[i,2] <- paste0("2_",myData[i,2])

}

if (Test_Type == 3 | Test_Type == 4) {

myData[,3] <- as.character(inputData[,Second_IV_input])

for (i in seq(1,length(myData[,1]))) {

if (myData[i,3] == as.character(Level1_of_Second_IV)) {

myData[i,3] <- paste0("1_",myData[i,3])

}

if (myData[i,3] == as.character(Level2_of_Second_IV)) {

myData[i,3] <- paste0("2_",myData[i,3])

}

if (Test_Type == 4) {

myData[,4] <- as.character(inputData[,Third_IV_input])

for (i in seq(1,length(myData[,1]))) {

if (myData[i,4] == as.character(Level1_of_Third_IV)) {

myData[i,4] <- paste0("1_",myData[i,4])

}

if (myData[i,4] == as.character(Level2_of_Third_IV)) {

myData[i,4] <- paste0("2_",myData[i,4])

}

if (VariableForFiltering != "none") {

myData[,5] <- inputData[, VariableForFiltering]

myData <- RunFilter(myData)

}

myData <- myData[complete.cases(myData),] # Remove cases containing NA

dichot_to_1_2 <- function(h) {

if (nlevels(as.factor(h)) == 2) {

q <- rank(h, ties.method = "average")

q <- q - min(q)

q <- (q / max(q)) + 1

return(q)

} else {

q <- "should.have.had.two.levels"

}

print(myData)

num.levels.error <- "no"

if (Test_Type != 1) {myData[,2] <- dichot_to_1_2(myData[,2])}

if (Test_Type == 3 | Test_Type == 4) {myData[,3] <- dichot_to_1_2(myData[,3])}

print("Below, the digits '1' and '2' substitute for the levels of the categorical values.")

if (Test_Type == 4) {myData[,4] <- dichot_to_1_2(myData[,4])}

print("Below, the digits '1' and '2' substitute for the levels of the categorical values.")

print(myData)

if (any((myData[,2] == "should.have.had.two.levels")

| (myData[,3] == "should.have.had.two.levels")

| (myData[,4] == "should.have.had.two.levels") )) {

num.levels.error <- "yes"

show.num.levels.error <- "** ERROR: A VARIABLE THAT SHOULD HAVE HAD EXACTLY TWO LEVELS, DID NOT."

}

# The following functions serve as inputs to bcajack.

# Note. Each input function should not refer to element numbers (row numbers, column numbers,

# or vector element numbers) within the function's input data, since bcajack controls

# which element numbers are utilized at any given point in time.

if (Test_Type == 1) {

myFunction <- function(k) {

myStat <- median(k[, 1])

return(myStat)

}

if (Test_Type == 2) {

myFunction <- function(k) {

myStat <- median(k[k[,2] == 1, 1]) - median(k[k[,2] == 2 ,1])

return(myStat)

}

if (Test_Type == 3) {

myFunction <- function(k) {

# old code:

# if (Rank.Within.Bootstrap.Sample == "yes") {k[,1] <- rank(k[,1], ties.method = "average")}

myStat <- (median(k[k[,2] == 1 & k[,3] == 1, 1]) - median(k[k[,2] == 2 & k[,3] == 1, 1])) -

(median(k[k[,2] == 1 & k[,3] == 2, 1]) - median(k[k[,2] == 2 & k[,3] == 2, 1]))

return(myStat)

}

if (Test_Type == 4) {

myFunction <- function(k) {

# old code:

# if (Rank.Within.Bootstrap.Sample == "yes") {k[,1] <- rank(k[,1], ties.method = "average")}

myStatTemp1 <-

{ (median(k[k[,2] == 1 & k[,3] == 1 & k[,4] == 1, 1]) -

median(k[k[,2] == 2 & k[,3] == 1 & k[,4] == 1, 1])) } -

{ (median(k[k[,2] == 1 & k[,3] == 2 & k[,4] == 1, 1]) -

median(k[k[,2] == 2 & k[,3] == 2 & k[,4] == 1, 1])) }

myStatTemp2 <-

{ (median(k[k[,2] == 1 & k[,3] == 1 & k[,4] == 2, 1]) -

median(k[k[,2] == 2 & k[,3] == 1 & k[,4] == 2, 1])) } -

{ (median(k[k[,2] == 1 & k[,3] == 2 & k[,4] == 2, 1]) -

median(k[k[,2] == 2 & k[,3] == 2 & k[,4] == 2, 1])) }

myStat <- myStatTemp1 - myStatTemp2

return(myStat)

}

NOTES <- "------------------------------------------------------------ NOTE. The most important results (above) are in the $lims table--specifically, Column 2 which provides the population-parameter estimates for various levels of confidence. As examples, the bootstrapped 95% confidence interval ranges from the bca value for the 0.025 level to the bca value for the 0.975 level; the bootstrapped 99% confidence interval ranges from the bca value for the 0.005 level to the bca value for the 0.995 level. Other results are as follows. The jacksd column is the standard error, in the algorithm's computations, for each value in the bca column. In the $stats table, Est Theta the original, un-bootstrapped sample statistic. Est Sdboot is the bootstrapped standard error for the sample statistic. It is possible for the bootstrap to fail when the sample size is so small that some bootstrapped samples are missing some cells required to compute the test statistic. For further details, see: https://cran.r-project.org/web/packages/bcaboot/bcaboot.pdf "

if (num.levels.error == "no") {

set.seed(My.Random.Seed.Value)

myBcaResult <- bcajack(x = myData, B = This.Many.Bootstrap.Samples, func = myFunction,