R script for imputing missing data

Post date: Dec 19, 2012 8:12:05 PM

A script I wrote using Amelia II and the PLYR package for multiple imputation of missing values in a compositional dataset. Basically, the script uses these two tools to automate the data-imputation process. Note that your dataset must have a column entitled "ANID" which contains an ID for each specimen. Missing values (often displayed as zeroes) should first be replaced with "NA". Also, although Amelia II can handle ordinal and nominal data, this script treats your data explicitly as ratio-scale data that are multivariate normal when converted to base-10 logarithms.

Use at your own risk, feel free to modify as long as you share, and if you have ideas on how to improve it, please let me know!

# Script to multiply impute missing values in a dataset using a bootstrapping algorithm of AMELIA II

# December 19, 2012

# Matthew T. Boulanger

# University of Missouri Research Reactor

#Script uses AMELIA II to impute missing values in a dataset.

#Automatically creates a file entitled "imputed_data.csv" in your working directory

#Option added for user to set the number of iterations

#-------------------------------->

#Load required libraries

library("Amelia")

library("plyr")

#-------------------------------->

#Begin user-defined variables

# Set working directory

setwd("")

# Set name of dataset in CSV Format to modify

dataset = "clay.csv" #Change clay.csv to your filename

#Set no. of iterations

iter = 10

#End user-defined variables

#-------------------------------->

#Read in clay and temper datasets

data.ppm <- read.table(dataset, sep=",", header=TRUE, na.strings="NA", row.names="ANID")

#Create Vector of ANIDs

anid.list <- c(rep(row.names(data.ppm), iter))

#Log transform data

data.log <- log10(data.ppm)

#Impute missing values

imputed.log <- amelia(data.log, m = iter)

#Write each imputed dataset to files

write.amelia(obj=imputed.log, file.stem = "imputed_")

#Create list of the imputed files

imputed.files = list.files(pattern = 'imputed_*')

#Read in all imputed files

imputed.list = lapply(imputed.files, read.csv, header=TRUE, row.names = "X")

#Compile all imputed data into one big data frame

big.list <- ldply(imputed.list, data.frame)

#Set variable y, which is the base-10 log concentration of each specimen

y = big.list

#Use variable y to convert data back to ppm

imputed.ppm <- 10^y

#Convert imputed data to data frame

imputed.ppm <- data.frame(imputed.ppm, row.names=NULL)

#Insert ANIDs into imputed data data frame

imputed.ppm$ANID <- anid.list

#Calculate mean of all imputations based on ANID

final.imputed.ppm <- aggregate(imputed.ppm[,-34],list(imputed.ppm$ANID), mean)

#Write imputed data to disk

write.csv(final.imputed.ppm, file="imputed_data.csv")

Google Sites

Report abuse