Post date: Dec 19, 2012 8:12:05 PM
A script I wrote using Amelia II and the PLYR package for multiple imputation of missing values in a compositional dataset. Basically, the script uses these two tools to automate the data-imputation process. Note that your dataset must have a column entitled "ANID" which contains an ID for each specimen. Missing values (often displayed as zeroes) should first be replaced with "NA". Also, although Amelia II can handle ordinal and nominal data, this script treats your data explicitly as ratio-scale data that are multivariate normal when converted to base-10 logarithms.
Use at your own risk, feel free to modify as long as you share, and if you have ideas on how to improve it, please let me know!
# Script to multiply impute missing values in a dataset using a bootstrapping algorithm of AMELIA II
# December 19, 2012
# Matthew T. Boulanger
# University of Missouri Research Reactor
#Script uses AMELIA II to impute missing values in a dataset.
#Automatically creates a file entitled "imputed_data.csv" in your working directory
#Option added for user to set the number of iterations
#-------------------------------->
#Load required libraries
library("Amelia")
library("plyr")
#-------------------------------->
#Begin user-defined variables
# Set working directory
setwd("")
# Set name of dataset in CSV Format to modify
dataset = "clay.csv" #Change clay.csv to your filename
#Set no. of iterations
iter = 10
#End user-defined variables
#-------------------------------->
#Read in clay and temper datasets
data.ppm <- read.table(dataset, sep=",", header=TRUE, na.strings="NA", row.names="ANID")
#Create Vector of ANIDs
anid.list <- c(rep(row.names(data.ppm), iter))
#Log transform data
data.log <- log10(data.ppm)
#Impute missing values
imputed.log <- amelia(data.log, m = iter)
#Write each imputed dataset to files
write.amelia(obj=imputed.log, file.stem = "imputed_")
#Create list of the imputed files
imputed.files = list.files(pattern = 'imputed_*')
#Read in all imputed files
imputed.list = lapply(imputed.files, read.csv, header=TRUE, row.names = "X")
#Compile all imputed data into one big data frame
big.list <- ldply(imputed.list, data.frame)
#Set variable y, which is the base-10 log concentration of each specimen
y = big.list
#Use variable y to convert data back to ppm
imputed.ppm <- 10^y
#Convert imputed data to data frame
imputed.ppm <- data.frame(imputed.ppm, row.names=NULL)
#Insert ANIDs into imputed data data frame
imputed.ppm$ANID <- anid.list
#Calculate mean of all imputations based on ANID
final.imputed.ppm <- aggregate(imputed.ppm[,-34],list(imputed.ppm$ANID), mean)
#Write imputed data to disk
write.csv(final.imputed.ppm, file="imputed_data.csv")