EcoPlate Analysis (1)

1) Preparation of ecoplate data file

Applications for converting raw data from ecoplates are often provided by the microplate reader itself, but such preprocessing is not required at all, and the following raw data should be saved as a text file or csv file on your computer.

The content of a csv file when opened by MS Excel look as follows.

Either text or csv format can be used, but the following explanation is the case with the csv format. In one research project, such a file will be created repeatedly for each sample and for each incubation time of one sample. There are two important rules in file preparation as follows.

1) Keep consistency of folder name and file name

For example, it seems like "sample date/time-process name or sample point-measurement time. csv". Let's use only numbers and alphabets for file names. Also, do not include spaces in file names.

2) Keep consistency in the format of file contents

The format of the file generated from different microplate readers is different. Normally, metadata such as measurement date and time and measurement protocol is included on the raw data (for example, lines 1 to 6 of the above csv file), but it is left unattended without any manual modification . Also leave blank lines etc as it is. This keeps the raw data format consistent as long as you use the same microplate reader.

2) Design of the entire R script

Keeping the readability of scripts and source code high regardless of language, not limited to R, is the most important and at the same time difficult to realize. Basically, I think that it is good to design "block" for each type of script and design as follows.

##########Comments on the whole script###########
########Load all of the necessary libraries (Library Block)##########
library(library_name1)
library(library_name2)
...
########Define all necessary functions (Function Block)#############
func1<-function(...)
{
}
func2<-function(...)
{
}
...
########Execute calculations/analyses by calling functions (Analysis Block) #######
#####Create blocks for each different type of analyses##############
####Analysis01#######
a<-func1(...)
####Analysis02#######
....

#######It might be OK to keep "trial-error" script that were used for creating functions####
#end of the script

However, since the sample code distributed at this workshop is for learning the analysis method, it is not structured as above. The sample code is written in the following order.

#1) Load libraries necessary for analysis 1 (if any)
#2) Trial and error, preliminary calculations for defining functions
#3) Definitions of functions
#4) Analysis using the functions defined
#1) Load libraries necessary for analysis 2
(Then, go to 2 and repeat them)

When writing a script yourself after the course, it would be better to pay attention to the structure.

3) Reading data file of ecoplate

Suppose that the data file has a name consistently like "sample date-process name or sample location-measurement time. csv". With this consistency, it is possible to write a script that automatically reads all the data files automatically. As preparation for defining such a script as a function, it is necessary to become familiar with how to use the "paste" function to paste multiple strings to create a new string.

In the following example (the last line), three strings ACGCTTT, AGGTTC, data01-023 are combined with an "_" between them (with the option of sep = "_") and the new string "ACGCTT_AGCCTTC_data01-023" are generated.

####Preparation#####
####use the function "paste" to generate the sequence of relative path to efficiently access to multiple files of ecoplate data

#example 01
paste("ACGCTT", "AGCCTTC", "data01-023", sep="_")

If you write a long string directly in the paste () function as a parameter, it lowers the readability, so it is also good to divide it into several lines as follows.

place_name<-"osaka"
file_name<-"sample01.csv"
file_path<-paste("./data", place_name, file_names, sep="/")
file_path

Let's write a script to read some files, applying this example. Here, there are two folders in the parent folder (data_merged, r_analysis) and the R script is put in r_analysis while the data file is stored in the folder, data_merged. Let's say that there are multiple data files and their names are 20150512-1500higashihori 24.csv, 20150512-1500higashihori48.csv, 20150512-1500higashihori72.csv. The format of these files is "Sample collection year-month-date-sampling time sampling place _ ecchoplate culture time. csv".

In this case, it is necessary to write a relative path to the data file so that after going to the upper level from r_analysis (../), access to data_merged. In other words, the relative path should be,

../data_merged/20150512-1500higashihori_XX.csv .

XX here is either 24, 48, or 72. Write the following two lines in the script as the path information up to XX.

############Simple trial to load three files at the same time and stock the data into list##############
path_file<-"../data_merged/"   #folder name
data_name<-"20150512-1500higashihori"   #data name

Since the part of culture time XX has to be changed sequentially, prepare it as a vector.

time_m<-c(24,48,72)  #sequence of time_point

The access to each element of the object named iime_m and the type of data can be checked as follows.

#check the type
time_m[1]
#check the type
class(time_m[1])  #--> this should be converted into character to be used in the part of file path

Since the class of time_m[1] is numeric, you need to use the function as.character (), which converts it to a character string in order to incorporate it into the string as part of the relative path to the file. The following script is for generating a relative path for reading only one file (20150512-1500higashihori-24.csv).

#To generate file path "../higashi_hori_bashi_rain/20150512-1500higashihori_24.csv""
j<-1
file_ph<-paste(path_file, data_name, "_", as.character(time_m[j]), ".csv", sep="")
file_ph

You can find that file_ph is "../data_merged/20150512-1500higashihori_24.csv".

Then, prepare the "list" described on webpage For R Beginners as a box to read multiple ecoplate data.

#generate empy list ecoplate data are saved as a list
data_box<-list()

Consider reading the contents of 20150512-1500higashihori_24.csv and saving them into the first element data_box [[1]] of this data_box. As shown in the Excel picture above, the top to the 6th line represent the metadata of the measurement, and the 7th line has the serial number (1, 2, ... 12) of the well of the ecoplate. You do not need to read any of them. Also, the first column also includes a part of the metadata and the serial number (A, B, ...H) of the well of the ecoplate but this has to be read and then are deleted after reading it. In order to skip the first seven lines when reading data, you can specify the number of skips as follows.

#the first several lines in csv file include metadata, but should be skipped
no_skip<-7

Just after skipping the first seven lines, data of ecoplate appear , so you do not need a header (header = FALSE). Just read the data and see the contents.

#read the text file and saved into the item of the list
data_box[[j]] <- read.csv(file_ph, skip=no_skip, header=FALSE)  
View(data_box[[j]])

When looking at the contents of data_boa with View () function, V1 - V13 (meaningless symbols) are assigned to the column name (colname). Since the column V1 records only for the serial number A to H, it is not necessary to keep and then delete it, and let's overwrite the same object (data_box [[j]). The "-1" at data_box [[j]][, -1] means to remove the first column.

#Deleve the first column (A,B,...H)
data_box[[j]] <-data_box[[j]][,-1]  

Let's name the rows and columns so that they are easy to be recognized. Here, it is enough to know the correspondence with the well of the eccplate , so a simple name is enough. It is not necessary to include chemical substrate names.

#set names
rownames(data_box[[j]])<-c("A", "B", "C", "D", "E", "F", "H", "G")   
colnames(data_box[[j]])<-c("V1","V2","V3", "V4","V5","V6","V7","V8", "V9", "V10", "V11", "V12") 

With View(data_box[[j]]), you should be able to confirm that the data was read in an intended order (Note that the values ​​do not have to be exactly the same as the values of captured Excel file above).

You should have understood the process of reading the contents of one data file into the list. Let's automate this using a loop function. Three values ​​are already saved in time_m (by time_m<-c(24,48,72)), and it is ready to read data from three files. Therefore, if you use a for loop as follows, you can read the contents of three files sequentially into each element of the list. The scripts described above are combined together insider the for loop.

#If you like to load three files ***24.csv, ***48.csv, ***72.csv at the same time, you can use the loop function
for(j in 1:3) {
  #file path
  file_ph<-paste(path_file, data_name, "_", as.character(time_m[j]), ".csv", sep="")
  data_box[[j]] <- read.csv(file_ph, skip=no_skip, header=FALSE)  
  #Deleve the first column (A,B,...H)
  data_box[[j]] <-data_box[[j]][,-1]  
  #set names
  rownames(data_box[[j]])<-c("A", "B", "C", "D", "E", "F", "G", "H")   
  colnames(data_box[[j]])<-c("V1","V2","V3", "V4","V5","V6","V7","V8", "V9", "V10", "V11", "V12") 
}

Whether or not it succeeded can be checked by e.g. View(data_box[[2]]).

Next, prepare a function with this for loop as the core, so that more files can be read by specifying the file name. Before defining a function, consider the following error handling. Prepare a vector called time_m2 and try to see what happens if you also try to access the file of culturing time (28) which does not actually exist.

#Add one more complexity to avoid the error
#Skip the index j if the file does not exist
time_m2<-c(24,28, 48,72)  #sequence of time_point, noting that the data (file) at 28 does not exist

By using the first element of time_m2, time_m2 [1], we certainly try to access the existing file, so the class of the object "e" in the following script surely becomes "data frame". Also data from the csv file is surely read in.

#check what happens without or with error by class(e)
#without error
j<-1
file_ph<-paste(path_file, data_name, "_", as.character(time_m2[j]), ".csv", sep="")
e<-try(read.csv(file_ph, skip=no_skip, header=T), silent=FALSE)   #error management
class(e) 
e

Next, use time_m2[2] and check the consequence when accessing a nonexistent file. When the following lines are executed, an error message is displayed on the console and the class of e is set "try-error".

#with error
j<-2
file_ph<-paste(path_file, data_name, "_", as.character(time_m2[j]), ".csv", sep="")
e<-try(read.csv(file_ph, skip=no_skip, header=T), silent=FALSE)   #error management
class(e)  

Now, if you write what you actually want to execute (read.csv in this case) in this try() function and put this try(...) in the function defined, even if an error occurs, the error part is just skipped and the calculation does not stop until the end of function. Although it becomes suddenly complicated appearance, just based on the above sample scripts, the following function can be defined. The first few lines as comments are information on function definition. Only part of error handling is bolded for emphasis. By the error check point as if(class(e)=="try-error") next, if there is no file and read.csv () returns an error, the calculation skips the else block where reading file and changing the names are intended, and then goes to the next loop index i.

#funtion version 1 using file name
###############list of parameters#######################
#relative_path & folder_name: need to specify the folder position of data 
#date_time_sample: the string to specify the sample date and time, which should have the correspondance with thte file name
#place_sample: the name of sample place, which is the part of file name
#measured_time: the vector of the measured time
#no_skip_line: The first some lines of the raw output from microplate reader represent metadata, which should be excluded
##################Output is a list; each element of the list is the data frame to have ecoplate 96 values

load_ecoplate_data_osaka <-  function(relative_path="../", folder_name="data_merged", date_time_sample="20150115-0700", place_sample ="higashihori", measured_time=c(24,48,72), no_skip_line=7)    #The parameters that already have values in the functional definition are recognized as default setting.
{  
  data_list <-list() #generate empty list, ecoplate data are saved as a list
  loop_length=length(measured_time)  #count the length of loop by the number of files 
  for(i in 1:loop_length) {
    file_name <- paste(relative_path,folder_name,"/", date_time_sample, place_sample, "_", as.character(measured_time[i]),".csv", sep="")   #the function paste is used to combine multiple character sequences. 
    e <-try(read.csv(file_name, skip=no_skip_line, header=T), silent=FALSE)   #error management
    if(class(e) == "try-error") next  #if the file doesn't exist, skip the index; it could happen because the measurement dates may not continuous
    else {
      data_list[[i]] <- read.csv(file_name, skip=no_skip_line, header=FALSE)  #read the text file and saved into the item of the list
      data_list[[i]] <-data_list[[i]][,-1]  #Deleve the first column (A,B,...H)
      rownames(data_list[[i]])<-c("A", "B", "C", "D", "E", "F", "G", "H")   #set names
      colnames(data_list[[i]])<-c("V1","V2","V3", "V4","V5","V6","V7","V8", "V9", "V10", "V11", "V12") #set names
    }
  }
  data_list  #output (return value) of this function
}

In order to check whether the error processing is successful, the following time_m3 is intentionally mixed with the information "34" of the non-existing file. You can understand how data is stored in the list by running the following three lines of script.

#example, you can see the second element of the result list is empty
time_m3<-c(24,34,48,72,120,168)
#call the function and save the result into "ecoplate_data01"
ecoplate_data01<-load_ecoplate_data_osaka(folder_name="data_merged", date_time_sample="20150114-2300", place_sample ="higashihori", measured_time=time_m3, no_skip_line=7)

You will see that the second element [[2]] is empty NULL.

> ecoplate_data01[[2]]
NULL

This is not a perfect function to read many files. In the above function load_ecoplate_data_osaka, if there are multiple sampling dates / times and places, you must call this function for each sampling dates, time, and places and read the data. If the amount of data is relatively small, you do not mind, but if you have a lot of data, you should be able to automate more efficiently. Let's make a list of sampling information corresponding to the file name in the csv file as below.

Suppose you save this file in a different folder "meta_data" in the same hierarchy as the folder storing the R script and name it meta_sample.csv. First, load this file.

#load the metadata
meta_test<-read.csv("../meta_data/meta_sample.csv", header=TRUE)
View(meta_test)

You can also check the number of samples by examining row number.

nrow(meta_test)  #sample size in metadata

The meta_test contains four variables (4 columns): sample_ID, data_time_sample, place_ sample, measured_time. You can create a file path from these, but first you need to check the data type. For example, the type of data_time_ sample is factor as follows,

> class(meta_test$date_time_sample[2])
[1] "factor"

so if you simply convert it to a numerical value, ordinal numbers (1st, 2nd, 3rd, etc.) as factor become numerical values, so you will be in trouble.

> as.numeric(meta_test$date_time_sample[2])
[1] 9

To extract a character string, convert it to a character string with as.character ().

> as.character(meta_test$date_time_sample[2])
[1] "20150114-2300"

Based on the above samples script, for example, you can create a path ("../data_merged/20150114-2300higashihori_48.csv") to the second file with the following script.

#We can generate file path
i<-2
folder_name<-"data_merged"
file_ph<- paste("../" ,folder_name,"/", as.character(meta_test$date_time_sample[i]), as.character(meta_test$place_sample[i]), "_", as.character(meta_test$measured_time[i]), ".csv", sep="") 
file_ph

Let's create a new file reading function by combining the above pieces of sample scripts.

#funciton version 2 using metadata file, which stock the the metadata of results from multiple samples 
#data in the metadata file can be used to generate file path, instead of manually specifying as in the function above
###############list of parameters#######################
#relative_path & folder_name_data: need to specify the folder position of data 
#metadata: dataframe that includes the sampling date, time, and measured time information
#no_skip_line: The first some lines of the raw output from microplate reader represent metadata, which should be excluded
##################Output is a list; each element of the list is the data frame to have ecoplate 96 values

load_ecoplate_data_osaka2 <-  function(relative_path="../", folder_name_data="data_merged", metadata=meta_test, no_skip_line=7)    
{
  
  data_list <-list() #generate empty list, ecoplate data are saved as a list
  loop_length=nrow(metadata)  #decide the loop length based on the sample size
  cat("The number of sample to be loaded is:", loop_length, "\n")  #output basic information
  cat("The levels of sample places are:", levels(metadata$place_sample), "\n")  #output basic information
  #file_name
  
  for(i in 1:loop_length) {
    file_name <- paste(relative_path,folder_name_data,"/", as.character(metadata$date_time_sample[i]), as.character(metadata$place_sample[i]), "_", as.character(metadata$measured_time[i]), ".csv", sep="")   #the function paste is used to combine multiple character sequences. 
    e <-try(read.csv(file_name, skip=no_skip_line), silent=FALSE)   #error management
    if(class(e) == "try-error") next  #if the file doesn't exist, skip the index; it could happen because the measurement dates may not continuous
    else {
      data_list[[i]] <- read.csv(file_name, skip=no_skip_line, header=FALSE)  #read the text file and saved into the item of the list
      data_list[[i]] <-data_list[[i]][,-1]  #Deleve the first column (A,B,...H)
      rownames(data_list[[i]])<-c("A", "B", "C", "D", "E", "F", "G", "H")   #set names
      colnames(data_list[[i]])<-c("V1","V2","V3", "V4","V5","V6","V7","V8", "V9", "V10", "V11", "V12") #set names
      
    }
  }
  
  data_list  #output (return value) of this function
}

The number of lines in the function is very long. In Rstudio's editor screen, you can also hide the area enclosed by {}. Since you can see a small triangle icon next to the number of rows at the left end, click here with the mouse to hide the content or redisplay it.

Now, define a function like this, read the metafile, call this function and read the data into the list called osaka_data.

#load metadata
meta_test<-read.csv("../meta_data/meta_sample.csv", header=TRUE)  
#example
osaka_data<-load_ecoplate_data_osaka2(folder_name_data="data_merged", metadata=meta_test, no_skip_line=7)  

The result of the function executed should be displayed on the console as follows.

> osaka_data<-load_ecoplate_data_osaka2(folder_name_data="data_merged", metadata=meta_test, no_skip_line=7)
The number of sample to be loaded is: 145 
The levels of sample places are: higashihori honmachi 

You can see that there are 145 samples, which consist of two sampling locations: higashihori and honmachi.

Now if you want to check the meta information of a specific sample and the value of the ecoplate, you can use the following script element.

#Check the information of metadata & list of data (osaka_data)
meta_test[133,]
osaka_data[[133]]

Results will be displayed on the console as follows.

> meta_test[133,]
    date_time_sample place_sample measured_time
133    20150717-1100     honmachi           168
> osaka_data[[133]]
   V1       V2       V3      V4      V5      V6       V7      V8      V9       V10   V11    V12
A 0.543 2.306 1.745 2.993 0.165 2.636 1.232 2.907 0.174 1.808 1.793 2.531
B 2.674 2.927 2.010 2.844 2.625 3.194 2.285 2.771 2.036 2.615 2.237 3.201
C 2.884 0.574 0.211 2.664 2.673 1.600 0.054 2.679 2.892 2.497 0.093 2.183
D 2.804 1.974 2.548 2.648 2.533 2.469 2.542 2.042 2.901 2.070 2.165 2.622
E 1.997 2.222 0.994 2.742 2.449 2.261 1.074 2.587 2.500 2.196 1.002 2.021
F 2.466 2.060 2.656 2.161 1.803 1.781 2.655 2.279 2.468 2.635 2.564 1.907
H 1.977 1.782 2.187 2.829 1.754 2.000 2.082 2.790 1.923 2.030 1.968 1.835
G 2.308 1.122 0.756 2.607 1.965 1.633 2.687 1.615 2.323 1.033 2.157 1.860

Reading data is now complete. When loading your sample you should be able to read all the data automatically by changing the function introduced above according to the rule of how to name the sample.