First, convert the raw data in two ways: integration and standard method as shown in the figure below.
Currently, all ecoplate data are stored in the list osaka_data with multiple samples mixed together. If we do not divide this into each sample, we can not convert the data to the integral or standard method indicated in the above figure. In meta_test, information that distinguishes the sample is stored in sample_ID as the class "integer".
#We need to separate the data set into each sample with multiple measurements
class(meta_test$sample_ID[3]
The actual number of samples can be known by looking at the contents of meta_test, but it is better to check with the following two methods.
#number of samples
max(meta_test$sample_ID)
#OR
length(levels(meta_test$date_time_sample))
Next, the list will be rebuilt so that multiple measurement results from each sample_ID are grouped together. Actually the list can stock anything, so we can also save a list into each element of the list. If you do the following method, you can handle the list like a two-dimensional matrix, and you can store one measurement result (96 values) for each element of the matrix.
#trial to generate the list of list
test_list<-list()
test_list[[1]]<-list()
test_list[[1]][[1]]<-osaka_data[[1]]
test_list[[1]][[2]]<-osaka_data[[2]]
test_list[[1]]
test_list[[1]][[2]]
Based on this trial, we will create a list named sample_list and store the k-th measurement result for each sample ID in the format sample_list [[sample ID]][[k]].
#separating the data to sample ID-specific list
sample_list<-list()
no_sample<-max(meta_test$sample_ID)
for(i in 1:no_sample) {
sample_list[[i]]<-list()
}
Likewise, I'd like to divide metadata by sample ID, but this time, each element of the list should be a data frame rather than a list. To do that we prepare a list as follows.
#summarizing the metadata into sample-ID-specific list
metadata_list<-list()
no_sample<-max(meta_test$sample_ID)
for(i in 1:no_sample) {
metadata_list[[i]]<-as.data.frame(meta_test[1,])
}
Now, the number of measurements for each sample can be counted by counting the number of observations with the identical sample_ID in the metadata. For example, if sample_ID is 5, the number of measurements can be confirmed as follows.
#Check the number of repeated measurements
sum(meta_test$sample_ID==5)
Based on the above preparations, prepare the following double loops, you can organize ecoplate measurement data and metadata for each sample ID.
i<-1 #initialization
for(j in 1: no_sample) {
no_measurement<-sum(meta_test$sample_ID==j) #check the number of repeated measurements
#loop for loading all measurement results from sample index j
for(k in 1:no_measurement) {
sample_list[[j]][[k]]<-osaka_data[[i]] #copy the data of each measurement into sample_list[[sample_index]]
metadata_list[[j]][k,]<-meta_test[i,] #copy the metadata
i<-i+1 #update the index of the raw data (list osaka_data)
}
}
For example, you can find out the consequences of this script by checking the contents of sample ID 3 as follows.
#example
sample_id<-3
sample_list[[sample_id]]
metadata_list[[sample_id]]
Prepare a function for automatically performing integration of which parameters should include sample_list and metadata_list created above. The first few lines are function definition information. This function has three parameters ecop_series, metadata, min_interval_hour, and the output is a data frame. In the output data frame, the integral value from one sample will be stored.
##########[2] Definitions of functions to data arrangement (integration, maximum, average, minimum, etc)##################
#The function for taking integration (=calculation of the area below the color development curve) even when there are gaps in measurement dates
##########################list of parameters
#ecop_series: data source of ecopalte color patterns from the single sample ID (list format)
#metadata: metadata, we need the information of measured time
#min_interval_hour, the minimum interval hour, here 24 hours
##################Output is a dataframe
integ_ecoplate <- function(ecop_series, metadata, min_interval_hour=24)
{
The next few lines are the script that calculates the number of measurements, the incubation period etc using measurement interval time (min_interval_hour). The first line is preparing a list to store measured data for each sample. sampled_index stores the information on the measurement day, which is converted from metadata$measured_time; metadata$measured_time stores the measurement time with the unit hour after inoculation.
Since end_sample is the maximum value of sampled_index, it is understood by this how many days the sample was incubated. However, it is important to note that length_period is the actual number of measurements, but this may be less than end_sample. For example, even though it is the final measurement at 6 days after the beginning of incubation (144 hours), if measurement is not done after 3 and 4 days, the value of end_ sample will be 6, while length_period will be 4.
data_eco <- list() #prepare an empty list
sampled_index <- metadata$measured_time/min_interval_hour #convert hours to n-th observation (n-th day)
end_sample <- max(sampled_index) #The final date of the measurement
length_period <- length(ecop_series) #size of the samples, generally less or equal to end_sample, because the interval of the observation is ofen longer than min_interval_hour
In the next line, we copy data of multiple measurements stored in ecop_series to data_eco. Again, note that data_eco[3] and data_eco[4] will remain empty (NULL) because sample_index has only four values {1, 2, 5, 6}.
for(i in 1: length_period) data_eco[[sampled_index[i]]] <-ecop_series[[i]] #load the ecoplate data only for sampled dates, some of the element of data_eco can be empty (NULL)
In the next two lines we prepare a data frame named data_integ and store the first data there.
data_integ <- data.frame() #prepare a data frame
data_integ <- data_eco[[1]] #Note that "data_eco" should be a list; each item is the information of color development from each measurement date.
Well, the contents of the for loop for i that correspond to the actual integration process, but we assume that the measurement interval is sparse (for example, there are no measurements after 3 and 4 days). Actual calculations do not use complex integral approximation, but are calculated as shown in the following figure.
Suppose that the above graph represents the absorbance of one well out of 96 well microplates. There are actually four observation sites after 24, 48, 120, and 144 hours. In the following script, the index i of the for loop increases one by one from 1 to 6, but when i becomes 3 (corresponding to 72 hours after the above graph), There is no data in data_eco [[3]] (NULL). This judgment is possible by the if statement: if(is.null(data_eco[[i]]))
. When is.null ()
is true, we search through the next j loop until 120 hours (j = 5) where the data exists. When j = 5, since (!is.null(data_eco[[j]])
is true, we complement the missing data using the loop for k written in bold. As shown in the figure above, the way of completion is very simple to connect the value after 48 hours and the value after 120 hours with a linear line.
for(i in 2:end_sample) {
#first check the NULL dates (lack in measurement)
if(is.null(data_eco[[i]])) {
for(j in (i+1):end_sample) { #search for all NULL dates until non-Null date
if(!is.null(data_eco[[j]])) { #when encountring the non-Null date (j)
for(k in i: (j-1)) data_eco[[k]] = ((j - k)*data_eco[[i-1]] + (k - i + 1)*data_eco[[j]])/(j - i + 1) #use linear line of two non-NULL dates (i-1 & j)
break #skip further loop when non-zero data is found
}#end of if data_eco[[j]]
}#end of for j
}#end of if data_eco[[i]]
In this way, the originally empty (NULL) data data_eco[[k]]
will be filled with complementary (approximated) values, we then will simply add each day's value including the complementary value in the following manner. This is an integral approximation that computes the total area of the gray bar graph in the figure above. I think that I can prepare a more proper approximation, but now I will assume that this simple method is good enough.
data_integ <- data_integ + data_eco[[i]] #each element of data_eco[[i]] was individually added.
}#end of for loop i
Finally, by dividing the integral value during the measurement period, we will correct variations between samples with different measurement periods.
data_integ/(end_sample*min_interval_hour) #output, take the average value over the integration period, i.e., the normalization by integration period
}
By defining the function like this, you can check the execution result for one sample in the console as follows, for example.
> sample_id<-4
> integ_ecoplate(ecop_series=sample_list[[sample_id]], metadata_list[[sample_id]], min_interval_hour=24)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
A 0.003101190 0.003029762 0.005258929 0.009366071 0.013357143 0.003461310 0.006866071 0.012779762 0.002916667 0.004008929 0.005410714 0.009952381
B 0.013651786 0.013860119 0.008663690 0.013125000 0.014080357 0.032985119 0.008386905 0.024922619 0.013059524 0.014000000 0.009032738 0.021288690
C 0.013592262 0.003193452 0.002702381 0.004514881 0.014809524 0.003488095 0.002398810 0.003205357 0.012190476 0.003354167 0.002455357 0.003413690
D 0.009056548 0.005023810 0.006800595 0.016619048 0.008199405 0.004750000 0.005482143 0.012154762 0.007809524 0.007529762 0.015559524 0.016934524
E 0.014101190 0.012833333 0.004172619 0.003681548 0.014827381 0.015178571 0.003818452 0.003779762 0.013169643 0.017023810 0.003514881 0.004056548
F 0.022017857 0.004964286 0.005997024 0.010467262 0.027886905 0.003592262 0.003937500 0.008949405 0.017943452 0.003363095 0.004735119 0.014077381
H 0.004181548 0.003544643 0.004979167 0.004633929 0.003872024 0.003217262 0.005169643 0.006580357 0.003767857 0.003535714 0.005550595 0.004443452
G 0.003776786 0.005199405 0.005005952 0.009931548 0.003166667 0.005229167 0.004812500 0.006276786 0.002913690 0.005029762 0.003267857 0.011389881
>
So, let's take the integral value of all of the sample put in the following manner in the loop to calculate the integral value by calling the respective elements of the data list sample_list and metadata list metadata_list created above.
#integrating all samples
integrated_osaka<-list() #prepare emplty list
no_sample<-max(meta_test$sample_ID)
for(i in 1:no_sample) integrated_osaka[[i]] <- integ_ecoplate(ecop_series=sample_list[[i]], metadata_list[[i]], min_interval_hour=24)
integrated_osaka
Next, let's extract the data according to the standard method using only the data of each sample and final measurement day. This is pretty simple and you can easily access sample_list[[sample_ID]][[k]]
with k by length(sample_list[[sample_ID]])
.
#extracting only the final day measurement data
#example
sample_list[[1]][[length(sample_list[[1]])]]
finalday_osaka<-list() #prepare emplty list
no_sample<-max(meta_test$sample_ID)
for(i in 1:no_sample) finalday_osaka[[i]] <-sample_list[[i]][[length(sample_list[[i]])]]
finalday_osaka
This is the end of the first phase of raw data conversion.
Then, it is the second phase of raw data conversion. In this case, we will process three repetitive values (triplicates) for each substrate. In this illustration, for these triplicate values including the control (blank) wells, we will calculate the maximum values, average values, and minimum values.
In these processes, we should note tha the data from each of the plates are stored regularly in the data frame 12 row x 8 rows: V1-V12 of A to G. Also note that V1-V4, V4-V8, and V9-V12 are three sets of replicates. In other words, through assembling four groups (V1, V5, V9), (V2, V6, V10), (V3, V7, V11), and (V4, V8, V12), we can calculate the maximum, average and minimum.
The average is the easiest, and if you specify each column, you can get an average with the following simple script.
data_ave1<-(data_f$V1+data_f$V5+data_f$V9)/3.0
By performing this calculation for each group and finally using the function called append, 96 values can be converted to 32 average values. The function is as follows. By the way, in bold part data_sum_nor<-data_sum - data_sum[1] #normalizing by water well
is the standardization process of the absorbance of all substrates by taking the difference between the value of a control (blank).
#The function for averaging triplicate
#The parameter is data frame
#Output is a vector
ave_ecoplate <- function(data_f){
data_ave1<-(data_f$V1+data_f$V5+data_f$V9)/3.0 #take average
data_ave2<-(data_f$V2+data_f$V6+data_f$V10)/3.0
data_ave3<-(data_f$V3+data_f$V7+data_f$V11)/3.0
data_ave4<-(data_f$V4+data_f$V8+data_f$V12)/3.0
data_sum<-append(append(append(data_ave1,data_ave2),data_ave3),data_ave4) #combine data
data_sum_nor<-data_sum - data_sum[1] #normalizing by water well
data_sum_nor #output
}
This function can be used as follows. Note that the output value is no longer a data frame but a vector of one line.
> sample_id<-2
> ave_ecoplate(finalday_osaka[[sample_id]])
[1] 0.000000000 0.466666667 0.688333333 0.035000000 0.837666667 0.087666667 0.541333333 -0.333000000 -0.191666667 -0.324333333 -0.346333333 -0.072666667
[13] 0.784666667 -0.161333333 -0.342000000 -0.174333333 0.312666667 0.323333333 -0.372666667 -0.268666667 -0.314333333 -0.009666667 -0.350333333 -0.305000000
[25] -0.114333333 0.845333333 -0.329000000 0.524000000 -0.327333333 0.125666667 -0.269333333 -0.090333333
Next, in order to calculate the maximum and minimum values, it is necessary to exactly pick up three values corresponding to the same substrate (for example, max(data_f$V1[1],data_f$V5[1], data_f$V9[1])
). The function can be easily defined.
#The function for taking maximum of triplicate
#The parameter is data frame
#Output is a vector
max_ecoplate <- function(data_f){
data_max<-max(data_f$V1[1],data_f$V5[1], data_f$V9[1])
for(i in 2:8) data_max <-append(data_max, max(data_f$V1[i], data_f$V5[i], data_f$V9[i]))
for(i in 1:8) data_max <-append(data_max, max(data_f$V2[i], data_f$V6[i], data_f$V10[i]))
for(i in 1:8) data_max <-append(data_max, max(data_f$V3[i], data_f$V7[i], data_f$V11[i]))
for(i in 1:8) data_max <-append(data_max, max(data_f$V4[i], data_f$V8[i], data_f$V12[i]))
data_max_nor <- data_max - data_max[1] #normalizing by water well
data_max_nor #output
}
#The function for takign minumu of triplicate
#The parameter is data frame
#Output is a vector
min_ecoplate <- function(data_f){
data_min<-min(data_f$V1[1], data_f$V5[1], data_f$V9[1])
for(i in 2:8) data_min <-append(data_min, min(data_f$V1[i], data_f$V5[i], data_f$V9[i]))
for(i in 1:8) data_min <-append(data_min, min(data_f$V2[i], data_f$V6[i], data_f$V10[i]))
for(i in 1:8) data_min <-append(data_min, min(data_f$V3[i], data_f$V7[i], data_f$V11[i]))
for(i in 1:8) data_min <-append(data_min, min(data_f$V4[i], data_f$V8[i], data_f$V12[i]))
data_min_nor <- data_min - data_min[1] #normalizing by water well
data_min_nor #output
}
So let's convert the data stocked in integrated_osaka and finalday_osaka into the average, maximum, and minimum values and summarize them into a new data frame. Finally we can also save the data frame as a csv file to a local folder.
It is possible to take the average value for each substrate of the integrated data sets by using the loop as follows.
#Integration + average
osaka_summary_integ_ave<-ave_ecoplate(integrated_osaka[[1]])
for(i in 2: length(integrated_osaka)) osaka_summary_integ_ave<-rbind(osaka_summary_integ_ave,ave_ecoplate(integrated_osaka[[i]]))
View(osaka_summary_integ_ave)
Similarly, maximum and minimum from the integral value, average, maximum and minimum from the last day data can be processed in the same way.
#Integration + max
osaka_summary_integ_max<-max_ecoplate(integrated_osaka[[1]])
for(i in 2: length(integrated_osaka)) osaka_summary_integ_max<-rbind(osaka_summary_integ_max,max_ecoplate(integrated_osaka[[i]]))
#Integration + min
osaka_summary_integ_min<-min_ecoplate(integrated_osaka[[1]])
for(i in 2: length(integrated_osaka)) osaka_summary_integ_min<-rbind(osaka_summary_integ_min,min_ecoplate(integrated_osaka[[i]]))
#finalday + average
osaka_summary_final_ave<-ave_ecoplate(finalday_osaka[[1]])
for(i in 2: length(finalday_osaka)) osaka_summary_final_ave<-rbind(osaka_summary_final_ave,ave_ecoplate(finalday_osaka[[i]]))
#finalday + max
osaka_summary_final_max<-max_ecoplate(finalday_osaka[[1]])
for(i in 2: length(finalday_osaka)) osaka_summary_final_max<-rbind(osaka_summary_final_max,max_ecoplate(finalday_osaka[[i]]))
#finalday + min
osaka_summary_final_min<-min_ecoplate(finalday_osaka[[1]])
for(i in 2: length(finalday_osaka)) osaka_summary_final_min<-rbind(osaka_summary_final_min,min_ecoplate(finalday_osaka[[i]]))
You can summarize the data by these processes, but the format (class) of the data is still matrix ( class(osaka_summary_final_ave)
). Therefore, these should be converted to the data frame format, which fits better for multivariate statistical analysis at later stages. At this time, the value of the control (already normalized to 0) is deleted by osaka_summary_integ_ave[,-1]
.
#convert them into dataframe, delete the control value (0)
class(osaka_summary_final_ave)
osaka_summary_integ_ave<-as.data.frame(osaka_summary_integ_ave[,-1])
osaka_summary_integ_max<-as.data.frame(osaka_summary_integ_max[,-1])
osaka_summary_integ_min<-as.data.frame(osaka_summary_integ_min[,-1])
osaka_summary_final_ave<-as.data.frame(osaka_summary_final_ave[,-1])
osaka_summary_final_max<-as.data.frame(osaka_summary_final_max[,-1])
osaka_summary_final_min<-as.data.frame(osaka_summary_final_min[,-1])
The meta file should be also converted into the data frame format so that it can be used with ecoplate data in multivariate analysis later.
#prepare the meta data file again for sample_ID
metadata_osaka<-metadata_list[[1]][1,1:3]
for(i in 1:length(metadata_list)) metadata_osaka<-rbind(metadata_osaka,metadata_list[[i]][1,1:3])
metadata_osaka<-as.data.frame(metadata_osaka)
The above result can be saved as a csv file in the folder of the specified path by using the function called write.csv ().
#write the result into csv files
write.csv(metadata_osaka, "./metadata_osaka.csv", row.names=F, quote=F)
All version will be used later so save them all as below.
write.csv(osaka_summary_integ_ave, "./osaka_summary_integ_ave.csv", row.names=F, quote=F)
write.csv(osaka_summary_integ_max, "./osaka_summary_integ_max.csv", row.names=F, quote=F)
write.csv(osaka_summary_integ_min, "./osaka_summary_integ_min.csv", row.names=F, quote=F)
write.csv(osaka_summary_final_ave, "./osaka_summary_final_ave.csv", row.names=F, quote=F)
write.csv(osaka_summary_final_max, "./osaka_summary_final_max.csv", row.names=F, quote=F)
write.csv(osaka_summary_final_min, "./osaka_summary_final_min.csv", row.names=F, quote=F)
This is the end of 2nd phase of raw data conversion. The functional matrix (in the class of the data frame) as shown in the above figure is obtained for both integration (average, maximum, minimum) and final (average, maximum, minimum).