For R beginners

We offer a couple of introductory tutorials on basic R concepts and programming for R and Rstudio beginners. This website is designed for Google Chrome.

0) Why do we need programming just for statistical analysis?

1) Basic information of Rstudio desktop GUI (Graphical User Interface)

2) Basics in R script 1 (executing script, simple calculation, loading library, loading and showing dataset)

3) Basics in R script 2 (data modification, graphics, data.frame and list format)

4) Programming in R script (conditional statement, loop statement, function, and memory release)

The following R script and data sets will be used in this section.

0) Why do we need programming?

Although MS Excel and other GUI applications are available for statistical analysis, why do we need to write programming codes that look like magical formula? The two major reasons can be given as follows.

[1] Required for guaranteeing reproducibility

In scientific reports such as journal articles, one of the most important requirements is to ensure the reproducibility of the results. This is realized by the detailed protocol for observation and experiment or by programming codes/equations/parameter tables for numerical simulations and models. This is also required for any statistical analysis and raw data pre-arrangement and conversion. If such processes need copy-paste procedure with some functional calculations in spread sheets or multiple-clicking of GUI applications, it is quite difficult to describe step by step these processes and thus keep reproducibility. Through saving every process by a script, it is possible to ensure the reproducibility of statistical analysis.

[2] Facilitating reanalysis, repeated processes, and handling large data sets

Recent research projects can easily generate large (repeated) data sets so that we need to handle and analyze them with the same methods. If you try this without any programming, endless procedure of mouse clicking, copy and paste, and value conversion is required. The complex and repeated procedures once programmed are all automatically executed. Even if new (or modified) data sets are available, what you need to change is just the input to the program.

1) GUI (Graphical User Interface) of Rstudio desktop

The whole frame of Rstudio consists of multiple windows. You can find the Tool bar [1] at the top of the windows (or top of the desktop in case of OSX) with which you can change the setting of the application. The upper left is the Editor of R script [2] where you can edit, add, and modify R programming codes (R script), or checking the loaded numerical data. The lower left is the "Console" window where the results from the scripts written in the editor (window [2]) or error messages are shown. In addition, you can directly write and execute simple scripts at the console window if you don't need to reuse them later. The window [4] shows the list of variables, vectors, matrix, and data (which are called "objects" in R). By clicking each of the objects, the values saved in it can be shown in the editor window as a new tab. The window [5] includes frequently used multiple functions. You can check the list of files at the current folder (Files), show the graph of analysis (Plots), and check the help file of packages and functions that you are going to use within your script (Help).

To open R script file (file with extension R) with Rstudio, select File> Open File ... from the left end of the toolbar of [1], the normal folder screen (the figure below) is displayed So, select the file you want to open (for _ intro_cer.R here).

When the script is opened, it will be displayed on the editor screen of [2]. Since the file is opened in tab form, you can open several files side-by-side. In the default setting, the background of the code should be white, the color of the letters should be colored like black and green.

Rstudio automatically recognizes the syntax of language R (and C), and automatically changes the color of letters in order to smooth the programming process. By selecting Tools> Global Options ... on the toolbar of [1], you can change this setting from the window (Options) that will be displayed afterwards.

Various settings are adjustable from the Options screen. In particular, you can change the appearance of the programming code by selecting Appearance and choosing a theme from the Editor theme, you can change the display as shown below.

One more step is necessary each time when you open a new script file. From the script, it is possible to read and write data by accessing the files in the computer. In order to inform R of the current position in the computer and properly access the file, it is necessary to select Session> Set Working Directory> To Source File Location with the mouse on the toolbar of [1]. This makes the file accessible by the relative path from the folder where the script file (written "Source File" in this case) is located.

2) Basics in R script 1

Execution of script · Simple calculation

To make the coding processes easiler, the line number is displayed on the left end of the editor window. Lines beginning with # are called "comments" and are not recognized by R when running scripts. The comment is a message for human writing the script but not for computer. Even if it is not a script to share with others, let's leave comments frequently. A comment is a message for yourself in the future, who will probably open again the same script.

The script on line 4 is the formula 1 + 1. To execute this. . . Move the cursor to the 4th line to highlight it (see the figure below) , click simultaneously Control key (Cntl) and Enter key (Enter) for Windows from the keyboard at the same time, (or Command key and Enter key for Mac). At this time, while failing and highlighting the script, pressing only the enter key will delete all the parts.

Execution of script · Simple calculation (continued)

By this key operation, only this part ("1 + 1") is automatically copied to the console screen, and the result is displayed as "[1] 2" (lower figure).

If you want to execute the scripts written in multiple lines from top to bottom , for example, highlight both lines 4 and 5 and press the keys for execution. Then, the execution result is displayed in order in the console (In this case, even if you include the comment line [3rd line] in the highlight, the result does not change) (See the figure below). The execution result displayed on the console screen looks like below.

Since it is troublesome to show the R editor screen and console screen by image capture every time, after that, the result of the above figure may be simply expressed using the function of google site as shown below, but note that the font style and color are different in R Studio.

> #Simple calculation

> 1+1

[1] 2

> exp(-2.0)

[1] 0.1353353

>

Load library

Various statistical methods used in this workshop are incorporated in the ecology library called "vegan". In order to use these statistical methods it is necessary to load this library (assuming the installation is done as described in "Preparation of R environment"). Just execute the following line of script to load the library.

library(vegan)

Reading data

First download the following file as the files to be read (loaded) and place it in the same folder where "for_ intro_cer.R" is saved.

data_sample.csv

To read the contents of this csv format file into R, execute the following script.

data_test<-read.csv("./data_sample.csv", header=TRUE)

This is done by using a function called read.csv() with specifying a file by the relative path ("./ data_sample.csv"), loading the first line of the csv file as a header instead of part of the data (header = TRUE). This script represents the process to read its contents into a "box" named "data_test". Such a "box" which stores e.g. numerical data is called an "object" in R.

Confirm the contents of the data: To see the contents of this "box" (object), you can check it on the R editor by executing the following function View(). As we read the csv file with the option: header = TRUE, you can see that the first line of the csv file is in the data frame as "column name" ("data_time_sample", "place_ sample", "measured_time").

View(data_test)

Even without executing the script as above, you can see the contents of data_test by a double-click of the object you want to see from the list of the objects displayed on the screen of Rstudio [4].

Deleting data: In addition, if you click the icon like "broom" on this screen, you can delete all the objects currently generated by R with one shot and release the memory consumed by those objects I can do it.

To delete objects individually, write the following script in the R editor and execute it.

rm(data_test)

Confirmation of data type: By the way, there are various kinds of "box" (object) which can be used on R and the values ​​to be put in it. One of the most frequently used type, i.e., data frames (data.frame), is used to store multivariate data such as summarized in the csv file. As you make a lot of objects, it becomes increasingly difficult to memorize the type (class) of each object. To check the type of the object, execute the following script.

#load (again) the data set

data_test<-read.csv("./data_sample.csv", header=TRUE)

#Check the type (class) of the object

class(data_test)

2) Basics in R script 2

The following contents are only a part of basic rules in R language. Almost all the information is posted on the following site and the book which summarizes the contents so you can study on your own.

http://cse.naro.affrc.go.jp/takezawa/r-tips/r.html

Data processing

You can access a part of the data saved in the data frame data_test (a column named place_ sample here) with the following script.

$Check the part of the data set

data_test$place_sample

The R editor of Rstudio conveniently displays candidates after that by writing data_test $.

Alternatively, you can access each row, column, and specific row - column values ​​as follows.

data_test[1, ]

data_test[, 1]

data_test[3,2]

Next, copy the contents of data_test $ place_ sample to another box (object) and try a little processing. To copy the value of data_test$place_ sample to a new object named sample_test, use the symbol "<-" as follows.

#Copy the part of the data set into a new object

sample_test<-data_test$place_sample

#Check the data

sample_test

However, with this script, the new object will no longer be a data frame, so put a function called as.data.frame() in between as follows.

#Copy the part of the data set into a new data.frame

sample_test<-as.data.frame(data_test$place_sample)

class(sample_test)

View(sample_test)

If you generate the new data frame like this, the column name will be appended automatically.

colnames(sample_test)

To change it to another name, do as follows.

colnames(sample_test)<-"ID"

Well, we use the original data frame data_test again. A part of data can be extracted using subset (data frame name, conditional expression). Here the equal sign is "==".

subset(data_test, place_sample=="higashihori")

In the above script, the result is only displayed on the console, so assign it to a new data frame (<-) in the form used above.

data_test01<-subset(data_test, place_sample=="higashihori")

class(data_test01)

This time the type of new object is set as a data frame even without using as.data.frame () (because it contains multiple columns). This is a point that R is often flexible, badly inconsistent. When an error occurs, it is important to check the type of the data each time in the class () function. Similarly, assign data that meets the condition "honmachi" to another data frame.

data_test02<-subset(data_test, place_sample=="honmachi")

The contents and size of the data frame etc. generated in this way are displayed in the environment of Rstudio's [4] screen. For example, data_test01 consists of 89 observations consisting of 3 variables. It can also be read as 89 rows and 3 columns of data. The point to note here is that in R, it is a style that each row represents "observation" results from different samples, and the values ​​of different "variables" are added to each column. It is important to be conscious of this style when saving raw data with Excel or csv file. From the script you can access the number of rows and the number of columns as follows. By the way, it is not easy to remember which is which. How to remember is here .

#Check the size

#Row size = length of the column vector

length(data_test01[,1])

#Column size = length of the row vector

length(data_test01[1,])

Well, with separated two data frames data_test01 and data_test02, you can combine them. You have to be careful which of rows and columns has the common number, but this time the number of columns is common, you can combine it in the row direction (= vertical direction) using the function rbind () I can do it.

#data binding

data_test03<-rbind(data_test01, data_test02)

View(data_test03)

If you try to combine it in the column direction (= lateral direction), you can use the function cbind ().

data_test04<-cbind(data_test01, data_test02)

However, since the number of rows does not match, an error message is displayed on the console as follows.

> data_test04<-cbind(data_test01, data_test02)

Error in data.frame(..., check.names = FALSE) :

arguments imply differing number of rows: 89, 56

Display graph

There are various graphic options for graphs, so you can study on their own at http://cse.naro.affrc.go.jp/takezawa/r-tips/r/48.html . I will explain only a very simple example here.

Download new sample data and save it in the same folder as R script If you read the csv file and specify the value to be taken on the horizontal axis and the value to be taken on the vertical axis with the plot () function, you can easily draw a 2D scatter diagram.

#For sample plot

#load data

data_plot<-read.csv("./data_sample2.csv", header=TRUE)

#plot x vs y

plot(data_plot$x, data_plot$y)

#plot y vs x

plot(data_plot$y, data_plot$z, col=4,cex=2)

The result of the plot is displayed on the screen of Rstudio [5] (see the figure below).

Data frame and list

Data frames are objects suitable for storing data with row names and column names. On the other hand, the list is an environment that can collect multiple objects (e.g. multiple data frames, data frames and vectors etc) as one object. In other words, it is an object that collects multiple boxes. An empty list can be generated by list (), and each box can be accessed (copied, called) by inserting a number in the double bracket [[]].

#data.frame and list

#generate an empty list

test_list<-list()


#check object type

class(test_list)

class(data_test01)

class(data_test02)


#assign object

test_list[[1]]<-data_test01 #assign data.frame

test_list[[2]]<-data_test02 #assign data.frame

test_list[[3]]<-c(1,0,2) #assign numerical values

You can see the contents of the list all at once, and you can also specify a part of it to check.

#check all objects in the list

test_list

#check the specific item in the list

test_list[[1]]

Since each item in the list acts as an assigned object, for example, test_list [[1]] acts as a data frame and you can use a script as follows to access a part of its contents:

#acting as data.frame

test_list[[1]]$place_sample[3]

You can check the class of each element by executing the following script in order.

#check the class

class(test_list[[1]])

class(test_list[[1]]$place_sample)

class(test_list[[3]])

"Programming" in script file

In this course we will explain the basis of conditional branching / repetition (loop) / function as minimum knowledge for automatically converting values ​​obtained from ecoplate (in general multivariate observations) and statistically processing. It is necessary to get used to using it, and we recommend that you learn self-study on the various reference options and books listed above.

Conditional statements (if / else)

The following is a simple script. By the way, cat () is a function to display a character string, etc.

#if statement

a <- 3

if(a > 3) { #check the condition

cat("a is greated than 3", "\n") #this is excuted when the condition is true

} else{

cat("a is less than or equal to 3", "\n") #this ix excuted when the condition is not true

}

A minor comment here is that an error will appear if you break a line before else. However, it is better to have the newline for readability, so you can specify the start and end points of the if statement with parentheses.

{

if(a > 3) { #check the condition

cat("a is greater than 3", "\n") #this is excuted when the condition is true

}

else{

cat("a is less than or equal to 3", "\n") #this ix excuted when the condition is not true

}

}

In general, when programs become complicated, many nested structures appear, so you need to be careful about the beginning and end of each part.

Loop statement (for loop)

Here is the script using "for loop". Create a vector x with values ​​from 1 to 10 and specify the beginning and end of the loop (start, end). It adds sequentially from x [1] to x [10] to the variable sum by sum<-sum+x[i]. Here, it is possible to specify the beginning and the end of the loop with a number directly, such as for (i in 1: 10) within for loop structure, but in order to keep high readability that is generally required for programming, it is better to avoid suddenly substituting numbers into the control structure in the script (e.g. if statement and loop statement).

#loop structure

x<-c(1,2,3,4,5,6,7,8,9,10) #prepare a vector

x[3]

start<-1 #start of loop

end<-10 #end of loop

sum<-0 #prepare the variable to calculate the sum of the vector,initialized as zero

for(i in start:end) {

sum<-sum+x[i] #add each element of vector x

}

sum

By the way, operators such as incremental assignment (+=) are not prepared for R language as in C language system, so "sum += x[i]" can not be used.

Function

Now, let's create a function using the example of conditional statement and iteration handled above. To create a function we use the following structure.

Name of function <- function(list of parameters) {

definition of function

Specifically, if you define the above if / else calculation as a function, the script should be as follows. The function definition part is almost the same as the script mentioned above. "check_3" is the name of the function, "input" is the parameter of the function, and default parameter value can be specified by "input = 2". You do not always have to specify the default value.

#function example 01

check_3<-function(input=2){

if(input > 3) { #check the condition

cat("Your input is greater than 3", "\n") #this is excuted when the condition is true

}

else{

cat("Your input is less than or equal to 3", "\n") #this ix excuted when the condition is not true

}

}

To use this function by calling it, you can prepare the following script. An example of the parameter value set are 2 and 4, respectively.

#exclute the function

check_3(input=2)

check_3(4)

If you call the function without specifying any value of the parameter, the default value is used for the result.

check_3()

Since more than one parameter can be specified for each function, if you make an example of iterative calculation described above as a function, you can do as follows. There are three parameters this time, vec, start, and end. vec corresponds to the vector to which the element is to be added, and start and end are loop start and end points, respectively.

#function example 02

sum_vector<-function(vec, start, end) {

sum2<-0

for(i in start:end) {

sum2<-sum2+vec[i] #add each element of vector x

}

print(sum2)

}

If you want to add elements 1 to 5 using the vector x we ​​made earlier, call the function as follows and execute it.

sum_vector(x, 1,5)

Memory release

When you execute complicated scripts using many vectors, data frames, lists, etc., memory will be consumed more and more. To release unused memory, you may want to run the following script twice in succession.

gc(T,T)

(This is additional information)

Caution about substitution in function (<-, << -)

Well, at the end, there is an important rule that leads to a big mistake if you abuse function. First, let's run the sum_vector function we created earlier and check the value of the variable sum2 defined in the function.

sum_vector(x,1,10)

sum2

At this time, an error message should appear as follows on the console.

> sum_vector(x,1,10)

[1] 55

> sum2

Error: object 'sum2' not found

>

That means sum2 can not be found. Because sum2 is defined within a function, it is valid only in the function and can not be accessed from outside the function.

How about trying to execute the function after defining sum2 before calling the function (or before defining) and substituting the value as follows?

sum2<-3

sum_vector(x,1,10)

sum2

Now, sum2 does exist, but there is no change in value even after calling the function. In this method, sum2 in the function and sum2 outside the function are handled as separate objects. In other words,

  • It is impossible to change the value of a variable defined outside the function from within the function (example above, sum 2).

However,

  • It is possible to read the value of a variable defined outside the function from within the function.

In the function below, sum_vector2, we use a variable named end2, which is defined outside the function.

#function example 03

end2<-3

sum_vector2<-function(vec, start) {

sum3<-0

for(i in start:end2) { #using the value of end2

sum3<-sum3+vec[i] #add each element of vector x

}

print(sum3)

}


sum_vector2(x,2)

It is worth noting that even if there are rules like the above, in order to write a highly readable script, it is important 1) not to use the same variable name in the function and outside the function as much as possible, 2) when accessing the variable value of the function, it is important not to make it like sum_vector2 but to assign the variable value outside the function to the parameter of the function.

#use parameters

st<-1

ende<-10

sum_vector(x,st,ende)

Then, what to do if you want to use the function execution result as another variable? There are two solutions, but the first is simple, you can just assign the execution result of the function to another variable.

#assign the value of the function into another variable

summ<-sum_vector(x,st,ende)

summ

The second solution is to change the value of the variable outside the function using another assignment operator (<<-). You can do as follows. Such coding will be necessary in some situations.

#function example 04

sum3<-3

sum_vector3<-function(vec, start, end) {

sum3<<-0

for(i in start:end) {

sum3<<-sum3+vec[i] #add each element of vector x

}

print(sum3)

}


sum_vector3(x, 1,10)

sum3