Episode II: Getting familiar with R and RStudio - for beginners
For those of you who are planning to analyse spatial data in R but do not know much about R and R studio, this page is for you! If you are already familiar with R you can still go through it since I will introduce current best practices that might also be useful. I provide a few tips that should be enough for you to set up everything right without of course the intention to be a comprehensive introduction to R.
I suggest that you install in the following order the following software:
install R: Link to download R for PC, Mac, and Linux machines. Download the latest version available. When completed restarted your machine and
install RStudio: Link to download RStudio for PC, Mac, and Linux machines. Download the latest version available. When completed restarted your machine. When you open RStudio it should automatically detect R.
👈 Left panel : This is the R console. You will see it pops up when you run R directly. In general we do not use R directly because in most cases, using an interface such as RStudio is more convenient. Here I recommend you to access R via RStudio for most cases. I will explicitely mention when R is more suitable than RStudio.
R Console
RStudio interface
👉 Right panel: This is the RStudio interface when you open Rstudio software. it contains four panels whose size and position can be manually changed. In the default you have the following panels: A: R code (.R file); B: several tabs are provided with the main tab showing your environment that includes the objects you have created in R; C: several tabs are provided with the main tab as the R console where you can run R (like when you use R directly) or see the output of your code; D: many important tabs here, including a view of your plots and folder structure.
In R we can create and save objects using names. Names can be made of any characters except special characters such as ! or ". Numbers are authorised but they cannot be used at the beginning of the object name. In R the symbol # is used to define a comment. You can add comments at the beginning of a code or next to a code.
To give a name to an object, and later being able to retrieve the object by its name, we use the assignment operator <-.Try to run this code:
#Let's do our first assignement
tomato <- 'red'
print(tomato) # print the object 'tomato'
You should see in the R console the following result: [1] "red". Also in the R environment panel of RStudio you should see that the object tomato is assigned with the value 'red' stored. If you run in the console print(Tomato), does it work? Note that R is case sensitive. What if you try to run this: 2tomato <- 'red'? Note that you can use simple or double inverted commas to define character objects. For example, tomato <- 'red' is equivalent to tomato <- "red". However, if you mix both approaches, it will give you an error message: tomato <- 'red".
Also, try to create more complex objects. R can deal with many objects, including simple characters as in the example below, or very complex objects such as spatial features. When you feel confident about assigning objects in R go to the next tip.
⛔ What you should not do to initiate a work in RStudio
From the top-left panel of RStudio, you can easily write an R code and save it using the File -> save as tab or click on the double disk, like in a Microsoft Word document. Here we create the file Example.R
Things will work if you do so, but I do not recommend to initiate your work in RStudio that way. It will not be very convenient when you want to share your work with your colleagues. Without explicitely defining the paths of the files you want to upload or save, R will save it automatically in your working directory. To know where is your working directory, run in the R console: getwd()
Now let's assume that you share a project with your colleague and in your code you import a csv file from your path: "C:/Users/user/mytutorial/myfile.csv". In R, one way to do so is to use the read.csv() function:
mydf <- read.csv("C:/Users/user/mytutorial/myfile.csv")
When trying to run your code, your colleague will for sure get an error message if the file is located in another path, say "D:/Mydrive/mytutorial/myfile.csv". That is very annoying and is one reason some people hate programming!
✅ What you should do instead to initiate a work in RStudio
To initiate a work that uses R that prevents unnecessary issues, I suggest that you always start by creating an RStudio project. From the project you can create .R files and put all your input and output data in the same folder where you saved your project. This will set up a working directory automatically for any user that runs the project. How to do so? Follow these four steps:
Click on the Rproject icon
2. Select New Directory
3. Select New Project
4. Browse your repository give a name for your project folder and click create project
This operation will create a project in your folder with the extension .Rproj.
You should always start your work by opening the .Rproj file with Rstudio instead of opening .R files directly.
This will make your life easier, especially when collaborating with other people and it will also simplify the way you code your paths. If you want to open for example myfile.csv located in a folder that you named 'tutorial' where you saved your project (.Rproj) and other files, you can write:
mydf <- read.csv("myfile.csv")
That's it. Anyone with whom you share the project will be able to load it that way, without any changes in the code needed. You should now be all set up!
The R community is very active and we see a large number of lines of code, or organised lines of code that we call 'packages' that are put online for the benefits of users across the world. Packages in R are, in most cases, centralised into an online repository named CRAN, which also hosts the R software.
From R you can install any package in CRAN using a simple command as long as you have an Internet connection. For example, to install the package 'dplyr' you can run the following command in the R console: install.packages('dplyr'). Packages needs to be installed once, and to acess their content, such as functions or datasets, we use the command library. In our example we would type: library('dplyr') to load the package 'dplyr'.
When a package is installed, there is no reason to reinstall it again, unless you want specifically to update it. To avoid installing packages that are already installed and then load all packages, you can use the code below:
list.of.packages <- c("dplyr","ggplot2")#list of packages you need
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]#check if some packages need to be installed
if(length(new.packages)) install.packages(new.packages)#installed missing packages
lapply(list.of.packages, library, character.only = TRUE)#load all packages
You can amend it by replacing the name of packages in the first line of code with the packages you need. The code will check if some packages from the list of packages (list.of.packages) you provide are not installed, and if so, will install them. Then it will load all packages of the list. That is also useful when you want to share your work with other.
An additional tip about installing R packages
It is pretty rare that using R directly is more beneficial than using an interface such as RStudio. For the installation of R packages, however, it is a special case where this is the case. I recommend you to close RStudio and install all R packages you need directly from the R console. The reason is that RStudio will keep some packages of your previous sessions or previously downloaded available and sometimes it prevents R to install packages.
Pipes are tools in R that facilitates writing and interpreting a chain of operations. To demonstrate their utility, let's compute the average horsepower for each category of engine (4, 6, or 8 cylinders) using a native R dataset mtcars. More on the dataset here.
To do so, you can use the pipe operator %>%, which will allow you to write a code that follows the sequence of operations. Here we assume that you have already installed it using install.packages(dplyr). Please see my recommendations above to install and load packages using best practices.
By loading the package dplyr you will automatically load other packages that are used by dplyr, which are called dependencies. This includes the package magrittr which provides the pipe operator %>%. If you run the code below:
mtcars %>%
dplyr::group_by(cyl) %>%
dplyr::summarize(mean_hp = mean(hp, na.rm = TRUE)) %>%
print()
You should get the following result:
# A tibble: 3 × 2
cyl mean_hp
<dbl> <dbl>
1 4 82.6
2 6 122.
3 8 209.
The results mean that the average horsepower for cars in the mtcars dataset with 4, 6, and 8 cylinders is about 83, 122, and 209, respectively.
From R version >=4.1.0, we can use the native R piping operator |> and run the following code as well to do the same operation:
library(dplyr)
mtcars |>
dplyr::group_by(cyl) |>
dplyr::summarize(mean_hp = mean(hp, na.rm = TRUE)) %>%
print()
You should get exactly the same result. Note that the dplyr package is used here only to get access to the functions group_by and summarize.(not the double column symbol :: with the name of the package in before and the name of the function to be loaded after the symbol. The pipe operator |> is native here and not loaded from the magrittr package via dplyr.
Another less common approach is via the pipe operator >> from the package pipeR:
library(dplyr)
library(pipeR)
mtcars %>>%
dplyr::group_by(cyl) %>>%
dplyr::summarize(mean_hp = mean(hp, na.rm = TRUE)) %>%
print()
There is some debate in the R community about how piping should be done and I do not have much opinion on which piping operator you should use. However, be aware that if you use |> it is not accessible for those who use R with version <4.1.0. Also, I would recommend you to know all possible piping approaches so you won't be suprised next time you see a code that use piping in a different way than the way you learned.
In the next episode we will introduce how to use sf and terra to do simple operations with spatial data. Stay with us!