Simple Variables
Entering input is quite straightforward. You can simply type in data in the R environment using the symbol "<-" as the assignment operator. For example:
5->x;
Assignment of a value works both ways as long as it is the value that is being assigned to the variable:
5->x; x<-5; x->5; # this will not work
Seeing what each variable contains is easy. Simply type the name of the variable and press enter
x 5
Input can be of any type, including characters, logical etc:
names<-"Christoforos"; status<-"False";
Integer sequences in the form of vectors can be incorporated with the specific operator (:)
x <- 1:20
Little Arithmetics
Simple variables can be dealt with by R in all sense of arithmetical operations.
x <- 2 y <- x+2 # addition/subtraction y [1] 4 z <- x*y # multiplication z [1] 8 d <- z/4 # division d [1] 2 d**1.5 # power
Precedence is based on the normal mathematical rules of precedence (**, */, +-) so you must always use brackets when coding a more complicated formula. Brackets are used in a nested manner like:
p <- 2*((x-y)**2)-3.14
R carries a great number of predefined arithmetical functions for basic operations such as square root, logarithms etc
x <- 64 sqrt(x) # square root of x [1] 8 y<- -x y [1] -8 abs(y) # absolute value of -8 [1] 8 log(y) # natural logarithm [1] 2.079442 log(y)/log(2) # changing the base to log2 [1] 3
Now, can you think about a way to get the cubic root of a simple variable x?
Simply with inputting data from the command line, means we have seen nothing of R's power yet. R is able to handle great chunks of data in various levels of organization and the best way to feed them is by making R read them from a file stored in our computer. There are numerous ways to do so depending on the format, size and type of the data as well as on the downstream analyses we intend to conduct. In the following we will take a look at the most common ones. Simply invoke it with:
data<-read.table("myfile.txt")
Keep in mind that myfile.txt needs to be in the directory you are currently working in. In any other case you will need the full path of the file such as e.g.
data<-read.table("~/Documents/R/myfile.txt")
R will read through the file, skip any line starting with a comment hash "#" and will try to store the values read in the most convenient form, which is usually a data frame.
We can now check what the data frame holds by asking R to return the first rows using head() or the last ones using tail()
head(data) tail(data)
If we want to be more specific we can ask R to return only a specific subset of the data frame's columns, or rows, or even combinations of the two. But let's leave this for later chapters.
Reading through big files can take time even with R (or especially with R) as it tries to make some inference on the data "on the fly" as the file is being read. In this way, R tries to figure out the column separator (if data come in columns separated by space, commas, tabs etc) the class of the data in each column and store the whole file in a data type. For all the above reasons, it is important that we make R's job easier by providing some of that information ourselves. Both read.table and read.delim allow us to provide additional information before reading the file. In particular there are some important attributes/options we can activate that are related to:
Lets try to let R know that we want to read a file using tab as the column separator, keeping the first line of the file as column header and reading 1000 rows. We can do this with read.delim()
data<-read.delim("myfile.txt", header=T, sep="\t", nrow=1000)
This will keep the first line as column headers and will stop reading after the first 1000 lines (excluding the header). Each column will be read if it is tab-separated by the previous one. There are a number of ways to separate columns in tables, the most common ones being space, tab and commas. You may often (or not so often) see filenames ending with the *.tsv or *.csv extensions. These indicate tab-separated-values or comma-separated-values. R has a particular function to read the latter called read.csv()
data<-read.csv("myfile.csv")
will read and store data directly in columns as long as they are comma separated.
readLines() is a function that reads files line-by-line storing each line in a separate vector element. The output is thus a vector holding the lines of the file in the order they appear in it. This may be useful for text mining purposes but not so much when the data are numeric and you want to store them in data frames or matrices.
text<-readLines("file.txt") text<-readLines("file.txt", 100)
The second command will only read the first 100 lines of the file.
Advanced reading can also be performed with the use of specialized functions, contained in R libraries. One such, allows the user to import Excel files
library(xlsx) mydata <- read.xlsx("c:/myexcel.xlsx", 1)
Missing Values
In many cases, the files fed into R will contain "holes", incorrectly formatted values or values that cannot be treated numerically. R is not able to understand what you meant with a funny character, or lack thereof, but is "clever" enough to mark the value with a "NA" or a "NaN" character. "NA" signifies a missing value (a hole in the data table) while "NaN" stands for not-a-number and is returned when a mathematical operation is non-sensical (e.g. a division with 0). Be extremely careful with both NA and NaN values as they may either inhibit the execution of certain functions (the good scenario, because you notice the error immediately) or make functions return erroneous values (the bad scenario, because you may not always notice the error).
Always remove NA values or at least mark them out of calculations. You can test if a variable has a NA, NaN status simply by asking R:
is.na(x) is.nan(x)
in which case R will return TRUE or FALSE. More on this in the chapter of Subsetting.
Writing Data
We saw how we feed data into R, how we make sure non-sensical values are not included but now how about getting data out of R and into a file in our computer? R has specific functions for writing data to files, most of which are perfectly symmetrical to the reading ones. Thus if we want to write output to a file we can make use of the write function:
write(data, file="out.txt")
which can be made much more elaborate if additional options are fed in the command
write(data, file="out.txt", append=T, sep="\t", ncolumns=3)
this will not only write the "data" to a file names "out.txt" but it will further append the data at the end of this file if it already exists. Moreover, data will be written in 3 columns separated with tab.
Although the command above will work most of the times, it is probably safer to use write.table instead for increased control of the process.
write.table(data, file="out.txt", append=T, sep="\t", row.names=F, col.names=T)
This is only to be used on a data frame or a matrix that already has the values spread out in rows and values (notice that there is no option for number of columns as there are predefined by the dataframe structure). row.names=F (meaning FALSE) tells R to skip enumerating the rows, since this usually adds and extra column to the output file (try this yourselves).
Writing to files can also be done in csv mode with (you might have guessed already) write.csv().
In the same way we can write data to an Excel recognizable file with:
write.xlsx(mydata, "c:/mydata.xlsx")
Reading R code
As you advance in working with R you may need to store parts of the code (or simply commands) in simple text files and invoke them without having to re-write them. This can be easily done with the source() function. Simply store the code you want in a file (it helps if it carries a special extension like *.R or *.Rdata) and then call the source function on it like this:
source("file.R")
This will immediately execute all the commands contained in "file.R" (and provide error messages for those that couldn't be executed).
What may these commands be? Carry on to the next session(s).