2. Using R to manage Data

Basic data management

As with many computer programming languages, R uses different strategies to manage data internally. In this chapter, we define basic constituents of data types. Using these data types we will define vectors, matrices, lists and our ultimate data management structure: data frames. We will learn how to create data frames out of simple text files as well as Excel or SAS files. The last part of this chapter explains how relational data bases are managed inside R, their advantages in saving space and some of the pitfalls encountered when dealing with factors inside of data frames.

Data modes and Classes

Two of the most important attributes of data in R are mode and class. It is important to understand not only the difference between data types, but also being able to comprehend the types of errors involved when addressing data in the wrong way. We can use the mode, function to tell us if our data is a numeric, a character or a logical variable. All numbers (real and imaginary) fall into the numeric category (e.g. 1, 200, -100, 10i), character corresponds to a collection of letters numbers and symbols (e.g. "red","blue", "names in a list", "my characters: 2$%").

Notice how we enclosed different character variables using quotations. Finally, logical modes correspond to a binary description: TRUE, FALSE.

Logical variables are interpreted as ones and zeros for every mathematical expression (e.g. TRUE * 10 = 10; FALSE * 3000 = 0). As seen in the previous section, we could assign values to defined variables

a <- FALSE

b <- 1000

c <- "hello"

We can do simple operations with these variables, but one of them is going to return an error:

summation <- a + b

values <- a * b

Let's check the mode of the resulting variables

mode(summation)

## [1] "numeric"

mode(values)

## [1] "numeric"

You can check that after performing basic math operations between numeric and boolean operators, we end up having a numeric mode. However, if we were to try a mathematical expression between a numeric and a character mode, we will get an error!.

The class function returns the object's class type. When we decide to store data in R, one important consideration should be the mode of the data being studied. Some objects like matrices or vectors require all stored data to be of the same mode. Other objects like data frames or lists allow for multiple modes stored in the same object. Additionally, we could use the typeof function to return us additional information about the type of an object.

Date and time

Another very important data type concerns dates and times. We could save this type of information simply as a character variable. However, it will make it very hard to manipulate in this form. In order to specify a date, we can take a two-step approach: first specify the content of a variable as a character and use the as.Date, function to turn it into a date. The as.Date, function takes format, as an argument. Valid format types are the following:Format takes a character expression to specify the date format:

as.Date("1970-12-30", format = "%Y-%m-%d")

## [1] "1970-12-30"

as.Date("20/1/1997", format = "%d/%m/%Y")

## [1] "1997-01-20"

as.Date("12/7/1998", format = "%m/%d/%Y")

## [1] "1998-12-07"

The separation character between days, months and years has to be the same as the one in the format expression. Once loaded, R will display the date in the same format as in your operating system settings. We can strip the contents of a date variable using the format function. It might sound confusing to have a format argument in the as.Date function and another format function to strip values. The name is the same, their functionality is close, but their use is in a different context. For instance, if we have a date variable that we want to query for the year, we use the following call:

my_date <- as.Date("12/7/1946", format = "%m/%d/%Y") format(my_date, "%Y")

## [1] "1946"

format(my_date, "%m")

## [1] "12"

format(my_date,"%d")

## [1] "07"

Notice that the return expression is a character. To turn that into a number we should use the as.numeric function.

as.numeric(format(my_date,"%Y"))

## [1] 1946

R provides many mechanisms to store dates, including a build-in Date, POSIXLT and POSIXct classes. There is also a chron package that we are going to review as well.

Arrays: Vectors and Matrices

Running a data analysis deals with managing large amounts of data. We rarely base one data analysis solely on scalar operations. In computer science arrays are one of the building blocks to manage data vectors and matrices. It is defined as a collection of multiply sub scripted of a single data type (or mode). An array has a dimension attribute, a vector describing the dimensions of the array. The two most common arrays are one dimensional vectors or multidimensional matrices.

Vectors

Vectors are the most basic way to store a collection of numbers into an array. Mathematically, vectors are defined as columns of numbers. R, we could define a vector using the function c. One the basic measurements when studying trees is the evaluation of tree height over time. This gives us an idea of the speed of growth setting the basis for evaluating the quality of a given site to produce volume. For tree height measurements over time we will have a vector defining measurement age (in years) and another vector defining height (in feet): Ages = 15 16 17 19 22 25 Heights = 30 49 53 68 76 83 In R we can simply define each vector using the function c:

Ages <- c(5,6,7,9,12,15)

Heights <- c(30,49,53,68,76,83)

We can vectors any way we want as long as we follow the rules stated in previous chapter for variable names. Here our example includes two numeric vectors. We could create a third variable indicating presence or absence of fungal disease and a fourth indicating information about the name of the person that measured that tree.

Disease <- c(F,F,F,T,T,T)

Name <- c("Johnson", "Johnson", "Burk", "Burk", "Burk", "Davis")

In words, we can say that our measured tree got infected somewhere in between measurement 7 and measurement 9. Vectors Age and Height correspond to numeric mode vectors. Vector Disease corresponds to a logical mode vector and Name corresponds to a character mode vector. Just like in math, we could perform mathematical operations with vectors as well as assigning the result of operation to create new vectors:

Height.meters <- Heights * 0.3048

Average.growth <- Heights / Ages Height.meters

## [1] 9.1440 14.9352 16.1544 20.7264 23.1648 25.2984

Average.growth

## [1] 6.000000 8.166667 7.571429 7.555556 6.333333 5.533333

In order to address an element from a vector we could apply a subscript to any variable. This notation is important as it is the base for managing data further below. Sub scripting requires the use of double square brackets:

Height.meters[3]

## [1] 16.1544

This gives me the 3rd element from the Height.meters vector. Likewise, we could use the same notation to address all values in a vector omitting one (or several):

Height.meters[-3]

## [1] 9.1440 14.9352 20.7264 23.1648 25.2984

It might sound silly to have the need to address several vectors this way, however, it will become handy when we start dealing with large data sets using thousands of elements per vector.

Recycling can bring you some troubles

Performing operations between vectors of unequal length can cause unexpected results. The shorter vector will be recycled in order to match the longer vector. For example, vectors a and b have different lengths, and their sum is computed by recycling values of the shorter vector a.

a <- c(10,20,30)

b <- c(1,2,3,4,5,6,7,8)

a+b

## Warning in a + b: longer object length is not a multiple of shorter object ## length

## [1] 11 22 33 14 25 36 17 28

Operations with character vectors are somewhat different. We can't add character vectors in a mathematical sense, but we can concatenate character vectors using the paste function:

Generic <- "Pinus"

Specific <- "taeda"

TheTree <- paste(Generic,Specific)

TheTree

## [1] "Pinus taeda"

Now let's try with a character vector:

Generic <- c("Pinus" , "Pinus", "Eucalyptus", "Eucalyptus")

Specific <- c("taeda", "radiata", "globulus", "grandis")

Scientific.Names <- paste(Generic, Specific)

Scientific.Names

## [1] "Pinus taeda" "Pinus radiata" "Eucalyptus globulus" ## [4] "Eucalyptus grandis"

We can combine a series of vectors simply using the c function as well:

AllinOne <- c(Heights, Ages)

AllinOne

## [1] 30 49 53 68 76 83 5 6 7 9 12 15

NoTrees <- c(Generic, Specific, AllinOne)

NoTrees

## [1] "Pinus" "Pinus" "Eucalyptus" "Eucalyptus" "taeda"

## [6] "radiata" "globulus" "grandis" "30" "49"

## [11] "53" "68" "76" "83" "5"

## [16] "6" "7" "9" "12" "15"

Notice how combining character vectors with numeric vectors results in a data cohersion (combination of different data modes), with the final type being a character. The elements of a vector can be assigned names which will be used when the object is displayed, and which can be also used to access elements of the vector through subscripts. Vector names can be assigned when the vector is first created or added/changed after the fact using the names function:

Measurements <- list(TreeAge = Ages,

TreeHeight = Heights,

TreeDisease = Disease,

MeasurementName = Generic)

Measurements

## $TreeAge

## [1] 5 6 7 9 12 15

## $TreeHeight

## [1] 30 49 53 68 76 83

## $TreeDisease

## [1] FALSE FALSE FALSE TRUE TRUE TRUE

## $MeasurementName

## [1] "Pinus" "Pinus" "Eucalyptus" "Eucalyptus"

TreeList <- list(Gender = Generic, Species = Specific) TreeList

## $Gender

## [1] "Pinus" "Pinus" "Eucalyptus" "Eucalyptus"

## $Species ## [1] "taeda" "radiata" "globulus" "grandis"

We can also populate a vector using a sequence of numbers. To create a sequence we could use the seq function:

1:10

## [1] 1 2 3 4 5 6 7 8 9 10

20:100

## [1] 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

## [18] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53

## [35] 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70

## [52] 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87

## [69] 88 89 90 91 92 93 94 95 96 97 98 99 100

seq(1,10,1)

## [1] 1 2 3 4 5 6 7 8 9 10

seq(0,20,2)

## [1] 0 2 4 6 8 10 12 14 16 18 20

Other specialized functions to create vectors deal with assigning values given a specific distribution. For example, if we want to create a vector with 25 random numbers picked out of a uniform distribution whose range stays between 0 and 50, we would run:

uniform<- runif(25, min = 0, max = 50) uniform

## [1] 40.0161410 4.2584316 44.5244364 0.9480228 24.0588032 30.5123242

## [7] 2.8025374 4.7772487 23.0321201 32.4474112 4.9134951 5.4160432

## [13] 6.8663596 21.4316589 8.6614231 2.8645446 43.6334687 6.8342773

## [19] 23.6691387 36.5921673 2.9047624 14.6854276 15.2848573 20.1399581 ## [25] 16.9273849

Matrices

A matrix is a doubly sub-scripted array of a single data type of data. To illustrate a simple use of a matrix we are going to examine a simple Markov chain model. In epidemiology, a disease transition matrix (matrix used to determine how a disease will affect a given population from time $t$ to time $t+1$) is defined as a square matrix. We can divide a population into healthy, infected, diseased and death. A simple transition matrix for very pathogenic fungal affecting radiata pine in Chile, Phytophtora pinifolia can be expressed as following:

We can interpret this matrix when we assign the first column, first row to the healthy population, second column and second row to the infected, third column and third row to the diseased and fourth column and row to the dead. The chances of being healthy and staying healthy are $.8$, chances of being healthy and becoming infected are $.2$, there is no chance to have the disease or to die right away. Chances of getting infected and becoming healthy back again are $.2$, to stay infected are $.6$, to express the disease $.1$, and to die $.1$. Chances of expressing the disease and becoming healthy are $0$, to get back to infected $.1$, to keep expressing the disease $.6$ and to die $.3$. The last column showing there is no chance of resurrecting, therefore, if an individual becomes dead, it stays like that. We can define this matrix in R using the matrix function. The arguments taken by the function are: a vector or length $n$ times $m$ (the size of the matrix with $n$ rows, $m$ columns); a logic argument, byrow, indicating if the matrix should be populated row by row, nrow or ncol to specify the number of rows or columns involved. Here are two examples populating the matrix by row or by column:

A = matrix(c(.8,.2, 0, 0, .2,.6,.1,.1, 0,.1,.6,.3, 0, 0, 0, 1), byrow = TRUE, nrow = 4)

## [,1] [,2] [,3] [,4]

## [1,] 0.8 0.2 0.0 0.0

## [2,] 0.2 0.6 0.1 0.1

## [3,] 0.0 0.1 0.6 0.3

## [4,] 0.0 0.0 0.0 1.0

B = matrix(c(.8,.2, 0,0, .2,.1,.6,0, 0,.1,.6,0, 0,.1,.3,1), byrow = FALSE, ncol = 4)

## [,1] [,2] [,3] [,4] ## [1,] 0.8 0.2 0.0 0.0

## [2,] 0.2 0.1 0.1 0.1 ## [3,] 0.0 0.6 0.6 0.3 ## [4,] 0.0 0.0 0.0 1.0

If we want to know what will happen to the disease at time $t+2$ we would need to multiply the matrix by itself.

A*A

## [,1] [,2] [,3] [,4]

## [1,] 0.64 0.04 0.00 0.00

## [2,] 0.04 0.01 0.01 0.01

## [3,] 0.00 0.36 0.36 0.09

## [4,] 0.00 0.00 0.00 1.00

Stop! That doesn't look right. This is because matrix operations require the %*% symbol, otherwise, we would get the multiplication of the matrix elements in the same order as they appear in the matrix definition but this is not going to give us insight on the disease behavior. So, for the correct answer, we multiply enclosing the * symbol in between %, characters.

A%*%A

## [,1] [,2] [,3] [,4]

## [1,] 0.68 0.18 0.02 0.02

## [2,] 0.18 0.11 0.07 0.14

## [3,] 0.12 0.42 0.42 0.54

## [4,] 0.00 0.00 0.00 1.00

This is telling us that the probability of a healthy tree to stay healthy at time $t+2$ will be $0.68$, chances of staying in the infected group at $0.28$, chances of expressing the disease of $.02$ and of dying of $0.02$. You can figure the rest on your own. For this transition matrix, the long-term prospects don't look very good. For more advanced calculation (matrix powers) we will need to load the expm package. Calculating the probabilities at time $t+10$, gives us already:

library(expm)

A %^% 10

## [,1] [,2] [,3] [,4]

## [1,] 0.20907500 0.07308831 0.02497864 0.1977470

## [2,] 0.07308831 0.02820185 0.01156551 0.2717591

## [3,] 0.14987185 0.06939307 0.03607213 1.1132363

## [4,] 0.00000000 0.00000000 0.00000000 1.0000000

If $A$ is a matrix, then $A_{[i,j]}$ is the $ij$-th element of A. In R we can express this as A[i,j]. Addressing the $i$-th row will have notation A[i,] and addressing the $j$-th column have notation A[,j]. A range of rows or columns can be extracted using the ":", sequence operator. For example, A[2:3,1:4] extracts the 2 by 4 matrix containing rows 2 and 3 and columns 1 through 4 of A.

A[2:3, 1:4]

## [,1] [,2] [,3] [,4]

## [1,] 0.2 0.1 0.1 0.1

## [2,] 0.0 0.6 0.6 0.3

If you don't know the size of your matrix, use the dim function. It will give you the number of rows as well as columns present in your matrix.

dim(A)

## [1] 4 4

Special data types: Inf, NaN and NA

One very powerful trait in R is the ability to represent abstract concepts inside of a variable giving us an indication on how to treat those special values. Inf and -Inf can be stored in a numeric variable. Operations involving Inf values return Inf:

a <- 1/0 a

## [1] Inf

a * 10

## [1] Inf

Some numbers are not defined (like $\sqrt{-1}$) but we can still get a result out of them:

my_nan<- sqrt(-1)

## Warning in sqrt(-1): NaNs produced

my_nan

## [1] NaN

In this case, NaN stands for Not a Number. This is not giving us an error, instead it is indicating us that the number is not defined. Doing operation with a NaN always returns a NaN. For example:

my_wird_vector <- c(sqrt(-1), 1,4,3,5)

## Warning in sqrt(-1): NaNs produced

my_normal_vector <- 1:5

output <- my_wird_vector * my_normal_vector output

## [1] NaN 2 12 12 25

It is always good to check our operations for NaN because these results ramify through all our calculations giving us a final NaN. The last special number is NA. It stands for Not Available. In a typical survey, it is not uncommon to have records belonging to unobserved or incomplete data as the result of a missing value. R represent those records with an NA. Again, NA ramifies through your calculations to produce a NA as a final result from any operation. We could remove those values at one point by adding the na.rm=TRUE argument to the function call. In the next example, we have a vector with tree heights. The 4th and 7th values were not measured. The individual is there but there is no recorded height. We simply add an NA as part of the database. However, if we wanted to have an estimate of the average height, we would need to remove those missing records to get a result:

heights <- c(10,15,56,NA,87,12, NA)

mean(heights)

## [1] NA

mean(heights, na.rm = TRUE)

## [1] 36

It might seem silly to keep storing NA values, however, the main reason is that operations with NA remained us that we are dealing with a probably unbalanced data set and we need to find ways to account for it. We could also query for NA values using is.na function. This will give us a logical vector showing TRUE for NA values and FALSE for non NA values. When combined with the sum function, we can determine the number of NA values in our vector or a data frame:

is.na(heights)

## [1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE

sum(is.na(heights))

## [1] 2

Likewise, using a data frame, we can query for specify columns looking at either NA ,Inf or NaN values using is.Infinity to query for infinite or is.finite for finite values; is.nan for Not a Number, anyNA to show if there is an NA value at all.

Lists and data frames

One of the most flexible way of storing data in R is a list. A list flexibility comes from its ability to accommodate objects of different modes and lengths. Many functions in R store or return data as a list. Any time you would like to find out about the modes of the components of a list, we could use the sapply function. In the following example, we will define a list and query its content using the sapply function:

my_list <- list(x=c(1,2,3,4), y = c("1", "2", "3", "4"), z = c("one", "two", "three"))

sapply(my_list, mode)

## x y z

## "numeric" "character" "character"

Notice how y is interpreted as a character vector given the way we defined it in between of quotations. We can access the contents of the list by specifying the name of the variable preceded by the list name and the "$" symbol.

my_list$x

## [1] 1 2 3 4

my_list$y

## [1] "1" "2" "3" "4"

Lists can be collections of objects of any size. A special type of list that we are going to use extensively to manage data is the data frame. A data frame is a collection of objects all of the same extension intended to be used as a relational table. This means that elements in the same column are related to each other, sharing the same units; elements in each row are expected to come from the same observation. Looking at the structure of a data frame will tell us the number of observations as well as the number of variables instead of number of rows and columns. Data frames are different from matrices because they can hold heterogeneous data types among columns. The advantage of a data frame over a list is that we can use some relational properties. We could define a data frame just like we did for the list:

my_df <- data.frame(x=c(1,2,3,4), y = c("1", "2", "3", "4"), z = c("one", "two", "three","four"))

Notice we had to add a 4th element to variable z so that all items in the data.frame are the same length.

Data frame creation out of different file types

We would normally have to deal with data that someone put together. Such data could have been written in a text editor, in Microsoft Excel or an old legacy SAS data. In such cases, we refer these data as a table. Common traits in tables are to have headers for every column, several records using one row for each observation.

Reading text files

The most common text file we would use is one that can be exported directly from Excel: Comma Separated Value (.csv) files. These are tables created in a text editor or exported from excel. As the name indicates, the fields in this type of data set are separated by some character (typically a comma). This kind of files involves a first row with the column names and subsequent rows holding records for each observation. DATE,TREE,DBH,HEIGHT

2006/12/11,1,15,20

2006/12/11,2,20,34

2006/12/11,3,40,35

2006/12/11,4,22,36

In order to read a csv file, we need to make sure our Working Directory is specified to point at the same folder our file is. Then we can use the read.csv function and assign it to a data frame object. Here is an example from a table holding forest stand measurements for average diameters, average heights, basal area, dominant height, top height and number of trees per hectare.

data <- read.csv("data.csv")

The dimension (number of rows and columns) can be query using the dim function. We can quickly inspect the names of the variables involved in this data frame using the names function as well as displaying the first 6 records using the head function or the last 6 records using the tail function.

dim(data)

## [1] 48 6

names(data)

## [1] "DBH" "HT" "G" "DomHT" "TopHT" "TPA"

head(data)

## DBH HT G DomHT TopHT TPA

## 1 5.573684 30.15789 451.2404 34 38 447.0588

## 2 5.428947 30.13158 403.6835 35 41 400.0000

## 3 5.857895 30.21053 408.1455 34 35 404.2553

## 4 6.082927 30.87805 379.8534 31 36 376.1468

## 5 5.525641 30.33333 401.6719 36 36 397.9592

## 6 5.282500 31.42500 474.8547 34 36 470.5882

tail(data)

## DBH HT G DomHT TopHT TPA

## 43 5.972973 33.08108 389.1692 38 40 385.4167

## 44 6.140541 33.00000 359.2901 40 42 355.7692

## 45 5.520000 32.37143 436.1188 36 37 432.0988

## 46 5.700000 33.08333 386.6079 33 39 382.9787

## 47 5.721053 31.86842 416.9674 33 39 413.0435

## 48 5.797143 33.11429 401.5360 42 42 397.7273

When a comma is not a comma

Your operating system locale (especially if you are running international versions of Windows) will sometimes determine that the decimal place be a comma and the separation character a semicolon (";"). When trying to import such file with read.csv we will probably get some error. We could go around this by using the read.csv2 function that assumes a semi colon as the separation instead of the comma and uses the comma as the decimal place point. read.csv2 is a function we call a wrapper for the general purpose function scan. Other ways of reading files are read.table, read.delim, read.delim2. The first one is the most flexible one, where you could specify the type of number separation using the sep argument, the number of lines to skip at the beginning of the file (skip) and the file encoding. The last two, help us to open TAB separated values. read.delim assumes the comma as a decimal point, and read.delim2 implies a point.

Reading Excel files

Excel portrays some peculiarities in they way it stores data. Sheets corresponds to different tables, therefore, in order to open an excel table we have to specify the sheet where the table is stored. There are several good packages to read Excel files. One of the most popular, very easy to install on Windows systems is the RODBC. This package is very flexible, but it comes at the cost of more programming to open a simple file. First we need to specify a connection to the file we want using the odbcConnectExcel2007 function. Next, we need to specify the sheet we want to read using the sqlFetch function. Finally we need to close the connection to the file using close function. There is also a function to query for tables present in the odbc connection, sqlTables.

> library(RODBC)

> my_file <- odbcConnectExcel2007("data.xlsx")

> my_data <- sqlFetch(my_file, "data")

> close(my_file)

Unfortunately, if you are using a MAC computer, getting RODBC installed can be quite cumbersome. There are two alternatives to it: gdata and read_xls. gdata uses java to connect to your excel file, so it can be slower sometimes. readxl provides a good compromise between connectivity and speed.

library(readxl)

my_data <- read_excel("data.xlsx", sheet = 1)

This command fill read the first sheet inside your Excel file. However, sometimes (well... most of the time), we don't remember the names of the sheet's present in a file, or we would like to create a vector with the sheet names, so that we can access all of them inside a program.The way to do that is using the excel_sheets function:

excel_sheets("data.xlsx")

## [1] "data"

Reading SAS files

Finally, there is a chance you will have to read someones SAS file. There is an experimental package, sas7bdat. At the time of this writing, the package is at its 0.5 version and has not being updated since 2014, however, for simple SAS tables, sas7bdat does the job. Reading the SAS data file makes use of the read.sas7bdat, function:

library(sas7bdat)

my_data <- read.sas7bdat("Measurement.sas7bdat")

Reading dbf and other file types

There is a plethora of different file formats used in analyzing natural resources. Most likely, you might encounter some of these files as legacy from other types of research conducted in the past. The package foreign handles all this data types. The function call varies with every file type:

Reading ASCII files with headers

If you have to deal with environmental sciences, it is not uncommon to find agencies providing data in a single file with a mix between explainations and data. The first task is to find out where the data starts (how many rows have to be skiped from the original file), determine what separates each column (commas, semi colons, TAB or spaces), determine what is used as the decimal separation value and file encoding and if the column names make sense or if you will need to specify those later. If you find something like that, you will need to tell R some options before we can read that data. Here we have an example of $CO_2$ data over time. The data corresponds to records of $CO_2$ taken at the Mauna Loa station in Hawai, and it sets a standard for $CO_2$ values in the modern times. The file, named co2_mm_mlo.txt contains 73 rows of pure explainations, after that, we have $CO_2$ records. If we want to use this data, we will need to open in the following way.

my_CO2.ManuaLoa <- read.table("./DATA/co2_mm_mlo.txt", skip = 73)

names(my_CO2.ManuaLoa) <- c("YEAR", "MONTH", "Year.month", "CO2", "CO2.spl","TREND", "N.Days") head(my_CO2.ManuaLoa)

## YEAR MONTH Year.month CO2 CO2.spl TREND N.Days

## 1 1958 4 1958.292 317.45 317.45 315.29 -1

## 2 1958 5 1958.375 317.50 317.50 314.71 -1

## 3 1958 6 1958.458 -99.99 317.10 314.85 -1

## 4 1958 7 1958.542 315.86 315.86 314.98 -1

## 5 1958 8 1958.625 314.93 314.93 315.94 -1

## 6 1958 9 1958.708 313.20 313.20 315.91 -1

plot(CO2.spl ~ Year.month, data = my_CO2.ManuaLoa, type = "l")

Factors

A factor is a vector object used to specify a discrete classification of the components of other vectors of the same length. We call this the levels of a factor. The use of factors help save memory space, particularly when we have very long character vectors with replicated values in several rows. In the following example, we have a vector describing attributes from a soil sample. In soil sciences, soil mapping plays a fundamental role in characterizing soil fertility. There are several important measurements that fit right into a numeric attribute, like clay, sand and silt percentage, organic matter content and various chemical variables describing Nutrient contents. At the same time, there is one qualitative descriptor used when aiming at soil classification and comparison anywhere in the world. The system has three components: hue (a specific color), value (lightness and darkness), and chroma (color intensity). Color values come from a look-up table (The Munsell) and our mission is to store that information for later processing. We can turn that character vector into a factor using the as.factor function.

soil_id <- c(12002, 12003, 15001, 9001,

9002, 9003, 101001, 101002,

101003, 21001, 21002, 21003)

soil_color <- c("10YR 6/1", "5Y 5/1", "10YR 6/1", "2.5YR 4/6",

"10YR 8/6", "2.5YR 4/6", "10YR 8/6", "5Y 5/1",

"10YR 8/6", "5Y 5/1", "5Y 6/4", "5Y 6/4")

soil_depth <- c(150,300,250,185,123,80,100,110,180,210,280,110)

soils <- data.frame(soil.number = 1:12,

soil.depth = soil_depth,

soil.color = as.factor(soil_color))

soils$soil.color

## [1] 10YR 6/1 5Y 5/1 10YR 6/1 2.5YR 4/6 10YR 8/6 2.5YR 4/6 10YR 8/6

## [8] 5Y 5/1 10YR 8/6 5Y 5/1 5Y 6/4 5Y 6/4

## Levels: 10YR 6/1 10YR 8/6 2.5YR 4/6 5Y 5/1 5Y 6/4

The soil_color vector now holds two vectors; one with the levels of the factor, the other with indices to each element of the factor level. So instead of having a long list with characters, we will have a smaller object with integer numbers.

Unordered vs Ordered factors

By default, R turns every vector into an ordered alphabetical list. But, this doesn't mean R knows what comes first. For the sake of an analysis, the only thing R would do is to display the factor labels in an ordered way. We can change the display order by means of explicitly specifying the labels of the factor. We can query the levels present in a factor vector using the levels function.

levels(soils$soil.color)

## [1] "10YR 6/1" "10YR 8/6" "2.5YR 4/6" "5Y 5/1" "5Y 6/4"

Adding new variables and merging data sets

Continuing with our previous example, we know each soil color code from the Munsell color table represents an actual color as well or a particular combination of red - green and blue (RGB) color. We would like to create a data frame with soils that incorporates those values, however, we do not want to repeat information over and over again. Plus we will need to create extra variables to hold the RGB values. The first part is very straight forward, using the same codes we have seen so far, we create a simple data frame that holds variables soil.color and soil.name.

Munsell <- data.frame(soil.color = c("10YR 6/1", "10YR 8/6","2.5YR 4/6", "5Y 5/1","5Y 6/4"), soil.name = c("light grey", "yellow", "red", "dark gray", "yellow1"))

Now that we have color names, we would like to find out the RGB value for each color. This is accomplished using the col2rgb function. The function takes as an argument a character variable and returns a numeric vector of size 3 with values between 0 and 255 for red, green and blue.

col2rgb("magenta")

## [,1]

## red 255

## green 0

## blue 255

It is desirable to store each color value in a separate variable. So we will first create a matrix containing the RGB values, transpose it using the t function and the assign it to the actual data frame.

RGB_values <- sapply(Munsell$soil.name, FUN = col2rgb)

RGB_values

## [,1] [,2] [,3] [,4] [,5]

## [1,] 211 255 255 169 255

## [2,] 211 255 0 169 255

## [3,] 211 0 0 169 0

RGB_values <- t(RGB_values) RGB_values

## [,1] [,2] [,3]

## [1,] 211 211 211

## [2,] 255 255 0

## [3,] 255 0 0

## [4,] 169 169 169

## [5,] 255 255 0

Munsell$R <- col2rgb(Munsell$soil.name)[1,]

Munsell$G <- col2rgb(Munsell$soil.name)[2,]

Munsell$B <- col2rgb(Munsell$soil.name)[3,]

head(Munsell,3)

## soil.color soil.name R G B

## 1 10YR 6/1 light grey 211 211 211

## 2 10YR 8/6 yellow 211 211 211

## 3 2.5YR 4/6 red 211 211 211

Now that we have a table with the descriptors for each soil color, we can merge it with the sampling table.

soils <- merge(soils, Munsell) head(soils)

## soil.color soil.number soil.depth soil.name R G B

## 1 10YR 6/1 1 150 light grey 211 211 211

## 2 10YR 6/1 3 250 light grey 211 211 211

## 3 10YR 8/6 5 123 yellow 211 211 211

## 4 10YR 8/6 7 100 yellow 211 211 211

## 5 10YR 8/6 9 180 yellow 211 211 211

## 6 2.5YR 4/6 4 185 red 211 211 211

You can see how each element in soils has its color name and values for red, green and blue. In this brief example, R tries to merge the data sets using their common field name (soil.color).

Piedmont and piedmont are not the same

Performing evaluations over characters can be tricky, particularly when we have a classification variable, spelled with capital and non capital letters. For example, we could have a sample coming from the Piedmont or the piedmont or the PIEDMONT. From our stand point, they are all the same. However, inside of R, they are all different. If we where to do summaries for each class, we will get a separation for either way of spelling. Same thing as with variable names, character names and derived factors have to be consistent. Functions to run before getting our table summaries are tolower ,toUpper and casefold.

tolower("Piedmont")

## [1] "piedmont"

toupper("Piedmont")

## [1] "PIEDMONT"

casefold("piedmont", upper = TRUE)

## [1] "PIEDMONT"

Exercises

1. Simple data frame

Create a character vector with the names of your schoolmates.
Create a numeric vector with the ages of your classmates (make sure the order is the same as in previous exercise).
Create a logical vector showing if your classmate owns a pet.
Create a date vector showing your classmate's expected graduation date.Create a character vector with the city people were born.
Name the last vector as Birthplace and add it to the data frame.

2. Factor/merge data

Create a vector length 100 with random values between 1 and and 7.
Create a vector length 100 with random numbers between 1 and 2
Create a data frame to hold vectors from 2.1 and 2.2
Create a data frame with two columns. First column a sequence of numbers between 1 and 7, the second with names of the week.
Merge the two data frames by the indexing (1:7) column.
Create a histogram showing the frequency of week days in your data frame.

3. Matrix / Markov Chain exercise

Create a matrix to represent the following markov process.
After a clear day, we have a 20% chance of clouds and 30% chance of rain
After a cloudy day, we have 40% chance of clear sky, and 30% chance of rain
After a rainy day, we have 20% chance or clear sky and 70% chance of clouds.
What are the chances of rain after 10 days if today is clear
What are the chances of clouds after 2 days if today is raining
What are the chances that it is going to keep raining in two days from today if it is raining today?
What is the equilibrium vector?
How long (how many states) does it take to reach an equilibrium?

Page updated

Google Sites

Report abuse