Data types are important in R. Most, if not all of its functions, are to be executed on specific type(s) of data. It is therefore crucial that you make sure you are using the right one.
Data types (pieces of data) belong to one of the following five classes:
Data objects may be organized in data classes which may be:
Simple variables are data types holding only one variable such as "Christoforos", 6 or True.
Vectors are enumerated arrays of data in the sense of uni-dimensional matrices, while matrices are...well matrices in the traditional form.
Conceptually, factors are variables in R which take on a limited number of different values; such variables are often refered to as categorical variables. One of the most important uses of factors is in statistical modeling. You can read a bit more about them here: http://www.stat.berkeley.edu/~s133/factors.html but there will be more in the following)
Data frames (or dataframes) are lists of vectors of equal length but not necessarily of the same class. In this sense they differ from matrices which can only be numeric (or integer). Data frames are the most versatile (and convenient) way of representing and analyzing data.
Lists are generic vectors containing other objects. Lists do not need to contain vectors of the same size and are thus the most complex of data types in R.
Coercion of data types
Data types can be changed into one another with a technique called forced coercion with which some data types can be forced to become another. This works for some transformation but not all and is to be used with caution. For instance
x<- 1
is a number
but
y<-as.character(x)
makes y a character equal to "1" (notice the double quotation marks). This can be changed back to a number with
xx<-as.numeric(y)
which makes xx equal to 1 again.
Changes can be performed between numerical and characters as well as between the integers 0,1 and logicals with as.logical(). Nonetheless forced coercions are not a very good practice especially for beginners. Consider yourselves warned.
Vectors
Vectors may be created with the simple function "concatanate" c or with the use of the vector function
x <- c(1,2,3) y <- c("me","you","him")
The vector function is to be used mostly for initialization purposes
z <- vector("numeric", length=20)
This creates a vector of 20 "0" values.
Data types containing mixed objects are to be treated with extreme caution. This is because R coerces data
y <- c(1.7, "a") # y is now character y <- c(TRUE, 2) # y is now numeric
We can find out the the class of a data type by typing
x <- c(1,2,"TRUE") class(x)
and coerce the data type to the one we desire with the as."" function
as.logical(x) [1] NA NA TRUE
when the coersion makes no sense, R returns the "NA" variable. Be prepared to see this a lot if you are not careful with your data assignments.
Matrices
Matrices are introduced with the matrix function. As with vectors, matrices need to be assigned with dimension specifications. Matrices need two dimensions as they are two-dimensional.
m <- matrix(0, nrow=2, ncol=5)
creates a 2x5 matrix of zeros
The dimensions of a matrix can be retrieved with the dim function
dim(m)
R fills matrices by completing the columns, starting from the upper left part (element[1,1]). So if you wanted to fill m with the first 10 numbers that would be done by:
m <- matrix(1:10, nrow=2, ncol=5)
If we now wanted to see what m looks like, we would simply type:
m [,1] [,2] [,3] [,4] [,5] [1,] 1 3 5 7 9 [2,] 2 4 6 8 10
We can also create a matrix from a vector by adding a dimension attribute. The dimensions are in this case a vector themselves
x <- 1:10 # a vector of the numbers 1 to 10 dim(x) <- c(2,5) # dimensions read as number of rows, number of colums m [,1] [,2] [,3] [,4] [,5] [1,] 1 3 5 7 9 [2,] 2 4 6 8 10
Two very useful functions for matrix manimulation allow us to add, join rows and columns to an existing matrix, or create matrices by joining vectors.
rbind joins vectors by rows and cbind does the same by columns
x <- 1:10 y <- 11:20 z <- rbind(x,y) # join x and y by treating them as rows of a matrix called z z [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 1 2 3 4 5 6 7 8 9 10 [2,] 11 12 13 14 15 16 17 18 19 20
z <- cbind(x,y) # join x and y by treating them as columns of a matrix called zz [,1] [,2] [1,] 1 11 [2,] 2 12 [3,] 3 13 [4,] 4 14 [5,] 5 15 [6,] 6 16 [7,] 7 17 [8,] 8 18 [9,] 9 19 [10,] 10 20
Data Frames
Data frames are one of the most common and versatile data type in R. They are tabular lists and so can contain elements of different classes, but they are also matrix-like in the sense that all vectors in the list should be of the same size (length). Data frames are mostly read-in in R with specific commands (see Reading Data). They have special attributes that refer to the names of the data elements stored in each row (row.names), they can carry titles for the columns etc. Data frames can be converted to matrices by calling the data.matrix() function, but this should be handled with extra care due to the coercion issues covered earlier.
When not reading data frames from a file already stored in the computer, we can declare them with commands like:
x<- data.frame(a=1:4, b=c("Me","You","Him","Her")) x a b 1 1 Me 2 2 You 3 3 Him 4 4 Her
Notice that the columns have names ("a" and "b") which were given in the declaration of the variable. Also notice that rows are numbered from 1 to 4. These can be recalled with the row.names() function
row.names(x) [1] "1" "2" "3" "4"
The size of the data frame can be given either with dim() or by calling the data frame-specific functions nrow() and ncol()
dim(x) # [1] 4 2 nrow(x) # 4 ncol(x) # 2
Data frames are the most commonly used data type, especially when handling external data (files from your computer). We will see more of that later on.
Factors
Imagine your data are based on some categorical, non-numeric variable such as "good", "bad", or "diseased","healthy" etc. In this case you will need a data type that deals with non-numeric data. Factors are special types of vectors that handle categorical (non-numerical) data. Their main (and outmost) difference from vectors is that they are labeled instead of ordered. This means that a factor has "names" such as "Me", "You" and "Him" instead of numbers such as 1, 2, 3 designated to its elements. In this sense, they are much more useful when trying to address different subsets of the data, a very important aspect of analysis that is called Subsetting.
Inserting a factor is as easy as:
fac<-factor(c("me", "me", "you", "me", "him", "you"))
Notice that we actually call a function called factor() upon a vector, thus we say we "factorize" a vector. fac now holds the names of the variables "me", "you" and "him" but it does so in specific positions. The different categorical values held can be visualized with the use of the levels() function
levels(fac) [1] "him" "me" "you"
Notice the variables are returned in alphabetical order. This order is used to assign specific numbers to each factor level. In this scheme, "him" will be given 1, "me" 2 and "you" 3. This can be visualized with unclass(). The unclass() function converts the levels to their attributed numbers.
unclass(fac) [1] 2 2 3 2 1 3 attr(,"levels") [1] "him" "me" "you" table(fac) fac him me you 1 3 2
Below the "numerized" factor, R also returns the attributed levels, a sort of legend that tells you which number corresponds to which. Not all factors you 'll be dealing with will be that small though and so it would be easy to have a summary of the levels representation in the factor. This is returned with the use of the table() function which returns the levels ordered by attribute number and the corresponding number of elements below it. This tells us that in our fac factor we had 1 instance of "him", three instances of "me" and two of "you".
Remember that if the alphabetical ordering is not very suitable/convenient for you, you can always change it by adding a levels option in the factor declaration
fac<-factor(c("me", "me", "you", "me", "him", "you"), levels=c("me","you","him"))
table now will return the order that you chose instead of the default one
table(fac)fac me you him 3 2 1
More on factors in later, not so introductory chapters.
Lists
Lists are data frames that do not have to follow the restriction of equal vector size. In this sense you may see them as data "blobs" that can hold simple variables or vectors of any type or size. For the moment it would be useful to know how to introduce one. Lets do it step by step, by creating three vectors first:
vec <- c(1,5,7);mec <- 1:10;dec <- c("TRUE","TRUE","FALSE","FALSE","FALSE")
Now lets put them all in a list with that order
l<-list(vec,mec,dec)
l now contains vec, mec and dec in this order. Which means that if call back the first element of l (by asking for l[1]) we will be getting the complete vector vec
l[1] [[1]] [1] 1 5 7
Notice the two-lined output R returns containing a "reference" with double brackets that points to our choice for the first element. We can get more than one element by invoking simple subsetting techniques. For instance we can retrieve the first and the third elements of the list by asking for them with a vector containing 1 and 3.
l[c(1,3)][[1]][1] 1 5 7[[2]][1] "TRUE" "TRUE" "FALSE" "FALSE" "FALSE"
which returns vec and dec, the 1st and the 3rd elements (but see more of subsetting later on).
Although the large proportion of R built-in function cannot handle lists, they remain important for the organization of data, especially when we are talking about big data. We 'll just live them aside for the moment and get back to them when the time is ripe.
Arrays
In case you are wondering, R also supports multi(higher) dimensional data types, called arrays. These are complex numerical data types of higher order, that we choose to skip discussing for the time being.
What class is my data? The class() and str() functions
Too often are we faced with the problem of not realizing the class of the data we are handling. This is especially more troubling in the case of data frames and lists whose components may be of different classes.
The class() function is called upon any data object and returns the data type. In the case of the above list l
class(l) [1] "list"
class(l) returns "list" which is exactly what l is. Now what if we wanted to know what is the data type of each of the object in l? In this case we may use the str() function
str(l) List of 3 $ : num [1:3] 1 5 7 $ : int [1:10] 1 2 3 4 5 6 7 8 9 10 $ : chr [1:5] "TRUE" "TRUE" "FALSE" "FALSE" ...
str(l) returns a much more detailed output that contains the data class of l (List), the number of objects in it (of 3) and the data type of each of the objects (num, int, character) alongside the first instances in each one. Notice how the last object of l has been assigned the type "character" (chr). We can easily coerce it back to logical and put it in the list with
l<-list(vec,mec,as.logical(dec)) str(l) List of 3 $ : num [1:3] 1 5 7 $ : int [1:10] 1 2 3 4 5 6 7 8 9 10 $ : logi [1:5] TRUE TRUE FALSE FALSE FALSE
Subsetting refers to the selection of parts of data from greater sets. Subsetting of data types is one very important aspect of the R environment in the sense that it can be performed with extreme precision and at great speeds. In this sense it constitutes one of R's main advantages.
Subsetting uses a number of special characters to perform various tasks, such as obtaining specific rows, columns, elements from all possible data types depending on the user's choise. It can be roughly divided to:
structural subsetting, where data are subsetted based on the structure of the data type (e.g. the 3 first columns)
logical subsetting, where data are subsetted based on a logical restriction (e.g. all values that are not "NA")
numerical subsetting, where data are subsetted based on numerical/categorical operation/control (e.g. all values >10 or all values equal to "FALSE")
Structural Subsetting
Makes use of the [], [[]] and $ operators, which with the clever combination of commas can provide absolute precision on the choise of data with only a few characters coding.
[] may be used on any vector, factor, matrix or dataframe to subset it in one or two dimensions. In the case of vectors and factors there is only one dimension. Therefore, if we want the nth element of a vector v we simply put n within brackets
x<-v[6]
x now holds the sixth value of the vector v. R enumerates all data types starting from 1 (and not 0 like Perl or Python) so 6 will actually return the 6th value.
x<-v[6:8]
will get a "slice" of v and store it in x which now becomes a vector itself carrying the 6th,7th and 8th elements of v. Remember how the ":" operator is used for ordered integers. Now what if we wanted some compartmentalized subsetting that does not follow a certain order
x<-v[c(1, 6:8, 11:15, 18, 20)]
This intricate subsetting allows us to get the 1st, then the 6th-8th, then the 11th-15th, then the 18th and then the 20th elements of v and store them in a vector called x. Notice how the subsetting indices (the numbers in the parentheses) are a vector in themselves and thus they are introduced with the concatanate function c(). We could have greater control in this subsetting if we split the process in two
ind<-c(1, 6:8, 11:15, 18, 20) x<-v[ind]
This first creates a vector called ind that carries the indices (the numbers of elements we want to obtain from v) and then passes it to v with [] to perform the subsetting.
The exact same process stands for factors (which as we already know are categorical vectors). But what about matrices and dataframes? Here there are two dimensions on which we can subset (rows and columns). R uses the same operator [] but allows for two values separated by comma to provide information of rows and columns (in this order). Although both of them are not always needed (suppose you only want to subset columns but not rows) R needs to keep in mind we are treating two-dimensional datatypes so we need to use a comma inbetween. This will become less confusing with an example. Suppose we need to keep only the first and the third line from a matrix m. This is done with:
mm <- m[,c(1,3)]
Notice that the indices inside the brackets are of the form [ ,vector]. That is because the value before the comma is reserved for row subsetting. Since we don't want to subset on the rows we leave this empty, but use the comma since this is compulsory. After the comma we simply provide the vector of indices we want to subset columns by (in this case c(1,3) for the 1st and the 3rd). In perfect symmetry subsetting on rows 10 through 20 would be performed with:
mm <- m[10:20,]
As in this case the indices are serial we need not use a concatanated structure so 10:20 will do. Alternatively we could use c(10,11,12,13,14,15,16,17,18,19,20) but we don't for obvious reasons.
Bear in mind that in both cases above mm is a matrix (or data frame, depending on what m was) whose dimensions have now changed. If m was a MxN matrix, then mm is a Mx2 in the first case and 10xN in the second. In the case we subset on both rows and columns
mm <- m[c(1:5,8), c(10,11, 12:14)]
mm is now a 6x5 matrix.
Understandably we can return any single element of a matrix or data frame by providing its exact "coordinates" and thus
m[6,9]
will return the 6th element of the 9th column.
In the case of data frames with named columns we can also use the $ operator to subset columns. Take the built-in R data frame called mtcars simply by typing
mtcars
This small dataframe contains makes of cars alongside their constructor specifications. In order to choose one specific specification simply ask for the name of the data frame followed by $ and the name of the column. For instance
mtcars$cyl
will return the vector containing the cyl column. This is identical to calling mtcars[,2] asking thus for the second column of the data frame (not including the names of the cars).
The [[]] and $ operators are mostly used in list context. Although, as we saw earlier, [n] can be used to invoke the nth element of a list, [[n]] does the same without getting back the reference (the name) of that element. This is sometimes desirable, especially when we want to pass data from a list to another function. Consider the list l we saw earlier
l<-list(vec,mec,dec)
where vec, mec and dec are three vector of different type and size. l can be created by keeping names for each of the three
l<-list(a=vec, b=mec, c=dec)
now vec can be invoked with all of the following commands
l[1] l[[1]] l$a
There are subtle differences are in the form of the output. From top to bottom the output is stripped from (sometimes unneccesary) references.
Logical Subsetting
Makes use of logical operators ("&", "!", "|", see more on that in Control Structures)
Logical operators stand for AND (="&"), OR (="|") and NOT ("!"). While the first two are more complex as "joining" operators and we will see more of them when we discuss control structures, NOT "!" can stand on its own as it signifies the negation of a statement. In this sense
!is.na(x)
returns logical values (TRUE or FALSE) depending on whether any value in x is NOT "NA". In this sense, !is.na() is the mirror-image of is.na(). Lets use this in a subset
y<- x[!is.na(x)]
y is now a vector with all the elements of x that are NOT "NA". There are two things to be careful of here. One, that the subsetting index refers to the subsetted data object (x). Notice how x is subsetted based on one of its own properties (how many of its elements are not "NA"). The other is that the output is a vector, regardless of the structure of x. Even if x was to be a matrix or a data frame, the logical subsetting will return the values reading by column starting from the top left.
Numerical Subsetting
Makes uses of the common numerical operators (>, <, >=, <=, "!=" and ==). You should already be familiar with ">","<" and "<=", "=>" as "greater", "smaller", "smaller or equal" and "greater or equal", but notice "!=" signifying "not equal to" and "==" for "equal to" (this is because in R as in most programming languages, we use "=" for assignment of variable values, (in R "->" and "=" are the same but we strongly recommend "->" to avoid confusion)).
Numerical subsetting works exactly like logical. So assuming x is a data frame holding these values
x a b 1 1 Me 2 2 You 3 3 Him 4 4 Her
Then, subsetting it numerically by asking only values equal to 2 would be
x[(x==2)] [1] "2"
This returns the value "2" (only once in this case, but more times if 2 were to be found more than once). What if we wanted to get values that are greater than 2. We could try
x[(x>2)]
and we would get
[1] "3" "4" NA NA NA NA Warning message: In Ops.factor(left, right) : > not meaningful for factors
What happened here? We got the two numerical values that fulfill our restriction (>2) but we got four NA values at the end and an error message that said our operation is not meaningful for factors. What R is trying to tell us is that a comparison >2 is not possible for categorical data like "Me" or "You". In this sense the operation "Me">2 returns nothing (NA). R prints this at the end of the vector but is kind enough to point it out to us.
Numerical subsetting is not always numerical. It can be categorical as well with the use of the "==" and the "!=" operators. If we try
> x[(x!="Me")] [1] "1" "2" "3" "4" "You" "Him" "Her"
we get back all the elements of x that are not equal to "Me" regardless if they are numbers or characters. This is because R performs coercion of variables wherever this is possible before conducting the subsetting. This is very handy in the general cases of data comparisons that we discuss next in Control Structures.
One very handy function for subsetting is which(). It can be used for both logical and numerical subsetting to produce a subset of indices fulfilling certain conditions.
which(x>=1)->ind
x[ind]->new_x
which() can also be used in logical context.
An example. Subsetting for not NA values
Lets now use subsetting in an example that is very useful (and also quite common). Removing NA values from a data frame. The task we want to complete is, given a data frame (or a matrix), extract the instances (rows) that do not have NA values (holes). As with many cases, R already has a built-in function to check for lines in a matrix that do not carry NA values. This function is complete.cases() and can be invoked on a matrix m like this
ind<-complete.cases(m)
ind now is a vector that holds the line numbers of the data frame that fulfill the condition (not having a NA value). All we have to do now is to ask for a subset of m with the rows held in ind
mm<-m[ind,]
And we are done! A clean data set.
And finally. The subset function
A number of things in R can be done with the use of predefined functions. Subsetting is not an exception. There is a specific subsetting function called subset().
subset() combines the use of all types of subsetting (numerical, logical and structural) in data frames with named columns. It also makes use of logical operators to combine subsetting commands in a single. An example may be seen with a default R dataset called "airquality". "airquality" is structured data in a data frame concerning information on ozone, solar radiation, wind and temperature for a number of dates organized by month and day. To have a better view of its contents simply type
head(airquality) Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6
Notice how there are holes in the data with a number of NA values. Suppose now that we want to obtain a slice of the data that we contains all dates with a temperature higher than 60 degrees and a wind of 10 knots or more. We would type
dates<-subset(airquality, Temp > 60 & Wind>=10)
Observe how the function works. We call subset() on the dataframe airquality asking that a combined condition is fulfilled, so as Temp>60 AND Wind>=10. The logical "AND" is coded by the ambersand "&" symbol. Alternatively, had we wanted to keep the dates with either Temp>60 OR Wind>=10 we would have asked for
dates<-subset(airquality, Temp > 60 | Wind>=10)
in which case the logical "OR" is coded with the bar symbol "|".
Think about the case where we would have wanted mutual exclusion of conditions (e.g. Temp>60 "AND NOT" Wind>=10). In this case we would have to think a bit more and code for an equivalent condition. That would be Temp>60 & Wind<10 (inverting the condition on wind).
Finally, subset can also incorporate structural subsetting in the form of retaining specific parts of the data. In the case e.g. that we would have wanted to keep only the dates (that is month and day) fulfilling the above condition (that is dropping the meteorological data) all we need to do is to use the select argument of subset(). The command would now be:
dates<-subset(airquality, Temp > 60 & Wind<10, select = c(Day, Month)) head(dates) Day Month 1 1 5 2 2 5 7 7 5 10 10 5 11 11 5 12 12 5
Notice how we have complete freedom to manipulate the data frame in terms of ordering of its vectors. In this example we choose to show Days before Months although their order was the other way round in the initial data frame.