R Tutorial 5: Object oriented programming

The topic of this post was mentioned in a tangential rant featured in my previous post, and I thought I might as well expand on this a bit. I'm not going to talk about programming language model or anything like that since I'm not a programmer - rather, I will treat this more like a tutorial or a "Pro-tip" kind of post.

I will be focusing on an aspect of R that is often taken for granted and maybe not well known by entry-level users. That is, R is an object-oriented programming language. If you already know this, then this blog post is not for you.

First, I'll list out a few interesting/useful features of R:

  1. R is interpretable
  2. R is based on vectors
  3. R can utilise functions (e.g. functional programming)
  4. R utilises objects (object-oriented programming)

Like I've already mentioned, this post will focus on point 4, that R is an object-oriented programming language (or simply that R can be object-oriented if you don't want to call R a programming language...).

We can start with a very simple situation.

Let's assign a name to a value in R:

> x <- 1

There you go. That's an object. We can call that object x and add another value 1 to it and assign the result to y.

> y <- x + 1

So that's an object oriented programming right there. Instead of displaying the result of the operation

> 1 + 1

which will return

[1] 2

We've called an object x and assigned the outcome of the operation to an object y. This kind of programming is core to R.

Another example:

Let's say you want to calculate the mean value of a sequence from 1 to 10. You can achieve this like:

> mean(1:10)

[1] 5.5

You can also do it like this:

> x <- 1:10

> x

[1] 1 2 3 4 5 6 7 8 9 10

> y <- mean(x)

> y

[1] 5.5

You might think that's an extra line of code compared to the first variation, but if you make it a habit to code using the second object-oriented type of coding, then you will probably find that as your coding gets more complicated, object-oriented programming will make everything easier to keep track of and maybe more importantly, easier to debug.

For starters, you've stored the outcome of the operation as object y so you can use it later on if you need it again, and you won't have to type that operation again (which saves you from unnecessary typos).

Let's go for a little bit more advanced example.

Suppose we have a data.frame object df which contains some morphometric data in Darwin's finches. Let's say that you want to subset the data to those rows (species) that have wingL greater than or equal to 4:

> df[df$wingL >= 4, ]

Taxon Name_in_Tree wingL tarsusL culmenL beakD gonysW

1 Geospiza_magnirostris magnirostris 4.404200 3.038950 2.724667 2.823767 2.675983

2 Geospiza_conirostris conirostris 4.349867 2.984200 2.654400 2.513800 2.360167

3 Geospiza_difficilis difficilis 4.224067 2.898917 2.277183 2.011100 1.929983

4 Geospiza_scandens scandens 4.261222 2.929033 2.621789 2.144700 2.036944

5 Geospiza_fortis fortis 4.244008 2.894717 2.407025 2.362658 2.221867

6 Geospiza_fuliginosa fuliginosa 4.132957 2.806514 2.094971 1.941157 1.845379

7 Camarhynchus_pallida pallida 4.265425 3.089450 2.430250 2.016350 1.949125

9 Camarhynchus_parvulus parvulus 4.131600 2.973060 1.974420 1.873540 1.813340

10 Camarhynchus_pauper pauper 4.232500 3.035900 2.187000 2.073400 1.962100

11 Pinaroloxias_inornata Pinaroloxias 4.188600 2.980200 2.311100 1.547500 1.630100

12 Platyspiza_crassirostris Platyspiza 4.419686 3.270543 2.331471 2.347471 2.282443

13 Camarhynchus_psittacula psittacula 4.235020 3.049120 2.259640 2.230040 2.073940

And then if you wanted to just extract the species names and calculate the mean tarsusL you can do this:

> df[df$wingL >= 4, ]$Taxon

[1] "Geospiza_magnirostris" "Geospiza_conirostris" "Geospiza_difficilis" "Geospiza_scandens" "Geospiza_fortis" "Geospiza_fuliginosa"

[7] "Camarhynchus_pallida" "Camarhynchus_parvulus" "Camarhynchus_pauper" "Pinaroloxias_inornata" "Platyspiza_crassirostris" "Camarhynchus_psittacula"

> mean(df[df$wingL >= 4, ]$tarsusL)

[1] 2.995884

Up to here, it might not be too bad to subset by the condition at every operation, but this can get annoying if you needed to do something a bit more engaging, for instance, subset the data as above but then further subset to only return data above the mean tarsusL within the subsetted data:

> df[df$wingL >= 4, ][df[df$wingL >= 4, ]$tarsusL > mean(df[df$wingL >= 4, ]$tarsusL), ]

Taxon Name_in_Tree wingL tarsusL culmenL beakD gonysW

1 Geospiza_magnirostris magnirostris 4.404200 3.038950 2.724667 2.823767 2.675983

7 Camarhynchus_pallida pallida 4.265425 3.089450 2.430250 2.016350 1.949125

10 Camarhynchus_pauper pauper 4.232500 3.035900 2.187000 2.073400 1.962100

12 Platyspiza_crassirostris Platyspiza 4.419686 3.270543 2.331471 2.347471 2.282443

13 Camarhynchus_psittacula psittacula 4.235020 3.049120 2.259640 2.230040 2.073940

That can get confusing and is really error prone - I actually got it wrong the first couple of trials.

What I would do instead is:

# set a condition where wingL >= 4

> cond1 <- df$wingL >= 4

# see what that looks like

> cond1

[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE

# it's a logical (TRUE/FALSE) vector

# Now subset according to the logical condition cond1 and call that object df1

> df1 <- df[cond1, ]

# see what the subsetted data df1 looks like

> df1

Taxon Name_in_Tree wingL tarsusL culmenL beakD gonysW

1 Geospiza_magnirostris magnirostris 4.404200 3.038950 2.724667 2.823767 2.675983

2 Geospiza_conirostris conirostris 4.349867 2.984200 2.654400 2.513800 2.360167

3 Geospiza_difficilis difficilis 4.224067 2.898917 2.277183 2.011100 1.929983

4 Geospiza_scandens scandens 4.261222 2.929033 2.621789 2.144700 2.036944

5 Geospiza_fortis fortis 4.244008 2.894717 2.407025 2.362658 2.221867

6 Geospiza_fuliginosa fuliginosa 4.132957 2.806514 2.094971 1.941157 1.845379

7 Camarhynchus_pallida pallida 4.265425 3.089450 2.430250 2.016350 1.949125

9 Camarhynchus_parvulus parvulus 4.131600 2.973060 1.974420 1.873540 1.813340

10 Camarhynchus_pauper pauper 4.232500 3.035900 2.187000 2.073400 1.962100

11 Pinaroloxias_inornata Pinaroloxias 4.188600 2.980200 2.311100 1.547500 1.630100

12 Platyspiza_crassirostris Platyspiza 4.419686 3.270543 2.331471 2.347471 2.282443

13 Camarhynchus_psittacula psittacula 4.235020 3.049120 2.259640 2.230040 2.073940

# looks identical to df[df$wingL >= 4, ] above

# assign an object that is the mean of tarsusL from the subsetted data df1

> mean.tl <- mean(df1$tarsusL)

# set a second condition: tarsusL in df1 that is greater than mean.tl

> cond2 <- df1$tarsusL > mean.tl

# view the condition

> cond2

[1] TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE TRUE

# subset df1 according to condition cond2 and call that object df2

> df2 <- df1[cond2, ]

# see what df2 looks like

> df2

Taxon Name_in_Tree wingL tarsusL culmenL beakD gonysW

1 Geospiza_magnirostris magnirostris 4.404200 3.038950 2.724667 2.823767 2.675983

7 Camarhynchus_pallida pallida 4.265425 3.089450 2.430250 2.016350 1.949125

10 Camarhynchus_pauper pauper 4.232500 3.035900 2.187000 2.073400 1.962100

12 Platyspiza_crassirostris Platyspiza 4.419686 3.270543 2.331471 2.347471 2.282443

13 Camarhynchus_psittacula psittacula 4.235020 3.049120 2.259640 2.230040 2.073940

There.

Really clean, readable and easy to see what stage of your operations you're at.

The second series of code above is what we'd call object-oriented programming. By placing the logical conditions as separate R objects, the subsetting step becomes really clean, and it's easier to check that you've got the right conditions set up. Most important of all, it reduces on typing error, especially if you're subsetting within a data.frame and you get confused about indexing column names using $, e.g. df[df$wingL >= 4, ].

If you can switch to object-oriented programming in R, that will make your life a heck of a lot easier. R is really geared towards this kind of coding so you might as well use it!

Remember, efficient coding stems from inherent laziness - i.e. you don't want to repeat menial tasks too much.