R for fish and wildlife grads - Week 7. Creating and working with R functions

Week 7. Creating and working with R functions

Welcome to week 7. Looks like you’re in luck! It's time to learn how to create your own custom R functions. The script for today's lesson can be found here. Bonus inverse logit link function can be found here (please download before class).

At the end of this lesson you will know...

What's a function you ask? Well, you've been using them all term-- mean(), count(), write.csv(), sum(), seq(), rep() are all examples of built in functions. One of R program’s greatest strengths is the user’s ability to easily (and elegantly) write and add functions. The benefit of writing your own functions allows multiple computational arguments to be run within the same series of code.

As you have noticed, there are literally 1000's of R functions for performing a variety of tasks and you can generally find one or more to meet your particular needs. However, sometimes you have a unique task to perform or maybe you have a repetitive task to perform (e.g., summarize data sets and create reports) and you don't want to rewrite your code every time. That's when you want to create your own function.

The basic format of functions:

function.name <- function(arguments) {

computation of the function i.e. calculation of the argument(s)

}

Let’s create our own function for calculating a mean and call it "my.mean". Remember that a mean is the sum of the elements in a data divided by the number of elements in a data. Looking at the functions listed above, we use sum() to add up the elements and length() to could the number of elements with the mean equal to sum()/length().

To create a function, we declare it using function and assign the function a name. Similar to for loops and if else comparisons, function actions are delimited by curly brackets "{}". The function can take one or more arguments (these are the inputs to the function). For instance, my.mean (below) takes one argument: variable. If there are more than one input to a function, THE ORDER THEY ARE LISTED MATTERS (more later).

EXAMPLE #1

# create a function for calculating the mean

# first name it and identify arguments

my.mean <- function(variable){

# users of the function provide "variable" and this is what is done with it

sum(variable)/length(variable)

}

# create a dummy data set

data.catch <- c(3,5,0,12,4,8,4,1,1,6,7)

# use the new function

my.mean(data.catch)

#compare to built-in R mean function

mean(data.catch)

Our function for calculating the mean is spot on. Now, lets create something that requires more than 1 input, remember ORDER MATTERS. Examine the code below and interpret.

EXAMPLE #2

## create a function for dividing 2 numbers and squaring the result

my.add.square <- function(a,b){

(a/b)^2

}

# use the function

my.add.square(5,2)

# use the function but reverse the numbers

my.add.square(2,5)

# use the function reverse the numbers, but use the arguments to assign the values

my.add.square(b=2,a=5) ## order only matters when don't assign values to argument objects

For grins, look at the contents of your working directory. Can you find “a” and “b” objects? The answer should be no, because R functions create and use local variables. This means that anything created in a function is not saved to your working directory.

RETURN, LIST and PRINT STATEMENTS and FUNCTIONS

Often, we have a function which performs multiple tasks (see calc.function below). We need to have a way of returning the results from our function that allows the multiple results to be reported. Using a return, list or print statement (with c()) inside the function allows us to see all the defined results in our function. In the following example the function calc.function doesn’t return a result.

EXAMPLE #3

calc.function <- function(x, y) {

a <- x*5 + y

b <- (a/x)^2

d <- a + b*1.05

}

calc.function(3,4) ## nothing happens, why?

# How about we just print out d afterward?

# NOTE object created within a function are temporary objects that do not

# exist outside of the functions

## Now just calculate d, don't assign it to object d?

calc.function <- function(x, y) {

a <- x*5 + y

b <- (a/x)^2

a + b*1.05 ## note that is the last command executed

}

calc.function(3,4)

If we add a return, list or print statement to the function, calc.function this will return all values listed in the statement. The function calc.function.two has a return function and therefore reports a, b and c from the function.

EXAMPLE #4

calc.function.two <- function(x, y) {

a <- x*5 + y

b <- (a/x)^2

c <- a + b*1.05

return(c(a,b,c))

}

calc.function.two(3,4)

calc.function.two(4,3)

calc.function.two(y=4,x=3)

**Try using list and print instead of return! Any differences? Using the list statement allows the variables to be reported in a list format (handy for further analysis). Using the print function allows the variables at any stage through out the function's computation to be reported (compared with return which is only at the end of a function).

BEWARE: If using the return statement this exits your function! Note, that if using return use it at the end of your function. Using return before the end of your functions may result in the loss of any future arguments computed in your function.

Furthermore, we commonly use and write functions which perform multiple computations. All variables computed within a function only exist there! A way to remember these values once the function has been run is to define a data frame with the function (see example #5). Saving the outputs as individual elements to a data frame using the function allows you to access the results at a later stage. By using list () within the function and naming your data frame we can easily extract and save information from a function.

EXAMPLE #5

calc.function.three <- function(x, y) {

a1 <- x*5 + y

b1 <- (a1/x)^2

c1 <- a1 + b1*1.05

list(temp=a1, temp.square=b1,add.temp=c1)

}

df <- calc.function.three(2, 5)

str(df)

df$temp.square

df$add.temp

Example 6: Global objects vs. temporary objects

Objects created outside of functions are ususlly global objects. That is they can be used inside and outside of a function. This can be dangerous if you don't know what you are doing. To illustrate, let's first create a global object BAD outside of the function and then use it within a function.

BAD<-100

# now create a function

add.one.mtply<-function(x){

y<-x+1

y*BAD

}

## now invoke the function

add.one.mtply(5)

# change value of BAD

BAD<-0

## now invoke the function

add.one.mtply(5)

I know you are now wondering, what happens if we try to change the value of a global object in a function? Let's see.

add.one.mtply<-function(x){

y<-x+1

BAD <- y*BAD

}

BAD<-50

add.one.mtply(5)

### Nothing happens!!

BAD

SUGGESTION-- use global objects (never when you are a beginner) very sparingly within a function.

For loops within FUNCTIONS

You have learned how to use 'for loops' and 'if else' functions. You can develop functions within functions to reduce the amount of repetitive coding you need to do. FUNCTIONS within FUNCTIONS - tricky and allows your code to look efficient!

EXAMPLE #7

calc.function.four <- function(x) {

for(i in 1:x) {

y <- i*2

print(y)

}

return(y*2)

}

calc.function.four(12)

Combining and pasting within a FUNCTION

Frequently in fisheries and wildlife research we use scientific names in our data. I common problem is combining this information into a single 'workable' column. For example, often we have genus and species names listed in a data frame and for ease of reporting we want to combine these. We can just use genus + species (try it below using the code!). We need to use the paste function and teh syntax for creating a new binary operator in R. Using the user-defined binary operator %[name]% we can combine the two columns.

EXAMPLE #8

Genus <- 'Greenis'

Species <- 'backyardis'

Genus + Species

'%p%' <- function (x,y){paste(x,y, sep=' ')}

Genus %p% Species

Fun with functions

The above demonstrates the usefulness of functions and the idea that the order of the inputs matters. Ok all of that is fine and good why would you want to bother learning how to create a function? Lets make something that you might actually use, like finding and replacing NA with mean values.

#' first create a dataframe with fake data lets call it fau_dat

fau_dat<-data.frame(Abcs=letters[1:22],Num1=c(1:22),Num2= seq(-5,15, length = 22),

Num3=seq(10, -22,length=22))

# now assign missing values to a few places

fau_dat[3:4,2]<-fau_dat[10,3]<-fau_dat[20,4]<- NA

# take a look

fau_dat

# create a copy for later use

fau_datII<-fau_dat

All function creation usually starts with creating the code for doing the work, then we make the code generic so we can use it with other objects, the last thing we do is turn the code into a function. So lets start by creating code that replaces missing values, recall from the ifelse lesson.

fau_dat$Num<-ifelse(is.na(fau_dat$Num1),mean(fau_dat$Num1, na.rm = T),fau_dat$Num1)

This above is a bit clunky. If we want to apply the function to another object, we will have to make it generic enough to apply to any dataframe. We know how use for loops and bracket notation to accomplish this task but first we need to find out how many column are in the dataframe.

z<-ncol(fau_dat)

for(zz in 1:z){

fau_dat[,zz]<-ifelse(is.na(fau_dat[,zz]),mean(fau_dat[,zz], na.rm = T),fau_dat[,zz])

}

## lets take a look

fau_dat

Woa, what just happened? All the letters turned to numbers maybe we should restrict the filling-in to just numeric columns. How do we identify those?

is.numeric(fau_dat[,2])

Note that the function returns 1 value either TRUE or FALSE. Lets combine with an if function to prevent filling in of non-numeric columns but first, lets restore the original dataframe

fau_dat<-fau_datII

## now give it a try

z<-ncol(fau_dat)

for(zz in 1:z){

if(is.numeric(fau_dat[,zz])) fau_dat[,zz]<-ifelse(is.na(fau_dat[,zz]),

mean(fau_dat[,zz], na.rm = T),fau_dat[,zz])

}

# take a look, it worked!

fau_dat

Now we need to fix the code so it is generic. Here we replace the actual dataframe name with a temporary object name that is used in the function lets call it df.

df<-fau_datII

z<-ncol(df)

for(zz in 1:z){

if(is.numeric(df[,zz])) df[,zz]<-ifelse(is.na(df[,zz]),mean(df[,zz], na.rm = T),df[,zz])

}

## did it work? Yes!

# now we create the function with a single argument df

# let's call it "fills.nas"

fills.nas<-function(df){

z<-ncol(df)

for(zz in 1:z){

if(is.numeric(df[,zz])) df[,zz]<-ifelse(is.na(df[,zz]),mean(df[,zz], na.rm = T),df[,zz])

}

return(df)

}

# now lets try it out, it works!

fills.nas(fau_datII)

# don't forget to assign the results to an object

fau_datII.fix<-fills.nas(fau_datII)

Pretty neat. Now lets create another object and try to use it. Let's try a vector.

vect<-seq(5,55,by = 4)

vect[c(3,6,8)]<-NA

## try it out, whoops what happened?

fills.nas(vect)

Maybe we should put in an error catcher what kind of objects do we want to be able to use? Lets say just dataframes for this example. How do we test to see if an object is a dataframe? Hmmmm......

is.data.frame(vect)

Lets add this to the beginning of the function. Here's a useful function in R-- "stop"-- that stops the execution of a function and prints out the message of your choice.Remember that is.data.frame returns a FALSE when the object is not a dataframe, so we need to be sure to turn the comparison to a TRUE comparison to execute the stop.

fills.nas<-function(df){

if(is.data.frame(df) == FALSE) stop("Object is not a dataframe")

z<-ncol(df)

for(zz in 1:z){

if(is.numeric(df[,zz])) df[,zz]<-ifelse(is.na(df[,zz]),mean(df[,zz], na.rm = T),df[,zz])

}

return(df)

}

# now lets try it out, it works!

fills.nas(vect)

# now lets try it out, it works!

fills.nas(fau_datII)

Now lets modify the function so that it writes the corrected file to a csv file too. This means that the function will require 2 arguments.

fills.nas<-function(df,fle.name){

if(is.data.frame(df) == FALSE) stop("Object is not a dataframe")

z<-ncol(df)

for(zz in 1:z){

if(is.numeric(df[,zz])) df[,zz]<-ifelse(is.na(df[,zz]),mean(df[,zz], na.rm = T),df[,zz])

}

write.csv(df,fle.name)

return(df)

}

# now lets try it out, it works!

fills.nas(fau_datII,"Fixed fau data.csv")

If time allows, lets make another function.

Finally... loading saved functions

The neat thing about functions is that they can be saved for later use. To load a function, use the source command. Be sure the script name and path are correct. Here, we use the bonus inverse logit link function we downloaded above.

# Load the function

source("Inverse logit link function.R")

## what's in the function?

inv.logit

## now we use it

inv.logit(0.75)

I wonder if it can use a vector of values?

Helpful Hints - Links to CRAN R package Reference Sheets

R Reference Card - published 2004-11-07

R Reference Card 2.0 - published 2012-12-24

This Weeks Assignment

Download the following 2 comma separated (csv) text files for this week's assignments:

coral_count.csv: contains coral counts from 50 quadrats

weather.csv: weather measurements made concurrently with coral counts

Due 1 week from tomorrow (Tuesday) by 5pm Pacific. Use the 2 data frames (above): coral_count and weather to complete the following:

1) Combine the two data sets into a single dataframe. (Hint... think merge, also note that one dataframe contains “Date” and the other contains "year", so you need to extract year form the date before merging.

Hint 2: thoroughly examine BOTH dataframes--there is another variable that you should be using to merge.)

NOTE: 2-5 below should be done within a SINGLE function. The function should be sufficiently generic that you can use it on any dataframe.

2) Create a function that calculates the mean, standard deviation, minimum and maximum of all the

numeric columns in the merged dataframe. except year and quadrat number.

3) The summary values should be in a single data frame with the following columns: variable name,

mean, sd, minimum, and maximum.

4) The function also should write the summary dataframe to a csv file. (Hint... the function should

take 2 arguments the name of the dataframe to summarize and the name of a comma separated

value (csv) file to write the summary.)

5) The function also should output the summary dataframe to R.

Bonus material: reading functions from files

The code below creates a folder "Funcs4fun" in your working directory. It then creates 2 cool (maybe) functions and writes them to the folder. This step is needed to create a folder and function files so I can demonstrate below how to automatically read folders (directories) and load functions from the folders. Note that all of the code below can be found in this script.

####################################################################

### assign the working directory path to an object

orig.wd<-getwd()

## now create a folder in the working directory

dir.create("Funcs4fun")

# set the working directory to the newly created function folder

setwd(paste(orig.wd, "/Funcs4fun", sep =""))

##### create logit link function with error checker and write to an R script file

sink(file="Logit link function.R")

cat("logit.lnk<-function(p){ if ( p>0 & p<1) log(p/(1-p)) else stop('Not a probability')}",fill = TRUE)

sink()

##### Create inverse logit link function too

sink(file="Inverse logit link function.R")

cat("inv.logit.lnk<-function(eta){ 1/(1+exp(-eta))}",fill = TRUE)

sink()

## reset the original working directory

setwd(orig.wd)

###################################### End of creating folder and functions

This is the code that you would use to load R functions from an external file. I assume that you know how to set your working directory to the location where you have the folder containing the function folder. I usually have population or decision models in a single folder. Inside that folder, I have a folder containing functions and another containing inputs ( e.g., weather, initial population size etc).

## first, for grins lets try to run the functions created above

logit.lnk(.5) ## you should get errors because they are not loaded

inv.logit.lnk(0)

## This just used the current working directory and adds the name/location of the

## file containing the functions

func.place<-paste(getwd(), "/Funcs4fun", sep ="")

## Read in functions from common location. "dir" will provide a list of file names

## ending in ".R"

functs<- dir(path = func.place, full.names = FALSE, pattern = ".R", recursive = FALSE)

### lets see what that did for us

functs

## now we read than in using a for loop and the source command

for(ia in 1:length(functs)) source(paste(func.place,functs[ia], sep = "/"))

## Now lets try to run the functions

logit.lnk(.5) ## They work!

inv.logit.lnk(0)

# nested functions, why? Because we can!

inv.logit.lnk(logit.lnk(.2))

Enjoy!