Ecological Niche Modelling

Finally, I decided to create this page. I am very much thankful to Dr. Abhishek Mukherjee, Indian Statistical Institute, Kolkata for introducing me to this domain. I also thank Dr. Sabyasachi Bhattacharya and Prof. Santanu Ray for their ever increasing support to me. I am very new to this, but, learned something useful in R. This page is mainly developed to build species distribution models (SDM) using occurrence data. My student, Shobhana Gopal Iyer, and I have developed R codes for variable selection using the glmnet package in R in SDM. So, the module will be continuously updated to cover several statistical techniques related to variable selection. I thought to share all the steps we have performed. Usually, I expect criticisms because that is the only way how we can improve further. Much of the following materials was covered in the Species Distribution Modeling with MaxEnt organized by the Agricultural and Ecological Research Unit, Indian Statistical Institute, Kolkata during 09-14th of January 2017. I am sincerely thankful to Dr. Achyut Kumar Banerjee for sharing valuable information that enriched me from many directions. The occurrence data on the invasive species Mikania micrantha Kunth has been used in this study. The document has been further modified during chapter II of the workshop during 03-09 December 2018: Workshop on Species Distribution Modelling with MaxEnt and R.

  • Section: Introduction

Background of the species: Mikania micrantha Kunth, commonly known as the 'mile-a-minute' weed, is an herbaceous terrestrial creeper with a pantropical distribution. It is now an established noxious weed in natural forests, plantations including tea, and agricultural systems in several Indian states. This plant which is considered to be one of the top 10 worst weeds in the world, is also a cause of serious concern in India since it is extremely difficult to eradicate. Mikania micrantha has already demonstrated an ability to establish and spread elsewhere in the world.

The species distribution model (SDM) is used to predict the distribution of species in the geographic space based on the mathematical representation of their known distribution in the environmental space. Within a multidimensional space defined by a set of climatic variables, the boundaries of the species' records determine its climatic niche or 'climatic envelope'. SDMs operate by generating a statistical relationship between these known occurrences of a species and environmental covariates. The statistical relationship of the climatic envelope thus developed can be spatially and temporarily extrapolated to generate the potential distribution of the species under novel climatic conditions.

Statistical and machine learning techniques are being increasingly used to identify the environmental factors and the resultant environmental covariates are used to build a model distribution on the climatic conditions in which species would survive, which are further used to identify the locations of species occurrence. The suitability of climatic conditions between the native and invaded region is known to be a crucial factor for the establishment, growth, and reproduction of an invasive species. Identification of the climatically suitable region in the invasive habitat, therefore, can help to identify habitats prone to invasion and guide management decisions. In similar approaches, SDMs are frequently used to identify areas in a novel habitat that are climatically suitable for an invasive species.

The following packages are required to be useful:

install.packages(c('raster', 'rgdal', 'dismo', 'rJava', 'rgeos', 'sp', 'maptools', 'sdm', 'adehabitatHR'), dependencies = TRUE)

library(raster)

library(dismo)

library(rgdal)

library(sp)

library(maptools)

dismo - This package implements a few species distribution models, including an R link to the 'maxent' model, and native implementations of Bioclim and Domain. It also provides a number of functions that can assist in using Boosted Regression Trees. In addition, there are a number of functions, such sampling background points, k-fold sampling, and for model evaluation (AUC) that are useful for these and for other species distribution modeling methods available in R (e.g. GLM, GAM, and RandomForest).

raster - The raster package has functions for creating, reading, manipulating, and writing raster data. The package provides, among other things, general raster data manipulation functions that can easily be used to develop more specific functions.

rgdal - Bindings for the 'Geospatial' Data Abstraction Library

sp

maptools

rJava - Required for running MaxEnt from R.

sdm - fits multiple models and can be used to generate multiple runs (replicates) of each method through partitioning (using one or several partitioning methods including: subsampling, cross-validation, and bootstrapping.

rgeos

  • Section: Occurrence Data for M. micrantha (From Internet)

In this section, we discuss the processing of raw data that can be downloaded from GBIF from the R environment itself. The occurrence data was collected from the Global Biodiversity Information Facility (www.gbif.org) (GBIF). Using the following codes, the function gbif() will load all the data with the genus name as "Mikania". The code shows that 39772 records found (as on Jan 4th 2017), so don't try this. As per the record shown on 4th December 2018, the records were 61672. The function gbif() downloads species occurrence records from the GBIF data portal. In fact, there is a separate package rgbif available in R that adds additional flexibility from R to gbif. One can download either a single species (if you append a "*" to the species name) or a subspecies of comparable level. One can download the data for an entire genus by using species=*. GBIF provides free and open access to biodiversity data worldwide.

library(dismo) # Load the library dismo

mic_gbif_all = gbif(genus = "Mikania", geo=FALSE) # Download data for genus Mikania

mic_gbif_all = subset(mic_gbif_all, !is.na(lon) & !is.na(lat) & !is.na(species)) # Removing the missing rows for lon, lat

dim(mic_gbif_all); names(mic_gbif_all) # Dimension of the data, variable names

We consider only those observations for which lon and lat information is available. See the use of "is.na()" function. We need to collect only the species name, longitude and latitude information. By using the species name, we can get particular species Mikania micrantha.

mic_gbif_all = mic_gbif_all[,c("species","lon","lat")] # Data contains only name, lon and lat

length(unique(mic_gbif_all$species)) # Unique species names

unique(mic_gbif_all$species)[1:5] # Show first few unique names

There are 387 specific names under the genus "Mikania" and we require on "Mikania micrantha". We have printed the first five names only. One should always Cross-check the records as the same species might be saved with a different name. A strong literature survey and expert comments are suggested. It is also advised to search for alternative names of the species if there is such. From the data set mic_gbif_all, we can obtain the data for our species by subsetting the data with respect to the species =="Mikania micrantha kunth"

# Subset the all genus to get a species

mic_gbif = subset(mic_gbif_all, species == "Mikania micrantha")

unique(mic_gbif$species) # Check the unique species records

head(mic_gbif); dim(mic_gbif) # Print first few records and check dimension

We can directly download the dataset by using species name also.

mic_gbif = gbif("Mikania", "micrantha Kunth", geo=TRUE) # Pass species name along with genus name

mic_gbif = mic_gbif[,c("species","lon","lat")]

mic_gbif = subset(mic_gbif, !is.na(lon) & !is.na(lat) & !is.na(species))

unique(mic_gbif$species)

dim(mic_gbif)

write.csv(mic_gbif, "mic_gbif.csv", row.names = FALSE) # Store for future use

It is always a good idea to visualize the occurrence records. We can visualize this data on the world map. For this, we shall use maptools package. wrld_simpl is a SpatialPolygonDataFrame that is available with maptools package. Type help(wrld_simpl) in the console for more information. I want to emphasize the plot function. If you type help(plot) in the console, the help page will contain Generic X-Y Plotting with the plot(x, y, ...). It is like a scatterplot in excel. In R plot function can operate on different kinds of objects. If you want to work with R, it is better to know this function nicely.

library(maptools) # Load library

data("wrld_simpl") # Load data for world map

mic_gbif = read.csv("mic_gbif.csv", header = TRUE) # Read the csv file we just saved

xlim = c(min(mic_gbif$lon),max(mic_gbif$lon)) # Set the limit of graph along x-axis

ylim = c(min(mic_gbif$lat),max(mic_gbif$lat)) # Set the limit of graph along y-axis

plot(wrld_simpl, xlim=xlim, ylim=ylim, axes=TRUE, col="light yellow") # Plot the graph

box(lwd = 2) # Bounding box for the graph

points(mic_gbif$lon, mic_gbif$lat, col='orange', pch=20, cex=0.75) # Plot the occurrence data for M. micrantha

points(mic_gbif$lon, mic_gbif$lat, col='red', cex=0.75) # Play with various plot options

The studied species is native to South America. We can subset the data and collect the occurrence points from South America only. The extent is defined as (1) the westernmost longitude, (2) the easternmost longitude, (3) the southernmost latitude, and (4) the northern most latitude.

e = extent(-120, -30, -60, 30) # Extent for South America region

plot(e, add = TRUE, col='red', lwd = 2) # Add the rectangular extent on the world map

mic_gbif_america = subset(mic_gbif, lon <(-30)) # Subset data for South America

plot(wrld_simpl, xlim=c(-100,-60), ylim=c(-50,30), axes=TRUE, col="light yellow") # Plot for South America only

points(mic_gbif_america$lon, mic_gbif_america$lat, col = "red", lwd=2) # Add Occurrence points

write.csv(mic_gbif_america, "mic_gbif_america.csv", row.names = FALSE) # Store the data for future purpose

e_india = extent(65, 100, 5, 45) # Extent of India

mic_gbif_india = subset(mic_gbif, lon < 100) # Subset data for India

plot(wrld_simpl, xlim= c(65, 100), ylim=c(5,45), axes=TRUE, col="light yellow") # Plot Indian subcontinent

points(mic_gbif_india$lon, mic_gbif_india$lat, col = "red", lwd=2) # Add occurrence points

box() # Add bounding box

write.csv(mic_gbif_india, "mic_gbif_india.csv", row.names = FALSE) # Store the data for future purpose

From the above figure, it is easy to see that, GBIF does not contain data from India. We shall use the occurrence records collected by Dr. Achyut Banerjee. In the above code note that e_india is not the precise extent of India. This is just a visual inspection about the rectangle within which complete India map is lying. If we have the shapefiles of India map, then we can obtain the actual extent of India by using extent() function available in raster package.

Also take caution while using write.csv() function. Make sure to write row.names = FALSE. So, to recall, our excel file mic_gbif_america.csv contains species name, lon and lat of of all the records of Mikania micrantha Kunth from gbif.

  • Section: Processing of occurrence Data (Cleaning - Removing duplicates)

Removal of duplicated records is very important in occurrence data. Data from the same location may be updated by multiple researchers. In that case, the duplicate will occur for with respect to longitude and latitude. Ignoring the species, the duplicates can be found out using the following code:

dups = duplicated(mic_gbif_america[,c("lon","lat")]) # TRUE if duplicate finds else FALSE

sum(dups) # Total number of lon-lat duplicates

mic_gbif_america = mic_gbif_america[!dups,] # Remove the duplicated rows

dim(mic_gbif_america) # Check the dimension of data again

The total number of duplicates are given by sum(dups). The following piece of code, that represents a toy example, will be helpful to understand the role of the function duplicated(). After removing the duplicates, we store the data again in mic_gbif_america.

x = c(letters[1:4],letters[1:3])

y = c("a", "b", "a", "b", "a", "c", "d")

duplicated(x); duplicated(y)

d = data.frame(x, y); print(d) # Note the rows carefully

duplicated(d[,c("x","y")])

  • Section: Processing of occurrence Data (Cleaning - Points falling in ocean)

In the following piece of code, we remove the occurrence locations that fall into the ocean. We first read partially cleaned data (after removing the duplicates). mic_gbif_america is a data frame which is converted to a spatial object by assigning the spatial coordinates. We set the coordinate reference system of wrld_simpl with mic_gbif_america to have a common reference and will be used for comparison. Then we apply the function over() to retrieve the information of the occurrence points from the wrld_simpl. You should type help(over) for further information. Recall that, wrld_simpl is a SpatialPolygonsDataFrame and each location is retrieved from it contains information on the name of countries, their corresponding areas, the number of regions and sub-regions, their lon-lat information etc. The idea is that the coordinates which fall in the ocean have the country name associated with it. So if some occurrence points do not have a country name associated with it, we declare it as falling into the ocean. Using which() function, we identify the row numbers of ovr that has no country name. We remove those rows from the coordinates of mic_gbif_america. Then we plot occurrence points in green and identify the points on the ocean ass red plus sign.

mic_gbif_america = read.csv("mic_gbif_america.csv", header = TRUE) # Read the data we just saved

library(sp) # load library

coordinates(mic_gbif_america) = ~lon+lat # Convert dataframe to SpatialPoints object

crs(mic_gbif_america ) = crs(wrld_simpl) # Associate coordinate reference system

class(mic_gbif_america) # Check the class

ovr = over(mic_gbif_america, wrld_simpl) # help(over)

class(ovr); names(over) # a dataframe, check the column names

head(ovr)

country = ovr$NAME # Country names associated with locations

sum(is.na(country)) # Number of rows with countries = NA

is_ocean = which(is.na(country)); head(is_ocean); table(is_ocean) # Check the variable is_ocean

lonlat_ocean = mic_gbif_america@coords[is_ocean,] # lon-lat of points which fall into ocean

mic_gbif_america@coords = mic_gbif_america@coords[-is_ocean,] # Remove those coordinates

plot(mic_gbif_america, col = "green", cex= 0.5, pch=16) # Plot all points with green

plot(wrld_simpl, add=TRUE, lwd = 2) # Add the world map, note they have same crs

points(lonlat_ocean, col="red", cex = 2, pch=3, lwd= 2) # Ocean points marked as red

box() # Bounding box

It is always a good practice to check the objects every time. It will give more flexibility over time in programming with R. For example, in the above code, we have used the class() function to check what type of object ovr is.

  • Section: Processing of occurrence Data (Cleaning - geographic bias)

A critical issue while dealing with the presence-only data is the presence of sample selection bias. It is often observed that the investigator visits the areas for sample collection which are either geographically convenient or environmental-friendly. This gives rise to two different categories of sample selection bias, one occurs in the geographic space and the other in environmental space. For the current Mikania micrantha data, we shall deal with both types of bias.

Using the following piece of code, we shall remove the geographic bias from the sample. The idea is similar to the concept of fishnet in ArcGIS. We shall divide the whole region into small grids of a certain resolution and from each grid, we shall select one occurrence point only. The choice of grid size is subjective in nature and will vary for different problems. Note that, we have extended the raster to take care of points that fall within the boundary of the extent we are studying. By using the function rasterToPolygons(), we create the fishnet and named it as net. Using black dots we represent all the occurrence points and with the red cross sign, we denote only those points that are selected by using the gridSample() function. Type help(gridSample) function to get the details. This is a very important function. Now save the geographically unbiased data in a CSV file and give it the same name as mic_gbif_america.csv. We shall deal with environmental bias after we understand the loading of environmental data.

r = raster(mic_gbif_america) # Create a RasterLayer object

res(r) = 0.3 # Set the resolution

r = extend(r, extent(r)+0.2) # Extend the reolution to cover boundaries

occ_one = gridSample(mic_gbif_america, r, n=1) # Select only one occurrence from each grid

class(occ_one); dim(occ_one) # Check the class and dimension

net = rasterToPolygons(r) # Create the net

plot(net, border = 'gray') # Plot the net

points(mic_gbif_america) # Plot all the points

points(occ_one, cex = 0.5, col = 'red', pch = 'x') # Plot the selected points only

dim(occ_one)

write.csv(occ_one, "mic_gbif_america.csv", row.names = FALSE) # Store the data for future use

It important to keep two points in mind: (1) Subsampling reduces the number of records, and it cannot correct the data for areas that have not been sampled at all and (2) It also suffers from the problem that locally dense records might in fact be a true reflection of the relative suitable of habitat.

  • Section: Climate Data Preparation in R Environment (From Internet)

We have got all the records from GBIF, cleaned them and collected the observations which we shall use for modeling. The data is considered only for South America. Now we will collect the environmental variables and process them to input into the model as covariates. In the following section, we shall how one can access the freely available climate data. The raster package provides functions to download the Bioclim open-source climate dataset, 19 variables derived from the WORLDCLIM database of temperature and rainfall measurements taken worldwide from 1950 to 2000 http://www.molecularecologist.com/2013/04/species-distribution-models-in-r/ and http://worldclim.org/bioclim for more information).

library(raster) # Load library

bioclim_data = getData(name = "worldclim", download = FALSE, var = "bio", # Download climate data

res = 2.5, path = "C:/STUDY/SDM at ISI Kolkata/bioclim_data/")

help(getData) # Get the information on this function

class(bioclim_data) # RasterStack, composed of different layers

names(bioclim_data) # Names of the Raster Layers

class(bioclim_data$bio1) # RasterLayer

str(bioclim_data$bio1) # Structure of the RasterLayer object

Once the download is done in the first run, from next run set the option download = FALSE. One can also load the environmental data from the local folder to the current R environment. Valid resolutions are 0.5, 2.5, 5, and 10 (minutes of a degree). The function crop() returns a geographic subset of an object as specified by an Extent object (or object from which an extent object can be extracted/created). Let us get the environmental conditions for e_india and e_america Extents. This is again a reminder that these two extents are not exact extents according to the shapefiles of South America and India. The objective of this complete document is to get more flexibility in spatial data handling, not to give some ready-made R codes that will generate the niche models.

bioclim_data_india = crop(bioclim_data, e_india) # Environmental layers for e_india extent

class(bioclim_data_india) # RasterBrick attributed raster package

bioclim_data_america = crop(bioclim_data, e_america)

class(bioclim_data_america)

writeRaster(bioclim_data_america, filename ="bioclim_data_america.grd", # Save the raster files with .grd extension

overwrite = TRUE)

writeRaster(bioclim_data_india,filename ="bioclim_data_india.grd", # Save the raster files with .grd extension

overwrite = TRUE)

help(writeRaster) # Details about the function

We can read the environmental layers from local files as well. Let us read the files that we have just saved in the local directory.

bioclim_data_america = brick("bioclim_data_america.grd") # Environmental layers from local directory

bioclim_data_india = brick("bioclim_data_india.grd")

class(bioclim_data_america) # RasterBrick

names(bioclim_data_america) # Names of the RasterLayers

plot(bioclim_data_america, 1:9) # Plot first 9 RasterLayers

plot(bioclim_data_america, 10:18)

plot(bioclim_data_india, 1:9)

plot(bioclim_data_india, 10:18)

So, in the above piece of code, we have cropped the bioclim_data to get the environmental variables for e_america and e_india. However, another geographic region above the India map is also appearing. The reason is that an extent is defined by four corners of a rectangular region, however, any country boundaries are not rectangular in nature. There are several breaks and it is a polygon with several sides. In the following, we now crop the variables only for the geographic region for India and South America. To do this, we shall use the mask() function which is available in raster package. To mask the geographic region for India, we first get the administrative boundaries of India from the database of global administrative boundaries which is stored in 'GADM'. The 'GADM' database can be accessed through the getData() function, by which we had already downloaded the data from worldclim.org/bioclim.

india_boundary = getData(name = "GADM", download=TRUE, country='IND', level = 0) # Try with level = 1, 2 and plot

class(india_boundary) # SpatialPolygonsDataFrame

mask_india = mask(bioclim_data_india, india_boundary) # Climate data within India only

class(mask_india)

bioclim_data_india = mask_india # Keep the same name for simplicity

writeRaster(bioclim_data_india, "bioclim_data_india.grd", overwrite = TRUE) # Save the masked RasterLayers of India

plot(bioclim_data_india, 1:9) # Only India map will appear in the plot

plot(bioclim_data_india, 10:18)

Note that, the class of India is a SpatialPolygonsDataFrame. So the idea is to mask india_boundary from e_india and create a new raster. The same process will be followed for data from South America also, but I will describe this later in Generation of Background Points. We could have masked India from bioclim_data directly. But, it might take few minutes to run. The following code can be used for that. The following code may be used for that.

mask_india = mask(bioclim_data, india_boundary)

class(mask_india)

  • Preparing Final Data for Modeling

We have learned how we can download the environmental variables from open source climate data repository using the getData() function. We have also studied how we can get the covariate information in specific locations, saving the data in local directories and reading them from the local directories. Now we shall concentrate on modeling part, where the predictors will be the predictors_america and predictors_india. Note that they are RasterStack objects. So, we have to extract the numeric values of covariates from them and prepare them for modeling.

plot(bioclim_data_america, 1, main = "Annual mean temp") # Let us plot again

points(mic_gbif_america[,c("lon","lat")], col="red", pch=16, cex=0.5) # Plot the occurrence points

In the plot, the red circles indicate the locations where the occurrences of the species have been recorded. We now require the climate data at these locations only, not all over America. By using the extract() function we collect the values of the climatic variables at the coordinates where the presence of the species was recorded.

predictors_values_america = extract(bioclim_data_america, mic_gbif_america[,c("lon", "lat")])

In predictors_america, the climatic information is available for complete South America. The object predictors_values_america contains the climatic information only in those locations where the species is observed. It might happen that, at those coordinates, the climatic variables may not be available. So we can use those locations for modeling. So we have to remove them from predictors_values_america. The number of coordinates at which the predictor information is available can be obtained by using the function, complete.cases(). Note the use of the function as.data.frame. Look into the ind variable also. It is a good practice to check the data summary again and again.

ind = complete.cases(as.data.frame(predictors_values_america)) # Return TRUE for complete rows else FALSE

predictors_values_america = as.data.frame(predictors_values_america[ind,]) # Collect only complete rows

dim(predictors_values_america)

head(predictors_values_america)

tail(predictors_values_america)

summary(predictors_values_america)

A modeler should take special care of the function summary() in R. It is a generic function that is used for summarizing the results or output. You can check some nice looking graphs using some functions in R as follows.

pairs(predictors_values_america[2:4], col = 2)

library(car)

scatterplotMatrix(predictors_values_america[,2:4], col = 2)

scatterplotMatrix(predictors_values_america[,2:4], diagonal = "histogram")

options(digits = 2)

cor(predictors_values_america[,1:9])

cor(predictors_values_america[,10:19])

cor.test(predictors_values_america[,3], predictors_values_america[,4])

cor.test(predictors_values_america[,"bio1"], predictors_values_america[,"bio15"])

corr_mat = cor(predictors_values_america)

require(corrplot)

corrplot(corr_mat, method = "circle")

corrplot(corr_mat, method = "number")

corrplot(corr_mat, order = "hclust", addrect = 4, rect.lwd = 2, rect.col = "red")

There are several other ways by which you can customize the plots and perform exploratory data analysis on these variables. Graphics in R itself very vast, it will be a wise decision not to cover this in this workshop. For India, we have to do the same thing. But, we shall not use the object mic_gbif_india, because GBIF does not contain occurrence data of Mikania micrantha in India subcontinent. We shall use the data mic_india that were collected by the investigators and cleaned.

plot(bioclim_data_india, 1)

points(mic_india[,c("lon","lat")], col="red", pch = 16, cex=0.75)

predictors_values_india = extract(bioclim_data_india, mic_india[,c("lon", "lat")])

ind = complete.cases(as.data.frame(predictors_values_india))

predictors_values_india = as.data.frame(predictors_values_india[ind,])

head(predictors_values_india)

tail(predictors_values_india)

dim(predictors_values_india)

summary(predictors_values_india)

pairs(predictors_values_india[,2:4], col = "red", pch=16)

library(car)

scatterplotMatrix(predictors_values_india[,2:4], col = 2)

scatterplotMatrix(predictors_values_india[,2:4], diagonal = "histogram")

In the previous section, we have collected the values of the climatic variables at each coordinate where the occurrence has been recorded. So the data sets predictors_values_america and predictors_values_india are presence only data. For modeling, it better to add a binary variable in the dataframe indicating that the species is present, let y denote the variable. y=1 means presence and y=0 mean absence. Since the data is presence-only, all entries will be one.

Statistically, our goal is to model the quantity P(y = 1| f(z)), where z is the environmental/climatic conditions. By using Bayes Rule, the quantity is equal to f_1(z)P(y=1)/f(z)$. Both the density f_1(z) and f(z) can be estimated from the climate data. But P(y=1) is not possible to estimate. The quantity represents the prevalence of the species in that region, that is the proportion of occupied sites of the species. Since it is presence-only, there is no possibility to estimate P(y=1). This means that it can not be exactly determined, regardless of the sample size, this is a fundamental limitation of presence-only data (Elith et al. 2009, "A statistical explanation for MaxEnt for ecologists", Diversity and Distributions). To overcome this difficulty, we generate background data and consider it as the pseudo absence. This can be done in the following way. We generate random points that can be used to extract background values ("random-absence"). The points are sampled (without replacement) from the cells that are not 'NA' in raster 'mask'. The set.seed() function plays very important role during simulation. You are encouraged to see the help file (but only for statistician).

set.seed(123) # Ensure repetitive random samples are generated

background = randomPoints(bioclim_data_america, 10000) # Generate random coordinates

help(randomPoints)

We have to take caution whether the background points are falling into the ocean. If so, we have to remove those points from the data.

library(sp)

background = as.data.frame(background)

names(background) = c("lon", "lat")

coordinates(background) = ~lon+lat

crs(background) = crs(wrld_simpl)

class(background)

ovr = over(background, wrld_simpl)

class(ovr)

head(ovr)

country = ovr$NAME

sum(is.na(country))

is_ocean = which(is.na(country))

lonlat_ocean = background@coords[is_ocean,]

background@coords = background@coords[-is_ocean,]

background = background@coords

pseudo_absence_values = extract(bioclim_data_america, background)

dim(pseudo_absence_values)

head(pseudo_absence_values)

plot(bioclim_data_america, 1, main = "Annual mean temperature")

points(background, pch=".", col='black')

points(mic_gbif_america[,c("lon","lat")], pch="*", col="red", cex=1)

legend("topright", legend = c("background", "presence"), col = c("black", "red"), pch = c(".","*"), bty = "n")


Note that, we have checked for the background points also whether they are falling into the ocean or not. It means that the background environment mic_data_america contains points from the ocean also. This is due to the crop() function. The function crops the raster object or raster layers by extent. Even if we pass some spatial polygons at which covariate information is required. It returns the background corresponds to the extent of the polygon, not for the region enclosed within the polygon. To collect the covariate information for South America without region falling into the ocean, we use the following code.

library(sp)

library(rgeos)

library(raster)

library(rgdal)

library(maptools)

setwd("C:/STUDY/SDM at ISI Kolkata/RDS files_South America")

Mexico = readRDS("MEX_adm0.rds") # A part of North America

crs(Mexico) = crs(wrld_simpl)

Venezuela = readRDS("VEN_adm0.rds")

crs(Venezuela) = crs(wrld_simpl)

Brazil = readRDS("BRA_adm0.rds")

crs(Brazil) = crs(wrld_simpl)

Columbia = readRDS("COL_adm0.rds")

crs(Columbia) = crs(wrld_simpl)

Peru = readRDS("PER_adm0.rds")

crs(Peru) = crs(wrld_simpl)

Bolivia = readRDS("BOL_adm0.rds")

crs(Bolivia) = crs(wrld_simpl)

Chile = readRDS("CHL_adm0.rds")

crs(Chile) = crs(wrld_simpl)

Argentina = readRDS("ARG_adm0.rds")

crs(Argentina) = crs(wrld_simpl)

Paraguay = readRDS("PRY_adm0.rds")

crs(Paraguay) = crs(wrld_simpl)

Uruguay = readRDS("URY_adm0.rds")

crs(Uruguay) = crs(wrld_simpl)

Guyana = readRDS("GUY_adm0.rds")

crs(Guyana) = crs(wrld_simpl)

Suriname = readRDS("SUR_adm0.rds")

crs(Suriname) = crs(wrld_simpl)

Ecuador = readRDS("ECU_adm0.rds")

crs(Ecuador) = crs(wrld_simpl)

French_Guiana = readRDS("GUF_adm0.rds")

crs(French_Guiana) = crs(wrld_simpl)

#Central America

Guatemala = readRDS("GTM_adm0.rds")

crs(Guatemala) = crs(wrld_simpl)

Nicaragua = readRDS("NIC_adm0.rds")

crs(Nicaragua) = crs(wrld_simpl)

Costa_Rica = readRDS("CRI_adm0.rds")

crs(Costa_Rica) = crs(wrld_simpl)

Panama = readRDS("PAN_adm0.rds")

crs(Panama) = crs(wrld_simpl)

Honduras = readRDS("HND_adm0.rds")

crs(Honduras) = crs(wrld_simpl)

Belize = readRDS("BLZ_adm0.rds")

crs(Belize) = crs(wrld_simpl)

Cuba = readRDS("CUB_adm0.rds") # islands near south America in the Carribean sea

crs(Cuba) = crs(wrld_simpl)

Haiti = readRDS("HTI_adm0.rds")

crs(Haiti) = crs(wrld_simpl)

Jamaica = readRDS("JAM_adm0.rds")

crs(Jamaica) = crs(wrld_simpl)

Bahamas = readRDS("BHS_adm0.rds")

crs(Bahamas) = crs(wrld_simpl)

Dominician_Republic = readRDS("DOM_adm0.rds")

crs(Dominician_Republic) = crs(wrld_simpl)

Trinidad_Tobago = readRDS("TTO_adm0.rds")

crs(Trinidad_Tobago) = crs(wrld_simpl)

Puerto_Rico = readRDS("PRI_adm0.rds")

crs(Puerto_Rico) = crs(wrld_simpl)

Faulkland = readRDS("FLK_adm0.rds")

crs(Faulkland) = crs(wrld_simpl)

south_america = rbind(Mexico, Venezuela, Brazil, Columbia, Peru, Bolivia, Chile, Argentina, Paraguay, Uruguay, Guyana, Suriname, Ecuador, French_Guiana, Guatemala, Nicaragua, Costa_Rica, Panama, Honduras, Cuba, Haiti, Jamaica, Bahamas, Dominician_Republic, Trinidad_Tobago, Puerto_Rico, Faulkland, Belize)

plot(south_america)

box()

class(south_america)

# Selecting the environmental variables from South America (mask)

bioclim_data_america = mask(bioclim_data, south_america)

writeRaster(bioclim_data_america, "bioclim_data_america.grd", overwrite = TRUE)

background = randomPoints(bioclim_data_america, 10000)

pseudo_absence_values = extract(bioclim_data_america, background)

dim(pseudo_absence_values)

plot(bio_america, 1:9)

plot(bio_america, 10:18)

Note that, in the above code, we have created the mask for South America in a hard way. However, it delineates a simple way to create mask by joining countries. This can be used for many purposes. Basically, we have downloaded the level-0 global administrative boundaries for the countries in South America as R (SpatialPolygonsDataFrame) and joined them to create the mask. Now, this background does not contain any locations that are falling into the ocean. You are encouraged to run the code from the previous section to verify this.

  • Some Operations on shape files (Discussed in the Workshop 2018)

# A bit dealing with shapefiles in R. Recollect the session from ArcGIS

library(maptools)

library(rgdal)

shape_invaded = readOGR("Shape files/country_India/india.shp")

plot(shape_invaded, col = "light yellow")

class(shape_invaded)

str(shape_invaded)

showDefault(shape_invaded)

crs(shape_invaded)

shape_invaded@bbox

extent(shape_invaded)


shape_native = readOGR("Shape files/country_native/native_countries.shp")

plot(shape_native, col = "lightgrey")

shape_combined = gUnion(shape_invaded, shape_native)

plot(shape_combined, col = "lightgreen")

plot(wrld_simpl, col = "lightyellow", add = TRUE)

plot(shape_combined, col = "lightgreen", add = TRUE)


# Let us make the plots again

plot(shape_combined, col = "lightgreen")

mik_occ_raw = read.table("mic_gbif.csv", header = TRUE, sep = ",")

names(mik_occ_raw)

points(mik_occ_raw[,2], mik_occ_raw[,3], pch = 16, col = "red", cex = 0.6)


  • Minimum Convex Polygon

In the above code, we have considered the complete South America as the background for the Mikania micrantha. In the following, we shall create the minimum convex polygon (MCP) and select the background sample from the MCP. The MCP is an estimate of the native home range of M. micrantha. The function mcp() available in the adehabitatHR package is used to perform the operation. The function also allows buffering. I will update more about it in later sections.

data = cbind(id = rep("Mikania micrantha", nrow(mic_gbif_america)), mic_gbif_america) # dataframe with first column as spcies

head(data); names(data) # first few rows and column names

coordinates(data) = ~lon + lat # SpatialPointsDataFrame

library(adehabitatHR)

# Load library to access function mcp()

cp = mcp(data, percent = 100)

# minimum convex polygon

cp@proj4string = crs(wrld_simpl)

# add crs of wrld_simpl

plot(cp) # Plot of minimum convex polygon

points(mic_gbif_america, col= "green", cex=0.9, pch=20)

# add occurrece locations to the plot

plot(wrld_simpl, add= TRUE, lwd=1)

# Show it on the world map

For selecting the background points from MCP, we have to take the intersection of South America and minimum convex polygon. The rgeos package has some helpful functions for manipulating SpatialPolygons - including ways to check invalid geometries and typical GIS tools for checking and manipulating geometries, such as gUnion(), gDifference(), gIntersection(), gBuffer(), etc. The steps can be written in the way: Intersection of world map and cp to get the MCP area only -> crop the environmental RasterLayers for the MCP area -> create mask for MCP -> Generate bakcground points from the mask. We have to take an intersection of cp and wrld_simpl, because cp is created from occurrence poits by joining the outermost points. Hence, within the cp there may ocean areas. Look into the plots carefully to understand the difference.

library(rgeos) # Load the library rgeos

mcp = gIntersection(wrld_simpl, cp) # Intersection of cp and world map, call mcp

plot(mcp, lwd =1) # Plot of MCP

points(mic_gbif_america, col= "red", cex=1, pch=20)

bioclim_data_mcp = mask(crop(bioclim_data_america, mcp), mcp) # Crop bioclim data on mcp and then mask it

background = randomPoints(bioclim_data_mcp, 1000) # Generate background points from the mask

Do we need to check whether the points falling into the ocean. Think and convince yourself that we not need to do that again.

plot(mcp, lwd =1)

points(mic_gbif_america, col= "red", cex = 0.9, pch = 20, lwd=2)

points(background, pch=19, col='grey', cex = 0.2, lwd=1)

legend("bottomleft", legend = c("background points", "occurrence points"), col = c("black","blue"), lwd = c(1,2), bty="n") # add legends to the plot

box()

A crucial point to be noted is that predictors_america contains the complete background data and the random selection of points are independent of the species' presence or absence. Now we include the pseudo absence data and occurrence data in a single data.frame.

  • Running MaxEnt in R environment

In the following, we run MaxEnt in R Environment using dismo package. Install MaxEnt in the computer. Put the file 'maxent.jar' in the 'java' folder of this package. First we prepare the final data for passing to MaxEnt.

y = c(rep(1, nrow(predictors_values_america)), rep(0, nrow(pseudo_absence_values))) # Preparing binary response 1/0

sdm_data_america = cbind(y, rbind(predictors_values_america, pseudo_absence_values)) # Join response with climatic data

summary(sdm_data_america) # Summary of the data

head(sdm_data_america, n=10) # First 10 observations

tail(sdm_data_america)

dim(sdm_data_america)

library(dismo)

library(rJava)

Model= maxent(x = sdm_data_america[,-1], p = y) # Run MaxEnt using maxent() function

show(Model)

response(Model)

help(maxent)

In the maxent() function, -1 is used as we do not need the y column. Note that it is already passed to the function using the argument p = y. In the following we run MaxEnt for different training sets which are created by selecting 70% of rows randomly. Rest 30% has been kept as test set.

For each run, we did the following things

  1. Random selected 70% rows of occurrence data and kept it training the model (train_data)

  2. Rest 30% is kept for testing the model (test_data)

  3. Selected the background sample of size 10000

  4. train_y contains 1’s and 0’s according to the number of rows in training occurrence data and background data respectively

  5. Prepare training data and run MaxEnt on it

  6. Then run the model on test_data

  7. The above process was carried out 10 times

This is a very simple way of doing it. Of course something more needs to be added for generating summary statistics.

run = 10

for(i in 1:run){

ind = sample(1:nrow(predictors_values_america), size = floor(0.7*nrow(predictors_values_america)), replace = FALSE)

train_data = predictors_values_america[ind, ]

test_data = predictors_values_america[-ind, ]

background = randomPoints( predictors_america, 10000)

pseudo_absence_values = extract(predictors_america, background)

train_y = c(rep(1,length(ind)), rep(0,nrow(pseudo_absence_values)))

train_sdm_data_america = cbind(train_y, rbind(train_data, pseudo_absence_values))

Model = maxent(train_sdm_data_america[,-1], p = train_y)

show(Model)

r = predict(Model, test_data, progress='text', args = c("outputformat=raw"))

}

  • Environmental bias in occurrence data

bioclim_data_america and bioclim_data_india will be used as predictors. These two data sets contain RasterLayers of all of the nineteen variables available in https://www.climond.org. In this section, we shall identify the plausible presence of environmental bias in the data. We capture the background environment information and create the complete environmental space of South America. Then, we identify the occurrence points in this space. If the environmental variables at occurrence records appear to represent a specific segment or clusters in the complete environmental space, then we suspect for the plausible presence of environmental bias in the occurrence data. Let us investigate the fact in our dataset. To generate the background environmental space, we randomly select a large number of locations from the complete background of South America. The function randomPoints() returns randomly selected locations. In this case, we select 10000 background locations and retrieve the quantitative values of environmental variables at those using the extract() function. Both these functions are available in the dismo package. The variable (a matrix object) bioclim_values_america contains the numerical predictor variables. Similarly, environmental data is obtained at occurrence points and is stored in bioclim_values_occ_america (a matrix object) after removing the missing values (if any).

require(dismo) # Check the difference between require & library

lonlat_america = randomPoints(bioclim_data_america, 10000) # Randomly selects 10000 coordinates

class(lonlat_america) # matrix

head(lonlat_america)

bioclim_values_america = extract(bioclim_data_america, lonlat_america) # Extract the values of bioclimatic variables

class(bioclim_values_america) # matrix

head(bioclim_values_america)

dim(bioclim_values_america) # number of rows and number of columns

sum(complete.cases(as.data.frame(bioclim_values_america))) # Returns the number of rows with missing values

I want to emphasize on the function complete.cases(). The function returns a logical vector containing TRUE and FALSE. The function scans each row in the data and returns TRUE if there no missing values, otherwise returns FALSE. Very handy function to cross check the data. Then by applying the sum() function, you return the number of rows that contain no missing values. I suggest trying this also sum(!complete.cases(data)) that will directly return the number rows that contain missing values. There are several ways how one can treat missing values in R. In the next, we extract environmental variables at the occurrence locations.

bioclim_values_occ_america = extract (bioclim_data_america, mic_gbif_america[, c("lon","lat") ] ) # extract climatic data at occurrence locations

class(bioclim_values_occ_america)

head(bioclim_values_occ_america)

dim(bioclim_values_occ_america)

sum(!complete.cases(as.data.frame(bioclim_values_occ_america)))

ind = complete.cases(as.data.frame(bioclim_values_occ_america)) # Row indices that contain no missing values

bioclim_values_occ_america = bioclim_values_occ_america[ind, ] # Save the data with complete information only

dim(bioclim_values_occ_america) # data dimension after removinng missing rows

sum(ind); print(ind)

In the following, we create a scatter plot plot of the environmental variables at both background and occurrence points but with the different color. From the following figure it is evident that the cold places are not visited. There may be possible bias with respect to the temperature, but no bias with respect to precipitation. But of course, that needs to be verified with expert comments.

par(mfrow = c(2,2))

plot(bioclim_values_america[,1], bioclim_values_america[,6], col="green", pch=16, xlab = "Annual Mean Temperature (°C)", ylab = "Min Temp of Coldest Month (°C)")

points(bioclim_values_occ_america[,1],bioclim_values_occ_america[,6], col="red", pch=16)

legend("topleft", legend = c("background", "occurrence"), col = c("green", "red"), bty = "n", pch = c(16,16))

plot(bioclim_values_america[,4], bioclim_values_america[,6], col="green", pch=16, xlab = "Temperature Seasonality (sd *100)", ylab = "Min Temp of Coldest Month (°C)")

points(bioclim_values_occ_america[,4],bioclim_values_occ_america[,6], col="red", pch=16)

legend("topright", legend = c("background ", "occurrence "), col = c("green", "red"), bty = "n", pch = c(16,16))

plot(bioclim_values_america[,7], bioclim_values_america[,12], col="green", pch=16, xlab = "Temperature Annual Range (BIO5-BIO6) (°C)", ylab = "Annual Precipitation")

points(bioclim_values_occ_america[,7],bioclim_values_occ_america[,12], col="red", pch=16)

legend("topright", legend = c("background ", "occurrence "), col = c("green", "red"), bty = "n", pch = c(16,16))

plot(bioclim_values_america[,12], bioclim_values_america[,15], col="green", pch=16, xlab = "Total (annual) precipitation", ylab = " Precipitation seasonality (cv)")

points(bioclim_values_occ_america[,12],bioclim_values_occ_america[,15], col="red", pch=16)

legend("topright", legend = c("background ", "occurrence "), col = c("green", "red"), bty = "n", pch = c(16,16))

par(mfrow = c(1,1))

  • Choice of background

Selection of background points is an important decision in presence-only species distribution modeling for model parametrization. It is also important to consider the eco-geographical extent while selecting the background points. In order to analyze what extent might provide most accurate results, we choose two Extents from our region of interest. After the complete data processing, we have 865 occurrence points from South America and 105 from India. In our data analysis, we considered four different backgrounds, namely South America (SA), the minimum convex polygon of South America (SA MCP), South America and India & MCP of South America (SA MCP + IND MCP) and MCP of India (IND MCP). The environmental covariates are masked accordingly for the regions of interest. For each occurrence point, the values of 19 environmental covariates are extracted.

In the earlier sections, we have already prepared the data and environmental layers for South America and South America MCP. We have also created the bioclim_data_india environmental layers for India map only. Here I first process the occurrence data from India and then do rest of the modeling part. We also create the MCP for India.

mic_india = read.csv("India occur_Final.csv", header = TRUE) # Load the occurrence data from India

mic_india = mic_india[,-c(1)] # Remove the first column only to consider lon lat

names(mic_india) = c("lon", "lat") # Set the column names as lon and lat respectively

data = cbind(id = rep("Mikania micrantha", nrow(mic_india)), mic_india) # Use cbind to create first column as species name

coordinates(data) = ~lon + lat # Convert to SpatialPointsDataFrame object

require(adehabitatHR) # Load library to access mcp() function

cp = mcp(data, percent = 100) # Join outermost points to creat convex polygon

cp@proj4string = crs(wrld_simpl) # add coordinate reference system

plot(cp) # Plot the covex polygon

points(mic_india, col= "green", cex=0.9, pch=20) # add occurrence points to the convex polygon

plot(wrld_simpl, add= TRUE, lwd=1) # Show it on the world map

library(rgeos) # load library rgeos

mcp = gIntersection(wrld_simpl, cp) # Create mcp by taking intersection with wrld_simpl

plot(mcp, lwd =1)

points(mic_india, col= "blue", cex=0.9, pch=20)

bioclim_data_india_mcp = mask(crop(bioclim_data_india, mcp), mcp) # Get the RasterLayers at occurrence coordinates

background = randomPoints(bioclim_data_india_mcp, 1000) # Generate background from India MCP

plot(mcp, lwd =1) # Plot of the India MCP

points(mic_india, col= "red", cex = 0.9, pch = 20, lwd=2) # add occurrence points coorodinates to MCP

points(background, pch = 19, col='grey', cex = 0.2, lwd=1) # add the background points

legend("bottomright", legend = c("background points", "occurrence points"), col = c("black","blue"), bty="n", pch=c(20,21)) # add an appropriate legends

legend("bottomright", legend = "India MCP", bty="n", cex = 2) # add legends

box() # bounding box

writeRaster(bioclim_data_india_mcp, "bioclim_data_india_mcp.grd", overwrite = T) # Save the RasterLayers for future use

At this point, one may get confused. So I just put a summary of the variables names here.

  • mic_gbif_america - longitude-latitude information about the presence locations of the species in the region of South America.

  • mic_gbif_india - consists of longitude-latitude information about the presence locations of the species in the Indian subcontinent.

  • bioclim_data_world - raster file consisting of information about the environmental covariates for the whole world.

  • bioclim_data_america - raster file consisting of information about the environmental covariates for the region of South America.

  • bioclim_data_india - raster file consisting of information about the environmental covariates for the region of India.

  • bioclim_data_mcp - raster file consisting of information about the environmental covariates for the minimum convex polygon built in the region of South America.

  • bioclim_data_india_mcp - raster file consisting of information about the environmental covariates for the minimum convex polygon built in the region of India.

  • predictors_values_america - Information about the environmental covariates for the presence locations of species in the region of South America.

  • predictors_values_india - Information about the environmental covariates for the presence locations of species in the region of India.

  • predictors_values_america_india - Information about the environmental covariates for the presence locations of species in the region of South America and India.

  • wrld_simpl - SpatialPolygonsDataFrame containing world map.

  • sdm_data_america - It consists of a response variable with a binary outcome (either 0 or 1) and the 19 explanatory variables (environmental covariates) of presence and background locations.

Handling RasterLayers from Local Folder

Occurrence Data Processing which was covered in the R sessions of Workshop on SDM using MaxEnt in 2017 (January 4 - 9, 2017) organized by the Agricultural and Ecological Research Unit, Indian Statistical Institute. Bioclim data, processed using GIS software, can be obtained on request.

In the previous session, we have learned that how one can directly access the occurrence data and bioclimatic variables from the online repositories directly. However, one may also want to download the data manually from the online resources. The following pieces of codes will be helpful for the researchers who have collected the data on their own. The data on Mikania micrantha which was manually downloaded from Global Biodiversity Information Facility (http://www.gbif.org/) are available at this link. The data are stored in the files native_occrrance final.csv (for South America Location) and India occur_Final.csv for India location. Throughout the workshop, we shall consider theses two files for modeling. Each of the files contains three variables, the first one is species name, the second one is longitude (X) and the third one is latitude (Y). In the following, we first read the data from local files and process them for the use in R environment. Download the data files and save them in a folder. Set the path of the folder using setwd() function.

setwd("C:/STUDY/SDM at ISI Kol (9-14 Jan 2017)/maxent")

mic_america = read.table("native_occurrence final.csv", header = TRUE, sep = ",")

The function setwd() is used to set the current working folder. Here, I have set the folder as my current working folder in which all the data files and climate variables are stored. You can also check the current working directory by using getwd(). The function read.table() is used to read the CSV file, but make sure to put the argument sep = ","; header = TRUE means the R will consider the first row as the names of the columns. You can also use the function read.csv() for reading CSV file. In such case, no need to put the sep = option.

mic_america = read.csv("native_occurrence final.csv", header = TRUE)

We have stored the data in the variable named mic_america. After loading the data into the current R environment, a basic scrutiny of the data is suggested. These can be performed using the following functions in R.

class(mic_america)

names(mic_america)

dim(mic_america)

head(is.na.data.frame(mic_america))

sum(is.na.data.frame(mic_america))

head(complete.cases(mic_america))

sum(complete.cases(mic_america))

Instead of complete.cases() one can use na.omit() also, the function removes all the rows containing NA values

mic_america = na.omit(mic_america)

So the data has no missing values, It has 874 observations and three variables. The variables are Species, X and Y. The function class() returns the type of object in which the data is stored in the R environment. The functions names(), head() are self explanatory. Type help(names) in the console and read the help file that appears in the right-end corner of your RStudio for further information.

head(mic_america[,"Species"]) # Print first few observation of variable Species

head(mic_america[,c("X","Y")])

head(mic_america)

Take caution, X = longitude, Y = latitude. longitude always come before the latitude in the modeling work. All functions related to SDM considers the first variable as longitude. So while importing the data be alert. Now we need only the last two variables "X" and "Y". In the following chunk of codes, we remove the species column and store the longitude and latitude information in mic_america that contains only two columns. We rename them as lon and lat.

mic_america = mic_america[, c("X","Y")]

head(mic_america)

names(mic_america) = c("lon", "lat")

head(mic_america)

library(maptools)

data("wrld_simpl")

plot(wrld_simpl, xlim= c(-100, -34), ylim=c(-50,30), axes=TRUE, col="light yellow")

points(mic_america$lon, mic_america$lat, col = "red", lwd=2)

box()

file = "India occur_Final.csv"

mic_india = read.table(file, header = TRUE, sep = ",")

mic_india = mic_india[,c("X","Y")]

head(mic_india)

names(mic_india) = c("lon", "lat")

plot(wrld_simpl, axes=TRUE, col="lightgrey")

points(mic_india$lon, mic_india$lat, col = "red", pch=16, cex = 0.75)

box()

  • Environmental Data Processing which was covered in the R sessions of Workshop on SDM using MaxEnt

The environmental data is usually stored as a raster type (grid) file. Now each environmental covariate/predictor is a raster object. Avoid ASCII files if you can, as they tend to considerably slow down processing speed. In a particular modeling exercise, all the layers should have same spatial extent, resolution, origin, and projection. For completeness of the report, we shall first read the files which are with .asc extension. Then, we read the environmental layers which are save as text files (downloaded from the internet). For both cases, the same function read.asciigrid() will work. Let us first read the environmental variables which are downloaded from https://www.climond.org/ as ascii file. Details will be available here https://www.climond.org/BioclimRegistry.aspx#Table1

setwd("C:/STUDY/SDM at ISI Kol (9-14 Jan 2017)/maxent")

bio01 = read.asciigrid("layers/bio01.asc") # Annual mean temperature

bio04 = read.asciigrid("layers/bio04.asc") # Temperature seasonality (C of V)

bio06 = read.asciigrid("layers/bio06.asc") # Min temperature of coldest week (°C)

bio07 = read.asciigrid("layers/bio07.asc") # Temperature annual range (Bio05-Bio06) (°C)

bio12 = read.asciigrid("layers/bio12.asc") # Annual precipitation (mm)

bio15 = read.asciigrid("layers/bio15.asc") # Precipitation seasonality (C of V)

class(bio01)

The class of bio01 is SpatialGridDataFrame. To plot this we have to convert it to a raster object and then by using stack function, we can convert a set of predictors to raster layers. A raster is a spatial (ge-ographic) data structure that divides a region into rectangles called 'cells' (or 'pixels') that can store one or more values for each of these cells. Such a data structure is also referred to as a 'grid' and is often contrasted with 'vector' data that is used to represent points, lines, and polygons.

class(raster(bio01))

This is a raster layer object. So for each environmental variables, we have to create separate RasterLayer objects. Let us stack all the raster objects together. In the following, we have created RasterStack object with the name predictors for all the environmental variables using the above command. So all the environmental variables which were stored as an ASCII files in your system, are now stored in a RasterStack with each RasterLayer representing the environmental variables. Let us check what are the names also.

predictors = stack(raster(bio01), raster(bio04), raster(bio06), raster(bio07), raster(bio12), raster(bio15)) # Create RasterStack object

names(predictors) # names of the Raster Layers

names(predictors) = c("bio01", "bio04", "bio06", "bio07", "bio12", "bio15") # Rename the Raster Layers

class(predictors)

plot(predictors) # Plot all the Raster Layers separately

In the plot note that, South America and India data map is available. Recall from the previous lectures that, in the workshop, we shall build the model using the data from South America and predict the distribution of Mikania micrantha in India. We can plot separately as well for each layer. In the following, we show only for the second layer which is bio01.

plot(predictors, 1, main = "Annual mean temperature")

mic_gbif_america = read.csv("mic_gbif_america.csv")

plot(wrld_simpl, add = TRUE)

points(mic_gbif_america[,1:2], col = "red", pch=".", cex = 2)

box()

We can plot South America and India separately also. Since we are interested in modeling the distribution of Mikania micrantha in South America, so we have to identify the climate at those locations only. We shall predict the distribution of occurrences in India, so we separate the climate data in these locations also. The extent function takes the matrix as input and returns a raster object. The extent is defined as (1) the westernmost longitude, (2) the easternmost longitude, (3) the southernmost latitude, and (4) the northern most latitude. In the following code e_america and e_india correspond to the south America and India respectively.

e_america = extent(-120, -30, -60, 30)

e_india = extent(65, 100, 5, 45)

bio01_america = crop(raster(bio01), e_america)

bio04_america = crop(raster(bio04), e_america)

bio06_america = crop(raster(bio06), e_america)

bio07_america = crop(raster(bio07), e_america)

bio12_america = crop(raster(bio12), e_america)

bio15_america = crop(raster(bio15), e_america)

bio01_india = crop(raster(bio01), e_india)

bio04_india = crop(raster(bio04), e_india)

bio06_india = crop(raster(bio06), e_india)

bio07_india = crop(raster(bio07), e_india)

bio12_india = crop(raster(bio12), e_india)

bio15_india = crop(raster(bio15), e_india)

The raster object bio01_america corresponds to the climate data of Annual mean temperature in South America only. Similarly, bio04_america and other names are defined. The raster object bio01_india corresponds to the climate data of Annual mean temperature in India only. Similarly, the other names are defined.

predictors_america = stack(bio01_america, bio04_america, bio06_america, bio07_america, bio12_america, bio15_america)

names(predictors_america) = c("bio01", "bio04", "bio06", "bio07", "bio12", "bio15")

plot(predictors_america, main = c("Annual mean temperature","Temperature seasonality (C of V)", "Min temperature of coldest week(°C)", "Temperature annual range", "Annual precipitation (mm)", "Precipitation seasonality (C of V)"))

predictors_india = stack(bio01_india, bio04_india, bio06_india, bio07_india, bio12_india, bio15_india)

names(predictors_india) = c("bio01", "bio04", "bio06", "bio07", "bio12", "bio15")

plot(predictors_india, main = c("Annual mean temperature","Temperature seasonality (C of V)", "Min temperature of coldest week(°C)", "Temperature annual range", "Annual precipitation (mm)", "Precipitation seasonality (C of V)"))

The environmental layers may be stored in a text file. For example, the file ''CM10_1975H_Bio01_V1.2.txt'' contains the Annual mean temperature at different locations of the world. We can read that file using the same function.

bio1 = read.asciigrid("layers/CM10_1975H_Bio01_V1.2.txt")

class(bio1)

plot(bio1)

box()

Some alternative way of generating plots with background points

library(maps)

SA <- c("argentina", "bolivia", "brazil", "chile", "colombia", "ecuador", "guyana", "paraguay", "peru", "suriname", "uruguay", "venezuela", "mexico", "guatemala", "belize", "honduras", "nicaragua", "panama", "costa rica", "cuba", "jamaica", "puerto rico", "dominican rep", "trinidad", "el salvador", "bahamas")

map(regions = SA)

background = randomPoints(bioclim_data_america, 1000)

points(background, pch = 19, col='grey', cex = 0.2, lwd=1)

points(mic_america$lon, mic_america$lat, col= "red", cex = 0.9, pch = 20, lwd=2)

require(rworldmap)

nativeCountries = c("BLZ", "ARG", "CRI","CUB","DMA","DOM","SLV", "GRD", "GLP", "GTM", "JAM", "MTQ", "PAN", "PRI", "LCA", "TTO", "ARG", "BOL", "BRA", "COL", "ECU", "PER", "VEN", "MEX", "CHL", "PRY", "URY", "GUY", "SUR", "GUF", "NIC", "HND", "HTI", "BHS", "FLK")

alienCountries = c("ASM", "AUS", "BGD", "IOT", "IND", "PHL", "CHN", "CXR", "COK", "FSM", "FJI", "PYF", "GUM", "HKG", "IDN", "MYS", "MUS", "NPL", "NCL", "NIU", "MNP", "PLW", "PNG", "REU", "WSM", "SLB", "LKA", "THA", "TKL", "TON", "TUV", "VUT", "WLF")

statusDF = data.frame(country = c(nativeCountries, alienCountries), status = c(rep("native", length(nativeCountries)), rep("alien", length(alienCountries))))

statusMap = joinCountryData2Map(statusDF, joinCode = "ISO3",

nameJoinColumn = "country")

mapCountryData(statusMap, nameColumnToPlot="status", catMethod = "categorical", missingCountryCol = gray(.9),addLegend = FALSE, mapTitle = "", lwd=0.9, borderCol = "blue")

legend(-180,-91,legend = c("Alien range", "Native range"), pch = c(22,22), cex =1.2, col = c("yellow", "red"), pt.bg = c("yellow", "red"), bty="n")

box()

Some basics of Spatial Data Handling in R

  • Basic data storing facility lon and lat information

lon = c(65.7, 85.5, 73.4, 80, 85, 70.5, 88.9)

lat = c(6.5, 11.8, 15.8, 35.1, 30, 30.8, 14.1)

locations = cbind(lon, lat)

class(locations)

names = c("A", "B", "C", "D", "E", "F", "G")

row.names(locations) = names

print(locations)

  • Plotting locations

par(mfrow = c(2,2))

plot(locations, xlim=c(60,95), ylim=c(3,40))

help(plot)

plot(locations, xlim=c(60,95), ylim=c(3,40), col = "red")

plot(locations, xlim=c(60,95), ylim=c(3,40), col = "blue", lwd = 2, cex = 2)

plot(locations, xlim=c(60,95), ylim=c(3,40), col = "red", lwd = 2, pch = 20, cex = 2)

par(mfrow=c(1,1))

  • SpatialPoints

locations = as.data.frame(locations)

library(sp)

SpPoints = SpatialPoints(locations)

class(SpPoints); str(SpPoints)

crs(SpPoints) = '+proj=longlat +datum=WGS84'

SpPoints@proj4string; plot(SpPoints);

box()

  • Populating spatial points with additional covariate information: SpatialPointsDataFrame

annual_rainfall = c( 17.9, 14.8,50, 5.5, 24.8, 20.05, 23.8)

annual_rainfall = data.frame(annual_rainfall)

SpPointsData = SpatialPointsDataFrame(SpPoints, annual_rainfall)

class(SpPointsData)

str(SpPointsData)

SpPointsData@data

plot(SpPointsData)

  • Customizing plots

row.names(locations) = LETTERS[1:7]

plot(SpPoints, xlim=c(60,95), ylim=c(3,40), cex = sqrt(annual_rainfall[,1])/2, pch = 20, col = "red", main = "annual rainfall (cm)")

box()

text(locations, names, pos = 4)

legend("topright",legend = c(10,20,30,40), pch = 20, pt.cex = sqrt(annual_rainfall[,1])/2, col = "red", bty = "n")

  • SpatialLines

Line1 = Line(locations[1:3,])

Line2 = Line(locations[4:7,])

class(line_1)

Lines1 = Lines(list(Line1), ID = "a")

Lines2 = Lines(list(Line2), ID = "b")

class(Lines1)

SpLines = SpatialLines(LinesList = list(Lines1, Lines2))

SpLines@proj4string = crs(SpPoints)

str(SpLines)

  • Plot SpatialLines

plot(SpLines)

plot(SpLines, col = c("red", "blue"))

plot(SpLines, col = c("red", "blue"), lwd=2)

points(locations, pch=19, col="black")

box()

text(locations, names, pos = 4)

  • Incorporating covariate information in SpatialLines: SpatialLinesDataFrame

LineData = data.frame(c(0,1))

row.names(LineData) = c("a", "b")

SpLinesData = SpatialLinesDataFrame(SpLines, data = LineData)

class(SpLinesData)

crs(SpLinesData)= '+proj=longlat +datum=WGS84'

str(SpLinesData)

SpLinesData@lines

  • SpatialPolygons

P1 = Polygon(locations[1:3,])

P2 = Polygon(locations[4:7,])

PP = Polygons(list(P1, P2), ID = "a")

class(PP)

P1 = Polygons(list(P1), ID = "a")

P2 = Polygons(list(P2), ID = "b")

SpPolygons = SpatialPolygons(list(P1, P2))

class(SpPolygons)

  • Plot SpatialPolygons

plot(SpPolygons)

plot(SpPolygons, col= c("red", "blue"))

points(locations, pch=20, col="black")

box()

text(locations, names, pos = 3)

  • Incorporating covariate information in SpatialPolygons: SpatialPolygonsDataFrame

d = data.frame(y = c(0,1))

row.names(d) = c("a", "b")

SpPolygonsDF = SpatialPolygonsDataFrame(SpPolygons, data = d)

class(SpPolygonsDF)

str(SpPolygonsDF)

SpPolygonsDF@data

  • SpatialGrid

grid = GridTopology(c(60,5), c(1,2), c(10,10))

help(GridTopology)

SpGrid = SpatialGrid(grid)

crs(SpGrid) = '+proj=longlat +datum=WGS84'

plot(coordinates(SpGrid), pch = '.', col = "blue", cex = 3, xlab = 'lon', ylab = 'lat')

plot(SpGrid)

class(SpGrid)

str(SpGrid)

  • SpatialGridDataFrame

d = data.frame(x =runif(grid@cells.dim[1]* grid@cells.dim[2]))

SpGridDF = SpatialGridDataFrame(SpGrid, data )

class(SpGridDF)

str(SpGridDF)