Teaching Resources for Workshops
This page will support my students who are aspiring researcher and folks I teach at various workshops. This page is under construction, and I hope to complete this page by the end of February. Primarily these resources will be for species distribution modelling, biogeography, and avian ecology. I hope to include more topics in future as I navigate through different parts of the academia.
Visualization of Data in R
At the start of your work in R, set up a working directory where your files are saved. this working directory should be the same path where you save the R script as a .R file. For example, I have my files saved my files in the folder "Workshop", within folder "Teaching", "within "D" drive.
So I'll set my working directory as -
setwd("D:/Teaching/workshop")
Here, setwd() is the function within which we are giving the name of file path ("D:/Teaching/workshop") as an argument. One thing is to keep in your mind that whenever we write a name in the R script, we write it within quotation marks. R software understands anything written within quotation mark as a character.
We have set the working directory. Now to get access in files kept in this directory, we write code-
> getwd()
[1] "D:/Teaching/workshop"
Now to read your files, the functions are read.csv(), read.table(), read.xls() etc. I am giving here the examples of reading csv and txt files.
For csv:
df1=read.csv("demodata.csv",header=T)
Here, demofata.csv is the name of the csv file to be read. The logical argument, header =T, tells the softwareread the column names in csv file.
If you wanna know what are the names of the columns, run the following code:
> names(df1)
[1] "Number.of.Participant" "Attendance" "Answer"
If you wanna print the first six data points, run
> head(df1)
Number.of.Participant Attendance Answer
1 1 y a
2 3 y a
3 5 n b
4 4 n c
5 6 n d
6 7 y e
If you are interested to view your data completely
View(df1)
#A basic statistical summary
summary(df1)
#Volume/dimensions of the data
dim(df1)#row and columns
nrow(df1)
ncol(df1)
For more download the HTML file
For today's hands-on assessment download the data by clicking here
Biostatistics
Regression types
Regression analysis is a statistical method used to identify and analyze the relationship between a dependent variable (response variable) and one or more independent variables (predictor variables). It is commonly used to make predictions or to identify the factors that influence an outcome.
There are several types of regression analysis, including:
Simple linear regression: This type of regression involves a single independent variable and a dependent variable. The goal is to identify a linear relationship between the variables. Simple linear regression involves a single independent variable and a dependent variable. Here are some examples of simple linear regression:
Exam scores and study time: Suppose you want to determine if there is a linear relationship between the amount of time a student spends studying and their exam scores. You can collect data on the number of hours spent studying (independent variable) and the exam scores (dependent variable) of a group of students. You can then use simple linear regression to determine the relationship between the two variables.
Advertising spending and sales: Suppose you want to determine if there is a linear relationship between the amount of money spent on advertising and the sales of a product. You can collect data on the amount of money spent on advertising (independent variable) and the sales (dependent variable) of the product over a certain period of time. You can then use simple linear regression to determine the relationship between the two variables.
Age and height: Suppose you want to determine if there is a linear relationship between a person's age and their height. You can collect data on the ages (independent variable) and heights (dependent variable) of a group of people. You can then use simple linear regression to determine the relationship between the two variables.
In each of these examples, simple linear regression can be used to determine if there is a linear relationship between the independent and dependent variables, and to estimate the strength and direction of that relationship.
Multiple linear regression: This type of regression involves two or more independent variables and a dependent variable. The goal is to identify the relationship between the variables while controlling for the effects of other variables. Multiple linear regression involves two or more independent variables and a dependent variable. Here are some examples of multiple linear regression:
House prices and square footage: Suppose you want to determine the relationship between house prices and various factors, including square footage, number of bedrooms, and number of bathrooms. You can collect data on the square footage, number of bedrooms, number of bathrooms (independent variables), and house prices (dependent variable) of a group of houses. You can then use multiple linear regression to determine the relationship between the independent variables and house prices, while controlling for the effects of the other variables.
Employee salary and experience: Suppose you want to determine the relationship between employee salary and various factors, including years of experience, education level, and job title. You can collect data on the years of experience, education level, job title (independent variables), and salaries (dependent variable) of a group of employees. You can then use multiple linear regression to determine the relationship between the independent variables and salaries, while controlling for the effects of the other variables.
Crop yield and weather conditions: Suppose you want to determine the relationship between crop yield and various factors, including temperature, precipitation, and soil type. You can collect data on the temperature, precipitation, soil type (independent variables), and crop yield (dependent variable) of a group of farms. You can then use multiple linear regression to determine the relationship between the independent variables and crop yield, while controlling for the effects of the other variables.
Polynomial regression: This type of regression involves fitting a polynomial function to the data. This is useful when the relationship between the variables is not linear.Polynomial regression involves fitting a polynomial function to the data. Here are some examples of multiple polynomial regression:
Sales revenue and advertising spending: Suppose you want to determine the relationship between sales revenue and various factors, including advertising spending and time. You can collect data on the advertising spending, time (independent variables), and sales revenue (dependent variable) of a company over a certain period of time. You can then use multiple polynomial regression to fit a polynomial function to the data, and determine the relationship between the independent variables and sales revenue.
Electricity usage and temperature: Suppose you want to determine the relationship between electricity usage and various factors, including temperature and time of day. You can collect data on the temperature, time of day (independent variables), and electricity usage (dependent variable) of a building over a certain period of time. You can then use multiple polynomial regression to fit a polynomial function to the data, and determine the relationship between the independent variables and electricity usage.
Crop yield and fertilizer usage: Suppose you want to determine the relationship between crop yield and various factors, including fertilizer usage and soil type. You can collect data on the fertilizer usage, soil type (independent variables), and crop yield (dependent variable) of a group of farms. You can then use multiple polynomial regression to fit a polynomial function to the data, and determine the relationship between the independent variables and crop yield.
Logistic regression: This type of regression is used when the dependent variable is binary (i.e., takes on one of two possible values). The goal is to identify the factors that influence the probability of the dependent variable taking on a particular value. Logistic regression is a type of regression analysis used to model the probability of a certain outcome or event occurring based on one or more predictor variables. Here are some examples of logistic regression:
Credit card default prediction: Suppose you want to predict the likelihood of a credit card customer defaulting on their payments. You can collect data on the customer's credit score, income, age, and other relevant factors (predictor variables), and whether they have defaulted on their payments (binary outcome variable). You can then use logistic regression to model the probability of default based on the predictor variables.
Breast cancer diagnosis: Suppose you want to diagnose breast cancer based on various clinical and demographic factors. You can collect data on the patient's age, family history of breast cancer, mammogram results, and other relevant factors (predictor variables), and whether they have been diagnosed with breast cancer (binary outcome variable). You can then use logistic regression to model the probability of breast cancer based on the predictor variables.
Email spam detection: Suppose you want to detect whether an email is spam or not based on its content and other features. You can collect data on various email features, such as subject line, sender address, content, and attachments (predictor variables), and whether the email is spam or not (binary outcome variable). You can then use logistic regression to model the probability of the email being spam based on the predictor variables.
Time series regression: This type of regression is used when the data is collected over time. The goal is to identify the relationship between the dependent variable and time, while controlling for other variables that may influence the outcome. Time series regression is a type of regression analysis that is used when the dependent variable is a time series. Here are some examples of time series regression:
Stock price prediction: Suppose you want to predict the future price of a stock based on its historical price data. You can collect time series data on the stock price, as well as other relevant factors such as economic indicators, company earnings reports, and news articles (independent variables). You can then use time series regression to model the relationship between the independent variables and the stock price, and make predictions about future stock prices.
Demand forecasting: Suppose you want to forecast the demand for a certain product based on historical sales data. You can collect time series data on the sales of the product, as well as other relevant factors such as advertising spending, seasonality, and competitor activity (independent variables). You can then use time series regression to model the relationship between the independent variables and the demand for the product, and make predictions about future demand.
Climate modeling: Suppose you want to model the effects of climate change on a certain region based on historical weather data. You can collect time series data on the temperature, precipitation, and other weather variables, as well as other relevant factors such as greenhouse gas emissions and land use (independent variables). You can then use time series regression to model the relationship between the independent variables and the climate variables, and make predictions about future climate trends.
Steps of regression
Linear regression:
Collect and organize the data: Collect the data you need for the regression analysis and organize it in a table or spreadsheet. Label the columns for the dependent variable (Y) and the independent variable(s) (X).
Calculate the mean and standard deviation of the variables: Calculate the mean and standard deviation of the dependent variable (Y) and the independent variable(s) (X).
Calculate the covariance of the variables: Calculate the covariance of the dependent variable (Y) and the independent variable(s) (X).
Calculate the slope of the regression line: Calculate the slope of the regression line using the formula:
slope = covariance of X and Y / variance of XCalculate the intercept of the regression line: Calculate the intercept of the regression line using the formula:
intercept = mean of Y - (slope * mean of X)Plot the regression line: Plot the regression line on a scatter plot of the data. The slope of the line represents the relationship between the independent variable(s) (X) and the dependent variable (Y).
Evaluate the fit of the regression line: Evaluate the fit of the regression line by calculating the coefficient of determination (R-squared value) and assessing the residuals (the difference between the predicted and actual values of Y).
Use the regression line to make predictions: Once you have a well-fitting regression line, you can use it to make predictions about the dependent variable (Y) for new observations or conditions by plugging in the values for the independent variable(s) (X).
Multilinear regression
Collect and organize the data: Collect the data you need for the regression analysis and organize it in a table or spreadsheet. Label the columns for the dependent variable (Y) and the independent variables (X1, X2, ... Xn).
Calculate the mean and standard deviation of the variables: Calculate the mean and standard deviation of the dependent variable (Y) and the independent variables (X1, X2, ... Xn).
Calculate the covariance matrix of the variables: Calculate the covariance matrix of the dependent variable (Y) and the independent variables (X1, X2, ... Xn).
Calculate the coefficients of the regression equation: Calculate the coefficients of the regression equation using the formula:
coefficients = inverse of (transpose of X * X) * transpose of X * Y
where X is the matrix of independent variables (including a column of ones for the intercept term) and Y is the vector of dependent variable values.Use the coefficients to form the regression equation: Use the coefficients obtained in step 4 to form the multiple regression equation. The equation will have the form:
Y = b0 + b1X1 + b2X2 + ... + bnxnEvaluate the fit of the regression equation: Evaluate the fit of the regression equation by calculating the coefficient of determination (R-squared value) and assessing the residuals (the difference between the predicted and actual values of Y).
Use the regression equation to make predictions: Once you have a well-fitting regression equation, you can use it to make predictions about the dependent variable (Y) for new observations or conditions by plugging in the values for the independent variables (X1, X2, ... Xn).
Polynomial regression:
Collect and organize the data: Collect the data you need for the regression analysis and organize it in a table or spreadsheet. Label the columns for the dependent variable (Y) and the independent variable (X).
Plot the data: Plot the data on a scatter plot to visualize the relationship between the dependent variable (Y) and the independent variable (X).
Choose the degree of the polynomial: Choose the degree of the polynomial that best fits the data. This can be done by visually inspecting the scatter plot or using statistical methods such as the adjusted R-squared or Akaike Information Criterion (AIC).
Create a matrix of predictor variables: Create a matrix of predictor variables by raising the independent variable (X) to the powers of the chosen degree. For example, if the degree is 2, create a matrix with columns for X, X^2, and a column of ones for the intercept term.
Calculate the coefficients of the regression equation: Calculate the coefficients of the regression equation using the formula:
coefficients = inverse of (transpose of X * X) * transpose of X * Y
where X is the matrix of predictor variables and Y is the vector of dependent variable values.Use the coefficients to form the regression equation: Use the coefficients obtained in step 5 to form the polynomial regression equation. The equation will have the form:
Y = b0 + b1X + b2X^2 + ... + bnx^nEvaluate the fit of the regression equation: Evaluate the fit of the regression equation by calculating the coefficient of determination (R-squared value) and assessing the residuals (the difference between the predicted and actual values of Y).
Use the regression equation to make predictions: Once you have a well-fitting regression equation, you can use it to make predictions about the dependent variable (Y) for new observations or conditions by plugging in the values for the independent variable (X).
Logistic regression:
Collect and organize the data: Collect the data you need for the regression analysis and organize it in a table or spreadsheet. Label the columns for the dependent variable (Y) and the independent variable(s) (X1, X2, ... Xn).
Transform the dependent variable: Logistic regression models a binary outcome (0 or 1). If your dependent variable (Y) is not already in binary form, you will need to transform it. For example, you could transform a continuous variable into a binary variable by assigning a value of 1 to observations that exceed a certain threshold and 0 to those that do not.
Plot the data: Plot the data on a scatter plot to visualize the relationship between the dependent variable (Y) and the independent variable(s) (X1, X2, ... Xn).
Choose a model: Choose a logistic regression model that best fits the data. This can be done by visually inspecting the scatter plot or using statistical methods such as the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC).
Calculate the coefficients of the logistic regression equation: Calculate the coefficients of the logistic regression equation using maximum likelihood estimation. This involves finding the values of the coefficients that maximize the likelihood of observing the data given the model.
Use the coefficients to form the logistic regression equation: Use the coefficients obtained in step 5 to form the logistic regression equation. The equation will have the form:
log(odds) = b0 + b1X1 + b2X2 + ... + bnxn
where odds = P(Y=1) / P(Y=0)Evaluate the fit of the regression equation: Evaluate the fit of the regression equation by assessing the goodness of fit of the model and calculating the classification accuracy.
Use the logistic regression equation to make predictions: Once you have a well-fitting logistic regression equation, you can use it to make predictions about the dependent variable (Y) for new observations or conditions by plugging in the values for the independent variable(s) (X1, X2, ... Xn) and using the logistic function to convert the predicted log-odds to probabilities.
Time series regression
Collect and organize the data: Collect the time series data you need for the regression analysis and organize it in a table or spreadsheet. Label the columns for the dependent variable (Y) and the independent variable(s) (X1, X2, ... Xn), along with the date or time stamp for each observation.
Plot the data: Plot the time series data on a line graph to visualize the relationship between the dependent variable (Y) and the independent variable(s) (X1, X2, ... Xn) over time.
Calculate the autocorrelation function (ACF) and partial autocorrelation function (PACF): Calculate the ACF and PACF of the dependent variable (Y) and the independent variable(s) (X1, X2, ... Xn) to assess the extent of autocorrelation in the data.
Choose a model: Choose a time series regression model that best fits the data. This can be done by visually inspecting the line graph and ACF/PACF plots or using statistical methods such as the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC).
Calculate the coefficients of the time series regression equation: Calculate the coefficients of the time series regression equation using maximum likelihood estimation. This involves finding the values of the coefficients that maximize the likelihood of observing the data given the model.
Use the coefficients to form the time series regression equation: Use the coefficients obtained in step 5 to form the time series regression equation. The equation will have the form:
Y(t) = b0 + b1X1(t) + b2X2(t) + ... + bnxn(t) + e(t)
where Y(t) is the dependent variable at time t, X1(t), X2(t), ... Xn(t) are the independent variables at time t, and e(t) is the error term at time t.Evaluate the fit of the regression equation: Evaluate the fit of the regression equation by assessing the goodness of fit of the model and calculating the mean absolute percentage error (MAPE) or root mean squared error (RMSE).
Use the time series regression equation to make predictions: Once you have a well-fitting time series regression equation, you can use it to make predictions about the dependent variable (Y) for future time periods by plugging in the values for the independent variable(s) (X1, X2, ... Xn) for those time periods.
Application of GIS and remote sensing in spatial ecology through Species Distribution Modelling (SDM)
Ecological modelers apply remote sensing and GIS in four major fields of Ecology: Ecosystem monitoring, Biogeography, Land use/land cover monitoring, and Ecological Informatics. All these four areas finally contribute to conservation policy-making and urbanization. So, in a greater sense, these two are integrative parts of the outcome from remote sensing and GIS in Ecology. As a researcher, you generally ask the following questions in each field that needs GIS and remote sensing data to find the answer:
Ecosystem monitoring:
What are the components of a functioning ecosystem?
How is their abundance changing in different locations?
What are the impacts of the changing components?
Biogeography:
(a) Where and when to find your species of interest?
(b) How your species of interest is distributed: Patchy, continuous, seasonal, etc.?
(c) Where can the species go and invade?
Land use/ Land cover monitoring:
(a) How much land is covered for a particular activity
(b) What is the landscape with conflicting activities
(c) Where is conservation necessary?
Ecological geospatial Informatics:
(a) What and where will be the impact of your species of interest
(b) Where management will be necessary
(c) Integration of metadata to device conservation plans
If you think about these questions, you will notice a common pattern. All these questions ask for the location of something you are interested in and how that location may change over time. This information about where and when the object/ action/ or species of your interest will occur is called spatiotemporal distribution. Now, you may be interested in either spatial or temporal or both distributions in your research. In either case, you may not always go and survey everywhere to obtain the data of their distribution. This is when you need the modeling part. You can model the data you obtained from surveying some part of the field at some time, and use this information to predict where the species can be at other parts of the field at other times.
Understanding how your species is distributed over space and time, requires an understanding of what are the resources of your species, and how the resources are available for your species over space and time. For example, chalks, dusters, a blackboard, lights etc. are the resources I need inside a classroom to teach you a topic. There is plenty of chalk, dusters, and light in the classroom, but only one blackboard. So I tend to stand near the blackboard. So a species may utilize many resources, but only the scarcest resource decides where the species will be located. This concept is known as Liebig's law of the minimum. Now there can be more than one such multiple resources with limiting abundances. There can be cases where a resource is more abundant than other resources but consumed at a much higher rate than the other resources by a species. Then these resources with high consumption rates are also important to decide the location of the species.
Therefore, to understand a species' locations, all the limiting resources and their consumption/ utilization rates are important to know. This concept of how many resources and how much of them are utilized by a species is known as the niche of a species. This is why we often refer to species distribution models as environmental niche models.
There are three concepts of niche in Ecology:
(a) Grinnellian niche: The habitat or geographic space that has all the suitable resources for your species. Suppose wildlife scientists, autecologists, and conservation scientists may try to predict and estimate where these habitats are located and how much of the habitat the species is occupying. In that case, they will use a species distribution modelling framework known as the habitat suitability model and habitat utilization model. These models provide a habitat suitability index and utilization score.
(b) Eltonian niche: Elton, in 1927, suggested that the niche is the role of a species in a biotic community. The biotic role, such as a competitor, cooperator, prey, predator, etc. may vary over a large scale of geography. This variation can be random too. This is why conventional SDMs may ignore the biotic interactions over large geography as a noise (Eltonian noise) in species distribution. I'll discuss the concept of Eltonian noise elaborately later in this series. Nevertheless, there is some scenarios where you may need to model the Eltonian niche to understand the species distribution. For example, I modeled both the Eltonian niche and the Grinnelian niche to predict the distribution of the successful nesting sites of Merops philippinus (read Ghosh et al. 2022).
(c) Hutchinsons' niche: If a species utilizes n number of resources all at different amounts, plotting the resources and their amount utilized will be an n-dimensional hypervolume. That n-dimensional hypervolume is Hutchinson's niche. One example of modelling this niche in SDM is predicting the distribution of an endemic community. Integrative species distribution models are a good predictor of Hutchinson's niche
Data collection for SDM
Geospatial data is a key component of Species Distribution Modeling (SDM), which is a technique used to predict the potential distribution of a species based on environmental variables. Some of the types of geospatial data used in SDM include:
Environmental variables: These are biophysical and climatic variables that can influence the distribution of a species, such as temperature, precipitation, elevation, soil type, and land cover.
Occurrence data: This refers to the geographic locations where a species has been observed or recorded, either through direct observation or inferred from other sources such as museum records, literature, or citizen science databases.
Habitat suitability data: This is a type of geospatial data that represents the potential suitability of different areas for a species based on environmental conditions. Habitat suitability models are often used in SDM to predict the potential distribution of a species.
Landscape variables: These include features such as topography, hydrology, and vegetation cover that can influence the suitability of a habitat for a species.
Climate models: These are statistical models that predict future climate conditions based on various scenarios of greenhouse gas emissions and other factors. Climate models can be used in SDM to project how the distribution of a species may shift under different climate scenarios.
Regenerate response
We ecologists and Geo statisticians often collects these datasets using remote sensing.
Remote sensing refers to the process of gathering information about an object or area without physically contacting it. This is typically done using sensors mounted on aircraft, satellites, or other platforms that capture data in the form of electromagnetic radiation. The data can then be processed and analyzed to generate information about the object or area being studied.
There are different types of remote sensing data, including:
Optical data: This is the most common type of remote sensing data and is captured by sensors that detect visible and near-infrared light. This data can be used to generate information about vegetation health, land use and land cover, and other features.
Thermal data: This type of remote sensing data captures information about the temperature of the Earth's surface. It can be used to generate information about areas of high heat flux, such as volcanic activity or wildfires.
Radar data: This type of remote sensing data uses radio waves to detect objects on the Earth's surface. It can penetrate clouds and vegetation, making it useful for generating information about topography, soil moisture, and other features.
The collection process for remote sensing data typically involves the following steps:
Platform selection: A platform, such as a satellite or aircraft, is chosen based on the specific needs of the study.
Sensor selection: A sensor is chosen based on the type of data needed and the specifications of the platform.
Data acquisition: The sensor captures data by scanning the Earth's surface from the platform.
Data transmission: The data is transmitted to a ground station for processing.
Pre-processing: The data is processed to correct for atmospheric effects and other factors that can affect the quality of the data.
Image processing: The data is processed to generate images and other information about the Earth's surface.
Analysis: The data is analyzed to generate information about the object or area being studied, such as land use and land cover, vegetation health, or soil moisture.