Introduction
Species distribution modelling attempts to link the bat sightings (species samples) with some other geographic variable, such as the terrain, vegetation or altitude above sea level. The process differs fundamentally from the modelling methods used to derive the bat activity maps in that the process is "predictive" and not "interpolative". If you do a web search on species distribution modelling, you'll come up with several methods that are used, the most notable are:
MaxEnt: Maximum Entropy Model (presence only model)
GARP: Genetic Algorithm for Rule Set Production
GAM: Generalised Additive Model
GLM: General Linear Model
+ many more...
Each technique has its own advantages, and a really good summary of the pro's & cons can be found here. For this project, I was looking for something ready built (and cheap) or something that would be easy to implement in "R", and also one that required presence only sample data. This very quickly focussed my attention on MaxEnt, which is a piece of java software written and freely distributed by Princeton University. The Bat Cave page will have details of where to get the software and, more importantly, how to prepare the sample and environmental data (this was more of a headache than running the model...) in due course.
I've also looked at OpenModeller, which is a complete framework for running SDM experiments and includes several modelling algorithms. It's only worth a look if you are not going to use categorical data in your models, as all the algorithms available within this framework are sadly lacking this capability at present (March 2013). So it turned out to be of little use for my needs (why would you ignore categorical data for general purpose SDMs..?), a great pity this as it's a pretty fundamental limitation, but worth keeping an eye on this package for future developments.
Also, if you want to know more about the mathematics behind various modelling methods, then make sure you visit Prof. Andrew Moore's tutorial page (School of Computer Science, Carnegie Mellon University). It focuses on mining applications, but the maths are the same and the slides show a wicked sense of humour too... these are not your average maths lectures!
Parkhurst Forest Model
I've used up to six environment variables in the model:
Tree species (includes grassland type, and built-over land, gives 10 sub-variables)
Altitude
Nearest distance to the inside edge (metres)
Nearest distance to the outside edge (metres)
Nearest distance to a hedgerow centre-line (metres)
Nearest distance to a stream centre-line (metres)
There is significant diversity in the forest planting, and each species is given a separate categorical variable in the model, that is except for some large areas of mixed broadleaf woodland, which are a mixture of oak, ash, beech, hornbeam and sweet chestnut. The environment grids were prepared using Quantum GIS (QGIS) to capture the forest plantation plan and then create the uniform raster file required by MaxEnt. Note: you now have the option of downloading Open GIS data for Forestry Commission run estates, so make sure you visit this link if your project includes any Forestry Commission areas. It won't give you everything but it will potentially save you days of mapping! Another useful and free map resource in the UK is the Ordnance Survey open GIS data located here.
The altitude data was painstakingly reconstructed in QGIS from the OS Survey Map. The table below gives some idea of what this data looks like. The plantation map is mapped by category, and the altitude, and edge functions are continuous variables.
Following reference to work carried out by Chloe Bellamy at Leeds University as part of her PhD (following on from discussions with Prof. John Altringham at Leeds in November 2012), I decided to add some further environment variables, these being the land gradient and compass aspect. These grids were derived from the raw altitude data using the terrain analysis QGIS plug-in. Although the altitude range is small (circa 80m) the area is relatively exposed, so compass aspect may be important, and there some significant gradient variations. These models have now been optimised and are available to view.
Plantation Map
Altitude
Exterior Edge Function
Compass Aspect
Interior Edge Function
Hedgerow Proximity
Running Water Proximity
Land Gradient (rate of slope change)
You can see the plantation plan is somewhat more busy than the one on the activity map pages, this is because I've included all the detail from the Forestry Commission's published plan, and updated it with on site visits. The resolution is 0.0001 degrees, or about 7.5m (note, regular lat lon grid is not regular in distance) which gives about 430x300 points in the model matrix. The sample data was extracted from the database in form of a simple CSV file, with a species, longitude, latitude format. I did not remove duplicates as I can readily do this in MaxEnt as one of the run-time options.
Generalised Model Results
Some results using the sample data between March and August 2012 is given in the table below. These results below give the logistic output (probability of presence), and each is the median of 50 trials with randomly selected test points. I've included these to demonstrate how I've assessed the general fit and performance of the model outputs which are given on separate pages. Note, a logistic output of 0.5 indicates a normal or average habitat, red indicates very favourable habitat, and blue indicates not very favourable. For each of these included the response curves as well to show the key environment drivers, and the ROC (Receiver Operating Characteristic) curve. The response graphs give some indication of the importance of the modelled variables, and the ROC curve whether the model has any statistical significance. The category key for the plantation plan is: 1=Mixed Broadleaf, 2=Pines, 3=Spruces, 4=Oaks, 5=Beeches, 6=Larches, 7=Scrubland, 8=Ashes, 9=Improved Grassland. Note: the data on this page are not kept up to date with the current field survey data, so please refer to the individual species appraisals linked from Parkhurst SDM page for the most up to date information on each species.
Barbastelle
Species Distribution
Response Curve: Altitude
Response Curve: Plantation
ROC Curve
The initial view here is the model fit is reasonably tight and the ROC curve indicating a significantly higher estimate than 50/50 chance. There appears to be stronger links to broadleaf woodland than conifer or grassland.
Serotine Bat
Species Distribution
Response Curve: Altitude
Response Curve: Plantation
ROC Curve
The model fit for altitude maybe not that significant here, and the key woodland areas seem to be ash followed by the conifers.
Leislers Bat
Species Distribution
Response Curve: Altitude
Response Curve: Plantation
ROC Curve
The overall fit looks good, with some significance indicated for the spruce and scrub (recovering from fire damage) areas.
Noctule Bat
Species Distribution
Response Curve: Altitude
Response Curve: Plantation
ROC Curve
Another good fit, with some significance to the ash and oak areas.
Common Pipistrelle
Species Distribution
Response Curve: Altitude
Response Curve: Plantation
ROC Curve
This appears to be a very tight fit, perhaps over-fitted in this case due to high number of samples. There seems to be a slightly stronger link to the deciduous areas though.
Soprano Pipistrelle
Species Distribution
Response Curve: Altitude
Response Curve: Plantation
ROC Curve
The fit here is less extreme than the common pipistrelle above, and shows similar links to the plantation variables.
Replicate Strategy
I've had a few emails with questions about the best replicate strategy to use for model validation. If you've read the basic MaxEnt tutorial and the other reference material, you will realise that there are three different methods incorporated into MaxEnt. The best method to use will depend to some extent on the number of presence samples you have, but in my experience the sub-sampling method provides the most reliable technique for validation. The reason for this is that the number of replicates to generate the validation statistics can be increased (within the limits of meaningful combinations of course) to provide a good view of the model fit. It is also easy to control the number of samples that have been used for training, and what have been used for testing. The cross validation method on the other hand has the advantage that all the presence points get used for both training and testing, whereas the sub-sample method will exclude the testing points from the training run, but you need keep the number of folds (replicates) down to a small number (say 5-10 for 100 samples). Bootstrapping is like the sub-sample method, but it feeds the training points back into the model; I would only use this if I was short of presence samples (say less than 30).
Feature Selection
I've also had few questions regarding the feature selection process. My advice to initially run with all feature types enabled, but examine the output data for each replicate to see what has been used by MaxEnt. The models I've produced tend to use the linear, quadratic and hinge features, which produces smooth feature curves. The downside is that true step functions will get suppressed unless the model also uses the threshold feature, but I've found that the including the threshold feature can give unrealistic steps in the feature curves under some circumstances, so probably best only use this feature if you have a large presence sample set. I've put an example below where you can see that an unrealistic step has been produced in the feature response curve.
No Threshold Feature
With Theshold Feature
With Theshold Feature 2
With Theshold Feature 3
These examples are taken from the same basic MaxEnt model, but the first example has the threshold feature disabled.
The step function around 450m is very pronounced, but there is no obvious physical reason why the peak is there. Before discounting the use of the threshold feature, check all the replicates and see if it is a common feature across all the training scenarios; if not as in this case, it's very likely to be just a modelling artefact and not a real physical relationship. A narrow corridor close to the water followed by a slower decay in presence probability away from this corridor is a far more realistic result in this case.
My advice is to carefully check the physical basis of any unusual or unexpected results when using threhold features...
Product features start to become useful where there is the possibility of a relationship between two environment variables, such as the ground slope and compass aspect in the particular models I've put together. The downside is that the use of product features makes it difficult to interpret the model results and you cannot use the "explain tool" to investigate the detail of the model. As with any mathmatical model, there needs to be a physical explanation for any output, so if you can't explain the output from a model in physical terms, chances are are that it is overfitted, biased, or incorrect.
Model Assessment
The most important aspect of any modelling activity is the assessment of the results. With MaxEnt you have number of model tests built into the software at your disposal, these are the ROC (Receiver Operating Characteristic) and jack-knife tests. The ROC plot provides a picture of how many of the test samples were true positive or negative against the number that were false positive or negative. There is a nice and simple description of this on Wikipedia here (ROC) and here (Sensitivity and Specificity).
An example is given below to the left. In this example, the black line represents 50/50 chance. If your data (mean AUC, which is the area under the ROC) is close to this line, then your model is not a good one. Typically, models can be considered good if they produce an AUC of 0.7 to 0.8, and very good if the AUC is greater than 0.8. AUC's less than 0.7 are marginal and unlikely to be of much use.
The downside of this approach is that the presence exclusion is estimated by the model itself, and is not based on true field data, however the method used in the MaxEnt model is reasonably robust. There are a number of other validation methods that can be used to post process the data output from your model, and provide a more robust statistical test. There's plenty of information online via the Google MaxEnt user group, so not going to repeat anything here. However, suffice to say that the ROC and AUC provide a reasonably robust test for most applications. If you are likely to need more information on model validation, I recommend that you follow up on the various discussion threads on the MaxEnt Google User Group.
In addition to the general model fit assessment, you will also be interested in finding out the relative impact of the environment variables. MaxEnt provides the capability to run a variables Jack-Knife process to enable you to do just this. A typical jack-knife plot is shown below:
This plot shows the relative importance of the variables, and shows the impact on the AUC of just the variable on its own, and the impact of removing the variable altogether. Looking at the hedgerow grid, we can see that removing this variable has a big impact on the model AUC, so this clearly is a key driver, and similarly it is a key contributor to the model. Taking the interior edge distance grid next, we can see that its removal barely impacts the overall AUC, and it's a minimal contributor to the model.
Using this information, you may wish to remove the variables contributing least to the AUC, noting that an AUC of 0.5 indicates a 50/50 probability and no better than chance.
So, in this example, I'd probably consider re-running the model with the interior edge distance and distance to running water grids removed. MaxEnt will generate a HTML summary file where you can access all this information, plus some brief assessment of variables relative importance.