Addition of Government Data

In December 2011 we strengthened our longstanding collaboration with Texas Parks and Wildlife Department (TPWD) by signing a Memorandum of Understanding facilitating continued development of the Fishes of Texas Project and including long-term sharing of data as part of the project. The text of that memorandum can be seen here. This is our most recent stride toward broader data sharing and increasing the breadth and overall value of the Fishes of Texas project. Texas Commission on Environmental Quality, another of our project sponsors, also maintains aquatic biodiversity data of interest and we are currently working with them to formalize a similar data sharing agreement. Here we explain our reasons for wanting to add these otherwise not readily accessible data sets to our project.

Project Overview

Various government offices maintain biodiversity and water quality data sets from monitoring and survey efforts that have relevant data that complement museum-based organismal occurrence data like those of the Fishes of Texas Project (FoTX) in many ways. However, these are mostly not online or otherwise easily obtained, and they are generally not useful for other than their original purpose without extensive processing. Adding these data sets to the FoTX online database would not only increase the number of simple fish occurrence records available via our web database by an order of magnitude (81,000 to over 1 million), but will also add data about fish communities (e.g., abundance measures, size-frequency and specimen weight data) and importantly, more recently collected data than is typically found in museum collections. The data sets we propose to use also complement one another in many ways that will help researchers overcome previously troublesome biases and other issues inherent to each data set when used in isolation. Finally, this project will also go beyond our current focus on fish occurrences to make a wealth of temporally and spatially linked environmental data (e.g., diverse water quality measures, temperature, discharge, etc.) readily available to researchers along with records of co-occurring other organisms (e.g., freshwater mussels, reptiles and amphibians, and aquatic plants).

We have been working with state government for many years - Texas Parks and Wildlife Department (TPWD) and Texas Commission on Environmental Quality (TCEQ) have respectively helped fund compilation of our Fishes of Texas (FoTX) database for 6 and 4 years, and both TPWD and the Austin Office of U.S. Geological Survey have long been donors of specimens to our own TNHC Ichthyology Collection. We now look to broaden these relationships to help these offices make some of their diverse related databases available through our FoTX project.


Our own Fishes of Texas Project Database is extensively documented elsewhere in these pages and here we provide a basic discussion of the value of adding to it the data from five government databases - TPWD's Inland Fisheries, Coastal Fisheries, and Fish Kills databases; TCEQ's State Water Quality Monitoring database; and USGS's National Water-Quality Assessment Program data. These data sources are compared and contrasted in a basic way in Table 1, and the discussion that follows points out the perhaps less obvious but impressive and diverse complementarity among these data sources. Combining them and making them all available through a common interface would greatly increase research power for addressing a broad spectrum of ecological questions.


Table 1. Preliminary summary and comparison of the six largest fish databases to be aggregated as part of this project.

Overall increase in database size (Table 1) - While it is obvious from Table 1 that addition of the government databases to our online resources would massively increase the number of basic fish occurrence records we could provide for research, we are compelled to explain that rigorous quantitative comparisons are not yet possible due primary to differences in data structure and data definition. For example, while TPWD Coastal Fisheries database is by far the biggest if only number of records is considered, every record in it was apparently independently georeferenced by GPS. Though such very fine scale geographic precision may be useful in some research, it is clear that what is represented as many records in TPWD Coastal would have been recorded as a single record in most museum databases. The power of the simple addition of records, however, is perhaps overshadowed by the extensive complementarity of the databases and the ways they might be used together in ways that greatly overcome the unique biases of each and function synergistically to greatly increase overall analytical power.


Species identification issues and how we will deal with them - All identifications in the FoTX database are verifiable via specimen inspection, and over 4,000 lots of suspect specimens (mostly distributional outliers) have been recently inspected and identification verified. However, none of the government databases we here propose to add to our resources are vouchered, or if voucher specimens were taken (apparently very rarely), the connection has not yet been made between the record and specimens. Knowing well the difficulties of identifying live fishes in the field, as the government data gathers have done, we were not surprised to find a good number of probable identification errors in these government data sets, however, we are confident in our ability to find and correct or eliminate a large proportion of them. We will use the FoTX to vett all records in all other databases, flagging suspect records that fall outside of distributions documented by vouchers, and will implement a feedback system to alert government collectors of the need to voucher suspect taxa in the future. This interaction should serve to further promote the practice of vouchering collections, and we are also prepared, with new interactive digital identification keys we've been working on, to help provide training to government staff and, in the process, invoke their assistance to help us improve our keys for a broader audience.


Temporal complementarity among databases (Table 1) – The oldest record in the FoTX database is >120 years older than any records in the non-vouchered government data sets and the median year of collection of FoTX records predates that of any govt. data by 16 years. Its large number of records and thorough geographic coverage for the decades of the 1950’s – 1980’s, not to mention its verifiability via specimen inspection, are obviously important to many research endeavors, while the other databases notably compensate for the declining numbers of collections in the FoTX data starting in the 1980’s.


Spatial overlap among these databases (Figure 1) is generally broad, but TPWD coastal is tightly focused in the narrow band of inshore marine and marine-influenced systems that appears to have been far more densely sampled than has any freshwater area. However, every record in TPWD Coastal was apparently independently georeferenced by GPS. Though such very fine scale geographic precision may be useful in some research, it is clear that what is represented as many separate records in TPWD Coastal would have been recorded as a single record in most museum databases. In other words, the very large number of records in TPWD Coastal is at least in part an artifact of its structure and very fine scale geographic sampling compared to the other databases.


Figure 1. Spatial distribution and numbers of sampled localities, and number of fish occurrence records, in the six largest databases to be aggregated by this project. Major river basin divides are also shown.


Dealing with inherent biases that differ among databases - Given their distinct missions and histories, it is not surprising that these databases each have unique biases that can be limiting when it comes to using them in rigorous analyses. However, one of the biggest benefits of using these databases together may result from the fact that doing so may make allow users to correct or otherwise account for those biases to allow rigorous analyses that otherwise would not have been possible. Here we provide a simple and very preliminary analysis of differences and similarities among the databases that focuses on freshwater groups with which we are most familiar. We have preliminarily synonymized records to current taxonomy for freshwater groups to allow this, and note that there are still many taxonomy normalization issues for marine fishes (with which we are less familiar) in the combined database that we have not attempted to resolve for this preliminary analysis. 


While FoTX and TPWD Coastal both record similar numbers of taxa, their lists of species overlap relatively little. Following definitions of Hubbs et al. (2008), 67% of FoTX records are of obligate freshwater species, and 22.5% are obligate marine or estuarine. The same percentages for TPWD Coastal are, respectively, 1% and 69%. Other databases have freshwater/marine ratios resembling that of FoTX, and their lists of taxa recorded overlap that of FoTX much more than they overlap the species list of TPWD Coastal. However, all other databases (except TPWD Coastal) have species lists ranging from 20 – 46% as long as the FoTX list. This can likely be attributed primarily to the fact that these other databases have sampled far fewer localities, ranging from about 20% (TPWD Fish Kill and TPWD Inland) of FoTX’s nearly 9,000 localities, all the way down to 0.03% (USGS NAWQA).


FoTX records include 22 non-native species (Hubbs et al. 2008) from the state. All other databases analyzed here have records for, on average, 10 non-native species (range 7-13). Looking at numbers of records, however, 71% of the total number of records of non-natives in the combined database are from the TPWD Inland database and only 10% from FoTX. Thus, any researcher would obviously be hard pressed to rigorously analyze the distributions and habitat relations of non-natives using only museum records.


It is noteworthy that a large proportion of the total non-native species recorded in the literature (e.g. Howells 2001) as occurring or having occurred in TX are not documented in any of these databases. We are confident that this museum – govt. collaboration will improve future vouchering of non-native occurrences in the future.


Among these databases, many of the state’s freshwater fishes are known only from the FoTX database. Of the 58 species of the family Cyprinidae known from Texas, for example, nearly 26% (15 species) are documented only by FoTX. Digging deeper into the data, we find that in the combined data set FoTX has > 90% of the total records for almost 60% of the total Cyprinid species list. Large numbers of unidentified Cyprinids are recorded in some of the govt. databases and we are confident that govt. workers are sampling far more cyprinids than they identify, thus under-representing the actual species richness in their samples. This project will focus on improving the fish identification skills of those contributing to government databases and increase their propensity to voucher their specimens.


Interestingly, records of a few cyprinids that are rare in the FoTX database are much more common in the other databases. For example, while museum records of the non-natives Carassius auratus, Ctenopharyngodon idella and Cyprinus carpio do exist in FoTX, we found them inadequate for production of Species Distribution Models of useful quality. Adding the records for these species from the government databases increased total record counts for each to 4 to 21 times what FoTX alone provides, thus surely making production of high quality SDMs possible. Similar complementarity of these databases is obvious for other families as well, for example in Catostomidae, total numbers of records of Ictiobines will be increased to 20 – 35 times what FoTX alone provides.


Redundancy of sampling at discrete localities is a valuable asset of the combined data - Museum specimen databases are well known to generally lack any documentation of species absences and so their data are generally considered to be indicative of only species presence (see Graham et al. 2004)This simple limitation can greatly limit many analyses, and is especially important in modeling. Much more powerful modeling methods can be used with data that are informative about both presence and absence, however, determination of absence generally requires methodical sampling over time and knowledge of the effort expended. The FoTX museum-based data set is typical for such data with on average 9 samples at each FoTX locality - often spread over very long time periods and almost always with no measure of effort, gear or other biases, etc. However, some of the data sets proposed to be included here will clearly support determination of absences. While USGS NAWQA has sampled very few sites, it has sampled each on average 94 times. Similarly, TCEQ SWQIM and TPWD Inland have each sampled at about 20% as many localities as has FoTX, but they have respectively sampled each site on average 68 and 35 times. While museum data typically cannot be used to infer absences for exactly this reason, the redundant site-specific sampling the government data provide can often be used to reliably infer absences, and if absence data are available, researchers can apply much more powerful species distribution modeling methods than if they are lacking (Brotons et al., 2004; Guisan et al., 2005).

Rare, endangered and extinct fishes – FoTX essentially corners the market on all species in these categories. Occurrences of none of the state’s extinct or extirpated taxa are recorded in any other database and most rare species are documented nearly exclusively by the FoTX.


Data on other aquatic organisms and water quality will also be included - Some of the data sets we propose to serve extend beyond fishes to provide data on other components of biodiversity. Mussels (currently a hot topic in conservation since five Central Texas species will soon be listed as endangered) and other aquatic macroinvertebrates are also monitored by government programs. TCEQ SWQIM and USGS NAWQA both contain extensive data on these other taxa, as well as various indices of biotic integrity developed from them which provide potentially useful integrated summaries of habitat quality that could be useful to many researchers. It is also important to recall the intimate host-parasite interactions between fishes and mussels that could easily be explored by researchers obtaining data from this proposed project. Obviously, mussel conservation must depend at least in part on fish communities, but at least in Texas, we are unaware of research that has thoroughly explored these interactions and their importance to sustainability of intact aquatic biodiversity and ecosystem.



Literature Cited

Brotons L, Thuiller W, Araújo MB, Hirzel AH (2004) Presence-absence versus presence-only modelling methods for predicting bird habitat suitability. Ecography 27: 437–448.

Graham, C.H., S. Ferrier, F. Huettman, C. Moritz, A.T. Peterson (2004) New developments in museum-based informatics and applications in biodiversity analysis, Trends in Ecology & Evolution,19-9,497-503.

Guisan A, Thuiller W (2005) Predicting species distribution: offering more than simple habitat models. Ecology letters 8: 993–1009.

Howells R.G. 2001. Introduced non-native fishes and shellfishes in Texas Waters: an updated checklist. Texas Parks and Wildlife Management Data Series No. 188. Texas Parks and Wildlife Department, Austin, Texas.

Hubbs C, Edwards RJ, Garrett GP (2008) An annotated checklist of the freshwater fishes of Texas, with keys to identification of species. Texas Journal of Science 43: 1-87.





Č
Ċ
Dean Hendrickson,
Dec 8, 2011, 2:13 PM
Comments