Cataloging your geo-datasets
2/11/22
The larger picture of mine and @blackary (Zachary Blackwood's) PR to intake-geopandas
Interactive code:
https://colab.research.google.com/drive/1qw47XHIRer6ojBLMQeuSmfRtlPPuwxqx?usp=sharing
https://mybinder.org/v2/gh/raybellwaves/intake-gpd-blog/main?urlpath=lab
Why do you need a data catalog?
Image you are working in a large team. The team may consume large amounts of data and it may produce large amounts of data.
If you are individually working on a project you probably know the data you are working on and may see burden in cataloging it.
Cataloging the data should be as important as other good coding/project practices: writing tests, writing doc strings, type hinting etc. While these add a bit of friction for early project work they will ultimately improve your coding/project skills.
A catalog will make it easy for team mates to discover and use data. It'll also help you if you come back to a project after a number of months. It also avoids duplicate efforts of multiple people ingesting the same data. This is especially important in large remote teams.
How to create a data catalog?
Something is better than nothing. It could be:
A shared excel document
A shared website or confluence page
However, there are well suited open source tools to help with this. If you are using python a good tool is intake.
Intake
Intake is an open source tool developed at anaconda which focusses on cataloging data. It provides readers which tells python how to open files. This is by far the best feature as it removes the nuances of understanding how to read complex file formats such a a grib file.
Intake works with plugins which are separate packages to read in different data format.
Intake-geopandas
The focus on this blog is geo-datasets which refer, in general, to files in which can be opened with geopandas. Intake-geopandas is an intake plugin which handles passing objects to geopandas.
To open a geo-parquet with geopandas you would use:
import geopandas as gpd
gpd.read_parquet("file.parquet")
This can be written for intake, which uses yaml (cat.yaml) as:
metadata:
version: 1
sources:
file:
description: My file containing geodata.
driver: geoparquet
args:
urlpath: file.parquet
You would then open the file as:
from intake import open_catalog
cat = open_catalog("cat.yaml")
cat.file.read()
How to create a data catalog using intake?
A team could contribute pull requests to a shared yaml such as that above.
A more robust approach is to create a package which provides testing and dependencies (see this tweet by @tdhopper (Tim Hopper)
What about your's and @blackary's PR to intake-geopandas?
We switch the reader for geo-parquet files to use dask-geopandas. You can see the PR here. This PR brings intake-geopandas inline with other intake plugins such as intake-parquet which uses dask as the driver. it also has other benefits:
It can speed up reading the geo-parquet using multiple cores.
You can use a wildcard to append multiple files during reading. If for example you have a folder which has files for given datetimes such as file_2022-01-01.parquet, file-2022.01-02.parquet, you can append this files as:
import dask_geopandas as dgpd
dgpd.read_parquet("file_2022-01-*.parquet")
The intake is the same as above however the wildcard is passed to the urlpath
metadata:
version: 1
sources:
file_2022-01:
description: My file containing geodata for January 2022.
driver: geoparquet
args:
urlpath: file_2022-01-*.parquet
See also
intake-pattern-catalog
intake-pattern-catalog is an intake plugin developed by @blackary and @tdhopper which helps access individuals files in a folder. Given the folder above you could access any file by passing a datetime object.
The yaml will look at following:
metadata:
version: 1
sources:
file_daily:
description: My daily file containing geodata.
driver: intake_pattern_catalog
args:
driver: geoparquet
urlpath: file_{datetime:%Y-%m-%d}.parquet
This can be read in as:
from datetime import datetime
from intake import open_catalog
cat = open_catalog("cat.yaml")
cat.file_daily.get_entry(datetime=datetime(2022, 1, 1).read()
arrow
@blackary's implemented a fix in arrow (https://github.com/apache/arrow/pull/10104) to get intake-geopandas to work with remote files (https://github.com/intake/intake_geopandas/pull/29).