Cataloging your geo-datasets

2/11/22

The larger picture of mine and @blackary (Zachary Blackwood's) PR to intake-geopandas

Interactive code:

https://colab.research.google.com/drive/1qw47XHIRer6ojBLMQeuSmfRtlPPuwxqx?usp=sharing

https://mybinder.org/v2/gh/raybellwaves/intake-gpd-blog/main?urlpath=lab

Why do you need a data catalog?

Image you are working in a large team. The team may consume large amounts of data and it may produce large amounts of data.

If you are individually working on a project you probably know the data you are working on and may see burden in cataloging it.

Cataloging the data should be as important as other good coding/project practices: writing tests, writing doc strings, type hinting etc. While these add a bit of friction for early project work they will ultimately improve your coding/project skills.

A catalog will make it easy for team mates to discover and use data. It'll also help you if you come back to a project after a number of months. It also avoids duplicate efforts of multiple people ingesting the same data. This is especially important in large remote teams.

How to create a data catalog?

Something is better than nothing. It could be:

A shared excel document
A shared website or confluence page

However, there are well suited open source tools to help with this. If you are using python a good tool is intake.

Intake

Intake is an open source tool developed at anaconda which focusses on cataloging data. It provides readers which tells python how to open files. This is by far the best feature as it removes the nuances of understanding how to read complex file formats such a a grib file.

Intake works with plugins which are separate packages to read in different data format.

Intake-geopandas

The focus on this blog is geo-datasets which refer, in general, to files in which can be opened with geopandas. Intake-geopandas is an intake plugin which handles passing objects to geopandas.

To open a geo-parquet with geopandas you would use:

import geopandas as gpd

gpd.read_parquet("file.parquet")

This can be written for intake, which uses yaml (cat.yaml) as:

metadata:

version: 1

sources:

file:

description: My file containing geodata.

driver: geoparquet

args:

urlpath: file.parquet

You would then open the file as:

from intake import open_catalog

cat = open_catalog("cat.yaml")

cat.file.read()

How to create a data catalog using intake?

A team could contribute pull requests to a shared yaml such as that above.

A more robust approach is to create a package which provides testing and dependencies (see this tweet by @tdhopper (Tim Hopper)

What about your's and @blackary's PR to intake-geopandas?

We switch the reader for geo-parquet files to use dask-geopandas. You can see the PR here. This PR brings intake-geopandas inline with other intake plugins such as intake-parquet which uses dask as the driver. it also has other benefits:

It can speed up reading the geo-parquet using multiple cores.
You can use a wildcard to append multiple files during reading. If for example you have a folder which has files for given datetimes such as file_2022-01-01.parquet, file-2022.01-02.parquet, you can append this files as:

import dask_geopandas as dgpd

dgpd.read_parquet("file_2022-01-*.parquet")

The intake is the same as above however the wildcard is passed to the urlpath

metadata:

version: 1

sources:

file_2022-01:

description: My file containing geodata for January 2022.