data.wa.gov is the open data portal for the state of Washington and is one of the larger open government data (OGD) state portals. In 2019, I worked with the Washington State Library Open Data Consultant to assess the completeness of metadata on the portal and make recommendations for curation. One outcome of this was a set of Python scripts in Jupyter Notebooks to gather and analyze metadata from the portal. One notebook gathers the metadata, creates visualizations, and adds to a dataset for tracking over time. The other notebook uses the longitudinal dataset to create some basic visualizations. This page shows how the metadata completeness has changed over time on data.wa.gov.
These charts represent data from September 2019 to July 2020. State agencies are given permission to publish data on the portal and the Washington State Library was given curatorial responsibility in 2020.
The name, category, tags, description, and attribution are the most commonly filled out metadata elements. Name, or title, is the only required element on data.wa.gov.
If filled out, a user has all the information needed to use the data appropriately and should be able to evaluate the trustworthiness of the data. If data publishers filled out posting frequency (postingFreq) and the licence more often, the quality of metadata would be improved.
The dataset Health Care Provider Credential Data offers a good example of how a publisher should fill out metadata. The name and description are not shown in the screenshot but you can go look at the dataset yourself to see them. This has all five core elements filled out along with all the other available elements.
These plots include all asset types including datasets, maps, charts, and files. The State Health Department is one of the largest publishers on the portal, followed by Education. Not all of the published datasets have the proper metadata though!
Notices the difference in scale between these charts and the ones to the left.
There are still almost 100 datasets with no available metadata filled in- this means the publisher only filled in the required name (or title) element. A positive sign is that there is evidence of growing numbers of datasets with 5, 6, and 9 metadata elements. Maybe these are the result of individual agencies implementing strong data publishing workflows.
As seen above, over 80% of datasets have 'category' filled in and this, combined with datasets with 'description' or 'attribution' filled in, is seen here in the plot for 1. The slight increase in 3 and 5 is hopeful.
The Washington State Library (WSL) has a challenging task of curating this portal in the future. It is especially difficult because state agencies are able to publish whatever they want. This certainly increases the amount of data published (which is great!) but it means that quality will vary. I will continue to track the metadata completeness for the portal in hopes of seeing changes data publishers following guidance from WSL.