The State of Art in Wikidata

About

My (perhaps over-ambitious) goal with this project was to perform a global survey of art work data in Wikidata through Python that queried, collected, and analyzed the properties associated with ALL the “instances of”(P31) “painting” (Q3305213) including all subclasses. The hope of this project was to provide some utility to cultural heritage institutions in understanding best practices with data modeling and aligning application profiles for a more robust and universally usable art dataset in the sphere of linked data. The scripts of this project work in conjunction with each other through JSON to identify all the QID’s of items classified as “painting”/+ all subclasses, then request all the claim information(i.e. associated properties) of those QID’s from wikidata, and then output as an aggregate by property number and number of occurrence in the dataset.

I say “ambitious” because it took multiple plans of attack to gather only a portion of the originally intended scope. At the time of this project, there were 566,444 items/ instances of painting/subclass of paintings, but the results of this project only consider 203,063 items in the analysis due to the considerable amount of time it took to pull all the claims.

[Data + Method]: The main source of data was accessed through Wikidata’s SPARQL endpoint, Wikidata’s Query Service, and through the Special:EntityData API endpoint through respective python scripts. The python scripts worked in succession with each other to collect, store, and pass through data with JSON.

[Conclusion]: Some interesting findings:

For 203,063 items, 566 unique properties were used in their claims
Top 10 Highest usage (excluding “instance of” which all items had as a property claim):
- collection
- inventory number
- location
- creator
- inception
- height
- width
- made from material
- title
- copyright status
It was interesting to note the properties that contained “ID”, which infer an effort of a particular institution to provide a unique identifier of an object in their collection. This particular claim can then serve as a vehicle to aggregate all items in a particular collection.
- Additionally this is an optimal identifier for search
There were 174 properties only used once in a claim and 52 properties used twice.
- 116 of those items had reference to “ID” in their property name, which could indicate either inconsistencies with Identification assignment

[Some barriers]: Wikidata doesn’t like it when you request hundreds of thousands of requests in succession. Weird, right? That necessitated adjustment to the plan of attack via the collection, retrieval, and storage of the Special:EntityData search results. Since it took a long time to request all the claim information for each of the 566,444 items, I made the decision to limit the claim collection to just 203,063 items ( instances of “painting”).

Future iterations of this project will include completing the 500K survey, expanding the scope to all instances of “work of art” (Q838948) Wikidata items for similar analysis, and creating more robust code in the process. Additionally I would like to investigate the data modeling relationships between items tagged to a certain identifier ID, like comparing those with “RKDimages ID” (P350) to “The Met object ID”(P3634), for example.

This project has definitely served as my foray into programmatic inquiry. I look forward to further expanding this project (and skillset) to “instances of” other entities in the Wiki-verse in the hopes of creating a utility to understand current state modeling (by popularity) activities within the linked open data realm.

Image Attribution:

"Jacob Cornelisz. van Oostsanen Painting a Portrait of His Wife"

Artist: Dirck Jacobsz.

Date: 1550

Retrieved 2021

With edits by Jessika Davis

Page updated

Google Sites

Report abuse