Challenges

Slides of the Challenges Presentations: https://tinyurl.com/yb59jyvf

Final presentations: https://tinyurl.com/y9xfc92c

GitHub: https://github.com/SDD18ZHHackathon

Challenge 1: Crowd sourced Open Data from «Züri wie neu»

«Züri wie neu» is an online platform of the Zurich city administration in order to facilitate the reporting of damaged infrastructure by citizens. It went online in 2013. Since then it is moderated by the city administration and managed transparently with all – anonymized - reports available as open data.

The open dataset derived from «Züri wie neu» currently contains around 14’000 reports (!). It includes the exact georeferenced location of the infrastructure damage, the time of recording, the exact description, the categorisation, the processing status and the time when the report was completed. For approximately 1’700 messages, even transmitted photos can be referred to it. All data is available in open geo formats and can also be queried via the open interface Open311.

It is obvious that the spatial and temporal components of these data usually invite for geo analysis and cartographic visualisations, such as heatmaps or animated maps. But there might be further interesting facts hidden in the data. The precise descriptions of the damage written by the users seem especially interesting. Since it is unstructured text, an analysis for further facts is not trivial.

For the administration it would be very helpful for example, if the category of the detail description could be automatically derived from the text and/or a picture in order to be assigned to the responsible office. The dataset is thus well suited for machine-learning based classification techniques. Here you can find more information about this challenge.

Challenge Slack channel: #zueriwieneu

Challenge 2: Extracting individual trees from LIDAR

The city of Zürich maintains a tree database, where all trees planted by the municipal gardeners (Grün Stadt Zürich) are recorded. But cities do not only green because of the work of the municipality: garden and balcony owners also contribute to the greening of the city.

By using LIDAR data, green areas can be analyzed in a more precise and comparable way. Thus, this dataset offers an interesting alternative to measure the greening of municipalities. Carrying out this calculation with LIDAR data is, however, not trivial - especially since the datasets are huge. Finding a methodology that allows for fast and precise analysis of this dataset is therefore very rewarding.

Possible topics for analysis are the extraction of individual trees as points or as tree outlines. Moreover, additional information could be attributed to each extracted tree, such as tree height and crown area. The tree database of the city of Zürich can serve as a validation dataset, as properties like tree position, tree height and tree species are recorded in this dataset.

Both the LIDAR data and the tree data are available as Open Data. The LIDAR data has been obtained in spring 2014. In addition to the X,Y and Z coordinates, other variables such as the pulse return magnitude and the standard set of ASPRS classification are available.

Challenge Slack channel: #lidar

Challenge 3: OpenStreetMap POI Completeness

«Points-of-Interest» (POI) are a dataset of high demand and of economical, social and scientific interest. OpenStreetMap (OSM), the biggest crowdsourced geospatial mapping project, contains over 1000 different POI categories. They range from amenities (e.g. restaurants) and shops, up to leisure (e.g. fitness centers).

A frequently asked question to OSM is about it's quality, especially about it's completeness and spatial coverage. Regarding streets it's e.g. over 80% worldwide in 2017 (see figure). As it turns out, this question is not easy to answer - and this is our challenge!

This challenge focuses to evaluate POI completeness on a city or country level. An obvious approach is to compare the POI in OSM with authoritative, 'official' sources. Unfortunately these sources are mostly covering only certain categories (like restaurants/hotels) and are otherwise inaccessible. An alternative is to analyze the (OSM) data with itself 'intrinsically'. This can be potentially processed at scale.

In order to evaluate the POI completeness intrinsically there are already some statistically sound approaches known, which we will shortly outline. One of them looks at the history of the POI and the other measures average number of POS along the street. But first, we propose to collect validation data by hand - just as the Hackathon motto "Crowdsourced Data Analysis" suggests.

Then the probably most demanding and fun phase of the challenge is to discuss the pros and cons of the intrinsic POI completeness approaches. Join this challenge and the world of spatio-temporal data analysis!

You can find a whitepaper on the topic here.

Challenge Slack channel: #osmcompleteness

Challenge 4: Online Search Behaviour and Government Information

The website of the Canton of Zurich (zh.ch) is the digital interface for netizens and the cantonal public administration. On zh.ch people find trusted information on the canton of Zurich and its public services. Due to the large quantity of information on very diverse topics and a site structure which reflects organizational hierarchy, it is not always easy for users to find the information which is most relevant to them.

By analyzing website traffic as well as web search data a better understanding of user behaviour and user needs can be gained.

What do people, looking for government information, search for on the web?
What patterns can be found in the aggregated website traffic and search engine data?
Are there clusters of search terms that are equally related to pages of several units of the public administration?
Do the search terms, people use before navigating to zh.ch, match with official language?

Help us understand whether the structure and content on zh.ch mirror the needs of our users and what should be done to ensure a user-experience which is reactive to users' needs.

Perform analyses of the data to provide insights and/or even build an app which allows to monitor search and web behavior related to zh.ch continuously.

Challenge Slack channel: #zhweb

DATA

Google Search Terms related to zh.ch

data (caveat: varying timespan for different zh.ch domains!)

Web Analytics & Google Search Data for kapo.zh.ch / statistik.zh.ch

data (.zip)

List of Topics (A-Z) → “official language”

data

Challenge 5: Automatic Detection of Color for Water Quality Strip Tests

The CrowdWater project collects crowdsourced hydrological data to predict floods and low flow. In this data collection process, citizen scientists use color strips to label water quality conditions. Interpreting colors can be very subjective, so the CrowdWater team proposes a challenge to eliminate subjectivity of color comparison for water quality strip tests by coding an application that automatically detects colors (and the corresponding water quality value) from strip tests and creates a fun experience for citizen scientists. We will give you 25 sample images, and your application should match the colors in the images to a color palette and assign the quality value.

Challenge Slack channel: #crowdwater

Challenge 6: Adding and Correcting Entities in Executive Minutes

The Staatsarchiv Kanton Zürich has 150’000 pages of handwritten executive minutes (Regierungsratsprotokolle, 1803-1883). These documents inform us about high politics and daily life in Zürich of the 19th century, and therefore are highly valuable. In order to enhance the usability and and searchability of these documents, it is important to identify the entities in the text (i.e., people, places, organisations etc.). To make life easier to human curators, we have automatically identified many of these entities (using a fixed list of places and persons). We challenge participants to create an interface to access and alter the documents, so that people can check the quality of the identified entities, correct any potential mistake done by the automatic algorithm and add any entity that the algorithm might have missed.

Challenge Slack channel: #staatsarchiv

Challenge 7: The RefBank Challenge - Cleaning and De-duplicating Bibliographic References

The RefBank Corpus contains more than 1 million bibliographic references. Since references can have different levels of specificity, it can happen that the same publication has been referenced multiple times using different strings (see example in the picture). To be able to provide a high quality corpus, it is important to identify duplicates. PLAZI challenges you to work on this automatic deduplication process. You will need to develop a method that extracts individual pieces of the references (e.g., author names and journal titles) and clusters references.

Challenge Slack channel: #plazi

Challenge 8: Looking for the WOW Wikidata query

Wikidata, the free knowledge base that anyone can use and edit, is growing fast and it is highly used. In order to encourage further use of the data, we need to showcase what we can do with Wikidata. Currently WikidataFacts tweets amazing queries and the community is manually collecting some representative queries in the wiki. But, can we spot WOW queries leveraging collective knowledge by mining the queries that people submit to the Query Service? We challenge you to analyse a query log released by WMDE and the University of Dresden, and find WOW queries that we can use to showcase the power of Wikidata and teach others how to use it.

We will give you a 101 introduction to SPARQL and you will have the chance to learn how to edit and query one of the most successful knowledge bases of the world.

Challenge Slack channel: #wikidata