We will use approximately 130,000 German-language Wikipedia articles as a dataset. The dataset was obtained from the Wikimedia dump service, processed, and prepared for the purposes of the workshop.
The complete dataset can be accessed on the workshop Google Drive (data folder).
The dataset is structured in JSON format. Check out the official documentation for information on how to work with JSON-formatted data.
Each object has the following keys:
title
intro
intro_len_char
intro_len_token
body
body_len_char
body_len_token
gold_kws
Keys: keyword lemmas (normalized keyword, e.g. "Schweiz") and their respective counts
Values: keyword lemmas and, if applicable, running forms (alternative forms that appear in the article, e.g. "Schweizer" as a running form of the lemma "Schweiz"), and their respective counts
num_kws_lemma
num_kws_all
noisy_kws
Gold keyword lemmas and added noisy keywords
num_noisy_kws
Sum of gold keyword lemmas and added noisy keywords
The example below represents a single article:
{
"title": "Aussagenlogik",
"intro": "Das ist die Einleitung.",
"intro_len_char": 23,
"intro_len_token": 5,
"body": "Das ist der spannende Haupttext.",
"body_len_char": 29,
"body_len_token": 6,
"gold_kws": {
"Logik": {"Logik": 3, "logischer": 8},
"Wahrheit": {"Wahrheit": 5}
},
"num_kws_lemmas": 2, # Logik, Wahrheit
"num_kws_all": 3, # Logik, logischer, Wahrheit
"noisy_kws": ["Logik, "Wahrheit", "Ausrede", "Logic
Pro"], # noisy = Ausrede, Logic Pro
"num_noisy_kws": 4 # 2 gold lemmas + 2 added noisy
}
You may follow our steps for creating the gold standard corpus by accessing and running the following notebooks, which are also available on the workshop Google Drive (notebooks folder):
corpus_overview.ipynb: examine corpus characteristics (e.g. mean character and token counts)
build_corpus.ipynb: filter texts to build dataset
In order to model real world data, we added noise to the gold standard keywords. We did this by automatically extracting keyword phrases using an implementation of TextRank and named entity recognition.
You can use the noisy keywords as a starting point on which you can test your own keyword consolidation methods to improve the quality of the keyword phrases, i.e. remove the noisy data. This could include removing very similar keywords, repairing incorrect segmentation or resolving coreferences.