This data is made available to you for this workshop only.
Do not share it with anyone.
Delete it when you are done.
The data can be downloaded here. The workshop organizers will provide you with the password.
In the protected folder, you will find a Python pickle with the data (organized as a train/dev/test split), as well as an Excel file with the same data. The Excel files in the original_excel_data folder are redundant.
The numerous data fields are described in this shared spreadsheet. Remember that not all fields are useful.
Many acronyms are present in the text. Here is a list of online resources to resolve them: Wikipedia, AllAcronyms, interaviagroup, CASA, EASA. A TSV file containing acronym mapping from Wikipedia and CASA sources has been prepared for you in the Github repository. Watch it: some acronyms are there multiple times or are missing.
The data counts 461k defects for the years 2018-2019, each with many fields, including a textual description (avg nb tokens = 13 ± 9 toks). About 9% of these defects are recurrent. Here is a sample of the kind of texts to expect.
We have created a GitHub repository here, where you will find a sample script to load the data and perform a few sample operations. This should help you getting started. The code will evolve during the workshop, and you are welcome to send pull requests/mails to update it.
Feel free to start from scratch if you want!