My Submission for the Dataline Competition, June 2022.
Bolaji O. | 08 Jul, 2022.
I recently participated in a Data Analytics Competition. In this write up I would be sharing how I went about building my report submission for the Dataline Competition. I cover everything from Data Cleaning, Data Modeling, report design choices as well as the thought process behind some other things which I discuss.
The purpose of this write up is to share some of my thought processes and provide some consideration or guide for anyone who wishes to try their hands on the data.
Challenge Requirements
The requirement was to create a report that answers a specific set of questions by analysing data on Aircraft Accidents over a period of time.
The data provided includes 4 tables, Aircrafts, Events, Injury and Country. In addition to this, a data dictionary was also provided explaining what the columns meant and in the cases where abbreviations were used in the column values, the full definitions were also included in the data dictionary.
The Aircrafts table gives data about the aircraft and the flight which the aircraft was on (departure and destination locations). Events table is data on accidents. Each row records an accident and additional info on where, when and the atmosphere condition. Injury table covers data on the number of injured persons and how fatal the injury is. The Country table simply maps the Country Codes to the full country names.
Table Descriptions
While going through the data to understand what it meant, I realised I had to substitute the codes in some columns, like the atmosphere conditions, with descriptions that are more explanatory and understandable in a report. To do that, I created a few tables from the data dictionary to do the replacement once I got to the data cleaning.
Model Design
Before diving into Power Query to clean and wrangle the data, I started to think about what the data model design should be. First step, identifying fact tables and matching table granularity. The Events table (aka Accidents) provided a row per accident with an event_id. Fact table #1. And the Injury table was Fact Table #2 which had #injured persons broken down by event, inury level and passenger type.
Transformation Suggestions
Now to the dimensions. The Aircrafts table had one row per aircraft and an event_id linking the aircraft to the accidents. Also, there was a column, Aircraft_Key which had values 1 to n depending on how many aircrafts were involved in an accident. The issue here was there was no way to uniquely identify an aircraft using just one column. So that had to be resolved. Still, after that is resolved, how would the Aircraft table link to the Events (Accidents) table? How would relationships be built between Aircraft, Accidents, and Injury without making calculations too difficult to write and maintaining a Star Schema? Some data cleaning is required.
Data Cleaning
Currently Aircrafts and Events are linked by using both the event_id and Aircraft_Key. To reduce it to just one column, I added an Index column to Aircrafts table to serve as a proper AircraftKey uniquely identified each aircraft. To link this to the Events table became the next hurdle. Do I merge the new AircraftKey to the Events table and have the number rows duplicated where more than one aircraft was involved in an accident? Because doing that would mean for a simple aggregation such as Number of Accidents, I'd have to use a DISTINCTCOUNT which can be an expensive calculation on large data. As a solution, I resulted to a data modeling technique I had learnt which was to build a bridge table between Aircraft and Events. From my learning and research, the problem was similar to what is known as a multi-valued dimensions in BI/DW where more than one row from a dimension is related to a single row in the fact table. The bridge table had just two columns, the event_id (AccidentID) and the AircraftKey. That way an accident involving more than one aircraft had n rows in the bridge table representing each aircraft involved. Problem solved!
Finding that solution took days (of learning and research) because it wasn't a concept I was familiar with. But with that out of the way, the rest of the data cleaning and modeling was easy. Other data cleaning and modeling activities included renaming columns to more descriptive titles, replacing codes with actual descriptions, creating a date table, adding a few columns for some analysis, etc.
Final Data Model
Power Query Folder Organisation
The important part of the challenge was doing the analysis and creating a report. For the challenge, the required analysis was split into three sections; Single Answer, Visual Solution, and Dig Deep. The analysis required for each section went from simple to complex from one section to the next.
Having these requirements in mind, I started, again, by thinking about what the report I wanted to design was. I sought out the best design tools ever created, a paper and pen, to do a sketch of what I thought the report was going to look like. I did a few sketches on paper working with the report requirements and making notes as I created the sketches. After I had this created, I began to write the calculations to answer the questions. With some base DAX measures written down, I went into PowerPoint to create the design layout. I also used the brand colors from Dataline as the color theme for the report.
Initial Report Sketches
Report Layout Design in PowerPoint
The last step was putting it all together. Data has been cleaned and transformed, data model created, DAX calculations created, Visual layout designed. Here is the final output of all this.
View the interactive version to get the full User Experience.
Thanks for reading. Hopefully, some of my data model & report design decisions resonate with you. If they don't, do well to share your thoughts & critiques with me on LinkedIn or Twitter.
If you would like have a go at the data, download here. Also here is my pbix file to download.
Check out some of my other projects on the Project page or visit my YouTube channel.