On the previous page we have loaded the cc_staging table. For our prototype, the table has about 188.000 rows, which is not very much. But, let's assume that we are building a production system and that the row count could grow significantly.
So, we will look at a few ways on how to make our design more efficient. To do so, we have to examine the data a bit more closely. Many times, this will involve some 'detective work' using SQL to find out relationships in our data.
What do we have in our table ?
1) the first column is the iso3_country code. We need to create a dimension (lookup) table to find the corresponding country names.
2) the column 'sector' corresponds to each of the ZIP files we have downloaded on the previous page.
sector
agriculture
buildings
fluorinated-gases
forestry-and-land-use
fossil-fuel-operations
manufacturing
mineral-extraction
power
transportation
waste
3) the column 'subsector' is a more fine-grained level of the sectors from the previous column, for example the sector 'transportation' has the following subsectors (found by the query SELECT distinct "subsector" FROM "public"."cc_staging" where "sector" = 'transportation';
domestic-aviation
domestic-shipping
international-aviation
international-shipping
non-broadcasting-vessels
other-transport
railways
road-transportation
8 row(s)
So, it seems we could create a 2-level Dimension Sector/Subsector
4) the columns 'start time' and 'end time' seem to indicate that the granularity of our data is annual. This is confirmed by the fact that the only value in the 'temporal_granularity' column is 'annual'.
5) the column 'gas' only contains one value : co2e_100yr. We could choose to ignore this column (not include it), but it might be wiser to keep in in case later on we add new data sources where different gases might appear.
6) the column 'emissions_quantity' will be our Fact column.
7) the column 'emissions_quantity_units' has NULL in all rows, so we can ignore it for now. Same goes for the columns 'created_date' and 'modified_date'
8) the column 'temporal_granularity' is always 'annual', we will ignore it for now.
First, I will create the Countries lookup table. The information can be found in this file .
Below you can see the workflow to load the dimension table and the resulting table in Postgres. Note that the table also has a Continent_name column, which will be useful for our reporting later on.
Now I will create lookup tables for the sector and subsector columns. I will just assign a sequential number to each sector and subsector, by using an auto-incrementing column in the database. Below you can see the workflow for the dim_sector table and the resulting table. The workflow to create the dim_subsector table is very similar.
Now that we have all this in place, we can create our fact table. We will see this on the next page.