Product Name: GHI Data Integration
Product Version: v2.05
Product Phase: Where applicable
Current Date: 1/10/2022
GHI Data Integration is a series of Python, Pentaho, SQL, and AWS scripts designed to transform and combine non-uniform global health data sets into a end-user consumable format. Sources include large authoritative health organizations such as the World Health Organization, Global Health Data Exchange, Unicef, World Bank, etc.
The scripts are designed to remove the manual steps that come with data aggregation and transformation. It will also allow for the end-user to choose sets of data that they want to extract (e.g., specific diseases, years, indicators), without pulling the entire source.
There are scripts written specifically for each data organization, which will help transform different formats into one uniform format.
The script starts by transforming data locations into standardized ISO 3166 country codes. This allows for different data sets to be combined based on one common field.
Column Data Types are then formatted in the same way
For numeric measures, numbers stored in thousands, millions, are converted to units
Label fields are converted to become uniform (e.g., all text in year labels are stripped out so it only has the year number)
Column Names are standardized (e.g., all country headers will be converted to become country_name, regardless of what the name was in the source)
The data set is normalized. More information on data normalization can be found here: https://en.wikipedia.org/wiki/Database_normalization
Data is finally grouped so column headers match
You will need Python 3, and accompanying Python modules. You will also need to download the data set you are trying to work with from the source.
Install Python 3 using the latest release version for your operating system at https://www.python.org/downloads/
Open up your terminal (Command Prompt in Windows, which can be accessed by typing "CMD" into the start menu)
Install the following modules using the following command: pip install <module_name>
country_converter
pandas
numpy
glob
4. Download your data set. For example, 2019 HIV data can be downloaded from the Global Health Data Exchange at: http://ghdx.healthdata.org/gbd-results-tool
5.Put the downloaded file into the input folder for the organization (e.g., GHDx scripts should be put into:
\Global Health Impact Organization\GHI-Data-Consolidation\GHDx\input
6. Complete the steps in the "Configuring the Script" section
7. Optional (Complete the steps in the "Bringing it all together" section
Users can edit the parse function (shown below) in each data set script to filter and select specific attributes.
Filters: Enter the column name in quotes, and then the values that it should be filtered by using the following format: ['<column name>']:['<filtervalue1>', '<filtervalue2>', '<filtervalueN'>],
For example, to select only 2019 and 2020 data, you would put in
['year']:['2019', '2020'],
Pivot Index: These will be the fixed columns in your data-set. For the most part, only location should be put here. This section should not be modified unless you know what you are doing.
Pivot Columns: These will be your variable columns and they will be combined with other columns in this list. For example, if you put Cause, Age and Measure in this list, there will be a column for each unique combination of the three inputs.
Consider the following input data attributes for each Pivot Column:
Cause: Hookworm, Ringworm
Age: 1-15, 16-80
Measure: Number
The output columns would be the following:
Hookworm Ages 1-15 Number
Hookworm Ages 16-80 Number
Ringworm 1-15 Number
Ringworm 16-80 Number
You can now run the script by doing the following:
Start Command Prompt in the folder of the python script. In Windows, you can do this by pressing Shift + Right Click, and selecting "Open PowerShell Window Here". In Linux, use the 'cd' command to navigate to the correct directory
Run the script with the following command: Python <script name.py>
e.g., Python GHDx.py
If you have multiple data-sets coming from different organizations, and you want to combine them together, you can follow these steps. At this point, you should have Configured and Run the Script with multiple data-sets.
If your readers have additional questions, want more information or just need to talk to a support agent, provide contact information or link them to online assistance here.
An automation product overview: https://docs.google.com/document/d/1eR6YfQTYueIlMDgmlYY5FzZqvfdLH1Cwn5V4G3GrNYA/edit
Here is an overview of the automating data part of things (GHI Data Integration Tab): https://sites.google.com/view/ghiteamintranet/ghi-data-integration?authuser=0
How we make the scripts for automating the datasets:
1) Fork the Github code or be added as a collaborator: https://github.com/SimplySaid/GHI-Data-Consolidation. Look up how to fork if you do not know.
2) “Pre-processing” Folder
i) Originally contained individual script for each major database (GBD, PCT, WHO), but proved to be unproductive to have someone to write complex code for each specific database for only 5-10 columns in our models.
ii) Instead, Created “Generic” folder:
a) Generic_processing.py contains code
b) Builds small modules to do simple things like:
1) Standardizing country names from our models to the countries in databases. The script starts by transforming data locations into standardized ISO 3166 country codes. This allows for different data sets to be combined based on one common field.
2) Datasets differ in their column headings. Script will standardize “normalized” datasets (when datasets have data points under the column heading “VAL”) and “un-normalized” datasets (when datasets have data points under the column heading of the indicator name). We want to change it to un-normalized to match the format of our existing models. More information on data normalization can be found here: Database normalization - Wikipedia
c) Create a user input that can be completed by anyone to describe the dataset they are pulling from (ex. Normalized or unnormalized) and input the information they want from the database (like year, database reference) by selecting options in script. Some users might need a specific year (ex: 2017) or any of the most up to date data (ex: up to 2017).
3) Joining Folder:
i) Output file:
a) Insert the main columns we need in our models and the file with the data we might need for the model
Video Explanation from Alex: https://drive.google.com/file/d/1FRTuza5x4X7nsa7pYTwWtt87WNb7CnWm/view?usp=sharing