1. Concepts & Definitions
1.1. Continous random distribution of probability
1.2. Normal distribution of probability
1.3. Standard normal distribution of probability
1.4. Inverse standard normal distribution
1.6. Inverse Student's T distribution
2. Problem & Solution
2.1. Weight, dimension, and value per HS6
2.2. How to fit a distribution
2.3. Employing standard deviation
2.4. Total time spent in a system
The first task is to read a database with HS6 code which is based on INTTRA database. This data is stored in Excel file with multiple sheets.
import pandas as pd
url = 'https://drive.google.com/uc?export=download&id=1iR7-yUYoaT4jfvFwK4uhriJJj0yFQyjv'
xls = pd.ExcelFile(url)
The next step is to obtain all sheets names.
sheet_names = xls.sheet_names
sheet_names
This will produce the following output:
['bookings', 'shipping_instructions', 'track&trace']
Finally, let's select data from the first sheet.
df1 = pd.read_excel(xls, sheet_names[0])
df1
This will return the following data frame:
Let's read the names of all columns of the first sheet.
df1.columns
This will return the following column names:
Index(['bk_transaction_id', 'si_transaction_id', 'revision_date', 'carrier', 'requester', 'requester_street', 'requester_city', 'requester_state', 'requester_postal_code', 'requester_country', 'shipper', 'shipper_street', 'shipper_city', 'shipper_state', 'shipper_postal_code', 'shipper_country', 'forwarder', 'forwarder_street', 'forwarder_city', 'forwarder_state', 'forwarder_postal_code', 'forwarder_country', 'consignee', 'consignee_street', 'consignee_city', 'consignee_state', 'consignee_postal_code', 'consignee_country', 'contract_party', 'contract_party_street', 'contract_party_city', 'contract_party_state', 'contract_party_postal_code', 'contract_party_country', 'notify_party', 'notify_party_street', 'notify_party_city', 'notify_party_state', 'notify_party_postal_code', 'notify_party_country', 'second_notify_party', 'second_notify_party_street', 'second_notify_party_city', 'second_notify_party_state', 'second_notify_party_postal_code', 'second_notify_party_country', 'third_notify_party', 'third_notify_party_street', 'third_notify_party_city', 'third_notify_party_state', 'third_notify_party_postal_code', 'third_notify_party_country', 'move_type', 'pol_vessel_name', 'pol_conveyance_number', 'pol_city_unlocode', 'pol_city_name', 'pol_country_name', 'pol_etd', 'pod_vessel_name', 'pod_conveyance_number', 'pod_city_unlocode', 'pod_city_name', 'pod_country_name', 'pod_eta', 'plor_city_unlocode', 'plor_city_name', 'plor_country_name', 'plor_etd', 'plod_city_unlocode', 'plod_city_name', 'plod_country_name', 'plod_eta', 'line_item_id', 'package_count', 'cargo_description', 'harmonized_system_code', 'ncm_code', 'schedule_b', 'hazmat_undg_number', 'hazmat_imo_class', 'weight_kg', 'allocated_container_count', 'allocated_teu_count'], dtype='object')
Now, it is possible to filter rows according to certain values in column 'weight_kg' and showing the remaining rows.
filter = df1['weight_kg'] < 100000
df1.loc[filter]
This will return a smaller data frame in terms of number of rows.
Another possibility, it is to filter rows according to certain values in column 'weight_kg' and show the distribution for column 'weight_kg'.
weight = df1.loc[filter]['weight_kg']
weight.hist()
This will produce the following graph:
A very useful library is fitter that automatically searches over a certain range of probability distribution and tries to find the one that best fits the data. In the next section, we will see how to install and use it.
The previous complete code is available in the following link:
https://colab.research.google.com/drive/1Xo-2dWDgL-gmDJH3QmB6b4YMlntgQqtu?usp=sharing