1. Concepts & Definitions
1.1. Continous random distribution of probability
1.2. Normal distribution of probability
1.3. Standard normal distribution of probability
1.4. Inverse standard normal distribution
1.6. Inverse Student's T distribution
2. Problem & Solution
2.1. Weight, dimension, and value per HS6
2.2. How to fit a distribution
2.3. Employing standard deviation
2.4. Total time spent in a system
The first step is to reading INTTRA data from an Excel file with multiple sheets.
import pandas as pd
url = 'https://drive.google.com/uc?export=download&id=1iR7-yUYoaT4jfvFwK4uhriJJj0yFQyjv'
xls = pd.ExcelFile(url)
xls
<pandas.io.excel._base.ExcelFile at 0x7f5d051f9c10>
The next code help to extract the names of all sheets.
sheet_names = xls.sheet_names
sheet_names
['bookings', 'shipping_instructions', 'track&trace']
The next code help to extract the data from the first sheet and store it in the variable df1.
df1 = pd.read_excel(xls, sheet_names[0])
df1
The next code help to extract the name of the columns of the data frame variable df1.
df1.columns
Index(['bk_transaction_id', 'si_transaction_id', 'revision_date', 'carrier', 'requester', 'requester_street', 'requester_city', 'requester_state', 'requester_postal_code', 'requester_country', 'shipper', 'shipper_street', 'shipper_city', 'shipper_state', 'shipper_postal_code', 'shipper_country', 'forwarder', 'forwarder_street', 'forwarder_city', 'forwarder_state', 'forwarder_postal_code', 'forwarder_country', 'consignee', 'consignee_street', 'consignee_city', 'consignee_state', 'consignee_postal_code', 'consignee_country', 'contract_party', 'contract_party_street', 'contract_party_city', 'contract_party_state', 'contract_party_postal_code', 'contract_party_country', 'notify_party', 'notify_party_street', 'notify_party_city', 'notify_party_state', 'notify_party_postal_code', 'notify_party_country', 'second_notify_party', 'second_notify_party_street', 'second_notify_party_city', 'second_notify_party_state', 'second_notify_party_postal_code', 'second_notify_party_country', 'third_notify_party', 'third_notify_party_street', 'third_notify_party_city', 'third_notify_party_state', 'third_notify_party_postal_code', 'third_notify_party_country', 'move_type', 'pol_vessel_name', 'pol_conveyance_number', 'pol_city_unlocode', 'pol_city_name', 'pol_country_name', 'pol_etd', 'pod_vessel_name', 'pod_conveyance_number', 'pod_city_unlocode', 'pod_city_name', 'pod_country_name', 'pod_eta', 'plor_city_unlocode', 'plor_city_name', 'plor_country_name', 'plor_etd', 'plod_city_unlocode', 'plod_city_name', 'plod_country_name', 'plod_eta', 'line_item_id', 'package_count', 'cargo_description', 'harmonized_system_code', 'ncm_code', 'schedule_b', 'hazmat_undg_number', 'hazmat_imo_class', 'weight_kg', 'allocated_container_count', 'allocated_teu_count'], dtype='object')
The next code defines a new command (function) to extract the feature that will be analyzed using a Gaussian Mixture.
# Calculate weight per unit: 𝑤 = 𝑊 / 𝑛
def get_weight(df):
df['weight_per_unit'] = df['weight_kg'] / df['package_count']
return df
This function could be combined with another function to select some columns with the information that will be used in the analysis.
# Filter out the dataset by columns
def batch_filter(df):
# Step one:
columns = ['si_transaction_id', 'harmonized_system_code', 'package_count', 'allocated_container_count', 'weight_kg']
df_filt = df[columns].copy()
#df_filt = df_filt[~df_filt.harmonized_system_code.isna()]
df_filt = get_weight(df_filt)
return df_filt
# Get filtered dataset with additional fields: length, heigth, volume_per_unit, hs2, hs4 and hs6
df_filtered = batch_filter(df1)
df_filtered
It is possible to extract some summary statistics using values in column 'weight_kg'.
df_filtered['weight_kg'].describe()
count 992.000000
mean 20071.400137
std 27296.583183
min 0.000000
25% 5999.672500
50% 20000.000000
75% 24500.000000
max 405904.000000
Name: weight_kg, dtype: float64
A second filter could be applied to select data in the column 'weight_kg' column which is lower than 25000.
df_filtered_75p_lower = df_filtered[df_filtered['weight_kg']< 25000]
df_filtered_75p_lower
It is possible to get a first insight into the number of classes by a histogram plot using the following commands.
import matplotlib.pyplot as plt
h = plt.hist(df_filtered_75p_lower['weight_kg'], bins=40)
counts, bins, ax = h
print(counts)
print(bins)
The next code applies a GMM to relate each bin, created on the previous code, to different groups.
from sklearn.mixture import GaussianMixture
import numpy as np
x = bins[:-1]
y = counts
n_clusters = 4
gmm = GaussianMixture(n_components=n_clusters, random_state=42)
gmm.fit(x.reshape(-1, 1))
target_class = gmm.predict(x.reshape(-1, 1))
target_class
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0])
The next code relates each group to a specific color.
list_x = [[] for d in range(n_clusters)]
list_y = [[] for d in range(n_clusters)]
k = 0
for elem in target_class:
if (elem == 0):
ind = elem + 3
elif (elem == 2):
ind = elem - 2
elif (elem == 3):
ind = elem - 1
else:
ind = elem
list_x[ind].append(x[k])
list_y[ind].append(y[k])
k = k+1
list_colors = ['red','blue','green','yellow']
#plt.hist(df_filtered_75p_lower['weight_kg'], bins=40)
for cluster,color in zip(range(n_clusters),list_colors):
plt.scatter(list_x[cluster], list_y[cluster], color=color, label='N_1')
plt.show()
The previous complete code is available in the following link:
https://colab.research.google.com/drive/1XviDrKZ3RTBks8vEqqD7Fa6ED6XJ4wqF?usp=sharing