2.6. Gaussian Mixture on OCDB

Index(['bk_transaction_id', 'si_transaction_id', 'revision_date', 'carrier', 'requester', 'requester_street', 'requester_city', 'requester_state', 'requester_postal_code', 'requester_country', 'shipper', 'shipper_street', 'shipper_city', 'shipper_state', 'shipper_postal_code', 'shipper_country', 'forwarder', 'forwarder_street', 'forwarder_city', 'forwarder_state', 'forwarder_postal_code', 'forwarder_country', 'consignee', 'consignee_street', 'consignee_city', 'consignee_state', 'consignee_postal_code', 'consignee_country', 'contract_party', 'contract_party_street', 'contract_party_city', 'contract_party_state', 'contract_party_postal_code', 'contract_party_country', 'notify_party', 'notify_party_street', 'notify_party_city', 'notify_party_state', 'notify_party_postal_code', 'notify_party_country', 'second_notify_party', 'second_notify_party_street', 'second_notify_party_city', 'second_notify_party_state', 'second_notify_party_postal_code', 'second_notify_party_country', 'third_notify_party', 'third_notify_party_street', 'third_notify_party_city', 'third_notify_party_state', 'third_notify_party_postal_code', 'third_notify_party_country', 'move_type', 'pol_vessel_name', 'pol_conveyance_number', 'pol_city_unlocode', 'pol_city_name', 'pol_country_name', 'pol_etd', 'pod_vessel_name', 'pod_conveyance_number', 'pod_city_unlocode', 'pod_city_name', 'pod_country_name', 'pod_eta', 'plor_city_unlocode', 'plor_city_name', 'plor_country_name', 'plor_etd', 'plod_city_unlocode', 'plod_city_name', 'plod_country_name', 'plod_eta', 'line_item_id', 'package_count', 'cargo_description', 'harmonized_system_code', 'ncm_code', 'schedule_b', 'hazmat_undg_number', 'hazmat_imo_class', 'weight_kg', 'allocated_container_count', 'allocated_teu_count'], dtype='object')

Get the weight of the containers per unit

The next code defines a new command (function) to extract the feature that will be analyzed using a Gaussian Mixture.

# Calculate weight per unit: 𝑤 = 𝑊 / 𝑛

def get_weight(df):

df['weight_per_unit'] = df['weight_kg'] / df['package_count']

return df

This function could be combined with another function to select some columns with the information that will be used in the analysis.

# Filter out the dataset by columns

def batch_filter(df):

# Step one:

columns = ['si_transaction_id', 'harmonized_system_code', 'package_count', 'allocated_container_count', 'weight_kg']

df_filt = df[columns].copy()

#df_filt = df_filt[~df_filt.harmonized_system_code.isna()]

df_filt = get_weight(df_filt)

return df_filt

# Get filtered dataset with additional fields: length, heigth, volume_per_unit, hs2, hs4 and hs6

df_filtered = batch_filter(df1)

df_filtered

Extracting statistics for the 'weight_kg' column

It is possible to extract some summary statistics using values in column 'weight_kg'.

df_filtered['weight_kg'].describe()

count 992.000000

mean 20071.400137

std 27296.583183

min 0.000000

25% 5999.672500

50% 20000.000000

75% 24500.000000

max 405904.000000

Name: weight_kg, dtype: float64

A second filter could be applied to select data in the column 'weight_kg' column which is lower than 25000.

df_filtered_75p_lower = df_filtered[df_filtered['weight_kg']< 25000]

df_filtered_75p_lower

A first insight about the number of classes

It is possible to get a first insight into the number of classes by a histogram plot using the following commands.

import matplotlib.pyplot as plt

h = plt.hist(df_filtered_75p_lower['weight_kg'], bins=40)

counts, bins, ax = h

print(counts)

print(bins)

Applying a GMM

The next code applies a GMM to relate each bin, created on the previous code, to different groups.

from sklearn.mixture import GaussianMixture

import numpy as np

x = bins[:-1]

y = counts

n_clusters = 4

gmm = GaussianMixture(n_components=n_clusters, random_state=42)

gmm.fit(x.reshape(-1, 1))

target_class = gmm.predict(x.reshape(-1, 1))

target_class

array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Drawing GMM results

The next code relates each group to a specific color.

list_x = [[] for d in range(n_clusters)]

list_y = [[] for d in range(n_clusters)]

k = 0

for elem in target_class:

if (elem == 0):

ind = elem + 3

elif (elem == 2):

ind = elem - 2

elif (elem == 3):

ind = elem - 1

else:

ind = elem

list_x[ind].append(x[k])

list_y[ind].append(y[k])

k = k+1

list_colors = ['red','blue','green','yellow']

#plt.hist(df_filtered_75p_lower['weight_kg'], bins=40)

for cluster,color in zip(range(n_clusters),list_colors):

plt.scatter(list_x[cluster], list_y[cluster], color=color, label='N_1')

plt.show()

The previous complete code is available in the following link:

https://colab.research.google.com/drive/1XviDrKZ3RTBks8vEqqD7Fa6ED6XJ4wqF?usp=sharing