This project involved working with the parcel delivery company [Name not mentioned for privacy reasons] to understand the primary causes of unfulfilled ride requests and develop solutions to optimise driver locations. By leveraging causal inference techniques, I helped identify factors that increase the fraction of completed orders. Since drivers are paid based on the number of requests they accept, this solution directly impacts both client satisfaction and business growth.
The project focused on estimating causal effects of specific variables (treatments) on outcomes of interest, particularly analysing how driver proximity and location recommendations influence request fulfilment rates.
A variable, X, can be said to cause another variable Y, if when all confounders are adjusted, an intervention in X results in a change in Y, but an intervention in Y does not necessarily result in a change in X. This is in contrast to correlations, which are inherently symmetric; i.e. if X correlates with Y, Y correlates with X, while if X causes Y, Y may not cause X.
In summary, X causes Y if when all confounders are adjusted, an intervention in X results in a change in Y, but an intervention in Y doesn't change X.
1. Causality is represented mathematically via Structural Causal Models (SCMs). The two key elements of SCMs are a graph and a set of equations. More specifically, the graph is a directed acyclic graph (DAG), and the set of equations is a structural equation model (SEM).
DAGs are graphs that provide visual representations of causal relationships among a set of variables. These causal relationships are either known to be true or, more commonly, are only assumed to be true. Each arrow in a DAG represents a causal relationship.
For example, in the Next DAG graph, age has a causal effect on the risk of skin cancer and moving to Florida.
2. Structural Equation Models(SEMs) Structural equation modeling (SEM) is a powerful, multivariate technique found increasingly in scientific investigations to test and evaluate multivariate causal relationships[4]. SEMs differ from other modeling approaches as they test the direct and indirect effects on pre-assumed causal relationships. SEM is a nearly 100-year-old statistical method that has progressed over three generations.
I implemented a structured causal inference methodology based on Pearl's framework, combining statistical analysis with domain expertise. The approach involved:
Performing exploratory data analysis and feature engineering
Inferring causal graphs from observational data and validating them
Constructing causal graphs based on algorithm recommendations
Implementing causal inference tasks to answer key business questions
Evaluating the Causal Model's prediction performance
The dataset Consists of
2-yearar (2022, 2021) data
almost all hours service (only 4 hours—[1,3,4,5])
drivers who had gotten no job but were provided with an offer.
By the end and beginning of the day (around 23:00 and 02:00 hours), service duration shows a significant increase, which suggests that the traffic is crowded at that time. This is supported by the dataset distribution in the hour column.
The hour distribution shows a normal distribution, with maximum values at 14:00 hours, which suggests that service users are high at that time, and also normal traffic allows the service provider to deliver the requests.
Service Duration shows an increase in the holidays.
Distance and duration are not correlated — high duration doesn’t mean the delivery is a long-distance delivery.
The driver proximity has a direct correlation with driver action (to be accepted or to be rejected), so it suggests that close drivers are selected for service upon the location of the newly emerged request.
Most of the service is given in a short amount of time and little distance, so the scatter plot is concentrated for lower values.
The causal inference is performed using the official guideline by CausalNex.
Step 1: Preparing the Causal Graph: The CausalNex package allows you to manually map the causal relationships between different features. So the first step is creating an empty structure model and populating it, whether by manually adding edges or by allowing the model to learn from the data and map the causal relations.
The first step is to import the empty structure model as follows.
from causalnex.structure import StructureModel
sm = StructureModel()
Approach 1: As shown below, we can insert relations manually, but for large feature sizes, this step can be tedious. So approach 2 is preferred.
sm.add_edges_from([
('isWeekDay', 'speed'),
('driver_proximity', 'distance')
])
Approach 2: This approach uses a module from from_pandas, which is found in the causalnex library, to learn the structure model from the given data.
from causalnex.structure.notears import from_pandas
sm = from_pandas(df_causal_norm)
at first, this step will result all-to-all-relations as shown below
To solve this issue, we can use some thresholds and remove edges below that as follows.
sm.remove_edges_below_threshold(0.8)
,and we can see we have a more improved result. The graph shows some erroneous and logical connections. for example, the following causal relations are correct
distance -> duration
speed ->duration
driver_proximity -> isWeekDay
and the following are wrong causal relations
speed -> isWeekDay (speed doesn’t cause the day to be a weekday or weekend)
driver_proximity ->isWeekDay
So we can add useful edges and remove erroneous edges to have a good causal graph as shown below.
The causal model was implemented as a Bayesian network using the following steps:
Converting the dataset into discrete form using CausalNex's Discretizer
Creating meaningful categories for numeric features:
Distance (short, 1km, 5km, 10km, 20km, long)
Driver proximity (short, 1km, 5km, 10km, 20km, long)
Duration (1min, 1hr, 2hr, above-2hr)
Speed (slow, medium, fast)
Fitting the Bayesian Network's conditional probability distributions using the BayesianEstimator with K2 prior
Using the fitted model to predict the fulfillment status of requests
This approach enabled robust probabilistic modeling of the key factors influencing request fulfillment.
The causal model showed strong overall performance with an AUC of 0.988, indicating excellent discriminative ability. The classification report revealed:
High precision for identifying unfulfilled requests
Lower recall for the minority class (unfulfilled requests)
Strong overall performance for the majority class (fulfilled requests)
Key causal insights from the model:
Driver proximity significantly impacts request fulfillment
Time of day influences service completion rates
Weekday vs. weekend patterns affect fulfillment differently
The project presented several significant challenges:
Data imbalance: The dataset was heavily skewed with 96.3% fulfilled requests, requiring careful handling to avoid overfitting
Complex causal relationships: Distinguishing genuine causal links from correlations required iterative refinement
Discretization decisions: Finding appropriate thresholds for continuous variables that maintained information while enabling causal analysis
Future improvements could include collecting more data, particularly on unfulfilled requests, and adding more features that provide valuable information to the model.
This project showcased expertise in:
Python for data analysis and modeling
CausalNex for causal discovery and Bayesian Network modeling
NetworkX for graph manipulation and visualization
Structural Causal Models (SCMs) and Directed Acyclic Graphs (DAGs)
Bayesian probability theory and conditional probability distributions
Data preprocessing and feature engineering
Geographic data visualization using OSMNX
Model evaluation and interpretation