Market Basket Analysis

Market Basket Analysis: Exploiting Product Relationships to Optimize Sales Strategies

The objective of the project is to conduct Market Basket Analysis (MBA) - a popular analytical technique in the retail industry - to:

Understand which products are often purchased together

Develop product association rules
Propose appropriate sales strategies, product recommendations and promotions

You can find the full code on GitHub here.

Theoretical Overview

1. What is Market Basket Analysis?

Market Basket Analysis is a method used to discover product combination patterns in customers' shopping behavior. For example, if many people who buy bread also buy peanut butter, we can recommend these two products together to increase revenue.

2. Important Indicators:

To quantify product relationships, MBA uses the concepts of support, confidence, lift.

Support: Frequency of the item combinations appearing in all transactions. Helps eliminate rare rules that are not common enough to have business value.
Confidence: The likelihood that a customer will buy product B if they have bought product A
Lift: How often two products are purchased together, compared to random purchase. Helps eliminate high confidence but meaningless rules, since B is already popular.

For example, if the Lift of [Coffee ⇒ Milk] is 3.5, it means that customers are 3.5 times more likely to buy Milk after buying Coffee than random purchase.

3. Data requirements for MBA:

To perform MBA, we'll need transactional data, which must include:

Transaction ID
Products purchased in each transaction

Approach

1. Data preparation:

Since I want to explore customer purchase patterns in the online fashion retail domain, inspired by platforms like SHEIN, I will be using a synthetic dataset modeled on Vietnamese weather conditions (hot, humid and rainy).

Since actual SHEIN data isn’t publicly available, I generated a realistic dataset with:

2,000 transactions representing customer shopping baskets
25 products, chosen to match Vietnam’s weather and current fashion trends

Each basket contains 1 to 6 randomly selected items, simulating typical SHEIN carts.

Examples of items:

Summerwear: Crop Top, Maxi Skirt, Linen Pants
Rain-ready: Light Rain Jacket, Rain Boots, Umbrella
Accessories: Sunglasses, Tote Bag, Bucket Hat

2. Data normalization:

Transforming the data into one-hot encoding for Apriori algorithm

3. Applying Apriori algorithm to find association rules

Using mlxtend in Python to generate association rules

Running Apriori & Extracting Rules

The first step in order to create a set of association rules is to determine the optimal thresholds for support and lift/confidence. I tried different values of support and lift and see graphically how many rules are generated for each combination.

Support level of 3%: The number of rules is zero across all lift thresholds. This indicates that very few item combinations occur frequently enough (≥3%) to meet the minimum support.

=> This threshold is too strict for this dataset and results in no actionable rules.

Support level of 2.5%: At a low lift threshold of 1.10, around >30 rules are generated. As the lift increases (to 1.20, 1.30...), the number of rules drops quickly. At a lift of 1.40 or above, no rules are found.

=> This support level allows a moderate number of rules to be discovered, but those with strong correlations (high lift) are limited.

Support level of 2%: The highest number of rules appears at lift = 1.10, with over 115 rules. The rule count sharply declines as lift increases: About 32 rules at lift = 1.20, 6 rules at lift = 1.30, Almost zero by lift = 1.40 or higher.

=> This threshold captures the broadest range of associations, including weaker and stronger ones. It’s ideal for exploration, especially in highly varied datasets like fashion.

To balance rule quantity and business relevance, I will use: Support = 2.5% and Lift = 1.20. This combination ensures:

Rules are generated from items that appear frequently enough to matter.
The number of rules is manageable for analysis and action.
Lift > 1.2 filters for stronger associations, ideal for: Curating product bundles, Designing cross-sell campaigns, Identifying co-purchased seasonal trends

"Co-ord Set" is the most frequently purchased item as it was purchased in 15.8% of all the transactions.

Item pairs appear in both directions, meaning the presence of either item increases the likelihood of the other being purchased:

1. Maxi Skirt → Denim Jacket

Support: 2.6% of all transactions
Confidence: ~19%
Lift: 1.40

→ Shoppers who buy a Maxi Skirt are 1.4 times more likely to also buy a Denim Jacket, compared to random chance.

2. Denim Jacket → Maxi Skirt

Similar support/confidence, indicating bidirectional affinity between these two items.

3. Maxi Skirt → Bermuda Shorts

Support: 2.7%
Confidence: 19.6%
Lift: 1.37

→ Although these two items serve different fashion purposes, the association suggests that some customers may be buying diverse bottoms in one purchase (for different occasions or climates).

4. Printed Blouse → Sleeveless Dress

Support: 2.5%
Confidence: 17.7%
Lift: 1.31

→ Customers purchasing Printed Blouses often complement their look with Sleeveless Dresses - possibly used as layering or style coordination.

Business Takeaways

Cross-sell Opportunities: Promote bundled recommendations such as “Pair your Maxi Skirt with a Denim Jacket” to boost AOV.

Seasonal Promotions: Use these item pairs to create seasonal style guides.

Appendix: Detailed Interpretation by Scenarios

High Support & High Confidence:

Many transactions include this item combination, and when the first item is bought, the second is very likely to be bought too. This is a strong, reliable rule. These combinations are ideal candidates for:

Product bundling
Cross-promotion
Store layout optimization

High Support & Low Confidence:

The item combination appears frequently, but the conditional probability is weak - buying A does not strongly indicate buying B. May not be effective for personalized recommendations, but still useful for:

Co-location of products
Promotions targeting large audiences

Low Support & High Confidence:

The rule applies to a niche group, but within that group, the behavior is consistent. Great for targeted marketing to specific segments, such as:

Loyalty programs
Personalized email offers

Low Support & Low Confidence:

Rarely occurs and isn’t consistent when it does. Likely noise, should generally be ignored or filtered out using thresholds. May still be worth investigating under seasonal conditions, but not for general-purpose strategies.

Page updated

Google Sites

Report abuse