This project involves building a predictive model for an online book club subscription based on customer purchase history. The process includes descriptive statistics, regression modeling, customer profiling, and the development of a targeted marketing strategy. Each part builds toward understanding which customer attributes align with higher likelihoods of subscription.
SQL
The analysis begins with descriptive statistics to understand customer spending, purchase behavior, and recency. Summary measures and categorical breakdowns help clarify the structure of the dataset and set the foundation for deeper analysis. Group comparisons by gender and subscription status are also included.
CODE:
# Install libraries if not already installed
install.packages(c("tidyverse", "skimr", "ggplot2", "janitor", "broom", "dplyr", "ggcorrplot", "caret", "pROC", "ROCR"))
# Load libraries
library(tidyverse)
library(skimr)
library(ggplot2)
library(janitor)
library(broom)
library(ggcorrplot)
library(pROC)
library(ROCR)
# Read in the data
df <- read_csv("OnlineBookClub.csv")
# Clean column names
df <- clean_names(df)
# Preview data
skim(df)
RESULTS
To better understand the customer base and prepare the dataset for modeling, a full data scan was conducted using summary statistics from skim(). The dataset is clean, with no missing values across any variables. Numeric fields show a wide range of behaviors, including varied spending levels, purchase frequencies, and time since first or last purchase—suggesting strong potential for segmentation and predictive modeling. Categorical fields such as gender and subscription status are well-balanced and contain no unexpected or malformed values. Overall, the data is complete, well-structured, and ready for advanced analysis.
CODE:
# Basic stats
summary_stats <- df %>%
summarise(
avg_total_spent = mean(total, na.rm = TRUE),
sd_total_spent = sd(total, na.rm = TRUE),
avg_books_purchased = mean(purch, na.rm = TRUE),
sd_books_purchased = sd(purch, na.rm = TRUE),
avg_recency = mean(last, na.rm = TRUE),
sd_recency = sd(last, na.rm = TRUE)
)
print(summary_stats)
# Gender & subscription stats
gender_subs <- df %>%
group_by(gender, subscribe) %>%
summarise(count = n()) %>%
pivot_wider(names_from = subscribe, values_from = count, values_fill = 0) %>%
mutate(
total = yes + no,
sub_rate = yes / total
)
print(gender_subs)
RESULTS
Subscription Rate by Gender
The subscription rate among M/NB customers (12.77%) is nearly double that of female customers (7.17%). While there are more total female customers, they are proportionally less likely to subscribe to the online book club. This suggests gender-based differences in engagement that may require targeted marketing strategies or a review of how the offering is positioned to different segments.
Descriptive Statistics: Spend, Purchases, and Recency
On average, customers have spent $208 in offline stores, purchased about 4 books, and have not made a purchase in approximately 12 months. The high standard deviation for both spend and recency reflects a wide range of engagement levels, with some customers being highly active and others potentially lapsed. This variability reinforces the need for personalized outreach strategies based on recency and spending behavior.
This bar chart visually shows the proportion of customers who subscribed to the online book club, broken down by gender. While both groups have relatively low subscription rates, a higher percentage of M/NB (Male/Non-Binary) customers subscribed compared to female customers. The visual emphasizes a noticeable gap in engagement that may warrant further investigation into how messaging or product appeal differs across gender groups.
Regression Models
Linear and logistic regression models are used to examine relationships between customer characteristics and outcomes. The linear model focuses on total offline spending, while the logistic model estimates subscription likelihood. Both models incorporate behavioral and demographic predictors and are interpreted using standard output metrics.
CODE:
lm_model <- lm(total ~ is_female + first + child + youth + cook + do_it + refernce + art + geog, data = df)
# Tidy output
tidy(lm_model)
RESULTS
Linear Regression Output: Predicting Total Spend
This model estimates how customer characteristics influence total offline spend. Nearly every book category, child, youth, cook, do_it, reference, art, and geography, has a positive and statistically significant relationship with spending, with p-values near zero, confirming their strong predictive value. Each estimate reflects the dollar increase in total spend associated with an additional unit of that category (e.g., one more cookbook purchase adds about $15.66 to total spend on average).
Variables like gender (is_female) and tenure (first) are not statistically significant (p > 0.05), suggesting that demographics or how long someone has been a customer do not meaningfully predict spend once purchase behavior is accounted for.
CODE:
# Create binary variable for female
df <- df %>%
mutate(
is_female = ifelse(gender == "F", 1, 0),
subscribe_bin = ifelse(subscribe == "yes", 1, 0)
)
# Correlation among numeric variables
corr_data <- df %>%
select(is_female, first, last, total, purch, child, youth, cook, do_it, refernce, art, geog)
corr_matrix <- cor(corr_data)
ggcorrplot(corr_matrix, lab = TRUE, type = "lower", title = "Correlation Matrix")
RESULTS
The correlation matrix supports the regression results by showing moderate to strong relationships between total spend and specific book categories. For instance, total spend is moderately correlated with child books (0.52) and highly correlated with total books purchased (0.69). Book category purchases also tend to be correlated with one another, which reinforces the importance of tracking what customers are buying.
Gender has weak negative correlations with most numeric variables, indicating only a slight difference in behavior between male and female customers. These differences are not strong enough to be considered predictive of spend on their own.
Buying Prediction and Customer Profiling Overview
Predicted probabilities from the logistic model are used to rank customers by likelihood to subscribe. Customers are divided into deciles, and the top and bottom groups are profiled to identify key differences in behavior. Aggregated summaries provide insight into how characteristics vary across likelihood tiers.
CODE:
logit_model <- glm(subscribe_bin ~ last + total + is_female + child + youth + cook + do_it + refernce + art + geog,
data = df, family = binomial)
summary(logit_model)
# Tidy view
tidy(logit_model)
RESULTS
The regression model uses behavioral and demographic variables to predict the probability of subscription. Significant negative predictors include:
Recency (last): The longer it has been since a customer’s last purchase, the less likely they are to subscribe.
Is_female: Female customers are significantly less likely to subscribe, even after accounting for behavior.
Book categories like child, youth, cook, and do_it: These are all negatively associated with subscription, indicating these buyers may be more traditional, less inclined toward digital formats, or more value-oriented.
Positive predictors include:
Total spend: Slight but significant positive effect, showing that more engaged spenders are more likely to subscribe.
Reference and art books: Strongest positive relationships. Customers interested in these categories may value curated or intellectual reading experiences, aligning well with a subscription model.
CODE:
# Predict subscription probabilities
df$predicted_prob <- predict(logit_model, type = "response")
# Decile assignment
df <- df %>%
mutate(decile = ntile(predicted_prob, 10))
# Profile top and bottom decile
top_decile <- df %>% filter(decile == 10) %>% summarise_all(mean, na.rm = TRUE)
bottom_decile <- df %>% filter(decile == 1) %>% summarise_all(mean, na.rm = TRUE)
print(top_decile)
print(bottom_decile)
RESULTS
Predict Subscription Probabilities | Top Decile
Predict Subscription Probabilities | Bottom Decile
The predicted probabilities were divided into deciles to compare those most and least likely to subscribe.
Top Decile:
Average predicted probability of subscription is approximately 39 percent.
Customers are more recently active (recency 7 months), have higher spending totals ($257), and purchase more books on average (6.5).
Higher engagement in niche categories like art, reference, and geography.
Lower proportion of females compared to the full sample, aligning with regression findings.
Bottom Decile:
Average predicted probability is extremely low, around 0.7 percent.
Recency is much higher (26 months), total spend is lower ($204), and customers have fewer overall purchases (~4.2).
Behavior is concentrated in more traditional or low-engagement categories like cookbooks and non-book items.
Much higher proportion of female customers, reinforcing the gender disparity in likelihood to subscribe.
The combined results from the logistic regression model and decile profiling offer a clear picture of which customers are most likely to subscribe to the online book club. Customers in the top decile are active, high-spending, and engaged with categories like art and reference, which align with the value proposition of a curated digital subscription. On the other hand, customers in the bottom decile tend to be inactive, lower spenders, and more likely to purchase general or functional book types such as cookbooks or DIY.
These patterns confirm that recent behavioral engagement and content preference are far more predictive of subscription likelihood than demographic attributes alone. The findings support a marketing approach that focuses on customer behavior rather than broad segments like gender or tenure.
Marketing Actions
A marketing strategy is developed based on patterns identified in the analysis. Recommendations include who to target, how to craft messaging, and when to reach out. An A/B test is proposed to measure the impact of different subject lines and refine the email campaign approach.
To drive subscriptions to the online book club, an email campaign should target customers who recently made a purchase and have shown interest in art, reference, or geography books. These individuals ranked highest in predicted subscription likelihood based on behavioral data. The campaign should prioritize customers in the top subscription probability decile and exclude those who are inactive or heavily oriented toward categories like cookbooks and DIY, which were negatively associated with subscription.
Subject line:
“Your Next Favorite Read Is Waiting – Let Us Pick It For You”
Email content:
The body of the email should highlight the convenience and curation of the online book club, with emphasis on personalized selections, early access to digital content, and an exclusive feel. Messaging should appeal to readers who enjoy discovery, intellectual depth, or seasonal themes. Include a strong call to action with a simple “Join Now” button, and possibly a limited-time incentive like a bonus e-book.
Timing:
Since calendar-based timing is not available, the message should be triggered based on relationship recency. Emails should be sent within 30 to 60 days after a customer’s last offline purchase. This timing ensures the brand is still top of mind and leverages recent engagement to prompt an online conversion.
Metrics to track:
Open rate
Click-through rate (CTR)
Conversion rate (email-to-subscription)
Unsubscribe rate
Bounce rate
A/B test design:
To test effectiveness, run an A/B test on the subject line:
Version A: “Your Next Favorite Read Is Waiting – Let Us Pick It For You”
Version B: “Two Free E-Books. Every Month. Yours to Explore.”
The test will help determine whether emotional curiosity or clear value-based messaging drives more engagement. Performance will be measured by open rate and subsequent click-through rate to refine future campaign strategies.