Fraudulent activities in financial transactions pose significant challenges to businesses, financial institutions, and customers. Detecting fraudulent credit card transactions efficiently is critical for minimizing financial losses and maintaining trust among users. This project aims to develop a predictive model using Random Forest to detect fraudulent credit card transactions with high precision and recall.
Through advanced machine learning techniques and robust feature engineering, this project explores transaction patterns, identifies key indicators of fraud, and implements a tuned thresholding strategy to balance the trade-off between precision and recall. Additionally, geospatial data is leveraged to analyze and visualize fraud distribution across states, adding depth to the understanding of the problem.
By achieving a balanced model performance, this project demonstrates how machine learning can support businesses in mitigating the risks of fraud while reducing customer inconvenience caused by false positives.
Dataset Source:
The dataset used in this project is sourced from Kaggle: Credit Card Transactions Fraud Detection Dataset. This dataset, simulated using the Sparkov tool by Brandon Harris and published by Kartik Shenoy, represents credit card transactions from January 1, 2019, to December 31, 2020.
The dataset contains transactions from 1,000 customers interacting with 800 merchants, offering a realistic and diverse representation of legitimate and fraudulent activities.
About the Simulation Process:
The dataset was generated using the Sparkov Data Generator.
Merchant, customer, and transaction categories were pre-defined.
Profiles such as age, location, and transaction patterns were used to simulate realistic behavior.
Faker, a Python library, was used to simulate customer behavior based on distributions such as transaction frequency, amounts, and categories.
This detailed simulation ensures the dataset mirrors real-world scenarios, making it an ideal choice for fraud detection analysis.
Dataset Overview:
Total Records: 594,643
Key Columns:
cc_num (Credit Card Number)
trans_date_trans_time (Transaction Date and Time)
merchant (Merchant Name)
category (Transaction Category)
amt (Transaction Amount)
gender (Cardholder’s Gender)
city, state (Location Details)
is_fraud (Fraud Label: 1 = Fraud, 0 = Non-Fraud)
The dataset combines demographic, transactional, and geographic information, making it ideal for fraud detection analysis.
Project Goals:
Analyze fraud trends and patterns across demographics, time, and geography.
Build an interactive Power BI dashboard to visualize insights.
Develop a machine learning model (Random Forest) to predict fraudulent transactions with high accuracy and recall.
This project showcases the seamless integration of SQL-based data processing, business intelligence tools, and machine learning techniques, offering a comprehensive solution to detect and analyze fraud effectively.
Fraud Counts Across Age Groups: Identifying Vulnerable Age Brackets
This query calculates the age of individuals at the time of fraud using their dob (date of birth) and groups the fraudulent transactions (is_fraud = 1) by age. The result is sorted to display the age groups with the highest number of fraudulent activities.
Key Insights from the Chart (Based on the SQL):
Peak Fraud Activity in Middle Age (30–60):
The data reveals that individuals in the age range of 30 to 60 are involved in the highest number of fraudulent transactions.
This group likely represents individuals with higher purchasing power and frequent transactions, making them more susceptible to fraud.
Lower Fraud Rates Among Younger and Older Groups:
Individuals under 30 years and above 60 years show a significantly lower number of fraud incidents.
Possible reasons include fewer financial transactions (e.g., limited access to credit cards among younger people and conservative spending among older people) or reduced targeting by fraudsters.
Sharp Increase Followed by Gradual Decline:
The chart suggests a steep rise in fraudulent activity as individuals transition from their 20s to their 30s, likely correlating with increased financial activity.
The decline in fraud from middle age to senior years appears more gradual, indicating behavioral or exposure-related changes that reduce fraud susceptibility.
Behavioral and Financial Patterns:
This pattern highlights the interplay between age, spending habits, and fraud targeting strategies. Younger individuals might not be as active in financial systems, while older adults may employ more cautious spending behaviors.
Monthly Fraud Trends: Seasonal Peaks and Annual Comparisons
This query examines the monthly fraud trends by extracting the year and month from trans_date_trans_time and counting the total transactions (total_transactions) and fraudulent transactions (fraud_cnt). The results are ordered by the highest number of fraud cases for actionable insights.
Seasonality of Fraud:
The chart reflects seasonal patterns in fraudulent activity, with distinct peaks and troughs observed throughout the years.
Fraud seems to spike during specific months, potentially coinciding with high transaction volumes like sales events, tax seasons, or holidays.
Peak Fraud in May 2020:
May 2020 stands out as the month with the highest fraudulent activity, indicating an unusual surge in fraud cases.
This could be attributed to external events, such as shifts in consumer behavior during the COVID-19 pandemic, where online transactions increased, creating opportunities for fraudsters.
Late 2020 Surge in Fraud:
The chart highlights a potential upward trend in fraud during November and December 2020, aligning with the holiday shopping season when transaction volumes typically increase.
Fraudsters might exploit this period when consumers and merchants are less vigilant due to the rush of the season.
Comparison Between 2019 and 2020:
While 2019 exhibits some peaks, the overall fraud count is noticeably higher in 2020.
This increase could be linked to global trends like a shift towards online shopping during the pandemic, leading to more opportunities for fraud.
Interestingly, the peaks in 2019 are less pronounced, indicating more consistent fraud rates throughout the year compared to 2020's sharp fluctuations.
Fraud Activity by Time: Late-Night Peaks and Weekday Patterns
This query dissects fraudulent transactions by day of the week (week_day) and hour of the day (hour_of_day). Fraud counts (fraud_cnt) and total transactions (total_transactions) are grouped and ordered to reveal high-risk periods for fraudulent activity.
Significant Peak in Late Evening/Night:
Fraudulent transactions spike dramatically between 9 PM and 11 PM.
This timeframe represents the highest fraud activity, likely due to reduced vigilance by account holders during late hours.
Low Fraud Activity During Daytime:
Fraud rates are significantly lower from 6 AM to 6 PM, indicating that fraudsters prefer off-peak hours when transactions are less likely to be monitored.
Weekday Variations:
Fraud levels are consistently higher on Friday and Saturday nights, potentially correlating with increased spending during weekends and reduced consumer vigilance.
This trend underscores how behavioral patterns (e.g., weekend leisure activities) may be exploited by fraudsters.
Reduced Monitoring at Night:
People are less likely to actively monitor their accounts or approve transactions late at night, providing a window of opportunity for fraudsters.
Automated Attacks:
Activities like card testing or credential stuffing may be conducted by bots during late hours to exploit lower detection rates.
Time Zone Differences:
Transactions processed from global merchants or fraudsters operating in different time zones could contribute to spikes in specific hours.
Fraud Hotspots: City and State Trends by Gender
The SQL queries analyze fraudulent transactions by state and city, grouped by gender. Fraud counts are ranked in descending order to identify the most affected areas and potential gender disparities.
1. Fraud Patterns by State and Gender:
Uneven Distribution Across States:
Fraudulent activity is significantly higher in states like New York (NY), Texas (TX), California (CA), and Florida (FL).
These states typically have higher population densities and active economies, increasing transaction volumes and fraud opportunities.
Fraud counts are notably lower in less populated or rural states, likely due to fewer transactions overall.
Gender Disparity (Potentially):
In some states, male-associated fraud counts slightly exceed female-associated counts, suggesting a potential gender-based pattern.
However, in other states, the fraud counts are more evenly distributed, or females may even slightly dominate. This variability indicates no universal trend across all states.
2. Fraud Breakdown by City and Gender:
Concentration in Major Cities:
Fraudulent transactions are disproportionately high in major metropolitan areas such as Houston, Dallas, New York City, and Los Angeles.
Cities with larger populations and economic hubs naturally process more transactions, making them attractive targets for fraudsters.
Granular Gender Disparity:
At the city level, certain areas exhibit a clearer gender imbalance in fraud counts, with some cities showing a more pronounced male or female predominance in fraudulent activity.
However, this disparity remains inconsistent and could vary due to local demographics or spending patterns.
Population Density and Economic Activity:
High fraud rates in populous states and cities are likely due to increased transaction volumes and diverse economic activities.
Gender-Based Behavioral Patterns:
Spending patterns, transaction habits, or targeted campaigns may contribute to the observed gender disparities in certain locations.
Fraudulent Schemes in Urban Areas:
Urban centers are more susceptible to sophisticated fraud schemes due to the density of digital and in-person transactions.
Fraud Detection Insights: Merchant and Category Analysis
This analysis highlights fraud patterns based on merchant and category statistics, summarizing fraud count, total transactions, and fraud rates across various merchants and categories. Below are key insights grouped by ranges of fraud count.
Merchants with the highest fraud counts in this range include grocery and shopping categories, with fraud rates ranging from 1.58% to 2.57%. Key findings:
Rau and Sons (grocery_pos): 49 frauds out of 2,490 transactions (1.97%).
Kozey-Boehm (shopping_net): 48 frauds out of 1,866 transactions (2.57%).
Cormier LLC (shopping_net): 45 frauds out of 1,959 transactions (2.30%).
Merchants within this range display slightly lower fraud counts but still significant fraud rates, highlighting specific vulnerabilities in certain categories:
Schumm PLC (shopping_net): 30 frauds out of 1,906 transactions (1.57%).
Rutherford-Mertz (grocery_pos): 29 frauds out of 2,489 transactions (1.17%).
Hackett-Lueilwitz (grocery_pos): 28 frauds out of 2,568 transactions (1.09%).
This group represents a wide variety of merchants with lower fraud counts, typically below 1% fraud rates. Key examples:
Brekke and Sons (gas_transport): 1 fraud out of 2,653 transactions (0.38%).
Nicolas, Hills and McGlynn (entertainment): 1 fraud out of 1,984 transactions (0.50%).
Bins, Balistreri and Beatty (shopping_pos): 1 fraud out of 2,315 transactions (0.39%).
Grocery and Shopping Categories: Frequently associated with higher fraud rates.
Miscellaneous Categories: Lower transaction volumes and fraud counts, but occasional spikes in fraud rates.
Gas Transport: Lowest fraud rates across all categories, suggesting stronger transactional security.
**For a detailed view, you can explore the Power BI Dashboard here. These will offer data visualization to support the analysis.**
Before applying the Random Forest model, I analyzed fraud patterns based on the distance between users and merchants. Distance serves as a critical feature, as unusual patterns in proximity can be indicative of fraudulent activities.
Key Findings:
Distance Distribution: Fraud occurrences were analyzed across distances ranging from 0.12 km to 150.67 km, with an average of ~76 km.
Categories: Distances were divided into bins (0-20, 20-50, 50-90, 90-120, 120+) for detailed analysis.
Fraud Insights:
The highest fraud counts (~248k) occurred in the 50-90 km range.
The 90-120 km and 20-50 km categories also showed significant fraud activity.
The 0-20 km and 120+ km ranges had relatively lower fraud counts.
Visualization:
The bar chart highlights that mid-range distances (50-90 km) had the highest fraud frequency. This finding underscores the importance of distance as a feature in the fraud detection model, helping to identify potential anomalies efficiently.
Random Forest is an ensemble learning method primarily used for classification and regression tasks. It builds multiple decision trees during training and combines their outputs for a more accurate and robust prediction. Key features include:
Robustness to Overfitting: By averaging multiple trees, Random Forest reduces the risk of overfitting.
Feature Importance: It provides insights into which features are most influential in predictions.
Handling of Imbalanced Data: It works well with imbalanced datasets by using techniques like class weighting or oversampling.
In this project, Random Forest is applied to detect fraudulent transactions by analyzing patterns and relationships in the dataset. The model predicts the likelihood of a transaction being fraudulent based on features such as merchant name, category, and transaction amounts.
Steps:
Preprocessing: Clean and encode categorical data to prepare for model training.
Model Training: Train the Random Forest model on historical transaction data labeled as fraudulent or non-fraudulent.
Evaluation: Use metrics like precision, recall, and F1-score to assess the model's performance.
Prediction: Apply the model to predict fraud in unseen data, prioritizing high-risk transactions.
The provided code handles the Fraud Detection Project and involves the following key steps:
Data Loading and Cleanup
Read Data: The dataset is loaded from Google Drive using pandas.
Column Renaming: Removes unnecessary columns (like Unnamed: 0) and standardizes column names by stripping whitespace.
Handle Missing and Duplicate Values: Drops rows with missing values (dropna) and checks for duplicates.
Descriptive Statistics
Summary statistics of the amt column provide insight into transaction amounts, indicating potential outliers or skewness.
Feature Engineering
Datetime Parsing: Converts trans_date_trans_time and dob to datetime format for feature extraction.
Extracts age, transaction month, hour, and day of the week to capture temporal patterns.
Geographic Distance Calculation: Computes the distance between user and merchant coordinates using the geodesic function.
This can help detect anomalies (e.g., transactions far from a user's location).
The preprocessed data is now ready for training, and model evaluation. Using the Random Forest Classifier, we can classify transactions as fraudulent or non-fraudulent based on these engineered features.
After preparing the data, I implemented a Random Forest model to detect fraudulent transactions. Random Forest, known for its robustness and ability to handle imbalanced datasets, was an ideal choice for this classification task. Here's a breakdown of the process:
1. Feature Selection and Target Variable
Features (X): All relevant numerical and categorical features, excluding transaction-specific identifiers, geographical details, and other non-predictive columns, were selected.
Target (y): The is_fraud column, which indicates whether a transaction is fraudulent, was designated as the target variable.
2. One-Hot Encoding
Nominal features like category and gender were encoded using One-Hot Encoding to ensure they were compatible with the model.
3. Train-Test Split
The dataset was split into 70% training and 30% testing to train and evaluate the model effectively, ensuring the results are unbiased.
4. Model Training
A Random Forest Classifier with default parameters and a fixed random seed was trained on the data.
The model achieved a perfect accuracy of 1.00 on the test set, a sign that further tuning and threshold adjustment were necessary
5. Threshold Adjustment
Since accuracy alone can be misleading in imbalanced datasets, I evaluated precision, recall, and F1-score across different probability thresholds.
The results showed how adjusting the threshold impacted the balance between correctly identifying fraud and avoiding false positives.
Threshold: 0.1 | Precision: 0.62 | Recall: 0.89 | F1-Score: 0.73
Threshold: 0.2 | Precision: 0.83 | Recall: 0.82 | F1-Score: 0.82
Threshold: 0.3 | Precision: 0.92 | Recall: 0.77 | F1-Score: 0.84
Threshold: 0.4 | Precision: 0.94 | Recall: 0.71 | F1-Score: 0.81
Threshold: 0.5 | Precision: 0.97 | Recall: 0.66 | F1-Score: 0.78
Threshold: 0.6 | Precision: 0.98 | Recall: 0.59 | F1-Score: 0.74
Threshold: 0.7 | Precision: 0.99 | Recall: 0.52 | F1-Score: 0.68
Threshold: 0.8 | Precision: 0.99 | Recall: 0.42 | F1-Score: 0.59
6. Final Threshold Selection
A balanced threshold of 0.25 was chosen, offering a good trade-off between precision and recall.
7. Prediction Analysis
Applied the selected threshold to generate predictions:
False Negatives (Fraud missed): 129 cases.
False Positives (Non-fraud identified as fraud): 65 cases.
This threshold adjustment highlights the flexibility and interpretability of the Random Forest model in optimizing fraud detection performance.
Confusion Matrix: [[166011 65]
[129 511]]
True Negatives: 166,011 (Non-fraud correctly classified as non-fraud)
False Positives: 65 (Non-fraud misclassified as fraud)
False Negatives: 129 (Fraud misclassified as non-fraud)
True Positives: 511 (Fraud correctly classified as fraud)
Key Metrics:
Accuracy: 1.00 (100%)
Precision: 0.89 (89%) — Among predicted frauds, 89% were actual frauds.
Recall: 0.80 (80%) — The model captured 80% of the actual fraud cases.
F1-Score: 0.84 — Balancing precision and recall.
ROC-AUC Score:
ROC-AUC: 0.98 — The model demonstrates an excellent ability to separate fraud from non-fraud cases.
Feature Importances: The chart demonstrates the relative importance of features in the model:
amt (Transaction amount): Most critical feature for fraud detection.
trans_hour: Time of transaction is a significant predictor.
age and distances (between cardholder and merchant): Highly impactful.
Lesser but still relevant features include category_grocery_pos, zip, and city_pop.
The model performs well in detecting fraud but misses some true fraud cases (moderate false negatives). Adjusting the threshold or fine-tuning the model can help further balance recall and precision.
Transaction-related features such as amount, time, and location are crucial predictors, highlighting the patterns of fraudulent behavior.
This project effectively showcased how machine learning aids in fraud detection. Key insights from EDA and visualization revealed that higher transaction amounts, late-night transactions, and specific merchant categories had a higher likelihood of fraud. Geographic proximity of transactions to merchant locations also played a crucial role.
Using the Random Forest model, we achieved a high AUC score of 0.98, identifying transaction amount and time as the most influential factors. SQL and Power BI integration enabled seamless querying and insightful dashboard creation, uncovering fraud patterns and empowering data-driven decision-making.