Dataset Overview:
This dataset simulates realistic financial transaction patterns and generated by using python code.
It was generated to mimic a wide range of transactional scenarios across multiple categories, including retail, grocery, dining, travel, and more, making it ideal for exploring patterns that distinguish legitimate transactions from fraudulent ones.
It is a publicly available dataset sourced from Kaggle, cleaned and formatted using Python and Pandas library.
The data has been uploaded to a Google Spreadsheet and attached below. [Further analysis of the dataset is in the section below]
Numerical Variables: amount, is_fraud, transaction_hour
Categorical Variables: customer_id, merchant_category, country, card_type.
Reference: Transactions - Ismat Samadov. Kaggle, 2024
Statistical Analysis of Dataset
The spreadsheet attached below shows the types of statistical analysis that can be done for each variable.
Analysis as per the activity requirements include:
Present possible Measures of central tendency and Measures of dispersion for each variable.
Visualize each variable in plots.
Association between any two categorical variables from the dataset & provide your insights regarding the observed relationship between the variables based on plotting a 100% stacked bar chart.
Association between any two numerical variables, say X1 and X2, from dataset and plot the scatter plot between them.
Provide insights regarding the observed relationship between the variables based on the scatter plot visualization.
Find the covariance and correlation between the above selected two variables and
Interpret the relationship between the variables based on the obtained values.
Graphs:
Measures of Central Tendancy Visualized:
transaction_hour
amount
is_fraud
Measures of Dispersion Visualized:
is_fraud
transaction_hour
amount
Numerical Association:
Covariance: 97398.7059
Shows a slight tendency for higher transaction amounts to occur at higher hours, but the relationship is very weak.
Correlation: 0.06220389575
The time of the transaction (hour of day) does not significantly influence the amount being transacted.
Chart Interpretation:
The chart shows that the highest transaction amounts occur at specific hours, notably around 4 AM, 13:00, and 14:00, with 4 AM being an unusual off-peak spike. Most other hours have moderate amounts (1–3M), with lower activity in the evening (18–21 hrs), suggesting both peak transaction windows and potential anomalies in early morning transactions.
Categorical Association
Table of categorical association between 2 variables
Chart between merchant_category & card_type
Chart Interpretation:
1. Premium Debit cards are popular across many categories
Especially in Education (31), Healthcare (30), Restaurant (31), Travel (33).Indicates customers prefer Premium Debit cards for diverse,high-value transactions.
2. Platinum Credit cards show strength in Grocery, Healthcare, and Retail
Highest count in Grocery (33) and strong presence in Healthcare (28) and Retail (28). Likely reflects mid- to high-tier customers who use it regularly for everyday and essential purchases.
3. Basic Debit cards are most frequently used in Restaurant and Retail.
Highest count in Restaurant (32) and Retail (29).
Most used in Gas (32) and Restaurant (29) categories. Suggests basic credit cards are popular for fuel and dining expenses.