Workshop on
AI for Financial Crime Fight
AI4FCF @ ICDM 2025
(IEEE International Conference on Data Mining 2025 - Washington DC, USA)
AI4FCF @ ICDM 2025
(IEEE International Conference on Data Mining 2025 - Washington DC, USA)
Open Datasets
A common barrier to entry to the AI-aided fight against financial crimes is the scarce availability of open datasets.
Banks and other financial institutions can leverage internal datasets, which cannot generally be released for obvious reasons.
To encourage the ICDM community to work on the topic, we have collected a curated list of open datasets, which can and have been used to conduct extensive analyses and build machine learning models.
🧼 IBM AMLSim
(Synthetic dataset)
This project aims to build a multi-agent simulator of anti-money laundering and share synthetically generated data so that researchers can design and implement their new algorithms over the unified data.
💳 Credit Card Fraud Detection
(Real, anonymized dataset)
The dataset contains transactions made by credit cards in September 2013 by European cardholders. The main features contained in the dataset have been "anonymized" through a PCA, so as to only provide a (compressed) projection of the original data.
The transactions in the dataset occurred in two days, with 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, as the positive class (frauds) accounts for 0.172% of all transactions.
The dataset is available on Kaggle.
🏦 BankSim payments simulator
(Synthetic dataset)
This dataset has been generated with the BankSim tool. It contains data covering approximately 6 months. The generator has been tuned to obtain a plausible data distribution. Anomalous behavior has been injected in a controlled way. The dataset contains 594,643 transactions (7,200 being fraudulent transactions).
The dataset is available on Kaggle.
🏝️ Paradise/Panama Papers
(Real dataset)
The Paradise Papers is a cache of some 13GB of data that contains 13.4 million confidential records of offshore investment by 120,000 people and companies in 19 tax jurisdictions (an awesome video to understand this), which was published by the International Consortium of Investigative Journalists (ICIJ) on November 5, 2017.
The Panama Papers is a cache of 38GB of data from the national corporate registry of the Bahamas. It contains the world’s top politicians and influential persons as heads and directors of offshore companies registered in the Bahamas.
The dataset is available on Kaggle.
🕵️ PaySim (Synthetic Financial Datasets for Fraud Detection)
(Synthetic dataset)
This dataset has been generated using the PaySim generator. This generator uses aggregated data from a private (real) dataset to generate a synthetic dataset that resembles the normal operation of transactions and injects malicious behaviour to later evaluate the performance of fraud detection methods.
The real data used to synthesize this dataset was extracted from one month of financial logs from a mobile money service implemented in an African country.
The dataset is available on Kaggle.
🕸️ Libra Bank transaction graph
(Real dataset)
This dataset has been made available by Libra Bank. It contains anonymized information about the interactions between customers, in an aggregated form (i.e., cumulative sum transferred, and number of transactions). It contains data collected over a period of 3 months.
Ground truth information is also available in the form of:
interactions that have raised an alert (these alerts can be internal, typically rule-based, and indicate that further attention should be given)
interactions that have raised a report (alerts that are further investigated and reported, e.g., to authorities) - although not specified, these may be comparable to a SAR (Suspicious Activity Report)
The dataset is available on this website.
🥃 Amaretto dataset (A Synthetic Capital Market Dataset)
(Synthetic dataset)
The dataset consists of 29,704,090 transactions executed by 400 end clients buying and selling specific securities in a specific market. Different anomalous transactions have been generated and injected, following 5 known patterns of fraudulent behaviors (more details in the repository).
The dataset is available on this repository.
🇨🇿 Czech financial dataset
(Real dataset)
The Czech financial dataset contains real anonymized transactions of a Czech bank from January 1993 to December 1998. It was released for the PKDD'99 Discovery Challenge. The dataset provides information on the bank's clients, accounts, loans, credit cards, and transactions. It contains data on 4500 accounts, 5369 clients, and roughly 1 million transactions.