Rare Event Clustering function in R: Overview

While Working on a project at the ministry of Finance in Belgium, we discussed how to be able to predict fraud on imported goods. The fraudulent cases come, luckily, only far and between,  only a few fraud cases appear in a long series of control. In terms of Fraud prediction, this was a problem for the team who struggled to implement classical predictive modeling techniques. Fraud detection usually deals with « rare » events phenomena, where we  only have a few cases of fraud (« rare » event) that can be used for the Analysis

We worked out an approach based on combined  matrix reduction techniques and clustering that aims at finding homogenous groups (clusters) of observations (cases) within a dataset that contain the highest proportion of  identified « rare » cases.  I refined and coded the approach in R (with a shiny application) in order to easily and quickly reproduce it on any kind of data.

Unlike predictive methods that purposely look for a relationship between the inputs and a chosen target to produce parameter estimates and model this relationship, the method proposed here is unsupervised and exposes the data to many clustering model trials, maximizing the chance that a cluster is found that regroups a significant number of these "rare" cases. We let the data  « speak » and « reveal » the structure of the relationships within the data that will form natural groups. We then check any potential interesting association by exposing the « rare » events  to the clustering model by cross-tabulating these rare events with the groups in the cluster model.