I have worked on Data Mining projects (clustering for TELCO & Financial services, predictive modelling for X-Sell, churn and cluster scoring) as well as on DM Campaign Management projects (creating tools to help Campaign analyst to screen the quality of their selections & post campaign analysis of overperforming sub-segments).
I have worked mainly with SAS (BASE, STAT, MACRO, Eminer, Eguide). But I have also developed a sound knowledge of R (open source package) in order to offer to my customers an analytical service for which they don't have to buy a new piece of software.
What I am aiming at is to create a typology (segmentation) that tells a business story. Segment description has got to be intuitive to the marketing managers so that they can be enthused and put the segments to good use in concrete marketing actions.
The first and most important challenge is to clearly identify the business objective justifying the clustering exercise. Is it a first analysis to have an eagle eye view of the customer base's most important behavioural patterns ? Is it meant to be used a KPI of marketing initiatives effects over time? Is it meant to be used for targeting groups of customers for specific DM campaigns ? this is going to drive all the subsequent aspects of the analysis
A clear understanding of the business context and objectives allows then to specify a set of measures on customers that will best represent the behavioural aspects that need to be analyzed. The discussion on measures to be used (availability, quality audit) has to include Marketeers.
In most Clustering exercises the population to analyze has to be set, certain types of customers known "a priori" could be excluded as their behaviour or status (new customers, inactives, outliers) may interfere with the other hidden patterns we want to discover.
Appropriate Rescaling of the measures used has to be considered when we compare different populations using different products or subject to different commercial initiatives. We indeed have to compare what is comparable. All customers have to be comparable to one another regarding the measures used. According to the business needs, a categorization of the continuous measures could be requested, especially if the business requires a "clean-cut" segmentation with strict boundaries between the groups in terms of measures. This categorization has an influence on the clustering algorithm per se. The Distance matrix will have to be adpated for categorical inputs or a specific matrix reduction technique will have to be applied.
The clustering process is a multi-stage approach combining Matrix reduction techniques with "K-means" and "Ward hierarchical" techniques. Depending on the Data and objectives of the clustering (high level view or more refined clustering with a DM campaign support objective) a 2 or a 3 stage approach will be taken. The combination of these techniques is designed to build a clustering model that relies on the robust structure of the relationships within the measures taken altogether. Of course, as often mentioned in textbooks, clustering is about "letting the data speak", and if the data dont have much to say, you are left with poor conclusions to draw. The aim of this methodology is to maximize the exposure of potential robust patterns within the data to the algorithm and so bring up natural distinct patterns to the surface. In some cases, the aim is to focus on minor (infrequent) behavioural patterns ... with large negative consequences: fraud. In this case the methodology wil be adapted to expose distinct but minor patterns.
I have developed algorithms and methods in SAS & R that cover the different methodological aspects mentioned above.
"your model is only as good as the predictive variables you use", Acquiring the most appropriate data to predict the event or phenomenon at stake, defining carefully the population to analyze, auditing & manipulating these data is just as important (if not more) than the modelling technique itself. I would rather spend money on developing new data sources to be integrated into a datawarehouse than on a new software package offering the latest modelling techniques.
Appropriate sampling is also crucial, the modelling data have got to represent the population on which the model will be rolled out.
The modelling technique remains of course crucial and should be carefully chosen in function of different factors and mainly how the scoring is taking place ? There is no simple answer as to which is the most appropriate modelling technique. Several criteria have to be taken into account:
Whenever possible I tend to use (logistic) regression as a modelling technique as the inner-workings of the modelling process are interpretable and by doing so the model can tell a business story that can be explained to the businnes and gives you reassurance that your model makes sense (another hint that it should perfom well when rolled-out).
A few years ago I realized comparative analysis of the predictive perfomance of various modelling techniques Versus Logistic regression using real customer data (financial services) : "A comparison of the predictive performance of non-parametric discriminant functions against logistic regression & a comparison of the predictive performance of logistic regression using raw categorical inputs and logistic regression using multiple correspondence analysis coordinates as inputs".
DTree are also relatively easy to interpret (if not too bushy) and they deal well with missing values and non-linearity issues. Moreover they are easy to implement on a large scoring population using simple SQL scripts.
Other advanced methods like SVM or neural networks should be tested especially when Logistic regression or DT do not do such a good predictive job.
I have developed predictive models both in SAS EMiner and SAS BASE /STAT, but I also developed a modelling process in SAS Eguide that takes the best of both worlds by combining the flexibility of MACRO coding and the power of procedures built in SAS EMINER. SAS Eminer procedures like DTree are used in a SAS Eguide Macro code node in a loop over several populations. This combination allows for ease of use for the analyst, more automation of the process to run on different populations (countries for example) & deliver results in the appropriate format and push them automatically into the DWH.
In order to ensure continuous learning and improvement/maintenance of models, a proper implementation of Control groups is essential. It can be a hard sell to managers as you are sending DM to a random selection of customers (that goes against your argument to use a model in tbe first place), but this represents a good way to really assess the perfomance of your model and to provide a new database for refining/revamping your model.
This POST-campaign analysis is paramount to really assess the performance of your DM campaign. It allows to draw key learnings from your campaign. Which creative/offer perfomed better ? Did the model really outperform a naive selection in terms of sales rate, and by how much ? All these learnings allow to close the loop of the modelling process and gets you ready for the next iteration.
I have developed a "Micro-segmentation" tool that will analyze a posteriori the response of customers a few months after a specific DM campaign. The aim is to find and describe "MICRO-SEGMENTS" (groups of customers) of customers that will outperfom in terms of their response to the DM campaign from both a statistical (significance level) and business viewpoint (analyst interpretation). Decision trees provide the statistical background for the analyst to interpret those groups in a business perspective. The Micro-segments can be used at a tactical level to improve campaign selections (and help further increase ROI) and also at a more strategic level by increasing knowledge about our customers and get new insights on customer beahviour (cross-analyzing the micro-segments with existing basic segmentations).
Some consistent and ever-reoccuring Micro-segments could be becoming part of a basic segmentations per se if they are validated by the business and statistically proven.
A presentation describing the approach is available here.
The all analysis needed to be automated as different specific populations needed to be analyzed on a weekly basis. So a SAS macro was created in SAS Eguide that would extract the data, create the rules (PROC ASSOC from Eminer is used in SAS macro code in conjunction with other Eminer needed PROCs) and score the customers to asssign them their appropriate GAP offers following a set of User defined parameters indicated in an excell parameter sheet which is read by the SAS macro at the start of each call of the SAS macro.
Latest developments of the AA tool allow for a selection of pre-defined groups of transactions (products) to be analyzed separately. This allows for more flexibility and an appropriate separation of different category of GAP offers to be made in Marketing Campaigns. This flexibility helps us control more accurately the commercial "relevance" of the Gap offers automately produced by the AA tool.
I find it interesting to share with you a very old personal project that I developed during my very rewarding experience in the UK. I have always been rather fascinated by resampling methods like Bootstrap & Jacknife. Now My interest has shifted towards the developments of more pragmatic approaches, but when I was much younger than today, during my life in the UK, I focused my attention at the week-end (yes I was sometimes a geek), to the development of a SAS macro that would give me gains charts illustrating the BCa Confidence interval around my estimates at different depth of the file. For those interested in this approach here is a copy of the document.
An idea for mapping stocks on the basis of a selection of meaningful financial (or stock analysts sentiments) indicators.
Being a (very average) stock trader myself, I have come up with this idea of comparing stocks from specific sectors using multivariate matrix reduction techniques, I downloaded some data for stocks in Financial services from my online trading account and created a intuitive map like i I would like to see it. I would be very happy to extend the experience by implementing this very experimental idea for real in a project for an online trading broker. A copy of this new initiative is available here (excuse my french ! :-) but this presentation is only available in french).
I have set-up a SAS course for Data Analysts willing to quickly access the power of the SAS language for Analyzing Data. This covers all the important aspects of the SAS language for handling data & creating descriptive STATS. Modelling is not included. A section of the course is devoted to MACRO language. You can have a look at the Table of Contents here.