CRISP-DM

Cross-Industry Standard Process for Data Mining

I have found this methodology to very beneficial as a framework, but also in explaining the process to clients, who are unfamiliar with data mining and predictive analytics. It also helps explain why data mining does not always work well within the confines of the traditional waterfall project methodology and how no phase can be considered completed until all phases have been completed.

In my work, I tend to work in a manner close to the agile method, where I complete short cycles that include each of the phases and produce a working solution or model. I will then go back and complete another iteration where I incorporate additional data, business knowledge, or whatever I believe the next step to an improved model might be. This ensures that I always have a working solution - even if that solution can be improved upon and stays true to the iterative nature of the CRISP-DM.

The Cross-Industry Standard Process for Data Mining (or CRISP-DM) was conceived by a consortium that included DaimlerChrysler, SPSS (formerly ISL) and NCR with funding from the European Commission. The process was intended to be industry-, tool-, and application-neutral and was developed with input from a wide range of practitioners, vendors, and management consultants through the CRISP-DM Special Interest Group.

The current revision of the methodology is CRISP-DM 1.0, which was released in August 2000. There has been some talk about updating the methodology, but no action has been taken as of yet and it seems unlikely that this will ever happen. IBM Corp. released a methodology in 2015 that expands and extends on CRISP-DM under the name ASUM-DM, which is short for Analytics Solutions Unified Method for Data Mining/Predictive Analytics.

The new methodology fills out some of the gaps in the process that are not covered by CRISP-DM - especially around project management and deployment - and it updates the language to better match the current terminology and technology. The term data mining now seems a little dated and have largely been replaced by either predictive analytics or data science, but I have mostly kept the original terminology intact in this description.

This page contains a brief description of the different phases of a data mining project as defined by CRISP-DM and the full user's guide can be found attached at the bottom of the page.

The life cycle of a data mining project consists of six phases. The sequence of the phases is not strict and moving back and forth between different phases is always required. The arrows indicate the most important and frequent dependencies between phases.

The CRISP-DM process model

The outer circle in the figure symbolizes the cyclic nature of data mining itself. A data mining process continues after a solution has been deployed. The lessons learned during the process can trigger new, often more focused business questions. Subsequent data mining processes will benefit from the experiences of previous ones.

Below follows a brief outline of the phases as they are described in the CRISP-DM 1.0 User's Guide.

Business Understanding

This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives.

Data Understanding

The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information.

Data Preparation

The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks include table, record, and attribute selection as well as transformation and cleaning of data for modeling tools.

Modeling

In this phase, various modeling techniques are selected and applied, and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often needed.

Evaluation

At this stage in the project you have built a model (or models) that appear to have high quality, from a data analysis perspective. Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model, and review the steps executed to construct the model, to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached.

Deployment

Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that the customer can use it. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process. In many cases it will be the customer, not the data analyst, who will carry out the deployment steps. However, even if the analyst will not carry out the deployment effort it is important for the customer to understand up front the actions which will need to be carried out in order to actually make use of the created models.

I am one of the active editors of the Wikipedia entry for the Cross Industry Standard Process for Data Mining, which is why the content of that entry is very similar to the content found here. I am responsible for creating the version of the process diagram shown as part of the Wikipedia entry and in this article, but it is based on the original artwork from the CRISP-DM 1.0 process guide.

Documents

The original CRISP-DM process guide released in August 2000. The layout slightly dated and it does not have colors, so all of the images and graphics are presented in grayscale

The most current release of the guide from 2016, which is shipped with the IBM SPSS Modeler 18 software. There are very few changes between the two versions, but the graphics are now in full color.