Introduction

Nowadays, the full CRISP-DM (Cross-Industry Standard Process for Data Mining) cycle is applied in real life data mining (DM) applications with machine learning (ML) techniques. The realization of DM in many life areas, as described also in the CRISP-DM cycle, leads to the need of various tools for statistics, data analytics, data processing, data mining, modelling and evaluation. The CRISP-DM was involved in the EU FP4-ESPRIT 4, ID 24959 project [CRIPS-DM 1999], which is in-part funded by the European Commission. It is now the leading and a de facto standard for DM applications.

Fig. 1 CRISP-DM Cross-Industry Standard Process for Data Mining

The CRISP-DM consists of six steps: business understanding, data understanding, data preparation, modelling, evaluation and deployment (Fig. 1).

The business understanding is usually realised based on the provided quest formulations and data descriptions.
The data understanding is usually realised based on provided data and their documentations.
The data preparation consists of data transformation, exploratory data analysis and feature engineering, each of them are furthermore divided into smaller sub-steps.
In the modelling phase, various ML algorithms can be applied with different parameter calibrations. The combination between data and parameter variability can lead to extensive repeating the model train-test-evaluation cycle. If the data is large-scale, the modelling phase can have time-consuming and computation-intensive requirements.
The evaluation phase can be performed under various criterions for thorough testing ML models and to choose the best model for the deployment phase.
The deployment phase is also called the production phase; it involves the use of the selected ML model as well as the creation of data pipeline in production.

The whole CRISP-DM cycle is repetitive. The group of the first five phases are also called the development and it can be repeated with different settings according to evaluation results. Here there is the need to highlight the fact that ML algorithms learn from data. Therefore, in practice, data understanding and data preparation phases can consume a large portion of the entire time of every DM project using ML techniques. Recently, almost all disciplines and research areas, including computer science, business, and medicine, are deeply involved in this spreading computational culture of Big Data because of its broad reach of influence and potential within multiple disciplines. The change in data collection has led to changes in data processing. The Big Data definition is characterised by many Vs, such as Volume, Velocity and Variety, as well as Veracity, Variability, Visualisation, Value and so on. Consequently, the methods and procedures to process these large-scale data must have the capability to handle, e.g., high volume and real-time data. Furthermore, data analysis is expected to change in this new era. The feature of large-scale data requires new approaches and new tools that can accommodate them with different data structures, different spatial and temporal scales [Liu 2016]. The surge of large volumes of information, especially with the Variety characteristic in the Big Data era, to be processed by DM and ML algorithms demand new transformative parallel and distributed computing solutions capable to scale computation effectively and efficiently. Graphic processing units (GPUs) have become widespread tools for speeding up general purpose computation in the last decade [Cano 2017]. They offer a massive parallelism to extend algorithms to large-scale data for a fraction of the cost of a traditional high-performance CPU cluster. The content of the document is organised as follows. Part 1 gives an introduction to data mining for large-scale data. Part 2 presents a comprehensive overview, the evolution and the emerging trend in ML and DL. It also briefly describes the connection between DL and accelerated computing. The main part of the document is Part 3, which provides the state-of-the-art in in DL, NN and ML frameworks and libraries. This part is divided into three subparts: general frameworks and libraries, DL with GPU support, and ML/DL integrated with MapReduce. Finally, Part 4 concludes the document.

Return to Contemt

Google Sites

Report abuse