KAU Data Scienсe Center

2.1. Machine Learning approach

When facing the huge number of different ML algorithms, the most frequent question is: “Which algorithm is the right solution for the given problem?”. The answer to this question varies depending on many factors, including 1) the size, quality, and nature of the domain data; 2) the available computational time; 3) the urgency of the task and 4) what is the aim of the quest. In many cases, no one can tell which algorithm will perform the best before trying different algorithms after thoughtful data examination. The use of a concrete algorithm is usually chosen based on data characteristics and exploratory data analysis. As in general with DM using ML approach, the performance of data models is strongly dependent on the representativeness of the provided data set. The complementarity of methods leads to try different options from a wide spectrum of available modelling methods based on data characteristics and analysis. In order to reach the maximum performance, in many cases, it is necessary to train each model multiple times with different parameters and options (so-called model ensembling). Sometimes, it is also suitable to combine several independent models of different types, because each type can be strong in fitting different cases. The full potential of the data can be tapped by a cooperation of partial weak models e.g. using ensemble learning methods based on principles such as voting, record weighting, multiple training process or random selection. Hence, a proper combination of several types of models with different advantages and disadvantages can be used to reach the maximum accuracy and stability in predictions.

The simplest customary way is to categorize ML algorithms into supervised, unsupervised and semi-supervised learning [Goodfellow 2016] as follows.

Supervised learning algorithms are learning algorithms that infer a function from some inputs with some outputs using supervised training data that consist of a set of training examples. Each example is a pair of an input and a (desired) output value. In many cases, the output may be difficult to collect automatically and must be provided by a human supervisor (i.e. labeling). The inferred function is called a classifier (if the output is discrete) or a regression function (if the output is continuous).
Unsupervised learning attempts to extract information from a training data that is only based on a set of inputs (i.e. without labeling). This category is usually associated with density estimation, learning to draw samples from a distribution, learning to denoise data from some distribution, finding a manifold that the data lies near, or clustering the data into groups of related examples. The distinction between supervised and unsupervised algorithms is not formally and rigidly defined because there is no objective test for distinguishing whether a value is a feature or a target provided by a supervisor.
Semi-supervised learning tries to make use of unlabeled data for training e.g. typically from small amount of labeled data within a large amount of unlabeled data. These algorithms are halfway between supervised and unsupervised learning. The reason is the expensive cost associated with the labeling process, e.g. by human expert interventions or physical examinations that causes fully labeled training set infeasible. Semi-supervised learning is interesting from ML theoretical side as a model of human learning.

It is interesting to notice that ML algorithms have no strict categorization, e.g. some method can be listed in one or more categories. For example, NNs can be trained for some problems in a supervised manner while in other problems in an unsupervised manner. Although the problem of algorithm categorization is interesting, it is out of the scope of this document.

Pre-processing and post-processing algorithms can also be categorized into a number of subcategories such as dimensionality reduction, sampling (subsampling, oversampling), linear methods, statistical testing, feature engineering with feature extraction, feature encoding, feature transformation and feature selection (e.g. mutual information, chi-square X2 statistics). Many more algorithms can be listed here for overfitting prevention (e.g. regularization, threshold setting, pruning, dropout), model selection and performance optimization (e.g. hyper-parameter tuning, grid search, local minimum search, bio-inspired optimization) and model evaluation (e.g. crossvalidation, k-fold, holdout) with various metrics such as accuracy (ACC), precision, recall, F1, Matthews correlation coefficient (MCC), receiver operating characteristic (ROC), area under the curve (ROC AUC), mean absolute error (MAE), mean squared error (MSE), and root-mean-square error (RMSE).

Fig. 3 provides a comprehensive graphical overview of ML methods for modelling as well as for pre-processing and post-processing. However, this overview is the subject to change as the number of ML algorithms is increasing continually.

Return to Contemt

Google Sites

Report abuse