Several data model and algorithm are used to achieve the digital flight analysis task. Clustering, classification (binominal prediction) and clustering-based classification enhancement are used. Clustering models used are x-mean and k-mean; Classification models (binominal prediction) used are logistic regression, random forest and decision tree; clustering-based classification enhancement are using the same classification models, but the model is used after k-mean is used. The performance of the data model is compared to find a suitable and most available model based on the digital flight analysis task.
To achieve the second objective of this project, which is the prediction of the duration of departure delay, several classification model (continuous prediction) are used. The classification model (continuous prediction) used are linear regression, support vector machines (svm) and neural network. The model is build based on the delay type of the departure delay. Thus, each of the delay type need to build for a data model. The average performance for the data model are find out and then compared with average performance for the other data model.
Classification is a data mining function that assigns items in a collection to target categories or classes. The goal of classification is to accurately predict the target class for each case in the data. Classification is the problem of identifying to which of a set of categories (sub populations), a new observation belongs to, on the basis of a training set of data containing observations and whose categories membership is known.
This is the snapshot of the process in order to apply logistic regression.
To apply classification and prediction in dataset, some of the module in RapidMiner is used.
Logistic regression is a statistical analysis method used to predict a data value based on prior observations of a data set. A logistic regression model predicts a dependent data variable by analyzing the relationship between one or more existing independent variables. Logistic regression predicts the probability of an outcome that can only have two values.
This is the snapshot of logistic regression model
Decision tree is a map of the possible outcomes of a series of related choices. Decision Trees are a non-parametric supervised learning method used for classification. A decision tree can be used to visually and explicitly represent decisions and decision making. As the name goes, it uses a tree-like model of decisions. Decision Tree can be used either to drive informal discussion or to map out an algorithm that predicts the best choice mathematically. A decision tree typically starts with a single node, which branches into possible outcomes. Each of those outcomes leads to additional nodes, which branch off into other possibilities.
This is the snapshot of decision tree model
Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time. Random Forest is a flexible, easy to use machine learning algorithm that produces, even without hyper-parameter tuning, a great result most of the time. It is also one of the most used algorithms, because it’s simplicity and the fact that it can be used for both classification and regression tasks.
This is the snapshot of random forest model
This is the snapshot of the process in order to apply linear regression.
To apply the model, some of the module in RapidMiner is used.
Linear regression is a linear approach to modelling the relationship between a dependent variable and one or more independent variables. In linear regression, the relationships are modeled using linear predictor functions whose unknown model parameters are estimated from the data. A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0).
This is the snapshot of linear regression model
Neural network is computing systems vaguely inspired by the biological neural networks that constitute animal brains. The neural network itself is not an algorithm, but rather a framework for many different machine learning algorithms to work together and process complex data inputs.
This is the snapshot of neural network model
Support vector machines are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier.
This is the snapshot of support vector machines (SVM) model
This is one of the clustering models applying in clustering model
Clustering methods are used to group the data so data within any segment are alike while data across segments are different. Cluster centroids are chosen randomly through a fixed number of K-clusters. The variation is important because the number of clusters K must be supplied by the user, and the search is prone to local minima fixed number of K-clusters. The different of number of K-clusters will inflict the result after doing the clustering algorithm. Thus, the value of k is test from 2 to 10 to find which of the value of k is the most suitable value for this k-means model.
Davies-bouldin and average cluster distance are used to evaluate the performance of clustering model. Davies bouldin index is a metric that introduced by David L. Davies and Donald W. Bouldin to evaluate clustering algorithm. This is an internal evaluation scheme, where the validation of how well the clustering has been done is made using quantities and features inherent the dataset. The lower the Davies-bouldin index, the better the separation of the cluster and the “tightness” inside the clusters. This represent the good performance for the clustering model.
This is the snapshot of K-mean model
X-means is one of the examples of clustering algorithm and technique. X-Means performs division of objects into clusters which are “similar” between them and are “dissimilar” to the objects belonging to another cluster. X-means clustering is basically the same as K-means as it is a variation of K-means clustering. The variation is important because the number of clusters K must be supplied by the user, and the search is prone to local minima fixed number of K-clusters. The different of number of K-clusters will inflict the result after doing the clustering algorithm. Thus, X-means is created to treats cluster allocations by repetitively attempting partition and keeping the optimal resultant splits, until some criterion is reached. X-means reveals the true number of classes in the underlying distribution, and that it is much faster than repeatedly using accelerated K-means for different values of K.
This is the process for k-mean and x-mean clustering. The only different is k-mean have to estimate the number of clusters K manually whereas x-mean will estimate the best value of the number of clusters K.
This is the snapshot of X-mean model
The task of classification is a supervised approach for the discovery of useful information from data. Data that are both growing in size and complexity and unfortunately, most of the classification methods are not scalable. Thus, clustering-based classification enhancement are used. There are main 2 step of clustering-based classification, which is applying clustering model then followed by classification model. This method is used in this dataset to find out the most suitable dataset to achieve the data mining task.
This is the snapshot of the process in order to apply random forest after applying k-mean clustering.
Random forest is one of the classification models that are mainly used. In this data model, k-mean models are applying to the dataset. The value of k used in this model is 8 because the value of Davies-bouldin when k = 7 has been proved is the lowest in previous section. After applying K-means, random forest is used, and both of its performance will be proven by its accuracy.
This is the snapshot of the process in order to apply random forest after applying k-mean clustering.
Decision Tree is one of the classification models that are mainly used. In this data model, k-mean models are applying to the dataset. The value of k used in this model is 8 because the value of Davies-bouldin when k = 8 has been proved is the lowest in previous section. After applying K-means, random forest is used, and both of its performance will be proven by its accuracy.
This is the snapshot of the process in order to apply decision tree after applying k-mean clustering.
Logistic Regression is one of the classification models that are mainly used. In this data model, k-mean models are applying to the dataset. The value of k used in this model is 8 because the value of Davies-bouldin when k = 8 has been proved is the lowest in previous section. After applying K-means, random forest is used, and both of its performance will be proven by its accuracy.
This is the snapshot of the process in order to apply logistic regression after applying k-mean clustering.