Data Model

Several data model and algorithm are used to achieve the digital flight analysis task. Clustering, classification (binominal prediction) and clustering-based classification enhancement are used. Clustering models used are x-mean and k-mean; Classification models (binominal prediction) used are logistic regression, random forest and decision tree; clustering-based classification enhancement are using the same classification models, but the model is used after k-mean is used. The performance of the data model is compared to find a suitable and most available model based on the digital flight analysis task.

To achieve the second objective of this project, which is the prediction of the duration of departure delay, several classification model (continuous prediction) are used. The classification model (continuous prediction) used are linear regression, support vector machines (svm) and neural network. The model is build based on the delay type of the departure delay. Thus, each of the delay type need to build for a data model. The average performance for the data model are find out and then compared with average performance for the other data model.

1) Classification model (binominal prediction)

Classification is a data mining function that assigns items in a collection to target categories or classes. The goal of classification is to accurately predict the target class for each case in the data. Classification is the problem of identifying to which of a set of categories (sub populations), a new observation belongs to, on the basis of a training set of data containing observations and whose categories membership is known.

This is the snapshot of the process in order to apply logistic regression.

To apply classification and prediction in dataset, some of the module in RapidMiner is used.

“Read Excel” is used to read the dataset from computer.
“Discretize” is used to convert numerical attribute to categorical attribute. The numerical attribute is discretized into 5 bins.
“Normalize” is used to improve the performance of each model.
“Set role” is used to set the predicted attribute.
The data is split into 2 section which is training set and validation set by using “Split data”. The portion that used to split the data is 0.7 for training set, 0.3 for validation set. This is using to make sure that the model is not overfit for specific data set.
“Logistic Regression” the module in RapidMiner for the model of Logistic Regression. It can replace with other classification and prediction model.
“Apply model” the model is applying in validation set to prevent overfitting of training set.
“Performance” use to test the accuracy of dataset after applying model.

a) Logistic Regression

Logistic regression is a statistical analysis method used to predict a data value based on prior observations of a data set. A logistic regression model predicts a dependent data variable by analyzing the relationship between one or more existing independent variables. Logistic regression predicts the probability of an outcome that can only have two values.

This is the snapshot of logistic regression model

b) Decision tree

Decision tree is a map of the possible outcomes of a series of related choices. Decision Trees are a non-parametric supervised learning method used for classification. A decision tree can be used to visually and explicitly represent decisions and decision making. As the name goes, it uses a tree-like model of decisions. Decision Tree can be used either to drive informal discussion or to map out an algorithm that predicts the best choice mathematically. A decision tree typically starts with a single node, which branches into possible outcomes. Each of those outcomes leads to additional nodes, which branch off into other possibilities.

This is the snapshot of decision tree model

c) Random forest

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time. Random Forest is a flexible, easy to use machine learning algorithm that produces, even without hyper-parameter tuning, a great result most of the time. It is also one of the most used algorithms, because it’s simplicity and the fact that it can be used for both classification and regression tasks.

This is the snapshot of random forest model

2) Classification model (continuous prediction)

This is the snapshot of the process in order to apply linear regression.

To apply the model, some of the module in RapidMiner is used.

“Read Excel” is used to read the dataset from computer.
“Select attribute” is used to select attributes that used in the model.
“Set role” is used to set the predicted attribute.
The data is split into 2 section which is training set and validation set by using “Split data”. The portion that used to split the data is 0.7 for training set, 0.3 for validation set. This is using to make sure that the model is not overfit for specific data set.
“Linear Regression” the module in RapidMiner for the model of Linear Regression. It can replace with another model.
“Apply model” the model is applying in validation set to prevent overfitting of training set.
“Performance” use to test the accuracy of dataset after applying model.

a) Linear Regression

Linear regression is a linear approach to modelling the relationship between a dependent variable and one or more independent variables. In linear regression, the relationships are modeled using linear predictor functions whose unknown model parameters are estimated from the data. A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0).

This is the snapshot of linear regression model

b) Neural Network

Neural network is computing systems vaguely inspired by the biological neural networks that constitute animal brains. The neural network itself is not an algorithm, but rather a framework for many different machine learning algorithms to work together and process complex data inputs.

This is the snapshot of neural network model

c) Support vector machines (SVM)

Support vector machines are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier.

This is the snapshot of support vector machines (SVM) model

3) Clustering model

This is one of the clustering models applying in clustering model

To apply clustering model in data set, several modules are needed.
“Read Excel” is used to read the dataset from computer.
Since clustering model cannot use to calculate nominal attributes, the nominal attributes cannot be chosen to apply the clustering model by using “Select Attributes”.
The nominal attribute is transformed into numerical value by using “Nominal to Numerical”.
“Multiply” is used to multiply the cluster model.
“Multiply (2)” is used to multiply clustered set for visualization and check the performance of the clustering.
“Cluster model visualization” is used to visualize the result after clustering by showing graph and summary.
“Performance” is used to calculate the centroid distance between each cluster and Davies Bouldin to check the performance of clustering that applying to the dataset.

a) K-mean

Clustering methods are used to group the data so data within any segment are alike while data across segments are different. Cluster centroids are chosen randomly through a fixed number of K-clusters. The variation is important because the number of clusters K must be supplied by the user, and the search is prone to local minima fixed number of K-clusters. The different of number of K-clusters will inflict the result after doing the clustering algorithm. Thus, the value of k is test from 2 to 10 to find which of the value of k is the most suitable value for this k-means model.

Davies-bouldin and average cluster distance are used to evaluate the performance of clustering model. Davies bouldin index is a metric that introduced by David L. Davies and Donald W. Bouldin to evaluate clustering algorithm. This is an internal evaluation scheme, where the validation of how well the clustering has been done is made using quantities and features inherent the dataset. The lower the Davies-bouldin index, the better the separation of the cluster and the “tightness” inside the clusters. This represent the good performance for the clustering model.

This is the snapshot of K-mean model

b) X means

X-means is one of the examples of clustering algorithm and technique. X-Means performs division of objects into clusters which are “similar” between them and are “dissimilar” to the objects belonging to another cluster. X-means clustering is basically the same as K-means as it is a variation of K-means clustering. The variation is important because the number of clusters K must be supplied by the user, and the search is prone to local minima fixed number of K-clusters. The different of number of K-clusters will inflict the result after doing the clustering algorithm. Thus, X-means is created to treats cluster allocations by repetitively attempting partition and keeping the optimal resultant splits, until some criterion is reached. X-means reveals the true number of classes in the underlying distribution, and that it is much faster than repeatedly using accelerated K-means for different values of K.

This is the process for k-mean and x-mean clustering. The only different is k-mean have to estimate the number of clusters K manually whereas x-mean will estimate the best value of the number of clusters K.

This is the snapshot of X-mean model

4) Clustering-based classification enhancement

The task of classification is a supervised approach for the discovery of useful information from data. Data that are both growing in size and complexity and unfortunately, most of the classification methods are not scalable. Thus, clustering-based classification enhancement are used. There are main 2 step of clustering-based classification, which is applying clustering model then followed by classification model. This method is used in this dataset to find out the most suitable dataset to achieve the data mining task.

This is the snapshot of the process in order to apply random forest after applying k-mean clustering.

“Read Excel” is used to read the dataset from computer.
Since clustering model cannot use to calculate nominal attributes, the nominal attributes cannot be chosen to apply the clustering model by using “Select attribute”.
The nominal attribute is transformed into numerical value by using “Nominal to Numerical”. “Clustering” is used to apply the clustering model.
“Set role” is used to set the predicted attribute.
The clustered data is split into 2 section which is training set and validation set by using “Split data”. The portion that used to split the data is 0.7 for training set, 0.3 for validation set.
“Random Forest” the module in RapidMiner for the model of Logistic Regression. It can replace with other classification and prediction model.
“Apply model” the model is applying in validation set to prevent overfitting of training set.
“Performance” use to test the accuracy of dataset after applying model.

a) Random Forest (Clustered)

Random forest is one of the classification models that are mainly used. In this data model, k-mean models are applying to the dataset. The value of k used in this model is 8 because the value of Davies-bouldin when k = 7 has been proved is the lowest in previous section. After applying K-means, random forest is used, and both of its performance will be proven by its accuracy.

This is the snapshot of the process in order to apply random forest after applying k-mean clustering.

b) Decision Tree (Clustered)

Decision Tree is one of the classification models that are mainly used. In this data model, k-mean models are applying to the dataset. The value of k used in this model is 8 because the value of Davies-bouldin when k = 8 has been proved is the lowest in previous section. After applying K-means, random forest is used, and both of its performance will be proven by its accuracy.

This is the snapshot of the process in order to apply decision tree after applying k-mean clustering.

c) Logistic Regression (clustered)

Logistic Regression is one of the classification models that are mainly used. In this data model, k-mean models are applying to the dataset. The value of k used in this model is 8 because the value of Davies-bouldin when k = 8 has been proved is the lowest in previous section. After applying K-means, random forest is used, and both of its performance will be proven by its accuracy.

This is the snapshot of the process in order to apply logistic regression after applying k-mean clustering.

Google Sites

Report abuse