Data Analysis

Correlation

Preliminary analysis is done before we move forward with the actual modeling.

Objective 1

Correlation is a statistical technique that can show whether and how strongly pairs of variables are related. A correlation is a number between -1 and +1 that measures the degree of association between two attributes (call them X and Y). A positive value for the correlation implies a positive association. In this case large values of X tend to be associated with large values of Y and small values of X tend to be associated with small values of Y. A negative value for the correlation implies a negative or inverse association. In this case large values of X tend to be associated with small values of Y and vice versa. Diagram above is the process of how we determine the correlation of the variables. The Weight by Correlation operator calculates the weight of attributes with respect to the label attribute by using correlation. The higher the weight of an attribute, the more relevant it is considered.

The diagram above gives the value of correlation for all the variables.

Based on the diagram above, we can see the highest value of correlation is Departure Delay, with the value of 0.624. So, we can initially conclude that this attribute is having a moderate relationship with the arrival delay. But if compared to the others attributes, we can say that among all the others attributes, departure delay will affect arrival delay the most as it has the highest value of correlation. In a nutshell, based on the weight, we can conclude that the significant variables are found to be Departure Delay, TAXI_OUT_TM and TOT_PAX_CT which have the most effect on arrival delay.

Objective 2

Basically for objective 2, we are using the same operator, which is the weight by correlation to calculates the weight of attributes with respect to the label attribute by using correlation. Diagram above shows the process to calculate the correlation.

The diagram above gives the value of correlation for all the variables.

Based on the diagram above, we can see the highest value of correlation is LATEST_TAIL_NR=9MMUA, with the value of 0.426. So, we can initially conclude that this attribute is having a moderate relationship with the departure delay. But if compared to the others attributes, we can say that among all the others attributes, LATEST_TAIL_NR=9MMUA will affect departure delay the most as it has the highest value of correlation. In a nutshell, based on the weight, the significant variables are found to be LATEST_TAIL_NR=9MMUA, LATEST_TAIL_NR=9MMUD and TAXI_OUT_TM which have the most effect on departure delay.

Objective 1: Predict of arrival delay of a flight

1) Classification model

a) Logistic Regression

Logistic Regression is applying in the data set to predict the arrival delay of the flight. The model is applying by using Rapid Miner.

This is the result of logistic regression model

A regression coefficient describes the size and direction of the relationship between a predictor and the response variable. Coefficients are the numbers by which the values of the term are multiplied in a regression equation. Use the coefficient to determine whether a change in a predictor variable makes the event more likely or less likely. Generally, positive coefficients make the event more likely and negative coefficients make the event less likely. An estimated coefficient near 0 implies that the effect of the predictor is small. This means that “Taxi_out_tm.range4” is more likely to happen as it coefficient is 14.250. The standard error is the error that occurring in the data model. The Z-value is the regression coefficient divided by its standard error. If the z-value is too big in magnitude, it indicates that the corresponding true regression coefficient is not 0. P-value helps you to decide whether there is a relationship between two variables or not. The smaller the p-value this mean the more confident about the existence of relationship between the two variables. For example, “Taxi_out_tm.range4” that has p-value is 0.910. This means that it is not so confident about the existence of relationship between the two variables.

This is the performance of logistic regression model.

From the table above, 603 of the true not delay is being predicted correctly whereas 130 of the true not delay is not being predicted correctly. 163 of the true delay is not being predicted correctly whereas 801 of the true not delay is being predicted correctly. The accuracy of the performance of logistic regression model is 82.73%.

b) Decision tree

This is the result of decision tree model

From the picture above, the prediction of the decision tree is based on this decision tree model. All the data input are used in the decision tree to determine whether the prediction of the arrival delay is delay or not delay.

This is the performance of decision tree model.

From the table above, 636 of the true not delay is being predicted correctly whereas 97 of the true not delay is not being predicted correctly. 228 of the true delay is not being predicted correctly whereas 736 of the true not delay is being predicted correctly. The accuracy of the performance of logistic regression model is 80.85%.

c) Random forest

This is one of the result of random forest model

By using the random forest, all of the data are used in the model to provide many of the trees about the prediction of the arrival delay of the flight. The trees are built using the random subspace method. In the picture above, it is one of the trees that produced by the random forest model.

This is the performance of random forest model.

From the table above, 586 of the true not delay is being predicted correctly whereas 147 of the true not delay is not being predicted correctly. 141 of the true delay is not being predicted correctly whereas 823 of the true not delay is being predicted correctly. The accuracy of the performance of logistic regression model is 83.03%.

Comparison between classification model

This is the comparison table for 3 classification model

From the table above, all three-classification model has a high performance and accuracy of this data set. All 3 models have accuracy that above 80%. This means that this data set is suitable for classification model. We can conclude that Decision tree are the worst performance among all of the model as decision tree has the lowest accuracy. Another conclusion can be made from the table is Logistic Regression has the greatest performance among all the model because Logistic Regression has the highest accuracy of validation.

2) Clustering model

a) K-mean

Different value of k is used in the K-mean model.

This is the result that using difference value of k for K-mean model

The average Davies-bouldin for K-mean model which using different value of k is 0.421. This value is in an acceptable range, which means that this clustering model are suitable to be used by using this dataset. From the table, we can conclude that the average within centroid distance is decreasing whenever value of k is decreased. The result of Davies-bouldin is 0.495 when k =2. Davies-bouldin raised to 0.497 when k = 3 and this is the highest Davies-bouldin among all the value of k. This means that when k=3, the clustering model is not suitable. The Davies-bouldin is then decreased from 0.497 to 0.343 when the value of k increased from 3 to 7. Then, the Davies-bouldin raised again to 0.373 when k = 8 and to 0.375 when k =9 and to 0.396 when k =10. From this table, the lowest Davies-bouldin is when the value of k = 7.

This is the heat map of k-Mean when k = 7

A heatmap is a graphical way of displaying a table of numbers by using colors to represent numerical values. The clustering algorithm groups related rows and/or columns together by similarity. From the heat map graph, in cluster 6, “arrival airport – AKL” and “departure airport - AKL” are the highest scale in the heat map whereas “Latest_tail_nr – 9MMTL” has the middle range scale in the heat map. The others have the lowest scale in the heat map.

This is the result of K-Mean when k = 7

When the value of k = 7 is use the group of data are separated into 7 groups. Cluster 0 with number 2,917 and the average distance is 27692.728; Cluster 1 with number 64 and the average distance is 59512.005; Cluster 2 with number 296 and the average distance is 186448.756; Cluster 3 with number 619 and the average distance is 96622.722; Cluster 4 with number 227 and the average distance is 338746.766; Cluster 5 with number 1505 and the average distance is 96528.736; Cluster 6 with number 29 and the average distance is 825.180. The distance measured is using Square Euclidean Distance.

This is the overall performance vector for K-Mean when k = 7

The average cluster distance for K-Mean when k = 7 is 74559.298 and Davies-Bouldin index is 0.343.

b) X means

This is one of the examples performances of X-means

The number of clusters that selected by X-mean algorithm is 3. Cluster 0 with number 9,702 and the average distance is 558393.337; Cluster 1 with number 186 and the average distance is 883315.303; Cluster 2 with number 1,426 and the average distance is 1920403.726. The distance measured is using Euclidean Distance.

This is one of the examples performances of X-means.

The average within centroid distance is 735400.788. The Davies-bouldin index for x-mean clustering is 0.454. This means that this clustering is acceptable model for this dataset.

Comparison between clustering model

This is the table of comparison between 2 clustering model

From the table above, both k-mean (average) and x-mean has Davies-bouldin index that lower than 0.5. This means that the clustering model is suitable for both the clustering model. We can conclude that K-mean are more suitable than X-mean in this dataset because K-mean has the lowest Davies-bouldin index, which is 0.421.

This is the result that using difference value of k for K-mean model

From the table above, we know that the Davies-bouldin index is the lowest when the value of is 7. This means that the k-mean model is most accurate when value of k is 7.

3) Clustering-based classification enhancement

a) Random Forest (Clustered)

This is the result of random forest (Clustered)

This is the prediction table that after applying the data model. From the table above, in the row 1 the prediction of the arrival flight is match with the real arrival flight. This means that the prediction by using this data model is correct. This prediction is predict based on the cluster result after applying K-mean model. For row 1, we can know that it is one of the cluster_0. This prediction has it confidence to be ND is 0.554 whereas the confidence to be D is 0.446.

This is one of the result of random forest (Clustered)

This is the performance of random forest after applying k-mean clustering model. The accuracy is 81.73%

b) Decision Tree (Clustered)

This is the result of decision tree (Clustered)

This is the result of the decision tree (Clustered)

This is the performance of decision tree after applying k-mean clustering model. The accuracy is 81.32%

c) Logistic Regression (clustered)

This is the result of logistic regression (Clustered)

This is the result of logistic regression

This is the performance of logistic regression after applying k-mean clustering model. The accuracy is 87.15%

Comparison between clustering-based classification enhancement

This is the comparison table for 3 clustering-based classification enhancement

From the table above, all three clustering-based classification enhancement has a high performance and accuracy of this data set. All 3 models have accuracy that above 80%. This means that this data set is suitable for clustering-based classification enhancement model. We can conclude that Decision Tree (Clustered) is the worst performance among all of the model as Decision Tree (Clustered) has the lowest accuracy. Another conclusion can be made from the table is Logistic Regression (clustered) has the greatest performance among all the model because Logistic Regression (clustered) has the highest accuracy of validation.

Comparison of clustering-based classification enhancement and standard classification

This is the comparison table for 3 clustering-based classification enhancement and 3 standard classification model

From the table above, we can notice that decision tree and logistic regression have a highest accuracy when those models are clustering-based classification enhancement. The accuracy increases from 80.85% to 81.32% for decision tree model by applying k-mean model first whereas accuracy for logistic regression increases from 82.73% to 87.15%. However, the accuracy is decreased from 83.03% to 81.73% for random forest clustering-based classification enhancement. This means that clustering-based classification enhancement increases the accuracy of standard classification and improve it performance.

Conclusion

From the comparison above, we can interpret all the finding and use the best model by choosing the higher accuracy and performance of the model. The highest accuracy and performance of the model that are selected is logistic regression that are clustering-based classification enhancement. The model selected is logistic regression after applying k-mean model when the value of k is 7.

The selected model is transformed into a data simulator that enable stakeholder or a user to input the data for model and predict for the result of arrival delay of a flight. The model is transformed by using a feature in RapidMiner, which is “Data Simulator”.

This is the example data for logistic regression that are clustering-based classification enhancement model

This is the result of the prediction based on the data input

The result of the prediction is real time result if the input of the data being changed. The input data will be applied in the data model to get the result of the prediction. In the data simulator, the result of the prediction will be shown in the form of graph with the percentage of the prediction. In this data input, we can know that the prediction of arrival delay is not delay. The confidence for ND is 84.94%. The most importance factor that make this data model to decide the prediction is Departure delay. The other factor that support the prediction is Distace, SVC_TYPE, Arrival Airport and TOT_PAX_CT. The factors that contradicts the prediction is TAXI_OUT_TM and TAXI_IN_TM.

The simulator is then can be develop into a web application or a mobile application for the user and stakeholder to predict the arrival delay of an aircraft. Thus, by using these predictions, some of the loss for both airlines company and customer can be minimized.

Objective 2 – Prediction of duration of departure delay based on delay type

a) Linear regression

The models are build based on the departure delay type. Different attributes choose based on the delay type of the flight.

From the table above, the average of the root mean squared is 14.079 and the average squared error is 481.342 for all delay type for linear regression model. This means that this model is available to applying in this model. 7 out of 10 of the delay types have root mean squared that are lower than 10. This means that these delay types are really fit in the dataset. Damage and failure delay type has the least root mean squared error which is 1.930 and miscellaneous type has the most root mean squared error which is 53.853. Miscellaneous include most of the delay time that because of other else reason. Thus, it has a high root mean squared error than the others. The no type of the delay time is calculated by “(LATEST_DEP_DT - SCH_DEP_DT) - Delay Time(internal) - Delay time(technical) - Delay time(passenger) - Delay time(miscellaneous) - Delay time(handling) - Delay time(operation) - Delay time(damage/failure)”. This means that this delay time is not stated in the type of the delay time but there is a delay time maybe based on the speed of the flight.

This is the result of linear regression when the delay type is damage/failure

Generally, positive coefficients make the event more likely and negative coefficients make the event less likely. An estimated coefficient near 0 implies that the effect of the predictor is small. This means that Flight period =PM make the prediction less likely to occur. The standard error is 0.098. The tolerance measures the influence of one independent variable on all other independent variables. This means that Flight period =PM has only a little impact to the prediction. The t-statistic is the ratio of the departure of the estimated value of a parameter from its hypothesized value to its standard error. P-value helps you to decide whether there is a relationship between two variables or not. The smaller the p-value this mean the more confident about the existence of relationship between the two variables.

This is performance of linear regression when the delay type is damage/failure

The best performance of the prediction time of the delay type are damage/failure. The root-mean_square_error is 1.930 and squared error is 3.725.

b) Support vector machines (SVM)

Support vector machines are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier.

The models are build based on the departure delay type. Different attributes choose based on the delay type of the flight.

From the table above, the average of the root mean squared is 13.502 and the average squared error is 427.515 for all delay type for linear regression model. This means that this model is available to applying in this model. 6 out of 10 of the delay types have root mean squared that are lower than 10. This means that these delay types are really fit in the dataset. Damage and failure delay type has the least root mean squared error which is 0.860 and miscellaneous type has the most root mean squared error which is 50.384. Miscellaneous include most of the delay time that because of other else reason. Thus, it has a high root mean squared error than the others. The no type of the delay time is calculated by “(LATEST_DEP_DT - SCH_DEP_DT) - Delay Time(internal) - Delay time(technical) - Delay time(passenger) - Delay time(miscellaneous) - Delay time(handling) - Delay time(operation) - Delay time(damage/failure)”. This means that this delay time is not stated in the type of the delay time but there is a delay time maybe based on the speed of the flight.

This is the result of SVM when the delay type is damage/failure

The total number of support vector is 3960. the weight for the flight period during pm is 0.029 whereas the flight period during am is 0.059.

This is performance of SVM when the delay type is damage/failure

The best performance of the prediction time of the delay type are damage/failure. The root-mean_square_error is 0.860 and squared error is 0.739.

c) Neural Network

Neural network is computing systems vaguely inspired by the biological neural networks that constitute animal brains. The neural network itself is not an algorithm, but rather a framework for many different machine learning algorithms to work together and process complex data inputs.

The models are build based on the departure delay type. Different attributes choose based on the delay type of the flight.

From the table above, the average of the root mean squared is 12.905 and the average squared error is 403.267 for all delay type for linear regression model. This means that this model is available to applying in this model. 7 out of 10 of the delay types have root mean squared that are lower than 10. This means that these delay types are really fit in the dataset. Damage and failure delay type has the least root mean squared error which is 1.004 and miscellaneous type has the most root mean squared error which is 48.492. Miscellaneous include most of the delay time that because of other else reason. Thus, it has a high root mean squared error than the others. The no type of the delay time is calculated by “(LATEST_DEP_DT - SCH_DEP_DT) - Delay Time(internal) - Delay time(technical) - Delay time(passenger) - Delay time(miscellaneous) - Delay time(handling) - Delay time(operation) - Delay time(damage/failure)”. This means that this delay time is not stated in the type of the delay time but there is a delay time maybe based on the speed of the flight.

This is part of the result of neural network when the delay type is damage/failure

This is performance of neural network when the delay type is damage/failure

The best performance of the prediction time of the delay type are damage/failure. The root-mean_square_error is 1.004 and squared error is 1.009.

Conclusion

From the table above, three of the models have Root Mean Squared Error that are lower than 15. This means that 3 of the models are suitable for the data set to predict the duration of delay time based on delay type. Neural network has the lowest Root Mean Squared Error among all the model which is 12.905 whereas Linear Regression has the highest Root Mean Squared Error, which is 14.079. This means that Neural network is a more suitable data model among three of the models.

The selected model is transformed into a data simulator that enable stakeholder or a user to input the data for model and predict for the delay time based on the delay type. The model is transformed by using a feature in RapidMiner, which is “Data Simulator”.

This is the example data for neural network

This is the result of the prediction of departure delay (damage/failure)

The result of the prediction is real time result if the input of the data being changed. The input data will be applied in the data model to get the result of the prediction. In this data input, we can know that the prediction of departure delay (damage/failure) is 1.045. The most importance factor that make this data model to decide the prediction is flight week. The other factor that support the prediction is Departure Airport = HKG. The factors that contradicts the prediction is TAXI_OUT_TM, TAXI_IN_TM, Flight date, TOT_PAX_CT and Flight Period =PM.

This simulator is only for one of the delay types which is damage/failure. To predict the departure delay for the flight, each delay types has to apply to this data model to get the delay time for each delay types. Then, the delay time that based on the delay type will be sum up to calculate the departure delay for the flight.

The disadvantages of the data model are the data given for each delay type is too little to be predict for the delay time for each delay type. For example, delay time for passenger only has 13 of the columns are more than 0 and the others are 0. This can decrease the accuracy and the performance of the data model. The more the data given, the more accurate the data model is.

Google Sites

Report abuse