Rapidminer studio is used to do data preprocessing for PFBI sample dataset. Here are the following step to do data cleaning:
Scheduled flying time(SCH_FLY_TM) less than 19 is filter out because it is not relevant.
588 missing values in Total passenger count (TOT_PAX_CT) column replace with average value which is 97
Few columns used binning method. Since scheduled flying time (SCH_FLY_TM) having various values, there bin into 3 groups and rename it as Fly_Duration.
Initially, service type (SVC_TYPE) consists of 8 different value, then it categorized into 3 categories.
This is the cleaned dataset with 6047 rows data and 8 attributes which are :
LEQ_SEQ_NR: leg sequence number (1,2,3)
ACF_VERSION: aircraft version (38,48,72,....)
DELAY: 1 - delay ; 0 - on time
SVC_CATEGORIES: 1 - passenger service; 2 - freighter/Kargo ; 3 - others
ASIA: 1 - Southeast Asia ; 0 - Non-Southeast Asia
BUSINESS_CLASS: 1 - with; 0 - without business class passenger
FLY_DURATION: 1 - <300 mins; 2- 300 to 600 mins; 3- >600 mins
TOTAL_PASSENGER: passenger count (5 groups)