Fast DT version 3.5.7

Code specification

  • Work on multidimensional feature vectors simultaneously which are not necessarily independent.
  • Reduce from 3-layer to 2-layer for the speed.
  • The DT parameters are learned using EM algorithm on the fly, so no need for training the model from the sample separately. This make this version of DT fully unsupervised.
  • experiment ID can be both text and number

The two modes are

Single dataset mode

Run demo_input_interface.m, and see the result. Here are some explanations. The code will first add path to a couple of toolbox necessary for DT, for instance, variational Bayes GMM (VBEMGMM), visualization toolbox (toolbox_visualization), hierarchical clustering, etc.

The users can play with the clustering parameters in the script script_DT_option.m which provides important parameters for DT. The details are commented in the script file. This script provides the input parameters for the interface code fn_input_interface.m which serves as an interface between the user parameters and the core engine of DT.

The code fn_input_interface.m works as a canvas for pre-process, the main DT engine and the post-process to be included in one place, whose sub-processes are listed as follows.

  1. First, the 10-column matrix data is converted into 29-column format d which is stored in the file
    1. inputData_exp_[experiment_ID].mat
    2. Note that the format is adopted from the DT code in the previous version. The format needs not to be followed strictly like in the future version. All the data produced in the code will be stored in the folder
    3. all_data_exp_[experiment_ID]
    4. created automatically inside the program.
  2. In the next step, the data matrix d will be divided into 2 parts according to its 15th column; d(:,15) = 1 and 2 are stored in the file
    1. Dataset1_exp_[experiment_ID].mat
    2. and
    3. Dataset2_exp_[experiment_ID].mat
    4. indicating the training and test dataset respectively. Note that the current version of the code, the training data is less necessary because the EM algorithm makes DT completely unsupervised, hence there's no need for the training dataset. Therefore, it is recommended that d(:,15) can be set to 2 namely the test set.
  3. Next, the test dataset will under-segmented using geographically locations only using the function
    1. fn_dataGeoClustering
    2. whose core is implemted by the agglomerative clustering in MATLAB's toolbox. The users can play with the geo-clustering characteristics by adjust the parameters called option_geogroup:
    3. option_geogroup.x = 7;
    4. option_geogroup.y = 8;
    5. As mentioned before, only x-y location (and not other features) of the samples will be used in the geo-clustering. In this case, the column 7 and 8 of the matrix d represent x- and y- coordinates of the samples.
    6. option_geogroup.cutoff_distance = 25;
    7. The cut-off distance is the most important parameter in this function as it represents the approximate distance between clusters segmented by the geo-clustering algorithm. To be more precise, there are no two points from different geo-group having pair-wise distance less than the cut-off distance. In other words, the distance between any two points from different geo-group is always larger than the cut-off distance.
    8. option_geogroup.groundtruth_provided = 'yes';
    9. 'yes' when ground truth of target is available.
    10. option_geogroup.data_purpose = 2; % 2 -> testing, 1 --> training
    11. User can pick the dataset to be segmented by geo-clustering using the number here.
    12. All the information about the geo-group, for instance, how many of them or what ID are store in the file
    13. GeogroupIndex_Dataset2_exp_[experiment_ID]_comb_2.mat
    14. After segmented, the data will be divided into several geo-group, each of which is denoted by the file name, for example
    15. Dataset2_exp_[experiment_ID]_Geogroup[geo-group_ID].mat
  4. Next, we will calculate a good initial solution for each of the geo-groups using variational Bayesian Gaussian Mixture Model (VBGMM), which determines the number of the clusters or root nodes automatically at some extent via user-defined prior distribution parameters. For the rich details, please refer to the VBGMM in Bishop's text.
    1. PriorPar.D = 2; % only x-y location
    2. PriorPar.alpha = 0.0001; % The bigger alpha, the more number of resulting clusters.
    3. PriorPar.mu = zeros(PriorPar.D,1);
    4. PriorPar.beta = 1;
    5. PriorPar.W = 10*eye(PriorPar.D); % The bigger coefficient, the more number of resulting clusters.
    6. PriorPar.v = 20;
    7. Furthermore, there are some more parameters the users need to play with.
    8. option_VBGMM.max_cluster = 5; % The maximum number of clusters allowed in VBGMM
    9. option_VBGMM.maxIter = 200; % max # of iteration for VBGMM
    10. option_VBGMM.threshold = 1e-5; % stop criteria for VBGMM
    11. option_VBGMM.displayFig = 0; % 1 --> shows 2D plot of each iteration, 0 --> otherwise
    12. option_VBGMM.displayIter = 0; % 1 --> shows which iteratin is running, 0 --> otherwise
  5. The initial cluster/partition is input to the main DT engine fn_EM_DT whose main functions are EM-GMM, structure perturbation strategy and simulated annealing algorithm. Users can manipulate the following display options:
    1. option_DT.display = 0; % display the learning curve --> 1,
    2. option_DT.display_final = 0; % display the final result --> 1
    3. option_DT.display_T = 1; % display temperature on the learning curve
    4. option_DT.pause = 0.01; % sec paused per each plot
    5. option_DT.dy_plot = 600; % the printing offset for temperature
    6. The function returns the following variables:
      • xy_posterior: the posterior distribution of the optimal centroid of each cluster
      • best_gmm_obj: the gmm object at the best solution
      • x_best: the column vector representing the label assigned by DT for each sample
      • storage_best: the best parameters of the SA optimization routine.
  6. Calculate confusion matrix and adjusted Rand index for the results.
  7. The final result will be stored in the file
    1. Dataset2_exp_[experiment_ID]_Geogroup[geo-group_ID]_DT_result.mat