Code documentation DFBN version 3.1.4

The code can be used with large data set as the heuristic structure updating "directed perturbation" has been implemented in the code. However, I admittedly agree that the code is not that convenient to use with different multiple dimensionality; I have to manually correct the parameters in the code every time the dimensionality of the data is changed. So in this document I will show you where in the code we have to manipulate to incorporate the difference dimensionality issue. Let's suppose that originally we have 3 features each of which has 2 dimensionality, and somehow we want to add one more dimension to the 3rd feature so that it is of 3D. So we do the following.

  1. First, in the file input_interface.m, which is used for 'one-click' mode of the DFBN, we have to manually add the 10th column to the input file.
    1. data = [...
    2. 1 1 1 10.497 25.005 4.0606 3.3294 20.164 18.722 4.76464922181493
    3. 3 1 1 14.189 21.795 4.5319 4.4582 20.66 19.841 5.77364700090192
    4. ......
    5. 45 3 2 32.614 2.2473 23.813 -15.165 40.416 1.2933 4.71457580927461
    6. ];
    7. Then we have to locate the added column in the 29-column format, for instance, here we locate the 3rd dimension in the 6th column of the 29-column format matrix.
    8. d = zeros(size(data,1),29);
    9. d(:,1) = data(:,1);
    10. ...
    11. d(:,5) = data(:,9);
    12. d(:,6) = data(:,10); % 3D
  2. In parameterLearning.m Determine the dimensionality of each feature in the cell array feature_list, which originally looks like
    1. feature_list = {[7 8];[2 3];[4 5]}; % feat#1, #2 #3 Be careful about the order of the data
    2. But after the 3rd dimension of the 3rd feature is added, the change would be
    3. feature_list = {[7 8];[2 3];[4 5 6]}; % feat#1, #2 #3 Be careful about the order of the data
    4. Note that 6 is the column where the additional feature is added in the 29-column matrix
  3. Finally, we make change in the code main7_everygroup2.m
    1. % pick the 2D feature to plot
    2. f1 = 7;
    3. f2 = 8;
    4. f3 = 2;
    5. f4 = 3;
    6. f5 = 4;
    7. f6 = 5;
    8. f7 = 6;
    9. order_column = 1;
    10. target_column = 10;
    11. plf_column = 11;
    12. fake_column = 16;
    13. % determine the feature vectors
    14. F = 3; % number of features
    15. feature_cell = cell(F,1);
    16. feature_cell{1,1}=[f1,f2]; % feature 1 (2D -- f1, f2)
    17. feature_cell{2,1}=[f3,f4]; % feature 2 (2D -- f3, f4)
    18. feature_cell{3,1}=[f5,f6,f7]; % feature 3 (3D -- f5, f6, f7)
    19. % determine the dimension of the feature
    20. D = cell(F,1);
    21. D{1,1}=size(feature_cell{1,1},2); % feature1 has 2D
    22. D{2,1}=size(feature_cell{2,1},2); % feature2 has 2D
    23. D{3,1}=size(feature_cell{3,1},2); % feature3 has 3D
    24. load learnedParameters; % load the learned parameters CovMatrix
    25. % ===== Get the CPT for each feature and each platform by
    26. Sigma_j = CovMatrix;
    27. Sigma_i = cell(F,1);
    28. Sigma_i{1,1}=[7.2665 2.2070;2.2070 2.2119]; % user-define parameter for feat#1
    29. Sigma_i{2,1}=[5.4858 4.2034; 4.2034 7.2827]; % user-define parameter for feat#2
    30. % Sigma_i{3,1}=[3.5234 -0.6989; -0.6989 0.6549]; % user-define parameter for feat#3 2D
    31. Sigma_i{3,1}=3*eye(D{3,1}); % user-define parameter for feat#3 3D
    32. % weight for each feature
    33. wf = zeros(F,1);
    34. wf(1) = 1/3;
    35. wf(2) = 1/3;
    36. wf(3) = 1/3;
    37. % normalize the feature. Believe it or not, the dynamic range of the
    38. % feature affect the value of the loglik
    39. f1_min = min(d_test(:,f1),[],1);
    40. d_test(:,f1) = d_test(:,f1)-f1_min;
    41. f2_min = min(d_test(:,f2),[],1);
    42. d_test(:,f2) = d_test(:,f2)-f2_min;
    43. f3_min = min(d_test(:,f3),[],1);
    44. d_test(:,f3) = d_test(:,f3)-f3_min;
    45. f4_min = min(d_test(:,f4),[],1);
    46. d_test(:,f4) = d_test(:,f4)-f4_min;
    47. f5_min = min(d_test(:,f5),[],1);
    48. d_test(:,f5) = d_test(:,f5)-f5_min;
    49. f6_min = min(d_test(:,f6),[],1);
    50. d_test(:,f6) = d_test(:,f6)-f6_min;
    51. f7_min = min(d_test(:,f7),[],1);
    52. d_test(:,f7) = d_test(:,f7)-f7_min;
  4. Note that we don't have to change anything in the following codes: preprocessData.m, dataGeoClustering.m and plotDataWholeDataset.m.

Warnings:

  • In one geo-group, each platform should have at least max(D)+1 observations for stability. For instance, platform#2 with dimensionalities 2, 2, and 3 for features 1, 2 and 3 respectively, therefore, the number of samples in the platform should be max(2, 2, 3) + 1 = 4.
  • When knowing nothing about the platform ID, the safest way is to assume that all the samples are from one single platform. However, the trade-off is that it will take longer time to converge to a good solution, or in the worst case it will not converge to a good solution when the number of iteration is not sufficient.

Extra Notes

  • Dataset1 --> training, and Dataset2 --> testing
  • In some extra settings, the code parameterLearning.m will combine both testing and training data together to learn the parameters.
  • In some testing scenario you might not have the training data, but you can still benefit from the parameterLearning.m if you have the ground truth target ID. In that case, the code will use the ground truth label as the training data and give you the covariance matrix Sigma_j. Unfortunately, it will not provide Sigma_i which you might have to come up with it yourself.