Code documentation DFBN version 3.1.4

The code can be used with large data set as the heuristic structure updating "directed perturbation" has been implemented in the code. However, I admittedly agree that the code is not that convenient to use with different multiple dimensionality; I have to manually correct the parameters in the code every time the dimensionality of the data is changed. So in this document I will show you where in the code we have to manipulate to incorporate the difference dimensionality issue. Let's suppose that originally we have 3 features each of which has 2 dimensionality, and somehow we want to add one more dimension to the 3rd feature so that it is of 3D. So we do the following.

First, in the file input_interface.m, which is used for 'one-click' mode of the DFBN, we have to manually add the 10th column to the input file.
1. data = [...
2. 1 1 1 10.497 25.005 4.0606 3.3294 20.164 18.722 4.76464922181493
3. 3 1 1 14.189 21.795 4.5319 4.4582 20.66 19.841 5.77364700090192
4. ......
5. 45 3 2 32.614 2.2473 23.813 -15.165 40.416 1.2933 4.71457580927461
6. ];
7. Then we have to locate the added column in the 29-column format, for instance, here we locate the 3rd dimension in the 6th column of the 29-column format matrix.
8. d = zeros(size(data,1),29);
9. d(:,1) = data(:,1);
10. ...
11. d(:,5) = data(:,9);
12. d(:,6) = data(:,10); % 3D
In parameterLearning.m Determine the dimensionality of each feature in the cell array feature_list, which originally looks like
1. feature_list = {[7 8];[2 3];[4 5]}; % feat#1, #2 #3 Be careful about the order of the data
2. But after the 3rd dimension of the 3rd feature is added, the change would be
3. feature_list = {[7 8];[2 3];[4 5 6]}; % feat#1, #2 #3 Be careful about the order of the data
4. Note that 6 is the column where the additional feature is added in the 29-column matrix
Finally, we make change in the code main7_everygroup2.m
1. % pick the 2D feature to plot
2. f1 = 7;
3. f2 = 8;
4. f3 = 2;
5. f4 = 3;
6. f5 = 4;
7. f6 = 5;
8. f7 = 6;
9. order_column = 1;
10. target_column = 10;
11. plf_column = 11;
12. fake_column = 16;
14. % determine the feature vectors
15. F = 3; % number of features
16. feature_cell = cell(F,1);
17. feature_cell{1,1}=[f1,f2]; % feature 1 (2D -- f1, f2)
18. feature_cell{2,1}=[f3,f4]; % feature 2 (2D -- f3, f4)
19. feature_cell{3,1}=[f5,f6,f7]; % feature 3 (3D -- f5, f6, f7)
20. % determine the dimension of the feature
21. D = cell(F,1);
22. D{1,1}=size(feature_cell{1,1},2); % feature1 has 2D
23. D{2,1}=size(feature_cell{2,1},2); % feature2 has 2D
24. D{3,1}=size(feature_cell{3,1},2); % feature3 has 3D
26. load learnedParameters; % load the learned parameters CovMatrix
27. % ===== Get the CPT for each feature and each platform by
28. Sigma_j = CovMatrix;
30. Sigma_i = cell(F,1);
31. Sigma_i{1,1}=[7.2665 2.2070;2.2070 2.2119]; % user-define parameter for feat#1
32. Sigma_i{2,1}=[5.4858 4.2034; 4.2034 7.2827]; % user-define parameter for feat#2
33. % Sigma_i{3,1}=[3.5234 -0.6989; -0.6989 0.6549]; % user-define parameter for feat#3 2D
34. Sigma_i{3,1}=3*eye(D{3,1}); % user-define parameter for feat#3 3D
36. % weight for each feature
37. wf = zeros(F,1);
38. wf(1) = 1/3;
39. wf(2) = 1/3;
40. wf(3) = 1/3;
43. % normalize the feature. Believe it or not, the dynamic range of the
44. % feature affect the value of the loglik
45. f1_min = min(d_test(:,f1),[],1);
46. d_test(:,f1) = d_test(:,f1)-f1_min;
47. f2_min = min(d_test(:,f2),[],1);
48. d_test(:,f2) = d_test(:,f2)-f2_min;
49. f3_min = min(d_test(:,f3),[],1);
50. d_test(:,f3) = d_test(:,f3)-f3_min;
51. f4_min = min(d_test(:,f4),[],1);
52. d_test(:,f4) = d_test(:,f4)-f4_min;
53. f5_min = min(d_test(:,f5),[],1);
54. d_test(:,f5) = d_test(:,f5)-f5_min;
55. f6_min = min(d_test(:,f6),[],1);
56. d_test(:,f6) = d_test(:,f6)-f6_min;
57. f7_min = min(d_test(:,f7),[],1);
58. d_test(:,f7) = d_test(:,f7)-f7_min;
Note that we don't have to change anything in the following codes: preprocessData.m, dataGeoClustering.m and plotDataWholeDataset.m.

Warnings:

In one geo-group, each platform should have at least max(D)+1 observations for stability. For instance, platform#2 with dimensionalities 2, 2, and 3 for features 1, 2 and 3 respectively, therefore, the number of samples in the platform should be max(2, 2, 3) + 1 = 4.
When knowing nothing about the platform ID, the safest way is to assume that all the samples are from one single platform. However, the trade-off is that it will take longer time to converge to a good solution, or in the worst case it will not converge to a good solution when the number of iteration is not sufficient.

Extra Notes

Dataset1 --> training, and Dataset2 --> testing
In some extra settings, the code parameterLearning.m will combine both testing and training data together to learn the parameters.
In some testing scenario you might not have the training data, but you can still benefit from the parameterLearning.m if you have the ground truth target ID. In that case, the code will use the ground truth label as the training data and give you the covariance matrix Sigma_j. Unfortunately, it will not provide Sigma_i which you might have to come up with it yourself.

Google Sites

Report abuse