Not all voxels in the brain are equally important. We want to achieve the best accuracy with minimal number of voxels. This problem is called "feature selection", which can be formulated as a combinatorial optimization problem. When number of the decision variable is large, heuristic optimization approach reportedly gives decent results. So, in this section I will use Simulated Annealing (SA) to find the optimal set of voxels.
The Haxby et al. data will be used in the experiment. The data set consists of 10 runs, each having 8 data instance from all 8 categories. We adopted the conventional 10-fold classification approach described as follows:
run#i
, say i=1
, as the test run, and the run i+1
(i.e., run#2
) as a validation set and the rest (run#3
-run#10
) as the training set. run#3
-run#10
).run#2
), and pick the best set of parameters (voxels) from it. In other words, we perform "parameters selection" on the validation data set. i=2, 3, 4,..10
as the test set.For SA, more details of the procedures are listed below:
run#i
as a test set, run#(i+1)
as a validation set and the rest as the training. That is, test <-- X(run#i,:)
, validation <-- X(run#(i+1),:)
, train<-- X(the rest,:)
idxSelect
".model <-- trainClassifier(X(the rest, idxSelect))
: Train the model with the training data X(the rest, idxSelect).accuracy_valid <-- predict(model, X(run#(i+1),idxSelect))
: Evaluate the accuracy on the validation set idxSelect_best
idxSelect_best
model_best <-- trainClassifier(X(the rest, idxSelect_best))
: Train the model with the training data X(the rest, idxSelect_best)
obtained from the optimal set of voxels idxSelect_best
.accuracy_test#i <-- predict(model_best, X(run#(i), idxSelect_best))
: Evaluate the accuracy on the test set run#i
i=1 to 10
, and report the averaged accuracy from the 10 folds.The objective function we are interested is
obj(x) = classificationAccuracy(x) - c*ratioVoxelsUsed(x)
The function classificationAccuracy produces a real value in the range of 0 to 1. ratioVoxel can be calculated from #(voxels used in the classifier)/#(total voxels), resulting a real value in the rage of 0 to 1. c is a constant, usually less than or equal to 1. Therefore, the smaller c is, the less we care about the number of voxels used in the classification. We are interested in several objective functions:
c=0
.c=0.000001
. c=0
to include lots of voxels to achieve a voxel combination that gives best accuracy. Then in the prune step, the voxels are removed one by one given that the best accuracy is still retained. A strategy to update the solution:
Simple SA for voxel selection: (SA_feature_selection_1) The code performs SA on any pre-determined training, cross-validation (cv) and test data set. The code is self-documented, so the reader should understand the mechanism with not much difficulties. The code can be used as a good building block for further analysis such as N-fold cross validation which can be done by covering this code with a for-loop. The learning curve is shown below:
The top figure shows the learning curve, that is, the training accuracy versus the iterations. Please refer to the SA toolbox page for the meaning of each symbol. The bottom figure shows which voxels (sorted by MI-descending) are included in the solution.