Simulated Annealing (SA) for Voxels Selection

Not all voxels in the brain are equally important. We want to achieve the best accuracy with minimal number of voxels. This problem is called "feature selection", which can be formulated as a combinatorial optimization problem. When number of the decision variable is large, heuristic optimization approach reportedly gives decent results. So, in this section I will use Simulated Annealing (SA) to find the optimal set of voxels.

The Haxby et al. data will be used in the experiment. The data set consists of 10 runs, each having 8 data instance from all 8 categories. We adopted the conventional 10-fold classification approach described as follows:

  1. pick run#i, say i=1, as the test run, and the run i+1 (i.e., run#2) as a validation set and the rest (run#3-run#10) as the training set.
  2. Pick a classification algorithm (say SVM) and train a predictive model based on the training set (run#3-run#10).
  3. Run the predictive model obtained from 2) on the validation set (run#2), and pick the best set of parameters (voxels) from it. In other words, we perform "parameters selection" on the validation data set.
  4. Apply the predictive model with the optimal parameters on the test set and report the accuracy.
  5. Repeat 1)-4) by increment i by 1, that is, using i=2, 3, 4,..10 as the test set.
  6. Report the averaged accuracy from 10 runs.

For SA, more details of the procedures are listed below:

  1. pick run#i as a test set, run#(i+1) as a validation set and the rest as the training. That is, test <-- X(run#i,:), validation <-- X(run#(i+1),:), train<-- X(the rest,:)
  2. In this step, we will use SA to pick the optimal set of voxels:
    1. Pick a set of voxels denoted by the selected indices "idxSelect".
    2. model <-- trainClassifier(X(the rest, idxSelect)): Train the model with the training data X(the rest, idxSelect).
    3. accuracy_valid <-- predict(model, X(run#(i+1),idxSelect)): Evaluate the accuracy on the validation set
    4. Keep updating the best set of voxels idxSelect_best
    5. Repeat 1-4 in the SA loop until finished. The best set of voxels is denoted by idxSelect_best
  3. model_best <-- trainClassifier(X(the rest, idxSelect_best)): Train the model with the training data X(the rest, idxSelect_best) obtained from the optimal set of voxels idxSelect_best.
  4. accuracy_test#i <-- predict(model_best, X(run#(i), idxSelect_best)): Evaluate the accuracy on the test set run#i
  5. Repeat 1-4 for all i=1 to 10, and report the averaged accuracy from the 10 folds.

Objective function for feature selection

The objective function we are interested is

obj(x) = classificationAccuracy(x) - c*ratioVoxelsUsed(x)

The function classificationAccuracy produces a real value in the range of 0 to 1. ratioVoxel can be calculated from #(voxels used in the classifier)/#(total voxels), resulting a real value in the rage of 0 to 1. c is a constant, usually less than or equal to 1. Therefore, the smaller c is, the less we care about the number of voxels used in the classification. We are interested in several objective functions:

  1. Aim at the accuracy only, and don't care how many voxels are used in the classification. If a new voxel is included and does not degrade the accuracy, it will be kept in the solution. This case is equivalent to c=0.
  2. Aim at the accuracy as the first priority and the number of voxels as the second. That is, as long as the accuracy is better, we don't care about the accuracy. However, at the equal accuracy, we prefer solution having fewer voxels. There are at least 2 ways:
    1. Simultaneous: Use the objective function and pick a very small c, for instance, c=0.000001.
    2. Include-and-Prune: First, in including step we use c=0 to include lots of voxels to achieve a voxel combination that gives best accuracy. Then in the prune step, the voxels are removed one by one given that the best accuracy is still retained.

A strategy to update the solution:

  1. Shuffle: Randomly shuffle the voxel list and introduce to the SA one by one regardless of the MI-descend order.
  2. Conditional-MI-Shuffle: Same as 1) but add some chances that we are going to skip a certain voxel as well
  3. MI-descend: Greedily introduce each voxel to the classifier one by one in MI-descend order
  4. Conditional-MI-descend: Same as 3) but add some chances that we are going to skip a certain voxel as well

Example codes

Simple SA for voxel selection: (SA_feature_selection_1) The code performs SA on any pre-determined training, cross-validation (cv) and test data set. The code is self-documented, so the reader should understand the mechanism with not much difficulties. The code can be used as a good building block for further analysis such as N-fold cross validation which can be done by covering this code with a for-loop. The learning curve is shown below:

The top figure shows the learning curve, that is, the training accuracy versus the iterations. Please refer to the SA toolbox page for the meaning of each symbol. The bottom figure shows which voxels (sorted by MI-descending) are included in the solution.