Please refer to the following article when using the source code:
The Anh Pham, Sabine Barrat, Mathieu Delalandre and Jean-Yves Ramel, “An efficient indexing scheme based on linked-node m-ary tree structure”. In proceedings of the 17th International Conference on Image Analysis and Processing (ICIAP 2013), A. Petrosino (Ed.): Part I, LNCS, Volume 8156, pp. 752–762, 2013

1. Introduction

Fast nearest neighbor search is a crucial need for many recognition systems. Despite the fact that a large number of indexing algorithms have been proposed in the literature,  few of them (e.g., randomized KD-trees, hierarchical K-means tree, randomized clustering trees, and LHS-based schemes) have been well validated on extensive experiments to give satisfactory performance on specific benchmarks. In this work,  we propose a linked-node m-ary tree (LM-tree) indexing algorithm, which works really well for both exact and approximate  nearest neighbor search. The main contribution of the LM-tree is three-fold. First, a new polar-space-based  method of data decomposition is presented to construct the LM-tree. Second, a novel  pruning rule is proposed to efficiently narrow down the search space. Finally, a bandwidth search method is introduced to explore the nodes of the LM-tree. Our  experiments, applied to one million 128-dimensional SIFT features and 250000 960-dimensional GIST features, show that the proposed algorithm gives the best search performance, compared to the  aforementioned indexing algorithms.

2. Construction of LM-tree

Given a dataset X composing of N feature vectors  in a D-dimensional space  RD, we present in this section an indexing structure to index the dataset X supporting  efficient proximity searching. For better presentation of our approach, we use the notation p as a point in the RD feature space, and pi as the ith component of the point p  (1 <= i <= D). We also denote p = (pi1, pi2) as a point in an 2D space. Before constructing the LM-tree, the dataset X is  normalized by aligning it to the principal axes  obtained from PCA analysis. Note that, no dimension reduction is performed in this step. In stead, PCA analysis is only used to align the data.  Next, the LM-tree is constructed by recursively partitioning the dataset X into m roughly equal-sized subsets as follows:

  • Sort the axes in the decreasing order of variance, and choose randomly two axes, i1 and i2, from the first L highest variance axes (L < D).

  • Project every point p in X into the plane i1ci2,  where c is the centroid of the set X, and then compute the angle: phi = arctan2(pi1-ci1, pi2-ci2).

  • Sort the angles {phi_t} with 1 <= t <= n  in the increasing order (n = |X|), and then divide the angles into m equal sub-partitions: (0, phi_t1],  (phi_t1, phi_t2], ..., (phi_tm, 360].

  • Partition the set X into m subsets Xk with 1 <= k <= m corresponding to m angle sub-partitions obtained in the previous step.

For each subset Xk, a new node Tk is constructed and then attached to its parent node, where we also store the  following information: the split axes (i.e., i1 and i2), the split centroid (ci1, ci2), the split angles phi_tk, and the split projected points  {(pk_i1,pk_i2)} where the point (pk_i1,pk_i2) corresponds to the split angle phi_tk.  For efficient access across these child nodes, a direct link is established between two adjacent nodes Tk and Tk-1, and the last one Tm is linked to the first one T1. Next, we repeat this  partitioning process for each subset Xk associated to the child node Tk until the number of data points in each node falls below a pre-defined threshold. It is worth pointing that each time, a partition is proceeded, the two highest variance axes of the corresponding dataset are employed. This is contrast to many existing tree-based indexing algorithms, where only one axis is often employed  to partition the data. Consequently, considering a high-dimensional feature space, such as 128-dimensional  SIFT features, the total number of axes involved in the tree construction is rather limited, making any pruning rules inefficient and the tree less discriminative for  later usage of searching. Naturally, more the number of principal axes involved in partitioning the data, more benefit we achieve for both efficiency and precision search. Figure 1(a) illustrates the first and second levels of the LM-tree construction with a branching factor m=6.

3. Search performance

We have compared the performance of ENN search of three systems: the proposed LM-tree, the KD-tree, and the hierarchical K-means tree. Figure 2(left) shows the speedup over brute-force search of the three systems, applied to the SIFT datasets with different sizes. We can notice that the LM-tree outperforms the two other systems on all tests.  Figure 2(right) presents the search performance of the three systems for the GIST features. The proposed LM-tree again outperforms the others and even performs far better than the SIFT features. Taking the test where #Points = 150000 on Figure 2(right), for example, the LM-tree gives a speedup of 17.2, the KD-tree gives a speedup of 3.53, and the K-means tree gives a speedup of 1.42 over the brute-force search. These results confirm the efficiency of the LM-tree for ENN search relative to the two baseline systems.

Fig 2. Exact nearest neighbor search performance of 3 systems

For ANN search performance, four systems are participated in this evaluation, including the proposed LM-trees, RKD-trees,  RC-trees, and K-means tree. We have used 8 parallel trees in the first three systems, while the last one uses a single tree because it was shown in Muja09 that the use of multiple K-means trees does not give better search performance. For the LM-trees, the parameters Emax and epsilon are empirically determined to obtain the search precision varying in [90%, 99%].  Figure 3(left) shows the search speedup versus the search precision of all the systems. As we can see that, the proposed LM-trees algorithm gives significantly better search performance  everywhere than the other systems. Taking the search precision of 95%, for example, the speedups over brute-force search of the LM-trees, RKD-trees, RC-trees, and K-means tree are 167.7, 108.4, 122.4, and 114.5, respectively. To make it comparable with the multi-probe LSH indexing algorithm, we have converted the real SIFT features to the binary vectors and tried several parameter settings (i.e., the number of hash tables, the number of multi-probe levels, and the length of the hash key) to obtain the best search performance. However, the result obtained on one million SIFT vectors is rather limited. Taking the search precision of 74.7%, for instance, the speedup over brute-force search (using Hamming distance) is just 1.5.

Figure 3(right) shows the search performance of all systems for 200000 GIST features. Again, the LM-trees algorithm clearly outperforms the others and tends to perform much better than the SIFT features.  The RC-trees algorithm also works reasonably well, while the RKD-trees and K-means tree work  poorly for this dataset.  Taking the search precision of 90%, for example, the speedups over brute-force search of the LM-trees, RKD-trees, RC-trees, and K-means tree are 113.5, 15.0, 45.2, and 21.2, respectively. 

Three crucial factors explain for these outstanding results of the LM-trees. First, the use of the two highest variance axes for data partitioning in the LM-tree gives more discriminative representation of the data, compared with the common use of the sole highest variance axis as in the literature. Second, by using the approximate pruning rule, a larger fraction of nodes will be inspected but much of them would be eliminated after checking the lower bound. In this way, the number of data points, which will be actually searched, is retained under the pre-defined threshold Emax, while covering a larger number of inspected nodes, and thus increasing the chance of reaching the true nodes closest to the query.  Finally, the use of bandwidth search gives much of benefit in terms of  computation cost, compared with the priority search used in the baseline indexing systems.

Fig 3. Approximate nearest neighbor search performance of 4 systems

4. Source codes
The source codes of the LM-tree and all the experiments are available in the Download page.