Locality Sensitive Hashing

Locality Sensitive Hashing (LSH)

LSH is a probabilistic technique for similarity search in high-dimensional spaces. Given a distance metric (e.g., Euclidean distance, cosine similarity), it aims to map similar data points to the same or similar hash codes with high probability. This allows for efficient retrieval of approximate nearest neighbors (ANN) in large datasets.

Here's a breakdown of the core concepts:

LSH schemes often employ multiple hash functions in a family, generating multiple hash codes for each data point. This helps to improve the probability of finding similar points even if they collide in some hash functions.

There are various LSH constructions, each with its own trade-offs between efficiency and accuracy. Common techniques include:

Fast Walsh-Hadamard Transform (FWHT)

The FWHT is a computationally efficient algorithm for computing the (Walsh) Hadamard transform of a vector. The Hadamard transform, denoted by H, operates on vectors of size N (power of 2) containing elements +1 or -1. It produces another vector of size N where each element is the sum of the element-wise product of the original vector and a basis vector. 

Connection between LSH and FWHT

By applying a fixed randomly chosen pattern of sign flips to the input data that scrambles the orderly basis vectors of the FWHT resulting in a Random Projection. For sparse input data another application of sign flips and a further FWHT is required. The output of the resulting Random Projection can be binarized to give a fast Locality Sensitive Hash algorithm.  In higher dimensions the angle between the Random Projection and the binarized version is about 57-59 degrees which is rather close.