Trang chủ‎ > ‎IT‎ > ‎Data Mining‎ > ‎Text Mining‎ > ‎NLP‎ > ‎

[20131130] Performance Shootout of Nearest Neighbours: Intro

Violent as the title sounds, I’ll be actually benchmarking software packages that realize the nearest-neighbour search in high dimensional vector spaces. Which approach is the fastest, easiest to use, the best? No neighbours got harmed writing this post.

This is a primer on similarity search, first in a series. Feel free to skip to the meatier Part 2 if you’re familiar with the topic.

Oooh, spaces

Feature vectors with high dimensionality come up in applications surprisingly often. Computer vision, recommendation systems, database systems. Arcane as they sound, they’re a convenient method of encoding an observation/document/item/whatever as a flat sequence of numbers, aka a vector. A document magically becomes a sparse vector (a sequence of floats… floats in spaces!). Like in this gensim tutorial. There’s tons of algorithms and math tricks and libraries that make effective use of vectors, and this item-to-vector transformation plugs into that pool conveniently. You can read more about vector spaces elsewhere but let’s move on to search.

Similarity search in high dimensions

What are the top-10 most similar items to this one? How quickly can you find them? Searching for the “nearest neighbours” (=most similar vectors) to a given item query is a fundamental task in many domains, like information retrieval, bioinformatics, machine learning, data compression, pattern recognition.

When the vectors contain many numbers (“high dimensionality”), a nasty property called the curse of dimensionality kicks in. In short, neat techniques that work wonders in 1D or 2D — like the binary search — get complicated. When “D” becomes hundreds or thousands, fancy partitioning can perform worse than a simple linear scan. Now, a linear scan is a dumb algorithm that visits each item in turn and explicitly computes its similarity to the query, keeping track of the best-N results so far. For a database of 1 million items, that’s one million item-query similarity computations. So if something performs worse than that, it’s pretty sad.

Dodging the Curse, State of the Art

What are these methods used for similarity queries in high dimensional spaces? Facing the curse of dimensionality, direct extensions of low-dim algorithms, like the BSP (binary space partitioning, known from Quake) or the kd-tree don’t work very well. People have been coming up with novel solutions in the past two decades: settling for an approximate answer (get the wrong neighbours, but get them fast), create several randomized trees (for robustness) or clustering the vectors with k-means repeatedly, for a hierarchical indexing structure. A particularly interesting approach is having several methods implemented at once, and deciding which one to use (+ with which parameters) in runtime, depending on the concrete input dataset. This is the approach taken by FLANN, one of the libraries I’ll be benchmarking here.

Gensim contains the brute force aka linear, dumb indexing. Its redeeming quality is that it’s very, very fast (low constant factor in the linear search, down to optimizing for CPU cache sizes), trivial to shard and parallelize and like all dumb scans, has no memory overhead because there’s no extra search structure. I’ve flirted with other algorithms in the past, ones that build more complex indexing structures. But the integration always failed for one reason or another. Usually poor performance… beware of academic big-O promises, where “n” tends to infinity (unlike your data) and constant factors are “irrelevant” (unlike your hardware) :-)

“You cannot sort an array of numbers faster than O(N logN). ‘Tis known.”

— Hapless computer science theorists who forget their basic assumptions.

What’s next?

I talked to Erik Bernhardsson from Spotify recently, which finally nudged me toward doing a more thorough, practical benchmarking of the various nearest-neighbour (or NN) approaches. For the purpose of music recommendations at Spotify, Erik developed and open sourced a neat little library for NN search, called Annoy. I just had to try it :-)

I’ll be comparing the retrieval performance of Annoy and a few others, on the English Wikipedia. The goal is to see whether it’s time to supplement the similarity search in gensim with something more refined.

You can do some more reading, or continue on to Part 2: The Contestants (or check my twitter for updates if you’re too early).

Continuing the benchmark of libraries for nearest-neighbour similarity search, part 2. What is the best software out there for similarity search in high dimensional vector spaces?

Document Similarity @ English Wikipedia

I’m not very fond of benchmarks on artificial datasets, and similarity search in particular is sensitive to actual data densities and distance profiles. Using fake “random gaussian datasets” seemed unfair to the more advanced indexing algorithms (brute force linear scans don’t care, of course). So I decided to pin the domain as similar document retrieval, with the English Wikipedia as our text corpus and benchmark dataset.

Prof. Gumby ranks Wikipedia at “lower middle class”, on the much-contested Big Data Scale (™).

The data was prepared as follows: first download the latest Wikipedia dump (9.7GB) & extract plain text from each article:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def process_article((title, text)):
    """
    Parse a wikipedia article, returning its content as
    `(title, list of tokens)`, all utf8.
 
    """
    text = gensim.corpora.wikicorpus.filter_wiki(text) # remove markup, get plain text
    # tokenize plain text, throwing away sentence structure, short words etc
    return title.encode('utf8'), gensim.utils.simple_preprocess(text)
 
def convert_wiki(infile, processes=multiprocessing.cpu_count()):
    """
    Yield articles from a bz2 Wikipedia dump `infile` as (title, tokens) 2-tuples.
 
    Only articles of sufficient length are returned (short articles & redirects
    etc are ignored).
 
    Uses multiple processes to speed up the parsing in parallel.
 
    """
    pool = multiprocessing.Pool(processes)
    texts = gensim.corpora.wikicorpus._extract_pages(bz2.BZ2File(infile)) # generator
    ignore_namespaces = 'Wikipedia Category File Portal Template MediaWiki User Help Book Draft'.split()
    # process the corpus in smaller chunks of docs, because multiprocessing.Pool
    # is dumb and would try to load the entire dump into RAM...
    for group in gensim.utils.chunkize(texts, chunksize=10 * processes):
        for title, tokens in pool.imap(process_article, group):
            if len(tokens) >= 50 and not any(title.startswith(ignore + ':') for ignore in ignore_namespaces):
                yield title.replace('t', ' '), tokens
    pool.terminate()
 
for title, tokens in covert_wiki('enwiki-latest-pages-articles.xml.bz2'):
    print "%st%s" % (title, ' '.join(tokens))

This outputs each article on one line, with its title followed by tab and a space-separated list of article tokens (wiki markup, stubs and redirects removed):

Extracted plain text: ~3.7 million documents, ~1.9 billion tokens
ArticleTitleExtracted tokens
#1Anarchismanarchism is political philosophy that advocates stateless societies based on non hierarchical free associations …
#2Autismautism is disorder of neural development characterized by impaired social interaction and verbal …
#3Anamed plural aes is the first letter and vowel in the iso basic latin alphabet it is similar to the ancient greek …
#4Alabamaalabama is state located in the southeastern region of the united states it is bordered by tennessee …
#5Abraham Lincolnabraham lincoln february april was the th president of the united states serving from march until his assassination …
#3,651,419Doug McCarthy (ice hockey)doug mccarthy born november is canadian former professional ice hockey player and former …
#3,651,420Csák I Hahótcsák from the kindred hahót died after was hungarian noble who held several secular positions during the reign …
#3,651,421Fair and Warmer (1919 film)fair and warmer is american silent film directed by henry otto starring may allison and eugene pallette …
#3,651,422Russell Jones (orientalist)dr russell jones is content to be called an orientalist of the old mould he has devoted his professional life to …


Using gensim’s memory-friendly streaming API I then converted these plain text tokens to TF-IDF vectors, ran Singular Value Decomposition (SVD) on this TF-IDF matrix to build a latent semantic analysis (LSA) model and finally stored each Wikipedia document as a 500-dimensional LSA vector to disk. These are the vectors that will be used for indexing and querying in the benchmarks:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
class ShootoutCorpus(gensim.corpora.TextCorpus):
    """Turn the plain text from above into a streamed corpus of sparse vectors."""
    def get_texts(self):
        for line in open(self.input):
            yield line.split('t')[1].split() # return tokens (ignore the title)
        self.length = lineno + 1
 
# lazy-load (stream) the corpus of 3.7M documents generated by covert_wiki above
corpus = ShootoutCorpus('title_tokens.txt')
 
# remove too rare/too common words, then keep 50k most common words
corpus.dictionary.filter_extremes(no_below=20, no_above=0.1, keep_n=50000)
 
# build tf-idf and lsi models on the wiki corpus
tfidf = gensim.models.TfidfModel(corpus)
lsi = gensim.models.LsiModel(tfidf[corpus], id2word=corpus.dictionary, num_topics=500)
 
# convert wiki to LSA space & store vectors to disk in MatrixMarket format
gensim.corpora.MmCorpus.serialize('wiki_vectors.mm', lsi[tfidf[corpus]]) 

I also published this preprocessing code on GitHub, including logging and gzipping the output, to save disk space. The output is a 3,651,422 x 500 dense matrix serialized to disk in MatrixMarket format (48GB; gzipped 12GB). There was no special NLP preprocessing, such as named entity extraction or model tuning — we only care about the final vectors here, from a benchmarking perspective.

The Contestants

When picking the competitor libraries for similarity search, I placed two constraints: 1. the software has to be callable from Python (no time to write custom bindings) and 2. up and running in half an hour or less (no time/patience for unruly or unmaintained software). Note that I’ve been creating&hacking&porting sundry software in sundry languages for sundry platforms for years and years, so half an hour of mine is not as cut-throat as it may sound.

Examples of libraries that didn’t make it include VLFeat (I managed to whip the Python wrappers into obedience, just shy of 30mins, only to realize they don’t expose the similarity routines); Inria’s Yael (atrocious documentation, couldn’t figure out how it works), scikits-ann (wraps ANN; discontinued?) and Jubatus (very promising, but missed the 30min up&running limit; mental note to check back later).

Survivors

Subjective TL;DR summary of the descriptions below.
LibraryNN algoEase of installAPI & docsIndexing speedIndex sizeShared mem & loadingIncremental indexing
flann◕‿◕◕‿◕ಠ_ಠ◕‿◕ಠ_ಠಠ_ಠ◕‿◕
yahoo/lsh◕‿◕◕‿◕ತ︿ತ◕‿◕ʘ‿ʘಠ_ಠ◕‿◕
scikit-learnಠ_ಠ◕‿◕◕‿◕◕‿◕ತ︿ತಠ_ಠತ︿ತ
spotify/annoy◕‿◕◕‿◕◕‿◕◕‿◕ಠ_ಠ◕‿◕ತ︿ತ
gensimತ︿ತ◕‿◕◕‿◕◕‿◕ಠ_ಠ◕‿◕◕‿◕


FLANN is a “library for performing fast approximate nearest neighbor (ANN) searches in high dimensional spaces”. Used version 1.8.4 (=the latest), smooth installation on both OS X and Debian Linux. Pros: Possibly the most widely used ANN library (e.g. internally by OpenCV). Indexing takes a while, but that’s due to an automatic selection of an optimal algorithm for the input data. This turns developer-tunes-indexing time into computer-tunes-indexing time, a happy tradeoff and a great feature. User-tunable tradeoff between retrieval accuracy, build speed and query speed. FLANN contains fast algos for brute force (linear) scan, hierarchical k-means or a forest of kd-trees (+their parameters); auto-tuning on wiki chooses the hierarchical k-means. Supports batch queries (=submitting several queries at once, for faster processing). Cons: Requires the original dataset to be stored&loaded separately. Lack of index memory sharing between processes (query parallelization).

Yahoo/LSH: ANN via “efficient implementation of locality-sensitive hashing (LSH)”. Constructs several random projections to hash input vectors into bins; vectors falling into the same (or nearby) bins are considered similar. Uses NumPy for vector manipulations internally. Pros: unlike all the other libs, only stores a few fingerprints (=hashes=dict entries) per input vector => low memory footprint. A single .py file, there’s no installation nor versioning. Lots of potential for runtime optimization. Cons: Academic implementation: unsightly coding style, weird library design (prepares data for matlab code as well…), one-letter variable names, no sane default values. Biggest problem is that it returns an arbitrary number of nearest neighbours — whatever falls into the same bins as the query. Which can be anything between zero and the entire dataset. Taming the “k” in k-NN requires insight and some non-obvious parameter tuning.

Scikit-learn: “machine learning in Python”. Used the latest 0.14.1 version, smooth installation. Contains optimized Ball Tree + KD-Tree + brute-force algos for exact NN search. Pros: “Everything-for-everybody” approach, covering various domains and algorithms in one coherent, well documented package. Auto-tune API param for choosing the best NN algo, similar to FLANN (chooses kd-tree on wiki subset). Cons: Blows up with memory errors much sooner than other libs (can only process small datasets). Stored index consumes 3x as much memory as the next most memory hungry lib, FLANN.

Annoy: “approximate nearest neighbors something something” :-). C++ wrapped with Boost, version 1.0, one minor installation hiccup. Annoy builds a forest of trees for ANN search. Each internal tree node separates the dataset recursively, down to a fixed leaf size (like in a kd-tree), except the hyperplanes are chosen randomly. Pros: Optimized for memory usage. Stores index in a format that is memory-mapped back on load, for fast index sharing between processes. A single parameter controlling precision vs. speed. Clean, no bullshit, clearly built for practical use. Cons: not much… the dependency on Boost, maybe.

Gensim: “topic modelling for humans”, version 0.8.8, by your truly. Pros: Indexes input into shards, which are memory-mapped for multiprocessing (like Annoy). Can index datasets larger than RAM. Each index shard can be independently processed/distributed. Optimized batch queries (like FLANN). Can update shards with new documents incrementally. Cons: Only supports exact NN via a linear brute force search with cosine similarity.

Neither yahoo/lsh nor scikit-learn support memory sharing, as far as I could tell. But both are built on top of NumPy, so I expect hacking them to use mmap would be straightforward. Annoy has the same potential to use very little memory like yahoo/lsh: the original input vectors are only ever used for sorting the (few) neighbour candidates, which may be unnecessary in some applications. Annoy code is fairly clean and to the point, so hacking it to not sort candidate neighbours (or sort them by how many trees they appear in, as an approximation) should be straightforward too. For packages that don’t support cosine similarity directly, I normalize the input vectors to unit euclidean length & use euclidean distance, which achieves the same thing. For motivation & academic links & theoretic background into sim search in vector spaces, see Part 1.

I know it’s a bit of a cliff-hanger, but I’ll postpone graphs of accuracy & query performance to the next episode. This one got longer than I anticipated (again!), but at least we got the preliminaries out of the way. Stay tuned.

EDIT: Part III with hard numbers here! Who won?


Previous posts explained the whys & whats of nearest-neighbour search, the available OSS libraries and Python wrappers. We converted the English Wikipedia to vector space, to be used as our testing dataset for retrieving “similar articles”. In this post, I finally get to some hard performance numbers, plus a live demo near the end.

Accuracy

First, there’s an obvious tradeoff between accuracy and performance — the approximation methods can be as fast as you like, as long as you don’t mind them returning rubbish. So I’ll be scatter-plotting speed vs. accuracy in a single graph, like for FLANN here:

10-NN FLANN avg error vs. speed (left) and precision vs. speed (right), for several FLANN accuracy settings. All measurements are an average over 100 (distinct) queries. Each reported timing is the best of three runs, using a single core, on a dedicated Hetzner EX40-SSD machine.

For example, when the FLANN index is built with target_precision=0.7, getting top-10 nearest neighbours to a query takes ~0.26ms (wow!), but it gets only 2.3 out of 10 nearest neighbours right, on average. And cosine similarities of these FLANN neighbours are on average ~0.09 away from the “ideal” neighbour similarities. With FLANN index built at target_precision=0.99, the speed drops to ~0.33ms/query (still wow!), but average precision rises to ~0.37 and average error drops to ~0.04. Unsurprisingly, the greater the accuracy requested, the slower the query (and the larger the index).

Interestingly and annoyingly, the 0.04 (or 4%) average cossim error is the best FLANN can do, on this dataset. When asked to build a more accurate index, something inside FLANN overflowsand indexing hangs in an infinite loop.

A few words on accuracy: as shown above, I recorded two different measures, each with a different focus. Precision is defined as the amount of overlap between retrieved documents and expected documents (=what an exact search would return). For example, if we ask an approximative method for the top-3 most similar documents, and it returns [doc_id1, doc_id2, doc_id3], while the exact three nearest neighbours are [doc_id2, doc_id3, doc_id4], the precision will be 2/3=0.67. Precision is a number between 0.0 and 1.0, higher is better. The document order (ranking) doesn’t matter for this measure.

The other accuracy measure is average similarity difference (~regret on cumulative gain), which is the average error in cosine similarity, again measured between the approximate vs. exact results. Smaller is better. For the same example as above, if a method returns top-3 similarities of [0.98, 0.9, 0.7], and the exact top-3 similarities are [0.98, 0.94, 0.9], its avg diff will be ((0.98-0.98) + (0.94-0.9) + (0.9-0.7)) / 3 = 0.08. Again, the particular ranking of results makes no difference.

Average difference makes better sense than precision in exploratory apps, where we don’t care about what particular documents are retrieved, as long as they are all as similar to the query as possible. I picked avg error because it’s intuitive and makes the accuracy results easy to interpret and reason about. More refined extensions, like the (normalized) discounted cumulative gain, exist and can be tuned in retrieval apps. I also recorded std dev and max of the errors, you can find them in the table at the end.

I won’t be going over the library or dataset descriptions here again; see the previous post for that.

Comparisons

10-NN with Flann vs Annoy, on several accuracy settings.

It’s hard to tell from this graph, with the sub-millisecond values, but compared to FLANN, Annoy is about 4x slower on ± same accuracy. But unlike FLANN, Annoy has no problem getting arbitrarily close to the “exact” results. Annoy 10-NN gets to a respectable precision of 0.82 and avg error of 0.006 with only 6.6ms/query — see annoy@500 in the plots above. Compare that with gensim’s brute force search taking 679ms/query (precision 1.0, avg error 0.0) = ~100x slower than Annoy.

Scikit-learn manages to create an index (but not store it) before running out of RAM, on the 32GB Hetzner machine. Getting the 10 nearest neighbours takes 2143ms/query, neighbours are exact (precision 1.0, avg error 0.0). This is even slower than brute force search, so I removed scikit-learn from the contestants pool. To be fair, scikit-learn’s documentation hints that its algorithm is not likely to perform well in high dimensional spaces.

The following plots show the same comparison, but asking for a different number of “nearest neighbours”. For example, an application may be interested in only a single nearest neighbour. Or it may need 1,000 nearest neighbours. How do the k-NN libraries scale with “k”?

1000-NN with Flann and Annoy, on several accuracy settings.

100-NN with Flann and Annoy, on several accuracy settings.

10-NN with Flann and Annoy, on several accuracy settings.

1-NN with Flann and Annoy, on several accuracy settings. All methods, on all settings, get the top one neighbour right, so it’s precision=1.0 and avgdiff=0.0 across the board.

Gensim (=brute force) doesn’t care about the number of neighbours, so its performance is 679ms/query regardless of “k”. When run in batch mode (issuing all 100 queries at once), average query time drops to 354ms/query.

In case you’re wondering where did the LSH contestant go: I didn’t manage to get LSH to behave nicely with regard to returning no less than “k” neighbours. I asked for top-10, got only 2 neighbours back and so on. Kind people pointed me toward NearPy, a much nicer implementation of LSH than Yahoo!’s. But even NearPy doesn’t solve the “k problem”. I opened a GitHub issue for this, your help is welcome! I’ll update the graphs with LSH measurements once they’re ready.

So, which nearest-neighbour implementation is the best?

FLANN is spectacularly fast, but it’s hard to say how it would fare on better accuracies.

On that note, let me say one more thing: getting to these results was about a hundred times more painful than I had anticipated. Lots of suffering. Things freezing, or not compiling, then compiling but failing tests, running out of memory, quirky idiosyncratic interfaces… And that’s not counting the libraries that I had pruned outright last time. I really thought the landscape of open source high-dim k-NN implementations would be a lot merrier than this. I find it surprising, given how fundamental and well-researched the domain is academically.

Landscape of open source high-dim similarity libraries for Python. Left: What I imagined. Right: Reality, one-and-a-half survivors. Brutal shootout.

Ok. So, each library takes a different approach, has different weak points, excels in different things. Results may vary with different datasets, yadda yadda. I hate vapid conclusions, so I’ll pick a winner anyway: the best implementation is Spotify’s Annoy.

FLANN is faster, but Annoy is saner. I like how it has a single parameter that controls accuracy that “just works”. It offers pleasant accuracy/speed tradeoffs, a responsive developer, is memory lean, concise, supports mmap for sharing an index between processes (FLANN can only share the input matrix, and not the index itself).

So I’ll probably plug either Annoy or NearPy into gensim, depending on how the “k” thing in LSH pans out. I’d likely reimplement Annoy in NumPy anyway though, because Annoy’s dependency on a C++ compiler and Boost is annoying (ha!), going against gensim’s “for humans” goal.

Bonus application

A bonus app for those who managed to read this far: type a title of any Wikipedia article exactly as it appears on Wikipedia, including letter casing, and see what the top-10 most similar articles are:


0ms; top10: []

This uses the annoy@100 index (average precision 50%, average similarity error 3%) over all ~3.5 million LSI vectors (=Wikipedia documents). Mind that this is only a performance/sanity check; there was no quality tuning nor NLP preprocessing done to the Wikipedia corpus.

Appendix: Raw measurements & code

Measurements in table form, including how large the various indexes were when serialized to disk:

Index times, accuracies and sizes for the various methods.
Library@settingsKPrecAvg diffStd diffMax diffQuery [ms]Index size [GB]Mmap'able portion of index [GB]
gensim@openblasany1.000.000.0000.00678.96.66.6
annoy@111.000.000.0000.000.46.76.7
annoy@1011.000.000.0000.000.57.17.1
annoy@5011.000.000.0000.000.68.88.8
annoy@10011.000.000.0000.000.911.011.0
annoy@50011.000.000.0000.002.729.029.0
annoy@1100.160.170.1630.870.46.76.7
annoy@10100.230.120.1260.550.57.17.1
annoy@50100.390.050.0640.361.08.88.8
annoy@100100.500.030.0430.191.611.011.0
annoy@500100.820.010.0130.076.629.029.0
annoy@11000.040.400.2250.870.46.76.7
annoy@101000.170.150.1260.561.17.17.1
annoy@501000.500.040.0390.214.28.88.8
annoy@1001000.660.020.0240.138.011.011.0
annoy@5001000.940.000.0050.0736.629.029.0
annoy@110000.050.490.1930.931.06.76.7
annoy@1010000.300.110.0850.487.47.17.1
annoy@5010000.700.020.0200.1134.28.88.8
annoy@10010000.850.010.0100.1064.811.011.0
annoy@50010000.990.000.0010.01287.729.029.0
flann@0.711.000.000.0000.000.26.86.6
flann@0.911.000.000.0000.000.310.06.6
flann@0.9911.000.000.0000.000.310.06.6
flann@0.7100.230.090.1120.830.36.86.6
flann@0.9100.350.040.0530.710.310.06.6
flann@0.99100.370.040.0480.270.310.06.6
flann@0.71000.130.260.2120.910.36.86.6
flann@0.91000.250.120.1140.820.410.06.6
flann@0.991000.240.110.1000.870.410.06.6
flann@0.710000.180.270.2050.951.46.86.6
flann@0.910000.330.130.1270.781.610.06.6
flann@0.9910000.370.110.1190.801.610.06.6
scikit-learn@auto101.000.000.0000.002142.8--

By the way, I extended all the evaluation code so that running a single bash script will download the English wiki, parse its XML, compute the Latent Semantic vectors, create all indexes, measure all timings, flush the toilet and brush your teeth for you, automatically. You can find the repo on GitHub: https://github.com/piskvorky/sim-shootout.

(If you liked this series, you may also like Optimizing deep learning with word2vec.)

Comments