https://micvog.com/2015/07/18/frequency-counting-algorithms-over-data-streams/
https://whitane.com/post/how-to-countish/
https://github.com/jparkie/PDD Advanced Bloom Filter Based Algorithms for Efficient Approximate Data De-Duplication in Streams
http://blog.fastforwardlabs.com/post/153566952648/probabilistic-data-structure-showdown-cuckoo
https://gist.github.com/debasishg/8172796 A collection of links for streaming algorithms and data structures
https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/
https://iwringer.wordpress.com/2015/08/03/patterns-for-streaming-realtime-analytics/
http://www.infoq.com/presentations/abstract-algebra-analytics
https://www.mapr.com/blog/some-important-streaming-algorithms-you-should-know-about
https://queue.acm.org/detail.cfm?id=2855183
https://nytlabs.github.io/streamtools/
https://news.ycombinator.com/item?id=9917008
http://jeremykun.com/2016/01/04/hashing-to-estimate-the-size-of-a-stream/
http://micvog.com/2015/07/18/frequency-counting-algorithms-over-data-streams/
http://queue.acm.org/detail.cfm?ref=rss&id=2534976 Real-time Algo
https://www.mapr.com/blog/some-important-streaming-algorithms-you-should-know-about
https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining
http://www.bravenewgeek.com/stream-processing-and-probabilistic-methods/
https://gist.github.com/debasishg/8172796 links to streaming and probabilistic articles
http://www.designofapproxalgs.com/download.php
https://www.youtube.com/watch?v=jKBwGlYb13w
http://habrahabr.ru/post/115147/ 2 Sets Comparison
http://en.wikipedia.org/wiki/Moving_average
Grouping by fuzzy criteria: http://knol.google.com/k/simple-simhashing#
Free books
https://free-programming-books.zeef.com/victor.felder
Algorithms for massive data sets:
http://massivedatasets.wordpress.com/
http://www.cs.princeton.edu/courses/archive/spring02/cs493/schedule.html
Random line from file: Reservoir Sampling
string selected_line;
for(size_t linenumber=0;!file.eof();linenumber++) {
string s = file.getline();
int64_t r = rand64();
r = r%linenumber;
if(r==0) {
selected_line = s;
}
}
cout << selected_line << endl;
The function call r%linenumber picks a random number between 0 and the current line number. Therefore, you have a one in N chance, that is, 1/N, of keeping the Nth line. Therefore you've a 100% chance of keeping the first line, a 50% chance of keeping the second, a 33% chance of keeping the third, and so on. The question is whether this is fair for all N, where N is any positive integer.
First, some concrete examples, then abstract ones.
Obviously, a file with one line (N=1) is fair: you always keep the first line because 1/1 = 100%, making it fair for files of 1 line. For a file with two lines, N=2. You always keep the first line; then when reaching the second line, you have a 50% chance of keeping it. Thus, both lines have an equal chance of being selected, which shows that N=2 is fair. For a file with three lines, N=3. You have a one-third chance, 33%, of keeping that third line. That leaves a two-thirds chance of retaining one of the first two out of the three lines. But we've already shown that for those first two lines there's a 50-50 chance of selecting either one. 50 percent of two-thirds is one-third. Thus, you have a one-third chance of selecting each of the three lines of the file.
In the general case, a file of N+1 lines will choose the last line 1/(N+1) times and one of the previous N lines N/(N+1) times. Dividing N/(N+1) by N leaves us with 1/(N+1) for each the first N lines in our N+1 line file, and also 1/(N+1) for line number N+1. The algorithm is therefore fair for all N, where N is a positive integer.
http://engineering.bloomreach.com/mapreduce-fun-sampling-for-large-data-set/
http://blog.cloudera.com/blog/2013/04/hadoop-stratified-randosampling-algorithm/
http://gregable.com/2007/10/reservoir-sampling.html
http://blog.cloudera.com/blog/2013/04/hadoop-stratified-randosampling-algorithm/
http://nedbatchelder.com/blog/201208/selecting_randomly_from_an_unknown_sequence.html
http://avva.livejournal.com/2659266.html choose random line from file
http://www.bryceboe.com/2009/03/23/random-lines-from-a-file/
Reservoir sampling: http://gregable.com/2007/10/reservoir-sampling.html
http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle Random permutation
http://habrahabr.ru/post/228575/ Median Estimation
Cuckoo
http://mybiasedcoin.blogspot.com/2014/10/cuckoo-filters.html
https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf
https://news.ycombinator.com/item?id=8489971
https://github.com/seiflotfy/cuckoofilter
Cardinality Estimation HyperLogLog4
https://www.cs.princeton.edu/~rs/talks/AC11-Cardinality.pdf
https://github.com/aaw/hyperloglog-redis.
"The basic idea of HyperLogLog (and its predecessors PCSA, LogLog, and others) is to apply a good hash function to each value observed in the stream and record the longest run of zeros seen as a prefix of any hashed value. If the hash function is good, the bits in any hashed value should be close to statistically independent, so seeing a value that starts with exactly X zeros should happen with probability close to 2 -(X + 1). So, if you've seen a run of 5 zeros in one of your hash values, you're likely to have around 2 6 = 64 values in the underlying set. The actual implementation and analysis are much more advanced than this, but that's the idea."
http://opensourceconnections.com/blog/2015/02/04/its-log-its-log-its-big-its-hyper-its-good/
https://blog.codeship.com/counting-distinct-values-with-hyperloglog/
http://moderndescartes.com/essays/hyperloglog
http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinality-Estimation
http://adityasastry.in/viewer.php?cno=35 KVM
https://github.com/clarkduvall/hypy
http://metamarkets.com/2012/fast-cheap-and-98-right-cardinality-estimation-for-big-data/
http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinality-Estimation
https://periscope.io/blog/hyperloglog-in-pure-sql.html
https://news.ycombinator.com/item?id=7506774
https://news.ycombinator.com/item?id=4488946
http://stackoverflow.com/questions/12327004/how-does-the-hyperloglog-algorithm-work
http://blog.aggregateknowledge.com/tag/hyperloglog/
http://metamarkets.com/2012/fast-cheap-and-98-right-cardinality-estimation-for-big-data/
http://research.stefanheule.com/papers/edbt2013-hyperloglog.pdf
Bloom Filter
https://habrahabr.ru/post/304800/
https://fylux.github.io/2017/03/19/Bloom-Filter/
https://github.com/le1ca/bloomfilter
https://news.ycombinator.com/item?id=14697431
Given:
1) set S: (e1,e2, ... )
2) element: y
Task: to detect if "y" exists inside the set S (membership problem)
Positive answer is probabilistic: probably "yes" (false positives)
https://www.youtube.com/watch?v=-SuTGoFYjZs
Negative answer is not probabilistic.
http://blog.michaelschmatz.com/2016/04/11/how-to-write-a-bloom-filter-cpp/
https://hackernoon.com/counting-bloom-filter-in-c-9672ec25b3ec
https://github.com/krisives/jbloomer/blob/master/src/jbloomer/BloomFilter.java
https://alexandrnikitin.github.io/blog/bloom-filter-for-scala/
you have a lot of items in a list of some kind and you want to know if a particular one is present already without incurring the heavy lookup cost.
When you check the Bloom filter it tells you:
1) it might be there
or
2) it definitely isn't there.
In the case of 2, you don't need to look it up. In case 1, you'll need to do the actual lookup.
It is commonly used to filter high volume / frequency requests for something. For example, if you have a list of banned IP addresses, user accounts, etc, you can quickly go through the bloom filter without hitting the database.
At the heart of every bloom filter lies two key elements
An array of n bits, initially all set to 0.
A collection of k independent hash functions h(x). Each hash function takes a value v and generates a number i where i < n which effectively maps to a position in the bit array.
The underlying idea of a bloom filter is quite simple and can be explained in the following steps -
Initialize a bit array of n bits with zeros. Generally n is chosen to be much greater than the number of elements in the set.
Whenever the filter sees a new element apply each of the hash functions h(x) on the element. With the value generated, which is an index in the bit array, set the bit to 1 in the array. For example, if there are k hash functions there will be k indices generated. For each of these k positions in the bit array set array[i] = 1
To check if an element exists in the set, simply carry out the exact same procedure with a slight twist. Generate k values by applying the k hash-functions on the input. If at least one of these k indices in the bit array is set to zero then the element is a new element else this is an existing element in the set.
http://habrahabr.ru/post/242285/
https://github.com/tylertreat/BoomFilters
http://tech.okcupid.com/swiping-right-on-bloom-filters/
http://www.michaelnielsen.org/ddi/why-bloom-filters-work-the-way-they-do/
http://www.jasondavies.com/bloomfilter/
https://www.youtube.com/watch?v=-SuTGoFYjZs
http://www.reddit.com/r/programming/comments/29wahu/bloom_filters_explained/
http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html
http://matthias.vallentin.net/blog/2011/06/a-garden-variety-of-bloom-filters/
http://habrahabr.ru/post/112069/
http://billmill.org/bloomfilter-tutorial/
http://code.activestate.com/recipes/577684-bloom-filter/ Python
C++ https://svn.boost.org/svn/boost/sandbox/bloom_filter/trunk/boost/bloom_filter/
http://blog.zacharyvoase.com/2012/08/31/m2mbloom/
http://matthias.vallentin.net/blog/2011/06/a-garden-variety-of-bloom-filters/
http://www.jasondavies.com/bloomfilter/
http://habrahabr.ru/blogs/algorithm/112069/
http://llimllib.github.com/bloomfilter-tutorial/
http://glowingpython.blogspot.it/2013/01/bloom-filter.html
http://www.reddit.com/r/Python/comments/17fega/bloom_filter_is_easy_with_python/
http://www.reddit.com/r/Python/comments/hrk9u/bloom_filter_python_recipes_space_efficient/
http://code.activestate.com/recipes/577684-bloom-filter/
http://corte.si/posts/code/bloom-filter-rules-of-thumb/index.html
http://stackoverflow.com/questions/497338/efficient-set-intersection-algorithm
http://en.wikipedia.org/wiki/Bloom_filter
http://news.ycombinator.com/item?id=1862047
http://www.cap-lore.com/code/BloomTheory.html
http://abhishek-tiwari.com/2010/06/bloom-filters-for-bioinformatics.html