HyperLogLog Bloom Streaming Sampling

The function call r%linenumber picks a random number between 0 and the current line number. Therefore, you have a one in N chance, that is, 1/N, of keeping the Nth line. Therefore you've a 100% chance of keeping the first line, a 50% chance of keeping the second, a 33% chance of keeping the third, and so on. The question is whether this is fair for all N, where N is any positive integer.

First, some concrete examples, then abstract ones.

Obviously, a file with one line (N=1) is fair: you always keep the first line because 1/1 = 100%, making it fair for files of 1 line. For a file with two lines, N=2. You always keep the first line; then when reaching the second line, you have a 50% chance of keeping it. Thus, both lines have an equal chance of being selected, which shows that N=2 is fair. For a file with three lines, N=3. You have a one-third chance, 33%, of keeping that third line. That leaves a two-thirds chance of retaining one of the first two out of the three lines. But we've already shown that for those first two lines there's a 50-50 chance of selecting either one. 50 percent of two-thirds is one-third. Thus, you have a one-third chance of selecting each of the three lines of the file.

In the general case, a file of N+1 lines will choose the last line 1/(N+1) times and one of the previous N lines N/(N+1) times. Dividing N/(N+1) by N leaves us with 1/(N+1) for each the first N lines in our N+1 line file, and also 1/(N+1) for line number N+1. The algorithm is therefore fair for all N, where N is a positive integer.

http://engineering.bloomreach.com/mapreduce-fun-sampling-for-large-data-set/

http://blog.cloudera.com/blog/2013/04/hadoop-stratified-randosampling-algorithm/

http://gregable.com/2007/10/reservoir-sampling.html

https://en.wikipedia.org/wiki/Reservoir_sampling

http://blog.cloudera.com/blog/2013/04/hadoop-stratified-randosampling-algorithm/

http://nedbatchelder.com/blog/201208/selecting_randomly_from_an_unknown_sequence.html

http://avva.livejournal.com/2659266.html choose random line from file

http://www.bryceboe.com/2009/03/23/random-lines-from-a-file/

http://stackoverflow.com/questions/2016240/how-can-i-return-a-random-line-from-a-file-interview-question

Reservoir sampling: http://gregable.com/2007/10/reservoir-sampling.html

http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle Random permutation

http://habrahabr.ru/post/228575/ Median Estimation

Cuckoo

http://mybiasedcoin.blogspot.com/2014/10/cuckoo-filters.html

https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf

https://news.ycombinator.com/item?id=8489971

https://github.com/seiflotfy/cuckoofilter

Cardinality Estimation HyperLogLog4

https://www.cs.princeton.edu/~rs/talks/AC11-Cardinality.pdf

https://github.com/aaw/hyperloglog-redis.

"The basic idea of HyperLogLog (and its predecessors PCSA, LogLog, and others) is to apply a good hash function to each value observed in the stream and record the longest run of zeros seen as a prefix of any hashed value. If the hash function is good, the bits in any hashed value should be close to statistically independent, so seeing a value that starts with exactly X zeros should happen with probability close to 2 -(X + 1). So, if you've seen a run of 5 zeros in one of your hash values, you're likely to have around 2 6 = 64 values in the underlying set. The actual implementation and analysis are much more advanced than this, but that's the idea."

http://opensourceconnections.com/blog/2015/02/04/its-log-its-log-its-big-its-hyper-its-good/

https://blog.codeship.com/counting-distinct-values-with-hyperloglog/

http://moderndescartes.com/essays/hyperloglog

http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinality-Estimation

http://adityasastry.in/viewer.php?cno=35 KVM

https://github.com/clarkduvall/hypy

http://metamarkets.com/2012/fast-cheap-and-98-right-cardinality-estimation-for-big-data/

http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinality-Estimation

http://antirez.com/news/75

https://periscope.io/blog/hyperloglog-in-pure-sql.html

https://news.ycombinator.com/item?id=7506774

https://news.ycombinator.com/item?id=4488946

http://stackoverflow.com/questions/12327004/how-does-the-hyperloglog-algorithm-work

http://blog.aggregateknowledge.com/tag/hyperloglog/

http://metamarkets.com/2012/fast-cheap-and-98-right-cardinality-estimation-for-big-data/

http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html

http://research.stefanheule.com/papers/edbt2013-hyperloglog.pdf

Bloom Filter

https://habrahabr.ru/post/304800/

https://fylux.github.io/2017/03/19/Bloom-Filter/

https://github.com/le1ca/bloomfilter

https://news.ycombinator.com/item?id=14697431

Given:

1) set S: (e1,e2, ... )

2) element: y

Task: to detect if "y" exists inside the set S (membership problem)

Positive answer is probabilistic: probably "yes" (false positives)

https://www.youtube.com/watch?v=-SuTGoFYjZs

Negative answer is not probabilistic.

http://blog.michaelschmatz.com/2016/04/11/how-to-write-a-bloom-filter-cpp/

https://hackernoon.com/counting-bloom-filter-in-c-9672ec25b3ec

https://github.com/krisives/jbloomer/blob/master/src/jbloomer/BloomFilter.java

https://alexandrnikitin.github.io/blog/bloom-filter-for-scala/

you have a lot of items in a list of some kind and you want to know if a particular one is present already without incurring the heavy lookup cost.

When you check the Bloom filter it tells you:

1) it might be there

2) it definitely isn't there.

In the case of 2, you don't need to look it up. In case 1, you'll need to do the actual lookup.

It is commonly used to filter high volume / frequency requests for something. For example, if you have a list of banned IP addresses, user accounts, etc, you can quickly go through the bloom filter without hitting the database.

At the heart of every bloom filter lies two key elements

An array of n bits, initially all set to 0.
A collection of k independent hash functions h(x). Each hash function takes a value v and generates a number i where i < n which effectively maps to a position in the bit array.

The underlying idea of a bloom filter is quite simple and can be explained in the following steps -

Initialize a bit array of n bits with zeros. Generally n is chosen to be much greater than the number of elements in the set.
Whenever the filter sees a new element apply each of the hash functions h(x) on the element. With the value generated, which is an index in the bit array, set the bit to 1 in the array. For example, if there are k hash functions there will be k indices generated. For each of these k positions in the bit array set array[i] = 1
To check if an element exists in the set, simply carry out the exact same procedure with a slight twist. Generate k values by applying the k hash-functions on the input. If at least one of these k indices in the bit array is set to zero then the element is a new element else this is an existing element in the set.

http://habrahabr.ru/post/242285/

https://github.com/tylertreat/BoomFilters

http://tech.okcupid.com/swiping-right-on-bloom-filters/

http://www.michaelnielsen.org/ddi/why-bloom-filters-work-the-way-they-do/

http://www.jasondavies.com/bloomfilter/

https://www.youtube.com/watch?v=-SuTGoFYjZs

http://www.reddit.com/r/programming/comments/29wahu/bloom_filters_explained/

http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html

http://matthias.vallentin.net/blog/2011/06/a-garden-variety-of-bloom-filters/

http://habrahabr.ru/post/112069/

http://billmill.org/bloomfilter-tutorial/

http://code.activestate.com/recipes/577684-bloom-filter/ Python