cm-eclectics

- [Computational Biology] "These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure." Qingpeng Zhang, Jason Pell, Rosangela Canino-Koning, Adina Chuang Howe, C. Titus Brown. The paper discusses our Count-Min Sketch implementation of a memory-efficient k-mer counting system

[Graph Machine Learning] Scaling Graph-based Semi Supervised Learning to Large Number of Labels Using Count-Min Sketch in AI-Stats 2014. Partha Pratim Talukdar and William W. Cohen.
- [NLP] Sketch Techniques for Scaling Distributional Similarity to the Web. Amit Goyal, Jagadeesh Jagarlamudi, Hal Daum´e III, and Suresh Venkatasubramanian. GEMS 2010.

"In this paper, we propose a memory, space, and time efficient framework to scale distributional

similarity to the web. We exploit sketch techniques, especially the Count-Min sketch, which approximates the frequency of an item in the corpus without explicitly storing the item itself. These methods use hashing to deal with massive amounts of the streaming text. We store all item counts computed from 90 GB of web data in just 2 billion counters (8 GB main memory) of CM sketch. Our method returns semantic similarity between word pairs in O(K) time and can compute similarity between any word pairs that are stored in the sketch. In our experiments, we show that our framework is as effective as using the exact counts."

Lossy Conservative Update (LCU) sketch: Succinct approximate count storage Amit Goyal and Hal Daum´e III, AAAI 2011

Considers a variety of sketch implementation issues for the same application.

- [Security] Popularity is everything: A new approach to protecting passwords from statistical-guessing attacks. Stuart Schechter, Cormac Herley, Michael Mitzenmacher. Hotsec2010.

From My Biased Coin and int main(): "The idea here is that the real problem with passwords is that some are too popular, making them easy to guess. Providers respond by forcing users to choose passwords that pass certain rules -- you must have a capital and lower-case letter, you must have a number, etc. These rules are somewhat arbitrary and don't directly tackle the significant problem of popularity. Our paper is about how that can be done. (As you might imagine, from my involvement, some Bloom filter variant -- the count-min filter in this case -- is part of the solution.) "

- [Privacy] Pan-Private Streaming Algorithms Cynthia Dwork, Moni Naor, Toniann Pitassi, Guy N. Rothblum and Sergey Yekhanin. ICS 2010.

"With certain technical changes (involving the use of epsilon-biased hash functions instead of the pseudo-randomness used in the original work), the sketching techniques can be used to give a weak form of pan-private algorithms for estimating the number of times an item appears in a stream. This can also be achieved using Count-Min Sketch..."

- [Machine Learning] Hash Kernel. Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, Alex Strehl.

"Firstly, we show that the sampling schemes of Kontorovich (2007) and Rahimi and Recht (2008) can be applied to a considerably larger class of kernels than originally suggested — the authors only consider languages and radial basis functions respectively. Secondly, we propose a biased approximation ¯. of . which allows efficient computations even on data streams. Our work is inspired by the count-min sketch of Cormode and Muthukrishnan (2004), which uses hash functions as a computationally efficient means of randomization. This affords storage efficiency (we need not store random vectors) and at the same time they give performance guarantees comparable to those obtained by means of random projections."

- [Machine Learning] On Classification of High-Cardinality Data Streams. Charu Aggarwal,Philip Yu. SIAM SDM 2010.

"In this paper, we will use a sketch-based approach to perform classification of data streams with massive domain-sizes. The idea is to create a sketch-based model which can approximately identify combinations of attributes which have high discriminatory power. This approximation is used for classification."

- [PageRank] To Randomize or Not To Randomize: Space Optimal Summaries for Hyperlink Analysis. Tamás Sarlós, András A. Benczúr, Károly Csalogány,Dániel Fogaras, Balázs Rácz.

"The key idea of our algorithms is that we use lossy representation of large vectors either by rounding or sketching. Sketches are compact randomized data structures that enable approximate computation in low dimension. To be more precise, we adapt the Count-Min Sketch of Cormode and Muthukrishnan [7], which was primarily introduced for data stream computation. We use sketches for small space computation; in the same spirit Palmer et al. [25] apply probabilistic counting sketches to approximate the sizes of neighborhoods of vertices in large graphs. "

- [Social Networks] Scalable Proximity Estimation and Link Prediction in Online Social Networks.
- Han Hee, Song Tae , Won Cho, Vacha Dave, Yin Zhang and Lili Qiu. IMC 09.

"Our proximity sketch effectively summarizes each row of P: P[x, *] using a count-min sketch . As a result, we provide the same probabilistic accuracy guarantee as the count-min sketch,..."

- [Entity Categorization] Entity Categorization Over Large Document Collections. Venkatesh Ganti, Arnd Christian König and Rares Vernica. KDD 08.

"... we resort to a very space-efficient hash-based approximation scheme to track entity frequencies. We employ a sketching technique called Count-Min Sketch (CM-Sketch) [12]."

- [Games] Move Prediction in the Game of Go. Brett Alexander Harrison. B. S. Thesis, Harvard, 2010.

"To successfully count pattern frequencies in a way that is both time and space efficient,

we use a count-min filter, also called a count-min sketch [9]."

- [Garbage Collection] Sketch based Distributed Garbage Collection, Theory and Empirical Evaluation. Joann`es Vermorel .
- [Information Theory] Estimating Entropy over Data Streams, Lakshminath Bhuvanagiri and Sumit Ganguly, European Symposium on Algorithms, 2006.

“This data-structure is typically instantiated using standard synopsis structures, such as, COUNT-MIN (for estimating entropy).”

- [Clustering] Algorithms for dynamic geometric problems over data streams, Piotr Indyk, STOC 2004.
- “Our Solution to this Problem uses hashing techniques akin to min-count sketches”

Page updated

Google Sites

Report abuse