cm-eclectics

"In this paper, we propose a memory, space, and time efficient framework to scale distributional
similarity to the web. We exploit sketch techniques, especially the Count-Min sketch, which approximates the frequency of an item in the corpus without explicitly storing the item itself. These methods use hashing to deal with massive amounts of the streaming text. We store all item counts computed from 90 GB of web data in just 2 billion counters (8 GB main memory) of CM sketch. Our method returns semantic similarity between word pairs in O(K) time and can compute similarity between any word pairs that are stored in the sketch. In our experiments, we show that our framework is as effective as using the exact counts."

Lossy Conservative Update (LCU) sketch: Succinct approximate count storage Amit Goyal and Hal Daum´e III, AAAI 2011
Considers a variety of sketch implementation issues for the same application.
From My Biased Coin and int main(): "The idea here is that the real problem with passwords is that some are too popular, making them easy to guess.  Providers respond by forcing users to choose passwords that pass certain rules -- you must have a capital and lower-case letter, you must have a number, etc.  These rules are somewhat arbitrary and don't directly tackle the significant problem of popularity.  Our paper is about how that can be done.  (As you might imagine, from my involvement, some Bloom filter variant -- the count-min filter in this case -- is part of the solution.) "
"With certain technical changes (involving the use of epsilon-biased hash functions instead of the pseudo-randomness used in the original work), the sketching techniques can be used to give a weak form of pan-private algorithms for estimating the number of times an item appears in a stream. This can also be achieved using Count-Min Sketch..."

  • [Machine Learning]  Hash Kernel. Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, Alex Strehl.
"Firstly, we show that the sampling schemes of Kontorovich (2007) and Rahimi and Recht (2008) can be applied to a considerably larger class of kernels than originally suggested — the authors only consider languages and radial basis functions respectively. Secondly, we propose a biased approximation ¯. of . which allows efficient computations even on data streams. Our work is inspired by the count-min sketch of Cormode and Muthukrishnan (2004), which uses hash functions as a computationally efficient means of randomization. This affords storage efficiency (we need not store random vectors) and at the same time they give performance guarantees comparable to those obtained by means of random projections."

See also Introduction to Machine Learning by Alex Smola.
"In this paper, we will use a sketch-based approach to perform classification of data streams with massive domain-sizes. The idea is to create a sketch-based model which can approximately identify combinations of attributes which have high discriminatory power. This approximation is used for classification."
"The key idea of our algorithms is that we use lossy representation of large vectors either by rounding or sketching. Sketches are compact randomized data structures that enable approximate computation in low dimension. To be more precise, we adapt the Count-Min Sketch of Cormode and Muthukrishnan [7], which was primarily introduced for data stream computation. We use sketches for small space computation; in the same spirit Palmer et al. [25] apply probabilistic counting sketches to approximate the sizes of neighborhoods of vertices in large graphs. "
"Our proximity sketch effectively summarizes each row of P: P[x, *] using a count-min sketch . As a result, we provide the same probabilistic accuracy guarantee as the count-min sketch,..."
"... we resort to a very space-efficient hash-based approximation scheme to track entity frequencies. We employ a sketching technique called Count-Min Sketch (CM-Sketch) [12]."

"To successfully count pattern frequencies in a way that is both time and space efficient,
we use a count-min filter, also called a count-min sketch [9]."
“This data-structure is typically instantiated using standard synopsis structures, such as, COUNT-MIN (for estimating entropy).” 
Comments