faq on count-min sketch

What is a "sketch"?

A sketch is a compact summary of a large amount of data.  Exact definitions vary, but we prefer to think of it as a small data structure which is linear function of the input data, where this linear function has a very compact description.  This linearity is important for providing many properties of the data structure.

How does CM sketch compare to a Bloom Filter?

The Bloom Filter is a data structure that compactly represents a set as a bitmap which is updated via hashing.  The main conceptual different is that CM sketch represents a multiset, and has different assumptions about the kind of updates.  More formally, CM sketch summarizes a frequency distribution, while Bloom Filter is concerned with representing which elements are present in a set.  Consequently, the Bloom Filter must be linear in the size of the set, whereas CM sketch is typically sublinear in the size of the frequency vector.   Although "counting" variations of Bloom Filter have been introduced, these are still concerned with sets rather than multisets.  Another difference is the assumptions on randomness: A tight Bloom Filter analysis assumes fully random hash functions (however, section 3.3 of MV08 shows that low randomness suffices at some cost in space usage). The analysis of CM sketch requires only pairwise independence.

How does CM sketch compare to Multistage Filter?

The multistage filter due to Estan and Varghese is conceptually quite similar to CM sketch.  The main difference is that CM sketch is analyzed under much weaker assumptions (limited independence hash functions), and shown to have many applications to other problems.  Estan and Varghese propose a "conservative update" mechanism, which improves the accuracy for arrival-only sequences, but removes the linearity property, and hence the ability to process "departures" or negative updates.

How does the CM sketch compare to the Count-Sketch?

The Count sketch (Charikar, Chen and Farach-Colton) provides guarantees relative to the L2 norm of the data frequencies, while the CM sketch gives guarantees relative to the L1 norm, at the expense of a worse dependence on epsilon (1/epsilon^2 instead of 1/epsilon).  Which of these is preferable depends on the characteristics of the data.  However, note that it is possible to extract a Count Sketch from a CM sketch data structure (by finding the difference of adjacent entries), while the reverse direction is not possible.  A set of detailed experiments comparing the accuracy and speed of these sketches appears in Methods for finding frequent items in data streamsRusu and Dobra conclude that Count-Min sketch has a tendency to outperform its theoretical worst-case bounds by a considerable margin, and gives better results than some other sketches for the problem of inner-product estimation.

Why does the CM sketch yield a biased estimator?

The basic estimates produced by the CM sketch are biased.  That is, they are never less than the true value, and are not more than the true answer by a fixed amount with high probability.  This one-sided error seemed a useful property for several applications.  If this is troubling to you, note that is is possible to generate unbiased estimators by subtracting an appropriately scaled quantity from the estimator!

When was CM sketch invented?

The ideas in the data structure have a long history.  The sketch was first formalized in early 2003, with a technical report appearing in the DIMACS series in June 2003 (original technical report), with subsequent publication in LATIN 2004 and the Journal of Algorithms.  The precursor to the sketch was work on finding frequent items in data streams from 2002 ("What's hot and what's not"), which in turn was inspired by work on tracking histograms on data streams...

Is it a coincidence that 'CM' is also the initials of the authors?