Data Stream Algorithms

                     Notes from Barbados 09 Complexity Theory Meeting, by S. Muthukrishnan

The course covered:

1.  Data stream model. CM Sketch and its applications to point estimates, rangesum, heavy hitters, quantiles etc. These use 1/\eps log (1/\delta) space, get accuracy upto additive \eps X  with prob at least 1-\delta, and rely on pairwise hash functions. The same structure with 4-wise independent hash functions gives relative error (1+\eps) approximation for F_2 (sum of squares of frequencies) and inner product with additive L_2 error guarantees using space (1/\eps^2)... The results are in
2. Problems over inverse distributions such as estimating the number of distinct elements, estimating the ratio of number of unique elements to the number of distinct elements, etc. Problems like quantiles, heavy hitters and point estimates  can all be solved using 1/\eps^2 log (1/\delta) space as in (1) above by maintaining a distinct sample in presence of inserts and deletes. This is in:
3. Graph and geometric streams. (Andrew). Slides. Problems studied include counting the number of  traingles, weighted bipartite matching, graph spanners and the k-center problem. A nice over view of graph streaming results are in Graph Mining on Streams by Andrew McGregor. The refs are
4. Sparse approximation problems. Ease of approximation for a dictionary that is an orthonormal dictionary (parseval's), and hardness of the problem for arbitrary dictionaries. Weighted norm sparse approximation. Haar wavelets on streams. Connections to Compressed sensing. Since there was some interest in the audience on coding problems, I presented an explicit construction for compressed sensing measurements using k-(strongly) separating sets.
5. Multipass, random order streams (Andrew). Slides. Median (quantiles) as an example to show that with random order one can get better accuracy than worst case order. (\pm \eps N to \pm \eps \sqrt{N}). Also extension to multiple passes. Hardness via Index, pointer chasing problems. Models for random order lower bound are tricky. Refs:
6. Linear algebraic problems such as approximating matrix products, least squares approximation and SVD. See:
7. Continuous monitoring problems. Central site outputs 1 when F_1 of items seen at the distributed sites exceeds a threshold, else, 0. The cost is the bits communicated in total. This is an example where the number of bits does not depend on k, the number of distributed sites. Other problems include monitoring F_0, F_2, etc.
8. MapReduce and distributed streaming. The MUD model and its connection to streaming. 
  • On Distributing Symmetric Streaming Computations. [pdf]  Feldman et al. One has to really read the original paper to understand the insights that went tinto building MapReduce.
9. Final Lecture: Summary + discussions.  The areas that were not covered: geometric streams (coresets, k-median, bichromatic matching etc), strings (embeddings, longest increasing or common sequences, etc), probabilistic streams, windowed streams, statistics and machine learning problems on streams. Andrew did a few nuggets of results including reductions from the Gap-Hamming problem to streaming problems, and entropy estimation. Faith suggested a Markovian extension to the probabilistic streams model which may be of interest.
Some questions and discussions that  arose during the week:
  • The analysis was always in terms of number of words, where one assumes words can store the size of the stream or the numbers that arise from the computations of interest. This was distracting to many, who cautioned that long words can be used to stuff bits and do unusual computations. In the upper bounds, the algorithms shown all behave properly without word gymastics, and in the lower bounds, we count bits (not words), so we circumvent these dangers.

  • In windowed stream model, Mario said the quantifiers were not clear. The probability we are estimating is for a given stream, over the choice of hash function, the probability of a *single query* giving the correct answer. It does not matter how many windows one sees as some feared since using exponential number of windows one could  hit all the bad hash function choices. If we need more samples, we have to ask independent queries and use independent hash functions. Since Mario kept working this over in his mind  during the evening, I sketched the solution the following day from: Estimating rarity and similarity on data stream windows by M. Datar and S. Muthukrishnan. Faith pointed out that the space guarantee is in expected sense.

  • When I presented min hash sampling, Eric mentioned that when only one item is seen, the different hash functions will return the same item as the sample. In other words, we wont get \log(1/\delta) different samples. Indeed since there is only one unique item, we will get log(1/\delta) copies of that item. In other words, we are sampling with replacement. That suffices for our claims.

  • Faith asked about whether symmetric functions are automatically symmetric on subsets of data. In our MUD work, we had a cleanup operator that was produced the final output which is a symmetric function. The intermediate states of the algorithm need not be the same for two permuted inputs. As a result, we could not apply the symmetric assumption for subsets of the input data.

  • Rocco asked if there was a connection between streaming and property testing beyond the joint workshop at dagstuhl on "sublinear algorithms". I do not know of a formal connection (such as prop testing algorithms can be simulated in streaming models). Even nonadaptive property testing algorithms may be hard to simulate via streaming based on the nature of sampling. There may be connections at level of individual algorithms. For ex, I used Dana Ron et al property testing algorithm for F_2 at some point. Andrew announced a streaming workshop at DIMACS later in March, and in a fantastic coincidence, Mario announced a property testing workshop at DIMACS shortly after that. The two groups they were not aware of each others' meetings!

  • Toni asked if there was a study of privacy via CM sketches, perhaps using differential privacy. I pointed out work by Wright et al on "Secure Multiparty Computation of Approximations," ACM Transactions on Algorithms 2 (2006), by Feingenbaum et al. FIMNSW.pdf.  More forthcoming. 
  •   
  • Faith said bunch of asynchronous distributed algorithms from PODC community (some even in books!) may be useful in MapReduce.

  • I gave the puzzle of sorting n integers each in log n bit word in-place. The solution appears in:  Radix sorting with no extra space, G. Franceschini, S Muthukrishnan and M. Patrascu. Nicole asked me how the final merge step works. Two arrays can be merged in-place in linear time using comparisons, even stably, as shown in the paper by Salowe and Steiger cited in this ref. Maybe one can do it simpler for this special case.
 

Reading material: Book pdf

 

 Overall comments:

  • The participants scribed notes. Maybe they will be available in the future, and  hopefully contain the many HW exercises tand the various anecdotes. :)

  • My personal recollections: I was happy to reconnect with this material after a few years, and used this opportunity to read recent results and make notes. This should help in the followup book I am working on, with Andrew. 
  •  
  • My regrets if any are that I could not finish the course the way I wanted, with having the participants speak  for 30 min or so on what they got out of the meeting, the results and questions that stayed in their mind, the new questions they formulated in their research area that was inspired by the material in the lectures etc. Most of the audience was from complexity or combinatorics research and the material presented were quite new, so this would have been the right way to end the course. As it turned out, the course became a monolog, the dialog did not happen (yet).
  •  
  • Advice to future  lecturers. The facilities are basic (wrt sharing rooms, food, etc). It gives you a week away from things, so you can focus and get things done (prep lecture notes, design HW problems, write chapters of a book, etc. Noam Nisan did these lectures some time ago and used it to write his book on Communication Complexity.) For the participants too, it is time away from family, faculty responsibilities, etc. so they focus on catching up on their research. It is unreasonable to expect them to read papers, do HWs, or solve problems on research areas away from their expertise. Denis has a great solution to get the lecturers to suggest some participants; these will presumably be experts on the lecture topic, so embedding such experts in to the audience may work well to draw others out. Audience, seasoned researchers or not, behave like students for a week (eg talk among themselves during the lecture), which means you have to do crowd control!
  • This is pure fun. Some of the questions I was asked with an incredulous look on the face: "This is *your* result?", "What is the *main result* in your paper?" etc. I am still trying to decipher those. ;)
  • Finally, I had one day to explore barbados. I spent it on the beach and in the water. Discovered a new tea place that used Rooibos tea to make "expresso" substitutes. Also, it was heartening to see that the worldwide standard for Pizza is "NY style Pizza".