Randomized Multi-pass Streaming Skyline Algorithms

Atish Das Sarma, Aswin Lall, Danupon Nanongkai, and Jun XuVLDB 2009 [wiki].

(Author names are in alphabetical order.)


From a theoretical point of view, we develop a streaming algorithm that computes skyline in about $\log n$ passes and $m$ space where $n$ is the number of points and $m$ is the size of the skyline. We also extend it to fixed-window and distributed versions. Our experiment shows that our algorithm is comparable to other algorithms on in the average case and significantly faster even when the datasets are perturbed by just sorting. 


We consider external algorithms for skyline computation without
pre-processing. Our goal is to develop an algorithm with a good
worst case guarantee while performing well on average. Due to the
nature of disks, it is desirable that such algorithms access the
input as a stream (even if in multiple passes). Using the tools of
randomness, proved to be useful in many applications, we present
an efficient multi-pass streaming algorithm, RAND,  for skyline
computation. As far as we are aware, RAND is the first
randomized skyline algorithm in the literature.

RAND is near-optimal for the streaming model, which we prove
via a simple lower bound. Additionally, our algorithm is
distributable and can handle partially ordered domains on each
attribute. Finally, we demonstrate the robustness of RAND
via extensive experiments on both real and synthetic datasets.
RAND is comparable to the existing algorithms in average case
and additionally tolerant to simple modifications of the data, while
other algorithms degrade considerably with such variation.