CSCE6350: Advanced Database Systems, Spring 2009 Instructor Dr. Yan Huang Group Members
Project : a survey paper Topic Clustering in Data Stream Proposal click here to view the proposal ... Survey Paper click here to view the document ... Presentation click here to view the slides ... Paper Analysis A Topic Clustering in Stream Data Management References [1] C. Jia, C. Tan, A. Yong, "A Grid and Density-Based Clustering Algorithm for Processing Data Stream," Genetic and Evolutionary Computing, 2008. WGEC '08. Second International Conference on, vol., no., pp.517-521, 25-26 Sept. 2008 [2] D. K. Tasoulis, N. M. Adams, and D. J. Hand, "Unsupervised Clustering In Streaming Data," Data Mining Workshops, 2006. ICDM Workshops 2006. Sixth IEEE International Conference on, vol., no., pp.638-642, Dec. 2006 [3] D. Tasoulis and M. Vrahatis, "Unsupervised clustering on dynamic databases," Pattern Recognition Letters, 26(13):2116–2127, 2005. [4] M. Motoyoshi, T. Miura, and I. Shioya, "Clustering Stream Data by Regression Analysis," In Proc. Australasian Workshop DMWI 2004, Dunedin, New Zealand. CRPIT, 32. Purvis, M., Ed. ACS. 115-120. [5] S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan, "Clustering Data Streams: Theory and Practice," IEEE Trans. Knowl. Data Eng., vol. 15, 2003. [6] C. C. Aggarwal, J. Han, J. Wang, and P. Yu, "A Framework for Clustering Evolving Data Streams," VLDB Conference, 2003. [7] L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani, "Streaming-Data Algorithms For High-Quality Clustering," In Proc. ICDE 2002, Feb. 2002. [8] S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan, "Clustering Data Streams," In Proc. FOCS 2000, Nov. 2000. Summary click here to view the slides ... [6] Traditional data clustering methods are not applicable for data streams since stream data may have infinite length and evolve over time. In order to deal with this scenario, the authors divided the clustering process into two components: one is online micro- clustering component which requires efficient process for storing summary statistics, the other is offline macro-clustering component which uses the summary statistics to provide clusters as per user-requirement. In this paper, the authors combine micro-clustering approach with the concept of pyramidal time frame which stores micro-clusters at snapshots in time which follow a pyramid pattern. The experiment result shows that this approach is both efficient and effective. [7] The authors present a LOCALSEARCH algorithm called to solve the k-Median problem by performing a binary search within a range between 0 and an easy-to-calculate upper bound. LOCALSEARCH is then transformed into one suitable for use on streams by removing the requirement of random access. Assuming data arrives in chunks fit in the main memory, any stream can be turned into trunks by waiting for enough points to arrive. Each chunk is re-represented as a data set of distinct points with weights equal to their original frequency in the chunk assigned. LOCALSEARCH is applied on the chunks individually to obtain weighted centers, on which a second round of LOCALSEARCH is performed to obtain the centers for the entire stream. The authors show that the streaming algorithm decrease the asymptotic running time and drastically cutting the memory usage, with theoretical quality guarantees. [8] The authors study a class of clustering algorithm in an attempt to maintain a consistently good clustering of the sequence observed so far using little memory and time under the data stream model of computation. The Small-Space algorithm examines the data in a piecemeal fashion, i.e., divide the data into pieces, cluster each of these pieces and then again cluster the obtained centers which are weighted by the number of points closer to it than to any other center. Another algorithm, Smaller-Space, is evaluated, which repeatedly reclusters weighted centers instead of reclustering only once in Small-Space. The proposed algorithms achieve much less memory usage with a tradeoff of sacrificing somewhat the quality of the clustering approximation. Paper Analysis B A. Arasu and J. Widom, "Resource Sharing in Continuous Sliding-Window Aggregates," In Proc. VLDB 2004, Sept. 2004 click here to view the slides ... click here to view the paper critique ... click here to view the questions ... Discussion on Reading Materials [2] D. Abadi, Y. Ahmad, M. Balazinska, U. Çetintemel, M. Cherniack, J.-H. Hwang, W. , Lindner, A. Maskey, A. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, and Stan Zdonik, "The Design of the Borealis Stream Processing Engine," In proceedings of the 2nd Biennial Conference on Innovative Data Systems Research (CIDR'05), Asilomar, CA, January 2005 Q1: How did they solve the server heavy optimization problem? A1: By means of a distributed system. Q2: How does QoS-based resource allocation and optimization really work? A2: With the help of a Vector of Metrics (VM) that includes content-related or performance related properties. [3] Arasu, S. Babu and J. Widom, "The CQL Continuous Query Language: Semantic Foundations and Query Execution," in VLDB Journal, 2005. Q1: Difference between data operators and system operators mentioned in the implementation? A1: Data operators perform stream-to-relation, relation-to-relation, relation-to-stream data processing; system operators handle low level issues such as asynchronous and out-of-order arrival of streams, load sheddding, and external stream data formats. Q2: Difference between the three classes of strem-to-relation operators? A2: (1) Time-based: size of sliding window specified by time interval (2) Tuple-based: size of sliding window specified by the number of tuples ordered by timestamps (3) Partition-based: logically partitions the stream into substreams, and perform tuple-based sliding window [4] Babcock, Brian and Babu, Shivnath and Datar, Mayur and Motwani, Rajeev, "Chain: Operator Scheduling for Memory Minimization in Data Stream Systems," In ACM International Conference on Management of Data (SIGMOD 2003), June 9-12, 2003 Q1: Why traditional greedy algorithm does not work for minimizing the memory? A1: Scheduling the highest-priority available operator seems to be able to minimize the memory usage, however, the higher-priority operator that follows the lower-priority operator rarely runs. The lower-priority operator will be unable to get scheduled. Thus, there is no input available for the higher-priority operator. Q2: Why chain scheduling algorithm works for minimizing the memory? A2: Chain scheduling algorithm calculates lower envelope and defines the slope of lower envelope segment as the priority of the operator. Then by always scheduling the highest-priority operator available, the algorithm can reach the near-optimum. [5] D. Carney, U. Cetintemel, A. Rasin, S. Zdonik, M. Cherniack, M. Stonebraker, "Operator Scheduling in a Data Stream Manager," In proceedings of the 29th International Conference on Very Large Data Bases (VLDB'03), Berlin, Germany, September 2003. Q1: What are the differences between Aurora's state-based execution model and the traditional thread-based execution model for structuring database servers? A1: Thread-based: a thread is assigned to each query or operator. This model suffers from significant overhead when the number of threads becomes large, though easy to program; State-based: a single scheduler thread monitors system state and maintains execution queue, shared by a small number of worker threads responsible for executing queue entries. It enables effective batching thereby cutting down the scheduling and execution overhead. Q2: How does the two-level scheduling work in operator batching? A2: A sequence of boxes is grouped together as a superbox to be scheduled and executed atomically. The first-level scheduling identifies one potential superbox for each query tree statically before run time, after which the second level specifies the ordering of component boxes based on throughput, latency, and memory consumption performance metrics before processing them within the superbox. [6] C. Olston, J. Jiang, and J. Widom. “Adaptive Filters for Continuous Queries over Distributed Data Streams,” In Proc. of SIGMOD 2003, June 2003 Q1: Why can't we take bound width allocation as uniform allocation? A1: Uniform allocation has poor performance in reducing communication cost. It cannot account for the conditions that CQs can have overlapping sets of objects and have different constraints. Moreover, data values of objects change at different rates, thus, uniform allocation is not applicable. Q2: How to overcome the latency problem? A2: Using serializing queues and symmetric hash join. In serializing queses, orders updates by timestamps. Larger upper bound latency receives more update with more delay. In symmetric hash join, value update and width update are combined to achieve bound update. [8] R. H. Guting, T. Behr, V. T. De Almeida, Z. Ding, F. Hoffmann, and M. Spiekermann, "SECONDO: An Extensible DBMS Architecture and Prototype," Fernuniversität Hagen, Informatik-Report 313, March 2004. Q1: A1: Q2: A2: |