CSCE6350: Advanced Database Systems, Spring 2009


  Instructor  

        Dr. Yan Huang


  Group Members  

 Ning (Martin) Xu
martinxu9@gmail.com
 Yao Shenshenyao1016@hotmail.com


 
  Project : a survey paper  

        Topic                          Clustering in Data Stream

        Proposal                    
click here to view the proposal ...

        Survey Paper            click here to view the document ...

        Presentation             click here to view the slides ...


  Paper Analysis A  

        Topic                          Clustering in Stream Data Management

        References      
                                    [1]
     C. Jia, C. Tan, A. Yong, "A Grid and Density-Based Clustering Algorithm for Processing Data Stream," Genetic and Evolutionary
                                             Computing, 2008. WGEC '08. Second International Conference on, vol., no., pp.517-521, 25-26 Sept. 2008

                                    [2]
     D. K. Tasoulis, N. M. Adams, and D. J. Hand, "Unsupervised Clustering In Streaming  Data," Data Mining Workshops, 2006. ICDM
                                             Workshops 2006. Sixth IEEE International Conference on, vol., no., pp.638-642, Dec. 2006

                                    [3]
     D. Tasoulis and M. Vrahatis, "Unsupervised clustering on dynamic databases," Pattern Recognition Letters, 26(13):2116–2127,
                                             2005.

                                    [4]
     M. Motoyoshi, T. Miura, and I. Shioya, "Clustering Stream Data by Regression Analysis," In Proc. Australasian Workshop DMWI
                                             2004
, Dunedin, New Zealand. CRPIT, 32. Purvis, M., Ed. ACS. 115-120.

                                    [5]
     S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan, "Clustering Data Streams: Theory and Practice," IEEE Trans.
                                             Knowl. Data Eng.
, vol. 15, 2003.
                                                
                                    [6]
     C. C. Aggarwal, J. Han, J. Wang, and P. Yu, "A Framework for Clustering Evolving Data Streams," VLDB Conference, 2003.

                                    [7]
     L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani, "Streaming-Data Algorithms For High-Quality Clustering," In
                                             Proc. ICDE 2002
, Feb. 2002.
                                           
                                    [8]
     S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan, "Clustering Data Streams," In Proc. FOCS 2000, Nov. 2000.

        Summary           click here to view the slides ...

                                    [6] Traditional data clustering methods are not applicable for data streams since stream data may have infinite length and evolve over
                                          time. In order to deal with this scenario, the authors divided the clustering process into two components: one is online micro-
                                          clustering component which requires efficient process for storing summary statistics, the other is offline macro-clustering component
                                          which uses the summary statistics to provide clusters as per user-requirement. In this paper, the authors combine micro-clustering                                                      approach with the concept of pyramidal time frame which stores micro-clusters at snapshots in time which follow a pyramid pattern.
                                          The experiment result shows that this approach is both efficient and effective.

                                    [7] The authors present a LOCALSEARCH algorithm called to solve the k-Median problem by performing a binary search within a range
                                          between 0 and an easy-to-calculate upper bound. LOCALSEARCH is then transformed into one suitable for use on streams by
                                          removing the requirement of random access. Assuming data arrives in chunks fit in the main memory, any stream can be turned into
                                          trunks by waiting for enough points to arrive. Each chunk is re-represented as a data set of distinct points with weights equal to their
                                          original frequency in the chunk assigned. LOCALSEARCH is applied on the chunks individually to obtain weighted centers, on which
                                          a second round of LOCALSEARCH is performed to obtain the centers for the entire stream. The authors show that the streaming
                                          algorithm decrease the asymptotic running time and drastically cutting the memory usage, with theoretical quality guarantees.


                                    [8] The authors study a class of clustering algorithm in an attempt to maintain a consistently good clustering of the sequence observed
                                          so far using little memory and time under the data stream model of computation. The Small-Space algorithm examines the data in
                                          a piecemeal fashion, i.e., divide the data into pieces, cluster each of these pieces and then again cluster the obtained centers which
                                          are weighted by the number of points closer to it than to any other center. Another algorithm, Smaller-Space, is evaluated, which
                                          repeatedly reclusters weighted centers instead of reclustering only once in Small-Space. The proposed algorithms achieve much
                                          less memory usage with a tradeoff of sacrificing somewhat the quality of the clustering approximation.


  Paper Analysis B  

        A. Arasu and J. Widom, "Resource Sharing in Continuous Sliding-Window Aggregates," In Proc. VLDB 2004, Sept. 2004

    
    
click here to view the slides ...

        click here to view the paper critique ...

        click here to view the questions ...

    
  
  Discussion on Reading Materials  

        [2]   D. Abadi, Y. Ahmad, M. Balazinska, U. Çetintemel, M. Cherniack, J.-H. Hwang, W. , Lindner, A. Maskey, A. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, and                       Stan Zdonik, "The Design of the Borealis Stream Processing Engine," In proceedings of the 2nd Biennial Conference on Innovative Data Systems                       Research (CIDR'05),  Asilomar, CA, January 2005

        Q1: How did they solve the server heavy optimization problem?
        A1: By means of a distributed system.

        Q2: How does QoS-based resource allocation and optimization really work?
        A2:
With the help of a Vector of Metrics (VM) that includes content-related or performance related properties.



        [3]   Arasu, S. Babu and J. Widom, "The CQL Continuous Query Language: Semantic Foundations and Query Execution," in VLDB Journal, 2005.

        Q1: Difference between data operators and system operators mentioned in the implementation?
        A1: Data operators perform stream-to-relation, relation-to-relation, relation-to-stream data processing; system operators handle low level issues such
               as asynchronous and out-of-order arrival of streams, load sheddding, and external stream data formats.

        Q2: Difference between the three classes of strem-to-relation operators?
        A2: (1)
Time-based: size of sliding window specified by time interval
               (2) Tuple-based: size of sliding window specified by the number of tuples ordered by timestamps
               (3) Partition-based: logically partitions the stream into substreams, and perform tuple-based sliding window




        [4]   Babcock, Brian and Babu, Shivnath and Datar, Mayur and Motwani, Rajeev, "Chain: Operator Scheduling for Memory Minimization in Data Stream                       Systems," In ACM International Conference on Management of Data (SIGMOD 2003), June 9-12, 2003

        Q1: Why traditional greedy algorithm does not work for minimizing the memory?
        A1: Scheduling the highest-priority available operator seems to be able to minimize the memory usage, however, the higher-priority operator that follows
               the lower-priority operator rarely runs. The lower-priority operator will be unable to get scheduled. Thus, there is no input available for the higher-priority
               operator.

        Q2: Why chain scheduling algorithm works for minimizing the memory?
        A2
: Chain scheduling algorithm calculates lower envelope and defines the slope of lower envelope segment as the priority of the operator. Then by
               always scheduling the highest-priority operator
available, the algorithm can reach the near-optimum.

    
        [5]  D. Carney, U. Cetintemel, A. Rasin, S. Zdonik, M. Cherniack, M. Stonebraker, "Operator Scheduling in a Data Stream Manager," In proceedings
              of the 29th International Conference on Very Large Data Bases (VLDB'03), Berlin, Germany, September 2003.

        Q1: What are the differences between Aurora's state-based execution model and the traditional thread-based execution model for structuring
               database servers?
        A1:
Thread-based: a thread is assigned to each query or operator. This model suffers from significant overhead when the number of threads becomes
               large, though easy to program;
State-based: a single scheduler thread monitors system state and maintains execution queue, shared by a small
               number of worker threads responsible for executing queue entries. It enables effective batching thereby cutting down the scheduling and execution
               overhead.

       
        Q2: How does the two-level scheduling work in operator batching?
        A2: A sequence of boxes is grouped together as a superbox to be scheduled and executed atomically. The first-level scheduling identifies one
               potential superbox for each query tree statically before run time, after which the second level specifies the ordering of component boxes based
               on throughput, latency, and memory consumption performance metrics before processing them within the superbox.


        [6]  C. Olston, J. Jiang, and J. Widom. “Adaptive Filters for Continuous Queries over Distributed Data Streams,” In Proc. of SIGMOD 2003, June 2003

        
Q1: Why can't we take bound width allocation as uniform allocation?
        A1: Uniform allocation has poor performance in reducing communication cost. It cannot account for the conditions that CQs can have overlapping sets
               of objects and have different constraints. Moreover, data values of objects change at different rates, thus, uniform allocation is not applicable.

        Q2: How to overcome the latency problem?
        A2: Using serializing queues and symmetric hash join. In serializing queses, orders updates by timestamps. Larger upper bound latency receives
               more update with more delay. In symmetric hash join, value update and width update are combined to achieve bound update.

        [8]  R. H. Guting, T. Behr, V. T. De Almeida, Z. Ding, F. Hoffmann, and M. Spiekermann, "SECONDO: An Extensible DBMS Architecture and
              Prototype," Fernuniversität Hagen, Informatik-Report 313, March 2004.

        Q1:
        A1:

        Q2:
        A2:

Attachments (8)

  • advdb_discussion_xu_shen.pdf - on Mar 24, 2009 7:23 PM by Martin Xu (version 13 / earlier versions)
    74k Download
  • advdb_papercritique_xushen.pdf - on Mar 23, 2009 12:51 PM by Martin Xu (version 7 / earlier versions)
    35k Download
  • advdb_ppt_03242009_xu_shen.pdf - on Mar 24, 2009 7:47 AM by Martin Xu (version 1)
    481k Download
  • ppt_2.24.2009_xu.shen.pdf - on Apr 7, 2009 4:44 PM by Martin Xu (version 3 / earlier versions)
    91k View Download
  • ppt_4.28.2009_xu.shen.pdf - on Apr 28, 2009 12:15 PM by Martin Xu (version 3 / earlier versions)
    957k View Download
  • ppt_4.28.2009_xu.shen.pptx - on Apr 28, 2009 12:15 PM by Martin Xu (version 2 / earlier versions)
    138k Download
  • survey_final_xu_shen.pdf - on Apr 27, 2009 8:06 PM by Martin Xu (version 1)
    113k Download
  • survey_proposal_xu_shen.pdf - on Apr 27, 2009 4:01 PM by Martin Xu (version 7 / earlier versions)
    39k Download