Assignments‎ > ‎

Assignment 2

posted ‎‎Oct 17, 2008 12:42 AM‎‎ by Beth Trushkowsky   [ updated ‎‎Oct 26, 2008 10:07 PM‎‎ ]

Project Overview

In Project 1, you studied how to change the page replacement policy of the PostgreSQL buffer manager. In this project you will move to a higher level in the system and add functionality to the PostgreSQL query executor. We will restrict our focus to grouped aggregates. This project will be considerably more complex than Project 1, both in terms of the amount of coding involved and in understanding existing code. The major parts of the project are:
  1. Gaining a “big picture” understanding of a query executor
  2. Examining and understanding existing code
  3. Implementing a new query operator
The grouped aggregate COUNT() that you've seen in lecture assumes that all elements for the column(s) specified will be counted and returned  For many cases however, you only actually want the groups with the k-highest counts.  One example would be a query that lists the 100 most popular videos on YouTube.  

Your task in this project is to implement a new grouped aggregate called APPROX_COUNT() that saves space in memory by only keeping track of a fixed number of elements to maintain an approximate count of the most frequent values. You will experiment with the amount of available memory and observe how your APPROX_COUNT aggregate performs compared to the traditional COUNT. You've seen in lecture saw how grouped aggregates can be implemented with sorting as well as hashing. In this project, you will be implementing your new aggregate operator using the hashing technique.

Administrivia

  • You may work on this project in teams of two
  • The project will be due October 29th.
  • Since understanding the code is an important part of this project, the TAs and Professor will not assist you in understanding the existing code beyond what is discussed here.

Description of APPROX_COUNT

The approximate count algorithm saves space by keeping track of the frequency of only the most popular items, instead of all items in the system.  Pseudocode for the approximate count aggregate is shown below.

Algorithm Pseudocode

Algorithm: 

Set up a structure to hold
elements (e_1, e_2,...e_m) and counts (count_1, ... count_m)
where m is the number of slots specified by the user.

per_input_tuple(e)
If e is in the structure as e_i,
increment count_i;
else{ let e_m be the element with least hits
Replace e_m with e;
Increment count_m;
}
end;
Your implementation will be a little different as the system automatically maintains a hash table of elements that are in your structure and tells you where e_i is located, if it exists already.

If you're interested in more details, see the original paper: 
http://www.cs.ucsb.edu/~dsl/publications/2005/ICDT2005-metwally.pdf

Setup

Clean-up old versions

In order to avoid going over quota, you will want to start by deleting the large directories we used in the last project (~/pgdata, /home/tmp/<USER>/postgres-8.0.3).  This can be done with 'rm -rf <file>'.  If that still prompts you for every file to remove, try typing 'unalias rm' first.

Download Postgres-8.3.4

$ cd /home/tmp/$USER
$ wget http://ftp7.us.postgresql.org/pub/postgresql/source/v8.3.4/postgresql-8.3.4.tar.gz
$ gunzip postgresql-8.3.4.tar.gz
$ tar xvf postgresql-8.3.4.tar

Apply the Patch We Provide

To get you started, we made many modifcations to Postgres.  All of these can be made on your source tree by applying a diff (a diff is a collection of line changes in a special format).  It is applied using the patch program.  Do this on icluster.eecs.berkeley.edu.
$ patch -p1 < ~cs186/fa08/hw2/approx_agg.diff

Build and Install Postgres 

Please refer to Assignment 1 for details on how to build and install postgres from source.

What to Implement

You will be modifying code in two files:
./src/backend/executor/nodeAgg.c (most of your code will be here) 
./src/include/nodes/execnodes.h

You must add any data structures that you think you will need. 
You will need to add code to some existing functions:
  • void approx_agg_per_input(AggState *aggstate, AggHashEntry entry, bool entry_is_new)
    • This will be called once for each tuple
    • You will need to take the data from entry and use it to update the state you are keeping in aggstate
    • If the entry is new, that means its either something you haven't seen before, or something you have seen, but evicted.  This means you will either need to make a new bucket for it, or if you are out of memory buckets, evict something.
    • If you evict an entry, you will need to delete it from the hash table as follows:
      • remove_hash_entry(aggstate, old_entry)
  • TupleTableSlot *agg_retrieve_approx(AggState *aggstate)
    • This is the function that is called after all the input has been passed to approx_agg_per_input.  It returns a TupleTableSlot with the results.  We have provided most of the code for this function
  • AggHashEntry approx_advance_iter(AggState *aggstate)
    • This function is called by agg_retrieve_approx each time we are ready for a new result tuple.
You will also need to add any additional functions that you deem necessary.

Helpful Tips

  • Looking through the diff will give you a good idea of what changes we made to the code to make this project possible.  While you don't need to understand all of it, a basic knowledge of what goes where could be very helpful
  • Parts where you will need to insert code have been marked with /* CS186: */ comments.  Try 'fgrep -rn CS186 *' from the postgres directory
  • There is a doubly-linked list implementation in src/backend/lib/dllist.c that you may find useful.
  • In agg_retrieve_approx, the setting of your count to aggvalues[] needs to be of type Int64Datum. This means that the data type of your counts should be int64, which is just a 64-bit integer. An Int64Datum is a wrapper that either stores the value of the int64 or a pointer to a Datum object that stores it.
  • We haven't discussed the concept of contexts in Postgres, but just know that to allocate and free memory, use palloc and pfree

Testing Implementation

Test Data

We have provided you with two datafiles in ~cs186/fa08/hw2/{data1,data2}.  If you have trouble accessing that path to the data files, try the absolute path: /home/ff/cs186/fa08/hw2/. Each contains a list of numbers with different distributions.  Feel free to explore by running queries over this data using the real COUNT() aggregate to get a feel for what they look like.  Create a Test database before running these queries, like in the first assignment.
test=# CREATE TABLE numbers (num int);
test=# COPY numbers FROM '<path to data file>' DELIMITERS ',';

Queries to Run

Run the following query for each dataset. Record your results for each (for your own reference for the discussion questions).
SELECT num, COUNT(*) FROM numbers GROUP BY num ORDER BY COUNT(*) DESC LIMIT 10;
Next, run the following query for each dataset using the approximate aggregate and vary the memory spaces to 10, 50 and 100.
SELECT num, APPROX_COUNT(10, mem) FROM numbers GROUP BY num;
Save the results to these queries as data<file_number>.<mem_size>.txt for submission. 

Discussion

In two-three sentences each, discuss your answers to the following:
  1. For which of the datafiles does your APPROX_COUNT opperator work well?  When doesn't it work?  What characteristics of the data explain this behavior?
  2. What effect does varying the memory size have on the accuracy?  What memory size worked best for each datafile?  Why?
Save your responses in discussion.txt.

What to turn in

  1. nodeAgg.c
  2. execnodes.h
  3.  6 files in the form: data<file number>.<mem size>.txt - Table of query run results
  4. discussion.txt - Answers to the discussion questions, TEXT ONLY!
  5. MY.PARTNERS - Listing the other person that you worked with

How to Submit

  1. Save your files in a directory called "hw2" within your cs186 home directory.
  2. cd into hw2
  3. Run: submit hw2