Assignments‎ > ‎

Assignment 3: PostgreSQL Query Optimizer

posted ‎‎Oct 27, 2008 10:41 AM‎‎ by akshay krishnamurthy   [ updated ‎‎Nov 7, 2008 12:36 PM‎‎ ]

In this homework you will carry out a number of exercises involving the optimization of relational queries using the PostgreSQL optimizer and the visualization command EXPLAIN. You will not have to write any code. This assignment has to be done individually.

There are four parts to this assignment:

  • Reading some PostgreSQL documentation
  • Setting up the database, loading the tables, creating indexes
  • Answering questions by observing the query plans generated by PostgreSQL
  • Writing and running some queries of your own.

General Instructions

  • Please spend some time to read this assignment completely before beginning.
  • You can get all the files for this assignment at ~cs186/Hw3.
  • We will be using PostgreSQL version 8.2.4, which is now set up to be the default for cs186 accounts. The instructions provided here are designed to work with the EECS instructional machines and may not work as smoothly elsewhere. The TAs and Professors will only support EECS instructional machines.
  • By default, you should be able to access the 8.2.4 versions of pg_ctl, psql, etc. (test this out by using the commands "psql --version", "pg_ctl --version"). If your path is pointing to some other version of these binaries, you may have to explicitly add to your PATH environment variable, the directory ~cs186/pgsql/bin, which has the 8.2.4 binaries. 
  • Homework 2 used PostgreSQL version 8.3.4, so make sure that you are not using that version with this assignment. Using the binaries at ~cs186/pgsql/bin should make everything work.
  • To make sure that your solutions are correct, please use one of the Solaris 10 X86 servers to do this assignment. These servers are named "rhombus", "pentagon", "cube", "sphere", "po", and "torus" followed by a ".cs.berkeley.edu".
  • Before moving from one question to the next, you should remove any indexes that you created specifically for the previous question. Also, enable any access/join methods that you had disabled for the previous question.
  • The deadline for this assignment is 5PM on Monday November 10, 2006 in the CS186 drop box in 283 Soda Hall.  
  • If you know how to use PostgreSQL, this assignment should be doable in a single sitting.

Documentation and PostgreSQL Tips

The PostgreSQL 8.0 Documentation is on-line (and searchable) at:  http://www.postgresql.org/docs/8.2/static/index.html


 

You will find the following parts particularly useful for this assignment:

  • PostgreSQL Performance Tips  http://www.postgresql.org/docs/8.2/static/performance-tips.html

  • The CREATE INDEX Command: http://www.postgresql.org/docs/8.2/static/sql-createindex.html

    • the command for creating indexes;  you will also need “DROP INDEX”
  • The SET Command: http://www.postgresql.org/docs/8.2/static/sql-set.html

    • this command lets you turn on and off operators that the query optimizer can consider.  For example, the SQL command “SET enable_hashjoin TO off;” tells the query optimizer not to use Hash Joins.
  • Runtime Configuration: http://www.postgresql.org/docs/8.2/static/runtime-config.html

    • see section 17.6.1 “planner method configuration” for  a list of the operators you can disable and re-enable for the optimizer using the SET command.
  • PostgreSQL System Catalogs: http://www.postgresql.org/docs/8.2/static/catalogs.html

    • note that PostgreSQL keeps its catalog information in tables.   The tables/views that are likely to be most useful in this assignment are “pg_stats”, “pg_class”, “pg_indexes and  pg_attribute”.   You can query these in psql just like regular tables.  For example, the query “select * from pg_indexes where tablename = 'rankings';” gives you information about the indexes defined on the rankings table.  You can see the statistics the optimizer has for the table ‘rankings’ using the query: “select * from pg_stats where tablename = 'rankings'” (note the single quotes around the tablename, which is a string attributed  in the pg_stats view).  The documentation for pg_stats (linked from the above page) explains what the numbers mean.  You can get information on tables by querying the “pg_class” table and/or using the “/d” command in psql. 

Setting up the Database
  • Create a database directory with initdb .  After that, start the postmaster with pg_ctl  (initdb tells you the exact command) Then, create a database with createdb.
$ ~cs186/pgsql/bin/initdb ~/pgdata
$ ~cs186/pgsql/bin/pg_ctl -D ~/pgdata -l hw3.log start
$ ~cs186/pgsql/bin/createdb colleges 

  • Create tables and import data into the database. We are creating three tables: 'rankings' and 'financials' and 'students' 
 $ ~cs186/pgsql/bin/psql colleges -f ~cs186/Hw3/hw3.sql
  • You MUST update the PostgreSQL statistics after adding or deleting indexes. The command for updating the statistics (in a psql session) is "VACUUM ANALYZE;". You MUST update the statistics once before starting the exercises. This step is crucial to getting correct answers to the questions!
  • When you start psql make sure you turn off the bitmap scan acces methods by typing "SET enable_bitmapscan TO off;" Do not re-enable this access method throughout the assignment.
After completing the above steps you should be able to analyze query plans produced by the optimizer for a given query (using EXPLAIN). You should also be able to modify the runtime variables for enabling/disabling join and access methods (using SET).

Questions –Hand in from here below…

CS 186 Fall 2008 Homework 3 Answer Sheet (don’t forget to staple it together!).
NAME: ____________________________              STUDENT ID:____ ______________   
IMPORTANT: Circle the last two letters of your class account:
 cs186 a b c d e f g h i j k l m n o p q r s t u v w x y z
       a b c d e f g h i j k l m n o p q r s t u v w x y z

 

1.      SOME EASY STUFF

Start psql and make sure you are running version 8.2.4 (it tells you when it starts up).  Then, using queries in psql, examine the set of tables and access methods in the database you just created. In particular examine the relation Rankings. Answer the following questions:

A. What are the attributes of the Students relation?




B. How many indexes are built on the Rankings relation? Name them and write down their type.




C. How many tuples are there in the Rankings relation?




D. How many distinct values of  “gradrate” does the query planner estimate there are for the Rankings relation?  (hint: Query pg_stats to find out)




E. How many distinct values of “gradrate” are there actually in the rankings table? (hint: it’s probably easiest to run a query to compute this!)





F. What query did you use to find the answer to E? 

 



 


2.      USING THE QUERY PLAN VIEWER

This question requires you to use the PostgreSQL query plan visualization command EXPLAIN. Read the documentation for EXPLAIN at the link given above.   Note that (like System R) EXPLAIN estimates query costs in units of disk I/Os (CPU instructions are added in by multiplying times a conversion factor).
Consider the following query:

SELECT * FROM rankings WHERE state = 'CA'; 

Answer the following questions looking at the query plan generated by the EXPLAIN command:


A. Briefly describe the plan chosen.   (e.g., what kind of scan is used?).




    

B. In what order would the tuples be returned by this plan? Why?



    

    

C. What is the estimated total cost of running the plan?

    



D. What is the estimated result cardinality for this plan? The estimated result cardinality is the number of colleges that the optimizer estimates to be in California.   Looking at the statistics, why does the optimizer come up with this estimate?


    



    

E. How many colleges actually do have “state = 'CA'”?  (hint: it’s probably easiest to run a query to compute this!)

    




F. Looking at the statistics, what are the top 10 states with the most colleges and the percentage of colleges in each of those state? (hint: you can select these by querying on the pg_stats table) 

    




G. Which value of  “state” is actually the most popular, and how many tuples have that “state”? How did you figure this out (what query did you use) ?

 



3.      SELECTS WITH INDEXES

Consider the same query from previous question: 

SELECT * FROM rankings WHERE state = 'CA'; 

Answer the following questions looking at the plans and the access methods:


A. Create a btree index on the attribute state of the relation Rankings. What is the plan chosen for the query now?   (e.g., what kind of scan is used and what is scanned?).

 



B. What is the estimated total cost of running the plan?



    

    

C. Compare this plan with the plan obtained in question 2.A above.  Which is cheaper and why?

 

 

 

4.      RANGE SELECTS

DROP the index that you created for Question 3. Don't forget to VACUUM ANALYZE

Now analyze the query plan that PostgreSQL comes up for the following query: 


SELECT * FROM rankings WHERE gradrate < 10;

Answer the following questions:


A. How many ranking tuples that have gradrate < 10 does the optimizer think there are?

 


B. How does the optimizer arrive at this estimate of the number of tuples? That is, what calculations does it perform, and where does the supporting data come from?

 

 

    

C. In what order will the tuples be returned by this plan?

 


    

D. What is a value of the constant (i.e. '10' in the above query) such that the optimizer chooses a different plan? What is that plan and why does the optimizer think it will be cheaper than the previous plan when used with this new constant?




 

     

    


E. Explain why one of the access methods is costlier than the other.

 




5.      SIMPLE JOIN

RE-ENABLE (using “SET”) the access method you turned off above.

Create a B-tree index on 'studentfacultyratio' of the rankings table.

Analyze the query plan for the following query that finds the average salary at schools who have a student to faculty ratio < 3.



SELECT R.name, F.avesalary 
FROM rankings R, financials F
WHERE R.id = F.id AND R.studentfacultyratio < 3; 


Answer the following questions:

A. What is the estimated cost of this plan?

    

    

B. What kind of join is used by the plan?




C. Disable the join type used in the above plan and re-optimize the query.  What type of join is used now, and what is the total estimated cost of the query?


 


d. What relations are sorted in this plan, and why?




6.      THREE-WAY JOIN

RE-ENABLE (using “SET”) the join method you turned off above.

Answer the following questions referring to the query below: 

SELECT S.name, R.name, F.avesalary

FROM students S, rankings R, financials F

WHERE S.school = R.id and R.id = F.id;


A. Describe the best plan estimated by the optimizer. List the joins and access methods it uses, and the order in which the relations are joined.

 

 

 


B. What is the Relational Algrebra expression for the above join order?





C. Modify the query by adding a condition R.studentfacultyratio < 10. What are the differences between this plan and the one above? Why is this new join ordering better for the extended query than the ordering obtained in part A?

 

 

 

7.      PLAYING WITH SQL 

Answer the following questions about the database (by writing queries!): 


A. What is the name of the public school (public = 1) with the highest average salary?  (hint: while one way to get close to this is simply to list all schools and the average salary for them sorted by average salary, you may also want to try writing a variant that only produces a single result row – it’s a bit ugly).




B. Find the public school (public = 1) with the largest difference between instate tuition and out of state tuition that has at least one student attending.




C. Show a query to find out some other interesting fact in the database and tell us what the answer is (or at least summarize it).