D. How many distinct values of “gradrate” does the query planner estimate there are for the Rankings relation? (hint: Query pg_stats to find out)
E. How many distinct values of “gradrate” are there actually in the rankings table? (hint: it’s probably easiest to run a query to compute this!)
F. What query did you use to find the answer to E?
2. USING THE QUERY PLAN VIEWER
This question requires you to use the PostgreSQL query plan visualization command EXPLAIN. Read the documentation for EXPLAIN at the link given above. Note that (like System R) EXPLAIN estimates query costs in units of disk I/Os (CPU instructions are added in by multiplying times a conversion factor).
Consider the following query:
SELECT * FROM rankings WHERE state = 'CA';
Answer the following questions looking at the query plan generated by the EXPLAIN command:
A. Briefly describe the plan chosen. (e.g., what kind of scan is used?).
B. In what order would the tuples be returned by this plan? Why?
C. What is the estimated total cost of running the plan?
D. What is the estimated result cardinality for this plan? The estimated result cardinality is the number of colleges that the optimizer estimates to be in California. Looking at the statistics, why does the optimizer come up with this estimate?
E. How many colleges actually do have “state = 'CA'”? (hint: it’s probably easiest to run a query to compute this!)
F. Looking at the statistics, what are the top 10 states with the most colleges and the percentage of colleges in each of those state? (hint: you can select these by querying on the pg_stats table)
G. Which value of “state” is actually the most popular, and how many tuples have that “state”? How did you figure this out (what query did you use) ?
3. SELECTS WITH INDEXES
Consider the same query from previous question:
SELECT * FROM rankings WHERE state = 'CA';
Answer the following questions looking at the plans and the access methods:
A. Create a btree index on the attribute state of the relation Rankings. What is the plan chosen for the query now? (e.g., what kind of scan is used and what is scanned?).
B. What is the estimated total cost of running the plan?
C. Compare this plan with the plan obtained in question 2.A above. Which is cheaper and why?
4. RANGE SELECTS
DROP the index that you created for Question 3. Don't forget to VACUUM ANALYZE
Now analyze the query plan that PostgreSQL comes up for the following query:
SELECT * FROM rankings WHERE gradrate < 10;
Answer the following questions:
A. How many ranking tuples that have gradrate < 10 does the optimizer think there are?
B. How does the optimizer arrive at this estimate of the number of tuples? That is, what calculations does it perform, and where does the supporting data come from?
C. In what order will the tuples be returned by this plan?
D. What is a value of the constant (i.e. '10' in the above query) such that the optimizer chooses a different plan? What is that plan and why does the optimizer think it will be cheaper than the previous plan when used with this new constant?
E. Explain why one of the access methods is costlier than the other.
5. SIMPLE JOIN
RE-ENABLE (using “SET”) the access method you turned off above.
Create a B-tree index on 'studentfacultyratio' of the rankings table.
Analyze the query plan for the following query that finds the average salary at schools who have a student to faculty ratio < 3.
SELECT R.name, F.avesalary
FROM rankings R, financials F
WHERE R.id = F.id AND R.studentfacultyratio < 3;
Answer the following questions:
A. What is the estimated cost of this plan?
B. What kind of join is used by the plan?
C. Disable the join type used in the above plan and re-optimize the query. What type of join is used now, and what is the total estimated cost of the query?
d. What relations are sorted in this plan, and why?
6. THREE-WAY JOIN
RE-ENABLE (using “SET”) the join method you turned off above.
Answer the following questions referring to the query below:
SELECT S.name, R.name, F.avesalary
FROM students S, rankings R, financials F
WHERE S.school = R.id and R.id = F.id;
A. Describe the best plan estimated by the optimizer. List the joins and access methods it uses, and the order in which the relations are joined.
B. What is the Relational Algrebra expression for the above join order?
C. Modify the query by adding a condition R.studentfacultyratio < 10. What are the differences between this plan and the one above? Why is this new join ordering better for the extended query than the ordering obtained in part A?
7. PLAYING WITH SQL
Answer the following questions about the database (by writing queries!):
A. What is the name of the public school (public = 1) with the highest average salary? (hint: while one way to get close to this is simply to list all schools and the average salary for them sorted by average salary, you may also want to try writing a variant that only produces a single result row – it’s a bit ugly).
B. Find the public school (public = 1) with the largest difference between instate tuition and out of state tuition that has at least one student attending.
C. Show a query to find out some other interesting fact in the database and tell us what the answer is (or at least summarize it).