Complex Queries
 

Often applications require grouping rows that have certain properties and then applying an aggregate function on one column for each group separately. For this, SQL provides the clause group by <group column(s)>. This clause appears after the where clause and must refer to columns of tables listed in the from clause:.

select <column(s)>
from <table(s)>
where <condition>
group by <group column(s)>
[having <group condition(s)>];

Those rows retrieved by the selected clause that have the same value(s) for <group column(s)> are grouped. Aggregations speci ed in the select clause are then applied to each group separately. It is important that only those columns that appear in the <group column(s)> clause can be listed without an aggregate function in the select clause !
Example: For each department, we want to retrieve the minimum and maximum salary.

select DEPTNO, min(SAL), max(SAL)
from EMP
group by DEPTNO;

Rows from the table EMP are grouped such that all rows in a group have the same department number. The aggregate functions are then applied to each such group. We thus get the following query result:

DEPTNO MIN(SAL) MAX(SAL)
10              1300             5000
20                800             3000
30                950             2850
Rows to form a group can be restricted in the where clause. For example, if we add the condition where JOB = 'CLERK', only respective rows build a group. The query then would retrieve the minimum and maximum salary of all clerks for each department. Note that is not allowed to specify any other column than DEPTNO without an aggregate function in the select clause since this is the only column listed in the group by clause (is it also easy to see that other columns would not make any sense).
Once groups have been formed, certain groups can be eliminated based on their properties, e.g., if a group contains less than three rows. This type of condition is speci ed using the having clause. As for the select clause also in a having clause only <group column(s)> and aggregations can be used.
Example: Retrieve the minimum and maximum salary of clerks for each department having more than three clerks.

select DEPTNO, min(SAL), max(SAL)
from EMP
where JOB = 'CLERK'
group by DEPTNO
having count(*) > 3;

Note that it is even possible to specify a subquery in a having clause. In the above query, for example, instead of the constant 3, a subquery can be speci ed.
A query containing a group by clause is processed in the following way: (IMP)
1. Select all rows that satisfy the condition speci ed in the where clause.
2. From these rows form groups according to the group by clause.
3. Discard all groups that do not satisfy the condition in the having clause.
4. Apply aggregate functions to each group.
5. Retrieve values for the columns and aggregations listed in the select clause.

Suppose that you want to start lumping together information from multiple rows. For example, you're interested in JOINing users with their classified ads. That will give you one row per ad posted. But you want to mush all the rows together for a particular user and just look at the most recent posting time. What you need is the GROUP BY construct:

select users.user_id, users.email, max(classified_ads.posted)
from users, classified_ads
where users.user_id = classified_ads.user_id
group by users.user_id, users.email
order by upper(users.email);

   USER_ID EMAIL          MAX(CLASSI
---------- ----------------------------------- ----------
     39406 102140.1200@compuserve.com        1998-10-08
     39842 102144.2651@compuserve.com        1998-12-13
     41426 50@seattle.va.gov         1997-01-13

The group by users.user_id, users.email tells SQL to "lump together all the rows that have the same values in these two columns." In addition to the grouped by columns, we can run aggregate functions on the columns that aren't being grouped. For example, the MAX above applies to the posting dates for the rows in a particular group. We can also use COUNT to see at a glance how active and how recently active a user has been:

select users.user_id, users.email, count(*), max(classified_ads.posted)
from users, classified_ads
where users.user_id = classified_ads.user_id
group by users.user_id, users.email
order by upper(users.email);

Let's find our most recently active users. At the same time, let's get rid of the unsightly "MAX(CLASSI" at the top of the report:

select users.user_id,
       users.email,
       count(*) as how_many,
       max(classified_ads.posted) as how_recent
from users, classified_ads
where users.user_id = classified_ads.user_id
group by users.user_id, users.email
order by how_recent desc, how_many desc;

   USER_ID EMAIL     HOW_MANY HOW_RECENT
---------- ----------------------------------- ---------- ----------
     39842 102144.2651@compuserve.com   3 1998-12-13
     39968 mkravit@mindspring.com   1 1998-12-13
     36758 mccallister@mindspring.com   1 1998-12-13
     38513 franjeff@alltel.net    1 1998-12-13
     34530 nverdesoto@earthlink.net   3 1998-12-13
     34765 jrl@blast.princeton.edu   1 1998-12-13
     38497 jeetsukumaran@pd.jaring.my   1 1998-12-12
     38879 john.macpherson@btinternet.com  5 1998-12-12
     37808 eck@coastalnet.com    1 1998-12-12
     37482 dougcan@arn.net    1 1998-12-12

Note that we were able to use our correlation names of "how_recent" and "how_many" in the ORDER BY clause. The desc ("descending") directives in the ORDER BY clause instruct Oracle to put the largest values at the top. The default sort order is from smallest to largest ("ascending").

Finding co-moderators: The HAVING Clause

The WHERE clause restricts which rows are returned. The HAVING clause operates analogously but on groups of rows. Suppose, for example, that we're interested in finding those users who've contributed heavily to our discussion forum and a posting contributed three years ago is not necessarily evidence of interest in the community right now. So the query reads as: "show me users who've posted at least 30 messages in the past 60 days, ranked in descending order of volubility":

select user_id, count(*) as how_many
from bboard
where posting_time + 60 > sysdate
group by user_id
having count(*) >= 30
order by how_many desc;

   USER_ID   HOW_MANY
---------- ----------
     34375    80
     34004    79
     37903    49
     41074    46
     42485    46
     35387    30
     42453    30

7 rows selected.

We had to do this in a HAVING clause because the number of rows in a group is a concept that doesn't make sense at the per-row level on which WHERE clauses operate.Oracle 8's SQL parser is too feeble to allow you to use the how_many correlation variable in the HAVING clause. You therefore have to repeat the count(*) incantation.

Set Operations: UNION, INTERSECT, and MINUS

Oracle provides set operations that can be used to combine rows produced by two or more separate SELECT statements.Sometimes it is useful to combine query results from two or more queries into a single result. SQL supports three set operators which have the pattern:
<query 1> <set operator> <query 2>
The set operators are:

  • union [all] returns a table consisting of all rows either appearing in the result of <query1> or in the result of <query 2>. Duplicates are automatically eliminated unless the clause all is used.
  • intersect returns all rows that appear in both results <query 1> and <query 2>.
  • minus returns those rows that appear in the result of <query 1> but not in the result of <query 2>.

Of the three, UNION is the most useful in practice.

Example: Assume that we have a table EMP2 that has the same structure and columns as the table EMP:

  • All employee numbers and names from both tables:

select EMPNO, ENAME from EMP
union
select EMPNO, ENAME from EMP2;

  • Employees who are listed in both EMP and EMP2:

select * from EMP
intersect
select * from EMP2;

  • Employees who are only listed in EMP:

select * from EMP
minus
select * from EMP2;

Each operator requires that both tables have the same data types for the columns to which the operator is applied.

Another example,
select
  'today - ' || to_char(trunc(sysdate),'Mon FMDDFM'),
  trunc(sysdate) as deadline
from dual
UNION
select
  'tomorrow - '|| to_char(trunc(sysdate+1),'Mon FMDDFM'),
  trunc(sysdate+1) as deadline
from dual
UNION
select
  'next week - '|| to_char(trunc(sysdate+7),'Mon FMDDFM'),
  trunc(sysdate+7) as deadline
from dual
UNION
select
  'next month - '|| to_char(trunc(ADD_MONTHS(sysdate,1)),'Mon FMDDFM'),
  trunc(ADD_MONTHS(sysdate,1)) as deadline
from dual
UNION
select
  name || ' - ' || to_char(deadline, 'Mon FMDDFM'),
  deadline
from ticket_deadlines
where project_id = :project_id
and deadline >= trunc(sysdate)
order by deadline

The INTERSECT and MINUS operators are seldom used.