Recent site activity

3. Profile



Profiling filters out well performing light weight SQL and collects information on heavy weight SQL. SQL that is heavy weight are either long running queries or queries that are short but run so often that they put load on the database

  Profiler takes snapshots of user/session activity once a second and builds up a statistical model of the load on the database.
  The sampled data displayed in 3 ways
            1. Load on the database measured in average number of sessions active
            2. Top Activity - top SQL, Event and Session
            3. Details - detail on a SQL , Session or Event

The idea is to look at the load on the database, the top graph on the screen. The graph on the top of the screen the shows the load on the database and can quickly indicate how the database is functioning. The database could be
  •            1. idle
  •            2. light load
  •            3. heavy load
  •            4. Bottlenecked

Problems can come from 4 areas

1 Machine  cpu, slow disks (network)
2 App – locks, invalid SQL
3 Database – cache sizes, log files, etc
4 SQL 


Database Tuning


Intro

DB Optimizer's methodology grew out of impossible predicament presented by the defacto method of database tuning. The standard method was  trying to collect 100% of the statistics a 100% time. Trying to collect all the statistics as fast as possible ends up putting load on the monitored database and creating problems. Stories of problems created by database monitoring products  abound in the industry. In order to avoid putting load on the target database, performance monitoring tools have to collect less often as a compromise. Oracle compromised in 10g with AWR (their automated performance data collector) only running it once an hour because of the the performance impact. Not only is the impact on the monitored target high, but the amount of data collected is staggering, but the worst problem of all though, is the impossibility of correlating statistics with the sessions and SQL that created the problems or suffered the consequences.

The solution to collecting performance data required letting go of the old stodgy problematic paradigm of trying to collect as many performance counters possible as often as we could and instead freeing ourselves with the simple approach of sampling session state. Session state includes who the  session is, what their state is (active, waiting, and if waiting, what they are waiting on) and what SQL they are running. The session sampling method was used by me occasionally over the years and I even wrote up a  method to access this information without SQL on Oracle by reading the live database memory via C code. The session state method was officially packaged by Oracle in 10g when they introduced "Active Session History"  (ASH).  ASH is an automated collection of session state sampling.  The rich robust data from ASH in its raw form is difficult to read and interpret. The solution for this was "Average Active Sessions" (AAS).  AAS is a single powerful metric which measures the load on the database based on the ASH data. AAS data provided the perfect road map for what data to drill into . The main drill downs are "top sql", "top session" , "top event" and "top objects".
    Other aggregations are possible based on the different dimensions in the ASH data.

Teaser

Here is an example screen shot of the same batch job being run 4 times. Between each run performance modifications are made based on what we see in the in the profiling load chart:
Run :
    1. "log file sync" so we moved the log files to a faster device. (you can see the checkpoint activity just after run 1 where we moved the log files)
    2 "buffer busy wait" so we moved the table from a normal tablespace to an Automatice Segmenet Space Managed tablepace.
    3."log file switch (checkpoint incomplete)" so we increased the size of the log files. (you can see the IO time spent creating the new redo logs just after run 3 )
    4. the run time is the shortest and all the time is spent on the CPU which was our goal - ie to take advanteage of all the processors and run the batch job as quickly as possible. 
Conclusion:
With the load chart we can quickly and easily identify the bottlenecks in the database, take corrective actions, and see the results. In the first run, almost all the time is spent waiting, in the second run were we eliminated on bottleneck we actually spent more time - ie the bottleneck was worse. Some time this happens as eliminating one bottleneck cause great contention on the next bottleneck.(you can see the width of the run, the time it ran, is wider in run 2).  By the third run, we are using more CPU and the run time is faster and finally by the 4th run all the time spent is on CPU, no waiting, and the run is the fastest by far.

What's happening on the databases?

Is the database Idle, working, bottlenecked?

When a bottleneck happens how can you know whether
  • It's an application problem
  • Machine is undersized
  • SQL requires optimization
  • Database is mis-configured
All of this can be easily determined form DB Optimzer's performance profiling screen:
Let's look at the components of the screen 
The screen has 6 important parts
  1. Databases
  2. Average Active Sessions (AAS) Load of selected database 
  3. Maximum CPU line
  4. Top SQL
  5. Top Bottlenecks
  6. Top Sessions
First, on top left, is a list of our databases we have registered.
The most important part of the screen is the Average Active Sessions (AAS) graph. AAS shows the performance of the database measured in the single powerful unified metric AASAAS  easily and quickly shows any performance bottlenecks on the database when compared to the Maximum CPUline. The Max CPU line is a yardstick for performance on the database. When AAS is larger than the Max CPU line there is a bottleneck on the database. Bottleneck identification is that easy. 
AAS or the average number or sessions active, shows how many sessions are active on average (over a 5 second range in DB Optimizer) and what the breakdown of their activity was. If all the users were running on CPU then the AAS bar is all green. If some users were running on CPU and some were doing IO, represented by blue, then the AAS bar will be partly green and partly blue.
The line "Max CPU" represents the number of CPU processors on the machine.  If we have one CPU then only one user can can be running on the CPU at a time. If we have two CPUs then only 2 users can be on CPU at any instant in time. Of course users can go on and off the CPU extremely rapidly. When we talk about sessions on the CPU  we are talking about the average number of sessions on CPU.  A load of one session on the CPU, thus would be an average which could represent one uses who is consistently on the CPU or many users who are on the CPU for short time slices. When CPU becomes a resource bottleneck on the database we will the average active sessions in CPU state go over the Max CPU line. The number of sessions above the max CPU line is the average number of sessions waiting for CPU.
The Max CPU  is a yardstick for performance on the database.
In the above chart, do you think it's an application problem or a machine resource problem?
In order to tell we have to find out where is that demand coming from. To find out where the demand is coming from we can look at Top SQL and Top Session tables below the load chart. In our case shown here the load of well distributed over all sql in Top SQL and all sessions inTop Session. There is no outlier or resource hog. In this case it's the machine that's underpowered. What does a case look like where we should tune the application?
In this case, again the CPU demand is more than the machine can supply but if we look at "Top SQL" we can see that the first SQL statement (with the large green bar) uses up much more CPU than any of the rest, actually 60%! If we could get it down to 10% CPU then we'd save 50% of the CPU usage on the machine ! Thus in this case it's worth our while to spend a day or week or even a couple weeks trying to tune that one SQL statement instead of buying a bigger machine.
Finally, how do we know when the database configuration is a problem? We know it's a configuration problem when we are seeing something other that CPU as the bottleneck in Top Bottleneck section. Here's an example

In this case we can see the load is higher than the Max CPU line but the load is coming from brown colored bars and the the green CPU colored bars. If we look at Top SQL we see that there is only one SQL taking up almost all the load, but it's not because of CPU which would be a green bar, but some other color. What does this other color represent? We can look at the Top Bottleneck section and see that it is "log file switch (incomplete)" which basically means the log files are too small, ie the database is not correctly configured. This bottleneck was resolved simply by increasing the log size.

Tuning Example in 3 parts

part  1 The Database is hanging! AKA "the application has problems"

I wonder if you can imagine, or have had the experience of the application guys calling with anger and panic in their voices saying  "the database is sooo slow, you've got to speed it up."
What's your first reaction? What tools do you use? How long does it take to figure out what's going on?
Let's take a look at how it would work with DB Optimizer. When I get a call like this I take a look at  the database with DB Optimizer:

I can clearly see that the database is not bottlenecked and there must be a problem on the application.
Why do I think it's the application and not the database? The database is showing plenty of free CPU in the load chart, the largest chart, on the top, in the image above. In the load chart, there is a horizontal red line. The red line represents the number of CPU's on the system, which in this case is 2 CPUs. The CPU line is rarely crossed by bars which represent the load on the database, measured in average number of sessions.  The session activity is averaged over 5 samples over 5 seconds, thus bars are 5 seconds wide. The bars above fall mostly about 1 average active session and the bars are rarely green. Green represents CPU load. Any other color bar indicates a sessions waiting. The main wait in this case is orange, which is log file sync, ie waits for commits. Why is the database more or less idle and why are most of the waits we do see for "commit"? I look at the code coming to the database and see something like this:
insert into foo values ('a');
commit;
insert into foo values ('a');
commit;
insert into foo values ('a');
commit;
insert into foo values ('a');
commit;
insert into foo values ('a');
commit;
insert into foo values ('a');
commit;
insert into foo values ('a');
commit;

Doing single row inserts and committing after each is very inefficient. There is a lot of time wasted on network communication which is why the database is mainly idle, when the application thinks it's running full speed ahead, it is actually waiting mainly on network communication and  commits. If we commit less and batch the work we send to the database, reducing network communications,  we will run much more efficiently. Changing the code to

begin
  for i in 1..1000 loop
   insert into foo values ('a');
   -- commit;
  end loop;
  end;
/
commit;

improves the communication delay and now we get a fully loaded database but we run into database configuration issues.

Part 2 It *is* the database (ie DBA get to work)


In the above DB Optimizer screen, the same workload was run 4 times. We can see that the time (width of the load) reduced, and the percent of activity on CPU increased.
Runs:
    1. "log file sync" , the orange color, is the biggest color area, which means uses are waiting on commits, still even though we are committing less in the code. In this case we moved the log files to a faster device. (you can see the checkpoint activity just after run 1 where we moved the log files)
    2 "buffer busy wait" , the burnt red, is the biggest color area. We drilled down on the buffer busy wait event in the Top Event section and the details tells use to move the table from a normal tablespace to an Automatice Segmenet Space Managed tablepace.
    3."log file switch (checkpoint incomplete)" , the dark brown, is  the largest color area, so we increased the size of the log files. (you can see the IO time spent creating the new redo logs just after run 3 )
    4. the run time is the shortest and all the time is spent on the CPU which was our goal - ie to take advanteage of all the processors and run the batch job as quickly as possible. 

Part 3 It's the machine (rock paper scissors)

Now that the application is tuned and the database is tuned let's run a bigger load:

We can see that the CPU load is constantly over the max CPU line. How can we have a bigger CPU load than there are actually CPUs on the machine? Because, this  actually means that the demand for CPU is higher than the CPU available on the machine. In the image above there are 2 CPUs on the machine but and average of 3 users who think they are on the CPU, which means that on average 1 users is not really on the CPU but  ready to run on the CPU and waiting for the CPU.
At this point we have two options - in this case we are only running one kind of load, ie the insert. For inserts we can actually go even further tuning this insert and use Oracle's bulk  load commands:

declare
   TYPE IDX IS TABLE OF Integer INDEX BY BINARY_INTEGER;
   MY_IDX IDX;
BEGIN
   for i in 1..8000 loop
      MY_IDX(i):=1; 
   end loop;
   FORALL indx IN MY_IDX.FIRST .. MY_IDX.LAST
   INSERT INTO foo ( dummy )
   VALUES ( MY_IDX(indx) );
   COMMIT;
end;
/

But if this was an application that had a lot of different SQL and the SQL load was well distributed across the system then we'd have a case for adding more hardware to the system. Making the decision to add more hardware can be a difficult decision because in general the information to make the decision is unknown, unclear or just plain confusing, but DB Optimizer makes it easy and clear, which can save weeks and months of wasteful meetings and debates. For example
If we look in the bottom left, there is no SQL that takes up a significant amount of load, ie there is no outlier SQL that we could tune and gain back a lot of wasted CPU. We'd have to tune many many SQL and make improvements on most of them to gain back enough CPU to get our load down below the max CPU line. In this case, adding CPUs to the machine might be the easiest and most cost affective solution.
Conclusion:
With the load chart we can quickly and easily identify the bottlenecks in the database, take corrective actions, and see the results. IN part 1 we had an application problem, in part 2 we had 3 database configuration issues and in part 3 we had a hardware sizing issue. In all 3 chapters DB Optimizer provides a clear and easy presentation of the data and issues making solutions clear.


SQL Tuning





Finding and Tuning Problem SQL

DB Optimizer is targeted at  finding  problem sql in a running load with the profiler and then tuning that (or those) specific queries with the tuner.

It’s not efficient just to dump a bunch of procedure code into the tuner and then try and see if any of the SQL in the package or procedure are tunable. Most queries should, by default, run  optimally on a database, so the goal of DBO is to tune those queries that for one reason or another are not optimally tuned by the database by default. The easiest way to find those queries is to identify them on a running system . They can be identified on a running system because they take up a lot of resources. If we find a resource intensive query then it’s worth the time to generate cases and analyze it for missing indexes to see if there is a way to tune it.

Platform Specific Features

Object IO (Oracle)

Procedure Tab (Sybase and SQL Server)

Advanced




DB Optimizer Version/Feature Matrix 


Profiling SQL tab enhancements

Platform

Average SQL exec time

Show procedure content

Show executed line in procedure

Oracle

yes

(planned 3.0)

(planned 3.0 for 10.2.0.3 and higher )

Sybase

(planned 3.0)

(planned 2.5)

(planned 2.5)

DB2

yes

(planned 2.5)

(planned 2.5)

SQL Server

(planned 2.5)

(planned 2.5)

(planned 2.5)


Oracle profiling already distinguishes between the calling procedure and sql executed by the procedure. These values are displayed as different rows in TOP SQL. By contrast, SQL Server, Sybase and DB2 only report the procedure in TOP SQL thus the priority of adding procedure content and line number in 2.5 for these platforms and delaying the improvements on Oracle till 3.0 since Oracle's functionality is already sufficient for proper performance analysis.

Editor Detail Drilldown

For statements

Platform

SQL Text

SQL Details

Events

Sessions

Children Details

Object I/0

Oracle

yes

yes

yes

yes

yes

yes

Sybase

yes

 (planned 2.5)

yes

yes

 

 

DB2

yes

yes

yes

yes

 

 

SQL Server

yes

 (planned 2.5)

yes

yes

 

 


For events:

Platform

SQL

Sessions

Raw Data

Analysis

Oracle

yes

yes

yes

Only for

“buffer busy waits”  “cache buffer chains latch”

Sybase

yes

yes

 

 

DB2

yes

yes

 

 

SQL Server

yes

yes

 

 

 

For Session:

Platform

Session Details

SQL

Events

Oracle

yes

yes

yes

Sybase

yes

yes

yes

DB2

 (planned 3.0)

yes

yes

SQL Server

yes

yes

yes


Comments