DB Optimizer's methodology grew out of impossible predicament presented by the defacto method of database tuning. The standard method was trying to collect 100% of the statistics a 100% time. Trying to collect all the statistics as fast as possible ends up putting load on the monitored database and creating problems. Stories of problems created by database monitoring products abound in the industry. In order to avoid putting load on the target database, performance monitoring tools have to collect less often as a compromise. Oracle compromised in 10g with AWR (their automated performance data collector) only running it once an hour because of the the performance impact. Not only is the impact on the monitored target high, but the amount of data collected is staggering, but the worst problem of all though, is the impossibility of correlating statistics with the sessions and SQL that created the problems or suffered the consequences.
The solution to collecting performance data required letting go of the old stodgy problematic paradigm of trying to collect as many performance counters possible as often as we could and instead freeing ourselves with the simple approach of sampling session state. Session state includes who the session is, what their state is (active, waiting, and if waiting, what they are waiting on) and what SQL they are running. The session sampling method was used by me occasionally over the years and I even wrote up a method to access this information without SQL on Oracle by reading the live database memory via C code. The session state method was officially packaged by Oracle in 10g when they introduced "Active Session History" (ASH). ASH is an automated collection of session state sampling. The rich robust data from ASH in its raw form is difficult to read and interpret. The solution for this was "Average Active Sessions" (AAS). AAS is a single powerful metric which measures the load on the database based on the ASH data. AAS data provided the perfect road map for what data to drill into . The main drill downs are "top sql", "top session" , "top event" and "top objects".
Other aggregations are possible based on the different dimensions in the ASH data.
Here is an example screen shot of the same batch job being run 4 times. Between each run performance modifications are made based on what we see in the in the profiling load chart:
First run - lots of orange
What is that orange color in the graph? It's waits for commits in the application. Every time a user commits they have to wait for their data to be written to disk.The solution here, as given in the flyover help, is to move the database log files to a faster disk. We created new database log files on a faster disk which you can see in the graph where it's marked checkpoint.
Second run - lots of dark red
The dark red color represents waits on a resource. We can see which resource wait under the EVENT section below the graph. The first event, the one we have been waiting for the most is "buffer busy wait". Clicking on the "buffer busy wait" takes us do a drilldown analysis page that suggests we move our table from a normal tablespace to an Automatice Segmenet Space Managed tablespace.
The dark brown color is coming from the event "log file switch (checkpoint incomplete)". The flyover help suggest we increased the size of the log files. (you can see the IO time spent creating the new redo logs just after run 3 )
In this run our batch job is tuned! We can see by the width of the last spike which represent the last batch run that the batch job ran much faster than the previous runs. The previous runs each showed a bottleneck in the graph and DB Optimizer suggested solutions in each case. After applying all the solutions, the run time is the shortest and all the time is spent on the CPU which was our goal - ie to take advanteage of all the processors and run the batch job as quickly as possible.
With the load chart we can quickly and easily identify the bottlenecks in the database, take corrective actions, and see the results. In the first run, almost all the time is spent waiting, in the second run were we eliminated on bottleneck we actually spent more time - ie the bottleneck was worse. Some time this happens as eliminating one bottleneck cause great contention on the next bottleneck.(you can see the width of the run, the time it ran, is wider in run 2). By the third run, we are using more CPU and the run time is faster and finally by the 4th run all the time spent is on CPU, no waiting, and the run is the fastest by far.
Is the database Idle, working, bottlenecked?
When a bottleneck happens how can you know whether
All of this can be easily determined form DB Optimzer's performance profiling screen:
Let's look at the components of the screen
The screen has 6 important parts
First, on top left, is a list of our databases we have registered.
The most important part of the screen is the Average Active Sessions (AAS) graph. AAS shows the performance of the database measured in the single powerful unified metric AAS. AAS easily and quickly shows any performance bottlenecks on the database when compared to the Maximum CPU line. The Max CPU line is a yardstick for performance on the database. When AAS is larger than the Max CPU line there is a bottleneck on the database. Bottleneck identification is that easy.
AAS or the average number or sessions active, shows how many sessions are active on average (over a 5 second range in DB Optimizer) and what the breakdown of their activity was. If all the users were running on CPU then the AAS bar is all green. If some users were running on CPU and some were doing IO, represented by blue, then the AAS bar will be partly green and partly blue.
The line "Max CPU" represents the number of CPU processors on the machine. If we have one CPU then only one user can can be running on the CPU at a time. If we have two CPUs then only 2 users can be on CPU at any instant in time. Of course users can go on and off the CPU extremely rapidly. When we talk about sessions on the CPU we are talking about the average number of sessions on CPU. A load of one session on the CPU, thus would be an average which could represent one uses who is consistently on the CPU or many users who are on the CPU for short time slices. When CPU becomes a resource bottleneck on the database we will the average active sessions in CPU state go over the Max CPU line. The number of sessions above the max CPU line is the average number of sessions waiting for CPU.
The Max CPU is a yardstick for performance on the database.
In the above chart, do you think it's an application problem or a machine resource problem?
In order to tell we have to find out where is that demand coming from. To find out where the demand is coming from we can look at Top SQL and Top Session tables below the load chart. In our case shown here the load of well distributed over all sql in Top SQL and all sessions inTop Session. There is no outlier or resource hog. In this case it's the machine that's underpowered. What does a case look like where we should tune the application?
In this case, again the CPU demand is more than the machine can supply but if we look at "Top SQL" we can see that the first SQL statement (with the large green bar) uses up much more CPU than any of the rest, actually 60%! If we could get it down to 10% CPU then we'd save 50% of the CPU usage on the machine ! Thus in this case it's worth our while to spend a day or week or even a couple weeks trying to tune that one SQL statement instead of buying a bigger machine.
Finally, how do we know when the database configuration is a problem? We know it's a configuration problem when we are seeing something other that CPU as the bottleneck in Top Bottleneck section. Here's an example
In this case we can see the load is higher than the Max CPU line but the load is coming from brown colored bars and the the green CPU colored bars. If we look at Top SQL we see that there is only one SQL taking up almost all the load, but it's not because of CPU which would be a green bar, but some other color. What does this other color represent? We can look at the Top Bottleneck section and see that it is "log file switch (incomplete)" which basically means the log files are too small, ie the database is not correctly configured. This bottleneck was resolved simply by increasing the log size.
Finding and Tuning Problem SQL
DB Optimizer is targeted at finding problem sql in a running load with the profiler and then tuning that (or those) specific queries with the tuner.
It’s not efficient just to dump a bunch of procedure code into the tuner and then try and see if any of the SQL in the package or procedure are tunable. Most queries should, by default, run optimally on a database, so the goal of DBO is to tune those queries that for one reason or another are not optimally tuned by the database by default. The easiest way to find those queries is to identify them on a running system . They can be identified on a running system because they take up a lot of resources. If we find a resource intensive query then it’s worth the time to generate cases and analyze it for missing indexes to see if there is a way to tune it.
Wait Event Docs >