Every DBA needs to know how to do their work under pressure, while facing crisis after crisis. Our most important contribution is to keep the database running. It's an on-call function; you never know what might happen. Half the branch was up all night last night restoring a database because of a failure. Data recovery is very important, and so is performance tuning and problem solving. IT tends to start at the back end and work out to see where the problem lies, so generally, we're one of the first areas that will be contacted when a problem occurs.
There are hundreds of possible sources of downtime hiding in the crooks and crannies of your company. Many outage situations cascade into each other, where an attempt to resolve a problem creates a new outage and quickly spirals further and further out of control. DBAs should always try to obtain critical diagnostics even when under pressure to bring the database up or press before system reboot.
This article describes some of the steps for a thorough post-mortem that a DBA in crisis mode should follow, especially in crises that happened unknowingly, caused database outage or company dollars. It is very important that some of these tasks listed below be performed before you resort to such drastic measures as restarting the database or server it self. I classify these tasks into three categories, tested on 10g and Linux environments.
Please ensure you can run the tasks on your specific environment before you use them.
After the system has been stabilized and when there is time, open a Service Request with Oracle. Don't forget to collect necessary trace and log files listed in the end of this article and upload them for the Oracle to work on your TAR
HANGANALYZE helps you to determine if a performance analysis will be required to diagnose the real problem when a HANG is taking place
When: Hang & deadlock situations in a RAC database or ASM instance.
Make sure you have it generated on all nodes – collect DIAG trace files from all nodes. If the above procedure does not work for any reason, then get systemstate dump individually on every node – see below.
Individual systemstate dumps:
The dump files are generated in the $ORACLE_BASE/admin/<DB_name>/udump directory. It's easy to locate the trace file as it will have 'systemstate' suffix in its name:
- individual system state dumps using OS debugger
Following is the how-to steps.
Again, have a look inside to make sure the file contains what we need – see examples of required patterns in 2.2.2.
When: Hang & deadlock situations in a database or ASM – when you can narrow down your suspicions to a specific Oracle process.
In the following demo, we obtain the error stack, call stack and process state info for the RECO background process.
After you have found the process's OS PID, connect to the Oracle instance:
When you have to log a Service Request for RAC Oracle environment, there are a few more log files you have to collect. Following is a list of files you may want to upload every time you log a Service Request.
RDA output from all cluster nodes. An alternative is to always keep the recent RDA in proactive SR. Remember to renew RDA stuff each time there is a change in the environment – for instance, when a database or ASM persistent configuration parameter is changed.
10gR1: collect log files from $CRS_HOME/crs/log, $CRS_HOME/css/log, $CRS_HOME/evm/log, $CRS_HOME/crs/racg/dump.
10gR2: collect log files from $CRS_HOME/log/`hostname` directory – you may want to 'tar' the entire directory altogether rather than go through every subdirectory in there.
If you, as a DBA, need to call in Oracle Support they are going to treat you in much the same way as you have treated the user who reported the problem to you. They are going to ask for a very precise definition of the problem, and if they cannot solve it at once, they are going to ask for hard evidence. It is also helpful if another DBA is called in to deal with a related problem later. It may be possible for them to save a vast amount of time if they have full information as to what you have done already. Indeed, you yourself may need this written record if the problem recurs in a few months time or if it becomes so complex that it is hard to remember everything that has happened.
SQL> oradebug setorapname diag
SQL> oradebug setmypid
The views expressed are my own and not necessarily those of my current or past employer. The views and opinions expressed by visitors to this blog/web are theirs and do not necessarily reflect mine. : VIRAG