(article best formated at link:http://tinyurl.com/ybyjazq)
Waits on the cache buffer chains latch, ie the wait event "latch: cache buffers chains" happen when there is extremely high and concurrent access to the same block in a database. Access to a block is normally a fast operation but if concurrent users access a block fast enough, repeatedly then simple access to the block can become an bottleneck. The most common occurance of cbc (cache buffer chains) latch contention happens when multiple users are running nest loop joins on a table and accessing the table driven into via an index. Since the NL join is basically a
For all rows in i
look up a value in j where j.field1 = i.val
then table j's index on field1 will get hit for every row returned from i. Now if the lookup on i returns a lot of rows and if multiple users are running this same query then the index root block is going to get hammered on the index j(field1).
In order to solve a CBC latch bottleneck we need to know what SQL is causing the bottleneck and what table or index that the SQL statement is using is causing the bottleneck.
From ASH data this is fairly easy:
From the out put it looks like we have both the SQL (at least the id, we can get the text with the id) and the block:
But the block actually is probably left over from a recent IO and not actually the CBC hot block though it might be.
We can investigate further to get more information by looking at P1, P2 and P3 for the CBC latch wait. How can we find out what P1, P2 and P3 mean? by looking them up in V$EVENT_NAME:
So P1 is the address of the latch for the cbc latch wait.
Now we can group the CBC latch waits by the address and find out what address had the most waits:
In this case, there is only one address that we had waits for, so now we can look up what blocks (headers actually) were at that address
We look for the block with the highest "TCH" or "touch count". Touch count is a count of the times the block has been accesses. The count has some restrictions. The count is only incremented once every 3 seconds, so even if I access the block 1 million times a second, the count will only go up once every 3 seconds. Also, and unfortunately, the count gets zeroed out if the block cycles through the buffer cache, but probably the most unfortunate is that this analysis only works when the problem is currently happening. Once the problem is over then the blocks will usually get pushed out of the buffer cache.
In the case where the CBC latch contention is happening right now we can run all of this analysis in one query
This can be misleading, as TCH gets set to 0 every rap around the LRU and it only gets updated once every 3 seconds, so in this case DUAL was my problem table not MGMT_EMD_PING
Deeper Analysis from Tanel Poder
Using Tanel's ideas here's a script to get the objects that we have the most cbc latch waits on
Why do we get cache buffers chains latch contention?
In order to understand why we get CBC latch contention we have to understand what the CBC latch protects. The CBC latch protects information controlling the buffer cache. Here is a schematic of computer memory and the Oracle processes, SGA and the main components of the SGA:
The buffer cache holds in memory versions of datablocks for faster access. Can you imagine though how we find a block we want in the buffer cache? The buffer cache doesn't have a index of blocks it contains and we certainly don't scan the whole cache looking for the block we want (though I have heard that as a concern when people increase the size of there buffer cache). The way we find a block in the buffer cache is by taking the block's address, ie it's file and block number and hashing it. What's hashing? A simple example of hashing is the "Modulo" function
Using "mod 4" as a hash funtion creates 4 possible results. These results are used by Oracle as "buckets" or identifiers of locations to store things. The things in this case will be block headers.
Block headers are meta data about data block including pointers to the actual datablock as well as pointers to the other headers in the same bucket.
The block headers in the hash buckets are connected via a doubly linked list. One link points forward the other points backwards
How many copies of a block are in the cache?
the steps to find a block in the cache are
If there are a lot of sessions concurrently accessing the same buffer header (or buffer headers in the same bucket) then the latch that protects that bucket will get hot and users will have to wait getting "latch: cache buffers chains" wait.
Two ways this can happen (among probably several others)
For the nested loops example, Oracle will in some (most?) cases try and pin the root block of the index because Oracle knows we will be using it over and over. When a block is pinned we don't have to use the cbc latch. There seem to be cases (some I think might be bugs) where the root block doesn't get pinned. (I want to look into this more - let me know if you have more info)
One thing that can make CBC latch contention worse is if a session is modifying the data block that users are reading because readers will clone a block with uncommitted changes and roll back the changes in the cloned block:
all these clone copies will go in the same bucket and be protected by the same latch:
Notice that the number of copies, 14, is higher the the max number of copies allowed set by "_db_block_max_cr_dba = 6" in 10g. The reason is this value is just a directive not a restriction. Oracle tries to limit the number of copies.
Find SQL ( Why is application hitting the block so hard? )
Possibly change application logicEliminate hot spots
What would OEM do?