dev - System shutdown mysteriously

Here is a recent case of surprising shutdown during the production hour.

All the developers are called in to give an explanation on this unwanted system restart.

The root turned out to be interesting.

The leason is: always train your IT Ops staff properly before putting your million dollar system into their hands.

============================================================

The cause is in improper ops actions. Below I attached the logs with explanations.

1> A good run of fix up script on Nov 13, Sun.

--- This is a good run of the fix up on Sun, Nov 13. The fix up finishes in 6 min.

131111 15:27:14:807 Thread[Pool thread #5,5,Server Threads] : Command Received From Athos: 'FIXUP_DB -password=****** -executorClass=com.coexis.syn.serverutils.orafixup.DumpTables'

131111 15:27:19:242 Thread[Syn,5,main] : Processing took 3893

2> The same fix up was issued 3 times on Nov 19, Sun.

And due to the parallel processing, core only managed to complete them 15, 21, and 22 hours later.

This is still not the direct cause of the shutdown. But issuing Athos command without understanding the impact, and also not looking at log files to confirm the action is dangerous. Such tasks can drag down the performance easily.

--- this pair goes for 15.26243861 hours.

core/archive/2011-11-20-(00)-coreGeneral.txt:191111 19:35:09:418 Thread[Pool thread #6,5,Server Threads] : Command Received From Athos: 'FIXUP_DB -password=****** -executorClass=com.coexis.syn.serverutils.orafixup.DumpTables -executorArgs=true'

core/archive/2011-11-21-(00)-coreGeneral.txt:201111 10:50:54:804 Thread[Syn,5,main] : Processing took 54944779

--- this pair goes for 21.16090417 hours.

core/archive/2011-11-20-(00)-coreGeneral.txt:191111 19:37:55:830 Thread[Pool thread #6,5,Server Threads] : Command Received From Athos: 'FIXUP_DB -password=****** -executorClass=com.coexis.syn.serverutils.orafixup.DumpTables -executorArgs=true'

core/archive/2011-11-21-(01)-coreGeneral.txt:211111 08:00:51:974 Thread[Syn,5,main] : Processing took 76179255

--- this pair goes for 22.22061583 hours.

core/archive/2011-11-20-(00)-coreGeneral.txt:191111 19:41:35:059 Thread[Pool thread #6,5,Server Threads] : Command Received From Athos: 'FIXUP_DB -password=****** -executorClass=com.coexis.syn.serverutils.orafixup.DumpTables -executorArgs=true'

core/archive/2011-11-22-(00)-coreGeneral.txt:221111 06:14:06:657 Thread[Syn,5,main] : Processing took 79994217

3> The synProc.sh scheduled restart of the system was not completed cleanly.

The two shutdown commands are received on Nov 20, but only queued up for process, due to the above 3 fix up tasks chewing up resources.

And the first shutdown was only accepted, and processed this morning, after the fix up finishes.

This cause the shut down of the system.

--- this is the shut down received on Sun. Which is only queued, but was not actually shutdown -- an Ops issue here.

core/archive/2011-11-21-(00)-coreGeneral.txt:201111 17:43:36:612 Thread[Pool thread #6,5,Server Threads] : Command Received From Athos: 'shutdown'

core/archive/2011-11-21-(00)-coreGeneral.txt:201111 17:48:23:367 Thread[Pool thread #6,5,Server Threads] : Command Received From Athos: 'shutdown'

The issue could have been avoided if on Sunday, while performing the scheduled restart, we really make sure all processes are shutdown properly.

1> by checking synProc.sh.

2> by checking Athos,.

3> by checking log files.

Any one of the above check steps would discover that core still running on Sun, and the issue would be discovered by a inspection on log file.

On the other hand, we are lucky that the system shutdown at a quiet time :).

We should not only operating the system, but also understand the behavious by understanding the log files and performning certain analysis.

Dev team will always stand-by to help.

Google Sites

Report abuse